DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] [PATCH v1 1/2] lib/librte_power: traffic pattern aware power control
@ 2018-06-08  9:57 Liang Ma
  2018-06-08  9:57 ` [dpdk-dev] [PATCH v1 2/2] examples/l3fwd-power: simple app update to support new API Liang Ma
                   ` (2 more replies)
  0 siblings, 3 replies; 79+ messages in thread
From: Liang Ma @ 2018-06-08  9:57 UTC (permalink / raw)
  To: david.hunt; +Cc: dev, radu.nicolau, Liang Ma

1. Abstract

For packet processing workloads such as DPDK polling is continuous.
This means CPU cores always show 100% busy independent of how much work
those cores are doing. It is critical to accurately determine how busy
a core is hugely important for the following reasons:

   * No indication of overload conditions

   * User do not know how much real load is on a system meaning resulted in
     wasted energy as no power management is utilized

Tried and failed schemes include calculating the cycles required from
the load on the core, in other words the business. For example,
how many cycles it costs to handle each packet and determining the frequency
cost per core. Due to the varying nature of traffic, types of frames and cost
in cycles to process, this mechanism becomes complex quickly where
a simple scheme is required to solve the problems.

2. Proposed solution

For all polling mechanism, the proposed solution focus on how many times
empty poll executed instead of calculating how many cycles it cost to handle
each packet. The less empty poll number means current core is busy with
processing workload, therefore,  the higher frequency is needed. The high
empty poll number indicate current core has lots spare time, therefore,
we can lower the frequency.

2.1 Power state definition:

	LOW:  the frequency is used for purge mode.

	MED:  the frequency is used to process modest traffic workload.

	HIGH: the frequency is used to process busy traffic workload.

2.2 There are two phases to establish the power management system:

	a.Initialization/Training phase. There is no traffic pass-through,
	  the system will test average empty poll numbers  with LOW/MED/HIGH
	  power state. Those average empty poll numbers  will be the baseline
	  for the normal phase. The system will collect all core's counter
	  every 100ms. The Training phase will take 5 seconds.

	b.Normal phase. When the real traffic pass-though, the system will
	  compare run-time empty poll moving average value with base line
	  then make decision to move to HIGH power state of MED  power state.
	  The system will collect all core's counter every 100ms.

3. Proposed  API

1.  rte_empty_poll_stat_init(void);
which is used to initialize the power management system.
 
2.  rte_empty_poll_stat_free(void);
which is used to free the resource hold by power management system.
 
3.  rte_empty_poll_stat_update(unsigned int lcore_id);
which is used to update specific core empty poll counter, not thread safe
 
4.  rte_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt);
which is used to update specific core valid poll counter, not thread safe
 
5.  rte_empty_poll_stat_fetch(unsigned int lcore_id);
which is used to get specific core empty poll counter.
 
6.  rte_poll_stat_fetch(unsigned int lcore_id);
which is used to get specific core valid poll counter.

7.  rte_empty_poll_set_freq(enum freq_val index, uint32_t limit);
which allow user customize the frequency of power state.

8.  rte_empty_poll_setup_timer(void);
which is used to setup the timer/callback to process all above counter.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
---
 lib/librte_power/Makefile         |   3 +-
 lib/librte_power/meson.build      |   5 +-
 lib/librte_power/rte_empty_poll.c | 529 ++++++++++++++++++++++++++++++++++++++
 lib/librte_power/rte_empty_poll.h | 135 ++++++++++
 4 files changed, 669 insertions(+), 3 deletions(-)
 create mode 100644 lib/librte_power/rte_empty_poll.c
 create mode 100644 lib/librte_power/rte_empty_poll.h

diff --git a/lib/librte_power/Makefile b/lib/librte_power/Makefile
index 6f85e88..dbc175a 100644
--- a/lib/librte_power/Makefile
+++ b/lib/librte_power/Makefile
@@ -16,8 +16,9 @@ LIBABIVER := 1
 # all source are stored in SRCS-y
 SRCS-$(CONFIG_RTE_LIBRTE_POWER) := rte_power.c power_acpi_cpufreq.c
 SRCS-$(CONFIG_RTE_LIBRTE_POWER) += power_kvm_vm.c guest_channel.c
+SRCS-$(CONFIG_RTE_LIBRTE_POWER) += rte_empty_poll.c
 
 # install this header file
-SYMLINK-$(CONFIG_RTE_LIBRTE_POWER)-include := rte_power.h
+SYMLINK-$(CONFIG_RTE_LIBRTE_POWER)-include := rte_power.h  rte_empty_poll.h
 
 include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/lib/librte_power/meson.build b/lib/librte_power/meson.build
index 253173f..5270fa3 100644
--- a/lib/librte_power/meson.build
+++ b/lib/librte_power/meson.build
@@ -5,5 +5,6 @@ if host_machine.system() != 'linux'
 	build = false
 endif
 sources = files('rte_power.c', 'power_acpi_cpufreq.c',
-		'power_kvm_vm.c', 'guest_channel.c')
-headers = files('rte_power.h')
+		'power_kvm_vm.c', 'guest_channel.c',
+		'rte_empty_poll.c')
+headers = files('rte_power.h','rte_empty_poll.h')
diff --git a/lib/librte_power/rte_empty_poll.c b/lib/librte_power/rte_empty_poll.c
new file mode 100644
index 0000000..57bd63b
--- /dev/null
+++ b/lib/librte_power/rte_empty_poll.c
@@ -0,0 +1,529 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2018 Intel Corporation
+ */
+
+#include <string.h>
+
+#include <rte_lcore.h>
+#include <rte_cycles.h>
+#include <rte_atomic.h>
+#include <rte_malloc.h>
+
+
+
+
+#include "rte_power.h"
+#include "rte_empty_poll.h"
+
+
+
+
+#define INTERVALS_PER_SECOND 10     /* (100ms) */
+#define SECONDS_TO_TRAIN_FOR  5
+#define DEFAULT_MED_TO_HIGH_PERCENT_THRESHOLD 70
+#define DEFAULT_HIGH_TO_MED_PERCENT_THRESHOLD 30
+#define DEFAULT_CYCLES_PER_PACKET 800
+
+
+static struct ep_params *ep_params;
+static uint32_t med_to_high_threshold = DEFAULT_MED_TO_HIGH_PERCENT_THRESHOLD;
+static uint32_t high_to_med_threshold = DEFAULT_HIGH_TO_MED_PERCENT_THRESHOLD;
+
+static uint32_t avail_freqs[RTE_MAX_LCORE][NUM_FREQS];
+
+static uint32_t total_avail_freqs[RTE_MAX_LCORE];
+
+static uint32_t freq_index[NUM_FREQ];
+
+
+
+static uint32_t
+get_freq_index(enum freq_val index)
+{
+	return freq_index[index];
+}
+
+
+static int
+set_power_freq(int lcore_id, enum freq_val freq, bool specific_freq)
+{
+	int err = 0;
+	uint32_t power_freq_index;
+	if (!specific_freq)
+		power_freq_index = get_freq_index(freq);
+	else
+		power_freq_index = freq;
+
+	err = rte_power_set_freq(lcore_id, power_freq_index);
+
+	return err;
+}
+
+
+static inline void __attribute__((always_inline))
+exit_training_state(struct priority_worker *poll_stats)
+{
+	RTE_SET_USED(poll_stats);
+}
+
+static inline void __attribute__((always_inline))
+enter_training_state(struct priority_worker *poll_stats)
+{
+	poll_stats->iter_counter = 0;
+	poll_stats->cur_freq = LOW;
+	poll_stats->queue_state = TRAINING;
+}
+
+static inline void __attribute__((always_inline))
+enter_normal_state(struct priority_worker *poll_stats)
+{
+	/* Clear the averages arrays and strs */
+	memset(poll_stats->edpi_av, 0, sizeof(poll_stats->edpi_av));
+	poll_stats->ec = 0;
+	memset(poll_stats->ppi_av, 0, sizeof(poll_stats->ppi_av));
+	poll_stats->pc = 0;
+
+	poll_stats->cur_freq = MED;
+	poll_stats->iter_counter = 0;
+	poll_stats->threshold_ctr = 0;
+	poll_stats->queue_state = MED_NORMAL;
+	set_power_freq(poll_stats->lcore_id, MED, false);
+
+	/* Try here */
+	poll_stats->thresh[MED].threshold_percent = med_to_high_threshold;
+	poll_stats->thresh[HGH].threshold_percent = high_to_med_threshold;
+}
+
+static inline void __attribute__((always_inline))
+enter_busy_state(struct priority_worker *poll_stats)
+{
+	memset(poll_stats->edpi_av, 0, sizeof(poll_stats->edpi_av));
+	poll_stats->ec = 0;
+	memset(poll_stats->ppi_av, 0, sizeof(poll_stats->ppi_av));
+	poll_stats->pc = 0;
+
+	poll_stats->cur_freq = HGH;
+	poll_stats->iter_counter = 0;
+	poll_stats->threshold_ctr = 0;
+	poll_stats->queue_state = HGH_BUSY;
+	set_power_freq(poll_stats->lcore_id, HGH, false);
+}
+
+static inline void __attribute__((always_inline))
+enter_purge_state(struct priority_worker *poll_stats)
+{
+	poll_stats->iter_counter = 0;
+	poll_stats->queue_state = LOW_PURGE;
+}
+
+static inline void __attribute__((always_inline))
+set_state(struct priority_worker *poll_stats,
+		enum queue_state new_state)
+{
+	enum queue_state old_state = poll_stats->queue_state;
+	if (old_state != new_state) {
+
+		/* Call any old state exit functions */
+		if (old_state == TRAINING)
+			exit_training_state(poll_stats);
+
+		/* Call any new state entry functions */
+		if (new_state == TRAINING)
+			enter_training_state(poll_stats);
+		if (new_state == MED_NORMAL)
+			enter_normal_state(poll_stats);
+		if (new_state == HGH_BUSY)
+			enter_busy_state(poll_stats);
+		if (new_state == LOW_PURGE)
+			enter_purge_state(poll_stats);
+	}
+}
+
+
+static void
+update_training_stats(struct priority_worker *poll_stats,
+		uint32_t freq,
+		bool specific_freq,
+		uint32_t max_train_iter)
+{
+	RTE_SET_USED(specific_freq);
+
+	char pfi_str[32];
+	uint64_t p0_empty_deq;
+
+	sprintf(pfi_str, "%02d", freq);
+
+	if (poll_stats->cur_freq == freq &&
+			poll_stats->thresh[freq].trained == false) {
+		if (poll_stats->thresh[freq].cur_train_iter == 0) {
+
+			set_power_freq(poll_stats->lcore_id,
+					freq, specific_freq);
+
+			poll_stats->empty_dequeues_prev =
+				poll_stats->empty_dequeues;
+
+			poll_stats->thresh[freq].cur_train_iter++;
+
+			return;
+		} else if (poll_stats->thresh[freq].cur_train_iter
+				<= max_train_iter) {
+
+			p0_empty_deq = poll_stats->empty_dequeues -
+				poll_stats->empty_dequeues_prev;
+
+			poll_stats->empty_dequeues_prev =
+				poll_stats->empty_dequeues;
+
+			poll_stats->thresh[freq].base_edpi += p0_empty_deq;
+			poll_stats->thresh[freq].cur_train_iter++;
+
+		} else {
+			if (poll_stats->thresh[freq].trained == false) {
+				poll_stats->thresh[freq].base_edpi =
+					poll_stats->thresh[freq].base_edpi /
+					max_train_iter;
+
+				/* Add on a factor of 0.05%, this should remove any */
+				/* false negatives when the system is 0% busy */
+				poll_stats->thresh[freq].base_edpi +=
+					poll_stats->thresh[freq].base_edpi / 2000;
+
+				poll_stats->thresh[freq].trained = true;
+				poll_stats->cur_freq++;
+
+			}
+		}
+	}
+}
+
+static inline uint32_t __attribute__((always_inline))
+update_stats(struct priority_worker *poll_stats)
+{
+	uint64_t tot_edpi = 0, tot_ppi = 0;
+	uint32_t j, percent;
+
+	struct priority_worker *s = poll_stats;
+
+	uint64_t cur_edpi = s->empty_dequeues - s->empty_dequeues_prev;
+
+	s->empty_dequeues_prev = s->empty_dequeues;
+
+	uint64_t ppi = s->num_dequeue_pkts - s->num_dequeue_pkts_prev;
+
+	s->num_dequeue_pkts_prev = s->num_dequeue_pkts;
+
+	if (s->thresh[s->cur_freq].base_edpi < cur_edpi)
+		return 1000UL; /* Value to make us fail */
+
+	s->edpi_av[s->ec++ % BINS_AV] = cur_edpi;
+	s->ppi_av[s->pc++ % BINS_AV] = ppi;
+
+	for (j = 0; j < BINS_AV; j++) {
+		tot_edpi += s->edpi_av[j];
+		tot_ppi += s->ppi_av[j];
+	}
+
+	tot_edpi = tot_edpi / BINS_AV;
+
+	percent = 100 - (uint32_t)((float)tot_edpi /
+			(float)s->thresh[s->cur_freq].base_edpi * 100);
+
+	return (uint32_t)percent;
+}
+
+
+static inline void  __attribute__((always_inline))
+update_stats_normal(struct priority_worker *poll_stats)
+{
+	uint32_t percent;
+
+	if (poll_stats->thresh[poll_stats->cur_freq].base_edpi == 0)
+		return;
+
+	percent = update_stats(poll_stats);
+
+	if (percent > 100)
+		return;
+
+	if (poll_stats->cur_freq == LOW)
+		RTE_LOG(INFO, POWER, "Purge Mode is not supported\n");
+	else if (poll_stats->cur_freq == MED) {
+
+		if (percent > poll_stats->thresh[poll_stats->cur_freq].
+				threshold_percent) {
+
+			if (poll_stats->threshold_ctr < INTERVALS_PER_SECOND)
+				poll_stats->threshold_ctr++;
+			else
+				set_state(poll_stats, HGH_BUSY);
+
+		} else {
+			/* reset */
+			poll_stats->threshold_ctr = 0;
+		}
+
+	} else if (poll_stats->cur_freq == HGH) {
+
+		if (percent < poll_stats->thresh[poll_stats->cur_freq].
+				threshold_percent) {
+
+			if (poll_stats->threshold_ctr < INTERVALS_PER_SECOND)
+				poll_stats->threshold_ctr++;
+			else
+				set_state(poll_stats, MED_NORMAL);
+		} else
+			/* reset */
+			poll_stats->threshold_ctr = 0;
+
+
+	}
+}
+
+static int
+empty_poll_trainning(struct priority_worker *poll_stats,
+		uint32_t max_train_iter) {
+
+	if (poll_stats->iter_counter < INTERVALS_PER_SECOND) {
+		poll_stats->iter_counter++;
+		return 0;
+	}
+
+
+	update_training_stats(poll_stats,
+			LOW,
+			false,
+			max_train_iter);
+
+	update_training_stats(poll_stats,
+			MED,
+			false,
+			max_train_iter);
+
+	update_training_stats(poll_stats,
+			HGH,
+			false,
+			max_train_iter);
+
+
+	if (poll_stats->thresh[LOW].trained == true
+			&& poll_stats->thresh[MED].trained == true
+			&& poll_stats->thresh[HGH].trained == true) {
+
+		set_state(poll_stats, MED_NORMAL);
+
+		RTE_LOG(INFO, POWER, "Training is Complete for %d\n",
+				poll_stats->lcore_id);
+	}
+
+	return 0;
+}
+
+static void
+empty_poll_detection(struct rte_timer *tim,
+		void *arg)
+{
+
+	uint32_t i;
+
+	struct priority_worker *poll_stats;
+
+	RTE_SET_USED(tim);
+
+	RTE_SET_USED(arg);
+
+	for (i = 0; i < NUM_NODES; i++) {
+
+		poll_stats = &(ep_params->wrk_data.wrk_stats[i]);
+
+		if (rte_lcore_is_enabled(poll_stats->lcore_id) == 0)
+			continue;
+
+		switch (poll_stats->queue_state) {
+		case(TRAINING):
+			empty_poll_trainning(poll_stats,
+					ep_params->max_train_iter);
+			break;
+
+		case(HGH_BUSY):
+		case(MED_NORMAL):
+			update_stats_normal(poll_stats);
+
+			break;
+
+		case(LOW_PURGE):
+			break;
+		default:
+			break;
+
+		}
+
+	}
+
+}
+
+int
+rte_empty_poll_stat_init(void)
+{
+	uint32_t i;
+	/* Allocate the ep_params structure */
+	ep_params = rte_zmalloc_socket(NULL,
+			sizeof(struct ep_params),
+			0,
+			rte_socket_id());
+
+	if (!ep_params)
+		rte_panic("Cannot allocate heap memory for ep_params "
+				"for socket %d\n", rte_socket_id());
+
+	freq_index[LOW] = 14;
+	freq_index[MED] = 9;
+	freq_index[HGH] = 1;
+
+	RTE_LOG(INFO, POWER, "Initialize the Empty Poll\n");
+
+	/* 5 seconds worth of training */
+	ep_params->max_train_iter = INTERVALS_PER_SECOND * SECONDS_TO_TRAIN_FOR;
+
+	struct stats_data *w = &ep_params->wrk_data;
+
+	/* initialize all wrk_stats state */
+	for (i = 0; i < NUM_NODES; i++) {
+
+		if (rte_lcore_is_enabled(i) == 0)
+			continue;
+
+		set_state(&w->wrk_stats[i], TRAINING);
+		/*init the freqs table */
+		total_avail_freqs[i] = rte_power_freqs(i,
+				avail_freqs[i],
+				NUM_FREQS);
+
+		if (get_freq_index(LOW) > total_avail_freqs[i])
+			return -1;
+
+	}
+
+
+	return 0;
+}
+
+void
+rte_empty_poll_stat_free(void)
+{
+
+	RTE_LOG(INFO, POWER, "Close the Empty Poll\n");
+
+	if (ep_params != NULL)
+		rte_free(ep_params);
+}
+
+int
+rte_empty_poll_stat_update(unsigned int lcore_id)
+{
+	struct priority_worker *poll_stats;
+
+	if (lcore_id >= NUM_NODES)
+		return -1;
+
+	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
+
+	if (poll_stats->lcore_id == 0)
+		poll_stats->lcore_id = lcore_id;
+
+	poll_stats->empty_dequeues++;
+
+	return 0;
+}
+
+int
+rte_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt)
+{
+
+	struct priority_worker *poll_stats;
+
+	if (lcore_id >= NUM_NODES)
+		return -1;
+
+	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
+
+	if (poll_stats->lcore_id == 0)
+		poll_stats->lcore_id = lcore_id;
+
+	poll_stats->num_dequeue_pkts += nb_pkt;
+
+	return 0;
+}
+
+
+uint64_t
+rte_empty_poll_stat_fetch(unsigned int lcore_id)
+{
+	struct priority_worker *poll_stats;
+
+	if (lcore_id >= NUM_NODES)
+		return -1;
+
+	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
+
+	if (poll_stats->lcore_id == 0)
+		poll_stats->lcore_id = lcore_id;
+
+	return poll_stats->empty_dequeues;
+}
+
+uint64_t
+rte_poll_stat_fetch(unsigned int lcore_id)
+{
+	struct priority_worker *poll_stats;
+
+	if (lcore_id >= NUM_NODES)
+		return -1;
+
+	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
+
+	if (poll_stats->lcore_id == 0)
+		poll_stats->lcore_id = lcore_id;
+
+	return poll_stats->num_dequeue_pkts;
+}
+
+void
+rte_empty_poll_set_freq(enum freq_val index, uint32_t limit)
+{
+	switch (index) {
+
+	case LOW:
+		freq_index[LOW] = limit;
+		break;
+
+	case MED:
+		freq_index[MED] = limit;
+		break;
+
+	case HGH:
+		freq_index[HGH] = limit;
+		break;
+	default:
+		break;
+	}
+}
+
+void
+rte_empty_poll_setup_timer(void)
+{
+	int lcore_id = rte_lcore_id();
+	uint64_t hz = rte_get_timer_hz();
+
+	struct  ep_params *ep_ptr = ep_params;
+
+	ep_ptr->interval_ticks = hz / INTERVALS_PER_SECOND;
+
+	rte_timer_reset_sync(&ep_ptr->timer0,
+			ep_ptr->interval_ticks,
+			PERIODICAL,
+			lcore_id,
+			empty_poll_detection,
+			(void *)ep_ptr);
+
+}
diff --git a/lib/librte_power/rte_empty_poll.h b/lib/librte_power/rte_empty_poll.h
new file mode 100644
index 0000000..7e036ee
--- /dev/null
+++ b/lib/librte_power/rte_empty_poll.h
@@ -0,0 +1,135 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2018 Intel Corporation
+ */
+
+#ifndef _RTE_EMPTY_POLL_H
+#define _RTE_EMPTY_POLL_H
+
+/**
+ * @file
+ * RTE Power Management
+ */
+#include <stdint.h>
+#include <stdbool.h>
+#include <sys/queue.h>
+
+#include <rte_common.h>
+#include <rte_byteorder.h>
+#include <rte_log.h>
+#include <rte_string_fns.h>
+#include <rte_power.h>
+#include <rte_timer.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#define NUM_FREQS 20
+
+#define BINS_AV 4 /* has to be ^2 */
+
+#define DROP (NUM_DIRECTIONS * NUM_DEVICES)
+
+#define NUM_PRIORITIES          2
+
+#define NUM_NODES         31 /*any reseanable prime number should work*/
+
+
+enum freq_val {
+	LOW,
+	MED,
+	HGH,
+	NUM_FREQ = NUM_FREQS
+};
+
+
+enum queue_state {
+	TRAINING, /* NO TRAFFIC */
+	MED_NORMAL,   /* MED */
+	HGH_BUSY,     /* HIGH */
+	LOW_PURGE,    /* LOW */
+};
+
+/* queue stats */
+struct freq_threshold {
+
+	uint64_t base_edpi;
+	bool trained;
+	uint32_t threshold_percent;
+	uint32_t cur_train_iter;
+};
+
+
+struct priority_worker {
+
+	/* Current dequeue and throughput counts */
+	/* These 2 are written to by the worker threads */
+	/* So keep them on their own cache line */
+	uint64_t empty_dequeues;
+	uint64_t num_dequeue_pkts;
+
+	enum queue_state queue_state;
+
+	uint64_t empty_dequeues_prev;
+	uint64_t num_dequeue_pkts_prev;
+
+	/* Used for training only */
+	struct freq_threshold thresh[NUM_FREQ];
+	enum freq_val cur_freq;
+
+	/* bucket arrays to calculate the averages */
+	uint64_t edpi_av[BINS_AV];
+	uint32_t  ec;
+	uint64_t ppi_av[BINS_AV];
+	uint32_t  pc;
+
+	uint32_t lcore_id;
+	uint32_t iter_counter;
+	uint32_t threshold_ctr;
+	uint32_t display_ctr;
+	uint8_t  dev_id;
+
+} __rte_cache_aligned;
+
+
+struct stats_data {
+
+	struct priority_worker wrk_stats[NUM_NODES];
+
+	/* flag to stop rx threads processing packets until training over */
+	bool start_rx;
+
+};
+
+struct ep_params {
+
+	/* timer related stuff */
+	uint64_t interval_ticks;
+	uint32_t max_train_iter;
+	struct rte_timer timer0;
+
+	struct stats_data wrk_data;
+};
+
+
+int rte_empty_poll_stat_init(void);
+
+void rte_empty_poll_stat_free(void);
+
+int rte_empty_poll_stat_update(unsigned int lcore_id);
+
+int rte_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt);
+
+uint64_t rte_empty_poll_stat_fetch(unsigned int lcore_id);
+
+uint64_t rte_poll_stat_fetch(unsigned int lcore_id);
+
+void rte_empty_poll_set_freq(enum freq_val index, uint32_t limit);
+
+void rte_empty_poll_setup_timer(void);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif
-- 
2.7.5

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [dpdk-dev] [PATCH v1 2/2] examples/l3fwd-power: simple app update to support new API
  2018-06-08  9:57 [dpdk-dev] [PATCH v1 1/2] lib/librte_power: traffic pattern aware power control Liang Ma
@ 2018-06-08  9:57 ` Liang Ma
  2018-06-19 10:31   ` Hunt, David
  2018-06-08 15:26 ` [dpdk-dev] [PATCH v2 1/2] lib/librte_power: traffic pattern aware power control Liang Ma
  2018-06-14 10:59 ` [dpdk-dev] [PATCH v1 1/2] lib/librte_power: traffic pattern aware power control Hunt, David
  2 siblings, 1 reply; 79+ messages in thread
From: Liang Ma @ 2018-06-08  9:57 UTC (permalink / raw)
  To: david.hunt; +Cc: dev, radu.nicolau, Liang Ma

Add the support for new traffic pattern aware power control
power management API.

Example:
./l3fwd-power -l xxx   -n 4   -w 0000:xx:00.0 -w 0000:xx:00.1 -- -p 0x3
-P --config="(0,0,xx),(1,0,xx)" --empty-poll

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
---
 examples/l3fwd-power/main.c | 229 ++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 211 insertions(+), 18 deletions(-)

diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
index 596d645..22a0e4e 100644
--- a/examples/l3fwd-power/main.c
+++ b/examples/l3fwd-power/main.c
@@ -43,6 +43,7 @@
 #include <rte_timer.h>
 #include <rte_power.h>
 #include <rte_spinlock.h>
+#include <rte_empty_poll.h>
 
 #define RTE_LOGTYPE_L3FWD_POWER RTE_LOGTYPE_USER1
 
@@ -129,6 +130,9 @@ static uint32_t enabled_port_mask = 0;
 static int promiscuous_on = 0;
 /* NUMA is enabled by default. */
 static int numa_on = 1;
+/* emptypoll is disabled by default. */
+static bool empty_poll_on;
+volatile bool empty_poll_stop;
 static int parse_ptype; /**< Parse packet type using rx callback, and */
 			/**< disabled by default */
 
@@ -336,6 +340,10 @@ static inline uint32_t power_idle_heuristic(uint32_t zero_rx_packet_count);
 static inline enum freq_scale_hint_t power_freq_scaleup_heuristic( \
 		unsigned int lcore_id, uint16_t port_id, uint16_t queue_id);
 
+static int is_done(void)
+{
+	return empty_poll_stop;
+}
 /* exit signal handler */
 static void
 signal_exit_now(int sigtype)
@@ -344,7 +352,15 @@ signal_exit_now(int sigtype)
 	unsigned int portid;
 	int ret;
 
+	RTE_SET_USED(lcore_id);
+	RTE_SET_USED(portid);
+	RTE_SET_USED(ret);
+
 	if (sigtype == SIGINT) {
+		if (empty_poll_on)
+			empty_poll_stop = true;
+
+
 		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
 			if (rte_lcore_is_enabled(lcore_id) == 0)
 				continue;
@@ -353,20 +369,23 @@ signal_exit_now(int sigtype)
 			ret = rte_power_exit(lcore_id);
 			if (ret)
 				rte_exit(EXIT_FAILURE, "Power management "
-					"library de-initialization failed on "
-							"core%u\n", lcore_id);
+						"library de-initialization failed on "
+						"core%u\n", lcore_id);
 		}
 
-		RTE_ETH_FOREACH_DEV(portid) {
-			if ((enabled_port_mask & (1 << portid)) == 0)
-				continue;
+		if (!empty_poll_on) {
+			RTE_ETH_FOREACH_DEV(portid) {
+				if ((enabled_port_mask & (1 << portid)) == 0)
+					continue;
 
-			rte_eth_dev_stop(portid);
-			rte_eth_dev_close(portid);
+				rte_eth_dev_stop(portid);
+				rte_eth_dev_close(portid);
+			}
 		}
 	}
 
-	rte_exit(EXIT_SUCCESS, "User forced exit\n");
+	if (!empty_poll_on)
+		rte_exit(EXIT_SUCCESS, "User forced exit\n");
 }
 
 /*  Freqency scale down timer callback */
@@ -831,6 +850,108 @@ static int event_register(struct lcore_conf *qconf)
 
 	return 0;
 }
+/* main processing loop */
+static int
+main_empty_poll_loop(__attribute__((unused)) void *dummy)
+{
+	struct rte_mbuf *pkts_burst[MAX_PKT_BURST];
+	unsigned int lcore_id;
+	uint64_t prev_tsc, diff_tsc, cur_tsc;
+	int i, j, nb_rx;
+	uint8_t queueid;
+	uint16_t portid;
+	struct lcore_conf *qconf;
+	struct lcore_rx_queue *rx_queue;
+
+	const uint64_t drain_tsc = (rte_get_tsc_hz() + US_PER_S - 1) / US_PER_S * BURST_TX_DRAIN_US;
+
+	prev_tsc = 0;
+
+	lcore_id = rte_lcore_id();
+	qconf = &lcore_conf[lcore_id];
+
+	if (qconf->n_rx_queue == 0) {
+		RTE_LOG(INFO, L3FWD_POWER, "lcore %u has nothing to do\n", lcore_id);
+		return 0;
+	}
+
+	RTE_LOG(INFO, L3FWD_POWER, "entering main empty_poll loop on lcore %u\n", lcore_id);
+
+	for (i = 0; i < qconf->n_rx_queue; i++) {
+		portid = qconf->rx_queue_list[i].port_id;
+		queueid = qconf->rx_queue_list[i].queue_id;
+		RTE_LOG(INFO, L3FWD_POWER, " -- lcoreid=%u portid=%u "
+				"rxqueueid=%hhu\n", lcore_id, portid, queueid);
+	}
+
+	while (!is_done()) {
+		stats[lcore_id].nb_iteration_looped++;
+
+		cur_tsc = rte_rdtsc();
+		/*
+		 * TX burst queue drain
+		 */
+		diff_tsc = cur_tsc - prev_tsc;
+		if (unlikely(diff_tsc > drain_tsc)) {
+			for (i = 0; i < qconf->n_tx_port; ++i) {
+				portid = qconf->tx_port_id[i];
+				rte_eth_tx_buffer_flush(portid,
+						qconf->tx_queue_id[portid],
+						qconf->tx_buffer[portid]);
+			}
+			prev_tsc = cur_tsc;
+		}
+
+		/*
+		 * Read packet from RX queues
+		 */
+		for (i = 0; i < qconf->n_rx_queue; ++i) {
+			rx_queue = &(qconf->rx_queue_list[i]);
+			rx_queue->idle_hint = 0;
+			portid = rx_queue->port_id;
+			queueid = rx_queue->queue_id;
+
+			nb_rx = rte_eth_rx_burst(portid, queueid, pkts_burst,
+					MAX_PKT_BURST);
+
+			stats[lcore_id].nb_rx_processed += nb_rx;
+
+			if (nb_rx == 0) {
+
+				rte_empty_poll_stat_update(lcore_id);
+
+				continue;
+			} else {
+				rte_poll_stat_update(lcore_id, nb_rx);
+			}
+
+
+			/* Prefetch first packets */
+			for (j = 0; j < PREFETCH_OFFSET && j < nb_rx; j++) {
+				rte_prefetch0(rte_pktmbuf_mtod(
+							pkts_burst[j], void *));
+			}
+
+			/* Prefetch and forward already prefetched packets */
+			for (j = 0; j < (nb_rx - PREFETCH_OFFSET); j++) {
+				rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[
+							j + PREFETCH_OFFSET], void *));
+				l3fwd_simple_forward(pkts_burst[j], portid,
+						qconf);
+			}
+
+			/* Forward remaining prefetched packets */
+			for (; j < nb_rx; j++) {
+				l3fwd_simple_forward(pkts_burst[j], portid,
+						qconf);
+			}
+
+		}
+
+	}
+
+	return 0;
+}
 
 /* main processing loop */
 static int
@@ -1128,7 +1249,8 @@ print_usage(const char *prgname)
 		"  --no-numa: optional, disable numa awareness\n"
 		"  --enable-jumbo: enable jumbo frame"
 		" which max packet len is PKTLEN in decimal (64-9600)\n"
-		"  --parse-ptype: parse packet type by software\n",
+		"  --parse-ptype: parse packet type by software\n"
+		"  --empty=poll: enable empty poll detection\n",
 		prgname);
 }
 
@@ -1231,10 +1353,12 @@ parse_args(int argc, char **argv)
 	int opt, ret;
 	char **argvopt;
 	int option_index;
+	uint32_t limit;
 	char *prgname = argv[0];
 	static struct option lgopts[] = {
 		{"config", 1, 0, 0},
 		{"no-numa", 0, 0, 0},
+		{"empty-poll", 0, 0, 0},
 		{"enable-jumbo", 0, 0, 0},
 		{CMD_LINE_OPT_PARSE_PTYPE, 0, 0, 0},
 		{NULL, 0, 0, 0}
@@ -1259,7 +1383,18 @@ parse_args(int argc, char **argv)
 			printf("Promiscuous mode selected\n");
 			promiscuous_on = 1;
 			break;
-
+		case 'l':
+			limit = parse_portmask(optarg);
+			rte_empty_poll_set_freq(LOW, limit);
+			break;
+		case 'm':
+			limit = parse_portmask(optarg);
+			rte_empty_poll_set_freq(MED, limit);
+			break;
+		case 'h':
+			limit = parse_portmask(optarg);
+			rte_empty_poll_set_freq(HGH, limit);
+			break;
 		/* long options */
 		case 0:
 			if (!strncmp(lgopts[option_index].name, "config", 6)) {
@@ -1278,6 +1413,12 @@ parse_args(int argc, char **argv)
 			}
 
 			if (!strncmp(lgopts[option_index].name,
+						"empty-poll", 10)) {
+				printf("empty-poll is enabled\n");
+				empty_poll_on = true;
+			}
+
+			if (!strncmp(lgopts[option_index].name,
 					"enable-jumbo", 12)) {
 				struct option lenopts =
 					{"max-pkt-len", required_argument, \
@@ -1609,6 +1750,41 @@ static int check_ptype(uint16_t portid)
 
 }
 
+static int
+launch_timer(unsigned int lcore_id)
+{
+	int64_t prev_tsc = 0, cur_tsc, diff_tsc, cycles_10ms;
+
+	RTE_SET_USED(lcore_id);
+
+
+	if (rte_get_master_lcore() != lcore_id) {
+		rte_panic("timer on lcore:%d which is not master core:%d\n",
+				lcore_id,
+				rte_get_master_lcore());
+	}
+
+	RTE_LOG(INFO, POWER, "Bring up the Timer\n");
+
+	rte_empty_poll_setup_timer();
+
+	cycles_10ms = rte_get_timer_hz() / 100;
+
+	while (!is_done()) {
+		cur_tsc = rte_rdtsc();
+		diff_tsc = cur_tsc - prev_tsc;
+		if (diff_tsc > cycles_10ms) {
+			rte_timer_manage();
+			prev_tsc = cur_tsc;
+			cycles_10ms = rte_get_timer_hz() / 100;
+		}
+	}
+
+	RTE_LOG(INFO, POWER, "Timer_subsystem is done\n");
+
+	return 0;
+}
+
 int
 main(int argc, char **argv)
 {
@@ -1780,14 +1956,15 @@ main(int argc, char **argv)
 				"Library initialization failed on core %u\n", lcore_id);
 
 		/* init timer structures for each enabled lcore */
-		rte_timer_init(&power_timers[lcore_id]);
-		hz = rte_get_timer_hz();
-		rte_timer_reset(&power_timers[lcore_id],
-			hz/TIMER_NUMBER_PER_SECOND, SINGLE, lcore_id,
-						power_timer_cb, NULL);
-
+		if (empty_poll_on == false) {
+			rte_timer_init(&power_timers[lcore_id]);
+			hz = rte_get_timer_hz();
+			rte_timer_reset(&power_timers[lcore_id],
+					hz/TIMER_NUMBER_PER_SECOND, SINGLE, lcore_id,
+					power_timer_cb, NULL);
+		}
 		qconf = &lcore_conf[lcore_id];
-		printf("\nInitializing rx queues on lcore %u ... ", lcore_id );
+		printf("\nInitializing rx queues on lcore %u ...\n", lcore_id);
 		fflush(stdout);
 		/* init RX queues */
 		for(queue = 0; queue < qconf->n_rx_queue; ++queue) {
@@ -1856,12 +2033,28 @@ main(int argc, char **argv)
 
 	check_all_ports_link_status(enabled_port_mask);
 
+	if (empty_poll_on == true)
+		rte_empty_poll_stat_init();
+
+
 	/* launch per-lcore init on every lcore */
-	rte_eal_mp_remote_launch(main_loop, NULL, CALL_MASTER);
+	if (empty_poll_on == false) {
+		rte_eal_mp_remote_launch(main_loop, NULL, CALL_MASTER);
+	} else {
+		empty_poll_stop = false;
+		rte_eal_mp_remote_launch(main_empty_poll_loop, NULL, SKIP_MASTER);
+	}
+
+	if (empty_poll_on == true)
+		launch_timer(rte_lcore_id());
+
 	RTE_LCORE_FOREACH_SLAVE(lcore_id) {
 		if (rte_eal_wait_lcore(lcore_id) < 0)
 			return -1;
 	}
 
+	if (empty_poll_on)
+		rte_empty_poll_stat_free();
+
 	return 0;
 }
-- 
2.7.5

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [dpdk-dev] [PATCH v2 1/2] lib/librte_power: traffic pattern aware power control
  2018-06-08  9:57 [dpdk-dev] [PATCH v1 1/2] lib/librte_power: traffic pattern aware power control Liang Ma
  2018-06-08  9:57 ` [dpdk-dev] [PATCH v1 2/2] examples/l3fwd-power: simple app update to support new API Liang Ma
@ 2018-06-08 15:26 ` Liang Ma
  2018-06-08 15:26   ` [dpdk-dev] [PATCH v2 2/2] examples/l3fwd-power: simple app update to support new API Liang Ma
  2018-06-14 10:59 ` [dpdk-dev] [PATCH v1 1/2] lib/librte_power: traffic pattern aware power control Hunt, David
  2 siblings, 1 reply; 79+ messages in thread
From: Liang Ma @ 2018-06-08 15:26 UTC (permalink / raw)
  To: david.hunt; +Cc: dev, radu.nicolau, Liang Ma

1. Abstract

For packet processing workloads such as DPDK polling is continuous.
This means CPU cores always show 100% busy independent of how much work
those cores are doing. It is critical to accurately determine how busy
a core is hugely important for the following reasons:

   * No indication of overload conditions

   * User do not know how much real load is on a system meaning resulted in
     wasted energy as no power management is utilized

Tried and failed schemes include calculating the cycles required from
the load on the core, in other words the business. For example,
how many cycles it costs to handle each packet and determining the frequency
cost per core. Due to the varying nature of traffic, types of frames and cost
in cycles to process, this mechanism becomes complex quickly where
a simple scheme is required to solve the problems.

2. Proposed solution

For all polling mechanism, the proposed solution focus on how many times
empty poll executed instead of calculating how many cycles it cost to handle
each packet. The less empty poll number means current core is busy with
processing workload, therefore,  the higher frequency is needed. The high
empty poll number indicate current core has lots spare time, therefore,
we can lower the frequency.

2.1 Power state definition:

	LOW:  the frequency is used for purge mode.

	MED:  the frequency is used to process modest traffic workload.

	HIGH: the frequency is used to process busy traffic workload.

2.2 There are two phases to establish the power management system:

	a.Initialization/Training phase. There is no traffic pass-through,
	  the system will test average empty poll numbers  with LOW/MED/HIGH
	  power state. Those average empty poll numbers  will be the baseline
	  for the normal phase. The system will collect all core's counter
	  every 100ms. The Training phase will take 5 seconds.

	b.Normal phase. When the real traffic pass-though, the system will
	  compare run-time empty poll moving average value with base line
	  then make decision to move to HIGH power state of MED  power state.
	  The system will collect all core's counter every 100ms.

3. Proposed  API

1.  rte_empty_poll_stat_init(void);
which is used to initialize the power management system.
 
2.  rte_empty_poll_stat_free(void);
which is used to free the resource hold by power management system.
 
3.  rte_empty_poll_stat_update(unsigned int lcore_id);
which is used to update specific core empty poll counter, not thread safe
 
4.  rte_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt);
which is used to update specific core valid poll counter, not thread safe
 
5.  rte_empty_poll_stat_fetch(unsigned int lcore_id);
which is used to get specific core empty poll counter.
 
6.  rte_poll_stat_fetch(unsigned int lcore_id);
which is used to get specific core valid poll counter.

7.  rte_empty_poll_set_freq(enum freq_val index, uint32_t limit);
which allow user customize the frequency of power state.

8.  rte_empty_poll_setup_timer(void);
which is used to setup the timer/callback to process all above counter.

ChangeLog:
v2: fix some coding style issues

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
---
 lib/librte_power/Makefile         |   3 +-
 lib/librte_power/meson.build      |   5 +-
 lib/librte_power/rte_empty_poll.c | 521 ++++++++++++++++++++++++++++++++++++++
 lib/librte_power/rte_empty_poll.h | 135 ++++++++++
 4 files changed, 661 insertions(+), 3 deletions(-)
 create mode 100644 lib/librte_power/rte_empty_poll.c
 create mode 100644 lib/librte_power/rte_empty_poll.h

diff --git a/lib/librte_power/Makefile b/lib/librte_power/Makefile
index 6f85e88..dbc175a 100644
--- a/lib/librte_power/Makefile
+++ b/lib/librte_power/Makefile
@@ -16,8 +16,9 @@ LIBABIVER := 1
 # all source are stored in SRCS-y
 SRCS-$(CONFIG_RTE_LIBRTE_POWER) := rte_power.c power_acpi_cpufreq.c
 SRCS-$(CONFIG_RTE_LIBRTE_POWER) += power_kvm_vm.c guest_channel.c
+SRCS-$(CONFIG_RTE_LIBRTE_POWER) += rte_empty_poll.c
 
 # install this header file
-SYMLINK-$(CONFIG_RTE_LIBRTE_POWER)-include := rte_power.h
+SYMLINK-$(CONFIG_RTE_LIBRTE_POWER)-include := rte_power.h  rte_empty_poll.h
 
 include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/lib/librte_power/meson.build b/lib/librte_power/meson.build
index 253173f..5270fa3 100644
--- a/lib/librte_power/meson.build
+++ b/lib/librte_power/meson.build
@@ -5,5 +5,6 @@ if host_machine.system() != 'linux'
 	build = false
 endif
 sources = files('rte_power.c', 'power_acpi_cpufreq.c',
-		'power_kvm_vm.c', 'guest_channel.c')
-headers = files('rte_power.h')
+		'power_kvm_vm.c', 'guest_channel.c',
+		'rte_empty_poll.c')
+headers = files('rte_power.h','rte_empty_poll.h')
diff --git a/lib/librte_power/rte_empty_poll.c b/lib/librte_power/rte_empty_poll.c
new file mode 100644
index 0000000..8264a01
--- /dev/null
+++ b/lib/librte_power/rte_empty_poll.c
@@ -0,0 +1,521 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2018 Intel Corporation
+ */
+
+#include <string.h>
+
+#include <rte_lcore.h>
+#include <rte_cycles.h>
+#include <rte_atomic.h>
+#include <rte_malloc.h>
+
+#include "rte_power.h"
+#include "rte_empty_poll.h"
+
+#define INTERVALS_PER_SECOND 10     /* (100ms) */
+#define SECONDS_TO_TRAIN_FOR  5
+#define DEFAULT_MED_TO_HIGH_PERCENT_THRESHOLD 70
+#define DEFAULT_HIGH_TO_MED_PERCENT_THRESHOLD 30
+#define DEFAULT_CYCLES_PER_PACKET 800
+
+static struct ep_params *ep_params;
+static uint32_t med_to_high_threshold = DEFAULT_MED_TO_HIGH_PERCENT_THRESHOLD;
+static uint32_t high_to_med_threshold = DEFAULT_HIGH_TO_MED_PERCENT_THRESHOLD;
+
+static uint32_t avail_freqs[RTE_MAX_LCORE][NUM_FREQS];
+
+static uint32_t total_avail_freqs[RTE_MAX_LCORE];
+
+static uint32_t freq_index[NUM_FREQ];
+
+static uint32_t
+get_freq_index(enum freq_val index)
+{
+	return freq_index[index];
+}
+
+
+static int
+set_power_freq(int lcore_id, enum freq_val freq, bool specific_freq)
+{
+	int err = 0;
+	uint32_t power_freq_index;
+	if (!specific_freq)
+		power_freq_index = get_freq_index(freq);
+	else
+		power_freq_index = freq;
+
+	err = rte_power_set_freq(lcore_id, power_freq_index);
+
+	return err;
+}
+
+
+static inline void __attribute__((always_inline))
+exit_training_state(struct priority_worker *poll_stats)
+{
+	RTE_SET_USED(poll_stats);
+}
+
+static inline void __attribute__((always_inline))
+enter_training_state(struct priority_worker *poll_stats)
+{
+	poll_stats->iter_counter = 0;
+	poll_stats->cur_freq = LOW;
+	poll_stats->queue_state = TRAINING;
+}
+
+static inline void __attribute__((always_inline))
+enter_normal_state(struct priority_worker *poll_stats)
+{
+	/* Clear the averages arrays and strs */
+	memset(poll_stats->edpi_av, 0, sizeof(poll_stats->edpi_av));
+	poll_stats->ec = 0;
+	memset(poll_stats->ppi_av, 0, sizeof(poll_stats->ppi_av));
+	poll_stats->pc = 0;
+
+	poll_stats->cur_freq = MED;
+	poll_stats->iter_counter = 0;
+	poll_stats->threshold_ctr = 0;
+	poll_stats->queue_state = MED_NORMAL;
+	set_power_freq(poll_stats->lcore_id, MED, false);
+
+	/* Try here */
+	poll_stats->thresh[MED].threshold_percent = med_to_high_threshold;
+	poll_stats->thresh[HGH].threshold_percent = high_to_med_threshold;
+}
+
+static inline void __attribute__((always_inline))
+enter_busy_state(struct priority_worker *poll_stats)
+{
+	memset(poll_stats->edpi_av, 0, sizeof(poll_stats->edpi_av));
+	poll_stats->ec = 0;
+	memset(poll_stats->ppi_av, 0, sizeof(poll_stats->ppi_av));
+	poll_stats->pc = 0;
+
+	poll_stats->cur_freq = HGH;
+	poll_stats->iter_counter = 0;
+	poll_stats->threshold_ctr = 0;
+	poll_stats->queue_state = HGH_BUSY;
+	set_power_freq(poll_stats->lcore_id, HGH, false);
+}
+
+static inline void __attribute__((always_inline))
+enter_purge_state(struct priority_worker *poll_stats)
+{
+	poll_stats->iter_counter = 0;
+	poll_stats->queue_state = LOW_PURGE;
+}
+
+static inline void __attribute__((always_inline))
+set_state(struct priority_worker *poll_stats,
+		enum queue_state new_state)
+{
+	enum queue_state old_state = poll_stats->queue_state;
+	if (old_state != new_state) {
+
+		/* Call any old state exit functions */
+		if (old_state == TRAINING)
+			exit_training_state(poll_stats);
+
+		/* Call any new state entry functions */
+		if (new_state == TRAINING)
+			enter_training_state(poll_stats);
+		if (new_state == MED_NORMAL)
+			enter_normal_state(poll_stats);
+		if (new_state == HGH_BUSY)
+			enter_busy_state(poll_stats);
+		if (new_state == LOW_PURGE)
+			enter_purge_state(poll_stats);
+	}
+}
+
+
+static void
+update_training_stats(struct priority_worker *poll_stats,
+		uint32_t freq,
+		bool specific_freq,
+		uint32_t max_train_iter)
+{
+	RTE_SET_USED(specific_freq);
+
+	char pfi_str[32];
+	uint64_t p0_empty_deq;
+
+	sprintf(pfi_str, "%02d", freq);
+
+	if (poll_stats->cur_freq == freq &&
+			poll_stats->thresh[freq].trained == false) {
+		if (poll_stats->thresh[freq].cur_train_iter == 0) {
+
+			set_power_freq(poll_stats->lcore_id,
+					freq, specific_freq);
+
+			poll_stats->empty_dequeues_prev =
+				poll_stats->empty_dequeues;
+
+			poll_stats->thresh[freq].cur_train_iter++;
+
+			return;
+		} else if (poll_stats->thresh[freq].cur_train_iter
+				<= max_train_iter) {
+
+			p0_empty_deq = poll_stats->empty_dequeues -
+				poll_stats->empty_dequeues_prev;
+
+			poll_stats->empty_dequeues_prev =
+				poll_stats->empty_dequeues;
+
+			poll_stats->thresh[freq].base_edpi += p0_empty_deq;
+			poll_stats->thresh[freq].cur_train_iter++;
+
+		} else {
+			if (poll_stats->thresh[freq].trained == false) {
+				poll_stats->thresh[freq].base_edpi =
+					poll_stats->thresh[freq].base_edpi /
+					max_train_iter;
+
+				/* Add on a factor of 0.05%, this should remove any */
+				/* false negatives when the system is 0% busy */
+				poll_stats->thresh[freq].base_edpi +=
+					poll_stats->thresh[freq].base_edpi / 2000;
+
+				poll_stats->thresh[freq].trained = true;
+				poll_stats->cur_freq++;
+
+			}
+		}
+	}
+}
+
+static inline uint32_t __attribute__((always_inline))
+update_stats(struct priority_worker *poll_stats)
+{
+	uint64_t tot_edpi = 0, tot_ppi = 0;
+	uint32_t j, percent;
+
+	struct priority_worker *s = poll_stats;
+
+	uint64_t cur_edpi = s->empty_dequeues - s->empty_dequeues_prev;
+
+	s->empty_dequeues_prev = s->empty_dequeues;
+
+	uint64_t ppi = s->num_dequeue_pkts - s->num_dequeue_pkts_prev;
+
+	s->num_dequeue_pkts_prev = s->num_dequeue_pkts;
+
+	if (s->thresh[s->cur_freq].base_edpi < cur_edpi)
+		return 1000UL; /* Value to make us fail */
+
+	s->edpi_av[s->ec++ % BINS_AV] = cur_edpi;
+	s->ppi_av[s->pc++ % BINS_AV] = ppi;
+
+	for (j = 0; j < BINS_AV; j++) {
+		tot_edpi += s->edpi_av[j];
+		tot_ppi += s->ppi_av[j];
+	}
+
+	tot_edpi = tot_edpi / BINS_AV;
+
+	percent = 100 - (uint32_t)((float)tot_edpi /
+			(float)s->thresh[s->cur_freq].base_edpi * 100);
+
+	return (uint32_t)percent;
+}
+
+
+static inline void  __attribute__((always_inline))
+update_stats_normal(struct priority_worker *poll_stats)
+{
+	uint32_t percent;
+
+	if (poll_stats->thresh[poll_stats->cur_freq].base_edpi == 0)
+		return;
+
+	percent = update_stats(poll_stats);
+
+	if (percent > 100)
+		return;
+
+	if (poll_stats->cur_freq == LOW)
+		RTE_LOG(INFO, POWER, "Purge Mode is not supported\n");
+	else if (poll_stats->cur_freq == MED) {
+
+		if (percent >
+			poll_stats->thresh[poll_stats->cur_freq].threshold_percent) {
+
+			if (poll_stats->threshold_ctr < INTERVALS_PER_SECOND)
+				poll_stats->threshold_ctr++;
+			else
+				set_state(poll_stats, HGH_BUSY);
+
+		} else {
+			/* reset */
+			poll_stats->threshold_ctr = 0;
+		}
+
+	} else if (poll_stats->cur_freq == HGH) {
+
+		if (percent <
+			poll_stats->thresh[poll_stats->cur_freq].threshold_percent) {
+
+			if (poll_stats->threshold_ctr < INTERVALS_PER_SECOND)
+				poll_stats->threshold_ctr++;
+			else
+				set_state(poll_stats, MED_NORMAL);
+		} else
+			/* reset */
+			poll_stats->threshold_ctr = 0;
+
+
+	}
+}
+
+static int
+empty_poll_trainning(struct priority_worker *poll_stats,
+		uint32_t max_train_iter)
+{
+
+	if (poll_stats->iter_counter < INTERVALS_PER_SECOND) {
+		poll_stats->iter_counter++;
+		return 0;
+	}
+
+
+	update_training_stats(poll_stats,
+			LOW,
+			false,
+			max_train_iter);
+
+	update_training_stats(poll_stats,
+			MED,
+			false,
+			max_train_iter);
+
+	update_training_stats(poll_stats,
+			HGH,
+			false,
+			max_train_iter);
+
+
+	if (poll_stats->thresh[LOW].trained == true
+			&& poll_stats->thresh[MED].trained == true
+			&& poll_stats->thresh[HGH].trained == true) {
+
+		set_state(poll_stats, MED_NORMAL);
+
+		RTE_LOG(INFO, POWER, "Training is Complete for %d\n",
+				poll_stats->lcore_id);
+	}
+
+	return 0;
+}
+
+static void
+empty_poll_detection(struct rte_timer *tim,
+		void *arg)
+{
+
+	uint32_t i;
+
+	struct priority_worker *poll_stats;
+
+	RTE_SET_USED(tim);
+
+	RTE_SET_USED(arg);
+
+	for (i = 0; i < NUM_NODES; i++) {
+
+		poll_stats = &(ep_params->wrk_data.wrk_stats[i]);
+
+		if (rte_lcore_is_enabled(poll_stats->lcore_id) == 0)
+			continue;
+
+		switch (poll_stats->queue_state) {
+		case(TRAINING):
+			empty_poll_trainning(poll_stats,
+					ep_params->max_train_iter);
+			break;
+
+		case(HGH_BUSY):
+		case(MED_NORMAL):
+			update_stats_normal(poll_stats);
+
+			break;
+
+		case(LOW_PURGE):
+			break;
+		default:
+			break;
+
+		}
+
+	}
+
+}
+
+int
+rte_empty_poll_stat_init(void)
+{
+	uint32_t i;
+	/* Allocate the ep_params structure */
+	ep_params = rte_zmalloc_socket(NULL,
+			sizeof(struct ep_params),
+			0,
+			rte_socket_id());
+
+	if (!ep_params)
+		rte_panic("Cannot allocate heap memory for ep_params "
+				"for socket %d\n", rte_socket_id());
+
+	freq_index[LOW] = 14;
+	freq_index[MED] = 9;
+	freq_index[HGH] = 1;
+
+	RTE_LOG(INFO, POWER, "Initialize the Empty Poll\n");
+
+	/* 5 seconds worth of training */
+	ep_params->max_train_iter = INTERVALS_PER_SECOND * SECONDS_TO_TRAIN_FOR;
+
+	struct stats_data *w = &ep_params->wrk_data;
+
+	/* initialize all wrk_stats state */
+	for (i = 0; i < NUM_NODES; i++) {
+
+		if (rte_lcore_is_enabled(i) == 0)
+			continue;
+
+		set_state(&w->wrk_stats[i], TRAINING);
+		/*init the freqs table */
+		total_avail_freqs[i] = rte_power_freqs(i,
+				avail_freqs[i],
+				NUM_FREQS);
+
+		if (get_freq_index(LOW) > total_avail_freqs[i])
+			return -1;
+
+	}
+
+
+	return 0;
+}
+
+void
+rte_empty_poll_stat_free(void)
+{
+
+	RTE_LOG(INFO, POWER, "Close the Empty Poll\n");
+
+	if (ep_params != NULL)
+		rte_free(ep_params);
+}
+
+int
+rte_empty_poll_stat_update(unsigned int lcore_id)
+{
+	struct priority_worker *poll_stats;
+
+	if (lcore_id >= NUM_NODES)
+		return -1;
+
+	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
+
+	if (poll_stats->lcore_id == 0)
+		poll_stats->lcore_id = lcore_id;
+
+	poll_stats->empty_dequeues++;
+
+	return 0;
+}
+
+int
+rte_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt)
+{
+
+	struct priority_worker *poll_stats;
+
+	if (lcore_id >= NUM_NODES)
+		return -1;
+
+	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
+
+	if (poll_stats->lcore_id == 0)
+		poll_stats->lcore_id = lcore_id;
+
+	poll_stats->num_dequeue_pkts += nb_pkt;
+
+	return 0;
+}
+
+
+uint64_t
+rte_empty_poll_stat_fetch(unsigned int lcore_id)
+{
+	struct priority_worker *poll_stats;
+
+	if (lcore_id >= NUM_NODES)
+		return -1;
+
+	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
+
+	if (poll_stats->lcore_id == 0)
+		poll_stats->lcore_id = lcore_id;
+
+	return poll_stats->empty_dequeues;
+}
+
+uint64_t
+rte_poll_stat_fetch(unsigned int lcore_id)
+{
+	struct priority_worker *poll_stats;
+
+	if (lcore_id >= NUM_NODES)
+		return -1;
+
+	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
+
+	if (poll_stats->lcore_id == 0)
+		poll_stats->lcore_id = lcore_id;
+
+	return poll_stats->num_dequeue_pkts;
+}
+
+void
+rte_empty_poll_set_freq(enum freq_val index, uint32_t limit)
+{
+	switch (index) {
+
+	case LOW:
+		freq_index[LOW] = limit;
+		break;
+
+	case MED:
+		freq_index[MED] = limit;
+		break;
+
+	case HGH:
+		freq_index[HGH] = limit;
+		break;
+	default:
+		break;
+	}
+}
+
+void
+rte_empty_poll_setup_timer(void)
+{
+	int lcore_id = rte_lcore_id();
+	uint64_t hz = rte_get_timer_hz();
+
+	struct  ep_params *ep_ptr = ep_params;
+
+	ep_ptr->interval_ticks = hz / INTERVALS_PER_SECOND;
+
+	rte_timer_reset_sync(&ep_ptr->timer0,
+			ep_ptr->interval_ticks,
+			PERIODICAL,
+			lcore_id,
+			empty_poll_detection,
+			(void *)ep_ptr);
+
+}
diff --git a/lib/librte_power/rte_empty_poll.h b/lib/librte_power/rte_empty_poll.h
new file mode 100644
index 0000000..7e036ee
--- /dev/null
+++ b/lib/librte_power/rte_empty_poll.h
@@ -0,0 +1,135 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2018 Intel Corporation
+ */
+
+#ifndef _RTE_EMPTY_POLL_H
+#define _RTE_EMPTY_POLL_H
+
+/**
+ * @file
+ * RTE Power Management
+ */
+#include <stdint.h>
+#include <stdbool.h>
+#include <sys/queue.h>
+
+#include <rte_common.h>
+#include <rte_byteorder.h>
+#include <rte_log.h>
+#include <rte_string_fns.h>
+#include <rte_power.h>
+#include <rte_timer.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#define NUM_FREQS 20
+
+#define BINS_AV 4 /* has to be ^2 */
+
+#define DROP (NUM_DIRECTIONS * NUM_DEVICES)
+
+#define NUM_PRIORITIES          2
+
+#define NUM_NODES         31 /*any reseanable prime number should work*/
+
+
+enum freq_val {
+	LOW,
+	MED,
+	HGH,
+	NUM_FREQ = NUM_FREQS
+};
+
+
+enum queue_state {
+	TRAINING, /* NO TRAFFIC */
+	MED_NORMAL,   /* MED */
+	HGH_BUSY,     /* HIGH */
+	LOW_PURGE,    /* LOW */
+};
+
+/* queue stats */
+struct freq_threshold {
+
+	uint64_t base_edpi;
+	bool trained;
+	uint32_t threshold_percent;
+	uint32_t cur_train_iter;
+};
+
+
+struct priority_worker {
+
+	/* Current dequeue and throughput counts */
+	/* These 2 are written to by the worker threads */
+	/* So keep them on their own cache line */
+	uint64_t empty_dequeues;
+	uint64_t num_dequeue_pkts;
+
+	enum queue_state queue_state;
+
+	uint64_t empty_dequeues_prev;
+	uint64_t num_dequeue_pkts_prev;
+
+	/* Used for training only */
+	struct freq_threshold thresh[NUM_FREQ];
+	enum freq_val cur_freq;
+
+	/* bucket arrays to calculate the averages */
+	uint64_t edpi_av[BINS_AV];
+	uint32_t  ec;
+	uint64_t ppi_av[BINS_AV];
+	uint32_t  pc;
+
+	uint32_t lcore_id;
+	uint32_t iter_counter;
+	uint32_t threshold_ctr;
+	uint32_t display_ctr;
+	uint8_t  dev_id;
+
+} __rte_cache_aligned;
+
+
+struct stats_data {
+
+	struct priority_worker wrk_stats[NUM_NODES];
+
+	/* flag to stop rx threads processing packets until training over */
+	bool start_rx;
+
+};
+
+struct ep_params {
+
+	/* timer related stuff */
+	uint64_t interval_ticks;
+	uint32_t max_train_iter;
+	struct rte_timer timer0;
+
+	struct stats_data wrk_data;
+};
+
+
+int rte_empty_poll_stat_init(void);
+
+void rte_empty_poll_stat_free(void);
+
+int rte_empty_poll_stat_update(unsigned int lcore_id);
+
+int rte_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt);
+
+uint64_t rte_empty_poll_stat_fetch(unsigned int lcore_id);
+
+uint64_t rte_poll_stat_fetch(unsigned int lcore_id);
+
+void rte_empty_poll_set_freq(enum freq_val index, uint32_t limit);
+
+void rte_empty_poll_setup_timer(void);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif
-- 
2.7.5

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [dpdk-dev] [PATCH v2 2/2] examples/l3fwd-power: simple app update to support new API
  2018-06-08 15:26 ` [dpdk-dev] [PATCH v2 1/2] lib/librte_power: traffic pattern aware power control Liang Ma
@ 2018-06-08 15:26   ` Liang Ma
  2018-06-20 14:44     ` [dpdk-dev] [PATCH v3 1/2] lib/librte_power: traffic pattern aware power control Liang Ma
  0 siblings, 1 reply; 79+ messages in thread
From: Liang Ma @ 2018-06-08 15:26 UTC (permalink / raw)
  To: david.hunt; +Cc: dev, radu.nicolau, Liang Ma

Add the support for new traffic pattern aware power control
power management API.

Example:
./l3fwd-power -l xxx   -n 4   -w 0000:xx:00.0 -w 0000:xx:00.1 -- -p 0x3
-P --config="(0,0,xx),(1,0,xx)" --empty-poll

ChangeLog:
v2 Fix some coding style issues

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
---
 examples/l3fwd-power/main.c | 228 ++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 210 insertions(+), 18 deletions(-)

diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
index 596d645..21f2aee 100644
--- a/examples/l3fwd-power/main.c
+++ b/examples/l3fwd-power/main.c
@@ -43,6 +43,7 @@
 #include <rte_timer.h>
 #include <rte_power.h>
 #include <rte_spinlock.h>
+#include <rte_empty_poll.h>
 
 #define RTE_LOGTYPE_L3FWD_POWER RTE_LOGTYPE_USER1
 
@@ -129,6 +130,9 @@ static uint32_t enabled_port_mask = 0;
 static int promiscuous_on = 0;
 /* NUMA is enabled by default. */
 static int numa_on = 1;
+/* emptypoll is disabled by default. */
+static bool empty_poll_on;
+volatile bool empty_poll_stop;
 static int parse_ptype; /**< Parse packet type using rx callback, and */
 			/**< disabled by default */
 
@@ -336,6 +340,10 @@ static inline uint32_t power_idle_heuristic(uint32_t zero_rx_packet_count);
 static inline enum freq_scale_hint_t power_freq_scaleup_heuristic( \
 		unsigned int lcore_id, uint16_t port_id, uint16_t queue_id);
 
+static int is_done(void)
+{
+	return empty_poll_stop;
+}
 /* exit signal handler */
 static void
 signal_exit_now(int sigtype)
@@ -344,7 +352,15 @@ signal_exit_now(int sigtype)
 	unsigned int portid;
 	int ret;
 
+	RTE_SET_USED(lcore_id);
+	RTE_SET_USED(portid);
+	RTE_SET_USED(ret);
+
 	if (sigtype == SIGINT) {
+		if (empty_poll_on)
+			empty_poll_stop = true;
+
+
 		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
 			if (rte_lcore_is_enabled(lcore_id) == 0)
 				continue;
@@ -353,20 +369,23 @@ signal_exit_now(int sigtype)
 			ret = rte_power_exit(lcore_id);
 			if (ret)
 				rte_exit(EXIT_FAILURE, "Power management "
-					"library de-initialization failed on "
-							"core%u\n", lcore_id);
+						"library de-initialization failed on "
+						"core%u\n", lcore_id);
 		}
 
-		RTE_ETH_FOREACH_DEV(portid) {
-			if ((enabled_port_mask & (1 << portid)) == 0)
-				continue;
+		if (!empty_poll_on) {
+			RTE_ETH_FOREACH_DEV(portid) {
+				if ((enabled_port_mask & (1 << portid)) == 0)
+					continue;
 
-			rte_eth_dev_stop(portid);
-			rte_eth_dev_close(portid);
+				rte_eth_dev_stop(portid);
+				rte_eth_dev_close(portid);
+			}
 		}
 	}
 
-	rte_exit(EXIT_SUCCESS, "User forced exit\n");
+	if (!empty_poll_on)
+		rte_exit(EXIT_SUCCESS, "User forced exit\n");
 }
 
 /*  Freqency scale down timer callback */
@@ -831,6 +850,107 @@ static int event_register(struct lcore_conf *qconf)
 
 	return 0;
 }
+/* main processing loop */
+static int
+main_empty_poll_loop(__attribute__((unused)) void *dummy)
+{
+	struct rte_mbuf *pkts_burst[MAX_PKT_BURST];
+	unsigned int lcore_id;
+	uint64_t prev_tsc, diff_tsc, cur_tsc;
+	int i, j, nb_rx;
+	uint8_t queueid;
+	uint16_t portid;
+	struct lcore_conf *qconf;
+	struct lcore_rx_queue *rx_queue;
+
+	const uint64_t drain_tsc =
+		(rte_get_tsc_hz() + US_PER_S - 1) / US_PER_S * BURST_TX_DRAIN_US;
+
+	prev_tsc = 0;
+
+	lcore_id = rte_lcore_id();
+	qconf = &lcore_conf[lcore_id];
+
+	if (qconf->n_rx_queue == 0) {
+		RTE_LOG(INFO, L3FWD_POWER, "lcore %u has nothing to do\n", lcore_id);
+		return 0;
+	}
+
+	for (i = 0; i < qconf->n_rx_queue; i++) {
+		portid = qconf->rx_queue_list[i].port_id;
+		queueid = qconf->rx_queue_list[i].queue_id;
+		RTE_LOG(INFO, L3FWD_POWER, " -- lcoreid=%u portid=%u "
+				"rxqueueid=%hhu\n", lcore_id, portid, queueid);
+	}
+
+	while (!is_done()) {
+		stats[lcore_id].nb_iteration_looped++;
+
+		cur_tsc = rte_rdtsc();
+		/*
+		 * TX burst queue drain
+		 */
+		diff_tsc = cur_tsc - prev_tsc;
+		if (unlikely(diff_tsc > drain_tsc)) {
+			for (i = 0; i < qconf->n_tx_port; ++i) {
+				portid = qconf->tx_port_id[i];
+				rte_eth_tx_buffer_flush(portid,
+						qconf->tx_queue_id[portid],
+						qconf->tx_buffer[portid]);
+			}
+			prev_tsc = cur_tsc;
+		}
+
+		/*
+		 * Read packet from RX queues
+		 */
+		for (i = 0; i < qconf->n_rx_queue; ++i) {
+			rx_queue = &(qconf->rx_queue_list[i]);
+			rx_queue->idle_hint = 0;
+			portid = rx_queue->port_id;
+			queueid = rx_queue->queue_id;
+
+			nb_rx = rte_eth_rx_burst(portid, queueid, pkts_burst,
+					MAX_PKT_BURST);
+
+			stats[lcore_id].nb_rx_processed += nb_rx;
+
+			if (nb_rx == 0) {
+
+				rte_empty_poll_stat_update(lcore_id);
+
+				continue;
+			} else {
+				rte_poll_stat_update(lcore_id, nb_rx);
+			}
+
+
+			/* Prefetch first packets */
+			for (j = 0; j < PREFETCH_OFFSET && j < nb_rx; j++) {
+				rte_prefetch0(rte_pktmbuf_mtod(
+							pkts_burst[j], void *));
+			}
+
+			/* Prefetch and forward already prefetched packets */
+			for (j = 0; j < (nb_rx - PREFETCH_OFFSET); j++) {
+				rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[
+							j + PREFETCH_OFFSET], void *));
+				l3fwd_simple_forward(pkts_burst[j], portid,
+						qconf);
+			}
+
+			/* Forward remaining prefetched packets */
+			for (; j < nb_rx; j++) {
+				l3fwd_simple_forward(pkts_burst[j], portid,
+						qconf);
+			}
+
+		}
+
+	}
+
+	return 0;
+}
 
 /* main processing loop */
 static int
@@ -1128,7 +1248,8 @@ print_usage(const char *prgname)
 		"  --no-numa: optional, disable numa awareness\n"
 		"  --enable-jumbo: enable jumbo frame"
 		" which max packet len is PKTLEN in decimal (64-9600)\n"
-		"  --parse-ptype: parse packet type by software\n",
+		"  --parse-ptype: parse packet type by software\n"
+		"  --empty=poll: enable empty poll detection\n",
 		prgname);
 }
 
@@ -1231,10 +1352,12 @@ parse_args(int argc, char **argv)
 	int opt, ret;
 	char **argvopt;
 	int option_index;
+	uint32_t limit;
 	char *prgname = argv[0];
 	static struct option lgopts[] = {
 		{"config", 1, 0, 0},
 		{"no-numa", 0, 0, 0},
+		{"empty-poll", 0, 0, 0},
 		{"enable-jumbo", 0, 0, 0},
 		{CMD_LINE_OPT_PARSE_PTYPE, 0, 0, 0},
 		{NULL, 0, 0, 0}
@@ -1259,7 +1382,18 @@ parse_args(int argc, char **argv)
 			printf("Promiscuous mode selected\n");
 			promiscuous_on = 1;
 			break;
-
+		case 'l':
+			limit = parse_portmask(optarg);
+			rte_empty_poll_set_freq(LOW, limit);
+			break;
+		case 'm':
+			limit = parse_portmask(optarg);
+			rte_empty_poll_set_freq(MED, limit);
+			break;
+		case 'h':
+			limit = parse_portmask(optarg);
+			rte_empty_poll_set_freq(HGH, limit);
+			break;
 		/* long options */
 		case 0:
 			if (!strncmp(lgopts[option_index].name, "config", 6)) {
@@ -1278,6 +1412,12 @@ parse_args(int argc, char **argv)
 			}
 
 			if (!strncmp(lgopts[option_index].name,
+						"empty-poll", 10)) {
+				printf("empty-poll is enabled\n");
+				empty_poll_on = true;
+			}
+
+			if (!strncmp(lgopts[option_index].name,
 					"enable-jumbo", 12)) {
 				struct option lenopts =
 					{"max-pkt-len", required_argument, \
@@ -1609,6 +1749,41 @@ static int check_ptype(uint16_t portid)
 
 }
 
+static int
+launch_timer(unsigned int lcore_id)
+{
+	int64_t prev_tsc = 0, cur_tsc, diff_tsc, cycles_10ms;
+
+	RTE_SET_USED(lcore_id);
+
+
+	if (rte_get_master_lcore() != lcore_id) {
+		rte_panic("timer on lcore:%d which is not master core:%d\n",
+				lcore_id,
+				rte_get_master_lcore());
+	}
+
+	RTE_LOG(INFO, POWER, "Bring up the Timer\n");
+
+	rte_empty_poll_setup_timer();
+
+	cycles_10ms = rte_get_timer_hz() / 100;
+
+	while (!is_done()) {
+		cur_tsc = rte_rdtsc();
+		diff_tsc = cur_tsc - prev_tsc;
+		if (diff_tsc > cycles_10ms) {
+			rte_timer_manage();
+			prev_tsc = cur_tsc;
+			cycles_10ms = rte_get_timer_hz() / 100;
+		}
+	}
+
+	RTE_LOG(INFO, POWER, "Timer_subsystem is done\n");
+
+	return 0;
+}
+
 int
 main(int argc, char **argv)
 {
@@ -1780,14 +1955,15 @@ main(int argc, char **argv)
 				"Library initialization failed on core %u\n", lcore_id);
 
 		/* init timer structures for each enabled lcore */
-		rte_timer_init(&power_timers[lcore_id]);
-		hz = rte_get_timer_hz();
-		rte_timer_reset(&power_timers[lcore_id],
-			hz/TIMER_NUMBER_PER_SECOND, SINGLE, lcore_id,
-						power_timer_cb, NULL);
-
+		if (empty_poll_on == false) {
+			rte_timer_init(&power_timers[lcore_id]);
+			hz = rte_get_timer_hz();
+			rte_timer_reset(&power_timers[lcore_id],
+					hz/TIMER_NUMBER_PER_SECOND, SINGLE, lcore_id,
+					power_timer_cb, NULL);
+		}
 		qconf = &lcore_conf[lcore_id];
-		printf("\nInitializing rx queues on lcore %u ... ", lcore_id );
+		printf("\nInitializing rx queues on lcore %u ...\n", lcore_id);
 		fflush(stdout);
 		/* init RX queues */
 		for(queue = 0; queue < qconf->n_rx_queue; ++queue) {
@@ -1856,12 +2032,28 @@ main(int argc, char **argv)
 
 	check_all_ports_link_status(enabled_port_mask);
 
+	if (empty_poll_on == true)
+		rte_empty_poll_stat_init();
+
+
 	/* launch per-lcore init on every lcore */
-	rte_eal_mp_remote_launch(main_loop, NULL, CALL_MASTER);
+	if (empty_poll_on == false) {
+		rte_eal_mp_remote_launch(main_loop, NULL, CALL_MASTER);
+	} else {
+		empty_poll_stop = false;
+		rte_eal_mp_remote_launch(main_empty_poll_loop, NULL, SKIP_MASTER);
+	}
+
+	if (empty_poll_on == true)
+		launch_timer(rte_lcore_id());
+
 	RTE_LCORE_FOREACH_SLAVE(lcore_id) {
 		if (rte_eal_wait_lcore(lcore_id) < 0)
 			return -1;
 	}
 
+	if (empty_poll_on)
+		rte_empty_poll_stat_free();
+
 	return 0;
 }
-- 
2.7.5

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [dpdk-dev] [PATCH v1 1/2] lib/librte_power: traffic pattern aware power control
  2018-06-08  9:57 [dpdk-dev] [PATCH v1 1/2] lib/librte_power: traffic pattern aware power control Liang Ma
  2018-06-08  9:57 ` [dpdk-dev] [PATCH v1 2/2] examples/l3fwd-power: simple app update to support new API Liang Ma
  2018-06-08 15:26 ` [dpdk-dev] [PATCH v2 1/2] lib/librte_power: traffic pattern aware power control Liang Ma
@ 2018-06-14 10:59 ` Hunt, David
  2018-06-18 16:11   ` Liang, Ma
  2 siblings, 1 reply; 79+ messages in thread
From: Hunt, David @ 2018-06-14 10:59 UTC (permalink / raw)
  To: Liang Ma; +Cc: dev, radu.nicolau

Hi Liang



On 8/6/2018 10:57 AM, Liang Ma wrote:
> 1. Abstract
>
> For packet processing workloads such as DPDK polling is continuous.
> This means CPU cores always show 100% busy independent of how much work
> those cores are doing. It is critical to accurately determine how busy
> a core is hugely important for the following reasons:
>
>     * No indication of overload conditions
>
>     * User do not know how much real load is on a system meaning resulted in
>       wasted energy as no power management is utilized
>
> Tried and failed schemes include calculating the cycles required from
> the load on the core, in other words the business. For example,

Typo, I think this should be busyness. :)

> how many cycles it costs to handle each packet and determining the frequency
> cost per core. Due to the varying nature of traffic, types of frames and cost
> in cycles to process, this mechanism becomes complex quickly where
> a simple scheme is required to solve the problems.
>
> 2. Proposed solution
>
> For all polling mechanism, the proposed solution focus on how many times
> empty poll executed instead of calculating how many cycles it cost to handle
> each packet. The less empty poll number means current core is busy with
> processing workload, therefore,  the higher frequency is needed. The high
> empty poll number indicate current core has lots spare time, therefore,
> we can lower the frequency.
>
> 2.1 Power state definition:
>
> 	LOW:  the frequency is used for purge mode.
>
> 	MED:  the frequency is used to process modest traffic workload.
>
> 	HIGH: the frequency is used to process busy traffic workload.
>
> 2.2 There are two phases to establish the power management system:
>
> 	a.Initialization/Training phase. There is no traffic pass-through,
> 	  the system will test average empty poll numbers  with LOW/MED/HIGH
> 	  power state. Those average empty poll numbers  will be the baseline
> 	  for the normal phase. The system will collect all core's counter
> 	  every 100ms. The Training phase will take 5 seconds.

I suggest that you add in that all Rx packets are blocked during this 
phase, and we measure the number of empty polls possible for each of the 
frequencies (low, medium and high).

> 	b.Normal phase. When the real traffic pass-though, the system will
> 	  compare run-time empty poll moving average value with base line
> 	  then make decision to move to HIGH power state of MED  power state.
> 	  The system will collect all core's counter every 100ms.

I think this may need to be reduced from 100ms. The reaction to an 
increase in traffic would need
to be quicker than this to avoid buffer overflow.

>
> 3. Proposed  API
>
> 1.  rte_empty_poll_stat_init(void);
> which is used to initialize the power management system.
>   
> 2.  rte_empty_poll_stat_free(void);
> which is used to free the resource hold by power management system.
>   
> 3.  rte_empty_poll_stat_update(unsigned int lcore_id);
> which is used to update specific core empty poll counter, not thread safe
>   
> 4.  rte_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt);
> which is used to update specific core valid poll counter, not thread safe
>   
> 5.  rte_empty_poll_stat_fetch(unsigned int lcore_id);
> which is used to get specific core empty poll counter.
>   
> 6.  rte_poll_stat_fetch(unsigned int lcore_id);
> which is used to get specific core valid poll counter.
>
> 7.  rte_empty_poll_set_freq(enum freq_val index, uint32_t limit);
> which allow user customize the frequency of power state.
>
> 8.  rte_empty_poll_setup_timer(void);
> which is used to setup the timer/callback to process all above counter.
>
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> ---
>   lib/librte_power/Makefile         |   3 +-
>   lib/librte_power/meson.build      |   5 +-
>   lib/librte_power/rte_empty_poll.c | 529 ++++++++++++++++++++++++++++++++++++++
>   lib/librte_power/rte_empty_poll.h | 135 ++++++++++
>   4 files changed, 669 insertions(+), 3 deletions(-)
>   create mode 100644 lib/librte_power/rte_empty_poll.c
>   create mode 100644 lib/librte_power/rte_empty_poll.h
>
> diff --git a/lib/librte_power/Makefile b/lib/librte_power/Makefile
> index 6f85e88..dbc175a 100644
> --- a/lib/librte_power/Makefile
> +++ b/lib/librte_power/Makefile
> @@ -16,8 +16,9 @@ LIBABIVER := 1
>   # all source are stored in SRCS-y
>   SRCS-$(CONFIG_RTE_LIBRTE_POWER) := rte_power.c power_acpi_cpufreq.c
>   SRCS-$(CONFIG_RTE_LIBRTE_POWER) += power_kvm_vm.c guest_channel.c
> +SRCS-$(CONFIG_RTE_LIBRTE_POWER) += rte_empty_poll.c
>   
>   # install this header file
> -SYMLINK-$(CONFIG_RTE_LIBRTE_POWER)-include := rte_power.h
> +SYMLINK-$(CONFIG_RTE_LIBRTE_POWER)-include := rte_power.h  rte_empty_poll.h
>   

I would propose re-naming the .h and .c file to 
rte_*power_*empty_poll.[h/c], so we can
associate it with the power library.


>   include $(RTE_SDK)/mk/rte.lib.mk
> diff --git a/lib/librte_power/meson.build b/lib/librte_power/meson.build
> index 253173f..5270fa3 100644
> --- a/lib/librte_power/meson.build
> +++ b/lib/librte_power/meson.build
> @@ -5,5 +5,6 @@ if host_machine.system() != 'linux'
>   	build = false
>   endif
>   sources = files('rte_power.c', 'power_acpi_cpufreq.c',
> -		'power_kvm_vm.c', 'guest_channel.c')
> -headers = files('rte_power.h')
> +		'power_kvm_vm.c', 'guest_channel.c',
> +		'rte_empty_poll.c')
> +headers = files('rte_power.h','rte_empty_poll.h')
> diff --git a/lib/librte_power/rte_empty_poll.c b/lib/librte_power/rte_empty_poll.c
> new file mode 100644
> index 0000000..57bd63b
> --- /dev/null
> +++ b/lib/librte_power/rte_empty_poll.c
> @@ -0,0 +1,529 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2010-2018 Intel Corporation
> + */
> +
> +#include <string.h>
> +
> +#include <rte_lcore.h>
> +#include <rte_cycles.h>
> +#include <rte_atomic.h>
> +#include <rte_malloc.h>
> +
> +
> +
> +
> +#include "rte_power.h"
> +#include "rte_empty_poll.h"
> +
> +
> +
> +
> +#define INTERVALS_PER_SECOND 10     /* (100ms) */
> +#define SECONDS_TO_TRAIN_FOR  5
> +#define DEFAULT_MED_TO_HIGH_PERCENT_THRESHOLD 70
> +#define DEFAULT_HIGH_TO_MED_PERCENT_THRESHOLD 30
> +#define DEFAULT_CYCLES_PER_PACKET 800
> +
> +
> +static struct ep_params *ep_params;
> +static uint32_t med_to_high_threshold = DEFAULT_MED_TO_HIGH_PERCENT_THRESHOLD;
> +static uint32_t high_to_med_threshold = DEFAULT_HIGH_TO_MED_PERCENT_THRESHOLD;
> +
> +static uint32_t avail_freqs[RTE_MAX_LCORE][NUM_FREQS];
> +
> +static uint32_t total_avail_freqs[RTE_MAX_LCORE];
> +
> +static uint32_t freq_index[NUM_FREQ];
> +
> +
> +
> +static uint32_t
> +get_freq_index(enum freq_val index)
> +{
> +	return freq_index[index];
> +}
> +
> +
> +static int
> +set_power_freq(int lcore_id, enum freq_val freq, bool specific_freq)
> +{
> +	int err = 0;
> +	uint32_t power_freq_index;
> +	if (!specific_freq)
> +		power_freq_index = get_freq_index(freq);
> +	else
> +		power_freq_index = freq;
> +
> +	err = rte_power_set_freq(lcore_id, power_freq_index);
> +
> +	return err;
> +}
> +
> +
> +static inline void __attribute__((always_inline))
> +exit_training_state(struct priority_worker *poll_stats)
> +{
> +	RTE_SET_USED(poll_stats);
> +}
> +
> +static inline void __attribute__((always_inline))
> +enter_training_state(struct priority_worker *poll_stats)
> +{
> +	poll_stats->iter_counter = 0;
> +	poll_stats->cur_freq = LOW;
> +	poll_stats->queue_state = TRAINING;
> +}
> +
> +static inline void __attribute__((always_inline))
> +enter_normal_state(struct priority_worker *poll_stats)
> +{
> +	/* Clear the averages arrays and strs */
> +	memset(poll_stats->edpi_av, 0, sizeof(poll_stats->edpi_av));
> +	poll_stats->ec = 0;
> +	memset(poll_stats->ppi_av, 0, sizeof(poll_stats->ppi_av));
> +	poll_stats->pc = 0;
> +
> +	poll_stats->cur_freq = MED;
> +	poll_stats->iter_counter = 0;
> +	poll_stats->threshold_ctr = 0;
> +	poll_stats->queue_state = MED_NORMAL;
> +	set_power_freq(poll_stats->lcore_id, MED, false);
> +
> +	/* Try here */
> +	poll_stats->thresh[MED].threshold_percent = med_to_high_threshold;
> +	poll_stats->thresh[HGH].threshold_percent = high_to_med_threshold;
> +}
> +
> +static inline void __attribute__((always_inline))
> +enter_busy_state(struct priority_worker *poll_stats)
> +{
> +	memset(poll_stats->edpi_av, 0, sizeof(poll_stats->edpi_av));
> +	poll_stats->ec = 0;
> +	memset(poll_stats->ppi_av, 0, sizeof(poll_stats->ppi_av));
> +	poll_stats->pc = 0;
> +
> +	poll_stats->cur_freq = HGH;
> +	poll_stats->iter_counter = 0;
> +	poll_stats->threshold_ctr = 0;
> +	poll_stats->queue_state = HGH_BUSY;
> +	set_power_freq(poll_stats->lcore_id, HGH, false);
> +}
> +
> +static inline void __attribute__((always_inline))
> +enter_purge_state(struct priority_worker *poll_stats)
> +{
> +	poll_stats->iter_counter = 0;
> +	poll_stats->queue_state = LOW_PURGE;
> +}
> +
> +static inline void __attribute__((always_inline))
> +set_state(struct priority_worker *poll_stats,
> +		enum queue_state new_state)
> +{
> +	enum queue_state old_state = poll_stats->queue_state;
> +	if (old_state != new_state) {
> +
> +		/* Call any old state exit functions */
> +		if (old_state == TRAINING)
> +			exit_training_state(poll_stats);
> +
> +		/* Call any new state entry functions */
> +		if (new_state == TRAINING)
> +			enter_training_state(poll_stats);
> +		if (new_state == MED_NORMAL)
> +			enter_normal_state(poll_stats);
> +		if (new_state == HGH_BUSY)
> +			enter_busy_state(poll_stats);
> +		if (new_state == LOW_PURGE)
> +			enter_purge_state(poll_stats);
> +	}
> +}
> +
> +
> +static void
> +update_training_stats(struct priority_worker *poll_stats,
> +		uint32_t freq,
> +		bool specific_freq,
> +		uint32_t max_train_iter)
> +{
> +	RTE_SET_USED(specific_freq);
> +
> +	char pfi_str[32];
> +	uint64_t p0_empty_deq;
> +
> +	sprintf(pfi_str, "%02d", freq);
> +
> +	if (poll_stats->cur_freq == freq &&
> +			poll_stats->thresh[freq].trained == false) {
> +		if (poll_stats->thresh[freq].cur_train_iter == 0) {
> +
> +			set_power_freq(poll_stats->lcore_id,
> +					freq, specific_freq);
> +
> +			poll_stats->empty_dequeues_prev =
> +				poll_stats->empty_dequeues;
> +
> +			poll_stats->thresh[freq].cur_train_iter++;
> +
> +			return;
> +		} else if (poll_stats->thresh[freq].cur_train_iter
> +				<= max_train_iter) {
> +
> +			p0_empty_deq = poll_stats->empty_dequeues -
> +				poll_stats->empty_dequeues_prev;
> +
> +			poll_stats->empty_dequeues_prev =
> +				poll_stats->empty_dequeues;
> +
> +			poll_stats->thresh[freq].base_edpi += p0_empty_deq;
> +			poll_stats->thresh[freq].cur_train_iter++;
> +
> +		} else {
> +			if (poll_stats->thresh[freq].trained == false) {
> +				poll_stats->thresh[freq].base_edpi =
> +					poll_stats->thresh[freq].base_edpi /
> +					max_train_iter;
> +
> +				/* Add on a factor of 0.05%, this should remove any */
> +				/* false negatives when the system is 0% busy */
> +				poll_stats->thresh[freq].base_edpi +=
> +					poll_stats->thresh[freq].base_edpi / 2000;

Please use #define for magic numbers

> +
> +				poll_stats->thresh[freq].trained = true;
> +				poll_stats->cur_freq++;
> +
> +			}
> +		}
> +	}
> +}
> +
> +static inline uint32_t __attribute__((always_inline))
> +update_stats(struct priority_worker *poll_stats)
> +{
> +	uint64_t tot_edpi = 0, tot_ppi = 0;
> +	uint32_t j, percent;
> +
> +	struct priority_worker *s = poll_stats;
> +
> +	uint64_t cur_edpi = s->empty_dequeues - s->empty_dequeues_prev;
> +
> +	s->empty_dequeues_prev = s->empty_dequeues;
> +
> +	uint64_t ppi = s->num_dequeue_pkts - s->num_dequeue_pkts_prev;
> +
> +	s->num_dequeue_pkts_prev = s->num_dequeue_pkts;
> +
> +	if (s->thresh[s->cur_freq].base_edpi < cur_edpi)
> +		return 1000UL; /* Value to make us fail */
> +
> +	s->edpi_av[s->ec++ % BINS_AV] = cur_edpi;
> +	s->ppi_av[s->pc++ % BINS_AV] = ppi;
> +
> +	for (j = 0; j < BINS_AV; j++) {
> +		tot_edpi += s->edpi_av[j];
> +		tot_ppi += s->ppi_av[j];
> +	}
> +
> +	tot_edpi = tot_edpi / BINS_AV;
> +
> +	percent = 100 - (uint32_t)((float)tot_edpi /
> +			(float)s->thresh[s->cur_freq].base_edpi * 100);
> +
> +	return (uint32_t)percent;
> +}
> +
> +
> +static inline void  __attribute__((always_inline))
> +update_stats_normal(struct priority_worker *poll_stats)
> +{
> +	uint32_t percent;
> +
> +	if (poll_stats->thresh[poll_stats->cur_freq].base_edpi == 0)
> +		return;
> +
> +	percent = update_stats(poll_stats);
> +
> +	if (percent > 100)
> +		return;
> +
> +	if (poll_stats->cur_freq == LOW)
> +		RTE_LOG(INFO, POWER, "Purge Mode is not supported\n");
> +	else if (poll_stats->cur_freq == MED) {
> +
> +		if (percent > poll_stats->thresh[poll_stats->cur_freq].
> +				threshold_percent) {
> +
> +			if (poll_stats->threshold_ctr < INTERVALS_PER_SECOND)
> +				poll_stats->threshold_ctr++;
> +			else
> +				set_state(poll_stats, HGH_BUSY);
> +
> +		} else {
> +			/* reset */
> +			poll_stats->threshold_ctr = 0;
> +		}
> +
> +	} else if (poll_stats->cur_freq == HGH) {
> +
> +		if (percent < poll_stats->thresh[poll_stats->cur_freq].
> +				threshold_percent) {
> +
> +			if (poll_stats->threshold_ctr < INTERVALS_PER_SECOND)
> +				poll_stats->threshold_ctr++;
> +			else
> +				set_state(poll_stats, MED_NORMAL);
> +		} else
> +			/* reset */
> +			poll_stats->threshold_ctr = 0;
> +
> +
> +	}
> +}
> +
> +static int
> +empty_poll_trainning(struct priority_worker *poll_stats,
> +		uint32_t max_train_iter) {
> +
> +	if (poll_stats->iter_counter < INTERVALS_PER_SECOND) {
> +		poll_stats->iter_counter++;
> +		return 0;
> +	}
> +
> +
> +	update_training_stats(poll_stats,
> +			LOW,
> +			false,
> +			max_train_iter);
> +
> +	update_training_stats(poll_stats,
> +			MED,
> +			false,
> +			max_train_iter);
> +
> +	update_training_stats(poll_stats,
> +			HGH,
> +			false,
> +			max_train_iter);
> +
> +
> +	if (poll_stats->thresh[LOW].trained == true
> +			&& poll_stats->thresh[MED].trained == true
> +			&& poll_stats->thresh[HGH].trained == true) {
> +
> +		set_state(poll_stats, MED_NORMAL);
> +
> +		RTE_LOG(INFO, POWER, "Training is Complete for %d\n",
> +				poll_stats->lcore_id);
> +	}
> +
> +	return 0;
> +}
> +
> +static void
> +empty_poll_detection(struct rte_timer *tim,
> +		void *arg)
> +{
> +
> +	uint32_t i;
> +
> +	struct priority_worker *poll_stats;
> +
> +	RTE_SET_USED(tim);
> +
> +	RTE_SET_USED(arg);
> +
> +	for (i = 0; i < NUM_NODES; i++) {
> +
> +		poll_stats = &(ep_params->wrk_data.wrk_stats[i]);
> +
> +		if (rte_lcore_is_enabled(poll_stats->lcore_id) == 0)
> +			continue;
> +
> +		switch (poll_stats->queue_state) {
> +		case(TRAINING):
> +			empty_poll_trainning(poll_stats,
> +					ep_params->max_train_iter);
> +			break;
> +
> +		case(HGH_BUSY):
> +		case(MED_NORMAL):
> +			update_stats_normal(poll_stats);
> +
> +			break;
> +
> +		case(LOW_PURGE):
> +			break;
> +		default:
> +			break;
> +
> +		}
> +
> +	}
> +
> +}
> +
> +int
> +rte_empty_poll_stat_init(void)
> +{
> +	uint32_t i;
> +	/* Allocate the ep_params structure */
> +	ep_params = rte_zmalloc_socket(NULL,
> +			sizeof(struct ep_params),
> +			0,
> +			rte_socket_id());
> +
> +	if (!ep_params)
> +		rte_panic("Cannot allocate heap memory for ep_params "
> +				"for socket %d\n", rte_socket_id());
> +
> +	freq_index[LOW] = 14;
> +	freq_index[MED] = 9;
> +	freq_index[HGH] = 1;
> +

Are these default frequency indexes? Can they be changed? Maybe say so 
in a comment.

> +	RTE_LOG(INFO, POWER, "Initialize the Empty Poll\n");
> +
> +	/* 5 seconds worth of training */
> +	ep_params->max_train_iter = INTERVALS_PER_SECOND * SECONDS_TO_TRAIN_FOR;
> +
> +	struct stats_data *w = &ep_params->wrk_data;
> +
> +	/* initialize all wrk_stats state */
> +	for (i = 0; i < NUM_NODES; i++) {
> +
> +		if (rte_lcore_is_enabled(i) == 0)
> +			continue;
> +
> +		set_state(&w->wrk_stats[i], TRAINING);
> +		/*init the freqs table */
> +		total_avail_freqs[i] = rte_power_freqs(i,
> +				avail_freqs[i],
> +				NUM_FREQS);
> +
> +		if (get_freq_index(LOW) > total_avail_freqs[i])
> +			return -1;
> +
> +	}
> +
> +
> +	return 0;
> +}
> +
> +void
> +rte_empty_poll_stat_free(void)
> +{
> +
> +	RTE_LOG(INFO, POWER, "Close the Empty Poll\n");
> +
> +	if (ep_params != NULL)
> +		rte_free(ep_params);
> +}
> +
> +int
> +rte_empty_poll_stat_update(unsigned int lcore_id)
> +{
> +	struct priority_worker *poll_stats;
> +
> +	if (lcore_id >= NUM_NODES)
> +		return -1;
> +
> +	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
> +
> +	if (poll_stats->lcore_id == 0)
> +		poll_stats->lcore_id = lcore_id;
> +
> +	poll_stats->empty_dequeues++;
> +
> +	return 0;
> +}
> +
> +int
> +rte_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt)
> +{
> +
> +	struct priority_worker *poll_stats;
> +
> +	if (lcore_id >= NUM_NODES)
> +		return -1;
> +
> +	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
> +
> +	if (poll_stats->lcore_id == 0)
> +		poll_stats->lcore_id = lcore_id;
> +
> +	poll_stats->num_dequeue_pkts += nb_pkt;
> +
> +	return 0;
> +}
> +
> +
> +uint64_t
> +rte_empty_poll_stat_fetch(unsigned int lcore_id)
> +{
> +	struct priority_worker *poll_stats;
> +
> +	if (lcore_id >= NUM_NODES)
> +		return -1;
> +
> +	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
> +
> +	if (poll_stats->lcore_id == 0)
> +		poll_stats->lcore_id = lcore_id;
> +
> +	return poll_stats->empty_dequeues;
> +}
> +
> +uint64_t
> +rte_poll_stat_fetch(unsigned int lcore_id)
> +{
> +	struct priority_worker *poll_stats;
> +
> +	if (lcore_id >= NUM_NODES)
> +		return -1;
> +
> +	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
> +
> +	if (poll_stats->lcore_id == 0)
> +		poll_stats->lcore_id = lcore_id;
> +
> +	return poll_stats->num_dequeue_pkts;
> +}
> +
> +void
> +rte_empty_poll_set_freq(enum freq_val index, uint32_t limit)
> +{
> +	switch (index) {
> +
> +	case LOW:
> +		freq_index[LOW] = limit;
> +		break;
> +
> +	case MED:
> +		freq_index[MED] = limit;
> +		break;
> +
> +	case HGH:
> +		freq_index[HGH] = limit;
> +		break;
> +	default:
> +		break;
> +	}
> +}
> +
> +void
> +rte_empty_poll_setup_timer(void)
> +{
> +	int lcore_id = rte_lcore_id();
> +	uint64_t hz = rte_get_timer_hz();
> +
> +	struct  ep_params *ep_ptr = ep_params;
> +
> +	ep_ptr->interval_ticks = hz / INTERVALS_PER_SECOND;
> +
> +	rte_timer_reset_sync(&ep_ptr->timer0,
> +			ep_ptr->interval_ticks,
> +			PERIODICAL,
> +			lcore_id,
> +			empty_poll_detection,
> +			(void *)ep_ptr);
> +
> +}
> diff --git a/lib/librte_power/rte_empty_poll.h b/lib/librte_power/rte_empty_poll.h
> new file mode 100644
> index 0000000..7e036ee
> --- /dev/null
> +++ b/lib/librte_power/rte_empty_poll.h
> @@ -0,0 +1,135 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2010-2018 Intel Corporation
> + */
> +
> +#ifndef _RTE_EMPTY_POLL_H
> +#define _RTE_EMPTY_POLL_H
> +
> +/**
> + * @file
> + * RTE Power Management
> + */
> +#include <stdint.h>
> +#include <stdbool.h>
> +#include <sys/queue.h>
> +
> +#include <rte_common.h>
> +#include <rte_byteorder.h>
> +#include <rte_log.h>
> +#include <rte_string_fns.h>
> +#include <rte_power.h>
> +#include <rte_timer.h>
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +#define NUM_FREQS 20

Should probably be RTE_MAX_LCORE_FREQS

> +
> +#define BINS_AV 4 /* has to be ^2 */
> +
> +#define DROP (NUM_DIRECTIONS * NUM_DEVICES)
> +
> +#define NUM_PRIORITIES          2
> +
> +#define NUM_NODES         31 /*any reseanable prime number should work*/
> +
> +
> +enum freq_val {
> +	LOW,
> +	MED,
> +	HGH,

HGH should be HIGH. Where we see this on it's own in the code, it's not 
obvious that it's related to MEDIUM and LOW.  Also, maybe MED should be 
MEDIUM.

Would suggest prefixing these with FREQ_. Makes the code more readable.

> +	NUM_FREQ = NUM_FREQS

This should probably be the number of elements in freq_val rather than 
NUM_FREQS

I think there is some confusion in the code about the number of 
frequencies. Any array where it's holding all the possible frequencies 
should use RTE_MAX_LCORE_FREQS.
But it looks ti me that the freq_index array only holds three values, 
the indexes for Low, Medium and High.


> +};
> +
> +
> +enum queue_state {
> +	TRAINING, /* NO TRAFFIC */
> +	MED_NORMAL,   /* MED */
> +	HGH_BUSY,     /* HIGH */
> +	LOW_PURGE,    /* LOW */
> +};

I'm not sure that the NORMAL, BUSY and PURGE words add any value here. 
How about MEDIUM, HIGH and LOW?

> +
> +/* queue stats */
> +struct freq_threshold {
> +
> +	uint64_t base_edpi;
> +	bool trained;
> +	uint32_t threshold_percent;
> +	uint32_t cur_train_iter;
> +};
> +
> +
> +struct priority_worker {
> +
> +	/* Current dequeue and throughput counts */
> +	/* These 2 are written to by the worker threads */
> +	/* So keep them on their own cache line */
> +	uint64_t empty_dequeues;
> +	uint64_t num_dequeue_pkts;
> +
> +	enum queue_state queue_state;
> +
> +	uint64_t empty_dequeues_prev;
> +	uint64_t num_dequeue_pkts_prev;
> +
> +	/* Used for training only */
> +	struct freq_threshold thresh[NUM_FREQ];
> +	enum freq_val cur_freq;
> +
> +	/* bucket arrays to calculate the averages */
> +	uint64_t edpi_av[BINS_AV];
> +	uint32_t  ec;
> +	uint64_t ppi_av[BINS_AV];
> +	uint32_t  pc;

Suggest a comment per line explaining what the names of these variables 
mean. edpi, ppi, etc.


> +
> +	uint32_t lcore_id;
> +	uint32_t iter_counter;
> +	uint32_t threshold_ctr;
> +	uint32_t display_ctr;
> +	uint8_t  dev_id;
> +
> +} __rte_cache_aligned;
> +
> +
> +struct stats_data {
> +
> +	struct priority_worker wrk_stats[NUM_NODES];
> +
> +	/* flag to stop rx threads processing packets until training over */
> +	bool start_rx;
> +
> +};
> +
> +struct ep_params {
> +
> +	/* timer related stuff */
> +	uint64_t interval_ticks;
> +	uint32_t max_train_iter;
> +	struct rte_timer timer0;
> +
> +	struct stats_data wrk_data;
> +};
> +
> +
> +int rte_empty_poll_stat_init(void);
> +
> +void rte_empty_poll_stat_free(void);
> +
> +int rte_empty_poll_stat_update(unsigned int lcore_id);
> +
> +int rte_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt);
> +
> +uint64_t rte_empty_poll_stat_fetch(unsigned int lcore_id);
> +
> +uint64_t rte_poll_stat_fetch(unsigned int lcore_id);
> +
> +void rte_empty_poll_set_freq(enum freq_val index, uint32_t limit);
> +
> +void rte_empty_poll_setup_timer(void);

All the function prototypes need documentation. Please see rte_power.h.
And please makes sure that Doxygen documentation generates correctly 
from the comments.
Suggest prefixing the functions with rte_power_

> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif

A few more comments throughout the code would be good.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [dpdk-dev] [PATCH v1 1/2] lib/librte_power: traffic pattern aware power control
  2018-06-14 10:59 ` [dpdk-dev] [PATCH v1 1/2] lib/librte_power: traffic pattern aware power control Hunt, David
@ 2018-06-18 16:11   ` Liang, Ma
  0 siblings, 0 replies; 79+ messages in thread
From: Liang, Ma @ 2018-06-18 16:11 UTC (permalink / raw)
  To: Hunt, David; +Cc: dev, radu.nicolau

On 14 Jun 11:59, Hunt, David wrote:
> Hi Liang
> 
> 
> 
> On 8/6/2018 10:57 AM, Liang Ma wrote:
> > 1. Abstract
> > 
> > For packet processing workloads such as DPDK polling is continuous.
> > This means CPU cores always show 100% busy independent of how much work
> > those cores are doing. It is critical to accurately determine how busy
> > a core is hugely important for the following reasons:
> > 
> >     * No indication of overload conditions
> > 
> >     * User do not know how much real load is on a system meaning resulted in
> >       wasted energy as no power management is utilized
> > 
> > Tried and failed schemes include calculating the cycles required from
> > the load on the core, in other words the business. For example,
> 
> Typo, I think this should be busyness. :)
agree, I will update in next patch.
> 
> > how many cycles it costs to handle each packet and determining the frequency
> > cost per core. Due to the varying nature of traffic, types of frames and cost
> > in cycles to process, this mechanism becomes complex quickly where
> > a simple scheme is required to solve the problems.
> > 
> > 2. Proposed solution
> > 
> > For all polling mechanism, the proposed solution focus on how many times
> > empty poll executed instead of calculating how many cycles it cost to handle
> > each packet. The less empty poll number means current core is busy with
> > processing workload, therefore,  the higher frequency is needed. The high
> > empty poll number indicate current core has lots spare time, therefore,
> > we can lower the frequency.
> > 
> > 2.1 Power state definition:
> > 
> > 	LOW:  the frequency is used for purge mode.
> > 
> > 	MED:  the frequency is used to process modest traffic workload.
> > 
> > 	HIGH: the frequency is used to process busy traffic workload.
> > 
> > 2.2 There are two phases to establish the power management system:
> > 
> > 	a.Initialization/Training phase. There is no traffic pass-through,
> > 	  the system will test average empty poll numbers  with LOW/MED/HIGH
> > 	  power state. Those average empty poll numbers  will be the baseline
> > 	  for the normal phase. The system will collect all core's counter
> > 	  every 100ms. The Training phase will take 5 seconds.
> 
> I suggest that you add in that all Rx packets are blocked during this phase,
> and we measure the number of empty polls possible for each of the
> frequencies (low, medium and high).
I don't think the Rx packet is blocked,  currently, there is no traffic pass-throughduring training phase.
> 
> > 	b.Normal phase. When the real traffic pass-though, the system will
> > 	  compare run-time empty poll moving average value with base line
> > 	  then make decision to move to HIGH power state of MED  power state.
> > 	  The system will collect all core's counter every 100ms.
> 
> I think this may need to be reduced from 100ms. The reaction to an increase
> in traffic would need
> to be quicker than this to avoid buffer overflow.
> 
agree, I will reduce to 10ms in next patch.
> > 
> > 3. Proposed  API
> > 
> > 1.  rte_empty_poll_stat_init(void);
> > which is used to initialize the power management system.
> > 2.  rte_empty_poll_stat_free(void);
> > which is used to free the resource hold by power management system.
> > 3.  rte_empty_poll_stat_update(unsigned int lcore_id);
> > which is used to update specific core empty poll counter, not thread safe
> > 4.  rte_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt);
> > which is used to update specific core valid poll counter, not thread safe
> > 5.  rte_empty_poll_stat_fetch(unsigned int lcore_id);
> > which is used to get specific core empty poll counter.
> > 6.  rte_poll_stat_fetch(unsigned int lcore_id);
> > which is used to get specific core valid poll counter.
> > 
> > 7.  rte_empty_poll_set_freq(enum freq_val index, uint32_t limit);
> > which allow user customize the frequency of power state.
> > 
> > 8.  rte_empty_poll_setup_timer(void);
> > which is used to setup the timer/callback to process all above counter.
> > 
> > Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> > ---
> >   lib/librte_power/Makefile         |   3 +-
> >   lib/librte_power/meson.build      |   5 +-
> >   lib/librte_power/rte_empty_poll.c | 529 ++++++++++++++++++++++++++++++++++++++
> >   lib/librte_power/rte_empty_poll.h | 135 ++++++++++
> >   4 files changed, 669 insertions(+), 3 deletions(-)
> >   create mode 100644 lib/librte_power/rte_empty_poll.c
> >   create mode 100644 lib/librte_power/rte_empty_poll.h
> > 
> > diff --git a/lib/librte_power/Makefile b/lib/librte_power/Makefile
> > index 6f85e88..dbc175a 100644
> > --- a/lib/librte_power/Makefile
> > +++ b/lib/librte_power/Makefile
> > @@ -16,8 +16,9 @@ LIBABIVER := 1
> >   # all source are stored in SRCS-y
> >   SRCS-$(CONFIG_RTE_LIBRTE_POWER) := rte_power.c power_acpi_cpufreq.c
> >   SRCS-$(CONFIG_RTE_LIBRTE_POWER) += power_kvm_vm.c guest_channel.c
> > +SRCS-$(CONFIG_RTE_LIBRTE_POWER) += rte_empty_poll.c
> >   # install this header file
> > -SYMLINK-$(CONFIG_RTE_LIBRTE_POWER)-include := rte_power.h
> > +SYMLINK-$(CONFIG_RTE_LIBRTE_POWER)-include := rte_power.h  rte_empty_poll.h
> 
> I would propose re-naming the .h and .c file to
> rte_*power_*empty_poll.[h/c], so we can
> associate it with the power library.
> 
> 
> >   include $(RTE_SDK)/mk/rte.lib.mk
> > diff --git a/lib/librte_power/meson.build b/lib/librte_power/meson.build
> > index 253173f..5270fa3 100644
> > --- a/lib/librte_power/meson.build
> > +++ b/lib/librte_power/meson.build
> > @@ -5,5 +5,6 @@ if host_machine.system() != 'linux'
> >   	build = false
> >   endif
> >   sources = files('rte_power.c', 'power_acpi_cpufreq.c',
> > -		'power_kvm_vm.c', 'guest_channel.c')
> > -headers = files('rte_power.h')
> > +		'power_kvm_vm.c', 'guest_channel.c',
> > +		'rte_empty_poll.c')
> > +headers = files('rte_power.h','rte_empty_poll.h')
> > diff --git a/lib/librte_power/rte_empty_poll.c b/lib/librte_power/rte_empty_poll.c
> > new file mode 100644
> > index 0000000..57bd63b
> > --- /dev/null
> > +++ b/lib/librte_power/rte_empty_poll.c
> > @@ -0,0 +1,529 @@
> > +/* SPDX-License-Identifier: BSD-3-Clause
> > + * Copyright(c) 2010-2018 Intel Corporation
> > + */
> > +
> > +#include <string.h>
> > +
> > +#include <rte_lcore.h>
> > +#include <rte_cycles.h>
> > +#include <rte_atomic.h>
> > +#include <rte_malloc.h>
> > +
> > +
> > +
> > +
> > +#include "rte_power.h"
> > +#include "rte_empty_poll.h"
> > +
> > +
> > +
> > +
> > +#define INTERVALS_PER_SECOND 10     /* (100ms) */
> > +#define SECONDS_TO_TRAIN_FOR  5
> > +#define DEFAULT_MED_TO_HIGH_PERCENT_THRESHOLD 70
> > +#define DEFAULT_HIGH_TO_MED_PERCENT_THRESHOLD 30
> > +#define DEFAULT_CYCLES_PER_PACKET 800
> > +
> > +
> > +static struct ep_params *ep_params;
> > +static uint32_t med_to_high_threshold = DEFAULT_MED_TO_HIGH_PERCENT_THRESHOLD;
> > +static uint32_t high_to_med_threshold = DEFAULT_HIGH_TO_MED_PERCENT_THRESHOLD;
> > +
> > +static uint32_t avail_freqs[RTE_MAX_LCORE][NUM_FREQS];
> > +
> > +static uint32_t total_avail_freqs[RTE_MAX_LCORE];
> > +
> > +static uint32_t freq_index[NUM_FREQ];
> > +
> > +
> > +
> > +static uint32_t
> > +get_freq_index(enum freq_val index)
> > +{
> > +	return freq_index[index];
> > +}
> > +
> > +
> > +static int
> > +set_power_freq(int lcore_id, enum freq_val freq, bool specific_freq)
> > +{
> > +	int err = 0;
> > +	uint32_t power_freq_index;
> > +	if (!specific_freq)
> > +		power_freq_index = get_freq_index(freq);
> > +	else
> > +		power_freq_index = freq;
> > +
> > +	err = rte_power_set_freq(lcore_id, power_freq_index);
> > +
> > +	return err;
> > +}
> > +
> > +
> > +static inline void __attribute__((always_inline))
> > +exit_training_state(struct priority_worker *poll_stats)
> > +{
> > +	RTE_SET_USED(poll_stats);
> > +}
> > +
> > +static inline void __attribute__((always_inline))
> > +enter_training_state(struct priority_worker *poll_stats)
> > +{
> > +	poll_stats->iter_counter = 0;
> > +	poll_stats->cur_freq = LOW;
> > +	poll_stats->queue_state = TRAINING;
> > +}
> > +
> > +static inline void __attribute__((always_inline))
> > +enter_normal_state(struct priority_worker *poll_stats)
> > +{
> > +	/* Clear the averages arrays and strs */
> > +	memset(poll_stats->edpi_av, 0, sizeof(poll_stats->edpi_av));
> > +	poll_stats->ec = 0;
> > +	memset(poll_stats->ppi_av, 0, sizeof(poll_stats->ppi_av));
> > +	poll_stats->pc = 0;
> > +
> > +	poll_stats->cur_freq = MED;
> > +	poll_stats->iter_counter = 0;
> > +	poll_stats->threshold_ctr = 0;
> > +	poll_stats->queue_state = MED_NORMAL;
> > +	set_power_freq(poll_stats->lcore_id, MED, false);
> > +
> > +	/* Try here */
> > +	poll_stats->thresh[MED].threshold_percent = med_to_high_threshold;
> > +	poll_stats->thresh[HGH].threshold_percent = high_to_med_threshold;
> > +}
> > +
> > +static inline void __attribute__((always_inline))
> > +enter_busy_state(struct priority_worker *poll_stats)
> > +{
> > +	memset(poll_stats->edpi_av, 0, sizeof(poll_stats->edpi_av));
> > +	poll_stats->ec = 0;
> > +	memset(poll_stats->ppi_av, 0, sizeof(poll_stats->ppi_av));
> > +	poll_stats->pc = 0;
> > +
> > +	poll_stats->cur_freq = HGH;
> > +	poll_stats->iter_counter = 0;
> > +	poll_stats->threshold_ctr = 0;
> > +	poll_stats->queue_state = HGH_BUSY;
> > +	set_power_freq(poll_stats->lcore_id, HGH, false);
> > +}
> > +
> > +static inline void __attribute__((always_inline))
> > +enter_purge_state(struct priority_worker *poll_stats)
> > +{
> > +	poll_stats->iter_counter = 0;
> > +	poll_stats->queue_state = LOW_PURGE;
> > +}
> > +
> > +static inline void __attribute__((always_inline))
> > +set_state(struct priority_worker *poll_stats,
> > +		enum queue_state new_state)
> > +{
> > +	enum queue_state old_state = poll_stats->queue_state;
> > +	if (old_state != new_state) {
> > +
> > +		/* Call any old state exit functions */
> > +		if (old_state == TRAINING)
> > +			exit_training_state(poll_stats);
> > +
> > +		/* Call any new state entry functions */
> > +		if (new_state == TRAINING)
> > +			enter_training_state(poll_stats);
> > +		if (new_state == MED_NORMAL)
> > +			enter_normal_state(poll_stats);
> > +		if (new_state == HGH_BUSY)
> > +			enter_busy_state(poll_stats);
> > +		if (new_state == LOW_PURGE)
> > +			enter_purge_state(poll_stats);
> > +	}
> > +}
> > +
> > +
> > +static void
> > +update_training_stats(struct priority_worker *poll_stats,
> > +		uint32_t freq,
> > +		bool specific_freq,
> > +		uint32_t max_train_iter)
> > +{
> > +	RTE_SET_USED(specific_freq);
> > +
> > +	char pfi_str[32];
> > +	uint64_t p0_empty_deq;
> > +
> > +	sprintf(pfi_str, "%02d", freq);
> > +
> > +	if (poll_stats->cur_freq == freq &&
> > +			poll_stats->thresh[freq].trained == false) {
> > +		if (poll_stats->thresh[freq].cur_train_iter == 0) {
> > +
> > +			set_power_freq(poll_stats->lcore_id,
> > +					freq, specific_freq);
> > +
> > +			poll_stats->empty_dequeues_prev =
> > +				poll_stats->empty_dequeues;
> > +
> > +			poll_stats->thresh[freq].cur_train_iter++;
> > +
> > +			return;
> > +		} else if (poll_stats->thresh[freq].cur_train_iter
> > +				<= max_train_iter) {
> > +
> > +			p0_empty_deq = poll_stats->empty_dequeues -
> > +				poll_stats->empty_dequeues_prev;
> > +
> > +			poll_stats->empty_dequeues_prev =
> > +				poll_stats->empty_dequeues;
> > +
> > +			poll_stats->thresh[freq].base_edpi += p0_empty_deq;
> > +			poll_stats->thresh[freq].cur_train_iter++;
> > +
> > +		} else {
> > +			if (poll_stats->thresh[freq].trained == false) {
> > +				poll_stats->thresh[freq].base_edpi =
> > +					poll_stats->thresh[freq].base_edpi /
> > +					max_train_iter;
> > +
> > +				/* Add on a factor of 0.05%, this should remove any */
> > +				/* false negatives when the system is 0% busy */
> > +				poll_stats->thresh[freq].base_edpi +=
> > +					poll_stats->thresh[freq].base_edpi / 2000;
> 
> Please use #define for magic numbers
> 
> > +
> > +				poll_stats->thresh[freq].trained = true;
> > +				poll_stats->cur_freq++;
> > +
> > +			}
> > +		}
> > +	}
> > +}
> > +
> > +static inline uint32_t __attribute__((always_inline))
> > +update_stats(struct priority_worker *poll_stats)
> > +{
> > +	uint64_t tot_edpi = 0, tot_ppi = 0;
> > +	uint32_t j, percent;
> > +
> > +	struct priority_worker *s = poll_stats;
> > +
> > +	uint64_t cur_edpi = s->empty_dequeues - s->empty_dequeues_prev;
> > +
> > +	s->empty_dequeues_prev = s->empty_dequeues;
> > +
> > +	uint64_t ppi = s->num_dequeue_pkts - s->num_dequeue_pkts_prev;
> > +
> > +	s->num_dequeue_pkts_prev = s->num_dequeue_pkts;
> > +
> > +	if (s->thresh[s->cur_freq].base_edpi < cur_edpi)
> > +		return 1000UL; /* Value to make us fail */
> > +
> > +	s->edpi_av[s->ec++ % BINS_AV] = cur_edpi;
> > +	s->ppi_av[s->pc++ % BINS_AV] = ppi;
> > +
> > +	for (j = 0; j < BINS_AV; j++) {
> > +		tot_edpi += s->edpi_av[j];
> > +		tot_ppi += s->ppi_av[j];
> > +	}
> > +
> > +	tot_edpi = tot_edpi / BINS_AV;
> > +
> > +	percent = 100 - (uint32_t)((float)tot_edpi /
> > +			(float)s->thresh[s->cur_freq].base_edpi * 100);
> > +
> > +	return (uint32_t)percent;
> > +}
> > +
> > +
> > +static inline void  __attribute__((always_inline))
> > +update_stats_normal(struct priority_worker *poll_stats)
> > +{
> > +	uint32_t percent;
> > +
> > +	if (poll_stats->thresh[poll_stats->cur_freq].base_edpi == 0)
> > +		return;
> > +
> > +	percent = update_stats(poll_stats);
> > +
> > +	if (percent > 100)
> > +		return;
> > +
> > +	if (poll_stats->cur_freq == LOW)
> > +		RTE_LOG(INFO, POWER, "Purge Mode is not supported\n");
> > +	else if (poll_stats->cur_freq == MED) {
> > +
> > +		if (percent > poll_stats->thresh[poll_stats->cur_freq].
> > +				threshold_percent) {
> > +
> > +			if (poll_stats->threshold_ctr < INTERVALS_PER_SECOND)
> > +				poll_stats->threshold_ctr++;
> > +			else
> > +				set_state(poll_stats, HGH_BUSY);
> > +
> > +		} else {
> > +			/* reset */
> > +			poll_stats->threshold_ctr = 0;
> > +		}
> > +
> > +	} else if (poll_stats->cur_freq == HGH) {
> > +
> > +		if (percent < poll_stats->thresh[poll_stats->cur_freq].
> > +				threshold_percent) {
> > +
> > +			if (poll_stats->threshold_ctr < INTERVALS_PER_SECOND)
> > +				poll_stats->threshold_ctr++;
> > +			else
> > +				set_state(poll_stats, MED_NORMAL);
> > +		} else
> > +			/* reset */
> > +			poll_stats->threshold_ctr = 0;
> > +
> > +
> > +	}
> > +}
> > +
> > +static int
> > +empty_poll_trainning(struct priority_worker *poll_stats,
> > +		uint32_t max_train_iter) {
> > +
> > +	if (poll_stats->iter_counter < INTERVALS_PER_SECOND) {
> > +		poll_stats->iter_counter++;
> > +		return 0;
> > +	}
> > +
> > +
> > +	update_training_stats(poll_stats,
> > +			LOW,
> > +			false,
> > +			max_train_iter);
> > +
> > +	update_training_stats(poll_stats,
> > +			MED,
> > +			false,
> > +			max_train_iter);
> > +
> > +	update_training_stats(poll_stats,
> > +			HGH,
> > +			false,
> > +			max_train_iter);
> > +
> > +
> > +	if (poll_stats->thresh[LOW].trained == true
> > +			&& poll_stats->thresh[MED].trained == true
> > +			&& poll_stats->thresh[HGH].trained == true) {
> > +
> > +		set_state(poll_stats, MED_NORMAL);
> > +
> > +		RTE_LOG(INFO, POWER, "Training is Complete for %d\n",
> > +				poll_stats->lcore_id);
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +static void
> > +empty_poll_detection(struct rte_timer *tim,
> > +		void *arg)
> > +{
> > +
> > +	uint32_t i;
> > +
> > +	struct priority_worker *poll_stats;
> > +
> > +	RTE_SET_USED(tim);
> > +
> > +	RTE_SET_USED(arg);
> > +
> > +	for (i = 0; i < NUM_NODES; i++) {
> > +
> > +		poll_stats = &(ep_params->wrk_data.wrk_stats[i]);
> > +
> > +		if (rte_lcore_is_enabled(poll_stats->lcore_id) == 0)
> > +			continue;
> > +
> > +		switch (poll_stats->queue_state) {
> > +		case(TRAINING):
> > +			empty_poll_trainning(poll_stats,
> > +					ep_params->max_train_iter);
> > +			break;
> > +
> > +		case(HGH_BUSY):
> > +		case(MED_NORMAL):
> > +			update_stats_normal(poll_stats);
> > +
> > +			break;
> > +
> > +		case(LOW_PURGE):
> > +			break;
> > +		default:
> > +			break;
> > +
> > +		}
> > +
> > +	}
> > +
> > +}
> > +
> > +int
> > +rte_empty_poll_stat_init(void)
> > +{
> > +	uint32_t i;
> > +	/* Allocate the ep_params structure */
> > +	ep_params = rte_zmalloc_socket(NULL,
> > +			sizeof(struct ep_params),
> > +			0,
> > +			rte_socket_id());
> > +
> > +	if (!ep_params)
> > +		rte_panic("Cannot allocate heap memory for ep_params "
> > +				"for socket %d\n", rte_socket_id());
> > +
> > +	freq_index[LOW] = 14;
> > +	freq_index[MED] = 9;
> > +	freq_index[HGH] = 1;
> > +
> 
> Are these default frequency indexes? Can they be changed? Maybe say so in a
> comment.
> 
> > +	RTE_LOG(INFO, POWER, "Initialize the Empty Poll\n");
> > +
> > +	/* 5 seconds worth of training */
> > +	ep_params->max_train_iter = INTERVALS_PER_SECOND * SECONDS_TO_TRAIN_FOR;
> > +
> > +	struct stats_data *w = &ep_params->wrk_data;
> > +
> > +	/* initialize all wrk_stats state */
> > +	for (i = 0; i < NUM_NODES; i++) {
> > +
> > +		if (rte_lcore_is_enabled(i) == 0)
> > +			continue;
> > +
> > +		set_state(&w->wrk_stats[i], TRAINING);
> > +		/*init the freqs table */
> > +		total_avail_freqs[i] = rte_power_freqs(i,
> > +				avail_freqs[i],
> > +				NUM_FREQS);
> > +
> > +		if (get_freq_index(LOW) > total_avail_freqs[i])
> > +			return -1;
> > +
> > +	}
> > +
> > +
> > +	return 0;
> > +}
> > +
> > +void
> > +rte_empty_poll_stat_free(void)
> > +{
> > +
> > +	RTE_LOG(INFO, POWER, "Close the Empty Poll\n");
> > +
> > +	if (ep_params != NULL)
> > +		rte_free(ep_params);
> > +}
> > +
> > +int
> > +rte_empty_poll_stat_update(unsigned int lcore_id)
> > +{
> > +	struct priority_worker *poll_stats;
> > +
> > +	if (lcore_id >= NUM_NODES)
> > +		return -1;
> > +
> > +	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
> > +
> > +	if (poll_stats->lcore_id == 0)
> > +		poll_stats->lcore_id = lcore_id;
> > +
> > +	poll_stats->empty_dequeues++;
> > +
> > +	return 0;
> > +}
> > +
> > +int
> > +rte_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt)
> > +{
> > +
> > +	struct priority_worker *poll_stats;
> > +
> > +	if (lcore_id >= NUM_NODES)
> > +		return -1;
> > +
> > +	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
> > +
> > +	if (poll_stats->lcore_id == 0)
> > +		poll_stats->lcore_id = lcore_id;
> > +
> > +	poll_stats->num_dequeue_pkts += nb_pkt;
> > +
> > +	return 0;
> > +}
> > +
> > +
> > +uint64_t
> > +rte_empty_poll_stat_fetch(unsigned int lcore_id)
> > +{
> > +	struct priority_worker *poll_stats;
> > +
> > +	if (lcore_id >= NUM_NODES)
> > +		return -1;
> > +
> > +	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
> > +
> > +	if (poll_stats->lcore_id == 0)
> > +		poll_stats->lcore_id = lcore_id;
> > +
> > +	return poll_stats->empty_dequeues;
> > +}
> > +
> > +uint64_t
> > +rte_poll_stat_fetch(unsigned int lcore_id)
> > +{
> > +	struct priority_worker *poll_stats;
> > +
> > +	if (lcore_id >= NUM_NODES)
> > +		return -1;
> > +
> > +	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
> > +
> > +	if (poll_stats->lcore_id == 0)
> > +		poll_stats->lcore_id = lcore_id;
> > +
> > +	return poll_stats->num_dequeue_pkts;
> > +}
> > +
> > +void
> > +rte_empty_poll_set_freq(enum freq_val index, uint32_t limit)
> > +{
> > +	switch (index) {
> > +
> > +	case LOW:
> > +		freq_index[LOW] = limit;
> > +		break;
> > +
> > +	case MED:
> > +		freq_index[MED] = limit;
> > +		break;
> > +
> > +	case HGH:
> > +		freq_index[HGH] = limit;
> > +		break;
> > +	default:
> > +		break;
> > +	}
> > +}
> > +
> > +void
> > +rte_empty_poll_setup_timer(void)
> > +{
> > +	int lcore_id = rte_lcore_id();
> > +	uint64_t hz = rte_get_timer_hz();
> > +
> > +	struct  ep_params *ep_ptr = ep_params;
> > +
> > +	ep_ptr->interval_ticks = hz / INTERVALS_PER_SECOND;
> > +
> > +	rte_timer_reset_sync(&ep_ptr->timer0,
> > +			ep_ptr->interval_ticks,
> > +			PERIODICAL,
> > +			lcore_id,
> > +			empty_poll_detection,
> > +			(void *)ep_ptr);
> > +
> > +}
> > diff --git a/lib/librte_power/rte_empty_poll.h b/lib/librte_power/rte_empty_poll.h
> > new file mode 100644
> > index 0000000..7e036ee
> > --- /dev/null
> > +++ b/lib/librte_power/rte_empty_poll.h
> > @@ -0,0 +1,135 @@
> > +/* SPDX-License-Identifier: BSD-3-Clause
> > + * Copyright(c) 2010-2018 Intel Corporation
> > + */
> > +
> > +#ifndef _RTE_EMPTY_POLL_H
> > +#define _RTE_EMPTY_POLL_H
> > +
> > +/**
> > + * @file
> > + * RTE Power Management
> > + */
> > +#include <stdint.h>
> > +#include <stdbool.h>
> > +#include <sys/queue.h>
> > +
> > +#include <rte_common.h>
> > +#include <rte_byteorder.h>
> > +#include <rte_log.h>
> > +#include <rte_string_fns.h>
> > +#include <rte_power.h>
> > +#include <rte_timer.h>
> > +
> > +#ifdef __cplusplus
> > +extern "C" {
> > +#endif
> > +
> > +#define NUM_FREQS 20
> 
> Should probably be RTE_MAX_LCORE_FREQS
> 
> > +
> > +#define BINS_AV 4 /* has to be ^2 */
> > +
> > +#define DROP (NUM_DIRECTIONS * NUM_DEVICES)
> > +
> > +#define NUM_PRIORITIES          2
> > +
> > +#define NUM_NODES         31 /*any reseanable prime number should work*/
> > +
> > +
> > +enum freq_val {
> > +	LOW,
> > +	MED,
> > +	HGH,
> 
> HGH should be HIGH. Where we see this on it's own in the code, it's not
> obvious that it's related to MEDIUM and LOW.  Also, maybe MED should be
> MEDIUM.
> 
> Would suggest prefixing these with FREQ_. Makes the code more readable.
> 
> > +	NUM_FREQ = NUM_FREQS
> 
> This should probably be the number of elements in freq_val rather than
> NUM_FREQS
> 
> I think there is some confusion in the code about the number of frequencies.
> Any array where it's holding all the possible frequencies should use
> RTE_MAX_LCORE_FREQS.
> But it looks ti me that the freq_index array only holds three values, the
> indexes for Low, Medium and High.
> 
> 
> > +};
> > +
> > +
> > +enum queue_state {
> > +	TRAINING, /* NO TRAFFIC */
> > +	MED_NORMAL,   /* MED */
> > +	HGH_BUSY,     /* HIGH */
> > +	LOW_PURGE,    /* LOW */
> > +};
> 
> I'm not sure that the NORMAL, BUSY and PURGE words add any value here. How
> about MEDIUM, HIGH and LOW?
> 
> > +
> > +/* queue stats */
> > +struct freq_threshold {
> > +
> > +	uint64_t base_edpi;
> > +	bool trained;
> > +	uint32_t threshold_percent;
> > +	uint32_t cur_train_iter;
> > +};
> > +
> > +
> > +struct priority_worker {
> > +
> > +	/* Current dequeue and throughput counts */
> > +	/* These 2 are written to by the worker threads */
> > +	/* So keep them on their own cache line */
> > +	uint64_t empty_dequeues;
> > +	uint64_t num_dequeue_pkts;
> > +
> > +	enum queue_state queue_state;
> > +
> > +	uint64_t empty_dequeues_prev;
> > +	uint64_t num_dequeue_pkts_prev;
> > +
> > +	/* Used for training only */
> > +	struct freq_threshold thresh[NUM_FREQ];
> > +	enum freq_val cur_freq;
> > +
> > +	/* bucket arrays to calculate the averages */
> > +	uint64_t edpi_av[BINS_AV];
> > +	uint32_t  ec;
> > +	uint64_t ppi_av[BINS_AV];
> > +	uint32_t  pc;
> 
> Suggest a comment per line explaining what the names of these variables
> mean. edpi, ppi, etc.
> 
> 
> > +
> > +	uint32_t lcore_id;
> > +	uint32_t iter_counter;
> > +	uint32_t threshold_ctr;
> > +	uint32_t display_ctr;
> > +	uint8_t  dev_id;
> > +
> > +} __rte_cache_aligned;
> > +
> > +
> > +struct stats_data {
> > +
> > +	struct priority_worker wrk_stats[NUM_NODES];
> > +
> > +	/* flag to stop rx threads processing packets until training over */
> > +	bool start_rx;
> > +
> > +};
> > +
> > +struct ep_params {
> > +
> > +	/* timer related stuff */
> > +	uint64_t interval_ticks;
> > +	uint32_t max_train_iter;
> > +	struct rte_timer timer0;
> > +
> > +	struct stats_data wrk_data;
> > +};
> > +
> > +
> > +int rte_empty_poll_stat_init(void);
> > +
> > +void rte_empty_poll_stat_free(void);
> > +
> > +int rte_empty_poll_stat_update(unsigned int lcore_id);
> > +
> > +int rte_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt);
> > +
> > +uint64_t rte_empty_poll_stat_fetch(unsigned int lcore_id);
> > +
> > +uint64_t rte_poll_stat_fetch(unsigned int lcore_id);
> > +
> > +void rte_empty_poll_set_freq(enum freq_val index, uint32_t limit);
> > +
> > +void rte_empty_poll_setup_timer(void);
> 
> All the function prototypes need documentation. Please see rte_power.h.
> And please makes sure that Doxygen documentation generates correctly from
> the comments.
> Suggest prefixing the functions with rte_power_
> 
> > +
> > +#ifdef __cplusplus
> > +}
> > +#endif
> > +
> > +#endif
> 
> A few more comments throughout the code would be good.
> 
> 
> 

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [dpdk-dev] [PATCH v1 2/2] examples/l3fwd-power: simple app update to support new API
  2018-06-08  9:57 ` [dpdk-dev] [PATCH v1 2/2] examples/l3fwd-power: simple app update to support new API Liang Ma
@ 2018-06-19 10:31   ` Hunt, David
  0 siblings, 0 replies; 79+ messages in thread
From: Hunt, David @ 2018-06-19 10:31 UTC (permalink / raw)
  To: Liang Ma; +Cc: dev, radu.nicolau

Hi Liang,


On 8/6/2018 10:57 AM, Liang Ma wrote:
> Add the support for new traffic pattern aware power control
> power management API.
>
> Example:
> ./l3fwd-power -l xxx   -n 4   -w 0000:xx:00.0 -w 0000:xx:00.1 -- -p 0x3
> -P --config="(0,0,xx),(1,0,xx)" --empty-poll

Suggest expanding out each of the above options and explaining what each 
one does.

Maybe also explain what traffic pattern aware power control is
i.e. there's stats gathered every poll that counts how many packets 
received. Then every
100mS these stats are analysed and a decision is taken to scale up/down 
the core.
Maybe also mention there's a sliding window to reduce excessive hysteresis.


> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> ---
>   examples/l3fwd-power/main.c | 229 ++++++++++++++++++++++++++++++++++++++++----
>   1 file changed, 211 insertions(+), 18 deletions(-)
>
> diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
> index 596d645..22a0e4e 100644
> --- a/examples/l3fwd-power/main.c
> +++ b/examples/l3fwd-power/main.c
> @@ -43,6 +43,7 @@
>   #include <rte_timer.h>
>   #include <rte_power.h>
>   #include <rte_spinlock.h>
> +#include <rte_empty_poll.h>
>   
>   #define RTE_LOGTYPE_L3FWD_POWER RTE_LOGTYPE_USER1
>   
> @@ -129,6 +130,9 @@ static uint32_t enabled_port_mask = 0;
>   static int promiscuous_on = 0;
>   /* NUMA is enabled by default. */
>   static int numa_on = 1;
> +/* emptypoll is disabled by default. */
> +static bool empty_poll_on;
> +volatile bool empty_poll_stop;
>   static int parse_ptype; /**< Parse packet type using rx callback, and */
>   			/**< disabled by default */
>   
> @@ -336,6 +340,10 @@ static inline uint32_t power_idle_heuristic(uint32_t zero_rx_packet_count);
>   static inline enum freq_scale_hint_t power_freq_scaleup_heuristic( \
>   		unsigned int lcore_id, uint16_t port_id, uint16_t queue_id);
>   
> +static int is_done(void)
> +{
> +	return empty_poll_stop;
> +}
>   /* exit signal handler */
>   static void
>   signal_exit_now(int sigtype)
> @@ -344,7 +352,15 @@ signal_exit_now(int sigtype)
>   	unsigned int portid;
>   	int ret;
>   
> +	RTE_SET_USED(lcore_id);
> +	RTE_SET_USED(portid);
> +	RTE_SET_USED(ret);
> +
>   	if (sigtype == SIGINT) {
> +		if (empty_poll_on)
> +			empty_poll_stop = true;
> +
> +
>   		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
>   			if (rte_lcore_is_enabled(lcore_id) == 0)
>   				continue;
> @@ -353,20 +369,23 @@ signal_exit_now(int sigtype)
>   			ret = rte_power_exit(lcore_id);
>   			if (ret)
>   				rte_exit(EXIT_FAILURE, "Power management "
> -					"library de-initialization failed on "
> -							"core%u\n", lcore_id);
> +						"library de-initialization failed on "
> +						"core%u\n", lcore_id);
>   		}
>   
> -		RTE_ETH_FOREACH_DEV(portid) {
> -			if ((enabled_port_mask & (1 << portid)) == 0)
> -				continue;
> +		if (!empty_poll_on) {
> +			RTE_ETH_FOREACH_DEV(portid) {
> +				if ((enabled_port_mask & (1 << portid)) == 0)
> +					continue;
>   
> -			rte_eth_dev_stop(portid);
> -			rte_eth_dev_close(portid);
> +				rte_eth_dev_stop(portid);
> +				rte_eth_dev_close(portid);
> +			}
>   		}
>   	}
>   
> -	rte_exit(EXIT_SUCCESS, "User forced exit\n");
> +	if (!empty_poll_on)
> +		rte_exit(EXIT_SUCCESS, "User forced exit\n");
>   }
>   
>   /*  Freqency scale down timer callback */
> @@ -831,6 +850,108 @@ static int event_register(struct lcore_conf *qconf)
>   
>   	return 0;
>   }
> +/* main processing loop */
> +static int
> +main_empty_poll_loop(__attribute__((unused)) void *dummy)
> +{
> +	struct rte_mbuf *pkts_burst[MAX_PKT_BURST];
> +	unsigned int lcore_id;
> +	uint64_t prev_tsc, diff_tsc, cur_tsc;
> +	int i, j, nb_rx;
> +	uint8_t queueid;
> +	uint16_t portid;
> +	struct lcore_conf *qconf;
> +	struct lcore_rx_queue *rx_queue;
> +
> +	const uint64_t drain_tsc = (rte_get_tsc_hz() + US_PER_S - 1) / US_PER_S * BURST_TX_DRAIN_US;
> +
> +	prev_tsc = 0;
> +
> +	lcore_id = rte_lcore_id();
> +	qconf = &lcore_conf[lcore_id];
> +
> +	if (qconf->n_rx_queue == 0) {
> +		RTE_LOG(INFO, L3FWD_POWER, "lcore %u has nothing to do\n", lcore_id);
> +		return 0;
> +	}
> +
> +	RTE_LOG(INFO, L3FWD_POWER, "entering main empty_poll loop on lcore %u\n", lcore_id);
> +
> +	for (i = 0; i < qconf->n_rx_queue; i++) {
> +		portid = qconf->rx_queue_list[i].port_id;
> +		queueid = qconf->rx_queue_list[i].queue_id;
> +		RTE_LOG(INFO, L3FWD_POWER, " -- lcoreid=%u portid=%u "
> +				"rxqueueid=%hhu\n", lcore_id, portid, queueid);
> +	}
> +
> +	while (!is_done()) {
> +		stats[lcore_id].nb_iteration_looped++;
> +
> +		cur_tsc = rte_rdtsc();
> +		/*
> +		 * TX burst queue drain
> +		 */
> +		diff_tsc = cur_tsc - prev_tsc;
> +		if (unlikely(diff_tsc > drain_tsc)) {
> +			for (i = 0; i < qconf->n_tx_port; ++i) {
> +				portid = qconf->tx_port_id[i];
> +				rte_eth_tx_buffer_flush(portid,
> +						qconf->tx_queue_id[portid],
> +						qconf->tx_buffer[portid]);
> +			}
> +			prev_tsc = cur_tsc;
> +		}
> +
> +		/*
> +		 * Read packet from RX queues
> +		 */
> +		for (i = 0; i < qconf->n_rx_queue; ++i) {
> +			rx_queue = &(qconf->rx_queue_list[i]);
> +			rx_queue->idle_hint = 0;
> +			portid = rx_queue->port_id;
> +			queueid = rx_queue->queue_id;
> +
> +			nb_rx = rte_eth_rx_burst(portid, queueid, pkts_burst,
> +					MAX_PKT_BURST);
> +
> +			stats[lcore_id].nb_rx_processed += nb_rx;
> +
> +			if (nb_rx == 0) {
> +
> +				rte_empty_poll_stat_update(lcore_id);
> +
> +				continue;
> +			} else {
> +				rte_poll_stat_update(lcore_id, nb_rx);
> +			}
> +
> +
> +			/* Prefetch first packets */
> +			for (j = 0; j < PREFETCH_OFFSET && j < nb_rx; j++) {
> +				rte_prefetch0(rte_pktmbuf_mtod(
> +							pkts_burst[j], void *));
> +			}
> +
> +			/* Prefetch and forward already prefetched packets */
> +			for (j = 0; j < (nb_rx - PREFETCH_OFFSET); j++) {
> +				rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[
> +							j + PREFETCH_OFFSET], void *));
> +				l3fwd_simple_forward(pkts_burst[j], portid,
> +						qconf);
> +			}
> +
> +			/* Forward remaining prefetched packets */
> +			for (; j < nb_rx; j++) {
> +				l3fwd_simple_forward(pkts_burst[j], portid,
> +						qconf);
> +			}
> +
> +		}
> +
> +	}
> +
> +	return 0;
> +}
>   

I see there's a new main loop added - main_empty_poll_loop. I think this is OK for the moment, but we might think about
separating this out into another file in the future. There may be other power saving mechanisms that we'd want to
implement in this way as well, rather than having a different l3-fwd app for each one. OK for the moment, IMO.

>   /* main processing loop */
>   static int
> @@ -1128,7 +1249,8 @@ print_usage(const char *prgname)
>   		"  --no-numa: optional, disable numa awareness\n"
>   		"  --enable-jumbo: enable jumbo frame"
>   		" which max packet len is PKTLEN in decimal (64-9600)\n"
> -		"  --parse-ptype: parse packet type by software\n",
> +		"  --parse-ptype: parse packet type by software\n"
> +		"  --empty=poll: enable empty poll detection\n",

typo, should be empty-poll

>   		prgname);
>   }
>   
> @@ -1231,10 +1353,12 @@ parse_args(int argc, char **argv)
>   	int opt, ret;
>   	char **argvopt;
>   	int option_index;
> +	uint32_t limit;
>   	char *prgname = argv[0];
>   	static struct option lgopts[] = {
>   		{"config", 1, 0, 0},
>   		{"no-numa", 0, 0, 0},
> +		{"empty-poll", 0, 0, 0},
>   		{"enable-jumbo", 0, 0, 0},
>   		{CMD_LINE_OPT_PARSE_PTYPE, 0, 0, 0},
>   		{NULL, 0, 0, 0}
> @@ -1259,7 +1383,18 @@ parse_args(int argc, char **argv)
>   			printf("Promiscuous mode selected\n");
>   			promiscuous_on = 1;
>   			break;
> -
> +		case 'l':
> +			limit = parse_portmask(optarg);
> +			rte_empty_poll_set_freq(LOW, limit);
> +			break;
> +		case 'm':
> +			limit = parse_portmask(optarg);
> +			rte_empty_poll_set_freq(MED, limit);
> +			break;
> +		case 'h':
> +			limit = parse_portmask(optarg);
> +			rte_empty_poll_set_freq(HGH, limit);
> +			break;
>   		/* long options */
>   		case 0:
>   			if (!strncmp(lgopts[option_index].name, "config", 6)) {
> @@ -1278,6 +1413,12 @@ parse_args(int argc, char **argv)
>   			}
>   
>   			if (!strncmp(lgopts[option_index].name,
> +						"empty-poll", 10)) {
> +				printf("empty-poll is enabled\n");
> +				empty_poll_on = true;
> +			}
> +
> +			if (!strncmp(lgopts[option_index].name,
>   					"enable-jumbo", 12)) {
>   				struct option lenopts =
>   					{"max-pkt-len", required_argument, \
> @@ -1609,6 +1750,41 @@ static int check_ptype(uint16_t portid)
>   
>   }
>   
> +static int
> +launch_timer(unsigned int lcore_id)
> +{
> +	int64_t prev_tsc = 0, cur_tsc, diff_tsc, cycles_10ms;
> +
> +	RTE_SET_USED(lcore_id);
> +
> +
> +	if (rte_get_master_lcore() != lcore_id) {
> +		rte_panic("timer on lcore:%d which is not master core:%d\n",
> +				lcore_id,
> +				rte_get_master_lcore());
> +	}
> +
> +	RTE_LOG(INFO, POWER, "Bring up the Timer\n");
> +
> +	rte_empty_poll_setup_timer();
> +
> +	cycles_10ms = rte_get_timer_hz() / 100;
> +
> +	while (!is_done()) {
> +		cur_tsc = rte_rdtsc();
> +		diff_tsc = cur_tsc - prev_tsc;
> +		if (diff_tsc > cycles_10ms) {
> +			rte_timer_manage();
> +			prev_tsc = cur_tsc;
> +			cycles_10ms = rte_get_timer_hz() / 100;
> +		}
> +	}
> +
> +	RTE_LOG(INFO, POWER, "Timer_subsystem is done\n");
> +
> +	return 0;
> +}
> +
>   int
>   main(int argc, char **argv)
>   {
> @@ -1780,14 +1956,15 @@ main(int argc, char **argv)
>   				"Library initialization failed on core %u\n", lcore_id);
>   
>   		/* init timer structures for each enabled lcore */
> -		rte_timer_init(&power_timers[lcore_id]);
> -		hz = rte_get_timer_hz();
> -		rte_timer_reset(&power_timers[lcore_id],
> -			hz/TIMER_NUMBER_PER_SECOND, SINGLE, lcore_id,
> -						power_timer_cb, NULL);
> -
> +		if (empty_poll_on == false) {
> +			rte_timer_init(&power_timers[lcore_id]);
> +			hz = rte_get_timer_hz();
> +			rte_timer_reset(&power_timers[lcore_id],
> +					hz/TIMER_NUMBER_PER_SECOND, SINGLE, lcore_id,
> +					power_timer_cb, NULL);
> +		}
>   		qconf = &lcore_conf[lcore_id];
> -		printf("\nInitializing rx queues on lcore %u ... ", lcore_id );
> +		printf("\nInitializing rx queues on lcore %u ...\n", lcore_id);
>   		fflush(stdout);
>   		/* init RX queues */
>   		for(queue = 0; queue < qconf->n_rx_queue; ++queue) {
> @@ -1856,12 +2033,28 @@ main(int argc, char **argv)
>   
>   	check_all_ports_link_status(enabled_port_mask);
>   
> +	if (empty_poll_on == true)
> +		rte_empty_poll_stat_init();
> +
> +
>   	/* launch per-lcore init on every lcore */
> -	rte_eal_mp_remote_launch(main_loop, NULL, CALL_MASTER);
> +	if (empty_poll_on == false) {
> +		rte_eal_mp_remote_launch(main_loop, NULL, CALL_MASTER);
> +	} else {
> +		empty_poll_stop = false;
> +		rte_eal_mp_remote_launch(main_empty_poll_loop, NULL, SKIP_MASTER);
> +	}
> +
> +	if (empty_poll_on == true)
> +		launch_timer(rte_lcore_id());
> +
>   	RTE_LCORE_FOREACH_SLAVE(lcore_id) {
>   		if (rte_eal_wait_lcore(lcore_id) < 0)
>   			return -1;
>   	}
>   
> +	if (empty_poll_on)
> +		rte_empty_poll_stat_free();
> +
>   	return 0;
>   }

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [dpdk-dev] [PATCH v3 1/2] lib/librte_power: traffic pattern aware power control
  2018-06-08 15:26   ` [dpdk-dev] [PATCH v2 2/2] examples/l3fwd-power: simple app update to support new API Liang Ma
@ 2018-06-20 14:44     ` Liang Ma
  2018-06-20 14:44       ` [dpdk-dev] [PATCH v3 2/2] examples/l3fwd-power: simple app update to support new API Liang Ma
  0 siblings, 1 reply; 79+ messages in thread
From: Liang Ma @ 2018-06-20 14:44 UTC (permalink / raw)
  To: david.hunt; +Cc: dev, radu.nicolau, Liang Ma

1. Abstract

For packet processing workloads such as DPDK polling is continuous.
This means CPU cores always show 100% busy independent of how much work
those cores are doing. It is critical to accurately determine how busy
a core is hugely important for the following reasons:

   * No indication of overload conditions

   * User do not know how much real load is on a system meaning resulted in
     wasted energy as no power management is utilized

Tried and failed schemes include calculating the cycles required from
the load on the core, in other words the busyness. For example,
how many cycles it costs to handle each packet and determining the
frequency cost per core. Due to the varying nature of traffic, types of
frames and cost in cycles to process, this mechanism becomes complex
quickly where a simple scheme is required to solve the problems.

2. Proposed solution

For all polling mechanism, the proposed solution focus on how many times
empty poll executed instead of calculating how many cycles it cost to
handle each packet. The less empty poll number means current core is busy
with processing workload, therefore,  the higher frequency is needed. The
high empty poll number indicate current core has lots spare time,
therefore, we can lower the frequency.

2.1 Power state definition:

	LOW:  the frequency is used for purge mode.

	MED:  the frequency is used to process modest traffic workload.

	HIGH: the frequency is used to process busy traffic workload.

2.2 There are two phases to establish the power management system:

	a.Initialization/Training phase. There is no traffic pass-through,
	  the system will test average empty poll numbers  with
	  LOW/MED/HIGH  power state. Those average empty poll numbers
	  will be the baseline
	  for the normal phase. The system will collect all core's counter
	  every 100ms. The Training phase will take 5 seconds.

	b.Normal phase. When the real traffic pass-though, the system will
	  compare run-time empty poll moving average value with base line
	  then make decision to move to HIGH power state of MED  power
	  state. The system will collect all core's counter every 10ms.

3. Proposed  API

1.  rte_power_empty_poll_stat_init(void);
which is used to initialize the power management system.
 
2.  rte_power_empty_poll_stat_free(void);
which is used to free the resource hold by power management system.
 
3.  rte_power_empty_poll_stat_update(unsigned int lcore_id);
which is used to update specific core empty poll counter, not thread safe
 
4.  rte_power_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt);
which is used to update specific core valid poll counter, not thread safe
 
5.  rte_power_empty_poll_stat_fetch(unsigned int lcore_id);
which is used to get specific core empty poll counter.
 
6.  rte_power_poll_stat_fetch(unsigned int lcore_id);
which is used to get specific core valid poll counter.

7.  rte_power_empty_poll_set_freq(enum freq_val index, uint32_t limit);
which allow user customize the frequency of power state.

8.  rte_power_empty_poll_setup_timer(void);
which is used to setup the timer/callback to process all above counter.

ChangeLog:
v2: fix some coding style issues
v3: rename the filename, API name.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
---
 lib/librte_power/Makefile               |   3 +-
 lib/librte_power/meson.build            |   5 +-
 lib/librte_power/rte_power_empty_poll.c | 521 ++++++++++++++++++++++++++++++++
 lib/librte_power/rte_power_empty_poll.h | 202 +++++++++++++
 4 files changed, 728 insertions(+), 3 deletions(-)
 create mode 100644 lib/librte_power/rte_power_empty_poll.c
 create mode 100644 lib/librte_power/rte_power_empty_poll.h

diff --git a/lib/librte_power/Makefile b/lib/librte_power/Makefile
index 6f85e88..1053e1b 100644
--- a/lib/librte_power/Makefile
+++ b/lib/librte_power/Makefile
@@ -16,8 +16,9 @@ LIBABIVER := 1
 # all source are stored in SRCS-y
 SRCS-$(CONFIG_RTE_LIBRTE_POWER) := rte_power.c power_acpi_cpufreq.c
 SRCS-$(CONFIG_RTE_LIBRTE_POWER) += power_kvm_vm.c guest_channel.c
+SRCS-$(CONFIG_RTE_LIBRTE_POWER) += rte_power_empty_poll.c
 
 # install this header file
-SYMLINK-$(CONFIG_RTE_LIBRTE_POWER)-include := rte_power.h
+SYMLINK-$(CONFIG_RTE_LIBRTE_POWER)-include := rte_power.h  rte_power_empty_poll.h
 
 include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/lib/librte_power/meson.build b/lib/librte_power/meson.build
index 253173f..63957eb 100644
--- a/lib/librte_power/meson.build
+++ b/lib/librte_power/meson.build
@@ -5,5 +5,6 @@ if host_machine.system() != 'linux'
 	build = false
 endif
 sources = files('rte_power.c', 'power_acpi_cpufreq.c',
-		'power_kvm_vm.c', 'guest_channel.c')
-headers = files('rte_power.h')
+		'power_kvm_vm.c', 'guest_channel.c',
+		'rte_power_empty_poll.c')
+headers = files('rte_power.h','rte_power_empty_poll.h')
diff --git a/lib/librte_power/rte_power_empty_poll.c b/lib/librte_power/rte_power_empty_poll.c
new file mode 100644
index 0000000..923226b
--- /dev/null
+++ b/lib/librte_power/rte_power_empty_poll.c
@@ -0,0 +1,521 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2018 Intel Corporation
+ */
+
+#include <string.h>
+
+#include <rte_lcore.h>
+#include <rte_cycles.h>
+#include <rte_atomic.h>
+#include <rte_malloc.h>
+
+#include "rte_power.h"
+#include "rte_power_empty_poll.h"
+
+#define INTERVALS_PER_SECOND 100     /* (10ms) */
+#define SECONDS_TO_TRAIN_FOR  5
+#define DEFAULT_MED_TO_HIGH_PERCENT_THRESHOLD 70
+#define DEFAULT_HIGH_TO_MED_PERCENT_THRESHOLD 30
+#define DEFAULT_CYCLES_PER_PACKET 800
+
+static struct ep_params *ep_params;
+static uint32_t med_to_high_threshold = DEFAULT_MED_TO_HIGH_PERCENT_THRESHOLD;
+static uint32_t high_to_med_threshold = DEFAULT_HIGH_TO_MED_PERCENT_THRESHOLD;
+
+static uint32_t avail_freqs[RTE_MAX_LCORE][NUM_FREQS];
+
+static uint32_t total_avail_freqs[RTE_MAX_LCORE];
+
+static uint32_t freq_index[NUM_FREQ];
+
+static uint32_t
+get_freq_index(enum freq_val index)
+{
+	return freq_index[index];
+}
+
+
+static int
+set_power_freq(int lcore_id, enum freq_val freq, bool specific_freq)
+{
+	int err = 0;
+	uint32_t power_freq_index;
+	if (!specific_freq)
+		power_freq_index = get_freq_index(freq);
+	else
+		power_freq_index = freq;
+
+	err = rte_power_set_freq(lcore_id, power_freq_index);
+
+	return err;
+}
+
+
+static inline void __attribute__((always_inline))
+exit_training_state(struct priority_worker *poll_stats)
+{
+	RTE_SET_USED(poll_stats);
+}
+
+static inline void __attribute__((always_inline))
+enter_training_state(struct priority_worker *poll_stats)
+{
+	poll_stats->iter_counter = 0;
+	poll_stats->cur_freq = LOW;
+	poll_stats->queue_state = TRAINING;
+}
+
+static inline void __attribute__((always_inline))
+enter_normal_state(struct priority_worker *poll_stats)
+{
+	/* Clear the averages arrays and strs */
+	memset(poll_stats->edpi_av, 0, sizeof(poll_stats->edpi_av));
+	poll_stats->ec = 0;
+	memset(poll_stats->ppi_av, 0, sizeof(poll_stats->ppi_av));
+	poll_stats->pc = 0;
+
+	poll_stats->cur_freq = MED;
+	poll_stats->iter_counter = 0;
+	poll_stats->threshold_ctr = 0;
+	poll_stats->queue_state = MED_NORMAL;
+	set_power_freq(poll_stats->lcore_id, MED, false);
+
+	/* Try here */
+	poll_stats->thresh[MED].threshold_percent = med_to_high_threshold;
+	poll_stats->thresh[HGH].threshold_percent = high_to_med_threshold;
+}
+
+static inline void __attribute__((always_inline))
+enter_busy_state(struct priority_worker *poll_stats)
+{
+	memset(poll_stats->edpi_av, 0, sizeof(poll_stats->edpi_av));
+	poll_stats->ec = 0;
+	memset(poll_stats->ppi_av, 0, sizeof(poll_stats->ppi_av));
+	poll_stats->pc = 0;
+
+	poll_stats->cur_freq = HGH;
+	poll_stats->iter_counter = 0;
+	poll_stats->threshold_ctr = 0;
+	poll_stats->queue_state = HGH_BUSY;
+	set_power_freq(poll_stats->lcore_id, HGH, false);
+}
+
+static inline void __attribute__((always_inline))
+enter_purge_state(struct priority_worker *poll_stats)
+{
+	poll_stats->iter_counter = 0;
+	poll_stats->queue_state = LOW_PURGE;
+}
+
+static inline void __attribute__((always_inline))
+set_state(struct priority_worker *poll_stats,
+		enum queue_state new_state)
+{
+	enum queue_state old_state = poll_stats->queue_state;
+	if (old_state != new_state) {
+
+		/* Call any old state exit functions */
+		if (old_state == TRAINING)
+			exit_training_state(poll_stats);
+
+		/* Call any new state entry functions */
+		if (new_state == TRAINING)
+			enter_training_state(poll_stats);
+		if (new_state == MED_NORMAL)
+			enter_normal_state(poll_stats);
+		if (new_state == HGH_BUSY)
+			enter_busy_state(poll_stats);
+		if (new_state == LOW_PURGE)
+			enter_purge_state(poll_stats);
+	}
+}
+
+
+static void
+update_training_stats(struct priority_worker *poll_stats,
+		uint32_t freq,
+		bool specific_freq,
+		uint32_t max_train_iter)
+{
+	RTE_SET_USED(specific_freq);
+
+	char pfi_str[32];
+	uint64_t p0_empty_deq;
+
+	sprintf(pfi_str, "%02d", freq);
+
+	if (poll_stats->cur_freq == freq &&
+			poll_stats->thresh[freq].trained == false) {
+		if (poll_stats->thresh[freq].cur_train_iter == 0) {
+
+			set_power_freq(poll_stats->lcore_id,
+					freq, specific_freq);
+
+			poll_stats->empty_dequeues_prev =
+				poll_stats->empty_dequeues;
+
+			poll_stats->thresh[freq].cur_train_iter++;
+
+			return;
+		} else if (poll_stats->thresh[freq].cur_train_iter
+				<= max_train_iter) {
+
+			p0_empty_deq = poll_stats->empty_dequeues -
+				poll_stats->empty_dequeues_prev;
+
+			poll_stats->empty_dequeues_prev =
+				poll_stats->empty_dequeues;
+
+			poll_stats->thresh[freq].base_edpi += p0_empty_deq;
+			poll_stats->thresh[freq].cur_train_iter++;
+
+		} else {
+			if (poll_stats->thresh[freq].trained == false) {
+				poll_stats->thresh[freq].base_edpi =
+					poll_stats->thresh[freq].base_edpi /
+					max_train_iter;
+
+				/* Add on a factor of 0.05%, this should remove any */
+				/* false negatives when the system is 0% busy */
+				poll_stats->thresh[freq].base_edpi +=
+					poll_stats->thresh[freq].base_edpi / 2000;
+
+				poll_stats->thresh[freq].trained = true;
+				poll_stats->cur_freq++;
+
+			}
+		}
+	}
+}
+
+static inline uint32_t __attribute__((always_inline))
+update_stats(struct priority_worker *poll_stats)
+{
+	uint64_t tot_edpi = 0, tot_ppi = 0;
+	uint32_t j, percent;
+
+	struct priority_worker *s = poll_stats;
+
+	uint64_t cur_edpi = s->empty_dequeues - s->empty_dequeues_prev;
+
+	s->empty_dequeues_prev = s->empty_dequeues;
+
+	uint64_t ppi = s->num_dequeue_pkts - s->num_dequeue_pkts_prev;
+
+	s->num_dequeue_pkts_prev = s->num_dequeue_pkts;
+
+	if (s->thresh[s->cur_freq].base_edpi < cur_edpi)
+		return 1000UL; /* Value to make us fail */
+
+	s->edpi_av[s->ec++ % BINS_AV] = cur_edpi;
+	s->ppi_av[s->pc++ % BINS_AV] = ppi;
+
+	for (j = 0; j < BINS_AV; j++) {
+		tot_edpi += s->edpi_av[j];
+		tot_ppi += s->ppi_av[j];
+	}
+
+	tot_edpi = tot_edpi / BINS_AV;
+
+	percent = 100 - (uint32_t)((float)tot_edpi /
+			(float)s->thresh[s->cur_freq].base_edpi * 100);
+
+	return (uint32_t)percent;
+}
+
+
+static inline void  __attribute__((always_inline))
+update_stats_normal(struct priority_worker *poll_stats)
+{
+	uint32_t percent;
+
+	if (poll_stats->thresh[poll_stats->cur_freq].base_edpi == 0)
+		return;
+
+	percent = update_stats(poll_stats);
+
+	if (percent > 100)
+		return;
+
+	if (poll_stats->cur_freq == LOW)
+		RTE_LOG(INFO, POWER, "Purge Mode is not supported\n");
+	else if (poll_stats->cur_freq == MED) {
+
+		if (percent >
+			poll_stats->thresh[poll_stats->cur_freq].threshold_percent) {
+
+			if (poll_stats->threshold_ctr < INTERVALS_PER_SECOND)
+				poll_stats->threshold_ctr++;
+			else
+				set_state(poll_stats, HGH_BUSY);
+
+		} else {
+			/* reset */
+			poll_stats->threshold_ctr = 0;
+		}
+
+	} else if (poll_stats->cur_freq == HGH) {
+
+		if (percent <
+			poll_stats->thresh[poll_stats->cur_freq].threshold_percent) {
+
+			if (poll_stats->threshold_ctr < INTERVALS_PER_SECOND)
+				poll_stats->threshold_ctr++;
+			else
+				set_state(poll_stats, MED_NORMAL);
+		} else
+			/* reset */
+			poll_stats->threshold_ctr = 0;
+
+
+	}
+}
+
+static int
+empty_poll_trainning(struct priority_worker *poll_stats,
+		uint32_t max_train_iter)
+{
+
+	if (poll_stats->iter_counter < INTERVALS_PER_SECOND) {
+		poll_stats->iter_counter++;
+		return 0;
+	}
+
+
+	update_training_stats(poll_stats,
+			LOW,
+			false,
+			max_train_iter);
+
+	update_training_stats(poll_stats,
+			MED,
+			false,
+			max_train_iter);
+
+	update_training_stats(poll_stats,
+			HGH,
+			false,
+			max_train_iter);
+
+
+	if (poll_stats->thresh[LOW].trained == true
+			&& poll_stats->thresh[MED].trained == true
+			&& poll_stats->thresh[HGH].trained == true) {
+
+		set_state(poll_stats, MED_NORMAL);
+
+		RTE_LOG(INFO, POWER, "Training is Complete for %d\n",
+				poll_stats->lcore_id);
+	}
+
+	return 0;
+}
+
+static void
+empty_poll_detection(struct rte_timer *tim,
+		void *arg)
+{
+
+	uint32_t i;
+
+	struct priority_worker *poll_stats;
+
+	RTE_SET_USED(tim);
+
+	RTE_SET_USED(arg);
+
+	for (i = 0; i < NUM_NODES; i++) {
+
+		poll_stats = &(ep_params->wrk_data.wrk_stats[i]);
+
+		if (rte_lcore_is_enabled(poll_stats->lcore_id) == 0)
+			continue;
+
+		switch (poll_stats->queue_state) {
+		case(TRAINING):
+			empty_poll_trainning(poll_stats,
+					ep_params->max_train_iter);
+			break;
+
+		case(HGH_BUSY):
+		case(MED_NORMAL):
+			update_stats_normal(poll_stats);
+
+			break;
+
+		case(LOW_PURGE):
+			break;
+		default:
+			break;
+
+		}
+
+	}
+
+}
+
+int
+rte_power_empty_poll_stat_init(void)
+{
+	uint32_t i;
+	/* Allocate the ep_params structure */
+	ep_params = rte_zmalloc_socket(NULL,
+			sizeof(struct ep_params),
+			0,
+			rte_socket_id());
+
+	if (!ep_params)
+		rte_panic("Cannot allocate heap memory for ep_params "
+				"for socket %d\n", rte_socket_id());
+
+	freq_index[LOW] = 14;
+	freq_index[MED] = 9;
+	freq_index[HGH] = 1;
+
+	RTE_LOG(INFO, POWER, "Initialize the Empty Poll\n");
+
+	/* 5 seconds worth of training */
+	ep_params->max_train_iter = INTERVALS_PER_SECOND * SECONDS_TO_TRAIN_FOR;
+
+	struct stats_data *w = &ep_params->wrk_data;
+
+	/* initialize all wrk_stats state */
+	for (i = 0; i < NUM_NODES; i++) {
+
+		if (rte_lcore_is_enabled(i) == 0)
+			continue;
+
+		set_state(&w->wrk_stats[i], TRAINING);
+		/*init the freqs table */
+		total_avail_freqs[i] = rte_power_freqs(i,
+				avail_freqs[i],
+				NUM_FREQS);
+
+		if (get_freq_index(LOW) > total_avail_freqs[i])
+			return -1;
+
+	}
+
+
+	return 0;
+}
+
+void
+rte_power_empty_poll_stat_free(void)
+{
+
+	RTE_LOG(INFO, POWER, "Close the Empty Poll\n");
+
+	if (ep_params != NULL)
+		rte_free(ep_params);
+}
+
+int
+rte_power_empty_poll_stat_update(unsigned int lcore_id)
+{
+	struct priority_worker *poll_stats;
+
+	if (lcore_id >= NUM_NODES)
+		return -1;
+
+	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
+
+	if (poll_stats->lcore_id == 0)
+		poll_stats->lcore_id = lcore_id;
+
+	poll_stats->empty_dequeues++;
+
+	return 0;
+}
+
+int
+rte_power_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt)
+{
+
+	struct priority_worker *poll_stats;
+
+	if (lcore_id >= NUM_NODES)
+		return -1;
+
+	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
+
+	if (poll_stats->lcore_id == 0)
+		poll_stats->lcore_id = lcore_id;
+
+	poll_stats->num_dequeue_pkts += nb_pkt;
+
+	return 0;
+}
+
+
+uint64_t
+rte_power_empty_poll_stat_fetch(unsigned int lcore_id)
+{
+	struct priority_worker *poll_stats;
+
+	if (lcore_id >= NUM_NODES)
+		return -1;
+
+	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
+
+	if (poll_stats->lcore_id == 0)
+		poll_stats->lcore_id = lcore_id;
+
+	return poll_stats->empty_dequeues;
+}
+
+uint64_t
+rte_power_poll_stat_fetch(unsigned int lcore_id)
+{
+	struct priority_worker *poll_stats;
+
+	if (lcore_id >= NUM_NODES)
+		return -1;
+
+	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
+
+	if (poll_stats->lcore_id == 0)
+		poll_stats->lcore_id = lcore_id;
+
+	return poll_stats->num_dequeue_pkts;
+}
+
+void
+rte_power_empty_poll_set_freq(enum freq_val index, uint32_t limit)
+{
+	switch (index) {
+
+	case LOW:
+		freq_index[LOW] = limit;
+		break;
+
+	case MED:
+		freq_index[MED] = limit;
+		break;
+
+	case HGH:
+		freq_index[HGH] = limit;
+		break;
+	default:
+		break;
+	}
+}
+
+void
+rte_power_empty_poll_setup_timer(void)
+{
+	int lcore_id = rte_lcore_id();
+	uint64_t hz = rte_get_timer_hz();
+
+	struct  ep_params *ep_ptr = ep_params;
+
+	ep_ptr->interval_ticks = hz / INTERVALS_PER_SECOND;
+
+	rte_timer_reset_sync(&ep_ptr->timer0,
+			ep_ptr->interval_ticks,
+			PERIODICAL,
+			lcore_id,
+			empty_poll_detection,
+			(void *)ep_ptr);
+
+}
diff --git a/lib/librte_power/rte_power_empty_poll.h b/lib/librte_power/rte_power_empty_poll.h
new file mode 100644
index 0000000..c814784
--- /dev/null
+++ b/lib/librte_power/rte_power_empty_poll.h
@@ -0,0 +1,202 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2018 Intel Corporation
+ */
+
+#ifndef _RTE_EMPTY_POLL_H
+#define _RTE_EMPTY_POLL_H
+
+/**
+ * @file
+ * RTE Power Management
+ */
+#include <stdint.h>
+#include <stdbool.h>
+#include <sys/queue.h>
+
+#include <rte_common.h>
+#include <rte_byteorder.h>
+#include <rte_log.h>
+#include <rte_string_fns.h>
+#include <rte_power.h>
+#include <rte_timer.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#define NUM_FREQS 20
+
+#define BINS_AV 4 /* Has to be ^2 */
+
+#define DROP (NUM_DIRECTIONS * NUM_DEVICES)
+
+#define NUM_PRIORITIES          2
+
+#define NUM_NODES         31 /* Any reseanable prime number should work*/
+
+/* Processor Power State */
+enum freq_val {
+	LOW,
+	MED,
+	HGH,
+	NUM_FREQ = NUM_FREQS
+};
+
+
+/* Queue Polling State */
+enum queue_state {
+	TRAINING, /* NO TRAFFIC */
+	MED_NORMAL,   /* MED */
+	HGH_BUSY,     /* HIGH */
+	LOW_PURGE,    /* LOW */
+};
+
+/* Queue Stats */
+struct freq_threshold {
+
+	uint64_t base_edpi;
+	bool trained;
+	uint32_t threshold_percent;
+	uint32_t cur_train_iter;
+};
+
+/* Each Worder Tread Empty Poll Stats */
+struct priority_worker {
+
+	/* Current dequeue and throughput counts */
+	/* These 2 are written to by the worker threads */
+	/* So keep them on their own cache line */
+	uint64_t empty_dequeues;
+	uint64_t num_dequeue_pkts;
+
+	enum queue_state queue_state;
+
+	uint64_t empty_dequeues_prev;
+	uint64_t num_dequeue_pkts_prev;
+
+	/* Used for training only */
+	struct freq_threshold thresh[NUM_FREQ];
+	enum freq_val cur_freq;
+
+	/* bucket arrays to calculate the averages */
+	uint64_t edpi_av[BINS_AV];
+	uint32_t  ec;
+	uint64_t ppi_av[BINS_AV];
+	uint32_t  pc;
+
+	uint32_t lcore_id;
+	uint32_t iter_counter;
+	uint32_t threshold_ctr;
+	uint32_t display_ctr;
+	uint8_t  dev_id;
+
+} __rte_cache_aligned;
+
+
+struct stats_data {
+
+	struct priority_worker wrk_stats[NUM_NODES];
+
+	/* flag to stop rx threads processing packets until training over */
+	bool start_rx;
+
+};
+
+/* Empty Poll Parameters */
+struct ep_params {
+
+	/* Timer related stuff */
+	uint64_t interval_ticks;
+	uint32_t max_train_iter;
+
+	struct rte_timer timer0;
+	struct stats_data wrk_data;
+};
+
+
+/**
+ * Initialize the power management system.
+ *
+ * @param void
+ *
+ * @return
+ *  - 0 on success.
+ *  - Negative on error.
+ */
+int rte_power_empty_poll_stat_init(void);
+
+/**
+ * Free the resource hold by power management system.
+ */
+void rte_power_empty_poll_stat_free(void);
+
+/**
+ * Update specific core empty poll counter
+ * It's not thread safe.
+ *
+ * @param lcore_id
+ *  lcore id
+ *
+ * @return
+ *  - 0 on success.
+ *  - Negative on error.
+ */
+int rte_power_empty_poll_stat_update(unsigned int lcore_id);
+
+/**
+ * Update specific core valid poll counter, not thread safe.
+ *
+ * @param lcore_id
+ *  lcore id.
+ * @param nb_pkt
+ *  The packet number of one valid poll.
+ *
+ * @return
+ *  - 0 on success.
+ *  - Negative on error.
+ */
+int rte_power_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt);
+
+/**
+ * Fetch specific core empty poll counter.
+ *
+ * @param lcore_id
+ *  lcore id
+ *
+ * @return
+ *  Current lcore empty poll counter value.
+ */
+uint64_t rte_power_empty_poll_stat_fetch(unsigned int lcore_id);
+
+/**
+ * Fetch specific core valid poll counter.
+ *
+ * @param lcore_id
+ *  lcore id
+ *
+ * @return
+ *  Current lcore valid poll counter value.
+ */
+uint64_t rte_power_poll_stat_fetch(unsigned int lcore_id);
+
+/**
+ * Allow user customize the index of power state.
+ *
+ * @param  index
+ *  The type of the queue power state
+ * @param  limit
+ *  The processor P-state offset value for that power state.
+ */
+void rte_power_empty_poll_set_freq(enum freq_val index, uint32_t limit);
+
+/**
+ * Setup the timer/callback to process empty poll counter.
+ * Default timer resolution is 10ms.
+ */
+void rte_power_empty_poll_setup_timer(void);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif
-- 
2.7.5

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [dpdk-dev] [PATCH v3 2/2] examples/l3fwd-power: simple app update to support new API
  2018-06-20 14:44     ` [dpdk-dev] [PATCH v3 1/2] lib/librte_power: traffic pattern aware power control Liang Ma
@ 2018-06-20 14:44       ` Liang Ma
  2018-06-26 11:40         ` [dpdk-dev] [PATCH v4 1/2] lib/librte_power: traffic pattern aware power control Radu Nicolau
  0 siblings, 1 reply; 79+ messages in thread
From: Liang Ma @ 2018-06-20 14:44 UTC (permalink / raw)
  To: david.hunt; +Cc: dev, radu.nicolau, Liang Ma

Add the support for new traffic pattern aware power control
power management API.

Example:
./l3fwd-power -l xxx   -n 4   -w 0000:xx:00.0 -w 0000:xx:00.1 -- -p 0x3
-P --config="(0,0,xx),(1,0,xx)" --empty-poll

Please Reference l3fwd-power document for all parameter except
empty-poll.

Once enable empty-poll. The system will start with training phase.
There should not has any traffic pass-through during training phase.
When training phase complete, system transfer to normal phase.

System will running with modest power stat at beginning.
If the system busyness percentage above 70%, then system will adjust
power state move to High power state. If the traffic become lower(eg. The
system busyness percentage drop below 30%), system will fallback
to the modest power state.

Example code use master thread to monitoring worker thread busyness.
the default timer resolution is 10ms.

ChangeLog:
v2 fix some coding style issues
v3 rename the API.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
---
 examples/l3fwd-power/main.c | 232 ++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 214 insertions(+), 18 deletions(-)

diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
index 596d645..953a2ed 100644
--- a/examples/l3fwd-power/main.c
+++ b/examples/l3fwd-power/main.c
@@ -42,6 +42,7 @@
 #include <rte_string_fns.h>
 #include <rte_timer.h>
 #include <rte_power.h>
+#include <rte_power_empty_poll.h>
 #include <rte_spinlock.h>
 
 #define RTE_LOGTYPE_L3FWD_POWER RTE_LOGTYPE_USER1
@@ -129,6 +130,9 @@ static uint32_t enabled_port_mask = 0;
 static int promiscuous_on = 0;
 /* NUMA is enabled by default. */
 static int numa_on = 1;
+/* emptypoll is disabled by default. */
+static bool empty_poll_on;
+volatile bool empty_poll_stop;
 static int parse_ptype; /**< Parse packet type using rx callback, and */
 			/**< disabled by default */
 
@@ -336,6 +340,10 @@ static inline uint32_t power_idle_heuristic(uint32_t zero_rx_packet_count);
 static inline enum freq_scale_hint_t power_freq_scaleup_heuristic( \
 		unsigned int lcore_id, uint16_t port_id, uint16_t queue_id);
 
+static int is_done(void)
+{
+	return empty_poll_stop;
+}
 /* exit signal handler */
 static void
 signal_exit_now(int sigtype)
@@ -344,7 +352,15 @@ signal_exit_now(int sigtype)
 	unsigned int portid;
 	int ret;
 
+	RTE_SET_USED(lcore_id);
+	RTE_SET_USED(portid);
+	RTE_SET_USED(ret);
+
 	if (sigtype == SIGINT) {
+		if (empty_poll_on)
+			empty_poll_stop = true;
+
+
 		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
 			if (rte_lcore_is_enabled(lcore_id) == 0)
 				continue;
@@ -353,20 +369,23 @@ signal_exit_now(int sigtype)
 			ret = rte_power_exit(lcore_id);
 			if (ret)
 				rte_exit(EXIT_FAILURE, "Power management "
-					"library de-initialization failed on "
-							"core%u\n", lcore_id);
+						"library de-initialization failed on "
+						"core%u\n", lcore_id);
 		}
 
-		RTE_ETH_FOREACH_DEV(portid) {
-			if ((enabled_port_mask & (1 << portid)) == 0)
-				continue;
+		if (!empty_poll_on) {
+			RTE_ETH_FOREACH_DEV(portid) {
+				if ((enabled_port_mask & (1 << portid)) == 0)
+					continue;
 
-			rte_eth_dev_stop(portid);
-			rte_eth_dev_close(portid);
+				rte_eth_dev_stop(portid);
+				rte_eth_dev_close(portid);
+			}
 		}
 	}
 
-	rte_exit(EXIT_SUCCESS, "User forced exit\n");
+	if (!empty_poll_on)
+		rte_exit(EXIT_SUCCESS, "User forced exit\n");
 }
 
 /*  Freqency scale down timer callback */
@@ -831,6 +850,107 @@ static int event_register(struct lcore_conf *qconf)
 
 	return 0;
 }
+/* main processing loop */
+static int
+main_empty_poll_loop(__attribute__((unused)) void *dummy)
+{
+	struct rte_mbuf *pkts_burst[MAX_PKT_BURST];
+	unsigned int lcore_id;
+	uint64_t prev_tsc, diff_tsc, cur_tsc;
+	int i, j, nb_rx;
+	uint8_t queueid;
+	uint16_t portid;
+	struct lcore_conf *qconf;
+	struct lcore_rx_queue *rx_queue;
+
+	const uint64_t drain_tsc =
+		(rte_get_tsc_hz() + US_PER_S - 1) / US_PER_S * BURST_TX_DRAIN_US;
+
+	prev_tsc = 0;
+
+	lcore_id = rte_lcore_id();
+	qconf = &lcore_conf[lcore_id];
+
+	if (qconf->n_rx_queue == 0) {
+		RTE_LOG(INFO, L3FWD_POWER, "lcore %u has nothing to do\n", lcore_id);
+		return 0;
+	}
+
+	for (i = 0; i < qconf->n_rx_queue; i++) {
+		portid = qconf->rx_queue_list[i].port_id;
+		queueid = qconf->rx_queue_list[i].queue_id;
+		RTE_LOG(INFO, L3FWD_POWER, " -- lcoreid=%u portid=%u "
+				"rxqueueid=%hhu\n", lcore_id, portid, queueid);
+	}
+
+	while (!is_done()) {
+		stats[lcore_id].nb_iteration_looped++;
+
+		cur_tsc = rte_rdtsc();
+		/*
+		 * TX burst queue drain
+		 */
+		diff_tsc = cur_tsc - prev_tsc;
+		if (unlikely(diff_tsc > drain_tsc)) {
+			for (i = 0; i < qconf->n_tx_port; ++i) {
+				portid = qconf->tx_port_id[i];
+				rte_eth_tx_buffer_flush(portid,
+						qconf->tx_queue_id[portid],
+						qconf->tx_buffer[portid]);
+			}
+			prev_tsc = cur_tsc;
+		}
+
+		/*
+		 * Read packet from RX queues
+		 */
+		for (i = 0; i < qconf->n_rx_queue; ++i) {
+			rx_queue = &(qconf->rx_queue_list[i]);
+			rx_queue->idle_hint = 0;
+			portid = rx_queue->port_id;
+			queueid = rx_queue->queue_id;
+
+			nb_rx = rte_eth_rx_burst(portid, queueid, pkts_burst,
+					MAX_PKT_BURST);
+
+			stats[lcore_id].nb_rx_processed += nb_rx;
+
+			if (nb_rx == 0) {
+
+				rte_power_empty_poll_stat_update(lcore_id);
+
+				continue;
+			} else {
+				rte_power_poll_stat_update(lcore_id, nb_rx);
+			}
+
+
+			/* Prefetch first packets */
+			for (j = 0; j < PREFETCH_OFFSET && j < nb_rx; j++) {
+				rte_prefetch0(rte_pktmbuf_mtod(
+							pkts_burst[j], void *));
+			}
+
+			/* Prefetch and forward already prefetched packets */
+			for (j = 0; j < (nb_rx - PREFETCH_OFFSET); j++) {
+				rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[
+							j + PREFETCH_OFFSET], void *));
+				l3fwd_simple_forward(pkts_burst[j], portid,
+						qconf);
+			}
+
+			/* Forward remaining prefetched packets */
+			for (; j < nb_rx; j++) {
+				l3fwd_simple_forward(pkts_burst[j], portid,
+						qconf);
+			}
+
+		}
+
+	}
+
+	return 0;
+}
 
 /* main processing loop */
 static int
@@ -1128,7 +1248,8 @@ print_usage(const char *prgname)
 		"  --no-numa: optional, disable numa awareness\n"
 		"  --enable-jumbo: enable jumbo frame"
 		" which max packet len is PKTLEN in decimal (64-9600)\n"
-		"  --parse-ptype: parse packet type by software\n",
+		"  --parse-ptype: parse packet type by software\n"
+		"  --empty=poll: enable empty poll detection\n",
 		prgname);
 }
 
@@ -1231,10 +1352,12 @@ parse_args(int argc, char **argv)
 	int opt, ret;
 	char **argvopt;
 	int option_index;
+	uint32_t limit;
 	char *prgname = argv[0];
 	static struct option lgopts[] = {
 		{"config", 1, 0, 0},
 		{"no-numa", 0, 0, 0},
+		{"empty-poll", 0, 0, 0},
 		{"enable-jumbo", 0, 0, 0},
 		{CMD_LINE_OPT_PARSE_PTYPE, 0, 0, 0},
 		{NULL, 0, 0, 0}
@@ -1259,7 +1382,18 @@ parse_args(int argc, char **argv)
 			printf("Promiscuous mode selected\n");
 			promiscuous_on = 1;
 			break;
-
+		case 'l':
+			limit = parse_portmask(optarg);
+			rte_power_empty_poll_set_freq(LOW, limit);
+			break;
+		case 'm':
+			limit = parse_portmask(optarg);
+			rte_power_empty_poll_set_freq(MED, limit);
+			break;
+		case 'h':
+			limit = parse_portmask(optarg);
+			rte_power_empty_poll_set_freq(HGH, limit);
+			break;
 		/* long options */
 		case 0:
 			if (!strncmp(lgopts[option_index].name, "config", 6)) {
@@ -1278,6 +1412,12 @@ parse_args(int argc, char **argv)
 			}
 
 			if (!strncmp(lgopts[option_index].name,
+						"empty-poll", 10)) {
+				printf("empty-poll is enabled\n");
+				empty_poll_on = true;
+			}
+
+			if (!strncmp(lgopts[option_index].name,
 					"enable-jumbo", 12)) {
 				struct option lenopts =
 					{"max-pkt-len", required_argument, \
@@ -1609,6 +1749,41 @@ static int check_ptype(uint16_t portid)
 
 }
 
+static int
+launch_timer(unsigned int lcore_id)
+{
+	int64_t prev_tsc = 0, cur_tsc, diff_tsc, cycles_10ms;
+
+	RTE_SET_USED(lcore_id);
+
+
+	if (rte_get_master_lcore() != lcore_id) {
+		rte_panic("timer on lcore:%d which is not master core:%d\n",
+				lcore_id,
+				rte_get_master_lcore());
+	}
+
+	RTE_LOG(INFO, POWER, "Bring up the Timer\n");
+
+	rte_power_empty_poll_setup_timer();
+
+	cycles_10ms = rte_get_timer_hz() / 100;
+
+	while (!is_done()) {
+		cur_tsc = rte_rdtsc();
+		diff_tsc = cur_tsc - prev_tsc;
+		if (diff_tsc > cycles_10ms) {
+			rte_timer_manage();
+			prev_tsc = cur_tsc;
+			cycles_10ms = rte_get_timer_hz() / 100;
+		}
+	}
+
+	RTE_LOG(INFO, POWER, "Timer_subsystem is done\n");
+
+	return 0;
+}
+
 int
 main(int argc, char **argv)
 {
@@ -1693,6 +1868,10 @@ main(int argc, char **argv)
 		if (dev_info.tx_offload_capa & DEV_TX_OFFLOAD_MBUF_FAST_FREE)
 			local_port_conf.txmode.offloads |=
 				DEV_TX_OFFLOAD_MBUF_FAST_FREE;
+
+		local_port_conf.rx_adv_conf.rss_conf.rss_hf &=
+			dev_info.flow_type_rss_offloads;
+
 		ret = rte_eth_dev_configure(portid, nb_rx_queue,
 					(uint16_t)n_tx_queue, &local_port_conf);
 		if (ret < 0)
@@ -1780,14 +1959,15 @@ main(int argc, char **argv)
 				"Library initialization failed on core %u\n", lcore_id);
 
 		/* init timer structures for each enabled lcore */
-		rte_timer_init(&power_timers[lcore_id]);
-		hz = rte_get_timer_hz();
-		rte_timer_reset(&power_timers[lcore_id],
-			hz/TIMER_NUMBER_PER_SECOND, SINGLE, lcore_id,
-						power_timer_cb, NULL);
-
+		if (empty_poll_on == false) {
+			rte_timer_init(&power_timers[lcore_id]);
+			hz = rte_get_timer_hz();
+			rte_timer_reset(&power_timers[lcore_id],
+					hz/TIMER_NUMBER_PER_SECOND, SINGLE, lcore_id,
+					power_timer_cb, NULL);
+		}
 		qconf = &lcore_conf[lcore_id];
-		printf("\nInitializing rx queues on lcore %u ... ", lcore_id );
+		printf("\nInitializing rx queues on lcore %u ...\n", lcore_id);
 		fflush(stdout);
 		/* init RX queues */
 		for(queue = 0; queue < qconf->n_rx_queue; ++queue) {
@@ -1856,12 +2036,28 @@ main(int argc, char **argv)
 
 	check_all_ports_link_status(enabled_port_mask);
 
+	if (empty_poll_on == true)
+		rte_power_empty_poll_stat_init();
+
+
 	/* launch per-lcore init on every lcore */
-	rte_eal_mp_remote_launch(main_loop, NULL, CALL_MASTER);
+	if (empty_poll_on == false) {
+		rte_eal_mp_remote_launch(main_loop, NULL, CALL_MASTER);
+	} else {
+		empty_poll_stop = false;
+		rte_eal_mp_remote_launch(main_empty_poll_loop, NULL, SKIP_MASTER);
+	}
+
+	if (empty_poll_on == true)
+		launch_timer(rte_lcore_id());
+
 	RTE_LCORE_FOREACH_SLAVE(lcore_id) {
 		if (rte_eal_wait_lcore(lcore_id) < 0)
 			return -1;
 	}
 
+	if (empty_poll_on)
+		rte_power_empty_poll_stat_free();
+
 	return 0;
 }
-- 
2.7.5

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [dpdk-dev] [PATCH v4 1/2] lib/librte_power: traffic pattern aware power control
  2018-06-20 14:44       ` [dpdk-dev] [PATCH v3 2/2] examples/l3fwd-power: simple app update to support new API Liang Ma
@ 2018-06-26 11:40         ` Radu Nicolau
  2018-06-26 11:40           ` [dpdk-dev] [PATCH v4 2/2] examples/l3fwd-power: simple app update to support new API Radu Nicolau
                             ` (3 more replies)
  0 siblings, 4 replies; 79+ messages in thread
From: Radu Nicolau @ 2018-06-26 11:40 UTC (permalink / raw)
  To: dev; +Cc: liang.j.ma, david.hunt, Radu Nicolau

From: Liang Ma <liang.j.ma@intel.com>

1. Abstract

For packet processing workloads such as DPDK polling is continuous.
This means CPU cores always show 100% busy independent of how much work
those cores are doing. It is critical to accurately determine how busy
a core is hugely important for the following reasons:

   * No indication of overload conditions

   * User do not know how much real load is on a system meaning resulted in
     wasted energy as no power management is utilized

Tried and failed schemes include calculating the cycles required from
the load on the core, in other words the busyness. For example,
how many cycles it costs to handle each packet and determining the
frequency cost per core. Due to the varying nature of traffic, types of
frames and cost in cycles to process, this mechanism becomes complex
quickly where a simple scheme is required to solve the problems.

2. Proposed solution

For all polling mechanism, the proposed solution focus on how many times
empty poll executed instead of calculating how many cycles it cost to
handle each packet. The less empty poll number means current core is busy
with processing workload, therefore,  the higher frequency is needed. The
high empty poll number indicate current core has lots spare time,
therefore, we can lower the frequency.

2.1 Power state definition:

	LOW:  the frequency is used for purge mode.

	MED:  the frequency is used to process modest traffic workload.

	HIGH: the frequency is used to process busy traffic workload.

2.2 There are two phases to establish the power management system:

	a.Initialization/Training phase. There is no traffic pass-through,
	  the system will test average empty poll numbers  with
	  LOW/MED/HIGH  power state. Those average empty poll numbers
	  will be the baseline
	  for the normal phase. The system will collect all core's counter
	  every 100ms. The Training phase will take 5 seconds.

	b.Normal phase. When the real traffic pass-though, the system will
	  compare run-time empty poll moving average value with base line
	  then make decision to move to HIGH power state of MED  power
	  state. The system will collect all core's counter every 10ms.

3. Proposed  API

1.  rte_power_empty_poll_stat_init(void);
which is used to initialize the power management system.
 
2.  rte_power_empty_poll_stat_free(void);
which is used to free the resource hold by power management system.
 
3.  rte_power_empty_poll_stat_update(unsigned int lcore_id);
which is used to update specific core empty poll counter, not thread safe
 
4.  rte_power_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt);
which is used to update specific core valid poll counter, not thread safe
 
5.  rte_power_empty_poll_stat_fetch(unsigned int lcore_id);
which is used to get specific core empty poll counter.
 
6.  rte_power_poll_stat_fetch(unsigned int lcore_id);
which is used to get specific core valid poll counter.

7.  rte_power_empty_poll_set_freq(enum freq_val index, uint32_t limit);
which allow user customize the frequency of power state.

8.  rte_power_empty_poll_setup_timer(void);
which is used to setup the timer/callback to process all above counter.

ChangeLog:
v2: fix some coding style issues
v3: rename the filename, API name.
v4: updated makefile and symbol list

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Radu Nicolau <radu.nicolau@intel.com>
---
 lib/librte_power/Makefile               |   5 +-
 lib/librte_power/meson.build            |   5 +-
 lib/librte_power/rte_power_empty_poll.c | 521 ++++++++++++++++++++++++++++++++
 lib/librte_power/rte_power_empty_poll.h | 202 +++++++++++++
 lib/librte_power/rte_power_version.map  |  14 +-
 5 files changed, 742 insertions(+), 5 deletions(-)
 create mode 100644 lib/librte_power/rte_power_empty_poll.c
 create mode 100644 lib/librte_power/rte_power_empty_poll.h

diff --git a/lib/librte_power/Makefile b/lib/librte_power/Makefile
index 6f85e88..9bec668 100644
--- a/lib/librte_power/Makefile
+++ b/lib/librte_power/Makefile
@@ -7,7 +7,7 @@ include $(RTE_SDK)/mk/rte.vars.mk
 LIB = librte_power.a
 
 CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR) -O3 -fno-strict-aliasing
-LDLIBS += -lrte_eal
+LDLIBS += -lrte_eal -lrte_timer
 
 EXPORT_MAP := rte_power_version.map
 
@@ -16,8 +16,9 @@ LIBABIVER := 1
 # all source are stored in SRCS-y
 SRCS-$(CONFIG_RTE_LIBRTE_POWER) := rte_power.c power_acpi_cpufreq.c
 SRCS-$(CONFIG_RTE_LIBRTE_POWER) += power_kvm_vm.c guest_channel.c
+SRCS-$(CONFIG_RTE_LIBRTE_POWER) += rte_power_empty_poll.c
 
 # install this header file
-SYMLINK-$(CONFIG_RTE_LIBRTE_POWER)-include := rte_power.h
+SYMLINK-$(CONFIG_RTE_LIBRTE_POWER)-include := rte_power.h  rte_power_empty_poll.h
 
 include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/lib/librte_power/meson.build b/lib/librte_power/meson.build
index 253173f..63957eb 100644
--- a/lib/librte_power/meson.build
+++ b/lib/librte_power/meson.build
@@ -5,5 +5,6 @@ if host_machine.system() != 'linux'
 	build = false
 endif
 sources = files('rte_power.c', 'power_acpi_cpufreq.c',
-		'power_kvm_vm.c', 'guest_channel.c')
-headers = files('rte_power.h')
+		'power_kvm_vm.c', 'guest_channel.c',
+		'rte_power_empty_poll.c')
+headers = files('rte_power.h','rte_power_empty_poll.h')
diff --git a/lib/librte_power/rte_power_empty_poll.c b/lib/librte_power/rte_power_empty_poll.c
new file mode 100644
index 0000000..923226b
--- /dev/null
+++ b/lib/librte_power/rte_power_empty_poll.c
@@ -0,0 +1,521 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2018 Intel Corporation
+ */
+
+#include <string.h>
+
+#include <rte_lcore.h>
+#include <rte_cycles.h>
+#include <rte_atomic.h>
+#include <rte_malloc.h>
+
+#include "rte_power.h"
+#include "rte_power_empty_poll.h"
+
+#define INTERVALS_PER_SECOND 100     /* (10ms) */
+#define SECONDS_TO_TRAIN_FOR  5
+#define DEFAULT_MED_TO_HIGH_PERCENT_THRESHOLD 70
+#define DEFAULT_HIGH_TO_MED_PERCENT_THRESHOLD 30
+#define DEFAULT_CYCLES_PER_PACKET 800
+
+static struct ep_params *ep_params;
+static uint32_t med_to_high_threshold = DEFAULT_MED_TO_HIGH_PERCENT_THRESHOLD;
+static uint32_t high_to_med_threshold = DEFAULT_HIGH_TO_MED_PERCENT_THRESHOLD;
+
+static uint32_t avail_freqs[RTE_MAX_LCORE][NUM_FREQS];
+
+static uint32_t total_avail_freqs[RTE_MAX_LCORE];
+
+static uint32_t freq_index[NUM_FREQ];
+
+static uint32_t
+get_freq_index(enum freq_val index)
+{
+	return freq_index[index];
+}
+
+
+static int
+set_power_freq(int lcore_id, enum freq_val freq, bool specific_freq)
+{
+	int err = 0;
+	uint32_t power_freq_index;
+	if (!specific_freq)
+		power_freq_index = get_freq_index(freq);
+	else
+		power_freq_index = freq;
+
+	err = rte_power_set_freq(lcore_id, power_freq_index);
+
+	return err;
+}
+
+
+static inline void __attribute__((always_inline))
+exit_training_state(struct priority_worker *poll_stats)
+{
+	RTE_SET_USED(poll_stats);
+}
+
+static inline void __attribute__((always_inline))
+enter_training_state(struct priority_worker *poll_stats)
+{
+	poll_stats->iter_counter = 0;
+	poll_stats->cur_freq = LOW;
+	poll_stats->queue_state = TRAINING;
+}
+
+static inline void __attribute__((always_inline))
+enter_normal_state(struct priority_worker *poll_stats)
+{
+	/* Clear the averages arrays and strs */
+	memset(poll_stats->edpi_av, 0, sizeof(poll_stats->edpi_av));
+	poll_stats->ec = 0;
+	memset(poll_stats->ppi_av, 0, sizeof(poll_stats->ppi_av));
+	poll_stats->pc = 0;
+
+	poll_stats->cur_freq = MED;
+	poll_stats->iter_counter = 0;
+	poll_stats->threshold_ctr = 0;
+	poll_stats->queue_state = MED_NORMAL;
+	set_power_freq(poll_stats->lcore_id, MED, false);
+
+	/* Try here */
+	poll_stats->thresh[MED].threshold_percent = med_to_high_threshold;
+	poll_stats->thresh[HGH].threshold_percent = high_to_med_threshold;
+}
+
+static inline void __attribute__((always_inline))
+enter_busy_state(struct priority_worker *poll_stats)
+{
+	memset(poll_stats->edpi_av, 0, sizeof(poll_stats->edpi_av));
+	poll_stats->ec = 0;
+	memset(poll_stats->ppi_av, 0, sizeof(poll_stats->ppi_av));
+	poll_stats->pc = 0;
+
+	poll_stats->cur_freq = HGH;
+	poll_stats->iter_counter = 0;
+	poll_stats->threshold_ctr = 0;
+	poll_stats->queue_state = HGH_BUSY;
+	set_power_freq(poll_stats->lcore_id, HGH, false);
+}
+
+static inline void __attribute__((always_inline))
+enter_purge_state(struct priority_worker *poll_stats)
+{
+	poll_stats->iter_counter = 0;
+	poll_stats->queue_state = LOW_PURGE;
+}
+
+static inline void __attribute__((always_inline))
+set_state(struct priority_worker *poll_stats,
+		enum queue_state new_state)
+{
+	enum queue_state old_state = poll_stats->queue_state;
+	if (old_state != new_state) {
+
+		/* Call any old state exit functions */
+		if (old_state == TRAINING)
+			exit_training_state(poll_stats);
+
+		/* Call any new state entry functions */
+		if (new_state == TRAINING)
+			enter_training_state(poll_stats);
+		if (new_state == MED_NORMAL)
+			enter_normal_state(poll_stats);
+		if (new_state == HGH_BUSY)
+			enter_busy_state(poll_stats);
+		if (new_state == LOW_PURGE)
+			enter_purge_state(poll_stats);
+	}
+}
+
+
+static void
+update_training_stats(struct priority_worker *poll_stats,
+		uint32_t freq,
+		bool specific_freq,
+		uint32_t max_train_iter)
+{
+	RTE_SET_USED(specific_freq);
+
+	char pfi_str[32];
+	uint64_t p0_empty_deq;
+
+	sprintf(pfi_str, "%02d", freq);
+
+	if (poll_stats->cur_freq == freq &&
+			poll_stats->thresh[freq].trained == false) {
+		if (poll_stats->thresh[freq].cur_train_iter == 0) {
+
+			set_power_freq(poll_stats->lcore_id,
+					freq, specific_freq);
+
+			poll_stats->empty_dequeues_prev =
+				poll_stats->empty_dequeues;
+
+			poll_stats->thresh[freq].cur_train_iter++;
+
+			return;
+		} else if (poll_stats->thresh[freq].cur_train_iter
+				<= max_train_iter) {
+
+			p0_empty_deq = poll_stats->empty_dequeues -
+				poll_stats->empty_dequeues_prev;
+
+			poll_stats->empty_dequeues_prev =
+				poll_stats->empty_dequeues;
+
+			poll_stats->thresh[freq].base_edpi += p0_empty_deq;
+			poll_stats->thresh[freq].cur_train_iter++;
+
+		} else {
+			if (poll_stats->thresh[freq].trained == false) {
+				poll_stats->thresh[freq].base_edpi =
+					poll_stats->thresh[freq].base_edpi /
+					max_train_iter;
+
+				/* Add on a factor of 0.05%, this should remove any */
+				/* false negatives when the system is 0% busy */
+				poll_stats->thresh[freq].base_edpi +=
+					poll_stats->thresh[freq].base_edpi / 2000;
+
+				poll_stats->thresh[freq].trained = true;
+				poll_stats->cur_freq++;
+
+			}
+		}
+	}
+}
+
+static inline uint32_t __attribute__((always_inline))
+update_stats(struct priority_worker *poll_stats)
+{
+	uint64_t tot_edpi = 0, tot_ppi = 0;
+	uint32_t j, percent;
+
+	struct priority_worker *s = poll_stats;
+
+	uint64_t cur_edpi = s->empty_dequeues - s->empty_dequeues_prev;
+
+	s->empty_dequeues_prev = s->empty_dequeues;
+
+	uint64_t ppi = s->num_dequeue_pkts - s->num_dequeue_pkts_prev;
+
+	s->num_dequeue_pkts_prev = s->num_dequeue_pkts;
+
+	if (s->thresh[s->cur_freq].base_edpi < cur_edpi)
+		return 1000UL; /* Value to make us fail */
+
+	s->edpi_av[s->ec++ % BINS_AV] = cur_edpi;
+	s->ppi_av[s->pc++ % BINS_AV] = ppi;
+
+	for (j = 0; j < BINS_AV; j++) {
+		tot_edpi += s->edpi_av[j];
+		tot_ppi += s->ppi_av[j];
+	}
+
+	tot_edpi = tot_edpi / BINS_AV;
+
+	percent = 100 - (uint32_t)((float)tot_edpi /
+			(float)s->thresh[s->cur_freq].base_edpi * 100);
+
+	return (uint32_t)percent;
+}
+
+
+static inline void  __attribute__((always_inline))
+update_stats_normal(struct priority_worker *poll_stats)
+{
+	uint32_t percent;
+
+	if (poll_stats->thresh[poll_stats->cur_freq].base_edpi == 0)
+		return;
+
+	percent = update_stats(poll_stats);
+
+	if (percent > 100)
+		return;
+
+	if (poll_stats->cur_freq == LOW)
+		RTE_LOG(INFO, POWER, "Purge Mode is not supported\n");
+	else if (poll_stats->cur_freq == MED) {
+
+		if (percent >
+			poll_stats->thresh[poll_stats->cur_freq].threshold_percent) {
+
+			if (poll_stats->threshold_ctr < INTERVALS_PER_SECOND)
+				poll_stats->threshold_ctr++;
+			else
+				set_state(poll_stats, HGH_BUSY);
+
+		} else {
+			/* reset */
+			poll_stats->threshold_ctr = 0;
+		}
+
+	} else if (poll_stats->cur_freq == HGH) {
+
+		if (percent <
+			poll_stats->thresh[poll_stats->cur_freq].threshold_percent) {
+
+			if (poll_stats->threshold_ctr < INTERVALS_PER_SECOND)
+				poll_stats->threshold_ctr++;
+			else
+				set_state(poll_stats, MED_NORMAL);
+		} else
+			/* reset */
+			poll_stats->threshold_ctr = 0;
+
+
+	}
+}
+
+static int
+empty_poll_trainning(struct priority_worker *poll_stats,
+		uint32_t max_train_iter)
+{
+
+	if (poll_stats->iter_counter < INTERVALS_PER_SECOND) {
+		poll_stats->iter_counter++;
+		return 0;
+	}
+
+
+	update_training_stats(poll_stats,
+			LOW,
+			false,
+			max_train_iter);
+
+	update_training_stats(poll_stats,
+			MED,
+			false,
+			max_train_iter);
+
+	update_training_stats(poll_stats,
+			HGH,
+			false,
+			max_train_iter);
+
+
+	if (poll_stats->thresh[LOW].trained == true
+			&& poll_stats->thresh[MED].trained == true
+			&& poll_stats->thresh[HGH].trained == true) {
+
+		set_state(poll_stats, MED_NORMAL);
+
+		RTE_LOG(INFO, POWER, "Training is Complete for %d\n",
+				poll_stats->lcore_id);
+	}
+
+	return 0;
+}
+
+static void
+empty_poll_detection(struct rte_timer *tim,
+		void *arg)
+{
+
+	uint32_t i;
+
+	struct priority_worker *poll_stats;
+
+	RTE_SET_USED(tim);
+
+	RTE_SET_USED(arg);
+
+	for (i = 0; i < NUM_NODES; i++) {
+
+		poll_stats = &(ep_params->wrk_data.wrk_stats[i]);
+
+		if (rte_lcore_is_enabled(poll_stats->lcore_id) == 0)
+			continue;
+
+		switch (poll_stats->queue_state) {
+		case(TRAINING):
+			empty_poll_trainning(poll_stats,
+					ep_params->max_train_iter);
+			break;
+
+		case(HGH_BUSY):
+		case(MED_NORMAL):
+			update_stats_normal(poll_stats);
+
+			break;
+
+		case(LOW_PURGE):
+			break;
+		default:
+			break;
+
+		}
+
+	}
+
+}
+
+int
+rte_power_empty_poll_stat_init(void)
+{
+	uint32_t i;
+	/* Allocate the ep_params structure */
+	ep_params = rte_zmalloc_socket(NULL,
+			sizeof(struct ep_params),
+			0,
+			rte_socket_id());
+
+	if (!ep_params)
+		rte_panic("Cannot allocate heap memory for ep_params "
+				"for socket %d\n", rte_socket_id());
+
+	freq_index[LOW] = 14;
+	freq_index[MED] = 9;
+	freq_index[HGH] = 1;
+
+	RTE_LOG(INFO, POWER, "Initialize the Empty Poll\n");
+
+	/* 5 seconds worth of training */
+	ep_params->max_train_iter = INTERVALS_PER_SECOND * SECONDS_TO_TRAIN_FOR;
+
+	struct stats_data *w = &ep_params->wrk_data;
+
+	/* initialize all wrk_stats state */
+	for (i = 0; i < NUM_NODES; i++) {
+
+		if (rte_lcore_is_enabled(i) == 0)
+			continue;
+
+		set_state(&w->wrk_stats[i], TRAINING);
+		/*init the freqs table */
+		total_avail_freqs[i] = rte_power_freqs(i,
+				avail_freqs[i],
+				NUM_FREQS);
+
+		if (get_freq_index(LOW) > total_avail_freqs[i])
+			return -1;
+
+	}
+
+
+	return 0;
+}
+
+void
+rte_power_empty_poll_stat_free(void)
+{
+
+	RTE_LOG(INFO, POWER, "Close the Empty Poll\n");
+
+	if (ep_params != NULL)
+		rte_free(ep_params);
+}
+
+int
+rte_power_empty_poll_stat_update(unsigned int lcore_id)
+{
+	struct priority_worker *poll_stats;
+
+	if (lcore_id >= NUM_NODES)
+		return -1;
+
+	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
+
+	if (poll_stats->lcore_id == 0)
+		poll_stats->lcore_id = lcore_id;
+
+	poll_stats->empty_dequeues++;
+
+	return 0;
+}
+
+int
+rte_power_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt)
+{
+
+	struct priority_worker *poll_stats;
+
+	if (lcore_id >= NUM_NODES)
+		return -1;
+
+	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
+
+	if (poll_stats->lcore_id == 0)
+		poll_stats->lcore_id = lcore_id;
+
+	poll_stats->num_dequeue_pkts += nb_pkt;
+
+	return 0;
+}
+
+
+uint64_t
+rte_power_empty_poll_stat_fetch(unsigned int lcore_id)
+{
+	struct priority_worker *poll_stats;
+
+	if (lcore_id >= NUM_NODES)
+		return -1;
+
+	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
+
+	if (poll_stats->lcore_id == 0)
+		poll_stats->lcore_id = lcore_id;
+
+	return poll_stats->empty_dequeues;
+}
+
+uint64_t
+rte_power_poll_stat_fetch(unsigned int lcore_id)
+{
+	struct priority_worker *poll_stats;
+
+	if (lcore_id >= NUM_NODES)
+		return -1;
+
+	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
+
+	if (poll_stats->lcore_id == 0)
+		poll_stats->lcore_id = lcore_id;
+
+	return poll_stats->num_dequeue_pkts;
+}
+
+void
+rte_power_empty_poll_set_freq(enum freq_val index, uint32_t limit)
+{
+	switch (index) {
+
+	case LOW:
+		freq_index[LOW] = limit;
+		break;
+
+	case MED:
+		freq_index[MED] = limit;
+		break;
+
+	case HGH:
+		freq_index[HGH] = limit;
+		break;
+	default:
+		break;
+	}
+}
+
+void
+rte_power_empty_poll_setup_timer(void)
+{
+	int lcore_id = rte_lcore_id();
+	uint64_t hz = rte_get_timer_hz();
+
+	struct  ep_params *ep_ptr = ep_params;
+
+	ep_ptr->interval_ticks = hz / INTERVALS_PER_SECOND;
+
+	rte_timer_reset_sync(&ep_ptr->timer0,
+			ep_ptr->interval_ticks,
+			PERIODICAL,
+			lcore_id,
+			empty_poll_detection,
+			(void *)ep_ptr);
+
+}
diff --git a/lib/librte_power/rte_power_empty_poll.h b/lib/librte_power/rte_power_empty_poll.h
new file mode 100644
index 0000000..c814784
--- /dev/null
+++ b/lib/librte_power/rte_power_empty_poll.h
@@ -0,0 +1,202 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2018 Intel Corporation
+ */
+
+#ifndef _RTE_EMPTY_POLL_H
+#define _RTE_EMPTY_POLL_H
+
+/**
+ * @file
+ * RTE Power Management
+ */
+#include <stdint.h>
+#include <stdbool.h>
+#include <sys/queue.h>
+
+#include <rte_common.h>
+#include <rte_byteorder.h>
+#include <rte_log.h>
+#include <rte_string_fns.h>
+#include <rte_power.h>
+#include <rte_timer.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#define NUM_FREQS 20
+
+#define BINS_AV 4 /* Has to be ^2 */
+
+#define DROP (NUM_DIRECTIONS * NUM_DEVICES)
+
+#define NUM_PRIORITIES          2
+
+#define NUM_NODES         31 /* Any reseanable prime number should work*/
+
+/* Processor Power State */
+enum freq_val {
+	LOW,
+	MED,
+	HGH,
+	NUM_FREQ = NUM_FREQS
+};
+
+
+/* Queue Polling State */
+enum queue_state {
+	TRAINING, /* NO TRAFFIC */
+	MED_NORMAL,   /* MED */
+	HGH_BUSY,     /* HIGH */
+	LOW_PURGE,    /* LOW */
+};
+
+/* Queue Stats */
+struct freq_threshold {
+
+	uint64_t base_edpi;
+	bool trained;
+	uint32_t threshold_percent;
+	uint32_t cur_train_iter;
+};
+
+/* Each Worder Tread Empty Poll Stats */
+struct priority_worker {
+
+	/* Current dequeue and throughput counts */
+	/* These 2 are written to by the worker threads */
+	/* So keep them on their own cache line */
+	uint64_t empty_dequeues;
+	uint64_t num_dequeue_pkts;
+
+	enum queue_state queue_state;
+
+	uint64_t empty_dequeues_prev;
+	uint64_t num_dequeue_pkts_prev;
+
+	/* Used for training only */
+	struct freq_threshold thresh[NUM_FREQ];
+	enum freq_val cur_freq;
+
+	/* bucket arrays to calculate the averages */
+	uint64_t edpi_av[BINS_AV];
+	uint32_t  ec;
+	uint64_t ppi_av[BINS_AV];
+	uint32_t  pc;
+
+	uint32_t lcore_id;
+	uint32_t iter_counter;
+	uint32_t threshold_ctr;
+	uint32_t display_ctr;
+	uint8_t  dev_id;
+
+} __rte_cache_aligned;
+
+
+struct stats_data {
+
+	struct priority_worker wrk_stats[NUM_NODES];
+
+	/* flag to stop rx threads processing packets until training over */
+	bool start_rx;
+
+};
+
+/* Empty Poll Parameters */
+struct ep_params {
+
+	/* Timer related stuff */
+	uint64_t interval_ticks;
+	uint32_t max_train_iter;
+
+	struct rte_timer timer0;
+	struct stats_data wrk_data;
+};
+
+
+/**
+ * Initialize the power management system.
+ *
+ * @param void
+ *
+ * @return
+ *  - 0 on success.
+ *  - Negative on error.
+ */
+int rte_power_empty_poll_stat_init(void);
+
+/**
+ * Free the resource hold by power management system.
+ */
+void rte_power_empty_poll_stat_free(void);
+
+/**
+ * Update specific core empty poll counter
+ * It's not thread safe.
+ *
+ * @param lcore_id
+ *  lcore id
+ *
+ * @return
+ *  - 0 on success.
+ *  - Negative on error.
+ */
+int rte_power_empty_poll_stat_update(unsigned int lcore_id);
+
+/**
+ * Update specific core valid poll counter, not thread safe.
+ *
+ * @param lcore_id
+ *  lcore id.
+ * @param nb_pkt
+ *  The packet number of one valid poll.
+ *
+ * @return
+ *  - 0 on success.
+ *  - Negative on error.
+ */
+int rte_power_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt);
+
+/**
+ * Fetch specific core empty poll counter.
+ *
+ * @param lcore_id
+ *  lcore id
+ *
+ * @return
+ *  Current lcore empty poll counter value.
+ */
+uint64_t rte_power_empty_poll_stat_fetch(unsigned int lcore_id);
+
+/**
+ * Fetch specific core valid poll counter.
+ *
+ * @param lcore_id
+ *  lcore id
+ *
+ * @return
+ *  Current lcore valid poll counter value.
+ */
+uint64_t rte_power_poll_stat_fetch(unsigned int lcore_id);
+
+/**
+ * Allow user customize the index of power state.
+ *
+ * @param  index
+ *  The type of the queue power state
+ * @param  limit
+ *  The processor P-state offset value for that power state.
+ */
+void rte_power_empty_poll_set_freq(enum freq_val index, uint32_t limit);
+
+/**
+ * Setup the timer/callback to process empty poll counter.
+ * Default timer resolution is 10ms.
+ */
+void rte_power_empty_poll_setup_timer(void);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif
diff --git a/lib/librte_power/rte_power_version.map b/lib/librte_power/rte_power_version.map
index 96dc42e..8e393e0 100644
--- a/lib/librte_power/rte_power_version.map
+++ b/lib/librte_power/rte_power_version.map
@@ -25,4 +25,16 @@ DPDK_17.11 {
 	rte_power_freq_enable_turbo;
 	rte_power_turbo_status;
 
-} DPDK_2.0;
\ No newline at end of file
+} DPDK_2.0;
+
+DPDK_18.08 {
+	global:
+
+	rte_power_empty_poll_set_freq;
+	rte_power_empty_poll_setup_timer;
+	rte_power_empty_poll_stat_free;
+	rte_power_empty_poll_stat_init;
+	rte_power_empty_poll_stat_update;
+	rte_power_poll_stat_update;
+
+} DPDK_17.11;
-- 
2.7.5

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [dpdk-dev] [PATCH v4 2/2] examples/l3fwd-power: simple app update to support new API
  2018-06-26 11:40         ` [dpdk-dev] [PATCH v4 1/2] lib/librte_power: traffic pattern aware power control Radu Nicolau
@ 2018-06-26 11:40           ` Radu Nicolau
  2018-06-26 13:03             ` Hunt, David
  2018-06-26 13:03           ` [dpdk-dev] [PATCH v4 1/2] lib/librte_power: traffic pattern aware power control Hunt, David
                             ` (2 subsequent siblings)
  3 siblings, 1 reply; 79+ messages in thread
From: Radu Nicolau @ 2018-06-26 11:40 UTC (permalink / raw)
  To: dev; +Cc: liang.j.ma, david.hunt

From: Liang Ma <liang.j.ma@intel.com>

Add the support for new traffic pattern aware power control
power management API.

Example:
./l3fwd-power -l xxx   -n 4   -w 0000:xx:00.0 -w 0000:xx:00.1 -- -p 0x3
-P --config="(0,0,xx),(1,0,xx)" --empty-poll

Please Reference l3fwd-power document for all parameter except
empty-poll.

Once enable empty-poll. The system will start with training phase.
There should not has any traffic pass-through during training phase.
When training phase complete, system transfer to normal phase.

System will running with modest power stat at beginning.
If the system busyness percentage above 70%, then system will adjust
power state move to High power state. If the traffic become lower(eg. The
system busyness percentage drop below 30%), system will fallback
to the modest power state.

Example code use master thread to monitoring worker thread busyness.
the default timer resolution is 10ms.

ChangeLog:
v2 fix some coding style issues
v3 rename the API.
v4 no change

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
---
 examples/l3fwd-power/main.c | 232 ++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 214 insertions(+), 18 deletions(-)

diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
index 596d645..953a2ed 100644
--- a/examples/l3fwd-power/main.c
+++ b/examples/l3fwd-power/main.c
@@ -42,6 +42,7 @@
 #include <rte_string_fns.h>
 #include <rte_timer.h>
 #include <rte_power.h>
+#include <rte_power_empty_poll.h>
 #include <rte_spinlock.h>
 
 #define RTE_LOGTYPE_L3FWD_POWER RTE_LOGTYPE_USER1
@@ -129,6 +130,9 @@ static uint32_t enabled_port_mask = 0;
 static int promiscuous_on = 0;
 /* NUMA is enabled by default. */
 static int numa_on = 1;
+/* emptypoll is disabled by default. */
+static bool empty_poll_on;
+volatile bool empty_poll_stop;
 static int parse_ptype; /**< Parse packet type using rx callback, and */
 			/**< disabled by default */
 
@@ -336,6 +340,10 @@ static inline uint32_t power_idle_heuristic(uint32_t zero_rx_packet_count);
 static inline enum freq_scale_hint_t power_freq_scaleup_heuristic( \
 		unsigned int lcore_id, uint16_t port_id, uint16_t queue_id);
 
+static int is_done(void)
+{
+	return empty_poll_stop;
+}
 /* exit signal handler */
 static void
 signal_exit_now(int sigtype)
@@ -344,7 +352,15 @@ signal_exit_now(int sigtype)
 	unsigned int portid;
 	int ret;
 
+	RTE_SET_USED(lcore_id);
+	RTE_SET_USED(portid);
+	RTE_SET_USED(ret);
+
 	if (sigtype == SIGINT) {
+		if (empty_poll_on)
+			empty_poll_stop = true;
+
+
 		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
 			if (rte_lcore_is_enabled(lcore_id) == 0)
 				continue;
@@ -353,20 +369,23 @@ signal_exit_now(int sigtype)
 			ret = rte_power_exit(lcore_id);
 			if (ret)
 				rte_exit(EXIT_FAILURE, "Power management "
-					"library de-initialization failed on "
-							"core%u\n", lcore_id);
+						"library de-initialization failed on "
+						"core%u\n", lcore_id);
 		}
 
-		RTE_ETH_FOREACH_DEV(portid) {
-			if ((enabled_port_mask & (1 << portid)) == 0)
-				continue;
+		if (!empty_poll_on) {
+			RTE_ETH_FOREACH_DEV(portid) {
+				if ((enabled_port_mask & (1 << portid)) == 0)
+					continue;
 
-			rte_eth_dev_stop(portid);
-			rte_eth_dev_close(portid);
+				rte_eth_dev_stop(portid);
+				rte_eth_dev_close(portid);
+			}
 		}
 	}
 
-	rte_exit(EXIT_SUCCESS, "User forced exit\n");
+	if (!empty_poll_on)
+		rte_exit(EXIT_SUCCESS, "User forced exit\n");
 }
 
 /*  Freqency scale down timer callback */
@@ -831,6 +850,107 @@ static int event_register(struct lcore_conf *qconf)
 
 	return 0;
 }
+/* main processing loop */
+static int
+main_empty_poll_loop(__attribute__((unused)) void *dummy)
+{
+	struct rte_mbuf *pkts_burst[MAX_PKT_BURST];
+	unsigned int lcore_id;
+	uint64_t prev_tsc, diff_tsc, cur_tsc;
+	int i, j, nb_rx;
+	uint8_t queueid;
+	uint16_t portid;
+	struct lcore_conf *qconf;
+	struct lcore_rx_queue *rx_queue;
+
+	const uint64_t drain_tsc =
+		(rte_get_tsc_hz() + US_PER_S - 1) / US_PER_S * BURST_TX_DRAIN_US;
+
+	prev_tsc = 0;
+
+	lcore_id = rte_lcore_id();
+	qconf = &lcore_conf[lcore_id];
+
+	if (qconf->n_rx_queue == 0) {
+		RTE_LOG(INFO, L3FWD_POWER, "lcore %u has nothing to do\n", lcore_id);
+		return 0;
+	}
+
+	for (i = 0; i < qconf->n_rx_queue; i++) {
+		portid = qconf->rx_queue_list[i].port_id;
+		queueid = qconf->rx_queue_list[i].queue_id;
+		RTE_LOG(INFO, L3FWD_POWER, " -- lcoreid=%u portid=%u "
+				"rxqueueid=%hhu\n", lcore_id, portid, queueid);
+	}
+
+	while (!is_done()) {
+		stats[lcore_id].nb_iteration_looped++;
+
+		cur_tsc = rte_rdtsc();
+		/*
+		 * TX burst queue drain
+		 */
+		diff_tsc = cur_tsc - prev_tsc;
+		if (unlikely(diff_tsc > drain_tsc)) {
+			for (i = 0; i < qconf->n_tx_port; ++i) {
+				portid = qconf->tx_port_id[i];
+				rte_eth_tx_buffer_flush(portid,
+						qconf->tx_queue_id[portid],
+						qconf->tx_buffer[portid]);
+			}
+			prev_tsc = cur_tsc;
+		}
+
+		/*
+		 * Read packet from RX queues
+		 */
+		for (i = 0; i < qconf->n_rx_queue; ++i) {
+			rx_queue = &(qconf->rx_queue_list[i]);
+			rx_queue->idle_hint = 0;
+			portid = rx_queue->port_id;
+			queueid = rx_queue->queue_id;
+
+			nb_rx = rte_eth_rx_burst(portid, queueid, pkts_burst,
+					MAX_PKT_BURST);
+
+			stats[lcore_id].nb_rx_processed += nb_rx;
+
+			if (nb_rx == 0) {
+
+				rte_power_empty_poll_stat_update(lcore_id);
+
+				continue;
+			} else {
+				rte_power_poll_stat_update(lcore_id, nb_rx);
+			}
+
+
+			/* Prefetch first packets */
+			for (j = 0; j < PREFETCH_OFFSET && j < nb_rx; j++) {
+				rte_prefetch0(rte_pktmbuf_mtod(
+							pkts_burst[j], void *));
+			}
+
+			/* Prefetch and forward already prefetched packets */
+			for (j = 0; j < (nb_rx - PREFETCH_OFFSET); j++) {
+				rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[
+							j + PREFETCH_OFFSET], void *));
+				l3fwd_simple_forward(pkts_burst[j], portid,
+						qconf);
+			}
+
+			/* Forward remaining prefetched packets */
+			for (; j < nb_rx; j++) {
+				l3fwd_simple_forward(pkts_burst[j], portid,
+						qconf);
+			}
+
+		}
+
+	}
+
+	return 0;
+}
 
 /* main processing loop */
 static int
@@ -1128,7 +1248,8 @@ print_usage(const char *prgname)
 		"  --no-numa: optional, disable numa awareness\n"
 		"  --enable-jumbo: enable jumbo frame"
 		" which max packet len is PKTLEN in decimal (64-9600)\n"
-		"  --parse-ptype: parse packet type by software\n",
+		"  --parse-ptype: parse packet type by software\n"
+		"  --empty=poll: enable empty poll detection\n",
 		prgname);
 }
 
@@ -1231,10 +1352,12 @@ parse_args(int argc, char **argv)
 	int opt, ret;
 	char **argvopt;
 	int option_index;
+	uint32_t limit;
 	char *prgname = argv[0];
 	static struct option lgopts[] = {
 		{"config", 1, 0, 0},
 		{"no-numa", 0, 0, 0},
+		{"empty-poll", 0, 0, 0},
 		{"enable-jumbo", 0, 0, 0},
 		{CMD_LINE_OPT_PARSE_PTYPE, 0, 0, 0},
 		{NULL, 0, 0, 0}
@@ -1259,7 +1382,18 @@ parse_args(int argc, char **argv)
 			printf("Promiscuous mode selected\n");
 			promiscuous_on = 1;
 			break;
-
+		case 'l':
+			limit = parse_portmask(optarg);
+			rte_power_empty_poll_set_freq(LOW, limit);
+			break;
+		case 'm':
+			limit = parse_portmask(optarg);
+			rte_power_empty_poll_set_freq(MED, limit);
+			break;
+		case 'h':
+			limit = parse_portmask(optarg);
+			rte_power_empty_poll_set_freq(HGH, limit);
+			break;
 		/* long options */
 		case 0:
 			if (!strncmp(lgopts[option_index].name, "config", 6)) {
@@ -1278,6 +1412,12 @@ parse_args(int argc, char **argv)
 			}
 
 			if (!strncmp(lgopts[option_index].name,
+						"empty-poll", 10)) {
+				printf("empty-poll is enabled\n");
+				empty_poll_on = true;
+			}
+
+			if (!strncmp(lgopts[option_index].name,
 					"enable-jumbo", 12)) {
 				struct option lenopts =
 					{"max-pkt-len", required_argument, \
@@ -1609,6 +1749,41 @@ static int check_ptype(uint16_t portid)
 
 }
 
+static int
+launch_timer(unsigned int lcore_id)
+{
+	int64_t prev_tsc = 0, cur_tsc, diff_tsc, cycles_10ms;
+
+	RTE_SET_USED(lcore_id);
+
+
+	if (rte_get_master_lcore() != lcore_id) {
+		rte_panic("timer on lcore:%d which is not master core:%d\n",
+				lcore_id,
+				rte_get_master_lcore());
+	}
+
+	RTE_LOG(INFO, POWER, "Bring up the Timer\n");
+
+	rte_power_empty_poll_setup_timer();
+
+	cycles_10ms = rte_get_timer_hz() / 100;
+
+	while (!is_done()) {
+		cur_tsc = rte_rdtsc();
+		diff_tsc = cur_tsc - prev_tsc;
+		if (diff_tsc > cycles_10ms) {
+			rte_timer_manage();
+			prev_tsc = cur_tsc;
+			cycles_10ms = rte_get_timer_hz() / 100;
+		}
+	}
+
+	RTE_LOG(INFO, POWER, "Timer_subsystem is done\n");
+
+	return 0;
+}
+
 int
 main(int argc, char **argv)
 {
@@ -1693,6 +1868,10 @@ main(int argc, char **argv)
 		if (dev_info.tx_offload_capa & DEV_TX_OFFLOAD_MBUF_FAST_FREE)
 			local_port_conf.txmode.offloads |=
 				DEV_TX_OFFLOAD_MBUF_FAST_FREE;
+
+		local_port_conf.rx_adv_conf.rss_conf.rss_hf &=
+			dev_info.flow_type_rss_offloads;
+
 		ret = rte_eth_dev_configure(portid, nb_rx_queue,
 					(uint16_t)n_tx_queue, &local_port_conf);
 		if (ret < 0)
@@ -1780,14 +1959,15 @@ main(int argc, char **argv)
 				"Library initialization failed on core %u\n", lcore_id);
 
 		/* init timer structures for each enabled lcore */
-		rte_timer_init(&power_timers[lcore_id]);
-		hz = rte_get_timer_hz();
-		rte_timer_reset(&power_timers[lcore_id],
-			hz/TIMER_NUMBER_PER_SECOND, SINGLE, lcore_id,
-						power_timer_cb, NULL);
-
+		if (empty_poll_on == false) {
+			rte_timer_init(&power_timers[lcore_id]);
+			hz = rte_get_timer_hz();
+			rte_timer_reset(&power_timers[lcore_id],
+					hz/TIMER_NUMBER_PER_SECOND, SINGLE, lcore_id,
+					power_timer_cb, NULL);
+		}
 		qconf = &lcore_conf[lcore_id];
-		printf("\nInitializing rx queues on lcore %u ... ", lcore_id );
+		printf("\nInitializing rx queues on lcore %u ...\n", lcore_id);
 		fflush(stdout);
 		/* init RX queues */
 		for(queue = 0; queue < qconf->n_rx_queue; ++queue) {
@@ -1856,12 +2036,28 @@ main(int argc, char **argv)
 
 	check_all_ports_link_status(enabled_port_mask);
 
+	if (empty_poll_on == true)
+		rte_power_empty_poll_stat_init();
+
+
 	/* launch per-lcore init on every lcore */
-	rte_eal_mp_remote_launch(main_loop, NULL, CALL_MASTER);
+	if (empty_poll_on == false) {
+		rte_eal_mp_remote_launch(main_loop, NULL, CALL_MASTER);
+	} else {
+		empty_poll_stop = false;
+		rte_eal_mp_remote_launch(main_empty_poll_loop, NULL, SKIP_MASTER);
+	}
+
+	if (empty_poll_on == true)
+		launch_timer(rte_lcore_id());
+
 	RTE_LCORE_FOREACH_SLAVE(lcore_id) {
 		if (rte_eal_wait_lcore(lcore_id) < 0)
 			return -1;
 	}
 
+	if (empty_poll_on)
+		rte_power_empty_poll_stat_free();
+
 	return 0;
 }
-- 
2.7.5

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [dpdk-dev] [PATCH v4 2/2] examples/l3fwd-power: simple app update to support new API
  2018-06-26 11:40           ` [dpdk-dev] [PATCH v4 2/2] examples/l3fwd-power: simple app update to support new API Radu Nicolau
@ 2018-06-26 13:03             ` Hunt, David
  0 siblings, 0 replies; 79+ messages in thread
From: Hunt, David @ 2018-06-26 13:03 UTC (permalink / raw)
  To: Radu Nicolau, dev; +Cc: liang.j.ma



On 26/6/2018 12:40 PM, Radu Nicolau wrote:
> From: Liang Ma <liang.j.ma@intel.com>
>
> Add the support for new traffic pattern aware power control
> power management API.
>
> Example:
> ./l3fwd-power -l xxx   -n 4   -w 0000:xx:00.0 -w 0000:xx:00.1 -- -p 0x3
> -P --config="(0,0,xx),(1,0,xx)" --empty-poll
>
> Please Reference l3fwd-power document for all parameter except
> empty-poll.
>
> Once enable empty-poll. The system will start with training phase.
> There should not has any traffic pass-through during training phase.
> When training phase complete, system transfer to normal phase.
>
> System will running with modest power stat at beginning.
> If the system busyness percentage above 70%, then system will adjust
> power state move to High power state. If the traffic become lower(eg. The
> system busyness percentage drop below 30%), system will fallback
> to the modest power state.
>
> Example code use master thread to monitoring worker thread busyness.
> the default timer resolution is 10ms.
>
> ChangeLog:
> v2 fix some coding style issues
> v3 rename the API.
> v4 no change
>
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> ---
>   examples/l3fwd-power/main.c | 232 ++++++++++++++++++++++++++++++++++++++++----
>   1 file changed, 214 insertions(+), 18 deletions(-)
>
> diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
> index 596d645..953a2ed 100644
> --- a/examples/l3fwd-power/main.c
> +++ b/examples/l3fwd-power/main.c
> @@ -42,6 +42,7 @@
>   #include <rte_string_fns.h>
>   #include <rte_timer.h>
>   #include <rte_power.h>
> +#include <rte_power_empty_poll.h>
>   #include <rte_spinlock.h>
>   
>   #define RTE_LOGTYPE_L3FWD_POWER RTE_LOGTYPE_USER1
> @@ -129,6 +130,9 @@ static uint32_t enabled_port_mask = 0;
>   static int promiscuous_on = 0;
>   /* NUMA is enabled by default. */
>   static int numa_on = 1;
> +/* emptypoll is disabled by default. */
> +static bool empty_poll_on;
> +volatile bool empty_poll_stop;
>   static int parse_ptype; /**< Parse packet type using rx callback, and */
>   			/**< disabled by default */
>   
> @@ -336,6 +340,10 @@ static inline uint32_t power_idle_heuristic(uint32_t zero_rx_packet_count);
>   static inline enum freq_scale_hint_t power_freq_scaleup_heuristic( \
>   		unsigned int lcore_id, uint16_t port_id, uint16_t queue_id);
>   
> +static int is_done(void)
> +{
> +	return empty_poll_stop;
> +}
>   /* exit signal handler */
>   static void
>   signal_exit_now(int sigtype)
> @@ -344,7 +352,15 @@ signal_exit_now(int sigtype)
>   	unsigned int portid;
>   	int ret;
>   
> +	RTE_SET_USED(lcore_id);
> +	RTE_SET_USED(portid);
> +	RTE_SET_USED(ret);
> +
>   	if (sigtype == SIGINT) {
> +		if (empty_poll_on)
> +			empty_poll_stop = true;
> +
> +
>   		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
>   			if (rte_lcore_is_enabled(lcore_id) == 0)
>   				continue;
> @@ -353,20 +369,23 @@ signal_exit_now(int sigtype)
>   			ret = rte_power_exit(lcore_id);
>   			if (ret)
>   				rte_exit(EXIT_FAILURE, "Power management "
> -					"library de-initialization failed on "
> -							"core%u\n", lcore_id);
> +						"library de-initialization failed on "
> +						"core%u\n", lcore_id);
>   		}
>   
> -		RTE_ETH_FOREACH_DEV(portid) {
> -			if ((enabled_port_mask & (1 << portid)) == 0)
> -				continue;
> +		if (!empty_poll_on) {
> +			RTE_ETH_FOREACH_DEV(portid) {
> +				if ((enabled_port_mask & (1 << portid)) == 0)
> +					continue;
>   
> -			rte_eth_dev_stop(portid);
> -			rte_eth_dev_close(portid);
> +				rte_eth_dev_stop(portid);
> +				rte_eth_dev_close(portid);
> +			}
>   		}
>   	}
>   
> -	rte_exit(EXIT_SUCCESS, "User forced exit\n");
> +	if (!empty_poll_on)
> +		rte_exit(EXIT_SUCCESS, "User forced exit\n");
>   }
>   
>   /*  Freqency scale down timer callback */
> @@ -831,6 +850,107 @@ static int event_register(struct lcore_conf *qconf)
>   
>   	return 0;
>   }
> +/* main processing loop */
> +static int
> +main_empty_poll_loop(__attribute__((unused)) void *dummy)
> +{
> +	struct rte_mbuf *pkts_burst[MAX_PKT_BURST];
> +	unsigned int lcore_id;
> +	uint64_t prev_tsc, diff_tsc, cur_tsc;
> +	int i, j, nb_rx;
> +	uint8_t queueid;
> +	uint16_t portid;
> +	struct lcore_conf *qconf;
> +	struct lcore_rx_queue *rx_queue;
> +
> +	const uint64_t drain_tsc =
> +		(rte_get_tsc_hz() + US_PER_S - 1) / US_PER_S * BURST_TX_DRAIN_US;
> +
> +	prev_tsc = 0;
> +
> +	lcore_id = rte_lcore_id();
> +	qconf = &lcore_conf[lcore_id];
> +
> +	if (qconf->n_rx_queue == 0) {
> +		RTE_LOG(INFO, L3FWD_POWER, "lcore %u has nothing to do\n", lcore_id);
> +		return 0;
> +	}
> +
> +	for (i = 0; i < qconf->n_rx_queue; i++) {
> +		portid = qconf->rx_queue_list[i].port_id;
> +		queueid = qconf->rx_queue_list[i].queue_id;
> +		RTE_LOG(INFO, L3FWD_POWER, " -- lcoreid=%u portid=%u "
> +				"rxqueueid=%hhu\n", lcore_id, portid, queueid);
> +	}
> +
> +	while (!is_done()) {
> +		stats[lcore_id].nb_iteration_looped++;
> +
> +		cur_tsc = rte_rdtsc();
> +		/*
> +		 * TX burst queue drain
> +		 */
> +		diff_tsc = cur_tsc - prev_tsc;
> +		if (unlikely(diff_tsc > drain_tsc)) {
> +			for (i = 0; i < qconf->n_tx_port; ++i) {
> +				portid = qconf->tx_port_id[i];
> +				rte_eth_tx_buffer_flush(portid,
> +						qconf->tx_queue_id[portid],
> +						qconf->tx_buffer[portid]);
> +			}
> +			prev_tsc = cur_tsc;
> +		}
> +
> +		/*
> +		 * Read packet from RX queues
> +		 */
> +		for (i = 0; i < qconf->n_rx_queue; ++i) {
> +			rx_queue = &(qconf->rx_queue_list[i]);
> +			rx_queue->idle_hint = 0;
> +			portid = rx_queue->port_id;
> +			queueid = rx_queue->queue_id;
> +
> +			nb_rx = rte_eth_rx_burst(portid, queueid, pkts_burst,
> +					MAX_PKT_BURST);
> +
> +			stats[lcore_id].nb_rx_processed += nb_rx;
> +
> +			if (nb_rx == 0) {
> +
> +				rte_power_empty_poll_stat_update(lcore_id);
> +
> +				continue;
> +			} else {
> +				rte_power_poll_stat_update(lcore_id, nb_rx);
> +			}
> +
> +
> +			/* Prefetch first packets */
> +			for (j = 0; j < PREFETCH_OFFSET && j < nb_rx; j++) {
> +				rte_prefetch0(rte_pktmbuf_mtod(
> +							pkts_burst[j], void *));
> +			}
> +
> +			/* Prefetch and forward already prefetched packets */
> +			for (j = 0; j < (nb_rx - PREFETCH_OFFSET); j++) {
> +				rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[
> +							j + PREFETCH_OFFSET], void *));
> +				l3fwd_simple_forward(pkts_burst[j], portid,
> +						qconf);
> +			}
> +
> +			/* Forward remaining prefetched packets */
> +			for (; j < nb_rx; j++) {
> +				l3fwd_simple_forward(pkts_burst[j], portid,
> +						qconf);
> +			}
> +
> +		}
> +
> +	}
> +
> +	return 0;
> +}
>   
>   /* main processing loop */
>   static int
> @@ -1128,7 +1248,8 @@ print_usage(const char *prgname)
>   		"  --no-numa: optional, disable numa awareness\n"
>   		"  --enable-jumbo: enable jumbo frame"
>   		" which max packet len is PKTLEN in decimal (64-9600)\n"
> -		"  --parse-ptype: parse packet type by software\n",
> +		"  --parse-ptype: parse packet type by software\n"
> +		"  --empty=poll: enable empty poll detection\n",
>   		prgname);
>   }
>   
> @@ -1231,10 +1352,12 @@ parse_args(int argc, char **argv)
>   	int opt, ret;
>   	char **argvopt;
>   	int option_index;
> +	uint32_t limit;
>   	char *prgname = argv[0];
>   	static struct option lgopts[] = {
>   		{"config", 1, 0, 0},
>   		{"no-numa", 0, 0, 0},
> +		{"empty-poll", 0, 0, 0},
>   		{"enable-jumbo", 0, 0, 0},
>   		{CMD_LINE_OPT_PARSE_PTYPE, 0, 0, 0},
>   		{NULL, 0, 0, 0}
> @@ -1259,7 +1382,18 @@ parse_args(int argc, char **argv)
>   			printf("Promiscuous mode selected\n");
>   			promiscuous_on = 1;
>   			break;
> -
> +		case 'l':
> +			limit = parse_portmask(optarg);
> +			rte_power_empty_poll_set_freq(LOW, limit);
> +			break;
> +		case 'm':
> +			limit = parse_portmask(optarg);
> +			rte_power_empty_poll_set_freq(MED, limit);
> +			break;
> +		case 'h':
> +			limit = parse_portmask(optarg);
> +			rte_power_empty_poll_set_freq(HGH, limit);
> +			break;
>   		/* long options */
>   		case 0:
>   			if (!strncmp(lgopts[option_index].name, "config", 6)) {
> @@ -1278,6 +1412,12 @@ parse_args(int argc, char **argv)
>   			}
>   
>   			if (!strncmp(lgopts[option_index].name,
> +						"empty-poll", 10)) {
> +				printf("empty-poll is enabled\n");
> +				empty_poll_on = true;
> +			}
> +
> +			if (!strncmp(lgopts[option_index].name,
>   					"enable-jumbo", 12)) {
>   				struct option lenopts =
>   					{"max-pkt-len", required_argument, \
> @@ -1609,6 +1749,41 @@ static int check_ptype(uint16_t portid)
>   
>   }
>   
> +static int
> +launch_timer(unsigned int lcore_id)
> +{
> +	int64_t prev_tsc = 0, cur_tsc, diff_tsc, cycles_10ms;
> +
> +	RTE_SET_USED(lcore_id);
> +
> +
> +	if (rte_get_master_lcore() != lcore_id) {
> +		rte_panic("timer on lcore:%d which is not master core:%d\n",
> +				lcore_id,
> +				rte_get_master_lcore());
> +	}
> +
> +	RTE_LOG(INFO, POWER, "Bring up the Timer\n");
> +
> +	rte_power_empty_poll_setup_timer();
> +
> +	cycles_10ms = rte_get_timer_hz() / 100;
> +
> +	while (!is_done()) {
> +		cur_tsc = rte_rdtsc();
> +		diff_tsc = cur_tsc - prev_tsc;
> +		if (diff_tsc > cycles_10ms) {
> +			rte_timer_manage();
> +			prev_tsc = cur_tsc;
> +			cycles_10ms = rte_get_timer_hz() / 100;
> +		}
> +	}
> +
> +	RTE_LOG(INFO, POWER, "Timer_subsystem is done\n");
> +
> +	return 0;
> +}
> +
>   int
>   main(int argc, char **argv)
>   {
> @@ -1693,6 +1868,10 @@ main(int argc, char **argv)
>   		if (dev_info.tx_offload_capa & DEV_TX_OFFLOAD_MBUF_FAST_FREE)
>   			local_port_conf.txmode.offloads |=
>   				DEV_TX_OFFLOAD_MBUF_FAST_FREE;
> +
> +		local_port_conf.rx_adv_conf.rss_conf.rss_hf &=
> +			dev_info.flow_type_rss_offloads;
> +
>   		ret = rte_eth_dev_configure(portid, nb_rx_queue,
>   					(uint16_t)n_tx_queue, &local_port_conf);
>   		if (ret < 0)
> @@ -1780,14 +1959,15 @@ main(int argc, char **argv)
>   				"Library initialization failed on core %u\n", lcore_id);
>   
>   		/* init timer structures for each enabled lcore */
> -		rte_timer_init(&power_timers[lcore_id]);
> -		hz = rte_get_timer_hz();
> -		rte_timer_reset(&power_timers[lcore_id],
> -			hz/TIMER_NUMBER_PER_SECOND, SINGLE, lcore_id,
> -						power_timer_cb, NULL);
> -
> +		if (empty_poll_on == false) {
> +			rte_timer_init(&power_timers[lcore_id]);
> +			hz = rte_get_timer_hz();
> +			rte_timer_reset(&power_timers[lcore_id],
> +					hz/TIMER_NUMBER_PER_SECOND, SINGLE, lcore_id,
> +					power_timer_cb, NULL);
> +		}
>   		qconf = &lcore_conf[lcore_id];
> -		printf("\nInitializing rx queues on lcore %u ... ", lcore_id );
> +		printf("\nInitializing rx queues on lcore %u ...\n", lcore_id);
>   		fflush(stdout);
>   		/* init RX queues */
>   		for(queue = 0; queue < qconf->n_rx_queue; ++queue) {
> @@ -1856,12 +2036,28 @@ main(int argc, char **argv)
>   
>   	check_all_ports_link_status(enabled_port_mask);
>   
> +	if (empty_poll_on == true)
> +		rte_power_empty_poll_stat_init();
> +
> +
>   	/* launch per-lcore init on every lcore */
> -	rte_eal_mp_remote_launch(main_loop, NULL, CALL_MASTER);
> +	if (empty_poll_on == false) {
> +		rte_eal_mp_remote_launch(main_loop, NULL, CALL_MASTER);
> +	} else {
> +		empty_poll_stop = false;
> +		rte_eal_mp_remote_launch(main_empty_poll_loop, NULL, SKIP_MASTER);
> +	}
> +
> +	if (empty_poll_on == true)
> +		launch_timer(rte_lcore_id());
> +
>   	RTE_LCORE_FOREACH_SLAVE(lcore_id) {
>   		if (rte_eal_wait_lcore(lcore_id) < 0)
>   			return -1;
>   	}
>   
> +	if (empty_poll_on)
> +		rte_power_empty_poll_stat_free();
> +
>   	return 0;
>   }

Acked-by: David Hunt<david.hunt@intel.com>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [dpdk-dev] [PATCH v4 1/2] lib/librte_power: traffic pattern aware power control
  2018-06-26 11:40         ` [dpdk-dev] [PATCH v4 1/2] lib/librte_power: traffic pattern aware power control Radu Nicolau
  2018-06-26 11:40           ` [dpdk-dev] [PATCH v4 2/2] examples/l3fwd-power: simple app update to support new API Radu Nicolau
@ 2018-06-26 13:03           ` Hunt, David
  2018-06-27 17:33           ` Kevin Traynor
  2018-07-10 16:04           ` [dpdk-dev] [PATCH v5 " Radu Nicolau
  3 siblings, 0 replies; 79+ messages in thread
From: Hunt, David @ 2018-06-26 13:03 UTC (permalink / raw)
  To: Radu Nicolau, dev; +Cc: liang.j.ma



On 26/6/2018 12:40 PM, Radu Nicolau wrote:
> From: Liang Ma <liang.j.ma@intel.com>
>
> 1. Abstract
>
> For packet processing workloads such as DPDK polling is continuous.
> This means CPU cores always show 100% busy independent of how much work
> those cores are doing. It is critical to accurately determine how busy
> a core is hugely important for the following reasons:
>
>     * No indication of overload conditions
>
>     * User do not know how much real load is on a system meaning resulted in
>       wasted energy as no power management is utilized
>
> Tried and failed schemes include calculating the cycles required from
> the load on the core, in other words the busyness. For example,
> how many cycles it costs to handle each packet and determining the
> frequency cost per core. Due to the varying nature of traffic, types of
> frames and cost in cycles to process, this mechanism becomes complex
> quickly where a simple scheme is required to solve the problems.
>
> 2. Proposed solution
>
> For all polling mechanism, the proposed solution focus on how many times
> empty poll executed instead of calculating how many cycles it cost to
> handle each packet. The less empty poll number means current core is busy
> with processing workload, therefore,  the higher frequency is needed. The
> high empty poll number indicate current core has lots spare time,
> therefore, we can lower the frequency.
>
> 2.1 Power state definition:
>
> 	LOW:  the frequency is used for purge mode.
>
> 	MED:  the frequency is used to process modest traffic workload.
>
> 	HIGH: the frequency is used to process busy traffic workload.
>
> 2.2 There are two phases to establish the power management system:
>
> 	a.Initialization/Training phase. There is no traffic pass-through,
> 	  the system will test average empty poll numbers  with
> 	  LOW/MED/HIGH  power state. Those average empty poll numbers
> 	  will be the baseline
> 	  for the normal phase. The system will collect all core's counter
> 	  every 100ms. The Training phase will take 5 seconds.
>
> 	b.Normal phase. When the real traffic pass-though, the system will
> 	  compare run-time empty poll moving average value with base line
> 	  then make decision to move to HIGH power state of MED  power
> 	  state. The system will collect all core's counter every 10ms.
>
> 3. Proposed  API
>
> 1.  rte_power_empty_poll_stat_init(void);
> which is used to initialize the power management system.
>   
> 2.  rte_power_empty_poll_stat_free(void);
> which is used to free the resource hold by power management system.
>   
> 3.  rte_power_empty_poll_stat_update(unsigned int lcore_id);
> which is used to update specific core empty poll counter, not thread safe
>   
> 4.  rte_power_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt);
> which is used to update specific core valid poll counter, not thread safe
>   
> 5.  rte_power_empty_poll_stat_fetch(unsigned int lcore_id);
> which is used to get specific core empty poll counter.
>   
> 6.  rte_power_poll_stat_fetch(unsigned int lcore_id);
> which is used to get specific core valid poll counter.
>
> 7.  rte_power_empty_poll_set_freq(enum freq_val index, uint32_t limit);
> which allow user customize the frequency of power state.
>
> 8.  rte_power_empty_poll_setup_timer(void);
> which is used to setup the timer/callback to process all above counter.
>
> ChangeLog:
> v2: fix some coding style issues
> v3: rename the filename, API name.
> v4: updated makefile and symbol list
>
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> Signed-off-by: Radu Nicolau <radu.nicolau@intel.com>
> ---
>   lib/librte_power/Makefile               |   5 +-
>   lib/librte_power/meson.build            |   5 +-
>   lib/librte_power/rte_power_empty_poll.c | 521 ++++++++++++++++++++++++++++++++
>   lib/librte_power/rte_power_empty_poll.h | 202 +++++++++++++
>   lib/librte_power/rte_power_version.map  |  14 +-
>   5 files changed, 742 insertions(+), 5 deletions(-)
>   create mode 100644 lib/librte_power/rte_power_empty_poll.c
>   create mode 100644 lib/librte_power/rte_power_empty_poll.h
>
> diff --git a/lib/librte_power/Makefile b/lib/librte_power/Makefile
> index 6f85e88..9bec668 100644
> --- a/lib/librte_power/Makefile
> +++ b/lib/librte_power/Makefile
> @@ -7,7 +7,7 @@ include $(RTE_SDK)/mk/rte.vars.mk
>   LIB = librte_power.a
>   
>   CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR) -O3 -fno-strict-aliasing
> -LDLIBS += -lrte_eal
> +LDLIBS += -lrte_eal -lrte_timer
>   
>   EXPORT_MAP := rte_power_version.map
>   
> @@ -16,8 +16,9 @@ LIBABIVER := 1
>   # all source are stored in SRCS-y
>   SRCS-$(CONFIG_RTE_LIBRTE_POWER) := rte_power.c power_acpi_cpufreq.c
>   SRCS-$(CONFIG_RTE_LIBRTE_POWER) += power_kvm_vm.c guest_channel.c
> +SRCS-$(CONFIG_RTE_LIBRTE_POWER) += rte_power_empty_poll.c
>   
>   # install this header file
> -SYMLINK-$(CONFIG_RTE_LIBRTE_POWER)-include := rte_power.h
> +SYMLINK-$(CONFIG_RTE_LIBRTE_POWER)-include := rte_power.h  rte_power_empty_poll.h
>   
>   include $(RTE_SDK)/mk/rte.lib.mk
> diff --git a/lib/librte_power/meson.build b/lib/librte_power/meson.build
> index 253173f..63957eb 100644
> --- a/lib/librte_power/meson.build
> +++ b/lib/librte_power/meson.build
> @@ -5,5 +5,6 @@ if host_machine.system() != 'linux'
>   	build = false
>   endif
>   sources = files('rte_power.c', 'power_acpi_cpufreq.c',
> -		'power_kvm_vm.c', 'guest_channel.c')
> -headers = files('rte_power.h')
> +		'power_kvm_vm.c', 'guest_channel.c',
> +		'rte_power_empty_poll.c')
> +headers = files('rte_power.h','rte_power_empty_poll.h')
> diff --git a/lib/librte_power/rte_power_empty_poll.c b/lib/librte_power/rte_power_empty_poll.c
> new file mode 100644
> index 0000000..923226b
> --- /dev/null
> +++ b/lib/librte_power/rte_power_empty_poll.c
> @@ -0,0 +1,521 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2010-2018 Intel Corporation
> + */
> +
> +#include <string.h>
> +
> +#include <rte_lcore.h>
> +#include <rte_cycles.h>
> +#include <rte_atomic.h>
> +#include <rte_malloc.h>
> +
> +#include "rte_power.h"
> +#include "rte_power_empty_poll.h"
> +
> +#define INTERVALS_PER_SECOND 100     /* (10ms) */
> +#define SECONDS_TO_TRAIN_FOR  5
> +#define DEFAULT_MED_TO_HIGH_PERCENT_THRESHOLD 70
> +#define DEFAULT_HIGH_TO_MED_PERCENT_THRESHOLD 30
> +#define DEFAULT_CYCLES_PER_PACKET 800
> +
> +static struct ep_params *ep_params;
> +static uint32_t med_to_high_threshold = DEFAULT_MED_TO_HIGH_PERCENT_THRESHOLD;
> +static uint32_t high_to_med_threshold = DEFAULT_HIGH_TO_MED_PERCENT_THRESHOLD;
> +
> +static uint32_t avail_freqs[RTE_MAX_LCORE][NUM_FREQS];
> +
> +static uint32_t total_avail_freqs[RTE_MAX_LCORE];
> +
> +static uint32_t freq_index[NUM_FREQ];
> +
> +static uint32_t
> +get_freq_index(enum freq_val index)
> +{
> +	return freq_index[index];
> +}
> +
> +
> +static int
> +set_power_freq(int lcore_id, enum freq_val freq, bool specific_freq)
> +{
> +	int err = 0;
> +	uint32_t power_freq_index;
> +	if (!specific_freq)
> +		power_freq_index = get_freq_index(freq);
> +	else
> +		power_freq_index = freq;
> +
> +	err = rte_power_set_freq(lcore_id, power_freq_index);
> +
> +	return err;
> +}
> +
> +
> +static inline void __attribute__((always_inline))
> +exit_training_state(struct priority_worker *poll_stats)
> +{
> +	RTE_SET_USED(poll_stats);
> +}
> +
> +static inline void __attribute__((always_inline))
> +enter_training_state(struct priority_worker *poll_stats)
> +{
> +	poll_stats->iter_counter = 0;
> +	poll_stats->cur_freq = LOW;
> +	poll_stats->queue_state = TRAINING;
> +}
> +
> +static inline void __attribute__((always_inline))
> +enter_normal_state(struct priority_worker *poll_stats)
> +{
> +	/* Clear the averages arrays and strs */
> +	memset(poll_stats->edpi_av, 0, sizeof(poll_stats->edpi_av));
> +	poll_stats->ec = 0;
> +	memset(poll_stats->ppi_av, 0, sizeof(poll_stats->ppi_av));
> +	poll_stats->pc = 0;
> +
> +	poll_stats->cur_freq = MED;
> +	poll_stats->iter_counter = 0;
> +	poll_stats->threshold_ctr = 0;
> +	poll_stats->queue_state = MED_NORMAL;
> +	set_power_freq(poll_stats->lcore_id, MED, false);
> +
> +	/* Try here */
> +	poll_stats->thresh[MED].threshold_percent = med_to_high_threshold;
> +	poll_stats->thresh[HGH].threshold_percent = high_to_med_threshold;
> +}
> +
> +static inline void __attribute__((always_inline))
> +enter_busy_state(struct priority_worker *poll_stats)
> +{
> +	memset(poll_stats->edpi_av, 0, sizeof(poll_stats->edpi_av));
> +	poll_stats->ec = 0;
> +	memset(poll_stats->ppi_av, 0, sizeof(poll_stats->ppi_av));
> +	poll_stats->pc = 0;
> +
> +	poll_stats->cur_freq = HGH;
> +	poll_stats->iter_counter = 0;
> +	poll_stats->threshold_ctr = 0;
> +	poll_stats->queue_state = HGH_BUSY;
> +	set_power_freq(poll_stats->lcore_id, HGH, false);
> +}
> +
> +static inline void __attribute__((always_inline))
> +enter_purge_state(struct priority_worker *poll_stats)
> +{
> +	poll_stats->iter_counter = 0;
> +	poll_stats->queue_state = LOW_PURGE;
> +}
> +
> +static inline void __attribute__((always_inline))
> +set_state(struct priority_worker *poll_stats,
> +		enum queue_state new_state)
> +{
> +	enum queue_state old_state = poll_stats->queue_state;
> +	if (old_state != new_state) {
> +
> +		/* Call any old state exit functions */
> +		if (old_state == TRAINING)
> +			exit_training_state(poll_stats);
> +
> +		/* Call any new state entry functions */
> +		if (new_state == TRAINING)
> +			enter_training_state(poll_stats);
> +		if (new_state == MED_NORMAL)
> +			enter_normal_state(poll_stats);
> +		if (new_state == HGH_BUSY)
> +			enter_busy_state(poll_stats);
> +		if (new_state == LOW_PURGE)
> +			enter_purge_state(poll_stats);
> +	}
> +}
> +
> +
> +static void
> +update_training_stats(struct priority_worker *poll_stats,
> +		uint32_t freq,
> +		bool specific_freq,
> +		uint32_t max_train_iter)
> +{
> +	RTE_SET_USED(specific_freq);
> +
> +	char pfi_str[32];
> +	uint64_t p0_empty_deq;
> +
> +	sprintf(pfi_str, "%02d", freq);
> +
> +	if (poll_stats->cur_freq == freq &&
> +			poll_stats->thresh[freq].trained == false) {
> +		if (poll_stats->thresh[freq].cur_train_iter == 0) {
> +
> +			set_power_freq(poll_stats->lcore_id,
> +					freq, specific_freq);
> +
> +			poll_stats->empty_dequeues_prev =
> +				poll_stats->empty_dequeues;
> +
> +			poll_stats->thresh[freq].cur_train_iter++;
> +
> +			return;
> +		} else if (poll_stats->thresh[freq].cur_train_iter
> +				<= max_train_iter) {
> +
> +			p0_empty_deq = poll_stats->empty_dequeues -
> +				poll_stats->empty_dequeues_prev;
> +
> +			poll_stats->empty_dequeues_prev =
> +				poll_stats->empty_dequeues;
> +
> +			poll_stats->thresh[freq].base_edpi += p0_empty_deq;
> +			poll_stats->thresh[freq].cur_train_iter++;
> +
> +		} else {
> +			if (poll_stats->thresh[freq].trained == false) {
> +				poll_stats->thresh[freq].base_edpi =
> +					poll_stats->thresh[freq].base_edpi /
> +					max_train_iter;
> +
> +				/* Add on a factor of 0.05%, this should remove any */
> +				/* false negatives when the system is 0% busy */
> +				poll_stats->thresh[freq].base_edpi +=
> +					poll_stats->thresh[freq].base_edpi / 2000;
> +
> +				poll_stats->thresh[freq].trained = true;
> +				poll_stats->cur_freq++;
> +
> +			}
> +		}
> +	}
> +}
> +
> +static inline uint32_t __attribute__((always_inline))
> +update_stats(struct priority_worker *poll_stats)
> +{
> +	uint64_t tot_edpi = 0, tot_ppi = 0;
> +	uint32_t j, percent;
> +
> +	struct priority_worker *s = poll_stats;
> +
> +	uint64_t cur_edpi = s->empty_dequeues - s->empty_dequeues_prev;
> +
> +	s->empty_dequeues_prev = s->empty_dequeues;
> +
> +	uint64_t ppi = s->num_dequeue_pkts - s->num_dequeue_pkts_prev;
> +
> +	s->num_dequeue_pkts_prev = s->num_dequeue_pkts;
> +
> +	if (s->thresh[s->cur_freq].base_edpi < cur_edpi)
> +		return 1000UL; /* Value to make us fail */
> +
> +	s->edpi_av[s->ec++ % BINS_AV] = cur_edpi;
> +	s->ppi_av[s->pc++ % BINS_AV] = ppi;
> +
> +	for (j = 0; j < BINS_AV; j++) {
> +		tot_edpi += s->edpi_av[j];
> +		tot_ppi += s->ppi_av[j];
> +	}
> +
> +	tot_edpi = tot_edpi / BINS_AV;
> +
> +	percent = 100 - (uint32_t)((float)tot_edpi /
> +			(float)s->thresh[s->cur_freq].base_edpi * 100);
> +
> +	return (uint32_t)percent;
> +}
> +
> +
> +static inline void  __attribute__((always_inline))
> +update_stats_normal(struct priority_worker *poll_stats)
> +{
> +	uint32_t percent;
> +
> +	if (poll_stats->thresh[poll_stats->cur_freq].base_edpi == 0)
> +		return;
> +
> +	percent = update_stats(poll_stats);
> +
> +	if (percent > 100)
> +		return;
> +
> +	if (poll_stats->cur_freq == LOW)
> +		RTE_LOG(INFO, POWER, "Purge Mode is not supported\n");
> +	else if (poll_stats->cur_freq == MED) {
> +
> +		if (percent >
> +			poll_stats->thresh[poll_stats->cur_freq].threshold_percent) {
> +
> +			if (poll_stats->threshold_ctr < INTERVALS_PER_SECOND)
> +				poll_stats->threshold_ctr++;
> +			else
> +				set_state(poll_stats, HGH_BUSY);
> +
> +		} else {
> +			/* reset */
> +			poll_stats->threshold_ctr = 0;
> +		}
> +
> +	} else if (poll_stats->cur_freq == HGH) {
> +
> +		if (percent <
> +			poll_stats->thresh[poll_stats->cur_freq].threshold_percent) {
> +
> +			if (poll_stats->threshold_ctr < INTERVALS_PER_SECOND)
> +				poll_stats->threshold_ctr++;
> +			else
> +				set_state(poll_stats, MED_NORMAL);
> +		} else
> +			/* reset */
> +			poll_stats->threshold_ctr = 0;
> +
> +
> +	}
> +}
> +
> +static int
> +empty_poll_trainning(struct priority_worker *poll_stats,
> +		uint32_t max_train_iter)
> +{
> +
> +	if (poll_stats->iter_counter < INTERVALS_PER_SECOND) {
> +		poll_stats->iter_counter++;
> +		return 0;
> +	}
> +
> +
> +	update_training_stats(poll_stats,
> +			LOW,
> +			false,
> +			max_train_iter);
> +
> +	update_training_stats(poll_stats,
> +			MED,
> +			false,
> +			max_train_iter);
> +
> +	update_training_stats(poll_stats,
> +			HGH,
> +			false,
> +			max_train_iter);
> +
> +
> +	if (poll_stats->thresh[LOW].trained == true
> +			&& poll_stats->thresh[MED].trained == true
> +			&& poll_stats->thresh[HGH].trained == true) {
> +
> +		set_state(poll_stats, MED_NORMAL);
> +
> +		RTE_LOG(INFO, POWER, "Training is Complete for %d\n",
> +				poll_stats->lcore_id);
> +	}
> +
> +	return 0;
> +}
> +
> +static void
> +empty_poll_detection(struct rte_timer *tim,
> +		void *arg)
> +{
> +
> +	uint32_t i;
> +
> +	struct priority_worker *poll_stats;
> +
> +	RTE_SET_USED(tim);
> +
> +	RTE_SET_USED(arg);
> +
> +	for (i = 0; i < NUM_NODES; i++) {
> +
> +		poll_stats = &(ep_params->wrk_data.wrk_stats[i]);
> +
> +		if (rte_lcore_is_enabled(poll_stats->lcore_id) == 0)
> +			continue;
> +
> +		switch (poll_stats->queue_state) {
> +		case(TRAINING):
> +			empty_poll_trainning(poll_stats,
> +					ep_params->max_train_iter);
> +			break;
> +
> +		case(HGH_BUSY):
> +		case(MED_NORMAL):
> +			update_stats_normal(poll_stats);
> +
> +			break;
> +
> +		case(LOW_PURGE):
> +			break;
> +		default:
> +			break;
> +
> +		}
> +
> +	}
> +
> +}
> +
> +int
> +rte_power_empty_poll_stat_init(void)
> +{
> +	uint32_t i;
> +	/* Allocate the ep_params structure */
> +	ep_params = rte_zmalloc_socket(NULL,
> +			sizeof(struct ep_params),
> +			0,
> +			rte_socket_id());
> +
> +	if (!ep_params)
> +		rte_panic("Cannot allocate heap memory for ep_params "
> +				"for socket %d\n", rte_socket_id());
> +
> +	freq_index[LOW] = 14;
> +	freq_index[MED] = 9;
> +	freq_index[HGH] = 1;
> +
> +	RTE_LOG(INFO, POWER, "Initialize the Empty Poll\n");
> +
> +	/* 5 seconds worth of training */
> +	ep_params->max_train_iter = INTERVALS_PER_SECOND * SECONDS_TO_TRAIN_FOR;
> +
> +	struct stats_data *w = &ep_params->wrk_data;
> +
> +	/* initialize all wrk_stats state */
> +	for (i = 0; i < NUM_NODES; i++) {
> +
> +		if (rte_lcore_is_enabled(i) == 0)
> +			continue;
> +
> +		set_state(&w->wrk_stats[i], TRAINING);
> +		/*init the freqs table */
> +		total_avail_freqs[i] = rte_power_freqs(i,
> +				avail_freqs[i],
> +				NUM_FREQS);
> +
> +		if (get_freq_index(LOW) > total_avail_freqs[i])
> +			return -1;
> +
> +	}
> +
> +
> +	return 0;
> +}
> +
> +void
> +rte_power_empty_poll_stat_free(void)
> +{
> +
> +	RTE_LOG(INFO, POWER, "Close the Empty Poll\n");
> +
> +	if (ep_params != NULL)
> +		rte_free(ep_params);
> +}
> +
> +int
> +rte_power_empty_poll_stat_update(unsigned int lcore_id)
> +{
> +	struct priority_worker *poll_stats;
> +
> +	if (lcore_id >= NUM_NODES)
> +		return -1;
> +
> +	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
> +
> +	if (poll_stats->lcore_id == 0)
> +		poll_stats->lcore_id = lcore_id;
> +
> +	poll_stats->empty_dequeues++;
> +
> +	return 0;
> +}
> +
> +int
> +rte_power_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt)
> +{
> +
> +	struct priority_worker *poll_stats;
> +
> +	if (lcore_id >= NUM_NODES)
> +		return -1;
> +
> +	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
> +
> +	if (poll_stats->lcore_id == 0)
> +		poll_stats->lcore_id = lcore_id;
> +
> +	poll_stats->num_dequeue_pkts += nb_pkt;
> +
> +	return 0;
> +}
> +
> +
> +uint64_t
> +rte_power_empty_poll_stat_fetch(unsigned int lcore_id)
> +{
> +	struct priority_worker *poll_stats;
> +
> +	if (lcore_id >= NUM_NODES)
> +		return -1;
> +
> +	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
> +
> +	if (poll_stats->lcore_id == 0)
> +		poll_stats->lcore_id = lcore_id;
> +
> +	return poll_stats->empty_dequeues;
> +}
> +
> +uint64_t
> +rte_power_poll_stat_fetch(unsigned int lcore_id)
> +{
> +	struct priority_worker *poll_stats;
> +
> +	if (lcore_id >= NUM_NODES)
> +		return -1;
> +
> +	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
> +
> +	if (poll_stats->lcore_id == 0)
> +		poll_stats->lcore_id = lcore_id;
> +
> +	return poll_stats->num_dequeue_pkts;
> +}
> +
> +void
> +rte_power_empty_poll_set_freq(enum freq_val index, uint32_t limit)
> +{
> +	switch (index) {
> +
> +	case LOW:
> +		freq_index[LOW] = limit;
> +		break;
> +
> +	case MED:
> +		freq_index[MED] = limit;
> +		break;
> +
> +	case HGH:
> +		freq_index[HGH] = limit;
> +		break;
> +	default:
> +		break;
> +	}
> +}
> +
> +void
> +rte_power_empty_poll_setup_timer(void)
> +{
> +	int lcore_id = rte_lcore_id();
> +	uint64_t hz = rte_get_timer_hz();
> +
> +	struct  ep_params *ep_ptr = ep_params;
> +
> +	ep_ptr->interval_ticks = hz / INTERVALS_PER_SECOND;
> +
> +	rte_timer_reset_sync(&ep_ptr->timer0,
> +			ep_ptr->interval_ticks,
> +			PERIODICAL,
> +			lcore_id,
> +			empty_poll_detection,
> +			(void *)ep_ptr);
> +
> +}
> diff --git a/lib/librte_power/rte_power_empty_poll.h b/lib/librte_power/rte_power_empty_poll.h
> new file mode 100644
> index 0000000..c814784
> --- /dev/null
> +++ b/lib/librte_power/rte_power_empty_poll.h
> @@ -0,0 +1,202 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2010-2018 Intel Corporation
> + */
> +
> +#ifndef _RTE_EMPTY_POLL_H
> +#define _RTE_EMPTY_POLL_H
> +
> +/**
> + * @file
> + * RTE Power Management
> + */
> +#include <stdint.h>
> +#include <stdbool.h>
> +#include <sys/queue.h>
> +
> +#include <rte_common.h>
> +#include <rte_byteorder.h>
> +#include <rte_log.h>
> +#include <rte_string_fns.h>
> +#include <rte_power.h>
> +#include <rte_timer.h>
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +#define NUM_FREQS 20
> +
> +#define BINS_AV 4 /* Has to be ^2 */
> +
> +#define DROP (NUM_DIRECTIONS * NUM_DEVICES)
> +
> +#define NUM_PRIORITIES          2
> +
> +#define NUM_NODES         31 /* Any reseanable prime number should work*/
> +
> +/* Processor Power State */
> +enum freq_val {
> +	LOW,
> +	MED,
> +	HGH,
> +	NUM_FREQ = NUM_FREQS
> +};
> +
> +
> +/* Queue Polling State */
> +enum queue_state {
> +	TRAINING, /* NO TRAFFIC */
> +	MED_NORMAL,   /* MED */
> +	HGH_BUSY,     /* HIGH */
> +	LOW_PURGE,    /* LOW */
> +};
> +
> +/* Queue Stats */
> +struct freq_threshold {
> +
> +	uint64_t base_edpi;
> +	bool trained;
> +	uint32_t threshold_percent;
> +	uint32_t cur_train_iter;
> +};
> +
> +/* Each Worder Tread Empty Poll Stats */
> +struct priority_worker {
> +
> +	/* Current dequeue and throughput counts */
> +	/* These 2 are written to by the worker threads */
> +	/* So keep them on their own cache line */
> +	uint64_t empty_dequeues;
> +	uint64_t num_dequeue_pkts;
> +
> +	enum queue_state queue_state;
> +
> +	uint64_t empty_dequeues_prev;
> +	uint64_t num_dequeue_pkts_prev;
> +
> +	/* Used for training only */
> +	struct freq_threshold thresh[NUM_FREQ];
> +	enum freq_val cur_freq;
> +
> +	/* bucket arrays to calculate the averages */
> +	uint64_t edpi_av[BINS_AV];
> +	uint32_t  ec;
> +	uint64_t ppi_av[BINS_AV];
> +	uint32_t  pc;
> +
> +	uint32_t lcore_id;
> +	uint32_t iter_counter;
> +	uint32_t threshold_ctr;
> +	uint32_t display_ctr;
> +	uint8_t  dev_id;
> +
> +} __rte_cache_aligned;
> +
> +
> +struct stats_data {
> +
> +	struct priority_worker wrk_stats[NUM_NODES];
> +
> +	/* flag to stop rx threads processing packets until training over */
> +	bool start_rx;
> +
> +};
> +
> +/* Empty Poll Parameters */
> +struct ep_params {
> +
> +	/* Timer related stuff */
> +	uint64_t interval_ticks;
> +	uint32_t max_train_iter;
> +
> +	struct rte_timer timer0;
> +	struct stats_data wrk_data;
> +};
> +
> +
> +/**
> + * Initialize the power management system.
> + *
> + * @param void
> + *
> + * @return
> + *  - 0 on success.
> + *  - Negative on error.
> + */
> +int rte_power_empty_poll_stat_init(void);
> +
> +/**
> + * Free the resource hold by power management system.
> + */
> +void rte_power_empty_poll_stat_free(void);
> +
> +/**
> + * Update specific core empty poll counter
> + * It's not thread safe.
> + *
> + * @param lcore_id
> + *  lcore id
> + *
> + * @return
> + *  - 0 on success.
> + *  - Negative on error.
> + */
> +int rte_power_empty_poll_stat_update(unsigned int lcore_id);
> +
> +/**
> + * Update specific core valid poll counter, not thread safe.
> + *
> + * @param lcore_id
> + *  lcore id.
> + * @param nb_pkt
> + *  The packet number of one valid poll.
> + *
> + * @return
> + *  - 0 on success.
> + *  - Negative on error.
> + */
> +int rte_power_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt);
> +
> +/**
> + * Fetch specific core empty poll counter.
> + *
> + * @param lcore_id
> + *  lcore id
> + *
> + * @return
> + *  Current lcore empty poll counter value.
> + */
> +uint64_t rte_power_empty_poll_stat_fetch(unsigned int lcore_id);
> +
> +/**
> + * Fetch specific core valid poll counter.
> + *
> + * @param lcore_id
> + *  lcore id
> + *
> + * @return
> + *  Current lcore valid poll counter value.
> + */
> +uint64_t rte_power_poll_stat_fetch(unsigned int lcore_id);
> +
> +/**
> + * Allow user customize the index of power state.
> + *
> + * @param  index
> + *  The type of the queue power state
> + * @param  limit
> + *  The processor P-state offset value for that power state.
> + */
> +void rte_power_empty_poll_set_freq(enum freq_val index, uint32_t limit);
> +
> +/**
> + * Setup the timer/callback to process empty poll counter.
> + * Default timer resolution is 10ms.
> + */
> +void rte_power_empty_poll_setup_timer(void);
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif
> diff --git a/lib/librte_power/rte_power_version.map b/lib/librte_power/rte_power_version.map
> index 96dc42e..8e393e0 100644
> --- a/lib/librte_power/rte_power_version.map
> +++ b/lib/librte_power/rte_power_version.map
> @@ -25,4 +25,16 @@ DPDK_17.11 {
>   	rte_power_freq_enable_turbo;
>   	rte_power_turbo_status;
>   
> -} DPDK_2.0;
> \ No newline at end of file
> +} DPDK_2.0;
> +
> +DPDK_18.08 {
> +	global:
> +
> +	rte_power_empty_poll_set_freq;
> +	rte_power_empty_poll_setup_timer;
> +	rte_power_empty_poll_stat_free;
> +	rte_power_empty_poll_stat_init;
> +	rte_power_empty_poll_stat_update;
> +	rte_power_poll_stat_update;
> +
> +} DPDK_17.11;

Acked-by: David Hunt<david.hunt@intel.com>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [dpdk-dev] [PATCH v4 1/2] lib/librte_power: traffic pattern aware power control
  2018-06-26 11:40         ` [dpdk-dev] [PATCH v4 1/2] lib/librte_power: traffic pattern aware power control Radu Nicolau
  2018-06-26 11:40           ` [dpdk-dev] [PATCH v4 2/2] examples/l3fwd-power: simple app update to support new API Radu Nicolau
  2018-06-26 13:03           ` [dpdk-dev] [PATCH v4 1/2] lib/librte_power: traffic pattern aware power control Hunt, David
@ 2018-06-27 17:33           ` Kevin Traynor
  2018-07-05 14:45             ` Liang, Ma
  2018-09-11  9:19             ` Hunt, David
  2018-07-10 16:04           ` [dpdk-dev] [PATCH v5 " Radu Nicolau
  3 siblings, 2 replies; 79+ messages in thread
From: Kevin Traynor @ 2018-06-27 17:33 UTC (permalink / raw)
  To: Radu Nicolau, dev, liang.j.ma; +Cc: david.hunt

On 06/26/2018 12:40 PM, Radu Nicolau wrote:
> From: Liang Ma <liang.j.ma@intel.com>
> 
> 1. Abstract
> 
> For packet processing workloads such as DPDK polling is continuous.
> This means CPU cores always show 100% busy independent of how much work
> those cores are doing. It is critical to accurately determine how busy
> a core is hugely important for the following reasons:
> 
>    * No indication of overload conditions
> 
>    * User do not know how much real load is on a system meaning resulted in
>      wasted energy as no power management is utilized
> 
> Tried and failed schemes include calculating the cycles required from
> the load on the core, in other words the busyness. For example,
> how many cycles it costs to handle each packet and determining the
> frequency cost per core. Due to the varying nature of traffic, types of
> frames and cost in cycles to process, this mechanism becomes complex
> quickly where a simple scheme is required to solve the problems.
> 
> 2. Proposed solution
> 
> For all polling mechanism, the proposed solution focus on how many times
> empty poll executed instead of calculating how many cycles it cost to
> handle each packet. The less empty poll number means current core is busy
> with processing workload, therefore,  the higher frequency is needed. The
> high empty poll number indicate current core has lots spare time,
> therefore, we can lower the frequency.
> 

Hi Liang/Radu,

I can see the benefit of providing an API for the application to provide
the num rx from each poll, and then have the library step down/up the
freq based on that. However, not sure I follow why you are adding the
complexity of defining power states and training modes.

> 2.1 Power state definition:
> 
> 	LOW:  the frequency is used for purge mode.
> 
> 	MED:  the frequency is used to process modest traffic workload.
> 
> 	HIGH: the frequency is used to process busy traffic workload.
> 

Why does there need to be user defined freq levels? Why not just keep
stepping down the freq until there is some user-defined threshold of
zero polls reached. e.g. keep stepping down until 10% of polls are zero
poll and have a tail of some time (perhaps user defined) for the step down.

> 2.2 There are two phases to establish the power management system:
> 
> 	a.Initialization/Training phase. There is no traffic pass-through,
> 	  the system will test average empty poll numbers  with
> 	  LOW/MED/HIGH  power state. Those average empty poll numbers
> 	  will be the baseline
> 	  for the normal phase. The system will collect all core's counter
> 	  every 100ms. The Training phase will take 5 seconds.
> 

This is requiring an application to sit for 5 secs in order to train and
align poll numbers with states? That doesn't seem realistic to me.

> 	b.Normal phase. When the real traffic pass-though, the system will
> 	  compare run-time empty poll moving average value with base line
> 	  then make decision to move to HIGH power state of MED  power
> 	  state. The system will collect all core's counter every 10ms.
> 

I only reviewed this commit msg and API usage, so maybe I didn't fully
get the use case or details, but it seems quite awkward from an
application perspective IMHO.

> 3. Proposed  API
> 
> 1.  rte_power_empty_poll_stat_init(void);
> which is used to initialize the power management system.
>  
> 2.  rte_power_empty_poll_stat_free(void);
> which is used to free the resource hold by power management system.
>  
> 3.  rte_power_empty_poll_stat_update(unsigned int lcore_id);
> which is used to update specific core empty poll counter, not thread safe
>  
> 4.  rte_power_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt);
> which is used to update specific core valid poll counter, not thread safe
>  

I think 4 could be dropped and 3 used instead. It could be a simple API
that takes in the core and nb_pkts from a poll. Seems clearer than
making a separate API for a special value of nb_pkts (i.e. 0) and the
application having to check to know which API should be called.

> 5.  rte_power_empty_poll_stat_fetch(unsigned int lcore_id);
> which is used to get specific core empty poll counter.
>  
> 6.  rte_power_poll_stat_fetch(unsigned int lcore_id);
> which is used to get specific core valid poll counter.
> 
> 7.  rte_power_empty_poll_set_freq(enum freq_val index, uint32_t limit);
> which allow user customize the frequency of power state.
> 
> 8.  rte_power_empty_poll_setup_timer(void);
> which is used to setup the timer/callback to process all above counter.
> 

The new API should be experimental

> ChangeLog:
> v2: fix some coding style issues
> v3: rename the filename, API name.
> v4: updated makefile and symbol list
> 
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> Signed-off-by: Radu Nicolau <radu.nicolau@intel.com>
> ---
>  lib/librte_power/Makefile               |   5 +-
>  lib/librte_power/meson.build            |   5 +-
>  lib/librte_power/rte_power_empty_poll.c | 521 ++++++++++++++++++++++++++++++++
>  lib/librte_power/rte_power_empty_poll.h | 202 +++++++++++++
>  lib/librte_power/rte_power_version.map  |  14 +-
>  5 files changed, 742 insertions(+), 5 deletions(-)
>  create mode 100644 lib/librte_power/rte_power_empty_poll.c
>  create mode 100644 lib/librte_power/rte_power_empty_poll.h
> 

Is there any in-tree documentation planned?

Kevin.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [dpdk-dev] [PATCH v4 1/2] lib/librte_power: traffic pattern aware power control
  2018-06-27 17:33           ` Kevin Traynor
@ 2018-07-05 14:45             ` Liang, Ma
  2018-07-12 17:30               ` Thomas Monjalon
  2018-09-11  9:19             ` Hunt, David
  1 sibling, 1 reply; 79+ messages in thread
From: Liang, Ma @ 2018-07-05 14:45 UTC (permalink / raw)
  To: Kevin Traynor; +Cc: Radu Nicolau, dev, david.hunt

On 27 Jun 18:33, Kevin Traynor wrote:
> On 06/26/2018 12:40 PM, Radu Nicolau wrote:
> > From: Liang Ma <liang.j.ma@intel.com>
> > 
> > 1. Abstract
> > 
> > For packet processing workloads such as DPDK polling is continuous.
> > This means CPU cores always show 100% busy independent of how much work
> > those cores are doing. It is critical to accurately determine how busy
> > a core is hugely important for the following reasons:
> > 
> >    * No indication of overload conditions
> > 
> >    * User do not know how much real load is on a system meaning resulted in
> >      wasted energy as no power management is utilized
> > 
> > Tried and failed schemes include calculating the cycles required from
> > the load on the core, in other words the busyness. For example,
> > how many cycles it costs to handle each packet and determining the
> > frequency cost per core. Due to the varying nature of traffic, types of
> > frames and cost in cycles to process, this mechanism becomes complex
> > quickly where a simple scheme is required to solve the problems.
> > 
> > 2. Proposed solution
> > 
> > For all polling mechanism, the proposed solution focus on how many times
> > empty poll executed instead of calculating how many cycles it cost to
> > handle each packet. The less empty poll number means current core is busy
> > with processing workload, therefore,  the higher frequency is needed. The
> > high empty poll number indicate current core has lots spare time,
> > therefore, we can lower the frequency.
> > 
> 
> Hi Liang/Radu,
> 
> I can see the benefit of providing an API for the application to provide
> the num rx from each poll, and then have the library step down/up the
> freq based on that. However, not sure I follow why you are adding the
> complexity of defining power states and training modes.
> 
> > 2.1 Power state definition:
> > 
> > 	LOW:  the frequency is used for purge mode.
> > 
> > 	MED:  the frequency is used to process modest traffic workload.
> > 
> > 	HIGH: the frequency is used to process busy traffic workload.
> > 
> 
> Why does there need to be user defined freq levels? Why not just keep
> stepping down the freq until there is some user-defined threshold of
> zero polls reached. e.g. keep stepping down until 10% of polls are zero
> poll and have a tail of some time (perhaps user defined) for the step down.
tranfer from one P-state to another P-state need update MSR which is expensive.
and swap the state too many times will disturb the worker core performance.
> 
> > 2.2 There are two phases to establish the power management system:
> > 
> > 	a.Initialization/Training phase. There is no traffic pass-through,
> > 	  the system will test average empty poll numbers  with
> > 	  LOW/MED/HIGH  power state. Those average empty poll numbers
> > 	  will be the baseline
> > 	  for the normal phase. The system will collect all core's counter
> > 	  every 100ms. The Training phase will take 5 seconds.
> > 
> 
> This is requiring an application to sit for 5 secs in order to train and
> align poll numbers with states? That doesn't seem realistic to me.
Because each CPU SKU has different configuration, micro-arch, cache size, 
power state number etc. it's has to be tested in Training phase to find the 
base line. simple app can block RX for the First 5 secs.
> 
> > 	b.Normal phase. When the real traffic pass-though, the system will
> > 	  compare run-time empty poll moving average value with base line
> > 	  then make decision to move to HIGH power state of MED  power
> > 	  state. The system will collect all core's counter every 10ms.
> > 
> 
> I only reviewed this commit msg and API usage, so maybe I didn't fully
> get the use case or details, but it seems quite awkward from an
> application perspective IMHO.
> 
> > 3. Proposed  API
> > 
> > 1.  rte_power_empty_poll_stat_init(void);
> > which is used to initialize the power management system.
> >  
> > 2.  rte_power_empty_poll_stat_free(void);
> > which is used to free the resource hold by power management system.
> >  
> > 3.  rte_power_empty_poll_stat_update(unsigned int lcore_id);
> > which is used to update specific core empty poll counter, not thread safe
> >  
> > 4.  rte_power_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt);
> > which is used to update specific core valid poll counter, not thread safe
> >  
> 
> I think 4 could be dropped and 3 used instead. It could be a simple API
> that takes in the core and nb_pkts from a poll. Seems clearer than
> making a separate API for a special value of nb_pkts (i.e. 0) and the
> application having to check to know which API should be called.
Agree.
> 
> > 5.  rte_power_empty_poll_stat_fetch(unsigned int lcore_id);
> > which is used to get specific core empty poll counter.
> >  
> > 6.  rte_power_poll_stat_fetch(unsigned int lcore_id);
> > which is used to get specific core valid poll counter.
> > 
> > 7.  rte_power_empty_poll_set_freq(enum freq_val index, uint32_t limit);
> > which allow user customize the frequency of power state.
> > 
> > 8.  rte_power_empty_poll_setup_timer(void);
> > which is used to setup the timer/callback to process all above counter.
> > 
> 
> The new API should be experimental
> 
> > ChangeLog:
> > v2: fix some coding style issues
> > v3: rename the filename, API name.
> > v4: updated makefile and symbol list
> > 
> > Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> > Signed-off-by: Radu Nicolau <radu.nicolau@intel.com>
> > ---
> >  lib/librte_power/Makefile               |   5 +-
> >  lib/librte_power/meson.build            |   5 +-
> >  lib/librte_power/rte_power_empty_poll.c | 521 ++++++++++++++++++++++++++++++++
> >  lib/librte_power/rte_power_empty_poll.h | 202 +++++++++++++
> >  lib/librte_power/rte_power_version.map  |  14 +-
> >  5 files changed, 742 insertions(+), 5 deletions(-)
> >  create mode 100644 lib/librte_power/rte_power_empty_poll.c
> >  create mode 100644 lib/librte_power/rte_power_empty_poll.h
> > 
> 
> Is there any in-tree documentation planned?
> 
> Kevin.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [dpdk-dev] [PATCH v5 1/2] lib/librte_power: traffic pattern aware power control
  2018-06-26 11:40         ` [dpdk-dev] [PATCH v4 1/2] lib/librte_power: traffic pattern aware power control Radu Nicolau
                             ` (2 preceding siblings ...)
  2018-06-27 17:33           ` Kevin Traynor
@ 2018-07-10 16:04           ` Radu Nicolau
  2018-07-10 16:04             ` [dpdk-dev] [PATCH v5 2/2] examples/l3fwd-power: simple app update to support new API Radu Nicolau
  2018-08-31 15:04             ` [dpdk-dev] [PATCH v6 1/4] lib/librte_power: traffic pattern aware power control Liang Ma
  3 siblings, 2 replies; 79+ messages in thread
From: Radu Nicolau @ 2018-07-10 16:04 UTC (permalink / raw)
  To: dev; +Cc: liang.j.ma, david.hunt, Radu Nicolau

From: Liang Ma <liang.j.ma@intel.com>

1. Abstract

For packet processing workloads such as DPDK polling is continuous.
This means CPU cores always show 100% busy independent of how much work
those cores are doing. It is critical to accurately determine how busy
a core is hugely important for the following reasons:

   * No indication of overload conditions

   * User do not know how much real load is on a system meaning resulted in
     wasted energy as no power management is utilized

Tried and failed schemes include calculating the cycles required from
the load on the core, in other words the busyness. For example,
how many cycles it costs to handle each packet and determining the
frequency cost per core. Due to the varying nature of traffic, types of
frames and cost in cycles to process, this mechanism becomes complex
quickly where a simple scheme is required to solve the problems.

2. Proposed solution

For all polling mechanism, the proposed solution focus on how many times
empty poll executed instead of calculating how many cycles it cost to
handle each packet. The less empty poll number means current core is busy
with processing workload, therefore,  the higher frequency is needed. The
high empty poll number indicate current core has lots spare time,
therefore, we can lower the frequency.

2.1 Power state definition:

	LOW:  the frequency is used for purge mode.

	MED:  the frequency is used to process modest traffic workload.

	HIGH: the frequency is used to process busy traffic workload.

2.2 There are two phases to establish the power management system:

	a.Initialization/Training phase. There is no traffic pass-through,
	  the system will test average empty poll numbers  with
	  LOW/MED/HIGH  power state. Those average empty poll numbers
	  will be the baseline
	  for the normal phase. The system will collect all core's counter
	  every 100ms. The Training phase will take 5 seconds.

	b.Normal phase. When the real traffic pass-though, the system will
	  compare run-time empty poll moving average value with base line
	  then make decision to move to HIGH power state of MED  power
	  state. The system will collect all core's counter every 10ms.

3. Proposed  API

1.  rte_power_empty_poll_stat_init(void);
which is used to initialize the power management system.
 
2.  rte_power_empty_poll_stat_free(void);
which is used to free the resource hold by power management system.
 
3.  rte_power_empty_poll_stat_update(unsigned int lcore_id);
which is used to update specific core empty poll counter, not thread safe
 
4.  rte_power_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt);
which is used to update specific core valid poll counter, not thread safe
 
5.  rte_power_empty_poll_stat_fetch(unsigned int lcore_id);
which is used to get specific core empty poll counter.
 
6.  rte_power_poll_stat_fetch(unsigned int lcore_id);
which is used to get specific core valid poll counter.

7.  rte_power_empty_poll_set_freq(enum freq_val index, uint32_t limit);
which allow user customize the frequency of power state.

8.  rte_power_empty_poll_setup_timer(void);
which is used to setup the timer/callback to process all above counter.

ChangeLog:
v2: fix some coding style issues
v3: rename the filename, API name.
v4: updated makefile and symbol list
v5: marked API as experimental

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Signed-off-by: Radu Nicolau <radu.nicolau@intel.com>
Acked-by: David Hunt <david.hunt@intel.com>
---
 lib/librte_power/Makefile               |   5 +-
 lib/librte_power/meson.build            |   5 +-
 lib/librte_power/rte_power_empty_poll.c | 521 ++++++++++++++++++++++++++++++++
 lib/librte_power/rte_power_empty_poll.h | 214 +++++++++++++
 lib/librte_power/rte_power_version.map  |  15 +-
 5 files changed, 755 insertions(+), 5 deletions(-)
 create mode 100644 lib/librte_power/rte_power_empty_poll.c
 create mode 100644 lib/librte_power/rte_power_empty_poll.h

diff --git a/lib/librte_power/Makefile b/lib/librte_power/Makefile
index 6f85e88..9bec668 100644
--- a/lib/librte_power/Makefile
+++ b/lib/librte_power/Makefile
@@ -7,7 +7,7 @@ include $(RTE_SDK)/mk/rte.vars.mk
 LIB = librte_power.a
 
 CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR) -O3 -fno-strict-aliasing
-LDLIBS += -lrte_eal
+LDLIBS += -lrte_eal -lrte_timer
 
 EXPORT_MAP := rte_power_version.map
 
@@ -16,8 +16,9 @@ LIBABIVER := 1
 # all source are stored in SRCS-y
 SRCS-$(CONFIG_RTE_LIBRTE_POWER) := rte_power.c power_acpi_cpufreq.c
 SRCS-$(CONFIG_RTE_LIBRTE_POWER) += power_kvm_vm.c guest_channel.c
+SRCS-$(CONFIG_RTE_LIBRTE_POWER) += rte_power_empty_poll.c
 
 # install this header file
-SYMLINK-$(CONFIG_RTE_LIBRTE_POWER)-include := rte_power.h
+SYMLINK-$(CONFIG_RTE_LIBRTE_POWER)-include := rte_power.h  rte_power_empty_poll.h
 
 include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/lib/librte_power/meson.build b/lib/librte_power/meson.build
index 253173f..63957eb 100644
--- a/lib/librte_power/meson.build
+++ b/lib/librte_power/meson.build
@@ -5,5 +5,6 @@ if host_machine.system() != 'linux'
 	build = false
 endif
 sources = files('rte_power.c', 'power_acpi_cpufreq.c',
-		'power_kvm_vm.c', 'guest_channel.c')
-headers = files('rte_power.h')
+		'power_kvm_vm.c', 'guest_channel.c',
+		'rte_power_empty_poll.c')
+headers = files('rte_power.h','rte_power_empty_poll.h')
diff --git a/lib/librte_power/rte_power_empty_poll.c b/lib/librte_power/rte_power_empty_poll.c
new file mode 100644
index 0000000..923226b
--- /dev/null
+++ b/lib/librte_power/rte_power_empty_poll.c
@@ -0,0 +1,521 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2018 Intel Corporation
+ */
+
+#include <string.h>
+
+#include <rte_lcore.h>
+#include <rte_cycles.h>
+#include <rte_atomic.h>
+#include <rte_malloc.h>
+
+#include "rte_power.h"
+#include "rte_power_empty_poll.h"
+
+#define INTERVALS_PER_SECOND 100     /* (10ms) */
+#define SECONDS_TO_TRAIN_FOR  5
+#define DEFAULT_MED_TO_HIGH_PERCENT_THRESHOLD 70
+#define DEFAULT_HIGH_TO_MED_PERCENT_THRESHOLD 30
+#define DEFAULT_CYCLES_PER_PACKET 800
+
+static struct ep_params *ep_params;
+static uint32_t med_to_high_threshold = DEFAULT_MED_TO_HIGH_PERCENT_THRESHOLD;
+static uint32_t high_to_med_threshold = DEFAULT_HIGH_TO_MED_PERCENT_THRESHOLD;
+
+static uint32_t avail_freqs[RTE_MAX_LCORE][NUM_FREQS];
+
+static uint32_t total_avail_freqs[RTE_MAX_LCORE];
+
+static uint32_t freq_index[NUM_FREQ];
+
+static uint32_t
+get_freq_index(enum freq_val index)
+{
+	return freq_index[index];
+}
+
+
+static int
+set_power_freq(int lcore_id, enum freq_val freq, bool specific_freq)
+{
+	int err = 0;
+	uint32_t power_freq_index;
+	if (!specific_freq)
+		power_freq_index = get_freq_index(freq);
+	else
+		power_freq_index = freq;
+
+	err = rte_power_set_freq(lcore_id, power_freq_index);
+
+	return err;
+}
+
+
+static inline void __attribute__((always_inline))
+exit_training_state(struct priority_worker *poll_stats)
+{
+	RTE_SET_USED(poll_stats);
+}
+
+static inline void __attribute__((always_inline))
+enter_training_state(struct priority_worker *poll_stats)
+{
+	poll_stats->iter_counter = 0;
+	poll_stats->cur_freq = LOW;
+	poll_stats->queue_state = TRAINING;
+}
+
+static inline void __attribute__((always_inline))
+enter_normal_state(struct priority_worker *poll_stats)
+{
+	/* Clear the averages arrays and strs */
+	memset(poll_stats->edpi_av, 0, sizeof(poll_stats->edpi_av));
+	poll_stats->ec = 0;
+	memset(poll_stats->ppi_av, 0, sizeof(poll_stats->ppi_av));
+	poll_stats->pc = 0;
+
+	poll_stats->cur_freq = MED;
+	poll_stats->iter_counter = 0;
+	poll_stats->threshold_ctr = 0;
+	poll_stats->queue_state = MED_NORMAL;
+	set_power_freq(poll_stats->lcore_id, MED, false);
+
+	/* Try here */
+	poll_stats->thresh[MED].threshold_percent = med_to_high_threshold;
+	poll_stats->thresh[HGH].threshold_percent = high_to_med_threshold;
+}
+
+static inline void __attribute__((always_inline))
+enter_busy_state(struct priority_worker *poll_stats)
+{
+	memset(poll_stats->edpi_av, 0, sizeof(poll_stats->edpi_av));
+	poll_stats->ec = 0;
+	memset(poll_stats->ppi_av, 0, sizeof(poll_stats->ppi_av));
+	poll_stats->pc = 0;
+
+	poll_stats->cur_freq = HGH;
+	poll_stats->iter_counter = 0;
+	poll_stats->threshold_ctr = 0;
+	poll_stats->queue_state = HGH_BUSY;
+	set_power_freq(poll_stats->lcore_id, HGH, false);
+}
+
+static inline void __attribute__((always_inline))
+enter_purge_state(struct priority_worker *poll_stats)
+{
+	poll_stats->iter_counter = 0;
+	poll_stats->queue_state = LOW_PURGE;
+}
+
+static inline void __attribute__((always_inline))
+set_state(struct priority_worker *poll_stats,
+		enum queue_state new_state)
+{
+	enum queue_state old_state = poll_stats->queue_state;
+	if (old_state != new_state) {
+
+		/* Call any old state exit functions */
+		if (old_state == TRAINING)
+			exit_training_state(poll_stats);
+
+		/* Call any new state entry functions */
+		if (new_state == TRAINING)
+			enter_training_state(poll_stats);
+		if (new_state == MED_NORMAL)
+			enter_normal_state(poll_stats);
+		if (new_state == HGH_BUSY)
+			enter_busy_state(poll_stats);
+		if (new_state == LOW_PURGE)
+			enter_purge_state(poll_stats);
+	}
+}
+
+
+static void
+update_training_stats(struct priority_worker *poll_stats,
+		uint32_t freq,
+		bool specific_freq,
+		uint32_t max_train_iter)
+{
+	RTE_SET_USED(specific_freq);
+
+	char pfi_str[32];
+	uint64_t p0_empty_deq;
+
+	sprintf(pfi_str, "%02d", freq);
+
+	if (poll_stats->cur_freq == freq &&
+			poll_stats->thresh[freq].trained == false) {
+		if (poll_stats->thresh[freq].cur_train_iter == 0) {
+
+			set_power_freq(poll_stats->lcore_id,
+					freq, specific_freq);
+
+			poll_stats->empty_dequeues_prev =
+				poll_stats->empty_dequeues;
+
+			poll_stats->thresh[freq].cur_train_iter++;
+
+			return;
+		} else if (poll_stats->thresh[freq].cur_train_iter
+				<= max_train_iter) {
+
+			p0_empty_deq = poll_stats->empty_dequeues -
+				poll_stats->empty_dequeues_prev;
+
+			poll_stats->empty_dequeues_prev =
+				poll_stats->empty_dequeues;
+
+			poll_stats->thresh[freq].base_edpi += p0_empty_deq;
+			poll_stats->thresh[freq].cur_train_iter++;
+
+		} else {
+			if (poll_stats->thresh[freq].trained == false) {
+				poll_stats->thresh[freq].base_edpi =
+					poll_stats->thresh[freq].base_edpi /
+					max_train_iter;
+
+				/* Add on a factor of 0.05%, this should remove any */
+				/* false negatives when the system is 0% busy */
+				poll_stats->thresh[freq].base_edpi +=
+					poll_stats->thresh[freq].base_edpi / 2000;
+
+				poll_stats->thresh[freq].trained = true;
+				poll_stats->cur_freq++;
+
+			}
+		}
+	}
+}
+
+static inline uint32_t __attribute__((always_inline))
+update_stats(struct priority_worker *poll_stats)
+{
+	uint64_t tot_edpi = 0, tot_ppi = 0;
+	uint32_t j, percent;
+
+	struct priority_worker *s = poll_stats;
+
+	uint64_t cur_edpi = s->empty_dequeues - s->empty_dequeues_prev;
+
+	s->empty_dequeues_prev = s->empty_dequeues;
+
+	uint64_t ppi = s->num_dequeue_pkts - s->num_dequeue_pkts_prev;
+
+	s->num_dequeue_pkts_prev = s->num_dequeue_pkts;
+
+	if (s->thresh[s->cur_freq].base_edpi < cur_edpi)
+		return 1000UL; /* Value to make us fail */
+
+	s->edpi_av[s->ec++ % BINS_AV] = cur_edpi;
+	s->ppi_av[s->pc++ % BINS_AV] = ppi;
+
+	for (j = 0; j < BINS_AV; j++) {
+		tot_edpi += s->edpi_av[j];
+		tot_ppi += s->ppi_av[j];
+	}
+
+	tot_edpi = tot_edpi / BINS_AV;
+
+	percent = 100 - (uint32_t)((float)tot_edpi /
+			(float)s->thresh[s->cur_freq].base_edpi * 100);
+
+	return (uint32_t)percent;
+}
+
+
+static inline void  __attribute__((always_inline))
+update_stats_normal(struct priority_worker *poll_stats)
+{
+	uint32_t percent;
+
+	if (poll_stats->thresh[poll_stats->cur_freq].base_edpi == 0)
+		return;
+
+	percent = update_stats(poll_stats);
+
+	if (percent > 100)
+		return;
+
+	if (poll_stats->cur_freq == LOW)
+		RTE_LOG(INFO, POWER, "Purge Mode is not supported\n");
+	else if (poll_stats->cur_freq == MED) {
+
+		if (percent >
+			poll_stats->thresh[poll_stats->cur_freq].threshold_percent) {
+
+			if (poll_stats->threshold_ctr < INTERVALS_PER_SECOND)
+				poll_stats->threshold_ctr++;
+			else
+				set_state(poll_stats, HGH_BUSY);
+
+		} else {
+			/* reset */
+			poll_stats->threshold_ctr = 0;
+		}
+
+	} else if (poll_stats->cur_freq == HGH) {
+
+		if (percent <
+			poll_stats->thresh[poll_stats->cur_freq].threshold_percent) {
+
+			if (poll_stats->threshold_ctr < INTERVALS_PER_SECOND)
+				poll_stats->threshold_ctr++;
+			else
+				set_state(poll_stats, MED_NORMAL);
+		} else
+			/* reset */
+			poll_stats->threshold_ctr = 0;
+
+
+	}
+}
+
+static int
+empty_poll_trainning(struct priority_worker *poll_stats,
+		uint32_t max_train_iter)
+{
+
+	if (poll_stats->iter_counter < INTERVALS_PER_SECOND) {
+		poll_stats->iter_counter++;
+		return 0;
+	}
+
+
+	update_training_stats(poll_stats,
+			LOW,
+			false,
+			max_train_iter);
+
+	update_training_stats(poll_stats,
+			MED,
+			false,
+			max_train_iter);
+
+	update_training_stats(poll_stats,
+			HGH,
+			false,
+			max_train_iter);
+
+
+	if (poll_stats->thresh[LOW].trained == true
+			&& poll_stats->thresh[MED].trained == true
+			&& poll_stats->thresh[HGH].trained == true) {
+
+		set_state(poll_stats, MED_NORMAL);
+
+		RTE_LOG(INFO, POWER, "Training is Complete for %d\n",
+				poll_stats->lcore_id);
+	}
+
+	return 0;
+}
+
+static void
+empty_poll_detection(struct rte_timer *tim,
+		void *arg)
+{
+
+	uint32_t i;
+
+	struct priority_worker *poll_stats;
+
+	RTE_SET_USED(tim);
+
+	RTE_SET_USED(arg);
+
+	for (i = 0; i < NUM_NODES; i++) {
+
+		poll_stats = &(ep_params->wrk_data.wrk_stats[i]);
+
+		if (rte_lcore_is_enabled(poll_stats->lcore_id) == 0)
+			continue;
+
+		switch (poll_stats->queue_state) {
+		case(TRAINING):
+			empty_poll_trainning(poll_stats,
+					ep_params->max_train_iter);
+			break;
+
+		case(HGH_BUSY):
+		case(MED_NORMAL):
+			update_stats_normal(poll_stats);
+
+			break;
+
+		case(LOW_PURGE):
+			break;
+		default:
+			break;
+
+		}
+
+	}
+
+}
+
+int
+rte_power_empty_poll_stat_init(void)
+{
+	uint32_t i;
+	/* Allocate the ep_params structure */
+	ep_params = rte_zmalloc_socket(NULL,
+			sizeof(struct ep_params),
+			0,
+			rte_socket_id());
+
+	if (!ep_params)
+		rte_panic("Cannot allocate heap memory for ep_params "
+				"for socket %d\n", rte_socket_id());
+
+	freq_index[LOW] = 14;
+	freq_index[MED] = 9;
+	freq_index[HGH] = 1;
+
+	RTE_LOG(INFO, POWER, "Initialize the Empty Poll\n");
+
+	/* 5 seconds worth of training */
+	ep_params->max_train_iter = INTERVALS_PER_SECOND * SECONDS_TO_TRAIN_FOR;
+
+	struct stats_data *w = &ep_params->wrk_data;
+
+	/* initialize all wrk_stats state */
+	for (i = 0; i < NUM_NODES; i++) {
+
+		if (rte_lcore_is_enabled(i) == 0)
+			continue;
+
+		set_state(&w->wrk_stats[i], TRAINING);
+		/*init the freqs table */
+		total_avail_freqs[i] = rte_power_freqs(i,
+				avail_freqs[i],
+				NUM_FREQS);
+
+		if (get_freq_index(LOW) > total_avail_freqs[i])
+			return -1;
+
+	}
+
+
+	return 0;
+}
+
+void
+rte_power_empty_poll_stat_free(void)
+{
+
+	RTE_LOG(INFO, POWER, "Close the Empty Poll\n");
+
+	if (ep_params != NULL)
+		rte_free(ep_params);
+}
+
+int
+rte_power_empty_poll_stat_update(unsigned int lcore_id)
+{
+	struct priority_worker *poll_stats;
+
+	if (lcore_id >= NUM_NODES)
+		return -1;
+
+	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
+
+	if (poll_stats->lcore_id == 0)
+		poll_stats->lcore_id = lcore_id;
+
+	poll_stats->empty_dequeues++;
+
+	return 0;
+}
+
+int
+rte_power_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt)
+{
+
+	struct priority_worker *poll_stats;
+
+	if (lcore_id >= NUM_NODES)
+		return -1;
+
+	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
+
+	if (poll_stats->lcore_id == 0)
+		poll_stats->lcore_id = lcore_id;
+
+	poll_stats->num_dequeue_pkts += nb_pkt;
+
+	return 0;
+}
+
+
+uint64_t
+rte_power_empty_poll_stat_fetch(unsigned int lcore_id)
+{
+	struct priority_worker *poll_stats;
+
+	if (lcore_id >= NUM_NODES)
+		return -1;
+
+	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
+
+	if (poll_stats->lcore_id == 0)
+		poll_stats->lcore_id = lcore_id;
+
+	return poll_stats->empty_dequeues;
+}
+
+uint64_t
+rte_power_poll_stat_fetch(unsigned int lcore_id)
+{
+	struct priority_worker *poll_stats;
+
+	if (lcore_id >= NUM_NODES)
+		return -1;
+
+	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
+
+	if (poll_stats->lcore_id == 0)
+		poll_stats->lcore_id = lcore_id;
+
+	return poll_stats->num_dequeue_pkts;
+}
+
+void
+rte_power_empty_poll_set_freq(enum freq_val index, uint32_t limit)
+{
+	switch (index) {
+
+	case LOW:
+		freq_index[LOW] = limit;
+		break;
+
+	case MED:
+		freq_index[MED] = limit;
+		break;
+
+	case HGH:
+		freq_index[HGH] = limit;
+		break;
+	default:
+		break;
+	}
+}
+
+void
+rte_power_empty_poll_setup_timer(void)
+{
+	int lcore_id = rte_lcore_id();
+	uint64_t hz = rte_get_timer_hz();
+
+	struct  ep_params *ep_ptr = ep_params;
+
+	ep_ptr->interval_ticks = hz / INTERVALS_PER_SECOND;
+
+	rte_timer_reset_sync(&ep_ptr->timer0,
+			ep_ptr->interval_ticks,
+			PERIODICAL,
+			lcore_id,
+			empty_poll_detection,
+			(void *)ep_ptr);
+
+}
diff --git a/lib/librte_power/rte_power_empty_poll.h b/lib/librte_power/rte_power_empty_poll.h
new file mode 100644
index 0000000..e276ee6
--- /dev/null
+++ b/lib/librte_power/rte_power_empty_poll.h
@@ -0,0 +1,214 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2018 Intel Corporation
+ */
+
+#ifndef _RTE_EMPTY_POLL_H
+#define _RTE_EMPTY_POLL_H
+
+/**
+ * @file
+ * RTE Power Management
+ *
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ */
+
+#include <stdint.h>
+#include <stdbool.h>
+#include <sys/queue.h>
+
+#include <rte_common.h>
+#include <rte_byteorder.h>
+#include <rte_log.h>
+#include <rte_string_fns.h>
+#include <rte_power.h>
+#include <rte_timer.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#define NUM_FREQS 20
+
+#define BINS_AV 4 /* Has to be ^2 */
+
+#define DROP (NUM_DIRECTIONS * NUM_DEVICES)
+
+#define NUM_PRIORITIES          2
+
+#define NUM_NODES         31 /* Any reseanable prime number should work*/
+
+/* Processor Power State */
+enum freq_val {
+	LOW,
+	MED,
+	HGH,
+	NUM_FREQ = NUM_FREQS
+};
+
+
+/* Queue Polling State */
+enum queue_state {
+	TRAINING, /* NO TRAFFIC */
+	MED_NORMAL,   /* MED */
+	HGH_BUSY,     /* HIGH */
+	LOW_PURGE,    /* LOW */
+};
+
+/* Queue Stats */
+struct freq_threshold {
+
+	uint64_t base_edpi;
+	bool trained;
+	uint32_t threshold_percent;
+	uint32_t cur_train_iter;
+};
+
+/* Each Worder Tread Empty Poll Stats */
+struct priority_worker {
+
+	/* Current dequeue and throughput counts */
+	/* These 2 are written to by the worker threads */
+	/* So keep them on their own cache line */
+	uint64_t empty_dequeues;
+	uint64_t num_dequeue_pkts;
+
+	enum queue_state queue_state;
+
+	uint64_t empty_dequeues_prev;
+	uint64_t num_dequeue_pkts_prev;
+
+	/* Used for training only */
+	struct freq_threshold thresh[NUM_FREQ];
+	enum freq_val cur_freq;
+
+	/* bucket arrays to calculate the averages */
+	uint64_t edpi_av[BINS_AV];
+	uint32_t  ec;
+	uint64_t ppi_av[BINS_AV];
+	uint32_t  pc;
+
+	uint32_t lcore_id;
+	uint32_t iter_counter;
+	uint32_t threshold_ctr;
+	uint32_t display_ctr;
+	uint8_t  dev_id;
+
+} __rte_cache_aligned;
+
+
+struct stats_data {
+
+	struct priority_worker wrk_stats[NUM_NODES];
+
+	/* flag to stop rx threads processing packets until training over */
+	bool start_rx;
+
+};
+
+/* Empty Poll Parameters */
+struct ep_params {
+
+	/* Timer related stuff */
+	uint64_t interval_ticks;
+	uint32_t max_train_iter;
+
+	struct rte_timer timer0;
+	struct stats_data wrk_data;
+};
+
+
+/**
+ * Initialize the power management system.
+ *
+ * @param void
+ *
+ * @return
+ *  - 0 on success.
+ *  - Negative on error.
+ */
+int __rte_experimental
+rte_power_empty_poll_stat_init(void);
+
+/**
+ * Free the resource hold by power management system.
+ */
+void __rte_experimental
+rte_power_empty_poll_stat_free(void);
+
+/**
+ * Update specific core empty poll counter
+ * It's not thread safe.
+ *
+ * @param lcore_id
+ *  lcore id
+ *
+ * @return
+ *  - 0 on success.
+ *  - Negative on error.
+ */
+int __rte_experimental
+rte_power_empty_poll_stat_update(unsigned int lcore_id);
+
+/**
+ * Update specific core valid poll counter, not thread safe.
+ *
+ * @param lcore_id
+ *  lcore id.
+ * @param nb_pkt
+ *  The packet number of one valid poll.
+ *
+ * @return
+ *  - 0 on success.
+ *  - Negative on error.
+ */
+int __rte_experimental
+rte_power_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt);
+
+/**
+ * Fetch specific core empty poll counter.
+ *
+ * @param lcore_id
+ *  lcore id
+ *
+ * @return
+ *  Current lcore empty poll counter value.
+ */
+uint64_t __rte_experimental
+rte_power_empty_poll_stat_fetch(unsigned int lcore_id);
+
+/**
+ * Fetch specific core valid poll counter.
+ *
+ * @param lcore_id
+ *  lcore id
+ *
+ * @return
+ *  Current lcore valid poll counter value.
+ */
+uint64_t __rte_experimental
+rte_power_poll_stat_fetch(unsigned int lcore_id);
+
+/**
+ * Allow user customize the index of power state.
+ *
+ * @param  index
+ *  The type of the queue power state
+ * @param  limit
+ *  The processor P-state offset value for that power state.
+ */
+void __rte_experimental
+rte_power_empty_poll_set_freq(enum freq_val index, uint32_t limit);
+
+/**
+ * Setup the timer/callback to process empty poll counter.
+ * Default timer resolution is 10ms.
+ */
+void __rte_experimental
+rte_power_empty_poll_setup_timer(void);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif
diff --git a/lib/librte_power/rte_power_version.map b/lib/librte_power/rte_power_version.map
index 96dc42e..08470c6 100644
--- a/lib/librte_power/rte_power_version.map
+++ b/lib/librte_power/rte_power_version.map
@@ -25,4 +25,17 @@ DPDK_17.11 {
 	rte_power_freq_enable_turbo;
 	rte_power_turbo_status;
 
-} DPDK_2.0;
\ No newline at end of file
+} DPDK_2.0;
+
+EXPERIMENTAL {
+	global:
+
+	rte_power_empty_poll_set_freq;
+	rte_power_empty_poll_setup_timer;
+	rte_power_empty_poll_stat_fetch;
+	rte_power_empty_poll_stat_free;
+	rte_power_empty_poll_stat_init;
+	rte_power_empty_poll_stat_update;
+	rte_power_poll_stat_fetch;
+	rte_power_poll_stat_update;
+};
-- 
2.7.5

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [dpdk-dev] [PATCH v5 2/2] examples/l3fwd-power: simple app update to support new API
  2018-07-10 16:04           ` [dpdk-dev] [PATCH v5 " Radu Nicolau
@ 2018-07-10 16:04             ` Radu Nicolau
  2018-08-31 15:04             ` [dpdk-dev] [PATCH v6 1/4] lib/librte_power: traffic pattern aware power control Liang Ma
  1 sibling, 0 replies; 79+ messages in thread
From: Radu Nicolau @ 2018-07-10 16:04 UTC (permalink / raw)
  To: dev; +Cc: liang.j.ma, david.hunt

From: Liang Ma <liang.j.ma@intel.com>

Add the support for new traffic pattern aware power control
power management API.

Example:
./l3fwd-power -l xxx   -n 4   -w 0000:xx:00.0 -w 0000:xx:00.1 -- -p 0x3
-P --config="(0,0,xx),(1,0,xx)" --empty-poll

Please Reference l3fwd-power document for all parameter except
empty-poll.

Once enable empty-poll. The system will start with training phase.
There should not has any traffic pass-through during training phase.
When training phase complete, system transfer to normal phase.

System will running with modest power stat at beginning.
If the system busyness percentage above 70%, then system will adjust
power state move to High power state. If the traffic become lower(eg. The
system busyness percentage drop below 30%), system will fallback
to the modest power state.

Example code use master thread to monitoring worker thread busyness.
the default timer resolution is 10ms.

ChangeLog:
v2 fix some coding style issues
v3 rename the API.
v4 no change
v5 no change

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Acked-by: David Hunt<david.hunt@intel.com>
---
 examples/l3fwd-power/main.c | 232 ++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 214 insertions(+), 18 deletions(-)

diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
index 596d645..953a2ed 100644
--- a/examples/l3fwd-power/main.c
+++ b/examples/l3fwd-power/main.c
@@ -42,6 +42,7 @@
 #include <rte_string_fns.h>
 #include <rte_timer.h>
 #include <rte_power.h>
+#include <rte_power_empty_poll.h>
 #include <rte_spinlock.h>
 
 #define RTE_LOGTYPE_L3FWD_POWER RTE_LOGTYPE_USER1
@@ -129,6 +130,9 @@ static uint32_t enabled_port_mask = 0;
 static int promiscuous_on = 0;
 /* NUMA is enabled by default. */
 static int numa_on = 1;
+/* emptypoll is disabled by default. */
+static bool empty_poll_on;
+volatile bool empty_poll_stop;
 static int parse_ptype; /**< Parse packet type using rx callback, and */
 			/**< disabled by default */
 
@@ -336,6 +340,10 @@ static inline uint32_t power_idle_heuristic(uint32_t zero_rx_packet_count);
 static inline enum freq_scale_hint_t power_freq_scaleup_heuristic( \
 		unsigned int lcore_id, uint16_t port_id, uint16_t queue_id);
 
+static int is_done(void)
+{
+	return empty_poll_stop;
+}
 /* exit signal handler */
 static void
 signal_exit_now(int sigtype)
@@ -344,7 +352,15 @@ signal_exit_now(int sigtype)
 	unsigned int portid;
 	int ret;
 
+	RTE_SET_USED(lcore_id);
+	RTE_SET_USED(portid);
+	RTE_SET_USED(ret);
+
 	if (sigtype == SIGINT) {
+		if (empty_poll_on)
+			empty_poll_stop = true;
+
+
 		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
 			if (rte_lcore_is_enabled(lcore_id) == 0)
 				continue;
@@ -353,20 +369,23 @@ signal_exit_now(int sigtype)
 			ret = rte_power_exit(lcore_id);
 			if (ret)
 				rte_exit(EXIT_FAILURE, "Power management "
-					"library de-initialization failed on "
-							"core%u\n", lcore_id);
+						"library de-initialization failed on "
+						"core%u\n", lcore_id);
 		}
 
-		RTE_ETH_FOREACH_DEV(portid) {
-			if ((enabled_port_mask & (1 << portid)) == 0)
-				continue;
+		if (!empty_poll_on) {
+			RTE_ETH_FOREACH_DEV(portid) {
+				if ((enabled_port_mask & (1 << portid)) == 0)
+					continue;
 
-			rte_eth_dev_stop(portid);
-			rte_eth_dev_close(portid);
+				rte_eth_dev_stop(portid);
+				rte_eth_dev_close(portid);
+			}
 		}
 	}
 
-	rte_exit(EXIT_SUCCESS, "User forced exit\n");
+	if (!empty_poll_on)
+		rte_exit(EXIT_SUCCESS, "User forced exit\n");
 }
 
 /*  Freqency scale down timer callback */
@@ -831,6 +850,107 @@ static int event_register(struct lcore_conf *qconf)
 
 	return 0;
 }
+/* main processing loop */
+static int
+main_empty_poll_loop(__attribute__((unused)) void *dummy)
+{
+	struct rte_mbuf *pkts_burst[MAX_PKT_BURST];
+	unsigned int lcore_id;
+	uint64_t prev_tsc, diff_tsc, cur_tsc;
+	int i, j, nb_rx;
+	uint8_t queueid;
+	uint16_t portid;
+	struct lcore_conf *qconf;
+	struct lcore_rx_queue *rx_queue;
+
+	const uint64_t drain_tsc =
+		(rte_get_tsc_hz() + US_PER_S - 1) / US_PER_S * BURST_TX_DRAIN_US;
+
+	prev_tsc = 0;
+
+	lcore_id = rte_lcore_id();
+	qconf = &lcore_conf[lcore_id];
+
+	if (qconf->n_rx_queue == 0) {
+		RTE_LOG(INFO, L3FWD_POWER, "lcore %u has nothing to do\n", lcore_id);
+		return 0;
+	}
+
+	for (i = 0; i < qconf->n_rx_queue; i++) {
+		portid = qconf->rx_queue_list[i].port_id;
+		queueid = qconf->rx_queue_list[i].queue_id;
+		RTE_LOG(INFO, L3FWD_POWER, " -- lcoreid=%u portid=%u "
+				"rxqueueid=%hhu\n", lcore_id, portid, queueid);
+	}
+
+	while (!is_done()) {
+		stats[lcore_id].nb_iteration_looped++;
+
+		cur_tsc = rte_rdtsc();
+		/*
+		 * TX burst queue drain
+		 */
+		diff_tsc = cur_tsc - prev_tsc;
+		if (unlikely(diff_tsc > drain_tsc)) {
+			for (i = 0; i < qconf->n_tx_port; ++i) {
+				portid = qconf->tx_port_id[i];
+				rte_eth_tx_buffer_flush(portid,
+						qconf->tx_queue_id[portid],
+						qconf->tx_buffer[portid]);
+			}
+			prev_tsc = cur_tsc;
+		}
+
+		/*
+		 * Read packet from RX queues
+		 */
+		for (i = 0; i < qconf->n_rx_queue; ++i) {
+			rx_queue = &(qconf->rx_queue_list[i]);
+			rx_queue->idle_hint = 0;
+			portid = rx_queue->port_id;
+			queueid = rx_queue->queue_id;
+
+			nb_rx = rte_eth_rx_burst(portid, queueid, pkts_burst,
+					MAX_PKT_BURST);
+
+			stats[lcore_id].nb_rx_processed += nb_rx;
+
+			if (nb_rx == 0) {
+
+				rte_power_empty_poll_stat_update(lcore_id);
+
+				continue;
+			} else {
+				rte_power_poll_stat_update(lcore_id, nb_rx);
+			}
+
+
+			/* Prefetch first packets */
+			for (j = 0; j < PREFETCH_OFFSET && j < nb_rx; j++) {
+				rte_prefetch0(rte_pktmbuf_mtod(
+							pkts_burst[j], void *));
+			}
+
+			/* Prefetch and forward already prefetched packets */
+			for (j = 0; j < (nb_rx - PREFETCH_OFFSET); j++) {
+				rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[
+							j + PREFETCH_OFFSET], void *));
+				l3fwd_simple_forward(pkts_burst[j], portid,
+						qconf);
+			}
+
+			/* Forward remaining prefetched packets */
+			for (; j < nb_rx; j++) {
+				l3fwd_simple_forward(pkts_burst[j], portid,
+						qconf);
+			}
+
+		}
+
+	}
+
+	return 0;
+}
 
 /* main processing loop */
 static int
@@ -1128,7 +1248,8 @@ print_usage(const char *prgname)
 		"  --no-numa: optional, disable numa awareness\n"
 		"  --enable-jumbo: enable jumbo frame"
 		" which max packet len is PKTLEN in decimal (64-9600)\n"
-		"  --parse-ptype: parse packet type by software\n",
+		"  --parse-ptype: parse packet type by software\n"
+		"  --empty=poll: enable empty poll detection\n",
 		prgname);
 }
 
@@ -1231,10 +1352,12 @@ parse_args(int argc, char **argv)
 	int opt, ret;
 	char **argvopt;
 	int option_index;
+	uint32_t limit;
 	char *prgname = argv[0];
 	static struct option lgopts[] = {
 		{"config", 1, 0, 0},
 		{"no-numa", 0, 0, 0},
+		{"empty-poll", 0, 0, 0},
 		{"enable-jumbo", 0, 0, 0},
 		{CMD_LINE_OPT_PARSE_PTYPE, 0, 0, 0},
 		{NULL, 0, 0, 0}
@@ -1259,7 +1382,18 @@ parse_args(int argc, char **argv)
 			printf("Promiscuous mode selected\n");
 			promiscuous_on = 1;
 			break;
-
+		case 'l':
+			limit = parse_portmask(optarg);
+			rte_power_empty_poll_set_freq(LOW, limit);
+			break;
+		case 'm':
+			limit = parse_portmask(optarg);
+			rte_power_empty_poll_set_freq(MED, limit);
+			break;
+		case 'h':
+			limit = parse_portmask(optarg);
+			rte_power_empty_poll_set_freq(HGH, limit);
+			break;
 		/* long options */
 		case 0:
 			if (!strncmp(lgopts[option_index].name, "config", 6)) {
@@ -1278,6 +1412,12 @@ parse_args(int argc, char **argv)
 			}
 
 			if (!strncmp(lgopts[option_index].name,
+						"empty-poll", 10)) {
+				printf("empty-poll is enabled\n");
+				empty_poll_on = true;
+			}
+
+			if (!strncmp(lgopts[option_index].name,
 					"enable-jumbo", 12)) {
 				struct option lenopts =
 					{"max-pkt-len", required_argument, \
@@ -1609,6 +1749,41 @@ static int check_ptype(uint16_t portid)
 
 }
 
+static int
+launch_timer(unsigned int lcore_id)
+{
+	int64_t prev_tsc = 0, cur_tsc, diff_tsc, cycles_10ms;
+
+	RTE_SET_USED(lcore_id);
+
+
+	if (rte_get_master_lcore() != lcore_id) {
+		rte_panic("timer on lcore:%d which is not master core:%d\n",
+				lcore_id,
+				rte_get_master_lcore());
+	}
+
+	RTE_LOG(INFO, POWER, "Bring up the Timer\n");
+
+	rte_power_empty_poll_setup_timer();
+
+	cycles_10ms = rte_get_timer_hz() / 100;
+
+	while (!is_done()) {
+		cur_tsc = rte_rdtsc();
+		diff_tsc = cur_tsc - prev_tsc;
+		if (diff_tsc > cycles_10ms) {
+			rte_timer_manage();
+			prev_tsc = cur_tsc;
+			cycles_10ms = rte_get_timer_hz() / 100;
+		}
+	}
+
+	RTE_LOG(INFO, POWER, "Timer_subsystem is done\n");
+
+	return 0;
+}
+
 int
 main(int argc, char **argv)
 {
@@ -1693,6 +1868,10 @@ main(int argc, char **argv)
 		if (dev_info.tx_offload_capa & DEV_TX_OFFLOAD_MBUF_FAST_FREE)
 			local_port_conf.txmode.offloads |=
 				DEV_TX_OFFLOAD_MBUF_FAST_FREE;
+
+		local_port_conf.rx_adv_conf.rss_conf.rss_hf &=
+			dev_info.flow_type_rss_offloads;
+
 		ret = rte_eth_dev_configure(portid, nb_rx_queue,
 					(uint16_t)n_tx_queue, &local_port_conf);
 		if (ret < 0)
@@ -1780,14 +1959,15 @@ main(int argc, char **argv)
 				"Library initialization failed on core %u\n", lcore_id);
 
 		/* init timer structures for each enabled lcore */
-		rte_timer_init(&power_timers[lcore_id]);
-		hz = rte_get_timer_hz();
-		rte_timer_reset(&power_timers[lcore_id],
-			hz/TIMER_NUMBER_PER_SECOND, SINGLE, lcore_id,
-						power_timer_cb, NULL);
-
+		if (empty_poll_on == false) {
+			rte_timer_init(&power_timers[lcore_id]);
+			hz = rte_get_timer_hz();
+			rte_timer_reset(&power_timers[lcore_id],
+					hz/TIMER_NUMBER_PER_SECOND, SINGLE, lcore_id,
+					power_timer_cb, NULL);
+		}
 		qconf = &lcore_conf[lcore_id];
-		printf("\nInitializing rx queues on lcore %u ... ", lcore_id );
+		printf("\nInitializing rx queues on lcore %u ...\n", lcore_id);
 		fflush(stdout);
 		/* init RX queues */
 		for(queue = 0; queue < qconf->n_rx_queue; ++queue) {
@@ -1856,12 +2036,28 @@ main(int argc, char **argv)
 
 	check_all_ports_link_status(enabled_port_mask);
 
+	if (empty_poll_on == true)
+		rte_power_empty_poll_stat_init();
+
+
 	/* launch per-lcore init on every lcore */
-	rte_eal_mp_remote_launch(main_loop, NULL, CALL_MASTER);
+	if (empty_poll_on == false) {
+		rte_eal_mp_remote_launch(main_loop, NULL, CALL_MASTER);
+	} else {
+		empty_poll_stop = false;
+		rte_eal_mp_remote_launch(main_empty_poll_loop, NULL, SKIP_MASTER);
+	}
+
+	if (empty_poll_on == true)
+		launch_timer(rte_lcore_id());
+
 	RTE_LCORE_FOREACH_SLAVE(lcore_id) {
 		if (rte_eal_wait_lcore(lcore_id) < 0)
 			return -1;
 	}
 
+	if (empty_poll_on)
+		rte_power_empty_poll_stat_free();
+
 	return 0;
 }
-- 
2.7.5

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [dpdk-dev] [PATCH v4 1/2] lib/librte_power: traffic pattern aware power control
  2018-07-05 14:45             ` Liang, Ma
@ 2018-07-12 17:30               ` Thomas Monjalon
  0 siblings, 0 replies; 79+ messages in thread
From: Thomas Monjalon @ 2018-07-12 17:30 UTC (permalink / raw)
  To: Liang, Ma, Radu Nicolau; +Cc: dev, Kevin Traynor, david.hunt

05/07/2018 16:45, Liang, Ma:
> On 27 Jun 18:33, Kevin Traynor wrote:
> > On 06/26/2018 12:40 PM, Radu Nicolau wrote:
> > > From: Liang Ma <liang.j.ma@intel.com>
> > > 
> > > 1. Abstract
> > > 
> > > For packet processing workloads such as DPDK polling is continuous.
> > > This means CPU cores always show 100% busy independent of how much work
> > > those cores are doing. It is critical to accurately determine how busy
> > > a core is hugely important for the following reasons:
> > > 
> > >    * No indication of overload conditions
> > > 
> > >    * User do not know how much real load is on a system meaning resulted in
> > >      wasted energy as no power management is utilized
> > > 
> > > Tried and failed schemes include calculating the cycles required from
> > > the load on the core, in other words the busyness. For example,
> > > how many cycles it costs to handle each packet and determining the
> > > frequency cost per core. Due to the varying nature of traffic, types of
> > > frames and cost in cycles to process, this mechanism becomes complex
> > > quickly where a simple scheme is required to solve the problems.
> > > 
> > > 2. Proposed solution
> > > 
> > > For all polling mechanism, the proposed solution focus on how many times
> > > empty poll executed instead of calculating how many cycles it cost to
> > > handle each packet. The less empty poll number means current core is busy
> > > with processing workload, therefore,  the higher frequency is needed. The
> > > high empty poll number indicate current core has lots spare time,
> > > therefore, we can lower the frequency.
> > > 
> > 
> > Hi Liang/Radu,
> > 
> > I can see the benefit of providing an API for the application to provide
> > the num rx from each poll, and then have the library step down/up the
> > freq based on that. However, not sure I follow why you are adding the
> > complexity of defining power states and training modes.
> > 
> > > 2.1 Power state definition:
> > > 
> > > 	LOW:  the frequency is used for purge mode.
> > > 
> > > 	MED:  the frequency is used to process modest traffic workload.
> > > 
> > > 	HIGH: the frequency is used to process busy traffic workload.
> > > 
> > 
> > Why does there need to be user defined freq levels? Why not just keep
> > stepping down the freq until there is some user-defined threshold of
> > zero polls reached. e.g. keep stepping down until 10% of polls are zero
> > poll and have a tail of some time (perhaps user defined) for the step down.
> tranfer from one P-state to another P-state need update MSR which is expensive.
> and swap the state too many times will disturb the worker core performance.
> > 
> > > 2.2 There are two phases to establish the power management system:
> > > 
> > > 	a.Initialization/Training phase. There is no traffic pass-through,
> > > 	  the system will test average empty poll numbers  with
> > > 	  LOW/MED/HIGH  power state. Those average empty poll numbers
> > > 	  will be the baseline
> > > 	  for the normal phase. The system will collect all core's counter
> > > 	  every 100ms. The Training phase will take 5 seconds.
> > > 
> > 
> > This is requiring an application to sit for 5 secs in order to train and
> > align poll numbers with states? That doesn't seem realistic to me.
> Because each CPU SKU has different configuration, micro-arch, cache size, 
> power state number etc. it's has to be tested in Training phase to find the 
> base line. simple app can block RX for the First 5 secs.
> > 
> > > 	b.Normal phase. When the real traffic pass-though, the system will
> > > 	  compare run-time empty poll moving average value with base line
> > > 	  then make decision to move to HIGH power state of MED  power
> > > 	  state. The system will collect all core's counter every 10ms.
> > > 
> > 
> > I only reviewed this commit msg and API usage, so maybe I didn't fully
> > get the use case or details, but it seems quite awkward from an
> > application perspective IMHO.
> > 
> > > 3. Proposed  API
> > > 
> > > 1.  rte_power_empty_poll_stat_init(void);
> > > which is used to initialize the power management system.
> > >  
> > > 2.  rte_power_empty_poll_stat_free(void);
> > > which is used to free the resource hold by power management system.
> > >  
> > > 3.  rte_power_empty_poll_stat_update(unsigned int lcore_id);
> > > which is used to update specific core empty poll counter, not thread safe
> > >  
> > > 4.  rte_power_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt);
> > > which is used to update specific core valid poll counter, not thread safe
> > >  
> > 
> > I think 4 could be dropped and 3 used instead. It could be a simple API
> > that takes in the core and nb_pkts from a poll. Seems clearer than
> > making a separate API for a special value of nb_pkts (i.e. 0) and the
> > application having to check to know which API should be called.
> Agree.
> > 
> > > 5.  rte_power_empty_poll_stat_fetch(unsigned int lcore_id);
> > > which is used to get specific core empty poll counter.
> > >  
> > > 6.  rte_power_poll_stat_fetch(unsigned int lcore_id);
> > > which is used to get specific core valid poll counter.
> > > 
> > > 7.  rte_power_empty_poll_set_freq(enum freq_val index, uint32_t limit);
> > > which allow user customize the frequency of power state.
> > > 
> > > 8.  rte_power_empty_poll_setup_timer(void);
> > > which is used to setup the timer/callback to process all above counter.
> > > 
> > 
> > The new API should be experimental
> > 
> > > ChangeLog:
> > > v2: fix some coding style issues
> > > v3: rename the filename, API name.
> > > v4: updated makefile and symbol list
> > > 
> > > Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> > > Signed-off-by: Radu Nicolau <radu.nicolau@intel.com>
> > > ---
> > >  lib/librte_power/Makefile               |   5 +-
> > >  lib/librte_power/meson.build            |   5 +-
> > >  lib/librte_power/rte_power_empty_poll.c | 521 ++++++++++++++++++++++++++++++++
> > >  lib/librte_power/rte_power_empty_poll.h | 202 +++++++++++++
> > >  lib/librte_power/rte_power_version.map  |  14 +-
> > >  5 files changed, 742 insertions(+), 5 deletions(-)
> > >  create mode 100644 lib/librte_power/rte_power_empty_poll.c
> > >  create mode 100644 lib/librte_power/rte_power_empty_poll.h
> > > 
> > 
> > Is there any in-tree documentation planned?

You did not reply to this question.
Usually, new API must be provided with documentation in programmers guide.

It would be interesting to have an opinion about this complicated API
outside of Intel.

This feature should wait 18.11.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [dpdk-dev] [PATCH v6 1/4] lib/librte_power: traffic pattern aware power control
  2018-07-10 16:04           ` [dpdk-dev] [PATCH v5 " Radu Nicolau
  2018-07-10 16:04             ` [dpdk-dev] [PATCH v5 2/2] examples/l3fwd-power: simple app update to support new API Radu Nicolau
@ 2018-08-31 15:04             ` Liang Ma
  2018-08-31 15:04               ` [dpdk-dev] [PATCH v6 2/4] examples/l3fwd-power: simple app update for new API Liang Ma
                                 ` (5 more replies)
  1 sibling, 6 replies; 79+ messages in thread
From: Liang Ma @ 2018-08-31 15:04 UTC (permalink / raw)
  To: david.hunt
  Cc: dev, lei.a.yao, radu.nicolau, anatoly.burakov, john.geary, Liang Ma

1. Abstract

For packet processing workloads such as DPDK polling is continuous.
This means CPU cores always show 100% busy independent of how much work
those cores are doing. It is critical to accurately determine how busy
a core is hugely important for the following reasons:

   * No indication of overload conditions

   * User do not know how much real load is on a system meaning resulted in
     wasted energy as no power management is utilized

Compared to the original l3fwd-power design, instead of going to sleep
after detecting an empty poll, the new mechanism just lowers the core
frequency. As a result, the application does not stop polling the device,
which leads to improved handling of bursts of traffic.

When the system become busy, the empty poll mechanism can also increase the core
frequency (including turbo) to do best effort for intensive traffic. This gives
us more flexible and balanced traffic awareness over the standard l3fwd-power
application.

2. Proposed solution

The proposed solution focuses on how many times empty polls are executed. The less
the number of empty polls, means current core is busy with processing workload,
therefore, the higher frequency is needed. The high empty poll number indicates
the current core not doing any real work therefore, we can lower the frequency
to safe power.

In the current implementation, each core has 1 empty-poll counter which assume
1 core is dedicated to 1 queue. This will need to be expanded in the future
to support multiple queues per core.

2.1 Power state definition:

	LOW:  Not currently used, reserved for future use.

	MED:  the frequency is used to process modest traffic workload.

	HIGH: the frequency is used to process busy traffic workload.

2.2 There are two phases to establish the power management system:

	a.Initialization/Training phase. The training phase is necessary
	  in order to figure out the system polling baseline numbers from
	  idle to busy. The highest poll count will be during idle, where all
	  polls are empty. These poll counts will be different between
	  systems due to the many possible processor micro-arch, cache
	  and device configurations, hence the training phase.
  	  In the training phase, traffic is blocked so the training algorithm
  	  can average the empty-poll numbers for the LOW, MED and
 	  HIGH  power states in order to create a baseline.
  	  The core's counter are collected every 10ms, and the Training
 	   phase will take 2 seconds.

	b.Normal phase. When the training phase is complete, traffic is
  	  started. The run-time poll counts are compared with the
	  baseline and the decision will be taken to move to MED power
  	  state or HIGH power state. The counters are calculated every 10ms.

3. Proposed  API

1.  rte_power_empty_poll_stat_init(void);
which is used to initialize the power management system.
 
2.  rte_power_empty_poll_stat_free(void);
which is used to free the resource hold by power management system.
 
3.  rte_power_empty_poll_stat_update(unsigned int lcore_id);
which is used to update specific core empty poll counter, not thread safe
 
4.  rte_power_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt);
which is used to update specific core valid poll counter, not thread safe
 
5.  rte_power_empty_poll_stat_fetch(unsigned int lcore_id);
which is used to get specific core empty poll counter.
 
6.  rte_power_poll_stat_fetch(unsigned int lcore_id);
which is used to get specific core valid poll counter.

7.  rte_empty_poll_detection(void);
which is used to detect empty poll state changes.

ChangeLog:
v2: fix some coding style issues
v3: rename the filename, API name.
v4: no change
v5: no change
v6: re-work the code layout, update API

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
---
 lib/librte_power/Makefile               |   6 +-
 lib/librte_power/meson.build            |   5 +-
 lib/librte_power/rte_power_empty_poll.c | 500 ++++++++++++++++++++++++++++++++
 lib/librte_power/rte_power_empty_poll.h | 205 +++++++++++++
 lib/librte_power/rte_power_version.map  |  13 +
 5 files changed, 725 insertions(+), 4 deletions(-)
 create mode 100644 lib/librte_power/rte_power_empty_poll.c
 create mode 100644 lib/librte_power/rte_power_empty_poll.h

diff --git a/lib/librte_power/Makefile b/lib/librte_power/Makefile
index 6f85e88..a8f1301 100644
--- a/lib/librte_power/Makefile
+++ b/lib/librte_power/Makefile
@@ -6,8 +6,9 @@ include $(RTE_SDK)/mk/rte.vars.mk
 # library name
 LIB = librte_power.a
 
+CFLAGS += -DALLOW_EXPERIMENTAL_API
 CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR) -O3 -fno-strict-aliasing
-LDLIBS += -lrte_eal
+LDLIBS += -lrte_eal -lrte_timer
 
 EXPORT_MAP := rte_power_version.map
 
@@ -16,8 +17,9 @@ LIBABIVER := 1
 # all source are stored in SRCS-y
 SRCS-$(CONFIG_RTE_LIBRTE_POWER) := rte_power.c power_acpi_cpufreq.c
 SRCS-$(CONFIG_RTE_LIBRTE_POWER) += power_kvm_vm.c guest_channel.c
+SRCS-$(CONFIG_RTE_LIBRTE_POWER) += rte_power_empty_poll.c
 
 # install this header file
-SYMLINK-$(CONFIG_RTE_LIBRTE_POWER)-include := rte_power.h
+SYMLINK-$(CONFIG_RTE_LIBRTE_POWER)-include := rte_power.h  rte_power_empty_poll.h
 
 include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/lib/librte_power/meson.build b/lib/librte_power/meson.build
index 253173f..63957eb 100644
--- a/lib/librte_power/meson.build
+++ b/lib/librte_power/meson.build
@@ -5,5 +5,6 @@ if host_machine.system() != 'linux'
 	build = false
 endif
 sources = files('rte_power.c', 'power_acpi_cpufreq.c',
-		'power_kvm_vm.c', 'guest_channel.c')
-headers = files('rte_power.h')
+		'power_kvm_vm.c', 'guest_channel.c',
+		'rte_power_empty_poll.c')
+headers = files('rte_power.h','rte_power_empty_poll.h')
diff --git a/lib/librte_power/rte_power_empty_poll.c b/lib/librte_power/rte_power_empty_poll.c
new file mode 100644
index 0000000..3dac654
--- /dev/null
+++ b/lib/librte_power/rte_power_empty_poll.c
@@ -0,0 +1,500 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2018 Intel Corporation
+ */
+
+#include <string.h>
+
+#include <rte_lcore.h>
+#include <rte_cycles.h>
+#include <rte_atomic.h>
+#include <rte_malloc.h>
+
+#include "rte_power.h"
+#include "rte_power_empty_poll.h"
+
+#define INTERVALS_PER_SECOND 100     /* (10ms) */
+#define SECONDS_TO_TRAIN_FOR 2
+#define DEFAULT_MED_TO_HIGH_PERCENT_THRESHOLD 70
+#define DEFAULT_HIGH_TO_MED_PERCENT_THRESHOLD 30
+#define DEFAULT_CYCLES_PER_PACKET 800
+
+static struct ep_params *ep_params;
+static uint32_t med_to_high_threshold = DEFAULT_MED_TO_HIGH_PERCENT_THRESHOLD;
+static uint32_t high_to_med_threshold = DEFAULT_HIGH_TO_MED_PERCENT_THRESHOLD;
+
+static uint32_t avail_freqs[RTE_MAX_LCORE][NUM_FREQS];
+
+static uint32_t total_avail_freqs[RTE_MAX_LCORE];
+
+static uint32_t freq_index[NUM_FREQ];
+
+static uint32_t
+get_freq_index(enum freq_val index)
+{
+	return freq_index[index];
+}
+
+
+static int
+set_power_freq(int lcore_id, enum freq_val freq, bool specific_freq)
+{
+	int err = 0;
+	uint32_t power_freq_index;
+	if (!specific_freq)
+		power_freq_index = get_freq_index(freq);
+	else
+		power_freq_index = freq;
+
+	err = rte_power_set_freq(lcore_id, power_freq_index);
+
+	return err;
+}
+
+
+static inline void __attribute__((always_inline))
+exit_training_state(struct priority_worker *poll_stats)
+{
+	RTE_SET_USED(poll_stats);
+}
+
+static inline void __attribute__((always_inline))
+enter_training_state(struct priority_worker *poll_stats)
+{
+	poll_stats->iter_counter = 0;
+	poll_stats->cur_freq = LOW;
+	poll_stats->queue_state = TRAINING;
+}
+
+static inline void __attribute__((always_inline))
+enter_normal_state(struct priority_worker *poll_stats)
+{
+	/* Clear the averages arrays and strs */
+	memset(poll_stats->edpi_av, 0, sizeof(poll_stats->edpi_av));
+	poll_stats->ec = 0;
+	memset(poll_stats->ppi_av, 0, sizeof(poll_stats->ppi_av));
+	poll_stats->pc = 0;
+
+	poll_stats->cur_freq = MED;
+	poll_stats->iter_counter = 0;
+	poll_stats->threshold_ctr = 0;
+	poll_stats->queue_state = MED_NORMAL;
+	set_power_freq(poll_stats->lcore_id, MED, false);
+
+	/* Try here */
+	poll_stats->thresh[MED].threshold_percent = med_to_high_threshold;
+	poll_stats->thresh[HGH].threshold_percent = high_to_med_threshold;
+}
+
+static inline void __attribute__((always_inline))
+enter_busy_state(struct priority_worker *poll_stats)
+{
+	memset(poll_stats->edpi_av, 0, sizeof(poll_stats->edpi_av));
+	poll_stats->ec = 0;
+	memset(poll_stats->ppi_av, 0, sizeof(poll_stats->ppi_av));
+	poll_stats->pc = 0;
+
+	poll_stats->cur_freq = HGH;
+	poll_stats->iter_counter = 0;
+	poll_stats->threshold_ctr = 0;
+	poll_stats->queue_state = HGH_BUSY;
+	set_power_freq(poll_stats->lcore_id, HGH, false);
+}
+
+static inline void __attribute__((always_inline))
+enter_purge_state(struct priority_worker *poll_stats)
+{
+	poll_stats->iter_counter = 0;
+	poll_stats->queue_state = LOW_PURGE;
+}
+
+static inline void __attribute__((always_inline))
+set_state(struct priority_worker *poll_stats,
+		enum queue_state new_state)
+{
+	enum queue_state old_state = poll_stats->queue_state;
+	if (old_state != new_state) {
+
+		/* Call any old state exit functions */
+		if (old_state == TRAINING)
+			exit_training_state(poll_stats);
+
+		/* Call any new state entry functions */
+		if (new_state == TRAINING)
+			enter_training_state(poll_stats);
+		if (new_state == MED_NORMAL)
+			enter_normal_state(poll_stats);
+		if (new_state == HGH_BUSY)
+			enter_busy_state(poll_stats);
+		if (new_state == LOW_PURGE)
+			enter_purge_state(poll_stats);
+	}
+}
+
+
+static void
+update_training_stats(struct priority_worker *poll_stats,
+		uint32_t freq,
+		bool specific_freq,
+		uint32_t max_train_iter)
+{
+	RTE_SET_USED(specific_freq);
+
+	char pfi_str[32];
+	uint64_t p0_empty_deq;
+
+	sprintf(pfi_str, "%02d", freq);
+
+	if (poll_stats->cur_freq == freq &&
+			poll_stats->thresh[freq].trained == false) {
+		if (poll_stats->thresh[freq].cur_train_iter == 0) {
+
+			set_power_freq(poll_stats->lcore_id,
+					freq, specific_freq);
+
+			poll_stats->empty_dequeues_prev =
+				poll_stats->empty_dequeues;
+
+			poll_stats->thresh[freq].cur_train_iter++;
+
+			return;
+		} else if (poll_stats->thresh[freq].cur_train_iter
+				<= max_train_iter) {
+
+			p0_empty_deq = poll_stats->empty_dequeues -
+				poll_stats->empty_dequeues_prev;
+
+			poll_stats->empty_dequeues_prev =
+				poll_stats->empty_dequeues;
+
+			poll_stats->thresh[freq].base_edpi += p0_empty_deq;
+			poll_stats->thresh[freq].cur_train_iter++;
+
+		} else {
+			if (poll_stats->thresh[freq].trained == false) {
+				poll_stats->thresh[freq].base_edpi =
+					poll_stats->thresh[freq].base_edpi /
+					max_train_iter;
+
+				/* Add on a factor of 0.05%, this should remove any */
+				/* false negatives when the system is 0% busy */
+				poll_stats->thresh[freq].base_edpi +=
+					poll_stats->thresh[freq].base_edpi / 2000;
+
+				poll_stats->thresh[freq].trained = true;
+				poll_stats->cur_freq++;
+
+			}
+		}
+	}
+}
+
+static inline uint32_t __attribute__((always_inline))
+update_stats(struct priority_worker *poll_stats)
+{
+	uint64_t tot_edpi = 0, tot_ppi = 0;
+	uint32_t j, percent;
+
+	struct priority_worker *s = poll_stats;
+
+	uint64_t cur_edpi = s->empty_dequeues - s->empty_dequeues_prev;
+
+	s->empty_dequeues_prev = s->empty_dequeues;
+
+	uint64_t ppi = s->num_dequeue_pkts - s->num_dequeue_pkts_prev;
+
+	s->num_dequeue_pkts_prev = s->num_dequeue_pkts;
+
+	if (s->thresh[s->cur_freq].base_edpi < cur_edpi) {
+		RTE_LOG(DEBUG, POWER, "cur_edpi is too large "
+				"cur edpi %ld "
+				"base epdi %ld\n",
+				cur_edpi,
+				s->thresh[s->cur_freq].base_edpi);
+		/* Value to make us fail need debug log*/
+		return 1000UL;
+	}
+
+	s->edpi_av[s->ec++ % BINS_AV] = cur_edpi;
+	s->ppi_av[s->pc++ % BINS_AV] = ppi;
+
+	for (j = 0; j < BINS_AV; j++) {
+		tot_edpi += s->edpi_av[j];
+		tot_ppi += s->ppi_av[j];
+	}
+
+	tot_edpi = tot_edpi / BINS_AV;
+
+	percent = 100 - (uint32_t)(((float)tot_edpi /
+			(float)s->thresh[s->cur_freq].base_edpi) * 100);
+
+	return (uint32_t)percent;
+}
+
+
+static inline void  __attribute__((always_inline))
+update_stats_normal(struct priority_worker *poll_stats)
+{
+	uint32_t percent;
+
+	if (poll_stats->thresh[poll_stats->cur_freq].base_edpi == 0)
+		return;
+
+	percent = update_stats(poll_stats);
+
+	if (percent > 100)
+		return;
+
+	if (poll_stats->cur_freq == LOW)
+		RTE_LOG(INFO, POWER, "Purge Mode is not supported\n");
+	else if (poll_stats->cur_freq == MED) {
+
+		if (percent >
+			poll_stats->thresh[MED].threshold_percent) {
+
+			if (poll_stats->threshold_ctr < INTERVALS_PER_SECOND)
+				poll_stats->threshold_ctr++;
+			else {
+				set_state(poll_stats, HGH_BUSY);
+				RTE_LOG(INFO, POWER, "MOVE to HGH\n");
+			}
+
+		} else {
+			/* reset */
+			/* need debug log */
+			poll_stats->threshold_ctr = 0;
+		}
+
+	} else if (poll_stats->cur_freq == HGH) {
+
+		if (percent <
+				poll_stats->thresh[HGH].threshold_percent) {
+
+			if (poll_stats->threshold_ctr < INTERVALS_PER_SECOND)
+				poll_stats->threshold_ctr++;
+			else
+				set_state(poll_stats, MED_NORMAL);
+		} else
+			/* reset */
+			/* need debug log */
+			poll_stats->threshold_ctr = 0;
+
+
+	}
+}
+
+static int
+empty_poll_training(struct priority_worker *poll_stats,
+		uint32_t max_train_iter)
+{
+
+	if (poll_stats->iter_counter < INTERVALS_PER_SECOND) {
+		poll_stats->iter_counter++;
+		return 0;
+	}
+
+
+	update_training_stats(poll_stats,
+			LOW,
+			false,
+			max_train_iter);
+
+	update_training_stats(poll_stats,
+			MED,
+			false,
+			max_train_iter);
+
+	update_training_stats(poll_stats,
+			HGH,
+			false,
+			max_train_iter);
+
+
+	if (poll_stats->thresh[LOW].trained == true
+			&& poll_stats->thresh[MED].trained == true
+			&& poll_stats->thresh[HGH].trained == true) {
+
+		set_state(poll_stats, MED_NORMAL);
+
+		RTE_LOG(INFO, POWER, "Training is Complete for %d\n",
+				poll_stats->lcore_id);
+	}
+
+	return 0;
+}
+
+void
+rte_empty_poll_detection(struct rte_timer *tim,
+		void *arg)
+{
+
+	uint32_t i;
+
+	struct priority_worker *poll_stats;
+
+	RTE_SET_USED(tim);
+
+	RTE_SET_USED(arg);
+
+	for (i = 0; i < NUM_NODES; i++) {
+
+		poll_stats = &(ep_params->wrk_data.wrk_stats[i]);
+
+		if (rte_lcore_is_enabled(poll_stats->lcore_id) == 0)
+			continue;
+
+		switch (poll_stats->queue_state) {
+		case(TRAINING):
+			empty_poll_training(poll_stats,
+					ep_params->max_train_iter);
+			break;
+
+		case(HGH_BUSY):
+		case(MED_NORMAL):
+			update_stats_normal(poll_stats);
+
+			break;
+
+		case(LOW_PURGE):
+			break;
+		default:
+			break;
+
+		}
+
+	}
+
+}
+
+int __rte_experimental
+rte_power_empty_poll_stat_init(struct ep_params **eptr, uint8_t *freq_tlb)
+{
+	uint32_t i;
+	/* Allocate the ep_params structure */
+	ep_params = rte_zmalloc_socket(NULL,
+			sizeof(struct ep_params),
+			0,
+			rte_socket_id());
+
+	if (!ep_params)
+		rte_panic("Cannot allocate heap memory for ep_params "
+				"for socket %d\n", rte_socket_id());
+
+	if (freq_tlb == NULL) {
+		freq_index[LOW] = 14;
+		freq_index[MED] = 9;
+		freq_index[HGH] = 1;
+	} else {
+		freq_index[LOW] = freq_tlb[LOW];
+		freq_index[MED] = freq_tlb[MED];
+		freq_index[HGH] = freq_tlb[HGH];
+	}
+
+	RTE_LOG(INFO, POWER, "Initialize the Empty Poll\n");
+
+	/* 5 seconds worth of training */
+	ep_params->max_train_iter = INTERVALS_PER_SECOND * SECONDS_TO_TRAIN_FOR;
+
+	struct stats_data *w = &ep_params->wrk_data;
+
+	*eptr = ep_params;
+
+	/* initialize all wrk_stats state */
+	for (i = 0; i < NUM_NODES; i++) {
+
+		if (rte_lcore_is_enabled(i) == 0)
+			continue;
+
+		set_state(&w->wrk_stats[i], TRAINING);
+		/*init the freqs table */
+		total_avail_freqs[i] = rte_power_freqs(i,
+				avail_freqs[i],
+				NUM_FREQS);
+
+		if (get_freq_index(LOW) > total_avail_freqs[i])
+			return -1;
+
+	}
+
+
+	return 0;
+}
+
+void __rte_experimental
+rte_power_empty_poll_stat_free(void)
+{
+
+	RTE_LOG(INFO, POWER, "Close the Empty Poll\n");
+
+	if (ep_params != NULL)
+		rte_free(ep_params);
+}
+
+int __rte_experimental
+rte_power_empty_poll_stat_update(unsigned int lcore_id)
+{
+	struct priority_worker *poll_stats;
+
+	if (lcore_id >= NUM_NODES)
+		return -1;
+
+	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
+
+	if (poll_stats->lcore_id == 0)
+		poll_stats->lcore_id = lcore_id;
+
+	poll_stats->empty_dequeues++;
+
+	return 0;
+}
+
+int __rte_experimental
+rte_power_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt)
+{
+
+	struct priority_worker *poll_stats;
+
+	if (lcore_id >= NUM_NODES)
+		return -1;
+
+	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
+
+	if (poll_stats->lcore_id == 0)
+		poll_stats->lcore_id = lcore_id;
+
+	poll_stats->num_dequeue_pkts += nb_pkt;
+
+	return 0;
+}
+
+
+uint64_t __rte_experimental
+rte_power_empty_poll_stat_fetch(unsigned int lcore_id)
+{
+	struct priority_worker *poll_stats;
+
+	if (lcore_id >= NUM_NODES)
+		return -1;
+
+	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
+
+	if (poll_stats->lcore_id == 0)
+		poll_stats->lcore_id = lcore_id;
+
+	return poll_stats->empty_dequeues;
+}
+
+uint64_t __rte_experimental
+rte_power_poll_stat_fetch(unsigned int lcore_id)
+{
+	struct priority_worker *poll_stats;
+
+	if (lcore_id >= NUM_NODES)
+		return -1;
+
+	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
+
+	if (poll_stats->lcore_id == 0)
+		poll_stats->lcore_id = lcore_id;
+
+	return poll_stats->num_dequeue_pkts;
+}
diff --git a/lib/librte_power/rte_power_empty_poll.h b/lib/librte_power/rte_power_empty_poll.h
new file mode 100644
index 0000000..e43981f
--- /dev/null
+++ b/lib/librte_power/rte_power_empty_poll.h
@@ -0,0 +1,205 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2018 Intel Corporation
+ */
+
+#ifndef _RTE_EMPTY_POLL_H
+#define _RTE_EMPTY_POLL_H
+
+/**
+ * @file
+ * RTE Power Management
+ */
+#include <stdint.h>
+#include <stdbool.h>
+
+#include <rte_common.h>
+#include <rte_byteorder.h>
+#include <rte_log.h>
+#include <rte_string_fns.h>
+#include <rte_power.h>
+#include <rte_timer.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#define NUM_FREQS 20
+
+#define BINS_AV 4 /* Has to be ^2 */
+
+#define DROP (NUM_DIRECTIONS * NUM_DEVICES)
+
+#define NUM_PRIORITIES          2
+
+#define NUM_NODES         31 /* Any reseanable prime number should work*/
+
+/* Processor Power State */
+enum freq_val {
+	LOW,
+	MED,
+	HGH,
+	NUM_FREQ = NUM_FREQS
+};
+
+
+/* Queue Polling State */
+enum queue_state {
+	TRAINING, /* NO TRAFFIC */
+	MED_NORMAL,   /* MED */
+	HGH_BUSY,     /* HIGH */
+	LOW_PURGE,    /* LOW */
+};
+
+/* Queue Stats */
+struct freq_threshold {
+
+	uint64_t base_edpi;
+	bool trained;
+	uint32_t threshold_percent;
+	uint32_t cur_train_iter;
+};
+
+/* Each Worder Thread Empty Poll Stats */
+struct priority_worker {
+
+	/* Current dequeue and throughput counts */
+	/* These 2 are written to by the worker threads */
+	/* So keep them on their own cache line */
+	uint64_t empty_dequeues;
+	uint64_t num_dequeue_pkts;
+
+	enum queue_state queue_state;
+
+	uint64_t empty_dequeues_prev;
+	uint64_t num_dequeue_pkts_prev;
+
+	/* Used for training only */
+	struct freq_threshold thresh[NUM_FREQ];
+	enum freq_val cur_freq;
+
+	/* bucket arrays to calculate the averages */
+	uint64_t edpi_av[BINS_AV];
+	uint32_t  ec;
+	uint64_t ppi_av[BINS_AV];
+	uint32_t  pc;
+
+	uint32_t lcore_id;
+	uint32_t iter_counter;
+	uint32_t threshold_ctr;
+	uint32_t display_ctr;
+	uint8_t  dev_id;
+
+} __rte_cache_aligned;
+
+
+struct stats_data {
+
+	struct priority_worker wrk_stats[NUM_NODES];
+
+	/* flag to stop rx threads processing packets until training over */
+	bool start_rx;
+
+};
+
+/* Empty Poll Parameters */
+struct ep_params {
+
+	/* Timer related stuff */
+	uint64_t interval_ticks;
+	uint32_t max_train_iter;
+
+	struct rte_timer timer0;
+	struct stats_data wrk_data;
+};
+
+
+/**
+ * Initialize the power management system.
+ *
+ * @param eptr
+ *   the structure of empty poll configuration
+ * @freq_tlb
+ *   the power state/frequency  mapping table
+ *
+ * @return
+ *  - 0 on success.
+ *  - Negative on error.
+ */
+int __rte_experimental
+rte_power_empty_poll_stat_init(struct ep_params **eptr, uint8_t *freq_tlb);
+
+/**
+ * Free the resource hold by power management system.
+ */
+void __rte_experimental
+rte_power_empty_poll_stat_free(void);
+
+/**
+ * Update specific core empty poll counter
+ * It's not thread safe.
+ *
+ * @param lcore_id
+ *  lcore id
+ *
+ * @return
+ *  - 0 on success.
+ *  - Negative on error.
+ */
+int __rte_experimental
+rte_power_empty_poll_stat_update(unsigned int lcore_id);
+
+/**
+ * Update specific core valid poll counter, not thread safe.
+ *
+ * @param lcore_id
+ *  lcore id.
+ * @param nb_pkt
+ *  The packet number of one valid poll.
+ *
+ * @return
+ *  - 0 on success.
+ *  - Negative on error.
+ */
+int __rte_experimental
+rte_power_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt);
+
+/**
+ * Fetch specific core empty poll counter.
+ *
+ * @param lcore_id
+ *  lcore id
+ *
+ * @return
+ *  Current lcore empty poll counter value.
+ */
+uint64_t __rte_experimental
+rte_power_empty_poll_stat_fetch(unsigned int lcore_id);
+
+/**
+ * Fetch specific core valid poll counter.
+ *
+ * @param lcore_id
+ *  lcore id
+ *
+ * @return
+ *  Current lcore valid poll counter value.
+ */
+uint64_t __rte_experimental
+rte_power_poll_stat_fetch(unsigned int lcore_id);
+
+/**
+ * Empty poll  state change detection function
+ *
+ * @param  tim
+ *  The timer structure
+ * @param  arg
+ *  The customized parameter
+ */
+void  __rte_experimental
+rte_empty_poll_detection(struct rte_timer *tim, void *arg);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif
diff --git a/lib/librte_power/rte_power_version.map b/lib/librte_power/rte_power_version.map
index dd587df..11ffdfb 100644
--- a/lib/librte_power/rte_power_version.map
+++ b/lib/librte_power/rte_power_version.map
@@ -33,3 +33,16 @@ DPDK_18.08 {
 	rte_power_get_capabilities;
 
 } DPDK_17.11;
+
+EXPERIMENTAL {
+        global:
+
+        rte_power_empty_poll_stat_init;
+        rte_power_empty_poll_stat_free;
+        rte_power_empty_poll_stat_update;
+        rte_power_empty_poll_stat_fetch;
+        rte_power_poll_stat_fetch;
+        rte_power_poll_stat_update;
+        rte_empty_poll_detection;
+
+};
-- 
2.7.5

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [dpdk-dev] [PATCH v6 2/4] examples/l3fwd-power: simple app update for new API
  2018-08-31 15:04             ` [dpdk-dev] [PATCH v6 1/4] lib/librte_power: traffic pattern aware power control Liang Ma
@ 2018-08-31 15:04               ` Liang Ma
  2018-08-31 15:04               ` [dpdk-dev] [PATCH v6 3/4] doc/guides/proguides/power-man: update the power API Liang Ma
                                 ` (4 subsequent siblings)
  5 siblings, 0 replies; 79+ messages in thread
From: Liang Ma @ 2018-08-31 15:04 UTC (permalink / raw)
  To: david.hunt
  Cc: dev, lei.a.yao, radu.nicolau, anatoly.burakov, john.geary, Liang Ma

Add the support for new traffic pattern aware power control
power management API.

Example:
./l3fwd-power -l xxx   -n 4   -w 0000:xx:00.0 -w 0000:xx:00.1 -- -p 0x3
-P --config="(0,0,xx),(1,0,xx)" --empty-poll -l 14 -m 9 -h 1

Please Reference l3fwd-power document for all parameter except
empty-poll.

the option "l", "m", "h" are used to set the power index for
LOW, MED, HIGH power state. only is useful after enable empty-poll

Once enable empty-poll. The system will start with training phase.
There should not has any traffic pass-through during training phase.
When training phase complete, system transfer to normal phase.

System will running with modest power stat at beginning.
If the system busyness percentage above 70%, then system will adjust
power state move to High power state. If the traffic become lower(eg. The
system busyness percentage drop below 30%), system will fallback
to the modest power state.

Example code use master thread to monitoring worker thread busyness.
the default timer resolution is 10ms.

ChangeLog:
v2 fix some coding style issues
v3 rename the API.
v6 re-work the API.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
---
 examples/l3fwd-power/Makefile    |   3 +
 examples/l3fwd-power/main.c      | 253 ++++++++++++++++++++++++++++++++++++---
 examples/l3fwd-power/meson.build |   1 +
 3 files changed, 240 insertions(+), 17 deletions(-)

diff --git a/examples/l3fwd-power/Makefile b/examples/l3fwd-power/Makefile
index d7e39a3..772ec7b 100644
--- a/examples/l3fwd-power/Makefile
+++ b/examples/l3fwd-power/Makefile
@@ -23,6 +23,8 @@ CFLAGS += -O3 $(shell pkg-config --cflags libdpdk)
 LDFLAGS_SHARED = $(shell pkg-config --libs libdpdk)
 LDFLAGS_STATIC = -Wl,-Bstatic $(shell pkg-config --static --libs libdpdk)
 
+CFLAGS += -DALLOW_EXPERIMENTAL_API
+
 build/$(APP)-shared: $(SRCS-y) Makefile $(PC_FILE) | build
 	$(CC) $(CFLAGS) $(SRCS-y) -o $@ $(LDFLAGS) $(LDFLAGS_SHARED)
 
@@ -54,6 +56,7 @@ please change the definition of the RTE_TARGET environment variable)
 all:
 else
 
+CFLAGS += -DALLOW_EXPERIMENTAL_API
 CFLAGS += -O3
 CFLAGS += $(WERROR_FLAGS)
 
diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
index d15cd52..f1e254b 100644
--- a/examples/l3fwd-power/main.c
+++ b/examples/l3fwd-power/main.c
@@ -43,6 +43,7 @@
 #include <rte_timer.h>
 #include <rte_power.h>
 #include <rte_spinlock.h>
+#include <rte_power_empty_poll.h>
 
 #include "perf_core.h"
 #include "main.h"
@@ -55,6 +56,8 @@
 
 /* 100 ms interval */
 #define TIMER_NUMBER_PER_SECOND           10
+/* (10ms) */
+#define INTERVALS_PER_SECOND             100
 /* 100000 us */
 #define SCALING_PERIOD                    (1000000/TIMER_NUMBER_PER_SECOND)
 #define SCALING_DOWN_TIME_RATIO_THRESHOLD 0.25
@@ -117,6 +120,9 @@
  */
 #define RTE_TEST_RX_DESC_DEFAULT 1024
 #define RTE_TEST_TX_DESC_DEFAULT 1024
+
+
+
 static uint16_t nb_rxd = RTE_TEST_RX_DESC_DEFAULT;
 static uint16_t nb_txd = RTE_TEST_TX_DESC_DEFAULT;
 
@@ -132,6 +138,10 @@ static uint32_t enabled_port_mask = 0;
 static int promiscuous_on = 0;
 /* NUMA is enabled by default. */
 static int numa_on = 1;
+/* emptypoll is disabled by default. */
+static bool empty_poll_on;
+volatile bool empty_poll_stop;
+static struct  ep_params *ep_params;
 static int parse_ptype; /**< Parse packet type using rx callback, and */
 			/**< disabled by default */
 
@@ -331,6 +341,13 @@ static inline uint32_t power_idle_heuristic(uint32_t zero_rx_packet_count);
 static inline enum freq_scale_hint_t power_freq_scaleup_heuristic( \
 		unsigned int lcore_id, uint16_t port_id, uint16_t queue_id);
 
+static uint8_t  freq_tlb[] = {14, 9, 1};
+
+static int is_done(void)
+{
+	return empty_poll_stop;
+}
+
 /* exit signal handler */
 static void
 signal_exit_now(int sigtype)
@@ -339,7 +356,15 @@ signal_exit_now(int sigtype)
 	unsigned int portid;
 	int ret;
 
+	RTE_SET_USED(lcore_id);
+	RTE_SET_USED(portid);
+	RTE_SET_USED(ret);
+
 	if (sigtype == SIGINT) {
+		if (empty_poll_on)
+			empty_poll_stop = true;
+
+
 		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
 			if (rte_lcore_is_enabled(lcore_id) == 0)
 				continue;
@@ -352,16 +377,19 @@ signal_exit_now(int sigtype)
 							"core%u\n", lcore_id);
 		}
 
-		RTE_ETH_FOREACH_DEV(portid) {
-			if ((enabled_port_mask & (1 << portid)) == 0)
-				continue;
+		if (!empty_poll_on) {
+			RTE_ETH_FOREACH_DEV(portid) {
+				if ((enabled_port_mask & (1 << portid)) == 0)
+					continue;
 
-			rte_eth_dev_stop(portid);
-			rte_eth_dev_close(portid);
+				rte_eth_dev_stop(portid);
+				rte_eth_dev_close(portid);
+			}
 		}
 	}
 
-	rte_exit(EXIT_SUCCESS, "User forced exit\n");
+	if (!empty_poll_on)
+		rte_exit(EXIT_SUCCESS, "User forced exit\n");
 }
 
 /*  Freqency scale down timer callback */
@@ -826,7 +854,107 @@ static int event_register(struct lcore_conf *qconf)
 
 	return 0;
 }
+/* main processing loop */
+static int
+main_empty_poll_loop(__attribute__((unused)) void *dummy)
+{
+	struct rte_mbuf *pkts_burst[MAX_PKT_BURST];
+	unsigned int lcore_id;
+	uint64_t prev_tsc, diff_tsc, cur_tsc;
+	int i, j, nb_rx;
+	uint8_t queueid;
+	uint16_t portid;
+	struct lcore_conf *qconf;
+	struct lcore_rx_queue *rx_queue;
+
+	const uint64_t drain_tsc =
+		(rte_get_tsc_hz() + US_PER_S - 1) / US_PER_S * BURST_TX_DRAIN_US;
+
+	prev_tsc = 0;
+
+	lcore_id = rte_lcore_id();
+	qconf = &lcore_conf[lcore_id];
+
+	if (qconf->n_rx_queue == 0) {
+		RTE_LOG(INFO, L3FWD_POWER, "lcore %u has nothing to do\n", lcore_id);
+		return 0;
+	}
+
+	for (i = 0; i < qconf->n_rx_queue; i++) {
+		portid = qconf->rx_queue_list[i].port_id;
+		queueid = qconf->rx_queue_list[i].queue_id;
+		RTE_LOG(INFO, L3FWD_POWER, " -- lcoreid=%u portid=%u "
+				"rxqueueid=%hhu\n", lcore_id, portid, queueid);
+	}
+
+	while (!is_done()) {
+		stats[lcore_id].nb_iteration_looped++;
+
+		cur_tsc = rte_rdtsc();
+		/*
+		 * TX burst queue drain
+		 */
+		diff_tsc = cur_tsc - prev_tsc;
+		if (unlikely(diff_tsc > drain_tsc)) {
+			for (i = 0; i < qconf->n_tx_port; ++i) {
+				portid = qconf->tx_port_id[i];
+				rte_eth_tx_buffer_flush(portid,
+						qconf->tx_queue_id[portid],
+						qconf->tx_buffer[portid]);
+			}
+			prev_tsc = cur_tsc;
+		}
+
+		/*
+		 * Read packet from RX queues
+		 */
+		for (i = 0; i < qconf->n_rx_queue; ++i) {
+			rx_queue = &(qconf->rx_queue_list[i]);
+			rx_queue->idle_hint = 0;
+			portid = rx_queue->port_id;
+			queueid = rx_queue->queue_id;
+
+			nb_rx = rte_eth_rx_burst(portid, queueid, pkts_burst,
+					MAX_PKT_BURST);
+
+			stats[lcore_id].nb_rx_processed += nb_rx;
+
+			if (nb_rx == 0) {
 
+				rte_power_empty_poll_stat_update(lcore_id);
+
+				continue;
+			} else {
+				rte_power_poll_stat_update(lcore_id, nb_rx);
+			}
+
+
+			/* Prefetch first packets */
+			for (j = 0; j < PREFETCH_OFFSET && j < nb_rx; j++) {
+				rte_prefetch0(rte_pktmbuf_mtod(
+							pkts_burst[j], void *));
+			}
+
+			/* Prefetch and forward already prefetched packets */
+			for (j = 0; j < (nb_rx - PREFETCH_OFFSET); j++) {
+				rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[
+							j + PREFETCH_OFFSET], void *));
+				l3fwd_simple_forward(pkts_burst[j], portid,
+						qconf);
+			}
+
+			/* Forward remaining prefetched packets */
+			for (; j < nb_rx; j++) {
+				l3fwd_simple_forward(pkts_burst[j], portid,
+						qconf);
+			}
+
+		}
+
+	}
+
+	return 0;
+}
 /* main processing loop */
 static int
 main_loop(__attribute__((unused)) void *dummy)
@@ -1128,7 +1256,8 @@ print_usage(const char *prgname)
 		"  --no-numa: optional, disable numa awareness\n"
 		"  --enable-jumbo: enable jumbo frame"
 		" which max packet len is PKTLEN in decimal (64-9600)\n"
-		"  --parse-ptype: parse packet type by software\n",
+		"  --parse-ptype: parse packet type by software\n"
+		"  --empty=poll: enable empty poll detection\n",
 		prgname);
 }
 
@@ -1231,6 +1360,7 @@ parse_args(int argc, char **argv)
 	int opt, ret;
 	char **argvopt;
 	int option_index;
+	uint32_t limit;
 	char *prgname = argv[0];
 	static struct option lgopts[] = {
 		{"config", 1, 0, 0},
@@ -1238,13 +1368,14 @@ parse_args(int argc, char **argv)
 		{"high-perf-cores", 1, 0, 0},
 		{"no-numa", 0, 0, 0},
 		{"enable-jumbo", 0, 0, 0},
+		{"empty-poll", 0, 0, 0},
 		{CMD_LINE_OPT_PARSE_PTYPE, 0, 0, 0},
 		{NULL, 0, 0, 0}
 	};
 
 	argvopt = argv;
 
-	while ((opt = getopt_long(argc, argvopt, "p:P",
+	while ((opt = getopt_long(argc, argvopt, "p:l:m:h:P",
 				lgopts, &option_index)) != EOF) {
 
 		switch (opt) {
@@ -1261,7 +1392,18 @@ parse_args(int argc, char **argv)
 			printf("Promiscuous mode selected\n");
 			promiscuous_on = 1;
 			break;
-
+		case 'l':
+			limit = parse_max_pkt_len(optarg);
+			freq_tlb[LOW] = limit;
+			break;
+		case 'm':
+			limit = parse_max_pkt_len(optarg);
+			freq_tlb[MED] = limit;
+			break;
+		case 'h':
+			limit = parse_max_pkt_len(optarg);
+			freq_tlb[HGH] = limit;
+			break;
 		/* long options */
 		case 0:
 			if (!strncmp(lgopts[option_index].name, "config", 6)) {
@@ -1300,6 +1442,12 @@ parse_args(int argc, char **argv)
 			}
 
 			if (!strncmp(lgopts[option_index].name,
+						"empty-poll", 10)) {
+				printf("empty-poll is enabled\n");
+				empty_poll_on = true;
+			}
+
+			if (!strncmp(lgopts[option_index].name,
 					"enable-jumbo", 12)) {
 				struct option lenopts =
 					{"max-pkt-len", required_argument, \
@@ -1647,6 +1795,59 @@ init_power_library(void)
 	}
 	return ret;
 }
+static void
+empty_poll_setup_timer(void)
+{
+	int lcore_id = rte_lcore_id();
+	uint64_t hz = rte_get_timer_hz();
+
+	struct  ep_params *ep_ptr = ep_params;
+
+	ep_ptr->interval_ticks = hz / INTERVALS_PER_SECOND;
+
+	rte_timer_reset_sync(&ep_ptr->timer0,
+			ep_ptr->interval_ticks,
+			PERIODICAL,
+			lcore_id,
+			rte_empty_poll_detection,
+			(void *)ep_ptr);
+
+}
+static int
+launch_timer(unsigned int lcore_id)
+{
+	int64_t prev_tsc = 0, cur_tsc, diff_tsc, cycles_10ms;
+
+	RTE_SET_USED(lcore_id);
+
+
+	if (rte_get_master_lcore() != lcore_id) {
+		rte_panic("timer on lcore:%d which is not master core:%d\n",
+				lcore_id,
+				rte_get_master_lcore());
+	}
+
+	RTE_LOG(INFO, POWER, "Bring up the Timer\n");
+
+	empty_poll_setup_timer();
+
+	cycles_10ms = rte_get_timer_hz() / 100;
+
+	while (!is_done()) {
+		cur_tsc = rte_rdtsc();
+		diff_tsc = cur_tsc - prev_tsc;
+		if (diff_tsc > cycles_10ms) {
+			rte_timer_manage();
+			prev_tsc = cur_tsc;
+			cycles_10ms = rte_get_timer_hz() / 100;
+		}
+	}
+
+	RTE_LOG(INFO, POWER, "Timer_subsystem is done\n");
+
+	return 0;
+}
+
 
 int
 main(int argc, char **argv)
@@ -1829,13 +2030,15 @@ main(int argc, char **argv)
 		if (rte_lcore_is_enabled(lcore_id) == 0)
 			continue;
 
-		/* init timer structures for each enabled lcore */
-		rte_timer_init(&power_timers[lcore_id]);
-		hz = rte_get_timer_hz();
-		rte_timer_reset(&power_timers[lcore_id],
-			hz/TIMER_NUMBER_PER_SECOND, SINGLE, lcore_id,
-						power_timer_cb, NULL);
-
+		if (empty_poll_on == false) {
+			/* init timer structures for each enabled lcore */
+			rte_timer_init(&power_timers[lcore_id]);
+			hz = rte_get_timer_hz();
+			rte_timer_reset(&power_timers[lcore_id],
+					hz/TIMER_NUMBER_PER_SECOND,
+					SINGLE, lcore_id,
+					power_timer_cb, NULL);
+		}
 		qconf = &lcore_conf[lcore_id];
 		printf("\nInitializing rx queues on lcore %u ... ", lcore_id );
 		fflush(stdout);
@@ -1906,12 +2109,28 @@ main(int argc, char **argv)
 
 	check_all_ports_link_status(enabled_port_mask);
 
+	if (empty_poll_on == true)
+		rte_power_empty_poll_stat_init(&ep_params, freq_tlb);
+
+
 	/* launch per-lcore init on every lcore */
-	rte_eal_mp_remote_launch(main_loop, NULL, CALL_MASTER);
+	if (empty_poll_on == false) {
+		rte_eal_mp_remote_launch(main_loop, NULL, CALL_MASTER);
+	} else {
+		empty_poll_stop = false;
+		rte_eal_mp_remote_launch(main_empty_poll_loop, NULL, SKIP_MASTER);
+	}
+
+	if (empty_poll_on == true)
+		launch_timer(rte_lcore_id());
+
 	RTE_LCORE_FOREACH_SLAVE(lcore_id) {
 		if (rte_eal_wait_lcore(lcore_id) < 0)
 			return -1;
 	}
 
+	if (empty_poll_on)
+		rte_power_empty_poll_stat_free();
+
 	return 0;
 }
diff --git a/examples/l3fwd-power/meson.build b/examples/l3fwd-power/meson.build
index 20c8054..a3c5c2f 100644
--- a/examples/l3fwd-power/meson.build
+++ b/examples/l3fwd-power/meson.build
@@ -9,6 +9,7 @@
 if host_machine.system() != 'linux'
 	build = false
 endif
+allow_experimental_apis = true
 deps += ['power', 'timer', 'lpm', 'hash']
 sources = files(
 	'main.c', 'perf_core.c'
-- 
2.7.5

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [dpdk-dev] [PATCH v6 3/4] doc/guides/proguides/power-man: update the power API
  2018-08-31 15:04             ` [dpdk-dev] [PATCH v6 1/4] lib/librte_power: traffic pattern aware power control Liang Ma
  2018-08-31 15:04               ` [dpdk-dev] [PATCH v6 2/4] examples/l3fwd-power: simple app update for new API Liang Ma
@ 2018-08-31 15:04               ` Liang Ma
  2018-08-31 15:04               ` [dpdk-dev] [PATCH v6 4/4] doc/guides/sample_app_ug/l3_forward_power_man.rst: empty poll update Liang Ma
                                 ` (3 subsequent siblings)
  5 siblings, 0 replies; 79+ messages in thread
From: Liang Ma @ 2018-08-31 15:04 UTC (permalink / raw)
  To: david.hunt
  Cc: dev, lei.a.yao, radu.nicolau, anatoly.burakov, john.geary, Liang Ma

update the document for empty poll API.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
---
 doc/guides/prog_guide/power_man.rst | 87 +++++++++++++++++++++++++++++++++++++
 1 file changed, 87 insertions(+)

diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
index eba1cc6..d8a4ef7 100644
--- a/doc/guides/prog_guide/power_man.rst
+++ b/doc/guides/prog_guide/power_man.rst
@@ -106,6 +106,93 @@ User Cases
 
 The power management mechanism is used to save power when performing L3 forwarding.
 
+
+Empty Poll API
+--------------
+
+Abstract
+~~~~~~~~
+
+For packet processing workloads such as DPDK polling is continuous.
+This means CPU cores always show 100% busy independent of how much work
+those cores are doing. It is critical to accurately determine how busy
+a core is hugely important for the following reasons:
+
+        * No indication of overload conditions
+        * User do not know how much real load is on a system meaning
+          resulted in wasted energy as no power management is utilized
+
+Compared to the original l3fwd-power design, instead of going to sleep
+after detecting an empty poll, the new mechanism just lowers the core frequency.
+As a result, the application does not stop polling the device, which leads
+to improved handling of bursts of traffic.
+
+When the system become busy, the empty poll mechanism can also increase the core
+frequency (including turbo) to do best effort for intensive traffic. This gives
+us more flexible and balanced traffic awareness over the standard l3fwd-power
+application.
+
+
+Proposed Solution
+~~~~~~~~~~~~~~~~~
+The proposed solution focuses on how many times empty polls are executed.
+The less the number of empty polls, means current core is busy with processing
+workload, therefore, the higher frequency is needed. The high empty poll number
+indicates the current core not doing any real work therefore, we can lower the
+frequency to safe power.
+
+In the current implementation, each core has 1 empty-poll counter which assume
+1 core is dedicated to 1 queue. This will need to be expanded in the future to
+support multiple queues per core.
+
+Power state definition:
+^^^^^^^^^^^^^^^^^^^^^^^
+
+* LOW:  Not currently used, reserved for future use.
+
+* MED:  the frequency is used to process modest traffic workload.
+
+* HIGH: the frequency is used to process busy traffic workload.
+
+There are two phases to establish the power management system:
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+* Initialization/Training phase. The training phase is necessary
+  in order to figure out the system polling baseline numbers from
+  idle to busy. The highest poll count will be during idle, where all
+  polls are empty. These poll counts will be different between
+  systems due to the many possible processor micro-arch, cache
+  and device configurations, hence the training phase.
+  In the training phase, traffic is blocked so the training algorithm
+  can average the empty-poll numbers for the LOW, MED and
+  HIGH  power states in order to create a baseline.
+  The core's counter are collected every 10ms, and the Training
+  phase will take 2 seconds.
+
+* Normal phase. When the training phase is complete, traffic is
+  started. The run-time poll counts are compared with the
+  baseline and the decision will be taken to move to MED power
+  state or HIGH power state. The counters are calculated every
+  10ms.
+
+
+API Overview for Empty Poll Power Management
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+* **State Init**: initialize the power management system.
+
+* **State Free**: free the resource hold by power management system.
+
+* **Update Empty Poll Counter**: update the empty poll counter.
+
+* **Update Valid Poll Counter**: update the valid poll counter.
+
+* **Set the Fequence Index**: update the power state/frequency mapping.
+
+* **Detect empty poll state change**: empty poll state change detection algorithm.
+
+User Cases
+----------
+The mechanism can applied to any device which is based on polling. e.g. NIC, FPGA.
+
 References
 ----------
 
-- 
2.7.5

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [dpdk-dev] [PATCH v6 4/4] doc/guides/sample_app_ug/l3_forward_power_man.rst: empty poll update
  2018-08-31 15:04             ` [dpdk-dev] [PATCH v6 1/4] lib/librte_power: traffic pattern aware power control Liang Ma
  2018-08-31 15:04               ` [dpdk-dev] [PATCH v6 2/4] examples/l3fwd-power: simple app update for new API Liang Ma
  2018-08-31 15:04               ` [dpdk-dev] [PATCH v6 3/4] doc/guides/proguides/power-man: update the power API Liang Ma
@ 2018-08-31 15:04               ` Liang Ma
  2018-09-04  1:11               ` [dpdk-dev] [PATCH v6 1/4] lib/librte_power: traffic pattern aware power control Yao, Lei A
                                 ` (2 subsequent siblings)
  5 siblings, 0 replies; 79+ messages in thread
From: Liang Ma @ 2018-08-31 15:04 UTC (permalink / raw)
  To: david.hunt
  Cc: dev, lei.a.yao, radu.nicolau, anatoly.burakov, john.geary, Liang Ma

add empty poll mode command line example

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
---
 doc/guides/sample_app_ug/l3_forward_power_man.rst | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/doc/guides/sample_app_ug/l3_forward_power_man.rst b/doc/guides/sample_app_ug/l3_forward_power_man.rst
index 795a570..7bea0a8 100644
--- a/doc/guides/sample_app_ug/l3_forward_power_man.rst
+++ b/doc/guides/sample_app_ug/l3_forward_power_man.rst
@@ -362,3 +362,24 @@ The algorithm has the following sleeping behavior depending on the idle counter:
 If a thread polls multiple Rx queues and different queue returns different sleep duration values,
 the algorithm controls the sleep time in a conservative manner by sleeping for the least possible time
 in order to avoid a potential performance impact.
+
+Empty Poll Mode
+-------------------------
+There is a new Mode which is added recently. Empty poll mode can be enabled by
+command option --empty-poll.
+
+See "Power Management" chapter in the DPDK Programmer's Guide for empty poll mode details.
+
+.. code-block:: console
+
+    ./l3fwd-power -l xxx   -n 4   -w 0000:xx:00.0 -w 0000:xx:00.1 -- -p 0x3 -P --config="(0,0,xx),(1,0,xx)" --empty-poll -l 14 -m 9 -h 1
+
+Where,
+
+--empty-poll: Enable the empty poll mode instead of original algorithm
+
+-l : optional, set up the LOW power state frequency index
+
+-m : optional, set up the MED power state frequency index
+
+-h : optional, set up the HIGH power state frequency index
-- 
2.7.5

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [dpdk-dev] [PATCH v6 1/4] lib/librte_power: traffic pattern aware power control
  2018-08-31 15:04             ` [dpdk-dev] [PATCH v6 1/4] lib/librte_power: traffic pattern aware power control Liang Ma
                                 ` (2 preceding siblings ...)
  2018-08-31 15:04               ` [dpdk-dev] [PATCH v6 4/4] doc/guides/sample_app_ug/l3_forward_power_man.rst: empty poll update Liang Ma
@ 2018-09-04  1:11               ` Yao, Lei A
  2018-09-04  2:09               ` Yao, Lei A
  2018-09-04 14:10               ` [dpdk-dev] [PATCH v7 " Liang Ma
  5 siblings, 0 replies; 79+ messages in thread
From: Yao, Lei A @ 2018-09-04  1:11 UTC (permalink / raw)
  To: Ma, Liang J, Hunt, David
  Cc: dev, Nicolau, Radu, Burakov, Anatoly, Geary, John



> -----Original Message-----
> From: Ma, Liang J
> Sent: Friday, August 31, 2018 11:04 PM
> To: Hunt, David <david.hunt@intel.com>
> Cc: dev@dpdk.org; Yao, Lei A <lei.a.yao@intel.com>; Nicolau, Radu
> <radu.nicolau@intel.com>; Burakov, Anatoly <anatoly.burakov@intel.com>;
> Geary, John <john.geary@intel.com>; Ma, Liang J <liang.j.ma@intel.com>
> Subject: [PATCH v6 1/4] lib/librte_power: traffic pattern aware power control
> 
> 1. Abstract
> 
> For packet processing workloads such as DPDK polling is continuous.
> This means CPU cores always show 100% busy independent of how much
> work
> those cores are doing. It is critical to accurately determine how busy
> a core is hugely important for the following reasons:
> 
>    * No indication of overload conditions
> 
>    * User do not know how much real load is on a system meaning resulted in
>      wasted energy as no power management is utilized
> 
> Compared to the original l3fwd-power design, instead of going to sleep
> after detecting an empty poll, the new mechanism just lowers the core
> frequency. As a result, the application does not stop polling the device,
> which leads to improved handling of bursts of traffic.
> 
> When the system become busy, the empty poll mechanism can also increase
> the core
> frequency (including turbo) to do best effort for intensive traffic. This gives
> us more flexible and balanced traffic awareness over the standard l3fwd-
> power
> application.
> 
> 2. Proposed solution
> 
> The proposed solution focuses on how many times empty polls are executed.
> The less
> the number of empty polls, means current core is busy with processing
> workload,
> therefore, the higher frequency is needed. The high empty poll number
> indicates
> the current core not doing any real work therefore, we can lower the
> frequency
> to safe power.
> 
> In the current implementation, each core has 1 empty-poll counter which
> assume
> 1 core is dedicated to 1 queue. This will need to be expanded in the future
> to support multiple queues per core.
> 
> 2.1 Power state definition:
> 
> 	LOW:  Not currently used, reserved for future use.
> 
> 	MED:  the frequency is used to process modest traffic workload.
> 
> 	HIGH: the frequency is used to process busy traffic workload.
> 
> 2.2 There are two phases to establish the power management system:
> 
> 	a.Initialization/Training phase. The training phase is necessary
> 	  in order to figure out the system polling baseline numbers from
> 	  idle to busy. The highest poll count will be during idle, where all
> 	  polls are empty. These poll counts will be different between
> 	  systems due to the many possible processor micro-arch, cache
> 	  and device configurations, hence the training phase.
>   	  In the training phase, traffic is blocked so the training algorithm
>   	  can average the empty-poll numbers for the LOW, MED and
>  	  HIGH  power states in order to create a baseline.
>   	  The core's counter are collected every 10ms, and the Training
>  	   phase will take 2 seconds.
> 
> 	b.Normal phase. When the training phase is complete, traffic is
>   	  started. The run-time poll counts are compared with the
> 	  baseline and the decision will be taken to move to MED power
>   	  state or HIGH power state. The counters are calculated every 10ms.
> 
> 3. Proposed  API
> 
> 1.  rte_power_empty_poll_stat_init(void);
> which is used to initialize the power management system.
> 
> 2.  rte_power_empty_poll_stat_free(void);
> which is used to free the resource hold by power management system.
> 
> 3.  rte_power_empty_poll_stat_update(unsigned int lcore_id);
> which is used to update specific core empty poll counter, not thread safe
> 
> 4.  rte_power_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt);
> which is used to update specific core valid poll counter, not thread safe
> 
> 5.  rte_power_empty_poll_stat_fetch(unsigned int lcore_id);
> which is used to get specific core empty poll counter.
> 
> 6.  rte_power_poll_stat_fetch(unsigned int lcore_id);
> which is used to get specific core valid poll counter.
> 
> 7.  rte_empty_poll_detection(void);
> which is used to detect empty poll state changes.
> 
> ChangeLog:
> v2: fix some coding style issues
> v3: rename the filename, API name.
> v4: no change
> v5: no change
> v6: re-work the code layout, update API
> 
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> ---
>  lib/librte_power/Makefile               |   6 +-
>  lib/librte_power/meson.build            |   5 +-
>  lib/librte_power/rte_power_empty_poll.c | 500
> ++++++++++++++++++++++++++++++++
>  lib/librte_power/rte_power_empty_poll.h | 205 +++++++++++++
>  lib/librte_power/rte_power_version.map  |  13 +
>  5 files changed, 725 insertions(+), 4 deletions(-)
>  create mode 100644 lib/librte_power/rte_power_empty_poll.c
>  create mode 100644 lib/librte_power/rte_power_empty_poll.h
> 
> diff --git a/lib/librte_power/Makefile b/lib/librte_power/Makefile
> index 6f85e88..a8f1301 100644
> --- a/lib/librte_power/Makefile
> +++ b/lib/librte_power/Makefile
> @@ -6,8 +6,9 @@ include $(RTE_SDK)/mk/rte.vars.mk
>  # library name
>  LIB = librte_power.a
> 
> +CFLAGS += -DALLOW_EXPERIMENTAL_API
>  CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR) -O3 -fno-strict-aliasing
> -LDLIBS += -lrte_eal
> +LDLIBS += -lrte_eal -lrte_timer
> 
>  EXPORT_MAP := rte_power_version.map
> 
> @@ -16,8 +17,9 @@ LIBABIVER := 1
>  # all source are stored in SRCS-y
>  SRCS-$(CONFIG_RTE_LIBRTE_POWER) := rte_power.c
> power_acpi_cpufreq.c
>  SRCS-$(CONFIG_RTE_LIBRTE_POWER) += power_kvm_vm.c
> guest_channel.c
> +SRCS-$(CONFIG_RTE_LIBRTE_POWER) += rte_power_empty_poll.c
> 
>  # install this header file
> -SYMLINK-$(CONFIG_RTE_LIBRTE_POWER)-include := rte_power.h
> +SYMLINK-$(CONFIG_RTE_LIBRTE_POWER)-include := rte_power.h
> rte_power_empty_poll.h
> 
>  include $(RTE_SDK)/mk/rte.lib.mk
> diff --git a/lib/librte_power/meson.build b/lib/librte_power/meson.build
> index 253173f..63957eb 100644
> --- a/lib/librte_power/meson.build
> +++ b/lib/librte_power/meson.build
> @@ -5,5 +5,6 @@ if host_machine.system() != 'linux'
>  	build = false
>  endif
>  sources = files('rte_power.c', 'power_acpi_cpufreq.c',
> -		'power_kvm_vm.c', 'guest_channel.c')
> -headers = files('rte_power.h')
> +		'power_kvm_vm.c', 'guest_channel.c',
> +		'rte_power_empty_poll.c')
> +headers = files('rte_power.h','rte_power_empty_poll.h')
> diff --git a/lib/librte_power/rte_power_empty_poll.c
> b/lib/librte_power/rte_power_empty_poll.c
> new file mode 100644
> index 0000000..3dac654
> --- /dev/null
> +++ b/lib/librte_power/rte_power_empty_poll.c
> @@ -0,0 +1,500 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2010-2018 Intel Corporation
> + */
> +
> +#include <string.h>
> +
> +#include <rte_lcore.h>
> +#include <rte_cycles.h>
> +#include <rte_atomic.h>
> +#include <rte_malloc.h>
> +
> +#include "rte_power.h"
> +#include "rte_power_empty_poll.h"
> +
> +#define INTERVALS_PER_SECOND 100     /* (10ms) */
> +#define SECONDS_TO_TRAIN_FOR 2
> +#define DEFAULT_MED_TO_HIGH_PERCENT_THRESHOLD 70
> +#define DEFAULT_HIGH_TO_MED_PERCENT_THRESHOLD 30
> +#define DEFAULT_CYCLES_PER_PACKET 800
> +
> +static struct ep_params *ep_params;
> +static uint32_t med_to_high_threshold =
> DEFAULT_MED_TO_HIGH_PERCENT_THRESHOLD;
> +static uint32_t high_to_med_threshold =
> DEFAULT_HIGH_TO_MED_PERCENT_THRESHOLD;
> +
> +static uint32_t avail_freqs[RTE_MAX_LCORE][NUM_FREQS];
> +
> +static uint32_t total_avail_freqs[RTE_MAX_LCORE];
> +
> +static uint32_t freq_index[NUM_FREQ];
> +
> +static uint32_t
> +get_freq_index(enum freq_val index)
> +{
> +	return freq_index[index];
> +}
> +
> +
> +static int
> +set_power_freq(int lcore_id, enum freq_val freq, bool specific_freq)
> +{
> +	int err = 0;
> +	uint32_t power_freq_index;
> +	if (!specific_freq)
> +		power_freq_index = get_freq_index(freq);
> +	else
> +		power_freq_index = freq;
> +
> +	err = rte_power_set_freq(lcore_id, power_freq_index);
> +
> +	return err;
> +}
> +
> +
> +static inline void __attribute__((always_inline))
> +exit_training_state(struct priority_worker *poll_stats)
> +{
> +	RTE_SET_USED(poll_stats);
> +}
> +
> +static inline void __attribute__((always_inline))
> +enter_training_state(struct priority_worker *poll_stats)
> +{
> +	poll_stats->iter_counter = 0;
> +	poll_stats->cur_freq = LOW;
> +	poll_stats->queue_state = TRAINING;
> +}
> +
> +static inline void __attribute__((always_inline))
> +enter_normal_state(struct priority_worker *poll_stats)
> +{
> +	/* Clear the averages arrays and strs */
> +	memset(poll_stats->edpi_av, 0, sizeof(poll_stats->edpi_av));
> +	poll_stats->ec = 0;
> +	memset(poll_stats->ppi_av, 0, sizeof(poll_stats->ppi_av));
> +	poll_stats->pc = 0;
> +
> +	poll_stats->cur_freq = MED;
> +	poll_stats->iter_counter = 0;
> +	poll_stats->threshold_ctr = 0;
> +	poll_stats->queue_state = MED_NORMAL;
> +	set_power_freq(poll_stats->lcore_id, MED, false);
> +
> +	/* Try here */
> +	poll_stats->thresh[MED].threshold_percent =
> med_to_high_threshold;
> +	poll_stats->thresh[HGH].threshold_percent =
> high_to_med_threshold;
> +}
> +
> +static inline void __attribute__((always_inline))
> +enter_busy_state(struct priority_worker *poll_stats)
> +{
> +	memset(poll_stats->edpi_av, 0, sizeof(poll_stats->edpi_av));
> +	poll_stats->ec = 0;
> +	memset(poll_stats->ppi_av, 0, sizeof(poll_stats->ppi_av));
> +	poll_stats->pc = 0;
> +
> +	poll_stats->cur_freq = HGH;
> +	poll_stats->iter_counter = 0;
> +	poll_stats->threshold_ctr = 0;
> +	poll_stats->queue_state = HGH_BUSY;
> +	set_power_freq(poll_stats->lcore_id, HGH, false);
> +}
> +
> +static inline void __attribute__((always_inline))
> +enter_purge_state(struct priority_worker *poll_stats)
> +{
> +	poll_stats->iter_counter = 0;
> +	poll_stats->queue_state = LOW_PURGE;
> +}
> +
> +static inline void __attribute__((always_inline))
> +set_state(struct priority_worker *poll_stats,
> +		enum queue_state new_state)
> +{
> +	enum queue_state old_state = poll_stats->queue_state;
> +	if (old_state != new_state) {
> +
> +		/* Call any old state exit functions */
> +		if (old_state == TRAINING)
> +			exit_training_state(poll_stats);
> +
> +		/* Call any new state entry functions */
> +		if (new_state == TRAINING)
> +			enter_training_state(poll_stats);
> +		if (new_state == MED_NORMAL)
> +			enter_normal_state(poll_stats);
> +		if (new_state == HGH_BUSY)
> +			enter_busy_state(poll_stats);
> +		if (new_state == LOW_PURGE)
> +			enter_purge_state(poll_stats);
> +	}
> +}
> +
> +
> +static void
> +update_training_stats(struct priority_worker *poll_stats,
> +		uint32_t freq,
> +		bool specific_freq,
> +		uint32_t max_train_iter)
> +{
> +	RTE_SET_USED(specific_freq);
> +
> +	char pfi_str[32];
> +	uint64_t p0_empty_deq;
> +
> +	sprintf(pfi_str, "%02d", freq);
> +
> +	if (poll_stats->cur_freq == freq &&
> +			poll_stats->thresh[freq].trained == false) {
> +		if (poll_stats->thresh[freq].cur_train_iter == 0) {
> +
> +			set_power_freq(poll_stats->lcore_id,
> +					freq, specific_freq);
> +
> +			poll_stats->empty_dequeues_prev =
> +				poll_stats->empty_dequeues;
> +
> +			poll_stats->thresh[freq].cur_train_iter++;
> +
> +			return;
> +		} else if (poll_stats->thresh[freq].cur_train_iter
> +				<= max_train_iter) {
> +
> +			p0_empty_deq = poll_stats->empty_dequeues -
> +				poll_stats->empty_dequeues_prev;
> +
> +			poll_stats->empty_dequeues_prev =
> +				poll_stats->empty_dequeues;
> +
> +			poll_stats->thresh[freq].base_edpi +=
> p0_empty_deq;
> +			poll_stats->thresh[freq].cur_train_iter++;
> +
> +		} else {
> +			if (poll_stats->thresh[freq].trained == false) {
> +				poll_stats->thresh[freq].base_edpi =
> +					poll_stats->thresh[freq].base_edpi /
> +					max_train_iter;
> +
> +				/* Add on a factor of 0.05%, this should
> remove any */
> +				/* false negatives when the system is 0%
> busy */
> +				poll_stats->thresh[freq].base_edpi +=
> +					poll_stats->thresh[freq].base_edpi /
> 2000;
> +
> +				poll_stats->thresh[freq].trained = true;
> +				poll_stats->cur_freq++;
> +
> +			}
> +		}
> +	}
> +}
> +
> +static inline uint32_t __attribute__((always_inline))
> +update_stats(struct priority_worker *poll_stats)
> +{
> +	uint64_t tot_edpi = 0, tot_ppi = 0;
> +	uint32_t j, percent;
> +
> +	struct priority_worker *s = poll_stats;
> +
> +	uint64_t cur_edpi = s->empty_dequeues - s-
> >empty_dequeues_prev;
> +
> +	s->empty_dequeues_prev = s->empty_dequeues;
> +
> +	uint64_t ppi = s->num_dequeue_pkts - s-
> >num_dequeue_pkts_prev;
> +
> +	s->num_dequeue_pkts_prev = s->num_dequeue_pkts;
> +
> +	if (s->thresh[s->cur_freq].base_edpi < cur_edpi) {
> +		RTE_LOG(DEBUG, POWER, "cur_edpi is too large "
> +				"cur edpi %ld "
> +				"base epdi %ld\n",
> +				cur_edpi,
> +				s->thresh[s->cur_freq].base_edpi);
> +		/* Value to make us fail need debug log*/
> +		return 1000UL;
> +	}
> +
> +	s->edpi_av[s->ec++ % BINS_AV] = cur_edpi;
> +	s->ppi_av[s->pc++ % BINS_AV] = ppi;
> +
> +	for (j = 0; j < BINS_AV; j++) {
> +		tot_edpi += s->edpi_av[j];
> +		tot_ppi += s->ppi_av[j];
> +	}
> +
> +	tot_edpi = tot_edpi / BINS_AV;
> +
> +	percent = 100 - (uint32_t)(((float)tot_edpi /
> +			(float)s->thresh[s->cur_freq].base_edpi) * 100);
> +
> +	return (uint32_t)percent;
> +}
> +
> +
> +static inline void  __attribute__((always_inline))
> +update_stats_normal(struct priority_worker *poll_stats)
> +{
> +	uint32_t percent;
> +
> +	if (poll_stats->thresh[poll_stats->cur_freq].base_edpi == 0)
> +		return;
> +
> +	percent = update_stats(poll_stats);
> +
> +	if (percent > 100)
> +		return;
> +
> +	if (poll_stats->cur_freq == LOW)
> +		RTE_LOG(INFO, POWER, "Purge Mode is not supported\n");
> +	else if (poll_stats->cur_freq == MED) {
> +
> +		if (percent >
> +			poll_stats->thresh[MED].threshold_percent) {
> +
> +			if (poll_stats->threshold_ctr <
> INTERVALS_PER_SECOND)
> +				poll_stats->threshold_ctr++;
> +			else {
> +				set_state(poll_stats, HGH_BUSY);
> +				RTE_LOG(INFO, POWER, "MOVE to HGH\n");
> +			}
> +
> +		} else {
> +			/* reset */
> +			/* need debug log */
> +			poll_stats->threshold_ctr = 0;
> +		}
> +
> +	} else if (poll_stats->cur_freq == HGH) {
> +
> +		if (percent <
> +				poll_stats->thresh[HGH].threshold_percent)
> {
> +
> +			if (poll_stats->threshold_ctr <
> INTERVALS_PER_SECOND)
> +				poll_stats->threshold_ctr++;
> +			else
> +				set_state(poll_stats, MED_NORMAL);
Suggest add log when status change from High to MED here also. 
RTE_LOG(INFO, POWER, "MOVE to MED\n");
> +		} else
> +			/* reset */
> +			/* need debug log */
> +			poll_stats->threshold_ctr = 0;
> +
> +
> +	}
> +}
> +
> +static int
> +empty_poll_training(struct priority_worker *poll_stats,
> +		uint32_t max_train_iter)
> +{
> +
> +	if (poll_stats->iter_counter < INTERVALS_PER_SECOND) {
> +		poll_stats->iter_counter++;
> +		return 0;
> +	}
> +
> +
> +	update_training_stats(poll_stats,
> +			LOW,
> +			false,
> +			max_train_iter);
> +
> +	update_training_stats(poll_stats,
> +			MED,
> +			false,
> +			max_train_iter);
> +
> +	update_training_stats(poll_stats,
> +			HGH,
> +			false,
> +			max_train_iter);
> +
> +
> +	if (poll_stats->thresh[LOW].trained == true
> +			&& poll_stats->thresh[MED].trained == true
> +			&& poll_stats->thresh[HGH].trained == true) {
> +
> +		set_state(poll_stats, MED_NORMAL);
> +
> +		RTE_LOG(INFO, POWER, "Training is Complete for %d\n",
> +				poll_stats->lcore_id);
> +	}
> +
> +	return 0;
> +}
> +
> +void
> +rte_empty_poll_detection(struct rte_timer *tim,
> +		void *arg)
> +{
> +
> +	uint32_t i;
> +
> +	struct priority_worker *poll_stats;
> +
> +	RTE_SET_USED(tim);
> +
> +	RTE_SET_USED(arg);
> +
> +	for (i = 0; i < NUM_NODES; i++) {
> +
> +		poll_stats = &(ep_params->wrk_data.wrk_stats[i]);
> +
> +		if (rte_lcore_is_enabled(poll_stats->lcore_id) == 0)
> +			continue;
> +
> +		switch (poll_stats->queue_state) {
> +		case(TRAINING):
> +			empty_poll_training(poll_stats,
> +					ep_params->max_train_iter);
> +			break;
> +
> +		case(HGH_BUSY):
> +		case(MED_NORMAL):
> +			update_stats_normal(poll_stats);
> +
> +			break;
> +
> +		case(LOW_PURGE):
> +			break;
> +		default:
> +			break;
> +
> +		}
> +
> +	}
> +
> +}
> +
> +int __rte_experimental
> +rte_power_empty_poll_stat_init(struct ep_params **eptr, uint8_t
> *freq_tlb)
> +{
> +	uint32_t i;
> +	/* Allocate the ep_params structure */
> +	ep_params = rte_zmalloc_socket(NULL,
> +			sizeof(struct ep_params),
> +			0,
> +			rte_socket_id());
> +
> +	if (!ep_params)
> +		rte_panic("Cannot allocate heap memory for ep_params "
> +				"for socket %d\n", rte_socket_id());
> +
> +	if (freq_tlb == NULL) {
> +		freq_index[LOW] = 14;
> +		freq_index[MED] = 9;
> +		freq_index[HGH] = 1;
> +	} else {
> +		freq_index[LOW] = freq_tlb[LOW];
> +		freq_index[MED] = freq_tlb[MED];
> +		freq_index[HGH] = freq_tlb[HGH];
> +	}
> +
> +	RTE_LOG(INFO, POWER, "Initialize the Empty Poll\n");
> +
> +	/* 5 seconds worth of training */
> +	ep_params->max_train_iter = INTERVALS_PER_SECOND *
> SECONDS_TO_TRAIN_FOR;
> +
> +	struct stats_data *w = &ep_params->wrk_data;
> +
> +	*eptr = ep_params;
> +
> +	/* initialize all wrk_stats state */
> +	for (i = 0; i < NUM_NODES; i++) {
> +
> +		if (rte_lcore_is_enabled(i) == 0)
> +			continue;
> +
> +		set_state(&w->wrk_stats[i], TRAINING);
> +		/*init the freqs table */
> +		total_avail_freqs[i] = rte_power_freqs(i,
> +				avail_freqs[i],
> +				NUM_FREQS);
> +
> +		if (get_freq_index(LOW) > total_avail_freqs[i])
> +			return -1;
> +
> +	}
> +
> +
> +	return 0;
> +}
> +
> +void __rte_experimental
> +rte_power_empty_poll_stat_free(void)
> +{
> +
> +	RTE_LOG(INFO, POWER, "Close the Empty Poll\n");
> +
> +	if (ep_params != NULL)
> +		rte_free(ep_params);
> +}
> +
> +int __rte_experimental
> +rte_power_empty_poll_stat_update(unsigned int lcore_id)
> +{
> +	struct priority_worker *poll_stats;
> +
> +	if (lcore_id >= NUM_NODES)
> +		return -1;
> +
> +	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
> +
> +	if (poll_stats->lcore_id == 0)
> +		poll_stats->lcore_id = lcore_id;
> +
> +	poll_stats->empty_dequeues++;
> +
> +	return 0;
> +}
> +
> +int __rte_experimental
> +rte_power_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt)
> +{
> +
> +	struct priority_worker *poll_stats;
> +
> +	if (lcore_id >= NUM_NODES)
> +		return -1;
> +
> +	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
> +
> +	if (poll_stats->lcore_id == 0)
> +		poll_stats->lcore_id = lcore_id;
> +
> +	poll_stats->num_dequeue_pkts += nb_pkt;
> +
> +	return 0;
> +}
> +
> +
> +uint64_t __rte_experimental
> +rte_power_empty_poll_stat_fetch(unsigned int lcore_id)
> +{
> +	struct priority_worker *poll_stats;
> +
> +	if (lcore_id >= NUM_NODES)
> +		return -1;
> +
> +	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
> +
> +	if (poll_stats->lcore_id == 0)
> +		poll_stats->lcore_id = lcore_id;
> +
> +	return poll_stats->empty_dequeues;
> +}
> +
> +uint64_t __rte_experimental
> +rte_power_poll_stat_fetch(unsigned int lcore_id)
> +{
> +	struct priority_worker *poll_stats;
> +
> +	if (lcore_id >= NUM_NODES)
> +		return -1;
> +
> +	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
> +
> +	if (poll_stats->lcore_id == 0)
> +		poll_stats->lcore_id = lcore_id;
> +
> +	return poll_stats->num_dequeue_pkts;
> +}
> diff --git a/lib/librte_power/rte_power_empty_poll.h
> b/lib/librte_power/rte_power_empty_poll.h
> new file mode 100644
> index 0000000..e43981f
> --- /dev/null
> +++ b/lib/librte_power/rte_power_empty_poll.h
> @@ -0,0 +1,205 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2010-2018 Intel Corporation
> + */
> +
> +#ifndef _RTE_EMPTY_POLL_H
> +#define _RTE_EMPTY_POLL_H
> +
> +/**
> + * @file
> + * RTE Power Management
> + */
> +#include <stdint.h>
> +#include <stdbool.h>
> +
> +#include <rte_common.h>
> +#include <rte_byteorder.h>
> +#include <rte_log.h>
> +#include <rte_string_fns.h>
> +#include <rte_power.h>
> +#include <rte_timer.h>
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +#define NUM_FREQS 20
> +
> +#define BINS_AV 4 /* Has to be ^2 */
> +
> +#define DROP (NUM_DIRECTIONS * NUM_DEVICES)
> +
> +#define NUM_PRIORITIES          2
> +
> +#define NUM_NODES         31 /* Any reseanable prime number should
> work*/
> +
> +/* Processor Power State */
> +enum freq_val {
> +	LOW,
> +	MED,
> +	HGH,
> +	NUM_FREQ = NUM_FREQS
> +};
> +
> +
> +/* Queue Polling State */
> +enum queue_state {
> +	TRAINING, /* NO TRAFFIC */
> +	MED_NORMAL,   /* MED */
> +	HGH_BUSY,     /* HIGH */
> +	LOW_PURGE,    /* LOW */
> +};
> +
> +/* Queue Stats */
> +struct freq_threshold {
> +
> +	uint64_t base_edpi;
> +	bool trained;
> +	uint32_t threshold_percent;
> +	uint32_t cur_train_iter;
> +};
> +
> +/* Each Worder Thread Empty Poll Stats */
> +struct priority_worker {
> +
> +	/* Current dequeue and throughput counts */
> +	/* These 2 are written to by the worker threads */
> +	/* So keep them on their own cache line */
> +	uint64_t empty_dequeues;
> +	uint64_t num_dequeue_pkts;
> +
> +	enum queue_state queue_state;
> +
> +	uint64_t empty_dequeues_prev;
> +	uint64_t num_dequeue_pkts_prev;
> +
> +	/* Used for training only */
> +	struct freq_threshold thresh[NUM_FREQ];
> +	enum freq_val cur_freq;
> +
> +	/* bucket arrays to calculate the averages */
> +	uint64_t edpi_av[BINS_AV];
> +	uint32_t  ec;
> +	uint64_t ppi_av[BINS_AV];
> +	uint32_t  pc;
> +
> +	uint32_t lcore_id;
> +	uint32_t iter_counter;
> +	uint32_t threshold_ctr;
> +	uint32_t display_ctr;
> +	uint8_t  dev_id;
> +
> +} __rte_cache_aligned;
> +
> +
> +struct stats_data {
> +
> +	struct priority_worker wrk_stats[NUM_NODES];
> +
> +	/* flag to stop rx threads processing packets until training over */
> +	bool start_rx;
> +
> +};
> +
> +/* Empty Poll Parameters */
> +struct ep_params {
> +
> +	/* Timer related stuff */
> +	uint64_t interval_ticks;
> +	uint32_t max_train_iter;
> +
> +	struct rte_timer timer0;
> +	struct stats_data wrk_data;
> +};
> +
> +
> +/**
> + * Initialize the power management system.
> + *
> + * @param eptr
> + *   the structure of empty poll configuration
> + * @freq_tlb
> + *   the power state/frequency  mapping table
> + *
> + * @return
> + *  - 0 on success.
> + *  - Negative on error.
> + */
> +int __rte_experimental
> +rte_power_empty_poll_stat_init(struct ep_params **eptr, uint8_t
> *freq_tlb);
> +
> +/**
> + * Free the resource hold by power management system.
> + */
> +void __rte_experimental
> +rte_power_empty_poll_stat_free(void);
> +
> +/**
> + * Update specific core empty poll counter
> + * It's not thread safe.
> + *
> + * @param lcore_id
> + *  lcore id
> + *
> + * @return
> + *  - 0 on success.
> + *  - Negative on error.
> + */
> +int __rte_experimental
> +rte_power_empty_poll_stat_update(unsigned int lcore_id);
> +
> +/**
> + * Update specific core valid poll counter, not thread safe.
> + *
> + * @param lcore_id
> + *  lcore id.
> + * @param nb_pkt
> + *  The packet number of one valid poll.
> + *
> + * @return
> + *  - 0 on success.
> + *  - Negative on error.
> + */
> +int __rte_experimental
> +rte_power_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt);
> +
> +/**
> + * Fetch specific core empty poll counter.
> + *
> + * @param lcore_id
> + *  lcore id
> + *
> + * @return
> + *  Current lcore empty poll counter value.
> + */
> +uint64_t __rte_experimental
> +rte_power_empty_poll_stat_fetch(unsigned int lcore_id);
> +
> +/**
> + * Fetch specific core valid poll counter.
> + *
> + * @param lcore_id
> + *  lcore id
> + *
> + * @return
> + *  Current lcore valid poll counter value.
> + */
> +uint64_t __rte_experimental
> +rte_power_poll_stat_fetch(unsigned int lcore_id);
> +
> +/**
> + * Empty poll  state change detection function
> + *
> + * @param  tim
> + *  The timer structure
> + * @param  arg
> + *  The customized parameter
> + */
> +void  __rte_experimental
> +rte_empty_poll_detection(struct rte_timer *tim, void *arg);
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif
> diff --git a/lib/librte_power/rte_power_version.map
> b/lib/librte_power/rte_power_version.map
> index dd587df..11ffdfb 100644
> --- a/lib/librte_power/rte_power_version.map
> +++ b/lib/librte_power/rte_power_version.map
> @@ -33,3 +33,16 @@ DPDK_18.08 {
>  	rte_power_get_capabilities;
> 
>  } DPDK_17.11;
> +
> +EXPERIMENTAL {
> +        global:
> +
> +        rte_power_empty_poll_stat_init;
> +        rte_power_empty_poll_stat_free;
> +        rte_power_empty_poll_stat_update;
> +        rte_power_empty_poll_stat_fetch;
> +        rte_power_poll_stat_fetch;
> +        rte_power_poll_stat_update;
> +        rte_empty_poll_detection;
> +
> +};
> --
> 2.7.5


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [dpdk-dev] [PATCH v6 1/4] lib/librte_power: traffic pattern aware power control
  2018-08-31 15:04             ` [dpdk-dev] [PATCH v6 1/4] lib/librte_power: traffic pattern aware power control Liang Ma
                                 ` (3 preceding siblings ...)
  2018-09-04  1:11               ` [dpdk-dev] [PATCH v6 1/4] lib/librte_power: traffic pattern aware power control Yao, Lei A
@ 2018-09-04  2:09               ` Yao, Lei A
  2018-09-04 14:10               ` [dpdk-dev] [PATCH v7 " Liang Ma
  5 siblings, 0 replies; 79+ messages in thread
From: Yao, Lei A @ 2018-09-04  2:09 UTC (permalink / raw)
  To: Ma, Liang J, Hunt, David
  Cc: dev, Nicolau, Radu, Burakov, Anatoly, Geary, John



> -----Original Message-----
> From: Ma, Liang J
> Sent: Friday, August 31, 2018 11:04 PM
> To: Hunt, David <david.hunt@intel.com>
> Cc: dev@dpdk.org; Yao, Lei A <lei.a.yao@intel.com>; Nicolau, Radu
> <radu.nicolau@intel.com>; Burakov, Anatoly <anatoly.burakov@intel.com>;
> Geary, John <john.geary@intel.com>; Ma, Liang J <liang.j.ma@intel.com>
> Subject: [PATCH v6 1/4] lib/librte_power: traffic pattern aware power control
> 
> 1. Abstract
> 
> For packet processing workloads such as DPDK polling is continuous.
> This means CPU cores always show 100% busy independent of how much
> work
> those cores are doing. It is critical to accurately determine how busy
> a core is hugely important for the following reasons:
> 
>    * No indication of overload conditions
> 
>    * User do not know how much real load is on a system meaning resulted in
>      wasted energy as no power management is utilized
> 
> Compared to the original l3fwd-power design, instead of going to sleep
> after detecting an empty poll, the new mechanism just lowers the core
> frequency. As a result, the application does not stop polling the device,
> which leads to improved handling of bursts of traffic.
> 
> When the system become busy, the empty poll mechanism can also increase
> the core
> frequency (including turbo) to do best effort for intensive traffic. This gives
> us more flexible and balanced traffic awareness over the standard l3fwd-
> power
> application.
> 
> 2. Proposed solution
> 
> The proposed solution focuses on how many times empty polls are executed.
> The less
> the number of empty polls, means current core is busy with processing
> workload,
> therefore, the higher frequency is needed. The high empty poll number
> indicates
> the current core not doing any real work therefore, we can lower the
> frequency
> to safe power.
> 
> In the current implementation, each core has 1 empty-poll counter which
> assume
> 1 core is dedicated to 1 queue. This will need to be expanded in the future
> to support multiple queues per core.
> 
> 2.1 Power state definition:
> 
> 	LOW:  Not currently used, reserved for future use.
> 
> 	MED:  the frequency is used to process modest traffic workload.
> 
> 	HIGH: the frequency is used to process busy traffic workload.
> 
> 2.2 There are two phases to establish the power management system:
> 
> 	a.Initialization/Training phase. The training phase is necessary
> 	  in order to figure out the system polling baseline numbers from
> 	  idle to busy. The highest poll count will be during idle, where all
> 	  polls are empty. These poll counts will be different between
> 	  systems due to the many possible processor micro-arch, cache
> 	  and device configurations, hence the training phase.
>   	  In the training phase, traffic is blocked so the training algorithm
>   	  can average the empty-poll numbers for the LOW, MED and
>  	  HIGH  power states in order to create a baseline.
>   	  The core's counter are collected every 10ms, and the Training
>  	   phase will take 2 seconds.
> 
> 	b.Normal phase. When the training phase is complete, traffic is
>   	  started. The run-time poll counts are compared with the
> 	  baseline and the decision will be taken to move to MED power
>   	  state or HIGH power state. The counters are calculated every 10ms.
> 
> 3. Proposed  API
> 
> 1.  rte_power_empty_poll_stat_init(void);
> which is used to initialize the power management system.
> 
> 2.  rte_power_empty_poll_stat_free(void);
> which is used to free the resource hold by power management system.
> 
> 3.  rte_power_empty_poll_stat_update(unsigned int lcore_id);
> which is used to update specific core empty poll counter, not thread safe
> 
> 4.  rte_power_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt);
> which is used to update specific core valid poll counter, not thread safe
> 
> 5.  rte_power_empty_poll_stat_fetch(unsigned int lcore_id);
> which is used to get specific core empty poll counter.
> 
> 6.  rte_power_poll_stat_fetch(unsigned int lcore_id);
> which is used to get specific core valid poll counter.
> 
> 7.  rte_empty_poll_detection(void);
> which is used to detect empty poll state changes.
> 
> ChangeLog:
> v2: fix some coding style issues
> v3: rename the filename, API name.
> v4: no change
> v5: no change
> v6: re-work the code layout, update API
> 
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
Reviewed-by: Lei Yao <lei.a.yao@intel.com>
> ---
>  lib/librte_power/Makefile               |   6 +-
>  lib/librte_power/meson.build            |   5 +-
>  lib/librte_power/rte_power_empty_poll.c | 500
> ++++++++++++++++++++++++++++++++
>  lib/librte_power/rte_power_empty_poll.h | 205 +++++++++++++
>  lib/librte_power/rte_power_version.map  |  13 +
>  5 files changed, 725 insertions(+), 4 deletions(-)
>  create mode 100644 lib/librte_power/rte_power_empty_poll.c
>  create mode 100644 lib/librte_power/rte_power_empty_poll.h
> 
> diff --git a/lib/librte_power/Makefile b/lib/librte_power/Makefile
> index 6f85e88..a8f1301 100644
> --- a/lib/librte_power/Makefile
> +++ b/lib/librte_power/Makefile
> @@ -6,8 +6,9 @@ include $(RTE_SDK)/mk/rte.vars.mk
>  # library name
>  LIB = librte_power.a
> 
> +CFLAGS += -DALLOW_EXPERIMENTAL_API
>  CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR) -O3 -fno-strict-aliasing
> -LDLIBS += -lrte_eal
> +LDLIBS += -lrte_eal -lrte_timer
> 
>  EXPORT_MAP := rte_power_version.map
> 
> @@ -16,8 +17,9 @@ LIBABIVER := 1
>  # all source are stored in SRCS-y
>  SRCS-$(CONFIG_RTE_LIBRTE_POWER) := rte_power.c
> power_acpi_cpufreq.c
>  SRCS-$(CONFIG_RTE_LIBRTE_POWER) += power_kvm_vm.c
> guest_channel.c
> +SRCS-$(CONFIG_RTE_LIBRTE_POWER) += rte_power_empty_poll.c
> 
>  # install this header file
> -SYMLINK-$(CONFIG_RTE_LIBRTE_POWER)-include := rte_power.h
> +SYMLINK-$(CONFIG_RTE_LIBRTE_POWER)-include := rte_power.h
> rte_power_empty_poll.h
> 
>  include $(RTE_SDK)/mk/rte.lib.mk
> diff --git a/lib/librte_power/meson.build b/lib/librte_power/meson.build
> index 253173f..63957eb 100644
> --- a/lib/librte_power/meson.build
> +++ b/lib/librte_power/meson.build
> @@ -5,5 +5,6 @@ if host_machine.system() != 'linux'
>  	build = false
>  endif
>  sources = files('rte_power.c', 'power_acpi_cpufreq.c',
> -		'power_kvm_vm.c', 'guest_channel.c')
> -headers = files('rte_power.h')
> +		'power_kvm_vm.c', 'guest_channel.c',
> +		'rte_power_empty_poll.c')
> +headers = files('rte_power.h','rte_power_empty_poll.h')
> diff --git a/lib/librte_power/rte_power_empty_poll.c
> b/lib/librte_power/rte_power_empty_poll.c
> new file mode 100644
> index 0000000..3dac654
> --- /dev/null
> +++ b/lib/librte_power/rte_power_empty_poll.c
> @@ -0,0 +1,500 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2010-2018 Intel Corporation
> + */
> +
> +#include <string.h>
> +
> +#include <rte_lcore.h>
> +#include <rte_cycles.h>
> +#include <rte_atomic.h>
> +#include <rte_malloc.h>
> +
> +#include "rte_power.h"
> +#include "rte_power_empty_poll.h"
> +
> +#define INTERVALS_PER_SECOND 100     /* (10ms) */
> +#define SECONDS_TO_TRAIN_FOR 2
> +#define DEFAULT_MED_TO_HIGH_PERCENT_THRESHOLD 70
> +#define DEFAULT_HIGH_TO_MED_PERCENT_THRESHOLD 30
> +#define DEFAULT_CYCLES_PER_PACKET 800
> +
> +static struct ep_params *ep_params;
> +static uint32_t med_to_high_threshold =
> DEFAULT_MED_TO_HIGH_PERCENT_THRESHOLD;
> +static uint32_t high_to_med_threshold =
> DEFAULT_HIGH_TO_MED_PERCENT_THRESHOLD;
> +
> +static uint32_t avail_freqs[RTE_MAX_LCORE][NUM_FREQS];
> +
> +static uint32_t total_avail_freqs[RTE_MAX_LCORE];
> +
> +static uint32_t freq_index[NUM_FREQ];
> +
> +static uint32_t
> +get_freq_index(enum freq_val index)
> +{
> +	return freq_index[index];
> +}
> +
> +
> +static int
> +set_power_freq(int lcore_id, enum freq_val freq, bool specific_freq)
> +{
> +	int err = 0;
> +	uint32_t power_freq_index;
> +	if (!specific_freq)
> +		power_freq_index = get_freq_index(freq);
> +	else
> +		power_freq_index = freq;
> +
> +	err = rte_power_set_freq(lcore_id, power_freq_index);
> +
> +	return err;
> +}
> +
> +
> +static inline void __attribute__((always_inline))
> +exit_training_state(struct priority_worker *poll_stats)
> +{
> +	RTE_SET_USED(poll_stats);
> +}
> +
> +static inline void __attribute__((always_inline))
> +enter_training_state(struct priority_worker *poll_stats)
> +{
> +	poll_stats->iter_counter = 0;
> +	poll_stats->cur_freq = LOW;
> +	poll_stats->queue_state = TRAINING;
> +}
> +
> +static inline void __attribute__((always_inline))
> +enter_normal_state(struct priority_worker *poll_stats)
> +{
> +	/* Clear the averages arrays and strs */
> +	memset(poll_stats->edpi_av, 0, sizeof(poll_stats->edpi_av));
> +	poll_stats->ec = 0;
> +	memset(poll_stats->ppi_av, 0, sizeof(poll_stats->ppi_av));
> +	poll_stats->pc = 0;
> +
> +	poll_stats->cur_freq = MED;
> +	poll_stats->iter_counter = 0;
> +	poll_stats->threshold_ctr = 0;
> +	poll_stats->queue_state = MED_NORMAL;
> +	set_power_freq(poll_stats->lcore_id, MED, false);
> +
> +	/* Try here */
> +	poll_stats->thresh[MED].threshold_percent =
> med_to_high_threshold;
> +	poll_stats->thresh[HGH].threshold_percent =
> high_to_med_threshold;
> +}
> +
> +static inline void __attribute__((always_inline))
> +enter_busy_state(struct priority_worker *poll_stats)
> +{
> +	memset(poll_stats->edpi_av, 0, sizeof(poll_stats->edpi_av));
> +	poll_stats->ec = 0;
> +	memset(poll_stats->ppi_av, 0, sizeof(poll_stats->ppi_av));
> +	poll_stats->pc = 0;
> +
> +	poll_stats->cur_freq = HGH;
> +	poll_stats->iter_counter = 0;
> +	poll_stats->threshold_ctr = 0;
> +	poll_stats->queue_state = HGH_BUSY;
> +	set_power_freq(poll_stats->lcore_id, HGH, false);
> +}
> +
> +static inline void __attribute__((always_inline))
> +enter_purge_state(struct priority_worker *poll_stats)
> +{
> +	poll_stats->iter_counter = 0;
> +	poll_stats->queue_state = LOW_PURGE;
> +}
> +
> +static inline void __attribute__((always_inline))
> +set_state(struct priority_worker *poll_stats,
> +		enum queue_state new_state)
> +{
> +	enum queue_state old_state = poll_stats->queue_state;
> +	if (old_state != new_state) {
> +
> +		/* Call any old state exit functions */
> +		if (old_state == TRAINING)
> +			exit_training_state(poll_stats);
> +
> +		/* Call any new state entry functions */
> +		if (new_state == TRAINING)
> +			enter_training_state(poll_stats);
> +		if (new_state == MED_NORMAL)
> +			enter_normal_state(poll_stats);
> +		if (new_state == HGH_BUSY)
> +			enter_busy_state(poll_stats);
> +		if (new_state == LOW_PURGE)
> +			enter_purge_state(poll_stats);
> +	}
> +}
> +
> +
> +static void
> +update_training_stats(struct priority_worker *poll_stats,
> +		uint32_t freq,
> +		bool specific_freq,
> +		uint32_t max_train_iter)
> +{
> +	RTE_SET_USED(specific_freq);
> +
> +	char pfi_str[32];
> +	uint64_t p0_empty_deq;
> +
> +	sprintf(pfi_str, "%02d", freq);
> +
> +	if (poll_stats->cur_freq == freq &&
> +			poll_stats->thresh[freq].trained == false) {
> +		if (poll_stats->thresh[freq].cur_train_iter == 0) {
> +
> +			set_power_freq(poll_stats->lcore_id,
> +					freq, specific_freq);
> +
> +			poll_stats->empty_dequeues_prev =
> +				poll_stats->empty_dequeues;
> +
> +			poll_stats->thresh[freq].cur_train_iter++;
> +
> +			return;
> +		} else if (poll_stats->thresh[freq].cur_train_iter
> +				<= max_train_iter) {
> +
> +			p0_empty_deq = poll_stats->empty_dequeues -
> +				poll_stats->empty_dequeues_prev;
> +
> +			poll_stats->empty_dequeues_prev =
> +				poll_stats->empty_dequeues;
> +
> +			poll_stats->thresh[freq].base_edpi +=
> p0_empty_deq;
> +			poll_stats->thresh[freq].cur_train_iter++;
> +
> +		} else {
> +			if (poll_stats->thresh[freq].trained == false) {
> +				poll_stats->thresh[freq].base_edpi =
> +					poll_stats->thresh[freq].base_edpi /
> +					max_train_iter;
> +
> +				/* Add on a factor of 0.05%, this should
> remove any */
> +				/* false negatives when the system is 0%
> busy */
> +				poll_stats->thresh[freq].base_edpi +=
> +					poll_stats->thresh[freq].base_edpi /
> 2000;
> +
> +				poll_stats->thresh[freq].trained = true;
> +				poll_stats->cur_freq++;
> +
> +			}
> +		}
> +	}
> +}
> +
> +static inline uint32_t __attribute__((always_inline))
> +update_stats(struct priority_worker *poll_stats)
> +{
> +	uint64_t tot_edpi = 0, tot_ppi = 0;
> +	uint32_t j, percent;
> +
> +	struct priority_worker *s = poll_stats;
> +
> +	uint64_t cur_edpi = s->empty_dequeues - s-
> >empty_dequeues_prev;
> +
> +	s->empty_dequeues_prev = s->empty_dequeues;
> +
> +	uint64_t ppi = s->num_dequeue_pkts - s-
> >num_dequeue_pkts_prev;
> +
> +	s->num_dequeue_pkts_prev = s->num_dequeue_pkts;
> +
> +	if (s->thresh[s->cur_freq].base_edpi < cur_edpi) {
> +		RTE_LOG(DEBUG, POWER, "cur_edpi is too large "
> +				"cur edpi %ld "
> +				"base epdi %ld\n",
> +				cur_edpi,
> +				s->thresh[s->cur_freq].base_edpi);
> +		/* Value to make us fail need debug log*/
> +		return 1000UL;
> +	}
> +
> +	s->edpi_av[s->ec++ % BINS_AV] = cur_edpi;
> +	s->ppi_av[s->pc++ % BINS_AV] = ppi;
> +
> +	for (j = 0; j < BINS_AV; j++) {
> +		tot_edpi += s->edpi_av[j];
> +		tot_ppi += s->ppi_av[j];
> +	}
> +
> +	tot_edpi = tot_edpi / BINS_AV;
> +
> +	percent = 100 - (uint32_t)(((float)tot_edpi /
> +			(float)s->thresh[s->cur_freq].base_edpi) * 100);
> +
> +	return (uint32_t)percent;
> +}
> +
> +
> +static inline void  __attribute__((always_inline))
> +update_stats_normal(struct priority_worker *poll_stats)
> +{
> +	uint32_t percent;
> +
> +	if (poll_stats->thresh[poll_stats->cur_freq].base_edpi == 0)
> +		return;
> +
> +	percent = update_stats(poll_stats);
> +
> +	if (percent > 100)
> +		return;
> +
> +	if (poll_stats->cur_freq == LOW)
> +		RTE_LOG(INFO, POWER, "Purge Mode is not supported\n");
> +	else if (poll_stats->cur_freq == MED) {
> +
> +		if (percent >
> +			poll_stats->thresh[MED].threshold_percent) {
> +
> +			if (poll_stats->threshold_ctr <
> INTERVALS_PER_SECOND)
> +				poll_stats->threshold_ctr++;
> +			else {
> +				set_state(poll_stats, HGH_BUSY);
> +				RTE_LOG(INFO, POWER, "MOVE to HGH\n");
> +			}
> +
> +		} else {
> +			/* reset */
> +			/* need debug log */
> +			poll_stats->threshold_ctr = 0;
> +		}
> +
> +	} else if (poll_stats->cur_freq == HGH) {
> +
> +		if (percent <
> +				poll_stats->thresh[HGH].threshold_percent)
> {
> +
> +			if (poll_stats->threshold_ctr <
> INTERVALS_PER_SECOND)
> +				poll_stats->threshold_ctr++;
> +			else
> +				set_state(poll_stats, MED_NORMAL);
> +		} else
> +			/* reset */
> +			/* need debug log */
> +			poll_stats->threshold_ctr = 0;
> +
> +
> +	}
> +}
> +
> +static int
> +empty_poll_training(struct priority_worker *poll_stats,
> +		uint32_t max_train_iter)
> +{
> +
> +	if (poll_stats->iter_counter < INTERVALS_PER_SECOND) {
> +		poll_stats->iter_counter++;
> +		return 0;
> +	}
> +
> +
> +	update_training_stats(poll_stats,
> +			LOW,
> +			false,
> +			max_train_iter);
> +
> +	update_training_stats(poll_stats,
> +			MED,
> +			false,
> +			max_train_iter);
> +
> +	update_training_stats(poll_stats,
> +			HGH,
> +			false,
> +			max_train_iter);
> +
> +
> +	if (poll_stats->thresh[LOW].trained == true
> +			&& poll_stats->thresh[MED].trained == true
> +			&& poll_stats->thresh[HGH].trained == true) {
> +
> +		set_state(poll_stats, MED_NORMAL);
> +
> +		RTE_LOG(INFO, POWER, "Training is Complete for %d\n",
> +				poll_stats->lcore_id);
> +	}
> +
> +	return 0;
> +}
> +
> +void
> +rte_empty_poll_detection(struct rte_timer *tim,
> +		void *arg)
> +{
> +
> +	uint32_t i;
> +
> +	struct priority_worker *poll_stats;
> +
> +	RTE_SET_USED(tim);
> +
> +	RTE_SET_USED(arg);
> +
> +	for (i = 0; i < NUM_NODES; i++) {
> +
> +		poll_stats = &(ep_params->wrk_data.wrk_stats[i]);
> +
> +		if (rte_lcore_is_enabled(poll_stats->lcore_id) == 0)
> +			continue;
> +
> +		switch (poll_stats->queue_state) {
> +		case(TRAINING):
> +			empty_poll_training(poll_stats,
> +					ep_params->max_train_iter);
> +			break;
> +
> +		case(HGH_BUSY):
> +		case(MED_NORMAL):
> +			update_stats_normal(poll_stats);
> +
> +			break;
> +
> +		case(LOW_PURGE):
> +			break;
> +		default:
> +			break;
> +
> +		}
> +
> +	}
> +
> +}
> +
> +int __rte_experimental
> +rte_power_empty_poll_stat_init(struct ep_params **eptr, uint8_t
> *freq_tlb)
> +{
> +	uint32_t i;
> +	/* Allocate the ep_params structure */
> +	ep_params = rte_zmalloc_socket(NULL,
> +			sizeof(struct ep_params),
> +			0,
> +			rte_socket_id());
> +
> +	if (!ep_params)
> +		rte_panic("Cannot allocate heap memory for ep_params "
> +				"for socket %d\n", rte_socket_id());
> +
> +	if (freq_tlb == NULL) {
> +		freq_index[LOW] = 14;
> +		freq_index[MED] = 9;
> +		freq_index[HGH] = 1;
> +	} else {
> +		freq_index[LOW] = freq_tlb[LOW];
> +		freq_index[MED] = freq_tlb[MED];
> +		freq_index[HGH] = freq_tlb[HGH];
> +	}
> +
> +	RTE_LOG(INFO, POWER, "Initialize the Empty Poll\n");
> +
> +	/* 5 seconds worth of training */
> +	ep_params->max_train_iter = INTERVALS_PER_SECOND *
> SECONDS_TO_TRAIN_FOR;
> +
> +	struct stats_data *w = &ep_params->wrk_data;
> +
> +	*eptr = ep_params;
> +
> +	/* initialize all wrk_stats state */
> +	for (i = 0; i < NUM_NODES; i++) {
> +
> +		if (rte_lcore_is_enabled(i) == 0)
> +			continue;
> +
> +		set_state(&w->wrk_stats[i], TRAINING);
> +		/*init the freqs table */
> +		total_avail_freqs[i] = rte_power_freqs(i,
> +				avail_freqs[i],
> +				NUM_FREQS);
> +
> +		if (get_freq_index(LOW) > total_avail_freqs[i])
> +			return -1;
> +
> +	}
> +
> +
> +	return 0;
> +}
> +
> +void __rte_experimental
> +rte_power_empty_poll_stat_free(void)
> +{
> +
> +	RTE_LOG(INFO, POWER, "Close the Empty Poll\n");
> +
> +	if (ep_params != NULL)
> +		rte_free(ep_params);
> +}
> +
> +int __rte_experimental
> +rte_power_empty_poll_stat_update(unsigned int lcore_id)
> +{
> +	struct priority_worker *poll_stats;
> +
> +	if (lcore_id >= NUM_NODES)
> +		return -1;
> +
> +	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
> +
> +	if (poll_stats->lcore_id == 0)
> +		poll_stats->lcore_id = lcore_id;
> +
> +	poll_stats->empty_dequeues++;
> +
> +	return 0;
> +}
> +
> +int __rte_experimental
> +rte_power_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt)
> +{
> +
> +	struct priority_worker *poll_stats;
> +
> +	if (lcore_id >= NUM_NODES)
> +		return -1;
> +
> +	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
> +
> +	if (poll_stats->lcore_id == 0)
> +		poll_stats->lcore_id = lcore_id;
> +
> +	poll_stats->num_dequeue_pkts += nb_pkt;
> +
> +	return 0;
> +}
> +
> +
> +uint64_t __rte_experimental
> +rte_power_empty_poll_stat_fetch(unsigned int lcore_id)
> +{
> +	struct priority_worker *poll_stats;
> +
> +	if (lcore_id >= NUM_NODES)
> +		return -1;
> +
> +	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
> +
> +	if (poll_stats->lcore_id == 0)
> +		poll_stats->lcore_id = lcore_id;
> +
> +	return poll_stats->empty_dequeues;
> +}
> +
> +uint64_t __rte_experimental
> +rte_power_poll_stat_fetch(unsigned int lcore_id)
> +{
> +	struct priority_worker *poll_stats;
> +
> +	if (lcore_id >= NUM_NODES)
> +		return -1;
> +
> +	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
> +
> +	if (poll_stats->lcore_id == 0)
> +		poll_stats->lcore_id = lcore_id;
> +
> +	return poll_stats->num_dequeue_pkts;
> +}
> diff --git a/lib/librte_power/rte_power_empty_poll.h
> b/lib/librte_power/rte_power_empty_poll.h
> new file mode 100644
> index 0000000..e43981f
> --- /dev/null
> +++ b/lib/librte_power/rte_power_empty_poll.h
> @@ -0,0 +1,205 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2010-2018 Intel Corporation
> + */
> +
> +#ifndef _RTE_EMPTY_POLL_H
> +#define _RTE_EMPTY_POLL_H
> +
> +/**
> + * @file
> + * RTE Power Management
> + */
> +#include <stdint.h>
> +#include <stdbool.h>
> +
> +#include <rte_common.h>
> +#include <rte_byteorder.h>
> +#include <rte_log.h>
> +#include <rte_string_fns.h>
> +#include <rte_power.h>
> +#include <rte_timer.h>
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +#define NUM_FREQS 20
> +
> +#define BINS_AV 4 /* Has to be ^2 */
> +
> +#define DROP (NUM_DIRECTIONS * NUM_DEVICES)
> +
> +#define NUM_PRIORITIES          2
> +
> +#define NUM_NODES         31 /* Any reseanable prime number should
> work*/
> +
> +/* Processor Power State */
> +enum freq_val {
> +	LOW,
> +	MED,
> +	HGH,
> +	NUM_FREQ = NUM_FREQS
> +};
> +
> +
> +/* Queue Polling State */
> +enum queue_state {
> +	TRAINING, /* NO TRAFFIC */
> +	MED_NORMAL,   /* MED */
> +	HGH_BUSY,     /* HIGH */
> +	LOW_PURGE,    /* LOW */
> +};
> +
> +/* Queue Stats */
> +struct freq_threshold {
> +
> +	uint64_t base_edpi;
> +	bool trained;
> +	uint32_t threshold_percent;
> +	uint32_t cur_train_iter;
> +};
> +
> +/* Each Worder Thread Empty Poll Stats */
> +struct priority_worker {
> +
> +	/* Current dequeue and throughput counts */
> +	/* These 2 are written to by the worker threads */
> +	/* So keep them on their own cache line */
> +	uint64_t empty_dequeues;
> +	uint64_t num_dequeue_pkts;
> +
> +	enum queue_state queue_state;
> +
> +	uint64_t empty_dequeues_prev;
> +	uint64_t num_dequeue_pkts_prev;
> +
> +	/* Used for training only */
> +	struct freq_threshold thresh[NUM_FREQ];
> +	enum freq_val cur_freq;
> +
> +	/* bucket arrays to calculate the averages */
> +	uint64_t edpi_av[BINS_AV];
> +	uint32_t  ec;
> +	uint64_t ppi_av[BINS_AV];
> +	uint32_t  pc;
> +
> +	uint32_t lcore_id;
> +	uint32_t iter_counter;
> +	uint32_t threshold_ctr;
> +	uint32_t display_ctr;
> +	uint8_t  dev_id;
> +
> +} __rte_cache_aligned;
> +
> +
> +struct stats_data {
> +
> +	struct priority_worker wrk_stats[NUM_NODES];
> +
> +	/* flag to stop rx threads processing packets until training over */
> +	bool start_rx;
> +
> +};
> +
> +/* Empty Poll Parameters */
> +struct ep_params {
> +
> +	/* Timer related stuff */
> +	uint64_t interval_ticks;
> +	uint32_t max_train_iter;
> +
> +	struct rte_timer timer0;
> +	struct stats_data wrk_data;
> +};
> +
> +
> +/**
> + * Initialize the power management system.
> + *
> + * @param eptr
> + *   the structure of empty poll configuration
> + * @freq_tlb
> + *   the power state/frequency  mapping table
> + *
> + * @return
> + *  - 0 on success.
> + *  - Negative on error.
> + */
> +int __rte_experimental
> +rte_power_empty_poll_stat_init(struct ep_params **eptr, uint8_t
> *freq_tlb);
> +
> +/**
> + * Free the resource hold by power management system.
> + */
> +void __rte_experimental
> +rte_power_empty_poll_stat_free(void);
> +
> +/**
> + * Update specific core empty poll counter
> + * It's not thread safe.
> + *
> + * @param lcore_id
> + *  lcore id
> + *
> + * @return
> + *  - 0 on success.
> + *  - Negative on error.
> + */
> +int __rte_experimental
> +rte_power_empty_poll_stat_update(unsigned int lcore_id);
> +
> +/**
> + * Update specific core valid poll counter, not thread safe.
> + *
> + * @param lcore_id
> + *  lcore id.
> + * @param nb_pkt
> + *  The packet number of one valid poll.
> + *
> + * @return
> + *  - 0 on success.
> + *  - Negative on error.
> + */
> +int __rte_experimental
> +rte_power_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt);
> +
> +/**
> + * Fetch specific core empty poll counter.
> + *
> + * @param lcore_id
> + *  lcore id
> + *
> + * @return
> + *  Current lcore empty poll counter value.
> + */
> +uint64_t __rte_experimental
> +rte_power_empty_poll_stat_fetch(unsigned int lcore_id);
> +
> +/**
> + * Fetch specific core valid poll counter.
> + *
> + * @param lcore_id
> + *  lcore id
> + *
> + * @return
> + *  Current lcore valid poll counter value.
> + */
> +uint64_t __rte_experimental
> +rte_power_poll_stat_fetch(unsigned int lcore_id);
> +
> +/**
> + * Empty poll  state change detection function
> + *
> + * @param  tim
> + *  The timer structure
> + * @param  arg
> + *  The customized parameter
> + */
> +void  __rte_experimental
> +rte_empty_poll_detection(struct rte_timer *tim, void *arg);
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif
> diff --git a/lib/librte_power/rte_power_version.map
> b/lib/librte_power/rte_power_version.map
> index dd587df..11ffdfb 100644
> --- a/lib/librte_power/rte_power_version.map
> +++ b/lib/librte_power/rte_power_version.map
> @@ -33,3 +33,16 @@ DPDK_18.08 {
>  	rte_power_get_capabilities;
> 
>  } DPDK_17.11;
> +
> +EXPERIMENTAL {
> +        global:
> +
> +        rte_power_empty_poll_stat_init;
> +        rte_power_empty_poll_stat_free;
> +        rte_power_empty_poll_stat_update;
> +        rte_power_empty_poll_stat_fetch;
> +        rte_power_poll_stat_fetch;
> +        rte_power_poll_stat_update;
> +        rte_empty_poll_detection;
> +
> +};
> --
> 2.7.5


^ permalink raw reply	[flat|nested] 79+ messages in thread

* [dpdk-dev] [PATCH v7 1/4] lib/librte_power: traffic pattern aware power control
  2018-08-31 15:04             ` [dpdk-dev] [PATCH v6 1/4] lib/librte_power: traffic pattern aware power control Liang Ma
                                 ` (4 preceding siblings ...)
  2018-09-04  2:09               ` Yao, Lei A
@ 2018-09-04 14:10               ` Liang Ma
  2018-09-04 14:10                 ` [dpdk-dev] [PATCH v7 2/4] examples/l3fwd-power: simple app update for new API Liang Ma
                                   ` (4 more replies)
  5 siblings, 5 replies; 79+ messages in thread
From: Liang Ma @ 2018-09-04 14:10 UTC (permalink / raw)
  To: david.hunt
  Cc: dev, lei.a.yao, radu.nicolau, anatoly.burakov, john.geary, Liang Ma

1. Abstract

For packet processing workloads such as DPDK polling is continuous.
This means CPU cores always show 100% busy independent of how much work
those cores are doing. It is critical to accurately determine how busy
a core is hugely important for the following reasons:

   * No indication of overload conditions

   * User do not know how much real load is on a system meaning resulted in
     wasted energy as no power management is utilized

Compared to the original l3fwd-power design, instead of going to sleep
after detecting an empty poll, the new mechanism just lowers the core
frequency. As a result, the application does not stop polling the device,
which leads to improved handling of bursts of traffic.

When the system become busy, the empty poll mechanism can also increase the
core frequency (including turbo) to do best effort for intensive traffic.
This gives us more flexible and balanced traffic awareness over the
standard l3fwd-power application.

2. Proposed solution

The proposed solution focuses on how many times empty polls are executed.
The less the number of empty polls, means current core is busy with
processing workload, therefore, the higher frequency is needed. The high
empty poll number indicates the current core not doing any real work
therefore, we can lower the frequency to safe power.

In the current implementation, each core has 1 empty-poll counter which
assume 1 core is dedicated to 1 queue. This will need to be expanded in the
future to support multiple queues per core.

2.1 Power state definition:

	LOW:  Not currently used, reserved for future use.

	MED:  the frequency is used to process modest traffic workload.

	HIGH: the frequency is used to process busy traffic workload.

2.2 There are two phases to establish the power management system:

	a.Initialization/Training phase. The training phase is necessary
	  in order to figure out the system polling baseline numbers from
	  idle to busy. The highest poll count will be during idle, where
	  all polls are empty. These poll counts will be different between
	  systems due to the many possible processor micro-arch, cache
	  and device configurations, hence the training phase.
  	  In the training phase, traffic is blocked so the training
  	  algorithm can average the empty-poll numbers for the LOW, MED and
 	  HIGH  power states in order to create a baseline.
  	  The core's counter are collected every 10ms, and the Training
 	   phase will take 2 seconds.

	b.Normal phase. When the training phase is complete, traffic is
  	  started. The run-time poll counts are compared with the
	  baseline and the decision will be taken to move to MED power
  	  state or HIGH power state. The counters are calculated every 10ms.

3. Proposed  API

1.  rte_power_empty_poll_stat_init(void);
which is used to initialize the power management system.
 
2.  rte_power_empty_poll_stat_free(void);
which is used to free the resource hold by power management system.
 
3.  rte_power_empty_poll_stat_update(unsigned int lcore_id);
which is used to update specific core empty poll counter, not thread safe
 
4.  rte_power_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt);
which is used to update specific core valid poll counter, not thread safe
 
5.  rte_power_empty_poll_stat_fetch(unsigned int lcore_id);
which is used to get specific core empty poll counter.
 
6.  rte_power_poll_stat_fetch(unsigned int lcore_id);
which is used to get specific core valid poll counter.

7.  rte_empty_poll_detection(void);
which is used to detect empty poll state changes.

ChangeLog:
v2: fix some coding style issues.
v3: rename the filename, API name.
v4: no change.
v5: no change.
v6: re-work the code layout, update API.
v7: fix minor typo and lift node num limit.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>

Reviewed-by: Lei Yao <lei.a.yao@intel.com>
---
 lib/librte_power/Makefile               |   6 +-
 lib/librte_power/meson.build            |   5 +-
 lib/librte_power/rte_power_empty_poll.c | 500 ++++++++++++++++++++++++++++++++
 lib/librte_power/rte_power_empty_poll.h | 205 +++++++++++++
 lib/librte_power/rte_power_version.map  |  13 +
 5 files changed, 725 insertions(+), 4 deletions(-)
 create mode 100644 lib/librte_power/rte_power_empty_poll.c
 create mode 100644 lib/librte_power/rte_power_empty_poll.h

diff --git a/lib/librte_power/Makefile b/lib/librte_power/Makefile
index 6f85e88..a8f1301 100644
--- a/lib/librte_power/Makefile
+++ b/lib/librte_power/Makefile
@@ -6,8 +6,9 @@ include $(RTE_SDK)/mk/rte.vars.mk
 # library name
 LIB = librte_power.a
 
+CFLAGS += -DALLOW_EXPERIMENTAL_API
 CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR) -O3 -fno-strict-aliasing
-LDLIBS += -lrte_eal
+LDLIBS += -lrte_eal -lrte_timer
 
 EXPORT_MAP := rte_power_version.map
 
@@ -16,8 +17,9 @@ LIBABIVER := 1
 # all source are stored in SRCS-y
 SRCS-$(CONFIG_RTE_LIBRTE_POWER) := rte_power.c power_acpi_cpufreq.c
 SRCS-$(CONFIG_RTE_LIBRTE_POWER) += power_kvm_vm.c guest_channel.c
+SRCS-$(CONFIG_RTE_LIBRTE_POWER) += rte_power_empty_poll.c
 
 # install this header file
-SYMLINK-$(CONFIG_RTE_LIBRTE_POWER)-include := rte_power.h
+SYMLINK-$(CONFIG_RTE_LIBRTE_POWER)-include := rte_power.h  rte_power_empty_poll.h
 
 include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/lib/librte_power/meson.build b/lib/librte_power/meson.build
index 253173f..63957eb 100644
--- a/lib/librte_power/meson.build
+++ b/lib/librte_power/meson.build
@@ -5,5 +5,6 @@ if host_machine.system() != 'linux'
 	build = false
 endif
 sources = files('rte_power.c', 'power_acpi_cpufreq.c',
-		'power_kvm_vm.c', 'guest_channel.c')
-headers = files('rte_power.h')
+		'power_kvm_vm.c', 'guest_channel.c',
+		'rte_power_empty_poll.c')
+headers = files('rte_power.h','rte_power_empty_poll.h')
diff --git a/lib/librte_power/rte_power_empty_poll.c b/lib/librte_power/rte_power_empty_poll.c
new file mode 100644
index 0000000..3dac654
--- /dev/null
+++ b/lib/librte_power/rte_power_empty_poll.c
@@ -0,0 +1,500 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2018 Intel Corporation
+ */
+
+#include <string.h>
+
+#include <rte_lcore.h>
+#include <rte_cycles.h>
+#include <rte_atomic.h>
+#include <rte_malloc.h>
+
+#include "rte_power.h"
+#include "rte_power_empty_poll.h"
+
+#define INTERVALS_PER_SECOND 100     /* (10ms) */
+#define SECONDS_TO_TRAIN_FOR 2
+#define DEFAULT_MED_TO_HIGH_PERCENT_THRESHOLD 70
+#define DEFAULT_HIGH_TO_MED_PERCENT_THRESHOLD 30
+#define DEFAULT_CYCLES_PER_PACKET 800
+
+static struct ep_params *ep_params;
+static uint32_t med_to_high_threshold = DEFAULT_MED_TO_HIGH_PERCENT_THRESHOLD;
+static uint32_t high_to_med_threshold = DEFAULT_HIGH_TO_MED_PERCENT_THRESHOLD;
+
+static uint32_t avail_freqs[RTE_MAX_LCORE][NUM_FREQS];
+
+static uint32_t total_avail_freqs[RTE_MAX_LCORE];
+
+static uint32_t freq_index[NUM_FREQ];
+
+static uint32_t
+get_freq_index(enum freq_val index)
+{
+	return freq_index[index];
+}
+
+
+static int
+set_power_freq(int lcore_id, enum freq_val freq, bool specific_freq)
+{
+	int err = 0;
+	uint32_t power_freq_index;
+	if (!specific_freq)
+		power_freq_index = get_freq_index(freq);
+	else
+		power_freq_index = freq;
+
+	err = rte_power_set_freq(lcore_id, power_freq_index);
+
+	return err;
+}
+
+
+static inline void __attribute__((always_inline))
+exit_training_state(struct priority_worker *poll_stats)
+{
+	RTE_SET_USED(poll_stats);
+}
+
+static inline void __attribute__((always_inline))
+enter_training_state(struct priority_worker *poll_stats)
+{
+	poll_stats->iter_counter = 0;
+	poll_stats->cur_freq = LOW;
+	poll_stats->queue_state = TRAINING;
+}
+
+static inline void __attribute__((always_inline))
+enter_normal_state(struct priority_worker *poll_stats)
+{
+	/* Clear the averages arrays and strs */
+	memset(poll_stats->edpi_av, 0, sizeof(poll_stats->edpi_av));
+	poll_stats->ec = 0;
+	memset(poll_stats->ppi_av, 0, sizeof(poll_stats->ppi_av));
+	poll_stats->pc = 0;
+
+	poll_stats->cur_freq = MED;
+	poll_stats->iter_counter = 0;
+	poll_stats->threshold_ctr = 0;
+	poll_stats->queue_state = MED_NORMAL;
+	set_power_freq(poll_stats->lcore_id, MED, false);
+
+	/* Try here */
+	poll_stats->thresh[MED].threshold_percent = med_to_high_threshold;
+	poll_stats->thresh[HGH].threshold_percent = high_to_med_threshold;
+}
+
+static inline void __attribute__((always_inline))
+enter_busy_state(struct priority_worker *poll_stats)
+{
+	memset(poll_stats->edpi_av, 0, sizeof(poll_stats->edpi_av));
+	poll_stats->ec = 0;
+	memset(poll_stats->ppi_av, 0, sizeof(poll_stats->ppi_av));
+	poll_stats->pc = 0;
+
+	poll_stats->cur_freq = HGH;
+	poll_stats->iter_counter = 0;
+	poll_stats->threshold_ctr = 0;
+	poll_stats->queue_state = HGH_BUSY;
+	set_power_freq(poll_stats->lcore_id, HGH, false);
+}
+
+static inline void __attribute__((always_inline))
+enter_purge_state(struct priority_worker *poll_stats)
+{
+	poll_stats->iter_counter = 0;
+	poll_stats->queue_state = LOW_PURGE;
+}
+
+static inline void __attribute__((always_inline))
+set_state(struct priority_worker *poll_stats,
+		enum queue_state new_state)
+{
+	enum queue_state old_state = poll_stats->queue_state;
+	if (old_state != new_state) {
+
+		/* Call any old state exit functions */
+		if (old_state == TRAINING)
+			exit_training_state(poll_stats);
+
+		/* Call any new state entry functions */
+		if (new_state == TRAINING)
+			enter_training_state(poll_stats);
+		if (new_state == MED_NORMAL)
+			enter_normal_state(poll_stats);
+		if (new_state == HGH_BUSY)
+			enter_busy_state(poll_stats);
+		if (new_state == LOW_PURGE)
+			enter_purge_state(poll_stats);
+	}
+}
+
+
+static void
+update_training_stats(struct priority_worker *poll_stats,
+		uint32_t freq,
+		bool specific_freq,
+		uint32_t max_train_iter)
+{
+	RTE_SET_USED(specific_freq);
+
+	char pfi_str[32];
+	uint64_t p0_empty_deq;
+
+	sprintf(pfi_str, "%02d", freq);
+
+	if (poll_stats->cur_freq == freq &&
+			poll_stats->thresh[freq].trained == false) {
+		if (poll_stats->thresh[freq].cur_train_iter == 0) {
+
+			set_power_freq(poll_stats->lcore_id,
+					freq, specific_freq);
+
+			poll_stats->empty_dequeues_prev =
+				poll_stats->empty_dequeues;
+
+			poll_stats->thresh[freq].cur_train_iter++;
+
+			return;
+		} else if (poll_stats->thresh[freq].cur_train_iter
+				<= max_train_iter) {
+
+			p0_empty_deq = poll_stats->empty_dequeues -
+				poll_stats->empty_dequeues_prev;
+
+			poll_stats->empty_dequeues_prev =
+				poll_stats->empty_dequeues;
+
+			poll_stats->thresh[freq].base_edpi += p0_empty_deq;
+			poll_stats->thresh[freq].cur_train_iter++;
+
+		} else {
+			if (poll_stats->thresh[freq].trained == false) {
+				poll_stats->thresh[freq].base_edpi =
+					poll_stats->thresh[freq].base_edpi /
+					max_train_iter;
+
+				/* Add on a factor of 0.05%, this should remove any */
+				/* false negatives when the system is 0% busy */
+				poll_stats->thresh[freq].base_edpi +=
+					poll_stats->thresh[freq].base_edpi / 2000;
+
+				poll_stats->thresh[freq].trained = true;
+				poll_stats->cur_freq++;
+
+			}
+		}
+	}
+}
+
+static inline uint32_t __attribute__((always_inline))
+update_stats(struct priority_worker *poll_stats)
+{
+	uint64_t tot_edpi = 0, tot_ppi = 0;
+	uint32_t j, percent;
+
+	struct priority_worker *s = poll_stats;
+
+	uint64_t cur_edpi = s->empty_dequeues - s->empty_dequeues_prev;
+
+	s->empty_dequeues_prev = s->empty_dequeues;
+
+	uint64_t ppi = s->num_dequeue_pkts - s->num_dequeue_pkts_prev;
+
+	s->num_dequeue_pkts_prev = s->num_dequeue_pkts;
+
+	if (s->thresh[s->cur_freq].base_edpi < cur_edpi) {
+		RTE_LOG(DEBUG, POWER, "cur_edpi is too large "
+				"cur edpi %ld "
+				"base epdi %ld\n",
+				cur_edpi,
+				s->thresh[s->cur_freq].base_edpi);
+		/* Value to make us fail need debug log*/
+		return 1000UL;
+	}
+
+	s->edpi_av[s->ec++ % BINS_AV] = cur_edpi;
+	s->ppi_av[s->pc++ % BINS_AV] = ppi;
+
+	for (j = 0; j < BINS_AV; j++) {
+		tot_edpi += s->edpi_av[j];
+		tot_ppi += s->ppi_av[j];
+	}
+
+	tot_edpi = tot_edpi / BINS_AV;
+
+	percent = 100 - (uint32_t)(((float)tot_edpi /
+			(float)s->thresh[s->cur_freq].base_edpi) * 100);
+
+	return (uint32_t)percent;
+}
+
+
+static inline void  __attribute__((always_inline))
+update_stats_normal(struct priority_worker *poll_stats)
+{
+	uint32_t percent;
+
+	if (poll_stats->thresh[poll_stats->cur_freq].base_edpi == 0)
+		return;
+
+	percent = update_stats(poll_stats);
+
+	if (percent > 100)
+		return;
+
+	if (poll_stats->cur_freq == LOW)
+		RTE_LOG(INFO, POWER, "Purge Mode is not supported\n");
+	else if (poll_stats->cur_freq == MED) {
+
+		if (percent >
+			poll_stats->thresh[MED].threshold_percent) {
+
+			if (poll_stats->threshold_ctr < INTERVALS_PER_SECOND)
+				poll_stats->threshold_ctr++;
+			else {
+				set_state(poll_stats, HGH_BUSY);
+				RTE_LOG(INFO, POWER, "MOVE to HGH\n");
+			}
+
+		} else {
+			/* reset */
+			/* need debug log */
+			poll_stats->threshold_ctr = 0;
+		}
+
+	} else if (poll_stats->cur_freq == HGH) {
+
+		if (percent <
+				poll_stats->thresh[HGH].threshold_percent) {
+
+			if (poll_stats->threshold_ctr < INTERVALS_PER_SECOND)
+				poll_stats->threshold_ctr++;
+			else
+				set_state(poll_stats, MED_NORMAL);
+		} else
+			/* reset */
+			/* need debug log */
+			poll_stats->threshold_ctr = 0;
+
+
+	}
+}
+
+static int
+empty_poll_training(struct priority_worker *poll_stats,
+		uint32_t max_train_iter)
+{
+
+	if (poll_stats->iter_counter < INTERVALS_PER_SECOND) {
+		poll_stats->iter_counter++;
+		return 0;
+	}
+
+
+	update_training_stats(poll_stats,
+			LOW,
+			false,
+			max_train_iter);
+
+	update_training_stats(poll_stats,
+			MED,
+			false,
+			max_train_iter);
+
+	update_training_stats(poll_stats,
+			HGH,
+			false,
+			max_train_iter);
+
+
+	if (poll_stats->thresh[LOW].trained == true
+			&& poll_stats->thresh[MED].trained == true
+			&& poll_stats->thresh[HGH].trained == true) {
+
+		set_state(poll_stats, MED_NORMAL);
+
+		RTE_LOG(INFO, POWER, "Training is Complete for %d\n",
+				poll_stats->lcore_id);
+	}
+
+	return 0;
+}
+
+void
+rte_empty_poll_detection(struct rte_timer *tim,
+		void *arg)
+{
+
+	uint32_t i;
+
+	struct priority_worker *poll_stats;
+
+	RTE_SET_USED(tim);
+
+	RTE_SET_USED(arg);
+
+	for (i = 0; i < NUM_NODES; i++) {
+
+		poll_stats = &(ep_params->wrk_data.wrk_stats[i]);
+
+		if (rte_lcore_is_enabled(poll_stats->lcore_id) == 0)
+			continue;
+
+		switch (poll_stats->queue_state) {
+		case(TRAINING):
+			empty_poll_training(poll_stats,
+					ep_params->max_train_iter);
+			break;
+
+		case(HGH_BUSY):
+		case(MED_NORMAL):
+			update_stats_normal(poll_stats);
+
+			break;
+
+		case(LOW_PURGE):
+			break;
+		default:
+			break;
+
+		}
+
+	}
+
+}
+
+int __rte_experimental
+rte_power_empty_poll_stat_init(struct ep_params **eptr, uint8_t *freq_tlb)
+{
+	uint32_t i;
+	/* Allocate the ep_params structure */
+	ep_params = rte_zmalloc_socket(NULL,
+			sizeof(struct ep_params),
+			0,
+			rte_socket_id());
+
+	if (!ep_params)
+		rte_panic("Cannot allocate heap memory for ep_params "
+				"for socket %d\n", rte_socket_id());
+
+	if (freq_tlb == NULL) {
+		freq_index[LOW] = 14;
+		freq_index[MED] = 9;
+		freq_index[HGH] = 1;
+	} else {
+		freq_index[LOW] = freq_tlb[LOW];
+		freq_index[MED] = freq_tlb[MED];
+		freq_index[HGH] = freq_tlb[HGH];
+	}
+
+	RTE_LOG(INFO, POWER, "Initialize the Empty Poll\n");
+
+	/* 5 seconds worth of training */
+	ep_params->max_train_iter = INTERVALS_PER_SECOND * SECONDS_TO_TRAIN_FOR;
+
+	struct stats_data *w = &ep_params->wrk_data;
+
+	*eptr = ep_params;
+
+	/* initialize all wrk_stats state */
+	for (i = 0; i < NUM_NODES; i++) {
+
+		if (rte_lcore_is_enabled(i) == 0)
+			continue;
+
+		set_state(&w->wrk_stats[i], TRAINING);
+		/*init the freqs table */
+		total_avail_freqs[i] = rte_power_freqs(i,
+				avail_freqs[i],
+				NUM_FREQS);
+
+		if (get_freq_index(LOW) > total_avail_freqs[i])
+			return -1;
+
+	}
+
+
+	return 0;
+}
+
+void __rte_experimental
+rte_power_empty_poll_stat_free(void)
+{
+
+	RTE_LOG(INFO, POWER, "Close the Empty Poll\n");
+
+	if (ep_params != NULL)
+		rte_free(ep_params);
+}
+
+int __rte_experimental
+rte_power_empty_poll_stat_update(unsigned int lcore_id)
+{
+	struct priority_worker *poll_stats;
+
+	if (lcore_id >= NUM_NODES)
+		return -1;
+
+	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
+
+	if (poll_stats->lcore_id == 0)
+		poll_stats->lcore_id = lcore_id;
+
+	poll_stats->empty_dequeues++;
+
+	return 0;
+}
+
+int __rte_experimental
+rte_power_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt)
+{
+
+	struct priority_worker *poll_stats;
+
+	if (lcore_id >= NUM_NODES)
+		return -1;
+
+	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
+
+	if (poll_stats->lcore_id == 0)
+		poll_stats->lcore_id = lcore_id;
+
+	poll_stats->num_dequeue_pkts += nb_pkt;
+
+	return 0;
+}
+
+
+uint64_t __rte_experimental
+rte_power_empty_poll_stat_fetch(unsigned int lcore_id)
+{
+	struct priority_worker *poll_stats;
+
+	if (lcore_id >= NUM_NODES)
+		return -1;
+
+	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
+
+	if (poll_stats->lcore_id == 0)
+		poll_stats->lcore_id = lcore_id;
+
+	return poll_stats->empty_dequeues;
+}
+
+uint64_t __rte_experimental
+rte_power_poll_stat_fetch(unsigned int lcore_id)
+{
+	struct priority_worker *poll_stats;
+
+	if (lcore_id >= NUM_NODES)
+		return -1;
+
+	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
+
+	if (poll_stats->lcore_id == 0)
+		poll_stats->lcore_id = lcore_id;
+
+	return poll_stats->num_dequeue_pkts;
+}
diff --git a/lib/librte_power/rte_power_empty_poll.h b/lib/librte_power/rte_power_empty_poll.h
new file mode 100644
index 0000000..0aca1f0
--- /dev/null
+++ b/lib/librte_power/rte_power_empty_poll.h
@@ -0,0 +1,205 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2018 Intel Corporation
+ */
+
+#ifndef _RTE_EMPTY_POLL_H
+#define _RTE_EMPTY_POLL_H
+
+/**
+ * @file
+ * RTE Power Management
+ */
+#include <stdint.h>
+#include <stdbool.h>
+
+#include <rte_common.h>
+#include <rte_byteorder.h>
+#include <rte_log.h>
+#include <rte_string_fns.h>
+#include <rte_power.h>
+#include <rte_timer.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#define NUM_FREQS 20
+
+#define BINS_AV 4 /* Has to be ^2 */
+
+#define DROP (NUM_DIRECTIONS * NUM_DEVICES)
+
+#define NUM_PRIORITIES          2
+
+#define NUM_NODES         256  /* Max core number*/
+
+/* Processor Power State */
+enum freq_val {
+	LOW,
+	MED,
+	HGH,
+	NUM_FREQ = NUM_FREQS
+};
+
+
+/* Queue Polling State */
+enum queue_state {
+	TRAINING, /* NO TRAFFIC */
+	MED_NORMAL,   /* MED */
+	HGH_BUSY,     /* HIGH */
+	LOW_PURGE,    /* LOW */
+};
+
+/* Queue Stats */
+struct freq_threshold {
+
+	uint64_t base_edpi;
+	bool trained;
+	uint32_t threshold_percent;
+	uint32_t cur_train_iter;
+};
+
+/* Each Worder Thread Empty Poll Stats */
+struct priority_worker {
+
+	/* Current dequeue and throughput counts */
+	/* These 2 are written to by the worker threads */
+	/* So keep them on their own cache line */
+	uint64_t empty_dequeues;
+	uint64_t num_dequeue_pkts;
+
+	enum queue_state queue_state;
+
+	uint64_t empty_dequeues_prev;
+	uint64_t num_dequeue_pkts_prev;
+
+	/* Used for training only */
+	struct freq_threshold thresh[NUM_FREQ];
+	enum freq_val cur_freq;
+
+	/* bucket arrays to calculate the averages */
+	uint64_t edpi_av[BINS_AV];
+	uint32_t  ec;
+	uint64_t ppi_av[BINS_AV];
+	uint32_t  pc;
+
+	uint32_t lcore_id;
+	uint32_t iter_counter;
+	uint32_t threshold_ctr;
+	uint32_t display_ctr;
+	uint8_t  dev_id;
+
+} __rte_cache_aligned;
+
+
+struct stats_data {
+
+	struct priority_worker wrk_stats[NUM_NODES];
+
+	/* flag to stop rx threads processing packets until training over */
+	bool start_rx;
+
+};
+
+/* Empty Poll Parameters */
+struct ep_params {
+
+	/* Timer related stuff */
+	uint64_t interval_ticks;
+	uint32_t max_train_iter;
+
+	struct rte_timer timer0;
+	struct stats_data wrk_data;
+};
+
+
+/**
+ * Initialize the power management system.
+ *
+ * @param eptr
+ *   the structure of empty poll configuration
+ * @freq_tlb
+ *   the power state/frequency  mapping table
+ *
+ * @return
+ *  - 0 on success.
+ *  - Negative on error.
+ */
+int __rte_experimental
+rte_power_empty_poll_stat_init(struct ep_params **eptr, uint8_t *freq_tlb);
+
+/**
+ * Free the resource hold by power management system.
+ */
+void __rte_experimental
+rte_power_empty_poll_stat_free(void);
+
+/**
+ * Update specific core empty poll counter
+ * It's not thread safe.
+ *
+ * @param lcore_id
+ *  lcore id
+ *
+ * @return
+ *  - 0 on success.
+ *  - Negative on error.
+ */
+int __rte_experimental
+rte_power_empty_poll_stat_update(unsigned int lcore_id);
+
+/**
+ * Update specific core valid poll counter, not thread safe.
+ *
+ * @param lcore_id
+ *  lcore id.
+ * @param nb_pkt
+ *  The packet number of one valid poll.
+ *
+ * @return
+ *  - 0 on success.
+ *  - Negative on error.
+ */
+int __rte_experimental
+rte_power_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt);
+
+/**
+ * Fetch specific core empty poll counter.
+ *
+ * @param lcore_id
+ *  lcore id
+ *
+ * @return
+ *  Current lcore empty poll counter value.
+ */
+uint64_t __rte_experimental
+rte_power_empty_poll_stat_fetch(unsigned int lcore_id);
+
+/**
+ * Fetch specific core valid poll counter.
+ *
+ * @param lcore_id
+ *  lcore id
+ *
+ * @return
+ *  Current lcore valid poll counter value.
+ */
+uint64_t __rte_experimental
+rte_power_poll_stat_fetch(unsigned int lcore_id);
+
+/**
+ * Empty poll  state change detection function
+ *
+ * @param  tim
+ *  The timer structure
+ * @param  arg
+ *  The customized parameter
+ */
+void  __rte_experimental
+rte_empty_poll_detection(struct rte_timer *tim, void *arg);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif
diff --git a/lib/librte_power/rte_power_version.map b/lib/librte_power/rte_power_version.map
index dd587df..11ffdfb 100644
--- a/lib/librte_power/rte_power_version.map
+++ b/lib/librte_power/rte_power_version.map
@@ -33,3 +33,16 @@ DPDK_18.08 {
 	rte_power_get_capabilities;
 
 } DPDK_17.11;
+
+EXPERIMENTAL {
+        global:
+
+        rte_power_empty_poll_stat_init;
+        rte_power_empty_poll_stat_free;
+        rte_power_empty_poll_stat_update;
+        rte_power_empty_poll_stat_fetch;
+        rte_power_poll_stat_fetch;
+        rte_power_poll_stat_update;
+        rte_empty_poll_detection;
+
+};
-- 
2.7.5

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [dpdk-dev] [PATCH v7 2/4] examples/l3fwd-power: simple app update for new API
  2018-09-04 14:10               ` [dpdk-dev] [PATCH v7 " Liang Ma
@ 2018-09-04 14:10                 ` Liang Ma
  2018-09-04 14:10                 ` [dpdk-dev] [PATCH v7 3/4] doc/guides/proguides/power-man: update the power API Liang Ma
                                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 79+ messages in thread
From: Liang Ma @ 2018-09-04 14:10 UTC (permalink / raw)
  To: david.hunt
  Cc: dev, lei.a.yao, radu.nicolau, anatoly.burakov, john.geary, Liang Ma

Add the support for new traffic pattern aware power control
power management API.

Example:
./l3fwd-power -l xxx   -n 4   -w 0000:xx:00.0 -w 0000:xx:00.1 -- -p 0x3
-P --config="(0,0,xx),(1,0,xx)" --empty-poll -l 14 -m 9 -h 1

Please Reference l3fwd-power document for all parameter except
empty-poll.

the option "l", "m", "h" are used to set the power index for
LOW, MED, HIGH power state. only is useful after enable empty-poll

Once enable empty-poll. The system will start with training phase.
There should not has any traffic pass-through during training phase.
When training phase complete, system transfer to normal phase.

System will running with modest power stat at beginning.
If the system busyness percentage above 70%, then system will adjust
power state move to High power state. If the traffic become lower(eg. The
system busyness percentage drop below 30%), system will fallback
to the modest power state.

Example code use master thread to monitoring worker thread busyness.
the default timer resolution is 10ms.

ChangeLog:
v2 fix some coding style issues
v3 rename the API.
v6 re-work the API.
v7 no change.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>

Reviewed-by: Lei Yao <lei.a.yao@intel.com>
---
 examples/l3fwd-power/Makefile    |   3 +
 examples/l3fwd-power/main.c      | 253 ++++++++++++++++++++++++++++++++++++---
 examples/l3fwd-power/meson.build |   1 +
 3 files changed, 240 insertions(+), 17 deletions(-)

diff --git a/examples/l3fwd-power/Makefile b/examples/l3fwd-power/Makefile
index d7e39a3..772ec7b 100644
--- a/examples/l3fwd-power/Makefile
+++ b/examples/l3fwd-power/Makefile
@@ -23,6 +23,8 @@ CFLAGS += -O3 $(shell pkg-config --cflags libdpdk)
 LDFLAGS_SHARED = $(shell pkg-config --libs libdpdk)
 LDFLAGS_STATIC = -Wl,-Bstatic $(shell pkg-config --static --libs libdpdk)
 
+CFLAGS += -DALLOW_EXPERIMENTAL_API
+
 build/$(APP)-shared: $(SRCS-y) Makefile $(PC_FILE) | build
 	$(CC) $(CFLAGS) $(SRCS-y) -o $@ $(LDFLAGS) $(LDFLAGS_SHARED)
 
@@ -54,6 +56,7 @@ please change the definition of the RTE_TARGET environment variable)
 all:
 else
 
+CFLAGS += -DALLOW_EXPERIMENTAL_API
 CFLAGS += -O3
 CFLAGS += $(WERROR_FLAGS)
 
diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
index d15cd52..f1e254b 100644
--- a/examples/l3fwd-power/main.c
+++ b/examples/l3fwd-power/main.c
@@ -43,6 +43,7 @@
 #include <rte_timer.h>
 #include <rte_power.h>
 #include <rte_spinlock.h>
+#include <rte_power_empty_poll.h>
 
 #include "perf_core.h"
 #include "main.h"
@@ -55,6 +56,8 @@
 
 /* 100 ms interval */
 #define TIMER_NUMBER_PER_SECOND           10
+/* (10ms) */
+#define INTERVALS_PER_SECOND             100
 /* 100000 us */
 #define SCALING_PERIOD                    (1000000/TIMER_NUMBER_PER_SECOND)
 #define SCALING_DOWN_TIME_RATIO_THRESHOLD 0.25
@@ -117,6 +120,9 @@
  */
 #define RTE_TEST_RX_DESC_DEFAULT 1024
 #define RTE_TEST_TX_DESC_DEFAULT 1024
+
+
+
 static uint16_t nb_rxd = RTE_TEST_RX_DESC_DEFAULT;
 static uint16_t nb_txd = RTE_TEST_TX_DESC_DEFAULT;
 
@@ -132,6 +138,10 @@ static uint32_t enabled_port_mask = 0;
 static int promiscuous_on = 0;
 /* NUMA is enabled by default. */
 static int numa_on = 1;
+/* emptypoll is disabled by default. */
+static bool empty_poll_on;
+volatile bool empty_poll_stop;
+static struct  ep_params *ep_params;
 static int parse_ptype; /**< Parse packet type using rx callback, and */
 			/**< disabled by default */
 
@@ -331,6 +341,13 @@ static inline uint32_t power_idle_heuristic(uint32_t zero_rx_packet_count);
 static inline enum freq_scale_hint_t power_freq_scaleup_heuristic( \
 		unsigned int lcore_id, uint16_t port_id, uint16_t queue_id);
 
+static uint8_t  freq_tlb[] = {14, 9, 1};
+
+static int is_done(void)
+{
+	return empty_poll_stop;
+}
+
 /* exit signal handler */
 static void
 signal_exit_now(int sigtype)
@@ -339,7 +356,15 @@ signal_exit_now(int sigtype)
 	unsigned int portid;
 	int ret;
 
+	RTE_SET_USED(lcore_id);
+	RTE_SET_USED(portid);
+	RTE_SET_USED(ret);
+
 	if (sigtype == SIGINT) {
+		if (empty_poll_on)
+			empty_poll_stop = true;
+
+
 		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
 			if (rte_lcore_is_enabled(lcore_id) == 0)
 				continue;
@@ -352,16 +377,19 @@ signal_exit_now(int sigtype)
 							"core%u\n", lcore_id);
 		}
 
-		RTE_ETH_FOREACH_DEV(portid) {
-			if ((enabled_port_mask & (1 << portid)) == 0)
-				continue;
+		if (!empty_poll_on) {
+			RTE_ETH_FOREACH_DEV(portid) {
+				if ((enabled_port_mask & (1 << portid)) == 0)
+					continue;
 
-			rte_eth_dev_stop(portid);
-			rte_eth_dev_close(portid);
+				rte_eth_dev_stop(portid);
+				rte_eth_dev_close(portid);
+			}
 		}
 	}
 
-	rte_exit(EXIT_SUCCESS, "User forced exit\n");
+	if (!empty_poll_on)
+		rte_exit(EXIT_SUCCESS, "User forced exit\n");
 }
 
 /*  Freqency scale down timer callback */
@@ -826,7 +854,107 @@ static int event_register(struct lcore_conf *qconf)
 
 	return 0;
 }
+/* main processing loop */
+static int
+main_empty_poll_loop(__attribute__((unused)) void *dummy)
+{
+	struct rte_mbuf *pkts_burst[MAX_PKT_BURST];
+	unsigned int lcore_id;
+	uint64_t prev_tsc, diff_tsc, cur_tsc;
+	int i, j, nb_rx;
+	uint8_t queueid;
+	uint16_t portid;
+	struct lcore_conf *qconf;
+	struct lcore_rx_queue *rx_queue;
+
+	const uint64_t drain_tsc =
+		(rte_get_tsc_hz() + US_PER_S - 1) / US_PER_S * BURST_TX_DRAIN_US;
+
+	prev_tsc = 0;
+
+	lcore_id = rte_lcore_id();
+	qconf = &lcore_conf[lcore_id];
+
+	if (qconf->n_rx_queue == 0) {
+		RTE_LOG(INFO, L3FWD_POWER, "lcore %u has nothing to do\n", lcore_id);
+		return 0;
+	}
+
+	for (i = 0; i < qconf->n_rx_queue; i++) {
+		portid = qconf->rx_queue_list[i].port_id;
+		queueid = qconf->rx_queue_list[i].queue_id;
+		RTE_LOG(INFO, L3FWD_POWER, " -- lcoreid=%u portid=%u "
+				"rxqueueid=%hhu\n", lcore_id, portid, queueid);
+	}
+
+	while (!is_done()) {
+		stats[lcore_id].nb_iteration_looped++;
+
+		cur_tsc = rte_rdtsc();
+		/*
+		 * TX burst queue drain
+		 */
+		diff_tsc = cur_tsc - prev_tsc;
+		if (unlikely(diff_tsc > drain_tsc)) {
+			for (i = 0; i < qconf->n_tx_port; ++i) {
+				portid = qconf->tx_port_id[i];
+				rte_eth_tx_buffer_flush(portid,
+						qconf->tx_queue_id[portid],
+						qconf->tx_buffer[portid]);
+			}
+			prev_tsc = cur_tsc;
+		}
+
+		/*
+		 * Read packet from RX queues
+		 */
+		for (i = 0; i < qconf->n_rx_queue; ++i) {
+			rx_queue = &(qconf->rx_queue_list[i]);
+			rx_queue->idle_hint = 0;
+			portid = rx_queue->port_id;
+			queueid = rx_queue->queue_id;
+
+			nb_rx = rte_eth_rx_burst(portid, queueid, pkts_burst,
+					MAX_PKT_BURST);
+
+			stats[lcore_id].nb_rx_processed += nb_rx;
+
+			if (nb_rx == 0) {
 
+				rte_power_empty_poll_stat_update(lcore_id);
+
+				continue;
+			} else {
+				rte_power_poll_stat_update(lcore_id, nb_rx);
+			}
+
+
+			/* Prefetch first packets */
+			for (j = 0; j < PREFETCH_OFFSET && j < nb_rx; j++) {
+				rte_prefetch0(rte_pktmbuf_mtod(
+							pkts_burst[j], void *));
+			}
+
+			/* Prefetch and forward already prefetched packets */
+			for (j = 0; j < (nb_rx - PREFETCH_OFFSET); j++) {
+				rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[
+							j + PREFETCH_OFFSET], void *));
+				l3fwd_simple_forward(pkts_burst[j], portid,
+						qconf);
+			}
+
+			/* Forward remaining prefetched packets */
+			for (; j < nb_rx; j++) {
+				l3fwd_simple_forward(pkts_burst[j], portid,
+						qconf);
+			}
+
+		}
+
+	}
+
+	return 0;
+}
 /* main processing loop */
 static int
 main_loop(__attribute__((unused)) void *dummy)
@@ -1128,7 +1256,8 @@ print_usage(const char *prgname)
 		"  --no-numa: optional, disable numa awareness\n"
 		"  --enable-jumbo: enable jumbo frame"
 		" which max packet len is PKTLEN in decimal (64-9600)\n"
-		"  --parse-ptype: parse packet type by software\n",
+		"  --parse-ptype: parse packet type by software\n"
+		"  --empty=poll: enable empty poll detection\n",
 		prgname);
 }
 
@@ -1231,6 +1360,7 @@ parse_args(int argc, char **argv)
 	int opt, ret;
 	char **argvopt;
 	int option_index;
+	uint32_t limit;
 	char *prgname = argv[0];
 	static struct option lgopts[] = {
 		{"config", 1, 0, 0},
@@ -1238,13 +1368,14 @@ parse_args(int argc, char **argv)
 		{"high-perf-cores", 1, 0, 0},
 		{"no-numa", 0, 0, 0},
 		{"enable-jumbo", 0, 0, 0},
+		{"empty-poll", 0, 0, 0},
 		{CMD_LINE_OPT_PARSE_PTYPE, 0, 0, 0},
 		{NULL, 0, 0, 0}
 	};
 
 	argvopt = argv;
 
-	while ((opt = getopt_long(argc, argvopt, "p:P",
+	while ((opt = getopt_long(argc, argvopt, "p:l:m:h:P",
 				lgopts, &option_index)) != EOF) {
 
 		switch (opt) {
@@ -1261,7 +1392,18 @@ parse_args(int argc, char **argv)
 			printf("Promiscuous mode selected\n");
 			promiscuous_on = 1;
 			break;
-
+		case 'l':
+			limit = parse_max_pkt_len(optarg);
+			freq_tlb[LOW] = limit;
+			break;
+		case 'm':
+			limit = parse_max_pkt_len(optarg);
+			freq_tlb[MED] = limit;
+			break;
+		case 'h':
+			limit = parse_max_pkt_len(optarg);
+			freq_tlb[HGH] = limit;
+			break;
 		/* long options */
 		case 0:
 			if (!strncmp(lgopts[option_index].name, "config", 6)) {
@@ -1300,6 +1442,12 @@ parse_args(int argc, char **argv)
 			}
 
 			if (!strncmp(lgopts[option_index].name,
+						"empty-poll", 10)) {
+				printf("empty-poll is enabled\n");
+				empty_poll_on = true;
+			}
+
+			if (!strncmp(lgopts[option_index].name,
 					"enable-jumbo", 12)) {
 				struct option lenopts =
 					{"max-pkt-len", required_argument, \
@@ -1647,6 +1795,59 @@ init_power_library(void)
 	}
 	return ret;
 }
+static void
+empty_poll_setup_timer(void)
+{
+	int lcore_id = rte_lcore_id();
+	uint64_t hz = rte_get_timer_hz();
+
+	struct  ep_params *ep_ptr = ep_params;
+
+	ep_ptr->interval_ticks = hz / INTERVALS_PER_SECOND;
+
+	rte_timer_reset_sync(&ep_ptr->timer0,
+			ep_ptr->interval_ticks,
+			PERIODICAL,
+			lcore_id,
+			rte_empty_poll_detection,
+			(void *)ep_ptr);
+
+}
+static int
+launch_timer(unsigned int lcore_id)
+{
+	int64_t prev_tsc = 0, cur_tsc, diff_tsc, cycles_10ms;
+
+	RTE_SET_USED(lcore_id);
+
+
+	if (rte_get_master_lcore() != lcore_id) {
+		rte_panic("timer on lcore:%d which is not master core:%d\n",
+				lcore_id,
+				rte_get_master_lcore());
+	}
+
+	RTE_LOG(INFO, POWER, "Bring up the Timer\n");
+
+	empty_poll_setup_timer();
+
+	cycles_10ms = rte_get_timer_hz() / 100;
+
+	while (!is_done()) {
+		cur_tsc = rte_rdtsc();
+		diff_tsc = cur_tsc - prev_tsc;
+		if (diff_tsc > cycles_10ms) {
+			rte_timer_manage();
+			prev_tsc = cur_tsc;
+			cycles_10ms = rte_get_timer_hz() / 100;
+		}
+	}
+
+	RTE_LOG(INFO, POWER, "Timer_subsystem is done\n");
+
+	return 0;
+}
+
 
 int
 main(int argc, char **argv)
@@ -1829,13 +2030,15 @@ main(int argc, char **argv)
 		if (rte_lcore_is_enabled(lcore_id) == 0)
 			continue;
 
-		/* init timer structures for each enabled lcore */
-		rte_timer_init(&power_timers[lcore_id]);
-		hz = rte_get_timer_hz();
-		rte_timer_reset(&power_timers[lcore_id],
-			hz/TIMER_NUMBER_PER_SECOND, SINGLE, lcore_id,
-						power_timer_cb, NULL);
-
+		if (empty_poll_on == false) {
+			/* init timer structures for each enabled lcore */
+			rte_timer_init(&power_timers[lcore_id]);
+			hz = rte_get_timer_hz();
+			rte_timer_reset(&power_timers[lcore_id],
+					hz/TIMER_NUMBER_PER_SECOND,
+					SINGLE, lcore_id,
+					power_timer_cb, NULL);
+		}
 		qconf = &lcore_conf[lcore_id];
 		printf("\nInitializing rx queues on lcore %u ... ", lcore_id );
 		fflush(stdout);
@@ -1906,12 +2109,28 @@ main(int argc, char **argv)
 
 	check_all_ports_link_status(enabled_port_mask);
 
+	if (empty_poll_on == true)
+		rte_power_empty_poll_stat_init(&ep_params, freq_tlb);
+
+
 	/* launch per-lcore init on every lcore */
-	rte_eal_mp_remote_launch(main_loop, NULL, CALL_MASTER);
+	if (empty_poll_on == false) {
+		rte_eal_mp_remote_launch(main_loop, NULL, CALL_MASTER);
+	} else {
+		empty_poll_stop = false;
+		rte_eal_mp_remote_launch(main_empty_poll_loop, NULL, SKIP_MASTER);
+	}
+
+	if (empty_poll_on == true)
+		launch_timer(rte_lcore_id());
+
 	RTE_LCORE_FOREACH_SLAVE(lcore_id) {
 		if (rte_eal_wait_lcore(lcore_id) < 0)
 			return -1;
 	}
 
+	if (empty_poll_on)
+		rte_power_empty_poll_stat_free();
+
 	return 0;
 }
diff --git a/examples/l3fwd-power/meson.build b/examples/l3fwd-power/meson.build
index 20c8054..a3c5c2f 100644
--- a/examples/l3fwd-power/meson.build
+++ b/examples/l3fwd-power/meson.build
@@ -9,6 +9,7 @@
 if host_machine.system() != 'linux'
 	build = false
 endif
+allow_experimental_apis = true
 deps += ['power', 'timer', 'lpm', 'hash']
 sources = files(
 	'main.c', 'perf_core.c'
-- 
2.7.5

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [dpdk-dev] [PATCH v7 3/4] doc/guides/proguides/power-man: update the power API
  2018-09-04 14:10               ` [dpdk-dev] [PATCH v7 " Liang Ma
  2018-09-04 14:10                 ` [dpdk-dev] [PATCH v7 2/4] examples/l3fwd-power: simple app update for new API Liang Ma
@ 2018-09-04 14:10                 ` Liang Ma
  2018-09-04 14:10                 ` [dpdk-dev] [PATCH v7 4/4] doc/guides/sample_app_ug/l3_forward_power_man.rst: empty poll update Liang Ma
                                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 79+ messages in thread
From: Liang Ma @ 2018-09-04 14:10 UTC (permalink / raw)
  To: david.hunt
  Cc: dev, lei.a.yao, radu.nicolau, anatoly.burakov, john.geary, Liang Ma

update the document for empty poll API.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
---
 doc/guides/prog_guide/power_man.rst | 87 +++++++++++++++++++++++++++++++++++++
 1 file changed, 87 insertions(+)

diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
index eba1cc6..d8a4ef7 100644
--- a/doc/guides/prog_guide/power_man.rst
+++ b/doc/guides/prog_guide/power_man.rst
@@ -106,6 +106,93 @@ User Cases
 
 The power management mechanism is used to save power when performing L3 forwarding.
 
+
+Empty Poll API
+--------------
+
+Abstract
+~~~~~~~~
+
+For packet processing workloads such as DPDK polling is continuous.
+This means CPU cores always show 100% busy independent of how much work
+those cores are doing. It is critical to accurately determine how busy
+a core is hugely important for the following reasons:
+
+        * No indication of overload conditions
+        * User do not know how much real load is on a system meaning
+          resulted in wasted energy as no power management is utilized
+
+Compared to the original l3fwd-power design, instead of going to sleep
+after detecting an empty poll, the new mechanism just lowers the core frequency.
+As a result, the application does not stop polling the device, which leads
+to improved handling of bursts of traffic.
+
+When the system become busy, the empty poll mechanism can also increase the core
+frequency (including turbo) to do best effort for intensive traffic. This gives
+us more flexible and balanced traffic awareness over the standard l3fwd-power
+application.
+
+
+Proposed Solution
+~~~~~~~~~~~~~~~~~
+The proposed solution focuses on how many times empty polls are executed.
+The less the number of empty polls, means current core is busy with processing
+workload, therefore, the higher frequency is needed. The high empty poll number
+indicates the current core not doing any real work therefore, we can lower the
+frequency to safe power.
+
+In the current implementation, each core has 1 empty-poll counter which assume
+1 core is dedicated to 1 queue. This will need to be expanded in the future to
+support multiple queues per core.
+
+Power state definition:
+^^^^^^^^^^^^^^^^^^^^^^^
+
+* LOW:  Not currently used, reserved for future use.
+
+* MED:  the frequency is used to process modest traffic workload.
+
+* HIGH: the frequency is used to process busy traffic workload.
+
+There are two phases to establish the power management system:
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+* Initialization/Training phase. The training phase is necessary
+  in order to figure out the system polling baseline numbers from
+  idle to busy. The highest poll count will be during idle, where all
+  polls are empty. These poll counts will be different between
+  systems due to the many possible processor micro-arch, cache
+  and device configurations, hence the training phase.
+  In the training phase, traffic is blocked so the training algorithm
+  can average the empty-poll numbers for the LOW, MED and
+  HIGH  power states in order to create a baseline.
+  The core's counter are collected every 10ms, and the Training
+  phase will take 2 seconds.
+
+* Normal phase. When the training phase is complete, traffic is
+  started. The run-time poll counts are compared with the
+  baseline and the decision will be taken to move to MED power
+  state or HIGH power state. The counters are calculated every
+  10ms.
+
+
+API Overview for Empty Poll Power Management
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+* **State Init**: initialize the power management system.
+
+* **State Free**: free the resource hold by power management system.
+
+* **Update Empty Poll Counter**: update the empty poll counter.
+
+* **Update Valid Poll Counter**: update the valid poll counter.
+
+* **Set the Fequence Index**: update the power state/frequency mapping.
+
+* **Detect empty poll state change**: empty poll state change detection algorithm.
+
+User Cases
+----------
+The mechanism can applied to any device which is based on polling. e.g. NIC, FPGA.
+
 References
 ----------
 
-- 
2.7.5

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [dpdk-dev] [PATCH v7 4/4] doc/guides/sample_app_ug/l3_forward_power_man.rst: empty poll update
  2018-09-04 14:10               ` [dpdk-dev] [PATCH v7 " Liang Ma
  2018-09-04 14:10                 ` [dpdk-dev] [PATCH v7 2/4] examples/l3fwd-power: simple app update for new API Liang Ma
  2018-09-04 14:10                 ` [dpdk-dev] [PATCH v7 3/4] doc/guides/proguides/power-man: update the power API Liang Ma
@ 2018-09-04 14:10                 ` Liang Ma
  2018-09-13 10:54                 ` [dpdk-dev] [PATCH v7 1/4] lib/librte_power: traffic pattern aware power control Kevin Traynor
  2018-09-17 13:30                 ` [dpdk-dev] [PATCH v8 " Liang Ma
  4 siblings, 0 replies; 79+ messages in thread
From: Liang Ma @ 2018-09-04 14:10 UTC (permalink / raw)
  To: david.hunt
  Cc: dev, lei.a.yao, radu.nicolau, anatoly.burakov, john.geary, Liang Ma

add empty poll mode command line example

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
---
 doc/guides/sample_app_ug/l3_forward_power_man.rst | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/doc/guides/sample_app_ug/l3_forward_power_man.rst b/doc/guides/sample_app_ug/l3_forward_power_man.rst
index 795a570..7bea0a8 100644
--- a/doc/guides/sample_app_ug/l3_forward_power_man.rst
+++ b/doc/guides/sample_app_ug/l3_forward_power_man.rst
@@ -362,3 +362,24 @@ The algorithm has the following sleeping behavior depending on the idle counter:
 If a thread polls multiple Rx queues and different queue returns different sleep duration values,
 the algorithm controls the sleep time in a conservative manner by sleeping for the least possible time
 in order to avoid a potential performance impact.
+
+Empty Poll Mode
+-------------------------
+There is a new Mode which is added recently. Empty poll mode can be enabled by
+command option --empty-poll.
+
+See "Power Management" chapter in the DPDK Programmer's Guide for empty poll mode details.
+
+.. code-block:: console
+
+    ./l3fwd-power -l xxx   -n 4   -w 0000:xx:00.0 -w 0000:xx:00.1 -- -p 0x3 -P --config="(0,0,xx),(1,0,xx)" --empty-poll -l 14 -m 9 -h 1
+
+Where,
+
+--empty-poll: Enable the empty poll mode instead of original algorithm
+
+-l : optional, set up the LOW power state frequency index
+
+-m : optional, set up the MED power state frequency index
+
+-h : optional, set up the HIGH power state frequency index
-- 
2.7.5

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [dpdk-dev] [PATCH v4 1/2] lib/librte_power: traffic pattern aware power control
  2018-06-27 17:33           ` Kevin Traynor
  2018-07-05 14:45             ` Liang, Ma
@ 2018-09-11  9:19             ` Hunt, David
  2018-09-13  9:46               ` Kevin Traynor
  1 sibling, 1 reply; 79+ messages in thread
From: Hunt, David @ 2018-09-11  9:19 UTC (permalink / raw)
  To: Kevin Traynor, Radu Nicolau, dev, liang.j.ma

Hi Kevin,


On 27/6/2018 6:33 PM, Kevin Traynor wrote:
> On 06/26/2018 12:40 PM, Radu Nicolau wrote:
>> From: Liang Ma <liang.j.ma@intel.com>
>>
>> 1. Abstract

--snip--

>> 2.2 There are two phases to establish the power management system:
>>
>> 	a.Initialization/Training phase. There is no traffic pass-through,
>> 	  the system will test average empty poll numbers  with
>> 	  LOW/MED/HIGH  power state. Those average empty poll numbers
>> 	  will be the baseline
>> 	  for the normal phase. The system will collect all core's counter
>> 	  every 100ms. The Training phase will take 5 seconds.
>>
> This is requiring an application to sit for 5 secs in order to train and
> align poll numbers with states? That doesn't seem realistic to me.


Thanks for the discussion at DPDK Userspace conference. Since we got 
back, Liang and
I have discussed the feedback we received, and we have a proposal.

We can split out the training phase into a separate run of the application
which does the training, spits out the threshold numbers, and then
the actual runs will start instantly once the threshold parameters are
provided on the command line, or falls back to hard-coded defaults if
no command line parameters are given.

So there are three ways of running the app
   1. Run without any threshold parameters, in which case the algorithm
       runs with default numbers calculated based on the min and max 
available frequency.
   2. Run with --train option, which requires no traffic on the NICS, and
       runs the training algorithm, prints out the thresholds for the 
host CPU, and exits.
   3. Take the output of the train phase, and provide the thresholds on 
the command
       line, and the app runs with the best fit to the running CPU.

That would eliminate the training period at startup, unless the user 
wanted to fine-tune
for a particular host CPU.

Would that be an adequate solution to the training period concerns?

Regards,
Dave.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [dpdk-dev] [PATCH v4 1/2] lib/librte_power: traffic pattern aware power control
  2018-09-11  9:19             ` Hunt, David
@ 2018-09-13  9:46               ` Kevin Traynor
  2018-09-13 13:30                 ` Liang, Ma
  0 siblings, 1 reply; 79+ messages in thread
From: Kevin Traynor @ 2018-09-13  9:46 UTC (permalink / raw)
  To: Hunt, David, Radu Nicolau, dev, liang.j.ma

On 09/11/2018 10:19 AM, Hunt, David wrote:
> Hi Kevin,
> 

Hi Dave,

> 
> On 27/6/2018 6:33 PM, Kevin Traynor wrote:
>> On 06/26/2018 12:40 PM, Radu Nicolau wrote:
>>> From: Liang Ma <liang.j.ma@intel.com>
>>>
>>> 1. Abstract
> 
> --snip--
> 
>>> 2.2 There are two phases to establish the power management system:
>>>
>>>     a.Initialization/Training phase. There is no traffic pass-through,
>>>       the system will test average empty poll numbers  with
>>>       LOW/MED/HIGH  power state. Those average empty poll numbers
>>>       will be the baseline
>>>       for the normal phase. The system will collect all core's counter
>>>       every 100ms. The Training phase will take 5 seconds.
>>>
>> This is requiring an application to sit for 5 secs in order to train and
>> align poll numbers with states? That doesn't seem realistic to me.
> 
> 
> Thanks for the discussion at DPDK Userspace conference. Since we got
> back, Liang and
> I have discussed the feedback we received, and we have a proposal.
> 
> We can split out the training phase into a separate run of the application
> which does the training, spits out the threshold numbers, and then
> the actual runs will start instantly once the threshold parameters are
> provided on the command line, or falls back to hard-coded defaults if
> no command line parameters are given.
> 
> So there are three ways of running the app
>   1. Run without any threshold parameters, in which case the algorithm
>       runs with default numbers calculated based on the min and max
> available frequency.
>   2. Run with --train option, which requires no traffic on the NICS, and
>       runs the training algorithm, prints out the thresholds for the
> host CPU, and exits.
>   3. Take the output of the train phase, and provide the thresholds on
> the command
>       line, and the app runs with the best fit to the running CPU.
> 
> That would eliminate the training period at startup, unless the user
> wanted to fine-tune
> for a particular host CPU.
> 
> Would that be an adequate solution to the training period concerns?
> 

Thanks for following up. It's allowing it to run without a training
phase which is what I thought could be problematic from an application
view, so that's nice. I'm not sure if it's much less effective without
that training phase etc, but the comment was focused on having a forced
training phase, so that is resolved now as it is not required.

I'm still not sure I see the use cases for the options where there *is*
a training type phase but it's difficult to know and considering it's
experimental, if you feel there are some potential use cases and
justification to add it, then fine with me.

I have a few comments on the API, which I'll reply directly to the patch.

thanks,
Kevin.

> Regards,
> Dave.
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [dpdk-dev] [PATCH v7 1/4] lib/librte_power: traffic pattern aware power control
  2018-09-04 14:10               ` [dpdk-dev] [PATCH v7 " Liang Ma
                                   ` (2 preceding siblings ...)
  2018-09-04 14:10                 ` [dpdk-dev] [PATCH v7 4/4] doc/guides/sample_app_ug/l3_forward_power_man.rst: empty poll update Liang Ma
@ 2018-09-13 10:54                 ` Kevin Traynor
  2018-09-13 13:37                   ` Liang, Ma
  2018-09-17 13:30                 ` [dpdk-dev] [PATCH v8 " Liang Ma
  4 siblings, 1 reply; 79+ messages in thread
From: Kevin Traynor @ 2018-09-13 10:54 UTC (permalink / raw)
  To: Liang Ma, david.hunt
  Cc: dev, lei.a.yao, radu.nicolau, anatoly.burakov, john.geary

On 09/04/2018 03:10 PM, Liang Ma wrote:
> 1. Abstract
> 

Hi Liang,

I didn't review the code, but some comments on API below,

> For packet processing workloads such as DPDK polling is continuous.
> This means CPU cores always show 100% busy independent of how much work
> those cores are doing. It is critical to accurately determine how busy
> a core is hugely important for the following reasons:
> 
>    * No indication of overload conditions
> 
>    * User do not know how much real load is on a system meaning resulted in
>      wasted energy as no power management is utilized
> 
> Compared to the original l3fwd-power design, instead of going to sleep
> after detecting an empty poll, the new mechanism just lowers the core
> frequency. As a result, the application does not stop polling the device,
> which leads to improved handling of bursts of traffic.
> 
> When the system become busy, the empty poll mechanism can also increase the
> core frequency (including turbo) to do best effort for intensive traffic.
> This gives us more flexible and balanced traffic awareness over the
> standard l3fwd-power application.
> 
> 2. Proposed solution
> 
> The proposed solution focuses on how many times empty polls are executed.
> The less the number of empty polls, means current core is busy with
> processing workload, therefore, the higher frequency is needed. The high
> empty poll number indicates the current core not doing any real work
> therefore, we can lower the frequency to safe power.
> 
> In the current implementation, each core has 1 empty-poll counter which
> assume 1 core is dedicated to 1 queue. This will need to be expanded in the
> future to support multiple queues per core.
> 
> 2.1 Power state definition:
> 
> 	LOW:  Not currently used, reserved for future use.
> 
> 	MED:  the frequency is used to process modest traffic workload.
> 
> 	HIGH: the frequency is used to process busy traffic workload.
> 
> 2.2 There are two phases to establish the power management system:
> 
> 	a.Initialization/Training phase. The training phase is necessary
> 	  in order to figure out the system polling baseline numbers from
> 	  idle to busy. The highest poll count will be during idle, where
> 	  all polls are empty. These poll counts will be different between
> 	  systems due to the many possible processor micro-arch, cache
> 	  and device configurations, hence the training phase.
>   	  In the training phase, traffic is blocked so the training

When you say 'traffic is blocked' is this something that the application
can do through DPDK API, or you mean no external packets are sent into
that port?

>   	  algorithm can average the empty-poll numbers for the LOW, MED and
>  	  HIGH  power states in order to create a baseline.
>   	  The core's counter are collected every 10ms, and the Training
>  	   phase will take 2 seconds.
> 
> 	b.Normal phase. When the training phase is complete, traffic is
>   	  started. The run-time poll counts are compared with the
> 	  baseline and the decision will be taken to move to MED power
>   	  state or HIGH power state. The counters are calculated every 10ms.
> 
> 3. Proposed  API
> 
> 1.  rte_power_empty_poll_stat_init(void);
> which is used to initialize the power management system.
>  
> 2.  rte_power_empty_poll_stat_free(void);
> which is used to free the resource hold by power management system.
>  
> 3.  rte_power_empty_poll_stat_update(unsigned int lcore_id);
> which is used to update specific core empty poll counter, not thread safe
>  
> 4.  rte_power_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt);
> which is used to update specific core valid poll counter, not thread safe
>  

is uint8_t enough to cover the max burst size for an rx poll? I didn't check

> 5.  rte_power_empty_poll_stat_fetch(unsigned int lcore_id);
> which is used to get specific core empty poll counter.
>  
> 6.  rte_power_poll_stat_fetch(unsigned int lcore_id);
> which is used to get specific core valid poll counter.
> 

How about replacing 1-6 with something like below..(not sure what would
be best prefix)

rte_power_poll_stat_init(void);
rte_power_poll_stat_free(void);
rte_power_poll_stat_update(unsigned int lcore, uint8_t nb_pkts)
rte_power_poll_stat_fetch(unsigned int lcore, uint8_t stat)

This would mean combining 3./4. as per previous suggestion so the
application could just call a single function with nb_pkt (which could
be 0). It also makes it more extensible if you want to add new stats in
the future, rather than having to add more functions.

It doesn't do anything different, so perhaps it's just a matter of
personal tastes.

> 7.  rte_empty_poll_detection(void);
> which is used to detect empty poll state changes.
> 

s/rte/rte_power/ ?

I think this needs some better docs/doxygen and maybe name. It seems to
not just detect state change but take some actions so that should be
documented, even in a general way.

thanks,
Kevin.

> ChangeLog:
> v2: fix some coding style issues.
> v3: rename the filename, API name.
> v4: no change.
> v5: no change.
> v6: re-work the code layout, update API.
> v7: fix minor typo and lift node num limit.
> 
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> 
> Reviewed-by: Lei Yao <lei.a.yao@intel.com>
> ---
>  lib/librte_power/Makefile               |   6 +-
>  lib/librte_power/meson.build            |   5 +-
>  lib/librte_power/rte_power_empty_poll.c | 500 ++++++++++++++++++++++++++++++++
>  lib/librte_power/rte_power_empty_poll.h | 205 +++++++++++++
>  lib/librte_power/rte_power_version.map  |  13 +
>  5 files changed, 725 insertions(+), 4 deletions(-)
>  create mode 100644 lib/librte_power/rte_power_empty_poll.c
>  create mode 100644 lib/librte_power/rte_power_empty_poll.h
> 
> diff --git a/lib/librte_power/Makefile b/lib/librte_power/Makefile
> index 6f85e88..a8f1301 100644
> --- a/lib/librte_power/Makefile
> +++ b/lib/librte_power/Makefile
> @@ -6,8 +6,9 @@ include $(RTE_SDK)/mk/rte.vars.mk
>  # library name
>  LIB = librte_power.a
>  
> +CFLAGS += -DALLOW_EXPERIMENTAL_API
>  CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR) -O3 -fno-strict-aliasing
> -LDLIBS += -lrte_eal
> +LDLIBS += -lrte_eal -lrte_timer
>  
>  EXPORT_MAP := rte_power_version.map
>  
> @@ -16,8 +17,9 @@ LIBABIVER := 1
>  # all source are stored in SRCS-y
>  SRCS-$(CONFIG_RTE_LIBRTE_POWER) := rte_power.c power_acpi_cpufreq.c
>  SRCS-$(CONFIG_RTE_LIBRTE_POWER) += power_kvm_vm.c guest_channel.c
> +SRCS-$(CONFIG_RTE_LIBRTE_POWER) += rte_power_empty_poll.c
>  
>  # install this header file
> -SYMLINK-$(CONFIG_RTE_LIBRTE_POWER)-include := rte_power.h
> +SYMLINK-$(CONFIG_RTE_LIBRTE_POWER)-include := rte_power.h  rte_power_empty_poll.h
>  
>  include $(RTE_SDK)/mk/rte.lib.mk
> diff --git a/lib/librte_power/meson.build b/lib/librte_power/meson.build
> index 253173f..63957eb 100644
> --- a/lib/librte_power/meson.build
> +++ b/lib/librte_power/meson.build
> @@ -5,5 +5,6 @@ if host_machine.system() != 'linux'
>  	build = false
>  endif
>  sources = files('rte_power.c', 'power_acpi_cpufreq.c',
> -		'power_kvm_vm.c', 'guest_channel.c')
> -headers = files('rte_power.h')
> +		'power_kvm_vm.c', 'guest_channel.c',
> +		'rte_power_empty_poll.c')
> +headers = files('rte_power.h','rte_power_empty_poll.h')
> diff --git a/lib/librte_power/rte_power_empty_poll.c b/lib/librte_power/rte_power_empty_poll.c
> new file mode 100644
> index 0000000..3dac654
> --- /dev/null
> +++ b/lib/librte_power/rte_power_empty_poll.c
> @@ -0,0 +1,500 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2010-2018 Intel Corporation
> + */
> +
> +#include <string.h>
> +
> +#include <rte_lcore.h>
> +#include <rte_cycles.h>
> +#include <rte_atomic.h>
> +#include <rte_malloc.h>
> +
> +#include "rte_power.h"
> +#include "rte_power_empty_poll.h"
> +
> +#define INTERVALS_PER_SECOND 100     /* (10ms) */
> +#define SECONDS_TO_TRAIN_FOR 2
> +#define DEFAULT_MED_TO_HIGH_PERCENT_THRESHOLD 70
> +#define DEFAULT_HIGH_TO_MED_PERCENT_THRESHOLD 30
> +#define DEFAULT_CYCLES_PER_PACKET 800
> +
> +static struct ep_params *ep_params;
> +static uint32_t med_to_high_threshold = DEFAULT_MED_TO_HIGH_PERCENT_THRESHOLD;
> +static uint32_t high_to_med_threshold = DEFAULT_HIGH_TO_MED_PERCENT_THRESHOLD;
> +
> +static uint32_t avail_freqs[RTE_MAX_LCORE][NUM_FREQS];
> +
> +static uint32_t total_avail_freqs[RTE_MAX_LCORE];
> +
> +static uint32_t freq_index[NUM_FREQ];
> +
> +static uint32_t
> +get_freq_index(enum freq_val index)
> +{
> +	return freq_index[index];
> +}
> +
> +
> +static int
> +set_power_freq(int lcore_id, enum freq_val freq, bool specific_freq)
> +{
> +	int err = 0;
> +	uint32_t power_freq_index;
> +	if (!specific_freq)
> +		power_freq_index = get_freq_index(freq);
> +	else
> +		power_freq_index = freq;
> +
> +	err = rte_power_set_freq(lcore_id, power_freq_index);
> +
> +	return err;
> +}
> +
> +
> +static inline void __attribute__((always_inline))
> +exit_training_state(struct priority_worker *poll_stats)
> +{
> +	RTE_SET_USED(poll_stats);
> +}
> +
> +static inline void __attribute__((always_inline))
> +enter_training_state(struct priority_worker *poll_stats)
> +{
> +	poll_stats->iter_counter = 0;
> +	poll_stats->cur_freq = LOW;
> +	poll_stats->queue_state = TRAINING;
> +}
> +
> +static inline void __attribute__((always_inline))
> +enter_normal_state(struct priority_worker *poll_stats)
> +{
> +	/* Clear the averages arrays and strs */
> +	memset(poll_stats->edpi_av, 0, sizeof(poll_stats->edpi_av));
> +	poll_stats->ec = 0;
> +	memset(poll_stats->ppi_av, 0, sizeof(poll_stats->ppi_av));
> +	poll_stats->pc = 0;
> +
> +	poll_stats->cur_freq = MED;
> +	poll_stats->iter_counter = 0;
> +	poll_stats->threshold_ctr = 0;
> +	poll_stats->queue_state = MED_NORMAL;
> +	set_power_freq(poll_stats->lcore_id, MED, false);
> +
> +	/* Try here */
> +	poll_stats->thresh[MED].threshold_percent = med_to_high_threshold;
> +	poll_stats->thresh[HGH].threshold_percent = high_to_med_threshold;
> +}
> +
> +static inline void __attribute__((always_inline))
> +enter_busy_state(struct priority_worker *poll_stats)
> +{
> +	memset(poll_stats->edpi_av, 0, sizeof(poll_stats->edpi_av));
> +	poll_stats->ec = 0;
> +	memset(poll_stats->ppi_av, 0, sizeof(poll_stats->ppi_av));
> +	poll_stats->pc = 0;
> +
> +	poll_stats->cur_freq = HGH;
> +	poll_stats->iter_counter = 0;
> +	poll_stats->threshold_ctr = 0;
> +	poll_stats->queue_state = HGH_BUSY;
> +	set_power_freq(poll_stats->lcore_id, HGH, false);
> +}
> +
> +static inline void __attribute__((always_inline))
> +enter_purge_state(struct priority_worker *poll_stats)
> +{
> +	poll_stats->iter_counter = 0;
> +	poll_stats->queue_state = LOW_PURGE;
> +}
> +
> +static inline void __attribute__((always_inline))
> +set_state(struct priority_worker *poll_stats,
> +		enum queue_state new_state)
> +{
> +	enum queue_state old_state = poll_stats->queue_state;
> +	if (old_state != new_state) {
> +
> +		/* Call any old state exit functions */
> +		if (old_state == TRAINING)
> +			exit_training_state(poll_stats);
> +
> +		/* Call any new state entry functions */
> +		if (new_state == TRAINING)
> +			enter_training_state(poll_stats);
> +		if (new_state == MED_NORMAL)
> +			enter_normal_state(poll_stats);
> +		if (new_state == HGH_BUSY)
> +			enter_busy_state(poll_stats);
> +		if (new_state == LOW_PURGE)
> +			enter_purge_state(poll_stats);
> +	}
> +}
> +
> +
> +static void
> +update_training_stats(struct priority_worker *poll_stats,
> +		uint32_t freq,
> +		bool specific_freq,
> +		uint32_t max_train_iter)
> +{
> +	RTE_SET_USED(specific_freq);
> +
> +	char pfi_str[32];
> +	uint64_t p0_empty_deq;
> +
> +	sprintf(pfi_str, "%02d", freq);
> +
> +	if (poll_stats->cur_freq == freq &&
> +			poll_stats->thresh[freq].trained == false) {
> +		if (poll_stats->thresh[freq].cur_train_iter == 0) {
> +
> +			set_power_freq(poll_stats->lcore_id,
> +					freq, specific_freq);
> +
> +			poll_stats->empty_dequeues_prev =
> +				poll_stats->empty_dequeues;
> +
> +			poll_stats->thresh[freq].cur_train_iter++;
> +
> +			return;
> +		} else if (poll_stats->thresh[freq].cur_train_iter
> +				<= max_train_iter) {
> +
> +			p0_empty_deq = poll_stats->empty_dequeues -
> +				poll_stats->empty_dequeues_prev;
> +
> +			poll_stats->empty_dequeues_prev =
> +				poll_stats->empty_dequeues;
> +
> +			poll_stats->thresh[freq].base_edpi += p0_empty_deq;
> +			poll_stats->thresh[freq].cur_train_iter++;
> +
> +		} else {
> +			if (poll_stats->thresh[freq].trained == false) {
> +				poll_stats->thresh[freq].base_edpi =
> +					poll_stats->thresh[freq].base_edpi /
> +					max_train_iter;
> +
> +				/* Add on a factor of 0.05%, this should remove any */
> +				/* false negatives when the system is 0% busy */
> +				poll_stats->thresh[freq].base_edpi +=
> +					poll_stats->thresh[freq].base_edpi / 2000;
> +
> +				poll_stats->thresh[freq].trained = true;
> +				poll_stats->cur_freq++;
> +
> +			}
> +		}
> +	}
> +}
> +
> +static inline uint32_t __attribute__((always_inline))
> +update_stats(struct priority_worker *poll_stats)
> +{
> +	uint64_t tot_edpi = 0, tot_ppi = 0;
> +	uint32_t j, percent;
> +
> +	struct priority_worker *s = poll_stats;
> +
> +	uint64_t cur_edpi = s->empty_dequeues - s->empty_dequeues_prev;
> +
> +	s->empty_dequeues_prev = s->empty_dequeues;
> +
> +	uint64_t ppi = s->num_dequeue_pkts - s->num_dequeue_pkts_prev;
> +
> +	s->num_dequeue_pkts_prev = s->num_dequeue_pkts;
> +
> +	if (s->thresh[s->cur_freq].base_edpi < cur_edpi) {
> +		RTE_LOG(DEBUG, POWER, "cur_edpi is too large "
> +				"cur edpi %ld "
> +				"base epdi %ld\n",
> +				cur_edpi,
> +				s->thresh[s->cur_freq].base_edpi);
> +		/* Value to make us fail need debug log*/
> +		return 1000UL;
> +	}
> +
> +	s->edpi_av[s->ec++ % BINS_AV] = cur_edpi;
> +	s->ppi_av[s->pc++ % BINS_AV] = ppi;
> +
> +	for (j = 0; j < BINS_AV; j++) {
> +		tot_edpi += s->edpi_av[j];
> +		tot_ppi += s->ppi_av[j];
> +	}
> +
> +	tot_edpi = tot_edpi / BINS_AV;
> +
> +	percent = 100 - (uint32_t)(((float)tot_edpi /
> +			(float)s->thresh[s->cur_freq].base_edpi) * 100);
> +
> +	return (uint32_t)percent;
> +}
> +
> +
> +static inline void  __attribute__((always_inline))
> +update_stats_normal(struct priority_worker *poll_stats)
> +{
> +	uint32_t percent;
> +
> +	if (poll_stats->thresh[poll_stats->cur_freq].base_edpi == 0)
> +		return;
> +
> +	percent = update_stats(poll_stats);
> +
> +	if (percent > 100)
> +		return;
> +
> +	if (poll_stats->cur_freq == LOW)
> +		RTE_LOG(INFO, POWER, "Purge Mode is not supported\n");
> +	else if (poll_stats->cur_freq == MED) {
> +
> +		if (percent >
> +			poll_stats->thresh[MED].threshold_percent) {
> +
> +			if (poll_stats->threshold_ctr < INTERVALS_PER_SECOND)
> +				poll_stats->threshold_ctr++;
> +			else {
> +				set_state(poll_stats, HGH_BUSY);
> +				RTE_LOG(INFO, POWER, "MOVE to HGH\n");
> +			}
> +
> +		} else {
> +			/* reset */
> +			/* need debug log */
> +			poll_stats->threshold_ctr = 0;
> +		}
> +
> +	} else if (poll_stats->cur_freq == HGH) {
> +
> +		if (percent <
> +				poll_stats->thresh[HGH].threshold_percent) {
> +
> +			if (poll_stats->threshold_ctr < INTERVALS_PER_SECOND)
> +				poll_stats->threshold_ctr++;
> +			else
> +				set_state(poll_stats, MED_NORMAL);
> +		} else
> +			/* reset */
> +			/* need debug log */
> +			poll_stats->threshold_ctr = 0;
> +
> +
> +	}
> +}
> +
> +static int
> +empty_poll_training(struct priority_worker *poll_stats,
> +		uint32_t max_train_iter)
> +{
> +
> +	if (poll_stats->iter_counter < INTERVALS_PER_SECOND) {
> +		poll_stats->iter_counter++;
> +		return 0;
> +	}
> +
> +
> +	update_training_stats(poll_stats,
> +			LOW,
> +			false,
> +			max_train_iter);
> +
> +	update_training_stats(poll_stats,
> +			MED,
> +			false,
> +			max_train_iter);
> +
> +	update_training_stats(poll_stats,
> +			HGH,
> +			false,
> +			max_train_iter);
> +
> +
> +	if (poll_stats->thresh[LOW].trained == true
> +			&& poll_stats->thresh[MED].trained == true
> +			&& poll_stats->thresh[HGH].trained == true) {
> +
> +		set_state(poll_stats, MED_NORMAL);
> +
> +		RTE_LOG(INFO, POWER, "Training is Complete for %d\n",
> +				poll_stats->lcore_id);
> +	}
> +
> +	return 0;
> +}
> +
> +void
> +rte_empty_poll_detection(struct rte_timer *tim,
> +		void *arg)
> +{
> +
> +	uint32_t i;
> +
> +	struct priority_worker *poll_stats;
> +
> +	RTE_SET_USED(tim);
> +
> +	RTE_SET_USED(arg);
> +
> +	for (i = 0; i < NUM_NODES; i++) {
> +
> +		poll_stats = &(ep_params->wrk_data.wrk_stats[i]);
> +
> +		if (rte_lcore_is_enabled(poll_stats->lcore_id) == 0)
> +			continue;
> +
> +		switch (poll_stats->queue_state) {
> +		case(TRAINING):
> +			empty_poll_training(poll_stats,
> +					ep_params->max_train_iter);
> +			break;
> +
> +		case(HGH_BUSY):
> +		case(MED_NORMAL):
> +			update_stats_normal(poll_stats);
> +
> +			break;
> +
> +		case(LOW_PURGE):
> +			break;
> +		default:
> +			break;
> +
> +		}
> +
> +	}
> +
> +}
> +
> +int __rte_experimental
> +rte_power_empty_poll_stat_init(struct ep_params **eptr, uint8_t *freq_tlb)
> +{
> +	uint32_t i;
> +	/* Allocate the ep_params structure */
> +	ep_params = rte_zmalloc_socket(NULL,
> +			sizeof(struct ep_params),
> +			0,
> +			rte_socket_id());
> +
> +	if (!ep_params)
> +		rte_panic("Cannot allocate heap memory for ep_params "
> +				"for socket %d\n", rte_socket_id());
> +
> +	if (freq_tlb == NULL) {
> +		freq_index[LOW] = 14;
> +		freq_index[MED] = 9;
> +		freq_index[HGH] = 1;
> +	} else {
> +		freq_index[LOW] = freq_tlb[LOW];
> +		freq_index[MED] = freq_tlb[MED];
> +		freq_index[HGH] = freq_tlb[HGH];
> +	}
> +
> +	RTE_LOG(INFO, POWER, "Initialize the Empty Poll\n");
> +
> +	/* 5 seconds worth of training */
> +	ep_params->max_train_iter = INTERVALS_PER_SECOND * SECONDS_TO_TRAIN_FOR;
> +
> +	struct stats_data *w = &ep_params->wrk_data;
> +
> +	*eptr = ep_params;
> +
> +	/* initialize all wrk_stats state */
> +	for (i = 0; i < NUM_NODES; i++) {
> +
> +		if (rte_lcore_is_enabled(i) == 0)
> +			continue;
> +
> +		set_state(&w->wrk_stats[i], TRAINING);
> +		/*init the freqs table */
> +		total_avail_freqs[i] = rte_power_freqs(i,
> +				avail_freqs[i],
> +				NUM_FREQS);
> +
> +		if (get_freq_index(LOW) > total_avail_freqs[i])
> +			return -1;
> +
> +	}
> +
> +
> +	return 0;
> +}
> +
> +void __rte_experimental
> +rte_power_empty_poll_stat_free(void)
> +{
> +
> +	RTE_LOG(INFO, POWER, "Close the Empty Poll\n");
> +
> +	if (ep_params != NULL)
> +		rte_free(ep_params);
> +}
> +
> +int __rte_experimental
> +rte_power_empty_poll_stat_update(unsigned int lcore_id)
> +{
> +	struct priority_worker *poll_stats;
> +
> +	if (lcore_id >= NUM_NODES)
> +		return -1;
> +
> +	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
> +
> +	if (poll_stats->lcore_id == 0)
> +		poll_stats->lcore_id = lcore_id;
> +
> +	poll_stats->empty_dequeues++;
> +
> +	return 0;
> +}
> +
> +int __rte_experimental
> +rte_power_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt)
> +{
> +
> +	struct priority_worker *poll_stats;
> +
> +	if (lcore_id >= NUM_NODES)
> +		return -1;
> +
> +	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
> +
> +	if (poll_stats->lcore_id == 0)
> +		poll_stats->lcore_id = lcore_id;
> +
> +	poll_stats->num_dequeue_pkts += nb_pkt;
> +
> +	return 0;
> +}
> +
> +
> +uint64_t __rte_experimental
> +rte_power_empty_poll_stat_fetch(unsigned int lcore_id)
> +{
> +	struct priority_worker *poll_stats;
> +
> +	if (lcore_id >= NUM_NODES)
> +		return -1;
> +
> +	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
> +
> +	if (poll_stats->lcore_id == 0)
> +		poll_stats->lcore_id = lcore_id;
> +
> +	return poll_stats->empty_dequeues;
> +}
> +
> +uint64_t __rte_experimental
> +rte_power_poll_stat_fetch(unsigned int lcore_id)
> +{
> +	struct priority_worker *poll_stats;
> +
> +	if (lcore_id >= NUM_NODES)
> +		return -1;
> +
> +	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
> +
> +	if (poll_stats->lcore_id == 0)
> +		poll_stats->lcore_id = lcore_id;
> +
> +	return poll_stats->num_dequeue_pkts;
> +}
> diff --git a/lib/librte_power/rte_power_empty_poll.h b/lib/librte_power/rte_power_empty_poll.h
> new file mode 100644
> index 0000000..0aca1f0
> --- /dev/null
> +++ b/lib/librte_power/rte_power_empty_poll.h
> @@ -0,0 +1,205 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2010-2018 Intel Corporation
> + */
> +
> +#ifndef _RTE_EMPTY_POLL_H
> +#define _RTE_EMPTY_POLL_H
> +
> +/**
> + * @file
> + * RTE Power Management
> + */
> +#include <stdint.h>
> +#include <stdbool.h>
> +
> +#include <rte_common.h>
> +#include <rte_byteorder.h>
> +#include <rte_log.h>
> +#include <rte_string_fns.h>
> +#include <rte_power.h>
> +#include <rte_timer.h>
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +#define NUM_FREQS 20
> +
> +#define BINS_AV 4 /* Has to be ^2 */
> +
> +#define DROP (NUM_DIRECTIONS * NUM_DEVICES)
> +
> +#define NUM_PRIORITIES          2
> +
> +#define NUM_NODES         256  /* Max core number*/
> +
> +/* Processor Power State */
> +enum freq_val {
> +	LOW,
> +	MED,
> +	HGH,
> +	NUM_FREQ = NUM_FREQS
> +};
> +
> +
> +/* Queue Polling State */
> +enum queue_state {
> +	TRAINING, /* NO TRAFFIC */
> +	MED_NORMAL,   /* MED */
> +	HGH_BUSY,     /* HIGH */
> +	LOW_PURGE,    /* LOW */
> +};
> +
> +/* Queue Stats */
> +struct freq_threshold {
> +
> +	uint64_t base_edpi;
> +	bool trained;
> +	uint32_t threshold_percent;
> +	uint32_t cur_train_iter;
> +};
> +
> +/* Each Worder Thread Empty Poll Stats */
> +struct priority_worker {
> +
> +	/* Current dequeue and throughput counts */
> +	/* These 2 are written to by the worker threads */
> +	/* So keep them on their own cache line */
> +	uint64_t empty_dequeues;
> +	uint64_t num_dequeue_pkts;
> +
> +	enum queue_state queue_state;
> +
> +	uint64_t empty_dequeues_prev;
> +	uint64_t num_dequeue_pkts_prev;
> +
> +	/* Used for training only */
> +	struct freq_threshold thresh[NUM_FREQ];
> +	enum freq_val cur_freq;
> +
> +	/* bucket arrays to calculate the averages */
> +	uint64_t edpi_av[BINS_AV];
> +	uint32_t  ec;
> +	uint64_t ppi_av[BINS_AV];
> +	uint32_t  pc;
> +
> +	uint32_t lcore_id;
> +	uint32_t iter_counter;
> +	uint32_t threshold_ctr;
> +	uint32_t display_ctr;
> +	uint8_t  dev_id;
> +
> +} __rte_cache_aligned;
> +
> +
> +struct stats_data {
> +
> +	struct priority_worker wrk_stats[NUM_NODES];
> +
> +	/* flag to stop rx threads processing packets until training over */
> +	bool start_rx;
> +
> +};
> +
> +/* Empty Poll Parameters */
> +struct ep_params {
> +
> +	/* Timer related stuff */
> +	uint64_t interval_ticks;
> +	uint32_t max_train_iter;
> +
> +	struct rte_timer timer0;
> +	struct stats_data wrk_data;
> +};
> +
> +
> +/**
> + * Initialize the power management system.
> + *
> + * @param eptr
> + *   the structure of empty poll configuration
> + * @freq_tlb
> + *   the power state/frequency  mapping table
> + *
> + * @return
> + *  - 0 on success.
> + *  - Negative on error.
> + */
> +int __rte_experimental
> +rte_power_empty_poll_stat_init(struct ep_params **eptr, uint8_t *freq_tlb);
> +
> +/**
> + * Free the resource hold by power management system.
> + */
> +void __rte_experimental
> +rte_power_empty_poll_stat_free(void);
> +
> +/**
> + * Update specific core empty poll counter
> + * It's not thread safe.
> + *
> + * @param lcore_id
> + *  lcore id
> + *
> + * @return
> + *  - 0 on success.
> + *  - Negative on error.
> + */
> +int __rte_experimental
> +rte_power_empty_poll_stat_update(unsigned int lcore_id);
> +
> +/**
> + * Update specific core valid poll counter, not thread safe.
> + *
> + * @param lcore_id
> + *  lcore id.
> + * @param nb_pkt
> + *  The packet number of one valid poll.
> + *
> + * @return
> + *  - 0 on success.
> + *  - Negative on error.
> + */
> +int __rte_experimental
> +rte_power_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt);
> +
> +/**
> + * Fetch specific core empty poll counter.
> + *
> + * @param lcore_id
> + *  lcore id
> + *
> + * @return
> + *  Current lcore empty poll counter value.
> + */
> +uint64_t __rte_experimental
> +rte_power_empty_poll_stat_fetch(unsigned int lcore_id);
> +
> +/**
> + * Fetch specific core valid poll counter.
> + *
> + * @param lcore_id
> + *  lcore id
> + *
> + * @return
> + *  Current lcore valid poll counter value.
> + */
> +uint64_t __rte_experimental
> +rte_power_poll_stat_fetch(unsigned int lcore_id);
> +
> +/**
> + * Empty poll  state change detection function
> + *
> + * @param  tim
> + *  The timer structure
> + * @param  arg
> + *  The customized parameter
> + */
> +void  __rte_experimental
> +rte_empty_poll_detection(struct rte_timer *tim, void *arg);
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif
> diff --git a/lib/librte_power/rte_power_version.map b/lib/librte_power/rte_power_version.map
> index dd587df..11ffdfb 100644
> --- a/lib/librte_power/rte_power_version.map
> +++ b/lib/librte_power/rte_power_version.map
> @@ -33,3 +33,16 @@ DPDK_18.08 {
>  	rte_power_get_capabilities;
>  
>  } DPDK_17.11;
> +
> +EXPERIMENTAL {
> +        global:
> +
> +        rte_power_empty_poll_stat_init;
> +        rte_power_empty_poll_stat_free;
> +        rte_power_empty_poll_stat_update;
> +        rte_power_empty_poll_stat_fetch;
> +        rte_power_poll_stat_fetch;
> +        rte_power_poll_stat_update;
> +        rte_empty_poll_detection;
> +
> +};
> 

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [dpdk-dev] [PATCH v4 1/2] lib/librte_power: traffic pattern aware power control
  2018-09-13  9:46               ` Kevin Traynor
@ 2018-09-13 13:30                 ` Liang, Ma
  0 siblings, 0 replies; 79+ messages in thread
From: Liang, Ma @ 2018-09-13 13:30 UTC (permalink / raw)
  To: Kevin Traynor; +Cc: Hunt, David, Radu Nicolau, dev

Hi Kevin,
     Many thanks for your feedback.
     Please check my comments below. 

On 13 Sep 10:46, Kevin Traynor wrote:
> 
> Thanks for following up. It's allowing it to run without a training
> phase which is what I thought could be problematic from an application
> view, so that's nice. I'm not sure if it's much less effective without
> that training phase etc, but the comment was focused on having a forced
> training phase, so that is resolved now as it is not required.
> 
I have re-worked the patch, therefore,  the simple app will run without training as default option
> I'm still not sure I see the use cases for the options where there *is*
> a training type phase but it's difficult to know and considering it's
> experimental, if you feel there are some potential use cases and
> justification to add it, then fine with me.
> 
However, the Training  still is necessary. 
The mechanism need 2 anchor point. 
1.  The max empty poll the system can reach without any real traffic. 
    That's is Maximum capability we use as a base line. 
2.  When the empty poll number drop to zero, that indicate the system 
    is 100% busy due to always get work to do. 
3.  When the empty poll number drop to certain point(e.g. 30% of Max)
    the mechanism will move to next frequency.

Without the Training phase, it's very hard to use normal traffic payload 
to get the absolute anchor point due to the traffic type, payload size, 
processor micro-arch, cache size etc, too many moving parts. 

I use default value which I think will be OK for mainstream xeon. 
if user use a very different system(e.g. arm), they still need re-run training phase.

the training can be triggered by a parameter option. 

> I have a few comments on the API, which I'll reply directly to the patch.
> 
> thanks,
> Kevin.
> 
> > Regards,
> > Dave.
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> 

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [dpdk-dev] [PATCH v7 1/4] lib/librte_power: traffic pattern aware power control
  2018-09-13 10:54                 ` [dpdk-dev] [PATCH v7 1/4] lib/librte_power: traffic pattern aware power control Kevin Traynor
@ 2018-09-13 13:37                   ` Liang, Ma
  2018-09-13 14:05                     ` Hunt, David
  0 siblings, 1 reply; 79+ messages in thread
From: Liang, Ma @ 2018-09-13 13:37 UTC (permalink / raw)
  To: Kevin Traynor
  Cc: david.hunt, dev, lei.a.yao, radu.nicolau, anatoly.burakov, john.geary

Hi Kevin,
   Many thanks for your comments.
   I will send v8 patch soon. 
   Please check comments below. 

On 13 Sep 11:54, Kevin Traynor wrote:
> On 09/04/2018 03:10 PM, Liang Ma wrote:
> > 1. Abstract
> > 
> 
> Hi Liang,
> 
> I didn't review the code, but some comments on API below,
> 
> > For packet processing workloads such as DPDK polling is continuous.
> > This means CPU cores always show 100% busy independent of how much work
> > those cores are doing. It is critical to accurately determine how busy
> > a core is hugely important for the following reasons:
> > 
> >    * No indication of overload conditions
> > 
> >    * User do not know how much real load is on a system meaning resulted in
> >      wasted energy as no power management is utilized
> > 
> > Compared to the original l3fwd-power design, instead of going to sleep
> > after detecting an empty poll, the new mechanism just lowers the core
> > frequency. As a result, the application does not stop polling the device,
> > which leads to improved handling of bursts of traffic.
> > 
> > When the system become busy, the empty poll mechanism can also increase the
> > core frequency (including turbo) to do best effort for intensive traffic.
> > This gives us more flexible and balanced traffic awareness over the
> > standard l3fwd-power application.
> > 
> > 2. Proposed solution
> > 
> > The proposed solution focuses on how many times empty polls are executed.
> > The less the number of empty polls, means current core is busy with
> > processing workload, therefore, the higher frequency is needed. The high
> > empty poll number indicates the current core not doing any real work
> > therefore, we can lower the frequency to safe power.
> > 
> > In the current implementation, each core has 1 empty-poll counter which
> > assume 1 core is dedicated to 1 queue. This will need to be expanded in the
> > future to support multiple queues per core.
> > 
> > 2.1 Power state definition:
> > 
> > 	LOW:  Not currently used, reserved for future use.
> > 
> > 	MED:  the frequency is used to process modest traffic workload.
> > 
> > 	HIGH: the frequency is used to process busy traffic workload.
> > 
> > 2.2 There are two phases to establish the power management system:
> > 
> > 	a.Initialization/Training phase. The training phase is necessary
> > 	  in order to figure out the system polling baseline numbers from
> > 	  idle to busy. The highest poll count will be during idle, where
> > 	  all polls are empty. These poll counts will be different between
> > 	  systems due to the many possible processor micro-arch, cache
> > 	  and device configurations, hence the training phase.
> >   	  In the training phase, traffic is blocked so the training
> 
> When you say 'traffic is blocked' is this something that the application
> can do through DPDK API, or you mean no external packets are sent into
> that port?a
training is disabled as default. if user enable it, that mean no external packet
are sent to the port. 

> 
> >   	  algorithm can average the empty-poll numbers for the LOW, MED and
> >  	  HIGH  power states in order to create a baseline.
> >   	  The core's counter are collected every 10ms, and the Training
> >  	   phase will take 2 seconds.
> > 
> > 	b.Normal phase. When the training phase is complete, traffic is
> >   	  started. The run-time poll counts are compared with the
> > 	  baseline and the decision will be taken to move to MED power
> >   	  state or HIGH power state. The counters are calculated every 10ms.
> > 
> > 3. Proposed  API
> > 
> > 1.  rte_power_empty_poll_stat_init(void);
> > which is used to initialize the power management system.
> >  
> > 2.  rte_power_empty_poll_stat_free(void);
> > which is used to free the resource hold by power management system.
> >  
> > 3.  rte_power_empty_poll_stat_update(unsigned int lcore_id);
> > which is used to update specific core empty poll counter, not thread safe
> >  
> > 4.  rte_power_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt);
> > which is used to update specific core valid poll counter, not thread safe
> >  
> 
> is uint8_t enough to cover the max burst size for an rx poll? I didn't checka
most comment max burst size is 32. correct me if I'm wrong. 
> 
> > 5.  rte_power_empty_poll_stat_fetch(unsigned int lcore_id);
> > which is used to get specific core empty poll counter.
> >  
> > 6.  rte_power_poll_stat_fetch(unsigned int lcore_id);
> > which is used to get specific core valid poll counter.
> > 
> 
> How about replacing 1-6 with something like below..(not sure what would
> be best prefix)
> 
> rte_power_poll_stat_init(void);
> rte_power_poll_stat_free(void);
> rte_power_poll_stat_update(unsigned int lcore, uint8_t nb_pkts)
> rte_power_poll_stat_fetch(unsigned int lcore, uint8_t stat)
> 
> This would mean combining 3./4. as per previous suggestion so the
> application could just call a single function with nb_pkt (which could
> be 0). It also makes it more extensible if you want to add new stats in
> the future, rather than having to add more functions.
> 
> It doesn't do anything different, so perhaps it's just a matter of
> personal tastes.
we use the empty poll counter which is separate with valid poll counter.
that's the fundamental difference. I prefer leave it this moment. 
> 
> > 7.  rte_empty_poll_detection(void);
> > which is used to detect empty poll state changes.
> > 
> 
> s/rte/rte_power/ ?
> 
> I think this needs some better docs/doxygen and maybe name. It seems to
> not just detect state change but take some actions so that should be
> documented, even in a general way.a
agree, I will update the description. 
> 
> thanks,
> Kevin.
> 
> > ChangeLog:
> > v2: fix some coding style issues.
> > v3: rename the filename, API name.
> > v4: no change.
> > v5: no change.
> > v6: re-work the code layout, update API.
> > v7: fix minor typo and lift node num limit.
> > 
> > Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> > 
> > Reviewed-by: Lei Yao <lei.a.yao@intel.com>
> > ---
> >  lib/librte_power/Makefile               |   6 +-
> >  lib/librte_power/meson.build            |   5 +-
> >  lib/librte_power/rte_power_empty_poll.c | 500 ++++++++++++++++++++++++++++++++
> >  lib/librte_power/rte_power_empty_poll.h | 205 +++++++++++++
> >  lib/librte_power/rte_power_version.map  |  13 +
> >  5 files changed, 725 insertions(+), 4 deletions(-)
> >  create mode 100644 lib/librte_power/rte_power_empty_poll.c
> >  create mode 100644 lib/librte_power/rte_power_empty_poll.h
> > 
> > diff --git a/lib/librte_power/Makefile b/lib/librte_power/Makefile
> > index 6f85e88..a8f1301 100644
> > --- a/lib/librte_power/Makefile
> > +++ b/lib/librte_power/Makefile
> > @@ -6,8 +6,9 @@ include $(RTE_SDK)/mk/rte.vars.mk
> >  # library name
> >  LIB = librte_power.a
> >  
> > +CFLAGS += -DALLOW_EXPERIMENTAL_API
> >  CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR) -O3 -fno-strict-aliasing
> > -LDLIBS += -lrte_eal
> > +LDLIBS += -lrte_eal -lrte_timer
> >  
> >  EXPORT_MAP := rte_power_version.map
> >  
> > @@ -16,8 +17,9 @@ LIBABIVER := 1
> >  # all source are stored in SRCS-y
> >  SRCS-$(CONFIG_RTE_LIBRTE_POWER) := rte_power.c power_acpi_cpufreq.c
> >  SRCS-$(CONFIG_RTE_LIBRTE_POWER) += power_kvm_vm.c guest_channel.c
> > +SRCS-$(CONFIG_RTE_LIBRTE_POWER) += rte_power_empty_poll.c
> >  
> >  # install this header file
> > -SYMLINK-$(CONFIG_RTE_LIBRTE_POWER)-include := rte_power.h
> > +SYMLINK-$(CONFIG_RTE_LIBRTE_POWER)-include := rte_power.h  rte_power_empty_poll.h
> >  
> >  include $(RTE_SDK)/mk/rte.lib.mk
> > diff --git a/lib/librte_power/meson.build b/lib/librte_power/meson.build
> > index 253173f..63957eb 100644
> > --- a/lib/librte_power/meson.build
> > +++ b/lib/librte_power/meson.build
> > @@ -5,5 +5,6 @@ if host_machine.system() != 'linux'
> >  	build = false
> >  endif
> >  sources = files('rte_power.c', 'power_acpi_cpufreq.c',
> > -		'power_kvm_vm.c', 'guest_channel.c')
> > -headers = files('rte_power.h')
> > +		'power_kvm_vm.c', 'guest_channel.c',
> > +		'rte_power_empty_poll.c')
> > +headers = files('rte_power.h','rte_power_empty_poll.h')
> > diff --git a/lib/librte_power/rte_power_empty_poll.c b/lib/librte_power/rte_power_empty_poll.c
> > new file mode 100644
> > index 0000000..3dac654
> > --- /dev/null
> > +++ b/lib/librte_power/rte_power_empty_poll.c
> > @@ -0,0 +1,500 @@
> > +/* SPDX-License-Identifier: BSD-3-Clause
> > + * Copyright(c) 2010-2018 Intel Corporation
> > + */
> > +
> > +#include <string.h>
> > +
> > +#include <rte_lcore.h>
> > +#include <rte_cycles.h>
> > +#include <rte_atomic.h>
> > +#include <rte_malloc.h>
> > +
> > +#include "rte_power.h"
> > +#include "rte_power_empty_poll.h"
> > +
> > +#define INTERVALS_PER_SECOND 100     /* (10ms) */
> > +#define SECONDS_TO_TRAIN_FOR 2
> > +#define DEFAULT_MED_TO_HIGH_PERCENT_THRESHOLD 70
> > +#define DEFAULT_HIGH_TO_MED_PERCENT_THRESHOLD 30
> > +#define DEFAULT_CYCLES_PER_PACKET 800
> > +
> > +static struct ep_params *ep_params;
> > +static uint32_t med_to_high_threshold = DEFAULT_MED_TO_HIGH_PERCENT_THRESHOLD;
> > +static uint32_t high_to_med_threshold = DEFAULT_HIGH_TO_MED_PERCENT_THRESHOLD;
> > +
> > +static uint32_t avail_freqs[RTE_MAX_LCORE][NUM_FREQS];
> > +
> > +static uint32_t total_avail_freqs[RTE_MAX_LCORE];
> > +
> > +static uint32_t freq_index[NUM_FREQ];
> > +
> > +static uint32_t
> > +get_freq_index(enum freq_val index)
> > +{
> > +	return freq_index[index];
> > +}
> > +
> > +
> > +static int
> > +set_power_freq(int lcore_id, enum freq_val freq, bool specific_freq)
> > +{
> > +	int err = 0;
> > +	uint32_t power_freq_index;
> > +	if (!specific_freq)
> > +		power_freq_index = get_freq_index(freq);
> > +	else
> > +		power_freq_index = freq;
> > +
> > +	err = rte_power_set_freq(lcore_id, power_freq_index);
> > +
> > +	return err;
> > +}
> > +
> > +
> > +static inline void __attribute__((always_inline))
> > +exit_training_state(struct priority_worker *poll_stats)
> > +{
> > +	RTE_SET_USED(poll_stats);
> > +}
> > +
> > +static inline void __attribute__((always_inline))
> > +enter_training_state(struct priority_worker *poll_stats)
> > +{
> > +	poll_stats->iter_counter = 0;
> > +	poll_stats->cur_freq = LOW;
> > +	poll_stats->queue_state = TRAINING;
> > +}
> > +
> > +static inline void __attribute__((always_inline))
> > +enter_normal_state(struct priority_worker *poll_stats)
> > +{
> > +	/* Clear the averages arrays and strs */
> > +	memset(poll_stats->edpi_av, 0, sizeof(poll_stats->edpi_av));
> > +	poll_stats->ec = 0;
> > +	memset(poll_stats->ppi_av, 0, sizeof(poll_stats->ppi_av));
> > +	poll_stats->pc = 0;
> > +
> > +	poll_stats->cur_freq = MED;
> > +	poll_stats->iter_counter = 0;
> > +	poll_stats->threshold_ctr = 0;
> > +	poll_stats->queue_state = MED_NORMAL;
> > +	set_power_freq(poll_stats->lcore_id, MED, false);
> > +
> > +	/* Try here */
> > +	poll_stats->thresh[MED].threshold_percent = med_to_high_threshold;
> > +	poll_stats->thresh[HGH].threshold_percent = high_to_med_threshold;
> > +}
> > +
> > +static inline void __attribute__((always_inline))
> > +enter_busy_state(struct priority_worker *poll_stats)
> > +{
> > +	memset(poll_stats->edpi_av, 0, sizeof(poll_stats->edpi_av));
> > +	poll_stats->ec = 0;
> > +	memset(poll_stats->ppi_av, 0, sizeof(poll_stats->ppi_av));
> > +	poll_stats->pc = 0;
> > +
> > +	poll_stats->cur_freq = HGH;
> > +	poll_stats->iter_counter = 0;
> > +	poll_stats->threshold_ctr = 0;
> > +	poll_stats->queue_state = HGH_BUSY;
> > +	set_power_freq(poll_stats->lcore_id, HGH, false);
> > +}
> > +
> > +static inline void __attribute__((always_inline))
> > +enter_purge_state(struct priority_worker *poll_stats)
> > +{
> > +	poll_stats->iter_counter = 0;
> > +	poll_stats->queue_state = LOW_PURGE;
> > +}
> > +
> > +static inline void __attribute__((always_inline))
> > +set_state(struct priority_worker *poll_stats,
> > +		enum queue_state new_state)
> > +{
> > +	enum queue_state old_state = poll_stats->queue_state;
> > +	if (old_state != new_state) {
> > +
> > +		/* Call any old state exit functions */
> > +		if (old_state == TRAINING)
> > +			exit_training_state(poll_stats);
> > +
> > +		/* Call any new state entry functions */
> > +		if (new_state == TRAINING)
> > +			enter_training_state(poll_stats);
> > +		if (new_state == MED_NORMAL)
> > +			enter_normal_state(poll_stats);
> > +		if (new_state == HGH_BUSY)
> > +			enter_busy_state(poll_stats);
> > +		if (new_state == LOW_PURGE)
> > +			enter_purge_state(poll_stats);
> > +	}
> > +}
> > +
> > +
> > +static void
> > +update_training_stats(struct priority_worker *poll_stats,
> > +		uint32_t freq,
> > +		bool specific_freq,
> > +		uint32_t max_train_iter)
> > +{
> > +	RTE_SET_USED(specific_freq);
> > +
> > +	char pfi_str[32];
> > +	uint64_t p0_empty_deq;
> > +
> > +	sprintf(pfi_str, "%02d", freq);
> > +
> > +	if (poll_stats->cur_freq == freq &&
> > +			poll_stats->thresh[freq].trained == false) {
> > +		if (poll_stats->thresh[freq].cur_train_iter == 0) {
> > +
> > +			set_power_freq(poll_stats->lcore_id,
> > +					freq, specific_freq);
> > +
> > +			poll_stats->empty_dequeues_prev =
> > +				poll_stats->empty_dequeues;
> > +
> > +			poll_stats->thresh[freq].cur_train_iter++;
> > +
> > +			return;
> > +		} else if (poll_stats->thresh[freq].cur_train_iter
> > +				<= max_train_iter) {
> > +
> > +			p0_empty_deq = poll_stats->empty_dequeues -
> > +				poll_stats->empty_dequeues_prev;
> > +
> > +			poll_stats->empty_dequeues_prev =
> > +				poll_stats->empty_dequeues;
> > +
> > +			poll_stats->thresh[freq].base_edpi += p0_empty_deq;
> > +			poll_stats->thresh[freq].cur_train_iter++;
> > +
> > +		} else {
> > +			if (poll_stats->thresh[freq].trained == false) {
> > +				poll_stats->thresh[freq].base_edpi =
> > +					poll_stats->thresh[freq].base_edpi /
> > +					max_train_iter;
> > +
> > +				/* Add on a factor of 0.05%, this should remove any */
> > +				/* false negatives when the system is 0% busy */
> > +				poll_stats->thresh[freq].base_edpi +=
> > +					poll_stats->thresh[freq].base_edpi / 2000;
> > +
> > +				poll_stats->thresh[freq].trained = true;
> > +				poll_stats->cur_freq++;
> > +
> > +			}
> > +		}
> > +	}
> > +}
> > +
> > +static inline uint32_t __attribute__((always_inline))
> > +update_stats(struct priority_worker *poll_stats)
> > +{
> > +	uint64_t tot_edpi = 0, tot_ppi = 0;
> > +	uint32_t j, percent;
> > +
> > +	struct priority_worker *s = poll_stats;
> > +
> > +	uint64_t cur_edpi = s->empty_dequeues - s->empty_dequeues_prev;
> > +
> > +	s->empty_dequeues_prev = s->empty_dequeues;
> > +
> > +	uint64_t ppi = s->num_dequeue_pkts - s->num_dequeue_pkts_prev;
> > +
> > +	s->num_dequeue_pkts_prev = s->num_dequeue_pkts;
> > +
> > +	if (s->thresh[s->cur_freq].base_edpi < cur_edpi) {
> > +		RTE_LOG(DEBUG, POWER, "cur_edpi is too large "
> > +				"cur edpi %ld "
> > +				"base epdi %ld\n",
> > +				cur_edpi,
> > +				s->thresh[s->cur_freq].base_edpi);
> > +		/* Value to make us fail need debug log*/
> > +		return 1000UL;
> > +	}
> > +
> > +	s->edpi_av[s->ec++ % BINS_AV] = cur_edpi;
> > +	s->ppi_av[s->pc++ % BINS_AV] = ppi;
> > +
> > +	for (j = 0; j < BINS_AV; j++) {
> > +		tot_edpi += s->edpi_av[j];
> > +		tot_ppi += s->ppi_av[j];
> > +	}
> > +
> > +	tot_edpi = tot_edpi / BINS_AV;
> > +
> > +	percent = 100 - (uint32_t)(((float)tot_edpi /
> > +			(float)s->thresh[s->cur_freq].base_edpi) * 100);
> > +
> > +	return (uint32_t)percent;
> > +}
> > +
> > +
> > +static inline void  __attribute__((always_inline))
> > +update_stats_normal(struct priority_worker *poll_stats)
> > +{
> > +	uint32_t percent;
> > +
> > +	if (poll_stats->thresh[poll_stats->cur_freq].base_edpi == 0)
> > +		return;
> > +
> > +	percent = update_stats(poll_stats);
> > +
> > +	if (percent > 100)
> > +		return;
> > +
> > +	if (poll_stats->cur_freq == LOW)
> > +		RTE_LOG(INFO, POWER, "Purge Mode is not supported\n");
> > +	else if (poll_stats->cur_freq == MED) {
> > +
> > +		if (percent >
> > +			poll_stats->thresh[MED].threshold_percent) {
> > +
> > +			if (poll_stats->threshold_ctr < INTERVALS_PER_SECOND)
> > +				poll_stats->threshold_ctr++;
> > +			else {
> > +				set_state(poll_stats, HGH_BUSY);
> > +				RTE_LOG(INFO, POWER, "MOVE to HGH\n");
> > +			}
> > +
> > +		} else {
> > +			/* reset */
> > +			/* need debug log */
> > +			poll_stats->threshold_ctr = 0;
> > +		}
> > +
> > +	} else if (poll_stats->cur_freq == HGH) {
> > +
> > +		if (percent <
> > +				poll_stats->thresh[HGH].threshold_percent) {
> > +
> > +			if (poll_stats->threshold_ctr < INTERVALS_PER_SECOND)
> > +				poll_stats->threshold_ctr++;
> > +			else
> > +				set_state(poll_stats, MED_NORMAL);
> > +		} else
> > +			/* reset */
> > +			/* need debug log */
> > +			poll_stats->threshold_ctr = 0;
> > +
> > +
> > +	}
> > +}
> > +
> > +static int
> > +empty_poll_training(struct priority_worker *poll_stats,
> > +		uint32_t max_train_iter)
> > +{
> > +
> > +	if (poll_stats->iter_counter < INTERVALS_PER_SECOND) {
> > +		poll_stats->iter_counter++;
> > +		return 0;
> > +	}
> > +
> > +
> > +	update_training_stats(poll_stats,
> > +			LOW,
> > +			false,
> > +			max_train_iter);
> > +
> > +	update_training_stats(poll_stats,
> > +			MED,
> > +			false,
> > +			max_train_iter);
> > +
> > +	update_training_stats(poll_stats,
> > +			HGH,
> > +			false,
> > +			max_train_iter);
> > +
> > +
> > +	if (poll_stats->thresh[LOW].trained == true
> > +			&& poll_stats->thresh[MED].trained == true
> > +			&& poll_stats->thresh[HGH].trained == true) {
> > +
> > +		set_state(poll_stats, MED_NORMAL);
> > +
> > +		RTE_LOG(INFO, POWER, "Training is Complete for %d\n",
> > +				poll_stats->lcore_id);
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +void
> > +rte_empty_poll_detection(struct rte_timer *tim,
> > +		void *arg)
> > +{
> > +
> > +	uint32_t i;
> > +
> > +	struct priority_worker *poll_stats;
> > +
> > +	RTE_SET_USED(tim);
> > +
> > +	RTE_SET_USED(arg);
> > +
> > +	for (i = 0; i < NUM_NODES; i++) {
> > +
> > +		poll_stats = &(ep_params->wrk_data.wrk_stats[i]);
> > +
> > +		if (rte_lcore_is_enabled(poll_stats->lcore_id) == 0)
> > +			continue;
> > +
> > +		switch (poll_stats->queue_state) {
> > +		case(TRAINING):
> > +			empty_poll_training(poll_stats,
> > +					ep_params->max_train_iter);
> > +			break;
> > +
> > +		case(HGH_BUSY):
> > +		case(MED_NORMAL):
> > +			update_stats_normal(poll_stats);
> > +
> > +			break;
> > +
> > +		case(LOW_PURGE):
> > +			break;
> > +		default:
> > +			break;
> > +
> > +		}
> > +
> > +	}
> > +
> > +}
> > +
> > +int __rte_experimental
> > +rte_power_empty_poll_stat_init(struct ep_params **eptr, uint8_t *freq_tlb)
> > +{
> > +	uint32_t i;
> > +	/* Allocate the ep_params structure */
> > +	ep_params = rte_zmalloc_socket(NULL,
> > +			sizeof(struct ep_params),
> > +			0,
> > +			rte_socket_id());
> > +
> > +	if (!ep_params)
> > +		rte_panic("Cannot allocate heap memory for ep_params "
> > +				"for socket %d\n", rte_socket_id());
> > +
> > +	if (freq_tlb == NULL) {
> > +		freq_index[LOW] = 14;
> > +		freq_index[MED] = 9;
> > +		freq_index[HGH] = 1;
> > +	} else {
> > +		freq_index[LOW] = freq_tlb[LOW];
> > +		freq_index[MED] = freq_tlb[MED];
> > +		freq_index[HGH] = freq_tlb[HGH];
> > +	}
> > +
> > +	RTE_LOG(INFO, POWER, "Initialize the Empty Poll\n");
> > +
> > +	/* 5 seconds worth of training */
> > +	ep_params->max_train_iter = INTERVALS_PER_SECOND * SECONDS_TO_TRAIN_FOR;
> > +
> > +	struct stats_data *w = &ep_params->wrk_data;
> > +
> > +	*eptr = ep_params;
> > +
> > +	/* initialize all wrk_stats state */
> > +	for (i = 0; i < NUM_NODES; i++) {
> > +
> > +		if (rte_lcore_is_enabled(i) == 0)
> > +			continue;
> > +
> > +		set_state(&w->wrk_stats[i], TRAINING);
> > +		/*init the freqs table */
> > +		total_avail_freqs[i] = rte_power_freqs(i,
> > +				avail_freqs[i],
> > +				NUM_FREQS);
> > +
> > +		if (get_freq_index(LOW) > total_avail_freqs[i])
> > +			return -1;
> > +
> > +	}
> > +
> > +
> > +	return 0;
> > +}
> > +
> > +void __rte_experimental
> > +rte_power_empty_poll_stat_free(void)
> > +{
> > +
> > +	RTE_LOG(INFO, POWER, "Close the Empty Poll\n");
> > +
> > +	if (ep_params != NULL)
> > +		rte_free(ep_params);
> > +}
> > +
> > +int __rte_experimental
> > +rte_power_empty_poll_stat_update(unsigned int lcore_id)
> > +{
> > +	struct priority_worker *poll_stats;
> > +
> > +	if (lcore_id >= NUM_NODES)
> > +		return -1;
> > +
> > +	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
> > +
> > +	if (poll_stats->lcore_id == 0)
> > +		poll_stats->lcore_id = lcore_id;
> > +
> > +	poll_stats->empty_dequeues++;
> > +
> > +	return 0;
> > +}
> > +
> > +int __rte_experimental
> > +rte_power_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt)
> > +{
> > +
> > +	struct priority_worker *poll_stats;
> > +
> > +	if (lcore_id >= NUM_NODES)
> > +		return -1;
> > +
> > +	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
> > +
> > +	if (poll_stats->lcore_id == 0)
> > +		poll_stats->lcore_id = lcore_id;
> > +
> > +	poll_stats->num_dequeue_pkts += nb_pkt;
> > +
> > +	return 0;
> > +}
> > +
> > +
> > +uint64_t __rte_experimental
> > +rte_power_empty_poll_stat_fetch(unsigned int lcore_id)
> > +{
> > +	struct priority_worker *poll_stats;
> > +
> > +	if (lcore_id >= NUM_NODES)
> > +		return -1;
> > +
> > +	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
> > +
> > +	if (poll_stats->lcore_id == 0)
> > +		poll_stats->lcore_id = lcore_id;
> > +
> > +	return poll_stats->empty_dequeues;
> > +}
> > +
> > +uint64_t __rte_experimental
> > +rte_power_poll_stat_fetch(unsigned int lcore_id)
> > +{
> > +	struct priority_worker *poll_stats;
> > +
> > +	if (lcore_id >= NUM_NODES)
> > +		return -1;
> > +
> > +	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
> > +
> > +	if (poll_stats->lcore_id == 0)
> > +		poll_stats->lcore_id = lcore_id;
> > +
> > +	return poll_stats->num_dequeue_pkts;
> > +}
> > diff --git a/lib/librte_power/rte_power_empty_poll.h b/lib/librte_power/rte_power_empty_poll.h
> > new file mode 100644
> > index 0000000..0aca1f0
> > --- /dev/null
> > +++ b/lib/librte_power/rte_power_empty_poll.h
> > @@ -0,0 +1,205 @@
> > +/* SPDX-License-Identifier: BSD-3-Clause
> > + * Copyright(c) 2010-2018 Intel Corporation
> > + */
> > +
> > +#ifndef _RTE_EMPTY_POLL_H
> > +#define _RTE_EMPTY_POLL_H
> > +
> > +/**
> > + * @file
> > + * RTE Power Management
> > + */
> > +#include <stdint.h>
> > +#include <stdbool.h>
> > +
> > +#include <rte_common.h>
> > +#include <rte_byteorder.h>
> > +#include <rte_log.h>
> > +#include <rte_string_fns.h>
> > +#include <rte_power.h>
> > +#include <rte_timer.h>
> > +
> > +#ifdef __cplusplus
> > +extern "C" {
> > +#endif
> > +
> > +#define NUM_FREQS 20
> > +
> > +#define BINS_AV 4 /* Has to be ^2 */
> > +
> > +#define DROP (NUM_DIRECTIONS * NUM_DEVICES)
> > +
> > +#define NUM_PRIORITIES          2
> > +
> > +#define NUM_NODES         256  /* Max core number*/
> > +
> > +/* Processor Power State */
> > +enum freq_val {
> > +	LOW,
> > +	MED,
> > +	HGH,
> > +	NUM_FREQ = NUM_FREQS
> > +};
> > +
> > +
> > +/* Queue Polling State */
> > +enum queue_state {
> > +	TRAINING, /* NO TRAFFIC */
> > +	MED_NORMAL,   /* MED */
> > +	HGH_BUSY,     /* HIGH */
> > +	LOW_PURGE,    /* LOW */
> > +};
> > +
> > +/* Queue Stats */
> > +struct freq_threshold {
> > +
> > +	uint64_t base_edpi;
> > +	bool trained;
> > +	uint32_t threshold_percent;
> > +	uint32_t cur_train_iter;
> > +};
> > +
> > +/* Each Worder Thread Empty Poll Stats */
> > +struct priority_worker {
> > +
> > +	/* Current dequeue and throughput counts */
> > +	/* These 2 are written to by the worker threads */
> > +	/* So keep them on their own cache line */
> > +	uint64_t empty_dequeues;
> > +	uint64_t num_dequeue_pkts;
> > +
> > +	enum queue_state queue_state;
> > +
> > +	uint64_t empty_dequeues_prev;
> > +	uint64_t num_dequeue_pkts_prev;
> > +
> > +	/* Used for training only */
> > +	struct freq_threshold thresh[NUM_FREQ];
> > +	enum freq_val cur_freq;
> > +
> > +	/* bucket arrays to calculate the averages */
> > +	uint64_t edpi_av[BINS_AV];
> > +	uint32_t  ec;
> > +	uint64_t ppi_av[BINS_AV];
> > +	uint32_t  pc;
> > +
> > +	uint32_t lcore_id;
> > +	uint32_t iter_counter;
> > +	uint32_t threshold_ctr;
> > +	uint32_t display_ctr;
> > +	uint8_t  dev_id;
> > +
> > +} __rte_cache_aligned;
> > +
> > +
> > +struct stats_data {
> > +
> > +	struct priority_worker wrk_stats[NUM_NODES];
> > +
> > +	/* flag to stop rx threads processing packets until training over */
> > +	bool start_rx;
> > +
> > +};
> > +
> > +/* Empty Poll Parameters */
> > +struct ep_params {
> > +
> > +	/* Timer related stuff */
> > +	uint64_t interval_ticks;
> > +	uint32_t max_train_iter;
> > +
> > +	struct rte_timer timer0;
> > +	struct stats_data wrk_data;
> > +};
> > +
> > +
> > +/**
> > + * Initialize the power management system.
> > + *
> > + * @param eptr
> > + *   the structure of empty poll configuration
> > + * @freq_tlb
> > + *   the power state/frequency  mapping table
> > + *
> > + * @return
> > + *  - 0 on success.
> > + *  - Negative on error.
> > + */
> > +int __rte_experimental
> > +rte_power_empty_poll_stat_init(struct ep_params **eptr, uint8_t *freq_tlb);
> > +
> > +/**
> > + * Free the resource hold by power management system.
> > + */
> > +void __rte_experimental
> > +rte_power_empty_poll_stat_free(void);
> > +
> > +/**
> > + * Update specific core empty poll counter
> > + * It's not thread safe.
> > + *
> > + * @param lcore_id
> > + *  lcore id
> > + *
> > + * @return
> > + *  - 0 on success.
> > + *  - Negative on error.
> > + */
> > +int __rte_experimental
> > +rte_power_empty_poll_stat_update(unsigned int lcore_id);
> > +
> > +/**
> > + * Update specific core valid poll counter, not thread safe.
> > + *
> > + * @param lcore_id
> > + *  lcore id.
> > + * @param nb_pkt
> > + *  The packet number of one valid poll.
> > + *
> > + * @return
> > + *  - 0 on success.
> > + *  - Negative on error.
> > + */
> > +int __rte_experimental
> > +rte_power_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt);
> > +
> > +/**
> > + * Fetch specific core empty poll counter.
> > + *
> > + * @param lcore_id
> > + *  lcore id
> > + *
> > + * @return
> > + *  Current lcore empty poll counter value.
> > + */
> > +uint64_t __rte_experimental
> > +rte_power_empty_poll_stat_fetch(unsigned int lcore_id);
> > +
> > +/**
> > + * Fetch specific core valid poll counter.
> > + *
> > + * @param lcore_id
> > + *  lcore id
> > + *
> > + * @return
> > + *  Current lcore valid poll counter value.
> > + */
> > +uint64_t __rte_experimental
> > +rte_power_poll_stat_fetch(unsigned int lcore_id);
> > +
> > +/**
> > + * Empty poll  state change detection function
> > + *
> > + * @param  tim
> > + *  The timer structure
> > + * @param  arg
> > + *  The customized parameter
> > + */
> > +void  __rte_experimental
> > +rte_empty_poll_detection(struct rte_timer *tim, void *arg);
> > +
> > +#ifdef __cplusplus
> > +}
> > +#endif
> > +
> > +#endif
> > diff --git a/lib/librte_power/rte_power_version.map b/lib/librte_power/rte_power_version.map
> > index dd587df..11ffdfb 100644
> > --- a/lib/librte_power/rte_power_version.map
> > +++ b/lib/librte_power/rte_power_version.map
> > @@ -33,3 +33,16 @@ DPDK_18.08 {
> >  	rte_power_get_capabilities;
> >  
> >  } DPDK_17.11;
> > +
> > +EXPERIMENTAL {
> > +        global:
> > +
> > +        rte_power_empty_poll_stat_init;
> > +        rte_power_empty_poll_stat_free;
> > +        rte_power_empty_poll_stat_update;
> > +        rte_power_empty_poll_stat_fetch;
> > +        rte_power_poll_stat_fetch;
> > +        rte_power_poll_stat_update;
> > +        rte_empty_poll_detection;
> > +
> > +};
> > 
> 

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [dpdk-dev] [PATCH v7 1/4] lib/librte_power: traffic pattern aware power control
  2018-09-13 13:37                   ` Liang, Ma
@ 2018-09-13 14:05                     ` Hunt, David
  0 siblings, 0 replies; 79+ messages in thread
From: Hunt, David @ 2018-09-13 14:05 UTC (permalink / raw)
  To: Liang, Ma, Kevin Traynor
  Cc: dev, lei.a.yao, radu.nicolau, anatoly.burakov, john.geary



On 13/9/2018 2:37 PM, Liang, Ma wrote:
> Hi Kevin,
>     Many thanks for your comments.
>     I will send v8 patch soon.
>     Please check comments below.
>
> On 13 Sep 11:54, Kevin Traynor wrote:
>> On 09/04/2018 03:10 PM, Liang Ma wrote:

--snip--

>> HIGH: the frequency is used to process busy traffic workload.
>>> 2.2 There are two phases to establish the power management system:
>>>
>>> 	a.Initialization/Training phase. The training phase is necessary
>>> 	  in order to figure out the system polling baseline numbers from
>>> 	  idle to busy. The highest poll count will be during idle, where
>>> 	  all polls are empty. These poll counts will be different between
>>> 	  systems due to the many possible processor micro-arch, cache
>>> 	  and device configurations, hence the training phase.
>>>    	  In the training phase, traffic is blocked so the training
>> When you say 'traffic is blocked' is this something that the application
>> can do through DPDK API, or you mean no external packets are sent into
>> that port?a
> training is disabled as default. if user enable it, that mean no external packet
> are sent to the port.

I think the correct term would be 'traffic should be blocked'. The 
application has no
capability to block traffic itself, so the user needs to ensure that no 
traffic is coming into the
application during the training period.

Rgds,
Dave.

--snip--

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [dpdk-dev] [PATCH v8 1/4] lib/librte_power: traffic pattern aware power control
  2018-09-04 14:10               ` [dpdk-dev] [PATCH v7 " Liang Ma
                                   ` (3 preceding siblings ...)
  2018-09-13 10:54                 ` [dpdk-dev] [PATCH v7 1/4] lib/librte_power: traffic pattern aware power control Kevin Traynor
@ 2018-09-17 13:30                 ` Liang Ma
  2018-09-17 13:30                   ` [dpdk-dev] [PATCH v8 2/4] examples/l3fwd-power: simple app update for new API Liang Ma
                                     ` (5 more replies)
  4 siblings, 6 replies; 79+ messages in thread
From: Liang Ma @ 2018-09-17 13:30 UTC (permalink / raw)
  To: david.hunt; +Cc: dev, lei.a.yao, ktraynor, john.geary, Liang Ma

1. Abstract

For packet processing workloads such as DPDK polling is continuous.
This means CPU cores always show 100% busy independent of how much work
those cores are doing. It is critical to accurately determine how busy
a core is hugely important for the following reasons:

   * No indication of overload conditions

   * User do not know how much real load is on a system meaning resulted in
     wasted energy as no power management is utilized

Compared to the original l3fwd-power design, instead of going to sleep
after detecting an empty poll, the new mechanism just lowers the core
frequency. As a result, the application does not stop polling the device,
which leads to improved handling of bursts of traffic.

When the system become busy, the empty poll mechanism can also increase the
core frequency (including turbo) to do best effort for intensive traffic.
This gives us more flexible and balanced traffic awareness over the
standard l3fwd-power application.

2. Proposed solution

The proposed solution focuses on how many times empty polls are executed.
The less the number of empty polls, means current core is busy with
processing workload, therefore, the higher frequency is needed. The high
empty poll number indicates the current core not doing any real work
therefore, we can lower the frequency to safe power.

In the current implementation, each core has 1 empty-poll counter which
assume 1 core is dedicated to 1 queue. This will need to be expanded in the
future to support multiple queues per core.

2.1 Power state definition:

	LOW:  Not currently used, reserved for future use.

	MED:  the frequency is used to process modest traffic workload.

	HIGH: the frequency is used to process busy traffic workload.

2.2 There are two phases to establish the power management system:

	a.Initialization/Training phase. The training phase is necessary
	  in order to figure out the system polling baseline numbers from
	  idle to busy. The highest poll count will be during idle, where
	  all polls are empty. These poll counts will be different between
	  systems due to the many possible processor micro-arch, cache
	  and device configurations, hence the training phase.
  	  In the training phase, traffic is blocked so the training
  	  algorithm can average the empty-poll numbers for the LOW, MED and
 	  HIGH  power states in order to create a baseline.
  	  The core's counter are collected every 10ms, and the Training
 	  phase will take 2 seconds.
 	  Training is disabled as default configuration. the default
 	  parameter is applied. Simple App still can trigger training
 	  if that's needed.

	b.Normal phase. When the training phase is complete, traffic is
  	  started. The run-time poll counts are compared with the
	  baseline and the decision will be taken to move to MED power
  	  state or HIGH power state. The counters are calculated every 10ms.

3. Proposed  API

1.  rte_power_empty_poll_stat_init(struct ep_params **eptr,
		uint8_t *freq_tlb, struct ep_policy *policy);
which is used to initialize the power management system.
 
2.  rte_power_empty_poll_stat_free(void);
which is used to free the resource hold by power management system.
 
3.  rte_power_empty_poll_stat_update(unsigned int lcore_id);
which is used to update specific core empty poll counter, not thread safe
 
4.  rte_power_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt);
which is used to update specific core valid poll counter, not thread safe
 
5.  rte_power_empty_poll_stat_fetch(unsigned int lcore_id);
which is used to get specific core empty poll counter.
 
6.  rte_power_poll_stat_fetch(unsigned int lcore_id);
which is used to get specific core valid poll counter.

7.  rte_empty_poll_detection(struct rte_timer *tim, void *arg);
which is used to detect empty poll state changes then take action.

ChangeLog:
v2: fix some coding style issues.
v3: rename the filename, API name.
v4: no change.
v5: no change.
v6: re-work the code layout, update API.
v7: fix minor typo and lift node num limit.
v8: disable training as default option.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>

Reviewed-by: Lei Yao <lei.a.yao@intel.com>
---
 lib/librte_power/Makefile               |   6 +-
 lib/librte_power/meson.build            |   5 +-
 lib/librte_power/rte_power_empty_poll.c | 539 ++++++++++++++++++++++++++++++++
 lib/librte_power/rte_power_empty_poll.h | 219 +++++++++++++
 lib/librte_power/rte_power_version.map  |  13 +
 5 files changed, 778 insertions(+), 4 deletions(-)
 create mode 100644 lib/librte_power/rte_power_empty_poll.c
 create mode 100644 lib/librte_power/rte_power_empty_poll.h

diff --git a/lib/librte_power/Makefile b/lib/librte_power/Makefile
index 6f85e88..a8f1301 100644
--- a/lib/librte_power/Makefile
+++ b/lib/librte_power/Makefile
@@ -6,8 +6,9 @@ include $(RTE_SDK)/mk/rte.vars.mk
 # library name
 LIB = librte_power.a
 
+CFLAGS += -DALLOW_EXPERIMENTAL_API
 CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR) -O3 -fno-strict-aliasing
-LDLIBS += -lrte_eal
+LDLIBS += -lrte_eal -lrte_timer
 
 EXPORT_MAP := rte_power_version.map
 
@@ -16,8 +17,9 @@ LIBABIVER := 1
 # all source are stored in SRCS-y
 SRCS-$(CONFIG_RTE_LIBRTE_POWER) := rte_power.c power_acpi_cpufreq.c
 SRCS-$(CONFIG_RTE_LIBRTE_POWER) += power_kvm_vm.c guest_channel.c
+SRCS-$(CONFIG_RTE_LIBRTE_POWER) += rte_power_empty_poll.c
 
 # install this header file
-SYMLINK-$(CONFIG_RTE_LIBRTE_POWER)-include := rte_power.h
+SYMLINK-$(CONFIG_RTE_LIBRTE_POWER)-include := rte_power.h  rte_power_empty_poll.h
 
 include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/lib/librte_power/meson.build b/lib/librte_power/meson.build
index 253173f..63957eb 100644
--- a/lib/librte_power/meson.build
+++ b/lib/librte_power/meson.build
@@ -5,5 +5,6 @@ if host_machine.system() != 'linux'
 	build = false
 endif
 sources = files('rte_power.c', 'power_acpi_cpufreq.c',
-		'power_kvm_vm.c', 'guest_channel.c')
-headers = files('rte_power.h')
+		'power_kvm_vm.c', 'guest_channel.c',
+		'rte_power_empty_poll.c')
+headers = files('rte_power.h','rte_power_empty_poll.h')
diff --git a/lib/librte_power/rte_power_empty_poll.c b/lib/librte_power/rte_power_empty_poll.c
new file mode 100644
index 0000000..587ce78
--- /dev/null
+++ b/lib/librte_power/rte_power_empty_poll.c
@@ -0,0 +1,539 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2018 Intel Corporation
+ */
+
+#include <string.h>
+
+#include <rte_lcore.h>
+#include <rte_cycles.h>
+#include <rte_atomic.h>
+#include <rte_malloc.h>
+
+#include "rte_power.h"
+#include "rte_power_empty_poll.h"
+
+#define INTERVALS_PER_SECOND 100     /* (10ms) */
+#define SECONDS_TO_TRAIN_FOR 2
+#define DEFAULT_MED_TO_HIGH_PERCENT_THRESHOLD 70
+#define DEFAULT_HIGH_TO_MED_PERCENT_THRESHOLD 30
+#define DEFAULT_CYCLES_PER_PACKET 800
+
+static struct ep_params *ep_params;
+static uint32_t med_to_high_threshold = DEFAULT_MED_TO_HIGH_PERCENT_THRESHOLD;
+static uint32_t high_to_med_threshold = DEFAULT_HIGH_TO_MED_PERCENT_THRESHOLD;
+
+static uint32_t avail_freqs[RTE_MAX_LCORE][NUM_FREQS];
+
+static uint32_t total_avail_freqs[RTE_MAX_LCORE];
+
+static uint32_t freq_index[NUM_FREQ];
+
+static uint32_t
+get_freq_index(enum freq_val index)
+{
+	return freq_index[index];
+}
+
+
+static int
+set_power_freq(int lcore_id, enum freq_val freq, bool specific_freq)
+{
+	int err = 0;
+	uint32_t power_freq_index;
+	if (!specific_freq)
+		power_freq_index = get_freq_index(freq);
+	else
+		power_freq_index = freq;
+
+	err = rte_power_set_freq(lcore_id, power_freq_index);
+
+	return err;
+}
+
+
+static inline void __attribute__((always_inline))
+exit_training_state(struct priority_worker *poll_stats)
+{
+	RTE_SET_USED(poll_stats);
+}
+
+static inline void __attribute__((always_inline))
+enter_training_state(struct priority_worker *poll_stats)
+{
+	poll_stats->iter_counter = 0;
+	poll_stats->cur_freq = LOW;
+	poll_stats->queue_state = TRAINING;
+}
+
+static inline void __attribute__((always_inline))
+enter_normal_state(struct priority_worker *poll_stats)
+{
+	/* Clear the averages arrays and strs */
+	memset(poll_stats->edpi_av, 0, sizeof(poll_stats->edpi_av));
+	poll_stats->ec = 0;
+	memset(poll_stats->ppi_av, 0, sizeof(poll_stats->ppi_av));
+	poll_stats->pc = 0;
+
+	poll_stats->cur_freq = MED;
+	poll_stats->iter_counter = 0;
+	poll_stats->threshold_ctr = 0;
+	poll_stats->queue_state = MED_NORMAL;
+	RTE_LOG(INFO, POWER, "set the poewr freq to MED\n");
+	set_power_freq(poll_stats->lcore_id, MED, false);
+
+	/* Try here */
+	poll_stats->thresh[MED].threshold_percent = med_to_high_threshold;
+	poll_stats->thresh[HGH].threshold_percent = high_to_med_threshold;
+}
+
+static inline void __attribute__((always_inline))
+enter_busy_state(struct priority_worker *poll_stats)
+{
+	memset(poll_stats->edpi_av, 0, sizeof(poll_stats->edpi_av));
+	poll_stats->ec = 0;
+	memset(poll_stats->ppi_av, 0, sizeof(poll_stats->ppi_av));
+	poll_stats->pc = 0;
+
+	poll_stats->cur_freq = HGH;
+	poll_stats->iter_counter = 0;
+	poll_stats->threshold_ctr = 0;
+	poll_stats->queue_state = HGH_BUSY;
+	set_power_freq(poll_stats->lcore_id, HGH, false);
+}
+
+static inline void __attribute__((always_inline))
+enter_purge_state(struct priority_worker *poll_stats)
+{
+	poll_stats->iter_counter = 0;
+	poll_stats->queue_state = LOW_PURGE;
+}
+
+static inline void __attribute__((always_inline))
+set_state(struct priority_worker *poll_stats,
+		enum queue_state new_state)
+{
+	enum queue_state old_state = poll_stats->queue_state;
+	if (old_state != new_state) {
+
+		/* Call any old state exit functions */
+		if (old_state == TRAINING)
+			exit_training_state(poll_stats);
+
+		/* Call any new state entry functions */
+		if (new_state == TRAINING)
+			enter_training_state(poll_stats);
+		if (new_state == MED_NORMAL)
+			enter_normal_state(poll_stats);
+		if (new_state == HGH_BUSY)
+			enter_busy_state(poll_stats);
+		if (new_state == LOW_PURGE)
+			enter_purge_state(poll_stats);
+	}
+}
+
+static inline void __attribute__((always_inline))
+set_policy(struct priority_worker *poll_stats,
+		struct ep_policy *policy)
+{
+	set_state(poll_stats, policy->state);
+
+	if (policy->state == TRAINING)
+		return;
+
+	poll_stats->thresh[MED_NORMAL].base_edpi = policy->med_base_edpi;
+	poll_stats->thresh[HGH_BUSY].base_edpi = policy->hgh_base_edpi;
+
+	poll_stats->thresh[MED_NORMAL].trained = true;
+	poll_stats->thresh[HGH_BUSY].trained = true;
+
+}
+
+static void
+update_training_stats(struct priority_worker *poll_stats,
+		uint32_t freq,
+		bool specific_freq,
+		uint32_t max_train_iter)
+{
+	RTE_SET_USED(specific_freq);
+
+	char pfi_str[32];
+	uint64_t p0_empty_deq;
+
+	sprintf(pfi_str, "%02d", freq);
+
+	if (poll_stats->cur_freq == freq &&
+			poll_stats->thresh[freq].trained == false) {
+		if (poll_stats->thresh[freq].cur_train_iter == 0) {
+
+			set_power_freq(poll_stats->lcore_id,
+					freq, specific_freq);
+
+			poll_stats->empty_dequeues_prev =
+				poll_stats->empty_dequeues;
+
+			poll_stats->thresh[freq].cur_train_iter++;
+
+			return;
+		} else if (poll_stats->thresh[freq].cur_train_iter
+				<= max_train_iter) {
+
+			p0_empty_deq = poll_stats->empty_dequeues -
+				poll_stats->empty_dequeues_prev;
+
+			poll_stats->empty_dequeues_prev =
+				poll_stats->empty_dequeues;
+
+			poll_stats->thresh[freq].base_edpi += p0_empty_deq;
+			poll_stats->thresh[freq].cur_train_iter++;
+
+		} else {
+			if (poll_stats->thresh[freq].trained == false) {
+				poll_stats->thresh[freq].base_edpi =
+					poll_stats->thresh[freq].base_edpi /
+					max_train_iter;
+
+				/* Add on a factor of 0.05%*/
+				/* this should remove any */
+				/* false negatives when the system is 0% busy */
+				poll_stats->thresh[freq].base_edpi +=
+					poll_stats->thresh[freq].base_edpi / 2000;
+
+				poll_stats->thresh[freq].trained = true;
+				poll_stats->cur_freq++;
+
+			}
+		}
+	}
+}
+
+static inline uint32_t __attribute__((always_inline))
+update_stats(struct priority_worker *poll_stats)
+{
+	uint64_t tot_edpi = 0, tot_ppi = 0;
+	uint32_t j, percent;
+
+	struct priority_worker *s = poll_stats;
+
+	uint64_t cur_edpi = s->empty_dequeues - s->empty_dequeues_prev;
+
+	s->empty_dequeues_prev = s->empty_dequeues;
+
+	uint64_t ppi = s->num_dequeue_pkts - s->num_dequeue_pkts_prev;
+
+	s->num_dequeue_pkts_prev = s->num_dequeue_pkts;
+
+	if (s->thresh[s->cur_freq].base_edpi < cur_edpi) {
+		RTE_LOG(DEBUG, POWER, "cur_edpi is too large "
+				"cur edpi %ld "
+				"base epdi %ld\n",
+				cur_edpi,
+				s->thresh[s->cur_freq].base_edpi);
+		/* Value to make us fail need debug log*/
+		return 1000UL;
+	}
+
+	s->edpi_av[s->ec++ % BINS_AV] = cur_edpi;
+	s->ppi_av[s->pc++ % BINS_AV] = ppi;
+
+	for (j = 0; j < BINS_AV; j++) {
+		tot_edpi += s->edpi_av[j];
+		tot_ppi += s->ppi_av[j];
+	}
+
+	tot_edpi = tot_edpi / BINS_AV;
+
+	percent = 100 - (uint32_t)(((float)tot_edpi /
+			(float)s->thresh[s->cur_freq].base_edpi) * 100);
+
+	return (uint32_t)percent;
+}
+
+
+static inline void  __attribute__((always_inline))
+update_stats_normal(struct priority_worker *poll_stats)
+{
+	uint32_t percent;
+
+	if (poll_stats->thresh[poll_stats->cur_freq].base_edpi == 0) {
+
+		RTE_LOG(DEBUG, POWER, "cure freq is %d, edpi is %lu\n",
+				poll_stats->cur_freq,
+				poll_stats->thresh[poll_stats->cur_freq].base_edpi);
+		return;
+	}
+
+	percent = update_stats(poll_stats);
+
+	if (percent > 100) {
+		RTE_LOG(DEBUG, POWER, "Big than 100 abnormal\n");
+		return;
+	}
+
+	if (poll_stats->cur_freq == LOW)
+		RTE_LOG(INFO, POWER, "Purge Mode is not supported\n");
+	else if (poll_stats->cur_freq == MED) {
+
+		if (percent >
+			poll_stats->thresh[MED].threshold_percent) {
+
+			if (poll_stats->threshold_ctr < INTERVALS_PER_SECOND)
+				poll_stats->threshold_ctr++;
+			else {
+				set_state(poll_stats, HGH_BUSY);
+				RTE_LOG(INFO, POWER, "MOVE to HGH\n");
+			}
+
+		} else {
+			/* reset */
+			poll_stats->threshold_ctr = 0;
+		}
+
+	} else if (poll_stats->cur_freq == HGH) {
+
+		if (percent <
+				poll_stats->thresh[HGH].threshold_percent) {
+
+			if (poll_stats->threshold_ctr < INTERVALS_PER_SECOND)
+				poll_stats->threshold_ctr++;
+			else {
+				set_state(poll_stats, MED_NORMAL);
+				RTE_LOG(INFO, POWER, "MOVE to MED\n");
+			}
+		} else {
+			/* reset */
+			poll_stats->threshold_ctr = 0;
+		}
+
+	}
+}
+
+static int
+empty_poll_training(struct priority_worker *poll_stats,
+		uint32_t max_train_iter)
+{
+
+	if (poll_stats->iter_counter < INTERVALS_PER_SECOND) {
+		poll_stats->iter_counter++;
+		return 0;
+	}
+
+
+	update_training_stats(poll_stats,
+			LOW,
+			false,
+			max_train_iter);
+
+	update_training_stats(poll_stats,
+			MED,
+			false,
+			max_train_iter);
+
+	update_training_stats(poll_stats,
+			HGH,
+			false,
+			max_train_iter);
+
+
+	if (poll_stats->thresh[LOW].trained == true
+			&& poll_stats->thresh[MED].trained == true
+			&& poll_stats->thresh[HGH].trained == true) {
+
+		set_state(poll_stats, MED_NORMAL);
+
+		RTE_LOG(INFO, POWER, "Low threshold is %lu\n",
+				poll_stats->thresh[LOW].base_edpi);
+
+		RTE_LOG(INFO, POWER, "MED threshold is %lu\n",
+				poll_stats->thresh[MED].base_edpi);
+
+
+		RTE_LOG(INFO, POWER, "HIGH threshold is %lu\n",
+				poll_stats->thresh[HGH].base_edpi);
+
+		RTE_LOG(INFO, POWER, "Training is Complete for %d\n",
+				poll_stats->lcore_id);
+	}
+
+	return 0;
+}
+
+void
+rte_empty_poll_detection(struct rte_timer *tim, void *arg)
+{
+
+	uint32_t i;
+
+	struct priority_worker *poll_stats;
+
+	RTE_SET_USED(tim);
+
+	RTE_SET_USED(arg);
+
+	for (i = 0; i < NUM_NODES; i++) {
+
+		poll_stats = &(ep_params->wrk_data.wrk_stats[i]);
+
+		if (rte_lcore_is_enabled(poll_stats->lcore_id) == 0)
+			continue;
+
+		switch (poll_stats->queue_state) {
+		case(TRAINING):
+			empty_poll_training(poll_stats,
+					ep_params->max_train_iter);
+			break;
+
+		case(HGH_BUSY):
+		case(MED_NORMAL):
+			update_stats_normal(poll_stats);
+			break;
+
+		case(LOW_PURGE):
+			break;
+		default:
+			break;
+
+		}
+
+	}
+
+}
+
+int __rte_experimental
+rte_power_empty_poll_stat_init(struct ep_params **eptr, uint8_t *freq_tlb,
+		struct ep_policy *policy)
+{
+	uint32_t i;
+	/* Allocate the ep_params structure */
+	ep_params = rte_zmalloc_socket(NULL,
+			sizeof(struct ep_params),
+			0,
+			rte_socket_id());
+
+	if (!ep_params)
+		rte_panic("Cannot allocate heap memory for ep_params "
+				"for socket %d\n", rte_socket_id());
+
+	if (freq_tlb == NULL) {
+		freq_index[LOW] = 14;
+		freq_index[MED] = 9;
+		freq_index[HGH] = 1;
+	} else {
+		freq_index[LOW] = freq_tlb[LOW];
+		freq_index[MED] = freq_tlb[MED];
+		freq_index[HGH] = freq_tlb[HGH];
+	}
+
+	RTE_LOG(INFO, POWER, "Initialize the Empty Poll\n");
+
+	/* 5 seconds worth of training */
+	ep_params->max_train_iter = INTERVALS_PER_SECOND * SECONDS_TO_TRAIN_FOR;
+
+	struct stats_data *w = &ep_params->wrk_data;
+
+	*eptr = ep_params;
+
+	/* initialize all wrk_stats state */
+	for (i = 0; i < NUM_NODES; i++) {
+
+		if (rte_lcore_is_enabled(i) == 0)
+			continue;
+		/*init the freqs table */
+		total_avail_freqs[i] = rte_power_freqs(i,
+				avail_freqs[i],
+				NUM_FREQS);
+
+		RTE_LOG(INFO, POWER, "total avail freq is %d , lcoreid %d\n",
+				total_avail_freqs[i],
+				i);
+
+		if (get_freq_index(LOW) > total_avail_freqs[i])
+			return -1;
+
+		if (rte_get_master_lcore() != i) {
+			w->wrk_stats[i].lcore_id = i;
+			set_policy(&w->wrk_stats[i], policy);
+		}
+	}
+
+	return 0;
+}
+
+void __rte_experimental
+rte_power_empty_poll_stat_free(void)
+{
+
+	RTE_LOG(INFO, POWER, "Close the Empty Poll\n");
+
+	if (ep_params != NULL)
+		rte_free(ep_params);
+}
+
+int __rte_experimental
+rte_power_empty_poll_stat_update(unsigned int lcore_id)
+{
+	struct priority_worker *poll_stats;
+
+	if (lcore_id >= NUM_NODES)
+		return -1;
+
+	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
+
+	if (poll_stats->lcore_id == 0)
+		poll_stats->lcore_id = lcore_id;
+
+	poll_stats->empty_dequeues++;
+
+	return 0;
+}
+
+int __rte_experimental
+rte_power_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt)
+{
+
+	struct priority_worker *poll_stats;
+
+	if (lcore_id >= NUM_NODES)
+		return -1;
+
+	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
+
+	if (poll_stats->lcore_id == 0)
+		poll_stats->lcore_id = lcore_id;
+
+	poll_stats->num_dequeue_pkts += nb_pkt;
+
+	return 0;
+}
+
+
+uint64_t __rte_experimental
+rte_power_empty_poll_stat_fetch(unsigned int lcore_id)
+{
+	struct priority_worker *poll_stats;
+
+	if (lcore_id >= NUM_NODES)
+		return -1;
+
+	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
+
+	if (poll_stats->lcore_id == 0)
+		poll_stats->lcore_id = lcore_id;
+
+	return poll_stats->empty_dequeues;
+}
+
+uint64_t __rte_experimental
+rte_power_poll_stat_fetch(unsigned int lcore_id)
+{
+	struct priority_worker *poll_stats;
+
+	if (lcore_id >= NUM_NODES)
+		return -1;
+
+	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
+
+	if (poll_stats->lcore_id == 0)
+		poll_stats->lcore_id = lcore_id;
+
+	return poll_stats->num_dequeue_pkts;
+}
diff --git a/lib/librte_power/rte_power_empty_poll.h b/lib/librte_power/rte_power_empty_poll.h
new file mode 100644
index 0000000..ae27f7d
--- /dev/null
+++ b/lib/librte_power/rte_power_empty_poll.h
@@ -0,0 +1,219 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2018 Intel Corporation
+ */
+
+#ifndef _RTE_EMPTY_POLL_H
+#define _RTE_EMPTY_POLL_H
+
+/**
+ * @file
+ * RTE Power Management
+ */
+#include <stdint.h>
+#include <stdbool.h>
+
+#include <rte_common.h>
+#include <rte_byteorder.h>
+#include <rte_log.h>
+#include <rte_string_fns.h>
+#include <rte_power.h>
+#include <rte_timer.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#define NUM_FREQS 20
+
+#define BINS_AV 4 /* Has to be ^2 */
+
+#define DROP (NUM_DIRECTIONS * NUM_DEVICES)
+
+#define NUM_PRIORITIES          2
+
+#define NUM_NODES         256  /* Max core number*/
+
+/* Processor Power State */
+enum freq_val {
+	LOW,
+	MED,
+	HGH,
+	NUM_FREQ = NUM_FREQS
+};
+
+
+/* Queue Polling State */
+enum queue_state {
+	TRAINING, /* NO TRAFFIC */
+	MED_NORMAL,   /* MED */
+	HGH_BUSY,     /* HIGH */
+	LOW_PURGE,    /* LOW */
+};
+
+/* Queue Stats */
+struct freq_threshold {
+
+	uint64_t base_edpi;
+	bool trained;
+	uint32_t threshold_percent;
+	uint32_t cur_train_iter;
+};
+
+/* Each Worder Thread Empty Poll Stats */
+struct priority_worker {
+
+	/* Current dequeue and throughput counts */
+	/* These 2 are written to by the worker threads */
+	/* So keep them on their own cache line */
+	uint64_t empty_dequeues;
+	uint64_t num_dequeue_pkts;
+
+	enum queue_state queue_state;
+
+	uint64_t empty_dequeues_prev;
+	uint64_t num_dequeue_pkts_prev;
+
+	/* Used for training only */
+	struct freq_threshold thresh[NUM_FREQ];
+	enum freq_val cur_freq;
+
+	/* bucket arrays to calculate the averages */
+	uint64_t edpi_av[BINS_AV];
+	uint32_t  ec;
+	uint64_t ppi_av[BINS_AV];
+	uint32_t  pc;
+
+	uint32_t lcore_id;
+	uint32_t iter_counter;
+	uint32_t threshold_ctr;
+	uint32_t display_ctr;
+	uint8_t  dev_id;
+
+} __rte_cache_aligned;
+
+
+struct stats_data {
+
+	struct priority_worker wrk_stats[NUM_NODES];
+
+	/* flag to stop rx threads processing packets until training over */
+	bool start_rx;
+
+};
+
+/* Empty Poll Parameters */
+struct ep_params {
+
+	/* Timer related stuff */
+	uint64_t interval_ticks;
+	uint32_t max_train_iter;
+
+	struct rte_timer timer0;
+	struct stats_data wrk_data;
+};
+
+
+/* Sample App Init information */
+struct ep_policy {
+
+	uint64_t med_base_edpi;
+	uint64_t hgh_base_edpi;
+
+	enum queue_state state;
+};
+
+
+
+/**
+ * Initialize the power management system.
+ *
+ * @param eptr
+ *   the structure of empty poll configuration
+ * @freq_tlb
+ *   the power state/frequency  mapping table
+ * @policy
+ *   the initialization policy from sample app
+ *
+ * @return
+ *  - 0 on success.
+ *  - Negative on error.
+ */
+int __rte_experimental
+rte_power_empty_poll_stat_init(struct ep_params **eptr, uint8_t *freq_tlb,
+		struct ep_policy *policy);
+
+/**
+ * Free the resource hold by power management system.
+ */
+void __rte_experimental
+rte_power_empty_poll_stat_free(void);
+
+/**
+ * Update specific core empty poll counter
+ * It's not thread safe.
+ *
+ * @param lcore_id
+ *  lcore id
+ *
+ * @return
+ *  - 0 on success.
+ *  - Negative on error.
+ */
+int __rte_experimental
+rte_power_empty_poll_stat_update(unsigned int lcore_id);
+
+/**
+ * Update specific core valid poll counter, not thread safe.
+ *
+ * @param lcore_id
+ *  lcore id.
+ * @param nb_pkt
+ *  The packet number of one valid poll.
+ *
+ * @return
+ *  - 0 on success.
+ *  - Negative on error.
+ */
+int __rte_experimental
+rte_power_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt);
+
+/**
+ * Fetch specific core empty poll counter.
+ *
+ * @param lcore_id
+ *  lcore id
+ *
+ * @return
+ *  Current lcore empty poll counter value.
+ */
+uint64_t __rte_experimental
+rte_power_empty_poll_stat_fetch(unsigned int lcore_id);
+
+/**
+ * Fetch specific core valid poll counter.
+ *
+ * @param lcore_id
+ *  lcore id
+ *
+ * @return
+ *  Current lcore valid poll counter value.
+ */
+uint64_t __rte_experimental
+rte_power_poll_stat_fetch(unsigned int lcore_id);
+
+/**
+ * Empty poll  state change detection function
+ *
+ * @param  tim
+ *  The timer structure
+ * @param  arg
+ *  The customized parameter
+ */
+void  __rte_experimental
+rte_empty_poll_detection(struct rte_timer *tim, void *arg);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif
diff --git a/lib/librte_power/rte_power_version.map b/lib/librte_power/rte_power_version.map
index dd587df..11ffdfb 100644
--- a/lib/librte_power/rte_power_version.map
+++ b/lib/librte_power/rte_power_version.map
@@ -33,3 +33,16 @@ DPDK_18.08 {
 	rte_power_get_capabilities;
 
 } DPDK_17.11;
+
+EXPERIMENTAL {
+        global:
+
+        rte_power_empty_poll_stat_init;
+        rte_power_empty_poll_stat_free;
+        rte_power_empty_poll_stat_update;
+        rte_power_empty_poll_stat_fetch;
+        rte_power_poll_stat_fetch;
+        rte_power_poll_stat_update;
+        rte_empty_poll_detection;
+
+};
-- 
2.7.5

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [dpdk-dev] [PATCH v8 2/4] examples/l3fwd-power: simple app update for new API
  2018-09-17 13:30                 ` [dpdk-dev] [PATCH v8 " Liang Ma
@ 2018-09-17 13:30                   ` Liang Ma
  2018-09-28 11:19                     ` Hunt, David
  2018-09-17 13:30                   ` [dpdk-dev] [PATCH v8 3/4] doc/guides/proguide/power-man: update the power API Liang Ma
                                     ` (4 subsequent siblings)
  5 siblings, 1 reply; 79+ messages in thread
From: Liang Ma @ 2018-09-17 13:30 UTC (permalink / raw)
  To: david.hunt; +Cc: dev, lei.a.yao, ktraynor, john.geary, Liang Ma

Add the support for new traffic pattern aware power control
power management API.

Example:
./l3fwd-power -l xxx   -n 4   -w 0000:xx:00.0 -w 0000:xx:00.1 -- -p 0x3
-P --config="(0,0,xx),(1,0,xx)" --empty-poll="0,0,0" -l 14 -m 9 -h 1

Please Reference l3fwd-power document for all parameter except
empty-poll.

The option "l", "m", "h" are used to set the power index for
LOW, MED, HIGH power state. only is useful after enable empty-poll

--empty-poll="training_flag, med_threshold, high_threshold"

The option training_flag is used to enable/disable training mode.

The option med_threshold is used to indicate the empty poll threshold
of modest state which is customized by user.

The option high_threshold is used to indicate the empty poll threshold
of busy state which is customized by user.

Above three option default value is all 0.

Once enable empty-poll. System will apply the default parameter.
Training mode is disabled as default.

If training mode is triggered, there should not has any traffic
pass-through during training phase.
When training phase complete, system transfer to normal phase.

System will running with modest power stat at beginning.
If the system busyness percentage above 70%, then system will adjust
power state move to High power state. If the traffic become lower(eg. The
system busyness percentage drop below 30%), system will fallback
to the modest power state.

Example code use master thread to monitoring worker thread busyness.
the default timer resolution is 10ms.

ChangeLog:
v2 fix some coding style issues
v3 rename the API.
v6 re-work the API.
v7 no change.
v8 disable training as default option.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>

Reviewed-by: Lei Yao <lei.a.yao@intel.com>
---
 examples/l3fwd-power/Makefile    |   3 +
 examples/l3fwd-power/main.c      | 325 +++++++++++++++++++++++++++++++++++++--
 examples/l3fwd-power/meson.build |   1 +
 3 files changed, 312 insertions(+), 17 deletions(-)

diff --git a/examples/l3fwd-power/Makefile b/examples/l3fwd-power/Makefile
index d7e39a3..772ec7b 100644
--- a/examples/l3fwd-power/Makefile
+++ b/examples/l3fwd-power/Makefile
@@ -23,6 +23,8 @@ CFLAGS += -O3 $(shell pkg-config --cflags libdpdk)
 LDFLAGS_SHARED = $(shell pkg-config --libs libdpdk)
 LDFLAGS_STATIC = -Wl,-Bstatic $(shell pkg-config --static --libs libdpdk)
 
+CFLAGS += -DALLOW_EXPERIMENTAL_API
+
 build/$(APP)-shared: $(SRCS-y) Makefile $(PC_FILE) | build
 	$(CC) $(CFLAGS) $(SRCS-y) -o $@ $(LDFLAGS) $(LDFLAGS_SHARED)
 
@@ -54,6 +56,7 @@ please change the definition of the RTE_TARGET environment variable)
 all:
 else
 
+CFLAGS += -DALLOW_EXPERIMENTAL_API
 CFLAGS += -O3
 CFLAGS += $(WERROR_FLAGS)
 
diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
index 68527d2..1465608 100644
--- a/examples/l3fwd-power/main.c
+++ b/examples/l3fwd-power/main.c
@@ -43,6 +43,7 @@
 #include <rte_timer.h>
 #include <rte_power.h>
 #include <rte_spinlock.h>
+#include <rte_power_empty_poll.h>
 
 #include "perf_core.h"
 #include "main.h"
@@ -55,6 +56,8 @@
 
 /* 100 ms interval */
 #define TIMER_NUMBER_PER_SECOND           10
+/* (10ms) */
+#define INTERVALS_PER_SECOND             100
 /* 100000 us */
 #define SCALING_PERIOD                    (1000000/TIMER_NUMBER_PER_SECOND)
 #define SCALING_DOWN_TIME_RATIO_THRESHOLD 0.25
@@ -117,6 +120,11 @@
  */
 #define RTE_TEST_RX_DESC_DEFAULT 1024
 #define RTE_TEST_TX_DESC_DEFAULT 1024
+#define EMPTY_POLL_MED_THRESHOLD 350000UL
+#define EMPTY_POLL_HGH_THRESHOLD 580000UL
+
+
+
 static uint16_t nb_rxd = RTE_TEST_RX_DESC_DEFAULT;
 static uint16_t nb_txd = RTE_TEST_TX_DESC_DEFAULT;
 
@@ -132,6 +140,14 @@ static uint32_t enabled_port_mask = 0;
 static int promiscuous_on = 0;
 /* NUMA is enabled by default. */
 static int numa_on = 1;
+/* emptypoll is disabled by default. */
+static bool empty_poll_on;
+static bool empty_poll_train;
+volatile bool empty_poll_stop;
+static struct  ep_params *ep_params;
+static struct  ep_policy policy;
+static long  ep_med_edpi, ep_hgh_edpi;
+
 static int parse_ptype; /**< Parse packet type using rx callback, and */
 			/**< disabled by default */
 
@@ -330,6 +346,13 @@ static inline uint32_t power_idle_heuristic(uint32_t zero_rx_packet_count);
 static inline enum freq_scale_hint_t power_freq_scaleup_heuristic( \
 		unsigned int lcore_id, uint16_t port_id, uint16_t queue_id);
 
+static uint8_t  freq_tlb[] = {14, 9, 1};
+
+static int is_done(void)
+{
+	return empty_poll_stop;
+}
+
 /* exit signal handler */
 static void
 signal_exit_now(int sigtype)
@@ -338,7 +361,15 @@ signal_exit_now(int sigtype)
 	unsigned int portid;
 	int ret;
 
+	RTE_SET_USED(lcore_id);
+	RTE_SET_USED(portid);
+	RTE_SET_USED(ret);
+
 	if (sigtype == SIGINT) {
+		if (empty_poll_on)
+			empty_poll_stop = true;
+
+
 		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
 			if (rte_lcore_is_enabled(lcore_id) == 0)
 				continue;
@@ -351,16 +382,19 @@ signal_exit_now(int sigtype)
 							"core%u\n", lcore_id);
 		}
 
-		RTE_ETH_FOREACH_DEV(portid) {
-			if ((enabled_port_mask & (1 << portid)) == 0)
-				continue;
+		if (!empty_poll_on) {
+			RTE_ETH_FOREACH_DEV(portid) {
+				if ((enabled_port_mask & (1 << portid)) == 0)
+					continue;
 
-			rte_eth_dev_stop(portid);
-			rte_eth_dev_close(portid);
+				rte_eth_dev_stop(portid);
+				rte_eth_dev_close(portid);
+			}
 		}
 	}
 
-	rte_exit(EXIT_SUCCESS, "User forced exit\n");
+	if (!empty_poll_on)
+		rte_exit(EXIT_SUCCESS, "User forced exit\n");
 }
 
 /*  Freqency scale down timer callback */
@@ -825,7 +859,107 @@ static int event_register(struct lcore_conf *qconf)
 
 	return 0;
 }
+/* main processing loop */
+static int
+main_empty_poll_loop(__attribute__((unused)) void *dummy)
+{
+	struct rte_mbuf *pkts_burst[MAX_PKT_BURST];
+	unsigned int lcore_id;
+	uint64_t prev_tsc, diff_tsc, cur_tsc;
+	int i, j, nb_rx;
+	uint8_t queueid;
+	uint16_t portid;
+	struct lcore_conf *qconf;
+	struct lcore_rx_queue *rx_queue;
+
+	const uint64_t drain_tsc =
+		(rte_get_tsc_hz() + US_PER_S - 1) / US_PER_S * BURST_TX_DRAIN_US;
+
+	prev_tsc = 0;
+
+	lcore_id = rte_lcore_id();
+	qconf = &lcore_conf[lcore_id];
+
+	if (qconf->n_rx_queue == 0) {
+		RTE_LOG(INFO, L3FWD_POWER, "lcore %u has nothing to do\n", lcore_id);
+		return 0;
+	}
+
+	for (i = 0; i < qconf->n_rx_queue; i++) {
+		portid = qconf->rx_queue_list[i].port_id;
+		queueid = qconf->rx_queue_list[i].queue_id;
+		RTE_LOG(INFO, L3FWD_POWER, " -- lcoreid=%u portid=%u "
+				"rxqueueid=%hhu\n", lcore_id, portid, queueid);
+	}
+
+	while (!is_done()) {
+		stats[lcore_id].nb_iteration_looped++;
+
+		cur_tsc = rte_rdtsc();
+		/*
+		 * TX burst queue drain
+		 */
+		diff_tsc = cur_tsc - prev_tsc;
+		if (unlikely(diff_tsc > drain_tsc)) {
+			for (i = 0; i < qconf->n_tx_port; ++i) {
+				portid = qconf->tx_port_id[i];
+				rte_eth_tx_buffer_flush(portid,
+						qconf->tx_queue_id[portid],
+						qconf->tx_buffer[portid]);
+			}
+			prev_tsc = cur_tsc;
+		}
+
+		/*
+		 * Read packet from RX queues
+		 */
+		for (i = 0; i < qconf->n_rx_queue; ++i) {
+			rx_queue = &(qconf->rx_queue_list[i]);
+			rx_queue->idle_hint = 0;
+			portid = rx_queue->port_id;
+			queueid = rx_queue->queue_id;
+
+			nb_rx = rte_eth_rx_burst(portid, queueid, pkts_burst,
+					MAX_PKT_BURST);
+
+			stats[lcore_id].nb_rx_processed += nb_rx;
+
+			if (nb_rx == 0) {
+
+				rte_power_empty_poll_stat_update(lcore_id);
+
+				continue;
+			} else {
+				rte_power_poll_stat_update(lcore_id, nb_rx);
+			}
+
+
+			/* Prefetch first packets */
+			for (j = 0; j < PREFETCH_OFFSET && j < nb_rx; j++) {
+				rte_prefetch0(rte_pktmbuf_mtod(
+							pkts_burst[j], void *));
+			}
+
+			/* Prefetch and forward already prefetched packets */
+			for (j = 0; j < (nb_rx - PREFETCH_OFFSET); j++) {
+				rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[
+							j + PREFETCH_OFFSET], void *));
+				l3fwd_simple_forward(pkts_burst[j], portid,
+						qconf);
+			}
 
+			/* Forward remaining prefetched packets */
+			for (; j < nb_rx; j++) {
+				l3fwd_simple_forward(pkts_burst[j], portid,
+						qconf);
+			}
+
+		}
+
+	}
+
+	return 0;
+}
 /* main processing loop */
 static int
 main_loop(__attribute__((unused)) void *dummy)
@@ -1127,7 +1261,8 @@ print_usage(const char *prgname)
 		"  --no-numa: optional, disable numa awareness\n"
 		"  --enable-jumbo: enable jumbo frame"
 		" which max packet len is PKTLEN in decimal (64-9600)\n"
-		"  --parse-ptype: parse packet type by software\n",
+		"  --parse-ptype: parse packet type by software\n"
+		"  --empty=poll: enable empty poll detection\n",
 		prgname);
 }
 
@@ -1220,7 +1355,55 @@ parse_config(const char *q_arg)
 
 	return 0;
 }
+static int
+parse_ep_config(const char *q_arg)
+{
+	char s[256];
+	const char *p = q_arg;
+	char *end;
+	int  num_arg;
+
+	char *str_fld[3];
+
+	int training_flag;
+	int med_edpi;
+	int hgh_edpi;
+
+	ep_med_edpi = EMPTY_POLL_MED_THRESHOLD;
+	ep_hgh_edpi = EMPTY_POLL_MED_THRESHOLD;
+
+	snprintf(s, sizeof(s), "%s", p);
+
+	num_arg = rte_strsplit(s, sizeof(s), str_fld, 3, ',');
+
+	empty_poll_train = false;
+
+	if (num_arg == 0)
+		return 0;
 
+	if (num_arg == 3) {
+
+		training_flag = strtoul(str_fld[0], &end, 0);
+		med_edpi = strtoul(str_fld[1], &end, 0);
+		hgh_edpi = strtoul(str_fld[2], &end, 0);
+
+		if (training_flag == 1)
+			empty_poll_train = true;
+
+		if (med_edpi > 0)
+			ep_med_edpi = med_edpi;
+
+		if (med_edpi > 0)
+			ep_hgh_edpi = hgh_edpi;
+
+	} else {
+
+		return -1;
+	}
+
+	return 0;
+
+}
 #define CMD_LINE_OPT_PARSE_PTYPE "parse-ptype"
 
 /* Parse the argument given in the command line of the application */
@@ -1230,6 +1413,7 @@ parse_args(int argc, char **argv)
 	int opt, ret;
 	char **argvopt;
 	int option_index;
+	uint32_t limit;
 	char *prgname = argv[0];
 	static struct option lgopts[] = {
 		{"config", 1, 0, 0},
@@ -1237,13 +1421,14 @@ parse_args(int argc, char **argv)
 		{"high-perf-cores", 1, 0, 0},
 		{"no-numa", 0, 0, 0},
 		{"enable-jumbo", 0, 0, 0},
+		{"empty-poll", 1, 0, 0},
 		{CMD_LINE_OPT_PARSE_PTYPE, 0, 0, 0},
 		{NULL, 0, 0, 0}
 	};
 
 	argvopt = argv;
 
-	while ((opt = getopt_long(argc, argvopt, "p:P",
+	while ((opt = getopt_long(argc, argvopt, "p:l:m:h:P",
 				lgopts, &option_index)) != EOF) {
 
 		switch (opt) {
@@ -1260,7 +1445,18 @@ parse_args(int argc, char **argv)
 			printf("Promiscuous mode selected\n");
 			promiscuous_on = 1;
 			break;
-
+		case 'l':
+			limit = parse_max_pkt_len(optarg);
+			freq_tlb[LOW] = limit;
+			break;
+		case 'm':
+			limit = parse_max_pkt_len(optarg);
+			freq_tlb[MED] = limit;
+			break;
+		case 'h':
+			limit = parse_max_pkt_len(optarg);
+			freq_tlb[HGH] = limit;
+			break;
 		/* long options */
 		case 0:
 			if (!strncmp(lgopts[option_index].name, "config", 6)) {
@@ -1299,6 +1495,20 @@ parse_args(int argc, char **argv)
 			}
 
 			if (!strncmp(lgopts[option_index].name,
+						"empty-poll", 10)) {
+				printf("empty-poll is enabled\n");
+				empty_poll_on = true;
+				ret = parse_ep_config(optarg);
+
+				if (ret) {
+					printf("invalid empty poll config\n");
+					print_usage(prgname);
+					return -1;
+				}
+
+			}
+
+			if (!strncmp(lgopts[option_index].name,
 					"enable-jumbo", 12)) {
 				struct option lenopts =
 					{"max-pkt-len", required_argument, \
@@ -1646,6 +1856,59 @@ init_power_library(void)
 	}
 	return ret;
 }
+static void
+empty_poll_setup_timer(void)
+{
+	int lcore_id = rte_lcore_id();
+	uint64_t hz = rte_get_timer_hz();
+
+	struct  ep_params *ep_ptr = ep_params;
+
+	ep_ptr->interval_ticks = hz / INTERVALS_PER_SECOND;
+
+	rte_timer_reset_sync(&ep_ptr->timer0,
+			ep_ptr->interval_ticks,
+			PERIODICAL,
+			lcore_id,
+			rte_empty_poll_detection,
+			(void *)ep_ptr);
+
+}
+static int
+launch_timer(unsigned int lcore_id)
+{
+	int64_t prev_tsc = 0, cur_tsc, diff_tsc, cycles_10ms;
+
+	RTE_SET_USED(lcore_id);
+
+
+	if (rte_get_master_lcore() != lcore_id) {
+		rte_panic("timer on lcore:%d which is not master core:%d\n",
+				lcore_id,
+				rte_get_master_lcore());
+	}
+
+	RTE_LOG(INFO, POWER, "Bring up the Timer\n");
+
+	empty_poll_setup_timer();
+
+	cycles_10ms = rte_get_timer_hz() / 100;
+
+	while (!is_done()) {
+		cur_tsc = rte_rdtsc();
+		diff_tsc = cur_tsc - prev_tsc;
+		if (diff_tsc > cycles_10ms) {
+			rte_timer_manage();
+			prev_tsc = cur_tsc;
+			cycles_10ms = rte_get_timer_hz() / 100;
+		}
+	}
+
+	RTE_LOG(INFO, POWER, "Timer_subsystem is done\n");
+
+	return 0;
+}
+
 
 int
 main(int argc, char **argv)
@@ -1828,13 +2091,15 @@ main(int argc, char **argv)
 		if (rte_lcore_is_enabled(lcore_id) == 0)
 			continue;
 
-		/* init timer structures for each enabled lcore */
-		rte_timer_init(&power_timers[lcore_id]);
-		hz = rte_get_timer_hz();
-		rte_timer_reset(&power_timers[lcore_id],
-			hz/TIMER_NUMBER_PER_SECOND, SINGLE, lcore_id,
-						power_timer_cb, NULL);
-
+		if (empty_poll_on == false) {
+			/* init timer structures for each enabled lcore */
+			rte_timer_init(&power_timers[lcore_id]);
+			hz = rte_get_timer_hz();
+			rte_timer_reset(&power_timers[lcore_id],
+					hz/TIMER_NUMBER_PER_SECOND,
+					SINGLE, lcore_id,
+					power_timer_cb, NULL);
+		}
 		qconf = &lcore_conf[lcore_id];
 		printf("\nInitializing rx queues on lcore %u ... ", lcore_id );
 		fflush(stdout);
@@ -1905,12 +2170,38 @@ main(int argc, char **argv)
 
 	check_all_ports_link_status(enabled_port_mask);
 
+	if (empty_poll_on == true) {
+
+		if (empty_poll_train) {
+			policy.state = TRAINING;
+		} else {
+			policy.state = MED_NORMAL;
+			policy.med_base_edpi = ep_med_edpi;
+			policy.hgh_base_edpi = ep_hgh_edpi;
+		}
+
+		rte_power_empty_poll_stat_init(&ep_params, freq_tlb, &policy);
+	}
+
+
 	/* launch per-lcore init on every lcore */
-	rte_eal_mp_remote_launch(main_loop, NULL, CALL_MASTER);
+	if (empty_poll_on == false) {
+		rte_eal_mp_remote_launch(main_loop, NULL, CALL_MASTER);
+	} else {
+		empty_poll_stop = false;
+		rte_eal_mp_remote_launch(main_empty_poll_loop, NULL, SKIP_MASTER);
+	}
+
+	if (empty_poll_on == true)
+		launch_timer(rte_lcore_id());
+
 	RTE_LCORE_FOREACH_SLAVE(lcore_id) {
 		if (rte_eal_wait_lcore(lcore_id) < 0)
 			return -1;
 	}
 
+	if (empty_poll_on)
+		rte_power_empty_poll_stat_free();
+
 	return 0;
 }
diff --git a/examples/l3fwd-power/meson.build b/examples/l3fwd-power/meson.build
index 20c8054..a3c5c2f 100644
--- a/examples/l3fwd-power/meson.build
+++ b/examples/l3fwd-power/meson.build
@@ -9,6 +9,7 @@
 if host_machine.system() != 'linux'
 	build = false
 endif
+allow_experimental_apis = true
 deps += ['power', 'timer', 'lpm', 'hash']
 sources = files(
 	'main.c', 'perf_core.c'
-- 
2.7.5

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [dpdk-dev] [PATCH v8 3/4] doc/guides/proguide/power-man: update the power API
  2018-09-17 13:30                 ` [dpdk-dev] [PATCH v8 " Liang Ma
  2018-09-17 13:30                   ` [dpdk-dev] [PATCH v8 2/4] examples/l3fwd-power: simple app update for new API Liang Ma
@ 2018-09-17 13:30                   ` Liang Ma
  2018-09-25 12:31                     ` Kovacevic, Marko
                                       ` (2 more replies)
  2018-09-17 13:30                   ` [dpdk-dev] [PATCH v8 4/4] doc/guides/sample_app_ug/l3_forward_power_man.rst: empty poll update Liang Ma
                                     ` (3 subsequent siblings)
  5 siblings, 3 replies; 79+ messages in thread
From: Liang Ma @ 2018-09-17 13:30 UTC (permalink / raw)
  To: david.hunt; +Cc: dev, lei.a.yao, ktraynor, john.geary, Liang Ma

Update the document for empty poll API.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
---
 doc/guides/prog_guide/power_man.rst | 90 +++++++++++++++++++++++++++++++++++++
 1 file changed, 90 insertions(+)

diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
index eba1cc6..056cb12 100644
--- a/doc/guides/prog_guide/power_man.rst
+++ b/doc/guides/prog_guide/power_man.rst
@@ -106,6 +106,96 @@ User Cases
 
 The power management mechanism is used to save power when performing L3 forwarding.
 
+
+Empty Poll API
+--------------
+
+Abstract
+~~~~~~~~
+
+For packet processing workloads such as DPDK polling is continuous.
+This means CPU cores always show 100% busy independent of how much work
+those cores are doing. It is critical to accurately determine how busy
+a core is hugely important for the following reasons:
+
+        * No indication of overload conditions
+        * User do not know how much real load is on a system meaning
+          resulted in wasted energy as no power management is utilized
+
+Compared to the original l3fwd-power design, instead of going to sleep
+after detecting an empty poll, the new mechanism just lowers the core frequency.
+As a result, the application does not stop polling the device, which leads
+to improved handling of bursts of traffic.
+
+When the system become busy, the empty poll mechanism can also increase the core
+frequency (including turbo) to do best effort for intensive traffic. This gives
+us more flexible and balanced traffic awareness over the standard l3fwd-power
+application.
+
+
+Proposed Solution
+~~~~~~~~~~~~~~~~~
+The proposed solution focuses on how many times empty polls are executed.
+The less the number of empty polls, means current core is busy with processing
+workload, therefore, the higher frequency is needed. The high empty poll number
+indicates the current core not doing any real work therefore, we can lower the
+frequency to safe power.
+
+In the current implementation, each core has 1 empty-poll counter which assume
+1 core is dedicated to 1 queue. This will need to be expanded in the future to
+support multiple queues per core.
+
+Power state definition:
+^^^^^^^^^^^^^^^^^^^^^^^
+
+* LOW:  Not currently used, reserved for future use.
+
+* MED:  the frequency is used to process modest traffic workload.
+
+* HIGH: the frequency is used to process busy traffic workload.
+
+There are two phases to establish the power management system:
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+* Initialization/Training phase. The training phase is necessary
+  in order to figure out the system polling baseline numbers from
+  idle to busy. The highest poll count will be during idle, where all
+  polls are empty. These poll counts will be different between
+  systems due to the many possible processor micro-arch, cache
+  and device configurations, hence the training phase.
+  In the training phase, traffic is blocked so the training algorithm
+  can average the empty-poll numbers for the LOW, MED and
+  HIGH  power states in order to create a baseline.
+  The core's counter are collected every 10ms, and the Training
+  phase will take 2 seconds.
+  Training is disabled as default configuration.
+  The default parameter is applied. Simple App still can trigger
+  training if that's needed
+
+* Normal phase. When the training phase is complete, traffic is
+  started. The run-time poll counts are compared with the
+  baseline and the decision will be taken to move to MED power
+  state or HIGH power state. The counters are calculated every
+  10ms.
+
+
+API Overview for Empty Poll Power Management
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+* **State Init**: initialize the power management system.
+
+* **State Free**: free the resource hold by power management system.
+
+* **Update Empty Poll Counter**: update the empty poll counter.
+
+* **Update Valid Poll Counter**: update the valid poll counter.
+
+* **Set the Fequence Index**: update the power state/frequency mapping.
+
+* **Detect empty poll state change**: empty poll state change detection algorithm then take action.
+
+User Cases
+----------
+The mechanism can applied to any device which is based on polling. e.g. NIC, FPGA.
+
 References
 ----------
 
-- 
2.7.5

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [dpdk-dev] [PATCH v8 4/4] doc/guides/sample_app_ug/l3_forward_power_man.rst: empty poll update
  2018-09-17 13:30                 ` [dpdk-dev] [PATCH v8 " Liang Ma
  2018-09-17 13:30                   ` [dpdk-dev] [PATCH v8 2/4] examples/l3fwd-power: simple app update for new API Liang Ma
  2018-09-17 13:30                   ` [dpdk-dev] [PATCH v8 3/4] doc/guides/proguide/power-man: update the power API Liang Ma
@ 2018-09-17 13:30                   ` Liang Ma
  2018-09-25 13:20                     ` Kovacevic, Marko
  2018-09-28 10:47                   ` [dpdk-dev] [PATCH v8 1/4] lib/librte_power: traffic pattern aware power control Hunt, David
                                     ` (2 subsequent siblings)
  5 siblings, 1 reply; 79+ messages in thread
From: Liang Ma @ 2018-09-17 13:30 UTC (permalink / raw)
  To: david.hunt; +Cc: dev, lei.a.yao, ktraynor, john.geary, Liang Ma

Add empty poll mode command line example

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
---
 doc/guides/sample_app_ug/l3_forward_power_man.rst | 29 +++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/doc/guides/sample_app_ug/l3_forward_power_man.rst b/doc/guides/sample_app_ug/l3_forward_power_man.rst
index 795a570..ca9c70a 100644
--- a/doc/guides/sample_app_ug/l3_forward_power_man.rst
+++ b/doc/guides/sample_app_ug/l3_forward_power_man.rst
@@ -362,3 +362,32 @@ The algorithm has the following sleeping behavior depending on the idle counter:
 If a thread polls multiple Rx queues and different queue returns different sleep duration values,
 the algorithm controls the sleep time in a conservative manner by sleeping for the least possible time
 in order to avoid a potential performance impact.
+
+Empty Poll Mode
+-------------------------
+There is a new Mode which is added recently. Empty poll mode can be enabled by
+command option --empty-poll.
+
+See "Power Management" chapter in the DPDK Programmer's Guide for empty poll mode details.
+
+.. code-block:: console
+
+    ./l3fwd-power -l xxx   -n 4   -w 0000:xx:00.0 -w 0000:xx:00.1 -- -p 0x3 -P --config="(0,0,xx),(1,0,xx)" --empty-poll="0,0,0" -l 14 -m 9 -h 1
+
+Where,
+
+--empty-poll: Enable the empty poll mode instead of original algorithm
+
+--empty-poll="training_flag, med_threshold, high_threshold"
+
+* training_flag : optional, enable/disable training mode. Default value is 0.
+
+* med_threshold : optional, indicate the empty poll threshold of modest state which is customized by user. Default value is 0.
+
+* high_threshold : optional, indicate the empty poll threshold of busy state which is customized by user. Default value is 0.
+
+* -l : optional, set up the LOW power state frequency index
+
+* -m : optional, set up the MED power state frequency index
+
+* -h : optional, set up the HIGH power state frequency index
-- 
2.7.5

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [dpdk-dev] [PATCH v8 3/4] doc/guides/proguide/power-man: update the power API
  2018-09-17 13:30                   ` [dpdk-dev] [PATCH v8 3/4] doc/guides/proguide/power-man: update the power API Liang Ma
@ 2018-09-25 12:31                     ` Kovacevic, Marko
  2018-09-25 12:44                     ` Kovacevic, Marko
  2018-09-28 12:30                     ` Hunt, David
  2 siblings, 0 replies; 79+ messages in thread
From: Kovacevic, Marko @ 2018-09-25 12:31 UTC (permalink / raw)
  To: Ma, Liang J, Hunt, David
  Cc: dev, Yao, Lei A, ktraynor, Geary, John, Ma, Liang J

> Update the document for empty poll API.
> 
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> ---
>  doc/guides/prog_guide/power_man.rst | 90
> +++++++++++++++++++++++++++++++++++++
>  1 file changed, 90 insertions(+)
> 
> diff --git a/doc/guides/prog_guide/power_man.rst
> b/doc/guides/prog_guide/power_man.rst
> index eba1cc6..056cb12 100644
> --- a/doc/guides/prog_guide/power_man.rst
> +++ b/doc/guides/prog_guide/power_man.rst
> @@ -106,6 +106,96 @@ User Cases
> 
>  The power management mechanism is used to save power when
> performing L3 forwarding.
> 
> +
> +Empty Poll API
> +--------------
> +
> +Abstract
> +~~~~~~~~
> +
> +For packet processing workloads such as DPDK polling is continuous.
> +This means CPU cores always show 100% busy independent of how much
> work
> +those cores are doing. It is critical to accurately determine how busy
> +a core is hugely important for the following reasons:
> +
> +        * No indication of overload conditions
> +        * User do not know how much real load is on a system meaning
> +          resulted in wasted energy as no power management is utilized

User does not know how much real load is on a resulting in 
wasted energy as no power management is utilized


> +API Overview for Empty Poll Power Management
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +* **State Init**: initialize the power management system.
> +
> +* **State Free**: free the resource hold by power management system.
> +
> +* **Update Empty Poll Counter**: update the empty poll counter.
> +
> +* **Update Valid Poll Counter**: update the valid poll counter.
> +
> +* **Set the Fequence Index**: update the power state/frequency
> mapping.

Spelling above: Fequence/ Frequency 

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [dpdk-dev] [PATCH v8 3/4] doc/guides/proguide/power-man: update the power API
  2018-09-17 13:30                   ` [dpdk-dev] [PATCH v8 3/4] doc/guides/proguide/power-man: update the power API Liang Ma
  2018-09-25 12:31                     ` Kovacevic, Marko
@ 2018-09-25 12:44                     ` Kovacevic, Marko
  2018-09-28 12:30                     ` Hunt, David
  2 siblings, 0 replies; 79+ messages in thread
From: Kovacevic, Marko @ 2018-09-25 12:44 UTC (permalink / raw)
  To: Ma, Liang J, Hunt, David
  Cc: dev, Yao, Lei A, ktraynor, Geary, John, Ma, Liang J

> > Update the document for empty poll API.
> >
> > Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> > ---
> >  doc/guides/prog_guide/power_man.rst | 90
> > +++++++++++++++++++++++++++++++++++++
> >  1 file changed, 90 insertions(+)
> >
> > diff --git a/doc/guides/prog_guide/power_man.rst
> > b/doc/guides/prog_guide/power_man.rst
> > index eba1cc6..056cb12 100644
Sending an email to fix my mistake I made to a fix.

> > +For packet processing workloads such as DPDK polling is continuous.
> > +This means CPU cores always show 100% busy independent of how much
> > work
> > +those cores are doing. It is critical to accurately determine how
> > +busy a core is hugely important for the following reasons:
> > +
> > +        * No indication of overload conditions
> > +        * User do not know how much real load is on a system meaning
> > +          resulted in wasted energy as no power management is
> > + utilized
> 
> User does not know how much real load is on a resulting in wasted energy as
> no power management is utilized

* User does not know how much real load is on a system,
    resulting in wasted energy as no power management is utilized

Thanks, 
Marko K

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [dpdk-dev] [PATCH v8 4/4] doc/guides/sample_app_ug/l3_forward_power_man.rst: empty poll update
  2018-09-17 13:30                   ` [dpdk-dev] [PATCH v8 4/4] doc/guides/sample_app_ug/l3_forward_power_man.rst: empty poll update Liang Ma
@ 2018-09-25 13:20                     ` Kovacevic, Marko
  2018-09-28 12:43                       ` Hunt, David
  0 siblings, 1 reply; 79+ messages in thread
From: Kovacevic, Marko @ 2018-09-25 13:20 UTC (permalink / raw)
  To: Ma, Liang J, Hunt, David
  Cc: dev, Yao, Lei A, ktraynor, Geary, John, Ma, Liang J

> Add empty poll mode command line example
> 
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> ---
>  doc/guides/sample_app_ug/l3_forward_power_man.rst | 29
> +++++++++++++++++++++++
>  1 file changed, 29 insertions(+)
> 
> +Empty Poll Mode
> +-------------------------
> +There is a new Mode which is added recently. Empty poll mode can be
> +enabled by command option --empty-poll.
> +
> +See "Power Management" chapter in the DPDK Programmer's Guide for
> empty poll mode details.

Can you embed the link to the Power Management chapter
:doc:`Power Management<../prog_guide/power_man>`


> +.. code-block:: console
> +
> +    ./l3fwd-power -l xxx   -n 4   -w 0000:xx:00.0 -w 0000:xx:00.1 -- -p 0x3 -P --
> config="(0,0,xx),(1,0,xx)" --empty-poll="0,0,0" -l 14 -m 9 -h 1
> +
> +Where,
> +
> +--empty-poll: Enable the empty poll mode instead of original algorithm
> +
> +--empty-poll="training_flag, med_threshold, high_threshold"
> +
> +* training_flag : optional, enable/disable training mode. Default value is 0.
> +
> +* med_threshold : optional, indicate the empty poll threshold of modest
> state which is customized by user. Default value is 0.
> +
> +* high_threshold : optional, indicate the empty poll threshold of busy state
> which is customized by user. Default value is 0.
> +
> +* -l : optional, set up the LOW power state frequency index
> +
> +* -m : optional, set up the MED power state frequency index
> +
> +* -h : optional, set up the HIGH power state frequency index

I think in this over all section needs a lot more explanation like what are valid training flags and how to get thresholds ect.

Also you could highlight the commands it looks better:      ``training_flag``

Thanks,
Marko K

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [dpdk-dev] [PATCH v8 1/4] lib/librte_power: traffic pattern aware power control
  2018-09-17 13:30                 ` [dpdk-dev] [PATCH v8 " Liang Ma
                                     ` (2 preceding siblings ...)
  2018-09-17 13:30                   ` [dpdk-dev] [PATCH v8 4/4] doc/guides/sample_app_ug/l3_forward_power_man.rst: empty poll update Liang Ma
@ 2018-09-28 10:47                   ` Hunt, David
  2018-10-02 10:13                     ` Liang, Ma
  2018-09-28 14:58                   ` [dpdk-dev] [PATCH v9 " Liang Ma
  2018-10-01 10:06                   ` [dpdk-dev] [PATCH v9 4/4] doc/guides/sample_app_ug/l3_forward_power_man.rst: empty poll update Liang Ma
  5 siblings, 1 reply; 79+ messages in thread
From: Hunt, David @ 2018-09-28 10:47 UTC (permalink / raw)
  To: Liang Ma; +Cc: dev, lei.a.yao, ktraynor, john.geary

Hi Liang,


On 17/9/2018 2:30 PM, Liang Ma wrote:
> 1. Abstract
>
> For packet processing workloads such as DPDK polling is continuous.
> This means CPU cores always show 100% busy independent of how much work
> those cores are doing. It is critical to accurately determine how busy
> a core is hugely important for the following reasons:
>
>     * No indication of overload conditions
>
>     * User do not know how much real load is on a system meaning resulted in
>       wasted energy as no power management is utilized
>
> Compared to the original l3fwd-power design, instead of going to sleep
> after detecting an empty poll, the new mechanism just lowers the core
> frequency. As a result, the application does not stop polling the device,
> which leads to improved handling of bursts of traffic.
>
> When the system become busy, the empty poll mechanism can also increase the
> core frequency (including turbo) to do best effort for intensive traffic.
> This gives us more flexible and balanced traffic awareness over the
> standard l3fwd-power application.
>
> 2. Proposed solution
>
> The proposed solution focuses on how many times empty polls are executed.
> The less the number of empty polls, means current core is busy with
> processing workload, therefore, the higher frequency is needed. The high
> empty poll number indicates the current core not doing any real work
> therefore, we can lower the frequency to safe power.
>
> In the current implementation, each core has 1 empty-poll counter which
> assume 1 core is dedicated to 1 queue. This will need to be expanded in the
> future to support multiple queues per core.
>
> 2.1 Power state definition:
>
> 	LOW:  Not currently used, reserved for future use.
>
> 	MED:  the frequency is used to process modest traffic workload.
>
> 	HIGH: the frequency is used to process busy traffic workload.
>
> 2.2 There are two phases to establish the power management system:
>
> 	a.Initialization/Training phase. The training phase is necessary
> 	  in order to figure out the system polling baseline numbers from
> 	  idle to busy. The highest poll count will be during idle, where
> 	  all polls are empty. These poll counts will be different between
> 	  systems due to the many possible processor micro-arch, cache
> 	  and device configurations, hence the training phase.
>    	  In the training phase, traffic is blocked so the training
>    	  algorithm can average the empty-poll numbers for the LOW, MED and
>   	  HIGH  power states in order to create a baseline.
>    	  The core's counter are collected every 10ms, and the Training
>   	  phase will take 2 seconds.
>   	  Training is disabled as default configuration. the default
>   	  parameter is applied. Simple App still can trigger training

Typo: "Simple" should be "Sample"

Suggest adding: Once the training phase has been executed once on a 
system, the application
can then be started with the relevant thresholds provided on the command 
line, allowing the
application to start passing start traffic immediately.

>   	  if that's needed.
>
> 	b.Normal phase. When the training phase is complete, traffic is
>    	  started. The run-time poll counts are compared with the
> 	  baseline and the decision will be taken to move to MED power
>    	  state or HIGH power state. The counters are calculated every 10ms.

Propose changing the first sentence:  Traffic starts immediately based 
on the default
thresholds, or based on the user supplied thresholds via the command 
line parameters.




> 3. Proposed  API
>
> 1.  rte_power_empty_poll_stat_init(struct ep_params **eptr,
> 		uint8_t *freq_tlb, struct ep_policy *policy);
> which is used to initialize the power management system.
>   
> 2.  rte_power_empty_poll_stat_free(void);
> which is used to free the resource hold by power management system.
>   
> 3.  rte_power_empty_poll_stat_update(unsigned int lcore_id);
> which is used to update specific core empty poll counter, not thread safe
>   
> 4.  rte_power_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt);
> which is used to update specific core valid poll counter, not thread safe
>   
> 5.  rte_power_empty_poll_stat_fetch(unsigned int lcore_id);
> which is used to get specific core empty poll counter.
>   
> 6.  rte_power_poll_stat_fetch(unsigned int lcore_id);
> which is used to get specific core valid poll counter.
>
> 7.  rte_empty_poll_detection(struct rte_timer *tim, void *arg);
> which is used to detect empty poll state changes then take action.
>
> ChangeLog:
> v2: fix some coding style issues.
> v3: rename the filename, API name.
> v4: no change.
> v5: no change.
> v6: re-work the code layout, update API.
> v7: fix minor typo and lift node num limit.
> v8: disable training as default option.
>
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
>
> Reviewed-by: Lei Yao <lei.a.yao@intel.com>
> ---
>   lib/librte_power/Makefile               |   6 +-
>   lib/librte_power/meson.build            |   5 +-
>   lib/librte_power/rte_power_empty_poll.c | 539 ++++++++++++++++++++++++++++++++
>   lib/librte_power/rte_power_empty_poll.h | 219 +++++++++++++
>   lib/librte_power/rte_power_version.map  |  13 +
>   5 files changed, 778 insertions(+), 4 deletions(-)
>   create mode 100644 lib/librte_power/rte_power_empty_poll.c
>   create mode 100644 lib/librte_power/rte_power_empty_poll.h
>
> diff --git a/lib/librte_power/Makefile b/lib/librte_power/Makefile
> index 6f85e88..a8f1301 100644
> --- a/lib/librte_power/Makefile
> +++ b/lib/librte_power/Makefile
> @@ -6,8 +6,9 @@ include $(RTE_SDK)/mk/rte.vars.mk
>   # library name
>   LIB = librte_power.a
>   
> +CFLAGS += -DALLOW_EXPERIMENTAL_API
>   CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR) -O3 -fno-strict-aliasing
> -LDLIBS += -lrte_eal
> +LDLIBS += -lrte_eal -lrte_timer
>   
>   EXPORT_MAP := rte_power_version.map
>   
> @@ -16,8 +17,9 @@ LIBABIVER := 1
>   # all source are stored in SRCS-y
>   SRCS-$(CONFIG_RTE_LIBRTE_POWER) := rte_power.c power_acpi_cpufreq.c
>   SRCS-$(CONFIG_RTE_LIBRTE_POWER) += power_kvm_vm.c guest_channel.c
> +SRCS-$(CONFIG_RTE_LIBRTE_POWER) += rte_power_empty_poll.c
>   
>   # install this header file
> -SYMLINK-$(CONFIG_RTE_LIBRTE_POWER)-include := rte_power.h
> +SYMLINK-$(CONFIG_RTE_LIBRTE_POWER)-include := rte_power.h  rte_power_empty_poll.h
>   
>   include $(RTE_SDK)/mk/rte.lib.mk
> diff --git a/lib/librte_power/meson.build b/lib/librte_power/meson.build
> index 253173f..63957eb 100644
> --- a/lib/librte_power/meson.build
> +++ b/lib/librte_power/meson.build
> @@ -5,5 +5,6 @@ if host_machine.system() != 'linux'
>   	build = false
>   endif
>   sources = files('rte_power.c', 'power_acpi_cpufreq.c',
> -		'power_kvm_vm.c', 'guest_channel.c')
> -headers = files('rte_power.h')
> +		'power_kvm_vm.c', 'guest_channel.c',
> +		'rte_power_empty_poll.c')
> +headers = files('rte_power.h','rte_power_empty_poll.h')
> diff --git a/lib/librte_power/rte_power_empty_poll.c b/lib/librte_power/rte_power_empty_poll.c
> new file mode 100644
> index 0000000..587ce78
> --- /dev/null
> +++ b/lib/librte_power/rte_power_empty_poll.c
> @@ -0,0 +1,539 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2010-2018 Intel Corporation
> + */
> +
> +#include <string.h>
> +
> +#include <rte_lcore.h>
> +#include <rte_cycles.h>
> +#include <rte_atomic.h>
> +#include <rte_malloc.h>
> +
> +#include "rte_power.h"
> +#include "rte_power_empty_poll.h"
> +
> +#define INTERVALS_PER_SECOND 100     /* (10ms) */
> +#define SECONDS_TO_TRAIN_FOR 2
> +#define DEFAULT_MED_TO_HIGH_PERCENT_THRESHOLD 70
> +#define DEFAULT_HIGH_TO_MED_PERCENT_THRESHOLD 30
> +#define DEFAULT_CYCLES_PER_PACKET 800
> +
> +static struct ep_params *ep_params;
> +static uint32_t med_to_high_threshold = DEFAULT_MED_TO_HIGH_PERCENT_THRESHOLD;
> +static uint32_t high_to_med_threshold = DEFAULT_HIGH_TO_MED_PERCENT_THRESHOLD;
> +
> +static uint32_t avail_freqs[RTE_MAX_LCORE][NUM_FREQS];
> +
> +static uint32_t total_avail_freqs[RTE_MAX_LCORE];
> +
> +static uint32_t freq_index[NUM_FREQ];
> +
> +static uint32_t
> +get_freq_index(enum freq_val index)
> +{
> +	return freq_index[index];
> +}
> +
> +
> +static int
> +set_power_freq(int lcore_id, enum freq_val freq, bool specific_freq)
> +{
> +	int err = 0;
> +	uint32_t power_freq_index;
> +	if (!specific_freq)
> +		power_freq_index = get_freq_index(freq);
> +	else
> +		power_freq_index = freq;
> +
> +	err = rte_power_set_freq(lcore_id, power_freq_index);
> +
> +	return err;
> +}
> +
> +
> +static inline void __attribute__((always_inline))
> +exit_training_state(struct priority_worker *poll_stats)
> +{
> +	RTE_SET_USED(poll_stats);
> +}
> +

Is this really needed? It does nothing, and is just local to this file.


> +static inline void __attribute__((always_inline))
> +enter_training_state(struct priority_worker *poll_stats)
> +{
> +	poll_stats->iter_counter = 0;
> +	poll_stats->cur_freq = LOW;
> +	poll_stats->queue_state = TRAINING;
> +}
> +
> +static inline void __attribute__((always_inline))
> +enter_normal_state(struct priority_worker *poll_stats)
> +{
> +	/* Clear the averages arrays and strs */
> +	memset(poll_stats->edpi_av, 0, sizeof(poll_stats->edpi_av));
> +	poll_stats->ec = 0;
> +	memset(poll_stats->ppi_av, 0, sizeof(poll_stats->ppi_av));
> +	poll_stats->pc = 0;
> +
> +	poll_stats->cur_freq = MED;
> +	poll_stats->iter_counter = 0;
> +	poll_stats->threshold_ctr = 0;
> +	poll_stats->queue_state = MED_NORMAL;
> +	RTE_LOG(INFO, POWER, "set the poewr freq to MED\n");

Typo, "poewr" should be "power", also suggest "Set" rather than "set"


> +	set_power_freq(poll_stats->lcore_id, MED, false);
> +
> +	/* Try here */

Not sure about this comment?

> +	poll_stats->thresh[MED].threshold_percent = med_to_high_threshold;
> +	poll_stats->thresh[HGH].threshold_percent = high_to_med_threshold;
> +}
> +
> +static inline void __attribute__((always_inline))
> +enter_busy_state(struct priority_worker *poll_stats)
> +{
> +	memset(poll_stats->edpi_av, 0, sizeof(poll_stats->edpi_av));
> +	poll_stats->ec = 0;
> +	memset(poll_stats->ppi_av, 0, sizeof(poll_stats->ppi_av));
> +	poll_stats->pc = 0;
> +
> +	poll_stats->cur_freq = HGH;
> +	poll_stats->iter_counter = 0;
> +	poll_stats->threshold_ctr = 0;
> +	poll_stats->queue_state = HGH_BUSY;
> +	set_power_freq(poll_stats->lcore_id, HGH, false);
> +}
> +
> +static inline void __attribute__((always_inline))
> +enter_purge_state(struct priority_worker *poll_stats)
> +{
> +	poll_stats->iter_counter = 0;
> +	poll_stats->queue_state = LOW_PURGE;
> +}
> +
> +static inline void __attribute__((always_inline))
> +set_state(struct priority_worker *poll_stats,
> +		enum queue_state new_state)
> +{
> +	enum queue_state old_state = poll_stats->queue_state;
> +	if (old_state != new_state) {
> +
> +		/* Call any old state exit functions */
> +		if (old_state == TRAINING)
> +			exit_training_state(poll_stats);

Is this needed? exit_training_state() does nothing.

> +		/* Call any new state entry functions */
> +		if (new_state == TRAINING)
> +			enter_training_state(poll_stats);
> +		if (new_state == MED_NORMAL)
> +			enter_normal_state(poll_stats);
> +		if (new_state == HGH_BUSY)
> +			enter_busy_state(poll_stats);
> +		if (new_state == LOW_PURGE)
> +			enter_purge_state(poll_stats);
> +	}
> +}
> +
> +static inline void __attribute__((always_inline))
> +set_policy(struct priority_worker *poll_stats,
> +		struct ep_policy *policy)
> +{
> +	set_state(poll_stats, policy->state);
> +
> +	if (policy->state == TRAINING)
> +		return;
> +
> +	poll_stats->thresh[MED_NORMAL].base_edpi = policy->med_base_edpi;
> +	poll_stats->thresh[HGH_BUSY].base_edpi = policy->hgh_base_edpi;
> +
> +	poll_stats->thresh[MED_NORMAL].trained = true;
> +	poll_stats->thresh[HGH_BUSY].trained = true;
> +
> +}
> +
> +static void
> +update_training_stats(struct priority_worker *poll_stats,
> +		uint32_t freq,
> +		bool specific_freq,
> +		uint32_t max_train_iter)
> +{
> +	RTE_SET_USED(specific_freq);
> +
> +	char pfi_str[32];
> +	uint64_t p0_empty_deq;
> +
> +	sprintf(pfi_str, "%02d", freq);
> +
> +	if (poll_stats->cur_freq == freq &&
> +			poll_stats->thresh[freq].trained == false) {
> +		if (poll_stats->thresh[freq].cur_train_iter == 0) {
> +
> +			set_power_freq(poll_stats->lcore_id,
> +					freq, specific_freq);
> +
> +			poll_stats->empty_dequeues_prev =
> +				poll_stats->empty_dequeues;
> +
> +			poll_stats->thresh[freq].cur_train_iter++;
> +
> +			return;
> +		} else if (poll_stats->thresh[freq].cur_train_iter
> +				<= max_train_iter) {
> +
> +			p0_empty_deq = poll_stats->empty_dequeues -
> +				poll_stats->empty_dequeues_prev;
> +
> +			poll_stats->empty_dequeues_prev =
> +				poll_stats->empty_dequeues;
> +
> +			poll_stats->thresh[freq].base_edpi += p0_empty_deq;
> +			poll_stats->thresh[freq].cur_train_iter++;
> +
> +		} else {
> +			if (poll_stats->thresh[freq].trained == false) {
> +				poll_stats->thresh[freq].base_edpi =
> +					poll_stats->thresh[freq].base_edpi /
> +					max_train_iter;
> +
> +				/* Add on a factor of 0.05%*/
> +				/* this should remove any */
> +				/* false negatives when the system is 0% busy */

Multi line comment should follow the usual standard /* \n * text \n text 
\n */

> +				poll_stats->thresh[freq].base_edpi +=
> +					poll_stats->thresh[freq].base_edpi / 2000;
> +
> +				poll_stats->thresh[freq].trained = true;
> +				poll_stats->cur_freq++;
> +
> +			}
> +		}
> +	}
> +}
> +
> +static inline uint32_t __attribute__((always_inline))
> +update_stats(struct priority_worker *poll_stats)
> +{
> +	uint64_t tot_edpi = 0, tot_ppi = 0;
> +	uint32_t j, percent;
> +
> +	struct priority_worker *s = poll_stats;
> +
> +	uint64_t cur_edpi = s->empty_dequeues - s->empty_dequeues_prev;
> +
> +	s->empty_dequeues_prev = s->empty_dequeues;
> +
> +	uint64_t ppi = s->num_dequeue_pkts - s->num_dequeue_pkts_prev;
> +
> +	s->num_dequeue_pkts_prev = s->num_dequeue_pkts;
> +
> +	if (s->thresh[s->cur_freq].base_edpi < cur_edpi) {
> +		RTE_LOG(DEBUG, POWER, "cur_edpi is too large "
> +				"cur edpi %ld "
> +				"base epdi %ld\n",
> +				cur_edpi,
> +				s->thresh[s->cur_freq].base_edpi);

Suggest making this log message more meaningful to the user. I suspect 
that "cur_edpi" will not mean much to the user.
What does edpi mean?

> +		/* Value to make us fail need debug log*/
> +		return 1000UL;
> +	}
> +
> +	s->edpi_av[s->ec++ % BINS_AV] = cur_edpi;
> +	s->ppi_av[s->pc++ % BINS_AV] = ppi;
> +
> +	for (j = 0; j < BINS_AV; j++) {
> +		tot_edpi += s->edpi_av[j];
> +		tot_ppi += s->ppi_av[j];
> +	}
> +
> +	tot_edpi = tot_edpi / BINS_AV;
> +
> +	percent = 100 - (uint32_t)(((float)tot_edpi /
> +			(float)s->thresh[s->cur_freq].base_edpi) * 100);
> +
> +	return (uint32_t)percent;
> +}
> +
> +
> +static inline void  __attribute__((always_inline))
> +update_stats_normal(struct priority_worker *poll_stats)
> +{
> +	uint32_t percent;
> +
> +	if (poll_stats->thresh[poll_stats->cur_freq].base_edpi == 0) {
> +
> +		RTE_LOG(DEBUG, POWER, "cure freq is %d, edpi is %lu\n",
> +				poll_stats->cur_freq,
> +				poll_stats->thresh[poll_stats->cur_freq].base_edpi);

Again, a more meaningful explanation of edpi is needed here.

> +		return;
> +	}
> +
> +	percent = update_stats(poll_stats);
> +
> +	if (percent > 100) {
> +		RTE_LOG(DEBUG, POWER, "Big than 100 abnormal\n");

Please change to something meaningful to the user. What is the 
percentage returned from update_stats()?

> +		return;
> +	}
> +
> +	if (poll_stats->cur_freq == LOW)
> +		RTE_LOG(INFO, POWER, "Purge Mode is not supported\n");

Suggest adding "currently" - "Purge Mode is not currently supported\n"

> +	else if (poll_stats->cur_freq == MED) {
> +
> +		if (percent >
> +			poll_stats->thresh[MED].threshold_percent) {
> +
> +			if (poll_stats->threshold_ctr < INTERVALS_PER_SECOND)
> +				poll_stats->threshold_ctr++;
> +			else {
> +				set_state(poll_stats, HGH_BUSY);
> +				RTE_LOG(INFO, POWER, "MOVE to HGH\n");
> +			}
> +
> +		} else {
> +			/* reset */
> +			poll_stats->threshold_ctr = 0;
> +		}
> +
> +	} else if (poll_stats->cur_freq == HGH) {
> +
> +		if (percent <
> +				poll_stats->thresh[HGH].threshold_percent) {
> +
> +			if (poll_stats->threshold_ctr < INTERVALS_PER_SECOND)
> +				poll_stats->threshold_ctr++;
> +			else {
> +				set_state(poll_stats, MED_NORMAL);
> +				RTE_LOG(INFO, POWER, "MOVE to MED\n");
> +			}
> +		} else {
> +			/* reset */
> +			poll_stats->threshold_ctr = 0;
> +		}
> +
> +	}
> +}
> +
> +static int
> +empty_poll_training(struct priority_worker *poll_stats,
> +		uint32_t max_train_iter)
> +{
> +
> +	if (poll_stats->iter_counter < INTERVALS_PER_SECOND) {
> +		poll_stats->iter_counter++;
> +		return 0;
> +	}
> +
> +
> +	update_training_stats(poll_stats,
> +			LOW,
> +			false,
> +			max_train_iter);
> +
> +	update_training_stats(poll_stats,
> +			MED,
> +			false,
> +			max_train_iter);
> +
> +	update_training_stats(poll_stats,
> +			HGH,
> +			false,
> +			max_train_iter);
> +
> +
> +	if (poll_stats->thresh[LOW].trained == true
> +			&& poll_stats->thresh[MED].trained == true
> +			&& poll_stats->thresh[HGH].trained == true) {
> +
> +		set_state(poll_stats, MED_NORMAL);
> +
> +		RTE_LOG(INFO, POWER, "Low threshold is %lu\n",
> +				poll_stats->thresh[LOW].base_edpi);

Suggest "Low" change to "LOW" for consistency with other log messages below.

> +
> +		RTE_LOG(INFO, POWER, "MED threshold is %lu\n",
> +				poll_stats->thresh[MED].base_edpi);
> +
> +
> +		RTE_LOG(INFO, POWER, "HIGH threshold is %lu\n",
> +				poll_stats->thresh[HGH].base_edpi);
> +
> +		RTE_LOG(INFO, POWER, "Training is Complete for %d\n",
> +				poll_stats->lcore_id);
> +	}
> +
> +	return 0;
> +}
> +
> +void
> +rte_empty_poll_detection(struct rte_timer *tim, void *arg)
> +{
> +
> +	uint32_t i;
> +
> +	struct priority_worker *poll_stats;
> +
> +	RTE_SET_USED(tim);
> +
> +	RTE_SET_USED(arg);
> +
> +	for (i = 0; i < NUM_NODES; i++) {
> +
> +		poll_stats = &(ep_params->wrk_data.wrk_stats[i]);
> +
> +		if (rte_lcore_is_enabled(poll_stats->lcore_id) == 0)
> +			continue;
> +
> +		switch (poll_stats->queue_state) {
> +		case(TRAINING):
> +			empty_poll_training(poll_stats,
> +					ep_params->max_train_iter);
> +			break;
> +
> +		case(HGH_BUSY):
> +		case(MED_NORMAL):
> +			update_stats_normal(poll_stats);
> +			break;
> +
> +		case(LOW_PURGE):
> +			break;
> +		default:
> +			break;
> +
> +		}
> +
> +	}
> +
> +}
> +
> +int __rte_experimental
> +rte_power_empty_poll_stat_init(struct ep_params **eptr, uint8_t *freq_tlb,
> +		struct ep_policy *policy)
> +{
> +	uint32_t i;
> +	/* Allocate the ep_params structure */
> +	ep_params = rte_zmalloc_socket(NULL,
> +			sizeof(struct ep_params),
> +			0,
> +			rte_socket_id());
> +
> +	if (!ep_params)
> +		rte_panic("Cannot allocate heap memory for ep_params "
> +				"for socket %d\n", rte_socket_id());
> +
> +	if (freq_tlb == NULL) {
> +		freq_index[LOW] = 14;
> +		freq_index[MED] = 9;
> +		freq_index[HGH] = 1;
> +	} else {
> +		freq_index[LOW] = freq_tlb[LOW];
> +		freq_index[MED] = freq_tlb[MED];
> +		freq_index[HGH] = freq_tlb[HGH];
> +	}
> +
> +	RTE_LOG(INFO, POWER, "Initialize the Empty Poll\n");
> +
> +	/* 5 seconds worth of training */

This now looks to be 2 seconds from the #define above. Maybe: /* Train 
for pre-defined period */

> +	ep_params->max_train_iter = INTERVALS_PER_SECOND * SECONDS_TO_TRAIN_FOR;
> +
> +	struct stats_data *w = &ep_params->wrk_data;
> +
> +	*eptr = ep_params;
> +
> +	/* initialize all wrk_stats state */
> +	for (i = 0; i < NUM_NODES; i++) {
> +
> +		if (rte_lcore_is_enabled(i) == 0)
> +			continue;
> +		/*init the freqs table */
> +		total_avail_freqs[i] = rte_power_freqs(i,
> +				avail_freqs[i],
> +				NUM_FREQS);
> +
> +		RTE_LOG(INFO, POWER, "total avail freq is %d , lcoreid %d\n",
> +				total_avail_freqs[i],
> +				i);
> +
> +		if (get_freq_index(LOW) > total_avail_freqs[i])
> +			return -1;
> +
> +		if (rte_get_master_lcore() != i) {
> +			w->wrk_stats[i].lcore_id = i;
> +			set_policy(&w->wrk_stats[i], policy);
> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +void __rte_experimental
> +rte_power_empty_poll_stat_free(void)
> +{
> +
> +	RTE_LOG(INFO, POWER, "Close the Empty Poll\n");
> +
> +	if (ep_params != NULL)
> +		rte_free(ep_params);
> +}
> +
> +int __rte_experimental
> +rte_power_empty_poll_stat_update(unsigned int lcore_id)
> +{
> +	struct priority_worker *poll_stats;
> +
> +	if (lcore_id >= NUM_NODES)
> +		return -1;
> +
> +	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
> +
> +	if (poll_stats->lcore_id == 0)
> +		poll_stats->lcore_id = lcore_id;
> +
> +	poll_stats->empty_dequeues++;
> +
> +	return 0;
> +}
> +
> +int __rte_experimental
> +rte_power_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt)
> +{
> +
> +	struct priority_worker *poll_stats;
> +
> +	if (lcore_id >= NUM_NODES)
> +		return -1;
> +
> +	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
> +
> +	if (poll_stats->lcore_id == 0)
> +		poll_stats->lcore_id = lcore_id;
> +
> +	poll_stats->num_dequeue_pkts += nb_pkt;
> +
> +	return 0;
> +}
> +
> +
> +uint64_t __rte_experimental
> +rte_power_empty_poll_stat_fetch(unsigned int lcore_id)
> +{
> +	struct priority_worker *poll_stats;
> +
> +	if (lcore_id >= NUM_NODES)
> +		return -1;
> +
> +	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
> +
> +	if (poll_stats->lcore_id == 0)
> +		poll_stats->lcore_id = lcore_id;
> +
> +	return poll_stats->empty_dequeues;
> +}
> +
> +uint64_t __rte_experimental
> +rte_power_poll_stat_fetch(unsigned int lcore_id)
> +{
> +	struct priority_worker *poll_stats;
> +
> +	if (lcore_id >= NUM_NODES)
> +		return -1;
> +
> +	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
> +
> +	if (poll_stats->lcore_id == 0)
> +		poll_stats->lcore_id = lcore_id;
> +
> +	return poll_stats->num_dequeue_pkts;
> +}
> diff --git a/lib/librte_power/rte_power_empty_poll.h b/lib/librte_power/rte_power_empty_poll.h
> new file mode 100644
> index 0000000..ae27f7d
> --- /dev/null
> +++ b/lib/librte_power/rte_power_empty_poll.h
> @@ -0,0 +1,219 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2010-2018 Intel Corporation
> + */
> +
> +#ifndef _RTE_EMPTY_POLL_H
> +#define _RTE_EMPTY_POLL_H
> +
> +/**
> + * @file
> + * RTE Power Management
> + */
> +#include <stdint.h>
> +#include <stdbool.h>
> +
> +#include <rte_common.h>
> +#include <rte_byteorder.h>
> +#include <rte_log.h>
> +#include <rte_string_fns.h>
> +#include <rte_power.h>
> +#include <rte_timer.h>
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +#define NUM_FREQS 20

I don't think this is enough. Suggest using RTE_MAX_LCORE_FREQS

> +
> +#define BINS_AV 4 /* Has to be ^2 */
> +
> +#define DROP (NUM_DIRECTIONS * NUM_DEVICES)
> +
> +#define NUM_PRIORITIES          2
> +
> +#define NUM_NODES         256  /* Max core number*/
> +
> +/* Processor Power State */
> +enum freq_val {
> +	LOW,
> +	MED,
> +	HGH,
> +	NUM_FREQ = NUM_FREQS
> +};
> +

Why is NUM_FREQ in this enum? 0,1,2,20 (or RTE_MAX_LCORE_FREQS) does not 
seem right.
If you are using NUM_FREQ in the code then why not just use NUM_FREQS.

> +
> +/* Queue Polling State */
> +enum queue_state {
> +	TRAINING, /* NO TRAFFIC */
> +	MED_NORMAL,   /* MED */
> +	HGH_BUSY,     /* HIGH */
> +	LOW_PURGE,    /* LOW */
> +};
> +
> +/* Queue Stats */
> +struct freq_threshold {
> +
> +	uint64_t base_edpi;
> +	bool trained;
> +	uint32_t threshold_percent;
> +	uint32_t cur_train_iter;
> +};
> +
> +/* Each Worder Thread Empty Poll Stats */
> +struct priority_worker {
> +
> +	/* Current dequeue and throughput counts */
> +	/* These 2 are written to by the worker threads */
> +	/* So keep them on their own cache line */
> +	uint64_t empty_dequeues;
> +	uint64_t num_dequeue_pkts;
> +
> +	enum queue_state queue_state;
> +
> +	uint64_t empty_dequeues_prev;
> +	uint64_t num_dequeue_pkts_prev;
> +
> +	/* Used for training only */
> +	struct freq_threshold thresh[NUM_FREQ];
> +	enum freq_val cur_freq;
> +
> +	/* bucket arrays to calculate the averages */
> +	uint64_t edpi_av[BINS_AV];
> +	uint32_t  ec;
> +	uint64_t ppi_av[BINS_AV];
> +	uint32_t  pc;
> +
> +	uint32_t lcore_id;
> +	uint32_t iter_counter;
> +	uint32_t threshold_ctr;
> +	uint32_t display_ctr;
> +	uint8_t  dev_id;
> +
> +} __rte_cache_aligned;
> +

Suggest adding a comment on each of the variables above explaining what 
the acronym means.
E.g. edpi, ec, pc, ppi.


> +
> +struct stats_data {
> +
> +	struct priority_worker wrk_stats[NUM_NODES];
> +
> +	/* flag to stop rx threads processing packets until training over */
> +	bool start_rx;
> +
> +};
> +
> +/* Empty Poll Parameters */
> +struct ep_params {
> +
> +	/* Timer related stuff */
> +	uint64_t interval_ticks;
> +	uint32_t max_train_iter;
> +
> +	struct rte_timer timer0;
> +	struct stats_data wrk_data;
> +};
> +
> +
> +/* Sample App Init information */
> +struct ep_policy {
> +
> +	uint64_t med_base_edpi;
> +	uint64_t hgh_base_edpi;
> +
> +	enum queue_state state;
> +};
> +
> +
> +
> +/**
> + * Initialize the power management system.
> + *
> + * @param eptr
> + *   the structure of empty poll configuration
> + * @freq_tlb
> + *   the power state/frequency  mapping table
> + * @policy
> + *   the initialization policy from sample app
> + *
> + * @return
> + *  - 0 on success.
> + *  - Negative on error.
> + */
> +int __rte_experimental
> +rte_power_empty_poll_stat_init(struct ep_params **eptr, uint8_t *freq_tlb,
> +		struct ep_policy *policy);
> +
> +/**
> + * Free the resource hold by power management system.
> + */
> +void __rte_experimental
> +rte_power_empty_poll_stat_free(void);
> +
> +/**
> + * Update specific core empty poll counter
> + * It's not thread safe.
> + *
> + * @param lcore_id
> + *  lcore id
> + *
> + * @return
> + *  - 0 on success.
> + *  - Negative on error.
> + */
> +int __rte_experimental
> +rte_power_empty_poll_stat_update(unsigned int lcore_id);
> +
> +/**
> + * Update specific core valid poll counter, not thread safe.
> + *
> + * @param lcore_id
> + *  lcore id.
> + * @param nb_pkt
> + *  The packet number of one valid poll.
> + *
> + * @return
> + *  - 0 on success.
> + *  - Negative on error.
> + */
> +int __rte_experimental
> +rte_power_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt);
> +
> +/**
> + * Fetch specific core empty poll counter.
> + *
> + * @param lcore_id
> + *  lcore id
> + *
> + * @return
> + *  Current lcore empty poll counter value.
> + */
> +uint64_t __rte_experimental
> +rte_power_empty_poll_stat_fetch(unsigned int lcore_id);
> +
> +/**
> + * Fetch specific core valid poll counter.
> + *
> + * @param lcore_id
> + *  lcore id
> + *
> + * @return
> + *  Current lcore valid poll counter value.
> + */
> +uint64_t __rte_experimental
> +rte_power_poll_stat_fetch(unsigned int lcore_id);
> +
> +/**
> + * Empty poll  state change detection function
> + *
> + * @param  tim
> + *  The timer structure
> + * @param  arg
> + *  The customized parameter
> + */
> +void  __rte_experimental
> +rte_empty_poll_detection(struct rte_timer *tim, void *arg);
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif
> diff --git a/lib/librte_power/rte_power_version.map b/lib/librte_power/rte_power_version.map
> index dd587df..11ffdfb 100644
> --- a/lib/librte_power/rte_power_version.map
> +++ b/lib/librte_power/rte_power_version.map
> @@ -33,3 +33,16 @@ DPDK_18.08 {
>   	rte_power_get_capabilities;
>   
>   } DPDK_17.11;
> +
> +EXPERIMENTAL {
> +        global:
> +
> +        rte_power_empty_poll_stat_init;
> +        rte_power_empty_poll_stat_free;
> +        rte_power_empty_poll_stat_update;
> +        rte_power_empty_poll_stat_fetch;
> +        rte_power_poll_stat_fetch;
> +        rte_power_poll_stat_update;
> +        rte_empty_poll_detection;
> +
> +};

checkpatch has several warnings:



### lib/librte_power: traffic pattern aware power control

WARNING:LONG_LINE: line over 80 characters
#355: FILE: lib/librte_power/rte_power_empty_poll.c:199:
+ poll_stats->thresh[freq].base_edpi / 2000;

WARNING:LONG_LINE: line over 80 characters
#417: FILE: lib/librte_power/rte_power_empty_poll.c:261:
+ poll_stats->thresh[poll_stats->cur_freq].base_edpi);

total: 0 errors, 2 warnings, 802 lines checked
ERROR: symbol rte_empty_poll_detection is added in a section other than 
the EXPERIMENTAL section of the version map
ERROR: symbol rte_power_empty_poll_stat_fetch is added in a section 
other than the EXPERIMENTAL section of the version map
ERROR: symbol rte_power_empty_poll_stat_free is added in a section other 
than the EXPERIMENTAL section of the version map
ERROR: symbol rte_power_empty_poll_stat_init is added in a section other 
than the EXPERIMENTAL section of the version map
ERROR: symbol rte_power_empty_poll_stat_update is added in a section 
other than the EXPERIMENTAL section of the version map
ERROR: symbol rte_power_poll_stat_fetch is added in a section other than 
the EXPERIMENTAL section of the version map
ERROR: symbol rte_power_poll_stat_update is added in a section other 
than the EXPERIMENTAL section of the version map
Warning in /lib/librte_power/rte_power_empty_poll.c:
are you sure you want to add the following:
rte_panic\(


Rgds,
Dave.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [dpdk-dev] [PATCH v8 2/4] examples/l3fwd-power: simple app update for new API
  2018-09-17 13:30                   ` [dpdk-dev] [PATCH v8 2/4] examples/l3fwd-power: simple app update for new API Liang Ma
@ 2018-09-28 11:19                     ` Hunt, David
  2018-10-02 10:18                       ` Liang, Ma
  0 siblings, 1 reply; 79+ messages in thread
From: Hunt, David @ 2018-09-28 11:19 UTC (permalink / raw)
  To: Liang Ma; +Cc: dev, lei.a.yao, ktraynor, john.geary

Hi Liang,

A few tweaks below:


On 17/9/2018 2:30 PM, Liang Ma wrote:
> Add the support for new traffic pattern aware power control
> power management API.
>
> Example:
> ./l3fwd-power -l xxx   -n 4   -w 0000:xx:00.0 -w 0000:xx:00.1 -- -p 0x3
> -P --config="(0,0,xx),(1,0,xx)" --empty-poll="0,0,0" -l 14 -m 9 -h 1
>
> Please Reference l3fwd-power document for all parameter except
> empty-poll.

The docs should probably include empty poll parameter. Suggest 
re-wording to

Please Reference l3fwd-power document for full parameter usage



> The option "l", "m", "h" are used to set the power index for
> LOW, MED, HIGH power state. only is useful after enable empty-poll
>
> --empty-poll="training_flag, med_threshold, high_threshold"
>
> The option training_flag is used to enable/disable training mode.
>
> The option med_threshold is used to indicate the empty poll threshold
> of modest state which is customized by user.
>
> The option high_threshold is used to indicate the empty poll threshold
> of busy state which is customized by user.
>
> Above three option default value is all 0.
>
> Once enable empty-poll. System will apply the default parameter.
> Training mode is disabled as default.

Suggest:

Once empty-poll is enabled, the system will apply the default parameters is no
other command line options are provided.



> If training mode is triggered, there should not has any traffic
> pass-through during training phase.

Suggest:
If training mode is enabled, the user should ensure that no traffic
is allowed to pass through the system.

> When training phase complete, system transfer to normal phase.

When training phase complete, the application transfer to normal operation



>
> System will running with modest power stat at beginning.

System will start running with the modest power mode.


> If the system busyness percentage above 70%, then system will adjust
> power state move to High power state. If the traffic become lower(eg. The
> system busyness percentage drop below 30%), system will fallback
> to the modest power state.

If the traffic goes above 70%, then system will move to High power state.
If the traffic drops below 30%, the system will fallback to the modest
power state.


> Example code use master thread to monitoring worker thread busyness.
> the default timer resolution is 10ms.
>
> ChangeLog:
> v2 fix some coding style issues
> v3 rename the API.
> v6 re-work the API.
> v7 no change.
> v8 disable training as default option.
>
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
>
> Reviewed-by: Lei Yao <lei.a.yao@intel.com>
> ---
>   examples/l3fwd-power/Makefile    |   3 +
>   examples/l3fwd-power/main.c      | 325 +++++++++++++++++++++++++++++++++++++--
>   examples/l3fwd-power/meson.build |   1 +
>   3 files changed, 312 insertions(+), 17 deletions(-)
>
> diff --git a/examples/l3fwd-power/Makefile b/examples/l3fwd-power/Makefile
> index d7e39a3..772ec7b 100644
> --- a/examples/l3fwd-power/Makefile
> +++ b/examples/l3fwd-power/Makefile
> @@ -23,6 +23,8 @@ CFLAGS += -O3 $(shell pkg-config --cflags libdpdk)
>   LDFLAGS_SHARED = $(shell pkg-config --libs libdpdk)
>   LDFLAGS_STATIC = -Wl,-Bstatic $(shell pkg-config --static --libs libdpdk)
>   
> +CFLAGS += -DALLOW_EXPERIMENTAL_API
> +
>   build/$(APP)-shared: $(SRCS-y) Makefile $(PC_FILE) | build
>   	$(CC) $(CFLAGS) $(SRCS-y) -o $@ $(LDFLAGS) $(LDFLAGS_SHARED)
>   
> @@ -54,6 +56,7 @@ please change the definition of the RTE_TARGET environment variable)
>   all:
>   else
>   
> +CFLAGS += -DALLOW_EXPERIMENTAL_API
>   CFLAGS += -O3
>   CFLAGS += $(WERROR_FLAGS)
>   
> diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
> index 68527d2..1465608 100644
> --- a/examples/l3fwd-power/main.c
> +++ b/examples/l3fwd-power/main.c
> @@ -43,6 +43,7 @@
>   #include <rte_timer.h>
>   #include <rte_power.h>
>   #include <rte_spinlock.h>
> +#include <rte_power_empty_poll.h>
>   
>   #include "perf_core.h"
>   #include "main.h"
> @@ -55,6 +56,8 @@
>   
>   /* 100 ms interval */
>   #define TIMER_NUMBER_PER_SECOND           10
> +/* (10ms) */
> +#define INTERVALS_PER_SECOND             100
>   /* 100000 us */
>   #define SCALING_PERIOD                    (1000000/TIMER_NUMBER_PER_SECOND)
>   #define SCALING_DOWN_TIME_RATIO_THRESHOLD 0.25
> @@ -117,6 +120,11 @@
>    */
>   #define RTE_TEST_RX_DESC_DEFAULT 1024
>   #define RTE_TEST_TX_DESC_DEFAULT 1024
> +#define EMPTY_POLL_MED_THRESHOLD 350000UL
> +#define EMPTY_POLL_HGH_THRESHOLD 580000UL

I'd suggest adding some explanation around these two numbers.
E.g.
/*
  * These two thresholds were decided on by running the training 
algorithm on
  * a 2.5GHz Xeon. These defaults can be overridden by supplying 
non-zero values
  * for the med_threshold and high_threshold parameters on the command line.
  */


> +
> +
> +
>   static uint16_t nb_rxd = RTE_TEST_RX_DESC_DEFAULT;
>   static uint16_t nb_txd = RTE_TEST_TX_DESC_DEFAULT;
>   
> @@ -132,6 +140,14 @@ static uint32_t enabled_port_mask = 0;
>   static int promiscuous_on = 0;
>   /* NUMA is enabled by default. */
>   static int numa_on = 1;
> +/* emptypoll is disabled by default. */
> +static bool empty_poll_on;
> +static bool empty_poll_train;
> +volatile bool empty_poll_stop;
> +static struct  ep_params *ep_params;
> +static struct  ep_policy policy;
> +static long  ep_med_edpi, ep_hgh_edpi;
> +
>   static int parse_ptype; /**< Parse packet type using rx callback, and */
>   			/**< disabled by default */
>   
> @@ -330,6 +346,13 @@ static inline uint32_t power_idle_heuristic(uint32_t zero_rx_packet_count);
>   static inline enum freq_scale_hint_t power_freq_scaleup_heuristic( \
>   		unsigned int lcore_id, uint16_t port_id, uint16_t queue_id);
>   
> +static uint8_t  freq_tlb[] = {14, 9, 1};
> +

Maybe an explanation on where these numbers came from. E.g.
/*
  * These defaults are using the max frequency index (1), a medium index 
(9) and a
  * typical low frequency index (14). These can be adjusted to use different
  * indexes using the relevant command line parameters.
  */


> +static int is_done(void)
> +{
> +	return empty_poll_stop;
> +}
> +
>   /* exit signal handler */
>   static void
>   signal_exit_now(int sigtype)
> @@ -338,7 +361,15 @@ signal_exit_now(int sigtype)
>   	unsigned int portid;
>   	int ret;
>   
> +	RTE_SET_USED(lcore_id);
> +	RTE_SET_USED(portid);
> +	RTE_SET_USED(ret);
> +
>   	if (sigtype == SIGINT) {
> +		if (empty_poll_on)
> +			empty_poll_stop = true;
> +
> +
>   		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
>   			if (rte_lcore_is_enabled(lcore_id) == 0)
>   				continue;
> @@ -351,16 +382,19 @@ signal_exit_now(int sigtype)
>   							"core%u\n", lcore_id);
>   		}
>   
> -		RTE_ETH_FOREACH_DEV(portid) {
> -			if ((enabled_port_mask & (1 << portid)) == 0)
> -				continue;
> +		if (!empty_poll_on) {
> +			RTE_ETH_FOREACH_DEV(portid) {
> +				if ((enabled_port_mask & (1 << portid)) == 0)
> +					continue;
>   
> -			rte_eth_dev_stop(portid);
> -			rte_eth_dev_close(portid);
> +				rte_eth_dev_stop(portid);
> +				rte_eth_dev_close(portid);
> +			}
>   		}
>   	}
>   
> -	rte_exit(EXIT_SUCCESS, "User forced exit\n");
> +	if (!empty_poll_on)
> +		rte_exit(EXIT_SUCCESS, "User forced exit\n");
>   }
>   
>   /*  Freqency scale down timer callback */
> @@ -825,7 +859,107 @@ static int event_register(struct lcore_conf *qconf)
>   
>   	return 0;
>   }
> +/* main processing loop */
> +static int
> +main_empty_poll_loop(__attribute__((unused)) void *dummy)
> +{
> +	struct rte_mbuf *pkts_burst[MAX_PKT_BURST];
> +	unsigned int lcore_id;
> +	uint64_t prev_tsc, diff_tsc, cur_tsc;
> +	int i, j, nb_rx;
> +	uint8_t queueid;
> +	uint16_t portid;
> +	struct lcore_conf *qconf;
> +	struct lcore_rx_queue *rx_queue;
> +
> +	const uint64_t drain_tsc =
> +		(rte_get_tsc_hz() + US_PER_S - 1) / US_PER_S * BURST_TX_DRAIN_US;
> +
> +	prev_tsc = 0;
> +
> +	lcore_id = rte_lcore_id();
> +	qconf = &lcore_conf[lcore_id];
> +
> +	if (qconf->n_rx_queue == 0) {
> +		RTE_LOG(INFO, L3FWD_POWER, "lcore %u has nothing to do\n", lcore_id);
> +		return 0;
> +	}
> +
> +	for (i = 0; i < qconf->n_rx_queue; i++) {
> +		portid = qconf->rx_queue_list[i].port_id;
> +		queueid = qconf->rx_queue_list[i].queue_id;
> +		RTE_LOG(INFO, L3FWD_POWER, " -- lcoreid=%u portid=%u "
> +				"rxqueueid=%hhu\n", lcore_id, portid, queueid);
> +	}
> +
> +	while (!is_done()) {
> +		stats[lcore_id].nb_iteration_looped++;
> +
> +		cur_tsc = rte_rdtsc();
> +		/*
> +		 * TX burst queue drain
> +		 */
> +		diff_tsc = cur_tsc - prev_tsc;
> +		if (unlikely(diff_tsc > drain_tsc)) {
> +			for (i = 0; i < qconf->n_tx_port; ++i) {
> +				portid = qconf->tx_port_id[i];
> +				rte_eth_tx_buffer_flush(portid,
> +						qconf->tx_queue_id[portid],
> +						qconf->tx_buffer[portid]);
> +			}
> +			prev_tsc = cur_tsc;
> +		}
> +
> +		/*
> +		 * Read packet from RX queues
> +		 */
> +		for (i = 0; i < qconf->n_rx_queue; ++i) {
> +			rx_queue = &(qconf->rx_queue_list[i]);
> +			rx_queue->idle_hint = 0;
> +			portid = rx_queue->port_id;
> +			queueid = rx_queue->queue_id;
> +
> +			nb_rx = rte_eth_rx_burst(portid, queueid, pkts_burst,
> +					MAX_PKT_BURST);
> +
> +			stats[lcore_id].nb_rx_processed += nb_rx;
> +
> +			if (nb_rx == 0) {
> +
> +				rte_power_empty_poll_stat_update(lcore_id);
> +
> +				continue;
> +			} else {
> +				rte_power_poll_stat_update(lcore_id, nb_rx);
> +			}
> +
> +
> +			/* Prefetch first packets */
> +			for (j = 0; j < PREFETCH_OFFSET && j < nb_rx; j++) {
> +				rte_prefetch0(rte_pktmbuf_mtod(
> +							pkts_burst[j], void *));
> +			}
> +
> +			/* Prefetch and forward already prefetched packets */
> +			for (j = 0; j < (nb_rx - PREFETCH_OFFSET); j++) {
> +				rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[
> +							j + PREFETCH_OFFSET], void *));
> +				l3fwd_simple_forward(pkts_burst[j], portid,
> +						qconf);
> +			}
>   
> +			/* Forward remaining prefetched packets */
> +			for (; j < nb_rx; j++) {
> +				l3fwd_simple_forward(pkts_burst[j], portid,
> +						qconf);
> +			}
> +
> +		}
> +
> +	}
> +
> +	return 0;
> +}
>   /* main processing loop */
>   static int
>   main_loop(__attribute__((unused)) void *dummy)
> @@ -1127,7 +1261,8 @@ print_usage(const char *prgname)
>   		"  --no-numa: optional, disable numa awareness\n"
>   		"  --enable-jumbo: enable jumbo frame"
>   		" which max packet len is PKTLEN in decimal (64-9600)\n"
> -		"  --parse-ptype: parse packet type by software\n",
> +		"  --parse-ptype: parse packet type by software\n"
> +		"  --empty=poll: enable empty poll detection\n",

typo: "empty=poll" should be "empty-poll"

I really think some info on what should be supplied with the empty-poll 
parameter
should be mentioned here
e.g.

--empty=poll "training_flag, high_threshold, med_threshold"



>   		prgname);
>   }
>   
> @@ -1220,7 +1355,55 @@ parse_config(const char *q_arg)
>   
>   	return 0;
>   }
> +static int
> +parse_ep_config(const char *q_arg)
> +{
> +	char s[256];
> +	const char *p = q_arg;
> +	char *end;
> +	int  num_arg;
> +
> +	char *str_fld[3];
> +
> +	int training_flag;
> +	int med_edpi;
> +	int hgh_edpi;
> +
> +	ep_med_edpi = EMPTY_POLL_MED_THRESHOLD;
> +	ep_hgh_edpi = EMPTY_POLL_MED_THRESHOLD;
> +
> +	snprintf(s, sizeof(s), "%s", p);
> +
> +	num_arg = rte_strsplit(s, sizeof(s), str_fld, 3, ',');
> +
> +	empty_poll_train = false;
> +
> +	if (num_arg == 0)
> +		return 0;
>   
> +	if (num_arg == 3) {
> +
> +		training_flag = strtoul(str_fld[0], &end, 0);
> +		med_edpi = strtoul(str_fld[1], &end, 0);
> +		hgh_edpi = strtoul(str_fld[2], &end, 0);
> +
> +		if (training_flag == 1)
> +			empty_poll_train = true;
> +
> +		if (med_edpi > 0)
> +			ep_med_edpi = med_edpi;
> +
> +		if (med_edpi > 0)
> +			ep_hgh_edpi = hgh_edpi;
> +
> +	} else {
> +
> +		return -1;
> +	}
> +
> +	return 0;
> +
> +}
>   #define CMD_LINE_OPT_PARSE_PTYPE "parse-ptype"
>   
>   /* Parse the argument given in the command line of the application */
> @@ -1230,6 +1413,7 @@ parse_args(int argc, char **argv)
>   	int opt, ret;
>   	char **argvopt;
>   	int option_index;
> +	uint32_t limit;
>   	char *prgname = argv[0];
>   	static struct option lgopts[] = {
>   		{"config", 1, 0, 0},
> @@ -1237,13 +1421,14 @@ parse_args(int argc, char **argv)
>   		{"high-perf-cores", 1, 0, 0},
>   		{"no-numa", 0, 0, 0},
>   		{"enable-jumbo", 0, 0, 0},
> +		{"empty-poll", 1, 0, 0},
>   		{CMD_LINE_OPT_PARSE_PTYPE, 0, 0, 0},
>   		{NULL, 0, 0, 0}
>   	};
>   
>   	argvopt = argv;
>   
> -	while ((opt = getopt_long(argc, argvopt, "p:P",
> +	while ((opt = getopt_long(argc, argvopt, "p:l:m:h:P",
>   				lgopts, &option_index)) != EOF) {
>   
>   		switch (opt) {
> @@ -1260,7 +1445,18 @@ parse_args(int argc, char **argv)
>   			printf("Promiscuous mode selected\n");
>   			promiscuous_on = 1;
>   			break;
> -
> +		case 'l':
> +			limit = parse_max_pkt_len(optarg);
> +			freq_tlb[LOW] = limit;
> +			break;
> +		case 'm':
> +			limit = parse_max_pkt_len(optarg);
> +			freq_tlb[MED] = limit;
> +			break;
> +		case 'h':
> +			limit = parse_max_pkt_len(optarg);
> +			freq_tlb[HGH] = limit;
> +			break;
>   		/* long options */
>   		case 0:
>   			if (!strncmp(lgopts[option_index].name, "config", 6)) {
> @@ -1299,6 +1495,20 @@ parse_args(int argc, char **argv)
>   			}
>   
>   			if (!strncmp(lgopts[option_index].name,
> +						"empty-poll", 10)) {
> +				printf("empty-poll is enabled\n");
> +				empty_poll_on = true;
> +				ret = parse_ep_config(optarg);
> +
> +				if (ret) {
> +					printf("invalid empty poll config\n");
> +					print_usage(prgname);
> +					return -1;
> +				}
> +
> +			}
> +
> +			if (!strncmp(lgopts[option_index].name,
>   					"enable-jumbo", 12)) {
>   				struct option lenopts =
>   					{"max-pkt-len", required_argument, \
> @@ -1646,6 +1856,59 @@ init_power_library(void)
>   	}
>   	return ret;
>   }
> +static void
> +empty_poll_setup_timer(void)
> +{
> +	int lcore_id = rte_lcore_id();
> +	uint64_t hz = rte_get_timer_hz();
> +
> +	struct  ep_params *ep_ptr = ep_params;
> +
> +	ep_ptr->interval_ticks = hz / INTERVALS_PER_SECOND;
> +
> +	rte_timer_reset_sync(&ep_ptr->timer0,
> +			ep_ptr->interval_ticks,
> +			PERIODICAL,
> +			lcore_id,
> +			rte_empty_poll_detection,
> +			(void *)ep_ptr);
> +
> +}
> +static int
> +launch_timer(unsigned int lcore_id)
> +{
> +	int64_t prev_tsc = 0, cur_tsc, diff_tsc, cycles_10ms;
> +
> +	RTE_SET_USED(lcore_id);
> +
> +
> +	if (rte_get_master_lcore() != lcore_id) {
> +		rte_panic("timer on lcore:%d which is not master core:%d\n",
> +				lcore_id,
> +				rte_get_master_lcore());
> +	}
> +
> +	RTE_LOG(INFO, POWER, "Bring up the Timer\n");
> +
> +	empty_poll_setup_timer();
> +
> +	cycles_10ms = rte_get_timer_hz() / 100;
> +
> +	while (!is_done()) {
> +		cur_tsc = rte_rdtsc();
> +		diff_tsc = cur_tsc - prev_tsc;
> +		if (diff_tsc > cycles_10ms) {
> +			rte_timer_manage();
> +			prev_tsc = cur_tsc;
> +			cycles_10ms = rte_get_timer_hz() / 100;
> +		}
> +	}
> +
> +	RTE_LOG(INFO, POWER, "Timer_subsystem is done\n");
> +
> +	return 0;
> +}
> +
>   
>   int
>   main(int argc, char **argv)
> @@ -1828,13 +2091,15 @@ main(int argc, char **argv)
>   		if (rte_lcore_is_enabled(lcore_id) == 0)
>   			continue;
>   
> -		/* init timer structures for each enabled lcore */
> -		rte_timer_init(&power_timers[lcore_id]);
> -		hz = rte_get_timer_hz();
> -		rte_timer_reset(&power_timers[lcore_id],
> -			hz/TIMER_NUMBER_PER_SECOND, SINGLE, lcore_id,
> -						power_timer_cb, NULL);
> -
> +		if (empty_poll_on == false) {
> +			/* init timer structures for each enabled lcore */
> +			rte_timer_init(&power_timers[lcore_id]);
> +			hz = rte_get_timer_hz();
> +			rte_timer_reset(&power_timers[lcore_id],
> +					hz/TIMER_NUMBER_PER_SECOND,
> +					SINGLE, lcore_id,
> +					power_timer_cb, NULL);
> +		}
>   		qconf = &lcore_conf[lcore_id];
>   		printf("\nInitializing rx queues on lcore %u ... ", lcore_id );
>   		fflush(stdout);
> @@ -1905,12 +2170,38 @@ main(int argc, char **argv)
>   
>   	check_all_ports_link_status(enabled_port_mask);
>   
> +	if (empty_poll_on == true) {
> +
> +		if (empty_poll_train) {
> +			policy.state = TRAINING;
> +		} else {
> +			policy.state = MED_NORMAL;
> +			policy.med_base_edpi = ep_med_edpi;
> +			policy.hgh_base_edpi = ep_hgh_edpi;
> +		}
> +
> +		rte_power_empty_poll_stat_init(&ep_params, freq_tlb, &policy);
> +	}
> +
> +
>   	/* launch per-lcore init on every lcore */
> -	rte_eal_mp_remote_launch(main_loop, NULL, CALL_MASTER);
> +	if (empty_poll_on == false) {
> +		rte_eal_mp_remote_launch(main_loop, NULL, CALL_MASTER);
> +	} else {
> +		empty_poll_stop = false;
> +		rte_eal_mp_remote_launch(main_empty_poll_loop, NULL, SKIP_MASTER);
> +	}
> +
> +	if (empty_poll_on == true)
> +		launch_timer(rte_lcore_id());
> +
>   	RTE_LCORE_FOREACH_SLAVE(lcore_id) {
>   		if (rte_eal_wait_lcore(lcore_id) < 0)
>   			return -1;
>   	}
>   
> +	if (empty_poll_on)
> +		rte_power_empty_poll_stat_free();
> +
>   	return 0;
>   }
> diff --git a/examples/l3fwd-power/meson.build b/examples/l3fwd-power/meson.build
> index 20c8054..a3c5c2f 100644
> --- a/examples/l3fwd-power/meson.build
> +++ b/examples/l3fwd-power/meson.build
> @@ -9,6 +9,7 @@
>   if host_machine.system() != 'linux'
>   	build = false
>   endif
> +allow_experimental_apis = true
>   deps += ['power', 'timer', 'lpm', 'hash']
>   sources = files(
>   	'main.c', 'perf_core.c'


Checkpatch throws up some warnings:


### examples/l3fwd-power: simple app update for new API

WARNING:LONG_LINE: line over 80 characters
#201: FILE: examples/l3fwd-power/main.c:876:
+               (rte_get_tsc_hz() + US_PER_S - 1) / US_PER_S * 
BURST_TX_DRAIN_US;

WARNING:LONG_LINE: line over 80 characters
#209: FILE: examples/l3fwd-power/main.c:884:
+               RTE_LOG(INFO, L3FWD_POWER, "lcore %u has nothing to 
do\n", lcore_id);

WARNING:LONG_LINE: line over 80 characters
#271: FILE: examples/l3fwd-power/main.c:946:
+                                                       j + 
PREFETCH_OFFSET], void *));

WARNING:LONG_LINE: line over 80 characters
#529: FILE: examples/l3fwd-power/main.c:2192:
+               rte_eal_mp_remote_launch(main_empty_poll_loop, NULL, 
SKIP_MASTER);

total: 0 errors, 4 warnings, 467 lines checked


Rgds,
Dave.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [dpdk-dev] [PATCH v8 3/4] doc/guides/proguide/power-man: update the power API
  2018-09-17 13:30                   ` [dpdk-dev] [PATCH v8 3/4] doc/guides/proguide/power-man: update the power API Liang Ma
  2018-09-25 12:31                     ` Kovacevic, Marko
  2018-09-25 12:44                     ` Kovacevic, Marko
@ 2018-09-28 12:30                     ` Hunt, David
  2 siblings, 0 replies; 79+ messages in thread
From: Hunt, David @ 2018-09-28 12:30 UTC (permalink / raw)
  To: Liang Ma; +Cc: dev, lei.a.yao, ktraynor, john.geary

Hi Liang,


On 17/9/2018 2:30 PM, Liang Ma wrote:
> Update the document for empty poll API.
>
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> ---
>   doc/guides/prog_guide/power_man.rst | 90 +++++++++++++++++++++++++++++++++++++
>   1 file changed, 90 insertions(+)
>
> diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
> index eba1cc6..056cb12 100644
> --- a/doc/guides/prog_guide/power_man.rst
> +++ b/doc/guides/prog_guide/power_man.rst
> @@ -106,6 +106,96 @@ User Cases
>   
>   The power management mechanism is used to save power when performing L3 forwarding.
>   
> +
> +Empty Poll API
> +--------------
> +
> +Abstract
> +~~~~~~~~
> +
> +For packet processing workloads such as DPDK polling is continuous.
> +This means CPU cores always show 100% busy independent of how much work
> +those cores are doing. It is critical to accurately determine how busy
> +a core is hugely important for the following reasons:
> +
> +        * No indication of overload conditions
> +        * User do not know how much real load is on a system meaning
> +          resulted in wasted energy as no power management is utilized
> +
> +Compared to the original l3fwd-power design, instead of going to sleep
> +after detecting an empty poll, the new mechanism just lowers the core frequency.
> +As a result, the application does not stop polling the device, which leads
> +to improved handling of bursts of traffic.
> +
> +When the system become busy, the empty poll mechanism can also increase the core
> +frequency (including turbo) to do best effort for intensive traffic. This gives
> +us more flexible and balanced traffic awareness over the standard l3fwd-power
> +application.
> +
> +
> +Proposed Solution
> +~~~~~~~~~~~~~~~~~
> +The proposed solution focuses on how many times empty polls are executed.
> +The less the number of empty polls, means current core is busy with processing
> +workload, therefore, the higher frequency is needed. The high empty poll number
> +indicates the current core not doing any real work therefore, we can lower the
> +frequency to safe power.
> +
> +In the current implementation, each core has 1 empty-poll counter which assume
> +1 core is dedicated to 1 queue. This will need to be expanded in the future to
> +support multiple queues per core.
> +
> +Power state definition:
> +^^^^^^^^^^^^^^^^^^^^^^^
> +
> +* LOW:  Not currently used, reserved for future use.
> +
> +* MED:  the frequency is used to process modest traffic workload.
> +
> +* HIGH: the frequency is used to process busy traffic workload.
> +
> +There are two phases to establish the power management system:
> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> +* Initialization/Training phase. The training phase is necessary
> +  in order to figure out the system polling baseline numbers from
> +  idle to busy. The highest poll count will be during idle, where all
> +  polls are empty. These poll counts will be different between
> +  systems due to the many possible processor micro-arch, cache
> +  and device configurations, hence the training phase.
> +  In the training phase, traffic is blocked so the training algorithm
> +  can average the empty-poll numbers for the LOW, MED and
> +  HIGH  power states in order to create a baseline.
> +  The core's counter are collected every 10ms, and the Training
> +  phase will take 2 seconds.
> +  Training is disabled as default configuration.
> +  The default parameter is applied. Simple App still can trigger
> +  training if that's needed

Suggest:

* Training phase. This phase is used to measure the optimal frequency
   change thresholds for a given system. The thresholds will differ from
   system to system due to differences in processor micro-architecture,
   cache and device configurations.
   In this phase, the user must ensure that no traffic can enter the
   system so that counts can be measured for empty polls at low, medium
   and high frequencies. Each frequency is measured for two seconds.
   Once the training phase is complete, the threshold numbers are
   displayed, and normal mode resumes, and traffic can be allowed into
   the system. These threshold number can be used on the command line
   when starting the application in normal mode to avoid re-training
   every time.


> +
> +* Normal phase. When the training phase is complete, traffic is
> +  started. The run-time poll counts are compared with the
> +  baseline and the decision will be taken to move to MED power
> +  state or HIGH power state. The counters are calculated every
> +  10ms.
> +

Suggest:

* Normal operation. Every 10ms the run-time counters are compared
   to the supplied threshold values, and the decision will be made
   whether to move to a different power state (by adjusting the
   frequency).



> +
> +API Overview for Empty Poll Power Management
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +* **State Init**: initialize the power management system.
> +
> +* **State Free**: free the resource hold by power management system.
> +
> +* **Update Empty Poll Counter**: update the empty poll counter.
> +
> +* **Update Valid Poll Counter**: update the valid poll counter.
> +
> +* **Set the Fequence Index**: update the power state/frequency mapping.
> +
> +* **Detect empty poll state change**: empty poll state change detection algorithm then take action.
> +
> +User Cases
> +----------
> +The mechanism can applied to any device which is based on polling. e.g. NIC, FPGA.
> +
>   References
>   ----------
>   

Rgds,
Dave.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [dpdk-dev] [PATCH v8 4/4] doc/guides/sample_app_ug/l3_forward_power_man.rst: empty poll update
  2018-09-25 13:20                     ` Kovacevic, Marko
@ 2018-09-28 12:43                       ` Hunt, David
  2018-09-28 12:52                         ` Liang, Ma
  0 siblings, 1 reply; 79+ messages in thread
From: Hunt, David @ 2018-09-28 12:43 UTC (permalink / raw)
  To: Kovacevic, Marko, Ma, Liang J; +Cc: dev, Yao, Lei A, ktraynor, Geary, John

Hi Liang,


I think section 21.4 "Running the Application" needs mention the empty 
poll feature. Maybe just add a mention

*   --empty-poll: Traffic Aware power management. See below for details.



On 25/9/2018 2:20 PM, Kovacevic, Marko wrote:
>> Add empty poll mode command line example
>>
>> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
>> ---
>>   doc/guides/sample_app_ug/l3_forward_power_man.rst | 29
>> +++++++++++++++++++++++
>>   1 file changed, 29 insertions(+)
>>
>> +Empty Poll Mode
>> +-------------------------
>> +There is a new Mode which is added recently. Empty poll mode can be
>> +enabled by command option --empty-poll.

In a couple of years time, it won't be 'new' or 'recent' any more. :)

Suggest:

Additionally, there is a traffic aware mode of operation called "Empty
Poll" where the number of empty polls can be monitored to keep track
of how busy the application is.Empty poll mode can be enabled by the
command line option --empty-poll.

>> +
>> +See "Power Management" chapter in the DPDK Programmer's Guide for
>> empty poll mode details.
> Can you embed the link to the Power Management chapter
> :doc:`Power Management<../prog_guide/power_man>`
>
>
>> +.. code-block:: console
>> +
>> +    ./l3fwd-power -l xxx   -n 4   -w 0000:xx:00.0 -w 0000:xx:00.1 -- -p 0x3 -P --
>> config="(0,0,xx),(1,0,xx)" --empty-poll="0,0,0" -l 14 -m 9 -h 1
>> +
>> +Where,
>> +
>> +--empty-poll: Enable the empty poll mode instead of original algorithm
>> +
>> +--empty-poll="training_flag, med_threshold, high_threshold"
>> +
>> +* training_flag : optional, enable/disable training mode. Default value is 0.
>> +
>> +* med_threshold : optional, indicate the empty poll threshold of modest
>> state which is customized by user. Default value is 0.
>> +
>> +* high_threshold : optional, indicate the empty poll threshold of busy state
>> which is customized by user. Default value is 0.
>> +
>> +* -l : optional, set up the LOW power state frequency index
>> +
>> +* -m : optional, set up the MED power state frequency index
>> +
>> +* -h : optional, set up the HIGH power state frequency index
> I think in this over all section needs a lot more explanation like what are valid training flags and how to get thresholds ect.
>
> Also you could highlight the commands it looks better:      ``training_flag``

I'd suggest adding a "Empty Poll Mode Example Usage" section as a 
sub-section to the "Empty Poll Mode" section, which could contain 
something along the lines of the following:

Empty Poll Mode Example Usage

To initially obtain the ideal thresholds for the system, the training 
mode should be run first. This is achieved by running the l3fwd-power 
app with the training flage set to “1”, and the other paramaters set to 0.

./examples/l3fwd-power/build/l3fwd-power -l 1-3 -- -p 0x0f 
--config="(0,0,2),(0,1,3)" --empty-poll "1,0,0" –P

This will run the training algorithm for x seconds on each core (cores 2 
and 3), and then print out the recommended threshold values for those 
cores. The thresholds should be very similar for each core.

POWER: Bring up the Timer
POWER: set the power freq to MED
POWER: Low threshold is 230277
POWER: MED threshold is 335071
POWER: HIGH threshold is 523769
POWER: Training is Complete for 2
POWER: set the power freq to MED
POWER: Low threshold is 236814
POWER: MED threshold is 344567
POWER: HIGH threshold is 538580
POWER: Training is Complete for 3

Once the values have been measured for a particular system, the app can 
then be started without the training mode so traffic can start immediately.

./examples/l3fwd-power/build/l3fwd-power -l 1-3 -- -p 0x0f 
--config="(0,0,2),(0,1,3)" --empty-poll "0,340000,540000" –P

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [dpdk-dev] [PATCH v8 4/4] doc/guides/sample_app_ug/l3_forward_power_man.rst: empty poll update
  2018-09-28 12:43                       ` Hunt, David
@ 2018-09-28 12:52                         ` Liang, Ma
  0 siblings, 0 replies; 79+ messages in thread
From: Liang, Ma @ 2018-09-28 12:52 UTC (permalink / raw)
  To: Hunt, David; +Cc: Kovacevic, Marko, dev, Yao, Lei A, ktraynor, Geary, John

Hi Dave, 
   thanks for your feedback. I will update document in v9.

O 28 Sep 13:43, Hunt, David wrote:
> Hi Liang,
> 
> 
> I think section 21.4 "Running the Application" needs mention the empty 
> poll feature. Maybe just add a mention
> 
> *   --empty-poll: Traffic Aware power management. See below for details.
> 
> 
> 
> On 25/9/2018 2:20 PM, Kovacevic, Marko wrote:
> >>Add empty poll mode command line example
> >>
> >>Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> >>---
> >>  doc/guides/sample_app_ug/l3_forward_power_man.rst | 29
> >>+++++++++++++++++++++++
> >>  1 file changed, 29 insertions(+)
> >>
> >>+Empty Poll Mode
> >>+-------------------------
> >>+There is a new Mode which is added recently. Empty poll mode can be
> >>+enabled by command option --empty-poll.
> 
> In a couple of years time, it won't be 'new' or 'recent' any more. :)
> 
> Suggest:
> 
> Additionally, there is a traffic aware mode of operation called "Empty
> Poll" where the number of empty polls can be monitored to keep track
> of how busy the application is.Empty poll mode can be enabled by the
> command line option --empty-poll.
> 
> >>+
> >>+See "Power Management" chapter in the DPDK Programmer's Guide for
> >>empty poll mode details.
> >Can you embed the link to the Power Management chapter
> >:doc:`Power Management<../prog_guide/power_man>`
> >
> >
> >>+.. code-block:: console
> >>+
> >>+    ./l3fwd-power -l xxx   -n 4   -w 0000:xx:00.0 -w 0000:xx:00.1 -- -p 
> >>0x3 -P --
> >>config="(0,0,xx),(1,0,xx)" --empty-poll="0,0,0" -l 14 -m 9 -h 1
> >>+
> >>+Where,
> >>+
> >>+--empty-poll: Enable the empty poll mode instead of original algorithm
> >>+
> >>+--empty-poll="training_flag, med_threshold, high_threshold"
> >>+
> >>+* training_flag : optional, enable/disable training mode. Default value 
> >>is 0.
> >>+
> >>+* med_threshold : optional, indicate the empty poll threshold of modest
> >>state which is customized by user. Default value is 0.
> >>+
> >>+* high_threshold : optional, indicate the empty poll threshold of busy 
> >>state
> >>which is customized by user. Default value is 0.
> >>+
> >>+* -l : optional, set up the LOW power state frequency index
> >>+
> >>+* -m : optional, set up the MED power state frequency index
> >>+
> >>+* -h : optional, set up the HIGH power state frequency index
> >I think in this over all section needs a lot more explanation like what 
> >are valid training flags and how to get thresholds ect.
> >
> >Also you could highlight the commands it looks better:      
> >``training_flag``
> 
> I'd suggest adding a "Empty Poll Mode Example Usage" section as a 
> sub-section to the "Empty Poll Mode" section, which could contain 
> something along the lines of the following:
> 
> Empty Poll Mode Example Usage
> 
> To initially obtain the ideal thresholds for the system, the training 
> mode should be run first. This is achieved by running the l3fwd-power 
> app with the training flage set to “1”, and the other paramaters set to 
> 0.
> 
> ./examples/l3fwd-power/build/l3fwd-power -l 1-3 -- -p 0x0f 
> --config="(0,0,2),(0,1,3)" --empty-poll "1,0,0" –P
> 
> This will run the training algorithm for x seconds on each core (cores 2 
> and 3), and then print out the recommended threshold values for those 
> cores. The thresholds should be very similar for each core.
> 
> POWER: Bring up the Timer
> POWER: set the power freq to MED
> POWER: Low threshold is 230277
> POWER: MED threshold is 335071
> POWER: HIGH threshold is 523769
> POWER: Training is Complete for 2
> POWER: set the power freq to MED
> POWER: Low threshold is 236814
> POWER: MED threshold is 344567
> POWER: HIGH threshold is 538580
> POWER: Training is Complete for 3
> 
> Once the values have been measured for a particular system, the app can 
> then be started without the training mode so traffic can start immediately.
> 
> ./examples/l3fwd-power/build/l3fwd-power -l 1-3 -- -p 0x0f 
> --config="(0,0,2),(0,1,3)" --empty-poll "0,340000,540000" –P
> 
> 
> 

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [dpdk-dev] [PATCH v9 1/4] lib/librte_power: traffic pattern aware power control
  2018-09-17 13:30                 ` [dpdk-dev] [PATCH v8 " Liang Ma
                                     ` (3 preceding siblings ...)
  2018-09-28 10:47                   ` [dpdk-dev] [PATCH v8 1/4] lib/librte_power: traffic pattern aware power control Hunt, David
@ 2018-09-28 14:58                   ` Liang Ma
  2018-09-28 14:58                     ` [dpdk-dev] [PATCH v9 2/4] examples/l3fwd-power: simple app update for new API Liang Ma
                                       ` (2 more replies)
  2018-10-01 10:06                   ` [dpdk-dev] [PATCH v9 4/4] doc/guides/sample_app_ug/l3_forward_power_man.rst: empty poll update Liang Ma
  5 siblings, 3 replies; 79+ messages in thread
From: Liang Ma @ 2018-09-28 14:58 UTC (permalink / raw)
  To: david.hunt; +Cc: dev, lei.a.yao, ktraynor, marko.kovacevic, Liang Ma

1. Abstract

For packet processing workloads such as DPDK polling is continuous.
This means CPU cores always show 100% busy independent of how much work
those cores are doing. It is critical to accurately determine how busy
a core is hugely important for the following reasons:

   * No indication of overload conditions.

   * User does not know how much real load is on a system, resulting
     in wasted energy as no power management is utilized.

Compared to the original l3fwd-power design, instead of going to sleep
after detecting an empty poll, the new mechanism just lowers the core
frequency. As a result, the application does not stop polling the device,
which leads to improved handling of bursts of traffic.

When the system become busy, the empty poll mechanism can also increase the
core frequency (including turbo) to do best effort for intensive traffic.
This gives us more flexible and balanced traffic awareness over the
standard l3fwd-power application.

2. Proposed solution

The proposed solution focuses on how many times empty polls are executed.
The less the number of empty polls, means current core is busy with
processing workload, therefore, the higher frequency is needed. The high
empty poll number indicates the current core not doing any real work
therefore, we can lower the frequency to safe power.

In the current implementation, each core has 1 empty-poll counter which
assume 1 core is dedicated to 1 queue. This will need to be expanded in the
future to support multiple queues per core.

2.1 Power state definition:

	LOW:  Not currently used, reserved for future use.

	MED:  the frequency is used to process modest traffic workload.

	HIGH: the frequency is used to process busy traffic workload.

2.2 There are two phases to establish the power management system:

	a.Initialization/Training phase. The training phase is necessary
	  in order to figure out the system polling baseline numbers from
	  idle to busy. The highest poll count will be during idle, where
	  all polls are empty. These poll counts will be different between
	  systems due to the many possible processor micro-arch, cache
	  and device configurations, hence the training phase.
  	  In the training phase, traffic is blocked so the training
  	  algorithm can average the empty-poll numbers for the LOW, MED and
 	  HIGH  power states in order to create a baseline.
  	  The core's counter are collected every 10ms, and the Training
 	  phase will take 2 seconds.
 	  Training is disabled as default configuration. The default
 	  parameter is applied. Sample App still can trigger training
 	  if that's needed. Once the training phase has been executed once on
 	  a system, the application can then be started with the relevant
 	  thresholds provided on the command line, allowing the application
 	  to start passing start traffic immediately

	b.Normal phase. Traffic starts immediately based on the default
	  thresholds, or based on the user supplied thresholds via the
	  command line parameters. The run-time poll counts are compared with
	  the baseline and the decision will be taken to move to MED power
  	  state or HIGH power state. The counters are calculated every 10ms.

3. Proposed  API

1.  rte_power_empty_poll_stat_init(struct ep_params **eptr,
		uint8_t *freq_tlb, struct ep_policy *policy);
which is used to initialize the power management system.
 
2.  rte_power_empty_poll_stat_free(void);
which is used to free the resource hold by power management system.
 
3.  rte_power_empty_poll_stat_update(unsigned int lcore_id);
which is used to update specific core empty poll counter, not thread safe
 
4.  rte_power_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt);
which is used to update specific core valid poll counter, not thread safe
 
5.  rte_power_empty_poll_stat_fetch(unsigned int lcore_id);
which is used to get specific core empty poll counter.
 
6.  rte_power_poll_stat_fetch(unsigned int lcore_id);
which is used to get specific core valid poll counter.

7.  rte_empty_poll_detection(struct rte_timer *tim, void *arg);
which is used to detect empty poll state changes then take action.

ChangeLog:
v2: fix some coding style issues.
v3: rename the filename, API name.
v4: no change.
v5: no change.
v6: re-work the code layout, update API.
v7: fix minor typo and lift node num limit.
v8: disable training as default option.
v9: minor git log update.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>

Reviewed-by: Lei Yao <lei.a.yao@intel.com>
---
 lib/librte_power/Makefile               |   6 +-
 lib/librte_power/meson.build            |   5 +-
 lib/librte_power/rte_power_empty_poll.c | 539 ++++++++++++++++++++++++++++++++
 lib/librte_power/rte_power_empty_poll.h | 219 +++++++++++++
 lib/librte_power/rte_power_version.map  |  13 +
 5 files changed, 778 insertions(+), 4 deletions(-)
 create mode 100644 lib/librte_power/rte_power_empty_poll.c
 create mode 100644 lib/librte_power/rte_power_empty_poll.h

diff --git a/lib/librte_power/Makefile b/lib/librte_power/Makefile
index 6f85e88..a8f1301 100644
--- a/lib/librte_power/Makefile
+++ b/lib/librte_power/Makefile
@@ -6,8 +6,9 @@ include $(RTE_SDK)/mk/rte.vars.mk
 # library name
 LIB = librte_power.a
 
+CFLAGS += -DALLOW_EXPERIMENTAL_API
 CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR) -O3 -fno-strict-aliasing
-LDLIBS += -lrte_eal
+LDLIBS += -lrte_eal -lrte_timer
 
 EXPORT_MAP := rte_power_version.map
 
@@ -16,8 +17,9 @@ LIBABIVER := 1
 # all source are stored in SRCS-y
 SRCS-$(CONFIG_RTE_LIBRTE_POWER) := rte_power.c power_acpi_cpufreq.c
 SRCS-$(CONFIG_RTE_LIBRTE_POWER) += power_kvm_vm.c guest_channel.c
+SRCS-$(CONFIG_RTE_LIBRTE_POWER) += rte_power_empty_poll.c
 
 # install this header file
-SYMLINK-$(CONFIG_RTE_LIBRTE_POWER)-include := rte_power.h
+SYMLINK-$(CONFIG_RTE_LIBRTE_POWER)-include := rte_power.h  rte_power_empty_poll.h
 
 include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/lib/librte_power/meson.build b/lib/librte_power/meson.build
index 253173f..63957eb 100644
--- a/lib/librte_power/meson.build
+++ b/lib/librte_power/meson.build
@@ -5,5 +5,6 @@ if host_machine.system() != 'linux'
 	build = false
 endif
 sources = files('rte_power.c', 'power_acpi_cpufreq.c',
-		'power_kvm_vm.c', 'guest_channel.c')
-headers = files('rte_power.h')
+		'power_kvm_vm.c', 'guest_channel.c',
+		'rte_power_empty_poll.c')
+headers = files('rte_power.h','rte_power_empty_poll.h')
diff --git a/lib/librte_power/rte_power_empty_poll.c b/lib/librte_power/rte_power_empty_poll.c
new file mode 100644
index 0000000..587ce78
--- /dev/null
+++ b/lib/librte_power/rte_power_empty_poll.c
@@ -0,0 +1,539 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2018 Intel Corporation
+ */
+
+#include <string.h>
+
+#include <rte_lcore.h>
+#include <rte_cycles.h>
+#include <rte_atomic.h>
+#include <rte_malloc.h>
+
+#include "rte_power.h"
+#include "rte_power_empty_poll.h"
+
+#define INTERVALS_PER_SECOND 100     /* (10ms) */
+#define SECONDS_TO_TRAIN_FOR 2
+#define DEFAULT_MED_TO_HIGH_PERCENT_THRESHOLD 70
+#define DEFAULT_HIGH_TO_MED_PERCENT_THRESHOLD 30
+#define DEFAULT_CYCLES_PER_PACKET 800
+
+static struct ep_params *ep_params;
+static uint32_t med_to_high_threshold = DEFAULT_MED_TO_HIGH_PERCENT_THRESHOLD;
+static uint32_t high_to_med_threshold = DEFAULT_HIGH_TO_MED_PERCENT_THRESHOLD;
+
+static uint32_t avail_freqs[RTE_MAX_LCORE][NUM_FREQS];
+
+static uint32_t total_avail_freqs[RTE_MAX_LCORE];
+
+static uint32_t freq_index[NUM_FREQ];
+
+static uint32_t
+get_freq_index(enum freq_val index)
+{
+	return freq_index[index];
+}
+
+
+static int
+set_power_freq(int lcore_id, enum freq_val freq, bool specific_freq)
+{
+	int err = 0;
+	uint32_t power_freq_index;
+	if (!specific_freq)
+		power_freq_index = get_freq_index(freq);
+	else
+		power_freq_index = freq;
+
+	err = rte_power_set_freq(lcore_id, power_freq_index);
+
+	return err;
+}
+
+
+static inline void __attribute__((always_inline))
+exit_training_state(struct priority_worker *poll_stats)
+{
+	RTE_SET_USED(poll_stats);
+}
+
+static inline void __attribute__((always_inline))
+enter_training_state(struct priority_worker *poll_stats)
+{
+	poll_stats->iter_counter = 0;
+	poll_stats->cur_freq = LOW;
+	poll_stats->queue_state = TRAINING;
+}
+
+static inline void __attribute__((always_inline))
+enter_normal_state(struct priority_worker *poll_stats)
+{
+	/* Clear the averages arrays and strs */
+	memset(poll_stats->edpi_av, 0, sizeof(poll_stats->edpi_av));
+	poll_stats->ec = 0;
+	memset(poll_stats->ppi_av, 0, sizeof(poll_stats->ppi_av));
+	poll_stats->pc = 0;
+
+	poll_stats->cur_freq = MED;
+	poll_stats->iter_counter = 0;
+	poll_stats->threshold_ctr = 0;
+	poll_stats->queue_state = MED_NORMAL;
+	RTE_LOG(INFO, POWER, "set the poewr freq to MED\n");
+	set_power_freq(poll_stats->lcore_id, MED, false);
+
+	/* Try here */
+	poll_stats->thresh[MED].threshold_percent = med_to_high_threshold;
+	poll_stats->thresh[HGH].threshold_percent = high_to_med_threshold;
+}
+
+static inline void __attribute__((always_inline))
+enter_busy_state(struct priority_worker *poll_stats)
+{
+	memset(poll_stats->edpi_av, 0, sizeof(poll_stats->edpi_av));
+	poll_stats->ec = 0;
+	memset(poll_stats->ppi_av, 0, sizeof(poll_stats->ppi_av));
+	poll_stats->pc = 0;
+
+	poll_stats->cur_freq = HGH;
+	poll_stats->iter_counter = 0;
+	poll_stats->threshold_ctr = 0;
+	poll_stats->queue_state = HGH_BUSY;
+	set_power_freq(poll_stats->lcore_id, HGH, false);
+}
+
+static inline void __attribute__((always_inline))
+enter_purge_state(struct priority_worker *poll_stats)
+{
+	poll_stats->iter_counter = 0;
+	poll_stats->queue_state = LOW_PURGE;
+}
+
+static inline void __attribute__((always_inline))
+set_state(struct priority_worker *poll_stats,
+		enum queue_state new_state)
+{
+	enum queue_state old_state = poll_stats->queue_state;
+	if (old_state != new_state) {
+
+		/* Call any old state exit functions */
+		if (old_state == TRAINING)
+			exit_training_state(poll_stats);
+
+		/* Call any new state entry functions */
+		if (new_state == TRAINING)
+			enter_training_state(poll_stats);
+		if (new_state == MED_NORMAL)
+			enter_normal_state(poll_stats);
+		if (new_state == HGH_BUSY)
+			enter_busy_state(poll_stats);
+		if (new_state == LOW_PURGE)
+			enter_purge_state(poll_stats);
+	}
+}
+
+static inline void __attribute__((always_inline))
+set_policy(struct priority_worker *poll_stats,
+		struct ep_policy *policy)
+{
+	set_state(poll_stats, policy->state);
+
+	if (policy->state == TRAINING)
+		return;
+
+	poll_stats->thresh[MED_NORMAL].base_edpi = policy->med_base_edpi;
+	poll_stats->thresh[HGH_BUSY].base_edpi = policy->hgh_base_edpi;
+
+	poll_stats->thresh[MED_NORMAL].trained = true;
+	poll_stats->thresh[HGH_BUSY].trained = true;
+
+}
+
+static void
+update_training_stats(struct priority_worker *poll_stats,
+		uint32_t freq,
+		bool specific_freq,
+		uint32_t max_train_iter)
+{
+	RTE_SET_USED(specific_freq);
+
+	char pfi_str[32];
+	uint64_t p0_empty_deq;
+
+	sprintf(pfi_str, "%02d", freq);
+
+	if (poll_stats->cur_freq == freq &&
+			poll_stats->thresh[freq].trained == false) {
+		if (poll_stats->thresh[freq].cur_train_iter == 0) {
+
+			set_power_freq(poll_stats->lcore_id,
+					freq, specific_freq);
+
+			poll_stats->empty_dequeues_prev =
+				poll_stats->empty_dequeues;
+
+			poll_stats->thresh[freq].cur_train_iter++;
+
+			return;
+		} else if (poll_stats->thresh[freq].cur_train_iter
+				<= max_train_iter) {
+
+			p0_empty_deq = poll_stats->empty_dequeues -
+				poll_stats->empty_dequeues_prev;
+
+			poll_stats->empty_dequeues_prev =
+				poll_stats->empty_dequeues;
+
+			poll_stats->thresh[freq].base_edpi += p0_empty_deq;
+			poll_stats->thresh[freq].cur_train_iter++;
+
+		} else {
+			if (poll_stats->thresh[freq].trained == false) {
+				poll_stats->thresh[freq].base_edpi =
+					poll_stats->thresh[freq].base_edpi /
+					max_train_iter;
+
+				/* Add on a factor of 0.05%*/
+				/* this should remove any */
+				/* false negatives when the system is 0% busy */
+				poll_stats->thresh[freq].base_edpi +=
+					poll_stats->thresh[freq].base_edpi / 2000;
+
+				poll_stats->thresh[freq].trained = true;
+				poll_stats->cur_freq++;
+
+			}
+		}
+	}
+}
+
+static inline uint32_t __attribute__((always_inline))
+update_stats(struct priority_worker *poll_stats)
+{
+	uint64_t tot_edpi = 0, tot_ppi = 0;
+	uint32_t j, percent;
+
+	struct priority_worker *s = poll_stats;
+
+	uint64_t cur_edpi = s->empty_dequeues - s->empty_dequeues_prev;
+
+	s->empty_dequeues_prev = s->empty_dequeues;
+
+	uint64_t ppi = s->num_dequeue_pkts - s->num_dequeue_pkts_prev;
+
+	s->num_dequeue_pkts_prev = s->num_dequeue_pkts;
+
+	if (s->thresh[s->cur_freq].base_edpi < cur_edpi) {
+		RTE_LOG(DEBUG, POWER, "cur_edpi is too large "
+				"cur edpi %ld "
+				"base epdi %ld\n",
+				cur_edpi,
+				s->thresh[s->cur_freq].base_edpi);
+		/* Value to make us fail need debug log*/
+		return 1000UL;
+	}
+
+	s->edpi_av[s->ec++ % BINS_AV] = cur_edpi;
+	s->ppi_av[s->pc++ % BINS_AV] = ppi;
+
+	for (j = 0; j < BINS_AV; j++) {
+		tot_edpi += s->edpi_av[j];
+		tot_ppi += s->ppi_av[j];
+	}
+
+	tot_edpi = tot_edpi / BINS_AV;
+
+	percent = 100 - (uint32_t)(((float)tot_edpi /
+			(float)s->thresh[s->cur_freq].base_edpi) * 100);
+
+	return (uint32_t)percent;
+}
+
+
+static inline void  __attribute__((always_inline))
+update_stats_normal(struct priority_worker *poll_stats)
+{
+	uint32_t percent;
+
+	if (poll_stats->thresh[poll_stats->cur_freq].base_edpi == 0) {
+
+		RTE_LOG(DEBUG, POWER, "cure freq is %d, edpi is %lu\n",
+				poll_stats->cur_freq,
+				poll_stats->thresh[poll_stats->cur_freq].base_edpi);
+		return;
+	}
+
+	percent = update_stats(poll_stats);
+
+	if (percent > 100) {
+		RTE_LOG(DEBUG, POWER, "Big than 100 abnormal\n");
+		return;
+	}
+
+	if (poll_stats->cur_freq == LOW)
+		RTE_LOG(INFO, POWER, "Purge Mode is not supported\n");
+	else if (poll_stats->cur_freq == MED) {
+
+		if (percent >
+			poll_stats->thresh[MED].threshold_percent) {
+
+			if (poll_stats->threshold_ctr < INTERVALS_PER_SECOND)
+				poll_stats->threshold_ctr++;
+			else {
+				set_state(poll_stats, HGH_BUSY);
+				RTE_LOG(INFO, POWER, "MOVE to HGH\n");
+			}
+
+		} else {
+			/* reset */
+			poll_stats->threshold_ctr = 0;
+		}
+
+	} else if (poll_stats->cur_freq == HGH) {
+
+		if (percent <
+				poll_stats->thresh[HGH].threshold_percent) {
+
+			if (poll_stats->threshold_ctr < INTERVALS_PER_SECOND)
+				poll_stats->threshold_ctr++;
+			else {
+				set_state(poll_stats, MED_NORMAL);
+				RTE_LOG(INFO, POWER, "MOVE to MED\n");
+			}
+		} else {
+			/* reset */
+			poll_stats->threshold_ctr = 0;
+		}
+
+	}
+}
+
+static int
+empty_poll_training(struct priority_worker *poll_stats,
+		uint32_t max_train_iter)
+{
+
+	if (poll_stats->iter_counter < INTERVALS_PER_SECOND) {
+		poll_stats->iter_counter++;
+		return 0;
+	}
+
+
+	update_training_stats(poll_stats,
+			LOW,
+			false,
+			max_train_iter);
+
+	update_training_stats(poll_stats,
+			MED,
+			false,
+			max_train_iter);
+
+	update_training_stats(poll_stats,
+			HGH,
+			false,
+			max_train_iter);
+
+
+	if (poll_stats->thresh[LOW].trained == true
+			&& poll_stats->thresh[MED].trained == true
+			&& poll_stats->thresh[HGH].trained == true) {
+
+		set_state(poll_stats, MED_NORMAL);
+
+		RTE_LOG(INFO, POWER, "Low threshold is %lu\n",
+				poll_stats->thresh[LOW].base_edpi);
+
+		RTE_LOG(INFO, POWER, "MED threshold is %lu\n",
+				poll_stats->thresh[MED].base_edpi);
+
+
+		RTE_LOG(INFO, POWER, "HIGH threshold is %lu\n",
+				poll_stats->thresh[HGH].base_edpi);
+
+		RTE_LOG(INFO, POWER, "Training is Complete for %d\n",
+				poll_stats->lcore_id);
+	}
+
+	return 0;
+}
+
+void
+rte_empty_poll_detection(struct rte_timer *tim, void *arg)
+{
+
+	uint32_t i;
+
+	struct priority_worker *poll_stats;
+
+	RTE_SET_USED(tim);
+
+	RTE_SET_USED(arg);
+
+	for (i = 0; i < NUM_NODES; i++) {
+
+		poll_stats = &(ep_params->wrk_data.wrk_stats[i]);
+
+		if (rte_lcore_is_enabled(poll_stats->lcore_id) == 0)
+			continue;
+
+		switch (poll_stats->queue_state) {
+		case(TRAINING):
+			empty_poll_training(poll_stats,
+					ep_params->max_train_iter);
+			break;
+
+		case(HGH_BUSY):
+		case(MED_NORMAL):
+			update_stats_normal(poll_stats);
+			break;
+
+		case(LOW_PURGE):
+			break;
+		default:
+			break;
+
+		}
+
+	}
+
+}
+
+int __rte_experimental
+rte_power_empty_poll_stat_init(struct ep_params **eptr, uint8_t *freq_tlb,
+		struct ep_policy *policy)
+{
+	uint32_t i;
+	/* Allocate the ep_params structure */
+	ep_params = rte_zmalloc_socket(NULL,
+			sizeof(struct ep_params),
+			0,
+			rte_socket_id());
+
+	if (!ep_params)
+		rte_panic("Cannot allocate heap memory for ep_params "
+				"for socket %d\n", rte_socket_id());
+
+	if (freq_tlb == NULL) {
+		freq_index[LOW] = 14;
+		freq_index[MED] = 9;
+		freq_index[HGH] = 1;
+	} else {
+		freq_index[LOW] = freq_tlb[LOW];
+		freq_index[MED] = freq_tlb[MED];
+		freq_index[HGH] = freq_tlb[HGH];
+	}
+
+	RTE_LOG(INFO, POWER, "Initialize the Empty Poll\n");
+
+	/* 5 seconds worth of training */
+	ep_params->max_train_iter = INTERVALS_PER_SECOND * SECONDS_TO_TRAIN_FOR;
+
+	struct stats_data *w = &ep_params->wrk_data;
+
+	*eptr = ep_params;
+
+	/* initialize all wrk_stats state */
+	for (i = 0; i < NUM_NODES; i++) {
+
+		if (rte_lcore_is_enabled(i) == 0)
+			continue;
+		/*init the freqs table */
+		total_avail_freqs[i] = rte_power_freqs(i,
+				avail_freqs[i],
+				NUM_FREQS);
+
+		RTE_LOG(INFO, POWER, "total avail freq is %d , lcoreid %d\n",
+				total_avail_freqs[i],
+				i);
+
+		if (get_freq_index(LOW) > total_avail_freqs[i])
+			return -1;
+
+		if (rte_get_master_lcore() != i) {
+			w->wrk_stats[i].lcore_id = i;
+			set_policy(&w->wrk_stats[i], policy);
+		}
+	}
+
+	return 0;
+}
+
+void __rte_experimental
+rte_power_empty_poll_stat_free(void)
+{
+
+	RTE_LOG(INFO, POWER, "Close the Empty Poll\n");
+
+	if (ep_params != NULL)
+		rte_free(ep_params);
+}
+
+int __rte_experimental
+rte_power_empty_poll_stat_update(unsigned int lcore_id)
+{
+	struct priority_worker *poll_stats;
+
+	if (lcore_id >= NUM_NODES)
+		return -1;
+
+	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
+
+	if (poll_stats->lcore_id == 0)
+		poll_stats->lcore_id = lcore_id;
+
+	poll_stats->empty_dequeues++;
+
+	return 0;
+}
+
+int __rte_experimental
+rte_power_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt)
+{
+
+	struct priority_worker *poll_stats;
+
+	if (lcore_id >= NUM_NODES)
+		return -1;
+
+	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
+
+	if (poll_stats->lcore_id == 0)
+		poll_stats->lcore_id = lcore_id;
+
+	poll_stats->num_dequeue_pkts += nb_pkt;
+
+	return 0;
+}
+
+
+uint64_t __rte_experimental
+rte_power_empty_poll_stat_fetch(unsigned int lcore_id)
+{
+	struct priority_worker *poll_stats;
+
+	if (lcore_id >= NUM_NODES)
+		return -1;
+
+	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
+
+	if (poll_stats->lcore_id == 0)
+		poll_stats->lcore_id = lcore_id;
+
+	return poll_stats->empty_dequeues;
+}
+
+uint64_t __rte_experimental
+rte_power_poll_stat_fetch(unsigned int lcore_id)
+{
+	struct priority_worker *poll_stats;
+
+	if (lcore_id >= NUM_NODES)
+		return -1;
+
+	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
+
+	if (poll_stats->lcore_id == 0)
+		poll_stats->lcore_id = lcore_id;
+
+	return poll_stats->num_dequeue_pkts;
+}
diff --git a/lib/librte_power/rte_power_empty_poll.h b/lib/librte_power/rte_power_empty_poll.h
new file mode 100644
index 0000000..ae27f7d
--- /dev/null
+++ b/lib/librte_power/rte_power_empty_poll.h
@@ -0,0 +1,219 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2018 Intel Corporation
+ */
+
+#ifndef _RTE_EMPTY_POLL_H
+#define _RTE_EMPTY_POLL_H
+
+/**
+ * @file
+ * RTE Power Management
+ */
+#include <stdint.h>
+#include <stdbool.h>
+
+#include <rte_common.h>
+#include <rte_byteorder.h>
+#include <rte_log.h>
+#include <rte_string_fns.h>
+#include <rte_power.h>
+#include <rte_timer.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#define NUM_FREQS 20
+
+#define BINS_AV 4 /* Has to be ^2 */
+
+#define DROP (NUM_DIRECTIONS * NUM_DEVICES)
+
+#define NUM_PRIORITIES          2
+
+#define NUM_NODES         256  /* Max core number*/
+
+/* Processor Power State */
+enum freq_val {
+	LOW,
+	MED,
+	HGH,
+	NUM_FREQ = NUM_FREQS
+};
+
+
+/* Queue Polling State */
+enum queue_state {
+	TRAINING, /* NO TRAFFIC */
+	MED_NORMAL,   /* MED */
+	HGH_BUSY,     /* HIGH */
+	LOW_PURGE,    /* LOW */
+};
+
+/* Queue Stats */
+struct freq_threshold {
+
+	uint64_t base_edpi;
+	bool trained;
+	uint32_t threshold_percent;
+	uint32_t cur_train_iter;
+};
+
+/* Each Worder Thread Empty Poll Stats */
+struct priority_worker {
+
+	/* Current dequeue and throughput counts */
+	/* These 2 are written to by the worker threads */
+	/* So keep them on their own cache line */
+	uint64_t empty_dequeues;
+	uint64_t num_dequeue_pkts;
+
+	enum queue_state queue_state;
+
+	uint64_t empty_dequeues_prev;
+	uint64_t num_dequeue_pkts_prev;
+
+	/* Used for training only */
+	struct freq_threshold thresh[NUM_FREQ];
+	enum freq_val cur_freq;
+
+	/* bucket arrays to calculate the averages */
+	uint64_t edpi_av[BINS_AV];
+	uint32_t  ec;
+	uint64_t ppi_av[BINS_AV];
+	uint32_t  pc;
+
+	uint32_t lcore_id;
+	uint32_t iter_counter;
+	uint32_t threshold_ctr;
+	uint32_t display_ctr;
+	uint8_t  dev_id;
+
+} __rte_cache_aligned;
+
+
+struct stats_data {
+
+	struct priority_worker wrk_stats[NUM_NODES];
+
+	/* flag to stop rx threads processing packets until training over */
+	bool start_rx;
+
+};
+
+/* Empty Poll Parameters */
+struct ep_params {
+
+	/* Timer related stuff */
+	uint64_t interval_ticks;
+	uint32_t max_train_iter;
+
+	struct rte_timer timer0;
+	struct stats_data wrk_data;
+};
+
+
+/* Sample App Init information */
+struct ep_policy {
+
+	uint64_t med_base_edpi;
+	uint64_t hgh_base_edpi;
+
+	enum queue_state state;
+};
+
+
+
+/**
+ * Initialize the power management system.
+ *
+ * @param eptr
+ *   the structure of empty poll configuration
+ * @freq_tlb
+ *   the power state/frequency  mapping table
+ * @policy
+ *   the initialization policy from sample app
+ *
+ * @return
+ *  - 0 on success.
+ *  - Negative on error.
+ */
+int __rte_experimental
+rte_power_empty_poll_stat_init(struct ep_params **eptr, uint8_t *freq_tlb,
+		struct ep_policy *policy);
+
+/**
+ * Free the resource hold by power management system.
+ */
+void __rte_experimental
+rte_power_empty_poll_stat_free(void);
+
+/**
+ * Update specific core empty poll counter
+ * It's not thread safe.
+ *
+ * @param lcore_id
+ *  lcore id
+ *
+ * @return
+ *  - 0 on success.
+ *  - Negative on error.
+ */
+int __rte_experimental
+rte_power_empty_poll_stat_update(unsigned int lcore_id);
+
+/**
+ * Update specific core valid poll counter, not thread safe.
+ *
+ * @param lcore_id
+ *  lcore id.
+ * @param nb_pkt
+ *  The packet number of one valid poll.
+ *
+ * @return
+ *  - 0 on success.
+ *  - Negative on error.
+ */
+int __rte_experimental
+rte_power_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt);
+
+/**
+ * Fetch specific core empty poll counter.
+ *
+ * @param lcore_id
+ *  lcore id
+ *
+ * @return
+ *  Current lcore empty poll counter value.
+ */
+uint64_t __rte_experimental
+rte_power_empty_poll_stat_fetch(unsigned int lcore_id);
+
+/**
+ * Fetch specific core valid poll counter.
+ *
+ * @param lcore_id
+ *  lcore id
+ *
+ * @return
+ *  Current lcore valid poll counter value.
+ */
+uint64_t __rte_experimental
+rte_power_poll_stat_fetch(unsigned int lcore_id);
+
+/**
+ * Empty poll  state change detection function
+ *
+ * @param  tim
+ *  The timer structure
+ * @param  arg
+ *  The customized parameter
+ */
+void  __rte_experimental
+rte_empty_poll_detection(struct rte_timer *tim, void *arg);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif
diff --git a/lib/librte_power/rte_power_version.map b/lib/librte_power/rte_power_version.map
index dd587df..11ffdfb 100644
--- a/lib/librte_power/rte_power_version.map
+++ b/lib/librte_power/rte_power_version.map
@@ -33,3 +33,16 @@ DPDK_18.08 {
 	rte_power_get_capabilities;
 
 } DPDK_17.11;
+
+EXPERIMENTAL {
+        global:
+
+        rte_power_empty_poll_stat_init;
+        rte_power_empty_poll_stat_free;
+        rte_power_empty_poll_stat_update;
+        rte_power_empty_poll_stat_fetch;
+        rte_power_poll_stat_fetch;
+        rte_power_poll_stat_update;
+        rte_empty_poll_detection;
+
+};
-- 
2.7.5

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [dpdk-dev] [PATCH v9 2/4] examples/l3fwd-power: simple app update for new API
  2018-09-28 14:58                   ` [dpdk-dev] [PATCH v9 " Liang Ma
@ 2018-09-28 14:58                     ` Liang Ma
  2018-09-28 14:58                     ` [dpdk-dev] [PATCH v9 3/4] doc/guides/pro_guide/power-man: update the power API Liang Ma
  2018-10-02 13:48                     ` [dpdk-dev] [PATCH v10 1/4] lib/librte_power: traffic pattern aware power control Liang Ma
  2 siblings, 0 replies; 79+ messages in thread
From: Liang Ma @ 2018-09-28 14:58 UTC (permalink / raw)
  To: david.hunt; +Cc: dev, lei.a.yao, ktraynor, marko.kovacevic, Liang Ma

Add the support for new traffic pattern aware power control
power management API.

Example:
./l3fwd-power -l xxx   -n 4   -w 0000:xx:00.0 -w 0000:xx:00.1 -- -p 0x3
-P --config="(0,0,xx),(1,0,xx)" --empty-poll="0,0,0" -l 14 -m 9 -h 1

Please Reference l3fwd-power document for full parameter usage

The option "l", "m", "h" are used to set the power index for
LOW, MED, HIGH power state. Only is useful after enable empty-poll

--empty-poll="training_flag, med_threshold, high_threshold"

The option training_flag is used to enable/disable training mode.

The option med_threshold is used to indicate the empty poll threshold
of modest state which is customized by user.

The option high_threshold is used to indicate the empty poll threshold
of busy state which is customized by user.

Above three option default value is all 0.

Once enable empty-poll. System will apply the default parameter.
Training mode is disabled as default.

If training mode is triggered, there should not has any traffic
pass-through during training phase.
When training phase complete, system transfer to normal phase.

System will running with modest power stat at beginning.
If the system busyness percentage above 70%, then system will adjust
power state move to High power state. If the traffic become lower(eg. The
system busyness percentage drop below 30%), system will fallback
to the modest power state.

Example code use master thread to monitoring worker thread busyness.
the default timer resolution is 10ms.

ChangeLog:
v2 fix some coding style issues
v3 rename the API.
v6 re-work the API.
v7 no change.
v8 disable training as default option.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>

Reviewed-by: Lei Yao <lei.a.yao@intel.com>
---
 examples/l3fwd-power/Makefile    |   3 +
 examples/l3fwd-power/main.c      | 325 +++++++++++++++++++++++++++++++++++++--
 examples/l3fwd-power/meson.build |   1 +
 3 files changed, 312 insertions(+), 17 deletions(-)

diff --git a/examples/l3fwd-power/Makefile b/examples/l3fwd-power/Makefile
index d7e39a3..772ec7b 100644
--- a/examples/l3fwd-power/Makefile
+++ b/examples/l3fwd-power/Makefile
@@ -23,6 +23,8 @@ CFLAGS += -O3 $(shell pkg-config --cflags libdpdk)
 LDFLAGS_SHARED = $(shell pkg-config --libs libdpdk)
 LDFLAGS_STATIC = -Wl,-Bstatic $(shell pkg-config --static --libs libdpdk)
 
+CFLAGS += -DALLOW_EXPERIMENTAL_API
+
 build/$(APP)-shared: $(SRCS-y) Makefile $(PC_FILE) | build
 	$(CC) $(CFLAGS) $(SRCS-y) -o $@ $(LDFLAGS) $(LDFLAGS_SHARED)
 
@@ -54,6 +56,7 @@ please change the definition of the RTE_TARGET environment variable)
 all:
 else
 
+CFLAGS += -DALLOW_EXPERIMENTAL_API
 CFLAGS += -O3
 CFLAGS += $(WERROR_FLAGS)
 
diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
index 68527d2..1465608 100644
--- a/examples/l3fwd-power/main.c
+++ b/examples/l3fwd-power/main.c
@@ -43,6 +43,7 @@
 #include <rte_timer.h>
 #include <rte_power.h>
 #include <rte_spinlock.h>
+#include <rte_power_empty_poll.h>
 
 #include "perf_core.h"
 #include "main.h"
@@ -55,6 +56,8 @@
 
 /* 100 ms interval */
 #define TIMER_NUMBER_PER_SECOND           10
+/* (10ms) */
+#define INTERVALS_PER_SECOND             100
 /* 100000 us */
 #define SCALING_PERIOD                    (1000000/TIMER_NUMBER_PER_SECOND)
 #define SCALING_DOWN_TIME_RATIO_THRESHOLD 0.25
@@ -117,6 +120,11 @@
  */
 #define RTE_TEST_RX_DESC_DEFAULT 1024
 #define RTE_TEST_TX_DESC_DEFAULT 1024
+#define EMPTY_POLL_MED_THRESHOLD 350000UL
+#define EMPTY_POLL_HGH_THRESHOLD 580000UL
+
+
+
 static uint16_t nb_rxd = RTE_TEST_RX_DESC_DEFAULT;
 static uint16_t nb_txd = RTE_TEST_TX_DESC_DEFAULT;
 
@@ -132,6 +140,14 @@ static uint32_t enabled_port_mask = 0;
 static int promiscuous_on = 0;
 /* NUMA is enabled by default. */
 static int numa_on = 1;
+/* emptypoll is disabled by default. */
+static bool empty_poll_on;
+static bool empty_poll_train;
+volatile bool empty_poll_stop;
+static struct  ep_params *ep_params;
+static struct  ep_policy policy;
+static long  ep_med_edpi, ep_hgh_edpi;
+
 static int parse_ptype; /**< Parse packet type using rx callback, and */
 			/**< disabled by default */
 
@@ -330,6 +346,13 @@ static inline uint32_t power_idle_heuristic(uint32_t zero_rx_packet_count);
 static inline enum freq_scale_hint_t power_freq_scaleup_heuristic( \
 		unsigned int lcore_id, uint16_t port_id, uint16_t queue_id);
 
+static uint8_t  freq_tlb[] = {14, 9, 1};
+
+static int is_done(void)
+{
+	return empty_poll_stop;
+}
+
 /* exit signal handler */
 static void
 signal_exit_now(int sigtype)
@@ -338,7 +361,15 @@ signal_exit_now(int sigtype)
 	unsigned int portid;
 	int ret;
 
+	RTE_SET_USED(lcore_id);
+	RTE_SET_USED(portid);
+	RTE_SET_USED(ret);
+
 	if (sigtype == SIGINT) {
+		if (empty_poll_on)
+			empty_poll_stop = true;
+
+
 		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
 			if (rte_lcore_is_enabled(lcore_id) == 0)
 				continue;
@@ -351,16 +382,19 @@ signal_exit_now(int sigtype)
 							"core%u\n", lcore_id);
 		}
 
-		RTE_ETH_FOREACH_DEV(portid) {
-			if ((enabled_port_mask & (1 << portid)) == 0)
-				continue;
+		if (!empty_poll_on) {
+			RTE_ETH_FOREACH_DEV(portid) {
+				if ((enabled_port_mask & (1 << portid)) == 0)
+					continue;
 
-			rte_eth_dev_stop(portid);
-			rte_eth_dev_close(portid);
+				rte_eth_dev_stop(portid);
+				rte_eth_dev_close(portid);
+			}
 		}
 	}
 
-	rte_exit(EXIT_SUCCESS, "User forced exit\n");
+	if (!empty_poll_on)
+		rte_exit(EXIT_SUCCESS, "User forced exit\n");
 }
 
 /*  Freqency scale down timer callback */
@@ -825,7 +859,107 @@ static int event_register(struct lcore_conf *qconf)
 
 	return 0;
 }
+/* main processing loop */
+static int
+main_empty_poll_loop(__attribute__((unused)) void *dummy)
+{
+	struct rte_mbuf *pkts_burst[MAX_PKT_BURST];
+	unsigned int lcore_id;
+	uint64_t prev_tsc, diff_tsc, cur_tsc;
+	int i, j, nb_rx;
+	uint8_t queueid;
+	uint16_t portid;
+	struct lcore_conf *qconf;
+	struct lcore_rx_queue *rx_queue;
+
+	const uint64_t drain_tsc =
+		(rte_get_tsc_hz() + US_PER_S - 1) / US_PER_S * BURST_TX_DRAIN_US;
+
+	prev_tsc = 0;
+
+	lcore_id = rte_lcore_id();
+	qconf = &lcore_conf[lcore_id];
+
+	if (qconf->n_rx_queue == 0) {
+		RTE_LOG(INFO, L3FWD_POWER, "lcore %u has nothing to do\n", lcore_id);
+		return 0;
+	}
+
+	for (i = 0; i < qconf->n_rx_queue; i++) {
+		portid = qconf->rx_queue_list[i].port_id;
+		queueid = qconf->rx_queue_list[i].queue_id;
+		RTE_LOG(INFO, L3FWD_POWER, " -- lcoreid=%u portid=%u "
+				"rxqueueid=%hhu\n", lcore_id, portid, queueid);
+	}
+
+	while (!is_done()) {
+		stats[lcore_id].nb_iteration_looped++;
+
+		cur_tsc = rte_rdtsc();
+		/*
+		 * TX burst queue drain
+		 */
+		diff_tsc = cur_tsc - prev_tsc;
+		if (unlikely(diff_tsc > drain_tsc)) {
+			for (i = 0; i < qconf->n_tx_port; ++i) {
+				portid = qconf->tx_port_id[i];
+				rte_eth_tx_buffer_flush(portid,
+						qconf->tx_queue_id[portid],
+						qconf->tx_buffer[portid]);
+			}
+			prev_tsc = cur_tsc;
+		}
+
+		/*
+		 * Read packet from RX queues
+		 */
+		for (i = 0; i < qconf->n_rx_queue; ++i) {
+			rx_queue = &(qconf->rx_queue_list[i]);
+			rx_queue->idle_hint = 0;
+			portid = rx_queue->port_id;
+			queueid = rx_queue->queue_id;
+
+			nb_rx = rte_eth_rx_burst(portid, queueid, pkts_burst,
+					MAX_PKT_BURST);
+
+			stats[lcore_id].nb_rx_processed += nb_rx;
+
+			if (nb_rx == 0) {
+
+				rte_power_empty_poll_stat_update(lcore_id);
+
+				continue;
+			} else {
+				rte_power_poll_stat_update(lcore_id, nb_rx);
+			}
+
+
+			/* Prefetch first packets */
+			for (j = 0; j < PREFETCH_OFFSET && j < nb_rx; j++) {
+				rte_prefetch0(rte_pktmbuf_mtod(
+							pkts_burst[j], void *));
+			}
+
+			/* Prefetch and forward already prefetched packets */
+			for (j = 0; j < (nb_rx - PREFETCH_OFFSET); j++) {
+				rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[
+							j + PREFETCH_OFFSET], void *));
+				l3fwd_simple_forward(pkts_burst[j], portid,
+						qconf);
+			}
 
+			/* Forward remaining prefetched packets */
+			for (; j < nb_rx; j++) {
+				l3fwd_simple_forward(pkts_burst[j], portid,
+						qconf);
+			}
+
+		}
+
+	}
+
+	return 0;
+}
 /* main processing loop */
 static int
 main_loop(__attribute__((unused)) void *dummy)
@@ -1127,7 +1261,8 @@ print_usage(const char *prgname)
 		"  --no-numa: optional, disable numa awareness\n"
 		"  --enable-jumbo: enable jumbo frame"
 		" which max packet len is PKTLEN in decimal (64-9600)\n"
-		"  --parse-ptype: parse packet type by software\n",
+		"  --parse-ptype: parse packet type by software\n"
+		"  --empty=poll: enable empty poll detection\n",
 		prgname);
 }
 
@@ -1220,7 +1355,55 @@ parse_config(const char *q_arg)
 
 	return 0;
 }
+static int
+parse_ep_config(const char *q_arg)
+{
+	char s[256];
+	const char *p = q_arg;
+	char *end;
+	int  num_arg;
+
+	char *str_fld[3];
+
+	int training_flag;
+	int med_edpi;
+	int hgh_edpi;
+
+	ep_med_edpi = EMPTY_POLL_MED_THRESHOLD;
+	ep_hgh_edpi = EMPTY_POLL_MED_THRESHOLD;
+
+	snprintf(s, sizeof(s), "%s", p);
+
+	num_arg = rte_strsplit(s, sizeof(s), str_fld, 3, ',');
+
+	empty_poll_train = false;
+
+	if (num_arg == 0)
+		return 0;
 
+	if (num_arg == 3) {
+
+		training_flag = strtoul(str_fld[0], &end, 0);
+		med_edpi = strtoul(str_fld[1], &end, 0);
+		hgh_edpi = strtoul(str_fld[2], &end, 0);
+
+		if (training_flag == 1)
+			empty_poll_train = true;
+
+		if (med_edpi > 0)
+			ep_med_edpi = med_edpi;
+
+		if (med_edpi > 0)
+			ep_hgh_edpi = hgh_edpi;
+
+	} else {
+
+		return -1;
+	}
+
+	return 0;
+
+}
 #define CMD_LINE_OPT_PARSE_PTYPE "parse-ptype"
 
 /* Parse the argument given in the command line of the application */
@@ -1230,6 +1413,7 @@ parse_args(int argc, char **argv)
 	int opt, ret;
 	char **argvopt;
 	int option_index;
+	uint32_t limit;
 	char *prgname = argv[0];
 	static struct option lgopts[] = {
 		{"config", 1, 0, 0},
@@ -1237,13 +1421,14 @@ parse_args(int argc, char **argv)
 		{"high-perf-cores", 1, 0, 0},
 		{"no-numa", 0, 0, 0},
 		{"enable-jumbo", 0, 0, 0},
+		{"empty-poll", 1, 0, 0},
 		{CMD_LINE_OPT_PARSE_PTYPE, 0, 0, 0},
 		{NULL, 0, 0, 0}
 	};
 
 	argvopt = argv;
 
-	while ((opt = getopt_long(argc, argvopt, "p:P",
+	while ((opt = getopt_long(argc, argvopt, "p:l:m:h:P",
 				lgopts, &option_index)) != EOF) {
 
 		switch (opt) {
@@ -1260,7 +1445,18 @@ parse_args(int argc, char **argv)
 			printf("Promiscuous mode selected\n");
 			promiscuous_on = 1;
 			break;
-
+		case 'l':
+			limit = parse_max_pkt_len(optarg);
+			freq_tlb[LOW] = limit;
+			break;
+		case 'm':
+			limit = parse_max_pkt_len(optarg);
+			freq_tlb[MED] = limit;
+			break;
+		case 'h':
+			limit = parse_max_pkt_len(optarg);
+			freq_tlb[HGH] = limit;
+			break;
 		/* long options */
 		case 0:
 			if (!strncmp(lgopts[option_index].name, "config", 6)) {
@@ -1299,6 +1495,20 @@ parse_args(int argc, char **argv)
 			}
 
 			if (!strncmp(lgopts[option_index].name,
+						"empty-poll", 10)) {
+				printf("empty-poll is enabled\n");
+				empty_poll_on = true;
+				ret = parse_ep_config(optarg);
+
+				if (ret) {
+					printf("invalid empty poll config\n");
+					print_usage(prgname);
+					return -1;
+				}
+
+			}
+
+			if (!strncmp(lgopts[option_index].name,
 					"enable-jumbo", 12)) {
 				struct option lenopts =
 					{"max-pkt-len", required_argument, \
@@ -1646,6 +1856,59 @@ init_power_library(void)
 	}
 	return ret;
 }
+static void
+empty_poll_setup_timer(void)
+{
+	int lcore_id = rte_lcore_id();
+	uint64_t hz = rte_get_timer_hz();
+
+	struct  ep_params *ep_ptr = ep_params;
+
+	ep_ptr->interval_ticks = hz / INTERVALS_PER_SECOND;
+
+	rte_timer_reset_sync(&ep_ptr->timer0,
+			ep_ptr->interval_ticks,
+			PERIODICAL,
+			lcore_id,
+			rte_empty_poll_detection,
+			(void *)ep_ptr);
+
+}
+static int
+launch_timer(unsigned int lcore_id)
+{
+	int64_t prev_tsc = 0, cur_tsc, diff_tsc, cycles_10ms;
+
+	RTE_SET_USED(lcore_id);
+
+
+	if (rte_get_master_lcore() != lcore_id) {
+		rte_panic("timer on lcore:%d which is not master core:%d\n",
+				lcore_id,
+				rte_get_master_lcore());
+	}
+
+	RTE_LOG(INFO, POWER, "Bring up the Timer\n");
+
+	empty_poll_setup_timer();
+
+	cycles_10ms = rte_get_timer_hz() / 100;
+
+	while (!is_done()) {
+		cur_tsc = rte_rdtsc();
+		diff_tsc = cur_tsc - prev_tsc;
+		if (diff_tsc > cycles_10ms) {
+			rte_timer_manage();
+			prev_tsc = cur_tsc;
+			cycles_10ms = rte_get_timer_hz() / 100;
+		}
+	}
+
+	RTE_LOG(INFO, POWER, "Timer_subsystem is done\n");
+
+	return 0;
+}
+
 
 int
 main(int argc, char **argv)
@@ -1828,13 +2091,15 @@ main(int argc, char **argv)
 		if (rte_lcore_is_enabled(lcore_id) == 0)
 			continue;
 
-		/* init timer structures for each enabled lcore */
-		rte_timer_init(&power_timers[lcore_id]);
-		hz = rte_get_timer_hz();
-		rte_timer_reset(&power_timers[lcore_id],
-			hz/TIMER_NUMBER_PER_SECOND, SINGLE, lcore_id,
-						power_timer_cb, NULL);
-
+		if (empty_poll_on == false) {
+			/* init timer structures for each enabled lcore */
+			rte_timer_init(&power_timers[lcore_id]);
+			hz = rte_get_timer_hz();
+			rte_timer_reset(&power_timers[lcore_id],
+					hz/TIMER_NUMBER_PER_SECOND,
+					SINGLE, lcore_id,
+					power_timer_cb, NULL);
+		}
 		qconf = &lcore_conf[lcore_id];
 		printf("\nInitializing rx queues on lcore %u ... ", lcore_id );
 		fflush(stdout);
@@ -1905,12 +2170,38 @@ main(int argc, char **argv)
 
 	check_all_ports_link_status(enabled_port_mask);
 
+	if (empty_poll_on == true) {
+
+		if (empty_poll_train) {
+			policy.state = TRAINING;
+		} else {
+			policy.state = MED_NORMAL;
+			policy.med_base_edpi = ep_med_edpi;
+			policy.hgh_base_edpi = ep_hgh_edpi;
+		}
+
+		rte_power_empty_poll_stat_init(&ep_params, freq_tlb, &policy);
+	}
+
+
 	/* launch per-lcore init on every lcore */
-	rte_eal_mp_remote_launch(main_loop, NULL, CALL_MASTER);
+	if (empty_poll_on == false) {
+		rte_eal_mp_remote_launch(main_loop, NULL, CALL_MASTER);
+	} else {
+		empty_poll_stop = false;
+		rte_eal_mp_remote_launch(main_empty_poll_loop, NULL, SKIP_MASTER);
+	}
+
+	if (empty_poll_on == true)
+		launch_timer(rte_lcore_id());
+
 	RTE_LCORE_FOREACH_SLAVE(lcore_id) {
 		if (rte_eal_wait_lcore(lcore_id) < 0)
 			return -1;
 	}
 
+	if (empty_poll_on)
+		rte_power_empty_poll_stat_free();
+
 	return 0;
 }
diff --git a/examples/l3fwd-power/meson.build b/examples/l3fwd-power/meson.build
index 20c8054..a3c5c2f 100644
--- a/examples/l3fwd-power/meson.build
+++ b/examples/l3fwd-power/meson.build
@@ -9,6 +9,7 @@
 if host_machine.system() != 'linux'
 	build = false
 endif
+allow_experimental_apis = true
 deps += ['power', 'timer', 'lpm', 'hash']
 sources = files(
 	'main.c', 'perf_core.c'
-- 
2.7.5

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [dpdk-dev] [PATCH v9 3/4] doc/guides/pro_guide/power-man: update the power API
  2018-09-28 14:58                   ` [dpdk-dev] [PATCH v9 " Liang Ma
  2018-09-28 14:58                     ` [dpdk-dev] [PATCH v9 2/4] examples/l3fwd-power: simple app update for new API Liang Ma
@ 2018-09-28 14:58                     ` Liang Ma
  2018-10-02 13:48                     ` [dpdk-dev] [PATCH v10 1/4] lib/librte_power: traffic pattern aware power control Liang Ma
  2 siblings, 0 replies; 79+ messages in thread
From: Liang Ma @ 2018-09-28 14:58 UTC (permalink / raw)
  To: david.hunt; +Cc: dev, lei.a.yao, ktraynor, marko.kovacevic, Liang Ma

Update the document for empty poll API.

Change Logs:
v9: minor changes for syntax. Update document.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
---
 doc/guides/prog_guide/power_man.rst | 86 +++++++++++++++++++++++++++++++++++++
 1 file changed, 86 insertions(+)

diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
index eba1cc6..68b7e8b 100644
--- a/doc/guides/prog_guide/power_man.rst
+++ b/doc/guides/prog_guide/power_man.rst
@@ -106,6 +106,92 @@ User Cases
 
 The power management mechanism is used to save power when performing L3 forwarding.
 
+
+Empty Poll API
+--------------
+
+Abstract
+~~~~~~~~
+
+For packet processing workloads such as DPDK polling is continuous.
+This means CPU cores always show 100% busy independent of how much work
+those cores are doing. It is critical to accurately determine how busy
+a core is hugely important for the following reasons:
+
+        * No indication of overload conditions
+        * User does not know how much real load is on a system, resulting
+          in wasted energy as no power management is utilized
+
+Compared to the original l3fwd-power design, instead of going to sleep
+after detecting an empty poll, the new mechanism just lowers the core frequency.
+As a result, the application does not stop polling the device, which leads
+to improved handling of bursts of traffic.
+
+When the system become busy, the empty poll mechanism can also increase the core
+frequency (including turbo) to do best effort for intensive traffic. This gives
+us more flexible and balanced traffic awareness over the standard l3fwd-power
+application.
+
+
+Proposed Solution
+~~~~~~~~~~~~~~~~~
+The proposed solution focuses on how many times empty polls are executed.
+The less the number of empty polls, means current core is busy with processing
+workload, therefore, the higher frequency is needed. The high empty poll number
+indicates the current core not doing any real work therefore, we can lower the
+frequency to safe power.
+
+In the current implementation, each core has 1 empty-poll counter which assume
+1 core is dedicated to 1 queue. This will need to be expanded in the future to
+support multiple queues per core.
+
+Power state definition:
+^^^^^^^^^^^^^^^^^^^^^^^
+
+* LOW:  Not currently used, reserved for future use.
+
+* MED:  the frequency is used to process modest traffic workload.
+
+* HIGH: the frequency is used to process busy traffic workload.
+
+There are two phases to establish the power management system:
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+* Training phase. This phase is used to measure the optimal frequency
+  change thresholds for a given system. The thresholds will differ from
+  system to system due to differences in processor micro-architecture,
+  cache and device configurations.
+  In this phase, the user must ensure that no traffic can enter the
+  system so that counts can be measured for empty polls at low, medium
+  and high frequencies. Each frequency is measured for two seconds.
+  Once the training phase is complete, the threshold numbers are
+  displayed, and normal mode resumes, and traffic can be allowed into
+  the system. These threshold number can be used on the command line
+  when starting the application in normal mode to avoid re-training
+  every time.
+
+* Normal phase. Every 10ms the run-time counters are compared
+  to the supplied threshold values, and the decision will be made
+  whether to move to a different power state (by adjusting the
+  frequency).
+
+API Overview for Empty Poll Power Management
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+* **State Init**: initialize the power management system.
+
+* **State Free**: free the resource hold by power management system.
+
+* **Update Empty Poll Counter**: update the empty poll counter.
+
+* **Update Valid Poll Counter**: update the valid poll counter.
+
+* **Set the Fequence Index**: update the power state/frequency mapping.
+
+* **Detect empty poll state change**: empty poll state change detection algorithm then take action.
+
+User Cases
+----------
+The mechanism can applied to any device which is based on polling. e.g. NIC, FPGA.
+
 References
 ----------
 
-- 
2.7.5

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [dpdk-dev] [PATCH v9 4/4] doc/guides/sample_app_ug/l3_forward_power_man.rst: empty poll update
  2018-09-17 13:30                 ` [dpdk-dev] [PATCH v8 " Liang Ma
                                     ` (4 preceding siblings ...)
  2018-09-28 14:58                   ` [dpdk-dev] [PATCH v9 " Liang Ma
@ 2018-10-01 10:06                   ` Liang Ma
  5 siblings, 0 replies; 79+ messages in thread
From: Liang Ma @ 2018-10-01 10:06 UTC (permalink / raw)
  To: david.hunt; +Cc: dev, lei.a.yao, ktraynor, marko.kovacevic, Liang Ma

Add empty poll mode command line example

ChangeLogs:
v9: update the document

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
---
 doc/guides/sample_app_ug/l3_forward_power_man.rst | 69 +++++++++++++++++++++++
 1 file changed, 69 insertions(+)

diff --git a/doc/guides/sample_app_ug/l3_forward_power_man.rst b/doc/guides/sample_app_ug/l3_forward_power_man.rst
index 795a570..e44a11b 100644
--- a/doc/guides/sample_app_ug/l3_forward_power_man.rst
+++ b/doc/guides/sample_app_ug/l3_forward_power_man.rst
@@ -105,6 +105,8 @@ where,
 
 *   --no-numa: optional, disables numa awareness
 
+*   --empty-poll: Traffic Aware power management. See below for details
+
 See :doc:`l3_forward` for details.
 The L3fwd-power example reuses the L3fwd command line options.
 
@@ -362,3 +364,70 @@ The algorithm has the following sleeping behavior depending on the idle counter:
 If a thread polls multiple Rx queues and different queue returns different sleep duration values,
 the algorithm controls the sleep time in a conservative manner by sleeping for the least possible time
 in order to avoid a potential performance impact.
+
+Empty Poll Mode
+-------------------------
+Additionally, there is a traffic aware mode of operation called "Empty
+Poll" where the number of empty polls can be monitored to keep track
+of how busy the application is. Empty poll mode can be enabled by the
+command line option --empty-poll.
+
+See :doc:`Power Management<../prog_guide/power_man>` chapter in the DPDK Programmer's Guide for empty poll mode details.
+
+.. code-block:: console
+
+    ./l3fwd-power -l xxx   -n 4   -w 0000:xx:00.0 -w 0000:xx:00.1 -- -p 0x3 -P --config="(0,0,xx),(1,0,xx)" --empty-poll="0,0,0" -l 14 -m 9 -h 1
+
+Where,
+
+--empty-poll: Enable the empty poll mode instead of original algorithm
+
+--empty-poll="training_flag, med_threshold, high_threshold"
+
+* ``training_flag`` : optional, enable/disable training mode. Default value is 0. If the training_flag is set as 1(true), then the application will start in training mode and print out the trained threshold values. If the training_flag is set as 0(false), the application will start in normal mode, and will use either the default thresholds or those supplied on the command line. The trained threshold values are specific to the user’s system, may give a better power profile when compared to the default threshold values.
+
+* ``med_threshold`` : optional, sets the empty poll threshold of a modestly busy system state. If this is not supplied, the application will apply the default value of 350000.
+
+* ``high_threshold`` : optional, sets the empty poll threshold of a busy system state. If this is not supplied, the application will apply the default value of 580000.
+
+* -l : optional, set up the LOW power state frequency index
+
+* -m : optional, set up the MED power state frequency index
+
+* -h : optional, set up the HIGH power state frequency index
+
+Empty Poll Mode Example Usage
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+To initially obtain the ideal thresholds for the system, the training
+mode should be run first. This is achieved by running the l3fwd-power
+app with the training flag set to “1”, and the other parameters set to
+0.
+
+.. code-block:: console
+
+        ./examples/l3fwd-power/build/l3fwd-power -l 1-3 -- -p 0x0f --config="(0,0,2),(0,1,3)" --empty-poll "1,0,0" –P
+
+This will run the training algorithm for x seconds on each core (cores 2
+and 3), and then print out the recommended threshold values for those
+cores. The thresholds should be very similar for each core.
+
+.. code-block:: console
+
+        POWER: Bring up the Timer
+        POWER: set the power freq to MED
+        POWER: Low threshold is 230277
+        POWER: MED threshold is 335071
+        POWER: HIGH threshold is 523769
+        POWER: Training is Complete for 2
+        POWER: set the power freq to MED
+        POWER: Low threshold is 236814
+        POWER: MED threshold is 344567
+        POWER: HIGH threshold is 538580
+        POWER: Training is Complete for 3
+
+Once the values have been measured for a particular system, the app can
+then be started without the training mode so traffic can start immediately.
+
+.. code-block:: console
+
+        ./examples/l3fwd-power/build/l3fwd-power -l 1-3 -- -p 0x0f --config="(0,0,2),(0,1,3)" --empty-poll "0,340000,540000" –P
-- 
2.7.5

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [dpdk-dev] [PATCH v8 1/4] lib/librte_power: traffic pattern aware power control
  2018-09-28 10:47                   ` [dpdk-dev] [PATCH v8 1/4] lib/librte_power: traffic pattern aware power control Hunt, David
@ 2018-10-02 10:13                     ` Liang, Ma
  0 siblings, 0 replies; 79+ messages in thread
From: Liang, Ma @ 2018-10-02 10:13 UTC (permalink / raw)
  To: Hunt, David; +Cc: dev, lei.a.yao, ktraynor, john.geary

Hi Dave, 
    Please check comment below. 

On 28 Sep 11:47, Hunt, David wrote:
> Hi Liang,
> 
> 
> On 17/9/2018 2:30 PM, Liang Ma wrote:
> >1. Abstract
> >
> >For packet processing workloads such as DPDK polling is continuous.
> >This means CPU cores always show 100% busy independent of how much work
> >those cores are doing. It is critical to accurately determine how busy
> >a core is hugely important for the following reasons:
> >
> >    * No indication of overload conditions
> >
> >    * User do not know how much real load is on a system meaning resulted 
> >    in
> >      wasted energy as no power management is utilized
> >
> >Compared to the original l3fwd-power design, instead of going to sleep
> >after detecting an empty poll, the new mechanism just lowers the core
> >frequency. As a result, the application does not stop polling the device,
> >which leads to improved handling of bursts of traffic.
> >
> >When the system become busy, the empty poll mechanism can also increase the
> >core frequency (including turbo) to do best effort for intensive traffic.
> >This gives us more flexible and balanced traffic awareness over the
> >standard l3fwd-power application.
> >
> >2. Proposed solution
> >
> >The proposed solution focuses on how many times empty polls are executed.
> >The less the number of empty polls, means current core is busy with
> >processing workload, therefore, the higher frequency is needed. The high
> >empty poll number indicates the current core not doing any real work
> >therefore, we can lower the frequency to safe power.
> >
> >In the current implementation, each core has 1 empty-poll counter which
> >assume 1 core is dedicated to 1 queue. This will need to be expanded in the
> >future to support multiple queues per core.
> >
> >2.1 Power state definition:
> >
> >	LOW:  Not currently used, reserved for future use.
> >
> >	MED:  the frequency is used to process modest traffic workload.
> >
> >	HIGH: the frequency is used to process busy traffic workload.
> >
> >2.2 There are two phases to establish the power management system:
> >
> >	a.Initialization/Training phase. The training phase is necessary
> >	  in order to figure out the system polling baseline numbers from
> >	  idle to busy. The highest poll count will be during idle, where
> >	  all polls are empty. These poll counts will be different between
> >	  systems due to the many possible processor micro-arch, cache
> >	  and device configurations, hence the training phase.
> >   	  In the training phase, traffic is blocked so the training
> >   	  algorithm can average the empty-poll numbers for the LOW, MED and
> >  	  HIGH  power states in order to create a baseline.
> >   	  The core's counter are collected every 10ms, and the Training
> >  	  phase will take 2 seconds.
> >  	  Training is disabled as default configuration. the default
> >  	  parameter is applied. Simple App still can trigger training
> 
> Typo: "Simple" should be "Sample"
> 
> Suggest adding: Once the training phase has been executed once on a 
> system, the application
> can then be started with the relevant thresholds provided on the command 
> line, allowing the
> application to start passing start traffic immediately.
agree
> 
> >  	  if that's needed.
> >
> >	b.Normal phase. When the training phase is complete, traffic is
> >   	  started. The run-time poll counts are compared with the
> >	  baseline and the decision will be taken to move to MED power
> >   	  state or HIGH power state. The counters are calculated every 10ms.
> 
> Propose changing the first sentence:  Traffic starts immediately based 
> on the default
> thresholds, or based on the user supplied thresholds via the command 
> line parameters.
>
agree
> 
> 
> 
> >3. Proposed  API
> >
> >1.  rte_power_empty_poll_stat_init(struct ep_params **eptr,
> >		uint8_t *freq_tlb, struct ep_policy *policy);
> >which is used to initialize the power management system.
> >  
> >2.  rte_power_empty_poll_stat_free(void);
> >which is used to free the resource hold by power management system.
> >  
> >3.  rte_power_empty_poll_stat_update(unsigned int lcore_id);
> >which is used to update specific core empty poll counter, not thread safe
> >  
> >4.  rte_power_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt);
> >which is used to update specific core valid poll counter, not thread safe
> >  
> >5.  rte_power_empty_poll_stat_fetch(unsigned int lcore_id);
> >which is used to get specific core empty poll counter.
> >  
> >6.  rte_power_poll_stat_fetch(unsigned int lcore_id);
> >which is used to get specific core valid poll counter.
> >
> >7.  rte_empty_poll_detection(struct rte_timer *tim, void *arg);
> >which is used to detect empty poll state changes then take action.
> >
> >ChangeLog:
> >v2: fix some coding style issues.
> >v3: rename the filename, API name.
> >v4: no change.
> >v5: no change.
> >v6: re-work the code layout, update API.
> >v7: fix minor typo and lift node num limit.
> >v8: disable training as default option.
> >
> >Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> >
> >Reviewed-by: Lei Yao <lei.a.yao@intel.com>
> >---
> >  lib/librte_power/Makefile               |   6 +-
> >  lib/librte_power/meson.build            |   5 +-
> >  lib/librte_power/rte_power_empty_poll.c | 539 
> >  ++++++++++++++++++++++++++++++++
> >  lib/librte_power/rte_power_empty_poll.h | 219 +++++++++++++
> >  lib/librte_power/rte_power_version.map  |  13 +
> >  5 files changed, 778 insertions(+), 4 deletions(-)
> >  create mode 100644 lib/librte_power/rte_power_empty_poll.c
> >  create mode 100644 lib/librte_power/rte_power_empty_poll.h
> >
> >diff --git a/lib/librte_power/Makefile b/lib/librte_power/Makefile
> >index 6f85e88..a8f1301 100644
> >--- a/lib/librte_power/Makefile
> >+++ b/lib/librte_power/Makefile
> >@@ -6,8 +6,9 @@ include $(RTE_SDK)/mk/rte.vars.mk
> >  # library name
> >  LIB = librte_power.a
> >  
> >+CFLAGS += -DALLOW_EXPERIMENTAL_API
> >  CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR) -O3 -fno-strict-aliasing
> >-LDLIBS += -lrte_eal
> >+LDLIBS += -lrte_eal -lrte_timer
> >  
> >  EXPORT_MAP := rte_power_version.map
> >  
> >@@ -16,8 +17,9 @@ LIBABIVER := 1
> >  # all source are stored in SRCS-y
> >  SRCS-$(CONFIG_RTE_LIBRTE_POWER) := rte_power.c power_acpi_cpufreq.c
> >  SRCS-$(CONFIG_RTE_LIBRTE_POWER) += power_kvm_vm.c guest_channel.c
> >+SRCS-$(CONFIG_RTE_LIBRTE_POWER) += rte_power_empty_poll.c
> >  
> >  # install this header file
> >-SYMLINK-$(CONFIG_RTE_LIBRTE_POWER)-include := rte_power.h
> >+SYMLINK-$(CONFIG_RTE_LIBRTE_POWER)-include := rte_power.h  
> >rte_power_empty_poll.h
> >  
> >  include $(RTE_SDK)/mk/rte.lib.mk
> >diff --git a/lib/librte_power/meson.build b/lib/librte_power/meson.build
> >index 253173f..63957eb 100644
> >--- a/lib/librte_power/meson.build
> >+++ b/lib/librte_power/meson.build
> >@@ -5,5 +5,6 @@ if host_machine.system() != 'linux'
> >  	build = false
> >  endif
> >  sources = files('rte_power.c', 'power_acpi_cpufreq.c',
> >-		'power_kvm_vm.c', 'guest_channel.c')
> >-headers = files('rte_power.h')
> >+		'power_kvm_vm.c', 'guest_channel.c',
> >+		'rte_power_empty_poll.c')
> >+headers = files('rte_power.h','rte_power_empty_poll.h')
> >diff --git a/lib/librte_power/rte_power_empty_poll.c 
> >b/lib/librte_power/rte_power_empty_poll.c
> >new file mode 100644
> >index 0000000..587ce78
> >--- /dev/null
> >+++ b/lib/librte_power/rte_power_empty_poll.c
> >@@ -0,0 +1,539 @@
> >+/* SPDX-License-Identifier: BSD-3-Clause
> >+ * Copyright(c) 2010-2018 Intel Corporation
> >+ */
> >+
> >+#include <string.h>
> >+
> >+#include <rte_lcore.h>
> >+#include <rte_cycles.h>
> >+#include <rte_atomic.h>
> >+#include <rte_malloc.h>
> >+
> >+#include "rte_power.h"
> >+#include "rte_power_empty_poll.h"
> >+
> >+#define INTERVALS_PER_SECOND 100     /* (10ms) */
> >+#define SECONDS_TO_TRAIN_FOR 2
> >+#define DEFAULT_MED_TO_HIGH_PERCENT_THRESHOLD 70
> >+#define DEFAULT_HIGH_TO_MED_PERCENT_THRESHOLD 30
> >+#define DEFAULT_CYCLES_PER_PACKET 800
> >+
> >+static struct ep_params *ep_params;
> >+static uint32_t med_to_high_threshold = 
> >DEFAULT_MED_TO_HIGH_PERCENT_THRESHOLD;
> >+static uint32_t high_to_med_threshold = 
> >DEFAULT_HIGH_TO_MED_PERCENT_THRESHOLD;
> >+
> >+static uint32_t avail_freqs[RTE_MAX_LCORE][NUM_FREQS];
> >+
> >+static uint32_t total_avail_freqs[RTE_MAX_LCORE];
> >+
> >+static uint32_t freq_index[NUM_FREQ];
> >+
> >+static uint32_t
> >+get_freq_index(enum freq_val index)
> >+{
> >+	return freq_index[index];
> >+}
> >+
> >+
> >+static int
> >+set_power_freq(int lcore_id, enum freq_val freq, bool specific_freq)
> >+{
> >+	int err = 0;
> >+	uint32_t power_freq_index;
> >+	if (!specific_freq)
> >+		power_freq_index = get_freq_index(freq);
> >+	else
> >+		power_freq_index = freq;
> >+
> >+	err = rte_power_set_freq(lcore_id, power_freq_index);
> >+
> >+	return err;
> >+}
> >+
> >+
> >+static inline void __attribute__((always_inline))
> >+exit_training_state(struct priority_worker *poll_stats)
> >+{
> >+	RTE_SET_USED(poll_stats);
> >+}
> >+
> 
> Is this really needed? It does nothing, and is just local to this file.
> 
this is needed for debug purpose, I prefer keep it. 
> 
> >+static inline void __attribute__((always_inline))
> >+enter_training_state(struct priority_worker *poll_stats)
> >+{
> >+	poll_stats->iter_counter = 0;
> >+	poll_stats->cur_freq = LOW;
> >+	poll_stats->queue_state = TRAINING;
> >+}
> >+
> >+static inline void __attribute__((always_inline))
> >+enter_normal_state(struct priority_worker *poll_stats)
> >+{
> >+	/* Clear the averages arrays and strs */
> >+	memset(poll_stats->edpi_av, 0, sizeof(poll_stats->edpi_av));
> >+	poll_stats->ec = 0;
> >+	memset(poll_stats->ppi_av, 0, sizeof(poll_stats->ppi_av));
> >+	poll_stats->pc = 0;
> >+
> >+	poll_stats->cur_freq = MED;
> >+	poll_stats->iter_counter = 0;
> >+	poll_stats->threshold_ctr = 0;
> >+	poll_stats->queue_state = MED_NORMAL;
> >+	RTE_LOG(INFO, POWER, "set the poewr freq to MED\n");
> 
> Typo, "poewr" should be "power", also suggest "Set" rather than "set"
>
agree
> 
> >+	set_power_freq(poll_stats->lcore_id, MED, false);
> >+
> >+	/* Try here */
> 
> Not sure about this comment?
will be removed
> 
> >+	poll_stats->thresh[MED].threshold_percent = med_to_high_threshold;
> >+	poll_stats->thresh[HGH].threshold_percent = high_to_med_threshold;
> >+}
> >+
> >+static inline void __attribute__((always_inline))
> >+enter_busy_state(struct priority_worker *poll_stats)
> >+{
> >+	memset(poll_stats->edpi_av, 0, sizeof(poll_stats->edpi_av));
> >+	poll_stats->ec = 0;
> >+	memset(poll_stats->ppi_av, 0, sizeof(poll_stats->ppi_av));
> >+	poll_stats->pc = 0;
> >+
> >+	poll_stats->cur_freq = HGH;
> >+	poll_stats->iter_counter = 0;
> >+	poll_stats->threshold_ctr = 0;
> >+	poll_stats->queue_state = HGH_BUSY;
> >+	set_power_freq(poll_stats->lcore_id, HGH, false);
> >+}
> >+
> >+static inline void __attribute__((always_inline))
> >+enter_purge_state(struct priority_worker *poll_stats)
> >+{
> >+	poll_stats->iter_counter = 0;
> >+	poll_stats->queue_state = LOW_PURGE;
> >+}
> >+
> >+static inline void __attribute__((always_inline))
> >+set_state(struct priority_worker *poll_stats,
> >+		enum queue_state new_state)
> >+{
> >+	enum queue_state old_state = poll_stats->queue_state;
> >+	if (old_state != new_state) {
> >+
> >+		/* Call any old state exit functions */
> >+		if (old_state == TRAINING)
> >+			exit_training_state(poll_stats);
> 
> Is this needed? exit_training_state() does nothing.
> 
original code is used for debug purpose. we can leave it.
> >+		/* Call any new state entry functions */
> >+		if (new_state == TRAINING)
> >+			enter_training_state(poll_stats);
> >+		if (new_state == MED_NORMAL)
> >+			enter_normal_state(poll_stats);
> >+		if (new_state == HGH_BUSY)
> >+			enter_busy_state(poll_stats);
> >+		if (new_state == LOW_PURGE)
> >+			enter_purge_state(poll_stats);
> >+	}
> >+}
> >+
> >+static inline void __attribute__((always_inline))
> >+set_policy(struct priority_worker *poll_stats,
> >+		struct ep_policy *policy)
> >+{
> >+	set_state(poll_stats, policy->state);
> >+
> >+	if (policy->state == TRAINING)
> >+		return;
> >+
> >+	poll_stats->thresh[MED_NORMAL].base_edpi = policy->med_base_edpi;
> >+	poll_stats->thresh[HGH_BUSY].base_edpi = policy->hgh_base_edpi;
> >+
> >+	poll_stats->thresh[MED_NORMAL].trained = true;
> >+	poll_stats->thresh[HGH_BUSY].trained = true;
> >+
> >+}
> >+
> >+static void
> >+update_training_stats(struct priority_worker *poll_stats,
> >+		uint32_t freq,
> >+		bool specific_freq,
> >+		uint32_t max_train_iter)
> >+{
> >+	RTE_SET_USED(specific_freq);
> >+
> >+	char pfi_str[32];
> >+	uint64_t p0_empty_deq;
> >+
> >+	sprintf(pfi_str, "%02d", freq);
> >+
> >+	if (poll_stats->cur_freq == freq &&
> >+			poll_stats->thresh[freq].trained == false) {
> >+		if (poll_stats->thresh[freq].cur_train_iter == 0) {
> >+
> >+			set_power_freq(poll_stats->lcore_id,
> >+					freq, specific_freq);
> >+
> >+			poll_stats->empty_dequeues_prev =
> >+				poll_stats->empty_dequeues;
> >+
> >+			poll_stats->thresh[freq].cur_train_iter++;
> >+
> >+			return;
> >+		} else if (poll_stats->thresh[freq].cur_train_iter
> >+				<= max_train_iter) {
> >+
> >+			p0_empty_deq = poll_stats->empty_dequeues -
> >+				poll_stats->empty_dequeues_prev;
> >+
> >+			poll_stats->empty_dequeues_prev =
> >+				poll_stats->empty_dequeues;
> >+
> >+			poll_stats->thresh[freq].base_edpi += p0_empty_deq;
> >+			poll_stats->thresh[freq].cur_train_iter++;
> >+
> >+		} else {
> >+			if (poll_stats->thresh[freq].trained == false) {
> >+				poll_stats->thresh[freq].base_edpi =
> >+					poll_stats->thresh[freq].base_edpi /
> >+					max_train_iter;
> >+
> >+				/* Add on a factor of 0.05%*/
> >+				/* this should remove any */
> >+				/* false negatives when the system is 0% 
> >busy */
> 
> Multi line comment should follow the usual standard /* \n * text \n text 
> \n */
> 
agree
> >+				poll_stats->thresh[freq].base_edpi +=
> >+					poll_stats->thresh[freq].base_edpi / 
> >2000;
> >+
> >+				poll_stats->thresh[freq].trained = true;
> >+				poll_stats->cur_freq++;
> >+
> >+			}
> >+		}
> >+	}
> >+}
> >+
> >+static inline uint32_t __attribute__((always_inline))
> >+update_stats(struct priority_worker *poll_stats)
> >+{
> >+	uint64_t tot_edpi = 0, tot_ppi = 0;
> >+	uint32_t j, percent;
> >+
> >+	struct priority_worker *s = poll_stats;
> >+
> >+	uint64_t cur_edpi = s->empty_dequeues - s->empty_dequeues_prev;
> >+
> >+	s->empty_dequeues_prev = s->empty_dequeues;
> >+
> >+	uint64_t ppi = s->num_dequeue_pkts - s->num_dequeue_pkts_prev;
> >+
> >+	s->num_dequeue_pkts_prev = s->num_dequeue_pkts;
> >+
> >+	if (s->thresh[s->cur_freq].base_edpi < cur_edpi) {
> >+		RTE_LOG(DEBUG, POWER, "cur_edpi is too large "
> >+				"cur edpi %ld "
> >+				"base epdi %ld\n",
> >+				cur_edpi,
> >+				s->thresh[s->cur_freq].base_edpi);
> 
> Suggest making this log message more meaningful to the user. I suspect 
> that "cur_edpi" will not mean much to the user.
> What does edpi mean?
>
agree
> >+		/* Value to make us fail need debug log*/
> >+		return 1000UL;
> >+	}
> >+
> >+	s->edpi_av[s->ec++ % BINS_AV] = cur_edpi;
> >+	s->ppi_av[s->pc++ % BINS_AV] = ppi;
> >+
> >+	for (j = 0; j < BINS_AV; j++) {
> >+		tot_edpi += s->edpi_av[j];
> >+		tot_ppi += s->ppi_av[j];
> >+	}
> >+
> >+	tot_edpi = tot_edpi / BINS_AV;
> >+
> >+	percent = 100 - (uint32_t)(((float)tot_edpi /
> >+			(float)s->thresh[s->cur_freq].base_edpi) * 100);
> >+
> >+	return (uint32_t)percent;
> >+}
> >+
> >+
> >+static inline void  __attribute__((always_inline))
> >+update_stats_normal(struct priority_worker *poll_stats)
> >+{
> >+	uint32_t percent;
> >+
> >+	if (poll_stats->thresh[poll_stats->cur_freq].base_edpi == 0) {
> >+
> >+		RTE_LOG(DEBUG, POWER, "cure freq is %d, edpi is %lu\n",
> >+				poll_stats->cur_freq,
> >+			 poll_stats->thresh[poll_stats->cur_freq].base_edpi);
> 
> Again, a more meaningful explanation of edpi is needed here.
> 
agree
> >+		return;
> >+	}
> >+
> >+	percent = update_stats(poll_stats);
> >+
> >+	if (percent > 100) {
> >+		RTE_LOG(DEBUG, POWER, "Big than 100 abnormal\n");
> 
> Please change to something meaningful to the user. What is the 
> percentage returned from update_stats()?
> 
agree
> >+		return;
> >+	}
> >+
> >+	if (poll_stats->cur_freq == LOW)
> >+		RTE_LOG(INFO, POWER, "Purge Mode is not supported\n");
> 
> Suggest adding "currently" - "Purge Mode is not currently supported\n"
> 
agree
> >+	else if (poll_stats->cur_freq == MED) {
> >+
> >+		if (percent >
> >+			poll_stats->thresh[MED].threshold_percent) {
> >+
> >+			if (poll_stats->threshold_ctr < INTERVALS_PER_SECOND)
> >+				poll_stats->threshold_ctr++;
> >+			else {
> >+				set_state(poll_stats, HGH_BUSY);
> >+				RTE_LOG(INFO, POWER, "MOVE to HGH\n");
> >+			}
> >+
> >+		} else {
> >+			/* reset */
> >+			poll_stats->threshold_ctr = 0;
> >+		}
> >+
> >+	} else if (poll_stats->cur_freq == HGH) {
> >+
> >+		if (percent <
> >+				poll_stats->thresh[HGH].threshold_percent) {
> >+
> >+			if (poll_stats->threshold_ctr < INTERVALS_PER_SECOND)
> >+				poll_stats->threshold_ctr++;
> >+			else {
> >+				set_state(poll_stats, MED_NORMAL);
> >+				RTE_LOG(INFO, POWER, "MOVE to MED\n");
> >+			}
> >+		} else {
> >+			/* reset */
> >+			poll_stats->threshold_ctr = 0;
> >+		}
> >+
> >+	}
> >+}
> >+
> >+static int
> >+empty_poll_training(struct priority_worker *poll_stats,
> >+		uint32_t max_train_iter)
> >+{
> >+
> >+	if (poll_stats->iter_counter < INTERVALS_PER_SECOND) {
> >+		poll_stats->iter_counter++;
> >+		return 0;
> >+	}
> >+
> >+
> >+	update_training_stats(poll_stats,
> >+			LOW,
> >+			false,
> >+			max_train_iter);
> >+
> >+	update_training_stats(poll_stats,
> >+			MED,
> >+			false,
> >+			max_train_iter);
> >+
> >+	update_training_stats(poll_stats,
> >+			HGH,
> >+			false,
> >+			max_train_iter);
> >+
> >+
> >+	if (poll_stats->thresh[LOW].trained == true
> >+			&& poll_stats->thresh[MED].trained == true
> >+			&& poll_stats->thresh[HGH].trained == true) {
> >+
> >+		set_state(poll_stats, MED_NORMAL);
> >+
> >+		RTE_LOG(INFO, POWER, "Low threshold is %lu\n",
> >+				poll_stats->thresh[LOW].base_edpi);
> 
> Suggest "Low" change to "LOW" for consistency with other log messages below.
> 
agree
> >+
> >+		RTE_LOG(INFO, POWER, "MED threshold is %lu\n",
> >+				poll_stats->thresh[MED].base_edpi);
> >+
> >+
> >+		RTE_LOG(INFO, POWER, "HIGH threshold is %lu\n",
> >+				poll_stats->thresh[HGH].base_edpi);
> >+
> >+		RTE_LOG(INFO, POWER, "Training is Complete for %d\n",
> >+				poll_stats->lcore_id);
> >+	}
> >+
> >+	return 0;
> >+}
> >+
> >+void
> >+rte_empty_poll_detection(struct rte_timer *tim, void *arg)
> >+{
> >+
> >+	uint32_t i;
> >+
> >+	struct priority_worker *poll_stats;
> >+
> >+	RTE_SET_USED(tim);
> >+
> >+	RTE_SET_USED(arg);
> >+
> >+	for (i = 0; i < NUM_NODES; i++) {
> >+
> >+		poll_stats = &(ep_params->wrk_data.wrk_stats[i]);
> >+
> >+		if (rte_lcore_is_enabled(poll_stats->lcore_id) == 0)
> >+			continue;
> >+
> >+		switch (poll_stats->queue_state) {
> >+		case(TRAINING):
> >+			empty_poll_training(poll_stats,
> >+					ep_params->max_train_iter);
> >+			break;
> >+
> >+		case(HGH_BUSY):
> >+		case(MED_NORMAL):
> >+			update_stats_normal(poll_stats);
> >+			break;
> >+
> >+		case(LOW_PURGE):
> >+			break;
> >+		default:
> >+			break;
> >+
> >+		}
> >+
> >+	}
> >+
> >+}
> >+
> >+int __rte_experimental
> >+rte_power_empty_poll_stat_init(struct ep_params **eptr, uint8_t *freq_tlb,
> >+		struct ep_policy *policy)
> >+{
> >+	uint32_t i;
> >+	/* Allocate the ep_params structure */
> >+	ep_params = rte_zmalloc_socket(NULL,
> >+			sizeof(struct ep_params),
> >+			0,
> >+			rte_socket_id());
> >+
> >+	if (!ep_params)
> >+		rte_panic("Cannot allocate heap memory for ep_params "
> >+				"for socket %d\n", rte_socket_id());
> >+
> >+	if (freq_tlb == NULL) {
> >+		freq_index[LOW] = 14;
> >+		freq_index[MED] = 9;
> >+		freq_index[HGH] = 1;
> >+	} else {
> >+		freq_index[LOW] = freq_tlb[LOW];
> >+		freq_index[MED] = freq_tlb[MED];
> >+		freq_index[HGH] = freq_tlb[HGH];
> >+	}
> >+
> >+	RTE_LOG(INFO, POWER, "Initialize the Empty Poll\n");
> >+
> >+	/* 5 seconds worth of training */
> 
> This now looks to be 2 seconds from the #define above. Maybe: /* Train 
> for pre-defined period */
> 
agree
> >+	ep_params->max_train_iter = INTERVALS_PER_SECOND * 
> >SECONDS_TO_TRAIN_FOR;
> >+
> >+	struct stats_data *w = &ep_params->wrk_data;
> >+
> >+	*eptr = ep_params;
> >+
> >+	/* initialize all wrk_stats state */
> >+	for (i = 0; i < NUM_NODES; i++) {
> >+
> >+		if (rte_lcore_is_enabled(i) == 0)
> >+			continue;
> >+		/*init the freqs table */
> >+		total_avail_freqs[i] = rte_power_freqs(i,
> >+				avail_freqs[i],
> >+				NUM_FREQS);
> >+
> >+		RTE_LOG(INFO, POWER, "total avail freq is %d , lcoreid %d\n",
> >+				total_avail_freqs[i],
> >+				i);
> >+
> >+		if (get_freq_index(LOW) > total_avail_freqs[i])
> >+			return -1;
> >+
> >+		if (rte_get_master_lcore() != i) {
> >+			w->wrk_stats[i].lcore_id = i;
> >+			set_policy(&w->wrk_stats[i], policy);
> >+		}
> >+	}
> >+
> >+	return 0;
> >+}
> >+
> >+void __rte_experimental
> >+rte_power_empty_poll_stat_free(void)
> >+{
> >+
> >+	RTE_LOG(INFO, POWER, "Close the Empty Poll\n");
> >+
> >+	if (ep_params != NULL)
> >+		rte_free(ep_params);
> >+}
> >+
> >+int __rte_experimental
> >+rte_power_empty_poll_stat_update(unsigned int lcore_id)
> >+{
> >+	struct priority_worker *poll_stats;
> >+
> >+	if (lcore_id >= NUM_NODES)
> >+		return -1;
> >+
> >+	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
> >+
> >+	if (poll_stats->lcore_id == 0)
> >+		poll_stats->lcore_id = lcore_id;
> >+
> >+	poll_stats->empty_dequeues++;
> >+
> >+	return 0;
> >+}
> >+
> >+int __rte_experimental
> >+rte_power_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt)
> >+{
> >+
> >+	struct priority_worker *poll_stats;
> >+
> >+	if (lcore_id >= NUM_NODES)
> >+		return -1;
> >+
> >+	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
> >+
> >+	if (poll_stats->lcore_id == 0)
> >+		poll_stats->lcore_id = lcore_id;
> >+
> >+	poll_stats->num_dequeue_pkts += nb_pkt;
> >+
> >+	return 0;
> >+}
> >+
> >+
> >+uint64_t __rte_experimental
> >+rte_power_empty_poll_stat_fetch(unsigned int lcore_id)
> >+{
> >+	struct priority_worker *poll_stats;
> >+
> >+	if (lcore_id >= NUM_NODES)
> >+		return -1;
> >+
> >+	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
> >+
> >+	if (poll_stats->lcore_id == 0)
> >+		poll_stats->lcore_id = lcore_id;
> >+
> >+	return poll_stats->empty_dequeues;
> >+}
> >+
> >+uint64_t __rte_experimental
> >+rte_power_poll_stat_fetch(unsigned int lcore_id)
> >+{
> >+	struct priority_worker *poll_stats;
> >+
> >+	if (lcore_id >= NUM_NODES)
> >+		return -1;
> >+
> >+	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
> >+
> >+	if (poll_stats->lcore_id == 0)
> >+		poll_stats->lcore_id = lcore_id;
> >+
> >+	return poll_stats->num_dequeue_pkts;
> >+}
> >diff --git a/lib/librte_power/rte_power_empty_poll.h 
> >b/lib/librte_power/rte_power_empty_poll.h
> >new file mode 100644
> >index 0000000..ae27f7d
> >--- /dev/null
> >+++ b/lib/librte_power/rte_power_empty_poll.h
> >@@ -0,0 +1,219 @@
> >+/* SPDX-License-Identifier: BSD-3-Clause
> >+ * Copyright(c) 2010-2018 Intel Corporation
> >+ */
> >+
> >+#ifndef _RTE_EMPTY_POLL_H
> >+#define _RTE_EMPTY_POLL_H
> >+
> >+/**
> >+ * @file
> >+ * RTE Power Management
> >+ */
> >+#include <stdint.h>
> >+#include <stdbool.h>
> >+
> >+#include <rte_common.h>
> >+#include <rte_byteorder.h>
> >+#include <rte_log.h>
> >+#include <rte_string_fns.h>
> >+#include <rte_power.h>
> >+#include <rte_timer.h>
> >+
> >+#ifdef __cplusplus
> >+extern "C" {
> >+#endif
> >+
> >+#define NUM_FREQS 20
> 
> I don't think this is enough. Suggest using RTE_MAX_LCORE_FREQS
> 
agree
> >+
> >+#define BINS_AV 4 /* Has to be ^2 */
> >+
> >+#define DROP (NUM_DIRECTIONS * NUM_DEVICES)
> >+
> >+#define NUM_PRIORITIES          2
> >+
> >+#define NUM_NODES         256  /* Max core number*/
> >+
> >+/* Processor Power State */
> >+enum freq_val {
> >+	LOW,
> >+	MED,
> >+	HGH,
> >+	NUM_FREQ = NUM_FREQS
> >+};
> >+
> 
> Why is NUM_FREQ in this enum? 0,1,2,20 (or RTE_MAX_LCORE_FREQS) does not 
> seem right.
> If you are using NUM_FREQ in the code then why not just use NUM_FREQS.
> 
the reason is to have some spare space in case we need extend to more stage.

> >+
> >+/* Queue Polling State */
> >+enum queue_state {
> >+	TRAINING, /* NO TRAFFIC */
> >+	MED_NORMAL,   /* MED */
> >+	HGH_BUSY,     /* HIGH */
> >+	LOW_PURGE,    /* LOW */
> >+};
> >+
> >+/* Queue Stats */
> >+struct freq_threshold {
> >+
> >+	uint64_t base_edpi;
> >+	bool trained;
> >+	uint32_t threshold_percent;
> >+	uint32_t cur_train_iter;
> >+};
> >+
> >+/* Each Worder Thread Empty Poll Stats */
> >+struct priority_worker {
> >+
> >+	/* Current dequeue and throughput counts */
> >+	/* These 2 are written to by the worker threads */
> >+	/* So keep them on their own cache line */
> >+	uint64_t empty_dequeues;
> >+	uint64_t num_dequeue_pkts;
> >+
> >+	enum queue_state queue_state;
> >+
> >+	uint64_t empty_dequeues_prev;
> >+	uint64_t num_dequeue_pkts_prev;
> >+
> >+	/* Used for training only */
> >+	struct freq_threshold thresh[NUM_FREQ];
> >+	enum freq_val cur_freq;
> >+
> >+	/* bucket arrays to calculate the averages */
> >+	uint64_t edpi_av[BINS_AV];
> >+	uint32_t  ec;
> >+	uint64_t ppi_av[BINS_AV];
> >+	uint32_t  pc;
> >+
> >+	uint32_t lcore_id;
> >+	uint32_t iter_counter;
> >+	uint32_t threshold_ctr;
> >+	uint32_t display_ctr;
> >+	uint8_t  dev_id;
> >+
> >+} __rte_cache_aligned;
> >+
> 
> Suggest adding a comment on each of the variables above explaining what 
> the acronym means.
> E.g. edpi, ec, pc, ppi.
> 
agree
> 
> >+
> >+struct stats_data {
> >+
> >+	struct priority_worker wrk_stats[NUM_NODES];
> >+
> >+	/* flag to stop rx threads processing packets until training over */
> >+	bool start_rx;
> >+
> >+};
> >+
> >+/* Empty Poll Parameters */
> >+struct ep_params {
> >+
> >+	/* Timer related stuff */
> >+	uint64_t interval_ticks;
> >+	uint32_t max_train_iter;
> >+
> >+	struct rte_timer timer0;
> >+	struct stats_data wrk_data;
> >+};
> >+
> >+
> >+/* Sample App Init information */
> >+struct ep_policy {
> >+
> >+	uint64_t med_base_edpi;
> >+	uint64_t hgh_base_edpi;
> >+
> >+	enum queue_state state;
> >+};
> >+
> >+
> >+
> >+/**
> >+ * Initialize the power management system.
> >+ *
> >+ * @param eptr
> >+ *   the structure of empty poll configuration
> >+ * @freq_tlb
> >+ *   the power state/frequency  mapping table
> >+ * @policy
> >+ *   the initialization policy from sample app
> >+ *
> >+ * @return
> >+ *  - 0 on success.
> >+ *  - Negative on error.
> >+ */
> >+int __rte_experimental
> >+rte_power_empty_poll_stat_init(struct ep_params **eptr, uint8_t *freq_tlb,
> >+		struct ep_policy *policy);
> >+
> >+/**
> >+ * Free the resource hold by power management system.
> >+ */
> >+void __rte_experimental
> >+rte_power_empty_poll_stat_free(void);
> >+
> >+/**
> >+ * Update specific core empty poll counter
> >+ * It's not thread safe.
> >+ *
> >+ * @param lcore_id
> >+ *  lcore id
> >+ *
> >+ * @return
> >+ *  - 0 on success.
> >+ *  - Negative on error.
> >+ */
> >+int __rte_experimental
> >+rte_power_empty_poll_stat_update(unsigned int lcore_id);
> >+
> >+/**
> >+ * Update specific core valid poll counter, not thread safe.
> >+ *
> >+ * @param lcore_id
> >+ *  lcore id.
> >+ * @param nb_pkt
> >+ *  The packet number of one valid poll.
> >+ *
> >+ * @return
> >+ *  - 0 on success.
> >+ *  - Negative on error.
> >+ */
> >+int __rte_experimental
> >+rte_power_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt);
> >+
> >+/**
> >+ * Fetch specific core empty poll counter.
> >+ *
> >+ * @param lcore_id
> >+ *  lcore id
> >+ *
> >+ * @return
> >+ *  Current lcore empty poll counter value.
> >+ */
> >+uint64_t __rte_experimental
> >+rte_power_empty_poll_stat_fetch(unsigned int lcore_id);
> >+
> >+/**
> >+ * Fetch specific core valid poll counter.
> >+ *
> >+ * @param lcore_id
> >+ *  lcore id
> >+ *
> >+ * @return
> >+ *  Current lcore valid poll counter value.
> >+ */
> >+uint64_t __rte_experimental
> >+rte_power_poll_stat_fetch(unsigned int lcore_id);
> >+
> >+/**
> >+ * Empty poll  state change detection function
> >+ *
> >+ * @param  tim
> >+ *  The timer structure
> >+ * @param  arg
> >+ *  The customized parameter
> >+ */
> >+void  __rte_experimental
> >+rte_empty_poll_detection(struct rte_timer *tim, void *arg);
> >+
> >+#ifdef __cplusplus
> >+}
> >+#endif
> >+
> >+#endif
> >diff --git a/lib/librte_power/rte_power_version.map 
> >b/lib/librte_power/rte_power_version.map
> >index dd587df..11ffdfb 100644
> >--- a/lib/librte_power/rte_power_version.map
> >+++ b/lib/librte_power/rte_power_version.map
> >@@ -33,3 +33,16 @@ DPDK_18.08 {
> >  	rte_power_get_capabilities;
> >  
> >  } DPDK_17.11;
> >+
> >+EXPERIMENTAL {
> >+        global:
> >+
> >+        rte_power_empty_poll_stat_init;
> >+        rte_power_empty_poll_stat_free;
> >+        rte_power_empty_poll_stat_update;
> >+        rte_power_empty_poll_stat_fetch;
> >+        rte_power_poll_stat_fetch;
> >+        rte_power_poll_stat_update;
> >+        rte_empty_poll_detection;
> >+
> >+};
> 
> checkpatch has several warnings:
> 
> 
> 
> ### lib/librte_power: traffic pattern aware power control
> 
> WARNING:LONG_LINE: line over 80 characters
> #355: FILE: lib/librte_power/rte_power_empty_poll.c:199:
> + poll_stats->thresh[freq].base_edpi / 2000;
> 
> WARNING:LONG_LINE: line over 80 characters
> #417: FILE: lib/librte_power/rte_power_empty_poll.c:261:
> + poll_stats->thresh[poll_stats->cur_freq].base_edpi);
> 
> total: 0 errors, 2 warnings, 802 lines checked
> ERROR: symbol rte_empty_poll_detection is added in a section other than 
> the EXPERIMENTAL section of the version map
> ERROR: symbol rte_power_empty_poll_stat_fetch is added in a section 
> other than the EXPERIMENTAL section of the version map
> ERROR: symbol rte_power_empty_poll_stat_free is added in a section other 
> than the EXPERIMENTAL section of the version map
> ERROR: symbol rte_power_empty_poll_stat_init is added in a section other 
> than the EXPERIMENTAL section of the version map
> ERROR: symbol rte_power_empty_poll_stat_update is added in a section 
> other than the EXPERIMENTAL section of the version map
> ERROR: symbol rte_power_poll_stat_fetch is added in a section other than 
> the EXPERIMENTAL section of the version map
> ERROR: symbol rte_power_poll_stat_update is added in a section other 
> than the EXPERIMENTAL section of the version map
> Warning in /lib/librte_power/rte_power_empty_poll.c:
> are you sure you want to add the following:
> rte_panic\(
> 
version map checking script(awk script to parse section name) has some issue here. 
> 
> Rgds,
> Dave.
> 
> 
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [dpdk-dev] [PATCH v8 2/4] examples/l3fwd-power: simple app update for new API
  2018-09-28 11:19                     ` Hunt, David
@ 2018-10-02 10:18                       ` Liang, Ma
  0 siblings, 0 replies; 79+ messages in thread
From: Liang, Ma @ 2018-10-02 10:18 UTC (permalink / raw)
  To: Hunt, David; +Cc: dev, lei.a.yao, ktraynor, john.geary

On 28 Sep 12:19, Hunt, David wrote:
> Hi Liang,
> 
> A few tweaks below:
> 
> 
> On 17/9/2018 2:30 PM, Liang Ma wrote:
> >Add the support for new traffic pattern aware power control
> >power management API.
> >
> >Example:
> >./l3fwd-power -l xxx   -n 4   -w 0000:xx:00.0 -w 0000:xx:00.1 -- -p 0x3
> >-P --config="(0,0,xx),(1,0,xx)" --empty-poll="0,0,0" -l 14 -m 9 -h 1
> >
> >Please Reference l3fwd-power document for all parameter except
> >empty-poll.
> 
> The docs should probably include empty poll parameter. Suggest 
> re-wording to
> 
> Please Reference l3fwd-power document for full parameter usage
> 
agree
> 
> 
> >The option "l", "m", "h" are used to set the power index for
> >LOW, MED, HIGH power state. only is useful after enable empty-poll
> >
> >--empty-poll="training_flag, med_threshold, high_threshold"
> >
> >The option training_flag is used to enable/disable training mode.
> >
> >The option med_threshold is used to indicate the empty poll threshold
> >of modest state which is customized by user.
> >
> >The option high_threshold is used to indicate the empty poll threshold
> >of busy state which is customized by user.
> >
> >Above three option default value is all 0.
> >
> >Once enable empty-poll. System will apply the default parameter.
> >Training mode is disabled as default.
> 
> Suggest:
> 
> Once empty-poll is enabled, the system will apply the default parameters is 
> no
> other command line options are provided.
> 
agree
> 
> 
> >If training mode is triggered, there should not has any traffic
> >pass-through during training phase.
> 
> Suggest:
> If training mode is enabled, the user should ensure that no traffic
> is allowed to pass through the system.
> 
> >When training phase complete, system transfer to normal phase.
> 
> When training phase complete, the application transfer to normal operation
> 
> 
agree
> 
> >
> >System will running with modest power stat at beginning.
> 
> System will start running with the modest power mode.
> 

> 
> >If the system busyness percentage above 70%, then system will adjust
> >power state move to High power state. If the traffic become lower(eg. The
> >system busyness percentage drop below 30%), system will fallback
> >to the modest power state.
> 
> If the traffic goes above 70%, then system will move to High power state.
> If the traffic drops below 30%, the system will fallback to the modest
> power state.
> 
> 
> >Example code use master thread to monitoring worker thread busyness.
> >the default timer resolution is 10ms.
> >
> >ChangeLog:
> >v2 fix some coding style issues
> >v3 rename the API.
> >v6 re-work the API.
> >v7 no change.
> >v8 disable training as default option.
> >
> >Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> >
> >Reviewed-by: Lei Yao <lei.a.yao@intel.com>
> >---
> >  examples/l3fwd-power/Makefile    |   3 +
> >  examples/l3fwd-power/main.c      | 325 
> >  +++++++++++++++++++++++++++++++++++++--
> >  examples/l3fwd-power/meson.build |   1 +
> >  3 files changed, 312 insertions(+), 17 deletions(-)
> >
> >diff --git a/examples/l3fwd-power/Makefile b/examples/l3fwd-power/Makefile
> >index d7e39a3..772ec7b 100644
> >--- a/examples/l3fwd-power/Makefile
> >+++ b/examples/l3fwd-power/Makefile
> >@@ -23,6 +23,8 @@ CFLAGS += -O3 $(shell pkg-config --cflags libdpdk)
> >  LDFLAGS_SHARED = $(shell pkg-config --libs libdpdk)
> >  LDFLAGS_STATIC = -Wl,-Bstatic $(shell pkg-config --static --libs libdpdk)
> >  
> >+CFLAGS += -DALLOW_EXPERIMENTAL_API
> >+
> >  build/$(APP)-shared: $(SRCS-y) Makefile $(PC_FILE) | build
> >  	$(CC) $(CFLAGS) $(SRCS-y) -o $@ $(LDFLAGS) $(LDFLAGS_SHARED)
> >  
> >@@ -54,6 +56,7 @@ please change the definition of the RTE_TARGET 
> >environment variable)
> >  all:
> >  else
> >  
> >+CFLAGS += -DALLOW_EXPERIMENTAL_API
> >  CFLAGS += -O3
> >  CFLAGS += $(WERROR_FLAGS)
> >  
> >diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
> >index 68527d2..1465608 100644
> >--- a/examples/l3fwd-power/main.c
> >+++ b/examples/l3fwd-power/main.c
> >@@ -43,6 +43,7 @@
> >  #include <rte_timer.h>
> >  #include <rte_power.h>
> >  #include <rte_spinlock.h>
> >+#include <rte_power_empty_poll.h>
> >  
> >  #include "perf_core.h"
> >  #include "main.h"
> >@@ -55,6 +56,8 @@
> >  
> >  /* 100 ms interval */
> >  #define TIMER_NUMBER_PER_SECOND           10
> >+/* (10ms) */
> >+#define INTERVALS_PER_SECOND             100
> >  /* 100000 us */
> >  #define SCALING_PERIOD                    
> >  (1000000/TIMER_NUMBER_PER_SECOND)
> >  #define SCALING_DOWN_TIME_RATIO_THRESHOLD 0.25
> >@@ -117,6 +120,11 @@
> >   */
> >  #define RTE_TEST_RX_DESC_DEFAULT 1024
> >  #define RTE_TEST_TX_DESC_DEFAULT 1024
> >+#define EMPTY_POLL_MED_THRESHOLD 350000UL
> >+#define EMPTY_POLL_HGH_THRESHOLD 580000UL
> 
> I'd suggest adding some explanation around these two numbers.
> E.g.
> /*
>  * These two thresholds were decided on by running the training 
> algorithm on
>  * a 2.5GHz Xeon. These defaults can be overridden by supplying 
> non-zero values
>  * for the med_threshold and high_threshold parameters on the command line.
>  */
> 
> 
> >+
> >+
> >+
> >  static uint16_t nb_rxd = RTE_TEST_RX_DESC_DEFAULT;
> >  static uint16_t nb_txd = RTE_TEST_TX_DESC_DEFAULT;
> >  
> >@@ -132,6 +140,14 @@ static uint32_t enabled_port_mask = 0;
> >  static int promiscuous_on = 0;
> >  /* NUMA is enabled by default. */
> >  static int numa_on = 1;
> >+/* emptypoll is disabled by default. */
> >+static bool empty_poll_on;
> >+static bool empty_poll_train;
> >+volatile bool empty_poll_stop;
> >+static struct  ep_params *ep_params;
> >+static struct  ep_policy policy;
> >+static long  ep_med_edpi, ep_hgh_edpi;
> >+
> >  static int parse_ptype; /**< Parse packet type using rx callback, and */
> >  			/**< disabled by default */
> >  
> >@@ -330,6 +346,13 @@ static inline uint32_t power_idle_heuristic(uint32_t 
> >zero_rx_packet_count);
> >  static inline enum freq_scale_hint_t power_freq_scaleup_heuristic( \
> >  		unsigned int lcore_id, uint16_t port_id, uint16_t queue_id);
> >  
> >+static uint8_t  freq_tlb[] = {14, 9, 1};
> >+
> 
> Maybe an explanation on where these numbers came from. E.g.
> /*
>  * These defaults are using the max frequency index (1), a medium index 
> (9) and a
>  * typical low frequency index (14). These can be adjusted to use different
>  * indexes using the relevant command line parameters.
>  */
>
agree
> 
> >+static int is_done(void)
> >+{
> >+	return empty_poll_stop;
> >+}
> >+
> >  /* exit signal handler */
> >  static void
> >  signal_exit_now(int sigtype)
> >@@ -338,7 +361,15 @@ signal_exit_now(int sigtype)
> >  	unsigned int portid;
> >  	int ret;
> >  
> >+	RTE_SET_USED(lcore_id);
> >+	RTE_SET_USED(portid);
> >+	RTE_SET_USED(ret);
> >+
> >  	if (sigtype == SIGINT) {
> >+		if (empty_poll_on)
> >+			empty_poll_stop = true;
> >+
> >+
> >  		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
> >  			if (rte_lcore_is_enabled(lcore_id) == 0)
> >  				continue;
> >@@ -351,16 +382,19 @@ signal_exit_now(int sigtype)
> >  							"core%u\n", 
> >  							lcore_id);
> >  		}
> >  
> >-		RTE_ETH_FOREACH_DEV(portid) {
> >-			if ((enabled_port_mask & (1 << portid)) == 0)
> >-				continue;
> >+		if (!empty_poll_on) {
> >+			RTE_ETH_FOREACH_DEV(portid) {
> >+				if ((enabled_port_mask & (1 << portid)) == 0)
> >+					continue;
> >  
> >-			rte_eth_dev_stop(portid);
> >-			rte_eth_dev_close(portid);
> >+				rte_eth_dev_stop(portid);
> >+				rte_eth_dev_close(portid);
> >+			}
> >  		}
> >  	}
> >  
> >-	rte_exit(EXIT_SUCCESS, "User forced exit\n");
> >+	if (!empty_poll_on)
> >+		rte_exit(EXIT_SUCCESS, "User forced exit\n");
> >  }
> >  
> >  /*  Freqency scale down timer callback */
> >@@ -825,7 +859,107 @@ static int event_register(struct lcore_conf *qconf)
> >  
> >  	return 0;
> >  }
> >+/* main processing loop */
> >+static int
> >+main_empty_poll_loop(__attribute__((unused)) void *dummy)
> >+{
> >+	struct rte_mbuf *pkts_burst[MAX_PKT_BURST];
> >+	unsigned int lcore_id;
> >+	uint64_t prev_tsc, diff_tsc, cur_tsc;
> >+	int i, j, nb_rx;
> >+	uint8_t queueid;
> >+	uint16_t portid;
> >+	struct lcore_conf *qconf;
> >+	struct lcore_rx_queue *rx_queue;
> >+
> >+	const uint64_t drain_tsc =
> >+		(rte_get_tsc_hz() + US_PER_S - 1) / US_PER_S * 
> >BURST_TX_DRAIN_US;
> >+
> >+	prev_tsc = 0;
> >+
> >+	lcore_id = rte_lcore_id();
> >+	qconf = &lcore_conf[lcore_id];
> >+
> >+	if (qconf->n_rx_queue == 0) {
> >+		RTE_LOG(INFO, L3FWD_POWER, "lcore %u has nothing to do\n", 
> >lcore_id);
> >+		return 0;
> >+	}
> >+
> >+	for (i = 0; i < qconf->n_rx_queue; i++) {
> >+		portid = qconf->rx_queue_list[i].port_id;
> >+		queueid = qconf->rx_queue_list[i].queue_id;
> >+		RTE_LOG(INFO, L3FWD_POWER, " -- lcoreid=%u portid=%u "
> >+				"rxqueueid=%hhu\n", lcore_id, portid, 
> >queueid);
> >+	}
> >+
> >+	while (!is_done()) {
> >+		stats[lcore_id].nb_iteration_looped++;
> >+
> >+		cur_tsc = rte_rdtsc();
> >+		/*
> >+		 * TX burst queue drain
> >+		 */
> >+		diff_tsc = cur_tsc - prev_tsc;
> >+		if (unlikely(diff_tsc > drain_tsc)) {
> >+			for (i = 0; i < qconf->n_tx_port; ++i) {
> >+				portid = qconf->tx_port_id[i];
> >+				rte_eth_tx_buffer_flush(portid,
> >+						qconf->tx_queue_id[portid],
> >+						qconf->tx_buffer[portid]);
> >+			}
> >+			prev_tsc = cur_tsc;
> >+		}
> >+
> >+		/*
> >+		 * Read packet from RX queues
> >+		 */
> >+		for (i = 0; i < qconf->n_rx_queue; ++i) {
> >+			rx_queue = &(qconf->rx_queue_list[i]);
> >+			rx_queue->idle_hint = 0;
> >+			portid = rx_queue->port_id;
> >+			queueid = rx_queue->queue_id;
> >+
> >+			nb_rx = rte_eth_rx_burst(portid, queueid, pkts_burst,
> >+					MAX_PKT_BURST);
> >+
> >+			stats[lcore_id].nb_rx_processed += nb_rx;
> >+
> >+			if (nb_rx == 0) {
> >+
> >+				rte_power_empty_poll_stat_update(lcore_id);
> >+
> >+				continue;
> >+			} else {
> >+				rte_power_poll_stat_update(lcore_id, nb_rx);
> >+			}
> >+
> >+
> >+			/* Prefetch first packets */
> >+			for (j = 0; j < PREFETCH_OFFSET && j < nb_rx; j++) {
> >+				rte_prefetch0(rte_pktmbuf_mtod(
> >+							pkts_burst[j], void 
> >*));
> >+			}
> >+
> >+			/* Prefetch and forward already prefetched packets */
> >+			for (j = 0; j < (nb_rx - PREFETCH_OFFSET); j++) {
> >+				rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[
> >+							j + 
> >PREFETCH_OFFSET], void *));
> >+				l3fwd_simple_forward(pkts_burst[j], portid,
> >+						qconf);
> >+			}
> >  
> >+			/* Forward remaining prefetched packets */
> >+			for (; j < nb_rx; j++) {
> >+				l3fwd_simple_forward(pkts_burst[j], portid,
> >+						qconf);
> >+			}
> >+
> >+		}
> >+
> >+	}
> >+
> >+	return 0;
> >+}
> >  /* main processing loop */
> >  static int
> >  main_loop(__attribute__((unused)) void *dummy)
> >@@ -1127,7 +1261,8 @@ print_usage(const char *prgname)
> >  		"  --no-numa: optional, disable numa awareness\n"
> >  		"  --enable-jumbo: enable jumbo frame"
> >  		" which max packet len is PKTLEN in decimal (64-9600)\n"
> >-		"  --parse-ptype: parse packet type by software\n",
> >+		"  --parse-ptype: parse packet type by software\n"
> >+		"  --empty=poll: enable empty poll detection\n",
> 
> typo: "empty=poll" should be "empty-poll"
> 
> I really think some info on what should be supplied with the empty-poll 
> parameter
> should be mentioned here
> e.g.
> 
> --empty=poll "training_flag, high_threshold, med_threshold"
> 
> 
agree
> 
> >  		prgname);
> >  }
> >  
> >@@ -1220,7 +1355,55 @@ parse_config(const char *q_arg)
> >  
> >  	return 0;
> >  }
> >+static int
> >+parse_ep_config(const char *q_arg)
> >+{
> >+	char s[256];
> >+	const char *p = q_arg;
> >+	char *end;
> >+	int  num_arg;
> >+
> >+	char *str_fld[3];
> >+
> >+	int training_flag;
> >+	int med_edpi;
> >+	int hgh_edpi;
> >+
> >+	ep_med_edpi = EMPTY_POLL_MED_THRESHOLD;
> >+	ep_hgh_edpi = EMPTY_POLL_MED_THRESHOLD;
> >+
> >+	snprintf(s, sizeof(s), "%s", p);
> >+
> >+	num_arg = rte_strsplit(s, sizeof(s), str_fld, 3, ',');
> >+
> >+	empty_poll_train = false;
> >+
> >+	if (num_arg == 0)
> >+		return 0;
> >  
> >+	if (num_arg == 3) {
> >+
> >+		training_flag = strtoul(str_fld[0], &end, 0);
> >+		med_edpi = strtoul(str_fld[1], &end, 0);
> >+		hgh_edpi = strtoul(str_fld[2], &end, 0);
> >+
> >+		if (training_flag == 1)
> >+			empty_poll_train = true;
> >+
> >+		if (med_edpi > 0)
> >+			ep_med_edpi = med_edpi;
> >+
> >+		if (med_edpi > 0)
> >+			ep_hgh_edpi = hgh_edpi;
> >+
> >+	} else {
> >+
> >+		return -1;
> >+	}
> >+
> >+	return 0;
> >+
> >+}
> >  #define CMD_LINE_OPT_PARSE_PTYPE "parse-ptype"
> >  
> >  /* Parse the argument given in the command line of the application */
> >@@ -1230,6 +1413,7 @@ parse_args(int argc, char **argv)
> >  	int opt, ret;
> >  	char **argvopt;
> >  	int option_index;
> >+	uint32_t limit;
> >  	char *prgname = argv[0];
> >  	static struct option lgopts[] = {
> >  		{"config", 1, 0, 0},
> >@@ -1237,13 +1421,14 @@ parse_args(int argc, char **argv)
> >  		{"high-perf-cores", 1, 0, 0},
> >  		{"no-numa", 0, 0, 0},
> >  		{"enable-jumbo", 0, 0, 0},
> >+		{"empty-poll", 1, 0, 0},
> >  		{CMD_LINE_OPT_PARSE_PTYPE, 0, 0, 0},
> >  		{NULL, 0, 0, 0}
> >  	};
> >  
> >  	argvopt = argv;
> >  
> >-	while ((opt = getopt_long(argc, argvopt, "p:P",
> >+	while ((opt = getopt_long(argc, argvopt, "p:l:m:h:P",
> >  				lgopts, &option_index)) != EOF) {
> >  
> >  		switch (opt) {
> >@@ -1260,7 +1445,18 @@ parse_args(int argc, char **argv)
> >  			printf("Promiscuous mode selected\n");
> >  			promiscuous_on = 1;
> >  			break;
> >-
> >+		case 'l':
> >+			limit = parse_max_pkt_len(optarg);
> >+			freq_tlb[LOW] = limit;
> >+			break;
> >+		case 'm':
> >+			limit = parse_max_pkt_len(optarg);
> >+			freq_tlb[MED] = limit;
> >+			break;
> >+		case 'h':
> >+			limit = parse_max_pkt_len(optarg);
> >+			freq_tlb[HGH] = limit;
> >+			break;
> >  		/* long options */
> >  		case 0:
> >  			if (!strncmp(lgopts[option_index].name, "config", 
> >  			6)) {
> >@@ -1299,6 +1495,20 @@ parse_args(int argc, char **argv)
> >  			}
> >  
> >  			if (!strncmp(lgopts[option_index].name,
> >+						"empty-poll", 10)) {
> >+				printf("empty-poll is enabled\n");
> >+				empty_poll_on = true;
> >+				ret = parse_ep_config(optarg);
> >+
> >+				if (ret) {
> >+					printf("invalid empty poll 
> >config\n");
> >+					print_usage(prgname);
> >+					return -1;
> >+				}
> >+
> >+			}
> >+
> >+			if (!strncmp(lgopts[option_index].name,
> >  					"enable-jumbo", 12)) {
> >  				struct option lenopts =
> >  					{"max-pkt-len", required_argument, \
> >@@ -1646,6 +1856,59 @@ init_power_library(void)
> >  	}
> >  	return ret;
> >  }
> >+static void
> >+empty_poll_setup_timer(void)
> >+{
> >+	int lcore_id = rte_lcore_id();
> >+	uint64_t hz = rte_get_timer_hz();
> >+
> >+	struct  ep_params *ep_ptr = ep_params;
> >+
> >+	ep_ptr->interval_ticks = hz / INTERVALS_PER_SECOND;
> >+
> >+	rte_timer_reset_sync(&ep_ptr->timer0,
> >+			ep_ptr->interval_ticks,
> >+			PERIODICAL,
> >+			lcore_id,
> >+			rte_empty_poll_detection,
> >+			(void *)ep_ptr);
> >+
> >+}
> >+static int
> >+launch_timer(unsigned int lcore_id)
> >+{
> >+	int64_t prev_tsc = 0, cur_tsc, diff_tsc, cycles_10ms;
> >+
> >+	RTE_SET_USED(lcore_id);
> >+
> >+
> >+	if (rte_get_master_lcore() != lcore_id) {
> >+		rte_panic("timer on lcore:%d which is not master core:%d\n",
> >+				lcore_id,
> >+				rte_get_master_lcore());
> >+	}
> >+
> >+	RTE_LOG(INFO, POWER, "Bring up the Timer\n");
> >+
> >+	empty_poll_setup_timer();
> >+
> >+	cycles_10ms = rte_get_timer_hz() / 100;
> >+
> >+	while (!is_done()) {
> >+		cur_tsc = rte_rdtsc();
> >+		diff_tsc = cur_tsc - prev_tsc;
> >+		if (diff_tsc > cycles_10ms) {
> >+			rte_timer_manage();
> >+			prev_tsc = cur_tsc;
> >+			cycles_10ms = rte_get_timer_hz() / 100;
> >+		}
> >+	}
> >+
> >+	RTE_LOG(INFO, POWER, "Timer_subsystem is done\n");
> >+
> >+	return 0;
> >+}
> >+
> >  
> >  int
> >  main(int argc, char **argv)
> >@@ -1828,13 +2091,15 @@ main(int argc, char **argv)
> >  		if (rte_lcore_is_enabled(lcore_id) == 0)
> >  			continue;
> >  
> >-		/* init timer structures for each enabled lcore */
> >-		rte_timer_init(&power_timers[lcore_id]);
> >-		hz = rte_get_timer_hz();
> >-		rte_timer_reset(&power_timers[lcore_id],
> >-			hz/TIMER_NUMBER_PER_SECOND, SINGLE, lcore_id,
> >-						power_timer_cb, NULL);
> >-
> >+		if (empty_poll_on == false) {
> >+			/* init timer structures for each enabled lcore */
> >+			rte_timer_init(&power_timers[lcore_id]);
> >+			hz = rte_get_timer_hz();
> >+			rte_timer_reset(&power_timers[lcore_id],
> >+					hz/TIMER_NUMBER_PER_SECOND,
> >+					SINGLE, lcore_id,
> >+					power_timer_cb, NULL);
> >+		}
> >  		qconf = &lcore_conf[lcore_id];
> >  		printf("\nInitializing rx queues on lcore %u ... ", lcore_id 
> >  		);
> >  		fflush(stdout);
> >@@ -1905,12 +2170,38 @@ main(int argc, char **argv)
> >  
> >  	check_all_ports_link_status(enabled_port_mask);
> >  
> >+	if (empty_poll_on == true) {
> >+
> >+		if (empty_poll_train) {
> >+			policy.state = TRAINING;
> >+		} else {
> >+			policy.state = MED_NORMAL;
> >+			policy.med_base_edpi = ep_med_edpi;
> >+			policy.hgh_base_edpi = ep_hgh_edpi;
> >+		}
> >+
> >+		rte_power_empty_poll_stat_init(&ep_params, freq_tlb, 
> >&policy);
> >+	}
> >+
> >+
> >  	/* launch per-lcore init on every lcore */
> >-	rte_eal_mp_remote_launch(main_loop, NULL, CALL_MASTER);
> >+	if (empty_poll_on == false) {
> >+		rte_eal_mp_remote_launch(main_loop, NULL, CALL_MASTER);
> >+	} else {
> >+		empty_poll_stop = false;
> >+		rte_eal_mp_remote_launch(main_empty_poll_loop, NULL, 
> >SKIP_MASTER);
> >+	}
> >+
> >+	if (empty_poll_on == true)
> >+		launch_timer(rte_lcore_id());
> >+
> >  	RTE_LCORE_FOREACH_SLAVE(lcore_id) {
> >  		if (rte_eal_wait_lcore(lcore_id) < 0)
> >  			return -1;
> >  	}
> >  
> >+	if (empty_poll_on)
> >+		rte_power_empty_poll_stat_free();
> >+
> >  	return 0;
> >  }
> >diff --git a/examples/l3fwd-power/meson.build 
> >b/examples/l3fwd-power/meson.build
> >index 20c8054..a3c5c2f 100644
> >--- a/examples/l3fwd-power/meson.build
> >+++ b/examples/l3fwd-power/meson.build
> >@@ -9,6 +9,7 @@
> >  if host_machine.system() != 'linux'
> >  	build = false
> >  endif
> >+allow_experimental_apis = true
> >  deps += ['power', 'timer', 'lpm', 'hash']
> >  sources = files(
> >  	'main.c', 'perf_core.c'
> 
> 
> Checkpatch throws up some warnings:
> 
> 
> ### examples/l3fwd-power: simple app update for new API
> 
> WARNING:LONG_LINE: line over 80 characters
> #201: FILE: examples/l3fwd-power/main.c:876:
> +               (rte_get_tsc_hz() + US_PER_S - 1) / US_PER_S 
> * BURST_TX_DRAIN_US;
> 
> WARNING:LONG_LINE: line over 80 characters
> #209: FILE: examples/l3fwd-power/main.c:884:
> +               RTE_LOG(INFO, L3FWD_POWER, "lcore %u has 
> nothing to do\n", lcore_id);
> 
> WARNING:LONG_LINE: line over 80 characters
> #271: FILE: examples/l3fwd-power/main.c:946:
> +                                                       j + 
> PREFETCH_OFFSET], void *));
> 
> WARNING:LONG_LINE: line over 80 characters
> #529: FILE: examples/l3fwd-power/main.c:2192:
> +               
> rte_eal_mp_remote_launch(main_empty_poll_loop, NULL, SKIP_MASTER);
> 
> total: 0 errors, 4 warnings, 467 lines checked
> 
> 
> Rgds,
> Dave.
> 
> 

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [dpdk-dev] [PATCH v10 1/4] lib/librte_power: traffic pattern aware power control
  2018-09-28 14:58                   ` [dpdk-dev] [PATCH v9 " Liang Ma
  2018-09-28 14:58                     ` [dpdk-dev] [PATCH v9 2/4] examples/l3fwd-power: simple app update for new API Liang Ma
  2018-09-28 14:58                     ` [dpdk-dev] [PATCH v9 3/4] doc/guides/pro_guide/power-man: update the power API Liang Ma
@ 2018-10-02 13:48                     ` Liang Ma
  2018-10-02 13:48                       ` [dpdk-dev] [PATCH v10 2/4] examples/l3fwd-power: simple app update for new API Liang Ma
                                         ` (5 more replies)
  2 siblings, 6 replies; 79+ messages in thread
From: Liang Ma @ 2018-10-02 13:48 UTC (permalink / raw)
  To: david.hunt; +Cc: dev, lei.a.yao, ktraynor, marko.kovacevic, Liang Ma

1. Abstract

For packet processing workloads such as DPDK polling is continuous.
This means CPU cores always show 100% busy independent of how much work
those cores are doing. It is critical to accurately determine how busy
a core is hugely important for the following reasons:

   * No indication of overload conditions.

   * User does not know how much real load is on a system, resulting
     in wasted energy as no power management is utilized.

Compared to the original l3fwd-power design, instead of going to sleep
after detecting an empty poll, the new mechanism just lowers the core
frequency. As a result, the application does not stop polling the device,
which leads to improved handling of bursts of traffic.

When the system become busy, the empty poll mechanism can also increase the
core frequency (including turbo) to do best effort for intensive traffic.
This gives us more flexible and balanced traffic awareness over the
standard l3fwd-power application.

2. Proposed solution

The proposed solution focuses on how many times empty polls are executed.
The less the number of empty polls, means current core is busy with
processing workload, therefore, the higher frequency is needed. The high
empty poll number indicates the current core not doing any real work
therefore, we can lower the frequency to safe power.

In the current implementation, each core has 1 empty-poll counter which
assume 1 core is dedicated to 1 queue. This will need to be expanded in the
future to support multiple queues per core.

2.1 Power state definition:

	LOW:  Not currently used, reserved for future use.

	MED:  the frequency is used to process modest traffic workload.

	HIGH: the frequency is used to process busy traffic workload.

2.2 There are two phases to establish the power management system:

	a.Initialization/Training phase. The training phase is necessary
	  in order to figure out the system polling baseline numbers from
	  idle to busy. The highest poll count will be during idle, where
	  all polls are empty. These poll counts will be different between
	  systems due to the many possible processor micro-arch, cache
	  and device configurations, hence the training phase.
  	  In the training phase, traffic is blocked so the training
  	  algorithm can average the empty-poll numbers for the LOW, MED and
 	  HIGH  power states in order to create a baseline.
  	  The core's counter are collected every 10ms, and the Training
 	  phase will take 2 seconds.
 	  Training is disabled as default configuration. The default
 	  parameter is applied. Sample App still can trigger training
 	  if that's needed. Once the training phase has been executed once on
 	  a system, the application can then be started with the relevant
 	  thresholds provided on the command line, allowing the application
 	  to start passing start traffic immediately

	b.Normal phase. Traffic starts immediately based on the default
	  thresholds, or based on the user supplied thresholds via the
	  command line parameters. The run-time poll counts are compared with
	  the baseline and the decision will be taken to move to MED power
  	  state or HIGH power state. The counters are calculated every 10ms.

3. Proposed  API

1.  rte_power_empty_poll_stat_init(struct ep_params **eptr,
		uint8_t *freq_tlb, struct ep_policy *policy);
which is used to initialize the power management system.
 
2.  rte_power_empty_poll_stat_free(void);
which is used to free the resource hold by power management system.
 
3.  rte_power_empty_poll_stat_update(unsigned int lcore_id);
which is used to update specific core empty poll counter, not thread safe
 
4.  rte_power_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt);
which is used to update specific core valid poll counter, not thread safe
 
5.  rte_power_empty_poll_stat_fetch(unsigned int lcore_id);
which is used to get specific core empty poll counter.
 
6.  rte_power_poll_stat_fetch(unsigned int lcore_id);
which is used to get specific core valid poll counter.

7.  rte_empty_poll_detection(struct rte_timer *tim, void *arg);
which is used to detect empty poll state changes then take action.

ChangeLog:
v2: fix some coding style issues.
v3: rename the filename, API name.
v4: no change.
v5: no change.
v6: re-work the code layout, update API.
v7: fix minor typo and lift node num limit.
v8: disable training as default option.
v9: minor git log update.
v10: update due to the code review comments.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>

Reviewed-by: Lei Yao <lei.a.yao@intel.com>
---
 lib/librte_power/Makefile               |   6 +-
 lib/librte_power/meson.build            |   5 +-
 lib/librte_power/rte_power_empty_poll.c | 545 ++++++++++++++++++++++++++++++++
 lib/librte_power/rte_power_empty_poll.h | 223 +++++++++++++
 lib/librte_power/rte_power_version.map  |  13 +
 5 files changed, 788 insertions(+), 4 deletions(-)
 create mode 100644 lib/librte_power/rte_power_empty_poll.c
 create mode 100644 lib/librte_power/rte_power_empty_poll.h

diff --git a/lib/librte_power/Makefile b/lib/librte_power/Makefile
index 6f85e88..a8f1301 100644
--- a/lib/librte_power/Makefile
+++ b/lib/librte_power/Makefile
@@ -6,8 +6,9 @@ include $(RTE_SDK)/mk/rte.vars.mk
 # library name
 LIB = librte_power.a
 
+CFLAGS += -DALLOW_EXPERIMENTAL_API
 CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR) -O3 -fno-strict-aliasing
-LDLIBS += -lrte_eal
+LDLIBS += -lrte_eal -lrte_timer
 
 EXPORT_MAP := rte_power_version.map
 
@@ -16,8 +17,9 @@ LIBABIVER := 1
 # all source are stored in SRCS-y
 SRCS-$(CONFIG_RTE_LIBRTE_POWER) := rte_power.c power_acpi_cpufreq.c
 SRCS-$(CONFIG_RTE_LIBRTE_POWER) += power_kvm_vm.c guest_channel.c
+SRCS-$(CONFIG_RTE_LIBRTE_POWER) += rte_power_empty_poll.c
 
 # install this header file
-SYMLINK-$(CONFIG_RTE_LIBRTE_POWER)-include := rte_power.h
+SYMLINK-$(CONFIG_RTE_LIBRTE_POWER)-include := rte_power.h  rte_power_empty_poll.h
 
 include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/lib/librte_power/meson.build b/lib/librte_power/meson.build
index 253173f..63957eb 100644
--- a/lib/librte_power/meson.build
+++ b/lib/librte_power/meson.build
@@ -5,5 +5,6 @@ if host_machine.system() != 'linux'
 	build = false
 endif
 sources = files('rte_power.c', 'power_acpi_cpufreq.c',
-		'power_kvm_vm.c', 'guest_channel.c')
-headers = files('rte_power.h')
+		'power_kvm_vm.c', 'guest_channel.c',
+		'rte_power_empty_poll.c')
+headers = files('rte_power.h','rte_power_empty_poll.h')
diff --git a/lib/librte_power/rte_power_empty_poll.c b/lib/librte_power/rte_power_empty_poll.c
new file mode 100644
index 0000000..c184647
--- /dev/null
+++ b/lib/librte_power/rte_power_empty_poll.c
@@ -0,0 +1,545 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2018 Intel Corporation
+ */
+
+#include <string.h>
+
+#include <rte_lcore.h>
+#include <rte_cycles.h>
+#include <rte_atomic.h>
+#include <rte_malloc.h>
+
+#include "rte_power.h"
+#include "rte_power_empty_poll.h"
+
+#define INTERVALS_PER_SECOND 100     /* (10ms) */
+#define SECONDS_TO_TRAIN_FOR 2
+#define DEFAULT_MED_TO_HIGH_PERCENT_THRESHOLD 70
+#define DEFAULT_HIGH_TO_MED_PERCENT_THRESHOLD 30
+#define DEFAULT_CYCLES_PER_PACKET 800
+
+static struct ep_params *ep_params;
+static uint32_t med_to_high_threshold = DEFAULT_MED_TO_HIGH_PERCENT_THRESHOLD;
+static uint32_t high_to_med_threshold = DEFAULT_HIGH_TO_MED_PERCENT_THRESHOLD;
+
+static uint32_t avail_freqs[RTE_MAX_LCORE][NUM_FREQS];
+
+static uint32_t total_avail_freqs[RTE_MAX_LCORE];
+
+static uint32_t freq_index[NUM_FREQ];
+
+static uint32_t
+get_freq_index(enum freq_val index)
+{
+	return freq_index[index];
+}
+
+
+static int
+set_power_freq(int lcore_id, enum freq_val freq, bool specific_freq)
+{
+	int err = 0;
+	uint32_t power_freq_index;
+	if (!specific_freq)
+		power_freq_index = get_freq_index(freq);
+	else
+		power_freq_index = freq;
+
+	err = rte_power_set_freq(lcore_id, power_freq_index);
+
+	return err;
+}
+
+
+static inline void __attribute__((always_inline))
+exit_training_state(struct priority_worker *poll_stats)
+{
+	RTE_SET_USED(poll_stats);
+}
+
+static inline void __attribute__((always_inline))
+enter_training_state(struct priority_worker *poll_stats)
+{
+	poll_stats->iter_counter = 0;
+	poll_stats->cur_freq = LOW;
+	poll_stats->queue_state = TRAINING;
+}
+
+static inline void __attribute__((always_inline))
+enter_normal_state(struct priority_worker *poll_stats)
+{
+	/* Clear the averages arrays and strs */
+	memset(poll_stats->edpi_av, 0, sizeof(poll_stats->edpi_av));
+	poll_stats->ec = 0;
+	memset(poll_stats->ppi_av, 0, sizeof(poll_stats->ppi_av));
+	poll_stats->pc = 0;
+
+	poll_stats->cur_freq = MED;
+	poll_stats->iter_counter = 0;
+	poll_stats->threshold_ctr = 0;
+	poll_stats->queue_state = MED_NORMAL;
+	RTE_LOG(INFO, POWER, "Set the power freq to MED\n");
+	set_power_freq(poll_stats->lcore_id, MED, false);
+
+	poll_stats->thresh[MED].threshold_percent = med_to_high_threshold;
+	poll_stats->thresh[HGH].threshold_percent = high_to_med_threshold;
+}
+
+static inline void __attribute__((always_inline))
+enter_busy_state(struct priority_worker *poll_stats)
+{
+	memset(poll_stats->edpi_av, 0, sizeof(poll_stats->edpi_av));
+	poll_stats->ec = 0;
+	memset(poll_stats->ppi_av, 0, sizeof(poll_stats->ppi_av));
+	poll_stats->pc = 0;
+
+	poll_stats->cur_freq = HGH;
+	poll_stats->iter_counter = 0;
+	poll_stats->threshold_ctr = 0;
+	poll_stats->queue_state = HGH_BUSY;
+	set_power_freq(poll_stats->lcore_id, HGH, false);
+}
+
+static inline void __attribute__((always_inline))
+enter_purge_state(struct priority_worker *poll_stats)
+{
+	poll_stats->iter_counter = 0;
+	poll_stats->queue_state = LOW_PURGE;
+}
+
+static inline void __attribute__((always_inline))
+set_state(struct priority_worker *poll_stats,
+		enum queue_state new_state)
+{
+	enum queue_state old_state = poll_stats->queue_state;
+	if (old_state != new_state) {
+
+		/* Call any old state exit functions */
+		if (old_state == TRAINING)
+			exit_training_state(poll_stats);
+
+		/* Call any new state entry functions */
+		if (new_state == TRAINING)
+			enter_training_state(poll_stats);
+		if (new_state == MED_NORMAL)
+			enter_normal_state(poll_stats);
+		if (new_state == HGH_BUSY)
+			enter_busy_state(poll_stats);
+		if (new_state == LOW_PURGE)
+			enter_purge_state(poll_stats);
+	}
+}
+
+static inline void __attribute__((always_inline))
+set_policy(struct priority_worker *poll_stats,
+		struct ep_policy *policy)
+{
+	set_state(poll_stats, policy->state);
+
+	if (policy->state == TRAINING)
+		return;
+
+	poll_stats->thresh[MED_NORMAL].base_edpi = policy->med_base_edpi;
+	poll_stats->thresh[HGH_BUSY].base_edpi = policy->hgh_base_edpi;
+
+	poll_stats->thresh[MED_NORMAL].trained = true;
+	poll_stats->thresh[HGH_BUSY].trained = true;
+
+}
+
+static void
+update_training_stats(struct priority_worker *poll_stats,
+		uint32_t freq,
+		bool specific_freq,
+		uint32_t max_train_iter)
+{
+	RTE_SET_USED(specific_freq);
+
+	char pfi_str[32];
+	uint64_t p0_empty_deq;
+
+	sprintf(pfi_str, "%02d", freq);
+
+	if (poll_stats->cur_freq == freq &&
+			poll_stats->thresh[freq].trained == false) {
+		if (poll_stats->thresh[freq].cur_train_iter == 0) {
+
+			set_power_freq(poll_stats->lcore_id,
+					freq, specific_freq);
+
+			poll_stats->empty_dequeues_prev =
+				poll_stats->empty_dequeues;
+
+			poll_stats->thresh[freq].cur_train_iter++;
+
+			return;
+		} else if (poll_stats->thresh[freq].cur_train_iter
+				<= max_train_iter) {
+
+			p0_empty_deq = poll_stats->empty_dequeues -
+				poll_stats->empty_dequeues_prev;
+
+			poll_stats->empty_dequeues_prev =
+				poll_stats->empty_dequeues;
+
+			poll_stats->thresh[freq].base_edpi += p0_empty_deq;
+			poll_stats->thresh[freq].cur_train_iter++;
+
+		} else {
+			if (poll_stats->thresh[freq].trained == false) {
+				poll_stats->thresh[freq].base_edpi =
+					poll_stats->thresh[freq].base_edpi /
+					max_train_iter;
+
+				/* Add on a factor of 0.05%
+				 * this should remove any
+				 * false negatives when the system is 0% busy
+				 */
+				poll_stats->thresh[freq].base_edpi +=
+				poll_stats->thresh[freq].base_edpi / 2000;
+
+				poll_stats->thresh[freq].trained = true;
+				poll_stats->cur_freq++;
+
+			}
+		}
+	}
+}
+
+static inline uint32_t __attribute__((always_inline))
+update_stats(struct priority_worker *poll_stats)
+{
+	uint64_t tot_edpi = 0, tot_ppi = 0;
+	uint32_t j, percent;
+
+	struct priority_worker *s = poll_stats;
+
+	uint64_t cur_edpi = s->empty_dequeues - s->empty_dequeues_prev;
+
+	s->empty_dequeues_prev = s->empty_dequeues;
+
+	uint64_t ppi = s->num_dequeue_pkts - s->num_dequeue_pkts_prev;
+
+	s->num_dequeue_pkts_prev = s->num_dequeue_pkts;
+
+	if (s->thresh[s->cur_freq].base_edpi < cur_edpi) {
+
+		/* edpi mean empty poll counter difference per interval */
+		RTE_LOG(DEBUG, POWER, "cur_edpi is too large "
+				"cur edpi %ld "
+				"base edpi %ld\n",
+				cur_edpi,
+				s->thresh[s->cur_freq].base_edpi);
+		/* Value to make us fail need debug log*/
+		return 1000UL;
+	}
+
+	s->edpi_av[s->ec++ % BINS_AV] = cur_edpi;
+	s->ppi_av[s->pc++ % BINS_AV] = ppi;
+
+	for (j = 0; j < BINS_AV; j++) {
+		tot_edpi += s->edpi_av[j];
+		tot_ppi += s->ppi_av[j];
+	}
+
+	tot_edpi = tot_edpi / BINS_AV;
+
+	percent = 100 - (uint32_t)(((float)tot_edpi /
+			(float)s->thresh[s->cur_freq].base_edpi) * 100);
+
+	return (uint32_t)percent;
+}
+
+
+static inline void  __attribute__((always_inline))
+update_stats_normal(struct priority_worker *poll_stats)
+{
+	uint32_t percent;
+
+	if (poll_stats->thresh[poll_stats->cur_freq].base_edpi == 0) {
+
+		enum freq_val cur_freq = poll_stats->cur_freq;
+
+		/* edpi mean empty poll counter difference per interval */
+		RTE_LOG(DEBUG, POWER, "cure freq is %d, edpi is %lu\n",
+				cur_freq,
+				poll_stats->thresh[cur_freq].base_edpi);
+		return;
+	}
+
+	percent = update_stats(poll_stats);
+
+	if (percent > 100) {
+		/* edpi mean empty poll counter difference per interval */
+		RTE_LOG(DEBUG, POWER, "Edpi is bigger than threshold\n");
+		return;
+	}
+
+	if (poll_stats->cur_freq == LOW)
+		RTE_LOG(INFO, POWER, "Purge Mode is not currently supported\n");
+	else if (poll_stats->cur_freq == MED) {
+
+		if (percent >
+			poll_stats->thresh[MED].threshold_percent) {
+
+			if (poll_stats->threshold_ctr < INTERVALS_PER_SECOND)
+				poll_stats->threshold_ctr++;
+			else {
+				set_state(poll_stats, HGH_BUSY);
+				RTE_LOG(INFO, POWER, "MOVE to HGH\n");
+			}
+
+		} else {
+			/* reset */
+			poll_stats->threshold_ctr = 0;
+		}
+
+	} else if (poll_stats->cur_freq == HGH) {
+
+		if (percent <
+				poll_stats->thresh[HGH].threshold_percent) {
+
+			if (poll_stats->threshold_ctr < INTERVALS_PER_SECOND)
+				poll_stats->threshold_ctr++;
+			else {
+				set_state(poll_stats, MED_NORMAL);
+				RTE_LOG(INFO, POWER, "MOVE to MED\n");
+			}
+		} else {
+			/* reset */
+			poll_stats->threshold_ctr = 0;
+		}
+
+	}
+}
+
+static int
+empty_poll_training(struct priority_worker *poll_stats,
+		uint32_t max_train_iter)
+{
+
+	if (poll_stats->iter_counter < INTERVALS_PER_SECOND) {
+		poll_stats->iter_counter++;
+		return 0;
+	}
+
+
+	update_training_stats(poll_stats,
+			LOW,
+			false,
+			max_train_iter);
+
+	update_training_stats(poll_stats,
+			MED,
+			false,
+			max_train_iter);
+
+	update_training_stats(poll_stats,
+			HGH,
+			false,
+			max_train_iter);
+
+
+	if (poll_stats->thresh[LOW].trained == true
+			&& poll_stats->thresh[MED].trained == true
+			&& poll_stats->thresh[HGH].trained == true) {
+
+		set_state(poll_stats, MED_NORMAL);
+
+		RTE_LOG(INFO, POWER, "LOW threshold is %lu\n",
+				poll_stats->thresh[LOW].base_edpi);
+
+		RTE_LOG(INFO, POWER, "MED threshold is %lu\n",
+				poll_stats->thresh[MED].base_edpi);
+
+
+		RTE_LOG(INFO, POWER, "HIGH threshold is %lu\n",
+				poll_stats->thresh[HGH].base_edpi);
+
+		RTE_LOG(INFO, POWER, "Training is Complete for %d\n",
+				poll_stats->lcore_id);
+	}
+
+	return 0;
+}
+
+void __rte_experimental
+rte_empty_poll_detection(struct rte_timer *tim, void *arg)
+{
+
+	uint32_t i;
+
+	struct priority_worker *poll_stats;
+
+	RTE_SET_USED(tim);
+
+	RTE_SET_USED(arg);
+
+	for (i = 0; i < NUM_NODES; i++) {
+
+		poll_stats = &(ep_params->wrk_data.wrk_stats[i]);
+
+		if (rte_lcore_is_enabled(poll_stats->lcore_id) == 0)
+			continue;
+
+		switch (poll_stats->queue_state) {
+		case(TRAINING):
+			empty_poll_training(poll_stats,
+					ep_params->max_train_iter);
+			break;
+
+		case(HGH_BUSY):
+		case(MED_NORMAL):
+			update_stats_normal(poll_stats);
+			break;
+
+		case(LOW_PURGE):
+			break;
+		default:
+			break;
+
+		}
+
+	}
+
+}
+
+int __rte_experimental
+rte_power_empty_poll_stat_init(struct ep_params **eptr, uint8_t *freq_tlb,
+		struct ep_policy *policy)
+{
+	uint32_t i;
+	/* Allocate the ep_params structure */
+	ep_params = rte_zmalloc_socket(NULL,
+			sizeof(struct ep_params),
+			0,
+			rte_socket_id());
+
+	if (!ep_params)
+		rte_panic("Cannot allocate heap memory for ep_params "
+				"for socket %d\n", rte_socket_id());
+
+	if (freq_tlb == NULL) {
+		freq_index[LOW] = 14;
+		freq_index[MED] = 9;
+		freq_index[HGH] = 1;
+	} else {
+		freq_index[LOW] = freq_tlb[LOW];
+		freq_index[MED] = freq_tlb[MED];
+		freq_index[HGH] = freq_tlb[HGH];
+	}
+
+	RTE_LOG(INFO, POWER, "Initialize the Empty Poll\n");
+
+	/* Train for pre-defined period */
+	ep_params->max_train_iter = INTERVALS_PER_SECOND * SECONDS_TO_TRAIN_FOR;
+
+	struct stats_data *w = &ep_params->wrk_data;
+
+	*eptr = ep_params;
+
+	/* initialize all wrk_stats state */
+	for (i = 0; i < NUM_NODES; i++) {
+
+		if (rte_lcore_is_enabled(i) == 0)
+			continue;
+		/*init the freqs table */
+		total_avail_freqs[i] = rte_power_freqs(i,
+				avail_freqs[i],
+				NUM_FREQS);
+
+		RTE_LOG(INFO, POWER, "total avail freq is %d , lcoreid %d\n",
+				total_avail_freqs[i],
+				i);
+
+		if (get_freq_index(LOW) > total_avail_freqs[i])
+			return -1;
+
+		if (rte_get_master_lcore() != i) {
+			w->wrk_stats[i].lcore_id = i;
+			set_policy(&w->wrk_stats[i], policy);
+		}
+	}
+
+	return 0;
+}
+
+void __rte_experimental
+rte_power_empty_poll_stat_free(void)
+{
+
+	RTE_LOG(INFO, POWER, "Close the Empty Poll\n");
+
+	if (ep_params != NULL)
+		rte_free(ep_params);
+}
+
+int __rte_experimental
+rte_power_empty_poll_stat_update(unsigned int lcore_id)
+{
+	struct priority_worker *poll_stats;
+
+	if (lcore_id >= NUM_NODES)
+		return -1;
+
+	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
+
+	if (poll_stats->lcore_id == 0)
+		poll_stats->lcore_id = lcore_id;
+
+	poll_stats->empty_dequeues++;
+
+	return 0;
+}
+
+int __rte_experimental
+rte_power_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt)
+{
+
+	struct priority_worker *poll_stats;
+
+	if (lcore_id >= NUM_NODES)
+		return -1;
+
+	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
+
+	if (poll_stats->lcore_id == 0)
+		poll_stats->lcore_id = lcore_id;
+
+	poll_stats->num_dequeue_pkts += nb_pkt;
+
+	return 0;
+}
+
+
+uint64_t __rte_experimental
+rte_power_empty_poll_stat_fetch(unsigned int lcore_id)
+{
+	struct priority_worker *poll_stats;
+
+	if (lcore_id >= NUM_NODES)
+		return -1;
+
+	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
+
+	if (poll_stats->lcore_id == 0)
+		poll_stats->lcore_id = lcore_id;
+
+	return poll_stats->empty_dequeues;
+}
+
+uint64_t __rte_experimental
+rte_power_poll_stat_fetch(unsigned int lcore_id)
+{
+	struct priority_worker *poll_stats;
+
+	if (lcore_id >= NUM_NODES)
+		return -1;
+
+	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
+
+	if (poll_stats->lcore_id == 0)
+		poll_stats->lcore_id = lcore_id;
+
+	return poll_stats->num_dequeue_pkts;
+}
diff --git a/lib/librte_power/rte_power_empty_poll.h b/lib/librte_power/rte_power_empty_poll.h
new file mode 100644
index 0000000..d8cbb17
--- /dev/null
+++ b/lib/librte_power/rte_power_empty_poll.h
@@ -0,0 +1,223 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2018 Intel Corporation
+ */
+
+#ifndef _RTE_EMPTY_POLL_H
+#define _RTE_EMPTY_POLL_H
+
+/**
+ * @file
+ * RTE Power Management
+ */
+#include <stdint.h>
+#include <stdbool.h>
+
+#include <rte_common.h>
+#include <rte_byteorder.h>
+#include <rte_log.h>
+#include <rte_string_fns.h>
+#include <rte_power.h>
+#include <rte_timer.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#define NUM_FREQS  RTE_MAX_LCORE_FREQS
+
+#define BINS_AV 4 /* Has to be ^2 */
+
+#define DROP (NUM_DIRECTIONS * NUM_DEVICES)
+
+#define NUM_PRIORITIES          2
+
+#define NUM_NODES         256  /* Max core number*/
+
+/* Processor Power State */
+enum freq_val {
+	LOW,
+	MED,
+	HGH,
+	NUM_FREQ = NUM_FREQS
+};
+
+
+/* Queue Polling State */
+enum queue_state {
+	TRAINING, /* NO TRAFFIC */
+	MED_NORMAL,   /* MED */
+	HGH_BUSY,     /* HIGH */
+	LOW_PURGE,    /* LOW */
+};
+
+/* Queue Stats */
+struct freq_threshold {
+
+	uint64_t base_edpi;
+	bool trained;
+	uint32_t threshold_percent;
+	uint32_t cur_train_iter;
+};
+
+/* Each Worder Thread Empty Poll Stats */
+struct priority_worker {
+
+	/* Current dequeue and throughput counts */
+	/* These 2 are written to by the worker threads */
+	/* So keep them on their own cache line */
+	uint64_t empty_dequeues;
+	uint64_t num_dequeue_pkts;
+
+	enum queue_state queue_state;
+
+	uint64_t empty_dequeues_prev;
+	uint64_t num_dequeue_pkts_prev;
+
+	/* Used for training only */
+	struct freq_threshold thresh[NUM_FREQ];
+	enum freq_val cur_freq;
+
+	/* bucket arrays to calculate the averages */
+	/* edpi mean empty poll counter difference per interval */
+	uint64_t edpi_av[BINS_AV];
+	/* empty poll counter */
+	uint32_t ec;
+	/* ppi mean valid poll counter per interval */
+	uint64_t ppi_av[BINS_AV];
+	/* valid poll counter */
+	uint32_t pc;
+
+	uint32_t lcore_id;
+	uint32_t iter_counter;
+	uint32_t threshold_ctr;
+	uint32_t display_ctr;
+	uint8_t  dev_id;
+
+} __rte_cache_aligned;
+
+
+struct stats_data {
+
+	struct priority_worker wrk_stats[NUM_NODES];
+
+	/* flag to stop rx threads processing packets until training over */
+	bool start_rx;
+
+};
+
+/* Empty Poll Parameters */
+struct ep_params {
+
+	/* Timer related stuff */
+	uint64_t interval_ticks;
+	uint32_t max_train_iter;
+
+	struct rte_timer timer0;
+	struct stats_data wrk_data;
+};
+
+
+/* Sample App Init information */
+struct ep_policy {
+
+	uint64_t med_base_edpi;
+	uint64_t hgh_base_edpi;
+
+	enum queue_state state;
+};
+
+
+
+/**
+ * Initialize the power management system.
+ *
+ * @param eptr
+ *   the structure of empty poll configuration
+ * @freq_tlb
+ *   the power state/frequency  mapping table
+ * @policy
+ *   the initialization policy from sample app
+ *
+ * @return
+ *  - 0 on success.
+ *  - Negative on error.
+ */
+int __rte_experimental
+rte_power_empty_poll_stat_init(struct ep_params **eptr, uint8_t *freq_tlb,
+		struct ep_policy *policy);
+
+/**
+ * Free the resource hold by power management system.
+ */
+void __rte_experimental
+rte_power_empty_poll_stat_free(void);
+
+/**
+ * Update specific core empty poll counter
+ * It's not thread safe.
+ *
+ * @param lcore_id
+ *  lcore id
+ *
+ * @return
+ *  - 0 on success.
+ *  - Negative on error.
+ */
+int __rte_experimental
+rte_power_empty_poll_stat_update(unsigned int lcore_id);
+
+/**
+ * Update specific core valid poll counter, not thread safe.
+ *
+ * @param lcore_id
+ *  lcore id.
+ * @param nb_pkt
+ *  The packet number of one valid poll.
+ *
+ * @return
+ *  - 0 on success.
+ *  - Negative on error.
+ */
+int __rte_experimental
+rte_power_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt);
+
+/**
+ * Fetch specific core empty poll counter.
+ *
+ * @param lcore_id
+ *  lcore id
+ *
+ * @return
+ *  Current lcore empty poll counter value.
+ */
+uint64_t __rte_experimental
+rte_power_empty_poll_stat_fetch(unsigned int lcore_id);
+
+/**
+ * Fetch specific core valid poll counter.
+ *
+ * @param lcore_id
+ *  lcore id
+ *
+ * @return
+ *  Current lcore valid poll counter value.
+ */
+uint64_t __rte_experimental
+rte_power_poll_stat_fetch(unsigned int lcore_id);
+
+/**
+ * Empty poll  state change detection function
+ *
+ * @param  tim
+ *  The timer structure
+ * @param  arg
+ *  The customized parameter
+ */
+void  __rte_experimental
+rte_empty_poll_detection(struct rte_timer *tim, void *arg);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif
diff --git a/lib/librte_power/rte_power_version.map b/lib/librte_power/rte_power_version.map
index dd587df..17a083b 100644
--- a/lib/librte_power/rte_power_version.map
+++ b/lib/librte_power/rte_power_version.map
@@ -33,3 +33,16 @@ DPDK_18.08 {
 	rte_power_get_capabilities;
 
 } DPDK_17.11;
+
+EXPERIMENTAL {
+        global:
+
+        rte_empty_poll_detection;
+        rte_power_empty_poll_stat_fetch;
+        rte_power_empty_poll_stat_free;
+        rte_power_empty_poll_stat_init;
+        rte_power_empty_poll_stat_update;
+        rte_power_poll_stat_fetch;
+        rte_power_poll_stat_update;
+
+};
-- 
2.7.5

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [dpdk-dev] [PATCH v10 2/4] examples/l3fwd-power: simple app update for new API
  2018-10-02 13:48                     ` [dpdk-dev] [PATCH v10 1/4] lib/librte_power: traffic pattern aware power control Liang Ma
@ 2018-10-02 13:48                       ` Liang Ma
  2018-10-02 14:23                         ` Hunt, David
  2018-10-02 13:48                       ` [dpdk-dev] [PATCH v10 3/4] doc/guides/pro_guide/power-man: update the power API Liang Ma
                                         ` (4 subsequent siblings)
  5 siblings, 1 reply; 79+ messages in thread
From: Liang Ma @ 2018-10-02 13:48 UTC (permalink / raw)
  To: david.hunt; +Cc: dev, lei.a.yao, ktraynor, marko.kovacevic, Liang Ma

Add the support for new traffic pattern aware power control
power management API.

Example:
./l3fwd-power -l xxx   -n 4   -w 0000:xx:00.0 -w 0000:xx:00.1 -- -p 0x3
-P --config="(0,0,xx),(1,0,xx)" --empty-poll="0,0,0" -l 14 -m 9 -h 1

Please Reference l3fwd-power document for full parameter usage

The option "l", "m", "h" are used to set the power index for
LOW, MED, HIGH power state. Only is useful after enable empty-poll

--empty-poll="training_flag, med_threshold, high_threshold"

The option training_flag is used to enable/disable training mode.

The option med_threshold is used to indicate the empty poll threshold
of modest state which is customized by user.

The option high_threshold is used to indicate the empty poll threshold
of busy state which is customized by user.

Above three option default value is all 0.

Once enable empty-poll. System will apply the default parameter if no
other command line options are provided.

If training mode is enabled, the user should ensure that no traffic
is allowed to pass through the system. When training phase complete,
the application transfer to normal operation

System will start running with the modest power mode.
If the traffic goes above 70%, then system will move to High power state.
If the traffic drops below 30%, the system will fallback to the modest
power state.

Example code use master thread to monitoring worker thread busyness.
The default timer resolution is 10ms.

ChangeLog:
v2 fix some coding style issues
v3 rename the API.
v6 re-work the API.
v7 no change.
v8 disable training as default option.
v10 update due to review comments.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>

Reviewed-by: Lei Yao <lei.a.yao@intel.com>
---
 examples/l3fwd-power/Makefile    |   3 +
 examples/l3fwd-power/main.c      | 342 +++++++++++++++++++++++++++++++++++++--
 examples/l3fwd-power/meson.build |   1 +
 3 files changed, 329 insertions(+), 17 deletions(-)

diff --git a/examples/l3fwd-power/Makefile b/examples/l3fwd-power/Makefile
index d7e39a3..772ec7b 100644
--- a/examples/l3fwd-power/Makefile
+++ b/examples/l3fwd-power/Makefile
@@ -23,6 +23,8 @@ CFLAGS += -O3 $(shell pkg-config --cflags libdpdk)
 LDFLAGS_SHARED = $(shell pkg-config --libs libdpdk)
 LDFLAGS_STATIC = -Wl,-Bstatic $(shell pkg-config --static --libs libdpdk)
 
+CFLAGS += -DALLOW_EXPERIMENTAL_API
+
 build/$(APP)-shared: $(SRCS-y) Makefile $(PC_FILE) | build
 	$(CC) $(CFLAGS) $(SRCS-y) -o $@ $(LDFLAGS) $(LDFLAGS_SHARED)
 
@@ -54,6 +56,7 @@ please change the definition of the RTE_TARGET environment variable)
 all:
 else
 
+CFLAGS += -DALLOW_EXPERIMENTAL_API
 CFLAGS += -O3
 CFLAGS += $(WERROR_FLAGS)
 
diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
index 68527d2..a2aa816 100644
--- a/examples/l3fwd-power/main.c
+++ b/examples/l3fwd-power/main.c
@@ -43,6 +43,7 @@
 #include <rte_timer.h>
 #include <rte_power.h>
 #include <rte_spinlock.h>
+#include <rte_power_empty_poll.h>
 
 #include "perf_core.h"
 #include "main.h"
@@ -55,6 +56,8 @@
 
 /* 100 ms interval */
 #define TIMER_NUMBER_PER_SECOND           10
+/* (10ms) */
+#define INTERVALS_PER_SECOND             100
 /* 100000 us */
 #define SCALING_PERIOD                    (1000000/TIMER_NUMBER_PER_SECOND)
 #define SCALING_DOWN_TIME_RATIO_THRESHOLD 0.25
@@ -117,6 +120,17 @@
  */
 #define RTE_TEST_RX_DESC_DEFAULT 1024
 #define RTE_TEST_TX_DESC_DEFAULT 1024
+
+/*
+ * These two thresholds were decided on by running the training algorithm on
+ * a 2.5GHz Xeon. These defaults can be overridden by supplying non-zero values
+ * for the med_threshold and high_threshold parameters on the command line.
+ */
+#define EMPTY_POLL_MED_THRESHOLD 350000UL
+#define EMPTY_POLL_HGH_THRESHOLD 580000UL
+
+
+
 static uint16_t nb_rxd = RTE_TEST_RX_DESC_DEFAULT;
 static uint16_t nb_txd = RTE_TEST_TX_DESC_DEFAULT;
 
@@ -132,6 +146,14 @@ static uint32_t enabled_port_mask = 0;
 static int promiscuous_on = 0;
 /* NUMA is enabled by default. */
 static int numa_on = 1;
+/* emptypoll is disabled by default. */
+static bool empty_poll_on;
+static bool empty_poll_train;
+volatile bool empty_poll_stop;
+static struct  ep_params *ep_params;
+static struct  ep_policy policy;
+static long  ep_med_edpi, ep_hgh_edpi;
+
 static int parse_ptype; /**< Parse packet type using rx callback, and */
 			/**< disabled by default */
 
@@ -330,6 +352,19 @@ static inline uint32_t power_idle_heuristic(uint32_t zero_rx_packet_count);
 static inline enum freq_scale_hint_t power_freq_scaleup_heuristic( \
 		unsigned int lcore_id, uint16_t port_id, uint16_t queue_id);
 
+
+/*
+ * These defaults are using the max frequency index (1), a medium index (9)
+ * and a typical low frequency index (14). These can be adjusted to use
+ * different indexes using the relevant command line parameters.
+ */
+static uint8_t  freq_tlb[] = {14, 9, 1};
+
+static int is_done(void)
+{
+	return empty_poll_stop;
+}
+
 /* exit signal handler */
 static void
 signal_exit_now(int sigtype)
@@ -338,7 +373,15 @@ signal_exit_now(int sigtype)
 	unsigned int portid;
 	int ret;
 
+	RTE_SET_USED(lcore_id);
+	RTE_SET_USED(portid);
+	RTE_SET_USED(ret);
+
 	if (sigtype == SIGINT) {
+		if (empty_poll_on)
+			empty_poll_stop = true;
+
+
 		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
 			if (rte_lcore_is_enabled(lcore_id) == 0)
 				continue;
@@ -351,16 +394,19 @@ signal_exit_now(int sigtype)
 							"core%u\n", lcore_id);
 		}
 
-		RTE_ETH_FOREACH_DEV(portid) {
-			if ((enabled_port_mask & (1 << portid)) == 0)
-				continue;
+		if (!empty_poll_on) {
+			RTE_ETH_FOREACH_DEV(portid) {
+				if ((enabled_port_mask & (1 << portid)) == 0)
+					continue;
 
-			rte_eth_dev_stop(portid);
-			rte_eth_dev_close(portid);
+				rte_eth_dev_stop(portid);
+				rte_eth_dev_close(portid);
+			}
 		}
 	}
 
-	rte_exit(EXIT_SUCCESS, "User forced exit\n");
+	if (!empty_poll_on)
+		rte_exit(EXIT_SUCCESS, "User forced exit\n");
 }
 
 /*  Freqency scale down timer callback */
@@ -825,7 +871,110 @@ static int event_register(struct lcore_conf *qconf)
 
 	return 0;
 }
+/* main processing loop */
+static int
+main_empty_poll_loop(__attribute__((unused)) void *dummy)
+{
+	struct rte_mbuf *pkts_burst[MAX_PKT_BURST];
+	unsigned int lcore_id;
+	uint64_t prev_tsc, diff_tsc, cur_tsc;
+	int i, j, nb_rx;
+	uint8_t queueid;
+	uint16_t portid;
+	struct lcore_conf *qconf;
+	struct lcore_rx_queue *rx_queue;
+
+	const uint64_t drain_tsc =
+		(rte_get_tsc_hz() + US_PER_S - 1) /
+		US_PER_S * BURST_TX_DRAIN_US;
+
+	prev_tsc = 0;
+
+	lcore_id = rte_lcore_id();
+	qconf = &lcore_conf[lcore_id];
+
+	if (qconf->n_rx_queue == 0) {
+		RTE_LOG(INFO, L3FWD_POWER, "lcore %u has nothing to do\n",
+			lcore_id);
+		return 0;
+	}
+
+	for (i = 0; i < qconf->n_rx_queue; i++) {
+		portid = qconf->rx_queue_list[i].port_id;
+		queueid = qconf->rx_queue_list[i].queue_id;
+		RTE_LOG(INFO, L3FWD_POWER, " -- lcoreid=%u portid=%u "
+				"rxqueueid=%hhu\n", lcore_id, portid, queueid);
+	}
 
+	while (!is_done()) {
+		stats[lcore_id].nb_iteration_looped++;
+
+		cur_tsc = rte_rdtsc();
+		/*
+		 * TX burst queue drain
+		 */
+		diff_tsc = cur_tsc - prev_tsc;
+		if (unlikely(diff_tsc > drain_tsc)) {
+			for (i = 0; i < qconf->n_tx_port; ++i) {
+				portid = qconf->tx_port_id[i];
+				rte_eth_tx_buffer_flush(portid,
+						qconf->tx_queue_id[portid],
+						qconf->tx_buffer[portid]);
+			}
+			prev_tsc = cur_tsc;
+		}
+
+		/*
+		 * Read packet from RX queues
+		 */
+		for (i = 0; i < qconf->n_rx_queue; ++i) {
+			rx_queue = &(qconf->rx_queue_list[i]);
+			rx_queue->idle_hint = 0;
+			portid = rx_queue->port_id;
+			queueid = rx_queue->queue_id;
+
+			nb_rx = rte_eth_rx_burst(portid, queueid, pkts_burst,
+					MAX_PKT_BURST);
+
+			stats[lcore_id].nb_rx_processed += nb_rx;
+
+			if (nb_rx == 0) {
+
+				rte_power_empty_poll_stat_update(lcore_id);
+
+				continue;
+			} else {
+				rte_power_poll_stat_update(lcore_id, nb_rx);
+			}
+
+
+			/* Prefetch first packets */
+			for (j = 0; j < PREFETCH_OFFSET && j < nb_rx; j++) {
+				rte_prefetch0(rte_pktmbuf_mtod(
+							pkts_burst[j], void *));
+			}
+
+			/* Prefetch and forward already prefetched packets */
+			for (j = 0; j < (nb_rx - PREFETCH_OFFSET); j++) {
+				rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[
+							j + PREFETCH_OFFSET],
+							void *));
+				l3fwd_simple_forward(pkts_burst[j], portid,
+						qconf);
+			}
+
+			/* Forward remaining prefetched packets */
+			for (; j < nb_rx; j++) {
+				l3fwd_simple_forward(pkts_burst[j], portid,
+						qconf);
+			}
+
+		}
+
+	}
+
+	return 0;
+}
 /* main processing loop */
 static int
 main_loop(__attribute__((unused)) void *dummy)
@@ -1127,7 +1276,9 @@ print_usage(const char *prgname)
 		"  --no-numa: optional, disable numa awareness\n"
 		"  --enable-jumbo: enable jumbo frame"
 		" which max packet len is PKTLEN in decimal (64-9600)\n"
-		"  --parse-ptype: parse packet type by software\n",
+		"  --parse-ptype: parse packet type by software\n"
+		"  --empty-poll: enable empty poll detection"
+		" follow (training_flag, high_threshold, med_threshold)\n",
 		prgname);
 }
 
@@ -1220,7 +1371,55 @@ parse_config(const char *q_arg)
 
 	return 0;
 }
+static int
+parse_ep_config(const char *q_arg)
+{
+	char s[256];
+	const char *p = q_arg;
+	char *end;
+	int  num_arg;
+
+	char *str_fld[3];
+
+	int training_flag;
+	int med_edpi;
+	int hgh_edpi;
+
+	ep_med_edpi = EMPTY_POLL_MED_THRESHOLD;
+	ep_hgh_edpi = EMPTY_POLL_MED_THRESHOLD;
+
+	snprintf(s, sizeof(s), "%s", p);
+
+	num_arg = rte_strsplit(s, sizeof(s), str_fld, 3, ',');
+
+	empty_poll_train = false;
 
+	if (num_arg == 0)
+		return 0;
+
+	if (num_arg == 3) {
+
+		training_flag = strtoul(str_fld[0], &end, 0);
+		med_edpi = strtoul(str_fld[1], &end, 0);
+		hgh_edpi = strtoul(str_fld[2], &end, 0);
+
+		if (training_flag == 1)
+			empty_poll_train = true;
+
+		if (med_edpi > 0)
+			ep_med_edpi = med_edpi;
+
+		if (med_edpi > 0)
+			ep_hgh_edpi = hgh_edpi;
+
+	} else {
+
+		return -1;
+	}
+
+	return 0;
+
+}
 #define CMD_LINE_OPT_PARSE_PTYPE "parse-ptype"
 
 /* Parse the argument given in the command line of the application */
@@ -1230,6 +1429,7 @@ parse_args(int argc, char **argv)
 	int opt, ret;
 	char **argvopt;
 	int option_index;
+	uint32_t limit;
 	char *prgname = argv[0];
 	static struct option lgopts[] = {
 		{"config", 1, 0, 0},
@@ -1237,13 +1437,14 @@ parse_args(int argc, char **argv)
 		{"high-perf-cores", 1, 0, 0},
 		{"no-numa", 0, 0, 0},
 		{"enable-jumbo", 0, 0, 0},
+		{"empty-poll", 1, 0, 0},
 		{CMD_LINE_OPT_PARSE_PTYPE, 0, 0, 0},
 		{NULL, 0, 0, 0}
 	};
 
 	argvopt = argv;
 
-	while ((opt = getopt_long(argc, argvopt, "p:P",
+	while ((opt = getopt_long(argc, argvopt, "p:l:m:h:P",
 				lgopts, &option_index)) != EOF) {
 
 		switch (opt) {
@@ -1260,7 +1461,18 @@ parse_args(int argc, char **argv)
 			printf("Promiscuous mode selected\n");
 			promiscuous_on = 1;
 			break;
-
+		case 'l':
+			limit = parse_max_pkt_len(optarg);
+			freq_tlb[LOW] = limit;
+			break;
+		case 'm':
+			limit = parse_max_pkt_len(optarg);
+			freq_tlb[MED] = limit;
+			break;
+		case 'h':
+			limit = parse_max_pkt_len(optarg);
+			freq_tlb[HGH] = limit;
+			break;
 		/* long options */
 		case 0:
 			if (!strncmp(lgopts[option_index].name, "config", 6)) {
@@ -1299,6 +1511,20 @@ parse_args(int argc, char **argv)
 			}
 
 			if (!strncmp(lgopts[option_index].name,
+						"empty-poll", 10)) {
+				printf("empty-poll is enabled\n");
+				empty_poll_on = true;
+				ret = parse_ep_config(optarg);
+
+				if (ret) {
+					printf("invalid empty poll config\n");
+					print_usage(prgname);
+					return -1;
+				}
+
+			}
+
+			if (!strncmp(lgopts[option_index].name,
 					"enable-jumbo", 12)) {
 				struct option lenopts =
 					{"max-pkt-len", required_argument, \
@@ -1646,6 +1872,59 @@ init_power_library(void)
 	}
 	return ret;
 }
+static void
+empty_poll_setup_timer(void)
+{
+	int lcore_id = rte_lcore_id();
+	uint64_t hz = rte_get_timer_hz();
+
+	struct  ep_params *ep_ptr = ep_params;
+
+	ep_ptr->interval_ticks = hz / INTERVALS_PER_SECOND;
+
+	rte_timer_reset_sync(&ep_ptr->timer0,
+			ep_ptr->interval_ticks,
+			PERIODICAL,
+			lcore_id,
+			rte_empty_poll_detection,
+			(void *)ep_ptr);
+
+}
+static int
+launch_timer(unsigned int lcore_id)
+{
+	int64_t prev_tsc = 0, cur_tsc, diff_tsc, cycles_10ms;
+
+	RTE_SET_USED(lcore_id);
+
+
+	if (rte_get_master_lcore() != lcore_id) {
+		rte_panic("timer on lcore:%d which is not master core:%d\n",
+				lcore_id,
+				rte_get_master_lcore());
+	}
+
+	RTE_LOG(INFO, POWER, "Bring up the Timer\n");
+
+	empty_poll_setup_timer();
+
+	cycles_10ms = rte_get_timer_hz() / 100;
+
+	while (!is_done()) {
+		cur_tsc = rte_rdtsc();
+		diff_tsc = cur_tsc - prev_tsc;
+		if (diff_tsc > cycles_10ms) {
+			rte_timer_manage();
+			prev_tsc = cur_tsc;
+			cycles_10ms = rte_get_timer_hz() / 100;
+		}
+	}
+
+	RTE_LOG(INFO, POWER, "Timer_subsystem is done\n");
+
+	return 0;
+}
+
 
 int
 main(int argc, char **argv)
@@ -1828,13 +2107,15 @@ main(int argc, char **argv)
 		if (rte_lcore_is_enabled(lcore_id) == 0)
 			continue;
 
-		/* init timer structures for each enabled lcore */
-		rte_timer_init(&power_timers[lcore_id]);
-		hz = rte_get_timer_hz();
-		rte_timer_reset(&power_timers[lcore_id],
-			hz/TIMER_NUMBER_PER_SECOND, SINGLE, lcore_id,
-						power_timer_cb, NULL);
-
+		if (empty_poll_on == false) {
+			/* init timer structures for each enabled lcore */
+			rte_timer_init(&power_timers[lcore_id]);
+			hz = rte_get_timer_hz();
+			rte_timer_reset(&power_timers[lcore_id],
+					hz/TIMER_NUMBER_PER_SECOND,
+					SINGLE, lcore_id,
+					power_timer_cb, NULL);
+		}
 		qconf = &lcore_conf[lcore_id];
 		printf("\nInitializing rx queues on lcore %u ... ", lcore_id );
 		fflush(stdout);
@@ -1905,12 +2186,39 @@ main(int argc, char **argv)
 
 	check_all_ports_link_status(enabled_port_mask);
 
+	if (empty_poll_on == true) {
+
+		if (empty_poll_train) {
+			policy.state = TRAINING;
+		} else {
+			policy.state = MED_NORMAL;
+			policy.med_base_edpi = ep_med_edpi;
+			policy.hgh_base_edpi = ep_hgh_edpi;
+		}
+
+		rte_power_empty_poll_stat_init(&ep_params, freq_tlb, &policy);
+	}
+
+
 	/* launch per-lcore init on every lcore */
-	rte_eal_mp_remote_launch(main_loop, NULL, CALL_MASTER);
+	if (empty_poll_on == false) {
+		rte_eal_mp_remote_launch(main_loop, NULL, CALL_MASTER);
+	} else {
+		empty_poll_stop = false;
+		rte_eal_mp_remote_launch(main_empty_poll_loop, NULL,
+				SKIP_MASTER);
+	}
+
+	if (empty_poll_on == true)
+		launch_timer(rte_lcore_id());
+
 	RTE_LCORE_FOREACH_SLAVE(lcore_id) {
 		if (rte_eal_wait_lcore(lcore_id) < 0)
 			return -1;
 	}
 
+	if (empty_poll_on)
+		rte_power_empty_poll_stat_free();
+
 	return 0;
 }
diff --git a/examples/l3fwd-power/meson.build b/examples/l3fwd-power/meson.build
index 20c8054..a3c5c2f 100644
--- a/examples/l3fwd-power/meson.build
+++ b/examples/l3fwd-power/meson.build
@@ -9,6 +9,7 @@
 if host_machine.system() != 'linux'
 	build = false
 endif
+allow_experimental_apis = true
 deps += ['power', 'timer', 'lpm', 'hash']
 sources = files(
 	'main.c', 'perf_core.c'
-- 
2.7.5

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [dpdk-dev] [PATCH v10 3/4] doc/guides/pro_guide/power-man: update the power API
  2018-10-02 13:48                     ` [dpdk-dev] [PATCH v10 1/4] lib/librte_power: traffic pattern aware power control Liang Ma
  2018-10-02 13:48                       ` [dpdk-dev] [PATCH v10 2/4] examples/l3fwd-power: simple app update for new API Liang Ma
@ 2018-10-02 13:48                       ` Liang Ma
  2018-10-02 14:24                         ` Hunt, David
  2018-10-02 13:48                       ` [dpdk-dev] [PATCH v10 4/4] doc/guides/sample_app_ug/l3_forward_power_man.rst: update Liang Ma
                                         ` (3 subsequent siblings)
  5 siblings, 1 reply; 79+ messages in thread
From: Liang Ma @ 2018-10-02 13:48 UTC (permalink / raw)
  To: david.hunt; +Cc: dev, lei.a.yao, ktraynor, marko.kovacevic, Liang Ma

Update the document for empty poll API.

Change Logs:
v9: minor changes for syntax. Update document.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
---
 doc/guides/prog_guide/power_man.rst | 86 +++++++++++++++++++++++++++++++++++++
 1 file changed, 86 insertions(+)

diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
index eba1cc6..68b7e8b 100644
--- a/doc/guides/prog_guide/power_man.rst
+++ b/doc/guides/prog_guide/power_man.rst
@@ -106,6 +106,92 @@ User Cases
 
 The power management mechanism is used to save power when performing L3 forwarding.
 
+
+Empty Poll API
+--------------
+
+Abstract
+~~~~~~~~
+
+For packet processing workloads such as DPDK polling is continuous.
+This means CPU cores always show 100% busy independent of how much work
+those cores are doing. It is critical to accurately determine how busy
+a core is hugely important for the following reasons:
+
+        * No indication of overload conditions
+        * User does not know how much real load is on a system, resulting
+          in wasted energy as no power management is utilized
+
+Compared to the original l3fwd-power design, instead of going to sleep
+after detecting an empty poll, the new mechanism just lowers the core frequency.
+As a result, the application does not stop polling the device, which leads
+to improved handling of bursts of traffic.
+
+When the system become busy, the empty poll mechanism can also increase the core
+frequency (including turbo) to do best effort for intensive traffic. This gives
+us more flexible and balanced traffic awareness over the standard l3fwd-power
+application.
+
+
+Proposed Solution
+~~~~~~~~~~~~~~~~~
+The proposed solution focuses on how many times empty polls are executed.
+The less the number of empty polls, means current core is busy with processing
+workload, therefore, the higher frequency is needed. The high empty poll number
+indicates the current core not doing any real work therefore, we can lower the
+frequency to safe power.
+
+In the current implementation, each core has 1 empty-poll counter which assume
+1 core is dedicated to 1 queue. This will need to be expanded in the future to
+support multiple queues per core.
+
+Power state definition:
+^^^^^^^^^^^^^^^^^^^^^^^
+
+* LOW:  Not currently used, reserved for future use.
+
+* MED:  the frequency is used to process modest traffic workload.
+
+* HIGH: the frequency is used to process busy traffic workload.
+
+There are two phases to establish the power management system:
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+* Training phase. This phase is used to measure the optimal frequency
+  change thresholds for a given system. The thresholds will differ from
+  system to system due to differences in processor micro-architecture,
+  cache and device configurations.
+  In this phase, the user must ensure that no traffic can enter the
+  system so that counts can be measured for empty polls at low, medium
+  and high frequencies. Each frequency is measured for two seconds.
+  Once the training phase is complete, the threshold numbers are
+  displayed, and normal mode resumes, and traffic can be allowed into
+  the system. These threshold number can be used on the command line
+  when starting the application in normal mode to avoid re-training
+  every time.
+
+* Normal phase. Every 10ms the run-time counters are compared
+  to the supplied threshold values, and the decision will be made
+  whether to move to a different power state (by adjusting the
+  frequency).
+
+API Overview for Empty Poll Power Management
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+* **State Init**: initialize the power management system.
+
+* **State Free**: free the resource hold by power management system.
+
+* **Update Empty Poll Counter**: update the empty poll counter.
+
+* **Update Valid Poll Counter**: update the valid poll counter.
+
+* **Set the Fequence Index**: update the power state/frequency mapping.
+
+* **Detect empty poll state change**: empty poll state change detection algorithm then take action.
+
+User Cases
+----------
+The mechanism can applied to any device which is based on polling. e.g. NIC, FPGA.
+
 References
 ----------
 
-- 
2.7.5

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [dpdk-dev] [PATCH v10 4/4] doc/guides/sample_app_ug/l3_forward_power_man.rst: update
  2018-10-02 13:48                     ` [dpdk-dev] [PATCH v10 1/4] lib/librte_power: traffic pattern aware power control Liang Ma
  2018-10-02 13:48                       ` [dpdk-dev] [PATCH v10 2/4] examples/l3fwd-power: simple app update for new API Liang Ma
  2018-10-02 13:48                       ` [dpdk-dev] [PATCH v10 3/4] doc/guides/pro_guide/power-man: update the power API Liang Ma
@ 2018-10-02 13:48                       ` Liang Ma
  2018-10-02 14:25                         ` Hunt, David
  2018-10-02 14:22                       ` [dpdk-dev] [PATCH v10 1/4] lib/librte_power: traffic pattern aware power control Hunt, David
                                         ` (2 subsequent siblings)
  5 siblings, 1 reply; 79+ messages in thread
From: Liang Ma @ 2018-10-02 13:48 UTC (permalink / raw)
  To: david.hunt; +Cc: dev, lei.a.yao, ktraynor, marko.kovacevic, Liang Ma

Add empty poll mode command line example

ChangeLogs:
v9: update the document

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
---
 doc/guides/sample_app_ug/l3_forward_power_man.rst | 69 +++++++++++++++++++++++
 1 file changed, 69 insertions(+)

diff --git a/doc/guides/sample_app_ug/l3_forward_power_man.rst b/doc/guides/sample_app_ug/l3_forward_power_man.rst
index 795a570..e44a11b 100644
--- a/doc/guides/sample_app_ug/l3_forward_power_man.rst
+++ b/doc/guides/sample_app_ug/l3_forward_power_man.rst
@@ -105,6 +105,8 @@ where,
 
 *   --no-numa: optional, disables numa awareness
 
+*   --empty-poll: Traffic Aware power management. See below for details
+
 See :doc:`l3_forward` for details.
 The L3fwd-power example reuses the L3fwd command line options.
 
@@ -362,3 +364,70 @@ The algorithm has the following sleeping behavior depending on the idle counter:
 If a thread polls multiple Rx queues and different queue returns different sleep duration values,
 the algorithm controls the sleep time in a conservative manner by sleeping for the least possible time
 in order to avoid a potential performance impact.
+
+Empty Poll Mode
+-------------------------
+Additionally, there is a traffic aware mode of operation called "Empty
+Poll" where the number of empty polls can be monitored to keep track
+of how busy the application is. Empty poll mode can be enabled by the
+command line option --empty-poll.
+
+See :doc:`Power Management<../prog_guide/power_man>` chapter in the DPDK Programmer's Guide for empty poll mode details.
+
+.. code-block:: console
+
+    ./l3fwd-power -l xxx   -n 4   -w 0000:xx:00.0 -w 0000:xx:00.1 -- -p 0x3 -P --config="(0,0,xx),(1,0,xx)" --empty-poll="0,0,0" -l 14 -m 9 -h 1
+
+Where,
+
+--empty-poll: Enable the empty poll mode instead of original algorithm
+
+--empty-poll="training_flag, med_threshold, high_threshold"
+
+* ``training_flag`` : optional, enable/disable training mode. Default value is 0. If the training_flag is set as 1(true), then the application will start in training mode and print out the trained threshold values. If the training_flag is set as 0(false), the application will start in normal mode, and will use either the default thresholds or those supplied on the command line. The trained threshold values are specific to the user’s system, may give a better power profile when compared to the default threshold values.
+
+* ``med_threshold`` : optional, sets the empty poll threshold of a modestly busy system state. If this is not supplied, the application will apply the default value of 350000.
+
+* ``high_threshold`` : optional, sets the empty poll threshold of a busy system state. If this is not supplied, the application will apply the default value of 580000.
+
+* -l : optional, set up the LOW power state frequency index
+
+* -m : optional, set up the MED power state frequency index
+
+* -h : optional, set up the HIGH power state frequency index
+
+Empty Poll Mode Example Usage
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+To initially obtain the ideal thresholds for the system, the training
+mode should be run first. This is achieved by running the l3fwd-power
+app with the training flag set to “1”, and the other parameters set to
+0.
+
+.. code-block:: console
+
+        ./examples/l3fwd-power/build/l3fwd-power -l 1-3 -- -p 0x0f --config="(0,0,2),(0,1,3)" --empty-poll "1,0,0" –P
+
+This will run the training algorithm for x seconds on each core (cores 2
+and 3), and then print out the recommended threshold values for those
+cores. The thresholds should be very similar for each core.
+
+.. code-block:: console
+
+        POWER: Bring up the Timer
+        POWER: set the power freq to MED
+        POWER: Low threshold is 230277
+        POWER: MED threshold is 335071
+        POWER: HIGH threshold is 523769
+        POWER: Training is Complete for 2
+        POWER: set the power freq to MED
+        POWER: Low threshold is 236814
+        POWER: MED threshold is 344567
+        POWER: HIGH threshold is 538580
+        POWER: Training is Complete for 3
+
+Once the values have been measured for a particular system, the app can
+then be started without the training mode so traffic can start immediately.
+
+.. code-block:: console
+
+        ./examples/l3fwd-power/build/l3fwd-power -l 1-3 -- -p 0x0f --config="(0,0,2),(0,1,3)" --empty-poll "0,340000,540000" –P
-- 
2.7.5

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [dpdk-dev] [PATCH v10 1/4] lib/librte_power: traffic pattern aware power control
  2018-10-02 13:48                     ` [dpdk-dev] [PATCH v10 1/4] lib/librte_power: traffic pattern aware power control Liang Ma
                                         ` (2 preceding siblings ...)
  2018-10-02 13:48                       ` [dpdk-dev] [PATCH v10 4/4] doc/guides/sample_app_ug/l3_forward_power_man.rst: update Liang Ma
@ 2018-10-02 14:22                       ` Hunt, David
  2018-10-12  1:59                       ` Yao, Lei A
  2018-10-19 10:23                       ` [dpdk-dev] [PATCH v11 1/5] " Liang Ma
  5 siblings, 0 replies; 79+ messages in thread
From: Hunt, David @ 2018-10-02 14:22 UTC (permalink / raw)
  To: Liang Ma; +Cc: dev, lei.a.yao, ktraynor, marko.kovacevic


On 2/10/2018 2:48 PM, Liang Ma wrote:
> 1. Abstract
>
> For packet processing workloads such as DPDK polling is continuous.
> This means CPU cores always show 100% busy independent of how much work
> those cores are doing. It is critical to accurately determine how busy
> a core is hugely important for the following reasons:
>
>     * No indication of overload conditions.
>
>     * User does not know how much real load is on a system, resulting
>       in wasted energy as no power management is utilized.
>
> Compared to the original l3fwd-power design, instead of going to sleep
> after detecting an empty poll, the new mechanism just lowers the core
> frequency. As a result, the application does not stop polling the device,
> which leads to improved handling of bursts of traffic.
>
> When the system become busy, the empty poll mechanism can also increase the
> core frequency (including turbo) to do best effort for intensive traffic.
> This gives us more flexible and balanced traffic awareness over the
> standard l3fwd-power application.
>
> 2. Proposed solution
>
> The proposed solution focuses on how many times empty polls are executed.
> The less the number of empty polls, means current core is busy with
> processing workload, therefore, the higher frequency is needed. The high
> empty poll number indicates the current core not doing any real work
> therefore, we can lower the frequency to safe power.
>
> In the current implementation, each core has 1 empty-poll counter which
> assume 1 core is dedicated to 1 queue. This will need to be expanded in the
> future to support multiple queues per core.
>
> 2.1 Power state definition:
>
> 	LOW:  Not currently used, reserved for future use.
>
> 	MED:  the frequency is used to process modest traffic workload.
>
> 	HIGH: the frequency is used to process busy traffic workload.
>
> 2.2 There are two phases to establish the power management system:
>
> 	a.Initialization/Training phase. The training phase is necessary
> 	  in order to figure out the system polling baseline numbers from
> 	  idle to busy. The highest poll count will be during idle, where
> 	  all polls are empty. These poll counts will be different between
> 	  systems due to the many possible processor micro-arch, cache
> 	  and device configurations, hence the training phase.
>    	  In the training phase, traffic is blocked so the training
>    	  algorithm can average the empty-poll numbers for the LOW, MED and
>   	  HIGH  power states in order to create a baseline.
>    	  The core's counter are collected every 10ms, and the Training
>   	  phase will take 2 seconds.
>   	  Training is disabled as default configuration. The default
>   	  parameter is applied. Sample App still can trigger training
>   	  if that's needed. Once the training phase has been executed once on
>   	  a system, the application can then be started with the relevant
>   	  thresholds provided on the command line, allowing the application
>   	  to start passing start traffic immediately
>
> 	b.Normal phase. Traffic starts immediately based on the default
> 	  thresholds, or based on the user supplied thresholds via the
> 	  command line parameters. The run-time poll counts are compared with
> 	  the baseline and the decision will be taken to move to MED power
>    	  state or HIGH power state. The counters are calculated every 10ms.
>
> 3. Proposed  API
>
> 1.  rte_power_empty_poll_stat_init(struct ep_params **eptr,
> 		uint8_t *freq_tlb, struct ep_policy *policy);
> which is used to initialize the power management system.
>   
> 2.  rte_power_empty_poll_stat_free(void);
> which is used to free the resource hold by power management system.
>   
> 3.  rte_power_empty_poll_stat_update(unsigned int lcore_id);
> which is used to update specific core empty poll counter, not thread safe
>   
> 4.  rte_power_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt);
> which is used to update specific core valid poll counter, not thread safe
>   
> 5.  rte_power_empty_poll_stat_fetch(unsigned int lcore_id);
> which is used to get specific core empty poll counter.
>   
> 6.  rte_power_poll_stat_fetch(unsigned int lcore_id);
> which is used to get specific core valid poll counter.
>
> 7.  rte_empty_poll_detection(struct rte_timer *tim, void *arg);
> which is used to detect empty poll state changes then take action.
>
> ChangeLog:
> v2: fix some coding style issues.
> v3: rename the filename, API name.
> v4: no change.
> v5: no change.
> v6: re-work the code layout, update API.
> v7: fix minor typo and lift node num limit.
> v8: disable training as default option.
> v9: minor git log update.
> v10: update due to the code review comments.
>
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
>
> Reviewed-by: Lei Yao <lei.a.yao@intel.com>
> ---


Acked-by: David Hunt <david.hunt@intel.com>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [dpdk-dev] [PATCH v10 2/4] examples/l3fwd-power: simple app update for new API
  2018-10-02 13:48                       ` [dpdk-dev] [PATCH v10 2/4] examples/l3fwd-power: simple app update for new API Liang Ma
@ 2018-10-02 14:23                         ` Hunt, David
  0 siblings, 0 replies; 79+ messages in thread
From: Hunt, David @ 2018-10-02 14:23 UTC (permalink / raw)
  To: Liang Ma; +Cc: dev, lei.a.yao, ktraynor, marko.kovacevic


On 2/10/2018 2:48 PM, Liang Ma wrote:
> Add the support for new traffic pattern aware power control
> power management API.
>
> Example:
> ./l3fwd-power -l xxx   -n 4   -w 0000:xx:00.0 -w 0000:xx:00.1 -- -p 0x3
> -P --config="(0,0,xx),(1,0,xx)" --empty-poll="0,0,0" -l 14 -m 9 -h 1
>
> Please Reference l3fwd-power document for full parameter usage
>
> The option "l", "m", "h" are used to set the power index for
> LOW, MED, HIGH power state. Only is useful after enable empty-poll
>
> --empty-poll="training_flag, med_threshold, high_threshold"
>
> The option training_flag is used to enable/disable training mode.
>
> The option med_threshold is used to indicate the empty poll threshold
> of modest state which is customized by user.
>
> The option high_threshold is used to indicate the empty poll threshold
> of busy state which is customized by user.
>
> Above three option default value is all 0.
>
> Once enable empty-poll. System will apply the default parameter if no
> other command line options are provided.
>
> If training mode is enabled, the user should ensure that no traffic
> is allowed to pass through the system. When training phase complete,
> the application transfer to normal operation
>
> System will start running with the modest power mode.
> If the traffic goes above 70%, then system will move to High power state.
> If the traffic drops below 30%, the system will fallback to the modest
> power state.
>
> Example code use master thread to monitoring worker thread busyness.
> The default timer resolution is 10ms.
>
> ChangeLog:
> v2 fix some coding style issues
> v3 rename the API.
> v6 re-work the API.
> v7 no change.
> v8 disable training as default option.
> v10 update due to review comments.
>
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
>
> Reviewed-by: Lei Yao <lei.a.yao@intel.com>
> ---

Acked-by: David Hunt <david.hunt@intel.com>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [dpdk-dev] [PATCH v10 3/4] doc/guides/pro_guide/power-man: update the power API
  2018-10-02 13:48                       ` [dpdk-dev] [PATCH v10 3/4] doc/guides/pro_guide/power-man: update the power API Liang Ma
@ 2018-10-02 14:24                         ` Hunt, David
  0 siblings, 0 replies; 79+ messages in thread
From: Hunt, David @ 2018-10-02 14:24 UTC (permalink / raw)
  To: Liang Ma; +Cc: dev, lei.a.yao, ktraynor, marko.kovacevic


On 2/10/2018 2:48 PM, Liang Ma wrote:
> Update the document for empty poll API.
>
> Change Logs:
> v9: minor changes for syntax. Update document.
>
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> ---

Acked-by: David Hunt <david.hunt@intel.com>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [dpdk-dev] [PATCH v10 4/4] doc/guides/sample_app_ug/l3_forward_power_man.rst: update
  2018-10-02 13:48                       ` [dpdk-dev] [PATCH v10 4/4] doc/guides/sample_app_ug/l3_forward_power_man.rst: update Liang Ma
@ 2018-10-02 14:25                         ` Hunt, David
  0 siblings, 0 replies; 79+ messages in thread
From: Hunt, David @ 2018-10-02 14:25 UTC (permalink / raw)
  To: Liang Ma; +Cc: dev, lei.a.yao, ktraynor, marko.kovacevic


On 2/10/2018 2:48 PM, Liang Ma wrote:
> Add empty poll mode command line example
>
> ChangeLogs:
> v9: update the document
>
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> ---

Acked-by: David Hunt <david.hunt@intel.com>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [dpdk-dev] [PATCH v10 1/4] lib/librte_power: traffic pattern aware power control
  2018-10-02 13:48                     ` [dpdk-dev] [PATCH v10 1/4] lib/librte_power: traffic pattern aware power control Liang Ma
                                         ` (3 preceding siblings ...)
  2018-10-02 14:22                       ` [dpdk-dev] [PATCH v10 1/4] lib/librte_power: traffic pattern aware power control Hunt, David
@ 2018-10-12  1:59                       ` Yao, Lei A
  2018-10-12 10:02                         ` Liang, Ma
  2018-10-19 10:23                       ` [dpdk-dev] [PATCH v11 1/5] " Liang Ma
  5 siblings, 1 reply; 79+ messages in thread
From: Yao, Lei A @ 2018-10-12  1:59 UTC (permalink / raw)
  To: Ma, Liang J, Hunt, David; +Cc: dev, ktraynor, Kovacevic, Marko



+
+		if (get_freq_index(LOW) > total_avail_freqs[i])
+			return -1;
+
+		if (rte_get_master_lcore() != i) {
+			w->wrk_stats[i].lcore_id = i;
+			set_policy(&w->wrk_stats[i], policy);
+		}
+	}
+
+	return 0;
+}

Hi, Liang

There is one issue in this part. 
When you find one frequency level can't be support on the server
we used,  you return directly. This will skip the set_policy step in the following. 
If skip the set_policy step, the behavior will be the power lib always 
execute the training steps, even we set the policy.state=MED_NORMAL in the sample.  
This will confuse the user, they don’t know why they can't skip the training steps even
the sample is already configured to --empty-poll=0,xxxxx,xxxxxx

BRs
Lei


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [dpdk-dev] [PATCH v10 1/4] lib/librte_power: traffic pattern aware power control
  2018-10-12  1:59                       ` Yao, Lei A
@ 2018-10-12 10:02                         ` Liang, Ma
  2018-10-12 13:22                           ` Yao, Lei A
  0 siblings, 1 reply; 79+ messages in thread
From: Liang, Ma @ 2018-10-12 10:02 UTC (permalink / raw)
  To: Yao, Lei A; +Cc: Hunt, David, dev, ktraynor, Kovacevic, Marko

On 11 Oct 18:59, Yao, Lei A wrote:
> 
> 
> +
> +		if (get_freq_index(LOW) > total_avail_freqs[i])
> +			return -1;
> +
> +		if (rte_get_master_lcore() != i) {
> +			w->wrk_stats[i].lcore_id = i;
> +			set_policy(&w->wrk_stats[i], policy);
> +		}
> +	}
> +
> +	return 0;
> +}
> 
> Hi, Liang
> 
> There is one issue in this part. 
> When you find one frequency level can't be support on the server
> we used,  you return directly. This will skip the set_policy step in the following. 
> If skip the set_policy step, the behavior will be the power lib always 
> execute the training steps, even we set the policy.state=MED_NORMAL in the sample.  
> This will confuse the user, they don’t know why they can't skip the training steps even
> the sample is already configured to --empty-poll=0,xxxxx,xxxxxx
> 
> BRs
> Lei
Hi Lei,
   I think the lib code logic is OK. 
   if the LOW freq index still is bigger than highest avaiable freq index, sth is wrong. 
   the execution should stop.
   Simple app should check the rte_power_empty_poll_stat_init
   result, if rte_power_empty_poll_stat_init return error. the sample app should exit.
   I can update the sample app code add the checking. 
Regards
Liang

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [dpdk-dev] [PATCH v10 1/4] lib/librte_power: traffic pattern aware power control
  2018-10-12 10:02                         ` Liang, Ma
@ 2018-10-12 13:22                           ` Yao, Lei A
  0 siblings, 0 replies; 79+ messages in thread
From: Yao, Lei A @ 2018-10-12 13:22 UTC (permalink / raw)
  To: Ma, Liang J; +Cc: Hunt, David, dev, ktraynor, Kovacevic, Marko



> -----Original Message-----
> From: Ma, Liang J
> Sent: Friday, October 12, 2018 6:03 PM
> To: Yao, Lei A <lei.a.yao@intel.com>
> Cc: Hunt, David <david.hunt@intel.com>; dev@dpdk.org;
> ktraynor@redhat.com; Kovacevic, Marko <marko.kovacevic@intel.com>
> Subject: Re: [PATCH v10 1/4] lib/librte_power: traffic pattern aware power
> control
> 
> On 11 Oct 18:59, Yao, Lei A wrote:
> >
> >
> > +
> > +		if (get_freq_index(LOW) > total_avail_freqs[i])
> > +			return -1;
> > +
> > +		if (rte_get_master_lcore() != i) {
> > +			w->wrk_stats[i].lcore_id = i;
> > +			set_policy(&w->wrk_stats[i], policy);
> > +		}
> > +	}
> > +
> > +	return 0;
> > +}
> >
> > Hi, Liang
> >
> > There is one issue in this part.
> > When you find one frequency level can't be support on the server
> > we used,  you return directly. This will skip the set_policy step in the
> following.
> > If skip the set_policy step, the behavior will be the power lib always
> > execute the training steps, even we set the policy.state=MED_NORMAL in
> the sample.
> > This will confuse the user, they don’t know why they can't skip the training
> steps even
> > the sample is already configured to --empty-poll=0,xxxxx,xxxxxx
> >
> > BRs
> > Lei
> Hi Lei,
>    I think the lib code logic is OK.
>    if the LOW freq index still is bigger than highest avaiable freq index, sth is
> wrong.
>    the execution should stop.
>    Simple app should check the rte_power_empty_poll_stat_init
>    result, if rte_power_empty_poll_stat_init return error. the sample app
> should exit.
>    I can update the sample app code add the checking.
> Regards
> Liang
Hi, Liang

If sample will exit in this situation, it's OK for me. Thanks.

BRs
Lei

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [dpdk-dev] [PATCH v11 1/5] lib/librte_power: traffic pattern aware power control
  2018-10-02 13:48                     ` [dpdk-dev] [PATCH v10 1/4] lib/librte_power: traffic pattern aware power control Liang Ma
                                         ` (4 preceding siblings ...)
  2018-10-12  1:59                       ` Yao, Lei A
@ 2018-10-19 10:23                       ` Liang Ma
  2018-10-19 10:23                         ` [dpdk-dev] [PATCH v11 2/5] examples/l3fwd-power: simple app update for new API Liang Ma
                                           ` (3 more replies)
  5 siblings, 4 replies; 79+ messages in thread
From: Liang Ma @ 2018-10-19 10:23 UTC (permalink / raw)
  To: david.hunt; +Cc: dev, lei.a.yao, ktraynor, marko.kovacevic, Liang Ma

1. Abstract

For packet processing workloads such as DPDK polling is continuous.
This means CPU cores always show 100% busy independent of how much work
those cores are doing. It is critical to accurately determine how busy
a core is hugely important for the following reasons:

   * No indication of overload conditions.

   * User does not know how much real load is on a system, resulting
     in wasted energy as no power management is utilized.

Compared to the original l3fwd-power design, instead of going to sleep
after detecting an empty poll, the new mechanism just lowers the core
frequency. As a result, the application does not stop polling the device,
which leads to improved handling of bursts of traffic.

When the system become busy, the empty poll mechanism can also increase the
core frequency (including turbo) to do best effort for intensive traffic.
This gives us more flexible and balanced traffic awareness over the
standard l3fwd-power application.

2. Proposed solution

The proposed solution focuses on how many times empty polls are executed.
The less the number of empty polls, means current core is busy with
processing workload, therefore, the higher frequency is needed. The high
empty poll number indicates the current core not doing any real work
therefore, we can lower the frequency to safe power.

In the current implementation, each core has 1 empty-poll counter which
assume 1 core is dedicated to 1 queue. This will need to be expanded in the
future to support multiple queues per core.

2.1 Power state definition:

	LOW:  Not currently used, reserved for future use.

	MED:  the frequency is used to process modest traffic workload.

	HIGH: the frequency is used to process busy traffic workload.

2.2 There are two phases to establish the power management system:

	a.Initialization/Training phase. The training phase is necessary
	  in order to figure out the system polling baseline numbers from
	  idle to busy. The highest poll count will be during idle, where
	  all polls are empty. These poll counts will be different between
	  systems due to the many possible processor micro-arch, cache
	  and device configurations, hence the training phase.
  	  In the training phase, traffic is blocked so the training
  	  algorithm can average the empty-poll numbers for the LOW, MED and
 	  HIGH  power states in order to create a baseline.
  	  The core's counter are collected every 10ms, and the Training
 	  phase will take 2 seconds.
 	  Training is disabled as default configuration. The default
 	  parameter is applied. Sample App still can trigger training
 	  if that's needed. Once the training phase has been executed once on
 	  a system, the application can then be started with the relevant
 	  thresholds provided on the command line, allowing the application
 	  to start passing start traffic immediately

	b.Normal phase. Traffic starts immediately based on the default
	  thresholds, or based on the user supplied thresholds via the
	  command line parameters. The run-time poll counts are compared with
	  the baseline and the decision will be taken to move to MED power
  	  state or HIGH power state. The counters are calculated every 10ms.

3. Proposed  API

1.  rte_power_empty_poll_stat_init(struct ep_params **eptr,
		uint8_t *freq_tlb, struct ep_policy *policy);
which is used to initialize the power management system.
 
2.  rte_power_empty_poll_stat_free(void);
which is used to free the resource hold by power management system.
 
3.  rte_power_empty_poll_stat_update(unsigned int lcore_id);
which is used to update specific core empty poll counter, not thread safe
 
4.  rte_power_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt);
which is used to update specific core valid poll counter, not thread safe
 
5.  rte_power_empty_poll_stat_fetch(unsigned int lcore_id);
which is used to get specific core empty poll counter.
 
6.  rte_power_poll_stat_fetch(unsigned int lcore_id);
which is used to get specific core valid poll counter.

7.  rte_empty_poll_detection(struct rte_timer *tim, void *arg);
which is used to detect empty poll state changes then take action.

ChangeLog:
v2: fix some coding style issues.
v3: rename the filename, API name.
v4: no change.
v5: no change.
v6: re-work the code layout, update API.
v7: fix minor typo and lift node num limit.
v8: disable training as default option.
v9: minor git log update.
v10: update due to the code review comments.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>

Reviewed-by: Lei Yao <lei.a.yao@intel.com>

Acked-by: David Hunt <david.hunt@intel.com>
---
 lib/librte_power/Makefile               |   6 +-
 lib/librte_power/meson.build            |   5 +-
 lib/librte_power/rte_power_empty_poll.c | 545 ++++++++++++++++++++++++++++++++
 lib/librte_power/rte_power_empty_poll.h | 223 +++++++++++++
 lib/librte_power/rte_power_version.map  |  13 +
 5 files changed, 788 insertions(+), 4 deletions(-)
 create mode 100644 lib/librte_power/rte_power_empty_poll.c
 create mode 100644 lib/librte_power/rte_power_empty_poll.h

diff --git a/lib/librte_power/Makefile b/lib/librte_power/Makefile
index 6f85e88..a8f1301 100644
--- a/lib/librte_power/Makefile
+++ b/lib/librte_power/Makefile
@@ -6,8 +6,9 @@ include $(RTE_SDK)/mk/rte.vars.mk
 # library name
 LIB = librte_power.a
 
+CFLAGS += -DALLOW_EXPERIMENTAL_API
 CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR) -O3 -fno-strict-aliasing
-LDLIBS += -lrte_eal
+LDLIBS += -lrte_eal -lrte_timer
 
 EXPORT_MAP := rte_power_version.map
 
@@ -16,8 +17,9 @@ LIBABIVER := 1
 # all source are stored in SRCS-y
 SRCS-$(CONFIG_RTE_LIBRTE_POWER) := rte_power.c power_acpi_cpufreq.c
 SRCS-$(CONFIG_RTE_LIBRTE_POWER) += power_kvm_vm.c guest_channel.c
+SRCS-$(CONFIG_RTE_LIBRTE_POWER) += rte_power_empty_poll.c
 
 # install this header file
-SYMLINK-$(CONFIG_RTE_LIBRTE_POWER)-include := rte_power.h
+SYMLINK-$(CONFIG_RTE_LIBRTE_POWER)-include := rte_power.h  rte_power_empty_poll.h
 
 include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/lib/librte_power/meson.build b/lib/librte_power/meson.build
index 253173f..63957eb 100644
--- a/lib/librte_power/meson.build
+++ b/lib/librte_power/meson.build
@@ -5,5 +5,6 @@ if host_machine.system() != 'linux'
 	build = false
 endif
 sources = files('rte_power.c', 'power_acpi_cpufreq.c',
-		'power_kvm_vm.c', 'guest_channel.c')
-headers = files('rte_power.h')
+		'power_kvm_vm.c', 'guest_channel.c',
+		'rte_power_empty_poll.c')
+headers = files('rte_power.h','rte_power_empty_poll.h')
diff --git a/lib/librte_power/rte_power_empty_poll.c b/lib/librte_power/rte_power_empty_poll.c
new file mode 100644
index 0000000..c184647
--- /dev/null
+++ b/lib/librte_power/rte_power_empty_poll.c
@@ -0,0 +1,545 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2018 Intel Corporation
+ */
+
+#include <string.h>
+
+#include <rte_lcore.h>
+#include <rte_cycles.h>
+#include <rte_atomic.h>
+#include <rte_malloc.h>
+
+#include "rte_power.h"
+#include "rte_power_empty_poll.h"
+
+#define INTERVALS_PER_SECOND 100     /* (10ms) */
+#define SECONDS_TO_TRAIN_FOR 2
+#define DEFAULT_MED_TO_HIGH_PERCENT_THRESHOLD 70
+#define DEFAULT_HIGH_TO_MED_PERCENT_THRESHOLD 30
+#define DEFAULT_CYCLES_PER_PACKET 800
+
+static struct ep_params *ep_params;
+static uint32_t med_to_high_threshold = DEFAULT_MED_TO_HIGH_PERCENT_THRESHOLD;
+static uint32_t high_to_med_threshold = DEFAULT_HIGH_TO_MED_PERCENT_THRESHOLD;
+
+static uint32_t avail_freqs[RTE_MAX_LCORE][NUM_FREQS];
+
+static uint32_t total_avail_freqs[RTE_MAX_LCORE];
+
+static uint32_t freq_index[NUM_FREQ];
+
+static uint32_t
+get_freq_index(enum freq_val index)
+{
+	return freq_index[index];
+}
+
+
+static int
+set_power_freq(int lcore_id, enum freq_val freq, bool specific_freq)
+{
+	int err = 0;
+	uint32_t power_freq_index;
+	if (!specific_freq)
+		power_freq_index = get_freq_index(freq);
+	else
+		power_freq_index = freq;
+
+	err = rte_power_set_freq(lcore_id, power_freq_index);
+
+	return err;
+}
+
+
+static inline void __attribute__((always_inline))
+exit_training_state(struct priority_worker *poll_stats)
+{
+	RTE_SET_USED(poll_stats);
+}
+
+static inline void __attribute__((always_inline))
+enter_training_state(struct priority_worker *poll_stats)
+{
+	poll_stats->iter_counter = 0;
+	poll_stats->cur_freq = LOW;
+	poll_stats->queue_state = TRAINING;
+}
+
+static inline void __attribute__((always_inline))
+enter_normal_state(struct priority_worker *poll_stats)
+{
+	/* Clear the averages arrays and strs */
+	memset(poll_stats->edpi_av, 0, sizeof(poll_stats->edpi_av));
+	poll_stats->ec = 0;
+	memset(poll_stats->ppi_av, 0, sizeof(poll_stats->ppi_av));
+	poll_stats->pc = 0;
+
+	poll_stats->cur_freq = MED;
+	poll_stats->iter_counter = 0;
+	poll_stats->threshold_ctr = 0;
+	poll_stats->queue_state = MED_NORMAL;
+	RTE_LOG(INFO, POWER, "Set the power freq to MED\n");
+	set_power_freq(poll_stats->lcore_id, MED, false);
+
+	poll_stats->thresh[MED].threshold_percent = med_to_high_threshold;
+	poll_stats->thresh[HGH].threshold_percent = high_to_med_threshold;
+}
+
+static inline void __attribute__((always_inline))
+enter_busy_state(struct priority_worker *poll_stats)
+{
+	memset(poll_stats->edpi_av, 0, sizeof(poll_stats->edpi_av));
+	poll_stats->ec = 0;
+	memset(poll_stats->ppi_av, 0, sizeof(poll_stats->ppi_av));
+	poll_stats->pc = 0;
+
+	poll_stats->cur_freq = HGH;
+	poll_stats->iter_counter = 0;
+	poll_stats->threshold_ctr = 0;
+	poll_stats->queue_state = HGH_BUSY;
+	set_power_freq(poll_stats->lcore_id, HGH, false);
+}
+
+static inline void __attribute__((always_inline))
+enter_purge_state(struct priority_worker *poll_stats)
+{
+	poll_stats->iter_counter = 0;
+	poll_stats->queue_state = LOW_PURGE;
+}
+
+static inline void __attribute__((always_inline))
+set_state(struct priority_worker *poll_stats,
+		enum queue_state new_state)
+{
+	enum queue_state old_state = poll_stats->queue_state;
+	if (old_state != new_state) {
+
+		/* Call any old state exit functions */
+		if (old_state == TRAINING)
+			exit_training_state(poll_stats);
+
+		/* Call any new state entry functions */
+		if (new_state == TRAINING)
+			enter_training_state(poll_stats);
+		if (new_state == MED_NORMAL)
+			enter_normal_state(poll_stats);
+		if (new_state == HGH_BUSY)
+			enter_busy_state(poll_stats);
+		if (new_state == LOW_PURGE)
+			enter_purge_state(poll_stats);
+	}
+}
+
+static inline void __attribute__((always_inline))
+set_policy(struct priority_worker *poll_stats,
+		struct ep_policy *policy)
+{
+	set_state(poll_stats, policy->state);
+
+	if (policy->state == TRAINING)
+		return;
+
+	poll_stats->thresh[MED_NORMAL].base_edpi = policy->med_base_edpi;
+	poll_stats->thresh[HGH_BUSY].base_edpi = policy->hgh_base_edpi;
+
+	poll_stats->thresh[MED_NORMAL].trained = true;
+	poll_stats->thresh[HGH_BUSY].trained = true;
+
+}
+
+static void
+update_training_stats(struct priority_worker *poll_stats,
+		uint32_t freq,
+		bool specific_freq,
+		uint32_t max_train_iter)
+{
+	RTE_SET_USED(specific_freq);
+
+	char pfi_str[32];
+	uint64_t p0_empty_deq;
+
+	sprintf(pfi_str, "%02d", freq);
+
+	if (poll_stats->cur_freq == freq &&
+			poll_stats->thresh[freq].trained == false) {
+		if (poll_stats->thresh[freq].cur_train_iter == 0) {
+
+			set_power_freq(poll_stats->lcore_id,
+					freq, specific_freq);
+
+			poll_stats->empty_dequeues_prev =
+				poll_stats->empty_dequeues;
+
+			poll_stats->thresh[freq].cur_train_iter++;
+
+			return;
+		} else if (poll_stats->thresh[freq].cur_train_iter
+				<= max_train_iter) {
+
+			p0_empty_deq = poll_stats->empty_dequeues -
+				poll_stats->empty_dequeues_prev;
+
+			poll_stats->empty_dequeues_prev =
+				poll_stats->empty_dequeues;
+
+			poll_stats->thresh[freq].base_edpi += p0_empty_deq;
+			poll_stats->thresh[freq].cur_train_iter++;
+
+		} else {
+			if (poll_stats->thresh[freq].trained == false) {
+				poll_stats->thresh[freq].base_edpi =
+					poll_stats->thresh[freq].base_edpi /
+					max_train_iter;
+
+				/* Add on a factor of 0.05%
+				 * this should remove any
+				 * false negatives when the system is 0% busy
+				 */
+				poll_stats->thresh[freq].base_edpi +=
+				poll_stats->thresh[freq].base_edpi / 2000;
+
+				poll_stats->thresh[freq].trained = true;
+				poll_stats->cur_freq++;
+
+			}
+		}
+	}
+}
+
+static inline uint32_t __attribute__((always_inline))
+update_stats(struct priority_worker *poll_stats)
+{
+	uint64_t tot_edpi = 0, tot_ppi = 0;
+	uint32_t j, percent;
+
+	struct priority_worker *s = poll_stats;
+
+	uint64_t cur_edpi = s->empty_dequeues - s->empty_dequeues_prev;
+
+	s->empty_dequeues_prev = s->empty_dequeues;
+
+	uint64_t ppi = s->num_dequeue_pkts - s->num_dequeue_pkts_prev;
+
+	s->num_dequeue_pkts_prev = s->num_dequeue_pkts;
+
+	if (s->thresh[s->cur_freq].base_edpi < cur_edpi) {
+
+		/* edpi mean empty poll counter difference per interval */
+		RTE_LOG(DEBUG, POWER, "cur_edpi is too large "
+				"cur edpi %ld "
+				"base edpi %ld\n",
+				cur_edpi,
+				s->thresh[s->cur_freq].base_edpi);
+		/* Value to make us fail need debug log*/
+		return 1000UL;
+	}
+
+	s->edpi_av[s->ec++ % BINS_AV] = cur_edpi;
+	s->ppi_av[s->pc++ % BINS_AV] = ppi;
+
+	for (j = 0; j < BINS_AV; j++) {
+		tot_edpi += s->edpi_av[j];
+		tot_ppi += s->ppi_av[j];
+	}
+
+	tot_edpi = tot_edpi / BINS_AV;
+
+	percent = 100 - (uint32_t)(((float)tot_edpi /
+			(float)s->thresh[s->cur_freq].base_edpi) * 100);
+
+	return (uint32_t)percent;
+}
+
+
+static inline void  __attribute__((always_inline))
+update_stats_normal(struct priority_worker *poll_stats)
+{
+	uint32_t percent;
+
+	if (poll_stats->thresh[poll_stats->cur_freq].base_edpi == 0) {
+
+		enum freq_val cur_freq = poll_stats->cur_freq;
+
+		/* edpi mean empty poll counter difference per interval */
+		RTE_LOG(DEBUG, POWER, "cure freq is %d, edpi is %lu\n",
+				cur_freq,
+				poll_stats->thresh[cur_freq].base_edpi);
+		return;
+	}
+
+	percent = update_stats(poll_stats);
+
+	if (percent > 100) {
+		/* edpi mean empty poll counter difference per interval */
+		RTE_LOG(DEBUG, POWER, "Edpi is bigger than threshold\n");
+		return;
+	}
+
+	if (poll_stats->cur_freq == LOW)
+		RTE_LOG(INFO, POWER, "Purge Mode is not currently supported\n");
+	else if (poll_stats->cur_freq == MED) {
+
+		if (percent >
+			poll_stats->thresh[MED].threshold_percent) {
+
+			if (poll_stats->threshold_ctr < INTERVALS_PER_SECOND)
+				poll_stats->threshold_ctr++;
+			else {
+				set_state(poll_stats, HGH_BUSY);
+				RTE_LOG(INFO, POWER, "MOVE to HGH\n");
+			}
+
+		} else {
+			/* reset */
+			poll_stats->threshold_ctr = 0;
+		}
+
+	} else if (poll_stats->cur_freq == HGH) {
+
+		if (percent <
+				poll_stats->thresh[HGH].threshold_percent) {
+
+			if (poll_stats->threshold_ctr < INTERVALS_PER_SECOND)
+				poll_stats->threshold_ctr++;
+			else {
+				set_state(poll_stats, MED_NORMAL);
+				RTE_LOG(INFO, POWER, "MOVE to MED\n");
+			}
+		} else {
+			/* reset */
+			poll_stats->threshold_ctr = 0;
+		}
+
+	}
+}
+
+static int
+empty_poll_training(struct priority_worker *poll_stats,
+		uint32_t max_train_iter)
+{
+
+	if (poll_stats->iter_counter < INTERVALS_PER_SECOND) {
+		poll_stats->iter_counter++;
+		return 0;
+	}
+
+
+	update_training_stats(poll_stats,
+			LOW,
+			false,
+			max_train_iter);
+
+	update_training_stats(poll_stats,
+			MED,
+			false,
+			max_train_iter);
+
+	update_training_stats(poll_stats,
+			HGH,
+			false,
+			max_train_iter);
+
+
+	if (poll_stats->thresh[LOW].trained == true
+			&& poll_stats->thresh[MED].trained == true
+			&& poll_stats->thresh[HGH].trained == true) {
+
+		set_state(poll_stats, MED_NORMAL);
+
+		RTE_LOG(INFO, POWER, "LOW threshold is %lu\n",
+				poll_stats->thresh[LOW].base_edpi);
+
+		RTE_LOG(INFO, POWER, "MED threshold is %lu\n",
+				poll_stats->thresh[MED].base_edpi);
+
+
+		RTE_LOG(INFO, POWER, "HIGH threshold is %lu\n",
+				poll_stats->thresh[HGH].base_edpi);
+
+		RTE_LOG(INFO, POWER, "Training is Complete for %d\n",
+				poll_stats->lcore_id);
+	}
+
+	return 0;
+}
+
+void __rte_experimental
+rte_empty_poll_detection(struct rte_timer *tim, void *arg)
+{
+
+	uint32_t i;
+
+	struct priority_worker *poll_stats;
+
+	RTE_SET_USED(tim);
+
+	RTE_SET_USED(arg);
+
+	for (i = 0; i < NUM_NODES; i++) {
+
+		poll_stats = &(ep_params->wrk_data.wrk_stats[i]);
+
+		if (rte_lcore_is_enabled(poll_stats->lcore_id) == 0)
+			continue;
+
+		switch (poll_stats->queue_state) {
+		case(TRAINING):
+			empty_poll_training(poll_stats,
+					ep_params->max_train_iter);
+			break;
+
+		case(HGH_BUSY):
+		case(MED_NORMAL):
+			update_stats_normal(poll_stats);
+			break;
+
+		case(LOW_PURGE):
+			break;
+		default:
+			break;
+
+		}
+
+	}
+
+}
+
+int __rte_experimental
+rte_power_empty_poll_stat_init(struct ep_params **eptr, uint8_t *freq_tlb,
+		struct ep_policy *policy)
+{
+	uint32_t i;
+	/* Allocate the ep_params structure */
+	ep_params = rte_zmalloc_socket(NULL,
+			sizeof(struct ep_params),
+			0,
+			rte_socket_id());
+
+	if (!ep_params)
+		rte_panic("Cannot allocate heap memory for ep_params "
+				"for socket %d\n", rte_socket_id());
+
+	if (freq_tlb == NULL) {
+		freq_index[LOW] = 14;
+		freq_index[MED] = 9;
+		freq_index[HGH] = 1;
+	} else {
+		freq_index[LOW] = freq_tlb[LOW];
+		freq_index[MED] = freq_tlb[MED];
+		freq_index[HGH] = freq_tlb[HGH];
+	}
+
+	RTE_LOG(INFO, POWER, "Initialize the Empty Poll\n");
+
+	/* Train for pre-defined period */
+	ep_params->max_train_iter = INTERVALS_PER_SECOND * SECONDS_TO_TRAIN_FOR;
+
+	struct stats_data *w = &ep_params->wrk_data;
+
+	*eptr = ep_params;
+
+	/* initialize all wrk_stats state */
+	for (i = 0; i < NUM_NODES; i++) {
+
+		if (rte_lcore_is_enabled(i) == 0)
+			continue;
+		/*init the freqs table */
+		total_avail_freqs[i] = rte_power_freqs(i,
+				avail_freqs[i],
+				NUM_FREQS);
+
+		RTE_LOG(INFO, POWER, "total avail freq is %d , lcoreid %d\n",
+				total_avail_freqs[i],
+				i);
+
+		if (get_freq_index(LOW) > total_avail_freqs[i])
+			return -1;
+
+		if (rte_get_master_lcore() != i) {
+			w->wrk_stats[i].lcore_id = i;
+			set_policy(&w->wrk_stats[i], policy);
+		}
+	}
+
+	return 0;
+}
+
+void __rte_experimental
+rte_power_empty_poll_stat_free(void)
+{
+
+	RTE_LOG(INFO, POWER, "Close the Empty Poll\n");
+
+	if (ep_params != NULL)
+		rte_free(ep_params);
+}
+
+int __rte_experimental
+rte_power_empty_poll_stat_update(unsigned int lcore_id)
+{
+	struct priority_worker *poll_stats;
+
+	if (lcore_id >= NUM_NODES)
+		return -1;
+
+	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
+
+	if (poll_stats->lcore_id == 0)
+		poll_stats->lcore_id = lcore_id;
+
+	poll_stats->empty_dequeues++;
+
+	return 0;
+}
+
+int __rte_experimental
+rte_power_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt)
+{
+
+	struct priority_worker *poll_stats;
+
+	if (lcore_id >= NUM_NODES)
+		return -1;
+
+	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
+
+	if (poll_stats->lcore_id == 0)
+		poll_stats->lcore_id = lcore_id;
+
+	poll_stats->num_dequeue_pkts += nb_pkt;
+
+	return 0;
+}
+
+
+uint64_t __rte_experimental
+rte_power_empty_poll_stat_fetch(unsigned int lcore_id)
+{
+	struct priority_worker *poll_stats;
+
+	if (lcore_id >= NUM_NODES)
+		return -1;
+
+	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
+
+	if (poll_stats->lcore_id == 0)
+		poll_stats->lcore_id = lcore_id;
+
+	return poll_stats->empty_dequeues;
+}
+
+uint64_t __rte_experimental
+rte_power_poll_stat_fetch(unsigned int lcore_id)
+{
+	struct priority_worker *poll_stats;
+
+	if (lcore_id >= NUM_NODES)
+		return -1;
+
+	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
+
+	if (poll_stats->lcore_id == 0)
+		poll_stats->lcore_id = lcore_id;
+
+	return poll_stats->num_dequeue_pkts;
+}
diff --git a/lib/librte_power/rte_power_empty_poll.h b/lib/librte_power/rte_power_empty_poll.h
new file mode 100644
index 0000000..d8cbb17
--- /dev/null
+++ b/lib/librte_power/rte_power_empty_poll.h
@@ -0,0 +1,223 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2018 Intel Corporation
+ */
+
+#ifndef _RTE_EMPTY_POLL_H
+#define _RTE_EMPTY_POLL_H
+
+/**
+ * @file
+ * RTE Power Management
+ */
+#include <stdint.h>
+#include <stdbool.h>
+
+#include <rte_common.h>
+#include <rte_byteorder.h>
+#include <rte_log.h>
+#include <rte_string_fns.h>
+#include <rte_power.h>
+#include <rte_timer.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#define NUM_FREQS  RTE_MAX_LCORE_FREQS
+
+#define BINS_AV 4 /* Has to be ^2 */
+
+#define DROP (NUM_DIRECTIONS * NUM_DEVICES)
+
+#define NUM_PRIORITIES          2
+
+#define NUM_NODES         256  /* Max core number*/
+
+/* Processor Power State */
+enum freq_val {
+	LOW,
+	MED,
+	HGH,
+	NUM_FREQ = NUM_FREQS
+};
+
+
+/* Queue Polling State */
+enum queue_state {
+	TRAINING, /* NO TRAFFIC */
+	MED_NORMAL,   /* MED */
+	HGH_BUSY,     /* HIGH */
+	LOW_PURGE,    /* LOW */
+};
+
+/* Queue Stats */
+struct freq_threshold {
+
+	uint64_t base_edpi;
+	bool trained;
+	uint32_t threshold_percent;
+	uint32_t cur_train_iter;
+};
+
+/* Each Worder Thread Empty Poll Stats */
+struct priority_worker {
+
+	/* Current dequeue and throughput counts */
+	/* These 2 are written to by the worker threads */
+	/* So keep them on their own cache line */
+	uint64_t empty_dequeues;
+	uint64_t num_dequeue_pkts;
+
+	enum queue_state queue_state;
+
+	uint64_t empty_dequeues_prev;
+	uint64_t num_dequeue_pkts_prev;
+
+	/* Used for training only */
+	struct freq_threshold thresh[NUM_FREQ];
+	enum freq_val cur_freq;
+
+	/* bucket arrays to calculate the averages */
+	/* edpi mean empty poll counter difference per interval */
+	uint64_t edpi_av[BINS_AV];
+	/* empty poll counter */
+	uint32_t ec;
+	/* ppi mean valid poll counter per interval */
+	uint64_t ppi_av[BINS_AV];
+	/* valid poll counter */
+	uint32_t pc;
+
+	uint32_t lcore_id;
+	uint32_t iter_counter;
+	uint32_t threshold_ctr;
+	uint32_t display_ctr;
+	uint8_t  dev_id;
+
+} __rte_cache_aligned;
+
+
+struct stats_data {
+
+	struct priority_worker wrk_stats[NUM_NODES];
+
+	/* flag to stop rx threads processing packets until training over */
+	bool start_rx;
+
+};
+
+/* Empty Poll Parameters */
+struct ep_params {
+
+	/* Timer related stuff */
+	uint64_t interval_ticks;
+	uint32_t max_train_iter;
+
+	struct rte_timer timer0;
+	struct stats_data wrk_data;
+};
+
+
+/* Sample App Init information */
+struct ep_policy {
+
+	uint64_t med_base_edpi;
+	uint64_t hgh_base_edpi;
+
+	enum queue_state state;
+};
+
+
+
+/**
+ * Initialize the power management system.
+ *
+ * @param eptr
+ *   the structure of empty poll configuration
+ * @freq_tlb
+ *   the power state/frequency  mapping table
+ * @policy
+ *   the initialization policy from sample app
+ *
+ * @return
+ *  - 0 on success.
+ *  - Negative on error.
+ */
+int __rte_experimental
+rte_power_empty_poll_stat_init(struct ep_params **eptr, uint8_t *freq_tlb,
+		struct ep_policy *policy);
+
+/**
+ * Free the resource hold by power management system.
+ */
+void __rte_experimental
+rte_power_empty_poll_stat_free(void);
+
+/**
+ * Update specific core empty poll counter
+ * It's not thread safe.
+ *
+ * @param lcore_id
+ *  lcore id
+ *
+ * @return
+ *  - 0 on success.
+ *  - Negative on error.
+ */
+int __rte_experimental
+rte_power_empty_poll_stat_update(unsigned int lcore_id);
+
+/**
+ * Update specific core valid poll counter, not thread safe.
+ *
+ * @param lcore_id
+ *  lcore id.
+ * @param nb_pkt
+ *  The packet number of one valid poll.
+ *
+ * @return
+ *  - 0 on success.
+ *  - Negative on error.
+ */
+int __rte_experimental
+rte_power_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt);
+
+/**
+ * Fetch specific core empty poll counter.
+ *
+ * @param lcore_id
+ *  lcore id
+ *
+ * @return
+ *  Current lcore empty poll counter value.
+ */
+uint64_t __rte_experimental
+rte_power_empty_poll_stat_fetch(unsigned int lcore_id);
+
+/**
+ * Fetch specific core valid poll counter.
+ *
+ * @param lcore_id
+ *  lcore id
+ *
+ * @return
+ *  Current lcore valid poll counter value.
+ */
+uint64_t __rte_experimental
+rte_power_poll_stat_fetch(unsigned int lcore_id);
+
+/**
+ * Empty poll  state change detection function
+ *
+ * @param  tim
+ *  The timer structure
+ * @param  arg
+ *  The customized parameter
+ */
+void  __rte_experimental
+rte_empty_poll_detection(struct rte_timer *tim, void *arg);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif
diff --git a/lib/librte_power/rte_power_version.map b/lib/librte_power/rte_power_version.map
index dd587df..17a083b 100644
--- a/lib/librte_power/rte_power_version.map
+++ b/lib/librte_power/rte_power_version.map
@@ -33,3 +33,16 @@ DPDK_18.08 {
 	rte_power_get_capabilities;
 
 } DPDK_17.11;
+
+EXPERIMENTAL {
+        global:
+
+        rte_empty_poll_detection;
+        rte_power_empty_poll_stat_fetch;
+        rte_power_empty_poll_stat_free;
+        rte_power_empty_poll_stat_init;
+        rte_power_empty_poll_stat_update;
+        rte_power_poll_stat_fetch;
+        rte_power_poll_stat_update;
+
+};
-- 
2.7.5

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [dpdk-dev] [PATCH v11 2/5] examples/l3fwd-power: simple app update for new API
  2018-10-19 10:23                       ` [dpdk-dev] [PATCH v11 1/5] " Liang Ma
@ 2018-10-19 10:23                         ` Liang Ma
  2018-10-19 10:23                         ` [dpdk-dev] [PATCH v11 3/5] doc/guides/pro_guide/power-man: update the power API Liang Ma
                                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 79+ messages in thread
From: Liang Ma @ 2018-10-19 10:23 UTC (permalink / raw)
  To: david.hunt; +Cc: dev, lei.a.yao, ktraynor, marko.kovacevic, Liang Ma

Add the support for new traffic pattern aware power control
power management API.

Example:
./l3fwd-power -l xxx   -n 4   -w 0000:xx:00.0 -w 0000:xx:00.1 -- -p 0x3
-P --config="(0,0,xx),(1,0,xx)" --empty-poll="0,0,0" -l 14 -m 9 -h 1

Please Reference l3fwd-power document for full parameter usage

The option "l", "m", "h" are used to set the power index for
LOW, MED, HIGH power state. Only is useful after enable empty-poll

--empty-poll="training_flag, med_threshold, high_threshold"

The option training_flag is used to enable/disable training mode.

The option med_threshold is used to indicate the empty poll threshold
of modest state which is customized by user.

The option high_threshold is used to indicate the empty poll threshold
of busy state which is customized by user.

Above three option default value is all 0.

Once enable empty-poll. System will apply the default parameter if no
other command line options are provided.

If training mode is enabled, the user should ensure that no traffic
is allowed to pass through the system. When training phase complete,
the application transfer to normal operation

System will start running with the modest power mode.
If the traffic goes above 70%, then system will move to High power state.
If the traffic drops below 30%, the system will fallback to the modest
power state.

Example code use master thread to monitoring worker thread busyness.
The default timer resolution is 10ms.

ChangeLog:
v2 fix some coding style issues
v3 rename the API.
v6 re-work the API.
v7 no change.
v8 disable training as default option.
v10 update due to review comments.
v11 add checking for empty poll init function return value.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>

Reviewed-by: Lei Yao <lei.a.yao@intel.com>

Acked-by: David Hunt <david.hunt@intel.com>
---
 examples/l3fwd-power/Makefile    |   3 +
 examples/l3fwd-power/main.c      | 346 +++++++++++++++++++++++++++++++++++++--
 examples/l3fwd-power/meson.build |   1 +
 3 files changed, 333 insertions(+), 17 deletions(-)

diff --git a/examples/l3fwd-power/Makefile b/examples/l3fwd-power/Makefile
index d7e39a3..772ec7b 100644
--- a/examples/l3fwd-power/Makefile
+++ b/examples/l3fwd-power/Makefile
@@ -23,6 +23,8 @@ CFLAGS += -O3 $(shell pkg-config --cflags libdpdk)
 LDFLAGS_SHARED = $(shell pkg-config --libs libdpdk)
 LDFLAGS_STATIC = -Wl,-Bstatic $(shell pkg-config --static --libs libdpdk)
 
+CFLAGS += -DALLOW_EXPERIMENTAL_API
+
 build/$(APP)-shared: $(SRCS-y) Makefile $(PC_FILE) | build
 	$(CC) $(CFLAGS) $(SRCS-y) -o $@ $(LDFLAGS) $(LDFLAGS_SHARED)
 
@@ -54,6 +56,7 @@ please change the definition of the RTE_TARGET environment variable)
 all:
 else
 
+CFLAGS += -DALLOW_EXPERIMENTAL_API
 CFLAGS += -O3
 CFLAGS += $(WERROR_FLAGS)
 
diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
index 68527d2..c07eeff 100644
--- a/examples/l3fwd-power/main.c
+++ b/examples/l3fwd-power/main.c
@@ -43,6 +43,7 @@
 #include <rte_timer.h>
 #include <rte_power.h>
 #include <rte_spinlock.h>
+#include <rte_power_empty_poll.h>
 
 #include "perf_core.h"
 #include "main.h"
@@ -55,6 +56,8 @@
 
 /* 100 ms interval */
 #define TIMER_NUMBER_PER_SECOND           10
+/* (10ms) */
+#define INTERVALS_PER_SECOND             100
 /* 100000 us */
 #define SCALING_PERIOD                    (1000000/TIMER_NUMBER_PER_SECOND)
 #define SCALING_DOWN_TIME_RATIO_THRESHOLD 0.25
@@ -117,6 +120,17 @@
  */
 #define RTE_TEST_RX_DESC_DEFAULT 1024
 #define RTE_TEST_TX_DESC_DEFAULT 1024
+
+/*
+ * These two thresholds were decided on by running the training algorithm on
+ * a 2.5GHz Xeon. These defaults can be overridden by supplying non-zero values
+ * for the med_threshold and high_threshold parameters on the command line.
+ */
+#define EMPTY_POLL_MED_THRESHOLD 350000UL
+#define EMPTY_POLL_HGH_THRESHOLD 580000UL
+
+
+
 static uint16_t nb_rxd = RTE_TEST_RX_DESC_DEFAULT;
 static uint16_t nb_txd = RTE_TEST_TX_DESC_DEFAULT;
 
@@ -132,6 +146,14 @@ static uint32_t enabled_port_mask = 0;
 static int promiscuous_on = 0;
 /* NUMA is enabled by default. */
 static int numa_on = 1;
+/* emptypoll is disabled by default. */
+static bool empty_poll_on;
+static bool empty_poll_train;
+volatile bool empty_poll_stop;
+static struct  ep_params *ep_params;
+static struct  ep_policy policy;
+static long  ep_med_edpi, ep_hgh_edpi;
+
 static int parse_ptype; /**< Parse packet type using rx callback, and */
 			/**< disabled by default */
 
@@ -330,6 +352,19 @@ static inline uint32_t power_idle_heuristic(uint32_t zero_rx_packet_count);
 static inline enum freq_scale_hint_t power_freq_scaleup_heuristic( \
 		unsigned int lcore_id, uint16_t port_id, uint16_t queue_id);
 
+
+/*
+ * These defaults are using the max frequency index (1), a medium index (9)
+ * and a typical low frequency index (14). These can be adjusted to use
+ * different indexes using the relevant command line parameters.
+ */
+static uint8_t  freq_tlb[] = {14, 9, 1};
+
+static int is_done(void)
+{
+	return empty_poll_stop;
+}
+
 /* exit signal handler */
 static void
 signal_exit_now(int sigtype)
@@ -338,7 +373,15 @@ signal_exit_now(int sigtype)
 	unsigned int portid;
 	int ret;
 
+	RTE_SET_USED(lcore_id);
+	RTE_SET_USED(portid);
+	RTE_SET_USED(ret);
+
 	if (sigtype == SIGINT) {
+		if (empty_poll_on)
+			empty_poll_stop = true;
+
+
 		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
 			if (rte_lcore_is_enabled(lcore_id) == 0)
 				continue;
@@ -351,16 +394,19 @@ signal_exit_now(int sigtype)
 							"core%u\n", lcore_id);
 		}
 
-		RTE_ETH_FOREACH_DEV(portid) {
-			if ((enabled_port_mask & (1 << portid)) == 0)
-				continue;
+		if (!empty_poll_on) {
+			RTE_ETH_FOREACH_DEV(portid) {
+				if ((enabled_port_mask & (1 << portid)) == 0)
+					continue;
 
-			rte_eth_dev_stop(portid);
-			rte_eth_dev_close(portid);
+				rte_eth_dev_stop(portid);
+				rte_eth_dev_close(portid);
+			}
 		}
 	}
 
-	rte_exit(EXIT_SUCCESS, "User forced exit\n");
+	if (!empty_poll_on)
+		rte_exit(EXIT_SUCCESS, "User forced exit\n");
 }
 
 /*  Freqency scale down timer callback */
@@ -825,7 +871,110 @@ static int event_register(struct lcore_conf *qconf)
 
 	return 0;
 }
+/* main processing loop */
+static int
+main_empty_poll_loop(__attribute__((unused)) void *dummy)
+{
+	struct rte_mbuf *pkts_burst[MAX_PKT_BURST];
+	unsigned int lcore_id;
+	uint64_t prev_tsc, diff_tsc, cur_tsc;
+	int i, j, nb_rx;
+	uint8_t queueid;
+	uint16_t portid;
+	struct lcore_conf *qconf;
+	struct lcore_rx_queue *rx_queue;
+
+	const uint64_t drain_tsc =
+		(rte_get_tsc_hz() + US_PER_S - 1) /
+		US_PER_S * BURST_TX_DRAIN_US;
+
+	prev_tsc = 0;
+
+	lcore_id = rte_lcore_id();
+	qconf = &lcore_conf[lcore_id];
+
+	if (qconf->n_rx_queue == 0) {
+		RTE_LOG(INFO, L3FWD_POWER, "lcore %u has nothing to do\n",
+			lcore_id);
+		return 0;
+	}
+
+	for (i = 0; i < qconf->n_rx_queue; i++) {
+		portid = qconf->rx_queue_list[i].port_id;
+		queueid = qconf->rx_queue_list[i].queue_id;
+		RTE_LOG(INFO, L3FWD_POWER, " -- lcoreid=%u portid=%u "
+				"rxqueueid=%hhu\n", lcore_id, portid, queueid);
+	}
+
+	while (!is_done()) {
+		stats[lcore_id].nb_iteration_looped++;
+
+		cur_tsc = rte_rdtsc();
+		/*
+		 * TX burst queue drain
+		 */
+		diff_tsc = cur_tsc - prev_tsc;
+		if (unlikely(diff_tsc > drain_tsc)) {
+			for (i = 0; i < qconf->n_tx_port; ++i) {
+				portid = qconf->tx_port_id[i];
+				rte_eth_tx_buffer_flush(portid,
+						qconf->tx_queue_id[portid],
+						qconf->tx_buffer[portid]);
+			}
+			prev_tsc = cur_tsc;
+		}
+
+		/*
+		 * Read packet from RX queues
+		 */
+		for (i = 0; i < qconf->n_rx_queue; ++i) {
+			rx_queue = &(qconf->rx_queue_list[i]);
+			rx_queue->idle_hint = 0;
+			portid = rx_queue->port_id;
+			queueid = rx_queue->queue_id;
+
+			nb_rx = rte_eth_rx_burst(portid, queueid, pkts_burst,
+					MAX_PKT_BURST);
+
+			stats[lcore_id].nb_rx_processed += nb_rx;
+
+			if (nb_rx == 0) {
+
+				rte_power_empty_poll_stat_update(lcore_id);
+
+				continue;
+			} else {
+				rte_power_poll_stat_update(lcore_id, nb_rx);
+			}
+
+
+			/* Prefetch first packets */
+			for (j = 0; j < PREFETCH_OFFSET && j < nb_rx; j++) {
+				rte_prefetch0(rte_pktmbuf_mtod(
+							pkts_burst[j], void *));
+			}
+
+			/* Prefetch and forward already prefetched packets */
+			for (j = 0; j < (nb_rx - PREFETCH_OFFSET); j++) {
+				rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[
+							j + PREFETCH_OFFSET],
+							void *));
+				l3fwd_simple_forward(pkts_burst[j], portid,
+						qconf);
+			}
+
+			/* Forward remaining prefetched packets */
+			for (; j < nb_rx; j++) {
+				l3fwd_simple_forward(pkts_burst[j], portid,
+						qconf);
+			}
+
+		}
 
+	}
+
+	return 0;
+}
 /* main processing loop */
 static int
 main_loop(__attribute__((unused)) void *dummy)
@@ -1127,7 +1276,9 @@ print_usage(const char *prgname)
 		"  --no-numa: optional, disable numa awareness\n"
 		"  --enable-jumbo: enable jumbo frame"
 		" which max packet len is PKTLEN in decimal (64-9600)\n"
-		"  --parse-ptype: parse packet type by software\n",
+		"  --parse-ptype: parse packet type by software\n"
+		"  --empty-poll: enable empty poll detection"
+		" follow (training_flag, high_threshold, med_threshold)\n",
 		prgname);
 }
 
@@ -1220,7 +1371,55 @@ parse_config(const char *q_arg)
 
 	return 0;
 }
+static int
+parse_ep_config(const char *q_arg)
+{
+	char s[256];
+	const char *p = q_arg;
+	char *end;
+	int  num_arg;
+
+	char *str_fld[3];
+
+	int training_flag;
+	int med_edpi;
+	int hgh_edpi;
+
+	ep_med_edpi = EMPTY_POLL_MED_THRESHOLD;
+	ep_hgh_edpi = EMPTY_POLL_MED_THRESHOLD;
+
+	snprintf(s, sizeof(s), "%s", p);
+
+	num_arg = rte_strsplit(s, sizeof(s), str_fld, 3, ',');
+
+	empty_poll_train = false;
 
+	if (num_arg == 0)
+		return 0;
+
+	if (num_arg == 3) {
+
+		training_flag = strtoul(str_fld[0], &end, 0);
+		med_edpi = strtoul(str_fld[1], &end, 0);
+		hgh_edpi = strtoul(str_fld[2], &end, 0);
+
+		if (training_flag == 1)
+			empty_poll_train = true;
+
+		if (med_edpi > 0)
+			ep_med_edpi = med_edpi;
+
+		if (med_edpi > 0)
+			ep_hgh_edpi = hgh_edpi;
+
+	} else {
+
+		return -1;
+	}
+
+	return 0;
+
+}
 #define CMD_LINE_OPT_PARSE_PTYPE "parse-ptype"
 
 /* Parse the argument given in the command line of the application */
@@ -1230,6 +1429,7 @@ parse_args(int argc, char **argv)
 	int opt, ret;
 	char **argvopt;
 	int option_index;
+	uint32_t limit;
 	char *prgname = argv[0];
 	static struct option lgopts[] = {
 		{"config", 1, 0, 0},
@@ -1237,13 +1437,14 @@ parse_args(int argc, char **argv)
 		{"high-perf-cores", 1, 0, 0},
 		{"no-numa", 0, 0, 0},
 		{"enable-jumbo", 0, 0, 0},
+		{"empty-poll", 1, 0, 0},
 		{CMD_LINE_OPT_PARSE_PTYPE, 0, 0, 0},
 		{NULL, 0, 0, 0}
 	};
 
 	argvopt = argv;
 
-	while ((opt = getopt_long(argc, argvopt, "p:P",
+	while ((opt = getopt_long(argc, argvopt, "p:l:m:h:P",
 				lgopts, &option_index)) != EOF) {
 
 		switch (opt) {
@@ -1260,7 +1461,18 @@ parse_args(int argc, char **argv)
 			printf("Promiscuous mode selected\n");
 			promiscuous_on = 1;
 			break;
-
+		case 'l':
+			limit = parse_max_pkt_len(optarg);
+			freq_tlb[LOW] = limit;
+			break;
+		case 'm':
+			limit = parse_max_pkt_len(optarg);
+			freq_tlb[MED] = limit;
+			break;
+		case 'h':
+			limit = parse_max_pkt_len(optarg);
+			freq_tlb[HGH] = limit;
+			break;
 		/* long options */
 		case 0:
 			if (!strncmp(lgopts[option_index].name, "config", 6)) {
@@ -1299,6 +1511,20 @@ parse_args(int argc, char **argv)
 			}
 
 			if (!strncmp(lgopts[option_index].name,
+						"empty-poll", 10)) {
+				printf("empty-poll is enabled\n");
+				empty_poll_on = true;
+				ret = parse_ep_config(optarg);
+
+				if (ret) {
+					printf("invalid empty poll config\n");
+					print_usage(prgname);
+					return -1;
+				}
+
+			}
+
+			if (!strncmp(lgopts[option_index].name,
 					"enable-jumbo", 12)) {
 				struct option lenopts =
 					{"max-pkt-len", required_argument, \
@@ -1646,6 +1872,59 @@ init_power_library(void)
 	}
 	return ret;
 }
+static void
+empty_poll_setup_timer(void)
+{
+	int lcore_id = rte_lcore_id();
+	uint64_t hz = rte_get_timer_hz();
+
+	struct  ep_params *ep_ptr = ep_params;
+
+	ep_ptr->interval_ticks = hz / INTERVALS_PER_SECOND;
+
+	rte_timer_reset_sync(&ep_ptr->timer0,
+			ep_ptr->interval_ticks,
+			PERIODICAL,
+			lcore_id,
+			rte_empty_poll_detection,
+			(void *)ep_ptr);
+
+}
+static int
+launch_timer(unsigned int lcore_id)
+{
+	int64_t prev_tsc = 0, cur_tsc, diff_tsc, cycles_10ms;
+
+	RTE_SET_USED(lcore_id);
+
+
+	if (rte_get_master_lcore() != lcore_id) {
+		rte_panic("timer on lcore:%d which is not master core:%d\n",
+				lcore_id,
+				rte_get_master_lcore());
+	}
+
+	RTE_LOG(INFO, POWER, "Bring up the Timer\n");
+
+	empty_poll_setup_timer();
+
+	cycles_10ms = rte_get_timer_hz() / 100;
+
+	while (!is_done()) {
+		cur_tsc = rte_rdtsc();
+		diff_tsc = cur_tsc - prev_tsc;
+		if (diff_tsc > cycles_10ms) {
+			rte_timer_manage();
+			prev_tsc = cur_tsc;
+			cycles_10ms = rte_get_timer_hz() / 100;
+		}
+	}
+
+	RTE_LOG(INFO, POWER, "Timer_subsystem is done\n");
+
+	return 0;
+}
+
 
 int
 main(int argc, char **argv)
@@ -1828,13 +2107,15 @@ main(int argc, char **argv)
 		if (rte_lcore_is_enabled(lcore_id) == 0)
 			continue;
 
-		/* init timer structures for each enabled lcore */
-		rte_timer_init(&power_timers[lcore_id]);
-		hz = rte_get_timer_hz();
-		rte_timer_reset(&power_timers[lcore_id],
-			hz/TIMER_NUMBER_PER_SECOND, SINGLE, lcore_id,
-						power_timer_cb, NULL);
-
+		if (empty_poll_on == false) {
+			/* init timer structures for each enabled lcore */
+			rte_timer_init(&power_timers[lcore_id]);
+			hz = rte_get_timer_hz();
+			rte_timer_reset(&power_timers[lcore_id],
+					hz/TIMER_NUMBER_PER_SECOND,
+					SINGLE, lcore_id,
+					power_timer_cb, NULL);
+		}
 		qconf = &lcore_conf[lcore_id];
 		printf("\nInitializing rx queues on lcore %u ... ", lcore_id );
 		fflush(stdout);
@@ -1905,12 +2186,43 @@ main(int argc, char **argv)
 
 	check_all_ports_link_status(enabled_port_mask);
 
+	if (empty_poll_on == true) {
+
+		if (empty_poll_train) {
+			policy.state = TRAINING;
+		} else {
+			policy.state = MED_NORMAL;
+			policy.med_base_edpi = ep_med_edpi;
+			policy.hgh_base_edpi = ep_hgh_edpi;
+		}
+
+		ret = rte_power_empty_poll_stat_init(&ep_params,
+				freq_tlb,
+				&policy);
+		if (ret < 0)
+			rte_exit(EXIT_FAILURE, "empty poll init failed");
+	}
+
+
 	/* launch per-lcore init on every lcore */
-	rte_eal_mp_remote_launch(main_loop, NULL, CALL_MASTER);
+	if (empty_poll_on == false) {
+		rte_eal_mp_remote_launch(main_loop, NULL, CALL_MASTER);
+	} else {
+		empty_poll_stop = false;
+		rte_eal_mp_remote_launch(main_empty_poll_loop, NULL,
+				SKIP_MASTER);
+	}
+
+	if (empty_poll_on == true)
+		launch_timer(rte_lcore_id());
+
 	RTE_LCORE_FOREACH_SLAVE(lcore_id) {
 		if (rte_eal_wait_lcore(lcore_id) < 0)
 			return -1;
 	}
 
+	if (empty_poll_on)
+		rte_power_empty_poll_stat_free();
+
 	return 0;
 }
diff --git a/examples/l3fwd-power/meson.build b/examples/l3fwd-power/meson.build
index 20c8054..a3c5c2f 100644
--- a/examples/l3fwd-power/meson.build
+++ b/examples/l3fwd-power/meson.build
@@ -9,6 +9,7 @@
 if host_machine.system() != 'linux'
 	build = false
 endif
+allow_experimental_apis = true
 deps += ['power', 'timer', 'lpm', 'hash']
 sources = files(
 	'main.c', 'perf_core.c'
-- 
2.7.5

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [dpdk-dev] [PATCH v11 3/5] doc/guides/pro_guide/power-man: update the power API
  2018-10-19 10:23                       ` [dpdk-dev] [PATCH v11 1/5] " Liang Ma
  2018-10-19 10:23                         ` [dpdk-dev] [PATCH v11 2/5] examples/l3fwd-power: simple app update for new API Liang Ma
@ 2018-10-19 10:23                         ` Liang Ma
  2018-10-19 10:23                         ` [dpdk-dev] [PATCH v11 5/5] doc: update release notes for empty poll library Liang Ma
  2018-10-19 11:07                         ` [dpdk-dev] [PATCH v12 1/5] lib/librte_power: traffic pattern aware power control Liang Ma
  3 siblings, 0 replies; 79+ messages in thread
From: Liang Ma @ 2018-10-19 10:23 UTC (permalink / raw)
  To: david.hunt; +Cc: dev, lei.a.yao, ktraynor, marko.kovacevic, Liang Ma

Update the document for empty poll API.

Change Logs:
v9: minor changes for syntax. Update document.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>

Acked-by: David Hunt <david.hunt@intel.com>
---
 doc/guides/prog_guide/power_man.rst | 86 +++++++++++++++++++++++++++++++++++++
 1 file changed, 86 insertions(+)

diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
index eba1cc6..68b7e8b 100644
--- a/doc/guides/prog_guide/power_man.rst
+++ b/doc/guides/prog_guide/power_man.rst
@@ -106,6 +106,92 @@ User Cases
 
 The power management mechanism is used to save power when performing L3 forwarding.
 
+
+Empty Poll API
+--------------
+
+Abstract
+~~~~~~~~
+
+For packet processing workloads such as DPDK polling is continuous.
+This means CPU cores always show 100% busy independent of how much work
+those cores are doing. It is critical to accurately determine how busy
+a core is hugely important for the following reasons:
+
+        * No indication of overload conditions
+        * User does not know how much real load is on a system, resulting
+          in wasted energy as no power management is utilized
+
+Compared to the original l3fwd-power design, instead of going to sleep
+after detecting an empty poll, the new mechanism just lowers the core frequency.
+As a result, the application does not stop polling the device, which leads
+to improved handling of bursts of traffic.
+
+When the system become busy, the empty poll mechanism can also increase the core
+frequency (including turbo) to do best effort for intensive traffic. This gives
+us more flexible and balanced traffic awareness over the standard l3fwd-power
+application.
+
+
+Proposed Solution
+~~~~~~~~~~~~~~~~~
+The proposed solution focuses on how many times empty polls are executed.
+The less the number of empty polls, means current core is busy with processing
+workload, therefore, the higher frequency is needed. The high empty poll number
+indicates the current core not doing any real work therefore, we can lower the
+frequency to safe power.
+
+In the current implementation, each core has 1 empty-poll counter which assume
+1 core is dedicated to 1 queue. This will need to be expanded in the future to
+support multiple queues per core.
+
+Power state definition:
+^^^^^^^^^^^^^^^^^^^^^^^
+
+* LOW:  Not currently used, reserved for future use.
+
+* MED:  the frequency is used to process modest traffic workload.
+
+* HIGH: the frequency is used to process busy traffic workload.
+
+There are two phases to establish the power management system:
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+* Training phase. This phase is used to measure the optimal frequency
+  change thresholds for a given system. The thresholds will differ from
+  system to system due to differences in processor micro-architecture,
+  cache and device configurations.
+  In this phase, the user must ensure that no traffic can enter the
+  system so that counts can be measured for empty polls at low, medium
+  and high frequencies. Each frequency is measured for two seconds.
+  Once the training phase is complete, the threshold numbers are
+  displayed, and normal mode resumes, and traffic can be allowed into
+  the system. These threshold number can be used on the command line
+  when starting the application in normal mode to avoid re-training
+  every time.
+
+* Normal phase. Every 10ms the run-time counters are compared
+  to the supplied threshold values, and the decision will be made
+  whether to move to a different power state (by adjusting the
+  frequency).
+
+API Overview for Empty Poll Power Management
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+* **State Init**: initialize the power management system.
+
+* **State Free**: free the resource hold by power management system.
+
+* **Update Empty Poll Counter**: update the empty poll counter.
+
+* **Update Valid Poll Counter**: update the valid poll counter.
+
+* **Set the Fequence Index**: update the power state/frequency mapping.
+
+* **Detect empty poll state change**: empty poll state change detection algorithm then take action.
+
+User Cases
+----------
+The mechanism can applied to any device which is based on polling. e.g. NIC, FPGA.
+
 References
 ----------
 
-- 
2.7.5

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [dpdk-dev] [PATCH v11 5/5] doc: update release notes for empty poll library
  2018-10-19 10:23                       ` [dpdk-dev] [PATCH v11 1/5] " Liang Ma
  2018-10-19 10:23                         ` [dpdk-dev] [PATCH v11 2/5] examples/l3fwd-power: simple app update for new API Liang Ma
  2018-10-19 10:23                         ` [dpdk-dev] [PATCH v11 3/5] doc/guides/pro_guide/power-man: update the power API Liang Ma
@ 2018-10-19 10:23                         ` Liang Ma
  2018-10-19 11:07                         ` [dpdk-dev] [PATCH v12 1/5] lib/librte_power: traffic pattern aware power control Liang Ma
  3 siblings, 0 replies; 79+ messages in thread
From: Liang Ma @ 2018-10-19 10:23 UTC (permalink / raw)
  To: david.hunt; +Cc: dev, lei.a.yao, ktraynor, marko.kovacevic, Liang Ma

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
---
 doc/guides/rel_notes/release_18_11.rst | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/doc/guides/rel_notes/release_18_11.rst b/doc/guides/rel_notes/release_18_11.rst
index a8327ea..bbfa8d6 100644
--- a/doc/guides/rel_notes/release_18_11.rst
+++ b/doc/guides/rel_notes/release_18_11.rst
@@ -97,6 +97,16 @@ New Features
   the SW eventdev PMD, sacrifices load balancing performance to
   gain better event scheduling throughput and scalability.
 
+* **Added Traffic Pattern Aware Power Control Library**
+
+  Added an experimental library. This extend Power Library and provide
+  empty_poll APIs. This feature measure how many times empty_poll are
+  executed per core, use the number of empty polls as a hint for system
+  power management.
+
+  See the :doc:`../prog_guide/power_man` section of the DPDK Programmers
+  Guide document for more information.
+
 * **Added ability to switch queue deferred start flag on testpmd app.**
 
   Added a console command to testpmd app, giving ability to switch
@@ -104,7 +114,6 @@ New Features
   the specified port. The port must be stopped before the command call in order
   to reconfigure queues.
 
-
 API Changes
 -----------
 
@@ -118,6 +127,16 @@ API Changes
    Also, make sure to start the actual text at the margin.
    =========================================================
 
+* power: Traffic Pattern Aware Control APIs is marked as experimental:
+
+  - ``rte_power_empty_poll_stat_init``
+  - ``rte_power_empty_poll_stat_free``
+  - ``rte_power_empty_poll_stat_update``
+  - ``rte_power_empty_poll_stat_fetch``
+  - ``rte_power_poll_stat_update``
+  - ``rte_power_poll_stat_fetch``
+  - ``rte_empty_poll_detection``
+
 * mbuf: The ``__rte_mbuf_raw_free()`` and ``__rte_pktmbuf_prefree_seg()``
   functions were deprecated since 17.05 and are replaced by
   ``rte_mbuf_raw_free()`` and ``rte_pktmbuf_prefree_seg()``.
-- 
2.7.5

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [dpdk-dev] [PATCH v12 1/5] lib/librte_power: traffic pattern aware power control
  2018-10-19 10:23                       ` [dpdk-dev] [PATCH v11 1/5] " Liang Ma
                                           ` (2 preceding siblings ...)
  2018-10-19 10:23                         ` [dpdk-dev] [PATCH v11 5/5] doc: update release notes for empty poll library Liang Ma
@ 2018-10-19 11:07                         ` Liang Ma
  2018-10-19 11:07                           ` [dpdk-dev] [PATCH v12 2/5] examples/l3fwd-power: simple app update for new API Liang Ma
                                             ` (6 more replies)
  3 siblings, 7 replies; 79+ messages in thread
From: Liang Ma @ 2018-10-19 11:07 UTC (permalink / raw)
  To: david.hunt; +Cc: dev, lei.a.yao, ktraynor, marko.kovacevic, Liang Ma

1. Abstract

For packet processing workloads such as DPDK polling is continuous.
This means CPU cores always show 100% busy independent of how much work
those cores are doing. It is critical to accurately determine how busy
a core is hugely important for the following reasons:

   * No indication of overload conditions.

   * User does not know how much real load is on a system, resulting
     in wasted energy as no power management is utilized.

Compared to the original l3fwd-power design, instead of going to sleep
after detecting an empty poll, the new mechanism just lowers the core
frequency. As a result, the application does not stop polling the device,
which leads to improved handling of bursts of traffic.

When the system become busy, the empty poll mechanism can also increase the
core frequency (including turbo) to do best effort for intensive traffic.
This gives us more flexible and balanced traffic awareness over the
standard l3fwd-power application.

2. Proposed solution

The proposed solution focuses on how many times empty polls are executed.
The less the number of empty polls, means current core is busy with
processing workload, therefore, the higher frequency is needed. The high
empty poll number indicates the current core not doing any real work
therefore, we can lower the frequency to safe power.

In the current implementation, each core has 1 empty-poll counter which
assume 1 core is dedicated to 1 queue. This will need to be expanded in the
future to support multiple queues per core.

2.1 Power state definition:

	LOW:  Not currently used, reserved for future use.

	MED:  the frequency is used to process modest traffic workload.

	HIGH: the frequency is used to process busy traffic workload.

2.2 There are two phases to establish the power management system:

	a.Initialization/Training phase. The training phase is necessary
	  in order to figure out the system polling baseline numbers from
	  idle to busy. The highest poll count will be during idle, where
	  all polls are empty. These poll counts will be different between
	  systems due to the many possible processor micro-arch, cache
	  and device configurations, hence the training phase.
  	  In the training phase, traffic is blocked so the training
  	  algorithm can average the empty-poll numbers for the LOW, MED and
 	  HIGH  power states in order to create a baseline.
  	  The core's counter are collected every 10ms, and the Training
 	  phase will take 2 seconds.
 	  Training is disabled as default configuration. The default
 	  parameter is applied. Sample App still can trigger training
 	  if that's needed. Once the training phase has been executed once on
 	  a system, the application can then be started with the relevant
 	  thresholds provided on the command line, allowing the application
 	  to start passing start traffic immediately

	b.Normal phase. Traffic starts immediately based on the default
	  thresholds, or based on the user supplied thresholds via the
	  command line parameters. The run-time poll counts are compared with
	  the baseline and the decision will be taken to move to MED power
  	  state or HIGH power state. The counters are calculated every 10ms.

3. Proposed  API

1.  rte_power_empty_poll_stat_init(struct ep_params **eptr,
		uint8_t *freq_tlb, struct ep_policy *policy);
which is used to initialize the power management system.
 
2.  rte_power_empty_poll_stat_free(void);
which is used to free the resource hold by power management system.
 
3.  rte_power_empty_poll_stat_update(unsigned int lcore_id);
which is used to update specific core empty poll counter, not thread safe
 
4.  rte_power_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt);
which is used to update specific core valid poll counter, not thread safe
 
5.  rte_power_empty_poll_stat_fetch(unsigned int lcore_id);
which is used to get specific core empty poll counter.
 
6.  rte_power_poll_stat_fetch(unsigned int lcore_id);
which is used to get specific core valid poll counter.

7.  rte_empty_poll_detection(struct rte_timer *tim, void *arg);
which is used to detect empty poll state changes then take action.

ChangeLog:
v2: fix some coding style issues.
v3: rename the filename, API name.
v4: no change.
v5: no change.
v6: re-work the code layout, update API.
v7: fix minor typo and lift node num limit.
v8: disable training as default option.
v9: minor git log update.
v10: update due to the code review comments.
v12: remove rte_panic

Signed-off-by: Liang Ma <liang.j.ma@intel.com>

Reviewed-by: Lei Yao <lei.a.yao@intel.com>

Acked-by: David Hunt <david.hunt@intel.com>
---
 lib/librte_power/Makefile               |   6 +-
 lib/librte_power/meson.build            |   5 +-
 lib/librte_power/rte_power_empty_poll.c | 544 ++++++++++++++++++++++++++++++++
 lib/librte_power/rte_power_empty_poll.h | 223 +++++++++++++
 lib/librte_power/rte_power_version.map  |  13 +
 5 files changed, 787 insertions(+), 4 deletions(-)
 create mode 100644 lib/librte_power/rte_power_empty_poll.c
 create mode 100644 lib/librte_power/rte_power_empty_poll.h

diff --git a/lib/librte_power/Makefile b/lib/librte_power/Makefile
index 6f85e88..a8f1301 100644
--- a/lib/librte_power/Makefile
+++ b/lib/librte_power/Makefile
@@ -6,8 +6,9 @@ include $(RTE_SDK)/mk/rte.vars.mk
 # library name
 LIB = librte_power.a
 
+CFLAGS += -DALLOW_EXPERIMENTAL_API
 CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR) -O3 -fno-strict-aliasing
-LDLIBS += -lrte_eal
+LDLIBS += -lrte_eal -lrte_timer
 
 EXPORT_MAP := rte_power_version.map
 
@@ -16,8 +17,9 @@ LIBABIVER := 1
 # all source are stored in SRCS-y
 SRCS-$(CONFIG_RTE_LIBRTE_POWER) := rte_power.c power_acpi_cpufreq.c
 SRCS-$(CONFIG_RTE_LIBRTE_POWER) += power_kvm_vm.c guest_channel.c
+SRCS-$(CONFIG_RTE_LIBRTE_POWER) += rte_power_empty_poll.c
 
 # install this header file
-SYMLINK-$(CONFIG_RTE_LIBRTE_POWER)-include := rte_power.h
+SYMLINK-$(CONFIG_RTE_LIBRTE_POWER)-include := rte_power.h  rte_power_empty_poll.h
 
 include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/lib/librte_power/meson.build b/lib/librte_power/meson.build
index 253173f..63957eb 100644
--- a/lib/librte_power/meson.build
+++ b/lib/librte_power/meson.build
@@ -5,5 +5,6 @@ if host_machine.system() != 'linux'
 	build = false
 endif
 sources = files('rte_power.c', 'power_acpi_cpufreq.c',
-		'power_kvm_vm.c', 'guest_channel.c')
-headers = files('rte_power.h')
+		'power_kvm_vm.c', 'guest_channel.c',
+		'rte_power_empty_poll.c')
+headers = files('rte_power.h','rte_power_empty_poll.h')
diff --git a/lib/librte_power/rte_power_empty_poll.c b/lib/librte_power/rte_power_empty_poll.c
new file mode 100644
index 0000000..c1e10e0
--- /dev/null
+++ b/lib/librte_power/rte_power_empty_poll.c
@@ -0,0 +1,544 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2018 Intel Corporation
+ */
+
+#include <string.h>
+
+#include <rte_lcore.h>
+#include <rte_cycles.h>
+#include <rte_atomic.h>
+#include <rte_malloc.h>
+
+#include "rte_power.h"
+#include "rte_power_empty_poll.h"
+
+#define INTERVALS_PER_SECOND 100     /* (10ms) */
+#define SECONDS_TO_TRAIN_FOR 2
+#define DEFAULT_MED_TO_HIGH_PERCENT_THRESHOLD 70
+#define DEFAULT_HIGH_TO_MED_PERCENT_THRESHOLD 30
+#define DEFAULT_CYCLES_PER_PACKET 800
+
+static struct ep_params *ep_params;
+static uint32_t med_to_high_threshold = DEFAULT_MED_TO_HIGH_PERCENT_THRESHOLD;
+static uint32_t high_to_med_threshold = DEFAULT_HIGH_TO_MED_PERCENT_THRESHOLD;
+
+static uint32_t avail_freqs[RTE_MAX_LCORE][NUM_FREQS];
+
+static uint32_t total_avail_freqs[RTE_MAX_LCORE];
+
+static uint32_t freq_index[NUM_FREQ];
+
+static uint32_t
+get_freq_index(enum freq_val index)
+{
+	return freq_index[index];
+}
+
+
+static int
+set_power_freq(int lcore_id, enum freq_val freq, bool specific_freq)
+{
+	int err = 0;
+	uint32_t power_freq_index;
+	if (!specific_freq)
+		power_freq_index = get_freq_index(freq);
+	else
+		power_freq_index = freq;
+
+	err = rte_power_set_freq(lcore_id, power_freq_index);
+
+	return err;
+}
+
+
+static inline void __attribute__((always_inline))
+exit_training_state(struct priority_worker *poll_stats)
+{
+	RTE_SET_USED(poll_stats);
+}
+
+static inline void __attribute__((always_inline))
+enter_training_state(struct priority_worker *poll_stats)
+{
+	poll_stats->iter_counter = 0;
+	poll_stats->cur_freq = LOW;
+	poll_stats->queue_state = TRAINING;
+}
+
+static inline void __attribute__((always_inline))
+enter_normal_state(struct priority_worker *poll_stats)
+{
+	/* Clear the averages arrays and strs */
+	memset(poll_stats->edpi_av, 0, sizeof(poll_stats->edpi_av));
+	poll_stats->ec = 0;
+	memset(poll_stats->ppi_av, 0, sizeof(poll_stats->ppi_av));
+	poll_stats->pc = 0;
+
+	poll_stats->cur_freq = MED;
+	poll_stats->iter_counter = 0;
+	poll_stats->threshold_ctr = 0;
+	poll_stats->queue_state = MED_NORMAL;
+	RTE_LOG(INFO, POWER, "Set the power freq to MED\n");
+	set_power_freq(poll_stats->lcore_id, MED, false);
+
+	poll_stats->thresh[MED].threshold_percent = med_to_high_threshold;
+	poll_stats->thresh[HGH].threshold_percent = high_to_med_threshold;
+}
+
+static inline void __attribute__((always_inline))
+enter_busy_state(struct priority_worker *poll_stats)
+{
+	memset(poll_stats->edpi_av, 0, sizeof(poll_stats->edpi_av));
+	poll_stats->ec = 0;
+	memset(poll_stats->ppi_av, 0, sizeof(poll_stats->ppi_av));
+	poll_stats->pc = 0;
+
+	poll_stats->cur_freq = HGH;
+	poll_stats->iter_counter = 0;
+	poll_stats->threshold_ctr = 0;
+	poll_stats->queue_state = HGH_BUSY;
+	set_power_freq(poll_stats->lcore_id, HGH, false);
+}
+
+static inline void __attribute__((always_inline))
+enter_purge_state(struct priority_worker *poll_stats)
+{
+	poll_stats->iter_counter = 0;
+	poll_stats->queue_state = LOW_PURGE;
+}
+
+static inline void __attribute__((always_inline))
+set_state(struct priority_worker *poll_stats,
+		enum queue_state new_state)
+{
+	enum queue_state old_state = poll_stats->queue_state;
+	if (old_state != new_state) {
+
+		/* Call any old state exit functions */
+		if (old_state == TRAINING)
+			exit_training_state(poll_stats);
+
+		/* Call any new state entry functions */
+		if (new_state == TRAINING)
+			enter_training_state(poll_stats);
+		if (new_state == MED_NORMAL)
+			enter_normal_state(poll_stats);
+		if (new_state == HGH_BUSY)
+			enter_busy_state(poll_stats);
+		if (new_state == LOW_PURGE)
+			enter_purge_state(poll_stats);
+	}
+}
+
+static inline void __attribute__((always_inline))
+set_policy(struct priority_worker *poll_stats,
+		struct ep_policy *policy)
+{
+	set_state(poll_stats, policy->state);
+
+	if (policy->state == TRAINING)
+		return;
+
+	poll_stats->thresh[MED_NORMAL].base_edpi = policy->med_base_edpi;
+	poll_stats->thresh[HGH_BUSY].base_edpi = policy->hgh_base_edpi;
+
+	poll_stats->thresh[MED_NORMAL].trained = true;
+	poll_stats->thresh[HGH_BUSY].trained = true;
+
+}
+
+static void
+update_training_stats(struct priority_worker *poll_stats,
+		uint32_t freq,
+		bool specific_freq,
+		uint32_t max_train_iter)
+{
+	RTE_SET_USED(specific_freq);
+
+	char pfi_str[32];
+	uint64_t p0_empty_deq;
+
+	sprintf(pfi_str, "%02d", freq);
+
+	if (poll_stats->cur_freq == freq &&
+			poll_stats->thresh[freq].trained == false) {
+		if (poll_stats->thresh[freq].cur_train_iter == 0) {
+
+			set_power_freq(poll_stats->lcore_id,
+					freq, specific_freq);
+
+			poll_stats->empty_dequeues_prev =
+				poll_stats->empty_dequeues;
+
+			poll_stats->thresh[freq].cur_train_iter++;
+
+			return;
+		} else if (poll_stats->thresh[freq].cur_train_iter
+				<= max_train_iter) {
+
+			p0_empty_deq = poll_stats->empty_dequeues -
+				poll_stats->empty_dequeues_prev;
+
+			poll_stats->empty_dequeues_prev =
+				poll_stats->empty_dequeues;
+
+			poll_stats->thresh[freq].base_edpi += p0_empty_deq;
+			poll_stats->thresh[freq].cur_train_iter++;
+
+		} else {
+			if (poll_stats->thresh[freq].trained == false) {
+				poll_stats->thresh[freq].base_edpi =
+					poll_stats->thresh[freq].base_edpi /
+					max_train_iter;
+
+				/* Add on a factor of 0.05%
+				 * this should remove any
+				 * false negatives when the system is 0% busy
+				 */
+				poll_stats->thresh[freq].base_edpi +=
+				poll_stats->thresh[freq].base_edpi / 2000;
+
+				poll_stats->thresh[freq].trained = true;
+				poll_stats->cur_freq++;
+
+			}
+		}
+	}
+}
+
+static inline uint32_t __attribute__((always_inline))
+update_stats(struct priority_worker *poll_stats)
+{
+	uint64_t tot_edpi = 0, tot_ppi = 0;
+	uint32_t j, percent;
+
+	struct priority_worker *s = poll_stats;
+
+	uint64_t cur_edpi = s->empty_dequeues - s->empty_dequeues_prev;
+
+	s->empty_dequeues_prev = s->empty_dequeues;
+
+	uint64_t ppi = s->num_dequeue_pkts - s->num_dequeue_pkts_prev;
+
+	s->num_dequeue_pkts_prev = s->num_dequeue_pkts;
+
+	if (s->thresh[s->cur_freq].base_edpi < cur_edpi) {
+
+		/* edpi mean empty poll counter difference per interval */
+		RTE_LOG(DEBUG, POWER, "cur_edpi is too large "
+				"cur edpi %ld "
+				"base edpi %ld\n",
+				cur_edpi,
+				s->thresh[s->cur_freq].base_edpi);
+		/* Value to make us fail need debug log*/
+		return 1000UL;
+	}
+
+	s->edpi_av[s->ec++ % BINS_AV] = cur_edpi;
+	s->ppi_av[s->pc++ % BINS_AV] = ppi;
+
+	for (j = 0; j < BINS_AV; j++) {
+		tot_edpi += s->edpi_av[j];
+		tot_ppi += s->ppi_av[j];
+	}
+
+	tot_edpi = tot_edpi / BINS_AV;
+
+	percent = 100 - (uint32_t)(((float)tot_edpi /
+			(float)s->thresh[s->cur_freq].base_edpi) * 100);
+
+	return (uint32_t)percent;
+}
+
+
+static inline void  __attribute__((always_inline))
+update_stats_normal(struct priority_worker *poll_stats)
+{
+	uint32_t percent;
+
+	if (poll_stats->thresh[poll_stats->cur_freq].base_edpi == 0) {
+
+		enum freq_val cur_freq = poll_stats->cur_freq;
+
+		/* edpi mean empty poll counter difference per interval */
+		RTE_LOG(DEBUG, POWER, "cure freq is %d, edpi is %lu\n",
+				cur_freq,
+				poll_stats->thresh[cur_freq].base_edpi);
+		return;
+	}
+
+	percent = update_stats(poll_stats);
+
+	if (percent > 100) {
+		/* edpi mean empty poll counter difference per interval */
+		RTE_LOG(DEBUG, POWER, "Edpi is bigger than threshold\n");
+		return;
+	}
+
+	if (poll_stats->cur_freq == LOW)
+		RTE_LOG(INFO, POWER, "Purge Mode is not currently supported\n");
+	else if (poll_stats->cur_freq == MED) {
+
+		if (percent >
+			poll_stats->thresh[MED].threshold_percent) {
+
+			if (poll_stats->threshold_ctr < INTERVALS_PER_SECOND)
+				poll_stats->threshold_ctr++;
+			else {
+				set_state(poll_stats, HGH_BUSY);
+				RTE_LOG(INFO, POWER, "MOVE to HGH\n");
+			}
+
+		} else {
+			/* reset */
+			poll_stats->threshold_ctr = 0;
+		}
+
+	} else if (poll_stats->cur_freq == HGH) {
+
+		if (percent <
+				poll_stats->thresh[HGH].threshold_percent) {
+
+			if (poll_stats->threshold_ctr < INTERVALS_PER_SECOND)
+				poll_stats->threshold_ctr++;
+			else {
+				set_state(poll_stats, MED_NORMAL);
+				RTE_LOG(INFO, POWER, "MOVE to MED\n");
+			}
+		} else {
+			/* reset */
+			poll_stats->threshold_ctr = 0;
+		}
+
+	}
+}
+
+static int
+empty_poll_training(struct priority_worker *poll_stats,
+		uint32_t max_train_iter)
+{
+
+	if (poll_stats->iter_counter < INTERVALS_PER_SECOND) {
+		poll_stats->iter_counter++;
+		return 0;
+	}
+
+
+	update_training_stats(poll_stats,
+			LOW,
+			false,
+			max_train_iter);
+
+	update_training_stats(poll_stats,
+			MED,
+			false,
+			max_train_iter);
+
+	update_training_stats(poll_stats,
+			HGH,
+			false,
+			max_train_iter);
+
+
+	if (poll_stats->thresh[LOW].trained == true
+			&& poll_stats->thresh[MED].trained == true
+			&& poll_stats->thresh[HGH].trained == true) {
+
+		set_state(poll_stats, MED_NORMAL);
+
+		RTE_LOG(INFO, POWER, "LOW threshold is %lu\n",
+				poll_stats->thresh[LOW].base_edpi);
+
+		RTE_LOG(INFO, POWER, "MED threshold is %lu\n",
+				poll_stats->thresh[MED].base_edpi);
+
+
+		RTE_LOG(INFO, POWER, "HIGH threshold is %lu\n",
+				poll_stats->thresh[HGH].base_edpi);
+
+		RTE_LOG(INFO, POWER, "Training is Complete for %d\n",
+				poll_stats->lcore_id);
+	}
+
+	return 0;
+}
+
+void __rte_experimental
+rte_empty_poll_detection(struct rte_timer *tim, void *arg)
+{
+
+	uint32_t i;
+
+	struct priority_worker *poll_stats;
+
+	RTE_SET_USED(tim);
+
+	RTE_SET_USED(arg);
+
+	for (i = 0; i < NUM_NODES; i++) {
+
+		poll_stats = &(ep_params->wrk_data.wrk_stats[i]);
+
+		if (rte_lcore_is_enabled(poll_stats->lcore_id) == 0)
+			continue;
+
+		switch (poll_stats->queue_state) {
+		case(TRAINING):
+			empty_poll_training(poll_stats,
+					ep_params->max_train_iter);
+			break;
+
+		case(HGH_BUSY):
+		case(MED_NORMAL):
+			update_stats_normal(poll_stats);
+			break;
+
+		case(LOW_PURGE):
+			break;
+		default:
+			break;
+
+		}
+
+	}
+
+}
+
+int __rte_experimental
+rte_power_empty_poll_stat_init(struct ep_params **eptr, uint8_t *freq_tlb,
+		struct ep_policy *policy)
+{
+	uint32_t i;
+	/* Allocate the ep_params structure */
+	ep_params = rte_zmalloc_socket(NULL,
+			sizeof(struct ep_params),
+			0,
+			rte_socket_id());
+
+	if (!ep_params)
+		return -1;
+
+	if (freq_tlb == NULL) {
+		freq_index[LOW] = 14;
+		freq_index[MED] = 9;
+		freq_index[HGH] = 1;
+	} else {
+		freq_index[LOW] = freq_tlb[LOW];
+		freq_index[MED] = freq_tlb[MED];
+		freq_index[HGH] = freq_tlb[HGH];
+	}
+
+	RTE_LOG(INFO, POWER, "Initialize the Empty Poll\n");
+
+	/* Train for pre-defined period */
+	ep_params->max_train_iter = INTERVALS_PER_SECOND * SECONDS_TO_TRAIN_FOR;
+
+	struct stats_data *w = &ep_params->wrk_data;
+
+	*eptr = ep_params;
+
+	/* initialize all wrk_stats state */
+	for (i = 0; i < NUM_NODES; i++) {
+
+		if (rte_lcore_is_enabled(i) == 0)
+			continue;
+		/*init the freqs table */
+		total_avail_freqs[i] = rte_power_freqs(i,
+				avail_freqs[i],
+				NUM_FREQS);
+
+		RTE_LOG(INFO, POWER, "total avail freq is %d , lcoreid %d\n",
+				total_avail_freqs[i],
+				i);
+
+		if (get_freq_index(LOW) > total_avail_freqs[i])
+			return -1;
+
+		if (rte_get_master_lcore() != i) {
+			w->wrk_stats[i].lcore_id = i;
+			set_policy(&w->wrk_stats[i], policy);
+		}
+	}
+
+	return 0;
+}
+
+void __rte_experimental
+rte_power_empty_poll_stat_free(void)
+{
+
+	RTE_LOG(INFO, POWER, "Close the Empty Poll\n");
+
+	if (ep_params != NULL)
+		rte_free(ep_params);
+}
+
+int __rte_experimental
+rte_power_empty_poll_stat_update(unsigned int lcore_id)
+{
+	struct priority_worker *poll_stats;
+
+	if (lcore_id >= NUM_NODES)
+		return -1;
+
+	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
+
+	if (poll_stats->lcore_id == 0)
+		poll_stats->lcore_id = lcore_id;
+
+	poll_stats->empty_dequeues++;
+
+	return 0;
+}
+
+int __rte_experimental
+rte_power_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt)
+{
+
+	struct priority_worker *poll_stats;
+
+	if (lcore_id >= NUM_NODES)
+		return -1;
+
+	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
+
+	if (poll_stats->lcore_id == 0)
+		poll_stats->lcore_id = lcore_id;
+
+	poll_stats->num_dequeue_pkts += nb_pkt;
+
+	return 0;
+}
+
+
+uint64_t __rte_experimental
+rte_power_empty_poll_stat_fetch(unsigned int lcore_id)
+{
+	struct priority_worker *poll_stats;
+
+	if (lcore_id >= NUM_NODES)
+		return -1;
+
+	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
+
+	if (poll_stats->lcore_id == 0)
+		poll_stats->lcore_id = lcore_id;
+
+	return poll_stats->empty_dequeues;
+}
+
+uint64_t __rte_experimental
+rte_power_poll_stat_fetch(unsigned int lcore_id)
+{
+	struct priority_worker *poll_stats;
+
+	if (lcore_id >= NUM_NODES)
+		return -1;
+
+	poll_stats = &(ep_params->wrk_data.wrk_stats[lcore_id]);
+
+	if (poll_stats->lcore_id == 0)
+		poll_stats->lcore_id = lcore_id;
+
+	return poll_stats->num_dequeue_pkts;
+}
diff --git a/lib/librte_power/rte_power_empty_poll.h b/lib/librte_power/rte_power_empty_poll.h
new file mode 100644
index 0000000..d8cbb17
--- /dev/null
+++ b/lib/librte_power/rte_power_empty_poll.h
@@ -0,0 +1,223 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2018 Intel Corporation
+ */
+
+#ifndef _RTE_EMPTY_POLL_H
+#define _RTE_EMPTY_POLL_H
+
+/**
+ * @file
+ * RTE Power Management
+ */
+#include <stdint.h>
+#include <stdbool.h>
+
+#include <rte_common.h>
+#include <rte_byteorder.h>
+#include <rte_log.h>
+#include <rte_string_fns.h>
+#include <rte_power.h>
+#include <rte_timer.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#define NUM_FREQS  RTE_MAX_LCORE_FREQS
+
+#define BINS_AV 4 /* Has to be ^2 */
+
+#define DROP (NUM_DIRECTIONS * NUM_DEVICES)
+
+#define NUM_PRIORITIES          2
+
+#define NUM_NODES         256  /* Max core number*/
+
+/* Processor Power State */
+enum freq_val {
+	LOW,
+	MED,
+	HGH,
+	NUM_FREQ = NUM_FREQS
+};
+
+
+/* Queue Polling State */
+enum queue_state {
+	TRAINING, /* NO TRAFFIC */
+	MED_NORMAL,   /* MED */
+	HGH_BUSY,     /* HIGH */
+	LOW_PURGE,    /* LOW */
+};
+
+/* Queue Stats */
+struct freq_threshold {
+
+	uint64_t base_edpi;
+	bool trained;
+	uint32_t threshold_percent;
+	uint32_t cur_train_iter;
+};
+
+/* Each Worder Thread Empty Poll Stats */
+struct priority_worker {
+
+	/* Current dequeue and throughput counts */
+	/* These 2 are written to by the worker threads */
+	/* So keep them on their own cache line */
+	uint64_t empty_dequeues;
+	uint64_t num_dequeue_pkts;
+
+	enum queue_state queue_state;
+
+	uint64_t empty_dequeues_prev;
+	uint64_t num_dequeue_pkts_prev;
+
+	/* Used for training only */
+	struct freq_threshold thresh[NUM_FREQ];
+	enum freq_val cur_freq;
+
+	/* bucket arrays to calculate the averages */
+	/* edpi mean empty poll counter difference per interval */
+	uint64_t edpi_av[BINS_AV];
+	/* empty poll counter */
+	uint32_t ec;
+	/* ppi mean valid poll counter per interval */
+	uint64_t ppi_av[BINS_AV];
+	/* valid poll counter */
+	uint32_t pc;
+
+	uint32_t lcore_id;
+	uint32_t iter_counter;
+	uint32_t threshold_ctr;
+	uint32_t display_ctr;
+	uint8_t  dev_id;
+
+} __rte_cache_aligned;
+
+
+struct stats_data {
+
+	struct priority_worker wrk_stats[NUM_NODES];
+
+	/* flag to stop rx threads processing packets until training over */
+	bool start_rx;
+
+};
+
+/* Empty Poll Parameters */
+struct ep_params {
+
+	/* Timer related stuff */
+	uint64_t interval_ticks;
+	uint32_t max_train_iter;
+
+	struct rte_timer timer0;
+	struct stats_data wrk_data;
+};
+
+
+/* Sample App Init information */
+struct ep_policy {
+
+	uint64_t med_base_edpi;
+	uint64_t hgh_base_edpi;
+
+	enum queue_state state;
+};
+
+
+
+/**
+ * Initialize the power management system.
+ *
+ * @param eptr
+ *   the structure of empty poll configuration
+ * @freq_tlb
+ *   the power state/frequency  mapping table
+ * @policy
+ *   the initialization policy from sample app
+ *
+ * @return
+ *  - 0 on success.
+ *  - Negative on error.
+ */
+int __rte_experimental
+rte_power_empty_poll_stat_init(struct ep_params **eptr, uint8_t *freq_tlb,
+		struct ep_policy *policy);
+
+/**
+ * Free the resource hold by power management system.
+ */
+void __rte_experimental
+rte_power_empty_poll_stat_free(void);
+
+/**
+ * Update specific core empty poll counter
+ * It's not thread safe.
+ *
+ * @param lcore_id
+ *  lcore id
+ *
+ * @return
+ *  - 0 on success.
+ *  - Negative on error.
+ */
+int __rte_experimental
+rte_power_empty_poll_stat_update(unsigned int lcore_id);
+
+/**
+ * Update specific core valid poll counter, not thread safe.
+ *
+ * @param lcore_id
+ *  lcore id.
+ * @param nb_pkt
+ *  The packet number of one valid poll.
+ *
+ * @return
+ *  - 0 on success.
+ *  - Negative on error.
+ */
+int __rte_experimental
+rte_power_poll_stat_update(unsigned int lcore_id, uint8_t nb_pkt);
+
+/**
+ * Fetch specific core empty poll counter.
+ *
+ * @param lcore_id
+ *  lcore id
+ *
+ * @return
+ *  Current lcore empty poll counter value.
+ */
+uint64_t __rte_experimental
+rte_power_empty_poll_stat_fetch(unsigned int lcore_id);
+
+/**
+ * Fetch specific core valid poll counter.
+ *
+ * @param lcore_id
+ *  lcore id
+ *
+ * @return
+ *  Current lcore valid poll counter value.
+ */
+uint64_t __rte_experimental
+rte_power_poll_stat_fetch(unsigned int lcore_id);
+
+/**
+ * Empty poll  state change detection function
+ *
+ * @param  tim
+ *  The timer structure
+ * @param  arg
+ *  The customized parameter
+ */
+void  __rte_experimental
+rte_empty_poll_detection(struct rte_timer *tim, void *arg);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif
diff --git a/lib/librte_power/rte_power_version.map b/lib/librte_power/rte_power_version.map
index dd587df..17a083b 100644
--- a/lib/librte_power/rte_power_version.map
+++ b/lib/librte_power/rte_power_version.map
@@ -33,3 +33,16 @@ DPDK_18.08 {
 	rte_power_get_capabilities;
 
 } DPDK_17.11;
+
+EXPERIMENTAL {
+        global:
+
+        rte_empty_poll_detection;
+        rte_power_empty_poll_stat_fetch;
+        rte_power_empty_poll_stat_free;
+        rte_power_empty_poll_stat_init;
+        rte_power_empty_poll_stat_update;
+        rte_power_poll_stat_fetch;
+        rte_power_poll_stat_update;
+
+};
-- 
2.7.5

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [dpdk-dev] [PATCH v12 2/5] examples/l3fwd-power: simple app update for new API
  2018-10-19 11:07                         ` [dpdk-dev] [PATCH v12 1/5] lib/librte_power: traffic pattern aware power control Liang Ma
@ 2018-10-19 11:07                           ` Liang Ma
  2018-10-19 11:07                           ` [dpdk-dev] [PATCH v12 3/5] doc/guides/pro_guide/power-man: update the power API Liang Ma
                                             ` (5 subsequent siblings)
  6 siblings, 0 replies; 79+ messages in thread
From: Liang Ma @ 2018-10-19 11:07 UTC (permalink / raw)
  To: david.hunt; +Cc: dev, lei.a.yao, ktraynor, marko.kovacevic, Liang Ma

Add the support for new traffic pattern aware power control
power management API.

Example:
./l3fwd-power -l xxx   -n 4   -w 0000:xx:00.0 -w 0000:xx:00.1 -- -p 0x3
-P --config="(0,0,xx),(1,0,xx)" --empty-poll="0,0,0" -l 14 -m 9 -h 1

Please Reference l3fwd-power document for full parameter usage

The option "l", "m", "h" are used to set the power index for
LOW, MED, HIGH power state. Only is useful after enable empty-poll

--empty-poll="training_flag, med_threshold, high_threshold"

The option training_flag is used to enable/disable training mode.

The option med_threshold is used to indicate the empty poll threshold
of modest state which is customized by user.

The option high_threshold is used to indicate the empty poll threshold
of busy state which is customized by user.

Above three option default value is all 0.

Once enable empty-poll. System will apply the default parameter if no
other command line options are provided.

If training mode is enabled, the user should ensure that no traffic
is allowed to pass through the system. When training phase complete,
the application transfer to normal operation

System will start running with the modest power mode.
If the traffic goes above 70%, then system will move to High power state.
If the traffic drops below 30%, the system will fallback to the modest
power state.

Example code use master thread to monitoring worker thread busyness.
The default timer resolution is 10ms.

ChangeLog:
v2 fix some coding style issues
v3 rename the API.
v6 re-work the API.
v7 no change.
v8 disable training as default option.
v10 update due to review comments.
v11 add checking for empty poll init function return value.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>

Reviewed-by: Lei Yao <lei.a.yao@intel.com>

Acked-by: David Hunt <david.hunt@intel.com>
---
 examples/l3fwd-power/Makefile    |   3 +
 examples/l3fwd-power/main.c      | 346 +++++++++++++++++++++++++++++++++++++--
 examples/l3fwd-power/meson.build |   1 +
 3 files changed, 333 insertions(+), 17 deletions(-)

diff --git a/examples/l3fwd-power/Makefile b/examples/l3fwd-power/Makefile
index d7e39a3..772ec7b 100644
--- a/examples/l3fwd-power/Makefile
+++ b/examples/l3fwd-power/Makefile
@@ -23,6 +23,8 @@ CFLAGS += -O3 $(shell pkg-config --cflags libdpdk)
 LDFLAGS_SHARED = $(shell pkg-config --libs libdpdk)
 LDFLAGS_STATIC = -Wl,-Bstatic $(shell pkg-config --static --libs libdpdk)
 
+CFLAGS += -DALLOW_EXPERIMENTAL_API
+
 build/$(APP)-shared: $(SRCS-y) Makefile $(PC_FILE) | build
 	$(CC) $(CFLAGS) $(SRCS-y) -o $@ $(LDFLAGS) $(LDFLAGS_SHARED)
 
@@ -54,6 +56,7 @@ please change the definition of the RTE_TARGET environment variable)
 all:
 else
 
+CFLAGS += -DALLOW_EXPERIMENTAL_API
 CFLAGS += -O3
 CFLAGS += $(WERROR_FLAGS)
 
diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
index 68527d2..c07eeff 100644
--- a/examples/l3fwd-power/main.c
+++ b/examples/l3fwd-power/main.c
@@ -43,6 +43,7 @@
 #include <rte_timer.h>
 #include <rte_power.h>
 #include <rte_spinlock.h>
+#include <rte_power_empty_poll.h>
 
 #include "perf_core.h"
 #include "main.h"
@@ -55,6 +56,8 @@
 
 /* 100 ms interval */
 #define TIMER_NUMBER_PER_SECOND           10
+/* (10ms) */
+#define INTERVALS_PER_SECOND             100
 /* 100000 us */
 #define SCALING_PERIOD                    (1000000/TIMER_NUMBER_PER_SECOND)
 #define SCALING_DOWN_TIME_RATIO_THRESHOLD 0.25
@@ -117,6 +120,17 @@
  */
 #define RTE_TEST_RX_DESC_DEFAULT 1024
 #define RTE_TEST_TX_DESC_DEFAULT 1024
+
+/*
+ * These two thresholds were decided on by running the training algorithm on
+ * a 2.5GHz Xeon. These defaults can be overridden by supplying non-zero values
+ * for the med_threshold and high_threshold parameters on the command line.
+ */
+#define EMPTY_POLL_MED_THRESHOLD 350000UL
+#define EMPTY_POLL_HGH_THRESHOLD 580000UL
+
+
+
 static uint16_t nb_rxd = RTE_TEST_RX_DESC_DEFAULT;
 static uint16_t nb_txd = RTE_TEST_TX_DESC_DEFAULT;
 
@@ -132,6 +146,14 @@ static uint32_t enabled_port_mask = 0;
 static int promiscuous_on = 0;
 /* NUMA is enabled by default. */
 static int numa_on = 1;
+/* emptypoll is disabled by default. */
+static bool empty_poll_on;
+static bool empty_poll_train;
+volatile bool empty_poll_stop;
+static struct  ep_params *ep_params;
+static struct  ep_policy policy;
+static long  ep_med_edpi, ep_hgh_edpi;
+
 static int parse_ptype; /**< Parse packet type using rx callback, and */
 			/**< disabled by default */
 
@@ -330,6 +352,19 @@ static inline uint32_t power_idle_heuristic(uint32_t zero_rx_packet_count);
 static inline enum freq_scale_hint_t power_freq_scaleup_heuristic( \
 		unsigned int lcore_id, uint16_t port_id, uint16_t queue_id);
 
+
+/*
+ * These defaults are using the max frequency index (1), a medium index (9)
+ * and a typical low frequency index (14). These can be adjusted to use
+ * different indexes using the relevant command line parameters.
+ */
+static uint8_t  freq_tlb[] = {14, 9, 1};
+
+static int is_done(void)
+{
+	return empty_poll_stop;
+}
+
 /* exit signal handler */
 static void
 signal_exit_now(int sigtype)
@@ -338,7 +373,15 @@ signal_exit_now(int sigtype)
 	unsigned int portid;
 	int ret;
 
+	RTE_SET_USED(lcore_id);
+	RTE_SET_USED(portid);
+	RTE_SET_USED(ret);
+
 	if (sigtype == SIGINT) {
+		if (empty_poll_on)
+			empty_poll_stop = true;
+
+
 		for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
 			if (rte_lcore_is_enabled(lcore_id) == 0)
 				continue;
@@ -351,16 +394,19 @@ signal_exit_now(int sigtype)
 							"core%u\n", lcore_id);
 		}
 
-		RTE_ETH_FOREACH_DEV(portid) {
-			if ((enabled_port_mask & (1 << portid)) == 0)
-				continue;
+		if (!empty_poll_on) {
+			RTE_ETH_FOREACH_DEV(portid) {
+				if ((enabled_port_mask & (1 << portid)) == 0)
+					continue;
 
-			rte_eth_dev_stop(portid);
-			rte_eth_dev_close(portid);
+				rte_eth_dev_stop(portid);
+				rte_eth_dev_close(portid);
+			}
 		}
 	}
 
-	rte_exit(EXIT_SUCCESS, "User forced exit\n");
+	if (!empty_poll_on)
+		rte_exit(EXIT_SUCCESS, "User forced exit\n");
 }
 
 /*  Freqency scale down timer callback */
@@ -825,7 +871,110 @@ static int event_register(struct lcore_conf *qconf)
 
 	return 0;
 }
+/* main processing loop */
+static int
+main_empty_poll_loop(__attribute__((unused)) void *dummy)
+{
+	struct rte_mbuf *pkts_burst[MAX_PKT_BURST];
+	unsigned int lcore_id;
+	uint64_t prev_tsc, diff_tsc, cur_tsc;
+	int i, j, nb_rx;
+	uint8_t queueid;
+	uint16_t portid;
+	struct lcore_conf *qconf;
+	struct lcore_rx_queue *rx_queue;
+
+	const uint64_t drain_tsc =
+		(rte_get_tsc_hz() + US_PER_S - 1) /
+		US_PER_S * BURST_TX_DRAIN_US;
+
+	prev_tsc = 0;
+
+	lcore_id = rte_lcore_id();
+	qconf = &lcore_conf[lcore_id];
+
+	if (qconf->n_rx_queue == 0) {
+		RTE_LOG(INFO, L3FWD_POWER, "lcore %u has nothing to do\n",
+			lcore_id);
+		return 0;
+	}
+
+	for (i = 0; i < qconf->n_rx_queue; i++) {
+		portid = qconf->rx_queue_list[i].port_id;
+		queueid = qconf->rx_queue_list[i].queue_id;
+		RTE_LOG(INFO, L3FWD_POWER, " -- lcoreid=%u portid=%u "
+				"rxqueueid=%hhu\n", lcore_id, portid, queueid);
+	}
+
+	while (!is_done()) {
+		stats[lcore_id].nb_iteration_looped++;
+
+		cur_tsc = rte_rdtsc();
+		/*
+		 * TX burst queue drain
+		 */
+		diff_tsc = cur_tsc - prev_tsc;
+		if (unlikely(diff_tsc > drain_tsc)) {
+			for (i = 0; i < qconf->n_tx_port; ++i) {
+				portid = qconf->tx_port_id[i];
+				rte_eth_tx_buffer_flush(portid,
+						qconf->tx_queue_id[portid],
+						qconf->tx_buffer[portid]);
+			}
+			prev_tsc = cur_tsc;
+		}
+
+		/*
+		 * Read packet from RX queues
+		 */
+		for (i = 0; i < qconf->n_rx_queue; ++i) {
+			rx_queue = &(qconf->rx_queue_list[i]);
+			rx_queue->idle_hint = 0;
+			portid = rx_queue->port_id;
+			queueid = rx_queue->queue_id;
+
+			nb_rx = rte_eth_rx_burst(portid, queueid, pkts_burst,
+					MAX_PKT_BURST);
+
+			stats[lcore_id].nb_rx_processed += nb_rx;
+
+			if (nb_rx == 0) {
+
+				rte_power_empty_poll_stat_update(lcore_id);
+
+				continue;
+			} else {
+				rte_power_poll_stat_update(lcore_id, nb_rx);
+			}
+
+
+			/* Prefetch first packets */
+			for (j = 0; j < PREFETCH_OFFSET && j < nb_rx; j++) {
+				rte_prefetch0(rte_pktmbuf_mtod(
+							pkts_burst[j], void *));
+			}
+
+			/* Prefetch and forward already prefetched packets */
+			for (j = 0; j < (nb_rx - PREFETCH_OFFSET); j++) {
+				rte_prefetch0(rte_pktmbuf_mtod(pkts_burst[
+							j + PREFETCH_OFFSET],
+							void *));
+				l3fwd_simple_forward(pkts_burst[j], portid,
+						qconf);
+			}
+
+			/* Forward remaining prefetched packets */
+			for (; j < nb_rx; j++) {
+				l3fwd_simple_forward(pkts_burst[j], portid,
+						qconf);
+			}
+
+		}
 
+	}
+
+	return 0;
+}
 /* main processing loop */
 static int
 main_loop(__attribute__((unused)) void *dummy)
@@ -1127,7 +1276,9 @@ print_usage(const char *prgname)
 		"  --no-numa: optional, disable numa awareness\n"
 		"  --enable-jumbo: enable jumbo frame"
 		" which max packet len is PKTLEN in decimal (64-9600)\n"
-		"  --parse-ptype: parse packet type by software\n",
+		"  --parse-ptype: parse packet type by software\n"
+		"  --empty-poll: enable empty poll detection"
+		" follow (training_flag, high_threshold, med_threshold)\n",
 		prgname);
 }
 
@@ -1220,7 +1371,55 @@ parse_config(const char *q_arg)
 
 	return 0;
 }
+static int
+parse_ep_config(const char *q_arg)
+{
+	char s[256];
+	const char *p = q_arg;
+	char *end;
+	int  num_arg;
+
+	char *str_fld[3];
+
+	int training_flag;
+	int med_edpi;
+	int hgh_edpi;
+
+	ep_med_edpi = EMPTY_POLL_MED_THRESHOLD;
+	ep_hgh_edpi = EMPTY_POLL_MED_THRESHOLD;
+
+	snprintf(s, sizeof(s), "%s", p);
+
+	num_arg = rte_strsplit(s, sizeof(s), str_fld, 3, ',');
+
+	empty_poll_train = false;
 
+	if (num_arg == 0)
+		return 0;
+
+	if (num_arg == 3) {
+
+		training_flag = strtoul(str_fld[0], &end, 0);
+		med_edpi = strtoul(str_fld[1], &end, 0);
+		hgh_edpi = strtoul(str_fld[2], &end, 0);
+
+		if (training_flag == 1)
+			empty_poll_train = true;
+
+		if (med_edpi > 0)
+			ep_med_edpi = med_edpi;
+
+		if (med_edpi > 0)
+			ep_hgh_edpi = hgh_edpi;
+
+	} else {
+
+		return -1;
+	}
+
+	return 0;
+
+}
 #define CMD_LINE_OPT_PARSE_PTYPE "parse-ptype"
 
 /* Parse the argument given in the command line of the application */
@@ -1230,6 +1429,7 @@ parse_args(int argc, char **argv)
 	int opt, ret;
 	char **argvopt;
 	int option_index;
+	uint32_t limit;
 	char *prgname = argv[0];
 	static struct option lgopts[] = {
 		{"config", 1, 0, 0},
@@ -1237,13 +1437,14 @@ parse_args(int argc, char **argv)
 		{"high-perf-cores", 1, 0, 0},
 		{"no-numa", 0, 0, 0},
 		{"enable-jumbo", 0, 0, 0},
+		{"empty-poll", 1, 0, 0},
 		{CMD_LINE_OPT_PARSE_PTYPE, 0, 0, 0},
 		{NULL, 0, 0, 0}
 	};
 
 	argvopt = argv;
 
-	while ((opt = getopt_long(argc, argvopt, "p:P",
+	while ((opt = getopt_long(argc, argvopt, "p:l:m:h:P",
 				lgopts, &option_index)) != EOF) {
 
 		switch (opt) {
@@ -1260,7 +1461,18 @@ parse_args(int argc, char **argv)
 			printf("Promiscuous mode selected\n");
 			promiscuous_on = 1;
 			break;
-
+		case 'l':
+			limit = parse_max_pkt_len(optarg);
+			freq_tlb[LOW] = limit;
+			break;
+		case 'm':
+			limit = parse_max_pkt_len(optarg);
+			freq_tlb[MED] = limit;
+			break;
+		case 'h':
+			limit = parse_max_pkt_len(optarg);
+			freq_tlb[HGH] = limit;
+			break;
 		/* long options */
 		case 0:
 			if (!strncmp(lgopts[option_index].name, "config", 6)) {
@@ -1299,6 +1511,20 @@ parse_args(int argc, char **argv)
 			}
 
 			if (!strncmp(lgopts[option_index].name,
+						"empty-poll", 10)) {
+				printf("empty-poll is enabled\n");
+				empty_poll_on = true;
+				ret = parse_ep_config(optarg);
+
+				if (ret) {
+					printf("invalid empty poll config\n");
+					print_usage(prgname);
+					return -1;
+				}
+
+			}
+
+			if (!strncmp(lgopts[option_index].name,
 					"enable-jumbo", 12)) {
 				struct option lenopts =
 					{"max-pkt-len", required_argument, \
@@ -1646,6 +1872,59 @@ init_power_library(void)
 	}
 	return ret;
 }
+static void
+empty_poll_setup_timer(void)
+{
+	int lcore_id = rte_lcore_id();
+	uint64_t hz = rte_get_timer_hz();
+
+	struct  ep_params *ep_ptr = ep_params;
+
+	ep_ptr->interval_ticks = hz / INTERVALS_PER_SECOND;
+
+	rte_timer_reset_sync(&ep_ptr->timer0,
+			ep_ptr->interval_ticks,
+			PERIODICAL,
+			lcore_id,
+			rte_empty_poll_detection,
+			(void *)ep_ptr);
+
+}
+static int
+launch_timer(unsigned int lcore_id)
+{
+	int64_t prev_tsc = 0, cur_tsc, diff_tsc, cycles_10ms;
+
+	RTE_SET_USED(lcore_id);
+
+
+	if (rte_get_master_lcore() != lcore_id) {
+		rte_panic("timer on lcore:%d which is not master core:%d\n",
+				lcore_id,
+				rte_get_master_lcore());
+	}
+
+	RTE_LOG(INFO, POWER, "Bring up the Timer\n");
+
+	empty_poll_setup_timer();
+
+	cycles_10ms = rte_get_timer_hz() / 100;
+
+	while (!is_done()) {
+		cur_tsc = rte_rdtsc();
+		diff_tsc = cur_tsc - prev_tsc;
+		if (diff_tsc > cycles_10ms) {
+			rte_timer_manage();
+			prev_tsc = cur_tsc;
+			cycles_10ms = rte_get_timer_hz() / 100;
+		}
+	}
+
+	RTE_LOG(INFO, POWER, "Timer_subsystem is done\n");
+
+	return 0;
+}
+
 
 int
 main(int argc, char **argv)
@@ -1828,13 +2107,15 @@ main(int argc, char **argv)
 		if (rte_lcore_is_enabled(lcore_id) == 0)
 			continue;
 
-		/* init timer structures for each enabled lcore */
-		rte_timer_init(&power_timers[lcore_id]);
-		hz = rte_get_timer_hz();
-		rte_timer_reset(&power_timers[lcore_id],
-			hz/TIMER_NUMBER_PER_SECOND, SINGLE, lcore_id,
-						power_timer_cb, NULL);
-
+		if (empty_poll_on == false) {
+			/* init timer structures for each enabled lcore */
+			rte_timer_init(&power_timers[lcore_id]);
+			hz = rte_get_timer_hz();
+			rte_timer_reset(&power_timers[lcore_id],
+					hz/TIMER_NUMBER_PER_SECOND,
+					SINGLE, lcore_id,
+					power_timer_cb, NULL);
+		}
 		qconf = &lcore_conf[lcore_id];
 		printf("\nInitializing rx queues on lcore %u ... ", lcore_id );
 		fflush(stdout);
@@ -1905,12 +2186,43 @@ main(int argc, char **argv)
 
 	check_all_ports_link_status(enabled_port_mask);
 
+	if (empty_poll_on == true) {
+
+		if (empty_poll_train) {
+			policy.state = TRAINING;
+		} else {
+			policy.state = MED_NORMAL;
+			policy.med_base_edpi = ep_med_edpi;
+			policy.hgh_base_edpi = ep_hgh_edpi;
+		}
+
+		ret = rte_power_empty_poll_stat_init(&ep_params,
+				freq_tlb,
+				&policy);
+		if (ret < 0)
+			rte_exit(EXIT_FAILURE, "empty poll init failed");
+	}
+
+
 	/* launch per-lcore init on every lcore */
-	rte_eal_mp_remote_launch(main_loop, NULL, CALL_MASTER);
+	if (empty_poll_on == false) {
+		rte_eal_mp_remote_launch(main_loop, NULL, CALL_MASTER);
+	} else {
+		empty_poll_stop = false;
+		rte_eal_mp_remote_launch(main_empty_poll_loop, NULL,
+				SKIP_MASTER);
+	}
+
+	if (empty_poll_on == true)
+		launch_timer(rte_lcore_id());
+
 	RTE_LCORE_FOREACH_SLAVE(lcore_id) {
 		if (rte_eal_wait_lcore(lcore_id) < 0)
 			return -1;
 	}
 
+	if (empty_poll_on)
+		rte_power_empty_poll_stat_free();
+
 	return 0;
 }
diff --git a/examples/l3fwd-power/meson.build b/examples/l3fwd-power/meson.build
index 20c8054..a3c5c2f 100644
--- a/examples/l3fwd-power/meson.build
+++ b/examples/l3fwd-power/meson.build
@@ -9,6 +9,7 @@
 if host_machine.system() != 'linux'
 	build = false
 endif
+allow_experimental_apis = true
 deps += ['power', 'timer', 'lpm', 'hash']
 sources = files(
 	'main.c', 'perf_core.c'
-- 
2.7.5

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [dpdk-dev] [PATCH v12 3/5] doc/guides/pro_guide/power-man: update the power API
  2018-10-19 11:07                         ` [dpdk-dev] [PATCH v12 1/5] lib/librte_power: traffic pattern aware power control Liang Ma
  2018-10-19 11:07                           ` [dpdk-dev] [PATCH v12 2/5] examples/l3fwd-power: simple app update for new API Liang Ma
@ 2018-10-19 11:07                           ` Liang Ma
  2018-10-19 11:07                           ` [dpdk-dev] [PATCH v12 4/5] doc/guides/sample_app_ug/l3_forward_power_man.rst: update Liang Ma
                                             ` (4 subsequent siblings)
  6 siblings, 0 replies; 79+ messages in thread
From: Liang Ma @ 2018-10-19 11:07 UTC (permalink / raw)
  To: david.hunt; +Cc: dev, lei.a.yao, ktraynor, marko.kovacevic, Liang Ma

Update the document for empty poll API.

Change Logs:
v9: minor changes for syntax. Update document.

Signed-off-by: Liang Ma <liang.j.ma@intel.com>

Acked-by: David Hunt <david.hunt@intel.com>
---
 doc/guides/prog_guide/power_man.rst | 86 +++++++++++++++++++++++++++++++++++++
 1 file changed, 86 insertions(+)

diff --git a/doc/guides/prog_guide/power_man.rst b/doc/guides/prog_guide/power_man.rst
index eba1cc6..68b7e8b 100644
--- a/doc/guides/prog_guide/power_man.rst
+++ b/doc/guides/prog_guide/power_man.rst
@@ -106,6 +106,92 @@ User Cases
 
 The power management mechanism is used to save power when performing L3 forwarding.
 
+
+Empty Poll API
+--------------
+
+Abstract
+~~~~~~~~
+
+For packet processing workloads such as DPDK polling is continuous.
+This means CPU cores always show 100% busy independent of how much work
+those cores are doing. It is critical to accurately determine how busy
+a core is hugely important for the following reasons:
+
+        * No indication of overload conditions
+        * User does not know how much real load is on a system, resulting
+          in wasted energy as no power management is utilized
+
+Compared to the original l3fwd-power design, instead of going to sleep
+after detecting an empty poll, the new mechanism just lowers the core frequency.
+As a result, the application does not stop polling the device, which leads
+to improved handling of bursts of traffic.
+
+When the system become busy, the empty poll mechanism can also increase the core
+frequency (including turbo) to do best effort for intensive traffic. This gives
+us more flexible and balanced traffic awareness over the standard l3fwd-power
+application.
+
+
+Proposed Solution
+~~~~~~~~~~~~~~~~~
+The proposed solution focuses on how many times empty polls are executed.
+The less the number of empty polls, means current core is busy with processing
+workload, therefore, the higher frequency is needed. The high empty poll number
+indicates the current core not doing any real work therefore, we can lower the
+frequency to safe power.
+
+In the current implementation, each core has 1 empty-poll counter which assume
+1 core is dedicated to 1 queue. This will need to be expanded in the future to
+support multiple queues per core.
+
+Power state definition:
+^^^^^^^^^^^^^^^^^^^^^^^
+
+* LOW:  Not currently used, reserved for future use.
+
+* MED:  the frequency is used to process modest traffic workload.
+
+* HIGH: the frequency is used to process busy traffic workload.
+
+There are two phases to establish the power management system:
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+* Training phase. This phase is used to measure the optimal frequency
+  change thresholds for a given system. The thresholds will differ from
+  system to system due to differences in processor micro-architecture,
+  cache and device configurations.
+  In this phase, the user must ensure that no traffic can enter the
+  system so that counts can be measured for empty polls at low, medium
+  and high frequencies. Each frequency is measured for two seconds.
+  Once the training phase is complete, the threshold numbers are
+  displayed, and normal mode resumes, and traffic can be allowed into
+  the system. These threshold number can be used on the command line
+  when starting the application in normal mode to avoid re-training
+  every time.
+
+* Normal phase. Every 10ms the run-time counters are compared
+  to the supplied threshold values, and the decision will be made
+  whether to move to a different power state (by adjusting the
+  frequency).
+
+API Overview for Empty Poll Power Management
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+* **State Init**: initialize the power management system.
+
+* **State Free**: free the resource hold by power management system.
+
+* **Update Empty Poll Counter**: update the empty poll counter.
+
+* **Update Valid Poll Counter**: update the valid poll counter.
+
+* **Set the Fequence Index**: update the power state/frequency mapping.
+
+* **Detect empty poll state change**: empty poll state change detection algorithm then take action.
+
+User Cases
+----------
+The mechanism can applied to any device which is based on polling. e.g. NIC, FPGA.
+
 References
 ----------
 
-- 
2.7.5

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [dpdk-dev] [PATCH v12 4/5] doc/guides/sample_app_ug/l3_forward_power_man.rst: update
  2018-10-19 11:07                         ` [dpdk-dev] [PATCH v12 1/5] lib/librte_power: traffic pattern aware power control Liang Ma
  2018-10-19 11:07                           ` [dpdk-dev] [PATCH v12 2/5] examples/l3fwd-power: simple app update for new API Liang Ma
  2018-10-19 11:07                           ` [dpdk-dev] [PATCH v12 3/5] doc/guides/pro_guide/power-man: update the power API Liang Ma
@ 2018-10-19 11:07                           ` Liang Ma
  2018-10-19 11:07                           ` [dpdk-dev] [PATCH v12 5/5] doc: update release notes for empty poll library Liang Ma
                                             ` (3 subsequent siblings)
  6 siblings, 0 replies; 79+ messages in thread
From: Liang Ma @ 2018-10-19 11:07 UTC (permalink / raw)
  To: david.hunt; +Cc: dev, lei.a.yao, ktraynor, marko.kovacevic, Liang Ma

Add empty poll mode command line example

ChangeLogs:
v9: update the document

Signed-off-by: Liang Ma <liang.j.ma@intel.com>

Acked-by: David Hunt <david.hunt@intel.com>
---
 doc/guides/sample_app_ug/l3_forward_power_man.rst | 69 +++++++++++++++++++++++
 1 file changed, 69 insertions(+)

diff --git a/doc/guides/sample_app_ug/l3_forward_power_man.rst b/doc/guides/sample_app_ug/l3_forward_power_man.rst
index 795a570..e44a11b 100644
--- a/doc/guides/sample_app_ug/l3_forward_power_man.rst
+++ b/doc/guides/sample_app_ug/l3_forward_power_man.rst
@@ -105,6 +105,8 @@ where,
 
 *   --no-numa: optional, disables numa awareness
 
+*   --empty-poll: Traffic Aware power management. See below for details
+
 See :doc:`l3_forward` for details.
 The L3fwd-power example reuses the L3fwd command line options.
 
@@ -362,3 +364,70 @@ The algorithm has the following sleeping behavior depending on the idle counter:
 If a thread polls multiple Rx queues and different queue returns different sleep duration values,
 the algorithm controls the sleep time in a conservative manner by sleeping for the least possible time
 in order to avoid a potential performance impact.
+
+Empty Poll Mode
+-------------------------
+Additionally, there is a traffic aware mode of operation called "Empty
+Poll" where the number of empty polls can be monitored to keep track
+of how busy the application is. Empty poll mode can be enabled by the
+command line option --empty-poll.
+
+See :doc:`Power Management<../prog_guide/power_man>` chapter in the DPDK Programmer's Guide for empty poll mode details.
+
+.. code-block:: console
+
+    ./l3fwd-power -l xxx   -n 4   -w 0000:xx:00.0 -w 0000:xx:00.1 -- -p 0x3 -P --config="(0,0,xx),(1,0,xx)" --empty-poll="0,0,0" -l 14 -m 9 -h 1
+
+Where,
+
+--empty-poll: Enable the empty poll mode instead of original algorithm
+
+--empty-poll="training_flag, med_threshold, high_threshold"
+
+* ``training_flag`` : optional, enable/disable training mode. Default value is 0. If the training_flag is set as 1(true), then the application will start in training mode and print out the trained threshold values. If the training_flag is set as 0(false), the application will start in normal mode, and will use either the default thresholds or those supplied on the command line. The trained threshold values are specific to the user’s system, may give a better power profile when compared to the default threshold values.
+
+* ``med_threshold`` : optional, sets the empty poll threshold of a modestly busy system state. If this is not supplied, the application will apply the default value of 350000.
+
+* ``high_threshold`` : optional, sets the empty poll threshold of a busy system state. If this is not supplied, the application will apply the default value of 580000.
+
+* -l : optional, set up the LOW power state frequency index
+
+* -m : optional, set up the MED power state frequency index
+
+* -h : optional, set up the HIGH power state frequency index
+
+Empty Poll Mode Example Usage
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+To initially obtain the ideal thresholds for the system, the training
+mode should be run first. This is achieved by running the l3fwd-power
+app with the training flag set to “1”, and the other parameters set to
+0.
+
+.. code-block:: console
+
+        ./examples/l3fwd-power/build/l3fwd-power -l 1-3 -- -p 0x0f --config="(0,0,2),(0,1,3)" --empty-poll "1,0,0" –P
+
+This will run the training algorithm for x seconds on each core (cores 2
+and 3), and then print out the recommended threshold values for those
+cores. The thresholds should be very similar for each core.
+
+.. code-block:: console
+
+        POWER: Bring up the Timer
+        POWER: set the power freq to MED
+        POWER: Low threshold is 230277
+        POWER: MED threshold is 335071
+        POWER: HIGH threshold is 523769
+        POWER: Training is Complete for 2
+        POWER: set the power freq to MED
+        POWER: Low threshold is 236814
+        POWER: MED threshold is 344567
+        POWER: HIGH threshold is 538580
+        POWER: Training is Complete for 3
+
+Once the values have been measured for a particular system, the app can
+then be started without the training mode so traffic can start immediately.
+
+.. code-block:: console
+
+        ./examples/l3fwd-power/build/l3fwd-power -l 1-3 -- -p 0x0f --config="(0,0,2),(0,1,3)" --empty-poll "0,340000,540000" –P
-- 
2.7.5

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [dpdk-dev] [PATCH v12 5/5] doc: update release notes for empty poll library
  2018-10-19 11:07                         ` [dpdk-dev] [PATCH v12 1/5] lib/librte_power: traffic pattern aware power control Liang Ma
                                             ` (2 preceding siblings ...)
  2018-10-19 11:07                           ` [dpdk-dev] [PATCH v12 4/5] doc/guides/sample_app_ug/l3_forward_power_man.rst: update Liang Ma
@ 2018-10-19 11:07                           ` Liang Ma
  2018-10-22 12:41                             ` Kovacevic, Marko
  2018-10-25 23:39                             ` Thomas Monjalon
  2018-10-25 23:22                           ` [dpdk-dev] [PATCH v12 1/5] lib/librte_power: traffic pattern aware power control Thomas Monjalon
                                             ` (2 subsequent siblings)
  6 siblings, 2 replies; 79+ messages in thread
From: Liang Ma @ 2018-10-19 11:07 UTC (permalink / raw)
  To: david.hunt; +Cc: dev, lei.a.yao, ktraynor, marko.kovacevic, Liang Ma

Update the release nots for Traffic Pattern Aware Control
Library(empty poll).

Signed-off-by: Liang Ma <liang.j.ma@intel.com>
---
 doc/guides/rel_notes/release_18_11.rst | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/doc/guides/rel_notes/release_18_11.rst b/doc/guides/rel_notes/release_18_11.rst
index a8327ea..5efc74e 100644
--- a/doc/guides/rel_notes/release_18_11.rst
+++ b/doc/guides/rel_notes/release_18_11.rst
@@ -97,6 +97,16 @@ New Features
   the SW eventdev PMD, sacrifices load balancing performance to
   gain better event scheduling throughput and scalability.
 
+* **Added Traffic Pattern Aware Power Control Library**
+
+  Added an experimental library. This extend Power Library and provide
+  empty_poll APIs. This feature measure how many times empty_poll are
+  executed per core, use the number of empty polls as a hint for system
+  power management.
+
+  See the :doc:`../prog_guide/power_man` section of the DPDK Programmers
+  Guide document for more information.
+
 * **Added ability to switch queue deferred start flag on testpmd app.**
 
   Added a console command to testpmd app, giving ability to switch
@@ -104,7 +114,6 @@ New Features
   the specified port. The port must be stopped before the command call in order
   to reconfigure queues.
 
-
 API Changes
 -----------
 
@@ -118,6 +127,16 @@ API Changes
    Also, make sure to start the actual text at the margin.
    =========================================================
 
+* power: Traffic Pattern Aware Control APIs (marked as experimental):
+
+  - ``rte_power_empty_poll_stat_init``
+  - ``rte_power_empty_poll_stat_free``
+  - ``rte_power_empty_poll_stat_update``
+  - ``rte_power_empty_poll_stat_fetch``
+  - ``rte_power_poll_stat_update``
+  - ``rte_power_poll_stat_fetch``
+  - ``rte_empty_poll_detection``
+
 * mbuf: The ``__rte_mbuf_raw_free()`` and ``__rte_pktmbuf_prefree_seg()``
   functions were deprecated since 17.05 and are replaced by
   ``rte_mbuf_raw_free()`` and ``rte_pktmbuf_prefree_seg()``.
-- 
2.7.5

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [dpdk-dev] [PATCH v12 5/5] doc: update release notes for empty poll library
  2018-10-19 11:07                           ` [dpdk-dev] [PATCH v12 5/5] doc: update release notes for empty poll library Liang Ma
@ 2018-10-22 12:41                             ` Kovacevic, Marko
  2018-10-25 23:39                             ` Thomas Monjalon
  1 sibling, 0 replies; 79+ messages in thread
From: Kovacevic, Marko @ 2018-10-22 12:41 UTC (permalink / raw)
  To: Ma, Liang J, Hunt, David; +Cc: dev, Yao, Lei A, ktraynor

Acked-by: Marko Kovacevic <marko.kovacevic@intel.com>

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [dpdk-dev] [PATCH v12 1/5] lib/librte_power: traffic pattern aware power control
  2018-10-19 11:07                         ` [dpdk-dev] [PATCH v12 1/5] lib/librte_power: traffic pattern aware power control Liang Ma
                                             ` (3 preceding siblings ...)
  2018-10-19 11:07                           ` [dpdk-dev] [PATCH v12 5/5] doc: update release notes for empty poll library Liang Ma
@ 2018-10-25 23:22                           ` Thomas Monjalon
  2018-10-25 23:32                             ` Thomas Monjalon
  2018-10-25 23:54                           ` Thomas Monjalon
  2018-10-25 23:55                           ` Thomas Monjalon
  6 siblings, 1 reply; 79+ messages in thread
From: Thomas Monjalon @ 2018-10-25 23:22 UTC (permalink / raw)
  To: Liang Ma; +Cc: dev, david.hunt, lei.a.yao, ktraynor, marko.kovacevic

Hi,

It fails to compile (tried with meson build-gcc-static):

lib/librte_power/rte_power_empty_poll.h:20:10: fatal error:
	rte_timer.h: No such file or directory

It looks to be fixed with this one-line change:

--- a/lib/librte_power/meson.build
+++ b/lib/librte_power/meson.build
@@ -8,3 +8,4 @@ sources = files('rte_power.c', 'power_acpi_cpufreq.c',
        'power_kvm_vm.c', 'guest_channel.c',
        'rte_power_empty_poll.c')
 headers = files('rte_power.h','rte_power_empty_poll.h')
+deps += ['timer']

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [dpdk-dev] [PATCH v12 1/5] lib/librte_power: traffic pattern aware power control
  2018-10-25 23:22                           ` [dpdk-dev] [PATCH v12 1/5] lib/librte_power: traffic pattern aware power control Thomas Monjalon
@ 2018-10-25 23:32                             ` Thomas Monjalon
  0 siblings, 0 replies; 79+ messages in thread
From: Thomas Monjalon @ 2018-10-25 23:32 UTC (permalink / raw)
  To: Liang Ma; +Cc: dev, david.hunt, lei.a.yao, ktraynor, marko.kovacevic

26/10/2018 01:22, Thomas Monjalon:
> Hi,
> 
> It fails to compile (tried with meson build-gcc-static):
> 
> lib/librte_power/rte_power_empty_poll.h:20:10: fatal error:
> 	rte_timer.h: No such file or directory
> 
> It looks to be fixed with this one-line change:
> 
> --- a/lib/librte_power/meson.build
> +++ b/lib/librte_power/meson.build
> @@ -8,3 +8,4 @@ sources = files('rte_power.c', 'power_acpi_cpufreq.c',
>         'power_kvm_vm.c', 'guest_channel.c',
>         'rte_power_empty_poll.c')
>  headers = files('rte_power.h','rte_power_empty_poll.h')
> +deps += ['timer']


There are also some doxygen issues:

lib/librte_power/rte_power_empty_poll.h:136: warning: Found unknown command `\freq_tlb'
lib/librte_power/rte_power_empty_poll.h:138: warning: Found unknown command `\policy'
lib/librte_power/rte_power_empty_poll.h:146: warning: The following parameters of rte_power_empty_poll_stat_init(struct ep_params **eptr, uint8_t *freq_tlb, struct ep_policy *policy) are not documented:
  parameter 'freq_tlb'
  parameter 'policy'

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [dpdk-dev] [PATCH v12 5/5] doc: update release notes for empty poll library
  2018-10-19 11:07                           ` [dpdk-dev] [PATCH v12 5/5] doc: update release notes for empty poll library Liang Ma
  2018-10-22 12:41                             ` Kovacevic, Marko
@ 2018-10-25 23:39                             ` Thomas Monjalon
  1 sibling, 0 replies; 79+ messages in thread
From: Thomas Monjalon @ 2018-10-25 23:39 UTC (permalink / raw)
  To: Liang Ma; +Cc: dev, david.hunt, lei.a.yao, ktraynor, marko.kovacevic

Hi,

This patch, like other doc patches should be merged with related code patches.


19/10/2018 13:07, Liang Ma:
> Update the release nots for Traffic Pattern Aware Control
> Library(empty poll).
> 
> Signed-off-by: Liang Ma <liang.j.ma@intel.com>
> ---
>  doc/guides/rel_notes/release_18_11.rst | 21 ++++++++++++++++++++-
>  1 file changed, 20 insertions(+), 1 deletion(-)
> 
> diff --git a/doc/guides/rel_notes/release_18_11.rst b/doc/guides/rel_notes/release_18_11.rst
> index a8327ea..5efc74e 100644
> --- a/doc/guides/rel_notes/release_18_11.rst
> +++ b/doc/guides/rel_notes/release_18_11.rst
> @@ -97,6 +97,16 @@ New Features
>    the SW eventdev PMD, sacrifices load balancing performance to
>    gain better event scheduling throughput and scalability.
>  
> +* **Added Traffic Pattern Aware Power Control Library**
> +
> +  Added an experimental library. This extend Power Library and provide
> +  empty_poll APIs. This feature measure how many times empty_poll are
> +  executed per core, use the number of empty polls as a hint for system
> +  power management.
> +
> +  See the :doc:`../prog_guide/power_man` section of the DPDK Programmers
> +  Guide document for more information.
> +
>  * **Added ability to switch queue deferred start flag on testpmd app.**
>  
>    Added a console command to testpmd app, giving ability to switch
> @@ -104,7 +114,6 @@ New Features
>    the specified port. The port must be stopped before the command call in order
>    to reconfigure queues.
>  
> -
>  API Changes
>  -----------
>  
> @@ -118,6 +127,16 @@ API Changes
>     Also, make sure to start the actual text at the margin.
>     =========================================================
>  
> +* power: Traffic Pattern Aware Control APIs (marked as experimental):
> +
> +  - ``rte_power_empty_poll_stat_init``
> +  - ``rte_power_empty_poll_stat_free``
> +  - ``rte_power_empty_poll_stat_update``
> +  - ``rte_power_empty_poll_stat_fetch``
> +  - ``rte_power_poll_stat_update``
> +  - ``rte_power_poll_stat_fetch``
> +  - ``rte_empty_poll_detection``

New API does not need to be listed here.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [dpdk-dev] [PATCH v12 1/5] lib/librte_power: traffic pattern aware power control
  2018-10-19 11:07                         ` [dpdk-dev] [PATCH v12 1/5] lib/librte_power: traffic pattern aware power control Liang Ma
                                             ` (4 preceding siblings ...)
  2018-10-25 23:22                           ` [dpdk-dev] [PATCH v12 1/5] lib/librte_power: traffic pattern aware power control Thomas Monjalon
@ 2018-10-25 23:54                           ` Thomas Monjalon
  2018-10-25 23:55                           ` Thomas Monjalon
  6 siblings, 0 replies; 79+ messages in thread
From: Thomas Monjalon @ 2018-10-25 23:54 UTC (permalink / raw)
  To: Liang Ma; +Cc: dev, david.hunt, lei.a.yao, ktraynor, marko.kovacevic

19/10/2018 13:07, Liang Ma:
> --- a/lib/librte_power/Makefile
> +++ b/lib/librte_power/Makefile
> @@ -6,8 +6,9 @@ include $(RTE_SDK)/mk/rte.vars.mk
>  # library name
>  LIB = librte_power.a
>  
> +CFLAGS += -DALLOW_EXPERIMENTAL_API

We don't need this flag if we don't use experimental API from other libs.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [dpdk-dev] [PATCH v12 1/5] lib/librte_power: traffic pattern aware power control
  2018-10-19 11:07                         ` [dpdk-dev] [PATCH v12 1/5] lib/librte_power: traffic pattern aware power control Liang Ma
                                             ` (5 preceding siblings ...)
  2018-10-25 23:54                           ` Thomas Monjalon
@ 2018-10-25 23:55                           ` Thomas Monjalon
  2018-10-26 13:34                             ` Liang, Ma
  6 siblings, 1 reply; 79+ messages in thread
From: Thomas Monjalon @ 2018-10-25 23:55 UTC (permalink / raw)
  To: Liang Ma; +Cc: dev, david.hunt, lei.a.yao, ktraynor, marko.kovacevic

19/10/2018 13:07, Liang Ma:
> The proposed solution focuses on how many times empty polls are executed.
> The less the number of empty polls, means current core is busy with
> processing workload, therefore, the higher frequency is needed. The high
> empty poll number indicates the current core not doing any real work
> therefore, we can lower the frequency to safe power.
> 
> In the current implementation, each core has 1 empty-poll counter which
> assume 1 core is dedicated to 1 queue. This will need to be expanded in the
> future to support multiple queues per core.

Applied with fixes for meson compilation, doxygen and doc.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [dpdk-dev] [PATCH v12 1/5] lib/librte_power: traffic pattern aware power control
  2018-10-25 23:55                           ` Thomas Monjalon
@ 2018-10-26 13:34                             ` Liang, Ma
  0 siblings, 0 replies; 79+ messages in thread
From: Liang, Ma @ 2018-10-26 13:34 UTC (permalink / raw)
  To: Thomas Monjalon; +Cc: dev, david.hunt, lei.a.yao, ktraynor, marko.kovacevic


Hi Thomas,
   Many thanks, I will check carefully  for meson build and doc  next time.

Regards
Liang

On 26 Oct 01:55, Thomas Monjalon wrote:
> 19/10/2018 13:07, Liang Ma:
> > The proposed solution focuses on how many times empty polls are executed.
> > The less the number of empty polls, means current core is busy with
> > processing workload, therefore, the higher frequency is needed. The high
> > empty poll number indicates the current core not doing any real work
> > therefore, we can lower the frequency to safe power.
> > 
> > In the current implementation, each core has 1 empty-poll counter which
> > assume 1 core is dedicated to 1 queue. This will need to be expanded in the
> > future to support multiple queues per core.
> 
> Applied with fixes for meson compilation, doxygen and doc.
> 
> 

^ permalink raw reply	[flat|nested] 79+ messages in thread

end of thread, other threads:[~2018-10-26 13:34 UTC | newest]

Thread overview: 79+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-08  9:57 [dpdk-dev] [PATCH v1 1/2] lib/librte_power: traffic pattern aware power control Liang Ma
2018-06-08  9:57 ` [dpdk-dev] [PATCH v1 2/2] examples/l3fwd-power: simple app update to support new API Liang Ma
2018-06-19 10:31   ` Hunt, David
2018-06-08 15:26 ` [dpdk-dev] [PATCH v2 1/2] lib/librte_power: traffic pattern aware power control Liang Ma
2018-06-08 15:26   ` [dpdk-dev] [PATCH v2 2/2] examples/l3fwd-power: simple app update to support new API Liang Ma
2018-06-20 14:44     ` [dpdk-dev] [PATCH v3 1/2] lib/librte_power: traffic pattern aware power control Liang Ma
2018-06-20 14:44       ` [dpdk-dev] [PATCH v3 2/2] examples/l3fwd-power: simple app update to support new API Liang Ma
2018-06-26 11:40         ` [dpdk-dev] [PATCH v4 1/2] lib/librte_power: traffic pattern aware power control Radu Nicolau
2018-06-26 11:40           ` [dpdk-dev] [PATCH v4 2/2] examples/l3fwd-power: simple app update to support new API Radu Nicolau
2018-06-26 13:03             ` Hunt, David
2018-06-26 13:03           ` [dpdk-dev] [PATCH v4 1/2] lib/librte_power: traffic pattern aware power control Hunt, David
2018-06-27 17:33           ` Kevin Traynor
2018-07-05 14:45             ` Liang, Ma
2018-07-12 17:30               ` Thomas Monjalon
2018-09-11  9:19             ` Hunt, David
2018-09-13  9:46               ` Kevin Traynor
2018-09-13 13:30                 ` Liang, Ma
2018-07-10 16:04           ` [dpdk-dev] [PATCH v5 " Radu Nicolau
2018-07-10 16:04             ` [dpdk-dev] [PATCH v5 2/2] examples/l3fwd-power: simple app update to support new API Radu Nicolau
2018-08-31 15:04             ` [dpdk-dev] [PATCH v6 1/4] lib/librte_power: traffic pattern aware power control Liang Ma
2018-08-31 15:04               ` [dpdk-dev] [PATCH v6 2/4] examples/l3fwd-power: simple app update for new API Liang Ma
2018-08-31 15:04               ` [dpdk-dev] [PATCH v6 3/4] doc/guides/proguides/power-man: update the power API Liang Ma
2018-08-31 15:04               ` [dpdk-dev] [PATCH v6 4/4] doc/guides/sample_app_ug/l3_forward_power_man.rst: empty poll update Liang Ma
2018-09-04  1:11               ` [dpdk-dev] [PATCH v6 1/4] lib/librte_power: traffic pattern aware power control Yao, Lei A
2018-09-04  2:09               ` Yao, Lei A
2018-09-04 14:10               ` [dpdk-dev] [PATCH v7 " Liang Ma
2018-09-04 14:10                 ` [dpdk-dev] [PATCH v7 2/4] examples/l3fwd-power: simple app update for new API Liang Ma
2018-09-04 14:10                 ` [dpdk-dev] [PATCH v7 3/4] doc/guides/proguides/power-man: update the power API Liang Ma
2018-09-04 14:10                 ` [dpdk-dev] [PATCH v7 4/4] doc/guides/sample_app_ug/l3_forward_power_man.rst: empty poll update Liang Ma
2018-09-13 10:54                 ` [dpdk-dev] [PATCH v7 1/4] lib/librte_power: traffic pattern aware power control Kevin Traynor
2018-09-13 13:37                   ` Liang, Ma
2018-09-13 14:05                     ` Hunt, David
2018-09-17 13:30                 ` [dpdk-dev] [PATCH v8 " Liang Ma
2018-09-17 13:30                   ` [dpdk-dev] [PATCH v8 2/4] examples/l3fwd-power: simple app update for new API Liang Ma
2018-09-28 11:19                     ` Hunt, David
2018-10-02 10:18                       ` Liang, Ma
2018-09-17 13:30                   ` [dpdk-dev] [PATCH v8 3/4] doc/guides/proguide/power-man: update the power API Liang Ma
2018-09-25 12:31                     ` Kovacevic, Marko
2018-09-25 12:44                     ` Kovacevic, Marko
2018-09-28 12:30                     ` Hunt, David
2018-09-17 13:30                   ` [dpdk-dev] [PATCH v8 4/4] doc/guides/sample_app_ug/l3_forward_power_man.rst: empty poll update Liang Ma
2018-09-25 13:20                     ` Kovacevic, Marko
2018-09-28 12:43                       ` Hunt, David
2018-09-28 12:52                         ` Liang, Ma
2018-09-28 10:47                   ` [dpdk-dev] [PATCH v8 1/4] lib/librte_power: traffic pattern aware power control Hunt, David
2018-10-02 10:13                     ` Liang, Ma
2018-09-28 14:58                   ` [dpdk-dev] [PATCH v9 " Liang Ma
2018-09-28 14:58                     ` [dpdk-dev] [PATCH v9 2/4] examples/l3fwd-power: simple app update for new API Liang Ma
2018-09-28 14:58                     ` [dpdk-dev] [PATCH v9 3/4] doc/guides/pro_guide/power-man: update the power API Liang Ma
2018-10-02 13:48                     ` [dpdk-dev] [PATCH v10 1/4] lib/librte_power: traffic pattern aware power control Liang Ma
2018-10-02 13:48                       ` [dpdk-dev] [PATCH v10 2/4] examples/l3fwd-power: simple app update for new API Liang Ma
2018-10-02 14:23                         ` Hunt, David
2018-10-02 13:48                       ` [dpdk-dev] [PATCH v10 3/4] doc/guides/pro_guide/power-man: update the power API Liang Ma
2018-10-02 14:24                         ` Hunt, David
2018-10-02 13:48                       ` [dpdk-dev] [PATCH v10 4/4] doc/guides/sample_app_ug/l3_forward_power_man.rst: update Liang Ma
2018-10-02 14:25                         ` Hunt, David
2018-10-02 14:22                       ` [dpdk-dev] [PATCH v10 1/4] lib/librte_power: traffic pattern aware power control Hunt, David
2018-10-12  1:59                       ` Yao, Lei A
2018-10-12 10:02                         ` Liang, Ma
2018-10-12 13:22                           ` Yao, Lei A
2018-10-19 10:23                       ` [dpdk-dev] [PATCH v11 1/5] " Liang Ma
2018-10-19 10:23                         ` [dpdk-dev] [PATCH v11 2/5] examples/l3fwd-power: simple app update for new API Liang Ma
2018-10-19 10:23                         ` [dpdk-dev] [PATCH v11 3/5] doc/guides/pro_guide/power-man: update the power API Liang Ma
2018-10-19 10:23                         ` [dpdk-dev] [PATCH v11 5/5] doc: update release notes for empty poll library Liang Ma
2018-10-19 11:07                         ` [dpdk-dev] [PATCH v12 1/5] lib/librte_power: traffic pattern aware power control Liang Ma
2018-10-19 11:07                           ` [dpdk-dev] [PATCH v12 2/5] examples/l3fwd-power: simple app update for new API Liang Ma
2018-10-19 11:07                           ` [dpdk-dev] [PATCH v12 3/5] doc/guides/pro_guide/power-man: update the power API Liang Ma
2018-10-19 11:07                           ` [dpdk-dev] [PATCH v12 4/5] doc/guides/sample_app_ug/l3_forward_power_man.rst: update Liang Ma
2018-10-19 11:07                           ` [dpdk-dev] [PATCH v12 5/5] doc: update release notes for empty poll library Liang Ma
2018-10-22 12:41                             ` Kovacevic, Marko
2018-10-25 23:39                             ` Thomas Monjalon
2018-10-25 23:22                           ` [dpdk-dev] [PATCH v12 1/5] lib/librte_power: traffic pattern aware power control Thomas Monjalon
2018-10-25 23:32                             ` Thomas Monjalon
2018-10-25 23:54                           ` Thomas Monjalon
2018-10-25 23:55                           ` Thomas Monjalon
2018-10-26 13:34                             ` Liang, Ma
2018-10-01 10:06                   ` [dpdk-dev] [PATCH v9 4/4] doc/guides/sample_app_ug/l3_forward_power_man.rst: empty poll update Liang Ma
2018-06-14 10:59 ` [dpdk-dev] [PATCH v1 1/2] lib/librte_power: traffic pattern aware power control Hunt, David
2018-06-18 16:11   ` Liang, Ma

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).