[PATCH v1 1/2] eal: add lcore busyness telemetry

DPDK patches and discussions
 help / color / mirror / Atom feed

* [PATCH v1 1/2] eal: add lcore busyness telemetry
@ 2022-07-15 13:12 Anatoly Burakov
  2022-07-15 13:12 ` [PATCH v1 2/2] eal: add cpuset lcore telemetry entries Anatoly Burakov
                   ` (9 more replies)
  0 siblings, 10 replies; 87+ messages in thread
From: Anatoly Burakov @ 2022-07-15 13:12 UTC (permalink / raw)
  To: dev, Bruce Richardson, Nicolas Chautru, Fan Zhang, Ashish Gupta,
	Akhil Goyal, David Hunt, Chengwen Feng, Kevin Laatz,
	Ray Kinsella, Thomas Monjalon, Ferruh Yigit, Andrew Rybchenko,
	Jerin Jacob, Sachin Saxena, Hemant Agrawal, Ori Kam,
	Honnappa Nagarahalli, Konstantin Ananyev
  Cc: Conor Walsh

Currently, there is no way to measure lcore busyness in a passive way,
without any modifications to the application. This patch adds a new EAL
API that will be able to passively track core busyness.

The busyness is calculated by relying on the fact that most DPDK API's
will poll for packets. Empty polls can be counted as "idle", while
non-empty polls can be counted as busy. To measure lcore busyness, we
simply call the telemetry timestamping function with the number of polls
a particular code section has processed, and count the number of cycles
we've spent processing empty bursts. The more empty bursts we encounter,
the less cycles we spend in "busy" state, and the less core busyness
will be reported.

In order for all of the above to work without modifications to the
application, the library code needs to be instrumented with calls to
the lcore telemetry busyness timestamping function. The following parts
of DPDK are instrumented with lcore telemetry calls:

- All major driver API's:
  - ethdev
  - cryptodev
  - compressdev
  - regexdev
  - bbdev
  - rawdev
  - eventdev
  - dmadev
- Some additional libraries:
  - ring
  - distributor

To avoid performance impact from having lcore telemetry support, a
global variable is exported by EAL, and a call to timestamping function
is wrapped into a macro, so that whenever telemetry is disabled, it only
takes one additional branch and no function calls are performed. It is
also possible to disable it at compile time by commenting out
RTE_LCORE_BUSYNESS from build config.

This patch also adds a telemetry endpoint to report lcore busyness, as
well as telemetry endpoints to enable/disable lcore telemetry.

Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
Signed-off-by: Conor Walsh <conor.walsh@intel.com>
Signed-off-by: David Hunt <david.hunt@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---

Notes:
    We did a couple of quick smoke tests to see if this patch causes any performance
    degradation, and it seemed to have none that we could measure. Telemetry can be
    disabled at compile time via a config option, while at runtime it can be
    disabled, seemingly at a cost of one additional branch.
    
    That said, our benchmarking efforts were admittedly not very rigorous, so
    comments welcome!

 config/rte_config.h                         |   2 +
 lib/bbdev/rte_bbdev.h                       |  17 +-
 lib/compressdev/rte_compressdev.c           |   2 +
 lib/cryptodev/rte_cryptodev.h               |   2 +
 lib/distributor/rte_distributor.c           |  21 +-
 lib/distributor/rte_distributor_single.c    |  14 +-
 lib/dmadev/rte_dmadev.h                     |  15 +-
 lib/eal/common/eal_common_lcore_telemetry.c | 274 ++++++++++++++++++++
 lib/eal/common/meson.build                  |   1 +
 lib/eal/include/rte_lcore.h                 |  80 ++++++
 lib/eal/meson.build                         |   3 +
 lib/eal/version.map                         |   7 +
 lib/ethdev/rte_ethdev.h                     |   2 +
 lib/eventdev/rte_eventdev.h                 |  10 +-
 lib/rawdev/rte_rawdev.c                     |   5 +-
 lib/regexdev/rte_regexdev.h                 |   5 +-
 lib/ring/rte_ring_elem_pvt.h                |   1 +
 17 files changed, 437 insertions(+), 24 deletions(-)
 create mode 100644 lib/eal/common/eal_common_lcore_telemetry.c

diff --git a/config/rte_config.h b/config/rte_config.h
index 46549cb062..583cb6f7a5 100644
--- a/config/rte_config.h
+++ b/config/rte_config.h
@@ -39,6 +39,8 @@
 #define RTE_LOG_DP_LEVEL RTE_LOG_INFO
 #define RTE_BACKTRACE 1
 #define RTE_MAX_VFIO_CONTAINERS 64
+#define RTE_LCORE_BUSYNESS 1
+#define RTE_LCORE_BUSYNESS_PERIOD 4000000ULL
 
 /* bsd module defines */
 #define RTE_CONTIGMEM_MAX_NUM_BUFS 64
diff --git a/lib/bbdev/rte_bbdev.h b/lib/bbdev/rte_bbdev.h
index b88c88167e..d6ed176cce 100644
--- a/lib/bbdev/rte_bbdev.h
+++ b/lib/bbdev/rte_bbdev.h
@@ -28,6 +28,7 @@ extern "C" {
 #include <stdbool.h>
 
 #include <rte_cpuflags.h>
+#include <rte_lcore.h>
 
 #include "rte_bbdev_op.h"
 
@@ -599,7 +600,9 @@ rte_bbdev_dequeue_enc_ops(uint16_t dev_id, uint16_t queue_id,
 {
 	struct rte_bbdev *dev = &rte_bbdev_devices[dev_id];
 	struct rte_bbdev_queue_data *q_data = &dev->data->queues[queue_id];
-	return dev->dequeue_enc_ops(q_data, ops, num_ops);
+	const uint16_t nb_ops = dev->dequeue_enc_ops(q_data, ops, num_ops);
+	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_ops);
+	return nb_ops;
 }
 
 /**
@@ -631,7 +634,9 @@ rte_bbdev_dequeue_dec_ops(uint16_t dev_id, uint16_t queue_id,
 {
 	struct rte_bbdev *dev = &rte_bbdev_devices[dev_id];
 	struct rte_bbdev_queue_data *q_data = &dev->data->queues[queue_id];
-	return dev->dequeue_dec_ops(q_data, ops, num_ops);
+	const uint16_t nb_ops = dev->dequeue_dec_ops(q_data, ops, num_ops);
+	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_ops);
+	return nb_ops;
 }
 
 
@@ -662,7 +667,9 @@ rte_bbdev_dequeue_ldpc_enc_ops(uint16_t dev_id, uint16_t queue_id,
 {
 	struct rte_bbdev *dev = &rte_bbdev_devices[dev_id];
 	struct rte_bbdev_queue_data *q_data = &dev->data->queues[queue_id];
-	return dev->dequeue_ldpc_enc_ops(q_data, ops, num_ops);
+	const uint16_t nb_ops = dev->dequeue_ldpc_enc_ops(q_data, ops, num_ops);
+	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_ops);
+	return nb_ops;
 }
 
 /**
@@ -692,7 +699,9 @@ rte_bbdev_dequeue_ldpc_dec_ops(uint16_t dev_id, uint16_t queue_id,
 {
 	struct rte_bbdev *dev = &rte_bbdev_devices[dev_id];
 	struct rte_bbdev_queue_data *q_data = &dev->data->queues[queue_id];
-	return dev->dequeue_ldpc_dec_ops(q_data, ops, num_ops);
+	const uint16_t nb_ops = dev->dequeue_ldpc_dec_ops(q_data, ops, num_ops);
+	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_ops);
+	return nb_ops;
 }
 
 /** Definitions of device event types */
diff --git a/lib/compressdev/rte_compressdev.c b/lib/compressdev/rte_compressdev.c
index 22c438f2dd..912cee9a16 100644
--- a/lib/compressdev/rte_compressdev.c
+++ b/lib/compressdev/rte_compressdev.c
@@ -580,6 +580,8 @@ rte_compressdev_dequeue_burst(uint8_t dev_id, uint16_t qp_id,
 	nb_ops = (*dev->dequeue_burst)
 			(dev->data->queue_pairs[qp_id], ops, nb_ops);
 
+	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_ops);
+
 	return nb_ops;
 }
 
diff --git a/lib/cryptodev/rte_cryptodev.h b/lib/cryptodev/rte_cryptodev.h
index 56f459c6a0..072874020d 100644
--- a/lib/cryptodev/rte_cryptodev.h
+++ b/lib/cryptodev/rte_cryptodev.h
@@ -1915,6 +1915,8 @@ rte_cryptodev_dequeue_burst(uint8_t dev_id, uint16_t qp_id,
 		rte_rcu_qsbr_thread_offline(list->qsbr, 0);
 	}
 #endif
+
+	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_ops);
 	return nb_ops;
 }
 
diff --git a/lib/distributor/rte_distributor.c b/lib/distributor/rte_distributor.c
index 3035b7a999..35b0d8d36b 100644
--- a/lib/distributor/rte_distributor.c
+++ b/lib/distributor/rte_distributor.c
@@ -56,6 +56,8 @@ rte_distributor_request_pkt(struct rte_distributor *d,
 
 		while (rte_rdtsc() < t)
 			rte_pause();
+		/* this was an empty poll */
+		RTE_LCORE_TELEMETRY_TIMESTAMP(0);
 	}
 
 	/*
@@ -134,24 +136,29 @@ rte_distributor_get_pkt(struct rte_distributor *d,
 
 	if (unlikely(d->alg_type == RTE_DIST_ALG_SINGLE)) {
 		if (return_count <= 1) {
+			uint16_t cnt;
 			pkts[0] = rte_distributor_get_pkt_single(d->d_single,
-				worker_id, return_count ? oldpkt[0] : NULL);
-			return (pkts[0]) ? 1 : 0;
-		} else
-			return -EINVAL;
+								 worker_id,
+								 return_count ? oldpkt[0] : NULL);
+			cnt = (pkts[0] != NULL) ? 1 : 0;
+			RTE_LCORE_TELEMETRY_TIMESTAMP(cnt);
+			return cnt;
+		}
+		return -EINVAL;
 	}
 
 	rte_distributor_request_pkt(d, worker_id, oldpkt, return_count);
 
-	count = rte_distributor_poll_pkt(d, worker_id, pkts);
-	while (count == -1) {
+	while ((count = rte_distributor_poll_pkt(d, worker_id, pkts)) == -1) {
 		uint64_t t = rte_rdtsc() + 100;
 
 		while (rte_rdtsc() < t)
 			rte_pause();
 
-		count = rte_distributor_poll_pkt(d, worker_id, pkts);
+		/* this was an empty poll */
+		RTE_LCORE_TELEMETRY_TIMESTAMP(0);
 	}
+	RTE_LCORE_TELEMETRY_TIMESTAMP(count);
 	return count;
 }
 
diff --git a/lib/distributor/rte_distributor_single.c b/lib/distributor/rte_distributor_single.c
index 2c77ac454a..dc58791bf4 100644
--- a/lib/distributor/rte_distributor_single.c
+++ b/lib/distributor/rte_distributor_single.c
@@ -31,8 +31,13 @@ rte_distributor_request_pkt_single(struct rte_distributor_single *d,
 	union rte_distributor_buffer_single *buf = &d->bufs[worker_id];
 	int64_t req = (((int64_t)(uintptr_t)oldpkt) << RTE_DISTRIB_FLAG_BITS)
 			| RTE_DISTRIB_GET_BUF;
-	RTE_WAIT_UNTIL_MASKED(&buf->bufptr64, RTE_DISTRIB_FLAGS_MASK,
-		==, 0, __ATOMIC_RELAXED);
+
+	while (!(__atomic_load_n(&buf->bufptr64, __ATOMIC_RELAXED)
+			& RTE_DISTRIB_FLAGS_MASK) == 0) {
+		rte_pause();
+		/* this was an empty poll */
+		RTE_LCORE_TELEMETRY_TIMESTAMP(0);
+	}
 
 	/* Sync with distributor on GET_BUF flag. */
 	__atomic_store_n(&(buf->bufptr64), req, __ATOMIC_RELEASE);
@@ -59,8 +64,11 @@ rte_distributor_get_pkt_single(struct rte_distributor_single *d,
 {
 	struct rte_mbuf *ret;
 	rte_distributor_request_pkt_single(d, worker_id, oldpkt);
-	while ((ret = rte_distributor_poll_pkt_single(d, worker_id)) == NULL)
+	while ((ret = rte_distributor_poll_pkt_single(d, worker_id)) == NULL) {
 		rte_pause();
+		/* this was an empty poll */
+		RTE_LCORE_TELEMETRY_TIMESTAMP(0);
+	}
 	return ret;
 }
 
diff --git a/lib/dmadev/rte_dmadev.h b/lib/dmadev/rte_dmadev.h
index e7f992b734..98176a6a7a 100644
--- a/lib/dmadev/rte_dmadev.h
+++ b/lib/dmadev/rte_dmadev.h
@@ -149,6 +149,7 @@
 #include <rte_bitops.h>
 #include <rte_common.h>
 #include <rte_compat.h>
+#include <rte_lcore.h>
 
 #ifdef __cplusplus
 extern "C" {
@@ -1027,7 +1028,7 @@ rte_dma_completed(int16_t dev_id, uint16_t vchan, const uint16_t nb_cpls,
 		  uint16_t *last_idx, bool *has_error)
 {
 	struct rte_dma_fp_object *obj = &rte_dma_fp_objs[dev_id];
-	uint16_t idx;
+	uint16_t idx, nb_ops;
 	bool err;
 
 #ifdef RTE_DMADEV_DEBUG
@@ -1050,8 +1051,10 @@ rte_dma_completed(int16_t dev_id, uint16_t vchan, const uint16_t nb_cpls,
 		has_error = &err;
 
 	*has_error = false;
-	return (*obj->completed)(obj->dev_private, vchan, nb_cpls, last_idx,
-				 has_error);
+	nb_ops = (*obj->completed)(obj->dev_private, vchan, nb_cpls, last_idx,
+				   has_error);
+	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_ops);
+	return nb_ops;
 }
 
 /**
@@ -1090,7 +1093,7 @@ rte_dma_completed_status(int16_t dev_id, uint16_t vchan,
 			 enum rte_dma_status_code *status)
 {
 	struct rte_dma_fp_object *obj = &rte_dma_fp_objs[dev_id];
-	uint16_t idx;
+	uint16_t idx, nb_ops;
 
 #ifdef RTE_DMADEV_DEBUG
 	if (!rte_dma_is_valid(dev_id) || nb_cpls == 0 || status == NULL)
@@ -1101,8 +1104,10 @@ rte_dma_completed_status(int16_t dev_id, uint16_t vchan,
 	if (last_idx == NULL)
 		last_idx = &idx;
 
-	return (*obj->completed_status)(obj->dev_private, vchan, nb_cpls,
+	nb_ops = (*obj->completed_status)(obj->dev_private, vchan, nb_cpls,
 					last_idx, status);
+	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_ops);
+	return nb_ops;
 }
 
 /**
diff --git a/lib/eal/common/eal_common_lcore_telemetry.c b/lib/eal/common/eal_common_lcore_telemetry.c
new file mode 100644
index 0000000000..5e4ea15ff5
--- /dev/null
+++ b/lib/eal/common/eal_common_lcore_telemetry.c
@@ -0,0 +1,274 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2014 Intel Corporation
+ */
+
+#include <unistd.h>
+#include <limits.h>
+#include <string.h>
+
+#include <rte_common.h>
+#include <rte_cycles.h>
+#include <rte_errno.h>
+#include <rte_lcore.h>
+
+#ifdef RTE_LCORE_BUSYNESS
+#include <rte_telemetry.h>
+#endif
+
+int __rte_lcore_telemetry_enabled;
+
+#ifdef RTE_LCORE_BUSYNESS
+
+struct lcore_telemetry {
+	int busyness;
+	/**< Calculated busyness (gets set/returned by the API) */
+	int raw_busyness;
+	/**< Calculated busyness times 100. */
+	uint64_t interval_ts;
+	/**< when previous telemetry interval started */
+	uint64_t empty_cycles;
+	/**< empty cycle count since last interval */
+	uint64_t last_poll_ts;
+	/**< last poll timestamp */
+	bool last_empty;
+	/**< if last poll was empty */
+	unsigned int contig_poll_cnt;
+	/**< contiguous (always empty/non empty) poll counter */
+} __rte_cache_aligned;
+static struct lcore_telemetry telemetry_data[RTE_MAX_LCORE];
+
+#define LCORE_BUSYNESS_MAX 100
+#define LCORE_BUSYNESS_NOT_SET -1
+#define LCORE_BUSYNESS_MIN 0
+
+static void lcore_config_init(void)
+{
+	int lcore_id;
+	RTE_LCORE_FOREACH(lcore_id) {
+		struct lcore_telemetry *td = &telemetry_data[lcore_id];
+
+		td->interval_ts = 0;
+		td->last_poll_ts = 0;
+		td->empty_cycles = 0;
+		td->last_empty = true;
+		td->contig_poll_cnt = 0;
+		td->busyness = LCORE_BUSYNESS_NOT_SET;
+		td->raw_busyness = 0;
+	}
+}
+
+int rte_lcore_busyness(unsigned int lcore_id)
+{
+	const uint64_t active_thresh = RTE_LCORE_BUSYNESS_PERIOD * 1000;
+	struct lcore_telemetry *tdata;
+
+	if (lcore_id >= RTE_MAX_LCORE)
+		return -EINVAL;
+	tdata = &telemetry_data[lcore_id];
+
+	/* if the lcore is not active */
+	if (tdata->interval_ts == 0)
+		return LCORE_BUSYNESS_NOT_SET;
+	/* if the core hasn't been active in a while */
+	else if ((rte_rdtsc() - tdata->interval_ts) > active_thresh)
+		return LCORE_BUSYNESS_NOT_SET;
+
+	/* this core is active, report its busyness */
+	return telemetry_data[lcore_id].busyness;
+}
+
+int rte_lcore_busyness_enabled(void)
+{
+	return __rte_lcore_telemetry_enabled;
+}
+
+void rte_lcore_busyness_enabled_set(int enable)
+{
+	__rte_lcore_telemetry_enabled = !!enable;
+
+	if (!enable)
+		lcore_config_init();
+}
+
+static inline int calc_raw_busyness(const struct lcore_telemetry *tdata,
+				    const uint64_t empty, const uint64_t total)
+{
+	/*
+	 * we don't want to use floating point math here, but we want for our
+	 * busyness to react smoothly to sudden changes, while still keeping the
+	 * accuracy and making sure that over time the average follows busyness
+	 * as measured just-in-time. therefore, we will calculate the average
+	 * busyness using integer math, but shift the decimal point two places
+	 * to the right, so that 100.0 becomes 10000. this allows us to report
+	 * integer values (0..100) while still allowing ourselves to follow the
+	 * just-in-time measurements when we calculate our averages.
+	 */
+	const int max_raw_idle = LCORE_BUSYNESS_MAX * 100;
+
+	/*
+	 * at upper end of the busyness scale, going up from 90->100 will take
+	 * longer than going from 10->20 because of the averaging. to address
+	 * this, we invert the scale when doing calculations: that is, we
+	 * effectively calculate average *idle* cycle percentage, not average
+	 * *busy* cycle percentage. this means that the scale is naturally
+	 * biased towards fast scaling up, and slow scaling down.
+	 */
+	const int prev_raw_idle = max_raw_idle - tdata->raw_busyness;
+
+	/* calculate rate of idle cycles, times 100 */
+	const int cur_raw_idle = (int)((empty * max_raw_idle) / total);
+
+	/* smoothen the idleness */
+	const int smoothened_idle = (cur_raw_idle + prev_raw_idle * 4) / 5;
+
+	/* convert idleness back to busyness */
+	return max_raw_idle - smoothened_idle;
+}
+
+void __rte_lcore_telemetry_timestamp(uint16_t nb_rx)
+{
+	const unsigned int lcore_id = rte_lcore_id();
+	uint64_t interval_ts, empty_cycles, cur_tsc, last_poll_ts;
+	struct lcore_telemetry *tdata = &telemetry_data[lcore_id];
+	const bool empty = nb_rx == 0;
+	uint64_t diff_int, diff_last;
+	bool last_empty;
+
+	last_empty = tdata->last_empty;
+
+	/* optimization: don't do anything if status hasn't changed */
+	if (last_empty == empty && tdata->contig_poll_cnt++ < 32)
+		return;
+	/* status changed or we're waiting for too long, reset counter */
+	tdata->contig_poll_cnt = 0;
+
+	cur_tsc = rte_rdtsc();
+
+	interval_ts = tdata->interval_ts;
+	empty_cycles = tdata->empty_cycles;
+	last_poll_ts = tdata->last_poll_ts;
+
+	diff_int = cur_tsc - interval_ts;
+	diff_last = cur_tsc - last_poll_ts;
+
+	/* is this the first time we're here? */
+	if (interval_ts == 0) {
+		tdata->busyness = LCORE_BUSYNESS_MIN;
+		tdata->raw_busyness = 0;
+		tdata->interval_ts = cur_tsc;
+		tdata->empty_cycles = 0;
+		tdata->contig_poll_cnt = 0;
+		goto end;
+	}
+
+	/* update the empty counter if we got an empty poll earlier */
+	if (last_empty)
+		empty_cycles += diff_last;
+
+	/* have we passed the interval? */
+	if (diff_int > RTE_LCORE_BUSYNESS_PERIOD) {
+		int raw_busyness;
+
+		/* get updated busyness value */
+		raw_busyness = calc_raw_busyness(tdata, empty_cycles, diff_int);
+
+		/* set a new interval, reset empty counter */
+		tdata->interval_ts = cur_tsc;
+		tdata->empty_cycles = 0;
+		tdata->raw_busyness = raw_busyness;
+		/* bring busyness back to 0..100 range, biased to round up */
+		tdata->busyness = (raw_busyness + 50) / 100;
+	} else
+		/* we may have updated empty counter */
+		tdata->empty_cycles = empty_cycles;
+
+end:
+	/* update status for next poll */
+	tdata->last_poll_ts = cur_tsc;
+	tdata->last_empty = empty;
+}
+
+static int
+lcore_busyness_enable(const char *cmd __rte_unused,
+		      const char *params __rte_unused,
+		      struct rte_tel_data *d)
+{
+	rte_lcore_busyness_enabled_set(1);
+
+	rte_tel_data_start_dict(d);
+
+	rte_tel_data_add_dict_int(d, "busyness_enabled", 1);
+
+	return 0;
+}
+
+static int
+lcore_busyness_disable(const char *cmd __rte_unused,
+		       const char *params __rte_unused,
+		       struct rte_tel_data *d)
+{
+	rte_lcore_busyness_enabled_set(0);
+
+	rte_tel_data_start_dict(d);
+
+	rte_tel_data_add_dict_int(d, "busyness_enabled", 0);
+
+	return 0;
+}
+
+static int
+lcore_handle_busyness(const char *cmd __rte_unused,
+		      const char *params __rte_unused, struct rte_tel_data *d)
+{
+	char corenum[64];
+	int i;
+
+	rte_tel_data_start_dict(d);
+
+	RTE_LCORE_FOREACH(i) {
+		if (!rte_lcore_is_enabled(i))
+			continue;
+		snprintf(corenum, sizeof(corenum), "%d", i);
+		rte_tel_data_add_dict_int(d, corenum, rte_lcore_busyness(i));
+	}
+
+	return 0;
+}
+
+RTE_INIT(lcore_init_telemetry)
+{
+	__rte_lcore_telemetry_enabled = true;
+
+	lcore_config_init();
+
+	rte_telemetry_register_cmd("/eal/lcore/busyness", lcore_handle_busyness,
+				   "return percentage busyness of cores");
+
+	rte_telemetry_register_cmd("/eal/lcore/busyness_enable", lcore_busyness_enable,
+				   "enable lcore busyness measurement");
+
+	rte_telemetry_register_cmd("/eal/lcore/busyness_disable", lcore_busyness_disable,
+				   "disable lcore busyness measurement");
+}
+
+#else
+
+int rte_lcore_busyness(unsigned int lcore_id __rte_unused)
+{
+	return -ENOTSUP;
+}
+
+int rte_lcore_busyness_enabled(void)
+{
+	return -ENOTSUP;
+}
+
+void rte_lcore_busyness_enabled_set(int enable __rte_unused)
+{
+}
+
+void __rte_lcore_telemetry_timestamp(uint16_t nb_rx __rte_unused)
+{
+}
+
+#endif
diff --git a/lib/eal/common/meson.build b/lib/eal/common/meson.build
index 917758cc65..a743e66a7d 100644
--- a/lib/eal/common/meson.build
+++ b/lib/eal/common/meson.build
@@ -17,6 +17,7 @@ sources += files(
         'eal_common_hexdump.c',
         'eal_common_interrupts.c',
         'eal_common_launch.c',
+        'eal_common_lcore_telemetry.c',
         'eal_common_lcore.c',
         'eal_common_log.c',
         'eal_common_mcfg.c',
diff --git a/lib/eal/include/rte_lcore.h b/lib/eal/include/rte_lcore.h
index b598e1b9ec..ab7a8e1e26 100644
--- a/lib/eal/include/rte_lcore.h
+++ b/lib/eal/include/rte_lcore.h
@@ -415,6 +415,86 @@ rte_ctrl_thread_create(pthread_t *thread, const char *name,
 		const pthread_attr_t *attr,
 		void *(*start_routine)(void *), void *arg);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Read busyness value corresponding to an lcore.
+ *
+ * @param lcore_id
+ *   Lcore to read busyness value for.
+ * @return
+ *   - value between 0 and 100 on success
+ *   - -1 if lcore is not active
+ *   - -EINVAL if lcore is invalid
+ *   - -ENOMEM if not enough memory available
+ *   - -ENOTSUP if not supported
+ */
+__rte_experimental
+int
+rte_lcore_busyness(unsigned int lcore_id);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Check if lcore busyness telemetry is enabled.
+ *
+ * @return
+ *   - 1 if lcore telemetry is enabled
+ *   - 0 if lcore telemetry is disabled
+ *   - -ENOTSUP if not lcore telemetry supported
+ */
+__rte_experimental
+int
+rte_lcore_busyness_enabled(void);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Enable or disable busyness telemetry.
+ *
+ * @param enable
+ *   1 to enable, 0 to disable
+ */
+__rte_experimental
+void
+rte_lcore_busyness_enabled_set(int enable);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Lcore telemetry timestamping function.
+ *
+ * @param nb_rx
+ *   Number of buffers processed by lcore.
+ */
+__rte_experimental
+void
+__rte_lcore_telemetry_timestamp(uint16_t nb_rx);
+
+/** @internal lcore telemetry enabled status */
+extern int __rte_lcore_telemetry_enabled;
+
+/**
+ * Call lcore telemetry timestamp function.
+ *
+ * @param nb_rx
+ *   Number of buffers processed by lcore.
+ */
+#ifdef RTE_LCORE_BUSYNESS
+#define RTE_LCORE_TELEMETRY_TIMESTAMP(nb_rx)                    \
+	do {                                                    \
+		if (__rte_lcore_telemetry_enabled)              \
+			__rte_lcore_telemetry_timestamp(nb_rx); \
+	} while (0)
+#else
+#define RTE_LCORE_TELEMETRY_TIMESTAMP(nb_rx) \
+	while (0)
+#endif
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/eal/meson.build b/lib/eal/meson.build
index 056beb9461..7199aa03c2 100644
--- a/lib/eal/meson.build
+++ b/lib/eal/meson.build
@@ -25,6 +25,9 @@ subdir(arch_subdir)
 deps += ['kvargs']
 if not is_windows
     deps += ['telemetry']
+else
+	# core busyness telemetry depends on telemetry library
+	dpdk_conf.set('RTE_LCORE_BUSYNESS', false)
 endif
 if dpdk_conf.has('RTE_USE_LIBBSD')
     ext_deps += libbsd
diff --git a/lib/eal/version.map b/lib/eal/version.map
index c2a2cebf69..52061b30f0 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -424,6 +424,13 @@ EXPERIMENTAL {
 	rte_thread_self;
 	rte_thread_set_affinity_by_id;
 	rte_thread_set_priority;
+
+	# added in 22.11
+	__rte_lcore_telemetry_timestamp;
+	__rte_lcore_telemetry_enabled;
+	rte_lcore_busyness;
+	rte_lcore_busyness_enabled;
+	rte_lcore_busyness_enabled_set;
 };
 
 INTERNAL {
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index de9e970d4d..1caecd5a11 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -5675,6 +5675,8 @@ rte_eth_rx_burst(uint16_t port_id, uint16_t queue_id,
 #endif
 
 	rte_ethdev_trace_rx_burst(port_id, queue_id, (void **)rx_pkts, nb_rx);
+
+	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_rx);
 	return nb_rx;
 }
 
diff --git a/lib/eventdev/rte_eventdev.h b/lib/eventdev/rte_eventdev.h
index 6a6f6ea4c1..a1d42d9214 100644
--- a/lib/eventdev/rte_eventdev.h
+++ b/lib/eventdev/rte_eventdev.h
@@ -2153,6 +2153,7 @@ rte_event_dequeue_burst(uint8_t dev_id, uint8_t port_id, struct rte_event ev[],
 			uint16_t nb_events, uint64_t timeout_ticks)
 {
 	const struct rte_event_fp_ops *fp_ops;
+	uint16_t nb_evts;
 	void *port;
 
 	fp_ops = &rte_event_fp_ops[dev_id];
@@ -2175,10 +2176,13 @@ rte_event_dequeue_burst(uint8_t dev_id, uint8_t port_id, struct rte_event ev[],
 	 * requests nb_events as const one
 	 */
 	if (nb_events == 1)
-		return (fp_ops->dequeue)(port, ev, timeout_ticks);
+		nb_evts = (fp_ops->dequeue)(port, ev, timeout_ticks);
 	else
-		return (fp_ops->dequeue_burst)(port, ev, nb_events,
-					       timeout_ticks);
+		nb_evts = (fp_ops->dequeue_burst)(port, ev, nb_events,
+					timeout_ticks);
+
+	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_evts);
+	return nb_evts;
 }
 
 #define RTE_EVENT_DEV_MAINT_OP_FLUSH          (1 << 0)
diff --git a/lib/rawdev/rte_rawdev.c b/lib/rawdev/rte_rawdev.c
index 2f0a4f132e..27163e87cb 100644
--- a/lib/rawdev/rte_rawdev.c
+++ b/lib/rawdev/rte_rawdev.c
@@ -226,12 +226,15 @@ rte_rawdev_dequeue_buffers(uint16_t dev_id,
 			   rte_rawdev_obj_t context)
 {
 	struct rte_rawdev *dev;
+	int nb_ops;
 
 	RTE_RAWDEV_VALID_DEVID_OR_ERR_RET(dev_id, -EINVAL);
 	dev = &rte_rawdevs[dev_id];
 
 	RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->dequeue_bufs, -ENOTSUP);
-	return (*dev->dev_ops->dequeue_bufs)(dev, buffers, count, context);
+	nb_ops = (*dev->dev_ops->dequeue_bufs)(dev, buffers, count, context);
+	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_ops);
+	return nb_ops;
 }
 
 int
diff --git a/lib/regexdev/rte_regexdev.h b/lib/regexdev/rte_regexdev.h
index 3bce8090f6..781055b4eb 100644
--- a/lib/regexdev/rte_regexdev.h
+++ b/lib/regexdev/rte_regexdev.h
@@ -1530,6 +1530,7 @@ rte_regexdev_dequeue_burst(uint8_t dev_id, uint16_t qp_id,
 			   struct rte_regex_ops **ops, uint16_t nb_ops)
 {
 	struct rte_regexdev *dev = &rte_regex_devices[dev_id];
+	uint16_t deq_ops;
 #ifdef RTE_LIBRTE_REGEXDEV_DEBUG
 	RTE_REGEXDEV_VALID_DEV_ID_OR_ERR_RET(dev_id, -EINVAL);
 	RTE_FUNC_PTR_OR_ERR_RET(*dev->dequeue, -ENOTSUP);
@@ -1538,7 +1539,9 @@ rte_regexdev_dequeue_burst(uint8_t dev_id, uint16_t qp_id,
 		return -EINVAL;
 	}
 #endif
-	return (*dev->dequeue)(dev, qp_id, ops, nb_ops);
+	deq_ops = (*dev->dequeue)(dev, qp_id, ops, nb_ops);
+	RTE_LCORE_TELEMETRY_TIMESTAMP(deq_ops);
+	return deq_ops;
 }
 
 #ifdef __cplusplus
diff --git a/lib/ring/rte_ring_elem_pvt.h b/lib/ring/rte_ring_elem_pvt.h
index 83788c56e6..6db09d4291 100644
--- a/lib/ring/rte_ring_elem_pvt.h
+++ b/lib/ring/rte_ring_elem_pvt.h
@@ -379,6 +379,7 @@ __rte_ring_do_dequeue_elem(struct rte_ring *r, void *obj_table,
 end:
 	if (available != NULL)
 		*available = entries - n;
+	RTE_LCORE_TELEMETRY_TIMESTAMP(n);
 	return n;
 }
 
-- 
2.25.1


^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v1 2/2] eal: add cpuset lcore telemetry entries
  2022-07-15 13:12 [PATCH v1 1/2] eal: add lcore busyness telemetry Anatoly Burakov
@ 2022-07-15 13:12 ` Anatoly Burakov
  2022-07-15 13:35 ` [PATCH v1 1/2] eal: add lcore busyness telemetry Burakov, Anatoly
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 87+ messages in thread
From: Anatoly Burakov @ 2022-07-15 13:12 UTC (permalink / raw)
  To: dev

Expose per-lcore cpuset information to telemetry.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 lib/eal/common/eal_common_lcore_telemetry.c | 47 +++++++++++++++++++++
 1 file changed, 47 insertions(+)

diff --git a/lib/eal/common/eal_common_lcore_telemetry.c b/lib/eal/common/eal_common_lcore_telemetry.c
index 5e4ea15ff5..39fffe2b93 100644
--- a/lib/eal/common/eal_common_lcore_telemetry.c
+++ b/lib/eal/common/eal_common_lcore_telemetry.c
@@ -19,6 +19,8 @@ int __rte_lcore_telemetry_enabled;
 
 #ifdef RTE_LCORE_BUSYNESS
 
+#include "eal_private.h"
+
 struct lcore_telemetry {
 	int busyness;
 	/**< Calculated busyness (gets set/returned by the API) */
@@ -235,6 +237,48 @@ lcore_handle_busyness(const char *cmd __rte_unused,
 	return 0;
 }
 
+static int
+lcore_handle_cpuset(const char *cmd __rte_unused,
+		    const char *params __rte_unused,
+		    struct rte_tel_data *d)
+{
+	char corenum[64];
+	int i;
+
+	rte_tel_data_start_dict(d);
+
+	RTE_LCORE_FOREACH(i) {
+		const struct lcore_config *cfg = &lcore_config[i];
+		const rte_cpuset_t *cpuset = &cfg->cpuset;
+		struct rte_tel_data *ld;
+		unsigned int cpu;
+
+		if (!rte_lcore_is_enabled(i))
+			continue;
+
+		/* create an array of integers */
+		ld = rte_tel_data_alloc();
+		if (ld == NULL)
+			return -ENOMEM;
+		rte_tel_data_start_array(ld, RTE_TEL_INT_VAL);
+
+		/* add cpu ID's from cpuset to the array */
+		for (cpu = 0; cpu < CPU_SETSIZE; cpu++) {
+			if (!CPU_ISSET(cpu, cpuset))
+				continue;
+			rte_tel_data_add_array_int(ld, cpu);
+		}
+
+		/* add array to the per-lcore container */
+		snprintf(corenum, sizeof(corenum), "%d", i);
+
+		/* tell telemetry library to free this array automatically */
+		rte_tel_data_add_dict_container(d, corenum, ld, 0);
+	}
+
+	return 0;
+}
+
 RTE_INIT(lcore_init_telemetry)
 {
 	__rte_lcore_telemetry_enabled = true;
@@ -249,6 +293,9 @@ RTE_INIT(lcore_init_telemetry)
 
 	rte_telemetry_register_cmd("/eal/lcore/busyness_disable", lcore_busyness_disable,
 				   "disable lcore busyness measurement");
+
+	rte_telemetry_register_cmd("/eal/lcore/cpuset", lcore_handle_cpuset,
+				   "list physical core affinity for each lcore");
 }
 
 #else
-- 
2.25.1


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v1 1/2] eal: add lcore busyness telemetry
  2022-07-15 13:12 [PATCH v1 1/2] eal: add lcore busyness telemetry Anatoly Burakov
  2022-07-15 13:12 ` [PATCH v1 2/2] eal: add cpuset lcore telemetry entries Anatoly Burakov
@ 2022-07-15 13:35 ` Burakov, Anatoly
  2022-07-15 13:46 ` Jerin Jacob
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 87+ messages in thread
From: Burakov, Anatoly @ 2022-07-15 13:35 UTC (permalink / raw)
  To: dev, Bruce Richardson, Nicolas Chautru, Fan Zhang, Ashish Gupta,
	Akhil Goyal, David Hunt, Chengwen Feng, Kevin Laatz,
	Ray Kinsella, Thomas Monjalon, Ferruh Yigit, Andrew Rybchenko,
	Jerin Jacob, Sachin Saxena, Hemant Agrawal, Ori Kam,
	Honnappa Nagarahalli, Konstantin Ananyev
  Cc: Conor Walsh

On 15-Jul-22 2:12 PM, Anatoly Burakov wrote:
> Currently, there is no way to measure lcore busyness in a passive way,
> without any modifications to the application. This patch adds a new EAL
> API that will be able to passively track core busyness.
> 
> The busyness is calculated by relying on the fact that most DPDK API's
> will poll for packets. Empty polls can be counted as "idle", while
> non-empty polls can be counted as busy. To measure lcore busyness, we
> simply call the telemetry timestamping function with the number of polls
> a particular code section has processed, and count the number of cycles
> we've spent processing empty bursts. The more empty bursts we encounter,
> the less cycles we spend in "busy" state, and the less core busyness
> will be reported.
> 
> In order for all of the above to work without modifications to the
> application, the library code needs to be instrumented with calls to
> the lcore telemetry busyness timestamping function. The following parts
> of DPDK are instrumented with lcore telemetry calls:
> 
> - All major driver API's:
>    - ethdev
>    - cryptodev
>    - compressdev
>    - regexdev
>    - bbdev
>    - rawdev
>    - eventdev
>    - dmadev
> - Some additional libraries:
>    - ring
>    - distributor
> 
> To avoid performance impact from having lcore telemetry support, a
> global variable is exported by EAL, and a call to timestamping function
> is wrapped into a macro, so that whenever telemetry is disabled, it only
> takes one additional branch and no function calls are performed. It is
> also possible to disable it at compile time by commenting out
> RTE_LCORE_BUSYNESS from build config.
> 
> This patch also adds a telemetry endpoint to report lcore busyness, as
> well as telemetry endpoints to enable/disable lcore telemetry.
> 
> Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
> Signed-off-by: Conor Walsh <conor.walsh@intel.com>
> Signed-off-by: David Hunt <david.hunt@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
> 
> Notes:
>      We did a couple of quick smoke tests to see if this patch causes any performance
>      degradation, and it seemed to have none that we could measure. Telemetry can be
>      disabled at compile time via a config option, while at runtime it can be
>      disabled, seemingly at a cost of one additional branch.
>      
>      That said, our benchmarking efforts were admittedly not very rigorous, so
>      comments welcome!
> 
>   config/rte_config.h                         |   2 +
>   lib/bbdev/rte_bbdev.h                       |  17 +-
>   lib/compressdev/rte_compressdev.c           |   2 +
>   lib/cryptodev/rte_cryptodev.h               |   2 +
>   lib/distributor/rte_distributor.c           |  21 +-
>   lib/distributor/rte_distributor_single.c    |  14 +-
>   lib/dmadev/rte_dmadev.h                     |  15 +-
>   lib/eal/common/eal_common_lcore_telemetry.c | 274 ++++++++++++++++++++
>   lib/eal/common/meson.build                  |   1 +
>   lib/eal/include/rte_lcore.h                 |  80 ++++++
>   lib/eal/meson.build                         |   3 +
>   lib/eal/version.map                         |   7 +
>   lib/ethdev/rte_ethdev.h                     |   2 +
>   lib/eventdev/rte_eventdev.h                 |  10 +-
>   lib/rawdev/rte_rawdev.c                     |   5 +-
>   lib/regexdev/rte_regexdev.h                 |   5 +-
>   lib/ring/rte_ring_elem_pvt.h                |   1 +
>   17 files changed, 437 insertions(+), 24 deletions(-)
>   create mode 100644 lib/eal/common/eal_common_lcore_telemetry.c
> 
> diff --git a/config/rte_config.h b/config/rte_config.h
> index 46549cb062..583cb6f7a5 100644
> --- a/config/rte_config.h
> +++ b/config/rte_config.h
> @@ -39,6 +39,8 @@
>   #define RTE_LOG_DP_LEVEL RTE_LOG_INFO
>   #define RTE_BACKTRACE 1
>   #define RTE_MAX_VFIO_CONTAINERS 64
> +#define RTE_LCORE_BUSYNESS 1
> +#define RTE_LCORE_BUSYNESS_PERIOD 4000000ULL

One possible improvement here would be to specify period in 
microseconds, and use rte_get_tsc_hz() to adjust the telemetry period. 
This would require adding code to EAL init, because we can't use that 
API until EAL has called `rte_eal_timer_init()`, but it would make the 
telemetry period CPU frequency independent.


-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v1 1/2] eal: add lcore busyness telemetry
  2022-07-15 13:12 [PATCH v1 1/2] eal: add lcore busyness telemetry Anatoly Burakov
  2022-07-15 13:12 ` [PATCH v1 2/2] eal: add cpuset lcore telemetry entries Anatoly Burakov
  2022-07-15 13:35 ` [PATCH v1 1/2] eal: add lcore busyness telemetry Burakov, Anatoly
@ 2022-07-15 13:46 ` Jerin Jacob
  2022-07-15 14:11   ` Bruce Richardson
  2022-07-15 14:18   ` Burakov, Anatoly
  2022-07-15 22:13 ` Morten Brørup
                   ` (6 subsequent siblings)
  9 siblings, 2 replies; 87+ messages in thread
From: Jerin Jacob @ 2022-07-15 13:46 UTC (permalink / raw)
  To: Anatoly Burakov
  Cc: dpdk-dev, Bruce Richardson, Nicolas Chautru, Fan Zhang,
	Ashish Gupta, Akhil Goyal, David Hunt, Chengwen Feng,
	Kevin Laatz, Ray Kinsella, Thomas Monjalon, Ferruh Yigit,
	Andrew Rybchenko, Jerin Jacob, Sachin Saxena, Hemant Agrawal,
	Ori Kam, Honnappa Nagarahalli, Konstantin Ananyev, Conor Walsh

On Fri, Jul 15, 2022 at 6:42 PM Anatoly Burakov
<anatoly.burakov@intel.com> wrote:
>
> Currently, there is no way to measure lcore busyness in a passive way,
> without any modifications to the application. This patch adds a new EAL
> API that will be able to passively track core busyness.
>
> The busyness is calculated by relying on the fact that most DPDK API's
> will poll for packets. Empty polls can be counted as "idle", while
> non-empty polls can be counted as busy. To measure lcore busyness, we
> simply call the telemetry timestamping function with the number of polls
> a particular code section has processed, and count the number of cycles
> we've spent processing empty bursts. The more empty bursts we encounter,
> the less cycles we spend in "busy" state, and the less core busyness
> will be reported.
>
> In order for all of the above to work without modifications to the
> application, the library code needs to be instrumented with calls to
> the lcore telemetry busyness timestamping function. The following parts
> of DPDK are instrumented with lcore telemetry calls:
>
> - All major driver API's:
>   - ethdev
>   - cryptodev
>   - compressdev
>   - regexdev
>   - bbdev
>   - rawdev
>   - eventdev
>   - dmadev
> - Some additional libraries:
>   - ring
>   - distributor
>
> To avoid performance impact from having lcore telemetry support, a
> global variable is exported by EAL, and a call to timestamping function
> is wrapped into a macro, so that whenever telemetry is disabled, it only
> takes one additional branch and no function calls are performed. It is
> also possible to disable it at compile time by commenting out
> RTE_LCORE_BUSYNESS from build config.
>
> This patch also adds a telemetry endpoint to report lcore busyness, as
> well as telemetry endpoints to enable/disable lcore telemetry.
>
> Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
> Signed-off-by: Conor Walsh <conor.walsh@intel.com>
> Signed-off-by: David Hunt <david.hunt@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>


It is a good feature. Thanks for this work.


> ---
>
> Notes:
>     We did a couple of quick smoke tests to see if this patch causes any performance
>     degradation, and it seemed to have none that we could measure. Telemetry can be
>     disabled at compile time via a config option, while at runtime it can be
>     disabled, seemingly at a cost of one additional branch.
>
>     That said, our benchmarking efforts were admittedly not very rigorous, so
>     comments welcome!

>
> diff --git a/config/rte_config.h b/config/rte_config.h
> index 46549cb062..583cb6f7a5 100644
> --- a/config/rte_config.h
> +++ b/config/rte_config.h
> @@ -39,6 +39,8 @@
>  #define RTE_LOG_DP_LEVEL RTE_LOG_INFO
>  #define RTE_BACKTRACE 1
>  #define RTE_MAX_VFIO_CONTAINERS 64
> +#define RTE_LCORE_BUSYNESS 1

Please don't enable debug features in fastpath as default.

> +#define RTE_LCORE_BUSYNESS_PERIOD 4000000ULL
>
> +
> +#include <unistd.h>
> +#include <limits.h>
> +#include <string.h>
> +
> +#include <rte_common.h>
> +#include <rte_cycles.h>
> +#include <rte_errno.h>
> +#include <rte_lcore.h>
> +
> +#ifdef RTE_LCORE_BUSYNESS

This clutter may not be required. Let it compile all cases.

> +#include <rte_telemetry.h>
> +#endif
> +
> +int __rte_lcore_telemetry_enabled;
> +
> +#ifdef RTE_LCORE_BUSYNESS
> +
> +struct lcore_telemetry {
> +       int busyness;
> +       /**< Calculated busyness (gets set/returned by the API) */
> +       int raw_busyness;
> +       /**< Calculated busyness times 100. */
> +       uint64_t interval_ts;
> +       /**< when previous telemetry interval started */
> +       uint64_t empty_cycles;
> +       /**< empty cycle count since last interval */
> +       uint64_t last_poll_ts;
> +       /**< last poll timestamp */
> +       bool last_empty;
> +       /**< if last poll was empty */
> +       unsigned int contig_poll_cnt;
> +       /**< contiguous (always empty/non empty) poll counter */
> +} __rte_cache_aligned;
> +static struct lcore_telemetry telemetry_data[RTE_MAX_LCORE];

Allocate this from hugepage.

> +
> +void __rte_lcore_telemetry_timestamp(uint16_t nb_rx)
> +{
> +       const unsigned int lcore_id = rte_lcore_id();
> +       uint64_t interval_ts, empty_cycles, cur_tsc, last_poll_ts;
> +       struct lcore_telemetry *tdata = &telemetry_data[lcore_id];
> +       const bool empty = nb_rx == 0;
> +       uint64_t diff_int, diff_last;
> +       bool last_empty;
> +
> +       last_empty = tdata->last_empty;
> +
> +       /* optimization: don't do anything if status hasn't changed */
> +       if (last_empty == empty && tdata->contig_poll_cnt++ < 32)
> +               return;
> +       /* status changed or we're waiting for too long, reset counter */
> +       tdata->contig_poll_cnt = 0;
> +
> +       cur_tsc = rte_rdtsc();
> +
> +       interval_ts = tdata->interval_ts;
> +       empty_cycles = tdata->empty_cycles;
> +       last_poll_ts = tdata->last_poll_ts;
> +
> +       diff_int = cur_tsc - interval_ts;
> +       diff_last = cur_tsc - last_poll_ts;
> +
> +       /* is this the first time we're here? */
> +       if (interval_ts == 0) {
> +               tdata->busyness = LCORE_BUSYNESS_MIN;
> +               tdata->raw_busyness = 0;
> +               tdata->interval_ts = cur_tsc;
> +               tdata->empty_cycles = 0;
> +               tdata->contig_poll_cnt = 0;
> +               goto end;
> +       }
> +
> +       /* update the empty counter if we got an empty poll earlier */
> +       if (last_empty)
> +               empty_cycles += diff_last;
> +
> +       /* have we passed the interval? */
> +       if (diff_int > RTE_LCORE_BUSYNESS_PERIOD) {


I think, this function logic can be limited to just updating the
timestamp in the ring buffer,
and another control function of telemetry which runs in control core to do
heavy lifting to reduce the performance impact on fast path,

> +               int raw_busyness;
> +
> +               /* get updated busyness value */
> +               raw_busyness = calc_raw_busyness(tdata, empty_cycles, diff_int);
> +
> +               /* set a new interval, reset empty counter */
> +               tdata->interval_ts = cur_tsc;
> +               tdata->empty_cycles = 0;
> +               tdata->raw_busyness = raw_busyness;
> +               /* bring busyness back to 0..100 range, biased to round up */
> +               tdata->busyness = (raw_busyness + 50) / 100;
> +       } else
> +               /* we may have updated empty counter */
> +               tdata->empty_cycles = empty_cycles;
> +
> +end:
> +       /* update status for next poll */
> +       tdata->last_poll_ts = cur_tsc;
> +       tdata->last_empty = empty;
> +}
> +
> +#ifdef RTE_LCORE_BUSYNESS
> +#define RTE_LCORE_TELEMETRY_TIMESTAMP(nb_rx)                    \
> +       do {                                                    \
> +               if (__rte_lcore_telemetry_enabled)              \

I think, rather than reading memory, Like Linux perf infrastructure,
we can patch up the instruction steam as NOP vs Timestamp capture
function.


Also instead of changing all libraries, Maybe we can use
"-finstrument-functions".
and just mark the function with attribute.

Just 2c.

> +                       __rte_lcore_telemetry_timestamp(nb_rx); \
> +       } while (0)
> +#else
> +#define RTE_LCORE_TELEMETRY_TIMESTAMP(nb_rx) \
> +       while (0)
> +#endif
> +

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v1 1/2] eal: add lcore busyness telemetry
  2022-07-15 13:46 ` Jerin Jacob
@ 2022-07-15 14:11   ` Bruce Richardson
  2022-07-15 14:18   ` Burakov, Anatoly
  1 sibling, 0 replies; 87+ messages in thread
From: Bruce Richardson @ 2022-07-15 14:11 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: Anatoly Burakov, dpdk-dev, Nicolas Chautru, Fan Zhang,
	Ashish Gupta, Akhil Goyal, David Hunt, Chengwen Feng,
	Kevin Laatz, Ray Kinsella, Thomas Monjalon, Ferruh Yigit,
	Andrew Rybchenko, Jerin Jacob, Sachin Saxena, Hemant Agrawal,
	Ori Kam, Honnappa Nagarahalli, Konstantin Ananyev, Conor Walsh

On Fri, Jul 15, 2022 at 07:16:17PM +0530, Jerin Jacob wrote:
> On Fri, Jul 15, 2022 at 6:42 PM Anatoly Burakov
> <anatoly.burakov@intel.com> wrote:
> >
> > Currently, there is no way to measure lcore busyness in a passive way,
> > without any modifications to the application. This patch adds a new EAL
> > API that will be able to passively track core busyness.
> >
> > The busyness is calculated by relying on the fact that most DPDK API's
> > will poll for packets. Empty polls can be counted as "idle", while
> > non-empty polls can be counted as busy. To measure lcore busyness, we
> > simply call the telemetry timestamping function with the number of polls
> > a particular code section has processed, and count the number of cycles
> > we've spent processing empty bursts. The more empty bursts we encounter,
> > the less cycles we spend in "busy" state, and the less core busyness
> > will be reported.
> >
> > In order for all of the above to work without modifications to the
> > application, the library code needs to be instrumented with calls to
> > the lcore telemetry busyness timestamping function. The following parts
> > of DPDK are instrumented with lcore telemetry calls:
> >
> > - All major driver API's:
> >   - ethdev
> >   - cryptodev
> >   - compressdev
> >   - regexdev
> >   - bbdev
> >   - rawdev
> >   - eventdev
> >   - dmadev
> > - Some additional libraries:
> >   - ring
> >   - distributor
> >
> > To avoid performance impact from having lcore telemetry support, a
> > global variable is exported by EAL, and a call to timestamping function
> > is wrapped into a macro, so that whenever telemetry is disabled, it only
> > takes one additional branch and no function calls are performed. It is
> > also possible to disable it at compile time by commenting out
> > RTE_LCORE_BUSYNESS from build config.
> >
> > This patch also adds a telemetry endpoint to report lcore busyness, as
> > well as telemetry endpoints to enable/disable lcore telemetry.
> >
> > Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
> > Signed-off-by: Conor Walsh <conor.walsh@intel.com>
> > Signed-off-by: David Hunt <david.hunt@intel.com>
> > Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> 
> 
> It is a good feature. Thanks for this work.
> 

+1
Some follow-up comments inline below.

/Bruce

> 
> > ---
> >
> > Notes:
> >     We did a couple of quick smoke tests to see if this patch causes any performance
> >     degradation, and it seemed to have none that we could measure. Telemetry can be
> >     disabled at compile time via a config option, while at runtime it can be
> >     disabled, seemingly at a cost of one additional branch.
> >
> >     That said, our benchmarking efforts were admittedly not very rigorous, so
> >     comments welcome!
> 
> >
> > diff --git a/config/rte_config.h b/config/rte_config.h
> > index 46549cb062..583cb6f7a5 100644
> > --- a/config/rte_config.h
> > +++ b/config/rte_config.h
> > @@ -39,6 +39,8 @@
> >  #define RTE_LOG_DP_LEVEL RTE_LOG_INFO
> >  #define RTE_BACKTRACE 1
> >  #define RTE_MAX_VFIO_CONTAINERS 64
> > +#define RTE_LCORE_BUSYNESS 1
> 
> Please don't enable debug features in fastpath as default.
> 

I would disagree that this is a debug feature. The number of times I have
heard from DPDK users that they wish to get more visibility into what the
app is doing rather than seeing 100% cpu busy. Therefore, I'd see this as
enabled by default rather than disabled - unless it's shown to have a
performance regression.

That said, since this impacts multiple components and it's something that
end users might want to disable at build-time, I'd suggest moving it to
meson_options.txt file and have it as an official DPDK build-time option.

> > +#define RTE_LCORE_BUSYNESS_PERIOD 4000000ULL
> >
> > +
> > +#include <unistd.h>
> > +#include <limits.h>
> > +#include <string.h>
> > +
> > +#include <rte_common.h>
> > +#include <rte_cycles.h>
> > +#include <rte_errno.h>
> > +#include <rte_lcore.h>
> > +
> > +#ifdef RTE_LCORE_BUSYNESS
> 
> This clutter may not be required. Let it compile all cases.
> 
> > +#include <rte_telemetry.h>
> > +#endif
> > +
> > +int __rte_lcore_telemetry_enabled;
> > +
> > +#ifdef RTE_LCORE_BUSYNESS
> > +
> > +struct lcore_telemetry {
> > +       int busyness;
> > +       /**< Calculated busyness (gets set/returned by the API) */
> > +       int raw_busyness;
> > +       /**< Calculated busyness times 100. */
> > +       uint64_t interval_ts;
> > +       /**< when previous telemetry interval started */
> > +       uint64_t empty_cycles;
> > +       /**< empty cycle count since last interval */
> > +       uint64_t last_poll_ts;
> > +       /**< last poll timestamp */
> > +       bool last_empty;
> > +       /**< if last poll was empty */
> > +       unsigned int contig_poll_cnt;
> > +       /**< contiguous (always empty/non empty) poll counter */
> > +} __rte_cache_aligned;
> > +static struct lcore_telemetry telemetry_data[RTE_MAX_LCORE];
> 
> Allocate this from hugepage.
> 

Yes, whether or not it's allocated from hugepages, dynamic allocation would
be better than having static vars.

> > +
> > +void __rte_lcore_telemetry_timestamp(uint16_t nb_rx)
> > +{
> > +       const unsigned int lcore_id = rte_lcore_id();
> > +       uint64_t interval_ts, empty_cycles, cur_tsc, last_poll_ts;
> > +       struct lcore_telemetry *tdata = &telemetry_data[lcore_id];
> > +       const bool empty = nb_rx == 0;
> > +       uint64_t diff_int, diff_last;
> > +       bool last_empty;
> > +
> > +       last_empty = tdata->last_empty;
> > +
> > +       /* optimization: don't do anything if status hasn't changed */
> > +       if (last_empty == empty && tdata->contig_poll_cnt++ < 32)
> > +               return;
> > +       /* status changed or we're waiting for too long, reset counter */
> > +       tdata->contig_poll_cnt = 0;
> > +
> > +       cur_tsc = rte_rdtsc();
> > +
> > +       interval_ts = tdata->interval_ts;
> > +       empty_cycles = tdata->empty_cycles;
> > +       last_poll_ts = tdata->last_poll_ts;
> > +
> > +       diff_int = cur_tsc - interval_ts;
> > +       diff_last = cur_tsc - last_poll_ts;
> > +
> > +       /* is this the first time we're here? */
> > +       if (interval_ts == 0) {
> > +               tdata->busyness = LCORE_BUSYNESS_MIN;
> > +               tdata->raw_busyness = 0;
> > +               tdata->interval_ts = cur_tsc;
> > +               tdata->empty_cycles = 0;
> > +               tdata->contig_poll_cnt = 0;
> > +               goto end;
> > +       }
> > +
> > +       /* update the empty counter if we got an empty poll earlier */
> > +       if (last_empty)
> > +               empty_cycles += diff_last;
> > +
> > +       /* have we passed the interval? */
> > +       if (diff_int > RTE_LCORE_BUSYNESS_PERIOD) {
> 
> 
> I think, this function logic can be limited to just updating the
> timestamp in the ring buffer,
> and another control function of telemetry which runs in control core to do
> heavy lifting to reduce the performance impact on fast path,
> 
> > +               int raw_busyness;
> > +
> > +               /* get updated busyness value */
> > +               raw_busyness = calc_raw_busyness(tdata, empty_cycles, diff_int);
> > +
> > +               /* set a new interval, reset empty counter */
> > +               tdata->interval_ts = cur_tsc;
> > +               tdata->empty_cycles = 0;
> > +               tdata->raw_busyness = raw_busyness;
> > +               /* bring busyness back to 0..100 range, biased to round up */
> > +               tdata->busyness = (raw_busyness + 50) / 100;
> > +       } else
> > +               /* we may have updated empty counter */
> > +               tdata->empty_cycles = empty_cycles;
> > +
> > +end:
> > +       /* update status for next poll */
> > +       tdata->last_poll_ts = cur_tsc;
> > +       tdata->last_empty = empty;
> > +}
> > +
> > +#ifdef RTE_LCORE_BUSYNESS
> > +#define RTE_LCORE_TELEMETRY_TIMESTAMP(nb_rx)                    \
> > +       do {                                                    \
> > +               if (__rte_lcore_telemetry_enabled)              \
> 
> I think, rather than reading memory, Like Linux perf infrastructure,
> we can patch up the instruction steam as NOP vs Timestamp capture
> function.
> 
Surely that requires much more complicated tooling? How would that work in
this situation?

> 
> Also instead of changing all libraries, Maybe we can use
> "-finstrument-functions".
> and just mark the function with attribute.
> 
> Just 2c.
> 
> > +                       __rte_lcore_telemetry_timestamp(nb_rx); \
> > +       } while (0)
> > +#else
> > +#define RTE_LCORE_TELEMETRY_TIMESTAMP(nb_rx) \
> > +       while (0)
> > +#endif
> > +

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v1 1/2] eal: add lcore busyness telemetry
  2022-07-15 13:46 ` Jerin Jacob
  2022-07-15 14:11   ` Bruce Richardson
@ 2022-07-15 14:18   ` Burakov, Anatoly
  1 sibling, 0 replies; 87+ messages in thread
From: Burakov, Anatoly @ 2022-07-15 14:18 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: dpdk-dev, Bruce Richardson, Nicolas Chautru, Fan Zhang,
	Ashish Gupta, Akhil Goyal, David Hunt, Chengwen Feng,
	Kevin Laatz, Ray Kinsella, Thomas Monjalon, Ferruh Yigit,
	Andrew Rybchenko, Jerin Jacob, Sachin Saxena, Hemant Agrawal,
	Ori Kam, Honnappa Nagarahalli, Konstantin Ananyev, Conor Walsh

On 15-Jul-22 2:46 PM, Jerin Jacob wrote:
> On Fri, Jul 15, 2022 at 6:42 PM Anatoly Burakov
> <anatoly.burakov@intel.com> wrote:
>>
>> Currently, there is no way to measure lcore busyness in a passive way,
>> without any modifications to the application. This patch adds a new EAL
>> API that will be able to passively track core busyness.
>>
>> The busyness is calculated by relying on the fact that most DPDK API's
>> will poll for packets. Empty polls can be counted as "idle", while
>> non-empty polls can be counted as busy. To measure lcore busyness, we
>> simply call the telemetry timestamping function with the number of polls
>> a particular code section has processed, and count the number of cycles
>> we've spent processing empty bursts. The more empty bursts we encounter,
>> the less cycles we spend in "busy" state, and the less core busyness
>> will be reported.
>>
>> In order for all of the above to work without modifications to the
>> application, the library code needs to be instrumented with calls to
>> the lcore telemetry busyness timestamping function. The following parts
>> of DPDK are instrumented with lcore telemetry calls:
>>
>> - All major driver API's:
>>    - ethdev
>>    - cryptodev
>>    - compressdev
>>    - regexdev
>>    - bbdev
>>    - rawdev
>>    - eventdev
>>    - dmadev
>> - Some additional libraries:
>>    - ring
>>    - distributor
>>
>> To avoid performance impact from having lcore telemetry support, a
>> global variable is exported by EAL, and a call to timestamping function
>> is wrapped into a macro, so that whenever telemetry is disabled, it only
>> takes one additional branch and no function calls are performed. It is
>> also possible to disable it at compile time by commenting out
>> RTE_LCORE_BUSYNESS from build config.
>>
>> This patch also adds a telemetry endpoint to report lcore busyness, as
>> well as telemetry endpoints to enable/disable lcore telemetry.
>>
>> Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
>> Signed-off-by: Conor Walsh <conor.walsh@intel.com>
>> Signed-off-by: David Hunt <david.hunt@intel.com>
>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> 
> 
> It is a good feature. Thanks for this work.

Hi Jerin,

Thanks for your review! Comments below.

> 
> 
>> ---
>>
>> Notes:
>>      We did a couple of quick smoke tests to see if this patch causes any performance
>>      degradation, and it seemed to have none that we could measure. Telemetry can be
>>      disabled at compile time via a config option, while at runtime it can be
>>      disabled, seemingly at a cost of one additional branch.
>>
>>      That said, our benchmarking efforts were admittedly not very rigorous, so
>>      comments welcome!
> 
>>
>> diff --git a/config/rte_config.h b/config/rte_config.h
>> index 46549cb062..583cb6f7a5 100644
>> --- a/config/rte_config.h
>> +++ b/config/rte_config.h
>> @@ -39,6 +39,8 @@
>>   #define RTE_LOG_DP_LEVEL RTE_LOG_INFO
>>   #define RTE_BACKTRACE 1
>>   #define RTE_MAX_VFIO_CONTAINERS 64
>> +#define RTE_LCORE_BUSYNESS 1
> 
> Please don't enable debug features in fastpath as default.

It is not meant to be a debug feature. The ability to measure CPU usage 
in DPDK applications consistently comes up as one of the top asks from 
all kinds of people working on DPDK, and this is an attempt to address 
that at a fundamental level. This is more of a quality of life 
improvement than a debug feature.

> 
>> +#define RTE_LCORE_BUSYNESS_PERIOD 4000000ULL
>>
>> +
>> +#include <unistd.h>
>> +#include <limits.h>
>> +#include <string.h>
>> +
>> +#include <rte_common.h>
>> +#include <rte_cycles.h>
>> +#include <rte_errno.h>
>> +#include <rte_lcore.h>
>> +
>> +#ifdef RTE_LCORE_BUSYNESS
> 
> This clutter may not be required. Let it compile all cases.

Windows does not have telemetry library (so this define never exists on 
Windows), and we want to be able to compile this without lcore telemetry 
enabled as well (by commenting out the config option). We would have to 
have #ifdef-ery here in any case. Am I missing an obvious way to have 
this without #ifdef's?

> 
>> +#include <rte_telemetry.h>
>> +#endif
>> +
>> +int __rte_lcore_telemetry_enabled;
>> +
>> +#ifdef RTE_LCORE_BUSYNESS
>> +
>> +struct lcore_telemetry {
>> +       int busyness;
>> +       /**< Calculated busyness (gets set/returned by the API) */
>> +       int raw_busyness;
>> +       /**< Calculated busyness times 100. */
>> +       uint64_t interval_ts;
>> +       /**< when previous telemetry interval started */
>> +       uint64_t empty_cycles;
>> +       /**< empty cycle count since last interval */
>> +       uint64_t last_poll_ts;
>> +       /**< last poll timestamp */
>> +       bool last_empty;
>> +       /**< if last poll was empty */
>> +       unsigned int contig_poll_cnt;
>> +       /**< contiguous (always empty/non empty) poll counter */
>> +} __rte_cache_aligned;
>> +static struct lcore_telemetry telemetry_data[RTE_MAX_LCORE];
> 
> Allocate this from hugepage.

Good suggestion, probably needs per-socket structures as well to avoid 
cross-socket accesses.

> 
>> +
>> +void __rte_lcore_telemetry_timestamp(uint16_t nb_rx)
>> +{
>> +       const unsigned int lcore_id = rte_lcore_id();
>> +       uint64_t interval_ts, empty_cycles, cur_tsc, last_poll_ts;
>> +       struct lcore_telemetry *tdata = &telemetry_data[lcore_id];
>> +       const bool empty = nb_rx == 0;
>> +       uint64_t diff_int, diff_last;
>> +       bool last_empty;
>> +
>> +       last_empty = tdata->last_empty;
>> +
>> +       /* optimization: don't do anything if status hasn't changed */
>> +       if (last_empty == empty && tdata->contig_poll_cnt++ < 32)
>> +               return;
>> +       /* status changed or we're waiting for too long, reset counter */
>> +       tdata->contig_poll_cnt = 0;
>> +
>> +       cur_tsc = rte_rdtsc();
>> +
>> +       interval_ts = tdata->interval_ts;
>> +       empty_cycles = tdata->empty_cycles;
>> +       last_poll_ts = tdata->last_poll_ts;
>> +
>> +       diff_int = cur_tsc - interval_ts;
>> +       diff_last = cur_tsc - last_poll_ts;
>> +
>> +       /* is this the first time we're here? */
>> +       if (interval_ts == 0) {
>> +               tdata->busyness = LCORE_BUSYNESS_MIN;
>> +               tdata->raw_busyness = 0;
>> +               tdata->interval_ts = cur_tsc;
>> +               tdata->empty_cycles = 0;
>> +               tdata->contig_poll_cnt = 0;
>> +               goto end;
>> +       }
>> +
>> +       /* update the empty counter if we got an empty poll earlier */
>> +       if (last_empty)
>> +               empty_cycles += diff_last;
>> +
>> +       /* have we passed the interval? */
>> +       if (diff_int > RTE_LCORE_BUSYNESS_PERIOD) {
> 
> 
> I think, this function logic can be limited to just updating the
> timestamp in the ring buffer,
> and another control function of telemetry which runs in control core to do
> heavy lifting to reduce the performance impact on fast path,

That ring buffer would have to be rather large, because telemetry calls 
can come in as often as once every couple of hundred cycles (when polls 
are empty) while the telemetry measuring period is in the millions of 
cycles. That said, this is an interesting area to explore in further 
revisions, so I'll think of something along those lines, thanks!

> 
>> +               int raw_busyness;
>> +
>> +               /* get updated busyness value */
>> +               raw_busyness = calc_raw_busyness(tdata, empty_cycles, diff_int);
>> +
>> +               /* set a new interval, reset empty counter */
>> +               tdata->interval_ts = cur_tsc;
>> +               tdata->empty_cycles = 0;
>> +               tdata->raw_busyness = raw_busyness;
>> +               /* bring busyness back to 0..100 range, biased to round up */
>> +               tdata->busyness = (raw_busyness + 50) / 100;
>> +       } else
>> +               /* we may have updated empty counter */
>> +               tdata->empty_cycles = empty_cycles;
>> +
>> +end:
>> +       /* update status for next poll */
>> +       tdata->last_poll_ts = cur_tsc;
>> +       tdata->last_empty = empty;
>> +}
>> +
>> +#ifdef RTE_LCORE_BUSYNESS
>> +#define RTE_LCORE_TELEMETRY_TIMESTAMP(nb_rx)                    \
>> +       do {                                                    \
>> +               if (__rte_lcore_telemetry_enabled)              \
> 
> I think, rather than reading memory, Like Linux perf infrastructure,
> we can patch up the instruction steam as NOP vs Timestamp capture
> function.

Would you be so kind as to provide a link to examples of how this can be 
implemented?

> 
> 
> Also instead of changing all libraries, Maybe we can use
> "-finstrument-functions".
> and just mark the function with attribute.

This is not meant to be a profiling solution, it is rather a lightweight 
CPU usage measuring tool, to expose (albeit limited) CPU usage data. I'm 
by all means not an expert on `-finstrument-functions` flag, but from 
cursory reading of description in GCC manual it seems to be rather 
heavyweight as it's captuing function enter/exit and other stuff, which 
we're not really interested in. This *would* make this feature a debug 
feature, but this was not the intention here :)

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 87+ messages in thread

* RE: [PATCH v1 1/2] eal: add lcore busyness telemetry
  2022-07-15 13:12 [PATCH v1 1/2] eal: add lcore busyness telemetry Anatoly Burakov
                   ` (2 preceding siblings ...)
  2022-07-15 13:46 ` Jerin Jacob
@ 2022-07-15 22:13 ` Morten Brørup
  2022-07-16 14:38   ` Thomas Monjalon
  2022-07-17  3:10   ` Honnappa Nagarahalli
  2022-08-24 16:24 ` [PATCH v2 0/3] Add lcore poll " Kevin Laatz
                   ` (5 subsequent siblings)
  9 siblings, 2 replies; 87+ messages in thread
From: Morten Brørup @ 2022-07-15 22:13 UTC (permalink / raw)
  To: Anatoly Burakov, dev, Bruce Richardson, Nicolas Chautru,
	Fan Zhang, Ashish Gupta, Akhil Goyal, David Hunt, Chengwen Feng,
	Kevin Laatz, Ray Kinsella, Thomas Monjalon, Ferruh Yigit,
	Andrew Rybchenko, Jerin Jacob, Sachin Saxena, Hemant Agrawal,
	Ori Kam, Honnappa Nagarahalli, Konstantin Ananyev
  Cc: Conor Walsh

> From: Anatoly Burakov [mailto:anatoly.burakov@intel.com]
> Sent: Friday, 15 July 2022 15.13
> 
> Currently, there is no way to measure lcore busyness in a passive way,
> without any modifications to the application. This patch adds a new EAL
> API that will be able to passively track core busyness.
> 
> The busyness is calculated by relying on the fact that most DPDK API's
> will poll for packets.

This is an "alternative fact"! Only run-to-completion applications polls for RX. Pipelined applications do not poll for packets in every pipeline stage.

> Empty polls can be counted as "idle", while
> non-empty polls can be counted as busy. To measure lcore busyness, we
> simply call the telemetry timestamping function with the number of
> polls
> a particular code section has processed, and count the number of cycles
> we've spent processing empty bursts. The more empty bursts we
> encounter,
> the less cycles we spend in "busy" state, and the less core busyness
> will be reported.
> 
> In order for all of the above to work without modifications to the
> application, the library code needs to be instrumented with calls to
> the lcore telemetry busyness timestamping function. The following parts
> of DPDK are instrumented with lcore telemetry calls:
> 
> - All major driver API's:
>   - ethdev
>   - cryptodev
>   - compressdev
>   - regexdev
>   - bbdev
>   - rawdev
>   - eventdev
>   - dmadev
> - Some additional libraries:
>   - ring
>   - distributor
> 
> To avoid performance impact from having lcore telemetry support, a
> global variable is exported by EAL, and a call to timestamping function
> is wrapped into a macro, so that whenever telemetry is disabled, it
> only
> takes one additional branch and no function calls are performed. It is
> also possible to disable it at compile time by commenting out
> RTE_LCORE_BUSYNESS from build config.

Since all of this can be completely disabled at build time, and thus has exactly zero performance impact, I will not object to this patch.

> 
> This patch also adds a telemetry endpoint to report lcore busyness, as
> well as telemetry endpoints to enable/disable lcore telemetry.
> 
> Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
> Signed-off-by: Conor Walsh <conor.walsh@intel.com>
> Signed-off-by: David Hunt <david.hunt@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> ---
> 
> Notes:
>     We did a couple of quick smoke tests to see if this patch causes
> any performance
>     degradation, and it seemed to have none that we could measure.
> Telemetry can be
>     disabled at compile time via a config option, while at runtime it
> can be
>     disabled, seemingly at a cost of one additional branch.
> 
>     That said, our benchmarking efforts were admittedly not very
> rigorous, so
>     comments welcome!

This patch does not reflect lcore business, it reflects some sort of ingress activity level.

All the considerations regarding non-intrusiveness and low overhead are good, but everything in this patch needs to be renamed to reflect what it truly does, so it is clear that pipelined applications cannot use this telemetry for measuring lcore business (except on the ingress pipeline stage).

It's a shame that so much effort clearly has gone into this patch, and no one stopped to consider pipelined applications. :-(


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v1 1/2] eal: add lcore busyness telemetry
  2022-07-15 22:13 ` Morten Brørup
@ 2022-07-16 14:38   ` Thomas Monjalon
  2022-07-17  3:10   ` Honnappa Nagarahalli
  1 sibling, 0 replies; 87+ messages in thread
From: Thomas Monjalon @ 2022-07-16 14:38 UTC (permalink / raw)
  To: Anatoly Burakov, Morten Brørup
  Cc: dev, Bruce Richardson, Nicolas Chautru, Fan Zhang, Ashish Gupta,
	Akhil Goyal, David Hunt, Chengwen Feng, Kevin Laatz,
	Ray Kinsella, Ferruh Yigit, Andrew Rybchenko, Jerin Jacob,
	Sachin Saxena, Hemant Agrawal, Ori Kam, Honnappa Nagarahalli,
	Konstantin Ananyev, Conor Walsh

16/07/2022 00:13, Morten Brørup:
> > From: Anatoly Burakov [mailto:anatoly.burakov@intel.com]
> > Sent: Friday, 15 July 2022 15.13
> > 
> > Currently, there is no way to measure lcore busyness in a passive way,
> > without any modifications to the application. This patch adds a new EAL
> > API that will be able to passively track core busyness.
> > 
> > The busyness is calculated by relying on the fact that most DPDK API's
> > will poll for packets.
> 
> This is an "alternative fact"! Only run-to-completion applications polls for RX. Pipelined applications do not poll for packets in every pipeline stage.
> 
> > Empty polls can be counted as "idle", while
> > non-empty polls can be counted as busy. To measure lcore busyness, we
> > simply call the telemetry timestamping function with the number of
> > polls
> > a particular code section has processed, and count the number of cycles
> > we've spent processing empty bursts. The more empty bursts we
> > encounter,
> > the less cycles we spend in "busy" state, and the less core busyness
> > will be reported.
> > 
> > In order for all of the above to work without modifications to the
> > application, the library code needs to be instrumented with calls to
> > the lcore telemetry busyness timestamping function. The following parts
> > of DPDK are instrumented with lcore telemetry calls:
> > 
> > - All major driver API's:
> >   - ethdev
> >   - cryptodev
> >   - compressdev
> >   - regexdev
> >   - bbdev
> >   - rawdev
> >   - eventdev
> >   - dmadev
> > - Some additional libraries:
> >   - ring
> >   - distributor
> > 
> > To avoid performance impact from having lcore telemetry support, a
> > global variable is exported by EAL, and a call to timestamping function
> > is wrapped into a macro, so that whenever telemetry is disabled, it
> > only
> > takes one additional branch and no function calls are performed. It is
> > also possible to disable it at compile time by commenting out
> > RTE_LCORE_BUSYNESS from build config.
> 
> Since all of this can be completely disabled at build time, and thus has exactly zero performance impact, I will not object to this patch.
> 
> > 
> > This patch also adds a telemetry endpoint to report lcore busyness, as
> > well as telemetry endpoints to enable/disable lcore telemetry.
> > 
> > Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
> > Signed-off-by: Conor Walsh <conor.walsh@intel.com>
> > Signed-off-by: David Hunt <david.hunt@intel.com>
> > Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> > ---
> > 
> > Notes:
> >     We did a couple of quick smoke tests to see if this patch causes
> > any performance
> >     degradation, and it seemed to have none that we could measure.
> > Telemetry can be
> >     disabled at compile time via a config option, while at runtime it
> > can be
> >     disabled, seemingly at a cost of one additional branch.
> > 
> >     That said, our benchmarking efforts were admittedly not very
> > rigorous, so
> >     comments welcome!
> 
> This patch does not reflect lcore business, it reflects some sort of ingress activity level.
> 
> All the considerations regarding non-intrusiveness and low overhead are good, but everything in this patch needs to be renamed to reflect what it truly does, so it is clear that pipelined applications cannot use this telemetry for measuring lcore business (except on the ingress pipeline stage).

+1
Anatoly, please reflect polling activity in naming.

> It's a shame that so much effort clearly has gone into this patch, and no one stopped to consider pipelined applications. :-(

That's because no RFC was sent I think.



^ permalink raw reply	[flat|nested] 87+ messages in thread

* RE: [PATCH v1 1/2] eal: add lcore busyness telemetry
  2022-07-15 22:13 ` Morten Brørup
  2022-07-16 14:38   ` Thomas Monjalon
@ 2022-07-17  3:10   ` Honnappa Nagarahalli
  2022-07-17  9:56     ` Morten Brørup
  1 sibling, 1 reply; 87+ messages in thread
From: Honnappa Nagarahalli @ 2022-07-17  3:10 UTC (permalink / raw)
  To: Morten Brørup, Anatoly Burakov, dev, Bruce Richardson,
	Nicolas Chautru, Fan Zhang, Ashish Gupta, Akhil Goyal,
	David Hunt, Chengwen Feng, Kevin Laatz, Ray Kinsella, thomas,
	Ferruh Yigit, Andrew Rybchenko, jerinj, Sachin Saxena,
	hemant.agrawal, Ori Kam, Konstantin Ananyev
  Cc: Conor Walsh, nd, nd


<snip>

> Subject: RE: [PATCH v1 1/2] eal: add lcore busyness telemetry
> 
> > From: Anatoly Burakov [mailto:anatoly.burakov@intel.com]
> > Sent: Friday, 15 July 2022 15.13
> >
> > Currently, there is no way to measure lcore busyness in a passive way,
> > without any modifications to the application. This patch adds a new
> > EAL API that will be able to passively track core busyness.
> >
> > The busyness is calculated by relying on the fact that most DPDK API's
> > will poll for packets.
> 
> This is an "alternative fact"! Only run-to-completion applications polls for RX.
> Pipelined applications do not poll for packets in every pipeline stage.
I guess you meant, poll for packets from NIC. They still need to receive packets from queues. We could do a similar thing for rte_ring APIs.

> 
> > Empty polls can be counted as "idle", while non-empty polls can be
> > counted as busy. To measure lcore busyness, we simply call the
> > telemetry timestamping function with the number of polls a particular
> > code section has processed, and count the number of cycles we've spent
> > processing empty bursts. The more empty bursts we encounter, the less
> > cycles we spend in "busy" state, and the less core busyness will be
> > reported.
> >
> > In order for all of the above to work without modifications to the
> > application, the library code needs to be instrumented with calls to
> > the lcore telemetry busyness timestamping function. The following
> > parts of DPDK are instrumented with lcore telemetry calls:
> >
> > - All major driver API's:
> >   - ethdev
> >   - cryptodev
> >   - compressdev
> >   - regexdev
> >   - bbdev
> >   - rawdev
> >   - eventdev
> >   - dmadev
> > - Some additional libraries:
> >   - ring
> >   - distributor
> >
> > To avoid performance impact from having lcore telemetry support, a
> > global variable is exported by EAL, and a call to timestamping
> > function is wrapped into a macro, so that whenever telemetry is
> > disabled, it only takes one additional branch and no function calls
> > are performed. It is also possible to disable it at compile time by
> > commenting out RTE_LCORE_BUSYNESS from build config.
> 
> Since all of this can be completely disabled at build time, and thus has exactly
> zero performance impact, I will not object to this patch.
> 
> >
> > This patch also adds a telemetry endpoint to report lcore busyness, as
> > well as telemetry endpoints to enable/disable lcore telemetry.
> >
> > Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
> > Signed-off-by: Conor Walsh <conor.walsh@intel.com>
> > Signed-off-by: David Hunt <david.hunt@intel.com>
> > Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> > ---
> >
> > Notes:
> >     We did a couple of quick smoke tests to see if this patch causes
> > any performance
> >     degradation, and it seemed to have none that we could measure.
> > Telemetry can be
> >     disabled at compile time via a config option, while at runtime it
> > can be
> >     disabled, seemingly at a cost of one additional branch.
> >
> >     That said, our benchmarking efforts were admittedly not very
> > rigorous, so
> >     comments welcome!
> 
> This patch does not reflect lcore business, it reflects some sort of ingress
> activity level.
> 
> All the considerations regarding non-intrusiveness and low overhead are
> good, but everything in this patch needs to be renamed to reflect what it truly
> does, so it is clear that pipelined applications cannot use this telemetry for
> measuring lcore business (except on the ingress pipeline stage).
> 
> It's a shame that so much effort clearly has gone into this patch, and no one
> stopped to consider pipelined applications. :-(


^ permalink raw reply	[flat|nested] 87+ messages in thread

* RE: [PATCH v1 1/2] eal: add lcore busyness telemetry
  2022-07-17  3:10   ` Honnappa Nagarahalli
@ 2022-07-17  9:56     ` Morten Brørup
  2022-07-18  9:43       ` Burakov, Anatoly
  0 siblings, 1 reply; 87+ messages in thread
From: Morten Brørup @ 2022-07-17  9:56 UTC (permalink / raw)
  To: Honnappa Nagarahalli, Anatoly Burakov, dev, Bruce Richardson,
	Nicolas Chautru, Fan Zhang, Ashish Gupta, Akhil Goyal,
	David Hunt, Chengwen Feng, Kevin Laatz, Ray Kinsella, thomas,
	Ferruh Yigit, Andrew Rybchenko, jerinj, Sachin Saxena,
	hemant.agrawal, Ori Kam, Konstantin Ananyev, Thomas Monjalon
  Cc: Conor Walsh, nd

> From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> Sent: Sunday, 17 July 2022 05.10
> 
> <snip>
> 
> > Subject: RE: [PATCH v1 1/2] eal: add lcore busyness telemetry
> >
> > > From: Anatoly Burakov [mailto:anatoly.burakov@intel.com]
> > > Sent: Friday, 15 July 2022 15.13
> > >
> > > Currently, there is no way to measure lcore busyness in a passive
> way,
> > > without any modifications to the application. This patch adds a new
> > > EAL API that will be able to passively track core busyness.
> > >
> > > The busyness is calculated by relying on the fact that most DPDK
> API's
> > > will poll for packets.
> >
> > This is an "alternative fact"! Only run-to-completion applications
> polls for RX.
> > Pipelined applications do not poll for packets in every pipeline
> stage.
> I guess you meant, poll for packets from NIC. They still need to
> receive packets from queues. We could do a similar thing for rte_ring
> APIs.

But it would mix apples, pears and bananas.

Let's say you have a pipeline with three ingress preprocessing threads, two advanced packet processing threads in the next pipeline stage and one egress thread as the third pipeline stage.

Now, the metrics reflects busyness for six threads, but three of them are apples, two of them are pears, and one is bananas.

I just realized another example, where this patch might give misleading results on a run-to-completion application:

One thread handles a specific type of packets received on an Ethdev ingress queue set up by the rte_flow APIs, and another thread handles ingress packets from another Ethdev ingress queue. E.g. the first queue may contain packets for well known flows, where packets can be processed quickly, and the other queue for other packets requiring more scrutiny. Both threads are run-to-completion and handle Ethdev ingress packets.

*So: Only applications where the threads perform the exact same task can use this patch.*

Also, rings may be used for other purposes than queueing packets between pipeline stages. E.g. our application uses rings for fast bulk allocation and freeing of other resources.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v1 1/2] eal: add lcore busyness telemetry
  2022-07-17  9:56     ` Morten Brørup
@ 2022-07-18  9:43       ` Burakov, Anatoly
  2022-07-18 10:59         ` Morten Brørup
  2022-07-18 15:46         ` Stephen Hemminger
  0 siblings, 2 replies; 87+ messages in thread
From: Burakov, Anatoly @ 2022-07-18  9:43 UTC (permalink / raw)
  To: Morten Brørup, Honnappa Nagarahalli, dev, Bruce Richardson,
	Nicolas Chautru, Fan Zhang, Ashish Gupta, Akhil Goyal,
	David Hunt, Chengwen Feng, Kevin Laatz, Ray Kinsella, thomas,
	Ferruh Yigit, Andrew Rybchenko, jerinj, Sachin Saxena,
	hemant.agrawal, Ori Kam, Konstantin Ananyev
  Cc: Conor Walsh, nd

On 17-Jul-22 10:56 AM, Morten Brørup wrote:
>> From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
>> Sent: Sunday, 17 July 2022 05.10
>>
>> <snip>
>>
>>> Subject: RE: [PATCH v1 1/2] eal: add lcore busyness telemetry
>>>
>>>> From: Anatoly Burakov [mailto:anatoly.burakov@intel.com]
>>>> Sent: Friday, 15 July 2022 15.13
>>>>
>>>> Currently, there is no way to measure lcore busyness in a passive
>> way,
>>>> without any modifications to the application. This patch adds a new
>>>> EAL API that will be able to passively track core busyness.
>>>>
>>>> The busyness is calculated by relying on the fact that most DPDK
>> API's
>>>> will poll for packets.
>>>
>>> This is an "alternative fact"! Only run-to-completion applications
>> polls for RX.
>>> Pipelined applications do not poll for packets in every pipeline
>> stage.
>> I guess you meant, poll for packets from NIC. They still need to
>> receive packets from queues. We could do a similar thing for rte_ring
>> APIs.

Ring API is already instrumented to report telemetry in the same way, so 
any rte_ring-based pipeline will be able to track it. Obviously, 
non-DPDK API's will have to be instrumented too, we really can't do 
anything about that from inside DPDK.

> 
> But it would mix apples, pears and bananas.
> 
> Let's say you have a pipeline with three ingress preprocessing threads, two advanced packet processing threads in the next pipeline stage and one egress thread as the third pipeline stage.
> 
> Now, the metrics reflects busyness for six threads, but three of them are apples, two of them are pears, and one is bananas.
> 
> I just realized another example, where this patch might give misleading results on a run-to-completion application:
> 
> One thread handles a specific type of packets received on an Ethdev ingress queue set up by the rte_flow APIs, and another thread handles ingress packets from another Ethdev ingress queue. E.g. the first queue may contain packets for well known flows, where packets can be processed quickly, and the other queue for other packets requiring more scrutiny. Both threads are run-to-completion and handle Ethdev ingress packets.
> 
> *So: Only applications where the threads perform the exact same task can use this patch.*

I do not see how that follows. I think you're falling for a "it's not 
100% useful, therefore it's 0% useful" fallacy here. Some use cases 
would obviously make telemetry more informative than others, that's 
true, however I do not see how it's a mandatory requirement for lcore 
busyness to report the same thing. We can document the limitations and 
assumptions made, can we not?

It is true that this patchset is mostly written from the standpoint of a 
run-to-completion application, but can we improve it? What would be your 
suggestions to make it better suit use cases you are familiar with?

> 
> Also, rings may be used for other purposes than queueing packets between pipeline stages. E.g. our application uses rings for fast bulk allocation and freeing of other resources.
> 

Well, this is the tradeoff for simplicity. Of course we could add all 
sorts of stuff like dynamic enable/disable of this and that and the 
other... but the end goal was something easy and automatic and that 
doesn't require any work to implement, not something that suits 100% of 
the cases 100% of the time. Having such flexibility as you described 
comes at a cost that this patch was not meant to pay!

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 87+ messages in thread

* RE: [PATCH v1 1/2] eal: add lcore busyness telemetry
  2022-07-18  9:43       ` Burakov, Anatoly
@ 2022-07-18 10:59         ` Morten Brørup
  2022-07-19 12:20           ` Thomas Monjalon
  2022-07-18 15:46         ` Stephen Hemminger
  1 sibling, 1 reply; 87+ messages in thread
From: Morten Brørup @ 2022-07-18 10:59 UTC (permalink / raw)
  To: Burakov, Anatoly, Honnappa Nagarahalli, dev, Bruce Richardson,
	Nicolas Chautru, Fan Zhang, Ashish Gupta, Akhil Goyal,
	David Hunt, Chengwen Feng, Kevin Laatz, Ray Kinsella, thomas,
	Ferruh Yigit, Andrew Rybchenko, jerinj, Sachin Saxena,
	hemant.agrawal, Ori Kam, Konstantin Ananyev
  Cc: Conor Walsh, nd

> From: Burakov, Anatoly [mailto:anatoly.burakov@intel.com]
> Sent: Monday, 18 July 2022 11.44
> 
> On 17-Jul-22 10:56 AM, Morten Brørup wrote:
> >> From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> >> Sent: Sunday, 17 July 2022 05.10
> >>
> >> <snip>
> >>
> >>> Subject: RE: [PATCH v1 1/2] eal: add lcore busyness telemetry
> >>>
> >>>> From: Anatoly Burakov [mailto:anatoly.burakov@intel.com]
> >>>> Sent: Friday, 15 July 2022 15.13
> >>>>
> >>>> Currently, there is no way to measure lcore busyness in a passive
> >> way,
> >>>> without any modifications to the application. This patch adds a
> new
> >>>> EAL API that will be able to passively track core busyness.
> >>>>
> >>>> The busyness is calculated by relying on the fact that most DPDK
> >> API's
> >>>> will poll for packets.
> >>>
> >>> This is an "alternative fact"! Only run-to-completion applications
> >> polls for RX.
> >>> Pipelined applications do not poll for packets in every pipeline
> >> stage.
> >> I guess you meant, poll for packets from NIC. They still need to
> >> receive packets from queues. We could do a similar thing for
> rte_ring
> >> APIs.
> 
> Ring API is already instrumented to report telemetry in the same way,
> so
> any rte_ring-based pipeline will be able to track it. Obviously,
> non-DPDK API's will have to be instrumented too, we really can't do
> anything about that from inside DPDK.
> 
> >
> > But it would mix apples, pears and bananas.
> >
> > Let's say you have a pipeline with three ingress preprocessing
> threads, two advanced packet processing threads in the next pipeline
> stage and one egress thread as the third pipeline stage.
> >
> > Now, the metrics reflects busyness for six threads, but three of them
> are apples, two of them are pears, and one is bananas.
> >
> > I just realized another example, where this patch might give
> misleading results on a run-to-completion application:
> >
> > One thread handles a specific type of packets received on an Ethdev
> ingress queue set up by the rte_flow APIs, and another thread handles
> ingress packets from another Ethdev ingress queue. E.g. the first queue
> may contain packets for well known flows, where packets can be
> processed quickly, and the other queue for other packets requiring more
> scrutiny. Both threads are run-to-completion and handle Ethdev ingress
> packets.
> >
> > *So: Only applications where the threads perform the exact same task
> can use this patch.*
> 
> I do not see how that follows. I think you're falling for a "it's not
> 100% useful, therefore it's 0% useful" fallacy here. Some use cases
> would obviously make telemetry more informative than others, that's
> true, however I do not see how it's a mandatory requirement for lcore
> busyness to report the same thing. We can document the limitations and
> assumptions made, can we not?

I did use strong wording in my email to get my message through. However, I do consider the scope "applications where the threads perform the exact same task" more than 0 % of all deployed applications, and thus the patch is more than 0 % useful. But I certainly don't consider the scope for this patch 100 %  of all deployed applications, and perhaps not even 80 %.

I didn't reject the patch or oppose to it, but requested it to be updated, so the names reflect the information provided by it. I strongly oppose to using "CPU Busyness" as the telemetry name for something that only reflects ingress activity, and is zero for a thread that only performs egress or other non-ingress tasks. That would be strongly misleading.

If you by "document the limitations and assumptions" also mean rename telemetry names and variables/functions in the patch to reflect what it actually does, then yes, documenting the limitations and assumptions suffices. However, adding a notice in some documentation that "CPU Business" telemetry only is correct/relevant for specific applications doesn't suffice.

> 
> It is true that this patchset is mostly written from the standpoint of
> a
> run-to-completion application, but can we improve it? What would be
> your
> suggestions to make it better suit use cases you are familiar with?

Our application uses our own run-time profiler library to measure time spent in the application's various threads and pipeline stages. And the application needs to call the profiler library functions to feed it the information it needs. We still haven't found a good way to transform the profiler data to a generic summary CPU Utilization percentage, which should reflect how much of the system's CPU capacity is being used (preferably on a linear scale). (Our profiler library is designed specifically for our own purposes, and would require a complete rewrite to meet even basic DPDK library standards, so I won't even try to contribute it.)

I don't think it is possible to measure and report detailed CPU Busyness without involving the application. Only the application has knowledge about what the individual lcores are doing. Even for my example above (with two run-to-completion threads serving rte_flow configured Ethdev ingress queues), this patch would not provide information about which of the two types of traffic is causing the higher busyness. The telemetry might expose which specific thread is busy, but it doesn't tell which of the two tasks is being performed by that thread, and thus which kind of traffic is causing the busyness.

> 
> >
> > Also, rings may be used for other purposes than queueing packets
> between pipeline stages. E.g. our application uses rings for fast bulk
> allocation and freeing of other resources.
> >
> 
> Well, this is the tradeoff for simplicity. Of course we could add all
> sorts of stuff like dynamic enable/disable of this and that and the
> other... but the end goal was something easy and automatic and that
> doesn't require any work to implement, not something that suits 100% of
> the cases 100% of the time. Having such flexibility as you described
> comes at a cost that this patch was not meant to pay!

I do see the benefit of adding instrumentation like this to the DPDK libraries, so information becomes available at zero application development effort. The alternative would be a profiler/busyness library requiring application modifications.

I only request that:
1. The patch clearly reflects what is does, and
2. The instrumentation can be omitted at build time, so it has zero performance impact on applications where it is useless.

> 
> --
> Thanks,
> Anatoly

PS: The busyness counters in the DPDK Service Cores library are also being updated [1].

[1] http://inbox.dpdk.org/dev/20220711131825.3373195-2-harry.van.haaren@intel.com/T/#u

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v1 1/2] eal: add lcore busyness telemetry
  2022-07-18  9:43       ` Burakov, Anatoly
  2022-07-18 10:59         ` Morten Brørup
@ 2022-07-18 15:46         ` Stephen Hemminger
  1 sibling, 0 replies; 87+ messages in thread
From: Stephen Hemminger @ 2022-07-18 15:46 UTC (permalink / raw)
  To: Burakov, Anatoly
  Cc: Morten Brørup, Honnappa Nagarahalli, dev, Bruce Richardson,
	Nicolas Chautru, Fan Zhang, Ashish Gupta, Akhil Goyal,
	David Hunt, Chengwen Feng, Kevin Laatz, Ray Kinsella, thomas,
	Ferruh Yigit, Andrew Rybchenko, jerinj, Sachin Saxena,
	hemant.agrawal, Ori Kam, Konstantin Ananyev, Conor Walsh, nd

On Mon, 18 Jul 2022 10:43:52 +0100
"Burakov, Anatoly" <anatoly.burakov@intel.com> wrote:

> >>> This is an "alternative fact"! Only run-to-completion applications  
> >> polls for RX.  
> >>> Pipelined applications do not poll for packets in every pipeline  
> >> stage.
> >> I guess you meant, poll for packets from NIC. They still need to
> >> receive packets from queues. We could do a similar thing for rte_ring
> >> APIs.  
> 
> Ring API is already instrumented to report telemetry in the same way, so 
> any rte_ring-based pipeline will be able to track it. Obviously, 
> non-DPDK API's will have to be instrumented too, we really can't do 
> anything about that from inside DPDK.

The eventdev API is used to build pipeline based app's and it supports
telemetry.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v1 1/2] eal: add lcore busyness telemetry
  2022-07-18 10:59         ` Morten Brørup
@ 2022-07-19 12:20           ` Thomas Monjalon
  0 siblings, 0 replies; 87+ messages in thread
From: Thomas Monjalon @ 2022-07-19 12:20 UTC (permalink / raw)
  To: Burakov, Anatoly, Morten Brørup
  Cc: Honnappa Nagarahalli, dev, Bruce Richardson, Nicolas Chautru,
	Fan Zhang, Ashish Gupta, Akhil Goyal, David Hunt, Chengwen Feng,
	Kevin Laatz, Ray Kinsella, Ferruh Yigit, Andrew Rybchenko,
	jerinj, Sachin Saxena, hemant.agrawal, Ori Kam,
	Konstantin Ananyev, Conor Walsh, nd

18/07/2022 12:59, Morten Brørup:
> I only request that:
> 1. The patch clearly reflects what is does, and

+1

> 2. The instrumentation can be omitted at build time, so it has zero performance impact on applications where it is useless.

+1

(+2 then)




^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v2 0/3] Add lcore poll busyness telemetry
  2022-07-15 13:12 [PATCH v1 1/2] eal: add lcore busyness telemetry Anatoly Burakov
                   ` (3 preceding siblings ...)
  2022-07-15 22:13 ` Morten Brørup
@ 2022-08-24 16:24 ` Kevin Laatz
  2022-08-24 16:24   ` [PATCH v2 1/3] eal: add " Kevin Laatz
                     ` (3 more replies)
  2022-08-25 15:28 ` [PATCH v3 " Kevin Laatz
                   ` (4 subsequent siblings)
  9 siblings, 4 replies; 87+ messages in thread
From: Kevin Laatz @ 2022-08-24 16:24 UTC (permalink / raw)
  To: dev; +Cc: anatoly.burakov, Kevin Laatz

Currently, there is no way to measure lcore busyness in a passive way,
without any modifications to the application. This patchset adds a new EAL
API that will be able to passively track core busyness. As part of the set,
new telemetry endpoints are added to read the generate metrics.

Anatoly Burakov (2):
  eal: add lcore poll busyness telemetry
  eal: add cpuset lcore telemetry entries

Kevin Laatz (1):
  doc: add howto guide for lcore poll busyness

 config/meson.build                          |   1 +
 config/rte_config.h                         |   1 +
 doc/guides/howto/lcore_busyness.rst         |  79 +++++
 lib/bbdev/rte_bbdev.h                       |  17 +-
 lib/compressdev/rte_compressdev.c           |   2 +
 lib/cryptodev/rte_cryptodev.h               |   2 +
 lib/distributor/rte_distributor.c           |  21 +-
 lib/distributor/rte_distributor_single.c    |  14 +-
 lib/dmadev/rte_dmadev.h                     |  15 +-
 lib/eal/common/eal_common_lcore_telemetry.c | 340 ++++++++++++++++++++
 lib/eal/common/meson.build                  |   1 +
 lib/eal/include/rte_lcore.h                 |  80 +++++
 lib/eal/meson.build                         |   3 +
 lib/eal/version.map                         |   7 +
 lib/ethdev/rte_ethdev.h                     |   2 +
 lib/eventdev/rte_eventdev.h                 |  10 +-
 lib/rawdev/rte_rawdev.c                     |   5 +-
 lib/regexdev/rte_regexdev.h                 |   5 +-
 lib/ring/rte_ring_elem_pvt.h                |   1 +
 meson_options.txt                           |   2 +
 20 files changed, 584 insertions(+), 24 deletions(-)
 create mode 100644 doc/guides/howto/lcore_busyness.rst
 create mode 100644 lib/eal/common/eal_common_lcore_telemetry.c

-- 
2.31.1


^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v2 1/3] eal: add lcore poll busyness telemetry
  2022-08-24 16:24 ` [PATCH v2 0/3] Add lcore poll " Kevin Laatz
@ 2022-08-24 16:24   ` Kevin Laatz
  2022-08-24 16:24   ` [PATCH v2 2/3] eal: add cpuset lcore telemetry entries Kevin Laatz
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 87+ messages in thread
From: Kevin Laatz @ 2022-08-24 16:24 UTC (permalink / raw)
  To: dev, Bruce Richardson, Nicolas Chautru, Fan Zhang, Ashish Gupta,
	Akhil Goyal, David Hunt, Chengwen Feng, Kevin Laatz,
	Ray Kinsella, Thomas Monjalon, Ferruh Yigit, Andrew Rybchenko,
	Jerin Jacob, Sachin Saxena, Hemant Agrawal, Ori Kam,
	Honnappa Nagarahalli, Konstantin Ananyev
  Cc: anatoly.burakov, Conor Walsh

From: Anatoly Burakov <anatoly.burakov@intel.com>

Currently, there is no way to measure lcore busyness in a passive way,
without any modifications to the application. This patch adds a new EAL
API that will be able to passively track core busyness.

The busyness is calculated by relying on the fact that most DPDK API's
will poll for packets. Empty polls can be counted as "idle", while
non-empty polls can be counted as busy. To measure lcore busyness, we
simply call the telemetry timestamping function with the number of polls
a particular code section has processed, and count the number of cycles
we've spent processing empty bursts. The more empty bursts we encounter,
the less cycles we spend in "busy" state, and the less core busyness
will be reported.

In order for all of the above to work without modifications to the
application, the library code needs to be instrumented with calls to
the lcore telemetry busyness timestamping function. The following parts
of DPDK are instrumented with lcore telemetry calls:

- All major driver API's:
  - ethdev
  - cryptodev
  - compressdev
  - regexdev
  - bbdev
  - rawdev
  - eventdev
  - dmadev
- Some additional libraries:
  - ring
  - distributor

To avoid performance impact from having lcore telemetry support, a
global variable is exported by EAL, and a call to timestamping function
is wrapped into a macro, so that whenever telemetry is disabled, it only
takes one additional branch and no function calls are performed. It is
also possible to disable it at compile time by commenting out
RTE_LCORE_BUSYNESS from build config.

This patch also adds a telemetry endpoint to report lcore busyness, as
well as telemetry endpoints to enable/disable lcore telemetry.

Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
Signed-off-by: Conor Walsh <conor.walsh@intel.com>
Signed-off-by: David Hunt <david.hunt@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>

---
v2:
  * Use rte_get_tsc_hz() to adjust the telemetry period
  * Rename to reflect polling busyness vs general busyness
  * Fix segfault when calling telemetry timestamp from an unregistered
    non-EAL thread.
  * Minor cleanup
---
 config/meson.build                          |   1 +
 config/rte_config.h                         |   1 +
 lib/bbdev/rte_bbdev.h                       |  17 +-
 lib/compressdev/rte_compressdev.c           |   2 +
 lib/cryptodev/rte_cryptodev.h               |   2 +
 lib/distributor/rte_distributor.c           |  21 +-
 lib/distributor/rte_distributor_single.c    |  14 +-
 lib/dmadev/rte_dmadev.h                     |  15 +-
 lib/eal/common/eal_common_lcore_telemetry.c | 293 ++++++++++++++++++++
 lib/eal/common/meson.build                  |   1 +
 lib/eal/include/rte_lcore.h                 |  80 ++++++
 lib/eal/meson.build                         |   3 +
 lib/eal/version.map                         |   7 +
 lib/ethdev/rte_ethdev.h                     |   2 +
 lib/eventdev/rte_eventdev.h                 |  10 +-
 lib/rawdev/rte_rawdev.c                     |   5 +-
 lib/regexdev/rte_regexdev.h                 |   5 +-
 lib/ring/rte_ring_elem_pvt.h                |   1 +
 meson_options.txt                           |   2 +
 19 files changed, 458 insertions(+), 24 deletions(-)
 create mode 100644 lib/eal/common/eal_common_lcore_telemetry.c

diff --git a/config/meson.build b/config/meson.build
index 7f7b6c92fd..d5954a059c 100644
--- a/config/meson.build
+++ b/config/meson.build
@@ -297,6 +297,7 @@ endforeach
 dpdk_conf.set('RTE_MAX_ETHPORTS', get_option('max_ethports'))
 dpdk_conf.set('RTE_LIBEAL_USE_HPET', get_option('use_hpet'))
 dpdk_conf.set('RTE_ENABLE_TRACE_FP', get_option('enable_trace_fp'))
+dpdk_conf.set('RTE_LCORE_POLL_BUSYNESS', get_option('enable_lcore_poll_busyness'))
 # values which have defaults which may be overridden
 dpdk_conf.set('RTE_MAX_VFIO_GROUPS', 64)
 dpdk_conf.set('RTE_DRIVER_MEMPOOL_BUCKET_SIZE_KB', 64)
diff --git a/config/rte_config.h b/config/rte_config.h
index 46549cb062..498702c9c7 100644
--- a/config/rte_config.h
+++ b/config/rte_config.h
@@ -39,6 +39,7 @@
 #define RTE_LOG_DP_LEVEL RTE_LOG_INFO
 #define RTE_BACKTRACE 1
 #define RTE_MAX_VFIO_CONTAINERS 64
+#define RTE_LCORE_POLL_BUSYNESS_PERIOD_MS 2
 
 /* bsd module defines */
 #define RTE_CONTIGMEM_MAX_NUM_BUFS 64
diff --git a/lib/bbdev/rte_bbdev.h b/lib/bbdev/rte_bbdev.h
index b88c88167e..d6ed176cce 100644
--- a/lib/bbdev/rte_bbdev.h
+++ b/lib/bbdev/rte_bbdev.h
@@ -28,6 +28,7 @@ extern "C" {
 #include <stdbool.h>
 
 #include <rte_cpuflags.h>
+#include <rte_lcore.h>
 
 #include "rte_bbdev_op.h"
 
@@ -599,7 +600,9 @@ rte_bbdev_dequeue_enc_ops(uint16_t dev_id, uint16_t queue_id,
 {
 	struct rte_bbdev *dev = &rte_bbdev_devices[dev_id];
 	struct rte_bbdev_queue_data *q_data = &dev->data->queues[queue_id];
-	return dev->dequeue_enc_ops(q_data, ops, num_ops);
+	const uint16_t nb_ops = dev->dequeue_enc_ops(q_data, ops, num_ops);
+	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_ops);
+	return nb_ops;
 }
 
 /**
@@ -631,7 +634,9 @@ rte_bbdev_dequeue_dec_ops(uint16_t dev_id, uint16_t queue_id,
 {
 	struct rte_bbdev *dev = &rte_bbdev_devices[dev_id];
 	struct rte_bbdev_queue_data *q_data = &dev->data->queues[queue_id];
-	return dev->dequeue_dec_ops(q_data, ops, num_ops);
+	const uint16_t nb_ops = dev->dequeue_dec_ops(q_data, ops, num_ops);
+	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_ops);
+	return nb_ops;
 }
 
 
@@ -662,7 +667,9 @@ rte_bbdev_dequeue_ldpc_enc_ops(uint16_t dev_id, uint16_t queue_id,
 {
 	struct rte_bbdev *dev = &rte_bbdev_devices[dev_id];
 	struct rte_bbdev_queue_data *q_data = &dev->data->queues[queue_id];
-	return dev->dequeue_ldpc_enc_ops(q_data, ops, num_ops);
+	const uint16_t nb_ops = dev->dequeue_ldpc_enc_ops(q_data, ops, num_ops);
+	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_ops);
+	return nb_ops;
 }
 
 /**
@@ -692,7 +699,9 @@ rte_bbdev_dequeue_ldpc_dec_ops(uint16_t dev_id, uint16_t queue_id,
 {
 	struct rte_bbdev *dev = &rte_bbdev_devices[dev_id];
 	struct rte_bbdev_queue_data *q_data = &dev->data->queues[queue_id];
-	return dev->dequeue_ldpc_dec_ops(q_data, ops, num_ops);
+	const uint16_t nb_ops = dev->dequeue_ldpc_dec_ops(q_data, ops, num_ops);
+	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_ops);
+	return nb_ops;
 }
 
 /** Definitions of device event types */
diff --git a/lib/compressdev/rte_compressdev.c b/lib/compressdev/rte_compressdev.c
index 22c438f2dd..912cee9a16 100644
--- a/lib/compressdev/rte_compressdev.c
+++ b/lib/compressdev/rte_compressdev.c
@@ -580,6 +580,8 @@ rte_compressdev_dequeue_burst(uint8_t dev_id, uint16_t qp_id,
 	nb_ops = (*dev->dequeue_burst)
 			(dev->data->queue_pairs[qp_id], ops, nb_ops);
 
+	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_ops);
+
 	return nb_ops;
 }
 
diff --git a/lib/cryptodev/rte_cryptodev.h b/lib/cryptodev/rte_cryptodev.h
index 56f459c6a0..072874020d 100644
--- a/lib/cryptodev/rte_cryptodev.h
+++ b/lib/cryptodev/rte_cryptodev.h
@@ -1915,6 +1915,8 @@ rte_cryptodev_dequeue_burst(uint8_t dev_id, uint16_t qp_id,
 		rte_rcu_qsbr_thread_offline(list->qsbr, 0);
 	}
 #endif
+
+	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_ops);
 	return nb_ops;
 }
 
diff --git a/lib/distributor/rte_distributor.c b/lib/distributor/rte_distributor.c
index 3035b7a999..35b0d8d36b 100644
--- a/lib/distributor/rte_distributor.c
+++ b/lib/distributor/rte_distributor.c
@@ -56,6 +56,8 @@ rte_distributor_request_pkt(struct rte_distributor *d,
 
 		while (rte_rdtsc() < t)
 			rte_pause();
+		/* this was an empty poll */
+		RTE_LCORE_TELEMETRY_TIMESTAMP(0);
 	}
 
 	/*
@@ -134,24 +136,29 @@ rte_distributor_get_pkt(struct rte_distributor *d,
 
 	if (unlikely(d->alg_type == RTE_DIST_ALG_SINGLE)) {
 		if (return_count <= 1) {
+			uint16_t cnt;
 			pkts[0] = rte_distributor_get_pkt_single(d->d_single,
-				worker_id, return_count ? oldpkt[0] : NULL);
-			return (pkts[0]) ? 1 : 0;
-		} else
-			return -EINVAL;
+								 worker_id,
+								 return_count ? oldpkt[0] : NULL);
+			cnt = (pkts[0] != NULL) ? 1 : 0;
+			RTE_LCORE_TELEMETRY_TIMESTAMP(cnt);
+			return cnt;
+		}
+		return -EINVAL;
 	}
 
 	rte_distributor_request_pkt(d, worker_id, oldpkt, return_count);
 
-	count = rte_distributor_poll_pkt(d, worker_id, pkts);
-	while (count == -1) {
+	while ((count = rte_distributor_poll_pkt(d, worker_id, pkts)) == -1) {
 		uint64_t t = rte_rdtsc() + 100;
 
 		while (rte_rdtsc() < t)
 			rte_pause();
 
-		count = rte_distributor_poll_pkt(d, worker_id, pkts);
+		/* this was an empty poll */
+		RTE_LCORE_TELEMETRY_TIMESTAMP(0);
 	}
+	RTE_LCORE_TELEMETRY_TIMESTAMP(count);
 	return count;
 }
 
diff --git a/lib/distributor/rte_distributor_single.c b/lib/distributor/rte_distributor_single.c
index 2c77ac454a..dc58791bf4 100644
--- a/lib/distributor/rte_distributor_single.c
+++ b/lib/distributor/rte_distributor_single.c
@@ -31,8 +31,13 @@ rte_distributor_request_pkt_single(struct rte_distributor_single *d,
 	union rte_distributor_buffer_single *buf = &d->bufs[worker_id];
 	int64_t req = (((int64_t)(uintptr_t)oldpkt) << RTE_DISTRIB_FLAG_BITS)
 			| RTE_DISTRIB_GET_BUF;
-	RTE_WAIT_UNTIL_MASKED(&buf->bufptr64, RTE_DISTRIB_FLAGS_MASK,
-		==, 0, __ATOMIC_RELAXED);
+
+	while (!(__atomic_load_n(&buf->bufptr64, __ATOMIC_RELAXED)
+			& RTE_DISTRIB_FLAGS_MASK) == 0) {
+		rte_pause();
+		/* this was an empty poll */
+		RTE_LCORE_TELEMETRY_TIMESTAMP(0);
+	}
 
 	/* Sync with distributor on GET_BUF flag. */
 	__atomic_store_n(&(buf->bufptr64), req, __ATOMIC_RELEASE);
@@ -59,8 +64,11 @@ rte_distributor_get_pkt_single(struct rte_distributor_single *d,
 {
 	struct rte_mbuf *ret;
 	rte_distributor_request_pkt_single(d, worker_id, oldpkt);
-	while ((ret = rte_distributor_poll_pkt_single(d, worker_id)) == NULL)
+	while ((ret = rte_distributor_poll_pkt_single(d, worker_id)) == NULL) {
 		rte_pause();
+		/* this was an empty poll */
+		RTE_LCORE_TELEMETRY_TIMESTAMP(0);
+	}
 	return ret;
 }
 
diff --git a/lib/dmadev/rte_dmadev.h b/lib/dmadev/rte_dmadev.h
index e7f992b734..98176a6a7a 100644
--- a/lib/dmadev/rte_dmadev.h
+++ b/lib/dmadev/rte_dmadev.h
@@ -149,6 +149,7 @@
 #include <rte_bitops.h>
 #include <rte_common.h>
 #include <rte_compat.h>
+#include <rte_lcore.h>
 
 #ifdef __cplusplus
 extern "C" {
@@ -1027,7 +1028,7 @@ rte_dma_completed(int16_t dev_id, uint16_t vchan, const uint16_t nb_cpls,
 		  uint16_t *last_idx, bool *has_error)
 {
 	struct rte_dma_fp_object *obj = &rte_dma_fp_objs[dev_id];
-	uint16_t idx;
+	uint16_t idx, nb_ops;
 	bool err;
 
 #ifdef RTE_DMADEV_DEBUG
@@ -1050,8 +1051,10 @@ rte_dma_completed(int16_t dev_id, uint16_t vchan, const uint16_t nb_cpls,
 		has_error = &err;
 
 	*has_error = false;
-	return (*obj->completed)(obj->dev_private, vchan, nb_cpls, last_idx,
-				 has_error);
+	nb_ops = (*obj->completed)(obj->dev_private, vchan, nb_cpls, last_idx,
+				   has_error);
+	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_ops);
+	return nb_ops;
 }
 
 /**
@@ -1090,7 +1093,7 @@ rte_dma_completed_status(int16_t dev_id, uint16_t vchan,
 			 enum rte_dma_status_code *status)
 {
 	struct rte_dma_fp_object *obj = &rte_dma_fp_objs[dev_id];
-	uint16_t idx;
+	uint16_t idx, nb_ops;
 
 #ifdef RTE_DMADEV_DEBUG
 	if (!rte_dma_is_valid(dev_id) || nb_cpls == 0 || status == NULL)
@@ -1101,8 +1104,10 @@ rte_dma_completed_status(int16_t dev_id, uint16_t vchan,
 	if (last_idx == NULL)
 		last_idx = &idx;
 
-	return (*obj->completed_status)(obj->dev_private, vchan, nb_cpls,
+	nb_ops = (*obj->completed_status)(obj->dev_private, vchan, nb_cpls,
 					last_idx, status);
+	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_ops);
+	return nb_ops;
 }
 
 /**
diff --git a/lib/eal/common/eal_common_lcore_telemetry.c b/lib/eal/common/eal_common_lcore_telemetry.c
new file mode 100644
index 0000000000..e6b58788c8
--- /dev/null
+++ b/lib/eal/common/eal_common_lcore_telemetry.c
@@ -0,0 +1,293 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2014 Intel Corporation
+ */
+
+#include <unistd.h>
+#include <limits.h>
+#include <string.h>
+
+#include <rte_common.h>
+#include <rte_cycles.h>
+#include <rte_errno.h>
+#include <rte_lcore.h>
+
+#ifdef RTE_LCORE_POLL_BUSYNESS
+#include <rte_telemetry.h>
+#endif
+
+int __rte_lcore_telemetry_enabled;
+
+#ifdef RTE_LCORE_POLL_BUSYNESS
+
+struct lcore_telemetry {
+	int busyness;
+	/**< Calculated busyness (gets set/returned by the API) */
+	int raw_busyness;
+	/**< Calculated busyness times 100. */
+	uint64_t interval_ts;
+	/**< when previous telemetry interval started */
+	uint64_t empty_cycles;
+	/**< empty cycle count since last interval */
+	uint64_t last_poll_ts;
+	/**< last poll timestamp */
+	bool last_empty;
+	/**< if last poll was empty */
+	unsigned int contig_poll_cnt;
+	/**< contiguous (always empty/non empty) poll counter */
+} __rte_cache_aligned;
+
+static struct lcore_telemetry *telemetry_data;
+
+#define LCORE_POLL_BUSYNESS_MAX 100
+#define LCORE_POLL_BUSYNESS_NOT_SET -1
+#define LCORE_POLL_BUSYNESS_MIN 0
+
+#define SMOOTH_COEFF 5
+#define STATE_CHANGE_OPT 32
+
+static void lcore_config_init(void)
+{
+	int lcore_id;
+
+	telemetry_data = calloc(RTE_MAX_LCORE, sizeof(telemetry_data[0]));
+	if (telemetry_data == NULL)
+		rte_panic("Could not init lcore telemetry data: Out of memory\n");
+
+	RTE_LCORE_FOREACH(lcore_id) {
+		struct lcore_telemetry *td = &telemetry_data[lcore_id];
+
+		td->interval_ts = 0;
+		td->last_poll_ts = 0;
+		td->empty_cycles = 0;
+		td->last_empty = true;
+		td->contig_poll_cnt = 0;
+		td->busyness = LCORE_POLL_BUSYNESS_NOT_SET;
+		td->raw_busyness = 0;
+	}
+}
+
+int rte_lcore_poll_busyness(unsigned int lcore_id)
+{
+	const uint64_t active_thresh = rte_get_tsc_hz() * RTE_LCORE_POLL_BUSYNESS_PERIOD_MS;
+	struct lcore_telemetry *tdata;
+
+	if (lcore_id >= RTE_MAX_LCORE)
+		return -EINVAL;
+	tdata = &telemetry_data[lcore_id];
+
+	/* if the lcore is not active */
+	if (tdata->interval_ts == 0)
+		return LCORE_POLL_BUSYNESS_NOT_SET;
+	/* if the core hasn't been active in a while */
+	else if ((rte_rdtsc() - tdata->interval_ts) > active_thresh)
+		return LCORE_POLL_BUSYNESS_NOT_SET;
+
+	/* this core is active, report its busyness */
+	return telemetry_data[lcore_id].busyness;
+}
+
+int rte_lcore_poll_busyness_enabled(void)
+{
+	return __rte_lcore_telemetry_enabled;
+}
+
+void rte_lcore_poll_busyness_enabled_set(int enable)
+{
+	__rte_lcore_telemetry_enabled = !!enable;
+
+	if (!enable)
+		lcore_config_init();
+}
+
+static inline int calc_raw_busyness(const struct lcore_telemetry *tdata,
+				    const uint64_t empty, const uint64_t total)
+{
+	/*
+	 * we don't want to use floating point math here, but we want for our
+	 * busyness to react smoothly to sudden changes, while still keeping the
+	 * accuracy and making sure that over time the average follows busyness
+	 * as measured just-in-time. therefore, we will calculate the average
+	 * busyness using integer math, but shift the decimal point two places
+	 * to the right, so that 100.0 becomes 10000. this allows us to report
+	 * integer values (0..100) while still allowing ourselves to follow the
+	 * just-in-time measurements when we calculate our averages.
+	 */
+	const int max_raw_idle = LCORE_POLL_BUSYNESS_MAX * 100;
+
+	/*
+	 * at upper end of the busyness scale, going up from 90->100 will take
+	 * longer than going from 10->20 because of the averaging. to address
+	 * this, we invert the scale when doing calculations: that is, we
+	 * effectively calculate average *idle* cycle percentage, not average
+	 * *busy* cycle percentage. this means that the scale is naturally
+	 * biased towards fast scaling up, and slow scaling down.
+	 */
+	const int prev_raw_idle = max_raw_idle - tdata->raw_busyness;
+
+	/* calculate rate of idle cycles, times 100 */
+	const int cur_raw_idle = (int)((empty * max_raw_idle) / total);
+
+	/* smoothen the idleness */
+	const int smoothened_idle =
+			(cur_raw_idle + prev_raw_idle * (SMOOTH_COEFF - 1)) / SMOOTH_COEFF;
+
+	/* convert idleness back to busyness */
+	return max_raw_idle - smoothened_idle;
+}
+
+void __rte_lcore_telemetry_timestamp(uint16_t nb_rx)
+{
+	const unsigned int lcore_id = rte_lcore_id();
+	uint64_t interval_ts, empty_cycles, cur_tsc, last_poll_ts;
+	struct lcore_telemetry *tdata;
+	const bool empty = nb_rx == 0;
+	uint64_t diff_int, diff_last;
+	bool last_empty;
+
+	/* This telemetry is not supported for unregistered non-EAL threads */
+	if (lcore_id >= RTE_MAX_LCORE)
+		return;
+
+	tdata = &telemetry_data[lcore_id];
+	last_empty = tdata->last_empty;
+
+	/* optimization: don't do anything if status hasn't changed */
+	if (last_empty == empty && tdata->contig_poll_cnt++ < STATE_CHANGE_OPT)
+		return;
+	/* status changed or we're waiting for too long, reset counter */
+	tdata->contig_poll_cnt = 0;
+
+	cur_tsc = rte_rdtsc();
+
+	interval_ts = tdata->interval_ts;
+	empty_cycles = tdata->empty_cycles;
+	last_poll_ts = tdata->last_poll_ts;
+
+	diff_int = cur_tsc - interval_ts;
+	diff_last = cur_tsc - last_poll_ts;
+
+	/* is this the first time we're here? */
+	if (interval_ts == 0) {
+		tdata->busyness = LCORE_POLL_BUSYNESS_MIN;
+		tdata->raw_busyness = 0;
+		tdata->interval_ts = cur_tsc;
+		tdata->empty_cycles = 0;
+		tdata->contig_poll_cnt = 0;
+		goto end;
+	}
+
+	/* update the empty counter if we got an empty poll earlier */
+	if (last_empty)
+		empty_cycles += diff_last;
+
+	/* have we passed the interval? */
+	uint64_t interval = ((rte_get_tsc_hz() / MS_PER_S) * RTE_LCORE_POLL_BUSYNESS_PERIOD_MS);
+	if (diff_int > interval) {
+		int raw_busyness;
+
+		/* get updated busyness value */
+		raw_busyness = calc_raw_busyness(tdata, empty_cycles, diff_int);
+
+		/* set a new interval, reset empty counter */
+		tdata->interval_ts = cur_tsc;
+		tdata->empty_cycles = 0;
+		tdata->raw_busyness = raw_busyness;
+		/* bring busyness back to 0..100 range, biased to round up */
+		tdata->busyness = (raw_busyness + 50) / 100;
+	} else
+		/* we may have updated empty counter */
+		tdata->empty_cycles = empty_cycles;
+
+end:
+	/* update status for next poll */
+	tdata->last_poll_ts = cur_tsc;
+	tdata->last_empty = empty;
+}
+
+static int
+lcore_poll_busyness_enable(const char *cmd __rte_unused,
+		      const char *params __rte_unused,
+		      struct rte_tel_data *d)
+{
+	rte_lcore_poll_busyness_enabled_set(1);
+
+	rte_tel_data_start_dict(d);
+
+	rte_tel_data_add_dict_int(d, "poll_busyness_enabled", 1);
+
+	return 0;
+}
+
+static int
+lcore_poll_busyness_disable(const char *cmd __rte_unused,
+		       const char *params __rte_unused,
+		       struct rte_tel_data *d)
+{
+	rte_lcore_poll_busyness_enabled_set(0);
+
+	rte_tel_data_start_dict(d);
+
+	rte_tel_data_add_dict_int(d, "poll_busyness_enabled", 0);
+
+	if (telemetry_data != NULL)
+		free(telemetry_data);
+
+	return 0;
+}
+
+static int
+lcore_handle_poll_busyness(const char *cmd __rte_unused,
+		      const char *params __rte_unused, struct rte_tel_data *d)
+{
+	char corenum[64];
+	int i;
+
+	rte_tel_data_start_dict(d);
+
+	RTE_LCORE_FOREACH(i) {
+		if (!rte_lcore_is_enabled(i))
+			continue;
+		snprintf(corenum, sizeof(corenum), "%d", i);
+		rte_tel_data_add_dict_int(d, corenum, rte_lcore_poll_busyness(i));
+	}
+
+	return 0;
+}
+
+RTE_INIT(lcore_init_telemetry)
+{
+	__rte_lcore_telemetry_enabled = true;
+
+	lcore_config_init();
+
+	rte_telemetry_register_cmd("/eal/lcore/busyness", lcore_handle_poll_busyness,
+				   "return percentage poll busyness of cores");
+
+	rte_telemetry_register_cmd("/eal/lcore/busyness_enable", lcore_poll_busyness_enable,
+				   "enable lcore poll busyness measurement");
+
+	rte_telemetry_register_cmd("/eal/lcore/busyness_disable", lcore_poll_busyness_disable,
+				   "disable lcore poll busyness measurement");
+}
+
+#else
+
+int rte_lcore_poll_busyness(unsigned int lcore_id __rte_unused)
+{
+	return -ENOTSUP;
+}
+
+int rte_lcore_poll_busyness_enabled(void)
+{
+	return -ENOTSUP;
+}
+
+void rte_lcore_poll_busyness_enabled_set(int enable __rte_unused)
+{
+}
+
+void __rte_lcore_telemetry_timestamp(uint16_t nb_rx __rte_unused)
+{
+}
+
+#endif
diff --git a/lib/eal/common/meson.build b/lib/eal/common/meson.build
index 917758cc65..a743e66a7d 100644
--- a/lib/eal/common/meson.build
+++ b/lib/eal/common/meson.build
@@ -17,6 +17,7 @@ sources += files(
         'eal_common_hexdump.c',
         'eal_common_interrupts.c',
         'eal_common_launch.c',
+        'eal_common_lcore_telemetry.c',
         'eal_common_lcore.c',
         'eal_common_log.c',
         'eal_common_mcfg.c',
diff --git a/lib/eal/include/rte_lcore.h b/lib/eal/include/rte_lcore.h
index b598e1b9ec..75c1f874cb 100644
--- a/lib/eal/include/rte_lcore.h
+++ b/lib/eal/include/rte_lcore.h
@@ -415,6 +415,86 @@ rte_ctrl_thread_create(pthread_t *thread, const char *name,
 		const pthread_attr_t *attr,
 		void *(*start_routine)(void *), void *arg);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Read poll busyness value corresponding to an lcore.
+ *
+ * @param lcore_id
+ *   Lcore to read poll busyness value for.
+ * @return
+ *   - value between 0 and 100 on success
+ *   - -1 if lcore is not active
+ *   - -EINVAL if lcore is invalid
+ *   - -ENOMEM if not enough memory available
+ *   - -ENOTSUP if not supported
+ */
+__rte_experimental
+int
+rte_lcore_poll_busyness(unsigned int lcore_id);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Check if lcore poll busyness telemetry is enabled.
+ *
+ * @return
+ *   - 1 if lcore telemetry is enabled
+ *   - 0 if lcore telemetry is disabled
+ *   - -ENOTSUP if not lcore telemetry supported
+ */
+__rte_experimental
+int
+rte_lcore_poll_busyness_enabled(void);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Enable or disable poll busyness telemetry.
+ *
+ * @param enable
+ *   1 to enable, 0 to disable
+ */
+__rte_experimental
+void
+rte_lcore_poll_busyness_enabled_set(int enable);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Lcore telemetry timestamping function.
+ *
+ * @param nb_rx
+ *   Number of buffers processed by lcore.
+ */
+__rte_experimental
+void
+__rte_lcore_telemetry_timestamp(uint16_t nb_rx);
+
+/** @internal lcore telemetry enabled status */
+extern int __rte_lcore_telemetry_enabled;
+
+/**
+ * Call lcore telemetry timestamp function.
+ *
+ * @param nb_rx
+ *   Number of buffers processed by lcore.
+ */
+#ifdef RTE_LCORE_POLL_BUSYNESS
+#define RTE_LCORE_TELEMETRY_TIMESTAMP(nb_rx)                    \
+	do {                                                    \
+		if (__rte_lcore_telemetry_enabled)              \
+			__rte_lcore_telemetry_timestamp(nb_rx); \
+	} while (0)
+#else
+#define RTE_LCORE_TELEMETRY_TIMESTAMP(nb_rx) \
+	while (0)
+#endif
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/eal/meson.build b/lib/eal/meson.build
index 056beb9461..7199aa03c2 100644
--- a/lib/eal/meson.build
+++ b/lib/eal/meson.build
@@ -25,6 +25,9 @@ subdir(arch_subdir)
 deps += ['kvargs']
 if not is_windows
     deps += ['telemetry']
+else
+	# core busyness telemetry depends on telemetry library
+	dpdk_conf.set('RTE_LCORE_BUSYNESS', false)
 endif
 if dpdk_conf.has('RTE_USE_LIBBSD')
     ext_deps += libbsd
diff --git a/lib/eal/version.map b/lib/eal/version.map
index c2a2cebf69..8a4af7e937 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -424,6 +424,13 @@ EXPERIMENTAL {
 	rte_thread_self;
 	rte_thread_set_affinity_by_id;
 	rte_thread_set_priority;
+
+	# added in 22.11
+	__rte_lcore_telemetry_timestamp;
+	__rte_lcore_telemetry_enabled;
+	rte_lcore_poll_busyness;
+	rte_lcore_poll_busyness_enabled;
+	rte_lcore_poll_busyness_enabled_set;
 };
 
 INTERNAL {
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index de9e970d4d..1caecd5a11 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -5675,6 +5675,8 @@ rte_eth_rx_burst(uint16_t port_id, uint16_t queue_id,
 #endif
 
 	rte_ethdev_trace_rx_burst(port_id, queue_id, (void **)rx_pkts, nb_rx);
+
+	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_rx);
 	return nb_rx;
 }
 
diff --git a/lib/eventdev/rte_eventdev.h b/lib/eventdev/rte_eventdev.h
index 6a6f6ea4c1..a1d42d9214 100644
--- a/lib/eventdev/rte_eventdev.h
+++ b/lib/eventdev/rte_eventdev.h
@@ -2153,6 +2153,7 @@ rte_event_dequeue_burst(uint8_t dev_id, uint8_t port_id, struct rte_event ev[],
 			uint16_t nb_events, uint64_t timeout_ticks)
 {
 	const struct rte_event_fp_ops *fp_ops;
+	uint16_t nb_evts;
 	void *port;
 
 	fp_ops = &rte_event_fp_ops[dev_id];
@@ -2175,10 +2176,13 @@ rte_event_dequeue_burst(uint8_t dev_id, uint8_t port_id, struct rte_event ev[],
 	 * requests nb_events as const one
 	 */
 	if (nb_events == 1)
-		return (fp_ops->dequeue)(port, ev, timeout_ticks);
+		nb_evts = (fp_ops->dequeue)(port, ev, timeout_ticks);
 	else
-		return (fp_ops->dequeue_burst)(port, ev, nb_events,
-					       timeout_ticks);
+		nb_evts = (fp_ops->dequeue_burst)(port, ev, nb_events,
+					timeout_ticks);
+
+	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_evts);
+	return nb_evts;
 }
 
 #define RTE_EVENT_DEV_MAINT_OP_FLUSH          (1 << 0)
diff --git a/lib/rawdev/rte_rawdev.c b/lib/rawdev/rte_rawdev.c
index 2f0a4f132e..27163e87cb 100644
--- a/lib/rawdev/rte_rawdev.c
+++ b/lib/rawdev/rte_rawdev.c
@@ -226,12 +226,15 @@ rte_rawdev_dequeue_buffers(uint16_t dev_id,
 			   rte_rawdev_obj_t context)
 {
 	struct rte_rawdev *dev;
+	int nb_ops;
 
 	RTE_RAWDEV_VALID_DEVID_OR_ERR_RET(dev_id, -EINVAL);
 	dev = &rte_rawdevs[dev_id];
 
 	RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->dequeue_bufs, -ENOTSUP);
-	return (*dev->dev_ops->dequeue_bufs)(dev, buffers, count, context);
+	nb_ops = (*dev->dev_ops->dequeue_bufs)(dev, buffers, count, context);
+	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_ops);
+	return nb_ops;
 }
 
 int
diff --git a/lib/regexdev/rte_regexdev.h b/lib/regexdev/rte_regexdev.h
index 3bce8090f6..781055b4eb 100644
--- a/lib/regexdev/rte_regexdev.h
+++ b/lib/regexdev/rte_regexdev.h
@@ -1530,6 +1530,7 @@ rte_regexdev_dequeue_burst(uint8_t dev_id, uint16_t qp_id,
 			   struct rte_regex_ops **ops, uint16_t nb_ops)
 {
 	struct rte_regexdev *dev = &rte_regex_devices[dev_id];
+	uint16_t deq_ops;
 #ifdef RTE_LIBRTE_REGEXDEV_DEBUG
 	RTE_REGEXDEV_VALID_DEV_ID_OR_ERR_RET(dev_id, -EINVAL);
 	RTE_FUNC_PTR_OR_ERR_RET(*dev->dequeue, -ENOTSUP);
@@ -1538,7 +1539,9 @@ rte_regexdev_dequeue_burst(uint8_t dev_id, uint16_t qp_id,
 		return -EINVAL;
 	}
 #endif
-	return (*dev->dequeue)(dev, qp_id, ops, nb_ops);
+	deq_ops = (*dev->dequeue)(dev, qp_id, ops, nb_ops);
+	RTE_LCORE_TELEMETRY_TIMESTAMP(deq_ops);
+	return deq_ops;
 }
 
 #ifdef __cplusplus
diff --git a/lib/ring/rte_ring_elem_pvt.h b/lib/ring/rte_ring_elem_pvt.h
index 83788c56e6..6db09d4291 100644
--- a/lib/ring/rte_ring_elem_pvt.h
+++ b/lib/ring/rte_ring_elem_pvt.h
@@ -379,6 +379,7 @@ __rte_ring_do_dequeue_elem(struct rte_ring *r, void *obj_table,
 end:
 	if (available != NULL)
 		*available = entries - n;
+	RTE_LCORE_TELEMETRY_TIMESTAMP(n);
 	return n;
 }
 
diff --git a/meson_options.txt b/meson_options.txt
index 7c220ad68d..725b851f69 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -20,6 +20,8 @@ option('enable_driver_sdk', type: 'boolean', value: false, description:
        'Install headers to build drivers.')
 option('enable_kmods', type: 'boolean', value: false, description:
        'build kernel modules')
+option('enable_lcore_poll_busyness', type: 'boolean', value: true, description:
+       'enable collection of lcore poll busyness telemetry')
 option('examples', type: 'string', value: '', description:
        'Comma-separated list of examples to build by default')
 option('flexran_sdk', type: 'string', value: '', description:
-- 
2.31.1


^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v2 2/3] eal: add cpuset lcore telemetry entries
  2022-08-24 16:24 ` [PATCH v2 0/3] Add lcore poll " Kevin Laatz
  2022-08-24 16:24   ` [PATCH v2 1/3] eal: add " Kevin Laatz
@ 2022-08-24 16:24   ` Kevin Laatz
  2022-08-24 16:24   ` [PATCH v2 3/3] doc: add howto guide for lcore poll busyness Kevin Laatz
  2022-08-25  7:47   ` [PATCH v2 0/3] Add lcore poll busyness telemetry Morten Brørup
  3 siblings, 0 replies; 87+ messages in thread
From: Kevin Laatz @ 2022-08-24 16:24 UTC (permalink / raw)
  To: dev; +Cc: anatoly.burakov

From: Anatoly Burakov <anatoly.burakov@intel.com>

Expose per-lcore cpuset information to telemetry.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 lib/eal/common/eal_common_lcore_telemetry.c | 47 +++++++++++++++++++++
 1 file changed, 47 insertions(+)

diff --git a/lib/eal/common/eal_common_lcore_telemetry.c b/lib/eal/common/eal_common_lcore_telemetry.c
index e6b58788c8..646b0fbb55 100644
--- a/lib/eal/common/eal_common_lcore_telemetry.c
+++ b/lib/eal/common/eal_common_lcore_telemetry.c
@@ -19,6 +19,8 @@ int __rte_lcore_telemetry_enabled;
 
 #ifdef RTE_LCORE_POLL_BUSYNESS
 
+#include "eal_private.h"
+
 struct lcore_telemetry {
 	int busyness;
 	/**< Calculated busyness (gets set/returned by the API) */
@@ -254,6 +256,48 @@ lcore_handle_poll_busyness(const char *cmd __rte_unused,
 	return 0;
 }
 
+static int
+lcore_handle_cpuset(const char *cmd __rte_unused,
+		    const char *params __rte_unused,
+		    struct rte_tel_data *d)
+{
+	char corenum[64];
+	int i;
+
+	rte_tel_data_start_dict(d);
+
+	RTE_LCORE_FOREACH(i) {
+		const struct lcore_config *cfg = &lcore_config[i];
+		const rte_cpuset_t *cpuset = &cfg->cpuset;
+		struct rte_tel_data *ld;
+		unsigned int cpu;
+
+		if (!rte_lcore_is_enabled(i))
+			continue;
+
+		/* create an array of integers */
+		ld = rte_tel_data_alloc();
+		if (ld == NULL)
+			return -ENOMEM;
+		rte_tel_data_start_array(ld, RTE_TEL_INT_VAL);
+
+		/* add cpu ID's from cpuset to the array */
+		for (cpu = 0; cpu < CPU_SETSIZE; cpu++) {
+			if (!CPU_ISSET(cpu, cpuset))
+				continue;
+			rte_tel_data_add_array_int(ld, cpu);
+		}
+
+		/* add array to the per-lcore container */
+		snprintf(corenum, sizeof(corenum), "%d", i);
+
+		/* tell telemetry library to free this array automatically */
+		rte_tel_data_add_dict_container(d, corenum, ld, 0);
+	}
+
+	return 0;
+}
+
 RTE_INIT(lcore_init_telemetry)
 {
 	__rte_lcore_telemetry_enabled = true;
@@ -268,6 +312,9 @@ RTE_INIT(lcore_init_telemetry)
 
 	rte_telemetry_register_cmd("/eal/lcore/busyness_disable", lcore_poll_busyness_disable,
 				   "disable lcore poll busyness measurement");
+
+	rte_telemetry_register_cmd("/eal/lcore/cpuset", lcore_handle_cpuset,
+				   "list physical core affinity for each lcore");
 }
 
 #else
-- 
2.31.1


^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v2 3/3] doc: add howto guide for lcore poll busyness
  2022-08-24 16:24 ` [PATCH v2 0/3] Add lcore poll " Kevin Laatz
  2022-08-24 16:24   ` [PATCH v2 1/3] eal: add " Kevin Laatz
  2022-08-24 16:24   ` [PATCH v2 2/3] eal: add cpuset lcore telemetry entries Kevin Laatz
@ 2022-08-24 16:24   ` Kevin Laatz
  2022-08-25  7:47   ` [PATCH v2 0/3] Add lcore poll busyness telemetry Morten Brørup
  3 siblings, 0 replies; 87+ messages in thread
From: Kevin Laatz @ 2022-08-24 16:24 UTC (permalink / raw)
  To: dev; +Cc: anatoly.burakov, Kevin Laatz

Add a new section to the howto guides for using the new lcore poll
busyness telemetry endpoints and describe general usage.

Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
---
 doc/guides/howto/lcore_busyness.rst | 79 +++++++++++++++++++++++++++++
 1 file changed, 79 insertions(+)
 create mode 100644 doc/guides/howto/lcore_busyness.rst

diff --git a/doc/guides/howto/lcore_busyness.rst b/doc/guides/howto/lcore_busyness.rst
new file mode 100644
index 0000000000..c8ccd3f513
--- /dev/null
+++ b/doc/guides/howto/lcore_busyness.rst
@@ -0,0 +1,79 @@
+..  SPDX-License-Identifier: BSD-3-Clause
+    Copyright(c) 2022 Intel Corporation.
+
+Lcore Poll Busyness Telemetry
+========================
+
+The lcore poll busyness telemetry provides a built-in, generic method of gathering
+lcore utilization metrics for running applications. These metrics are exposed
+via a new telemetry endpoint.
+
+Since most DPDK APIs poll for packets, the poll busyness is calculated based on
+APIs receiving packets. Empty polls are considered as idle, while non-empty polls
+are considered busy. Using the amount of cycles spent processing empty polls, the
+busyness can be calculated and recorded.
+
+Application Specified Busyness
+------------------------------
+
+Improved accuracy of the reported busyness may need more contextual awareness
+from the application. For example, a pipelined application may make a number of
+calls to rx_burst before processing packets. Any processing done on this 'bulk'
+would need to be marked as "busy" cycles, not just the last received burst. This
+type of awareness is only available within the application.
+
+Applications can be modified to incorporate the extra contextual awareness in
+order to improve the reported busyness by marking areas of code as "busy" or
+"idle" appropriately. This can be done by inserting the timestamping macro::
+
+    RTE_LCORE_TELEMETRY_TIMESTAMP(0)    /* to mark section as idle */
+    RTE_LCORE_TELEMETRY_TIMESTAMP(32)   /* where 32 is nb_pkts to mark section as busy (non-zero is busy) */
+
+All cycles since the last state change will be counted towards the current state's
+counter.
+
+Consuming the Telemetry
+-----------------------
+
+The telemetry gathered for lcore poll busyness can be read from the `telemetry.py`
+script via the new `/eal/lcore/poll_busyness` endpoint::
+
+    $ ./usertools/dpdk-telemetry.py
+    --> /eal/lcore/poll_busyness
+    {"/eal/lcore/poll_busyness": {"12": -1, "13": 85, "14": 84}}
+
+* Cores not collecting busyness will report "-1". E.g. control cores or inactive cores.
+* All enabled cores will report their busyness in the range 0-100.
+
+Disabling Lcore Poll Busyness Telemetry
+----------------------------------
+
+Some applications may not want lcore poll busyness telemetry to be tracked, for
+example performance critical applications or applications that are already being
+monitored by other tools gathering similar or more application specific information.
+
+For those applications, there are two ways in which this telemetry can be disabled.
+
+At compile time
+^^^^^^^^^^^^^^^
+
+Support can be disabled at compile time via the meson option. It is enabled by
+default.::
+
+    $ meson configure -Denable_lcore_poll_busyness=false
+
+At run time
+^^^^^^^^^^^
+
+Support can also be disabled during runtime. This comes at the cost of an
+additional branch, however no additional function calls are performed.
+
+To disable support at runtime, a call can be made to the
+`/eal/lcore/poll_busyness_disable` endpoint::
+
+    $ ./usertools/dpdk-telemetry.py
+    --> /eal/lcore/poll_busyness_disable
+    {"/eal/lcore/poll_busyness_disable": {"busyness_enabled": 0}}
+
+It can be re-enabled at run time with the `/eal/lcore/poll_busyness_enable`
+endpoint.
-- 
2.31.1


^ permalink raw reply	[flat|nested] 87+ messages in thread

* RE: [PATCH v2 0/3] Add lcore poll busyness telemetry
  2022-08-24 16:24 ` [PATCH v2 0/3] Add lcore poll " Kevin Laatz
                     ` (2 preceding siblings ...)
  2022-08-24 16:24   ` [PATCH v2 3/3] doc: add howto guide for lcore poll busyness Kevin Laatz
@ 2022-08-25  7:47   ` Morten Brørup
  2022-08-25 10:53     ` Kevin Laatz
  3 siblings, 1 reply; 87+ messages in thread
From: Morten Brørup @ 2022-08-25  7:47 UTC (permalink / raw)
  To: Kevin Laatz, dev; +Cc: anatoly.burakov

> From: Kevin Laatz [mailto:kevin.laatz@intel.com]
> Sent: Wednesday, 24 August 2022 18.25
> 
> Currently, there is no way to measure lcore busyness in a passive way,
> without any modifications to the application. This patchset adds a new
> EAL
> API that will be able to passively track core busyness. As part of the
> set,
> new telemetry endpoints are added to read the generate metrics.
> 
> Anatoly Burakov (2):
>   eal: add lcore poll busyness telemetry
>   eal: add cpuset lcore telemetry entries
> 
> Kevin Laatz (1):
>   doc: add howto guide for lcore poll busyness
> 

At first glance, reading the commit messages, I thought you had ignored the feedback.
However, I see that V2 adds "poll" in front of "busyness".

It's just missing in many locations in the code, e.g. in eal_common_lcore_telemetry.c, the telemetry command is still registered as "/eal/lcore/busyness".

I recommend that you search for "business" in the files, and add "poll" where it is missing.

Perhaps one of your testers can do it for you... I know from experience that it can be difficult spotting details like this in your own code, whereas many test people are good at this stuff. There's even a company here in Denmark offering autistics as testers - they are absolutely brilliant at tasks like this!

-Morten

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v2 0/3] Add lcore poll busyness telemetry
  2022-08-25  7:47   ` [PATCH v2 0/3] Add lcore poll busyness telemetry Morten Brørup
@ 2022-08-25 10:53     ` Kevin Laatz
  0 siblings, 0 replies; 87+ messages in thread
From: Kevin Laatz @ 2022-08-25 10:53 UTC (permalink / raw)
  To: Morten Brørup, dev; +Cc: anatoly.burakov

On 25/08/2022 08:47, Morten Brørup wrote:
>> From: Kevin Laatz [mailto:kevin.laatz@intel.com]
>> Sent: Wednesday, 24 August 2022 18.25
>>
>> Currently, there is no way to measure lcore busyness in a passive way,
>> without any modifications to the application. This patchset adds a new
>> EAL
>> API that will be able to passively track core busyness. As part of the
>> set,
>> new telemetry endpoints are added to read the generate metrics.
>>
>> Anatoly Burakov (2):
>>    eal: add lcore poll busyness telemetry
>>    eal: add cpuset lcore telemetry entries
>>
>> Kevin Laatz (1):
>>    doc: add howto guide for lcore poll busyness
>>
> At first glance, reading the commit messages, I thought you had ignored the feedback.
> However, I see that V2 adds "poll" in front of "busyness".

I'll make changes to the message to clarify this a bit better.

> It's just missing in many locations in the code, e.g. in eal_common_lcore_telemetry.c, the telemetry command is still registered as "/eal/lcore/busyness".
>
Seems some builds are failing in the CI so I'll send a v3 with fixes 
shortly - I'll do another sweep before sending to make sure there are no 
omissions.

-Kevin


^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v3 0/3] Add lcore poll busyness telemetry
  2022-07-15 13:12 [PATCH v1 1/2] eal: add lcore busyness telemetry Anatoly Burakov
                   ` (4 preceding siblings ...)
  2022-08-24 16:24 ` [PATCH v2 0/3] Add lcore poll " Kevin Laatz
@ 2022-08-25 15:28 ` Kevin Laatz
  2022-08-25 15:28   ` [PATCH v3 1/3] eal: add " Kevin Laatz
                     ` (2 more replies)
  2022-09-01 14:39 ` [PATCH v4 0/3] Add lcore poll busyness telemetry Kevin Laatz
                   ` (3 subsequent siblings)
  9 siblings, 3 replies; 87+ messages in thread
From: Kevin Laatz @ 2022-08-25 15:28 UTC (permalink / raw)
  To: dev; +Cc: anatoly.burakov, Kevin Laatz

Currently, there is no way to measure lcore polling busyness in a passive
way, without any modifications to the application. This patchset adds a new
EAL API that will be able to passively track core polling busyness. As part
of the set, new telemetry endpoints are added to read the generate metrics.

---
v3:
  * Fix missing renaming to poll busyness
  * Fix clang compilation
  * Fix arm compilation

v2:
  * Use rte_get_tsc_hz() to adjust the telemetry period
  * Rename to reflect polling busyness vs general busyness
  * Fix segfault when calling telemetry timestamp from an unregistered
    non-EAL thread.
  * Minor cleanup

Anatoly Burakov (2):
  eal: add lcore poll busyness telemetry
  eal: add cpuset lcore telemetry entries

Kevin Laatz (1):
  doc: add howto guide for lcore poll busyness

 config/meson.build                          |   1 +
 config/rte_config.h                         |   1 +
 doc/guides/howto/lcore_busyness.rst         |  79 +++++
 lib/bbdev/rte_bbdev.h                       |  17 +-
 lib/compressdev/rte_compressdev.c           |   2 +
 lib/cryptodev/rte_cryptodev.h               |   2 +
 lib/distributor/rte_distributor.c           |  21 +-
 lib/distributor/rte_distributor_single.c    |  14 +-
 lib/dmadev/rte_dmadev.h                     |  15 +-
 lib/eal/common/eal_common_lcore_telemetry.c | 340 ++++++++++++++++++++
 lib/eal/common/meson.build                  |   1 +
 lib/eal/include/rte_lcore.h                 |  80 +++++
 lib/eal/meson.build                         |   3 +
 lib/eal/version.map                         |   7 +
 lib/ethdev/rte_ethdev.h                     |   2 +
 lib/eventdev/rte_eventdev.h                 |  10 +-
 lib/rawdev/rte_rawdev.c                     |   6 +-
 lib/regexdev/rte_regexdev.h                 |   5 +-
 lib/ring/rte_ring_elem_pvt.h                |   1 +
 meson_options.txt                           |   2 +
 20 files changed, 585 insertions(+), 24 deletions(-)
 create mode 100644 doc/guides/howto/lcore_busyness.rst
 create mode 100644 lib/eal/common/eal_common_lcore_telemetry.c

-- 
2.31.1


^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v3 1/3] eal: add lcore poll busyness telemetry
  2022-08-25 15:28 ` [PATCH v3 " Kevin Laatz
@ 2022-08-25 15:28   ` Kevin Laatz
  2022-08-26  7:05     ` Jerin Jacob
  2022-08-26 22:06     ` Mattias Rönnblom
  2022-08-25 15:28   ` [PATCH v3 2/3] eal: add cpuset lcore telemetry entries Kevin Laatz
  2022-08-25 15:28   ` [PATCH v3 3/3] doc: add howto guide for lcore poll busyness Kevin Laatz
  2 siblings, 2 replies; 87+ messages in thread
From: Kevin Laatz @ 2022-08-25 15:28 UTC (permalink / raw)
  To: dev
  Cc: anatoly.burakov, Kevin Laatz, Conor Walsh, David Hunt,
	Bruce Richardson, Nicolas Chautru, Fan Zhang, Ashish Gupta,
	Akhil Goyal, Chengwen Feng, Ray Kinsella, Thomas Monjalon,
	Ferruh Yigit, Andrew Rybchenko, Jerin Jacob, Sachin Saxena,
	Hemant Agrawal, Ori Kam, Honnappa Nagarahalli,
	Konstantin Ananyev

From: Anatoly Burakov <anatoly.burakov@intel.com>

Currently, there is no way to measure lcore poll busyness in a passive way,
without any modifications to the application. This patch adds a new EAL API
that will be able to passively track core polling busyness.

The poll busyness is calculated by relying on the fact that most DPDK API's
will poll for packets. Empty polls can be counted as "idle", while
non-empty polls can be counted as busy. To measure lcore poll busyness, we
simply call the telemetry timestamping function with the number of polls a
particular code section has processed, and count the number of cycles we've
spent processing empty bursts. The more empty bursts we encounter, the less
cycles we spend in "busy" state, and the less core poll busyness will be
reported.

In order for all of the above to work without modifications to the
application, the library code needs to be instrumented with calls to the
lcore telemetry busyness timestamping function. The following parts of DPDK
are instrumented with lcore telemetry calls:

- All major driver API's:
  - ethdev
  - cryptodev
  - compressdev
  - regexdev
  - bbdev
  - rawdev
  - eventdev
  - dmadev
- Some additional libraries:
  - ring
  - distributor

To avoid performance impact from having lcore telemetry support, a global
variable is exported by EAL, and a call to timestamping function is wrapped
into a macro, so that whenever telemetry is disabled, it only takes one
additional branch and no function calls are performed. It is also possible
to disable it at compile time by commenting out RTE_LCORE_BUSYNESS from
build config.

This patch also adds a telemetry endpoint to report lcore poll busyness, as
well as telemetry endpoints to enable/disable lcore telemetry. A
documentation entry has been added to the howto guides to explain the usage
of the new telemetry endpoints and API.

Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
Signed-off-by: Conor Walsh <conor.walsh@intel.com>
Signed-off-by: David Hunt <david.hunt@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>

---
v3:
  * Fix missed renaming to poll busyness
  * Fix clang compilation
  * Fix arm compilation

v2:
  * Use rte_get_tsc_hz() to adjust the telemetry period
  * Rename to reflect polling busyness vs general busyness
  * Fix segfault when calling telemetry timestamp from an unregistered
    non-EAL thread.
  * Minor cleanup
---
 config/meson.build                          |   1 +
 config/rte_config.h                         |   1 +
 lib/bbdev/rte_bbdev.h                       |  17 +-
 lib/compressdev/rte_compressdev.c           |   2 +
 lib/cryptodev/rte_cryptodev.h               |   2 +
 lib/distributor/rte_distributor.c           |  21 +-
 lib/distributor/rte_distributor_single.c    |  14 +-
 lib/dmadev/rte_dmadev.h                     |  15 +-
 lib/eal/common/eal_common_lcore_telemetry.c | 293 ++++++++++++++++++++
 lib/eal/common/meson.build                  |   1 +
 lib/eal/include/rte_lcore.h                 |  80 ++++++
 lib/eal/meson.build                         |   3 +
 lib/eal/version.map                         |   7 +
 lib/ethdev/rte_ethdev.h                     |   2 +
 lib/eventdev/rte_eventdev.h                 |  10 +-
 lib/rawdev/rte_rawdev.c                     |   6 +-
 lib/regexdev/rte_regexdev.h                 |   5 +-
 lib/ring/rte_ring_elem_pvt.h                |   1 +
 meson_options.txt                           |   2 +
 19 files changed, 459 insertions(+), 24 deletions(-)
 create mode 100644 lib/eal/common/eal_common_lcore_telemetry.c

diff --git a/config/meson.build b/config/meson.build
index 7f7b6c92fd..d5954a059c 100644
--- a/config/meson.build
+++ b/config/meson.build
@@ -297,6 +297,7 @@ endforeach
 dpdk_conf.set('RTE_MAX_ETHPORTS', get_option('max_ethports'))
 dpdk_conf.set('RTE_LIBEAL_USE_HPET', get_option('use_hpet'))
 dpdk_conf.set('RTE_ENABLE_TRACE_FP', get_option('enable_trace_fp'))
+dpdk_conf.set('RTE_LCORE_POLL_BUSYNESS', get_option('enable_lcore_poll_busyness'))
 # values which have defaults which may be overridden
 dpdk_conf.set('RTE_MAX_VFIO_GROUPS', 64)
 dpdk_conf.set('RTE_DRIVER_MEMPOOL_BUCKET_SIZE_KB', 64)
diff --git a/config/rte_config.h b/config/rte_config.h
index 46549cb062..498702c9c7 100644
--- a/config/rte_config.h
+++ b/config/rte_config.h
@@ -39,6 +39,7 @@
 #define RTE_LOG_DP_LEVEL RTE_LOG_INFO
 #define RTE_BACKTRACE 1
 #define RTE_MAX_VFIO_CONTAINERS 64
+#define RTE_LCORE_POLL_BUSYNESS_PERIOD_MS 2
 
 /* bsd module defines */
 #define RTE_CONTIGMEM_MAX_NUM_BUFS 64
diff --git a/lib/bbdev/rte_bbdev.h b/lib/bbdev/rte_bbdev.h
index b88c88167e..d6ed176cce 100644
--- a/lib/bbdev/rte_bbdev.h
+++ b/lib/bbdev/rte_bbdev.h
@@ -28,6 +28,7 @@ extern "C" {
 #include <stdbool.h>
 
 #include <rte_cpuflags.h>
+#include <rte_lcore.h>
 
 #include "rte_bbdev_op.h"
 
@@ -599,7 +600,9 @@ rte_bbdev_dequeue_enc_ops(uint16_t dev_id, uint16_t queue_id,
 {
 	struct rte_bbdev *dev = &rte_bbdev_devices[dev_id];
 	struct rte_bbdev_queue_data *q_data = &dev->data->queues[queue_id];
-	return dev->dequeue_enc_ops(q_data, ops, num_ops);
+	const uint16_t nb_ops = dev->dequeue_enc_ops(q_data, ops, num_ops);
+	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_ops);
+	return nb_ops;
 }
 
 /**
@@ -631,7 +634,9 @@ rte_bbdev_dequeue_dec_ops(uint16_t dev_id, uint16_t queue_id,
 {
 	struct rte_bbdev *dev = &rte_bbdev_devices[dev_id];
 	struct rte_bbdev_queue_data *q_data = &dev->data->queues[queue_id];
-	return dev->dequeue_dec_ops(q_data, ops, num_ops);
+	const uint16_t nb_ops = dev->dequeue_dec_ops(q_data, ops, num_ops);
+	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_ops);
+	return nb_ops;
 }
 
 
@@ -662,7 +667,9 @@ rte_bbdev_dequeue_ldpc_enc_ops(uint16_t dev_id, uint16_t queue_id,
 {
 	struct rte_bbdev *dev = &rte_bbdev_devices[dev_id];
 	struct rte_bbdev_queue_data *q_data = &dev->data->queues[queue_id];
-	return dev->dequeue_ldpc_enc_ops(q_data, ops, num_ops);
+	const uint16_t nb_ops = dev->dequeue_ldpc_enc_ops(q_data, ops, num_ops);
+	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_ops);
+	return nb_ops;
 }
 
 /**
@@ -692,7 +699,9 @@ rte_bbdev_dequeue_ldpc_dec_ops(uint16_t dev_id, uint16_t queue_id,
 {
 	struct rte_bbdev *dev = &rte_bbdev_devices[dev_id];
 	struct rte_bbdev_queue_data *q_data = &dev->data->queues[queue_id];
-	return dev->dequeue_ldpc_dec_ops(q_data, ops, num_ops);
+	const uint16_t nb_ops = dev->dequeue_ldpc_dec_ops(q_data, ops, num_ops);
+	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_ops);
+	return nb_ops;
 }
 
 /** Definitions of device event types */
diff --git a/lib/compressdev/rte_compressdev.c b/lib/compressdev/rte_compressdev.c
index 22c438f2dd..912cee9a16 100644
--- a/lib/compressdev/rte_compressdev.c
+++ b/lib/compressdev/rte_compressdev.c
@@ -580,6 +580,8 @@ rte_compressdev_dequeue_burst(uint8_t dev_id, uint16_t qp_id,
 	nb_ops = (*dev->dequeue_burst)
 			(dev->data->queue_pairs[qp_id], ops, nb_ops);
 
+	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_ops);
+
 	return nb_ops;
 }
 
diff --git a/lib/cryptodev/rte_cryptodev.h b/lib/cryptodev/rte_cryptodev.h
index 56f459c6a0..072874020d 100644
--- a/lib/cryptodev/rte_cryptodev.h
+++ b/lib/cryptodev/rte_cryptodev.h
@@ -1915,6 +1915,8 @@ rte_cryptodev_dequeue_burst(uint8_t dev_id, uint16_t qp_id,
 		rte_rcu_qsbr_thread_offline(list->qsbr, 0);
 	}
 #endif
+
+	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_ops);
 	return nb_ops;
 }
 
diff --git a/lib/distributor/rte_distributor.c b/lib/distributor/rte_distributor.c
index 3035b7a999..35b0d8d36b 100644
--- a/lib/distributor/rte_distributor.c
+++ b/lib/distributor/rte_distributor.c
@@ -56,6 +56,8 @@ rte_distributor_request_pkt(struct rte_distributor *d,
 
 		while (rte_rdtsc() < t)
 			rte_pause();
+		/* this was an empty poll */
+		RTE_LCORE_TELEMETRY_TIMESTAMP(0);
 	}
 
 	/*
@@ -134,24 +136,29 @@ rte_distributor_get_pkt(struct rte_distributor *d,
 
 	if (unlikely(d->alg_type == RTE_DIST_ALG_SINGLE)) {
 		if (return_count <= 1) {
+			uint16_t cnt;
 			pkts[0] = rte_distributor_get_pkt_single(d->d_single,
-				worker_id, return_count ? oldpkt[0] : NULL);
-			return (pkts[0]) ? 1 : 0;
-		} else
-			return -EINVAL;
+								 worker_id,
+								 return_count ? oldpkt[0] : NULL);
+			cnt = (pkts[0] != NULL) ? 1 : 0;
+			RTE_LCORE_TELEMETRY_TIMESTAMP(cnt);
+			return cnt;
+		}
+		return -EINVAL;
 	}
 
 	rte_distributor_request_pkt(d, worker_id, oldpkt, return_count);
 
-	count = rte_distributor_poll_pkt(d, worker_id, pkts);
-	while (count == -1) {
+	while ((count = rte_distributor_poll_pkt(d, worker_id, pkts)) == -1) {
 		uint64_t t = rte_rdtsc() + 100;
 
 		while (rte_rdtsc() < t)
 			rte_pause();
 
-		count = rte_distributor_poll_pkt(d, worker_id, pkts);
+		/* this was an empty poll */
+		RTE_LCORE_TELEMETRY_TIMESTAMP(0);
 	}
+	RTE_LCORE_TELEMETRY_TIMESTAMP(count);
 	return count;
 }
 
diff --git a/lib/distributor/rte_distributor_single.c b/lib/distributor/rte_distributor_single.c
index 2c77ac454a..63cc9aab69 100644
--- a/lib/distributor/rte_distributor_single.c
+++ b/lib/distributor/rte_distributor_single.c
@@ -31,8 +31,13 @@ rte_distributor_request_pkt_single(struct rte_distributor_single *d,
 	union rte_distributor_buffer_single *buf = &d->bufs[worker_id];
 	int64_t req = (((int64_t)(uintptr_t)oldpkt) << RTE_DISTRIB_FLAG_BITS)
 			| RTE_DISTRIB_GET_BUF;
-	RTE_WAIT_UNTIL_MASKED(&buf->bufptr64, RTE_DISTRIB_FLAGS_MASK,
-		==, 0, __ATOMIC_RELAXED);
+
+	while (!((__atomic_load_n(&buf->bufptr64, __ATOMIC_RELAXED)
+			& RTE_DISTRIB_FLAGS_MASK) == 0)) {
+		rte_pause();
+		/* this was an empty poll */
+		RTE_LCORE_TELEMETRY_TIMESTAMP(0);
+	}
 
 	/* Sync with distributor on GET_BUF flag. */
 	__atomic_store_n(&(buf->bufptr64), req, __ATOMIC_RELEASE);
@@ -59,8 +64,11 @@ rte_distributor_get_pkt_single(struct rte_distributor_single *d,
 {
 	struct rte_mbuf *ret;
 	rte_distributor_request_pkt_single(d, worker_id, oldpkt);
-	while ((ret = rte_distributor_poll_pkt_single(d, worker_id)) == NULL)
+	while ((ret = rte_distributor_poll_pkt_single(d, worker_id)) == NULL) {
 		rte_pause();
+		/* this was an empty poll */
+		RTE_LCORE_TELEMETRY_TIMESTAMP(0);
+	}
 	return ret;
 }
 
diff --git a/lib/dmadev/rte_dmadev.h b/lib/dmadev/rte_dmadev.h
index e7f992b734..98176a6a7a 100644
--- a/lib/dmadev/rte_dmadev.h
+++ b/lib/dmadev/rte_dmadev.h
@@ -149,6 +149,7 @@
 #include <rte_bitops.h>
 #include <rte_common.h>
 #include <rte_compat.h>
+#include <rte_lcore.h>
 
 #ifdef __cplusplus
 extern "C" {
@@ -1027,7 +1028,7 @@ rte_dma_completed(int16_t dev_id, uint16_t vchan, const uint16_t nb_cpls,
 		  uint16_t *last_idx, bool *has_error)
 {
 	struct rte_dma_fp_object *obj = &rte_dma_fp_objs[dev_id];
-	uint16_t idx;
+	uint16_t idx, nb_ops;
 	bool err;
 
 #ifdef RTE_DMADEV_DEBUG
@@ -1050,8 +1051,10 @@ rte_dma_completed(int16_t dev_id, uint16_t vchan, const uint16_t nb_cpls,
 		has_error = &err;
 
 	*has_error = false;
-	return (*obj->completed)(obj->dev_private, vchan, nb_cpls, last_idx,
-				 has_error);
+	nb_ops = (*obj->completed)(obj->dev_private, vchan, nb_cpls, last_idx,
+				   has_error);
+	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_ops);
+	return nb_ops;
 }
 
 /**
@@ -1090,7 +1093,7 @@ rte_dma_completed_status(int16_t dev_id, uint16_t vchan,
 			 enum rte_dma_status_code *status)
 {
 	struct rte_dma_fp_object *obj = &rte_dma_fp_objs[dev_id];
-	uint16_t idx;
+	uint16_t idx, nb_ops;
 
 #ifdef RTE_DMADEV_DEBUG
 	if (!rte_dma_is_valid(dev_id) || nb_cpls == 0 || status == NULL)
@@ -1101,8 +1104,10 @@ rte_dma_completed_status(int16_t dev_id, uint16_t vchan,
 	if (last_idx == NULL)
 		last_idx = &idx;
 
-	return (*obj->completed_status)(obj->dev_private, vchan, nb_cpls,
+	nb_ops = (*obj->completed_status)(obj->dev_private, vchan, nb_cpls,
 					last_idx, status);
+	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_ops);
+	return nb_ops;
 }
 
 /**
diff --git a/lib/eal/common/eal_common_lcore_telemetry.c b/lib/eal/common/eal_common_lcore_telemetry.c
new file mode 100644
index 0000000000..bba0afc26d
--- /dev/null
+++ b/lib/eal/common/eal_common_lcore_telemetry.c
@@ -0,0 +1,293 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2014 Intel Corporation
+ */
+
+#include <unistd.h>
+#include <limits.h>
+#include <string.h>
+
+#include <rte_common.h>
+#include <rte_cycles.h>
+#include <rte_errno.h>
+#include <rte_lcore.h>
+
+#ifdef RTE_LCORE_POLL_BUSYNESS
+#include <rte_telemetry.h>
+#endif
+
+int __rte_lcore_telemetry_enabled;
+
+#ifdef RTE_LCORE_POLL_BUSYNESS
+
+struct lcore_telemetry {
+	int poll_busyness;
+	/**< Calculated poll busyness (gets set/returned by the API) */
+	int raw_poll_busyness;
+	/**< Calculated poll busyness times 100. */
+	uint64_t interval_ts;
+	/**< when previous telemetry interval started */
+	uint64_t empty_cycles;
+	/**< empty cycle count since last interval */
+	uint64_t last_poll_ts;
+	/**< last poll timestamp */
+	bool last_empty;
+	/**< if last poll was empty */
+	unsigned int contig_poll_cnt;
+	/**< contiguous (always empty/non empty) poll counter */
+} __rte_cache_aligned;
+
+static struct lcore_telemetry *telemetry_data;
+
+#define LCORE_POLL_BUSYNESS_MAX 100
+#define LCORE_POLL_BUSYNESS_NOT_SET -1
+#define LCORE_POLL_BUSYNESS_MIN 0
+
+#define SMOOTH_COEFF 5
+#define STATE_CHANGE_OPT 32
+
+static void lcore_config_init(void)
+{
+	int lcore_id;
+
+	telemetry_data = calloc(RTE_MAX_LCORE, sizeof(telemetry_data[0]));
+	if (telemetry_data == NULL)
+		rte_panic("Could not init lcore telemetry data: Out of memory\n");
+
+	RTE_LCORE_FOREACH(lcore_id) {
+		struct lcore_telemetry *td = &telemetry_data[lcore_id];
+
+		td->interval_ts = 0;
+		td->last_poll_ts = 0;
+		td->empty_cycles = 0;
+		td->last_empty = true;
+		td->contig_poll_cnt = 0;
+		td->poll_busyness = LCORE_POLL_BUSYNESS_NOT_SET;
+		td->raw_poll_busyness = 0;
+	}
+}
+
+int rte_lcore_poll_busyness(unsigned int lcore_id)
+{
+	const uint64_t active_thresh = rte_get_tsc_hz() * RTE_LCORE_POLL_BUSYNESS_PERIOD_MS;
+	struct lcore_telemetry *tdata;
+
+	if (lcore_id >= RTE_MAX_LCORE)
+		return -EINVAL;
+	tdata = &telemetry_data[lcore_id];
+
+	/* if the lcore is not active */
+	if (tdata->interval_ts == 0)
+		return LCORE_POLL_BUSYNESS_NOT_SET;
+	/* if the core hasn't been active in a while */
+	else if ((rte_rdtsc() - tdata->interval_ts) > active_thresh)
+		return LCORE_POLL_BUSYNESS_NOT_SET;
+
+	/* this core is active, report its poll busyness */
+	return telemetry_data[lcore_id].poll_busyness;
+}
+
+int rte_lcore_poll_busyness_enabled(void)
+{
+	return __rte_lcore_telemetry_enabled;
+}
+
+void rte_lcore_poll_busyness_enabled_set(int enable)
+{
+	__rte_lcore_telemetry_enabled = !!enable;
+
+	if (!enable)
+		lcore_config_init();
+}
+
+static inline int calc_raw_poll_busyness(const struct lcore_telemetry *tdata,
+				    const uint64_t empty, const uint64_t total)
+{
+	/*
+	 * we don't want to use floating point math here, but we want for our poll
+	 * busyness to react smoothly to sudden changes, while still keeping the
+	 * accuracy and making sure that over time the average follows poll busyness
+	 * as measured just-in-time. therefore, we will calculate the average poll
+	 * busyness using integer math, but shift the decimal point two places
+	 * to the right, so that 100.0 becomes 10000. this allows us to report
+	 * integer values (0..100) while still allowing ourselves to follow the
+	 * just-in-time measurements when we calculate our averages.
+	 */
+	const int max_raw_idle = LCORE_POLL_BUSYNESS_MAX * 100;
+
+	/*
+	 * at upper end of the poll busyness scale, going up from 90->100 will take
+	 * longer than going from 10->20 because of the averaging. to address
+	 * this, we invert the scale when doing calculations: that is, we
+	 * effectively calculate average *idle* cycle percentage, not average
+	 * *busy* cycle percentage. this means that the scale is naturally
+	 * biased towards fast scaling up, and slow scaling down.
+	 */
+	const int prev_raw_idle = max_raw_idle - tdata->raw_poll_busyness;
+
+	/* calculate rate of idle cycles, times 100 */
+	const int cur_raw_idle = (int)((empty * max_raw_idle) / total);
+
+	/* smoothen the idleness */
+	const int smoothened_idle =
+			(cur_raw_idle + prev_raw_idle * (SMOOTH_COEFF - 1)) / SMOOTH_COEFF;
+
+	/* convert idleness back to poll busyness */
+	return max_raw_idle - smoothened_idle;
+}
+
+void __rte_lcore_telemetry_timestamp(uint16_t nb_rx)
+{
+	const unsigned int lcore_id = rte_lcore_id();
+	uint64_t interval_ts, empty_cycles, cur_tsc, last_poll_ts;
+	struct lcore_telemetry *tdata;
+	const bool empty = nb_rx == 0;
+	uint64_t diff_int, diff_last;
+	bool last_empty;
+
+	/* This telemetry is not supported for unregistered non-EAL threads */
+	if (lcore_id >= RTE_MAX_LCORE)
+		return;
+
+	tdata = &telemetry_data[lcore_id];
+	last_empty = tdata->last_empty;
+
+	/* optimization: don't do anything if status hasn't changed */
+	if (last_empty == empty && tdata->contig_poll_cnt++ < STATE_CHANGE_OPT)
+		return;
+	/* status changed or we're waiting for too long, reset counter */
+	tdata->contig_poll_cnt = 0;
+
+	cur_tsc = rte_rdtsc();
+
+	interval_ts = tdata->interval_ts;
+	empty_cycles = tdata->empty_cycles;
+	last_poll_ts = tdata->last_poll_ts;
+
+	diff_int = cur_tsc - interval_ts;
+	diff_last = cur_tsc - last_poll_ts;
+
+	/* is this the first time we're here? */
+	if (interval_ts == 0) {
+		tdata->poll_busyness = LCORE_POLL_BUSYNESS_MIN;
+		tdata->raw_poll_busyness = 0;
+		tdata->interval_ts = cur_tsc;
+		tdata->empty_cycles = 0;
+		tdata->contig_poll_cnt = 0;
+		goto end;
+	}
+
+	/* update the empty counter if we got an empty poll earlier */
+	if (last_empty)
+		empty_cycles += diff_last;
+
+	/* have we passed the interval? */
+	uint64_t interval = ((rte_get_tsc_hz() / MS_PER_S) * RTE_LCORE_POLL_BUSYNESS_PERIOD_MS);
+	if (diff_int > interval) {
+		int raw_poll_busyness;
+
+		/* get updated poll_busyness value */
+		raw_poll_busyness = calc_raw_poll_busyness(tdata, empty_cycles, diff_int);
+
+		/* set a new interval, reset empty counter */
+		tdata->interval_ts = cur_tsc;
+		tdata->empty_cycles = 0;
+		tdata->raw_poll_busyness = raw_poll_busyness;
+		/* bring poll busyness back to 0..100 range, biased to round up */
+		tdata->poll_busyness = (raw_poll_busyness + 50) / 100;
+	} else
+		/* we may have updated empty counter */
+		tdata->empty_cycles = empty_cycles;
+
+end:
+	/* update status for next poll */
+	tdata->last_poll_ts = cur_tsc;
+	tdata->last_empty = empty;
+}
+
+static int
+lcore_poll_busyness_enable(const char *cmd __rte_unused,
+		      const char *params __rte_unused,
+		      struct rte_tel_data *d)
+{
+	rte_lcore_poll_busyness_enabled_set(1);
+
+	rte_tel_data_start_dict(d);
+
+	rte_tel_data_add_dict_int(d, "poll_busyness_enabled", 1);
+
+	return 0;
+}
+
+static int
+lcore_poll_busyness_disable(const char *cmd __rte_unused,
+		       const char *params __rte_unused,
+		       struct rte_tel_data *d)
+{
+	rte_lcore_poll_busyness_enabled_set(0);
+
+	rte_tel_data_start_dict(d);
+
+	rte_tel_data_add_dict_int(d, "poll_busyness_enabled", 0);
+
+	if (telemetry_data != NULL)
+		free(telemetry_data);
+
+	return 0;
+}
+
+static int
+lcore_handle_poll_busyness(const char *cmd __rte_unused,
+		      const char *params __rte_unused, struct rte_tel_data *d)
+{
+	char corenum[64];
+	int i;
+
+	rte_tel_data_start_dict(d);
+
+	RTE_LCORE_FOREACH(i) {
+		if (!rte_lcore_is_enabled(i))
+			continue;
+		snprintf(corenum, sizeof(corenum), "%d", i);
+		rte_tel_data_add_dict_int(d, corenum, rte_lcore_poll_busyness(i));
+	}
+
+	return 0;
+}
+
+RTE_INIT(lcore_init_telemetry)
+{
+	__rte_lcore_telemetry_enabled = true;
+
+	lcore_config_init();
+
+	rte_telemetry_register_cmd("/eal/lcore/poll_busyness", lcore_handle_poll_busyness,
+				   "return percentage poll busyness of cores");
+
+	rte_telemetry_register_cmd("/eal/lcore/poll_busyness_enable", lcore_poll_busyness_enable,
+				   "enable lcore poll busyness measurement");
+
+	rte_telemetry_register_cmd("/eal/lcore/poll_busyness_disable", lcore_poll_busyness_disable,
+				   "disable lcore poll busyness measurement");
+}
+
+#else
+
+int rte_lcore_poll_busyness(unsigned int lcore_id __rte_unused)
+{
+	return -ENOTSUP;
+}
+
+int rte_lcore_poll_busyness_enabled(void)
+{
+	return -ENOTSUP;
+}
+
+void rte_lcore_poll_busyness_enabled_set(int enable __rte_unused)
+{
+}
+
+void __rte_lcore_telemetry_timestamp(uint16_t nb_rx __rte_unused)
+{
+}
+
+#endif
diff --git a/lib/eal/common/meson.build b/lib/eal/common/meson.build
index 917758cc65..a743e66a7d 100644
--- a/lib/eal/common/meson.build
+++ b/lib/eal/common/meson.build
@@ -17,6 +17,7 @@ sources += files(
         'eal_common_hexdump.c',
         'eal_common_interrupts.c',
         'eal_common_launch.c',
+        'eal_common_lcore_telemetry.c',
         'eal_common_lcore.c',
         'eal_common_log.c',
         'eal_common_mcfg.c',
diff --git a/lib/eal/include/rte_lcore.h b/lib/eal/include/rte_lcore.h
index b598e1b9ec..75c1f874cb 100644
--- a/lib/eal/include/rte_lcore.h
+++ b/lib/eal/include/rte_lcore.h
@@ -415,6 +415,86 @@ rte_ctrl_thread_create(pthread_t *thread, const char *name,
 		const pthread_attr_t *attr,
 		void *(*start_routine)(void *), void *arg);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Read poll busyness value corresponding to an lcore.
+ *
+ * @param lcore_id
+ *   Lcore to read poll busyness value for.
+ * @return
+ *   - value between 0 and 100 on success
+ *   - -1 if lcore is not active
+ *   - -EINVAL if lcore is invalid
+ *   - -ENOMEM if not enough memory available
+ *   - -ENOTSUP if not supported
+ */
+__rte_experimental
+int
+rte_lcore_poll_busyness(unsigned int lcore_id);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Check if lcore poll busyness telemetry is enabled.
+ *
+ * @return
+ *   - 1 if lcore telemetry is enabled
+ *   - 0 if lcore telemetry is disabled
+ *   - -ENOTSUP if not lcore telemetry supported
+ */
+__rte_experimental
+int
+rte_lcore_poll_busyness_enabled(void);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Enable or disable poll busyness telemetry.
+ *
+ * @param enable
+ *   1 to enable, 0 to disable
+ */
+__rte_experimental
+void
+rte_lcore_poll_busyness_enabled_set(int enable);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Lcore telemetry timestamping function.
+ *
+ * @param nb_rx
+ *   Number of buffers processed by lcore.
+ */
+__rte_experimental
+void
+__rte_lcore_telemetry_timestamp(uint16_t nb_rx);
+
+/** @internal lcore telemetry enabled status */
+extern int __rte_lcore_telemetry_enabled;
+
+/**
+ * Call lcore telemetry timestamp function.
+ *
+ * @param nb_rx
+ *   Number of buffers processed by lcore.
+ */
+#ifdef RTE_LCORE_POLL_BUSYNESS
+#define RTE_LCORE_TELEMETRY_TIMESTAMP(nb_rx)                    \
+	do {                                                    \
+		if (__rte_lcore_telemetry_enabled)              \
+			__rte_lcore_telemetry_timestamp(nb_rx); \
+	} while (0)
+#else
+#define RTE_LCORE_TELEMETRY_TIMESTAMP(nb_rx) \
+	while (0)
+#endif
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/eal/meson.build b/lib/eal/meson.build
index 056beb9461..2fb90d446b 100644
--- a/lib/eal/meson.build
+++ b/lib/eal/meson.build
@@ -25,6 +25,9 @@ subdir(arch_subdir)
 deps += ['kvargs']
 if not is_windows
     deps += ['telemetry']
+else
+	# core poll busyness telemetry depends on telemetry library
+	dpdk_conf.set('RTE_LCORE_POLL_BUSYNESS', false)
 endif
 if dpdk_conf.has('RTE_USE_LIBBSD')
     ext_deps += libbsd
diff --git a/lib/eal/version.map b/lib/eal/version.map
index 1f293e768b..f84d2dc319 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -424,6 +424,13 @@ EXPERIMENTAL {
 	rte_thread_self;
 	rte_thread_set_affinity_by_id;
 	rte_thread_set_priority;
+
+	# added in 22.11
+	__rte_lcore_telemetry_timestamp;
+	__rte_lcore_telemetry_enabled;
+	rte_lcore_poll_busyness;
+	rte_lcore_poll_busyness_enabled;
+	rte_lcore_poll_busyness_enabled_set;
 };
 
 INTERNAL {
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index de9e970d4d..1caecd5a11 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -5675,6 +5675,8 @@ rte_eth_rx_burst(uint16_t port_id, uint16_t queue_id,
 #endif
 
 	rte_ethdev_trace_rx_burst(port_id, queue_id, (void **)rx_pkts, nb_rx);
+
+	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_rx);
 	return nb_rx;
 }
 
diff --git a/lib/eventdev/rte_eventdev.h b/lib/eventdev/rte_eventdev.h
index 6a6f6ea4c1..a1d42d9214 100644
--- a/lib/eventdev/rte_eventdev.h
+++ b/lib/eventdev/rte_eventdev.h
@@ -2153,6 +2153,7 @@ rte_event_dequeue_burst(uint8_t dev_id, uint8_t port_id, struct rte_event ev[],
 			uint16_t nb_events, uint64_t timeout_ticks)
 {
 	const struct rte_event_fp_ops *fp_ops;
+	uint16_t nb_evts;
 	void *port;
 
 	fp_ops = &rte_event_fp_ops[dev_id];
@@ -2175,10 +2176,13 @@ rte_event_dequeue_burst(uint8_t dev_id, uint8_t port_id, struct rte_event ev[],
 	 * requests nb_events as const one
 	 */
 	if (nb_events == 1)
-		return (fp_ops->dequeue)(port, ev, timeout_ticks);
+		nb_evts = (fp_ops->dequeue)(port, ev, timeout_ticks);
 	else
-		return (fp_ops->dequeue_burst)(port, ev, nb_events,
-					       timeout_ticks);
+		nb_evts = (fp_ops->dequeue_burst)(port, ev, nb_events,
+					timeout_ticks);
+
+	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_evts);
+	return nb_evts;
 }
 
 #define RTE_EVENT_DEV_MAINT_OP_FLUSH          (1 << 0)
diff --git a/lib/rawdev/rte_rawdev.c b/lib/rawdev/rte_rawdev.c
index 2f0a4f132e..f6c0ed196f 100644
--- a/lib/rawdev/rte_rawdev.c
+++ b/lib/rawdev/rte_rawdev.c
@@ -16,6 +16,7 @@
 #include <rte_common.h>
 #include <rte_malloc.h>
 #include <rte_telemetry.h>
+#include <rte_lcore.h>
 
 #include "rte_rawdev.h"
 #include "rte_rawdev_pmd.h"
@@ -226,12 +227,15 @@ rte_rawdev_dequeue_buffers(uint16_t dev_id,
 			   rte_rawdev_obj_t context)
 {
 	struct rte_rawdev *dev;
+	int nb_ops;
 
 	RTE_RAWDEV_VALID_DEVID_OR_ERR_RET(dev_id, -EINVAL);
 	dev = &rte_rawdevs[dev_id];
 
 	RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->dequeue_bufs, -ENOTSUP);
-	return (*dev->dev_ops->dequeue_bufs)(dev, buffers, count, context);
+	nb_ops = (*dev->dev_ops->dequeue_bufs)(dev, buffers, count, context);
+	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_ops);
+	return nb_ops;
 }
 
 int
diff --git a/lib/regexdev/rte_regexdev.h b/lib/regexdev/rte_regexdev.h
index 3bce8090f6..781055b4eb 100644
--- a/lib/regexdev/rte_regexdev.h
+++ b/lib/regexdev/rte_regexdev.h
@@ -1530,6 +1530,7 @@ rte_regexdev_dequeue_burst(uint8_t dev_id, uint16_t qp_id,
 			   struct rte_regex_ops **ops, uint16_t nb_ops)
 {
 	struct rte_regexdev *dev = &rte_regex_devices[dev_id];
+	uint16_t deq_ops;
 #ifdef RTE_LIBRTE_REGEXDEV_DEBUG
 	RTE_REGEXDEV_VALID_DEV_ID_OR_ERR_RET(dev_id, -EINVAL);
 	RTE_FUNC_PTR_OR_ERR_RET(*dev->dequeue, -ENOTSUP);
@@ -1538,7 +1539,9 @@ rte_regexdev_dequeue_burst(uint8_t dev_id, uint16_t qp_id,
 		return -EINVAL;
 	}
 #endif
-	return (*dev->dequeue)(dev, qp_id, ops, nb_ops);
+	deq_ops = (*dev->dequeue)(dev, qp_id, ops, nb_ops);
+	RTE_LCORE_TELEMETRY_TIMESTAMP(deq_ops);
+	return deq_ops;
 }
 
 #ifdef __cplusplus
diff --git a/lib/ring/rte_ring_elem_pvt.h b/lib/ring/rte_ring_elem_pvt.h
index 83788c56e6..6db09d4291 100644
--- a/lib/ring/rte_ring_elem_pvt.h
+++ b/lib/ring/rte_ring_elem_pvt.h
@@ -379,6 +379,7 @@ __rte_ring_do_dequeue_elem(struct rte_ring *r, void *obj_table,
 end:
 	if (available != NULL)
 		*available = entries - n;
+	RTE_LCORE_TELEMETRY_TIMESTAMP(n);
 	return n;
 }
 
diff --git a/meson_options.txt b/meson_options.txt
index 7c220ad68d..725b851f69 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -20,6 +20,8 @@ option('enable_driver_sdk', type: 'boolean', value: false, description:
        'Install headers to build drivers.')
 option('enable_kmods', type: 'boolean', value: false, description:
        'build kernel modules')
+option('enable_lcore_poll_busyness', type: 'boolean', value: true, description:
+       'enable collection of lcore poll busyness telemetry')
 option('examples', type: 'string', value: '', description:
        'Comma-separated list of examples to build by default')
 option('flexran_sdk', type: 'string', value: '', description:
-- 
2.31.1


^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v3 2/3] eal: add cpuset lcore telemetry entries
  2022-08-25 15:28 ` [PATCH v3 " Kevin Laatz
  2022-08-25 15:28   ` [PATCH v3 1/3] eal: add " Kevin Laatz
@ 2022-08-25 15:28   ` Kevin Laatz
  2022-08-25 15:28   ` [PATCH v3 3/3] doc: add howto guide for lcore poll busyness Kevin Laatz
  2 siblings, 0 replies; 87+ messages in thread
From: Kevin Laatz @ 2022-08-25 15:28 UTC (permalink / raw)
  To: dev; +Cc: anatoly.burakov

From: Anatoly Burakov <anatoly.burakov@intel.com>

Expose per-lcore cpuset information to telemetry.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 lib/eal/common/eal_common_lcore_telemetry.c | 47 +++++++++++++++++++++
 1 file changed, 47 insertions(+)

diff --git a/lib/eal/common/eal_common_lcore_telemetry.c b/lib/eal/common/eal_common_lcore_telemetry.c
index bba0afc26d..09c3beb0f7 100644
--- a/lib/eal/common/eal_common_lcore_telemetry.c
+++ b/lib/eal/common/eal_common_lcore_telemetry.c
@@ -19,6 +19,8 @@ int __rte_lcore_telemetry_enabled;
 
 #ifdef RTE_LCORE_POLL_BUSYNESS
 
+#include "eal_private.h"
+
 struct lcore_telemetry {
 	int poll_busyness;
 	/**< Calculated poll busyness (gets set/returned by the API) */
@@ -254,6 +256,48 @@ lcore_handle_poll_busyness(const char *cmd __rte_unused,
 	return 0;
 }
 
+static int
+lcore_handle_cpuset(const char *cmd __rte_unused,
+		    const char *params __rte_unused,
+		    struct rte_tel_data *d)
+{
+	char corenum[64];
+	int i;
+
+	rte_tel_data_start_dict(d);
+
+	RTE_LCORE_FOREACH(i) {
+		const struct lcore_config *cfg = &lcore_config[i];
+		const rte_cpuset_t *cpuset = &cfg->cpuset;
+		struct rte_tel_data *ld;
+		unsigned int cpu;
+
+		if (!rte_lcore_is_enabled(i))
+			continue;
+
+		/* create an array of integers */
+		ld = rte_tel_data_alloc();
+		if (ld == NULL)
+			return -ENOMEM;
+		rte_tel_data_start_array(ld, RTE_TEL_INT_VAL);
+
+		/* add cpu ID's from cpuset to the array */
+		for (cpu = 0; cpu < CPU_SETSIZE; cpu++) {
+			if (!CPU_ISSET(cpu, cpuset))
+				continue;
+			rte_tel_data_add_array_int(ld, cpu);
+		}
+
+		/* add array to the per-lcore container */
+		snprintf(corenum, sizeof(corenum), "%d", i);
+
+		/* tell telemetry library to free this array automatically */
+		rte_tel_data_add_dict_container(d, corenum, ld, 0);
+	}
+
+	return 0;
+}
+
 RTE_INIT(lcore_init_telemetry)
 {
 	__rte_lcore_telemetry_enabled = true;
@@ -268,6 +312,9 @@ RTE_INIT(lcore_init_telemetry)
 
 	rte_telemetry_register_cmd("/eal/lcore/poll_busyness_disable", lcore_poll_busyness_disable,
 				   "disable lcore poll busyness measurement");
+
+	rte_telemetry_register_cmd("/eal/lcore/cpuset", lcore_handle_cpuset,
+				   "list physical core affinity for each lcore");
 }
 
 #else
-- 
2.31.1


^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v3 3/3] doc: add howto guide for lcore poll busyness
  2022-08-25 15:28 ` [PATCH v3 " Kevin Laatz
  2022-08-25 15:28   ` [PATCH v3 1/3] eal: add " Kevin Laatz
  2022-08-25 15:28   ` [PATCH v3 2/3] eal: add cpuset lcore telemetry entries Kevin Laatz
@ 2022-08-25 15:28   ` Kevin Laatz
  2 siblings, 0 replies; 87+ messages in thread
From: Kevin Laatz @ 2022-08-25 15:28 UTC (permalink / raw)
  To: dev; +Cc: anatoly.burakov, Kevin Laatz

Add a new section to the howto guides for using the new lcore poll
busyness telemetry endpoints and describe general usage.

Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>

---
v3: update naming to poll busyness
---
 doc/guides/howto/lcore_busyness.rst | 79 +++++++++++++++++++++++++++++
 1 file changed, 79 insertions(+)
 create mode 100644 doc/guides/howto/lcore_busyness.rst

diff --git a/doc/guides/howto/lcore_busyness.rst b/doc/guides/howto/lcore_busyness.rst
new file mode 100644
index 0000000000..ab8dd631c7
--- /dev/null
+++ b/doc/guides/howto/lcore_busyness.rst
@@ -0,0 +1,79 @@
+..  SPDX-License-Identifier: BSD-3-Clause
+    Copyright(c) 2022 Intel Corporation.
+
+Lcore Poll Busyness Telemetry
+========================
+
+The lcore poll busyness telemetry provides a built-in, generic method of gathering
+lcore utilization metrics for running applications. These metrics are exposed
+via a new telemetry endpoint.
+
+Since most DPDK APIs poll for packets, the poll busyness is calculated based on
+APIs receiving packets. Empty polls are considered as idle, while non-empty polls
+are considered busy. Using the amount of cycles spent processing empty polls, the
+busyness can be calculated and recorded.
+
+Application Specified Busyness
+------------------------------
+
+Improved accuracy of the reported busyness may need more contextual awareness
+from the application. For example, a pipelined application may make a number of
+calls to rx_burst before processing packets. Any processing done on this 'bulk'
+would need to be marked as "busy" cycles, not just the last received burst. This
+type of awareness is only available within the application.
+
+Applications can be modified to incorporate the extra contextual awareness in
+order to improve the reported busyness by marking areas of code as "busy" or
+"idle" appropriately. This can be done by inserting the timestamping macro::
+
+    RTE_LCORE_TELEMETRY_TIMESTAMP(0)    /* to mark section as idle */
+    RTE_LCORE_TELEMETRY_TIMESTAMP(32)   /* where 32 is nb_pkts to mark section as busy (non-zero is busy) */
+
+All cycles since the last state change will be counted towards the current state's
+counter.
+
+Consuming the Telemetry
+-----------------------
+
+The telemetry gathered for lcore poll busyness can be read from the `telemetry.py`
+script via the new `/eal/lcore/poll_busyness` endpoint::
+
+    $ ./usertools/dpdk-telemetry.py
+    --> /eal/lcore/poll_busyness
+    {"/eal/lcore/poll_busyness": {"12": -1, "13": 85, "14": 84}}
+
+* Cores not collecting poll busyness will report "-1". E.g. control cores or inactive cores.
+* All enabled cores will report their poll busyness in the range 0-100.
+
+Disabling Lcore Poll Busyness Telemetry
+----------------------------------
+
+Some applications may not want lcore poll busyness telemetry to be tracked, for
+example performance critical applications or applications that are already being
+monitored by other tools gathering similar or more application specific information.
+
+For those applications, there are two ways in which this telemetry can be disabled.
+
+At compile time
+^^^^^^^^^^^^^^^
+
+Support can be disabled at compile time via the meson option. It is enabled by
+default.::
+
+    $ meson configure -Denable_lcore_poll_busyness=false
+
+At run time
+^^^^^^^^^^^
+
+Support can also be disabled during runtime. This comes at the cost of an
+additional branch, however no additional function calls are performed.
+
+To disable support at runtime, a call can be made to the
+`/eal/lcore/poll_busyness_disable` endpoint::
+
+    $ ./usertools/dpdk-telemetry.py
+    --> /eal/lcore/poll_busyness_disable
+    {"/eal/lcore/poll_busyness_disable": {"poll_busyness_enabled": 0}}
+
+It can be re-enabled at run time with the `/eal/lcore/poll_busyness_enable`
+endpoint.
-- 
2.31.1


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 1/3] eal: add lcore poll busyness telemetry
  2022-08-25 15:28   ` [PATCH v3 1/3] eal: add " Kevin Laatz
@ 2022-08-26  7:05     ` Jerin Jacob
  2022-08-26  8:07       ` Bruce Richardson
  2022-08-26 22:06     ` Mattias Rönnblom
  1 sibling, 1 reply; 87+ messages in thread
From: Jerin Jacob @ 2022-08-26  7:05 UTC (permalink / raw)
  To: Kevin Laatz
  Cc: dpdk-dev, Anatoly Burakov, Conor Walsh, David Hunt,
	Bruce Richardson, Nicolas Chautru, Fan Zhang, Ashish Gupta,
	Akhil Goyal, Chengwen Feng, Ray Kinsella, Thomas Monjalon,
	Ferruh Yigit, Andrew Rybchenko, Jerin Jacob, Sachin Saxena,
	Hemant Agrawal, Ori Kam, Honnappa Nagarahalli,
	Konstantin Ananyev, Morten Brørup

On Thu, Aug 25, 2022 at 8:56 PM Kevin Laatz <kevin.laatz@intel.com> wrote:
>
> From: Anatoly Burakov <anatoly.burakov@intel.com>
>
> Currently, there is no way to measure lcore poll busyness in a passive way,
> without any modifications to the application. This patch adds a new EAL API
> that will be able to passively track core polling busyness.
>
> The poll busyness is calculated by relying on the fact that most DPDK API's
> will poll for packets. Empty polls can be counted as "idle", while
> non-empty polls can be counted as busy. To measure lcore poll busyness, we
> simply call the telemetry timestamping function with the number of polls a
> particular code section has processed, and count the number of cycles we've
> spent processing empty bursts. The more empty bursts we encounter, the less
> cycles we spend in "busy" state, and the less core poll busyness will be
> reported.
>
> In order for all of the above to work without modifications to the
> application, the library code needs to be instrumented with calls to the
> lcore telemetry busyness timestamping function. The following parts of DPDK
> are instrumented with lcore telemetry calls:
>
> - All major driver API's:
>   - ethdev
>   - cryptodev
>   - compressdev
>   - regexdev
>   - bbdev
>   - rawdev
>   - eventdev
>   - dmadev
> - Some additional libraries:
>   - ring
>   - distributor
>
> To avoid performance impact from having lcore telemetry support, a global
> variable is exported by EAL, and a call to timestamping function is wrapped
> into a macro, so that whenever telemetry is disabled, it only takes one
> additional branch and no function calls are performed. It is also possible
> to disable it at compile time by commenting out RTE_LCORE_BUSYNESS from
> build config.
>
> This patch also adds a telemetry endpoint to report lcore poll busyness, as
> well as telemetry endpoints to enable/disable lcore telemetry. A
> documentation entry has been added to the howto guides to explain the usage
> of the new telemetry endpoints and API.
>
> Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
> Signed-off-by: Conor Walsh <conor.walsh@intel.com>
> Signed-off-by: David Hunt <david.hunt@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>
> ---
> v3:
>   * Fix missed renaming to poll busyness
>   * Fix clang compilation
>   * Fix arm compilation
>
> v2:
>   * Use rte_get_tsc_hz() to adjust the telemetry period
>   * Rename to reflect polling busyness vs general busyness
>   * Fix segfault when calling telemetry timestamp from an unregistered
>     non-EAL thread.
>   * Minor cleanup
> ---

> diff --git a/meson_options.txt b/meson_options.txt
> index 7c220ad68d..725b851f69 100644
> --- a/meson_options.txt
> +++ b/meson_options.txt
> @@ -20,6 +20,8 @@ option('enable_driver_sdk', type: 'boolean', value: false, description:
>         'Install headers to build drivers.')
>  option('enable_kmods', type: 'boolean', value: false, description:
>         'build kernel modules')
> +option('enable_lcore_poll_busyness', type: 'boolean', value: true, description:
> +       'enable collection of lcore poll busyness telemetry')

IMO, All fastpath features should be opt-in. i.e default should be false.
For the trace fastpath related changes, We have done the similar thing
even though it cost additional one cycle for disabled trace points



>  option('examples', type: 'string', value: '', description:
>         'Comma-separated list of examples to build by default')
>  option('flexran_sdk', type: 'string', value: '', description:
> --
> 2.31.1
>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 1/3] eal: add lcore poll busyness telemetry
  2022-08-26  7:05     ` Jerin Jacob
@ 2022-08-26  8:07       ` Bruce Richardson
  2022-08-26  8:16         ` Jerin Jacob
  0 siblings, 1 reply; 87+ messages in thread
From: Bruce Richardson @ 2022-08-26  8:07 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: Kevin Laatz, dpdk-dev, Anatoly Burakov, Conor Walsh, David Hunt,
	Nicolas Chautru, Fan Zhang, Ashish Gupta, Akhil Goyal,
	Chengwen Feng, Ray Kinsella, Thomas Monjalon, Ferruh Yigit,
	Andrew Rybchenko, Jerin Jacob, Sachin Saxena, Hemant Agrawal,
	Ori Kam, Honnappa Nagarahalli, Konstantin Ananyev,
	Morten Brørup

On Fri, Aug 26, 2022 at 12:35:16PM +0530, Jerin Jacob wrote:
> On Thu, Aug 25, 2022 at 8:56 PM Kevin Laatz <kevin.laatz@intel.com> wrote:
> >
> > From: Anatoly Burakov <anatoly.burakov@intel.com>
> >
> > Currently, there is no way to measure lcore poll busyness in a passive way,
> > without any modifications to the application. This patch adds a new EAL API
> > that will be able to passively track core polling busyness.
> >
> > The poll busyness is calculated by relying on the fact that most DPDK API's
> > will poll for packets. Empty polls can be counted as "idle", while
> > non-empty polls can be counted as busy. To measure lcore poll busyness, we
> > simply call the telemetry timestamping function with the number of polls a
> > particular code section has processed, and count the number of cycles we've
> > spent processing empty bursts. The more empty bursts we encounter, the less
> > cycles we spend in "busy" state, and the less core poll busyness will be
> > reported.
> >
> > In order for all of the above to work without modifications to the
> > application, the library code needs to be instrumented with calls to the
> > lcore telemetry busyness timestamping function. The following parts of DPDK
> > are instrumented with lcore telemetry calls:
> >
> > - All major driver API's:
> >   - ethdev
> >   - cryptodev
> >   - compressdev
> >   - regexdev
> >   - bbdev
> >   - rawdev
> >   - eventdev
> >   - dmadev
> > - Some additional libraries:
> >   - ring
> >   - distributor
> >
> > To avoid performance impact from having lcore telemetry support, a global
> > variable is exported by EAL, and a call to timestamping function is wrapped
> > into a macro, so that whenever telemetry is disabled, it only takes one
> > additional branch and no function calls are performed. It is also possible
> > to disable it at compile time by commenting out RTE_LCORE_BUSYNESS from
> > build config.
> >
> > This patch also adds a telemetry endpoint to report lcore poll busyness, as
> > well as telemetry endpoints to enable/disable lcore telemetry. A
> > documentation entry has been added to the howto guides to explain the usage
> > of the new telemetry endpoints and API.
> >
> > Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
> > Signed-off-by: Conor Walsh <conor.walsh@intel.com>
> > Signed-off-by: David Hunt <david.hunt@intel.com>
> > Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> >
> > ---
> > v3:
> >   * Fix missed renaming to poll busyness
> >   * Fix clang compilation
> >   * Fix arm compilation
> >
> > v2:
> >   * Use rte_get_tsc_hz() to adjust the telemetry period
> >   * Rename to reflect polling busyness vs general busyness
> >   * Fix segfault when calling telemetry timestamp from an unregistered
> >     non-EAL thread.
> >   * Minor cleanup
> > ---
> 
> > diff --git a/meson_options.txt b/meson_options.txt
> > index 7c220ad68d..725b851f69 100644
> > --- a/meson_options.txt
> > +++ b/meson_options.txt
> > @@ -20,6 +20,8 @@ option('enable_driver_sdk', type: 'boolean', value: false, description:
> >         'Install headers to build drivers.')
> >  option('enable_kmods', type: 'boolean', value: false, description:
> >         'build kernel modules')
> > +option('enable_lcore_poll_busyness', type: 'boolean', value: true, description:
> > +       'enable collection of lcore poll busyness telemetry')
> 
> IMO, All fastpath features should be opt-in. i.e default should be false.
> For the trace fastpath related changes, We have done the similar thing
> even though it cost additional one cycle for disabled trace points
> 

We do need to consider runtime and build defaults differently, though.
Since this has also runtime enabling, I think having build-time enabling
true as default is ok, so long as the runtime enabling is false (assuming
no noticable overhead when the feature is disabled.)

/Bruce

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 1/3] eal: add lcore poll busyness telemetry
  2022-08-26  8:07       ` Bruce Richardson
@ 2022-08-26  8:16         ` Jerin Jacob
  2022-08-26  8:29           ` Morten Brørup
  0 siblings, 1 reply; 87+ messages in thread
From: Jerin Jacob @ 2022-08-26  8:16 UTC (permalink / raw)
  To: Bruce Richardson
  Cc: Kevin Laatz, dpdk-dev, Anatoly Burakov, Conor Walsh, David Hunt,
	Nicolas Chautru, Fan Zhang, Ashish Gupta, Akhil Goyal,
	Chengwen Feng, Ray Kinsella, Thomas Monjalon, Ferruh Yigit,
	Andrew Rybchenko, Jerin Jacob, Sachin Saxena, Hemant Agrawal,
	Ori Kam, Honnappa Nagarahalli, Konstantin Ananyev,
	Morten Brørup

On Fri, Aug 26, 2022 at 1:37 PM Bruce Richardson
<bruce.richardson@intel.com> wrote:
>
> On Fri, Aug 26, 2022 at 12:35:16PM +0530, Jerin Jacob wrote:
> > On Thu, Aug 25, 2022 at 8:56 PM Kevin Laatz <kevin.laatz@intel.com> wrote:
> > >
> > > From: Anatoly Burakov <anatoly.burakov@intel.com>
> > >
> > > Currently, there is no way to measure lcore poll busyness in a passive way,
> > > without any modifications to the application. This patch adds a new EAL API
> > > that will be able to passively track core polling busyness.
> > >
> > > The poll busyness is calculated by relying on the fact that most DPDK API's
> > > will poll for packets. Empty polls can be counted as "idle", while
> > > non-empty polls can be counted as busy. To measure lcore poll busyness, we
> > > simply call the telemetry timestamping function with the number of polls a
> > > particular code section has processed, and count the number of cycles we've
> > > spent processing empty bursts. The more empty bursts we encounter, the less
> > > cycles we spend in "busy" state, and the less core poll busyness will be
> > > reported.
> > >
> > > In order for all of the above to work without modifications to the
> > > application, the library code needs to be instrumented with calls to the
> > > lcore telemetry busyness timestamping function. The following parts of DPDK
> > > are instrumented with lcore telemetry calls:
> > >
> > > - All major driver API's:
> > >   - ethdev
> > >   - cryptodev
> > >   - compressdev
> > >   - regexdev
> > >   - bbdev
> > >   - rawdev
> > >   - eventdev
> > >   - dmadev
> > > - Some additional libraries:
> > >   - ring
> > >   - distributor
> > >
> > > To avoid performance impact from having lcore telemetry support, a global
> > > variable is exported by EAL, and a call to timestamping function is wrapped
> > > into a macro, so that whenever telemetry is disabled, it only takes one
> > > additional branch and no function calls are performed. It is also possible
> > > to disable it at compile time by commenting out RTE_LCORE_BUSYNESS from
> > > build config.
> > >
> > > This patch also adds a telemetry endpoint to report lcore poll busyness, as
> > > well as telemetry endpoints to enable/disable lcore telemetry. A
> > > documentation entry has been added to the howto guides to explain the usage
> > > of the new telemetry endpoints and API.
> > >
> > > Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
> > > Signed-off-by: Conor Walsh <conor.walsh@intel.com>
> > > Signed-off-by: David Hunt <david.hunt@intel.com>
> > > Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> > >
> > > ---
> > > v3:
> > >   * Fix missed renaming to poll busyness
> > >   * Fix clang compilation
> > >   * Fix arm compilation
> > >
> > > v2:
> > >   * Use rte_get_tsc_hz() to adjust the telemetry period
> > >   * Rename to reflect polling busyness vs general busyness
> > >   * Fix segfault when calling telemetry timestamp from an unregistered
> > >     non-EAL thread.
> > >   * Minor cleanup
> > > ---
> >
> > > diff --git a/meson_options.txt b/meson_options.txt
> > > index 7c220ad68d..725b851f69 100644
> > > --- a/meson_options.txt
> > > +++ b/meson_options.txt
> > > @@ -20,6 +20,8 @@ option('enable_driver_sdk', type: 'boolean', value: false, description:
> > >         'Install headers to build drivers.')
> > >  option('enable_kmods', type: 'boolean', value: false, description:
> > >         'build kernel modules')
> > > +option('enable_lcore_poll_busyness', type: 'boolean', value: true, description:
> > > +       'enable collection of lcore poll busyness telemetry')
> >
> > IMO, All fastpath features should be opt-in. i.e default should be false.
> > For the trace fastpath related changes, We have done the similar thing
> > even though it cost additional one cycle for disabled trace points
> >
>
> We do need to consider runtime and build defaults differently, though.
> Since this has also runtime enabling, I think having build-time enabling
> true as default is ok, so long as the runtime enabling is false (assuming
> no noticable overhead when the feature is disabled.)

I was talking about buildtime only. "enable_trace_fp" meson option selected as
false as default.

If the concern is enabling on generic distros then distro generic
config can opt in this

>
> /Bruce

^ permalink raw reply	[flat|nested] 87+ messages in thread

* RE: [PATCH v3 1/3] eal: add lcore poll busyness telemetry
  2022-08-26  8:16         ` Jerin Jacob
@ 2022-08-26  8:29           ` Morten Brørup
  2022-08-26 15:27             ` Kevin Laatz
  0 siblings, 1 reply; 87+ messages in thread
From: Morten Brørup @ 2022-08-26  8:29 UTC (permalink / raw)
  To: Jerin Jacob, Bruce Richardson, Kevin Laatz, Anatoly Burakov
  Cc: dpdk-dev, Conor Walsh, David Hunt, Nicolas Chautru, Fan Zhang,
	Ashish Gupta, Akhil Goyal, Chengwen Feng, Ray Kinsella,
	Thomas Monjalon, Ferruh Yigit, Andrew Rybchenko, Jerin Jacob,
	Sachin Saxena, Hemant Agrawal, Ori Kam, Honnappa Nagarahalli,
	Konstantin Ananyev

> From: Jerin Jacob [mailto:jerinjacobk@gmail.com]
> Sent: Friday, 26 August 2022 10.16
> 
> On Fri, Aug 26, 2022 at 1:37 PM Bruce Richardson
> <bruce.richardson@intel.com> wrote:
> >
> > On Fri, Aug 26, 2022 at 12:35:16PM +0530, Jerin Jacob wrote:
> > > On Thu, Aug 25, 2022 at 8:56 PM Kevin Laatz <kevin.laatz@intel.com>
> wrote:
> > > >
> > > > From: Anatoly Burakov <anatoly.burakov@intel.com>
> > > >
> > > > Currently, there is no way to measure lcore poll busyness in a
> passive way,
> > > > without any modifications to the application. This patch adds a
> new EAL API
> > > > that will be able to passively track core polling busyness.
> > > >
> > > > The poll busyness is calculated by relying on the fact that most
> DPDK API's
> > > > will poll for packets. Empty polls can be counted as "idle",
> while
> > > > non-empty polls can be counted as busy. To measure lcore poll
> busyness, we
> > > > simply call the telemetry timestamping function with the number
> of polls a
> > > > particular code section has processed, and count the number of
> cycles we've
> > > > spent processing empty bursts. The more empty bursts we
> encounter, the less
> > > > cycles we spend in "busy" state, and the less core poll busyness
> will be
> > > > reported.
> > > >
> > > > In order for all of the above to work without modifications to
> the
> > > > application, the library code needs to be instrumented with calls
> to the
> > > > lcore telemetry busyness timestamping function. The following
> parts of DPDK
> > > > are instrumented with lcore telemetry calls:
> > > >
> > > > - All major driver API's:
> > > >   - ethdev
> > > >   - cryptodev
> > > >   - compressdev
> > > >   - regexdev
> > > >   - bbdev
> > > >   - rawdev
> > > >   - eventdev
> > > >   - dmadev
> > > > - Some additional libraries:
> > > >   - ring
> > > >   - distributor
> > > >
> > > > To avoid performance impact from having lcore telemetry support,
> a global
> > > > variable is exported by EAL, and a call to timestamping function
> is wrapped
> > > > into a macro, so that whenever telemetry is disabled, it only
> takes one
> > > > additional branch and no function calls are performed. It is also
> possible
> > > > to disable it at compile time by commenting out
> RTE_LCORE_BUSYNESS from
> > > > build config.
> > > >
> > > > This patch also adds a telemetry endpoint to report lcore poll
> busyness, as
> > > > well as telemetry endpoints to enable/disable lcore telemetry. A
> > > > documentation entry has been added to the howto guides to explain
> the usage
> > > > of the new telemetry endpoints and API.
> > > >
> > > > Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
> > > > Signed-off-by: Conor Walsh <conor.walsh@intel.com>
> > > > Signed-off-by: David Hunt <david.hunt@intel.com>
> > > > Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> > > >
> > > > ---
> > > > v3:
> > > >   * Fix missed renaming to poll busyness
> > > >   * Fix clang compilation
> > > >   * Fix arm compilation
> > > >
> > > > v2:
> > > >   * Use rte_get_tsc_hz() to adjust the telemetry period
> > > >   * Rename to reflect polling busyness vs general busyness
> > > >   * Fix segfault when calling telemetry timestamp from an
> unregistered
> > > >     non-EAL thread.
> > > >   * Minor cleanup
> > > > ---
> > >
> > > > diff --git a/meson_options.txt b/meson_options.txt
> > > > index 7c220ad68d..725b851f69 100644
> > > > --- a/meson_options.txt
> > > > +++ b/meson_options.txt
> > > > @@ -20,6 +20,8 @@ option('enable_driver_sdk', type: 'boolean',
> value: false, description:
> > > >         'Install headers to build drivers.')
> > > >  option('enable_kmods', type: 'boolean', value: false,
> description:
> > > >         'build kernel modules')
> > > > +option('enable_lcore_poll_busyness', type: 'boolean', value:
> true, description:
> > > > +       'enable collection of lcore poll busyness telemetry')
> > >
> > > IMO, All fastpath features should be opt-in. i.e default should be
> false.
> > > For the trace fastpath related changes, We have done the similar
> thing
> > > even though it cost additional one cycle for disabled trace points
> > >
> >
> > We do need to consider runtime and build defaults differently,
> though.
> > Since this has also runtime enabling, I think having build-time
> enabling
> > true as default is ok, so long as the runtime enabling is false
> (assuming
> > no noticable overhead when the feature is disabled.)
> 
> I was talking about buildtime only. "enable_trace_fp" meson option
> selected as
> false as default.

Agree. "enable_lcore_poll_busyness" is in the fast path, so it should follow the design pattern of "enable_trace_fp".

> 
> If the concern is enabling on generic distros then distro generic
> config can opt in this
> 
> >
> > /Bruce

@Kevin, are you considering a roadmap for using RTE_LCORE_TELEMETRY_TIMESTAMP() for other purposes? Otherwise, it should also be renamed to indicate that it is part of the "poll busyness" telemetry.


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 1/3] eal: add lcore poll busyness telemetry
  2022-08-26  8:29           ` Morten Brørup
@ 2022-08-26 15:27             ` Kevin Laatz
  2022-08-26 15:46               ` Morten Brørup
  0 siblings, 1 reply; 87+ messages in thread
From: Kevin Laatz @ 2022-08-26 15:27 UTC (permalink / raw)
  To: Morten Brørup, Jerin Jacob, Bruce Richardson, Anatoly Burakov
  Cc: dpdk-dev, Conor Walsh, David Hunt, Nicolas Chautru, Fan Zhang,
	Ashish Gupta, Akhil Goyal, Chengwen Feng, Ray Kinsella,
	Thomas Monjalon, Ferruh Yigit, Andrew Rybchenko, Jerin Jacob,
	Sachin Saxena, Hemant Agrawal, Ori Kam, Honnappa Nagarahalli,
	Konstantin Ananyev

On 26/08/2022 09:29, Morten Brørup wrote:
>> From: Jerin Jacob [mailto:jerinjacobk@gmail.com]
>> Sent: Friday, 26 August 2022 10.16
>>
>> On Fri, Aug 26, 2022 at 1:37 PM Bruce Richardson
>> <bruce.richardson@intel.com> wrote:
>>> On Fri, Aug 26, 2022 at 12:35:16PM +0530, Jerin Jacob wrote:
>>>> On Thu, Aug 25, 2022 at 8:56 PM Kevin Laatz <kevin.laatz@intel.com>
>> wrote:
>>>>> From: Anatoly Burakov <anatoly.burakov@intel.com>
>>>>>
>>>>> Currently, there is no way to measure lcore poll busyness in a
>> passive way,
>>>>> without any modifications to the application. This patch adds a
>> new EAL API
>>>>> that will be able to passively track core polling busyness.
>>>>>
>>>>> The poll busyness is calculated by relying on the fact that most
>> DPDK API's
>>>>> will poll for packets. Empty polls can be counted as "idle",
>> while
>>>>> non-empty polls can be counted as busy. To measure lcore poll
>> busyness, we
>>>>> simply call the telemetry timestamping function with the number
>> of polls a
>>>>> particular code section has processed, and count the number of
>> cycles we've
>>>>> spent processing empty bursts. The more empty bursts we
>> encounter, the less
>>>>> cycles we spend in "busy" state, and the less core poll busyness
>> will be
>>>>> reported.
>>>>>
>>>>> In order for all of the above to work without modifications to
>> the
>>>>> application, the library code needs to be instrumented with calls
>> to the
>>>>> lcore telemetry busyness timestamping function. The following
>> parts of DPDK
>>>>> are instrumented with lcore telemetry calls:
>>>>>
>>>>> - All major driver API's:
>>>>>    - ethdev
>>>>>    - cryptodev
>>>>>    - compressdev
>>>>>    - regexdev
>>>>>    - bbdev
>>>>>    - rawdev
>>>>>    - eventdev
>>>>>    - dmadev
>>>>> - Some additional libraries:
>>>>>    - ring
>>>>>    - distributor
>>>>>
>>>>> To avoid performance impact from having lcore telemetry support,
>> a global
>>>>> variable is exported by EAL, and a call to timestamping function
>> is wrapped
>>>>> into a macro, so that whenever telemetry is disabled, it only
>> takes one
>>>>> additional branch and no function calls are performed. It is also
>> possible
>>>>> to disable it at compile time by commenting out
>> RTE_LCORE_BUSYNESS from
>>>>> build config.
>>>>>
>>>>> This patch also adds a telemetry endpoint to report lcore poll
>> busyness, as
>>>>> well as telemetry endpoints to enable/disable lcore telemetry. A
>>>>> documentation entry has been added to the howto guides to explain
>> the usage
>>>>> of the new telemetry endpoints and API.
>>>>>
>>>>> Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
>>>>> Signed-off-by: Conor Walsh <conor.walsh@intel.com>
>>>>> Signed-off-by: David Hunt <david.hunt@intel.com>
>>>>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>>>>>
>>>>> ---
>>>>> v3:
>>>>>    * Fix missed renaming to poll busyness
>>>>>    * Fix clang compilation
>>>>>    * Fix arm compilation
>>>>>
>>>>> v2:
>>>>>    * Use rte_get_tsc_hz() to adjust the telemetry period
>>>>>    * Rename to reflect polling busyness vs general busyness
>>>>>    * Fix segfault when calling telemetry timestamp from an
>> unregistered
>>>>>      non-EAL thread.
>>>>>    * Minor cleanup
>>>>> ---
>>>>> diff --git a/meson_options.txt b/meson_options.txt
>>>>> index 7c220ad68d..725b851f69 100644
>>>>> --- a/meson_options.txt
>>>>> +++ b/meson_options.txt
>>>>> @@ -20,6 +20,8 @@ option('enable_driver_sdk', type: 'boolean',
>> value: false, description:
>>>>>          'Install headers to build drivers.')
>>>>>   option('enable_kmods', type: 'boolean', value: false,
>> description:
>>>>>          'build kernel modules')
>>>>> +option('enable_lcore_poll_busyness', type: 'boolean', value:
>> true, description:
>>>>> +       'enable collection of lcore poll busyness telemetry')
>>>> IMO, All fastpath features should be opt-in. i.e default should be
>> false.
>>>> For the trace fastpath related changes, We have done the similar
>> thing
>>>> even though it cost additional one cycle for disabled trace points
>>>>
>>> We do need to consider runtime and build defaults differently,
>> though.
>>> Since this has also runtime enabling, I think having build-time
>> enabling
>>> true as default is ok, so long as the runtime enabling is false
>> (assuming
>>> no noticable overhead when the feature is disabled.)
>> I was talking about buildtime only. "enable_trace_fp" meson option
>> selected as
>> false as default.
> Agree. "enable_lcore_poll_busyness" is in the fast path, so it should follow the design pattern of "enable_trace_fp".

+1 to making this opt-in. However, I'd lean more towards having the 
buildtime option enabled and the runtime option disabled by default. 
There is no measurable impact cause by the extra branch (the check for 
enabled/disabled in the macro) by disabling at runtime, and we gain the 
benefit of avoiding a recompile to enable it later.

>
>> If the concern is enabling on generic distros then distro generic
>> config can opt in this
>>
>>> /Bruce
> @Kevin, are you considering a roadmap for using RTE_LCORE_TELEMETRY_TIMESTAMP() for other purposes? Otherwise, it should also be renamed to indicate that it is part of the "poll busyness" telemetry.

No further purposes are planned for this macro, I'll rename it in the 
next revision.

-Kevin



^ permalink raw reply	[flat|nested] 87+ messages in thread

* RE: [PATCH v3 1/3] eal: add lcore poll busyness telemetry
  2022-08-26 15:27             ` Kevin Laatz
@ 2022-08-26 15:46               ` Morten Brørup
  2022-08-29 10:41                 ` Bruce Richardson
  0 siblings, 1 reply; 87+ messages in thread
From: Morten Brørup @ 2022-08-26 15:46 UTC (permalink / raw)
  To: Kevin Laatz, Jerin Jacob, Bruce Richardson, Anatoly Burakov
  Cc: dpdk-dev, Conor Walsh, David Hunt, Nicolas Chautru, Fan Zhang,
	Ashish Gupta, Akhil Goyal, Chengwen Feng, Ray Kinsella,
	Thomas Monjalon, Ferruh Yigit, Andrew Rybchenko, Jerin Jacob,
	Sachin Saxena, Hemant Agrawal, Ori Kam, Honnappa Nagarahalli,
	Konstantin Ananyev

> From: Kevin Laatz [mailto:kevin.laatz@intel.com]
> Sent: Friday, 26 August 2022 17.27
> 
> On 26/08/2022 09:29, Morten Brørup wrote:
> >> From: Jerin Jacob [mailto:jerinjacobk@gmail.com]
> >> Sent: Friday, 26 August 2022 10.16
> >>
> >> On Fri, Aug 26, 2022 at 1:37 PM Bruce Richardson
> >> <bruce.richardson@intel.com> wrote:
> >>> On Fri, Aug 26, 2022 at 12:35:16PM +0530, Jerin Jacob wrote:
> >>>> On Thu, Aug 25, 2022 at 8:56 PM Kevin Laatz
> <kevin.laatz@intel.com>
> >> wrote:
> >>>>> From: Anatoly Burakov <anatoly.burakov@intel.com>
> >>>>>
> >>>>> Currently, there is no way to measure lcore poll busyness in a
> >> passive way,
> >>>>> without any modifications to the application. This patch adds a
> >> new EAL API
> >>>>> that will be able to passively track core polling busyness.
> >>>>>
> >>>>> The poll busyness is calculated by relying on the fact that most
> >> DPDK API's
> >>>>> will poll for packets. Empty polls can be counted as "idle",
> >> while
> >>>>> non-empty polls can be counted as busy. To measure lcore poll
> >> busyness, we
> >>>>> simply call the telemetry timestamping function with the number
> >> of polls a
> >>>>> particular code section has processed, and count the number of
> >> cycles we've
> >>>>> spent processing empty bursts. The more empty bursts we
> >> encounter, the less
> >>>>> cycles we spend in "busy" state, and the less core poll busyness
> >> will be
> >>>>> reported.
> >>>>>
> >>>>> In order for all of the above to work without modifications to
> >> the
> >>>>> application, the library code needs to be instrumented with calls
> >> to the
> >>>>> lcore telemetry busyness timestamping function. The following
> >> parts of DPDK
> >>>>> are instrumented with lcore telemetry calls:
> >>>>>
> >>>>> - All major driver API's:
> >>>>>    - ethdev
> >>>>>    - cryptodev
> >>>>>    - compressdev
> >>>>>    - regexdev
> >>>>>    - bbdev
> >>>>>    - rawdev
> >>>>>    - eventdev
> >>>>>    - dmadev
> >>>>> - Some additional libraries:
> >>>>>    - ring
> >>>>>    - distributor
> >>>>>
> >>>>> To avoid performance impact from having lcore telemetry support,
> >> a global
> >>>>> variable is exported by EAL, and a call to timestamping function
> >> is wrapped
> >>>>> into a macro, so that whenever telemetry is disabled, it only
> >> takes one
> >>>>> additional branch and no function calls are performed. It is also
> >> possible
> >>>>> to disable it at compile time by commenting out
> >> RTE_LCORE_BUSYNESS from
> >>>>> build config.
> >>>>>
> >>>>> This patch also adds a telemetry endpoint to report lcore poll
> >> busyness, as
> >>>>> well as telemetry endpoints to enable/disable lcore telemetry. A
> >>>>> documentation entry has been added to the howto guides to explain
> >> the usage
> >>>>> of the new telemetry endpoints and API.
> >>>>>
> >>>>> Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
> >>>>> Signed-off-by: Conor Walsh <conor.walsh@intel.com>
> >>>>> Signed-off-by: David Hunt <david.hunt@intel.com>
> >>>>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> >>>>>
> >>>>> ---
> >>>>> v3:
> >>>>>    * Fix missed renaming to poll busyness
> >>>>>    * Fix clang compilation
> >>>>>    * Fix arm compilation
> >>>>>
> >>>>> v2:
> >>>>>    * Use rte_get_tsc_hz() to adjust the telemetry period
> >>>>>    * Rename to reflect polling busyness vs general busyness
> >>>>>    * Fix segfault when calling telemetry timestamp from an
> >> unregistered
> >>>>>      non-EAL thread.
> >>>>>    * Minor cleanup
> >>>>> ---
> >>>>> diff --git a/meson_options.txt b/meson_options.txt
> >>>>> index 7c220ad68d..725b851f69 100644
> >>>>> --- a/meson_options.txt
> >>>>> +++ b/meson_options.txt
> >>>>> @@ -20,6 +20,8 @@ option('enable_driver_sdk', type: 'boolean',
> >> value: false, description:
> >>>>>          'Install headers to build drivers.')
> >>>>>   option('enable_kmods', type: 'boolean', value: false,
> >> description:
> >>>>>          'build kernel modules')
> >>>>> +option('enable_lcore_poll_busyness', type: 'boolean', value:
> >> true, description:
> >>>>> +       'enable collection of lcore poll busyness telemetry')
> >>>> IMO, All fastpath features should be opt-in. i.e default should be
> >> false.
> >>>> For the trace fastpath related changes, We have done the similar
> >> thing
> >>>> even though it cost additional one cycle for disabled trace points
> >>>>
> >>> We do need to consider runtime and build defaults differently,
> >> though.
> >>> Since this has also runtime enabling, I think having build-time
> >> enabling
> >>> true as default is ok, so long as the runtime enabling is false
> >> (assuming
> >>> no noticable overhead when the feature is disabled.)
> >> I was talking about buildtime only. "enable_trace_fp" meson option
> >> selected as
> >> false as default.
> > Agree. "enable_lcore_poll_busyness" is in the fast path, so it should
> follow the design pattern of "enable_trace_fp".
> 
> +1 to making this opt-in. However, I'd lean more towards having the
> buildtime option enabled and the runtime option disabled by default.
> There is no measurable impact cause by the extra branch (the check for
> enabled/disabled in the macro) by disabling at runtime, and we gain the
> benefit of avoiding a recompile to enable it later.

The exact same thing could be said about "enable_trace_fp"; however, the development effort was put into separating it from "enable_trace", so it could be disabled by default.

Your patch is unlikely to get approved if you don't follow the "enable_trace_fp" design pattern as suggested.

> 
> >
> >> If the concern is enabling on generic distros then distro generic
> >> config can opt in this
> >>
> >>> /Bruce
> > @Kevin, are you considering a roadmap for using
> RTE_LCORE_TELEMETRY_TIMESTAMP() for other purposes? Otherwise, it
> should also be renamed to indicate that it is part of the "poll
> busyness" telemetry.
> 
> No further purposes are planned for this macro, I'll rename it in the
> next revision.

OK. Thank you.

Also, there's a new discussion about EAL bloat [1]. Perhaps I'm stretching it here, but it would be nice if your library was made a separate library, instead of part of the EAL library. (Since this kind of feature is not new to the EAL, I will categorize this suggestion as "nice to have", not "must have".)

[1] http://inbox.dpdk.org/dev/2594603.Isy0gbHreE@thomas/T/


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 1/3] eal: add lcore poll busyness telemetry
  2022-08-25 15:28   ` [PATCH v3 1/3] eal: add " Kevin Laatz
  2022-08-26  7:05     ` Jerin Jacob
@ 2022-08-26 22:06     ` Mattias Rönnblom
  2022-08-29  8:23       ` Bruce Richardson
                         ` (2 more replies)
  1 sibling, 3 replies; 87+ messages in thread
From: Mattias Rönnblom @ 2022-08-26 22:06 UTC (permalink / raw)
  To: Kevin Laatz, dev
  Cc: anatoly.burakov, Conor Walsh, David Hunt, Bruce Richardson,
	Nicolas Chautru, Fan Zhang, Ashish Gupta, Akhil Goyal,
	Chengwen Feng, Ray Kinsella, Thomas Monjalon, Ferruh Yigit,
	Andrew Rybchenko, Jerin Jacob, Sachin Saxena, Hemant Agrawal,
	Ori Kam, Honnappa Nagarahalli, Konstantin Ananyev

On 2022-08-25 17:28, Kevin Laatz wrote:
> From: Anatoly Burakov <anatoly.burakov@intel.com>
> 
> Currently, there is no way to measure lcore poll busyness in a passive way,
> without any modifications to the application. This patch adds a new EAL API
> that will be able to passively track core polling busyness.

There's no generic way, but the DSW event device keep tracks of lcore 
utilization (i.e., the fraction of cycles used to perform actual work, 
as opposed to just polling empty queues), and it does some with the same 
basic principles as, from what it seems after a quick look, used in this 
patch.

> 
> The poll busyness is calculated by relying on the fact that most DPDK API's
> will poll for packets. Empty polls can be counted as "idle", while

Lcore worker threads poll for work. Packets, timeouts, completions, 
event device events, etc.

> non-empty polls can be counted as busy. To measure lcore poll busyness, we

I guess what is meant here is that cycles spent after non-empty polls 
can be counted as busy (useful) cycles? Potentially including the cycles 
spent for the actual poll operation. ("Poll busyness" is a very vague 
term, in my opionion.)

Similiarly, cycles spent after an empty poll would not be counted.

> simply call the telemetry timestamping function with the number of polls a
> particular code section has processed, and count the number of cycles we've
> spent processing empty bursts. The more empty bursts we encounter, the less
> cycles we spend in "busy" state, and the less core poll busyness will be
> reported.
> 

Is this the same scheme as DSW? When a non-zero burst in idle state 
means a transition from the idle to busy? And a zero burst poll in busy 
state means a transition from idle to busy?

The issue with this scheme, is that you might potentially end up with a 
state transition for every iteration of the application's main loop, if 
packets (or other items of work) only comes in on one of the lcore's 
potentially many RX queues (or other input queues, such as eventdev 
ports). That means a rdtsc for every loop, which isn't too bad, but 
still might be noticable.

An application that gather items of work from multiple source before 
actually doing anything breaks this model. For example, consider a lcore 
worker owning two RX queues, performing rte_eth_rx_burst() on both, 
before attempt to process any of the received packets. If the last poll 
is empty, the cycles spent will considered idle, even though they were busy.

A lcore worker might also decide to poll the same RX queue multiple 
times (until it hits an empty poll, or reaching some high upper bound), 
before initating processing of the packets.

I didn't read your code in detail, so I might be jumping to conclusions.

> In order for all of the above to work without modifications to the
> application, the library code needs to be instrumented with calls to the
> lcore telemetry busyness timestamping function. The following parts of DPDK
> are instrumented with lcore telemetry calls:
> 
> - All major driver API's:
>    - ethdev
>    - cryptodev
>    - compressdev
>    - regexdev
>    - bbdev
>    - rawdev
>    - eventdev
>    - dmadev
> - Some additional libraries:
>    - ring
>    - distributor

In the past, I've suggested this kind of functionality should go into 
the service framework instead, with the service function explicitly 
signaling wheter or not the cycles spent on something useful or not.

That seems to me like a more straight-forward and more accurate 
solution, but does require the application to deploy everything as 
services, and also requires a change of the service function signature.

> 
> To avoid performance impact from having lcore telemetry support, a global
> variable is exported by EAL, and a call to timestamping function is wrapped
> into a macro, so that whenever telemetry is disabled, it only takes one

Use an static inline function if you don't need the additional 
expressive power of a macro.

I suggest you also mention the performance implications, when this 
function is enabled.

> additional branch and no function calls are performed. It is also possible
> to disable it at compile time by commenting out RTE_LCORE_BUSYNESS from
> build config.
> 
> This patch also adds a telemetry endpoint to report lcore poll busyness, as
> well as telemetry endpoints to enable/disable lcore telemetry. A
> documentation entry has been added to the howto guides to explain the usage
> of the new telemetry endpoints and API.
> 

Should there really be a dependency from the EAL to the telemetry 
library? A cycle. Maybe some dependency inversion would be in order? The 
telemetry library could instead register an interest in getting 
busy/idle cycles reports from lcores.

> Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
> Signed-off-by: Conor Walsh <conor.walsh@intel.com>
> Signed-off-by: David Hunt <david.hunt@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> 
> ---
> v3:
>    * Fix missed renaming to poll busyness
>    * Fix clang compilation
>    * Fix arm compilation
> 
> v2:
>    * Use rte_get_tsc_hz() to adjust the telemetry period
>    * Rename to reflect polling busyness vs general busyness
>    * Fix segfault when calling telemetry timestamp from an unregistered
>      non-EAL thread.
>    * Minor cleanup
> ---
>   config/meson.build                          |   1 +
>   config/rte_config.h                         |   1 +
>   lib/bbdev/rte_bbdev.h                       |  17 +-
>   lib/compressdev/rte_compressdev.c           |   2 +
>   lib/cryptodev/rte_cryptodev.h               |   2 +
>   lib/distributor/rte_distributor.c           |  21 +-
>   lib/distributor/rte_distributor_single.c    |  14 +-
>   lib/dmadev/rte_dmadev.h                     |  15 +-
>   lib/eal/common/eal_common_lcore_telemetry.c | 293 ++++++++++++++++++++
>   lib/eal/common/meson.build                  |   1 +
>   lib/eal/include/rte_lcore.h                 |  80 ++++++
>   lib/eal/meson.build                         |   3 +
>   lib/eal/version.map                         |   7 +
>   lib/ethdev/rte_ethdev.h                     |   2 +
>   lib/eventdev/rte_eventdev.h                 |  10 +-
>   lib/rawdev/rte_rawdev.c                     |   6 +-
>   lib/regexdev/rte_regexdev.h                 |   5 +-
>   lib/ring/rte_ring_elem_pvt.h                |   1 +
>   meson_options.txt                           |   2 +
>   19 files changed, 459 insertions(+), 24 deletions(-)
>   create mode 100644 lib/eal/common/eal_common_lcore_telemetry.c
> 
> diff --git a/config/meson.build b/config/meson.build
> index 7f7b6c92fd..d5954a059c 100644
> --- a/config/meson.build
> +++ b/config/meson.build
> @@ -297,6 +297,7 @@ endforeach
>   dpdk_conf.set('RTE_MAX_ETHPORTS', get_option('max_ethports'))
>   dpdk_conf.set('RTE_LIBEAL_USE_HPET', get_option('use_hpet'))
>   dpdk_conf.set('RTE_ENABLE_TRACE_FP', get_option('enable_trace_fp'))
> +dpdk_conf.set('RTE_LCORE_POLL_BUSYNESS', get_option('enable_lcore_poll_busyness'))
>   # values which have defaults which may be overridden
>   dpdk_conf.set('RTE_MAX_VFIO_GROUPS', 64)
>   dpdk_conf.set('RTE_DRIVER_MEMPOOL_BUCKET_SIZE_KB', 64)
> diff --git a/config/rte_config.h b/config/rte_config.h
> index 46549cb062..498702c9c7 100644
> --- a/config/rte_config.h
> +++ b/config/rte_config.h
> @@ -39,6 +39,7 @@
>   #define RTE_LOG_DP_LEVEL RTE_LOG_INFO
>   #define RTE_BACKTRACE 1
>   #define RTE_MAX_VFIO_CONTAINERS 64
> +#define RTE_LCORE_POLL_BUSYNESS_PERIOD_MS 2
>   
>   /* bsd module defines */
>   #define RTE_CONTIGMEM_MAX_NUM_BUFS 64
> diff --git a/lib/bbdev/rte_bbdev.h b/lib/bbdev/rte_bbdev.h
> index b88c88167e..d6ed176cce 100644
> --- a/lib/bbdev/rte_bbdev.h
> +++ b/lib/bbdev/rte_bbdev.h
> @@ -28,6 +28,7 @@ extern "C" {
>   #include <stdbool.h>
>   
>   #include <rte_cpuflags.h>
> +#include <rte_lcore.h>
>   
>   #include "rte_bbdev_op.h"
>   
> @@ -599,7 +600,9 @@ rte_bbdev_dequeue_enc_ops(uint16_t dev_id, uint16_t queue_id,
>   {
>   	struct rte_bbdev *dev = &rte_bbdev_devices[dev_id];
>   	struct rte_bbdev_queue_data *q_data = &dev->data->queues[queue_id];
> -	return dev->dequeue_enc_ops(q_data, ops, num_ops);
> +	const uint16_t nb_ops = dev->dequeue_enc_ops(q_data, ops, num_ops);
> +	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_ops);
> +	return nb_ops;
>   }
>   
>   /**
> @@ -631,7 +634,9 @@ rte_bbdev_dequeue_dec_ops(uint16_t dev_id, uint16_t queue_id,
>   {
>   	struct rte_bbdev *dev = &rte_bbdev_devices[dev_id];
>   	struct rte_bbdev_queue_data *q_data = &dev->data->queues[queue_id];
> -	return dev->dequeue_dec_ops(q_data, ops, num_ops);
> +	const uint16_t nb_ops = dev->dequeue_dec_ops(q_data, ops, num_ops);
> +	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_ops);
> +	return nb_ops;
>   }
>   
>   
> @@ -662,7 +667,9 @@ rte_bbdev_dequeue_ldpc_enc_ops(uint16_t dev_id, uint16_t queue_id,
>   {
>   	struct rte_bbdev *dev = &rte_bbdev_devices[dev_id];
>   	struct rte_bbdev_queue_data *q_data = &dev->data->queues[queue_id];
> -	return dev->dequeue_ldpc_enc_ops(q_data, ops, num_ops);
> +	const uint16_t nb_ops = dev->dequeue_ldpc_enc_ops(q_data, ops, num_ops);
> +	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_ops);
> +	return nb_ops;
>   }
>   
>   /**
> @@ -692,7 +699,9 @@ rte_bbdev_dequeue_ldpc_dec_ops(uint16_t dev_id, uint16_t queue_id,
>   {
>   	struct rte_bbdev *dev = &rte_bbdev_devices[dev_id];
>   	struct rte_bbdev_queue_data *q_data = &dev->data->queues[queue_id];
> -	return dev->dequeue_ldpc_dec_ops(q_data, ops, num_ops);
> +	const uint16_t nb_ops = dev->dequeue_ldpc_dec_ops(q_data, ops, num_ops);
> +	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_ops);
> +	return nb_ops;
>   }
>   
>   /** Definitions of device event types */
> diff --git a/lib/compressdev/rte_compressdev.c b/lib/compressdev/rte_compressdev.c
> index 22c438f2dd..912cee9a16 100644
> --- a/lib/compressdev/rte_compressdev.c
> +++ b/lib/compressdev/rte_compressdev.c
> @@ -580,6 +580,8 @@ rte_compressdev_dequeue_burst(uint8_t dev_id, uint16_t qp_id,
>   	nb_ops = (*dev->dequeue_burst)
>   			(dev->data->queue_pairs[qp_id], ops, nb_ops);
>   
> +	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_ops);
> +
>   	return nb_ops;
>   }
>   
> diff --git a/lib/cryptodev/rte_cryptodev.h b/lib/cryptodev/rte_cryptodev.h
> index 56f459c6a0..072874020d 100644
> --- a/lib/cryptodev/rte_cryptodev.h
> +++ b/lib/cryptodev/rte_cryptodev.h
> @@ -1915,6 +1915,8 @@ rte_cryptodev_dequeue_burst(uint8_t dev_id, uint16_t qp_id,
>   		rte_rcu_qsbr_thread_offline(list->qsbr, 0);
>   	}
>   #endif
> +
> +	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_ops);
>   	return nb_ops;
>   }
>   
> diff --git a/lib/distributor/rte_distributor.c b/lib/distributor/rte_distributor.c
> index 3035b7a999..35b0d8d36b 100644
> --- a/lib/distributor/rte_distributor.c
> +++ b/lib/distributor/rte_distributor.c
> @@ -56,6 +56,8 @@ rte_distributor_request_pkt(struct rte_distributor *d,
>   
>   		while (rte_rdtsc() < t)
>   			rte_pause();
> +		/* this was an empty poll */
> +		RTE_LCORE_TELEMETRY_TIMESTAMP(0);
>   	}
>   
>   	/*
> @@ -134,24 +136,29 @@ rte_distributor_get_pkt(struct rte_distributor *d,
>   
>   	if (unlikely(d->alg_type == RTE_DIST_ALG_SINGLE)) {
>   		if (return_count <= 1) {
> +			uint16_t cnt;
>   			pkts[0] = rte_distributor_get_pkt_single(d->d_single,
> -				worker_id, return_count ? oldpkt[0] : NULL);
> -			return (pkts[0]) ? 1 : 0;
> -		} else
> -			return -EINVAL;
> +								 worker_id,
> +								 return_count ? oldpkt[0] : NULL);
> +			cnt = (pkts[0] != NULL) ? 1 : 0;
> +			RTE_LCORE_TELEMETRY_TIMESTAMP(cnt);
> +			return cnt;
> +		}
> +		return -EINVAL;
>   	}
>   
>   	rte_distributor_request_pkt(d, worker_id, oldpkt, return_count);
>   
> -	count = rte_distributor_poll_pkt(d, worker_id, pkts);
> -	while (count == -1) {
> +	while ((count = rte_distributor_poll_pkt(d, worker_id, pkts)) == -1) {
>   		uint64_t t = rte_rdtsc() + 100;
>   
>   		while (rte_rdtsc() < t)
>   			rte_pause();
>   
> -		count = rte_distributor_poll_pkt(d, worker_id, pkts);
> +		/* this was an empty poll */
> +		RTE_LCORE_TELEMETRY_TIMESTAMP(0);
>   	}
> +	RTE_LCORE_TELEMETRY_TIMESTAMP(count);
>   	return count;
>   }
>   
> diff --git a/lib/distributor/rte_distributor_single.c b/lib/distributor/rte_distributor_single.c
> index 2c77ac454a..63cc9aab69 100644
> --- a/lib/distributor/rte_distributor_single.c
> +++ b/lib/distributor/rte_distributor_single.c
> @@ -31,8 +31,13 @@ rte_distributor_request_pkt_single(struct rte_distributor_single *d,
>   	union rte_distributor_buffer_single *buf = &d->bufs[worker_id];
>   	int64_t req = (((int64_t)(uintptr_t)oldpkt) << RTE_DISTRIB_FLAG_BITS)
>   			| RTE_DISTRIB_GET_BUF;
> -	RTE_WAIT_UNTIL_MASKED(&buf->bufptr64, RTE_DISTRIB_FLAGS_MASK,
> -		==, 0, __ATOMIC_RELAXED);
> +
> +	while (!((__atomic_load_n(&buf->bufptr64, __ATOMIC_RELAXED)
> +			& RTE_DISTRIB_FLAGS_MASK) == 0)) {
> +		rte_pause();
> +		/* this was an empty poll */
> +		RTE_LCORE_TELEMETRY_TIMESTAMP(0);
> +	}
>   
>   	/* Sync with distributor on GET_BUF flag. */
>   	__atomic_store_n(&(buf->bufptr64), req, __ATOMIC_RELEASE);
> @@ -59,8 +64,11 @@ rte_distributor_get_pkt_single(struct rte_distributor_single *d,
>   {
>   	struct rte_mbuf *ret;
>   	rte_distributor_request_pkt_single(d, worker_id, oldpkt);
> -	while ((ret = rte_distributor_poll_pkt_single(d, worker_id)) == NULL)
> +	while ((ret = rte_distributor_poll_pkt_single(d, worker_id)) == NULL) {
>   		rte_pause();
> +		/* this was an empty poll */
> +		RTE_LCORE_TELEMETRY_TIMESTAMP(0);
> +	}
>   	return ret;
>   }
>   
> diff --git a/lib/dmadev/rte_dmadev.h b/lib/dmadev/rte_dmadev.h
> index e7f992b734..98176a6a7a 100644
> --- a/lib/dmadev/rte_dmadev.h
> +++ b/lib/dmadev/rte_dmadev.h
> @@ -149,6 +149,7 @@
>   #include <rte_bitops.h>
>   #include <rte_common.h>
>   #include <rte_compat.h>
> +#include <rte_lcore.h>
>   
>   #ifdef __cplusplus
>   extern "C" {
> @@ -1027,7 +1028,7 @@ rte_dma_completed(int16_t dev_id, uint16_t vchan, const uint16_t nb_cpls,
>   		  uint16_t *last_idx, bool *has_error)
>   {
>   	struct rte_dma_fp_object *obj = &rte_dma_fp_objs[dev_id];
> -	uint16_t idx;
> +	uint16_t idx, nb_ops;
>   	bool err;
>   
>   #ifdef RTE_DMADEV_DEBUG
> @@ -1050,8 +1051,10 @@ rte_dma_completed(int16_t dev_id, uint16_t vchan, const uint16_t nb_cpls,
>   		has_error = &err;
>   
>   	*has_error = false;
> -	return (*obj->completed)(obj->dev_private, vchan, nb_cpls, last_idx,
> -				 has_error);
> +	nb_ops = (*obj->completed)(obj->dev_private, vchan, nb_cpls, last_idx,
> +				   has_error);
> +	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_ops);
> +	return nb_ops;
>   }
>   
>   /**
> @@ -1090,7 +1093,7 @@ rte_dma_completed_status(int16_t dev_id, uint16_t vchan,
>   			 enum rte_dma_status_code *status)
>   {
>   	struct rte_dma_fp_object *obj = &rte_dma_fp_objs[dev_id];
> -	uint16_t idx;
> +	uint16_t idx, nb_ops;
>   
>   #ifdef RTE_DMADEV_DEBUG
>   	if (!rte_dma_is_valid(dev_id) || nb_cpls == 0 || status == NULL)
> @@ -1101,8 +1104,10 @@ rte_dma_completed_status(int16_t dev_id, uint16_t vchan,
>   	if (last_idx == NULL)
>   		last_idx = &idx;
>   
> -	return (*obj->completed_status)(obj->dev_private, vchan, nb_cpls,
> +	nb_ops = (*obj->completed_status)(obj->dev_private, vchan, nb_cpls,
>   					last_idx, status);
> +	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_ops);
> +	return nb_ops;
>   }
>   
>   /**
> diff --git a/lib/eal/common/eal_common_lcore_telemetry.c b/lib/eal/common/eal_common_lcore_telemetry.c
> new file mode 100644
> index 0000000000..bba0afc26d
> --- /dev/null
> +++ b/lib/eal/common/eal_common_lcore_telemetry.c
> @@ -0,0 +1,293 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2010-2014 Intel Corporation
> + */
> +
> +#include <unistd.h>
> +#include <limits.h>
> +#include <string.h>
> +
> +#include <rte_common.h>
> +#include <rte_cycles.h>
> +#include <rte_errno.h>
> +#include <rte_lcore.h>
> +
> +#ifdef RTE_LCORE_POLL_BUSYNESS
> +#include <rte_telemetry.h>
> +#endif
> +
> +int __rte_lcore_telemetry_enabled;

Is "telemetry" really the term to use here? Isn't this just another 
piece of statistics? It can be used for telemetry, or in some other fashion.

(Use bool not int.)

> +
> +#ifdef RTE_LCORE_POLL_BUSYNESS
> +
> +struct lcore_telemetry {
> +	int poll_busyness;
> +	/**< Calculated poll busyness (gets set/returned by the API) */
> +	int raw_poll_busyness;
> +	/**< Calculated poll busyness times 100. */
> +	uint64_t interval_ts;
> +	/**< when previous telemetry interval started */
> +	uint64_t empty_cycles;
> +	/**< empty cycle count since last interval */
> +	uint64_t last_poll_ts;
> +	/**< last poll timestamp */
> +	bool last_empty;
> +	/**< if last poll was empty */
> +	unsigned int contig_poll_cnt;
> +	/**< contiguous (always empty/non empty) poll counter */
> +} __rte_cache_aligned;
> +
> +static struct lcore_telemetry *telemetry_data;
> +
> +#define LCORE_POLL_BUSYNESS_MAX 100
> +#define LCORE_POLL_BUSYNESS_NOT_SET -1
> +#define LCORE_POLL_BUSYNESS_MIN 0
> +
> +#define SMOOTH_COEFF 5
> +#define STATE_CHANGE_OPT 32
> +
> +static void lcore_config_init(void)
> +{
> +	int lcore_id;
> +
> +	telemetry_data = calloc(RTE_MAX_LCORE, sizeof(telemetry_data[0]));
> +	if (telemetry_data == NULL)
> +		rte_panic("Could not init lcore telemetry data: Out of memory\n");
> +
> +	RTE_LCORE_FOREACH(lcore_id) {
> +		struct lcore_telemetry *td = &telemetry_data[lcore_id];
> +
> +		td->interval_ts = 0;
> +		td->last_poll_ts = 0;
> +		td->empty_cycles = 0;
> +		td->last_empty = true;
> +		td->contig_poll_cnt = 0;
> +		td->poll_busyness = LCORE_POLL_BUSYNESS_NOT_SET;
> +		td->raw_poll_busyness = 0;
> +	}
> +}
> +
> +int rte_lcore_poll_busyness(unsigned int lcore_id)
> +{
> +	const uint64_t active_thresh = rte_get_tsc_hz() * RTE_LCORE_POLL_BUSYNESS_PERIOD_MS;
> +	struct lcore_telemetry *tdata;
> +
> +	if (lcore_id >= RTE_MAX_LCORE)
> +		return -EINVAL;
> +	tdata = &telemetry_data[lcore_id];
> +
> +	/* if the lcore is not active */
> +	if (tdata->interval_ts == 0)
> +		return LCORE_POLL_BUSYNESS_NOT_SET;
> +	/* if the core hasn't been active in a while */
> +	else if ((rte_rdtsc() - tdata->interval_ts) > active_thresh)
> +		return LCORE_POLL_BUSYNESS_NOT_SET;
> +
> +	/* this core is active, report its poll busyness */
> +	return telemetry_data[lcore_id].poll_busyness;
> +}
> +
> +int rte_lcore_poll_busyness_enabled(void)
> +{
> +	return __rte_lcore_telemetry_enabled;
> +}
> +
> +void rte_lcore_poll_busyness_enabled_set(int enable)

Use bool.

> +{
> +	__rte_lcore_telemetry_enabled = !!enable;

!!Another reason to use bool!! :)

Are you allowed to call this function during operation, you'll need a 
atomic store here (and an atomic load on the read side).

> +
> +	if (!enable)
> +		lcore_config_init();
> +}
> +
> +static inline int calc_raw_poll_busyness(const struct lcore_telemetry *tdata,
> +				    const uint64_t empty, const uint64_t total)
> +{
> +	/*
> +	 * we don't want to use floating point math here, but we want for our poll
> +	 * busyness to react smoothly to sudden changes, while still keeping the
> +	 * accuracy and making sure that over time the average follows poll busyness
> +	 * as measured just-in-time. therefore, we will calculate the average poll
> +	 * busyness using integer math, but shift the decimal point two places
> +	 * to the right, so that 100.0 becomes 10000. this allows us to report
> +	 * integer values (0..100) while still allowing ourselves to follow the
> +	 * just-in-time measurements when we calculate our averages.
> +	 */
> +	const int max_raw_idle = LCORE_POLL_BUSYNESS_MAX * 100;
> +

Why not just store/manage the number of busy (or idle, or both) cycles? 
Then the user can decide what time period to average over, to what 
extent the lcore utilization from previous periods should be factored 
in, etc.

In DSW, I initially presented only a load statistic (which averaged over 
250 us, with some contribution from previous period). I later came to 
realize that just exposing the number of busy cycles left the calling 
application much more options. For example, to present the average load 
during 1 s, you needed to have some control thread sampling the load 
statistic during that time period, as opposed to when the busy cycles 
statistic was introduced, it just had to read that value twice (at the 
beginning of the period, and at the end), and compared it will the 
amount of wallclock time passed.

> +	/*
> +	 * at upper end of the poll busyness scale, going up from 90->100 will take
> +	 * longer than going from 10->20 because of the averaging. to address
> +	 * this, we invert the scale when doing calculations: that is, we
> +	 * effectively calculate average *idle* cycle percentage, not average
> +	 * *busy* cycle percentage. this means that the scale is naturally
> +	 * biased towards fast scaling up, and slow scaling down.
> +	 */
> +	const int prev_raw_idle = max_raw_idle - tdata->raw_poll_busyness;
> +
> +	/* calculate rate of idle cycles, times 100 */
> +	const int cur_raw_idle = (int)((empty * max_raw_idle) / total);
> +
> +	/* smoothen the idleness */
> +	const int smoothened_idle =
> +			(cur_raw_idle + prev_raw_idle * (SMOOTH_COEFF - 1)) / SMOOTH_COEFF;
> +
> +	/* convert idleness back to poll busyness */
> +	return max_raw_idle - smoothened_idle;
> +}
> +
> +void __rte_lcore_telemetry_timestamp(uint16_t nb_rx)
> +{
> +	const unsigned int lcore_id = rte_lcore_id();
> +	uint64_t interval_ts, empty_cycles, cur_tsc, last_poll_ts;
> +	struct lcore_telemetry *tdata;
> +	const bool empty = nb_rx == 0;
> +	uint64_t diff_int, diff_last;
> +	bool last_empty;
> +
> +	/* This telemetry is not supported for unregistered non-EAL threads */
> +	if (lcore_id >= RTE_MAX_LCORE)
> +		return;
> +
> +	tdata = &telemetry_data[lcore_id];
> +	last_empty = tdata->last_empty;
> +
> +	/* optimization: don't do anything if status hasn't changed */
> +	if (last_empty == empty && tdata->contig_poll_cnt++ < STATE_CHANGE_OPT)
> +		return;
> +	/* status changed or we're waiting for too long, reset counter */
> +	tdata->contig_poll_cnt = 0;
> +
> +	cur_tsc = rte_rdtsc();
> +
> +	interval_ts = tdata->interval_ts;
> +	empty_cycles = tdata->empty_cycles;
> +	last_poll_ts = tdata->last_poll_ts;
> +
> +	diff_int = cur_tsc - interval_ts;
> +	diff_last = cur_tsc - last_poll_ts;
> +
> +	/* is this the first time we're here? */
> +	if (interval_ts == 0) {
> +		tdata->poll_busyness = LCORE_POLL_BUSYNESS_MIN;
> +		tdata->raw_poll_busyness = 0;
> +		tdata->interval_ts = cur_tsc;
> +		tdata->empty_cycles = 0;
> +		tdata->contig_poll_cnt = 0;
> +		goto end;
> +	}
> +
> +	/* update the empty counter if we got an empty poll earlier */
> +	if (last_empty)
> +		empty_cycles += diff_last;
> +
> +	/* have we passed the interval? */
> +	uint64_t interval = ((rte_get_tsc_hz() / MS_PER_S) * RTE_LCORE_POLL_BUSYNESS_PERIOD_MS);
> +	if (diff_int > interval) {
> +		int raw_poll_busyness;
> +
> +		/* get updated poll_busyness value */
> +		raw_poll_busyness = calc_raw_poll_busyness(tdata, empty_cycles, diff_int);
> +
> +		/* set a new interval, reset empty counter */
> +		tdata->interval_ts = cur_tsc;
> +		tdata->empty_cycles = 0;
> +		tdata->raw_poll_busyness = raw_poll_busyness;
> +		/* bring poll busyness back to 0..100 range, biased to round up */
> +		tdata->poll_busyness = (raw_poll_busyness + 50) / 100;
> +	} else
> +		/* we may have updated empty counter */
> +		tdata->empty_cycles = empty_cycles;
> +
> +end:
> +	/* update status for next poll */
> +	tdata->last_poll_ts = cur_tsc;
> +	tdata->last_empty = empty;
> +}
> +
> +static int
> +lcore_poll_busyness_enable(const char *cmd __rte_unused,
> +		      const char *params __rte_unused,
> +		      struct rte_tel_data *d)
> +{
> +	rte_lcore_poll_busyness_enabled_set(1);
> +
> +	rte_tel_data_start_dict(d);
> +
> +	rte_tel_data_add_dict_int(d, "poll_busyness_enabled", 1);
> +
> +	return 0;
> +}
> +
> +static int
> +lcore_poll_busyness_disable(const char *cmd __rte_unused,
> +		       const char *params __rte_unused,
> +		       struct rte_tel_data *d)
> +{
> +	rte_lcore_poll_busyness_enabled_set(0);
> +
> +	rte_tel_data_start_dict(d);
> +
> +	rte_tel_data_add_dict_int(d, "poll_busyness_enabled", 0);
> +
> +	if (telemetry_data != NULL)
> +		free(telemetry_data);
> +
> +	return 0;
> +}
> +
> +static int
> +lcore_handle_poll_busyness(const char *cmd __rte_unused,
> +		      const char *params __rte_unused, struct rte_tel_data *d)
> +{
> +	char corenum[64];
> +	int i;
> +
> +	rte_tel_data_start_dict(d);
> +
> +	RTE_LCORE_FOREACH(i) {
> +		if (!rte_lcore_is_enabled(i))
> +			continue;
> +		snprintf(corenum, sizeof(corenum), "%d", i);
> +		rte_tel_data_add_dict_int(d, corenum, rte_lcore_poll_busyness(i));
> +	}
> +
> +	return 0;
> +}
> +
> +RTE_INIT(lcore_init_telemetry)
> +{
> +	__rte_lcore_telemetry_enabled = true;
> +
> +	lcore_config_init();
> +
> +	rte_telemetry_register_cmd("/eal/lcore/poll_busyness", lcore_handle_poll_busyness,
> +				   "return percentage poll busyness of cores");
> +
> +	rte_telemetry_register_cmd("/eal/lcore/poll_busyness_enable", lcore_poll_busyness_enable,
> +				   "enable lcore poll busyness measurement");
> +
> +	rte_telemetry_register_cmd("/eal/lcore/poll_busyness_disable", lcore_poll_busyness_disable,
> +				   "disable lcore poll busyness measurement");
> +}
> +
> +#else
> +
> +int rte_lcore_poll_busyness(unsigned int lcore_id __rte_unused)
> +{
> +	return -ENOTSUP;
> +}
> +
> +int rte_lcore_poll_busyness_enabled(void)
> +{
> +	return -ENOTSUP;
> +}
> +
> +void rte_lcore_poll_busyness_enabled_set(int enable __rte_unused)
> +{
> +}
> +
> +void __rte_lcore_telemetry_timestamp(uint16_t nb_rx __rte_unused)
> +{
> +}
> +
> +#endif
> diff --git a/lib/eal/common/meson.build b/lib/eal/common/meson.build
> index 917758cc65..a743e66a7d 100644
> --- a/lib/eal/common/meson.build
> +++ b/lib/eal/common/meson.build
> @@ -17,6 +17,7 @@ sources += files(
>           'eal_common_hexdump.c',
>           'eal_common_interrupts.c',
>           'eal_common_launch.c',
> +        'eal_common_lcore_telemetry.c',
>           'eal_common_lcore.c',
>           'eal_common_log.c',
>           'eal_common_mcfg.c',
> diff --git a/lib/eal/include/rte_lcore.h b/lib/eal/include/rte_lcore.h
> index b598e1b9ec..75c1f874cb 100644
> --- a/lib/eal/include/rte_lcore.h
> +++ b/lib/eal/include/rte_lcore.h
> @@ -415,6 +415,86 @@ rte_ctrl_thread_create(pthread_t *thread, const char *name,
>   		const pthread_attr_t *attr,
>   		void *(*start_routine)(void *), void *arg);
>   
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Read poll busyness value corresponding to an lcore.
> + *
> + * @param lcore_id
> + *   Lcore to read poll busyness value for.
> + * @return
> + *   - value between 0 and 100 on success
> + *   - -1 if lcore is not active
> + *   - -EINVAL if lcore is invalid
> + *   - -ENOMEM if not enough memory available
> + *   - -ENOTSUP if not supported
> + */
> +__rte_experimental
> +int
> +rte_lcore_poll_busyness(unsigned int lcore_id);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Check if lcore poll busyness telemetry is enabled.
> + *
> + * @return
> + *   - 1 if lcore telemetry is enabled
> + *   - 0 if lcore telemetry is disabled
> + *   - -ENOTSUP if not lcore telemetry supported
> + */
> +__rte_experimental
> +int
> +rte_lcore_poll_busyness_enabled(void);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Enable or disable poll busyness telemetry.
> + *
> + * @param enable
> + *   1 to enable, 0 to disable
> + */
> +__rte_experimental
> +void
> +rte_lcore_poll_busyness_enabled_set(int enable);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Lcore telemetry timestamping function.
> + *
> + * @param nb_rx
> + *   Number of buffers processed by lcore.
> + */
> +__rte_experimental
> +void
> +__rte_lcore_telemetry_timestamp(uint16_t nb_rx);
> +
> +/** @internal lcore telemetry enabled status */
> +extern int __rte_lcore_telemetry_enabled;
> +
> +/**
> + * Call lcore telemetry timestamp function.
> + *
> + * @param nb_rx
> + *   Number of buffers processed by lcore.
> + */
> +#ifdef RTE_LCORE_POLL_BUSYNESS
> +#define RTE_LCORE_TELEMETRY_TIMESTAMP(nb_rx)                    \
> +	do {                                                    \
> +		if (__rte_lcore_telemetry_enabled)              \
> +			__rte_lcore_telemetry_timestamp(nb_rx); \
> +	} while (0)
> +#else
> +#define RTE_LCORE_TELEMETRY_TIMESTAMP(nb_rx) \
> +	while (0)
> +#endif
> +
>   #ifdef __cplusplus
>   }
>   #endif
> diff --git a/lib/eal/meson.build b/lib/eal/meson.build
> index 056beb9461..2fb90d446b 100644
> --- a/lib/eal/meson.build
> +++ b/lib/eal/meson.build
> @@ -25,6 +25,9 @@ subdir(arch_subdir)
>   deps += ['kvargs']
>   if not is_windows
>       deps += ['telemetry']
> +else
> +	# core poll busyness telemetry depends on telemetry library
> +	dpdk_conf.set('RTE_LCORE_POLL_BUSYNESS', false)
>   endif
>   if dpdk_conf.has('RTE_USE_LIBBSD')
>       ext_deps += libbsd
> diff --git a/lib/eal/version.map b/lib/eal/version.map
> index 1f293e768b..f84d2dc319 100644
> --- a/lib/eal/version.map
> +++ b/lib/eal/version.map
> @@ -424,6 +424,13 @@ EXPERIMENTAL {
>   	rte_thread_self;
>   	rte_thread_set_affinity_by_id;
>   	rte_thread_set_priority;
> +
> +	# added in 22.11
> +	__rte_lcore_telemetry_timestamp;
> +	__rte_lcore_telemetry_enabled;
> +	rte_lcore_poll_busyness;
> +	rte_lcore_poll_busyness_enabled;
> +	rte_lcore_poll_busyness_enabled_set;
>   };
>   
>   INTERNAL {
> diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> index de9e970d4d..1caecd5a11 100644
> --- a/lib/ethdev/rte_ethdev.h
> +++ b/lib/ethdev/rte_ethdev.h
> @@ -5675,6 +5675,8 @@ rte_eth_rx_burst(uint16_t port_id, uint16_t queue_id,
>   #endif
>   
>   	rte_ethdev_trace_rx_burst(port_id, queue_id, (void **)rx_pkts, nb_rx);
> +
> +	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_rx);
>   	return nb_rx;
>   }
>   
> diff --git a/lib/eventdev/rte_eventdev.h b/lib/eventdev/rte_eventdev.h
> index 6a6f6ea4c1..a1d42d9214 100644
> --- a/lib/eventdev/rte_eventdev.h
> +++ b/lib/eventdev/rte_eventdev.h
> @@ -2153,6 +2153,7 @@ rte_event_dequeue_burst(uint8_t dev_id, uint8_t port_id, struct rte_event ev[],
>   			uint16_t nb_events, uint64_t timeout_ticks)
>   {
>   	const struct rte_event_fp_ops *fp_ops;
> +	uint16_t nb_evts;
>   	void *port;
>   
>   	fp_ops = &rte_event_fp_ops[dev_id];
> @@ -2175,10 +2176,13 @@ rte_event_dequeue_burst(uint8_t dev_id, uint8_t port_id, struct rte_event ev[],
>   	 * requests nb_events as const one
>   	 */
>   	if (nb_events == 1)
> -		return (fp_ops->dequeue)(port, ev, timeout_ticks);
> +		nb_evts = (fp_ops->dequeue)(port, ev, timeout_ticks);
>   	else
> -		return (fp_ops->dequeue_burst)(port, ev, nb_events,
> -					       timeout_ticks);
> +		nb_evts = (fp_ops->dequeue_burst)(port, ev, nb_events,
> +					timeout_ticks);
> +
> +	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_evts);
> +	return nb_evts;
>   }
>   
>   #define RTE_EVENT_DEV_MAINT_OP_FLUSH          (1 << 0)
> diff --git a/lib/rawdev/rte_rawdev.c b/lib/rawdev/rte_rawdev.c
> index 2f0a4f132e..f6c0ed196f 100644
> --- a/lib/rawdev/rte_rawdev.c
> +++ b/lib/rawdev/rte_rawdev.c
> @@ -16,6 +16,7 @@
>   #include <rte_common.h>
>   #include <rte_malloc.h>
>   #include <rte_telemetry.h>
> +#include <rte_lcore.h>
>   
>   #include "rte_rawdev.h"
>   #include "rte_rawdev_pmd.h"
> @@ -226,12 +227,15 @@ rte_rawdev_dequeue_buffers(uint16_t dev_id,
>   			   rte_rawdev_obj_t context)
>   {
>   	struct rte_rawdev *dev;
> +	int nb_ops;
>   
>   	RTE_RAWDEV_VALID_DEVID_OR_ERR_RET(dev_id, -EINVAL);
>   	dev = &rte_rawdevs[dev_id];
>   
>   	RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->dequeue_bufs, -ENOTSUP);
> -	return (*dev->dev_ops->dequeue_bufs)(dev, buffers, count, context);
> +	nb_ops = (*dev->dev_ops->dequeue_bufs)(dev, buffers, count, context);
> +	RTE_LCORE_TELEMETRY_TIMESTAMP(nb_ops);
> +	return nb_ops;
>   }
>   
>   int
> diff --git a/lib/regexdev/rte_regexdev.h b/lib/regexdev/rte_regexdev.h
> index 3bce8090f6..781055b4eb 100644
> --- a/lib/regexdev/rte_regexdev.h
> +++ b/lib/regexdev/rte_regexdev.h
> @@ -1530,6 +1530,7 @@ rte_regexdev_dequeue_burst(uint8_t dev_id, uint16_t qp_id,
>   			   struct rte_regex_ops **ops, uint16_t nb_ops)
>   {
>   	struct rte_regexdev *dev = &rte_regex_devices[dev_id];
> +	uint16_t deq_ops;
>   #ifdef RTE_LIBRTE_REGEXDEV_DEBUG
>   	RTE_REGEXDEV_VALID_DEV_ID_OR_ERR_RET(dev_id, -EINVAL);
>   	RTE_FUNC_PTR_OR_ERR_RET(*dev->dequeue, -ENOTSUP);
> @@ -1538,7 +1539,9 @@ rte_regexdev_dequeue_burst(uint8_t dev_id, uint16_t qp_id,
>   		return -EINVAL;
>   	}
>   #endif
> -	return (*dev->dequeue)(dev, qp_id, ops, nb_ops);
> +	deq_ops = (*dev->dequeue)(dev, qp_id, ops, nb_ops);
> +	RTE_LCORE_TELEMETRY_TIMESTAMP(deq_ops);
> +	return deq_ops;
>   }
>   
>   #ifdef __cplusplus
> diff --git a/lib/ring/rte_ring_elem_pvt.h b/lib/ring/rte_ring_elem_pvt.h
> index 83788c56e6..6db09d4291 100644
> --- a/lib/ring/rte_ring_elem_pvt.h
> +++ b/lib/ring/rte_ring_elem_pvt.h
> @@ -379,6 +379,7 @@ __rte_ring_do_dequeue_elem(struct rte_ring *r, void *obj_table,
>   end:
>   	if (available != NULL)
>   		*available = entries - n;
> +	RTE_LCORE_TELEMETRY_TIMESTAMP(n);
>   	return n;
>   }
>   
> diff --git a/meson_options.txt b/meson_options.txt
> index 7c220ad68d..725b851f69 100644
> --- a/meson_options.txt
> +++ b/meson_options.txt
> @@ -20,6 +20,8 @@ option('enable_driver_sdk', type: 'boolean', value: false, description:
>          'Install headers to build drivers.')
>   option('enable_kmods', type: 'boolean', value: false, description:
>          'build kernel modules')
> +option('enable_lcore_poll_busyness', type: 'boolean', value: true, description:
> +       'enable collection of lcore poll busyness telemetry')
>   option('examples', type: 'string', value: '', description:
>          'Comma-separated list of examples to build by default')
>   option('flexran_sdk', type: 'string', value: '', description:

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 1/3] eal: add lcore poll busyness telemetry
  2022-08-26 22:06     ` Mattias Rönnblom
@ 2022-08-29  8:23       ` Bruce Richardson
  2022-08-29 13:16       ` Kevin Laatz
  2022-08-30 10:26       ` Kevin Laatz
  2 siblings, 0 replies; 87+ messages in thread
From: Bruce Richardson @ 2022-08-29  8:23 UTC (permalink / raw)
  To: Mattias Rönnblom
  Cc: Kevin Laatz, dev, anatoly.burakov, Conor Walsh, David Hunt,
	Nicolas Chautru, Fan Zhang, Ashish Gupta, Akhil Goyal,
	Chengwen Feng, Ray Kinsella, Thomas Monjalon, Ferruh Yigit,
	Andrew Rybchenko, Jerin Jacob, Sachin Saxena, Hemant Agrawal,
	Ori Kam, Honnappa Nagarahalli, Konstantin Ananyev

On Sat, Aug 27, 2022 at 12:06:19AM +0200, Mattias Rönnblom wrote:
> On 2022-08-25 17:28, Kevin Laatz wrote:
> > From: Anatoly Burakov <anatoly.burakov@intel.com>
> > 
<snip>
> > This patch also adds a telemetry endpoint to report lcore poll busyness, as
> > well as telemetry endpoints to enable/disable lcore telemetry. A
> > documentation entry has been added to the howto guides to explain the usage
> > of the new telemetry endpoints and API.
> > 
> 
> Should there really be a dependency from the EAL to the telemetry library? A
> cycle. Maybe some dependency inversion would be in order? The telemetry
> library could instead register an interest in getting busy/idle cycles
> reports from lcores.
> 
Just on this point, EAL already exposes telemetry and already depends upon
the telemetry library, so there would be no new dependency introduced here.

With the existing code, we avoid a cycle by having telemetry avoid using
EAL functions - and for the couple it does need, e.g. the log function, we
inject in the function pointer on init.

/Bruce

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 1/3] eal: add lcore poll busyness telemetry
  2022-08-26 15:46               ` Morten Brørup
@ 2022-08-29 10:41                 ` Bruce Richardson
  2022-08-29 10:53                   ` Thomas Monjalon
  2022-08-29 11:22                   ` Morten Brørup
  0 siblings, 2 replies; 87+ messages in thread
From: Bruce Richardson @ 2022-08-29 10:41 UTC (permalink / raw)
  To: Morten Brørup
  Cc: Kevin Laatz, Jerin Jacob, Anatoly Burakov, dpdk-dev, Conor Walsh,
	David Hunt, Nicolas Chautru, Fan Zhang, Ashish Gupta,
	Akhil Goyal, Chengwen Feng, Ray Kinsella, Thomas Monjalon,
	Ferruh Yigit, Andrew Rybchenko, Jerin Jacob, Sachin Saxena,
	Hemant Agrawal, Ori Kam, Honnappa Nagarahalli,
	Konstantin Ananyev

On Fri, Aug 26, 2022 at 05:46:48PM +0200, Morten Brørup wrote:
> > From: Kevin Laatz [mailto:kevin.laatz@intel.com]
> > Sent: Friday, 26 August 2022 17.27
> > 
> > On 26/08/2022 09:29, Morten Brørup wrote:
> > >> From: Jerin Jacob [mailto:jerinjacobk@gmail.com]
> > >> Sent: Friday, 26 August 2022 10.16
> > >>
> > >> On Fri, Aug 26, 2022 at 1:37 PM Bruce Richardson
> > >> <bruce.richardson@intel.com> wrote:
> > >>> On Fri, Aug 26, 2022 at 12:35:16PM +0530, Jerin Jacob wrote:
> > >>>> On Thu, Aug 25, 2022 at 8:56 PM Kevin Laatz
> > <kevin.laatz@intel.com>
> > >> wrote:
> > >>>>> From: Anatoly Burakov <anatoly.burakov@intel.com>
> > >>>>>
> > >>>>> Currently, there is no way to measure lcore poll busyness in a
> > >> passive way,
> > >>>>> without any modifications to the application. This patch adds a
> > >> new EAL API
> > >>>>> that will be able to passively track core polling busyness.
> > >>>>>
> > >>>>> The poll busyness is calculated by relying on the fact that most
> > >> DPDK API's
> > >>>>> will poll for packets. Empty polls can be counted as "idle",
> > >> while
> > >>>>> non-empty polls can be counted as busy. To measure lcore poll
> > >> busyness, we
> > >>>>> simply call the telemetry timestamping function with the number
> > >> of polls a
> > >>>>> particular code section has processed, and count the number of
> > >> cycles we've
> > >>>>> spent processing empty bursts. The more empty bursts we
> > >> encounter, the less
> > >>>>> cycles we spend in "busy" state, and the less core poll busyness
> > >> will be
> > >>>>> reported.
> > >>>>>
> > >>>>> In order for all of the above to work without modifications to
> > >> the
> > >>>>> application, the library code needs to be instrumented with calls
> > >> to the
> > >>>>> lcore telemetry busyness timestamping function. The following
> > >> parts of DPDK
> > >>>>> are instrumented with lcore telemetry calls:
> > >>>>>
> > >>>>> - All major driver API's:
> > >>>>>    - ethdev
> > >>>>>    - cryptodev
> > >>>>>    - compressdev
> > >>>>>    - regexdev
> > >>>>>    - bbdev
> > >>>>>    - rawdev
> > >>>>>    - eventdev
> > >>>>>    - dmadev
> > >>>>> - Some additional libraries:
> > >>>>>    - ring
> > >>>>>    - distributor
> > >>>>>
> > >>>>> To avoid performance impact from having lcore telemetry support,
> > >> a global
> > >>>>> variable is exported by EAL, and a call to timestamping function
> > >> is wrapped
> > >>>>> into a macro, so that whenever telemetry is disabled, it only
> > >> takes one
> > >>>>> additional branch and no function calls are performed. It is also
> > >> possible
> > >>>>> to disable it at compile time by commenting out
> > >> RTE_LCORE_BUSYNESS from
> > >>>>> build config.
> > >>>>>
> > >>>>> This patch also adds a telemetry endpoint to report lcore poll
> > >> busyness, as
> > >>>>> well as telemetry endpoints to enable/disable lcore telemetry. A
> > >>>>> documentation entry has been added to the howto guides to explain
> > >> the usage
> > >>>>> of the new telemetry endpoints and API.
> > >>>>>
> > >>>>> Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
> > >>>>> Signed-off-by: Conor Walsh <conor.walsh@intel.com>
> > >>>>> Signed-off-by: David Hunt <david.hunt@intel.com>
> > >>>>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> > >>>>>
> > >>>>> ---
> > >>>>> v3:
> > >>>>>    * Fix missed renaming to poll busyness
> > >>>>>    * Fix clang compilation
> > >>>>>    * Fix arm compilation
> > >>>>>
> > >>>>> v2:
> > >>>>>    * Use rte_get_tsc_hz() to adjust the telemetry period
> > >>>>>    * Rename to reflect polling busyness vs general busyness
> > >>>>>    * Fix segfault when calling telemetry timestamp from an
> > >> unregistered
> > >>>>>      non-EAL thread.
> > >>>>>    * Minor cleanup
> > >>>>> ---
> > >>>>> diff --git a/meson_options.txt b/meson_options.txt
> > >>>>> index 7c220ad68d..725b851f69 100644
> > >>>>> --- a/meson_options.txt
> > >>>>> +++ b/meson_options.txt
> > >>>>> @@ -20,6 +20,8 @@ option('enable_driver_sdk', type: 'boolean',
> > >> value: false, description:
> > >>>>>          'Install headers to build drivers.')
> > >>>>>   option('enable_kmods', type: 'boolean', value: false,
> > >> description:
> > >>>>>          'build kernel modules')
> > >>>>> +option('enable_lcore_poll_busyness', type: 'boolean', value:
> > >> true, description:
> > >>>>> +       'enable collection of lcore poll busyness telemetry')
> > >>>> IMO, All fastpath features should be opt-in. i.e default should be
> > >> false.
> > >>>> For the trace fastpath related changes, We have done the similar
> > >> thing
> > >>>> even though it cost additional one cycle for disabled trace points
> > >>>>
> > >>> We do need to consider runtime and build defaults differently,
> > >> though.
> > >>> Since this has also runtime enabling, I think having build-time
> > >> enabling
> > >>> true as default is ok, so long as the runtime enabling is false
> > >> (assuming
> > >>> no noticable overhead when the feature is disabled.)
> > >> I was talking about buildtime only. "enable_trace_fp" meson option
> > >> selected as
> > >> false as default.
> > > Agree. "enable_lcore_poll_busyness" is in the fast path, so it should
> > follow the design pattern of "enable_trace_fp".
> > 
> > +1 to making this opt-in. However, I'd lean more towards having the
> > buildtime option enabled and the runtime option disabled by default.
> > There is no measurable impact cause by the extra branch (the check for
> > enabled/disabled in the macro) by disabling at runtime, and we gain the
> > benefit of avoiding a recompile to enable it later.
> 
> The exact same thing could be said about "enable_trace_fp"; however, the development effort was put into separating it from "enable_trace", so it could be disabled by default.
> 
> Your patch is unlikely to get approved if you don't follow the "enable_trace_fp" design pattern as suggested.
> 
> > 
> > >
> > >> If the concern is enabling on generic distros then distro generic
> > >> config can opt in this
> > >>
> > >>> /Bruce
> > > @Kevin, are you considering a roadmap for using
> > RTE_LCORE_TELEMETRY_TIMESTAMP() for other purposes? Otherwise, it
> > should also be renamed to indicate that it is part of the "poll
> > busyness" telemetry.
> > 
> > No further purposes are planned for this macro, I'll rename it in the
> > next revision.
> 
> OK. Thank you.
> 
> Also, there's a new discussion about EAL bloat [1]. Perhaps I'm stretching it here, but it would be nice if your library was made a separate library, instead of part of the EAL library. (Since this kind of feature is not new to the EAL, I will categorize this suggestion as "nice to have", not "must have".)
> 
> [1] http://inbox.dpdk.org/dev/2594603.Isy0gbHreE@thomas/T/
> 

I was actually discussing this with Kevin and Dave H. on Friay, and trying
to make this a separate library is indeed a big stretch. :-)

From that discussion, the key point/gap is that we are really missing a
clean way of providing undefs or macro fallbacks for when a library is just
not present. For example, if this was a separate library we would gain a
number of advantages e.g. no need for separate enable/disable flag, but the
big disadvantage is that every header include for it, and every reference
to the macros used in that header need to be surrounded by big ugly ifdefs.

For now, adding this into EAL is the far more practical approach, since it
means that even if support for the feature is disabled at build time the
header is still available to provide the appropriate no-op macros.

/Bruce

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 1/3] eal: add lcore poll busyness telemetry
  2022-08-29 10:41                 ` Bruce Richardson
@ 2022-08-29 10:53                   ` Thomas Monjalon
  2022-08-29 12:36                     ` Kevin Laatz
  2022-08-29 11:22                   ` Morten Brørup
  1 sibling, 1 reply; 87+ messages in thread
From: Thomas Monjalon @ 2022-08-29 10:53 UTC (permalink / raw)
  To: Morten Brørup, Bruce Richardson
  Cc: Kevin Laatz, Jerin Jacob, Anatoly Burakov, dpdk-dev, Conor Walsh,
	David Hunt, Nicolas Chautru, Fan Zhang, Ashish Gupta,
	Akhil Goyal, Chengwen Feng, Ray Kinsella, Ferruh Yigit,
	Andrew Rybchenko, Jerin Jacob, Sachin Saxena, Hemant Agrawal,
	Ori Kam, Honnappa Nagarahalli, Konstantin Ananyev

29/08/2022 12:41, Bruce Richardson:
> On Fri, Aug 26, 2022 at 05:46:48PM +0200, Morten Brørup wrote:
> > > From: Kevin Laatz [mailto:kevin.laatz@intel.com]
> > > Sent: Friday, 26 August 2022 17.27
> > > 
> > > On 26/08/2022 09:29, Morten Brørup wrote:
> > > >> From: Jerin Jacob [mailto:jerinjacobk@gmail.com]
> > > >> Sent: Friday, 26 August 2022 10.16
> > > >>
> > > >> On Fri, Aug 26, 2022 at 1:37 PM Bruce Richardson
> > > >> <bruce.richardson@intel.com> wrote:
> > > >>> On Fri, Aug 26, 2022 at 12:35:16PM +0530, Jerin Jacob wrote:
> > > >>>> On Thu, Aug 25, 2022 at 8:56 PM Kevin Laatz
> > > <kevin.laatz@intel.com>
> > > >> wrote:
> > > >>>>> From: Anatoly Burakov <anatoly.burakov@intel.com>
> > > >>>>>
> > > >>>>> Currently, there is no way to measure lcore poll busyness in a
> > > >> passive way,
> > > >>>>> without any modifications to the application. This patch adds a
> > > >> new EAL API
> > > >>>>> that will be able to passively track core polling busyness.
> > > >>>>>
> > > >>>>> The poll busyness is calculated by relying on the fact that most
> > > >> DPDK API's
> > > >>>>> will poll for packets. Empty polls can be counted as "idle",
> > > >> while
> > > >>>>> non-empty polls can be counted as busy. To measure lcore poll
> > > >> busyness, we
> > > >>>>> simply call the telemetry timestamping function with the number
> > > >> of polls a
> > > >>>>> particular code section has processed, and count the number of
> > > >> cycles we've
> > > >>>>> spent processing empty bursts. The more empty bursts we
> > > >> encounter, the less
> > > >>>>> cycles we spend in "busy" state, and the less core poll busyness
> > > >> will be
> > > >>>>> reported.
> > > >>>>>
> > > >>>>> In order for all of the above to work without modifications to
> > > >> the
> > > >>>>> application, the library code needs to be instrumented with calls
> > > >> to the
> > > >>>>> lcore telemetry busyness timestamping function. The following
> > > >> parts of DPDK
> > > >>>>> are instrumented with lcore telemetry calls:
> > > >>>>>
> > > >>>>> - All major driver API's:
> > > >>>>>    - ethdev
> > > >>>>>    - cryptodev
> > > >>>>>    - compressdev
> > > >>>>>    - regexdev
> > > >>>>>    - bbdev
> > > >>>>>    - rawdev
> > > >>>>>    - eventdev
> > > >>>>>    - dmadev
> > > >>>>> - Some additional libraries:
> > > >>>>>    - ring
> > > >>>>>    - distributor
> > > >>>>>
> > > >>>>> To avoid performance impact from having lcore telemetry support,
> > > >> a global
> > > >>>>> variable is exported by EAL, and a call to timestamping function
> > > >> is wrapped
> > > >>>>> into a macro, so that whenever telemetry is disabled, it only
> > > >> takes one
> > > >>>>> additional branch and no function calls are performed. It is also
> > > >> possible
> > > >>>>> to disable it at compile time by commenting out
> > > >> RTE_LCORE_BUSYNESS from
> > > >>>>> build config.
> > > >>>>>
> > > >>>>> This patch also adds a telemetry endpoint to report lcore poll
> > > >> busyness, as
> > > >>>>> well as telemetry endpoints to enable/disable lcore telemetry. A
> > > >>>>> documentation entry has been added to the howto guides to explain
> > > >> the usage
> > > >>>>> of the new telemetry endpoints and API.
> > > >>>>>
> > > >>>>> Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
> > > >>>>> Signed-off-by: Conor Walsh <conor.walsh@intel.com>
> > > >>>>> Signed-off-by: David Hunt <david.hunt@intel.com>
> > > >>>>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> > > >>>>>
> > > >>>>> ---
> > > >>>>> v3:
> > > >>>>>    * Fix missed renaming to poll busyness
> > > >>>>>    * Fix clang compilation
> > > >>>>>    * Fix arm compilation
> > > >>>>>
> > > >>>>> v2:
> > > >>>>>    * Use rte_get_tsc_hz() to adjust the telemetry period
> > > >>>>>    * Rename to reflect polling busyness vs general busyness
> > > >>>>>    * Fix segfault when calling telemetry timestamp from an
> > > >> unregistered
> > > >>>>>      non-EAL thread.
> > > >>>>>    * Minor cleanup
> > > >>>>> ---
> > > >>>>> diff --git a/meson_options.txt b/meson_options.txt
> > > >>>>> index 7c220ad68d..725b851f69 100644
> > > >>>>> --- a/meson_options.txt
> > > >>>>> +++ b/meson_options.txt
> > > >>>>> @@ -20,6 +20,8 @@ option('enable_driver_sdk', type: 'boolean',
> > > >> value: false, description:
> > > >>>>>          'Install headers to build drivers.')
> > > >>>>>   option('enable_kmods', type: 'boolean', value: false,
> > > >> description:
> > > >>>>>          'build kernel modules')
> > > >>>>> +option('enable_lcore_poll_busyness', type: 'boolean', value:
> > > >> true, description:
> > > >>>>> +       'enable collection of lcore poll busyness telemetry')
> > > >>>> IMO, All fastpath features should be opt-in. i.e default should be
> > > >> false.
> > > >>>> For the trace fastpath related changes, We have done the similar
> > > >> thing
> > > >>>> even though it cost additional one cycle for disabled trace points
> > > >>>>
> > > >>> We do need to consider runtime and build defaults differently,
> > > >> though.
> > > >>> Since this has also runtime enabling, I think having build-time
> > > >> enabling
> > > >>> true as default is ok, so long as the runtime enabling is false
> > > >> (assuming
> > > >>> no noticable overhead when the feature is disabled.)
> > > >> I was talking about buildtime only. "enable_trace_fp" meson option
> > > >> selected as
> > > >> false as default.
> > > > Agree. "enable_lcore_poll_busyness" is in the fast path, so it should
> > > follow the design pattern of "enable_trace_fp".
> > > 
> > > +1 to making this opt-in. However, I'd lean more towards having the
> > > buildtime option enabled and the runtime option disabled by default.
> > > There is no measurable impact cause by the extra branch (the check for
> > > enabled/disabled in the macro) by disabling at runtime, and we gain the
> > > benefit of avoiding a recompile to enable it later.
> > 
> > The exact same thing could be said about "enable_trace_fp"; however, the development effort was put into separating it from "enable_trace", so it could be disabled by default.
> > 
> > Your patch is unlikely to get approved if you don't follow the "enable_trace_fp" design pattern as suggested.
> > 
> > > 
> > > >
> > > >> If the concern is enabling on generic distros then distro generic
> > > >> config can opt in this
> > > >>
> > > >>> /Bruce
> > > > @Kevin, are you considering a roadmap for using
> > > RTE_LCORE_TELEMETRY_TIMESTAMP() for other purposes? Otherwise, it
> > > should also be renamed to indicate that it is part of the "poll
> > > busyness" telemetry.
> > > 
> > > No further purposes are planned for this macro, I'll rename it in the
> > > next revision.
> > 
> > OK. Thank you.
> > 
> > Also, there's a new discussion about EAL bloat [1]. Perhaps I'm stretching it here, but it would be nice if your library was made a separate library, instead of part of the EAL library. (Since this kind of feature is not new to the EAL, I will categorize this suggestion as "nice to have", not "must have".)
> > 
> > [1] http://inbox.dpdk.org/dev/2594603.Isy0gbHreE@thomas/T/
> > 
> 
> I was actually discussing this with Kevin and Dave H. on Friay, and trying
> to make this a separate library is indeed a big stretch. :-)
> 
> From that discussion, the key point/gap is that we are really missing a
> clean way of providing undefs or macro fallbacks for when a library is just
> not present. For example, if this was a separate library we would gain a
> number of advantages e.g. no need for separate enable/disable flag, but the
> big disadvantage is that every header include for it, and every reference
> to the macros used in that header need to be surrounded by big ugly ifdefs.
> 
> For now, adding this into EAL is the far more practical approach, since it
> means that even if support for the feature is disabled at build time the
> header is still available to provide the appropriate no-op macros.

We can make the library always available with different implementations
based on a build option.
But is it a good idea to have a different behaviour based on build option?
Why not making it a runtime option?
Can the performance hit be cached in some way?



^ permalink raw reply	[flat|nested] 87+ messages in thread

* RE: [PATCH v3 1/3] eal: add lcore poll busyness telemetry
  2022-08-29 10:41                 ` Bruce Richardson
  2022-08-29 10:53                   ` Thomas Monjalon
@ 2022-08-29 11:22                   ` Morten Brørup
  1 sibling, 0 replies; 87+ messages in thread
From: Morten Brørup @ 2022-08-29 11:22 UTC (permalink / raw)
  To: Bruce Richardson
  Cc: Kevin Laatz, Jerin Jacob, Anatoly Burakov, dpdk-dev, Conor Walsh,
	David Hunt, Nicolas Chautru, Fan Zhang, Ashish Gupta,
	Akhil Goyal, Chengwen Feng, Ray Kinsella, Thomas Monjalon,
	Ferruh Yigit, Andrew Rybchenko, Jerin Jacob, Sachin Saxena,
	Hemant Agrawal, Ori Kam, Honnappa Nagarahalli,
	Konstantin Ananyev

> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> Sent: Monday, 29 August 2022 12.41
> 
> On Fri, Aug 26, 2022 at 05:46:48PM +0200, Morten Brørup wrote:
> >
> > Also, there's a new discussion about EAL bloat [1]. Perhaps I'm
> stretching it here, but it would be nice if your library was made a
> separate library, instead of part of the EAL library. (Since this kind
> of feature is not new to the EAL, I will categorize this suggestion as
> "nice to have", not "must have".)
> >
> > [1] http://inbox.dpdk.org/dev/2594603.Isy0gbHreE@thomas/T/
> >
> 
> I was actually discussing this with Kevin and Dave H. on Friay, and
> trying
> to make this a separate library is indeed a big stretch. :-)
> 
> From that discussion, the key point/gap is that we are really missing a
> clean way of providing undefs or macro fallbacks for when a library is
> just
> not present. For example, if this was a separate library we would gain
> a
> number of advantages e.g. no need for separate enable/disable flag, but
> the
> big disadvantage is that every header include for it, and every
> reference
> to the macros used in that header need to be surrounded by big ugly
> ifdefs.

I agree that we don't want everything surrounded by big ugly ifdefs.

I think solving this should be the responsibility of a library itself, not of the application or other libraries.

E.g. like the mbuf library's use of RTE_LIBRTE_MBUF_DEBUG.

Additionally, a no-op variant of the library could be required, to be compiled in when the library is disabled at build time.

> 
> For now, adding this into EAL is the far more practical approach, since
> it
> means that even if support for the feature is disabled at build time
> the
> header is still available to provide the appropriate no-op macros.

I think you already discovered the key to solving this problem, but perhaps didn't even notice it yourselves:

It must be possible to omit the *feature* at build time - not necessarily the *library*.



^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 1/3] eal: add lcore poll busyness telemetry
  2022-08-29 10:53                   ` Thomas Monjalon
@ 2022-08-29 12:36                     ` Kevin Laatz
  2022-08-29 12:49                       ` Morten Brørup
  0 siblings, 1 reply; 87+ messages in thread
From: Kevin Laatz @ 2022-08-29 12:36 UTC (permalink / raw)
  To: Thomas Monjalon, Morten Brørup, Bruce Richardson
  Cc: Jerin Jacob, Anatoly Burakov, dpdk-dev, Conor Walsh, David Hunt,
	Nicolas Chautru, Fan Zhang, Ashish Gupta, Akhil Goyal,
	Chengwen Feng, Ray Kinsella, Ferruh Yigit, Andrew Rybchenko,
	Jerin Jacob, Sachin Saxena, Hemant Agrawal, Ori Kam,
	Honnappa Nagarahalli, Konstantin Ananyev

[-- Attachment #1: Type: text/plain, Size: 8608 bytes --]

On 29/08/2022 11:53, Thomas Monjalon wrote:
> 29/08/2022 12:41, Bruce Richardson:
>> On Fri, Aug 26, 2022 at 05:46:48PM +0200, Morten Brørup wrote:
>>>> From: Kevin Laatz [mailto:kevin.laatz@intel.com]
>>>> Sent: Friday, 26 August 2022 17.27
>>>>
>>>> On 26/08/2022 09:29, Morten Brørup wrote:
>>>>>> From: Jerin Jacob [mailto:jerinjacobk@gmail.com]
>>>>>> Sent: Friday, 26 August 2022 10.16
>>>>>>
>>>>>> On Fri, Aug 26, 2022 at 1:37 PM Bruce Richardson
>>>>>> <bruce.richardson@intel.com>  wrote:
>>>>>>> On Fri, Aug 26, 2022 at 12:35:16PM +0530, Jerin Jacob wrote:
>>>>>>>> On Thu, Aug 25, 2022 at 8:56 PM Kevin Laatz
>>>> <kevin.laatz@intel.com>
>>>>>> wrote:
>>>>>>>>> From: Anatoly Burakov<anatoly.burakov@intel.com>
>>>>>>>>>
>>>>>>>>> Currently, there is no way to measure lcore poll busyness in a
>>>>>> passive way,
>>>>>>>>> without any modifications to the application. This patch adds a
>>>>>> new EAL API
>>>>>>>>> that will be able to passively track core polling busyness.
>>>>>>>>>
>>>>>>>>> The poll busyness is calculated by relying on the fact that most
>>>>>> DPDK API's
>>>>>>>>> will poll for packets. Empty polls can be counted as "idle",
>>>>>> while
>>>>>>>>> non-empty polls can be counted as busy. To measure lcore poll
>>>>>> busyness, we
>>>>>>>>> simply call the telemetry timestamping function with the number
>>>>>> of polls a
>>>>>>>>> particular code section has processed, and count the number of
>>>>>> cycles we've
>>>>>>>>> spent processing empty bursts. The more empty bursts we
>>>>>> encounter, the less
>>>>>>>>> cycles we spend in "busy" state, and the less core poll busyness
>>>>>> will be
>>>>>>>>> reported.
>>>>>>>>>
>>>>>>>>> In order for all of the above to work without modifications to
>>>>>> the
>>>>>>>>> application, the library code needs to be instrumented with calls
>>>>>> to the
>>>>>>>>> lcore telemetry busyness timestamping function. The following
>>>>>> parts of DPDK
>>>>>>>>> are instrumented with lcore telemetry calls:
>>>>>>>>>
>>>>>>>>> - All major driver API's:
>>>>>>>>>     - ethdev
>>>>>>>>>     - cryptodev
>>>>>>>>>     - compressdev
>>>>>>>>>     - regexdev
>>>>>>>>>     - bbdev
>>>>>>>>>     - rawdev
>>>>>>>>>     - eventdev
>>>>>>>>>     - dmadev
>>>>>>>>> - Some additional libraries:
>>>>>>>>>     - ring
>>>>>>>>>     - distributor
>>>>>>>>>
>>>>>>>>> To avoid performance impact from having lcore telemetry support,
>>>>>> a global
>>>>>>>>> variable is exported by EAL, and a call to timestamping function
>>>>>> is wrapped
>>>>>>>>> into a macro, so that whenever telemetry is disabled, it only
>>>>>> takes one
>>>>>>>>> additional branch and no function calls are performed. It is also
>>>>>> possible
>>>>>>>>> to disable it at compile time by commenting out
>>>>>> RTE_LCORE_BUSYNESS from
>>>>>>>>> build config.
>>>>>>>>>
>>>>>>>>> This patch also adds a telemetry endpoint to report lcore poll
>>>>>> busyness, as
>>>>>>>>> well as telemetry endpoints to enable/disable lcore telemetry. A
>>>>>>>>> documentation entry has been added to the howto guides to explain
>>>>>> the usage
>>>>>>>>> of the new telemetry endpoints and API.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Kevin Laatz<kevin.laatz@intel.com>
>>>>>>>>> Signed-off-by: Conor Walsh<conor.walsh@intel.com>
>>>>>>>>> Signed-off-by: David Hunt<david.hunt@intel.com>
>>>>>>>>> Signed-off-by: Anatoly Burakov<anatoly.burakov@intel.com>
>>>>>>>>>
>>>>>>>>> ---
>>>>>>>>> v3:
>>>>>>>>>     * Fix missed renaming to poll busyness
>>>>>>>>>     * Fix clang compilation
>>>>>>>>>     * Fix arm compilation
>>>>>>>>>
>>>>>>>>> v2:
>>>>>>>>>     * Use rte_get_tsc_hz() to adjust the telemetry period
>>>>>>>>>     * Rename to reflect polling busyness vs general busyness
>>>>>>>>>     * Fix segfault when calling telemetry timestamp from an
>>>>>> unregistered
>>>>>>>>>       non-EAL thread.
>>>>>>>>>     * Minor cleanup
>>>>>>>>> ---
>>>>>>>>> diff --git a/meson_options.txt b/meson_options.txt
>>>>>>>>> index 7c220ad68d..725b851f69 100644
>>>>>>>>> --- a/meson_options.txt
>>>>>>>>> +++ b/meson_options.txt
>>>>>>>>> @@ -20,6 +20,8 @@ option('enable_driver_sdk', type: 'boolean',
>>>>>> value: false, description:
>>>>>>>>>           'Install headers to build drivers.')
>>>>>>>>>    option('enable_kmods', type: 'boolean', value: false,
>>>>>> description:
>>>>>>>>>           'build kernel modules')
>>>>>>>>> +option('enable_lcore_poll_busyness', type: 'boolean', value:
>>>>>> true, description:
>>>>>>>>> +       'enable collection of lcore poll busyness telemetry')
>>>>>>>> IMO, All fastpath features should be opt-in. i.e default should be
>>>>>> false.
>>>>>>>> For the trace fastpath related changes, We have done the similar
>>>>>> thing
>>>>>>>> even though it cost additional one cycle for disabled trace points
>>>>>>>>
>>>>>>> We do need to consider runtime and build defaults differently,
>>>>>> though.
>>>>>>> Since this has also runtime enabling, I think having build-time
>>>>>> enabling
>>>>>>> true as default is ok, so long as the runtime enabling is false
>>>>>> (assuming
>>>>>>> no noticable overhead when the feature is disabled.)
>>>>>> I was talking about buildtime only. "enable_trace_fp" meson option
>>>>>> selected as
>>>>>> false as default.
>>>>> Agree. "enable_lcore_poll_busyness" is in the fast path, so it should
>>>> follow the design pattern of "enable_trace_fp".
>>>>
>>>> +1 to making this opt-in. However, I'd lean more towards having the
>>>> buildtime option enabled and the runtime option disabled by default.
>>>> There is no measurable impact cause by the extra branch (the check for
>>>> enabled/disabled in the macro) by disabling at runtime, and we gain the
>>>> benefit of avoiding a recompile to enable it later.
>>> The exact same thing could be said about "enable_trace_fp"; however, the development effort was put into separating it from "enable_trace", so it could be disabled by default.
>>>
>>> Your patch is unlikely to get approved if you don't follow the "enable_trace_fp" design pattern as suggested.
>>>
>>>>>> If the concern is enabling on generic distros then distro generic
>>>>>> config can opt in this
>>>>>>
>>>>>>> /Bruce
>>>>> @Kevin, are you considering a roadmap for using
>>>> RTE_LCORE_TELEMETRY_TIMESTAMP() for other purposes? Otherwise, it
>>>> should also be renamed to indicate that it is part of the "poll
>>>> busyness" telemetry.
>>>>
>>>> No further purposes are planned for this macro, I'll rename it in the
>>>> next revision.
>>> OK. Thank you.
>>>
>>> Also, there's a new discussion about EAL bloat [1]. Perhaps I'm stretching it here, but it would be nice if your library was made a separate library, instead of part of the EAL library. (Since this kind of feature is not new to the EAL, I will categorize this suggestion as "nice to have", not "must have".)
>>>
>>> [1]http://inbox.dpdk.org/dev/2594603.Isy0gbHreE@thomas/T/
>>>
>> I was actually discussing this with Kevin and Dave H. on Friay, and trying
>> to make this a separate library is indeed a big stretch. :-)
>>
>>  From that discussion, the key point/gap is that we are really missing a
>> clean way of providing undefs or macro fallbacks for when a library is just
>> not present. For example, if this was a separate library we would gain a
>> number of advantages e.g. no need for separate enable/disable flag, but the
>> big disadvantage is that every header include for it, and every reference
>> to the macros used in that header need to be surrounded by big ugly ifdefs.
>>
>> For now, adding this into EAL is the far more practical approach, since it
>> means that even if support for the feature is disabled at build time the
>> header is still available to provide the appropriate no-op macros.
> We can make the library always available with different implementations
> based on a build option.
> But is it a good idea to have a different behaviour based on build option?
> Why not making it a runtime option?
> Can the performance hit be cached in some way?
>
The patches currently include runtime options to enable/disable the 
feature via API and via telemetry endpoints. We have run performance 
tests and have failed to measure any performance impact with the feature 
_runtime_ disabled.

We added the buildtime option following previous feedback where an 
absolute guarantee is required that performance would not be affected by 
the addition of this feature. To avoid clutter in the meson options, we 
could either a) remove the buildtime option, or b) allow a CFLAG to 
disable the feature rather than an explicit buildtime option. Either 
way, they would only serve a reassurance purpose.

[-- Attachment #2: Type: text/html, Size: 20073 bytes --]

^ permalink raw reply	[flat|nested] 87+ messages in thread

* RE: [PATCH v3 1/3] eal: add lcore poll busyness telemetry
  2022-08-29 12:36                     ` Kevin Laatz
@ 2022-08-29 12:49                       ` Morten Brørup
  2022-08-29 13:37                         ` Kevin Laatz
  0 siblings, 1 reply; 87+ messages in thread
From: Morten Brørup @ 2022-08-29 12:49 UTC (permalink / raw)
  To: Kevin Laatz, Thomas Monjalon, Bruce Richardson
  Cc: Jerin Jacob, Anatoly Burakov, dpdk-dev, Conor Walsh, David Hunt,
	Nicolas Chautru, Fan Zhang, Ashish Gupta, Akhil Goyal,
	Chengwen Feng, Ray Kinsella, Ferruh Yigit, Andrew Rybchenko,
	Jerin Jacob, Sachin Saxena, Hemant Agrawal, Ori Kam,
	Honnappa Nagarahalli, Konstantin Ananyev

From: Kevin Laatz [mailto:kevin.laatz@intel.com] 
Sent: Monday, 29 August 2022 14.37
>
> The patches currently include runtime options to enable/disable the feature via API and via telemetry endpoints. We have run performance tests and have failed to measure any performance impact with the feature runtime disabled.

Lots of features are added to DPDK all the time, and they all use the same "insignificant performance impact" argument. But the fact is, each added test-and-branch has some small performance impact (and consume some entries in the branch prediction table, which may impact performance elsewhere). If you add a million features using this argument, there will be a significant and measurable performance impact.

Which is why I keep insisting on the ability to omit non-core features from DPDK at build time.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 1/3] eal: add lcore poll busyness telemetry
  2022-08-26 22:06     ` Mattias Rönnblom
  2022-08-29  8:23       ` Bruce Richardson
@ 2022-08-29 13:16       ` Kevin Laatz
  2022-08-30 10:26       ` Kevin Laatz
  2 siblings, 0 replies; 87+ messages in thread
From: Kevin Laatz @ 2022-08-29 13:16 UTC (permalink / raw)
  To: Mattias Rönnblom, dev
  Cc: anatoly.burakov, Conor Walsh, David Hunt, Bruce Richardson,
	Nicolas Chautru, Fan Zhang, Ashish Gupta, Akhil Goyal,
	Chengwen Feng, Ray Kinsella, Thomas Monjalon, Ferruh Yigit,
	Andrew Rybchenko, Jerin Jacob, Sachin Saxena, Hemant Agrawal,
	Ori Kam, Honnappa Nagarahalli, Konstantin Ananyev

On 26/08/2022 23:06, Mattias Rönnblom wrote:
> On 2022-08-25 17:28, Kevin Laatz wrote:
>> From: Anatoly Burakov <anatoly.burakov@intel.com>
>>
>> Currently, there is no way to measure lcore poll busyness in a 
>> passive way,
>> without any modifications to the application. This patch adds a new 
>> EAL API
>> that will be able to passively track core polling busyness.
>
> There's no generic way, but the DSW event device keep tracks of lcore 
> utilization (i.e., the fraction of cycles used to perform actual work, 
> as opposed to just polling empty queues), and it does some with the 
> same basic principles as, from what it seems after a quick look, used 
> in this patch.
>
>>
>> The poll busyness is calculated by relying on the fact that most DPDK 
>> API's
>> will poll for packets. Empty polls can be counted as "idle", while
>
> Lcore worker threads poll for work. Packets, timeouts, completions, 
> event device events, etc.

Yes, the wording was too restrictive here - the patch includes changes 
to drivers and libraries such as dmadev, eventdev, ring etc that poll 
for work and would want to mark it as "idle" or "busy".


>
>> non-empty polls can be counted as busy. To measure lcore poll 
>> busyness, we
>
> I guess what is meant here is that cycles spent after non-empty polls 
> can be counted as busy (useful) cycles? Potentially including the 
> cycles spent for the actual poll operation. ("Poll busyness" is a very 
> vague term, in my opionion.)
>
> Similiarly, cycles spent after an empty poll would not be counted.

Correct, the generic functionality works this way. Any cycles between an 
"busy poll" and the next "idle poll" will be counted as busy/useful work 
(and vice versa).


>
>> simply call the telemetry timestamping function with the number of 
>> polls a
>> particular code section has processed, and count the number of cycles 
>> we've
>> spent processing empty bursts. The more empty bursts we encounter, 
>> the less
>> cycles we spend in "busy" state, and the less core poll busyness will be
>> reported.
>>
>
> Is this the same scheme as DSW? When a non-zero burst in idle state 
> means a transition from the idle to busy? And a zero burst poll in 
> busy state means a transition from idle to busy?
>
> The issue with this scheme, is that you might potentially end up with 
> a state transition for every iteration of the application's main loop, 
> if packets (or other items of work) only comes in on one of the 
> lcore's potentially many RX queues (or other input queues, such as 
> eventdev ports). That means a rdtsc for every loop, which isn't too 
> bad, but still might be noticable.
>
> An application that gather items of work from multiple source before 
> actually doing anything breaks this model. For example, consider a 
> lcore worker owning two RX queues, performing rte_eth_rx_burst() on 
> both, before attempt to process any of the received packets. If the 
> last poll is empty, the cycles spent will considered idle, even though 
> they were busy.
>
> A lcore worker might also decide to poll the same RX queue multiple 
> times (until it hits an empty poll, or reaching some high upper 
> bound), before initating processing of the packets.

Yes, more complex applications will need to be modified to gain a more 
fine-grained busyness metric. In order to achieve this level of 
accuracy, application context is required. The 
'RTE_LCORE_POLL_BUSYNESS_TIMESTAMP()' macro can be used within the 
application to mark sections as "busy" or "not busy" to do so. Using 
your example above, the application could keep track of multiple bursts 
(whether they have work or not) and call the macro before initiating the 
processing to signal that there is, in fact, work to be done.

There's a section in the documentation update in this patchset that 
describes it. It might need more work if its not clear :-)


>
> I didn't read your code in detail, so I might be jumping to conclusions.
>
>> In order for all of the above to work without modifications to the
>> application, the library code needs to be instrumented with calls to the
>> lcore telemetry busyness timestamping function. The following parts 
>> of DPDK
>> are instrumented with lcore telemetry calls:
>>
>> - All major driver API's:
>>    - ethdev
>>    - cryptodev
>>    - compressdev
>>    - regexdev
>>    - bbdev
>>    - rawdev
>>    - eventdev
>>    - dmadev
>> - Some additional libraries:
>>    - ring
>>    - distributor
>
> In the past, I've suggested this kind of functionality should go into 
> the service framework instead, with the service function explicitly 
> signaling wheter or not the cycles spent on something useful or not.
>
> That seems to me like a more straight-forward and more accurate 
> solution, but does require the application to deploy everything as 
> services, and also requires a change of the service function signature.
>
>>
>> To avoid performance impact from having lcore telemetry support, a 
>> global
>> variable is exported by EAL, and a call to timestamping function is 
>> wrapped
>> into a macro, so that whenever telemetry is disabled, it only takes one
>
> Use an static inline function if you don't need the additional 
> expressive power of a macro.
>
> I suggest you also mention the performance implications, when this 
> function is enabled.

Sure, I can add a note in the next revision.


>
>> additional branch and no function calls are performed. It is also 
>> possible
>> to disable it at compile time by commenting out RTE_LCORE_BUSYNESS from
>> build config.
>>
>> This patch also adds a telemetry endpoint to report lcore poll 
>> busyness, as
>> well as telemetry endpoints to enable/disable lcore telemetry. A
>> documentation entry has been added to the howto guides to explain the 
>> usage
>> of the new telemetry endpoints and API.
>>
>
> Should there really be a dependency from the EAL to the telemetry 
> library? A cycle. Maybe some dependency inversion would be in order? 
> The telemetry library could instead register an interest in getting 
> busy/idle cycles reports from lcores.
>
>> Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
>> Signed-off-by: Conor Walsh <conor.walsh@intel.com>
>> Signed-off-by: David Hunt <david.hunt@intel.com>
>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
>>
>> ---
>> v3:
>>    * Fix missed renaming to poll busyness
>>    * Fix clang compilation
>>    * Fix arm compilation
>>
>> v2:
>>    * Use rte_get_tsc_hz() to adjust the telemetry period
>>    * Rename to reflect polling busyness vs general busyness
>>    * Fix segfault when calling telemetry timestamp from an unregistered
>>      non-EAL thread.
>>    * Minor cleanup
>> ---
>>   config/meson.build                          |   1 +
>>   config/rte_config.h                         |   1 +
>>   lib/bbdev/rte_bbdev.h                       |  17 +-
>>   lib/compressdev/rte_compressdev.c           |   2 +
>>   lib/cryptodev/rte_cryptodev.h               |   2 +
>>   lib/distributor/rte_distributor.c           |  21 +-
>>   lib/distributor/rte_distributor_single.c    |  14 +-
>>   lib/dmadev/rte_dmadev.h                     |  15 +-
>>   lib/eal/common/eal_common_lcore_telemetry.c | 293 ++++++++++++++++++++
>>   lib/eal/common/meson.build                  |   1 +
>>   lib/eal/include/rte_lcore.h                 |  80 ++++++
>>   lib/eal/meson.build                         |   3 +
>>   lib/eal/version.map                         |   7 +
>>   lib/ethdev/rte_ethdev.h                     |   2 +
>>   lib/eventdev/rte_eventdev.h                 |  10 +-
>>   lib/rawdev/rte_rawdev.c                     |   6 +-
>>   lib/regexdev/rte_regexdev.h                 |   5 +-
>>   lib/ring/rte_ring_elem_pvt.h                |   1 +
>>   meson_options.txt                           |   2 +
>>   19 files changed, 459 insertions(+), 24 deletions(-)
>>   create mode 100644 lib/eal/common/eal_common_lcore_telemetry.c
>>
<snip>
>> diff --git a/lib/eal/common/eal_common_lcore_telemetry.c 
>> b/lib/eal/common/eal_common_lcore_telemetry.c
>> new file mode 100644
>> index 0000000000..bba0afc26d
>> --- /dev/null
>> +++ b/lib/eal/common/eal_common_lcore_telemetry.c
>> @@ -0,0 +1,293 @@
>> +/* SPDX-License-Identifier: BSD-3-Clause
>> + * Copyright(c) 2010-2014 Intel Corporation
>> + */
>> +
>> +#include <unistd.h>
>> +#include <limits.h>
>> +#include <string.h>
>> +
>> +#include <rte_common.h>
>> +#include <rte_cycles.h>
>> +#include <rte_errno.h>
>> +#include <rte_lcore.h>
>> +
>> +#ifdef RTE_LCORE_POLL_BUSYNESS
>> +#include <rte_telemetry.h>
>> +#endif
>> +
>> +int __rte_lcore_telemetry_enabled;
>
> Is "telemetry" really the term to use here? Isn't this just another 
> piece of statistics? It can be used for telemetry, or in some other 
> fashion.
>
> (Use bool not int.)

Can rename to '__rte_lcore_stats_enabled' in next revision.


>
>> +
>> +#ifdef RTE_LCORE_POLL_BUSYNESS
>> +
>> +struct lcore_telemetry {
>> +    int poll_busyness;
>> +    /**< Calculated poll busyness (gets set/returned by the API) */
>> +    int raw_poll_busyness;
>> +    /**< Calculated poll busyness times 100. */
>> +    uint64_t interval_ts;
>> +    /**< when previous telemetry interval started */
>> +    uint64_t empty_cycles;
>> +    /**< empty cycle count since last interval */
>> +    uint64_t last_poll_ts;
>> +    /**< last poll timestamp */
>> +    bool last_empty;
>> +    /**< if last poll was empty */
>> +    unsigned int contig_poll_cnt;
>> +    /**< contiguous (always empty/non empty) poll counter */
>> +} __rte_cache_aligned;
>> +
>> +static struct lcore_telemetry *telemetry_data;
>> +
>> +#define LCORE_POLL_BUSYNESS_MAX 100
>> +#define LCORE_POLL_BUSYNESS_NOT_SET -1
>> +#define LCORE_POLL_BUSYNESS_MIN 0
>> +
>> +#define SMOOTH_COEFF 5
>> +#define STATE_CHANGE_OPT 32
>> +
>> +static void lcore_config_init(void)
>> +{
>> +    int lcore_id;
>> +
>> +    telemetry_data = calloc(RTE_MAX_LCORE, sizeof(telemetry_data[0]));
>> +    if (telemetry_data == NULL)
>> +        rte_panic("Could not init lcore telemetry data: Out of 
>> memory\n");
>> +
>> +    RTE_LCORE_FOREACH(lcore_id) {
>> +        struct lcore_telemetry *td = &telemetry_data[lcore_id];
>> +
>> +        td->interval_ts = 0;
>> +        td->last_poll_ts = 0;
>> +        td->empty_cycles = 0;
>> +        td->last_empty = true;
>> +        td->contig_poll_cnt = 0;
>> +        td->poll_busyness = LCORE_POLL_BUSYNESS_NOT_SET;
>> +        td->raw_poll_busyness = 0;
>> +    }
>> +}
>> +
>> +int rte_lcore_poll_busyness(unsigned int lcore_id)
>> +{
>> +    const uint64_t active_thresh = rte_get_tsc_hz() * 
>> RTE_LCORE_POLL_BUSYNESS_PERIOD_MS;
>> +    struct lcore_telemetry *tdata;
>> +
>> +    if (lcore_id >= RTE_MAX_LCORE)
>> +        return -EINVAL;
>> +    tdata = &telemetry_data[lcore_id];
>> +
>> +    /* if the lcore is not active */
>> +    if (tdata->interval_ts == 0)
>> +        return LCORE_POLL_BUSYNESS_NOT_SET;
>> +    /* if the core hasn't been active in a while */
>> +    else if ((rte_rdtsc() - tdata->interval_ts) > active_thresh)
>> +        return LCORE_POLL_BUSYNESS_NOT_SET;
>> +
>> +    /* this core is active, report its poll busyness */
>> +    return telemetry_data[lcore_id].poll_busyness;
>> +}
>> +
>> +int rte_lcore_poll_busyness_enabled(void)
>> +{
>> +    return __rte_lcore_telemetry_enabled;
>> +}
>> +
>> +void rte_lcore_poll_busyness_enabled_set(int enable)
>
> Use bool.
>
>> +{
>> +    __rte_lcore_telemetry_enabled = !!enable;
>
> !!Another reason to use bool!! :)
>
> Are you allowed to call this function during operation, you'll need a 
> atomic store here (and an atomic load on the read side).

Ack


>
>> +
>> +    if (!enable)
>> +        lcore_config_init();
>> +}
>> +
>> +static inline int calc_raw_poll_busyness(const struct 
>> lcore_telemetry *tdata,
>> +                    const uint64_t empty, const uint64_t total)
>> +{
>> +    /*
>> +     * we don't want to use floating point math here, but we want 
>> for our poll
>> +     * busyness to react smoothly to sudden changes, while still 
>> keeping the
>> +     * accuracy and making sure that over time the average follows 
>> poll busyness
>> +     * as measured just-in-time. therefore, we will calculate the 
>> average poll
>> +     * busyness using integer math, but shift the decimal point two 
>> places
>> +     * to the right, so that 100.0 becomes 10000. this allows us to 
>> report
>> +     * integer values (0..100) while still allowing ourselves to 
>> follow the
>> +     * just-in-time measurements when we calculate our averages.
>> +     */
>> +    const int max_raw_idle = LCORE_POLL_BUSYNESS_MAX * 100;
>> +
>
> Why not just store/manage the number of busy (or idle, or both) 
> cycles? Then the user can decide what time period to average over, to 
> what extent the lcore utilization from previous periods should be 
> factored in, etc.

There's an option 'RTE_LCORE_POLL_BUSYNESS_PERIOD_MS' added to 
rte_config.h which would allow the user to define the time period over 
which the utilization should be reported. We only do this calculation if 
that time interval has elapsed.


>
> In DSW, I initially presented only a load statistic (which averaged 
> over 250 us, with some contribution from previous period). I later 
> came to realize that just exposing the number of busy cycles left the 
> calling application much more options. For example, to present the 
> average load during 1 s, you needed to have some control thread 
> sampling the load statistic during that time period, as opposed to 
> when the busy cycles statistic was introduced, it just had to read 
> that value twice (at the beginning of the period, and at the end), and 
> compared it will the amount of wallclock time passed.
>
<snip>

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 1/3] eal: add lcore poll busyness telemetry
  2022-08-29 12:49                       ` Morten Brørup
@ 2022-08-29 13:37                         ` Kevin Laatz
  2022-08-29 13:44                           ` Morten Brørup
  0 siblings, 1 reply; 87+ messages in thread
From: Kevin Laatz @ 2022-08-29 13:37 UTC (permalink / raw)
  To: Morten Brørup, Thomas Monjalon, Bruce Richardson
  Cc: Jerin Jacob, Anatoly Burakov, dpdk-dev, Conor Walsh, David Hunt,
	Nicolas Chautru, Fan Zhang, Ashish Gupta, Akhil Goyal,
	Chengwen Feng, Ray Kinsella, Ferruh Yigit, Andrew Rybchenko,
	Jerin Jacob, Sachin Saxena, Hemant Agrawal, Ori Kam,
	Honnappa Nagarahalli, Konstantin Ananyev

On 29/08/2022 13:49, Morten Brørup wrote:
> From: Kevin Laatz [mailto:kevin.laatz@intel.com]
> Sent: Monday, 29 August 2022 14.37
>> The patches currently include runtime options to enable/disable the feature via API and via telemetry endpoints. We have run performance tests and have failed to measure any performance impact with the feature runtime disabled.
> Lots of features are added to DPDK all the time, and they all use the same "insignificant performance impact" argument. But the fact is, each added test-and-branch has some small performance impact (and consume some entries in the branch prediction table, which may impact performance elsewhere). If you add a million features using this argument, there will be a significant and measurable performance impact.
>
> Which is why I keep insisting on the ability to omit non-core features from DPDK at build time.

I think there's general consensus in having a buildtime option to 
disable it.

Do we agree that it should be buildtime enabled, and runtime disabled by 
default (so just the single additional branch by default), with the 
meson option available to disable it completely at buildtime?


^ permalink raw reply	[flat|nested] 87+ messages in thread

* RE: [PATCH v3 1/3] eal: add lcore poll busyness telemetry
  2022-08-29 13:37                         ` Kevin Laatz
@ 2022-08-29 13:44                           ` Morten Brørup
  2022-08-29 14:21                             ` Kevin Laatz
  0 siblings, 1 reply; 87+ messages in thread
From: Morten Brørup @ 2022-08-29 13:44 UTC (permalink / raw)
  To: Kevin Laatz, Thomas Monjalon, Bruce Richardson
  Cc: Jerin Jacob, Anatoly Burakov, dpdk-dev, Conor Walsh, David Hunt,
	Nicolas Chautru, Fan Zhang, Ashish Gupta, Akhil Goyal,
	Chengwen Feng, Ray Kinsella, Ferruh Yigit, Andrew Rybchenko,
	Jerin Jacob, Sachin Saxena, Hemant Agrawal, Ori Kam,
	Honnappa Nagarahalli, Konstantin Ananyev

> From: Kevin Laatz [mailto:kevin.laatz@intel.com]
> Sent: Monday, 29 August 2022 15.37
> 
> On 29/08/2022 13:49, Morten Brørup wrote:
> > From: Kevin Laatz [mailto:kevin.laatz@intel.com]
> > Sent: Monday, 29 August 2022 14.37
> >> The patches currently include runtime options to enable/disable the
> feature via API and via telemetry endpoints. We have run performance
> tests and have failed to measure any performance impact with the
> feature runtime disabled.
> > Lots of features are added to DPDK all the time, and they all use the
> same "insignificant performance impact" argument. But the fact is, each
> added test-and-branch has some small performance impact (and consume
> some entries in the branch prediction table, which may impact
> performance elsewhere). If you add a million features using this
> argument, there will be a significant and measurable performance
> impact.
> >
> > Which is why I keep insisting on the ability to omit non-core
> features from DPDK at build time.
> 
> I think there's general consensus in having a buildtime option to
> disable it.
> 
> Do we agree that it should be buildtime enabled, and runtime disabled
> by
> default (so just the single additional branch by default), with the
> meson option available to disable it completely at buildtime?

No. This feature is in the fast path, so please follow the "enable_trace_fp" design pattern, which also has fast path trace disabled at build time.

-Morten


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 1/3] eal: add lcore poll busyness telemetry
  2022-08-29 13:44                           ` Morten Brørup
@ 2022-08-29 14:21                             ` Kevin Laatz
  0 siblings, 0 replies; 87+ messages in thread
From: Kevin Laatz @ 2022-08-29 14:21 UTC (permalink / raw)
  To: Morten Brørup, Thomas Monjalon, Bruce Richardson
  Cc: Jerin Jacob, Anatoly Burakov, dpdk-dev, Conor Walsh, David Hunt,
	Nicolas Chautru, Fan Zhang, Ashish Gupta, Akhil Goyal,
	Chengwen Feng, Ray Kinsella, Ferruh Yigit, Andrew Rybchenko,
	Jerin Jacob, Sachin Saxena, Hemant Agrawal, Ori Kam,
	Honnappa Nagarahalli, Konstantin Ananyev

On 29/08/2022 14:44, Morten Brørup wrote:
>> From: Kevin Laatz [mailto:kevin.laatz@intel.com]
>> Sent: Monday, 29 August 2022 15.37
>>
>> On 29/08/2022 13:49, Morten Brørup wrote:
>>> From: Kevin Laatz [mailto:kevin.laatz@intel.com]
>>> Sent: Monday, 29 August 2022 14.37
>>>> The patches currently include runtime options to enable/disable the
>> feature via API and via telemetry endpoints. We have run performance
>> tests and have failed to measure any performance impact with the
>> feature runtime disabled.
>>> Lots of features are added to DPDK all the time, and they all use the
>> same "insignificant performance impact" argument. But the fact is, each
>> added test-and-branch has some small performance impact (and consume
>> some entries in the branch prediction table, which may impact
>> performance elsewhere). If you add a million features using this
>> argument, there will be a significant and measurable performance
>> impact.
>>> Which is why I keep insisting on the ability to omit non-core
>> features from DPDK at build time.
>>
>> I think there's general consensus in having a buildtime option to
>> disable it.
>>
>> Do we agree that it should be buildtime enabled, and runtime disabled
>> by
>> default (so just the single additional branch by default), with the
>> meson option available to disable it completely at buildtime?
> No. This feature is in the fast path, so please follow the "enable_trace_fp" design pattern, which also has fast path trace disabled at build time.
>
Ok, will make this change for the v4. Thanks!



^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v3 1/3] eal: add lcore poll busyness telemetry
  2022-08-26 22:06     ` Mattias Rönnblom
  2022-08-29  8:23       ` Bruce Richardson
  2022-08-29 13:16       ` Kevin Laatz
@ 2022-08-30 10:26       ` Kevin Laatz
  2 siblings, 0 replies; 87+ messages in thread
From: Kevin Laatz @ 2022-08-30 10:26 UTC (permalink / raw)
  To: Mattias Rönnblom, dev
  Cc: anatoly.burakov, Conor Walsh, David Hunt, Bruce Richardson,
	Nicolas Chautru, Fan Zhang, Ashish Gupta, Akhil Goyal,
	Chengwen Feng, Ray Kinsella, Thomas Monjalon, Ferruh Yigit,
	Andrew Rybchenko, Jerin Jacob, Sachin Saxena, Hemant Agrawal,
	Ori Kam, Honnappa Nagarahalli, Konstantin Ananyev

On 26/08/2022 23:06, Mattias Rönnblom wrote:
> On 2022-08-25 17:28, Kevin Laatz wrote:
>> From: Anatoly Burakov <anatoly.burakov@intel.com>
<snip>
>>
>> To avoid performance impact from having lcore telemetry support, a 
>> global
>> variable is exported by EAL, and a call to timestamping function is 
>> wrapped
>> into a macro, so that whenever telemetry is disabled, it only takes one
>
> Use an static inline function if you don't need the additional 
> expressive power of a macro.
>
> I suggest you also mention the performance implications, when this 
> function is enabled.

Keeping the performance implications of having the feature enabled in 
mind, I think the expressive power of the macro is beneficial here.


<snip>

>> diff --git a/lib/eal/common/eal_common_lcore_telemetry.c 
>> b/lib/eal/common/eal_common_lcore_telemetry.c
>> new file mode 100644
>> index 0000000000..bba0afc26d
>> --- /dev/null
>> +++ b/lib/eal/common/eal_common_lcore_telemetry.c
>> @@ -0,0 +1,293 @@
>> +/* SPDX-License-Identifier: BSD-3-Clause
>> + * Copyright(c) 2010-2014 Intel Corporation
>> + */
>> +
>> +#include <unistd.h>
>> +#include <limits.h>
>> +#include <string.h>
>> +
>> +#include <rte_common.h>
>> +#include <rte_cycles.h>
>> +#include <rte_errno.h>
>> +#include <rte_lcore.h>
>> +
>> +#ifdef RTE_LCORE_POLL_BUSYNESS
>> +#include <rte_telemetry.h>
>> +#endif
>> +
>> +int __rte_lcore_telemetry_enabled;
>
> Is "telemetry" really the term to use here? Isn't this just another 
> piece of statistics? It can be used for telemetry, or in some other 
> fashion.
>
> (Use bool not int.)

Will change to bool.

Looking at this again, the telemetry naming is more accurate here since 
'__rte_lcore_telemetry_enabled' is used to enable/disable the telemetry 
endpoints.

-Kevin


^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v4 0/3] Add lcore poll busyness telemetry
  2022-07-15 13:12 [PATCH v1 1/2] eal: add lcore busyness telemetry Anatoly Burakov
                   ` (5 preceding siblings ...)
  2022-08-25 15:28 ` [PATCH v3 " Kevin Laatz
@ 2022-09-01 14:39 ` Kevin Laatz
  2022-09-01 14:39   ` [PATCH v4 1/3] eal: add " Kevin Laatz
                     ` (2 more replies)
  2022-09-02 15:58 ` [PATCH v5 0/3] Add lcore poll busyness telemetry Kevin Laatz
                   ` (2 subsequent siblings)
  9 siblings, 3 replies; 87+ messages in thread
From: Kevin Laatz @ 2022-09-01 14:39 UTC (permalink / raw)
  To: dev; +Cc: anatoly.burakov, Kevin Laatz

Currently, there is no way to measure lcore polling busyness in a passive
way, without any modifications to the application. This patchset adds a new
EAL API that will be able to passively track core polling busyness. As part
of the set, new telemetry endpoints are added to read the generate metrics.

---
v4:
  * Fix doc build
  * Rename timestamp macro to RTE_LCORE_TELEMETRY_TIMESTAMP
  * Make enable/disable read and write atomic
  * Change rte_lcore_poll_busyness_enabled_set() param to bool
  * Move mem alloc from enable/disable to init/cleanup
  * Other minor fixes

v3:
  * Fix missing renaming to poll busyness
  * Fix clang compilation
  * Fix arm compilation

v2:
  * Use rte_get_tsc_hz() to adjust the telemetry period
  * Rename to reflect polling busyness vs general busyness
  * Fix segfault when calling telemetry timestamp from an unregistered
    non-EAL thread.
  * Minor cleanup

Anatoly Burakov (2):
  eal: add lcore poll busyness telemetry
  eal: add cpuset lcore telemetry entries

Kevin Laatz (1):
  doc: add howto guide for lcore poll busyness

 config/meson.build                          |   1 +
 config/rte_config.h                         |   1 +
 doc/guides/howto/index.rst                  |   1 +
 doc/guides/howto/lcore_poll_busyness.rst    |  92 ++++++
 lib/bbdev/rte_bbdev.h                       |  17 +-
 lib/compressdev/rte_compressdev.c           |   2 +
 lib/cryptodev/rte_cryptodev.h               |   2 +
 lib/distributor/rte_distributor.c           |  21 +-
 lib/distributor/rte_distributor_single.c    |  14 +-
 lib/dmadev/rte_dmadev.h                     |  15 +-
 lib/eal/common/eal_common_lcore_telemetry.c | 346 ++++++++++++++++++++
 lib/eal/common/meson.build                  |   1 +
 lib/eal/include/rte_lcore.h                 |  90 +++++
 lib/eal/linux/eal.c                         |   3 +
 lib/eal/meson.build                         |   3 +
 lib/eal/version.map                         |   8 +
 lib/ethdev/rte_ethdev.h                     |   2 +
 lib/eventdev/rte_eventdev.h                 |  10 +-
 lib/rawdev/rte_rawdev.c                     |   6 +-
 lib/regexdev/rte_regexdev.h                 |   5 +-
 lib/ring/rte_ring_elem_pvt.h                |   1 +
 meson_options.txt                           |   2 +
 22 files changed, 619 insertions(+), 24 deletions(-)
 create mode 100644 doc/guides/howto/lcore_poll_busyness.rst
 create mode 100644 lib/eal/common/eal_common_lcore_telemetry.c

-- 
2.31.1


^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v4 1/3] eal: add lcore poll busyness telemetry
  2022-09-01 14:39 ` [PATCH v4 0/3] Add lcore poll busyness telemetry Kevin Laatz
@ 2022-09-01 14:39   ` Kevin Laatz
  2022-09-01 14:39   ` [PATCH v4 2/3] eal: add cpuset lcore telemetry entries Kevin Laatz
  2022-09-01 14:39   ` [PATCH v4 3/3] doc: add howto guide for lcore poll busyness Kevin Laatz
  2 siblings, 0 replies; 87+ messages in thread
From: Kevin Laatz @ 2022-09-01 14:39 UTC (permalink / raw)
  To: dev
  Cc: anatoly.burakov, Kevin Laatz, Conor Walsh, David Hunt,
	Bruce Richardson, Nicolas Chautru, Fan Zhang, Ashish Gupta,
	Akhil Goyal, Chengwen Feng, Ray Kinsella, Thomas Monjalon,
	Ferruh Yigit, Andrew Rybchenko, Jerin Jacob, Sachin Saxena,
	Hemant Agrawal, Ori Kam, Honnappa Nagarahalli,
	Konstantin Ananyev

From: Anatoly Burakov <anatoly.burakov@intel.com>

Currently, there is no way to measure lcore poll busyness in a passive way,
without any modifications to the application. This patch adds a new EAL API
that will be able to passively track core polling busyness.

The poll busyness is calculated by relying on the fact that most DPDK API's
will poll for work (packets, completions, eventdev events, etc). Empty
polls can be counted as "idle", while non-empty polls can be counted as
busy. To measure lcore poll busyness, we simply call the telemetry
timestamping function with the number of polls a particular code section
has processed, and count the number of cycles we've spent processing empty
bursts. The more empty bursts we encounter, the less cycles we spend in
"busy" state, and the less core poll busyness will be reported.

In order for all of the above to work without modifications to the
application, the library code needs to be instrumented with calls to the
lcore telemetry busyness timestamping function. The following parts of DPDK
are instrumented with lcore poll busyness timestamping calls:

- All major driver API's:
  - ethdev
  - cryptodev
  - compressdev
  - regexdev
  - bbdev
  - rawdev
  - eventdev
  - dmadev
- Some additional libraries:
  - ring
  - distributor

To avoid performance impact from having lcore telemetry support, a global
variable is exported by EAL, and a call to timestamping function is wrapped
into a macro, so that whenever telemetry is disabled, it only takes one
additional branch and no function calls are performed. It is disabled at
compile time by default.

This patch also adds a telemetry endpoint to report lcore poll busyness, as
well as telemetry endpoints to enable/disable lcore telemetry. A
documentation entry has been added to the howto guides to explain the usage
of the new telemetry endpoints and API.

Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
Signed-off-by: Conor Walsh <conor.walsh@intel.com>
Signed-off-by: David Hunt <david.hunt@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>

---
v4:
  * Fix doc build
  * Rename timestamp macro to RTE_LCORE_TELEMETRY_TIMESTAMP
  * Make enable/disable read and write atomic
  * Change rte_lcore_poll_busyness_enabled_set() param to bool
  * Move mem alloc from enable/disable to init/cleanup
  * Other minor fixes

v3:
  * Fix missed renaming to poll busyness
  * Fix clang compilation
  * Fix arm compilation

v2:
  * Use rte_get_tsc_hz() to adjust the telemetry period
  * Rename to reflect polling busyness vs general busyness
  * Fix segfault when calling telemetry timestamp from an unregistered
    non-EAL thread.
  * Minor cleanup
---
 config/meson.build                          |   1 +
 config/rte_config.h                         |   1 +
 lib/bbdev/rte_bbdev.h                       |  17 +-
 lib/compressdev/rte_compressdev.c           |   2 +
 lib/cryptodev/rte_cryptodev.h               |   2 +
 lib/distributor/rte_distributor.c           |  21 +-
 lib/distributor/rte_distributor_single.c    |  14 +-
 lib/dmadev/rte_dmadev.h                     |  15 +-
 lib/eal/common/eal_common_lcore_telemetry.c | 299 ++++++++++++++++++++
 lib/eal/common/meson.build                  |   1 +
 lib/eal/include/rte_lcore.h                 |  90 ++++++
 lib/eal/linux/eal.c                         |   3 +
 lib/eal/meson.build                         |   3 +
 lib/eal/version.map                         |   8 +
 lib/ethdev/rte_ethdev.h                     |   2 +
 lib/eventdev/rte_eventdev.h                 |  10 +-
 lib/rawdev/rte_rawdev.c                     |   6 +-
 lib/regexdev/rte_regexdev.h                 |   5 +-
 lib/ring/rte_ring_elem_pvt.h                |   1 +
 meson_options.txt                           |   2 +
 20 files changed, 479 insertions(+), 24 deletions(-)
 create mode 100644 lib/eal/common/eal_common_lcore_telemetry.c

diff --git a/config/meson.build b/config/meson.build
index 7f7b6c92fd..d5954a059c 100644
--- a/config/meson.build
+++ b/config/meson.build
@@ -297,6 +297,7 @@ endforeach
 dpdk_conf.set('RTE_MAX_ETHPORTS', get_option('max_ethports'))
 dpdk_conf.set('RTE_LIBEAL_USE_HPET', get_option('use_hpet'))
 dpdk_conf.set('RTE_ENABLE_TRACE_FP', get_option('enable_trace_fp'))
+dpdk_conf.set('RTE_LCORE_POLL_BUSYNESS', get_option('enable_lcore_poll_busyness'))
 # values which have defaults which may be overridden
 dpdk_conf.set('RTE_MAX_VFIO_GROUPS', 64)
 dpdk_conf.set('RTE_DRIVER_MEMPOOL_BUCKET_SIZE_KB', 64)
diff --git a/config/rte_config.h b/config/rte_config.h
index 46549cb062..498702c9c7 100644
--- a/config/rte_config.h
+++ b/config/rte_config.h
@@ -39,6 +39,7 @@
 #define RTE_LOG_DP_LEVEL RTE_LOG_INFO
 #define RTE_BACKTRACE 1
 #define RTE_MAX_VFIO_CONTAINERS 64
+#define RTE_LCORE_POLL_BUSYNESS_PERIOD_MS 2
 
 /* bsd module defines */
 #define RTE_CONTIGMEM_MAX_NUM_BUFS 64
diff --git a/lib/bbdev/rte_bbdev.h b/lib/bbdev/rte_bbdev.h
index b88c88167e..d6a98d3f11 100644
--- a/lib/bbdev/rte_bbdev.h
+++ b/lib/bbdev/rte_bbdev.h
@@ -28,6 +28,7 @@ extern "C" {
 #include <stdbool.h>
 
 #include <rte_cpuflags.h>
+#include <rte_lcore.h>
 
 #include "rte_bbdev_op.h"
 
@@ -599,7 +600,9 @@ rte_bbdev_dequeue_enc_ops(uint16_t dev_id, uint16_t queue_id,
 {
 	struct rte_bbdev *dev = &rte_bbdev_devices[dev_id];
 	struct rte_bbdev_queue_data *q_data = &dev->data->queues[queue_id];
-	return dev->dequeue_enc_ops(q_data, ops, num_ops);
+	const uint16_t nb_ops = dev->dequeue_enc_ops(q_data, ops, num_ops);
+	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
+	return nb_ops;
 }
 
 /**
@@ -631,7 +634,9 @@ rte_bbdev_dequeue_dec_ops(uint16_t dev_id, uint16_t queue_id,
 {
 	struct rte_bbdev *dev = &rte_bbdev_devices[dev_id];
 	struct rte_bbdev_queue_data *q_data = &dev->data->queues[queue_id];
-	return dev->dequeue_dec_ops(q_data, ops, num_ops);
+	const uint16_t nb_ops = dev->dequeue_dec_ops(q_data, ops, num_ops);
+	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
+	return nb_ops;
 }
 
 
@@ -662,7 +667,9 @@ rte_bbdev_dequeue_ldpc_enc_ops(uint16_t dev_id, uint16_t queue_id,
 {
 	struct rte_bbdev *dev = &rte_bbdev_devices[dev_id];
 	struct rte_bbdev_queue_data *q_data = &dev->data->queues[queue_id];
-	return dev->dequeue_ldpc_enc_ops(q_data, ops, num_ops);
+	const uint16_t nb_ops = dev->dequeue_ldpc_enc_ops(q_data, ops, num_ops);
+	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
+	return nb_ops;
 }
 
 /**
@@ -692,7 +699,9 @@ rte_bbdev_dequeue_ldpc_dec_ops(uint16_t dev_id, uint16_t queue_id,
 {
 	struct rte_bbdev *dev = &rte_bbdev_devices[dev_id];
 	struct rte_bbdev_queue_data *q_data = &dev->data->queues[queue_id];
-	return dev->dequeue_ldpc_dec_ops(q_data, ops, num_ops);
+	const uint16_t nb_ops = dev->dequeue_ldpc_dec_ops(q_data, ops, num_ops);
+	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
+	return nb_ops;
 }
 
 /** Definitions of device event types */
diff --git a/lib/compressdev/rte_compressdev.c b/lib/compressdev/rte_compressdev.c
index 22c438f2dd..fabc495a8e 100644
--- a/lib/compressdev/rte_compressdev.c
+++ b/lib/compressdev/rte_compressdev.c
@@ -580,6 +580,8 @@ rte_compressdev_dequeue_burst(uint8_t dev_id, uint16_t qp_id,
 	nb_ops = (*dev->dequeue_burst)
 			(dev->data->queue_pairs[qp_id], ops, nb_ops);
 
+	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
+
 	return nb_ops;
 }
 
diff --git a/lib/cryptodev/rte_cryptodev.h b/lib/cryptodev/rte_cryptodev.h
index 56f459c6a0..a5b1d7c594 100644
--- a/lib/cryptodev/rte_cryptodev.h
+++ b/lib/cryptodev/rte_cryptodev.h
@@ -1915,6 +1915,8 @@ rte_cryptodev_dequeue_burst(uint8_t dev_id, uint16_t qp_id,
 		rte_rcu_qsbr_thread_offline(list->qsbr, 0);
 	}
 #endif
+
+	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
 	return nb_ops;
 }
 
diff --git a/lib/distributor/rte_distributor.c b/lib/distributor/rte_distributor.c
index 3035b7a999..428157ec64 100644
--- a/lib/distributor/rte_distributor.c
+++ b/lib/distributor/rte_distributor.c
@@ -56,6 +56,8 @@ rte_distributor_request_pkt(struct rte_distributor *d,
 
 		while (rte_rdtsc() < t)
 			rte_pause();
+		/* this was an empty poll */
+		RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(0);
 	}
 
 	/*
@@ -134,24 +136,29 @@ rte_distributor_get_pkt(struct rte_distributor *d,
 
 	if (unlikely(d->alg_type == RTE_DIST_ALG_SINGLE)) {
 		if (return_count <= 1) {
+			uint16_t cnt;
 			pkts[0] = rte_distributor_get_pkt_single(d->d_single,
-				worker_id, return_count ? oldpkt[0] : NULL);
-			return (pkts[0]) ? 1 : 0;
-		} else
-			return -EINVAL;
+								 worker_id,
+								 return_count ? oldpkt[0] : NULL);
+			cnt = (pkts[0] != NULL) ? 1 : 0;
+			RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(cnt);
+			return cnt;
+		}
+		return -EINVAL;
 	}
 
 	rte_distributor_request_pkt(d, worker_id, oldpkt, return_count);
 
-	count = rte_distributor_poll_pkt(d, worker_id, pkts);
-	while (count == -1) {
+	while ((count = rte_distributor_poll_pkt(d, worker_id, pkts)) == -1) {
 		uint64_t t = rte_rdtsc() + 100;
 
 		while (rte_rdtsc() < t)
 			rte_pause();
 
-		count = rte_distributor_poll_pkt(d, worker_id, pkts);
+		/* this was an empty poll */
+		RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(0);
 	}
+	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(count);
 	return count;
 }
 
diff --git a/lib/distributor/rte_distributor_single.c b/lib/distributor/rte_distributor_single.c
index 2c77ac454a..4c916c0fd2 100644
--- a/lib/distributor/rte_distributor_single.c
+++ b/lib/distributor/rte_distributor_single.c
@@ -31,8 +31,13 @@ rte_distributor_request_pkt_single(struct rte_distributor_single *d,
 	union rte_distributor_buffer_single *buf = &d->bufs[worker_id];
 	int64_t req = (((int64_t)(uintptr_t)oldpkt) << RTE_DISTRIB_FLAG_BITS)
 			| RTE_DISTRIB_GET_BUF;
-	RTE_WAIT_UNTIL_MASKED(&buf->bufptr64, RTE_DISTRIB_FLAGS_MASK,
-		==, 0, __ATOMIC_RELAXED);
+
+	while ((__atomic_load_n(&buf->bufptr64, __ATOMIC_RELAXED)
+			& RTE_DISTRIB_FLAGS_MASK) != 0) {
+		rte_pause();
+		/* this was an empty poll */
+		RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(0);
+	}
 
 	/* Sync with distributor on GET_BUF flag. */
 	__atomic_store_n(&(buf->bufptr64), req, __ATOMIC_RELEASE);
@@ -59,8 +64,11 @@ rte_distributor_get_pkt_single(struct rte_distributor_single *d,
 {
 	struct rte_mbuf *ret;
 	rte_distributor_request_pkt_single(d, worker_id, oldpkt);
-	while ((ret = rte_distributor_poll_pkt_single(d, worker_id)) == NULL)
+	while ((ret = rte_distributor_poll_pkt_single(d, worker_id)) == NULL) {
 		rte_pause();
+		/* this was an empty poll */
+		RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(0);
+	}
 	return ret;
 }
 
diff --git a/lib/dmadev/rte_dmadev.h b/lib/dmadev/rte_dmadev.h
index e7f992b734..3e27e0fd2b 100644
--- a/lib/dmadev/rte_dmadev.h
+++ b/lib/dmadev/rte_dmadev.h
@@ -149,6 +149,7 @@
 #include <rte_bitops.h>
 #include <rte_common.h>
 #include <rte_compat.h>
+#include <rte_lcore.h>
 
 #ifdef __cplusplus
 extern "C" {
@@ -1027,7 +1028,7 @@ rte_dma_completed(int16_t dev_id, uint16_t vchan, const uint16_t nb_cpls,
 		  uint16_t *last_idx, bool *has_error)
 {
 	struct rte_dma_fp_object *obj = &rte_dma_fp_objs[dev_id];
-	uint16_t idx;
+	uint16_t idx, nb_ops;
 	bool err;
 
 #ifdef RTE_DMADEV_DEBUG
@@ -1050,8 +1051,10 @@ rte_dma_completed(int16_t dev_id, uint16_t vchan, const uint16_t nb_cpls,
 		has_error = &err;
 
 	*has_error = false;
-	return (*obj->completed)(obj->dev_private, vchan, nb_cpls, last_idx,
-				 has_error);
+	nb_ops = (*obj->completed)(obj->dev_private, vchan, nb_cpls, last_idx,
+				   has_error);
+	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
+	return nb_ops;
 }
 
 /**
@@ -1090,7 +1093,7 @@ rte_dma_completed_status(int16_t dev_id, uint16_t vchan,
 			 enum rte_dma_status_code *status)
 {
 	struct rte_dma_fp_object *obj = &rte_dma_fp_objs[dev_id];
-	uint16_t idx;
+	uint16_t idx, nb_ops;
 
 #ifdef RTE_DMADEV_DEBUG
 	if (!rte_dma_is_valid(dev_id) || nb_cpls == 0 || status == NULL)
@@ -1101,8 +1104,10 @@ rte_dma_completed_status(int16_t dev_id, uint16_t vchan,
 	if (last_idx == NULL)
 		last_idx = &idx;
 
-	return (*obj->completed_status)(obj->dev_private, vchan, nb_cpls,
+	nb_ops = (*obj->completed_status)(obj->dev_private, vchan, nb_cpls,
 					last_idx, status);
+	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
+	return nb_ops;
 }
 
 /**
diff --git a/lib/eal/common/eal_common_lcore_telemetry.c b/lib/eal/common/eal_common_lcore_telemetry.c
new file mode 100644
index 0000000000..540ac5eba0
--- /dev/null
+++ b/lib/eal/common/eal_common_lcore_telemetry.c
@@ -0,0 +1,299 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2022 Intel Corporation
+ */
+
+#include <unistd.h>
+#include <limits.h>
+#include <string.h>
+
+#include <rte_common.h>
+#include <rte_cycles.h>
+#include <rte_errno.h>
+#include <rte_lcore.h>
+
+#ifdef RTE_LCORE_POLL_BUSYNESS
+#include <rte_telemetry.h>
+#endif
+
+rte_atomic32_t __rte_lcore_telemetry_enabled;
+
+#ifdef RTE_LCORE_POLL_BUSYNESS
+
+struct lcore_telemetry {
+	int poll_busyness;
+	/**< Calculated poll busyness (gets set/returned by the API) */
+	int raw_poll_busyness;
+	/**< Calculated poll busyness times 100. */
+	uint64_t interval_ts;
+	/**< when previous telemetry interval started */
+	uint64_t empty_cycles;
+	/**< empty cycle count since last interval */
+	uint64_t last_poll_ts;
+	/**< last poll timestamp */
+	bool last_empty;
+	/**< if last poll was empty */
+	unsigned int contig_poll_cnt;
+	/**< contiguous (always empty/non empty) poll counter */
+} __rte_cache_aligned;
+
+static struct lcore_telemetry *telemetry_data;
+
+#define LCORE_POLL_BUSYNESS_MAX 100
+#define LCORE_POLL_BUSYNESS_NOT_SET -1
+#define LCORE_POLL_BUSYNESS_MIN 0
+
+#define SMOOTH_COEFF 5
+#define STATE_CHANGE_OPT 32
+
+static void lcore_config_init(void)
+{
+	int lcore_id;
+
+	RTE_LCORE_FOREACH(lcore_id) {
+		struct lcore_telemetry *td = &telemetry_data[lcore_id];
+
+		td->interval_ts = 0;
+		td->last_poll_ts = 0;
+		td->empty_cycles = 0;
+		td->last_empty = true;
+		td->contig_poll_cnt = 0;
+		td->poll_busyness = LCORE_POLL_BUSYNESS_NOT_SET;
+		td->raw_poll_busyness = 0;
+	}
+}
+
+int rte_lcore_poll_busyness(unsigned int lcore_id)
+{
+	const uint64_t tsc_ms = rte_get_timer_hz() / MS_PER_S;
+	/* if more than 1000 busyness periods have passed, this core is considered inactive */
+	const uint64_t active_thresh = RTE_LCORE_POLL_BUSYNESS_PERIOD_MS * tsc_ms * 1000;
+	struct lcore_telemetry *tdata;
+
+	if (lcore_id >= RTE_MAX_LCORE)
+		return -EINVAL;
+	tdata = &telemetry_data[lcore_id];
+
+	/* if the lcore is not active */
+	if (tdata->interval_ts == 0)
+		return LCORE_POLL_BUSYNESS_NOT_SET;
+	/* if the core hasn't been active in a while */
+	else if ((rte_rdtsc() - tdata->interval_ts) > active_thresh)
+		return LCORE_POLL_BUSYNESS_NOT_SET;
+
+	/* this core is active, report its poll busyness */
+	return telemetry_data[lcore_id].poll_busyness;
+}
+
+int rte_lcore_poll_busyness_enabled(void)
+{
+	return (int)rte_atomic32_read(&__rte_lcore_telemetry_enabled);
+}
+
+void rte_lcore_poll_busyness_enabled_set(bool enable)
+{
+	int set = rte_atomic32_cmpset((volatile uint32_t *)&__rte_lcore_telemetry_enabled,
+			(int)!enable, (int)enable);
+
+	/* Reset counters on successful disable */
+	if (set && !enable)
+		lcore_config_init();
+}
+
+static inline int calc_raw_poll_busyness(const struct lcore_telemetry *tdata,
+				    const uint64_t empty, const uint64_t total)
+{
+	/*
+	 * We don't want to use floating point math here, but we want for our poll
+	 * busyness to react smoothly to sudden changes, while still keeping the
+	 * accuracy and making sure that over time the average follows poll busyness
+	 * as measured just-in-time. Therefore, we will calculate the average poll
+	 * busyness using integer math, but shift the decimal point two places
+	 * to the right, so that 100.0 becomes 10000. This allows us to report
+	 * integer values (0..100) while still allowing ourselves to follow the
+	 * just-in-time measurements when we calculate our averages.
+	 */
+	const int max_raw_idle = LCORE_POLL_BUSYNESS_MAX * 100;
+
+	const int prev_raw_idle = max_raw_idle - tdata->raw_poll_busyness;
+
+	/* calculate rate of idle cycles, times 100 */
+	const int cur_raw_idle = (int)((empty * max_raw_idle) / total);
+
+	/* smoothen the idleness */
+	const int smoothened_idle =
+			(cur_raw_idle + prev_raw_idle * (SMOOTH_COEFF - 1)) / SMOOTH_COEFF;
+
+	/* convert idleness to poll busyness */
+	return max_raw_idle - smoothened_idle;
+}
+
+void __rte_lcore_poll_busyness_timestamp(uint16_t nb_rx)
+{
+	const unsigned int lcore_id = rte_lcore_id();
+	uint64_t interval_ts, empty_cycles, cur_tsc, last_poll_ts;
+	struct lcore_telemetry *tdata;
+	const bool empty = nb_rx == 0;
+	uint64_t diff_int, diff_last;
+	bool last_empty;
+
+	/* This telemetry is not supported for unregistered non-EAL threads */
+	if (lcore_id >= RTE_MAX_LCORE) {
+		RTE_LOG(DEBUG, EAL,
+				"Lcore telemetry not supported on unregistered non-EAL thread %d",
+				lcore_id);
+		return;
+	}
+
+	tdata = &telemetry_data[lcore_id];
+	last_empty = tdata->last_empty;
+
+	/* optimization: don't do anything if status hasn't changed */
+	if (last_empty == empty && tdata->contig_poll_cnt++ < STATE_CHANGE_OPT)
+		return;
+	/* status changed or we're waiting for too long, reset counter */
+	tdata->contig_poll_cnt = 0;
+
+	cur_tsc = rte_rdtsc();
+
+	interval_ts = tdata->interval_ts;
+	empty_cycles = tdata->empty_cycles;
+	last_poll_ts = tdata->last_poll_ts;
+
+	diff_int = cur_tsc - interval_ts;
+	diff_last = cur_tsc - last_poll_ts;
+
+	/* is this the first time we're here? */
+	if (interval_ts == 0) {
+		tdata->poll_busyness = LCORE_POLL_BUSYNESS_MIN;
+		tdata->raw_poll_busyness = 0;
+		tdata->interval_ts = cur_tsc;
+		tdata->empty_cycles = 0;
+		tdata->contig_poll_cnt = 0;
+		goto end;
+	}
+
+	/* update the empty counter if we got an empty poll earlier */
+	if (last_empty)
+		empty_cycles += diff_last;
+
+	/* have we passed the interval? */
+	uint64_t interval = ((rte_get_tsc_hz() / MS_PER_S) * RTE_LCORE_POLL_BUSYNESS_PERIOD_MS);
+	if (diff_int > interval) {
+		int raw_poll_busyness;
+
+		/* get updated poll_busyness value */
+		raw_poll_busyness = calc_raw_poll_busyness(tdata, empty_cycles, diff_int);
+
+		/* set a new interval, reset empty counter */
+		tdata->interval_ts = cur_tsc;
+		tdata->empty_cycles = 0;
+		tdata->raw_poll_busyness = raw_poll_busyness;
+		/* bring poll busyness back to 0..100 range, biased to round up */
+		tdata->poll_busyness = (raw_poll_busyness + 50) / 100;
+	} else
+		/* we may have updated empty counter */
+		tdata->empty_cycles = empty_cycles;
+
+end:
+	/* update status for next poll */
+	tdata->last_poll_ts = cur_tsc;
+	tdata->last_empty = empty;
+}
+
+static int
+lcore_poll_busyness_enable(const char *cmd __rte_unused,
+		      const char *params __rte_unused,
+		      struct rte_tel_data *d)
+{
+	rte_lcore_poll_busyness_enabled_set(true);
+
+	rte_tel_data_start_dict(d);
+
+	rte_tel_data_add_dict_int(d, "poll_busyness_enabled", 1);
+
+	return 0;
+}
+
+static int
+lcore_poll_busyness_disable(const char *cmd __rte_unused,
+		       const char *params __rte_unused,
+		       struct rte_tel_data *d)
+{
+	rte_lcore_poll_busyness_enabled_set(false);
+
+	rte_tel_data_start_dict(d);
+
+	rte_tel_data_add_dict_int(d, "poll_busyness_enabled", 0);
+
+	return 0;
+}
+
+static int
+lcore_handle_poll_busyness(const char *cmd __rte_unused,
+		      const char *params __rte_unused, struct rte_tel_data *d)
+{
+	char corenum[64];
+	int i;
+
+	rte_tel_data_start_dict(d);
+
+	RTE_LCORE_FOREACH(i) {
+		if (!rte_lcore_is_enabled(i))
+			continue;
+		snprintf(corenum, sizeof(corenum), "%d", i);
+		rte_tel_data_add_dict_int(d, corenum, rte_lcore_poll_busyness(i));
+	}
+
+	return 0;
+}
+
+void
+rte_lcore_telemetry_free(void)
+{
+	if (telemetry_data != NULL) {
+		free(telemetry_data);
+		telemetry_data = NULL;
+	}
+}
+
+RTE_INIT(lcore_init_telemetry)
+{
+	telemetry_data = calloc(RTE_MAX_LCORE, sizeof(telemetry_data[0]));
+	if (telemetry_data == NULL)
+		rte_panic("Could not init lcore telemetry data: Out of memory\n");
+
+	rte_atomic32_set(&__rte_lcore_telemetry_enabled, true);
+
+	lcore_config_init();
+
+	rte_telemetry_register_cmd("/eal/lcore/poll_busyness", lcore_handle_poll_busyness,
+				   "return percentage poll busyness of cores");
+
+	rte_telemetry_register_cmd("/eal/lcore/poll_busyness_enable", lcore_poll_busyness_enable,
+				   "enable lcore poll busyness measurement");
+
+	rte_telemetry_register_cmd("/eal/lcore/poll_busyness_disable", lcore_poll_busyness_disable,
+				   "disable lcore poll busyness measurement");
+}
+
+#else
+
+int rte_lcore_poll_busyness(unsigned int lcore_id __rte_unused)
+{
+	return -ENOTSUP;
+}
+
+int rte_lcore_poll_busyness_enabled(void)
+{
+	return -ENOTSUP;
+}
+
+void rte_lcore_poll_busyness_enabled_set(bool enable __rte_unused)
+{
+}
+
+void __rte_lcore_poll_busyness_timestamp(uint16_t nb_rx __rte_unused)
+{
+}
+
+#endif
diff --git a/lib/eal/common/meson.build b/lib/eal/common/meson.build
index 917758cc65..a743e66a7d 100644
--- a/lib/eal/common/meson.build
+++ b/lib/eal/common/meson.build
@@ -17,6 +17,7 @@ sources += files(
         'eal_common_hexdump.c',
         'eal_common_interrupts.c',
         'eal_common_launch.c',
+        'eal_common_lcore_telemetry.c',
         'eal_common_lcore.c',
         'eal_common_log.c',
         'eal_common_mcfg.c',
diff --git a/lib/eal/include/rte_lcore.h b/lib/eal/include/rte_lcore.h
index b598e1b9ec..c8e369e9d2 100644
--- a/lib/eal/include/rte_lcore.h
+++ b/lib/eal/include/rte_lcore.h
@@ -16,6 +16,7 @@
 #include <rte_eal.h>
 #include <rte_launch.h>
 #include <rte_thread.h>
+#include <rte_atomic.h>
 
 #ifdef __cplusplus
 extern "C" {
@@ -415,6 +416,95 @@ rte_ctrl_thread_create(pthread_t *thread, const char *name,
 		const pthread_attr_t *attr,
 		void *(*start_routine)(void *), void *arg);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Read poll busyness value corresponding to an lcore.
+ *
+ * @param lcore_id
+ *   Lcore to read poll busyness value for.
+ * @return
+ *   - value between 0 and 100 on success
+ *   - -1 if lcore is not active
+ *   - -EINVAL if lcore is invalid
+ *   - -ENOMEM if not enough memory available
+ *   - -ENOTSUP if not supported
+ */
+__rte_experimental
+int
+rte_lcore_poll_busyness(unsigned int lcore_id);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Check if lcore poll busyness telemetry is enabled.
+ *
+ * @return
+ *   - true if lcore telemetry is enabled
+ *   - false if lcore telemetry is disabled
+ *   - -ENOTSUP if not lcore telemetry supported
+ */
+__rte_experimental
+int
+rte_lcore_poll_busyness_enabled(void);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Enable or disable poll busyness telemetry.
+ *
+ * @param enable
+ *   1 to enable, 0 to disable
+ */
+__rte_experimental
+void
+rte_lcore_poll_busyness_enabled_set(bool enable);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Lcore poll busyness timestamping function.
+ *
+ * @param nb_rx
+ *   Number of buffers processed by lcore.
+ */
+__rte_experimental
+void
+__rte_lcore_poll_busyness_timestamp(uint16_t nb_rx);
+
+/** @internal lcore telemetry enabled status */
+extern rte_atomic32_t __rte_lcore_telemetry_enabled;
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Free memory allocated for lcore telemetry data at init.
+ */
+__rte_experimental
+void
+rte_lcore_telemetry_free(void);
+
+/**
+ * Call lcore poll busyness timestamp function.
+ *
+ * @param nb_rx
+ *   Number of buffers processed by lcore.
+ */
+#ifdef RTE_LCORE_POLL_BUSYNESS
+#define RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_rx) do {				\
+	int enabled = (int)rte_atomic32_read(&__rte_lcore_telemetry_enabled);	\
+	if (enabled)								\
+		__rte_lcore_poll_busyness_timestamp(nb_rx);			\
+} while (0)
+#else
+#define RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_rx) do { } while (0)
+#endif
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/eal/linux/eal.c b/lib/eal/linux/eal.c
index 37d29643a5..4ae4a925b4 100644
--- a/lib/eal/linux/eal.c
+++ b/lib/eal/linux/eal.c
@@ -1364,6 +1364,9 @@ rte_eal_cleanup(void)
 	rte_mp_channel_cleanup();
 	rte_trace_save();
 	eal_trace_fini();
+#ifdef RTE_LCORE_POLL_BUSYNESS
+	rte_lcore_telemetry_free();
+#endif
 	/* after this point, any DPDK pointers will become dangling */
 	rte_eal_memory_detach();
 	eal_mp_dev_hotplug_cleanup();
diff --git a/lib/eal/meson.build b/lib/eal/meson.build
index 056beb9461..2fb90d446b 100644
--- a/lib/eal/meson.build
+++ b/lib/eal/meson.build
@@ -25,6 +25,9 @@ subdir(arch_subdir)
 deps += ['kvargs']
 if not is_windows
     deps += ['telemetry']
+else
+	# core poll busyness telemetry depends on telemetry library
+	dpdk_conf.set('RTE_LCORE_POLL_BUSYNESS', false)
 endif
 if dpdk_conf.has('RTE_USE_LIBBSD')
     ext_deps += libbsd
diff --git a/lib/eal/version.map b/lib/eal/version.map
index 1f293e768b..454599b6b4 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -424,6 +424,14 @@ EXPERIMENTAL {
 	rte_thread_self;
 	rte_thread_set_affinity_by_id;
 	rte_thread_set_priority;
+
+	# added in 22.11
+	__rte_lcore_poll_busyness_timestamp;
+	__rte_lcore_telemetry_enabled;
+	rte_lcore_poll_busyness;
+	rte_lcore_poll_busyness_enabled;
+	rte_lcore_poll_busyness_enabled_set;
+	rte_lcore_telemetry_free;
 };
 
 INTERNAL {
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index de9e970d4d..4c8113f31f 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -5675,6 +5675,8 @@ rte_eth_rx_burst(uint16_t port_id, uint16_t queue_id,
 #endif
 
 	rte_ethdev_trace_rx_burst(port_id, queue_id, (void **)rx_pkts, nb_rx);
+
+	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_rx);
 	return nb_rx;
 }
 
diff --git a/lib/eventdev/rte_eventdev.h b/lib/eventdev/rte_eventdev.h
index 6a6f6ea4c1..a65b3c7c85 100644
--- a/lib/eventdev/rte_eventdev.h
+++ b/lib/eventdev/rte_eventdev.h
@@ -2153,6 +2153,7 @@ rte_event_dequeue_burst(uint8_t dev_id, uint8_t port_id, struct rte_event ev[],
 			uint16_t nb_events, uint64_t timeout_ticks)
 {
 	const struct rte_event_fp_ops *fp_ops;
+	uint16_t nb_evts;
 	void *port;
 
 	fp_ops = &rte_event_fp_ops[dev_id];
@@ -2175,10 +2176,13 @@ rte_event_dequeue_burst(uint8_t dev_id, uint8_t port_id, struct rte_event ev[],
 	 * requests nb_events as const one
 	 */
 	if (nb_events == 1)
-		return (fp_ops->dequeue)(port, ev, timeout_ticks);
+		nb_evts = (fp_ops->dequeue)(port, ev, timeout_ticks);
 	else
-		return (fp_ops->dequeue_burst)(port, ev, nb_events,
-					       timeout_ticks);
+		nb_evts = (fp_ops->dequeue_burst)(port, ev, nb_events,
+					timeout_ticks);
+
+	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_evts);
+	return nb_evts;
 }
 
 #define RTE_EVENT_DEV_MAINT_OP_FLUSH          (1 << 0)
diff --git a/lib/rawdev/rte_rawdev.c b/lib/rawdev/rte_rawdev.c
index 2f0a4f132e..1cba53270a 100644
--- a/lib/rawdev/rte_rawdev.c
+++ b/lib/rawdev/rte_rawdev.c
@@ -16,6 +16,7 @@
 #include <rte_common.h>
 #include <rte_malloc.h>
 #include <rte_telemetry.h>
+#include <rte_lcore.h>
 
 #include "rte_rawdev.h"
 #include "rte_rawdev_pmd.h"
@@ -226,12 +227,15 @@ rte_rawdev_dequeue_buffers(uint16_t dev_id,
 			   rte_rawdev_obj_t context)
 {
 	struct rte_rawdev *dev;
+	int nb_ops;
 
 	RTE_RAWDEV_VALID_DEVID_OR_ERR_RET(dev_id, -EINVAL);
 	dev = &rte_rawdevs[dev_id];
 
 	RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->dequeue_bufs, -ENOTSUP);
-	return (*dev->dev_ops->dequeue_bufs)(dev, buffers, count, context);
+	nb_ops = (*dev->dev_ops->dequeue_bufs)(dev, buffers, count, context);
+	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
+	return nb_ops;
 }
 
 int
diff --git a/lib/regexdev/rte_regexdev.h b/lib/regexdev/rte_regexdev.h
index 3bce8090f6..8caaed502f 100644
--- a/lib/regexdev/rte_regexdev.h
+++ b/lib/regexdev/rte_regexdev.h
@@ -1530,6 +1530,7 @@ rte_regexdev_dequeue_burst(uint8_t dev_id, uint16_t qp_id,
 			   struct rte_regex_ops **ops, uint16_t nb_ops)
 {
 	struct rte_regexdev *dev = &rte_regex_devices[dev_id];
+	uint16_t deq_ops;
 #ifdef RTE_LIBRTE_REGEXDEV_DEBUG
 	RTE_REGEXDEV_VALID_DEV_ID_OR_ERR_RET(dev_id, -EINVAL);
 	RTE_FUNC_PTR_OR_ERR_RET(*dev->dequeue, -ENOTSUP);
@@ -1538,7 +1539,9 @@ rte_regexdev_dequeue_burst(uint8_t dev_id, uint16_t qp_id,
 		return -EINVAL;
 	}
 #endif
-	return (*dev->dequeue)(dev, qp_id, ops, nb_ops);
+	deq_ops = (*dev->dequeue)(dev, qp_id, ops, nb_ops);
+	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(deq_ops);
+	return deq_ops;
 }
 
 #ifdef __cplusplus
diff --git a/lib/ring/rte_ring_elem_pvt.h b/lib/ring/rte_ring_elem_pvt.h
index 83788c56e6..cf2370c238 100644
--- a/lib/ring/rte_ring_elem_pvt.h
+++ b/lib/ring/rte_ring_elem_pvt.h
@@ -379,6 +379,7 @@ __rte_ring_do_dequeue_elem(struct rte_ring *r, void *obj_table,
 end:
 	if (available != NULL)
 		*available = entries - n;
+	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(n);
 	return n;
 }
 
diff --git a/meson_options.txt b/meson_options.txt
index 7c220ad68d..9b20a36fdb 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -20,6 +20,8 @@ option('enable_driver_sdk', type: 'boolean', value: false, description:
        'Install headers to build drivers.')
 option('enable_kmods', type: 'boolean', value: false, description:
        'build kernel modules')
+option('enable_lcore_poll_busyness', type: 'boolean', value: false, description:
+       'enable collection of lcore poll busyness telemetry')
 option('examples', type: 'string', value: '', description:
        'Comma-separated list of examples to build by default')
 option('flexran_sdk', type: 'string', value: '', description:
-- 
2.31.1


^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v4 2/3] eal: add cpuset lcore telemetry entries
  2022-09-01 14:39 ` [PATCH v4 0/3] Add lcore poll busyness telemetry Kevin Laatz
  2022-09-01 14:39   ` [PATCH v4 1/3] eal: add " Kevin Laatz
@ 2022-09-01 14:39   ` Kevin Laatz
  2022-09-01 14:39   ` [PATCH v4 3/3] doc: add howto guide for lcore poll busyness Kevin Laatz
  2 siblings, 0 replies; 87+ messages in thread
From: Kevin Laatz @ 2022-09-01 14:39 UTC (permalink / raw)
  To: dev; +Cc: anatoly.burakov

From: Anatoly Burakov <anatoly.burakov@intel.com>

Expose per-lcore cpuset information to telemetry.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 lib/eal/common/eal_common_lcore_telemetry.c | 47 +++++++++++++++++++++
 1 file changed, 47 insertions(+)

diff --git a/lib/eal/common/eal_common_lcore_telemetry.c b/lib/eal/common/eal_common_lcore_telemetry.c
index 540ac5eba0..976a615d4e 100644
--- a/lib/eal/common/eal_common_lcore_telemetry.c
+++ b/lib/eal/common/eal_common_lcore_telemetry.c
@@ -19,6 +19,8 @@ rte_atomic32_t __rte_lcore_telemetry_enabled;
 
 #ifdef RTE_LCORE_POLL_BUSYNESS
 
+#include "eal_private.h"
+
 struct lcore_telemetry {
 	int poll_busyness;
 	/**< Calculated poll busyness (gets set/returned by the API) */
@@ -247,6 +249,48 @@ lcore_handle_poll_busyness(const char *cmd __rte_unused,
 	return 0;
 }
 
+static int
+lcore_handle_cpuset(const char *cmd __rte_unused,
+		    const char *params __rte_unused,
+		    struct rte_tel_data *d)
+{
+	char corenum[64];
+	int i;
+
+	rte_tel_data_start_dict(d);
+
+	RTE_LCORE_FOREACH(i) {
+		const struct lcore_config *cfg = &lcore_config[i];
+		const rte_cpuset_t *cpuset = &cfg->cpuset;
+		struct rte_tel_data *ld;
+		unsigned int cpu;
+
+		if (!rte_lcore_is_enabled(i))
+			continue;
+
+		/* create an array of integers */
+		ld = rte_tel_data_alloc();
+		if (ld == NULL)
+			return -ENOMEM;
+		rte_tel_data_start_array(ld, RTE_TEL_INT_VAL);
+
+		/* add cpu ID's from cpuset to the array */
+		for (cpu = 0; cpu < CPU_SETSIZE; cpu++) {
+			if (!CPU_ISSET(cpu, cpuset))
+				continue;
+			rte_tel_data_add_array_int(ld, cpu);
+		}
+
+		/* add array to the per-lcore container */
+		snprintf(corenum, sizeof(corenum), "%d", i);
+
+		/* tell telemetry library to free this array automatically */
+		rte_tel_data_add_dict_container(d, corenum, ld, 0);
+	}
+
+	return 0;
+}
+
 void
 rte_lcore_telemetry_free(void)
 {
@@ -274,6 +318,9 @@ RTE_INIT(lcore_init_telemetry)
 
 	rte_telemetry_register_cmd("/eal/lcore/poll_busyness_disable", lcore_poll_busyness_disable,
 				   "disable lcore poll busyness measurement");
+
+	rte_telemetry_register_cmd("/eal/lcore/cpuset", lcore_handle_cpuset,
+				   "list physical core affinity for each lcore");
 }
 
 #else
-- 
2.31.1


^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v4 3/3] doc: add howto guide for lcore poll busyness
  2022-09-01 14:39 ` [PATCH v4 0/3] Add lcore poll busyness telemetry Kevin Laatz
  2022-09-01 14:39   ` [PATCH v4 1/3] eal: add " Kevin Laatz
  2022-09-01 14:39   ` [PATCH v4 2/3] eal: add cpuset lcore telemetry entries Kevin Laatz
@ 2022-09-01 14:39   ` Kevin Laatz
  2 siblings, 0 replies; 87+ messages in thread
From: Kevin Laatz @ 2022-09-01 14:39 UTC (permalink / raw)
  To: dev; +Cc: anatoly.burakov, Kevin Laatz

Add a new section to the howto guides for using the new lcore poll
busyness telemetry endpoints and describe general usage.

Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>

---
v4:
  * Include note on perf impact when the feature is enabled
  * Add doc to toctree
  * Updates to incorporate changes made earlier in the patchset

v3:
  * Update naming to poll busyness
---
 doc/guides/howto/index.rst               |  1 +
 doc/guides/howto/lcore_poll_busyness.rst | 92 ++++++++++++++++++++++++
 2 files changed, 93 insertions(+)
 create mode 100644 doc/guides/howto/lcore_poll_busyness.rst

diff --git a/doc/guides/howto/index.rst b/doc/guides/howto/index.rst
index bf6337d021..0a9060c1d3 100644
--- a/doc/guides/howto/index.rst
+++ b/doc/guides/howto/index.rst
@@ -21,3 +21,4 @@ HowTo Guides
     debug_troubleshoot
     openwrt
     avx512
+    lcore_poll_busyness
diff --git a/doc/guides/howto/lcore_poll_busyness.rst b/doc/guides/howto/lcore_poll_busyness.rst
new file mode 100644
index 0000000000..ebbbd4c44e
--- /dev/null
+++ b/doc/guides/howto/lcore_poll_busyness.rst
@@ -0,0 +1,92 @@
+..  SPDX-License-Identifier: BSD-3-Clause
+    Copyright(c) 2022 Intel Corporation.
+
+Lcore Poll Busyness Telemetry
+=============================
+
+The lcore poll busyness telemetry provides a built-in, generic method of gathering
+lcore utilization metrics for running applications. These metrics are exposed
+via a new telemetry endpoint.
+
+Since most DPDK APIs polling based, the poll busyness is calculated based on
+APIs receiving 'work' (packets, completions, events, etc). Empty polls are
+considered as idle, while non-empty polls are considered busy. Using the amount
+of cycles spent processing empty polls, the busyness can be calculated and recorded.
+
+Application Specified Busyness
+------------------------------
+
+Improved accuracy of the reported busyness may need more contextual awareness
+from the application. For example, an application may make a number of calls to
+rx_burst before processing packets. If the last burst was an "empty poll", then
+the processing time of the packets would be falsely considered as "idle", since
+the last burst was empty. The application should track if any of the polls
+contained "work" to do and should mark the 'bulk' as "busy" cycles before
+proceeding to the processesing. This type of awareness is only available within
+the application.
+
+Applications can be modified to incorporate the extra contextual awareness in
+order to improve the reported busyness by marking areas of code as "busy" or
+"idle" appropriately. This can be done by inserting the timestamping macro::
+
+    RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(0)    /* to mark section as idle */
+    RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(32)   /* where 32 is nb_pkts to mark section as busy (non-zero is busy) */
+
+All cycles since the last state change (idle to busy, or vice versa) will be
+counted towards the current state's counter.
+
+Consuming the Telemetry
+-----------------------
+
+The telemetry gathered for lcore poll busyness can be read from the `telemetry.py`
+script via the new `/eal/lcore/poll_busyness` endpoint::
+
+    $ ./usertools/dpdk-telemetry.py
+    --> /eal/lcore/poll_busyness
+    {"/eal/lcore/poll_busyness": {"12": -1, "13": 85, "14": 84}}
+
+* Cores not collecting poll busyness will report "-1". E.g. control cores or inactive cores.
+* All enabled cores will report their poll busyness in the range 0-100.
+
+Enabling and Disabling Lcore Poll Busyness Telemetry
+----------------------------------------------------
+
+By default, the lcore poll busyness telemetry is disabled at compile time. In
+order to allow DPDK to gather this metric, the ``enable_lcore_poll_busyness``
+meson option must be set to ``true``.
+
+.. note::
+    Enabling lcore poll busyness telemetry may impact performance due to the
+    additional timestamping, potentially per poll depending on the application.
+
+At compile time
+^^^^^^^^^^^^^^^
+
+Support can be enabled/disabled at compile time via the meson option.
+It is disabled by default.::
+
+    $ meson configure -Denable_lcore_poll_busyness=true     #enable
+
+    $ meson configure -Denable_lcore_poll_busyness=false    #disable
+
+At run time
+^^^^^^^^^^^
+
+Support can also be enabled/disabled during runtime (if the meson option is
+enabled at compile time). Disabling at runtime comes at the cost of an additional
+branch, however no additional function calls are performed.
+
+To enable/disable support at runtime, a call can be made to the appropriately
+telemetry endpoint.
+
+Disable::
+
+    $ ./usertools/dpdk-telemetry.py
+    --> /eal/lcore/poll_busyness_disable
+    {"/eal/lcore/poll_busyness_disable": {"poll_busyness_enabled": 0}}
+
+Enable::
+
+    $ ./usertools/dpdk-telemetry.py
+    --> /eal/lcore/poll_busyness_enable
+    {"/eal/lcore/poll_busyness_enable": {"poll_busyness_enabled": 1}}
-- 
2.31.1


^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v5 0/3] Add lcore poll busyness telemetry
  2022-07-15 13:12 [PATCH v1 1/2] eal: add lcore busyness telemetry Anatoly Burakov
                   ` (6 preceding siblings ...)
  2022-09-01 14:39 ` [PATCH v4 0/3] Add lcore poll busyness telemetry Kevin Laatz
@ 2022-09-02 15:58 ` Kevin Laatz
  2022-09-02 15:58   ` [PATCH v5 1/3] eal: add " Kevin Laatz
                     ` (2 more replies)
  2022-09-13 13:19 ` [PATCH v6 0/4] Add lcore poll busyness telemetry Kevin Laatz
  2022-09-14  9:29 ` [PATCH v7 0/4] Add lcore poll busyness telemetry Kevin Laatz
  9 siblings, 3 replies; 87+ messages in thread
From: Kevin Laatz @ 2022-09-02 15:58 UTC (permalink / raw)
  To: dev; +Cc: anatoly.burakov, Kevin Laatz

Currently, there is no way to measure lcore polling busyness in a passive
way, without any modifications to the application. This patchset adds a new
EAL API that will be able to passively track core polling busyness. As part
of the set, new telemetry endpoints are added to read the generate metrics.

---
v5:
  * Fix Windows build
  * Make lcore_telemetry_free() an internal interface
  * Minor cleanup

v4:
  * Fix doc build
  * Rename timestamp macro to RTE_LCORE_POLL_BUSYNESS_TIMESTAMP
  * Make enable/disable read and write atomic
  * Change rte_lcore_poll_busyness_enabled_set() param to bool
  * Move mem alloc from enable/disable to init/cleanup
  * Other minor fixes

v3:
  * Fix missing renaming to poll busyness
  * Fix clang compilation
  * Fix arm compilation

v2:
  * Use rte_get_tsc_hz() to adjust the telemetry period
  * Rename to reflect polling busyness vs general busyness
  * Fix segfault when calling telemetry timestamp from an unregistered
    non-EAL thread.
  * Minor cleanup

Anatoly Burakov (2):
  eal: add lcore poll busyness telemetry
  eal: add cpuset lcore telemetry entries

Kevin Laatz (1):
  doc: add howto guide for lcore poll busyness

 config/meson.build                          |   1 +
 config/rte_config.h                         |   1 +
 doc/guides/howto/index.rst                  |   1 +
 doc/guides/howto/lcore_poll_busyness.rst    |  92 +++++
 lib/bbdev/rte_bbdev.h                       |  17 +-
 lib/compressdev/rte_compressdev.c           |   2 +
 lib/cryptodev/rte_cryptodev.h               |   2 +
 lib/distributor/rte_distributor.c           |  21 +-
 lib/distributor/rte_distributor_single.c    |  14 +-
 lib/dmadev/rte_dmadev.h                     |  15 +-
 lib/eal/common/eal_common_lcore_telemetry.c | 350 ++++++++++++++++++++
 lib/eal/common/meson.build                  |   1 +
 lib/eal/include/rte_lcore.h                 |  84 +++++
 lib/eal/linux/eal.c                         |   1 +
 lib/eal/meson.build                         |   3 +
 lib/eal/version.map                         |   7 +
 lib/ethdev/rte_ethdev.h                     |   2 +
 lib/eventdev/rte_eventdev.h                 |  10 +-
 lib/rawdev/rte_rawdev.c                     |   6 +-
 lib/regexdev/rte_regexdev.h                 |   5 +-
 lib/ring/rte_ring_elem_pvt.h                |   1 +
 meson_options.txt                           |   2 +
 22 files changed, 614 insertions(+), 24 deletions(-)
 create mode 100644 doc/guides/howto/lcore_poll_busyness.rst
 create mode 100644 lib/eal/common/eal_common_lcore_telemetry.c

-- 
2.31.1


^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v5 1/3] eal: add lcore poll busyness telemetry
  2022-09-02 15:58 ` [PATCH v5 0/3] Add lcore poll busyness telemetry Kevin Laatz
@ 2022-09-02 15:58   ` Kevin Laatz
  2022-09-03 13:33     ` Jerin Jacob
  2022-09-02 15:58   ` [PATCH v5 2/3] eal: add cpuset lcore telemetry entries Kevin Laatz
  2022-09-02 15:58   ` [PATCH v5 3/3] doc: add howto guide for lcore poll busyness Kevin Laatz
  2 siblings, 1 reply; 87+ messages in thread
From: Kevin Laatz @ 2022-09-02 15:58 UTC (permalink / raw)
  To: dev
  Cc: anatoly.burakov, Kevin Laatz, Conor Walsh, David Hunt,
	Bruce Richardson, Nicolas Chautru, Fan Zhang, Ashish Gupta,
	Akhil Goyal, Chengwen Feng, Ray Kinsella, Thomas Monjalon,
	Ferruh Yigit, Andrew Rybchenko, Jerin Jacob, Sachin Saxena,
	Hemant Agrawal, Ori Kam, Honnappa Nagarahalli,
	Konstantin Ananyev

From: Anatoly Burakov <anatoly.burakov@intel.com>

Currently, there is no way to measure lcore poll busyness in a passive way,
without any modifications to the application. This patch adds a new EAL API
that will be able to passively track core polling busyness.

The poll busyness is calculated by relying on the fact that most DPDK API's
will poll for work (packets, completions, eventdev events, etc). Empty
polls can be counted as "idle", while non-empty polls can be counted as
busy. To measure lcore poll busyness, we simply call the telemetry
timestamping function with the number of polls a particular code section
has processed, and count the number of cycles we've spent processing empty
bursts. The more empty bursts we encounter, the less cycles we spend in
"busy" state, and the less core poll busyness will be reported.

In order for all of the above to work without modifications to the
application, the library code needs to be instrumented with calls to the
lcore telemetry busyness timestamping function. The following parts of DPDK
are instrumented with lcore poll busyness timestamping calls:

- All major driver API's:
  - ethdev
  - cryptodev
  - compressdev
  - regexdev
  - bbdev
  - rawdev
  - eventdev
  - dmadev
- Some additional libraries:
  - ring
  - distributor

To avoid performance impact from having lcore telemetry support, a global
variable is exported by EAL, and a call to timestamping function is wrapped
into a macro, so that whenever telemetry is disabled, it only takes one
additional branch and no function calls are performed. It is disabled at
compile time by default.

This patch also adds a telemetry endpoint to report lcore poll busyness, as
well as telemetry endpoints to enable/disable lcore telemetry. A
documentation entry has been added to the howto guides to explain the usage
of the new telemetry endpoints and API.

Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
Signed-off-by: Conor Walsh <conor.walsh@intel.com>
Signed-off-by: David Hunt <david.hunt@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>

---
v5:
  * Fix Windows build
  * Make lcore_telemetry_free() an internal interface
  * Minor cleanup

v4:
  * Fix doc build
  * Rename timestamp macro to RTE_LCORE_POLL_BUSYNESS_TIMESTAMP
  * Make enable/disable read and write atomic
  * Change rte_lcore_poll_busyness_enabled_set() param to bool
  * Move mem alloc from enable/disable to init/cleanup
  * Other minor fixes

v3:
  * Fix missed renaming to poll busyness
  * Fix clang compilation
  * Fix arm compilation

v2:
  * Use rte_get_tsc_hz() to adjust the telemetry period
  * Rename to reflect polling busyness vs general busyness
  * Fix segfault when calling telemetry timestamp from an unregistered
    non-EAL thread.
  * Minor cleanup
---
 config/meson.build                          |   1 +
 config/rte_config.h                         |   1 +
 lib/bbdev/rte_bbdev.h                       |  17 +-
 lib/compressdev/rte_compressdev.c           |   2 +
 lib/cryptodev/rte_cryptodev.h               |   2 +
 lib/distributor/rte_distributor.c           |  21 +-
 lib/distributor/rte_distributor_single.c    |  14 +-
 lib/dmadev/rte_dmadev.h                     |  15 +-
 lib/eal/common/eal_common_lcore_telemetry.c | 303 ++++++++++++++++++++
 lib/eal/common/meson.build                  |   1 +
 lib/eal/include/rte_lcore.h                 |  84 ++++++
 lib/eal/linux/eal.c                         |   1 +
 lib/eal/meson.build                         |   3 +
 lib/eal/version.map                         |   7 +
 lib/ethdev/rte_ethdev.h                     |   2 +
 lib/eventdev/rte_eventdev.h                 |  10 +-
 lib/rawdev/rte_rawdev.c                     |   6 +-
 lib/regexdev/rte_regexdev.h                 |   5 +-
 lib/ring/rte_ring_elem_pvt.h                |   1 +
 meson_options.txt                           |   2 +
 20 files changed, 474 insertions(+), 24 deletions(-)
 create mode 100644 lib/eal/common/eal_common_lcore_telemetry.c

diff --git a/config/meson.build b/config/meson.build
index 7f7b6c92fd..d5954a059c 100644
--- a/config/meson.build
+++ b/config/meson.build
@@ -297,6 +297,7 @@ endforeach
 dpdk_conf.set('RTE_MAX_ETHPORTS', get_option('max_ethports'))
 dpdk_conf.set('RTE_LIBEAL_USE_HPET', get_option('use_hpet'))
 dpdk_conf.set('RTE_ENABLE_TRACE_FP', get_option('enable_trace_fp'))
+dpdk_conf.set('RTE_LCORE_POLL_BUSYNESS', get_option('enable_lcore_poll_busyness'))
 # values which have defaults which may be overridden
 dpdk_conf.set('RTE_MAX_VFIO_GROUPS', 64)
 dpdk_conf.set('RTE_DRIVER_MEMPOOL_BUCKET_SIZE_KB', 64)
diff --git a/config/rte_config.h b/config/rte_config.h
index 46549cb062..498702c9c7 100644
--- a/config/rte_config.h
+++ b/config/rte_config.h
@@ -39,6 +39,7 @@
 #define RTE_LOG_DP_LEVEL RTE_LOG_INFO
 #define RTE_BACKTRACE 1
 #define RTE_MAX_VFIO_CONTAINERS 64
+#define RTE_LCORE_POLL_BUSYNESS_PERIOD_MS 2
 
 /* bsd module defines */
 #define RTE_CONTIGMEM_MAX_NUM_BUFS 64
diff --git a/lib/bbdev/rte_bbdev.h b/lib/bbdev/rte_bbdev.h
index b88c88167e..d6a98d3f11 100644
--- a/lib/bbdev/rte_bbdev.h
+++ b/lib/bbdev/rte_bbdev.h
@@ -28,6 +28,7 @@ extern "C" {
 #include <stdbool.h>
 
 #include <rte_cpuflags.h>
+#include <rte_lcore.h>
 
 #include "rte_bbdev_op.h"
 
@@ -599,7 +600,9 @@ rte_bbdev_dequeue_enc_ops(uint16_t dev_id, uint16_t queue_id,
 {
 	struct rte_bbdev *dev = &rte_bbdev_devices[dev_id];
 	struct rte_bbdev_queue_data *q_data = &dev->data->queues[queue_id];
-	return dev->dequeue_enc_ops(q_data, ops, num_ops);
+	const uint16_t nb_ops = dev->dequeue_enc_ops(q_data, ops, num_ops);
+	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
+	return nb_ops;
 }
 
 /**
@@ -631,7 +634,9 @@ rte_bbdev_dequeue_dec_ops(uint16_t dev_id, uint16_t queue_id,
 {
 	struct rte_bbdev *dev = &rte_bbdev_devices[dev_id];
 	struct rte_bbdev_queue_data *q_data = &dev->data->queues[queue_id];
-	return dev->dequeue_dec_ops(q_data, ops, num_ops);
+	const uint16_t nb_ops = dev->dequeue_dec_ops(q_data, ops, num_ops);
+	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
+	return nb_ops;
 }
 
 
@@ -662,7 +667,9 @@ rte_bbdev_dequeue_ldpc_enc_ops(uint16_t dev_id, uint16_t queue_id,
 {
 	struct rte_bbdev *dev = &rte_bbdev_devices[dev_id];
 	struct rte_bbdev_queue_data *q_data = &dev->data->queues[queue_id];
-	return dev->dequeue_ldpc_enc_ops(q_data, ops, num_ops);
+	const uint16_t nb_ops = dev->dequeue_ldpc_enc_ops(q_data, ops, num_ops);
+	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
+	return nb_ops;
 }
 
 /**
@@ -692,7 +699,9 @@ rte_bbdev_dequeue_ldpc_dec_ops(uint16_t dev_id, uint16_t queue_id,
 {
 	struct rte_bbdev *dev = &rte_bbdev_devices[dev_id];
 	struct rte_bbdev_queue_data *q_data = &dev->data->queues[queue_id];
-	return dev->dequeue_ldpc_dec_ops(q_data, ops, num_ops);
+	const uint16_t nb_ops = dev->dequeue_ldpc_dec_ops(q_data, ops, num_ops);
+	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
+	return nb_ops;
 }
 
 /** Definitions of device event types */
diff --git a/lib/compressdev/rte_compressdev.c b/lib/compressdev/rte_compressdev.c
index 22c438f2dd..fabc495a8e 100644
--- a/lib/compressdev/rte_compressdev.c
+++ b/lib/compressdev/rte_compressdev.c
@@ -580,6 +580,8 @@ rte_compressdev_dequeue_burst(uint8_t dev_id, uint16_t qp_id,
 	nb_ops = (*dev->dequeue_burst)
 			(dev->data->queue_pairs[qp_id], ops, nb_ops);
 
+	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
+
 	return nb_ops;
 }
 
diff --git a/lib/cryptodev/rte_cryptodev.h b/lib/cryptodev/rte_cryptodev.h
index 56f459c6a0..a5b1d7c594 100644
--- a/lib/cryptodev/rte_cryptodev.h
+++ b/lib/cryptodev/rte_cryptodev.h
@@ -1915,6 +1915,8 @@ rte_cryptodev_dequeue_burst(uint8_t dev_id, uint16_t qp_id,
 		rte_rcu_qsbr_thread_offline(list->qsbr, 0);
 	}
 #endif
+
+	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
 	return nb_ops;
 }
 
diff --git a/lib/distributor/rte_distributor.c b/lib/distributor/rte_distributor.c
index 3035b7a999..428157ec64 100644
--- a/lib/distributor/rte_distributor.c
+++ b/lib/distributor/rte_distributor.c
@@ -56,6 +56,8 @@ rte_distributor_request_pkt(struct rte_distributor *d,
 
 		while (rte_rdtsc() < t)
 			rte_pause();
+		/* this was an empty poll */
+		RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(0);
 	}
 
 	/*
@@ -134,24 +136,29 @@ rte_distributor_get_pkt(struct rte_distributor *d,
 
 	if (unlikely(d->alg_type == RTE_DIST_ALG_SINGLE)) {
 		if (return_count <= 1) {
+			uint16_t cnt;
 			pkts[0] = rte_distributor_get_pkt_single(d->d_single,
-				worker_id, return_count ? oldpkt[0] : NULL);
-			return (pkts[0]) ? 1 : 0;
-		} else
-			return -EINVAL;
+								 worker_id,
+								 return_count ? oldpkt[0] : NULL);
+			cnt = (pkts[0] != NULL) ? 1 : 0;
+			RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(cnt);
+			return cnt;
+		}
+		return -EINVAL;
 	}
 
 	rte_distributor_request_pkt(d, worker_id, oldpkt, return_count);
 
-	count = rte_distributor_poll_pkt(d, worker_id, pkts);
-	while (count == -1) {
+	while ((count = rte_distributor_poll_pkt(d, worker_id, pkts)) == -1) {
 		uint64_t t = rte_rdtsc() + 100;
 
 		while (rte_rdtsc() < t)
 			rte_pause();
 
-		count = rte_distributor_poll_pkt(d, worker_id, pkts);
+		/* this was an empty poll */
+		RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(0);
 	}
+	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(count);
 	return count;
 }
 
diff --git a/lib/distributor/rte_distributor_single.c b/lib/distributor/rte_distributor_single.c
index 2c77ac454a..4c916c0fd2 100644
--- a/lib/distributor/rte_distributor_single.c
+++ b/lib/distributor/rte_distributor_single.c
@@ -31,8 +31,13 @@ rte_distributor_request_pkt_single(struct rte_distributor_single *d,
 	union rte_distributor_buffer_single *buf = &d->bufs[worker_id];
 	int64_t req = (((int64_t)(uintptr_t)oldpkt) << RTE_DISTRIB_FLAG_BITS)
 			| RTE_DISTRIB_GET_BUF;
-	RTE_WAIT_UNTIL_MASKED(&buf->bufptr64, RTE_DISTRIB_FLAGS_MASK,
-		==, 0, __ATOMIC_RELAXED);
+
+	while ((__atomic_load_n(&buf->bufptr64, __ATOMIC_RELAXED)
+			& RTE_DISTRIB_FLAGS_MASK) != 0) {
+		rte_pause();
+		/* this was an empty poll */
+		RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(0);
+	}
 
 	/* Sync with distributor on GET_BUF flag. */
 	__atomic_store_n(&(buf->bufptr64), req, __ATOMIC_RELEASE);
@@ -59,8 +64,11 @@ rte_distributor_get_pkt_single(struct rte_distributor_single *d,
 {
 	struct rte_mbuf *ret;
 	rte_distributor_request_pkt_single(d, worker_id, oldpkt);
-	while ((ret = rte_distributor_poll_pkt_single(d, worker_id)) == NULL)
+	while ((ret = rte_distributor_poll_pkt_single(d, worker_id)) == NULL) {
 		rte_pause();
+		/* this was an empty poll */
+		RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(0);
+	}
 	return ret;
 }
 
diff --git a/lib/dmadev/rte_dmadev.h b/lib/dmadev/rte_dmadev.h
index e7f992b734..3e27e0fd2b 100644
--- a/lib/dmadev/rte_dmadev.h
+++ b/lib/dmadev/rte_dmadev.h
@@ -149,6 +149,7 @@
 #include <rte_bitops.h>
 #include <rte_common.h>
 #include <rte_compat.h>
+#include <rte_lcore.h>
 
 #ifdef __cplusplus
 extern "C" {
@@ -1027,7 +1028,7 @@ rte_dma_completed(int16_t dev_id, uint16_t vchan, const uint16_t nb_cpls,
 		  uint16_t *last_idx, bool *has_error)
 {
 	struct rte_dma_fp_object *obj = &rte_dma_fp_objs[dev_id];
-	uint16_t idx;
+	uint16_t idx, nb_ops;
 	bool err;
 
 #ifdef RTE_DMADEV_DEBUG
@@ -1050,8 +1051,10 @@ rte_dma_completed(int16_t dev_id, uint16_t vchan, const uint16_t nb_cpls,
 		has_error = &err;
 
 	*has_error = false;
-	return (*obj->completed)(obj->dev_private, vchan, nb_cpls, last_idx,
-				 has_error);
+	nb_ops = (*obj->completed)(obj->dev_private, vchan, nb_cpls, last_idx,
+				   has_error);
+	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
+	return nb_ops;
 }
 
 /**
@@ -1090,7 +1093,7 @@ rte_dma_completed_status(int16_t dev_id, uint16_t vchan,
 			 enum rte_dma_status_code *status)
 {
 	struct rte_dma_fp_object *obj = &rte_dma_fp_objs[dev_id];
-	uint16_t idx;
+	uint16_t idx, nb_ops;
 
 #ifdef RTE_DMADEV_DEBUG
 	if (!rte_dma_is_valid(dev_id) || nb_cpls == 0 || status == NULL)
@@ -1101,8 +1104,10 @@ rte_dma_completed_status(int16_t dev_id, uint16_t vchan,
 	if (last_idx == NULL)
 		last_idx = &idx;
 
-	return (*obj->completed_status)(obj->dev_private, vchan, nb_cpls,
+	nb_ops = (*obj->completed_status)(obj->dev_private, vchan, nb_cpls,
 					last_idx, status);
+	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
+	return nb_ops;
 }
 
 /**
diff --git a/lib/eal/common/eal_common_lcore_telemetry.c b/lib/eal/common/eal_common_lcore_telemetry.c
new file mode 100644
index 0000000000..abef1ff86d
--- /dev/null
+++ b/lib/eal/common/eal_common_lcore_telemetry.c
@@ -0,0 +1,303 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2022 Intel Corporation
+ */
+
+#include <unistd.h>
+#include <limits.h>
+#include <string.h>
+
+#include <rte_common.h>
+#include <rte_cycles.h>
+#include <rte_errno.h>
+#include <rte_lcore.h>
+
+#ifdef RTE_LCORE_POLL_BUSYNESS
+#include <rte_telemetry.h>
+#endif
+
+rte_atomic32_t __rte_lcore_telemetry_enabled;
+
+#ifdef RTE_LCORE_POLL_BUSYNESS
+
+struct lcore_telemetry {
+	int poll_busyness;
+	/**< Calculated poll busyness (gets set/returned by the API) */
+	int raw_poll_busyness;
+	/**< Calculated poll busyness times 100. */
+	uint64_t interval_ts;
+	/**< when previous telemetry interval started */
+	uint64_t empty_cycles;
+	/**< empty cycle count since last interval */
+	uint64_t last_poll_ts;
+	/**< last poll timestamp */
+	bool last_empty;
+	/**< if last poll was empty */
+	unsigned int contig_poll_cnt;
+	/**< contiguous (always empty/non empty) poll counter */
+} __rte_cache_aligned;
+
+static struct lcore_telemetry *telemetry_data;
+
+#define LCORE_POLL_BUSYNESS_MAX 100
+#define LCORE_POLL_BUSYNESS_NOT_SET -1
+#define LCORE_POLL_BUSYNESS_MIN 0
+
+#define SMOOTH_COEFF 5
+#define STATE_CHANGE_OPT 32
+
+static void lcore_config_init(void)
+{
+	int lcore_id;
+
+	RTE_LCORE_FOREACH(lcore_id) {
+		struct lcore_telemetry *td = &telemetry_data[lcore_id];
+
+		td->interval_ts = 0;
+		td->last_poll_ts = 0;
+		td->empty_cycles = 0;
+		td->last_empty = true;
+		td->contig_poll_cnt = 0;
+		td->poll_busyness = LCORE_POLL_BUSYNESS_NOT_SET;
+		td->raw_poll_busyness = 0;
+	}
+}
+
+int rte_lcore_poll_busyness(unsigned int lcore_id)
+{
+	const uint64_t tsc_ms = rte_get_timer_hz() / MS_PER_S;
+	/* if more than 1000 busyness periods have passed, this core is considered inactive */
+	const uint64_t active_thresh = RTE_LCORE_POLL_BUSYNESS_PERIOD_MS * tsc_ms * 1000;
+	struct lcore_telemetry *tdata;
+
+	if (lcore_id >= RTE_MAX_LCORE)
+		return -EINVAL;
+	tdata = &telemetry_data[lcore_id];
+
+	/* if the lcore is not active */
+	if (tdata->interval_ts == 0)
+		return LCORE_POLL_BUSYNESS_NOT_SET;
+	/* if the core hasn't been active in a while */
+	else if ((rte_rdtsc() - tdata->interval_ts) > active_thresh)
+		return LCORE_POLL_BUSYNESS_NOT_SET;
+
+	/* this core is active, report its poll busyness */
+	return telemetry_data[lcore_id].poll_busyness;
+}
+
+int rte_lcore_poll_busyness_enabled(void)
+{
+	return rte_atomic32_read(&__rte_lcore_telemetry_enabled);
+}
+
+void rte_lcore_poll_busyness_enabled_set(bool enable)
+{
+	int set = rte_atomic32_cmpset((volatile uint32_t *)&__rte_lcore_telemetry_enabled,
+			(int)!enable, (int)enable);
+
+	/* Reset counters on successful disable */
+	if (set && !enable)
+		lcore_config_init();
+}
+
+static inline int calc_raw_poll_busyness(const struct lcore_telemetry *tdata,
+				    const uint64_t empty, const uint64_t total)
+{
+	/*
+	 * We don't want to use floating point math here, but we want for our poll
+	 * busyness to react smoothly to sudden changes, while still keeping the
+	 * accuracy and making sure that over time the average follows poll busyness
+	 * as measured just-in-time. Therefore, we will calculate the average poll
+	 * busyness using integer math, but shift the decimal point two places
+	 * to the right, so that 100.0 becomes 10000. This allows us to report
+	 * integer values (0..100) while still allowing ourselves to follow the
+	 * just-in-time measurements when we calculate our averages.
+	 */
+	const int max_raw_idle = LCORE_POLL_BUSYNESS_MAX * 100;
+
+	const int prev_raw_idle = max_raw_idle - tdata->raw_poll_busyness;
+
+	/* calculate rate of idle cycles, times 100 */
+	const int cur_raw_idle = (int)((empty * max_raw_idle) / total);
+
+	/* smoothen the idleness */
+	const int smoothened_idle =
+			(cur_raw_idle + prev_raw_idle * (SMOOTH_COEFF - 1)) / SMOOTH_COEFF;
+
+	/* convert idleness to poll busyness */
+	return max_raw_idle - smoothened_idle;
+}
+
+void __rte_lcore_poll_busyness_timestamp(uint16_t nb_rx)
+{
+	const unsigned int lcore_id = rte_lcore_id();
+	uint64_t interval_ts, empty_cycles, cur_tsc, last_poll_ts;
+	struct lcore_telemetry *tdata;
+	const bool empty = nb_rx == 0;
+	uint64_t diff_int, diff_last;
+	bool last_empty;
+
+	/* This telemetry is not supported for unregistered non-EAL threads */
+	if (lcore_id >= RTE_MAX_LCORE) {
+		RTE_LOG(DEBUG, EAL,
+				"Lcore telemetry not supported on unregistered non-EAL thread %d",
+				lcore_id);
+		return;
+	}
+
+	tdata = &telemetry_data[lcore_id];
+	last_empty = tdata->last_empty;
+
+	/* optimization: don't do anything if status hasn't changed */
+	if (last_empty == empty && tdata->contig_poll_cnt++ < STATE_CHANGE_OPT)
+		return;
+	/* status changed or we're waiting for too long, reset counter */
+	tdata->contig_poll_cnt = 0;
+
+	cur_tsc = rte_rdtsc();
+
+	interval_ts = tdata->interval_ts;
+	empty_cycles = tdata->empty_cycles;
+	last_poll_ts = tdata->last_poll_ts;
+
+	diff_int = cur_tsc - interval_ts;
+	diff_last = cur_tsc - last_poll_ts;
+
+	/* is this the first time we're here? */
+	if (interval_ts == 0) {
+		tdata->poll_busyness = LCORE_POLL_BUSYNESS_MIN;
+		tdata->raw_poll_busyness = 0;
+		tdata->interval_ts = cur_tsc;
+		tdata->empty_cycles = 0;
+		tdata->contig_poll_cnt = 0;
+		goto end;
+	}
+
+	/* update the empty counter if we got an empty poll earlier */
+	if (last_empty)
+		empty_cycles += diff_last;
+
+	/* have we passed the interval? */
+	uint64_t interval = ((rte_get_tsc_hz() / MS_PER_S) * RTE_LCORE_POLL_BUSYNESS_PERIOD_MS);
+	if (diff_int > interval) {
+		int raw_poll_busyness;
+
+		/* get updated poll_busyness value */
+		raw_poll_busyness = calc_raw_poll_busyness(tdata, empty_cycles, diff_int);
+
+		/* set a new interval, reset empty counter */
+		tdata->interval_ts = cur_tsc;
+		tdata->empty_cycles = 0;
+		tdata->raw_poll_busyness = raw_poll_busyness;
+		/* bring poll busyness back to 0..100 range, biased to round up */
+		tdata->poll_busyness = (raw_poll_busyness + 50) / 100;
+	} else
+		/* we may have updated empty counter */
+		tdata->empty_cycles = empty_cycles;
+
+end:
+	/* update status for next poll */
+	tdata->last_poll_ts = cur_tsc;
+	tdata->last_empty = empty;
+}
+
+static int
+lcore_poll_busyness_enable(const char *cmd __rte_unused,
+		      const char *params __rte_unused,
+		      struct rte_tel_data *d)
+{
+	rte_lcore_poll_busyness_enabled_set(true);
+
+	rte_tel_data_start_dict(d);
+
+	rte_tel_data_add_dict_int(d, "poll_busyness_enabled", 1);
+
+	return 0;
+}
+
+static int
+lcore_poll_busyness_disable(const char *cmd __rte_unused,
+		       const char *params __rte_unused,
+		       struct rte_tel_data *d)
+{
+	rte_lcore_poll_busyness_enabled_set(false);
+
+	rte_tel_data_start_dict(d);
+
+	rte_tel_data_add_dict_int(d, "poll_busyness_enabled", 0);
+
+	return 0;
+}
+
+static int
+lcore_handle_poll_busyness(const char *cmd __rte_unused,
+		      const char *params __rte_unused, struct rte_tel_data *d)
+{
+	char corenum[64];
+	int i;
+
+	rte_tel_data_start_dict(d);
+
+	RTE_LCORE_FOREACH(i) {
+		if (!rte_lcore_is_enabled(i))
+			continue;
+		snprintf(corenum, sizeof(corenum), "%d", i);
+		rte_tel_data_add_dict_int(d, corenum, rte_lcore_poll_busyness(i));
+	}
+
+	return 0;
+}
+
+void
+lcore_telemetry_free(void)
+{
+	if (telemetry_data != NULL) {
+		free(telemetry_data);
+		telemetry_data = NULL;
+	}
+}
+
+RTE_INIT(lcore_init_telemetry)
+{
+	telemetry_data = calloc(RTE_MAX_LCORE, sizeof(telemetry_data[0]));
+	if (telemetry_data == NULL)
+		rte_panic("Could not init lcore telemetry data: Out of memory\n");
+
+	lcore_config_init();
+
+	rte_telemetry_register_cmd("/eal/lcore/poll_busyness", lcore_handle_poll_busyness,
+				   "return percentage poll busyness of cores");
+
+	rte_telemetry_register_cmd("/eal/lcore/poll_busyness_enable", lcore_poll_busyness_enable,
+				   "enable lcore poll busyness measurement");
+
+	rte_telemetry_register_cmd("/eal/lcore/poll_busyness_disable", lcore_poll_busyness_disable,
+				   "disable lcore poll busyness measurement");
+
+	rte_atomic32_set(&__rte_lcore_telemetry_enabled, true);
+}
+
+#else
+
+int rte_lcore_poll_busyness(unsigned int lcore_id __rte_unused)
+{
+	return -ENOTSUP;
+}
+
+int rte_lcore_poll_busyness_enabled(void)
+{
+	return -ENOTSUP;
+}
+
+void rte_lcore_poll_busyness_enabled_set(bool enable __rte_unused)
+{
+}
+
+void __rte_lcore_poll_busyness_timestamp(uint16_t nb_rx __rte_unused)
+{
+}
+
+void lcore_telemetry_free(void)
+{
+}
+
+#endif
diff --git a/lib/eal/common/meson.build b/lib/eal/common/meson.build
index 917758cc65..a743e66a7d 100644
--- a/lib/eal/common/meson.build
+++ b/lib/eal/common/meson.build
@@ -17,6 +17,7 @@ sources += files(
         'eal_common_hexdump.c',
         'eal_common_interrupts.c',
         'eal_common_launch.c',
+        'eal_common_lcore_telemetry.c',
         'eal_common_lcore.c',
         'eal_common_log.c',
         'eal_common_mcfg.c',
diff --git a/lib/eal/include/rte_lcore.h b/lib/eal/include/rte_lcore.h
index b598e1b9ec..f77022184e 100644
--- a/lib/eal/include/rte_lcore.h
+++ b/lib/eal/include/rte_lcore.h
@@ -16,6 +16,7 @@
 #include <rte_eal.h>
 #include <rte_launch.h>
 #include <rte_thread.h>
+#include <rte_atomic.h>
 
 #ifdef __cplusplus
 extern "C" {
@@ -415,6 +416,89 @@ rte_ctrl_thread_create(pthread_t *thread, const char *name,
 		const pthread_attr_t *attr,
 		void *(*start_routine)(void *), void *arg);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Read poll busyness value corresponding to an lcore.
+ *
+ * @param lcore_id
+ *   Lcore to read poll busyness value for.
+ * @return
+ *   - value between 0 and 100 on success
+ *   - -1 if lcore is not active
+ *   - -EINVAL if lcore is invalid
+ *   - -ENOMEM if not enough memory available
+ *   - -ENOTSUP if not supported
+ */
+__rte_experimental
+int
+rte_lcore_poll_busyness(unsigned int lcore_id);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Check if lcore poll busyness telemetry is enabled.
+ *
+ * @return
+ *   - true if lcore telemetry is enabled
+ *   - false if lcore telemetry is disabled
+ *   - -ENOTSUP if not lcore telemetry supported
+ */
+__rte_experimental
+int
+rte_lcore_poll_busyness_enabled(void);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Enable or disable poll busyness telemetry.
+ *
+ * @param enable
+ *   1 to enable, 0 to disable
+ */
+__rte_experimental
+void
+rte_lcore_poll_busyness_enabled_set(bool enable);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Lcore poll busyness timestamping function.
+ *
+ * @param nb_rx
+ *   Number of buffers processed by lcore.
+ */
+__rte_experimental
+void
+__rte_lcore_poll_busyness_timestamp(uint16_t nb_rx);
+
+/** @internal lcore telemetry enabled status */
+extern rte_atomic32_t __rte_lcore_telemetry_enabled;
+
+/** @internal free memory allocated for lcore telemetry */
+void
+lcore_telemetry_free(void);
+
+/**
+ * Call lcore poll busyness timestamp function.
+ *
+ * @param nb_rx
+ *   Number of buffers processed by lcore.
+ */
+#ifdef RTE_LCORE_POLL_BUSYNESS
+#define RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_rx) do {				\
+	int enabled = (int)rte_atomic32_read(&__rte_lcore_telemetry_enabled);	\
+	if (enabled)								\
+		__rte_lcore_poll_busyness_timestamp(nb_rx);			\
+} while (0)
+#else
+#define RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_rx) do { } while (0)
+#endif
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/eal/linux/eal.c b/lib/eal/linux/eal.c
index 37d29643a5..cea01d31a9 100644
--- a/lib/eal/linux/eal.c
+++ b/lib/eal/linux/eal.c
@@ -1364,6 +1364,7 @@ rte_eal_cleanup(void)
 	rte_mp_channel_cleanup();
 	rte_trace_save();
 	eal_trace_fini();
+	lcore_telemetry_free();
 	/* after this point, any DPDK pointers will become dangling */
 	rte_eal_memory_detach();
 	eal_mp_dev_hotplug_cleanup();
diff --git a/lib/eal/meson.build b/lib/eal/meson.build
index 056beb9461..2fb90d446b 100644
--- a/lib/eal/meson.build
+++ b/lib/eal/meson.build
@@ -25,6 +25,9 @@ subdir(arch_subdir)
 deps += ['kvargs']
 if not is_windows
     deps += ['telemetry']
+else
+	# core poll busyness telemetry depends on telemetry library
+	dpdk_conf.set('RTE_LCORE_POLL_BUSYNESS', false)
 endif
 if dpdk_conf.has('RTE_USE_LIBBSD')
     ext_deps += libbsd
diff --git a/lib/eal/version.map b/lib/eal/version.map
index 1f293e768b..b943ee7d5d 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -424,6 +424,13 @@ EXPERIMENTAL {
 	rte_thread_self;
 	rte_thread_set_affinity_by_id;
 	rte_thread_set_priority;
+
+	# added in 22.11
+	__rte_lcore_poll_busyness_timestamp;
+	__rte_lcore_telemetry_enabled;
+	rte_lcore_poll_busyness;
+	rte_lcore_poll_busyness_enabled;
+	rte_lcore_poll_busyness_enabled_set;
 };
 
 INTERNAL {
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index de9e970d4d..4c8113f31f 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -5675,6 +5675,8 @@ rte_eth_rx_burst(uint16_t port_id, uint16_t queue_id,
 #endif
 
 	rte_ethdev_trace_rx_burst(port_id, queue_id, (void **)rx_pkts, nb_rx);
+
+	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_rx);
 	return nb_rx;
 }
 
diff --git a/lib/eventdev/rte_eventdev.h b/lib/eventdev/rte_eventdev.h
index 6a6f6ea4c1..a65b3c7c85 100644
--- a/lib/eventdev/rte_eventdev.h
+++ b/lib/eventdev/rte_eventdev.h
@@ -2153,6 +2153,7 @@ rte_event_dequeue_burst(uint8_t dev_id, uint8_t port_id, struct rte_event ev[],
 			uint16_t nb_events, uint64_t timeout_ticks)
 {
 	const struct rte_event_fp_ops *fp_ops;
+	uint16_t nb_evts;
 	void *port;
 
 	fp_ops = &rte_event_fp_ops[dev_id];
@@ -2175,10 +2176,13 @@ rte_event_dequeue_burst(uint8_t dev_id, uint8_t port_id, struct rte_event ev[],
 	 * requests nb_events as const one
 	 */
 	if (nb_events == 1)
-		return (fp_ops->dequeue)(port, ev, timeout_ticks);
+		nb_evts = (fp_ops->dequeue)(port, ev, timeout_ticks);
 	else
-		return (fp_ops->dequeue_burst)(port, ev, nb_events,
-					       timeout_ticks);
+		nb_evts = (fp_ops->dequeue_burst)(port, ev, nb_events,
+					timeout_ticks);
+
+	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_evts);
+	return nb_evts;
 }
 
 #define RTE_EVENT_DEV_MAINT_OP_FLUSH          (1 << 0)
diff --git a/lib/rawdev/rte_rawdev.c b/lib/rawdev/rte_rawdev.c
index 2f0a4f132e..1cba53270a 100644
--- a/lib/rawdev/rte_rawdev.c
+++ b/lib/rawdev/rte_rawdev.c
@@ -16,6 +16,7 @@
 #include <rte_common.h>
 #include <rte_malloc.h>
 #include <rte_telemetry.h>
+#include <rte_lcore.h>
 
 #include "rte_rawdev.h"
 #include "rte_rawdev_pmd.h"
@@ -226,12 +227,15 @@ rte_rawdev_dequeue_buffers(uint16_t dev_id,
 			   rte_rawdev_obj_t context)
 {
 	struct rte_rawdev *dev;
+	int nb_ops;
 
 	RTE_RAWDEV_VALID_DEVID_OR_ERR_RET(dev_id, -EINVAL);
 	dev = &rte_rawdevs[dev_id];
 
 	RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->dequeue_bufs, -ENOTSUP);
-	return (*dev->dev_ops->dequeue_bufs)(dev, buffers, count, context);
+	nb_ops = (*dev->dev_ops->dequeue_bufs)(dev, buffers, count, context);
+	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
+	return nb_ops;
 }
 
 int
diff --git a/lib/regexdev/rte_regexdev.h b/lib/regexdev/rte_regexdev.h
index 3bce8090f6..8caaed502f 100644
--- a/lib/regexdev/rte_regexdev.h
+++ b/lib/regexdev/rte_regexdev.h
@@ -1530,6 +1530,7 @@ rte_regexdev_dequeue_burst(uint8_t dev_id, uint16_t qp_id,
 			   struct rte_regex_ops **ops, uint16_t nb_ops)
 {
 	struct rte_regexdev *dev = &rte_regex_devices[dev_id];
+	uint16_t deq_ops;
 #ifdef RTE_LIBRTE_REGEXDEV_DEBUG
 	RTE_REGEXDEV_VALID_DEV_ID_OR_ERR_RET(dev_id, -EINVAL);
 	RTE_FUNC_PTR_OR_ERR_RET(*dev->dequeue, -ENOTSUP);
@@ -1538,7 +1539,9 @@ rte_regexdev_dequeue_burst(uint8_t dev_id, uint16_t qp_id,
 		return -EINVAL;
 	}
 #endif
-	return (*dev->dequeue)(dev, qp_id, ops, nb_ops);
+	deq_ops = (*dev->dequeue)(dev, qp_id, ops, nb_ops);
+	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(deq_ops);
+	return deq_ops;
 }
 
 #ifdef __cplusplus
diff --git a/lib/ring/rte_ring_elem_pvt.h b/lib/ring/rte_ring_elem_pvt.h
index 83788c56e6..cf2370c238 100644
--- a/lib/ring/rte_ring_elem_pvt.h
+++ b/lib/ring/rte_ring_elem_pvt.h
@@ -379,6 +379,7 @@ __rte_ring_do_dequeue_elem(struct rte_ring *r, void *obj_table,
 end:
 	if (available != NULL)
 		*available = entries - n;
+	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(n);
 	return n;
 }
 
diff --git a/meson_options.txt b/meson_options.txt
index 7c220ad68d..9b20a36fdb 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -20,6 +20,8 @@ option('enable_driver_sdk', type: 'boolean', value: false, description:
        'Install headers to build drivers.')
 option('enable_kmods', type: 'boolean', value: false, description:
        'build kernel modules')
+option('enable_lcore_poll_busyness', type: 'boolean', value: false, description:
+       'enable collection of lcore poll busyness telemetry')
 option('examples', type: 'string', value: '', description:
        'Comma-separated list of examples to build by default')
 option('flexran_sdk', type: 'string', value: '', description:
-- 
2.31.1


^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v5 2/3] eal: add cpuset lcore telemetry entries
  2022-09-02 15:58 ` [PATCH v5 0/3] Add lcore poll busyness telemetry Kevin Laatz
  2022-09-02 15:58   ` [PATCH v5 1/3] eal: add " Kevin Laatz
@ 2022-09-02 15:58   ` Kevin Laatz
  2022-09-02 15:58   ` [PATCH v5 3/3] doc: add howto guide for lcore poll busyness Kevin Laatz
  2 siblings, 0 replies; 87+ messages in thread
From: Kevin Laatz @ 2022-09-02 15:58 UTC (permalink / raw)
  To: dev; +Cc: anatoly.burakov

From: Anatoly Burakov <anatoly.burakov@intel.com>

Expose per-lcore cpuset information to telemetry.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 lib/eal/common/eal_common_lcore_telemetry.c | 47 +++++++++++++++++++++
 1 file changed, 47 insertions(+)

diff --git a/lib/eal/common/eal_common_lcore_telemetry.c b/lib/eal/common/eal_common_lcore_telemetry.c
index abef1ff86d..796a4a6a73 100644
--- a/lib/eal/common/eal_common_lcore_telemetry.c
+++ b/lib/eal/common/eal_common_lcore_telemetry.c
@@ -19,6 +19,8 @@ rte_atomic32_t __rte_lcore_telemetry_enabled;
 
 #ifdef RTE_LCORE_POLL_BUSYNESS
 
+#include "eal_private.h"
+
 struct lcore_telemetry {
 	int poll_busyness;
 	/**< Calculated poll busyness (gets set/returned by the API) */
@@ -247,6 +249,48 @@ lcore_handle_poll_busyness(const char *cmd __rte_unused,
 	return 0;
 }
 
+static int
+lcore_handle_cpuset(const char *cmd __rte_unused,
+		    const char *params __rte_unused,
+		    struct rte_tel_data *d)
+{
+	char corenum[64];
+	int i;
+
+	rte_tel_data_start_dict(d);
+
+	RTE_LCORE_FOREACH(i) {
+		const struct lcore_config *cfg = &lcore_config[i];
+		const rte_cpuset_t *cpuset = &cfg->cpuset;
+		struct rte_tel_data *ld;
+		unsigned int cpu;
+
+		if (!rte_lcore_is_enabled(i))
+			continue;
+
+		/* create an array of integers */
+		ld = rte_tel_data_alloc();
+		if (ld == NULL)
+			return -ENOMEM;
+		rte_tel_data_start_array(ld, RTE_TEL_INT_VAL);
+
+		/* add cpu ID's from cpuset to the array */
+		for (cpu = 0; cpu < CPU_SETSIZE; cpu++) {
+			if (!CPU_ISSET(cpu, cpuset))
+				continue;
+			rte_tel_data_add_array_int(ld, cpu);
+		}
+
+		/* add array to the per-lcore container */
+		snprintf(corenum, sizeof(corenum), "%d", i);
+
+		/* tell telemetry library to free this array automatically */
+		rte_tel_data_add_dict_container(d, corenum, ld, 0);
+	}
+
+	return 0;
+}
+
 void
 lcore_telemetry_free(void)
 {
@@ -273,6 +317,9 @@ RTE_INIT(lcore_init_telemetry)
 	rte_telemetry_register_cmd("/eal/lcore/poll_busyness_disable", lcore_poll_busyness_disable,
 				   "disable lcore poll busyness measurement");
 
+	rte_telemetry_register_cmd("/eal/lcore/cpuset", lcore_handle_cpuset,
+				   "list physical core affinity for each lcore");
+
 	rte_atomic32_set(&__rte_lcore_telemetry_enabled, true);
 }
 
-- 
2.31.1


^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v5 3/3] doc: add howto guide for lcore poll busyness
  2022-09-02 15:58 ` [PATCH v5 0/3] Add lcore poll busyness telemetry Kevin Laatz
  2022-09-02 15:58   ` [PATCH v5 1/3] eal: add " Kevin Laatz
  2022-09-02 15:58   ` [PATCH v5 2/3] eal: add cpuset lcore telemetry entries Kevin Laatz
@ 2022-09-02 15:58   ` Kevin Laatz
  2 siblings, 0 replies; 87+ messages in thread
From: Kevin Laatz @ 2022-09-02 15:58 UTC (permalink / raw)
  To: dev; +Cc: anatoly.burakov, Kevin Laatz

Add a new section to the howto guides for using the new lcore poll
busyness telemetry endpoints and describe general usage.

Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>

---
v4:
  * Include note on perf impact when the feature is enabled
  * Add doc to toctree
  * Updates to incorporate changes made earlier in the patchset

v3:
  * Update naming to poll busyness
---
 doc/guides/howto/index.rst               |  1 +
 doc/guides/howto/lcore_poll_busyness.rst | 92 ++++++++++++++++++++++++
 2 files changed, 93 insertions(+)
 create mode 100644 doc/guides/howto/lcore_poll_busyness.rst

diff --git a/doc/guides/howto/index.rst b/doc/guides/howto/index.rst
index bf6337d021..0a9060c1d3 100644
--- a/doc/guides/howto/index.rst
+++ b/doc/guides/howto/index.rst
@@ -21,3 +21,4 @@ HowTo Guides
     debug_troubleshoot
     openwrt
     avx512
+    lcore_poll_busyness
diff --git a/doc/guides/howto/lcore_poll_busyness.rst b/doc/guides/howto/lcore_poll_busyness.rst
new file mode 100644
index 0000000000..ebbbd4c44e
--- /dev/null
+++ b/doc/guides/howto/lcore_poll_busyness.rst
@@ -0,0 +1,92 @@
+..  SPDX-License-Identifier: BSD-3-Clause
+    Copyright(c) 2022 Intel Corporation.
+
+Lcore Poll Busyness Telemetry
+=============================
+
+The lcore poll busyness telemetry provides a built-in, generic method of gathering
+lcore utilization metrics for running applications. These metrics are exposed
+via a new telemetry endpoint.
+
+Since most DPDK APIs polling based, the poll busyness is calculated based on
+APIs receiving 'work' (packets, completions, events, etc). Empty polls are
+considered as idle, while non-empty polls are considered busy. Using the amount
+of cycles spent processing empty polls, the busyness can be calculated and recorded.
+
+Application Specified Busyness
+------------------------------
+
+Improved accuracy of the reported busyness may need more contextual awareness
+from the application. For example, an application may make a number of calls to
+rx_burst before processing packets. If the last burst was an "empty poll", then
+the processing time of the packets would be falsely considered as "idle", since
+the last burst was empty. The application should track if any of the polls
+contained "work" to do and should mark the 'bulk' as "busy" cycles before
+proceeding to the processesing. This type of awareness is only available within
+the application.
+
+Applications can be modified to incorporate the extra contextual awareness in
+order to improve the reported busyness by marking areas of code as "busy" or
+"idle" appropriately. This can be done by inserting the timestamping macro::
+
+    RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(0)    /* to mark section as idle */
+    RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(32)   /* where 32 is nb_pkts to mark section as busy (non-zero is busy) */
+
+All cycles since the last state change (idle to busy, or vice versa) will be
+counted towards the current state's counter.
+
+Consuming the Telemetry
+-----------------------
+
+The telemetry gathered for lcore poll busyness can be read from the `telemetry.py`
+script via the new `/eal/lcore/poll_busyness` endpoint::
+
+    $ ./usertools/dpdk-telemetry.py
+    --> /eal/lcore/poll_busyness
+    {"/eal/lcore/poll_busyness": {"12": -1, "13": 85, "14": 84}}
+
+* Cores not collecting poll busyness will report "-1". E.g. control cores or inactive cores.
+* All enabled cores will report their poll busyness in the range 0-100.
+
+Enabling and Disabling Lcore Poll Busyness Telemetry
+----------------------------------------------------
+
+By default, the lcore poll busyness telemetry is disabled at compile time. In
+order to allow DPDK to gather this metric, the ``enable_lcore_poll_busyness``
+meson option must be set to ``true``.
+
+.. note::
+    Enabling lcore poll busyness telemetry may impact performance due to the
+    additional timestamping, potentially per poll depending on the application.
+
+At compile time
+^^^^^^^^^^^^^^^
+
+Support can be enabled/disabled at compile time via the meson option.
+It is disabled by default.::
+
+    $ meson configure -Denable_lcore_poll_busyness=true     #enable
+
+    $ meson configure -Denable_lcore_poll_busyness=false    #disable
+
+At run time
+^^^^^^^^^^^
+
+Support can also be enabled/disabled during runtime (if the meson option is
+enabled at compile time). Disabling at runtime comes at the cost of an additional
+branch, however no additional function calls are performed.
+
+To enable/disable support at runtime, a call can be made to the appropriately
+telemetry endpoint.
+
+Disable::
+
+    $ ./usertools/dpdk-telemetry.py
+    --> /eal/lcore/poll_busyness_disable
+    {"/eal/lcore/poll_busyness_disable": {"poll_busyness_enabled": 0}}
+
+Enable::
+
+    $ ./usertools/dpdk-telemetry.py
+    --> /eal/lcore/poll_busyness_enable
+    {"/eal/lcore/poll_busyness_enable": {"poll_busyness_enabled": 1}}
-- 
2.31.1


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v5 1/3] eal: add lcore poll busyness telemetry
  2022-09-02 15:58   ` [PATCH v5 1/3] eal: add " Kevin Laatz
@ 2022-09-03 13:33     ` Jerin Jacob
  2022-09-06  9:37       ` Kevin Laatz
  0 siblings, 1 reply; 87+ messages in thread
From: Jerin Jacob @ 2022-09-03 13:33 UTC (permalink / raw)
  To: Kevin Laatz
  Cc: dpdk-dev, Anatoly Burakov, Conor Walsh, David Hunt,
	Bruce Richardson, Nicolas Chautru, Fan Zhang, Ashish Gupta,
	Akhil Goyal, Chengwen Feng, Ray Kinsella, Thomas Monjalon,
	Ferruh Yigit, Andrew Rybchenko, Jerin Jacob, Sachin Saxena,
	Hemant Agrawal, Ori Kam, Honnappa Nagarahalli,
	Konstantin Ananyev

On Fri, Sep 2, 2022 at 9:26 PM Kevin Laatz <kevin.laatz@intel.com> wrote:
>
> From: Anatoly Burakov <anatoly.burakov@intel.com>
>
> Currently, there is no way to measure lcore poll busyness in a passive way,
> without any modifications to the application. This patch adds a new EAL API
> that will be able to passively track core polling busyness.
>
> The poll busyness is calculated by relying on the fact that most DPDK API's
> will poll for work (packets, completions, eventdev events, etc). Empty
> polls can be counted as "idle", while non-empty polls can be counted as
> busy. To measure lcore poll busyness, we simply call the telemetry
> timestamping function with the number of polls a particular code section
> has processed, and count the number of cycles we've spent processing empty
> bursts. The more empty bursts we encounter, the less cycles we spend in
> "busy" state, and the less core poll busyness will be reported.
>
> In order for all of the above to work without modifications to the
> application, the library code needs to be instrumented with calls to the
> lcore telemetry busyness timestamping function. The following parts of DPDK
> are instrumented with lcore poll busyness timestamping calls:
>
> - All major driver API's:
>   - ethdev
>   - cryptodev
>   - compressdev
>   - regexdev
>   - bbdev
>   - rawdev
>   - eventdev
>   - dmadev
> - Some additional libraries:
>   - ring
>   - distributor
>
> To avoid performance impact from having lcore telemetry support, a global
> variable is exported by EAL, and a call to timestamping function is wrapped
> into a macro, so that whenever telemetry is disabled, it only takes one
> additional branch and no function calls are performed. It is disabled at
> compile time by default.
>
> This patch also adds a telemetry endpoint to report lcore poll busyness, as
> well as telemetry endpoints to enable/disable lcore telemetry. A
> documentation entry has been added to the howto guides to explain the usage
> of the new telemetry endpoints and API.
>
> Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
> Signed-off-by: Conor Walsh <conor.walsh@intel.com>
> Signed-off-by: David Hunt <david.hunt@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>

This version looks good to me. Thanks for this new feature.

I think, we need to add a UT for this new rte_lcore_poll_* APIs also
add a performance test case to measure the cycles for
RTE_LCORE_POLL_BUSYNESS_TIMESTAMP [1]

[1]
Reference performance test application for trace: app/test/test_trace_perf.c

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v5 1/3] eal: add lcore poll busyness telemetry
  2022-09-03 13:33     ` Jerin Jacob
@ 2022-09-06  9:37       ` Kevin Laatz
  0 siblings, 0 replies; 87+ messages in thread
From: Kevin Laatz @ 2022-09-06  9:37 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: dpdk-dev, Anatoly Burakov, Conor Walsh, David Hunt,
	Bruce Richardson, Nicolas Chautru, Fan Zhang, Ashish Gupta,
	Akhil Goyal, Chengwen Feng, Ray Kinsella, Thomas Monjalon,
	Ferruh Yigit, Andrew Rybchenko, Jerin Jacob, Sachin Saxena,
	Hemant Agrawal, Ori Kam, Honnappa Nagarahalli,
	Konstantin Ananyev

On 03/09/2022 14:33, Jerin Jacob wrote:
> On Fri, Sep 2, 2022 at 9:26 PM Kevin Laatz <kevin.laatz@intel.com> wrote:
>> From: Anatoly Burakov <anatoly.burakov@intel.com>
>>
>> Currently, there is no way to measure lcore poll busyness in a passive way,
>> without any modifications to the application. This patch adds a new EAL API
>> that will be able to passively track core polling busyness.
>>
>> The poll busyness is calculated by relying on the fact that most DPDK API's
>> will poll for work (packets, completions, eventdev events, etc). Empty
>> polls can be counted as "idle", while non-empty polls can be counted as
>> busy. To measure lcore poll busyness, we simply call the telemetry
>> timestamping function with the number of polls a particular code section
>> has processed, and count the number of cycles we've spent processing empty
>> bursts. The more empty bursts we encounter, the less cycles we spend in
>> "busy" state, and the less core poll busyness will be reported.
>>
>> In order for all of the above to work without modifications to the
>> application, the library code needs to be instrumented with calls to the
>> lcore telemetry busyness timestamping function. The following parts of DPDK
>> are instrumented with lcore poll busyness timestamping calls:
>>
>> - All major driver API's:
>>    - ethdev
>>    - cryptodev
>>    - compressdev
>>    - regexdev
>>    - bbdev
>>    - rawdev
>>    - eventdev
>>    - dmadev
>> - Some additional libraries:
>>    - ring
>>    - distributor
>>
>> To avoid performance impact from having lcore telemetry support, a global
>> variable is exported by EAL, and a call to timestamping function is wrapped
>> into a macro, so that whenever telemetry is disabled, it only takes one
>> additional branch and no function calls are performed. It is disabled at
>> compile time by default.
>>
>> This patch also adds a telemetry endpoint to report lcore poll busyness, as
>> well as telemetry endpoints to enable/disable lcore telemetry. A
>> documentation entry has been added to the howto guides to explain the usage
>> of the new telemetry endpoints and API.
>>
>> Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
>> Signed-off-by: Conor Walsh <conor.walsh@intel.com>
>> Signed-off-by: David Hunt <david.hunt@intel.com>
>> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> This version looks good to me. Thanks for this new feature.
>
> I think, we need to add a UT for this new rte_lcore_poll_* APIs also
> add a performance test case to measure the cycles for
> RTE_LCORE_POLL_BUSYNESS_TIMESTAMP [1]
>
> [1]
> Reference performance test application for trace: app/test/test_trace_perf.c

Thanks for reviewing, Jerin.

I'll look into adding a UT, thanks!


^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v6 0/4] Add lcore poll busyness telemetry
  2022-07-15 13:12 [PATCH v1 1/2] eal: add lcore busyness telemetry Anatoly Burakov
                   ` (7 preceding siblings ...)
  2022-09-02 15:58 ` [PATCH v5 0/3] Add lcore poll busyness telemetry Kevin Laatz
@ 2022-09-13 13:19 ` Kevin Laatz
  2022-09-13 13:19   ` [PATCH v6 1/4] eal: add " Kevin Laatz
                     ` (3 more replies)
  2022-09-14  9:29 ` [PATCH v7 0/4] Add lcore poll busyness telemetry Kevin Laatz
  9 siblings, 4 replies; 87+ messages in thread
From: Kevin Laatz @ 2022-09-13 13:19 UTC (permalink / raw)
  To: dev; +Cc: anatoly.burakov, Kevin Laatz

Currently, there is no way to measure lcore polling busyness in a passive
way, without any modifications to the application. This patchset adds a new
EAL API that will be able to passively track core polling busyness. As part
of the set, new telemetry endpoints are added to read the generate metrics.

---
v6:
  * Add API and perf unit tests

v5:
  * Fix Windows build
  * Make lcore_telemetry_free() an internal interface
  * Minor cleanup

v4:
  * Fix doc build
  * Rename timestamp macro to RTE_LCORE_POLL_BUSYNESS_TIMESTAMP
  * Make enable/disable read and write atomic
  * Change rte_lcore_poll_busyness_enabled_set() param to bool
  * Move mem alloc from enable/disable to init/cleanup
  * Other minor fixes

v3:
  * Fix missing renaming to poll busyness
  * Fix clang compilation
  * Fix arm compilation

v2:
  * Use rte_get_tsc_hz() to adjust the telemetry period
  * Rename to reflect polling busyness vs general busyness
  * Fix segfault when calling telemetry timestamp from an unregistered
    non-EAL thread.
  * Minor cleanup

Anatoly Burakov (2):
  eal: add lcore poll busyness telemetry
  eal: add cpuset lcore telemetry entries

Kevin Laatz (2):
  app/test: add unit tests for lcore poll busyness
  doc: add howto guide for lcore poll busyness

 app/test/meson.build                        |   4 +
 app/test/test_lcore_poll_busyness_api.c     | 134 ++++++++
 app/test/test_lcore_poll_busyness_perf.c    |  72 ++++
 config/meson.build                          |   1 +
 config/rte_config.h                         |   1 +
 doc/guides/howto/index.rst                  |   1 +
 doc/guides/howto/lcore_poll_busyness.rst    |  93 ++++++
 lib/bbdev/rte_bbdev.h                       |  17 +-
 lib/compressdev/rte_compressdev.c           |   2 +
 lib/cryptodev/rte_cryptodev.h               |   2 +
 lib/distributor/rte_distributor.c           |  21 +-
 lib/distributor/rte_distributor_single.c    |  14 +-
 lib/dmadev/rte_dmadev.h                     |  15 +-
 lib/eal/common/eal_common_lcore_telemetry.c | 350 ++++++++++++++++++++
 lib/eal/common/meson.build                  |   1 +
 lib/eal/include/rte_lcore.h                 |  84 +++++
 lib/eal/linux/eal.c                         |   1 +
 lib/eal/meson.build                         |   3 +
 lib/eal/version.map                         |   7 +
 lib/ethdev/rte_ethdev.h                     |   2 +
 lib/eventdev/rte_eventdev.h                 |  10 +-
 lib/rawdev/rte_rawdev.c                     |   6 +-
 lib/regexdev/rte_regexdev.h                 |   5 +-
 lib/ring/rte_ring_elem_pvt.h                |   1 +
 meson_options.txt                           |   2 +
 25 files changed, 825 insertions(+), 24 deletions(-)
 create mode 100644 app/test/test_lcore_poll_busyness_api.c
 create mode 100644 app/test/test_lcore_poll_busyness_perf.c
 create mode 100644 doc/guides/howto/lcore_poll_busyness.rst
 create mode 100644 lib/eal/common/eal_common_lcore_telemetry.c

-- 
2.31.1


^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v6 1/4] eal: add lcore poll busyness telemetry
  2022-09-13 13:19 ` [PATCH v6 0/4] Add lcore poll busyness telemetry Kevin Laatz
@ 2022-09-13 13:19   ` Kevin Laatz
  2022-09-13 13:48     ` Morten Brørup
  2022-09-13 13:19   ` [PATCH v6 2/4] eal: add cpuset lcore telemetry entries Kevin Laatz
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 87+ messages in thread
From: Kevin Laatz @ 2022-09-13 13:19 UTC (permalink / raw)
  To: dev
  Cc: anatoly.burakov, Kevin Laatz, Conor Walsh, David Hunt,
	Bruce Richardson, Nicolas Chautru, Fan Zhang, Ashish Gupta,
	Akhil Goyal, Chengwen Feng, Ray Kinsella, Thomas Monjalon,
	Ferruh Yigit, Andrew Rybchenko, Jerin Jacob, Sachin Saxena,
	Hemant Agrawal, Ori Kam, Honnappa Nagarahalli,
	Konstantin Ananyev

From: Anatoly Burakov <anatoly.burakov@intel.com>

Currently, there is no way to measure lcore poll busyness in a passive way,
without any modifications to the application. This patch adds a new EAL API
that will be able to passively track core polling busyness.

The poll busyness is calculated by relying on the fact that most DPDK API's
will poll for work (packets, completions, eventdev events, etc). Empty
polls can be counted as "idle", while non-empty polls can be counted as
busy. To measure lcore poll busyness, we simply call the telemetry
timestamping function with the number of polls a particular code section
has processed, and count the number of cycles we've spent processing empty
bursts. The more empty bursts we encounter, the less cycles we spend in
"busy" state, and the less core poll busyness will be reported.

In order for all of the above to work without modifications to the
application, the library code needs to be instrumented with calls to the
lcore telemetry busyness timestamping function. The following parts of DPDK
are instrumented with lcore poll busyness timestamping calls:

- All major driver API's:
  - ethdev
  - cryptodev
  - compressdev
  - regexdev
  - bbdev
  - rawdev
  - eventdev
  - dmadev
- Some additional libraries:
  - ring
  - distributor

To avoid performance impact from having lcore telemetry support, a global
variable is exported by EAL, and a call to timestamping function is wrapped
into a macro, so that whenever telemetry is disabled, it only takes one
additional branch and no function calls are performed. It is disabled at
compile time by default.

This patch also adds a telemetry endpoint to report lcore poll busyness, as
well as telemetry endpoints to enable/disable lcore telemetry. A
documentation entry has been added to the howto guides to explain the usage
of the new telemetry endpoints and API.

Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
Signed-off-by: Conor Walsh <conor.walsh@intel.com>
Signed-off-by: David Hunt <david.hunt@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>

---
v5:
  * Fix Windows build
  * Make lcore_telemetry_free() an internal interface
  * Minor cleanup

v4:
  * Fix doc build
  * Rename timestamp macro to RTE_LCORE_POLL_BUSYNESS_TIMESTAMP
  * Make enable/disable read and write atomic
  * Change rte_lcore_poll_busyness_enabled_set() param to bool
  * Move mem alloc from enable/disable to init/cleanup
  * Other minor fixes

v3:
  * Fix missed renaming to poll busyness
  * Fix clang compilation
  * Fix arm compilation

v2:
  * Use rte_get_tsc_hz() to adjust the telemetry period
  * Rename to reflect polling busyness vs general busyness
  * Fix segfault when calling telemetry timestamp from an unregistered
    non-EAL thread.
  * Minor cleanup
---
 config/meson.build                          |   1 +
 config/rte_config.h                         |   1 +
 lib/bbdev/rte_bbdev.h                       |  17 +-
 lib/compressdev/rte_compressdev.c           |   2 +
 lib/cryptodev/rte_cryptodev.h               |   2 +
 lib/distributor/rte_distributor.c           |  21 +-
 lib/distributor/rte_distributor_single.c    |  14 +-
 lib/dmadev/rte_dmadev.h                     |  15 +-
 lib/eal/common/eal_common_lcore_telemetry.c | 303 ++++++++++++++++++++
 lib/eal/common/meson.build                  |   1 +
 lib/eal/include/rte_lcore.h                 |  84 ++++++
 lib/eal/linux/eal.c                         |   1 +
 lib/eal/meson.build                         |   3 +
 lib/eal/version.map                         |   7 +
 lib/ethdev/rte_ethdev.h                     |   2 +
 lib/eventdev/rte_eventdev.h                 |  10 +-
 lib/rawdev/rte_rawdev.c                     |   6 +-
 lib/regexdev/rte_regexdev.h                 |   5 +-
 lib/ring/rte_ring_elem_pvt.h                |   1 +
 meson_options.txt                           |   2 +
 20 files changed, 474 insertions(+), 24 deletions(-)
 create mode 100644 lib/eal/common/eal_common_lcore_telemetry.c

diff --git a/config/meson.build b/config/meson.build
index 7f7b6c92fd..d5954a059c 100644
--- a/config/meson.build
+++ b/config/meson.build
@@ -297,6 +297,7 @@ endforeach
 dpdk_conf.set('RTE_MAX_ETHPORTS', get_option('max_ethports'))
 dpdk_conf.set('RTE_LIBEAL_USE_HPET', get_option('use_hpet'))
 dpdk_conf.set('RTE_ENABLE_TRACE_FP', get_option('enable_trace_fp'))
+dpdk_conf.set('RTE_LCORE_POLL_BUSYNESS', get_option('enable_lcore_poll_busyness'))
 # values which have defaults which may be overridden
 dpdk_conf.set('RTE_MAX_VFIO_GROUPS', 64)
 dpdk_conf.set('RTE_DRIVER_MEMPOOL_BUCKET_SIZE_KB', 64)
diff --git a/config/rte_config.h b/config/rte_config.h
index 46549cb062..498702c9c7 100644
--- a/config/rte_config.h
+++ b/config/rte_config.h
@@ -39,6 +39,7 @@
 #define RTE_LOG_DP_LEVEL RTE_LOG_INFO
 #define RTE_BACKTRACE 1
 #define RTE_MAX_VFIO_CONTAINERS 64
+#define RTE_LCORE_POLL_BUSYNESS_PERIOD_MS 2
 
 /* bsd module defines */
 #define RTE_CONTIGMEM_MAX_NUM_BUFS 64
diff --git a/lib/bbdev/rte_bbdev.h b/lib/bbdev/rte_bbdev.h
index b88c88167e..d6a98d3f11 100644
--- a/lib/bbdev/rte_bbdev.h
+++ b/lib/bbdev/rte_bbdev.h
@@ -28,6 +28,7 @@ extern "C" {
 #include <stdbool.h>
 
 #include <rte_cpuflags.h>
+#include <rte_lcore.h>
 
 #include "rte_bbdev_op.h"
 
@@ -599,7 +600,9 @@ rte_bbdev_dequeue_enc_ops(uint16_t dev_id, uint16_t queue_id,
 {
 	struct rte_bbdev *dev = &rte_bbdev_devices[dev_id];
 	struct rte_bbdev_queue_data *q_data = &dev->data->queues[queue_id];
-	return dev->dequeue_enc_ops(q_data, ops, num_ops);
+	const uint16_t nb_ops = dev->dequeue_enc_ops(q_data, ops, num_ops);
+	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
+	return nb_ops;
 }
 
 /**
@@ -631,7 +634,9 @@ rte_bbdev_dequeue_dec_ops(uint16_t dev_id, uint16_t queue_id,
 {
 	struct rte_bbdev *dev = &rte_bbdev_devices[dev_id];
 	struct rte_bbdev_queue_data *q_data = &dev->data->queues[queue_id];
-	return dev->dequeue_dec_ops(q_data, ops, num_ops);
+	const uint16_t nb_ops = dev->dequeue_dec_ops(q_data, ops, num_ops);
+	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
+	return nb_ops;
 }
 
 
@@ -662,7 +667,9 @@ rte_bbdev_dequeue_ldpc_enc_ops(uint16_t dev_id, uint16_t queue_id,
 {
 	struct rte_bbdev *dev = &rte_bbdev_devices[dev_id];
 	struct rte_bbdev_queue_data *q_data = &dev->data->queues[queue_id];
-	return dev->dequeue_ldpc_enc_ops(q_data, ops, num_ops);
+	const uint16_t nb_ops = dev->dequeue_ldpc_enc_ops(q_data, ops, num_ops);
+	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
+	return nb_ops;
 }
 
 /**
@@ -692,7 +699,9 @@ rte_bbdev_dequeue_ldpc_dec_ops(uint16_t dev_id, uint16_t queue_id,
 {
 	struct rte_bbdev *dev = &rte_bbdev_devices[dev_id];
 	struct rte_bbdev_queue_data *q_data = &dev->data->queues[queue_id];
-	return dev->dequeue_ldpc_dec_ops(q_data, ops, num_ops);
+	const uint16_t nb_ops = dev->dequeue_ldpc_dec_ops(q_data, ops, num_ops);
+	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
+	return nb_ops;
 }
 
 /** Definitions of device event types */
diff --git a/lib/compressdev/rte_compressdev.c b/lib/compressdev/rte_compressdev.c
index 22c438f2dd..fabc495a8e 100644
--- a/lib/compressdev/rte_compressdev.c
+++ b/lib/compressdev/rte_compressdev.c
@@ -580,6 +580,8 @@ rte_compressdev_dequeue_burst(uint8_t dev_id, uint16_t qp_id,
 	nb_ops = (*dev->dequeue_burst)
 			(dev->data->queue_pairs[qp_id], ops, nb_ops);
 
+	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
+
 	return nb_ops;
 }
 
diff --git a/lib/cryptodev/rte_cryptodev.h b/lib/cryptodev/rte_cryptodev.h
index 56f459c6a0..a5b1d7c594 100644
--- a/lib/cryptodev/rte_cryptodev.h
+++ b/lib/cryptodev/rte_cryptodev.h
@@ -1915,6 +1915,8 @@ rte_cryptodev_dequeue_burst(uint8_t dev_id, uint16_t qp_id,
 		rte_rcu_qsbr_thread_offline(list->qsbr, 0);
 	}
 #endif
+
+	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
 	return nb_ops;
 }
 
diff --git a/lib/distributor/rte_distributor.c b/lib/distributor/rte_distributor.c
index 3035b7a999..428157ec64 100644
--- a/lib/distributor/rte_distributor.c
+++ b/lib/distributor/rte_distributor.c
@@ -56,6 +56,8 @@ rte_distributor_request_pkt(struct rte_distributor *d,
 
 		while (rte_rdtsc() < t)
 			rte_pause();
+		/* this was an empty poll */
+		RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(0);
 	}
 
 	/*
@@ -134,24 +136,29 @@ rte_distributor_get_pkt(struct rte_distributor *d,
 
 	if (unlikely(d->alg_type == RTE_DIST_ALG_SINGLE)) {
 		if (return_count <= 1) {
+			uint16_t cnt;
 			pkts[0] = rte_distributor_get_pkt_single(d->d_single,
-				worker_id, return_count ? oldpkt[0] : NULL);
-			return (pkts[0]) ? 1 : 0;
-		} else
-			return -EINVAL;
+								 worker_id,
+								 return_count ? oldpkt[0] : NULL);
+			cnt = (pkts[0] != NULL) ? 1 : 0;
+			RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(cnt);
+			return cnt;
+		}
+		return -EINVAL;
 	}
 
 	rte_distributor_request_pkt(d, worker_id, oldpkt, return_count);
 
-	count = rte_distributor_poll_pkt(d, worker_id, pkts);
-	while (count == -1) {
+	while ((count = rte_distributor_poll_pkt(d, worker_id, pkts)) == -1) {
 		uint64_t t = rte_rdtsc() + 100;
 
 		while (rte_rdtsc() < t)
 			rte_pause();
 
-		count = rte_distributor_poll_pkt(d, worker_id, pkts);
+		/* this was an empty poll */
+		RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(0);
 	}
+	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(count);
 	return count;
 }
 
diff --git a/lib/distributor/rte_distributor_single.c b/lib/distributor/rte_distributor_single.c
index 2c77ac454a..4c916c0fd2 100644
--- a/lib/distributor/rte_distributor_single.c
+++ b/lib/distributor/rte_distributor_single.c
@@ -31,8 +31,13 @@ rte_distributor_request_pkt_single(struct rte_distributor_single *d,
 	union rte_distributor_buffer_single *buf = &d->bufs[worker_id];
 	int64_t req = (((int64_t)(uintptr_t)oldpkt) << RTE_DISTRIB_FLAG_BITS)
 			| RTE_DISTRIB_GET_BUF;
-	RTE_WAIT_UNTIL_MASKED(&buf->bufptr64, RTE_DISTRIB_FLAGS_MASK,
-		==, 0, __ATOMIC_RELAXED);
+
+	while ((__atomic_load_n(&buf->bufptr64, __ATOMIC_RELAXED)
+			& RTE_DISTRIB_FLAGS_MASK) != 0) {
+		rte_pause();
+		/* this was an empty poll */
+		RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(0);
+	}
 
 	/* Sync with distributor on GET_BUF flag. */
 	__atomic_store_n(&(buf->bufptr64), req, __ATOMIC_RELEASE);
@@ -59,8 +64,11 @@ rte_distributor_get_pkt_single(struct rte_distributor_single *d,
 {
 	struct rte_mbuf *ret;
 	rte_distributor_request_pkt_single(d, worker_id, oldpkt);
-	while ((ret = rte_distributor_poll_pkt_single(d, worker_id)) == NULL)
+	while ((ret = rte_distributor_poll_pkt_single(d, worker_id)) == NULL) {
 		rte_pause();
+		/* this was an empty poll */
+		RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(0);
+	}
 	return ret;
 }
 
diff --git a/lib/dmadev/rte_dmadev.h b/lib/dmadev/rte_dmadev.h
index e7f992b734..3e27e0fd2b 100644
--- a/lib/dmadev/rte_dmadev.h
+++ b/lib/dmadev/rte_dmadev.h
@@ -149,6 +149,7 @@
 #include <rte_bitops.h>
 #include <rte_common.h>
 #include <rte_compat.h>
+#include <rte_lcore.h>
 
 #ifdef __cplusplus
 extern "C" {
@@ -1027,7 +1028,7 @@ rte_dma_completed(int16_t dev_id, uint16_t vchan, const uint16_t nb_cpls,
 		  uint16_t *last_idx, bool *has_error)
 {
 	struct rte_dma_fp_object *obj = &rte_dma_fp_objs[dev_id];
-	uint16_t idx;
+	uint16_t idx, nb_ops;
 	bool err;
 
 #ifdef RTE_DMADEV_DEBUG
@@ -1050,8 +1051,10 @@ rte_dma_completed(int16_t dev_id, uint16_t vchan, const uint16_t nb_cpls,
 		has_error = &err;
 
 	*has_error = false;
-	return (*obj->completed)(obj->dev_private, vchan, nb_cpls, last_idx,
-				 has_error);
+	nb_ops = (*obj->completed)(obj->dev_private, vchan, nb_cpls, last_idx,
+				   has_error);
+	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
+	return nb_ops;
 }
 
 /**
@@ -1090,7 +1093,7 @@ rte_dma_completed_status(int16_t dev_id, uint16_t vchan,
 			 enum rte_dma_status_code *status)
 {
 	struct rte_dma_fp_object *obj = &rte_dma_fp_objs[dev_id];
-	uint16_t idx;
+	uint16_t idx, nb_ops;
 
 #ifdef RTE_DMADEV_DEBUG
 	if (!rte_dma_is_valid(dev_id) || nb_cpls == 0 || status == NULL)
@@ -1101,8 +1104,10 @@ rte_dma_completed_status(int16_t dev_id, uint16_t vchan,
 	if (last_idx == NULL)
 		last_idx = &idx;
 
-	return (*obj->completed_status)(obj->dev_private, vchan, nb_cpls,
+	nb_ops = (*obj->completed_status)(obj->dev_private, vchan, nb_cpls,
 					last_idx, status);
+	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
+	return nb_ops;
 }
 
 /**
diff --git a/lib/eal/common/eal_common_lcore_telemetry.c b/lib/eal/common/eal_common_lcore_telemetry.c
new file mode 100644
index 0000000000..abef1ff86d
--- /dev/null
+++ b/lib/eal/common/eal_common_lcore_telemetry.c
@@ -0,0 +1,303 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2022 Intel Corporation
+ */
+
+#include <unistd.h>
+#include <limits.h>
+#include <string.h>
+
+#include <rte_common.h>
+#include <rte_cycles.h>
+#include <rte_errno.h>
+#include <rte_lcore.h>
+
+#ifdef RTE_LCORE_POLL_BUSYNESS
+#include <rte_telemetry.h>
+#endif
+
+rte_atomic32_t __rte_lcore_telemetry_enabled;
+
+#ifdef RTE_LCORE_POLL_BUSYNESS
+
+struct lcore_telemetry {
+	int poll_busyness;
+	/**< Calculated poll busyness (gets set/returned by the API) */
+	int raw_poll_busyness;
+	/**< Calculated poll busyness times 100. */
+	uint64_t interval_ts;
+	/**< when previous telemetry interval started */
+	uint64_t empty_cycles;
+	/**< empty cycle count since last interval */
+	uint64_t last_poll_ts;
+	/**< last poll timestamp */
+	bool last_empty;
+	/**< if last poll was empty */
+	unsigned int contig_poll_cnt;
+	/**< contiguous (always empty/non empty) poll counter */
+} __rte_cache_aligned;
+
+static struct lcore_telemetry *telemetry_data;
+
+#define LCORE_POLL_BUSYNESS_MAX 100
+#define LCORE_POLL_BUSYNESS_NOT_SET -1
+#define LCORE_POLL_BUSYNESS_MIN 0
+
+#define SMOOTH_COEFF 5
+#define STATE_CHANGE_OPT 32
+
+static void lcore_config_init(void)
+{
+	int lcore_id;
+
+	RTE_LCORE_FOREACH(lcore_id) {
+		struct lcore_telemetry *td = &telemetry_data[lcore_id];
+
+		td->interval_ts = 0;
+		td->last_poll_ts = 0;
+		td->empty_cycles = 0;
+		td->last_empty = true;
+		td->contig_poll_cnt = 0;
+		td->poll_busyness = LCORE_POLL_BUSYNESS_NOT_SET;
+		td->raw_poll_busyness = 0;
+	}
+}
+
+int rte_lcore_poll_busyness(unsigned int lcore_id)
+{
+	const uint64_t tsc_ms = rte_get_timer_hz() / MS_PER_S;
+	/* if more than 1000 busyness periods have passed, this core is considered inactive */
+	const uint64_t active_thresh = RTE_LCORE_POLL_BUSYNESS_PERIOD_MS * tsc_ms * 1000;
+	struct lcore_telemetry *tdata;
+
+	if (lcore_id >= RTE_MAX_LCORE)
+		return -EINVAL;
+	tdata = &telemetry_data[lcore_id];
+
+	/* if the lcore is not active */
+	if (tdata->interval_ts == 0)
+		return LCORE_POLL_BUSYNESS_NOT_SET;
+	/* if the core hasn't been active in a while */
+	else if ((rte_rdtsc() - tdata->interval_ts) > active_thresh)
+		return LCORE_POLL_BUSYNESS_NOT_SET;
+
+	/* this core is active, report its poll busyness */
+	return telemetry_data[lcore_id].poll_busyness;
+}
+
+int rte_lcore_poll_busyness_enabled(void)
+{
+	return rte_atomic32_read(&__rte_lcore_telemetry_enabled);
+}
+
+void rte_lcore_poll_busyness_enabled_set(bool enable)
+{
+	int set = rte_atomic32_cmpset((volatile uint32_t *)&__rte_lcore_telemetry_enabled,
+			(int)!enable, (int)enable);
+
+	/* Reset counters on successful disable */
+	if (set && !enable)
+		lcore_config_init();
+}
+
+static inline int calc_raw_poll_busyness(const struct lcore_telemetry *tdata,
+				    const uint64_t empty, const uint64_t total)
+{
+	/*
+	 * We don't want to use floating point math here, but we want for our poll
+	 * busyness to react smoothly to sudden changes, while still keeping the
+	 * accuracy and making sure that over time the average follows poll busyness
+	 * as measured just-in-time. Therefore, we will calculate the average poll
+	 * busyness using integer math, but shift the decimal point two places
+	 * to the right, so that 100.0 becomes 10000. This allows us to report
+	 * integer values (0..100) while still allowing ourselves to follow the
+	 * just-in-time measurements when we calculate our averages.
+	 */
+	const int max_raw_idle = LCORE_POLL_BUSYNESS_MAX * 100;
+
+	const int prev_raw_idle = max_raw_idle - tdata->raw_poll_busyness;
+
+	/* calculate rate of idle cycles, times 100 */
+	const int cur_raw_idle = (int)((empty * max_raw_idle) / total);
+
+	/* smoothen the idleness */
+	const int smoothened_idle =
+			(cur_raw_idle + prev_raw_idle * (SMOOTH_COEFF - 1)) / SMOOTH_COEFF;
+
+	/* convert idleness to poll busyness */
+	return max_raw_idle - smoothened_idle;
+}
+
+void __rte_lcore_poll_busyness_timestamp(uint16_t nb_rx)
+{
+	const unsigned int lcore_id = rte_lcore_id();
+	uint64_t interval_ts, empty_cycles, cur_tsc, last_poll_ts;
+	struct lcore_telemetry *tdata;
+	const bool empty = nb_rx == 0;
+	uint64_t diff_int, diff_last;
+	bool last_empty;
+
+	/* This telemetry is not supported for unregistered non-EAL threads */
+	if (lcore_id >= RTE_MAX_LCORE) {
+		RTE_LOG(DEBUG, EAL,
+				"Lcore telemetry not supported on unregistered non-EAL thread %d",
+				lcore_id);
+		return;
+	}
+
+	tdata = &telemetry_data[lcore_id];
+	last_empty = tdata->last_empty;
+
+	/* optimization: don't do anything if status hasn't changed */
+	if (last_empty == empty && tdata->contig_poll_cnt++ < STATE_CHANGE_OPT)
+		return;
+	/* status changed or we're waiting for too long, reset counter */
+	tdata->contig_poll_cnt = 0;
+
+	cur_tsc = rte_rdtsc();
+
+	interval_ts = tdata->interval_ts;
+	empty_cycles = tdata->empty_cycles;
+	last_poll_ts = tdata->last_poll_ts;
+
+	diff_int = cur_tsc - interval_ts;
+	diff_last = cur_tsc - last_poll_ts;
+
+	/* is this the first time we're here? */
+	if (interval_ts == 0) {
+		tdata->poll_busyness = LCORE_POLL_BUSYNESS_MIN;
+		tdata->raw_poll_busyness = 0;
+		tdata->interval_ts = cur_tsc;
+		tdata->empty_cycles = 0;
+		tdata->contig_poll_cnt = 0;
+		goto end;
+	}
+
+	/* update the empty counter if we got an empty poll earlier */
+	if (last_empty)
+		empty_cycles += diff_last;
+
+	/* have we passed the interval? */
+	uint64_t interval = ((rte_get_tsc_hz() / MS_PER_S) * RTE_LCORE_POLL_BUSYNESS_PERIOD_MS);
+	if (diff_int > interval) {
+		int raw_poll_busyness;
+
+		/* get updated poll_busyness value */
+		raw_poll_busyness = calc_raw_poll_busyness(tdata, empty_cycles, diff_int);
+
+		/* set a new interval, reset empty counter */
+		tdata->interval_ts = cur_tsc;
+		tdata->empty_cycles = 0;
+		tdata->raw_poll_busyness = raw_poll_busyness;
+		/* bring poll busyness back to 0..100 range, biased to round up */
+		tdata->poll_busyness = (raw_poll_busyness + 50) / 100;
+	} else
+		/* we may have updated empty counter */
+		tdata->empty_cycles = empty_cycles;
+
+end:
+	/* update status for next poll */
+	tdata->last_poll_ts = cur_tsc;
+	tdata->last_empty = empty;
+}
+
+static int
+lcore_poll_busyness_enable(const char *cmd __rte_unused,
+		      const char *params __rte_unused,
+		      struct rte_tel_data *d)
+{
+	rte_lcore_poll_busyness_enabled_set(true);
+
+	rte_tel_data_start_dict(d);
+
+	rte_tel_data_add_dict_int(d, "poll_busyness_enabled", 1);
+
+	return 0;
+}
+
+static int
+lcore_poll_busyness_disable(const char *cmd __rte_unused,
+		       const char *params __rte_unused,
+		       struct rte_tel_data *d)
+{
+	rte_lcore_poll_busyness_enabled_set(false);
+
+	rte_tel_data_start_dict(d);
+
+	rte_tel_data_add_dict_int(d, "poll_busyness_enabled", 0);
+
+	return 0;
+}
+
+static int
+lcore_handle_poll_busyness(const char *cmd __rte_unused,
+		      const char *params __rte_unused, struct rte_tel_data *d)
+{
+	char corenum[64];
+	int i;
+
+	rte_tel_data_start_dict(d);
+
+	RTE_LCORE_FOREACH(i) {
+		if (!rte_lcore_is_enabled(i))
+			continue;
+		snprintf(corenum, sizeof(corenum), "%d", i);
+		rte_tel_data_add_dict_int(d, corenum, rte_lcore_poll_busyness(i));
+	}
+
+	return 0;
+}
+
+void
+lcore_telemetry_free(void)
+{
+	if (telemetry_data != NULL) {
+		free(telemetry_data);
+		telemetry_data = NULL;
+	}
+}
+
+RTE_INIT(lcore_init_telemetry)
+{
+	telemetry_data = calloc(RTE_MAX_LCORE, sizeof(telemetry_data[0]));
+	if (telemetry_data == NULL)
+		rte_panic("Could not init lcore telemetry data: Out of memory\n");
+
+	lcore_config_init();
+
+	rte_telemetry_register_cmd("/eal/lcore/poll_busyness", lcore_handle_poll_busyness,
+				   "return percentage poll busyness of cores");
+
+	rte_telemetry_register_cmd("/eal/lcore/poll_busyness_enable", lcore_poll_busyness_enable,
+				   "enable lcore poll busyness measurement");
+
+	rte_telemetry_register_cmd("/eal/lcore/poll_busyness_disable", lcore_poll_busyness_disable,
+				   "disable lcore poll busyness measurement");
+
+	rte_atomic32_set(&__rte_lcore_telemetry_enabled, true);
+}
+
+#else
+
+int rte_lcore_poll_busyness(unsigned int lcore_id __rte_unused)
+{
+	return -ENOTSUP;
+}
+
+int rte_lcore_poll_busyness_enabled(void)
+{
+	return -ENOTSUP;
+}
+
+void rte_lcore_poll_busyness_enabled_set(bool enable __rte_unused)
+{
+}
+
+void __rte_lcore_poll_busyness_timestamp(uint16_t nb_rx __rte_unused)
+{
+}
+
+void lcore_telemetry_free(void)
+{
+}
+
+#endif
diff --git a/lib/eal/common/meson.build b/lib/eal/common/meson.build
index 917758cc65..a743e66a7d 100644
--- a/lib/eal/common/meson.build
+++ b/lib/eal/common/meson.build
@@ -17,6 +17,7 @@ sources += files(
         'eal_common_hexdump.c',
         'eal_common_interrupts.c',
         'eal_common_launch.c',
+        'eal_common_lcore_telemetry.c',
         'eal_common_lcore.c',
         'eal_common_log.c',
         'eal_common_mcfg.c',
diff --git a/lib/eal/include/rte_lcore.h b/lib/eal/include/rte_lcore.h
index b598e1b9ec..f77022184e 100644
--- a/lib/eal/include/rte_lcore.h
+++ b/lib/eal/include/rte_lcore.h
@@ -16,6 +16,7 @@
 #include <rte_eal.h>
 #include <rte_launch.h>
 #include <rte_thread.h>
+#include <rte_atomic.h>
 
 #ifdef __cplusplus
 extern "C" {
@@ -415,6 +416,89 @@ rte_ctrl_thread_create(pthread_t *thread, const char *name,
 		const pthread_attr_t *attr,
 		void *(*start_routine)(void *), void *arg);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Read poll busyness value corresponding to an lcore.
+ *
+ * @param lcore_id
+ *   Lcore to read poll busyness value for.
+ * @return
+ *   - value between 0 and 100 on success
+ *   - -1 if lcore is not active
+ *   - -EINVAL if lcore is invalid
+ *   - -ENOMEM if not enough memory available
+ *   - -ENOTSUP if not supported
+ */
+__rte_experimental
+int
+rte_lcore_poll_busyness(unsigned int lcore_id);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Check if lcore poll busyness telemetry is enabled.
+ *
+ * @return
+ *   - true if lcore telemetry is enabled
+ *   - false if lcore telemetry is disabled
+ *   - -ENOTSUP if not lcore telemetry supported
+ */
+__rte_experimental
+int
+rte_lcore_poll_busyness_enabled(void);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Enable or disable poll busyness telemetry.
+ *
+ * @param enable
+ *   1 to enable, 0 to disable
+ */
+__rte_experimental
+void
+rte_lcore_poll_busyness_enabled_set(bool enable);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Lcore poll busyness timestamping function.
+ *
+ * @param nb_rx
+ *   Number of buffers processed by lcore.
+ */
+__rte_experimental
+void
+__rte_lcore_poll_busyness_timestamp(uint16_t nb_rx);
+
+/** @internal lcore telemetry enabled status */
+extern rte_atomic32_t __rte_lcore_telemetry_enabled;
+
+/** @internal free memory allocated for lcore telemetry */
+void
+lcore_telemetry_free(void);
+
+/**
+ * Call lcore poll busyness timestamp function.
+ *
+ * @param nb_rx
+ *   Number of buffers processed by lcore.
+ */
+#ifdef RTE_LCORE_POLL_BUSYNESS
+#define RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_rx) do {				\
+	int enabled = (int)rte_atomic32_read(&__rte_lcore_telemetry_enabled);	\
+	if (enabled)								\
+		__rte_lcore_poll_busyness_timestamp(nb_rx);			\
+} while (0)
+#else
+#define RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_rx) do { } while (0)
+#endif
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/eal/linux/eal.c b/lib/eal/linux/eal.c
index 37d29643a5..cea01d31a9 100644
--- a/lib/eal/linux/eal.c
+++ b/lib/eal/linux/eal.c
@@ -1364,6 +1364,7 @@ rte_eal_cleanup(void)
 	rte_mp_channel_cleanup();
 	rte_trace_save();
 	eal_trace_fini();
+	lcore_telemetry_free();
 	/* after this point, any DPDK pointers will become dangling */
 	rte_eal_memory_detach();
 	eal_mp_dev_hotplug_cleanup();
diff --git a/lib/eal/meson.build b/lib/eal/meson.build
index 056beb9461..2fb90d446b 100644
--- a/lib/eal/meson.build
+++ b/lib/eal/meson.build
@@ -25,6 +25,9 @@ subdir(arch_subdir)
 deps += ['kvargs']
 if not is_windows
     deps += ['telemetry']
+else
+	# core poll busyness telemetry depends on telemetry library
+	dpdk_conf.set('RTE_LCORE_POLL_BUSYNESS', false)
 endif
 if dpdk_conf.has('RTE_USE_LIBBSD')
     ext_deps += libbsd
diff --git a/lib/eal/version.map b/lib/eal/version.map
index 1f293e768b..b943ee7d5d 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -424,6 +424,13 @@ EXPERIMENTAL {
 	rte_thread_self;
 	rte_thread_set_affinity_by_id;
 	rte_thread_set_priority;
+
+	# added in 22.11
+	__rte_lcore_poll_busyness_timestamp;
+	__rte_lcore_telemetry_enabled;
+	rte_lcore_poll_busyness;
+	rte_lcore_poll_busyness_enabled;
+	rte_lcore_poll_busyness_enabled_set;
 };
 
 INTERNAL {
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index de9e970d4d..4c8113f31f 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -5675,6 +5675,8 @@ rte_eth_rx_burst(uint16_t port_id, uint16_t queue_id,
 #endif
 
 	rte_ethdev_trace_rx_burst(port_id, queue_id, (void **)rx_pkts, nb_rx);
+
+	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_rx);
 	return nb_rx;
 }
 
diff --git a/lib/eventdev/rte_eventdev.h b/lib/eventdev/rte_eventdev.h
index 6a6f6ea4c1..a65b3c7c85 100644
--- a/lib/eventdev/rte_eventdev.h
+++ b/lib/eventdev/rte_eventdev.h
@@ -2153,6 +2153,7 @@ rte_event_dequeue_burst(uint8_t dev_id, uint8_t port_id, struct rte_event ev[],
 			uint16_t nb_events, uint64_t timeout_ticks)
 {
 	const struct rte_event_fp_ops *fp_ops;
+	uint16_t nb_evts;
 	void *port;
 
 	fp_ops = &rte_event_fp_ops[dev_id];
@@ -2175,10 +2176,13 @@ rte_event_dequeue_burst(uint8_t dev_id, uint8_t port_id, struct rte_event ev[],
 	 * requests nb_events as const one
 	 */
 	if (nb_events == 1)
-		return (fp_ops->dequeue)(port, ev, timeout_ticks);
+		nb_evts = (fp_ops->dequeue)(port, ev, timeout_ticks);
 	else
-		return (fp_ops->dequeue_burst)(port, ev, nb_events,
-					       timeout_ticks);
+		nb_evts = (fp_ops->dequeue_burst)(port, ev, nb_events,
+					timeout_ticks);
+
+	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_evts);
+	return nb_evts;
 }
 
 #define RTE_EVENT_DEV_MAINT_OP_FLUSH          (1 << 0)
diff --git a/lib/rawdev/rte_rawdev.c b/lib/rawdev/rte_rawdev.c
index 2f0a4f132e..1cba53270a 100644
--- a/lib/rawdev/rte_rawdev.c
+++ b/lib/rawdev/rte_rawdev.c
@@ -16,6 +16,7 @@
 #include <rte_common.h>
 #include <rte_malloc.h>
 #include <rte_telemetry.h>
+#include <rte_lcore.h>
 
 #include "rte_rawdev.h"
 #include "rte_rawdev_pmd.h"
@@ -226,12 +227,15 @@ rte_rawdev_dequeue_buffers(uint16_t dev_id,
 			   rte_rawdev_obj_t context)
 {
 	struct rte_rawdev *dev;
+	int nb_ops;
 
 	RTE_RAWDEV_VALID_DEVID_OR_ERR_RET(dev_id, -EINVAL);
 	dev = &rte_rawdevs[dev_id];
 
 	RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->dequeue_bufs, -ENOTSUP);
-	return (*dev->dev_ops->dequeue_bufs)(dev, buffers, count, context);
+	nb_ops = (*dev->dev_ops->dequeue_bufs)(dev, buffers, count, context);
+	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
+	return nb_ops;
 }
 
 int
diff --git a/lib/regexdev/rte_regexdev.h b/lib/regexdev/rte_regexdev.h
index 3bce8090f6..8caaed502f 100644
--- a/lib/regexdev/rte_regexdev.h
+++ b/lib/regexdev/rte_regexdev.h
@@ -1530,6 +1530,7 @@ rte_regexdev_dequeue_burst(uint8_t dev_id, uint16_t qp_id,
 			   struct rte_regex_ops **ops, uint16_t nb_ops)
 {
 	struct rte_regexdev *dev = &rte_regex_devices[dev_id];
+	uint16_t deq_ops;
 #ifdef RTE_LIBRTE_REGEXDEV_DEBUG
 	RTE_REGEXDEV_VALID_DEV_ID_OR_ERR_RET(dev_id, -EINVAL);
 	RTE_FUNC_PTR_OR_ERR_RET(*dev->dequeue, -ENOTSUP);
@@ -1538,7 +1539,9 @@ rte_regexdev_dequeue_burst(uint8_t dev_id, uint16_t qp_id,
 		return -EINVAL;
 	}
 #endif
-	return (*dev->dequeue)(dev, qp_id, ops, nb_ops);
+	deq_ops = (*dev->dequeue)(dev, qp_id, ops, nb_ops);
+	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(deq_ops);
+	return deq_ops;
 }
 
 #ifdef __cplusplus
diff --git a/lib/ring/rte_ring_elem_pvt.h b/lib/ring/rte_ring_elem_pvt.h
index 83788c56e6..cf2370c238 100644
--- a/lib/ring/rte_ring_elem_pvt.h
+++ b/lib/ring/rte_ring_elem_pvt.h
@@ -379,6 +379,7 @@ __rte_ring_do_dequeue_elem(struct rte_ring *r, void *obj_table,
 end:
 	if (available != NULL)
 		*available = entries - n;
+	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(n);
 	return n;
 }
 
diff --git a/meson_options.txt b/meson_options.txt
index 7c220ad68d..9b20a36fdb 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -20,6 +20,8 @@ option('enable_driver_sdk', type: 'boolean', value: false, description:
        'Install headers to build drivers.')
 option('enable_kmods', type: 'boolean', value: false, description:
        'build kernel modules')
+option('enable_lcore_poll_busyness', type: 'boolean', value: false, description:
+       'enable collection of lcore poll busyness telemetry')
 option('examples', type: 'string', value: '', description:
        'Comma-separated list of examples to build by default')
 option('flexran_sdk', type: 'string', value: '', description:
-- 
2.31.1


^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v6 2/4] eal: add cpuset lcore telemetry entries
  2022-09-13 13:19 ` [PATCH v6 0/4] Add lcore poll busyness telemetry Kevin Laatz
  2022-09-13 13:19   ` [PATCH v6 1/4] eal: add " Kevin Laatz
@ 2022-09-13 13:19   ` Kevin Laatz
  2022-09-13 13:19   ` [PATCH v6 3/4] app/test: add unit tests for lcore poll busyness Kevin Laatz
  2022-09-13 13:19   ` [PATCH v6 4/4] doc: add howto guide " Kevin Laatz
  3 siblings, 0 replies; 87+ messages in thread
From: Kevin Laatz @ 2022-09-13 13:19 UTC (permalink / raw)
  To: dev; +Cc: anatoly.burakov

From: Anatoly Burakov <anatoly.burakov@intel.com>

Expose per-lcore cpuset information to telemetry.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 lib/eal/common/eal_common_lcore_telemetry.c | 47 +++++++++++++++++++++
 1 file changed, 47 insertions(+)

diff --git a/lib/eal/common/eal_common_lcore_telemetry.c b/lib/eal/common/eal_common_lcore_telemetry.c
index abef1ff86d..796a4a6a73 100644
--- a/lib/eal/common/eal_common_lcore_telemetry.c
+++ b/lib/eal/common/eal_common_lcore_telemetry.c
@@ -19,6 +19,8 @@ rte_atomic32_t __rte_lcore_telemetry_enabled;
 
 #ifdef RTE_LCORE_POLL_BUSYNESS
 
+#include "eal_private.h"
+
 struct lcore_telemetry {
 	int poll_busyness;
 	/**< Calculated poll busyness (gets set/returned by the API) */
@@ -247,6 +249,48 @@ lcore_handle_poll_busyness(const char *cmd __rte_unused,
 	return 0;
 }
 
+static int
+lcore_handle_cpuset(const char *cmd __rte_unused,
+		    const char *params __rte_unused,
+		    struct rte_tel_data *d)
+{
+	char corenum[64];
+	int i;
+
+	rte_tel_data_start_dict(d);
+
+	RTE_LCORE_FOREACH(i) {
+		const struct lcore_config *cfg = &lcore_config[i];
+		const rte_cpuset_t *cpuset = &cfg->cpuset;
+		struct rte_tel_data *ld;
+		unsigned int cpu;
+
+		if (!rte_lcore_is_enabled(i))
+			continue;
+
+		/* create an array of integers */
+		ld = rte_tel_data_alloc();
+		if (ld == NULL)
+			return -ENOMEM;
+		rte_tel_data_start_array(ld, RTE_TEL_INT_VAL);
+
+		/* add cpu ID's from cpuset to the array */
+		for (cpu = 0; cpu < CPU_SETSIZE; cpu++) {
+			if (!CPU_ISSET(cpu, cpuset))
+				continue;
+			rte_tel_data_add_array_int(ld, cpu);
+		}
+
+		/* add array to the per-lcore container */
+		snprintf(corenum, sizeof(corenum), "%d", i);
+
+		/* tell telemetry library to free this array automatically */
+		rte_tel_data_add_dict_container(d, corenum, ld, 0);
+	}
+
+	return 0;
+}
+
 void
 lcore_telemetry_free(void)
 {
@@ -273,6 +317,9 @@ RTE_INIT(lcore_init_telemetry)
 	rte_telemetry_register_cmd("/eal/lcore/poll_busyness_disable", lcore_poll_busyness_disable,
 				   "disable lcore poll busyness measurement");
 
+	rte_telemetry_register_cmd("/eal/lcore/cpuset", lcore_handle_cpuset,
+				   "list physical core affinity for each lcore");
+
 	rte_atomic32_set(&__rte_lcore_telemetry_enabled, true);
 }
 
-- 
2.31.1


^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v6 3/4] app/test: add unit tests for lcore poll busyness
  2022-09-13 13:19 ` [PATCH v6 0/4] Add lcore poll busyness telemetry Kevin Laatz
  2022-09-13 13:19   ` [PATCH v6 1/4] eal: add " Kevin Laatz
  2022-09-13 13:19   ` [PATCH v6 2/4] eal: add cpuset lcore telemetry entries Kevin Laatz
@ 2022-09-13 13:19   ` Kevin Laatz
  2022-09-13 13:19   ` [PATCH v6 4/4] doc: add howto guide " Kevin Laatz
  3 siblings, 0 replies; 87+ messages in thread
From: Kevin Laatz @ 2022-09-13 13:19 UTC (permalink / raw)
  To: dev; +Cc: anatoly.burakov, Kevin Laatz

Add API unit tests and perf unit tests for the newly added lcore poll
busyness feature.

Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
---
 app/test/meson.build                     |   4 +
 app/test/test_lcore_poll_busyness_api.c  | 134 +++++++++++++++++++++++
 app/test/test_lcore_poll_busyness_perf.c |  72 ++++++++++++
 3 files changed, 210 insertions(+)
 create mode 100644 app/test/test_lcore_poll_busyness_api.c
 create mode 100644 app/test/test_lcore_poll_busyness_perf.c

diff --git a/app/test/meson.build b/app/test/meson.build
index 431c5bd318..4a56ca7b5a 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -74,6 +74,8 @@ test_sources = files(
         'test_ipsec_perf.c',
         'test_kni.c',
         'test_kvargs.c',
+        'test_lcore_poll_busyness_api.c',
+        'test_lcore_poll_busyness_perf.c',
         'test_lcores.c',
         'test_logs.c',
         'test_lpm.c',
@@ -192,6 +194,7 @@ fast_tests = [
         ['interrupt_autotest', true, true],
         ['ipfrag_autotest', false, true],
         ['lcores_autotest', true, true],
+        ['lcore_poll_busyness_autotest', true, true],
         ['logs_autotest', true, true],
         ['lpm_autotest', true, true],
         ['lpm6_autotest', true, true],
@@ -292,6 +295,7 @@ perf_test_names = [
         'trace_perf_autotest',
         'ipsec_perf_autotest',
         'thash_perf_autotest',
+        'lcore_poll_busyness_perf_autotest'
 ]
 
 driver_test_names = [
diff --git a/app/test/test_lcore_poll_busyness_api.c b/app/test/test_lcore_poll_busyness_api.c
new file mode 100644
index 0000000000..db76322994
--- /dev/null
+++ b/app/test/test_lcore_poll_busyness_api.c
@@ -0,0 +1,134 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2022 Intel Corporation
+ */
+
+#include <rte_lcore.h>
+
+#include "test.h"
+
+/* Arbitrary amount of "work" to simulate busyness with */
+#define WORK		32
+#define TIMESTAMP_ITERS	1000000
+
+#define LCORE_POLL_BUSYNESS_NOT_SET	-1
+
+static int
+test_lcore_poll_busyness_enable_disable(void)
+{
+	int initial_state, curr_state;
+	bool req_state;
+
+	/* Get the initial state */
+	initial_state = rte_lcore_poll_busyness_enabled();
+	if (initial_state == -ENOTSUP)
+		return TEST_SKIPPED;
+
+	/* Set state to the inverse of the initial state and check for the change */
+	req_state = !initial_state;
+	rte_lcore_poll_busyness_enabled_set(req_state);
+	curr_state = rte_lcore_poll_busyness_enabled();
+	if (curr_state != req_state)
+		return TEST_FAILED;
+
+	/* Now change the state back to the original state. By changing it back, both
+	 * enable and disable will have been tested.
+	 */
+	req_state = !curr_state;
+	rte_lcore_poll_busyness_enabled_set(req_state);
+	curr_state = rte_lcore_poll_busyness_enabled();
+	if (curr_state != req_state)
+		return TEST_FAILED;
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_lcore_poll_busyness_invalid_lcore(void)
+{
+	int ret;
+
+	/* Check if lcore poll busyness is enabled */
+	if (rte_lcore_poll_busyness_enabled() == -ENOTSUP)
+		return TEST_SKIPPED;
+
+	/* Only lcore_id <= RTE_MAX_LCORE are valid */
+	ret = rte_lcore_poll_busyness(RTE_MAX_LCORE);
+	if (ret != -EINVAL)
+		return TEST_FAILED;
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_lcore_poll_busyness_inactive_lcore(void)
+{
+	int ret;
+
+	/* Check if lcore poll busyness is enabled */
+	if (rte_lcore_poll_busyness_enabled() == -ENOTSUP)
+		return TEST_SKIPPED;
+
+	/* Use the test thread lcore_id for this test. Since it is not a polling
+	 * application, the busyness is expected to return -1.
+	 *
+	 * Note: this will not work with affinitized cores
+	 */
+	ret = rte_lcore_poll_busyness(rte_lcore_id());
+	if (ret != LCORE_POLL_BUSYNESS_NOT_SET)
+		return TEST_FAILED;
+
+	return TEST_SUCCESS;
+}
+
+static void
+simulate_lcore_poll_busyness(int iters)
+{
+	int i;
+
+	for (i = 0; i < iters; i++)
+		RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(WORK);
+}
+
+/* The test cannot know of an application running to test for valid lcore poll
+ * busyness data. For this test, we simulate lcore poll busyness for the
+ * lcore_id of the test thread for testing purposes.
+ */
+static int
+test_lcore_poll_busyness_active_lcore(void)
+{
+	int ret;
+
+	/* Check if lcore poll busyness is enabled */
+	if (rte_lcore_poll_busyness_enabled() == -ENOTSUP)
+		return TEST_SKIPPED;
+
+	simulate_lcore_poll_busyness(TIMESTAMP_ITERS);
+
+	/* After timestamping with "work" many times, lcore poll busyness should be > 0 */
+	ret = rte_lcore_poll_busyness(rte_lcore_id());
+	if (ret <= 0)
+		return TEST_FAILED;
+
+	return TEST_SUCCESS;
+}
+
+static struct unit_test_suite lcore_poll_busyness_tests = {
+	.suite_name = "lcore poll busyness autotest",
+	.setup = NULL,
+	.teardown = NULL,
+	.unit_test_cases = {
+		TEST_CASE(test_lcore_poll_busyness_enable_disable),
+		TEST_CASE(test_lcore_poll_busyness_invalid_lcore),
+		TEST_CASE(test_lcore_poll_busyness_inactive_lcore),
+		TEST_CASE(test_lcore_poll_busyness_active_lcore),
+		TEST_CASES_END()
+	}
+};
+
+static int
+test_lcore_poll_busyness_api(void)
+{
+	return unit_test_suite_runner(&lcore_poll_busyness_tests);
+}
+
+REGISTER_TEST_COMMAND(lcore_poll_busyness_autotest, test_lcore_poll_busyness_api);
diff --git a/app/test/test_lcore_poll_busyness_perf.c b/app/test/test_lcore_poll_busyness_perf.c
new file mode 100644
index 0000000000..5c27d21b00
--- /dev/null
+++ b/app/test/test_lcore_poll_busyness_perf.c
@@ -0,0 +1,72 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2022 Intel Corporation
+ */
+
+#include <unistd.h>
+#include <inttypes.h>
+
+#include <rte_lcore.h>
+#include <rte_cycles.h>
+
+#include "test.h"
+
+/* Arbitrary amount of "work" to simulate busyness with */
+#define WORK		32
+#define TIMESTAMP_ITERS	1000000
+#define TEST_ITERS	10000
+
+static void
+simulate_lcore_poll_busyness(int iters)
+{
+	int i;
+
+	for (i = 0; i < iters; i++)
+		RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(WORK);
+}
+
+static void
+test_timestamp_perf(void)
+{
+	uint64_t start, end, diff;
+	uint64_t min = UINT64_MAX;
+	uint64_t max = 0;
+	uint64_t total = 0;
+	int i;
+
+	for (i = 0; i < TEST_ITERS; i++) {
+		start = rte_rdtsc();
+		RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(WORK);
+		end = rte_rdtsc();
+
+		diff = end - start;
+		min = RTE_MIN(diff, min);
+		max = RTE_MAX(diff, max);
+		total += diff;
+	}
+
+	printf("### Timestamp perf ###\n");
+	printf("Min cycles: %"PRIu64"\n", min);
+	printf("Avg cycles: %"PRIu64"\n", total / TEST_ITERS);
+	printf("Max cycles: %"PRIu64"\n", max);
+	printf("\n");
+}
+
+
+static int
+test_lcore_poll_busyness_perf(void)
+{
+	if (rte_lcore_poll_busyness_enabled()  == -ENOTSUP) {
+		printf("Lcore poll busyness may be disabled...\n");
+		return TEST_SKIPPED;
+	}
+
+	/* Initialize and prime the timestamp struct with simulated "work" for this lcore */
+	simulate_lcore_poll_busyness(10000);
+
+	/* Run perf tests */
+	test_timestamp_perf();
+
+	return TEST_SUCCESS;
+}
+
+REGISTER_TEST_COMMAND(lcore_poll_busyness_perf_autotest, test_lcore_poll_busyness_perf);
-- 
2.31.1


^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v6 4/4] doc: add howto guide for lcore poll busyness
  2022-09-13 13:19 ` [PATCH v6 0/4] Add lcore poll busyness telemetry Kevin Laatz
                     ` (2 preceding siblings ...)
  2022-09-13 13:19   ` [PATCH v6 3/4] app/test: add unit tests for lcore poll busyness Kevin Laatz
@ 2022-09-13 13:19   ` Kevin Laatz
  3 siblings, 0 replies; 87+ messages in thread
From: Kevin Laatz @ 2022-09-13 13:19 UTC (permalink / raw)
  To: dev; +Cc: anatoly.burakov, Kevin Laatz

Add a new section to the howto guides for using the new lcore poll
busyness telemetry endpoints and describe general usage.

Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>

---
v6:
  * Add mention of perf autotest in note mentioning perf impact.

v4:
  * Include note on perf impact when the feature is enabled
  * Add doc to toctree
  * Updates to incorporate changes made earlier in the patchset

v3:
  * Update naming to poll busyness
---
 doc/guides/howto/index.rst               |  1 +
 doc/guides/howto/lcore_poll_busyness.rst | 93 ++++++++++++++++++++++++
 2 files changed, 94 insertions(+)
 create mode 100644 doc/guides/howto/lcore_poll_busyness.rst

diff --git a/doc/guides/howto/index.rst b/doc/guides/howto/index.rst
index bf6337d021..0a9060c1d3 100644
--- a/doc/guides/howto/index.rst
+++ b/doc/guides/howto/index.rst
@@ -21,3 +21,4 @@ HowTo Guides
     debug_troubleshoot
     openwrt
     avx512
+    lcore_poll_busyness
diff --git a/doc/guides/howto/lcore_poll_busyness.rst b/doc/guides/howto/lcore_poll_busyness.rst
new file mode 100644
index 0000000000..be5ea2a85d
--- /dev/null
+++ b/doc/guides/howto/lcore_poll_busyness.rst
@@ -0,0 +1,93 @@
+..  SPDX-License-Identifier: BSD-3-Clause
+    Copyright(c) 2022 Intel Corporation.
+
+Lcore Poll Busyness Telemetry
+=============================
+
+The lcore poll busyness telemetry provides a built-in, generic method of gathering
+lcore utilization metrics for running applications. These metrics are exposed
+via a new telemetry endpoint.
+
+Since most DPDK APIs polling based, the poll busyness is calculated based on
+APIs receiving 'work' (packets, completions, events, etc). Empty polls are
+considered as idle, while non-empty polls are considered busy. Using the amount
+of cycles spent processing empty polls, the busyness can be calculated and recorded.
+
+Application Specified Busyness
+------------------------------
+
+Improved accuracy of the reported busyness may need more contextual awareness
+from the application. For example, an application may make a number of calls to
+rx_burst before processing packets. If the last burst was an "empty poll", then
+the processing time of the packets would be falsely considered as "idle", since
+the last burst was empty. The application should track if any of the polls
+contained "work" to do and should mark the 'bulk' as "busy" cycles before
+proceeding to the processesing. This type of awareness is only available within
+the application.
+
+Applications can be modified to incorporate the extra contextual awareness in
+order to improve the reported busyness by marking areas of code as "busy" or
+"idle" appropriately. This can be done by inserting the timestamping macro::
+
+    RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(0)    /* to mark section as idle */
+    RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(32)   /* where 32 is nb_pkts to mark section as busy (non-zero is busy) */
+
+All cycles since the last state change (idle to busy, or vice versa) will be
+counted towards the current state's counter.
+
+Consuming the Telemetry
+-----------------------
+
+The telemetry gathered for lcore poll busyness can be read from the `telemetry.py`
+script via the new `/eal/lcore/poll_busyness` endpoint::
+
+    $ ./usertools/dpdk-telemetry.py
+    --> /eal/lcore/poll_busyness
+    {"/eal/lcore/poll_busyness": {"12": -1, "13": 85, "14": 84}}
+
+* Cores not collecting poll busyness will report "-1". E.g. control cores or inactive cores.
+* All enabled cores will report their poll busyness in the range 0-100.
+
+Enabling and Disabling Lcore Poll Busyness Telemetry
+----------------------------------------------------
+
+By default, the lcore poll busyness telemetry is disabled at compile time. In
+order to allow DPDK to gather this metric, the ``enable_lcore_poll_busyness``
+meson option must be set to ``true``.
+
+.. note::
+    Enabling lcore poll busyness telemetry may impact performance due to the
+    additional timestamping, potentially per poll depending on the application.
+    This can be measured with the `lcore_poll_busyness_perf_autotest`.
+
+At compile time
+^^^^^^^^^^^^^^^
+
+Support can be enabled/disabled at compile time via the meson option.
+It is disabled by default.::
+
+    $ meson configure -Denable_lcore_poll_busyness=true     #enable
+
+    $ meson configure -Denable_lcore_poll_busyness=false    #disable
+
+At run time
+^^^^^^^^^^^
+
+Support can also be enabled/disabled during runtime (if the meson option is
+enabled at compile time). Disabling at runtime comes at the cost of an additional
+branch, however no additional function calls are performed.
+
+To enable/disable support at runtime, a call can be made to the appropriately
+telemetry endpoint.
+
+Disable::
+
+    $ ./usertools/dpdk-telemetry.py
+    --> /eal/lcore/poll_busyness_disable
+    {"/eal/lcore/poll_busyness_disable": {"poll_busyness_enabled": 0}}
+
+Enable::
+
+    $ ./usertools/dpdk-telemetry.py
+    --> /eal/lcore/poll_busyness_enable
+    {"/eal/lcore/poll_busyness_enable": {"poll_busyness_enabled": 1}}
-- 
2.31.1


^ permalink raw reply	[flat|nested] 87+ messages in thread

* RE: [PATCH v6 1/4] eal: add lcore poll busyness telemetry
  2022-09-13 13:19   ` [PATCH v6 1/4] eal: add " Kevin Laatz
@ 2022-09-13 13:48     ` Morten Brørup
  0 siblings, 0 replies; 87+ messages in thread
From: Morten Brørup @ 2022-09-13 13:48 UTC (permalink / raw)
  To: Kevin Laatz, dev
  Cc: anatoly.burakov, Conor Walsh, David Hunt, Bruce Richardson,
	Nicolas Chautru, Fan Zhang, Ashish Gupta, Akhil Goyal,
	Chengwen Feng, Ray Kinsella, Thomas Monjalon, Ferruh Yigit,
	Andrew Rybchenko, Jerin Jacob, Sachin Saxena, Hemant Agrawal,
	Ori Kam, Honnappa Nagarahalli, Konstantin Ananyev

> From: Kevin Laatz [mailto:kevin.laatz@intel.com]
> Sent: Tuesday, 13 September 2022 15.20
> 
> From: Anatoly Burakov <anatoly.burakov@intel.com>
> 

[...]

Still a few missing renames...


> diff --git a/lib/eal/common/eal_common_lcore_telemetry.c
> b/lib/eal/common/eal_common_lcore_telemetry.c
> new file mode 100644
> index 0000000000..abef1ff86d
> --- /dev/null
> +++ b/lib/eal/common/eal_common_lcore_telemetry.c

eal_common_lcore_poll_telemetry.c

> @@ -0,0 +1,303 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2022 Intel Corporation
> + */
> +
> +#include <unistd.h>
> +#include <limits.h>
> +#include <string.h>
> +
> +#include <rte_common.h>
> +#include <rte_cycles.h>
> +#include <rte_errno.h>
> +#include <rte_lcore.h>
> +
> +#ifdef RTE_LCORE_POLL_BUSYNESS
> +#include <rte_telemetry.h>
> +#endif
> +
> +rte_atomic32_t __rte_lcore_telemetry_enabled;

__rte_lcore_poll_telemetry_enabled

> +
> +#ifdef RTE_LCORE_POLL_BUSYNESS
> +
> +struct lcore_telemetry {

This one is private, so suggestion only:

struct lcore_poll_telemetry {

> +	int poll_busyness;
> +	/**< Calculated poll busyness (gets set/returned by the API) */
> +	int raw_poll_busyness;
> +	/**< Calculated poll busyness times 100. */
> +	uint64_t interval_ts;
> +	/**< when previous telemetry interval started */
> +	uint64_t empty_cycles;
> +	/**< empty cycle count since last interval */
> +	uint64_t last_poll_ts;
> +	/**< last poll timestamp */
> +	bool last_empty;
> +	/**< if last poll was empty */
> +	unsigned int contig_poll_cnt;
> +	/**< contiguous (always empty/non empty) poll counter */
> +} __rte_cache_aligned;
> +
> +static struct lcore_telemetry *telemetry_data;
> +
> +#define LCORE_POLL_BUSYNESS_MAX 100
> +#define LCORE_POLL_BUSYNESS_NOT_SET -1
> +#define LCORE_POLL_BUSYNESS_MIN 0
> +
> +#define SMOOTH_COEFF 5
> +#define STATE_CHANGE_OPT 32
> +
> +static void lcore_config_init(void)
> +{
> +	int lcore_id;
> +
> +	RTE_LCORE_FOREACH(lcore_id) {
> +		struct lcore_telemetry *td = &telemetry_data[lcore_id];
> +
> +		td->interval_ts = 0;
> +		td->last_poll_ts = 0;
> +		td->empty_cycles = 0;
> +		td->last_empty = true;
> +		td->contig_poll_cnt = 0;
> +		td->poll_busyness = LCORE_POLL_BUSYNESS_NOT_SET;
> +		td->raw_poll_busyness = 0;
> +	}
> +}
> +
> +int rte_lcore_poll_busyness(unsigned int lcore_id)
> +{
> +	const uint64_t tsc_ms = rte_get_timer_hz() / MS_PER_S;
> +	/* if more than 1000 busyness periods have passed, this core is
> considered inactive */
> +	const uint64_t active_thresh = RTE_LCORE_POLL_BUSYNESS_PERIOD_MS
> * tsc_ms * 1000;
> +	struct lcore_telemetry *tdata;
> +
> +	if (lcore_id >= RTE_MAX_LCORE)
> +		return -EINVAL;
> +	tdata = &telemetry_data[lcore_id];
> +
> +	/* if the lcore is not active */
> +	if (tdata->interval_ts == 0)
> +		return LCORE_POLL_BUSYNESS_NOT_SET;
> +	/* if the core hasn't been active in a while */
> +	else if ((rte_rdtsc() - tdata->interval_ts) > active_thresh)
> +		return LCORE_POLL_BUSYNESS_NOT_SET;
> +
> +	/* this core is active, report its poll busyness */
> +	return telemetry_data[lcore_id].poll_busyness;
> +}
> +
> +int rte_lcore_poll_busyness_enabled(void)
> +{
> +	return rte_atomic32_read(&__rte_lcore_telemetry_enabled);
> +}
> +
> +void rte_lcore_poll_busyness_enabled_set(bool enable)
> +{
> +	int set = rte_atomic32_cmpset((volatile uint32_t
> *)&__rte_lcore_telemetry_enabled,
> +			(int)!enable, (int)enable);
> +
> +	/* Reset counters on successful disable */
> +	if (set && !enable)
> +		lcore_config_init();
> +}
> +
> +static inline int calc_raw_poll_busyness(const struct lcore_telemetry
> *tdata,
> +				    const uint64_t empty, const uint64_t total)
> +{
> +	/*
> +	 * We don't want to use floating point math here, but we want for
> our poll
> +	 * busyness to react smoothly to sudden changes, while still
> keeping the
> +	 * accuracy and making sure that over time the average follows
> poll busyness
> +	 * as measured just-in-time. Therefore, we will calculate the
> average poll
> +	 * busyness using integer math, but shift the decimal point two
> places
> +	 * to the right, so that 100.0 becomes 10000. This allows us to
> report
> +	 * integer values (0..100) while still allowing ourselves to
> follow the
> +	 * just-in-time measurements when we calculate our averages.
> +	 */
> +	const int max_raw_idle = LCORE_POLL_BUSYNESS_MAX * 100;
> +
> +	const int prev_raw_idle = max_raw_idle - tdata-
> >raw_poll_busyness;
> +
> +	/* calculate rate of idle cycles, times 100 */
> +	const int cur_raw_idle = (int)((empty * max_raw_idle) / total);
> +
> +	/* smoothen the idleness */
> +	const int smoothened_idle =
> +			(cur_raw_idle + prev_raw_idle * (SMOOTH_COEFF - 1)) /
> SMOOTH_COEFF;
> +
> +	/* convert idleness to poll busyness */
> +	return max_raw_idle - smoothened_idle;
> +}
> +
> +void __rte_lcore_poll_busyness_timestamp(uint16_t nb_rx)
> +{
> +	const unsigned int lcore_id = rte_lcore_id();
> +	uint64_t interval_ts, empty_cycles, cur_tsc, last_poll_ts;
> +	struct lcore_telemetry *tdata;
> +	const bool empty = nb_rx == 0;
> +	uint64_t diff_int, diff_last;
> +	bool last_empty;
> +
> +	/* This telemetry is not supported for unregistered non-EAL
> threads */
> +	if (lcore_id >= RTE_MAX_LCORE) {
> +		RTE_LOG(DEBUG, EAL,
> +				"Lcore telemetry not supported on unregistered
> non-EAL thread %d",
> +				lcore_id);
> +		return;
> +	}
> +
> +	tdata = &telemetry_data[lcore_id];
> +	last_empty = tdata->last_empty;
> +
> +	/* optimization: don't do anything if status hasn't changed */
> +	if (last_empty == empty && tdata->contig_poll_cnt++ <
> STATE_CHANGE_OPT)
> +		return;
> +	/* status changed or we're waiting for too long, reset counter */
> +	tdata->contig_poll_cnt = 0;
> +
> +	cur_tsc = rte_rdtsc();
> +
> +	interval_ts = tdata->interval_ts;
> +	empty_cycles = tdata->empty_cycles;
> +	last_poll_ts = tdata->last_poll_ts;
> +
> +	diff_int = cur_tsc - interval_ts;
> +	diff_last = cur_tsc - last_poll_ts;
> +
> +	/* is this the first time we're here? */
> +	if (interval_ts == 0) {
> +		tdata->poll_busyness = LCORE_POLL_BUSYNESS_MIN;
> +		tdata->raw_poll_busyness = 0;
> +		tdata->interval_ts = cur_tsc;
> +		tdata->empty_cycles = 0;
> +		tdata->contig_poll_cnt = 0;
> +		goto end;
> +	}
> +
> +	/* update the empty counter if we got an empty poll earlier */
> +	if (last_empty)
> +		empty_cycles += diff_last;
> +
> +	/* have we passed the interval? */
> +	uint64_t interval = ((rte_get_tsc_hz() / MS_PER_S) *
> RTE_LCORE_POLL_BUSYNESS_PERIOD_MS);
> +	if (diff_int > interval) {
> +		int raw_poll_busyness;
> +
> +		/* get updated poll_busyness value */
> +		raw_poll_busyness = calc_raw_poll_busyness(tdata,
> empty_cycles, diff_int);
> +
> +		/* set a new interval, reset empty counter */
> +		tdata->interval_ts = cur_tsc;
> +		tdata->empty_cycles = 0;
> +		tdata->raw_poll_busyness = raw_poll_busyness;
> +		/* bring poll busyness back to 0..100 range, biased to
> round up */
> +		tdata->poll_busyness = (raw_poll_busyness + 50) / 100;
> +	} else
> +		/* we may have updated empty counter */
> +		tdata->empty_cycles = empty_cycles;
> +
> +end:
> +	/* update status for next poll */
> +	tdata->last_poll_ts = cur_tsc;
> +	tdata->last_empty = empty;
> +}
> +
> +static int
> +lcore_poll_busyness_enable(const char *cmd __rte_unused,
> +		      const char *params __rte_unused,
> +		      struct rte_tel_data *d)
> +{
> +	rte_lcore_poll_busyness_enabled_set(true);
> +
> +	rte_tel_data_start_dict(d);
> +
> +	rte_tel_data_add_dict_int(d, "poll_busyness_enabled", 1);
> +
> +	return 0;
> +}
> +
> +static int
> +lcore_poll_busyness_disable(const char *cmd __rte_unused,
> +		       const char *params __rte_unused,
> +		       struct rte_tel_data *d)
> +{
> +	rte_lcore_poll_busyness_enabled_set(false);
> +
> +	rte_tel_data_start_dict(d);
> +
> +	rte_tel_data_add_dict_int(d, "poll_busyness_enabled", 0);
> +
> +	return 0;
> +}
> +
> +static int
> +lcore_handle_poll_busyness(const char *cmd __rte_unused,
> +		      const char *params __rte_unused, struct rte_tel_data
> *d)
> +{
> +	char corenum[64];
> +	int i;
> +
> +	rte_tel_data_start_dict(d);
> +
> +	RTE_LCORE_FOREACH(i) {
> +		if (!rte_lcore_is_enabled(i))
> +			continue;
> +		snprintf(corenum, sizeof(corenum), "%d", i);
> +		rte_tel_data_add_dict_int(d, corenum,
> rte_lcore_poll_busyness(i));
> +	}
> +
> +	return 0;
> +}
> +
> +void
> +lcore_telemetry_free(void)

Not sure, but either:
lcore_poll_telemetry_free or
rte_lcore_poll_telemetry_free

> +{
> +	if (telemetry_data != NULL) {
> +		free(telemetry_data);
> +		telemetry_data = NULL;
> +	}
> +}
> +
> +RTE_INIT(lcore_init_telemetry)

Not sure, but either:
RTE_INIT(lcore_poll_init_telemetry) or
RTE_INIT(rte_lcore_poll_init_telemetry)

> +{
> +	telemetry_data = calloc(RTE_MAX_LCORE,
> sizeof(telemetry_data[0]));
> +	if (telemetry_data == NULL)
> +		rte_panic("Could not init lcore telemetry data: Out of
> memory\n");
> +
> +	lcore_config_init();
> +
> +	rte_telemetry_register_cmd("/eal/lcore/poll_busyness",
> lcore_handle_poll_busyness,
> +				   "return percentage poll busyness of cores");
> +
> +	rte_telemetry_register_cmd("/eal/lcore/poll_busyness_enable",
> lcore_poll_busyness_enable,
> +				   "enable lcore poll busyness measurement");
> +
> +	rte_telemetry_register_cmd("/eal/lcore/poll_busyness_disable",
> lcore_poll_busyness_disable,
> +				   "disable lcore poll busyness measurement");
> +
> +	rte_atomic32_set(&__rte_lcore_telemetry_enabled, true);
> +}
> +
> +#else
> +
> +int rte_lcore_poll_busyness(unsigned int lcore_id __rte_unused)
> +{
> +	return -ENOTSUP;
> +}
> +
> +int rte_lcore_poll_busyness_enabled(void)
> +{
> +	return -ENOTSUP;
> +}
> +
> +void rte_lcore_poll_busyness_enabled_set(bool enable __rte_unused)
> +{
> +}
> +
> +void __rte_lcore_poll_busyness_timestamp(uint16_t nb_rx __rte_unused)
> +{
> +}
> +
> +void lcore_telemetry_free(void)
> +{
> +}
> +
> +#endif


^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v7 0/4] Add lcore poll busyness telemetry
  2022-07-15 13:12 [PATCH v1 1/2] eal: add lcore busyness telemetry Anatoly Burakov
                   ` (8 preceding siblings ...)
  2022-09-13 13:19 ` [PATCH v6 0/4] Add lcore poll busyness telemetry Kevin Laatz
@ 2022-09-14  9:29 ` Kevin Laatz
  2022-09-14  9:29   ` [PATCH v7 1/4] eal: add " Kevin Laatz
                     ` (5 more replies)
  9 siblings, 6 replies; 87+ messages in thread
From: Kevin Laatz @ 2022-09-14  9:29 UTC (permalink / raw)
  To: dev; +Cc: anatoly.burakov, Kevin Laatz

Currently, there is no way to measure lcore polling busyness in a passive
way, without any modifications to the application. This patchset adds a new
EAL API that will be able to passively track core polling busyness. As part
of the set, new telemetry endpoints are added to read the generate metrics.

---
v7:
  * Rename funcs, vars, files to include "poll" where missing.

v6:
  * Add API and perf unit tests

v5:
  * Fix Windows build
  * Make lcore_telemetry_free() an internal interface
  * Minor cleanup

v4:
  * Fix doc build
  * Rename timestamp macro to RTE_LCORE_POLL_BUSYNESS_TIMESTAMP
  * Make enable/disable read and write atomic
  * Change rte_lcore_poll_busyness_enabled_set() param to bool
  * Move mem alloc from enable/disable to init/cleanup
  * Other minor fixes

v3:
  * Fix missing renaming to poll busyness
  * Fix clang compilation
  * Fix arm compilation

v2:
  * Use rte_get_tsc_hz() to adjust the telemetry period
  * Rename to reflect polling busyness vs general busyness
  * Fix segfault when calling telemetry timestamp from an unregistered
    non-EAL thread.
  * Minor cleanup

Anatoly Burakov (2):
  eal: add lcore poll busyness telemetry
  eal: add cpuset lcore telemetry entries

Kevin Laatz (2):
  app/test: add unit tests for lcore poll busyness
  doc: add howto guide for lcore poll busyness

 app/test/meson.build                          |   4 +
 app/test/test_lcore_poll_busyness_api.c       | 134 +++++++
 app/test/test_lcore_poll_busyness_perf.c      |  72 ++++
 config/meson.build                            |   1 +
 config/rte_config.h                           |   1 +
 doc/guides/howto/index.rst                    |   1 +
 doc/guides/howto/lcore_poll_busyness.rst      |  93 +++++
 lib/bbdev/rte_bbdev.h                         |  17 +-
 lib/compressdev/rte_compressdev.c             |   2 +
 lib/cryptodev/rte_cryptodev.h                 |   2 +
 lib/distributor/rte_distributor.c             |  21 +-
 lib/distributor/rte_distributor_single.c      |  14 +-
 lib/dmadev/rte_dmadev.h                       |  15 +-
 .../common/eal_common_lcore_poll_telemetry.c  | 350 ++++++++++++++++++
 lib/eal/common/meson.build                    |   1 +
 lib/eal/freebsd/eal.c                         |   1 +
 lib/eal/include/rte_lcore.h                   |  85 ++++-
 lib/eal/linux/eal.c                           |   1 +
 lib/eal/meson.build                           |   3 +
 lib/eal/version.map                           |   7 +
 lib/ethdev/rte_ethdev.h                       |   2 +
 lib/eventdev/rte_eventdev.h                   |  10 +-
 lib/rawdev/rte_rawdev.c                       |   6 +-
 lib/regexdev/rte_regexdev.h                   |   5 +-
 lib/ring/rte_ring_elem_pvt.h                  |   1 +
 meson_options.txt                             |   2 +
 26 files changed, 826 insertions(+), 25 deletions(-)
 create mode 100644 app/test/test_lcore_poll_busyness_api.c
 create mode 100644 app/test/test_lcore_poll_busyness_perf.c
 create mode 100644 doc/guides/howto/lcore_poll_busyness.rst
 create mode 100644 lib/eal/common/eal_common_lcore_poll_telemetry.c

-- 
2.31.1


^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v7 1/4] eal: add lcore poll busyness telemetry
  2022-09-14  9:29 ` [PATCH v7 0/4] Add lcore poll busyness telemetry Kevin Laatz
@ 2022-09-14  9:29   ` Kevin Laatz
  2022-09-14 14:30     ` Stephen Hemminger
                       ` (2 more replies)
  2022-09-14  9:29   ` [PATCH v7 2/4] eal: add cpuset lcore telemetry entries Kevin Laatz
                     ` (4 subsequent siblings)
  5 siblings, 3 replies; 87+ messages in thread
From: Kevin Laatz @ 2022-09-14  9:29 UTC (permalink / raw)
  To: dev
  Cc: anatoly.burakov, Kevin Laatz, Conor Walsh, David Hunt,
	Bruce Richardson, Nicolas Chautru, Fan Zhang, Ashish Gupta,
	Akhil Goyal, Chengwen Feng, Ray Kinsella, Thomas Monjalon,
	Ferruh Yigit, Andrew Rybchenko, Jerin Jacob, Sachin Saxena,
	Hemant Agrawal, Ori Kam, Honnappa Nagarahalli,
	Konstantin Ananyev

From: Anatoly Burakov <anatoly.burakov@intel.com>

Currently, there is no way to measure lcore poll busyness in a passive way,
without any modifications to the application. This patch adds a new EAL API
that will be able to passively track core polling busyness.

The poll busyness is calculated by relying on the fact that most DPDK API's
will poll for work (packets, completions, eventdev events, etc). Empty
polls can be counted as "idle", while non-empty polls can be counted as
busy. To measure lcore poll busyness, we simply call the telemetry
timestamping function with the number of polls a particular code section
has processed, and count the number of cycles we've spent processing empty
bursts. The more empty bursts we encounter, the less cycles we spend in
"busy" state, and the less core poll busyness will be reported.

In order for all of the above to work without modifications to the
application, the library code needs to be instrumented with calls to the
lcore telemetry busyness timestamping function. The following parts of DPDK
are instrumented with lcore poll busyness timestamping calls:

- All major driver API's:
  - ethdev
  - cryptodev
  - compressdev
  - regexdev
  - bbdev
  - rawdev
  - eventdev
  - dmadev
- Some additional libraries:
  - ring
  - distributor

To avoid performance impact from having lcore telemetry support, a global
variable is exported by EAL, and a call to timestamping function is wrapped
into a macro, so that whenever telemetry is disabled, it only takes one
additional branch and no function calls are performed. It is disabled at
compile time by default.

This patch also adds a telemetry endpoint to report lcore poll busyness, as
well as telemetry endpoints to enable/disable lcore telemetry. A
documentation entry has been added to the howto guides to explain the usage
of the new telemetry endpoints and API.

Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
Signed-off-by: Conor Walsh <conor.walsh@intel.com>
Signed-off-by: David Hunt <david.hunt@intel.com>
Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>

---
v7:
  * Rename funcs, vars, files to include "poll" where missing.

v5:
  * Fix Windows build
  * Make lcore_telemetry_free() an internal interface
  * Minor cleanup

v4:
  * Fix doc build
  * Rename timestamp macro to RTE_LCORE_POLL_BUSYNESS_TIMESTAMP
  * Make enable/disable read and write atomic
  * Change rte_lcore_poll_busyness_enabled_set() param to bool
  * Move mem alloc from enable/disable to init/cleanup
  * Other minor fixes

v3:
  * Fix missed renaming to poll busyness
  * Fix clang compilation
  * Fix arm compilation

v2:
  * Use rte_get_tsc_hz() to adjust the telemetry period
  * Rename to reflect polling busyness vs general busyness
  * Fix segfault when calling telemetry timestamp from an unregistered
    non-EAL thread.
  * Minor cleanup
---
 config/meson.build                            |   1 +
 config/rte_config.h                           |   1 +
 lib/bbdev/rte_bbdev.h                         |  17 +-
 lib/compressdev/rte_compressdev.c             |   2 +
 lib/cryptodev/rte_cryptodev.h                 |   2 +
 lib/distributor/rte_distributor.c             |  21 +-
 lib/distributor/rte_distributor_single.c      |  14 +-
 lib/dmadev/rte_dmadev.h                       |  15 +-
 .../common/eal_common_lcore_poll_telemetry.c  | 303 ++++++++++++++++++
 lib/eal/common/meson.build                    |   1 +
 lib/eal/freebsd/eal.c                         |   1 +
 lib/eal/include/rte_lcore.h                   |  85 ++++-
 lib/eal/linux/eal.c                           |   1 +
 lib/eal/meson.build                           |   3 +
 lib/eal/version.map                           |   7 +
 lib/ethdev/rte_ethdev.h                       |   2 +
 lib/eventdev/rte_eventdev.h                   |  10 +-
 lib/rawdev/rte_rawdev.c                       |   6 +-
 lib/regexdev/rte_regexdev.h                   |   5 +-
 lib/ring/rte_ring_elem_pvt.h                  |   1 +
 meson_options.txt                             |   2 +
 21 files changed, 475 insertions(+), 25 deletions(-)
 create mode 100644 lib/eal/common/eal_common_lcore_poll_telemetry.c

diff --git a/config/meson.build b/config/meson.build
index 7f7b6c92fd..d5954a059c 100644
--- a/config/meson.build
+++ b/config/meson.build
@@ -297,6 +297,7 @@ endforeach
 dpdk_conf.set('RTE_MAX_ETHPORTS', get_option('max_ethports'))
 dpdk_conf.set('RTE_LIBEAL_USE_HPET', get_option('use_hpet'))
 dpdk_conf.set('RTE_ENABLE_TRACE_FP', get_option('enable_trace_fp'))
+dpdk_conf.set('RTE_LCORE_POLL_BUSYNESS', get_option('enable_lcore_poll_busyness'))
 # values which have defaults which may be overridden
 dpdk_conf.set('RTE_MAX_VFIO_GROUPS', 64)
 dpdk_conf.set('RTE_DRIVER_MEMPOOL_BUCKET_SIZE_KB', 64)
diff --git a/config/rte_config.h b/config/rte_config.h
index ae56a86394..86ac3b8a6e 100644
--- a/config/rte_config.h
+++ b/config/rte_config.h
@@ -39,6 +39,7 @@
 #define RTE_LOG_DP_LEVEL RTE_LOG_INFO
 #define RTE_BACKTRACE 1
 #define RTE_MAX_VFIO_CONTAINERS 64
+#define RTE_LCORE_POLL_BUSYNESS_PERIOD_MS 2
 
 /* bsd module defines */
 #define RTE_CONTIGMEM_MAX_NUM_BUFS 64
diff --git a/lib/bbdev/rte_bbdev.h b/lib/bbdev/rte_bbdev.h
index b88c88167e..d6a98d3f11 100644
--- a/lib/bbdev/rte_bbdev.h
+++ b/lib/bbdev/rte_bbdev.h
@@ -28,6 +28,7 @@ extern "C" {
 #include <stdbool.h>
 
 #include <rte_cpuflags.h>
+#include <rte_lcore.h>
 
 #include "rte_bbdev_op.h"
 
@@ -599,7 +600,9 @@ rte_bbdev_dequeue_enc_ops(uint16_t dev_id, uint16_t queue_id,
 {
 	struct rte_bbdev *dev = &rte_bbdev_devices[dev_id];
 	struct rte_bbdev_queue_data *q_data = &dev->data->queues[queue_id];
-	return dev->dequeue_enc_ops(q_data, ops, num_ops);
+	const uint16_t nb_ops = dev->dequeue_enc_ops(q_data, ops, num_ops);
+	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
+	return nb_ops;
 }
 
 /**
@@ -631,7 +634,9 @@ rte_bbdev_dequeue_dec_ops(uint16_t dev_id, uint16_t queue_id,
 {
 	struct rte_bbdev *dev = &rte_bbdev_devices[dev_id];
 	struct rte_bbdev_queue_data *q_data = &dev->data->queues[queue_id];
-	return dev->dequeue_dec_ops(q_data, ops, num_ops);
+	const uint16_t nb_ops = dev->dequeue_dec_ops(q_data, ops, num_ops);
+	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
+	return nb_ops;
 }
 
 
@@ -662,7 +667,9 @@ rte_bbdev_dequeue_ldpc_enc_ops(uint16_t dev_id, uint16_t queue_id,
 {
 	struct rte_bbdev *dev = &rte_bbdev_devices[dev_id];
 	struct rte_bbdev_queue_data *q_data = &dev->data->queues[queue_id];
-	return dev->dequeue_ldpc_enc_ops(q_data, ops, num_ops);
+	const uint16_t nb_ops = dev->dequeue_ldpc_enc_ops(q_data, ops, num_ops);
+	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
+	return nb_ops;
 }
 
 /**
@@ -692,7 +699,9 @@ rte_bbdev_dequeue_ldpc_dec_ops(uint16_t dev_id, uint16_t queue_id,
 {
 	struct rte_bbdev *dev = &rte_bbdev_devices[dev_id];
 	struct rte_bbdev_queue_data *q_data = &dev->data->queues[queue_id];
-	return dev->dequeue_ldpc_dec_ops(q_data, ops, num_ops);
+	const uint16_t nb_ops = dev->dequeue_ldpc_dec_ops(q_data, ops, num_ops);
+	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
+	return nb_ops;
 }
 
 /** Definitions of device event types */
diff --git a/lib/compressdev/rte_compressdev.c b/lib/compressdev/rte_compressdev.c
index 22c438f2dd..fabc495a8e 100644
--- a/lib/compressdev/rte_compressdev.c
+++ b/lib/compressdev/rte_compressdev.c
@@ -580,6 +580,8 @@ rte_compressdev_dequeue_burst(uint8_t dev_id, uint16_t qp_id,
 	nb_ops = (*dev->dequeue_burst)
 			(dev->data->queue_pairs[qp_id], ops, nb_ops);
 
+	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
+
 	return nb_ops;
 }
 
diff --git a/lib/cryptodev/rte_cryptodev.h b/lib/cryptodev/rte_cryptodev.h
index 56f459c6a0..a5b1d7c594 100644
--- a/lib/cryptodev/rte_cryptodev.h
+++ b/lib/cryptodev/rte_cryptodev.h
@@ -1915,6 +1915,8 @@ rte_cryptodev_dequeue_burst(uint8_t dev_id, uint16_t qp_id,
 		rte_rcu_qsbr_thread_offline(list->qsbr, 0);
 	}
 #endif
+
+	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
 	return nb_ops;
 }
 
diff --git a/lib/distributor/rte_distributor.c b/lib/distributor/rte_distributor.c
index 3035b7a999..428157ec64 100644
--- a/lib/distributor/rte_distributor.c
+++ b/lib/distributor/rte_distributor.c
@@ -56,6 +56,8 @@ rte_distributor_request_pkt(struct rte_distributor *d,
 
 		while (rte_rdtsc() < t)
 			rte_pause();
+		/* this was an empty poll */
+		RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(0);
 	}
 
 	/*
@@ -134,24 +136,29 @@ rte_distributor_get_pkt(struct rte_distributor *d,
 
 	if (unlikely(d->alg_type == RTE_DIST_ALG_SINGLE)) {
 		if (return_count <= 1) {
+			uint16_t cnt;
 			pkts[0] = rte_distributor_get_pkt_single(d->d_single,
-				worker_id, return_count ? oldpkt[0] : NULL);
-			return (pkts[0]) ? 1 : 0;
-		} else
-			return -EINVAL;
+								 worker_id,
+								 return_count ? oldpkt[0] : NULL);
+			cnt = (pkts[0] != NULL) ? 1 : 0;
+			RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(cnt);
+			return cnt;
+		}
+		return -EINVAL;
 	}
 
 	rte_distributor_request_pkt(d, worker_id, oldpkt, return_count);
 
-	count = rte_distributor_poll_pkt(d, worker_id, pkts);
-	while (count == -1) {
+	while ((count = rte_distributor_poll_pkt(d, worker_id, pkts)) == -1) {
 		uint64_t t = rte_rdtsc() + 100;
 
 		while (rte_rdtsc() < t)
 			rte_pause();
 
-		count = rte_distributor_poll_pkt(d, worker_id, pkts);
+		/* this was an empty poll */
+		RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(0);
 	}
+	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(count);
 	return count;
 }
 
diff --git a/lib/distributor/rte_distributor_single.c b/lib/distributor/rte_distributor_single.c
index 2c77ac454a..4c916c0fd2 100644
--- a/lib/distributor/rte_distributor_single.c
+++ b/lib/distributor/rte_distributor_single.c
@@ -31,8 +31,13 @@ rte_distributor_request_pkt_single(struct rte_distributor_single *d,
 	union rte_distributor_buffer_single *buf = &d->bufs[worker_id];
 	int64_t req = (((int64_t)(uintptr_t)oldpkt) << RTE_DISTRIB_FLAG_BITS)
 			| RTE_DISTRIB_GET_BUF;
-	RTE_WAIT_UNTIL_MASKED(&buf->bufptr64, RTE_DISTRIB_FLAGS_MASK,
-		==, 0, __ATOMIC_RELAXED);
+
+	while ((__atomic_load_n(&buf->bufptr64, __ATOMIC_RELAXED)
+			& RTE_DISTRIB_FLAGS_MASK) != 0) {
+		rte_pause();
+		/* this was an empty poll */
+		RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(0);
+	}
 
 	/* Sync with distributor on GET_BUF flag. */
 	__atomic_store_n(&(buf->bufptr64), req, __ATOMIC_RELEASE);
@@ -59,8 +64,11 @@ rte_distributor_get_pkt_single(struct rte_distributor_single *d,
 {
 	struct rte_mbuf *ret;
 	rte_distributor_request_pkt_single(d, worker_id, oldpkt);
-	while ((ret = rte_distributor_poll_pkt_single(d, worker_id)) == NULL)
+	while ((ret = rte_distributor_poll_pkt_single(d, worker_id)) == NULL) {
 		rte_pause();
+		/* this was an empty poll */
+		RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(0);
+	}
 	return ret;
 }
 
diff --git a/lib/dmadev/rte_dmadev.h b/lib/dmadev/rte_dmadev.h
index e7f992b734..3e27e0fd2b 100644
--- a/lib/dmadev/rte_dmadev.h
+++ b/lib/dmadev/rte_dmadev.h
@@ -149,6 +149,7 @@
 #include <rte_bitops.h>
 #include <rte_common.h>
 #include <rte_compat.h>
+#include <rte_lcore.h>
 
 #ifdef __cplusplus
 extern "C" {
@@ -1027,7 +1028,7 @@ rte_dma_completed(int16_t dev_id, uint16_t vchan, const uint16_t nb_cpls,
 		  uint16_t *last_idx, bool *has_error)
 {
 	struct rte_dma_fp_object *obj = &rte_dma_fp_objs[dev_id];
-	uint16_t idx;
+	uint16_t idx, nb_ops;
 	bool err;
 
 #ifdef RTE_DMADEV_DEBUG
@@ -1050,8 +1051,10 @@ rte_dma_completed(int16_t dev_id, uint16_t vchan, const uint16_t nb_cpls,
 		has_error = &err;
 
 	*has_error = false;
-	return (*obj->completed)(obj->dev_private, vchan, nb_cpls, last_idx,
-				 has_error);
+	nb_ops = (*obj->completed)(obj->dev_private, vchan, nb_cpls, last_idx,
+				   has_error);
+	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
+	return nb_ops;
 }
 
 /**
@@ -1090,7 +1093,7 @@ rte_dma_completed_status(int16_t dev_id, uint16_t vchan,
 			 enum rte_dma_status_code *status)
 {
 	struct rte_dma_fp_object *obj = &rte_dma_fp_objs[dev_id];
-	uint16_t idx;
+	uint16_t idx, nb_ops;
 
 #ifdef RTE_DMADEV_DEBUG
 	if (!rte_dma_is_valid(dev_id) || nb_cpls == 0 || status == NULL)
@@ -1101,8 +1104,10 @@ rte_dma_completed_status(int16_t dev_id, uint16_t vchan,
 	if (last_idx == NULL)
 		last_idx = &idx;
 
-	return (*obj->completed_status)(obj->dev_private, vchan, nb_cpls,
+	nb_ops = (*obj->completed_status)(obj->dev_private, vchan, nb_cpls,
 					last_idx, status);
+	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
+	return nb_ops;
 }
 
 /**
diff --git a/lib/eal/common/eal_common_lcore_poll_telemetry.c b/lib/eal/common/eal_common_lcore_poll_telemetry.c
new file mode 100644
index 0000000000..d97996e85f
--- /dev/null
+++ b/lib/eal/common/eal_common_lcore_poll_telemetry.c
@@ -0,0 +1,303 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2022 Intel Corporation
+ */
+
+#include <unistd.h>
+#include <limits.h>
+#include <string.h>
+
+#include <rte_common.h>
+#include <rte_cycles.h>
+#include <rte_errno.h>
+#include <rte_lcore.h>
+
+#ifdef RTE_LCORE_POLL_BUSYNESS
+#include <rte_telemetry.h>
+#endif
+
+rte_atomic32_t __rte_lcore_poll_telemetry_enabled;
+
+#ifdef RTE_LCORE_POLL_BUSYNESS
+
+struct lcore_poll_telemetry {
+	int poll_busyness;
+	/**< Calculated poll busyness (gets set/returned by the API) */
+	int raw_poll_busyness;
+	/**< Calculated poll busyness times 100. */
+	uint64_t interval_ts;
+	/**< when previous telemetry interval started */
+	uint64_t empty_cycles;
+	/**< empty cycle count since last interval */
+	uint64_t last_poll_ts;
+	/**< last poll timestamp */
+	bool last_empty;
+	/**< if last poll was empty */
+	unsigned int contig_poll_cnt;
+	/**< contiguous (always empty/non empty) poll counter */
+} __rte_cache_aligned;
+
+static struct lcore_poll_telemetry *telemetry_data;
+
+#define LCORE_POLL_BUSYNESS_MAX 100
+#define LCORE_POLL_BUSYNESS_NOT_SET -1
+#define LCORE_POLL_BUSYNESS_MIN 0
+
+#define SMOOTH_COEFF 5
+#define STATE_CHANGE_OPT 32
+
+static void lcore_config_init(void)
+{
+	int lcore_id;
+
+	RTE_LCORE_FOREACH(lcore_id) {
+		struct lcore_poll_telemetry *td = &telemetry_data[lcore_id];
+
+		td->interval_ts = 0;
+		td->last_poll_ts = 0;
+		td->empty_cycles = 0;
+		td->last_empty = true;
+		td->contig_poll_cnt = 0;
+		td->poll_busyness = LCORE_POLL_BUSYNESS_NOT_SET;
+		td->raw_poll_busyness = 0;
+	}
+}
+
+int rte_lcore_poll_busyness(unsigned int lcore_id)
+{
+	const uint64_t tsc_ms = rte_get_timer_hz() / MS_PER_S;
+	/* if more than 1000 busyness periods have passed, this core is considered inactive */
+	const uint64_t active_thresh = RTE_LCORE_POLL_BUSYNESS_PERIOD_MS * tsc_ms * 1000;
+	struct lcore_poll_telemetry *tdata;
+
+	if (lcore_id >= RTE_MAX_LCORE)
+		return -EINVAL;
+	tdata = &telemetry_data[lcore_id];
+
+	/* if the lcore is not active */
+	if (tdata->interval_ts == 0)
+		return LCORE_POLL_BUSYNESS_NOT_SET;
+	/* if the core hasn't been active in a while */
+	else if ((rte_rdtsc() - tdata->interval_ts) > active_thresh)
+		return LCORE_POLL_BUSYNESS_NOT_SET;
+
+	/* this core is active, report its poll busyness */
+	return telemetry_data[lcore_id].poll_busyness;
+}
+
+int rte_lcore_poll_busyness_enabled(void)
+{
+	return rte_atomic32_read(&__rte_lcore_poll_telemetry_enabled);
+}
+
+void rte_lcore_poll_busyness_enabled_set(bool enable)
+{
+	int set = rte_atomic32_cmpset((volatile uint32_t *)&__rte_lcore_poll_telemetry_enabled,
+			(int)!enable, (int)enable);
+
+	/* Reset counters on successful disable */
+	if (set && !enable)
+		lcore_config_init();
+}
+
+static inline int calc_raw_poll_busyness(const struct lcore_poll_telemetry *tdata,
+				    const uint64_t empty, const uint64_t total)
+{
+	/*
+	 * We don't want to use floating point math here, but we want for our poll
+	 * busyness to react smoothly to sudden changes, while still keeping the
+	 * accuracy and making sure that over time the average follows poll busyness
+	 * as measured just-in-time. Therefore, we will calculate the average poll
+	 * busyness using integer math, but shift the decimal point two places
+	 * to the right, so that 100.0 becomes 10000. This allows us to report
+	 * integer values (0..100) while still allowing ourselves to follow the
+	 * just-in-time measurements when we calculate our averages.
+	 */
+	const int max_raw_idle = LCORE_POLL_BUSYNESS_MAX * 100;
+
+	const int prev_raw_idle = max_raw_idle - tdata->raw_poll_busyness;
+
+	/* calculate rate of idle cycles, times 100 */
+	const int cur_raw_idle = (int)((empty * max_raw_idle) / total);
+
+	/* smoothen the idleness */
+	const int smoothened_idle =
+			(cur_raw_idle + prev_raw_idle * (SMOOTH_COEFF - 1)) / SMOOTH_COEFF;
+
+	/* convert idleness to poll busyness */
+	return max_raw_idle - smoothened_idle;
+}
+
+void __rte_lcore_poll_busyness_timestamp(uint16_t nb_rx)
+{
+	const unsigned int lcore_id = rte_lcore_id();
+	uint64_t interval_ts, empty_cycles, cur_tsc, last_poll_ts;
+	struct lcore_poll_telemetry *tdata;
+	const bool empty = nb_rx == 0;
+	uint64_t diff_int, diff_last;
+	bool last_empty;
+
+	/* This telemetry is not supported for unregistered non-EAL threads */
+	if (lcore_id >= RTE_MAX_LCORE) {
+		RTE_LOG(DEBUG, EAL,
+				"Lcore telemetry not supported on unregistered non-EAL thread %d",
+				lcore_id);
+		return;
+	}
+
+	tdata = &telemetry_data[lcore_id];
+	last_empty = tdata->last_empty;
+
+	/* optimization: don't do anything if status hasn't changed */
+	if (last_empty == empty && tdata->contig_poll_cnt++ < STATE_CHANGE_OPT)
+		return;
+	/* status changed or we're waiting for too long, reset counter */
+	tdata->contig_poll_cnt = 0;
+
+	cur_tsc = rte_rdtsc();
+
+	interval_ts = tdata->interval_ts;
+	empty_cycles = tdata->empty_cycles;
+	last_poll_ts = tdata->last_poll_ts;
+
+	diff_int = cur_tsc - interval_ts;
+	diff_last = cur_tsc - last_poll_ts;
+
+	/* is this the first time we're here? */
+	if (interval_ts == 0) {
+		tdata->poll_busyness = LCORE_POLL_BUSYNESS_MIN;
+		tdata->raw_poll_busyness = 0;
+		tdata->interval_ts = cur_tsc;
+		tdata->empty_cycles = 0;
+		tdata->contig_poll_cnt = 0;
+		goto end;
+	}
+
+	/* update the empty counter if we got an empty poll earlier */
+	if (last_empty)
+		empty_cycles += diff_last;
+
+	/* have we passed the interval? */
+	uint64_t interval = ((rte_get_tsc_hz() / MS_PER_S) * RTE_LCORE_POLL_BUSYNESS_PERIOD_MS);
+	if (diff_int > interval) {
+		int raw_poll_busyness;
+
+		/* get updated poll_busyness value */
+		raw_poll_busyness = calc_raw_poll_busyness(tdata, empty_cycles, diff_int);
+
+		/* set a new interval, reset empty counter */
+		tdata->interval_ts = cur_tsc;
+		tdata->empty_cycles = 0;
+		tdata->raw_poll_busyness = raw_poll_busyness;
+		/* bring poll busyness back to 0..100 range, biased to round up */
+		tdata->poll_busyness = (raw_poll_busyness + 50) / 100;
+	} else
+		/* we may have updated empty counter */
+		tdata->empty_cycles = empty_cycles;
+
+end:
+	/* update status for next poll */
+	tdata->last_poll_ts = cur_tsc;
+	tdata->last_empty = empty;
+}
+
+static int
+lcore_poll_busyness_enable(const char *cmd __rte_unused,
+		      const char *params __rte_unused,
+		      struct rte_tel_data *d)
+{
+	rte_lcore_poll_busyness_enabled_set(true);
+
+	rte_tel_data_start_dict(d);
+
+	rte_tel_data_add_dict_int(d, "poll_busyness_enabled", 1);
+
+	return 0;
+}
+
+static int
+lcore_poll_busyness_disable(const char *cmd __rte_unused,
+		       const char *params __rte_unused,
+		       struct rte_tel_data *d)
+{
+	rte_lcore_poll_busyness_enabled_set(false);
+
+	rte_tel_data_start_dict(d);
+
+	rte_tel_data_add_dict_int(d, "poll_busyness_enabled", 0);
+
+	return 0;
+}
+
+static int
+lcore_handle_poll_busyness(const char *cmd __rte_unused,
+		      const char *params __rte_unused, struct rte_tel_data *d)
+{
+	char corenum[64];
+	int i;
+
+	rte_tel_data_start_dict(d);
+
+	RTE_LCORE_FOREACH(i) {
+		if (!rte_lcore_is_enabled(i))
+			continue;
+		snprintf(corenum, sizeof(corenum), "%d", i);
+		rte_tel_data_add_dict_int(d, corenum, rte_lcore_poll_busyness(i));
+	}
+
+	return 0;
+}
+
+void
+eal_lcore_poll_telemetry_free(void)
+{
+	if (telemetry_data != NULL) {
+		free(telemetry_data);
+		telemetry_data = NULL;
+	}
+}
+
+RTE_INIT(lcore_init_poll_telemetry)
+{
+	telemetry_data = calloc(RTE_MAX_LCORE, sizeof(telemetry_data[0]));
+	if (telemetry_data == NULL)
+		rte_panic("Could not init lcore telemetry data: Out of memory\n");
+
+	lcore_config_init();
+
+	rte_telemetry_register_cmd("/eal/lcore/poll_busyness", lcore_handle_poll_busyness,
+				   "return percentage poll busyness of cores");
+
+	rte_telemetry_register_cmd("/eal/lcore/poll_busyness_enable", lcore_poll_busyness_enable,
+				   "enable lcore poll busyness measurement");
+
+	rte_telemetry_register_cmd("/eal/lcore/poll_busyness_disable", lcore_poll_busyness_disable,
+				   "disable lcore poll busyness measurement");
+
+	rte_atomic32_set(&__rte_lcore_poll_telemetry_enabled, true);
+}
+
+#else
+
+int rte_lcore_poll_busyness(unsigned int lcore_id __rte_unused)
+{
+	return -ENOTSUP;
+}
+
+int rte_lcore_poll_busyness_enabled(void)
+{
+	return -ENOTSUP;
+}
+
+void rte_lcore_poll_busyness_enabled_set(bool enable __rte_unused)
+{
+}
+
+void __rte_lcore_poll_busyness_timestamp(uint16_t nb_rx __rte_unused)
+{
+}
+
+void eal_lcore_poll_telemetry_free(void)
+{
+}
+
+#endif
diff --git a/lib/eal/common/meson.build b/lib/eal/common/meson.build
index 917758cc65..e5741ce9f9 100644
--- a/lib/eal/common/meson.build
+++ b/lib/eal/common/meson.build
@@ -17,6 +17,7 @@ sources += files(
         'eal_common_hexdump.c',
         'eal_common_interrupts.c',
         'eal_common_launch.c',
+        'eal_common_lcore_poll_telemetry.c',
         'eal_common_lcore.c',
         'eal_common_log.c',
         'eal_common_mcfg.c',
diff --git a/lib/eal/freebsd/eal.c b/lib/eal/freebsd/eal.c
index 26fbc91b26..92c4af9c28 100644
--- a/lib/eal/freebsd/eal.c
+++ b/lib/eal/freebsd/eal.c
@@ -895,6 +895,7 @@ rte_eal_cleanup(void)
 	rte_mp_channel_cleanup();
 	rte_trace_save();
 	eal_trace_fini();
+	eal_lcore_poll_telemetry_free();
 	/* after this point, any DPDK pointers will become dangling */
 	rte_eal_memory_detach();
 	rte_eal_alarm_cleanup();
diff --git a/lib/eal/include/rte_lcore.h b/lib/eal/include/rte_lcore.h
index b598e1b9ec..2191c2473a 100644
--- a/lib/eal/include/rte_lcore.h
+++ b/lib/eal/include/rte_lcore.h
@@ -16,6 +16,7 @@
 #include <rte_eal.h>
 #include <rte_launch.h>
 #include <rte_thread.h>
+#include <rte_atomic.h>
 
 #ifdef __cplusplus
 extern "C" {
@@ -415,9 +416,91 @@ rte_ctrl_thread_create(pthread_t *thread, const char *name,
 		const pthread_attr_t *attr,
 		void *(*start_routine)(void *), void *arg);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Read poll busyness value corresponding to an lcore.
+ *
+ * @param lcore_id
+ *   Lcore to read poll busyness value for.
+ * @return
+ *   - value between 0 and 100 on success
+ *   - -1 if lcore is not active
+ *   - -EINVAL if lcore is invalid
+ *   - -ENOMEM if not enough memory available
+ *   - -ENOTSUP if not supported
+ */
+__rte_experimental
+int
+rte_lcore_poll_busyness(unsigned int lcore_id);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Check if lcore poll busyness telemetry is enabled.
+ *
+ * @return
+ *   - true if lcore telemetry is enabled
+ *   - false if lcore telemetry is disabled
+ *   - -ENOTSUP if not lcore telemetry supported
+ */
+__rte_experimental
+int
+rte_lcore_poll_busyness_enabled(void);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Enable or disable poll busyness telemetry.
+ *
+ * @param enable
+ *   1 to enable, 0 to disable
+ */
+__rte_experimental
+void
+rte_lcore_poll_busyness_enabled_set(bool enable);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Lcore poll busyness timestamping function.
+ *
+ * @param nb_rx
+ *   Number of buffers processed by lcore.
+ */
+__rte_experimental
+void
+__rte_lcore_poll_busyness_timestamp(uint16_t nb_rx);
+
+/** @internal lcore telemetry enabled status */
+extern rte_atomic32_t __rte_lcore_poll_telemetry_enabled;
+
+/** @internal free memory allocated for lcore telemetry */
+void
+eal_lcore_poll_telemetry_free(void);
+
+/**
+ * Call lcore poll busyness timestamp function.
+ *
+ * @param nb_rx
+ *   Number of buffers processed by lcore.
+ */
+#ifdef RTE_LCORE_POLL_BUSYNESS
+#define RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_rx) do {				\
+	int enabled = (int)rte_atomic32_read(&__rte_lcore_poll_telemetry_enabled);	\
+	if (enabled)								\
+		__rte_lcore_poll_busyness_timestamp(nb_rx);			\
+} while (0)
+#else
+#define RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_rx) do { } while (0)
+#endif
+
 #ifdef __cplusplus
 }
 #endif
 
-
 #endif /* _RTE_LCORE_H_ */
diff --git a/lib/eal/linux/eal.c b/lib/eal/linux/eal.c
index 37d29643a5..5e81352a81 100644
--- a/lib/eal/linux/eal.c
+++ b/lib/eal/linux/eal.c
@@ -1364,6 +1364,7 @@ rte_eal_cleanup(void)
 	rte_mp_channel_cleanup();
 	rte_trace_save();
 	eal_trace_fini();
+	eal_lcore_poll_telemetry_free();
 	/* after this point, any DPDK pointers will become dangling */
 	rte_eal_memory_detach();
 	eal_mp_dev_hotplug_cleanup();
diff --git a/lib/eal/meson.build b/lib/eal/meson.build
index 056beb9461..2fb90d446b 100644
--- a/lib/eal/meson.build
+++ b/lib/eal/meson.build
@@ -25,6 +25,9 @@ subdir(arch_subdir)
 deps += ['kvargs']
 if not is_windows
     deps += ['telemetry']
+else
+	# core poll busyness telemetry depends on telemetry library
+	dpdk_conf.set('RTE_LCORE_POLL_BUSYNESS', false)
 endif
 if dpdk_conf.has('RTE_USE_LIBBSD')
     ext_deps += libbsd
diff --git a/lib/eal/version.map b/lib/eal/version.map
index 1f293e768b..3275d1fac4 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -424,6 +424,13 @@ EXPERIMENTAL {
 	rte_thread_self;
 	rte_thread_set_affinity_by_id;
 	rte_thread_set_priority;
+
+	# added in 22.11
+	__rte_lcore_poll_busyness_timestamp;
+	__rte_lcore_poll_telemetry_enabled;
+	rte_lcore_poll_busyness;
+	rte_lcore_poll_busyness_enabled;
+	rte_lcore_poll_busyness_enabled_set;
 };
 
 INTERNAL {
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index de9e970d4d..4c8113f31f 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -5675,6 +5675,8 @@ rte_eth_rx_burst(uint16_t port_id, uint16_t queue_id,
 #endif
 
 	rte_ethdev_trace_rx_burst(port_id, queue_id, (void **)rx_pkts, nb_rx);
+
+	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_rx);
 	return nb_rx;
 }
 
diff --git a/lib/eventdev/rte_eventdev.h b/lib/eventdev/rte_eventdev.h
index 6a6f6ea4c1..a65b3c7c85 100644
--- a/lib/eventdev/rte_eventdev.h
+++ b/lib/eventdev/rte_eventdev.h
@@ -2153,6 +2153,7 @@ rte_event_dequeue_burst(uint8_t dev_id, uint8_t port_id, struct rte_event ev[],
 			uint16_t nb_events, uint64_t timeout_ticks)
 {
 	const struct rte_event_fp_ops *fp_ops;
+	uint16_t nb_evts;
 	void *port;
 
 	fp_ops = &rte_event_fp_ops[dev_id];
@@ -2175,10 +2176,13 @@ rte_event_dequeue_burst(uint8_t dev_id, uint8_t port_id, struct rte_event ev[],
 	 * requests nb_events as const one
 	 */
 	if (nb_events == 1)
-		return (fp_ops->dequeue)(port, ev, timeout_ticks);
+		nb_evts = (fp_ops->dequeue)(port, ev, timeout_ticks);
 	else
-		return (fp_ops->dequeue_burst)(port, ev, nb_events,
-					       timeout_ticks);
+		nb_evts = (fp_ops->dequeue_burst)(port, ev, nb_events,
+					timeout_ticks);
+
+	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_evts);
+	return nb_evts;
 }
 
 #define RTE_EVENT_DEV_MAINT_OP_FLUSH          (1 << 0)
diff --git a/lib/rawdev/rte_rawdev.c b/lib/rawdev/rte_rawdev.c
index 2f0a4f132e..1cba53270a 100644
--- a/lib/rawdev/rte_rawdev.c
+++ b/lib/rawdev/rte_rawdev.c
@@ -16,6 +16,7 @@
 #include <rte_common.h>
 #include <rte_malloc.h>
 #include <rte_telemetry.h>
+#include <rte_lcore.h>
 
 #include "rte_rawdev.h"
 #include "rte_rawdev_pmd.h"
@@ -226,12 +227,15 @@ rte_rawdev_dequeue_buffers(uint16_t dev_id,
 			   rte_rawdev_obj_t context)
 {
 	struct rte_rawdev *dev;
+	int nb_ops;
 
 	RTE_RAWDEV_VALID_DEVID_OR_ERR_RET(dev_id, -EINVAL);
 	dev = &rte_rawdevs[dev_id];
 
 	RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->dequeue_bufs, -ENOTSUP);
-	return (*dev->dev_ops->dequeue_bufs)(dev, buffers, count, context);
+	nb_ops = (*dev->dev_ops->dequeue_bufs)(dev, buffers, count, context);
+	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
+	return nb_ops;
 }
 
 int
diff --git a/lib/regexdev/rte_regexdev.h b/lib/regexdev/rte_regexdev.h
index 3bce8090f6..8caaed502f 100644
--- a/lib/regexdev/rte_regexdev.h
+++ b/lib/regexdev/rte_regexdev.h
@@ -1530,6 +1530,7 @@ rte_regexdev_dequeue_burst(uint8_t dev_id, uint16_t qp_id,
 			   struct rte_regex_ops **ops, uint16_t nb_ops)
 {
 	struct rte_regexdev *dev = &rte_regex_devices[dev_id];
+	uint16_t deq_ops;
 #ifdef RTE_LIBRTE_REGEXDEV_DEBUG
 	RTE_REGEXDEV_VALID_DEV_ID_OR_ERR_RET(dev_id, -EINVAL);
 	RTE_FUNC_PTR_OR_ERR_RET(*dev->dequeue, -ENOTSUP);
@@ -1538,7 +1539,9 @@ rte_regexdev_dequeue_burst(uint8_t dev_id, uint16_t qp_id,
 		return -EINVAL;
 	}
 #endif
-	return (*dev->dequeue)(dev, qp_id, ops, nb_ops);
+	deq_ops = (*dev->dequeue)(dev, qp_id, ops, nb_ops);
+	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(deq_ops);
+	return deq_ops;
 }
 
 #ifdef __cplusplus
diff --git a/lib/ring/rte_ring_elem_pvt.h b/lib/ring/rte_ring_elem_pvt.h
index 83788c56e6..cf2370c238 100644
--- a/lib/ring/rte_ring_elem_pvt.h
+++ b/lib/ring/rte_ring_elem_pvt.h
@@ -379,6 +379,7 @@ __rte_ring_do_dequeue_elem(struct rte_ring *r, void *obj_table,
 end:
 	if (available != NULL)
 		*available = entries - n;
+	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(n);
 	return n;
 }
 
diff --git a/meson_options.txt b/meson_options.txt
index 7c220ad68d..9b20a36fdb 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -20,6 +20,8 @@ option('enable_driver_sdk', type: 'boolean', value: false, description:
        'Install headers to build drivers.')
 option('enable_kmods', type: 'boolean', value: false, description:
        'build kernel modules')
+option('enable_lcore_poll_busyness', type: 'boolean', value: false, description:
+       'enable collection of lcore poll busyness telemetry')
 option('examples', type: 'string', value: '', description:
        'Comma-separated list of examples to build by default')
 option('flexran_sdk', type: 'string', value: '', description:
-- 
2.31.1


^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v7 2/4] eal: add cpuset lcore telemetry entries
  2022-09-14  9:29 ` [PATCH v7 0/4] Add lcore poll busyness telemetry Kevin Laatz
  2022-09-14  9:29   ` [PATCH v7 1/4] eal: add " Kevin Laatz
@ 2022-09-14  9:29   ` Kevin Laatz
  2022-09-14  9:29   ` [PATCH v7 3/4] app/test: add unit tests for lcore poll busyness Kevin Laatz
                     ` (3 subsequent siblings)
  5 siblings, 0 replies; 87+ messages in thread
From: Kevin Laatz @ 2022-09-14  9:29 UTC (permalink / raw)
  To: dev; +Cc: anatoly.burakov

From: Anatoly Burakov <anatoly.burakov@intel.com>

Expose per-lcore cpuset information to telemetry.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
---
 .../common/eal_common_lcore_poll_telemetry.c  | 47 +++++++++++++++++++
 1 file changed, 47 insertions(+)

diff --git a/lib/eal/common/eal_common_lcore_poll_telemetry.c b/lib/eal/common/eal_common_lcore_poll_telemetry.c
index d97996e85f..a19d6ccb95 100644
--- a/lib/eal/common/eal_common_lcore_poll_telemetry.c
+++ b/lib/eal/common/eal_common_lcore_poll_telemetry.c
@@ -19,6 +19,8 @@ rte_atomic32_t __rte_lcore_poll_telemetry_enabled;
 
 #ifdef RTE_LCORE_POLL_BUSYNESS
 
+#include "eal_private.h"
+
 struct lcore_poll_telemetry {
 	int poll_busyness;
 	/**< Calculated poll busyness (gets set/returned by the API) */
@@ -247,6 +249,48 @@ lcore_handle_poll_busyness(const char *cmd __rte_unused,
 	return 0;
 }
 
+static int
+lcore_handle_cpuset(const char *cmd __rte_unused,
+		    const char *params __rte_unused,
+		    struct rte_tel_data *d)
+{
+	char corenum[64];
+	int i;
+
+	rte_tel_data_start_dict(d);
+
+	RTE_LCORE_FOREACH(i) {
+		const struct lcore_config *cfg = &lcore_config[i];
+		const rte_cpuset_t *cpuset = &cfg->cpuset;
+		struct rte_tel_data *ld;
+		unsigned int cpu;
+
+		if (!rte_lcore_is_enabled(i))
+			continue;
+
+		/* create an array of integers */
+		ld = rte_tel_data_alloc();
+		if (ld == NULL)
+			return -ENOMEM;
+		rte_tel_data_start_array(ld, RTE_TEL_INT_VAL);
+
+		/* add cpu ID's from cpuset to the array */
+		for (cpu = 0; cpu < CPU_SETSIZE; cpu++) {
+			if (!CPU_ISSET(cpu, cpuset))
+				continue;
+			rte_tel_data_add_array_int(ld, cpu);
+		}
+
+		/* add array to the per-lcore container */
+		snprintf(corenum, sizeof(corenum), "%d", i);
+
+		/* tell telemetry library to free this array automatically */
+		rte_tel_data_add_dict_container(d, corenum, ld, 0);
+	}
+
+	return 0;
+}
+
 void
 eal_lcore_poll_telemetry_free(void)
 {
@@ -273,6 +317,9 @@ RTE_INIT(lcore_init_poll_telemetry)
 	rte_telemetry_register_cmd("/eal/lcore/poll_busyness_disable", lcore_poll_busyness_disable,
 				   "disable lcore poll busyness measurement");
 
+	rte_telemetry_register_cmd("/eal/lcore/cpuset", lcore_handle_cpuset,
+				   "list physical core affinity for each lcore");
+
 	rte_atomic32_set(&__rte_lcore_poll_telemetry_enabled, true);
 }
 
-- 
2.31.1


^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v7 3/4] app/test: add unit tests for lcore poll busyness
  2022-09-14  9:29 ` [PATCH v7 0/4] Add lcore poll busyness telemetry Kevin Laatz
  2022-09-14  9:29   ` [PATCH v7 1/4] eal: add " Kevin Laatz
  2022-09-14  9:29   ` [PATCH v7 2/4] eal: add cpuset lcore telemetry entries Kevin Laatz
@ 2022-09-14  9:29   ` Kevin Laatz
  2022-09-30 22:20     ` Mattias Rönnblom
  2022-09-14  9:29   ` [PATCH v7 4/4] doc: add howto guide " Kevin Laatz
                     ` (2 subsequent siblings)
  5 siblings, 1 reply; 87+ messages in thread
From: Kevin Laatz @ 2022-09-14  9:29 UTC (permalink / raw)
  To: dev; +Cc: anatoly.burakov, Kevin Laatz

Add API unit tests and perf unit tests for the newly added lcore poll
busyness feature.

Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
---
 app/test/meson.build                     |   4 +
 app/test/test_lcore_poll_busyness_api.c  | 134 +++++++++++++++++++++++
 app/test/test_lcore_poll_busyness_perf.c |  72 ++++++++++++
 3 files changed, 210 insertions(+)
 create mode 100644 app/test/test_lcore_poll_busyness_api.c
 create mode 100644 app/test/test_lcore_poll_busyness_perf.c

diff --git a/app/test/meson.build b/app/test/meson.build
index bf1d81f84a..d543e730a2 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -74,6 +74,8 @@ test_sources = files(
         'test_ipsec_perf.c',
         'test_kni.c',
         'test_kvargs.c',
+        'test_lcore_poll_busyness_api.c',
+        'test_lcore_poll_busyness_perf.c',
         'test_lcores.c',
         'test_logs.c',
         'test_lpm.c',
@@ -192,6 +194,7 @@ fast_tests = [
         ['interrupt_autotest', true, true],
         ['ipfrag_autotest', false, true],
         ['lcores_autotest', true, true],
+        ['lcore_poll_busyness_autotest', true, true],
         ['logs_autotest', true, true],
         ['lpm_autotest', true, true],
         ['lpm6_autotest', true, true],
@@ -292,6 +295,7 @@ perf_test_names = [
         'trace_perf_autotest',
         'ipsec_perf_autotest',
         'thash_perf_autotest',
+        'lcore_poll_busyness_perf_autotest'
 ]
 
 driver_test_names = [
diff --git a/app/test/test_lcore_poll_busyness_api.c b/app/test/test_lcore_poll_busyness_api.c
new file mode 100644
index 0000000000..db76322994
--- /dev/null
+++ b/app/test/test_lcore_poll_busyness_api.c
@@ -0,0 +1,134 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2022 Intel Corporation
+ */
+
+#include <rte_lcore.h>
+
+#include "test.h"
+
+/* Arbitrary amount of "work" to simulate busyness with */
+#define WORK		32
+#define TIMESTAMP_ITERS	1000000
+
+#define LCORE_POLL_BUSYNESS_NOT_SET	-1
+
+static int
+test_lcore_poll_busyness_enable_disable(void)
+{
+	int initial_state, curr_state;
+	bool req_state;
+
+	/* Get the initial state */
+	initial_state = rte_lcore_poll_busyness_enabled();
+	if (initial_state == -ENOTSUP)
+		return TEST_SKIPPED;
+
+	/* Set state to the inverse of the initial state and check for the change */
+	req_state = !initial_state;
+	rte_lcore_poll_busyness_enabled_set(req_state);
+	curr_state = rte_lcore_poll_busyness_enabled();
+	if (curr_state != req_state)
+		return TEST_FAILED;
+
+	/* Now change the state back to the original state. By changing it back, both
+	 * enable and disable will have been tested.
+	 */
+	req_state = !curr_state;
+	rte_lcore_poll_busyness_enabled_set(req_state);
+	curr_state = rte_lcore_poll_busyness_enabled();
+	if (curr_state != req_state)
+		return TEST_FAILED;
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_lcore_poll_busyness_invalid_lcore(void)
+{
+	int ret;
+
+	/* Check if lcore poll busyness is enabled */
+	if (rte_lcore_poll_busyness_enabled() == -ENOTSUP)
+		return TEST_SKIPPED;
+
+	/* Only lcore_id <= RTE_MAX_LCORE are valid */
+	ret = rte_lcore_poll_busyness(RTE_MAX_LCORE);
+	if (ret != -EINVAL)
+		return TEST_FAILED;
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_lcore_poll_busyness_inactive_lcore(void)
+{
+	int ret;
+
+	/* Check if lcore poll busyness is enabled */
+	if (rte_lcore_poll_busyness_enabled() == -ENOTSUP)
+		return TEST_SKIPPED;
+
+	/* Use the test thread lcore_id for this test. Since it is not a polling
+	 * application, the busyness is expected to return -1.
+	 *
+	 * Note: this will not work with affinitized cores
+	 */
+	ret = rte_lcore_poll_busyness(rte_lcore_id());
+	if (ret != LCORE_POLL_BUSYNESS_NOT_SET)
+		return TEST_FAILED;
+
+	return TEST_SUCCESS;
+}
+
+static void
+simulate_lcore_poll_busyness(int iters)
+{
+	int i;
+
+	for (i = 0; i < iters; i++)
+		RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(WORK);
+}
+
+/* The test cannot know of an application running to test for valid lcore poll
+ * busyness data. For this test, we simulate lcore poll busyness for the
+ * lcore_id of the test thread for testing purposes.
+ */
+static int
+test_lcore_poll_busyness_active_lcore(void)
+{
+	int ret;
+
+	/* Check if lcore poll busyness is enabled */
+	if (rte_lcore_poll_busyness_enabled() == -ENOTSUP)
+		return TEST_SKIPPED;
+
+	simulate_lcore_poll_busyness(TIMESTAMP_ITERS);
+
+	/* After timestamping with "work" many times, lcore poll busyness should be > 0 */
+	ret = rte_lcore_poll_busyness(rte_lcore_id());
+	if (ret <= 0)
+		return TEST_FAILED;
+
+	return TEST_SUCCESS;
+}
+
+static struct unit_test_suite lcore_poll_busyness_tests = {
+	.suite_name = "lcore poll busyness autotest",
+	.setup = NULL,
+	.teardown = NULL,
+	.unit_test_cases = {
+		TEST_CASE(test_lcore_poll_busyness_enable_disable),
+		TEST_CASE(test_lcore_poll_busyness_invalid_lcore),
+		TEST_CASE(test_lcore_poll_busyness_inactive_lcore),
+		TEST_CASE(test_lcore_poll_busyness_active_lcore),
+		TEST_CASES_END()
+	}
+};
+
+static int
+test_lcore_poll_busyness_api(void)
+{
+	return unit_test_suite_runner(&lcore_poll_busyness_tests);
+}
+
+REGISTER_TEST_COMMAND(lcore_poll_busyness_autotest, test_lcore_poll_busyness_api);
diff --git a/app/test/test_lcore_poll_busyness_perf.c b/app/test/test_lcore_poll_busyness_perf.c
new file mode 100644
index 0000000000..5c27d21b00
--- /dev/null
+++ b/app/test/test_lcore_poll_busyness_perf.c
@@ -0,0 +1,72 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2022 Intel Corporation
+ */
+
+#include <unistd.h>
+#include <inttypes.h>
+
+#include <rte_lcore.h>
+#include <rte_cycles.h>
+
+#include "test.h"
+
+/* Arbitrary amount of "work" to simulate busyness with */
+#define WORK		32
+#define TIMESTAMP_ITERS	1000000
+#define TEST_ITERS	10000
+
+static void
+simulate_lcore_poll_busyness(int iters)
+{
+	int i;
+
+	for (i = 0; i < iters; i++)
+		RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(WORK);
+}
+
+static void
+test_timestamp_perf(void)
+{
+	uint64_t start, end, diff;
+	uint64_t min = UINT64_MAX;
+	uint64_t max = 0;
+	uint64_t total = 0;
+	int i;
+
+	for (i = 0; i < TEST_ITERS; i++) {
+		start = rte_rdtsc();
+		RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(WORK);
+		end = rte_rdtsc();
+
+		diff = end - start;
+		min = RTE_MIN(diff, min);
+		max = RTE_MAX(diff, max);
+		total += diff;
+	}
+
+	printf("### Timestamp perf ###\n");
+	printf("Min cycles: %"PRIu64"\n", min);
+	printf("Avg cycles: %"PRIu64"\n", total / TEST_ITERS);
+	printf("Max cycles: %"PRIu64"\n", max);
+	printf("\n");
+}
+
+
+static int
+test_lcore_poll_busyness_perf(void)
+{
+	if (rte_lcore_poll_busyness_enabled()  == -ENOTSUP) {
+		printf("Lcore poll busyness may be disabled...\n");
+		return TEST_SKIPPED;
+	}
+
+	/* Initialize and prime the timestamp struct with simulated "work" for this lcore */
+	simulate_lcore_poll_busyness(10000);
+
+	/* Run perf tests */
+	test_timestamp_perf();
+
+	return TEST_SUCCESS;
+}
+
+REGISTER_TEST_COMMAND(lcore_poll_busyness_perf_autotest, test_lcore_poll_busyness_perf);
-- 
2.31.1


^ permalink raw reply	[flat|nested] 87+ messages in thread

* [PATCH v7 4/4] doc: add howto guide for lcore poll busyness
  2022-09-14  9:29 ` [PATCH v7 0/4] Add lcore poll busyness telemetry Kevin Laatz
                     ` (2 preceding siblings ...)
  2022-09-14  9:29   ` [PATCH v7 3/4] app/test: add unit tests for lcore poll busyness Kevin Laatz
@ 2022-09-14  9:29   ` Kevin Laatz
  2022-09-14 14:33   ` [PATCH v7 0/4] Add lcore poll busyness telemetry Stephen Hemminger
  2022-10-05 13:44   ` Kevin Laatz
  5 siblings, 0 replies; 87+ messages in thread
From: Kevin Laatz @ 2022-09-14  9:29 UTC (permalink / raw)
  To: dev; +Cc: anatoly.burakov, Kevin Laatz

Add a new section to the howto guides for using the new lcore poll
busyness telemetry endpoints and describe general usage.

Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>

---
v6:
  * Add mention of perf autotest in note mentioning perf impact.

v4:
  * Include note on perf impact when the feature is enabled
  * Add doc to toctree
  * Updates to incorporate changes made earlier in the patchset

v3:
  * Update naming to poll busyness
---
 doc/guides/howto/index.rst               |  1 +
 doc/guides/howto/lcore_poll_busyness.rst | 93 ++++++++++++++++++++++++
 2 files changed, 94 insertions(+)
 create mode 100644 doc/guides/howto/lcore_poll_busyness.rst

diff --git a/doc/guides/howto/index.rst b/doc/guides/howto/index.rst
index bf6337d021..0a9060c1d3 100644
--- a/doc/guides/howto/index.rst
+++ b/doc/guides/howto/index.rst
@@ -21,3 +21,4 @@ HowTo Guides
     debug_troubleshoot
     openwrt
     avx512
+    lcore_poll_busyness
diff --git a/doc/guides/howto/lcore_poll_busyness.rst b/doc/guides/howto/lcore_poll_busyness.rst
new file mode 100644
index 0000000000..be5ea2a85d
--- /dev/null
+++ b/doc/guides/howto/lcore_poll_busyness.rst
@@ -0,0 +1,93 @@
+..  SPDX-License-Identifier: BSD-3-Clause
+    Copyright(c) 2022 Intel Corporation.
+
+Lcore Poll Busyness Telemetry
+=============================
+
+The lcore poll busyness telemetry provides a built-in, generic method of gathering
+lcore utilization metrics for running applications. These metrics are exposed
+via a new telemetry endpoint.
+
+Since most DPDK APIs polling based, the poll busyness is calculated based on
+APIs receiving 'work' (packets, completions, events, etc). Empty polls are
+considered as idle, while non-empty polls are considered busy. Using the amount
+of cycles spent processing empty polls, the busyness can be calculated and recorded.
+
+Application Specified Busyness
+------------------------------
+
+Improved accuracy of the reported busyness may need more contextual awareness
+from the application. For example, an application may make a number of calls to
+rx_burst before processing packets. If the last burst was an "empty poll", then
+the processing time of the packets would be falsely considered as "idle", since
+the last burst was empty. The application should track if any of the polls
+contained "work" to do and should mark the 'bulk' as "busy" cycles before
+proceeding to the processesing. This type of awareness is only available within
+the application.
+
+Applications can be modified to incorporate the extra contextual awareness in
+order to improve the reported busyness by marking areas of code as "busy" or
+"idle" appropriately. This can be done by inserting the timestamping macro::
+
+    RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(0)    /* to mark section as idle */
+    RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(32)   /* where 32 is nb_pkts to mark section as busy (non-zero is busy) */
+
+All cycles since the last state change (idle to busy, or vice versa) will be
+counted towards the current state's counter.
+
+Consuming the Telemetry
+-----------------------
+
+The telemetry gathered for lcore poll busyness can be read from the `telemetry.py`
+script via the new `/eal/lcore/poll_busyness` endpoint::
+
+    $ ./usertools/dpdk-telemetry.py
+    --> /eal/lcore/poll_busyness
+    {"/eal/lcore/poll_busyness": {"12": -1, "13": 85, "14": 84}}
+
+* Cores not collecting poll busyness will report "-1". E.g. control cores or inactive cores.
+* All enabled cores will report their poll busyness in the range 0-100.
+
+Enabling and Disabling Lcore Poll Busyness Telemetry
+----------------------------------------------------
+
+By default, the lcore poll busyness telemetry is disabled at compile time. In
+order to allow DPDK to gather this metric, the ``enable_lcore_poll_busyness``
+meson option must be set to ``true``.
+
+.. note::
+    Enabling lcore poll busyness telemetry may impact performance due to the
+    additional timestamping, potentially per poll depending on the application.
+    This can be measured with the `lcore_poll_busyness_perf_autotest`.
+
+At compile time
+^^^^^^^^^^^^^^^
+
+Support can be enabled/disabled at compile time via the meson option.
+It is disabled by default.::
+
+    $ meson configure -Denable_lcore_poll_busyness=true     #enable
+
+    $ meson configure -Denable_lcore_poll_busyness=false    #disable
+
+At run time
+^^^^^^^^^^^
+
+Support can also be enabled/disabled during runtime (if the meson option is
+enabled at compile time). Disabling at runtime comes at the cost of an additional
+branch, however no additional function calls are performed.
+
+To enable/disable support at runtime, a call can be made to the appropriately
+telemetry endpoint.
+
+Disable::
+
+    $ ./usertools/dpdk-telemetry.py
+    --> /eal/lcore/poll_busyness_disable
+    {"/eal/lcore/poll_busyness_disable": {"poll_busyness_enabled": 0}}
+
+Enable::
+
+    $ ./usertools/dpdk-telemetry.py
+    --> /eal/lcore/poll_busyness_enable
+    {"/eal/lcore/poll_busyness_enable": {"poll_busyness_enabled": 1}}
-- 
2.31.1


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 1/4] eal: add lcore poll busyness telemetry
  2022-09-14  9:29   ` [PATCH v7 1/4] eal: add " Kevin Laatz
@ 2022-09-14 14:30     ` Stephen Hemminger
  2022-09-16 12:35       ` Kevin Laatz
  2022-09-19 10:19     ` Konstantin Ananyev
  2022-09-30 22:13     ` Mattias Rönnblom
  2 siblings, 1 reply; 87+ messages in thread
From: Stephen Hemminger @ 2022-09-14 14:30 UTC (permalink / raw)
  To: Kevin Laatz
  Cc: dev, anatoly.burakov, Conor Walsh, David Hunt, Bruce Richardson,
	Nicolas Chautru, Fan Zhang, Ashish Gupta, Akhil Goyal,
	Chengwen Feng, Ray Kinsella, Thomas Monjalon, Ferruh Yigit,
	Andrew Rybchenko, Jerin Jacob, Sachin Saxena, Hemant Agrawal,
	Ori Kam, Honnappa Nagarahalli, Konstantin Ananyev

On Wed, 14 Sep 2022 10:29:26 +0100
Kevin Laatz <kevin.laatz@intel.com> wrote:

> +struct lcore_poll_telemetry {
> +	int poll_busyness;
> +	/**< Calculated poll busyness (gets set/returned by the API) */
> +	int raw_poll_busyness;
> +	/**< Calculated poll busyness times 100. */
> +	uint64_t interval_ts;
> +	/**< when previous telemetry interval started */
> +	uint64_t empty_cycles;
> +	/**< empty cycle count since last interval */
> +	uint64_t last_poll_ts;
> +	/**< last poll timestamp */
> +	bool last_empty;
> +	/**< if last poll was empty */
> +	unsigned int contig_poll_cnt;
> +	/**< contiguous (always empty/non empty) poll counter */
> +} __rte_cache_aligned;
> +

For api's always prefer to use fix size types.
Is there any reason the poll_busyness values could be negative.
If not please use unsigned types.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 0/4] Add lcore poll busyness telemetry
  2022-09-14  9:29 ` [PATCH v7 0/4] Add lcore poll busyness telemetry Kevin Laatz
                     ` (3 preceding siblings ...)
  2022-09-14  9:29   ` [PATCH v7 4/4] doc: add howto guide " Kevin Laatz
@ 2022-09-14 14:33   ` Stephen Hemminger
  2022-09-16 12:35     ` Kevin Laatz
  2022-10-05 13:44   ` Kevin Laatz
  5 siblings, 1 reply; 87+ messages in thread
From: Stephen Hemminger @ 2022-09-14 14:33 UTC (permalink / raw)
  To: Kevin Laatz; +Cc: dev, anatoly.burakov

On Wed, 14 Sep 2022 10:29:25 +0100
Kevin Laatz <kevin.laatz@intel.com> wrote:

> Currently, there is no way to measure lcore polling busyness in a passive
> way, without any modifications to the application. This patchset adds a new
> EAL API that will be able to passively track core polling busyness. As part
> of the set, new telemetry endpoints are added to read the generate metrics.

How much does measuring busyness impact performance??

In the past, calling rte_rdsc() would slow down packet rate
because it is stops CPU pipeline. Maybe better on more modern
processors, haven't measured it lately.

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 1/4] eal: add lcore poll busyness telemetry
  2022-09-14 14:30     ` Stephen Hemminger
@ 2022-09-16 12:35       ` Kevin Laatz
  0 siblings, 0 replies; 87+ messages in thread
From: Kevin Laatz @ 2022-09-16 12:35 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: dev, anatoly.burakov, Conor Walsh, David Hunt, Bruce Richardson,
	Nicolas Chautru, Fan Zhang, Ashish Gupta, Akhil Goyal,
	Chengwen Feng, Ray Kinsella, Thomas Monjalon, Ferruh Yigit,
	Andrew Rybchenko, Jerin Jacob, Sachin Saxena, Hemant Agrawal,
	Ori Kam, Honnappa Nagarahalli, Konstantin Ananyev

On 14/09/2022 15:30, Stephen Hemminger wrote:
> On Wed, 14 Sep 2022 10:29:26 +0100
> Kevin Laatz <kevin.laatz@intel.com> wrote:
>
>> +struct lcore_poll_telemetry {
>> +	int poll_busyness;
>> +	/**< Calculated poll busyness (gets set/returned by the API) */
>> +	int raw_poll_busyness;
>> +	/**< Calculated poll busyness times 100. */
>> +	uint64_t interval_ts;
>> +	/**< when previous telemetry interval started */
>> +	uint64_t empty_cycles;
>> +	/**< empty cycle count since last interval */
>> +	uint64_t last_poll_ts;
>> +	/**< last poll timestamp */
>> +	bool last_empty;
>> +	/**< if last poll was empty */
>> +	unsigned int contig_poll_cnt;
>> +	/**< contiguous (always empty/non empty) poll counter */
>> +} __rte_cache_aligned;
>> +
> For api's always prefer to use fix size types.
> Is there any reason the poll_busyness values could be negative.
> If not please use unsigned types.

We use -1 to indicate the core is "inactive" or a "non-polling" core.

These are cores that have either a) never called the timestamp macro, or 
b) haven't called the timestamp macro for some time and have therefore 
been marked as "inactive" until they next call the timestamp macro.


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 0/4] Add lcore poll busyness telemetry
  2022-09-14 14:33   ` [PATCH v7 0/4] Add lcore poll busyness telemetry Stephen Hemminger
@ 2022-09-16 12:35     ` Kevin Laatz
  2022-09-16 14:10       ` Kevin Laatz
  0 siblings, 1 reply; 87+ messages in thread
From: Kevin Laatz @ 2022-09-16 12:35 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev, anatoly.burakov

On 14/09/2022 15:33, Stephen Hemminger wrote:
> On Wed, 14 Sep 2022 10:29:25 +0100
> Kevin Laatz <kevin.laatz@intel.com> wrote:
>
>> Currently, there is no way to measure lcore polling busyness in a passive
>> way, without any modifications to the application. This patchset adds a new
>> EAL API that will be able to passively track core polling busyness. As part
>> of the set, new telemetry endpoints are added to read the generate metrics.
> How much does measuring busyness impact performance??
>
> In the past, calling rte_rdsc() would slow down packet rate
> because it is stops CPU pipeline. Maybe better on more modern
> processors, haven't measured it lately.

Hi Stephen,

I've run some 0.001% loss tests using 2x 100G ports, with 64B packets 
using testpmd for forwarding. Those tests show a ~2.7% performance 
impact when the lcore poll busyness feature is enabled vs compile-time 
disabled.
Applications with more compute intensive workloads should see less 
performance impact since the proportion of time spent time-stamping will 
be smaller.

In addition, a performance autotest has been added in this patchset 
which measures the cycles cost of calling the timestamp macro. Please 
feel free to test it on your system (lcore_poll_busyness_perf_autotest).

-Kevin

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 0/4] Add lcore poll busyness telemetry
  2022-09-16 12:35     ` Kevin Laatz
@ 2022-09-16 14:10       ` Kevin Laatz
  0 siblings, 0 replies; 87+ messages in thread
From: Kevin Laatz @ 2022-09-16 14:10 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev, anatoly.burakov

[-- Attachment #1: Type: text/plain, Size: 1520 bytes --]

On 16/09/2022 13:35, Kevin Laatz wrote:
> On 14/09/2022 15:33, Stephen Hemminger wrote:
>> On Wed, 14 Sep 2022 10:29:25 +0100
>> Kevin Laatz <kevin.laatz@intel.com> wrote:
>>
>>> Currently, there is no way to measure lcore polling busyness in a 
>>> passive
>>> way, without any modifications to the application. This patchset 
>>> adds a new
>>> EAL API that will be able to passively track core polling busyness. 
>>> As part
>>> of the set, new telemetry endpoints are added to read the generate 
>>> metrics.
>> How much does measuring busyness impact performance??
>>
>> In the past, calling rte_rdsc() would slow down packet rate
>> because it is stops CPU pipeline. Maybe better on more modern
>> processors, haven't measured it lately.
>
> Hi Stephen,
>
> I've run some 0.001% loss tests using 2x 100G ports, with 64B packets 
> using testpmd for forwarding. Those tests show a ~2.7% performance 
> impact when the lcore poll busyness feature is enabled vs compile-time 
> disabled.
> Applications with more compute intensive workloads should see less 
> performance impact since the proportion of time spent time-stamping 
> will be smaller.
>
> In addition, a performance autotest has been added in this patchset 
> which measures the cycles cost of calling the timestamp macro. Please 
> feel free to test it on your system (lcore_poll_busyness_perf_autotest).
>
Worth mentioning as well, is that when lcore poll busyness is enabled at 
compile-time and disabled at run-time, we see *zero *performance impact.

[-- Attachment #2: Type: text/html, Size: 2346 bytes --]

^ permalink raw reply	[flat|nested] 87+ messages in thread

* RE: [PATCH v7 1/4] eal: add lcore poll busyness telemetry
  2022-09-14  9:29   ` [PATCH v7 1/4] eal: add " Kevin Laatz
  2022-09-14 14:30     ` Stephen Hemminger
@ 2022-09-19 10:19     ` Konstantin Ananyev
  2022-09-22 17:14       ` Kevin Laatz
  2022-09-30 22:13     ` Mattias Rönnblom
  2 siblings, 1 reply; 87+ messages in thread
From: Konstantin Ananyev @ 2022-09-19 10:19 UTC (permalink / raw)
  To: Kevin Laatz, dev
  Cc: anatoly.burakov, Conor Walsh, David Hunt, Bruce Richardson,
	Nicolas Chautru, Fan Zhang, Ashish Gupta, Akhil Goyal,
	Fengchengwen, Ray Kinsella, Thomas Monjalon, Ferruh Yigit,
	Andrew Rybchenko, Jerin Jacob, Sachin Saxena, Hemant Agrawal,
	Ori Kam, Honnappa Nagarahalli, Konstantin Ananyev

Hi everyone,

> 
> From: Anatoly Burakov <anatoly.burakov@intel.com>
> 
> Currently, there is no way to measure lcore poll busyness in a passive way,
> without any modifications to the application. This patch adds a new EAL API
> that will be able to passively track core polling busyness.
> 
> The poll busyness is calculated by relying on the fact that most DPDK API's
> will poll for work (packets, completions, eventdev events, etc). Empty
> polls can be counted as "idle", while non-empty polls can be counted as
> busy. To measure lcore poll busyness, we simply call the telemetry
> timestamping function with the number of polls a particular code section
> has processed, and count the number of cycles we've spent processing empty
> bursts. The more empty bursts we encounter, the less cycles we spend in
> "busy" state, and the less core poll busyness will be reported.
> 
> In order for all of the above to work without modifications to the
> application, the library code needs to be instrumented with calls to the
> lcore telemetry busyness timestamping function. The following parts of DPDK
> are instrumented with lcore poll busyness timestamping calls:
> 
> - All major driver API's:
>   - ethdev
>   - cryptodev
>   - compressdev
>   - regexdev
>   - bbdev
>   - rawdev
>   - eventdev
>   - dmadev
> - Some additional libraries:
>   - ring
>   - distributor
> 
> To avoid performance impact from having lcore telemetry support, a global
> variable is exported by EAL, and a call to timestamping function is wrapped
> into a macro, so that whenever telemetry is disabled, it only takes one
> additional branch and no function calls are performed. It is disabled at
> compile time by default.
> 
> This patch also adds a telemetry endpoint to report lcore poll busyness, as
> well as telemetry endpoints to enable/disable lcore telemetry. A
> documentation entry has been added to the howto guides to explain the usage
> of the new telemetry endpoints and API.

As was already mentioned  by other reviewers, it would be much better
to let application itself decide when it is idle and when it is busy.
With current approach even for constant polling run-to-completion model there
are plenty of opportunities to get things wrong and provide misleading statistics.
My special concern - inserting it into ring dequeue code.
Ring is used for various different things, not only pass packets between threads (mempool, etc.).
Blindly assuming that ring dequeue returns empty means idle cycles seams wrong to me.
Which make me wonder should we really hard-code these calls into DPDK core functions?
If you still like to introduce such stats, might be better to implement it via callback mechanism.
As I remember nearly all our drivers (net, crypto, etc.) do support it.
That way our generic code   will remain unaffected, plus user will have ability to enable/disable
it on a per device basis.
 
Konstantin

> Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
> Signed-off-by: Conor Walsh <conor.walsh@intel.com>
> Signed-off-by: David Hunt <david.hunt@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> 
> ---
> v7:
>   * Rename funcs, vars, files to include "poll" where missing.
> 
> v5:
>   * Fix Windows build
>   * Make lcore_telemetry_free() an internal interface
>   * Minor cleanup
> 
> v4:
>   * Fix doc build
>   * Rename timestamp macro to RTE_LCORE_POLL_BUSYNESS_TIMESTAMP
>   * Make enable/disable read and write atomic
>   * Change rte_lcore_poll_busyness_enabled_set() param to bool
>   * Move mem alloc from enable/disable to init/cleanup
>   * Other minor fixes
> 
> v3:
>   * Fix missed renaming to poll busyness
>   * Fix clang compilation
>   * Fix arm compilation
> 
> v2:
>   * Use rte_get_tsc_hz() to adjust the telemetry period
>   * Rename to reflect polling busyness vs general busyness
>   * Fix segfault when calling telemetry timestamp from an unregistered
>     non-EAL thread.
>   * Minor cleanup
> ---
>  config/meson.build                            |   1 +
>  config/rte_config.h                           |   1 +
>  lib/bbdev/rte_bbdev.h                         |  17 +-
>  lib/compressdev/rte_compressdev.c             |   2 +
>  lib/cryptodev/rte_cryptodev.h                 |   2 +
>  lib/distributor/rte_distributor.c             |  21 +-
>  lib/distributor/rte_distributor_single.c      |  14 +-
>  lib/dmadev/rte_dmadev.h                       |  15 +-
>  .../common/eal_common_lcore_poll_telemetry.c  | 303 ++++++++++++++++++
>  lib/eal/common/meson.build                    |   1 +
>  lib/eal/freebsd/eal.c                         |   1 +
>  lib/eal/include/rte_lcore.h                   |  85 ++++-
>  lib/eal/linux/eal.c                           |   1 +
>  lib/eal/meson.build                           |   3 +
>  lib/eal/version.map                           |   7 +
>  lib/ethdev/rte_ethdev.h                       |   2 +
>  lib/eventdev/rte_eventdev.h                   |  10 +-
>  lib/rawdev/rte_rawdev.c                       |   6 +-
>  lib/regexdev/rte_regexdev.h                   |   5 +-
>  lib/ring/rte_ring_elem_pvt.h                  |   1 +
>  meson_options.txt                             |   2 +
>  21 files changed, 475 insertions(+), 25 deletions(-)
>  create mode 100644 lib/eal/common/eal_common_lcore_poll_telemetry.c
> 
> diff --git a/config/meson.build b/config/meson.build
> index 7f7b6c92fd..d5954a059c 100644
> --- a/config/meson.build
> +++ b/config/meson.build
> @@ -297,6 +297,7 @@ endforeach
>  dpdk_conf.set('RTE_MAX_ETHPORTS', get_option('max_ethports'))
>  dpdk_conf.set('RTE_LIBEAL_USE_HPET', get_option('use_hpet'))
>  dpdk_conf.set('RTE_ENABLE_TRACE_FP', get_option('enable_trace_fp'))
> +dpdk_conf.set('RTE_LCORE_POLL_BUSYNESS', get_option('enable_lcore_poll_busyness'))
>  # values which have defaults which may be overridden
>  dpdk_conf.set('RTE_MAX_VFIO_GROUPS', 64)
>  dpdk_conf.set('RTE_DRIVER_MEMPOOL_BUCKET_SIZE_KB', 64)
> diff --git a/config/rte_config.h b/config/rte_config.h
> index ae56a86394..86ac3b8a6e 100644
> --- a/config/rte_config.h
> +++ b/config/rte_config.h
> @@ -39,6 +39,7 @@
>  #define RTE_LOG_DP_LEVEL RTE_LOG_INFO
>  #define RTE_BACKTRACE 1
>  #define RTE_MAX_VFIO_CONTAINERS 64
> +#define RTE_LCORE_POLL_BUSYNESS_PERIOD_MS 2
> 
>  /* bsd module defines */
>  #define RTE_CONTIGMEM_MAX_NUM_BUFS 64
> diff --git a/lib/bbdev/rte_bbdev.h b/lib/bbdev/rte_bbdev.h
> index b88c88167e..d6a98d3f11 100644
> --- a/lib/bbdev/rte_bbdev.h
> +++ b/lib/bbdev/rte_bbdev.h
> @@ -28,6 +28,7 @@ extern "C" {
>  #include <stdbool.h>
> 
>  #include <rte_cpuflags.h>
> +#include <rte_lcore.h>
> 
>  #include "rte_bbdev_op.h"
> 
> @@ -599,7 +600,9 @@ rte_bbdev_dequeue_enc_ops(uint16_t dev_id, uint16_t queue_id,
>  {
>  	struct rte_bbdev *dev = &rte_bbdev_devices[dev_id];
>  	struct rte_bbdev_queue_data *q_data = &dev->data->queues[queue_id];
> -	return dev->dequeue_enc_ops(q_data, ops, num_ops);
> +	const uint16_t nb_ops = dev->dequeue_enc_ops(q_data, ops, num_ops);
> +	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
> +	return nb_ops;
>  }
> 
>  /**
> @@ -631,7 +634,9 @@ rte_bbdev_dequeue_dec_ops(uint16_t dev_id, uint16_t queue_id,
>  {
>  	struct rte_bbdev *dev = &rte_bbdev_devices[dev_id];
>  	struct rte_bbdev_queue_data *q_data = &dev->data->queues[queue_id];
> -	return dev->dequeue_dec_ops(q_data, ops, num_ops);
> +	const uint16_t nb_ops = dev->dequeue_dec_ops(q_data, ops, num_ops);
> +	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
> +	return nb_ops;
>  }
> 
> 
> @@ -662,7 +667,9 @@ rte_bbdev_dequeue_ldpc_enc_ops(uint16_t dev_id, uint16_t queue_id,
>  {
>  	struct rte_bbdev *dev = &rte_bbdev_devices[dev_id];
>  	struct rte_bbdev_queue_data *q_data = &dev->data->queues[queue_id];
> -	return dev->dequeue_ldpc_enc_ops(q_data, ops, num_ops);
> +	const uint16_t nb_ops = dev->dequeue_ldpc_enc_ops(q_data, ops, num_ops);
> +	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
> +	return nb_ops;
>  }
> 
>  /**
> @@ -692,7 +699,9 @@ rte_bbdev_dequeue_ldpc_dec_ops(uint16_t dev_id, uint16_t queue_id,
>  {
>  	struct rte_bbdev *dev = &rte_bbdev_devices[dev_id];
>  	struct rte_bbdev_queue_data *q_data = &dev->data->queues[queue_id];
> -	return dev->dequeue_ldpc_dec_ops(q_data, ops, num_ops);
> +	const uint16_t nb_ops = dev->dequeue_ldpc_dec_ops(q_data, ops, num_ops);
> +	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
> +	return nb_ops;
>  }
> 
>  /** Definitions of device event types */
> diff --git a/lib/compressdev/rte_compressdev.c b/lib/compressdev/rte_compressdev.c
> index 22c438f2dd..fabc495a8e 100644
> --- a/lib/compressdev/rte_compressdev.c
> +++ b/lib/compressdev/rte_compressdev.c
> @@ -580,6 +580,8 @@ rte_compressdev_dequeue_burst(uint8_t dev_id, uint16_t qp_id,
>  	nb_ops = (*dev->dequeue_burst)
>  			(dev->data->queue_pairs[qp_id], ops, nb_ops);
> 
> +	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
> +
>  	return nb_ops;
>  }
> 
> diff --git a/lib/cryptodev/rte_cryptodev.h b/lib/cryptodev/rte_cryptodev.h
> index 56f459c6a0..a5b1d7c594 100644
> --- a/lib/cryptodev/rte_cryptodev.h
> +++ b/lib/cryptodev/rte_cryptodev.h
> @@ -1915,6 +1915,8 @@ rte_cryptodev_dequeue_burst(uint8_t dev_id, uint16_t qp_id,
>  		rte_rcu_qsbr_thread_offline(list->qsbr, 0);
>  	}
>  #endif
> +
> +	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
>  	return nb_ops;
>  }
> 
> diff --git a/lib/distributor/rte_distributor.c b/lib/distributor/rte_distributor.c
> index 3035b7a999..428157ec64 100644
> --- a/lib/distributor/rte_distributor.c
> +++ b/lib/distributor/rte_distributor.c
> @@ -56,6 +56,8 @@ rte_distributor_request_pkt(struct rte_distributor *d,
> 
>  		while (rte_rdtsc() < t)
>  			rte_pause();
> +		/* this was an empty poll */
> +		RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(0);
>  	}
> 
>  	/*
> @@ -134,24 +136,29 @@ rte_distributor_get_pkt(struct rte_distributor *d,
> 
>  	if (unlikely(d->alg_type == RTE_DIST_ALG_SINGLE)) {
>  		if (return_count <= 1) {
> +			uint16_t cnt;
>  			pkts[0] = rte_distributor_get_pkt_single(d->d_single,
> -				worker_id, return_count ? oldpkt[0] : NULL);
> -			return (pkts[0]) ? 1 : 0;
> -		} else
> -			return -EINVAL;
> +								 worker_id,
> +								 return_count ? oldpkt[0] : NULL);
> +			cnt = (pkts[0] != NULL) ? 1 : 0;
> +			RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(cnt);
> +			return cnt;
> +		}
> +		return -EINVAL;
>  	}
> 
>  	rte_distributor_request_pkt(d, worker_id, oldpkt, return_count);
> 
> -	count = rte_distributor_poll_pkt(d, worker_id, pkts);
> -	while (count == -1) {
> +	while ((count = rte_distributor_poll_pkt(d, worker_id, pkts)) == -1) {
>  		uint64_t t = rte_rdtsc() + 100;
> 
>  		while (rte_rdtsc() < t)
>  			rte_pause();
> 
> -		count = rte_distributor_poll_pkt(d, worker_id, pkts);
> +		/* this was an empty poll */
> +		RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(0);
>  	}
> +	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(count);
>  	return count;
>  }
> 
> diff --git a/lib/distributor/rte_distributor_single.c b/lib/distributor/rte_distributor_single.c
> index 2c77ac454a..4c916c0fd2 100644
> --- a/lib/distributor/rte_distributor_single.c
> +++ b/lib/distributor/rte_distributor_single.c
> @@ -31,8 +31,13 @@ rte_distributor_request_pkt_single(struct rte_distributor_single *d,
>  	union rte_distributor_buffer_single *buf = &d->bufs[worker_id];
>  	int64_t req = (((int64_t)(uintptr_t)oldpkt) << RTE_DISTRIB_FLAG_BITS)
>  			| RTE_DISTRIB_GET_BUF;
> -	RTE_WAIT_UNTIL_MASKED(&buf->bufptr64, RTE_DISTRIB_FLAGS_MASK,
> -		==, 0, __ATOMIC_RELAXED);
> +
> +	while ((__atomic_load_n(&buf->bufptr64, __ATOMIC_RELAXED)
> +			& RTE_DISTRIB_FLAGS_MASK) != 0) {
> +		rte_pause();
> +		/* this was an empty poll */
> +		RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(0);
> +	}
> 
>  	/* Sync with distributor on GET_BUF flag. */
>  	__atomic_store_n(&(buf->bufptr64), req, __ATOMIC_RELEASE);
> @@ -59,8 +64,11 @@ rte_distributor_get_pkt_single(struct rte_distributor_single *d,
>  {
>  	struct rte_mbuf *ret;
>  	rte_distributor_request_pkt_single(d, worker_id, oldpkt);
> -	while ((ret = rte_distributor_poll_pkt_single(d, worker_id)) == NULL)
> +	while ((ret = rte_distributor_poll_pkt_single(d, worker_id)) == NULL) {
>  		rte_pause();
> +		/* this was an empty poll */
> +		RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(0);
> +	}
>  	return ret;
>  }
> 
> diff --git a/lib/dmadev/rte_dmadev.h b/lib/dmadev/rte_dmadev.h
> index e7f992b734..3e27e0fd2b 100644
> --- a/lib/dmadev/rte_dmadev.h
> +++ b/lib/dmadev/rte_dmadev.h
> @@ -149,6 +149,7 @@
>  #include <rte_bitops.h>
>  #include <rte_common.h>
>  #include <rte_compat.h>
> +#include <rte_lcore.h>
> 
>  #ifdef __cplusplus
>  extern "C" {
> @@ -1027,7 +1028,7 @@ rte_dma_completed(int16_t dev_id, uint16_t vchan, const uint16_t nb_cpls,
>  		  uint16_t *last_idx, bool *has_error)
>  {
>  	struct rte_dma_fp_object *obj = &rte_dma_fp_objs[dev_id];
> -	uint16_t idx;
> +	uint16_t idx, nb_ops;
>  	bool err;
> 
>  #ifdef RTE_DMADEV_DEBUG
> @@ -1050,8 +1051,10 @@ rte_dma_completed(int16_t dev_id, uint16_t vchan, const uint16_t nb_cpls,
>  		has_error = &err;
> 
>  	*has_error = false;
> -	return (*obj->completed)(obj->dev_private, vchan, nb_cpls, last_idx,
> -				 has_error);
> +	nb_ops = (*obj->completed)(obj->dev_private, vchan, nb_cpls, last_idx,
> +				   has_error);
> +	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
> +	return nb_ops;
>  }
> 
>  /**
> @@ -1090,7 +1093,7 @@ rte_dma_completed_status(int16_t dev_id, uint16_t vchan,
>  			 enum rte_dma_status_code *status)
>  {
>  	struct rte_dma_fp_object *obj = &rte_dma_fp_objs[dev_id];
> -	uint16_t idx;
> +	uint16_t idx, nb_ops;
> 
>  #ifdef RTE_DMADEV_DEBUG
>  	if (!rte_dma_is_valid(dev_id) || nb_cpls == 0 || status == NULL)
> @@ -1101,8 +1104,10 @@ rte_dma_completed_status(int16_t dev_id, uint16_t vchan,
>  	if (last_idx == NULL)
>  		last_idx = &idx;
> 
> -	return (*obj->completed_status)(obj->dev_private, vchan, nb_cpls,
> +	nb_ops = (*obj->completed_status)(obj->dev_private, vchan, nb_cpls,
>  					last_idx, status);
> +	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
> +	return nb_ops;
>  }
> 
>  /**
> diff --git a/lib/eal/common/eal_common_lcore_poll_telemetry.c b/lib/eal/common/eal_common_lcore_poll_telemetry.c
> new file mode 100644
> index 0000000000..d97996e85f
> --- /dev/null
> +++ b/lib/eal/common/eal_common_lcore_poll_telemetry.c
> @@ -0,0 +1,303 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2022 Intel Corporation
> + */
> +
> +#include <unistd.h>
> +#include <limits.h>
> +#include <string.h>
> +
> +#include <rte_common.h>
> +#include <rte_cycles.h>
> +#include <rte_errno.h>
> +#include <rte_lcore.h>
> +
> +#ifdef RTE_LCORE_POLL_BUSYNESS
> +#include <rte_telemetry.h>
> +#endif
> +
> +rte_atomic32_t __rte_lcore_poll_telemetry_enabled;
> +
> +#ifdef RTE_LCORE_POLL_BUSYNESS
> +
> +struct lcore_poll_telemetry {
> +	int poll_busyness;
> +	/**< Calculated poll busyness (gets set/returned by the API) */
> +	int raw_poll_busyness;
> +	/**< Calculated poll busyness times 100. */
> +	uint64_t interval_ts;
> +	/**< when previous telemetry interval started */
> +	uint64_t empty_cycles;
> +	/**< empty cycle count since last interval */
> +	uint64_t last_poll_ts;
> +	/**< last poll timestamp */
> +	bool last_empty;
> +	/**< if last poll was empty */
> +	unsigned int contig_poll_cnt;
> +	/**< contiguous (always empty/non empty) poll counter */
> +} __rte_cache_aligned;
> +
> +static struct lcore_poll_telemetry *telemetry_data;
> +
> +#define LCORE_POLL_BUSYNESS_MAX 100
> +#define LCORE_POLL_BUSYNESS_NOT_SET -1
> +#define LCORE_POLL_BUSYNESS_MIN 0
> +
> +#define SMOOTH_COEFF 5
> +#define STATE_CHANGE_OPT 32
> +
> +static void lcore_config_init(void)
> +{
> +	int lcore_id;
> +
> +	RTE_LCORE_FOREACH(lcore_id) {
> +		struct lcore_poll_telemetry *td = &telemetry_data[lcore_id];
> +
> +		td->interval_ts = 0;
> +		td->last_poll_ts = 0;
> +		td->empty_cycles = 0;
> +		td->last_empty = true;
> +		td->contig_poll_cnt = 0;
> +		td->poll_busyness = LCORE_POLL_BUSYNESS_NOT_SET;
> +		td->raw_poll_busyness = 0;
> +	}
> +}
> +
> +int rte_lcore_poll_busyness(unsigned int lcore_id)
> +{
> +	const uint64_t tsc_ms = rte_get_timer_hz() / MS_PER_S;
> +	/* if more than 1000 busyness periods have passed, this core is considered inactive */
> +	const uint64_t active_thresh = RTE_LCORE_POLL_BUSYNESS_PERIOD_MS * tsc_ms * 1000;
> +	struct lcore_poll_telemetry *tdata;
> +
> +	if (lcore_id >= RTE_MAX_LCORE)
> +		return -EINVAL;
> +	tdata = &telemetry_data[lcore_id];
> +
> +	/* if the lcore is not active */
> +	if (tdata->interval_ts == 0)
> +		return LCORE_POLL_BUSYNESS_NOT_SET;
> +	/* if the core hasn't been active in a while */
> +	else if ((rte_rdtsc() - tdata->interval_ts) > active_thresh)
> +		return LCORE_POLL_BUSYNESS_NOT_SET;
> +
> +	/* this core is active, report its poll busyness */
> +	return telemetry_data[lcore_id].poll_busyness;
> +}
> +
> +int rte_lcore_poll_busyness_enabled(void)
> +{
> +	return rte_atomic32_read(&__rte_lcore_poll_telemetry_enabled);
> +}
> +
> +void rte_lcore_poll_busyness_enabled_set(bool enable)
> +{
> +	int set = rte_atomic32_cmpset((volatile uint32_t *)&__rte_lcore_poll_telemetry_enabled,
> +			(int)!enable, (int)enable);
> +
> +	/* Reset counters on successful disable */
> +	if (set && !enable)
> +		lcore_config_init();
> +}
> +
> +static inline int calc_raw_poll_busyness(const struct lcore_poll_telemetry *tdata,
> +				    const uint64_t empty, const uint64_t total)
> +{
> +	/*
> +	 * We don't want to use floating point math here, but we want for our poll
> +	 * busyness to react smoothly to sudden changes, while still keeping the
> +	 * accuracy and making sure that over time the average follows poll busyness
> +	 * as measured just-in-time. Therefore, we will calculate the average poll
> +	 * busyness using integer math, but shift the decimal point two places
> +	 * to the right, so that 100.0 becomes 10000. This allows us to report
> +	 * integer values (0..100) while still allowing ourselves to follow the
> +	 * just-in-time measurements when we calculate our averages.
> +	 */
> +	const int max_raw_idle = LCORE_POLL_BUSYNESS_MAX * 100;
> +
> +	const int prev_raw_idle = max_raw_idle - tdata->raw_poll_busyness;
> +
> +	/* calculate rate of idle cycles, times 100 */
> +	const int cur_raw_idle = (int)((empty * max_raw_idle) / total);
> +
> +	/* smoothen the idleness */
> +	const int smoothened_idle =
> +			(cur_raw_idle + prev_raw_idle * (SMOOTH_COEFF - 1)) / SMOOTH_COEFF;
> +
> +	/* convert idleness to poll busyness */
> +	return max_raw_idle - smoothened_idle;
> +}
> +
> +void __rte_lcore_poll_busyness_timestamp(uint16_t nb_rx)
> +{
> +	const unsigned int lcore_id = rte_lcore_id();
> +	uint64_t interval_ts, empty_cycles, cur_tsc, last_poll_ts;
> +	struct lcore_poll_telemetry *tdata;
> +	const bool empty = nb_rx == 0;
> +	uint64_t diff_int, diff_last;
> +	bool last_empty;
> +
> +	/* This telemetry is not supported for unregistered non-EAL threads */
> +	if (lcore_id >= RTE_MAX_LCORE) {
> +		RTE_LOG(DEBUG, EAL,
> +				"Lcore telemetry not supported on unregistered non-EAL thread %d",
> +				lcore_id);
> +		return;
> +	}
> +
> +	tdata = &telemetry_data[lcore_id];
> +	last_empty = tdata->last_empty;
> +
> +	/* optimization: don't do anything if status hasn't changed */
> +	if (last_empty == empty && tdata->contig_poll_cnt++ < STATE_CHANGE_OPT)
> +		return;
> +	/* status changed or we're waiting for too long, reset counter */
> +	tdata->contig_poll_cnt = 0;
> +
> +	cur_tsc = rte_rdtsc();
> +
> +	interval_ts = tdata->interval_ts;
> +	empty_cycles = tdata->empty_cycles;
> +	last_poll_ts = tdata->last_poll_ts;
> +
> +	diff_int = cur_tsc - interval_ts;
> +	diff_last = cur_tsc - last_poll_ts;
> +
> +	/* is this the first time we're here? */
> +	if (interval_ts == 0) {
> +		tdata->poll_busyness = LCORE_POLL_BUSYNESS_MIN;
> +		tdata->raw_poll_busyness = 0;
> +		tdata->interval_ts = cur_tsc;
> +		tdata->empty_cycles = 0;
> +		tdata->contig_poll_cnt = 0;
> +		goto end;
> +	}
> +
> +	/* update the empty counter if we got an empty poll earlier */
> +	if (last_empty)
> +		empty_cycles += diff_last;
> +
> +	/* have we passed the interval? */
> +	uint64_t interval = ((rte_get_tsc_hz() / MS_PER_S) * RTE_LCORE_POLL_BUSYNESS_PERIOD_MS);
> +	if (diff_int > interval) {
> +		int raw_poll_busyness;
> +
> +		/* get updated poll_busyness value */
> +		raw_poll_busyness = calc_raw_poll_busyness(tdata, empty_cycles, diff_int);
> +
> +		/* set a new interval, reset empty counter */
> +		tdata->interval_ts = cur_tsc;
> +		tdata->empty_cycles = 0;
> +		tdata->raw_poll_busyness = raw_poll_busyness;
> +		/* bring poll busyness back to 0..100 range, biased to round up */
> +		tdata->poll_busyness = (raw_poll_busyness + 50) / 100;
> +	} else
> +		/* we may have updated empty counter */
> +		tdata->empty_cycles = empty_cycles;
> +
> +end:
> +	/* update status for next poll */
> +	tdata->last_poll_ts = cur_tsc;
> +	tdata->last_empty = empty;
> +}
> +
> +static int
> +lcore_poll_busyness_enable(const char *cmd __rte_unused,
> +		      const char *params __rte_unused,
> +		      struct rte_tel_data *d)
> +{
> +	rte_lcore_poll_busyness_enabled_set(true);
> +
> +	rte_tel_data_start_dict(d);
> +
> +	rte_tel_data_add_dict_int(d, "poll_busyness_enabled", 1);
> +
> +	return 0;
> +}
> +
> +static int
> +lcore_poll_busyness_disable(const char *cmd __rte_unused,
> +		       const char *params __rte_unused,
> +		       struct rte_tel_data *d)
> +{
> +	rte_lcore_poll_busyness_enabled_set(false);
> +
> +	rte_tel_data_start_dict(d);
> +
> +	rte_tel_data_add_dict_int(d, "poll_busyness_enabled", 0);
> +
> +	return 0;
> +}
> +
> +static int
> +lcore_handle_poll_busyness(const char *cmd __rte_unused,
> +		      const char *params __rte_unused, struct rte_tel_data *d)
> +{
> +	char corenum[64];
> +	int i;
> +
> +	rte_tel_data_start_dict(d);
> +
> +	RTE_LCORE_FOREACH(i) {
> +		if (!rte_lcore_is_enabled(i))
> +			continue;
> +		snprintf(corenum, sizeof(corenum), "%d", i);
> +		rte_tel_data_add_dict_int(d, corenum, rte_lcore_poll_busyness(i));
> +	}
> +
> +	return 0;
> +}
> +
> +void
> +eal_lcore_poll_telemetry_free(void)
> +{
> +	if (telemetry_data != NULL) {
> +		free(telemetry_data);
> +		telemetry_data = NULL;
> +	}
> +}
> +
> +RTE_INIT(lcore_init_poll_telemetry)
> +{
> +	telemetry_data = calloc(RTE_MAX_LCORE, sizeof(telemetry_data[0]));
> +	if (telemetry_data == NULL)
> +		rte_panic("Could not init lcore telemetry data: Out of memory\n");
> +
> +	lcore_config_init();
> +
> +	rte_telemetry_register_cmd("/eal/lcore/poll_busyness", lcore_handle_poll_busyness,
> +				   "return percentage poll busyness of cores");
> +
> +	rte_telemetry_register_cmd("/eal/lcore/poll_busyness_enable", lcore_poll_busyness_enable,
> +				   "enable lcore poll busyness measurement");
> +
> +	rte_telemetry_register_cmd("/eal/lcore/poll_busyness_disable", lcore_poll_busyness_disable,
> +				   "disable lcore poll busyness measurement");
> +
> +	rte_atomic32_set(&__rte_lcore_poll_telemetry_enabled, true);
> +}
> +
> +#else
> +
> +int rte_lcore_poll_busyness(unsigned int lcore_id __rte_unused)
> +{
> +	return -ENOTSUP;
> +}
> +
> +int rte_lcore_poll_busyness_enabled(void)
> +{
> +	return -ENOTSUP;
> +}
> +
> +void rte_lcore_poll_busyness_enabled_set(bool enable __rte_unused)
> +{
> +}
> +
> +void __rte_lcore_poll_busyness_timestamp(uint16_t nb_rx __rte_unused)
> +{
> +}
> +
> +void eal_lcore_poll_telemetry_free(void)
> +{
> +}
> +
> +#endif
> diff --git a/lib/eal/common/meson.build b/lib/eal/common/meson.build
> index 917758cc65..e5741ce9f9 100644
> --- a/lib/eal/common/meson.build
> +++ b/lib/eal/common/meson.build
> @@ -17,6 +17,7 @@ sources += files(
>          'eal_common_hexdump.c',
>          'eal_common_interrupts.c',
>          'eal_common_launch.c',
> +        'eal_common_lcore_poll_telemetry.c',
>          'eal_common_lcore.c',
>          'eal_common_log.c',
>          'eal_common_mcfg.c',
> diff --git a/lib/eal/freebsd/eal.c b/lib/eal/freebsd/eal.c
> index 26fbc91b26..92c4af9c28 100644
> --- a/lib/eal/freebsd/eal.c
> +++ b/lib/eal/freebsd/eal.c
> @@ -895,6 +895,7 @@ rte_eal_cleanup(void)
>  	rte_mp_channel_cleanup();
>  	rte_trace_save();
>  	eal_trace_fini();
> +	eal_lcore_poll_telemetry_free();
>  	/* after this point, any DPDK pointers will become dangling */
>  	rte_eal_memory_detach();
>  	rte_eal_alarm_cleanup();
> diff --git a/lib/eal/include/rte_lcore.h b/lib/eal/include/rte_lcore.h
> index b598e1b9ec..2191c2473a 100644
> --- a/lib/eal/include/rte_lcore.h
> +++ b/lib/eal/include/rte_lcore.h
> @@ -16,6 +16,7 @@
>  #include <rte_eal.h>
>  #include <rte_launch.h>
>  #include <rte_thread.h>
> +#include <rte_atomic.h>
> 
>  #ifdef __cplusplus
>  extern "C" {
> @@ -415,9 +416,91 @@ rte_ctrl_thread_create(pthread_t *thread, const char *name,
>  		const pthread_attr_t *attr,
>  		void *(*start_routine)(void *), void *arg);
> 
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Read poll busyness value corresponding to an lcore.
> + *
> + * @param lcore_id
> + *   Lcore to read poll busyness value for.
> + * @return
> + *   - value between 0 and 100 on success
> + *   - -1 if lcore is not active
> + *   - -EINVAL if lcore is invalid
> + *   - -ENOMEM if not enough memory available
> + *   - -ENOTSUP if not supported
> + */
> +__rte_experimental
> +int
> +rte_lcore_poll_busyness(unsigned int lcore_id);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Check if lcore poll busyness telemetry is enabled.
> + *
> + * @return
> + *   - true if lcore telemetry is enabled
> + *   - false if lcore telemetry is disabled
> + *   - -ENOTSUP if not lcore telemetry supported
> + */
> +__rte_experimental
> +int
> +rte_lcore_poll_busyness_enabled(void);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Enable or disable poll busyness telemetry.
> + *
> + * @param enable
> + *   1 to enable, 0 to disable
> + */
> +__rte_experimental
> +void
> +rte_lcore_poll_busyness_enabled_set(bool enable);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Lcore poll busyness timestamping function.
> + *
> + * @param nb_rx
> + *   Number of buffers processed by lcore.
> + */
> +__rte_experimental
> +void
> +__rte_lcore_poll_busyness_timestamp(uint16_t nb_rx);
> +
> +/** @internal lcore telemetry enabled status */
> +extern rte_atomic32_t __rte_lcore_poll_telemetry_enabled;
> +
> +/** @internal free memory allocated for lcore telemetry */
> +void
> +eal_lcore_poll_telemetry_free(void);
> +
> +/**
> + * Call lcore poll busyness timestamp function.
> + *
> + * @param nb_rx
> + *   Number of buffers processed by lcore.
> + */
> +#ifdef RTE_LCORE_POLL_BUSYNESS
> +#define RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_rx) do {				\
> +	int enabled = (int)rte_atomic32_read(&__rte_lcore_poll_telemetry_enabled);	\
> +	if (enabled)								\
> +		__rte_lcore_poll_busyness_timestamp(nb_rx);			\
> +} while (0)
> +#else
> +#define RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_rx) do { } while (0)
> +#endif
> +
>  #ifdef __cplusplus
>  }
>  #endif
> 
> -
>  #endif /* _RTE_LCORE_H_ */
> diff --git a/lib/eal/linux/eal.c b/lib/eal/linux/eal.c
> index 37d29643a5..5e81352a81 100644
> --- a/lib/eal/linux/eal.c
> +++ b/lib/eal/linux/eal.c
> @@ -1364,6 +1364,7 @@ rte_eal_cleanup(void)
>  	rte_mp_channel_cleanup();
>  	rte_trace_save();
>  	eal_trace_fini();
> +	eal_lcore_poll_telemetry_free();
>  	/* after this point, any DPDK pointers will become dangling */
>  	rte_eal_memory_detach();
>  	eal_mp_dev_hotplug_cleanup();
> diff --git a/lib/eal/meson.build b/lib/eal/meson.build
> index 056beb9461..2fb90d446b 100644
> --- a/lib/eal/meson.build
> +++ b/lib/eal/meson.build
> @@ -25,6 +25,9 @@ subdir(arch_subdir)
>  deps += ['kvargs']
>  if not is_windows
>      deps += ['telemetry']
> +else
> +	# core poll busyness telemetry depends on telemetry library
> +	dpdk_conf.set('RTE_LCORE_POLL_BUSYNESS', false)
>  endif
>  if dpdk_conf.has('RTE_USE_LIBBSD')
>      ext_deps += libbsd
> diff --git a/lib/eal/version.map b/lib/eal/version.map
> index 1f293e768b..3275d1fac4 100644
> --- a/lib/eal/version.map
> +++ b/lib/eal/version.map
> @@ -424,6 +424,13 @@ EXPERIMENTAL {
>  	rte_thread_self;
>  	rte_thread_set_affinity_by_id;
>  	rte_thread_set_priority;
> +
> +	# added in 22.11
> +	__rte_lcore_poll_busyness_timestamp;
> +	__rte_lcore_poll_telemetry_enabled;
> +	rte_lcore_poll_busyness;
> +	rte_lcore_poll_busyness_enabled;
> +	rte_lcore_poll_busyness_enabled_set;
>  };
> 
>  INTERNAL {
> diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> index de9e970d4d..4c8113f31f 100644
> --- a/lib/ethdev/rte_ethdev.h
> +++ b/lib/ethdev/rte_ethdev.h
> @@ -5675,6 +5675,8 @@ rte_eth_rx_burst(uint16_t port_id, uint16_t queue_id,
>  #endif
> 
>  	rte_ethdev_trace_rx_burst(port_id, queue_id, (void **)rx_pkts, nb_rx);
> +
> +	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_rx);
>  	return nb_rx;
>  }
> 
> diff --git a/lib/eventdev/rte_eventdev.h b/lib/eventdev/rte_eventdev.h
> index 6a6f6ea4c1..a65b3c7c85 100644
> --- a/lib/eventdev/rte_eventdev.h
> +++ b/lib/eventdev/rte_eventdev.h
> @@ -2153,6 +2153,7 @@ rte_event_dequeue_burst(uint8_t dev_id, uint8_t port_id, struct rte_event ev[],
>  			uint16_t nb_events, uint64_t timeout_ticks)
>  {
>  	const struct rte_event_fp_ops *fp_ops;
> +	uint16_t nb_evts;
>  	void *port;
> 
>  	fp_ops = &rte_event_fp_ops[dev_id];
> @@ -2175,10 +2176,13 @@ rte_event_dequeue_burst(uint8_t dev_id, uint8_t port_id, struct rte_event ev[],
>  	 * requests nb_events as const one
>  	 */
>  	if (nb_events == 1)
> -		return (fp_ops->dequeue)(port, ev, timeout_ticks);
> +		nb_evts = (fp_ops->dequeue)(port, ev, timeout_ticks);
>  	else
> -		return (fp_ops->dequeue_burst)(port, ev, nb_events,
> -					       timeout_ticks);
> +		nb_evts = (fp_ops->dequeue_burst)(port, ev, nb_events,
> +					timeout_ticks);
> +
> +	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_evts);
> +	return nb_evts;
>  }
> 
>  #define RTE_EVENT_DEV_MAINT_OP_FLUSH          (1 << 0)
> diff --git a/lib/rawdev/rte_rawdev.c b/lib/rawdev/rte_rawdev.c
> index 2f0a4f132e..1cba53270a 100644
> --- a/lib/rawdev/rte_rawdev.c
> +++ b/lib/rawdev/rte_rawdev.c
> @@ -16,6 +16,7 @@
>  #include <rte_common.h>
>  #include <rte_malloc.h>
>  #include <rte_telemetry.h>
> +#include <rte_lcore.h>
> 
>  #include "rte_rawdev.h"
>  #include "rte_rawdev_pmd.h"
> @@ -226,12 +227,15 @@ rte_rawdev_dequeue_buffers(uint16_t dev_id,
>  			   rte_rawdev_obj_t context)
>  {
>  	struct rte_rawdev *dev;
> +	int nb_ops;
> 
>  	RTE_RAWDEV_VALID_DEVID_OR_ERR_RET(dev_id, -EINVAL);
>  	dev = &rte_rawdevs[dev_id];
> 
>  	RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->dequeue_bufs, -ENOTSUP);
> -	return (*dev->dev_ops->dequeue_bufs)(dev, buffers, count, context);
> +	nb_ops = (*dev->dev_ops->dequeue_bufs)(dev, buffers, count, context);
> +	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
> +	return nb_ops;
>  }
> 
>  int
> diff --git a/lib/regexdev/rte_regexdev.h b/lib/regexdev/rte_regexdev.h
> index 3bce8090f6..8caaed502f 100644
> --- a/lib/regexdev/rte_regexdev.h
> +++ b/lib/regexdev/rte_regexdev.h
> @@ -1530,6 +1530,7 @@ rte_regexdev_dequeue_burst(uint8_t dev_id, uint16_t qp_id,
>  			   struct rte_regex_ops **ops, uint16_t nb_ops)
>  {
>  	struct rte_regexdev *dev = &rte_regex_devices[dev_id];
> +	uint16_t deq_ops;
>  #ifdef RTE_LIBRTE_REGEXDEV_DEBUG
>  	RTE_REGEXDEV_VALID_DEV_ID_OR_ERR_RET(dev_id, -EINVAL);
>  	RTE_FUNC_PTR_OR_ERR_RET(*dev->dequeue, -ENOTSUP);
> @@ -1538,7 +1539,9 @@ rte_regexdev_dequeue_burst(uint8_t dev_id, uint16_t qp_id,
>  		return -EINVAL;
>  	}
>  #endif
> -	return (*dev->dequeue)(dev, qp_id, ops, nb_ops);
> +	deq_ops = (*dev->dequeue)(dev, qp_id, ops, nb_ops);
> +	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(deq_ops);
> +	return deq_ops;
>  }
> 
>  #ifdef __cplusplus
> diff --git a/lib/ring/rte_ring_elem_pvt.h b/lib/ring/rte_ring_elem_pvt.h
> index 83788c56e6..cf2370c238 100644
> --- a/lib/ring/rte_ring_elem_pvt.h
> +++ b/lib/ring/rte_ring_elem_pvt.h
> @@ -379,6 +379,7 @@ __rte_ring_do_dequeue_elem(struct rte_ring *r, void *obj_table,
>  end:
>  	if (available != NULL)
>  		*available = entries - n;
> +	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(n);
>  	return n;
>  }
> 
> diff --git a/meson_options.txt b/meson_options.txt
> index 7c220ad68d..9b20a36fdb 100644
> --- a/meson_options.txt
> +++ b/meson_options.txt
> @@ -20,6 +20,8 @@ option('enable_driver_sdk', type: 'boolean', value: false, description:
>         'Install headers to build drivers.')
>  option('enable_kmods', type: 'boolean', value: false, description:
>         'build kernel modules')
> +option('enable_lcore_poll_busyness', type: 'boolean', value: false, description:
> +       'enable collection of lcore poll busyness telemetry')
>  option('examples', type: 'string', value: '', description:
>         'Comma-separated list of examples to build by default')
>  option('flexran_sdk', type: 'string', value: '', description:
> --
> 2.31.1


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 1/4] eal: add lcore poll busyness telemetry
  2022-09-19 10:19     ` Konstantin Ananyev
@ 2022-09-22 17:14       ` Kevin Laatz
  2022-09-26  9:37         ` Konstantin Ananyev
  0 siblings, 1 reply; 87+ messages in thread
From: Kevin Laatz @ 2022-09-22 17:14 UTC (permalink / raw)
  To: Konstantin Ananyev, dev
  Cc: anatoly.burakov, Conor Walsh, David Hunt, Bruce Richardson,
	Nicolas Chautru, Fan Zhang, Ashish Gupta, Akhil Goyal,
	Fengchengwen, Ray Kinsella, Thomas Monjalon, Ferruh Yigit,
	Andrew Rybchenko, Jerin Jacob, Sachin Saxena, Hemant Agrawal,
	Ori Kam, Honnappa Nagarahalli, Konstantin Ananyev

On 19/09/2022 11:19, Konstantin Ananyev wrote:
> Hi everyone,
>
>> From: Anatoly Burakov <anatoly.burakov@intel.com>
>>
>> Currently, there is no way to measure lcore poll busyness in a passive way,
>> without any modifications to the application. This patch adds a new EAL API
>> that will be able to passively track core polling busyness.
>>
>> The poll busyness is calculated by relying on the fact that most DPDK API's
>> will poll for work (packets, completions, eventdev events, etc). Empty
>> polls can be counted as "idle", while non-empty polls can be counted as
>> busy. To measure lcore poll busyness, we simply call the telemetry
>> timestamping function with the number of polls a particular code section
>> has processed, and count the number of cycles we've spent processing empty
>> bursts. The more empty bursts we encounter, the less cycles we spend in
>> "busy" state, and the less core poll busyness will be reported.
>>
>> In order for all of the above to work without modifications to the
>> application, the library code needs to be instrumented with calls to the
>> lcore telemetry busyness timestamping function. The following parts of DPDK
>> are instrumented with lcore poll busyness timestamping calls:
>>
>> - All major driver API's:
>>    - ethdev
>>    - cryptodev
>>    - compressdev
>>    - regexdev
>>    - bbdev
>>    - rawdev
>>    - eventdev
>>    - dmadev
>> - Some additional libraries:
>>    - ring
>>    - distributor
>>
>> To avoid performance impact from having lcore telemetry support, a global
>> variable is exported by EAL, and a call to timestamping function is wrapped
>> into a macro, so that whenever telemetry is disabled, it only takes one
>> additional branch and no function calls are performed. It is disabled at
>> compile time by default.
>>
>> This patch also adds a telemetry endpoint to report lcore poll busyness, as
>> well as telemetry endpoints to enable/disable lcore telemetry. A
>> documentation entry has been added to the howto guides to explain the usage
>> of the new telemetry endpoints and API.
> As was already mentioned  by other reviewers, it would be much better
> to let application itself decide when it is idle and when it is busy.
> With current approach even for constant polling run-to-completion model there
> are plenty of opportunities to get things wrong and provide misleading statistics.
> My special concern - inserting it into ring dequeue code.
> Ring is used for various different things, not only pass packets between threads (mempool, etc.).
> Blindly assuming that ring dequeue returns empty means idle cycles seams wrong to me.
> Which make me wonder should we really hard-code these calls into DPDK core functions?
> If you still like to introduce such stats, might be better to implement it via callback mechanism.
> As I remember nearly all our drivers (net, crypto, etc.) do support it.
> That way our generic code   will remain unaffected, plus user will have ability to enable/disable
> it on a per device basis.

Thanks for your feedback, Konstantin.

You are right in saying that this approach won't be 100% suitable for 
all use-cases, but should be suitable for the majority of applications. 
It's worth keeping in mind that this feature is compile-time disabled by 
default, so there is no impact to any application/user that does not 
wish to use this, for example applications where this type of busyness 
is not useful, or for applications that already use other mechanisms to 
report similar telemetry. However, the upside for applications that do 
wish to use this is that there are no code changes required (for the 
most part), the feature simply needs to be enabled at compile-time via 
the meson option.

In scenarios where contextual awareness of the application is needed in 
order to report more accurate "busyness", the 
"RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(n)" macro can be used to mark 
sections of code as "busy" or "idle". This way, the application can 
assume control of determining the poll busyness of its lcores while 
leveraging the telemetry hooks adding in this patchset.

We did initially consider implementing this via callbacks, however we 
found this approach to have 2 main drawbacks:
1. Application changes are required for all applications wanting to 
report this telemetry - rather than the majority getting it for free.
2. Ring does not have callback support, meaning pipelined applications 
could not report lcore poll busyness telemetry with this approach. 
Eventdev is another driver which would be completely missed with this 
approach.

BR,
Kevin




^ permalink raw reply	[flat|nested] 87+ messages in thread

* RE: [PATCH v7 1/4] eal: add lcore poll busyness telemetry
  2022-09-22 17:14       ` Kevin Laatz
@ 2022-09-26  9:37         ` Konstantin Ananyev
  2022-09-29 12:41           ` Kevin Laatz
  0 siblings, 1 reply; 87+ messages in thread
From: Konstantin Ananyev @ 2022-09-26  9:37 UTC (permalink / raw)
  To: Kevin Laatz, dev
  Cc: anatoly.burakov, Conor Walsh, David Hunt, Bruce Richardson,
	Nicolas Chautru, Fan Zhang, Ashish Gupta, Akhil Goyal,
	Fengchengwen, Ray Kinsella, Thomas Monjalon, Ferruh Yigit,
	Andrew Rybchenko, Jerin Jacob, Sachin Saxena, Hemant Agrawal,
	Ori Kam, Honnappa Nagarahalli, Konstantin Ananyev


Hi Kevin,

> >> Currently, there is no way to measure lcore poll busyness in a passive way,
> >> without any modifications to the application. This patch adds a new EAL API
> >> that will be able to passively track core polling busyness.
> >>
> >> The poll busyness is calculated by relying on the fact that most DPDK API's
> >> will poll for work (packets, completions, eventdev events, etc). Empty
> >> polls can be counted as "idle", while non-empty polls can be counted as
> >> busy. To measure lcore poll busyness, we simply call the telemetry
> >> timestamping function with the number of polls a particular code section
> >> has processed, and count the number of cycles we've spent processing empty
> >> bursts. The more empty bursts we encounter, the less cycles we spend in
> >> "busy" state, and the less core poll busyness will be reported.
> >>
> >> In order for all of the above to work without modifications to the
> >> application, the library code needs to be instrumented with calls to the
> >> lcore telemetry busyness timestamping function. The following parts of DPDK
> >> are instrumented with lcore poll busyness timestamping calls:
> >>
> >> - All major driver API's:
> >>    - ethdev
> >>    - cryptodev
> >>    - compressdev
> >>    - regexdev
> >>    - bbdev
> >>    - rawdev
> >>    - eventdev
> >>    - dmadev
> >> - Some additional libraries:
> >>    - ring
> >>    - distributor
> >>
> >> To avoid performance impact from having lcore telemetry support, a global
> >> variable is exported by EAL, and a call to timestamping function is wrapped
> >> into a macro, so that whenever telemetry is disabled, it only takes one
> >> additional branch and no function calls are performed. It is disabled at
> >> compile time by default.
> >>
> >> This patch also adds a telemetry endpoint to report lcore poll busyness, as
> >> well as telemetry endpoints to enable/disable lcore telemetry. A
> >> documentation entry has been added to the howto guides to explain the usage
> >> of the new telemetry endpoints and API.
> > As was already mentioned  by other reviewers, it would be much better
> > to let application itself decide when it is idle and when it is busy.
> > With current approach even for constant polling run-to-completion model there
> > are plenty of opportunities to get things wrong and provide misleading statistics.
> > My special concern - inserting it into ring dequeue code.
> > Ring is used for various different things, not only pass packets between threads (mempool, etc.).
> > Blindly assuming that ring dequeue returns empty means idle cycles seams wrong to me.
> > Which make me wonder should we really hard-code these calls into DPDK core functions?
> > If you still like to introduce such stats, might be better to implement it via callback mechanism.
> > As I remember nearly all our drivers (net, crypto, etc.) do support it.
> > That way our generic code   will remain unaffected, plus user will have ability to enable/disable
> > it on a per device basis.
> 
> Thanks for your feedback, Konstantin.
> 
> You are right in saying that this approach won't be 100% suitable for
> all use-cases, but should be suitable for the majority of applications.

First of all - could you explain how did you measure what is the 'majority' of DPDK applications?
And how did you conclude that it definitely work for all the apps in that 'majority'?
Second what bother me with that approach - I don't see s clear and deterministic way
for the user to understand would that stats work properly for his app or not.
(except manually ananlyzing his app code).

> It's worth keeping in mind that this feature is compile-time disabled by
> default, so there is no impact to any application/user that does not
> wish to use this, for example applications where this type of busyness
> is not useful, or for applications that already use other mechanisms to
> report similar telemetry.

Not sure that adding in new compile-time option disabled by default is a good thing...
For me it would be much more preferable if we'll go through a more 'standard' way here:
a) define clear API to enable/disable/collect/report such type of stats.
b) use some of our sample apps to demonstrate how to use it properly with user-specific code.
c) if needed implement some 'silent' stats collection for limited scope of apps via callbacks -
let say for run-to-completion apps that do use ether and crypto devs only.

 However, the upside for applications that do
> wish to use this is that there are no code changes required (for the
> most part), the feature simply needs to be enabled at compile-time via
> the meson option.
> 
> In scenarios where contextual awareness of the application is needed in
> order to report more accurate "busyness", the
> "RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(n)" macro can be used to mark
> sections of code as "busy" or "idle". This way, the application can
> assume control of determining the poll busyness of its lcores while
> leveraging the telemetry hooks adding in this patchset.
> 
> We did initially consider implementing this via callbacks, however we
> found this approach to have 2 main drawbacks:
> 1. Application changes are required for all applications wanting to
> report this telemetry - rather than the majority getting it for free.

Didn't get it - why callbacks approach would require user-app changes?
In other situations - rte_power callbacks, pdump, etc. it works transparent to
user-leve code.  
Why it can't be done here in a similar way?

> 2. Ring does not have callback support, meaning pipelined applications
> could not report lcore poll busyness telemetry with this approach.

That's another big concern that I have:
Why you consider that all rings will be used for a pipilines between threads and should
always be accounted by your stats?
They could be used for dozens different purposes.
What if that ring is used for mempool, and ring_dequeue() just means we try to allocate
an object from the pool? In such case, why failing to allocate an object should mean
start of new 'idle cycle'?  

> Eventdev is another driver which would be completely missed with this
> approach.

Ok, I see two ways here:
- implement CB support for eventdev.
-meanwhile clearly document that this stats are not supported for eventdev  scenarios (yet).

> 
> 


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 1/4] eal: add lcore poll busyness telemetry
  2022-09-26  9:37         ` Konstantin Ananyev
@ 2022-09-29 12:41           ` Kevin Laatz
  2022-09-30 12:32             ` Jerin Jacob
  2022-10-01 14:17             ` Konstantin Ananyev
  0 siblings, 2 replies; 87+ messages in thread
From: Kevin Laatz @ 2022-09-29 12:41 UTC (permalink / raw)
  To: Konstantin Ananyev, dev
  Cc: anatoly.burakov, Conor Walsh, David Hunt, Bruce Richardson,
	Nicolas Chautru, Fan Zhang, Ashish Gupta, Akhil Goyal,
	Fengchengwen, Ray Kinsella, Thomas Monjalon, Ferruh Yigit,
	Andrew Rybchenko, Jerin Jacob, Sachin Saxena, Hemant Agrawal,
	Ori Kam, Honnappa Nagarahalli, Konstantin Ananyev

On 26/09/2022 10:37, Konstantin Ananyev wrote:
> Hi Kevin,
>
>>>> Currently, there is no way to measure lcore poll busyness in a passive way,
>>>> without any modifications to the application. This patch adds a new EAL API
>>>> that will be able to passively track core polling busyness.
>>>>
>>>> The poll busyness is calculated by relying on the fact that most DPDK API's
>>>> will poll for work (packets, completions, eventdev events, etc). Empty
>>>> polls can be counted as "idle", while non-empty polls can be counted as
>>>> busy. To measure lcore poll busyness, we simply call the telemetry
>>>> timestamping function with the number of polls a particular code section
>>>> has processed, and count the number of cycles we've spent processing empty
>>>> bursts. The more empty bursts we encounter, the less cycles we spend in
>>>> "busy" state, and the less core poll busyness will be reported.
>>>>
>>>> In order for all of the above to work without modifications to the
>>>> application, the library code needs to be instrumented with calls to the
>>>> lcore telemetry busyness timestamping function. The following parts of DPDK
>>>> are instrumented with lcore poll busyness timestamping calls:
>>>>
>>>> - All major driver API's:
>>>>     - ethdev
>>>>     - cryptodev
>>>>     - compressdev
>>>>     - regexdev
>>>>     - bbdev
>>>>     - rawdev
>>>>     - eventdev
>>>>     - dmadev
>>>> - Some additional libraries:
>>>>     - ring
>>>>     - distributor
>>>>
>>>> To avoid performance impact from having lcore telemetry support, a global
>>>> variable is exported by EAL, and a call to timestamping function is wrapped
>>>> into a macro, so that whenever telemetry is disabled, it only takes one
>>>> additional branch and no function calls are performed. It is disabled at
>>>> compile time by default.
>>>>
>>>> This patch also adds a telemetry endpoint to report lcore poll busyness, as
>>>> well as telemetry endpoints to enable/disable lcore telemetry. A
>>>> documentation entry has been added to the howto guides to explain the usage
>>>> of the new telemetry endpoints and API.
>>> As was already mentioned  by other reviewers, it would be much better
>>> to let application itself decide when it is idle and when it is busy.
>>> With current approach even for constant polling run-to-completion model there
>>> are plenty of opportunities to get things wrong and provide misleading statistics.
>>> My special concern - inserting it into ring dequeue code.
>>> Ring is used for various different things, not only pass packets between threads (mempool, etc.).
>>> Blindly assuming that ring dequeue returns empty means idle cycles seams wrong to me.
>>> Which make me wonder should we really hard-code these calls into DPDK core functions?
>>> If you still like to introduce such stats, might be better to implement it via callback mechanism.
>>> As I remember nearly all our drivers (net, crypto, etc.) do support it.
>>> That way our generic code   will remain unaffected, plus user will have ability to enable/disable
>>> it on a per device basis.
>> Thanks for your feedback, Konstantin.
>>
>> You are right in saying that this approach won't be 100% suitable for
>> all use-cases, but should be suitable for the majority of applications.
> First of all - could you explain how did you measure what is the 'majority' of DPDK applications?
> And how did you conclude that it definitely work for all the apps in that 'majority'?
> Second what bother me with that approach - I don't see s clear and deterministic way
> for the user to understand would that stats work properly for his app or not.
> (except manually ananlyzing his app code).

All of the DPDK example applications we've tested with (l2fwd, l3fwd + friends, testpmd, distributor, dmafwd) report lcore poll busyness and respond to changing traffic rates etc. We've also compared the reported busyness to similar metrics reported by other projects such as VPP and OvS, and found the reported busyness matches with a difference of +/- 1%. In addition to the DPDK example applications, we've have shared our plans with end customers and they have confirmed that the design should work with their applications.

>> It's worth keeping in mind that this feature is compile-time disabled by
>> default, so there is no impact to any application/user that does not
>> wish to use this, for example applications where this type of busyness
>> is not useful, or for applications that already use other mechanisms to
>> report similar telemetry.
> Not sure that adding in new compile-time option disabled by default is a good thing...
> For me it would be much more preferable if we'll go through a more 'standard' way here:
> a) define clear API to enable/disable/collect/report such type of stats.
> b) use some of our sample apps to demonstrate how to use it properly with user-specific code.
> c) if needed implement some 'silent' stats collection for limited scope of apps via callbacks -
> let say for run-to-completion apps that do use ether and crypto devs only.

With the compile-time option, its just one build flag for lots of applications to silently benefit from this.

>   However, the upside for applications that do
>> wish to use this is that there are no code changes required (for the
>> most part), the feature simply needs to be enabled at compile-time via
>> the meson option.
>>
>> In scenarios where contextual awareness of the application is needed in
>> order to report more accurate "busyness", the
>> "RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(n)" macro can be used to mark
>> sections of code as "busy" or "idle". This way, the application can
>> assume control of determining the poll busyness of its lcores while
>> leveraging the telemetry hooks adding in this patchset.
>>
>> We did initially consider implementing this via callbacks, however we
>> found this approach to have 2 main drawbacks:
>> 1. Application changes are required for all applications wanting to
>> report this telemetry - rather than the majority getting it for free.
> Didn't get it - why callbacks approach would require user-app changes?
> In other situations - rte_power callbacks, pdump, etc. it works transparent to
> user-leve code.
> Why it can't be done here in a similar way?

 From my understanding, the callbacks would need to be registered by the application at the very least (and the callback would have to be registered per device/pmd/lib).

>
>> 2. Ring does not have callback support, meaning pipelined applications
>> could not report lcore poll busyness telemetry with this approach.
> That's another big concern that I have:
> Why you consider that all rings will be used for a pipilines between threads and should
> always be accounted by your stats?
> They could be used for dozens different purposes.
> What if that ring is used for mempool, and ring_dequeue() just means we try to allocate
> an object from the pool? In such case, why failing to allocate an object should mean
> start of new 'idle cycle'?

Another approach could be taken here if the mempool interactions are of concern.

 From our understanding, mempool operations use the "_bulk" APIs, whereas polling operations use the "_burst" APIs. Would only timestamping on the "_burst" APIs be better here? That way the mempool interactions won't be counted towards the busyness.

Including support for pipelined applications using rings is key for a number of usecases, this was highlighted as part of the customer feedback when we shared the design.

>
>> Eventdev is another driver which would be completely missed with this
>> approach.
> Ok, I see two ways here:
> - implement CB support for eventdev.
> -meanwhile clearly document that this stats are not supported for eventdev  scenarios (yet).

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 1/4] eal: add lcore poll busyness telemetry
  2022-09-29 12:41           ` Kevin Laatz
@ 2022-09-30 12:32             ` Jerin Jacob
  2022-10-01 14:17             ` Konstantin Ananyev
  1 sibling, 0 replies; 87+ messages in thread
From: Jerin Jacob @ 2022-09-30 12:32 UTC (permalink / raw)
  To: Kevin Laatz
  Cc: Konstantin Ananyev, dev, anatoly.burakov, Conor Walsh,
	David Hunt, Bruce Richardson, Nicolas Chautru, Fan Zhang,
	Ashish Gupta, Akhil Goyal, Fengchengwen, Ray Kinsella,
	Thomas Monjalon, Ferruh Yigit, Andrew Rybchenko, Jerin Jacob,
	Sachin Saxena, Hemant Agrawal, Ori Kam, Honnappa Nagarahalli,
	Konstantin Ananyev

On Thu, Sep 29, 2022 at 6:11 PM Kevin Laatz <kevin.laatz@intel.com> wrote:
>
> >
> >> 2. Ring does not have callback support, meaning pipelined applications
> >> could not report lcore poll busyness telemetry with this approach.
> > That's another big concern that I have:
> > Why you consider that all rings will be used for a pipilines between threads and should
> > always be accounted by your stats?
> > They could be used for dozens different purposes.
> > What if that ring is used for mempool, and ring_dequeue() just means we try to allocate
> > an object from the pool? In such case, why failing to allocate an object should mean
> > start of new 'idle cycle'?
>
> Another approach could be taken here if the mempool interactions are of concern.


Another method to solve the problem will be leveraging an existing
trace framework and leverage existing fastpath tracepoints.
Where existing lcore poll busyness could be monitored by another
application by looking at the timestamp where traces are emitted.
This also gives flexibility to add customer or application specific
tracepoint as needed. Also control enable/disable aspects of trace
points.

l2reflect is a similar problem to see latency.

The use case like above(other application need to observe the code
flow of an DPDK application) and analyse, can be implemented
as

Similar suggesiton provied for l2reflect at
https://mails.dpdk.org/archives/dev/2022-September/250583.html

I would suggest to take this path to accommodate more use case in future like
- finding CPU idle time
-latency for crypto/dmadev/eventdev enqueue to dequeue
-histogram of occupancy for different queues
etc

This would translate to
1)Adding app/proc-info style app to pull the live trace from primary process
2)Add plugin framework to operate on live trace
3)Add a plugin for this specific use case
4)If needed, a communication from secondary to primary to take action
based on live analysis
like in this case if stop the primary when latency exceeds certain limit

On the plus side,
If we move all analysis and presentation to new generic application,
your packet forwarding
logic can simply move as new fwd_engine in testpmd(see
app/test-pmd/noisy_vnf.c as a example for fwdengine)

Ideally "eal: add lcore poll busyness telemetry"[1] could converge to
this model.

[1]
https://patches.dpdk.org/project/dpdk/patch/20220914092929.1159773-2-kevin.laatz@intel.com/






>
>  From our understanding, mempool operations use the "_bulk" APIs, whereas polling operations use the "_burst" APIs. Would only timestamping on the "_burst" APIs be better here? That way the mempool interactions won't be counted towards the busyness.
>
> Including support for pipelined applications using rings is key for a number of usecases, this was highlighted as part of the customer feedback when we shared the design.
>
> >
> >> Eventdev is another driver which would be completely missed with this
> >> approach.
> > Ok, I see two ways here:
> > - implement CB support for eventdev.
> > -meanwhile clearly document that this stats are not supported for eventdev  scenarios (yet).

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 1/4] eal: add lcore poll busyness telemetry
  2022-09-14  9:29   ` [PATCH v7 1/4] eal: add " Kevin Laatz
  2022-09-14 14:30     ` Stephen Hemminger
  2022-09-19 10:19     ` Konstantin Ananyev
@ 2022-09-30 22:13     ` Mattias Rönnblom
  2 siblings, 0 replies; 87+ messages in thread
From: Mattias Rönnblom @ 2022-09-30 22:13 UTC (permalink / raw)
  To: Kevin Laatz, dev
  Cc: anatoly.burakov, Conor Walsh, David Hunt, Bruce Richardson,
	Nicolas Chautru, Fan Zhang, Ashish Gupta, Akhil Goyal,
	Chengwen Feng, Ray Kinsella, Thomas Monjalon, Ferruh Yigit,
	Andrew Rybchenko, Jerin Jacob, Sachin Saxena, Hemant Agrawal,
	Ori Kam, Honnappa Nagarahalli, Konstantin Ananyev,
	Mattias Rönnblom

On 2022-09-14 11:29, Kevin Laatz wrote:
> From: Anatoly Burakov <anatoly.burakov@intel.com>
> 
> Currently, there is no way to measure lcore poll busyness in a passive way,
> without any modifications to the application. This patch adds a new EAL API
> that will be able to passively track core polling busyness.

I think it's more fair to say it "/../ attempts to track /../".

> 
> The poll busyness is calculated by relying on the fact that most DPDK API's
> will poll for work (packets, completions, eventdev events, etc). Empty
> polls can be counted as "idle", while non-empty polls can be counted as
> busy.

I think it would be more clear it said something like "After an empty 
poll, the calling EAL thread is considered idle /../". It's not the poll 
operation itself that we care about, but the resulting processing.

To measure lcore poll busyness, we simply call the telemetry
> timestamping function with the number of polls a particular code section
> has processed, and count the number of cycles we've spent processing empty
> bursts. The more empty bursts we encounter, the less cycles we spend in
> "busy" state, and the less core poll busyness will be reported.
> 
> In order for all of the above to work without modifications to the
> application, the library code needs to be instrumented with calls to the
> lcore telemetry busyness timestamping function. The following parts of DPDK
> are instrumented with lcore poll busyness timestamping calls:
> 
> - All major driver API's:
>    - ethdev
>    - cryptodev
>    - compressdev
>    - regexdev
>    - bbdev
>    - rawdev
>    - eventdev
>    - dmadev
> - Some additional libraries:
>    - ring
>    - distributor

Shouldn't the timer library also be in the list? It's a source of work.

> 
> To avoid performance impact from having lcore telemetry support, a global
> variable is exported by EAL, and a call to timestamping function is wrapped
> into a macro, so that whenever telemetry is disabled, it only takes one
> additional branch and no function calls are performed. It is disabled at
> compile time by default.
> 
> This patch also adds a telemetry endpoint to report lcore poll busyness, as
> well as telemetry endpoints to enable/disable lcore telemetry. A
> documentation entry has been added to the howto guides to explain the usage
> of the new telemetry endpoints and API.
> 
> Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
> Signed-off-by: Conor Walsh <conor.walsh@intel.com>
> Signed-off-by: David Hunt <david.hunt@intel.com>
> Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
> 
> ---
> v7:
>    * Rename funcs, vars, files to include "poll" where missing.
> 
> v5:
>    * Fix Windows build
>    * Make lcore_telemetry_free() an internal interface
>    * Minor cleanup
> 
> v4:
>    * Fix doc build
>    * Rename timestamp macro to RTE_LCORE_POLL_BUSYNESS_TIMESTAMP
>    * Make enable/disable read and write atomic
>    * Change rte_lcore_poll_busyness_enabled_set() param to bool
>    * Move mem alloc from enable/disable to init/cleanup
>    * Other minor fixes
> 
> v3:
>    * Fix missed renaming to poll busyness
>    * Fix clang compilation
>    * Fix arm compilation
> 
> v2:
>    * Use rte_get_tsc_hz() to adjust the telemetry period
>    * Rename to reflect polling busyness vs general busyness
>    * Fix segfault when calling telemetry timestamp from an unregistered
>      non-EAL thread.
>    * Minor cleanup
> ---
>   config/meson.build                            |   1 +
>   config/rte_config.h                           |   1 +
>   lib/bbdev/rte_bbdev.h                         |  17 +-
>   lib/compressdev/rte_compressdev.c             |   2 +
>   lib/cryptodev/rte_cryptodev.h                 |   2 +
>   lib/distributor/rte_distributor.c             |  21 +-
>   lib/distributor/rte_distributor_single.c      |  14 +-
>   lib/dmadev/rte_dmadev.h                       |  15 +-
>   .../common/eal_common_lcore_poll_telemetry.c  | 303 ++++++++++++++++++
>   lib/eal/common/meson.build                    |   1 +
>   lib/eal/freebsd/eal.c                         |   1 +
>   lib/eal/include/rte_lcore.h                   |  85 ++++-
>   lib/eal/linux/eal.c                           |   1 +
>   lib/eal/meson.build                           |   3 +
>   lib/eal/version.map                           |   7 +
>   lib/ethdev/rte_ethdev.h                       |   2 +
>   lib/eventdev/rte_eventdev.h                   |  10 +-
>   lib/rawdev/rte_rawdev.c                       |   6 +-
>   lib/regexdev/rte_regexdev.h                   |   5 +-
>   lib/ring/rte_ring_elem_pvt.h                  |   1 +
>   meson_options.txt                             |   2 +
>   21 files changed, 475 insertions(+), 25 deletions(-)
>   create mode 100644 lib/eal/common/eal_common_lcore_poll_telemetry.c
> 
> diff --git a/config/meson.build b/config/meson.build
> index 7f7b6c92fd..d5954a059c 100644
> --- a/config/meson.build
> +++ b/config/meson.build
> @@ -297,6 +297,7 @@ endforeach
>   dpdk_conf.set('RTE_MAX_ETHPORTS', get_option('max_ethports'))
>   dpdk_conf.set('RTE_LIBEAL_USE_HPET', get_option('use_hpet'))
>   dpdk_conf.set('RTE_ENABLE_TRACE_FP', get_option('enable_trace_fp'))
> +dpdk_conf.set('RTE_LCORE_POLL_BUSYNESS', get_option('enable_lcore_poll_busyness'))
>   # values which have defaults which may be overridden
>   dpdk_conf.set('RTE_MAX_VFIO_GROUPS', 64)
>   dpdk_conf.set('RTE_DRIVER_MEMPOOL_BUCKET_SIZE_KB', 64)
> diff --git a/config/rte_config.h b/config/rte_config.h
> index ae56a86394..86ac3b8a6e 100644
> --- a/config/rte_config.h
> +++ b/config/rte_config.h
> @@ -39,6 +39,7 @@
>   #define RTE_LOG_DP_LEVEL RTE_LOG_INFO
>   #define RTE_BACKTRACE 1
>   #define RTE_MAX_VFIO_CONTAINERS 64
> +#define RTE_LCORE_POLL_BUSYNESS_PERIOD_MS 2
>   
>   /* bsd module defines */
>   #define RTE_CONTIGMEM_MAX_NUM_BUFS 64
> diff --git a/lib/bbdev/rte_bbdev.h b/lib/bbdev/rte_bbdev.h
> index b88c88167e..d6a98d3f11 100644
> --- a/lib/bbdev/rte_bbdev.h
> +++ b/lib/bbdev/rte_bbdev.h
> @@ -28,6 +28,7 @@ extern "C" {
>   #include <stdbool.h>
>   
>   #include <rte_cpuflags.h>
> +#include <rte_lcore.h>
>   
>   #include "rte_bbdev_op.h"
>   
> @@ -599,7 +600,9 @@ rte_bbdev_dequeue_enc_ops(uint16_t dev_id, uint16_t queue_id,
>   {
>   	struct rte_bbdev *dev = &rte_bbdev_devices[dev_id];
>   	struct rte_bbdev_queue_data *q_data = &dev->data->queues[queue_id];
> -	return dev->dequeue_enc_ops(q_data, ops, num_ops);
> +	const uint16_t nb_ops = dev->dequeue_enc_ops(q_data, ops, num_ops);
> +	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
> +	return nb_ops;
>   }
>   
>   /**
> @@ -631,7 +634,9 @@ rte_bbdev_dequeue_dec_ops(uint16_t dev_id, uint16_t queue_id,
>   {
>   	struct rte_bbdev *dev = &rte_bbdev_devices[dev_id];
>   	struct rte_bbdev_queue_data *q_data = &dev->data->queues[queue_id];
> -	return dev->dequeue_dec_ops(q_data, ops, num_ops);
> +	const uint16_t nb_ops = dev->dequeue_dec_ops(q_data, ops, num_ops);
> +	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
> +	return nb_ops;
>   }
>   
>   
> @@ -662,7 +667,9 @@ rte_bbdev_dequeue_ldpc_enc_ops(uint16_t dev_id, uint16_t queue_id,
>   {
>   	struct rte_bbdev *dev = &rte_bbdev_devices[dev_id];
>   	struct rte_bbdev_queue_data *q_data = &dev->data->queues[queue_id];
> -	return dev->dequeue_ldpc_enc_ops(q_data, ops, num_ops);
> +	const uint16_t nb_ops = dev->dequeue_ldpc_enc_ops(q_data, ops, num_ops);
> +	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
> +	return nb_ops;
>   }
>   
>   /**
> @@ -692,7 +699,9 @@ rte_bbdev_dequeue_ldpc_dec_ops(uint16_t dev_id, uint16_t queue_id,
>   {
>   	struct rte_bbdev *dev = &rte_bbdev_devices[dev_id];
>   	struct rte_bbdev_queue_data *q_data = &dev->data->queues[queue_id];
> -	return dev->dequeue_ldpc_dec_ops(q_data, ops, num_ops);
> +	const uint16_t nb_ops = dev->dequeue_ldpc_dec_ops(q_data, ops, num_ops);
> +	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
> +	return nb_ops;
>   }
>   
>   /** Definitions of device event types */
> diff --git a/lib/compressdev/rte_compressdev.c b/lib/compressdev/rte_compressdev.c
> index 22c438f2dd..fabc495a8e 100644
> --- a/lib/compressdev/rte_compressdev.c
> +++ b/lib/compressdev/rte_compressdev.c
> @@ -580,6 +580,8 @@ rte_compressdev_dequeue_burst(uint8_t dev_id, uint16_t qp_id,
>   	nb_ops = (*dev->dequeue_burst)
>   			(dev->data->queue_pairs[qp_id], ops, nb_ops);
>   
> +	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
> +
>   	return nb_ops;
>   }
>   
> diff --git a/lib/cryptodev/rte_cryptodev.h b/lib/cryptodev/rte_cryptodev.h
> index 56f459c6a0..a5b1d7c594 100644
> --- a/lib/cryptodev/rte_cryptodev.h
> +++ b/lib/cryptodev/rte_cryptodev.h
> @@ -1915,6 +1915,8 @@ rte_cryptodev_dequeue_burst(uint8_t dev_id, uint16_t qp_id,
>   		rte_rcu_qsbr_thread_offline(list->qsbr, 0);
>   	}
>   #endif
> +
> +	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
>   	return nb_ops;
>   }
>   
> diff --git a/lib/distributor/rte_distributor.c b/lib/distributor/rte_distributor.c
> index 3035b7a999..428157ec64 100644
> --- a/lib/distributor/rte_distributor.c
> +++ b/lib/distributor/rte_distributor.c
> @@ -56,6 +56,8 @@ rte_distributor_request_pkt(struct rte_distributor *d,
>   
>   		while (rte_rdtsc() < t)
>   			rte_pause();
> +		/* this was an empty poll */
> +		RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(0);
>   	}
>   
>   	/*
> @@ -134,24 +136,29 @@ rte_distributor_get_pkt(struct rte_distributor *d,
>   
>   	if (unlikely(d->alg_type == RTE_DIST_ALG_SINGLE)) {
>   		if (return_count <= 1) {
> +			uint16_t cnt;
>   			pkts[0] = rte_distributor_get_pkt_single(d->d_single,
> -				worker_id, return_count ? oldpkt[0] : NULL);
> -			return (pkts[0]) ? 1 : 0;
> -		} else
> -			return -EINVAL;
> +								 worker_id,
> +								 return_count ? oldpkt[0] : NULL);
> +			cnt = (pkts[0] != NULL) ? 1 : 0;
> +			RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(cnt);
> +			return cnt;
> +		}
> +		return -EINVAL;
>   	}
>   
>   	rte_distributor_request_pkt(d, worker_id, oldpkt, return_count);
>   
> -	count = rte_distributor_poll_pkt(d, worker_id, pkts);
> -	while (count == -1) {
> +	while ((count = rte_distributor_poll_pkt(d, worker_id, pkts)) == -1) {
>   		uint64_t t = rte_rdtsc() + 100;
>   
>   		while (rte_rdtsc() < t)
>   			rte_pause();
>   
> -		count = rte_distributor_poll_pkt(d, worker_id, pkts);
> +		/* this was an empty poll */
> +		RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(0);
>   	}
> +	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(count);
>   	return count;
>   }
>   
> diff --git a/lib/distributor/rte_distributor_single.c b/lib/distributor/rte_distributor_single.c
> index 2c77ac454a..4c916c0fd2 100644
> --- a/lib/distributor/rte_distributor_single.c
> +++ b/lib/distributor/rte_distributor_single.c
> @@ -31,8 +31,13 @@ rte_distributor_request_pkt_single(struct rte_distributor_single *d,
>   	union rte_distributor_buffer_single *buf = &d->bufs[worker_id];
>   	int64_t req = (((int64_t)(uintptr_t)oldpkt) << RTE_DISTRIB_FLAG_BITS)
>   			| RTE_DISTRIB_GET_BUF;
> -	RTE_WAIT_UNTIL_MASKED(&buf->bufptr64, RTE_DISTRIB_FLAGS_MASK,
> -		==, 0, __ATOMIC_RELAXED);
> +
> +	while ((__atomic_load_n(&buf->bufptr64, __ATOMIC_RELAXED)
> +			& RTE_DISTRIB_FLAGS_MASK) != 0) {
> +		rte_pause();
> +		/* this was an empty poll */

The idle period started before the pause.

> +		RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(0);
> +	}
>   
>   	/* Sync with distributor on GET_BUF flag. */
>   	__atomic_store_n(&(buf->bufptr64), req, __ATOMIC_RELEASE);
> @@ -59,8 +64,11 @@ rte_distributor_get_pkt_single(struct rte_distributor_single *d,
>   {
>   	struct rte_mbuf *ret;
>   	rte_distributor_request_pkt_single(d, worker_id, oldpkt);
> -	while ((ret = rte_distributor_poll_pkt_single(d, worker_id)) == NULL)
> +	while ((ret = rte_distributor_poll_pkt_single(d, worker_id)) == NULL) {
>   		rte_pause();
> +		/* this was an empty poll */
> +		RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(0);
> +	}
>   	return ret;
>   }
>   
> diff --git a/lib/dmadev/rte_dmadev.h b/lib/dmadev/rte_dmadev.h
> index e7f992b734..3e27e0fd2b 100644
> --- a/lib/dmadev/rte_dmadev.h
> +++ b/lib/dmadev/rte_dmadev.h
> @@ -149,6 +149,7 @@
>   #include <rte_bitops.h>
>   #include <rte_common.h>
>   #include <rte_compat.h>
> +#include <rte_lcore.h>
>   
>   #ifdef __cplusplus
>   extern "C" {
> @@ -1027,7 +1028,7 @@ rte_dma_completed(int16_t dev_id, uint16_t vchan, const uint16_t nb_cpls,
>   		  uint16_t *last_idx, bool *has_error)
>   {
>   	struct rte_dma_fp_object *obj = &rte_dma_fp_objs[dev_id];
> -	uint16_t idx;
> +	uint16_t idx, nb_ops;
>   	bool err;
>   
>   #ifdef RTE_DMADEV_DEBUG
> @@ -1050,8 +1051,10 @@ rte_dma_completed(int16_t dev_id, uint16_t vchan, const uint16_t nb_cpls,
>   		has_error = &err;
>   
>   	*has_error = false;
> -	return (*obj->completed)(obj->dev_private, vchan, nb_cpls, last_idx,
> -				 has_error);
> +	nb_ops = (*obj->completed)(obj->dev_private, vchan, nb_cpls, last_idx,
> +				   has_error);
> +	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
> +	return nb_ops;
>   }
>   
>   /**
> @@ -1090,7 +1093,7 @@ rte_dma_completed_status(int16_t dev_id, uint16_t vchan,
>   			 enum rte_dma_status_code *status)
>   {
>   	struct rte_dma_fp_object *obj = &rte_dma_fp_objs[dev_id];
> -	uint16_t idx;
> +	uint16_t idx, nb_ops;
>   
>   #ifdef RTE_DMADEV_DEBUG
>   	if (!rte_dma_is_valid(dev_id) || nb_cpls == 0 || status == NULL)
> @@ -1101,8 +1104,10 @@ rte_dma_completed_status(int16_t dev_id, uint16_t vchan,
>   	if (last_idx == NULL)
>   		last_idx = &idx;
>   
> -	return (*obj->completed_status)(obj->dev_private, vchan, nb_cpls,
> +	nb_ops = (*obj->completed_status)(obj->dev_private, vchan, nb_cpls,
>   					last_idx, status);
> +	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
> +	return nb_ops;
>   }
>   
>   /**
> diff --git a/lib/eal/common/eal_common_lcore_poll_telemetry.c b/lib/eal/common/eal_common_lcore_poll_telemetry.c
> new file mode 100644
> index 0000000000..d97996e85f
> --- /dev/null
> +++ b/lib/eal/common/eal_common_lcore_poll_telemetry.c
> @@ -0,0 +1,303 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2022 Intel Corporation
> + */
> +
> +#include <unistd.h>
> +#include <limits.h>
> +#include <string.h>
> +
> +#include <rte_common.h>
> +#include <rte_cycles.h>
> +#include <rte_errno.h>
> +#include <rte_lcore.h>
> +
> +#ifdef RTE_LCORE_POLL_BUSYNESS
> +#include <rte_telemetry.h>
> +#endif
> +
> +rte_atomic32_t __rte_lcore_poll_telemetry_enabled;
> +
> +#ifdef RTE_LCORE_POLL_BUSYNESS
> +
> +struct lcore_poll_telemetry {
> +	int poll_busyness;
> +	/**< Calculated poll busyness (gets set/returned by the API) */
> +	int raw_poll_busyness;
> +	/**< Calculated poll busyness times 100. */
> +	uint64_t interval_ts;
> +	/**< when previous telemetry interval started */
> +	uint64_t empty_cycles;
> +	/**< empty cycle count since last interval */
> +	uint64_t last_poll_ts;
> +	/**< last poll timestamp */
> +	bool last_empty;
> +	/**< if last poll was empty */
> +	unsigned int contig_poll_cnt;
> +	/**< contiguous (always empty/non empty) poll counter */
> +} __rte_cache_aligned;
> +
> +static struct lcore_poll_telemetry *telemetry_data;
> +
> +#define LCORE_POLL_BUSYNESS_MAX 100
> +#define LCORE_POLL_BUSYNESS_NOT_SET -1
> +#define LCORE_POLL_BUSYNESS_MIN 0
> +
> +#define SMOOTH_COEFF 5
> +#define STATE_CHANGE_OPT 32
> +
> +static void lcore_config_init(void)
> +{
> +	int lcore_id;
> +
> +	RTE_LCORE_FOREACH(lcore_id) {
> +		struct lcore_poll_telemetry *td = &telemetry_data[lcore_id];
> +
> +		td->interval_ts = 0;
> +		td->last_poll_ts = 0;
> +		td->empty_cycles = 0;
> +		td->last_empty = true;
> +		td->contig_poll_cnt = 0;
> +		td->poll_busyness = LCORE_POLL_BUSYNESS_NOT_SET;
> +		td->raw_poll_busyness = 0;
> +	}
> +}
> +
> +int rte_lcore_poll_busyness(unsigned int lcore_id)
> +{
> +	const uint64_t tsc_ms = rte_get_timer_hz() / MS_PER_S;
> +	/* if more than 1000 busyness periods have passed, this core is considered inactive */
> +	const uint64_t active_thresh = RTE_LCORE_POLL_BUSYNESS_PERIOD_MS * tsc_ms * 1000;
> +	struct lcore_poll_telemetry *tdata;
> +
> +	if (lcore_id >= RTE_MAX_LCORE)
> +		return -EINVAL;
> +	tdata = &telemetry_data[lcore_id];
> +
> +	/* if the lcore is not active */
> +	if (tdata->interval_ts == 0)
> +		return LCORE_POLL_BUSYNESS_NOT_SET;
> +	/* if the core hasn't been active in a while */
> +	else if ((rte_rdtsc() - tdata->interval_ts) > active_thresh)
> +		return LCORE_POLL_BUSYNESS_NOT_SET;
> +
> +	/* this core is active, report its poll busyness */
> +	return telemetry_data[lcore_id].poll_busyness;
> +}
> +
> +int rte_lcore_poll_busyness_enabled(void)
> +{
> +	return rte_atomic32_read(&__rte_lcore_poll_telemetry_enabled);
> +}
> +
> +void rte_lcore_poll_busyness_enabled_set(bool enable)
> +{
> +	int set = rte_atomic32_cmpset((volatile uint32_t *)&__rte_lcore_poll_telemetry_enabled,
> +			(int)!enable, (int)enable);

Use GCC C11 atomics?

> +
> +	/* Reset counters on successful disable */
> +	if (set && !enable)
> +		lcore_config_init();
> +}
> +
> +static inline int calc_raw_poll_busyness(const struct lcore_poll_telemetry *tdata,
> +				    const uint64_t empty, const uint64_t total)
> +{
> +	/*
> +	 * We don't want to use floating point math here, but we want for our poll
> +	 * busyness to react smoothly to sudden changes, while still keeping the
> +	 * accuracy and making sure that over time the average follows poll busyness
> +	 * as measured just-in-time. Therefore, we will calculate the average poll
> +	 * busyness using integer math, but shift the decimal point two places
> +	 * to the right, so that 100.0 becomes 10000. This allows us to report
> +	 * integer values (0..100) while still allowing ourselves to follow the
> +	 * just-in-time measurements when we calculate our averages.
> +	 */
> +	const int max_raw_idle = LCORE_POLL_BUSYNESS_MAX * 100;
> +
> +	const int prev_raw_idle = max_raw_idle - tdata->raw_poll_busyness;
> +
> +	/* calculate rate of idle cycles, times 100 */
> +	const int cur_raw_idle = (int)((empty * max_raw_idle) / total);
> +
> +	/* smoothen the idleness */
> +	const int smoothened_idle =
> +			(cur_raw_idle + prev_raw_idle * (SMOOTH_COEFF - 1)) / SMOOTH_COEFF;
> +
> +	/* convert idleness to poll busyness */
> +	return max_raw_idle - smoothened_idle;
> +}
> +
> +void __rte_lcore_poll_busyness_timestamp(uint16_t nb_rx)
> +{
> +	const unsigned int lcore_id = rte_lcore_id();
> +	uint64_t interval_ts, empty_cycles, cur_tsc, last_poll_ts;
> +	struct lcore_poll_telemetry *tdata;
> +	const bool empty = nb_rx == 0;
> +	uint64_t diff_int, diff_last;
> +	bool last_empty;
> +
> +	/* This telemetry is not supported for unregistered non-EAL threads */
> +	if (lcore_id >= RTE_MAX_LCORE) {
> +		RTE_LOG(DEBUG, EAL,
> +				"Lcore telemetry not supported on unregistered non-EAL thread %d",
> +				lcore_id);
> +		return;
> +	}
> +
> +	tdata = &telemetry_data[lcore_id];
> +	last_empty = tdata->last_empty;
> +
> +	/* optimization: don't do anything if status hasn't changed */
> +	if (last_empty == empty && tdata->contig_poll_cnt++ < STATE_CHANGE_OPT)
> +		return;
> +	/* status changed or we're waiting for too long, reset counter */
> +	tdata->contig_poll_cnt = 0;
> +
> +	cur_tsc = rte_rdtsc();
> +
> +	interval_ts = tdata->interval_ts;
> +	empty_cycles = tdata->empty_cycles;
> +	last_poll_ts = tdata->last_poll_ts;
> +
> +	diff_int = cur_tsc - interval_ts;
> +	diff_last = cur_tsc - last_poll_ts;
> +
> +	/* is this the first time we're here? */
> +	if (interval_ts == 0) {
> +		tdata->poll_busyness = LCORE_POLL_BUSYNESS_MIN;
> +		tdata->raw_poll_busyness = 0;
> +		tdata->interval_ts = cur_tsc;
> +		tdata->empty_cycles = 0;
> +		tdata->contig_poll_cnt = 0;
> +		goto end;
> +	}
> +
> +	/* update the empty counter if we got an empty poll earlier */
> +	if (last_empty)
> +		empty_cycles += diff_last;
> +
> +	/* have we passed the interval? */
> +	uint64_t interval = ((rte_get_tsc_hz() / MS_PER_S) * RTE_LCORE_POLL_BUSYNESS_PERIOD_MS);
> +	if (diff_int > interval) {
> +		int raw_poll_busyness;
> +
> +		/* get updated poll_busyness value */
> +		raw_poll_busyness = calc_raw_poll_busyness(tdata, empty_cycles, diff_int);
> +
> +		/* set a new interval, reset empty counter */
> +		tdata->interval_ts = cur_tsc;
> +		tdata->empty_cycles = 0;
> +		tdata->raw_poll_busyness = raw_poll_busyness;
> +		/* bring poll busyness back to 0..100 range, biased to round up */
> +		tdata->poll_busyness = (raw_poll_busyness + 50) / 100;

You probably want to report the number of busy cycles as well, so the 
user can do her own averaging, without being forced to resorting to 
sampling.

> +	} else
> +		/* we may have updated empty counter */
> +		tdata->empty_cycles = empty_cycles;
> +
> +end:
> +	/* update status for next poll */
> +	tdata->last_poll_ts = cur_tsc;
> +	tdata->last_empty = empty;
> +}
> +
> +static int
> +lcore_poll_busyness_enable(const char *cmd __rte_unused,
> +		      const char *params __rte_unused,
> +		      struct rte_tel_data *d)
> +{
> +	rte_lcore_poll_busyness_enabled_set(true);
> +
> +	rte_tel_data_start_dict(d);
> +
> +	rte_tel_data_add_dict_int(d, "poll_busyness_enabled", 1);
> +
> +	return 0;
> +}
> +
> +static int
> +lcore_poll_busyness_disable(const char *cmd __rte_unused,
> +		       const char *params __rte_unused,
> +		       struct rte_tel_data *d)
> +{
> +	rte_lcore_poll_busyness_enabled_set(false);
> +
> +	rte_tel_data_start_dict(d);
> +
> +	rte_tel_data_add_dict_int(d, "poll_busyness_enabled", 0);
> +
> +	return 0;
> +}
> +
> +static int
> +lcore_handle_poll_busyness(const char *cmd __rte_unused,
> +		      const char *params __rte_unused, struct rte_tel_data *d)
> +{
> +	char corenum[64];
> +	int i;
> +
> +	rte_tel_data_start_dict(d);
> +
> +	RTE_LCORE_FOREACH(i) {
> +		if (!rte_lcore_is_enabled(i))
> +			continue;
> +		snprintf(corenum, sizeof(corenum), "%d", i);
> +		rte_tel_data_add_dict_int(d, corenum, rte_lcore_poll_busyness(i));
> +	}
> +
> +	return 0;
> +}
> +
> +void
> +eal_lcore_poll_telemetry_free(void)
> +{
> +	if (telemetry_data != NULL) {
> +		free(telemetry_data);
> +		telemetry_data = NULL;
> +	}
> +}
> +
> +RTE_INIT(lcore_init_poll_telemetry)
> +{
> +	telemetry_data = calloc(RTE_MAX_LCORE, sizeof(telemetry_data[0]));
> +	if (telemetry_data == NULL)
> +		rte_panic("Could not init lcore telemetry data: Out of memory\n");
> +
> +	lcore_config_init();
> +
> +	rte_telemetry_register_cmd("/eal/lcore/poll_busyness", lcore_handle_poll_busyness,
> +				   "return percentage poll busyness of cores");
> +
> +	rte_telemetry_register_cmd("/eal/lcore/poll_busyness_enable", lcore_poll_busyness_enable,
> +				   "enable lcore poll busyness measurement");
> +
> +	rte_telemetry_register_cmd("/eal/lcore/poll_busyness_disable", lcore_poll_busyness_disable,
> +				   "disable lcore poll busyness measurement");
> +
> +	rte_atomic32_set(&__rte_lcore_poll_telemetry_enabled, true);
> +}
> +
> +#else
> +
> +int rte_lcore_poll_busyness(unsigned int lcore_id __rte_unused)
> +{
> +	return -ENOTSUP;
> +}
> +
> +int rte_lcore_poll_busyness_enabled(void)
> +{
> +	return -ENOTSUP;
> +}
> +
> +void rte_lcore_poll_busyness_enabled_set(bool enable __rte_unused)
> +{
> +}
> +
> +void __rte_lcore_poll_busyness_timestamp(uint16_t nb_rx __rte_unused)
> +{
> +}
> +
> +void eal_lcore_poll_telemetry_free(void)
> +{
> +}
> +
> +#endif
> diff --git a/lib/eal/common/meson.build b/lib/eal/common/meson.build
> index 917758cc65..e5741ce9f9 100644
> --- a/lib/eal/common/meson.build
> +++ b/lib/eal/common/meson.build
> @@ -17,6 +17,7 @@ sources += files(
>           'eal_common_hexdump.c',
>           'eal_common_interrupts.c',
>           'eal_common_launch.c',
> +        'eal_common_lcore_poll_telemetry.c',
>           'eal_common_lcore.c',
>           'eal_common_log.c',
>           'eal_common_mcfg.c',
> diff --git a/lib/eal/freebsd/eal.c b/lib/eal/freebsd/eal.c
> index 26fbc91b26..92c4af9c28 100644
> --- a/lib/eal/freebsd/eal.c
> +++ b/lib/eal/freebsd/eal.c
> @@ -895,6 +895,7 @@ rte_eal_cleanup(void)
>   	rte_mp_channel_cleanup();
>   	rte_trace_save();
>   	eal_trace_fini();
> +	eal_lcore_poll_telemetry_free();
>   	/* after this point, any DPDK pointers will become dangling */
>   	rte_eal_memory_detach();
>   	rte_eal_alarm_cleanup();
> diff --git a/lib/eal/include/rte_lcore.h b/lib/eal/include/rte_lcore.h
> index b598e1b9ec..2191c2473a 100644
> --- a/lib/eal/include/rte_lcore.h
> +++ b/lib/eal/include/rte_lcore.h
> @@ -16,6 +16,7 @@
>   #include <rte_eal.h>
>   #include <rte_launch.h>
>   #include <rte_thread.h>
> +#include <rte_atomic.h>
>   
>   #ifdef __cplusplus
>   extern "C" {
> @@ -415,9 +416,91 @@ rte_ctrl_thread_create(pthread_t *thread, const char *name,
>   		const pthread_attr_t *attr,
>   		void *(*start_routine)(void *), void *arg);
>   
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Read poll busyness value corresponding to an lcore.
> + *
> + * @param lcore_id
> + *   Lcore to read poll busyness value for.
> + * @return
> + *   - value between 0 and 100 on success
> + *   - -1 if lcore is not active
> + *   - -EINVAL if lcore is invalid
> + *   - -ENOMEM if not enough memory available
> + *   - -ENOTSUP if not supported
> + */
> +__rte_experimental
> +int
> +rte_lcore_poll_busyness(unsigned int lcore_id);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Check if lcore poll busyness telemetry is enabled.
> + *
> + * @return
> + *   - true if lcore telemetry is enabled
> + *   - false if lcore telemetry is disabled
> + *   - -ENOTSUP if not lcore telemetry supported
> + */
> +__rte_experimental
> +int
> +rte_lcore_poll_busyness_enabled(void);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Enable or disable poll busyness telemetry.
> + *
> + * @param enable
> + *   1 to enable, 0 to disable
> + */
> +__rte_experimental
> +void
> +rte_lcore_poll_busyness_enabled_set(bool enable);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Lcore poll busyness timestamping function.
> + *
> + * @param nb_rx
> + *   Number of buffers processed by lcore.
> + */
> +__rte_experimental
> +void
> +__rte_lcore_poll_busyness_timestamp(uint16_t nb_rx);
> +
> +/** @internal lcore telemetry enabled status */
> +extern rte_atomic32_t __rte_lcore_poll_telemetry_enabled;
> +
> +/** @internal free memory allocated for lcore telemetry */
> +void
> +eal_lcore_poll_telemetry_free(void);
> +
> +/**
> + * Call lcore poll busyness timestamp function.
> + *
> + * @param nb_rx
> + *   Number of buffers processed by lcore.
> + */
> +#ifdef RTE_LCORE_POLL_BUSYNESS
> +#define RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_rx) do {				\
> +	int enabled = (int)rte_atomic32_read(&__rte_lcore_poll_telemetry_enabled);	\
> +	if (enabled)								\
> +		__rte_lcore_poll_busyness_timestamp(nb_rx);			\
> +} while (0)
> +#else
> +#define RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_rx) do { } while (0)
> +#endif
> +
>   #ifdef __cplusplus
>   }
>   #endif
>   
> -
>   #endif /* _RTE_LCORE_H_ */
> diff --git a/lib/eal/linux/eal.c b/lib/eal/linux/eal.c
> index 37d29643a5..5e81352a81 100644
> --- a/lib/eal/linux/eal.c
> +++ b/lib/eal/linux/eal.c
> @@ -1364,6 +1364,7 @@ rte_eal_cleanup(void)
>   	rte_mp_channel_cleanup();
>   	rte_trace_save();
>   	eal_trace_fini();
> +	eal_lcore_poll_telemetry_free();
>   	/* after this point, any DPDK pointers will become dangling */
>   	rte_eal_memory_detach();
>   	eal_mp_dev_hotplug_cleanup();
> diff --git a/lib/eal/meson.build b/lib/eal/meson.build
> index 056beb9461..2fb90d446b 100644
> --- a/lib/eal/meson.build
> +++ b/lib/eal/meson.build
> @@ -25,6 +25,9 @@ subdir(arch_subdir)
>   deps += ['kvargs']
>   if not is_windows
>       deps += ['telemetry']
> +else
> +	# core poll busyness telemetry depends on telemetry library
> +	dpdk_conf.set('RTE_LCORE_POLL_BUSYNESS', false)
>   endif
>   if dpdk_conf.has('RTE_USE_LIBBSD')
>       ext_deps += libbsd
> diff --git a/lib/eal/version.map b/lib/eal/version.map
> index 1f293e768b..3275d1fac4 100644
> --- a/lib/eal/version.map
> +++ b/lib/eal/version.map
> @@ -424,6 +424,13 @@ EXPERIMENTAL {
>   	rte_thread_self;
>   	rte_thread_set_affinity_by_id;
>   	rte_thread_set_priority;
> +
> +	# added in 22.11
> +	__rte_lcore_poll_busyness_timestamp;
> +	__rte_lcore_poll_telemetry_enabled;
> +	rte_lcore_poll_busyness;
> +	rte_lcore_poll_busyness_enabled;
> +	rte_lcore_poll_busyness_enabled_set;
>   };
>   
>   INTERNAL {
> diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> index de9e970d4d..4c8113f31f 100644
> --- a/lib/ethdev/rte_ethdev.h
> +++ b/lib/ethdev/rte_ethdev.h
> @@ -5675,6 +5675,8 @@ rte_eth_rx_burst(uint16_t port_id, uint16_t queue_id,
>   #endif
>   
>   	rte_ethdev_trace_rx_burst(port_id, queue_id, (void **)rx_pkts, nb_rx);
> +
> +	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_rx);
>   	return nb_rx;
>   }
>   
> diff --git a/lib/eventdev/rte_eventdev.h b/lib/eventdev/rte_eventdev.h
> index 6a6f6ea4c1..a65b3c7c85 100644
> --- a/lib/eventdev/rte_eventdev.h
> +++ b/lib/eventdev/rte_eventdev.h
> @@ -2153,6 +2153,7 @@ rte_event_dequeue_burst(uint8_t dev_id, uint8_t port_id, struct rte_event ev[],
>   			uint16_t nb_events, uint64_t timeout_ticks)
>   {
>   	const struct rte_event_fp_ops *fp_ops;
> +	uint16_t nb_evts;
>   	void *port;
>   
>   	fp_ops = &rte_event_fp_ops[dev_id];
> @@ -2175,10 +2176,13 @@ rte_event_dequeue_burst(uint8_t dev_id, uint8_t port_id, struct rte_event ev[],
>   	 * requests nb_events as const one
>   	 */
>   	if (nb_events == 1)
> -		return (fp_ops->dequeue)(port, ev, timeout_ticks);
> +		nb_evts = (fp_ops->dequeue)(port, ev, timeout_ticks);
>   	else
> -		return (fp_ops->dequeue_burst)(port, ev, nb_events,
> -					       timeout_ticks);
> +		nb_evts = (fp_ops->dequeue_burst)(port, ev, nb_events,
> +					timeout_ticks);
> +
> +	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_evts);
> +	return nb_evts;
>   }
>   

For software event devices, like SW or DSW, the timestamp macro will be 
invoked at least twice per dequeue call. This is a slight slow-down. Not 
a huge issue.

However, if the event device would use more than one ring per eventdev 
port, which it may well do (e.g., to handle events different priority, 
or for events from different sources), the first ring poll may well be 
empty, and always so, thus causing a state transition from busy to idle, 
and back again, *for every dequeue call*. That would definately show up 
as a noticable performance degradation, since the rdtsc and the other 
instructions, including a mul, in that code path. Taken twice.

>   #define RTE_EVENT_DEV_MAINT_OP_FLUSH          (1 << 0)
> diff --git a/lib/rawdev/rte_rawdev.c b/lib/rawdev/rte_rawdev.c
> index 2f0a4f132e..1cba53270a 100644
> --- a/lib/rawdev/rte_rawdev.c
> +++ b/lib/rawdev/rte_rawdev.c
> @@ -16,6 +16,7 @@
>   #include <rte_common.h>
>   #include <rte_malloc.h>
>   #include <rte_telemetry.h>
> +#include <rte_lcore.h>
>   
>   #include "rte_rawdev.h"
>   #include "rte_rawdev_pmd.h"
> @@ -226,12 +227,15 @@ rte_rawdev_dequeue_buffers(uint16_t dev_id,
>   			   rte_rawdev_obj_t context)
>   {
>   	struct rte_rawdev *dev;
> +	int nb_ops;
>   
>   	RTE_RAWDEV_VALID_DEVID_OR_ERR_RET(dev_id, -EINVAL);
>   	dev = &rte_rawdevs[dev_id];
>   
>   	RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->dequeue_bufs, -ENOTSUP);
> -	return (*dev->dev_ops->dequeue_bufs)(dev, buffers, count, context);
> +	nb_ops = (*dev->dev_ops->dequeue_bufs)(dev, buffers, count, context);
> +	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(nb_ops);
> +	return nb_ops;
>   }
>   
>   int
> diff --git a/lib/regexdev/rte_regexdev.h b/lib/regexdev/rte_regexdev.h
> index 3bce8090f6..8caaed502f 100644
> --- a/lib/regexdev/rte_regexdev.h
> +++ b/lib/regexdev/rte_regexdev.h
> @@ -1530,6 +1530,7 @@ rte_regexdev_dequeue_burst(uint8_t dev_id, uint16_t qp_id,
>   			   struct rte_regex_ops **ops, uint16_t nb_ops)
>   {
>   	struct rte_regexdev *dev = &rte_regex_devices[dev_id];
> +	uint16_t deq_ops;
>   #ifdef RTE_LIBRTE_REGEXDEV_DEBUG
>   	RTE_REGEXDEV_VALID_DEV_ID_OR_ERR_RET(dev_id, -EINVAL);
>   	RTE_FUNC_PTR_OR_ERR_RET(*dev->dequeue, -ENOTSUP);
> @@ -1538,7 +1539,9 @@ rte_regexdev_dequeue_burst(uint8_t dev_id, uint16_t qp_id,
>   		return -EINVAL;
>   	}
>   #endif
> -	return (*dev->dequeue)(dev, qp_id, ops, nb_ops);
> +	deq_ops = (*dev->dequeue)(dev, qp_id, ops, nb_ops);
> +	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(deq_ops);
> +	return deq_ops;
>   }
>   
>   #ifdef __cplusplus
> diff --git a/lib/ring/rte_ring_elem_pvt.h b/lib/ring/rte_ring_elem_pvt.h
> index 83788c56e6..cf2370c238 100644
> --- a/lib/ring/rte_ring_elem_pvt.h
> +++ b/lib/ring/rte_ring_elem_pvt.h
> @@ -379,6 +379,7 @@ __rte_ring_do_dequeue_elem(struct rte_ring *r, void *obj_table,
>   end:
>   	if (available != NULL)
>   		*available = entries - n;
> +	RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(n);
>   	return n;
>   }
>   
> diff --git a/meson_options.txt b/meson_options.txt
> index 7c220ad68d..9b20a36fdb 100644
> --- a/meson_options.txt
> +++ b/meson_options.txt
> @@ -20,6 +20,8 @@ option('enable_driver_sdk', type: 'boolean', value: false, description:
>          'Install headers to build drivers.')
>   option('enable_kmods', type: 'boolean', value: false, description:
>          'build kernel modules')
> +option('enable_lcore_poll_busyness', type: 'boolean', value: false, description:
> +       'enable collection of lcore poll busyness telemetry')
>   option('examples', type: 'string', value: '', description:
>          'Comma-separated list of examples to build by default')
>   option('flexran_sdk', type: 'string', value: '', description:

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 3/4] app/test: add unit tests for lcore poll busyness
  2022-09-14  9:29   ` [PATCH v7 3/4] app/test: add unit tests for lcore poll busyness Kevin Laatz
@ 2022-09-30 22:20     ` Mattias Rönnblom
  0 siblings, 0 replies; 87+ messages in thread
From: Mattias Rönnblom @ 2022-09-30 22:20 UTC (permalink / raw)
  To: Kevin Laatz, dev; +Cc: anatoly.burakov, Mattias Rönnblom



On 2022-09-14 11:29, Kevin Laatz wrote:
> Add API unit tests and perf unit tests for the newly added lcore poll
> busyness feature.
> 
> Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
> ---
>   app/test/meson.build                     |   4 +
>   app/test/test_lcore_poll_busyness_api.c  | 134 +++++++++++++++++++++++
>   app/test/test_lcore_poll_busyness_perf.c |  72 ++++++++++++
>   3 files changed, 210 insertions(+)
>   create mode 100644 app/test/test_lcore_poll_busyness_api.c
>   create mode 100644 app/test/test_lcore_poll_busyness_perf.c
> 
> diff --git a/app/test/meson.build b/app/test/meson.build
> index bf1d81f84a..d543e730a2 100644
> --- a/app/test/meson.build
> +++ b/app/test/meson.build
> @@ -74,6 +74,8 @@ test_sources = files(
>           'test_ipsec_perf.c',
>           'test_kni.c',
>           'test_kvargs.c',
> +        'test_lcore_poll_busyness_api.c',
> +        'test_lcore_poll_busyness_perf.c',
>           'test_lcores.c',
>           'test_logs.c',
>           'test_lpm.c',
> @@ -192,6 +194,7 @@ fast_tests = [
>           ['interrupt_autotest', true, true],
>           ['ipfrag_autotest', false, true],
>           ['lcores_autotest', true, true],
> +        ['lcore_poll_busyness_autotest', true, true],
>           ['logs_autotest', true, true],
>           ['lpm_autotest', true, true],
>           ['lpm6_autotest', true, true],
> @@ -292,6 +295,7 @@ perf_test_names = [
>           'trace_perf_autotest',
>           'ipsec_perf_autotest',
>           'thash_perf_autotest',
> +        'lcore_poll_busyness_perf_autotest'
>   ]
>   
>   driver_test_names = [
> diff --git a/app/test/test_lcore_poll_busyness_api.c b/app/test/test_lcore_poll_busyness_api.c
> new file mode 100644
> index 0000000000..db76322994
> --- /dev/null
> +++ b/app/test/test_lcore_poll_busyness_api.c
> @@ -0,0 +1,134 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2022 Intel Corporation
> + */
> +
> +#include <rte_lcore.h>
> +
> +#include "test.h"
> +
> +/* Arbitrary amount of "work" to simulate busyness with */
> +#define WORK		32
> +#define TIMESTAMP_ITERS	1000000
> +
> +#define LCORE_POLL_BUSYNESS_NOT_SET	-1
> +
> +static int
> +test_lcore_poll_busyness_enable_disable(void)
> +{
> +	int initial_state, curr_state;
> +	bool req_state;
> +
> +	/* Get the initial state */
> +	initial_state = rte_lcore_poll_busyness_enabled();
> +	if (initial_state == -ENOTSUP)
> +		return TEST_SKIPPED;
> +
> +	/* Set state to the inverse of the initial state and check for the change */
> +	req_state = !initial_state;
> +	rte_lcore_poll_busyness_enabled_set(req_state);
> +	curr_state = rte_lcore_poll_busyness_enabled();
> +	if (curr_state != req_state)
> +		return TEST_FAILED;
> +
> +	/* Now change the state back to the original state. By changing it back, both
> +	 * enable and disable will have been tested.
> +	 */
> +	req_state = !curr_state;
> +	rte_lcore_poll_busyness_enabled_set(req_state);
> +	curr_state = rte_lcore_poll_busyness_enabled();
> +	if (curr_state != req_state)
> +		return TEST_FAILED;
> +
> +	return TEST_SUCCESS;
> +}
> +
> +static int
> +test_lcore_poll_busyness_invalid_lcore(void)
> +{
> +	int ret;
> +
> +	/* Check if lcore poll busyness is enabled */
> +	if (rte_lcore_poll_busyness_enabled() == -ENOTSUP)
> +		return TEST_SKIPPED;
> +
> +	/* Only lcore_id <= RTE_MAX_LCORE are valid */
> +	ret = rte_lcore_poll_busyness(RTE_MAX_LCORE);
> +	if (ret != -EINVAL)
> +		return TEST_FAILED;
> +
> +	return TEST_SUCCESS;
> +}
> +
> +static int
> +test_lcore_poll_busyness_inactive_lcore(void)
> +{
> +	int ret;
> +
> +	/* Check if lcore poll busyness is enabled */
> +	if (rte_lcore_poll_busyness_enabled() == -ENOTSUP)
> +		return TEST_SKIPPED;
> +
> +	/* Use the test thread lcore_id for this test. Since it is not a polling
> +	 * application, the busyness is expected to return -1.
> +	 *
> +	 * Note: this will not work with affinitized cores
> +	 */
> +	ret = rte_lcore_poll_busyness(rte_lcore_id());
> +	if (ret != LCORE_POLL_BUSYNESS_NOT_SET)
> +		return TEST_FAILED;
> +
> +	return TEST_SUCCESS;
> +}
> +
> +static void
> +simulate_lcore_poll_busyness(int iters)
> +{
> +	int i;
> +
> +	for (i = 0; i < iters; i++)
> +		RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(WORK);
> +}
> +
> +/* The test cannot know of an application running to test for valid lcore poll
> + * busyness data. For this test, we simulate lcore poll busyness for the
> + * lcore_id of the test thread for testing purposes.
> + */
> +static int
> +test_lcore_poll_busyness_active_lcore(void)
> +{
> +	int ret;
> +
> +	/* Check if lcore poll busyness is enabled */
> +	if (rte_lcore_poll_busyness_enabled() == -ENOTSUP)
> +		return TEST_SKIPPED;
> +
> +	simulate_lcore_poll_busyness(TIMESTAMP_ITERS);
> +
> +	/* After timestamping with "work" many times, lcore poll busyness should be > 0 */
> +	ret = rte_lcore_poll_busyness(rte_lcore_id());
> +	if (ret <= 0)
> +		return TEST_FAILED;
> +
> +	return TEST_SUCCESS;
> +}
> +
> +static struct unit_test_suite lcore_poll_busyness_tests = {
> +	.suite_name = "lcore poll busyness autotest",
> +	.setup = NULL,
> +	.teardown = NULL,
> +	.unit_test_cases = {
> +		TEST_CASE(test_lcore_poll_busyness_enable_disable),
> +		TEST_CASE(test_lcore_poll_busyness_invalid_lcore),
> +		TEST_CASE(test_lcore_poll_busyness_inactive_lcore),
> +		TEST_CASE(test_lcore_poll_busyness_active_lcore),
> +		TEST_CASES_END()
> +	}
> +};
> +
> +static int
> +test_lcore_poll_busyness_api(void)
> +{
> +	return unit_test_suite_runner(&lcore_poll_busyness_tests);
> +}
> +
> +REGISTER_TEST_COMMAND(lcore_poll_busyness_autotest, test_lcore_poll_busyness_api);
> diff --git a/app/test/test_lcore_poll_busyness_perf.c b/app/test/test_lcore_poll_busyness_perf.c
> new file mode 100644
> index 0000000000..5c27d21b00
> --- /dev/null
> +++ b/app/test/test_lcore_poll_busyness_perf.c
> @@ -0,0 +1,72 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2022 Intel Corporation
> + */
> +
> +#include <unistd.h>
> +#include <inttypes.h>
> +
> +#include <rte_lcore.h>
> +#include <rte_cycles.h>
> +
> +#include "test.h"
> +
> +/* Arbitrary amount of "work" to simulate busyness with */
> +#define WORK		32
> +#define TIMESTAMP_ITERS	1000000
> +#define TEST_ITERS	10000
> +
> +static void
> +simulate_lcore_poll_busyness(int iters)
> +{
> +	int i;
> +
> +	for (i = 0; i < iters; i++)
> +		RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(WORK);
> +}
> +
> +static void
> +test_timestamp_perf(void)
> +{
> +	uint64_t start, end, diff;
> +	uint64_t min = UINT64_MAX;
> +	uint64_t max = 0;
> +	uint64_t total = 0;
> +	int i;
> +
> +	for (i = 0; i < TEST_ITERS; i++) {
> +		start = rte_rdtsc();
> +		RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(WORK);

This is how it will look for a thread which is always busy. That's a 
relevant case, but not the only such.

Do one with WORK replaced by (i & 1), and you get the other extreme. 
Applications that poll multiple sources of work before performing any 
actual processing, and which are following your advice about the use of 
the macro, will see these kind of latencies.

> +		end = rte_rdtsc();
> +
> +		diff = end - start;
> +		min = RTE_MIN(diff, min);
> +		max = RTE_MAX(diff, max);
> +		total += diff;
> +	}
> +
> +	printf("### Timestamp perf ###\n");
> +	printf("Min cycles: %"PRIu64"\n", min);
> +	printf("Avg cycles: %"PRIu64"\n", total / TEST_ITERS);
> +	printf("Max cycles: %"PRIu64"\n", max);
> +	printf("\n");
> +}
> +
> +
> +static int
> +test_lcore_poll_busyness_perf(void)
> +{
> +	if (rte_lcore_poll_busyness_enabled()  == -ENOTSUP) {
> +		printf("Lcore poll busyness may be disabled...\n");
> +		return TEST_SKIPPED;
> +	}
> +
> +	/* Initialize and prime the timestamp struct with simulated "work" for this lcore */
> +	simulate_lcore_poll_busyness(10000);
> +
> +	/* Run perf tests */
> +	test_timestamp_perf();
> +
> +	return TEST_SUCCESS;
> +}
> +
> +REGISTER_TEST_COMMAND(lcore_poll_busyness_perf_autotest, test_lcore_poll_busyness_perf);

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 1/4] eal: add lcore poll busyness telemetry
  2022-09-29 12:41           ` Kevin Laatz
  2022-09-30 12:32             ` Jerin Jacob
@ 2022-10-01 14:17             ` Konstantin Ananyev
  2022-10-03 20:02               ` Mattias Rönnblom
  1 sibling, 1 reply; 87+ messages in thread
From: Konstantin Ananyev @ 2022-10-01 14:17 UTC (permalink / raw)
  To: Kevin Laatz, Konstantin Ananyev, dev
  Cc: anatoly.burakov, Conor Walsh, David Hunt, Bruce Richardson,
	Nicolas Chautru, Fan Zhang, Ashish Gupta, Akhil Goyal,
	Fengchengwen, Ray Kinsella, Thomas Monjalon, Ferruh Yigit,
	Andrew Rybchenko, Jerin Jacob, Sachin Saxena, Hemant Agrawal,
	Ori Kam, Honnappa Nagarahalli


>> Hi Kevin,
>>
>>>>> Currently, there is no way to measure lcore poll busyness in a 
>>>>> passive way,
>>>>> without any modifications to the application. This patch adds a new 
>>>>> EAL API
>>>>> that will be able to passively track core polling busyness.
>>>>>
>>>>> The poll busyness is calculated by relying on the fact that most 
>>>>> DPDK API's
>>>>> will poll for work (packets, completions, eventdev events, etc). Empty
>>>>> polls can be counted as "idle", while non-empty polls can be 
>>>>> counted as
>>>>> busy. To measure lcore poll busyness, we simply call the telemetry
>>>>> timestamping function with the number of polls a particular code 
>>>>> section
>>>>> has processed, and count the number of cycles we've spent 
>>>>> processing empty
>>>>> bursts. The more empty bursts we encounter, the less cycles we 
>>>>> spend in
>>>>> "busy" state, and the less core poll busyness will be reported.
>>>>>
>>>>> In order for all of the above to work without modifications to the
>>>>> application, the library code needs to be instrumented with calls 
>>>>> to the
>>>>> lcore telemetry busyness timestamping function. The following parts 
>>>>> of DPDK
>>>>> are instrumented with lcore poll busyness timestamping calls:
>>>>>
>>>>> - All major driver API's:
>>>>>     - ethdev
>>>>>     - cryptodev
>>>>>     - compressdev
>>>>>     - regexdev
>>>>>     - bbdev
>>>>>     - rawdev
>>>>>     - eventdev
>>>>>     - dmadev
>>>>> - Some additional libraries:
>>>>>     - ring
>>>>>     - distributor
>>>>>
>>>>> To avoid performance impact from having lcore telemetry support, a 
>>>>> global
>>>>> variable is exported by EAL, and a call to timestamping function is 
>>>>> wrapped
>>>>> into a macro, so that whenever telemetry is disabled, it only takes 
>>>>> one
>>>>> additional branch and no function calls are performed. It is 
>>>>> disabled at
>>>>> compile time by default.
>>>>>
>>>>> This patch also adds a telemetry endpoint to report lcore poll 
>>>>> busyness, as
>>>>> well as telemetry endpoints to enable/disable lcore telemetry. A
>>>>> documentation entry has been added to the howto guides to explain 
>>>>> the usage
>>>>> of the new telemetry endpoints and API.
>>>> As was already mentioned  by other reviewers, it would be much better
>>>> to let application itself decide when it is idle and when it is busy.
>>>> With current approach even for constant polling run-to-completion 
>>>> model there
>>>> are plenty of opportunities to get things wrong and provide 
>>>> misleading statistics.
>>>> My special concern - inserting it into ring dequeue code.
>>>> Ring is used for various different things, not only pass packets 
>>>> between threads (mempool, etc.).
>>>> Blindly assuming that ring dequeue returns empty means idle cycles 
>>>> seams wrong to me.
>>>> Which make me wonder should we really hard-code these calls into 
>>>> DPDK core functions?
>>>> If you still like to introduce such stats, might be better to 
>>>> implement it via callback mechanism.
>>>> As I remember nearly all our drivers (net, crypto, etc.) do support it.
>>>> That way our generic code   will remain unaffected, plus user will 
>>>> have ability to enable/disable
>>>> it on a per device basis.
>>> Thanks for your feedback, Konstantin.
>>>
>>> You are right in saying that this approach won't be 100% suitable for
>>> all use-cases, but should be suitable for the majority of applications.
>> First of all - could you explain how did you measure what is the 
>> 'majority' of DPDK applications?
>> And how did you conclude that it definitely work for all the apps in 
>> that 'majority'?
>> Second what bother me with that approach - I don't see s clear and 
>> deterministic way
>> for the user to understand would that stats work properly for his app 
>> or not.
>> (except manually ananlyzing his app code).
> 
> All of the DPDK example applications we've tested with (l2fwd, l3fwd + 
> friends, testpmd, distributor, dmafwd) report lcore poll busyness and 
> respond to changing traffic rates etc. We've also compared the reported 
> busyness to similar metrics reported by other projects such as VPP and 
> OvS, and found the reported busyness matches with a difference of +/- 
> 1%. In addition to the DPDK example applications, we've have shared our 
> plans with end customers and they have confirmed that the design should 
> work with their applications.

I am sure l3fwd and testpmd should be ok, I am talking about
something more complicated/unusual.
Below are few examples on top of my head when I think your approach
will generate invalid stats, feel free to correct me, if I am wrong.

1) App doing some sort of bonding itslef, i.e:

struct rte_mbuf pkts[N*2];
k = rte_eth_rx_burst(p0, q0, pkts, N);
n = rte_eth_rx_burst(p1, q1, pkts + k, N);

/*process all packets from both ports at once */
if (n + k != 0)
    process_pkts(pkts, n + k);

Now, as I understand, if n==0, then all cycles spent
in process_pkts() will be accounted as idle.

2) App doing something similar to what pdump library does
(creates a copy of a packet and sends it somewhere).

n =rte_eth_rx_burst(p0, q0, &pkt, 1);
if (n != 0) {
   dup_pkt = rte_pktmbuf_copy(pkt, dup_mp, ...);
   if (dup_pkt != NULL)
      process_dup_pkt(dup_pkt);
   process_pkt(pkt);
}

that relates to ring discussion below:
if there are no mbufs in dup_mp, then ring_deque() will fail
and process_pkt() will be accounted as idle.

3) App dequeues from ring in a bit of unusual way:

/* idle spin loop */
while ((n = rte_ring_count(ring)) == 0)
   ret_pause();

n = rte_ring_dequeue_bulk(ring, pkts, n, NULL);
if (n != 0)
   process_pkts(pkts, n);

here, we can end-up accounting cycles spent in
idle spin loop as busy.


4) Any thread that generates TX traffic on it's own
(something like testpmd tx_only fwd mode)

5) Any thread that depends on both dequeue and enqueue:

n = rte_ring_dequeue_burst(in_ring, pkts, n, ..);
...

/* loop till all packets are sent out successfully */
while(rte_ring_enqueue_bulk(out_ring, pkts, n, NULL) == 0)
    rte_pause();

Now, if n > 0, all cycles spent in enqueue() will be accounted
as 'busy', though from my perspective they probably should
be considered as 'idle'.


Also I expect some problems when packet processing is done inside
rx callbacks, but that probably can be easily fixed.


> 
>>> It's worth keeping in mind that this feature is compile-time disabled by
>>> default, so there is no impact to any application/user that does not
>>> wish to use this, for example applications where this type of busyness
>>> is not useful, or for applications that already use other mechanisms to
>>> report similar telemetry.
>> Not sure that adding in new compile-time option disabled by default is 
>> a good thing...
>> For me it would be much more preferable if we'll go through a more 
>> 'standard' way here:
>> a) define clear API to enable/disable/collect/report such type of stats.
>> b) use some of our sample apps to demonstrate how to use it properly 
>> with user-specific code.
>> c) if needed implement some 'silent' stats collection for limited 
>> scope of apps via callbacks -
>> let say for run-to-completion apps that do use ether and crypto devs 
>> only.
> 
> With the compile-time option, its just one build flag for lots of 
> applications to silently benefit from this.

There could be a lot of useful and helpfull stats
that user would like to collect (idle/busy, processing latency, etc.).
But, if for each such case we will hard-code new stats collection
into our fast data-path code, then very soon it will become
completely bloated and unmaintainable.
I think we need some generic approach for such extra stats collection.
Callbacks could be one way, Jerin in another mail suggested using 
existing trace-point hooks, might be it worth to explore it further.

> 
>>   However, the upside for applications that do
>>> wish to use this is that there are no code changes required (for the
>>> most part), the feature simply needs to be enabled at compile-time via
>>> the meson option.
>>>
>>> In scenarios where contextual awareness of the application is needed in
>>> order to report more accurate "busyness", the
>>> "RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(n)" macro can be used to mark
>>> sections of code as "busy" or "idle". This way, the application can
>>> assume control of determining the poll busyness of its lcores while
>>> leveraging the telemetry hooks adding in this patchset.
>>>
>>> We did initially consider implementing this via callbacks, however we
>>> found this approach to have 2 main drawbacks:
>>> 1. Application changes are required for all applications wanting to
>>> report this telemetry - rather than the majority getting it for free.
>> Didn't get it - why callbacks approach would require user-app changes?
>> In other situations - rte_power callbacks, pdump, etc. it works 
>> transparent to
>> user-leve code.
>> Why it can't be done here in a similar way?
> 
>  From my understanding, the callbacks would need to be registered by the 
> application at the very least (and the callback would have to be 
> registered per device/pmd/lib).

Callbacks can be registered by library itself.
AFAIK, latenc-ystats, power and pdump libraries - all use similar 
approach. user calls something like xxx_stats_enable() and then library 
can iterate over all available devices and setup necessary callbacks.
same for xxx_stats_disable().

>>
>>> 2. Ring does not have callback support, meaning pipelined applications
>>> could not report lcore poll busyness telemetry with this approach.
>> That's another big concern that I have:
>> Why you consider that all rings will be used for a pipilines between 
>> threads and should
>> always be accounted by your stats?
>> They could be used for dozens different purposes.
>> What if that ring is used for mempool, and ring_dequeue() just means 
>> we try to allocate
>> an object from the pool? In such case, why failing to allocate an 
>> object should mean
>> start of new 'idle cycle'?
> 
> Another approach could be taken here if the mempool interactions are of 
> concern.
> 
>  From our understanding, mempool operations use the "_bulk" APIs, 
> whereas polling operations use the "_burst" APIs. Would only 
> timestamping on the "_burst" APIs be better here? That way the mempool 
> interactions won't be counted towards the busyness.

Well, it would help to solve one particular case,
but in general I still think it is incomplete and error-prone.
What if pipeline app will use ring_count/ring_dequeue_bulk(),
or even ZC ring API?
What if app will use something different from rte_ring to pass
packets between threads/processes?
As I said before, without some clues from the app, it is probably
not possible to collect such stats in a proper way.


> Including support for pipelined applications using rings is key for a 
> number of usecases, this was highlighted as part of the customer 
> feedback when we shared the design.
> 
>>
>>> Eventdev is another driver which would be completely missed with this
>>> approach.
>> Ok, I see two ways here:
>> - implement CB support for eventdev.
>> -meanwhile clearly document that this stats are not supported for 
>> eventdev  scenarios (yet).


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 1/4] eal: add lcore poll busyness telemetry
  2022-10-01 14:17             ` Konstantin Ananyev
@ 2022-10-03 20:02               ` Mattias Rönnblom
  2022-10-04  9:15                 ` Morten Brørup
  0 siblings, 1 reply; 87+ messages in thread
From: Mattias Rönnblom @ 2022-10-03 20:02 UTC (permalink / raw)
  To: Konstantin Ananyev, Kevin Laatz, Konstantin Ananyev, dev
  Cc: anatoly.burakov, Conor Walsh, David Hunt, Bruce Richardson,
	Nicolas Chautru, Fan Zhang, Ashish Gupta, Akhil Goyal,
	Fengchengwen, Ray Kinsella, Thomas Monjalon, Ferruh Yigit,
	Andrew Rybchenko, Jerin Jacob, Sachin Saxena, Hemant Agrawal,
	Ori Kam, Honnappa Nagarahalli, Mattias Rönnblom

On 2022-10-01 16:17, Konstantin Ananyev wrote:
> 
>>> Hi Kevin,
>>>
>>>>>> Currently, there is no way to measure lcore poll busyness in a 
>>>>>> passive way,
>>>>>> without any modifications to the application. This patch adds a 
>>>>>> new EAL API
>>>>>> that will be able to passively track core polling busyness.
>>>>>>
>>>>>> The poll busyness is calculated by relying on the fact that most 
>>>>>> DPDK API's
>>>>>> will poll for work (packets, completions, eventdev events, etc). 
>>>>>> Empty
>>>>>> polls can be counted as "idle", while non-empty polls can be 
>>>>>> counted as
>>>>>> busy. To measure lcore poll busyness, we simply call the telemetry
>>>>>> timestamping function with the number of polls a particular code 
>>>>>> section
>>>>>> has processed, and count the number of cycles we've spent 
>>>>>> processing empty
>>>>>> bursts. The more empty bursts we encounter, the less cycles we 
>>>>>> spend in
>>>>>> "busy" state, and the less core poll busyness will be reported.
>>>>>>
>>>>>> In order for all of the above to work without modifications to the
>>>>>> application, the library code needs to be instrumented with calls 
>>>>>> to the
>>>>>> lcore telemetry busyness timestamping function. The following 
>>>>>> parts of DPDK
>>>>>> are instrumented with lcore poll busyness timestamping calls:
>>>>>>
>>>>>> - All major driver API's:
>>>>>>     - ethdev
>>>>>>     - cryptodev
>>>>>>     - compressdev
>>>>>>     - regexdev
>>>>>>     - bbdev
>>>>>>     - rawdev
>>>>>>     - eventdev
>>>>>>     - dmadev
>>>>>> - Some additional libraries:
>>>>>>     - ring
>>>>>>     - distributor
>>>>>>
>>>>>> To avoid performance impact from having lcore telemetry support, a 
>>>>>> global
>>>>>> variable is exported by EAL, and a call to timestamping function 
>>>>>> is wrapped
>>>>>> into a macro, so that whenever telemetry is disabled, it only 
>>>>>> takes one
>>>>>> additional branch and no function calls are performed. It is 
>>>>>> disabled at
>>>>>> compile time by default.
>>>>>>
>>>>>> This patch also adds a telemetry endpoint to report lcore poll 
>>>>>> busyness, as
>>>>>> well as telemetry endpoints to enable/disable lcore telemetry. A
>>>>>> documentation entry has been added to the howto guides to explain 
>>>>>> the usage
>>>>>> of the new telemetry endpoints and API.
>>>>> As was already mentioned  by other reviewers, it would be much better
>>>>> to let application itself decide when it is idle and when it is busy.
>>>>> With current approach even for constant polling run-to-completion 
>>>>> model there
>>>>> are plenty of opportunities to get things wrong and provide 
>>>>> misleading statistics.
>>>>> My special concern - inserting it into ring dequeue code.
>>>>> Ring is used for various different things, not only pass packets 
>>>>> between threads (mempool, etc.).
>>>>> Blindly assuming that ring dequeue returns empty means idle cycles 
>>>>> seams wrong to me.
>>>>> Which make me wonder should we really hard-code these calls into 
>>>>> DPDK core functions?
>>>>> If you still like to introduce such stats, might be better to 
>>>>> implement it via callback mechanism.
>>>>> As I remember nearly all our drivers (net, crypto, etc.) do support 
>>>>> it.
>>>>> That way our generic code   will remain unaffected, plus user will 
>>>>> have ability to enable/disable
>>>>> it on a per device basis.
>>>> Thanks for your feedback, Konstantin.
>>>>
>>>> You are right in saying that this approach won't be 100% suitable for
>>>> all use-cases, but should be suitable for the majority of applications.
>>> First of all - could you explain how did you measure what is the 
>>> 'majority' of DPDK applications?
>>> And how did you conclude that it definitely work for all the apps in 
>>> that 'majority'?
>>> Second what bother me with that approach - I don't see s clear and 
>>> deterministic way
>>> for the user to understand would that stats work properly for his app 
>>> or not.
>>> (except manually ananlyzing his app code).
>>
>> All of the DPDK example applications we've tested with (l2fwd, l3fwd + 
>> friends, testpmd, distributor, dmafwd) report lcore poll busyness and 
>> respond to changing traffic rates etc. We've also compared the 
>> reported busyness to similar metrics reported by other projects such 
>> as VPP and OvS, and found the reported busyness matches with a 
>> difference of +/- 1%. In addition to the DPDK example applications, 
>> we've have shared our plans with end customers and they have confirmed 
>> that the design should work with their applications.
> 
> I am sure l3fwd and testpmd should be ok, I am talking about
> something more complicated/unusual.
> Below are few examples on top of my head when I think your approach
> will generate invalid stats, feel free to correct me, if I am wrong.
> 
> 1) App doing some sort of bonding itslef, i.e:
> 
> struct rte_mbuf pkts[N*2];
> k = rte_eth_rx_burst(p0, q0, pkts, N);
> n = rte_eth_rx_burst(p1, q1, pkts + k, N);
> 
> /*process all packets from both ports at once */
> if (n + k != 0)
>     process_pkts(pkts, n + k);
> 
> Now, as I understand, if n==0, then all cycles spent
> in process_pkts() will be accounted as idle.
> 
> 2) App doing something similar to what pdump library does
> (creates a copy of a packet and sends it somewhere).
> 
> n =rte_eth_rx_burst(p0, q0, &pkt, 1);
> if (n != 0) {
>    dup_pkt = rte_pktmbuf_copy(pkt, dup_mp, ...);
>    if (dup_pkt != NULL)
>       process_dup_pkt(dup_pkt);
>    process_pkt(pkt);
> }
> 
> that relates to ring discussion below:
> if there are no mbufs in dup_mp, then ring_deque() will fail
> and process_pkt() will be accounted as idle.
> 
> 3) App dequeues from ring in a bit of unusual way:
> 
> /* idle spin loop */
> while ((n = rte_ring_count(ring)) == 0)
>    ret_pause();
> 
> n = rte_ring_dequeue_bulk(ring, pkts, n, NULL);
> if (n != 0)
>    process_pkts(pkts, n);
> 
> here, we can end-up accounting cycles spent in
> idle spin loop as busy.
> 
> 
> 4) Any thread that generates TX traffic on it's own
> (something like testpmd tx_only fwd mode)
> 
> 5) Any thread that depends on both dequeue and enqueue:
> 
> n = rte_ring_dequeue_burst(in_ring, pkts, n, ..);
> ...
> 
> /* loop till all packets are sent out successfully */
> while(rte_ring_enqueue_bulk(out_ring, pkts, n, NULL) == 0)
>     rte_pause();
> 
> Now, if n > 0, all cycles spent in enqueue() will be accounted
> as 'busy', though from my perspective they probably should
> be considered as 'idle'.
> 
> 
> Also I expect some problems when packet processing is done inside
> rx callbacks, but that probably can be easily fixed.
> 
> 
>>
>>>> It's worth keeping in mind that this feature is compile-time 
>>>> disabled by
>>>> default, so there is no impact to any application/user that does not
>>>> wish to use this, for example applications where this type of busyness
>>>> is not useful, or for applications that already use other mechanisms to
>>>> report similar telemetry.
>>> Not sure that adding in new compile-time option disabled by default 
>>> is a good thing...
>>> For me it would be much more preferable if we'll go through a more 
>>> 'standard' way here:
>>> a) define clear API to enable/disable/collect/report such type of stats.
>>> b) use some of our sample apps to demonstrate how to use it properly 
>>> with user-specific code.
>>> c) if needed implement some 'silent' stats collection for limited 
>>> scope of apps via callbacks -
>>> let say for run-to-completion apps that do use ether and crypto devs 
>>> only.
>>
>> With the compile-time option, its just one build flag for lots of 
>> applications to silently benefit from this.
> 
> There could be a lot of useful and helpfull stats
> that user would like to collect (idle/busy, processing latency, etc.).
> But, if for each such case we will hard-code new stats collection
> into our fast data-path code, then very soon it will become
> completely bloated and unmaintainable.
> I think we need some generic approach for such extra stats collection.
> Callbacks could be one way, Jerin in another mail suggested using 
> existing trace-point hooks, might be it worth to explore it further.
> 
>>
>>>   However, the upside for applications that do
>>>> wish to use this is that there are no code changes required (for the
>>>> most part), the feature simply needs to be enabled at compile-time via
>>>> the meson option.
>>>>
>>>> In scenarios where contextual awareness of the application is needed in
>>>> order to report more accurate "busyness", the
>>>> "RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(n)" macro can be used to mark
>>>> sections of code as "busy" or "idle". This way, the application can
>>>> assume control of determining the poll busyness of its lcores while
>>>> leveraging the telemetry hooks adding in this patchset.
>>>>
>>>> We did initially consider implementing this via callbacks, however we
>>>> found this approach to have 2 main drawbacks:
>>>> 1. Application changes are required for all applications wanting to
>>>> report this telemetry - rather than the majority getting it for free.
>>> Didn't get it - why callbacks approach would require user-app changes?
>>> In other situations - rte_power callbacks, pdump, etc. it works 
>>> transparent to
>>> user-leve code.
>>> Why it can't be done here in a similar way?
>>
>>  From my understanding, the callbacks would need to be registered by 
>> the application at the very least (and the callback would have to be 
>> registered per device/pmd/lib).
> 
> Callbacks can be registered by library itself.
> AFAIK, latenc-ystats, power and pdump libraries - all use similar 
> approach. user calls something like xxx_stats_enable() and then library 
> can iterate over all available devices and setup necessary callbacks.
> same for xxx_stats_disable().
> 
>>>
>>>> 2. Ring does not have callback support, meaning pipelined applications
>>>> could not report lcore poll busyness telemetry with this approach.
>>> That's another big concern that I have:
>>> Why you consider that all rings will be used for a pipilines between 
>>> threads and should
>>> always be accounted by your stats?
>>> They could be used for dozens different purposes.
>>> What if that ring is used for mempool, and ring_dequeue() just means 
>>> we try to allocate
>>> an object from the pool? In such case, why failing to allocate an 
>>> object should mean
>>> start of new 'idle cycle'?
>>
>> Another approach could be taken here if the mempool interactions are 
>> of concern.
>>
>>  From our understanding, mempool operations use the "_bulk" APIs, 
>> whereas polling operations use the "_burst" APIs. Would only 
>> timestamping on the "_burst" APIs be better here? That way the mempool 
>> interactions won't be counted towards the busyness.
> 
> Well, it would help to solve one particular case,
> but in general I still think it is incomplete and error-prone.

I agree.

The functionality provided is very useful, and the implementation is 
clever in the way it doesn't require any application modifications. But, 
a clever, useful brittle hack is still a brittle hack.

What if there was instead a busyness module, where the application would 
explicitly report what it was up to. The new library would hook up to 
telemetry just like this patchset does, plus provide an explicit API to 
retrieve lcore thread load.

The service cores framework (fancy name for rte_service.c) could also 
call the lcore load tracking module, provided all services properly 
reported back on whether or not they were doing anything useful with the 
cycles they just spent.

The metrics of such a load tracking module could potentially be used by 
other modules in DPDK, or by the application. It could potentially be 
used for dynamic load balancing of service core services, or for power 
management (e.g, DVFS), or for a potential future deferred-work type 
mechanism more sophisticated than current rte_service, or some green 
threads/coroutines/fiber thingy. The DSW event device could also use it 
to replace its current internal load estimation scheme.

I may be repeating myself here, from past threads.

> What if pipeline app will use ring_count/ring_dequeue_bulk(),
> or even ZC ring API?
> What if app will use something different from rte_ring to pass
> packets between threads/processes?
> As I said before, without some clues from the app, it is probably
> not possible to collect such stats in a proper way.
> 
> 
>> Including support for pipelined applications using rings is key for a 
>> number of usecases, this was highlighted as part of the customer 
>> feedback when we shared the design.
>>
>>>
>>>> Eventdev is another driver which would be completely missed with this
>>>> approach.
>>> Ok, I see two ways here:
>>> - implement CB support for eventdev.
>>> -meanwhile clearly document that this stats are not supported for 
>>> eventdev  scenarios (yet).
> 

^ permalink raw reply	[flat|nested] 87+ messages in thread

* RE: [PATCH v7 1/4] eal: add lcore poll busyness telemetry
  2022-10-03 20:02               ` Mattias Rönnblom
@ 2022-10-04  9:15                 ` Morten Brørup
  2022-10-04 11:57                   ` Bruce Richardson
  0 siblings, 1 reply; 87+ messages in thread
From: Morten Brørup @ 2022-10-04  9:15 UTC (permalink / raw)
  To: Mattias Rönnblom, Konstantin Ananyev, Kevin Laatz,
	Konstantin Ananyev, dev
  Cc: anatoly.burakov, Conor Walsh, David Hunt, Bruce Richardson,
	Nicolas Chautru, Fan Zhang, Ashish Gupta, Akhil Goyal,
	Fengchengwen, Ray Kinsella, Thomas Monjalon, Ferruh Yigit,
	Andrew Rybchenko, Jerin Jacob, Sachin Saxena, Hemant Agrawal,
	Ori Kam, Honnappa Nagarahalli, Mattias Rönnblom

> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> Sent: Monday, 3 October 2022 22.02

[...]

> The functionality provided is very useful, and the implementation is
> clever in the way it doesn't require any application modifications.
> But,
> a clever, useful brittle hack is still a brittle hack.
> 
> What if there was instead a busyness module, where the application
> would
> explicitly report what it was up to. The new library would hook up to
> telemetry just like this patchset does, plus provide an explicit API to
> retrieve lcore thread load.
> 
> The service cores framework (fancy name for rte_service.c) could also
> call the lcore load tracking module, provided all services properly
> reported back on whether or not they were doing anything useful with
> the
> cycles they just spent.
> 
> The metrics of such a load tracking module could potentially be used by
> other modules in DPDK, or by the application. It could potentially be
> used for dynamic load balancing of service core services, or for power
> management (e.g, DVFS), or for a potential future deferred-work type
> mechanism more sophisticated than current rte_service, or some green
> threads/coroutines/fiber thingy. The DSW event device could also use it
> to replace its current internal load estimation scheme.

[...]

I agree 100 % with everything Mattias wrote above, and I would like to voice my opinion too.

This patch is full of preconditions and assumptions. Its only true advantage (vs. a generic load tracking library) is that it doesn't require any application modifications, and thus can be deployed with zero effort.

I my opinion, it would be much better with a well designed generic load tracking library, to be called from the application, so it gets correct information about what the lcores spend their cycles doing. And as Mattias mentions: With the appropriate API for consumption of the collected data, it could also provide actionable statistics for use by the application itself, not just telemetry. ("Actionable statistics": Statistics that is directly usable for decision making.)

There is also the aspect of time-to-benefit: This patch immediately provides benefits (to the users of the DPDK applications that meet the preconditions/assumptions of the patch), while a generic load tracking library will take years to get integrated into applications before it provides benefits (to the users of the DPDK applications that use the new library).

So, we should ask ourselves: Do we want an application-specific solution with a short time-to-benefit, or a generic solution with a long time-to-benefit? (I use the term "application specific" because not all applications can be tweaked to provide meaningful data with this patch. You might also label a generic library "application specific", because it requires that the application uses the library - however that is a common requirement of all DPDK libraries.)

Furthermore, if the proposed patch is primarily for the benefit of OVS, I suppose that calls to a generic load tracking library could be added to OVS within a relatively short time frame (although not as quick as this patch).

I guess that the developers of this patch initially thought that it was generic and usable for the majority of applications, and it came as somewhat a surprise that it wasn't as generic as expected. The DPDK community has a good review process with open discussions and sharing of thoughts and ideas. Sometimes, an idea doesn't fly, because the corner cases turn out to be more common than expected. I'm sorry to say it, but I think that is the case for this patch. :-(

-Morten

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 1/4] eal: add lcore poll busyness telemetry
  2022-10-04  9:15                 ` Morten Brørup
@ 2022-10-04 11:57                   ` Bruce Richardson
  2022-10-04 14:26                     ` Mattias Rönnblom
  2022-10-04 23:30                     ` Konstantin Ananyev
  0 siblings, 2 replies; 87+ messages in thread
From: Bruce Richardson @ 2022-10-04 11:57 UTC (permalink / raw)
  To: Morten Brørup
  Cc: Mattias Rönnblom, Konstantin Ananyev, Kevin Laatz,
	Konstantin Ananyev, dev, anatoly.burakov, Conor Walsh,
	David Hunt, Nicolas Chautru, Fan Zhang, Ashish Gupta,
	Akhil Goyal, Fengchengwen, Ray Kinsella, Thomas Monjalon,
	Ferruh Yigit, Andrew Rybchenko, Jerin Jacob, Sachin Saxena,
	Hemant Agrawal, Ori Kam, Honnappa Nagarahalli,
	Mattias Rönnblom

On Tue, Oct 04, 2022 at 11:15:19AM +0200, Morten Brørup wrote:
> > From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> > Sent: Monday, 3 October 2022 22.02
> 
> [...]
> 
> > The functionality provided is very useful, and the implementation is
> > clever in the way it doesn't require any application modifications.
> > But,
> > a clever, useful brittle hack is still a brittle hack.
> > 

I think that may be a little harsh here. After all, this is a feature which
is build-time disabled and runtime disabled by default, so like many other
components it's designed for use when it makes sense to do so.

Furthermore, I'd just like to point out that the authors, when doing the
patches, have left in the hooks so that even apps, for which the "for-free"
scheme doesn't work, can still leverage the infrastructure to have the app
itself report the busy/free metrics.

> > What if there was instead a busyness module, where the application
> > would
> > explicitly report what it was up to. The new library would hook up to
> > telemetry just like this patchset does, plus provide an explicit API to
> > retrieve lcore thread load.
> > 
> > The service cores framework (fancy name for rte_service.c) could also
> > call the lcore load tracking module, provided all services properly
> > reported back on whether or not they were doing anything useful with
> > the
> > cycles they just spent.
> > 
> > The metrics of such a load tracking module could potentially be used by
> > other modules in DPDK, or by the application. It could potentially be
> > used for dynamic load balancing of service core services, or for power
> > management (e.g, DVFS), or for a potential future deferred-work type
> > mechanism more sophisticated than current rte_service, or some green
> > threads/coroutines/fiber thingy. The DSW event device could also use it
> > to replace its current internal load estimation scheme.
> 
> [...]
> 
> I agree 100 % with everything Mattias wrote above, and I would like to voice my opinion too.
> 
> This patch is full of preconditions and assumptions. Its only true advantage (vs. a generic load tracking library) is that it doesn't require any application modifications, and thus can be deployed with zero effort.
> 
> I my opinion, it would be much better with a well designed generic load tracking library, to be called from the application, so it gets correct information about what the lcores spend their cycles doing. And as Mattias mentions: With the appropriate API for consumption of the collected data, it could also provide actionable statistics for use by the application itself, not just telemetry. ("Actionable statistics": Statistics that is directly usable for decision making.)
> 
> There is also the aspect of time-to-benefit: This patch immediately provides benefits (to the users of the DPDK applications that meet the preconditions/assumptions of the patch), while a generic load tracking library will take years to get integrated into applications before it provides benefits (to the users of the DPDK applications that use the new library).
> 
> So, we should ask ourselves: Do we want an application-specific solution with a short time-to-benefit, or a generic solution with a long time-to-benefit? (I use the term "application specific" because not all applications can be tweaked to provide meaningful data with this patch. You might also label a generic library "application specific", because it requires that the application uses the library - however that is a common requirement of all DPDK libraries.)
> 
> Furthermore, if the proposed patch is primarily for the benefit of OVS, I suppose that calls to a generic load tracking library could be added to OVS within a relatively short time frame (although not as quick as this patch).
> 
> I guess that the developers of this patch initially thought that it was generic and usable for the majority of applications, and it came as somewhat a surprise that it wasn't as generic as expected. The DPDK community has a good review process with open discussions and sharing of thoughts and ideas. Sometimes, an idea doesn't fly, because the corner cases turn out to be more common than expected. I'm sorry to say it, but I think that is the case for this patch. :-(
> 

I'd actually like to question this last statement a little.

I think we in the DPDK community are very good at coming up with
theoretical examples where things don't work, but are they really cases
that occur commonly in the real-world? 

I accept, for example, that the "for free" approach would not be suitable
for something like VPP which does multiple polls to gather packets before
processing, but for some of the other cases I'd question their commonality.
For example, a number of objections have focused on the case where
allocation of buffers fails and so the busyness gets counted wrongly.  Are
there really (many) apps out there where running out of buffers is not a
much more serious problem than incorrectly reported busyness stats?

I'd also say that, in my experience, the non-open-source end-user apps tend
very much to use DPDK based on the style of operation given in our DPDK
examples, rather than trying out new or different ways of working. (Maybe
others have different experiences, though, and can comment). I also tend to
believe that open-source software using DPDK probably shows more variety in
how things are done, which is not representative of a lot of non-OSS users
of DPDK.

Regards,
/Bruce

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 1/4] eal: add lcore poll busyness telemetry
  2022-10-04 11:57                   ` Bruce Richardson
@ 2022-10-04 14:26                     ` Mattias Rönnblom
  2022-10-04 23:30                     ` Konstantin Ananyev
  1 sibling, 0 replies; 87+ messages in thread
From: Mattias Rönnblom @ 2022-10-04 14:26 UTC (permalink / raw)
  To: Bruce Richardson, Morten Brørup
  Cc: Mattias Rönnblom, Konstantin Ananyev, Kevin Laatz,
	Konstantin Ananyev, dev, anatoly.burakov, Conor Walsh,
	David Hunt, Nicolas Chautru, Fan Zhang, Ashish Gupta,
	Akhil Goyal, Fengchengwen, Ray Kinsella, Thomas Monjalon,
	Ferruh Yigit, Andrew Rybchenko, Jerin Jacob, Sachin Saxena,
	Hemant Agrawal, Ori Kam, Honnappa Nagarahalli

On 2022-10-04 13:57, Bruce Richardson wrote:
> On Tue, Oct 04, 2022 at 11:15:19AM +0200, Morten Brørup wrote:
>>> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
>>> Sent: Monday, 3 October 2022 22.02
>>
>> [...]
>>
>>> The functionality provided is very useful, and the implementation is
>>> clever in the way it doesn't require any application modifications.
>>> But,
>>> a clever, useful brittle hack is still a brittle hack.
>>>
> 
> I think that may be a little harsh here. After all, this is a feature which
> is build-time disabled and runtime disabled by default, so like many other
> components it's designed for use when it makes sense to do so.
> 

So you don't think it's a hack? The driver level and the level of basic 
data structures (e.g., the ring) is the appropriate level to classify 
cycles into useful and not useful? And you don't think all the shaky 
assumptions makes it brittle?

Runtime configurable or not doesn't make a difference in this regard, in 
my opinion. On the source code level, this code is there, and making it 
compile-time conditional just makes matters worse.

Had this feature been limited to a small library, it would made a 
difference, but it's smeared across a wide range of APIs, and this list 
is not yet complete. Anything than can produce items of work need to be 
adapted.

That said, it's not obvious how this should be done. The higher-layer 
constructs where this should really be done aren't there in DPDK, at 
least not yet.

Have you considered the option to instrument rte_pause()? It's the 
closes DPDK has to the (now largely extinct) idle loop in an OS kernel. 
It too would be a hack, but maybe a less intrusive one.

> Furthermore, I'd just like to point out that the authors, when doing the
> patches, have left in the hooks so that even apps, for which the "for-free"
> scheme doesn't work, can still leverage the infrastructure to have the app
> itself report the busy/free metrics.
> 

If this is done properly, in a way that the data can reasonably be 
trusted and it can be enabled in runtime without much of a performance 
implication, tracking lcore load could be much more useful, than just 
best effort-telemetry.

Why is it so important not to require changes to the application? The 
changes are likely trivial, not unlike those I've submitted for the 
equivalent bookkeeping for DPDK services.

>>> What if there was instead a busyness module, where the application
>>> would
>>> explicitly report what it was up to. The new library would hook up to
>>> telemetry just like this patchset does, plus provide an explicit API to
>>> retrieve lcore thread load.
>>>
>>> The service cores framework (fancy name for rte_service.c) could also
>>> call the lcore load tracking module, provided all services properly
>>> reported back on whether or not they were doing anything useful with
>>> the
>>> cycles they just spent.
>>>
>>> The metrics of such a load tracking module could potentially be used by
>>> other modules in DPDK, or by the application. It could potentially be
>>> used for dynamic load balancing of service core services, or for power
>>> management (e.g, DVFS), or for a potential future deferred-work type
>>> mechanism more sophisticated than current rte_service, or some green
>>> threads/coroutines/fiber thingy. The DSW event device could also use it
>>> to replace its current internal load estimation scheme.
>>
>> [...]
>>
>> I agree 100 % with everything Mattias wrote above, and I would like to voice my opinion too.
>>
>> This patch is full of preconditions and assumptions. Its only true advantage (vs. a generic load tracking library) is that it doesn't require any application modifications, and thus can be deployed with zero effort.
>>
>> I my opinion, it would be much better with a well designed generic load tracking library, to be called from the application, so it gets correct information about what the lcores spend their cycles doing. And as Mattias mentions: With the appropriate API for consumption of the collected data, it could also provide actionable statistics for use by the application itself, not just telemetry. ("Actionable statistics": Statistics that is directly usable for decision making.)
>>
>> There is also the aspect of time-to-benefit: This patch immediately provides benefits (to the users of the DPDK applications that meet the preconditions/assumptions of the patch), while a generic load tracking library will take years to get integrated into applications before it provides benefits (to the users of the DPDK applications that use the new library).
>>
>> So, we should ask ourselves: Do we want an application-specific solution with a short time-to-benefit, or a generic solution with a long time-to-benefit? (I use the term "application specific" because not all applications can be tweaked to provide meaningful data with this patch. You might also label a generic library "application specific", because it requires that the application uses the library - however that is a common requirement of all DPDK libraries.)
>>
>> Furthermore, if the proposed patch is primarily for the benefit of OVS, I suppose that calls to a generic load tracking library could be added to OVS within a relatively short time frame (although not as quick as this patch).
>>
>> I guess that the developers of this patch initially thought that it was generic and usable for the majority of applications, and it came as somewhat a surprise that it wasn't as generic as expected. The DPDK community has a good review process with open discussions and sharing of thoughts and ideas. Sometimes, an idea doesn't fly, because the corner cases turn out to be more common than expected. I'm sorry to say it, but I think that is the case for this patch. :-(
>>
> 
> I'd actually like to question this last statement a little.
> 
> I think we in the DPDK community are very good at coming up with
> theoretical examples where things don't work, but are they really cases
> that occur commonly in the real-world?
> 
> I accept, for example, that the "for free" approach would not be suitable
> for something like VPP which does multiple polls to gather packets before
> processing, but for some of the other cases I'd question their commonality.
> For example, a number of objections have focused on the case where
> allocation of buffers fails and so the busyness gets counted wrongly.  Are
> there really (many) apps out there where running out of buffers is not a
> much more serious problem than incorrectly reported busyness stats?
> 

Many, if not all, non-trivial DPDK applications will poll multiple 
sources of work, some of which almost always will fail to produce any 
items. In such cases, they will transit between the busy and idle state, 
potentially several times, for every iteration in their lcore thread 
poll loop. That will cause a performance degradation if this features is 
used, and there's nothing they can do to fix it from the application 
level, assuming they find this telemetry statistic useful and don't want 
it disabled. So, not "for free", although may be you can still argue 
it's a bargain. :)

> I'd also say that, in my experience, the non-open-source end-user apps tend
> very much to use DPDK based on the style of operation given in our DPDK
> examples, rather than trying out new or different ways of working. (Maybe
> others have different experiences, though, and can comment). I also tend to
> believe that open-source software using DPDK probably shows more variety in
> how things are done, which is not representative of a lot of non-OSS users
> of DPDK.
> 
> Regards,
> /Bruce


^ permalink raw reply	[flat|nested] 87+ messages in thread

* RE: [PATCH v7 1/4] eal: add lcore poll busyness telemetry
  2022-10-04 11:57                   ` Bruce Richardson
  2022-10-04 14:26                     ` Mattias Rönnblom
@ 2022-10-04 23:30                     ` Konstantin Ananyev
  1 sibling, 0 replies; 87+ messages in thread
From: Konstantin Ananyev @ 2022-10-04 23:30 UTC (permalink / raw)
  To: Bruce Richardson, Morten Brørup
  Cc: Mattias Rönnblom, Konstantin Ananyev, Kevin Laatz, dev,
	anatoly.burakov, Conor Walsh, David Hunt, Nicolas Chautru,
	Fan Zhang, Ashish Gupta, Akhil Goyal, Fengchengwen, Ray Kinsella,
	Thomas Monjalon, Ferruh Yigit, Andrew Rybchenko, Jerin Jacob,
	Sachin Saxena, Hemant Agrawal, Ori Kam, Honnappa Nagarahalli,
	Mattias Rönnblom


Hi Bruce,

> On Tue, Oct 04, 2022 at 11:15:19AM +0200, Morten Brørup wrote:
> > > From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> > > Sent: Monday, 3 October 2022 22.02
> >
> > [...]
> >
> > > The functionality provided is very useful, and the implementation is
> > > clever in the way it doesn't require any application modifications.
> > > But,
> > > a clever, useful brittle hack is still a brittle hack.
> > >
> 
> I think that may be a little harsh here. After all, this is a feature which
> is build-time disabled and runtime disabled by default, so like many other
> components it's designed for use when it makes sense to do so.

Honestly, I don't understand why both you and Kevin think that conditional
compilation provides some sort of indulgence here...
Putting #ifdef around problematic code wouldn't make it any better.
In fact, I think it only makes things worse - adds more confusion,
makes it harder to follow the code, etc. 

> 
> Furthermore, I'd just like to point out that the authors, when doing the
> patches, have left in the hooks so that even apps, for which the "for-free"
> scheme doesn't work, can still leverage the infrastructure to have the app
> itself report the busy/free metrics.

Ok, then it is probably good opportunity not to push for problematic solution,
but try to exploit these hook-points?
Take one of existing DPDK examples, add code to expose these hook points.
That will also demonstrate for the user how to use these hooks properly,
and how difficult It would be to adopt such approach.   
 
> > > What if there was instead a busyness module, where the application
> > > would
> > > explicitly report what it was up to. The new library would hook up to
> > > telemetry just like this patchset does, plus provide an explicit API to
> > > retrieve lcore thread load.
> > >
> > > The service cores framework (fancy name for rte_service.c) could also
> > > call the lcore load tracking module, provided all services properly
> > > reported back on whether or not they were doing anything useful with
> > > the
> > > cycles they just spent.
> > >
> > > The metrics of such a load tracking module could potentially be used by
> > > other modules in DPDK, or by the application. It could potentially be
> > > used for dynamic load balancing of service core services, or for power
> > > management (e.g, DVFS), or for a potential future deferred-work type
> > > mechanism more sophisticated than current rte_service, or some green
> > > threads/coroutines/fiber thingy. The DSW event device could also use it
> > > to replace its current internal load estimation scheme.
> >
> > [...]
> >
> > I agree 100 % with everything Mattias wrote above, and I would like to voice my opinion too.
> >
> > This patch is full of preconditions and assumptions. Its only true advantage (vs. a generic load tracking library) is that it doesn't
> require any application modifications, and thus can be deployed with zero effort.
> >
> > I my opinion, it would be much better with a well designed generic load tracking library, to be called from the application, so it gets
> correct information about what the lcores spend their cycles doing. And as Mattias mentions: With the appropriate API for
> consumption of the collected data, it could also provide actionable statistics for use by the application itself, not just telemetry.
> ("Actionable statistics": Statistics that is directly usable for decision making.)
> >
> > There is also the aspect of time-to-benefit: This patch immediately provides benefits (to the users of the DPDK applications that
> meet the preconditions/assumptions of the patch), while a generic load tracking library will take years to get integrated into
> applications before it provides benefits (to the users of the DPDK applications that use the new library).
> >
> > So, we should ask ourselves: Do we want an application-specific solution with a short time-to-benefit, or a generic solution with a
> long time-to-benefit? (I use the term "application specific" because not all applications can be tweaked to provide meaningful data
> with this patch. You might also label a generic library "application specific", because it requires that the application uses the library -
> however that is a common requirement of all DPDK libraries.)
> >
> > Furthermore, if the proposed patch is primarily for the benefit of OVS, I suppose that calls to a generic load tracking library could be
> added to OVS within a relatively short time frame (although not as quick as this patch).
> >
> > I guess that the developers of this patch initially thought that it was generic and usable for the majority of applications, and it came
> as somewhat a surprise that it wasn't as generic as expected. The DPDK community has a good review process with open discussions
> and sharing of thoughts and ideas. Sometimes, an idea doesn't fly, because the corner cases turn out to be more common than
> expected. I'm sorry to say it, but I think that is the case for this patch. :-(
> >
> 
> I'd actually like to question this last statement a little.
> 
> I think we in the DPDK community are very good at coming up with
> theoretical examples where things don't work, but are they really cases
> that occur commonly in the real-world?
> 
> I accept, for example, that the "for free" approach would not be suitable
> for something like VPP which does multiple polls to gather packets before
> processing, but for some of the other cases I'd question their commonality.
> For example, a number of objections have focused on the case where
> allocation of buffers fails and so the busyness gets counted wrongly.  Are
> there really (many) apps out there where running out of buffers is not a
> much more serious problem than incorrectly reported busyness stats?

Obviously, inability to dynamically allocate a memory could flag a serious problem. 
Though I don't see why it should be treated as an excuse to provide a misleading statistics.
There are many real-world network appliances that supposed
to keep working properly even under severe memory pressure.
As an example: suppose your app is doing some sort of TCP connection tracking.
So, for  every new flow you need to allocate some socket-like structure.
Also suppose that for performance reasons you use DPDK mempool to manage
these structures.
Now, it could be situations (SYN flood attack)  when you run out of your sockets.
In that situation it is probably ok to start dropping such packets,
but traffic belonging to already existing connections, plus non-TCP traffic    
still expected to be handled properly.

> 
> I'd also say that, in my experience, the non-open-source end-user apps tend
> very much to use DPDK based on the style of operation given in our DPDK
> examples, rather than trying out new or different ways of working. (Maybe
> others have different experiences, though, and can comment). I also tend to
> believe that open-source software using DPDK probably shows more variety in
> how things are done, which is not representative of a lot of non-OSS users
> of DPDK.
> 
> Regards,
> /Bruce


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 0/4] Add lcore poll busyness telemetry
  2022-09-14  9:29 ` [PATCH v7 0/4] Add lcore poll busyness telemetry Kevin Laatz
                     ` (4 preceding siblings ...)
  2022-09-14 14:33   ` [PATCH v7 0/4] Add lcore poll busyness telemetry Stephen Hemminger
@ 2022-10-05 13:44   ` Kevin Laatz
  2022-10-06 13:25     ` Morten Brørup
  5 siblings, 1 reply; 87+ messages in thread
From: Kevin Laatz @ 2022-10-05 13:44 UTC (permalink / raw)
  To: dev
  Cc: anatoly.burakov, Bruce Richardson, Morten Brørup,
	Mattias Rönnblom, Konstantin Ananyev, Conor Walsh,
	David Hunt, Nicolas Chautru, Fan Zhang, Ashish Gupta,
	Akhil Goyal, Chengwen Feng, Ray Kinsella, Thomas Monjalon,
	Ferruh Yigit, Andrew Rybchenko, Jerin Jacob, Sachin Saxena,
	Hemant Agrawal, Ori Kam, Honnappa Nagarahalli, mattias.ronnblom,
	David Marchand

On 14/09/2022 10:29, Kevin Laatz wrote:
> Currently, there is no way to measure lcore polling busyness in a passive
> way, without any modifications to the application. This patchset adds a new
> EAL API that will be able to passively track core polling busyness. As part
> of the set, new telemetry endpoints are added to read the generate metrics.
>
> ---
> v7:
>    * Rename funcs, vars, files to include "poll" where missing.
>
> v6:
>    * Add API and perf unit tests
>
> v5:
>    * Fix Windows build
>    * Make lcore_telemetry_free() an internal interface
>    * Minor cleanup
>
> v4:
>    * Fix doc build
>    * Rename timestamp macro to RTE_LCORE_POLL_BUSYNESS_TIMESTAMP
>    * Make enable/disable read and write atomic
>    * Change rte_lcore_poll_busyness_enabled_set() param to bool
>    * Move mem alloc from enable/disable to init/cleanup
>    * Other minor fixes
>
> v3:
>    * Fix missing renaming to poll busyness
>    * Fix clang compilation
>    * Fix arm compilation
>
> v2:
>    * Use rte_get_tsc_hz() to adjust the telemetry period
>    * Rename to reflect polling busyness vs general busyness
>    * Fix segfault when calling telemetry timestamp from an unregistered
>      non-EAL thread.
>    * Minor cleanup
>
> Anatoly Burakov (2):
>    eal: add lcore poll busyness telemetry
>    eal: add cpuset lcore telemetry entries
>
> Kevin Laatz (2):
>    app/test: add unit tests for lcore poll busyness
>    doc: add howto guide for lcore poll busyness
>
>   app/test/meson.build                          |   4 +
>   app/test/test_lcore_poll_busyness_api.c       | 134 +++++++
>   app/test/test_lcore_poll_busyness_perf.c      |  72 ++++
>   config/meson.build                            |   1 +
>   config/rte_config.h                           |   1 +
>   doc/guides/howto/index.rst                    |   1 +
>   doc/guides/howto/lcore_poll_busyness.rst      |  93 +++++
>   lib/bbdev/rte_bbdev.h                         |  17 +-
>   lib/compressdev/rte_compressdev.c             |   2 +
>   lib/cryptodev/rte_cryptodev.h                 |   2 +
>   lib/distributor/rte_distributor.c             |  21 +-
>   lib/distributor/rte_distributor_single.c      |  14 +-
>   lib/dmadev/rte_dmadev.h                       |  15 +-
>   .../common/eal_common_lcore_poll_telemetry.c  | 350 ++++++++++++++++++
>   lib/eal/common/meson.build                    |   1 +
>   lib/eal/freebsd/eal.c                         |   1 +
>   lib/eal/include/rte_lcore.h                   |  85 ++++-
>   lib/eal/linux/eal.c                           |   1 +
>   lib/eal/meson.build                           |   3 +
>   lib/eal/version.map                           |   7 +
>   lib/ethdev/rte_ethdev.h                       |   2 +
>   lib/eventdev/rte_eventdev.h                   |  10 +-
>   lib/rawdev/rte_rawdev.c                       |   6 +-
>   lib/regexdev/rte_regexdev.h                   |   5 +-
>   lib/ring/rte_ring_elem_pvt.h                  |   1 +
>   meson_options.txt                             |   2 +
>   26 files changed, 826 insertions(+), 25 deletions(-)
>   create mode 100644 app/test/test_lcore_poll_busyness_api.c
>   create mode 100644 app/test/test_lcore_poll_busyness_perf.c
>   create mode 100644 doc/guides/howto/lcore_poll_busyness.rst
>   create mode 100644 lib/eal/common/eal_common_lcore_poll_telemetry.c

Based on the feedback in the discussions on this patchset, we have 
decided to revoke the submission of this patchset for the 22.11 release.

We will re-evaluate the design with the aim to provide a more acceptable 
solution in a future release.

---
Kevin



^ permalink raw reply	[flat|nested] 87+ messages in thread

* RE: [PATCH v7 0/4] Add lcore poll busyness telemetry
  2022-10-05 13:44   ` Kevin Laatz
@ 2022-10-06 13:25     ` Morten Brørup
  2022-10-06 15:26       ` Mattias Rönnblom
  0 siblings, 1 reply; 87+ messages in thread
From: Morten Brørup @ 2022-10-06 13:25 UTC (permalink / raw)
  To: Kevin Laatz, dev
  Cc: anatoly.burakov, Bruce Richardson, Mattias Rönnblom,
	Konstantin Ananyev, Conor Walsh, David Hunt, Nicolas Chautru,
	Fan Zhang, Ashish Gupta, Akhil Goyal, Chengwen Feng,
	Ray Kinsella, Thomas Monjalon, Ferruh Yigit, Andrew Rybchenko,
	Jerin Jacob, Sachin Saxena, Hemant Agrawal, Ori Kam,
	Honnappa Nagarahalli, mattias.ronnblom, David Marchand

> From: Kevin Laatz [mailto:kevin.laatz@intel.com]
> Sent: Wednesday, 5 October 2022 15.45
> 
> On 14/09/2022 10:29, Kevin Laatz wrote:
> > Currently, there is no way to measure lcore polling busyness in a
> passive
> > way, without any modifications to the application. This patchset adds
> a new
> > EAL API that will be able to passively track core polling busyness.
> As part
> > of the set, new telemetry endpoints are added to read the generate
> metrics.
> >
> > ---
> 
> Based on the feedback in the discussions on this patchset, we have
> decided to revoke the submission of this patchset for the 22.11
> release.
> 
> We will re-evaluate the design with the aim to provide a more
> acceptable
> solution in a future release.

Good call. Thank you!

I suggest having an open discussion about requirements/expectations for such a solution, before you implement any code.

We haven't found the golden solution for our application, but we have discussed it quite a lot internally. Here are some of our thoughts:

The application must feed the library with information about how much work it is doing.

E.g. A pipeline stage that polls the NIC for N ingress packets could feed the busyness library with values such as:
 - "no work": zero packets received,
 - "25 % utilization": less than N packets received (in this example: 8 of max 32 packets = 25 %), or
 - "100% utilization, possibly more work to do": all N packets received (more packets could be ready in the queue, but we don't know).

A pipeline stage that services a QoS scheduler could additionally feed the library with values such as:
 - "100% utilization, definitely more work to do": stopped processing due to some "max work per call" limitation.
 - "waiting, no work until [DELAY] ns": current timeslot has been filled, waiting for the next timeslot to start.

It is important to note that any pipeline stage processing packets (or some other objects!) might process a different maximum number of objects than the ingress pipeline stage. What I mean is: The number N might not be the same for all pipeline stages.

The information should be collected per lcore or thread, also to prevent cache trashing.

Additionally, it could be collected per pipeline stage too, making the collection two-dimensional. This would essentially make it a profiling library, where you - in addition to seeing how much time is spent working - also can see which work the time is spent on.

As mentioned during the previous discussions, APIs should be provided to make the collected information machine readable, so the application can use it for power management and other purposes.

One of the simple things I would like to be able to extract from such a library is CPU Utilization (percentage) per lcore.

And since I want the CPU Utilization to be shown for multiple the time intervals (usually 1, 5 or 15 minutes; but perhaps also 1 second or 1 millisecond) the output data should be exposed as a counter type, so my "loadavg application" can calculate the rate by subtracting the previously obtained value from the current value and divide the difference by the time interval.

-Morten

^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 0/4] Add lcore poll busyness telemetry
  2022-10-06 13:25     ` Morten Brørup
@ 2022-10-06 15:26       ` Mattias Rönnblom
  2022-10-10 15:22         ` Morten Brørup
  0 siblings, 1 reply; 87+ messages in thread
From: Mattias Rönnblom @ 2022-10-06 15:26 UTC (permalink / raw)
  To: Morten Brørup, Kevin Laatz, dev
  Cc: anatoly.burakov, Bruce Richardson, Mattias Rönnblom,
	Konstantin Ananyev, Conor Walsh, David Hunt, Nicolas Chautru,
	Fan Zhang, Ashish Gupta, Akhil Goyal, Chengwen Feng,
	Ray Kinsella, Thomas Monjalon, Ferruh Yigit, Andrew Rybchenko,
	Jerin Jacob, Sachin Saxena, Hemant Agrawal, Ori Kam,
	Honnappa Nagarahalli, David Marchand

On 2022-10-06 15:25, Morten Brørup wrote:
>> From: Kevin Laatz [mailto:kevin.laatz@intel.com]
>> Sent: Wednesday, 5 October 2022 15.45
>>
>> On 14/09/2022 10:29, Kevin Laatz wrote:
>>> Currently, there is no way to measure lcore polling busyness in a
>> passive
>>> way, without any modifications to the application. This patchset adds
>> a new
>>> EAL API that will be able to passively track core polling busyness.
>> As part
>>> of the set, new telemetry endpoints are added to read the generate
>> metrics.
>>>
>>> ---
>>
>> Based on the feedback in the discussions on this patchset, we have
>> decided to revoke the submission of this patchset for the 22.11
>> release.
>>
>> We will re-evaluate the design with the aim to provide a more
>> acceptable
>> solution in a future release.
> 
> Good call. Thank you!
> 
> I suggest having an open discussion about requirements/expectations for such a solution, before you implement any code.
> 
> We haven't found the golden solution for our application, but we have discussed it quite a lot internally. Here are some of our thoughts:
> 
> The application must feed the library with information about how much work it is doing.
> 
> E.g. A pipeline stage that polls the NIC for N ingress packets could feed the busyness library with values such as:
>   - "no work": zero packets received,
>   - "25 % utilization": less than N packets received (in this example: 8 of max 32 packets = 25 %), or
>   - "100% utilization, possibly more work to do": all N packets received (more packets could be ready in the queue, but we don't know).
> 

If some lcore's NIC RX queue always, for every poll operation, produces 
8 packets out of a max burst of 32, I would argue that lcore is 100% 
busy. With always something to do, it doesn't have a single cycle to spare.

It seems to me that you basically have two options, if you do 
application-level "busyness" reporting.

Either the application
a) reports when a section of useful work begins, and when it ends, as 
two separate function calls.
b) after having taken a time stamp, and having completed a section of 
code which turned out to be something useful, it reports back to the 
busyness module with one function call, containing the busy cycles spent.

In a), the two calls could be to the same function, with a boolean 
argument informing the busyness module if this is the beginning of a 
busy or an idle period. In such case, just pass "num_pkts_dequeued > 0" 
to the call.

What you would like is a solution which avoid ping-pong between idle and 
busy states (with the resulting time stamping and computations) in 
scenarios where a lcore thread mix sources of work which often have 
items available, with sources that do not (e.g., packets in a RX queue 
versus reassembly timeouts in a core-local timer wheel). It would be 
better in that situation, to attribute the timer wheel poll cycles as 
busy cycles.

Another crucial aspect is that you want the API to be simple, and code 
changes to be minimal.

It's unclear to me if you need to account for both idle and busy cycles, 
or only busy cycles, and assume all other cycles are idle. The will be 
for a traditional 1:1 EAL thread <-> CPU core mapping, but not if the 
"--lcores" parameter is used to create floating EAL threads, and EAL 
threads which share the same core, and thus may not be able to use 100% 
of the TSC cycles.

> A pipeline stage that services a QoS scheduler could additionally feed the library with values such as:
>   - "100% utilization, definitely more work to do": stopped processing due to some "max work per call" limitation.
>   - "waiting, no work until [DELAY] ns": current timeslot has been filled, waiting for the next timeslot to start.
> 
> It is important to note that any pipeline stage processing packets (or some other objects!) might process a different maximum number of objects than the ingress pipeline stage. What I mean is: The number N might not be the same for all pipeline stages.
> 
> 
> The information should be collected per lcore or thread, also to prevent cache trashing.
> 
> Additionally, it could be collected per pipeline stage too, making the collection two-dimensional. This would essentially make it a profiling library, where you - in addition to seeing how much time is spent working - also can see which work the time is spent on.
> 

If you introduce subcategories of "busy", like "busy-with-X", and 
"busy-with-Y", the book keeping will be more expensive, since you will 
transit between states even for 100% busy lcores (which in principle you 
never, or at least very rarely, need to do if you have only busy and 
idle as states).

If your application is organized as DPDK services, you will get this 
already today, on a DPDK service level.

If you have your application organized as a pipeline, and you use an 
event device as a scheduler between the stages, that event device has a 
good opportunity to do this kind of bookkeeping. DSW, for example, keeps 
track of the average processing latency for events, and how many events 
of various types have been processed.

> As mentioned during the previous discussions, APIs should be provided to make the collected information machine readable, so the application can use it for power management and other purposes.
> 
> One of the simple things I would like to be able to extract from such a library is CPU Utilization (percentage) per lcore. >
> And since I want the CPU Utilization to be shown for multiple the time intervals (usually 1, 5 or 15 minutes; but perhaps also 1 second or 1 millisecond) the output data should be exposed as a counter type, so my "loadavg application" can calculate the rate by subtracting the previously obtained value from the current value and divide the difference by the time interval.
> 

I agree. In addition, you also want the "raw data" (lcore busy cycles) 
so you can do you own sampling, at your own favorite-length intervals.

> -Morten
> 

^ permalink raw reply	[flat|nested] 87+ messages in thread

* RE: [PATCH v7 0/4] Add lcore poll busyness telemetry
  2022-10-06 15:26       ` Mattias Rönnblom
@ 2022-10-10 15:22         ` Morten Brørup
  2022-10-10 17:38           ` Mattias Rönnblom
  0 siblings, 1 reply; 87+ messages in thread
From: Morten Brørup @ 2022-10-10 15:22 UTC (permalink / raw)
  To: Mattias Rönnblom, Kevin Laatz, dev
  Cc: anatoly.burakov, Bruce Richardson, Mattias Rönnblom,
	Konstantin Ananyev, Conor Walsh, David Hunt, Nicolas Chautru,
	Fan Zhang, Ashish Gupta, Akhil Goyal, Chengwen Feng,
	Ray Kinsella, Thomas Monjalon, Ferruh Yigit, Andrew Rybchenko,
	Jerin Jacob, Sachin Saxena, Hemant Agrawal, Ori Kam,
	Honnappa Nagarahalli, David Marchand

> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> Sent: Thursday, 6 October 2022 17.27
> 
> On 2022-10-06 15:25, Morten Brørup wrote:
> >> From: Kevin Laatz [mailto:kevin.laatz@intel.com]
> >> Sent: Wednesday, 5 October 2022 15.45
> >>
> >> On 14/09/2022 10:29, Kevin Laatz wrote:
> >>> Currently, there is no way to measure lcore polling busyness in a
> >> passive
> >>> way, without any modifications to the application. This patchset
> adds
> >> a new
> >>> EAL API that will be able to passively track core polling busyness.
> >> As part
> >>> of the set, new telemetry endpoints are added to read the generate
> >> metrics.
> >>>
> >>> ---
> >>
> >> Based on the feedback in the discussions on this patchset, we have
> >> decided to revoke the submission of this patchset for the 22.11
> >> release.
> >>
> >> We will re-evaluate the design with the aim to provide a more
> >> acceptable
> >> solution in a future release.
> >
> > Good call. Thank you!
> >
> > I suggest having an open discussion about requirements/expectations
> for such a solution, before you implement any code.
> >
> > We haven't found the golden solution for our application, but we have
> discussed it quite a lot internally. Here are some of our thoughts:
> >
> > The application must feed the library with information about how much
> work it is doing.
> >
> > E.g. A pipeline stage that polls the NIC for N ingress packets could
> feed the busyness library with values such as:
> >   - "no work": zero packets received,
> >   - "25 % utilization": less than N packets received (in this
> example: 8 of max 32 packets = 25 %), or
> >   - "100% utilization, possibly more work to do": all N packets
> received (more packets could be ready in the queue, but we don't know).
> >
> 
> If some lcore's NIC RX queue always, for every poll operation, produces
> 8 packets out of a max burst of 32, I would argue that lcore is 100%
> busy. With always something to do, it doesn't have a single cycle to
> spare.

I would argue that if I have four cores each only processing 25 % of the packets, one core would suffice instead. Or, the application could schedule the function at 1/4 of the frequency it does now (e.g. call the function once every 40 microseconds instead of once every 10 microseconds).

However, the business does not scale linearly with the number of packets processed - which an intended benefit of bursting.

Here are some real life numbers from our in-house profiler library in a production environment, which says that polling the NIC for packets takes on average:

104 cycles when the NIC returns 0 packets,
529 cycles when the NIC returns 1 packet,
679 cycles when the NIC returns 8 packets, and
1275 cycles when the NIC returns a full burst of 32 packets.

(This includes some overhead from our application, so you will see other numbers in your application.)

> 
> It seems to me that you basically have two options, if you do
> application-level "busyness" reporting.
> 
> Either the application
> a) reports when a section of useful work begins, and when it ends, as
> two separate function calls.
> b) after having taken a time stamp, and having completed a section of
> code which turned out to be something useful, it reports back to the
> busyness module with one function call, containing the busy cycles
> spent.
> 
> In a), the two calls could be to the same function, with a boolean
> argument informing the busyness module if this is the beginning of a
> busy or an idle period. In such case, just pass "num_pkts_dequeued > 0"
> to the call.

Our profiler library has a start()and an end() function, and an end_and_start() function for when a section directly follows the preceding section (to only take one timestamp instead of two).

> 
> What you would like is a solution which avoid ping-pong between idle
> and
> busy states (with the resulting time stamping and computations) in
> scenarios where a lcore thread mix sources of work which often have
> items available, with sources that do not (e.g., packets in a RX queue
> versus reassembly timeouts in a core-local timer wheel). It would be
> better in that situation, to attribute the timer wheel poll cycles as
> busy cycles.
> 
> Another crucial aspect is that you want the API to be simple, and code
> changes to be minimal.
> 
> It's unclear to me if you need to account for both idle and busy
> cycles,
> or only busy cycles, and assume all other cycles are idle. The will be
> for a traditional 1:1 EAL thread <-> CPU core mapping, but not if the
> "--lcores" parameter is used to create floating EAL threads, and EAL
> threads which share the same core, and thus may not be able to use 100%
> of the TSC cycles.
> 
> > A pipeline stage that services a QoS scheduler could additionally
> feed the library with values such as:
> >   - "100% utilization, definitely more work to do": stopped
> processing due to some "max work per call" limitation.
> >   - "waiting, no work until [DELAY] ns": current timeslot has been
> filled, waiting for the next timeslot to start.
> >
> > It is important to note that any pipeline stage processing packets
> (or some other objects!) might process a different maximum number of
> objects than the ingress pipeline stage. What I mean is: The number N
> might not be the same for all pipeline stages.
> >
> >
> > The information should be collected per lcore or thread, also to
> prevent cache trashing.
> >
> > Additionally, it could be collected per pipeline stage too, making
> the collection two-dimensional. This would essentially make it a
> profiling library, where you - in addition to seeing how much time is
> spent working - also can see which work the time is spent on.
> >
> 
> If you introduce subcategories of "busy", like "busy-with-X", and
> "busy-with-Y", the book keeping will be more expensive, since you will
> transit between states even for 100% busy lcores (which in principle
> you
> never, or at least very rarely, need to do if you have only busy and
> idle as states).
> 
> If your application is organized as DPDK services, you will get this
> already today, on a DPDK service level.
> 
> If you have your application organized as a pipeline, and you use an
> event device as a scheduler between the stages, that event device has a
> good opportunity to do this kind of bookkeeping. DSW, for example,
> keeps
> track of the average processing latency for events, and how many events
> of various types have been processed.
> 

Lots of good input, Mattias. Let's see what others suggest. :-)

> > As mentioned during the previous discussions, APIs should be provided
> to make the collected information machine readable, so the application
> can use it for power management and other purposes.
> >
> > One of the simple things I would like to be able to extract from such
> a library is CPU Utilization (percentage) per lcore. >
> > And since I want the CPU Utilization to be shown for multiple the
> time intervals (usually 1, 5 or 15 minutes; but perhaps also 1 second
> or 1 millisecond) the output data should be exposed as a counter type,
> so my "loadavg application" can calculate the rate by subtracting the
> previously obtained value from the current value and divide the
> difference by the time interval.
> >
> 
> I agree. In addition, you also want the "raw data" (lcore busy cycles)
> so you can do you own sampling, at your own favorite-length intervals.
> 
> > -Morten
> >


^ permalink raw reply	[flat|nested] 87+ messages in thread

* Re: [PATCH v7 0/4] Add lcore poll busyness telemetry
  2022-10-10 15:22         ` Morten Brørup
@ 2022-10-10 17:38           ` Mattias Rönnblom
  2022-10-12 12:25             ` Morten Brørup
  0 siblings, 1 reply; 87+ messages in thread
From: Mattias Rönnblom @ 2022-10-10 17:38 UTC (permalink / raw)
  To: Morten Brørup, Mattias Rönnblom, Kevin Laatz, dev
  Cc: anatoly.burakov, Bruce Richardson, Konstantin Ananyev,
	Conor Walsh, David Hunt, Nicolas Chautru, Fan Zhang,
	Ashish Gupta, Akhil Goyal, Chengwen Feng, Ray Kinsella,
	Thomas Monjalon, Ferruh Yigit, Andrew Rybchenko, Jerin Jacob,
	Sachin Saxena, Hemant Agrawal, Ori Kam, Honnappa Nagarahalli,
	David Marchand

On 2022-10-10 17:22, Morten Brørup wrote:
>> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
>> Sent: Thursday, 6 October 2022 17.27
>>
>> On 2022-10-06 15:25, Morten Brørup wrote:
>>>> From: Kevin Laatz [mailto:kevin.laatz@intel.com]
>>>> Sent: Wednesday, 5 October 2022 15.45
>>>>
>>>> On 14/09/2022 10:29, Kevin Laatz wrote:
>>>>> Currently, there is no way to measure lcore polling busyness in a
>>>> passive
>>>>> way, without any modifications to the application. This patchset
>> adds
>>>> a new
>>>>> EAL API that will be able to passively track core polling busyness.
>>>> As part
>>>>> of the set, new telemetry endpoints are added to read the generate
>>>> metrics.
>>>>>
>>>>> ---
>>>>
>>>> Based on the feedback in the discussions on this patchset, we have
>>>> decided to revoke the submission of this patchset for the 22.11
>>>> release.
>>>>
>>>> We will re-evaluate the design with the aim to provide a more
>>>> acceptable
>>>> solution in a future release.
>>>
>>> Good call. Thank you!
>>>
>>> I suggest having an open discussion about requirements/expectations
>> for such a solution, before you implement any code.
>>>
>>> We haven't found the golden solution for our application, but we have
>> discussed it quite a lot internally. Here are some of our thoughts:
>>>
>>> The application must feed the library with information about how much
>> work it is doing.
>>>
>>> E.g. A pipeline stage that polls the NIC for N ingress packets could
>> feed the busyness library with values such as:
>>>    - "no work": zero packets received,
>>>    - "25 % utilization": less than N packets received (in this
>> example: 8 of max 32 packets = 25 %), or
>>>    - "100% utilization, possibly more work to do": all N packets
>> received (more packets could be ready in the queue, but we don't know).
>>>
>>
>> If some lcore's NIC RX queue always, for every poll operation, produces
>> 8 packets out of a max burst of 32, I would argue that lcore is 100%
>> busy. With always something to do, it doesn't have a single cycle to
>> spare.
> 
> I would argue that if I have four cores each only processing 25 % of the packets, one core would suffice instead. Or, the application could schedule the function at 1/4 of the frequency it does now (e.g. call the function once every 40 microseconds instead of once every 10 microseconds).
> 

Do you mean "only processing packets 25% of the time"? If yes, being 
able to replace four core @ 25% utilization with one core @ 100% might 
be a reasonable first guess. I'm not sure how it relates to what I 
wrote, though.

> However, the business does not scale linearly with the number of packets processed - which an intended benefit of bursting.
> 

Sure, there's usually a non-linear relationship between the system 
capacity used and the resulting CPU utilization. It can be both in the 
manner you describe below, with the per-packet processing latency 
reduced at higher rates, or the other way around. For example, NIC RX 
LLC stashing may cause a lot of LLC evictions, and generally the 
application might have a larger working set size during high load, so 
there may be forces working in the other direction as well.

It seems to me "busyness" telemetry value should just be lcore thread 
CPU utilization (in total, or with some per-module breakdown). If you 
want to know how much of the system's capacity is used, you need help 
from an application-specific agent, equipped with a model of how CPU 
utilization and capacity relates. Such a heuristic could take other 
factors into account as well, e.g. the average queue sizes, packet 
rates, packet sizes etc.

In my experience, for high touch applications (i.e., those that spends 
thousands of cycles per packet), CPU utilization is a pretty decent 
approximation on how much of the system's capacity is used.

> Here are some real life numbers from our in-house profiler library in a production environment, which says that polling the NIC for packets takes on average:
> 
> 104 cycles when the NIC returns 0 packets,
> 529 cycles when the NIC returns 1 packet,
> 679 cycles when the NIC returns 8 packets, and
> 1275 cycles when the NIC returns a full burst of 32 packets.
> 
> (This includes some overhead from our application, so you will see other numbers in your application.)
> 
>>
>> It seems to me that you basically have two options, if you do
>> application-level "busyness" reporting.
>>
>> Either the application
>> a) reports when a section of useful work begins, and when it ends, as
>> two separate function calls.
>> b) after having taken a time stamp, and having completed a section of
>> code which turned out to be something useful, it reports back to the
>> busyness module with one function call, containing the busy cycles
>> spent.
>>
>> In a), the two calls could be to the same function, with a boolean
>> argument informing the busyness module if this is the beginning of a
>> busy or an idle period. In such case, just pass "num_pkts_dequeued > 0"
>> to the call.
> 
> Our profiler library has a start()and an end() function, and an end_and_start() function for when a section directly follows the preceding section (to only take one timestamp instead of two).
> 

I like the idea of a end_and_start() (except for the name, maybe).

>>
>> What you would like is a solution which avoid ping-pong between idle
>> and
>> busy states (with the resulting time stamping and computations) in
>> scenarios where a lcore thread mix sources of work which often have
>> items available, with sources that do not (e.g., packets in a RX queue
>> versus reassembly timeouts in a core-local timer wheel). It would be
>> better in that situation, to attribute the timer wheel poll cycles as
>> busy cycles.
>>
>> Another crucial aspect is that you want the API to be simple, and code
>> changes to be minimal.
>>
>> It's unclear to me if you need to account for both idle and busy
>> cycles,
>> or only busy cycles, and assume all other cycles are idle. The will be
>> for a traditional 1:1 EAL thread <-> CPU core mapping, but not if the
>> "--lcores" parameter is used to create floating EAL threads, and EAL
>> threads which share the same core, and thus may not be able to use 100%
>> of the TSC cycles.
>>
>>> A pipeline stage that services a QoS scheduler could additionally
>> feed the library with values such as:
>>>    - "100% utilization, definitely more work to do": stopped
>> processing due to some "max work per call" limitation.
>>>    - "waiting, no work until [DELAY] ns": current timeslot has been
>> filled, waiting for the next timeslot to start.
>>>
>>> It is important to note that any pipeline stage processing packets
>> (or some other objects!) might process a different maximum number of
>> objects than the ingress pipeline stage. What I mean is: The number N
>> might not be the same for all pipeline stages.
>>>
>>>
>>> The information should be collected per lcore or thread, also to
>> prevent cache trashing.
>>>
>>> Additionally, it could be collected per pipeline stage too, making
>> the collection two-dimensional. This would essentially make it a
>> profiling library, where you - in addition to seeing how much time is
>> spent working - also can see which work the time is spent on.
>>>
>>
>> If you introduce subcategories of "busy", like "busy-with-X", and
>> "busy-with-Y", the book keeping will be more expensive, since you will
>> transit between states even for 100% busy lcores (which in principle
>> you
>> never, or at least very rarely, need to do if you have only busy and
>> idle as states).
>>
>> If your application is organized as DPDK services, you will get this
>> already today, on a DPDK service level.
>>
>> If you have your application organized as a pipeline, and you use an
>> event device as a scheduler between the stages, that event device has a
>> good opportunity to do this kind of bookkeeping. DSW, for example,
>> keeps
>> track of the average processing latency for events, and how many events
>> of various types have been processed.
>>
> 
> Lots of good input, Mattias. Let's see what others suggest. :-)
> 
>>> As mentioned during the previous discussions, APIs should be provided
>> to make the collected information machine readable, so the application
>> can use it for power management and other purposes.
>>>
>>> One of the simple things I would like to be able to extract from such
>> a library is CPU Utilization (percentage) per lcore. >
>>> And since I want the CPU Utilization to be shown for multiple the
>> time intervals (usually 1, 5 or 15 minutes; but perhaps also 1 second
>> or 1 millisecond) the output data should be exposed as a counter type,
>> so my "loadavg application" can calculate the rate by subtracting the
>> previously obtained value from the current value and divide the
>> difference by the time interval.
>>>
>>
>> I agree. In addition, you also want the "raw data" (lcore busy cycles)
>> so you can do you own sampling, at your own favorite-length intervals.
>>
>>> -Morten
>>>
> 

^ permalink raw reply	[flat|nested] 87+ messages in thread

* RE: [PATCH v7 0/4] Add lcore poll busyness telemetry
  2022-10-10 17:38           ` Mattias Rönnblom
@ 2022-10-12 12:25             ` Morten Brørup
  0 siblings, 0 replies; 87+ messages in thread
From: Morten Brørup @ 2022-10-12 12:25 UTC (permalink / raw)
  To: Mattias Rönnblom, Mattias Rönnblom, Kevin Laatz, dev
  Cc: anatoly.burakov, Bruce Richardson, Konstantin Ananyev,
	Conor Walsh, David Hunt, Nicolas Chautru, Fan Zhang,
	Ashish Gupta, Akhil Goyal, Chengwen Feng, Ray Kinsella,
	Thomas Monjalon, Ferruh Yigit, Andrew Rybchenko, Jerin Jacob,
	Sachin Saxena, Hemant Agrawal, Ori Kam, Honnappa Nagarahalli,
	David Marchand

> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> Sent: Monday, 10 October 2022 19.39
> 
> On 2022-10-10 17:22, Morten Brørup wrote:
> >> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> >> Sent: Thursday, 6 October 2022 17.27
> >>
> >> On 2022-10-06 15:25, Morten Brørup wrote:
> >>>> From: Kevin Laatz [mailto:kevin.laatz@intel.com]
> >>>> Sent: Wednesday, 5 October 2022 15.45
> >>>>
> >>>> On 14/09/2022 10:29, Kevin Laatz wrote:
> >>>>> Currently, there is no way to measure lcore polling busyness in a
> >>>> passive
> >>>>> way, without any modifications to the application. This patchset
> >> adds
> >>>> a new
> >>>>> EAL API that will be able to passively track core polling
> busyness.
> >>>> As part
> >>>>> of the set, new telemetry endpoints are added to read the
> generate
> >>>> metrics.
> >>>>>
> >>>>> ---
> >>>>
> >>>> Based on the feedback in the discussions on this patchset, we have
> >>>> decided to revoke the submission of this patchset for the 22.11
> >>>> release.
> >>>>
> >>>> We will re-evaluate the design with the aim to provide a more
> >>>> acceptable
> >>>> solution in a future release.
> >>>
> >>> Good call. Thank you!
> >>>
> >>> I suggest having an open discussion about requirements/expectations
> >> for such a solution, before you implement any code.
> >>>
> >>> We haven't found the golden solution for our application, but we
> have
> >> discussed it quite a lot internally. Here are some of our thoughts:
> >>>
> >>> The application must feed the library with information about how
> much
> >> work it is doing.
> >>>
> >>> E.g. A pipeline stage that polls the NIC for N ingress packets
> could
> >> feed the busyness library with values such as:
> >>>    - "no work": zero packets received,
> >>>    - "25 % utilization": less than N packets received (in this
> >> example: 8 of max 32 packets = 25 %), or
> >>>    - "100% utilization, possibly more work to do": all N packets
> >> received (more packets could be ready in the queue, but we don't
> know).
> >>>
> >>
> >> If some lcore's NIC RX queue always, for every poll operation,
> produces
> >> 8 packets out of a max burst of 32, I would argue that lcore is 100%
> >> busy. With always something to do, it doesn't have a single cycle to
> >> spare.
> >
> > I would argue that if I have four cores each only processing 25 % of
> the packets, one core would suffice instead. Or, the application could
> schedule the function at 1/4 of the frequency it does now (e.g. call
> the function once every 40 microseconds instead of once every 10
> microseconds).
> >
> 
> Do you mean "only processing packets 25% of the time"? If yes, being
> able to replace four core @ 25% utilization with one core @ 100% might
> be a reasonable first guess. I'm not sure how it relates to what I
> wrote, though.

I meant: "only processing 25 % of the maximum number of packets it could have processed (if 100 % utilized)"

A service is allowed to do some fixed maximum amount of work every time its function is called; in my example receive 32 packets. But if there were only 8 packets to receive in a call, the utilization in that call was only 25 % (although the lcore was 100 % active in duration of the function call). So the function can be called less frequently.

If the utilization per packet is linear, then it makes no difference. However, it is non-linear, so it is more efficient to call the function 1/4 of the times, receiving 32 packets each time, than calling it too frequently and only receive 8 packets each time.

> 
> > However, the business does not scale linearly with the number of
> packets processed - which an intended benefit of bursting.
> >
> 
> Sure, there's usually a non-linear relationship between the system
> capacity used and the resulting CPU utilization. It can be both in the
> manner you describe below, with the per-packet processing latency
> reduced at higher rates, or the other way around. For example, NIC RX
> LLC stashing may cause a lot of LLC evictions, and generally the
> application might have a larger working set size during high load, so
> there may be forces working in the other direction as well.
> 
> It seems to me "busyness" telemetry value should just be lcore thread
> CPU utilization (in total, or with some per-module breakdown).

We probably agree that if we cannot discriminate between useless and useful work, then the utilization will always be 100 % minus some cycles known by the library to be non-productive overhead.

I prefer a scale from 0 to 100 % useful, rather than a Boolean discriminating between useless and useful cycles spent in a function.

> If you
> want to know how much of the system's capacity is used, you need help
> from an application-specific agent, equipped with a model of how CPU
> utilization and capacity relates. Such a heuristic could take other
> factors into account as well, e.g. the average queue sizes, packet
> rates, packet sizes etc.

That is exactly what I'm exploring if this library can approximate in some simple way.

> 
> In my experience, for high touch applications (i.e., those that spends
> thousands of cycles per packet), CPU utilization is a pretty decent
> approximation on how much of the system's capacity is used.

I agree.

We can probably also agree that the number of cycles spent roughly follows a formula like: A + B * x, where B is the number of cycles spent to handle the burst itself, and C is the number of cycles it takes to handle an additional packet (or other unit of work) in the burst.

In high touch services, A might be relatively insignificant; but in other applications, A is very significant.

Going back to my example, and using my profiler numbers from below...

If 679 cycles are spent in every RX function call to receive 8 packets, and the function is called 4 times per time unit, it is 679 * 4 = 2.716 cycles spent per time unit.

If the library can tell us that the function is only doing 25 % useful work, we can instead call the function 1 time per time unit. We will still get the same 32 packets per time unit, but only spend 1.275 cycles per time unit.

This is one of the things I am hoping for the library to help the application achieve. (By feeding the library with information about the work done.)

> 
> > Here are some real life numbers from our in-house profiler library in
> a production environment, which says that polling the NIC for packets
> takes on average:
> >
> > 104 cycles when the NIC returns 0 packets,
> > 529 cycles when the NIC returns 1 packet,
> > 679 cycles when the NIC returns 8 packets, and
> > 1275 cycles when the NIC returns a full burst of 32 packets.
> >
> > (This includes some overhead from our application, so you will see
> other numbers in your application.)
> >
> >>
> >> It seems to me that you basically have two options, if you do
> >> application-level "busyness" reporting.
> >>
> >> Either the application
> >> a) reports when a section of useful work begins, and when it ends,
> as
> >> two separate function calls.
> >> b) after having taken a time stamp, and having completed a section
> of
> >> code which turned out to be something useful, it reports back to the
> >> busyness module with one function call, containing the busy cycles
> >> spent.
> >>
> >> In a), the two calls could be to the same function, with a boolean
> >> argument informing the busyness module if this is the beginning of a
> >> busy or an idle period. In such case, just pass "num_pkts_dequeued >
> 0"
> >> to the call.
> >
> > Our profiler library has a start()and an end() function, and an
> end_and_start() function for when a section directly follows the
> preceding section (to only take one timestamp instead of two).
> >
> 
> I like the idea of a end_and_start() (except for the name, maybe).
> 
> >>
> >> What you would like is a solution which avoid ping-pong between idle
> >> and
> >> busy states (with the resulting time stamping and computations) in
> >> scenarios where a lcore thread mix sources of work which often have
> >> items available, with sources that do not (e.g., packets in a RX
> queue
> >> versus reassembly timeouts in a core-local timer wheel). It would be
> >> better in that situation, to attribute the timer wheel poll cycles
> as
> >> busy cycles.
> >>
> >> Another crucial aspect is that you want the API to be simple, and
> code
> >> changes to be minimal.
> >>
> >> It's unclear to me if you need to account for both idle and busy
> >> cycles,
> >> or only busy cycles, and assume all other cycles are idle. The will
> be
> >> for a traditional 1:1 EAL thread <-> CPU core mapping, but not if
> the
> >> "--lcores" parameter is used to create floating EAL threads, and EAL
> >> threads which share the same core, and thus may not be able to use
> 100%
> >> of the TSC cycles.
> >>
> >>> A pipeline stage that services a QoS scheduler could additionally
> >> feed the library with values such as:
> >>>    - "100% utilization, definitely more work to do": stopped
> >> processing due to some "max work per call" limitation.
> >>>    - "waiting, no work until [DELAY] ns": current timeslot has been
> >> filled, waiting for the next timeslot to start.
> >>>
> >>> It is important to note that any pipeline stage processing packets
> >> (or some other objects!) might process a different maximum number of
> >> objects than the ingress pipeline stage. What I mean is: The number
> N
> >> might not be the same for all pipeline stages.
> >>>
> >>>
> >>> The information should be collected per lcore or thread, also to
> >> prevent cache trashing.
> >>>
> >>> Additionally, it could be collected per pipeline stage too, making
> >> the collection two-dimensional. This would essentially make it a
> >> profiling library, where you - in addition to seeing how much time
> is
> >> spent working - also can see which work the time is spent on.
> >>>
> >>
> >> If you introduce subcategories of "busy", like "busy-with-X", and
> >> "busy-with-Y", the book keeping will be more expensive, since you
> will
> >> transit between states even for 100% busy lcores (which in principle
> >> you
> >> never, or at least very rarely, need to do if you have only busy and
> >> idle as states).
> >>
> >> If your application is organized as DPDK services, you will get this
> >> already today, on a DPDK service level.
> >>
> >> If you have your application organized as a pipeline, and you use an
> >> event device as a scheduler between the stages, that event device
> has a
> >> good opportunity to do this kind of bookkeeping. DSW, for example,
> >> keeps
> >> track of the average processing latency for events, and how many
> events
> >> of various types have been processed.
> >>
> >
> > Lots of good input, Mattias. Let's see what others suggest. :-)
> >
> >>> As mentioned during the previous discussions, APIs should be
> provided
> >> to make the collected information machine readable, so the
> application
> >> can use it for power management and other purposes.
> >>>
> >>> One of the simple things I would like to be able to extract from
> such
> >> a library is CPU Utilization (percentage) per lcore. >
> >>> And since I want the CPU Utilization to be shown for multiple the
> >> time intervals (usually 1, 5 or 15 minutes; but perhaps also 1
> second
> >> or 1 millisecond) the output data should be exposed as a counter
> type,
> >> so my "loadavg application" can calculate the rate by subtracting
> the
> >> previously obtained value from the current value and divide the
> >> difference by the time interval.
> >>>
> >>
> >> I agree. In addition, you also want the "raw data" (lcore busy
> cycles)
> >> so you can do you own sampling, at your own favorite-length
> intervals.
> >>
> >>> -Morten
> >>>
> >


^ permalink raw reply	[flat|nested] 87+ messages in thread

end of thread, other threads:[~2022-10-12 12:25 UTC | newest]

Thread overview: 87+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-15 13:12 [PATCH v1 1/2] eal: add lcore busyness telemetry Anatoly Burakov
2022-07-15 13:12 ` [PATCH v1 2/2] eal: add cpuset lcore telemetry entries Anatoly Burakov
2022-07-15 13:35 ` [PATCH v1 1/2] eal: add lcore busyness telemetry Burakov, Anatoly
2022-07-15 13:46 ` Jerin Jacob
2022-07-15 14:11   ` Bruce Richardson
2022-07-15 14:18   ` Burakov, Anatoly
2022-07-15 22:13 ` Morten Brørup
2022-07-16 14:38   ` Thomas Monjalon
2022-07-17  3:10   ` Honnappa Nagarahalli
2022-07-17  9:56     ` Morten Brørup
2022-07-18  9:43       ` Burakov, Anatoly
2022-07-18 10:59         ` Morten Brørup
2022-07-19 12:20           ` Thomas Monjalon
2022-07-18 15:46         ` Stephen Hemminger
2022-08-24 16:24 ` [PATCH v2 0/3] Add lcore poll " Kevin Laatz
2022-08-24 16:24   ` [PATCH v2 1/3] eal: add " Kevin Laatz
2022-08-24 16:24   ` [PATCH v2 2/3] eal: add cpuset lcore telemetry entries Kevin Laatz
2022-08-24 16:24   ` [PATCH v2 3/3] doc: add howto guide for lcore poll busyness Kevin Laatz
2022-08-25  7:47   ` [PATCH v2 0/3] Add lcore poll busyness telemetry Morten Brørup
2022-08-25 10:53     ` Kevin Laatz
2022-08-25 15:28 ` [PATCH v3 " Kevin Laatz
2022-08-25 15:28   ` [PATCH v3 1/3] eal: add " Kevin Laatz
2022-08-26  7:05     ` Jerin Jacob
2022-08-26  8:07       ` Bruce Richardson
2022-08-26  8:16         ` Jerin Jacob
2022-08-26  8:29           ` Morten Brørup
2022-08-26 15:27             ` Kevin Laatz
2022-08-26 15:46               ` Morten Brørup
2022-08-29 10:41                 ` Bruce Richardson
2022-08-29 10:53                   ` Thomas Monjalon
2022-08-29 12:36                     ` Kevin Laatz
2022-08-29 12:49                       ` Morten Brørup
2022-08-29 13:37                         ` Kevin Laatz
2022-08-29 13:44                           ` Morten Brørup
2022-08-29 14:21                             ` Kevin Laatz
2022-08-29 11:22                   ` Morten Brørup
2022-08-26 22:06     ` Mattias Rönnblom
2022-08-29  8:23       ` Bruce Richardson
2022-08-29 13:16       ` Kevin Laatz
2022-08-30 10:26       ` Kevin Laatz
2022-08-25 15:28   ` [PATCH v3 2/3] eal: add cpuset lcore telemetry entries Kevin Laatz
2022-08-25 15:28   ` [PATCH v3 3/3] doc: add howto guide for lcore poll busyness Kevin Laatz
2022-09-01 14:39 ` [PATCH v4 0/3] Add lcore poll busyness telemetry Kevin Laatz
2022-09-01 14:39   ` [PATCH v4 1/3] eal: add " Kevin Laatz
2022-09-01 14:39   ` [PATCH v4 2/3] eal: add cpuset lcore telemetry entries Kevin Laatz
2022-09-01 14:39   ` [PATCH v4 3/3] doc: add howto guide for lcore poll busyness Kevin Laatz
2022-09-02 15:58 ` [PATCH v5 0/3] Add lcore poll busyness telemetry Kevin Laatz
2022-09-02 15:58   ` [PATCH v5 1/3] eal: add " Kevin Laatz
2022-09-03 13:33     ` Jerin Jacob
2022-09-06  9:37       ` Kevin Laatz
2022-09-02 15:58   ` [PATCH v5 2/3] eal: add cpuset lcore telemetry entries Kevin Laatz
2022-09-02 15:58   ` [PATCH v5 3/3] doc: add howto guide for lcore poll busyness Kevin Laatz
2022-09-13 13:19 ` [PATCH v6 0/4] Add lcore poll busyness telemetry Kevin Laatz
2022-09-13 13:19   ` [PATCH v6 1/4] eal: add " Kevin Laatz
2022-09-13 13:48     ` Morten Brørup
2022-09-13 13:19   ` [PATCH v6 2/4] eal: add cpuset lcore telemetry entries Kevin Laatz
2022-09-13 13:19   ` [PATCH v6 3/4] app/test: add unit tests for lcore poll busyness Kevin Laatz
2022-09-13 13:19   ` [PATCH v6 4/4] doc: add howto guide " Kevin Laatz
2022-09-14  9:29 ` [PATCH v7 0/4] Add lcore poll busyness telemetry Kevin Laatz
2022-09-14  9:29   ` [PATCH v7 1/4] eal: add " Kevin Laatz
2022-09-14 14:30     ` Stephen Hemminger
2022-09-16 12:35       ` Kevin Laatz
2022-09-19 10:19     ` Konstantin Ananyev
2022-09-22 17:14       ` Kevin Laatz
2022-09-26  9:37         ` Konstantin Ananyev
2022-09-29 12:41           ` Kevin Laatz
2022-09-30 12:32             ` Jerin Jacob
2022-10-01 14:17             ` Konstantin Ananyev
2022-10-03 20:02               ` Mattias Rönnblom
2022-10-04  9:15                 ` Morten Brørup
2022-10-04 11:57                   ` Bruce Richardson
2022-10-04 14:26                     ` Mattias Rönnblom
2022-10-04 23:30                     ` Konstantin Ananyev
2022-09-30 22:13     ` Mattias Rönnblom
2022-09-14  9:29   ` [PATCH v7 2/4] eal: add cpuset lcore telemetry entries Kevin Laatz
2022-09-14  9:29   ` [PATCH v7 3/4] app/test: add unit tests for lcore poll busyness Kevin Laatz
2022-09-30 22:20     ` Mattias Rönnblom
2022-09-14  9:29   ` [PATCH v7 4/4] doc: add howto guide " Kevin Laatz
2022-09-14 14:33   ` [PATCH v7 0/4] Add lcore poll busyness telemetry Stephen Hemminger
2022-09-16 12:35     ` Kevin Laatz
2022-09-16 14:10       ` Kevin Laatz
2022-10-05 13:44   ` Kevin Laatz
2022-10-06 13:25     ` Morten Brørup
2022-10-06 15:26       ` Mattias Rönnblom
2022-10-10 15:22         ` Morten Brørup
2022-10-10 17:38           ` Mattias Rönnblom
2022-10-12 12:25             ` Morten Brørup

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).