DPDK patches and discussions
 help / color / mirror / Atom feed
* [PATCH 0/4] add support for self monitoring
@ 2022-11-11  9:43 Tomasz Duszynski
  2022-11-11  9:43 ` [PATCH 1/4] eal: add generic support for reading PMU events Tomasz Duszynski
                   ` (4 more replies)
  0 siblings, 5 replies; 139+ messages in thread
From: Tomasz Duszynski @ 2022-11-11  9:43 UTC (permalink / raw)
  To: dev, tduszynski; +Cc: thomas, jerinj

This series adds self monitoring support i.e allows to configure and
read performance measurement unit (PMU) counters in runtime without
using perf utility. This has certain adventages when application runs on
isolated cores with nohz_full kernel parameter.

Events can be read directly using rte_pmu_read() or using dedicated
tracepoint rte_eal_trace_pmu_read(). The latter will cause events to be
stored inside CTF file.

By design, all enabled events are grouped together and the same group
is attached to lcores that use self monitoring funtionality.

Events are enabled by names, which need to be read from standard
location under sysfs i.e

/sys/bus/event_source/devices/PMU/events

where PMU is a core pmu i.e one measuring cpu events. As of today
raw events are not supported.

Tomasz Duszynski (4):
  eal: add generic support for reading PMU events
  eal/arm: support reading ARM PMU events in runtime
  eal/x86: support reading Intel PMU events in runtime
  eal: add PMU support to tracing library

 app/test/meson.build                     |   1 +
 app/test/test_pmu.c                      |  47 ++
 app/test/test_trace_perf.c               |   4 +
 doc/guides/prog_guide/profile_app.rst    |  13 +
 doc/guides/prog_guide/trace_lib.rst      |  32 ++
 lib/eal/arm/include/meson.build          |   1 +
 lib/eal/arm/include/rte_pmu_pmc.h        |  37 ++
 lib/eal/arm/meson.build                  |   4 +
 lib/eal/arm/rte_pmu.c                    | 103 +++++
 lib/eal/common/eal_common_trace_points.c |   3 +
 lib/eal/common/meson.build               |   3 +
 lib/eal/common/pmu_private.h             |  41 ++
 lib/eal/common/rte_pmu.c                 | 518 +++++++++++++++++++++++
 lib/eal/include/meson.build              |   1 +
 lib/eal/include/rte_eal_trace.h          |  11 +
 lib/eal/include/rte_pmu.h                | 207 +++++++++
 lib/eal/linux/eal.c                      |   4 +
 lib/eal/version.map                      |   4 +
 lib/eal/x86/include/meson.build          |   1 +
 lib/eal/x86/include/rte_pmu_pmc.h        |  32 ++
 20 files changed, 1067 insertions(+)
 create mode 100644 app/test/test_pmu.c
 create mode 100644 lib/eal/arm/include/rte_pmu_pmc.h
 create mode 100644 lib/eal/arm/rte_pmu.c
 create mode 100644 lib/eal/common/pmu_private.h
 create mode 100644 lib/eal/common/rte_pmu.c
 create mode 100644 lib/eal/include/rte_pmu.h
 create mode 100644 lib/eal/x86/include/rte_pmu_pmc.h

--
2.25.1


^ permalink raw reply	[flat|nested] 139+ messages in thread

* [PATCH 1/4] eal: add generic support for reading PMU events
  2022-11-11  9:43 [PATCH 0/4] add support for self monitoring Tomasz Duszynski
@ 2022-11-11  9:43 ` Tomasz Duszynski
  2022-12-15  8:33   ` Mattias Rönnblom
  2022-11-11  9:43 ` [PATCH 2/4] eal/arm: support reading ARM PMU events in runtime Tomasz Duszynski
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 139+ messages in thread
From: Tomasz Duszynski @ 2022-11-11  9:43 UTC (permalink / raw)
  To: dev, tduszynski; +Cc: thomas, jerinj

Add support for programming PMU counters and reading their values
in runtime bypassing kernel completely.

This is especially useful in cases where CPU cores are isolated
(nohz_full) i.e run dedicated tasks. In such cases one cannot use
standard perf utility without sacrificing latency and performance.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
---
 app/test/meson.build                  |   1 +
 app/test/test_pmu.c                   |  41 +++
 doc/guides/prog_guide/profile_app.rst |   8 +
 lib/eal/common/meson.build            |   3 +
 lib/eal/common/pmu_private.h          |  41 +++
 lib/eal/common/rte_pmu.c              | 455 ++++++++++++++++++++++++++
 lib/eal/include/meson.build           |   1 +
 lib/eal/include/rte_pmu.h             | 204 ++++++++++++
 lib/eal/linux/eal.c                   |   4 +
 lib/eal/version.map                   |   3 +
 10 files changed, 761 insertions(+)
 create mode 100644 app/test/test_pmu.c
 create mode 100644 lib/eal/common/pmu_private.h
 create mode 100644 lib/eal/common/rte_pmu.c
 create mode 100644 lib/eal/include/rte_pmu.h

diff --git a/app/test/meson.build b/app/test/meson.build
index f34d19e3c3..93b3300309 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -143,6 +143,7 @@ test_sources = files(
         'test_timer_racecond.c',
         'test_timer_secondary.c',
         'test_ticketlock.c',
+        'test_pmu.c',
         'test_trace.c',
         'test_trace_register.c',
         'test_trace_perf.c',
diff --git a/app/test/test_pmu.c b/app/test/test_pmu.c
new file mode 100644
index 0000000000..fd331af9ee
--- /dev/null
+++ b/app/test/test_pmu.c
@@ -0,0 +1,41 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2022 Marvell International Ltd.
+ */
+
+#include <rte_pmu.h>
+
+#include "test.h"
+
+static int
+test_pmu_read(void)
+{
+	uint64_t val = 0;
+	int tries = 10;
+	int event = -1;
+
+	while (tries--)
+		val += rte_pmu_read(event);
+
+	if (val == 0)
+		return TEST_FAILED;
+
+	return TEST_SUCCESS;
+}
+
+static struct unit_test_suite pmu_tests = {
+	.suite_name = "pmu autotest",
+	.setup = NULL,
+	.teardown = NULL,
+	.unit_test_cases = {
+		TEST_CASE(test_pmu_read),
+		TEST_CASES_END()
+	}
+};
+
+static int
+test_pmu(void)
+{
+	return unit_test_suite_runner(&pmu_tests);
+}
+
+REGISTER_TEST_COMMAND(pmu_autotest, test_pmu);
diff --git a/doc/guides/prog_guide/profile_app.rst b/doc/guides/prog_guide/profile_app.rst
index bd6700ef85..8fc1b20cab 100644
--- a/doc/guides/prog_guide/profile_app.rst
+++ b/doc/guides/prog_guide/profile_app.rst
@@ -7,6 +7,14 @@ Profile Your Application
 The following sections describe methods of profiling DPDK applications on
 different architectures.
 
+Performance counter based profiling
+-----------------------------------
+
+Majority of architectures support some sort hardware measurement unit which provides a set of
+programmable counters that monitor specific events. There are different tools which can gather
+that information, perf being an example here. Though in some scenarios, eg. when CPU cores are
+isolated (nohz_full) and run dedicated tasks, using perf is less than ideal. In such cases one can
+read specific events directly from application via ``rte_pmu_read()``.
 
 Profiling on x86
 ----------------
diff --git a/lib/eal/common/meson.build b/lib/eal/common/meson.build
index 917758cc65..d6d05b56f3 100644
--- a/lib/eal/common/meson.build
+++ b/lib/eal/common/meson.build
@@ -38,6 +38,9 @@ sources += files(
         'rte_service.c',
         'rte_version.c',
 )
+if is_linux
+    sources += files('rte_pmu.c')
+endif
 if is_linux or is_windows
     sources += files('eal_common_dynmem.c')
 endif
diff --git a/lib/eal/common/pmu_private.h b/lib/eal/common/pmu_private.h
new file mode 100644
index 0000000000..cade4245e6
--- /dev/null
+++ b/lib/eal/common/pmu_private.h
@@ -0,0 +1,41 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2022 Marvell
+ */
+
+#ifndef _PMU_PRIVATE_H_
+#define _PMU_PRIVATE_H_
+
+/**
+ * Architecture specific PMU init callback.
+ *
+ * @return
+ *   0 in case of success, negative value otherwise.
+ */
+int
+pmu_arch_init(void);
+
+/**
+ * Architecture specific PMU cleanup callback.
+ */
+void
+pmu_arch_fini(void);
+
+/**
+ * Apply architecture specific settings to config before passing it to syscall.
+ */
+void
+pmu_arch_fixup_config(uint64_t config[3]);
+
+/**
+ * Initialize PMU tracing internals.
+ */
+void
+eal_pmu_init(void);
+
+/**
+ * Cleanup PMU internals.
+ */
+void
+eal_pmu_fini(void);
+
+#endif /* _PMU_PRIVATE_H_ */
diff --git a/lib/eal/common/rte_pmu.c b/lib/eal/common/rte_pmu.c
new file mode 100644
index 0000000000..7d3bd57d1d
--- /dev/null
+++ b/lib/eal/common/rte_pmu.c
@@ -0,0 +1,455 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2022 Marvell International Ltd.
+ */
+
+#include <ctype.h>
+#include <dirent.h>
+#include <errno.h>
+#include <regex.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <sys/queue.h>
+#include <sys/syscall.h>
+#include <unistd.h>
+
+#include <rte_eal_paging.h>
+#include <rte_malloc.h>
+#include <rte_pmu.h>
+#include <rte_tailq.h>
+
+#include "pmu_private.h"
+
+#define EVENT_SOURCE_DEVICES_PATH "/sys/bus/event_source/devices"
+
+#ifndef GENMASK_ULL
+#define GENMASK_ULL(h, l) ((~0ULL - (1ULL << (l)) + 1) & (~0ULL >> ((64 - 1 - (h)))))
+#endif
+
+#ifndef FIELD_PREP
+#define FIELD_PREP(m, v) (((uint64_t)(v) << (__builtin_ffsll(m) - 1)) & (m))
+#endif
+
+struct rte_pmu *pmu;
+
+/*
+ * Following __rte_weak functions provide default no-op. Architectures should override them if
+ * necessary.
+ */
+
+int
+__rte_weak pmu_arch_init(void)
+{
+	return 0;
+}
+
+void
+__rte_weak pmu_arch_fini(void)
+{
+}
+
+void
+__rte_weak pmu_arch_fixup_config(uint64_t config[3])
+{
+	RTE_SET_USED(config);
+}
+
+static int
+get_term_format(const char *name, int *num, uint64_t *mask)
+{
+	char *config = NULL;
+	char path[PATH_MAX];
+	int high, low, ret;
+	FILE *fp;
+
+	/* quiesce -Wmaybe-uninitialized warning */
+	*num = 0;
+	*mask = 0;
+
+	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/format/%s", pmu->name, name);
+	fp = fopen(path, "r");
+	if (!fp)
+		return -errno;
+
+	errno = 0;
+	ret = fscanf(fp, "%m[^:]:%d-%d", &config, &low, &high);
+	if (ret < 2) {
+		ret = -ENODATA;
+		goto out;
+	}
+	if (errno) {
+		ret = -errno;
+		goto out;
+	}
+
+	if (ret == 2)
+		high = low;
+
+	*mask = GENMASK_ULL(high, low);
+	/* Last digit should be [012]. If last digit is missing 0 is implied. */
+	*num = config[strlen(config) - 1];
+	*num = isdigit(*num) ? *num - '0' : 0;
+
+	ret = 0;
+out:
+	free(config);
+	fclose(fp);
+
+	return ret;
+}
+
+static int
+parse_event(char *buf, uint64_t config[3])
+{
+	char *token, *term;
+	int num, ret, val;
+	uint64_t mask;
+
+	config[0] = config[1] = config[2] = 0;
+
+	token = strtok(buf, ",");
+	while (token) {
+		errno = 0;
+		/* <term>=<value> */
+		ret = sscanf(token, "%m[^=]=%i", &term, &val);
+		if (ret < 1)
+			return -ENODATA;
+		if (errno)
+			return -errno;
+		if (ret == 1)
+			val = 1;
+
+		ret = get_term_format(term, &num, &mask);
+		free(term);
+		if (ret)
+			return ret;
+
+		config[num] |= FIELD_PREP(mask, val);
+		token = strtok(NULL, ",");
+	}
+
+	return 0;
+}
+
+static int
+get_event_config(const char *name, uint64_t config[3])
+{
+	char path[PATH_MAX], buf[BUFSIZ];
+	FILE *fp;
+	int ret;
+
+	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", pmu->name, name);
+	fp = fopen(path, "r");
+	if (!fp)
+		return -errno;
+
+	ret = fread(buf, 1, sizeof(buf), fp);
+	if (ret == 0) {
+		fclose(fp);
+
+		return -EINVAL;
+	}
+	fclose(fp);
+	buf[ret] = '\0';
+
+	return parse_event(buf, config);
+}
+
+static int
+do_perf_event_open(uint64_t config[3], int lcore_id, int group_fd)
+{
+	struct perf_event_attr attr = {
+		.size = sizeof(struct perf_event_attr),
+		.type = PERF_TYPE_RAW,
+		.exclude_kernel = 1,
+		.exclude_hv = 1,
+		.disabled = 1,
+	};
+
+	pmu_arch_fixup_config(config);
+
+	attr.config = config[0];
+	attr.config1 = config[1];
+	attr.config2 = config[2];
+
+	return syscall(SYS_perf_event_open, &attr, rte_gettid(), rte_lcore_to_cpu_id(lcore_id),
+		       group_fd, 0);
+}
+
+static int
+open_events(int lcore_id)
+{
+	struct rte_pmu_event_group *group = &pmu->group[lcore_id];
+	struct rte_pmu_event *event;
+	uint64_t config[3];
+	int num = 0, ret;
+
+	/* group leader gets created first, with fd = -1 */
+	group->fds[0] = -1;
+
+	TAILQ_FOREACH(event, &pmu->event_list, next) {
+		ret = get_event_config(event->name, config);
+		if (ret) {
+			RTE_LOG(ERR, EAL, "failed to get %s event config\n", event->name);
+			continue;
+		}
+
+		ret = do_perf_event_open(config, lcore_id, group->fds[0]);
+		if (ret == -1) {
+			if (errno == EOPNOTSUPP)
+				RTE_LOG(ERR, EAL, "64 bit counters not supported\n");
+
+			ret = -errno;
+			goto out;
+		}
+
+		group->fds[event->index] = ret;
+		num++;
+	}
+
+	return 0;
+out:
+	for (--num; num >= 0; num--) {
+		close(group->fds[num]);
+		group->fds[num] = -1;
+	}
+
+
+	return ret;
+}
+
+static int
+mmap_events(int lcore_id)
+{
+	struct rte_pmu_event_group *group = &pmu->group[lcore_id];
+	void *addr;
+	int ret, i;
+
+	for (i = 0; i < pmu->num_group_events; i++) {
+		addr = mmap(0, rte_mem_page_size(), PROT_READ, MAP_SHARED, group->fds[i], 0);
+		if (addr == MAP_FAILED) {
+			ret = -errno;
+			goto out;
+		}
+
+		group->mmap_pages[i] = addr;
+	}
+
+	return 0;
+out:
+	for (; i; i--) {
+		munmap(group->mmap_pages[i - 1], rte_mem_page_size());
+		group->mmap_pages[i - 1] = NULL;
+	}
+
+	return ret;
+}
+
+static void
+cleanup_events(int lcore_id)
+{
+	struct rte_pmu_event_group *group = &pmu->group[lcore_id];
+	int i;
+
+	if (!group->fds)
+		return;
+
+	if (group->fds[0] != -1)
+		ioctl(group->fds[0], PERF_EVENT_IOC_DISABLE, PERF_IOC_FLAG_GROUP);
+
+	for (i = 0; i < pmu->num_group_events; i++) {
+		if (group->mmap_pages[i]) {
+			munmap(group->mmap_pages[i], rte_mem_page_size());
+			group->mmap_pages[i] = NULL;
+		}
+
+		if (group->fds[i] != -1) {
+			close(group->fds[i]);
+			group->fds[i] = -1;
+		}
+	}
+
+	rte_free(group->mmap_pages);
+	rte_free(group->fds);
+
+	group->mmap_pages = NULL;
+	group->fds = NULL;
+	group->enabled = false;
+}
+
+int __rte_noinline
+rte_pmu_enable_group(int lcore_id)
+{
+	struct rte_pmu_event_group *group = &pmu->group[lcore_id];
+	int ret;
+
+	if (pmu->num_group_events == 0) {
+		RTE_LOG(DEBUG, EAL, "no matching PMU events\n");
+
+		return 0;
+	}
+
+	group->fds = rte_zmalloc(NULL, pmu->num_group_events, sizeof(*group->fds));
+	if (!group->fds) {
+		RTE_LOG(ERR, EAL, "failed to alloc descriptor memory\n");
+
+		return -ENOMEM;
+	}
+
+	group->mmap_pages = rte_zmalloc(NULL, pmu->num_group_events, sizeof(*group->mmap_pages));
+	if (!group->mmap_pages) {
+		RTE_LOG(ERR, EAL, "failed to alloc userpage memory\n");
+
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	ret = open_events(lcore_id);
+	if (ret) {
+		RTE_LOG(ERR, EAL, "failed to open events on lcore-worker-%d\n", lcore_id);
+		goto out;
+	}
+
+	ret = mmap_events(lcore_id);
+	if (ret) {
+		RTE_LOG(ERR, EAL, "failed to map events on lcore-worker-%d\n", lcore_id);
+		goto out;
+	}
+
+	if (ioctl(group->fds[0], PERF_EVENT_IOC_ENABLE, PERF_IOC_FLAG_GROUP) == -1) {
+		RTE_LOG(ERR, EAL, "failed to enable events on lcore-worker-%d\n", lcore_id);
+
+		ret = -errno;
+		goto out;
+	}
+
+	return 0;
+
+out:
+	cleanup_events(lcore_id);
+
+	return ret;
+}
+
+static int
+scan_pmus(void)
+{
+	char path[PATH_MAX];
+	struct dirent *dent;
+	const char *name;
+	DIR *dirp;
+
+	dirp = opendir(EVENT_SOURCE_DEVICES_PATH);
+	if (!dirp)
+		return -errno;
+
+	while ((dent = readdir(dirp))) {
+		name = dent->d_name;
+		if (name[0] == '.')
+			continue;
+
+		/* sysfs entry should either contain cpus or be a cpu */
+		if (!strcmp(name, "cpu"))
+			break;
+
+		snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/cpus", name);
+		if (access(path, F_OK) == 0)
+			break;
+	}
+
+	closedir(dirp);
+
+	if (dent) {
+		pmu->name = strdup(name);
+		if (!pmu->name)
+			return -ENOMEM;
+	}
+
+	return pmu->name ? 0 : -ENODEV;
+}
+
+int
+rte_pmu_add_event(const char *name)
+{
+	struct rte_pmu_event *event;
+	char path[PATH_MAX];
+
+	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", pmu->name, name);
+	if (access(path, R_OK))
+		return -ENODEV;
+
+	TAILQ_FOREACH(event, &pmu->event_list, next) {
+		if (!strcmp(event->name, name))
+			return event->index;
+		continue;
+	}
+
+	event = rte_zmalloc(NULL, 1, sizeof(*event));
+	if (!event)
+		return -ENOMEM;
+
+	event->name = strdup(name);
+	if (!event->name) {
+		rte_free(event);
+
+		return -ENOMEM;
+	}
+
+	event->index = pmu->num_group_events++;
+	TAILQ_INSERT_TAIL(&pmu->event_list, event, next);
+
+	RTE_LOG(DEBUG, EAL, "%s even added at index %d\n", name, event->index);
+
+	return event->index;
+}
+
+void
+eal_pmu_init(void)
+{
+	int ret;
+
+	pmu = rte_calloc(NULL, 1, sizeof(*pmu), RTE_CACHE_LINE_SIZE);
+	if (!pmu) {
+		RTE_LOG(ERR, EAL, "failed to alloc PMU\n");
+
+		return;
+	}
+
+	TAILQ_INIT(&pmu->event_list);
+
+	ret = scan_pmus();
+	if (ret) {
+		RTE_LOG(ERR, EAL, "failed to find core pmu\n");
+		goto out;
+	}
+
+	ret = pmu_arch_init();
+	if (ret) {
+		RTE_LOG(ERR, EAL, "failed to setup arch for PMU\n");
+		goto out;
+	}
+
+	return;
+out:
+	free(pmu->name);
+	rte_free(pmu);
+}
+
+void
+eal_pmu_fini(void)
+{
+	struct rte_pmu_event *event, *tmp;
+	int lcore_id;
+
+	RTE_TAILQ_FOREACH_SAFE(event, &pmu->event_list, next, tmp) {
+		TAILQ_REMOVE(&pmu->event_list, event, next);
+		free(event->name);
+		rte_free(event);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		cleanup_events(lcore_id);
+
+	pmu_arch_fini();
+	free(pmu->name);
+	rte_free(pmu);
+}
diff --git a/lib/eal/include/meson.build b/lib/eal/include/meson.build
index cfcd40aaed..3bf830adee 100644
--- a/lib/eal/include/meson.build
+++ b/lib/eal/include/meson.build
@@ -36,6 +36,7 @@ headers += files(
         'rte_pci_dev_features.h',
         'rte_per_lcore.h',
         'rte_pflock.h',
+        'rte_pmu.h',
         'rte_random.h',
         'rte_reciprocal.h',
         'rte_seqcount.h',
diff --git a/lib/eal/include/rte_pmu.h b/lib/eal/include/rte_pmu.h
new file mode 100644
index 0000000000..5955c22779
--- /dev/null
+++ b/lib/eal/include/rte_pmu.h
@@ -0,0 +1,204 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2022 Marvell
+ */
+
+#ifndef _RTE_PMU_H_
+#define _RTE_PMU_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <rte_common.h>
+#include <rte_compat.h>
+
+#ifdef RTE_EXEC_ENV_LINUX
+
+#include <linux/perf_event.h>
+
+#include <rte_atomic.h>
+#include <rte_branch_prediction.h>
+#include <rte_lcore.h>
+#include <rte_log.h>
+
+/**
+ * @file
+ *
+ * PMU event tracing operations
+ *
+ * This file defines generic API and types necessary to setup PMU and
+ * read selected counters in runtime.
+ */
+
+/**
+ * A structure describing a group of events.
+ */
+struct rte_pmu_event_group {
+	int *fds; /**< array of event descriptors */
+	void **mmap_pages; /**< array of pointers to mmapped perf_event_attr structures */
+	bool enabled; /**< true if group was enabled on particular lcore */
+};
+
+/**
+ * A structure describing an event.
+ */
+struct rte_pmu_event {
+	char *name; /** name of an event */
+	int index; /** event index into fds/mmap_pages */
+	TAILQ_ENTRY(rte_pmu_event) next; /** list entry */
+};
+
+/**
+ * A PMU state container.
+ */
+struct rte_pmu {
+	char *name; /** name of core PMU listed under /sys/bus/event_source/devices */
+	struct rte_pmu_event_group group[RTE_MAX_LCORE]; /**< per lcore event group data */
+	int num_group_events; /**< number of events in a group */
+	TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of matching events */
+};
+
+/** Pointer to the PMU state container */
+extern struct rte_pmu *pmu;
+
+/** Each architecture supporting PMU needs to provide its own version */
+#ifndef rte_pmu_pmc_read
+#define rte_pmu_pmc_read(index) ({ 0; })
+#endif
+
+/**
+ * @internal
+ *
+ * Read PMU counter.
+ *
+ * @param pc
+ *   Pointer to the mmapped user page.
+ * @return
+ *   Counter value read from hardware.
+ */
+__rte_internal
+static __rte_always_inline uint64_t
+rte_pmu_read_userpage(struct perf_event_mmap_page *pc)
+{
+	uint64_t offset, width, pmc = 0;
+	uint32_t seq, index;
+	int tries = 100;
+
+	for (;;) {
+		seq = pc->lock;
+		rte_compiler_barrier();
+		index = pc->index;
+		offset = pc->offset;
+		width = pc->pmc_width;
+
+		if (likely(pc->cap_user_rdpmc && index)) {
+			pmc = rte_pmu_pmc_read(index - 1);
+			pmc <<= 64 - width;
+			pmc >>= 64 - width;
+		}
+
+		rte_compiler_barrier();
+
+		if (likely(pc->lock == seq))
+			return pmc + offset;
+
+		if (--tries == 0) {
+			RTE_LOG(DEBUG, EAL, "failed to get perf_event_mmap_page lock\n");
+			break;
+		}
+	}
+
+	return 0;
+}
+
+/**
+ * @internal
+ *
+ * Enable group of events for a given lcore.
+ *
+ * @param lcore_id
+ *   The identifier of the lcore.
+ * @return
+ *   0 in case of success, negative value otherwise.
+ */
+__rte_internal
+int
+rte_pmu_enable_group(int lcore_id);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Add event to the group of enabled events.
+ *
+ * @param name
+ *   Name of an event listed under /sys/bus/event_source/devices/pmu/events.
+ * @return
+ *   Event index in case of success, negative value otherwise.
+ */
+__rte_experimental
+int
+rte_pmu_add_event(const char *name);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Read hardware counter configured to count occurrences of an event.
+ *
+ * @param index
+ *   Index of an event to be read.
+ * @return
+ *   Event value read from register. In case of errors or lack of support
+ *   0 is returned. In other words, stream of zeros in a trace file
+ *   indicates problem with reading particular PMU event register.
+ */
+__rte_experimental
+static __rte_always_inline uint64_t
+rte_pmu_read(int index)
+{
+	int lcore_id = rte_lcore_id();
+	struct rte_pmu_event_group *group;
+	int ret;
+
+	if (!pmu)
+		return 0;
+
+	group = &pmu->group[lcore_id];
+	if (!group->enabled) {
+		ret = rte_pmu_enable_group(lcore_id);
+		if (ret)
+			return 0;
+
+		group->enabled = true;
+	}
+
+	if (index < 0 || index >= pmu->num_group_events)
+		return 0;
+
+	return rte_pmu_read_userpage(group->mmap_pages[index]);
+}
+
+#else /* !RTE_EXEC_ENV_LINUX */
+
+__rte_experimental
+static int __rte_unused
+rte_pmu_add_event(__rte_unused const char *name)
+{
+	return -1;
+}
+
+__rte_experimental
+static __rte_always_inline uint64_t
+rte_pmu_read(__rte_unused int index)
+{
+	return 0;
+}
+
+#endif /* RTE_EXEC_ENV_LINUX */
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_PMU_H_ */
diff --git a/lib/eal/linux/eal.c b/lib/eal/linux/eal.c
index 8c118d0d9f..751a13b597 100644
--- a/lib/eal/linux/eal.c
+++ b/lib/eal/linux/eal.c
@@ -53,6 +53,7 @@
 #include "eal_options.h"
 #include "eal_vfio.h"
 #include "hotplug_mp.h"
+#include "pmu_private.h"
 
 #define MEMSIZE_IF_NO_HUGE_PAGE (64ULL * 1024ULL * 1024ULL)
 
@@ -1206,6 +1207,8 @@ rte_eal_init(int argc, char **argv)
 		return -1;
 	}
 
+	eal_pmu_init();
+
 	if (rte_eal_tailqs_init() < 0) {
 		rte_eal_init_alert("Cannot init tail queues for objects");
 		rte_errno = EFAULT;
@@ -1372,6 +1375,7 @@ rte_eal_cleanup(void)
 	eal_bus_cleanup();
 	rte_trace_save();
 	eal_trace_fini();
+	eal_pmu_fini();
 	/* after this point, any DPDK pointers will become dangling */
 	rte_eal_memory_detach();
 	eal_mp_dev_hotplug_cleanup();
diff --git a/lib/eal/version.map b/lib/eal/version.map
index 7ad12a7dc9..e870c87493 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -432,6 +432,8 @@ EXPERIMENTAL {
 	rte_thread_set_priority;
 
 	# added in 22.11
+	rte_pmu_add_event; # WINDOWS_NO_EXPORT
+	rte_pmu_read; # WINDOWS_NO_EXPORT
 	rte_thread_attr_get_affinity;
 	rte_thread_attr_init;
 	rte_thread_attr_set_affinity;
@@ -483,4 +485,5 @@ INTERNAL {
 	rte_mem_map;
 	rte_mem_page_size;
 	rte_mem_unmap;
+	rte_pmu_enable_group;
 };
-- 
2.25.1


^ permalink raw reply	[flat|nested] 139+ messages in thread

* [PATCH 2/4] eal/arm: support reading ARM PMU events in runtime
  2022-11-11  9:43 [PATCH 0/4] add support for self monitoring Tomasz Duszynski
  2022-11-11  9:43 ` [PATCH 1/4] eal: add generic support for reading PMU events Tomasz Duszynski
@ 2022-11-11  9:43 ` Tomasz Duszynski
  2022-11-11  9:43 ` [PATCH 3/4] eal/x86: support reading Intel " Tomasz Duszynski
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 139+ messages in thread
From: Tomasz Duszynski @ 2022-11-11  9:43 UTC (permalink / raw)
  To: dev, tduszynski, Ruifeng Wang; +Cc: thomas, jerinj

Add support for reading ARM PMU events in runtime.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
---
 app/test/test_pmu.c               |   4 ++
 lib/eal/arm/include/meson.build   |   1 +
 lib/eal/arm/include/rte_pmu_pmc.h |  37 +++++++++++
 lib/eal/arm/meson.build           |   4 ++
 lib/eal/arm/rte_pmu.c             | 103 ++++++++++++++++++++++++++++++
 lib/eal/include/rte_pmu.h         |   3 +
 6 files changed, 152 insertions(+)
 create mode 100644 lib/eal/arm/include/rte_pmu_pmc.h
 create mode 100644 lib/eal/arm/rte_pmu.c

diff --git a/app/test/test_pmu.c b/app/test/test_pmu.c
index fd331af9ee..f94866dff9 100644
--- a/app/test/test_pmu.c
+++ b/app/test/test_pmu.c
@@ -13,6 +13,10 @@ test_pmu_read(void)
 	int tries = 10;
 	int event = -1;
 
+#if defined(RTE_ARCH_ARM64)
+	event = rte_pmu_add_event("cpu_cycles");
+#endif
+
 	while (tries--)
 		val += rte_pmu_read(event);
 
diff --git a/lib/eal/arm/include/meson.build b/lib/eal/arm/include/meson.build
index 657bf58569..ab13b0220a 100644
--- a/lib/eal/arm/include/meson.build
+++ b/lib/eal/arm/include/meson.build
@@ -20,6 +20,7 @@ arch_headers = files(
         'rte_pause_32.h',
         'rte_pause_64.h',
         'rte_pause.h',
+        'rte_pmu_pmc.h',
         'rte_power_intrinsics.h',
         'rte_prefetch_32.h',
         'rte_prefetch_64.h',
diff --git a/lib/eal/arm/include/rte_pmu_pmc.h b/lib/eal/arm/include/rte_pmu_pmc.h
new file mode 100644
index 0000000000..5efc851cb8
--- /dev/null
+++ b/lib/eal/arm/include/rte_pmu_pmc.h
@@ -0,0 +1,37 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2022 Marvell.
+ */
+
+#ifndef _RTE_PMU_PMC_ARM_H_
+#define _RTE_PMU_PMC_ARM_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+static __rte_always_inline uint64_t
+rte_pmu_pmc_read(int index)
+{
+	uint64_t val;
+
+	if (index == 31) {
+		/* CPU Cycles (0x11) must be read via pmccntr_el0 */
+		asm volatile("mrs %0, pmccntr_el0" : "=r" (val));
+	} else {
+		asm volatile(
+			"msr pmselr_el0, %x0\n"
+			"mrs %0, pmxevcntr_el0\n"
+			: "=r" (val)
+			: "rZ" (index)
+		);
+	}
+
+	return val;
+}
+#define rte_pmu_pmc_read rte_pmu_pmc_read
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_PMU_PMC_ARM_H_ */
diff --git a/lib/eal/arm/meson.build b/lib/eal/arm/meson.build
index dca1106aae..0c5575b197 100644
--- a/lib/eal/arm/meson.build
+++ b/lib/eal/arm/meson.build
@@ -9,3 +9,7 @@ sources += files(
         'rte_hypervisor.c',
         'rte_power_intrinsics.c',
 )
+
+if is_linux
+    sources += files('rte_pmu.c')
+endif
diff --git a/lib/eal/arm/rte_pmu.c b/lib/eal/arm/rte_pmu.c
new file mode 100644
index 0000000000..6c50a1b3c4
--- /dev/null
+++ b/lib/eal/arm/rte_pmu.c
@@ -0,0 +1,103 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2022 Marvell International Ltd.
+ */
+
+#include <errno.h>
+#include <fcntl.h>
+#include <unistd.h>
+
+#include <rte_bitops.h>
+#include <rte_common.h>
+#include <rte_log.h>
+#include <rte_pmu.h>
+
+#include "pmu_private.h"
+
+#define PERF_USER_ACCESS_PATH "/proc/sys/kernel/perf_user_access"
+
+static int restore_uaccess;
+
+static int
+read_attr_int(const char *path, int *val)
+{
+	char buf[BUFSIZ];
+	int ret, fd;
+
+	fd = open(path, O_RDONLY);
+	if (fd == -1)
+		return -errno;
+
+	ret = read(fd, buf, sizeof(buf));
+	if (ret == -1) {
+		close(fd);
+
+		return -errno;
+	}
+
+	*val = strtol(buf, NULL, 10);
+	close(fd);
+
+	return 0;
+}
+
+static int
+write_attr_int(const char *path, int val)
+{
+	char buf[BUFSIZ];
+	int num, ret, fd;
+
+	fd = open(path, O_WRONLY);
+	if (fd == -1)
+		return -errno;
+
+	num = snprintf(buf, sizeof(buf), "%d", val);
+	ret = write(fd, buf, num);
+	if (ret == -1) {
+		close(fd);
+
+		return -errno;
+	}
+
+	close(fd);
+
+	return 0;
+}
+
+int
+pmu_arch_init(void)
+{
+	int ret;
+
+	ret = read_attr_int(PERF_USER_ACCESS_PATH, &restore_uaccess);
+	if (ret) {
+		RTE_LOG(ERR, EAL, "failed to read %s\n", PERF_USER_ACCESS_PATH);
+
+		return ret;
+	}
+
+	ret = write_attr_int(PERF_USER_ACCESS_PATH, 1);
+	if (ret) {
+		RTE_LOG(ERR, EAL, "failed to enable perf user access\n"
+			"try enabling manually 'echo 1 > %s'\n",
+			PERF_USER_ACCESS_PATH);
+
+		return ret;
+	}
+
+	return 0;
+}
+
+void
+pmu_arch_fini(void)
+{
+	write_attr_int(PERF_USER_ACCESS_PATH, restore_uaccess);
+}
+
+void
+pmu_arch_fixup_config(uint64_t config[3])
+{
+	/* select 64 bit counters */
+	config[1] |= RTE_BIT64(0);
+	/* enable userspace access */
+	config[1] |= RTE_BIT64(1);
+}
diff --git a/lib/eal/include/rte_pmu.h b/lib/eal/include/rte_pmu.h
index 5955c22779..67b1194a2a 100644
--- a/lib/eal/include/rte_pmu.h
+++ b/lib/eal/include/rte_pmu.h
@@ -20,6 +20,9 @@ extern "C" {
 #include <rte_branch_prediction.h>
 #include <rte_lcore.h>
 #include <rte_log.h>
+#if defined(RTE_ARCH_ARM64)
+#include <rte_pmu_pmc.h>
+#endif
 
 /**
  * @file
-- 
2.25.1


^ permalink raw reply	[flat|nested] 139+ messages in thread

* [PATCH 3/4] eal/x86: support reading Intel PMU events in runtime
  2022-11-11  9:43 [PATCH 0/4] add support for self monitoring Tomasz Duszynski
  2022-11-11  9:43 ` [PATCH 1/4] eal: add generic support for reading PMU events Tomasz Duszynski
  2022-11-11  9:43 ` [PATCH 2/4] eal/arm: support reading ARM PMU events in runtime Tomasz Duszynski
@ 2022-11-11  9:43 ` Tomasz Duszynski
  2022-11-11  9:43 ` [PATCH 4/4] eal: add PMU support to tracing library Tomasz Duszynski
  2022-11-21 12:11 ` [PATCH v2 0/4] add support for self monitoring Tomasz Duszynski
  4 siblings, 0 replies; 139+ messages in thread
From: Tomasz Duszynski @ 2022-11-11  9:43 UTC (permalink / raw)
  To: dev, tduszynski, Bruce Richardson, Konstantin Ananyev; +Cc: thomas, jerinj

Add support for reading Intel PMU events in runtime.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
---
 app/test/test_pmu.c               |  2 ++
 lib/eal/include/rte_pmu.h         |  2 +-
 lib/eal/x86/include/meson.build   |  1 +
 lib/eal/x86/include/rte_pmu_pmc.h | 32 +++++++++++++++++++++++++++++++
 4 files changed, 36 insertions(+), 1 deletion(-)
 create mode 100644 lib/eal/x86/include/rte_pmu_pmc.h

diff --git a/app/test/test_pmu.c b/app/test/test_pmu.c
index f94866dff9..016204c083 100644
--- a/app/test/test_pmu.c
+++ b/app/test/test_pmu.c
@@ -15,6 +15,8 @@ test_pmu_read(void)
 
 #if defined(RTE_ARCH_ARM64)
 	event = rte_pmu_add_event("cpu_cycles");
+#elif defined(RTE_ARCH_X86_64)
+	event = rte_pmu_add_event("cpu-cycles");
 #endif
 
 	while (tries--)
diff --git a/lib/eal/include/rte_pmu.h b/lib/eal/include/rte_pmu.h
index 67b1194a2a..bbe12d100d 100644
--- a/lib/eal/include/rte_pmu.h
+++ b/lib/eal/include/rte_pmu.h
@@ -20,7 +20,7 @@ extern "C" {
 #include <rte_branch_prediction.h>
 #include <rte_lcore.h>
 #include <rte_log.h>
-#if defined(RTE_ARCH_ARM64)
+#if defined(RTE_ARCH_ARM64) || defined(RTE_ARCH_X86_64)
 #include <rte_pmu_pmc.h>
 #endif
 
diff --git a/lib/eal/x86/include/meson.build b/lib/eal/x86/include/meson.build
index 52d2f8e969..03d286ed25 100644
--- a/lib/eal/x86/include/meson.build
+++ b/lib/eal/x86/include/meson.build
@@ -9,6 +9,7 @@ arch_headers = files(
         'rte_io.h',
         'rte_memcpy.h',
         'rte_pause.h',
+        'rte_pmu_pmc.h',
         'rte_power_intrinsics.h',
         'rte_prefetch.h',
         'rte_rtm.h',
diff --git a/lib/eal/x86/include/rte_pmu_pmc.h b/lib/eal/x86/include/rte_pmu_pmc.h
new file mode 100644
index 0000000000..6ecb27a1eb
--- /dev/null
+++ b/lib/eal/x86/include/rte_pmu_pmc.h
@@ -0,0 +1,32 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2022 Marvell.
+ */
+
+#ifndef _RTE_PMU_PMC_X86_H_
+#define _RTE_PMU_PMC_X86_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+__rte_internal
+static __rte_always_inline uint64_t
+rte_pmu_pmc_read(int index)
+{
+	uint32_t high, low;
+
+	asm volatile(
+		"rdpmc\n"
+		: "=a" (low), "=d" (high)
+		: "c" (index)
+	);
+
+	return ((uint64_t)high << 32) | (uint64_t)low;
+}
+#define rte_pmu_pmc_read rte_pmu_pmc_read
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_PMU_PMC_X86_H_ */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 139+ messages in thread

* [PATCH 4/4] eal: add PMU support to tracing library
  2022-11-11  9:43 [PATCH 0/4] add support for self monitoring Tomasz Duszynski
                   ` (2 preceding siblings ...)
  2022-11-11  9:43 ` [PATCH 3/4] eal/x86: support reading Intel " Tomasz Duszynski
@ 2022-11-11  9:43 ` Tomasz Duszynski
  2022-11-21 12:11 ` [PATCH v2 0/4] add support for self monitoring Tomasz Duszynski
  4 siblings, 0 replies; 139+ messages in thread
From: Tomasz Duszynski @ 2022-11-11  9:43 UTC (permalink / raw)
  To: dev, tduszynski, Jerin Jacob, Sunil Kumar Kori; +Cc: thomas

In order to profile app one needs to store significant amount of samples
somewhere for an analysis latern on. Since trace library supports
storing data in a CTF format lets take adventage of that and add a
dedicated PMU tracepoint.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
---
 app/test/test_trace_perf.c               |  4 ++
 doc/guides/prog_guide/profile_app.rst    |  5 ++
 doc/guides/prog_guide/trace_lib.rst      | 32 ++++++++++++
 lib/eal/common/eal_common_trace_points.c |  3 ++
 lib/eal/common/rte_pmu.c                 | 63 ++++++++++++++++++++++++
 lib/eal/include/rte_eal_trace.h          | 11 +++++
 lib/eal/version.map                      |  1 +
 7 files changed, 119 insertions(+)

diff --git a/app/test/test_trace_perf.c b/app/test/test_trace_perf.c
index 46ae7d8074..4851b6852f 100644
--- a/app/test/test_trace_perf.c
+++ b/app/test/test_trace_perf.c
@@ -114,6 +114,8 @@ worker_fn_##func(void *arg) \
 #define GENERIC_DOUBLE rte_eal_trace_generic_double(3.66666)
 #define GENERIC_STR rte_eal_trace_generic_str("hello world")
 #define VOID_FP app_dpdk_test_fp()
+/* 0 corresponds first event passed via --trace= */
+#define READ_PMU rte_eal_trace_pmu_read(0)
 
 WORKER_DEFINE(GENERIC_VOID)
 WORKER_DEFINE(GENERIC_U64)
@@ -122,6 +124,7 @@ WORKER_DEFINE(GENERIC_FLOAT)
 WORKER_DEFINE(GENERIC_DOUBLE)
 WORKER_DEFINE(GENERIC_STR)
 WORKER_DEFINE(VOID_FP)
+WORKER_DEFINE(READ_PMU)
 
 static void
 run_test(const char *str, lcore_function_t f, struct test_data *data, size_t sz)
@@ -174,6 +177,7 @@ test_trace_perf(void)
 	run_test("double", worker_fn_GENERIC_DOUBLE, data, sz);
 	run_test("string", worker_fn_GENERIC_STR, data, sz);
 	run_test("void_fp", worker_fn_VOID_FP, data, sz);
+	run_test("read_pmu", worker_fn_READ_PMU, data, sz);
 
 	rte_free(data);
 	return TEST_SUCCESS;
diff --git a/doc/guides/prog_guide/profile_app.rst b/doc/guides/prog_guide/profile_app.rst
index 8fc1b20cab..977800ea01 100644
--- a/doc/guides/prog_guide/profile_app.rst
+++ b/doc/guides/prog_guide/profile_app.rst
@@ -16,6 +16,11 @@ that information, perf being an example here. Though in some scenarios, eg. when
 isolated (nohz_full) and run dedicated tasks, using perf is less than ideal. In such cases one can
 read specific events directly from application via ``rte_pmu_read()``.
 
+Alternatively tracing library can be used which offers dedicated tracepoint
+``rte_eal_trace_pmu_event()``.
+
+Refer to :doc:`../prog_guide/trace_lib` for more details.
+
 Profiling on x86
 ----------------
 
diff --git a/doc/guides/prog_guide/trace_lib.rst b/doc/guides/prog_guide/trace_lib.rst
index 9a8f38073d..9a845fd86f 100644
--- a/doc/guides/prog_guide/trace_lib.rst
+++ b/doc/guides/prog_guide/trace_lib.rst
@@ -46,6 +46,7 @@ DPDK tracing library features
   trace format and is compatible with ``LTTng``.
   For detailed information, refer to
   `Common Trace Format <https://diamon.org/ctf/>`_.
+- Support reading PMU events on ARM64 and x86 (Intel)
 
 How to add a tracepoint?
 ------------------------
@@ -137,6 +138,37 @@ the user must use ``RTE_TRACE_POINT_FP`` instead of ``RTE_TRACE_POINT``.
 ``RTE_TRACE_POINT_FP`` is compiled out by default and it can be enabled using
 the ``enable_trace_fp`` option for meson build.
 
+PMU tracepoint
+--------------
+
+Performance measurement unit (PMU) event values can be read from hardware
+registers using predefined ``rte_pmu_read`` tracepoint.
+
+Tracing is enabled via ``--trace`` EAL option by passing both expression
+matching PMU tracepoint name i.e ``lib.eal.pmu.read`` and expression
+``e=ev1[,ev2,...]`` matching particular events::
+
+    --trace='*pmu.read\|e=cpu_cycles,l1d_cache'
+
+Event names are available under ``/sys/bus/event_source/devices/PMU/events``
+directory, where ``PMU`` is a placeholder for either a ``cpu`` or a directory
+containing ``cpus``.
+
+In contrary to other tracepoints this does not need any extra variables
+added to source files. Instead, caller passes index which follows the order of
+events specified via ``--trace`` parameter. In the following example index ``0``
+corresponds to ``cpu_cyclces`` while index ``1`` corresponds to ``l1d_cache``.
+
+.. code-block:: c
+
+ ...
+ rte_eal_trace_pmu_read(0);
+ rte_eal_trace_pmu_read(1);
+ ...
+
+PMU tracing support must be explicitly enabled using the ``enable_trace_fp``
+option for meson build.
+
 Event record mode
 -----------------
 
diff --git a/lib/eal/common/eal_common_trace_points.c b/lib/eal/common/eal_common_trace_points.c
index 0b0b254615..de918ca618 100644
--- a/lib/eal/common/eal_common_trace_points.c
+++ b/lib/eal/common/eal_common_trace_points.c
@@ -75,3 +75,6 @@ RTE_TRACE_POINT_REGISTER(rte_eal_trace_intr_enable,
 	lib.eal.intr.enable)
 RTE_TRACE_POINT_REGISTER(rte_eal_trace_intr_disable,
 	lib.eal.intr.disable)
+
+RTE_TRACE_POINT_REGISTER(rte_eal_trace_pmu_read,
+	lib.eal.pmu.read)
diff --git a/lib/eal/common/rte_pmu.c b/lib/eal/common/rte_pmu.c
index 7d3bd57d1d..40c454f92a 100644
--- a/lib/eal/common/rte_pmu.c
+++ b/lib/eal/common/rte_pmu.c
@@ -18,6 +18,7 @@
 #include <rte_tailq.h>
 
 #include "pmu_private.h"
+#include "eal_trace.h"
 
 #define EVENT_SOURCE_DEVICES_PATH "/sys/bus/event_source/devices"
 
@@ -402,11 +403,70 @@ rte_pmu_add_event(const char *name)
 	return event->index;
 }
 
+static void
+add_events(const char *pattern)
+{
+	char *token, *copy;
+	int ret;
+
+	copy = strdup(pattern);
+	if (!copy)
+		return;
+
+	token = strtok(copy, ",");
+	while (token) {
+		ret = rte_pmu_add_event(token);
+		if (ret < 0)
+			RTE_LOG(ERR, EAL, "failed to add %s event\n", token);
+
+		token = strtok(NULL, ",");
+	}
+
+	free(copy);
+}
+
+static void
+add_events_by_pattern(const char *pattern)
+{
+	regmatch_t rmatch;
+	char buf[BUFSIZ];
+	unsigned int num;
+	regex_t reg;
+
+	/* events are matched against occurrences of e=ev1[,ev2,..] pattern */
+	if (regcomp(&reg, "e=([_[:alnum:]-],?)+", REG_EXTENDED))
+		return;
+
+	for (;;) {
+		if (regexec(&reg, pattern, 1, &rmatch, 0))
+			break;
+
+		num = rmatch.rm_eo - rmatch.rm_so;
+		if (num > sizeof(buf))
+			num = sizeof(buf);
+
+		/* skip e= pattern prefix */
+		memcpy(buf, pattern + rmatch.rm_so + 2, num - 2);
+		buf[num] = '\0';
+		add_events(buf);
+
+		pattern += rmatch.rm_eo;
+	}
+
+	regfree(&reg);
+}
+
 void
 eal_pmu_init(void)
 {
+	struct trace_arg *arg;
+	struct trace *trace;
 	int ret;
 
+	trace = trace_obj_get();
+	if (!trace)
+		RTE_LOG(WARNING, EAL, "tracing not initialized\n");
+
 	pmu = rte_calloc(NULL, 1, sizeof(*pmu), RTE_CACHE_LINE_SIZE);
 	if (!pmu) {
 		RTE_LOG(ERR, EAL, "failed to alloc PMU\n");
@@ -428,6 +488,9 @@ eal_pmu_init(void)
 		goto out;
 	}
 
+	STAILQ_FOREACH(arg, &trace->args, next)
+		add_events_by_pattern(arg->val);
+
 	return;
 out:
 	free(pmu->name);
diff --git a/lib/eal/include/rte_eal_trace.h b/lib/eal/include/rte_eal_trace.h
index 5ef4398230..2a10f63e97 100644
--- a/lib/eal/include/rte_eal_trace.h
+++ b/lib/eal/include/rte_eal_trace.h
@@ -17,6 +17,7 @@ extern "C" {
 
 #include <rte_alarm.h>
 #include <rte_interrupts.h>
+#include <rte_pmu.h>
 #include <rte_trace_point.h>
 
 #include "eal_interrupts.h"
@@ -279,6 +280,16 @@ RTE_TRACE_POINT(
 	rte_trace_point_emit_string(cpuset);
 )
 
+/* PMU */
+RTE_TRACE_POINT_FP(
+	rte_eal_trace_pmu_read,
+	RTE_TRACE_POINT_ARGS(int index),
+	uint64_t val;
+	rte_trace_point_emit_int(index);
+	val = rte_pmu_read(index);
+	rte_trace_point_emit_u64(val);
+)
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/eal/version.map b/lib/eal/version.map
index e870c87493..d6ec3f3b0e 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -432,6 +432,7 @@ EXPERIMENTAL {
 	rte_thread_set_priority;
 
 	# added in 22.11
+	__rte_eal_trace_pmu_read; # WINDOWS_NO_EXPORT
 	rte_pmu_add_event; # WINDOWS_NO_EXPORT
 	rte_pmu_read; # WINDOWS_NO_EXPORT
 	rte_thread_attr_get_affinity;
-- 
2.25.1


^ permalink raw reply	[flat|nested] 139+ messages in thread

* [PATCH v2 0/4] add support for self monitoring
  2022-11-11  9:43 [PATCH 0/4] add support for self monitoring Tomasz Duszynski
                   ` (3 preceding siblings ...)
  2022-11-11  9:43 ` [PATCH 4/4] eal: add PMU support to tracing library Tomasz Duszynski
@ 2022-11-21 12:11 ` Tomasz Duszynski
  2022-11-21 12:11   ` [PATCH v2 1/4] eal: add generic support for reading PMU events Tomasz Duszynski
                     ` (4 more replies)
  4 siblings, 5 replies; 139+ messages in thread
From: Tomasz Duszynski @ 2022-11-21 12:11 UTC (permalink / raw)
  To: dev; +Cc: thomas, jerinj, Tomasz Duszynski

This series adds self monitoring support i.e allows to configure and
read performance measurement unit (PMU) counters in runtime without
using perf utility. This has certain adventages when application runs on
isolated cores with nohz_full kernel parameter.

Events can be read directly using rte_pmu_read() or using dedicated
tracepoint rte_eal_trace_pmu_read(). The latter will cause events to be
stored inside CTF file.

By design, all enabled events are grouped together and the same group
is attached to lcores that use self monitoring funtionality.

Events are enabled by names, which need to be read from standard
location under sysfs i.e

/sys/bus/event_source/devices/PMU/events

where PMU is a core pmu i.e one measuring cpu events. As of today
raw events are not supported.

v2:
- fix problems reported by test build infra

Tomasz Duszynski (4):
  eal: add generic support for reading PMU events
  eal/arm: support reading ARM PMU events in runtime
  eal/x86: support reading Intel PMU events in runtime
  eal: add PMU support to tracing library

 app/test/meson.build                     |   1 +
 app/test/test_pmu.c                      |  47 ++
 app/test/test_trace_perf.c               |   4 +
 doc/guides/prog_guide/profile_app.rst    |  13 +
 doc/guides/prog_guide/trace_lib.rst      |  32 ++
 lib/eal/arm/include/meson.build          |   1 +
 lib/eal/arm/include/rte_pmu_pmc.h        |  39 ++
 lib/eal/arm/meson.build                  |   4 +
 lib/eal/arm/rte_pmu.c                    | 103 +++++
 lib/eal/common/eal_common_trace_points.c |   3 +
 lib/eal/common/meson.build               |   3 +
 lib/eal/common/pmu_private.h             |  41 ++
 lib/eal/common/rte_pmu.c                 | 519 +++++++++++++++++++++++
 lib/eal/include/meson.build              |   1 +
 lib/eal/include/rte_eal_trace.h          |  11 +
 lib/eal/include/rte_pmu.h                | 207 +++++++++
 lib/eal/linux/eal.c                      |   4 +
 lib/eal/version.map                      |   6 +
 lib/eal/x86/include/meson.build          |   1 +
 lib/eal/x86/include/rte_pmu_pmc.h        |  33 ++
 20 files changed, 1073 insertions(+)
 create mode 100644 app/test/test_pmu.c
 create mode 100644 lib/eal/arm/include/rte_pmu_pmc.h
 create mode 100644 lib/eal/arm/rte_pmu.c
 create mode 100644 lib/eal/common/pmu_private.h
 create mode 100644 lib/eal/common/rte_pmu.c
 create mode 100644 lib/eal/include/rte_pmu.h
 create mode 100644 lib/eal/x86/include/rte_pmu_pmc.h

--
2.25.1


^ permalink raw reply	[flat|nested] 139+ messages in thread

* [PATCH v2 1/4] eal: add generic support for reading PMU events
  2022-11-21 12:11 ` [PATCH v2 0/4] add support for self monitoring Tomasz Duszynski
@ 2022-11-21 12:11   ` Tomasz Duszynski
  2022-11-21 12:11   ` [PATCH v2 2/4] eal/arm: support reading ARM PMU events in runtime Tomasz Duszynski
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 139+ messages in thread
From: Tomasz Duszynski @ 2022-11-21 12:11 UTC (permalink / raw)
  To: dev; +Cc: thomas, jerinj, Tomasz Duszynski

Add support for programming PMU counters and reading their values
in runtime bypassing kernel completely.

This is especially useful in cases where CPU cores are isolated
(nohz_full) i.e run dedicated tasks. In such cases one cannot use
standard perf utility without sacrificing latency and performance.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
---
 app/test/meson.build                  |   1 +
 app/test/test_pmu.c                   |  41 +++
 doc/guides/prog_guide/profile_app.rst |   8 +
 lib/eal/common/meson.build            |   3 +
 lib/eal/common/pmu_private.h          |  41 +++
 lib/eal/common/rte_pmu.c              | 456 ++++++++++++++++++++++++++
 lib/eal/include/meson.build           |   1 +
 lib/eal/include/rte_pmu.h             | 204 ++++++++++++
 lib/eal/linux/eal.c                   |   4 +
 lib/eal/version.map                   |   5 +
 10 files changed, 764 insertions(+)
 create mode 100644 app/test/test_pmu.c
 create mode 100644 lib/eal/common/pmu_private.h
 create mode 100644 lib/eal/common/rte_pmu.c
 create mode 100644 lib/eal/include/rte_pmu.h

diff --git a/app/test/meson.build b/app/test/meson.build
index f34d19e3c3..93b3300309 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -143,6 +143,7 @@ test_sources = files(
         'test_timer_racecond.c',
         'test_timer_secondary.c',
         'test_ticketlock.c',
+        'test_pmu.c',
         'test_trace.c',
         'test_trace_register.c',
         'test_trace_perf.c',
diff --git a/app/test/test_pmu.c b/app/test/test_pmu.c
new file mode 100644
index 0000000000..fd331af9ee
--- /dev/null
+++ b/app/test/test_pmu.c
@@ -0,0 +1,41 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2022 Marvell International Ltd.
+ */
+
+#include <rte_pmu.h>
+
+#include "test.h"
+
+static int
+test_pmu_read(void)
+{
+	uint64_t val = 0;
+	int tries = 10;
+	int event = -1;
+
+	while (tries--)
+		val += rte_pmu_read(event);
+
+	if (val == 0)
+		return TEST_FAILED;
+
+	return TEST_SUCCESS;
+}
+
+static struct unit_test_suite pmu_tests = {
+	.suite_name = "pmu autotest",
+	.setup = NULL,
+	.teardown = NULL,
+	.unit_test_cases = {
+		TEST_CASE(test_pmu_read),
+		TEST_CASES_END()
+	}
+};
+
+static int
+test_pmu(void)
+{
+	return unit_test_suite_runner(&pmu_tests);
+}
+
+REGISTER_TEST_COMMAND(pmu_autotest, test_pmu);
diff --git a/doc/guides/prog_guide/profile_app.rst b/doc/guides/prog_guide/profile_app.rst
index bd6700ef85..8fc1b20cab 100644
--- a/doc/guides/prog_guide/profile_app.rst
+++ b/doc/guides/prog_guide/profile_app.rst
@@ -7,6 +7,14 @@ Profile Your Application
 The following sections describe methods of profiling DPDK applications on
 different architectures.
 
+Performance counter based profiling
+-----------------------------------
+
+Majority of architectures support some sort hardware measurement unit which provides a set of
+programmable counters that monitor specific events. There are different tools which can gather
+that information, perf being an example here. Though in some scenarios, eg. when CPU cores are
+isolated (nohz_full) and run dedicated tasks, using perf is less than ideal. In such cases one can
+read specific events directly from application via ``rte_pmu_read()``.
 
 Profiling on x86
 ----------------
diff --git a/lib/eal/common/meson.build b/lib/eal/common/meson.build
index 917758cc65..d6d05b56f3 100644
--- a/lib/eal/common/meson.build
+++ b/lib/eal/common/meson.build
@@ -38,6 +38,9 @@ sources += files(
         'rte_service.c',
         'rte_version.c',
 )
+if is_linux
+    sources += files('rte_pmu.c')
+endif
 if is_linux or is_windows
     sources += files('eal_common_dynmem.c')
 endif
diff --git a/lib/eal/common/pmu_private.h b/lib/eal/common/pmu_private.h
new file mode 100644
index 0000000000..cade4245e6
--- /dev/null
+++ b/lib/eal/common/pmu_private.h
@@ -0,0 +1,41 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2022 Marvell
+ */
+
+#ifndef _PMU_PRIVATE_H_
+#define _PMU_PRIVATE_H_
+
+/**
+ * Architecture specific PMU init callback.
+ *
+ * @return
+ *   0 in case of success, negative value otherwise.
+ */
+int
+pmu_arch_init(void);
+
+/**
+ * Architecture specific PMU cleanup callback.
+ */
+void
+pmu_arch_fini(void);
+
+/**
+ * Apply architecture specific settings to config before passing it to syscall.
+ */
+void
+pmu_arch_fixup_config(uint64_t config[3]);
+
+/**
+ * Initialize PMU tracing internals.
+ */
+void
+eal_pmu_init(void);
+
+/**
+ * Cleanup PMU internals.
+ */
+void
+eal_pmu_fini(void);
+
+#endif /* _PMU_PRIVATE_H_ */
diff --git a/lib/eal/common/rte_pmu.c b/lib/eal/common/rte_pmu.c
new file mode 100644
index 0000000000..dc169fb2cf
--- /dev/null
+++ b/lib/eal/common/rte_pmu.c
@@ -0,0 +1,456 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2022 Marvell International Ltd.
+ */
+
+#include <ctype.h>
+#include <dirent.h>
+#include <errno.h>
+#include <regex.h>
+#include <stdlib.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <sys/queue.h>
+#include <sys/syscall.h>
+#include <unistd.h>
+
+#include <rte_eal_paging.h>
+#include <rte_malloc.h>
+#include <rte_pmu.h>
+#include <rte_tailq.h>
+
+#include "pmu_private.h"
+
+#define EVENT_SOURCE_DEVICES_PATH "/sys/bus/event_source/devices"
+
+#ifndef GENMASK_ULL
+#define GENMASK_ULL(h, l) ((~0ULL - (1ULL << (l)) + 1) & (~0ULL >> ((64 - 1 - (h)))))
+#endif
+
+#ifndef FIELD_PREP
+#define FIELD_PREP(m, v) (((uint64_t)(v) << (__builtin_ffsll(m) - 1)) & (m))
+#endif
+
+struct rte_pmu *pmu;
+
+/*
+ * Following __rte_weak functions provide default no-op. Architectures should override them if
+ * necessary.
+ */
+
+int
+__rte_weak pmu_arch_init(void)
+{
+	return 0;
+}
+
+void
+__rte_weak pmu_arch_fini(void)
+{
+}
+
+void
+__rte_weak pmu_arch_fixup_config(uint64_t config[3])
+{
+	RTE_SET_USED(config);
+}
+
+static int
+get_term_format(const char *name, int *num, uint64_t *mask)
+{
+	char *config = NULL;
+	char path[PATH_MAX];
+	int high, low, ret;
+	FILE *fp;
+
+	/* quiesce -Wmaybe-uninitialized warning */
+	*num = 0;
+	*mask = 0;
+
+	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/format/%s", pmu->name, name);
+	fp = fopen(path, "r");
+	if (!fp)
+		return -errno;
+
+	errno = 0;
+	ret = fscanf(fp, "%m[^:]:%d-%d", &config, &low, &high);
+	if (ret < 2) {
+		ret = -ENODATA;
+		goto out;
+	}
+	if (errno) {
+		ret = -errno;
+		goto out;
+	}
+
+	if (ret == 2)
+		high = low;
+
+	*mask = GENMASK_ULL(high, low);
+	/* Last digit should be [012]. If last digit is missing 0 is implied. */
+	*num = config[strlen(config) - 1];
+	*num = isdigit(*num) ? *num - '0' : 0;
+
+	ret = 0;
+out:
+	free(config);
+	fclose(fp);
+
+	return ret;
+}
+
+static int
+parse_event(char *buf, uint64_t config[3])
+{
+	char *token, *term;
+	int num, ret, val;
+	uint64_t mask;
+
+	config[0] = config[1] = config[2] = 0;
+
+	token = strtok(buf, ",");
+	while (token) {
+		errno = 0;
+		/* <term>=<value> */
+		ret = sscanf(token, "%m[^=]=%i", &term, &val);
+		if (ret < 1)
+			return -ENODATA;
+		if (errno)
+			return -errno;
+		if (ret == 1)
+			val = 1;
+
+		ret = get_term_format(term, &num, &mask);
+		free(term);
+		if (ret)
+			return ret;
+
+		config[num] |= FIELD_PREP(mask, val);
+		token = strtok(NULL, ",");
+	}
+
+	return 0;
+}
+
+static int
+get_event_config(const char *name, uint64_t config[3])
+{
+	char path[PATH_MAX], buf[BUFSIZ];
+	FILE *fp;
+	int ret;
+
+	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", pmu->name, name);
+	fp = fopen(path, "r");
+	if (!fp)
+		return -errno;
+
+	ret = fread(buf, 1, sizeof(buf), fp);
+	if (ret == 0) {
+		fclose(fp);
+
+		return -EINVAL;
+	}
+	fclose(fp);
+	buf[ret] = '\0';
+
+	return parse_event(buf, config);
+}
+
+static int
+do_perf_event_open(uint64_t config[3], int lcore_id, int group_fd)
+{
+	struct perf_event_attr attr = {
+		.size = sizeof(struct perf_event_attr),
+		.type = PERF_TYPE_RAW,
+		.exclude_kernel = 1,
+		.exclude_hv = 1,
+		.disabled = 1,
+	};
+
+	pmu_arch_fixup_config(config);
+
+	attr.config = config[0];
+	attr.config1 = config[1];
+	attr.config2 = config[2];
+
+	return syscall(SYS_perf_event_open, &attr, rte_gettid(), rte_lcore_to_cpu_id(lcore_id),
+		       group_fd, 0);
+}
+
+static int
+open_events(int lcore_id)
+{
+	struct rte_pmu_event_group *group = &pmu->group[lcore_id];
+	struct rte_pmu_event *event;
+	uint64_t config[3];
+	int num = 0, ret;
+
+	/* group leader gets created first, with fd = -1 */
+	group->fds[0] = -1;
+
+	TAILQ_FOREACH(event, &pmu->event_list, next) {
+		ret = get_event_config(event->name, config);
+		if (ret) {
+			RTE_LOG(ERR, EAL, "failed to get %s event config\n", event->name);
+			continue;
+		}
+
+		ret = do_perf_event_open(config, lcore_id, group->fds[0]);
+		if (ret == -1) {
+			if (errno == EOPNOTSUPP)
+				RTE_LOG(ERR, EAL, "64 bit counters not supported\n");
+
+			ret = -errno;
+			goto out;
+		}
+
+		group->fds[event->index] = ret;
+		num++;
+	}
+
+	return 0;
+out:
+	for (--num; num >= 0; num--) {
+		close(group->fds[num]);
+		group->fds[num] = -1;
+	}
+
+
+	return ret;
+}
+
+static int
+mmap_events(int lcore_id)
+{
+	struct rte_pmu_event_group *group = &pmu->group[lcore_id];
+	void *addr;
+	int ret, i;
+
+	for (i = 0; i < pmu->num_group_events; i++) {
+		addr = mmap(0, rte_mem_page_size(), PROT_READ, MAP_SHARED, group->fds[i], 0);
+		if (addr == MAP_FAILED) {
+			ret = -errno;
+			goto out;
+		}
+
+		group->mmap_pages[i] = addr;
+	}
+
+	return 0;
+out:
+	for (; i; i--) {
+		munmap(group->mmap_pages[i - 1], rte_mem_page_size());
+		group->mmap_pages[i - 1] = NULL;
+	}
+
+	return ret;
+}
+
+static void
+cleanup_events(int lcore_id)
+{
+	struct rte_pmu_event_group *group = &pmu->group[lcore_id];
+	int i;
+
+	if (!group->fds)
+		return;
+
+	if (group->fds[0] != -1)
+		ioctl(group->fds[0], PERF_EVENT_IOC_DISABLE, PERF_IOC_FLAG_GROUP);
+
+	for (i = 0; i < pmu->num_group_events; i++) {
+		if (group->mmap_pages[i]) {
+			munmap(group->mmap_pages[i], rte_mem_page_size());
+			group->mmap_pages[i] = NULL;
+		}
+
+		if (group->fds[i] != -1) {
+			close(group->fds[i]);
+			group->fds[i] = -1;
+		}
+	}
+
+	rte_free(group->mmap_pages);
+	rte_free(group->fds);
+
+	group->mmap_pages = NULL;
+	group->fds = NULL;
+	group->enabled = false;
+}
+
+int __rte_noinline
+rte_pmu_enable_group(int lcore_id)
+{
+	struct rte_pmu_event_group *group = &pmu->group[lcore_id];
+	int ret;
+
+	if (pmu->num_group_events == 0) {
+		RTE_LOG(DEBUG, EAL, "no matching PMU events\n");
+
+		return 0;
+	}
+
+	group->fds = rte_calloc(NULL, pmu->num_group_events, sizeof(*group->fds), 0);
+	if (!group->fds) {
+		RTE_LOG(ERR, EAL, "failed to alloc descriptor memory\n");
+
+		return -ENOMEM;
+	}
+
+	group->mmap_pages = rte_calloc(NULL, pmu->num_group_events, sizeof(*group->mmap_pages), 0);
+	if (!group->mmap_pages) {
+		RTE_LOG(ERR, EAL, "failed to alloc userpage memory\n");
+
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	ret = open_events(lcore_id);
+	if (ret) {
+		RTE_LOG(ERR, EAL, "failed to open events on lcore-worker-%d\n", lcore_id);
+		goto out;
+	}
+
+	ret = mmap_events(lcore_id);
+	if (ret) {
+		RTE_LOG(ERR, EAL, "failed to map events on lcore-worker-%d\n", lcore_id);
+		goto out;
+	}
+
+	if (ioctl(group->fds[0], PERF_EVENT_IOC_ENABLE, PERF_IOC_FLAG_GROUP) == -1) {
+		RTE_LOG(ERR, EAL, "failed to enable events on lcore-worker-%d\n", lcore_id);
+
+		ret = -errno;
+		goto out;
+	}
+
+	return 0;
+
+out:
+	cleanup_events(lcore_id);
+
+	return ret;
+}
+
+static int
+scan_pmus(void)
+{
+	char path[PATH_MAX];
+	struct dirent *dent;
+	const char *name;
+	DIR *dirp;
+
+	dirp = opendir(EVENT_SOURCE_DEVICES_PATH);
+	if (!dirp)
+		return -errno;
+
+	while ((dent = readdir(dirp))) {
+		name = dent->d_name;
+		if (name[0] == '.')
+			continue;
+
+		/* sysfs entry should either contain cpus or be a cpu */
+		if (!strcmp(name, "cpu"))
+			break;
+
+		snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/cpus", name);
+		if (access(path, F_OK) == 0)
+			break;
+	}
+
+	closedir(dirp);
+
+	if (dent) {
+		pmu->name = strdup(name);
+		if (!pmu->name)
+			return -ENOMEM;
+	}
+
+	return pmu->name ? 0 : -ENODEV;
+}
+
+int
+rte_pmu_add_event(const char *name)
+{
+	struct rte_pmu_event *event;
+	char path[PATH_MAX];
+
+	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", pmu->name, name);
+	if (access(path, R_OK))
+		return -ENODEV;
+
+	TAILQ_FOREACH(event, &pmu->event_list, next) {
+		if (!strcmp(event->name, name))
+			return event->index;
+		continue;
+	}
+
+	event = rte_calloc(NULL, 1, sizeof(*event), 0);
+	if (!event)
+		return -ENOMEM;
+
+	event->name = strdup(name);
+	if (!event->name) {
+		rte_free(event);
+
+		return -ENOMEM;
+	}
+
+	event->index = pmu->num_group_events++;
+	TAILQ_INSERT_TAIL(&pmu->event_list, event, next);
+
+	RTE_LOG(DEBUG, EAL, "%s even added at index %d\n", name, event->index);
+
+	return event->index;
+}
+
+void
+eal_pmu_init(void)
+{
+	int ret;
+
+	pmu = rte_calloc(NULL, 1, sizeof(*pmu), RTE_CACHE_LINE_SIZE);
+	if (!pmu) {
+		RTE_LOG(ERR, EAL, "failed to alloc PMU\n");
+
+		return;
+	}
+
+	TAILQ_INIT(&pmu->event_list);
+
+	ret = scan_pmus();
+	if (ret) {
+		RTE_LOG(ERR, EAL, "failed to find core pmu\n");
+		goto out;
+	}
+
+	ret = pmu_arch_init();
+	if (ret) {
+		RTE_LOG(ERR, EAL, "failed to setup arch for PMU\n");
+		goto out;
+	}
+
+	return;
+out:
+	free(pmu->name);
+	rte_free(pmu);
+}
+
+void
+eal_pmu_fini(void)
+{
+	struct rte_pmu_event *event, *tmp;
+	int lcore_id;
+
+	RTE_TAILQ_FOREACH_SAFE(event, &pmu->event_list, next, tmp) {
+		TAILQ_REMOVE(&pmu->event_list, event, next);
+		free(event->name);
+		rte_free(event);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		cleanup_events(lcore_id);
+
+	pmu_arch_fini();
+	free(pmu->name);
+	rte_free(pmu);
+}
diff --git a/lib/eal/include/meson.build b/lib/eal/include/meson.build
index cfcd40aaed..3bf830adee 100644
--- a/lib/eal/include/meson.build
+++ b/lib/eal/include/meson.build
@@ -36,6 +36,7 @@ headers += files(
         'rte_pci_dev_features.h',
         'rte_per_lcore.h',
         'rte_pflock.h',
+        'rte_pmu.h',
         'rte_random.h',
         'rte_reciprocal.h',
         'rte_seqcount.h',
diff --git a/lib/eal/include/rte_pmu.h b/lib/eal/include/rte_pmu.h
new file mode 100644
index 0000000000..5955c22779
--- /dev/null
+++ b/lib/eal/include/rte_pmu.h
@@ -0,0 +1,204 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2022 Marvell
+ */
+
+#ifndef _RTE_PMU_H_
+#define _RTE_PMU_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <rte_common.h>
+#include <rte_compat.h>
+
+#ifdef RTE_EXEC_ENV_LINUX
+
+#include <linux/perf_event.h>
+
+#include <rte_atomic.h>
+#include <rte_branch_prediction.h>
+#include <rte_lcore.h>
+#include <rte_log.h>
+
+/**
+ * @file
+ *
+ * PMU event tracing operations
+ *
+ * This file defines generic API and types necessary to setup PMU and
+ * read selected counters in runtime.
+ */
+
+/**
+ * A structure describing a group of events.
+ */
+struct rte_pmu_event_group {
+	int *fds; /**< array of event descriptors */
+	void **mmap_pages; /**< array of pointers to mmapped perf_event_attr structures */
+	bool enabled; /**< true if group was enabled on particular lcore */
+};
+
+/**
+ * A structure describing an event.
+ */
+struct rte_pmu_event {
+	char *name; /** name of an event */
+	int index; /** event index into fds/mmap_pages */
+	TAILQ_ENTRY(rte_pmu_event) next; /** list entry */
+};
+
+/**
+ * A PMU state container.
+ */
+struct rte_pmu {
+	char *name; /** name of core PMU listed under /sys/bus/event_source/devices */
+	struct rte_pmu_event_group group[RTE_MAX_LCORE]; /**< per lcore event group data */
+	int num_group_events; /**< number of events in a group */
+	TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of matching events */
+};
+
+/** Pointer to the PMU state container */
+extern struct rte_pmu *pmu;
+
+/** Each architecture supporting PMU needs to provide its own version */
+#ifndef rte_pmu_pmc_read
+#define rte_pmu_pmc_read(index) ({ 0; })
+#endif
+
+/**
+ * @internal
+ *
+ * Read PMU counter.
+ *
+ * @param pc
+ *   Pointer to the mmapped user page.
+ * @return
+ *   Counter value read from hardware.
+ */
+__rte_internal
+static __rte_always_inline uint64_t
+rte_pmu_read_userpage(struct perf_event_mmap_page *pc)
+{
+	uint64_t offset, width, pmc = 0;
+	uint32_t seq, index;
+	int tries = 100;
+
+	for (;;) {
+		seq = pc->lock;
+		rte_compiler_barrier();
+		index = pc->index;
+		offset = pc->offset;
+		width = pc->pmc_width;
+
+		if (likely(pc->cap_user_rdpmc && index)) {
+			pmc = rte_pmu_pmc_read(index - 1);
+			pmc <<= 64 - width;
+			pmc >>= 64 - width;
+		}
+
+		rte_compiler_barrier();
+
+		if (likely(pc->lock == seq))
+			return pmc + offset;
+
+		if (--tries == 0) {
+			RTE_LOG(DEBUG, EAL, "failed to get perf_event_mmap_page lock\n");
+			break;
+		}
+	}
+
+	return 0;
+}
+
+/**
+ * @internal
+ *
+ * Enable group of events for a given lcore.
+ *
+ * @param lcore_id
+ *   The identifier of the lcore.
+ * @return
+ *   0 in case of success, negative value otherwise.
+ */
+__rte_internal
+int
+rte_pmu_enable_group(int lcore_id);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Add event to the group of enabled events.
+ *
+ * @param name
+ *   Name of an event listed under /sys/bus/event_source/devices/pmu/events.
+ * @return
+ *   Event index in case of success, negative value otherwise.
+ */
+__rte_experimental
+int
+rte_pmu_add_event(const char *name);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Read hardware counter configured to count occurrences of an event.
+ *
+ * @param index
+ *   Index of an event to be read.
+ * @return
+ *   Event value read from register. In case of errors or lack of support
+ *   0 is returned. In other words, stream of zeros in a trace file
+ *   indicates problem with reading particular PMU event register.
+ */
+__rte_experimental
+static __rte_always_inline uint64_t
+rte_pmu_read(int index)
+{
+	int lcore_id = rte_lcore_id();
+	struct rte_pmu_event_group *group;
+	int ret;
+
+	if (!pmu)
+		return 0;
+
+	group = &pmu->group[lcore_id];
+	if (!group->enabled) {
+		ret = rte_pmu_enable_group(lcore_id);
+		if (ret)
+			return 0;
+
+		group->enabled = true;
+	}
+
+	if (index < 0 || index >= pmu->num_group_events)
+		return 0;
+
+	return rte_pmu_read_userpage(group->mmap_pages[index]);
+}
+
+#else /* !RTE_EXEC_ENV_LINUX */
+
+__rte_experimental
+static int __rte_unused
+rte_pmu_add_event(__rte_unused const char *name)
+{
+	return -1;
+}
+
+__rte_experimental
+static __rte_always_inline uint64_t
+rte_pmu_read(__rte_unused int index)
+{
+	return 0;
+}
+
+#endif /* RTE_EXEC_ENV_LINUX */
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_PMU_H_ */
diff --git a/lib/eal/linux/eal.c b/lib/eal/linux/eal.c
index 8c118d0d9f..751a13b597 100644
--- a/lib/eal/linux/eal.c
+++ b/lib/eal/linux/eal.c
@@ -53,6 +53,7 @@
 #include "eal_options.h"
 #include "eal_vfio.h"
 #include "hotplug_mp.h"
+#include "pmu_private.h"
 
 #define MEMSIZE_IF_NO_HUGE_PAGE (64ULL * 1024ULL * 1024ULL)
 
@@ -1206,6 +1207,8 @@ rte_eal_init(int argc, char **argv)
 		return -1;
 	}
 
+	eal_pmu_init();
+
 	if (rte_eal_tailqs_init() < 0) {
 		rte_eal_init_alert("Cannot init tail queues for objects");
 		rte_errno = EFAULT;
@@ -1372,6 +1375,7 @@ rte_eal_cleanup(void)
 	eal_bus_cleanup();
 	rte_trace_save();
 	eal_trace_fini();
+	eal_pmu_fini();
 	/* after this point, any DPDK pointers will become dangling */
 	rte_eal_memory_detach();
 	eal_mp_dev_hotplug_cleanup();
diff --git a/lib/eal/version.map b/lib/eal/version.map
index 7ad12a7dc9..1ebd842f34 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -440,6 +440,10 @@ EXPERIMENTAL {
 	rte_thread_detach;
 	rte_thread_equal;
 	rte_thread_join;
+
+	# added in 23.03
+	rte_pmu_add_event; # WINDOWS_NO_EXPORT
+	rte_pmu_read; # WINDOWS_NO_EXPORT
 };
 
 INTERNAL {
@@ -483,4 +487,5 @@ INTERNAL {
 	rte_mem_map;
 	rte_mem_page_size;
 	rte_mem_unmap;
+	rte_pmu_enable_group;
 };
-- 
2.25.1


^ permalink raw reply	[flat|nested] 139+ messages in thread

* [PATCH v2 2/4] eal/arm: support reading ARM PMU events in runtime
  2022-11-21 12:11 ` [PATCH v2 0/4] add support for self monitoring Tomasz Duszynski
  2022-11-21 12:11   ` [PATCH v2 1/4] eal: add generic support for reading PMU events Tomasz Duszynski
@ 2022-11-21 12:11   ` Tomasz Duszynski
  2022-11-21 12:11   ` [PATCH v2 3/4] eal/x86: support reading Intel " Tomasz Duszynski
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 139+ messages in thread
From: Tomasz Duszynski @ 2022-11-21 12:11 UTC (permalink / raw)
  To: dev, Ruifeng Wang; +Cc: thomas, jerinj, Tomasz Duszynski

Add support for reading ARM PMU events in runtime.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
---
 app/test/test_pmu.c               |   4 ++
 lib/eal/arm/include/meson.build   |   1 +
 lib/eal/arm/include/rte_pmu_pmc.h |  39 +++++++++++
 lib/eal/arm/meson.build           |   4 ++
 lib/eal/arm/rte_pmu.c             | 103 ++++++++++++++++++++++++++++++
 lib/eal/include/rte_pmu.h         |   3 +
 6 files changed, 154 insertions(+)
 create mode 100644 lib/eal/arm/include/rte_pmu_pmc.h
 create mode 100644 lib/eal/arm/rte_pmu.c

diff --git a/app/test/test_pmu.c b/app/test/test_pmu.c
index fd331af9ee..f94866dff9 100644
--- a/app/test/test_pmu.c
+++ b/app/test/test_pmu.c
@@ -13,6 +13,10 @@ test_pmu_read(void)
 	int tries = 10;
 	int event = -1;
 
+#if defined(RTE_ARCH_ARM64)
+	event = rte_pmu_add_event("cpu_cycles");
+#endif
+
 	while (tries--)
 		val += rte_pmu_read(event);
 
diff --git a/lib/eal/arm/include/meson.build b/lib/eal/arm/include/meson.build
index 657bf58569..ab13b0220a 100644
--- a/lib/eal/arm/include/meson.build
+++ b/lib/eal/arm/include/meson.build
@@ -20,6 +20,7 @@ arch_headers = files(
         'rte_pause_32.h',
         'rte_pause_64.h',
         'rte_pause.h',
+        'rte_pmu_pmc.h',
         'rte_power_intrinsics.h',
         'rte_prefetch_32.h',
         'rte_prefetch_64.h',
diff --git a/lib/eal/arm/include/rte_pmu_pmc.h b/lib/eal/arm/include/rte_pmu_pmc.h
new file mode 100644
index 0000000000..10e2984813
--- /dev/null
+++ b/lib/eal/arm/include/rte_pmu_pmc.h
@@ -0,0 +1,39 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2022 Marvell.
+ */
+
+#ifndef _RTE_PMU_PMC_ARM_H_
+#define _RTE_PMU_PMC_ARM_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <rte_common.h>
+
+static __rte_always_inline uint64_t
+rte_pmu_pmc_read(int index)
+{
+	uint64_t val;
+
+	if (index == 31) {
+		/* CPU Cycles (0x11) must be read via pmccntr_el0 */
+		asm volatile("mrs %0, pmccntr_el0" : "=r" (val));
+	} else {
+		asm volatile(
+			"msr pmselr_el0, %x0\n"
+			"mrs %0, pmxevcntr_el0\n"
+			: "=r" (val)
+			: "rZ" (index)
+		);
+	}
+
+	return val;
+}
+#define rte_pmu_pmc_read rte_pmu_pmc_read
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_PMU_PMC_ARM_H_ */
diff --git a/lib/eal/arm/meson.build b/lib/eal/arm/meson.build
index dca1106aae..0c5575b197 100644
--- a/lib/eal/arm/meson.build
+++ b/lib/eal/arm/meson.build
@@ -9,3 +9,7 @@ sources += files(
         'rte_hypervisor.c',
         'rte_power_intrinsics.c',
 )
+
+if is_linux
+    sources += files('rte_pmu.c')
+endif
diff --git a/lib/eal/arm/rte_pmu.c b/lib/eal/arm/rte_pmu.c
new file mode 100644
index 0000000000..6c50a1b3c4
--- /dev/null
+++ b/lib/eal/arm/rte_pmu.c
@@ -0,0 +1,103 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2022 Marvell International Ltd.
+ */
+
+#include <errno.h>
+#include <fcntl.h>
+#include <unistd.h>
+
+#include <rte_bitops.h>
+#include <rte_common.h>
+#include <rte_log.h>
+#include <rte_pmu.h>
+
+#include "pmu_private.h"
+
+#define PERF_USER_ACCESS_PATH "/proc/sys/kernel/perf_user_access"
+
+static int restore_uaccess;
+
+static int
+read_attr_int(const char *path, int *val)
+{
+	char buf[BUFSIZ];
+	int ret, fd;
+
+	fd = open(path, O_RDONLY);
+	if (fd == -1)
+		return -errno;
+
+	ret = read(fd, buf, sizeof(buf));
+	if (ret == -1) {
+		close(fd);
+
+		return -errno;
+	}
+
+	*val = strtol(buf, NULL, 10);
+	close(fd);
+
+	return 0;
+}
+
+static int
+write_attr_int(const char *path, int val)
+{
+	char buf[BUFSIZ];
+	int num, ret, fd;
+
+	fd = open(path, O_WRONLY);
+	if (fd == -1)
+		return -errno;
+
+	num = snprintf(buf, sizeof(buf), "%d", val);
+	ret = write(fd, buf, num);
+	if (ret == -1) {
+		close(fd);
+
+		return -errno;
+	}
+
+	close(fd);
+
+	return 0;
+}
+
+int
+pmu_arch_init(void)
+{
+	int ret;
+
+	ret = read_attr_int(PERF_USER_ACCESS_PATH, &restore_uaccess);
+	if (ret) {
+		RTE_LOG(ERR, EAL, "failed to read %s\n", PERF_USER_ACCESS_PATH);
+
+		return ret;
+	}
+
+	ret = write_attr_int(PERF_USER_ACCESS_PATH, 1);
+	if (ret) {
+		RTE_LOG(ERR, EAL, "failed to enable perf user access\n"
+			"try enabling manually 'echo 1 > %s'\n",
+			PERF_USER_ACCESS_PATH);
+
+		return ret;
+	}
+
+	return 0;
+}
+
+void
+pmu_arch_fini(void)
+{
+	write_attr_int(PERF_USER_ACCESS_PATH, restore_uaccess);
+}
+
+void
+pmu_arch_fixup_config(uint64_t config[3])
+{
+	/* select 64 bit counters */
+	config[1] |= RTE_BIT64(0);
+	/* enable userspace access */
+	config[1] |= RTE_BIT64(1);
+}
diff --git a/lib/eal/include/rte_pmu.h b/lib/eal/include/rte_pmu.h
index 5955c22779..67b1194a2a 100644
--- a/lib/eal/include/rte_pmu.h
+++ b/lib/eal/include/rte_pmu.h
@@ -20,6 +20,9 @@ extern "C" {
 #include <rte_branch_prediction.h>
 #include <rte_lcore.h>
 #include <rte_log.h>
+#if defined(RTE_ARCH_ARM64)
+#include <rte_pmu_pmc.h>
+#endif
 
 /**
  * @file
-- 
2.25.1


^ permalink raw reply	[flat|nested] 139+ messages in thread

* [PATCH v2 3/4] eal/x86: support reading Intel PMU events in runtime
  2022-11-21 12:11 ` [PATCH v2 0/4] add support for self monitoring Tomasz Duszynski
  2022-11-21 12:11   ` [PATCH v2 1/4] eal: add generic support for reading PMU events Tomasz Duszynski
  2022-11-21 12:11   ` [PATCH v2 2/4] eal/arm: support reading ARM PMU events in runtime Tomasz Duszynski
@ 2022-11-21 12:11   ` Tomasz Duszynski
  2022-11-21 12:11   ` [PATCH v2 4/4] eal: add PMU support to tracing library Tomasz Duszynski
  2022-11-29  9:28   ` [PATCH v3 0/4] add support for self monitoring Tomasz Duszynski
  4 siblings, 0 replies; 139+ messages in thread
From: Tomasz Duszynski @ 2022-11-21 12:11 UTC (permalink / raw)
  To: dev, Bruce Richardson, Konstantin Ananyev
  Cc: thomas, jerinj, Tomasz Duszynski

Add support for reading Intel PMU events in runtime.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
---
 app/test/test_pmu.c               |  2 ++
 lib/eal/include/rte_pmu.h         |  2 +-
 lib/eal/x86/include/meson.build   |  1 +
 lib/eal/x86/include/rte_pmu_pmc.h | 33 +++++++++++++++++++++++++++++++
 4 files changed, 37 insertions(+), 1 deletion(-)
 create mode 100644 lib/eal/x86/include/rte_pmu_pmc.h

diff --git a/app/test/test_pmu.c b/app/test/test_pmu.c
index f94866dff9..016204c083 100644
--- a/app/test/test_pmu.c
+++ b/app/test/test_pmu.c
@@ -15,6 +15,8 @@ test_pmu_read(void)
 
 #if defined(RTE_ARCH_ARM64)
 	event = rte_pmu_add_event("cpu_cycles");
+#elif defined(RTE_ARCH_X86_64)
+	event = rte_pmu_add_event("cpu-cycles");
 #endif
 
 	while (tries--)
diff --git a/lib/eal/include/rte_pmu.h b/lib/eal/include/rte_pmu.h
index 67b1194a2a..bbe12d100d 100644
--- a/lib/eal/include/rte_pmu.h
+++ b/lib/eal/include/rte_pmu.h
@@ -20,7 +20,7 @@ extern "C" {
 #include <rte_branch_prediction.h>
 #include <rte_lcore.h>
 #include <rte_log.h>
-#if defined(RTE_ARCH_ARM64)
+#if defined(RTE_ARCH_ARM64) || defined(RTE_ARCH_X86_64)
 #include <rte_pmu_pmc.h>
 #endif
 
diff --git a/lib/eal/x86/include/meson.build b/lib/eal/x86/include/meson.build
index 52d2f8e969..03d286ed25 100644
--- a/lib/eal/x86/include/meson.build
+++ b/lib/eal/x86/include/meson.build
@@ -9,6 +9,7 @@ arch_headers = files(
         'rte_io.h',
         'rte_memcpy.h',
         'rte_pause.h',
+        'rte_pmu_pmc.h',
         'rte_power_intrinsics.h',
         'rte_prefetch.h',
         'rte_rtm.h',
diff --git a/lib/eal/x86/include/rte_pmu_pmc.h b/lib/eal/x86/include/rte_pmu_pmc.h
new file mode 100644
index 0000000000..a2cd849fb1
--- /dev/null
+++ b/lib/eal/x86/include/rte_pmu_pmc.h
@@ -0,0 +1,33 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2022 Marvell.
+ */
+
+#ifndef _RTE_PMU_PMC_X86_H_
+#define _RTE_PMU_PMC_X86_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <rte_common.h>
+
+static __rte_always_inline uint64_t
+rte_pmu_pmc_read(int index)
+{
+	uint32_t high, low;
+
+	asm volatile(
+		"rdpmc\n"
+		: "=a" (low), "=d" (high)
+		: "c" (index)
+	);
+
+	return ((uint64_t)high << 32) | (uint64_t)low;
+}
+#define rte_pmu_pmc_read rte_pmu_pmc_read
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_PMU_PMC_X86_H_ */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 139+ messages in thread

* [PATCH v2 4/4] eal: add PMU support to tracing library
  2022-11-21 12:11 ` [PATCH v2 0/4] add support for self monitoring Tomasz Duszynski
                     ` (2 preceding siblings ...)
  2022-11-21 12:11   ` [PATCH v2 3/4] eal/x86: support reading Intel " Tomasz Duszynski
@ 2022-11-21 12:11   ` Tomasz Duszynski
  2022-11-29  9:28   ` [PATCH v3 0/4] add support for self monitoring Tomasz Duszynski
  4 siblings, 0 replies; 139+ messages in thread
From: Tomasz Duszynski @ 2022-11-21 12:11 UTC (permalink / raw)
  To: dev, Jerin Jacob, Sunil Kumar Kori; +Cc: thomas, Tomasz Duszynski

In order to profile app one needs to store significant amount of samples
somewhere for an analysis latern on. Since trace library supports
storing data in a CTF format lets take adventage of that and add a
dedicated PMU tracepoint.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
---
 app/test/test_trace_perf.c               |  4 ++
 doc/guides/prog_guide/profile_app.rst    |  5 ++
 doc/guides/prog_guide/trace_lib.rst      | 32 ++++++++++++
 lib/eal/common/eal_common_trace_points.c |  3 ++
 lib/eal/common/rte_pmu.c                 | 63 ++++++++++++++++++++++++
 lib/eal/include/rte_eal_trace.h          | 11 +++++
 lib/eal/version.map                      |  1 +
 7 files changed, 119 insertions(+)

diff --git a/app/test/test_trace_perf.c b/app/test/test_trace_perf.c
index 46ae7d8074..4851b6852f 100644
--- a/app/test/test_trace_perf.c
+++ b/app/test/test_trace_perf.c
@@ -114,6 +114,8 @@ worker_fn_##func(void *arg) \
 #define GENERIC_DOUBLE rte_eal_trace_generic_double(3.66666)
 #define GENERIC_STR rte_eal_trace_generic_str("hello world")
 #define VOID_FP app_dpdk_test_fp()
+/* 0 corresponds first event passed via --trace= */
+#define READ_PMU rte_eal_trace_pmu_read(0)
 
 WORKER_DEFINE(GENERIC_VOID)
 WORKER_DEFINE(GENERIC_U64)
@@ -122,6 +124,7 @@ WORKER_DEFINE(GENERIC_FLOAT)
 WORKER_DEFINE(GENERIC_DOUBLE)
 WORKER_DEFINE(GENERIC_STR)
 WORKER_DEFINE(VOID_FP)
+WORKER_DEFINE(READ_PMU)
 
 static void
 run_test(const char *str, lcore_function_t f, struct test_data *data, size_t sz)
@@ -174,6 +177,7 @@ test_trace_perf(void)
 	run_test("double", worker_fn_GENERIC_DOUBLE, data, sz);
 	run_test("string", worker_fn_GENERIC_STR, data, sz);
 	run_test("void_fp", worker_fn_VOID_FP, data, sz);
+	run_test("read_pmu", worker_fn_READ_PMU, data, sz);
 
 	rte_free(data);
 	return TEST_SUCCESS;
diff --git a/doc/guides/prog_guide/profile_app.rst b/doc/guides/prog_guide/profile_app.rst
index 8fc1b20cab..977800ea01 100644
--- a/doc/guides/prog_guide/profile_app.rst
+++ b/doc/guides/prog_guide/profile_app.rst
@@ -16,6 +16,11 @@ that information, perf being an example here. Though in some scenarios, eg. when
 isolated (nohz_full) and run dedicated tasks, using perf is less than ideal. In such cases one can
 read specific events directly from application via ``rte_pmu_read()``.
 
+Alternatively tracing library can be used which offers dedicated tracepoint
+``rte_eal_trace_pmu_event()``.
+
+Refer to :doc:`../prog_guide/trace_lib` for more details.
+
 Profiling on x86
 ----------------
 
diff --git a/doc/guides/prog_guide/trace_lib.rst b/doc/guides/prog_guide/trace_lib.rst
index 9a8f38073d..9a845fd86f 100644
--- a/doc/guides/prog_guide/trace_lib.rst
+++ b/doc/guides/prog_guide/trace_lib.rst
@@ -46,6 +46,7 @@ DPDK tracing library features
   trace format and is compatible with ``LTTng``.
   For detailed information, refer to
   `Common Trace Format <https://diamon.org/ctf/>`_.
+- Support reading PMU events on ARM64 and x86 (Intel)
 
 How to add a tracepoint?
 ------------------------
@@ -137,6 +138,37 @@ the user must use ``RTE_TRACE_POINT_FP`` instead of ``RTE_TRACE_POINT``.
 ``RTE_TRACE_POINT_FP`` is compiled out by default and it can be enabled using
 the ``enable_trace_fp`` option for meson build.
 
+PMU tracepoint
+--------------
+
+Performance measurement unit (PMU) event values can be read from hardware
+registers using predefined ``rte_pmu_read`` tracepoint.
+
+Tracing is enabled via ``--trace`` EAL option by passing both expression
+matching PMU tracepoint name i.e ``lib.eal.pmu.read`` and expression
+``e=ev1[,ev2,...]`` matching particular events::
+
+    --trace='*pmu.read\|e=cpu_cycles,l1d_cache'
+
+Event names are available under ``/sys/bus/event_source/devices/PMU/events``
+directory, where ``PMU`` is a placeholder for either a ``cpu`` or a directory
+containing ``cpus``.
+
+In contrary to other tracepoints this does not need any extra variables
+added to source files. Instead, caller passes index which follows the order of
+events specified via ``--trace`` parameter. In the following example index ``0``
+corresponds to ``cpu_cyclces`` while index ``1`` corresponds to ``l1d_cache``.
+
+.. code-block:: c
+
+ ...
+ rte_eal_trace_pmu_read(0);
+ rte_eal_trace_pmu_read(1);
+ ...
+
+PMU tracing support must be explicitly enabled using the ``enable_trace_fp``
+option for meson build.
+
 Event record mode
 -----------------
 
diff --git a/lib/eal/common/eal_common_trace_points.c b/lib/eal/common/eal_common_trace_points.c
index 0b0b254615..de918ca618 100644
--- a/lib/eal/common/eal_common_trace_points.c
+++ b/lib/eal/common/eal_common_trace_points.c
@@ -75,3 +75,6 @@ RTE_TRACE_POINT_REGISTER(rte_eal_trace_intr_enable,
 	lib.eal.intr.enable)
 RTE_TRACE_POINT_REGISTER(rte_eal_trace_intr_disable,
 	lib.eal.intr.disable)
+
+RTE_TRACE_POINT_REGISTER(rte_eal_trace_pmu_read,
+	lib.eal.pmu.read)
diff --git a/lib/eal/common/rte_pmu.c b/lib/eal/common/rte_pmu.c
index dc169fb2cf..6a417f74a9 100644
--- a/lib/eal/common/rte_pmu.c
+++ b/lib/eal/common/rte_pmu.c
@@ -19,6 +19,7 @@
 #include <rte_tailq.h>
 
 #include "pmu_private.h"
+#include "eal_trace.h"
 
 #define EVENT_SOURCE_DEVICES_PATH "/sys/bus/event_source/devices"
 
@@ -403,11 +404,70 @@ rte_pmu_add_event(const char *name)
 	return event->index;
 }
 
+static void
+add_events(const char *pattern)
+{
+	char *token, *copy;
+	int ret;
+
+	copy = strdup(pattern);
+	if (!copy)
+		return;
+
+	token = strtok(copy, ",");
+	while (token) {
+		ret = rte_pmu_add_event(token);
+		if (ret < 0)
+			RTE_LOG(ERR, EAL, "failed to add %s event\n", token);
+
+		token = strtok(NULL, ",");
+	}
+
+	free(copy);
+}
+
+static void
+add_events_by_pattern(const char *pattern)
+{
+	regmatch_t rmatch;
+	char buf[BUFSIZ];
+	unsigned int num;
+	regex_t reg;
+
+	/* events are matched against occurrences of e=ev1[,ev2,..] pattern */
+	if (regcomp(&reg, "e=([_[:alnum:]-],?)+", REG_EXTENDED))
+		return;
+
+	for (;;) {
+		if (regexec(&reg, pattern, 1, &rmatch, 0))
+			break;
+
+		num = rmatch.rm_eo - rmatch.rm_so;
+		if (num > sizeof(buf))
+			num = sizeof(buf);
+
+		/* skip e= pattern prefix */
+		memcpy(buf, pattern + rmatch.rm_so + 2, num - 2);
+		buf[num] = '\0';
+		add_events(buf);
+
+		pattern += rmatch.rm_eo;
+	}
+
+	regfree(&reg);
+}
+
 void
 eal_pmu_init(void)
 {
+	struct trace_arg *arg;
+	struct trace *trace;
 	int ret;
 
+	trace = trace_obj_get();
+	if (!trace)
+		RTE_LOG(WARNING, EAL, "tracing not initialized\n");
+
 	pmu = rte_calloc(NULL, 1, sizeof(*pmu), RTE_CACHE_LINE_SIZE);
 	if (!pmu) {
 		RTE_LOG(ERR, EAL, "failed to alloc PMU\n");
@@ -429,6 +489,9 @@ eal_pmu_init(void)
 		goto out;
 	}
 
+	STAILQ_FOREACH(arg, &trace->args, next)
+		add_events_by_pattern(arg->val);
+
 	return;
 out:
 	free(pmu->name);
diff --git a/lib/eal/include/rte_eal_trace.h b/lib/eal/include/rte_eal_trace.h
index 5ef4398230..2a10f63e97 100644
--- a/lib/eal/include/rte_eal_trace.h
+++ b/lib/eal/include/rte_eal_trace.h
@@ -17,6 +17,7 @@ extern "C" {
 
 #include <rte_alarm.h>
 #include <rte_interrupts.h>
+#include <rte_pmu.h>
 #include <rte_trace_point.h>
 
 #include "eal_interrupts.h"
@@ -279,6 +280,16 @@ RTE_TRACE_POINT(
 	rte_trace_point_emit_string(cpuset);
 )
 
+/* PMU */
+RTE_TRACE_POINT_FP(
+	rte_eal_trace_pmu_read,
+	RTE_TRACE_POINT_ARGS(int index),
+	uint64_t val;
+	rte_trace_point_emit_int(index);
+	val = rte_pmu_read(index);
+	rte_trace_point_emit_u64(val);
+)
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/eal/version.map b/lib/eal/version.map
index 1ebd842f34..b49a430c84 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -442,6 +442,7 @@ EXPERIMENTAL {
 	rte_thread_join;
 
 	# added in 23.03
+	__rte_eal_trace_pmu_read; # WINDOWS_NO_EXPORT
 	rte_pmu_add_event; # WINDOWS_NO_EXPORT
 	rte_pmu_read; # WINDOWS_NO_EXPORT
 };
-- 
2.25.1


^ permalink raw reply	[flat|nested] 139+ messages in thread

* [PATCH v3 0/4] add support for self monitoring
  2022-11-21 12:11 ` [PATCH v2 0/4] add support for self monitoring Tomasz Duszynski
                     ` (3 preceding siblings ...)
  2022-11-21 12:11   ` [PATCH v2 4/4] eal: add PMU support to tracing library Tomasz Duszynski
@ 2022-11-29  9:28   ` Tomasz Duszynski
  2022-11-29  9:28     ` [PATCH v3 1/4] eal: add generic support for reading PMU events Tomasz Duszynski
                       ` (5 more replies)
  4 siblings, 6 replies; 139+ messages in thread
From: Tomasz Duszynski @ 2022-11-29  9:28 UTC (permalink / raw)
  To: dev; +Cc: thomas, jerinj, Tomasz Duszynski

This series adds self monitoring support i.e allows to configure and
read performance measurement unit (PMU) counters in runtime without
using perf utility. This has certain adventages when application runs on
isolated cores with nohz_full kernel parameter.

Events can be read directly using rte_pmu_read() or using dedicated
tracepoint rte_eal_trace_pmu_read(). The latter will cause events to be
stored inside CTF file.

By design, all enabled events are grouped together and the same group
is attached to lcores that use self monitoring funtionality.

Events are enabled by names, which need to be read from standard
location under sysfs i.e

/sys/bus/event_source/devices/PMU/events

where PMU is a core pmu i.e one measuring cpu events. As of today
raw events are not supported.

v3:
- fix shared build
v2:
- fix problems reported by test build infra

Tomasz Duszynski (4):
  eal: add generic support for reading PMU events
  eal/arm: support reading ARM PMU events in runtime
  eal/x86: support reading Intel PMU events in runtime
  eal: add PMU support to tracing library

 app/test/meson.build                     |   1 +
 app/test/test_pmu.c                      |  47 ++
 app/test/test_trace_perf.c               |   4 +
 doc/guides/prog_guide/profile_app.rst    |  13 +
 doc/guides/prog_guide/trace_lib.rst      |  32 ++
 lib/eal/arm/include/meson.build          |   1 +
 lib/eal/arm/include/rte_pmu_pmc.h        |  39 ++
 lib/eal/arm/meson.build                  |   4 +
 lib/eal/arm/rte_pmu.c                    | 104 +++++
 lib/eal/common/eal_common_trace_points.c |   3 +
 lib/eal/common/meson.build               |   3 +
 lib/eal/common/pmu_private.h             |  41 ++
 lib/eal/common/rte_pmu.c                 | 520 +++++++++++++++++++++++
 lib/eal/include/meson.build              |   1 +
 lib/eal/include/rte_eal_trace.h          |  11 +
 lib/eal/include/rte_pmu.h                | 207 +++++++++
 lib/eal/linux/eal.c                      |   4 +
 lib/eal/version.map                      |   7 +
 lib/eal/x86/include/meson.build          |   1 +
 lib/eal/x86/include/rte_pmu_pmc.h        |  33 ++
 20 files changed, 1076 insertions(+)
 create mode 100644 app/test/test_pmu.c
 create mode 100644 lib/eal/arm/include/rte_pmu_pmc.h
 create mode 100644 lib/eal/arm/rte_pmu.c
 create mode 100644 lib/eal/common/pmu_private.h
 create mode 100644 lib/eal/common/rte_pmu.c
 create mode 100644 lib/eal/include/rte_pmu.h
 create mode 100644 lib/eal/x86/include/rte_pmu_pmc.h

--
2.25.1


^ permalink raw reply	[flat|nested] 139+ messages in thread

* [PATCH v3 1/4] eal: add generic support for reading PMU events
  2022-11-29  9:28   ` [PATCH v3 0/4] add support for self monitoring Tomasz Duszynski
@ 2022-11-29  9:28     ` Tomasz Duszynski
  2022-11-30  8:32       ` zhoumin
  2022-11-29  9:28     ` [PATCH v3 2/4] eal/arm: support reading ARM PMU events in runtime Tomasz Duszynski
                       ` (4 subsequent siblings)
  5 siblings, 1 reply; 139+ messages in thread
From: Tomasz Duszynski @ 2022-11-29  9:28 UTC (permalink / raw)
  To: dev; +Cc: thomas, jerinj, Tomasz Duszynski

Add support for programming PMU counters and reading their values
in runtime bypassing kernel completely.

This is especially useful in cases where CPU cores are isolated
(nohz_full) i.e run dedicated tasks. In such cases one cannot use
standard perf utility without sacrificing latency and performance.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
---
 app/test/meson.build                  |   1 +
 app/test/test_pmu.c                   |  41 +++
 doc/guides/prog_guide/profile_app.rst |   8 +
 lib/eal/common/meson.build            |   3 +
 lib/eal/common/pmu_private.h          |  41 +++
 lib/eal/common/rte_pmu.c              | 457 ++++++++++++++++++++++++++
 lib/eal/include/meson.build           |   1 +
 lib/eal/include/rte_pmu.h             | 204 ++++++++++++
 lib/eal/linux/eal.c                   |   4 +
 lib/eal/version.map                   |   6 +
 10 files changed, 766 insertions(+)
 create mode 100644 app/test/test_pmu.c
 create mode 100644 lib/eal/common/pmu_private.h
 create mode 100644 lib/eal/common/rte_pmu.c
 create mode 100644 lib/eal/include/rte_pmu.h

diff --git a/app/test/meson.build b/app/test/meson.build
index f34d19e3c3..93b3300309 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -143,6 +143,7 @@ test_sources = files(
         'test_timer_racecond.c',
         'test_timer_secondary.c',
         'test_ticketlock.c',
+        'test_pmu.c',
         'test_trace.c',
         'test_trace_register.c',
         'test_trace_perf.c',
diff --git a/app/test/test_pmu.c b/app/test/test_pmu.c
new file mode 100644
index 0000000000..fd331af9ee
--- /dev/null
+++ b/app/test/test_pmu.c
@@ -0,0 +1,41 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2022 Marvell International Ltd.
+ */
+
+#include <rte_pmu.h>
+
+#include "test.h"
+
+static int
+test_pmu_read(void)
+{
+	uint64_t val = 0;
+	int tries = 10;
+	int event = -1;
+
+	while (tries--)
+		val += rte_pmu_read(event);
+
+	if (val == 0)
+		return TEST_FAILED;
+
+	return TEST_SUCCESS;
+}
+
+static struct unit_test_suite pmu_tests = {
+	.suite_name = "pmu autotest",
+	.setup = NULL,
+	.teardown = NULL,
+	.unit_test_cases = {
+		TEST_CASE(test_pmu_read),
+		TEST_CASES_END()
+	}
+};
+
+static int
+test_pmu(void)
+{
+	return unit_test_suite_runner(&pmu_tests);
+}
+
+REGISTER_TEST_COMMAND(pmu_autotest, test_pmu);
diff --git a/doc/guides/prog_guide/profile_app.rst b/doc/guides/prog_guide/profile_app.rst
index bd6700ef85..8fc1b20cab 100644
--- a/doc/guides/prog_guide/profile_app.rst
+++ b/doc/guides/prog_guide/profile_app.rst
@@ -7,6 +7,14 @@ Profile Your Application
 The following sections describe methods of profiling DPDK applications on
 different architectures.
 
+Performance counter based profiling
+-----------------------------------
+
+Majority of architectures support some sort hardware measurement unit which provides a set of
+programmable counters that monitor specific events. There are different tools which can gather
+that information, perf being an example here. Though in some scenarios, eg. when CPU cores are
+isolated (nohz_full) and run dedicated tasks, using perf is less than ideal. In such cases one can
+read specific events directly from application via ``rte_pmu_read()``.
 
 Profiling on x86
 ----------------
diff --git a/lib/eal/common/meson.build b/lib/eal/common/meson.build
index 917758cc65..d6d05b56f3 100644
--- a/lib/eal/common/meson.build
+++ b/lib/eal/common/meson.build
@@ -38,6 +38,9 @@ sources += files(
         'rte_service.c',
         'rte_version.c',
 )
+if is_linux
+    sources += files('rte_pmu.c')
+endif
 if is_linux or is_windows
     sources += files('eal_common_dynmem.c')
 endif
diff --git a/lib/eal/common/pmu_private.h b/lib/eal/common/pmu_private.h
new file mode 100644
index 0000000000..cade4245e6
--- /dev/null
+++ b/lib/eal/common/pmu_private.h
@@ -0,0 +1,41 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2022 Marvell
+ */
+
+#ifndef _PMU_PRIVATE_H_
+#define _PMU_PRIVATE_H_
+
+/**
+ * Architecture specific PMU init callback.
+ *
+ * @return
+ *   0 in case of success, negative value otherwise.
+ */
+int
+pmu_arch_init(void);
+
+/**
+ * Architecture specific PMU cleanup callback.
+ */
+void
+pmu_arch_fini(void);
+
+/**
+ * Apply architecture specific settings to config before passing it to syscall.
+ */
+void
+pmu_arch_fixup_config(uint64_t config[3]);
+
+/**
+ * Initialize PMU tracing internals.
+ */
+void
+eal_pmu_init(void);
+
+/**
+ * Cleanup PMU internals.
+ */
+void
+eal_pmu_fini(void);
+
+#endif /* _PMU_PRIVATE_H_ */
diff --git a/lib/eal/common/rte_pmu.c b/lib/eal/common/rte_pmu.c
new file mode 100644
index 0000000000..6763005903
--- /dev/null
+++ b/lib/eal/common/rte_pmu.c
@@ -0,0 +1,457 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2022 Marvell International Ltd.
+ */
+
+#include <ctype.h>
+#include <dirent.h>
+#include <errno.h>
+#include <regex.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <sys/queue.h>
+#include <sys/syscall.h>
+#include <unistd.h>
+
+#include <rte_eal_paging.h>
+#include <rte_malloc.h>
+#include <rte_pmu.h>
+#include <rte_tailq.h>
+
+#include "pmu_private.h"
+
+#define EVENT_SOURCE_DEVICES_PATH "/sys/bus/event_source/devices"
+
+#ifndef GENMASK_ULL
+#define GENMASK_ULL(h, l) ((~0ULL - (1ULL << (l)) + 1) & (~0ULL >> ((64 - 1 - (h)))))
+#endif
+
+#ifndef FIELD_PREP
+#define FIELD_PREP(m, v) (((uint64_t)(v) << (__builtin_ffsll(m) - 1)) & (m))
+#endif
+
+struct rte_pmu *rte_pmu;
+
+/*
+ * Following __rte_weak functions provide default no-op. Architectures should override them if
+ * necessary.
+ */
+
+int
+__rte_weak pmu_arch_init(void)
+{
+	return 0;
+}
+
+void
+__rte_weak pmu_arch_fini(void)
+{
+}
+
+void
+__rte_weak pmu_arch_fixup_config(uint64_t config[3])
+{
+	RTE_SET_USED(config);
+}
+
+static int
+get_term_format(const char *name, int *num, uint64_t *mask)
+{
+	char *config = NULL;
+	char path[PATH_MAX];
+	int high, low, ret;
+	FILE *fp;
+
+	/* quiesce -Wmaybe-uninitialized warning */
+	*num = 0;
+	*mask = 0;
+
+	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/format/%s", rte_pmu->name, name);
+	fp = fopen(path, "r");
+	if (!fp)
+		return -errno;
+
+	errno = 0;
+	ret = fscanf(fp, "%m[^:]:%d-%d", &config, &low, &high);
+	if (ret < 2) {
+		ret = -ENODATA;
+		goto out;
+	}
+	if (errno) {
+		ret = -errno;
+		goto out;
+	}
+
+	if (ret == 2)
+		high = low;
+
+	*mask = GENMASK_ULL(high, low);
+	/* Last digit should be [012]. If last digit is missing 0 is implied. */
+	*num = config[strlen(config) - 1];
+	*num = isdigit(*num) ? *num - '0' : 0;
+
+	ret = 0;
+out:
+	free(config);
+	fclose(fp);
+
+	return ret;
+}
+
+static int
+parse_event(char *buf, uint64_t config[3])
+{
+	char *token, *term;
+	int num, ret, val;
+	uint64_t mask;
+
+	config[0] = config[1] = config[2] = 0;
+
+	token = strtok(buf, ",");
+	while (token) {
+		errno = 0;
+		/* <term>=<value> */
+		ret = sscanf(token, "%m[^=]=%i", &term, &val);
+		if (ret < 1)
+			return -ENODATA;
+		if (errno)
+			return -errno;
+		if (ret == 1)
+			val = 1;
+
+		ret = get_term_format(term, &num, &mask);
+		free(term);
+		if (ret)
+			return ret;
+
+		config[num] |= FIELD_PREP(mask, val);
+		token = strtok(NULL, ",");
+	}
+
+	return 0;
+}
+
+static int
+get_event_config(const char *name, uint64_t config[3])
+{
+	char path[PATH_MAX], buf[BUFSIZ];
+	FILE *fp;
+	int ret;
+
+	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu->name, name);
+	fp = fopen(path, "r");
+	if (!fp)
+		return -errno;
+
+	ret = fread(buf, 1, sizeof(buf), fp);
+	if (ret == 0) {
+		fclose(fp);
+
+		return -EINVAL;
+	}
+	fclose(fp);
+	buf[ret] = '\0';
+
+	return parse_event(buf, config);
+}
+
+static int
+do_perf_event_open(uint64_t config[3], int lcore_id, int group_fd)
+{
+	struct perf_event_attr attr = {
+		.size = sizeof(struct perf_event_attr),
+		.type = PERF_TYPE_RAW,
+		.exclude_kernel = 1,
+		.exclude_hv = 1,
+		.disabled = 1,
+	};
+
+	pmu_arch_fixup_config(config);
+
+	attr.config = config[0];
+	attr.config1 = config[1];
+	attr.config2 = config[2];
+
+	return syscall(SYS_perf_event_open, &attr, rte_gettid(), rte_lcore_to_cpu_id(lcore_id),
+		       group_fd, 0);
+}
+
+static int
+open_events(int lcore_id)
+{
+	struct rte_pmu_event_group *group = &rte_pmu->group[lcore_id];
+	struct rte_pmu_event *event;
+	uint64_t config[3];
+	int num = 0, ret;
+
+	/* group leader gets created first, with fd = -1 */
+	group->fds[0] = -1;
+
+	TAILQ_FOREACH(event, &rte_pmu->event_list, next) {
+		ret = get_event_config(event->name, config);
+		if (ret) {
+			RTE_LOG(ERR, EAL, "failed to get %s event config\n", event->name);
+			continue;
+		}
+
+		ret = do_perf_event_open(config, lcore_id, group->fds[0]);
+		if (ret == -1) {
+			if (errno == EOPNOTSUPP)
+				RTE_LOG(ERR, EAL, "64 bit counters not supported\n");
+
+			ret = -errno;
+			goto out;
+		}
+
+		group->fds[event->index] = ret;
+		num++;
+	}
+
+	return 0;
+out:
+	for (--num; num >= 0; num--) {
+		close(group->fds[num]);
+		group->fds[num] = -1;
+	}
+
+
+	return ret;
+}
+
+static int
+mmap_events(int lcore_id)
+{
+	struct rte_pmu_event_group *group = &rte_pmu->group[lcore_id];
+	void *addr;
+	int ret, i;
+
+	for (i = 0; i < rte_pmu->num_group_events; i++) {
+		addr = mmap(0, rte_mem_page_size(), PROT_READ, MAP_SHARED, group->fds[i], 0);
+		if (addr == MAP_FAILED) {
+			ret = -errno;
+			goto out;
+		}
+
+		group->mmap_pages[i] = addr;
+	}
+
+	return 0;
+out:
+	for (; i; i--) {
+		munmap(group->mmap_pages[i - 1], rte_mem_page_size());
+		group->mmap_pages[i - 1] = NULL;
+	}
+
+	return ret;
+}
+
+static void
+cleanup_events(int lcore_id)
+{
+	struct rte_pmu_event_group *group = &rte_pmu->group[lcore_id];
+	int i;
+
+	if (!group->fds)
+		return;
+
+	if (group->fds[0] != -1)
+		ioctl(group->fds[0], PERF_EVENT_IOC_DISABLE, PERF_IOC_FLAG_GROUP);
+
+	for (i = 0; i < rte_pmu->num_group_events; i++) {
+		if (group->mmap_pages[i]) {
+			munmap(group->mmap_pages[i], rte_mem_page_size());
+			group->mmap_pages[i] = NULL;
+		}
+
+		if (group->fds[i] != -1) {
+			close(group->fds[i]);
+			group->fds[i] = -1;
+		}
+	}
+
+	rte_free(group->mmap_pages);
+	rte_free(group->fds);
+
+	group->mmap_pages = NULL;
+	group->fds = NULL;
+	group->enabled = false;
+}
+
+int __rte_noinline
+rte_pmu_enable_group(int lcore_id)
+{
+	struct rte_pmu_event_group *group = &rte_pmu->group[lcore_id];
+	int ret;
+
+	if (rte_pmu->num_group_events == 0) {
+		RTE_LOG(DEBUG, EAL, "no matching PMU events\n");
+
+		return 0;
+	}
+
+	group->fds = rte_calloc(NULL, rte_pmu->num_group_events, sizeof(*group->fds), 0);
+	if (!group->fds) {
+		RTE_LOG(ERR, EAL, "failed to alloc descriptor memory\n");
+
+		return -ENOMEM;
+	}
+
+	group->mmap_pages = rte_calloc(NULL, rte_pmu->num_group_events, sizeof(*group->mmap_pages), 0);
+	if (!group->mmap_pages) {
+		RTE_LOG(ERR, EAL, "failed to alloc userpage memory\n");
+
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	ret = open_events(lcore_id);
+	if (ret) {
+		RTE_LOG(ERR, EAL, "failed to open events on lcore-worker-%d\n", lcore_id);
+		goto out;
+	}
+
+	ret = mmap_events(lcore_id);
+	if (ret) {
+		RTE_LOG(ERR, EAL, "failed to map events on lcore-worker-%d\n", lcore_id);
+		goto out;
+	}
+
+	if (ioctl(group->fds[0], PERF_EVENT_IOC_ENABLE, PERF_IOC_FLAG_GROUP) == -1) {
+		RTE_LOG(ERR, EAL, "failed to enable events on lcore-worker-%d\n", lcore_id);
+
+		ret = -errno;
+		goto out;
+	}
+
+	return 0;
+
+out:
+	cleanup_events(lcore_id);
+
+	return ret;
+}
+
+static int
+scan_pmus(void)
+{
+	char path[PATH_MAX];
+	struct dirent *dent;
+	const char *name;
+	DIR *dirp;
+
+	dirp = opendir(EVENT_SOURCE_DEVICES_PATH);
+	if (!dirp)
+		return -errno;
+
+	while ((dent = readdir(dirp))) {
+		name = dent->d_name;
+		if (name[0] == '.')
+			continue;
+
+		/* sysfs entry should either contain cpus or be a cpu */
+		if (!strcmp(name, "cpu"))
+			break;
+
+		snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/cpus", name);
+		if (access(path, F_OK) == 0)
+			break;
+	}
+
+	closedir(dirp);
+
+	if (dent) {
+		rte_pmu->name = strdup(name);
+		if (!rte_pmu->name)
+			return -ENOMEM;
+	}
+
+	return rte_pmu->name ? 0 : -ENODEV;
+}
+
+int
+rte_pmu_add_event(const char *name)
+{
+	struct rte_pmu_event *event;
+	char path[PATH_MAX];
+
+	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu->name, name);
+	if (access(path, R_OK))
+		return -ENODEV;
+
+	TAILQ_FOREACH(event, &rte_pmu->event_list, next) {
+		if (!strcmp(event->name, name))
+			return event->index;
+		continue;
+	}
+
+	event = rte_calloc(NULL, 1, sizeof(*event), 0);
+	if (!event)
+		return -ENOMEM;
+
+	event->name = strdup(name);
+	if (!event->name) {
+		rte_free(event);
+
+		return -ENOMEM;
+	}
+
+	event->index = rte_pmu->num_group_events++;
+	TAILQ_INSERT_TAIL(&rte_pmu->event_list, event, next);
+
+	RTE_LOG(DEBUG, EAL, "%s even added at index %d\n", name, event->index);
+
+	return event->index;
+}
+
+void
+eal_pmu_init(void)
+{
+	int ret;
+
+	rte_pmu = rte_calloc(NULL, 1, sizeof(*rte_pmu), RTE_CACHE_LINE_SIZE);
+	if (!rte_pmu) {
+		RTE_LOG(ERR, EAL, "failed to alloc PMU\n");
+
+		return;
+	}
+
+	TAILQ_INIT(&rte_pmu->event_list);
+
+	ret = scan_pmus();
+	if (ret) {
+		RTE_LOG(ERR, EAL, "failed to find core pmu\n");
+		goto out;
+	}
+
+	ret = pmu_arch_init();
+	if (ret) {
+		RTE_LOG(ERR, EAL, "failed to setup arch for PMU\n");
+		goto out;
+	}
+
+	return;
+out:
+	free(rte_pmu->name);
+	rte_free(rte_pmu);
+}
+
+void
+eal_pmu_fini(void)
+{
+	struct rte_pmu_event *event, *tmp;
+	int lcore_id;
+
+	RTE_TAILQ_FOREACH_SAFE(event, &rte_pmu->event_list, next, tmp) {
+		TAILQ_REMOVE(&rte_pmu->event_list, event, next);
+		free(event->name);
+		rte_free(event);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		cleanup_events(lcore_id);
+
+	pmu_arch_fini();
+	free(rte_pmu->name);
+	rte_free(rte_pmu);
+}
diff --git a/lib/eal/include/meson.build b/lib/eal/include/meson.build
index cfcd40aaed..3bf830adee 100644
--- a/lib/eal/include/meson.build
+++ b/lib/eal/include/meson.build
@@ -36,6 +36,7 @@ headers += files(
         'rte_pci_dev_features.h',
         'rte_per_lcore.h',
         'rte_pflock.h',
+        'rte_pmu.h',
         'rte_random.h',
         'rte_reciprocal.h',
         'rte_seqcount.h',
diff --git a/lib/eal/include/rte_pmu.h b/lib/eal/include/rte_pmu.h
new file mode 100644
index 0000000000..e4b4f6b052
--- /dev/null
+++ b/lib/eal/include/rte_pmu.h
@@ -0,0 +1,204 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2022 Marvell
+ */
+
+#ifndef _RTE_PMU_H_
+#define _RTE_PMU_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <rte_common.h>
+#include <rte_compat.h>
+
+#ifdef RTE_EXEC_ENV_LINUX
+
+#include <linux/perf_event.h>
+
+#include <rte_atomic.h>
+#include <rte_branch_prediction.h>
+#include <rte_lcore.h>
+#include <rte_log.h>
+
+/**
+ * @file
+ *
+ * PMU event tracing operations
+ *
+ * This file defines generic API and types necessary to setup PMU and
+ * read selected counters in runtime.
+ */
+
+/**
+ * A structure describing a group of events.
+ */
+struct rte_pmu_event_group {
+	int *fds; /**< array of event descriptors */
+	void **mmap_pages; /**< array of pointers to mmapped perf_event_attr structures */
+	bool enabled; /**< true if group was enabled on particular lcore */
+};
+
+/**
+ * A structure describing an event.
+ */
+struct rte_pmu_event {
+	char *name; /** name of an event */
+	int index; /** event index into fds/mmap_pages */
+	TAILQ_ENTRY(rte_pmu_event) next; /** list entry */
+};
+
+/**
+ * A PMU state container.
+ */
+struct rte_pmu {
+	char *name; /** name of core PMU listed under /sys/bus/event_source/devices */
+	struct rte_pmu_event_group group[RTE_MAX_LCORE]; /**< per lcore event group data */
+	int num_group_events; /**< number of events in a group */
+	TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of matching events */
+};
+
+/** Pointer to the PMU state container */
+extern struct rte_pmu *rte_pmu;
+
+/** Each architecture supporting PMU needs to provide its own version */
+#ifndef rte_pmu_pmc_read
+#define rte_pmu_pmc_read(index) ({ 0; })
+#endif
+
+/**
+ * @internal
+ *
+ * Read PMU counter.
+ *
+ * @param pc
+ *   Pointer to the mmapped user page.
+ * @return
+ *   Counter value read from hardware.
+ */
+__rte_internal
+static __rte_always_inline uint64_t
+rte_pmu_read_userpage(struct perf_event_mmap_page *pc)
+{
+	uint64_t offset, width, pmc = 0;
+	uint32_t seq, index;
+	int tries = 100;
+
+	for (;;) {
+		seq = pc->lock;
+		rte_compiler_barrier();
+		index = pc->index;
+		offset = pc->offset;
+		width = pc->pmc_width;
+
+		if (likely(pc->cap_user_rdpmc && index)) {
+			pmc = rte_pmu_pmc_read(index - 1);
+			pmc <<= 64 - width;
+			pmc >>= 64 - width;
+		}
+
+		rte_compiler_barrier();
+
+		if (likely(pc->lock == seq))
+			return pmc + offset;
+
+		if (--tries == 0) {
+			RTE_LOG(DEBUG, EAL, "failed to get perf_event_mmap_page lock\n");
+			break;
+		}
+	}
+
+	return 0;
+}
+
+/**
+ * @internal
+ *
+ * Enable group of events for a given lcore.
+ *
+ * @param lcore_id
+ *   The identifier of the lcore.
+ * @return
+ *   0 in case of success, negative value otherwise.
+ */
+__rte_internal
+int
+rte_pmu_enable_group(int lcore_id);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Add event to the group of enabled events.
+ *
+ * @param name
+ *   Name of an event listed under /sys/bus/event_source/devices/pmu/events.
+ * @return
+ *   Event index in case of success, negative value otherwise.
+ */
+__rte_experimental
+int
+rte_pmu_add_event(const char *name);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Read hardware counter configured to count occurrences of an event.
+ *
+ * @param index
+ *   Index of an event to be read.
+ * @return
+ *   Event value read from register. In case of errors or lack of support
+ *   0 is returned. In other words, stream of zeros in a trace file
+ *   indicates problem with reading particular PMU event register.
+ */
+__rte_experimental
+static __rte_always_inline uint64_t
+rte_pmu_read(int index)
+{
+	int lcore_id = rte_lcore_id();
+	struct rte_pmu_event_group *group;
+	int ret;
+
+	if (!rte_pmu)
+		return 0;
+
+	group = &rte_pmu->group[lcore_id];
+	if (!group->enabled) {
+		ret = rte_pmu_enable_group(lcore_id);
+		if (ret)
+			return 0;
+
+		group->enabled = true;
+	}
+
+	if (index < 0 || index >= rte_pmu->num_group_events)
+		return 0;
+
+	return rte_pmu_read_userpage((struct perf_event_mmap_page *)group->mmap_pages[index]);
+}
+
+#else /* !RTE_EXEC_ENV_LINUX */
+
+__rte_experimental
+static int __rte_unused
+rte_pmu_add_event(__rte_unused const char *name)
+{
+	return -1;
+}
+
+__rte_experimental
+static __rte_always_inline uint64_t
+rte_pmu_read(__rte_unused int index)
+{
+	return 0;
+}
+
+#endif /* RTE_EXEC_ENV_LINUX */
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_PMU_H_ */
diff --git a/lib/eal/linux/eal.c b/lib/eal/linux/eal.c
index 8c118d0d9f..751a13b597 100644
--- a/lib/eal/linux/eal.c
+++ b/lib/eal/linux/eal.c
@@ -53,6 +53,7 @@
 #include "eal_options.h"
 #include "eal_vfio.h"
 #include "hotplug_mp.h"
+#include "pmu_private.h"
 
 #define MEMSIZE_IF_NO_HUGE_PAGE (64ULL * 1024ULL * 1024ULL)
 
@@ -1206,6 +1207,8 @@ rte_eal_init(int argc, char **argv)
 		return -1;
 	}
 
+	eal_pmu_init();
+
 	if (rte_eal_tailqs_init() < 0) {
 		rte_eal_init_alert("Cannot init tail queues for objects");
 		rte_errno = EFAULT;
@@ -1372,6 +1375,7 @@ rte_eal_cleanup(void)
 	eal_bus_cleanup();
 	rte_trace_save();
 	eal_trace_fini();
+	eal_pmu_fini();
 	/* after this point, any DPDK pointers will become dangling */
 	rte_eal_memory_detach();
 	eal_mp_dev_hotplug_cleanup();
diff --git a/lib/eal/version.map b/lib/eal/version.map
index 7ad12a7dc9..9225f46f67 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -440,6 +440,11 @@ EXPERIMENTAL {
 	rte_thread_detach;
 	rte_thread_equal;
 	rte_thread_join;
+
+	# added in 23.03
+	rte_pmu; # WINDOWS_NO_EXPORT
+	rte_pmu_add_event; # WINDOWS_NO_EXPORT
+	rte_pmu_read; # WINDOWS_NO_EXPORT
 };
 
 INTERNAL {
@@ -483,4 +488,5 @@ INTERNAL {
 	rte_mem_map;
 	rte_mem_page_size;
 	rte_mem_unmap;
+	rte_pmu_enable_group;
 };
-- 
2.25.1


^ permalink raw reply	[flat|nested] 139+ messages in thread

* [PATCH v3 2/4] eal/arm: support reading ARM PMU events in runtime
  2022-11-29  9:28   ` [PATCH v3 0/4] add support for self monitoring Tomasz Duszynski
  2022-11-29  9:28     ` [PATCH v3 1/4] eal: add generic support for reading PMU events Tomasz Duszynski
@ 2022-11-29  9:28     ` Tomasz Duszynski
  2022-11-29  9:28     ` [PATCH v3 3/4] eal/x86: support reading Intel " Tomasz Duszynski
                       ` (3 subsequent siblings)
  5 siblings, 0 replies; 139+ messages in thread
From: Tomasz Duszynski @ 2022-11-29  9:28 UTC (permalink / raw)
  To: dev, Ruifeng Wang; +Cc: thomas, jerinj, Tomasz Duszynski

Add support for reading ARM PMU events in runtime.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
---
 app/test/test_pmu.c               |   4 ++
 lib/eal/arm/include/meson.build   |   1 +
 lib/eal/arm/include/rte_pmu_pmc.h |  39 +++++++++++
 lib/eal/arm/meson.build           |   4 ++
 lib/eal/arm/rte_pmu.c             | 104 ++++++++++++++++++++++++++++++
 lib/eal/include/rte_pmu.h         |   3 +
 6 files changed, 155 insertions(+)
 create mode 100644 lib/eal/arm/include/rte_pmu_pmc.h
 create mode 100644 lib/eal/arm/rte_pmu.c

diff --git a/app/test/test_pmu.c b/app/test/test_pmu.c
index fd331af9ee..f94866dff9 100644
--- a/app/test/test_pmu.c
+++ b/app/test/test_pmu.c
@@ -13,6 +13,10 @@ test_pmu_read(void)
 	int tries = 10;
 	int event = -1;
 
+#if defined(RTE_ARCH_ARM64)
+	event = rte_pmu_add_event("cpu_cycles");
+#endif
+
 	while (tries--)
 		val += rte_pmu_read(event);
 
diff --git a/lib/eal/arm/include/meson.build b/lib/eal/arm/include/meson.build
index 657bf58569..ab13b0220a 100644
--- a/lib/eal/arm/include/meson.build
+++ b/lib/eal/arm/include/meson.build
@@ -20,6 +20,7 @@ arch_headers = files(
         'rte_pause_32.h',
         'rte_pause_64.h',
         'rte_pause.h',
+        'rte_pmu_pmc.h',
         'rte_power_intrinsics.h',
         'rte_prefetch_32.h',
         'rte_prefetch_64.h',
diff --git a/lib/eal/arm/include/rte_pmu_pmc.h b/lib/eal/arm/include/rte_pmu_pmc.h
new file mode 100644
index 0000000000..10e2984813
--- /dev/null
+++ b/lib/eal/arm/include/rte_pmu_pmc.h
@@ -0,0 +1,39 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2022 Marvell.
+ */
+
+#ifndef _RTE_PMU_PMC_ARM_H_
+#define _RTE_PMU_PMC_ARM_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <rte_common.h>
+
+static __rte_always_inline uint64_t
+rte_pmu_pmc_read(int index)
+{
+	uint64_t val;
+
+	if (index == 31) {
+		/* CPU Cycles (0x11) must be read via pmccntr_el0 */
+		asm volatile("mrs %0, pmccntr_el0" : "=r" (val));
+	} else {
+		asm volatile(
+			"msr pmselr_el0, %x0\n"
+			"mrs %0, pmxevcntr_el0\n"
+			: "=r" (val)
+			: "rZ" (index)
+		);
+	}
+
+	return val;
+}
+#define rte_pmu_pmc_read rte_pmu_pmc_read
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_PMU_PMC_ARM_H_ */
diff --git a/lib/eal/arm/meson.build b/lib/eal/arm/meson.build
index dca1106aae..0c5575b197 100644
--- a/lib/eal/arm/meson.build
+++ b/lib/eal/arm/meson.build
@@ -9,3 +9,7 @@ sources += files(
         'rte_hypervisor.c',
         'rte_power_intrinsics.c',
 )
+
+if is_linux
+    sources += files('rte_pmu.c')
+endif
diff --git a/lib/eal/arm/rte_pmu.c b/lib/eal/arm/rte_pmu.c
new file mode 100644
index 0000000000..10ec770ead
--- /dev/null
+++ b/lib/eal/arm/rte_pmu.c
@@ -0,0 +1,104 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2022 Marvell International Ltd.
+ */
+
+#include <errno.h>
+#include <fcntl.h>
+#include <stdlib.h>
+#include <unistd.h>
+
+#include <rte_bitops.h>
+#include <rte_common.h>
+#include <rte_log.h>
+#include <rte_pmu.h>
+
+#include "pmu_private.h"
+
+#define PERF_USER_ACCESS_PATH "/proc/sys/kernel/perf_user_access"
+
+static int restore_uaccess;
+
+static int
+read_attr_int(const char *path, int *val)
+{
+	char buf[BUFSIZ];
+	int ret, fd;
+
+	fd = open(path, O_RDONLY);
+	if (fd == -1)
+		return -errno;
+
+	ret = read(fd, buf, sizeof(buf));
+	if (ret == -1) {
+		close(fd);
+
+		return -errno;
+	}
+
+	*val = strtol(buf, NULL, 10);
+	close(fd);
+
+	return 0;
+}
+
+static int
+write_attr_int(const char *path, int val)
+{
+	char buf[BUFSIZ];
+	int num, ret, fd;
+
+	fd = open(path, O_WRONLY);
+	if (fd == -1)
+		return -errno;
+
+	num = snprintf(buf, sizeof(buf), "%d", val);
+	ret = write(fd, buf, num);
+	if (ret == -1) {
+		close(fd);
+
+		return -errno;
+	}
+
+	close(fd);
+
+	return 0;
+}
+
+int
+pmu_arch_init(void)
+{
+	int ret;
+
+	ret = read_attr_int(PERF_USER_ACCESS_PATH, &restore_uaccess);
+	if (ret) {
+		RTE_LOG(ERR, EAL, "failed to read %s\n", PERF_USER_ACCESS_PATH);
+
+		return ret;
+	}
+
+	ret = write_attr_int(PERF_USER_ACCESS_PATH, 1);
+	if (ret) {
+		RTE_LOG(ERR, EAL, "failed to enable perf user access\n"
+			"try enabling manually 'echo 1 > %s'\n",
+			PERF_USER_ACCESS_PATH);
+
+		return ret;
+	}
+
+	return 0;
+}
+
+void
+pmu_arch_fini(void)
+{
+	write_attr_int(PERF_USER_ACCESS_PATH, restore_uaccess);
+}
+
+void
+pmu_arch_fixup_config(uint64_t config[3])
+{
+	/* select 64 bit counters */
+	config[1] |= RTE_BIT64(0);
+	/* enable userspace access */
+	config[1] |= RTE_BIT64(1);
+}
diff --git a/lib/eal/include/rte_pmu.h b/lib/eal/include/rte_pmu.h
index e4b4f6b052..158a616b83 100644
--- a/lib/eal/include/rte_pmu.h
+++ b/lib/eal/include/rte_pmu.h
@@ -20,6 +20,9 @@ extern "C" {
 #include <rte_branch_prediction.h>
 #include <rte_lcore.h>
 #include <rte_log.h>
+#if defined(RTE_ARCH_ARM64)
+#include <rte_pmu_pmc.h>
+#endif
 
 /**
  * @file
-- 
2.25.1


^ permalink raw reply	[flat|nested] 139+ messages in thread

* [PATCH v3 3/4] eal/x86: support reading Intel PMU events in runtime
  2022-11-29  9:28   ` [PATCH v3 0/4] add support for self monitoring Tomasz Duszynski
  2022-11-29  9:28     ` [PATCH v3 1/4] eal: add generic support for reading PMU events Tomasz Duszynski
  2022-11-29  9:28     ` [PATCH v3 2/4] eal/arm: support reading ARM PMU events in runtime Tomasz Duszynski
@ 2022-11-29  9:28     ` Tomasz Duszynski
  2022-11-29  9:28     ` [PATCH v3 4/4] eal: add PMU support to tracing library Tomasz Duszynski
                       ` (2 subsequent siblings)
  5 siblings, 0 replies; 139+ messages in thread
From: Tomasz Duszynski @ 2022-11-29  9:28 UTC (permalink / raw)
  To: dev, Bruce Richardson, Konstantin Ananyev
  Cc: thomas, jerinj, Tomasz Duszynski

Add support for reading Intel PMU events in runtime.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
---
 app/test/test_pmu.c               |  2 ++
 lib/eal/include/rte_pmu.h         |  2 +-
 lib/eal/x86/include/meson.build   |  1 +
 lib/eal/x86/include/rte_pmu_pmc.h | 33 +++++++++++++++++++++++++++++++
 4 files changed, 37 insertions(+), 1 deletion(-)
 create mode 100644 lib/eal/x86/include/rte_pmu_pmc.h

diff --git a/app/test/test_pmu.c b/app/test/test_pmu.c
index f94866dff9..016204c083 100644
--- a/app/test/test_pmu.c
+++ b/app/test/test_pmu.c
@@ -15,6 +15,8 @@ test_pmu_read(void)
 
 #if defined(RTE_ARCH_ARM64)
 	event = rte_pmu_add_event("cpu_cycles");
+#elif defined(RTE_ARCH_X86_64)
+	event = rte_pmu_add_event("cpu-cycles");
 #endif
 
 	while (tries--)
diff --git a/lib/eal/include/rte_pmu.h b/lib/eal/include/rte_pmu.h
index 158a616b83..3d90f4baf7 100644
--- a/lib/eal/include/rte_pmu.h
+++ b/lib/eal/include/rte_pmu.h
@@ -20,7 +20,7 @@ extern "C" {
 #include <rte_branch_prediction.h>
 #include <rte_lcore.h>
 #include <rte_log.h>
-#if defined(RTE_ARCH_ARM64)
+#if defined(RTE_ARCH_ARM64) || defined(RTE_ARCH_X86_64)
 #include <rte_pmu_pmc.h>
 #endif
 
diff --git a/lib/eal/x86/include/meson.build b/lib/eal/x86/include/meson.build
index 52d2f8e969..03d286ed25 100644
--- a/lib/eal/x86/include/meson.build
+++ b/lib/eal/x86/include/meson.build
@@ -9,6 +9,7 @@ arch_headers = files(
         'rte_io.h',
         'rte_memcpy.h',
         'rte_pause.h',
+        'rte_pmu_pmc.h',
         'rte_power_intrinsics.h',
         'rte_prefetch.h',
         'rte_rtm.h',
diff --git a/lib/eal/x86/include/rte_pmu_pmc.h b/lib/eal/x86/include/rte_pmu_pmc.h
new file mode 100644
index 0000000000..a2cd849fb1
--- /dev/null
+++ b/lib/eal/x86/include/rte_pmu_pmc.h
@@ -0,0 +1,33 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2022 Marvell.
+ */
+
+#ifndef _RTE_PMU_PMC_X86_H_
+#define _RTE_PMU_PMC_X86_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <rte_common.h>
+
+static __rte_always_inline uint64_t
+rte_pmu_pmc_read(int index)
+{
+	uint32_t high, low;
+
+	asm volatile(
+		"rdpmc\n"
+		: "=a" (low), "=d" (high)
+		: "c" (index)
+	);
+
+	return ((uint64_t)high << 32) | (uint64_t)low;
+}
+#define rte_pmu_pmc_read rte_pmu_pmc_read
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_PMU_PMC_X86_H_ */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 139+ messages in thread

* [PATCH v3 4/4] eal: add PMU support to tracing library
  2022-11-29  9:28   ` [PATCH v3 0/4] add support for self monitoring Tomasz Duszynski
                       ` (2 preceding siblings ...)
  2022-11-29  9:28     ` [PATCH v3 3/4] eal/x86: support reading Intel " Tomasz Duszynski
@ 2022-11-29  9:28     ` Tomasz Duszynski
  2022-11-29 10:42     ` [PATCH v3 0/4] add support for self monitoring Morten Brørup
  2022-12-13 10:43     ` [PATCH v4 " Tomasz Duszynski
  5 siblings, 0 replies; 139+ messages in thread
From: Tomasz Duszynski @ 2022-11-29  9:28 UTC (permalink / raw)
  To: dev, Jerin Jacob, Sunil Kumar Kori; +Cc: thomas, Tomasz Duszynski

In order to profile app one needs to store significant amount of samples
somewhere for an analysis latern on. Since trace library supports
storing data in a CTF format lets take adventage of that and add a
dedicated PMU tracepoint.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
---
 app/test/test_trace_perf.c               |  4 ++
 doc/guides/prog_guide/profile_app.rst    |  5 ++
 doc/guides/prog_guide/trace_lib.rst      | 32 ++++++++++++
 lib/eal/common/eal_common_trace_points.c |  3 ++
 lib/eal/common/rte_pmu.c                 | 63 ++++++++++++++++++++++++
 lib/eal/include/rte_eal_trace.h          | 11 +++++
 lib/eal/version.map                      |  1 +
 7 files changed, 119 insertions(+)

diff --git a/app/test/test_trace_perf.c b/app/test/test_trace_perf.c
index 46ae7d8074..4851b6852f 100644
--- a/app/test/test_trace_perf.c
+++ b/app/test/test_trace_perf.c
@@ -114,6 +114,8 @@ worker_fn_##func(void *arg) \
 #define GENERIC_DOUBLE rte_eal_trace_generic_double(3.66666)
 #define GENERIC_STR rte_eal_trace_generic_str("hello world")
 #define VOID_FP app_dpdk_test_fp()
+/* 0 corresponds first event passed via --trace= */
+#define READ_PMU rte_eal_trace_pmu_read(0)
 
 WORKER_DEFINE(GENERIC_VOID)
 WORKER_DEFINE(GENERIC_U64)
@@ -122,6 +124,7 @@ WORKER_DEFINE(GENERIC_FLOAT)
 WORKER_DEFINE(GENERIC_DOUBLE)
 WORKER_DEFINE(GENERIC_STR)
 WORKER_DEFINE(VOID_FP)
+WORKER_DEFINE(READ_PMU)
 
 static void
 run_test(const char *str, lcore_function_t f, struct test_data *data, size_t sz)
@@ -174,6 +177,7 @@ test_trace_perf(void)
 	run_test("double", worker_fn_GENERIC_DOUBLE, data, sz);
 	run_test("string", worker_fn_GENERIC_STR, data, sz);
 	run_test("void_fp", worker_fn_VOID_FP, data, sz);
+	run_test("read_pmu", worker_fn_READ_PMU, data, sz);
 
 	rte_free(data);
 	return TEST_SUCCESS;
diff --git a/doc/guides/prog_guide/profile_app.rst b/doc/guides/prog_guide/profile_app.rst
index 8fc1b20cab..977800ea01 100644
--- a/doc/guides/prog_guide/profile_app.rst
+++ b/doc/guides/prog_guide/profile_app.rst
@@ -16,6 +16,11 @@ that information, perf being an example here. Though in some scenarios, eg. when
 isolated (nohz_full) and run dedicated tasks, using perf is less than ideal. In such cases one can
 read specific events directly from application via ``rte_pmu_read()``.
 
+Alternatively tracing library can be used which offers dedicated tracepoint
+``rte_eal_trace_pmu_event()``.
+
+Refer to :doc:`../prog_guide/trace_lib` for more details.
+
 Profiling on x86
 ----------------
 
diff --git a/doc/guides/prog_guide/trace_lib.rst b/doc/guides/prog_guide/trace_lib.rst
index 9a8f38073d..9a845fd86f 100644
--- a/doc/guides/prog_guide/trace_lib.rst
+++ b/doc/guides/prog_guide/trace_lib.rst
@@ -46,6 +46,7 @@ DPDK tracing library features
   trace format and is compatible with ``LTTng``.
   For detailed information, refer to
   `Common Trace Format <https://diamon.org/ctf/>`_.
+- Support reading PMU events on ARM64 and x86 (Intel)
 
 How to add a tracepoint?
 ------------------------
@@ -137,6 +138,37 @@ the user must use ``RTE_TRACE_POINT_FP`` instead of ``RTE_TRACE_POINT``.
 ``RTE_TRACE_POINT_FP`` is compiled out by default and it can be enabled using
 the ``enable_trace_fp`` option for meson build.
 
+PMU tracepoint
+--------------
+
+Performance measurement unit (PMU) event values can be read from hardware
+registers using predefined ``rte_pmu_read`` tracepoint.
+
+Tracing is enabled via ``--trace`` EAL option by passing both expression
+matching PMU tracepoint name i.e ``lib.eal.pmu.read`` and expression
+``e=ev1[,ev2,...]`` matching particular events::
+
+    --trace='*pmu.read\|e=cpu_cycles,l1d_cache'
+
+Event names are available under ``/sys/bus/event_source/devices/PMU/events``
+directory, where ``PMU`` is a placeholder for either a ``cpu`` or a directory
+containing ``cpus``.
+
+In contrary to other tracepoints this does not need any extra variables
+added to source files. Instead, caller passes index which follows the order of
+events specified via ``--trace`` parameter. In the following example index ``0``
+corresponds to ``cpu_cyclces`` while index ``1`` corresponds to ``l1d_cache``.
+
+.. code-block:: c
+
+ ...
+ rte_eal_trace_pmu_read(0);
+ rte_eal_trace_pmu_read(1);
+ ...
+
+PMU tracing support must be explicitly enabled using the ``enable_trace_fp``
+option for meson build.
+
 Event record mode
 -----------------
 
diff --git a/lib/eal/common/eal_common_trace_points.c b/lib/eal/common/eal_common_trace_points.c
index 0b0b254615..de918ca618 100644
--- a/lib/eal/common/eal_common_trace_points.c
+++ b/lib/eal/common/eal_common_trace_points.c
@@ -75,3 +75,6 @@ RTE_TRACE_POINT_REGISTER(rte_eal_trace_intr_enable,
 	lib.eal.intr.enable)
 RTE_TRACE_POINT_REGISTER(rte_eal_trace_intr_disable,
 	lib.eal.intr.disable)
+
+RTE_TRACE_POINT_REGISTER(rte_eal_trace_pmu_read,
+	lib.eal.pmu.read)
diff --git a/lib/eal/common/rte_pmu.c b/lib/eal/common/rte_pmu.c
index 6763005903..db8f6f43c3 100644
--- a/lib/eal/common/rte_pmu.c
+++ b/lib/eal/common/rte_pmu.c
@@ -20,6 +20,7 @@
 #include <rte_tailq.h>
 
 #include "pmu_private.h"
+#include "eal_trace.h"
 
 #define EVENT_SOURCE_DEVICES_PATH "/sys/bus/event_source/devices"
 
@@ -404,11 +405,70 @@ rte_pmu_add_event(const char *name)
 	return event->index;
 }
 
+static void
+add_events(const char *pattern)
+{
+	char *token, *copy;
+	int ret;
+
+	copy = strdup(pattern);
+	if (!copy)
+		return;
+
+	token = strtok(copy, ",");
+	while (token) {
+		ret = rte_pmu_add_event(token);
+		if (ret < 0)
+			RTE_LOG(ERR, EAL, "failed to add %s event\n", token);
+
+		token = strtok(NULL, ",");
+	}
+
+	free(copy);
+}
+
+static void
+add_events_by_pattern(const char *pattern)
+{
+	regmatch_t rmatch;
+	char buf[BUFSIZ];
+	unsigned int num;
+	regex_t reg;
+
+	/* events are matched against occurrences of e=ev1[,ev2,..] pattern */
+	if (regcomp(&reg, "e=([_[:alnum:]-],?)+", REG_EXTENDED))
+		return;
+
+	for (;;) {
+		if (regexec(&reg, pattern, 1, &rmatch, 0))
+			break;
+
+		num = rmatch.rm_eo - rmatch.rm_so;
+		if (num > sizeof(buf))
+			num = sizeof(buf);
+
+		/* skip e= pattern prefix */
+		memcpy(buf, pattern + rmatch.rm_so + 2, num - 2);
+		buf[num] = '\0';
+		add_events(buf);
+
+		pattern += rmatch.rm_eo;
+	}
+
+	regfree(&reg);
+}
+
 void
 eal_pmu_init(void)
 {
+	struct trace_arg *arg;
+	struct trace *trace;
 	int ret;
 
+	trace = trace_obj_get();
+	if (!trace)
+		RTE_LOG(WARNING, EAL, "tracing not initialized\n");
+
 	rte_pmu = rte_calloc(NULL, 1, sizeof(*rte_pmu), RTE_CACHE_LINE_SIZE);
 	if (!rte_pmu) {
 		RTE_LOG(ERR, EAL, "failed to alloc PMU\n");
@@ -430,6 +490,9 @@ eal_pmu_init(void)
 		goto out;
 	}
 
+	STAILQ_FOREACH(arg, &trace->args, next)
+		add_events_by_pattern(arg->val);
+
 	return;
 out:
 	free(rte_pmu->name);
diff --git a/lib/eal/include/rte_eal_trace.h b/lib/eal/include/rte_eal_trace.h
index 5ef4398230..2a10f63e97 100644
--- a/lib/eal/include/rte_eal_trace.h
+++ b/lib/eal/include/rte_eal_trace.h
@@ -17,6 +17,7 @@ extern "C" {
 
 #include <rte_alarm.h>
 #include <rte_interrupts.h>
+#include <rte_pmu.h>
 #include <rte_trace_point.h>
 
 #include "eal_interrupts.h"
@@ -279,6 +280,16 @@ RTE_TRACE_POINT(
 	rte_trace_point_emit_string(cpuset);
 )
 
+/* PMU */
+RTE_TRACE_POINT_FP(
+	rte_eal_trace_pmu_read,
+	RTE_TRACE_POINT_ARGS(int index),
+	uint64_t val;
+	rte_trace_point_emit_int(index);
+	val = rte_pmu_read(index);
+	rte_trace_point_emit_u64(val);
+)
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/eal/version.map b/lib/eal/version.map
index 9225f46f67..73803f9601 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -442,6 +442,7 @@ EXPERIMENTAL {
 	rte_thread_join;
 
 	# added in 23.03
+	__rte_eal_trace_pmu_read; # WINDOWS_NO_EXPORT
 	rte_pmu; # WINDOWS_NO_EXPORT
 	rte_pmu_add_event; # WINDOWS_NO_EXPORT
 	rte_pmu_read; # WINDOWS_NO_EXPORT
-- 
2.25.1


^ permalink raw reply	[flat|nested] 139+ messages in thread

* RE: [PATCH v3 0/4] add support for self monitoring
  2022-11-29  9:28   ` [PATCH v3 0/4] add support for self monitoring Tomasz Duszynski
                       ` (3 preceding siblings ...)
  2022-11-29  9:28     ` [PATCH v3 4/4] eal: add PMU support to tracing library Tomasz Duszynski
@ 2022-11-29 10:42     ` Morten Brørup
  2022-12-13  8:23       ` Tomasz Duszynski
  2022-12-13 10:43     ` [PATCH v4 " Tomasz Duszynski
  5 siblings, 1 reply; 139+ messages in thread
From: Morten Brørup @ 2022-11-29 10:42 UTC (permalink / raw)
  To: Tomasz Duszynski, dev; +Cc: thomas, jerinj

> From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
> Sent: Tuesday, 29 November 2022 10.28
> 
> This series adds self monitoring support i.e allows to configure and
> read performance measurement unit (PMU) counters in runtime without
> using perf utility. This has certain adventages when application runs
> on
> isolated cores with nohz_full kernel parameter.
> 
> Events can be read directly using rte_pmu_read() or using dedicated
> tracepoint rte_eal_trace_pmu_read(). The latter will cause events to be
> stored inside CTF file.
> 
> By design, all enabled events are grouped together and the same group
> is attached to lcores that use self monitoring funtionality.
> 
> Events are enabled by names, which need to be read from standard
> location under sysfs i.e
> 
> /sys/bus/event_source/devices/PMU/events
> 
> where PMU is a core pmu i.e one measuring cpu events. As of today
> raw events are not supported.

Hi Thomasz,

I am very interested in this patch series for fast path profiling purposes. (Not using EAL trace, but our proprietary profiler.)

However, it seems that rte_pmu_read() is quite longwinded, compared to rte_pmu_pmc_read().

But perhaps I am just worrying too much, so I will ask: What is the performance cost of using rte_pmu_read() - compared to rte_pmu_pmc_read() - in the fast path?

If there is a non-negligible difference, could you please provide an example of how to configure PMU events and use rte_pmu_pmc_read() in an application?

I would primarily be interested in data cache misses and branch mispredictions. But feel free to make your own choices for the example.


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v3 1/4] eal: add generic support for reading PMU events
  2022-11-29  9:28     ` [PATCH v3 1/4] eal: add generic support for reading PMU events Tomasz Duszynski
@ 2022-11-30  8:32       ` zhoumin
  2022-12-13  8:05         ` [EXT] " Tomasz Duszynski
  0 siblings, 1 reply; 139+ messages in thread
From: zhoumin @ 2022-11-30  8:32 UTC (permalink / raw)
  To: Tomasz Duszynski, dev; +Cc: thomas, jerinj

Hi Tomasz,

On Tue, Nov 29, 2022 at 5:28 PM, Tomasz Duszynski wrote:
> Add support for programming PMU counters and reading their values
> in runtime bypassing kernel completely.
>
> This is especially useful in cases where CPU cores are isolated
> (nohz_full) i.e run dedicated tasks. In such cases one cannot use
> standard perf utility without sacrificing latency and performance.
>
> Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
> ---
>   app/test/meson.build                  |   1 +
>   app/test/test_pmu.c                   |  41 +++
>   doc/guides/prog_guide/profile_app.rst |   8 +
>   lib/eal/common/meson.build            |   3 +
>   lib/eal/common/pmu_private.h          |  41 +++
>   lib/eal/common/rte_pmu.c              | 457 ++++++++++++++++++++++++++
>   lib/eal/include/meson.build           |   1 +
>   lib/eal/include/rte_pmu.h             | 204 ++++++++++++
>   lib/eal/linux/eal.c                   |   4 +
>   lib/eal/version.map                   |   6 +
>   10 files changed, 766 insertions(+)
>   create mode 100644 app/test/test_pmu.c
>   create mode 100644 lib/eal/common/pmu_private.h
>   create mode 100644 lib/eal/common/rte_pmu.c
>   create mode 100644 lib/eal/include/rte_pmu.h
>
> diff --git a/app/test/meson.build b/app/test/meson.build
> index f34d19e3c3..93b3300309 100644
> --- a/app/test/meson.build
> +++ b/app/test/meson.build
> @@ -143,6 +143,7 @@ test_sources = files(
>           'test_timer_racecond.c',
>           'test_timer_secondary.c',
>           'test_ticketlock.c',
> +        'test_pmu.c',
>           'test_trace.c',
>           'test_trace_register.c',
>           'test_trace_perf.c',
> diff --git a/app/test/test_pmu.c b/app/test/test_pmu.c
> new file mode 100644
> index 0000000000..fd331af9ee
> --- /dev/null
> +++ b/app/test/test_pmu.c
> @@ -0,0 +1,41 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(C) 2022 Marvell International Ltd.
> + */
> +
> +#include <rte_pmu.h>
> +
> +#include "test.h"
> +
> +static int
> +test_pmu_read(void)
> +{
> +	uint64_t val = 0;
> +	int tries = 10;
> +	int event = -1;
> +
> +	while (tries--)
> +		val += rte_pmu_read(event);
> +
> +	if (val == 0)
> +		return TEST_FAILED;
> +
> +	return TEST_SUCCESS;
> +}
> +
> +static struct unit_test_suite pmu_tests = {
> +	.suite_name = "pmu autotest",
> +	.setup = NULL,
> +	.teardown = NULL,
> +	.unit_test_cases = {
> +		TEST_CASE(test_pmu_read),
> +		TEST_CASES_END()
> +	}
> +};
> +
> +static int
> +test_pmu(void)
> +{
> +	return unit_test_suite_runner(&pmu_tests);
> +}
> +
> +REGISTER_TEST_COMMAND(pmu_autotest, test_pmu);
> diff --git a/doc/guides/prog_guide/profile_app.rst b/doc/guides/prog_guide/profile_app.rst
> index bd6700ef85..8fc1b20cab 100644
> --- a/doc/guides/prog_guide/profile_app.rst
> +++ b/doc/guides/prog_guide/profile_app.rst
> @@ -7,6 +7,14 @@ Profile Your Application
>   The following sections describe methods of profiling DPDK applications on
>   different architectures.
>   
> +Performance counter based profiling
> +-----------------------------------
> +
> +Majority of architectures support some sort hardware measurement unit which provides a set of
> +programmable counters that monitor specific events. There are different tools which can gather
> +that information, perf being an example here. Though in some scenarios, eg. when CPU cores are
> +isolated (nohz_full) and run dedicated tasks, using perf is less than ideal. In such cases one can
> +read specific events directly from application via ``rte_pmu_read()``.
>   
>   Profiling on x86
>   ----------------
> diff --git a/lib/eal/common/meson.build b/lib/eal/common/meson.build
> index 917758cc65..d6d05b56f3 100644
> --- a/lib/eal/common/meson.build
> +++ b/lib/eal/common/meson.build
> @@ -38,6 +38,9 @@ sources += files(
>           'rte_service.c',
>           'rte_version.c',
>   )
> +if is_linux
> +    sources += files('rte_pmu.c')
> +endif
>   if is_linux or is_windows
>       sources += files('eal_common_dynmem.c')
>   endif
> diff --git a/lib/eal/common/pmu_private.h b/lib/eal/common/pmu_private.h
> new file mode 100644
> index 0000000000..cade4245e6
> --- /dev/null
> +++ b/lib/eal/common/pmu_private.h
> @@ -0,0 +1,41 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2022 Marvell
> + */
> +
> +#ifndef _PMU_PRIVATE_H_
> +#define _PMU_PRIVATE_H_
> +
> +/**
> + * Architecture specific PMU init callback.
> + *
> + * @return
> + *   0 in case of success, negative value otherwise.
> + */
> +int
> +pmu_arch_init(void);
> +
> +/**
> + * Architecture specific PMU cleanup callback.
> + */
> +void
> +pmu_arch_fini(void);
> +
> +/**
> + * Apply architecture specific settings to config before passing it to syscall.
> + */
> +void
> +pmu_arch_fixup_config(uint64_t config[3]);
> +
> +/**
> + * Initialize PMU tracing internals.
> + */
> +void
> +eal_pmu_init(void);
> +
> +/**
> + * Cleanup PMU internals.
> + */
> +void
> +eal_pmu_fini(void);
> +
> +#endif /* _PMU_PRIVATE_H_ */
> diff --git a/lib/eal/common/rte_pmu.c b/lib/eal/common/rte_pmu.c
> new file mode 100644
> index 0000000000..6763005903
> --- /dev/null
> +++ b/lib/eal/common/rte_pmu.c
> @@ -0,0 +1,457 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(C) 2022 Marvell International Ltd.
> + */
> +
> +#include <ctype.h>
> +#include <dirent.h>
> +#include <errno.h>
> +#include <regex.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <sys/ioctl.h>
> +#include <sys/mman.h>
> +#include <sys/queue.h>
> +#include <sys/syscall.h>
> +#include <unistd.h>
> +
> +#include <rte_eal_paging.h>
> +#include <rte_malloc.h>
> +#include <rte_pmu.h>
> +#include <rte_tailq.h>
> +
> +#include "pmu_private.h"
> +
> +#define EVENT_SOURCE_DEVICES_PATH "/sys/bus/event_source/devices"
> +
> +#ifndef GENMASK_ULL
> +#define GENMASK_ULL(h, l) ((~0ULL - (1ULL << (l)) + 1) & (~0ULL >> ((64 - 1 - (h)))))
> +#endif
> +
> +#ifndef FIELD_PREP
> +#define FIELD_PREP(m, v) (((uint64_t)(v) << (__builtin_ffsll(m) - 1)) & (m))
> +#endif
> +
> +struct rte_pmu *rte_pmu;
> +
> +/*
> + * Following __rte_weak functions provide default no-op. Architectures should override them if
> + * necessary.
> + */
> +
> +int
> +__rte_weak pmu_arch_init(void)
> +{
> +	return 0;
> +}
> +
> +void
> +__rte_weak pmu_arch_fini(void)
> +{
> +}
> +
> +void
> +__rte_weak pmu_arch_fixup_config(uint64_t config[3])
> +{
> +	RTE_SET_USED(config);
> +}
> +
> +static int
> +get_term_format(const char *name, int *num, uint64_t *mask)
> +{
> +	char *config = NULL;
> +	char path[PATH_MAX];
> +	int high, low, ret;
> +	FILE *fp;
> +
> +	/* quiesce -Wmaybe-uninitialized warning */
> +	*num = 0;
> +	*mask = 0;
> +
> +	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/format/%s", rte_pmu->name, name);
> +	fp = fopen(path, "r");
> +	if (!fp)
> +		return -errno;
> +
> +	errno = 0;
> +	ret = fscanf(fp, "%m[^:]:%d-%d", &config, &low, &high);
> +	if (ret < 2) {
> +		ret = -ENODATA;
> +		goto out;
> +	}
> +	if (errno) {
> +		ret = -errno;
> +		goto out;
> +	}
> +
> +	if (ret == 2)
> +		high = low;
> +
> +	*mask = GENMASK_ULL(high, low);
> +	/* Last digit should be [012]. If last digit is missing 0 is implied. */
> +	*num = config[strlen(config) - 1];
> +	*num = isdigit(*num) ? *num - '0' : 0;
> +
> +	ret = 0;
> +out:
> +	free(config);
> +	fclose(fp);
> +
> +	return ret;
> +}
> +
> +static int
> +parse_event(char *buf, uint64_t config[3])
> +{
> +	char *token, *term;
> +	int num, ret, val;
> +	uint64_t mask;
> +
> +	config[0] = config[1] = config[2] = 0;
> +
> +	token = strtok(buf, ",");
> +	while (token) {
> +		errno = 0;
> +		/* <term>=<value> */
> +		ret = sscanf(token, "%m[^=]=%i", &term, &val);
> +		if (ret < 1)
> +			return -ENODATA;
> +		if (errno)
> +			return -errno;
> +		if (ret == 1)
> +			val = 1;
> +
> +		ret = get_term_format(term, &num, &mask);
> +		free(term);
> +		if (ret)
> +			return ret;
> +
> +		config[num] |= FIELD_PREP(mask, val);
> +		token = strtok(NULL, ",");
> +	}
> +
> +	return 0;
> +}
> +
> +static int
> +get_event_config(const char *name, uint64_t config[3])
> +{
> +	char path[PATH_MAX], buf[BUFSIZ];
> +	FILE *fp;
> +	int ret;
> +
> +	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu->name, name);
> +	fp = fopen(path, "r");
> +	if (!fp)
> +		return -errno;
> +
> +	ret = fread(buf, 1, sizeof(buf), fp);
> +	if (ret == 0) {
> +		fclose(fp);
> +
> +		return -EINVAL;
> +	}
> +	fclose(fp);
> +	buf[ret] = '\0';
> +
> +	return parse_event(buf, config);
> +}
> +
> +static int
> +do_perf_event_open(uint64_t config[3], int lcore_id, int group_fd)
> +{
> +	struct perf_event_attr attr = {
> +		.size = sizeof(struct perf_event_attr),
> +		.type = PERF_TYPE_RAW,
> +		.exclude_kernel = 1,
> +		.exclude_hv = 1,
> +		.disabled = 1,
> +	};
> +
> +	pmu_arch_fixup_config(config);
> +
> +	attr.config = config[0];
> +	attr.config1 = config[1];
> +	attr.config2 = config[2];
> +
> +	return syscall(SYS_perf_event_open, &attr, rte_gettid(), rte_lcore_to_cpu_id(lcore_id),
> +		       group_fd, 0);
> +}
> +
> +static int
> +open_events(int lcore_id)
> +{
> +	struct rte_pmu_event_group *group = &rte_pmu->group[lcore_id];
> +	struct rte_pmu_event *event;
> +	uint64_t config[3];
> +	int num = 0, ret;
> +
> +	/* group leader gets created first, with fd = -1 */
> +	group->fds[0] = -1;
> +
> +	TAILQ_FOREACH(event, &rte_pmu->event_list, next) {
> +		ret = get_event_config(event->name, config);
> +		if (ret) {
> +			RTE_LOG(ERR, EAL, "failed to get %s event config\n", event->name);
> +			continue;
> +		}
> +
> +		ret = do_perf_event_open(config, lcore_id, group->fds[0]);
> +		if (ret == -1) {
> +			if (errno == EOPNOTSUPP)
> +				RTE_LOG(ERR, EAL, "64 bit counters not supported\n");
> +
> +			ret = -errno;
> +			goto out;
> +		}
> +
> +		group->fds[event->index] = ret;
> +		num++;
> +	}
> +
> +	return 0;
> +out:
> +	for (--num; num >= 0; num--) {
> +		close(group->fds[num]);
> +		group->fds[num] = -1;
> +	}
> +
> +
> +	return ret;
> +}
> +
> +static int
> +mmap_events(int lcore_id)
> +{
> +	struct rte_pmu_event_group *group = &rte_pmu->group[lcore_id];
> +	void *addr;
> +	int ret, i;
> +
> +	for (i = 0; i < rte_pmu->num_group_events; i++) {
> +		addr = mmap(0, rte_mem_page_size(), PROT_READ, MAP_SHARED, group->fds[i], 0);
> +		if (addr == MAP_FAILED) {
> +			ret = -errno;
> +			goto out;
> +		}
> +
> +		group->mmap_pages[i] = addr;
> +	}
> +
> +	return 0;
> +out:
> +	for (; i; i--) {
> +		munmap(group->mmap_pages[i - 1], rte_mem_page_size());
> +		group->mmap_pages[i - 1] = NULL;
> +	}
> +
> +	return ret;
> +}
> +
> +static void
> +cleanup_events(int lcore_id)
> +{
> +	struct rte_pmu_event_group *group = &rte_pmu->group[lcore_id];
> +	int i;
> +
> +	if (!group->fds)
> +		return;
> +
> +	if (group->fds[0] != -1)
> +		ioctl(group->fds[0], PERF_EVENT_IOC_DISABLE, PERF_IOC_FLAG_GROUP);
> +
> +	for (i = 0; i < rte_pmu->num_group_events; i++) {
> +		if (group->mmap_pages[i]) {
> +			munmap(group->mmap_pages[i], rte_mem_page_size());
> +			group->mmap_pages[i] = NULL;
> +		}
> +
> +		if (group->fds[i] != -1) {
> +			close(group->fds[i]);
> +			group->fds[i] = -1;
> +		}
> +	}
> +
> +	rte_free(group->mmap_pages);
> +	rte_free(group->fds);
> +
> +	group->mmap_pages = NULL;
> +	group->fds = NULL;
> +	group->enabled = false;
> +}
> +
> +int __rte_noinline
> +rte_pmu_enable_group(int lcore_id)
> +{
> +	struct rte_pmu_event_group *group = &rte_pmu->group[lcore_id];
> +	int ret;
> +
> +	if (rte_pmu->num_group_events == 0) {
> +		RTE_LOG(DEBUG, EAL, "no matching PMU events\n");
> +
> +		return 0;
> +	}
> +
> +	group->fds = rte_calloc(NULL, rte_pmu->num_group_events, sizeof(*group->fds), 0);
> +	if (!group->fds) {
> +		RTE_LOG(ERR, EAL, "failed to alloc descriptor memory\n");
> +
> +		return -ENOMEM;
> +	}
> +
> +	group->mmap_pages = rte_calloc(NULL, rte_pmu->num_group_events, sizeof(*group->mmap_pages), 0);
> +	if (!group->mmap_pages) {
> +		RTE_LOG(ERR, EAL, "failed to alloc userpage memory\n");
> +
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	ret = open_events(lcore_id);
> +	if (ret) {
> +		RTE_LOG(ERR, EAL, "failed to open events on lcore-worker-%d\n", lcore_id);
> +		goto out;
> +	}
> +
> +	ret = mmap_events(lcore_id);
> +	if (ret) {
> +		RTE_LOG(ERR, EAL, "failed to map events on lcore-worker-%d\n", lcore_id);
> +		goto out;
> +	}
> +
> +	if (ioctl(group->fds[0], PERF_EVENT_IOC_ENABLE, PERF_IOC_FLAG_GROUP) == -1) {
> +		RTE_LOG(ERR, EAL, "failed to enable events on lcore-worker-%d\n", lcore_id);
> +
> +		ret = -errno;
> +		goto out;
> +	}
> +
> +	return 0;
> +
> +out:
> +	cleanup_events(lcore_id);
> +
> +	return ret;
> +}
> +
> +static int
> +scan_pmus(void)
> +{
> +	char path[PATH_MAX];
> +	struct dirent *dent;
> +	const char *name;
> +	DIR *dirp;
> +
> +	dirp = opendir(EVENT_SOURCE_DEVICES_PATH);
> +	if (!dirp)
> +		return -errno;
> +
> +	while ((dent = readdir(dirp))) {
> +		name = dent->d_name;
> +		if (name[0] == '.')
> +			continue;
> +
> +		/* sysfs entry should either contain cpus or be a cpu */
> +		if (!strcmp(name, "cpu"))
> +			break;
> +
> +		snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/cpus", name);
> +		if (access(path, F_OK) == 0)
> +			break;
> +	}
> +
> +	closedir(dirp);
> +
> +	if (dent) {
> +		rte_pmu->name = strdup(name);
> +		if (!rte_pmu->name)
> +			return -ENOMEM;
> +	}
> +
> +	return rte_pmu->name ? 0 : -ENODEV;
> +}
> +
> +int
> +rte_pmu_add_event(const char *name)
> +{
> +	struct rte_pmu_event *event;
> +	char path[PATH_MAX];
> +
> +	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu->name, name);
> +	if (access(path, R_OK))
> +		return -ENODEV;
> +
> +	TAILQ_FOREACH(event, &rte_pmu->event_list, next) {
> +		if (!strcmp(event->name, name))
> +			return event->index;
> +		continue;
> +	}
> +
> +	event = rte_calloc(NULL, 1, sizeof(*event), 0);
> +	if (!event)
> +		return -ENOMEM;
> +
> +	event->name = strdup(name);
> +	if (!event->name) {
> +		rte_free(event);
> +
> +		return -ENOMEM;
> +	}
> +
> +	event->index = rte_pmu->num_group_events++;
> +	TAILQ_INSERT_TAIL(&rte_pmu->event_list, event, next);
> +
> +	RTE_LOG(DEBUG, EAL, "%s even added at index %d\n", name, event->index);
> +
> +	return event->index;
> +}
> +
> +void
> +eal_pmu_init(void)
> +{
> +	int ret;
> +
> +	rte_pmu = rte_calloc(NULL, 1, sizeof(*rte_pmu), RTE_CACHE_LINE_SIZE);
> +	if (!rte_pmu) {
> +		RTE_LOG(ERR, EAL, "failed to alloc PMU\n");
> +
> +		return;
> +	}
> +
> +	TAILQ_INIT(&rte_pmu->event_list);
> +
> +	ret = scan_pmus();
> +	if (ret) {
> +		RTE_LOG(ERR, EAL, "failed to find core pmu\n");
> +		goto out;
> +	}
> +
> +	ret = pmu_arch_init();
> +	if (ret) {
> +		RTE_LOG(ERR, EAL, "failed to setup arch for PMU\n");
> +		goto out;
> +	}
> +
> +	return;
> +out:
> +	free(rte_pmu->name);
> +	rte_free(rte_pmu);
> +}
> +
> +void
> +eal_pmu_fini(void)
> +{
> +	struct rte_pmu_event *event, *tmp;
> +	int lcore_id;
> +
> +	RTE_TAILQ_FOREACH_SAFE(event, &rte_pmu->event_list, next, tmp) {
> +		TAILQ_REMOVE(&rte_pmu->event_list, event, next);
> +		free(event->name);
> +		rte_free(event);
> +	}
> +
> +	RTE_LCORE_FOREACH_WORKER(lcore_id)
> +		cleanup_events(lcore_id);
> +
> +	pmu_arch_fini();
> +	free(rte_pmu->name);
> +	rte_free(rte_pmu);
> +}

There may be some problems with the implementation of eal_pmu_fini(), 
but I'm not sure.

I checked some test reports for this series. It seems that the test case 
of `debug_autotest` in

the DPDK unit test has an issue when the child process in this test case 
calls the function of  rte_exit().

The call chain is as follows:

     test_debug() -> test_exit() -> test_exit_val() -> rte_exit() -> 
rte_eal_cleanup() -> eal_pmu_fini().

The issue may be related to memory free from the error message as follows:

test_exit_valEAL: Error: Invalid memory
EAL: Error - exiting with code: 1
   Cause: test_exit_valEAL: Error: Invalid memory
EAL: Error - exiting with code: 2
   Cause: test_exit_valEAL: Error: Invalid memory
EAL: Error - exiting with code: 255
   Cause: test_exit_valEAL: Error: Invalid memory
EAL: Error - exiting with code: -1
   Cause: test_exit_valEAL: Error: Invalid memory

The above error message will disappear when I comment out the calling to 
the eal_pmu_fini() in

the rte_eal_cleanup().

> diff --git a/lib/eal/include/meson.build b/lib/eal/include/meson.build
> index cfcd40aaed..3bf830adee 100644
> --- a/lib/eal/include/meson.build
> +++ b/lib/eal/include/meson.build
> @@ -36,6 +36,7 @@ headers += files(
>           'rte_pci_dev_features.h',
>           'rte_per_lcore.h',
>           'rte_pflock.h',
> +        'rte_pmu.h',
>           'rte_random.h',
>           'rte_reciprocal.h',
>           'rte_seqcount.h',
> diff --git a/lib/eal/include/rte_pmu.h b/lib/eal/include/rte_pmu.h
> new file mode 100644
> index 0000000000..e4b4f6b052
> --- /dev/null
> +++ b/lib/eal/include/rte_pmu.h
> @@ -0,0 +1,204 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2022 Marvell
> + */
> +
> +#ifndef _RTE_PMU_H_
> +#define _RTE_PMU_H_
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +#include <rte_common.h>
> +#include <rte_compat.h>
> +
> +#ifdef RTE_EXEC_ENV_LINUX
> +
> +#include <linux/perf_event.h>
> +
> +#include <rte_atomic.h>
> +#include <rte_branch_prediction.h>
> +#include <rte_lcore.h>
> +#include <rte_log.h>
> +
> +/**
> + * @file
> + *
> + * PMU event tracing operations
> + *
> + * This file defines generic API and types necessary to setup PMU and
> + * read selected counters in runtime.
> + */
> +
> +/**
> + * A structure describing a group of events.
> + */
> +struct rte_pmu_event_group {
> +	int *fds; /**< array of event descriptors */
> +	void **mmap_pages; /**< array of pointers to mmapped perf_event_attr structures */
> +	bool enabled; /**< true if group was enabled on particular lcore */
> +};
> +
> +/**
> + * A structure describing an event.
> + */
> +struct rte_pmu_event {
> +	char *name; /** name of an event */
> +	int index; /** event index into fds/mmap_pages */
> +	TAILQ_ENTRY(rte_pmu_event) next; /** list entry */
> +};
> +
> +/**
> + * A PMU state container.
> + */
> +struct rte_pmu {
> +	char *name; /** name of core PMU listed under /sys/bus/event_source/devices */
> +	struct rte_pmu_event_group group[RTE_MAX_LCORE]; /**< per lcore event group data */
> +	int num_group_events; /**< number of events in a group */
> +	TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of matching events */
> +};
> +
> +/** Pointer to the PMU state container */
> +extern struct rte_pmu *rte_pmu;
> +
> +/** Each architecture supporting PMU needs to provide its own version */
> +#ifndef rte_pmu_pmc_read
> +#define rte_pmu_pmc_read(index) ({ 0; })
> +#endif
> +
> +/**
> + * @internal
> + *
> + * Read PMU counter.
> + *
> + * @param pc
> + *   Pointer to the mmapped user page.
> + * @return
> + *   Counter value read from hardware.
> + */
> +__rte_internal
> +static __rte_always_inline uint64_t
> +rte_pmu_read_userpage(struct perf_event_mmap_page *pc)
> +{
> +	uint64_t offset, width, pmc = 0;
> +	uint32_t seq, index;
> +	int tries = 100;
> +
> +	for (;;) {
> +		seq = pc->lock;
> +		rte_compiler_barrier();
> +		index = pc->index;
> +		offset = pc->offset;
> +		width = pc->pmc_width;
> +
> +		if (likely(pc->cap_user_rdpmc && index)) {
> +			pmc = rte_pmu_pmc_read(index - 1);
> +			pmc <<= 64 - width;
> +			pmc >>= 64 - width;
> +		}
> +
> +		rte_compiler_barrier();
> +
> +		if (likely(pc->lock == seq))
> +			return pmc + offset;
> +
> +		if (--tries == 0) {
> +			RTE_LOG(DEBUG, EAL, "failed to get perf_event_mmap_page lock\n");
> +			break;
> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * @internal
> + *
> + * Enable group of events for a given lcore.
> + *
> + * @param lcore_id
> + *   The identifier of the lcore.
> + * @return
> + *   0 in case of success, negative value otherwise.
> + */
> +__rte_internal
> +int
> +rte_pmu_enable_group(int lcore_id);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice
> + *
> + * Add event to the group of enabled events.
> + *
> + * @param name
> + *   Name of an event listed under /sys/bus/event_source/devices/pmu/events.
> + * @return
> + *   Event index in case of success, negative value otherwise.
> + */
> +__rte_experimental
> +int
> +rte_pmu_add_event(const char *name);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice
> + *
> + * Read hardware counter configured to count occurrences of an event.
> + *
> + * @param index
> + *   Index of an event to be read.
> + * @return
> + *   Event value read from register. In case of errors or lack of support
> + *   0 is returned. In other words, stream of zeros in a trace file
> + *   indicates problem with reading particular PMU event register.
> + */
> +__rte_experimental
> +static __rte_always_inline uint64_t
> +rte_pmu_read(int index)
> +{
> +	int lcore_id = rte_lcore_id();
> +	struct rte_pmu_event_group *group;
> +	int ret;
> +
> +	if (!rte_pmu)
> +		return 0;
> +
> +	group = &rte_pmu->group[lcore_id];
> +	if (!group->enabled) {
> +		ret = rte_pmu_enable_group(lcore_id);
> +		if (ret)
> +			return 0;
> +
> +		group->enabled = true;
> +	}
> +
> +	if (index < 0 || index >= rte_pmu->num_group_events)
> +		return 0;
> +
> +	return rte_pmu_read_userpage((struct perf_event_mmap_page *)group->mmap_pages[index]);
> +}
> +
> +#else /* !RTE_EXEC_ENV_LINUX */
> +
> +__rte_experimental
> +static int __rte_unused
> +rte_pmu_add_event(__rte_unused const char *name)
> +{
> +	return -1;
> +}
> +
> +__rte_experimental
> +static __rte_always_inline uint64_t
> +rte_pmu_read(__rte_unused int index)
> +{
> +	return 0;
> +}
> +
> +#endif /* RTE_EXEC_ENV_LINUX */
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif /* _RTE_PMU_H_ */
> diff --git a/lib/eal/linux/eal.c b/lib/eal/linux/eal.c
> index 8c118d0d9f..751a13b597 100644
> --- a/lib/eal/linux/eal.c
> +++ b/lib/eal/linux/eal.c
> @@ -53,6 +53,7 @@
>   #include "eal_options.h"
>   #include "eal_vfio.h"
>   #include "hotplug_mp.h"
> +#include "pmu_private.h"
>   
>   #define MEMSIZE_IF_NO_HUGE_PAGE (64ULL * 1024ULL * 1024ULL)
>   
> @@ -1206,6 +1207,8 @@ rte_eal_init(int argc, char **argv)
>   		return -1;
>   	}
>   
> +	eal_pmu_init();
> +
>   	if (rte_eal_tailqs_init() < 0) {
>   		rte_eal_init_alert("Cannot init tail queues for objects");
>   		rte_errno = EFAULT;
> @@ -1372,6 +1375,7 @@ rte_eal_cleanup(void)
>   	eal_bus_cleanup();
>   	rte_trace_save();
>   	eal_trace_fini();
> +	eal_pmu_fini();
>   	/* after this point, any DPDK pointers will become dangling */
>   	rte_eal_memory_detach();
>   	eal_mp_dev_hotplug_cleanup();
> diff --git a/lib/eal/version.map b/lib/eal/version.map
> index 7ad12a7dc9..9225f46f67 100644
> --- a/lib/eal/version.map
> +++ b/lib/eal/version.map
> @@ -440,6 +440,11 @@ EXPERIMENTAL {
>   	rte_thread_detach;
>   	rte_thread_equal;
>   	rte_thread_join;
> +
> +	# added in 23.03
> +	rte_pmu; # WINDOWS_NO_EXPORT
> +	rte_pmu_add_event; # WINDOWS_NO_EXPORT
> +	rte_pmu_read; # WINDOWS_NO_EXPORT
>   };
>   
>   INTERNAL {
> @@ -483,4 +488,5 @@ INTERNAL {
>   	rte_mem_map;
>   	rte_mem_page_size;
>   	rte_mem_unmap;
> +	rte_pmu_enable_group;
>   };

Best regards,

Min Zhou


^ permalink raw reply	[flat|nested] 139+ messages in thread

* RE: [EXT] Re: [PATCH v3 1/4] eal: add generic support for reading PMU events
  2022-11-30  8:32       ` zhoumin
@ 2022-12-13  8:05         ` Tomasz Duszynski
  0 siblings, 0 replies; 139+ messages in thread
From: Tomasz Duszynski @ 2022-12-13  8:05 UTC (permalink / raw)
  To: zhoumin, dev; +Cc: thomas, Jerin Jacob Kollanukkaran

Hello Min, 

> -----Original Message-----
> From: zhoumin <zhoumin@loongson.cn>
> Sent: Wednesday, November 30, 2022 9:33 AM
> To: Tomasz Duszynski <tduszynski@marvell.com>; dev@dpdk.org
> Cc: thomas@monjalon.net; Jerin Jacob Kollanukkaran <jerinj@marvell.com>
> Subject: [EXT] Re: [PATCH v3 1/4] eal: add generic support for reading PMU events
> 
> External Email
> 
> ----------------------------------------------------------------------
> Hi Tomasz,
> 

[...] 

> > +void
> > +eal_pmu_fini(void)
> > +{
> > +	struct rte_pmu_event *event, *tmp;
> > +	int lcore_id;
> > +
> > +	RTE_TAILQ_FOREACH_SAFE(event, &rte_pmu->event_list, next, tmp) {
> > +		TAILQ_REMOVE(&rte_pmu->event_list, event, next);
> > +		free(event->name);
> > +		rte_free(event);
> > +	}
> > +
> > +	RTE_LCORE_FOREACH_WORKER(lcore_id)
> > +		cleanup_events(lcore_id);
> > +
> > +	pmu_arch_fini();
> > +	free(rte_pmu->name);
> > +	rte_free(rte_pmu);
> > +}
> 
> There may be some problems with the implementation of eal_pmu_fini(), but I'm not sure.
> 
> I checked some test reports for this series. It seems that the test case of `debug_autotest` in
> 
> the DPDK unit test has an issue when the child process in this test case calls the function
> of  rte_exit().
> 
> The call chain is as follows:
> 
>      test_debug() -> test_exit() -> test_exit_val() -> rte_exit() ->
> rte_eal_cleanup() -> eal_pmu_fini().
> 
> The issue may be related to memory free from the error message as follows:
> 
> test_exit_valEAL: Error: Invalid memory
> EAL: Error - exiting with code: 1
>    Cause: test_exit_valEAL: Error: Invalid memory
> EAL: Error - exiting with code: 2
>    Cause: test_exit_valEAL: Error: Invalid memory
> EAL: Error - exiting with code: 255
>    Cause: test_exit_valEAL: Error: Invalid memory
> EAL: Error - exiting with code: -1
>    Cause: test_exit_valEAL: Error: Invalid memory
> 
> The above error message will disappear when I comment out the calling to the eal_pmu_fini() in
> 
> the rte_eal_cleanup().
> 

Thanks for pointing this out. This was apparently happening due to freeing same hugepage memory in forked process multiple times.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* RE: [PATCH v3 0/4] add support for self monitoring
  2022-11-29 10:42     ` [PATCH v3 0/4] add support for self monitoring Morten Brørup
@ 2022-12-13  8:23       ` Tomasz Duszynski
  0 siblings, 0 replies; 139+ messages in thread
From: Tomasz Duszynski @ 2022-12-13  8:23 UTC (permalink / raw)
  To: Morten Brørup, dev; +Cc: thomas, Jerin Jacob Kollanukkaran

Hi Morten,

> -----Original Message-----
> From: Morten Brørup <mb@smartsharesystems.com>
> Sent: Tuesday, November 29, 2022 11:43 AM
> To: Tomasz Duszynski <tduszynski@marvell.com>; dev@dpdk.org
> Cc: thomas@monjalon.net; Jerin Jacob Kollanukkaran <jerinj@marvell.com>
> Subject: [EXT] RE: [PATCH v3 0/4] add support for self monitoring
> 
> External Email
> 
> ----------------------------------------------------------------------
> > From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
> > Sent: Tuesday, 29 November 2022 10.28
> >
> > This series adds self monitoring support i.e allows to configure and
> > read performance measurement unit (PMU) counters in runtime without
> > using perf utility. This has certain adventages when application runs
> > on isolated cores with nohz_full kernel parameter.
> >
> > Events can be read directly using rte_pmu_read() or using dedicated
> > tracepoint rte_eal_trace_pmu_read(). The latter will cause events to
> > be stored inside CTF file.
> >
> > By design, all enabled events are grouped together and the same group
> > is attached to lcores that use self monitoring funtionality.
> >
> > Events are enabled by names, which need to be read from standard
> > location under sysfs i.e
> >
> > /sys/bus/event_source/devices/PMU/events
> >
> > where PMU is a core pmu i.e one measuring cpu events. As of today raw
> > events are not supported.
> 
> Hi Thomasz,
> 
> I am very interested in this patch series for fast path profiling purposes. (Not using EAL trace,
> but our proprietary profiler.)
> 
> However, it seems that rte_pmu_read() is quite longwinded, compared to rte_pmu_pmc_read().
> 

We need some bit of extra logic to set thigs up before performing reading actual counter but in reality 
cycles are mostly consumed by rte_pmu_pmc_read(). This obviously differs among platforms so if you
want precise measurements you need to get your hands dirty. 

That said, below are results coming from dpdk-test after running trace_perf_autotest - just to give you some idea. 

X86-64

RTE>>trace_perf_autotest
Timer running at 3000.00MHz
            void: cycles=17.739375 ns=5.913125
             u64: cycles=17.348296 ns=5.782765
             int: cycles=17.098724 ns=5.699575
           float: cycles=17.099946 ns=5.699982
          double: cycles=17.229702 ns=5.743234
          string: cycles=31.159907 ns=10.386636
         void_fp: cycles=0.679842 ns=0.226614
        read_pmu: cycles=49.325117 ns=16.441706

ARM64 with RTE_ARM_EAL_RDTSC_USE_PMU

RTE>>trace_perf_autotest
Timer running at 2480.00MHz
            void: cycles=9.413568 ns=3.795793
             u64: cycles=9.386003 ns=3.784678
             int: cycles=9.438701 ns=3.805928
           float: cycles=9.359377 ns=3.773942
          double: cycles=9.372279 ns=3.779145
          string: cycles=24.474899 ns=9.868911
         void_fp: cycles=0.505513 ns=0.203836
        read_pmu: cycles=17.442853 ns=7.033409

> But perhaps I am just worrying too much, so I will ask: What is the performance cost of using
> rte_pmu_read() - compared to rte_pmu_pmc_read() - in the fast path?
> 
> If there is a non-negligible difference, could you please provide an example of how to configure
> PMU events and use rte_pmu_pmc_read() in an application?
> 

Series come with some docs so you can check there how to run it. 

> I would primarily be interested in data cache misses and branch mispredictions. But feel free to
> make your own choices for the example.

Raw events are not supported right now which means you don't have fine control over all events. 
You can use only events from CPU PMU (/sys/bus/event_source/devices/<PMU>/events).



^ permalink raw reply	[flat|nested] 139+ messages in thread

* [PATCH v4 0/4] add support for self monitoring
  2022-11-29  9:28   ` [PATCH v3 0/4] add support for self monitoring Tomasz Duszynski
                       ` (4 preceding siblings ...)
  2022-11-29 10:42     ` [PATCH v3 0/4] add support for self monitoring Morten Brørup
@ 2022-12-13 10:43     ` Tomasz Duszynski
  2022-12-13 10:43       ` [PATCH v4 1/4] eal: add generic support for reading PMU events Tomasz Duszynski
                         ` (4 more replies)
  5 siblings, 5 replies; 139+ messages in thread
From: Tomasz Duszynski @ 2022-12-13 10:43 UTC (permalink / raw)
  To: dev; +Cc: thomas, jerinj, mb, zhoumin, Tomasz Duszynski

This series adds self monitoring support i.e allows to configure and
read performance measurement unit (PMU) counters in runtime without
using perf utility. This has certain adventages when application runs on
isolated cores with nohz_full kernel parameter.

Events can be read directly using rte_pmu_read() or using dedicated
tracepoint rte_eal_trace_pmu_read(). The latter will cause events to be
stored inside CTF file.

By design, all enabled events are grouped together and the same group
is attached to lcores that use self monitoring funtionality.

Events are enabled by names, which need to be read from standard
location under sysfs i.e

/sys/bus/event_source/devices/PMU/events

where PMU is a core pmu i.e one measuring cpu events. As of today
raw events are not supported.

v4:
- fix freeing mem detected by debug_autotest
v3:
- fix shared build
v2:
- fix problems reported by test build infra

Tomasz Duszynski (4):
  eal: add generic support for reading PMU events
  eal/arm: support reading ARM PMU events in runtime
  eal/x86: support reading Intel PMU events in runtime
  eal: add PMU support to tracing library

 app/test/meson.build                     |   1 +
 app/test/test_pmu.c                      |  47 ++
 app/test/test_trace_perf.c               |   4 +
 doc/guides/prog_guide/profile_app.rst    |  13 +
 doc/guides/prog_guide/trace_lib.rst      |  32 ++
 lib/eal/arm/include/meson.build          |   1 +
 lib/eal/arm/include/rte_pmu_pmc.h        |  39 ++
 lib/eal/arm/meson.build                  |   4 +
 lib/eal/arm/rte_pmu.c                    | 104 +++++
 lib/eal/common/eal_common_trace_points.c |   3 +
 lib/eal/common/meson.build               |   3 +
 lib/eal/common/pmu_private.h             |  41 ++
 lib/eal/common/rte_pmu.c                 | 519 +++++++++++++++++++++++
 lib/eal/include/meson.build              |   1 +
 lib/eal/include/rte_eal_trace.h          |  11 +
 lib/eal/include/rte_pmu.h                | 207 +++++++++
 lib/eal/linux/eal.c                      |   4 +
 lib/eal/version.map                      |   7 +
 lib/eal/x86/include/meson.build          |   1 +
 lib/eal/x86/include/rte_pmu_pmc.h        |  33 ++
 20 files changed, 1075 insertions(+)
 create mode 100644 app/test/test_pmu.c
 create mode 100644 lib/eal/arm/include/rte_pmu_pmc.h
 create mode 100644 lib/eal/arm/rte_pmu.c
 create mode 100644 lib/eal/common/pmu_private.h
 create mode 100644 lib/eal/common/rte_pmu.c
 create mode 100644 lib/eal/include/rte_pmu.h
 create mode 100644 lib/eal/x86/include/rte_pmu_pmc.h

--
2.25.1


^ permalink raw reply	[flat|nested] 139+ messages in thread

* [PATCH v4 1/4] eal: add generic support for reading PMU events
  2022-12-13 10:43     ` [PATCH v4 " Tomasz Duszynski
@ 2022-12-13 10:43       ` Tomasz Duszynski
  2022-12-13 11:52         ` Morten Brørup
                           ` (2 more replies)
  2022-12-13 10:43       ` [PATCH v4 2/4] eal/arm: support reading ARM PMU events in runtime Tomasz Duszynski
                         ` (3 subsequent siblings)
  4 siblings, 3 replies; 139+ messages in thread
From: Tomasz Duszynski @ 2022-12-13 10:43 UTC (permalink / raw)
  To: dev; +Cc: thomas, jerinj, mb, zhoumin, Tomasz Duszynski

Add support for programming PMU counters and reading their values
in runtime bypassing kernel completely.

This is especially useful in cases where CPU cores are isolated
(nohz_full) i.e run dedicated tasks. In such cases one cannot use
standard perf utility without sacrificing latency and performance.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
---
 app/test/meson.build                  |   1 +
 app/test/test_pmu.c                   |  41 +++
 doc/guides/prog_guide/profile_app.rst |   8 +
 lib/eal/common/meson.build            |   3 +
 lib/eal/common/pmu_private.h          |  41 +++
 lib/eal/common/rte_pmu.c              | 456 ++++++++++++++++++++++++++
 lib/eal/include/meson.build           |   1 +
 lib/eal/include/rte_pmu.h             | 204 ++++++++++++
 lib/eal/linux/eal.c                   |   4 +
 lib/eal/version.map                   |   6 +
 10 files changed, 765 insertions(+)
 create mode 100644 app/test/test_pmu.c
 create mode 100644 lib/eal/common/pmu_private.h
 create mode 100644 lib/eal/common/rte_pmu.c
 create mode 100644 lib/eal/include/rte_pmu.h

diff --git a/app/test/meson.build b/app/test/meson.build
index f34d19e3c3..93b3300309 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -143,6 +143,7 @@ test_sources = files(
         'test_timer_racecond.c',
         'test_timer_secondary.c',
         'test_ticketlock.c',
+        'test_pmu.c',
         'test_trace.c',
         'test_trace_register.c',
         'test_trace_perf.c',
diff --git a/app/test/test_pmu.c b/app/test/test_pmu.c
new file mode 100644
index 0000000000..fd331af9ee
--- /dev/null
+++ b/app/test/test_pmu.c
@@ -0,0 +1,41 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2022 Marvell International Ltd.
+ */
+
+#include <rte_pmu.h>
+
+#include "test.h"
+
+static int
+test_pmu_read(void)
+{
+	uint64_t val = 0;
+	int tries = 10;
+	int event = -1;
+
+	while (tries--)
+		val += rte_pmu_read(event);
+
+	if (val == 0)
+		return TEST_FAILED;
+
+	return TEST_SUCCESS;
+}
+
+static struct unit_test_suite pmu_tests = {
+	.suite_name = "pmu autotest",
+	.setup = NULL,
+	.teardown = NULL,
+	.unit_test_cases = {
+		TEST_CASE(test_pmu_read),
+		TEST_CASES_END()
+	}
+};
+
+static int
+test_pmu(void)
+{
+	return unit_test_suite_runner(&pmu_tests);
+}
+
+REGISTER_TEST_COMMAND(pmu_autotest, test_pmu);
diff --git a/doc/guides/prog_guide/profile_app.rst b/doc/guides/prog_guide/profile_app.rst
index 14292d4c25..a8b501fe0c 100644
--- a/doc/guides/prog_guide/profile_app.rst
+++ b/doc/guides/prog_guide/profile_app.rst
@@ -7,6 +7,14 @@ Profile Your Application
 The following sections describe methods of profiling DPDK applications on
 different architectures.
 
+Performance counter based profiling
+-----------------------------------
+
+Majority of architectures support some sort hardware measurement unit which provides a set of
+programmable counters that monitor specific events. There are different tools which can gather
+that information, perf being an example here. Though in some scenarios, eg. when CPU cores are
+isolated (nohz_full) and run dedicated tasks, using perf is less than ideal. In such cases one can
+read specific events directly from application via ``rte_pmu_read()``.
 
 Profiling on x86
 ----------------
diff --git a/lib/eal/common/meson.build b/lib/eal/common/meson.build
index 917758cc65..d6d05b56f3 100644
--- a/lib/eal/common/meson.build
+++ b/lib/eal/common/meson.build
@@ -38,6 +38,9 @@ sources += files(
         'rte_service.c',
         'rte_version.c',
 )
+if is_linux
+    sources += files('rte_pmu.c')
+endif
 if is_linux or is_windows
     sources += files('eal_common_dynmem.c')
 endif
diff --git a/lib/eal/common/pmu_private.h b/lib/eal/common/pmu_private.h
new file mode 100644
index 0000000000..cade4245e6
--- /dev/null
+++ b/lib/eal/common/pmu_private.h
@@ -0,0 +1,41 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2022 Marvell
+ */
+
+#ifndef _PMU_PRIVATE_H_
+#define _PMU_PRIVATE_H_
+
+/**
+ * Architecture specific PMU init callback.
+ *
+ * @return
+ *   0 in case of success, negative value otherwise.
+ */
+int
+pmu_arch_init(void);
+
+/**
+ * Architecture specific PMU cleanup callback.
+ */
+void
+pmu_arch_fini(void);
+
+/**
+ * Apply architecture specific settings to config before passing it to syscall.
+ */
+void
+pmu_arch_fixup_config(uint64_t config[3]);
+
+/**
+ * Initialize PMU tracing internals.
+ */
+void
+eal_pmu_init(void);
+
+/**
+ * Cleanup PMU internals.
+ */
+void
+eal_pmu_fini(void);
+
+#endif /* _PMU_PRIVATE_H_ */
diff --git a/lib/eal/common/rte_pmu.c b/lib/eal/common/rte_pmu.c
new file mode 100644
index 0000000000..049fe19fe3
--- /dev/null
+++ b/lib/eal/common/rte_pmu.c
@@ -0,0 +1,456 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2022 Marvell International Ltd.
+ */
+
+#include <ctype.h>
+#include <dirent.h>
+#include <errno.h>
+#include <regex.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <sys/queue.h>
+#include <sys/syscall.h>
+#include <unistd.h>
+
+#include <rte_eal_paging.h>
+#include <rte_pmu.h>
+#include <rte_tailq.h>
+
+#include "pmu_private.h"
+
+#define EVENT_SOURCE_DEVICES_PATH "/sys/bus/event_source/devices"
+
+#ifndef GENMASK_ULL
+#define GENMASK_ULL(h, l) ((~0ULL - (1ULL << (l)) + 1) & (~0ULL >> ((64 - 1 - (h)))))
+#endif
+
+#ifndef FIELD_PREP
+#define FIELD_PREP(m, v) (((uint64_t)(v) << (__builtin_ffsll(m) - 1)) & (m))
+#endif
+
+struct rte_pmu *rte_pmu;
+
+/*
+ * Following __rte_weak functions provide default no-op. Architectures should override them if
+ * necessary.
+ */
+
+int
+__rte_weak pmu_arch_init(void)
+{
+	return 0;
+}
+
+void
+__rte_weak pmu_arch_fini(void)
+{
+}
+
+void
+__rte_weak pmu_arch_fixup_config(uint64_t config[3])
+{
+	RTE_SET_USED(config);
+}
+
+static int
+get_term_format(const char *name, int *num, uint64_t *mask)
+{
+	char *config = NULL;
+	char path[PATH_MAX];
+	int high, low, ret;
+	FILE *fp;
+
+	/* quiesce -Wmaybe-uninitialized warning */
+	*num = 0;
+	*mask = 0;
+
+	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/format/%s", rte_pmu->name, name);
+	fp = fopen(path, "r");
+	if (!fp)
+		return -errno;
+
+	errno = 0;
+	ret = fscanf(fp, "%m[^:]:%d-%d", &config, &low, &high);
+	if (ret < 2) {
+		ret = -ENODATA;
+		goto out;
+	}
+	if (errno) {
+		ret = -errno;
+		goto out;
+	}
+
+	if (ret == 2)
+		high = low;
+
+	*mask = GENMASK_ULL(high, low);
+	/* Last digit should be [012]. If last digit is missing 0 is implied. */
+	*num = config[strlen(config) - 1];
+	*num = isdigit(*num) ? *num - '0' : 0;
+
+	ret = 0;
+out:
+	free(config);
+	fclose(fp);
+
+	return ret;
+}
+
+static int
+parse_event(char *buf, uint64_t config[3])
+{
+	char *token, *term;
+	int num, ret, val;
+	uint64_t mask;
+
+	config[0] = config[1] = config[2] = 0;
+
+	token = strtok(buf, ",");
+	while (token) {
+		errno = 0;
+		/* <term>=<value> */
+		ret = sscanf(token, "%m[^=]=%i", &term, &val);
+		if (ret < 1)
+			return -ENODATA;
+		if (errno)
+			return -errno;
+		if (ret == 1)
+			val = 1;
+
+		ret = get_term_format(term, &num, &mask);
+		free(term);
+		if (ret)
+			return ret;
+
+		config[num] |= FIELD_PREP(mask, val);
+		token = strtok(NULL, ",");
+	}
+
+	return 0;
+}
+
+static int
+get_event_config(const char *name, uint64_t config[3])
+{
+	char path[PATH_MAX], buf[BUFSIZ];
+	FILE *fp;
+	int ret;
+
+	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu->name, name);
+	fp = fopen(path, "r");
+	if (!fp)
+		return -errno;
+
+	ret = fread(buf, 1, sizeof(buf), fp);
+	if (ret == 0) {
+		fclose(fp);
+
+		return -EINVAL;
+	}
+	fclose(fp);
+	buf[ret] = '\0';
+
+	return parse_event(buf, config);
+}
+
+static int
+do_perf_event_open(uint64_t config[3], int lcore_id, int group_fd)
+{
+	struct perf_event_attr attr = {
+		.size = sizeof(struct perf_event_attr),
+		.type = PERF_TYPE_RAW,
+		.exclude_kernel = 1,
+		.exclude_hv = 1,
+		.disabled = 1,
+	};
+
+	pmu_arch_fixup_config(config);
+
+	attr.config = config[0];
+	attr.config1 = config[1];
+	attr.config2 = config[2];
+
+	return syscall(SYS_perf_event_open, &attr, rte_gettid(), rte_lcore_to_cpu_id(lcore_id),
+		       group_fd, 0);
+}
+
+static int
+open_events(int lcore_id)
+{
+	struct rte_pmu_event_group *group = &rte_pmu->group[lcore_id];
+	struct rte_pmu_event *event;
+	uint64_t config[3];
+	int num = 0, ret;
+
+	/* group leader gets created first, with fd = -1 */
+	group->fds[0] = -1;
+
+	TAILQ_FOREACH(event, &rte_pmu->event_list, next) {
+		ret = get_event_config(event->name, config);
+		if (ret) {
+			RTE_LOG(ERR, EAL, "failed to get %s event config\n", event->name);
+			continue;
+		}
+
+		ret = do_perf_event_open(config, lcore_id, group->fds[0]);
+		if (ret == -1) {
+			if (errno == EOPNOTSUPP)
+				RTE_LOG(ERR, EAL, "64 bit counters not supported\n");
+
+			ret = -errno;
+			goto out;
+		}
+
+		group->fds[event->index] = ret;
+		num++;
+	}
+
+	return 0;
+out:
+	for (--num; num >= 0; num--) {
+		close(group->fds[num]);
+		group->fds[num] = -1;
+	}
+
+
+	return ret;
+}
+
+static int
+mmap_events(int lcore_id)
+{
+	struct rte_pmu_event_group *group = &rte_pmu->group[lcore_id];
+	void *addr;
+	int ret, i;
+
+	for (i = 0; i < rte_pmu->num_group_events; i++) {
+		addr = mmap(0, rte_mem_page_size(), PROT_READ, MAP_SHARED, group->fds[i], 0);
+		if (addr == MAP_FAILED) {
+			ret = -errno;
+			goto out;
+		}
+
+		group->mmap_pages[i] = addr;
+	}
+
+	return 0;
+out:
+	for (; i; i--) {
+		munmap(group->mmap_pages[i - 1], rte_mem_page_size());
+		group->mmap_pages[i - 1] = NULL;
+	}
+
+	return ret;
+}
+
+static void
+cleanup_events(int lcore_id)
+{
+	struct rte_pmu_event_group *group = &rte_pmu->group[lcore_id];
+	int i;
+
+	if (!group->fds)
+		return;
+
+	if (group->fds[0] != -1)
+		ioctl(group->fds[0], PERF_EVENT_IOC_DISABLE, PERF_IOC_FLAG_GROUP);
+
+	for (i = 0; i < rte_pmu->num_group_events; i++) {
+		if (group->mmap_pages[i]) {
+			munmap(group->mmap_pages[i], rte_mem_page_size());
+			group->mmap_pages[i] = NULL;
+		}
+
+		if (group->fds[i] != -1) {
+			close(group->fds[i]);
+			group->fds[i] = -1;
+		}
+	}
+
+	free(group->mmap_pages);
+	free(group->fds);
+
+	group->mmap_pages = NULL;
+	group->fds = NULL;
+	group->enabled = false;
+}
+
+int __rte_noinline
+rte_pmu_enable_group(int lcore_id)
+{
+	struct rte_pmu_event_group *group = &rte_pmu->group[lcore_id];
+	int ret;
+
+	if (rte_pmu->num_group_events == 0) {
+		RTE_LOG(DEBUG, EAL, "no matching PMU events\n");
+
+		return 0;
+	}
+
+	group->fds = calloc(rte_pmu->num_group_events, sizeof(*group->fds));
+	if (!group->fds) {
+		RTE_LOG(ERR, EAL, "failed to alloc descriptor memory\n");
+
+		return -ENOMEM;
+	}
+
+	group->mmap_pages = calloc(rte_pmu->num_group_events, sizeof(*group->mmap_pages));
+	if (!group->mmap_pages) {
+		RTE_LOG(ERR, EAL, "failed to alloc userpage memory\n");
+
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	ret = open_events(lcore_id);
+	if (ret) {
+		RTE_LOG(ERR, EAL, "failed to open events on lcore-worker-%d\n", lcore_id);
+		goto out;
+	}
+
+	ret = mmap_events(lcore_id);
+	if (ret) {
+		RTE_LOG(ERR, EAL, "failed to map events on lcore-worker-%d\n", lcore_id);
+		goto out;
+	}
+
+	if (ioctl(group->fds[0], PERF_EVENT_IOC_ENABLE, PERF_IOC_FLAG_GROUP) == -1) {
+		RTE_LOG(ERR, EAL, "failed to enable events on lcore-worker-%d\n", lcore_id);
+
+		ret = -errno;
+		goto out;
+	}
+
+	return 0;
+
+out:
+	cleanup_events(lcore_id);
+
+	return ret;
+}
+
+static int
+scan_pmus(void)
+{
+	char path[PATH_MAX];
+	struct dirent *dent;
+	const char *name;
+	DIR *dirp;
+
+	dirp = opendir(EVENT_SOURCE_DEVICES_PATH);
+	if (!dirp)
+		return -errno;
+
+	while ((dent = readdir(dirp))) {
+		name = dent->d_name;
+		if (name[0] == '.')
+			continue;
+
+		/* sysfs entry should either contain cpus or be a cpu */
+		if (!strcmp(name, "cpu"))
+			break;
+
+		snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/cpus", name);
+		if (access(path, F_OK) == 0)
+			break;
+	}
+
+	closedir(dirp);
+
+	if (dent) {
+		rte_pmu->name = strdup(name);
+		if (!rte_pmu->name)
+			return -ENOMEM;
+	}
+
+	return rte_pmu->name ? 0 : -ENODEV;
+}
+
+int
+rte_pmu_add_event(const char *name)
+{
+	struct rte_pmu_event *event;
+	char path[PATH_MAX];
+
+	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu->name, name);
+	if (access(path, R_OK))
+		return -ENODEV;
+
+	TAILQ_FOREACH(event, &rte_pmu->event_list, next) {
+		if (!strcmp(event->name, name))
+			return event->index;
+		continue;
+	}
+
+	event = calloc(1, sizeof(*event));
+	if (!event)
+		return -ENOMEM;
+
+	event->name = strdup(name);
+	if (!event->name) {
+		free(event);
+
+		return -ENOMEM;
+	}
+
+	event->index = rte_pmu->num_group_events++;
+	TAILQ_INSERT_TAIL(&rte_pmu->event_list, event, next);
+
+	RTE_LOG(DEBUG, EAL, "%s even added at index %d\n", name, event->index);
+
+	return event->index;
+}
+
+void
+eal_pmu_init(void)
+{
+	int ret;
+
+	rte_pmu = calloc(1, sizeof(*rte_pmu));
+	if (!rte_pmu) {
+		RTE_LOG(ERR, EAL, "failed to alloc PMU\n");
+
+		return;
+	}
+
+	TAILQ_INIT(&rte_pmu->event_list);
+
+	ret = scan_pmus();
+	if (ret) {
+		RTE_LOG(ERR, EAL, "failed to find core pmu\n");
+		goto out;
+	}
+
+	ret = pmu_arch_init();
+	if (ret) {
+		RTE_LOG(ERR, EAL, "failed to setup arch for PMU\n");
+		goto out;
+	}
+
+	return;
+out:
+	free(rte_pmu->name);
+	free(rte_pmu);
+}
+
+void
+eal_pmu_fini(void)
+{
+	struct rte_pmu_event *event, *tmp;
+	int lcore_id;
+
+	RTE_TAILQ_FOREACH_SAFE(event, &rte_pmu->event_list, next, tmp) {
+		TAILQ_REMOVE(&rte_pmu->event_list, event, next);
+		free(event->name);
+		free(event);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		cleanup_events(lcore_id);
+
+	pmu_arch_fini();
+	free(rte_pmu->name);
+	free(rte_pmu);
+}
diff --git a/lib/eal/include/meson.build b/lib/eal/include/meson.build
index cfcd40aaed..3bf830adee 100644
--- a/lib/eal/include/meson.build
+++ b/lib/eal/include/meson.build
@@ -36,6 +36,7 @@ headers += files(
         'rte_pci_dev_features.h',
         'rte_per_lcore.h',
         'rte_pflock.h',
+        'rte_pmu.h',
         'rte_random.h',
         'rte_reciprocal.h',
         'rte_seqcount.h',
diff --git a/lib/eal/include/rte_pmu.h b/lib/eal/include/rte_pmu.h
new file mode 100644
index 0000000000..e4b4f6b052
--- /dev/null
+++ b/lib/eal/include/rte_pmu.h
@@ -0,0 +1,204 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2022 Marvell
+ */
+
+#ifndef _RTE_PMU_H_
+#define _RTE_PMU_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <rte_common.h>
+#include <rte_compat.h>
+
+#ifdef RTE_EXEC_ENV_LINUX
+
+#include <linux/perf_event.h>
+
+#include <rte_atomic.h>
+#include <rte_branch_prediction.h>
+#include <rte_lcore.h>
+#include <rte_log.h>
+
+/**
+ * @file
+ *
+ * PMU event tracing operations
+ *
+ * This file defines generic API and types necessary to setup PMU and
+ * read selected counters in runtime.
+ */
+
+/**
+ * A structure describing a group of events.
+ */
+struct rte_pmu_event_group {
+	int *fds; /**< array of event descriptors */
+	void **mmap_pages; /**< array of pointers to mmapped perf_event_attr structures */
+	bool enabled; /**< true if group was enabled on particular lcore */
+};
+
+/**
+ * A structure describing an event.
+ */
+struct rte_pmu_event {
+	char *name; /** name of an event */
+	int index; /** event index into fds/mmap_pages */
+	TAILQ_ENTRY(rte_pmu_event) next; /** list entry */
+};
+
+/**
+ * A PMU state container.
+ */
+struct rte_pmu {
+	char *name; /** name of core PMU listed under /sys/bus/event_source/devices */
+	struct rte_pmu_event_group group[RTE_MAX_LCORE]; /**< per lcore event group data */
+	int num_group_events; /**< number of events in a group */
+	TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of matching events */
+};
+
+/** Pointer to the PMU state container */
+extern struct rte_pmu *rte_pmu;
+
+/** Each architecture supporting PMU needs to provide its own version */
+#ifndef rte_pmu_pmc_read
+#define rte_pmu_pmc_read(index) ({ 0; })
+#endif
+
+/**
+ * @internal
+ *
+ * Read PMU counter.
+ *
+ * @param pc
+ *   Pointer to the mmapped user page.
+ * @return
+ *   Counter value read from hardware.
+ */
+__rte_internal
+static __rte_always_inline uint64_t
+rte_pmu_read_userpage(struct perf_event_mmap_page *pc)
+{
+	uint64_t offset, width, pmc = 0;
+	uint32_t seq, index;
+	int tries = 100;
+
+	for (;;) {
+		seq = pc->lock;
+		rte_compiler_barrier();
+		index = pc->index;
+		offset = pc->offset;
+		width = pc->pmc_width;
+
+		if (likely(pc->cap_user_rdpmc && index)) {
+			pmc = rte_pmu_pmc_read(index - 1);
+			pmc <<= 64 - width;
+			pmc >>= 64 - width;
+		}
+
+		rte_compiler_barrier();
+
+		if (likely(pc->lock == seq))
+			return pmc + offset;
+
+		if (--tries == 0) {
+			RTE_LOG(DEBUG, EAL, "failed to get perf_event_mmap_page lock\n");
+			break;
+		}
+	}
+
+	return 0;
+}
+
+/**
+ * @internal
+ *
+ * Enable group of events for a given lcore.
+ *
+ * @param lcore_id
+ *   The identifier of the lcore.
+ * @return
+ *   0 in case of success, negative value otherwise.
+ */
+__rte_internal
+int
+rte_pmu_enable_group(int lcore_id);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Add event to the group of enabled events.
+ *
+ * @param name
+ *   Name of an event listed under /sys/bus/event_source/devices/pmu/events.
+ * @return
+ *   Event index in case of success, negative value otherwise.
+ */
+__rte_experimental
+int
+rte_pmu_add_event(const char *name);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Read hardware counter configured to count occurrences of an event.
+ *
+ * @param index
+ *   Index of an event to be read.
+ * @return
+ *   Event value read from register. In case of errors or lack of support
+ *   0 is returned. In other words, stream of zeros in a trace file
+ *   indicates problem with reading particular PMU event register.
+ */
+__rte_experimental
+static __rte_always_inline uint64_t
+rte_pmu_read(int index)
+{
+	int lcore_id = rte_lcore_id();
+	struct rte_pmu_event_group *group;
+	int ret;
+
+	if (!rte_pmu)
+		return 0;
+
+	group = &rte_pmu->group[lcore_id];
+	if (!group->enabled) {
+		ret = rte_pmu_enable_group(lcore_id);
+		if (ret)
+			return 0;
+
+		group->enabled = true;
+	}
+
+	if (index < 0 || index >= rte_pmu->num_group_events)
+		return 0;
+
+	return rte_pmu_read_userpage((struct perf_event_mmap_page *)group->mmap_pages[index]);
+}
+
+#else /* !RTE_EXEC_ENV_LINUX */
+
+__rte_experimental
+static int __rte_unused
+rte_pmu_add_event(__rte_unused const char *name)
+{
+	return -1;
+}
+
+__rte_experimental
+static __rte_always_inline uint64_t
+rte_pmu_read(__rte_unused int index)
+{
+	return 0;
+}
+
+#endif /* RTE_EXEC_ENV_LINUX */
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_PMU_H_ */
diff --git a/lib/eal/linux/eal.c b/lib/eal/linux/eal.c
index 8c118d0d9f..751a13b597 100644
--- a/lib/eal/linux/eal.c
+++ b/lib/eal/linux/eal.c
@@ -53,6 +53,7 @@
 #include "eal_options.h"
 #include "eal_vfio.h"
 #include "hotplug_mp.h"
+#include "pmu_private.h"
 
 #define MEMSIZE_IF_NO_HUGE_PAGE (64ULL * 1024ULL * 1024ULL)
 
@@ -1206,6 +1207,8 @@ rte_eal_init(int argc, char **argv)
 		return -1;
 	}
 
+	eal_pmu_init();
+
 	if (rte_eal_tailqs_init() < 0) {
 		rte_eal_init_alert("Cannot init tail queues for objects");
 		rte_errno = EFAULT;
@@ -1372,6 +1375,7 @@ rte_eal_cleanup(void)
 	eal_bus_cleanup();
 	rte_trace_save();
 	eal_trace_fini();
+	eal_pmu_fini();
 	/* after this point, any DPDK pointers will become dangling */
 	rte_eal_memory_detach();
 	eal_mp_dev_hotplug_cleanup();
diff --git a/lib/eal/version.map b/lib/eal/version.map
index 7ad12a7dc9..9225f46f67 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -440,6 +440,11 @@ EXPERIMENTAL {
 	rte_thread_detach;
 	rte_thread_equal;
 	rte_thread_join;
+
+	# added in 23.03
+	rte_pmu; # WINDOWS_NO_EXPORT
+	rte_pmu_add_event; # WINDOWS_NO_EXPORT
+	rte_pmu_read; # WINDOWS_NO_EXPORT
 };
 
 INTERNAL {
@@ -483,4 +488,5 @@ INTERNAL {
 	rte_mem_map;
 	rte_mem_page_size;
 	rte_mem_unmap;
+	rte_pmu_enable_group;
 };
-- 
2.25.1


^ permalink raw reply	[flat|nested] 139+ messages in thread

* [PATCH v4 2/4] eal/arm: support reading ARM PMU events in runtime
  2022-12-13 10:43     ` [PATCH v4 " Tomasz Duszynski
  2022-12-13 10:43       ` [PATCH v4 1/4] eal: add generic support for reading PMU events Tomasz Duszynski
@ 2022-12-13 10:43       ` Tomasz Duszynski
  2022-12-13 10:43       ` [PATCH v4 3/4] eal/x86: support reading Intel " Tomasz Duszynski
                         ` (2 subsequent siblings)
  4 siblings, 0 replies; 139+ messages in thread
From: Tomasz Duszynski @ 2022-12-13 10:43 UTC (permalink / raw)
  To: dev, Ruifeng Wang; +Cc: thomas, jerinj, mb, zhoumin, Tomasz Duszynski

Add support for reading ARM PMU events in runtime.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
---
 app/test/test_pmu.c               |   4 ++
 lib/eal/arm/include/meson.build   |   1 +
 lib/eal/arm/include/rte_pmu_pmc.h |  39 +++++++++++
 lib/eal/arm/meson.build           |   4 ++
 lib/eal/arm/rte_pmu.c             | 104 ++++++++++++++++++++++++++++++
 lib/eal/include/rte_pmu.h         |   3 +
 6 files changed, 155 insertions(+)
 create mode 100644 lib/eal/arm/include/rte_pmu_pmc.h
 create mode 100644 lib/eal/arm/rte_pmu.c

diff --git a/app/test/test_pmu.c b/app/test/test_pmu.c
index fd331af9ee..f94866dff9 100644
--- a/app/test/test_pmu.c
+++ b/app/test/test_pmu.c
@@ -13,6 +13,10 @@ test_pmu_read(void)
 	int tries = 10;
 	int event = -1;
 
+#if defined(RTE_ARCH_ARM64)
+	event = rte_pmu_add_event("cpu_cycles");
+#endif
+
 	while (tries--)
 		val += rte_pmu_read(event);
 
diff --git a/lib/eal/arm/include/meson.build b/lib/eal/arm/include/meson.build
index 657bf58569..ab13b0220a 100644
--- a/lib/eal/arm/include/meson.build
+++ b/lib/eal/arm/include/meson.build
@@ -20,6 +20,7 @@ arch_headers = files(
         'rte_pause_32.h',
         'rte_pause_64.h',
         'rte_pause.h',
+        'rte_pmu_pmc.h',
         'rte_power_intrinsics.h',
         'rte_prefetch_32.h',
         'rte_prefetch_64.h',
diff --git a/lib/eal/arm/include/rte_pmu_pmc.h b/lib/eal/arm/include/rte_pmu_pmc.h
new file mode 100644
index 0000000000..10e2984813
--- /dev/null
+++ b/lib/eal/arm/include/rte_pmu_pmc.h
@@ -0,0 +1,39 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2022 Marvell.
+ */
+
+#ifndef _RTE_PMU_PMC_ARM_H_
+#define _RTE_PMU_PMC_ARM_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <rte_common.h>
+
+static __rte_always_inline uint64_t
+rte_pmu_pmc_read(int index)
+{
+	uint64_t val;
+
+	if (index == 31) {
+		/* CPU Cycles (0x11) must be read via pmccntr_el0 */
+		asm volatile("mrs %0, pmccntr_el0" : "=r" (val));
+	} else {
+		asm volatile(
+			"msr pmselr_el0, %x0\n"
+			"mrs %0, pmxevcntr_el0\n"
+			: "=r" (val)
+			: "rZ" (index)
+		);
+	}
+
+	return val;
+}
+#define rte_pmu_pmc_read rte_pmu_pmc_read
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_PMU_PMC_ARM_H_ */
diff --git a/lib/eal/arm/meson.build b/lib/eal/arm/meson.build
index dca1106aae..0c5575b197 100644
--- a/lib/eal/arm/meson.build
+++ b/lib/eal/arm/meson.build
@@ -9,3 +9,7 @@ sources += files(
         'rte_hypervisor.c',
         'rte_power_intrinsics.c',
 )
+
+if is_linux
+    sources += files('rte_pmu.c')
+endif
diff --git a/lib/eal/arm/rte_pmu.c b/lib/eal/arm/rte_pmu.c
new file mode 100644
index 0000000000..10ec770ead
--- /dev/null
+++ b/lib/eal/arm/rte_pmu.c
@@ -0,0 +1,104 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2022 Marvell International Ltd.
+ */
+
+#include <errno.h>
+#include <fcntl.h>
+#include <stdlib.h>
+#include <unistd.h>
+
+#include <rte_bitops.h>
+#include <rte_common.h>
+#include <rte_log.h>
+#include <rte_pmu.h>
+
+#include "pmu_private.h"
+
+#define PERF_USER_ACCESS_PATH "/proc/sys/kernel/perf_user_access"
+
+static int restore_uaccess;
+
+static int
+read_attr_int(const char *path, int *val)
+{
+	char buf[BUFSIZ];
+	int ret, fd;
+
+	fd = open(path, O_RDONLY);
+	if (fd == -1)
+		return -errno;
+
+	ret = read(fd, buf, sizeof(buf));
+	if (ret == -1) {
+		close(fd);
+
+		return -errno;
+	}
+
+	*val = strtol(buf, NULL, 10);
+	close(fd);
+
+	return 0;
+}
+
+static int
+write_attr_int(const char *path, int val)
+{
+	char buf[BUFSIZ];
+	int num, ret, fd;
+
+	fd = open(path, O_WRONLY);
+	if (fd == -1)
+		return -errno;
+
+	num = snprintf(buf, sizeof(buf), "%d", val);
+	ret = write(fd, buf, num);
+	if (ret == -1) {
+		close(fd);
+
+		return -errno;
+	}
+
+	close(fd);
+
+	return 0;
+}
+
+int
+pmu_arch_init(void)
+{
+	int ret;
+
+	ret = read_attr_int(PERF_USER_ACCESS_PATH, &restore_uaccess);
+	if (ret) {
+		RTE_LOG(ERR, EAL, "failed to read %s\n", PERF_USER_ACCESS_PATH);
+
+		return ret;
+	}
+
+	ret = write_attr_int(PERF_USER_ACCESS_PATH, 1);
+	if (ret) {
+		RTE_LOG(ERR, EAL, "failed to enable perf user access\n"
+			"try enabling manually 'echo 1 > %s'\n",
+			PERF_USER_ACCESS_PATH);
+
+		return ret;
+	}
+
+	return 0;
+}
+
+void
+pmu_arch_fini(void)
+{
+	write_attr_int(PERF_USER_ACCESS_PATH, restore_uaccess);
+}
+
+void
+pmu_arch_fixup_config(uint64_t config[3])
+{
+	/* select 64 bit counters */
+	config[1] |= RTE_BIT64(0);
+	/* enable userspace access */
+	config[1] |= RTE_BIT64(1);
+}
diff --git a/lib/eal/include/rte_pmu.h b/lib/eal/include/rte_pmu.h
index e4b4f6b052..158a616b83 100644
--- a/lib/eal/include/rte_pmu.h
+++ b/lib/eal/include/rte_pmu.h
@@ -20,6 +20,9 @@ extern "C" {
 #include <rte_branch_prediction.h>
 #include <rte_lcore.h>
 #include <rte_log.h>
+#if defined(RTE_ARCH_ARM64)
+#include <rte_pmu_pmc.h>
+#endif
 
 /**
  * @file
-- 
2.25.1


^ permalink raw reply	[flat|nested] 139+ messages in thread

* [PATCH v4 3/4] eal/x86: support reading Intel PMU events in runtime
  2022-12-13 10:43     ` [PATCH v4 " Tomasz Duszynski
  2022-12-13 10:43       ` [PATCH v4 1/4] eal: add generic support for reading PMU events Tomasz Duszynski
  2022-12-13 10:43       ` [PATCH v4 2/4] eal/arm: support reading ARM PMU events in runtime Tomasz Duszynski
@ 2022-12-13 10:43       ` Tomasz Duszynski
  2022-12-13 10:43       ` [PATCH v4 4/4] eal: add PMU support to tracing library Tomasz Duszynski
  2023-01-10 23:46       ` [PATCH v5 0/4] add support for self monitoring Tomasz Duszynski
  4 siblings, 0 replies; 139+ messages in thread
From: Tomasz Duszynski @ 2022-12-13 10:43 UTC (permalink / raw)
  To: dev, Bruce Richardson, Konstantin Ananyev
  Cc: thomas, jerinj, mb, zhoumin, Tomasz Duszynski

Add support for reading Intel PMU events in runtime.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
---
 app/test/test_pmu.c               |  2 ++
 lib/eal/include/rte_pmu.h         |  2 +-
 lib/eal/x86/include/meson.build   |  1 +
 lib/eal/x86/include/rte_pmu_pmc.h | 33 +++++++++++++++++++++++++++++++
 4 files changed, 37 insertions(+), 1 deletion(-)
 create mode 100644 lib/eal/x86/include/rte_pmu_pmc.h

diff --git a/app/test/test_pmu.c b/app/test/test_pmu.c
index f94866dff9..016204c083 100644
--- a/app/test/test_pmu.c
+++ b/app/test/test_pmu.c
@@ -15,6 +15,8 @@ test_pmu_read(void)
 
 #if defined(RTE_ARCH_ARM64)
 	event = rte_pmu_add_event("cpu_cycles");
+#elif defined(RTE_ARCH_X86_64)
+	event = rte_pmu_add_event("cpu-cycles");
 #endif
 
 	while (tries--)
diff --git a/lib/eal/include/rte_pmu.h b/lib/eal/include/rte_pmu.h
index 158a616b83..3d90f4baf7 100644
--- a/lib/eal/include/rte_pmu.h
+++ b/lib/eal/include/rte_pmu.h
@@ -20,7 +20,7 @@ extern "C" {
 #include <rte_branch_prediction.h>
 #include <rte_lcore.h>
 #include <rte_log.h>
-#if defined(RTE_ARCH_ARM64)
+#if defined(RTE_ARCH_ARM64) || defined(RTE_ARCH_X86_64)
 #include <rte_pmu_pmc.h>
 #endif
 
diff --git a/lib/eal/x86/include/meson.build b/lib/eal/x86/include/meson.build
index 52d2f8e969..03d286ed25 100644
--- a/lib/eal/x86/include/meson.build
+++ b/lib/eal/x86/include/meson.build
@@ -9,6 +9,7 @@ arch_headers = files(
         'rte_io.h',
         'rte_memcpy.h',
         'rte_pause.h',
+        'rte_pmu_pmc.h',
         'rte_power_intrinsics.h',
         'rte_prefetch.h',
         'rte_rtm.h',
diff --git a/lib/eal/x86/include/rte_pmu_pmc.h b/lib/eal/x86/include/rte_pmu_pmc.h
new file mode 100644
index 0000000000..a2cd849fb1
--- /dev/null
+++ b/lib/eal/x86/include/rte_pmu_pmc.h
@@ -0,0 +1,33 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2022 Marvell.
+ */
+
+#ifndef _RTE_PMU_PMC_X86_H_
+#define _RTE_PMU_PMC_X86_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <rte_common.h>
+
+static __rte_always_inline uint64_t
+rte_pmu_pmc_read(int index)
+{
+	uint32_t high, low;
+
+	asm volatile(
+		"rdpmc\n"
+		: "=a" (low), "=d" (high)
+		: "c" (index)
+	);
+
+	return ((uint64_t)high << 32) | (uint64_t)low;
+}
+#define rte_pmu_pmc_read rte_pmu_pmc_read
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_PMU_PMC_X86_H_ */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 139+ messages in thread

* [PATCH v4 4/4] eal: add PMU support to tracing library
  2022-12-13 10:43     ` [PATCH v4 " Tomasz Duszynski
                         ` (2 preceding siblings ...)
  2022-12-13 10:43       ` [PATCH v4 3/4] eal/x86: support reading Intel " Tomasz Duszynski
@ 2022-12-13 10:43       ` Tomasz Duszynski
  2023-01-10 23:46       ` [PATCH v5 0/4] add support for self monitoring Tomasz Duszynski
  4 siblings, 0 replies; 139+ messages in thread
From: Tomasz Duszynski @ 2022-12-13 10:43 UTC (permalink / raw)
  To: dev, Jerin Jacob, Sunil Kumar Kori; +Cc: thomas, mb, zhoumin, Tomasz Duszynski

In order to profile app one needs to store significant amount of samples
somewhere for an analysis latern on. Since trace library supports
storing data in a CTF format lets take adventage of that and add a
dedicated PMU tracepoint.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
---
 app/test/test_trace_perf.c               |  4 ++
 doc/guides/prog_guide/profile_app.rst    |  5 ++
 doc/guides/prog_guide/trace_lib.rst      | 32 ++++++++++++
 lib/eal/common/eal_common_trace_points.c |  3 ++
 lib/eal/common/rte_pmu.c                 | 63 ++++++++++++++++++++++++
 lib/eal/include/rte_eal_trace.h          | 11 +++++
 lib/eal/version.map                      |  1 +
 7 files changed, 119 insertions(+)

diff --git a/app/test/test_trace_perf.c b/app/test/test_trace_perf.c
index 46ae7d8074..4851b6852f 100644
--- a/app/test/test_trace_perf.c
+++ b/app/test/test_trace_perf.c
@@ -114,6 +114,8 @@ worker_fn_##func(void *arg) \
 #define GENERIC_DOUBLE rte_eal_trace_generic_double(3.66666)
 #define GENERIC_STR rte_eal_trace_generic_str("hello world")
 #define VOID_FP app_dpdk_test_fp()
+/* 0 corresponds first event passed via --trace= */
+#define READ_PMU rte_eal_trace_pmu_read(0)
 
 WORKER_DEFINE(GENERIC_VOID)
 WORKER_DEFINE(GENERIC_U64)
@@ -122,6 +124,7 @@ WORKER_DEFINE(GENERIC_FLOAT)
 WORKER_DEFINE(GENERIC_DOUBLE)
 WORKER_DEFINE(GENERIC_STR)
 WORKER_DEFINE(VOID_FP)
+WORKER_DEFINE(READ_PMU)
 
 static void
 run_test(const char *str, lcore_function_t f, struct test_data *data, size_t sz)
@@ -174,6 +177,7 @@ test_trace_perf(void)
 	run_test("double", worker_fn_GENERIC_DOUBLE, data, sz);
 	run_test("string", worker_fn_GENERIC_STR, data, sz);
 	run_test("void_fp", worker_fn_VOID_FP, data, sz);
+	run_test("read_pmu", worker_fn_READ_PMU, data, sz);
 
 	rte_free(data);
 	return TEST_SUCCESS;
diff --git a/doc/guides/prog_guide/profile_app.rst b/doc/guides/prog_guide/profile_app.rst
index a8b501fe0c..6a53341c6b 100644
--- a/doc/guides/prog_guide/profile_app.rst
+++ b/doc/guides/prog_guide/profile_app.rst
@@ -16,6 +16,11 @@ that information, perf being an example here. Though in some scenarios, eg. when
 isolated (nohz_full) and run dedicated tasks, using perf is less than ideal. In such cases one can
 read specific events directly from application via ``rte_pmu_read()``.
 
+Alternatively tracing library can be used which offers dedicated tracepoint
+``rte_eal_trace_pmu_event()``.
+
+Refer to :doc:`../prog_guide/trace_lib` for more details.
+
 Profiling on x86
 ----------------
 
diff --git a/doc/guides/prog_guide/trace_lib.rst b/doc/guides/prog_guide/trace_lib.rst
index 9a8f38073d..9a845fd86f 100644
--- a/doc/guides/prog_guide/trace_lib.rst
+++ b/doc/guides/prog_guide/trace_lib.rst
@@ -46,6 +46,7 @@ DPDK tracing library features
   trace format and is compatible with ``LTTng``.
   For detailed information, refer to
   `Common Trace Format <https://diamon.org/ctf/>`_.
+- Support reading PMU events on ARM64 and x86 (Intel)
 
 How to add a tracepoint?
 ------------------------
@@ -137,6 +138,37 @@ the user must use ``RTE_TRACE_POINT_FP`` instead of ``RTE_TRACE_POINT``.
 ``RTE_TRACE_POINT_FP`` is compiled out by default and it can be enabled using
 the ``enable_trace_fp`` option for meson build.
 
+PMU tracepoint
+--------------
+
+Performance measurement unit (PMU) event values can be read from hardware
+registers using predefined ``rte_pmu_read`` tracepoint.
+
+Tracing is enabled via ``--trace`` EAL option by passing both expression
+matching PMU tracepoint name i.e ``lib.eal.pmu.read`` and expression
+``e=ev1[,ev2,...]`` matching particular events::
+
+    --trace='*pmu.read\|e=cpu_cycles,l1d_cache'
+
+Event names are available under ``/sys/bus/event_source/devices/PMU/events``
+directory, where ``PMU`` is a placeholder for either a ``cpu`` or a directory
+containing ``cpus``.
+
+In contrary to other tracepoints this does not need any extra variables
+added to source files. Instead, caller passes index which follows the order of
+events specified via ``--trace`` parameter. In the following example index ``0``
+corresponds to ``cpu_cyclces`` while index ``1`` corresponds to ``l1d_cache``.
+
+.. code-block:: c
+
+ ...
+ rte_eal_trace_pmu_read(0);
+ rte_eal_trace_pmu_read(1);
+ ...
+
+PMU tracing support must be explicitly enabled using the ``enable_trace_fp``
+option for meson build.
+
 Event record mode
 -----------------
 
diff --git a/lib/eal/common/eal_common_trace_points.c b/lib/eal/common/eal_common_trace_points.c
index 0b0b254615..de918ca618 100644
--- a/lib/eal/common/eal_common_trace_points.c
+++ b/lib/eal/common/eal_common_trace_points.c
@@ -75,3 +75,6 @@ RTE_TRACE_POINT_REGISTER(rte_eal_trace_intr_enable,
 	lib.eal.intr.enable)
 RTE_TRACE_POINT_REGISTER(rte_eal_trace_intr_disable,
 	lib.eal.intr.disable)
+
+RTE_TRACE_POINT_REGISTER(rte_eal_trace_pmu_read,
+	lib.eal.pmu.read)
diff --git a/lib/eal/common/rte_pmu.c b/lib/eal/common/rte_pmu.c
index 049fe19fe3..be105493ea 100644
--- a/lib/eal/common/rte_pmu.c
+++ b/lib/eal/common/rte_pmu.c
@@ -19,6 +19,7 @@
 #include <rte_tailq.h>
 
 #include "pmu_private.h"
+#include "eal_trace.h"
 
 #define EVENT_SOURCE_DEVICES_PATH "/sys/bus/event_source/devices"
 
@@ -403,11 +404,70 @@ rte_pmu_add_event(const char *name)
 	return event->index;
 }
 
+static void
+add_events(const char *pattern)
+{
+	char *token, *copy;
+	int ret;
+
+	copy = strdup(pattern);
+	if (!copy)
+		return;
+
+	token = strtok(copy, ",");
+	while (token) {
+		ret = rte_pmu_add_event(token);
+		if (ret < 0)
+			RTE_LOG(ERR, EAL, "failed to add %s event\n", token);
+
+		token = strtok(NULL, ",");
+	}
+
+	free(copy);
+}
+
+static void
+add_events_by_pattern(const char *pattern)
+{
+	regmatch_t rmatch;
+	char buf[BUFSIZ];
+	unsigned int num;
+	regex_t reg;
+
+	/* events are matched against occurrences of e=ev1[,ev2,..] pattern */
+	if (regcomp(&reg, "e=([_[:alnum:]-],?)+", REG_EXTENDED))
+		return;
+
+	for (;;) {
+		if (regexec(&reg, pattern, 1, &rmatch, 0))
+			break;
+
+		num = rmatch.rm_eo - rmatch.rm_so;
+		if (num > sizeof(buf))
+			num = sizeof(buf);
+
+		/* skip e= pattern prefix */
+		memcpy(buf, pattern + rmatch.rm_so + 2, num - 2);
+		buf[num] = '\0';
+		add_events(buf);
+
+		pattern += rmatch.rm_eo;
+	}
+
+	regfree(&reg);
+}
+
 void
 eal_pmu_init(void)
 {
+	struct trace_arg *arg;
+	struct trace *trace;
 	int ret;
 
+	trace = trace_obj_get();
+	if (!trace)
+		RTE_LOG(WARNING, EAL, "tracing not initialized\n");
+
 	rte_pmu = calloc(1, sizeof(*rte_pmu));
 	if (!rte_pmu) {
 		RTE_LOG(ERR, EAL, "failed to alloc PMU\n");
@@ -429,6 +489,9 @@ eal_pmu_init(void)
 		goto out;
 	}
 
+	STAILQ_FOREACH(arg, &trace->args, next)
+		add_events_by_pattern(arg->val);
+
 	return;
 out:
 	free(rte_pmu->name);
diff --git a/lib/eal/include/rte_eal_trace.h b/lib/eal/include/rte_eal_trace.h
index 5ef4398230..2a10f63e97 100644
--- a/lib/eal/include/rte_eal_trace.h
+++ b/lib/eal/include/rte_eal_trace.h
@@ -17,6 +17,7 @@ extern "C" {
 
 #include <rte_alarm.h>
 #include <rte_interrupts.h>
+#include <rte_pmu.h>
 #include <rte_trace_point.h>
 
 #include "eal_interrupts.h"
@@ -279,6 +280,16 @@ RTE_TRACE_POINT(
 	rte_trace_point_emit_string(cpuset);
 )
 
+/* PMU */
+RTE_TRACE_POINT_FP(
+	rte_eal_trace_pmu_read,
+	RTE_TRACE_POINT_ARGS(int index),
+	uint64_t val;
+	rte_trace_point_emit_int(index);
+	val = rte_pmu_read(index);
+	rte_trace_point_emit_u64(val);
+)
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/eal/version.map b/lib/eal/version.map
index 9225f46f67..73803f9601 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -442,6 +442,7 @@ EXPERIMENTAL {
 	rte_thread_join;
 
 	# added in 23.03
+	__rte_eal_trace_pmu_read; # WINDOWS_NO_EXPORT
 	rte_pmu; # WINDOWS_NO_EXPORT
 	rte_pmu_add_event; # WINDOWS_NO_EXPORT
 	rte_pmu_read; # WINDOWS_NO_EXPORT
-- 
2.25.1


^ permalink raw reply	[flat|nested] 139+ messages in thread

* RE: [PATCH v4 1/4] eal: add generic support for reading PMU events
  2022-12-13 10:43       ` [PATCH v4 1/4] eal: add generic support for reading PMU events Tomasz Duszynski
@ 2022-12-13 11:52         ` Morten Brørup
  2022-12-14  9:38           ` Tomasz Duszynski
  2022-12-15  8:46         ` Mattias Rönnblom
  2023-01-09  7:37         ` Ruifeng Wang
  2 siblings, 1 reply; 139+ messages in thread
From: Morten Brørup @ 2022-12-13 11:52 UTC (permalink / raw)
  To: Tomasz Duszynski, dev; +Cc: thomas, jerinj, zhoumin

> From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
> Sent: Tuesday, 13 December 2022 11.44
> 
> Add support for programming PMU counters and reading their values
> in runtime bypassing kernel completely.
> 
> This is especially useful in cases where CPU cores are isolated
> (nohz_full) i.e run dedicated tasks. In such cases one cannot use
> standard perf utility without sacrificing latency and performance.
> 
> Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
> ---


> +++ b/lib/eal/common/rte_pmu.c
> @@ -0,0 +1,456 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(C) 2022 Marvell International Ltd.
> + */
> +
> +#include <ctype.h>
> +#include <dirent.h>
> +#include <errno.h>
> +#include <regex.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <sys/ioctl.h>
> +#include <sys/mman.h>
> +#include <sys/queue.h>
> +#include <sys/syscall.h>
> +#include <unistd.h>
> +
> +#include <rte_eal_paging.h>
> +#include <rte_pmu.h>
> +#include <rte_tailq.h>
> +
> +#include "pmu_private.h"
> +
> +#define EVENT_SOURCE_DEVICES_PATH "/sys/bus/event_source/devices"
> +
> +#ifndef GENMASK_ULL
> +#define GENMASK_ULL(h, l) ((~0ULL - (1ULL << (l)) + 1) & (~0ULL >>
> ((64 - 1 - (h)))))
> +#endif
> +
> +#ifndef FIELD_PREP
> +#define FIELD_PREP(m, v) (((uint64_t)(v) << (__builtin_ffsll(m) - 1))
> & (m))
> +#endif
> +
> +struct rte_pmu *rte_pmu;
> +
> +/*
> + * Following __rte_weak functions provide default no-op. Architectures
> should override them if
> + * necessary.
> + */
> +
> +int
> +__rte_weak pmu_arch_init(void)
> +{
> +	return 0;
> +}
> +
> +void
> +__rte_weak pmu_arch_fini(void)
> +{
> +}
> +
> +void
> +__rte_weak pmu_arch_fixup_config(uint64_t config[3])
> +{
> +	RTE_SET_USED(config);
> +}
> +
> +static int
> +get_term_format(const char *name, int *num, uint64_t *mask)
> +{
> +	char *config = NULL;
> +	char path[PATH_MAX];
> +	int high, low, ret;
> +	FILE *fp;
> +
> +	/* quiesce -Wmaybe-uninitialized warning */
> +	*num = 0;
> +	*mask = 0;
> +
> +	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH
> "/%s/format/%s", rte_pmu->name, name);
> +	fp = fopen(path, "r");
> +	if (!fp)
> +		return -errno;
> +
> +	errno = 0;
> +	ret = fscanf(fp, "%m[^:]:%d-%d", &config, &low, &high);
> +	if (ret < 2) {
> +		ret = -ENODATA;
> +		goto out;
> +	}
> +	if (errno) {
> +		ret = -errno;
> +		goto out;
> +	}
> +
> +	if (ret == 2)
> +		high = low;
> +
> +	*mask = GENMASK_ULL(high, low);
> +	/* Last digit should be [012]. If last digit is missing 0 is
> implied. */
> +	*num = config[strlen(config) - 1];
> +	*num = isdigit(*num) ? *num - '0' : 0;
> +
> +	ret = 0;
> +out:
> +	free(config);
> +	fclose(fp);
> +
> +	return ret;
> +}
> +
> +static int
> +parse_event(char *buf, uint64_t config[3])
> +{
> +	char *token, *term;
> +	int num, ret, val;
> +	uint64_t mask;
> +
> +	config[0] = config[1] = config[2] = 0;
> +
> +	token = strtok(buf, ",");
> +	while (token) {
> +		errno = 0;
> +		/* <term>=<value> */
> +		ret = sscanf(token, "%m[^=]=%i", &term, &val);
> +		if (ret < 1)
> +			return -ENODATA;
> +		if (errno)
> +			return -errno;
> +		if (ret == 1)
> +			val = 1;
> +
> +		ret = get_term_format(term, &num, &mask);
> +		free(term);
> +		if (ret)
> +			return ret;
> +
> +		config[num] |= FIELD_PREP(mask, val);
> +		token = strtok(NULL, ",");
> +	}
> +
> +	return 0;
> +}
> +
> +static int
> +get_event_config(const char *name, uint64_t config[3])
> +{
> +	char path[PATH_MAX], buf[BUFSIZ];
> +	FILE *fp;
> +	int ret;
> +
> +	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH
> "/%s/events/%s", rte_pmu->name, name);
> +	fp = fopen(path, "r");
> +	if (!fp)
> +		return -errno;
> +
> +	ret = fread(buf, 1, sizeof(buf), fp);
> +	if (ret == 0) {
> +		fclose(fp);
> +
> +		return -EINVAL;
> +	}
> +	fclose(fp);
> +	buf[ret] = '\0';
> +
> +	return parse_event(buf, config);
> +}
> +
> +static int
> +do_perf_event_open(uint64_t config[3], int lcore_id, int group_fd)
> +{
> +	struct perf_event_attr attr = {
> +		.size = sizeof(struct perf_event_attr),
> +		.type = PERF_TYPE_RAW,
> +		.exclude_kernel = 1,
> +		.exclude_hv = 1,
> +		.disabled = 1,
> +	};
> +
> +	pmu_arch_fixup_config(config);
> +
> +	attr.config = config[0];
> +	attr.config1 = config[1];
> +	attr.config2 = config[2];
> +
> +	return syscall(SYS_perf_event_open, &attr, rte_gettid(),
> rte_lcore_to_cpu_id(lcore_id),
> +		       group_fd, 0);
> +}
> +
> +static int
> +open_events(int lcore_id)
> +{
> +	struct rte_pmu_event_group *group = &rte_pmu->group[lcore_id];
> +	struct rte_pmu_event *event;
> +	uint64_t config[3];
> +	int num = 0, ret;
> +
> +	/* group leader gets created first, with fd = -1 */
> +	group->fds[0] = -1;
> +
> +	TAILQ_FOREACH(event, &rte_pmu->event_list, next) {
> +		ret = get_event_config(event->name, config);
> +		if (ret) {
> +			RTE_LOG(ERR, EAL, "failed to get %s event config\n",
> event->name);
> +			continue;
> +		}
> +
> +		ret = do_perf_event_open(config, lcore_id, group->fds[0]);
> +		if (ret == -1) {
> +			if (errno == EOPNOTSUPP)
> +				RTE_LOG(ERR, EAL, "64 bit counters not
> supported\n");
> +
> +			ret = -errno;
> +			goto out;
> +		}
> +
> +		group->fds[event->index] = ret;
> +		num++;
> +	}
> +
> +	return 0;
> +out:
> +	for (--num; num >= 0; num--) {
> +		close(group->fds[num]);
> +		group->fds[num] = -1;
> +	}
> +
> +
> +	return ret;
> +}
> +
> +static int
> +mmap_events(int lcore_id)
> +{
> +	struct rte_pmu_event_group *group = &rte_pmu->group[lcore_id];
> +	void *addr;
> +	int ret, i;
> +
> +	for (i = 0; i < rte_pmu->num_group_events; i++) {
> +		addr = mmap(0, rte_mem_page_size(), PROT_READ, MAP_SHARED,
> group->fds[i], 0);
> +		if (addr == MAP_FAILED) {
> +			ret = -errno;
> +			goto out;
> +		}
> +
> +		group->mmap_pages[i] = addr;
> +	}
> +
> +	return 0;
> +out:
> +	for (; i; i--) {
> +		munmap(group->mmap_pages[i - 1], rte_mem_page_size());
> +		group->mmap_pages[i - 1] = NULL;
> +	}
> +
> +	return ret;
> +}
> +
> +static void
> +cleanup_events(int lcore_id)
> +{
> +	struct rte_pmu_event_group *group = &rte_pmu->group[lcore_id];
> +	int i;
> +
> +	if (!group->fds)
> +		return;
> +
> +	if (group->fds[0] != -1)
> +		ioctl(group->fds[0], PERF_EVENT_IOC_DISABLE,
> PERF_IOC_FLAG_GROUP);
> +
> +	for (i = 0; i < rte_pmu->num_group_events; i++) {
> +		if (group->mmap_pages[i]) {
> +			munmap(group->mmap_pages[i], rte_mem_page_size());
> +			group->mmap_pages[i] = NULL;
> +		}
> +
> +		if (group->fds[i] != -1) {
> +			close(group->fds[i]);
> +			group->fds[i] = -1;
> +		}
> +	}
> +
> +	free(group->mmap_pages);
> +	free(group->fds);
> +
> +	group->mmap_pages = NULL;
> +	group->fds = NULL;
> +	group->enabled = false;
> +}
> +
> +int __rte_noinline
> +rte_pmu_enable_group(int lcore_id)
> +{
> +	struct rte_pmu_event_group *group = &rte_pmu->group[lcore_id];
> +	int ret;
> +
> +	if (rte_pmu->num_group_events == 0) {
> +		RTE_LOG(DEBUG, EAL, "no matching PMU events\n");
> +
> +		return 0;
> +	}
> +
> +	group->fds = calloc(rte_pmu->num_group_events, sizeof(*group-
> >fds));
> +	if (!group->fds) {
> +		RTE_LOG(ERR, EAL, "failed to alloc descriptor memory\n");
> +
> +		return -ENOMEM;
> +	}
> +
> +	group->mmap_pages = calloc(rte_pmu->num_group_events,
> sizeof(*group->mmap_pages));
> +	if (!group->mmap_pages) {
> +		RTE_LOG(ERR, EAL, "failed to alloc userpage memory\n");
> +
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	ret = open_events(lcore_id);
> +	if (ret) {
> +		RTE_LOG(ERR, EAL, "failed to open events on lcore-worker-
> %d\n", lcore_id);
> +		goto out;
> +	}
> +
> +	ret = mmap_events(lcore_id);
> +	if (ret) {
> +		RTE_LOG(ERR, EAL, "failed to map events on lcore-worker-
> %d\n", lcore_id);
> +		goto out;
> +	}
> +
> +	if (ioctl(group->fds[0], PERF_EVENT_IOC_ENABLE,
> PERF_IOC_FLAG_GROUP) == -1) {
> +		RTE_LOG(ERR, EAL, "failed to enable events on lcore-worker-
> %d\n", lcore_id);
> +
> +		ret = -errno;
> +		goto out;
> +	}
> +
> +	return 0;
> +
> +out:
> +	cleanup_events(lcore_id);
> +
> +	return ret;
> +}
> +
> +static int
> +scan_pmus(void)
> +{
> +	char path[PATH_MAX];
> +	struct dirent *dent;
> +	const char *name;
> +	DIR *dirp;
> +
> +	dirp = opendir(EVENT_SOURCE_DEVICES_PATH);
> +	if (!dirp)
> +		return -errno;
> +
> +	while ((dent = readdir(dirp))) {
> +		name = dent->d_name;
> +		if (name[0] == '.')
> +			continue;
> +
> +		/* sysfs entry should either contain cpus or be a cpu */
> +		if (!strcmp(name, "cpu"))
> +			break;
> +
> +		snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH
> "/%s/cpus", name);
> +		if (access(path, F_OK) == 0)
> +			break;
> +	}
> +
> +	closedir(dirp);
> +
> +	if (dent) {
> +		rte_pmu->name = strdup(name);
> +		if (!rte_pmu->name)
> +			return -ENOMEM;
> +	}
> +
> +	return rte_pmu->name ? 0 : -ENODEV;
> +}
> +
> +int
> +rte_pmu_add_event(const char *name)
> +{
> +	struct rte_pmu_event *event;
> +	char path[PATH_MAX];
> +
> +	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH
> "/%s/events/%s", rte_pmu->name, name);
> +	if (access(path, R_OK))
> +		return -ENODEV;
> +
> +	TAILQ_FOREACH(event, &rte_pmu->event_list, next) {
> +		if (!strcmp(event->name, name))
> +			return event->index;
> +		continue;
> +	}
> +
> +	event = calloc(1, sizeof(*event));
> +	if (!event)
> +		return -ENOMEM;
> +
> +	event->name = strdup(name);
> +	if (!event->name) {
> +		free(event);
> +
> +		return -ENOMEM;
> +	}
> +
> +	event->index = rte_pmu->num_group_events++;
> +	TAILQ_INSERT_TAIL(&rte_pmu->event_list, event, next);
> +
> +	RTE_LOG(DEBUG, EAL, "%s even added at index %d\n", name, event-
> >index);
> +
> +	return event->index;
> +}
> +
> +void
> +eal_pmu_init(void)
> +{
> +	int ret;
> +
> +	rte_pmu = calloc(1, sizeof(*rte_pmu));
> +	if (!rte_pmu) {
> +		RTE_LOG(ERR, EAL, "failed to alloc PMU\n");
> +
> +		return;
> +	}
> +
> +	TAILQ_INIT(&rte_pmu->event_list);
> +
> +	ret = scan_pmus();
> +	if (ret) {
> +		RTE_LOG(ERR, EAL, "failed to find core pmu\n");
> +		goto out;
> +	}
> +
> +	ret = pmu_arch_init();
> +	if (ret) {
> +		RTE_LOG(ERR, EAL, "failed to setup arch for PMU\n");
> +		goto out;
> +	}
> +
> +	return;
> +out:
> +	free(rte_pmu->name);
> +	free(rte_pmu);
> +}
> +
> +void
> +eal_pmu_fini(void)
> +{
> +	struct rte_pmu_event *event, *tmp;
> +	int lcore_id;
> +
> +	RTE_TAILQ_FOREACH_SAFE(event, &rte_pmu->event_list, next, tmp) {
> +		TAILQ_REMOVE(&rte_pmu->event_list, event, next);
> +		free(event->name);
> +		free(event);
> +	}
> +
> +	RTE_LCORE_FOREACH_WORKER(lcore_id)
> +		cleanup_events(lcore_id);
> +
> +	pmu_arch_fini();
> +	free(rte_pmu->name);
> +	free(rte_pmu);
> +}
> diff --git a/lib/eal/include/meson.build b/lib/eal/include/meson.build
> index cfcd40aaed..3bf830adee 100644
> --- a/lib/eal/include/meson.build
> +++ b/lib/eal/include/meson.build
> @@ -36,6 +36,7 @@ headers += files(
>          'rte_pci_dev_features.h',
>          'rte_per_lcore.h',
>          'rte_pflock.h',
> +        'rte_pmu.h',
>          'rte_random.h',
>          'rte_reciprocal.h',
>          'rte_seqcount.h',
> diff --git a/lib/eal/include/rte_pmu.h b/lib/eal/include/rte_pmu.h
> new file mode 100644
> index 0000000000..e4b4f6b052
> --- /dev/null
> +++ b/lib/eal/include/rte_pmu.h
> @@ -0,0 +1,204 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2022 Marvell
> + */
> +
> +#ifndef _RTE_PMU_H_
> +#define _RTE_PMU_H_
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +#include <rte_common.h>
> +#include <rte_compat.h>
> +
> +#ifdef RTE_EXEC_ENV_LINUX
> +
> +#include <linux/perf_event.h>
> +
> +#include <rte_atomic.h>
> +#include <rte_branch_prediction.h>
> +#include <rte_lcore.h>
> +#include <rte_log.h>
> +
> +/**
> + * @file
> + *
> + * PMU event tracing operations
> + *
> + * This file defines generic API and types necessary to setup PMU and
> + * read selected counters in runtime.
> + */
> +
> +/**
> + * A structure describing a group of events.
> + */
> +struct rte_pmu_event_group {
> +	int *fds; /**< array of event descriptors */
> +	void **mmap_pages; /**< array of pointers to mmapped
> perf_event_attr structures */

There seems to be a lot of indirection involved here. Why are these arrays not statically sized, instead of dynamically allocated?

Also, what is the reason for hiding the type struct perf_event_mmap_page **mmap_pages opaque by using void **mmap_pages instead?

> +	bool enabled; /**< true if group was enabled on particular lcore
> */
> +};
> +
> +/**
> + * A structure describing an event.
> + */
> +struct rte_pmu_event {
> +	char *name; /** name of an event */
> +	int index; /** event index into fds/mmap_pages */
> +	TAILQ_ENTRY(rte_pmu_event) next; /** list entry */
> +};
> +
> +/**
> + * A PMU state container.
> + */
> +struct rte_pmu {
> +	char *name; /** name of core PMU listed under
> /sys/bus/event_source/devices */
> +	struct rte_pmu_event_group group[RTE_MAX_LCORE]; /**< per lcore
> event group data */
> +	int num_group_events; /**< number of events in a group */
> +	TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of matching
> events */
> +};
> +
> +/** Pointer to the PMU state container */
> +extern struct rte_pmu *rte_pmu;

Again, why not just extern struct rte_pmu, instead of dynamic allocation?

> +
> +/** Each architecture supporting PMU needs to provide its own version
> */
> +#ifndef rte_pmu_pmc_read
> +#define rte_pmu_pmc_read(index) ({ 0; })
> +#endif
> +
> +/**
> + * @internal
> + *
> + * Read PMU counter.
> + *
> + * @param pc
> + *   Pointer to the mmapped user page.
> + * @return
> + *   Counter value read from hardware.
> + */
> +__rte_internal
> +static __rte_always_inline uint64_t
> +rte_pmu_read_userpage(struct perf_event_mmap_page *pc)
> +{
> +	uint64_t offset, width, pmc = 0;
> +	uint32_t seq, index;
> +	int tries = 100;
> +
> +	for (;;) {
> +		seq = pc->lock;
> +		rte_compiler_barrier();
> +		index = pc->index;
> +		offset = pc->offset;
> +		width = pc->pmc_width;
> +
> +		if (likely(pc->cap_user_rdpmc && index)) {
> +			pmc = rte_pmu_pmc_read(index - 1);
> +			pmc <<= 64 - width;
> +			pmc >>= 64 - width;
> +		}
> +
> +		rte_compiler_barrier();
> +
> +		if (likely(pc->lock == seq))
> +			return pmc + offset;
> +
> +		if (--tries == 0) {
> +			RTE_LOG(DEBUG, EAL, "failed to get
> perf_event_mmap_page lock\n");
> +			break;
> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * @internal
> + *
> + * Enable group of events for a given lcore.
> + *
> + * @param lcore_id
> + *   The identifier of the lcore.
> + * @return
> + *   0 in case of success, negative value otherwise.
> + */
> +__rte_internal
> +int
> +rte_pmu_enable_group(int lcore_id);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice
> + *
> + * Add event to the group of enabled events.
> + *
> + * @param name
> + *   Name of an event listed under
> /sys/bus/event_source/devices/pmu/events.
> + * @return
> + *   Event index in case of success, negative value otherwise.
> + */
> +__rte_experimental
> +int
> +rte_pmu_add_event(const char *name);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice
> + *
> + * Read hardware counter configured to count occurrences of an event.
> + *
> + * @param index
> + *   Index of an event to be read.
> + * @return
> + *   Event value read from register. In case of errors or lack of
> support
> + *   0 is returned. In other words, stream of zeros in a trace file
> + *   indicates problem with reading particular PMU event register.
> + */
> +__rte_experimental
> +static __rte_always_inline uint64_t
> +rte_pmu_read(int index)
> +{
> +	int lcore_id = rte_lcore_id();
> +	struct rte_pmu_event_group *group;
> +	int ret;
> +
> +	if (!rte_pmu)
> +		return 0;
> +
> +	group = &rte_pmu->group[lcore_id];
> +	if (!group->enabled) {
> +		ret = rte_pmu_enable_group(lcore_id);
> +		if (ret)
> +			return 0;
> +
> +		group->enabled = true;
> +	}

Why is the group not enabled in the setup function, rte_pmu_add_event(), instead of here, in the hot path?

> +
> +	if (index < 0 || index >= rte_pmu->num_group_events)
> +		return 0;
> +
> +	return rte_pmu_read_userpage((struct perf_event_mmap_page
> *)group->mmap_pages[index]);

Using fixed size arrays instead of multiple indirections via pointers is faster. It could be:

return rte_pmu_read_userpage((struct perf_event_mmap_page *)rte_pmu.group[lcore_id].mmap_pages[index]);

With our without suggested performance improvements...

Series-acked-by: Morten Brørup <mb@smartsharesystems.com>


^ permalink raw reply	[flat|nested] 139+ messages in thread

* RE: [PATCH v4 1/4] eal: add generic support for reading PMU events
  2022-12-13 11:52         ` Morten Brørup
@ 2022-12-14  9:38           ` Tomasz Duszynski
  2022-12-14 10:41             ` Morten Brørup
  0 siblings, 1 reply; 139+ messages in thread
From: Tomasz Duszynski @ 2022-12-14  9:38 UTC (permalink / raw)
  To: Morten Brørup, dev; +Cc: thomas, Jerin Jacob Kollanukkaran, zhoumin

Hello Morten, 

Thanks for review. Answers inline. 

[...]

> > +/**
> > + * @file
> > + *
> > + * PMU event tracing operations
> > + *
> > + * This file defines generic API and types necessary to setup PMU and
> > + * read selected counters in runtime.
> > + */
> > +
> > +/**
> > + * A structure describing a group of events.
> > + */
> > +struct rte_pmu_event_group {
> > +	int *fds; /**< array of event descriptors */
> > +	void **mmap_pages; /**< array of pointers to mmapped
> > perf_event_attr structures */
> 
> There seems to be a lot of indirection involved here. Why are these arrays not statically sized,
> instead of dynamically allocated?
> 

Different architectures/pmus impose limits on number of simultaneously enabled counters. So in order
relief the pain of thinking about it and adding macros for each and every arch I decided to allocate
the number user wants dynamically. Also assumption holds that user knows about tradeoffs of using
too many counters hence will not enable too many events at once. 

> Also, what is the reason for hiding the type struct perf_event_mmap_page **mmap_pages opaque by
> using void **mmap_pages instead?

I think, that part doing mmap/munmap was written first hence void ** was chosen in the first place. 

> 
> > +	bool enabled; /**< true if group was enabled on particular lcore
> > */
> > +};
> > +
> > +/**
> > + * A structure describing an event.
> > + */
> > +struct rte_pmu_event {
> > +	char *name; /** name of an event */
> > +	int index; /** event index into fds/mmap_pages */
> > +	TAILQ_ENTRY(rte_pmu_event) next; /** list entry */ };
> > +
> > +/**
> > + * A PMU state container.
> > + */
> > +struct rte_pmu {
> > +	char *name; /** name of core PMU listed under
> > /sys/bus/event_source/devices */
> > +	struct rte_pmu_event_group group[RTE_MAX_LCORE]; /**< per lcore
> > event group data */
> > +	int num_group_events; /**< number of events in a group */
> > +	TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of matching
> > events */
> > +};
> > +
> > +/** Pointer to the PMU state container */ extern struct rte_pmu
> > +*rte_pmu;
> 
> Again, why not just extern struct rte_pmu, instead of dynamic allocation?
> 

No strong opinions here since this is a matter of personal preference. Can be removed
in the next version. 

> > +
> > +/** Each architecture supporting PMU needs to provide its own version
> > */
> > +#ifndef rte_pmu_pmc_read
> > +#define rte_pmu_pmc_read(index) ({ 0; }) #endif
> > +
> > +/**
> > + * @internal
> > + *
> > + * Read PMU counter.
> > + *
> > + * @param pc
> > + *   Pointer to the mmapped user page.
> > + * @return
> > + *   Counter value read from hardware.
> > + */
> > +__rte_internal
> > +static __rte_always_inline uint64_t
> > +rte_pmu_read_userpage(struct perf_event_mmap_page *pc) {
> > +	uint64_t offset, width, pmc = 0;
> > +	uint32_t seq, index;
> > +	int tries = 100;
> > +
> > +	for (;;) {
> > +		seq = pc->lock;
> > +		rte_compiler_barrier();
> > +		index = pc->index;
> > +		offset = pc->offset;
> > +		width = pc->pmc_width;
> > +
> > +		if (likely(pc->cap_user_rdpmc && index)) {
> > +			pmc = rte_pmu_pmc_read(index - 1);
> > +			pmc <<= 64 - width;
> > +			pmc >>= 64 - width;
> > +		}
> > +
> > +		rte_compiler_barrier();
> > +
> > +		if (likely(pc->lock == seq))
> > +			return pmc + offset;
> > +
> > +		if (--tries == 0) {
> > +			RTE_LOG(DEBUG, EAL, "failed to get
> > perf_event_mmap_page lock\n");
> > +			break;
> > +		}
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +/**
> > + * @internal
> > + *
> > + * Enable group of events for a given lcore.
> > + *
> > + * @param lcore_id
> > + *   The identifier of the lcore.
> > + * @return
> > + *   0 in case of success, negative value otherwise.
> > + */
> > +__rte_internal
> > +int
> > +rte_pmu_enable_group(int lcore_id);
> > +
> > +/**
> > + * @warning
> > + * @b EXPERIMENTAL: this API may change without prior notice
> > + *
> > + * Add event to the group of enabled events.
> > + *
> > + * @param name
> > + *   Name of an event listed under
> > /sys/bus/event_source/devices/pmu/events.
> > + * @return
> > + *   Event index in case of success, negative value otherwise.
> > + */
> > +__rte_experimental
> > +int
> > +rte_pmu_add_event(const char *name);
> > +
> > +/**
> > + * @warning
> > + * @b EXPERIMENTAL: this API may change without prior notice
> > + *
> > + * Read hardware counter configured to count occurrences of an event.
> > + *
> > + * @param index
> > + *   Index of an event to be read.
> > + * @return
> > + *   Event value read from register. In case of errors or lack of
> > support
> > + *   0 is returned. In other words, stream of zeros in a trace file
> > + *   indicates problem with reading particular PMU event register.
> > + */
> > +__rte_experimental
> > +static __rte_always_inline uint64_t
> > +rte_pmu_read(int index)
> > +{
> > +	int lcore_id = rte_lcore_id();
> > +	struct rte_pmu_event_group *group;
> > +	int ret;
> > +
> > +	if (!rte_pmu)
> > +		return 0;
> > +
> > +	group = &rte_pmu->group[lcore_id];
> > +	if (!group->enabled) {
> > +		ret = rte_pmu_enable_group(lcore_id);
> > +		if (ret)
> > +			return 0;
> > +
> > +		group->enabled = true;
> > +	}
> 
> Why is the group not enabled in the setup function, rte_pmu_add_event(), instead of here, in the
> hot path?
> 

When this is executed for the very first time then cpu will have obviously more work to do 
but afterwards setup path is not taken hence much less cpu cycles are required.

Setup is executed by main lcore solely, before lcores are executed hence some info passed to
SYS_perf_event_open ioctl() is missing, pid (via rte_gettid()) being an example here. 

> > +
> > +	if (index < 0 || index >= rte_pmu->num_group_events)
> > +		return 0;
> > +
> > +	return rte_pmu_read_userpage((struct perf_event_mmap_page
> > *)group->mmap_pages[index]);
> 
> Using fixed size arrays instead of multiple indirections via pointers is faster. It could be:
> 
> return rte_pmu_read_userpage((struct perf_event_mmap_page
> *)rte_pmu.group[lcore_id].mmap_pages[index]);
> 
> With our without suggested performance improvements...
> 
> Series-acked-by: Morten Brørup <mb@smartsharesystems.com>


^ permalink raw reply	[flat|nested] 139+ messages in thread

* RE: [PATCH v4 1/4] eal: add generic support for reading PMU events
  2022-12-14  9:38           ` Tomasz Duszynski
@ 2022-12-14 10:41             ` Morten Brørup
  2022-12-15  8:22               ` Morten Brørup
  2023-01-05 21:14               ` Tomasz Duszynski
  0 siblings, 2 replies; 139+ messages in thread
From: Morten Brørup @ 2022-12-14 10:41 UTC (permalink / raw)
  To: Tomasz Duszynski, dev
  Cc: thomas, Jerin Jacob Kollanukkaran, zhoumin, mattias.ronnblom

+CC: Mattias, see my comment below about per-thread constructor for this

> From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
> Sent: Wednesday, 14 December 2022 10.39
> 
> Hello Morten,
> 
> Thanks for review. Answers inline.
> 
> [...]
> 
> > > +/**
> > > + * @file
> > > + *
> > > + * PMU event tracing operations
> > > + *
> > > + * This file defines generic API and types necessary to setup PMU
> and
> > > + * read selected counters in runtime.
> > > + */
> > > +
> > > +/**
> > > + * A structure describing a group of events.
> > > + */
> > > +struct rte_pmu_event_group {
> > > +	int *fds; /**< array of event descriptors */
> > > +	void **mmap_pages; /**< array of pointers to mmapped
> > > perf_event_attr structures */
> >
> > There seems to be a lot of indirection involved here. Why are these
> arrays not statically sized,
> > instead of dynamically allocated?
> >
> 
> Different architectures/pmus impose limits on number of simultaneously
> enabled counters. So in order
> relief the pain of thinking about it and adding macros for each and
> every arch I decided to allocate
> the number user wants dynamically. Also assumption holds that user
> knows about tradeoffs of using
> too many counters hence will not enable too many events at once.

The DPDK convention is to use fixed size arrays (with a maximum size, e.g. RTE_MAX_ETHPORTS) in the fast path, for performance reasons.

Please use fixed size arrays instead of dynamically allocated arrays.

> 
> > Also, what is the reason for hiding the type struct
> perf_event_mmap_page **mmap_pages opaque by
> > using void **mmap_pages instead?
> 
> I think, that part doing mmap/munmap was written first hence void **
> was chosen in the first place.

Please update it, so the actual type is reflected here.

> 
> >
> > > +	bool enabled; /**< true if group was enabled on particular lcore
> > > */
> > > +};
> > > +
> > > +/**
> > > + * A structure describing an event.
> > > + */
> > > +struct rte_pmu_event {
> > > +	char *name; /** name of an event */
> > > +	int index; /** event index into fds/mmap_pages */
> > > +	TAILQ_ENTRY(rte_pmu_event) next; /** list entry */ };
> > > +
> > > +/**
> > > + * A PMU state container.
> > > + */
> > > +struct rte_pmu {
> > > +	char *name; /** name of core PMU listed under
> > > /sys/bus/event_source/devices */
> > > +	struct rte_pmu_event_group group[RTE_MAX_LCORE]; /**< per lcore
> > > event group data */
> > > +	int num_group_events; /**< number of events in a group */
> > > +	TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of matching
> > > events */

The event_list is used in slow path only, so it can remain a list - i.e. no change requested here. :-)

> > > +};
> > > +
> > > +/** Pointer to the PMU state container */ extern struct rte_pmu
> > > +*rte_pmu;
> >
> > Again, why not just extern struct rte_pmu, instead of dynamic
> allocation?
> >
> 
> No strong opinions here since this is a matter of personal preference.
> Can be removed
> in the next version.

Yes, please.

> 
> > > +
> > > +/** Each architecture supporting PMU needs to provide its own
> version
> > > */
> > > +#ifndef rte_pmu_pmc_read
> > > +#define rte_pmu_pmc_read(index) ({ 0; }) #endif
> > > +
> > > +/**
> > > + * @internal
> > > + *
> > > + * Read PMU counter.
> > > + *
> > > + * @param pc
> > > + *   Pointer to the mmapped user page.
> > > + * @return
> > > + *   Counter value read from hardware.
> > > + */
> > > +__rte_internal
> > > +static __rte_always_inline uint64_t
> > > +rte_pmu_read_userpage(struct perf_event_mmap_page *pc) {
> > > +	uint64_t offset, width, pmc = 0;
> > > +	uint32_t seq, index;
> > > +	int tries = 100;
> > > +
> > > +	for (;;) {

As a matter of personal preference, I would write this loop differently:

+ for (tries = 100; tries != 0; tries--) {

> > > +		seq = pc->lock;
> > > +		rte_compiler_barrier();
> > > +		index = pc->index;
> > > +		offset = pc->offset;
> > > +		width = pc->pmc_width;
> > > +
> > > +		if (likely(pc->cap_user_rdpmc && index)) {

Why "&& index"? The way I read [man perf_event_open], index 0 is perfectly valid.

[man perf_event_open]: https://man7.org/linux/man-pages/man2/perf_event_open.2.html

> > > +			pmc = rte_pmu_pmc_read(index - 1);
> > > +			pmc <<= 64 - width;
> > > +			pmc >>= 64 - width;
> > > +		}
> > > +
> > > +		rte_compiler_barrier();
> > > +
> > > +		if (likely(pc->lock == seq))
> > > +			return pmc + offset;
> > > +
> > > +		if (--tries == 0) {
> > > +			RTE_LOG(DEBUG, EAL, "failed to get
> > > perf_event_mmap_page lock\n");
> > > +			break;
> > > +		}

- Remove the 4 above lines of code, and move the debug log message to the end of the function instead.

> > > +	}

+ RTE_LOG(DEBUG, EAL, "failed to get perf_event_mmap_page lock\n");

> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +/**
> > > + * @internal
> > > + *
> > > + * Enable group of events for a given lcore.
> > > + *
> > > + * @param lcore_id
> > > + *   The identifier of the lcore.
> > > + * @return
> > > + *   0 in case of success, negative value otherwise.
> > > + */
> > > +__rte_internal
> > > +int
> > > +rte_pmu_enable_group(int lcore_id);
> > > +
> > > +/**
> > > + * @warning
> > > + * @b EXPERIMENTAL: this API may change without prior notice
> > > + *
> > > + * Add event to the group of enabled events.
> > > + *
> > > + * @param name
> > > + *   Name of an event listed under
> > > /sys/bus/event_source/devices/pmu/events.
> > > + * @return
> > > + *   Event index in case of success, negative value otherwise.
> > > + */
> > > +__rte_experimental
> > > +int
> > > +rte_pmu_add_event(const char *name);
> > > +
> > > +/**
> > > + * @warning
> > > + * @b EXPERIMENTAL: this API may change without prior notice
> > > + *
> > > + * Read hardware counter configured to count occurrences of an
> event.
> > > + *
> > > + * @param index
> > > + *   Index of an event to be read.
> > > + * @return
> > > + *   Event value read from register. In case of errors or lack of
> > > support
> > > + *   0 is returned. In other words, stream of zeros in a trace
> file
> > > + *   indicates problem with reading particular PMU event register.
> > > + */
> > > +__rte_experimental
> > > +static __rte_always_inline uint64_t
> > > +rte_pmu_read(int index)

The index type can be changed from int to uint32_t. This also eliminates the "(index < 0" part of the comparison further below in this function.

> > > +{
> > > +	int lcore_id = rte_lcore_id();
> > > +	struct rte_pmu_event_group *group;
> > > +	int ret;
> > > +
> > > +	if (!rte_pmu)
> > > +		return 0;
> > > +
> > > +	group = &rte_pmu->group[lcore_id];
> > > +	if (!group->enabled) {

Optimized: if (unlikely(!group->enabled)) {

> > > +		ret = rte_pmu_enable_group(lcore_id);
> > > +		if (ret)
> > > +			return 0;
> > > +
> > > +		group->enabled = true;
> > > +	}
> >
> > Why is the group not enabled in the setup function,
> rte_pmu_add_event(), instead of here, in the
> > hot path?
> >
> 
> When this is executed for the very first time then cpu will have
> obviously more work to do
> but afterwards setup path is not taken hence much less cpu cycles are
> required.
> 
> Setup is executed by main lcore solely, before lcores are executed
> hence some info passed to
> SYS_perf_event_open ioctl() is missing, pid (via rte_gettid()) being an
> example here.

OK. Thank you for the explanation. Since impossible at setup, it has to be done at runtime.

@Mattias: Another good example of something that would belong in per-thread constructors, as my suggested feature creep in [1].

[1]: http://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35D87553@smartserver.smartshare.dk/

> 
> > > +
> > > +	if (index < 0 || index >= rte_pmu->num_group_events)

Optimized: if (unlikely(index >= rte_pmu.num_group_events))

> > > +		return 0;
> > > +
> > > +	return rte_pmu_read_userpage((struct perf_event_mmap_page
> > > *)group->mmap_pages[index]);
> >
> > Using fixed size arrays instead of multiple indirections via pointers
> is faster. It could be:
> >
> > return rte_pmu_read_userpage((struct perf_event_mmap_page
> > *)rte_pmu.group[lcore_id].mmap_pages[index]);
> >
> > With our without suggested performance improvements...
> >
> > Series-acked-by: Morten Brørup <mb@smartsharesystems.com>
> 


^ permalink raw reply	[flat|nested] 139+ messages in thread

* RE: [PATCH v4 1/4] eal: add generic support for reading PMU events
  2022-12-14 10:41             ` Morten Brørup
@ 2022-12-15  8:22               ` Morten Brørup
  2022-12-16  7:33                 ` Morten Brørup
  2023-01-05 21:14               ` Tomasz Duszynski
  1 sibling, 1 reply; 139+ messages in thread
From: Morten Brørup @ 2022-12-15  8:22 UTC (permalink / raw)
  To: Tomasz Duszynski, dev
  Cc: thomas, Jerin Jacob Kollanukkaran, zhoumin, mattias.ronnblom

> From: Morten Brørup [mailto:mb@smartsharesystems.com]
> Sent: Wednesday, 14 December 2022 11.41
> 
> +CC: Mattias, see my comment below about per-thread constructor for
> this
> 
> > From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
> > Sent: Wednesday, 14 December 2022 10.39
> >
> > Hello Morten,
> >
> > Thanks for review. Answers inline.
> >
> > [...]
> >
> > > > +__rte_experimental
> > > > +static __rte_always_inline uint64_t
> > > > +rte_pmu_read(int index)
> 
> The index type can be changed from int to uint32_t. This also
> eliminates the "(index < 0" part of the comparison further below in
> this function.
> 
> > > > +{
> > > > +	int lcore_id = rte_lcore_id();
> > > > +	struct rte_pmu_event_group *group;
> > > > +	int ret;
> > > > +
> > > > +	if (!rte_pmu)
> > > > +		return 0;
> > > > +
> > > > +	group = &rte_pmu->group[lcore_id];
> > > > +	if (!group->enabled) {
> 
> Optimized: if (unlikely(!group->enabled)) {
> 
> > > > +		ret = rte_pmu_enable_group(lcore_id);
> > > > +		if (ret)
> > > > +			return 0;
> > > > +
> > > > +		group->enabled = true;
> > > > +	}
> > >
> > > Why is the group not enabled in the setup function,
> > rte_pmu_add_event(), instead of here, in the
> > > hot path?
> > >
> >
> > When this is executed for the very first time then cpu will have
> > obviously more work to do
> > but afterwards setup path is not taken hence much less cpu cycles are
> > required.
> >
> > Setup is executed by main lcore solely, before lcores are executed
> > hence some info passed to
> > SYS_perf_event_open ioctl() is missing, pid (via rte_gettid()) being
> an
> > example here.
> 
> OK. Thank you for the explanation. Since impossible at setup, it has to
> be done at runtime.
> 
> @Mattias: Another good example of something that would belong in per-
> thread constructors, as my suggested feature creep in [1].
> 
> [1]:
> http://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35D87553@smarts
> erver.smartshare.dk/

I just realized that this initialization is per-lcore (not per thread), so you can use rte_lcore_callback_register() to register a per-lcore initialization function, and move rte_pmu_enable_group(lcore_id) there.

-Morten


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 1/4] eal: add generic support for reading PMU events
  2022-11-11  9:43 ` [PATCH 1/4] eal: add generic support for reading PMU events Tomasz Duszynski
@ 2022-12-15  8:33   ` Mattias Rönnblom
  0 siblings, 0 replies; 139+ messages in thread
From: Mattias Rönnblom @ 2022-12-15  8:33 UTC (permalink / raw)
  To: Tomasz Duszynski, dev; +Cc: thomas, jerinj

On 2022-11-11 10:43, Tomasz Duszynski wrote:
> Add support for programming PMU counters and reading their values
> in runtime bypassing kernel completely.
> 
> This is especially useful in cases where CPU cores are isolated
> (nohz_full) i.e run dedicated tasks. In such cases one cannot use
> standard perf utility without sacrificing latency and performance.
> 
> Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
> ---
>   app/test/meson.build                  |   1 +
>   app/test/test_pmu.c                   |  41 +++
>   doc/guides/prog_guide/profile_app.rst |   8 +
>   lib/eal/common/meson.build            |   3 +
>   lib/eal/common/pmu_private.h          |  41 +++
>   lib/eal/common/rte_pmu.c              | 455 ++++++++++++++++++++++++++
>   lib/eal/include/meson.build           |   1 +
>   lib/eal/include/rte_pmu.h             | 204 ++++++++++++
>   lib/eal/linux/eal.c                   |   4 +
>   lib/eal/version.map                   |   3 +
>   10 files changed, 761 insertions(+)
>   create mode 100644 app/test/test_pmu.c
>   create mode 100644 lib/eal/common/pmu_private.h
>   create mode 100644 lib/eal/common/rte_pmu.c
>   create mode 100644 lib/eal/include/rte_pmu.h
> 
> diff --git a/app/test/meson.build b/app/test/meson.build
> index f34d19e3c3..93b3300309 100644
> --- a/app/test/meson.build
> +++ b/app/test/meson.build
> @@ -143,6 +143,7 @@ test_sources = files(
>           'test_timer_racecond.c',
>           'test_timer_secondary.c',
>           'test_ticketlock.c',
> +        'test_pmu.c',
>           'test_trace.c',
>           'test_trace_register.c',
>           'test_trace_perf.c',
> diff --git a/app/test/test_pmu.c b/app/test/test_pmu.c
> new file mode 100644
> index 0000000000..fd331af9ee
> --- /dev/null
> +++ b/app/test/test_pmu.c
> @@ -0,0 +1,41 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(C) 2022 Marvell International Ltd.
> + */
> +
> +#include <rte_pmu.h>
> +
> +#include "test.h"
> +
> +static int
> +test_pmu_read(void)
> +{
> +	uint64_t val = 0;
> +	int tries = 10;
> +	int event = -1;
> +
> +	while (tries--)
> +		val += rte_pmu_read(event);
> +
> +	if (val == 0)
> +		return TEST_FAILED;
> +
> +	return TEST_SUCCESS;
> +}
> +
> +static struct unit_test_suite pmu_tests = {
> +	.suite_name = "pmu autotest",
> +	.setup = NULL,
> +	.teardown = NULL,
> +	.unit_test_cases = {
> +		TEST_CASE(test_pmu_read),
> +		TEST_CASES_END()
> +	}
> +};
> +
> +static int
> +test_pmu(void)
> +{
> +	return unit_test_suite_runner(&pmu_tests);
> +}
> +
> +REGISTER_TEST_COMMAND(pmu_autotest, test_pmu);
> diff --git a/doc/guides/prog_guide/profile_app.rst b/doc/guides/prog_guide/profile_app.rst
> index bd6700ef85..8fc1b20cab 100644
> --- a/doc/guides/prog_guide/profile_app.rst
> +++ b/doc/guides/prog_guide/profile_app.rst
> @@ -7,6 +7,14 @@ Profile Your Application
>   The following sections describe methods of profiling DPDK applications on
>   different architectures.
>   
> +Performance counter based profiling
> +-----------------------------------
> +
> +Majority of architectures support some sort hardware measurement unit which provides a set of
> +programmable counters that monitor specific events. There are different tools which can gather
> +that information, perf being an example here. Though in some scenarios, eg. when CPU cores are
> +isolated (nohz_full) and run dedicated tasks, using perf is less than ideal. In such cases one can
> +read specific events directly from application via ``rte_pmu_read()``.
>   
>   Profiling on x86
>   ----------------
> diff --git a/lib/eal/common/meson.build b/lib/eal/common/meson.build
> index 917758cc65..d6d05b56f3 100644
> --- a/lib/eal/common/meson.build
> +++ b/lib/eal/common/meson.build
> @@ -38,6 +38,9 @@ sources += files(
>           'rte_service.c',
>           'rte_version.c',
>   )
> +if is_linux
> +    sources += files('rte_pmu.c')
> +endif
>   if is_linux or is_windows
>       sources += files('eal_common_dynmem.c')
>   endif
> diff --git a/lib/eal/common/pmu_private.h b/lib/eal/common/pmu_private.h
> new file mode 100644
> index 0000000000..cade4245e6
> --- /dev/null
> +++ b/lib/eal/common/pmu_private.h
> @@ -0,0 +1,41 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2022 Marvell
> + */
> +
> +#ifndef _PMU_PRIVATE_H_
> +#define _PMU_PRIVATE_H_
> +
> +/**
> + * Architecture specific PMU init callback.
> + *
> + * @return
> + *   0 in case of success, negative value otherwise.
> + */
> +int
> +pmu_arch_init(void);
> +
> +/**
> + * Architecture specific PMU cleanup callback.
> + */
> +void
> +pmu_arch_fini(void);
> +
> +/**
> + * Apply architecture specific settings to config before passing it to syscall.
> + */
> +void
> +pmu_arch_fixup_config(uint64_t config[3]);
> +
> +/**
> + * Initialize PMU tracing internals.
> + */
> +void
> +eal_pmu_init(void);
> +
> +/**
> + * Cleanup PMU internals.
> + */
> +void
> +eal_pmu_fini(void);
> +
> +#endif /* _PMU_PRIVATE_H_ */
> diff --git a/lib/eal/common/rte_pmu.c b/lib/eal/common/rte_pmu.c
> new file mode 100644
> index 0000000000..7d3bd57d1d
> --- /dev/null
> +++ b/lib/eal/common/rte_pmu.c
> @@ -0,0 +1,455 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(C) 2022 Marvell International Ltd.
> + */
> +
> +#include <ctype.h>
> +#include <dirent.h>
> +#include <errno.h>
> +#include <regex.h>
> +#include <sys/ioctl.h>
> +#include <sys/mman.h>
> +#include <sys/queue.h>
> +#include <sys/syscall.h>
> +#include <unistd.h>
> +
> +#include <rte_eal_paging.h>
> +#include <rte_malloc.h>
> +#include <rte_pmu.h>
> +#include <rte_tailq.h>
> +
> +#include "pmu_private.h"
> +
> +#define EVENT_SOURCE_DEVICES_PATH "/sys/bus/event_source/devices"
> +
> +#ifndef GENMASK_ULL
> +#define GENMASK_ULL(h, l) ((~0ULL - (1ULL << (l)) + 1) & (~0ULL >> ((64 - 1 - (h)))))
> +#endif
> +
> +#ifndef FIELD_PREP
> +#define FIELD_PREP(m, v) (((uint64_t)(v) << (__builtin_ffsll(m) - 1)) & (m))
> +#endif
> +
> +struct rte_pmu *pmu;
> +
> +/*
> + * Following __rte_weak functions provide default no-op. Architectures should override them if
> + * necessary.
> + */
> +
> +int
> +__rte_weak pmu_arch_init(void)
> +{
> +	return 0;
> +}
> +
> +void
> +__rte_weak pmu_arch_fini(void)
> +{
> +}
> +
> +void
> +__rte_weak pmu_arch_fixup_config(uint64_t config[3])
> +{
> +	RTE_SET_USED(config);
> +}
> +
> +static int
> +get_term_format(const char *name, int *num, uint64_t *mask)
> +{
> +	char *config = NULL;
> +	char path[PATH_MAX];
> +	int high, low, ret;
> +	FILE *fp;
> +
> +	/* quiesce -Wmaybe-uninitialized warning */
> +	*num = 0;
> +	*mask = 0;
> +
> +	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/format/%s", pmu->name, name);
> +	fp = fopen(path, "r");
> +	if (!fp)
> +		return -errno;
> +
> +	errno = 0;
> +	ret = fscanf(fp, "%m[^:]:%d-%d", &config, &low, &high);
> +	if (ret < 2) {
> +		ret = -ENODATA;
> +		goto out;
> +	}
> +	if (errno) {
> +		ret = -errno;
> +		goto out;
> +	}
> +
> +	if (ret == 2)
> +		high = low;
> +
> +	*mask = GENMASK_ULL(high, low);
> +	/* Last digit should be [012]. If last digit is missing 0 is implied. */
> +	*num = config[strlen(config) - 1];
> +	*num = isdigit(*num) ? *num - '0' : 0;
> +
> +	ret = 0;
> +out:
> +	free(config);
> +	fclose(fp);
> +
> +	return ret;
> +}
> +
> +static int
> +parse_event(char *buf, uint64_t config[3])
> +{
> +	char *token, *term;
> +	int num, ret, val;
> +	uint64_t mask;
> +
> +	config[0] = config[1] = config[2] = 0;
> +
> +	token = strtok(buf, ",");
> +	while (token) {
> +		errno = 0;
> +		/* <term>=<value> */
> +		ret = sscanf(token, "%m[^=]=%i", &term, &val);
> +		if (ret < 1)
> +			return -ENODATA;
> +		if (errno)
> +			return -errno;
> +		if (ret == 1)
> +			val = 1;
> +
> +		ret = get_term_format(term, &num, &mask);
> +		free(term);
> +		if (ret)
> +			return ret;
> +
> +		config[num] |= FIELD_PREP(mask, val);
> +		token = strtok(NULL, ",");
> +	}
> +
> +	return 0;
> +}
> +
> +static int
> +get_event_config(const char *name, uint64_t config[3])
> +{
> +	char path[PATH_MAX], buf[BUFSIZ];
> +	FILE *fp;
> +	int ret;
> +
> +	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", pmu->name, name);
> +	fp = fopen(path, "r");
> +	if (!fp)
> +		return -errno;
> +
> +	ret = fread(buf, 1, sizeof(buf), fp);
> +	if (ret == 0) {
> +		fclose(fp);
> +
> +		return -EINVAL;
> +	}
> +	fclose(fp);
> +	buf[ret] = '\0';
> +
> +	return parse_event(buf, config);
> +}
> +
> +static int
> +do_perf_event_open(uint64_t config[3], int lcore_id, int group_fd)
> +{
> +	struct perf_event_attr attr = {
> +		.size = sizeof(struct perf_event_attr),
> +		.type = PERF_TYPE_RAW,
> +		.exclude_kernel = 1,
> +		.exclude_hv = 1,
> +		.disabled = 1,
> +	};
> +
> +	pmu_arch_fixup_config(config);
> +
> +	attr.config = config[0];
> +	attr.config1 = config[1];
> +	attr.config2 = config[2];
> +
> +	return syscall(SYS_perf_event_open, &attr, rte_gettid(), rte_lcore_to_cpu_id(lcore_id),
> +		       group_fd, 0);
> +}
> +
> +static int
> +open_events(int lcore_id)
> +{
> +	struct rte_pmu_event_group *group = &pmu->group[lcore_id];
> +	struct rte_pmu_event *event;
> +	uint64_t config[3];
> +	int num = 0, ret;
> +
> +	/* group leader gets created first, with fd = -1 */
> +	group->fds[0] = -1;
> +
> +	TAILQ_FOREACH(event, &pmu->event_list, next) {
> +		ret = get_event_config(event->name, config);
> +		if (ret) {
> +			RTE_LOG(ERR, EAL, "failed to get %s event config\n", event->name);
> +			continue;
> +		}
> +
> +		ret = do_perf_event_open(config, lcore_id, group->fds[0]);
> +		if (ret == -1) {
> +			if (errno == EOPNOTSUPP)
> +				RTE_LOG(ERR, EAL, "64 bit counters not supported\n");
> +
> +			ret = -errno;
> +			goto out;
> +		}
> +
> +		group->fds[event->index] = ret;
> +		num++;
> +	}
> +
> +	return 0;
> +out:
> +	for (--num; num >= 0; num--) {
> +		close(group->fds[num]);
> +		group->fds[num] = -1;
> +	}
> +
> +
> +	return ret;
> +}
> +
> +static int
> +mmap_events(int lcore_id)
> +{
> +	struct rte_pmu_event_group *group = &pmu->group[lcore_id];
> +	void *addr;
> +	int ret, i;
> +
> +	for (i = 0; i < pmu->num_group_events; i++) {
> +		addr = mmap(0, rte_mem_page_size(), PROT_READ, MAP_SHARED, group->fds[i], 0);
> +		if (addr == MAP_FAILED) {
> +			ret = -errno;
> +			goto out;
> +		}
> +
> +		group->mmap_pages[i] = addr;
> +	}
> +
> +	return 0;
> +out:
> +	for (; i; i--) {
> +		munmap(group->mmap_pages[i - 1], rte_mem_page_size());
> +		group->mmap_pages[i - 1] = NULL;
> +	}
> +
> +	return ret;
> +}
> +
> +static void
> +cleanup_events(int lcore_id)
> +{
> +	struct rte_pmu_event_group *group = &pmu->group[lcore_id];
> +	int i;
> +
> +	if (!group->fds)
> +		return;
> +
> +	if (group->fds[0] != -1)
> +		ioctl(group->fds[0], PERF_EVENT_IOC_DISABLE, PERF_IOC_FLAG_GROUP);
> +
> +	for (i = 0; i < pmu->num_group_events; i++) {
> +		if (group->mmap_pages[i]) {
> +			munmap(group->mmap_pages[i], rte_mem_page_size());
> +			group->mmap_pages[i] = NULL;
> +		}
> +
> +		if (group->fds[i] != -1) {
> +			close(group->fds[i]);
> +			group->fds[i] = -1;
> +		}
> +	}
> +
> +	rte_free(group->mmap_pages);
> +	rte_free(group->fds);
> +
> +	group->mmap_pages = NULL;
> +	group->fds = NULL;
> +	group->enabled = false;
> +}
> +
> +int __rte_noinline
> +rte_pmu_enable_group(int lcore_id)
> +{
> +	struct rte_pmu_event_group *group = &pmu->group[lcore_id];
> +	int ret;
> +
> +	if (pmu->num_group_events == 0) {
> +		RTE_LOG(DEBUG, EAL, "no matching PMU events\n");
> +
> +		return 0;
> +	}
> +
> +	group->fds = rte_zmalloc(NULL, pmu->num_group_events, sizeof(*group->fds));
> +	if (!group->fds) {
> +		RTE_LOG(ERR, EAL, "failed to alloc descriptor memory\n");
> +
> +		return -ENOMEM;
> +	}
> +
> +	group->mmap_pages = rte_zmalloc(NULL, pmu->num_group_events, sizeof(*group->mmap_pages));
> +	if (!group->mmap_pages) {
> +		RTE_LOG(ERR, EAL, "failed to alloc userpage memory\n");
> +
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	ret = open_events(lcore_id);
> +	if (ret) {
> +		RTE_LOG(ERR, EAL, "failed to open events on lcore-worker-%d\n", lcore_id);
> +		goto out;
> +	}
> +
> +	ret = mmap_events(lcore_id);
> +	if (ret) {
> +		RTE_LOG(ERR, EAL, "failed to map events on lcore-worker-%d\n", lcore_id);
> +		goto out;
> +	}
> +
> +	if (ioctl(group->fds[0], PERF_EVENT_IOC_ENABLE, PERF_IOC_FLAG_GROUP) == -1) {
> +		RTE_LOG(ERR, EAL, "failed to enable events on lcore-worker-%d\n", lcore_id);
> +
> +		ret = -errno;
> +		goto out;
> +	}
> +
> +	return 0;
> +
> +out:
> +	cleanup_events(lcore_id);
> +
> +	return ret;
> +}
> +
> +static int
> +scan_pmus(void)
> +{
> +	char path[PATH_MAX];
> +	struct dirent *dent;
> +	const char *name;
> +	DIR *dirp;
> +
> +	dirp = opendir(EVENT_SOURCE_DEVICES_PATH);
> +	if (!dirp)
> +		return -errno;
> +
> +	while ((dent = readdir(dirp))) {
> +		name = dent->d_name;
> +		if (name[0] == '.')
> +			continue;
> +
> +		/* sysfs entry should either contain cpus or be a cpu */
> +		if (!strcmp(name, "cpu"))
> +			break;
> +
> +		snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/cpus", name);
> +		if (access(path, F_OK) == 0)
> +			break;
> +	}
> +
> +	closedir(dirp);
> +
> +	if (dent) {
> +		pmu->name = strdup(name);
> +		if (!pmu->name)
> +			return -ENOMEM;
> +	}
> +
> +	return pmu->name ? 0 : -ENODEV;
> +}
> +
> +int
> +rte_pmu_add_event(const char *name)
> +{
> +	struct rte_pmu_event *event;
> +	char path[PATH_MAX];
> +
> +	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", pmu->name, name);
> +	if (access(path, R_OK))
> +		return -ENODEV;
> +
> +	TAILQ_FOREACH(event, &pmu->event_list, next) {
> +		if (!strcmp(event->name, name))
> +			return event->index;
> +		continue;
> +	}
> +
> +	event = rte_zmalloc(NULL, 1, sizeof(*event));
> +	if (!event)
> +		return -ENOMEM;
> +
> +	event->name = strdup(name);
> +	if (!event->name) {
> +		rte_free(event);
> +
> +		return -ENOMEM;
> +	}
> +
> +	event->index = pmu->num_group_events++;
> +	TAILQ_INSERT_TAIL(&pmu->event_list, event, next);
> +
> +	RTE_LOG(DEBUG, EAL, "%s even added at index %d\n", name, event->index);
> +
> +	return event->index;
> +}
> +
> +void
> +eal_pmu_init(void)
> +{
> +	int ret;
> +
> +	pmu = rte_calloc(NULL, 1, sizeof(*pmu), RTE_CACHE_LINE_SIZE);
> +	if (!pmu) {
> +		RTE_LOG(ERR, EAL, "failed to alloc PMU\n");
> +
> +		return;
> +	}
> +
> +	TAILQ_INIT(&pmu->event_list);
> +
> +	ret = scan_pmus();
> +	if (ret) {
> +		RTE_LOG(ERR, EAL, "failed to find core pmu\n");
> +		goto out;
> +	}
> +
> +	ret = pmu_arch_init();
> +	if (ret) {
> +		RTE_LOG(ERR, EAL, "failed to setup arch for PMU\n");
> +		goto out;
> +	}
> +
> +	return;
> +out:
> +	free(pmu->name);
> +	rte_free(pmu);
> +}
> +
> +void
> +eal_pmu_fini(void)
> +{
> +	struct rte_pmu_event *event, *tmp;
> +	int lcore_id;
> +
> +	RTE_TAILQ_FOREACH_SAFE(event, &pmu->event_list, next, tmp) {
> +		TAILQ_REMOVE(&pmu->event_list, event, next);
> +		free(event->name);
> +		rte_free(event);
> +	}
> +
> +	RTE_LCORE_FOREACH_WORKER(lcore_id)
> +		cleanup_events(lcore_id);
> +
> +	pmu_arch_fini();
> +	free(pmu->name);
> +	rte_free(pmu);
> +}
> diff --git a/lib/eal/include/meson.build b/lib/eal/include/meson.build
> index cfcd40aaed..3bf830adee 100644
> --- a/lib/eal/include/meson.build
> +++ b/lib/eal/include/meson.build
> @@ -36,6 +36,7 @@ headers += files(
>           'rte_pci_dev_features.h',
>           'rte_per_lcore.h',
>           'rte_pflock.h',
> +        'rte_pmu.h',
>           'rte_random.h',
>           'rte_reciprocal.h',
>           'rte_seqcount.h',
> diff --git a/lib/eal/include/rte_pmu.h b/lib/eal/include/rte_pmu.h
> new file mode 100644
> index 0000000000..5955c22779
> --- /dev/null
> +++ b/lib/eal/include/rte_pmu.h
> @@ -0,0 +1,204 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2022 Marvell
> + */
> +
> +#ifndef _RTE_PMU_H_
> +#define _RTE_PMU_H_
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +#include <rte_common.h>
> +#include <rte_compat.h>
> +
> +#ifdef RTE_EXEC_ENV_LINUX
> +
> +#include <linux/perf_event.h>
> +
> +#include <rte_atomic.h>
> +#include <rte_branch_prediction.h>
> +#include <rte_lcore.h>
> +#include <rte_log.h>
> +
> +/**
> + * @file
> + *
> + * PMU event tracing operations
> + *
> + * This file defines generic API and types necessary to setup PMU and
> + * read selected counters in runtime.
> + */
> +
> +/**
> + * A structure describing a group of events.
> + */
> +struct rte_pmu_event_group {
> +	int *fds; /**< array of event descriptors */
> +	void **mmap_pages; /**< array of pointers to mmapped perf_event_attr structures */
> +	bool enabled; /**< true if group was enabled on particular lcore */
> +};
> +
> +/**
> + * A structure describing an event.
> + */
> +struct rte_pmu_event {
> +	char *name; /** name of an event */
> +	int index; /** event index into fds/mmap_pages */
> +	TAILQ_ENTRY(rte_pmu_event) next; /** list entry */
> +};
> +
> +/**
> + * A PMU state container.
> + */
> +struct rte_pmu {
> +	char *name; /** name of core PMU listed under /sys/bus/event_source/devices */
> +	struct rte_pmu_event_group group[RTE_MAX_LCORE]; /**< per lcore event group data */
> +	int num_group_events; /**< number of events in a group */
> +	TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of matching events */
> +};
> +
> +/** Pointer to the PMU state container */
> +extern struct rte_pmu *pmu;
> +
> +/** Each architecture supporting PMU needs to provide its own version */
> +#ifndef rte_pmu_pmc_read
> +#define rte_pmu_pmc_read(index) ({ 0; })
> +#endif
> +
> +/**
> + * @internal
> + *
> + * Read PMU counter.
> + *
> + * @param pc
> + *   Pointer to the mmapped user page.
> + * @return
> + *   Counter value read from hardware.
> + */
> +__rte_internal
> +static __rte_always_inline uint64_t
> +rte_pmu_read_userpage(struct perf_event_mmap_page *pc)
> +{
> +	uint64_t offset, width, pmc = 0;
> +	uint32_t seq, index;
> +	int tries = 100;
> +
> +	for (;;) {
> +		seq = pc->lock;
> +		rte_compiler_barrier();

I'm guessing this should be a load-acquire instead. Less heavy-handed 
than a compiler barrier on TSO CPUs, and works on weakly ordered systems 
as well (unlike the compiler barrier).

This looks like an open-coded sequence lock, so take a look in 
rte_seqcount.h for inspiration.

> +		index = pc->index;
> +		offset = pc->offset;
> +		width = pc->pmc_width;
> +
> +		if (likely(pc->cap_user_rdpmc && index)) {
> +			pmc = rte_pmu_pmc_read(index - 1);
> +			pmc <<= 64 - width;
> +			pmc >>= 64 - width;
> +		}
> +
> +		rte_compiler_barrier();
> +
> +		if (likely(pc->lock == seq))
> +			return pmc + offset;
> +
> +		if (--tries == 0) {
> +			RTE_LOG(DEBUG, EAL, "failed to get perf_event_mmap_page lock\n");
> +			break;
> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * @internal
> + *
> + * Enable group of events for a given lcore.
> + *
> + * @param lcore_id
> + *   The identifier of the lcore.
> + * @return
> + *   0 in case of success, negative value otherwise.
> + */
> +__rte_internal
> +int
> +rte_pmu_enable_group(int lcore_id);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice
> + *
> + * Add event to the group of enabled events.
> + *
> + * @param name
> + *   Name of an event listed under /sys/bus/event_source/devices/pmu/events.
> + * @return
> + *   Event index in case of success, negative value otherwise.
> + */
> +__rte_experimental
> +int
> +rte_pmu_add_event(const char *name);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice
> + *
> + * Read hardware counter configured to count occurrences of an event.
> + *
> + * @param index
> + *   Index of an event to be read.
> + * @return
> + *   Event value read from register. In case of errors or lack of support
> + *   0 is returned. In other words, stream of zeros in a trace file
> + *   indicates problem with reading particular PMU event register.
> + */
> +__rte_experimental
> +static __rte_always_inline uint64_t
> +rte_pmu_read(int index)
> +{
> +	int lcore_id = rte_lcore_id();
> +	struct rte_pmu_event_group *group;
> +	int ret;
> +
> +	if (!pmu)
> +		return 0;
> +
> +	group = &pmu->group[lcore_id];
> +	if (!group->enabled) {
> +		ret = rte_pmu_enable_group(lcore_id);
> +		if (ret)
> +			return 0;
> +
> +		group->enabled = true;
> +	}
> +
> +	if (index < 0 || index >= pmu->num_group_events)
> +		return 0;
> +
> +	return rte_pmu_read_userpage(group->mmap_pages[index]);
> +}
> +
> +#else /* !RTE_EXEC_ENV_LINUX */
> +
> +__rte_experimental
> +static int __rte_unused
> +rte_pmu_add_event(__rte_unused const char *name)
> +{
> +	return -1;
> +}
> +
> +__rte_experimental
> +static __rte_always_inline uint64_t
> +rte_pmu_read(__rte_unused int index)
> +{
> +	return 0;
> +}
> +
> +#endif /* RTE_EXEC_ENV_LINUX */
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif /* _RTE_PMU_H_ */
> diff --git a/lib/eal/linux/eal.c b/lib/eal/linux/eal.c
> index 8c118d0d9f..751a13b597 100644
> --- a/lib/eal/linux/eal.c
> +++ b/lib/eal/linux/eal.c
> @@ -53,6 +53,7 @@
>   #include "eal_options.h"
>   #include "eal_vfio.h"
>   #include "hotplug_mp.h"
> +#include "pmu_private.h"
>   
>   #define MEMSIZE_IF_NO_HUGE_PAGE (64ULL * 1024ULL * 1024ULL)
>   
> @@ -1206,6 +1207,8 @@ rte_eal_init(int argc, char **argv)
>   		return -1;
>   	}
>   
> +	eal_pmu_init();
> +
>   	if (rte_eal_tailqs_init() < 0) {
>   		rte_eal_init_alert("Cannot init tail queues for objects");
>   		rte_errno = EFAULT;
> @@ -1372,6 +1375,7 @@ rte_eal_cleanup(void)
>   	eal_bus_cleanup();
>   	rte_trace_save();
>   	eal_trace_fini();
> +	eal_pmu_fini();
>   	/* after this point, any DPDK pointers will become dangling */
>   	rte_eal_memory_detach();
>   	eal_mp_dev_hotplug_cleanup();
> diff --git a/lib/eal/version.map b/lib/eal/version.map
> index 7ad12a7dc9..e870c87493 100644
> --- a/lib/eal/version.map
> +++ b/lib/eal/version.map
> @@ -432,6 +432,8 @@ EXPERIMENTAL {
>   	rte_thread_set_priority;
>   
>   	# added in 22.11
> +	rte_pmu_add_event; # WINDOWS_NO_EXPORT
> +	rte_pmu_read; # WINDOWS_NO_EXPORT
>   	rte_thread_attr_get_affinity;
>   	rte_thread_attr_init;
>   	rte_thread_attr_set_affinity;
> @@ -483,4 +485,5 @@ INTERNAL {
>   	rte_mem_map;
>   	rte_mem_page_size;
>   	rte_mem_unmap;
> +	rte_pmu_enable_group;
>   };


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v4 1/4] eal: add generic support for reading PMU events
  2022-12-13 10:43       ` [PATCH v4 1/4] eal: add generic support for reading PMU events Tomasz Duszynski
  2022-12-13 11:52         ` Morten Brørup
@ 2022-12-15  8:46         ` Mattias Rönnblom
  2023-01-04 15:47           ` Tomasz Duszynski
  2023-01-09  7:37         ` Ruifeng Wang
  2 siblings, 1 reply; 139+ messages in thread
From: Mattias Rönnblom @ 2022-12-15  8:46 UTC (permalink / raw)
  To: Tomasz Duszynski, dev; +Cc: thomas, jerinj, mb, zhoumin

On 2022-12-13 11:43, Tomasz Duszynski wrote:
> Add support for programming PMU counters and reading their values
> in runtime bypassing kernel completely.
> 
> This is especially useful in cases where CPU cores are isolated
> (nohz_full) i.e run dedicated tasks. In such cases one cannot use
> standard perf utility without sacrificing latency and performance.
> 
> Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
> ---
>   app/test/meson.build                  |   1 +
>   app/test/test_pmu.c                   |  41 +++
>   doc/guides/prog_guide/profile_app.rst |   8 +
>   lib/eal/common/meson.build            |   3 +
>   lib/eal/common/pmu_private.h          |  41 +++
>   lib/eal/common/rte_pmu.c              | 456 ++++++++++++++++++++++++++
>   lib/eal/include/meson.build           |   1 +
>   lib/eal/include/rte_pmu.h             | 204 ++++++++++++
>   lib/eal/linux/eal.c                   |   4 +
>   lib/eal/version.map                   |   6 +
>   10 files changed, 765 insertions(+)
>   create mode 100644 app/test/test_pmu.c
>   create mode 100644 lib/eal/common/pmu_private.h
>   create mode 100644 lib/eal/common/rte_pmu.c
>   create mode 100644 lib/eal/include/rte_pmu.h
> 
> diff --git a/app/test/meson.build b/app/test/meson.build
> index f34d19e3c3..93b3300309 100644
> --- a/app/test/meson.build
> +++ b/app/test/meson.build
> @@ -143,6 +143,7 @@ test_sources = files(
>           'test_timer_racecond.c',
>           'test_timer_secondary.c',
>           'test_ticketlock.c',
> +        'test_pmu.c',
>           'test_trace.c',
>           'test_trace_register.c',
>           'test_trace_perf.c',
> diff --git a/app/test/test_pmu.c b/app/test/test_pmu.c
> new file mode 100644
> index 0000000000..fd331af9ee
> --- /dev/null
> +++ b/app/test/test_pmu.c
> @@ -0,0 +1,41 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(C) 2022 Marvell International Ltd.
> + */
> +
> +#include <rte_pmu.h>
> +
> +#include "test.h"
> +
> +static int
> +test_pmu_read(void)
> +{
> +	uint64_t val = 0;
> +	int tries = 10;
> +	int event = -1;
> +
> +	while (tries--)
> +		val += rte_pmu_read(event);
> +
> +	if (val == 0)
> +		return TEST_FAILED;
> +
> +	return TEST_SUCCESS;
> +}
> +
> +static struct unit_test_suite pmu_tests = {
> +	.suite_name = "pmu autotest",
> +	.setup = NULL,
> +	.teardown = NULL,
> +	.unit_test_cases = {
> +		TEST_CASE(test_pmu_read),
> +		TEST_CASES_END()
> +	}
> +};
> +
> +static int
> +test_pmu(void)
> +{
> +	return unit_test_suite_runner(&pmu_tests);
> +}
> +
> +REGISTER_TEST_COMMAND(pmu_autotest, test_pmu);
> diff --git a/doc/guides/prog_guide/profile_app.rst b/doc/guides/prog_guide/profile_app.rst
> index 14292d4c25..a8b501fe0c 100644
> --- a/doc/guides/prog_guide/profile_app.rst
> +++ b/doc/guides/prog_guide/profile_app.rst
> @@ -7,6 +7,14 @@ Profile Your Application
>   The following sections describe methods of profiling DPDK applications on
>   different architectures.
>   
> +Performance counter based profiling
> +-----------------------------------
> +
> +Majority of architectures support some sort hardware measurement unit which provides a set of
> +programmable counters that monitor specific events. There are different tools which can gather
> +that information, perf being an example here. Though in some scenarios, eg. when CPU cores are
> +isolated (nohz_full) and run dedicated tasks, using perf is less than ideal. In such cases one can
> +read specific events directly from application via ``rte_pmu_read()``.
>   
>   Profiling on x86
>   ----------------
> diff --git a/lib/eal/common/meson.build b/lib/eal/common/meson.build
> index 917758cc65..d6d05b56f3 100644
> --- a/lib/eal/common/meson.build
> +++ b/lib/eal/common/meson.build
> @@ -38,6 +38,9 @@ sources += files(
>           'rte_service.c',
>           'rte_version.c',
>   )
> +if is_linux
> +    sources += files('rte_pmu.c')
> +endif
>   if is_linux or is_windows
>       sources += files('eal_common_dynmem.c')
>   endif
> diff --git a/lib/eal/common/pmu_private.h b/lib/eal/common/pmu_private.h
> new file mode 100644
> index 0000000000..cade4245e6
> --- /dev/null
> +++ b/lib/eal/common/pmu_private.h
> @@ -0,0 +1,41 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2022 Marvell
> + */
> +
> +#ifndef _PMU_PRIVATE_H_
> +#define _PMU_PRIVATE_H_
> +
> +/**
> + * Architecture specific PMU init callback.
> + *
> + * @return
> + *   0 in case of success, negative value otherwise.
> + */
> +int
> +pmu_arch_init(void);
> +
> +/**
> + * Architecture specific PMU cleanup callback.
> + */
> +void
> +pmu_arch_fini(void);
> +
> +/**
> + * Apply architecture specific settings to config before passing it to syscall.
> + */
> +void
> +pmu_arch_fixup_config(uint64_t config[3]);
> +
> +/**
> + * Initialize PMU tracing internals.
> + */
> +void
> +eal_pmu_init(void);
> +
> +/**
> + * Cleanup PMU internals.
> + */
> +void
> +eal_pmu_fini(void);
> +
> +#endif /* _PMU_PRIVATE_H_ */
> diff --git a/lib/eal/common/rte_pmu.c b/lib/eal/common/rte_pmu.c
> new file mode 100644
> index 0000000000..049fe19fe3
> --- /dev/null
> +++ b/lib/eal/common/rte_pmu.c
> @@ -0,0 +1,456 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(C) 2022 Marvell International Ltd.
> + */
> +
> +#include <ctype.h>
> +#include <dirent.h>
> +#include <errno.h>
> +#include <regex.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <sys/ioctl.h>
> +#include <sys/mman.h>
> +#include <sys/queue.h>
> +#include <sys/syscall.h>
> +#include <unistd.h>
> +
> +#include <rte_eal_paging.h>
> +#include <rte_pmu.h>
> +#include <rte_tailq.h>
> +
> +#include "pmu_private.h"
> +
> +#define EVENT_SOURCE_DEVICES_PATH "/sys/bus/event_source/devices"
> +
> +#ifndef GENMASK_ULL
> +#define GENMASK_ULL(h, l) ((~0ULL - (1ULL << (l)) + 1) & (~0ULL >> ((64 - 1 - (h)))))
> +#endif
> +
> +#ifndef FIELD_PREP
> +#define FIELD_PREP(m, v) (((uint64_t)(v) << (__builtin_ffsll(m) - 1)) & (m))
> +#endif
> +
> +struct rte_pmu *rte_pmu;
> +
> +/*
> + * Following __rte_weak functions provide default no-op. Architectures should override them if
> + * necessary.
> + */
> +
> +int
> +__rte_weak pmu_arch_init(void)
> +{
> +	return 0;
> +}
> +
> +void
> +__rte_weak pmu_arch_fini(void)
> +{
> +}
> +
> +void
> +__rte_weak pmu_arch_fixup_config(uint64_t config[3])
> +{
> +	RTE_SET_USED(config);
> +}
> +
> +static int
> +get_term_format(const char *name, int *num, uint64_t *mask)
> +{
> +	char *config = NULL;
> +	char path[PATH_MAX];
> +	int high, low, ret;
> +	FILE *fp;
> +
> +	/* quiesce -Wmaybe-uninitialized warning */
> +	*num = 0;
> +	*mask = 0;
> +
> +	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/format/%s", rte_pmu->name, name)

This code might crash in case a long name is supplied, which is maybe 
not want you want. A trunacte and a "file not found" is probably better. 
I believe there is a snprintf lookalike with these properties in DPDK.

> +	fp = fopen(path, "r");
> +	if (!fp)
> +		return -errno;
> +
> +	errno = 0;
> +	ret = fscanf(fp, "%m[^:]:%d-%d", &config, &low, &high);
> +	if (ret < 2) {
> +		ret = -ENODATA;
> +		goto out;
> +	}
> +	if (errno) {
> +		ret = -errno;
> +		goto out;
> +	}
> +
> +	if (ret == 2)
> +		high = low;
> +
> +	*mask = GENMASK_ULL(high, low);
> +	/* Last digit should be [012]. If last digit is missing 0 is implied. */
> +	*num = config[strlen(config) - 1];
> +	*num = isdigit(*num) ? *num - '0' : 0;
> +
> +	ret = 0;
> +out:
> +	free(config);
> +	fclose(fp);
> +
> +	return ret;
> +}
> +
> +static int
> +parse_event(char *buf, uint64_t config[3])
> +{
> +	char *token, *term;
> +	int num, ret, val;
> +	uint64_t mask;
> +
> +	config[0] = config[1] = config[2] = 0;
> +
> +	token = strtok(buf, ",");
> +	while (token) {
> +		errno = 0;
> +		/* <term>=<value> */
> +		ret = sscanf(token, "%m[^=]=%i", &term, &val);
> +		if (ret < 1)
> +			return -ENODATA;
> +		if (errno)
> +			return -errno;
> +		if (ret == 1)
> +			val = 1;
> +
> +		ret = get_term_format(term, &num, &mask);
> +		free(term);
> +		if (ret)
> +			return ret;
> +
> +		config[num] |= FIELD_PREP(mask, val);
> +		token = strtok(NULL, ",");
> +	}
> +
> +	return 0;
> +}
> +
> +static int
> +get_event_config(const char *name, uint64_t config[3])
> +{
> +	char path[PATH_MAX], buf[BUFSIZ];
> +	FILE *fp;
> +	int ret;
> +
> +	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu->name, name);
> +	fp = fopen(path, "r");
> +	if (!fp)
> +		return -errno;
> +
> +	ret = fread(buf, 1, sizeof(buf), fp);
> +	if (ret == 0) {
> +		fclose(fp);
> +
> +		return -EINVAL;
> +	}
> +	fclose(fp);
> +	buf[ret] = '\0';
> +
> +	return parse_event(buf, config);
> +}
> +
> +static int
> +do_perf_event_open(uint64_t config[3], int lcore_id, int group_fd)
> +{
> +	struct perf_event_attr attr = {
> +		.size = sizeof(struct perf_event_attr),
> +		.type = PERF_TYPE_RAW,
> +		.exclude_kernel = 1,
> +		.exclude_hv = 1,
> +		.disabled = 1,
> +	};
> +
> +	pmu_arch_fixup_config(config);
> +
> +	attr.config = config[0];
> +	attr.config1 = config[1];
> +	attr.config2 = config[2];
> +
> +	return syscall(SYS_perf_event_open, &attr, rte_gettid(), rte_lcore_to_cpu_id(lcore_id),
> +		       group_fd, 0);
> +}
> +
> +static int
> +open_events(int lcore_id)
> +{
> +	struct rte_pmu_event_group *group = &rte_pmu->group[lcore_id];
> +	struct rte_pmu_event *event;
> +	uint64_t config[3];
> +	int num = 0, ret;
> +
> +	/* group leader gets created first, with fd = -1 */
> +	group->fds[0] = -1;
> +
> +	TAILQ_FOREACH(event, &rte_pmu->event_list, next) {
> +		ret = get_event_config(event->name, config);
> +		if (ret) {
> +			RTE_LOG(ERR, EAL, "failed to get %s event config\n", event->name);
> +			continue;
> +		}
> +
> +		ret = do_perf_event_open(config, lcore_id, group->fds[0]);
> +		if (ret == -1) {
> +			if (errno == EOPNOTSUPP)
> +				RTE_LOG(ERR, EAL, "64 bit counters not supported\n");
> +
> +			ret = -errno;
> +			goto out;
> +		}
> +
> +		group->fds[event->index] = ret;
> +		num++;
> +	}
> +
> +	return 0;
> +out:
> +	for (--num; num >= 0; num--) {
> +		close(group->fds[num]);
> +		group->fds[num] = -1;
> +	}
> +
> +
> +	return ret;
> +}
> +
> +static int
> +mmap_events(int lcore_id)
> +{
> +	struct rte_pmu_event_group *group = &rte_pmu->group[lcore_id];
> +	void *addr;
> +	int ret, i;
> +
> +	for (i = 0; i < rte_pmu->num_group_events; i++) {
> +		addr = mmap(0, rte_mem_page_size(), PROT_READ, MAP_SHARED, group->fds[i], 0);
> +		if (addr == MAP_FAILED) {
> +			ret = -errno;
> +			goto out;
> +		}
> +
> +		group->mmap_pages[i] = addr;
> +	}
> +
> +	return 0;
> +out:
> +	for (; i; i--) {
> +		munmap(group->mmap_pages[i - 1], rte_mem_page_size());
> +		group->mmap_pages[i - 1] = NULL;
> +	}
> +
> +	return ret;
> +}
> +
> +static void
> +cleanup_events(int lcore_id)

lcore_id is an unsigned int.

> +{
> +	struct rte_pmu_event_group *group = &rte_pmu->group[lcore_id];
> +	int i;
> +
> +	if (!group->fds)
> +		return;

group->fds == NULL

This coding style violating appears throughput the patch set.

> +
> +	if (group->fds[0] != -1)
> +		ioctl(group->fds[0], PERF_EVENT_IOC_DISABLE, PERF_IOC_FLAG_GROUP);
> +
> +	for (i = 0; i < rte_pmu->num_group_events; i++) {
> +		if (group->mmap_pages[i]) {
> +			munmap(group->mmap_pages[i], rte_mem_page_size());
> +			group->mmap_pages[i] = NULL;
> +		}
> +
> +		if (group->fds[i] != -1) {
> +			close(group->fds[i]);
> +			group->fds[i] = -1;
> +		}
> +	}
> +
> +	free(group->mmap_pages);
> +	free(group->fds);
> +
> +	group->mmap_pages = NULL;
> +	group->fds = NULL;
> +	group->enabled = false;
> +}
> +
> +int __rte_noinline
> +rte_pmu_enable_group(int lcore_id)

unsigned

> +{
> +	struct rte_pmu_event_group *group = &rte_pmu->group[lcore_id];
> +	int ret;
> +
> +	if (rte_pmu->num_group_events == 0) {
> +		RTE_LOG(DEBUG, EAL, "no matching PMU events\n");
> +
> +		return 0;
> +	}
> +
> +	group->fds = calloc(rte_pmu->num_group_events, sizeof(*group->fds));
> +	if (!group->fds) {
> +		RTE_LOG(ERR, EAL, "failed to alloc descriptor memory\n");
> +
> +		return -ENOMEM;
> +	}
> +
> +	group->mmap_pages = calloc(rte_pmu->num_group_events, sizeof(*group->mmap_pages));
> +	if (!group->mmap_pages) {
> +		RTE_LOG(ERR, EAL, "failed to alloc userpage memory\n");
> +
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	ret = open_events(lcore_id);
> +	if (ret) {
> +		RTE_LOG(ERR, EAL, "failed to open events on lcore-worker-%d\n", lcore_id);
> +		goto out;
> +	}
> +
> +	ret = mmap_events(lcore_id);
> +	if (ret) {
> +		RTE_LOG(ERR, EAL, "failed to map events on lcore-worker-%d\n", lcore_id);
> +		goto out;
> +	}
> +
> +	if (ioctl(group->fds[0], PERF_EVENT_IOC_ENABLE, PERF_IOC_FLAG_GROUP) == -1) {
> +		RTE_LOG(ERR, EAL, "failed to enable events on lcore-worker-%d\n", lcore_id);
> +
> +		ret = -errno;
> +		goto out;
> +	}
> +
> +	return 0;
> +
> +out:
> +	cleanup_events(lcore_id);
> +
> +	return ret;
> +}
> +
> +static int
> +scan_pmus(void)
> +{
> +	char path[PATH_MAX];
> +	struct dirent *dent;
> +	const char *name;
> +	DIR *dirp;
> +
> +	dirp = opendir(EVENT_SOURCE_DEVICES_PATH);
> +	if (!dirp)
> +		return -errno;
> +
> +	while ((dent = readdir(dirp))) {
> +		name = dent->d_name;
> +		if (name[0] == '.')
> +			continue;
> +
> +		/* sysfs entry should either contain cpus or be a cpu */
> +		if (!strcmp(name, "cpu"))
> +			break;
> +
> +		snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/cpus", name);
> +		if (access(path, F_OK) == 0)
> +			break;
> +	}
> +
> +	closedir(dirp);
> +
> +	if (dent) {
> +		rte_pmu->name = strdup(name);
> +		if (!rte_pmu->name)
> +			return -ENOMEM;
> +	}
> +
> +	return rte_pmu->name ? 0 : -ENODEV;
> +}
> +
> +int
> +rte_pmu_add_event(const char *name)
> +{
> +	struct rte_pmu_event *event;
> +	char path[PATH_MAX];
> +
> +	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu->name, name);
> +	if (access(path, R_OK))
> +		return -ENODEV;
> +
> +	TAILQ_FOREACH(event, &rte_pmu->event_list, next) {
> +		if (!strcmp(event->name, name))
> +			return event->index;
> +		continue;
> +	}
> +
> +	event = calloc(1, sizeof(*event));
> +	if (!event)
> +		return -ENOMEM;
> +
> +	event->name = strdup(name);
> +	if (!event->name) {
> +		free(event);
> +
> +		return -ENOMEM;
> +	}
> +
> +	event->index = rte_pmu->num_group_events++;
> +	TAILQ_INSERT_TAIL(&rte_pmu->event_list, event, next);
> +
> +	RTE_LOG(DEBUG, EAL, "%s even added at index %d\n", name, event->index);
> +
> +	return event->index;
> +}
> +
> +void
> +eal_pmu_init(void)
> +{
> +	int ret;
> +
> +	rte_pmu = calloc(1, sizeof(*rte_pmu));
> +	if (!rte_pmu) {
> +		RTE_LOG(ERR, EAL, "failed to alloc PMU\n");
> +
> +		return;
> +	}
> +
> +	TAILQ_INIT(&rte_pmu->event_list);
> +
> +	ret = scan_pmus();
> +	if (ret) {
> +		RTE_LOG(ERR, EAL, "failed to find core pmu\n");
> +		goto out;
> +	}
> +
> +	ret = pmu_arch_init();
> +	if (ret) {
> +		RTE_LOG(ERR, EAL, "failed to setup arch for PMU\n");
> +		goto out;
> +	}
> +
> +	return;
> +out:
> +	free(rte_pmu->name);
> +	free(rte_pmu);
> +}
> +
> +void
> +eal_pmu_fini(void)
> +{
> +	struct rte_pmu_event *event, *tmp;
> +	int lcore_id;
> +
> +	RTE_TAILQ_FOREACH_SAFE(event, &rte_pmu->event_list, next, tmp) {
> +		TAILQ_REMOVE(&rte_pmu->event_list, event, next);
> +		free(event->name);
> +		free(event);
> +	}
> +
> +	RTE_LCORE_FOREACH_WORKER(lcore_id)
> +		cleanup_events(lcore_id);

Why is the main lcore left out?

> +
> +	pmu_arch_fini();
> +	free(rte_pmu->name);
> +	free(rte_pmu);
> +}
> diff --git a/lib/eal/include/meson.build b/lib/eal/include/meson.build
> index cfcd40aaed..3bf830adee 100644
> --- a/lib/eal/include/meson.build
> +++ b/lib/eal/include/meson.build
> @@ -36,6 +36,7 @@ headers += files(
>           'rte_pci_dev_features.h',
>           'rte_per_lcore.h',
>           'rte_pflock.h',
> +        'rte_pmu.h',
>           'rte_random.h',
>           'rte_reciprocal.h',
>           'rte_seqcount.h',
> diff --git a/lib/eal/include/rte_pmu.h b/lib/eal/include/rte_pmu.h
> new file mode 100644
> index 0000000000..e4b4f6b052
> --- /dev/null
> +++ b/lib/eal/include/rte_pmu.h
> @@ -0,0 +1,204 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2022 Marvell
> + */
> +
> +#ifndef _RTE_PMU_H_
> +#define _RTE_PMU_H_
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +#include <rte_common.h>
> +#include <rte_compat.h>
> +
> +#ifdef RTE_EXEC_ENV_LINUX
> +
> +#include <linux/perf_event.h>
> +
> +#include <rte_atomic.h>
> +#include <rte_branch_prediction.h>
> +#include <rte_lcore.h>
> +#include <rte_log.h>
> +
> +/**
> + * @file
> + *
> + * PMU event tracing operations
> + *
> + * This file defines generic API and types necessary to setup PMU and
> + * read selected counters in runtime.
> + */
> +
> +/**
> + * A structure describing a group of events.
> + */
> +struct rte_pmu_event_group {
> +	int *fds; /**< array of event descriptors */
> +	void **mmap_pages; /**< array of pointers to mmapped perf_event_attr structures */
> +	bool enabled; /**< true if group was enabled on particular lcore */
> +};
> +
> +/**
> + * A structure describing an event.
> + */
> +struct rte_pmu_event {
> +	char *name; /** name of an event */
> +	int index; /** event index into fds/mmap_pages */
> +	TAILQ_ENTRY(rte_pmu_event) next; /** list entry */
> +};
> +
> +/**
> + * A PMU state container.
> + */
> +struct rte_pmu {
> +	char *name; /** name of core PMU listed under /sys/bus/event_source/devices */
> +	struct rte_pmu_event_group group[RTE_MAX_LCORE]; /**< per lcore event group data */
> +	int num_group_events; /**< number of events in a group */
> +	TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of matching events */
> +};
> +
> +/** Pointer to the PMU state container */
> +extern struct rte_pmu *rte_pmu;
> +
> +/** Each architecture supporting PMU needs to provide its own version */
> +#ifndef rte_pmu_pmc_read
> +#define rte_pmu_pmc_read(index) ({ 0; })
> +#endif
> +
> +/**
> + * @internal
> + *
> + * Read PMU counter.
> + *
> + * @param pc
> + *   Pointer to the mmapped user page.
> + * @return
> + *   Counter value read from hardware.
> + */
> +__rte_internal
> +static __rte_always_inline uint64_t
> +rte_pmu_read_userpage(struct perf_event_mmap_page *pc)
> +{
> +	uint64_t offset, width, pmc = 0;
> +	uint32_t seq, index;
> +	int tries = 100;
> +
> +	for (;;) {
> +		seq = pc->lock;
> +		rte_compiler_barrier();
> +		index = pc->index;
> +		offset = pc->offset;
> +		width = pc->pmc_width;
> +
> +		if (likely(pc->cap_user_rdpmc && index)) {
> +			pmc = rte_pmu_pmc_read(index - 1);
> +			pmc <<= 64 - width;
> +			pmc >>= 64 - width;
> +		}
> +
> +		rte_compiler_barrier();
> +
> +		if (likely(pc->lock == seq))
> +			return pmc + offset;
> +
> +		if (--tries == 0) {
> +			RTE_LOG(DEBUG, EAL, "failed to get perf_event_mmap_page lock\n");
> +			break;
> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * @internal
> + *
> + * Enable group of events for a given lcore.
> + *
> + * @param lcore_id
> + *   The identifier of the lcore.
> + * @return
> + *   0 in case of success, negative value otherwise.
> + */
> +__rte_internal
> +int
> +rte_pmu_enable_group(int lcore_id);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice
> + *
> + * Add event to the group of enabled events.
> + *
> + * @param name
> + *   Name of an event listed under /sys/bus/event_source/devices/pmu/events.
> + * @return
> + *   Event index in case of success, negative value otherwise.
> + */
> +__rte_experimental
> +int
> +rte_pmu_add_event(const char *name);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice
> + *
> + * Read hardware counter configured to count occurrences of an event.
> + *
> + * @param index
> + *   Index of an event to be read.
> + * @return
> + *   Event value read from register. In case of errors or lack of support
> + *   0 is returned. In other words, stream of zeros in a trace file
> + *   indicates problem with reading particular PMU event register.
> + */
> +__rte_experimental
> +static __rte_always_inline uint64_t
> +rte_pmu_read(int index)
> +{
> +	int lcore_id = rte_lcore_id();
> +	struct rte_pmu_event_group *group;
> +	int ret;
> +
> +	if (!rte_pmu)
> +		return 0;
> +
> +	group = &rte_pmu->group[lcore_id];
> +	if (!group->enabled) {
> +		ret = rte_pmu_enable_group(lcore_id);
> +		if (ret)
> +			return 0;
> +
> +		group->enabled = true;
> +	}
> +
> +	if (index < 0 || index >= rte_pmu->num_group_events)
> +		return 0;
> +
> +	return rte_pmu_read_userpage((struct perf_event_mmap_page *)group->mmap_pages[index]);
> +}
> +
> +#else /* !RTE_EXEC_ENV_LINUX */
> +
> +__rte_experimental
> +static int __rte_unused
> +rte_pmu_add_event(__rte_unused const char *name)
> +{
> +	return -1;
> +}
> +
> +__rte_experimental
> +static __rte_always_inline uint64_t
> +rte_pmu_read(__rte_unused int index)
> +{
> +	return 0;
> +}
> +
> +#endif /* RTE_EXEC_ENV_LINUX */
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif /* _RTE_PMU_H_ */
> diff --git a/lib/eal/linux/eal.c b/lib/eal/linux/eal.c
> index 8c118d0d9f..751a13b597 100644
> --- a/lib/eal/linux/eal.c
> +++ b/lib/eal/linux/eal.c
> @@ -53,6 +53,7 @@
>   #include "eal_options.h"
>   #include "eal_vfio.h"
>   #include "hotplug_mp.h"
> +#include "pmu_private.h"
>   
>   #define MEMSIZE_IF_NO_HUGE_PAGE (64ULL * 1024ULL * 1024ULL)
>   
> @@ -1206,6 +1207,8 @@ rte_eal_init(int argc, char **argv)
>   		return -1;
>   	}
>   
> +	eal_pmu_init();
> +
>   	if (rte_eal_tailqs_init() < 0) {
>   		rte_eal_init_alert("Cannot init tail queues for objects");
>   		rte_errno = EFAULT;
> @@ -1372,6 +1375,7 @@ rte_eal_cleanup(void)
>   	eal_bus_cleanup();
>   	rte_trace_save();
>   	eal_trace_fini();
> +	eal_pmu_fini();
>   	/* after this point, any DPDK pointers will become dangling */
>   	rte_eal_memory_detach();
>   	eal_mp_dev_hotplug_cleanup();
> diff --git a/lib/eal/version.map b/lib/eal/version.map
> index 7ad12a7dc9..9225f46f67 100644
> --- a/lib/eal/version.map
> +++ b/lib/eal/version.map
> @@ -440,6 +440,11 @@ EXPERIMENTAL {
>   	rte_thread_detach;
>   	rte_thread_equal;
>   	rte_thread_join;
> +
> +	# added in 23.03
> +	rte_pmu; # WINDOWS_NO_EXPORT
> +	rte_pmu_add_event; # WINDOWS_NO_EXPORT
> +	rte_pmu_read; # WINDOWS_NO_EXPORT
>   };
>   
>   INTERNAL {
> @@ -483,4 +488,5 @@ INTERNAL {
>   	rte_mem_map;
>   	rte_mem_page_size;
>   	rte_mem_unmap;
> +	rte_pmu_enable_group;
>   };


^ permalink raw reply	[flat|nested] 139+ messages in thread

* RE: [PATCH v4 1/4] eal: add generic support for reading PMU events
  2022-12-15  8:22               ` Morten Brørup
@ 2022-12-16  7:33                 ` Morten Brørup
  0 siblings, 0 replies; 139+ messages in thread
From: Morten Brørup @ 2022-12-16  7:33 UTC (permalink / raw)
  To: Tomasz Duszynski, dev
  Cc: thomas, Jerin Jacob Kollanukkaran, zhoumin, mattias.ronnblom,
	david.marchand

> From: Morten Brørup [mailto:mb@smartsharesystems.com]
> Sent: Thursday, 15 December 2022 09.22
> 
> > From: Morten Brørup [mailto:mb@smartsharesystems.com]
> > Sent: Wednesday, 14 December 2022 11.41
> >
> > +CC: Mattias, see my comment below about per-thread constructor for
> > this
> >
> > > From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
> > > Sent: Wednesday, 14 December 2022 10.39
> > >
> > > Hello Morten,
> > >
> > > Thanks for review. Answers inline.
> > >
> > > [...]
> > >
> > > > > +__rte_experimental
> > > > > +static __rte_always_inline uint64_t
> > > > > +rte_pmu_read(int index)
> >
> > The index type can be changed from int to uint32_t. This also
> > eliminates the "(index < 0" part of the comparison further below in
> > this function.
> >
> > > > > +{
> > > > > +	int lcore_id = rte_lcore_id();
> > > > > +	struct rte_pmu_event_group *group;
> > > > > +	int ret;
> > > > > +
> > > > > +	if (!rte_pmu)
> > > > > +		return 0;
> > > > > +
> > > > > +	group = &rte_pmu->group[lcore_id];
> > > > > +	if (!group->enabled) {
> >
> > Optimized: if (unlikely(!group->enabled)) {
> >
> > > > > +		ret = rte_pmu_enable_group(lcore_id);
> > > > > +		if (ret)
> > > > > +			return 0;
> > > > > +
> > > > > +		group->enabled = true;
> > > > > +	}
> > > >
> > > > Why is the group not enabled in the setup function,
> > > rte_pmu_add_event(), instead of here, in the
> > > > hot path?
> > > >
> > >
> > > When this is executed for the very first time then cpu will have
> > > obviously more work to do
> > > but afterwards setup path is not taken hence much less cpu cycles
> are
> > > required.
> > >
> > > Setup is executed by main lcore solely, before lcores are executed
> > > hence some info passed to
> > > SYS_perf_event_open ioctl() is missing, pid (via rte_gettid())
> being
> > an
> > > example here.
> >
> > OK. Thank you for the explanation. Since impossible at setup, it has
> to
> > be done at runtime.
> >
> > @Mattias: Another good example of something that would belong in per-
> > thread constructors, as my suggested feature creep in [1].
> >
> > [1]:
> >
> http://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35D87553@smarts
> > erver.smartshare.dk/
> 
> I just realized that this initialization is per-lcore (not per thread),
> so you can use rte_lcore_callback_register() to register a per-lcore
> initialization function, and move rte_pmu_enable_group(lcore_id) there.

Sorry, Thomasz!

You can't use rte_lcore_callback_register()... it doesn't provide per-lcore thread constructors/destructors the way I thought. :-(



^ permalink raw reply	[flat|nested] 139+ messages in thread

* RE: [PATCH v4 1/4] eal: add generic support for reading PMU events
  2022-12-15  8:46         ` Mattias Rönnblom
@ 2023-01-04 15:47           ` Tomasz Duszynski
  0 siblings, 0 replies; 139+ messages in thread
From: Tomasz Duszynski @ 2023-01-04 15:47 UTC (permalink / raw)
  To: Mattias Rönnblom, dev; +Cc: thomas, Jerin Jacob Kollanukkaran, mb, zhoumin

> -----Original Message-----
> From: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> Sent: Thursday, December 15, 2022 9:46 AM
> To: Tomasz Duszynski <tduszynski@marvell.com>; dev@dpdk.org
> Cc: thomas@monjalon.net; Jerin Jacob Kollanukkaran <jerinj@marvell.com>; mb@smartsharesystems.com;
> zhoumin@loongson.cn
> Subject: [EXT] Re: [PATCH v4 1/4] eal: add generic support for reading PMU events
> 
> External Email
> 
> ----------------------------------------------------------------------
> On 2022-12-13 11:43, Tomasz Duszynski wrote:
> > Add support for programming PMU counters and reading their values in
> > runtime bypassing kernel completely.
> >
> > This is especially useful in cases where CPU cores are isolated
> > (nohz_full) i.e run dedicated tasks. In such cases one cannot use
> > standard perf utility without sacrificing latency and performance.
> >
> > Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
> > ---
> >   app/test/meson.build                  |   1 +
> >   app/test/test_pmu.c                   |  41 +++
> >   doc/guides/prog_guide/profile_app.rst |   8 +
> >   lib/eal/common/meson.build            |   3 +
> >   lib/eal/common/pmu_private.h          |  41 +++
> >   lib/eal/common/rte_pmu.c              | 456 ++++++++++++++++++++++++++
> >   lib/eal/include/meson.build           |   1 +
> >   lib/eal/include/rte_pmu.h             | 204 ++++++++++++
> >   lib/eal/linux/eal.c                   |   4 +
> >   lib/eal/version.map                   |   6 +
> >   10 files changed, 765 insertions(+)
> >   create mode 100644 app/test/test_pmu.c
> >   create mode 100644 lib/eal/common/pmu_private.h
> >   create mode 100644 lib/eal/common/rte_pmu.c
> >   create mode 100644 lib/eal/include/rte_pmu.h
> >
> > diff --git a/app/test/meson.build b/app/test/meson.build index
> > f34d19e3c3..93b3300309 100644
> > --- a/app/test/meson.build
> > +++ b/app/test/meson.build
> > @@ -143,6 +143,7 @@ test_sources = files(
> >           'test_timer_racecond.c',
> >           'test_timer_secondary.c',
> >           'test_ticketlock.c',
> > +        'test_pmu.c',
> >           'test_trace.c',
> >           'test_trace_register.c',
> >           'test_trace_perf.c',
> > diff --git a/app/test/test_pmu.c b/app/test/test_pmu.c new file mode
> > 100644 index 0000000000..fd331af9ee
> > --- /dev/null
> > +++ b/app/test/test_pmu.c
> > @@ -0,0 +1,41 @@
> > +/* SPDX-License-Identifier: BSD-3-Clause
> > + * Copyright(C) 2022 Marvell International Ltd.
> > + */
> > +
> > +#include <rte_pmu.h>
> > +
> > +#include "test.h"
> > +
> > +static int
> > +test_pmu_read(void)
> > +{
> > +	uint64_t val = 0;
> > +	int tries = 10;
> > +	int event = -1;
> > +
> > +	while (tries--)
> > +		val += rte_pmu_read(event);
> > +
> > +	if (val == 0)
> > +		return TEST_FAILED;
> > +
> > +	return TEST_SUCCESS;
> > +}
> > +
> > +static struct unit_test_suite pmu_tests = {
> > +	.suite_name = "pmu autotest",
> > +	.setup = NULL,
> > +	.teardown = NULL,
> > +	.unit_test_cases = {
> > +		TEST_CASE(test_pmu_read),
> > +		TEST_CASES_END()
> > +	}
> > +};
> > +
> > +static int
> > +test_pmu(void)
> > +{
> > +	return unit_test_suite_runner(&pmu_tests);
> > +}
> > +
> > +REGISTER_TEST_COMMAND(pmu_autotest, test_pmu);
> > diff --git a/doc/guides/prog_guide/profile_app.rst
> > b/doc/guides/prog_guide/profile_app.rst
> > index 14292d4c25..a8b501fe0c 100644
> > --- a/doc/guides/prog_guide/profile_app.rst
> > +++ b/doc/guides/prog_guide/profile_app.rst
> > @@ -7,6 +7,14 @@ Profile Your Application
> >   The following sections describe methods of profiling DPDK applications on
> >   different architectures.
> >
> > +Performance counter based profiling
> > +-----------------------------------
> > +
> > +Majority of architectures support some sort hardware measurement unit
> > +which provides a set of programmable counters that monitor specific
> > +events. There are different tools which can gather that information,
> > +perf being an example here. Though in some scenarios, eg. when CPU
> > +cores are isolated (nohz_full) and run dedicated tasks, using perf is less than ideal. In such
> cases one can read specific events directly from application via ``rte_pmu_read()``.
> >
> >   Profiling on x86
> >   ----------------
> > diff --git a/lib/eal/common/meson.build b/lib/eal/common/meson.build
> > index 917758cc65..d6d05b56f3 100644
> > --- a/lib/eal/common/meson.build
> > +++ b/lib/eal/common/meson.build
> > @@ -38,6 +38,9 @@ sources += files(
> >           'rte_service.c',
> >           'rte_version.c',
> >   )
> > +if is_linux
> > +    sources += files('rte_pmu.c')
> > +endif
> >   if is_linux or is_windows
> >       sources += files('eal_common_dynmem.c')
> >   endif
> > diff --git a/lib/eal/common/pmu_private.h
> > b/lib/eal/common/pmu_private.h new file mode 100644 index
> > 0000000000..cade4245e6
> > --- /dev/null
> > +++ b/lib/eal/common/pmu_private.h
> > @@ -0,0 +1,41 @@
> > +/* SPDX-License-Identifier: BSD-3-Clause
> > + * Copyright(c) 2022 Marvell
> > + */
> > +
> > +#ifndef _PMU_PRIVATE_H_
> > +#define _PMU_PRIVATE_H_
> > +
> > +/**
> > + * Architecture specific PMU init callback.
> > + *
> > + * @return
> > + *   0 in case of success, negative value otherwise.
> > + */
> > +int
> > +pmu_arch_init(void);
> > +
> > +/**
> > + * Architecture specific PMU cleanup callback.
> > + */
> > +void
> > +pmu_arch_fini(void);
> > +
> > +/**
> > + * Apply architecture specific settings to config before passing it to syscall.
> > + */
> > +void
> > +pmu_arch_fixup_config(uint64_t config[3]);
> > +
> > +/**
> > + * Initialize PMU tracing internals.
> > + */
> > +void
> > +eal_pmu_init(void);
> > +
> > +/**
> > + * Cleanup PMU internals.
> > + */
> > +void
> > +eal_pmu_fini(void);
> > +
> > +#endif /* _PMU_PRIVATE_H_ */
> > diff --git a/lib/eal/common/rte_pmu.c b/lib/eal/common/rte_pmu.c new
> > file mode 100644 index 0000000000..049fe19fe3
> > --- /dev/null
> > +++ b/lib/eal/common/rte_pmu.c
> > @@ -0,0 +1,456 @@
> > +/* SPDX-License-Identifier: BSD-3-Clause
> > + * Copyright(C) 2022 Marvell International Ltd.
> > + */
> > +
> > +#include <ctype.h>
> > +#include <dirent.h>
> > +#include <errno.h>
> > +#include <regex.h>
> > +#include <stdlib.h>
> > +#include <string.h>
> > +#include <sys/ioctl.h>
> > +#include <sys/mman.h>
> > +#include <sys/queue.h>
> > +#include <sys/syscall.h>
> > +#include <unistd.h>
> > +
> > +#include <rte_eal_paging.h>
> > +#include <rte_pmu.h>
> > +#include <rte_tailq.h>
> > +
> > +#include "pmu_private.h"
> > +
> > +#define EVENT_SOURCE_DEVICES_PATH "/sys/bus/event_source/devices"
> > +
> > +#ifndef GENMASK_ULL
> > +#define GENMASK_ULL(h, l) ((~0ULL - (1ULL << (l)) + 1) & (~0ULL >>
> > +((64 - 1 - (h))))) #endif
> > +
> > +#ifndef FIELD_PREP
> > +#define FIELD_PREP(m, v) (((uint64_t)(v) << (__builtin_ffsll(m) - 1))
> > +& (m)) #endif
> > +
> > +struct rte_pmu *rte_pmu;
> > +
> > +/*
> > + * Following __rte_weak functions provide default no-op.
> > +Architectures should override them if
> > + * necessary.
> > + */
> > +
> > +int
> > +__rte_weak pmu_arch_init(void)
> > +{
> > +	return 0;
> > +}
> > +
> > +void
> > +__rte_weak pmu_arch_fini(void)
> > +{
> > +}
> > +
> > +void
> > +__rte_weak pmu_arch_fixup_config(uint64_t config[3]) {
> > +	RTE_SET_USED(config);
> > +}
> > +
> > +static int
> > +get_term_format(const char *name, int *num, uint64_t *mask) {
> > +	char *config = NULL;
> > +	char path[PATH_MAX];
> > +	int high, low, ret;
> > +	FILE *fp;
> > +
> > +	/* quiesce -Wmaybe-uninitialized warning */
> > +	*num = 0;
> > +	*mask = 0;
> > +
> > +	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH
> > +"/%s/format/%s", rte_pmu->name, name)
> 
> This code might crash in case a long name is supplied, which is maybe not want you want. A
> trunacte and a "file not found" is probably better.
> I believe there is a snprintf lookalike with these properties in DPDK.
> 

In scenario, which is pretty unlikely especially because sysfs files have sane names, where 
'path' cannot accommodate the whole string there will be NUL implicitly appended by snprintf.
Hence fopen will fail. Not sure how this may go wrong. 

> > +	fp = fopen(path, "r");
> > +	if (!fp)
> > +		return -errno;
> > +
> > +	errno = 0;
> > +	ret = fscanf(fp, "%m[^:]:%d-%d", &config, &low, &high);
> > +	if (ret < 2) {
> > +		ret = -ENODATA;
> > +		goto out;
> > +	}
> > +	if (errno) {
> > +		ret = -errno;
> > +		goto out;
> > +	}
> > +
> > +	if (ret == 2)
> > +		high = low;
> > +
> > +	*mask = GENMASK_ULL(high, low);
> > +	/* Last digit should be [012]. If last digit is missing 0 is implied. */
> > +	*num = config[strlen(config) - 1];
> > +	*num = isdigit(*num) ? *num - '0' : 0;
> > +
> > +	ret = 0;
> > +out:
> > +	free(config);
> > +	fclose(fp);
> > +
> > +	return ret;
> > +}
> > +
> > +static int
> > +parse_event(char *buf, uint64_t config[3]) {
> > +	char *token, *term;
> > +	int num, ret, val;
> > +	uint64_t mask;
> > +
> > +	config[0] = config[1] = config[2] = 0;
> > +
> > +	token = strtok(buf, ",");
> > +	while (token) {
> > +		errno = 0;
> > +		/* <term>=<value> */
> > +		ret = sscanf(token, "%m[^=]=%i", &term, &val);
> > +		if (ret < 1)
> > +			return -ENODATA;
> > +		if (errno)
> > +			return -errno;
> > +		if (ret == 1)
> > +			val = 1;
> > +
> > +		ret = get_term_format(term, &num, &mask);
> > +		free(term);
> > +		if (ret)
> > +			return ret;
> > +
> > +		config[num] |= FIELD_PREP(mask, val);
> > +		token = strtok(NULL, ",");
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +static int
> > +get_event_config(const char *name, uint64_t config[3]) {
> > +	char path[PATH_MAX], buf[BUFSIZ];
> > +	FILE *fp;
> > +	int ret;
> > +
> > +	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu->name,
> name);
> > +	fp = fopen(path, "r");
> > +	if (!fp)
> > +		return -errno;
> > +
> > +	ret = fread(buf, 1, sizeof(buf), fp);
> > +	if (ret == 0) {
> > +		fclose(fp);
> > +
> > +		return -EINVAL;
> > +	}
> > +	fclose(fp);
> > +	buf[ret] = '\0';
> > +
> > +	return parse_event(buf, config);
> > +}
> > +
> > +static int
> > +do_perf_event_open(uint64_t config[3], int lcore_id, int group_fd) {
> > +	struct perf_event_attr attr = {
> > +		.size = sizeof(struct perf_event_attr),
> > +		.type = PERF_TYPE_RAW,
> > +		.exclude_kernel = 1,
> > +		.exclude_hv = 1,
> > +		.disabled = 1,
> > +	};
> > +
> > +	pmu_arch_fixup_config(config);
> > +
> > +	attr.config = config[0];
> > +	attr.config1 = config[1];
> > +	attr.config2 = config[2];
> > +
> > +	return syscall(SYS_perf_event_open, &attr, rte_gettid(), rte_lcore_to_cpu_id(lcore_id),
> > +		       group_fd, 0);
> > +}
> > +
> > +static int
> > +open_events(int lcore_id)
> > +{
> > +	struct rte_pmu_event_group *group = &rte_pmu->group[lcore_id];
> > +	struct rte_pmu_event *event;
> > +	uint64_t config[3];
> > +	int num = 0, ret;
> > +
> > +	/* group leader gets created first, with fd = -1 */
> > +	group->fds[0] = -1;
> > +
> > +	TAILQ_FOREACH(event, &rte_pmu->event_list, next) {
> > +		ret = get_event_config(event->name, config);
> > +		if (ret) {
> > +			RTE_LOG(ERR, EAL, "failed to get %s event config\n", event->name);
> > +			continue;
> > +		}
> > +
> > +		ret = do_perf_event_open(config, lcore_id, group->fds[0]);
> > +		if (ret == -1) {
> > +			if (errno == EOPNOTSUPP)
> > +				RTE_LOG(ERR, EAL, "64 bit counters not supported\n");
> > +
> > +			ret = -errno;
> > +			goto out;
> > +		}
> > +
> > +		group->fds[event->index] = ret;
> > +		num++;
> > +	}
> > +
> > +	return 0;
> > +out:
> > +	for (--num; num >= 0; num--) {
> > +		close(group->fds[num]);
> > +		group->fds[num] = -1;
> > +	}
> > +
> > +
> > +	return ret;
> > +}
> > +
> > +static int
> > +mmap_events(int lcore_id)
> > +{
> > +	struct rte_pmu_event_group *group = &rte_pmu->group[lcore_id];
> > +	void *addr;
> > +	int ret, i;
> > +
> > +	for (i = 0; i < rte_pmu->num_group_events; i++) {
> > +		addr = mmap(0, rte_mem_page_size(), PROT_READ, MAP_SHARED, group->fds[i], 0);
> > +		if (addr == MAP_FAILED) {
> > +			ret = -errno;
> > +			goto out;
> > +		}
> > +
> > +		group->mmap_pages[i] = addr;
> > +	}
> > +
> > +	return 0;
> > +out:
> > +	for (; i; i--) {
> > +		munmap(group->mmap_pages[i - 1], rte_mem_page_size());
> > +		group->mmap_pages[i - 1] = NULL;
> > +	}
> > +
> > +	return ret;
> > +}
> > +
> > +static void
> > +cleanup_events(int lcore_id)
> 
> lcore_id is an unsigned int.
> 

True, unsigned seems to be more common. 

> > +{
> > +	struct rte_pmu_event_group *group = &rte_pmu->group[lcore_id];
> > +	int i;
> > +
> > +	if (!group->fds)
> > +		return;
> 
> group->fds == NULL
> 
> This coding style violating appears throughput the patch set.
> 

Good point. 

> > +
> > +	if (group->fds[0] != -1)
> > +		ioctl(group->fds[0], PERF_EVENT_IOC_DISABLE, PERF_IOC_FLAG_GROUP);
> > +
> > +	for (i = 0; i < rte_pmu->num_group_events; i++) {
> > +		if (group->mmap_pages[i]) {
> > +			munmap(group->mmap_pages[i], rte_mem_page_size());
> > +			group->mmap_pages[i] = NULL;
> > +		}
> > +
> > +		if (group->fds[i] != -1) {
> > +			close(group->fds[i]);
> > +			group->fds[i] = -1;
> > +		}
> > +	}
> > +
> > +	free(group->mmap_pages);
> > +	free(group->fds);
> > +
> > +	group->mmap_pages = NULL;
> > +	group->fds = NULL;
> > +	group->enabled = false;
> > +}
> > +
> > +int __rte_noinline
> > +rte_pmu_enable_group(int lcore_id)
> 
> unsigned
> 
> > +{
> > +	struct rte_pmu_event_group *group = &rte_pmu->group[lcore_id];
> > +	int ret;
> > +
> > +	if (rte_pmu->num_group_events == 0) {
> > +		RTE_LOG(DEBUG, EAL, "no matching PMU events\n");
> > +
> > +		return 0;
> > +	}
> > +
> > +	group->fds = calloc(rte_pmu->num_group_events, sizeof(*group->fds));
> > +	if (!group->fds) {
> > +		RTE_LOG(ERR, EAL, "failed to alloc descriptor memory\n");
> > +
> > +		return -ENOMEM;
> > +	}
> > +
> > +	group->mmap_pages = calloc(rte_pmu->num_group_events, sizeof(*group->mmap_pages));
> > +	if (!group->mmap_pages) {
> > +		RTE_LOG(ERR, EAL, "failed to alloc userpage memory\n");
> > +
> > +		ret = -ENOMEM;
> > +		goto out;
> > +	}
> > +
> > +	ret = open_events(lcore_id);
> > +	if (ret) {
> > +		RTE_LOG(ERR, EAL, "failed to open events on lcore-worker-%d\n", lcore_id);
> > +		goto out;
> > +	}
> > +
> > +	ret = mmap_events(lcore_id);
> > +	if (ret) {
> > +		RTE_LOG(ERR, EAL, "failed to map events on lcore-worker-%d\n", lcore_id);
> > +		goto out;
> > +	}
> > +
> > +	if (ioctl(group->fds[0], PERF_EVENT_IOC_ENABLE, PERF_IOC_FLAG_GROUP) == -1) {
> > +		RTE_LOG(ERR, EAL, "failed to enable events on lcore-worker-%d\n",
> > +lcore_id);
> > +
> > +		ret = -errno;
> > +		goto out;
> > +	}
> > +
> > +	return 0;
> > +
> > +out:
> > +	cleanup_events(lcore_id);
> > +
> > +	return ret;
> > +}
> > +
> > +static int
> > +scan_pmus(void)
> > +{
> > +	char path[PATH_MAX];
> > +	struct dirent *dent;
> > +	const char *name;
> > +	DIR *dirp;
> > +
> > +	dirp = opendir(EVENT_SOURCE_DEVICES_PATH);
> > +	if (!dirp)
> > +		return -errno;
> > +
> > +	while ((dent = readdir(dirp))) {
> > +		name = dent->d_name;
> > +		if (name[0] == '.')
> > +			continue;
> > +
> > +		/* sysfs entry should either contain cpus or be a cpu */
> > +		if (!strcmp(name, "cpu"))
> > +			break;
> > +
> > +		snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/cpus", name);
> > +		if (access(path, F_OK) == 0)
> > +			break;
> > +	}
> > +
> > +	closedir(dirp);
> > +
> > +	if (dent) {
> > +		rte_pmu->name = strdup(name);
> > +		if (!rte_pmu->name)
> > +			return -ENOMEM;
> > +	}
> > +
> > +	return rte_pmu->name ? 0 : -ENODEV;
> > +}
> > +
> > +int
> > +rte_pmu_add_event(const char *name)
> > +{
> > +	struct rte_pmu_event *event;
> > +	char path[PATH_MAX];
> > +
> > +	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu->name,
> name);
> > +	if (access(path, R_OK))
> > +		return -ENODEV;
> > +
> > +	TAILQ_FOREACH(event, &rte_pmu->event_list, next) {
> > +		if (!strcmp(event->name, name))
> > +			return event->index;
> > +		continue;
> > +	}
> > +
> > +	event = calloc(1, sizeof(*event));
> > +	if (!event)
> > +		return -ENOMEM;
> > +
> > +	event->name = strdup(name);
> > +	if (!event->name) {
> > +		free(event);
> > +
> > +		return -ENOMEM;
> > +	}
> > +
> > +	event->index = rte_pmu->num_group_events++;
> > +	TAILQ_INSERT_TAIL(&rte_pmu->event_list, event, next);
> > +
> > +	RTE_LOG(DEBUG, EAL, "%s even added at index %d\n", name,
> > +event->index);
> > +
> > +	return event->index;
> > +}
> > +
> > +void
> > +eal_pmu_init(void)
> > +{
> > +	int ret;
> > +
> > +	rte_pmu = calloc(1, sizeof(*rte_pmu));
> > +	if (!rte_pmu) {
> > +		RTE_LOG(ERR, EAL, "failed to alloc PMU\n");
> > +
> > +		return;
> > +	}
> > +
> > +	TAILQ_INIT(&rte_pmu->event_list);
> > +
> > +	ret = scan_pmus();
> > +	if (ret) {
> > +		RTE_LOG(ERR, EAL, "failed to find core pmu\n");
> > +		goto out;
> > +	}
> > +
> > +	ret = pmu_arch_init();
> > +	if (ret) {
> > +		RTE_LOG(ERR, EAL, "failed to setup arch for PMU\n");
> > +		goto out;
> > +	}
> > +
> > +	return;
> > +out:
> > +	free(rte_pmu->name);
> > +	free(rte_pmu);
> > +}
> > +
> > +void
> > +eal_pmu_fini(void)
> > +{
> > +	struct rte_pmu_event *event, *tmp;
> > +	int lcore_id;
> > +
> > +	RTE_TAILQ_FOREACH_SAFE(event, &rte_pmu->event_list, next, tmp) {
> > +		TAILQ_REMOVE(&rte_pmu->event_list, event, next);
> > +		free(event->name);
> > +		free(event);
> > +	}
> > +
> > +	RTE_LCORE_FOREACH_WORKER(lcore_id)
> > +		cleanup_events(lcore_id);
> 
> Why is the main lcore left out?
> 

Main lcore was omitted because it's pretty uncommon for it to do a heavy-lifting so
usefulness of reading counters is questionable. It can be added for completeness
though. 

Do you have any specific use case on your mind?

> > +
> > +	pmu_arch_fini();
> > +	free(rte_pmu->name);
> > +	free(rte_pmu);
> > +}
> > diff --git a/lib/eal/include/meson.build b/lib/eal/include/meson.build
> > index cfcd40aaed..3bf830adee 100644
> > --- a/lib/eal/include/meson.build
> > +++ b/lib/eal/include/meson.build
> > @@ -36,6 +36,7 @@ headers += files(
> >           'rte_pci_dev_features.h',
> >           'rte_per_lcore.h',
> >           'rte_pflock.h',
> > +        'rte_pmu.h',
> >           'rte_random.h',
> >           'rte_reciprocal.h',
> >           'rte_seqcount.h',
> > diff --git a/lib/eal/include/rte_pmu.h b/lib/eal/include/rte_pmu.h new
> > file mode 100644 index 0000000000..e4b4f6b052
> > --- /dev/null
> > +++ b/lib/eal/include/rte_pmu.h
> > @@ -0,0 +1,204 @@
> > +/* SPDX-License-Identifier: BSD-3-Clause
> > + * Copyright(c) 2022 Marvell
> > + */
> > +
> > +#ifndef _RTE_PMU_H_
> > +#define _RTE_PMU_H_
> > +
> > +#ifdef __cplusplus
> > +extern "C" {
> > +#endif
> > +
> > +#include <rte_common.h>
> > +#include <rte_compat.h>
> > +
> > +#ifdef RTE_EXEC_ENV_LINUX
> > +
> > +#include <linux/perf_event.h>
> > +
> > +#include <rte_atomic.h>
> > +#include <rte_branch_prediction.h>
> > +#include <rte_lcore.h>
> > +#include <rte_log.h>
> > +
> > +/**
> > + * @file
> > + *
> > + * PMU event tracing operations
> > + *
> > + * This file defines generic API and types necessary to setup PMU and
> > + * read selected counters in runtime.
> > + */
> > +
> > +/**
> > + * A structure describing a group of events.
> > + */
> > +struct rte_pmu_event_group {
> > +	int *fds; /**< array of event descriptors */
> > +	void **mmap_pages; /**< array of pointers to mmapped perf_event_attr structures */
> > +	bool enabled; /**< true if group was enabled on particular lcore */
> > +};
> > +
> > +/**
> > + * A structure describing an event.
> > + */
> > +struct rte_pmu_event {
> > +	char *name; /** name of an event */
> > +	int index; /** event index into fds/mmap_pages */
> > +	TAILQ_ENTRY(rte_pmu_event) next; /** list entry */ };
> > +
> > +/**
> > + * A PMU state container.
> > + */
> > +struct rte_pmu {
> > +	char *name; /** name of core PMU listed under /sys/bus/event_source/devices */
> > +	struct rte_pmu_event_group group[RTE_MAX_LCORE]; /**< per lcore event group data */
> > +	int num_group_events; /**< number of events in a group */
> > +	TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of matching events
> > +*/ };
> > +
> > +/** Pointer to the PMU state container */ extern struct rte_pmu
> > +*rte_pmu;
> > +
> > +/** Each architecture supporting PMU needs to provide its own version
> > +*/ #ifndef rte_pmu_pmc_read #define rte_pmu_pmc_read(index) ({ 0; })
> > +#endif
> > +
> > +/**
> > + * @internal
> > + *
> > + * Read PMU counter.
> > + *
> > + * @param pc
> > + *   Pointer to the mmapped user page.
> > + * @return
> > + *   Counter value read from hardware.
> > + */
> > +__rte_internal
> > +static __rte_always_inline uint64_t
> > +rte_pmu_read_userpage(struct perf_event_mmap_page *pc) {
> > +	uint64_t offset, width, pmc = 0;
> > +	uint32_t seq, index;
> > +	int tries = 100;
> > +
> > +	for (;;) {
> > +		seq = pc->lock;
> > +		rte_compiler_barrier();
> > +		index = pc->index;
> > +		offset = pc->offset;
> > +		width = pc->pmc_width;
> > +
> > +		if (likely(pc->cap_user_rdpmc && index)) {
> > +			pmc = rte_pmu_pmc_read(index - 1);
> > +			pmc <<= 64 - width;
> > +			pmc >>= 64 - width;
> > +		}
> > +
> > +		rte_compiler_barrier();
> > +
> > +		if (likely(pc->lock == seq))
> > +			return pmc + offset;
> > +
> > +		if (--tries == 0) {
> > +			RTE_LOG(DEBUG, EAL, "failed to get perf_event_mmap_page lock\n");
> > +			break;
> > +		}
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +/**
> > + * @internal
> > + *
> > + * Enable group of events for a given lcore.
> > + *
> > + * @param lcore_id
> > + *   The identifier of the lcore.
> > + * @return
> > + *   0 in case of success, negative value otherwise.
> > + */
> > +__rte_internal
> > +int
> > +rte_pmu_enable_group(int lcore_id);
> > +
> > +/**
> > + * @warning
> > + * @b EXPERIMENTAL: this API may change without prior notice
> > + *
> > + * Add event to the group of enabled events.
> > + *
> > + * @param name
> > + *   Name of an event listed under /sys/bus/event_source/devices/pmu/events.
> > + * @return
> > + *   Event index in case of success, negative value otherwise.
> > + */
> > +__rte_experimental
> > +int
> > +rte_pmu_add_event(const char *name);
> > +
> > +/**
> > + * @warning
> > + * @b EXPERIMENTAL: this API may change without prior notice
> > + *
> > + * Read hardware counter configured to count occurrences of an event.
> > + *
> > + * @param index
> > + *   Index of an event to be read.
> > + * @return
> > + *   Event value read from register. In case of errors or lack of support
> > + *   0 is returned. In other words, stream of zeros in a trace file
> > + *   indicates problem with reading particular PMU event register.
> > + */
> > +__rte_experimental
> > +static __rte_always_inline uint64_t
> > +rte_pmu_read(int index)
> > +{
> > +	int lcore_id = rte_lcore_id();
> > +	struct rte_pmu_event_group *group;
> > +	int ret;
> > +
> > +	if (!rte_pmu)
> > +		return 0;
> > +
> > +	group = &rte_pmu->group[lcore_id];
> > +	if (!group->enabled) {
> > +		ret = rte_pmu_enable_group(lcore_id);
> > +		if (ret)
> > +			return 0;
> > +
> > +		group->enabled = true;
> > +	}
> > +
> > +	if (index < 0 || index >= rte_pmu->num_group_events)
> > +		return 0;
> > +
> > +	return rte_pmu_read_userpage((struct perf_event_mmap_page
> > +*)group->mmap_pages[index]); }
> > +
> > +#else /* !RTE_EXEC_ENV_LINUX */
> > +
> > +__rte_experimental
> > +static int __rte_unused
> > +rte_pmu_add_event(__rte_unused const char *name) {
> > +	return -1;
> > +}
> > +
> > +__rte_experimental
> > +static __rte_always_inline uint64_t
> > +rte_pmu_read(__rte_unused int index)
> > +{
> > +	return 0;
> > +}
> > +
> > +#endif /* RTE_EXEC_ENV_LINUX */
> > +
> > +#ifdef __cplusplus
> > +}
> > +#endif
> > +
> > +#endif /* _RTE_PMU_H_ */
> > diff --git a/lib/eal/linux/eal.c b/lib/eal/linux/eal.c index
> > 8c118d0d9f..751a13b597 100644
> > --- a/lib/eal/linux/eal.c
> > +++ b/lib/eal/linux/eal.c
> > @@ -53,6 +53,7 @@
> >   #include "eal_options.h"
> >   #include "eal_vfio.h"
> >   #include "hotplug_mp.h"
> > +#include "pmu_private.h"
> >
> >   #define MEMSIZE_IF_NO_HUGE_PAGE (64ULL * 1024ULL * 1024ULL)
> >
> > @@ -1206,6 +1207,8 @@ rte_eal_init(int argc, char **argv)
> >   		return -1;
> >   	}
> >
> > +	eal_pmu_init();
> > +
> >   	if (rte_eal_tailqs_init() < 0) {
> >   		rte_eal_init_alert("Cannot init tail queues for objects");
> >   		rte_errno = EFAULT;
> > @@ -1372,6 +1375,7 @@ rte_eal_cleanup(void)
> >   	eal_bus_cleanup();
> >   	rte_trace_save();
> >   	eal_trace_fini();
> > +	eal_pmu_fini();
> >   	/* after this point, any DPDK pointers will become dangling */
> >   	rte_eal_memory_detach();
> >   	eal_mp_dev_hotplug_cleanup();
> > diff --git a/lib/eal/version.map b/lib/eal/version.map index
> > 7ad12a7dc9..9225f46f67 100644
> > --- a/lib/eal/version.map
> > +++ b/lib/eal/version.map
> > @@ -440,6 +440,11 @@ EXPERIMENTAL {
> >   	rte_thread_detach;
> >   	rte_thread_equal;
> >   	rte_thread_join;
> > +
> > +	# added in 23.03
> > +	rte_pmu; # WINDOWS_NO_EXPORT
> > +	rte_pmu_add_event; # WINDOWS_NO_EXPORT
> > +	rte_pmu_read; # WINDOWS_NO_EXPORT
> >   };
> >
> >   INTERNAL {
> > @@ -483,4 +488,5 @@ INTERNAL {
> >   	rte_mem_map;
> >   	rte_mem_page_size;
> >   	rte_mem_unmap;
> > +	rte_pmu_enable_group;
> >   };


^ permalink raw reply	[flat|nested] 139+ messages in thread

* RE: [PATCH v4 1/4] eal: add generic support for reading PMU events
  2022-12-14 10:41             ` Morten Brørup
  2022-12-15  8:22               ` Morten Brørup
@ 2023-01-05 21:14               ` Tomasz Duszynski
  2023-01-05 22:07                 ` Morten Brørup
  1 sibling, 1 reply; 139+ messages in thread
From: Tomasz Duszynski @ 2023-01-05 21:14 UTC (permalink / raw)
  To: Morten Brørup, dev
  Cc: thomas, Jerin Jacob Kollanukkaran, zhoumin, mattias.ronnblom

Hi Morten, 

A few comments inline. 

>-----Original Message-----
>From: Morten Brørup <mb@smartsharesystems.com>
>Sent: Wednesday, December 14, 2022 11:41 AM
>To: Tomasz Duszynski <tduszynski@marvell.com>; dev@dpdk.org
>Cc: thomas@monjalon.net; Jerin Jacob Kollanukkaran <jerinj@marvell.com>; zhoumin@loongson.cn;
>mattias.ronnblom@ericsson.com
>Subject: [EXT] RE: [PATCH v4 1/4] eal: add generic support for reading PMU events
>
>External Email
>
>----------------------------------------------------------------------
>+CC: Mattias, see my comment below about per-thread constructor for this
>
>> From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
>> Sent: Wednesday, 14 December 2022 10.39
>>
>> Hello Morten,
>>
>> Thanks for review. Answers inline.
>>
>> [...]
>>
>> > > +/**
>> > > + * @file
>> > > + *
>> > > + * PMU event tracing operations
>> > > + *
>> > > + * This file defines generic API and types necessary to setup PMU
>> and
>> > > + * read selected counters in runtime.
>> > > + */
>> > > +
>> > > +/**
>> > > + * A structure describing a group of events.
>> > > + */
>> > > +struct rte_pmu_event_group {
>> > > +	int *fds; /**< array of event descriptors */
>> > > +	void **mmap_pages; /**< array of pointers to mmapped
>> > > perf_event_attr structures */
>> >
>> > There seems to be a lot of indirection involved here. Why are these
>> arrays not statically sized,
>> > instead of dynamically allocated?
>> >
>>
>> Different architectures/pmus impose limits on number of simultaneously
>> enabled counters. So in order relief the pain of thinking about it and
>> adding macros for each and every arch I decided to allocate the number
>> user wants dynamically. Also assumption holds that user knows about
>> tradeoffs of using too many counters hence will not enable too many
>> events at once.
>
>The DPDK convention is to use fixed size arrays (with a maximum size, e.g. RTE_MAX_ETHPORTS) in the
>fast path, for performance reasons.
>
>Please use fixed size arrays instead of dynamically allocated arrays.
>

I do agree that from maintenance angle fixed arrays are much more convenient 
but when optimization kicks in then that statement does not necessarily
hold true anymore.

For example, in this case performance dropped by ~0.3% which is insignificant imo. So
given simpler code, next patchset will use fixed arrays. 

>>
>> > Also, what is the reason for hiding the type struct
>> perf_event_mmap_page **mmap_pages opaque by
>> > using void **mmap_pages instead?
>>
>> I think, that part doing mmap/munmap was written first hence void **
>> was chosen in the first place.
>
>Please update it, so the actual type is reflected here.
>
>>
>> >
>> > > +	bool enabled; /**< true if group was enabled on particular lcore
>> > > */
>> > > +};
>> > > +
>> > > +/**
>> > > + * A structure describing an event.
>> > > + */
>> > > +struct rte_pmu_event {
>> > > +	char *name; /** name of an event */
>> > > +	int index; /** event index into fds/mmap_pages */
>> > > +	TAILQ_ENTRY(rte_pmu_event) next; /** list entry */ };
>> > > +
>> > > +/**
>> > > + * A PMU state container.
>> > > + */
>> > > +struct rte_pmu {
>> > > +	char *name; /** name of core PMU listed under
>> > > /sys/bus/event_source/devices */
>> > > +	struct rte_pmu_event_group group[RTE_MAX_LCORE]; /**< per lcore
>> > > event group data */
>> > > +	int num_group_events; /**< number of events in a group */
>> > > +	TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of matching
>> > > events */
>
>The event_list is used in slow path only, so it can remain a list - i.e. no change requested here.
>:-)
>
>> > > +};
>> > > +
>> > > +/** Pointer to the PMU state container */ extern struct rte_pmu
>> > > +*rte_pmu;
>> >
>> > Again, why not just extern struct rte_pmu, instead of dynamic
>> allocation?
>> >
>>
>> No strong opinions here since this is a matter of personal preference.
>> Can be removed
>> in the next version.
>
>Yes, please.
>
>>
>> > > +
>> > > +/** Each architecture supporting PMU needs to provide its own
>> version
>> > > */
>> > > +#ifndef rte_pmu_pmc_read
>> > > +#define rte_pmu_pmc_read(index) ({ 0; }) #endif
>> > > +
>> > > +/**
>> > > + * @internal
>> > > + *
>> > > + * Read PMU counter.
>> > > + *
>> > > + * @param pc
>> > > + *   Pointer to the mmapped user page.
>> > > + * @return
>> > > + *   Counter value read from hardware.
>> > > + */
>> > > +__rte_internal
>> > > +static __rte_always_inline uint64_t rte_pmu_read_userpage(struct
>> > > +perf_event_mmap_page *pc) {
>> > > +	uint64_t offset, width, pmc = 0;
>> > > +	uint32_t seq, index;
>> > > +	int tries = 100;
>> > > +
>> > > +	for (;;) {
>
>As a matter of personal preference, I would write this loop differently:
>
>+ for (tries = 100; tries != 0; tries--) {
>
>> > > +		seq = pc->lock;
>> > > +		rte_compiler_barrier();
>> > > +		index = pc->index;
>> > > +		offset = pc->offset;
>> > > +		width = pc->pmc_width;
>> > > +
>> > > +		if (likely(pc->cap_user_rdpmc && index)) {
>
>Why "&& index"? The way I read [man perf_event_open], index 0 is perfectly valid.
>

Valid index starts at 1. 0 means that either hw counter is stopped or isn't active. Maybe this is not
initially clear from man but there's example later on how to get actual number. 

>[man perf_event_open]: https://urldefense.proofpoint.com/v2/url?u=https-3A__man7.org_linux_man-
>2Dpages_man2_perf-5Fevent-
>5Fopen.2.html&d=DwIFAw&c=nKjWec2b6R0mOyPaz7xtfQ&r=PZNXgrbjdlXxVEEGYkxIxRndyEUwWU_ad5ce22YI6Is&m=tny
>gBVwOnoZDV7hItku1HtmsI8R3F6vPJdr7ON3hE5iAds96T2C9JTNcnt6ptN4Q&s=s10yJogwRRXHqAuIay47H-
>aWl9SL5wpQ9tCjfiQUgrY&e=
>
>> > > +			pmc = rte_pmu_pmc_read(index - 1);
>> > > +			pmc <<= 64 - width;
>> > > +			pmc >>= 64 - width;
>> > > +		}
>> > > +
>> > > +		rte_compiler_barrier();
>> > > +
>> > > +		if (likely(pc->lock == seq))
>> > > +			return pmc + offset;
>> > > +
>> > > +		if (--tries == 0) {
>> > > +			RTE_LOG(DEBUG, EAL, "failed to get
>> > > perf_event_mmap_page lock\n");
>> > > +			break;
>> > > +		}
>
>- Remove the 4 above lines of code, and move the debug log message to the end of the function
>instead.
>
>> > > +	}
>
>+ RTE_LOG(DEBUG, EAL, "failed to get perf_event_mmap_page lock\n");
>
>> > > +
>> > > +	return 0;
>> > > +}
>> > > +
>> > > +/**
>> > > + * @internal
>> > > + *
>> > > + * Enable group of events for a given lcore.
>> > > + *
>> > > + * @param lcore_id
>> > > + *   The identifier of the lcore.
>> > > + * @return
>> > > + *   0 in case of success, negative value otherwise.
>> > > + */
>> > > +__rte_internal
>> > > +int
>> > > +rte_pmu_enable_group(int lcore_id);
>> > > +
>> > > +/**
>> > > + * @warning
>> > > + * @b EXPERIMENTAL: this API may change without prior notice
>> > > + *
>> > > + * Add event to the group of enabled events.
>> > > + *
>> > > + * @param name
>> > > + *   Name of an event listed under
>> > > /sys/bus/event_source/devices/pmu/events.
>> > > + * @return
>> > > + *   Event index in case of success, negative value otherwise.
>> > > + */
>> > > +__rte_experimental
>> > > +int
>> > > +rte_pmu_add_event(const char *name);
>> > > +
>> > > +/**
>> > > + * @warning
>> > > + * @b EXPERIMENTAL: this API may change without prior notice
>> > > + *
>> > > + * Read hardware counter configured to count occurrences of an
>> event.
>> > > + *
>> > > + * @param index
>> > > + *   Index of an event to be read.
>> > > + * @return
>> > > + *   Event value read from register. In case of errors or lack of
>> > > support
>> > > + *   0 is returned. In other words, stream of zeros in a trace
>> file
>> > > + *   indicates problem with reading particular PMU event register.
>> > > + */
>> > > +__rte_experimental
>> > > +static __rte_always_inline uint64_t rte_pmu_read(int index)
>
>The index type can be changed from int to uint32_t. This also eliminates the "(index < 0" part of
>the comparison further below in this function.
>

That's true. 

>> > > +{
>> > > +	int lcore_id = rte_lcore_id();
>> > > +	struct rte_pmu_event_group *group;
>> > > +	int ret;
>> > > +
>> > > +	if (!rte_pmu)
>> > > +		return 0;
>> > > +
>> > > +	group = &rte_pmu->group[lcore_id];
>> > > +	if (!group->enabled) {
>
>Optimized: if (unlikely(!group->enabled)) {
>

Compiler will optimize the branch itself correctly. Extra hint is not required.  

>> > > +		ret = rte_pmu_enable_group(lcore_id);
>> > > +		if (ret)
>> > > +			return 0;
>> > > +
>> > > +		group->enabled = true;
>> > > +	}
>> >
>> > Why is the group not enabled in the setup function,
>> rte_pmu_add_event(), instead of here, in the
>> > hot path?
>> >
>>
>> When this is executed for the very first time then cpu will have
>> obviously more work to do but afterwards setup path is not taken hence
>> much less cpu cycles are required.
>>
>> Setup is executed by main lcore solely, before lcores are executed
>> hence some info passed to SYS_perf_event_open ioctl() is missing, pid
>> (via rte_gettid()) being an example here.
>
>OK. Thank you for the explanation. Since impossible at setup, it has to be done at runtime.
>
>@Mattias: Another good example of something that would belong in per-thread constructors, as my
>suggested feature creep in [1].
>
>[1]: https://urldefense.proofpoint.com/v2/url?u=http-
>3A__inbox.dpdk.org_dev_98CBD80474FA8B44BF855DF32C47DC35D87553-
>40smartserver.smartshare.dk_&d=DwIFAw&c=nKjWec2b6R0mOyPaz7xtfQ&r=PZNXgrbjdlXxVEEGYkxIxRndyEUwWU_ad5
>ce22YI6Is&m=tnygBVwOnoZDV7hItku1HtmsI8R3F6vPJdr7ON3hE5iAds96T2C9JTNcnt6ptN4Q&s=aSAnYqgVnrgDp6yyMtGC
>uWgJjDlgqj1wHf1nGWyHCNo&e=
>
>>
>> > > +
>> > > +	if (index < 0 || index >= rte_pmu->num_group_events)
>
>Optimized: if (unlikely(index >= rte_pmu.num_group_events))
>
>> > > +		return 0;
>> > > +
>> > > +	return rte_pmu_read_userpage((struct perf_event_mmap_page
>> > > *)group->mmap_pages[index]);
>> >
>> > Using fixed size arrays instead of multiple indirections via
>> > pointers
>> is faster. It could be:
>> >
>> > return rte_pmu_read_userpage((struct perf_event_mmap_page
>> > *)rte_pmu.group[lcore_id].mmap_pages[index]);
>> >
>> > With our without suggested performance improvements...
>> >
>> > Series-acked-by: Morten Brørup <mb@smartsharesystems.com>
>>


^ permalink raw reply	[flat|nested] 139+ messages in thread

* RE: [PATCH v4 1/4] eal: add generic support for reading PMU events
  2023-01-05 21:14               ` Tomasz Duszynski
@ 2023-01-05 22:07                 ` Morten Brørup
  2023-01-08 15:41                   ` Tomasz Duszynski
  0 siblings, 1 reply; 139+ messages in thread
From: Morten Brørup @ 2023-01-05 22:07 UTC (permalink / raw)
  To: Tomasz Duszynski, dev
  Cc: thomas, Jerin Jacob Kollanukkaran, zhoumin, mattias.ronnblom

> From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
> Sent: Thursday, 5 January 2023 22.14
> 
> Hi Morten,
> 
> A few comments inline.
> 
> >From: Morten Brørup <mb@smartsharesystems.com>
> >Sent: Wednesday, December 14, 2022 11:41 AM
> >
> >External Email
> >
> >----------------------------------------------------------------------
> >+CC: Mattias, see my comment below about per-thread constructor for
> this
> >
> >> From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
> >> Sent: Wednesday, 14 December 2022 10.39
> >>
> >> Hello Morten,
> >>
> >> Thanks for review. Answers inline.
> >>
> >> [...]
> >>
> >> > > +/**
> >> > > + * @file
> >> > > + *
> >> > > + * PMU event tracing operations
> >> > > + *
> >> > > + * This file defines generic API and types necessary to setup
> PMU
> >> and
> >> > > + * read selected counters in runtime.
> >> > > + */
> >> > > +
> >> > > +/**
> >> > > + * A structure describing a group of events.
> >> > > + */
> >> > > +struct rte_pmu_event_group {
> >> > > +	int *fds; /**< array of event descriptors */
> >> > > +	void **mmap_pages; /**< array of pointers to mmapped
> >> > > perf_event_attr structures */
> >> >
> >> > There seems to be a lot of indirection involved here. Why are
> these
> >> arrays not statically sized,
> >> > instead of dynamically allocated?
> >> >
> >>
> >> Different architectures/pmus impose limits on number of
> simultaneously
> >> enabled counters. So in order relief the pain of thinking about it
> and
> >> adding macros for each and every arch I decided to allocate the
> number
> >> user wants dynamically. Also assumption holds that user knows about
> >> tradeoffs of using too many counters hence will not enable too many
> >> events at once.
> >
> >The DPDK convention is to use fixed size arrays (with a maximum size,
> e.g. RTE_MAX_ETHPORTS) in the
> >fast path, for performance reasons.
> >
> >Please use fixed size arrays instead of dynamically allocated arrays.
> >
> 
> I do agree that from maintenance angle fixed arrays are much more
> convenient
> but when optimization kicks in then that statement does not necessarily
> hold true anymore.
> 
> For example, in this case performance dropped by ~0.3% which is
> insignificant imo. So
> given simpler code, next patchset will use fixed arrays.

I fail to understand how pointer chasing can perform better than obtaining an address by multiplying by a constant. Modern CPUs work in mysterious ways, and you obviously tested this, so I believe your test results. But in theory, pointer chasing touches more cache lines, and should perform worse in a loaded system where pointers in the chain have been evicted from the cache.

Anyway, you agreed to use fixed arrays, so I am happy. :-)

> 
> >>
> >> > Also, what is the reason for hiding the type struct
> >> perf_event_mmap_page **mmap_pages opaque by
> >> > using void **mmap_pages instead?
> >>
> >> I think, that part doing mmap/munmap was written first hence void **
> >> was chosen in the first place.
> >
> >Please update it, so the actual type is reflected here.
> >
> >>
> >> >
> >> > > +	bool enabled; /**< true if group was enabled on particular
> lcore
> >> > > */
> >> > > +};
> >> > > +
> >> > > +/**
> >> > > + * A structure describing an event.
> >> > > + */
> >> > > +struct rte_pmu_event {
> >> > > +	char *name; /** name of an event */
> >> > > +	int index; /** event index into fds/mmap_pages */
> >> > > +	TAILQ_ENTRY(rte_pmu_event) next; /** list entry */ };
> >> > > +
> >> > > +/**
> >> > > + * A PMU state container.
> >> > > + */
> >> > > +struct rte_pmu {
> >> > > +	char *name; /** name of core PMU listed under
> >> > > /sys/bus/event_source/devices */
> >> > > +	struct rte_pmu_event_group group[RTE_MAX_LCORE]; /**< per
> lcore
> >> > > event group data */
> >> > > +	int num_group_events; /**< number of events in a group */
> >> > > +	TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of
> matching
> >> > > events */
> >
> >The event_list is used in slow path only, so it can remain a list -
> i.e. no change requested here.
> >:-)
> >
> >> > > +};
> >> > > +
> >> > > +/** Pointer to the PMU state container */ extern struct rte_pmu
> >> > > +*rte_pmu;
> >> >
> >> > Again, why not just extern struct rte_pmu, instead of dynamic
> >> allocation?
> >> >
> >>
> >> No strong opinions here since this is a matter of personal
> preference.
> >> Can be removed
> >> in the next version.
> >
> >Yes, please.
> >
> >>
> >> > > +
> >> > > +/** Each architecture supporting PMU needs to provide its own
> >> version
> >> > > */
> >> > > +#ifndef rte_pmu_pmc_read
> >> > > +#define rte_pmu_pmc_read(index) ({ 0; }) #endif
> >> > > +
> >> > > +/**
> >> > > + * @internal
> >> > > + *
> >> > > + * Read PMU counter.
> >> > > + *
> >> > > + * @param pc
> >> > > + *   Pointer to the mmapped user page.
> >> > > + * @return
> >> > > + *   Counter value read from hardware.
> >> > > + */
> >> > > +__rte_internal
> >> > > +static __rte_always_inline uint64_t
> rte_pmu_read_userpage(struct
> >> > > +perf_event_mmap_page *pc) {
> >> > > +	uint64_t offset, width, pmc = 0;
> >> > > +	uint32_t seq, index;
> >> > > +	int tries = 100;
> >> > > +
> >> > > +	for (;;) {
> >
> >As a matter of personal preference, I would write this loop
> differently:
> >
> >+ for (tries = 100; tries != 0; tries--) {
> >
> >> > > +		seq = pc->lock;
> >> > > +		rte_compiler_barrier();
> >> > > +		index = pc->index;
> >> > > +		offset = pc->offset;
> >> > > +		width = pc->pmc_width;
> >> > > +
> >> > > +		if (likely(pc->cap_user_rdpmc && index)) {
> >
> >Why "&& index"? The way I read [man perf_event_open], index 0 is
> perfectly valid.
> >
> 
> Valid index starts at 1. 0 means that either hw counter is stopped or
> isn't active. Maybe this is not
> initially clear from man but there's example later on how to get actual
> number.

OK. Thanks for the reference.

Please add a comment about the special meaning of index 0 in the code.

> 
> >[man perf_event_open]:
> https://urldefense.proofpoint.com/v2/url?u=https-
> 3A__man7.org_linux_man-
> >2Dpages_man2_perf-5Fevent-
> >5Fopen.2.html&d=DwIFAw&c=nKjWec2b6R0mOyPaz7xtfQ&r=PZNXgrbjdlXxVEEGYkxI
> xRndyEUwWU_ad5ce22YI6Is&m=tny
> >gBVwOnoZDV7hItku1HtmsI8R3F6vPJdr7ON3hE5iAds96T2C9JTNcnt6ptN4Q&s=s10yJo
> gwRRXHqAuIay47H-
> >aWl9SL5wpQ9tCjfiQUgrY&e=
> >
> >> > > +			pmc = rte_pmu_pmc_read(index - 1);
> >> > > +			pmc <<= 64 - width;
> >> > > +			pmc >>= 64 - width;
> >> > > +		}
> >> > > +
> >> > > +		rte_compiler_barrier();
> >> > > +
> >> > > +		if (likely(pc->lock == seq))
> >> > > +			return pmc + offset;
> >> > > +
> >> > > +		if (--tries == 0) {
> >> > > +			RTE_LOG(DEBUG, EAL, "failed to get
> >> > > perf_event_mmap_page lock\n");
> >> > > +			break;
> >> > > +		}
> >
> >- Remove the 4 above lines of code, and move the debug log message to
> the end of the function
> >instead.
> >
> >> > > +	}
> >
> >+ RTE_LOG(DEBUG, EAL, "failed to get perf_event_mmap_page lock\n");
> >
> >> > > +
> >> > > +	return 0;
> >> > > +}
> >> > > +
> >> > > +/**
> >> > > + * @internal
> >> > > + *
> >> > > + * Enable group of events for a given lcore.
> >> > > + *
> >> > > + * @param lcore_id
> >> > > + *   The identifier of the lcore.
> >> > > + * @return
> >> > > + *   0 in case of success, negative value otherwise.
> >> > > + */
> >> > > +__rte_internal
> >> > > +int
> >> > > +rte_pmu_enable_group(int lcore_id);
> >> > > +
> >> > > +/**
> >> > > + * @warning
> >> > > + * @b EXPERIMENTAL: this API may change without prior notice
> >> > > + *
> >> > > + * Add event to the group of enabled events.
> >> > > + *
> >> > > + * @param name
> >> > > + *   Name of an event listed under
> >> > > /sys/bus/event_source/devices/pmu/events.
> >> > > + * @return
> >> > > + *   Event index in case of success, negative value otherwise.
> >> > > + */
> >> > > +__rte_experimental
> >> > > +int
> >> > > +rte_pmu_add_event(const char *name);
> >> > > +
> >> > > +/**
> >> > > + * @warning
> >> > > + * @b EXPERIMENTAL: this API may change without prior notice
> >> > > + *
> >> > > + * Read hardware counter configured to count occurrences of an
> >> event.
> >> > > + *
> >> > > + * @param index
> >> > > + *   Index of an event to be read.
> >> > > + * @return
> >> > > + *   Event value read from register. In case of errors or lack
> of
> >> > > support
> >> > > + *   0 is returned. In other words, stream of zeros in a trace
> >> file
> >> > > + *   indicates problem with reading particular PMU event
> register.
> >> > > + */
> >> > > +__rte_experimental
> >> > > +static __rte_always_inline uint64_t rte_pmu_read(int index)
> >
> >The index type can be changed from int to uint32_t. This also
> eliminates the "(index < 0" part of
> >the comparison further below in this function.
> >
> 
> That's true.
> 
> >> > > +{
> >> > > +	int lcore_id = rte_lcore_id();
> >> > > +	struct rte_pmu_event_group *group;
> >> > > +	int ret;
> >> > > +
> >> > > +	if (!rte_pmu)
> >> > > +		return 0;
> >> > > +
> >> > > +	group = &rte_pmu->group[lcore_id];
> >> > > +	if (!group->enabled) {
> >
> >Optimized: if (unlikely(!group->enabled)) {
> >
> 
> Compiler will optimize the branch itself correctly. Extra hint is not
> required.

I haven't reviewed the output from this, so I'll take your word for it. I suggested the unlikely() because I previously tested some very simple code, and it optimized for taking the "if":

void testb(bool b)
{
    if (!b)
        exit(1);
    
    exit(99);
}

I guess I should experiment with more realistic code, and update my optimization notes!

You could add the unlikely() for readability purposes. ;-)

> 
> >> > > +		ret = rte_pmu_enable_group(lcore_id);
> >> > > +		if (ret)
> >> > > +			return 0;
> >> > > +
> >> > > +		group->enabled = true;
> >> > > +	}
> >> >
> >> > Why is the group not enabled in the setup function,
> >> rte_pmu_add_event(), instead of here, in the
> >> > hot path?
> >> >
> >>
> >> When this is executed for the very first time then cpu will have
> >> obviously more work to do but afterwards setup path is not taken
> hence
> >> much less cpu cycles are required.
> >>
> >> Setup is executed by main lcore solely, before lcores are executed
> >> hence some info passed to SYS_perf_event_open ioctl() is missing,
> pid
> >> (via rte_gettid()) being an example here.
> >
> >OK. Thank you for the explanation. Since impossible at setup, it has
> to be done at runtime.
> >
> >@Mattias: Another good example of something that would belong in per-
> thread constructors, as my
> >suggested feature creep in [1].
> >
> >[1]: https://urldefense.proofpoint.com/v2/url?u=http-
> >3A__inbox.dpdk.org_dev_98CBD80474FA8B44BF855DF32C47DC35D87553-
> >40smartserver.smartshare.dk_&d=DwIFAw&c=nKjWec2b6R0mOyPaz7xtfQ&r=PZNXg
> rbjdlXxVEEGYkxIxRndyEUwWU_ad5
> >ce22YI6Is&m=tnygBVwOnoZDV7hItku1HtmsI8R3F6vPJdr7ON3hE5iAds96T2C9JTNcnt
> 6ptN4Q&s=aSAnYqgVnrgDp6yyMtGC
> >uWgJjDlgqj1wHf1nGWyHCNo&e=
> >
> >>
> >> > > +
> >> > > +	if (index < 0 || index >= rte_pmu->num_group_events)
> >
> >Optimized: if (unlikely(index >= rte_pmu.num_group_events))
> >
> >> > > +		return 0;
> >> > > +
> >> > > +	return rte_pmu_read_userpage((struct perf_event_mmap_page
> >> > > *)group->mmap_pages[index]);
> >> >
> >> > Using fixed size arrays instead of multiple indirections via
> >> > pointers
> >> is faster. It could be:
> >> >
> >> > return rte_pmu_read_userpage((struct perf_event_mmap_page
> >> > *)rte_pmu.group[lcore_id].mmap_pages[index]);
> >> >
> >> > With our without suggested performance improvements...
> >> >
> >> > Series-acked-by: Morten Brørup <mb@smartsharesystems.com>
> >>
> 


^ permalink raw reply	[flat|nested] 139+ messages in thread

* RE: [PATCH v4 1/4] eal: add generic support for reading PMU events
  2023-01-05 22:07                 ` Morten Brørup
@ 2023-01-08 15:41                   ` Tomasz Duszynski
  2023-01-08 16:30                     ` Morten Brørup
  0 siblings, 1 reply; 139+ messages in thread
From: Tomasz Duszynski @ 2023-01-08 15:41 UTC (permalink / raw)
  To: Morten Brørup, dev
  Cc: thomas, Jerin Jacob Kollanukkaran, zhoumin, mattias.ronnblom

>-----Original Message-----
>From: Morten Brørup <mb@smartsharesystems.com>
>Sent: Thursday, January 5, 2023 11:08 PM
>To: Tomasz Duszynski <tduszynski@marvell.com>; dev@dpdk.org
>Cc: thomas@monjalon.net; Jerin Jacob Kollanukkaran <jerinj@marvell.com>; zhoumin@loongson.cn;
>mattias.ronnblom@ericsson.com
>Subject: [EXT] RE: [PATCH v4 1/4] eal: add generic support for reading PMU events
>
>External Email
>
>----------------------------------------------------------------------
>> From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
>> Sent: Thursday, 5 January 2023 22.14
>>
>> Hi Morten,
>>
>> A few comments inline.
>>
>> >From: Morten Brørup <mb@smartsharesystems.com>
>> >Sent: Wednesday, December 14, 2022 11:41 AM
>> >
>> >External Email
>> >
>> >---------------------------------------------------------------------
>> >-
>> >+CC: Mattias, see my comment below about per-thread constructor for
>> this
>> >
>> >> From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
>> >> Sent: Wednesday, 14 December 2022 10.39
>> >>
>> >> Hello Morten,
>> >>
>> >> Thanks for review. Answers inline.
>> >>
>> >> [...]
>> >>
>> >> > > +/**
>> >> > > + * @file
>> >> > > + *
>> >> > > + * PMU event tracing operations
>> >> > > + *
>> >> > > + * This file defines generic API and types necessary to setup
>> PMU
>> >> and
>> >> > > + * read selected counters in runtime.
>> >> > > + */
>> >> > > +
>> >> > > +/**
>> >> > > + * A structure describing a group of events.
>> >> > > + */
>> >> > > +struct rte_pmu_event_group {
>> >> > > +	int *fds; /**< array of event descriptors */
>> >> > > +	void **mmap_pages; /**< array of pointers to mmapped
>> >> > > perf_event_attr structures */
>> >> >
>> >> > There seems to be a lot of indirection involved here. Why are
>> these
>> >> arrays not statically sized,
>> >> > instead of dynamically allocated?
>> >> >
>> >>
>> >> Different architectures/pmus impose limits on number of
>> simultaneously
>> >> enabled counters. So in order relief the pain of thinking about it
>> and
>> >> adding macros for each and every arch I decided to allocate the
>> number
>> >> user wants dynamically. Also assumption holds that user knows about
>> >> tradeoffs of using too many counters hence will not enable too many
>> >> events at once.
>> >
>> >The DPDK convention is to use fixed size arrays (with a maximum size,
>> e.g. RTE_MAX_ETHPORTS) in the
>> >fast path, for performance reasons.
>> >
>> >Please use fixed size arrays instead of dynamically allocated arrays.
>> >
>>
>> I do agree that from maintenance angle fixed arrays are much more
>> convenient but when optimization kicks in then that statement does not
>> necessarily hold true anymore.
>>
>> For example, in this case performance dropped by ~0.3% which is
>> insignificant imo. So given simpler code, next patchset will use fixed
>> arrays.
>
>I fail to understand how pointer chasing can perform better than obtaining an address by
>multiplying by a constant. Modern CPUs work in mysterious ways, and you obviously tested this, so I
>believe your test results. But in theory, pointer chasing touches more cache lines, and should
>perform worse in a loaded system where pointers in the chain have been evicted from the cache.
>
>Anyway, you agreed to use fixed arrays, so I am happy. :-)
>
>>
>> >>
>> >> > Also, what is the reason for hiding the type struct
>> >> perf_event_mmap_page **mmap_pages opaque by
>> >> > using void **mmap_pages instead?
>> >>
>> >> I think, that part doing mmap/munmap was written first hence void
>> >> ** was chosen in the first place.
>> >
>> >Please update it, so the actual type is reflected here.
>> >
>> >>
>> >> >
>> >> > > +	bool enabled; /**< true if group was enabled on particular
>> lcore
>> >> > > */
>> >> > > +};
>> >> > > +
>> >> > > +/**
>> >> > > + * A structure describing an event.
>> >> > > + */
>> >> > > +struct rte_pmu_event {
>> >> > > +	char *name; /** name of an event */
>> >> > > +	int index; /** event index into fds/mmap_pages */
>> >> > > +	TAILQ_ENTRY(rte_pmu_event) next; /** list entry */ };
>> >> > > +
>> >> > > +/**
>> >> > > + * A PMU state container.
>> >> > > + */
>> >> > > +struct rte_pmu {
>> >> > > +	char *name; /** name of core PMU listed under
>> >> > > /sys/bus/event_source/devices */
>> >> > > +	struct rte_pmu_event_group group[RTE_MAX_LCORE]; /**< per
>> lcore
>> >> > > event group data */
>> >> > > +	int num_group_events; /**< number of events in a group */
>> >> > > +	TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of
>> matching
>> >> > > events */
>> >
>> >The event_list is used in slow path only, so it can remain a list -
>> i.e. no change requested here.
>> >:-)
>> >
>> >> > > +};
>> >> > > +
>> >> > > +/** Pointer to the PMU state container */ extern struct
>> >> > > +rte_pmu *rte_pmu;
>> >> >
>> >> > Again, why not just extern struct rte_pmu, instead of dynamic
>> >> allocation?
>> >> >
>> >>
>> >> No strong opinions here since this is a matter of personal
>> preference.
>> >> Can be removed
>> >> in the next version.
>> >
>> >Yes, please.
>> >
>> >>
>> >> > > +
>> >> > > +/** Each architecture supporting PMU needs to provide its own
>> >> version
>> >> > > */
>> >> > > +#ifndef rte_pmu_pmc_read
>> >> > > +#define rte_pmu_pmc_read(index) ({ 0; }) #endif
>> >> > > +
>> >> > > +/**
>> >> > > + * @internal
>> >> > > + *
>> >> > > + * Read PMU counter.
>> >> > > + *
>> >> > > + * @param pc
>> >> > > + *   Pointer to the mmapped user page.
>> >> > > + * @return
>> >> > > + *   Counter value read from hardware.
>> >> > > + */
>> >> > > +__rte_internal
>> >> > > +static __rte_always_inline uint64_t
>> rte_pmu_read_userpage(struct
>> >> > > +perf_event_mmap_page *pc) {
>> >> > > +	uint64_t offset, width, pmc = 0;
>> >> > > +	uint32_t seq, index;
>> >> > > +	int tries = 100;
>> >> > > +
>> >> > > +	for (;;) {
>> >
>> >As a matter of personal preference, I would write this loop
>> differently:
>> >
>> >+ for (tries = 100; tries != 0; tries--) {
>> >
>> >> > > +		seq = pc->lock;
>> >> > > +		rte_compiler_barrier();
>> >> > > +		index = pc->index;
>> >> > > +		offset = pc->offset;
>> >> > > +		width = pc->pmc_width;
>> >> > > +
>> >> > > +		if (likely(pc->cap_user_rdpmc && index)) {
>> >
>> >Why "&& index"? The way I read [man perf_event_open], index 0 is
>> perfectly valid.
>> >
>>
>> Valid index starts at 1. 0 means that either hw counter is stopped or
>> isn't active. Maybe this is not initially clear from man but there's
>> example later on how to get actual number.
>
>OK. Thanks for the reference.
>
>Please add a comment about the special meaning of index 0 in the code.
>
>>
>> >[man perf_event_open]:
>> https://urldefense.proofpoint.com/v2/url?u=https-
>> 3A__man7.org_linux_man-
>> >2Dpages_man2_perf-5Fevent-
>> >5Fopen.2.html&d=DwIFAw&c=nKjWec2b6R0mOyPaz7xtfQ&r=PZNXgrbjdlXxVEEGYkx
>> >I
>> xRndyEUwWU_ad5ce22YI6Is&m=tny
>> >gBVwOnoZDV7hItku1HtmsI8R3F6vPJdr7ON3hE5iAds96T2C9JTNcnt6ptN4Q&s=s10yJ
>> >o
>> gwRRXHqAuIay47H-
>> >aWl9SL5wpQ9tCjfiQUgrY&e=
>> >
>> >> > > +			pmc = rte_pmu_pmc_read(index - 1);
>> >> > > +			pmc <<= 64 - width;
>> >> > > +			pmc >>= 64 - width;
>> >> > > +		}
>> >> > > +
>> >> > > +		rte_compiler_barrier();
>> >> > > +
>> >> > > +		if (likely(pc->lock == seq))
>> >> > > +			return pmc + offset;
>> >> > > +
>> >> > > +		if (--tries == 0) {
>> >> > > +			RTE_LOG(DEBUG, EAL, "failed to get
>> >> > > perf_event_mmap_page lock\n");
>> >> > > +			break;
>> >> > > +		}
>> >
>> >- Remove the 4 above lines of code, and move the debug log message to
>> the end of the function
>> >instead.
>> >
>> >> > > +	}
>> >
>> >+ RTE_LOG(DEBUG, EAL, "failed to get perf_event_mmap_page lock\n");
>> >
>> >> > > +
>> >> > > +	return 0;
>> >> > > +}
>> >> > > +
>> >> > > +/**
>> >> > > + * @internal
>> >> > > + *
>> >> > > + * Enable group of events for a given lcore.
>> >> > > + *
>> >> > > + * @param lcore_id
>> >> > > + *   The identifier of the lcore.
>> >> > > + * @return
>> >> > > + *   0 in case of success, negative value otherwise.
>> >> > > + */
>> >> > > +__rte_internal
>> >> > > +int
>> >> > > +rte_pmu_enable_group(int lcore_id);
>> >> > > +
>> >> > > +/**
>> >> > > + * @warning
>> >> > > + * @b EXPERIMENTAL: this API may change without prior notice
>> >> > > + *
>> >> > > + * Add event to the group of enabled events.
>> >> > > + *
>> >> > > + * @param name
>> >> > > + *   Name of an event listed under
>> >> > > /sys/bus/event_source/devices/pmu/events.
>> >> > > + * @return
>> >> > > + *   Event index in case of success, negative value otherwise.
>> >> > > + */
>> >> > > +__rte_experimental
>> >> > > +int
>> >> > > +rte_pmu_add_event(const char *name);
>> >> > > +
>> >> > > +/**
>> >> > > + * @warning
>> >> > > + * @b EXPERIMENTAL: this API may change without prior notice
>> >> > > + *
>> >> > > + * Read hardware counter configured to count occurrences of an
>> >> event.
>> >> > > + *
>> >> > > + * @param index
>> >> > > + *   Index of an event to be read.
>> >> > > + * @return
>> >> > > + *   Event value read from register. In case of errors or lack
>> of
>> >> > > support
>> >> > > + *   0 is returned. In other words, stream of zeros in a trace
>> >> file
>> >> > > + *   indicates problem with reading particular PMU event
>> register.
>> >> > > + */
>> >> > > +__rte_experimental
>> >> > > +static __rte_always_inline uint64_t rte_pmu_read(int index)
>> >
>> >The index type can be changed from int to uint32_t. This also
>> eliminates the "(index < 0" part of
>> >the comparison further below in this function.
>> >
>>
>> That's true.
>>
>> >> > > +{
>> >> > > +	int lcore_id = rte_lcore_id();
>> >> > > +	struct rte_pmu_event_group *group;
>> >> > > +	int ret;
>> >> > > +
>> >> > > +	if (!rte_pmu)
>> >> > > +		return 0;
>> >> > > +
>> >> > > +	group = &rte_pmu->group[lcore_id];
>> >> > > +	if (!group->enabled) {
>> >
>> >Optimized: if (unlikely(!group->enabled)) {
>> >
>>
>> Compiler will optimize the branch itself correctly. Extra hint is not
>> required.
>
>I haven't reviewed the output from this, so I'll take your word for it. I suggested the unlikely()
>because I previously tested some very simple code, and it optimized for taking the "if":
>
>void testb(bool b)
>{
>    if (!b)
>        exit(1);
>
>    exit(99);
>}
>
>I guess I should experiment with more realistic code, and update my optimization notes!
>

I think this may be too simple to draw far-reaching conclusions from it. Compiler will make the
fall-through path more likely. If I recall Intel Optimization Reference Manual has some more
info on this. 

Lets take a different example.  

int main(int argc, char *argv[])
{
        int *p;

        p = malloc(sizeof(*p));
        if (!p)
                return 1;
        *p = atoi(argv[1]);
        if (*p < 0)
                return 2;
        free(p);

        return 0;
}

If compiled with -O3 and disassembled I got below. 

00000000000010a0 <main>:
    10a0:       f3 0f 1e fa             endbr64
    10a4:       55                      push   %rbp
    10a5:       bf 04 00 00 00          mov    $0x4,%edi
    10aa:       53                      push   %rbx
    10ab:       48 89 f3                mov    %rsi,%rbx
    10ae:       48 83 ec 08             sub    $0x8,%rsp
    10b2:       e8 d9 ff ff ff          call   1090 <malloc@plt>
    10b7:       48 85 c0                test   %rax,%rax
    10ba:       74 31                   je     10ed <main+0x4d>
    10bc:       48 8b 7b 08             mov    0x8(%rbx),%rdi
    10c0:       ba 0a 00 00 00          mov    $0xa,%edx
    10c5:       31 f6                   xor    %esi,%esi
    10c7:       48 89 c5                mov    %rax,%rbp
    10ca:       e8 b1 ff ff ff          call   1080 <strtol@plt>
    10cf:       49 89 c0                mov    %rax,%r8
    10d2:       b8 02 00 00 00          mov    $0x2,%eax
    10d7:       45 85 c0                test   %r8d,%r8d
    10da:       78 0a                   js     10e6 <main+0x46>
    10dc:       48 89 ef                mov    %rbp,%rdi
    10df:       e8 8c ff ff ff          call   1070 <free@plt>
    10e4:       31 c0                   xor    %eax,%eax
    10e6:       48 83 c4 08             add    $0x8,%rsp
    10ea:       5b                      pop    %rbx
    10eb:       5d                      pop    %rbp
    10ec:       c3                      ret
    10ed:       b8 01 00 00 00          mov    $0x1,%eax
    10f2:       eb f2                   jmp    10e6 <main+0x46>

Looking at both 10ba and 10da suggests that code was laid out in a way that jumping is frowned upon. Also 
potentially lest frequently executed code (at 10ed) is pushed further down the memory hence optimizing cache line usage. 

That said, each and every scenario needs analysis on its own. 

>You could add the unlikely() for readability purposes. ;-)
>

Sure. That won't hurt performance.   

>>
>> >> > > +		ret = rte_pmu_enable_group(lcore_id);
>> >> > > +		if (ret)
>> >> > > +			return 0;
>> >> > > +
>> >> > > +		group->enabled = true;
>> >> > > +	}
>> >> >
>> >> > Why is the group not enabled in the setup function,
>> >> rte_pmu_add_event(), instead of here, in the
>> >> > hot path?
>> >> >
>> >>
>> >> When this is executed for the very first time then cpu will have
>> >> obviously more work to do but afterwards setup path is not taken
>> hence
>> >> much less cpu cycles are required.
>> >>
>> >> Setup is executed by main lcore solely, before lcores are executed
>> >> hence some info passed to SYS_perf_event_open ioctl() is missing,
>> pid
>> >> (via rte_gettid()) being an example here.
>> >
>> >OK. Thank you for the explanation. Since impossible at setup, it has
>> to be done at runtime.
>> >
>> >@Mattias: Another good example of something that would belong in per-
>> thread constructors, as my
>> >suggested feature creep in [1].
>> >
>> >[1]: https://urldefense.proofpoint.com/v2/url?u=http-
>> >3A__inbox.dpdk.org_dev_98CBD80474FA8B44BF855DF32C47DC35D87553-
>> >40smartserver.smartshare.dk_&d=DwIFAw&c=nKjWec2b6R0mOyPaz7xtfQ&r=PZNX
>> >g
>> rbjdlXxVEEGYkxIxRndyEUwWU_ad5
>> >ce22YI6Is&m=tnygBVwOnoZDV7hItku1HtmsI8R3F6vPJdr7ON3hE5iAds96T2C9JTNcn
>> >t
>> 6ptN4Q&s=aSAnYqgVnrgDp6yyMtGC
>> >uWgJjDlgqj1wHf1nGWyHCNo&e=
>> >
>> >>
>> >> > > +
>> >> > > +	if (index < 0 || index >= rte_pmu->num_group_events)
>> >
>> >Optimized: if (unlikely(index >= rte_pmu.num_group_events))
>> >
>> >> > > +		return 0;
>> >> > > +
>> >> > > +	return rte_pmu_read_userpage((struct perf_event_mmap_page
>> >> > > *)group->mmap_pages[index]);
>> >> >
>> >> > Using fixed size arrays instead of multiple indirections via
>> >> > pointers
>> >> is faster. It could be:
>> >> >
>> >> > return rte_pmu_read_userpage((struct perf_event_mmap_page
>> >> > *)rte_pmu.group[lcore_id].mmap_pages[index]);
>> >> >
>> >> > With our without suggested performance improvements...
>> >> >
>> >> > Series-acked-by: Morten Brørup <mb@smartsharesystems.com>
>> >>
>>


^ permalink raw reply	[flat|nested] 139+ messages in thread

* RE: [PATCH v4 1/4] eal: add generic support for reading PMU events
  2023-01-08 15:41                   ` Tomasz Duszynski
@ 2023-01-08 16:30                     ` Morten Brørup
  0 siblings, 0 replies; 139+ messages in thread
From: Morten Brørup @ 2023-01-08 16:30 UTC (permalink / raw)
  To: Tomasz Duszynski, dev
  Cc: thomas, Jerin Jacob Kollanukkaran, zhoumin, mattias.ronnblom

> From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
> Sent: Sunday, 8 January 2023 16.41
> 
> >From: Morten Brørup <mb@smartsharesystems.com>
> >Sent: Thursday, January 5, 2023 11:08 PM
> >
> >> From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
> >> Sent: Thursday, 5 January 2023 22.14
> >>
> >> Hi Morten,
> >>
> >> A few comments inline.
> >>
> >> >From: Morten Brørup <mb@smartsharesystems.com>
> >> >Sent: Wednesday, December 14, 2022 11:41 AM
> >> >
> >> >External Email
> >> >
> >> >-------------------------------------------------------------------
> --
> >> >-
> >> >+CC: Mattias, see my comment below about per-thread constructor for
> >> this
> >> >
> >> >> From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
> >> >> Sent: Wednesday, 14 December 2022 10.39
> >> >>
> >> >> Hello Morten,
> >> >>
> >> >> Thanks for review. Answers inline.
> >> >>

[...]

> >> >> > > +{
> >> >> > > +	int lcore_id = rte_lcore_id();
> >> >> > > +	struct rte_pmu_event_group *group;
> >> >> > > +	int ret;
> >> >> > > +
> >> >> > > +	if (!rte_pmu)
> >> >> > > +		return 0;
> >> >> > > +
> >> >> > > +	group = &rte_pmu->group[lcore_id];
> >> >> > > +	if (!group->enabled) {
> >> >
> >> >Optimized: if (unlikely(!group->enabled)) {
> >> >
> >>
> >> Compiler will optimize the branch itself correctly. Extra hint is
> not
> >> required.
> >
> >I haven't reviewed the output from this, so I'll take your word for
> it. I suggested the unlikely()
> >because I previously tested some very simple code, and it optimized
> for taking the "if":
> >
> >void testb(bool b)
> >{
> >    if (!b)
> >        exit(1);
> >
> >    exit(99);
> >}
> >
> >I guess I should experiment with more realistic code, and update my
> optimization notes!
> >
> 
> I think this may be too simple to draw far-reaching conclusions from
> it. Compiler will make the
> fall-through path more likely. If I recall Intel Optimization Reference
> Manual has some more
> info on this.

IIRC, the Intel Optimization Reference Manual discusses branch optimization for assembler, not C.

> 
> Lets take a different example.
> 
> int main(int argc, char *argv[])
> {
>         int *p;
> 
>         p = malloc(sizeof(*p));
>         if (!p)
>                 return 1;
>         *p = atoi(argv[1]);
>         if (*p < 0)
>                 return 2;
>         free(p);
> 
>         return 0;
> }
> 
> If compiled with -O3 and disassembled I got below.
> 
> 00000000000010a0 <main>:
>     10a0:       f3 0f 1e fa             endbr64
>     10a4:       55                      push   %rbp
>     10a5:       bf 04 00 00 00          mov    $0x4,%edi
>     10aa:       53                      push   %rbx
>     10ab:       48 89 f3                mov    %rsi,%rbx
>     10ae:       48 83 ec 08             sub    $0x8,%rsp
>     10b2:       e8 d9 ff ff ff          call   1090 <malloc@plt>
>     10b7:       48 85 c0                test   %rax,%rax
>     10ba:       74 31                   je     10ed <main+0x4d>
>     10bc:       48 8b 7b 08             mov    0x8(%rbx),%rdi
>     10c0:       ba 0a 00 00 00          mov    $0xa,%edx
>     10c5:       31 f6                   xor    %esi,%esi
>     10c7:       48 89 c5                mov    %rax,%rbp
>     10ca:       e8 b1 ff ff ff          call   1080 <strtol@plt>
>     10cf:       49 89 c0                mov    %rax,%r8
>     10d2:       b8 02 00 00 00          mov    $0x2,%eax
>     10d7:       45 85 c0                test   %r8d,%r8d
>     10da:       78 0a                   js     10e6 <main+0x46>
>     10dc:       48 89 ef                mov    %rbp,%rdi
>     10df:       e8 8c ff ff ff          call   1070 <free@plt>
>     10e4:       31 c0                   xor    %eax,%eax
>     10e6:       48 83 c4 08             add    $0x8,%rsp
>     10ea:       5b                      pop    %rbx
>     10eb:       5d                      pop    %rbp
>     10ec:       c3                      ret
>     10ed:       b8 01 00 00 00          mov    $0x1,%eax
>     10f2:       eb f2                   jmp    10e6 <main+0x46>
> 
> Looking at both 10ba and 10da suggests that code was laid out in a way
> that jumping is frowned upon. Also
> potentially lest frequently executed code (at 10ed) is pushed further
> down the memory hence optimizing cache line usage.

In my notes, I have (ptr == NULL) marked as considered unlikely, but (int == 0) marked as considered likely. Since group->enabled is bool, I guessed the compiler would treat it like int and consider (!group->enabled) as likely.

Like in your example here, I also have (int < 0) marked as considered unlikely.

> 
> That said, each and every scenario needs analysis on its own.

Agree. Theory is good, validation is better. ;-)

> 
> >You could add the unlikely() for readability purposes. ;-)
> >
> 
> Sure. That won't hurt performance.

I think we are both in agreement about the intentions here, so I won't hold you back with further academic discussions at this point. I might resume the discussion with your next patch version, though. ;-)



^ permalink raw reply	[flat|nested] 139+ messages in thread

* RE: [PATCH v4 1/4] eal: add generic support for reading PMU events
  2022-12-13 10:43       ` [PATCH v4 1/4] eal: add generic support for reading PMU events Tomasz Duszynski
  2022-12-13 11:52         ` Morten Brørup
  2022-12-15  8:46         ` Mattias Rönnblom
@ 2023-01-09  7:37         ` Ruifeng Wang
  2023-01-09 15:40           ` Tomasz Duszynski
  2 siblings, 1 reply; 139+ messages in thread
From: Ruifeng Wang @ 2023-01-09  7:37 UTC (permalink / raw)
  To: Tomasz Duszynski, dev; +Cc: thomas, jerinj, mb, zhoumin, nd

> -----Original Message-----
> From: Tomasz Duszynski <tduszynski@marvell.com>
> Sent: Tuesday, December 13, 2022 6:44 PM
> To: dev@dpdk.org
> Cc: thomas@monjalon.net; jerinj@marvell.com; mb@smartsharesystems.com; zhoumin@loongson.cn;
> Tomasz Duszynski <tduszynski@marvell.com>
> Subject: [PATCH v4 1/4] eal: add generic support for reading PMU events
> 
> Add support for programming PMU counters and reading their values in runtime bypassing
> kernel completely.
> 
> This is especially useful in cases where CPU cores are isolated
> (nohz_full) i.e run dedicated tasks. In such cases one cannot use standard perf utility
> without sacrificing latency and performance.
> 
> Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
> ---
>  app/test/meson.build                  |   1 +
>  app/test/test_pmu.c                   |  41 +++
>  doc/guides/prog_guide/profile_app.rst |   8 +
>  lib/eal/common/meson.build            |   3 +
>  lib/eal/common/pmu_private.h          |  41 +++
>  lib/eal/common/rte_pmu.c              | 456 ++++++++++++++++++++++++++
>  lib/eal/include/meson.build           |   1 +
>  lib/eal/include/rte_pmu.h             | 204 ++++++++++++
>  lib/eal/linux/eal.c                   |   4 +
>  lib/eal/version.map                   |   6 +
>  10 files changed, 765 insertions(+)
>  create mode 100644 app/test/test_pmu.c
>  create mode 100644 lib/eal/common/pmu_private.h  create mode 100644
> lib/eal/common/rte_pmu.c  create mode 100644 lib/eal/include/rte_pmu.h
> 
<snip>
> diff --git a/lib/eal/common/rte_pmu.c b/lib/eal/common/rte_pmu.c new file mode 100644
> index 0000000000..049fe19fe3
> --- /dev/null
> +++ b/lib/eal/common/rte_pmu.c
> @@ -0,0 +1,456 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(C) 2022 Marvell International Ltd.
> + */
> +
> +#include <ctype.h>
> +#include <dirent.h>
> +#include <errno.h>
> +#include <regex.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <sys/ioctl.h>
> +#include <sys/mman.h>
> +#include <sys/queue.h>
> +#include <sys/syscall.h>
> +#include <unistd.h>
> +
> +#include <rte_eal_paging.h>
> +#include <rte_pmu.h>
> +#include <rte_tailq.h>
> +
> +#include "pmu_private.h"
> +
> +#define EVENT_SOURCE_DEVICES_PATH "/sys/bus/event_source/devices"
> +
> +#ifndef GENMASK_ULL
> +#define GENMASK_ULL(h, l) ((~0ULL - (1ULL << (l)) + 1) & (~0ULL >> ((64
> +- 1 - (h))))) #endif
> +
> +#ifndef FIELD_PREP
> +#define FIELD_PREP(m, v) (((uint64_t)(v) << (__builtin_ffsll(m) - 1)) &
> +(m)) #endif
> +
> +struct rte_pmu *rte_pmu;
> +
> +/*
> + * Following __rte_weak functions provide default no-op. Architectures
> +should override them if
> + * necessary.
> + */
> +
> +int
> +__rte_weak pmu_arch_init(void)
> +{
> +	return 0;
> +}
> +
> +void
> +__rte_weak pmu_arch_fini(void)
> +{
> +}
> +
> +void
> +__rte_weak pmu_arch_fixup_config(uint64_t config[3]) {
> +	RTE_SET_USED(config);
> +}
> +
> +static int
> +get_term_format(const char *name, int *num, uint64_t *mask) {
> +	char *config = NULL;
> +	char path[PATH_MAX];
> +	int high, low, ret;
> +	FILE *fp;
> +
> +	/* quiesce -Wmaybe-uninitialized warning */
> +	*num = 0;
> +	*mask = 0;
> +
> +	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/format/%s", rte_pmu->name,
> name);
> +	fp = fopen(path, "r");
> +	if (!fp)
> +		return -errno;
> +
> +	errno = 0;
> +	ret = fscanf(fp, "%m[^:]:%d-%d", &config, &low, &high);
> +	if (ret < 2) {
> +		ret = -ENODATA;
> +		goto out;
> +	}
> +	if (errno) {
> +		ret = -errno;
> +		goto out;
> +	}
> +
> +	if (ret == 2)
> +		high = low;
> +
> +	*mask = GENMASK_ULL(high, low);
> +	/* Last digit should be [012]. If last digit is missing 0 is implied. */
> +	*num = config[strlen(config) - 1];
> +	*num = isdigit(*num) ? *num - '0' : 0;
> +
> +	ret = 0;
> +out:
> +	free(config);
> +	fclose(fp);
> +
> +	return ret;
> +}
> +
> +static int
> +parse_event(char *buf, uint64_t config[3]) {
> +	char *token, *term;
> +	int num, ret, val;
> +	uint64_t mask;
> +
> +	config[0] = config[1] = config[2] = 0;
> +
> +	token = strtok(buf, ",");
> +	while (token) {
> +		errno = 0;
> +		/* <term>=<value> */
> +		ret = sscanf(token, "%m[^=]=%i", &term, &val);
> +		if (ret < 1)
> +			return -ENODATA;
> +		if (errno)
> +			return -errno;
> +		if (ret == 1)
> +			val = 1;
> +
> +		ret = get_term_format(term, &num, &mask);
> +		free(term);
> +		if (ret)
> +			return ret;
> +
> +		config[num] |= FIELD_PREP(mask, val);
> +		token = strtok(NULL, ",");
> +	}
> +
> +	return 0;
> +}
> +
> +static int
> +get_event_config(const char *name, uint64_t config[3]) {
> +	char path[PATH_MAX], buf[BUFSIZ];
> +	FILE *fp;
> +	int ret;
> +
> +	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu->name,
> name);
> +	fp = fopen(path, "r");
> +	if (!fp)
> +		return -errno;
> +
> +	ret = fread(buf, 1, sizeof(buf), fp);
> +	if (ret == 0) {
> +		fclose(fp);
> +
> +		return -EINVAL;
> +	}
> +	fclose(fp);
> +	buf[ret] = '\0';
> +
> +	return parse_event(buf, config);
> +}
> +
> +static int
> +do_perf_event_open(uint64_t config[3], int lcore_id, int group_fd) {
> +	struct perf_event_attr attr = {
> +		.size = sizeof(struct perf_event_attr),
> +		.type = PERF_TYPE_RAW,
> +		.exclude_kernel = 1,
> +		.exclude_hv = 1,
> +		.disabled = 1,
> +	};
> +
> +	pmu_arch_fixup_config(config);
> +
> +	attr.config = config[0];
> +	attr.config1 = config[1];
> +	attr.config2 = config[2];
> +
> +	return syscall(SYS_perf_event_open, &attr, rte_gettid(),

Looks like using '0' instead of rte_gettid() takes the same effect. A small optimization.

> rte_lcore_to_cpu_id(lcore_id),
> +		       group_fd, 0);
> +}
> +
> +static int
> +open_events(int lcore_id)
> +{
> +	struct rte_pmu_event_group *group = &rte_pmu->group[lcore_id];
> +	struct rte_pmu_event *event;
> +	uint64_t config[3];
> +	int num = 0, ret;
> +
> +	/* group leader gets created first, with fd = -1 */
> +	group->fds[0] = -1;
> +
> +	TAILQ_FOREACH(event, &rte_pmu->event_list, next) {
> +		ret = get_event_config(event->name, config);
> +		if (ret) {
> +			RTE_LOG(ERR, EAL, "failed to get %s event config\n", event->name);
> +			continue;
> +		}
> +
> +		ret = do_perf_event_open(config, lcore_id, group->fds[0]);
> +		if (ret == -1) {
> +			if (errno == EOPNOTSUPP)
> +				RTE_LOG(ERR, EAL, "64 bit counters not supported\n");
> +
> +			ret = -errno;
> +			goto out;
> +		}
> +
> +		group->fds[event->index] = ret;
> +		num++;
> +	}
> +
> +	return 0;
> +out:
> +	for (--num; num >= 0; num--) {
> +		close(group->fds[num]);
> +		group->fds[num] = -1;
> +	}
> +
> +
> +	return ret;
> +}
> +
> +static int
> +mmap_events(int lcore_id)
> +{
> +	struct rte_pmu_event_group *group = &rte_pmu->group[lcore_id];
> +	void *addr;
> +	int ret, i;
> +
> +	for (i = 0; i < rte_pmu->num_group_events; i++) {
> +		addr = mmap(0, rte_mem_page_size(), PROT_READ, MAP_SHARED, group->fds[i], 0);
> +		if (addr == MAP_FAILED) {
> +			ret = -errno;
> +			goto out;
> +		}
> +
> +		group->mmap_pages[i] = addr;
> +	}
> +
> +	return 0;
> +out:
> +	for (; i; i--) {
> +		munmap(group->mmap_pages[i - 1], rte_mem_page_size());
> +		group->mmap_pages[i - 1] = NULL;
> +	}
> +
> +	return ret;
> +}
> +
> +static void
> +cleanup_events(int lcore_id)
> +{
> +	struct rte_pmu_event_group *group = &rte_pmu->group[lcore_id];
> +	int i;
> +
> +	if (!group->fds)
> +		return;
> +
> +	if (group->fds[0] != -1)
> +		ioctl(group->fds[0], PERF_EVENT_IOC_DISABLE, PERF_IOC_FLAG_GROUP);
> +
> +	for (i = 0; i < rte_pmu->num_group_events; i++) {
> +		if (group->mmap_pages[i]) {
> +			munmap(group->mmap_pages[i], rte_mem_page_size());
> +			group->mmap_pages[i] = NULL;
> +		}
> +
> +		if (group->fds[i] != -1) {
> +			close(group->fds[i]);
> +			group->fds[i] = -1;
> +		}
> +	}
> +
> +	free(group->mmap_pages);
> +	free(group->fds);
> +
> +	group->mmap_pages = NULL;
> +	group->fds = NULL;
> +	group->enabled = false;
> +}
> +
> +int __rte_noinline
> +rte_pmu_enable_group(int lcore_id)
> +{
> +	struct rte_pmu_event_group *group = &rte_pmu->group[lcore_id];
> +	int ret;
> +
> +	if (rte_pmu->num_group_events == 0) {
> +		RTE_LOG(DEBUG, EAL, "no matching PMU events\n");
> +
> +		return 0;
> +	}
> +
> +	group->fds = calloc(rte_pmu->num_group_events, sizeof(*group->fds));
> +	if (!group->fds) {
> +		RTE_LOG(ERR, EAL, "failed to alloc descriptor memory\n");
> +
> +		return -ENOMEM;
> +	}
> +
> +	group->mmap_pages = calloc(rte_pmu->num_group_events, sizeof(*group->mmap_pages));
> +	if (!group->mmap_pages) {
> +		RTE_LOG(ERR, EAL, "failed to alloc userpage memory\n");
> +
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	ret = open_events(lcore_id);
> +	if (ret) {
> +		RTE_LOG(ERR, EAL, "failed to open events on lcore-worker-%d\n", lcore_id);
> +		goto out;
> +	}
> +
> +	ret = mmap_events(lcore_id);
> +	if (ret) {
> +		RTE_LOG(ERR, EAL, "failed to map events on lcore-worker-%d\n", lcore_id);
> +		goto out;
> +	}
> +
> +	if (ioctl(group->fds[0], PERF_EVENT_IOC_ENABLE, PERF_IOC_FLAG_GROUP) == -1) {
> +		RTE_LOG(ERR, EAL, "failed to enable events on lcore-worker-%d\n",
> +lcore_id);
> +
> +		ret = -errno;
> +		goto out;
> +	}
> +
> +	return 0;
> +
> +out:
> +	cleanup_events(lcore_id);
> +
> +	return ret;
> +}
> +
> +static int
> +scan_pmus(void)
> +{
> +	char path[PATH_MAX];
> +	struct dirent *dent;
> +	const char *name;
> +	DIR *dirp;
> +
> +	dirp = opendir(EVENT_SOURCE_DEVICES_PATH);
> +	if (!dirp)
> +		return -errno;
> +
> +	while ((dent = readdir(dirp))) {
> +		name = dent->d_name;
> +		if (name[0] == '.')
> +			continue;
> +
> +		/* sysfs entry should either contain cpus or be a cpu */
> +		if (!strcmp(name, "cpu"))
> +			break;
> +
> +		snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/cpus", name);
> +		if (access(path, F_OK) == 0)
> +			break;
> +	}
> +
> +	closedir(dirp);
> +
> +	if (dent) {
> +		rte_pmu->name = strdup(name);
> +		if (!rte_pmu->name)
> +			return -ENOMEM;
> +	}
> +
> +	return rte_pmu->name ? 0 : -ENODEV;
> +}
> +
> +int
> +rte_pmu_add_event(const char *name)
> +{
> +	struct rte_pmu_event *event;
> +	char path[PATH_MAX];
> +
> +	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu->name,
> name);

Better to check if rte_pmu is available.
See below.

> +	if (access(path, R_OK))
> +		return -ENODEV;
> +
> +	TAILQ_FOREACH(event, &rte_pmu->event_list, next) {
> +		if (!strcmp(event->name, name))
> +			return event->index;
> +		continue;
> +	}
> +
> +	event = calloc(1, sizeof(*event));
> +	if (!event)
> +		return -ENOMEM;
> +
> +	event->name = strdup(name);
> +	if (!event->name) {
> +		free(event);
> +
> +		return -ENOMEM;
> +	}
> +
> +	event->index = rte_pmu->num_group_events++;
> +	TAILQ_INSERT_TAIL(&rte_pmu->event_list, event, next);
> +
> +	RTE_LOG(DEBUG, EAL, "%s even added at index %d\n", name,
> +event->index);
> +
> +	return event->index;
> +}
> +
> +void
> +eal_pmu_init(void)
> +{
> +	int ret;
> +
> +	rte_pmu = calloc(1, sizeof(*rte_pmu));
> +	if (!rte_pmu) {
> +		RTE_LOG(ERR, EAL, "failed to alloc PMU\n");
> +
> +		return;
> +	}
> +
> +	TAILQ_INIT(&rte_pmu->event_list);
> +
> +	ret = scan_pmus();
> +	if (ret) {
> +		RTE_LOG(ERR, EAL, "failed to find core pmu\n");
> +		goto out;
> +	}
> +
> +	ret = pmu_arch_init();
> +	if (ret) {
> +		RTE_LOG(ERR, EAL, "failed to setup arch for PMU\n");
> +		goto out;
> +	}
> +
> +	return;
> +out:
> +	free(rte_pmu->name);
> +	free(rte_pmu);

Set rte_pmu to NULL to prevent unintentional use?

> +}
> +
> +void
> +eal_pmu_fini(void)
> +{
> +	struct rte_pmu_event *event, *tmp;
> +	int lcore_id;
> +
> +	RTE_TAILQ_FOREACH_SAFE(event, &rte_pmu->event_list, next, tmp) {

rte_pmu can be unavailable if init fails. Better to check before accessing.

> +		TAILQ_REMOVE(&rte_pmu->event_list, event, next);
> +		free(event->name);
> +		free(event);
> +	}
> +
> +	RTE_LCORE_FOREACH_WORKER(lcore_id)
> +		cleanup_events(lcore_id);
> +
> +	pmu_arch_fini();
> +	free(rte_pmu->name);
> +	free(rte_pmu);
> +}
<snip>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* RE: [PATCH v4 1/4] eal: add generic support for reading PMU events
  2023-01-09  7:37         ` Ruifeng Wang
@ 2023-01-09 15:40           ` Tomasz Duszynski
  0 siblings, 0 replies; 139+ messages in thread
From: Tomasz Duszynski @ 2023-01-09 15:40 UTC (permalink / raw)
  To: Ruifeng Wang, dev; +Cc: thomas, Jerin Jacob Kollanukkaran, mb, zhoumin, nd

Hi Ruifeng, 

>-----Original Message-----
>From: Ruifeng Wang <Ruifeng.Wang@arm.com>
>Sent: Monday, January 9, 2023 8:37 AM
>To: Tomasz Duszynski <tduszynski@marvell.com>; dev@dpdk.org
>Cc: thomas@monjalon.net; Jerin Jacob Kollanukkaran <jerinj@marvell.com>; mb@smartsharesystems.com;
>zhoumin@loongson.cn; nd <nd@arm.com>
>Subject: [EXT] RE: [PATCH v4 1/4] eal: add generic support for reading PMU events
>
>External Email
>
>----------------------------------------------------------------------
>> -----Original Message-----
>> From: Tomasz Duszynski <tduszynski@marvell.com>
>> Sent: Tuesday, December 13, 2022 6:44 PM
>> To: dev@dpdk.org
>> Cc: thomas@monjalon.net; jerinj@marvell.com; mb@smartsharesystems.com;
>> zhoumin@loongson.cn; Tomasz Duszynski <tduszynski@marvell.com>
>> Subject: [PATCH v4 1/4] eal: add generic support for reading PMU
>> events
>>
>> Add support for programming PMU counters and reading their values in
>> runtime bypassing kernel completely.
>>
>> This is especially useful in cases where CPU cores are isolated
>> (nohz_full) i.e run dedicated tasks. In such cases one cannot use
>> standard perf utility without sacrificing latency and performance.
>>
>> Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
>> ---
>>  app/test/meson.build                  |   1 +
>>  app/test/test_pmu.c                   |  41 +++
>>  doc/guides/prog_guide/profile_app.rst |   8 +
>>  lib/eal/common/meson.build            |   3 +
>>  lib/eal/common/pmu_private.h          |  41 +++
>>  lib/eal/common/rte_pmu.c              | 456 ++++++++++++++++++++++++++
>>  lib/eal/include/meson.build           |   1 +
>>  lib/eal/include/rte_pmu.h             | 204 ++++++++++++
>>  lib/eal/linux/eal.c                   |   4 +
>>  lib/eal/version.map                   |   6 +
>>  10 files changed, 765 insertions(+)
>>  create mode 100644 app/test/test_pmu.c  create mode 100644
>> lib/eal/common/pmu_private.h  create mode 100644
>> lib/eal/common/rte_pmu.c  create mode 100644 lib/eal/include/rte_pmu.h
>>
><snip>
>> diff --git a/lib/eal/common/rte_pmu.c b/lib/eal/common/rte_pmu.c new
>> file mode 100644 index 0000000000..049fe19fe3
>> --- /dev/null
>> +++ b/lib/eal/common/rte_pmu.c
>> @@ -0,0 +1,456 @@
>> +/* SPDX-License-Identifier: BSD-3-Clause
>> + * Copyright(C) 2022 Marvell International Ltd.
>> + */
>> +
>> +#include <ctype.h>
>> +#include <dirent.h>
>> +#include <errno.h>
>> +#include <regex.h>
>> +#include <stdlib.h>
>> +#include <string.h>
>> +#include <sys/ioctl.h>
>> +#include <sys/mman.h>
>> +#include <sys/queue.h>
>> +#include <sys/syscall.h>
>> +#include <unistd.h>
>> +
>> +#include <rte_eal_paging.h>
>> +#include <rte_pmu.h>
>> +#include <rte_tailq.h>
>> +
>> +#include "pmu_private.h"
>> +
>> +#define EVENT_SOURCE_DEVICES_PATH "/sys/bus/event_source/devices"
>> +
>> +#ifndef GENMASK_ULL
>> +#define GENMASK_ULL(h, l) ((~0ULL - (1ULL << (l)) + 1) & (~0ULL >>
>> +((64
>> +- 1 - (h))))) #endif
>> +
>> +#ifndef FIELD_PREP
>> +#define FIELD_PREP(m, v) (((uint64_t)(v) << (__builtin_ffsll(m) - 1))
>> +&
>> +(m)) #endif
>> +
>> +struct rte_pmu *rte_pmu;
>> +
>> +/*
>> + * Following __rte_weak functions provide default no-op.
>> +Architectures should override them if
>> + * necessary.
>> + */
>> +
>> +int
>> +__rte_weak pmu_arch_init(void)
>> +{
>> +	return 0;
>> +}
>> +
>> +void
>> +__rte_weak pmu_arch_fini(void)
>> +{
>> +}
>> +
>> +void
>> +__rte_weak pmu_arch_fixup_config(uint64_t config[3]) {
>> +	RTE_SET_USED(config);
>> +}
>> +
>> +static int
>> +get_term_format(const char *name, int *num, uint64_t *mask) {
>> +	char *config = NULL;
>> +	char path[PATH_MAX];
>> +	int high, low, ret;
>> +	FILE *fp;
>> +
>> +	/* quiesce -Wmaybe-uninitialized warning */
>> +	*num = 0;
>> +	*mask = 0;
>> +
>> +	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH
>> +"/%s/format/%s", rte_pmu->name,
>> name);
>> +	fp = fopen(path, "r");
>> +	if (!fp)
>> +		return -errno;
>> +
>> +	errno = 0;
>> +	ret = fscanf(fp, "%m[^:]:%d-%d", &config, &low, &high);
>> +	if (ret < 2) {
>> +		ret = -ENODATA;
>> +		goto out;
>> +	}
>> +	if (errno) {
>> +		ret = -errno;
>> +		goto out;
>> +	}
>> +
>> +	if (ret == 2)
>> +		high = low;
>> +
>> +	*mask = GENMASK_ULL(high, low);
>> +	/* Last digit should be [012]. If last digit is missing 0 is implied. */
>> +	*num = config[strlen(config) - 1];
>> +	*num = isdigit(*num) ? *num - '0' : 0;
>> +
>> +	ret = 0;
>> +out:
>> +	free(config);
>> +	fclose(fp);
>> +
>> +	return ret;
>> +}
>> +
>> +static int
>> +parse_event(char *buf, uint64_t config[3]) {
>> +	char *token, *term;
>> +	int num, ret, val;
>> +	uint64_t mask;
>> +
>> +	config[0] = config[1] = config[2] = 0;
>> +
>> +	token = strtok(buf, ",");
>> +	while (token) {
>> +		errno = 0;
>> +		/* <term>=<value> */
>> +		ret = sscanf(token, "%m[^=]=%i", &term, &val);
>> +		if (ret < 1)
>> +			return -ENODATA;
>> +		if (errno)
>> +			return -errno;
>> +		if (ret == 1)
>> +			val = 1;
>> +
>> +		ret = get_term_format(term, &num, &mask);
>> +		free(term);
>> +		if (ret)
>> +			return ret;
>> +
>> +		config[num] |= FIELD_PREP(mask, val);
>> +		token = strtok(NULL, ",");
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +static int
>> +get_event_config(const char *name, uint64_t config[3]) {
>> +	char path[PATH_MAX], buf[BUFSIZ];
>> +	FILE *fp;
>> +	int ret;
>> +
>> +	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH
>> +"/%s/events/%s", rte_pmu->name,
>> name);
>> +	fp = fopen(path, "r");
>> +	if (!fp)
>> +		return -errno;
>> +
>> +	ret = fread(buf, 1, sizeof(buf), fp);
>> +	if (ret == 0) {
>> +		fclose(fp);
>> +
>> +		return -EINVAL;
>> +	}
>> +	fclose(fp);
>> +	buf[ret] = '\0';
>> +
>> +	return parse_event(buf, config);
>> +}
>> +
>> +static int
>> +do_perf_event_open(uint64_t config[3], int lcore_id, int group_fd) {
>> +	struct perf_event_attr attr = {
>> +		.size = sizeof(struct perf_event_attr),
>> +		.type = PERF_TYPE_RAW,
>> +		.exclude_kernel = 1,
>> +		.exclude_hv = 1,
>> +		.disabled = 1,
>> +	};
>> +
>> +	pmu_arch_fixup_config(config);
>> +
>> +	attr.config = config[0];
>> +	attr.config1 = config[1];
>> +	attr.config2 = config[2];
>> +
>> +	return syscall(SYS_perf_event_open, &attr, rte_gettid(),
>
>Looks like using '0' instead of rte_gettid() takes the same effect. A small optimization.
>
>> rte_lcore_to_cpu_id(lcore_id),
>> +		       group_fd, 0);
>> +}
>> +
>> +static int
>> +open_events(int lcore_id)
>> +{
>> +	struct rte_pmu_event_group *group = &rte_pmu->group[lcore_id];
>> +	struct rte_pmu_event *event;
>> +	uint64_t config[3];
>> +	int num = 0, ret;
>> +
>> +	/* group leader gets created first, with fd = -1 */
>> +	group->fds[0] = -1;
>> +
>> +	TAILQ_FOREACH(event, &rte_pmu->event_list, next) {
>> +		ret = get_event_config(event->name, config);
>> +		if (ret) {
>> +			RTE_LOG(ERR, EAL, "failed to get %s event config\n", event->name);
>> +			continue;
>> +		}
>> +
>> +		ret = do_perf_event_open(config, lcore_id, group->fds[0]);
>> +		if (ret == -1) {
>> +			if (errno == EOPNOTSUPP)
>> +				RTE_LOG(ERR, EAL, "64 bit counters not supported\n");
>> +
>> +			ret = -errno;
>> +			goto out;
>> +		}
>> +
>> +		group->fds[event->index] = ret;
>> +		num++;
>> +	}
>> +
>> +	return 0;
>> +out:
>> +	for (--num; num >= 0; num--) {
>> +		close(group->fds[num]);
>> +		group->fds[num] = -1;
>> +	}
>> +
>> +
>> +	return ret;
>> +}
>> +
>> +static int
>> +mmap_events(int lcore_id)
>> +{
>> +	struct rte_pmu_event_group *group = &rte_pmu->group[lcore_id];
>> +	void *addr;
>> +	int ret, i;
>> +
>> +	for (i = 0; i < rte_pmu->num_group_events; i++) {
>> +		addr = mmap(0, rte_mem_page_size(), PROT_READ, MAP_SHARED, group->fds[i], 0);
>> +		if (addr == MAP_FAILED) {
>> +			ret = -errno;
>> +			goto out;
>> +		}
>> +
>> +		group->mmap_pages[i] = addr;
>> +	}
>> +
>> +	return 0;
>> +out:
>> +	for (; i; i--) {
>> +		munmap(group->mmap_pages[i - 1], rte_mem_page_size());
>> +		group->mmap_pages[i - 1] = NULL;
>> +	}
>> +
>> +	return ret;
>> +}
>> +
>> +static void
>> +cleanup_events(int lcore_id)
>> +{
>> +	struct rte_pmu_event_group *group = &rte_pmu->group[lcore_id];
>> +	int i;
>> +
>> +	if (!group->fds)
>> +		return;
>> +
>> +	if (group->fds[0] != -1)
>> +		ioctl(group->fds[0], PERF_EVENT_IOC_DISABLE, PERF_IOC_FLAG_GROUP);
>> +
>> +	for (i = 0; i < rte_pmu->num_group_events; i++) {
>> +		if (group->mmap_pages[i]) {
>> +			munmap(group->mmap_pages[i], rte_mem_page_size());
>> +			group->mmap_pages[i] = NULL;
>> +		}
>> +
>> +		if (group->fds[i] != -1) {
>> +			close(group->fds[i]);
>> +			group->fds[i] = -1;
>> +		}
>> +	}
>> +
>> +	free(group->mmap_pages);
>> +	free(group->fds);
>> +
>> +	group->mmap_pages = NULL;
>> +	group->fds = NULL;
>> +	group->enabled = false;
>> +}
>> +
>> +int __rte_noinline
>> +rte_pmu_enable_group(int lcore_id)
>> +{
>> +	struct rte_pmu_event_group *group = &rte_pmu->group[lcore_id];
>> +	int ret;
>> +
>> +	if (rte_pmu->num_group_events == 0) {
>> +		RTE_LOG(DEBUG, EAL, "no matching PMU events\n");
>> +
>> +		return 0;
>> +	}
>> +
>> +	group->fds = calloc(rte_pmu->num_group_events, sizeof(*group->fds));
>> +	if (!group->fds) {
>> +		RTE_LOG(ERR, EAL, "failed to alloc descriptor memory\n");
>> +
>> +		return -ENOMEM;
>> +	}
>> +
>> +	group->mmap_pages = calloc(rte_pmu->num_group_events, sizeof(*group->mmap_pages));
>> +	if (!group->mmap_pages) {
>> +		RTE_LOG(ERR, EAL, "failed to alloc userpage memory\n");
>> +
>> +		ret = -ENOMEM;
>> +		goto out;
>> +	}
>> +
>> +	ret = open_events(lcore_id);
>> +	if (ret) {
>> +		RTE_LOG(ERR, EAL, "failed to open events on lcore-worker-%d\n", lcore_id);
>> +		goto out;
>> +	}
>> +
>> +	ret = mmap_events(lcore_id);
>> +	if (ret) {
>> +		RTE_LOG(ERR, EAL, "failed to map events on lcore-worker-%d\n", lcore_id);
>> +		goto out;
>> +	}
>> +
>> +	if (ioctl(group->fds[0], PERF_EVENT_IOC_ENABLE, PERF_IOC_FLAG_GROUP) == -1) {
>> +		RTE_LOG(ERR, EAL, "failed to enable events on lcore-worker-%d\n",
>> +lcore_id);
>> +
>> +		ret = -errno;
>> +		goto out;
>> +	}
>> +
>> +	return 0;
>> +
>> +out:
>> +	cleanup_events(lcore_id);
>> +
>> +	return ret;
>> +}
>> +
>> +static int
>> +scan_pmus(void)
>> +{
>> +	char path[PATH_MAX];
>> +	struct dirent *dent;
>> +	const char *name;
>> +	DIR *dirp;
>> +
>> +	dirp = opendir(EVENT_SOURCE_DEVICES_PATH);
>> +	if (!dirp)
>> +		return -errno;
>> +
>> +	while ((dent = readdir(dirp))) {
>> +		name = dent->d_name;
>> +		if (name[0] == '.')
>> +			continue;
>> +
>> +		/* sysfs entry should either contain cpus or be a cpu */
>> +		if (!strcmp(name, "cpu"))
>> +			break;
>> +
>> +		snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/cpus", name);
>> +		if (access(path, F_OK) == 0)
>> +			break;
>> +	}
>> +
>> +	closedir(dirp);
>> +
>> +	if (dent) {
>> +		rte_pmu->name = strdup(name);
>> +		if (!rte_pmu->name)
>> +			return -ENOMEM;
>> +	}
>> +
>> +	return rte_pmu->name ? 0 : -ENODEV;
>> +}
>> +
>> +int
>> +rte_pmu_add_event(const char *name)
>> +{
>> +	struct rte_pmu_event *event;
>> +	char path[PATH_MAX];
>> +
>> +	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH
>> +"/%s/events/%s", rte_pmu->name,
>> name);
>
>Better to check if rte_pmu is available.
>See below.
>
>> +	if (access(path, R_OK))
>> +		return -ENODEV;
>> +
>> +	TAILQ_FOREACH(event, &rte_pmu->event_list, next) {
>> +		if (!strcmp(event->name, name))
>> +			return event->index;
>> +		continue;
>> +	}
>> +
>> +	event = calloc(1, sizeof(*event));
>> +	if (!event)
>> +		return -ENOMEM;
>> +
>> +	event->name = strdup(name);
>> +	if (!event->name) {
>> +		free(event);
>> +
>> +		return -ENOMEM;
>> +	}
>> +
>> +	event->index = rte_pmu->num_group_events++;
>> +	TAILQ_INSERT_TAIL(&rte_pmu->event_list, event, next);
>> +
>> +	RTE_LOG(DEBUG, EAL, "%s even added at index %d\n", name,
>> +event->index);
>> +
>> +	return event->index;
>> +}
>> +
>> +void
>> +eal_pmu_init(void)
>> +{
>> +	int ret;
>> +
>> +	rte_pmu = calloc(1, sizeof(*rte_pmu));
>> +	if (!rte_pmu) {
>> +		RTE_LOG(ERR, EAL, "failed to alloc PMU\n");
>> +
>> +		return;
>> +	}
>> +
>> +	TAILQ_INIT(&rte_pmu->event_list);
>> +
>> +	ret = scan_pmus();
>> +	if (ret) {
>> +		RTE_LOG(ERR, EAL, "failed to find core pmu\n");
>> +		goto out;
>> +	}
>> +
>> +	ret = pmu_arch_init();
>> +	if (ret) {
>> +		RTE_LOG(ERR, EAL, "failed to setup arch for PMU\n");
>> +		goto out;
>> +	}
>> +
>> +	return;
>> +out:
>> +	free(rte_pmu->name);
>> +	free(rte_pmu);
>
>Set rte_pmu to NULL to prevent unintentional use?
>

Next series will take use of global pmu instance so this will no longer be
required though your suggestions may be applied to other pointers around.  

>> +}
>> +
>> +void
>> +eal_pmu_fini(void)
>> +{
>> +	struct rte_pmu_event *event, *tmp;
>> +	int lcore_id;
>> +
>> +	RTE_TAILQ_FOREACH_SAFE(event, &rte_pmu->event_list, next, tmp) {
>
>rte_pmu can be unavailable if init fails. Better to check before accessing.
>

Yep. 

>> +		TAILQ_REMOVE(&rte_pmu->event_list, event, next);
>> +		free(event->name);
>> +		free(event);
>> +	}
>> +
>> +	RTE_LCORE_FOREACH_WORKER(lcore_id)
>> +		cleanup_events(lcore_id);
>> +
>> +	pmu_arch_fini();
>> +	free(rte_pmu->name);
>> +	free(rte_pmu);
>> +}
><snip>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* [PATCH v5 0/4] add support for self monitoring
  2022-12-13 10:43     ` [PATCH v4 " Tomasz Duszynski
                         ` (3 preceding siblings ...)
  2022-12-13 10:43       ` [PATCH v4 4/4] eal: add PMU support to tracing library Tomasz Duszynski
@ 2023-01-10 23:46       ` Tomasz Duszynski
  2023-01-10 23:46         ` [PATCH v5 1/4] eal: add generic support for reading PMU events Tomasz Duszynski
                           ` (7 more replies)
  4 siblings, 8 replies; 139+ messages in thread
From: Tomasz Duszynski @ 2023-01-10 23:46 UTC (permalink / raw)
  To: dev
  Cc: thomas, jerinj, mb, Ruifeng.Wang, mattias.ronnblom, zhoumin,
	Tomasz Duszynski

This series adds self monitoring support i.e allows to configure and
read performance measurement unit (PMU) counters in runtime without
using perf utility. This has certain adventages when application runs on
isolated cores with nohz_full kernel parameter.

Events can be read directly using rte_pmu_read() or using dedicated
tracepoint rte_eal_trace_pmu_read(). The latter will cause events to be
stored inside CTF file.

By design, all enabled events are grouped together and the same group
is attached to lcores that use self monitoring funtionality.

Events are enabled by names, which need to be read from standard
location under sysfs i.e

/sys/bus/event_source/devices/PMU/events

where PMU is a core pmu i.e one measuring cpu events. As of today
raw events are not supported.

v5:
- address review comments
- fix sign extension while reading pmu on x86
- fix regex mentioned in doc
- various minor changes/improvements here and there
v4:
- fix freeing mem detected by debug_autotest
v3:
- fix shared build
v2:
- fix problems reported by test build infra

Tomasz Duszynski (4):
  eal: add generic support for reading PMU events
  eal/arm: support reading ARM PMU events in runtime
  eal/x86: support reading Intel PMU events in runtime
  eal: add PMU support to tracing library

 app/test/meson.build                     |   1 +
 app/test/test_pmu.c                      |  47 +++
 app/test/test_trace_perf.c               |   4 +
 doc/guides/prog_guide/profile_app.rst    |  13 +
 doc/guides/prog_guide/trace_lib.rst      |  32 ++
 lib/eal/arm/include/meson.build          |   1 +
 lib/eal/arm/include/rte_pmu_pmc.h        |  39 ++
 lib/eal/arm/meson.build                  |   4 +
 lib/eal/arm/rte_pmu.c                    | 104 +++++
 lib/eal/common/eal_common_trace_points.c |   3 +
 lib/eal/common/meson.build               |   3 +
 lib/eal/common/pmu_private.h             |  41 ++
 lib/eal/common/rte_pmu.c                 | 504 +++++++++++++++++++++++
 lib/eal/include/meson.build              |   1 +
 lib/eal/include/rte_eal_trace.h          |  10 +
 lib/eal/include/rte_pmu.h                | 202 +++++++++
 lib/eal/linux/eal.c                      |   4 +
 lib/eal/version.map                      |   7 +
 lib/eal/x86/include/meson.build          |   1 +
 lib/eal/x86/include/rte_pmu_pmc.h        |  33 ++
 20 files changed, 1054 insertions(+)
 create mode 100644 app/test/test_pmu.c
 create mode 100644 lib/eal/arm/include/rte_pmu_pmc.h
 create mode 100644 lib/eal/arm/rte_pmu.c
 create mode 100644 lib/eal/common/pmu_private.h
 create mode 100644 lib/eal/common/rte_pmu.c
 create mode 100644 lib/eal/include/rte_pmu.h
 create mode 100644 lib/eal/x86/include/rte_pmu_pmc.h

--
2.34.1


^ permalink raw reply	[flat|nested] 139+ messages in thread

* [PATCH v5 1/4] eal: add generic support for reading PMU events
  2023-01-10 23:46       ` [PATCH v5 0/4] add support for self monitoring Tomasz Duszynski
@ 2023-01-10 23:46         ` Tomasz Duszynski
  2023-01-11  9:05           ` Morten Brørup
  2023-01-10 23:46         ` [PATCH v5 2/4] eal/arm: support reading ARM PMU events in runtime Tomasz Duszynski
                           ` (6 subsequent siblings)
  7 siblings, 1 reply; 139+ messages in thread
From: Tomasz Duszynski @ 2023-01-10 23:46 UTC (permalink / raw)
  To: dev
  Cc: thomas, jerinj, mb, Ruifeng.Wang, mattias.ronnblom, zhoumin,
	Tomasz Duszynski

Add support for programming PMU counters and reading their values
in runtime bypassing kernel completely.

This is especially useful in cases where CPU cores are isolated
(nohz_full) i.e run dedicated tasks. In such cases one cannot use
standard perf utility without sacrificing latency and performance.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
---
 app/test/meson.build                  |   1 +
 app/test/test_pmu.c                   |  41 +++
 doc/guides/prog_guide/profile_app.rst |   8 +
 lib/eal/common/meson.build            |   3 +
 lib/eal/common/pmu_private.h          |  41 +++
 lib/eal/common/rte_pmu.c              | 435 ++++++++++++++++++++++++++
 lib/eal/include/meson.build           |   1 +
 lib/eal/include/rte_pmu.h             | 199 ++++++++++++
 lib/eal/linux/eal.c                   |   4 +
 lib/eal/version.map                   |   6 +
 10 files changed, 739 insertions(+)
 create mode 100644 app/test/test_pmu.c
 create mode 100644 lib/eal/common/pmu_private.h
 create mode 100644 lib/eal/common/rte_pmu.c
 create mode 100644 lib/eal/include/rte_pmu.h

diff --git a/app/test/meson.build b/app/test/meson.build
index f34d19e3c3..93b3300309 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -143,6 +143,7 @@ test_sources = files(
         'test_timer_racecond.c',
         'test_timer_secondary.c',
         'test_ticketlock.c',
+        'test_pmu.c',
         'test_trace.c',
         'test_trace_register.c',
         'test_trace_perf.c',
diff --git a/app/test/test_pmu.c b/app/test/test_pmu.c
new file mode 100644
index 0000000000..9a90aaffdb
--- /dev/null
+++ b/app/test/test_pmu.c
@@ -0,0 +1,41 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2023 Marvell International Ltd.
+ */
+
+#include <rte_pmu.h>
+
+#include "test.h"
+
+static int
+test_pmu_read(void)
+{
+	uint64_t val = 0;
+	int tries = 10;
+	int event = -1;
+
+	while (tries--)
+		val += rte_pmu_read(event);
+
+	if (val == 0)
+		return TEST_FAILED;
+
+	return TEST_SUCCESS;
+}
+
+static struct unit_test_suite pmu_tests = {
+	.suite_name = "pmu autotest",
+	.setup = NULL,
+	.teardown = NULL,
+	.unit_test_cases = {
+		TEST_CASE(test_pmu_read),
+		TEST_CASES_END()
+	}
+};
+
+static int
+test_pmu(void)
+{
+	return unit_test_suite_runner(&pmu_tests);
+}
+
+REGISTER_TEST_COMMAND(pmu_autotest, test_pmu);
diff --git a/doc/guides/prog_guide/profile_app.rst b/doc/guides/prog_guide/profile_app.rst
index 14292d4c25..a8b501fe0c 100644
--- a/doc/guides/prog_guide/profile_app.rst
+++ b/doc/guides/prog_guide/profile_app.rst
@@ -7,6 +7,14 @@ Profile Your Application
 The following sections describe methods of profiling DPDK applications on
 different architectures.
 
+Performance counter based profiling
+-----------------------------------
+
+Majority of architectures support some sort hardware measurement unit which provides a set of
+programmable counters that monitor specific events. There are different tools which can gather
+that information, perf being an example here. Though in some scenarios, eg. when CPU cores are
+isolated (nohz_full) and run dedicated tasks, using perf is less than ideal. In such cases one can
+read specific events directly from application via ``rte_pmu_read()``.
 
 Profiling on x86
 ----------------
diff --git a/lib/eal/common/meson.build b/lib/eal/common/meson.build
index 917758cc65..d6d05b56f3 100644
--- a/lib/eal/common/meson.build
+++ b/lib/eal/common/meson.build
@@ -38,6 +38,9 @@ sources += files(
         'rte_service.c',
         'rte_version.c',
 )
+if is_linux
+    sources += files('rte_pmu.c')
+endif
 if is_linux or is_windows
     sources += files('eal_common_dynmem.c')
 endif
diff --git a/lib/eal/common/pmu_private.h b/lib/eal/common/pmu_private.h
new file mode 100644
index 0000000000..cade4245e6
--- /dev/null
+++ b/lib/eal/common/pmu_private.h
@@ -0,0 +1,41 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2022 Marvell
+ */
+
+#ifndef _PMU_PRIVATE_H_
+#define _PMU_PRIVATE_H_
+
+/**
+ * Architecture specific PMU init callback.
+ *
+ * @return
+ *   0 in case of success, negative value otherwise.
+ */
+int
+pmu_arch_init(void);
+
+/**
+ * Architecture specific PMU cleanup callback.
+ */
+void
+pmu_arch_fini(void);
+
+/**
+ * Apply architecture specific settings to config before passing it to syscall.
+ */
+void
+pmu_arch_fixup_config(uint64_t config[3]);
+
+/**
+ * Initialize PMU tracing internals.
+ */
+void
+eal_pmu_init(void);
+
+/**
+ * Cleanup PMU internals.
+ */
+void
+eal_pmu_fini(void);
+
+#endif /* _PMU_PRIVATE_H_ */
diff --git a/lib/eal/common/rte_pmu.c b/lib/eal/common/rte_pmu.c
new file mode 100644
index 0000000000..67e8ffefb2
--- /dev/null
+++ b/lib/eal/common/rte_pmu.c
@@ -0,0 +1,435 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2023 Marvell International Ltd.
+ */
+
+#include <ctype.h>
+#include <dirent.h>
+#include <errno.h>
+#include <regex.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <sys/queue.h>
+#include <sys/syscall.h>
+#include <unistd.h>
+
+#include <rte_eal_paging.h>
+#include <rte_pmu.h>
+#include <rte_tailq.h>
+
+#include "pmu_private.h"
+
+#define EVENT_SOURCE_DEVICES_PATH "/sys/bus/event_source/devices"
+
+#ifndef GENMASK_ULL
+#define GENMASK_ULL(h, l) ((~0ULL - (1ULL << (l)) + 1) & (~0ULL >> ((64 - 1 - (h)))))
+#endif
+
+#ifndef FIELD_PREP
+#define FIELD_PREP(m, v) (((uint64_t)(v) << (__builtin_ffsll(m) - 1)) & (m))
+#endif
+
+struct rte_pmu rte_pmu;
+
+/*
+ * Following __rte_weak functions provide default no-op. Architectures should override them if
+ * necessary.
+ */
+
+int
+__rte_weak pmu_arch_init(void)
+{
+	return 0;
+}
+
+void
+__rte_weak pmu_arch_fini(void)
+{
+}
+
+void
+__rte_weak pmu_arch_fixup_config(uint64_t config[3])
+{
+	RTE_SET_USED(config);
+}
+
+static int
+get_term_format(const char *name, int *num, uint64_t *mask)
+{
+	char *config = NULL;
+	char path[PATH_MAX];
+	int high, low, ret;
+	FILE *fp;
+
+	/* quiesce -Wmaybe-uninitialized warning */
+	*num = 0;
+	*mask = 0;
+
+	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/format/%s", rte_pmu.name, name);
+	fp = fopen(path, "r");
+	if (fp == NULL)
+		return -errno;
+
+	errno = 0;
+	ret = fscanf(fp, "%m[^:]:%d-%d", &config, &low, &high);
+	if (ret < 2) {
+		ret = -ENODATA;
+		goto out;
+	}
+	if (errno) {
+		ret = -errno;
+		goto out;
+	}
+
+	if (ret == 2)
+		high = low;
+
+	*mask = GENMASK_ULL(high, low);
+	/* Last digit should be [012]. If last digit is missing 0 is implied. */
+	*num = config[strlen(config) - 1];
+	*num = isdigit(*num) ? *num - '0' : 0;
+
+	ret = 0;
+out:
+	free(config);
+	fclose(fp);
+
+	return ret;
+}
+
+static int
+parse_event(char *buf, uint64_t config[3])
+{
+	char *token, *term;
+	int num, ret, val;
+	uint64_t mask;
+
+	config[0] = config[1] = config[2] = 0;
+
+	token = strtok(buf, ",");
+	while (token) {
+		errno = 0;
+		/* <term>=<value> */
+		ret = sscanf(token, "%m[^=]=%i", &term, &val);
+		if (ret < 1)
+			return -ENODATA;
+		if (errno)
+			return -errno;
+		if (ret == 1)
+			val = 1;
+
+		ret = get_term_format(term, &num, &mask);
+		free(term);
+		if (ret)
+			return ret;
+
+		config[num] |= FIELD_PREP(mask, val);
+		token = strtok(NULL, ",");
+	}
+
+	return 0;
+}
+
+static int
+get_event_config(const char *name, uint64_t config[3])
+{
+	char path[PATH_MAX], buf[BUFSIZ];
+	FILE *fp;
+	int ret;
+
+	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu.name, name);
+	fp = fopen(path, "r");
+	if (fp == NULL)
+		return -errno;
+
+	ret = fread(buf, 1, sizeof(buf), fp);
+	if (ret == 0) {
+		fclose(fp);
+
+		return -EINVAL;
+	}
+	fclose(fp);
+	buf[ret] = '\0';
+
+	return parse_event(buf, config);
+}
+
+static int
+do_perf_event_open(uint64_t config[3], unsigned int lcore_id, int group_fd)
+{
+	struct perf_event_attr attr = {
+		.size = sizeof(struct perf_event_attr),
+		.type = PERF_TYPE_RAW,
+		.exclude_kernel = 1,
+		.exclude_hv = 1,
+		.disabled = 1,
+	};
+
+	pmu_arch_fixup_config(config);
+
+	attr.config = config[0];
+	attr.config1 = config[1];
+	attr.config2 = config[2];
+
+	return syscall(SYS_perf_event_open, &attr, 0, rte_lcore_to_cpu_id(lcore_id), group_fd, 0);
+}
+
+static int
+open_events(unsigned int lcore_id)
+{
+	struct rte_pmu_event_group *group = &rte_pmu.group[lcore_id];
+	struct rte_pmu_event *event;
+	uint64_t config[3];
+	int num = 0, ret;
+
+	/* group leader gets created first, with fd = -1 */
+	group->fds[0] = -1;
+
+	TAILQ_FOREACH(event, &rte_pmu.event_list, next) {
+		ret = get_event_config(event->name, config);
+		if (ret) {
+			RTE_LOG(ERR, EAL, "failed to get %s event config\n", event->name);
+			continue;
+		}
+
+		ret = do_perf_event_open(config, lcore_id, group->fds[0]);
+		if (ret == -1) {
+			if (errno == EOPNOTSUPP)
+				RTE_LOG(ERR, EAL, "64 bit counters not supported\n");
+
+			ret = -errno;
+			goto out;
+		}
+
+		group->fds[event->index] = ret;
+		num++;
+	}
+
+	return 0;
+out:
+	for (--num; num >= 0; num--) {
+		close(group->fds[num]);
+		group->fds[num] = -1;
+	}
+
+
+	return ret;
+}
+
+static int
+mmap_events(unsigned int lcore_id)
+{
+	struct rte_pmu_event_group *group = &rte_pmu.group[lcore_id];
+	unsigned int i;
+	void *addr;
+	int ret;
+
+	for (i = 0; i < rte_pmu.num_group_events; i++) {
+		addr = mmap(0, rte_mem_page_size(), PROT_READ, MAP_SHARED, group->fds[i], 0);
+		if (addr == MAP_FAILED) {
+			ret = -errno;
+			goto out;
+		}
+
+		group->mmap_pages[i] = addr;
+	}
+
+	return 0;
+out:
+	for (; i; i--) {
+		munmap(group->mmap_pages[i - 1], rte_mem_page_size());
+		group->mmap_pages[i - 1] = NULL;
+	}
+
+	return ret;
+}
+
+static void
+cleanup_events(unsigned int lcore_id)
+{
+	struct rte_pmu_event_group *group = &rte_pmu.group[lcore_id];
+	unsigned int i;
+
+	if (group->fds == NULL)
+		return;
+
+	if (group->fds[0] != -1)
+		ioctl(group->fds[0], PERF_EVENT_IOC_DISABLE, PERF_IOC_FLAG_GROUP);
+
+	for (i = 0; i < rte_pmu.num_group_events; i++) {
+		if (group->mmap_pages[i]) {
+			munmap(group->mmap_pages[i], rte_mem_page_size());
+			group->mmap_pages[i] = NULL;
+		}
+
+		if (group->fds[i] != -1) {
+			close(group->fds[i]);
+			group->fds[i] = -1;
+		}
+	}
+
+	group->enabled = false;
+}
+
+int __rte_noinline
+rte_pmu_enable_group(unsigned int lcore_id)
+{
+	struct rte_pmu_event_group *group = &rte_pmu.group[lcore_id];
+	int ret;
+
+	if (rte_pmu.num_group_events == 0) {
+		RTE_LOG(DEBUG, EAL, "no matching PMU events\n");
+
+		return 0;
+	}
+
+	ret = open_events(lcore_id);
+	if (ret) {
+		RTE_LOG(ERR, EAL, "failed to open events on lcore-worker-%d\n", lcore_id);
+		goto out;
+	}
+
+	ret = mmap_events(lcore_id);
+	if (ret) {
+		RTE_LOG(ERR, EAL, "failed to map events on lcore-worker-%d\n", lcore_id);
+		goto out;
+	}
+
+	if (ioctl(group->fds[0], PERF_EVENT_IOC_RESET, PERF_IOC_FLAG_GROUP) == -1) {
+		RTE_LOG(ERR, EAL, "failed to reset events on lcore-worker-%d\n", lcore_id);
+
+		ret = -errno;
+		goto out;
+	}
+
+	if (ioctl(group->fds[0], PERF_EVENT_IOC_ENABLE, PERF_IOC_FLAG_GROUP) == -1) {
+		RTE_LOG(ERR, EAL, "failed to enable events on lcore-worker-%d\n", lcore_id);
+
+		ret = -errno;
+		goto out;
+	}
+
+	return 0;
+
+out:
+	cleanup_events(lcore_id);
+
+	return ret;
+}
+
+static int
+scan_pmus(void)
+{
+	char path[PATH_MAX];
+	struct dirent *dent;
+	const char *name;
+	DIR *dirp;
+
+	dirp = opendir(EVENT_SOURCE_DEVICES_PATH);
+	if (dirp == NULL)
+		return -errno;
+
+	while ((dent = readdir(dirp))) {
+		name = dent->d_name;
+		if (name[0] == '.')
+			continue;
+
+		/* sysfs entry should either contain cpus or be a cpu */
+		if (!strcmp(name, "cpu"))
+			break;
+
+		snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/cpus", name);
+		if (access(path, F_OK) == 0)
+			break;
+	}
+
+	closedir(dirp);
+
+	if (dent) {
+		rte_pmu.name = strdup(name);
+		if (rte_pmu.name == NULL)
+			return -ENOMEM;
+	}
+
+	return rte_pmu.name ? 0 : -ENODEV;
+}
+
+int
+rte_pmu_add_event(const char *name)
+{
+	struct rte_pmu_event *event;
+	char path[PATH_MAX];
+
+	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu.name, name);
+	if (access(path, R_OK))
+		return -ENODEV;
+
+	TAILQ_FOREACH(event, &rte_pmu.event_list, next) {
+		if (!strcmp(event->name, name))
+			return event->index;
+		continue;
+	}
+
+	event = calloc(1, sizeof(*event));
+	if (!event)
+		return -ENOMEM;
+
+	event->name = strdup(name);
+	if (!event->name) {
+		free(event);
+
+		return -ENOMEM;
+	}
+
+	event->index = rte_pmu.num_group_events++;
+	TAILQ_INSERT_TAIL(&rte_pmu.event_list, event, next);
+
+	RTE_LOG(DEBUG, EAL, "%s even added at index %d\n", name, event->index);
+
+	return event->index;
+}
+
+void
+eal_pmu_init(void)
+{
+	int ret;
+
+	TAILQ_INIT(&rte_pmu.event_list);
+
+	ret = scan_pmus();
+	if (ret) {
+		RTE_LOG(ERR, EAL, "failed to find core pmu\n");
+		goto out;
+	}
+
+	ret = pmu_arch_init();
+	if (ret) {
+		RTE_LOG(ERR, EAL, "failed to setup arch for PMU\n");
+		goto out;
+	}
+
+	return;
+out:
+	free(rte_pmu.name);
+	rte_pmu.name = NULL;
+}
+
+void
+eal_pmu_fini(void)
+{
+	struct rte_pmu_event *event, *tmp;
+	unsigned int lcore_id;
+
+	RTE_TAILQ_FOREACH_SAFE(event, &rte_pmu.event_list, next, tmp) {
+		TAILQ_REMOVE(&rte_pmu.event_list, event, next);
+		free(event->name);
+		free(event);
+	}
+
+	RTE_LCORE_FOREACH(lcore_id)
+		cleanup_events(lcore_id);
+
+	pmu_arch_fini();
+	free(rte_pmu.name);
+}
diff --git a/lib/eal/include/meson.build b/lib/eal/include/meson.build
index cfcd40aaed..3bf830adee 100644
--- a/lib/eal/include/meson.build
+++ b/lib/eal/include/meson.build
@@ -36,6 +36,7 @@ headers += files(
         'rte_pci_dev_features.h',
         'rte_per_lcore.h',
         'rte_pflock.h',
+        'rte_pmu.h',
         'rte_random.h',
         'rte_reciprocal.h',
         'rte_seqcount.h',
diff --git a/lib/eal/include/rte_pmu.h b/lib/eal/include/rte_pmu.h
new file mode 100644
index 0000000000..6968b35545
--- /dev/null
+++ b/lib/eal/include/rte_pmu.h
@@ -0,0 +1,199 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Marvell
+ */
+
+#ifndef _RTE_PMU_H_
+#define _RTE_PMU_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <rte_common.h>
+#include <rte_compat.h>
+
+#ifdef RTE_EXEC_ENV_LINUX
+
+#include <linux/perf_event.h>
+
+#include <rte_atomic.h>
+#include <rte_branch_prediction.h>
+#include <rte_lcore.h>
+#include <rte_log.h>
+
+/**
+ * @file
+ *
+ * PMU event tracing operations
+ *
+ * This file defines generic API and types necessary to setup PMU and
+ * read selected counters in runtime.
+ */
+
+/** Maximum number of events in a group */
+#define MAX_NUM_GROUP_EVENTS 16
+
+/**
+ * A structure describing a group of events.
+ */
+struct rte_pmu_event_group {
+	int fds[MAX_NUM_GROUP_EVENTS]; /**< array of event descriptors */
+	struct perf_event_mmap_page *mmap_pages[MAX_NUM_GROUP_EVENTS]; /**< array of user pages */
+	bool enabled; /**< true if group was enabled on particular lcore */
+};
+
+/**
+ * A structure describing an event.
+ */
+struct rte_pmu_event {
+	char *name; /** name of an event */
+	unsigned int index; /** event index into fds/mmap_pages */
+	TAILQ_ENTRY(rte_pmu_event) next; /** list entry */
+};
+
+/**
+ * A PMU state container.
+ */
+struct rte_pmu {
+	char *name; /** name of core PMU listed under /sys/bus/event_source/devices */
+	struct rte_pmu_event_group group[RTE_MAX_LCORE]; /**< per lcore event group data */
+	unsigned int num_group_events; /**< number of events in a group */
+	TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of matching events */
+};
+
+/** Pointer to the PMU state container */
+extern struct rte_pmu rte_pmu;
+
+/** Each architecture supporting PMU needs to provide its own version */
+#ifndef rte_pmu_pmc_read
+#define rte_pmu_pmc_read(index) ({ 0; })
+#endif
+
+/**
+ * @internal
+ *
+ * Read PMU counter.
+ *
+ * @param pc
+ *   Pointer to the mmapped user page.
+ * @return
+ *   Counter value read from hardware.
+ */
+__rte_internal
+static __rte_always_inline uint64_t
+rte_pmu_read_userpage(struct perf_event_mmap_page *pc)
+{
+	uint64_t width, offset;
+	uint32_t seq, index;
+	int64_t pmc;
+
+	for (;;) {
+		seq = pc->lock;
+		rte_compiler_barrier();
+		index = pc->index;
+		offset = pc->offset;
+		width = pc->pmc_width;
+
+		if (likely(pc->cap_user_rdpmc && index)) {
+			pmc = rte_pmu_pmc_read(index - 1);
+			pmc <<= 64 - width;
+			pmc >>= 64 - width;
+			offset += pmc;
+		}
+
+		rte_compiler_barrier();
+
+		if (likely(pc->lock == seq))
+			return offset;
+	}
+
+	return 0;
+}
+
+/**
+ * @internal
+ *
+ * Enable group of events for a given lcore.
+ *
+ * @param lcore_id
+ *   The identifier of the lcore.
+ * @return
+ *   0 in case of success, negative value otherwise.
+ */
+__rte_internal
+int
+rte_pmu_enable_group(unsigned int lcore_id);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Add event to the group of enabled events.
+ *
+ * @param name
+ *   Name of an event listed under /sys/bus/event_source/devices/pmu/events.
+ * @return
+ *   Event index in case of success, negative value otherwise.
+ */
+__rte_experimental
+int
+rte_pmu_add_event(const char *name);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Read hardware counter configured to count occurrences of an event.
+ *
+ * @param index
+ *   Index of an event to be read.
+ * @return
+ *   Event value read from register. In case of errors or lack of support
+ *   0 is returned. In other words, stream of zeros in a trace file
+ *   indicates problem with reading particular PMU event register.
+ */
+__rte_experimental
+static __rte_always_inline uint64_t
+rte_pmu_read(unsigned int index)
+{
+	struct rte_pmu_event_group *group;
+	int ret, lcore_id = rte_lcore_id();
+
+	group = &rte_pmu.group[lcore_id];
+	if (unlikely(!group->enabled)) {
+		ret = rte_pmu_enable_group(lcore_id);
+		if (ret)
+			return 0;
+
+		group->enabled = true;
+	}
+
+	if (unlikely(index >= rte_pmu.num_group_events))
+		return 0;
+
+	return rte_pmu_read_userpage(group->mmap_pages[index]);
+}
+
+#else /* !RTE_EXEC_ENV_LINUX */
+
+__rte_experimental
+static int __rte_unused
+rte_pmu_add_event(__rte_unused const char *name)
+{
+	return -1;
+}
+
+__rte_experimental
+static __rte_always_inline uint64_t
+rte_pmu_read(__rte_unused unsigned int index)
+{
+	return 0;
+}
+
+#endif /* RTE_EXEC_ENV_LINUX */
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_PMU_H_ */
diff --git a/lib/eal/linux/eal.c b/lib/eal/linux/eal.c
index 8c118d0d9f..751a13b597 100644
--- a/lib/eal/linux/eal.c
+++ b/lib/eal/linux/eal.c
@@ -53,6 +53,7 @@
 #include "eal_options.h"
 #include "eal_vfio.h"
 #include "hotplug_mp.h"
+#include "pmu_private.h"
 
 #define MEMSIZE_IF_NO_HUGE_PAGE (64ULL * 1024ULL * 1024ULL)
 
@@ -1206,6 +1207,8 @@ rte_eal_init(int argc, char **argv)
 		return -1;
 	}
 
+	eal_pmu_init();
+
 	if (rte_eal_tailqs_init() < 0) {
 		rte_eal_init_alert("Cannot init tail queues for objects");
 		rte_errno = EFAULT;
@@ -1372,6 +1375,7 @@ rte_eal_cleanup(void)
 	eal_bus_cleanup();
 	rte_trace_save();
 	eal_trace_fini();
+	eal_pmu_fini();
 	/* after this point, any DPDK pointers will become dangling */
 	rte_eal_memory_detach();
 	eal_mp_dev_hotplug_cleanup();
diff --git a/lib/eal/version.map b/lib/eal/version.map
index 7ad12a7dc9..1717b221b4 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -440,6 +440,11 @@ EXPERIMENTAL {
 	rte_thread_detach;
 	rte_thread_equal;
 	rte_thread_join;
+
+	# added in 23.03
+	rte_pmu; # WINDOWS_NO_EXPORT
+	rte_pmu_add_event; # WINDOWS_NO_EXPORT
+	rte_pmu_read; # WINDOWS_NO_EXPORT
 };
 
 INTERNAL {
@@ -483,4 +488,5 @@ INTERNAL {
 	rte_mem_map;
 	rte_mem_page_size;
 	rte_mem_unmap;
+	rte_pmu_enable_group; # WINDOWS_NO_EXPORT
 };
-- 
2.34.1


^ permalink raw reply	[flat|nested] 139+ messages in thread

* [PATCH v5 2/4] eal/arm: support reading ARM PMU events in runtime
  2023-01-10 23:46       ` [PATCH v5 0/4] add support for self monitoring Tomasz Duszynski
  2023-01-10 23:46         ` [PATCH v5 1/4] eal: add generic support for reading PMU events Tomasz Duszynski
@ 2023-01-10 23:46         ` Tomasz Duszynski
  2023-01-10 23:46         ` [PATCH v5 3/4] eal/x86: support reading Intel " Tomasz Duszynski
                           ` (5 subsequent siblings)
  7 siblings, 0 replies; 139+ messages in thread
From: Tomasz Duszynski @ 2023-01-10 23:46 UTC (permalink / raw)
  To: dev, Ruifeng Wang
  Cc: thomas, jerinj, mb, Ruifeng.Wang, mattias.ronnblom, zhoumin,
	Tomasz Duszynski

Add support for reading ARM PMU events in runtime.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
---
 app/test/test_pmu.c               |   4 ++
 lib/eal/arm/include/meson.build   |   1 +
 lib/eal/arm/include/rte_pmu_pmc.h |  39 +++++++++++
 lib/eal/arm/meson.build           |   4 ++
 lib/eal/arm/rte_pmu.c             | 104 ++++++++++++++++++++++++++++++
 lib/eal/include/rte_pmu.h         |   3 +
 6 files changed, 155 insertions(+)
 create mode 100644 lib/eal/arm/include/rte_pmu_pmc.h
 create mode 100644 lib/eal/arm/rte_pmu.c

diff --git a/app/test/test_pmu.c b/app/test/test_pmu.c
index 9a90aaffdb..e19819c31a 100644
--- a/app/test/test_pmu.c
+++ b/app/test/test_pmu.c
@@ -13,6 +13,10 @@ test_pmu_read(void)
 	int tries = 10;
 	int event = -1;
 
+#if defined(RTE_ARCH_ARM64)
+	event = rte_pmu_add_event("cpu_cycles");
+#endif
+
 	while (tries--)
 		val += rte_pmu_read(event);
 
diff --git a/lib/eal/arm/include/meson.build b/lib/eal/arm/include/meson.build
index 657bf58569..ab13b0220a 100644
--- a/lib/eal/arm/include/meson.build
+++ b/lib/eal/arm/include/meson.build
@@ -20,6 +20,7 @@ arch_headers = files(
         'rte_pause_32.h',
         'rte_pause_64.h',
         'rte_pause.h',
+        'rte_pmu_pmc.h',
         'rte_power_intrinsics.h',
         'rte_prefetch_32.h',
         'rte_prefetch_64.h',
diff --git a/lib/eal/arm/include/rte_pmu_pmc.h b/lib/eal/arm/include/rte_pmu_pmc.h
new file mode 100644
index 0000000000..729f3d4dfe
--- /dev/null
+++ b/lib/eal/arm/include/rte_pmu_pmc.h
@@ -0,0 +1,39 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Marvell.
+ */
+
+#ifndef _RTE_PMU_PMC_ARM_H_
+#define _RTE_PMU_PMC_ARM_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <rte_common.h>
+
+static __rte_always_inline uint64_t
+rte_pmu_pmc_read(int index)
+{
+	uint64_t val;
+
+	if (index == 31) {
+		/* CPU Cycles (0x11) must be read via pmccntr_el0 */
+		asm volatile("mrs %0, pmccntr_el0" : "=r" (val));
+	} else {
+		asm volatile(
+			"msr pmselr_el0, %x0\n"
+			"mrs %0, pmxevcntr_el0\n"
+			: "=r" (val)
+			: "rZ" (index)
+		);
+	}
+
+	return val;
+}
+#define rte_pmu_pmc_read rte_pmu_pmc_read
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_PMU_PMC_ARM_H_ */
diff --git a/lib/eal/arm/meson.build b/lib/eal/arm/meson.build
index dca1106aae..0c5575b197 100644
--- a/lib/eal/arm/meson.build
+++ b/lib/eal/arm/meson.build
@@ -9,3 +9,7 @@ sources += files(
         'rte_hypervisor.c',
         'rte_power_intrinsics.c',
 )
+
+if is_linux
+    sources += files('rte_pmu.c')
+endif
diff --git a/lib/eal/arm/rte_pmu.c b/lib/eal/arm/rte_pmu.c
new file mode 100644
index 0000000000..4cbbe6f31d
--- /dev/null
+++ b/lib/eal/arm/rte_pmu.c
@@ -0,0 +1,104 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2023 Marvell International Ltd.
+ */
+
+#include <errno.h>
+#include <fcntl.h>
+#include <stdlib.h>
+#include <unistd.h>
+
+#include <rte_bitops.h>
+#include <rte_common.h>
+#include <rte_log.h>
+#include <rte_pmu.h>
+
+#include "pmu_private.h"
+
+#define PERF_USER_ACCESS_PATH "/proc/sys/kernel/perf_user_access"
+
+static int restore_uaccess;
+
+static int
+read_attr_int(const char *path, int *val)
+{
+	char buf[BUFSIZ];
+	int ret, fd;
+
+	fd = open(path, O_RDONLY);
+	if (fd == -1)
+		return -errno;
+
+	ret = read(fd, buf, sizeof(buf));
+	if (ret == -1) {
+		close(fd);
+
+		return -errno;
+	}
+
+	*val = strtol(buf, NULL, 10);
+	close(fd);
+
+	return 0;
+}
+
+static int
+write_attr_int(const char *path, int val)
+{
+	char buf[BUFSIZ];
+	int num, ret, fd;
+
+	fd = open(path, O_WRONLY);
+	if (fd == -1)
+		return -errno;
+
+	num = snprintf(buf, sizeof(buf), "%d", val);
+	ret = write(fd, buf, num);
+	if (ret == -1) {
+		close(fd);
+
+		return -errno;
+	}
+
+	close(fd);
+
+	return 0;
+}
+
+int
+pmu_arch_init(void)
+{
+	int ret;
+
+	ret = read_attr_int(PERF_USER_ACCESS_PATH, &restore_uaccess);
+	if (ret) {
+		RTE_LOG(ERR, EAL, "failed to read %s\n", PERF_USER_ACCESS_PATH);
+
+		return ret;
+	}
+
+	ret = write_attr_int(PERF_USER_ACCESS_PATH, 1);
+	if (ret) {
+		RTE_LOG(ERR, EAL, "failed to enable perf user access\n"
+			"try enabling manually 'echo 1 > %s'\n",
+			PERF_USER_ACCESS_PATH);
+
+		return ret;
+	}
+
+	return 0;
+}
+
+void
+pmu_arch_fini(void)
+{
+	write_attr_int(PERF_USER_ACCESS_PATH, restore_uaccess);
+}
+
+void
+pmu_arch_fixup_config(uint64_t config[3])
+{
+	/* select 64 bit counters */
+	config[1] |= RTE_BIT64(0);
+	/* enable userspace access */
+	config[1] |= RTE_BIT64(1);
+}
diff --git a/lib/eal/include/rte_pmu.h b/lib/eal/include/rte_pmu.h
index 6968b35545..9185d05ca3 100644
--- a/lib/eal/include/rte_pmu.h
+++ b/lib/eal/include/rte_pmu.h
@@ -20,6 +20,9 @@ extern "C" {
 #include <rte_branch_prediction.h>
 #include <rte_lcore.h>
 #include <rte_log.h>
+#if defined(RTE_ARCH_ARM64)
+#include <rte_pmu_pmc.h>
+#endif
 
 /**
  * @file
-- 
2.34.1


^ permalink raw reply	[flat|nested] 139+ messages in thread

* [PATCH v5 3/4] eal/x86: support reading Intel PMU events in runtime
  2023-01-10 23:46       ` [PATCH v5 0/4] add support for self monitoring Tomasz Duszynski
  2023-01-10 23:46         ` [PATCH v5 1/4] eal: add generic support for reading PMU events Tomasz Duszynski
  2023-01-10 23:46         ` [PATCH v5 2/4] eal/arm: support reading ARM PMU events in runtime Tomasz Duszynski
@ 2023-01-10 23:46         ` Tomasz Duszynski
  2023-01-10 23:46         ` [PATCH v5 4/4] eal: add PMU support to tracing library Tomasz Duszynski
                           ` (4 subsequent siblings)
  7 siblings, 0 replies; 139+ messages in thread
From: Tomasz Duszynski @ 2023-01-10 23:46 UTC (permalink / raw)
  To: dev, Bruce Richardson, Konstantin Ananyev
  Cc: thomas, jerinj, mb, Ruifeng.Wang, mattias.ronnblom, zhoumin,
	Tomasz Duszynski

Add support for reading Intel PMU events in runtime.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
---
 app/test/test_pmu.c               |  2 ++
 lib/eal/include/rte_pmu.h         |  2 +-
 lib/eal/x86/include/meson.build   |  1 +
 lib/eal/x86/include/rte_pmu_pmc.h | 33 +++++++++++++++++++++++++++++++
 4 files changed, 37 insertions(+), 1 deletion(-)
 create mode 100644 lib/eal/x86/include/rte_pmu_pmc.h

diff --git a/app/test/test_pmu.c b/app/test/test_pmu.c
index e19819c31a..79f83a1925 100644
--- a/app/test/test_pmu.c
+++ b/app/test/test_pmu.c
@@ -15,6 +15,8 @@ test_pmu_read(void)
 
 #if defined(RTE_ARCH_ARM64)
 	event = rte_pmu_add_event("cpu_cycles");
+#elif defined(RTE_ARCH_X86_64)
+	event = rte_pmu_add_event("cpu-cycles");
 #endif
 
 	while (tries--)
diff --git a/lib/eal/include/rte_pmu.h b/lib/eal/include/rte_pmu.h
index 9185d05ca3..0345746940 100644
--- a/lib/eal/include/rte_pmu.h
+++ b/lib/eal/include/rte_pmu.h
@@ -20,7 +20,7 @@ extern "C" {
 #include <rte_branch_prediction.h>
 #include <rte_lcore.h>
 #include <rte_log.h>
-#if defined(RTE_ARCH_ARM64)
+#if defined(RTE_ARCH_ARM64) || defined(RTE_ARCH_X86_64)
 #include <rte_pmu_pmc.h>
 #endif
 
diff --git a/lib/eal/x86/include/meson.build b/lib/eal/x86/include/meson.build
index 52d2f8e969..03d286ed25 100644
--- a/lib/eal/x86/include/meson.build
+++ b/lib/eal/x86/include/meson.build
@@ -9,6 +9,7 @@ arch_headers = files(
         'rte_io.h',
         'rte_memcpy.h',
         'rte_pause.h',
+        'rte_pmu_pmc.h',
         'rte_power_intrinsics.h',
         'rte_prefetch.h',
         'rte_rtm.h',
diff --git a/lib/eal/x86/include/rte_pmu_pmc.h b/lib/eal/x86/include/rte_pmu_pmc.h
new file mode 100644
index 0000000000..f241b80bc9
--- /dev/null
+++ b/lib/eal/x86/include/rte_pmu_pmc.h
@@ -0,0 +1,33 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Marvell.
+ */
+
+#ifndef _RTE_PMU_PMC_X86_H_
+#define _RTE_PMU_PMC_X86_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <rte_common.h>
+
+static __rte_always_inline uint64_t
+rte_pmu_pmc_read(int index)
+{
+	uint64_t low, high;
+
+	asm volatile(
+		"rdpmc\n"
+		: "=a" (low), "=d" (high)
+		: "c" (index)
+	);
+
+	return low | (high << 32);
+}
+#define rte_pmu_pmc_read rte_pmu_pmc_read
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_PMU_PMC_X86_H_ */
-- 
2.34.1


^ permalink raw reply	[flat|nested] 139+ messages in thread

* [PATCH v5 4/4] eal: add PMU support to tracing library
  2023-01-10 23:46       ` [PATCH v5 0/4] add support for self monitoring Tomasz Duszynski
                           ` (2 preceding siblings ...)
  2023-01-10 23:46         ` [PATCH v5 3/4] eal/x86: support reading Intel " Tomasz Duszynski
@ 2023-01-10 23:46         ` Tomasz Duszynski
  2023-01-11  0:32         ` [PATCH v5 0/4] add support for self monitoring Tyler Retzlaff
                           ` (3 subsequent siblings)
  7 siblings, 0 replies; 139+ messages in thread
From: Tomasz Duszynski @ 2023-01-10 23:46 UTC (permalink / raw)
  To: dev, Jerin Jacob, Sunil Kumar Kori
  Cc: thomas, mb, Ruifeng.Wang, mattias.ronnblom, zhoumin, Tomasz Duszynski

In order to profile app one needs to store significant amount of samples
somewhere for an analysis latern on. Since trace library supports
storing data in a CTF format lets take adventage of that and add a
dedicated PMU tracepoint.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
---
 app/test/test_trace_perf.c               |  4 ++
 doc/guides/prog_guide/profile_app.rst    |  5 ++
 doc/guides/prog_guide/trace_lib.rst      | 32 +++++++++++
 lib/eal/common/eal_common_trace_points.c |  3 +
 lib/eal/common/rte_pmu.c                 | 73 +++++++++++++++++++++++-
 lib/eal/include/rte_eal_trace.h          | 10 ++++
 lib/eal/version.map                      |  1 +
 7 files changed, 126 insertions(+), 2 deletions(-)

diff --git a/app/test/test_trace_perf.c b/app/test/test_trace_perf.c
index 46ae7d8074..4851b6852f 100644
--- a/app/test/test_trace_perf.c
+++ b/app/test/test_trace_perf.c
@@ -114,6 +114,8 @@ worker_fn_##func(void *arg) \
 #define GENERIC_DOUBLE rte_eal_trace_generic_double(3.66666)
 #define GENERIC_STR rte_eal_trace_generic_str("hello world")
 #define VOID_FP app_dpdk_test_fp()
+/* 0 corresponds first event passed via --trace= */
+#define READ_PMU rte_eal_trace_pmu_read(0)
 
 WORKER_DEFINE(GENERIC_VOID)
 WORKER_DEFINE(GENERIC_U64)
@@ -122,6 +124,7 @@ WORKER_DEFINE(GENERIC_FLOAT)
 WORKER_DEFINE(GENERIC_DOUBLE)
 WORKER_DEFINE(GENERIC_STR)
 WORKER_DEFINE(VOID_FP)
+WORKER_DEFINE(READ_PMU)
 
 static void
 run_test(const char *str, lcore_function_t f, struct test_data *data, size_t sz)
@@ -174,6 +177,7 @@ test_trace_perf(void)
 	run_test("double", worker_fn_GENERIC_DOUBLE, data, sz);
 	run_test("string", worker_fn_GENERIC_STR, data, sz);
 	run_test("void_fp", worker_fn_VOID_FP, data, sz);
+	run_test("read_pmu", worker_fn_READ_PMU, data, sz);
 
 	rte_free(data);
 	return TEST_SUCCESS;
diff --git a/doc/guides/prog_guide/profile_app.rst b/doc/guides/prog_guide/profile_app.rst
index a8b501fe0c..6a53341c6b 100644
--- a/doc/guides/prog_guide/profile_app.rst
+++ b/doc/guides/prog_guide/profile_app.rst
@@ -16,6 +16,11 @@ that information, perf being an example here. Though in some scenarios, eg. when
 isolated (nohz_full) and run dedicated tasks, using perf is less than ideal. In such cases one can
 read specific events directly from application via ``rte_pmu_read()``.
 
+Alternatively tracing library can be used which offers dedicated tracepoint
+``rte_eal_trace_pmu_event()``.
+
+Refer to :doc:`../prog_guide/trace_lib` for more details.
+
 Profiling on x86
 ----------------
 
diff --git a/doc/guides/prog_guide/trace_lib.rst b/doc/guides/prog_guide/trace_lib.rst
index 9a8f38073d..10d5b99084 100644
--- a/doc/guides/prog_guide/trace_lib.rst
+++ b/doc/guides/prog_guide/trace_lib.rst
@@ -46,6 +46,7 @@ DPDK tracing library features
   trace format and is compatible with ``LTTng``.
   For detailed information, refer to
   `Common Trace Format <https://diamon.org/ctf/>`_.
+- Support reading PMU events on ARM64 and x86 (Intel)
 
 How to add a tracepoint?
 ------------------------
@@ -137,6 +138,37 @@ the user must use ``RTE_TRACE_POINT_FP`` instead of ``RTE_TRACE_POINT``.
 ``RTE_TRACE_POINT_FP`` is compiled out by default and it can be enabled using
 the ``enable_trace_fp`` option for meson build.
 
+PMU tracepoint
+--------------
+
+Performance measurement unit (PMU) event values can be read from hardware
+registers using predefined ``rte_pmu_read`` tracepoint.
+
+Tracing is enabled via ``--trace`` EAL option by passing both expression
+matching PMU tracepoint name i.e ``lib.eal.pmu.read`` and expression
+``e=ev1[,ev2,...]`` matching particular events::
+
+    --trace='.*pmu.read\|e=cpu_cycles,l1d_cache'
+
+Event names are available under ``/sys/bus/event_source/devices/PMU/events``
+directory, where ``PMU`` is a placeholder for either a ``cpu`` or a directory
+containing ``cpus``.
+
+In contrary to other tracepoints this does not need any extra variables
+added to source files. Instead, caller passes index which follows the order of
+events specified via ``--trace`` parameter. In the following example index ``0``
+corresponds to ``cpu_cyclces`` while index ``1`` corresponds to ``l1d_cache``.
+
+.. code-block:: c
+
+ ...
+ rte_eal_trace_pmu_read(0);
+ rte_eal_trace_pmu_read(1);
+ ...
+
+PMU tracing support must be explicitly enabled using the ``enable_trace_fp``
+option for meson build.
+
 Event record mode
 -----------------
 
diff --git a/lib/eal/common/eal_common_trace_points.c b/lib/eal/common/eal_common_trace_points.c
index 0b0b254615..de918ca618 100644
--- a/lib/eal/common/eal_common_trace_points.c
+++ b/lib/eal/common/eal_common_trace_points.c
@@ -75,3 +75,6 @@ RTE_TRACE_POINT_REGISTER(rte_eal_trace_intr_enable,
 	lib.eal.intr.enable)
 RTE_TRACE_POINT_REGISTER(rte_eal_trace_intr_disable,
 	lib.eal.intr.disable)
+
+RTE_TRACE_POINT_REGISTER(rte_eal_trace_pmu_read,
+	lib.eal.pmu.read)
diff --git a/lib/eal/common/rte_pmu.c b/lib/eal/common/rte_pmu.c
index 67e8ffefb2..fd0df3b756 100644
--- a/lib/eal/common/rte_pmu.c
+++ b/lib/eal/common/rte_pmu.c
@@ -19,6 +19,7 @@
 #include <rte_tailq.h>
 
 #include "pmu_private.h"
+#include "eal_trace.h"
 
 #define EVENT_SOURCE_DEVICES_PATH "/sys/bus/event_source/devices"
 
@@ -361,6 +362,12 @@ rte_pmu_add_event(const char *name)
 	struct rte_pmu_event *event;
 	char path[PATH_MAX];
 
+	if (rte_pmu.name == NULL)
+		return -ENODEV;
+
+	if (rte_pmu.num_group_events + 1 >= MAX_NUM_GROUP_EVENTS)
+		return -ENOSPC;
+
 	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu.name, name);
 	if (access(path, R_OK))
 		return -ENODEV;
@@ -372,11 +379,11 @@ rte_pmu_add_event(const char *name)
 	}
 
 	event = calloc(1, sizeof(*event));
-	if (!event)
+	if (event == NULL)
 		return -ENOMEM;
 
 	event->name = strdup(name);
-	if (!event->name) {
+	if (event->name == NULL) {
 		free(event);
 
 		return -ENOMEM;
@@ -390,11 +397,70 @@ rte_pmu_add_event(const char *name)
 	return event->index;
 }
 
+static void
+add_events(const char *pattern)
+{
+	char *token, *copy;
+	int ret;
+
+	copy = strdup(pattern);
+	if (copy == NULL)
+		return;
+
+	token = strtok(copy, ",");
+	while (token) {
+		ret = rte_pmu_add_event(token);
+		if (ret < 0)
+			RTE_LOG(ERR, EAL, "failed to add %s event\n", token);
+
+		token = strtok(NULL, ",");
+	}
+
+	free(copy);
+}
+
+static void
+add_events_by_pattern(const char *pattern)
+{
+	regmatch_t rmatch;
+	char buf[BUFSIZ];
+	unsigned int num;
+	regex_t reg;
+
+	/* events are matched against occurrences of e=ev1[,ev2,..] pattern */
+	if (regcomp(&reg, "e=([_[:alnum:]-],?)+", REG_EXTENDED))
+		return;
+
+	for (;;) {
+		if (regexec(&reg, pattern, 1, &rmatch, 0))
+			break;
+
+		num = rmatch.rm_eo - rmatch.rm_so;
+		if (num > sizeof(buf))
+			num = sizeof(buf);
+
+		/* skip e= pattern prefix */
+		memcpy(buf, pattern + rmatch.rm_so + 2, num - 2);
+		buf[num] = '\0';
+		add_events(buf);
+
+		pattern += rmatch.rm_eo;
+	}
+
+	regfree(&reg);
+}
+
 void
 eal_pmu_init(void)
 {
+	struct trace_arg *arg;
+	struct trace *trace;
 	int ret;
 
+	trace = trace_obj_get();
+	if (trace == NULL)
+		RTE_LOG(WARNING, EAL, "tracing not initialized\n");
+
 	TAILQ_INIT(&rte_pmu.event_list);
 
 	ret = scan_pmus();
@@ -409,6 +475,9 @@ eal_pmu_init(void)
 		goto out;
 	}
 
+	STAILQ_FOREACH(arg, &trace->args, next)
+		add_events_by_pattern(arg->val);
+
 	return;
 out:
 	free(rte_pmu.name);
diff --git a/lib/eal/include/rte_eal_trace.h b/lib/eal/include/rte_eal_trace.h
index 5ef4398230..9b35af75d5 100644
--- a/lib/eal/include/rte_eal_trace.h
+++ b/lib/eal/include/rte_eal_trace.h
@@ -17,6 +17,7 @@ extern "C" {
 
 #include <rte_alarm.h>
 #include <rte_interrupts.h>
+#include <rte_pmu.h>
 #include <rte_trace_point.h>
 
 #include "eal_interrupts.h"
@@ -279,6 +280,15 @@ RTE_TRACE_POINT(
 	rte_trace_point_emit_string(cpuset);
 )
 
+/* PMU */
+RTE_TRACE_POINT_FP(
+	rte_eal_trace_pmu_read,
+	RTE_TRACE_POINT_ARGS(unsigned int index),
+	uint64_t val;
+	val = rte_pmu_read(index);
+	rte_trace_point_emit_u64(val);
+)
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/eal/version.map b/lib/eal/version.map
index 1717b221b4..d87a867e5b 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -442,6 +442,7 @@ EXPERIMENTAL {
 	rte_thread_join;
 
 	# added in 23.03
+	__rte_eal_trace_pmu_read; # WINDOWS_NO_EXPORT
 	rte_pmu; # WINDOWS_NO_EXPORT
 	rte_pmu_add_event; # WINDOWS_NO_EXPORT
 	rte_pmu_read; # WINDOWS_NO_EXPORT
-- 
2.34.1


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v5 0/4] add support for self monitoring
  2023-01-10 23:46       ` [PATCH v5 0/4] add support for self monitoring Tomasz Duszynski
                           ` (3 preceding siblings ...)
  2023-01-10 23:46         ` [PATCH v5 4/4] eal: add PMU support to tracing library Tomasz Duszynski
@ 2023-01-11  0:32         ` Tyler Retzlaff
  2023-01-11  9:31           ` Morten Brørup
  2023-01-11  9:39           ` [EXT] " Tomasz Duszynski
  2023-01-19 23:39         ` [PATCH v6 " Tomasz Duszynski
                           ` (2 subsequent siblings)
  7 siblings, 2 replies; 139+ messages in thread
From: Tyler Retzlaff @ 2023-01-11  0:32 UTC (permalink / raw)
  To: Tomasz Duszynski, bruce.richardson, mb
  Cc: dev, thomas, jerinj, mb, Ruifeng.Wang, mattias.ronnblom, zhoumin

hi,

don't interpret this as an objection to the functionality but this looks
like a clear example of something that doesn't belong in the EAL. has
there been a discussion as to whether or not this should be in a
separate library?

a basic test is whether or not an implementation exists or can be
reasonably provided for all platforms and that isn't strictly evident
here. red flag is to see yet more code being added conditionally
compiled for a single platform.

Morten, Bruce comments?

thanks

On Wed, Jan 11, 2023 at 12:46:38AM +0100, Tomasz Duszynski wrote:
> This series adds self monitoring support i.e allows to configure and
> read performance measurement unit (PMU) counters in runtime without
> using perf utility. This has certain adventages when application runs on
> isolated cores with nohz_full kernel parameter.
> 
> Events can be read directly using rte_pmu_read() or using dedicated
> tracepoint rte_eal_trace_pmu_read(). The latter will cause events to be
> stored inside CTF file.
> 
> By design, all enabled events are grouped together and the same group
> is attached to lcores that use self monitoring funtionality.
> 
> Events are enabled by names, which need to be read from standard
> location under sysfs i.e
> 
> /sys/bus/event_source/devices/PMU/events
> 
> where PMU is a core pmu i.e one measuring cpu events. As of today
> raw events are not supported.
> 
> v5:
> - address review comments
> - fix sign extension while reading pmu on x86
> - fix regex mentioned in doc
> - various minor changes/improvements here and there
> v4:
> - fix freeing mem detected by debug_autotest
> v3:
> - fix shared build
> v2:
> - fix problems reported by test build infra
> 
> Tomasz Duszynski (4):
>   eal: add generic support for reading PMU events
>   eal/arm: support reading ARM PMU events in runtime
>   eal/x86: support reading Intel PMU events in runtime
>   eal: add PMU support to tracing library
> 
>  app/test/meson.build                     |   1 +
>  app/test/test_pmu.c                      |  47 +++
>  app/test/test_trace_perf.c               |   4 +
>  doc/guides/prog_guide/profile_app.rst    |  13 +
>  doc/guides/prog_guide/trace_lib.rst      |  32 ++
>  lib/eal/arm/include/meson.build          |   1 +
>  lib/eal/arm/include/rte_pmu_pmc.h        |  39 ++
>  lib/eal/arm/meson.build                  |   4 +
>  lib/eal/arm/rte_pmu.c                    | 104 +++++
>  lib/eal/common/eal_common_trace_points.c |   3 +
>  lib/eal/common/meson.build               |   3 +
>  lib/eal/common/pmu_private.h             |  41 ++
>  lib/eal/common/rte_pmu.c                 | 504 +++++++++++++++++++++++
>  lib/eal/include/meson.build              |   1 +
>  lib/eal/include/rte_eal_trace.h          |  10 +
>  lib/eal/include/rte_pmu.h                | 202 +++++++++
>  lib/eal/linux/eal.c                      |   4 +
>  lib/eal/version.map                      |   7 +
>  lib/eal/x86/include/meson.build          |   1 +
>  lib/eal/x86/include/rte_pmu_pmc.h        |  33 ++
>  20 files changed, 1054 insertions(+)
>  create mode 100644 app/test/test_pmu.c
>  create mode 100644 lib/eal/arm/include/rte_pmu_pmc.h
>  create mode 100644 lib/eal/arm/rte_pmu.c
>  create mode 100644 lib/eal/common/pmu_private.h
>  create mode 100644 lib/eal/common/rte_pmu.c
>  create mode 100644 lib/eal/include/rte_pmu.h
>  create mode 100644 lib/eal/x86/include/rte_pmu_pmc.h
> 
> --
> 2.34.1

^ permalink raw reply	[flat|nested] 139+ messages in thread

* RE: [PATCH v5 1/4] eal: add generic support for reading PMU events
  2023-01-10 23:46         ` [PATCH v5 1/4] eal: add generic support for reading PMU events Tomasz Duszynski
@ 2023-01-11  9:05           ` Morten Brørup
  2023-01-11 16:20             ` Tomasz Duszynski
  0 siblings, 1 reply; 139+ messages in thread
From: Morten Brørup @ 2023-01-11  9:05 UTC (permalink / raw)
  To: Tomasz Duszynski, dev
  Cc: thomas, jerinj, Ruifeng.Wang, mattias.ronnblom, zhoumin

> From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
> Sent: Wednesday, 11 January 2023 00.47
> 
> Add support for programming PMU counters and reading their values
> in runtime bypassing kernel completely.
> 
> This is especially useful in cases where CPU cores are isolated
> (nohz_full) i.e run dedicated tasks. In such cases one cannot use
> standard perf utility without sacrificing latency and performance.
> 
> Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
> ---

[...]

> +static int
> +do_perf_event_open(uint64_t config[3], unsigned int lcore_id, int
> group_fd)
> +{
> +	struct perf_event_attr attr = {
> +		.size = sizeof(struct perf_event_attr),
> +		.type = PERF_TYPE_RAW,
> +		.exclude_kernel = 1,
> +		.exclude_hv = 1,
> +		.disabled = 1,
> +	};
> +
> +	pmu_arch_fixup_config(config);
> +
> +	attr.config = config[0];
> +	attr.config1 = config[1];
> +	attr.config2 = config[2];
> +
> +	return syscall(SYS_perf_event_open, &attr, 0,
> rte_lcore_to_cpu_id(lcore_id), group_fd, 0);
> +}

If SYS_perf_event_open() must be called from the worker thread itself, then lcore_id must not be passed as a parameter to do_perf_event_open(). Otherwise, I would expect to be able to call do_perf_event_open() from the main thread and pass any lcore_id of a worker thread.
This comment applies to all functions that must be called from the worker thread itself. It also applies to the functions that call such functions.

[...]

> +/**
> + * A structure describing a group of events.
> + */
> +struct rte_pmu_event_group {
> +	int fds[MAX_NUM_GROUP_EVENTS]; /**< array of event descriptors */
> +	struct perf_event_mmap_page *mmap_pages[MAX_NUM_GROUP_EVENTS];
> /**< array of user pages */
> +	bool enabled; /**< true if group was enabled on particular lcore
> */
> +};
> +
> +/**
> + * A structure describing an event.
> + */
> +struct rte_pmu_event {
> +	char *name; /** name of an event */
> +	unsigned int index; /** event index into fds/mmap_pages */
> +	TAILQ_ENTRY(rte_pmu_event) next; /** list entry */
> +};

Move the "enabled" field up, making it the first field in this structure. This might reduce the number of instructions required to check (!group->enabled) in rte_pmu_read().

Also, each instance of the structure is used individually per lcore, so the structure should be cache line aligned to avoid unnecessarily crossing cache lines.

I.e.:

struct rte_pmu_event_group {
	bool enabled; /**< true if group was enabled on particular lcore */
	int fds[MAX_NUM_GROUP_EVENTS]; /**< array of event descriptors */
	struct perf_event_mmap_page *mmap_pages[MAX_NUM_GROUP_EVENTS]; /**< array of user pages */
} __rte_cache_aligned;

> +
> +/**
> + * A PMU state container.
> + */
> +struct rte_pmu {
> +	char *name; /** name of core PMU listed under
> /sys/bus/event_source/devices */
> +	struct rte_pmu_event_group group[RTE_MAX_LCORE]; /**< per lcore
> event group data */
> +	unsigned int num_group_events; /**< number of events in a group
> */
> +	TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of matching
> events */
> +};
> +
> +/** Pointer to the PMU state container */
> +extern struct rte_pmu rte_pmu;

Just "The PMU state container". It is not a pointer anymore. :-)

[...]

> +/**
> + * @internal
> + *
> + * Read PMU counter.
> + *
> + * @param pc
> + *   Pointer to the mmapped user page.
> + * @return
> + *   Counter value read from hardware.
> + */
> +__rte_internal
> +static __rte_always_inline uint64_t
> +rte_pmu_read_userpage(struct perf_event_mmap_page *pc)
> +{
> +	uint64_t width, offset;
> +	uint32_t seq, index;
> +	int64_t pmc;
> +
> +	for (;;) {
> +		seq = pc->lock;
> +		rte_compiler_barrier();
> +		index = pc->index;
> +		offset = pc->offset;
> +		width = pc->pmc_width;
> +

Please add a comment here about the special meaning of index == 0.

> +		if (likely(pc->cap_user_rdpmc && index)) {
> +			pmc = rte_pmu_pmc_read(index - 1);
> +			pmc <<= 64 - width;
> +			pmc >>= 64 - width;
> +			offset += pmc;
> +		}
> +
> +		rte_compiler_barrier();
> +
> +		if (likely(pc->lock == seq))
> +			return offset;
> +	}
> +
> +	return 0;
> +}

[...]

> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice
> + *
> + * Read hardware counter configured to count occurrences of an event.
> + *
> + * @param index
> + *   Index of an event to be read.
> + * @return
> + *   Event value read from register. In case of errors or lack of
> support
> + *   0 is returned. In other words, stream of zeros in a trace file
> + *   indicates problem with reading particular PMU event register.
> + */
> +__rte_experimental
> +static __rte_always_inline uint64_t
> +rte_pmu_read(unsigned int index)
> +{
> +	struct rte_pmu_event_group *group;
> +	int ret, lcore_id = rte_lcore_id();
> +
> +	group = &rte_pmu.group[lcore_id];
> +	if (unlikely(!group->enabled)) {
> +		ret = rte_pmu_enable_group(lcore_id);
> +		if (ret)
> +			return 0;
> +
> +		group->enabled = true;

Group->enabled should be set inside rte_pmu_enable_group(), not here.

> +	}
> +
> +	if (unlikely(index >= rte_pmu.num_group_events))
> +		return 0;
> +
> +	return rte_pmu_read_userpage(group->mmap_pages[index]);
> +}



^ permalink raw reply	[flat|nested] 139+ messages in thread

* RE: [PATCH v5 0/4] add support for self monitoring
  2023-01-11  0:32         ` [PATCH v5 0/4] add support for self monitoring Tyler Retzlaff
@ 2023-01-11  9:31           ` Morten Brørup
  2023-01-11 14:24             ` Tomasz Duszynski
  2023-01-11  9:39           ` [EXT] " Tomasz Duszynski
  1 sibling, 1 reply; 139+ messages in thread
From: Morten Brørup @ 2023-01-11  9:31 UTC (permalink / raw)
  To: Tyler Retzlaff, Tomasz Duszynski, bruce.richardson
  Cc: dev, thomas, jerinj, Ruifeng.Wang, mattias.ronnblom, zhoumin

> From: Tyler Retzlaff [mailto:roretzla@linux.microsoft.com]
> Sent: Wednesday, 11 January 2023 01.32
> 
> On Wed, Jan 11, 2023 at 12:46:38AM +0100, Tomasz Duszynski wrote:
> > This series adds self monitoring support i.e allows to configure and
> > read performance measurement unit (PMU) counters in runtime without
> > using perf utility. This has certain adventages when application runs
> on
> > isolated cores with nohz_full kernel parameter.
> >
> > Events can be read directly using rte_pmu_read() or using dedicated
> > tracepoint rte_eal_trace_pmu_read(). The latter will cause events to
> be
> > stored inside CTF file.
> >
> > By design, all enabled events are grouped together and the same group
> > is attached to lcores that use self monitoring funtionality.
> >
> > Events are enabled by names, which need to be read from standard
> > location under sysfs i.e
> >
> > /sys/bus/event_source/devices/PMU/events
> >
> > where PMU is a core pmu i.e one measuring cpu events. As of today
> > raw events are not supported.
> >
> > v5:
> > - address review comments
> > - fix sign extension while reading pmu on x86
> > - fix regex mentioned in doc
> > - various minor changes/improvements here and there
> > v4:
> > - fix freeing mem detected by debug_autotest
> > v3:
> > - fix shared build
> > v2:
> > - fix problems reported by test build infra
> >
> > Tomasz Duszynski (4):
> >   eal: add generic support for reading PMU events
> >   eal/arm: support reading ARM PMU events in runtime
> >   eal/x86: support reading Intel PMU events in runtime
> >   eal: add PMU support to tracing library
> >
> >  app/test/meson.build                     |   1 +
> >  app/test/test_pmu.c                      |  47 +++
> >  app/test/test_trace_perf.c               |   4 +
> >  doc/guides/prog_guide/profile_app.rst    |  13 +
> >  doc/guides/prog_guide/trace_lib.rst      |  32 ++
> >  lib/eal/arm/include/meson.build          |   1 +
> >  lib/eal/arm/include/rte_pmu_pmc.h        |  39 ++
> >  lib/eal/arm/meson.build                  |   4 +
> >  lib/eal/arm/rte_pmu.c                    | 104 +++++
> >  lib/eal/common/eal_common_trace_points.c |   3 +
> >  lib/eal/common/meson.build               |   3 +
> >  lib/eal/common/pmu_private.h             |  41 ++
> >  lib/eal/common/rte_pmu.c                 | 504
> +++++++++++++++++++++++
> >  lib/eal/include/meson.build              |   1 +
> >  lib/eal/include/rte_eal_trace.h          |  10 +
> >  lib/eal/include/rte_pmu.h                | 202 +++++++++
> >  lib/eal/linux/eal.c                      |   4 +
> >  lib/eal/version.map                      |   7 +
> >  lib/eal/x86/include/meson.build          |   1 +
> >  lib/eal/x86/include/rte_pmu_pmc.h        |  33 ++
> >  20 files changed, 1054 insertions(+)
> >  create mode 100644 app/test/test_pmu.c
> >  create mode 100644 lib/eal/arm/include/rte_pmu_pmc.h
> >  create mode 100644 lib/eal/arm/rte_pmu.c
> >  create mode 100644 lib/eal/common/pmu_private.h
> >  create mode 100644 lib/eal/common/rte_pmu.c
> >  create mode 100644 lib/eal/include/rte_pmu.h
> >  create mode 100644 lib/eal/x86/include/rte_pmu_pmc.h
> >
> > --
> > 2.34.1

[Moved Tyler's post down here.]

> 
> hi,
> 
> don't interpret this as an objection to the functionality but this
> looks
> like a clear example of something that doesn't belong in the EAL. has
> there been a discussion as to whether or not this should be in a
> separate library?

IIRC, there has been no such discussion.

Although I agree that this doesn't belong in EAL, I would point to the trace library as a reference for allowing it into the EAL.

For the records, I also oppose to the trace library being part of the EAL.

On the other hand, it would be interesting to determine if it is *impossible* adding this functionality as any other normal DPDK library, i.e. outside of the EAL, or if there is an unavoidable tie-in to the EAL.

@Tomasz, if this is impossible, please describe the unavoidable tie-in to the EAL. No need for a long report, just a few words. You (and this functionality) shouldn't suffer from our long term ambition to move stuff out of the EAL.

> 
> a basic test is whether or not an implementation exists or can be
> reasonably provided for all platforms and that isn't strictly evident
> here. red flag is to see yet more code being added conditionally
> compiled for a single platform.

Another basic test: Can DPDK applications run without it? If they can, an Environment Abstraction Layer does not need to have it, and thus it does not need to be part of the EAL.

> 
> Morten, Bruce comments?
> 
> thanks

^ permalink raw reply	[flat|nested] 139+ messages in thread

* RE: [EXT] Re: [PATCH v5 0/4] add support for self monitoring
  2023-01-11  0:32         ` [PATCH v5 0/4] add support for self monitoring Tyler Retzlaff
  2023-01-11  9:31           ` Morten Brørup
@ 2023-01-11  9:39           ` Tomasz Duszynski
  2023-01-11 21:05             ` Tyler Retzlaff
  1 sibling, 1 reply; 139+ messages in thread
From: Tomasz Duszynski @ 2023-01-11  9:39 UTC (permalink / raw)
  To: Tyler Retzlaff, bruce.richardson, mb
  Cc: dev, thomas, Jerin Jacob Kollanukkaran, mb, Ruifeng.Wang,
	mattias.ronnblom, zhoumin

Hi Tyler,

>-----Original Message-----
>From: Tyler Retzlaff <roretzla@linux.microsoft.com>
>Sent: Wednesday, January 11, 2023 1:32 AM
>To: Tomasz Duszynski <tduszynski@marvell.com>; bruce.richardson@intel.com; mb@smartsharesystems.com
>Cc: dev@dpdk.org; thomas@monjalon.net; Jerin Jacob Kollanukkaran <jerinj@marvell.com>;
>mb@smartsharesystems.com; Ruifeng.Wang@arm.com; mattias.ronnblom@ericsson.com; zhoumin@loongson.cn
>Subject: [EXT] Re: [PATCH v5 0/4] add support for self monitoring
>
>External Email
>
>----------------------------------------------------------------------
>hi,
>
>don't interpret this as an objection to the functionality but this looks like a clear example of
>something that doesn't belong in the EAL. has there been a discussion as to whether or not this
>should be in a separate library?

No, I don't recall anybody having any concerns about the code placement. Rationale behind 
making this part of eal was based on the fact that tracing itself is a part of eal and
since this was meant to be extension to tracing, code placement decision came out naturally. 

During development phase idea evolved a bit and what initially was supposed to be solely yet
another tracepoint become generic API to read pmu and tracepoint based on that. Which means
both can be used independently. 

That said, since this code has both platform agnostic and platform specific parts this can either be split into: 
1. library + eal platform code
2. all under eal 

Either approach seems legit. Thoughts?

>
>a basic test is whether or not an implementation exists or can be reasonably provided for all
>platforms and that isn't strictly evident here. red flag is to see yet more code being added
>conditionally compiled for a single platform.

Even libs are not entirely pristine and have platform specific ifdefs lurking so not sure where
this red flag is coming from. 

>
>Morten, Bruce comments?
>
>thanks
>
>On Wed, Jan 11, 2023 at 12:46:38AM +0100, Tomasz Duszynski wrote:
>> This series adds self monitoring support i.e allows to configure and
>> read performance measurement unit (PMU) counters in runtime without
>> using perf utility. This has certain adventages when application runs
>> on isolated cores with nohz_full kernel parameter.
>>
>> Events can be read directly using rte_pmu_read() or using dedicated
>> tracepoint rte_eal_trace_pmu_read(). The latter will cause events to
>> be stored inside CTF file.
>>
>> By design, all enabled events are grouped together and the same group
>> is attached to lcores that use self monitoring funtionality.
>>
>> Events are enabled by names, which need to be read from standard
>> location under sysfs i.e
>>
>> /sys/bus/event_source/devices/PMU/events
>>
>> where PMU is a core pmu i.e one measuring cpu events. As of today raw
>> events are not supported.
>>
>> v5:
>> - address review comments
>> - fix sign extension while reading pmu on x86
>> - fix regex mentioned in doc
>> - various minor changes/improvements here and there
>> v4:
>> - fix freeing mem detected by debug_autotest
>> v3:
>> - fix shared build
>> v2:
>> - fix problems reported by test build infra
>>
>> Tomasz Duszynski (4):
>>   eal: add generic support for reading PMU events
>>   eal/arm: support reading ARM PMU events in runtime
>>   eal/x86: support reading Intel PMU events in runtime
>>   eal: add PMU support to tracing library
>>
>>  app/test/meson.build                     |   1 +
>>  app/test/test_pmu.c                      |  47 +++
>>  app/test/test_trace_perf.c               |   4 +
>>  doc/guides/prog_guide/profile_app.rst    |  13 +
>>  doc/guides/prog_guide/trace_lib.rst      |  32 ++
>>  lib/eal/arm/include/meson.build          |   1 +
>>  lib/eal/arm/include/rte_pmu_pmc.h        |  39 ++
>>  lib/eal/arm/meson.build                  |   4 +
>>  lib/eal/arm/rte_pmu.c                    | 104 +++++
>>  lib/eal/common/eal_common_trace_points.c |   3 +
>>  lib/eal/common/meson.build               |   3 +
>>  lib/eal/common/pmu_private.h             |  41 ++
>>  lib/eal/common/rte_pmu.c                 | 504 +++++++++++++++++++++++
>>  lib/eal/include/meson.build              |   1 +
>>  lib/eal/include/rte_eal_trace.h          |  10 +
>>  lib/eal/include/rte_pmu.h                | 202 +++++++++
>>  lib/eal/linux/eal.c                      |   4 +
>>  lib/eal/version.map                      |   7 +
>>  lib/eal/x86/include/meson.build          |   1 +
>>  lib/eal/x86/include/rte_pmu_pmc.h        |  33 ++
>>  20 files changed, 1054 insertions(+)
>>  create mode 100644 app/test/test_pmu.c  create mode 100644
>> lib/eal/arm/include/rte_pmu_pmc.h  create mode 100644
>> lib/eal/arm/rte_pmu.c  create mode 100644 lib/eal/common/pmu_private.h
>> create mode 100644 lib/eal/common/rte_pmu.c  create mode 100644
>> lib/eal/include/rte_pmu.h  create mode 100644
>> lib/eal/x86/include/rte_pmu_pmc.h
>>
>> --
>> 2.34.1

^ permalink raw reply	[flat|nested] 139+ messages in thread

* RE: [PATCH v5 0/4] add support for self monitoring
  2023-01-11  9:31           ` Morten Brørup
@ 2023-01-11 14:24             ` Tomasz Duszynski
  2023-01-11 14:32               ` Bruce Richardson
  0 siblings, 1 reply; 139+ messages in thread
From: Tomasz Duszynski @ 2023-01-11 14:24 UTC (permalink / raw)
  To: Morten Brørup, Tyler Retzlaff, bruce.richardson
  Cc: dev, thomas, Jerin Jacob Kollanukkaran, Ruifeng.Wang,
	mattias.ronnblom, zhoumin



>-----Original Message-----
>From: Morten Brørup <mb@smartsharesystems.com>
>Sent: Wednesday, January 11, 2023 10:31 AM
>To: Tyler Retzlaff <roretzla@linux.microsoft.com>; Tomasz Duszynski <tduszynski@marvell.com>;
>bruce.richardson@intel.com
>Cc: dev@dpdk.org; thomas@monjalon.net; Jerin Jacob Kollanukkaran <jerinj@marvell.com>;
>Ruifeng.Wang@arm.com; mattias.ronnblom@ericsson.com; zhoumin@loongson.cn
>Subject: [EXT] RE: [PATCH v5 0/4] add support for self monitoring
>
>External Email
>
>----------------------------------------------------------------------
>> From: Tyler Retzlaff [mailto:roretzla@linux.microsoft.com]
>> Sent: Wednesday, 11 January 2023 01.32
>>
>> On Wed, Jan 11, 2023 at 12:46:38AM +0100, Tomasz Duszynski wrote:
>> > This series adds self monitoring support i.e allows to configure and
>> > read performance measurement unit (PMU) counters in runtime without
>> > using perf utility. This has certain adventages when application
>> > runs
>> on
>> > isolated cores with nohz_full kernel parameter.
>> >
>> > Events can be read directly using rte_pmu_read() or using dedicated
>> > tracepoint rte_eal_trace_pmu_read(). The latter will cause events to
>> be
>> > stored inside CTF file.
>> >
>> > By design, all enabled events are grouped together and the same
>> > group is attached to lcores that use self monitoring funtionality.
>> >
>> > Events are enabled by names, which need to be read from standard
>> > location under sysfs i.e
>> >
>> > /sys/bus/event_source/devices/PMU/events
>> >
>> > where PMU is a core pmu i.e one measuring cpu events. As of today
>> > raw events are not supported.
>> >
>> > v5:
>> > - address review comments
>> > - fix sign extension while reading pmu on x86
>> > - fix regex mentioned in doc
>> > - various minor changes/improvements here and there
>> > v4:
>> > - fix freeing mem detected by debug_autotest
>> > v3:
>> > - fix shared build
>> > v2:
>> > - fix problems reported by test build infra
>> >
>> > Tomasz Duszynski (4):
>> >   eal: add generic support for reading PMU events
>> >   eal/arm: support reading ARM PMU events in runtime
>> >   eal/x86: support reading Intel PMU events in runtime
>> >   eal: add PMU support to tracing library
>> >
>> >  app/test/meson.build                     |   1 +
>> >  app/test/test_pmu.c                      |  47 +++
>> >  app/test/test_trace_perf.c               |   4 +
>> >  doc/guides/prog_guide/profile_app.rst    |  13 +
>> >  doc/guides/prog_guide/trace_lib.rst      |  32 ++
>> >  lib/eal/arm/include/meson.build          |   1 +
>> >  lib/eal/arm/include/rte_pmu_pmc.h        |  39 ++
>> >  lib/eal/arm/meson.build                  |   4 +
>> >  lib/eal/arm/rte_pmu.c                    | 104 +++++
>> >  lib/eal/common/eal_common_trace_points.c |   3 +
>> >  lib/eal/common/meson.build               |   3 +
>> >  lib/eal/common/pmu_private.h             |  41 ++
>> >  lib/eal/common/rte_pmu.c                 | 504
>> +++++++++++++++++++++++
>> >  lib/eal/include/meson.build              |   1 +
>> >  lib/eal/include/rte_eal_trace.h          |  10 +
>> >  lib/eal/include/rte_pmu.h                | 202 +++++++++
>> >  lib/eal/linux/eal.c                      |   4 +
>> >  lib/eal/version.map                      |   7 +
>> >  lib/eal/x86/include/meson.build          |   1 +
>> >  lib/eal/x86/include/rte_pmu_pmc.h        |  33 ++
>> >  20 files changed, 1054 insertions(+)  create mode 100644
>> > app/test/test_pmu.c  create mode 100644
>> > lib/eal/arm/include/rte_pmu_pmc.h  create mode 100644
>> > lib/eal/arm/rte_pmu.c  create mode 100644
>> > lib/eal/common/pmu_private.h  create mode 100644
>> > lib/eal/common/rte_pmu.c  create mode 100644
>> > lib/eal/include/rte_pmu.h  create mode 100644
>> > lib/eal/x86/include/rte_pmu_pmc.h
>> >
>> > --
>> > 2.34.1
>
>[Moved Tyler's post down here.]
>
>>
>> hi,
>>
>> don't interpret this as an objection to the functionality but this
>> looks like a clear example of something that doesn't belong in the
>> EAL. has there been a discussion as to whether or not this should be
>> in a separate library?
>
>IIRC, there has been no such discussion.
>
>Although I agree that this doesn't belong in EAL, I would point to the trace library as a reference
>for allowing it into the EAL.
>
>For the records, I also oppose to the trace library being part of the EAL.
>
>On the other hand, it would be interesting to determine if it is *impossible* adding this
>functionality as any other normal DPDK library, i.e. outside of the EAL, or if there is an
>unavoidable tie-in to the EAL.
>
>@Tomasz, if this is impossible, please describe the unavoidable tie-in to the EAL. No need for a
>long report, just a few words. You (and this functionality) shouldn't suffer from our long term
>ambition to move stuff out of the EAL.
>

You can read about rationale here https://lore.kernel.org/dpdk-dev/DM4PR18MB436872EBC5922084C5DAFC1DD2FC9@DM4PR18MB4368.namprd18.prod.outlook.com/#t

As for the NO-NO there isn't any in fact. There are some tradeoffs though. 

For example, seems eal cannot depend on other libs so if someone needs to
finetune some part of EAL for whatever reason, then relevant part needs to 
modified each and every time. I.e specific includes and trcepoints need to be added each time.

On the other hand, if this is coupled with eal then adding tracepoints to some parts
will be easier. Or they can just be added to specific points and live there. 

No strong opinions besides that. I'd like to know what others think. 

>>
>> a basic test is whether or not an implementation exists or can be
>> reasonably provided for all platforms and that isn't strictly evident
>> here. red flag is to see yet more code being added conditionally
>> compiled for a single platform.
>
>Another basic test: Can DPDK applications run without it? If they can, an Environment Abstraction
>Layer does not need to have it, and thus it does not need to be part of the EAL.
>
>>
>> Morten, Bruce comments?
>>
>> thanks

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v5 0/4] add support for self monitoring
  2023-01-11 14:24             ` Tomasz Duszynski
@ 2023-01-11 14:32               ` Bruce Richardson
  0 siblings, 0 replies; 139+ messages in thread
From: Bruce Richardson @ 2023-01-11 14:32 UTC (permalink / raw)
  To: Tomasz Duszynski
  Cc: Morten Brørup, Tyler Retzlaff, dev, thomas,
	Jerin Jacob Kollanukkaran, Ruifeng.Wang, mattias.ronnblom,
	zhoumin

On Wed, Jan 11, 2023 at 02:24:28PM +0000, Tomasz Duszynski wrote:
> 
> 
> >-----Original Message-----
> >From: Morten Brørup <mb@smartsharesystems.com>
> >Sent: Wednesday, January 11, 2023 10:31 AM
> >To: Tyler Retzlaff <roretzla@linux.microsoft.com>; Tomasz Duszynski <tduszynski@marvell.com>;
> >bruce.richardson@intel.com
> >Cc: dev@dpdk.org; thomas@monjalon.net; Jerin Jacob Kollanukkaran <jerinj@marvell.com>;
> >Ruifeng.Wang@arm.com; mattias.ronnblom@ericsson.com; zhoumin@loongson.cn
> >Subject: [EXT] RE: [PATCH v5 0/4] add support for self monitoring
> >
> >External Email
> >
> >----------------------------------------------------------------------
> >> From: Tyler Retzlaff [mailto:roretzla@linux.microsoft.com]
> >> Sent: Wednesday, 11 January 2023 01.32
> >>
> >> On Wed, Jan 11, 2023 at 12:46:38AM +0100, Tomasz Duszynski wrote:
> >> > This series adds self monitoring support i.e allows to configure and
> >> > read performance measurement unit (PMU) counters in runtime without
> >> > using perf utility. This has certain adventages when application
> >> > runs
> >> on
> >> > isolated cores with nohz_full kernel parameter.
> >> >
> >> > Events can be read directly using rte_pmu_read() or using dedicated
> >> > tracepoint rte_eal_trace_pmu_read(). The latter will cause events to
> >> be
> >> > stored inside CTF file.
> >> >
> >> > By design, all enabled events are grouped together and the same
> >> > group is attached to lcores that use self monitoring funtionality.
> >> >
> >> > Events are enabled by names, which need to be read from standard
> >> > location under sysfs i.e
> >> >
> >> > /sys/bus/event_source/devices/PMU/events
> >> >
> >> > where PMU is a core pmu i.e one measuring cpu events. As of today
> >> > raw events are not supported.
> >> >
> >> > v5:
> >> > - address review comments
> >> > - fix sign extension while reading pmu on x86
> >> > - fix regex mentioned in doc
> >> > - various minor changes/improvements here and there
> >> > v4:
> >> > - fix freeing mem detected by debug_autotest
> >> > v3:
> >> > - fix shared build
> >> > v2:
> >> > - fix problems reported by test build infra
> >> >
> >> > Tomasz Duszynski (4):
> >> >   eal: add generic support for reading PMU events
> >> >   eal/arm: support reading ARM PMU events in runtime
> >> >   eal/x86: support reading Intel PMU events in runtime
> >> >   eal: add PMU support to tracing library
> >> >
> >> >  app/test/meson.build                     |   1 +
> >> >  app/test/test_pmu.c                      |  47 +++
> >> >  app/test/test_trace_perf.c               |   4 +
> >> >  doc/guides/prog_guide/profile_app.rst    |  13 +
> >> >  doc/guides/prog_guide/trace_lib.rst      |  32 ++
> >> >  lib/eal/arm/include/meson.build          |   1 +
> >> >  lib/eal/arm/include/rte_pmu_pmc.h        |  39 ++
> >> >  lib/eal/arm/meson.build                  |   4 +
> >> >  lib/eal/arm/rte_pmu.c                    | 104 +++++
> >> >  lib/eal/common/eal_common_trace_points.c |   3 +
> >> >  lib/eal/common/meson.build               |   3 +
> >> >  lib/eal/common/pmu_private.h             |  41 ++
> >> >  lib/eal/common/rte_pmu.c                 | 504
> >> +++++++++++++++++++++++
> >> >  lib/eal/include/meson.build              |   1 +
> >> >  lib/eal/include/rte_eal_trace.h          |  10 +
> >> >  lib/eal/include/rte_pmu.h                | 202 +++++++++
> >> >  lib/eal/linux/eal.c                      |   4 +
> >> >  lib/eal/version.map                      |   7 +
> >> >  lib/eal/x86/include/meson.build          |   1 +
> >> >  lib/eal/x86/include/rte_pmu_pmc.h        |  33 ++
> >> >  20 files changed, 1054 insertions(+)  create mode 100644
> >> > app/test/test_pmu.c  create mode 100644
> >> > lib/eal/arm/include/rte_pmu_pmc.h  create mode 100644
> >> > lib/eal/arm/rte_pmu.c  create mode 100644
> >> > lib/eal/common/pmu_private.h  create mode 100644
> >> > lib/eal/common/rte_pmu.c  create mode 100644
> >> > lib/eal/include/rte_pmu.h  create mode 100644
> >> > lib/eal/x86/include/rte_pmu_pmc.h
> >> >
> >> > --
> >> > 2.34.1
> >
> >[Moved Tyler's post down here.]
> >
> >>
> >> hi,
> >>
> >> don't interpret this as an objection to the functionality but this
> >> looks like a clear example of something that doesn't belong in the
> >> EAL. has there been a discussion as to whether or not this should be
> >> in a separate library?
> >
> >IIRC, there has been no such discussion.
> >
> >Although I agree that this doesn't belong in EAL, I would point to the trace library as a reference
> >for allowing it into the EAL.
> >
> >For the records, I also oppose to the trace library being part of the EAL.
> >
> >On the other hand, it would be interesting to determine if it is *impossible* adding this
> >functionality as any other normal DPDK library, i.e. outside of the EAL, or if there is an
> >unavoidable tie-in to the EAL.
> >
> >@Tomasz, if this is impossible, please describe the unavoidable tie-in to the EAL. No need for a
> >long report, just a few words. You (and this functionality) shouldn't suffer from our long term
> >ambition to move stuff out of the EAL.
> >
> 
> You can read about rationale here https://lore.kernel.org/dpdk-dev/DM4PR18MB436872EBC5922084C5DAFC1DD2FC9@DM4PR18MB4368.namprd18.prod.outlook.com/#t
> 
> As for the NO-NO there isn't any in fact. There are some tradeoffs though. 
> 
> For example, seems eal cannot depend on other libs so if someone needs to
> finetune some part of EAL for whatever reason, then relevant part needs to 
> modified each and every time. I.e specific includes and trcepoints need to be added each time.
>
Well, EAL can depend on other libs, but then those libs cannot in turn
directly depend upon DPDK. This is where breaking out first some of the
smaller widely used parts of DPDK  e.g. logging, would be good, as it would
then in turn allow other, potentially bigger parts of EAL to be taken out.

See [1] for a rough first attempt at this, which allows simlification of
telemetry as it no longer needs a "dependency injection" style to have
logging. Moving out logging would also allow logging from kvargs library
too - another lib which is used by EAL rather than depending on it.
Similarly for tracing functionality - if that were pulled out of EAL, it
could be used by telemetry, kvargs and any other parts removed from EAL.

/Bruce

[1] http://patches.dpdk.org/project/dpdk/list/?series=24453

^ permalink raw reply	[flat|nested] 139+ messages in thread

* RE: [PATCH v5 1/4] eal: add generic support for reading PMU events
  2023-01-11  9:05           ` Morten Brørup
@ 2023-01-11 16:20             ` Tomasz Duszynski
  2023-01-11 16:54               ` Morten Brørup
  0 siblings, 1 reply; 139+ messages in thread
From: Tomasz Duszynski @ 2023-01-11 16:20 UTC (permalink / raw)
  To: Morten Brørup, dev
  Cc: thomas, Jerin Jacob Kollanukkaran, Ruifeng.Wang,
	mattias.ronnblom, zhoumin



>-----Original Message-----
>From: Morten Brørup <mb@smartsharesystems.com>
>Sent: Wednesday, January 11, 2023 10:06 AM
>To: Tomasz Duszynski <tduszynski@marvell.com>; dev@dpdk.org
>Cc: thomas@monjalon.net; Jerin Jacob Kollanukkaran <jerinj@marvell.com>; Ruifeng.Wang@arm.com;
>mattias.ronnblom@ericsson.com; zhoumin@loongson.cn
>Subject: [EXT] RE: [PATCH v5 1/4] eal: add generic support for reading PMU events
>
>External Email
>
>----------------------------------------------------------------------
>> From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
>> Sent: Wednesday, 11 January 2023 00.47
>>
>> Add support for programming PMU counters and reading their values in
>> runtime bypassing kernel completely.
>>
>> This is especially useful in cases where CPU cores are isolated
>> (nohz_full) i.e run dedicated tasks. In such cases one cannot use
>> standard perf utility without sacrificing latency and performance.
>>
>> Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
>> ---
>
>[...]
>
>> +static int
>> +do_perf_event_open(uint64_t config[3], unsigned int lcore_id, int
>> group_fd)
>> +{
>> +	struct perf_event_attr attr = {
>> +		.size = sizeof(struct perf_event_attr),
>> +		.type = PERF_TYPE_RAW,
>> +		.exclude_kernel = 1,
>> +		.exclude_hv = 1,
>> +		.disabled = 1,
>> +	};
>> +
>> +	pmu_arch_fixup_config(config);
>> +
>> +	attr.config = config[0];
>> +	attr.config1 = config[1];
>> +	attr.config2 = config[2];
>> +
>> +	return syscall(SYS_perf_event_open, &attr, 0,
>> rte_lcore_to_cpu_id(lcore_id), group_fd, 0);
>> +}
>
>If SYS_perf_event_open() must be called from the worker thread itself, then lcore_id must not be
>passed as a parameter to do_perf_event_open(). Otherwise, I would expect to be able to call
>do_perf_event_open() from the main thread and pass any lcore_id of a worker thread.
>This comment applies to all functions that must be called from the worker thread itself. It also
>applies to the functions that call such functions.
>

Lcore_id is being passed around so that we don't need to call rte_lcore_id() each and every time. 

>[...]
>
>> +/**
>> + * A structure describing a group of events.
>> + */
>> +struct rte_pmu_event_group {
>> +	int fds[MAX_NUM_GROUP_EVENTS]; /**< array of event descriptors */
>> +	struct perf_event_mmap_page *mmap_pages[MAX_NUM_GROUP_EVENTS];
>> /**< array of user pages */
>> +	bool enabled; /**< true if group was enabled on particular lcore
>> */
>> +};
>> +
>> +/**
>> + * A structure describing an event.
>> + */
>> +struct rte_pmu_event {
>> +	char *name; /** name of an event */
>> +	unsigned int index; /** event index into fds/mmap_pages */
>> +	TAILQ_ENTRY(rte_pmu_event) next; /** list entry */ };
>
>Move the "enabled" field up, making it the first field in this structure. This might reduce the
>number of instructions required to check (!group->enabled) in rte_pmu_read().
>

This will be called once and no this will not produce more instructions. Why should it?
In both cases compiler will need to load data at some offset and archs do have instructions for that. 

>Also, each instance of the structure is used individually per lcore, so the structure should be
>cache line aligned to avoid unnecessarily crossing cache lines.
>
>I.e.:
>
>struct rte_pmu_event_group {
>	bool enabled; /**< true if group was enabled on particular lcore */
>	int fds[MAX_NUM_GROUP_EVENTS]; /**< array of event descriptors */
>	struct perf_event_mmap_page *mmap_pages[MAX_NUM_GROUP_EVENTS]; /**< array of user pages */ }
>__rte_cache_aligned;

Yes, this can be aligned. While at it, I'd be more inclined to move mmap_pages up instead of enable.   

>
>> +
>> +/**
>> + * A PMU state container.
>> + */
>> +struct rte_pmu {
>> +	char *name; /** name of core PMU listed under
>> /sys/bus/event_source/devices */
>> +	struct rte_pmu_event_group group[RTE_MAX_LCORE]; /**< per lcore
>> event group data */
>> +	unsigned int num_group_events; /**< number of events in a group
>> */
>> +	TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of matching
>> events */
>> +};
>> +
>> +/** Pointer to the PMU state container */ extern struct rte_pmu
>> +rte_pmu;
>
>Just "The PMU state container". It is not a pointer anymore. :-)
>

Good catch.

>[...]
>
>> +/**
>> + * @internal
>> + *
>> + * Read PMU counter.
>> + *
>> + * @param pc
>> + *   Pointer to the mmapped user page.
>> + * @return
>> + *   Counter value read from hardware.
>> + */
>> +__rte_internal
>> +static __rte_always_inline uint64_t
>> +rte_pmu_read_userpage(struct perf_event_mmap_page *pc) {
>> +	uint64_t width, offset;
>> +	uint32_t seq, index;
>> +	int64_t pmc;
>> +
>> +	for (;;) {
>> +		seq = pc->lock;
>> +		rte_compiler_barrier();
>> +		index = pc->index;
>> +		offset = pc->offset;
>> +		width = pc->pmc_width;
>> +
>
>Please add a comment here about the special meaning of index == 0.

Okay. 

>
>> +		if (likely(pc->cap_user_rdpmc && index)) {
>> +			pmc = rte_pmu_pmc_read(index - 1);
>> +			pmc <<= 64 - width;
>> +			pmc >>= 64 - width;
>> +			offset += pmc;
>> +		}
>> +
>> +		rte_compiler_barrier();
>> +
>> +		if (likely(pc->lock == seq))
>> +			return offset;
>> +	}
>> +
>> +	return 0;
>> +}
>
>[...]
>
>> +/**
>> + * @warning
>> + * @b EXPERIMENTAL: this API may change without prior notice
>> + *
>> + * Read hardware counter configured to count occurrences of an event.
>> + *
>> + * @param index
>> + *   Index of an event to be read.
>> + * @return
>> + *   Event value read from register. In case of errors or lack of
>> support
>> + *   0 is returned. In other words, stream of zeros in a trace file
>> + *   indicates problem with reading particular PMU event register.
>> + */
>> +__rte_experimental
>> +static __rte_always_inline uint64_t
>> +rte_pmu_read(unsigned int index)
>> +{
>> +	struct rte_pmu_event_group *group;
>> +	int ret, lcore_id = rte_lcore_id();
>> +
>> +	group = &rte_pmu.group[lcore_id];
>> +	if (unlikely(!group->enabled)) {
>> +		ret = rte_pmu_enable_group(lcore_id);
>> +		if (ret)
>> +			return 0;
>> +
>> +		group->enabled = true;
>
>Group->enabled should be set inside rte_pmu_enable_group(), not here.
>

This is easier to follow imo and not against coding guidelines so I prefer to leave it as is.  

>> +	}
>> +
>> +	if (unlikely(index >= rte_pmu.num_group_events))
>> +		return 0;
>> +
>> +	return rte_pmu_read_userpage(group->mmap_pages[index]);
>> +}
>


^ permalink raw reply	[flat|nested] 139+ messages in thread

* RE: [PATCH v5 1/4] eal: add generic support for reading PMU events
  2023-01-11 16:20             ` Tomasz Duszynski
@ 2023-01-11 16:54               ` Morten Brørup
  0 siblings, 0 replies; 139+ messages in thread
From: Morten Brørup @ 2023-01-11 16:54 UTC (permalink / raw)
  To: Tomasz Duszynski, dev
  Cc: thomas, Jerin Jacob Kollanukkaran, Ruifeng.Wang,
	mattias.ronnblom, zhoumin

> From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
> Sent: Wednesday, 11 January 2023 17.21
> 
> >From: Morten Brørup <mb@smartsharesystems.com>
> >Sent: Wednesday, January 11, 2023 10:06 AM
> >
> >> From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
> >> Sent: Wednesday, 11 January 2023 00.47
> >>
> >> Add support for programming PMU counters and reading their values in
> >> runtime bypassing kernel completely.
> >>
> >> This is especially useful in cases where CPU cores are isolated
> >> (nohz_full) i.e run dedicated tasks. In such cases one cannot use
> >> standard perf utility without sacrificing latency and performance.
> >>
> >> Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
> >> ---
> >
> >[...]
> >
> >> +static int
> >> +do_perf_event_open(uint64_t config[3], unsigned int lcore_id, int
> >> group_fd)
> >> +{
> >> +	struct perf_event_attr attr = {
> >> +		.size = sizeof(struct perf_event_attr),
> >> +		.type = PERF_TYPE_RAW,
> >> +		.exclude_kernel = 1,
> >> +		.exclude_hv = 1,
> >> +		.disabled = 1,
> >> +	};
> >> +
> >> +	pmu_arch_fixup_config(config);
> >> +
> >> +	attr.config = config[0];
> >> +	attr.config1 = config[1];
> >> +	attr.config2 = config[2];
> >> +
> >> +	return syscall(SYS_perf_event_open, &attr, 0,
> >> rte_lcore_to_cpu_id(lcore_id), group_fd, 0);
> >> +}
> >
> >If SYS_perf_event_open() must be called from the worker thread itself,
> then lcore_id must not be
> >passed as a parameter to do_perf_event_open(). Otherwise, I would
> expect to be able to call
> >do_perf_event_open() from the main thread and pass any lcore_id of a
> worker thread.
> >This comment applies to all functions that must be called from the
> worker thread itself. It also
> >applies to the functions that call such functions.
> >
> 
> Lcore_id is being passed around so that we don't need to call
> rte_lcore_id() each and every time.

Please take a look at the rte_lcore_id() implementation. :-)

Regardless, my argument still stands: If a function cannot be called with the lcore_id parameter set to any valid lcore id, it should not be a parameter to the function.

> 
> >[...]
> >
> >> +/**
> >> + * A structure describing a group of events.
> >> + */
> >> +struct rte_pmu_event_group {
> >> +	int fds[MAX_NUM_GROUP_EVENTS]; /**< array of event descriptors */
> >> +	struct perf_event_mmap_page *mmap_pages[MAX_NUM_GROUP_EVENTS];
> >> /**< array of user pages */
> >> +	bool enabled; /**< true if group was enabled on particular lcore
> >> */
> >> +};
> >> +
> >> +/**
> >> + * A structure describing an event.
> >> + */
> >> +struct rte_pmu_event {
> >> +	char *name; /** name of an event */
> >> +	unsigned int index; /** event index into fds/mmap_pages */
> >> +	TAILQ_ENTRY(rte_pmu_event) next; /** list entry */ };
> >
> >Move the "enabled" field up, making it the first field in this
> structure. This might reduce the
> >number of instructions required to check (!group->enabled) in
> rte_pmu_read().
> >
> 
> This will be called once and no this will not produce more
> instructions. Why should it?

It seems I was not clearly describing my intention here here. rte_pmu_read() a hot function, where the comparison "if (!group->enabled)" itself will be executed many times.

> In both cases compiler will need to load data at some offset and archs
> do have instructions for that.

Yes, the instructions are: address = BASE + sizeof(struct rte_pmu_event_group) * lcore_id + offsetof(struct rte_pmu_event, enabled).

I meant you could avoid the extra instructions stemming from the addition: "+ offsetof()". But you are right... Both BASE and offsetof(struct rte_pmu_event, enabled) are known in advance, and can be merged at compile time to avoid the addition.

> 
> >Also, each instance of the structure is used individually per lcore,
> so the structure should be
> >cache line aligned to avoid unnecessarily crossing cache lines.
> >
> >I.e.:
> >
> >struct rte_pmu_event_group {
> >	bool enabled; /**< true if group was enabled on particular lcore
> */
> >	int fds[MAX_NUM_GROUP_EVENTS]; /**< array of event descriptors */
> >	struct perf_event_mmap_page *mmap_pages[MAX_NUM_GROUP_EVENTS];
> /**< array of user pages */ }
> >__rte_cache_aligned;
> 
> Yes, this can be aligned. While at it, I'd be more inclined to move
> mmap_pages up instead of enable.

Yes, moving up mmap_pages is better.

> 
> >
> >> +
> >> +/**
> >> + * A PMU state container.
> >> + */
> >> +struct rte_pmu {
> >> +	char *name; /** name of core PMU listed under
> >> /sys/bus/event_source/devices */
> >> +	struct rte_pmu_event_group group[RTE_MAX_LCORE]; /**< per lcore
> >> event group data */
> >> +	unsigned int num_group_events; /**< number of events in a group
> >> */
> >> +	TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of matching
> >> events */
> >> +};
> >> +
> >> +/** Pointer to the PMU state container */ extern struct rte_pmu
> >> +rte_pmu;
> >
> >Just "The PMU state container". It is not a pointer anymore. :-)
> >
> 
> Good catch.
> 
> >[...]
> >
> >> +/**
> >> + * @internal
> >> + *
> >> + * Read PMU counter.
> >> + *
> >> + * @param pc
> >> + *   Pointer to the mmapped user page.
> >> + * @return
> >> + *   Counter value read from hardware.
> >> + */
> >> +__rte_internal
> >> +static __rte_always_inline uint64_t
> >> +rte_pmu_read_userpage(struct perf_event_mmap_page *pc) {
> >> +	uint64_t width, offset;
> >> +	uint32_t seq, index;
> >> +	int64_t pmc;
> >> +
> >> +	for (;;) {
> >> +		seq = pc->lock;
> >> +		rte_compiler_barrier();
> >> +		index = pc->index;
> >> +		offset = pc->offset;
> >> +		width = pc->pmc_width;
> >> +
> >
> >Please add a comment here about the special meaning of index == 0.
> 
> Okay.
> 
> >
> >> +		if (likely(pc->cap_user_rdpmc && index)) {
> >> +			pmc = rte_pmu_pmc_read(index - 1);
> >> +			pmc <<= 64 - width;
> >> +			pmc >>= 64 - width;
> >> +			offset += pmc;
> >> +		}
> >> +
> >> +		rte_compiler_barrier();
> >> +
> >> +		if (likely(pc->lock == seq))
> >> +			return offset;
> >> +	}
> >> +
> >> +	return 0;
> >> +}
> >
> >[...]
> >
> >> +/**
> >> + * @warning
> >> + * @b EXPERIMENTAL: this API may change without prior notice
> >> + *
> >> + * Read hardware counter configured to count occurrences of an
> event.
> >> + *
> >> + * @param index
> >> + *   Index of an event to be read.
> >> + * @return
> >> + *   Event value read from register. In case of errors or lack of
> >> support
> >> + *   0 is returned. In other words, stream of zeros in a trace file
> >> + *   indicates problem with reading particular PMU event register.
> >> + */
> >> +__rte_experimental
> >> +static __rte_always_inline uint64_t
> >> +rte_pmu_read(unsigned int index)
> >> +{
> >> +	struct rte_pmu_event_group *group;
> >> +	int ret, lcore_id = rte_lcore_id();
> >> +
> >> +	group = &rte_pmu.group[lcore_id];
> >> +	if (unlikely(!group->enabled)) {
> >> +		ret = rte_pmu_enable_group(lcore_id);
> >> +		if (ret)
> >> +			return 0;
> >> +
> >> +		group->enabled = true;
> >
> >Group->enabled should be set inside rte_pmu_enable_group(), not here.
> >
> 
> This is easier to follow imo and not against coding guidelines so I
> prefer to leave it as is.

OK. It makes the rte_pmu_read() source code slightly shorter, but probably has zero effect on the generated code. No strong preference - feel free to follow your personal preference on this.

> 
> >> +	}
> >> +
> >> +	if (unlikely(index >= rte_pmu.num_group_events))
> >> +		return 0;
> >> +
> >> +	return rte_pmu_read_userpage(group->mmap_pages[index]);
> >> +}
> >
> 


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [EXT] Re: [PATCH v5 0/4] add support for self monitoring
  2023-01-11  9:39           ` [EXT] " Tomasz Duszynski
@ 2023-01-11 21:05             ` Tyler Retzlaff
  2023-01-13  7:44               ` Tomasz Duszynski
  0 siblings, 1 reply; 139+ messages in thread
From: Tyler Retzlaff @ 2023-01-11 21:05 UTC (permalink / raw)
  To: Tomasz Duszynski
  Cc: bruce.richardson, mb, dev, thomas, Jerin Jacob Kollanukkaran,
	Ruifeng.Wang, mattias.ronnblom, zhoumin

On Wed, Jan 11, 2023 at 09:39:35AM +0000, Tomasz Duszynski wrote:
> Hi Tyler,
> 
> >-----Original Message-----
> >From: Tyler Retzlaff <roretzla@linux.microsoft.com>
> >Sent: Wednesday, January 11, 2023 1:32 AM
> >To: Tomasz Duszynski <tduszynski@marvell.com>; bruce.richardson@intel.com; mb@smartsharesystems.com
> >Cc: dev@dpdk.org; thomas@monjalon.net; Jerin Jacob Kollanukkaran <jerinj@marvell.com>;
> >mb@smartsharesystems.com; Ruifeng.Wang@arm.com; mattias.ronnblom@ericsson.com; zhoumin@loongson.cn
> >Subject: [EXT] Re: [PATCH v5 0/4] add support for self monitoring
> >
> >External Email
> >
> >----------------------------------------------------------------------
> >hi,
> >
> >don't interpret this as an objection to the functionality but this looks like a clear example of
> >something that doesn't belong in the EAL. has there been a discussion as to whether or not this
> >should be in a separate library?
> 
> No, I don't recall anybody having any concerns about the code placement. Rationale behind 
> making this part of eal was based on the fact that tracing itself is a part of eal and
> since this was meant to be extension to tracing, code placement decision came out naturally. 
> 
> During development phase idea evolved a bit and what initially was supposed to be solely yet
> another tracepoint become generic API to read pmu and tracepoint based on that. Which means
> both can be used independently. 
> 
> That said, since this code has both platform agnostic and platform specific parts this can either be split into: 
> 1. library + eal platform code
> 2. all under eal 
> 
> Either approach seems legit. Thoughts?
> 
> >
> >a basic test is whether or not an implementation exists or can be reasonably provided for all
> >platforms and that isn't strictly evident here. red flag is to see yet more code being added
> >conditionally compiled for a single platform.
> 
> Even libs are not entirely pristine and have platform specific ifdefs lurking so not sure where
> this red flag is coming from. 

i think red flag was probably the wrong term to use sorry for that.
rather i should say it is an indicator that the api probably doesn't
belong in the eal.

fundamentally the purpose of the abstraction library is to relieve the
application from having to do conditional compilation and/or execution for
the subject apis coming from eal. including and exporting apis that work
for only one platform is in direct contradiction.

please explore adding this as a separate library, it is understood that
there are tradeoffs involved.

thanks!


^ permalink raw reply	[flat|nested] 139+ messages in thread

* RE: [EXT] Re: [PATCH v5 0/4] add support for self monitoring
  2023-01-11 21:05             ` Tyler Retzlaff
@ 2023-01-13  7:44               ` Tomasz Duszynski
  2023-01-13 19:22                 ` Tyler Retzlaff
  0 siblings, 1 reply; 139+ messages in thread
From: Tomasz Duszynski @ 2023-01-13  7:44 UTC (permalink / raw)
  To: Tyler Retzlaff
  Cc: bruce.richardson, mb, dev, thomas, Jerin Jacob Kollanukkaran,
	Ruifeng.Wang, mattias.ronnblom, zhoumin



>-----Original Message-----
>From: Tyler Retzlaff <roretzla@linux.microsoft.com>
>Sent: Wednesday, January 11, 2023 10:06 PM
>To: Tomasz Duszynski <tduszynski@marvell.com>
>Cc: bruce.richardson@intel.com; mb@smartsharesystems.com; dev@dpdk.org; thomas@monjalon.net; Jerin
>Jacob Kollanukkaran <jerinj@marvell.com>; Ruifeng.Wang@arm.com; mattias.ronnblom@ericsson.com;
>zhoumin@loongson.cn
>Subject: Re: [EXT] Re: [PATCH v5 0/4] add support for self monitoring
>
>On Wed, Jan 11, 2023 at 09:39:35AM +0000, Tomasz Duszynski wrote:
>> Hi Tyler,
>>
>> >-----Original Message-----
>> >From: Tyler Retzlaff <roretzla@linux.microsoft.com>
>> >Sent: Wednesday, January 11, 2023 1:32 AM
>> >To: Tomasz Duszynski <tduszynski@marvell.com>;
>> >bruce.richardson@intel.com; mb@smartsharesystems.com
>> >Cc: dev@dpdk.org; thomas@monjalon.net; Jerin Jacob Kollanukkaran
>> ><jerinj@marvell.com>; mb@smartsharesystems.com; Ruifeng.Wang@arm.com;
>> >mattias.ronnblom@ericsson.com; zhoumin@loongson.cn
>> >Subject: [EXT] Re: [PATCH v5 0/4] add support for self monitoring
>> >
>> >External Email
>> >
>> >---------------------------------------------------------------------
>> >-
>> >hi,
>> >
>> >don't interpret this as an objection to the functionality but this
>> >looks like a clear example of something that doesn't belong in the
>> >EAL. has there been a discussion as to whether or not this should be in a separate library?
>>
>> No, I don't recall anybody having any concerns about the code
>> placement. Rationale behind making this part of eal was based on the
>> fact that tracing itself is a part of eal and since this was meant to be extension to tracing,
>code placement decision came out naturally.
>>
>> During development phase idea evolved a bit and what initially was
>> supposed to be solely yet another tracepoint become generic API to
>> read pmu and tracepoint based on that. Which means both can be used independently.
>>
>> That said, since this code has both platform agnostic and platform specific parts this can either
>be split into:
>> 1. library + eal platform code
>> 2. all under eal
>>
>> Either approach seems legit. Thoughts?
>>
>> >
>> >a basic test is whether or not an implementation exists or can be
>> >reasonably provided for all platforms and that isn't strictly evident
>> >here. red flag is to see yet more code being added conditionally compiled for a single platform.
>>
>> Even libs are not entirely pristine and have platform specific ifdefs
>> lurking so not sure where this red flag is coming from.
>
>i think red flag was probably the wrong term to use sorry for that.
>rather i should say it is an indicator that the api probably doesn't belong in the eal.
>
>fundamentally the purpose of the abstraction library is to relieve the application from having to
>do conditional compilation and/or execution for the subject apis coming from eal. including and
>exporting apis that work for only one platform is in direct contradiction.
>
>please explore adding this as a separate library, it is understood that there are tradeoffs
>involved.
>
>thanks!

Any ideas how to name the library?


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [EXT] Re: [PATCH v5 0/4] add support for self monitoring
  2023-01-13  7:44               ` Tomasz Duszynski
@ 2023-01-13 19:22                 ` Tyler Retzlaff
  2023-01-14  9:53                   ` Morten Brørup
  0 siblings, 1 reply; 139+ messages in thread
From: Tyler Retzlaff @ 2023-01-13 19:22 UTC (permalink / raw)
  To: Tomasz Duszynski
  Cc: bruce.richardson, mb, dev, thomas, Jerin Jacob Kollanukkaran,
	Ruifeng.Wang, mattias.ronnblom, zhoumin

On Fri, Jan 13, 2023 at 07:44:57AM +0000, Tomasz Duszynski wrote:
> 
> 
> >-----Original Message-----
> >From: Tyler Retzlaff <roretzla@linux.microsoft.com>
> >Sent: Wednesday, January 11, 2023 10:06 PM
> >To: Tomasz Duszynski <tduszynski@marvell.com>
> >Cc: bruce.richardson@intel.com; mb@smartsharesystems.com; dev@dpdk.org; thomas@monjalon.net; Jerin
> >Jacob Kollanukkaran <jerinj@marvell.com>; Ruifeng.Wang@arm.com; mattias.ronnblom@ericsson.com;
> >zhoumin@loongson.cn
> >Subject: Re: [EXT] Re: [PATCH v5 0/4] add support for self monitoring
> >
> >On Wed, Jan 11, 2023 at 09:39:35AM +0000, Tomasz Duszynski wrote:
> >> Hi Tyler,
> >>
> >> >-----Original Message-----
> >> >From: Tyler Retzlaff <roretzla@linux.microsoft.com>
> >> >Sent: Wednesday, January 11, 2023 1:32 AM
> >> >To: Tomasz Duszynski <tduszynski@marvell.com>;
> >> >bruce.richardson@intel.com; mb@smartsharesystems.com
> >> >Cc: dev@dpdk.org; thomas@monjalon.net; Jerin Jacob Kollanukkaran
> >> ><jerinj@marvell.com>; mb@smartsharesystems.com; Ruifeng.Wang@arm.com;
> >> >mattias.ronnblom@ericsson.com; zhoumin@loongson.cn
> >> >Subject: [EXT] Re: [PATCH v5 0/4] add support for self monitoring
> >> >
> >> >External Email
> >> >
> >> >---------------------------------------------------------------------
> >> >-
> >> >hi,
> >> >
> >> >don't interpret this as an objection to the functionality but this
> >> >looks like a clear example of something that doesn't belong in the
> >> >EAL. has there been a discussion as to whether or not this should be in a separate library?
> >>
> >> No, I don't recall anybody having any concerns about the code
> >> placement. Rationale behind making this part of eal was based on the
> >> fact that tracing itself is a part of eal and since this was meant to be extension to tracing,
> >code placement decision came out naturally.
> >>
> >> During development phase idea evolved a bit and what initially was
> >> supposed to be solely yet another tracepoint become generic API to
> >> read pmu and tracepoint based on that. Which means both can be used independently.
> >>
> >> That said, since this code has both platform agnostic and platform specific parts this can either
> >be split into:
> >> 1. library + eal platform code
> >> 2. all under eal
> >>
> >> Either approach seems legit. Thoughts?
> >>
> >> >
> >> >a basic test is whether or not an implementation exists or can be
> >> >reasonably provided for all platforms and that isn't strictly evident
> >> >here. red flag is to see yet more code being added conditionally compiled for a single platform.
> >>
> >> Even libs are not entirely pristine and have platform specific ifdefs
> >> lurking so not sure where this red flag is coming from.
> >
> >i think red flag was probably the wrong term to use sorry for that.
> >rather i should say it is an indicator that the api probably doesn't belong in the eal.
> >
> >fundamentally the purpose of the abstraction library is to relieve the application from having to
> >do conditional compilation and/or execution for the subject apis coming from eal. including and
> >exporting apis that work for only one platform is in direct contradiction.
> >
> >please explore adding this as a separate library, it is understood that there are tradeoffs
> >involved.
> >
> >thanks!
> 
> Any ideas how to name the library?

naming is always so hard and i'm definitely not authoritative.

it seems like lib/pmu would be the least churn against your existing
patch series, here are some other suggestions that might work.

lib/pmu (measuring unit)
lib/pmc (measuring counters)
lib/pcq (counter query)
lib/pmq (measuring query)


^ permalink raw reply	[flat|nested] 139+ messages in thread

* RE: [EXT] Re: [PATCH v5 0/4] add support for self monitoring
  2023-01-13 19:22                 ` Tyler Retzlaff
@ 2023-01-14  9:53                   ` Morten Brørup
  0 siblings, 0 replies; 139+ messages in thread
From: Morten Brørup @ 2023-01-14  9:53 UTC (permalink / raw)
  To: Tyler Retzlaff, Tomasz Duszynski
  Cc: bruce.richardson, dev, thomas, Jerin Jacob Kollanukkaran,
	Ruifeng.Wang, mattias.ronnblom, zhoumin


> From: Tyler Retzlaff [mailto:roretzla@linux.microsoft.com]
> Sent: Friday, 13 January 2023 20.22
> 
> On Fri, Jan 13, 2023 at 07:44:57AM +0000, Tomasz Duszynski wrote:
> >
> > >From: Tyler Retzlaff <roretzla@linux.microsoft.com>
> > >Sent: Wednesday, January 11, 2023 10:06 PM
> > >
> > >On Wed, Jan 11, 2023 at 09:39:35AM +0000, Tomasz Duszynski wrote:
> > >> Hi Tyler,
> > >>
> > >> >From: Tyler Retzlaff <roretzla@linux.microsoft.com>
> > >> >Sent: Wednesday, January 11, 2023 1:32 AM
> > >> >
> > >> >hi,
> > >> >
> > >> >don't interpret this as an objection to the functionality but
> this
> > >> >looks like a clear example of something that doesn't belong in
> the
> > >> >EAL. has there been a discussion as to whether or not this should
> be in a separate library?
> > >>
> > >> No, I don't recall anybody having any concerns about the code
> > >> placement. Rationale behind making this part of eal was based on
> the
> > >> fact that tracing itself is a part of eal and since this was meant
> to be extension to tracing,
> > >code placement decision came out naturally.
> > >>
> > >> During development phase idea evolved a bit and what initially was
> > >> supposed to be solely yet another tracepoint become generic API to
> > >> read pmu and tracepoint based on that. Which means both can be
> used independently.
> > >>
> > >> That said, since this code has both platform agnostic and platform
> specific parts this can either
> > >be split into:
> > >> 1. library + eal platform code
> > >> 2. all under eal
> > >>
> > >> Either approach seems legit. Thoughts?
> > >>
> > >> >
> > >> >a basic test is whether or not an implementation exists or can be
> > >> >reasonably provided for all platforms and that isn't strictly
> evident
> > >> >here. red flag is to see yet more code being added conditionally
> compiled for a single platform.
> > >>
> > >> Even libs are not entirely pristine and have platform specific
> ifdefs
> > >> lurking so not sure where this red flag is coming from.
> > >
> > >i think red flag was probably the wrong term to use sorry for that.
> > >rather i should say it is an indicator that the api probably doesn't
> belong in the eal.
> > >
> > >fundamentally the purpose of the abstraction library is to relieve
> the application from having to
> > >do conditional compilation and/or execution for the subject apis
> coming from eal. including and
> > >exporting apis that work for only one platform is in direct
> contradiction.
> > >
> > >please explore adding this as a separate library, it is understood
> that there are tradeoffs
> > >involved.
> > >
> > >thanks!
> >
> > Any ideas how to name the library?
> 
> naming is always so hard and i'm definitely not authoritative.
> 
> it seems like lib/pmu would be the least churn against your existing
> patch series, here are some other suggestions that might work.

+1 to lib/pmu

Less work, as Tyler already mentioned. Furthermore:

Both Intel and ARM use the term Performance Monitoring Unit (abbreviated PMU).

Microsoft does too [1].

[1]: https://learn.microsoft.com/en-us/windows-hardware/test/wpt/recording-pmu-events

RISC-V uses the term Hardware Performance Monitor (abbreviated HPM).
I haven't checked other CPU vendors.


^ permalink raw reply	[flat|nested] 139+ messages in thread

* [PATCH v6 0/4] add support for self monitoring
  2023-01-10 23:46       ` [PATCH v5 0/4] add support for self monitoring Tomasz Duszynski
                           ` (4 preceding siblings ...)
  2023-01-11  0:32         ` [PATCH v5 0/4] add support for self monitoring Tyler Retzlaff
@ 2023-01-19 23:39         ` Tomasz Duszynski
  2023-01-19 23:39           ` [PATCH v6 1/4] lib: add generic support for reading PMU events Tomasz Duszynski
                             ` (4 more replies)
  2023-01-25 10:33         ` [PATCH 0/2] add platform bus Tomasz Duszynski
  2023-02-16 20:56         ` [PATCH v5 0/4] add support for self monitoring Liang Ma
  7 siblings, 5 replies; 139+ messages in thread
From: Tomasz Duszynski @ 2023-01-19 23:39 UTC (permalink / raw)
  To: dev
  Cc: thomas, jerinj, mb, Ruifeng.Wang, mattias.ronnblom, zhoumin,
	bruce.richardson, roretzla, Tomasz Duszynski

This series adds self monitoring support i.e allows to configure and
read performance measurement unit (PMU) counters in runtime without
using perf utility. This has certain adventages when application runs on
isolated cores with nohz_full kernel parameter.

Events can be read directly using rte_pmu_read() or using dedicated
tracepoint rte_eal_trace_pmu_read(). The latter will cause events to be
stored inside CTF file.

By design, all enabled events are grouped together and the same group
is attached to lcores that use self monitoring funtionality.

Events are enabled by names, which need to be read from standard
location under sysfs i.e

/sys/bus/event_source/devices/PMU/events

where PMU is a core pmu i.e one measuring cpu events. As of today
raw events are not supported.

v6:
- move codebase to the separate library
- address review comments
v5:
- address review comments
- fix sign extension while reading pmu on x86
- fix regex mentioned in doc
- various minor changes/improvements here and there
v4:
- fix freeing mem detected by debug_autotest
v3:
- fix shared build
v2:
- fix problems reported by test build infra

Tomasz Duszynski (4):
  lib: add generic support for reading PMU events
  pmu: support reading ARM PMU events in runtime
  pmu: support reading Intel x86_64 PMU events in runtime
  eal: add PMU support to tracing library

 MAINTAINERS                              |   5 +
 app/test/meson.build                     |   4 +
 app/test/test_pmu.c                      |  48 +++
 app/test/test_trace_perf.c               |  10 +
 doc/api/doxy-api-index.md                |   3 +-
 doc/api/doxy-api.conf.in                 |   1 +
 doc/guides/prog_guide/profile_app.rst    |  13 +
 doc/guides/prog_guide/trace_lib.rst      |  32 ++
 doc/guides/rel_notes/release_23_03.rst   |   7 +
 lib/eal/common/eal_common_trace.c        |  13 +-
 lib/eal/common/eal_common_trace_points.c |   5 +
 lib/eal/include/rte_eal_trace.h          |  13 +
 lib/eal/meson.build                      |   3 +
 lib/eal/version.map                      |   3 +
 lib/meson.build                          |   1 +
 lib/pmu/meson.build                      |  21 +
 lib/pmu/pmu_arm64.c                      |  94 +++++
 lib/pmu/pmu_private.h                    |  29 ++
 lib/pmu/rte_pmu.c                        | 497 +++++++++++++++++++++++
 lib/pmu/rte_pmu.h                        | 226 +++++++++++
 lib/pmu/rte_pmu_pmc_arm64.h              |  30 ++
 lib/pmu/rte_pmu_pmc_x86_64.h             |  24 ++
 lib/pmu/version.map                      |  20 +
 23 files changed, 1100 insertions(+), 2 deletions(-)
 create mode 100644 app/test/test_pmu.c
 create mode 100644 lib/pmu/meson.build
 create mode 100644 lib/pmu/pmu_arm64.c
 create mode 100644 lib/pmu/pmu_private.h
 create mode 100644 lib/pmu/rte_pmu.c
 create mode 100644 lib/pmu/rte_pmu.h
 create mode 100644 lib/pmu/rte_pmu_pmc_arm64.h
 create mode 100644 lib/pmu/rte_pmu_pmc_x86_64.h
 create mode 100644 lib/pmu/version.map

--
2.34.1


^ permalink raw reply	[flat|nested] 139+ messages in thread

* [PATCH v6 1/4] lib: add generic support for reading PMU events
  2023-01-19 23:39         ` [PATCH v6 " Tomasz Duszynski
@ 2023-01-19 23:39           ` Tomasz Duszynski
  2023-01-20  9:46             ` Morten Brørup
  2023-01-20 18:29             ` Tyler Retzlaff
  2023-01-19 23:39           ` [PATCH v6 2/4] pmu: support reading ARM PMU events in runtime Tomasz Duszynski
                             ` (3 subsequent siblings)
  4 siblings, 2 replies; 139+ messages in thread
From: Tomasz Duszynski @ 2023-01-19 23:39 UTC (permalink / raw)
  To: dev, Thomas Monjalon, Tomasz Duszynski
  Cc: jerinj, mb, Ruifeng.Wang, mattias.ronnblom, zhoumin,
	bruce.richardson, roretzla

Add support for programming PMU counters and reading their values
in runtime bypassing kernel completely.

This is especially useful in cases where CPU cores are isolated
(nohz_full) i.e run dedicated tasks. In such cases one cannot use
standard perf utility without sacrificing latency and performance.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
---
 MAINTAINERS                            |   5 +
 app/test/meson.build                   |   4 +
 app/test/test_pmu.c                    |  42 +++
 doc/api/doxy-api-index.md              |   3 +-
 doc/api/doxy-api.conf.in               |   1 +
 doc/guides/prog_guide/profile_app.rst  |   8 +
 doc/guides/rel_notes/release_23_03.rst |   7 +
 lib/meson.build                        |   1 +
 lib/pmu/meson.build                    |  13 +
 lib/pmu/pmu_private.h                  |  29 ++
 lib/pmu/rte_pmu.c                      | 436 +++++++++++++++++++++++++
 lib/pmu/rte_pmu.h                      | 206 ++++++++++++
 lib/pmu/version.map                    |  19 ++
 13 files changed, 773 insertions(+), 1 deletion(-)
 create mode 100644 app/test/test_pmu.c
 create mode 100644 lib/pmu/meson.build
 create mode 100644 lib/pmu/pmu_private.h
 create mode 100644 lib/pmu/rte_pmu.c
 create mode 100644 lib/pmu/rte_pmu.h
 create mode 100644 lib/pmu/version.map

diff --git a/MAINTAINERS b/MAINTAINERS
index 9a0f416d2e..9f13eafd95 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1697,6 +1697,11 @@ M: Nithin Dabilpuram <ndabilpuram@marvell.com>
 M: Pavan Nikhilesh <pbhagavatula@marvell.com>
 F: lib/node/
 
+PMU - EXPERIMENTAL
+M: Tomasz Duszynski <tduszynski@marvell.com>
+F: lib/pmu/
+F: app/test/test_pmu*
+
 
 Test Applications
 -----------------
diff --git a/app/test/meson.build b/app/test/meson.build
index f34d19e3c3..b2c2a618b1 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -360,6 +360,10 @@ if dpdk_conf.has('RTE_LIB_METRICS')
     test_sources += ['test_metrics.c']
     fast_tests += [['metrics_autotest', true, true]]
 endif
+if is_linux
+    test_sources += ['test_pmu.c']
+    fast_tests += [['pmu_autotest', true, true]]
+endif
 if not is_windows and dpdk_conf.has('RTE_LIB_TELEMETRY')
     test_sources += ['test_telemetry_json.c', 'test_telemetry_data.c']
     fast_tests += [['telemetry_json_autotest', true, true]]
diff --git a/app/test/test_pmu.c b/app/test/test_pmu.c
new file mode 100644
index 0000000000..7c3cf18ed9
--- /dev/null
+++ b/app/test/test_pmu.c
@@ -0,0 +1,42 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2023 Marvell International Ltd.
+ */
+
+#include <rte_pmu.h>
+
+#include "test.h"
+
+static int
+test_pmu_read(void)
+{
+	int tries = 10, event = -1;
+	uint64_t val = 0;
+
+	if (rte_pmu_init() < 0)
+		return TEST_FAILED;
+
+	while (tries--)
+		val += rte_pmu_read(event);
+
+	rte_pmu_fini();
+
+	return val ? TEST_SUCCESS : TEST_FAILED;
+}
+
+static struct unit_test_suite pmu_tests = {
+	.suite_name = "pmu autotest",
+	.setup = NULL,
+	.teardown = NULL,
+	.unit_test_cases = {
+		TEST_CASE(test_pmu_read),
+		TEST_CASES_END()
+	}
+};
+
+static int
+test_pmu(void)
+{
+	return unit_test_suite_runner(&pmu_tests);
+}
+
+REGISTER_TEST_COMMAND(pmu_autotest, test_pmu);
diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
index de488c7abf..7f1938f92f 100644
--- a/doc/api/doxy-api-index.md
+++ b/doc/api/doxy-api-index.md
@@ -222,7 +222,8 @@ The public API headers are grouped by topics:
   [log](@ref rte_log.h),
   [errno](@ref rte_errno.h),
   [trace](@ref rte_trace.h),
-  [trace_point](@ref rte_trace_point.h)
+  [trace_point](@ref rte_trace_point.h),
+  [pmu](@ref rte_pmu.h)
 
 - **misc**:
   [EAL config](@ref rte_eal.h),
diff --git a/doc/api/doxy-api.conf.in b/doc/api/doxy-api.conf.in
index f0886c3bd1..920e615996 100644
--- a/doc/api/doxy-api.conf.in
+++ b/doc/api/doxy-api.conf.in
@@ -63,6 +63,7 @@ INPUT                   = @TOPDIR@/doc/api/doxy-api-index.md \
                           @TOPDIR@/lib/pci \
                           @TOPDIR@/lib/pdump \
                           @TOPDIR@/lib/pipeline \
+                          @TOPDIR@/lib/pmu \
                           @TOPDIR@/lib/port \
                           @TOPDIR@/lib/power \
                           @TOPDIR@/lib/rawdev \
diff --git a/doc/guides/prog_guide/profile_app.rst b/doc/guides/prog_guide/profile_app.rst
index 14292d4c25..a8b501fe0c 100644
--- a/doc/guides/prog_guide/profile_app.rst
+++ b/doc/guides/prog_guide/profile_app.rst
@@ -7,6 +7,14 @@ Profile Your Application
 The following sections describe methods of profiling DPDK applications on
 different architectures.
 
+Performance counter based profiling
+-----------------------------------
+
+Majority of architectures support some sort hardware measurement unit which provides a set of
+programmable counters that monitor specific events. There are different tools which can gather
+that information, perf being an example here. Though in some scenarios, eg. when CPU cores are
+isolated (nohz_full) and run dedicated tasks, using perf is less than ideal. In such cases one can
+read specific events directly from application via ``rte_pmu_read()``.
 
 Profiling on x86
 ----------------
diff --git a/doc/guides/rel_notes/release_23_03.rst b/doc/guides/rel_notes/release_23_03.rst
index 92ec1e4b88..f43bd62376 100644
--- a/doc/guides/rel_notes/release_23_03.rst
+++ b/doc/guides/rel_notes/release_23_03.rst
@@ -57,6 +57,13 @@ New Features
 
 * **Added multi-process support for axgbe PMD.**
 
+* **Added PMU library.**
+
+  Added a new PMU (performance measurement unit) library which allows applications
+  to perform self monitoring activities without depending on external utilities like perf.
+  After integration with :doc:`../prog_guide/trace_lib` data gathered from hardware counters
+  can be stored in CTF format for further analysis.
+
 
 Removed Items
 -------------
diff --git a/lib/meson.build b/lib/meson.build
index a90fee31b7..7132131b5c 100644
--- a/lib/meson.build
+++ b/lib/meson.build
@@ -11,6 +11,7 @@
 libraries = [
         'kvargs', # eal depends on kvargs
         'telemetry', # basic info querying
+        'pmu',
         'eal', # everything depends on eal
         'ring',
         'rcu', # rcu depends on ring
diff --git a/lib/pmu/meson.build b/lib/pmu/meson.build
new file mode 100644
index 0000000000..a4160b494e
--- /dev/null
+++ b/lib/pmu/meson.build
@@ -0,0 +1,13 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(C) 2023 Marvell International Ltd.
+
+if not is_linux
+    build = false
+    reason = 'only supported on Linux'
+    subdir_done()
+endif
+
+includes = [global_inc]
+
+sources = files('rte_pmu.c')
+headers = files('rte_pmu.h')
diff --git a/lib/pmu/pmu_private.h b/lib/pmu/pmu_private.h
new file mode 100644
index 0000000000..849549b125
--- /dev/null
+++ b/lib/pmu/pmu_private.h
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Marvell
+ */
+
+#ifndef _PMU_PRIVATE_H_
+#define _PMU_PRIVATE_H_
+
+/**
+ * Architecture specific PMU init callback.
+ *
+ * @return
+ *   0 in case of success, negative value otherwise.
+ */
+int
+pmu_arch_init(void);
+
+/**
+ * Architecture specific PMU cleanup callback.
+ */
+void
+pmu_arch_fini(void);
+
+/**
+ * Apply architecture specific settings to config before passing it to syscall.
+ */
+void
+pmu_arch_fixup_config(uint64_t config[3]);
+
+#endif /* _PMU_PRIVATE_H_ */
diff --git a/lib/pmu/rte_pmu.c b/lib/pmu/rte_pmu.c
new file mode 100644
index 0000000000..f8369b9dc7
--- /dev/null
+++ b/lib/pmu/rte_pmu.c
@@ -0,0 +1,436 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2023 Marvell International Ltd.
+ */
+
+#include <ctype.h>
+#include <dirent.h>
+#include <errno.h>
+#include <regex.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <sys/queue.h>
+#include <sys/syscall.h>
+#include <unistd.h>
+
+#include <rte_atomic.h>
+#include <rte_pmu.h>
+#include <rte_tailq.h>
+
+#include "pmu_private.h"
+
+#define EVENT_SOURCE_DEVICES_PATH "/sys/bus/event_source/devices"
+
+#ifndef GENMASK_ULL
+#define GENMASK_ULL(h, l) ((~0ULL - (1ULL << (l)) + 1) & (~0ULL >> ((64 - 1 - (h)))))
+#endif
+
+#ifndef FIELD_PREP
+#define FIELD_PREP(m, v) (((uint64_t)(v) << (__builtin_ffsll(m) - 1)) & (m))
+#endif
+
+struct rte_pmu rte_pmu;
+
+/*
+ * Following __rte_weak functions provide default no-op. Architectures should override them if
+ * necessary.
+ */
+
+int
+__rte_weak pmu_arch_init(void)
+{
+	return 0;
+}
+
+void
+__rte_weak pmu_arch_fini(void)
+{
+}
+
+void
+__rte_weak pmu_arch_fixup_config(uint64_t __rte_unused config[3])
+{
+}
+
+static int
+get_term_format(const char *name, int *num, uint64_t *mask)
+{
+	char *config = NULL;
+	char path[PATH_MAX];
+	int high, low, ret;
+	FILE *fp;
+
+	/* quiesce -Wmaybe-uninitialized warning */
+	*num = 0;
+	*mask = 0;
+
+	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/format/%s", rte_pmu.name, name);
+	fp = fopen(path, "r");
+	if (fp == NULL)
+		return -errno;
+
+	errno = 0;
+	ret = fscanf(fp, "%m[^:]:%d-%d", &config, &low, &high);
+	if (ret < 2) {
+		ret = -ENODATA;
+		goto out;
+	}
+	if (errno) {
+		ret = -errno;
+		goto out;
+	}
+
+	if (ret == 2)
+		high = low;
+
+	*mask = GENMASK_ULL(high, low);
+	/* Last digit should be [012]. If last digit is missing 0 is implied. */
+	*num = config[strlen(config) - 1];
+	*num = isdigit(*num) ? *num - '0' : 0;
+
+	ret = 0;
+out:
+	free(config);
+	fclose(fp);
+
+	return ret;
+}
+
+static int
+parse_event(char *buf, uint64_t config[3])
+{
+	char *token, *term;
+	int num, ret, val;
+	uint64_t mask;
+
+	config[0] = config[1] = config[2] = 0;
+
+	token = strtok(buf, ",");
+	while (token) {
+		errno = 0;
+		/* <term>=<value> */
+		ret = sscanf(token, "%m[^=]=%i", &term, &val);
+		if (ret < 1)
+			return -ENODATA;
+		if (errno)
+			return -errno;
+		if (ret == 1)
+			val = 1;
+
+		ret = get_term_format(term, &num, &mask);
+		free(term);
+		if (ret)
+			return ret;
+
+		config[num] |= FIELD_PREP(mask, val);
+		token = strtok(NULL, ",");
+	}
+
+	return 0;
+}
+
+static int
+get_event_config(const char *name, uint64_t config[3])
+{
+	char path[PATH_MAX], buf[BUFSIZ];
+	FILE *fp;
+	int ret;
+
+	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu.name, name);
+	fp = fopen(path, "r");
+	if (fp == NULL)
+		return -errno;
+
+	ret = fread(buf, 1, sizeof(buf), fp);
+	if (ret == 0) {
+		fclose(fp);
+
+		return -EINVAL;
+	}
+	fclose(fp);
+	buf[ret] = '\0';
+
+	return parse_event(buf, config);
+}
+
+static int
+do_perf_event_open(uint64_t config[3], int group_fd)
+{
+	struct perf_event_attr attr = {
+		.size = sizeof(struct perf_event_attr),
+		.type = PERF_TYPE_RAW,
+		.exclude_kernel = 1,
+		.exclude_hv = 1,
+		.disabled = 1,
+	};
+
+	pmu_arch_fixup_config(config);
+
+	attr.config = config[0];
+	attr.config1 = config[1];
+	attr.config2 = config[2];
+
+	return syscall(SYS_perf_event_open, &attr, 0, -1, group_fd, 0);
+}
+
+static int
+open_events(unsigned int lcore_id)
+{
+	struct rte_pmu_event_group *group = &rte_pmu.group[lcore_id];
+	struct rte_pmu_event *event;
+	uint64_t config[3];
+	int num = 0, ret;
+
+	/* group leader gets created first, with fd = -1 */
+	group->fds[0] = -1;
+
+	TAILQ_FOREACH(event, &rte_pmu.event_list, next) {
+		ret = get_event_config(event->name, config);
+		if (ret)
+			continue;
+
+		ret = do_perf_event_open(config, group->fds[0]);
+		if (ret == -1) {
+			ret = -errno;
+			goto out;
+		}
+
+		group->fds[event->index] = ret;
+		num++;
+	}
+
+	return 0;
+out:
+	for (--num; num >= 0; num--) {
+		close(group->fds[num]);
+		group->fds[num] = -1;
+	}
+
+
+	return ret;
+}
+
+static int
+mmap_events(unsigned int lcore_id)
+{
+	struct rte_pmu_event_group *group = &rte_pmu.group[lcore_id];
+	long page_size = sysconf(_SC_PAGE_SIZE);
+	unsigned int i;
+	void *addr;
+	int ret;
+
+	for (i = 0; i < rte_pmu.num_group_events; i++) {
+		addr = mmap(0, page_size, PROT_READ, MAP_SHARED, group->fds[i], 0);
+		if (addr == MAP_FAILED) {
+			ret = -errno;
+			goto out;
+		}
+
+		group->mmap_pages[i] = addr;
+	}
+
+	return 0;
+out:
+	for (; i; i--) {
+		munmap(group->mmap_pages[i - 1], page_size);
+		group->mmap_pages[i - 1] = NULL;
+	}
+
+	return ret;
+}
+
+static void
+cleanup_events(unsigned int lcore_id)
+{
+	struct rte_pmu_event_group *group = &rte_pmu.group[lcore_id];
+	unsigned int i;
+
+	if (group->fds[0] != -1)
+		ioctl(group->fds[0], PERF_EVENT_IOC_DISABLE, PERF_IOC_FLAG_GROUP);
+
+	for (i = 0; i < rte_pmu.num_group_events; i++) {
+		if (group->mmap_pages[i]) {
+			munmap(group->mmap_pages[i], sysconf(_SC_PAGE_SIZE));
+			group->mmap_pages[i] = NULL;
+		}
+
+		if (group->fds[i] != -1) {
+			close(group->fds[i]);
+			group->fds[i] = -1;
+		}
+	}
+
+	group->enabled = false;
+}
+
+int __rte_noinline
+rte_pmu_enable_group(unsigned int lcore_id)
+{
+	struct rte_pmu_event_group *group = &rte_pmu.group[lcore_id];
+	int ret;
+
+	if (rte_pmu.num_group_events == 0)
+		return -ENODEV;
+
+	ret = open_events(lcore_id);
+	if (ret)
+		goto out;
+
+	ret = mmap_events(lcore_id);
+	if (ret)
+		goto out;
+
+	if (ioctl(group->fds[0], PERF_EVENT_IOC_RESET, PERF_IOC_FLAG_GROUP) == -1) {
+		ret = -errno;
+		goto out;
+	}
+
+	if (ioctl(group->fds[0], PERF_EVENT_IOC_ENABLE, PERF_IOC_FLAG_GROUP) == -1) {
+		ret = -errno;
+		goto out;
+	}
+
+	return 0;
+
+out:
+	cleanup_events(lcore_id);
+
+	return ret;
+}
+
+static int
+scan_pmus(void)
+{
+	char path[PATH_MAX];
+	struct dirent *dent;
+	const char *name;
+	DIR *dirp;
+
+	dirp = opendir(EVENT_SOURCE_DEVICES_PATH);
+	if (dirp == NULL)
+		return -errno;
+
+	while ((dent = readdir(dirp))) {
+		name = dent->d_name;
+		if (name[0] == '.')
+			continue;
+
+		/* sysfs entry should either contain cpus or be a cpu */
+		if (!strcmp(name, "cpu"))
+			break;
+
+		snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/cpus", name);
+		if (access(path, F_OK) == 0)
+			break;
+	}
+
+	closedir(dirp);
+
+	if (dent) {
+		rte_pmu.name = strdup(name);
+		if (rte_pmu.name == NULL)
+			return -ENOMEM;
+	}
+
+	return rte_pmu.name ? 0 : -ENODEV;
+}
+
+int
+rte_pmu_add_event(const char *name)
+{
+	struct rte_pmu_event *event;
+	char path[PATH_MAX];
+
+	if (rte_pmu.name == NULL)
+		return -ENODEV;
+
+	if (rte_pmu.num_group_events + 1 >= MAX_NUM_GROUP_EVENTS)
+		return -ENOSPC;
+
+	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu.name, name);
+	if (access(path, R_OK))
+		return -ENODEV;
+
+	TAILQ_FOREACH(event, &rte_pmu.event_list, next) {
+		if (!strcmp(event->name, name))
+			return event->index;
+		continue;
+	}
+
+	event = calloc(1, sizeof(*event));
+	if (event == NULL)
+		return -ENOMEM;
+
+	event->name = strdup(name);
+	if (event->name == NULL) {
+		free(event);
+
+		return -ENOMEM;
+	}
+
+	event->index = rte_pmu.num_group_events++;
+	TAILQ_INSERT_TAIL(&rte_pmu.event_list, event, next);
+
+	return event->index;
+}
+
+int
+rte_pmu_init(void)
+{
+	int ret;
+
+	/* Allow calling init from multiple contexts within a single thread. This simplifies
+	 * resource management a bit e.g in case fast-path tracepoint has already been enabled
+	 * via command line but application doesn't care enough and performs init/fini again.
+	 */
+	if (rte_pmu.initialized) {
+		rte_pmu.initialized++;
+
+		return 0;
+	}
+
+	ret = scan_pmus();
+	if (ret)
+		goto out;
+
+	TAILQ_INIT(&rte_pmu.event_list);
+
+	ret = pmu_arch_init();
+	if (ret)
+		goto out;
+
+	rte_pmu.initialized = 1;
+
+	return 0;
+out:
+	free(rte_pmu.name);
+	rte_pmu.name = NULL;
+
+	return ret;
+}
+
+void
+rte_pmu_fini(void)
+{
+	struct rte_pmu_event *event, *tmp;
+	unsigned int i;
+
+	/* cleanup once init count drops to zero */
+	if (!rte_pmu.initialized || --rte_pmu.initialized)
+		return;
+
+	RTE_TAILQ_FOREACH_SAFE(event, &rte_pmu.event_list, next, tmp) {
+		TAILQ_REMOVE(&rte_pmu.event_list, event, next);
+		free(event->name);
+		free(event);
+	}
+
+	for (i = 0; i < rte_pmu.num_group_events; i++)
+		cleanup_events(i);
+
+	pmu_arch_fini();
+	free(rte_pmu.name);
+	rte_pmu.name = NULL;
+	rte_pmu.num_group_events = 0;
+}
diff --git a/lib/pmu/rte_pmu.h b/lib/pmu/rte_pmu.h
new file mode 100644
index 0000000000..42c764fa9e
--- /dev/null
+++ b/lib/pmu/rte_pmu.h
@@ -0,0 +1,206 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Marvell
+ */
+
+#ifndef _RTE_PMU_H_
+#define _RTE_PMU_H_
+
+/**
+ * @file
+ *
+ * PMU event tracing operations
+ *
+ * This file defines generic API and types necessary to setup PMU and
+ * read selected counters in runtime.
+ */
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <linux/perf_event.h>
+
+#include <rte_atomic.h>
+#include <rte_branch_prediction.h>
+#include <rte_common.h>
+#include <rte_compat.h>
+#include <rte_lcore.h>
+
+/** Maximum number of events in a group */
+#define MAX_NUM_GROUP_EVENTS 8
+
+/**
+ * A structure describing a group of events.
+ */
+struct rte_pmu_event_group {
+	struct perf_event_mmap_page *mmap_pages[MAX_NUM_GROUP_EVENTS]; /**< array of user pages */
+	int fds[MAX_NUM_GROUP_EVENTS]; /**< array of event descriptors */
+	bool enabled; /**< true if group was enabled on particular lcore */
+} __rte_cache_aligned;
+
+/**
+ * A structure describing an event.
+ */
+struct rte_pmu_event {
+	char *name; /**< name of an event */
+	unsigned int index; /**< event index into fds/mmap_pages */
+	TAILQ_ENTRY(rte_pmu_event) next; /**< list entry */
+};
+
+/**
+ * A PMU state container.
+ */
+struct rte_pmu {
+	char *name; /**< name of core PMU listed under /sys/bus/event_source/devices */
+	struct rte_pmu_event_group group[RTE_MAX_LCORE]; /**< per lcore event group data */
+	unsigned int num_group_events; /**< number of events in a group */
+	TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of matching events */
+	unsigned int initialized; /**< initialization counter */
+};
+
+/** PMU state container */
+extern struct rte_pmu rte_pmu;
+
+/** Each architecture supporting PMU needs to provide its own version */
+#ifndef rte_pmu_pmc_read
+#define rte_pmu_pmc_read(index) ({ 0; })
+#endif
+
+/**
+ * @internal
+ *
+ * Read PMU counter.
+ *
+ * @param pc
+ *   Pointer to the mmapped user page.
+ * @return
+ *   Counter value read from hardware.
+ */
+__rte_internal
+static __rte_always_inline uint64_t
+rte_pmu_read_userpage(struct perf_event_mmap_page *pc)
+{
+	uint64_t width, offset;
+	uint32_t seq, index;
+	int64_t pmc;
+
+	for (;;) {
+		seq = pc->lock;
+		rte_compiler_barrier();
+		index = pc->index;
+		offset = pc->offset;
+		width = pc->pmc_width;
+
+		/* index set to 0 means that particular counter cannot be used */
+		if (likely(pc->cap_user_rdpmc && index)) {
+			pmc = rte_pmu_pmc_read(index - 1);
+			pmc <<= 64 - width;
+			pmc >>= 64 - width;
+			offset += pmc;
+		}
+
+		rte_compiler_barrier();
+
+		if (likely(pc->lock == seq))
+			return offset;
+	}
+
+	return 0;
+}
+
+/**
+ * @internal
+ *
+ * Enable group of events on a given lcore.
+ *
+ * @param lcore_id
+ *   The identifier of the lcore.
+ * @return
+ *   0 in case of success, negative value otherwise.
+ */
+__rte_internal
+int
+rte_pmu_enable_group(unsigned int lcore_id);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Initialize PMU library.
+ *
+ * @return
+ *   0 in case of success, negative value otherwise.
+ */
+__rte_experimental
+int
+rte_pmu_init(void);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Finalize PMU library. This should be called after PMU counters are no longer being read.
+ */
+__rte_experimental
+void
+rte_pmu_fini(void);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Add event to the group of enabled events.
+ *
+ * @param name
+ *   Name of an event listed under /sys/bus/event_source/devices/pmu/events.
+ * @return
+ *   Event index in case of success, negative value otherwise.
+ */
+__rte_experimental
+int
+rte_pmu_add_event(const char *name);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Read hardware counter configured to count occurrences of an event.
+ *
+ * @param index
+ *   Index of an event to be read.
+ * @return
+ *   Event value read from register. In case of errors or lack of support
+ *   0 is returned. In other words, stream of zeros in a trace file
+ *   indicates problem with reading particular PMU event register.
+ */
+__rte_experimental
+static __rte_always_inline uint64_t
+rte_pmu_read(unsigned int index)
+{
+	unsigned int lcore_id = rte_lcore_id();
+	struct rte_pmu_event_group *group;
+	int ret;
+
+	if (unlikely(!rte_pmu.initialized))
+		return 0;
+
+	group = &rte_pmu.group[lcore_id];
+	if (unlikely(!group->enabled)) {
+		ret = rte_pmu_enable_group(lcore_id);
+		if (ret)
+			return 0;
+
+		group->enabled = true;
+	}
+
+	if (unlikely(index >= rte_pmu.num_group_events))
+		return 0;
+
+	return rte_pmu_read_userpage(group->mmap_pages[index]);
+}
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_PMU_H_ */
diff --git a/lib/pmu/version.map b/lib/pmu/version.map
new file mode 100644
index 0000000000..e15e21156a
--- /dev/null
+++ b/lib/pmu/version.map
@@ -0,0 +1,19 @@
+DPDK_23 {
+	local: *;
+};
+
+EXPERIMENTAL {
+	global:
+
+	rte_pmu;
+	rte_pmu_add_event;
+	rte_pmu_fini;
+	rte_pmu_init;
+	rte_pmu_read;
+};
+
+INTERNAL {
+	global:
+
+	rte_pmu_enable_group;
+};
-- 
2.34.1


^ permalink raw reply	[flat|nested] 139+ messages in thread

* [PATCH v6 2/4] pmu: support reading ARM PMU events in runtime
  2023-01-19 23:39         ` [PATCH v6 " Tomasz Duszynski
  2023-01-19 23:39           ` [PATCH v6 1/4] lib: add generic support for reading PMU events Tomasz Duszynski
@ 2023-01-19 23:39           ` Tomasz Duszynski
  2023-01-19 23:39           ` [PATCH v6 3/4] pmu: support reading Intel x86_64 " Tomasz Duszynski
                             ` (2 subsequent siblings)
  4 siblings, 0 replies; 139+ messages in thread
From: Tomasz Duszynski @ 2023-01-19 23:39 UTC (permalink / raw)
  To: dev, Tomasz Duszynski, Ruifeng Wang
  Cc: thomas, jerinj, mb, Ruifeng.Wang, mattias.ronnblom, zhoumin,
	bruce.richardson, roretzla

Add support for reading ARM PMU events in runtime.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
---
 app/test/test_pmu.c         |  4 ++
 lib/pmu/meson.build         |  7 +++
 lib/pmu/pmu_arm64.c         | 94 +++++++++++++++++++++++++++++++++++++
 lib/pmu/rte_pmu.h           |  4 ++
 lib/pmu/rte_pmu_pmc_arm64.h | 30 ++++++++++++
 5 files changed, 139 insertions(+)
 create mode 100644 lib/pmu/pmu_arm64.c
 create mode 100644 lib/pmu/rte_pmu_pmc_arm64.h

diff --git a/app/test/test_pmu.c b/app/test/test_pmu.c
index 7c3cf18ed9..4cdc71791e 100644
--- a/app/test/test_pmu.c
+++ b/app/test/test_pmu.c
@@ -15,6 +15,10 @@ test_pmu_read(void)
 	if (rte_pmu_init() < 0)
 		return TEST_FAILED;
 
+#if defined(RTE_ARCH_ARM64)
+	event = rte_pmu_add_event("cpu_cycles");
+#endif
+
 	while (tries--)
 		val += rte_pmu_read(event);
 
diff --git a/lib/pmu/meson.build b/lib/pmu/meson.build
index a4160b494e..e857681137 100644
--- a/lib/pmu/meson.build
+++ b/lib/pmu/meson.build
@@ -11,3 +11,10 @@ includes = [global_inc]
 
 sources = files('rte_pmu.c')
 headers = files('rte_pmu.h')
+indirect_headers += files(
+        'rte_pmu_pmc_arm64.h',
+)
+
+if dpdk_conf.has('RTE_ARCH_ARM64')
+    sources += files('pmu_arm64.c')
+endif
diff --git a/lib/pmu/pmu_arm64.c b/lib/pmu/pmu_arm64.c
new file mode 100644
index 0000000000..9e15727948
--- /dev/null
+++ b/lib/pmu/pmu_arm64.c
@@ -0,0 +1,94 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2023 Marvell International Ltd.
+ */
+
+#include <errno.h>
+#include <fcntl.h>
+#include <stdlib.h>
+#include <unistd.h>
+
+#include <rte_bitops.h>
+#include <rte_common.h>
+
+#include "pmu_private.h"
+
+#define PERF_USER_ACCESS_PATH "/proc/sys/kernel/perf_user_access"
+
+static int restore_uaccess;
+
+static int
+read_attr_int(const char *path, int *val)
+{
+	char buf[BUFSIZ];
+	int ret, fd;
+
+	fd = open(path, O_RDONLY);
+	if (fd == -1)
+		return -errno;
+
+	ret = read(fd, buf, sizeof(buf));
+	if (ret == -1) {
+		close(fd);
+
+		return -errno;
+	}
+
+	*val = strtol(buf, NULL, 10);
+	close(fd);
+
+	return 0;
+}
+
+static int
+write_attr_int(const char *path, int val)
+{
+	char buf[BUFSIZ];
+	int num, ret, fd;
+
+	fd = open(path, O_WRONLY);
+	if (fd == -1)
+		return -errno;
+
+	num = snprintf(buf, sizeof(buf), "%d", val);
+	ret = write(fd, buf, num);
+	if (ret == -1) {
+		close(fd);
+
+		return -errno;
+	}
+
+	close(fd);
+
+	return 0;
+}
+
+int
+pmu_arch_init(void)
+{
+	int ret;
+
+	ret = read_attr_int(PERF_USER_ACCESS_PATH, &restore_uaccess);
+	if (ret)
+		return ret;
+
+	/* user access already enabled */
+	if (restore_uaccess == 1)
+		return 0;
+
+	return write_attr_int(PERF_USER_ACCESS_PATH, 1);
+}
+
+void
+pmu_arch_fini(void)
+{
+	write_attr_int(PERF_USER_ACCESS_PATH, restore_uaccess);
+}
+
+void
+pmu_arch_fixup_config(uint64_t config[3])
+{
+	/* select 64 bit counters */
+	config[1] |= RTE_BIT64(0);
+	/* enable userspace access */
+	config[1] |= RTE_BIT64(1);
+}
diff --git a/lib/pmu/rte_pmu.h b/lib/pmu/rte_pmu.h
index 42c764fa9e..4808d90eb9 100644
--- a/lib/pmu/rte_pmu.h
+++ b/lib/pmu/rte_pmu.h
@@ -26,6 +26,10 @@ extern "C" {
 #include <rte_compat.h>
 #include <rte_lcore.h>
 
+#if defined(RTE_ARCH_ARM64)
+#include "rte_pmu_pmc_arm64.h"
+#endif
+
 /** Maximum number of events in a group */
 #define MAX_NUM_GROUP_EVENTS 8
 
diff --git a/lib/pmu/rte_pmu_pmc_arm64.h b/lib/pmu/rte_pmu_pmc_arm64.h
new file mode 100644
index 0000000000..10648f0c5f
--- /dev/null
+++ b/lib/pmu/rte_pmu_pmc_arm64.h
@@ -0,0 +1,30 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Marvell.
+ */
+#ifndef _RTE_PMU_PMC_ARM64_H_
+#define _RTE_PMU_PMC_ARM64_H_
+
+#include <rte_common.h>
+
+static __rte_always_inline uint64_t
+rte_pmu_pmc_read(int index)
+{
+	uint64_t val;
+
+	if (index == 31) {
+		/* CPU Cycles (0x11) must be read via pmccntr_el0 */
+		asm volatile("mrs %0, pmccntr_el0" : "=r" (val));
+	} else {
+		asm volatile(
+			"msr pmselr_el0, %x0\n"
+			"mrs %0, pmxevcntr_el0\n"
+			: "=r" (val)
+			: "rZ" (index)
+		);
+	}
+
+	return val;
+}
+#define rte_pmu_pmc_read rte_pmu_pmc_read
+
+#endif /* _RTE_PMU_PMC_ARM64_H_ */
-- 
2.34.1


^ permalink raw reply	[flat|nested] 139+ messages in thread

* [PATCH v6 3/4] pmu: support reading Intel x86_64 PMU events in runtime
  2023-01-19 23:39         ` [PATCH v6 " Tomasz Duszynski
  2023-01-19 23:39           ` [PATCH v6 1/4] lib: add generic support for reading PMU events Tomasz Duszynski
  2023-01-19 23:39           ` [PATCH v6 2/4] pmu: support reading ARM PMU events in runtime Tomasz Duszynski
@ 2023-01-19 23:39           ` Tomasz Duszynski
  2023-01-19 23:39           ` [PATCH v6 4/4] eal: add PMU support to tracing library Tomasz Duszynski
  2023-02-01 13:17           ` [PATCH v7 0/4] add support for self monitoring Tomasz Duszynski
  4 siblings, 0 replies; 139+ messages in thread
From: Tomasz Duszynski @ 2023-01-19 23:39 UTC (permalink / raw)
  To: dev, Tomasz Duszynski
  Cc: thomas, jerinj, mb, Ruifeng.Wang, mattias.ronnblom, zhoumin,
	bruce.richardson, roretzla

Add support for reading Intel x86_64 PMU events in runtime.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
---
 app/test/test_pmu.c          |  2 ++
 lib/pmu/meson.build          |  1 +
 lib/pmu/rte_pmu.h            |  2 ++
 lib/pmu/rte_pmu_pmc_x86_64.h | 24 ++++++++++++++++++++++++
 4 files changed, 29 insertions(+)
 create mode 100644 lib/pmu/rte_pmu_pmc_x86_64.h

diff --git a/app/test/test_pmu.c b/app/test/test_pmu.c
index 4cdc71791e..dc7a9cdb27 100644
--- a/app/test/test_pmu.c
+++ b/app/test/test_pmu.c
@@ -17,6 +17,8 @@ test_pmu_read(void)
 
 #if defined(RTE_ARCH_ARM64)
 	event = rte_pmu_add_event("cpu_cycles");
+#elif defined(RTE_ARCH_X86_64)
+	event = rte_pmu_add_event("cpu-cycles");
 #endif
 
 	while (tries--)
diff --git a/lib/pmu/meson.build b/lib/pmu/meson.build
index e857681137..5b92e5c4e3 100644
--- a/lib/pmu/meson.build
+++ b/lib/pmu/meson.build
@@ -13,6 +13,7 @@ sources = files('rte_pmu.c')
 headers = files('rte_pmu.h')
 indirect_headers += files(
         'rte_pmu_pmc_arm64.h',
+        'rte_pmu_pmc_x86_64.h',
 )
 
 if dpdk_conf.has('RTE_ARCH_ARM64')
diff --git a/lib/pmu/rte_pmu.h b/lib/pmu/rte_pmu.h
index 4808d90eb9..617732361c 100644
--- a/lib/pmu/rte_pmu.h
+++ b/lib/pmu/rte_pmu.h
@@ -28,6 +28,8 @@ extern "C" {
 
 #if defined(RTE_ARCH_ARM64)
 #include "rte_pmu_pmc_arm64.h"
+#elif defined(RTE_ARCH_X86_64)
+#include "rte_pmu_pmc_x86_64.h"
 #endif
 
 /** Maximum number of events in a group */
diff --git a/lib/pmu/rte_pmu_pmc_x86_64.h b/lib/pmu/rte_pmu_pmc_x86_64.h
new file mode 100644
index 0000000000..7b67466960
--- /dev/null
+++ b/lib/pmu/rte_pmu_pmc_x86_64.h
@@ -0,0 +1,24 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Marvell.
+ */
+#ifndef _RTE_PMU_PMC_X86_64_H_
+#define _RTE_PMU_PMC_X86_64_H_
+
+#include <rte_common.h>
+
+static __rte_always_inline uint64_t
+rte_pmu_pmc_read(int index)
+{
+	uint64_t low, high;
+
+	asm volatile(
+		"rdpmc\n"
+		: "=a" (low), "=d" (high)
+		: "c" (index)
+	);
+
+	return low | (high << 32);
+}
+#define rte_pmu_pmc_read rte_pmu_pmc_read
+
+#endif /* _RTE_PMU_PMC_X86_64_H_ */
-- 
2.34.1


^ permalink raw reply	[flat|nested] 139+ messages in thread

* [PATCH v6 4/4] eal: add PMU support to tracing library
  2023-01-19 23:39         ` [PATCH v6 " Tomasz Duszynski
                             ` (2 preceding siblings ...)
  2023-01-19 23:39           ` [PATCH v6 3/4] pmu: support reading Intel x86_64 " Tomasz Duszynski
@ 2023-01-19 23:39           ` Tomasz Duszynski
  2023-02-01 13:17           ` [PATCH v7 0/4] add support for self monitoring Tomasz Duszynski
  4 siblings, 0 replies; 139+ messages in thread
From: Tomasz Duszynski @ 2023-01-19 23:39 UTC (permalink / raw)
  To: dev, Jerin Jacob, Sunil Kumar Kori, Tomasz Duszynski
  Cc: thomas, mb, Ruifeng.Wang, mattias.ronnblom, zhoumin,
	bruce.richardson, roretzla

In order to profile app one needs to store significant amount of samples
somewhere for an analysis latern on. Since trace library supports
storing data in a CTF format lets take adventage of that and add a
dedicated PMU tracepoint.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
---
 app/test/test_trace_perf.c               | 10 ++++
 doc/guides/prog_guide/profile_app.rst    |  5 ++
 doc/guides/prog_guide/trace_lib.rst      | 32 +++++++++++++
 lib/eal/common/eal_common_trace.c        | 13 ++++-
 lib/eal/common/eal_common_trace_points.c |  5 ++
 lib/eal/include/rte_eal_trace.h          | 13 +++++
 lib/eal/meson.build                      |  3 ++
 lib/eal/version.map                      |  3 ++
 lib/pmu/rte_pmu.c                        | 61 ++++++++++++++++++++++++
 lib/pmu/rte_pmu.h                        | 14 ++++++
 lib/pmu/version.map                      |  1 +
 11 files changed, 159 insertions(+), 1 deletion(-)

diff --git a/app/test/test_trace_perf.c b/app/test/test_trace_perf.c
index 46ae7d8074..f1929f2734 100644
--- a/app/test/test_trace_perf.c
+++ b/app/test/test_trace_perf.c
@@ -114,6 +114,10 @@ worker_fn_##func(void *arg) \
 #define GENERIC_DOUBLE rte_eal_trace_generic_double(3.66666)
 #define GENERIC_STR rte_eal_trace_generic_str("hello world")
 #define VOID_FP app_dpdk_test_fp()
+#ifdef RTE_EXEC_ENV_LINUX
+/* 0 corresponds first event passed via --trace= */
+#define READ_PMU rte_eal_trace_pmu_read(0)
+#endif
 
 WORKER_DEFINE(GENERIC_VOID)
 WORKER_DEFINE(GENERIC_U64)
@@ -122,6 +126,9 @@ WORKER_DEFINE(GENERIC_FLOAT)
 WORKER_DEFINE(GENERIC_DOUBLE)
 WORKER_DEFINE(GENERIC_STR)
 WORKER_DEFINE(VOID_FP)
+#ifdef RTE_EXEC_ENV_LINUX
+WORKER_DEFINE(READ_PMU)
+#endif
 
 static void
 run_test(const char *str, lcore_function_t f, struct test_data *data, size_t sz)
@@ -174,6 +181,9 @@ test_trace_perf(void)
 	run_test("double", worker_fn_GENERIC_DOUBLE, data, sz);
 	run_test("string", worker_fn_GENERIC_STR, data, sz);
 	run_test("void_fp", worker_fn_VOID_FP, data, sz);
+#ifdef RTE_EXEC_ENV_LINUX
+	run_test("read_pmu", worker_fn_READ_PMU, data, sz);
+#endif
 
 	rte_free(data);
 	return TEST_SUCCESS;
diff --git a/doc/guides/prog_guide/profile_app.rst b/doc/guides/prog_guide/profile_app.rst
index a8b501fe0c..6a53341c6b 100644
--- a/doc/guides/prog_guide/profile_app.rst
+++ b/doc/guides/prog_guide/profile_app.rst
@@ -16,6 +16,11 @@ that information, perf being an example here. Though in some scenarios, eg. when
 isolated (nohz_full) and run dedicated tasks, using perf is less than ideal. In such cases one can
 read specific events directly from application via ``rte_pmu_read()``.
 
+Alternatively tracing library can be used which offers dedicated tracepoint
+``rte_eal_trace_pmu_event()``.
+
+Refer to :doc:`../prog_guide/trace_lib` for more details.
+
 Profiling on x86
 ----------------
 
diff --git a/doc/guides/prog_guide/trace_lib.rst b/doc/guides/prog_guide/trace_lib.rst
index 9a8f38073d..a8e97ee1ec 100644
--- a/doc/guides/prog_guide/trace_lib.rst
+++ b/doc/guides/prog_guide/trace_lib.rst
@@ -46,6 +46,7 @@ DPDK tracing library features
   trace format and is compatible with ``LTTng``.
   For detailed information, refer to
   `Common Trace Format <https://diamon.org/ctf/>`_.
+- Support reading PMU events on ARM64 and x86-64 (Intel)
 
 How to add a tracepoint?
 ------------------------
@@ -137,6 +138,37 @@ the user must use ``RTE_TRACE_POINT_FP`` instead of ``RTE_TRACE_POINT``.
 ``RTE_TRACE_POINT_FP`` is compiled out by default and it can be enabled using
 the ``enable_trace_fp`` option for meson build.
 
+PMU tracepoint
+--------------
+
+Performance measurement unit (PMU) event values can be read from hardware
+registers using predefined ``rte_pmu_read`` tracepoint.
+
+Tracing is enabled via ``--trace`` EAL option by passing both expression
+matching PMU tracepoint name i.e ``lib.eal.pmu.read`` and expression
+``e=ev1[,ev2,...]`` matching particular events::
+
+    --trace='.*pmu.read\|e=cpu_cycles,l1d_cache'
+
+Event names are available under ``/sys/bus/event_source/devices/PMU/events``
+directory, where ``PMU`` is a placeholder for either a ``cpu`` or a directory
+containing ``cpus``.
+
+In contrary to other tracepoints this does not need any extra variables
+added to source files. Instead, caller passes index which follows the order of
+events specified via ``--trace`` parameter. In the following example index ``0``
+corresponds to ``cpu_cyclces`` while index ``1`` corresponds to ``l1d_cache``.
+
+.. code-block:: c
+
+ ...
+ rte_eal_trace_pmu_read(0);
+ rte_eal_trace_pmu_read(1);
+ ...
+
+PMU tracing support must be explicitly enabled using the ``enable_trace_fp``
+option for meson build.
+
 Event record mode
 -----------------
 
diff --git a/lib/eal/common/eal_common_trace.c b/lib/eal/common/eal_common_trace.c
index 5caaac8e59..3631d0032b 100644
--- a/lib/eal/common/eal_common_trace.c
+++ b/lib/eal/common/eal_common_trace.c
@@ -11,6 +11,9 @@
 #include <rte_errno.h>
 #include <rte_lcore.h>
 #include <rte_per_lcore.h>
+#ifdef RTE_EXEC_ENV_LINUX
+#include <rte_pmu.h>
+#endif
 #include <rte_string_fns.h>
 
 #include "eal_trace.h"
@@ -71,8 +74,13 @@ eal_trace_init(void)
 		goto free_meta;
 
 	/* Apply global configurations */
-	STAILQ_FOREACH(arg, &trace.args, next)
+	STAILQ_FOREACH(arg, &trace.args, next) {
 		trace_args_apply(arg->val);
+#ifdef RTE_EXEC_ENV_LINUX
+		if (rte_pmu_init() == 0)
+			rte_pmu_add_events_by_pattern(arg->val);
+#endif
+	}
 
 	rte_trace_mode_set(trace.mode);
 
@@ -88,6 +96,9 @@ eal_trace_init(void)
 void
 eal_trace_fini(void)
 {
+#ifdef RTE_EXEC_ENV_LINUX
+	rte_pmu_fini();
+#endif
 	trace_mem_free();
 	trace_metadata_destroy();
 	eal_trace_args_free();
diff --git a/lib/eal/common/eal_common_trace_points.c b/lib/eal/common/eal_common_trace_points.c
index 0b0b254615..1e46ce549a 100644
--- a/lib/eal/common/eal_common_trace_points.c
+++ b/lib/eal/common/eal_common_trace_points.c
@@ -75,3 +75,8 @@ RTE_TRACE_POINT_REGISTER(rte_eal_trace_intr_enable,
 	lib.eal.intr.enable)
 RTE_TRACE_POINT_REGISTER(rte_eal_trace_intr_disable,
 	lib.eal.intr.disable)
+
+#ifdef RTE_EXEC_ENV_LINUX
+RTE_TRACE_POINT_REGISTER(rte_eal_trace_pmu_read,
+	lib.eal.pmu.read)
+#endif
diff --git a/lib/eal/include/rte_eal_trace.h b/lib/eal/include/rte_eal_trace.h
index 5ef4398230..afb459b198 100644
--- a/lib/eal/include/rte_eal_trace.h
+++ b/lib/eal/include/rte_eal_trace.h
@@ -17,6 +17,9 @@ extern "C" {
 
 #include <rte_alarm.h>
 #include <rte_interrupts.h>
+#ifdef RTE_EXEC_ENV_LINUX
+#include <rte_pmu.h>
+#endif
 #include <rte_trace_point.h>
 
 #include "eal_interrupts.h"
@@ -279,6 +282,16 @@ RTE_TRACE_POINT(
 	rte_trace_point_emit_string(cpuset);
 )
 
+#ifdef RTE_EXEC_ENV_LINUX
+RTE_TRACE_POINT_FP(
+	rte_eal_trace_pmu_read,
+	RTE_TRACE_POINT_ARGS(unsigned int index),
+	uint64_t val;
+	val = rte_pmu_read(index);
+	rte_trace_point_emit_u64(val);
+)
+#endif
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/eal/meson.build b/lib/eal/meson.build
index 056beb9461..f5865dbcd9 100644
--- a/lib/eal/meson.build
+++ b/lib/eal/meson.build
@@ -26,6 +26,9 @@ deps += ['kvargs']
 if not is_windows
     deps += ['telemetry']
 endif
+if is_linux
+    deps += ['pmu']
+endif
 if dpdk_conf.has('RTE_USE_LIBBSD')
     ext_deps += libbsd
 endif
diff --git a/lib/eal/version.map b/lib/eal/version.map
index 7ad12a7dc9..eddb45bebf 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -440,6 +440,9 @@ EXPERIMENTAL {
 	rte_thread_detach;
 	rte_thread_equal;
 	rte_thread_join;
+
+	# added in 23.03
+	__rte_eal_trace_pmu_read; # WINDOWS_NO_EXPORT
 };
 
 INTERNAL {
diff --git a/lib/pmu/rte_pmu.c b/lib/pmu/rte_pmu.c
index f8369b9dc7..3241b8c748 100644
--- a/lib/pmu/rte_pmu.c
+++ b/lib/pmu/rte_pmu.c
@@ -375,6 +375,67 @@ rte_pmu_add_event(const char *name)
 	return event->index;
 }
 
+static int
+add_events(const char *pattern)
+{
+	char *token, *copy;
+	int ret;
+
+	copy = strdup(pattern);
+	if (copy == NULL)
+		return -ENOMEM;
+
+	token = strtok(copy, ",");
+	while (token) {
+		ret = rte_pmu_add_event(token);
+		if (ret < 0)
+			break;
+
+		token = strtok(NULL, ",");
+	}
+
+	free(copy);
+
+	return ret >= 0 ? 0 : ret;
+}
+
+int
+rte_pmu_add_events_by_pattern(const char *pattern)
+{
+	regmatch_t rmatch;
+	char buf[BUFSIZ];
+	unsigned int num;
+	regex_t reg;
+	int ret;
+
+	/* events are matched against occurrences of e=ev1[,ev2,..] pattern */
+	ret = regcomp(&reg, "e=([_[:alnum:]-],?)+", REG_EXTENDED);
+	if (ret)
+		return -EINVAL;
+
+	for (;;) {
+		if (regexec(&reg, pattern, 1, &rmatch, 0))
+			break;
+
+		num = rmatch.rm_eo - rmatch.rm_so;
+		if (num > sizeof(buf))
+			num = sizeof(buf);
+
+		/* skip e= pattern prefix */
+		memcpy(buf, pattern + rmatch.rm_so + 2, num - 2);
+		buf[num - 2] = '\0';
+		ret = add_events(buf);
+		if (ret)
+			break;
+
+		pattern += rmatch.rm_eo;
+	}
+
+	regfree(&reg);
+
+	return ret;
+}
+
 int
 rte_pmu_init(void)
 {
diff --git a/lib/pmu/rte_pmu.h b/lib/pmu/rte_pmu.h
index 617732361c..f642b721e8 100644
--- a/lib/pmu/rte_pmu.h
+++ b/lib/pmu/rte_pmu.h
@@ -166,6 +166,20 @@ __rte_experimental
 int
 rte_pmu_add_event(const char *name);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Add events matching pattern to the group of enabled events.
+ *
+ * @param pattern
+ *   Pattern e=ev1[,ev2,...] matching events, where evX is a placeholder for an event listed under
+ *   /sys/bus/event_source/devices/pmu/events.
+ */
+__rte_experimental
+int
+rte_pmu_add_events_by_pattern(const char *pattern);
+
 /**
  * @warning
  * @b EXPERIMENTAL: this API may change without prior notice
diff --git a/lib/pmu/version.map b/lib/pmu/version.map
index e15e21156a..4646eefd2b 100644
--- a/lib/pmu/version.map
+++ b/lib/pmu/version.map
@@ -7,6 +7,7 @@ EXPERIMENTAL {
 
 	rte_pmu;
 	rte_pmu_add_event;
+	rte_pmu_add_events_by_pattern;
 	rte_pmu_fini;
 	rte_pmu_init;
 	rte_pmu_read;
-- 
2.34.1


^ permalink raw reply	[flat|nested] 139+ messages in thread

* RE: [PATCH v6 1/4] lib: add generic support for reading PMU events
  2023-01-19 23:39           ` [PATCH v6 1/4] lib: add generic support for reading PMU events Tomasz Duszynski
@ 2023-01-20  9:46             ` Morten Brørup
  2023-01-26  9:40               ` Tomasz Duszynski
  2023-01-20 18:29             ` Tyler Retzlaff
  1 sibling, 1 reply; 139+ messages in thread
From: Morten Brørup @ 2023-01-20  9:46 UTC (permalink / raw)
  To: Tomasz Duszynski, dev, Thomas Monjalon
  Cc: jerinj, Ruifeng.Wang, mattias.ronnblom, zhoumin,
	bruce.richardson, roretzla

> From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
> Sent: Friday, 20 January 2023 00.39
> 
> Add support for programming PMU counters and reading their values
> in runtime bypassing kernel completely.
> 
> This is especially useful in cases where CPU cores are isolated
> (nohz_full) i.e run dedicated tasks. In such cases one cannot use
> standard perf utility without sacrificing latency and performance.
> 
> Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
> ---

If you insist on passing lcore_id around as a function parameter, the function description must mention that the lcore_id parameter must be set to rte_lcore_id() for the functions where this is a requirement, including all functions that use those functions.

Alternatively, follow my previous suggestion: Omit the lcore_id function parameter, and use rte_lcore_id() instead.



^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v6 1/4] lib: add generic support for reading PMU events
  2023-01-19 23:39           ` [PATCH v6 1/4] lib: add generic support for reading PMU events Tomasz Duszynski
  2023-01-20  9:46             ` Morten Brørup
@ 2023-01-20 18:29             ` Tyler Retzlaff
  2023-01-26  9:05               ` [EXT] " Tomasz Duszynski
  1 sibling, 1 reply; 139+ messages in thread
From: Tyler Retzlaff @ 2023-01-20 18:29 UTC (permalink / raw)
  To: Tomasz Duszynski
  Cc: dev, Thomas Monjalon, jerinj, mb, Ruifeng.Wang, mattias.ronnblom,
	zhoumin, bruce.richardson

On Fri, Jan 20, 2023 at 12:39:12AM +0100, Tomasz Duszynski wrote:
> Add support for programming PMU counters and reading their values
> in runtime bypassing kernel completely.
> 
> This is especially useful in cases where CPU cores are isolated
> (nohz_full) i.e run dedicated tasks. In such cases one cannot use
> standard perf utility without sacrificing latency and performance.
> 
> Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
> ---
>  MAINTAINERS                            |   5 +
>  app/test/meson.build                   |   4 +
>  app/test/test_pmu.c                    |  42 +++
>  doc/api/doxy-api-index.md              |   3 +-
>  doc/api/doxy-api.conf.in               |   1 +
>  doc/guides/prog_guide/profile_app.rst  |   8 +
>  doc/guides/rel_notes/release_23_03.rst |   7 +
>  lib/meson.build                        |   1 +
>  lib/pmu/meson.build                    |  13 +
>  lib/pmu/pmu_private.h                  |  29 ++
>  lib/pmu/rte_pmu.c                      | 436 +++++++++++++++++++++++++
>  lib/pmu/rte_pmu.h                      | 206 ++++++++++++
>  lib/pmu/version.map                    |  19 ++
>  13 files changed, 773 insertions(+), 1 deletion(-)
>  create mode 100644 app/test/test_pmu.c
>  create mode 100644 lib/pmu/meson.build
>  create mode 100644 lib/pmu/pmu_private.h
>  create mode 100644 lib/pmu/rte_pmu.c
>  create mode 100644 lib/pmu/rte_pmu.h
>  create mode 100644 lib/pmu/version.map
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 9a0f416d2e..9f13eafd95 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -1697,6 +1697,11 @@ M: Nithin Dabilpuram <ndabilpuram@marvell.com>
>  M: Pavan Nikhilesh <pbhagavatula@marvell.com>
>  F: lib/node/
>  
> +PMU - EXPERIMENTAL
> +M: Tomasz Duszynski <tduszynski@marvell.com>
> +F: lib/pmu/
> +F: app/test/test_pmu*
> +
>  
>  Test Applications
>  -----------------
> diff --git a/app/test/meson.build b/app/test/meson.build
> index f34d19e3c3..b2c2a618b1 100644
> --- a/app/test/meson.build
> +++ b/app/test/meson.build
> @@ -360,6 +360,10 @@ if dpdk_conf.has('RTE_LIB_METRICS')
>      test_sources += ['test_metrics.c']
>      fast_tests += [['metrics_autotest', true, true]]
>  endif
> +if is_linux
> +    test_sources += ['test_pmu.c']
> +    fast_tests += [['pmu_autotest', true, true]]
> +endif

traditionally we don't conditionally include tests at the meson.build
level, instead we run all tests and have them skip when executed for
unsupported exec environments.

you can take a look at test_eventdev.c as an example for a test that is
skipped on windows, i'm sure it could be adapted to skip on freebsd if
you aren't supporting it.


^ permalink raw reply	[flat|nested] 139+ messages in thread

* [PATCH 0/2] add platform bus
  2023-01-10 23:46       ` [PATCH v5 0/4] add support for self monitoring Tomasz Duszynski
                           ` (5 preceding siblings ...)
  2023-01-19 23:39         ` [PATCH v6 " Tomasz Duszynski
@ 2023-01-25 10:33         ` Tomasz Duszynski
  2023-01-25 10:33           ` [PATCH 1/2] lib: add helper to read strings from sysfs files Tomasz Duszynski
                             ` (2 more replies)
  2023-02-16 20:56         ` [PATCH v5 0/4] add support for self monitoring Liang Ma
  7 siblings, 3 replies; 139+ messages in thread
From: Tomasz Duszynski @ 2023-01-25 10:33 UTC (permalink / raw)
  To: dev; +Cc: thomas, jerinj, stephen, chenbo.xia, Tomasz Duszynski

Platform bus is a bus under Linux which manages devices that do not have
any discovery-mechanism built in. Linux learns about platform devices
directly from device-tree during boot-up phase.

Afterwards if userspace wants to use some particular device driver being
usually a mixture of vdev/rawdev gets developed.

In order to simplify that introduce a DPDK platform bus which provides
auto-probe experience and separates a bus logic from the driver itself.

Now only devices which are backed-by vfio-platform kernel driver
are supported, though other options may be added if necessary.

Tomasz Duszynski (2):
  lib: add helper to read strings from sysfs files
  bus: add platform bus

 MAINTAINERS                                |   4 +
 app/test/test_eal_fs.c                     | 108 +++-
 doc/guides/rel_notes/release_23_03.rst     |   5 +
 drivers/bus/meson.build                    |   1 +
 drivers/bus/platform/bus_platform_driver.h | 174 ++++++
 drivers/bus/platform/meson.build           |  16 +
 drivers/bus/platform/platform.c            | 604 +++++++++++++++++++++
 drivers/bus/platform/platform_params.c     |  70 +++
 drivers/bus/platform/private.h             |  48 ++
 drivers/bus/platform/version.map           |  10 +
 lib/eal/common/eal_filesystem.h            |   6 +
 lib/eal/unix/eal_filesystem.c              |  24 +-
 lib/eal/version.map                        |   1 +
 13 files changed, 1053 insertions(+), 18 deletions(-)
 create mode 100644 drivers/bus/platform/bus_platform_driver.h
 create mode 100644 drivers/bus/platform/meson.build
 create mode 100644 drivers/bus/platform/platform.c
 create mode 100644 drivers/bus/platform/platform_params.c
 create mode 100644 drivers/bus/platform/private.h
 create mode 100644 drivers/bus/platform/version.map

--
2.34.1


^ permalink raw reply	[flat|nested] 139+ messages in thread

* [PATCH 1/2] lib: add helper to read strings from sysfs files
  2023-01-25 10:33         ` [PATCH 0/2] add platform bus Tomasz Duszynski
@ 2023-01-25 10:33           ` Tomasz Duszynski
  2023-01-25 10:39             ` Thomas Monjalon
  2023-01-25 10:33           ` [PATCH 2/2] bus: add platform bus Tomasz Duszynski
  2023-01-25 10:41           ` [PATCH 0/2] " Tomasz Duszynski
  2 siblings, 1 reply; 139+ messages in thread
From: Tomasz Duszynski @ 2023-01-25 10:33 UTC (permalink / raw)
  To: dev; +Cc: thomas, jerinj, stephen, chenbo.xia, Tomasz Duszynski

Reading strings from sysfs files is a re-occurring pattern
hence add helper for doing that.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
---
 app/test/test_eal_fs.c          | 108 ++++++++++++++++++++++++++++----
 lib/eal/common/eal_filesystem.h |   6 ++
 lib/eal/unix/eal_filesystem.c   |  24 ++++---
 lib/eal/version.map             |   1 +
 4 files changed, 121 insertions(+), 18 deletions(-)

diff --git a/app/test/test_eal_fs.c b/app/test/test_eal_fs.c
index b3686edcb4..6c373fc7f1 100644
--- a/app/test/test_eal_fs.c
+++ b/app/test/test_eal_fs.c
@@ -20,12 +20,33 @@ test_eal_fs(void)
 
 #else
 
+static int
+temp_create(char *filename, size_t len)
+{
+	char file_template[] = "/tmp/eal_test_XXXXXX";
+	char proc_path[PATH_MAX];
+	int fd;
+
+	fd = mkstemp(file_template);
+	if (fd == -1) {
+		perror("mkstemp() failure");
+		return -1;
+	}
+
+	snprintf(proc_path, sizeof(proc_path), "/proc/self/fd/%d", fd);
+	if (readlink(proc_path, filename, len) < 0) {
+		perror("readlink() failure");
+		close(fd);
+		return -1;
+	}
+
+	return fd;
+}
+
 static int
 test_parse_sysfs_value(void)
 {
 	char filename[PATH_MAX] = "";
-	char proc_path[PATH_MAX];
-	char file_template[] = "/tmp/eal_test_XXXXXX";
 	int tmp_file_handle = -1;
 	FILE *fd = NULL;
 	unsigned valid_number;
@@ -40,16 +61,10 @@ test_parse_sysfs_value(void)
 
 	/* get a temporary filename to use for all tests - create temp file handle and then
 	 * use /proc to get the actual file that we can open */
-	tmp_file_handle = mkstemp(file_template);
-	if (tmp_file_handle == -1) {
-		perror("mkstemp() failure");
+	tmp_file_handle = temp_create(filename, sizeof(filename));
+	if (tmp_file_handle < 0)
 		goto error;
-	}
-	snprintf(proc_path, sizeof(proc_path), "/proc/self/fd/%d", tmp_file_handle);
-	if (readlink(proc_path, filename, sizeof(filename)) < 0) {
-		perror("readlink() failure");
-		goto error;
-	}
+
 	printf("Temporary file is: %s\n", filename);
 
 	/* test we get an error value if we use file before it's created */
@@ -175,11 +190,82 @@ test_parse_sysfs_value(void)
 	return -1;
 }
 
+static int
+test_parse_sysfs_string(void)
+{
+	const char *teststr = "the quick brown dog jumps over the lazy fox\n";
+	char filename[PATH_MAX] = "";
+	char buf[BUFSIZ] = { };
+	int tmp_file_handle;
+	FILE *fd = NULL;
+
+#ifdef RTE_EXEC_ENV_FREEBSD
+	/* BSD doesn't have /proc/pid/fd */
+	return 0;
+#endif
+	printf("Testing function eal_parse_sysfs_string()\n");
+
+	/* get a temporary filename to use for all tests - create temp file handle and then
+	 * use /proc to get the actual file that we can open
+	 */
+	tmp_file_handle = temp_create(filename, sizeof(filename));
+	if (tmp_file_handle < 0)
+		goto error;
+
+	printf("Temporary file is: %s\n", filename);
+
+	/* test we get an error value if we use file before it's created */
+	printf("Test reading a missing file ...\n");
+	if (eal_parse_sysfs_string("/dev/not-quite-null", buf, sizeof(buf)) == 0) {
+		printf("Error with eal_parse_sysfs_string() - returned success on reading empty file\n");
+		goto error;
+	}
+	printf("Confirmed return error when reading empty file\n");
+
+	/* test reading a string from file */
+	printf("Test reading string ...\n");
+	fd = fopen(filename, "w");
+	if (fd == NULL) {
+		printf("line %d, Error opening %s: %s\n", __LINE__, filename, strerror(errno));
+		goto error;
+	}
+	fprintf(fd, "%s", teststr);
+	fclose(fd);
+	fd = NULL;
+	if (eal_parse_sysfs_string(filename, buf, sizeof(buf) - 1) < 0) {
+		printf("eal_parse_sysfs_string() returned error - test failed\n");
+		goto error;
+	}
+	if (strcmp(teststr, buf)) {
+		printf("Invalid string read by eal_parse_sysfs_string() - test failed\n");
+		goto error;
+	}
+	/* don't print newline */
+	buf[strlen(buf) - 1] = '\0';
+	printf("Read '%s\\n' ok\n", buf);
+
+	close(tmp_file_handle);
+	unlink(filename);
+	printf("eal_parse_sysfs_string() - OK\n");
+	return 0;
+
+error:
+	if (fd)
+		fclose(fd);
+	if (tmp_file_handle > 0)
+		close(tmp_file_handle);
+	if (filename[0] != '\0')
+		unlink(filename);
+	return -1;
+}
+
 static int
 test_eal_fs(void)
 {
 	if (test_parse_sysfs_value() < 0)
 		return -1;
+	if (test_parse_sysfs_string() < 0)
+		return -1;
 	return 0;
 }
 
diff --git a/lib/eal/common/eal_filesystem.h b/lib/eal/common/eal_filesystem.h
index 5d21f07c20..ac6449f529 100644
--- a/lib/eal/common/eal_filesystem.h
+++ b/lib/eal/common/eal_filesystem.h
@@ -104,4 +104,10 @@ eal_get_hugefile_path(char *buffer, size_t buflen, const char *hugedir, int f_id
  * Used to read information from files on /sys */
 int eal_parse_sysfs_value(const char *filename, unsigned long *val);
 
+/** Function to read a string from a file on the filesystem.
+ * Used to read information for files in /sys
+ */
+__rte_internal
+int eal_parse_sysfs_string(const char *filename, char *str, size_t size);
+
 #endif /* EAL_FILESYSTEM_H */
diff --git a/lib/eal/unix/eal_filesystem.c b/lib/eal/unix/eal_filesystem.c
index afbab9368a..8ed10094be 100644
--- a/lib/eal/unix/eal_filesystem.c
+++ b/lib/eal/unix/eal_filesystem.c
@@ -76,12 +76,9 @@ int eal_create_runtime_dir(void)
 	return 0;
 }
 
-/* parse a sysfs (or other) file containing one integer value */
-int eal_parse_sysfs_value(const char *filename, unsigned long *val)
+int eal_parse_sysfs_string(const char *filename, char *str, size_t size)
 {
 	FILE *f;
-	char buf[BUFSIZ];
-	char *end = NULL;
 
 	if ((f = fopen(filename, "r")) == NULL) {
 		RTE_LOG(ERR, EAL, "%s(): cannot open sysfs value %s\n",
@@ -89,19 +86,32 @@ int eal_parse_sysfs_value(const char *filename, unsigned long *val)
 		return -1;
 	}
 
-	if (fgets(buf, sizeof(buf), f) == NULL) {
+	if (fgets(str, size, f) == NULL) {
 		RTE_LOG(ERR, EAL, "%s(): cannot read sysfs value %s\n",
 			__func__, filename);
 		fclose(f);
 		return -1;
 	}
+	fclose(f);
+	return 0;
+}
+
+/* parse a sysfs (or other) file containing one integer value */
+int eal_parse_sysfs_value(const char *filename, unsigned long *val)
+{
+	char buf[BUFSIZ];
+	char *end = NULL;
+	int ret;
+
+	ret = eal_parse_sysfs_string(filename, buf, sizeof(buf));
+	if (ret < 0)
+		return ret;
+
 	*val = strtoul(buf, &end, 0);
 	if ((buf[0] == '\0') || (end == NULL) || (*end != '\n')) {
 		RTE_LOG(ERR, EAL, "%s(): cannot parse sysfs value %s\n",
 				__func__, filename);
-		fclose(f);
 		return -1;
 	}
-	fclose(f);
 	return 0;
 }
diff --git a/lib/eal/version.map b/lib/eal/version.map
index 7ad12a7dc9..9118bb6228 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -445,6 +445,7 @@ EXPERIMENTAL {
 INTERNAL {
 	global:
 
+	eal_parse_sysfs_string; # WINDOWS_NO_EXPORT
 	rte_bus_register;
 	rte_bus_unregister;
 	rte_eal_get_baseaddr;
-- 
2.34.1


^ permalink raw reply	[flat|nested] 139+ messages in thread

* [PATCH 2/2] bus: add platform bus
  2023-01-25 10:33         ` [PATCH 0/2] add platform bus Tomasz Duszynski
  2023-01-25 10:33           ` [PATCH 1/2] lib: add helper to read strings from sysfs files Tomasz Duszynski
@ 2023-01-25 10:33           ` Tomasz Duszynski
  2023-01-25 10:41           ` [PATCH 0/2] " Tomasz Duszynski
  2 siblings, 0 replies; 139+ messages in thread
From: Tomasz Duszynski @ 2023-01-25 10:33 UTC (permalink / raw)
  To: dev, Thomas Monjalon, Tomasz Duszynski; +Cc: jerinj, stephen, chenbo.xia

Platform bus is a software bus under Linux that manages devices which
generally do not have built-in discovery mechanisms. Linux normally
learns about platform devices directly from device-tree during
boot-up phase.

Up to this point, whenever some userspace app needed control over
platform device or a range of thereof some sort of driver being
a mixture of vdev/rawdev was required.

In order to simplify this task, provide an auto-probe
experience and separate bus logic from the driver itself,
add platform bus support.

Currently devices backed up by vfio-platform kernel driver
are supported.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
---
 MAINTAINERS                                |   4 +
 doc/guides/rel_notes/release_23_03.rst     |   5 +
 drivers/bus/meson.build                    |   1 +
 drivers/bus/platform/bus_platform_driver.h | 174 ++++++
 drivers/bus/platform/meson.build           |  16 +
 drivers/bus/platform/platform.c            | 604 +++++++++++++++++++++
 drivers/bus/platform/platform_params.c     |  70 +++
 drivers/bus/platform/private.h             |  48 ++
 drivers/bus/platform/version.map           |  10 +
 9 files changed, 932 insertions(+)
 create mode 100644 drivers/bus/platform/bus_platform_driver.h
 create mode 100644 drivers/bus/platform/meson.build
 create mode 100644 drivers/bus/platform/platform.c
 create mode 100644 drivers/bus/platform/platform_params.c
 create mode 100644 drivers/bus/platform/private.h
 create mode 100644 drivers/bus/platform/version.map

diff --git a/MAINTAINERS b/MAINTAINERS
index 9a0f416d2e..b02666710c 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -581,6 +581,10 @@ VDEV bus driver
 F: drivers/bus/vdev/
 F: app/test/test_vdev.c
 
+Platform bus driver
+M: Tomasz Duszynski <tduszynski@marvell.com>
+F: drivers/bus/platform
+
 VMBUS bus driver
 M: Long Li <longli@microsoft.com>
 F: drivers/bus/vmbus/
diff --git a/doc/guides/rel_notes/release_23_03.rst b/doc/guides/rel_notes/release_23_03.rst
index 84b112a8b1..74b2b1e3ff 100644
--- a/doc/guides/rel_notes/release_23_03.rst
+++ b/doc/guides/rel_notes/release_23_03.rst
@@ -57,6 +57,11 @@ New Features
 
 * **Added multi-process support for axgbe PMD.**
 
+* **Added platform bus support.**
+
+  A platform bus provides a way to use Linux platform devices which
+  are compatible with vfio-platform kernel driver.
+
 * **Updated Corigine nfp driver.**
 
   * Added support for meter options.
diff --git a/drivers/bus/meson.build b/drivers/bus/meson.build
index 45eab5233d..6d2520c543 100644
--- a/drivers/bus/meson.build
+++ b/drivers/bus/meson.build
@@ -7,6 +7,7 @@ drivers = [
         'fslmc',
         'ifpga',
         'pci',
+        'platform',
         'vdev',
         'vmbus',
 ]
diff --git a/drivers/bus/platform/bus_platform_driver.h b/drivers/bus/platform/bus_platform_driver.h
new file mode 100644
index 0000000000..8291c7f3f6
--- /dev/null
+++ b/drivers/bus/platform/bus_platform_driver.h
@@ -0,0 +1,174 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2023 Marvell.
+ */
+
+#ifndef _BUS_PLATFORM_DRIVER_H_
+#define _BUS_PLATFORM_DRIVER_H_
+
+/**
+ * @file
+ * Platform bus interface.
+ */
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <stddef.h>
+#include <stdint.h>
+
+#include <dev_driver.h>
+#include <rte_common.h>
+#include <rte_dev.h>
+#include <rte_os.h>
+
+/* Forward declarations */
+struct rte_platform_bus;
+struct rte_platform_device;
+struct rte_platform_driver;
+
+/**
+ * Initialization function for the driver called during platform device probing.
+ *
+ * @param pdev
+ *   Pointer to the platform device.
+ * @return
+ *   0 on success, negative value otherwise.
+ */
+typedef int (rte_platform_probe_t)(struct rte_platform_device *pdev);
+
+/**
+ * Removal function for the driver called during platform device removal.
+ *
+ * @param pdev
+ *   Pointer to the platform device.
+ * @return
+ *   0 on success, negative value otherwise.
+ */
+typedef int (rte_platform_remove_t)(struct rte_platform_device *pdev);
+
+/**
+ * Driver specific DMA mapping.
+ *
+ * @param pdev
+ *   Pointer to the platform device.
+ * @param addr
+ *   Starting virtual address of memory to be mapped.
+ * @param iova
+ *   Starting IOVA address of memory to be mapped.
+ * @param len
+ *   Length of memory segment being mapped.
+ * @return
+ *   - 0 on success, negative value and rte_errno is set otherwise.
+ */
+typedef int (rte_platform_dma_map_t)(struct rte_platform_device *pdev, void *addr, uint64_t iova,
+				     size_t len);
+
+/**
+ * Driver specific DMA unmapping.
+ *
+ * @param pdev
+ *   Pointer to the platform device.
+ * @param addr
+ *   Starting virtual address of memory to be mapped.
+ * @param iova
+ *   Starting IOVA address of memory to be mapped.
+ * @param len
+ *   Length of memory segment being mapped.
+ * @return
+ *   - 0 on success, negative value and rte_errno is set otherwise.
+ */
+typedef int (rte_platform_dma_unmap_t)(struct rte_platform_device *pdev, void *addr, uint64_t iova,
+				       size_t len);
+
+/**
+ * A structure describing a platform device resource.
+ */
+struct rte_platform_resource {
+	char *name; /**< Resource name specified via reg-names prop in device-tree */
+	struct rte_mem_resource mem; /**< Memory resource */
+};
+
+/**
+ * A structure describing a platform device.
+ */
+struct rte_platform_device {
+	RTE_TAILQ_ENTRY(rte_platform_device) next; /**< Next attached platform device */
+	struct rte_device device; /**< Core device */
+	struct rte_platform_driver *driver; /**< Matching device driver */
+	char name[RTE_DEV_NAME_MAX_LEN]; /**< Device name */
+	unsigned int num_resource; /**< Number of device resources */
+	struct rte_platform_resource *resource; /**< Device resources */
+	int dev_fd; /**< VFIO device fd */
+};
+
+/**
+ * A structure describing a platform device driver.
+ */
+struct rte_platform_driver {
+	RTE_TAILQ_ENTRY(rte_platform_driver) next; /**< Next available platform driver */
+	struct rte_driver driver; /**< Core driver */
+	rte_platform_probe_t *probe;  /**< Device probe function */
+	rte_platform_remove_t *remove; /**< Device remove function */
+	rte_platform_dma_map_t *dma_map; /**< Device DMA map function */
+	rte_platform_dma_unmap_t *dma_unmap; /**< Device DMA unmap function */
+	uint32_t drv_flags; /**< Driver flags RTE_PLATFORM_DRV_* */
+};
+
+/** Device driver needs IOVA as VA and cannot work with IOVA as PA */
+#define RTE_PLATFORM_DRV_NEED_IOVA_AS_VA 0x0001
+
+/**
+ * @internal
+ * Helper macros used to convert core device to platform device.
+ */
+#define RTE_DEV_TO_PLATFORM_DEV(ptr) \
+	container_of(ptr, struct rte_platform_device, device)
+
+#define RTE_DEV_TO_PLATFORM_DEV_CONST(ptr) \
+	container_of(ptr, const struct rte_platform_device, device)
+
+/**
+ * Register a platform device driver.
+ *
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * @param pdrv
+ *   A pointer to a rte_platform_driver structure describing driver to be registered.
+ */
+__rte_internal
+void rte_platform_register(struct rte_platform_driver *pdrv);
+
+/** Helper for platform driver registration. */
+#define RTE_PMD_REGISTER_PLATFORM(nm, platform_drv) \
+static const char *pdrvinit_ ## nm ## _alias; \
+RTE_INIT(pdrvinitfn_ ##nm) \
+{ \
+	(platform_drv).driver.name = RTE_STR(nm); \
+	(platform_drv).driver.alias = pdrvinit_ ## nm ## _alias; \
+	rte_platform_register(&(platform_drv)); \
+} \
+RTE_PMD_EXPORT_NAME(nm, __COUNTER__)
+
+/** Helper for setting platform driver alias. */
+#define RTE_PMD_REGISTER_ALIAS(nm, alias) \
+static const char *pdrvinit_ ## nm ## _alias = RTE_STR(alias)
+
+/**
+ * Unregister a platform device driver.
+ *
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * @param pdrv
+ *   A pointer to a rte_platform_driver structure describing driver to be unregistered.
+ */
+__rte_internal
+void rte_platform_unregister(struct rte_platform_driver *pdrv);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _BUS_PLATFORM_DRIVER_H_ */
diff --git a/drivers/bus/platform/meson.build b/drivers/bus/platform/meson.build
new file mode 100644
index 0000000000..417d7b81f8
--- /dev/null
+++ b/drivers/bus/platform/meson.build
@@ -0,0 +1,16 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(C) 2023 Marvell.
+#
+
+if not is_linux
+    build = false
+    reason = 'only supported on Linux'
+    subdir_done()
+endif
+
+deps += ['kvargs']
+sources = files(
+        'platform_params.c',
+        'platform.c',
+)
+driver_sdk_headers += files('bus_platform_driver.h')
diff --git a/drivers/bus/platform/platform.c b/drivers/bus/platform/platform.c
new file mode 100644
index 0000000000..b43a5b9153
--- /dev/null
+++ b/drivers/bus/platform/platform.c
@@ -0,0 +1,604 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2023 Marvell.
+ */
+
+#include <dirent.h>
+#include <inttypes.h>
+#include <linux/vfio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <sys/queue.h>
+#include <unistd.h>
+
+#include <bus_driver.h>
+#include <bus_platform_driver.h>
+#include <eal_filesystem.h>
+#include <rte_bus.h>
+#include <rte_devargs.h>
+#include <rte_errno.h>
+#include <rte_log.h>
+#include <rte_memory.h>
+#include <rte_string_fns.h>
+#include <rte_vfio.h>
+
+#include "private.h"
+
+#define PLATFORM_BUS_DEVICES_PATH "/sys/bus/platform/devices"
+
+void
+rte_platform_register(struct rte_platform_driver *pdrv)
+{
+	TAILQ_INSERT_TAIL(&platform_bus.driver_list, pdrv, next);
+}
+
+void
+rte_platform_unregister(struct rte_platform_driver *pdrv)
+{
+	TAILQ_REMOVE(&platform_bus.driver_list, pdrv, next);
+}
+
+static struct rte_devargs *
+dev_devargs(const char *dev_name)
+{
+	struct rte_devargs *devargs;
+
+	RTE_EAL_DEVARGS_FOREACH("platform", devargs) {
+		if (!strcmp(devargs->name, dev_name))
+			return devargs;
+	}
+
+	return NULL;
+}
+
+static bool
+dev_allowed(const char *dev_name)
+{
+	struct rte_devargs *devargs;
+
+	devargs = dev_devargs(dev_name);
+	if (devargs == NULL)
+		return true;
+
+	switch (platform_bus.bus.conf.scan_mode) {
+	case RTE_BUS_SCAN_UNDEFINED:
+	case RTE_BUS_SCAN_ALLOWLIST:
+		if (devargs->policy == RTE_DEV_ALLOWED)
+			return true;
+		break;
+	case RTE_BUS_SCAN_BLOCKLIST:
+		if (devargs->policy == RTE_DEV_BLOCKED)
+			return false;
+		break;
+	}
+
+	return true;
+}
+
+static int
+dev_add(const char *dev_name)
+{
+	struct rte_platform_device *pdev, *tmp;
+	char path[PATH_MAX];
+	unsigned long val;
+
+	pdev = calloc(1, sizeof(*pdev));
+	if (pdev == NULL)
+		return -ENOMEM;
+
+	rte_strscpy(pdev->name, dev_name, sizeof(pdev->name));
+	pdev->device.name = pdev->name;
+	pdev->device.devargs = dev_devargs(dev_name);
+	pdev->device.bus = &platform_bus.bus;
+	snprintf(path, sizeof(path), PLATFORM_BUS_DEVICES_PATH "/%s/numa_node", dev_name);
+	pdev->device.numa_node = eal_parse_sysfs_value(path, &val) ? rte_socket_id() : val;
+
+	FOREACH_DEVICE_ON_PLATFORM_BUS(tmp) {
+		if (!strcmp(tmp->name, pdev->name)) {
+			PLATFORM_LOG(INFO, "device %s already added\n", pdev->name);
+
+			if (tmp->device.devargs != pdev->device.devargs)
+				rte_devargs_remove(pdev->device.devargs);
+
+			free(pdev);
+		}
+	}
+
+	TAILQ_INSERT_HEAD(&platform_bus.device_list, pdev, next);
+
+	PLATFORM_LOG(INFO, "adding device %s to the list\n", dev_name);
+
+	return 0;
+}
+
+static char *
+dev_kernel_driver_name(const char *dev_name)
+{
+	char path[PATH_MAX], buf[BUFSIZ] = { };
+	char *kdrv;
+	int ret;
+
+	snprintf(path, sizeof(path), PLATFORM_BUS_DEVICES_PATH "/%s/driver", dev_name);
+	/* save space for NUL */
+	ret = readlink(path, buf, sizeof(buf) - 1);
+	if (ret <= 0)
+		return NULL;
+
+	/* last token is kernel driver name */
+	kdrv = strrchr(buf, '/');
+	if (kdrv != NULL)
+		return strdup(kdrv + 1);
+
+	return NULL;
+}
+
+static bool
+dev_is_bound_vfio_platform(const char *dev_name)
+{
+	char *kdrv;
+	int ret;
+
+	kdrv = dev_kernel_driver_name(dev_name);
+	if (!kdrv)
+		return false;
+
+	ret = strcmp(kdrv, "vfio-platform");
+	free(kdrv);
+
+	return ret == 0;
+}
+
+static int
+platform_bus_scan(void)
+{
+	const struct dirent *ent;
+	const char *dev_name;
+	int ret = 0;
+	DIR *dp;
+
+	if ((dp = opendir(PLATFORM_BUS_DEVICES_PATH)) == NULL) {
+		PLATFORM_LOG(INFO, "failed to open %s\n", PLATFORM_BUS_DEVICES_PATH);
+		return -errno;
+	}
+
+	while ((ent = readdir(dp))) {
+		dev_name = ent->d_name;
+		if (dev_name[0] == '.')
+			continue;
+
+		if (!dev_allowed(dev_name))
+			continue;
+
+		if (!dev_is_bound_vfio_platform(dev_name))
+			continue;
+
+		ret = dev_add(dev_name);
+		if (ret)
+			break;
+	}
+
+	closedir(dp);
+
+	return ret;
+}
+
+static int
+device_map_resource_offset(struct rte_platform_device *pdev, struct rte_platform_resource *res,
+			   size_t offset)
+{
+	res->mem.addr = mmap(NULL, res->mem.len, PROT_READ | PROT_WRITE, MAP_PRIVATE, pdev->dev_fd,
+			     offset);
+	if (res->mem.addr == MAP_FAILED)
+		return -errno;
+
+	PLATFORM_LOG(DEBUG, "adding resource va = %p len = %"PRIu64" name = %s\n", res->mem.addr,
+		     res->mem.len, res->name);
+
+	return 0;
+}
+
+static void
+device_unmap_resources(struct rte_platform_device *pdev)
+{
+	struct rte_platform_resource *res;
+	unsigned int i;
+
+	for (i = 0; i < pdev->num_resource; i++) {
+		res = &pdev->resource[i];
+		munmap(res->mem.addr, res->mem.len);
+		free(res->name);
+	}
+
+	free(pdev->resource);
+	pdev->resource = NULL;
+	pdev->num_resource = 0;
+}
+
+static char *
+of_resource_name(const char *dev_name, int index)
+{
+	char path[PATH_MAX], buf[BUFSIZ] = { };
+	int num = 0, ret;
+	char *name;
+
+	snprintf(path, sizeof(path), PLATFORM_BUS_DEVICES_PATH "/%s/of_node/reg-names", dev_name);
+	ret = eal_parse_sysfs_string(path, buf, sizeof(buf) - 1);
+	if (ret)
+		return NULL;
+
+	for (name = buf; name; name += strlen(name) + 1) {
+		if (num++ != index)
+			continue;
+		return strdup(name);
+	}
+
+	return NULL;
+}
+
+static int
+device_map_resources(struct rte_platform_device *pdev, unsigned int num)
+{
+	struct rte_platform_resource *res;
+	unsigned int i;
+	int ret;
+
+	if (num == 0)
+		PLATFORM_LOG(WARNING, "device %s has no resources\n", pdev->name);
+
+	pdev->resource = calloc(num, sizeof(*pdev->resource));
+	if (pdev->resource == NULL)
+		return -ENOMEM;
+
+	for (i = 0; i < num; i++) {
+		struct vfio_region_info reg_info = {
+			.argsz = sizeof(reg_info),
+			.index = i,
+		};
+
+		ret = ioctl(pdev->dev_fd, VFIO_DEVICE_GET_REGION_INFO, &reg_info);
+		if (ret) {
+			PLATFORM_LOG(ERR, "failed to get region info at %d\n", i);
+			ret = -errno;
+			goto out;
+		}
+
+		res = &pdev->resource[i];
+		res->name = of_resource_name(pdev->name, reg_info.index);
+		res->mem.len = reg_info.size;
+		ret = device_map_resource_offset(pdev, res, reg_info.offset);
+		if (ret) {
+			PLATFORM_LOG(ERR, "failed to ioremap resource at %d\n", i);
+			goto out;
+		}
+
+		pdev->num_resource++;
+	}
+
+	return 0;
+out:
+	device_unmap_resources(pdev);
+
+	return ret;
+}
+
+static void
+device_cleanup(struct rte_platform_device *pdev)
+{
+	device_unmap_resources(pdev);
+	rte_vfio_release_device(PLATFORM_BUS_DEVICES_PATH, pdev->name, pdev->dev_fd);
+}
+
+static int
+device_setup(struct rte_platform_device *pdev)
+{
+	struct vfio_device_info dev_info = { .argsz = sizeof(dev_info), };
+	const char *name = pdev->name;
+	int ret;
+
+	ret = rte_vfio_setup_device(PLATFORM_BUS_DEVICES_PATH, name, &pdev->dev_fd, &dev_info);
+	if (ret) {
+		PLATFORM_LOG(ERR, "failed to setup %s\n", name);
+		return -ENODEV;
+	}
+
+	if (!(dev_info.flags & VFIO_DEVICE_FLAGS_PLATFORM)) {
+		PLATFORM_LOG(ERR, "device not backed by vfio-platform\n");
+		ret = -ENOTSUP;
+		goto out;
+	}
+
+	ret = device_map_resources(pdev, dev_info.num_regions);
+	if (ret) {
+		PLATFORM_LOG(ERR, "failed to setup platform resources\n");
+		goto out;
+	}
+
+	return 0;
+out:
+	device_cleanup(pdev);
+
+	return ret;
+}
+
+static int
+driver_call_probe(struct rte_platform_driver *pdrv, struct rte_platform_device *pdev)
+{
+	int ret;
+
+	if (rte_dev_is_probed(&pdev->device))
+		return -EBUSY;
+
+	if (pdrv->probe) {
+		pdev->driver = pdrv;
+		ret = pdrv->probe(pdev);
+		if (ret)
+			return ret;
+	}
+
+	pdev->device.driver = &pdrv->driver;
+
+	return 0;
+}
+
+static int
+driver_probe_device(struct rte_platform_driver *pdrv, struct rte_platform_device *pdev)
+{
+	enum rte_iova_mode iova_mode;
+	int ret;
+
+	iova_mode = rte_eal_iova_mode();
+	if (pdrv->drv_flags & RTE_PLATFORM_DRV_NEED_IOVA_AS_VA && iova_mode != RTE_IOVA_VA) {
+		PLATFORM_LOG(ERR, "driver %s expects VA IOVA mode but current mode is PA\n",
+			     pdrv->driver.name);
+		return -EINVAL;
+	}
+
+	ret = device_setup(pdev);
+	if (ret)
+		return ret;
+
+	ret = driver_call_probe(pdrv, pdev);
+	if (ret)
+		device_cleanup(pdev);
+
+	return ret;
+}
+
+static bool
+driver_match_device(struct rte_platform_driver *pdrv, struct rte_platform_device *pdev)
+{
+	bool match = false;
+	char *kdrv;
+
+	kdrv = dev_kernel_driver_name(pdev->name);
+	if (!kdrv)
+		return false;
+
+	/* match by driver name */
+	if (!strcmp(kdrv, pdrv->driver.name)) {
+		match = true;
+		goto out;
+	}
+
+	/* match by driver alias */
+	if (pdrv->driver.alias != NULL && !strcmp(kdrv, pdrv->driver.alias)) {
+		match = true;
+		goto out;
+	}
+
+	/* match by device name */
+	if (!strcmp(pdev->name, pdrv->driver.name))
+		match = true;
+
+out:
+	free(kdrv);
+
+	return match;
+}
+
+
+static int
+device_attach(struct rte_platform_device *pdev)
+{
+	struct rte_platform_driver *pdrv;
+
+	FOREACH_DRIVER_ON_PLATFORM_BUS(pdrv) {
+		if (driver_match_device(pdrv, pdev))
+			break;
+	}
+
+	if (pdrv == NULL)
+		return -ENODEV;
+
+	return driver_probe_device(pdrv, pdev);
+}
+
+static int
+platform_bus_probe(void)
+{
+	struct rte_platform_device *pdev;
+	int ret;
+
+	FOREACH_DEVICE_ON_PLATFORM_BUS(pdev) {
+		ret = device_attach(pdev);
+		if (ret == -EBUSY) {
+			PLATFORM_LOG(DEBUG, "device %s already probed\n", pdev->name);
+			continue;
+		}
+		if (ret)
+			PLATFORM_LOG(ERR, "failed to probe %s\n", pdev->name);
+	}
+
+	return 0;
+}
+
+static struct rte_device *
+platform_bus_find_device(const struct rte_device *start, rte_dev_cmp_t cmp, const void *data)
+{
+	struct rte_platform_device *pdev;
+
+	pdev = start ? RTE_TAILQ_NEXT(RTE_DEV_TO_PLATFORM_DEV_CONST(start), next) :
+		       RTE_TAILQ_FIRST(&platform_bus.device_list);
+	while (pdev) {
+		if (cmp(&pdev->device, data) == 0)
+			return &pdev->device;
+
+		pdev = RTE_TAILQ_NEXT(pdev, next);
+	}
+
+	return NULL;
+}
+
+static int
+platform_bus_plug(struct rte_device *dev)
+{
+	struct rte_platform_device *pdev;
+
+	if (!dev_allowed(dev->name))
+		return -EPERM;
+
+	if (!dev_is_bound_vfio_platform(dev->name))
+		return -EPERM;
+
+	pdev = RTE_DEV_TO_PLATFORM_DEV(dev);
+	if (pdev == NULL)
+		return -EINVAL;
+
+	return device_attach(pdev);
+}
+
+static void
+device_release_driver(struct rte_platform_device *pdev)
+{
+	struct rte_platform_driver *pdrv;
+	int ret;
+
+	pdrv = pdev->driver;
+	if (pdrv != NULL && pdrv->remove != NULL) {
+		ret = pdrv->remove(pdev);
+		if (ret)
+			PLATFORM_LOG(WARNING, "failed to remove %s\n", pdev->name);
+	}
+
+	pdev->device.driver = NULL;
+	pdev->driver = NULL;
+}
+
+static int
+platform_bus_unplug(struct rte_device *dev)
+{
+	struct rte_platform_device *pdev;
+
+	pdev = RTE_DEV_TO_PLATFORM_DEV(dev);
+	if (pdev == NULL)
+		return -EINVAL;
+
+	device_release_driver(pdev);
+	device_cleanup(pdev);
+	rte_devargs_remove(pdev->device.devargs);
+	free(pdev);
+
+	return 0;
+}
+
+static int
+platform_bus_parse(const char *name, void *addr)
+{
+	struct rte_platform_device *pdev;
+	const char **out = addr;
+
+	FOREACH_DEVICE_ON_PLATFORM_BUS(pdev) {
+		if (!strcmp(name, pdev->name))
+			break;
+	}
+
+	if (pdev && addr)
+		*out = name;
+
+	return pdev ? 0 : -ENODEV;
+}
+
+static int
+platform_bus_dma_map(struct rte_device *dev, void *addr, uint64_t iova, size_t len)
+{
+	struct rte_platform_device *pdev;
+
+	pdev = RTE_DEV_TO_PLATFORM_DEV(dev);
+	if (pdev == NULL || pdev->driver == NULL) {
+		rte_errno = EINVAL;
+		return -1;
+	}
+
+	if (pdev->driver->dma_map != NULL)
+		return pdev->driver->dma_map(pdev, addr, iova, len);
+
+	return rte_vfio_container_dma_map(RTE_VFIO_DEFAULT_CONTAINER_FD, (uint64_t)addr, iova, len);
+}
+
+static int
+platform_bus_dma_unmap(struct rte_device *dev, void *addr, uint64_t iova, size_t len)
+{
+	struct rte_platform_device *pdev;
+
+	pdev = RTE_DEV_TO_PLATFORM_DEV(dev);
+	if (pdev == NULL || pdev->driver == NULL) {
+		rte_errno = EINVAL;
+		return -1;
+	}
+
+	if (pdev->driver->dma_unmap != NULL)
+		return pdev->driver->dma_unmap(pdev, addr, iova, len);
+
+	return rte_vfio_container_dma_unmap(RTE_VFIO_DEFAULT_CONTAINER_FD, (uint64_t)addr, iova,
+					    len);
+}
+
+static enum rte_iova_mode
+platform_bus_get_iommu_class(void)
+{
+	struct rte_platform_driver *pdrv;
+	struct rte_platform_device *pdev;
+
+	FOREACH_DEVICE_ON_PLATFORM_BUS(pdev) {
+		pdrv = pdev->driver;
+		if (pdrv != NULL && pdrv->drv_flags & RTE_PLATFORM_DRV_NEED_IOVA_AS_VA)
+			return RTE_IOVA_VA;
+	}
+
+	return RTE_IOVA_DC;
+}
+
+static int
+platform_bus_cleanup(void)
+{
+	struct rte_platform_device *pdev, *tmp;
+
+	RTE_TAILQ_FOREACH_SAFE(pdev, &platform_bus.device_list, next, tmp) {
+		platform_bus_unplug(&pdev->device);
+		TAILQ_REMOVE(&platform_bus.device_list, pdev, next);
+	}
+
+	return 0;
+}
+
+struct rte_platform_bus platform_bus = {
+	.bus = {
+		.scan = platform_bus_scan,
+		.probe = platform_bus_probe,
+		.find_device = platform_bus_find_device,
+		.plug = platform_bus_plug,
+		.unplug = platform_bus_unplug,
+		.parse = platform_bus_parse,
+		.dma_map = platform_bus_dma_map,
+		.dma_unmap = platform_bus_dma_unmap,
+		.get_iommu_class = platform_bus_get_iommu_class,
+		.dev_iterate = platform_bus_dev_iterate,
+		.cleanup = platform_bus_cleanup,
+	},
+	.device_list = TAILQ_HEAD_INITIALIZER(platform_bus.device_list),
+	.driver_list = TAILQ_HEAD_INITIALIZER(platform_bus.driver_list),
+};
+
+RTE_REGISTER_BUS(platform_bus, platform_bus.bus);
+RTE_LOG_REGISTER_DEFAULT(platform_bus_logtype, NOTICE);
diff --git a/drivers/bus/platform/platform_params.c b/drivers/bus/platform/platform_params.c
new file mode 100644
index 0000000000..d199c0c586
--- /dev/null
+++ b/drivers/bus/platform/platform_params.c
@@ -0,0 +1,70 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2023 Marvell.
+ */
+
+#include <string.h>
+#include <errno.h>
+
+#include <rte_bus.h>
+#include <rte_common.h>
+#include <rte_dev.h>
+#include <rte_errno.h>
+#include <rte_kvargs.h>
+
+#include "bus_platform_driver.h"
+#include "private.h"
+
+enum platform_params {
+	RTE_PLATFORM_PARAM_NAME,
+};
+
+static const char * const platform_params_keys[] = {
+	[RTE_PLATFORM_PARAM_NAME] = "name",
+	NULL
+};
+
+static int
+platform_dev_match(const struct rte_device *dev, const void *_kvlist)
+{
+	const char *key = platform_params_keys[RTE_PLATFORM_PARAM_NAME];
+	const struct rte_kvargs *kvlist = _kvlist;
+	const char *name;
+
+	/* no kvlist arg, all devices match */
+	if (kvlist == NULL)
+		return 0;
+
+	/* if key is present in kvlist and does not match, filter device */
+	name = rte_kvargs_get(kvlist, key);
+	if (name != NULL && strcmp(name, dev->name))
+		return -1;
+
+	return 0;
+}
+
+void *
+platform_bus_dev_iterate(const void *start, const char *str,
+			 const struct rte_dev_iterator *it __rte_unused)
+{
+	rte_bus_find_device_t find_device;
+	struct rte_kvargs *kvargs = NULL;
+	struct rte_device *dev;
+
+	if (str != NULL) {
+		kvargs = rte_kvargs_parse(str, platform_params_keys);
+		if (!kvargs) {
+			PLATFORM_LOG(ERR, "cannot parse argument list %s", str);
+			rte_errno = EINVAL;
+			return NULL;
+		}
+	}
+
+	find_device = platform_bus.bus.find_device;
+	if (find_device == NULL)
+		return NULL;
+
+	dev = platform_bus.bus.find_device(start, platform_dev_match, kvargs);
+	rte_kvargs_free(kvargs);
+
+	return dev;
+}
diff --git a/drivers/bus/platform/private.h b/drivers/bus/platform/private.h
new file mode 100644
index 0000000000..dcd992f8a7
--- /dev/null
+++ b/drivers/bus/platform/private.h
@@ -0,0 +1,48 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2023 Marvell.
+ */
+
+#ifndef _PLATFORM_PRIVATE_H_
+#define _PLATFORM_PRIVATE_H_
+
+#include <bus_driver.h>
+#include <rte_bus.h>
+#include <rte_common.h>
+#include <rte_dev.h>
+#include <rte_log.h>
+#include <rte_os.h>
+
+#include "bus_platform_driver.h"
+
+extern struct rte_platform_bus platform_bus;
+extern int platform_bus_logtype;
+
+/* Platform bus iterators. */
+#define FOREACH_DEVICE_ON_PLATFORM_BUS(p) \
+	RTE_TAILQ_FOREACH(p, &(platform_bus.device_list), next)
+
+#define FOREACH_DRIVER_ON_PLATFORM_BUS(p) \
+	RTE_TAILQ_FOREACH(p, &(platform_bus.driver_list), next)
+
+/*
+ * Structure describing platform bus.
+ */
+struct rte_platform_bus {
+	struct rte_bus bus; /* Core bus */
+	RTE_TAILQ_HEAD(, rte_platform_device) device_list; /* List of bus devices */
+	RTE_TAILQ_HEAD(, rte_platform_driver) driver_list; /* List of bus drivers */
+};
+
+#define PLATFORM_LOG(level, ...) \
+	rte_log(RTE_LOG_ ## level, platform_bus_logtype, \
+		RTE_FMT("platform bus: " RTE_FMT_HEAD(__VA_ARGS__,), \
+			RTE_FMT_TAIL(__VA_ARGS__,)))
+
+/*
+ * Iterate registered platform devices and find one that matches provided string.
+ */
+void *
+platform_bus_dev_iterate(const void *start, const char *str,
+			 const struct rte_dev_iterator *it __rte_unused);
+
+#endif /* _PLATFORM_PRIVATE_H_ */
diff --git a/drivers/bus/platform/version.map b/drivers/bus/platform/version.map
new file mode 100644
index 0000000000..bacce4da08
--- /dev/null
+++ b/drivers/bus/platform/version.map
@@ -0,0 +1,10 @@
+DPDK_23 {
+	local: *;
+};
+
+INTERNAL {
+	global:
+
+	rte_platform_register;
+	rte_platform_unregister;
+};
-- 
2.34.1


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 1/2] lib: add helper to read strings from sysfs files
  2023-01-25 10:33           ` [PATCH 1/2] lib: add helper to read strings from sysfs files Tomasz Duszynski
@ 2023-01-25 10:39             ` Thomas Monjalon
  2023-01-25 16:16               ` Tyler Retzlaff
  2023-01-26  8:35               ` Tomasz Duszynski
  0 siblings, 2 replies; 139+ messages in thread
From: Thomas Monjalon @ 2023-01-25 10:39 UTC (permalink / raw)
  To: Tomasz Duszynski
  Cc: dev, jerinj, stephen, chenbo.xia, david.marchand, bruce.richardson

25/01/2023 11:33, Tomasz Duszynski:
> Reading strings from sysfs files is a re-occurring pattern
> hence add helper for doing that.

In general it would be to nice to clean sysfs parsing in libs and drivers,
so they all use some functions from EAL.




^ permalink raw reply	[flat|nested] 139+ messages in thread

* RE: [PATCH 0/2] add platform bus
  2023-01-25 10:33         ` [PATCH 0/2] add platform bus Tomasz Duszynski
  2023-01-25 10:33           ` [PATCH 1/2] lib: add helper to read strings from sysfs files Tomasz Duszynski
  2023-01-25 10:33           ` [PATCH 2/2] bus: add platform bus Tomasz Duszynski
@ 2023-01-25 10:41           ` Tomasz Duszynski
  2 siblings, 0 replies; 139+ messages in thread
From: Tomasz Duszynski @ 2023-01-25 10:41 UTC (permalink / raw)
  To: Tomasz Duszynski, dev
  Cc: thomas, Jerin Jacob Kollanukkaran, stephen, chenbo.xia

This was mistakenly appended to this thread - ignore it. I've just sent the series again. 

>-----Original Message-----
>From: Tomasz Duszynski <tduszynski@marvell.com>
>Sent: Wednesday, January 25, 2023 11:33 AM
>To: dev@dpdk.org
>Cc: thomas@monjalon.net; Jerin Jacob Kollanukkaran <jerinj@marvell.com>;
>stephen@networkplumber.org; chenbo.xia@intel.com; Tomasz Duszynski <tduszynski@marvell.com>
>Subject: [PATCH 0/2] add platform bus
>
>Platform bus is a bus under Linux which manages devices that do not have any discovery-mechanism
>built in. Linux learns about platform devices directly from device-tree during boot-up phase.
>
>Afterwards if userspace wants to use some particular device driver being usually a mixture of
>vdev/rawdev gets developed.
>
>In order to simplify that introduce a DPDK platform bus which provides auto-probe experience and
>separates a bus logic from the driver itself.
>
>Now only devices which are backed-by vfio-platform kernel driver are supported, though other
>options may be added if necessary.
>
>Tomasz Duszynski (2):
>  lib: add helper to read strings from sysfs files
>  bus: add platform bus
>
> MAINTAINERS                                |   4 +
> app/test/test_eal_fs.c                     | 108 +++-
> doc/guides/rel_notes/release_23_03.rst     |   5 +
> drivers/bus/meson.build                    |   1 +
> drivers/bus/platform/bus_platform_driver.h | 174 ++++++
> drivers/bus/platform/meson.build           |  16 +
> drivers/bus/platform/platform.c            | 604 +++++++++++++++++++++
> drivers/bus/platform/platform_params.c     |  70 +++
> drivers/bus/platform/private.h             |  48 ++
> drivers/bus/platform/version.map           |  10 +
> lib/eal/common/eal_filesystem.h            |   6 +
> lib/eal/unix/eal_filesystem.c              |  24 +-
> lib/eal/version.map                        |   1 +
> 13 files changed, 1053 insertions(+), 18 deletions(-)  create mode 100644
>drivers/bus/platform/bus_platform_driver.h
> create mode 100644 drivers/bus/platform/meson.build  create mode 100644
>drivers/bus/platform/platform.c  create mode 100644 drivers/bus/platform/platform_params.c
> create mode 100644 drivers/bus/platform/private.h  create mode 100644
>drivers/bus/platform/version.map
>
>--
>2.34.1


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH 1/2] lib: add helper to read strings from sysfs files
  2023-01-25 10:39             ` Thomas Monjalon
@ 2023-01-25 16:16               ` Tyler Retzlaff
  2023-01-26  8:30                 ` [EXT] " Tomasz Duszynski
  2023-01-26  8:35               ` Tomasz Duszynski
  1 sibling, 1 reply; 139+ messages in thread
From: Tyler Retzlaff @ 2023-01-25 16:16 UTC (permalink / raw)
  To: Thomas Monjalon
  Cc: Tomasz Duszynski, dev, jerinj, stephen, chenbo.xia,
	david.marchand, bruce.richardson

On Wed, Jan 25, 2023 at 11:39:30AM +0100, Thomas Monjalon wrote:
> 25/01/2023 11:33, Tomasz Duszynski:
> > Reading strings from sysfs files is a re-occurring pattern
> > hence add helper for doing that.
> 
> In general it would be to nice to clean sysfs parsing in libs and drivers,
> so they all use some functions from EAL.

maybe there should be a general utility library for dealing with sysfs
separate from the core EAL that drivers / platform specific libs can
share?

^ permalink raw reply	[flat|nested] 139+ messages in thread

* RE: [EXT] Re: [PATCH 1/2] lib: add helper to read strings from sysfs files
  2023-01-25 16:16               ` Tyler Retzlaff
@ 2023-01-26  8:30                 ` Tomasz Duszynski
  2023-01-26 17:21                   ` Tyler Retzlaff
  0 siblings, 1 reply; 139+ messages in thread
From: Tomasz Duszynski @ 2023-01-26  8:30 UTC (permalink / raw)
  To: Tyler Retzlaff, Thomas Monjalon
  Cc: dev, Jerin Jacob Kollanukkaran, stephen, chenbo.xia,
	david.marchand, bruce.richardson


>-----Original Message-----
>From: Tyler Retzlaff <roretzla@linux.microsoft.com>
>Sent: Wednesday, January 25, 2023 5:16 PM
>To: Thomas Monjalon <thomas@monjalon.net>
>Cc: Tomasz Duszynski <tduszynski@marvell.com>; dev@dpdk.org; Jerin Jacob Kollanukkaran
><jerinj@marvell.com>; stephen@networkplumber.org; chenbo.xia@intel.com; david.marchand@redhat.com;
>bruce.richardson@intel.com
>Subject: [EXT] Re: [PATCH 1/2] lib: add helper to read strings from sysfs files
>
>External Email
>
>----------------------------------------------------------------------
>On Wed, Jan 25, 2023 at 11:39:30AM +0100, Thomas Monjalon wrote:
>> 25/01/2023 11:33, Tomasz Duszynski:
>> > Reading strings from sysfs files is a re-occurring pattern hence add
>> > helper for doing that.
>>
>> In general it would be to nice to clean sysfs parsing in libs and
>> drivers, so they all use some functions from EAL.
>
>maybe there should be a general utility library for dealing with sysfs separate from the core EAL
>that drivers / platform specific libs can share?

reading/writing of sysfs files is scattered around the codebase and this has been piling up
with each and and every new pmd/lib that requires it. So generally a few simple utility functions 
in one place may be a good idea. 

Would following make sense?

rte_sysfs_write_int()
rte_sysfs_write_string()
rte_sysfs_read_int()
rte_sysfs_read_string() 

Also seems that pattern where file gets opened once and keeps being written to until closed is 
reoccurring as well. So there might be some utils for that as well. Thoughts? 

^ permalink raw reply	[flat|nested] 139+ messages in thread

* RE: [EXT] Re: [PATCH 1/2] lib: add helper to read strings from sysfs files
  2023-01-25 10:39             ` Thomas Monjalon
  2023-01-25 16:16               ` Tyler Retzlaff
@ 2023-01-26  8:35               ` Tomasz Duszynski
  1 sibling, 0 replies; 139+ messages in thread
From: Tomasz Duszynski @ 2023-01-26  8:35 UTC (permalink / raw)
  To: Thomas Monjalon
  Cc: dev, Jerin Jacob Kollanukkaran, stephen, chenbo.xia,
	david.marchand, bruce.richardson



>-----Original Message-----
>From: Thomas Monjalon <thomas@monjalon.net>
>Sent: Wednesday, January 25, 2023 11:40 AM
>To: Tomasz Duszynski <tduszynski@marvell.com>
>Cc: dev@dpdk.org; Jerin Jacob Kollanukkaran <jerinj@marvell.com>; stephen@networkplumber.org;
>chenbo.xia@intel.com; david.marchand@redhat.com; bruce.richardson@intel.com
>Subject: [EXT] Re: [PATCH 1/2] lib: add helper to read strings from sysfs files
>
>External Email
>
>----------------------------------------------------------------------
>25/01/2023 11:33, Tomasz Duszynski:
>> Reading strings from sysfs files is a re-occurring pattern hence add
>> helper for doing that.
>
>In general it would be to nice to clean sysfs parsing in libs and drivers, so they all use some
>functions from EAL.
>

That's generally true. Here I wanted to avoid tree-wide changes caused by unrelated work i.e a new bus
and do a cleanup, i.e use this read string util where applicable, later on. 
 

^ permalink raw reply	[flat|nested] 139+ messages in thread

* RE: [EXT] Re: [PATCH v6 1/4] lib: add generic support for reading PMU events
  2023-01-20 18:29             ` Tyler Retzlaff
@ 2023-01-26  9:05               ` Tomasz Duszynski
  0 siblings, 0 replies; 139+ messages in thread
From: Tomasz Duszynski @ 2023-01-26  9:05 UTC (permalink / raw)
  To: Tyler Retzlaff
  Cc: dev, Thomas Monjalon, Jerin Jacob Kollanukkaran, mb,
	Ruifeng.Wang, mattias.ronnblom, zhoumin, bruce.richardson



>-----Original Message-----
>From: Tyler Retzlaff <roretzla@linux.microsoft.com>
>Sent: Friday, January 20, 2023 7:30 PM
>To: Tomasz Duszynski <tduszynski@marvell.com>
>Cc: dev@dpdk.org; Thomas Monjalon <thomas@monjalon.net>; Jerin Jacob Kollanukkaran
><jerinj@marvell.com>; mb@smartsharesystems.com; Ruifeng.Wang@arm.com;
>mattias.ronnblom@ericsson.com; zhoumin@loongson.cn; bruce.richardson@intel.com
>Subject: [EXT] Re: [PATCH v6 1/4] lib: add generic support for reading PMU events
>
>External Email
>
>----------------------------------------------------------------------
>On Fri, Jan 20, 2023 at 12:39:12AM +0100, Tomasz Duszynski wrote:
>> Add support for programming PMU counters and reading their values in
>> runtime bypassing kernel completely.
>>
>> This is especially useful in cases where CPU cores are isolated
>> (nohz_full) i.e run dedicated tasks. In such cases one cannot use
>> standard perf utility without sacrificing latency and performance.
>>
>> Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
>> ---
>>  MAINTAINERS                            |   5 +
>>  app/test/meson.build                   |   4 +
>>  app/test/test_pmu.c                    |  42 +++
>>  doc/api/doxy-api-index.md              |   3 +-
>>  doc/api/doxy-api.conf.in               |   1 +
>>  doc/guides/prog_guide/profile_app.rst  |   8 +
>>  doc/guides/rel_notes/release_23_03.rst |   7 +
>>  lib/meson.build                        |   1 +
>>  lib/pmu/meson.build                    |  13 +
>>  lib/pmu/pmu_private.h                  |  29 ++
>>  lib/pmu/rte_pmu.c                      | 436 +++++++++++++++++++++++++
>>  lib/pmu/rte_pmu.h                      | 206 ++++++++++++
>>  lib/pmu/version.map                    |  19 ++
>>  13 files changed, 773 insertions(+), 1 deletion(-)  create mode
>> 100644 app/test/test_pmu.c  create mode 100644 lib/pmu/meson.build
>> create mode 100644 lib/pmu/pmu_private.h  create mode 100644
>> lib/pmu/rte_pmu.c  create mode 100644 lib/pmu/rte_pmu.h  create mode
>> 100644 lib/pmu/version.map
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS index 9a0f416d2e..9f13eafd95
>> 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -1697,6 +1697,11 @@ M: Nithin Dabilpuram <ndabilpuram@marvell.com>
>>  M: Pavan Nikhilesh <pbhagavatula@marvell.com>
>>  F: lib/node/
>>
>> +PMU - EXPERIMENTAL
>> +M: Tomasz Duszynski <tduszynski@marvell.com>
>> +F: lib/pmu/
>> +F: app/test/test_pmu*
>> +
>>
>>  Test Applications
>>  -----------------
>> diff --git a/app/test/meson.build b/app/test/meson.build index
>> f34d19e3c3..b2c2a618b1 100644
>> --- a/app/test/meson.build
>> +++ b/app/test/meson.build
>> @@ -360,6 +360,10 @@ if dpdk_conf.has('RTE_LIB_METRICS')
>>      test_sources += ['test_metrics.c']
>>      fast_tests += [['metrics_autotest', true, true]]  endif
>> +if is_linux
>> +    test_sources += ['test_pmu.c']
>> +    fast_tests += [['pmu_autotest', true, true]] endif
>
>traditionally we don't conditionally include tests at the meson.build level, instead we run all
>tests and have them skip when executed for unsupported exec environments.
>
>you can take a look at test_eventdev.c as an example for a test that is skipped on windows, i'm
>sure it could be adapted to skip on freebsd if you aren't supporting it.

Right, this looks better. Thanks for pointing this out. 

^ permalink raw reply	[flat|nested] 139+ messages in thread

* RE: [PATCH v6 1/4] lib: add generic support for reading PMU events
  2023-01-20  9:46             ` Morten Brørup
@ 2023-01-26  9:40               ` Tomasz Duszynski
  2023-01-26 12:29                 ` Morten Brørup
  0 siblings, 1 reply; 139+ messages in thread
From: Tomasz Duszynski @ 2023-01-26  9:40 UTC (permalink / raw)
  To: Morten Brørup, dev, Thomas Monjalon
  Cc: Jerin Jacob Kollanukkaran, Ruifeng.Wang, mattias.ronnblom,
	zhoumin, bruce.richardson, roretzla

>-----Original Message-----
>From: Morten Brørup <mb@smartsharesystems.com>
>Sent: Friday, January 20, 2023 10:47 AM
>To: Tomasz Duszynski <tduszynski@marvell.com>; dev@dpdk.org; Thomas Monjalon <thomas@monjalon.net>
>Cc: Jerin Jacob Kollanukkaran <jerinj@marvell.com>; Ruifeng.Wang@arm.com;
>mattias.ronnblom@ericsson.com; zhoumin@loongson.cn; bruce.richardson@intel.com;
>roretzla@linux.microsoft.com
>Subject: [EXT] RE: [PATCH v6 1/4] lib: add generic support for reading PMU events
>
>External Email
>
>----------------------------------------------------------------------
>> From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
>> Sent: Friday, 20 January 2023 00.39
>>
>> Add support for programming PMU counters and reading their values in
>> runtime bypassing kernel completely.
>>
>> This is especially useful in cases where CPU cores are isolated
>> (nohz_full) i.e run dedicated tasks. In such cases one cannot use
>> standard perf utility without sacrificing latency and performance.
>>
>> Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
>> ---
>
>If you insist on passing lcore_id around as a function parameter, the function description must
>mention that the lcore_id parameter must be set to rte_lcore_id() for the functions where this is a
>requirement, including all functions that use those functions.
>

Not sure why are you insisting so much on removing that rte_lcore_id(). Yes that macro evaluates
to integer but if you don't think about internals this resembles a function call.

Then natural pattern is to call it once and reuse results if possible. Passing lcore_id around
implies that calls are per l-core, why would that confuse anyone reading that code?

Besides, all functions taking it are internal stuff hence you cannot call it elsewhere. 

>Alternatively, follow my previous suggestion: Omit the lcore_id function parameter, and use
>rte_lcore_id() instead.
>


^ permalink raw reply	[flat|nested] 139+ messages in thread

* RE: [PATCH v6 1/4] lib: add generic support for reading PMU events
  2023-01-26  9:40               ` Tomasz Duszynski
@ 2023-01-26 12:29                 ` Morten Brørup
  2023-01-26 12:59                   ` Bruce Richardson
  2023-01-26 15:17                   ` Tomasz Duszynski
  0 siblings, 2 replies; 139+ messages in thread
From: Morten Brørup @ 2023-01-26 12:29 UTC (permalink / raw)
  To: Tomasz Duszynski, dev, Thomas Monjalon
  Cc: Jerin Jacob Kollanukkaran, Ruifeng.Wang, mattias.ronnblom,
	zhoumin, bruce.richardson, roretzla

> From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
> Sent: Thursday, 26 January 2023 10.40
> 
> >From: Morten Brørup <mb@smartsharesystems.com>
> >Sent: Friday, January 20, 2023 10:47 AM
> >
> >> From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
> >> Sent: Friday, 20 January 2023 00.39
> >>
> >> Add support for programming PMU counters and reading their values in
> >> runtime bypassing kernel completely.
> >>
> >> This is especially useful in cases where CPU cores are isolated
> >> (nohz_full) i.e run dedicated tasks. In such cases one cannot use
> >> standard perf utility without sacrificing latency and performance.
> >>
> >> Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
> >> ---
> >
> >If you insist on passing lcore_id around as a function parameter, the
> function description must
> >mention that the lcore_id parameter must be set to rte_lcore_id() for
> the functions where this is a
> >requirement, including all functions that use those functions.

Perhaps I'm stating this wrong, so let me try to rephrase:

As I understand it, some of the setup functions must be called from the EAL thread that executes that function - due to some syscall (SYS_perf_event_open) needing to be called from the thread itself.

Those functions should not take an lcore_id parameter. Otherwise, I would expect to be able to call those functions from e.g. the main thread and pass the lcore_id of any EAL thread as a parameter, which you at the bottom of this email [1] explained is not possible.

[1]: http://inbox.dpdk.org/dev/DM4PR18MB4368461EC42603F77A7DC1BCD2E09@DM4PR18MB4368.namprd18.prod.outlook.com/

> >
> 
> Not sure why are you insisting so much on removing that rte_lcore_id().
> Yes that macro evaluates
> to integer but if you don't think about internals this resembles a
> function call.

I agree with this argument. And for that reason, passing lcore_id around could be relevant.

I only wanted to bring your attention to the low cost of fetching it inside the functions, as an alternative to passing it as an argument.

> 
> Then natural pattern is to call it once and reuse results if possible.

Yes, and I would usually agree to using this pattern.

> Passing lcore_id around
> implies that calls are per l-core, why would that confuse anyone
> reading that code?

This is where I disagree: Passing lcore_id as a parameter to a function does NOT imply that the function is running on that lcore!

E.g rte_mempool_default_cache(struct rte_mempool *mp, unsigned lcore_id) [2] takes lcore_id as a parameter, and does not assume that lcore_id==rte_lcore_id().

[2]: https://elixir.bootlin.com/dpdk/latest/source/lib/mempool/rte_mempool.h#L1315

> 
> Besides, all functions taking it are internal stuff hence you cannot
> call it elsewhere.

OK. I agree that this reduces the risk of incorrect use.

Generally, I think that internal functions should be documented too. Not to the full extent, like public functions, but some documentation is nice.

And if there are special requirements to a function parameter, it should be documented with that function. Requiring that the lcore_id parameter must be == rte_lcore_id() is certainly a special requirement.

It might just be me worrying too much, so... If nobody else complains about this, I can live with it as is. Assuming that none of the public functions have this special requirement (either directly or indirectly, by calling functions with the special requirement).

> 
> >Alternatively, follow my previous suggestion: Omit the lcore_id
> function parameter, and use
> >rte_lcore_id() instead.
> >
> 


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v6 1/4] lib: add generic support for reading PMU events
  2023-01-26 12:29                 ` Morten Brørup
@ 2023-01-26 12:59                   ` Bruce Richardson
  2023-01-26 15:28                     ` [EXT] " Tomasz Duszynski
  2023-01-26 15:17                   ` Tomasz Duszynski
  1 sibling, 1 reply; 139+ messages in thread
From: Bruce Richardson @ 2023-01-26 12:59 UTC (permalink / raw)
  To: Morten Brørup
  Cc: Tomasz Duszynski, dev, Thomas Monjalon,
	Jerin Jacob Kollanukkaran, Ruifeng.Wang, mattias.ronnblom,
	zhoumin, roretzla

On Thu, Jan 26, 2023 at 01:29:36PM +0100, Morten Brørup wrote:
> > From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
> > Sent: Thursday, 26 January 2023 10.40
> > 
> > >From: Morten Brørup <mb@smartsharesystems.com>
> > >Sent: Friday, January 20, 2023 10:47 AM
> > >
> > >> From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
> > >> Sent: Friday, 20 January 2023 00.39
> > >>
> > >> Add support for programming PMU counters and reading their values in
> > >> runtime bypassing kernel completely.
> > >>
> > >> This is especially useful in cases where CPU cores are isolated
> > >> (nohz_full) i.e run dedicated tasks. In such cases one cannot use
> > >> standard perf utility without sacrificing latency and performance.
> > >>
> > >> Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
> > >> ---
> > >
> > >If you insist on passing lcore_id around as a function parameter, the
> > function description must
> > >mention that the lcore_id parameter must be set to rte_lcore_id() for
> > the functions where this is a
> > >requirement, including all functions that use those functions.
> 
> Perhaps I'm stating this wrong, so let me try to rephrase:
> 
> As I understand it, some of the setup functions must be called from the EAL thread that executes that function - due to some syscall (SYS_perf_event_open) needing to be called from the thread itself.
> 
> Those functions should not take an lcore_id parameter. Otherwise, I would expect to be able to call those functions from e.g. the main thread and pass the lcore_id of any EAL thread as a parameter, which you at the bottom of this email [1] explained is not possible.
> 
> [1]: http://inbox.dpdk.org/dev/DM4PR18MB4368461EC42603F77A7DC1BCD2E09@DM4PR18MB4368.namprd18.prod.outlook.com/
> 
> > >
> > 
> > Not sure why are you insisting so much on removing that rte_lcore_id().
> > Yes that macro evaluates
> > to integer but if you don't think about internals this resembles a
> > function call.
> 
> I agree with this argument. And for that reason, passing lcore_id around could be relevant.
> 
> I only wanted to bring your attention to the low cost of fetching it inside the functions, as an alternative to passing it as an argument.
> 
> > 
> > Then natural pattern is to call it once and reuse results if possible.
> 
> Yes, and I would usually agree to using this pattern.
> 
> > Passing lcore_id around
> > implies that calls are per l-core, why would that confuse anyone
> > reading that code?
> 
> This is where I disagree: Passing lcore_id as a parameter to a function does NOT imply that the function is running on that lcore!
> 
> E.g rte_mempool_default_cache(struct rte_mempool *mp, unsigned lcore_id) [2] takes lcore_id as a parameter, and does not assume that lcore_id==rte_lcore_id().
> 
> [2]: https://elixir.bootlin.com/dpdk/latest/source/lib/mempool/rte_mempool.h#L1315
> 
> > 
> > Besides, all functions taking it are internal stuff hence you cannot
> > call it elsewhere.
> 
> OK. I agree that this reduces the risk of incorrect use.
> 
> Generally, I think that internal functions should be documented too. Not to the full extent, like public functions, but some documentation is nice.
> 
> And if there are special requirements to a function parameter, it should be documented with that function. Requiring that the lcore_id parameter must be == rte_lcore_id() is certainly a special requirement.
> 
> It might just be me worrying too much, so... If nobody else complains about this, I can live with it as is. Assuming that none of the public functions have this special requirement (either directly or indirectly, by calling functions with the special requirement).
> 
I would tend to agree with you Morten. If the lcore_id parameter to the
function must be rte_lcore_id(), then I think it's error prone to have that
as an explicit parameter, and that the function should always get the core
id itself.

Other possible complication is - how does this work with threads that are
not pinned to a particular physical core? Do things work as expected in
that case?

/Bruce

^ permalink raw reply	[flat|nested] 139+ messages in thread

* RE: [PATCH v6 1/4] lib: add generic support for reading PMU events
  2023-01-26 12:29                 ` Morten Brørup
  2023-01-26 12:59                   ` Bruce Richardson
@ 2023-01-26 15:17                   ` Tomasz Duszynski
  1 sibling, 0 replies; 139+ messages in thread
From: Tomasz Duszynski @ 2023-01-26 15:17 UTC (permalink / raw)
  To: Morten Brørup, dev, Thomas Monjalon
  Cc: Jerin Jacob Kollanukkaran, Ruifeng.Wang, mattias.ronnblom,
	zhoumin, bruce.richardson, roretzla

>-----Original Message-----
>From: Morten Brørup <mb@smartsharesystems.com>
>Sent: Thursday, January 26, 2023 1:30 PM
>To: Tomasz Duszynski <tduszynski@marvell.com>; dev@dpdk.org; Thomas Monjalon <thomas@monjalon.net>
>Cc: Jerin Jacob Kollanukkaran <jerinj@marvell.com>; Ruifeng.Wang@arm.com;
>mattias.ronnblom@ericsson.com; zhoumin@loongson.cn; bruce.richardson@intel.com;
>roretzla@linux.microsoft.com
>Subject: [EXT] RE: [PATCH v6 1/4] lib: add generic support for reading PMU events
>
>External Email
>
>----------------------------------------------------------------------
>> From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
>> Sent: Thursday, 26 January 2023 10.40
>>
>> >From: Morten Brørup <mb@smartsharesystems.com>
>> >Sent: Friday, January 20, 2023 10:47 AM
>> >
>> >> From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
>> >> Sent: Friday, 20 January 2023 00.39
>> >>
>> >> Add support for programming PMU counters and reading their values
>> >> in runtime bypassing kernel completely.
>> >>
>> >> This is especially useful in cases where CPU cores are isolated
>> >> (nohz_full) i.e run dedicated tasks. In such cases one cannot use
>> >> standard perf utility without sacrificing latency and performance.
>> >>
>> >> Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
>> >> ---
>> >
>> >If you insist on passing lcore_id around as a function parameter, the
>> function description must
>> >mention that the lcore_id parameter must be set to rte_lcore_id() for
>> the functions where this is a
>> >requirement, including all functions that use those functions.
>
>Perhaps I'm stating this wrong, so let me try to rephrase:
>
>As I understand it, some of the setup functions must be called from the EAL thread that executes
>that function - due to some syscall (SYS_perf_event_open) needing to be called from the thread
>itself.
>
>Those functions should not take an lcore_id parameter. Otherwise, I would expect to be able to call
>those functions from e.g. the main thread and pass the lcore_id of any EAL thread as a parameter,
>which you at the bottom of this email [1] explained is not possible.
>
>[1]: https://urldefense.proofpoint.com/v2/url?u=http-
>3A__inbox.dpdk.org_dev_DM4PR18MB4368461EC42603F77A7DC1BCD2E09-
>40DM4PR18MB4368.namprd18.prod.outlook.com_&d=DwIFAw&c=nKjWec2b6R0mOyPaz7xtfQ&r=PZNXgrbjdlXxVEEGYkxI
>xRndyEUwWU_ad5ce22YI6Is&m=QkMcmM2epUOCdRd6xI5o3d2nQaqruy0GvOQUgbn75cLlzobEVMwLBUiGXADiuvVz&s=5K9oM8
>e7u52C_0_5xtWIKl31aXRHhJDKoQTDQp5EHWY&e=
>
>> >
>>
>> Not sure why are you insisting so much on removing that rte_lcore_id().
>> Yes that macro evaluates
>> to integer but if you don't think about internals this resembles a
>> function call.
>
>I agree with this argument. And for that reason, passing lcore_id around could be relevant.
>
>I only wanted to bring your attention to the low cost of fetching it inside the functions, as an
>alternative to passing it as an argument.
>
>>
>> Then natural pattern is to call it once and reuse results if possible.
>
>Yes, and I would usually agree to using this pattern.
>
>> Passing lcore_id around
>> implies that calls are per l-core, why would that confuse anyone
>> reading that code?
>
>This is where I disagree: Passing lcore_id as a parameter to a function does NOT imply that the
>function is running on that lcore!
>
>E.g rte_mempool_default_cache(struct rte_mempool *mp, unsigned lcore_id) [2] takes lcore_id as a
>parameter, and does not assume that lcore_id==rte_lcore_id().
>

Oh, now I got your point!

Okay then, if this is going to cause confusion because of misleading
self-documenting code I'll change that.  

>[2]: https://urldefense.proofpoint.com/v2/url?u=https-
>3A__elixir.bootlin.com_dpdk_latest_source_lib_mempool_rte-5Fmempool.h-
>23L1315&d=DwIFAw&c=nKjWec2b6R0mOyPaz7xtfQ&r=PZNXgrbjdlXxVEEGYkxIxRndyEUwWU_ad5ce22YI6Is&m=QkMcmM2ep
>UOCdRd6xI5o3d2nQaqruy0GvOQUgbn75cLlzobEVMwLBUiGXADiuvVz&s=4pnL_TZcVhj476u19ybcn2Rbad6OTb3k2U-
>nhFvhZ0k&e=
>
>>
>> Besides, all functions taking it are internal stuff hence you cannot
>> call it elsewhere.
>
>OK. I agree that this reduces the risk of incorrect use.
>
>Generally, I think that internal functions should be documented too. Not to the full extent, like
>public functions, but some documentation is nice.
>
>And if there are special requirements to a function parameter, it should be documented with that
>function. Requiring that the lcore_id parameter must be == rte_lcore_id() is certainly a special
>requirement.
>
>It might just be me worrying too much, so... If nobody else complains about this, I can live with
>it as is. Assuming that none of the public functions have this special requirement (either directly
>or indirectly, by calling functions with the special requirement).
>
>>
>> >Alternatively, follow my previous suggestion: Omit the lcore_id
>> function parameter, and use
>> >rte_lcore_id() instead.
>> >
>>


^ permalink raw reply	[flat|nested] 139+ messages in thread

* RE: [EXT] Re: [PATCH v6 1/4] lib: add generic support for reading PMU events
  2023-01-26 12:59                   ` Bruce Richardson
@ 2023-01-26 15:28                     ` Tomasz Duszynski
  2023-02-02 14:27                       ` Morten Brørup
  0 siblings, 1 reply; 139+ messages in thread
From: Tomasz Duszynski @ 2023-01-26 15:28 UTC (permalink / raw)
  To: Bruce Richardson, Morten Brørup
  Cc: dev, Thomas Monjalon, Jerin Jacob Kollanukkaran, Ruifeng.Wang,
	mattias.ronnblom, zhoumin, roretzla



>-----Original Message-----
>From: Bruce Richardson <bruce.richardson@intel.com>
>Sent: Thursday, January 26, 2023 1:59 PM
>To: Morten Brørup <mb@smartsharesystems.com>
>Cc: Tomasz Duszynski <tduszynski@marvell.com>; dev@dpdk.org; Thomas Monjalon <thomas@monjalon.net>;
>Jerin Jacob Kollanukkaran <jerinj@marvell.com>; Ruifeng.Wang@arm.com;
>mattias.ronnblom@ericsson.com; zhoumin@loongson.cn; roretzla@linux.microsoft.com
>Subject: [EXT] Re: [PATCH v6 1/4] lib: add generic support for reading PMU events
>
>External Email
>
>----------------------------------------------------------------------
>On Thu, Jan 26, 2023 at 01:29:36PM +0100, Morten Brørup wrote:
>> > From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
>> > Sent: Thursday, 26 January 2023 10.40
>> >
>> > >From: Morten Brørup <mb@smartsharesystems.com>
>> > >Sent: Friday, January 20, 2023 10:47 AM
>> > >
>> > >> From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
>> > >> Sent: Friday, 20 January 2023 00.39
>> > >>
>> > >> Add support for programming PMU counters and reading their values
>> > >> in runtime bypassing kernel completely.
>> > >>
>> > >> This is especially useful in cases where CPU cores are isolated
>> > >> (nohz_full) i.e run dedicated tasks. In such cases one cannot use
>> > >> standard perf utility without sacrificing latency and performance.
>> > >>
>> > >> Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
>> > >> ---
>> > >
>> > >If you insist on passing lcore_id around as a function parameter,
>> > >the
>> > function description must
>> > >mention that the lcore_id parameter must be set to rte_lcore_id()
>> > >for
>> > the functions where this is a
>> > >requirement, including all functions that use those functions.
>>
>> Perhaps I'm stating this wrong, so let me try to rephrase:
>>
>> As I understand it, some of the setup functions must be called from the EAL thread that executes
>that function - due to some syscall (SYS_perf_event_open) needing to be called from the thread
>itself.
>>
>> Those functions should not take an lcore_id parameter. Otherwise, I would expect to be able to
>call those functions from e.g. the main thread and pass the lcore_id of any EAL thread as a
>parameter, which you at the bottom of this email [1] explained is not possible.
>>
>> [1]:
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__inbox.dpdk.org_dev
>> _DM4PR18MB4368461EC42603F77A7DC1BCD2E09-40DM4PR18MB4368.namprd18.prod.
>> outlook.com_&d=DwIDAw&c=nKjWec2b6R0mOyPaz7xtfQ&r=PZNXgrbjdlXxVEEGYkxIx
>> RndyEUwWU_ad5ce22YI6Is&m=wEvFmuH_S_EhAgRZQTC7z3pQ1Sr_cEsbFAXxgE2Fi2ESd
>> 4sMgg-tgVOVDepp-JYO&s=wU4g1LLV4EHyRYpj2inWOK8MDcUKq7txrZ7RXZhUM2I&e=
>>
>> > >
>> >
>> > Not sure why are you insisting so much on removing that rte_lcore_id().
>> > Yes that macro evaluates
>> > to integer but if you don't think about internals this resembles a
>> > function call.
>>
>> I agree with this argument. And for that reason, passing lcore_id around could be relevant.
>>
>> I only wanted to bring your attention to the low cost of fetching it inside the functions, as an
>alternative to passing it as an argument.
>>
>> >
>> > Then natural pattern is to call it once and reuse results if possible.
>>
>> Yes, and I would usually agree to using this pattern.
>>
>> > Passing lcore_id around
>> > implies that calls are per l-core, why would that confuse anyone
>> > reading that code?
>>
>> This is where I disagree: Passing lcore_id as a parameter to a function does NOT imply that the
>function is running on that lcore!
>>
>> E.g rte_mempool_default_cache(struct rte_mempool *mp, unsigned lcore_id) [2] takes lcore_id as a
>parameter, and does not assume that lcore_id==rte_lcore_id().
>>
>> [2]:
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__elixir.bootlin.co
>> m_dpdk_latest_source_lib_mempool_rte-5Fmempool.h-23L1315&d=DwIDAw&c=nK
>> jWec2b6R0mOyPaz7xtfQ&r=PZNXgrbjdlXxVEEGYkxIxRndyEUwWU_ad5ce22YI6Is&m=w
>> EvFmuH_S_EhAgRZQTC7z3pQ1Sr_cEsbFAXxgE2Fi2ESd4sMgg-tgVOVDepp-JYO&s=Ayyj
>> pEtATWUHfWnGMn5j2XDLMjgxxJTh5gQV0m77z5Q&e=
>>
>> >
>> > Besides, all functions taking it are internal stuff hence you cannot
>> > call it elsewhere.
>>
>> OK. I agree that this reduces the risk of incorrect use.
>>
>> Generally, I think that internal functions should be documented too. Not to the full extent, like
>public functions, but some documentation is nice.
>>
>> And if there are special requirements to a function parameter, it should be documented with that
>function. Requiring that the lcore_id parameter must be == rte_lcore_id() is certainly a special
>requirement.
>>
>> It might just be me worrying too much, so... If nobody else complains about this, I can live with
>it as is. Assuming that none of the public functions have this special requirement (either directly
>or indirectly, by calling functions with the special requirement).
>>
>I would tend to agree with you Morten. If the lcore_id parameter to the function must be
>rte_lcore_id(), then I think it's error prone to have that as an explicit parameter, and that the
>function should always get the core id itself.
>
>Other possible complication is - how does this work with threads that are not pinned to a
>particular physical core? Do things work as expected in that case?
>

It's assumed that once set of counters is enabled on particular l-core then this thread shouldn't be migrating 
back and for the obvious reasons. 

But, once scheduled elsewhere all should still work as expected. 

>/Bruce

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [EXT] Re: [PATCH 1/2] lib: add helper to read strings from sysfs files
  2023-01-26  8:30                 ` [EXT] " Tomasz Duszynski
@ 2023-01-26 17:21                   ` Tyler Retzlaff
  0 siblings, 0 replies; 139+ messages in thread
From: Tyler Retzlaff @ 2023-01-26 17:21 UTC (permalink / raw)
  To: Tomasz Duszynski
  Cc: Thomas Monjalon, dev, Jerin Jacob Kollanukkaran, stephen,
	chenbo.xia, david.marchand, bruce.richardson

On Thu, Jan 26, 2023 at 08:30:01AM +0000, Tomasz Duszynski wrote:
> 
> >-----Original Message-----
> >From: Tyler Retzlaff <roretzla@linux.microsoft.com>
> >Sent: Wednesday, January 25, 2023 5:16 PM
> >To: Thomas Monjalon <thomas@monjalon.net>
> >Cc: Tomasz Duszynski <tduszynski@marvell.com>; dev@dpdk.org; Jerin Jacob Kollanukkaran
> ><jerinj@marvell.com>; stephen@networkplumber.org; chenbo.xia@intel.com; david.marchand@redhat.com;
> >bruce.richardson@intel.com
> >Subject: [EXT] Re: [PATCH 1/2] lib: add helper to read strings from sysfs files
> >
> >External Email
> >
> >----------------------------------------------------------------------
> >On Wed, Jan 25, 2023 at 11:39:30AM +0100, Thomas Monjalon wrote:
> >> 25/01/2023 11:33, Tomasz Duszynski:
> >> > Reading strings from sysfs files is a re-occurring pattern hence add
> >> > helper for doing that.
> >>
> >> In general it would be to nice to clean sysfs parsing in libs and
> >> drivers, so they all use some functions from EAL.
> >
> >maybe there should be a general utility library for dealing with sysfs separate from the core EAL
> >that drivers / platform specific libs can share?
> 
> reading/writing of sysfs files is scattered around the codebase and this has been piling up
> with each and and every new pmd/lib that requires it. So generally a few simple utility functions 
> in one place may be a good idea. 

i'm an advocate of smaller libraries that tackle a subject area and do
so well. even better if they can be unit tested without dragging in a
lot of dependencies or bootstrapping other unrelated subsystems.

it is also in alignment with trying to de-bloat eal which i think there
is increasing interest in.

> 
> Would following make sense?
> 
> rte_sysfs_write_int()
> rte_sysfs_write_string()
> rte_sysfs_read_int()
> rte_sysfs_read_string() 
> 
> Also seems that pattern where file gets opened once and keeps being written to until closed is 
> reoccurring as well. So there might be some utils for that as well. Thoughts? 

i guess the answer here is whatever makes a simple intuitive api for
sysfs access, i don't contribute much on the linux side to dpdk so can't
speak to what makes a good api here, but i imagine others can in review.

thanks

^ permalink raw reply	[flat|nested] 139+ messages in thread

* [PATCH v7 0/4] add support for self monitoring
  2023-01-19 23:39         ` [PATCH v6 " Tomasz Duszynski
                             ` (3 preceding siblings ...)
  2023-01-19 23:39           ` [PATCH v6 4/4] eal: add PMU support to tracing library Tomasz Duszynski
@ 2023-02-01 13:17           ` Tomasz Duszynski
  2023-02-01 13:17             ` [PATCH v7 1/4] lib: add generic support for reading PMU events Tomasz Duszynski
                               ` (5 more replies)
  4 siblings, 6 replies; 139+ messages in thread
From: Tomasz Duszynski @ 2023-02-01 13:17 UTC (permalink / raw)
  To: dev
  Cc: roretzla, Ruifeng.Wang, bruce.richardson, jerinj,
	mattias.ronnblom, mb, thomas, zhoumin, Tomasz Duszynski

This series adds self monitoring support i.e allows to configure and
read performance measurement unit (PMU) counters in runtime without
using perf utility. This has certain adventages when application runs on
isolated cores with nohz_full kernel parameter.

Events can be read directly using rte_pmu_read() or using dedicated
tracepoint rte_eal_trace_pmu_read(). The latter will cause events to be
stored inside CTF file.

By design, all enabled events are grouped together and the same group
is attached to lcores that use self monitoring funtionality.

Events are enabled by names, which need to be read from standard
location under sysfs i.e

/sys/bus/event_source/devices/PMU/events

where PMU is a core pmu i.e one measuring cpu events. As of today
raw events are not supported.

v7:
- use per-lcore event group instead of global table index by lcore-id
- don't add pmu_autotest to fast tests because due to lack of suported on
  every arch
v6:
- move codebase to the separate library
- address review comments
v5:
- address review comments
- fix sign extension while reading pmu on x86
- fix regex mentioned in doc
- various minor changes/improvements here and there
v4:
- fix freeing mem detected by debug_autotest
v3:
- fix shared build
v2:
- fix problems reported by test build infra

Tomasz Duszynski (4):
  lib: add generic support for reading PMU events
  pmu: support reading ARM PMU events in runtime
  pmu: support reading Intel x86_64 PMU events in runtime
  eal: add PMU support to tracing library

 MAINTAINERS                              |   5 +
 app/test/meson.build                     |   1 +
 app/test/test_pmu.c                      |  61 +++
 app/test/test_trace_perf.c               |  10 +
 doc/api/doxy-api-index.md                |   3 +-
 doc/api/doxy-api.conf.in                 |   1 +
 doc/guides/prog_guide/profile_app.rst    |  13 +
 doc/guides/prog_guide/trace_lib.rst      |  32 ++
 doc/guides/rel_notes/release_23_03.rst   |   7 +
 lib/eal/common/eal_common_trace.c        |  13 +-
 lib/eal/common/eal_common_trace_points.c |   5 +
 lib/eal/include/rte_eal_trace.h          |  13 +
 lib/eal/meson.build                      |   3 +
 lib/eal/version.map                      |   3 +
 lib/meson.build                          |   1 +
 lib/pmu/meson.build                      |  21 +
 lib/pmu/pmu_arm64.c                      |  94 ++++
 lib/pmu/pmu_private.h                    |  29 ++
 lib/pmu/rte_pmu.c                        | 525 +++++++++++++++++++++++
 lib/pmu/rte_pmu.h                        | 225 ++++++++++
 lib/pmu/rte_pmu_pmc_arm64.h              |  30 ++
 lib/pmu/rte_pmu_pmc_x86_64.h             |  24 ++
 lib/pmu/version.map                      |  21 +
 23 files changed, 1138 insertions(+), 2 deletions(-)
 create mode 100644 app/test/test_pmu.c
 create mode 100644 lib/pmu/meson.build
 create mode 100644 lib/pmu/pmu_arm64.c
 create mode 100644 lib/pmu/pmu_private.h
 create mode 100644 lib/pmu/rte_pmu.c
 create mode 100644 lib/pmu/rte_pmu.h
 create mode 100644 lib/pmu/rte_pmu_pmc_arm64.h
 create mode 100644 lib/pmu/rte_pmu_pmc_x86_64.h
 create mode 100644 lib/pmu/version.map

--
2.34.1


^ permalink raw reply	[flat|nested] 139+ messages in thread

* [PATCH v7 1/4] lib: add generic support for reading PMU events
  2023-02-01 13:17           ` [PATCH v7 0/4] add support for self monitoring Tomasz Duszynski
@ 2023-02-01 13:17             ` Tomasz Duszynski
  2023-02-01 13:17             ` [PATCH v7 2/4] pmu: support reading ARM PMU events in runtime Tomasz Duszynski
                               ` (4 subsequent siblings)
  5 siblings, 0 replies; 139+ messages in thread
From: Tomasz Duszynski @ 2023-02-01 13:17 UTC (permalink / raw)
  To: dev, Thomas Monjalon, Tomasz Duszynski
  Cc: roretzla, Ruifeng.Wang, bruce.richardson, jerinj,
	mattias.ronnblom, mb, zhoumin

Add support for programming PMU counters and reading their values
in runtime bypassing kernel completely.

This is especially useful in cases where CPU cores are isolated
(nohz_full) i.e run dedicated tasks. In such cases one cannot use
standard perf utility without sacrificing latency and performance.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
---
 MAINTAINERS                            |   5 +
 app/test/meson.build                   |   1 +
 app/test/test_pmu.c                    |  55 +++
 doc/api/doxy-api-index.md              |   3 +-
 doc/api/doxy-api.conf.in               |   1 +
 doc/guides/prog_guide/profile_app.rst  |   8 +
 doc/guides/rel_notes/release_23_03.rst |   7 +
 lib/meson.build                        |   1 +
 lib/pmu/meson.build                    |  13 +
 lib/pmu/pmu_private.h                  |  29 ++
 lib/pmu/rte_pmu.c                      | 464 +++++++++++++++++++++++++
 lib/pmu/rte_pmu.h                      | 205 +++++++++++
 lib/pmu/version.map                    |  20 ++
 13 files changed, 811 insertions(+), 1 deletion(-)
 create mode 100644 app/test/test_pmu.c
 create mode 100644 lib/pmu/meson.build
 create mode 100644 lib/pmu/pmu_private.h
 create mode 100644 lib/pmu/rte_pmu.c
 create mode 100644 lib/pmu/rte_pmu.h
 create mode 100644 lib/pmu/version.map

diff --git a/MAINTAINERS b/MAINTAINERS
index 9a0f416d2e..9f13eafd95 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1697,6 +1697,11 @@ M: Nithin Dabilpuram <ndabilpuram@marvell.com>
 M: Pavan Nikhilesh <pbhagavatula@marvell.com>
 F: lib/node/
 
+PMU - EXPERIMENTAL
+M: Tomasz Duszynski <tduszynski@marvell.com>
+F: lib/pmu/
+F: app/test/test_pmu*
+
 
 Test Applications
 -----------------
diff --git a/app/test/meson.build b/app/test/meson.build
index f34d19e3c3..7b6b69dcf1 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -111,6 +111,7 @@ test_sources = files(
         'test_reciprocal_division_perf.c',
         'test_red.c',
         'test_pie.c',
+        'test_pmu.c',
         'test_reorder.c',
         'test_rib.c',
         'test_rib6.c',
diff --git a/app/test/test_pmu.c b/app/test/test_pmu.c
new file mode 100644
index 0000000000..b30db35724
--- /dev/null
+++ b/app/test/test_pmu.c
@@ -0,0 +1,55 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2023 Marvell International Ltd.
+ */
+
+#include "test.h"
+
+#ifndef RTE_EXEC_ENV_LINUX
+
+static int
+test_pmu(void)
+{
+	printf("pmu_autotest onnly supported on Linux, skipping test\n");
+	return TEST_SKIPPED;
+}
+
+#else
+
+#include <rte_pmu.h>
+
+static int
+test_pmu_read(void)
+{
+	int tries = 10, event = -1;
+	uint64_t val = 0;
+
+	if (rte_pmu_init() < 0)
+		return TEST_FAILED;
+
+	while (tries--)
+		val += rte_pmu_read(event);
+
+	rte_pmu_fini();
+
+	return val ? TEST_SUCCESS : TEST_FAILED;
+}
+
+static struct unit_test_suite pmu_tests = {
+	.suite_name = "pmu autotest",
+	.setup = NULL,
+	.teardown = NULL,
+	.unit_test_cases = {
+		TEST_CASE(test_pmu_read),
+		TEST_CASES_END()
+	}
+};
+
+static int
+test_pmu(void)
+{
+	return unit_test_suite_runner(&pmu_tests);
+}
+
+#endif /* RTE_EXEC_ENV_LINUX */
+
+REGISTER_TEST_COMMAND(pmu_autotest, test_pmu);
diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
index de488c7abf..7f1938f92f 100644
--- a/doc/api/doxy-api-index.md
+++ b/doc/api/doxy-api-index.md
@@ -222,7 +222,8 @@ The public API headers are grouped by topics:
   [log](@ref rte_log.h),
   [errno](@ref rte_errno.h),
   [trace](@ref rte_trace.h),
-  [trace_point](@ref rte_trace_point.h)
+  [trace_point](@ref rte_trace_point.h),
+  [pmu](@ref rte_pmu.h)
 
 - **misc**:
   [EAL config](@ref rte_eal.h),
diff --git a/doc/api/doxy-api.conf.in b/doc/api/doxy-api.conf.in
index f0886c3bd1..920e615996 100644
--- a/doc/api/doxy-api.conf.in
+++ b/doc/api/doxy-api.conf.in
@@ -63,6 +63,7 @@ INPUT                   = @TOPDIR@/doc/api/doxy-api-index.md \
                           @TOPDIR@/lib/pci \
                           @TOPDIR@/lib/pdump \
                           @TOPDIR@/lib/pipeline \
+                          @TOPDIR@/lib/pmu \
                           @TOPDIR@/lib/port \
                           @TOPDIR@/lib/power \
                           @TOPDIR@/lib/rawdev \
diff --git a/doc/guides/prog_guide/profile_app.rst b/doc/guides/prog_guide/profile_app.rst
index 14292d4c25..a8b501fe0c 100644
--- a/doc/guides/prog_guide/profile_app.rst
+++ b/doc/guides/prog_guide/profile_app.rst
@@ -7,6 +7,14 @@ Profile Your Application
 The following sections describe methods of profiling DPDK applications on
 different architectures.
 
+Performance counter based profiling
+-----------------------------------
+
+Majority of architectures support some sort hardware measurement unit which provides a set of
+programmable counters that monitor specific events. There are different tools which can gather
+that information, perf being an example here. Though in some scenarios, eg. when CPU cores are
+isolated (nohz_full) and run dedicated tasks, using perf is less than ideal. In such cases one can
+read specific events directly from application via ``rte_pmu_read()``.
 
 Profiling on x86
 ----------------
diff --git a/doc/guides/rel_notes/release_23_03.rst b/doc/guides/rel_notes/release_23_03.rst
index 84b112a8b1..7e6062022a 100644
--- a/doc/guides/rel_notes/release_23_03.rst
+++ b/doc/guides/rel_notes/release_23_03.rst
@@ -55,6 +55,13 @@ New Features
      Also, make sure to start the actual text at the margin.
      =======================================================
 
+* **Added PMU library.**
+
+  Added a new PMU (performance measurement unit) library which allows applications
+  to perform self monitoring activities without depending on external utilities like perf.
+  After integration with :doc:`../prog_guide/trace_lib` data gathered from hardware counters
+  can be stored in CTF format for further analysis.
+
 * **Added multi-process support for axgbe PMD.**
 
 * **Updated Corigine nfp driver.**
diff --git a/lib/meson.build b/lib/meson.build
index a90fee31b7..7132131b5c 100644
--- a/lib/meson.build
+++ b/lib/meson.build
@@ -11,6 +11,7 @@
 libraries = [
         'kvargs', # eal depends on kvargs
         'telemetry', # basic info querying
+        'pmu',
         'eal', # everything depends on eal
         'ring',
         'rcu', # rcu depends on ring
diff --git a/lib/pmu/meson.build b/lib/pmu/meson.build
new file mode 100644
index 0000000000..a4160b494e
--- /dev/null
+++ b/lib/pmu/meson.build
@@ -0,0 +1,13 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(C) 2023 Marvell International Ltd.
+
+if not is_linux
+    build = false
+    reason = 'only supported on Linux'
+    subdir_done()
+endif
+
+includes = [global_inc]
+
+sources = files('rte_pmu.c')
+headers = files('rte_pmu.h')
diff --git a/lib/pmu/pmu_private.h b/lib/pmu/pmu_private.h
new file mode 100644
index 0000000000..849549b125
--- /dev/null
+++ b/lib/pmu/pmu_private.h
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Marvell
+ */
+
+#ifndef _PMU_PRIVATE_H_
+#define _PMU_PRIVATE_H_
+
+/**
+ * Architecture specific PMU init callback.
+ *
+ * @return
+ *   0 in case of success, negative value otherwise.
+ */
+int
+pmu_arch_init(void);
+
+/**
+ * Architecture specific PMU cleanup callback.
+ */
+void
+pmu_arch_fini(void);
+
+/**
+ * Apply architecture specific settings to config before passing it to syscall.
+ */
+void
+pmu_arch_fixup_config(uint64_t config[3]);
+
+#endif /* _PMU_PRIVATE_H_ */
diff --git a/lib/pmu/rte_pmu.c b/lib/pmu/rte_pmu.c
new file mode 100644
index 0000000000..4cf3161155
--- /dev/null
+++ b/lib/pmu/rte_pmu.c
@@ -0,0 +1,464 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2023 Marvell International Ltd.
+ */
+
+#include <ctype.h>
+#include <dirent.h>
+#include <errno.h>
+#include <regex.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <sys/queue.h>
+#include <sys/syscall.h>
+#include <unistd.h>
+
+#include <rte_atomic.h>
+#include <rte_per_lcore.h>
+#include <rte_pmu.h>
+#include <rte_spinlock.h>
+#include <rte_tailq.h>
+
+#include "pmu_private.h"
+
+#define EVENT_SOURCE_DEVICES_PATH "/sys/bus/event_source/devices"
+
+#ifndef GENMASK_ULL
+#define GENMASK_ULL(h, l) ((~0ULL - (1ULL << (l)) + 1) & (~0ULL >> ((64 - 1 - (h)))))
+#endif
+
+#ifndef FIELD_PREP
+#define FIELD_PREP(m, v) (((uint64_t)(v) << (__builtin_ffsll(m) - 1)) & (m))
+#endif
+
+RTE_DEFINE_PER_LCORE(struct rte_pmu_event_group, _event_group);
+struct rte_pmu rte_pmu;
+
+/*
+ * Following __rte_weak functions provide default no-op. Architectures should override them if
+ * necessary.
+ */
+
+int
+__rte_weak pmu_arch_init(void)
+{
+	return 0;
+}
+
+void
+__rte_weak pmu_arch_fini(void)
+{
+}
+
+void
+__rte_weak pmu_arch_fixup_config(uint64_t __rte_unused config[3])
+{
+}
+
+static int
+get_term_format(const char *name, int *num, uint64_t *mask)
+{
+	char *config = NULL;
+	char path[PATH_MAX];
+	int high, low, ret;
+	FILE *fp;
+
+	/* quiesce -Wmaybe-uninitialized warning */
+	*num = 0;
+	*mask = 0;
+
+	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/format/%s", rte_pmu.name, name);
+	fp = fopen(path, "r");
+	if (fp == NULL)
+		return -errno;
+
+	errno = 0;
+	ret = fscanf(fp, "%m[^:]:%d-%d", &config, &low, &high);
+	if (ret < 2) {
+		ret = -ENODATA;
+		goto out;
+	}
+	if (errno) {
+		ret = -errno;
+		goto out;
+	}
+
+	if (ret == 2)
+		high = low;
+
+	*mask = GENMASK_ULL(high, low);
+	/* Last digit should be [012]. If last digit is missing 0 is implied. */
+	*num = config[strlen(config) - 1];
+	*num = isdigit(*num) ? *num - '0' : 0;
+
+	ret = 0;
+out:
+	free(config);
+	fclose(fp);
+
+	return ret;
+}
+
+static int
+parse_event(char *buf, uint64_t config[3])
+{
+	char *token, *term;
+	int num, ret, val;
+	uint64_t mask;
+
+	config[0] = config[1] = config[2] = 0;
+
+	token = strtok(buf, ",");
+	while (token) {
+		errno = 0;
+		/* <term>=<value> */
+		ret = sscanf(token, "%m[^=]=%i", &term, &val);
+		if (ret < 1)
+			return -ENODATA;
+		if (errno)
+			return -errno;
+		if (ret == 1)
+			val = 1;
+
+		ret = get_term_format(term, &num, &mask);
+		free(term);
+		if (ret)
+			return ret;
+
+		config[num] |= FIELD_PREP(mask, val);
+		token = strtok(NULL, ",");
+	}
+
+	return 0;
+}
+
+static int
+get_event_config(const char *name, uint64_t config[3])
+{
+	char path[PATH_MAX], buf[BUFSIZ];
+	FILE *fp;
+	int ret;
+
+	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu.name, name);
+	fp = fopen(path, "r");
+	if (fp == NULL)
+		return -errno;
+
+	ret = fread(buf, 1, sizeof(buf), fp);
+	if (ret == 0) {
+		fclose(fp);
+
+		return -EINVAL;
+	}
+	fclose(fp);
+	buf[ret] = '\0';
+
+	return parse_event(buf, config);
+}
+
+static int
+do_perf_event_open(uint64_t config[3], int group_fd)
+{
+	struct perf_event_attr attr = {
+		.size = sizeof(struct perf_event_attr),
+		.type = PERF_TYPE_RAW,
+		.exclude_kernel = 1,
+		.exclude_hv = 1,
+		.disabled = 1,
+	};
+
+	pmu_arch_fixup_config(config);
+
+	attr.config = config[0];
+	attr.config1 = config[1];
+	attr.config2 = config[2];
+
+	return syscall(SYS_perf_event_open, &attr, 0, -1, group_fd, 0);
+}
+
+static int
+open_events(struct rte_pmu_event_group *group)
+{
+	struct rte_pmu_event *event;
+	uint64_t config[3];
+	int num = 0, ret;
+
+	/* group leader gets created first, with fd = -1 */
+	group->fds[0] = -1;
+
+	TAILQ_FOREACH(event, &rte_pmu.event_list, next) {
+		ret = get_event_config(event->name, config);
+		if (ret)
+			continue;
+
+		ret = do_perf_event_open(config, group->fds[0]);
+		if (ret == -1) {
+			ret = -errno;
+			goto out;
+		}
+
+		group->fds[event->index] = ret;
+		num++;
+	}
+
+	return 0;
+out:
+	for (--num; num >= 0; num--) {
+		close(group->fds[num]);
+		group->fds[num] = -1;
+	}
+
+
+	return ret;
+}
+
+static int
+mmap_events(struct rte_pmu_event_group *group)
+{
+	long page_size = sysconf(_SC_PAGE_SIZE);
+	unsigned int i;
+	void *addr;
+	int ret;
+
+	for (i = 0; i < rte_pmu.num_group_events; i++) {
+		addr = mmap(0, page_size, PROT_READ, MAP_SHARED, group->fds[i], 0);
+		if (addr == MAP_FAILED) {
+			ret = -errno;
+			goto out;
+		}
+
+		group->mmap_pages[i] = addr;
+	}
+
+	return 0;
+out:
+	for (; i; i--) {
+		munmap(group->mmap_pages[i - 1], page_size);
+		group->mmap_pages[i - 1] = NULL;
+	}
+
+	return ret;
+}
+
+static void
+cleanup_events(struct rte_pmu_event_group *group)
+{
+	unsigned int i;
+
+	if (group->fds[0] != -1)
+		ioctl(group->fds[0], PERF_EVENT_IOC_DISABLE, PERF_IOC_FLAG_GROUP);
+
+	for (i = 0; i < rte_pmu.num_group_events; i++) {
+		if (group->mmap_pages[i]) {
+			munmap(group->mmap_pages[i], sysconf(_SC_PAGE_SIZE));
+			group->mmap_pages[i] = NULL;
+		}
+
+		if (group->fds[i] != -1) {
+			close(group->fds[i]);
+			group->fds[i] = -1;
+		}
+	}
+
+	group->enabled = false;
+}
+
+int __rte_noinline
+rte_pmu_enable_group(void)
+{
+	struct rte_pmu_event_group *group = &RTE_PER_LCORE(_event_group);
+	int ret;
+
+	if (rte_pmu.num_group_events == 0)
+		return -ENODEV;
+
+	ret = open_events(group);
+	if (ret)
+		goto out;
+
+	ret = mmap_events(group);
+	if (ret)
+		goto out;
+
+	if (ioctl(group->fds[0], PERF_EVENT_IOC_RESET, PERF_IOC_FLAG_GROUP) == -1) {
+		ret = -errno;
+		goto out;
+	}
+
+	if (ioctl(group->fds[0], PERF_EVENT_IOC_ENABLE, PERF_IOC_FLAG_GROUP) == -1) {
+		ret = -errno;
+		goto out;
+	}
+
+	rte_spinlock_lock(&rte_pmu.lock);
+	TAILQ_INSERT_TAIL(&rte_pmu.event_group_list, group, next);
+	rte_spinlock_unlock(&rte_pmu.lock);
+	group->enabled = true;
+
+	return 0;
+
+out:
+	cleanup_events(group);
+
+	return ret;
+}
+
+static int
+scan_pmus(void)
+{
+	char path[PATH_MAX];
+	struct dirent *dent;
+	const char *name;
+	DIR *dirp;
+
+	dirp = opendir(EVENT_SOURCE_DEVICES_PATH);
+	if (dirp == NULL)
+		return -errno;
+
+	while ((dent = readdir(dirp))) {
+		name = dent->d_name;
+		if (name[0] == '.')
+			continue;
+
+		/* sysfs entry should either contain cpus or be a cpu */
+		if (!strcmp(name, "cpu"))
+			break;
+
+		snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/cpus", name);
+		if (access(path, F_OK) == 0)
+			break;
+	}
+
+	if (dent) {
+		rte_pmu.name = strdup(name);
+		if (rte_pmu.name == NULL) {
+			closedir(dirp);
+
+			return -ENOMEM;
+		}
+	}
+
+	closedir(dirp);
+
+	return rte_pmu.name ? 0 : -ENODEV;
+}
+
+static struct rte_pmu_event *
+new_event(const char *name)
+{
+	struct rte_pmu_event *event;
+
+	event = calloc(1, sizeof(*event));
+	if (event == NULL)
+		goto out;
+
+	event->name = strdup(name);
+	if (event->name == NULL) {
+		free(event);
+		event = NULL;
+	}
+
+out:
+	return event;
+}
+
+static void
+free_event(struct rte_pmu_event *event)
+{
+	free(event->name);
+	free(event);
+}
+
+int
+rte_pmu_add_event(const char *name)
+{
+	struct rte_pmu_event *event;
+	char path[PATH_MAX];
+
+	if (rte_pmu.name == NULL)
+		return -ENODEV;
+
+	if (rte_pmu.num_group_events + 1 >= MAX_NUM_GROUP_EVENTS)
+		return -ENOSPC;
+
+	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu.name, name);
+	if (access(path, R_OK))
+		return -ENODEV;
+
+	TAILQ_FOREACH(event, &rte_pmu.event_list, next) {
+		if (!strcmp(event->name, name))
+			return event->index;
+		continue;
+	}
+
+	event = new_event(name);
+	if (event == NULL)
+		return -ENOMEM;
+
+	event->index = rte_pmu.num_group_events++;
+	TAILQ_INSERT_TAIL(&rte_pmu.event_list, event, next);
+
+	return event->index;
+}
+
+int
+rte_pmu_init(void)
+{
+	int ret;
+
+	/* Allow calling init from multiple contexts within a single thread. This simplifies
+	 * resource management a bit e.g in case fast-path tracepoint has already been enabled
+	 * via command line but application doesn't care enough and performs init/fini again.
+	 */
+	if (rte_pmu.initialized) {
+		rte_pmu.initialized++;
+		return 0;
+	}
+
+	ret = scan_pmus();
+	if (ret)
+		goto out;
+
+	ret = pmu_arch_init();
+	if (ret)
+		goto out;
+
+	TAILQ_INIT(&rte_pmu.event_list);
+	TAILQ_INIT(&rte_pmu.event_group_list);
+	rte_spinlock_init(&rte_pmu.lock);
+	rte_pmu.initialized = 1;
+
+	return 0;
+out:
+	free(rte_pmu.name);
+	rte_pmu.name = NULL;
+
+	return ret;
+}
+
+void
+rte_pmu_fini(void)
+{
+	struct rte_pmu_event_group *group, *tmp_group;
+	struct rte_pmu_event *event, *tmp_event;
+
+	/* cleanup once init count drops to zero */
+	if (!rte_pmu.initialized || --rte_pmu.initialized)
+		return;
+
+	RTE_TAILQ_FOREACH_SAFE(event, &rte_pmu.event_list, next, tmp_event) {
+		TAILQ_REMOVE(&rte_pmu.event_list, event, next);
+		free_event(event);
+	}
+
+	RTE_TAILQ_FOREACH_SAFE(group, &rte_pmu.event_group_list, next, tmp_group) {
+		TAILQ_REMOVE(&rte_pmu.event_group_list, group, next);
+		cleanup_events(group);
+	}
+
+	pmu_arch_fini();
+	free(rte_pmu.name);
+	rte_pmu.name = NULL;
+	rte_pmu.num_group_events = 0;
+}
diff --git a/lib/pmu/rte_pmu.h b/lib/pmu/rte_pmu.h
new file mode 100644
index 0000000000..e360375a0c
--- /dev/null
+++ b/lib/pmu/rte_pmu.h
@@ -0,0 +1,205 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Marvell
+ */
+
+#ifndef _RTE_PMU_H_
+#define _RTE_PMU_H_
+
+/**
+ * @file
+ *
+ * PMU event tracing operations
+ *
+ * This file defines generic API and types necessary to setup PMU and
+ * read selected counters in runtime.
+ */
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <linux/perf_event.h>
+
+#include <rte_atomic.h>
+#include <rte_branch_prediction.h>
+#include <rte_common.h>
+#include <rte_compat.h>
+#include <rte_spinlock.h>
+
+/** Maximum number of events in a group */
+#define MAX_NUM_GROUP_EVENTS 8
+
+/**
+ * A structure describing a group of events.
+ */
+struct rte_pmu_event_group {
+	struct perf_event_mmap_page *mmap_pages[MAX_NUM_GROUP_EVENTS]; /**< array of user pages */
+	int fds[MAX_NUM_GROUP_EVENTS]; /**< array of event descriptors */
+	bool enabled; /**< true if group was enabled on particular lcore */
+	TAILQ_ENTRY(rte_pmu_event_group) next; /**< list entry */
+} __rte_cache_aligned;
+
+/**
+ * A structure describing an event.
+ */
+struct rte_pmu_event {
+	char *name; /**< name of an event */
+	unsigned int index; /**< event index into fds/mmap_pages */
+	TAILQ_ENTRY(rte_pmu_event) next; /**< list entry */
+};
+
+/**
+ * A PMU state container.
+ */
+struct rte_pmu {
+	char *name; /**< name of core PMU listed under /sys/bus/event_source/devices */
+	rte_spinlock_t lock; /**< serialize access to event group list */
+	TAILQ_HEAD(, rte_pmu_event_group) event_group_list; /**< list of event groups */
+	unsigned int num_group_events; /**< number of events in a group */
+	TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of matching events */
+	unsigned int initialized; /**< initialization counter */
+};
+
+/** lcore event group */
+RTE_DECLARE_PER_LCORE(struct rte_pmu_event_group, _event_group);
+
+/** PMU state container */
+extern struct rte_pmu rte_pmu;
+
+/** Each architecture supporting PMU needs to provide its own version */
+#ifndef rte_pmu_pmc_read
+#define rte_pmu_pmc_read(index) ({ 0; })
+#endif
+
+/**
+ * @internal
+ *
+ * Read PMU counter.
+ *
+ * @param pc
+ *   Pointer to the mmapped user page.
+ * @return
+ *   Counter value read from hardware.
+ */
+__rte_internal
+static __rte_always_inline uint64_t
+rte_pmu_read_userpage(struct perf_event_mmap_page *pc)
+{
+	uint64_t width, offset;
+	uint32_t seq, index;
+	int64_t pmc;
+
+	for (;;) {
+		seq = pc->lock;
+		rte_compiler_barrier();
+		index = pc->index;
+		offset = pc->offset;
+		width = pc->pmc_width;
+
+		/* index set to 0 means that particular counter cannot be used */
+		if (likely(pc->cap_user_rdpmc && index)) {
+			pmc = rte_pmu_pmc_read(index - 1);
+			pmc <<= 64 - width;
+			pmc >>= 64 - width;
+			offset += pmc;
+		}
+
+		rte_compiler_barrier();
+
+		if (likely(pc->lock == seq))
+			return offset;
+	}
+
+	return 0;
+}
+
+/**
+ * @internal
+ *
+ * Enable group of events on the calling lcore.
+ *
+ * @return
+ *   0 in case of success, negative value otherwise.
+ */
+__rte_internal
+int
+rte_pmu_enable_group(void);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Initialize PMU library.
+ *
+ * @return
+ *   0 in case of success, negative value otherwise.
+ */
+__rte_experimental
+int
+rte_pmu_init(void);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Finalize PMU library. This should be called after PMU counters are no longer being read.
+ */
+__rte_experimental
+void
+rte_pmu_fini(void);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Add event to the group of enabled events.
+ *
+ * @param name
+ *   Name of an event listed under /sys/bus/event_source/devices/pmu/events.
+ * @return
+ *   Event index in case of success, negative value otherwise.
+ */
+__rte_experimental
+int
+rte_pmu_add_event(const char *name);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Read hardware counter configured to count occurrences of an event.
+ *
+ * @param index
+ *   Index of an event to be read.
+ * @return
+ *   Event value read from register. In case of errors or lack of support
+ *   0 is returned. In other words, stream of zeros in a trace file
+ *   indicates problem with reading particular PMU event register.
+ */
+__rte_experimental
+static __rte_always_inline uint64_t
+rte_pmu_read(unsigned int index)
+{
+	struct rte_pmu_event_group *group = &RTE_PER_LCORE(_event_group);
+	int ret;
+
+	if (unlikely(!rte_pmu.initialized))
+		return 0;
+
+	if (unlikely(!group->enabled)) {
+		ret = rte_pmu_enable_group();
+		if (ret)
+			return 0;
+	}
+
+	if (unlikely(index >= rte_pmu.num_group_events))
+		return 0;
+
+	return rte_pmu_read_userpage(group->mmap_pages[index]);
+}
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_PMU_H_ */
diff --git a/lib/pmu/version.map b/lib/pmu/version.map
new file mode 100644
index 0000000000..50fb0f354e
--- /dev/null
+++ b/lib/pmu/version.map
@@ -0,0 +1,20 @@
+DPDK_23 {
+	local: *;
+};
+
+EXPERIMENTAL {
+	global:
+
+	per_lcore__event_group;
+	rte_pmu;
+	rte_pmu_add_event;
+	rte_pmu_fini;
+	rte_pmu_init;
+	rte_pmu_read;
+};
+
+INTERNAL {
+	global:
+
+	rte_pmu_enable_group;
+};
-- 
2.34.1


^ permalink raw reply	[flat|nested] 139+ messages in thread

* [PATCH v7 2/4] pmu: support reading ARM PMU events in runtime
  2023-02-01 13:17           ` [PATCH v7 0/4] add support for self monitoring Tomasz Duszynski
  2023-02-01 13:17             ` [PATCH v7 1/4] lib: add generic support for reading PMU events Tomasz Duszynski
@ 2023-02-01 13:17             ` Tomasz Duszynski
  2023-02-01 13:17             ` [PATCH v7 3/4] pmu: support reading Intel x86_64 " Tomasz Duszynski
                               ` (3 subsequent siblings)
  5 siblings, 0 replies; 139+ messages in thread
From: Tomasz Duszynski @ 2023-02-01 13:17 UTC (permalink / raw)
  To: dev, Tomasz Duszynski, Ruifeng Wang
  Cc: roretzla, Ruifeng.Wang, bruce.richardson, jerinj,
	mattias.ronnblom, mb, thomas, zhoumin

Add support for reading ARM PMU events in runtime.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
---
 app/test/test_pmu.c         |  4 ++
 lib/pmu/meson.build         |  7 +++
 lib/pmu/pmu_arm64.c         | 94 +++++++++++++++++++++++++++++++++++++
 lib/pmu/rte_pmu.h           |  4 ++
 lib/pmu/rte_pmu_pmc_arm64.h | 30 ++++++++++++
 5 files changed, 139 insertions(+)
 create mode 100644 lib/pmu/pmu_arm64.c
 create mode 100644 lib/pmu/rte_pmu_pmc_arm64.h

diff --git a/app/test/test_pmu.c b/app/test/test_pmu.c
index b30db35724..c53a1bc2f1 100644
--- a/app/test/test_pmu.c
+++ b/app/test/test_pmu.c
@@ -26,6 +26,10 @@ test_pmu_read(void)
 	if (rte_pmu_init() < 0)
 		return TEST_FAILED;
 
+#if defined(RTE_ARCH_ARM64)
+	event = rte_pmu_add_event("cpu_cycles");
+#endif
+
 	while (tries--)
 		val += rte_pmu_read(event);
 
diff --git a/lib/pmu/meson.build b/lib/pmu/meson.build
index a4160b494e..e857681137 100644
--- a/lib/pmu/meson.build
+++ b/lib/pmu/meson.build
@@ -11,3 +11,10 @@ includes = [global_inc]
 
 sources = files('rte_pmu.c')
 headers = files('rte_pmu.h')
+indirect_headers += files(
+        'rte_pmu_pmc_arm64.h',
+)
+
+if dpdk_conf.has('RTE_ARCH_ARM64')
+    sources += files('pmu_arm64.c')
+endif
diff --git a/lib/pmu/pmu_arm64.c b/lib/pmu/pmu_arm64.c
new file mode 100644
index 0000000000..9e15727948
--- /dev/null
+++ b/lib/pmu/pmu_arm64.c
@@ -0,0 +1,94 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2023 Marvell International Ltd.
+ */
+
+#include <errno.h>
+#include <fcntl.h>
+#include <stdlib.h>
+#include <unistd.h>
+
+#include <rte_bitops.h>
+#include <rte_common.h>
+
+#include "pmu_private.h"
+
+#define PERF_USER_ACCESS_PATH "/proc/sys/kernel/perf_user_access"
+
+static int restore_uaccess;
+
+static int
+read_attr_int(const char *path, int *val)
+{
+	char buf[BUFSIZ];
+	int ret, fd;
+
+	fd = open(path, O_RDONLY);
+	if (fd == -1)
+		return -errno;
+
+	ret = read(fd, buf, sizeof(buf));
+	if (ret == -1) {
+		close(fd);
+
+		return -errno;
+	}
+
+	*val = strtol(buf, NULL, 10);
+	close(fd);
+
+	return 0;
+}
+
+static int
+write_attr_int(const char *path, int val)
+{
+	char buf[BUFSIZ];
+	int num, ret, fd;
+
+	fd = open(path, O_WRONLY);
+	if (fd == -1)
+		return -errno;
+
+	num = snprintf(buf, sizeof(buf), "%d", val);
+	ret = write(fd, buf, num);
+	if (ret == -1) {
+		close(fd);
+
+		return -errno;
+	}
+
+	close(fd);
+
+	return 0;
+}
+
+int
+pmu_arch_init(void)
+{
+	int ret;
+
+	ret = read_attr_int(PERF_USER_ACCESS_PATH, &restore_uaccess);
+	if (ret)
+		return ret;
+
+	/* user access already enabled */
+	if (restore_uaccess == 1)
+		return 0;
+
+	return write_attr_int(PERF_USER_ACCESS_PATH, 1);
+}
+
+void
+pmu_arch_fini(void)
+{
+	write_attr_int(PERF_USER_ACCESS_PATH, restore_uaccess);
+}
+
+void
+pmu_arch_fixup_config(uint64_t config[3])
+{
+	/* select 64 bit counters */
+	config[1] |= RTE_BIT64(0);
+	/* enable userspace access */
+	config[1] |= RTE_BIT64(1);
+}
diff --git a/lib/pmu/rte_pmu.h b/lib/pmu/rte_pmu.h
index e360375a0c..b18938dab1 100644
--- a/lib/pmu/rte_pmu.h
+++ b/lib/pmu/rte_pmu.h
@@ -26,6 +26,10 @@ extern "C" {
 #include <rte_compat.h>
 #include <rte_spinlock.h>
 
+#if defined(RTE_ARCH_ARM64)
+#include "rte_pmu_pmc_arm64.h"
+#endif
+
 /** Maximum number of events in a group */
 #define MAX_NUM_GROUP_EVENTS 8
 
diff --git a/lib/pmu/rte_pmu_pmc_arm64.h b/lib/pmu/rte_pmu_pmc_arm64.h
new file mode 100644
index 0000000000..10648f0c5f
--- /dev/null
+++ b/lib/pmu/rte_pmu_pmc_arm64.h
@@ -0,0 +1,30 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Marvell.
+ */
+#ifndef _RTE_PMU_PMC_ARM64_H_
+#define _RTE_PMU_PMC_ARM64_H_
+
+#include <rte_common.h>
+
+static __rte_always_inline uint64_t
+rte_pmu_pmc_read(int index)
+{
+	uint64_t val;
+
+	if (index == 31) {
+		/* CPU Cycles (0x11) must be read via pmccntr_el0 */
+		asm volatile("mrs %0, pmccntr_el0" : "=r" (val));
+	} else {
+		asm volatile(
+			"msr pmselr_el0, %x0\n"
+			"mrs %0, pmxevcntr_el0\n"
+			: "=r" (val)
+			: "rZ" (index)
+		);
+	}
+
+	return val;
+}
+#define rte_pmu_pmc_read rte_pmu_pmc_read
+
+#endif /* _RTE_PMU_PMC_ARM64_H_ */
-- 
2.34.1


^ permalink raw reply	[flat|nested] 139+ messages in thread

* [PATCH v7 3/4] pmu: support reading Intel x86_64 PMU events in runtime
  2023-02-01 13:17           ` [PATCH v7 0/4] add support for self monitoring Tomasz Duszynski
  2023-02-01 13:17             ` [PATCH v7 1/4] lib: add generic support for reading PMU events Tomasz Duszynski
  2023-02-01 13:17             ` [PATCH v7 2/4] pmu: support reading ARM PMU events in runtime Tomasz Duszynski
@ 2023-02-01 13:17             ` Tomasz Duszynski
  2023-02-01 13:17             ` [PATCH v7 4/4] eal: add PMU support to tracing library Tomasz Duszynski
                               ` (2 subsequent siblings)
  5 siblings, 0 replies; 139+ messages in thread
From: Tomasz Duszynski @ 2023-02-01 13:17 UTC (permalink / raw)
  To: dev, Tomasz Duszynski
  Cc: roretzla, Ruifeng.Wang, bruce.richardson, jerinj,
	mattias.ronnblom, mb, thomas, zhoumin

Add support for reading Intel x86_64 PMU events in runtime.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
---
 app/test/test_pmu.c          |  2 ++
 lib/pmu/meson.build          |  1 +
 lib/pmu/rte_pmu.h            |  2 ++
 lib/pmu/rte_pmu_pmc_x86_64.h | 24 ++++++++++++++++++++++++
 4 files changed, 29 insertions(+)
 create mode 100644 lib/pmu/rte_pmu_pmc_x86_64.h

diff --git a/app/test/test_pmu.c b/app/test/test_pmu.c
index c53a1bc2f1..07cdc8f5ec 100644
--- a/app/test/test_pmu.c
+++ b/app/test/test_pmu.c
@@ -28,6 +28,8 @@ test_pmu_read(void)
 
 #if defined(RTE_ARCH_ARM64)
 	event = rte_pmu_add_event("cpu_cycles");
+#elif defined(RTE_ARCH_X86_64)
+	event = rte_pmu_add_event("cpu-cycles");
 #endif
 
 	while (tries--)
diff --git a/lib/pmu/meson.build b/lib/pmu/meson.build
index e857681137..5b92e5c4e3 100644
--- a/lib/pmu/meson.build
+++ b/lib/pmu/meson.build
@@ -13,6 +13,7 @@ sources = files('rte_pmu.c')
 headers = files('rte_pmu.h')
 indirect_headers += files(
         'rte_pmu_pmc_arm64.h',
+        'rte_pmu_pmc_x86_64.h',
 )
 
 if dpdk_conf.has('RTE_ARCH_ARM64')
diff --git a/lib/pmu/rte_pmu.h b/lib/pmu/rte_pmu.h
index b18938dab1..0f7004c31c 100644
--- a/lib/pmu/rte_pmu.h
+++ b/lib/pmu/rte_pmu.h
@@ -28,6 +28,8 @@ extern "C" {
 
 #if defined(RTE_ARCH_ARM64)
 #include "rte_pmu_pmc_arm64.h"
+#elif defined(RTE_ARCH_X86_64)
+#include "rte_pmu_pmc_x86_64.h"
 #endif
 
 /** Maximum number of events in a group */
diff --git a/lib/pmu/rte_pmu_pmc_x86_64.h b/lib/pmu/rte_pmu_pmc_x86_64.h
new file mode 100644
index 0000000000..7b67466960
--- /dev/null
+++ b/lib/pmu/rte_pmu_pmc_x86_64.h
@@ -0,0 +1,24 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Marvell.
+ */
+#ifndef _RTE_PMU_PMC_X86_64_H_
+#define _RTE_PMU_PMC_X86_64_H_
+
+#include <rte_common.h>
+
+static __rte_always_inline uint64_t
+rte_pmu_pmc_read(int index)
+{
+	uint64_t low, high;
+
+	asm volatile(
+		"rdpmc\n"
+		: "=a" (low), "=d" (high)
+		: "c" (index)
+	);
+
+	return low | (high << 32);
+}
+#define rte_pmu_pmc_read rte_pmu_pmc_read
+
+#endif /* _RTE_PMU_PMC_X86_64_H_ */
-- 
2.34.1


^ permalink raw reply	[flat|nested] 139+ messages in thread

* [PATCH v7 4/4] eal: add PMU support to tracing library
  2023-02-01 13:17           ` [PATCH v7 0/4] add support for self monitoring Tomasz Duszynski
                               ` (2 preceding siblings ...)
  2023-02-01 13:17             ` [PATCH v7 3/4] pmu: support reading Intel x86_64 " Tomasz Duszynski
@ 2023-02-01 13:17             ` Tomasz Duszynski
  2023-02-01 13:51             ` [PATCH v7 0/4] add support for self monitoring Morten Brørup
  2023-02-02  9:43             ` [PATCH v8 " Tomasz Duszynski
  5 siblings, 0 replies; 139+ messages in thread
From: Tomasz Duszynski @ 2023-02-01 13:17 UTC (permalink / raw)
  To: dev, Jerin Jacob, Sunil Kumar Kori, Tomasz Duszynski
  Cc: roretzla, Ruifeng.Wang, bruce.richardson, mattias.ronnblom, mb,
	thomas, zhoumin

In order to profile app one needs to store significant amount of samples
somewhere for an analysis latern on. Since trace library supports
storing data in a CTF format lets take adventage of that and add a
dedicated PMU tracepoint.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
---
 app/test/test_trace_perf.c               | 10 ++++
 doc/guides/prog_guide/profile_app.rst    |  5 ++
 doc/guides/prog_guide/trace_lib.rst      | 32 +++++++++++++
 lib/eal/common/eal_common_trace.c        | 13 ++++-
 lib/eal/common/eal_common_trace_points.c |  5 ++
 lib/eal/include/rte_eal_trace.h          | 13 +++++
 lib/eal/meson.build                      |  3 ++
 lib/eal/version.map                      |  3 ++
 lib/pmu/rte_pmu.c                        | 61 ++++++++++++++++++++++++
 lib/pmu/rte_pmu.h                        | 14 ++++++
 lib/pmu/version.map                      |  1 +
 11 files changed, 159 insertions(+), 1 deletion(-)

diff --git a/app/test/test_trace_perf.c b/app/test/test_trace_perf.c
index 46ae7d8074..f1929f2734 100644
--- a/app/test/test_trace_perf.c
+++ b/app/test/test_trace_perf.c
@@ -114,6 +114,10 @@ worker_fn_##func(void *arg) \
 #define GENERIC_DOUBLE rte_eal_trace_generic_double(3.66666)
 #define GENERIC_STR rte_eal_trace_generic_str("hello world")
 #define VOID_FP app_dpdk_test_fp()
+#ifdef RTE_EXEC_ENV_LINUX
+/* 0 corresponds first event passed via --trace= */
+#define READ_PMU rte_eal_trace_pmu_read(0)
+#endif
 
 WORKER_DEFINE(GENERIC_VOID)
 WORKER_DEFINE(GENERIC_U64)
@@ -122,6 +126,9 @@ WORKER_DEFINE(GENERIC_FLOAT)
 WORKER_DEFINE(GENERIC_DOUBLE)
 WORKER_DEFINE(GENERIC_STR)
 WORKER_DEFINE(VOID_FP)
+#ifdef RTE_EXEC_ENV_LINUX
+WORKER_DEFINE(READ_PMU)
+#endif
 
 static void
 run_test(const char *str, lcore_function_t f, struct test_data *data, size_t sz)
@@ -174,6 +181,9 @@ test_trace_perf(void)
 	run_test("double", worker_fn_GENERIC_DOUBLE, data, sz);
 	run_test("string", worker_fn_GENERIC_STR, data, sz);
 	run_test("void_fp", worker_fn_VOID_FP, data, sz);
+#ifdef RTE_EXEC_ENV_LINUX
+	run_test("read_pmu", worker_fn_READ_PMU, data, sz);
+#endif
 
 	rte_free(data);
 	return TEST_SUCCESS;
diff --git a/doc/guides/prog_guide/profile_app.rst b/doc/guides/prog_guide/profile_app.rst
index a8b501fe0c..6a53341c6b 100644
--- a/doc/guides/prog_guide/profile_app.rst
+++ b/doc/guides/prog_guide/profile_app.rst
@@ -16,6 +16,11 @@ that information, perf being an example here. Though in some scenarios, eg. when
 isolated (nohz_full) and run dedicated tasks, using perf is less than ideal. In such cases one can
 read specific events directly from application via ``rte_pmu_read()``.
 
+Alternatively tracing library can be used which offers dedicated tracepoint
+``rte_eal_trace_pmu_event()``.
+
+Refer to :doc:`../prog_guide/trace_lib` for more details.
+
 Profiling on x86
 ----------------
 
diff --git a/doc/guides/prog_guide/trace_lib.rst b/doc/guides/prog_guide/trace_lib.rst
index 9a8f38073d..a8e97ee1ec 100644
--- a/doc/guides/prog_guide/trace_lib.rst
+++ b/doc/guides/prog_guide/trace_lib.rst
@@ -46,6 +46,7 @@ DPDK tracing library features
   trace format and is compatible with ``LTTng``.
   For detailed information, refer to
   `Common Trace Format <https://diamon.org/ctf/>`_.
+- Support reading PMU events on ARM64 and x86-64 (Intel)
 
 How to add a tracepoint?
 ------------------------
@@ -137,6 +138,37 @@ the user must use ``RTE_TRACE_POINT_FP`` instead of ``RTE_TRACE_POINT``.
 ``RTE_TRACE_POINT_FP`` is compiled out by default and it can be enabled using
 the ``enable_trace_fp`` option for meson build.
 
+PMU tracepoint
+--------------
+
+Performance measurement unit (PMU) event values can be read from hardware
+registers using predefined ``rte_pmu_read`` tracepoint.
+
+Tracing is enabled via ``--trace`` EAL option by passing both expression
+matching PMU tracepoint name i.e ``lib.eal.pmu.read`` and expression
+``e=ev1[,ev2,...]`` matching particular events::
+
+    --trace='.*pmu.read\|e=cpu_cycles,l1d_cache'
+
+Event names are available under ``/sys/bus/event_source/devices/PMU/events``
+directory, where ``PMU`` is a placeholder for either a ``cpu`` or a directory
+containing ``cpus``.
+
+In contrary to other tracepoints this does not need any extra variables
+added to source files. Instead, caller passes index which follows the order of
+events specified via ``--trace`` parameter. In the following example index ``0``
+corresponds to ``cpu_cyclces`` while index ``1`` corresponds to ``l1d_cache``.
+
+.. code-block:: c
+
+ ...
+ rte_eal_trace_pmu_read(0);
+ rte_eal_trace_pmu_read(1);
+ ...
+
+PMU tracing support must be explicitly enabled using the ``enable_trace_fp``
+option for meson build.
+
 Event record mode
 -----------------
 
diff --git a/lib/eal/common/eal_common_trace.c b/lib/eal/common/eal_common_trace.c
index 5caaac8e59..3631d0032b 100644
--- a/lib/eal/common/eal_common_trace.c
+++ b/lib/eal/common/eal_common_trace.c
@@ -11,6 +11,9 @@
 #include <rte_errno.h>
 #include <rte_lcore.h>
 #include <rte_per_lcore.h>
+#ifdef RTE_EXEC_ENV_LINUX
+#include <rte_pmu.h>
+#endif
 #include <rte_string_fns.h>
 
 #include "eal_trace.h"
@@ -71,8 +74,13 @@ eal_trace_init(void)
 		goto free_meta;
 
 	/* Apply global configurations */
-	STAILQ_FOREACH(arg, &trace.args, next)
+	STAILQ_FOREACH(arg, &trace.args, next) {
 		trace_args_apply(arg->val);
+#ifdef RTE_EXEC_ENV_LINUX
+		if (rte_pmu_init() == 0)
+			rte_pmu_add_events_by_pattern(arg->val);
+#endif
+	}
 
 	rte_trace_mode_set(trace.mode);
 
@@ -88,6 +96,9 @@ eal_trace_init(void)
 void
 eal_trace_fini(void)
 {
+#ifdef RTE_EXEC_ENV_LINUX
+	rte_pmu_fini();
+#endif
 	trace_mem_free();
 	trace_metadata_destroy();
 	eal_trace_args_free();
diff --git a/lib/eal/common/eal_common_trace_points.c b/lib/eal/common/eal_common_trace_points.c
index 0b0b254615..1e46ce549a 100644
--- a/lib/eal/common/eal_common_trace_points.c
+++ b/lib/eal/common/eal_common_trace_points.c
@@ -75,3 +75,8 @@ RTE_TRACE_POINT_REGISTER(rte_eal_trace_intr_enable,
 	lib.eal.intr.enable)
 RTE_TRACE_POINT_REGISTER(rte_eal_trace_intr_disable,
 	lib.eal.intr.disable)
+
+#ifdef RTE_EXEC_ENV_LINUX
+RTE_TRACE_POINT_REGISTER(rte_eal_trace_pmu_read,
+	lib.eal.pmu.read)
+#endif
diff --git a/lib/eal/include/rte_eal_trace.h b/lib/eal/include/rte_eal_trace.h
index 5ef4398230..afb459b198 100644
--- a/lib/eal/include/rte_eal_trace.h
+++ b/lib/eal/include/rte_eal_trace.h
@@ -17,6 +17,9 @@ extern "C" {
 
 #include <rte_alarm.h>
 #include <rte_interrupts.h>
+#ifdef RTE_EXEC_ENV_LINUX
+#include <rte_pmu.h>
+#endif
 #include <rte_trace_point.h>
 
 #include "eal_interrupts.h"
@@ -279,6 +282,16 @@ RTE_TRACE_POINT(
 	rte_trace_point_emit_string(cpuset);
 )
 
+#ifdef RTE_EXEC_ENV_LINUX
+RTE_TRACE_POINT_FP(
+	rte_eal_trace_pmu_read,
+	RTE_TRACE_POINT_ARGS(unsigned int index),
+	uint64_t val;
+	val = rte_pmu_read(index);
+	rte_trace_point_emit_u64(val);
+)
+#endif
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/eal/meson.build b/lib/eal/meson.build
index 056beb9461..f5865dbcd9 100644
--- a/lib/eal/meson.build
+++ b/lib/eal/meson.build
@@ -26,6 +26,9 @@ deps += ['kvargs']
 if not is_windows
     deps += ['telemetry']
 endif
+if is_linux
+    deps += ['pmu']
+endif
 if dpdk_conf.has('RTE_USE_LIBBSD')
     ext_deps += libbsd
 endif
diff --git a/lib/eal/version.map b/lib/eal/version.map
index 7ad12a7dc9..eddb45bebf 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -440,6 +440,9 @@ EXPERIMENTAL {
 	rte_thread_detach;
 	rte_thread_equal;
 	rte_thread_join;
+
+	# added in 23.03
+	__rte_eal_trace_pmu_read; # WINDOWS_NO_EXPORT
 };
 
 INTERNAL {
diff --git a/lib/pmu/rte_pmu.c b/lib/pmu/rte_pmu.c
index 4cf3161155..ae880c72b7 100644
--- a/lib/pmu/rte_pmu.c
+++ b/lib/pmu/rte_pmu.c
@@ -402,6 +402,67 @@ rte_pmu_add_event(const char *name)
 	return event->index;
 }
 
+static int
+add_events(const char *pattern)
+{
+	char *token, *copy;
+	int ret;
+
+	copy = strdup(pattern);
+	if (copy == NULL)
+		return -ENOMEM;
+
+	token = strtok(copy, ",");
+	while (token) {
+		ret = rte_pmu_add_event(token);
+		if (ret < 0)
+			break;
+
+		token = strtok(NULL, ",");
+	}
+
+	free(copy);
+
+	return ret >= 0 ? 0 : ret;
+}
+
+int
+rte_pmu_add_events_by_pattern(const char *pattern)
+{
+	regmatch_t rmatch;
+	char buf[BUFSIZ];
+	unsigned int num;
+	regex_t reg;
+	int ret;
+
+	/* events are matched against occurrences of e=ev1[,ev2,..] pattern */
+	ret = regcomp(&reg, "e=([_[:alnum:]-],?)+", REG_EXTENDED);
+	if (ret)
+		return -EINVAL;
+
+	for (;;) {
+		if (regexec(&reg, pattern, 1, &rmatch, 0))
+			break;
+
+		num = rmatch.rm_eo - rmatch.rm_so;
+		if (num > sizeof(buf))
+			num = sizeof(buf);
+
+		/* skip e= pattern prefix */
+		memcpy(buf, pattern + rmatch.rm_so + 2, num - 2);
+		buf[num - 2] = '\0';
+		ret = add_events(buf);
+		if (ret)
+			break;
+
+		pattern += rmatch.rm_eo;
+	}
+
+	regfree(&reg);
+
+	return ret;
+}
+
 int
 rte_pmu_init(void)
 {
diff --git a/lib/pmu/rte_pmu.h b/lib/pmu/rte_pmu.h
index 0f7004c31c..0f6250e81f 100644
--- a/lib/pmu/rte_pmu.h
+++ b/lib/pmu/rte_pmu.h
@@ -169,6 +169,20 @@ __rte_experimental
 int
 rte_pmu_add_event(const char *name);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Add events matching pattern to the group of enabled events.
+ *
+ * @param pattern
+ *   Pattern e=ev1[,ev2,...] matching events, where evX is a placeholder for an event listed under
+ *   /sys/bus/event_source/devices/pmu/events.
+ */
+__rte_experimental
+int
+rte_pmu_add_events_by_pattern(const char *pattern);
+
 /**
  * @warning
  * @b EXPERIMENTAL: this API may change without prior notice
diff --git a/lib/pmu/version.map b/lib/pmu/version.map
index 50fb0f354e..20a27d085c 100644
--- a/lib/pmu/version.map
+++ b/lib/pmu/version.map
@@ -8,6 +8,7 @@ EXPERIMENTAL {
 	per_lcore__event_group;
 	rte_pmu;
 	rte_pmu_add_event;
+	rte_pmu_add_events_by_pattern;
 	rte_pmu_fini;
 	rte_pmu_init;
 	rte_pmu_read;
-- 
2.34.1


^ permalink raw reply	[flat|nested] 139+ messages in thread

* RE: [PATCH v7 0/4] add support for self monitoring
  2023-02-01 13:17           ` [PATCH v7 0/4] add support for self monitoring Tomasz Duszynski
                               ` (3 preceding siblings ...)
  2023-02-01 13:17             ` [PATCH v7 4/4] eal: add PMU support to tracing library Tomasz Duszynski
@ 2023-02-01 13:51             ` Morten Brørup
  2023-02-02  7:54               ` Tomasz Duszynski
  2023-02-02  9:43             ` [PATCH v8 " Tomasz Duszynski
  5 siblings, 1 reply; 139+ messages in thread
From: Morten Brørup @ 2023-02-01 13:51 UTC (permalink / raw)
  To: Tomasz Duszynski, dev
  Cc: roretzla, Ruifeng.Wang, bruce.richardson, jerinj,
	mattias.ronnblom, thomas, zhoumin

> From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
> Sent: Wednesday, 1 February 2023 14.18
> 
> This series adds self monitoring support i.e allows to configure and
> read performance measurement unit (PMU) counters in runtime without
> using perf utility. This has certain adventages when application runs
> on
> isolated cores with nohz_full kernel parameter.
> 
> Events can be read directly using rte_pmu_read() or using dedicated
> tracepoint rte_eal_trace_pmu_read(). The latter will cause events to be
> stored inside CTF file.
> 
> By design, all enabled events are grouped together and the same group
> is attached to lcores that use self monitoring funtionality.
> 
> Events are enabled by names, which need to be read from standard
> location under sysfs i.e
> 
> /sys/bus/event_source/devices/PMU/events
> 
> where PMU is a core pmu i.e one measuring cpu events. As of today
> raw events are not supported.

I like the modifications in v7.

Series-acked-by: Morten Brørup <mb@smartsharesystems.com>


^ permalink raw reply	[flat|nested] 139+ messages in thread

* RE: [PATCH v7 0/4] add support for self monitoring
  2023-02-01 13:51             ` [PATCH v7 0/4] add support for self monitoring Morten Brørup
@ 2023-02-02  7:54               ` Tomasz Duszynski
  0 siblings, 0 replies; 139+ messages in thread
From: Tomasz Duszynski @ 2023-02-02  7:54 UTC (permalink / raw)
  To: Morten Brørup, dev
  Cc: roretzla, Ruifeng.Wang, bruce.richardson,
	Jerin Jacob Kollanukkaran, mattias.ronnblom, thomas, zhoumin

Hi Morten,

>-----Original Message-----
>From: Morten Brørup <mb@smartsharesystems.com>
>Sent: Wednesday, February 1, 2023 2:51 PM
>To: Tomasz Duszynski <tduszynski@marvell.com>; dev@dpdk.org
>Cc: roretzla@linux.microsoft.com; Ruifeng.Wang@arm.com; bruce.richardson@intel.com; Jerin Jacob
>Kollanukkaran <jerinj@marvell.com>; mattias.ronnblom@ericsson.com; thomas@monjalon.net;
>zhoumin@loongson.cn
>Subject: [EXT] RE: [PATCH v7 0/4] add support for self monitoring
>
>External Email
>
>----------------------------------------------------------------------
>> From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
>> Sent: Wednesday, 1 February 2023 14.18
>>
>> This series adds self monitoring support i.e allows to configure and
>> read performance measurement unit (PMU) counters in runtime without
>> using perf utility. This has certain adventages when application runs
>> on isolated cores with nohz_full kernel parameter.
>>
>> Events can be read directly using rte_pmu_read() or using dedicated
>> tracepoint rte_eal_trace_pmu_read(). The latter will cause events to
>> be stored inside CTF file.
>>
>> By design, all enabled events are grouped together and the same group
>> is attached to lcores that use self monitoring funtionality.
>>
>> Events are enabled by names, which need to be read from standard
>> location under sysfs i.e
>>
>> /sys/bus/event_source/devices/PMU/events
>>
>> where PMU is a core pmu i.e one measuring cpu events. As of today raw
>> events are not supported.
>
>I like the modifications in v7.
>
>Series-acked-by: Morten Brørup <mb@smartsharesystems.com>

Thanks. 

^ permalink raw reply	[flat|nested] 139+ messages in thread

* [PATCH v8 0/4] add support for self monitoring
  2023-02-01 13:17           ` [PATCH v7 0/4] add support for self monitoring Tomasz Duszynski
                               ` (4 preceding siblings ...)
  2023-02-01 13:51             ` [PATCH v7 0/4] add support for self monitoring Morten Brørup
@ 2023-02-02  9:43             ` Tomasz Duszynski
  2023-02-02  9:43               ` [PATCH v8 1/4] lib: add generic support for reading PMU events Tomasz Duszynski
                                 ` (4 more replies)
  5 siblings, 5 replies; 139+ messages in thread
From: Tomasz Duszynski @ 2023-02-02  9:43 UTC (permalink / raw)
  To: dev
  Cc: roretzla, Ruifeng.Wang, bruce.richardson, jerinj,
	mattias.ronnblom, mb, thomas, zhoumin, Tomasz Duszynski

This series adds self monitoring support i.e allows to configure and
read performance measurement unit (PMU) counters in runtime without
using perf utility. This has certain adventages when application runs on
isolated cores with nohz_full kernel parameter.

Events can be read directly using rte_pmu_read() or using dedicated
tracepoint rte_eal_trace_pmu_read(). The latter will cause events to be
stored inside CTF file.

By design, all enabled events are grouped together and the same group
is attached to lcores that use self monitoring funtionality.

Events are enabled by names, which need to be read from standard
location under sysfs i.e

/sys/bus/event_source/devices/PMU/events

where PMU is a core pmu i.e one measuring cpu events. As of today
raw events are not supported.

v8:
- just rebase series
v7:
- use per-lcore event group instead of global table index by lcore-id
- don't add pmu_autotest to fast tests because due to lack of suported on
  every arch
v6:
- move codebase to the separate library
- address review comments
v5:
- address review comments
- fix sign extension while reading pmu on x86
- fix regex mentioned in doc
- various minor changes/improvements here and there
v4:
- fix freeing mem detected by debug_autotest
v3:
- fix shared build
v2:
- fix problems reported by test build infra

Tomasz Duszynski (4):
  lib: add generic support for reading PMU events
  pmu: support reading ARM PMU events in runtime
  pmu: support reading Intel x86_64 PMU events in runtime
  eal: add PMU support to tracing library

 MAINTAINERS                              |   5 +
 app/test/meson.build                     |   1 +
 app/test/test_pmu.c                      |  61 +++
 app/test/test_trace_perf.c               |  10 +
 doc/api/doxy-api-index.md                |   3 +-
 doc/api/doxy-api.conf.in                 |   1 +
 doc/guides/prog_guide/profile_app.rst    |  13 +
 doc/guides/prog_guide/trace_lib.rst      |  32 ++
 doc/guides/rel_notes/release_23_03.rst   |   9 +
 lib/eal/common/eal_common_trace.c        |  13 +-
 lib/eal/common/eal_common_trace_points.c |   5 +
 lib/eal/include/rte_eal_trace.h          |  13 +
 lib/eal/meson.build                      |   3 +
 lib/eal/version.map                      |   1 +
 lib/meson.build                          |   1 +
 lib/pmu/meson.build                      |  21 +
 lib/pmu/pmu_arm64.c                      |  94 ++++
 lib/pmu/pmu_private.h                    |  29 ++
 lib/pmu/rte_pmu.c                        | 525 +++++++++++++++++++++++
 lib/pmu/rte_pmu.h                        | 225 ++++++++++
 lib/pmu/rte_pmu_pmc_arm64.h              |  30 ++
 lib/pmu/rte_pmu_pmc_x86_64.h             |  24 ++
 lib/pmu/version.map                      |  21 +
 23 files changed, 1138 insertions(+), 2 deletions(-)
 create mode 100644 app/test/test_pmu.c
 create mode 100644 lib/pmu/meson.build
 create mode 100644 lib/pmu/pmu_arm64.c
 create mode 100644 lib/pmu/pmu_private.h
 create mode 100644 lib/pmu/rte_pmu.c
 create mode 100644 lib/pmu/rte_pmu.h
 create mode 100644 lib/pmu/rte_pmu_pmc_arm64.h
 create mode 100644 lib/pmu/rte_pmu_pmc_x86_64.h
 create mode 100644 lib/pmu/version.map

--
2.34.1


^ permalink raw reply	[flat|nested] 139+ messages in thread

* [PATCH v8 1/4] lib: add generic support for reading PMU events
  2023-02-02  9:43             ` [PATCH v8 " Tomasz Duszynski
@ 2023-02-02  9:43               ` Tomasz Duszynski
  2023-02-02 10:32                 ` Ruifeng Wang
  2023-02-02  9:43               ` [PATCH v8 2/4] pmu: support reading ARM PMU events in runtime Tomasz Duszynski
                                 ` (3 subsequent siblings)
  4 siblings, 1 reply; 139+ messages in thread
From: Tomasz Duszynski @ 2023-02-02  9:43 UTC (permalink / raw)
  To: dev, Thomas Monjalon, Tomasz Duszynski
  Cc: roretzla, Ruifeng.Wang, bruce.richardson, jerinj,
	mattias.ronnblom, mb, zhoumin

Add support for programming PMU counters and reading their values
in runtime bypassing kernel completely.

This is especially useful in cases where CPU cores are isolated
(nohz_full) i.e run dedicated tasks. In such cases one cannot use
standard perf utility without sacrificing latency and performance.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
---
 MAINTAINERS                            |   5 +
 app/test/meson.build                   |   1 +
 app/test/test_pmu.c                    |  55 +++
 doc/api/doxy-api-index.md              |   3 +-
 doc/api/doxy-api.conf.in               |   1 +
 doc/guides/prog_guide/profile_app.rst  |   8 +
 doc/guides/rel_notes/release_23_03.rst |   9 +
 lib/meson.build                        |   1 +
 lib/pmu/meson.build                    |  13 +
 lib/pmu/pmu_private.h                  |  29 ++
 lib/pmu/rte_pmu.c                      | 464 +++++++++++++++++++++++++
 lib/pmu/rte_pmu.h                      | 205 +++++++++++
 lib/pmu/version.map                    |  20 ++
 13 files changed, 813 insertions(+), 1 deletion(-)
 create mode 100644 app/test/test_pmu.c
 create mode 100644 lib/pmu/meson.build
 create mode 100644 lib/pmu/pmu_private.h
 create mode 100644 lib/pmu/rte_pmu.c
 create mode 100644 lib/pmu/rte_pmu.h
 create mode 100644 lib/pmu/version.map

diff --git a/MAINTAINERS b/MAINTAINERS
index 9a0f416d2e..9f13eafd95 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1697,6 +1697,11 @@ M: Nithin Dabilpuram <ndabilpuram@marvell.com>
 M: Pavan Nikhilesh <pbhagavatula@marvell.com>
 F: lib/node/
 
+PMU - EXPERIMENTAL
+M: Tomasz Duszynski <tduszynski@marvell.com>
+F: lib/pmu/
+F: app/test/test_pmu*
+
 
 Test Applications
 -----------------
diff --git a/app/test/meson.build b/app/test/meson.build
index f34d19e3c3..7b6b69dcf1 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -111,6 +111,7 @@ test_sources = files(
         'test_reciprocal_division_perf.c',
         'test_red.c',
         'test_pie.c',
+        'test_pmu.c',
         'test_reorder.c',
         'test_rib.c',
         'test_rib6.c',
diff --git a/app/test/test_pmu.c b/app/test/test_pmu.c
new file mode 100644
index 0000000000..a9bfb1a427
--- /dev/null
+++ b/app/test/test_pmu.c
@@ -0,0 +1,55 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2023 Marvell International Ltd.
+ */
+
+#include "test.h"
+
+#ifndef RTE_EXEC_ENV_LINUX
+
+static int
+test_pmu(void)
+{
+	printf("pmu_autotest only supported on Linux, skipping test\n");
+	return TEST_SKIPPED;
+}
+
+#else
+
+#include <rte_pmu.h>
+
+static int
+test_pmu_read(void)
+{
+	int tries = 10, event = -1;
+	uint64_t val = 0;
+
+	if (rte_pmu_init() < 0)
+		return TEST_FAILED;
+
+	while (tries--)
+		val += rte_pmu_read(event);
+
+	rte_pmu_fini();
+
+	return val ? TEST_SUCCESS : TEST_FAILED;
+}
+
+static struct unit_test_suite pmu_tests = {
+	.suite_name = "pmu autotest",
+	.setup = NULL,
+	.teardown = NULL,
+	.unit_test_cases = {
+		TEST_CASE(test_pmu_read),
+		TEST_CASES_END()
+	}
+};
+
+static int
+test_pmu(void)
+{
+	return unit_test_suite_runner(&pmu_tests);
+}
+
+#endif /* RTE_EXEC_ENV_LINUX */
+
+REGISTER_TEST_COMMAND(pmu_autotest, test_pmu);
diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
index de488c7abf..7f1938f92f 100644
--- a/doc/api/doxy-api-index.md
+++ b/doc/api/doxy-api-index.md
@@ -222,7 +222,8 @@ The public API headers are grouped by topics:
   [log](@ref rte_log.h),
   [errno](@ref rte_errno.h),
   [trace](@ref rte_trace.h),
-  [trace_point](@ref rte_trace_point.h)
+  [trace_point](@ref rte_trace_point.h),
+  [pmu](@ref rte_pmu.h)
 
 - **misc**:
   [EAL config](@ref rte_eal.h),
diff --git a/doc/api/doxy-api.conf.in b/doc/api/doxy-api.conf.in
index f0886c3bd1..920e615996 100644
--- a/doc/api/doxy-api.conf.in
+++ b/doc/api/doxy-api.conf.in
@@ -63,6 +63,7 @@ INPUT                   = @TOPDIR@/doc/api/doxy-api-index.md \
                           @TOPDIR@/lib/pci \
                           @TOPDIR@/lib/pdump \
                           @TOPDIR@/lib/pipeline \
+                          @TOPDIR@/lib/pmu \
                           @TOPDIR@/lib/port \
                           @TOPDIR@/lib/power \
                           @TOPDIR@/lib/rawdev \
diff --git a/doc/guides/prog_guide/profile_app.rst b/doc/guides/prog_guide/profile_app.rst
index 14292d4c25..a8b501fe0c 100644
--- a/doc/guides/prog_guide/profile_app.rst
+++ b/doc/guides/prog_guide/profile_app.rst
@@ -7,6 +7,14 @@ Profile Your Application
 The following sections describe methods of profiling DPDK applications on
 different architectures.
 
+Performance counter based profiling
+-----------------------------------
+
+Majority of architectures support some sort hardware measurement unit which provides a set of
+programmable counters that monitor specific events. There are different tools which can gather
+that information, perf being an example here. Though in some scenarios, eg. when CPU cores are
+isolated (nohz_full) and run dedicated tasks, using perf is less than ideal. In such cases one can
+read specific events directly from application via ``rte_pmu_read()``.
 
 Profiling on x86
 ----------------
diff --git a/doc/guides/rel_notes/release_23_03.rst b/doc/guides/rel_notes/release_23_03.rst
index 73f5d94e14..733541d56c 100644
--- a/doc/guides/rel_notes/release_23_03.rst
+++ b/doc/guides/rel_notes/release_23_03.rst
@@ -55,10 +55,19 @@ New Features
      Also, make sure to start the actual text at the margin.
      =======================================================
 
+* **Added PMU library.**
+
+  Added a new PMU (performance measurement unit) library which allows applications
+  to perform self monitoring activities without depending on external utilities like perf.
+  After integration with :doc:`../prog_guide/trace_lib` data gathered from hardware counters
+  can be stored in CTF format for further analysis.
+
 * **Updated AMD axgbe driver.**
 
   * Added multi-process support.
 
+* **Added multi-process support for axgbe PMD.**
+
 * **Updated Corigine nfp driver.**
 
   * Added support for meter options.
diff --git a/lib/meson.build b/lib/meson.build
index a90fee31b7..7132131b5c 100644
--- a/lib/meson.build
+++ b/lib/meson.build
@@ -11,6 +11,7 @@
 libraries = [
         'kvargs', # eal depends on kvargs
         'telemetry', # basic info querying
+        'pmu',
         'eal', # everything depends on eal
         'ring',
         'rcu', # rcu depends on ring
diff --git a/lib/pmu/meson.build b/lib/pmu/meson.build
new file mode 100644
index 0000000000..a4160b494e
--- /dev/null
+++ b/lib/pmu/meson.build
@@ -0,0 +1,13 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(C) 2023 Marvell International Ltd.
+
+if not is_linux
+    build = false
+    reason = 'only supported on Linux'
+    subdir_done()
+endif
+
+includes = [global_inc]
+
+sources = files('rte_pmu.c')
+headers = files('rte_pmu.h')
diff --git a/lib/pmu/pmu_private.h b/lib/pmu/pmu_private.h
new file mode 100644
index 0000000000..849549b125
--- /dev/null
+++ b/lib/pmu/pmu_private.h
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Marvell
+ */
+
+#ifndef _PMU_PRIVATE_H_
+#define _PMU_PRIVATE_H_
+
+/**
+ * Architecture specific PMU init callback.
+ *
+ * @return
+ *   0 in case of success, negative value otherwise.
+ */
+int
+pmu_arch_init(void);
+
+/**
+ * Architecture specific PMU cleanup callback.
+ */
+void
+pmu_arch_fini(void);
+
+/**
+ * Apply architecture specific settings to config before passing it to syscall.
+ */
+void
+pmu_arch_fixup_config(uint64_t config[3]);
+
+#endif /* _PMU_PRIVATE_H_ */
diff --git a/lib/pmu/rte_pmu.c b/lib/pmu/rte_pmu.c
new file mode 100644
index 0000000000..4cf3161155
--- /dev/null
+++ b/lib/pmu/rte_pmu.c
@@ -0,0 +1,464 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2023 Marvell International Ltd.
+ */
+
+#include <ctype.h>
+#include <dirent.h>
+#include <errno.h>
+#include <regex.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <sys/queue.h>
+#include <sys/syscall.h>
+#include <unistd.h>
+
+#include <rte_atomic.h>
+#include <rte_per_lcore.h>
+#include <rte_pmu.h>
+#include <rte_spinlock.h>
+#include <rte_tailq.h>
+
+#include "pmu_private.h"
+
+#define EVENT_SOURCE_DEVICES_PATH "/sys/bus/event_source/devices"
+
+#ifndef GENMASK_ULL
+#define GENMASK_ULL(h, l) ((~0ULL - (1ULL << (l)) + 1) & (~0ULL >> ((64 - 1 - (h)))))
+#endif
+
+#ifndef FIELD_PREP
+#define FIELD_PREP(m, v) (((uint64_t)(v) << (__builtin_ffsll(m) - 1)) & (m))
+#endif
+
+RTE_DEFINE_PER_LCORE(struct rte_pmu_event_group, _event_group);
+struct rte_pmu rte_pmu;
+
+/*
+ * Following __rte_weak functions provide default no-op. Architectures should override them if
+ * necessary.
+ */
+
+int
+__rte_weak pmu_arch_init(void)
+{
+	return 0;
+}
+
+void
+__rte_weak pmu_arch_fini(void)
+{
+}
+
+void
+__rte_weak pmu_arch_fixup_config(uint64_t __rte_unused config[3])
+{
+}
+
+static int
+get_term_format(const char *name, int *num, uint64_t *mask)
+{
+	char *config = NULL;
+	char path[PATH_MAX];
+	int high, low, ret;
+	FILE *fp;
+
+	/* quiesce -Wmaybe-uninitialized warning */
+	*num = 0;
+	*mask = 0;
+
+	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/format/%s", rte_pmu.name, name);
+	fp = fopen(path, "r");
+	if (fp == NULL)
+		return -errno;
+
+	errno = 0;
+	ret = fscanf(fp, "%m[^:]:%d-%d", &config, &low, &high);
+	if (ret < 2) {
+		ret = -ENODATA;
+		goto out;
+	}
+	if (errno) {
+		ret = -errno;
+		goto out;
+	}
+
+	if (ret == 2)
+		high = low;
+
+	*mask = GENMASK_ULL(high, low);
+	/* Last digit should be [012]. If last digit is missing 0 is implied. */
+	*num = config[strlen(config) - 1];
+	*num = isdigit(*num) ? *num - '0' : 0;
+
+	ret = 0;
+out:
+	free(config);
+	fclose(fp);
+
+	return ret;
+}
+
+static int
+parse_event(char *buf, uint64_t config[3])
+{
+	char *token, *term;
+	int num, ret, val;
+	uint64_t mask;
+
+	config[0] = config[1] = config[2] = 0;
+
+	token = strtok(buf, ",");
+	while (token) {
+		errno = 0;
+		/* <term>=<value> */
+		ret = sscanf(token, "%m[^=]=%i", &term, &val);
+		if (ret < 1)
+			return -ENODATA;
+		if (errno)
+			return -errno;
+		if (ret == 1)
+			val = 1;
+
+		ret = get_term_format(term, &num, &mask);
+		free(term);
+		if (ret)
+			return ret;
+
+		config[num] |= FIELD_PREP(mask, val);
+		token = strtok(NULL, ",");
+	}
+
+	return 0;
+}
+
+static int
+get_event_config(const char *name, uint64_t config[3])
+{
+	char path[PATH_MAX], buf[BUFSIZ];
+	FILE *fp;
+	int ret;
+
+	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu.name, name);
+	fp = fopen(path, "r");
+	if (fp == NULL)
+		return -errno;
+
+	ret = fread(buf, 1, sizeof(buf), fp);
+	if (ret == 0) {
+		fclose(fp);
+
+		return -EINVAL;
+	}
+	fclose(fp);
+	buf[ret] = '\0';
+
+	return parse_event(buf, config);
+}
+
+static int
+do_perf_event_open(uint64_t config[3], int group_fd)
+{
+	struct perf_event_attr attr = {
+		.size = sizeof(struct perf_event_attr),
+		.type = PERF_TYPE_RAW,
+		.exclude_kernel = 1,
+		.exclude_hv = 1,
+		.disabled = 1,
+	};
+
+	pmu_arch_fixup_config(config);
+
+	attr.config = config[0];
+	attr.config1 = config[1];
+	attr.config2 = config[2];
+
+	return syscall(SYS_perf_event_open, &attr, 0, -1, group_fd, 0);
+}
+
+static int
+open_events(struct rte_pmu_event_group *group)
+{
+	struct rte_pmu_event *event;
+	uint64_t config[3];
+	int num = 0, ret;
+
+	/* group leader gets created first, with fd = -1 */
+	group->fds[0] = -1;
+
+	TAILQ_FOREACH(event, &rte_pmu.event_list, next) {
+		ret = get_event_config(event->name, config);
+		if (ret)
+			continue;
+
+		ret = do_perf_event_open(config, group->fds[0]);
+		if (ret == -1) {
+			ret = -errno;
+			goto out;
+		}
+
+		group->fds[event->index] = ret;
+		num++;
+	}
+
+	return 0;
+out:
+	for (--num; num >= 0; num--) {
+		close(group->fds[num]);
+		group->fds[num] = -1;
+	}
+
+
+	return ret;
+}
+
+static int
+mmap_events(struct rte_pmu_event_group *group)
+{
+	long page_size = sysconf(_SC_PAGE_SIZE);
+	unsigned int i;
+	void *addr;
+	int ret;
+
+	for (i = 0; i < rte_pmu.num_group_events; i++) {
+		addr = mmap(0, page_size, PROT_READ, MAP_SHARED, group->fds[i], 0);
+		if (addr == MAP_FAILED) {
+			ret = -errno;
+			goto out;
+		}
+
+		group->mmap_pages[i] = addr;
+	}
+
+	return 0;
+out:
+	for (; i; i--) {
+		munmap(group->mmap_pages[i - 1], page_size);
+		group->mmap_pages[i - 1] = NULL;
+	}
+
+	return ret;
+}
+
+static void
+cleanup_events(struct rte_pmu_event_group *group)
+{
+	unsigned int i;
+
+	if (group->fds[0] != -1)
+		ioctl(group->fds[0], PERF_EVENT_IOC_DISABLE, PERF_IOC_FLAG_GROUP);
+
+	for (i = 0; i < rte_pmu.num_group_events; i++) {
+		if (group->mmap_pages[i]) {
+			munmap(group->mmap_pages[i], sysconf(_SC_PAGE_SIZE));
+			group->mmap_pages[i] = NULL;
+		}
+
+		if (group->fds[i] != -1) {
+			close(group->fds[i]);
+			group->fds[i] = -1;
+		}
+	}
+
+	group->enabled = false;
+}
+
+int __rte_noinline
+rte_pmu_enable_group(void)
+{
+	struct rte_pmu_event_group *group = &RTE_PER_LCORE(_event_group);
+	int ret;
+
+	if (rte_pmu.num_group_events == 0)
+		return -ENODEV;
+
+	ret = open_events(group);
+	if (ret)
+		goto out;
+
+	ret = mmap_events(group);
+	if (ret)
+		goto out;
+
+	if (ioctl(group->fds[0], PERF_EVENT_IOC_RESET, PERF_IOC_FLAG_GROUP) == -1) {
+		ret = -errno;
+		goto out;
+	}
+
+	if (ioctl(group->fds[0], PERF_EVENT_IOC_ENABLE, PERF_IOC_FLAG_GROUP) == -1) {
+		ret = -errno;
+		goto out;
+	}
+
+	rte_spinlock_lock(&rte_pmu.lock);
+	TAILQ_INSERT_TAIL(&rte_pmu.event_group_list, group, next);
+	rte_spinlock_unlock(&rte_pmu.lock);
+	group->enabled = true;
+
+	return 0;
+
+out:
+	cleanup_events(group);
+
+	return ret;
+}
+
+static int
+scan_pmus(void)
+{
+	char path[PATH_MAX];
+	struct dirent *dent;
+	const char *name;
+	DIR *dirp;
+
+	dirp = opendir(EVENT_SOURCE_DEVICES_PATH);
+	if (dirp == NULL)
+		return -errno;
+
+	while ((dent = readdir(dirp))) {
+		name = dent->d_name;
+		if (name[0] == '.')
+			continue;
+
+		/* sysfs entry should either contain cpus or be a cpu */
+		if (!strcmp(name, "cpu"))
+			break;
+
+		snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/cpus", name);
+		if (access(path, F_OK) == 0)
+			break;
+	}
+
+	if (dent) {
+		rte_pmu.name = strdup(name);
+		if (rte_pmu.name == NULL) {
+			closedir(dirp);
+
+			return -ENOMEM;
+		}
+	}
+
+	closedir(dirp);
+
+	return rte_pmu.name ? 0 : -ENODEV;
+}
+
+static struct rte_pmu_event *
+new_event(const char *name)
+{
+	struct rte_pmu_event *event;
+
+	event = calloc(1, sizeof(*event));
+	if (event == NULL)
+		goto out;
+
+	event->name = strdup(name);
+	if (event->name == NULL) {
+		free(event);
+		event = NULL;
+	}
+
+out:
+	return event;
+}
+
+static void
+free_event(struct rte_pmu_event *event)
+{
+	free(event->name);
+	free(event);
+}
+
+int
+rte_pmu_add_event(const char *name)
+{
+	struct rte_pmu_event *event;
+	char path[PATH_MAX];
+
+	if (rte_pmu.name == NULL)
+		return -ENODEV;
+
+	if (rte_pmu.num_group_events + 1 >= MAX_NUM_GROUP_EVENTS)
+		return -ENOSPC;
+
+	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu.name, name);
+	if (access(path, R_OK))
+		return -ENODEV;
+
+	TAILQ_FOREACH(event, &rte_pmu.event_list, next) {
+		if (!strcmp(event->name, name))
+			return event->index;
+		continue;
+	}
+
+	event = new_event(name);
+	if (event == NULL)
+		return -ENOMEM;
+
+	event->index = rte_pmu.num_group_events++;
+	TAILQ_INSERT_TAIL(&rte_pmu.event_list, event, next);
+
+	return event->index;
+}
+
+int
+rte_pmu_init(void)
+{
+	int ret;
+
+	/* Allow calling init from multiple contexts within a single thread. This simplifies
+	 * resource management a bit e.g in case fast-path tracepoint has already been enabled
+	 * via command line but application doesn't care enough and performs init/fini again.
+	 */
+	if (rte_pmu.initialized) {
+		rte_pmu.initialized++;
+		return 0;
+	}
+
+	ret = scan_pmus();
+	if (ret)
+		goto out;
+
+	ret = pmu_arch_init();
+	if (ret)
+		goto out;
+
+	TAILQ_INIT(&rte_pmu.event_list);
+	TAILQ_INIT(&rte_pmu.event_group_list);
+	rte_spinlock_init(&rte_pmu.lock);
+	rte_pmu.initialized = 1;
+
+	return 0;
+out:
+	free(rte_pmu.name);
+	rte_pmu.name = NULL;
+
+	return ret;
+}
+
+void
+rte_pmu_fini(void)
+{
+	struct rte_pmu_event_group *group, *tmp_group;
+	struct rte_pmu_event *event, *tmp_event;
+
+	/* cleanup once init count drops to zero */
+	if (!rte_pmu.initialized || --rte_pmu.initialized)
+		return;
+
+	RTE_TAILQ_FOREACH_SAFE(event, &rte_pmu.event_list, next, tmp_event) {
+		TAILQ_REMOVE(&rte_pmu.event_list, event, next);
+		free_event(event);
+	}
+
+	RTE_TAILQ_FOREACH_SAFE(group, &rte_pmu.event_group_list, next, tmp_group) {
+		TAILQ_REMOVE(&rte_pmu.event_group_list, group, next);
+		cleanup_events(group);
+	}
+
+	pmu_arch_fini();
+	free(rte_pmu.name);
+	rte_pmu.name = NULL;
+	rte_pmu.num_group_events = 0;
+}
diff --git a/lib/pmu/rte_pmu.h b/lib/pmu/rte_pmu.h
new file mode 100644
index 0000000000..e360375a0c
--- /dev/null
+++ b/lib/pmu/rte_pmu.h
@@ -0,0 +1,205 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Marvell
+ */
+
+#ifndef _RTE_PMU_H_
+#define _RTE_PMU_H_
+
+/**
+ * @file
+ *
+ * PMU event tracing operations
+ *
+ * This file defines generic API and types necessary to setup PMU and
+ * read selected counters in runtime.
+ */
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <linux/perf_event.h>
+
+#include <rte_atomic.h>
+#include <rte_branch_prediction.h>
+#include <rte_common.h>
+#include <rte_compat.h>
+#include <rte_spinlock.h>
+
+/** Maximum number of events in a group */
+#define MAX_NUM_GROUP_EVENTS 8
+
+/**
+ * A structure describing a group of events.
+ */
+struct rte_pmu_event_group {
+	struct perf_event_mmap_page *mmap_pages[MAX_NUM_GROUP_EVENTS]; /**< array of user pages */
+	int fds[MAX_NUM_GROUP_EVENTS]; /**< array of event descriptors */
+	bool enabled; /**< true if group was enabled on particular lcore */
+	TAILQ_ENTRY(rte_pmu_event_group) next; /**< list entry */
+} __rte_cache_aligned;
+
+/**
+ * A structure describing an event.
+ */
+struct rte_pmu_event {
+	char *name; /**< name of an event */
+	unsigned int index; /**< event index into fds/mmap_pages */
+	TAILQ_ENTRY(rte_pmu_event) next; /**< list entry */
+};
+
+/**
+ * A PMU state container.
+ */
+struct rte_pmu {
+	char *name; /**< name of core PMU listed under /sys/bus/event_source/devices */
+	rte_spinlock_t lock; /**< serialize access to event group list */
+	TAILQ_HEAD(, rte_pmu_event_group) event_group_list; /**< list of event groups */
+	unsigned int num_group_events; /**< number of events in a group */
+	TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of matching events */
+	unsigned int initialized; /**< initialization counter */
+};
+
+/** lcore event group */
+RTE_DECLARE_PER_LCORE(struct rte_pmu_event_group, _event_group);
+
+/** PMU state container */
+extern struct rte_pmu rte_pmu;
+
+/** Each architecture supporting PMU needs to provide its own version */
+#ifndef rte_pmu_pmc_read
+#define rte_pmu_pmc_read(index) ({ 0; })
+#endif
+
+/**
+ * @internal
+ *
+ * Read PMU counter.
+ *
+ * @param pc
+ *   Pointer to the mmapped user page.
+ * @return
+ *   Counter value read from hardware.
+ */
+__rte_internal
+static __rte_always_inline uint64_t
+rte_pmu_read_userpage(struct perf_event_mmap_page *pc)
+{
+	uint64_t width, offset;
+	uint32_t seq, index;
+	int64_t pmc;
+
+	for (;;) {
+		seq = pc->lock;
+		rte_compiler_barrier();
+		index = pc->index;
+		offset = pc->offset;
+		width = pc->pmc_width;
+
+		/* index set to 0 means that particular counter cannot be used */
+		if (likely(pc->cap_user_rdpmc && index)) {
+			pmc = rte_pmu_pmc_read(index - 1);
+			pmc <<= 64 - width;
+			pmc >>= 64 - width;
+			offset += pmc;
+		}
+
+		rte_compiler_barrier();
+
+		if (likely(pc->lock == seq))
+			return offset;
+	}
+
+	return 0;
+}
+
+/**
+ * @internal
+ *
+ * Enable group of events on the calling lcore.
+ *
+ * @return
+ *   0 in case of success, negative value otherwise.
+ */
+__rte_internal
+int
+rte_pmu_enable_group(void);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Initialize PMU library.
+ *
+ * @return
+ *   0 in case of success, negative value otherwise.
+ */
+__rte_experimental
+int
+rte_pmu_init(void);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Finalize PMU library. This should be called after PMU counters are no longer being read.
+ */
+__rte_experimental
+void
+rte_pmu_fini(void);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Add event to the group of enabled events.
+ *
+ * @param name
+ *   Name of an event listed under /sys/bus/event_source/devices/pmu/events.
+ * @return
+ *   Event index in case of success, negative value otherwise.
+ */
+__rte_experimental
+int
+rte_pmu_add_event(const char *name);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Read hardware counter configured to count occurrences of an event.
+ *
+ * @param index
+ *   Index of an event to be read.
+ * @return
+ *   Event value read from register. In case of errors or lack of support
+ *   0 is returned. In other words, stream of zeros in a trace file
+ *   indicates problem with reading particular PMU event register.
+ */
+__rte_experimental
+static __rte_always_inline uint64_t
+rte_pmu_read(unsigned int index)
+{
+	struct rte_pmu_event_group *group = &RTE_PER_LCORE(_event_group);
+	int ret;
+
+	if (unlikely(!rte_pmu.initialized))
+		return 0;
+
+	if (unlikely(!group->enabled)) {
+		ret = rte_pmu_enable_group();
+		if (ret)
+			return 0;
+	}
+
+	if (unlikely(index >= rte_pmu.num_group_events))
+		return 0;
+
+	return rte_pmu_read_userpage(group->mmap_pages[index]);
+}
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_PMU_H_ */
diff --git a/lib/pmu/version.map b/lib/pmu/version.map
new file mode 100644
index 0000000000..50fb0f354e
--- /dev/null
+++ b/lib/pmu/version.map
@@ -0,0 +1,20 @@
+DPDK_23 {
+	local: *;
+};
+
+EXPERIMENTAL {
+	global:
+
+	per_lcore__event_group;
+	rte_pmu;
+	rte_pmu_add_event;
+	rte_pmu_fini;
+	rte_pmu_init;
+	rte_pmu_read;
+};
+
+INTERNAL {
+	global:
+
+	rte_pmu_enable_group;
+};
-- 
2.34.1


^ permalink raw reply	[flat|nested] 139+ messages in thread

* [PATCH v8 2/4] pmu: support reading ARM PMU events in runtime
  2023-02-02  9:43             ` [PATCH v8 " Tomasz Duszynski
  2023-02-02  9:43               ` [PATCH v8 1/4] lib: add generic support for reading PMU events Tomasz Duszynski
@ 2023-02-02  9:43               ` Tomasz Duszynski
  2023-02-02  9:43               ` [PATCH v8 3/4] pmu: support reading Intel x86_64 " Tomasz Duszynski
                                 ` (2 subsequent siblings)
  4 siblings, 0 replies; 139+ messages in thread
From: Tomasz Duszynski @ 2023-02-02  9:43 UTC (permalink / raw)
  To: dev, Tomasz Duszynski, Ruifeng Wang
  Cc: roretzla, Ruifeng.Wang, bruce.richardson, jerinj,
	mattias.ronnblom, mb, thomas, zhoumin

Add support for reading ARM PMU events in runtime.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
---
 app/test/test_pmu.c         |  4 ++
 lib/pmu/meson.build         |  7 +++
 lib/pmu/pmu_arm64.c         | 94 +++++++++++++++++++++++++++++++++++++
 lib/pmu/rte_pmu.h           |  4 ++
 lib/pmu/rte_pmu_pmc_arm64.h | 30 ++++++++++++
 5 files changed, 139 insertions(+)
 create mode 100644 lib/pmu/pmu_arm64.c
 create mode 100644 lib/pmu/rte_pmu_pmc_arm64.h

diff --git a/app/test/test_pmu.c b/app/test/test_pmu.c
index a9bfb1a427..623e04b691 100644
--- a/app/test/test_pmu.c
+++ b/app/test/test_pmu.c
@@ -26,6 +26,10 @@ test_pmu_read(void)
 	if (rte_pmu_init() < 0)
 		return TEST_FAILED;
 
+#if defined(RTE_ARCH_ARM64)
+	event = rte_pmu_add_event("cpu_cycles");
+#endif
+
 	while (tries--)
 		val += rte_pmu_read(event);
 
diff --git a/lib/pmu/meson.build b/lib/pmu/meson.build
index a4160b494e..e857681137 100644
--- a/lib/pmu/meson.build
+++ b/lib/pmu/meson.build
@@ -11,3 +11,10 @@ includes = [global_inc]
 
 sources = files('rte_pmu.c')
 headers = files('rte_pmu.h')
+indirect_headers += files(
+        'rte_pmu_pmc_arm64.h',
+)
+
+if dpdk_conf.has('RTE_ARCH_ARM64')
+    sources += files('pmu_arm64.c')
+endif
diff --git a/lib/pmu/pmu_arm64.c b/lib/pmu/pmu_arm64.c
new file mode 100644
index 0000000000..9e15727948
--- /dev/null
+++ b/lib/pmu/pmu_arm64.c
@@ -0,0 +1,94 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2023 Marvell International Ltd.
+ */
+
+#include <errno.h>
+#include <fcntl.h>
+#include <stdlib.h>
+#include <unistd.h>
+
+#include <rte_bitops.h>
+#include <rte_common.h>
+
+#include "pmu_private.h"
+
+#define PERF_USER_ACCESS_PATH "/proc/sys/kernel/perf_user_access"
+
+static int restore_uaccess;
+
+static int
+read_attr_int(const char *path, int *val)
+{
+	char buf[BUFSIZ];
+	int ret, fd;
+
+	fd = open(path, O_RDONLY);
+	if (fd == -1)
+		return -errno;
+
+	ret = read(fd, buf, sizeof(buf));
+	if (ret == -1) {
+		close(fd);
+
+		return -errno;
+	}
+
+	*val = strtol(buf, NULL, 10);
+	close(fd);
+
+	return 0;
+}
+
+static int
+write_attr_int(const char *path, int val)
+{
+	char buf[BUFSIZ];
+	int num, ret, fd;
+
+	fd = open(path, O_WRONLY);
+	if (fd == -1)
+		return -errno;
+
+	num = snprintf(buf, sizeof(buf), "%d", val);
+	ret = write(fd, buf, num);
+	if (ret == -1) {
+		close(fd);
+
+		return -errno;
+	}
+
+	close(fd);
+
+	return 0;
+}
+
+int
+pmu_arch_init(void)
+{
+	int ret;
+
+	ret = read_attr_int(PERF_USER_ACCESS_PATH, &restore_uaccess);
+	if (ret)
+		return ret;
+
+	/* user access already enabled */
+	if (restore_uaccess == 1)
+		return 0;
+
+	return write_attr_int(PERF_USER_ACCESS_PATH, 1);
+}
+
+void
+pmu_arch_fini(void)
+{
+	write_attr_int(PERF_USER_ACCESS_PATH, restore_uaccess);
+}
+
+void
+pmu_arch_fixup_config(uint64_t config[3])
+{
+	/* select 64 bit counters */
+	config[1] |= RTE_BIT64(0);
+	/* enable userspace access */
+	config[1] |= RTE_BIT64(1);
+}
diff --git a/lib/pmu/rte_pmu.h b/lib/pmu/rte_pmu.h
index e360375a0c..b18938dab1 100644
--- a/lib/pmu/rte_pmu.h
+++ b/lib/pmu/rte_pmu.h
@@ -26,6 +26,10 @@ extern "C" {
 #include <rte_compat.h>
 #include <rte_spinlock.h>
 
+#if defined(RTE_ARCH_ARM64)
+#include "rte_pmu_pmc_arm64.h"
+#endif
+
 /** Maximum number of events in a group */
 #define MAX_NUM_GROUP_EVENTS 8
 
diff --git a/lib/pmu/rte_pmu_pmc_arm64.h b/lib/pmu/rte_pmu_pmc_arm64.h
new file mode 100644
index 0000000000..10648f0c5f
--- /dev/null
+++ b/lib/pmu/rte_pmu_pmc_arm64.h
@@ -0,0 +1,30 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Marvell.
+ */
+#ifndef _RTE_PMU_PMC_ARM64_H_
+#define _RTE_PMU_PMC_ARM64_H_
+
+#include <rte_common.h>
+
+static __rte_always_inline uint64_t
+rte_pmu_pmc_read(int index)
+{
+	uint64_t val;
+
+	if (index == 31) {
+		/* CPU Cycles (0x11) must be read via pmccntr_el0 */
+		asm volatile("mrs %0, pmccntr_el0" : "=r" (val));
+	} else {
+		asm volatile(
+			"msr pmselr_el0, %x0\n"
+			"mrs %0, pmxevcntr_el0\n"
+			: "=r" (val)
+			: "rZ" (index)
+		);
+	}
+
+	return val;
+}
+#define rte_pmu_pmc_read rte_pmu_pmc_read
+
+#endif /* _RTE_PMU_PMC_ARM64_H_ */
-- 
2.34.1


^ permalink raw reply	[flat|nested] 139+ messages in thread

* [PATCH v8 3/4] pmu: support reading Intel x86_64 PMU events in runtime
  2023-02-02  9:43             ` [PATCH v8 " Tomasz Duszynski
  2023-02-02  9:43               ` [PATCH v8 1/4] lib: add generic support for reading PMU events Tomasz Duszynski
  2023-02-02  9:43               ` [PATCH v8 2/4] pmu: support reading ARM PMU events in runtime Tomasz Duszynski
@ 2023-02-02  9:43               ` Tomasz Duszynski
  2023-02-02  9:43               ` [PATCH v8 4/4] eal: add PMU support to tracing library Tomasz Duszynski
  2023-02-02 12:49               ` [PATCH v9 0/4] add support for self monitoring Tomasz Duszynski
  4 siblings, 0 replies; 139+ messages in thread
From: Tomasz Duszynski @ 2023-02-02  9:43 UTC (permalink / raw)
  To: dev, Tomasz Duszynski
  Cc: roretzla, Ruifeng.Wang, bruce.richardson, jerinj,
	mattias.ronnblom, mb, thomas, zhoumin

Add support for reading Intel x86_64 PMU events in runtime.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
---
 app/test/test_pmu.c          |  2 ++
 lib/pmu/meson.build          |  1 +
 lib/pmu/rte_pmu.h            |  2 ++
 lib/pmu/rte_pmu_pmc_x86_64.h | 24 ++++++++++++++++++++++++
 4 files changed, 29 insertions(+)
 create mode 100644 lib/pmu/rte_pmu_pmc_x86_64.h

diff --git a/app/test/test_pmu.c b/app/test/test_pmu.c
index 623e04b691..614395482f 100644
--- a/app/test/test_pmu.c
+++ b/app/test/test_pmu.c
@@ -28,6 +28,8 @@ test_pmu_read(void)
 
 #if defined(RTE_ARCH_ARM64)
 	event = rte_pmu_add_event("cpu_cycles");
+#elif defined(RTE_ARCH_X86_64)
+	event = rte_pmu_add_event("cpu-cycles");
 #endif
 
 	while (tries--)
diff --git a/lib/pmu/meson.build b/lib/pmu/meson.build
index e857681137..5b92e5c4e3 100644
--- a/lib/pmu/meson.build
+++ b/lib/pmu/meson.build
@@ -13,6 +13,7 @@ sources = files('rte_pmu.c')
 headers = files('rte_pmu.h')
 indirect_headers += files(
         'rte_pmu_pmc_arm64.h',
+        'rte_pmu_pmc_x86_64.h',
 )
 
 if dpdk_conf.has('RTE_ARCH_ARM64')
diff --git a/lib/pmu/rte_pmu.h b/lib/pmu/rte_pmu.h
index b18938dab1..0f7004c31c 100644
--- a/lib/pmu/rte_pmu.h
+++ b/lib/pmu/rte_pmu.h
@@ -28,6 +28,8 @@ extern "C" {
 
 #if defined(RTE_ARCH_ARM64)
 #include "rte_pmu_pmc_arm64.h"
+#elif defined(RTE_ARCH_X86_64)
+#include "rte_pmu_pmc_x86_64.h"
 #endif
 
 /** Maximum number of events in a group */
diff --git a/lib/pmu/rte_pmu_pmc_x86_64.h b/lib/pmu/rte_pmu_pmc_x86_64.h
new file mode 100644
index 0000000000..7b67466960
--- /dev/null
+++ b/lib/pmu/rte_pmu_pmc_x86_64.h
@@ -0,0 +1,24 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Marvell.
+ */
+#ifndef _RTE_PMU_PMC_X86_64_H_
+#define _RTE_PMU_PMC_X86_64_H_
+
+#include <rte_common.h>
+
+static __rte_always_inline uint64_t
+rte_pmu_pmc_read(int index)
+{
+	uint64_t low, high;
+
+	asm volatile(
+		"rdpmc\n"
+		: "=a" (low), "=d" (high)
+		: "c" (index)
+	);
+
+	return low | (high << 32);
+}
+#define rte_pmu_pmc_read rte_pmu_pmc_read
+
+#endif /* _RTE_PMU_PMC_X86_64_H_ */
-- 
2.34.1


^ permalink raw reply	[flat|nested] 139+ messages in thread

* [PATCH v8 4/4] eal: add PMU support to tracing library
  2023-02-02  9:43             ` [PATCH v8 " Tomasz Duszynski
                                 ` (2 preceding siblings ...)
  2023-02-02  9:43               ` [PATCH v8 3/4] pmu: support reading Intel x86_64 " Tomasz Duszynski
@ 2023-02-02  9:43               ` Tomasz Duszynski
  2023-02-02 12:49               ` [PATCH v9 0/4] add support for self monitoring Tomasz Duszynski
  4 siblings, 0 replies; 139+ messages in thread
From: Tomasz Duszynski @ 2023-02-02  9:43 UTC (permalink / raw)
  To: dev, Jerin Jacob, Sunil Kumar Kori, Tomasz Duszynski
  Cc: roretzla, Ruifeng.Wang, bruce.richardson, mattias.ronnblom, mb,
	thomas, zhoumin

In order to profile app one needs to store significant amount of samples
somewhere for an analysis latern on. Since trace library supports
storing data in a CTF format lets take adventage of that and add a
dedicated PMU tracepoint.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
---
 app/test/test_trace_perf.c               | 10 ++++
 doc/guides/prog_guide/profile_app.rst    |  5 ++
 doc/guides/prog_guide/trace_lib.rst      | 32 +++++++++++++
 lib/eal/common/eal_common_trace.c        | 13 ++++-
 lib/eal/common/eal_common_trace_points.c |  5 ++
 lib/eal/include/rte_eal_trace.h          | 13 +++++
 lib/eal/meson.build                      |  3 ++
 lib/eal/version.map                      |  1 +
 lib/pmu/rte_pmu.c                        | 61 ++++++++++++++++++++++++
 lib/pmu/rte_pmu.h                        | 14 ++++++
 lib/pmu/version.map                      |  1 +
 11 files changed, 157 insertions(+), 1 deletion(-)

diff --git a/app/test/test_trace_perf.c b/app/test/test_trace_perf.c
index 46ae7d8074..f1929f2734 100644
--- a/app/test/test_trace_perf.c
+++ b/app/test/test_trace_perf.c
@@ -114,6 +114,10 @@ worker_fn_##func(void *arg) \
 #define GENERIC_DOUBLE rte_eal_trace_generic_double(3.66666)
 #define GENERIC_STR rte_eal_trace_generic_str("hello world")
 #define VOID_FP app_dpdk_test_fp()
+#ifdef RTE_EXEC_ENV_LINUX
+/* 0 corresponds first event passed via --trace= */
+#define READ_PMU rte_eal_trace_pmu_read(0)
+#endif
 
 WORKER_DEFINE(GENERIC_VOID)
 WORKER_DEFINE(GENERIC_U64)
@@ -122,6 +126,9 @@ WORKER_DEFINE(GENERIC_FLOAT)
 WORKER_DEFINE(GENERIC_DOUBLE)
 WORKER_DEFINE(GENERIC_STR)
 WORKER_DEFINE(VOID_FP)
+#ifdef RTE_EXEC_ENV_LINUX
+WORKER_DEFINE(READ_PMU)
+#endif
 
 static void
 run_test(const char *str, lcore_function_t f, struct test_data *data, size_t sz)
@@ -174,6 +181,9 @@ test_trace_perf(void)
 	run_test("double", worker_fn_GENERIC_DOUBLE, data, sz);
 	run_test("string", worker_fn_GENERIC_STR, data, sz);
 	run_test("void_fp", worker_fn_VOID_FP, data, sz);
+#ifdef RTE_EXEC_ENV_LINUX
+	run_test("read_pmu", worker_fn_READ_PMU, data, sz);
+#endif
 
 	rte_free(data);
 	return TEST_SUCCESS;
diff --git a/doc/guides/prog_guide/profile_app.rst b/doc/guides/prog_guide/profile_app.rst
index a8b501fe0c..6a53341c6b 100644
--- a/doc/guides/prog_guide/profile_app.rst
+++ b/doc/guides/prog_guide/profile_app.rst
@@ -16,6 +16,11 @@ that information, perf being an example here. Though in some scenarios, eg. when
 isolated (nohz_full) and run dedicated tasks, using perf is less than ideal. In such cases one can
 read specific events directly from application via ``rte_pmu_read()``.
 
+Alternatively tracing library can be used which offers dedicated tracepoint
+``rte_eal_trace_pmu_event()``.
+
+Refer to :doc:`../prog_guide/trace_lib` for more details.
+
 Profiling on x86
 ----------------
 
diff --git a/doc/guides/prog_guide/trace_lib.rst b/doc/guides/prog_guide/trace_lib.rst
index 9a8f38073d..a8e97ee1ec 100644
--- a/doc/guides/prog_guide/trace_lib.rst
+++ b/doc/guides/prog_guide/trace_lib.rst
@@ -46,6 +46,7 @@ DPDK tracing library features
   trace format and is compatible with ``LTTng``.
   For detailed information, refer to
   `Common Trace Format <https://diamon.org/ctf/>`_.
+- Support reading PMU events on ARM64 and x86-64 (Intel)
 
 How to add a tracepoint?
 ------------------------
@@ -137,6 +138,37 @@ the user must use ``RTE_TRACE_POINT_FP`` instead of ``RTE_TRACE_POINT``.
 ``RTE_TRACE_POINT_FP`` is compiled out by default and it can be enabled using
 the ``enable_trace_fp`` option for meson build.
 
+PMU tracepoint
+--------------
+
+Performance measurement unit (PMU) event values can be read from hardware
+registers using predefined ``rte_pmu_read`` tracepoint.
+
+Tracing is enabled via ``--trace`` EAL option by passing both expression
+matching PMU tracepoint name i.e ``lib.eal.pmu.read`` and expression
+``e=ev1[,ev2,...]`` matching particular events::
+
+    --trace='.*pmu.read\|e=cpu_cycles,l1d_cache'
+
+Event names are available under ``/sys/bus/event_source/devices/PMU/events``
+directory, where ``PMU`` is a placeholder for either a ``cpu`` or a directory
+containing ``cpus``.
+
+In contrary to other tracepoints this does not need any extra variables
+added to source files. Instead, caller passes index which follows the order of
+events specified via ``--trace`` parameter. In the following example index ``0``
+corresponds to ``cpu_cyclces`` while index ``1`` corresponds to ``l1d_cache``.
+
+.. code-block:: c
+
+ ...
+ rte_eal_trace_pmu_read(0);
+ rte_eal_trace_pmu_read(1);
+ ...
+
+PMU tracing support must be explicitly enabled using the ``enable_trace_fp``
+option for meson build.
+
 Event record mode
 -----------------
 
diff --git a/lib/eal/common/eal_common_trace.c b/lib/eal/common/eal_common_trace.c
index 75162b722d..8796052d0c 100644
--- a/lib/eal/common/eal_common_trace.c
+++ b/lib/eal/common/eal_common_trace.c
@@ -11,6 +11,9 @@
 #include <rte_errno.h>
 #include <rte_lcore.h>
 #include <rte_per_lcore.h>
+#ifdef RTE_EXEC_ENV_LINUX
+#include <rte_pmu.h>
+#endif
 #include <rte_string_fns.h>
 
 #include "eal_trace.h"
@@ -71,8 +74,13 @@ eal_trace_init(void)
 		goto free_meta;
 
 	/* Apply global configurations */
-	STAILQ_FOREACH(arg, &trace.args, next)
+	STAILQ_FOREACH(arg, &trace.args, next) {
 		trace_args_apply(arg->val);
+#ifdef RTE_EXEC_ENV_LINUX
+		if (rte_pmu_init() == 0)
+			rte_pmu_add_events_by_pattern(arg->val);
+#endif
+	}
 
 	rte_trace_mode_set(trace.mode);
 
@@ -88,6 +96,9 @@ eal_trace_init(void)
 void
 eal_trace_fini(void)
 {
+#ifdef RTE_EXEC_ENV_LINUX
+	rte_pmu_fini();
+#endif
 	trace_mem_free();
 	trace_metadata_destroy();
 	eal_trace_args_free();
diff --git a/lib/eal/common/eal_common_trace_points.c b/lib/eal/common/eal_common_trace_points.c
index 0b0b254615..1e46ce549a 100644
--- a/lib/eal/common/eal_common_trace_points.c
+++ b/lib/eal/common/eal_common_trace_points.c
@@ -75,3 +75,8 @@ RTE_TRACE_POINT_REGISTER(rte_eal_trace_intr_enable,
 	lib.eal.intr.enable)
 RTE_TRACE_POINT_REGISTER(rte_eal_trace_intr_disable,
 	lib.eal.intr.disable)
+
+#ifdef RTE_EXEC_ENV_LINUX
+RTE_TRACE_POINT_REGISTER(rte_eal_trace_pmu_read,
+	lib.eal.pmu.read)
+#endif
diff --git a/lib/eal/include/rte_eal_trace.h b/lib/eal/include/rte_eal_trace.h
index 5ef4398230..afb459b198 100644
--- a/lib/eal/include/rte_eal_trace.h
+++ b/lib/eal/include/rte_eal_trace.h
@@ -17,6 +17,9 @@ extern "C" {
 
 #include <rte_alarm.h>
 #include <rte_interrupts.h>
+#ifdef RTE_EXEC_ENV_LINUX
+#include <rte_pmu.h>
+#endif
 #include <rte_trace_point.h>
 
 #include "eal_interrupts.h"
@@ -279,6 +282,16 @@ RTE_TRACE_POINT(
 	rte_trace_point_emit_string(cpuset);
 )
 
+#ifdef RTE_EXEC_ENV_LINUX
+RTE_TRACE_POINT_FP(
+	rte_eal_trace_pmu_read,
+	RTE_TRACE_POINT_ARGS(unsigned int index),
+	uint64_t val;
+	val = rte_pmu_read(index);
+	rte_trace_point_emit_u64(val);
+)
+#endif
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/eal/meson.build b/lib/eal/meson.build
index 056beb9461..f5865dbcd9 100644
--- a/lib/eal/meson.build
+++ b/lib/eal/meson.build
@@ -26,6 +26,9 @@ deps += ['kvargs']
 if not is_windows
     deps += ['telemetry']
 endif
+if is_linux
+    deps += ['pmu']
+endif
 if dpdk_conf.has('RTE_USE_LIBBSD')
     ext_deps += libbsd
 endif
diff --git a/lib/eal/version.map b/lib/eal/version.map
index 6523102157..2f8f66874b 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -441,6 +441,7 @@ EXPERIMENTAL {
 	rte_thread_join;
 
 	# added in 23.03
+	__rte_eal_trace_pmu_read; # WINDOWS_NO_EXPORT
 	rte_thread_set_name;
 };
 
diff --git a/lib/pmu/rte_pmu.c b/lib/pmu/rte_pmu.c
index 4cf3161155..ae880c72b7 100644
--- a/lib/pmu/rte_pmu.c
+++ b/lib/pmu/rte_pmu.c
@@ -402,6 +402,67 @@ rte_pmu_add_event(const char *name)
 	return event->index;
 }
 
+static int
+add_events(const char *pattern)
+{
+	char *token, *copy;
+	int ret;
+
+	copy = strdup(pattern);
+	if (copy == NULL)
+		return -ENOMEM;
+
+	token = strtok(copy, ",");
+	while (token) {
+		ret = rte_pmu_add_event(token);
+		if (ret < 0)
+			break;
+
+		token = strtok(NULL, ",");
+	}
+
+	free(copy);
+
+	return ret >= 0 ? 0 : ret;
+}
+
+int
+rte_pmu_add_events_by_pattern(const char *pattern)
+{
+	regmatch_t rmatch;
+	char buf[BUFSIZ];
+	unsigned int num;
+	regex_t reg;
+	int ret;
+
+	/* events are matched against occurrences of e=ev1[,ev2,..] pattern */
+	ret = regcomp(&reg, "e=([_[:alnum:]-],?)+", REG_EXTENDED);
+	if (ret)
+		return -EINVAL;
+
+	for (;;) {
+		if (regexec(&reg, pattern, 1, &rmatch, 0))
+			break;
+
+		num = rmatch.rm_eo - rmatch.rm_so;
+		if (num > sizeof(buf))
+			num = sizeof(buf);
+
+		/* skip e= pattern prefix */
+		memcpy(buf, pattern + rmatch.rm_so + 2, num - 2);
+		buf[num - 2] = '\0';
+		ret = add_events(buf);
+		if (ret)
+			break;
+
+		pattern += rmatch.rm_eo;
+	}
+
+	regfree(&reg);
+
+	return ret;
+}
+
 int
 rte_pmu_init(void)
 {
diff --git a/lib/pmu/rte_pmu.h b/lib/pmu/rte_pmu.h
index 0f7004c31c..0f6250e81f 100644
--- a/lib/pmu/rte_pmu.h
+++ b/lib/pmu/rte_pmu.h
@@ -169,6 +169,20 @@ __rte_experimental
 int
 rte_pmu_add_event(const char *name);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Add events matching pattern to the group of enabled events.
+ *
+ * @param pattern
+ *   Pattern e=ev1[,ev2,...] matching events, where evX is a placeholder for an event listed under
+ *   /sys/bus/event_source/devices/pmu/events.
+ */
+__rte_experimental
+int
+rte_pmu_add_events_by_pattern(const char *pattern);
+
 /**
  * @warning
  * @b EXPERIMENTAL: this API may change without prior notice
diff --git a/lib/pmu/version.map b/lib/pmu/version.map
index 50fb0f354e..20a27d085c 100644
--- a/lib/pmu/version.map
+++ b/lib/pmu/version.map
@@ -8,6 +8,7 @@ EXPERIMENTAL {
 	per_lcore__event_group;
 	rte_pmu;
 	rte_pmu_add_event;
+	rte_pmu_add_events_by_pattern;
 	rte_pmu_fini;
 	rte_pmu_init;
 	rte_pmu_read;
-- 
2.34.1


^ permalink raw reply	[flat|nested] 139+ messages in thread

* RE: [PATCH v8 1/4] lib: add generic support for reading PMU events
  2023-02-02  9:43               ` [PATCH v8 1/4] lib: add generic support for reading PMU events Tomasz Duszynski
@ 2023-02-02 10:32                 ` Ruifeng Wang
  0 siblings, 0 replies; 139+ messages in thread
From: Ruifeng Wang @ 2023-02-02 10:32 UTC (permalink / raw)
  To: Tomasz Duszynski, dev, thomas
  Cc: roretzla, bruce.richardson, jerinj, mattias.ronnblom, mb, zhoumin, nd

> -----Original Message-----
> From: Tomasz Duszynski <tduszynski@marvell.com>
> Sent: Thursday, February 2, 2023 5:44 PM
> To: dev@dpdk.org; thomas@monjalon.net; Tomasz Duszynski <tduszynski@marvell.com>
> Cc: roretzla@linux.microsoft.com; Ruifeng Wang <Ruifeng.Wang@arm.com>;
> bruce.richardson@intel.com; jerinj@marvell.com; mattias.ronnblom@ericsson.com;
> mb@smartsharesystems.com; zhoumin@loongson.cn
> Subject: [PATCH v8 1/4] lib: add generic support for reading PMU events
> 
> Add support for programming PMU counters and reading their values in runtime bypassing
> kernel completely.
> 
> This is especially useful in cases where CPU cores are isolated
> (nohz_full) i.e run dedicated tasks. In such cases one cannot use standard perf utility
> without sacrificing latency and performance.
> 
> Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
> Acked-by: Morten Brørup <mb@smartsharesystems.com>
> ---
>  MAINTAINERS                            |   5 +
>  app/test/meson.build                   |   1 +
>  app/test/test_pmu.c                    |  55 +++
>  doc/api/doxy-api-index.md              |   3 +-
>  doc/api/doxy-api.conf.in               |   1 +
>  doc/guides/prog_guide/profile_app.rst  |   8 +
>  doc/guides/rel_notes/release_23_03.rst |   9 +
>  lib/meson.build                        |   1 +
>  lib/pmu/meson.build                    |  13 +
>  lib/pmu/pmu_private.h                  |  29 ++
>  lib/pmu/rte_pmu.c                      | 464 +++++++++++++++++++++++++
>  lib/pmu/rte_pmu.h                      | 205 +++++++++++
>  lib/pmu/version.map                    |  20 ++
>  13 files changed, 813 insertions(+), 1 deletion(-)  create mode 100644
> app/test/test_pmu.c  create mode 100644 lib/pmu/meson.build  create mode 100644
> lib/pmu/pmu_private.h  create mode 100644 lib/pmu/rte_pmu.c  create mode 100644
> lib/pmu/rte_pmu.h  create mode 100644 lib/pmu/version.map
>
 
<snip>

> diff --git a/doc/guides/rel_notes/release_23_03.rst
> b/doc/guides/rel_notes/release_23_03.rst
> index 73f5d94e14..733541d56c 100644
> --- a/doc/guides/rel_notes/release_23_03.rst
> +++ b/doc/guides/rel_notes/release_23_03.rst
> @@ -55,10 +55,19 @@ New Features
>       Also, make sure to start the actual text at the margin.
>       =======================================================
> 
> +* **Added PMU library.**
> +
> +  Added a new PMU (performance measurement unit) library which allows

Overall looks good to me. Just a minor comment.
Should it be 'performance *monitoring* unit'?
I see the same terminology is used across architectures. It will be better if we align with that.

Thanks.

<snip>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* [PATCH v9 0/4] add support for self monitoring
  2023-02-02  9:43             ` [PATCH v8 " Tomasz Duszynski
                                 ` (3 preceding siblings ...)
  2023-02-02  9:43               ` [PATCH v8 4/4] eal: add PMU support to tracing library Tomasz Duszynski
@ 2023-02-02 12:49               ` Tomasz Duszynski
  2023-02-02 12:49                 ` [PATCH v9 1/4] lib: add generic support for reading PMU events Tomasz Duszynski
                                   ` (4 more replies)
  4 siblings, 5 replies; 139+ messages in thread
From: Tomasz Duszynski @ 2023-02-02 12:49 UTC (permalink / raw)
  To: dev
  Cc: roretzla, Ruifeng.Wang, bruce.richardson, jerinj,
	mattias.ronnblom, mb, thomas, zhoumin, Tomasz Duszynski

This series adds self monitoring support i.e allows to configure and
read performance measurement unit (PMU) counters in runtime without
using perf utility. This has certain adventages when application runs on
isolated cores with nohz_full kernel parameter.

Events can be read directly using rte_pmu_read() or using dedicated
tracepoint rte_eal_trace_pmu_read(). The latter will cause events to be
stored inside CTF file.

By design, all enabled events are grouped together and the same group
is attached to lcores that use self monitoring funtionality.

Events are enabled by names, which need to be read from standard
location under sysfs i.e

/sys/bus/event_source/devices/PMU/events

where PMU is a core pmu i.e one measuring cpu events. As of today
raw events are not supported.

v9:
- fix 'maybe-uninitialized' warning reported by CI
v8:
- just rebase series
v7:
- use per-lcore event group instead of global table index by lcore-id
- don't add pmu_autotest to fast tests because due to lack of suported on
  every arch
v6:
- move codebase to the separate library
- address review comments
v5:
- address review comments
- fix sign extension while reading pmu on x86
- fix regex mentioned in doc
- various minor changes/improvements here and there
v4:
- fix freeing mem detected by debug_autotest
v3:
- fix shared build
v2:
- fix problems reported by test build infra

Tomasz Duszynski (4):
  lib: add generic support for reading PMU events
  pmu: support reading ARM PMU events in runtime
  pmu: support reading Intel x86_64 PMU events in runtime
  eal: add PMU support to tracing library

 MAINTAINERS                              |   5 +
 app/test/meson.build                     |   1 +
 app/test/test_pmu.c                      |  61 +++
 app/test/test_trace_perf.c               |  10 +
 doc/api/doxy-api-index.md                |   3 +-
 doc/api/doxy-api.conf.in                 |   1 +
 doc/guides/prog_guide/profile_app.rst    |  13 +
 doc/guides/prog_guide/trace_lib.rst      |  32 ++
 doc/guides/rel_notes/release_23_03.rst   |   9 +
 lib/eal/common/eal_common_trace.c        |  13 +-
 lib/eal/common/eal_common_trace_points.c |   5 +
 lib/eal/include/rte_eal_trace.h          |  13 +
 lib/eal/meson.build                      |   3 +
 lib/eal/version.map                      |   1 +
 lib/meson.build                          |   1 +
 lib/pmu/meson.build                      |  21 +
 lib/pmu/pmu_arm64.c                      |  94 ++++
 lib/pmu/pmu_private.h                    |  29 ++
 lib/pmu/rte_pmu.c                        | 525 +++++++++++++++++++++++
 lib/pmu/rte_pmu.h                        | 225 ++++++++++
 lib/pmu/rte_pmu_pmc_arm64.h              |  30 ++
 lib/pmu/rte_pmu_pmc_x86_64.h             |  24 ++
 lib/pmu/version.map                      |  21 +
 23 files changed, 1138 insertions(+), 2 deletions(-)
 create mode 100644 app/test/test_pmu.c
 create mode 100644 lib/pmu/meson.build
 create mode 100644 lib/pmu/pmu_arm64.c
 create mode 100644 lib/pmu/pmu_private.h
 create mode 100644 lib/pmu/rte_pmu.c
 create mode 100644 lib/pmu/rte_pmu.h
 create mode 100644 lib/pmu/rte_pmu_pmc_arm64.h
 create mode 100644 lib/pmu/rte_pmu_pmc_x86_64.h
 create mode 100644 lib/pmu/version.map

--
2.34.1


^ permalink raw reply	[flat|nested] 139+ messages in thread

* [PATCH v9 1/4] lib: add generic support for reading PMU events
  2023-02-02 12:49               ` [PATCH v9 0/4] add support for self monitoring Tomasz Duszynski
@ 2023-02-02 12:49                 ` Tomasz Duszynski
  2023-02-06 11:02                   ` David Marchand
  2023-02-02 12:49                 ` [PATCH v9 2/4] pmu: support reading ARM PMU events in runtime Tomasz Duszynski
                                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 139+ messages in thread
From: Tomasz Duszynski @ 2023-02-02 12:49 UTC (permalink / raw)
  To: dev, Thomas Monjalon, Tomasz Duszynski
  Cc: roretzla, Ruifeng.Wang, bruce.richardson, jerinj,
	mattias.ronnblom, mb, zhoumin

Add support for programming PMU counters and reading their values
in runtime bypassing kernel completely.

This is especially useful in cases where CPU cores are isolated
(nohz_full) i.e run dedicated tasks. In such cases one cannot use
standard perf utility without sacrificing latency and performance.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
---
 MAINTAINERS                            |   5 +
 app/test/meson.build                   |   1 +
 app/test/test_pmu.c                    |  55 +++
 doc/api/doxy-api-index.md              |   3 +-
 doc/api/doxy-api.conf.in               |   1 +
 doc/guides/prog_guide/profile_app.rst  |   8 +
 doc/guides/rel_notes/release_23_03.rst |   9 +
 lib/meson.build                        |   1 +
 lib/pmu/meson.build                    |  13 +
 lib/pmu/pmu_private.h                  |  29 ++
 lib/pmu/rte_pmu.c                      | 464 +++++++++++++++++++++++++
 lib/pmu/rte_pmu.h                      | 205 +++++++++++
 lib/pmu/version.map                    |  20 ++
 13 files changed, 813 insertions(+), 1 deletion(-)
 create mode 100644 app/test/test_pmu.c
 create mode 100644 lib/pmu/meson.build
 create mode 100644 lib/pmu/pmu_private.h
 create mode 100644 lib/pmu/rte_pmu.c
 create mode 100644 lib/pmu/rte_pmu.h
 create mode 100644 lib/pmu/version.map

diff --git a/MAINTAINERS b/MAINTAINERS
index 9a0f416d2e..9f13eafd95 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1697,6 +1697,11 @@ M: Nithin Dabilpuram <ndabilpuram@marvell.com>
 M: Pavan Nikhilesh <pbhagavatula@marvell.com>
 F: lib/node/
 
+PMU - EXPERIMENTAL
+M: Tomasz Duszynski <tduszynski@marvell.com>
+F: lib/pmu/
+F: app/test/test_pmu*
+
 
 Test Applications
 -----------------
diff --git a/app/test/meson.build b/app/test/meson.build
index f34d19e3c3..7b6b69dcf1 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -111,6 +111,7 @@ test_sources = files(
         'test_reciprocal_division_perf.c',
         'test_red.c',
         'test_pie.c',
+        'test_pmu.c',
         'test_reorder.c',
         'test_rib.c',
         'test_rib6.c',
diff --git a/app/test/test_pmu.c b/app/test/test_pmu.c
new file mode 100644
index 0000000000..a9bfb1a427
--- /dev/null
+++ b/app/test/test_pmu.c
@@ -0,0 +1,55 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2023 Marvell International Ltd.
+ */
+
+#include "test.h"
+
+#ifndef RTE_EXEC_ENV_LINUX
+
+static int
+test_pmu(void)
+{
+	printf("pmu_autotest only supported on Linux, skipping test\n");
+	return TEST_SKIPPED;
+}
+
+#else
+
+#include <rte_pmu.h>
+
+static int
+test_pmu_read(void)
+{
+	int tries = 10, event = -1;
+	uint64_t val = 0;
+
+	if (rte_pmu_init() < 0)
+		return TEST_FAILED;
+
+	while (tries--)
+		val += rte_pmu_read(event);
+
+	rte_pmu_fini();
+
+	return val ? TEST_SUCCESS : TEST_FAILED;
+}
+
+static struct unit_test_suite pmu_tests = {
+	.suite_name = "pmu autotest",
+	.setup = NULL,
+	.teardown = NULL,
+	.unit_test_cases = {
+		TEST_CASE(test_pmu_read),
+		TEST_CASES_END()
+	}
+};
+
+static int
+test_pmu(void)
+{
+	return unit_test_suite_runner(&pmu_tests);
+}
+
+#endif /* RTE_EXEC_ENV_LINUX */
+
+REGISTER_TEST_COMMAND(pmu_autotest, test_pmu);
diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
index de488c7abf..7f1938f92f 100644
--- a/doc/api/doxy-api-index.md
+++ b/doc/api/doxy-api-index.md
@@ -222,7 +222,8 @@ The public API headers are grouped by topics:
   [log](@ref rte_log.h),
   [errno](@ref rte_errno.h),
   [trace](@ref rte_trace.h),
-  [trace_point](@ref rte_trace_point.h)
+  [trace_point](@ref rte_trace_point.h),
+  [pmu](@ref rte_pmu.h)
 
 - **misc**:
   [EAL config](@ref rte_eal.h),
diff --git a/doc/api/doxy-api.conf.in b/doc/api/doxy-api.conf.in
index f0886c3bd1..920e615996 100644
--- a/doc/api/doxy-api.conf.in
+++ b/doc/api/doxy-api.conf.in
@@ -63,6 +63,7 @@ INPUT                   = @TOPDIR@/doc/api/doxy-api-index.md \
                           @TOPDIR@/lib/pci \
                           @TOPDIR@/lib/pdump \
                           @TOPDIR@/lib/pipeline \
+                          @TOPDIR@/lib/pmu \
                           @TOPDIR@/lib/port \
                           @TOPDIR@/lib/power \
                           @TOPDIR@/lib/rawdev \
diff --git a/doc/guides/prog_guide/profile_app.rst b/doc/guides/prog_guide/profile_app.rst
index 14292d4c25..a8b501fe0c 100644
--- a/doc/guides/prog_guide/profile_app.rst
+++ b/doc/guides/prog_guide/profile_app.rst
@@ -7,6 +7,14 @@ Profile Your Application
 The following sections describe methods of profiling DPDK applications on
 different architectures.
 
+Performance counter based profiling
+-----------------------------------
+
+Majority of architectures support some sort hardware measurement unit which provides a set of
+programmable counters that monitor specific events. There are different tools which can gather
+that information, perf being an example here. Though in some scenarios, eg. when CPU cores are
+isolated (nohz_full) and run dedicated tasks, using perf is less than ideal. In such cases one can
+read specific events directly from application via ``rte_pmu_read()``.
 
 Profiling on x86
 ----------------
diff --git a/doc/guides/rel_notes/release_23_03.rst b/doc/guides/rel_notes/release_23_03.rst
index 73f5d94e14..733541d56c 100644
--- a/doc/guides/rel_notes/release_23_03.rst
+++ b/doc/guides/rel_notes/release_23_03.rst
@@ -55,10 +55,19 @@ New Features
      Also, make sure to start the actual text at the margin.
      =======================================================
 
+* **Added PMU library.**
+
+  Added a new PMU (performance measurement unit) library which allows applications
+  to perform self monitoring activities without depending on external utilities like perf.
+  After integration with :doc:`../prog_guide/trace_lib` data gathered from hardware counters
+  can be stored in CTF format for further analysis.
+
 * **Updated AMD axgbe driver.**
 
   * Added multi-process support.
 
+* **Added multi-process support for axgbe PMD.**
+
 * **Updated Corigine nfp driver.**
 
   * Added support for meter options.
diff --git a/lib/meson.build b/lib/meson.build
index a90fee31b7..7132131b5c 100644
--- a/lib/meson.build
+++ b/lib/meson.build
@@ -11,6 +11,7 @@
 libraries = [
         'kvargs', # eal depends on kvargs
         'telemetry', # basic info querying
+        'pmu',
         'eal', # everything depends on eal
         'ring',
         'rcu', # rcu depends on ring
diff --git a/lib/pmu/meson.build b/lib/pmu/meson.build
new file mode 100644
index 0000000000..a4160b494e
--- /dev/null
+++ b/lib/pmu/meson.build
@@ -0,0 +1,13 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(C) 2023 Marvell International Ltd.
+
+if not is_linux
+    build = false
+    reason = 'only supported on Linux'
+    subdir_done()
+endif
+
+includes = [global_inc]
+
+sources = files('rte_pmu.c')
+headers = files('rte_pmu.h')
diff --git a/lib/pmu/pmu_private.h b/lib/pmu/pmu_private.h
new file mode 100644
index 0000000000..849549b125
--- /dev/null
+++ b/lib/pmu/pmu_private.h
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Marvell
+ */
+
+#ifndef _PMU_PRIVATE_H_
+#define _PMU_PRIVATE_H_
+
+/**
+ * Architecture specific PMU init callback.
+ *
+ * @return
+ *   0 in case of success, negative value otherwise.
+ */
+int
+pmu_arch_init(void);
+
+/**
+ * Architecture specific PMU cleanup callback.
+ */
+void
+pmu_arch_fini(void);
+
+/**
+ * Apply architecture specific settings to config before passing it to syscall.
+ */
+void
+pmu_arch_fixup_config(uint64_t config[3]);
+
+#endif /* _PMU_PRIVATE_H_ */
diff --git a/lib/pmu/rte_pmu.c b/lib/pmu/rte_pmu.c
new file mode 100644
index 0000000000..4cf3161155
--- /dev/null
+++ b/lib/pmu/rte_pmu.c
@@ -0,0 +1,464 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2023 Marvell International Ltd.
+ */
+
+#include <ctype.h>
+#include <dirent.h>
+#include <errno.h>
+#include <regex.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <sys/queue.h>
+#include <sys/syscall.h>
+#include <unistd.h>
+
+#include <rte_atomic.h>
+#include <rte_per_lcore.h>
+#include <rte_pmu.h>
+#include <rte_spinlock.h>
+#include <rte_tailq.h>
+
+#include "pmu_private.h"
+
+#define EVENT_SOURCE_DEVICES_PATH "/sys/bus/event_source/devices"
+
+#ifndef GENMASK_ULL
+#define GENMASK_ULL(h, l) ((~0ULL - (1ULL << (l)) + 1) & (~0ULL >> ((64 - 1 - (h)))))
+#endif
+
+#ifndef FIELD_PREP
+#define FIELD_PREP(m, v) (((uint64_t)(v) << (__builtin_ffsll(m) - 1)) & (m))
+#endif
+
+RTE_DEFINE_PER_LCORE(struct rte_pmu_event_group, _event_group);
+struct rte_pmu rte_pmu;
+
+/*
+ * Following __rte_weak functions provide default no-op. Architectures should override them if
+ * necessary.
+ */
+
+int
+__rte_weak pmu_arch_init(void)
+{
+	return 0;
+}
+
+void
+__rte_weak pmu_arch_fini(void)
+{
+}
+
+void
+__rte_weak pmu_arch_fixup_config(uint64_t __rte_unused config[3])
+{
+}
+
+static int
+get_term_format(const char *name, int *num, uint64_t *mask)
+{
+	char *config = NULL;
+	char path[PATH_MAX];
+	int high, low, ret;
+	FILE *fp;
+
+	/* quiesce -Wmaybe-uninitialized warning */
+	*num = 0;
+	*mask = 0;
+
+	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/format/%s", rte_pmu.name, name);
+	fp = fopen(path, "r");
+	if (fp == NULL)
+		return -errno;
+
+	errno = 0;
+	ret = fscanf(fp, "%m[^:]:%d-%d", &config, &low, &high);
+	if (ret < 2) {
+		ret = -ENODATA;
+		goto out;
+	}
+	if (errno) {
+		ret = -errno;
+		goto out;
+	}
+
+	if (ret == 2)
+		high = low;
+
+	*mask = GENMASK_ULL(high, low);
+	/* Last digit should be [012]. If last digit is missing 0 is implied. */
+	*num = config[strlen(config) - 1];
+	*num = isdigit(*num) ? *num - '0' : 0;
+
+	ret = 0;
+out:
+	free(config);
+	fclose(fp);
+
+	return ret;
+}
+
+static int
+parse_event(char *buf, uint64_t config[3])
+{
+	char *token, *term;
+	int num, ret, val;
+	uint64_t mask;
+
+	config[0] = config[1] = config[2] = 0;
+
+	token = strtok(buf, ",");
+	while (token) {
+		errno = 0;
+		/* <term>=<value> */
+		ret = sscanf(token, "%m[^=]=%i", &term, &val);
+		if (ret < 1)
+			return -ENODATA;
+		if (errno)
+			return -errno;
+		if (ret == 1)
+			val = 1;
+
+		ret = get_term_format(term, &num, &mask);
+		free(term);
+		if (ret)
+			return ret;
+
+		config[num] |= FIELD_PREP(mask, val);
+		token = strtok(NULL, ",");
+	}
+
+	return 0;
+}
+
+static int
+get_event_config(const char *name, uint64_t config[3])
+{
+	char path[PATH_MAX], buf[BUFSIZ];
+	FILE *fp;
+	int ret;
+
+	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu.name, name);
+	fp = fopen(path, "r");
+	if (fp == NULL)
+		return -errno;
+
+	ret = fread(buf, 1, sizeof(buf), fp);
+	if (ret == 0) {
+		fclose(fp);
+
+		return -EINVAL;
+	}
+	fclose(fp);
+	buf[ret] = '\0';
+
+	return parse_event(buf, config);
+}
+
+static int
+do_perf_event_open(uint64_t config[3], int group_fd)
+{
+	struct perf_event_attr attr = {
+		.size = sizeof(struct perf_event_attr),
+		.type = PERF_TYPE_RAW,
+		.exclude_kernel = 1,
+		.exclude_hv = 1,
+		.disabled = 1,
+	};
+
+	pmu_arch_fixup_config(config);
+
+	attr.config = config[0];
+	attr.config1 = config[1];
+	attr.config2 = config[2];
+
+	return syscall(SYS_perf_event_open, &attr, 0, -1, group_fd, 0);
+}
+
+static int
+open_events(struct rte_pmu_event_group *group)
+{
+	struct rte_pmu_event *event;
+	uint64_t config[3];
+	int num = 0, ret;
+
+	/* group leader gets created first, with fd = -1 */
+	group->fds[0] = -1;
+
+	TAILQ_FOREACH(event, &rte_pmu.event_list, next) {
+		ret = get_event_config(event->name, config);
+		if (ret)
+			continue;
+
+		ret = do_perf_event_open(config, group->fds[0]);
+		if (ret == -1) {
+			ret = -errno;
+			goto out;
+		}
+
+		group->fds[event->index] = ret;
+		num++;
+	}
+
+	return 0;
+out:
+	for (--num; num >= 0; num--) {
+		close(group->fds[num]);
+		group->fds[num] = -1;
+	}
+
+
+	return ret;
+}
+
+static int
+mmap_events(struct rte_pmu_event_group *group)
+{
+	long page_size = sysconf(_SC_PAGE_SIZE);
+	unsigned int i;
+	void *addr;
+	int ret;
+
+	for (i = 0; i < rte_pmu.num_group_events; i++) {
+		addr = mmap(0, page_size, PROT_READ, MAP_SHARED, group->fds[i], 0);
+		if (addr == MAP_FAILED) {
+			ret = -errno;
+			goto out;
+		}
+
+		group->mmap_pages[i] = addr;
+	}
+
+	return 0;
+out:
+	for (; i; i--) {
+		munmap(group->mmap_pages[i - 1], page_size);
+		group->mmap_pages[i - 1] = NULL;
+	}
+
+	return ret;
+}
+
+static void
+cleanup_events(struct rte_pmu_event_group *group)
+{
+	unsigned int i;
+
+	if (group->fds[0] != -1)
+		ioctl(group->fds[0], PERF_EVENT_IOC_DISABLE, PERF_IOC_FLAG_GROUP);
+
+	for (i = 0; i < rte_pmu.num_group_events; i++) {
+		if (group->mmap_pages[i]) {
+			munmap(group->mmap_pages[i], sysconf(_SC_PAGE_SIZE));
+			group->mmap_pages[i] = NULL;
+		}
+
+		if (group->fds[i] != -1) {
+			close(group->fds[i]);
+			group->fds[i] = -1;
+		}
+	}
+
+	group->enabled = false;
+}
+
+int __rte_noinline
+rte_pmu_enable_group(void)
+{
+	struct rte_pmu_event_group *group = &RTE_PER_LCORE(_event_group);
+	int ret;
+
+	if (rte_pmu.num_group_events == 0)
+		return -ENODEV;
+
+	ret = open_events(group);
+	if (ret)
+		goto out;
+
+	ret = mmap_events(group);
+	if (ret)
+		goto out;
+
+	if (ioctl(group->fds[0], PERF_EVENT_IOC_RESET, PERF_IOC_FLAG_GROUP) == -1) {
+		ret = -errno;
+		goto out;
+	}
+
+	if (ioctl(group->fds[0], PERF_EVENT_IOC_ENABLE, PERF_IOC_FLAG_GROUP) == -1) {
+		ret = -errno;
+		goto out;
+	}
+
+	rte_spinlock_lock(&rte_pmu.lock);
+	TAILQ_INSERT_TAIL(&rte_pmu.event_group_list, group, next);
+	rte_spinlock_unlock(&rte_pmu.lock);
+	group->enabled = true;
+
+	return 0;
+
+out:
+	cleanup_events(group);
+
+	return ret;
+}
+
+static int
+scan_pmus(void)
+{
+	char path[PATH_MAX];
+	struct dirent *dent;
+	const char *name;
+	DIR *dirp;
+
+	dirp = opendir(EVENT_SOURCE_DEVICES_PATH);
+	if (dirp == NULL)
+		return -errno;
+
+	while ((dent = readdir(dirp))) {
+		name = dent->d_name;
+		if (name[0] == '.')
+			continue;
+
+		/* sysfs entry should either contain cpus or be a cpu */
+		if (!strcmp(name, "cpu"))
+			break;
+
+		snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/cpus", name);
+		if (access(path, F_OK) == 0)
+			break;
+	}
+
+	if (dent) {
+		rte_pmu.name = strdup(name);
+		if (rte_pmu.name == NULL) {
+			closedir(dirp);
+
+			return -ENOMEM;
+		}
+	}
+
+	closedir(dirp);
+
+	return rte_pmu.name ? 0 : -ENODEV;
+}
+
+static struct rte_pmu_event *
+new_event(const char *name)
+{
+	struct rte_pmu_event *event;
+
+	event = calloc(1, sizeof(*event));
+	if (event == NULL)
+		goto out;
+
+	event->name = strdup(name);
+	if (event->name == NULL) {
+		free(event);
+		event = NULL;
+	}
+
+out:
+	return event;
+}
+
+static void
+free_event(struct rte_pmu_event *event)
+{
+	free(event->name);
+	free(event);
+}
+
+int
+rte_pmu_add_event(const char *name)
+{
+	struct rte_pmu_event *event;
+	char path[PATH_MAX];
+
+	if (rte_pmu.name == NULL)
+		return -ENODEV;
+
+	if (rte_pmu.num_group_events + 1 >= MAX_NUM_GROUP_EVENTS)
+		return -ENOSPC;
+
+	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu.name, name);
+	if (access(path, R_OK))
+		return -ENODEV;
+
+	TAILQ_FOREACH(event, &rte_pmu.event_list, next) {
+		if (!strcmp(event->name, name))
+			return event->index;
+		continue;
+	}
+
+	event = new_event(name);
+	if (event == NULL)
+		return -ENOMEM;
+
+	event->index = rte_pmu.num_group_events++;
+	TAILQ_INSERT_TAIL(&rte_pmu.event_list, event, next);
+
+	return event->index;
+}
+
+int
+rte_pmu_init(void)
+{
+	int ret;
+
+	/* Allow calling init from multiple contexts within a single thread. This simplifies
+	 * resource management a bit e.g in case fast-path tracepoint has already been enabled
+	 * via command line but application doesn't care enough and performs init/fini again.
+	 */
+	if (rte_pmu.initialized) {
+		rte_pmu.initialized++;
+		return 0;
+	}
+
+	ret = scan_pmus();
+	if (ret)
+		goto out;
+
+	ret = pmu_arch_init();
+	if (ret)
+		goto out;
+
+	TAILQ_INIT(&rte_pmu.event_list);
+	TAILQ_INIT(&rte_pmu.event_group_list);
+	rte_spinlock_init(&rte_pmu.lock);
+	rte_pmu.initialized = 1;
+
+	return 0;
+out:
+	free(rte_pmu.name);
+	rte_pmu.name = NULL;
+
+	return ret;
+}
+
+void
+rte_pmu_fini(void)
+{
+	struct rte_pmu_event_group *group, *tmp_group;
+	struct rte_pmu_event *event, *tmp_event;
+
+	/* cleanup once init count drops to zero */
+	if (!rte_pmu.initialized || --rte_pmu.initialized)
+		return;
+
+	RTE_TAILQ_FOREACH_SAFE(event, &rte_pmu.event_list, next, tmp_event) {
+		TAILQ_REMOVE(&rte_pmu.event_list, event, next);
+		free_event(event);
+	}
+
+	RTE_TAILQ_FOREACH_SAFE(group, &rte_pmu.event_group_list, next, tmp_group) {
+		TAILQ_REMOVE(&rte_pmu.event_group_list, group, next);
+		cleanup_events(group);
+	}
+
+	pmu_arch_fini();
+	free(rte_pmu.name);
+	rte_pmu.name = NULL;
+	rte_pmu.num_group_events = 0;
+}
diff --git a/lib/pmu/rte_pmu.h b/lib/pmu/rte_pmu.h
new file mode 100644
index 0000000000..e360375a0c
--- /dev/null
+++ b/lib/pmu/rte_pmu.h
@@ -0,0 +1,205 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Marvell
+ */
+
+#ifndef _RTE_PMU_H_
+#define _RTE_PMU_H_
+
+/**
+ * @file
+ *
+ * PMU event tracing operations
+ *
+ * This file defines generic API and types necessary to setup PMU and
+ * read selected counters in runtime.
+ */
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <linux/perf_event.h>
+
+#include <rte_atomic.h>
+#include <rte_branch_prediction.h>
+#include <rte_common.h>
+#include <rte_compat.h>
+#include <rte_spinlock.h>
+
+/** Maximum number of events in a group */
+#define MAX_NUM_GROUP_EVENTS 8
+
+/**
+ * A structure describing a group of events.
+ */
+struct rte_pmu_event_group {
+	struct perf_event_mmap_page *mmap_pages[MAX_NUM_GROUP_EVENTS]; /**< array of user pages */
+	int fds[MAX_NUM_GROUP_EVENTS]; /**< array of event descriptors */
+	bool enabled; /**< true if group was enabled on particular lcore */
+	TAILQ_ENTRY(rte_pmu_event_group) next; /**< list entry */
+} __rte_cache_aligned;
+
+/**
+ * A structure describing an event.
+ */
+struct rte_pmu_event {
+	char *name; /**< name of an event */
+	unsigned int index; /**< event index into fds/mmap_pages */
+	TAILQ_ENTRY(rte_pmu_event) next; /**< list entry */
+};
+
+/**
+ * A PMU state container.
+ */
+struct rte_pmu {
+	char *name; /**< name of core PMU listed under /sys/bus/event_source/devices */
+	rte_spinlock_t lock; /**< serialize access to event group list */
+	TAILQ_HEAD(, rte_pmu_event_group) event_group_list; /**< list of event groups */
+	unsigned int num_group_events; /**< number of events in a group */
+	TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of matching events */
+	unsigned int initialized; /**< initialization counter */
+};
+
+/** lcore event group */
+RTE_DECLARE_PER_LCORE(struct rte_pmu_event_group, _event_group);
+
+/** PMU state container */
+extern struct rte_pmu rte_pmu;
+
+/** Each architecture supporting PMU needs to provide its own version */
+#ifndef rte_pmu_pmc_read
+#define rte_pmu_pmc_read(index) ({ 0; })
+#endif
+
+/**
+ * @internal
+ *
+ * Read PMU counter.
+ *
+ * @param pc
+ *   Pointer to the mmapped user page.
+ * @return
+ *   Counter value read from hardware.
+ */
+__rte_internal
+static __rte_always_inline uint64_t
+rte_pmu_read_userpage(struct perf_event_mmap_page *pc)
+{
+	uint64_t width, offset;
+	uint32_t seq, index;
+	int64_t pmc;
+
+	for (;;) {
+		seq = pc->lock;
+		rte_compiler_barrier();
+		index = pc->index;
+		offset = pc->offset;
+		width = pc->pmc_width;
+
+		/* index set to 0 means that particular counter cannot be used */
+		if (likely(pc->cap_user_rdpmc && index)) {
+			pmc = rte_pmu_pmc_read(index - 1);
+			pmc <<= 64 - width;
+			pmc >>= 64 - width;
+			offset += pmc;
+		}
+
+		rte_compiler_barrier();
+
+		if (likely(pc->lock == seq))
+			return offset;
+	}
+
+	return 0;
+}
+
+/**
+ * @internal
+ *
+ * Enable group of events on the calling lcore.
+ *
+ * @return
+ *   0 in case of success, negative value otherwise.
+ */
+__rte_internal
+int
+rte_pmu_enable_group(void);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Initialize PMU library.
+ *
+ * @return
+ *   0 in case of success, negative value otherwise.
+ */
+__rte_experimental
+int
+rte_pmu_init(void);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Finalize PMU library. This should be called after PMU counters are no longer being read.
+ */
+__rte_experimental
+void
+rte_pmu_fini(void);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Add event to the group of enabled events.
+ *
+ * @param name
+ *   Name of an event listed under /sys/bus/event_source/devices/pmu/events.
+ * @return
+ *   Event index in case of success, negative value otherwise.
+ */
+__rte_experimental
+int
+rte_pmu_add_event(const char *name);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Read hardware counter configured to count occurrences of an event.
+ *
+ * @param index
+ *   Index of an event to be read.
+ * @return
+ *   Event value read from register. In case of errors or lack of support
+ *   0 is returned. In other words, stream of zeros in a trace file
+ *   indicates problem with reading particular PMU event register.
+ */
+__rte_experimental
+static __rte_always_inline uint64_t
+rte_pmu_read(unsigned int index)
+{
+	struct rte_pmu_event_group *group = &RTE_PER_LCORE(_event_group);
+	int ret;
+
+	if (unlikely(!rte_pmu.initialized))
+		return 0;
+
+	if (unlikely(!group->enabled)) {
+		ret = rte_pmu_enable_group();
+		if (ret)
+			return 0;
+	}
+
+	if (unlikely(index >= rte_pmu.num_group_events))
+		return 0;
+
+	return rte_pmu_read_userpage(group->mmap_pages[index]);
+}
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_PMU_H_ */
diff --git a/lib/pmu/version.map b/lib/pmu/version.map
new file mode 100644
index 0000000000..50fb0f354e
--- /dev/null
+++ b/lib/pmu/version.map
@@ -0,0 +1,20 @@
+DPDK_23 {
+	local: *;
+};
+
+EXPERIMENTAL {
+	global:
+
+	per_lcore__event_group;
+	rte_pmu;
+	rte_pmu_add_event;
+	rte_pmu_fini;
+	rte_pmu_init;
+	rte_pmu_read;
+};
+
+INTERNAL {
+	global:
+
+	rte_pmu_enable_group;
+};
-- 
2.34.1


^ permalink raw reply	[flat|nested] 139+ messages in thread

* [PATCH v9 2/4] pmu: support reading ARM PMU events in runtime
  2023-02-02 12:49               ` [PATCH v9 0/4] add support for self monitoring Tomasz Duszynski
  2023-02-02 12:49                 ` [PATCH v9 1/4] lib: add generic support for reading PMU events Tomasz Duszynski
@ 2023-02-02 12:49                 ` Tomasz Duszynski
  2023-02-02 12:49                 ` [PATCH v9 3/4] pmu: support reading Intel x86_64 " Tomasz Duszynski
                                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 139+ messages in thread
From: Tomasz Duszynski @ 2023-02-02 12:49 UTC (permalink / raw)
  To: dev, Tomasz Duszynski, Ruifeng Wang
  Cc: roretzla, Ruifeng.Wang, bruce.richardson, jerinj,
	mattias.ronnblom, mb, thomas, zhoumin

Add support for reading ARM PMU events in runtime.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
---
 app/test/test_pmu.c         |  4 ++
 lib/pmu/meson.build         |  7 +++
 lib/pmu/pmu_arm64.c         | 94 +++++++++++++++++++++++++++++++++++++
 lib/pmu/rte_pmu.h           |  4 ++
 lib/pmu/rte_pmu_pmc_arm64.h | 30 ++++++++++++
 5 files changed, 139 insertions(+)
 create mode 100644 lib/pmu/pmu_arm64.c
 create mode 100644 lib/pmu/rte_pmu_pmc_arm64.h

diff --git a/app/test/test_pmu.c b/app/test/test_pmu.c
index a9bfb1a427..623e04b691 100644
--- a/app/test/test_pmu.c
+++ b/app/test/test_pmu.c
@@ -26,6 +26,10 @@ test_pmu_read(void)
 	if (rte_pmu_init() < 0)
 		return TEST_FAILED;
 
+#if defined(RTE_ARCH_ARM64)
+	event = rte_pmu_add_event("cpu_cycles");
+#endif
+
 	while (tries--)
 		val += rte_pmu_read(event);
 
diff --git a/lib/pmu/meson.build b/lib/pmu/meson.build
index a4160b494e..e857681137 100644
--- a/lib/pmu/meson.build
+++ b/lib/pmu/meson.build
@@ -11,3 +11,10 @@ includes = [global_inc]
 
 sources = files('rte_pmu.c')
 headers = files('rte_pmu.h')
+indirect_headers += files(
+        'rte_pmu_pmc_arm64.h',
+)
+
+if dpdk_conf.has('RTE_ARCH_ARM64')
+    sources += files('pmu_arm64.c')
+endif
diff --git a/lib/pmu/pmu_arm64.c b/lib/pmu/pmu_arm64.c
new file mode 100644
index 0000000000..9e15727948
--- /dev/null
+++ b/lib/pmu/pmu_arm64.c
@@ -0,0 +1,94 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2023 Marvell International Ltd.
+ */
+
+#include <errno.h>
+#include <fcntl.h>
+#include <stdlib.h>
+#include <unistd.h>
+
+#include <rte_bitops.h>
+#include <rte_common.h>
+
+#include "pmu_private.h"
+
+#define PERF_USER_ACCESS_PATH "/proc/sys/kernel/perf_user_access"
+
+static int restore_uaccess;
+
+static int
+read_attr_int(const char *path, int *val)
+{
+	char buf[BUFSIZ];
+	int ret, fd;
+
+	fd = open(path, O_RDONLY);
+	if (fd == -1)
+		return -errno;
+
+	ret = read(fd, buf, sizeof(buf));
+	if (ret == -1) {
+		close(fd);
+
+		return -errno;
+	}
+
+	*val = strtol(buf, NULL, 10);
+	close(fd);
+
+	return 0;
+}
+
+static int
+write_attr_int(const char *path, int val)
+{
+	char buf[BUFSIZ];
+	int num, ret, fd;
+
+	fd = open(path, O_WRONLY);
+	if (fd == -1)
+		return -errno;
+
+	num = snprintf(buf, sizeof(buf), "%d", val);
+	ret = write(fd, buf, num);
+	if (ret == -1) {
+		close(fd);
+
+		return -errno;
+	}
+
+	close(fd);
+
+	return 0;
+}
+
+int
+pmu_arch_init(void)
+{
+	int ret;
+
+	ret = read_attr_int(PERF_USER_ACCESS_PATH, &restore_uaccess);
+	if (ret)
+		return ret;
+
+	/* user access already enabled */
+	if (restore_uaccess == 1)
+		return 0;
+
+	return write_attr_int(PERF_USER_ACCESS_PATH, 1);
+}
+
+void
+pmu_arch_fini(void)
+{
+	write_attr_int(PERF_USER_ACCESS_PATH, restore_uaccess);
+}
+
+void
+pmu_arch_fixup_config(uint64_t config[3])
+{
+	/* select 64 bit counters */
+	config[1] |= RTE_BIT64(0);
+	/* enable userspace access */
+	config[1] |= RTE_BIT64(1);
+}
diff --git a/lib/pmu/rte_pmu.h b/lib/pmu/rte_pmu.h
index e360375a0c..b18938dab1 100644
--- a/lib/pmu/rte_pmu.h
+++ b/lib/pmu/rte_pmu.h
@@ -26,6 +26,10 @@ extern "C" {
 #include <rte_compat.h>
 #include <rte_spinlock.h>
 
+#if defined(RTE_ARCH_ARM64)
+#include "rte_pmu_pmc_arm64.h"
+#endif
+
 /** Maximum number of events in a group */
 #define MAX_NUM_GROUP_EVENTS 8
 
diff --git a/lib/pmu/rte_pmu_pmc_arm64.h b/lib/pmu/rte_pmu_pmc_arm64.h
new file mode 100644
index 0000000000..10648f0c5f
--- /dev/null
+++ b/lib/pmu/rte_pmu_pmc_arm64.h
@@ -0,0 +1,30 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Marvell.
+ */
+#ifndef _RTE_PMU_PMC_ARM64_H_
+#define _RTE_PMU_PMC_ARM64_H_
+
+#include <rte_common.h>
+
+static __rte_always_inline uint64_t
+rte_pmu_pmc_read(int index)
+{
+	uint64_t val;
+
+	if (index == 31) {
+		/* CPU Cycles (0x11) must be read via pmccntr_el0 */
+		asm volatile("mrs %0, pmccntr_el0" : "=r" (val));
+	} else {
+		asm volatile(
+			"msr pmselr_el0, %x0\n"
+			"mrs %0, pmxevcntr_el0\n"
+			: "=r" (val)
+			: "rZ" (index)
+		);
+	}
+
+	return val;
+}
+#define rte_pmu_pmc_read rte_pmu_pmc_read
+
+#endif /* _RTE_PMU_PMC_ARM64_H_ */
-- 
2.34.1


^ permalink raw reply	[flat|nested] 139+ messages in thread

* [PATCH v9 3/4] pmu: support reading Intel x86_64 PMU events in runtime
  2023-02-02 12:49               ` [PATCH v9 0/4] add support for self monitoring Tomasz Duszynski
  2023-02-02 12:49                 ` [PATCH v9 1/4] lib: add generic support for reading PMU events Tomasz Duszynski
  2023-02-02 12:49                 ` [PATCH v9 2/4] pmu: support reading ARM PMU events in runtime Tomasz Duszynski
@ 2023-02-02 12:49                 ` Tomasz Duszynski
  2023-02-02 12:49                 ` [PATCH v9 4/4] eal: add PMU support to tracing library Tomasz Duszynski
  2023-02-13 11:31                 ` [PATCH v10 0/4] add support for self monitoring Tomasz Duszynski
  4 siblings, 0 replies; 139+ messages in thread
From: Tomasz Duszynski @ 2023-02-02 12:49 UTC (permalink / raw)
  To: dev, Tomasz Duszynski
  Cc: roretzla, Ruifeng.Wang, bruce.richardson, jerinj,
	mattias.ronnblom, mb, thomas, zhoumin

Add support for reading Intel x86_64 PMU events in runtime.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
---
 app/test/test_pmu.c          |  2 ++
 lib/pmu/meson.build          |  1 +
 lib/pmu/rte_pmu.h            |  2 ++
 lib/pmu/rte_pmu_pmc_x86_64.h | 24 ++++++++++++++++++++++++
 4 files changed, 29 insertions(+)
 create mode 100644 lib/pmu/rte_pmu_pmc_x86_64.h

diff --git a/app/test/test_pmu.c b/app/test/test_pmu.c
index 623e04b691..614395482f 100644
--- a/app/test/test_pmu.c
+++ b/app/test/test_pmu.c
@@ -28,6 +28,8 @@ test_pmu_read(void)
 
 #if defined(RTE_ARCH_ARM64)
 	event = rte_pmu_add_event("cpu_cycles");
+#elif defined(RTE_ARCH_X86_64)
+	event = rte_pmu_add_event("cpu-cycles");
 #endif
 
 	while (tries--)
diff --git a/lib/pmu/meson.build b/lib/pmu/meson.build
index e857681137..5b92e5c4e3 100644
--- a/lib/pmu/meson.build
+++ b/lib/pmu/meson.build
@@ -13,6 +13,7 @@ sources = files('rte_pmu.c')
 headers = files('rte_pmu.h')
 indirect_headers += files(
         'rte_pmu_pmc_arm64.h',
+        'rte_pmu_pmc_x86_64.h',
 )
 
 if dpdk_conf.has('RTE_ARCH_ARM64')
diff --git a/lib/pmu/rte_pmu.h b/lib/pmu/rte_pmu.h
index b18938dab1..0f7004c31c 100644
--- a/lib/pmu/rte_pmu.h
+++ b/lib/pmu/rte_pmu.h
@@ -28,6 +28,8 @@ extern "C" {
 
 #if defined(RTE_ARCH_ARM64)
 #include "rte_pmu_pmc_arm64.h"
+#elif defined(RTE_ARCH_X86_64)
+#include "rte_pmu_pmc_x86_64.h"
 #endif
 
 /** Maximum number of events in a group */
diff --git a/lib/pmu/rte_pmu_pmc_x86_64.h b/lib/pmu/rte_pmu_pmc_x86_64.h
new file mode 100644
index 0000000000..7b67466960
--- /dev/null
+++ b/lib/pmu/rte_pmu_pmc_x86_64.h
@@ -0,0 +1,24 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Marvell.
+ */
+#ifndef _RTE_PMU_PMC_X86_64_H_
+#define _RTE_PMU_PMC_X86_64_H_
+
+#include <rte_common.h>
+
+static __rte_always_inline uint64_t
+rte_pmu_pmc_read(int index)
+{
+	uint64_t low, high;
+
+	asm volatile(
+		"rdpmc\n"
+		: "=a" (low), "=d" (high)
+		: "c" (index)
+	);
+
+	return low | (high << 32);
+}
+#define rte_pmu_pmc_read rte_pmu_pmc_read
+
+#endif /* _RTE_PMU_PMC_X86_64_H_ */
-- 
2.34.1


^ permalink raw reply	[flat|nested] 139+ messages in thread

* [PATCH v9 4/4] eal: add PMU support to tracing library
  2023-02-02 12:49               ` [PATCH v9 0/4] add support for self monitoring Tomasz Duszynski
                                   ` (2 preceding siblings ...)
  2023-02-02 12:49                 ` [PATCH v9 3/4] pmu: support reading Intel x86_64 " Tomasz Duszynski
@ 2023-02-02 12:49                 ` Tomasz Duszynski
  2023-02-13 11:31                 ` [PATCH v10 0/4] add support for self monitoring Tomasz Duszynski
  4 siblings, 0 replies; 139+ messages in thread
From: Tomasz Duszynski @ 2023-02-02 12:49 UTC (permalink / raw)
  To: dev, Jerin Jacob, Sunil Kumar Kori, Tomasz Duszynski
  Cc: roretzla, Ruifeng.Wang, bruce.richardson, mattias.ronnblom, mb,
	thomas, zhoumin

In order to profile app one needs to store significant amount of samples
somewhere for an analysis latern on. Since trace library supports
storing data in a CTF format lets take adventage of that and add a
dedicated PMU tracepoint.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
---
 app/test/test_trace_perf.c               | 10 ++++
 doc/guides/prog_guide/profile_app.rst    |  5 ++
 doc/guides/prog_guide/trace_lib.rst      | 32 +++++++++++++
 lib/eal/common/eal_common_trace.c        | 13 ++++-
 lib/eal/common/eal_common_trace_points.c |  5 ++
 lib/eal/include/rte_eal_trace.h          | 13 +++++
 lib/eal/meson.build                      |  3 ++
 lib/eal/version.map                      |  1 +
 lib/pmu/rte_pmu.c                        | 61 ++++++++++++++++++++++++
 lib/pmu/rte_pmu.h                        | 14 ++++++
 lib/pmu/version.map                      |  1 +
 11 files changed, 157 insertions(+), 1 deletion(-)

diff --git a/app/test/test_trace_perf.c b/app/test/test_trace_perf.c
index 46ae7d8074..f1929f2734 100644
--- a/app/test/test_trace_perf.c
+++ b/app/test/test_trace_perf.c
@@ -114,6 +114,10 @@ worker_fn_##func(void *arg) \
 #define GENERIC_DOUBLE rte_eal_trace_generic_double(3.66666)
 #define GENERIC_STR rte_eal_trace_generic_str("hello world")
 #define VOID_FP app_dpdk_test_fp()
+#ifdef RTE_EXEC_ENV_LINUX
+/* 0 corresponds first event passed via --trace= */
+#define READ_PMU rte_eal_trace_pmu_read(0)
+#endif
 
 WORKER_DEFINE(GENERIC_VOID)
 WORKER_DEFINE(GENERIC_U64)
@@ -122,6 +126,9 @@ WORKER_DEFINE(GENERIC_FLOAT)
 WORKER_DEFINE(GENERIC_DOUBLE)
 WORKER_DEFINE(GENERIC_STR)
 WORKER_DEFINE(VOID_FP)
+#ifdef RTE_EXEC_ENV_LINUX
+WORKER_DEFINE(READ_PMU)
+#endif
 
 static void
 run_test(const char *str, lcore_function_t f, struct test_data *data, size_t sz)
@@ -174,6 +181,9 @@ test_trace_perf(void)
 	run_test("double", worker_fn_GENERIC_DOUBLE, data, sz);
 	run_test("string", worker_fn_GENERIC_STR, data, sz);
 	run_test("void_fp", worker_fn_VOID_FP, data, sz);
+#ifdef RTE_EXEC_ENV_LINUX
+	run_test("read_pmu", worker_fn_READ_PMU, data, sz);
+#endif
 
 	rte_free(data);
 	return TEST_SUCCESS;
diff --git a/doc/guides/prog_guide/profile_app.rst b/doc/guides/prog_guide/profile_app.rst
index a8b501fe0c..6a53341c6b 100644
--- a/doc/guides/prog_guide/profile_app.rst
+++ b/doc/guides/prog_guide/profile_app.rst
@@ -16,6 +16,11 @@ that information, perf being an example here. Though in some scenarios, eg. when
 isolated (nohz_full) and run dedicated tasks, using perf is less than ideal. In such cases one can
 read specific events directly from application via ``rte_pmu_read()``.
 
+Alternatively tracing library can be used which offers dedicated tracepoint
+``rte_eal_trace_pmu_event()``.
+
+Refer to :doc:`../prog_guide/trace_lib` for more details.
+
 Profiling on x86
 ----------------
 
diff --git a/doc/guides/prog_guide/trace_lib.rst b/doc/guides/prog_guide/trace_lib.rst
index 9a8f38073d..a8e97ee1ec 100644
--- a/doc/guides/prog_guide/trace_lib.rst
+++ b/doc/guides/prog_guide/trace_lib.rst
@@ -46,6 +46,7 @@ DPDK tracing library features
   trace format and is compatible with ``LTTng``.
   For detailed information, refer to
   `Common Trace Format <https://diamon.org/ctf/>`_.
+- Support reading PMU events on ARM64 and x86-64 (Intel)
 
 How to add a tracepoint?
 ------------------------
@@ -137,6 +138,37 @@ the user must use ``RTE_TRACE_POINT_FP`` instead of ``RTE_TRACE_POINT``.
 ``RTE_TRACE_POINT_FP`` is compiled out by default and it can be enabled using
 the ``enable_trace_fp`` option for meson build.
 
+PMU tracepoint
+--------------
+
+Performance measurement unit (PMU) event values can be read from hardware
+registers using predefined ``rte_pmu_read`` tracepoint.
+
+Tracing is enabled via ``--trace`` EAL option by passing both expression
+matching PMU tracepoint name i.e ``lib.eal.pmu.read`` and expression
+``e=ev1[,ev2,...]`` matching particular events::
+
+    --trace='.*pmu.read\|e=cpu_cycles,l1d_cache'
+
+Event names are available under ``/sys/bus/event_source/devices/PMU/events``
+directory, where ``PMU`` is a placeholder for either a ``cpu`` or a directory
+containing ``cpus``.
+
+In contrary to other tracepoints this does not need any extra variables
+added to source files. Instead, caller passes index which follows the order of
+events specified via ``--trace`` parameter. In the following example index ``0``
+corresponds to ``cpu_cyclces`` while index ``1`` corresponds to ``l1d_cache``.
+
+.. code-block:: c
+
+ ...
+ rte_eal_trace_pmu_read(0);
+ rte_eal_trace_pmu_read(1);
+ ...
+
+PMU tracing support must be explicitly enabled using the ``enable_trace_fp``
+option for meson build.
+
 Event record mode
 -----------------
 
diff --git a/lib/eal/common/eal_common_trace.c b/lib/eal/common/eal_common_trace.c
index 75162b722d..8796052d0c 100644
--- a/lib/eal/common/eal_common_trace.c
+++ b/lib/eal/common/eal_common_trace.c
@@ -11,6 +11,9 @@
 #include <rte_errno.h>
 #include <rte_lcore.h>
 #include <rte_per_lcore.h>
+#ifdef RTE_EXEC_ENV_LINUX
+#include <rte_pmu.h>
+#endif
 #include <rte_string_fns.h>
 
 #include "eal_trace.h"
@@ -71,8 +74,13 @@ eal_trace_init(void)
 		goto free_meta;
 
 	/* Apply global configurations */
-	STAILQ_FOREACH(arg, &trace.args, next)
+	STAILQ_FOREACH(arg, &trace.args, next) {
 		trace_args_apply(arg->val);
+#ifdef RTE_EXEC_ENV_LINUX
+		if (rte_pmu_init() == 0)
+			rte_pmu_add_events_by_pattern(arg->val);
+#endif
+	}
 
 	rte_trace_mode_set(trace.mode);
 
@@ -88,6 +96,9 @@ eal_trace_init(void)
 void
 eal_trace_fini(void)
 {
+#ifdef RTE_EXEC_ENV_LINUX
+	rte_pmu_fini();
+#endif
 	trace_mem_free();
 	trace_metadata_destroy();
 	eal_trace_args_free();
diff --git a/lib/eal/common/eal_common_trace_points.c b/lib/eal/common/eal_common_trace_points.c
index 0b0b254615..1e46ce549a 100644
--- a/lib/eal/common/eal_common_trace_points.c
+++ b/lib/eal/common/eal_common_trace_points.c
@@ -75,3 +75,8 @@ RTE_TRACE_POINT_REGISTER(rte_eal_trace_intr_enable,
 	lib.eal.intr.enable)
 RTE_TRACE_POINT_REGISTER(rte_eal_trace_intr_disable,
 	lib.eal.intr.disable)
+
+#ifdef RTE_EXEC_ENV_LINUX
+RTE_TRACE_POINT_REGISTER(rte_eal_trace_pmu_read,
+	lib.eal.pmu.read)
+#endif
diff --git a/lib/eal/include/rte_eal_trace.h b/lib/eal/include/rte_eal_trace.h
index 5ef4398230..afb459b198 100644
--- a/lib/eal/include/rte_eal_trace.h
+++ b/lib/eal/include/rte_eal_trace.h
@@ -17,6 +17,9 @@ extern "C" {
 
 #include <rte_alarm.h>
 #include <rte_interrupts.h>
+#ifdef RTE_EXEC_ENV_LINUX
+#include <rte_pmu.h>
+#endif
 #include <rte_trace_point.h>
 
 #include "eal_interrupts.h"
@@ -279,6 +282,16 @@ RTE_TRACE_POINT(
 	rte_trace_point_emit_string(cpuset);
 )
 
+#ifdef RTE_EXEC_ENV_LINUX
+RTE_TRACE_POINT_FP(
+	rte_eal_trace_pmu_read,
+	RTE_TRACE_POINT_ARGS(unsigned int index),
+	uint64_t val;
+	val = rte_pmu_read(index);
+	rte_trace_point_emit_u64(val);
+)
+#endif
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/eal/meson.build b/lib/eal/meson.build
index 056beb9461..f5865dbcd9 100644
--- a/lib/eal/meson.build
+++ b/lib/eal/meson.build
@@ -26,6 +26,9 @@ deps += ['kvargs']
 if not is_windows
     deps += ['telemetry']
 endif
+if is_linux
+    deps += ['pmu']
+endif
 if dpdk_conf.has('RTE_USE_LIBBSD')
     ext_deps += libbsd
 endif
diff --git a/lib/eal/version.map b/lib/eal/version.map
index 6523102157..2f8f66874b 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -441,6 +441,7 @@ EXPERIMENTAL {
 	rte_thread_join;
 
 	# added in 23.03
+	__rte_eal_trace_pmu_read; # WINDOWS_NO_EXPORT
 	rte_thread_set_name;
 };
 
diff --git a/lib/pmu/rte_pmu.c b/lib/pmu/rte_pmu.c
index 4cf3161155..f1c5630344 100644
--- a/lib/pmu/rte_pmu.c
+++ b/lib/pmu/rte_pmu.c
@@ -402,6 +402,67 @@ rte_pmu_add_event(const char *name)
 	return event->index;
 }
 
+static int
+add_events(const char *pattern)
+{
+	char *token, *copy;
+	int ret = 0;
+
+	copy = strdup(pattern);
+	if (copy == NULL)
+		return -ENOMEM;
+
+	token = strtok(copy, ",");
+	while (token) {
+		ret = rte_pmu_add_event(token);
+		if (ret < 0)
+			break;
+
+		token = strtok(NULL, ",");
+	}
+
+	free(copy);
+
+	return ret >= 0 ? 0 : ret;
+}
+
+int
+rte_pmu_add_events_by_pattern(const char *pattern)
+{
+	regmatch_t rmatch;
+	char buf[BUFSIZ];
+	unsigned int num;
+	regex_t reg;
+	int ret;
+
+	/* events are matched against occurrences of e=ev1[,ev2,..] pattern */
+	ret = regcomp(&reg, "e=([_[:alnum:]-],?)+", REG_EXTENDED);
+	if (ret)
+		return -EINVAL;
+
+	for (;;) {
+		if (regexec(&reg, pattern, 1, &rmatch, 0))
+			break;
+
+		num = rmatch.rm_eo - rmatch.rm_so;
+		if (num > sizeof(buf))
+			num = sizeof(buf);
+
+		/* skip e= pattern prefix */
+		memcpy(buf, pattern + rmatch.rm_so + 2, num - 2);
+		buf[num - 2] = '\0';
+		ret = add_events(buf);
+		if (ret)
+			break;
+
+		pattern += rmatch.rm_eo;
+	}
+
+	regfree(&reg);
+
+	return ret;
+}
+
 int
 rte_pmu_init(void)
 {
diff --git a/lib/pmu/rte_pmu.h b/lib/pmu/rte_pmu.h
index 0f7004c31c..0f6250e81f 100644
--- a/lib/pmu/rte_pmu.h
+++ b/lib/pmu/rte_pmu.h
@@ -169,6 +169,20 @@ __rte_experimental
 int
 rte_pmu_add_event(const char *name);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Add events matching pattern to the group of enabled events.
+ *
+ * @param pattern
+ *   Pattern e=ev1[,ev2,...] matching events, where evX is a placeholder for an event listed under
+ *   /sys/bus/event_source/devices/pmu/events.
+ */
+__rte_experimental
+int
+rte_pmu_add_events_by_pattern(const char *pattern);
+
 /**
  * @warning
  * @b EXPERIMENTAL: this API may change without prior notice
diff --git a/lib/pmu/version.map b/lib/pmu/version.map
index 50fb0f354e..20a27d085c 100644
--- a/lib/pmu/version.map
+++ b/lib/pmu/version.map
@@ -8,6 +8,7 @@ EXPERIMENTAL {
 	per_lcore__event_group;
 	rte_pmu;
 	rte_pmu_add_event;
+	rte_pmu_add_events_by_pattern;
 	rte_pmu_fini;
 	rte_pmu_init;
 	rte_pmu_read;
-- 
2.34.1


^ permalink raw reply	[flat|nested] 139+ messages in thread

* RE: [EXT] Re: [PATCH v6 1/4] lib: add generic support for reading PMU events
  2023-01-26 15:28                     ` [EXT] " Tomasz Duszynski
@ 2023-02-02 14:27                       ` Morten Brørup
  0 siblings, 0 replies; 139+ messages in thread
From: Morten Brørup @ 2023-02-02 14:27 UTC (permalink / raw)
  To: Tomasz Duszynski, Bruce Richardson
  Cc: dev, Thomas Monjalon, Jerin Jacob Kollanukkaran, Ruifeng.Wang,
	mattias.ronnblom, zhoumin, roretzla

> From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
> Sent: Thursday, 26 January 2023 16.28
> 
> >From: Bruce Richardson <bruce.richardson@intel.com>
> >Sent: Thursday, January 26, 2023 1:59 PM
> >
> >Other possible complication is - how does this work with threads that
> are not pinned to a
> >particular physical core? Do things work as expected in that case?
> >
> 
> It's assumed that once set of counters is enabled on particular l-core
> then this thread shouldn't be migrating
> back and for the obvious reasons.
> 
> But, once scheduled elsewhere all should still work as expected.

Just to elaborate what Tomasz stated here...

The patch contains this line (code comments are mine):

return syscall(SYS_perf_event_open, &attr, /*pid*/ 0, /*cpu*/ -1, group_fd, 0);

And man 2 perf_event_open [1] says:

pid == 0 and cpu == -1:
	This measures the calling process/thread on any CPU.

So it should work just fine, even when a thread is not pinned to a particular physical core.

[1]: https://man7.org/linux/man-pages/man2/perf_event_open.2.html


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v9 1/4] lib: add generic support for reading PMU events
  2023-02-02 12:49                 ` [PATCH v9 1/4] lib: add generic support for reading PMU events Tomasz Duszynski
@ 2023-02-06 11:02                   ` David Marchand
  2023-02-09 11:09                     ` [EXT] " Tomasz Duszynski
  0 siblings, 1 reply; 139+ messages in thread
From: David Marchand @ 2023-02-06 11:02 UTC (permalink / raw)
  To: Tomasz Duszynski
  Cc: dev, Thomas Monjalon, roretzla, Ruifeng.Wang, bruce.richardson,
	jerinj, mattias.ronnblom, mb, zhoumin

Hello,

On Thu, Feb 2, 2023 at 1:50 PM Tomasz Duszynski <tduszynski@marvell.com> wrote:
>
> Add support for programming PMU counters and reading their values
> in runtime bypassing kernel completely.
>
> This is especially useful in cases where CPU cores are isolated
> (nohz_full) i.e run dedicated tasks. In such cases one cannot use
> standard perf utility without sacrificing latency and performance.

For my understanding, what OS capability/permission are required to
use this library?


>
> Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
> Acked-by: Morten Brørup <mb@smartsharesystems.com>
> ---
>  MAINTAINERS                            |   5 +
>  app/test/meson.build                   |   1 +
>  app/test/test_pmu.c                    |  55 +++
>  doc/api/doxy-api-index.md              |   3 +-
>  doc/api/doxy-api.conf.in               |   1 +
>  doc/guides/prog_guide/profile_app.rst  |   8 +
>  doc/guides/rel_notes/release_23_03.rst |   9 +
>  lib/meson.build                        |   1 +
>  lib/pmu/meson.build                    |  13 +
>  lib/pmu/pmu_private.h                  |  29 ++
>  lib/pmu/rte_pmu.c                      | 464 +++++++++++++++++++++++++
>  lib/pmu/rte_pmu.h                      | 205 +++++++++++
>  lib/pmu/version.map                    |  20 ++
>  13 files changed, 813 insertions(+), 1 deletion(-)
>  create mode 100644 app/test/test_pmu.c
>  create mode 100644 lib/pmu/meson.build
>  create mode 100644 lib/pmu/pmu_private.h
>  create mode 100644 lib/pmu/rte_pmu.c
>  create mode 100644 lib/pmu/rte_pmu.h
>  create mode 100644 lib/pmu/version.map
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 9a0f416d2e..9f13eafd95 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -1697,6 +1697,11 @@ M: Nithin Dabilpuram <ndabilpuram@marvell.com>
>  M: Pavan Nikhilesh <pbhagavatula@marvell.com>
>  F: lib/node/
>
> +PMU - EXPERIMENTAL
> +M: Tomasz Duszynski <tduszynski@marvell.com>
> +F: lib/pmu/
> +F: app/test/test_pmu*
> +
>
>  Test Applications
>  -----------------
> diff --git a/app/test/meson.build b/app/test/meson.build
> index f34d19e3c3..7b6b69dcf1 100644
> --- a/app/test/meson.build
> +++ b/app/test/meson.build
> @@ -111,6 +111,7 @@ test_sources = files(
>          'test_reciprocal_division_perf.c',
>          'test_red.c',
>          'test_pie.c',
> +        'test_pmu.c',
>          'test_reorder.c',
>          'test_rib.c',
>          'test_rib6.c',

This code adds a new test.
This test should be added to an existing testsuite, like fast-tests etc...


> diff --git a/app/test/test_pmu.c b/app/test/test_pmu.c
> new file mode 100644
> index 0000000000..a9bfb1a427
> --- /dev/null
> +++ b/app/test/test_pmu.c
> @@ -0,0 +1,55 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(C) 2023 Marvell International Ltd.
> + */
> +
> +#include "test.h"
> +
> +#ifndef RTE_EXEC_ENV_LINUX
> +
> +static int
> +test_pmu(void)
> +{
> +       printf("pmu_autotest only supported on Linux, skipping test\n");
> +       return TEST_SKIPPED;
> +}
> +
> +#else
> +
> +#include <rte_pmu.h>
> +
> +static int
> +test_pmu_read(void)
> +{
> +       int tries = 10, event = -1;
> +       uint64_t val = 0;
> +
> +       if (rte_pmu_init() < 0)
> +               return TEST_FAILED;
> +
> +       while (tries--)
> +               val += rte_pmu_read(event);
> +
> +       rte_pmu_fini();
> +
> +       return val ? TEST_SUCCESS : TEST_FAILED;
> +}
> +
> +static struct unit_test_suite pmu_tests = {
> +       .suite_name = "pmu autotest",
> +       .setup = NULL,
> +       .teardown = NULL,
> +       .unit_test_cases = {
> +               TEST_CASE(test_pmu_read),
> +               TEST_CASES_END()
> +       }
> +};
> +
> +static int
> +test_pmu(void)
> +{
> +       return unit_test_suite_runner(&pmu_tests);
> +}
> +
> +#endif /* RTE_EXEC_ENV_LINUX */
> +
> +REGISTER_TEST_COMMAND(pmu_autotest, test_pmu);
> diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
> index de488c7abf..7f1938f92f 100644
> --- a/doc/api/doxy-api-index.md
> +++ b/doc/api/doxy-api-index.md
> @@ -222,7 +222,8 @@ The public API headers are grouped by topics:
>    [log](@ref rte_log.h),
>    [errno](@ref rte_errno.h),
>    [trace](@ref rte_trace.h),
> -  [trace_point](@ref rte_trace_point.h)
> +  [trace_point](@ref rte_trace_point.h),
> +  [pmu](@ref rte_pmu.h)
>
>  - **misc**:
>    [EAL config](@ref rte_eal.h),
> diff --git a/doc/api/doxy-api.conf.in b/doc/api/doxy-api.conf.in
> index f0886c3bd1..920e615996 100644
> --- a/doc/api/doxy-api.conf.in
> +++ b/doc/api/doxy-api.conf.in
> @@ -63,6 +63,7 @@ INPUT                   = @TOPDIR@/doc/api/doxy-api-index.md \
>                            @TOPDIR@/lib/pci \
>                            @TOPDIR@/lib/pdump \
>                            @TOPDIR@/lib/pipeline \
> +                          @TOPDIR@/lib/pmu \
>                            @TOPDIR@/lib/port \
>                            @TOPDIR@/lib/power \
>                            @TOPDIR@/lib/rawdev \
> diff --git a/doc/guides/prog_guide/profile_app.rst b/doc/guides/prog_guide/profile_app.rst
> index 14292d4c25..a8b501fe0c 100644
> --- a/doc/guides/prog_guide/profile_app.rst
> +++ b/doc/guides/prog_guide/profile_app.rst
> @@ -7,6 +7,14 @@ Profile Your Application
>  The following sections describe methods of profiling DPDK applications on
>  different architectures.
>
> +Performance counter based profiling
> +-----------------------------------
> +
> +Majority of architectures support some sort hardware measurement unit which provides a set of
> +programmable counters that monitor specific events. There are different tools which can gather
> +that information, perf being an example here. Though in some scenarios, eg. when CPU cores are
> +isolated (nohz_full) and run dedicated tasks, using perf is less than ideal. In such cases one can
> +read specific events directly from application via ``rte_pmu_read()``.

We need a common definition in the documentation of what PMU stands
for and use it consistently.
I am not sure this documentation is the best place, but at least, I'd
prefer we go with "Performance Monitoring Unit" and stick to it.

Plus, this block is a bit hard to read too, what do you think of:

"""
A majority of architectures support some performance monitoring unit (PMU).
Such unit provides programmable counters that monitor specific events.

Different tools gather that information, like for example perf.
However, in some scenarios when CPU cores are isolated (nohz_full) and
run dedicated tasks, interrupting those tasks with perf may be
undesirable.
In such cases, an application can use the PMU library to read such
events via ``rte_pmu_read()``.
"""

And, a double newline is used between sections in this doc.


>
>  Profiling on x86
>  ----------------
> diff --git a/doc/guides/rel_notes/release_23_03.rst b/doc/guides/rel_notes/release_23_03.rst
> index 73f5d94e14..733541d56c 100644
> --- a/doc/guides/rel_notes/release_23_03.rst
> +++ b/doc/guides/rel_notes/release_23_03.rst
> @@ -55,10 +55,19 @@ New Features
>       Also, make sure to start the actual text at the margin.
>       =======================================================
>
> +* **Added PMU library.**
> +
> +  Added a new PMU (performance measurement unit) library which allows applications

Performance Monitoring Unit.

> +  to perform self monitoring activities without depending on external utilities like perf.

> +  After integration with :doc:`../prog_guide/trace_lib` data gathered from hardware counters
> +  can be stored in CTF format for further analysis.

Afaiu, this integration comes later in the series.
This part of the RN update should go with it.


> +
>  * **Updated AMD axgbe driver.**
>
>    * Added multi-process support.
>
> +* **Added multi-process support for axgbe PMD.**
> +
>  * **Updated Corigine nfp driver.**
>

Unrelated rebase damage.. please pay attention to such detail.


>    * Added support for meter options.
> diff --git a/lib/meson.build b/lib/meson.build
> index a90fee31b7..7132131b5c 100644
> --- a/lib/meson.build
> +++ b/lib/meson.build
> @@ -11,6 +11,7 @@
>  libraries = [
>          'kvargs', # eal depends on kvargs
>          'telemetry', # basic info querying
> +        'pmu',
>          'eal', # everything depends on eal
>          'ring',
>          'rcu', # rcu depends on ring
> diff --git a/lib/pmu/meson.build b/lib/pmu/meson.build
> new file mode 100644
> index 0000000000..a4160b494e
> --- /dev/null
> +++ b/lib/pmu/meson.build
> @@ -0,0 +1,13 @@
> +# SPDX-License-Identifier: BSD-3-Clause
> +# Copyright(C) 2023 Marvell International Ltd.
> +
> +if not is_linux
> +    build = false
> +    reason = 'only supported on Linux'
> +    subdir_done()
> +endif
> +
> +includes = [global_inc]
> +
> +sources = files('rte_pmu.c')
> +headers = files('rte_pmu.h')
> diff --git a/lib/pmu/pmu_private.h b/lib/pmu/pmu_private.h
> new file mode 100644
> index 0000000000..849549b125
> --- /dev/null
> +++ b/lib/pmu/pmu_private.h
> @@ -0,0 +1,29 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2023 Marvell
> + */
> +
> +#ifndef _PMU_PRIVATE_H_
> +#define _PMU_PRIVATE_H_
> +
> +/**
> + * Architecture specific PMU init callback.
> + *
> + * @return
> + *   0 in case of success, negative value otherwise.
> + */
> +int
> +pmu_arch_init(void);
> +
> +/**
> + * Architecture specific PMU cleanup callback.
> + */
> +void
> +pmu_arch_fini(void);
> +
> +/**
> + * Apply architecture specific settings to config before passing it to syscall.

Please describe config[].


> + */
> +void
> +pmu_arch_fixup_config(uint64_t config[3]);
> +
> +#endif /* _PMU_PRIVATE_H_ */
> diff --git a/lib/pmu/rte_pmu.c b/lib/pmu/rte_pmu.c
> new file mode 100644
> index 0000000000..4cf3161155
> --- /dev/null
> +++ b/lib/pmu/rte_pmu.c
> @@ -0,0 +1,464 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(C) 2023 Marvell International Ltd.
> + */
> +
> +#include <ctype.h>
> +#include <dirent.h>
> +#include <errno.h>
> +#include <regex.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <sys/ioctl.h>
> +#include <sys/mman.h>
> +#include <sys/queue.h>
> +#include <sys/syscall.h>
> +#include <unistd.h>

Asking to be sure because I did not check:
do we need all those includes, or is this just copy/pasted from somewhere else?


> +
> +#include <rte_atomic.h>
> +#include <rte_per_lcore.h>
> +#include <rte_pmu.h>
> +#include <rte_spinlock.h>
> +#include <rte_tailq.h>
> +
> +#include "pmu_private.h"
> +
> +#define EVENT_SOURCE_DEVICES_PATH "/sys/bus/event_source/devices"
> +
> +#ifndef GENMASK_ULL

This macro is copy/pasted all over the dpdk tree...
This is worth a cleanup later, read: I am not asking for it as part of
this series.

However, here, there is no need for protecting against its definition.


> +#define GENMASK_ULL(h, l) ((~0ULL - (1ULL << (l)) + 1) & (~0ULL >> ((64 - 1 - (h)))))
> +#endif
> +
> +#ifndef FIELD_PREP
> +#define FIELD_PREP(m, v) (((uint64_t)(v) << (__builtin_ffsll(m) - 1)) & (m))
> +#endif

Idem.


> +
> +RTE_DEFINE_PER_LCORE(struct rte_pmu_event_group, _event_group);
> +struct rte_pmu rte_pmu;
> +
> +/*
> + * Following __rte_weak functions provide default no-op. Architectures should override them if
> + * necessary.
> + */

Prefer using per architectures #ifdef.
It is easier to get a simple link error than use weak symbols that
make it look like it could work on some arch.


> +
> +int
> +__rte_weak pmu_arch_init(void)
> +{
> +       return 0;

Add a debug log message indicating that this arch does not support PMU.


> +}
> +
> +void
> +__rte_weak pmu_arch_fini(void)
> +{
> +}
> +
> +void
> +__rte_weak pmu_arch_fixup_config(uint64_t __rte_unused config[3])
> +{
> +}
> +
> +static int
> +get_term_format(const char *name, int *num, uint64_t *mask)
> +{
> +       char *config = NULL;
> +       char path[PATH_MAX];
> +       int high, low, ret;
> +       FILE *fp;

Reverse xmas tree when possible.


> +
> +       /* quiesce -Wmaybe-uninitialized warning */

This comment just seems to be a note for yourself.
What was the issue exactly?


> +       *num = 0;
> +       *mask = 0;
> +
> +       snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/format/%s", rte_pmu.name, name);
> +       fp = fopen(path, "r");
> +       if (fp == NULL)
> +               return -errno;
> +
> +       errno = 0;
> +       ret = fscanf(fp, "%m[^:]:%d-%d", &config, &low, &high);
> +       if (ret < 2) {
> +               ret = -ENODATA;
> +               goto out;
> +       }
> +       if (errno) {
> +               ret = -errno;
> +               goto out;
> +       }
> +
> +       if (ret == 2)
> +               high = low;
> +
> +       *mask = GENMASK_ULL(high, low);
> +       /* Last digit should be [012]. If last digit is missing 0 is implied. */
> +       *num = config[strlen(config) - 1];
> +       *num = isdigit(*num) ? *num - '0' : 0;
> +
> +       ret = 0;
> +out:
> +       free(config);
> +       fclose(fp);
> +
> +       return ret;
> +}
> +
> +static int
> +parse_event(char *buf, uint64_t config[3])
> +{
> +       char *token, *term;
> +       int num, ret, val;
> +       uint64_t mask;
> +
> +       config[0] = config[1] = config[2] = 0;
> +
> +       token = strtok(buf, ",");
> +       while (token) {
> +               errno = 0;
> +               /* <term>=<value> */
> +               ret = sscanf(token, "%m[^=]=%i", &term, &val);
> +               if (ret < 1)
> +                       return -ENODATA;
> +               if (errno)
> +                       return -errno;
> +               if (ret == 1)
> +                       val = 1;
> +
> +               ret = get_term_format(term, &num, &mask);
> +               free(term);
> +               if (ret)
> +                       return ret;
> +
> +               config[num] |= FIELD_PREP(mask, val);
> +               token = strtok(NULL, ",");
> +       }
> +
> +       return 0;
> +}
> +
> +static int
> +get_event_config(const char *name, uint64_t config[3])
> +{
> +       char path[PATH_MAX], buf[BUFSIZ];
> +       FILE *fp;
> +       int ret;
> +
> +       snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu.name, name);
> +       fp = fopen(path, "r");
> +       if (fp == NULL)
> +               return -errno;
> +
> +       ret = fread(buf, 1, sizeof(buf), fp);
> +       if (ret == 0) {
> +               fclose(fp);
> +
> +               return -EINVAL;
> +       }
> +       fclose(fp);
> +       buf[ret] = '\0';
> +
> +       return parse_event(buf, config);
> +}
> +
> +static int
> +do_perf_event_open(uint64_t config[3], int group_fd)
> +{
> +       struct perf_event_attr attr = {
> +               .size = sizeof(struct perf_event_attr),
> +               .type = PERF_TYPE_RAW,
> +               .exclude_kernel = 1,
> +               .exclude_hv = 1,
> +               .disabled = 1,
> +       };
> +
> +       pmu_arch_fixup_config(config);
> +
> +       attr.config = config[0];
> +       attr.config1 = config[1];
> +       attr.config2 = config[2];
> +
> +       return syscall(SYS_perf_event_open, &attr, 0, -1, group_fd, 0);
> +}
> +
> +static int
> +open_events(struct rte_pmu_event_group *group)
> +{
> +       struct rte_pmu_event *event;
> +       uint64_t config[3];
> +       int num = 0, ret;
> +
> +       /* group leader gets created first, with fd = -1 */
> +       group->fds[0] = -1;
> +
> +       TAILQ_FOREACH(event, &rte_pmu.event_list, next) {
> +               ret = get_event_config(event->name, config);
> +               if (ret)
> +                       continue;
> +
> +               ret = do_perf_event_open(config, group->fds[0]);
> +               if (ret == -1) {
> +                       ret = -errno;
> +                       goto out;
> +               }
> +
> +               group->fds[event->index] = ret;
> +               num++;
> +       }
> +
> +       return 0;
> +out:
> +       for (--num; num >= 0; num--) {
> +               close(group->fds[num]);
> +               group->fds[num] = -1;
> +       }
> +
> +
> +       return ret;
> +}
> +
> +static int
> +mmap_events(struct rte_pmu_event_group *group)
> +{
> +       long page_size = sysconf(_SC_PAGE_SIZE);
> +       unsigned int i;
> +       void *addr;
> +       int ret;
> +
> +       for (i = 0; i < rte_pmu.num_group_events; i++) {
> +               addr = mmap(0, page_size, PROT_READ, MAP_SHARED, group->fds[i], 0);
> +               if (addr == MAP_FAILED) {
> +                       ret = -errno;
> +                       goto out;
> +               }
> +
> +               group->mmap_pages[i] = addr;
> +       }
> +
> +       return 0;
> +out:
> +       for (; i; i--) {
> +               munmap(group->mmap_pages[i - 1], page_size);
> +               group->mmap_pages[i - 1] = NULL;
> +       }
> +
> +       return ret;
> +}
> +
> +static void
> +cleanup_events(struct rte_pmu_event_group *group)
> +{
> +       unsigned int i;
> +
> +       if (group->fds[0] != -1)
> +               ioctl(group->fds[0], PERF_EVENT_IOC_DISABLE, PERF_IOC_FLAG_GROUP);
> +
> +       for (i = 0; i < rte_pmu.num_group_events; i++) {
> +               if (group->mmap_pages[i]) {
> +                       munmap(group->mmap_pages[i], sysconf(_SC_PAGE_SIZE));
> +                       group->mmap_pages[i] = NULL;
> +               }
> +
> +               if (group->fds[i] != -1) {
> +                       close(group->fds[i]);
> +                       group->fds[i] = -1;
> +               }
> +       }
> +
> +       group->enabled = false;
> +}
> +
> +int __rte_noinline

This symbol is exported out of this library, no need for noinline.


> +rte_pmu_enable_group(void)
> +{
> +       struct rte_pmu_event_group *group = &RTE_PER_LCORE(_event_group);
> +       int ret;
> +
> +       if (rte_pmu.num_group_events == 0)
> +               return -ENODEV;
> +
> +       ret = open_events(group);
> +       if (ret)
> +               goto out;
> +
> +       ret = mmap_events(group);
> +       if (ret)
> +               goto out;
> +
> +       if (ioctl(group->fds[0], PERF_EVENT_IOC_RESET, PERF_IOC_FLAG_GROUP) == -1) {
> +               ret = -errno;
> +               goto out;
> +       }
> +
> +       if (ioctl(group->fds[0], PERF_EVENT_IOC_ENABLE, PERF_IOC_FLAG_GROUP) == -1) {
> +               ret = -errno;
> +               goto out;
> +       }
> +
> +       rte_spinlock_lock(&rte_pmu.lock);
> +       TAILQ_INSERT_TAIL(&rte_pmu.event_group_list, group, next);
> +       rte_spinlock_unlock(&rte_pmu.lock);
> +       group->enabled = true;
> +
> +       return 0;
> +
> +out:
> +       cleanup_events(group);
> +
> +       return ret;
> +}
> +
> +static int
> +scan_pmus(void)
> +{
> +       char path[PATH_MAX];
> +       struct dirent *dent;
> +       const char *name;
> +       DIR *dirp;
> +
> +       dirp = opendir(EVENT_SOURCE_DEVICES_PATH);
> +       if (dirp == NULL)
> +               return -errno;
> +
> +       while ((dent = readdir(dirp))) {
> +               name = dent->d_name;
> +               if (name[0] == '.')
> +                       continue;
> +
> +               /* sysfs entry should either contain cpus or be a cpu */
> +               if (!strcmp(name, "cpu"))
> +                       break;
> +
> +               snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/cpus", name);
> +               if (access(path, F_OK) == 0)
> +                       break;
> +       }
> +
> +       if (dent) {
> +               rte_pmu.name = strdup(name);
> +               if (rte_pmu.name == NULL) {
> +                       closedir(dirp);
> +
> +                       return -ENOMEM;
> +               }
> +       }
> +
> +       closedir(dirp);
> +
> +       return rte_pmu.name ? 0 : -ENODEV;
> +}
> +
> +static struct rte_pmu_event *
> +new_event(const char *name)
> +{
> +       struct rte_pmu_event *event;
> +
> +       event = calloc(1, sizeof(*event));
> +       if (event == NULL)
> +               goto out;
> +
> +       event->name = strdup(name);
> +       if (event->name == NULL) {
> +               free(event);
> +               event = NULL;
> +       }
> +
> +out:
> +       return event;
> +}
> +
> +static void
> +free_event(struct rte_pmu_event *event)
> +{
> +       free(event->name);
> +       free(event);
> +}
> +
> +int
> +rte_pmu_add_event(const char *name)
> +{
> +       struct rte_pmu_event *event;
> +       char path[PATH_MAX];
> +
> +       if (rte_pmu.name == NULL)
> +               return -ENODEV;
> +
> +       if (rte_pmu.num_group_events + 1 >= MAX_NUM_GROUP_EVENTS)
> +               return -ENOSPC;
> +
> +       snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu.name, name);
> +       if (access(path, R_OK))
> +               return -ENODEV;
> +
> +       TAILQ_FOREACH(event, &rte_pmu.event_list, next) {
> +               if (!strcmp(event->name, name))
> +                       return event->index;
> +               continue;
> +       }
> +
> +       event = new_event(name);
> +       if (event == NULL)
> +               return -ENOMEM;
> +
> +       event->index = rte_pmu.num_group_events++;
> +       TAILQ_INSERT_TAIL(&rte_pmu.event_list, event, next);
> +
> +       return event->index;
> +}
> +
> +int
> +rte_pmu_init(void)
> +{
> +       int ret;
> +
> +       /* Allow calling init from multiple contexts within a single thread. This simplifies
> +        * resource management a bit e.g in case fast-path tracepoint has already been enabled
> +        * via command line but application doesn't care enough and performs init/fini again.
> +        */
> +       if (rte_pmu.initialized) {

This is an integer so check against 0 explicitly (there may be other
cases in this patch, I did not recheck the whole patch).


> +               rte_pmu.initialized++;
> +               return 0;
> +       }
> +
> +       ret = scan_pmus();
> +       if (ret)
> +               goto out;
> +
> +       ret = pmu_arch_init();
> +       if (ret)
> +               goto out;
> +
> +       TAILQ_INIT(&rte_pmu.event_list);
> +       TAILQ_INIT(&rte_pmu.event_group_list);
> +       rte_spinlock_init(&rte_pmu.lock);
> +       rte_pmu.initialized = 1;
> +
> +       return 0;
> +out:
> +       free(rte_pmu.name);
> +       rte_pmu.name = NULL;
> +
> +       return ret;
> +}
> +
> +void
> +rte_pmu_fini(void)
> +{
> +       struct rte_pmu_event_group *group, *tmp_group;
> +       struct rte_pmu_event *event, *tmp_event;
> +
> +       /* cleanup once init count drops to zero */
> +       if (!rte_pmu.initialized || --rte_pmu.initialized)
> +               return;
> +
> +       RTE_TAILQ_FOREACH_SAFE(event, &rte_pmu.event_list, next, tmp_event) {
> +               TAILQ_REMOVE(&rte_pmu.event_list, event, next);
> +               free_event(event);
> +       }
> +
> +       RTE_TAILQ_FOREACH_SAFE(group, &rte_pmu.event_group_list, next, tmp_group) {
> +               TAILQ_REMOVE(&rte_pmu.event_group_list, group, next);
> +               cleanup_events(group);
> +       }
> +
> +       pmu_arch_fini();
> +       free(rte_pmu.name);
> +       rte_pmu.name = NULL;
> +       rte_pmu.num_group_events = 0;
> +}
> diff --git a/lib/pmu/rte_pmu.h b/lib/pmu/rte_pmu.h
> new file mode 100644
> index 0000000000..e360375a0c
> --- /dev/null
> +++ b/lib/pmu/rte_pmu.h
> @@ -0,0 +1,205 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2023 Marvell
> + */
> +
> +#ifndef _RTE_PMU_H_
> +#define _RTE_PMU_H_
> +
> +/**
> + * @file
> + *
> + * PMU event tracing operations
> + *
> + * This file defines generic API and types necessary to setup PMU and
> + * read selected counters in runtime.
> + */
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +#include <linux/perf_event.h>
> +
> +#include <rte_atomic.h>
> +#include <rte_branch_prediction.h>
> +#include <rte_common.h>
> +#include <rte_compat.h>
> +#include <rte_spinlock.h>
> +
> +/** Maximum number of events in a group */
> +#define MAX_NUM_GROUP_EVENTS 8
> +
> +/**
> + * A structure describing a group of events.
> + */
> +struct rte_pmu_event_group {
> +       struct perf_event_mmap_page *mmap_pages[MAX_NUM_GROUP_EVENTS]; /**< array of user pages */
> +       int fds[MAX_NUM_GROUP_EVENTS]; /**< array of event descriptors */
> +       bool enabled; /**< true if group was enabled on particular lcore */
> +       TAILQ_ENTRY(rte_pmu_event_group) next; /**< list entry */
> +} __rte_cache_aligned;

One problem for the future is that we have a fixed size fd array.
Do we need to expose this whole structure to the application?


> +
> +/**
> + * A structure describing an event.
> + */
> +struct rte_pmu_event {
> +       char *name; /**< name of an event */
> +       unsigned int index; /**< event index into fds/mmap_pages */

This is an internal consideration.
Do we need to expose this to the application?


> +       TAILQ_ENTRY(rte_pmu_event) next; /**< list entry */
> +};
> +
> +/**
> + * A PMU state container.
> + */
> +struct rte_pmu {
> +       char *name; /**< name of core PMU listed under /sys/bus/event_source/devices */
> +       rte_spinlock_t lock; /**< serialize access to event group list */
> +       TAILQ_HEAD(, rte_pmu_event_group) event_group_list; /**< list of event groups */
> +       unsigned int num_group_events; /**< number of events in a group */
> +       TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of matching events */
> +       unsigned int initialized; /**< initialization counter */
> +};

Idem, do we need to expose this to the application?


> +
> +/** lcore event group */
> +RTE_DECLARE_PER_LCORE(struct rte_pmu_event_group, _event_group);
> +
> +/** PMU state container */
> +extern struct rte_pmu rte_pmu;
> +
> +/** Each architecture supporting PMU needs to provide its own version */
> +#ifndef rte_pmu_pmc_read
> +#define rte_pmu_pmc_read(index) ({ 0; })
> +#endif
> +
> +/**
> + * @internal
> + *
> + * Read PMU counter.
> + *
> + * @param pc
> + *   Pointer to the mmapped user page.
> + * @return
> + *   Counter value read from hardware.
> + */
> +__rte_internal
> +static __rte_always_inline uint64_t
> +rte_pmu_read_userpage(struct perf_event_mmap_page *pc)
> +{
> +       uint64_t width, offset;
> +       uint32_t seq, index;
> +       int64_t pmc;
> +
> +       for (;;) {
> +               seq = pc->lock;
> +               rte_compiler_barrier();
> +               index = pc->index;
> +               offset = pc->offset;
> +               width = pc->pmc_width;
> +
> +               /* index set to 0 means that particular counter cannot be used */
> +               if (likely(pc->cap_user_rdpmc && index)) {
> +                       pmc = rte_pmu_pmc_read(index - 1);
> +                       pmc <<= 64 - width;
> +                       pmc >>= 64 - width;
> +                       offset += pmc;
> +               }
> +
> +               rte_compiler_barrier();
> +
> +               if (likely(pc->lock == seq))
> +                       return offset;
> +       }
> +
> +       return 0;
> +}
> +
> +/**
> + * @internal
> + *
> + * Enable group of events on the calling lcore.
> + *
> + * @return
> + *   0 in case of success, negative value otherwise.
> + */
> +__rte_internal

Unless I missed something, this symbol is called from rte_pmu_read()
so this makes rte_pmu_read() itself internal.
So external applications won't be able to use the PMU API.

This can probably be confirmed by adding some call to the PMU API in
an examples/.


> +int
> +rte_pmu_enable_group(void);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice
> + *
> + * Initialize PMU library.
> + *
> + * @return
> + *   0 in case of success, negative value otherwise.
> + */
> +__rte_experimental
> +int
> +rte_pmu_init(void);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice
> + *
> + * Finalize PMU library. This should be called after PMU counters are no longer being read.
> + */
> +__rte_experimental
> +void
> +rte_pmu_fini(void);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice
> + *
> + * Add event to the group of enabled events.
> + *
> + * @param name
> + *   Name of an event listed under /sys/bus/event_source/devices/pmu/events.
> + * @return
> + *   Event index in case of success, negative value otherwise.
> + */
> +__rte_experimental
> +int
> +rte_pmu_add_event(const char *name);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice
> + *
> + * Read hardware counter configured to count occurrences of an event.
> + *
> + * @param index
> + *   Index of an event to be read.
> + * @return
> + *   Event value read from register. In case of errors or lack of support
> + *   0 is returned. In other words, stream of zeros in a trace file
> + *   indicates problem with reading particular PMU event register.
> + */
> +__rte_experimental
> +static __rte_always_inline uint64_t
> +rte_pmu_read(unsigned int index)
> +{
> +       struct rte_pmu_event_group *group = &RTE_PER_LCORE(_event_group);
> +       int ret;
> +
> +       if (unlikely(!rte_pmu.initialized))
> +               return 0;
> +
> +       if (unlikely(!group->enabled)) {
> +               ret = rte_pmu_enable_group();
> +               if (ret)
> +                       return 0;
> +       }
> +
> +       if (unlikely(index >= rte_pmu.num_group_events))
> +               return 0;
> +
> +       return rte_pmu_read_userpage(group->mmap_pages[index]);
> +}
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif /* _RTE_PMU_H_ */
> diff --git a/lib/pmu/version.map b/lib/pmu/version.map
> new file mode 100644
> index 0000000000..50fb0f354e
> --- /dev/null
> +++ b/lib/pmu/version.map
> @@ -0,0 +1,20 @@
> +DPDK_23 {
> +       local: *;
> +};
> +
> +EXPERIMENTAL {
> +       global:
> +
> +       per_lcore__event_group;
> +       rte_pmu;
> +       rte_pmu_add_event;
> +       rte_pmu_fini;
> +       rte_pmu_init;
> +       rte_pmu_read;
> +};
> +
> +INTERNAL {
> +       global:
> +
> +       rte_pmu_enable_group;
> +};
> --
> 2.34.1
>


-- 
David Marchand


^ permalink raw reply	[flat|nested] 139+ messages in thread

* RE: [EXT] Re: [PATCH v9 1/4] lib: add generic support for reading PMU events
  2023-02-06 11:02                   ` David Marchand
@ 2023-02-09 11:09                     ` Tomasz Duszynski
  0 siblings, 0 replies; 139+ messages in thread
From: Tomasz Duszynski @ 2023-02-09 11:09 UTC (permalink / raw)
  To: David Marchand
  Cc: dev, Thomas Monjalon, roretzla, Ruifeng.Wang, bruce.richardson,
	Jerin Jacob Kollanukkaran, mattias.ronnblom, mb, zhoumin

Hi David, 

Thanks for review. Comments inline. 

>-----Original Message-----
>From: David Marchand <david.marchand@redhat.com>
>Sent: Monday, February 6, 2023 12:03 PM
>To: Tomasz Duszynski <tduszynski@marvell.com>
>Cc: dev@dpdk.org; Thomas Monjalon <thomas@monjalon.net>; roretzla@linux.microsoft.com;
>Ruifeng.Wang@arm.com; bruce.richardson@intel.com; Jerin Jacob Kollanukkaran <jerinj@marvell.com>;
>mattias.ronnblom@ericsson.com; mb@smartsharesystems.com; zhoumin@loongson.cn
>Subject: [EXT] Re: [PATCH v9 1/4] lib: add generic support for reading PMU events
>
>External Email
>
>----------------------------------------------------------------------
>Hello,
>
>On Thu, Feb 2, 2023 at 1:50 PM Tomasz Duszynski <tduszynski@marvell.com> wrote:
>>
>> Add support for programming PMU counters and reading their values in
>> runtime bypassing kernel completely.
>>
>> This is especially useful in cases where CPU cores are isolated
>> (nohz_full) i.e run dedicated tasks. In such cases one cannot use
>> standard perf utility without sacrificing latency and performance.
>
>For my understanding, what OS capability/permission are required to use this library?
>

On x86 it sufficient for self-monitoring to have kernel built with perf events enabled
and /proc/sys/kernel/perf_event_paranoid knob should be set to 2, which
should be a default value anyway, unless changed by some scripts. 

On ARM64 you need to additionally set /proc/sys/kernel/perf_user_access to bypass
kernel when accessing hw counters. 

>
>>
>> Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
>> Acked-by: Morten Brørup <mb@smartsharesystems.com>
>> ---
>>  MAINTAINERS                            |   5 +
>>  app/test/meson.build                   |   1 +
>>  app/test/test_pmu.c                    |  55 +++
>>  doc/api/doxy-api-index.md              |   3 +-
>>  doc/api/doxy-api.conf.in               |   1 +
>>  doc/guides/prog_guide/profile_app.rst  |   8 +
>>  doc/guides/rel_notes/release_23_03.rst |   9 +
>>  lib/meson.build                        |   1 +
>>  lib/pmu/meson.build                    |  13 +
>>  lib/pmu/pmu_private.h                  |  29 ++
>>  lib/pmu/rte_pmu.c                      | 464 +++++++++++++++++++++++++
>>  lib/pmu/rte_pmu.h                      | 205 +++++++++++
>>  lib/pmu/version.map                    |  20 ++
>>  13 files changed, 813 insertions(+), 1 deletion(-)  create mode
>> 100644 app/test/test_pmu.c  create mode 100644 lib/pmu/meson.build
>> create mode 100644 lib/pmu/pmu_private.h  create mode 100644
>> lib/pmu/rte_pmu.c  create mode 100644 lib/pmu/rte_pmu.h  create mode
>> 100644 lib/pmu/version.map
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS index 9a0f416d2e..9f13eafd95
>> 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -1697,6 +1697,11 @@ M: Nithin Dabilpuram <ndabilpuram@marvell.com>
>>  M: Pavan Nikhilesh <pbhagavatula@marvell.com>
>>  F: lib/node/
>>
>> +PMU - EXPERIMENTAL
>> +M: Tomasz Duszynski <tduszynski@marvell.com>
>> +F: lib/pmu/
>> +F: app/test/test_pmu*
>> +
>>
>>  Test Applications
>>  -----------------
>> diff --git a/app/test/meson.build b/app/test/meson.build index
>> f34d19e3c3..7b6b69dcf1 100644
>> --- a/app/test/meson.build
>> +++ b/app/test/meson.build
>> @@ -111,6 +111,7 @@ test_sources = files(
>>          'test_reciprocal_division_perf.c',
>>          'test_red.c',
>>          'test_pie.c',
>> +        'test_pmu.c',
>>          'test_reorder.c',
>>          'test_rib.c',
>>          'test_rib6.c',
>
>This code adds a new test.
>This test should be added to an existing testsuite, like fast-tests etc...
>
>
>> diff --git a/app/test/test_pmu.c b/app/test/test_pmu.c new file mode
>> 100644 index 0000000000..a9bfb1a427
>> --- /dev/null
>> +++ b/app/test/test_pmu.c
>> @@ -0,0 +1,55 @@
>> +/* SPDX-License-Identifier: BSD-3-Clause
>> + * Copyright(C) 2023 Marvell International Ltd.
>> + */
>> +
>> +#include "test.h"
>> +
>> +#ifndef RTE_EXEC_ENV_LINUX
>> +
>> +static int
>> +test_pmu(void)
>> +{
>> +       printf("pmu_autotest only supported on Linux, skipping test\n");
>> +       return TEST_SKIPPED;
>> +}
>> +
>> +#else
>> +
>> +#include <rte_pmu.h>
>> +
>> +static int
>> +test_pmu_read(void)
>> +{
>> +       int tries = 10, event = -1;
>> +       uint64_t val = 0;
>> +
>> +       if (rte_pmu_init() < 0)
>> +               return TEST_FAILED;
>> +
>> +       while (tries--)
>> +               val += rte_pmu_read(event);
>> +
>> +       rte_pmu_fini();
>> +
>> +       return val ? TEST_SUCCESS : TEST_FAILED; }
>> +
>> +static struct unit_test_suite pmu_tests = {
>> +       .suite_name = "pmu autotest",
>> +       .setup = NULL,
>> +       .teardown = NULL,
>> +       .unit_test_cases = {
>> +               TEST_CASE(test_pmu_read),
>> +               TEST_CASES_END()
>> +       }
>> +};
>> +
>> +static int
>> +test_pmu(void)
>> +{
>> +       return unit_test_suite_runner(&pmu_tests);
>> +}
>> +
>> +#endif /* RTE_EXEC_ENV_LINUX */
>> +
>> +REGISTER_TEST_COMMAND(pmu_autotest, test_pmu);
>> diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
>> index de488c7abf..7f1938f92f 100644
>> --- a/doc/api/doxy-api-index.md
>> +++ b/doc/api/doxy-api-index.md
>> @@ -222,7 +222,8 @@ The public API headers are grouped by topics:
>>    [log](@ref rte_log.h),
>>    [errno](@ref rte_errno.h),
>>    [trace](@ref rte_trace.h),
>> -  [trace_point](@ref rte_trace_point.h)
>> +  [trace_point](@ref rte_trace_point.h),  [pmu](@ref rte_pmu.h)
>>
>>  - **misc**:
>>    [EAL config](@ref rte_eal.h),
>> diff --git a/doc/api/doxy-api.conf.in b/doc/api/doxy-api.conf.in index
>> f0886c3bd1..920e615996 100644
>> --- a/doc/api/doxy-api.conf.in
>> +++ b/doc/api/doxy-api.conf.in
>> @@ -63,6 +63,7 @@ INPUT                   = @TOPDIR@/doc/api/doxy-api-index.md \
>>                            @TOPDIR@/lib/pci \
>>                            @TOPDIR@/lib/pdump \
>>                            @TOPDIR@/lib/pipeline \
>> +                          @TOPDIR@/lib/pmu \
>>                            @TOPDIR@/lib/port \
>>                            @TOPDIR@/lib/power \
>>                            @TOPDIR@/lib/rawdev \ diff --git
>> a/doc/guides/prog_guide/profile_app.rst
>> b/doc/guides/prog_guide/profile_app.rst
>> index 14292d4c25..a8b501fe0c 100644
>> --- a/doc/guides/prog_guide/profile_app.rst
>> +++ b/doc/guides/prog_guide/profile_app.rst
>> @@ -7,6 +7,14 @@ Profile Your Application  The following sections
>> describe methods of profiling DPDK applications on  different
>> architectures.
>>
>> +Performance counter based profiling
>> +-----------------------------------
>> +
>> +Majority of architectures support some sort hardware measurement unit
>> +which provides a set of programmable counters that monitor specific
>> +events. There are different tools which can gather that information,
>> +perf being an example here. Though in some scenarios, eg. when CPU
>> +cores are isolated (nohz_full) and run dedicated tasks, using perf is less than ideal. In such
>cases one can read specific events directly from application via ``rte_pmu_read()``.
>
>We need a common definition in the documentation of what PMU stands for and use it consistently.
>I am not sure this documentation is the best place, but at least, I'd prefer we go with
>"Performance Monitoring Unit" and stick to it.

For the time being I think it's good enough. Frankly I don’t have better idea where to put that. 

>
>Plus, this block is a bit hard to read too, what do you think of:
>
>"""
>A majority of architectures support some performance monitoring unit (PMU).
>Such unit provides programmable counters that monitor specific events.
>
>Different tools gather that information, like for example perf.
>However, in some scenarios when CPU cores are isolated (nohz_full) and run dedicated tasks,
>interrupting those tasks with perf may be undesirable.
>In such cases, an application can use the PMU library to read such events via ``rte_pmu_read()``.
>"""
>
>And, a double newline is used between sections in this doc.
>
>

No problem. 

>>
>>  Profiling on x86
>>  ----------------
>> diff --git a/doc/guides/rel_notes/release_23_03.rst
>> b/doc/guides/rel_notes/release_23_03.rst
>> index 73f5d94e14..733541d56c 100644
>> --- a/doc/guides/rel_notes/release_23_03.rst
>> +++ b/doc/guides/rel_notes/release_23_03.rst
>> @@ -55,10 +55,19 @@ New Features
>>       Also, make sure to start the actual text at the margin.
>>       =======================================================
>>
>> +* **Added PMU library.**
>> +
>> +  Added a new PMU (performance measurement unit) library which allows
>> + applications
>
>Performance Monitoring Unit.
>
>> +  to perform self monitoring activities without depending on external utilities like perf.
>
>> +  After integration with :doc:`../prog_guide/trace_lib` data gathered
>> + from hardware counters  can be stored in CTF format for further analysis.
>
>Afaiu, this integration comes later in the series.
>This part of the RN update should go with it.
>
>
>> +
>>  * **Updated AMD axgbe driver.**
>>
>>    * Added multi-process support.
>>
>> +* **Added multi-process support for axgbe PMD.**
>> +
>>  * **Updated Corigine nfp driver.**
>>
>
>Unrelated rebase damage.. please pay attention to such detail.
>

Thanks, that creeped in somehow. 

>
>>    * Added support for meter options.
>> diff --git a/lib/meson.build b/lib/meson.build index
>> a90fee31b7..7132131b5c 100644
>> --- a/lib/meson.build
>> +++ b/lib/meson.build
>> @@ -11,6 +11,7 @@
>>  libraries = [
>>          'kvargs', # eal depends on kvargs
>>          'telemetry', # basic info querying
>> +        'pmu',
>>          'eal', # everything depends on eal
>>          'ring',
>>          'rcu', # rcu depends on ring
>> diff --git a/lib/pmu/meson.build b/lib/pmu/meson.build new file mode
>> 100644 index 0000000000..a4160b494e
>> --- /dev/null
>> +++ b/lib/pmu/meson.build
>> @@ -0,0 +1,13 @@
>> +# SPDX-License-Identifier: BSD-3-Clause # Copyright(C) 2023 Marvell
>> +International Ltd.
>> +
>> +if not is_linux
>> +    build = false
>> +    reason = 'only supported on Linux'
>> +    subdir_done()
>> +endif
>> +
>> +includes = [global_inc]
>> +
>> +sources = files('rte_pmu.c')
>> +headers = files('rte_pmu.h')
>> diff --git a/lib/pmu/pmu_private.h b/lib/pmu/pmu_private.h new file
>> mode 100644 index 0000000000..849549b125
>> --- /dev/null
>> +++ b/lib/pmu/pmu_private.h
>> @@ -0,0 +1,29 @@
>> +/* SPDX-License-Identifier: BSD-3-Clause
>> + * Copyright(c) 2023 Marvell
>> + */
>> +
>> +#ifndef _PMU_PRIVATE_H_
>> +#define _PMU_PRIVATE_H_
>> +
>> +/**
>> + * Architecture specific PMU init callback.
>> + *
>> + * @return
>> + *   0 in case of success, negative value otherwise.
>> + */
>> +int
>> +pmu_arch_init(void);
>> +
>> +/**
>> + * Architecture specific PMU cleanup callback.
>> + */
>> +void
>> +pmu_arch_fini(void);
>> +
>> +/**
>> + * Apply architecture specific settings to config before passing it to syscall.
>
>Please describe config[].
>

Well, the problem here is that each and every arch may expect different config so 
anyone adding new stuff would need to consult kernel sources anyway.

What kind of documentation would you like to see here?

>
>> + */
>> +void
>> +pmu_arch_fixup_config(uint64_t config[3]);
>> +
>> +#endif /* _PMU_PRIVATE_H_ */
>> diff --git a/lib/pmu/rte_pmu.c b/lib/pmu/rte_pmu.c new file mode
>> 100644 index 0000000000..4cf3161155
>> --- /dev/null
>> +++ b/lib/pmu/rte_pmu.c
>> @@ -0,0 +1,464 @@
>> +/* SPDX-License-Identifier: BSD-3-Clause
>> + * Copyright(C) 2023 Marvell International Ltd.
>> + */
>> +
>> +#include <ctype.h>
>> +#include <dirent.h>
>> +#include <errno.h>
>> +#include <regex.h>
>> +#include <stdlib.h>
>> +#include <string.h>
>> +#include <sys/ioctl.h>
>> +#include <sys/mman.h>
>> +#include <sys/queue.h>
>> +#include <sys/syscall.h>
>> +#include <unistd.h>
>
>Asking to be sure because I did not check:
>do we need all those includes, or is this just copy/pasted from somewhere else?
>

No, it was not copy/pasted. Each header exports something used in this file. I'll
double check if all still required.

>
>> +
>> +#include <rte_atomic.h>
>> +#include <rte_per_lcore.h>
>> +#include <rte_pmu.h>
>> +#include <rte_spinlock.h>
>> +#include <rte_tailq.h>
>> +
>> +#include "pmu_private.h"
>> +
>> +#define EVENT_SOURCE_DEVICES_PATH "/sys/bus/event_source/devices"
>> +
>> +#ifndef GENMASK_ULL
>
>This macro is copy/pasted all over the dpdk tree...
>This is worth a cleanup later, read: I am not asking for it as part of this series.
>
>However, here, there is no need for protecting against its definition.
>

Okay. 

>
>> +#define GENMASK_ULL(h, l) ((~0ULL - (1ULL << (l)) + 1) & (~0ULL >>
>> +((64 - 1 - (h))))) #endif
>> +
>> +#ifndef FIELD_PREP
>> +#define FIELD_PREP(m, v) (((uint64_t)(v) << (__builtin_ffsll(m) - 1))
>> +& (m)) #endif
>
>Idem.
>
>
>> +
>> +RTE_DEFINE_PER_LCORE(struct rte_pmu_event_group, _event_group);
>> +struct rte_pmu rte_pmu;
>> +
>> +/*
>> + * Following __rte_weak functions provide default no-op.
>> +Architectures should override them if
>> + * necessary.
>> + */
>
>Prefer using per architectures #ifdef.
>It is easier to get a simple link error than use weak symbols that make it look like it could work
>on some arch.
>

Rationale for that was actually to not break anything. That means you can compile that code
for every arch, except on unsupported ones you'll see stream of zeros inside a trace file. 

>
>> +
>> +int
>> +__rte_weak pmu_arch_init(void)
>> +{
>> +       return 0;
>
>Add a debug log message indicating that this arch does not support PMU.
>

Prove me wrong but logs are part of eal and unless they are moved to a separate 
library this lib shouldn't call these APIs. Otherwise we'll introduce dependency
on eal which we wanted to avoid in the first place. 

>
>> +}
>> +
>> +void
>> +__rte_weak pmu_arch_fini(void)
>> +{
>> +}
>> +
>> +void
>> +__rte_weak pmu_arch_fixup_config(uint64_t __rte_unused config[3]) { }
>> +
>> +static int
>> +get_term_format(const char *name, int *num, uint64_t *mask) {
>> +       char *config = NULL;
>> +       char path[PATH_MAX];
>> +       int high, low, ret;
>> +       FILE *fp;
>
>Reverse xmas tree when possible.
>

Okay. 

>
>> +
>> +       /* quiesce -Wmaybe-uninitialized warning */
>
>This comment just seems to be a note for yourself.
>What was the issue exactly?
>

Generally speaking, compiler thinks that function calling this function may use
'num' even though callee returned an error. 

[1/198] Compiling C object lib/librte_pmu.a.p/pmu_rte_pmu.c.o
In function _parse_event_,
    inlined from _get_event_config_ at ../lib/pmu/rte_pmu.c:157:9:
../lib/pmu/rte_pmu.c:129:23: warning: _num_ may be used uninitialized [-Wmaybe-uninitialized]
  129 |                 config[num] |= FIELD_PREP(mask, val);
      |                 ~~~~~~^~~~~
../lib/pmu/rte_pmu.c: In function _get_event_config_:
../lib/pmu/rte_pmu.c:107:13: note: _num_ was declared here
  107 |         int num, ret, val;
      |

>
>> +       *num = 0;
>> +       *mask = 0;
>> +
>> +       snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/format/%s", rte_pmu.name,
>name);
>> +       fp = fopen(path, "r");
>> +       if (fp == NULL)
>> +               return -errno;
>> +
>> +       errno = 0;
>> +       ret = fscanf(fp, "%m[^:]:%d-%d", &config, &low, &high);
>> +       if (ret < 2) {
>> +               ret = -ENODATA;
>> +               goto out;
>> +       }
>> +       if (errno) {
>> +               ret = -errno;
>> +               goto out;
>> +       }
>> +
>> +       if (ret == 2)
>> +               high = low;
>> +
>> +       *mask = GENMASK_ULL(high, low);
>> +       /* Last digit should be [012]. If last digit is missing 0 is implied. */
>> +       *num = config[strlen(config) - 1];
>> +       *num = isdigit(*num) ? *num - '0' : 0;
>> +
>> +       ret = 0;
>> +out:
>> +       free(config);
>> +       fclose(fp);
>> +
>> +       return ret;
>> +}
>> +
>> +static int
>> +parse_event(char *buf, uint64_t config[3]) {
>> +       char *token, *term;
>> +       int num, ret, val;
>> +       uint64_t mask;
>> +
>> +       config[0] = config[1] = config[2] = 0;
>> +
>> +       token = strtok(buf, ",");
>> +       while (token) {
>> +               errno = 0;
>> +               /* <term>=<value> */
>> +               ret = sscanf(token, "%m[^=]=%i", &term, &val);
>> +               if (ret < 1)
>> +                       return -ENODATA;
>> +               if (errno)
>> +                       return -errno;
>> +               if (ret == 1)
>> +                       val = 1;
>> +
>> +               ret = get_term_format(term, &num, &mask);
>> +               free(term);
>> +               if (ret)
>> +                       return ret;
>> +
>> +               config[num] |= FIELD_PREP(mask, val);
>> +               token = strtok(NULL, ",");
>> +       }
>> +
>> +       return 0;
>> +}
>> +
>> +static int
>> +get_event_config(const char *name, uint64_t config[3]) {
>> +       char path[PATH_MAX], buf[BUFSIZ];
>> +       FILE *fp;
>> +       int ret;
>> +
>> +       snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu.name,
>name);
>> +       fp = fopen(path, "r");
>> +       if (fp == NULL)
>> +               return -errno;
>> +
>> +       ret = fread(buf, 1, sizeof(buf), fp);
>> +       if (ret == 0) {
>> +               fclose(fp);
>> +
>> +               return -EINVAL;
>> +       }
>> +       fclose(fp);
>> +       buf[ret] = '\0';
>> +
>> +       return parse_event(buf, config); }
>> +
>> +static int
>> +do_perf_event_open(uint64_t config[3], int group_fd) {
>> +       struct perf_event_attr attr = {
>> +               .size = sizeof(struct perf_event_attr),
>> +               .type = PERF_TYPE_RAW,
>> +               .exclude_kernel = 1,
>> +               .exclude_hv = 1,
>> +               .disabled = 1,
>> +       };
>> +
>> +       pmu_arch_fixup_config(config);
>> +
>> +       attr.config = config[0];
>> +       attr.config1 = config[1];
>> +       attr.config2 = config[2];
>> +
>> +       return syscall(SYS_perf_event_open, &attr, 0, -1, group_fd,
>> +0); }
>> +
>> +static int
>> +open_events(struct rte_pmu_event_group *group) {
>> +       struct rte_pmu_event *event;
>> +       uint64_t config[3];
>> +       int num = 0, ret;
>> +
>> +       /* group leader gets created first, with fd = -1 */
>> +       group->fds[0] = -1;
>> +
>> +       TAILQ_FOREACH(event, &rte_pmu.event_list, next) {
>> +               ret = get_event_config(event->name, config);
>> +               if (ret)
>> +                       continue;
>> +
>> +               ret = do_perf_event_open(config, group->fds[0]);
>> +               if (ret == -1) {
>> +                       ret = -errno;
>> +                       goto out;
>> +               }
>> +
>> +               group->fds[event->index] = ret;
>> +               num++;
>> +       }
>> +
>> +       return 0;
>> +out:
>> +       for (--num; num >= 0; num--) {
>> +               close(group->fds[num]);
>> +               group->fds[num] = -1;
>> +       }
>> +
>> +
>> +       return ret;
>> +}
>> +
>> +static int
>> +mmap_events(struct rte_pmu_event_group *group) {
>> +       long page_size = sysconf(_SC_PAGE_SIZE);
>> +       unsigned int i;
>> +       void *addr;
>> +       int ret;
>> +
>> +       for (i = 0; i < rte_pmu.num_group_events; i++) {
>> +               addr = mmap(0, page_size, PROT_READ, MAP_SHARED, group->fds[i], 0);
>> +               if (addr == MAP_FAILED) {
>> +                       ret = -errno;
>> +                       goto out;
>> +               }
>> +
>> +               group->mmap_pages[i] = addr;
>> +       }
>> +
>> +       return 0;
>> +out:
>> +       for (; i; i--) {
>> +               munmap(group->mmap_pages[i - 1], page_size);
>> +               group->mmap_pages[i - 1] = NULL;
>> +       }
>> +
>> +       return ret;
>> +}
>> +
>> +static void
>> +cleanup_events(struct rte_pmu_event_group *group) {
>> +       unsigned int i;
>> +
>> +       if (group->fds[0] != -1)
>> +               ioctl(group->fds[0], PERF_EVENT_IOC_DISABLE,
>> + PERF_IOC_FLAG_GROUP);
>> +
>> +       for (i = 0; i < rte_pmu.num_group_events; i++) {
>> +               if (group->mmap_pages[i]) {
>> +                       munmap(group->mmap_pages[i], sysconf(_SC_PAGE_SIZE));
>> +                       group->mmap_pages[i] = NULL;
>> +               }
>> +
>> +               if (group->fds[i] != -1) {
>> +                       close(group->fds[i]);
>> +                       group->fds[i] = -1;
>> +               }
>> +       }
>> +
>> +       group->enabled = false;
>> +}
>> +
>> +int __rte_noinline
>
>This symbol is exported out of this library, no need for noinline.
>

I recall that it was added deliberately because this function was actually being inlined.
But given code changed a bit through multiple revisions this may not be necessary anymore. 

>
>> +rte_pmu_enable_group(void)
>> +{
>> +       struct rte_pmu_event_group *group = &RTE_PER_LCORE(_event_group);
>> +       int ret;
>> +
>> +       if (rte_pmu.num_group_events == 0)
>> +               return -ENODEV;
>> +
>> +       ret = open_events(group);
>> +       if (ret)
>> +               goto out;
>> +
>> +       ret = mmap_events(group);
>> +       if (ret)
>> +               goto out;
>> +
>> +       if (ioctl(group->fds[0], PERF_EVENT_IOC_RESET, PERF_IOC_FLAG_GROUP) == -1) {
>> +               ret = -errno;
>> +               goto out;
>> +       }
>> +
>> +       if (ioctl(group->fds[0], PERF_EVENT_IOC_ENABLE, PERF_IOC_FLAG_GROUP) == -1) {
>> +               ret = -errno;
>> +               goto out;
>> +       }
>> +
>> +       rte_spinlock_lock(&rte_pmu.lock);
>> +       TAILQ_INSERT_TAIL(&rte_pmu.event_group_list, group, next);
>> +       rte_spinlock_unlock(&rte_pmu.lock);
>> +       group->enabled = true;
>> +
>> +       return 0;
>> +
>> +out:
>> +       cleanup_events(group);
>> +
>> +       return ret;
>> +}
>> +
>> +static int
>> +scan_pmus(void)
>> +{
>> +       char path[PATH_MAX];
>> +       struct dirent *dent;
>> +       const char *name;
>> +       DIR *dirp;
>> +
>> +       dirp = opendir(EVENT_SOURCE_DEVICES_PATH);
>> +       if (dirp == NULL)
>> +               return -errno;
>> +
>> +       while ((dent = readdir(dirp))) {
>> +               name = dent->d_name;
>> +               if (name[0] == '.')
>> +                       continue;
>> +
>> +               /* sysfs entry should either contain cpus or be a cpu */
>> +               if (!strcmp(name, "cpu"))
>> +                       break;
>> +
>> +               snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/cpus", name);
>> +               if (access(path, F_OK) == 0)
>> +                       break;
>> +       }
>> +
>> +       if (dent) {
>> +               rte_pmu.name = strdup(name);
>> +               if (rte_pmu.name == NULL) {
>> +                       closedir(dirp);
>> +
>> +                       return -ENOMEM;
>> +               }
>> +       }
>> +
>> +       closedir(dirp);
>> +
>> +       return rte_pmu.name ? 0 : -ENODEV; }
>> +
>> +static struct rte_pmu_event *
>> +new_event(const char *name)
>> +{
>> +       struct rte_pmu_event *event;
>> +
>> +       event = calloc(1, sizeof(*event));
>> +       if (event == NULL)
>> +               goto out;
>> +
>> +       event->name = strdup(name);
>> +       if (event->name == NULL) {
>> +               free(event);
>> +               event = NULL;
>> +       }
>> +
>> +out:
>> +       return event;
>> +}
>> +
>> +static void
>> +free_event(struct rte_pmu_event *event) {
>> +       free(event->name);
>> +       free(event);
>> +}
>> +
>> +int
>> +rte_pmu_add_event(const char *name)
>> +{
>> +       struct rte_pmu_event *event;
>> +       char path[PATH_MAX];
>> +
>> +       if (rte_pmu.name == NULL)
>> +               return -ENODEV;
>> +
>> +       if (rte_pmu.num_group_events + 1 >= MAX_NUM_GROUP_EVENTS)
>> +               return -ENOSPC;
>> +
>> +       snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu.name,
>name);
>> +       if (access(path, R_OK))
>> +               return -ENODEV;
>> +
>> +       TAILQ_FOREACH(event, &rte_pmu.event_list, next) {
>> +               if (!strcmp(event->name, name))
>> +                       return event->index;
>> +               continue;
>> +       }
>> +
>> +       event = new_event(name);
>> +       if (event == NULL)
>> +               return -ENOMEM;
>> +
>> +       event->index = rte_pmu.num_group_events++;
>> +       TAILQ_INSERT_TAIL(&rte_pmu.event_list, event, next);
>> +
>> +       return event->index;
>> +}
>> +
>> +int
>> +rte_pmu_init(void)
>> +{
>> +       int ret;
>> +
>> +       /* Allow calling init from multiple contexts within a single thread. This simplifies
>> +        * resource management a bit e.g in case fast-path tracepoint has already been enabled
>> +        * via command line but application doesn't care enough and performs init/fini again.
>> +        */
>> +       if (rte_pmu.initialized) {
>
>This is an integer so check against 0 explicitly (there may be other cases in this patch, I did not
>recheck the whole patch).
>

Okay. 

>
>> +               rte_pmu.initialized++;
>> +               return 0;
>> +       }
>> +
>> +       ret = scan_pmus();
>> +       if (ret)
>> +               goto out;
>> +
>> +       ret = pmu_arch_init();
>> +       if (ret)
>> +               goto out;
>> +
>> +       TAILQ_INIT(&rte_pmu.event_list);
>> +       TAILQ_INIT(&rte_pmu.event_group_list);
>> +       rte_spinlock_init(&rte_pmu.lock);
>> +       rte_pmu.initialized = 1;
>> +
>> +       return 0;
>> +out:
>> +       free(rte_pmu.name);
>> +       rte_pmu.name = NULL;
>> +
>> +       return ret;
>> +}
>> +
>> +void
>> +rte_pmu_fini(void)
>> +{
>> +       struct rte_pmu_event_group *group, *tmp_group;
>> +       struct rte_pmu_event *event, *tmp_event;
>> +
>> +       /* cleanup once init count drops to zero */
>> +       if (!rte_pmu.initialized || --rte_pmu.initialized)
>> +               return;
>> +
>> +       RTE_TAILQ_FOREACH_SAFE(event, &rte_pmu.event_list, next, tmp_event) {
>> +               TAILQ_REMOVE(&rte_pmu.event_list, event, next);
>> +               free_event(event);
>> +       }
>> +
>> +       RTE_TAILQ_FOREACH_SAFE(group, &rte_pmu.event_group_list, next, tmp_group) {
>> +               TAILQ_REMOVE(&rte_pmu.event_group_list, group, next);
>> +               cleanup_events(group);
>> +       }
>> +
>> +       pmu_arch_fini();
>> +       free(rte_pmu.name);
>> +       rte_pmu.name = NULL;
>> +       rte_pmu.num_group_events = 0;
>> +}
>> diff --git a/lib/pmu/rte_pmu.h b/lib/pmu/rte_pmu.h new file mode
>> 100644 index 0000000000..e360375a0c
>> --- /dev/null
>> +++ b/lib/pmu/rte_pmu.h
>> @@ -0,0 +1,205 @@
>> +/* SPDX-License-Identifier: BSD-3-Clause
>> + * Copyright(c) 2023 Marvell
>> + */
>> +
>> +#ifndef _RTE_PMU_H_
>> +#define _RTE_PMU_H_
>> +
>> +/**
>> + * @file
>> + *
>> + * PMU event tracing operations
>> + *
>> + * This file defines generic API and types necessary to setup PMU and
>> + * read selected counters in runtime.
>> + */
>> +
>> +#ifdef __cplusplus
>> +extern "C" {
>> +#endif
>> +
>> +#include <linux/perf_event.h>
>> +
>> +#include <rte_atomic.h>
>> +#include <rte_branch_prediction.h>
>> +#include <rte_common.h>
>> +#include <rte_compat.h>
>> +#include <rte_spinlock.h>
>> +
>> +/** Maximum number of events in a group */ #define
>> +MAX_NUM_GROUP_EVENTS 8
>> +
>> +/**
>> + * A structure describing a group of events.
>> + */
>> +struct rte_pmu_event_group {
>> +       struct perf_event_mmap_page *mmap_pages[MAX_NUM_GROUP_EVENTS]; /**< array of user pages
>*/
>> +       int fds[MAX_NUM_GROUP_EVENTS]; /**< array of event descriptors */
>> +       bool enabled; /**< true if group was enabled on particular lcore */
>> +       TAILQ_ENTRY(rte_pmu_event_group) next; /**< list entry */ }
>> +__rte_cache_aligned;
>
>One problem for the future is that we have a fixed size fd array.

This number can be increased if needed but rationale for that
was PMUs come with relatively small number of hw counters. Here 
events are grouped together (i.e scheduled together) which means there must 
be enough hw counters for all events for things to work.

>Do we need to expose this whole structure to the application?
>

Probably some part of it could have been hidden but all those structures are so small
that presumably such partitioning would not bring much to the table.

>
>> +
>> +/**
>> + * A structure describing an event.
>> + */
>> +struct rte_pmu_event {
>> +       char *name; /**< name of an event */
>> +       unsigned int index; /**< event index into fds/mmap_pages */
>
>This is an internal consideration.
>Do we need to expose this to the application?
>
>
>> +       TAILQ_ENTRY(rte_pmu_event) next; /**< list entry */ };
>> +
>> +/**
>> + * A PMU state container.
>> + */
>> +struct rte_pmu {
>> +       char *name; /**< name of core PMU listed under /sys/bus/event_source/devices */
>> +       rte_spinlock_t lock; /**< serialize access to event group list */
>> +       TAILQ_HEAD(, rte_pmu_event_group) event_group_list; /**< list of event groups */
>> +       unsigned int num_group_events; /**< number of events in a group */
>> +       TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of matching events */
>> +       unsigned int initialized; /**< initialization counter */ };
>
>Idem, do we need to expose this to the application?
>
>
>> +
>> +/** lcore event group */
>> +RTE_DECLARE_PER_LCORE(struct rte_pmu_event_group, _event_group);
>> +
>> +/** PMU state container */
>> +extern struct rte_pmu rte_pmu;
>> +
>> +/** Each architecture supporting PMU needs to provide its own version
>> +*/ #ifndef rte_pmu_pmc_read #define rte_pmu_pmc_read(index) ({ 0; })
>> +#endif
>> +
>> +/**
>> + * @internal
>> + *
>> + * Read PMU counter.
>> + *
>> + * @param pc
>> + *   Pointer to the mmapped user page.
>> + * @return
>> + *   Counter value read from hardware.
>> + */
>> +__rte_internal
>> +static __rte_always_inline uint64_t
>> +rte_pmu_read_userpage(struct perf_event_mmap_page *pc) {
>> +       uint64_t width, offset;
>> +       uint32_t seq, index;
>> +       int64_t pmc;
>> +
>> +       for (;;) {
>> +               seq = pc->lock;
>> +               rte_compiler_barrier();
>> +               index = pc->index;
>> +               offset = pc->offset;
>> +               width = pc->pmc_width;
>> +
>> +               /* index set to 0 means that particular counter cannot be used */
>> +               if (likely(pc->cap_user_rdpmc && index)) {
>> +                       pmc = rte_pmu_pmc_read(index - 1);
>> +                       pmc <<= 64 - width;
>> +                       pmc >>= 64 - width;
>> +                       offset += pmc;
>> +               }
>> +
>> +               rte_compiler_barrier();
>> +
>> +               if (likely(pc->lock == seq))
>> +                       return offset;
>> +       }
>> +
>> +       return 0;
>> +}
>> +
>> +/**
>> + * @internal
>> + *
>> + * Enable group of events on the calling lcore.
>> + *
>> + * @return
>> + *   0 in case of success, negative value otherwise.
>> + */
>> +__rte_internal
>
>Unless I missed something, this symbol is called from rte_pmu_read() so this makes rte_pmu_read()
>itself internal.
>So external applications won't be able to use the PMU API.
>
>This can probably be confirmed by adding some call to the PMU API in an examples/.
>

Good point actually. This was not that obvious when I looked at the patch introducing that change. 
So in this case it needs to be exported but given app should not call that itself maybe I'll
just make the intent clear by renaming it perhaps to __rte_pmu_enable_group() or something
alike. 

>
>> +int
>> +rte_pmu_enable_group(void);
>> +
>> +/**
>> + * @warning
>> + * @b EXPERIMENTAL: this API may change without prior notice
>> + *
>> + * Initialize PMU library.
>> + *
>> + * @return
>> + *   0 in case of success, negative value otherwise.
>> + */
>> +__rte_experimental
>> +int
>> +rte_pmu_init(void);
>> +
>> +/**
>> + * @warning
>> + * @b EXPERIMENTAL: this API may change without prior notice
>> + *
>> + * Finalize PMU library. This should be called after PMU counters are no longer being read.
>> + */
>> +__rte_experimental
>> +void
>> +rte_pmu_fini(void);
>> +
>> +/**
>> + * @warning
>> + * @b EXPERIMENTAL: this API may change without prior notice
>> + *
>> + * Add event to the group of enabled events.
>> + *
>> + * @param name
>> + *   Name of an event listed under /sys/bus/event_source/devices/pmu/events.
>> + * @return
>> + *   Event index in case of success, negative value otherwise.
>> + */
>> +__rte_experimental
>> +int
>> +rte_pmu_add_event(const char *name);
>> +
>> +/**
>> + * @warning
>> + * @b EXPERIMENTAL: this API may change without prior notice
>> + *
>> + * Read hardware counter configured to count occurrences of an event.
>> + *
>> + * @param index
>> + *   Index of an event to be read.
>> + * @return
>> + *   Event value read from register. In case of errors or lack of support
>> + *   0 is returned. In other words, stream of zeros in a trace file
>> + *   indicates problem with reading particular PMU event register.
>> + */
>> +__rte_experimental
>> +static __rte_always_inline uint64_t
>> +rte_pmu_read(unsigned int index)
>> +{
>> +       struct rte_pmu_event_group *group = &RTE_PER_LCORE(_event_group);
>> +       int ret;
>> +
>> +       if (unlikely(!rte_pmu.initialized))
>> +               return 0;
>> +
>> +       if (unlikely(!group->enabled)) {
>> +               ret = rte_pmu_enable_group();
>> +               if (ret)
>> +                       return 0;
>> +       }
>> +
>> +       if (unlikely(index >= rte_pmu.num_group_events))
>> +               return 0;
>> +
>> +       return rte_pmu_read_userpage(group->mmap_pages[index]);
>> +}
>> +
>> +#ifdef __cplusplus
>> +}
>> +#endif
>> +
>> +#endif /* _RTE_PMU_H_ */
>> diff --git a/lib/pmu/version.map b/lib/pmu/version.map new file mode
>> 100644 index 0000000000..50fb0f354e
>> --- /dev/null
>> +++ b/lib/pmu/version.map
>> @@ -0,0 +1,20 @@
>> +DPDK_23 {
>> +       local: *;
>> +};
>> +
>> +EXPERIMENTAL {
>> +       global:
>> +
>> +       per_lcore__event_group;
>> +       rte_pmu;
>> +       rte_pmu_add_event;
>> +       rte_pmu_fini;
>> +       rte_pmu_init;
>> +       rte_pmu_read;
>> +};
>> +
>> +INTERNAL {
>> +       global:
>> +
>> +       rte_pmu_enable_group;
>> +};
>> --
>> 2.34.1
>>
>
>
>--
>David Marchand


^ permalink raw reply	[flat|nested] 139+ messages in thread

* [PATCH v10 0/4] add support for self monitoring
  2023-02-02 12:49               ` [PATCH v9 0/4] add support for self monitoring Tomasz Duszynski
                                   ` (3 preceding siblings ...)
  2023-02-02 12:49                 ` [PATCH v9 4/4] eal: add PMU support to tracing library Tomasz Duszynski
@ 2023-02-13 11:31                 ` Tomasz Duszynski
  2023-02-13 11:31                   ` [PATCH v10 1/4] lib: add generic support for reading PMU events Tomasz Duszynski
                                     ` (4 more replies)
  4 siblings, 5 replies; 139+ messages in thread
From: Tomasz Duszynski @ 2023-02-13 11:31 UTC (permalink / raw)
  To: dev
  Cc: roretzla, Ruifeng.Wang, bruce.richardson, jerinj,
	mattias.ronnblom, mb, thomas, zhoumin, david.marchand,
	Tomasz Duszynski

This series adds self monitoring support i.e allows to configure and
read performance measurement unit (PMU) counters in runtime without
using perf utility. This has certain advantages when application runs on
isolated cores running dedicated tasks.

Events can be read directly using rte_pmu_read() or using dedicated
tracepoint rte_eal_trace_pmu_read(). The latter will cause events to be
stored inside CTF file.

By design, all enabled events are grouped together and the same group
is attached to lcores that use self monitoring funtionality.

Events are enabled by names, which need to be read from standard
location under sysfs i.e

/sys/bus/event_source/devices/PMU/events

where PMU is a core pmu i.e one measuring cpu events. As of today
raw events are not supported.

v10:
- check permissions before using counters
- do not use internal symbols in exported functions
- address review comments
v9:
- fix 'maybe-uninitialized' warning reported by CI
v8:
- just rebase series
v7:
- use per-lcore event group instead of global table index by lcore-id
- don't add pmu_autotest to fast tests because due to lack of suported on
  every arch
v6:
- move codebase to the separate library
- address review comments
v5:
- address review comments
- fix sign extension while reading pmu on x86
- fix regex mentioned in doc
- various minor changes/improvements here and there
v4:
- fix freeing mem detected by debug_autotest
v3:
- fix shared build
v2:
- fix problems reported by test build infra

Tomasz Duszynski (4):
  lib: add generic support for reading PMU events
  pmu: support reading ARM PMU events in runtime
  pmu: support reading Intel x86_64 PMU events in runtime
  eal: add PMU support to tracing library

 MAINTAINERS                              |   5 +
 app/test/meson.build                     |   2 +
 app/test/test_pmu.c                      |  68 +++
 app/test/test_trace_perf.c               |  10 +
 doc/api/doxy-api-index.md                |   3 +-
 doc/api/doxy-api.conf.in                 |   1 +
 doc/guides/prog_guide/profile_app.rst    |  17 +
 doc/guides/prog_guide/trace_lib.rst      |  32 ++
 doc/guides/rel_notes/release_23_03.rst   |   7 +
 lib/eal/common/eal_common_trace.c        |  13 +-
 lib/eal/common/eal_common_trace_points.c |   5 +
 lib/eal/include/rte_eal_trace.h          |  13 +
 lib/eal/meson.build                      |   3 +
 lib/eal/version.map                      |   1 +
 lib/meson.build                          |   1 +
 lib/pmu/meson.build                      |  21 +
 lib/pmu/pmu_arm64.c                      |  94 ++++
 lib/pmu/pmu_private.h                    |  32 ++
 lib/pmu/rte_pmu.c                        | 521 +++++++++++++++++++++++
 lib/pmu/rte_pmu.h                        | 232 ++++++++++
 lib/pmu/rte_pmu_pmc_arm64.h              |  30 ++
 lib/pmu/rte_pmu_pmc_x86_64.h             |  24 ++
 lib/pmu/version.map                      |  16 +
 23 files changed, 1149 insertions(+), 2 deletions(-)
 create mode 100644 app/test/test_pmu.c
 create mode 100644 lib/pmu/meson.build
 create mode 100644 lib/pmu/pmu_arm64.c
 create mode 100644 lib/pmu/pmu_private.h
 create mode 100644 lib/pmu/rte_pmu.c
 create mode 100644 lib/pmu/rte_pmu.h
 create mode 100644 lib/pmu/rte_pmu_pmc_arm64.h
 create mode 100644 lib/pmu/rte_pmu_pmc_x86_64.h
 create mode 100644 lib/pmu/version.map

--
2.34.1


^ permalink raw reply	[flat|nested] 139+ messages in thread

* [PATCH v10 1/4] lib: add generic support for reading PMU events
  2023-02-13 11:31                 ` [PATCH v10 0/4] add support for self monitoring Tomasz Duszynski
@ 2023-02-13 11:31                   ` Tomasz Duszynski
  2023-02-16  7:39                     ` Ruifeng Wang
  2023-02-13 11:31                   ` [PATCH v10 2/4] pmu: support reading ARM PMU events in runtime Tomasz Duszynski
                                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 139+ messages in thread
From: Tomasz Duszynski @ 2023-02-13 11:31 UTC (permalink / raw)
  To: dev, Thomas Monjalon, Tomasz Duszynski
  Cc: roretzla, Ruifeng.Wang, bruce.richardson, jerinj,
	mattias.ronnblom, mb, zhoumin, david.marchand

Add support for programming PMU counters and reading their values
in runtime bypassing kernel completely.

This is especially useful in cases where CPU cores are isolated
i.e run dedicated tasks. In such cases one cannot use standard
perf utility without sacrificing latency and performance.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
---
 MAINTAINERS                            |   5 +
 app/test/meson.build                   |   2 +
 app/test/test_pmu.c                    |  62 ++++
 doc/api/doxy-api-index.md              |   3 +-
 doc/api/doxy-api.conf.in               |   1 +
 doc/guides/prog_guide/profile_app.rst  |  12 +
 doc/guides/rel_notes/release_23_03.rst |   7 +
 lib/meson.build                        |   1 +
 lib/pmu/meson.build                    |  13 +
 lib/pmu/pmu_private.h                  |  32 ++
 lib/pmu/rte_pmu.c                      | 460 +++++++++++++++++++++++++
 lib/pmu/rte_pmu.h                      | 212 ++++++++++++
 lib/pmu/version.map                    |  15 +
 13 files changed, 824 insertions(+), 1 deletion(-)
 create mode 100644 app/test/test_pmu.c
 create mode 100644 lib/pmu/meson.build
 create mode 100644 lib/pmu/pmu_private.h
 create mode 100644 lib/pmu/rte_pmu.c
 create mode 100644 lib/pmu/rte_pmu.h
 create mode 100644 lib/pmu/version.map

diff --git a/MAINTAINERS b/MAINTAINERS
index 3495946d0f..d37f242120 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1697,6 +1697,11 @@ M: Nithin Dabilpuram <ndabilpuram@marvell.com>
 M: Pavan Nikhilesh <pbhagavatula@marvell.com>
 F: lib/node/

+PMU - EXPERIMENTAL
+M: Tomasz Duszynski <tduszynski@marvell.com>
+F: lib/pmu/
+F: app/test/test_pmu*
+

 Test Applications
 -----------------
diff --git a/app/test/meson.build b/app/test/meson.build
index f34d19e3c3..6b61b7fc32 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -111,6 +111,7 @@ test_sources = files(
         'test_reciprocal_division_perf.c',
         'test_red.c',
         'test_pie.c',
+        'test_pmu.c',
         'test_reorder.c',
         'test_rib.c',
         'test_rib6.c',
@@ -239,6 +240,7 @@ fast_tests = [
         ['kni_autotest', false, true],
         ['kvargs_autotest', true, true],
         ['member_autotest', true, true],
+        ['pmu_autotest', true, true],
         ['power_cpufreq_autotest', false, true],
         ['power_autotest', true, true],
         ['power_kvm_vm_autotest', false, true],
diff --git a/app/test/test_pmu.c b/app/test/test_pmu.c
new file mode 100644
index 0000000000..a64564b5f5
--- /dev/null
+++ b/app/test/test_pmu.c
@@ -0,0 +1,62 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2023 Marvell International Ltd.
+ */
+
+#include "test.h"
+
+#ifndef RTE_EXEC_ENV_LINUX
+
+static int
+test_pmu(void)
+{
+	printf("pmu_autotest only supported on Linux, skipping test\n");
+	return TEST_SKIPPED;
+}
+
+#else
+
+#include <rte_pmu.h>
+
+static int
+test_pmu_read(void)
+{
+	const char *name = NULL;
+	int tries = 10, event;
+	uint64_t val = 0;
+
+	if (name == NULL) {
+		printf("PMU not supported on this arch\n");
+		return TEST_SKIPPED;
+	}
+
+	if (rte_pmu_init() < 0)
+		return TEST_FAILED;
+
+	event = rte_pmu_add_event(name);
+	while (tries--)
+		val += rte_pmu_read(event);
+
+	rte_pmu_fini();
+
+	return val ? TEST_SUCCESS : TEST_FAILED;
+}
+
+static struct unit_test_suite pmu_tests = {
+	.suite_name = "pmu autotest",
+	.setup = NULL,
+	.teardown = NULL,
+	.unit_test_cases = {
+		TEST_CASE(test_pmu_read),
+		TEST_CASES_END()
+	}
+};
+
+static int
+test_pmu(void)
+{
+	return unit_test_suite_runner(&pmu_tests);
+}
+
+#endif /* RTE_EXEC_ENV_LINUX */
+
+REGISTER_TEST_COMMAND(pmu_autotest, test_pmu);
diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
index 2deec7ea19..a8e04a195d 100644
--- a/doc/api/doxy-api-index.md
+++ b/doc/api/doxy-api-index.md
@@ -223,7 +223,8 @@ The public API headers are grouped by topics:
   [log](@ref rte_log.h),
   [errno](@ref rte_errno.h),
   [trace](@ref rte_trace.h),
-  [trace_point](@ref rte_trace_point.h)
+  [trace_point](@ref rte_trace_point.h),
+  [pmu](@ref rte_pmu.h)

 - **misc**:
   [EAL config](@ref rte_eal.h),
diff --git a/doc/api/doxy-api.conf.in b/doc/api/doxy-api.conf.in
index e859426099..350b5a8c94 100644
--- a/doc/api/doxy-api.conf.in
+++ b/doc/api/doxy-api.conf.in
@@ -63,6 +63,7 @@ INPUT                   = @TOPDIR@/doc/api/doxy-api-index.md \
                           @TOPDIR@/lib/pci \
                           @TOPDIR@/lib/pdump \
                           @TOPDIR@/lib/pipeline \
+                          @TOPDIR@/lib/pmu \
                           @TOPDIR@/lib/port \
                           @TOPDIR@/lib/power \
                           @TOPDIR@/lib/rawdev \
diff --git a/doc/guides/prog_guide/profile_app.rst b/doc/guides/prog_guide/profile_app.rst
index 14292d4c25..89e38cd301 100644
--- a/doc/guides/prog_guide/profile_app.rst
+++ b/doc/guides/prog_guide/profile_app.rst
@@ -7,6 +7,18 @@ Profile Your Application
 The following sections describe methods of profiling DPDK applications on
 different architectures.

+Performance counter based profiling
+-----------------------------------
+
+Majority of architectures support some performance monitoring unit (PMU).
+Such unit provides programmable counters that monitor specific events.
+
+Different tools gather that information, like for example perf.
+However, in some scenarios when CPU cores are isolated and run
+dedicated tasks interrupting those tasks with perf may be undesirable.
+
+In such cases, an application can use the PMU library to read such events via ``rte_pmu_read()``.
+

 Profiling on x86
 ----------------
diff --git a/doc/guides/rel_notes/release_23_03.rst b/doc/guides/rel_notes/release_23_03.rst
index ab998a5357..20622efe58 100644
--- a/doc/guides/rel_notes/release_23_03.rst
+++ b/doc/guides/rel_notes/release_23_03.rst
@@ -147,6 +147,13 @@ New Features
   * Added support to capture packets at each graph node with packet metadata and
     node name.

+* **Added PMU library.**
+
+  Added a new performance monitoring unit (PMU) library which allows applications
+  to perform self monitoring activities without depending on external utilities like perf.
+  After integration with :doc:`../prog_guide/trace_lib` data gathered from hardware counters
+  can be stored in CTF format for further analysis.
+

 Removed Items
 -------------
diff --git a/lib/meson.build b/lib/meson.build
index 450c061d2b..8a42d45d20 100644
--- a/lib/meson.build
+++ b/lib/meson.build
@@ -11,6 +11,7 @@
 libraries = [
         'kvargs', # eal depends on kvargs
         'telemetry', # basic info querying
+        'pmu',
         'eal', # everything depends on eal
         'ring',
         'rcu', # rcu depends on ring
diff --git a/lib/pmu/meson.build b/lib/pmu/meson.build
new file mode 100644
index 0000000000..a4160b494e
--- /dev/null
+++ b/lib/pmu/meson.build
@@ -0,0 +1,13 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(C) 2023 Marvell International Ltd.
+
+if not is_linux
+    build = false
+    reason = 'only supported on Linux'
+    subdir_done()
+endif
+
+includes = [global_inc]
+
+sources = files('rte_pmu.c')
+headers = files('rte_pmu.h')
diff --git a/lib/pmu/pmu_private.h b/lib/pmu/pmu_private.h
new file mode 100644
index 0000000000..b9f8c1ddc8
--- /dev/null
+++ b/lib/pmu/pmu_private.h
@@ -0,0 +1,32 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Marvell
+ */
+
+#ifndef _PMU_PRIVATE_H_
+#define _PMU_PRIVATE_H_
+
+/**
+ * Architecture specific PMU init callback.
+ *
+ * @return
+ *   0 in case of success, negative value otherwise.
+ */
+int
+pmu_arch_init(void);
+
+/**
+ * Architecture specific PMU cleanup callback.
+ */
+void
+pmu_arch_fini(void);
+
+/**
+ * Apply architecture specific settings to config before passing it to syscall.
+ *
+ * @param config
+ *   Architecture specific event configuration. Consult kernel sources for available options.
+ */
+void
+pmu_arch_fixup_config(uint64_t config[3]);
+
+#endif /* _PMU_PRIVATE_H_ */
diff --git a/lib/pmu/rte_pmu.c b/lib/pmu/rte_pmu.c
new file mode 100644
index 0000000000..950f999cb7
--- /dev/null
+++ b/lib/pmu/rte_pmu.c
@@ -0,0 +1,460 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2023 Marvell International Ltd.
+ */
+
+#include <ctype.h>
+#include <dirent.h>
+#include <errno.h>
+#include <regex.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <sys/queue.h>
+#include <sys/syscall.h>
+#include <unistd.h>
+
+#include <rte_atomic.h>
+#include <rte_per_lcore.h>
+#include <rte_pmu.h>
+#include <rte_spinlock.h>
+#include <rte_tailq.h>
+
+#include "pmu_private.h"
+
+#define EVENT_SOURCE_DEVICES_PATH "/sys/bus/event_source/devices"
+
+#define GENMASK_ULL(h, l) ((~0ULL - (1ULL << (l)) + 1) & (~0ULL >> ((64 - 1 - (h)))))
+#define FIELD_PREP(m, v) (((uint64_t)(v) << (__builtin_ffsll(m) - 1)) & (m))
+
+RTE_DEFINE_PER_LCORE(struct rte_pmu_event_group, _event_group);
+struct rte_pmu rte_pmu;
+
+/*
+ * Following __rte_weak functions provide default no-op. Architectures should override them if
+ * necessary.
+ */
+
+int
+__rte_weak pmu_arch_init(void)
+{
+	return 0;
+}
+
+void
+__rte_weak pmu_arch_fini(void)
+{
+}
+
+void
+__rte_weak pmu_arch_fixup_config(uint64_t __rte_unused config[3])
+{
+}
+
+static int
+get_term_format(const char *name, int *num, uint64_t *mask)
+{
+	char path[PATH_MAX];
+	char *config = NULL;
+	int high, low, ret;
+	FILE *fp;
+
+	*num = *mask = 0;
+	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/format/%s", rte_pmu.name, name);
+	fp = fopen(path, "r");
+	if (fp == NULL)
+		return -errno;
+
+	errno = 0;
+	ret = fscanf(fp, "%m[^:]:%d-%d", &config, &low, &high);
+	if (ret < 2) {
+		ret = -ENODATA;
+		goto out;
+	}
+	if (errno) {
+		ret = -errno;
+		goto out;
+	}
+
+	if (ret == 2)
+		high = low;
+
+	*mask = GENMASK_ULL(high, low);
+	/* Last digit should be [012]. If last digit is missing 0 is implied. */
+	*num = config[strlen(config) - 1];
+	*num = isdigit(*num) ? *num - '0' : 0;
+
+	ret = 0;
+out:
+	free(config);
+	fclose(fp);
+
+	return ret;
+}
+
+static int
+parse_event(char *buf, uint64_t config[3])
+{
+	char *token, *term;
+	int num, ret, val;
+	uint64_t mask;
+
+	config[0] = config[1] = config[2] = 0;
+
+	token = strtok(buf, ",");
+	while (token) {
+		errno = 0;
+		/* <term>=<value> */
+		ret = sscanf(token, "%m[^=]=%i", &term, &val);
+		if (ret < 1)
+			return -ENODATA;
+		if (errno)
+			return -errno;
+		if (ret == 1)
+			val = 1;
+
+		ret = get_term_format(term, &num, &mask);
+		free(term);
+		if (ret)
+			return ret;
+
+		config[num] |= FIELD_PREP(mask, val);
+		token = strtok(NULL, ",");
+	}
+
+	return 0;
+}
+
+static int
+get_event_config(const char *name, uint64_t config[3])
+{
+	char path[PATH_MAX], buf[BUFSIZ];
+	FILE *fp;
+	int ret;
+
+	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu.name, name);
+	fp = fopen(path, "r");
+	if (fp == NULL)
+		return -errno;
+
+	ret = fread(buf, 1, sizeof(buf), fp);
+	if (ret == 0) {
+		fclose(fp);
+
+		return -EINVAL;
+	}
+	fclose(fp);
+	buf[ret] = '\0';
+
+	return parse_event(buf, config);
+}
+
+static int
+do_perf_event_open(uint64_t config[3], int group_fd)
+{
+	struct perf_event_attr attr = {
+		.size = sizeof(struct perf_event_attr),
+		.type = PERF_TYPE_RAW,
+		.exclude_kernel = 1,
+		.exclude_hv = 1,
+		.disabled = 1,
+	};
+
+	pmu_arch_fixup_config(config);
+
+	attr.config = config[0];
+	attr.config1 = config[1];
+	attr.config2 = config[2];
+
+	return syscall(SYS_perf_event_open, &attr, 0, -1, group_fd, 0);
+}
+
+static int
+open_events(struct rte_pmu_event_group *group)
+{
+	struct rte_pmu_event *event;
+	uint64_t config[3];
+	int num = 0, ret;
+
+	/* group leader gets created first, with fd = -1 */
+	group->fds[0] = -1;
+
+	TAILQ_FOREACH(event, &rte_pmu.event_list, next) {
+		ret = get_event_config(event->name, config);
+		if (ret)
+			continue;
+
+		ret = do_perf_event_open(config, group->fds[0]);
+		if (ret == -1) {
+			ret = -errno;
+			goto out;
+		}
+
+		group->fds[event->index] = ret;
+		num++;
+	}
+
+	return 0;
+out:
+	for (--num; num >= 0; num--) {
+		close(group->fds[num]);
+		group->fds[num] = -1;
+	}
+
+
+	return ret;
+}
+
+static int
+mmap_events(struct rte_pmu_event_group *group)
+{
+	long page_size = sysconf(_SC_PAGE_SIZE);
+	unsigned int i;
+	void *addr;
+	int ret;
+
+	for (i = 0; i < rte_pmu.num_group_events; i++) {
+		addr = mmap(0, page_size, PROT_READ, MAP_SHARED, group->fds[i], 0);
+		if (addr == MAP_FAILED) {
+			ret = -errno;
+			goto out;
+		}
+
+		group->mmap_pages[i] = addr;
+		if (!group->mmap_pages[i]->cap_user_rdpmc) {
+			ret = -EPERM;
+			goto out;
+		}
+	}
+
+	return 0;
+out:
+	for (; i; i--) {
+		munmap(group->mmap_pages[i - 1], page_size);
+		group->mmap_pages[i - 1] = NULL;
+	}
+
+	return ret;
+}
+
+static void
+cleanup_events(struct rte_pmu_event_group *group)
+{
+	unsigned int i;
+
+	if (group->fds[0] != -1)
+		ioctl(group->fds[0], PERF_EVENT_IOC_DISABLE, PERF_IOC_FLAG_GROUP);
+
+	for (i = 0; i < rte_pmu.num_group_events; i++) {
+		if (group->mmap_pages[i]) {
+			munmap(group->mmap_pages[i], sysconf(_SC_PAGE_SIZE));
+			group->mmap_pages[i] = NULL;
+		}
+
+		if (group->fds[i] != -1) {
+			close(group->fds[i]);
+			group->fds[i] = -1;
+		}
+	}
+
+	group->enabled = false;
+}
+
+int
+__rte_pmu_enable_group(void)
+{
+	struct rte_pmu_event_group *group = &RTE_PER_LCORE(_event_group);
+	int ret;
+
+	if (rte_pmu.num_group_events == 0)
+		return -ENODEV;
+
+	ret = open_events(group);
+	if (ret)
+		goto out;
+
+	ret = mmap_events(group);
+	if (ret)
+		goto out;
+
+	if (ioctl(group->fds[0], PERF_EVENT_IOC_RESET, PERF_IOC_FLAG_GROUP) == -1) {
+		ret = -errno;
+		goto out;
+	}
+
+	if (ioctl(group->fds[0], PERF_EVENT_IOC_ENABLE, PERF_IOC_FLAG_GROUP) == -1) {
+		ret = -errno;
+		goto out;
+	}
+
+	rte_spinlock_lock(&rte_pmu.lock);
+	TAILQ_INSERT_TAIL(&rte_pmu.event_group_list, group, next);
+	rte_spinlock_unlock(&rte_pmu.lock);
+	group->enabled = true;
+
+	return 0;
+
+out:
+	cleanup_events(group);
+
+	return ret;
+}
+
+static int
+scan_pmus(void)
+{
+	char path[PATH_MAX];
+	struct dirent *dent;
+	const char *name;
+	DIR *dirp;
+
+	dirp = opendir(EVENT_SOURCE_DEVICES_PATH);
+	if (dirp == NULL)
+		return -errno;
+
+	while ((dent = readdir(dirp))) {
+		name = dent->d_name;
+		if (name[0] == '.')
+			continue;
+
+		/* sysfs entry should either contain cpus or be a cpu */
+		if (!strcmp(name, "cpu"))
+			break;
+
+		snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/cpus", name);
+		if (access(path, F_OK) == 0)
+			break;
+	}
+
+	if (dent) {
+		rte_pmu.name = strdup(name);
+		if (rte_pmu.name == NULL) {
+			closedir(dirp);
+
+			return -ENOMEM;
+		}
+	}
+
+	closedir(dirp);
+
+	return rte_pmu.name ? 0 : -ENODEV;
+}
+
+static struct rte_pmu_event *
+new_event(const char *name)
+{
+	struct rte_pmu_event *event;
+
+	event = calloc(1, sizeof(*event));
+	if (event == NULL)
+		goto out;
+
+	event->name = strdup(name);
+	if (event->name == NULL) {
+		free(event);
+		event = NULL;
+	}
+
+out:
+	return event;
+}
+
+static void
+free_event(struct rte_pmu_event *event)
+{
+	free(event->name);
+	free(event);
+}
+
+int
+rte_pmu_add_event(const char *name)
+{
+	struct rte_pmu_event *event;
+	char path[PATH_MAX];
+
+	if (rte_pmu.name == NULL)
+		return -ENODEV;
+
+	if (rte_pmu.num_group_events + 1 >= MAX_NUM_GROUP_EVENTS)
+		return -ENOSPC;
+
+	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu.name, name);
+	if (access(path, R_OK))
+		return -ENODEV;
+
+	TAILQ_FOREACH(event, &rte_pmu.event_list, next) {
+		if (!strcmp(event->name, name))
+			return event->index;
+		continue;
+	}
+
+	event = new_event(name);
+	if (event == NULL)
+		return -ENOMEM;
+
+	event->index = rte_pmu.num_group_events++;
+	TAILQ_INSERT_TAIL(&rte_pmu.event_list, event, next);
+
+	return event->index;
+}
+
+int
+rte_pmu_init(void)
+{
+	int ret;
+
+	/* Allow calling init from multiple contexts within a single thread. This simplifies
+	 * resource management a bit e.g in case fast-path tracepoint has already been enabled
+	 * via command line but application doesn't care enough and performs init/fini again.
+	 */
+	if (rte_pmu.initialized != 0) {
+		rte_pmu.initialized++;
+		return 0;
+	}
+
+	ret = scan_pmus();
+	if (ret)
+		goto out;
+
+	ret = pmu_arch_init();
+	if (ret)
+		goto out;
+
+	TAILQ_INIT(&rte_pmu.event_list);
+	TAILQ_INIT(&rte_pmu.event_group_list);
+	rte_spinlock_init(&rte_pmu.lock);
+	rte_pmu.initialized = 1;
+
+	return 0;
+out:
+	free(rte_pmu.name);
+	rte_pmu.name = NULL;
+
+	return ret;
+}
+
+void
+rte_pmu_fini(void)
+{
+	struct rte_pmu_event_group *group, *tmp_group;
+	struct rte_pmu_event *event, *tmp_event;
+
+	/* cleanup once init count drops to zero */
+	if (rte_pmu.initialized == 0 || --rte_pmu.initialized != 0)
+		return;
+
+	RTE_TAILQ_FOREACH_SAFE(event, &rte_pmu.event_list, next, tmp_event) {
+		TAILQ_REMOVE(&rte_pmu.event_list, event, next);
+		free_event(event);
+	}
+
+	RTE_TAILQ_FOREACH_SAFE(group, &rte_pmu.event_group_list, next, tmp_group) {
+		TAILQ_REMOVE(&rte_pmu.event_group_list, group, next);
+		cleanup_events(group);
+	}
+
+	pmu_arch_fini();
+	free(rte_pmu.name);
+	rte_pmu.name = NULL;
+	rte_pmu.num_group_events = 0;
+}
diff --git a/lib/pmu/rte_pmu.h b/lib/pmu/rte_pmu.h
new file mode 100644
index 0000000000..6b664c3336
--- /dev/null
+++ b/lib/pmu/rte_pmu.h
@@ -0,0 +1,212 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Marvell
+ */
+
+#ifndef _RTE_PMU_H_
+#define _RTE_PMU_H_
+
+/**
+ * @file
+ *
+ * PMU event tracing operations
+ *
+ * This file defines generic API and types necessary to setup PMU and
+ * read selected counters in runtime.
+ */
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <linux/perf_event.h>
+
+#include <rte_atomic.h>
+#include <rte_branch_prediction.h>
+#include <rte_common.h>
+#include <rte_compat.h>
+#include <rte_spinlock.h>
+
+/** Maximum number of events in a group */
+#define MAX_NUM_GROUP_EVENTS 8
+
+/**
+ * A structure describing a group of events.
+ */
+struct rte_pmu_event_group {
+	struct perf_event_mmap_page *mmap_pages[MAX_NUM_GROUP_EVENTS]; /**< array of user pages */
+	int fds[MAX_NUM_GROUP_EVENTS]; /**< array of event descriptors */
+	bool enabled; /**< true if group was enabled on particular lcore */
+	TAILQ_ENTRY(rte_pmu_event_group) next; /**< list entry */
+} __rte_cache_aligned;
+
+/**
+ * A structure describing an event.
+ */
+struct rte_pmu_event {
+	char *name; /**< name of an event */
+	unsigned int index; /**< event index into fds/mmap_pages */
+	TAILQ_ENTRY(rte_pmu_event) next; /**< list entry */
+};
+
+/**
+ * A PMU state container.
+ */
+struct rte_pmu {
+	char *name; /**< name of core PMU listed under /sys/bus/event_source/devices */
+	rte_spinlock_t lock; /**< serialize access to event group list */
+	TAILQ_HEAD(, rte_pmu_event_group) event_group_list; /**< list of event groups */
+	unsigned int num_group_events; /**< number of events in a group */
+	TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of matching events */
+	unsigned int initialized; /**< initialization counter */
+};
+
+/** lcore event group */
+RTE_DECLARE_PER_LCORE(struct rte_pmu_event_group, _event_group);
+
+/** PMU state container */
+extern struct rte_pmu rte_pmu;
+
+/** Each architecture supporting PMU needs to provide its own version */
+#ifndef rte_pmu_pmc_read
+#define rte_pmu_pmc_read(index) ({ 0; })
+#endif
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Read PMU counter.
+ *
+ * @warning This should be not called directly.
+ *
+ * @param pc
+ *   Pointer to the mmapped user page.
+ * @return
+ *   Counter value read from hardware.
+ */
+static __rte_always_inline uint64_t
+__rte_pmu_read_userpage(struct perf_event_mmap_page *pc)
+{
+	uint64_t width, offset;
+	uint32_t seq, index;
+	int64_t pmc;
+
+	for (;;) {
+		seq = pc->lock;
+		rte_compiler_barrier();
+		index = pc->index;
+		offset = pc->offset;
+		width = pc->pmc_width;
+
+		/* index set to 0 means that particular counter cannot be used */
+		if (likely(pc->cap_user_rdpmc && index)) {
+			pmc = rte_pmu_pmc_read(index - 1);
+			pmc <<= 64 - width;
+			pmc >>= 64 - width;
+			offset += pmc;
+		}
+
+		rte_compiler_barrier();
+
+		if (likely(pc->lock == seq))
+			return offset;
+	}
+
+	return 0;
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Enable group of events on the calling lcore.
+ *
+ * @warning This should be not called directly.
+ *
+ * @return
+ *   0 in case of success, negative value otherwise.
+ */
+__rte_experimental
+int
+__rte_pmu_enable_group(void);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Initialize PMU library.
+ *
+ * @warning This should be not called directly.
+ *
+ * @return
+ *   0 in case of success, negative value otherwise.
+ */
+__rte_experimental
+int
+rte_pmu_init(void);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Finalize PMU library. This should be called after PMU counters are no longer being read.
+ */
+__rte_experimental
+void
+rte_pmu_fini(void);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Add event to the group of enabled events.
+ *
+ * @param name
+ *   Name of an event listed under /sys/bus/event_source/devices/pmu/events.
+ * @return
+ *   Event index in case of success, negative value otherwise.
+ */
+__rte_experimental
+int
+rte_pmu_add_event(const char *name);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Read hardware counter configured to count occurrences of an event.
+ *
+ * @param index
+ *   Index of an event to be read.
+ * @return
+ *   Event value read from register. In case of errors or lack of support
+ *   0 is returned. In other words, stream of zeros in a trace file
+ *   indicates problem with reading particular PMU event register.
+ */
+__rte_experimental
+static __rte_always_inline uint64_t
+rte_pmu_read(unsigned int index)
+{
+	struct rte_pmu_event_group *group = &RTE_PER_LCORE(_event_group);
+	int ret;
+
+	if (unlikely(!rte_pmu.initialized))
+		return 0;
+
+	if (unlikely(!group->enabled)) {
+		ret = __rte_pmu_enable_group();
+		if (ret)
+			return 0;
+	}
+
+	if (unlikely(index >= rte_pmu.num_group_events))
+		return 0;
+
+	return __rte_pmu_read_userpage(group->mmap_pages[index]);
+}
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_PMU_H_ */
diff --git a/lib/pmu/version.map b/lib/pmu/version.map
new file mode 100644
index 0000000000..39a4f279c1
--- /dev/null
+++ b/lib/pmu/version.map
@@ -0,0 +1,15 @@
+DPDK_23 {
+	local: *;
+};
+
+EXPERIMENTAL {
+	global:
+
+	__rte_pmu_enable_group;
+	per_lcore__event_group;
+	rte_pmu;
+	rte_pmu_add_event;
+	rte_pmu_fini;
+	rte_pmu_init;
+	rte_pmu_read;
+};
--
2.34.1


^ permalink raw reply	[flat|nested] 139+ messages in thread

* [PATCH v10 2/4] pmu: support reading ARM PMU events in runtime
  2023-02-13 11:31                 ` [PATCH v10 0/4] add support for self monitoring Tomasz Duszynski
  2023-02-13 11:31                   ` [PATCH v10 1/4] lib: add generic support for reading PMU events Tomasz Duszynski
@ 2023-02-13 11:31                   ` Tomasz Duszynski
  2023-02-16  7:41                     ` Ruifeng Wang
  2023-02-13 11:31                   ` [PATCH v10 3/4] pmu: support reading Intel x86_64 " Tomasz Duszynski
                                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 139+ messages in thread
From: Tomasz Duszynski @ 2023-02-13 11:31 UTC (permalink / raw)
  To: dev, Tomasz Duszynski, Ruifeng Wang
  Cc: roretzla, Ruifeng.Wang, bruce.richardson, jerinj,
	mattias.ronnblom, mb, thomas, zhoumin, david.marchand

Add support for reading ARM PMU events in runtime.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
---
 app/test/test_pmu.c         |  4 ++
 lib/pmu/meson.build         |  7 +++
 lib/pmu/pmu_arm64.c         | 94 +++++++++++++++++++++++++++++++++++++
 lib/pmu/rte_pmu.h           |  4 ++
 lib/pmu/rte_pmu_pmc_arm64.h | 30 ++++++++++++
 5 files changed, 139 insertions(+)
 create mode 100644 lib/pmu/pmu_arm64.c
 create mode 100644 lib/pmu/rte_pmu_pmc_arm64.h

diff --git a/app/test/test_pmu.c b/app/test/test_pmu.c
index a64564b5f5..7550223dc0 100644
--- a/app/test/test_pmu.c
+++ b/app/test/test_pmu.c
@@ -24,6 +24,10 @@ test_pmu_read(void)
 	int tries = 10, event;
 	uint64_t val = 0;
 
+#if defined(RTE_ARCH_ARM64)
+	name = "cpu_cycles";
+#endif
+
 	if (name == NULL) {
 		printf("PMU not supported on this arch\n");
 		return TEST_SKIPPED;
diff --git a/lib/pmu/meson.build b/lib/pmu/meson.build
index a4160b494e..e857681137 100644
--- a/lib/pmu/meson.build
+++ b/lib/pmu/meson.build
@@ -11,3 +11,10 @@ includes = [global_inc]
 
 sources = files('rte_pmu.c')
 headers = files('rte_pmu.h')
+indirect_headers += files(
+        'rte_pmu_pmc_arm64.h',
+)
+
+if dpdk_conf.has('RTE_ARCH_ARM64')
+    sources += files('pmu_arm64.c')
+endif
diff --git a/lib/pmu/pmu_arm64.c b/lib/pmu/pmu_arm64.c
new file mode 100644
index 0000000000..9e15727948
--- /dev/null
+++ b/lib/pmu/pmu_arm64.c
@@ -0,0 +1,94 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2023 Marvell International Ltd.
+ */
+
+#include <errno.h>
+#include <fcntl.h>
+#include <stdlib.h>
+#include <unistd.h>
+
+#include <rte_bitops.h>
+#include <rte_common.h>
+
+#include "pmu_private.h"
+
+#define PERF_USER_ACCESS_PATH "/proc/sys/kernel/perf_user_access"
+
+static int restore_uaccess;
+
+static int
+read_attr_int(const char *path, int *val)
+{
+	char buf[BUFSIZ];
+	int ret, fd;
+
+	fd = open(path, O_RDONLY);
+	if (fd == -1)
+		return -errno;
+
+	ret = read(fd, buf, sizeof(buf));
+	if (ret == -1) {
+		close(fd);
+
+		return -errno;
+	}
+
+	*val = strtol(buf, NULL, 10);
+	close(fd);
+
+	return 0;
+}
+
+static int
+write_attr_int(const char *path, int val)
+{
+	char buf[BUFSIZ];
+	int num, ret, fd;
+
+	fd = open(path, O_WRONLY);
+	if (fd == -1)
+		return -errno;
+
+	num = snprintf(buf, sizeof(buf), "%d", val);
+	ret = write(fd, buf, num);
+	if (ret == -1) {
+		close(fd);
+
+		return -errno;
+	}
+
+	close(fd);
+
+	return 0;
+}
+
+int
+pmu_arch_init(void)
+{
+	int ret;
+
+	ret = read_attr_int(PERF_USER_ACCESS_PATH, &restore_uaccess);
+	if (ret)
+		return ret;
+
+	/* user access already enabled */
+	if (restore_uaccess == 1)
+		return 0;
+
+	return write_attr_int(PERF_USER_ACCESS_PATH, 1);
+}
+
+void
+pmu_arch_fini(void)
+{
+	write_attr_int(PERF_USER_ACCESS_PATH, restore_uaccess);
+}
+
+void
+pmu_arch_fixup_config(uint64_t config[3])
+{
+	/* select 64 bit counters */
+	config[1] |= RTE_BIT64(0);
+	/* enable userspace access */
+	config[1] |= RTE_BIT64(1);
+}
diff --git a/lib/pmu/rte_pmu.h b/lib/pmu/rte_pmu.h
index 6b664c3336..bcc8e3b22d 100644
--- a/lib/pmu/rte_pmu.h
+++ b/lib/pmu/rte_pmu.h
@@ -26,6 +26,10 @@ extern "C" {
 #include <rte_compat.h>
 #include <rte_spinlock.h>
 
+#if defined(RTE_ARCH_ARM64)
+#include "rte_pmu_pmc_arm64.h"
+#endif
+
 /** Maximum number of events in a group */
 #define MAX_NUM_GROUP_EVENTS 8
 
diff --git a/lib/pmu/rte_pmu_pmc_arm64.h b/lib/pmu/rte_pmu_pmc_arm64.h
new file mode 100644
index 0000000000..10648f0c5f
--- /dev/null
+++ b/lib/pmu/rte_pmu_pmc_arm64.h
@@ -0,0 +1,30 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Marvell.
+ */
+#ifndef _RTE_PMU_PMC_ARM64_H_
+#define _RTE_PMU_PMC_ARM64_H_
+
+#include <rte_common.h>
+
+static __rte_always_inline uint64_t
+rte_pmu_pmc_read(int index)
+{
+	uint64_t val;
+
+	if (index == 31) {
+		/* CPU Cycles (0x11) must be read via pmccntr_el0 */
+		asm volatile("mrs %0, pmccntr_el0" : "=r" (val));
+	} else {
+		asm volatile(
+			"msr pmselr_el0, %x0\n"
+			"mrs %0, pmxevcntr_el0\n"
+			: "=r" (val)
+			: "rZ" (index)
+		);
+	}
+
+	return val;
+}
+#define rte_pmu_pmc_read rte_pmu_pmc_read
+
+#endif /* _RTE_PMU_PMC_ARM64_H_ */
-- 
2.34.1


^ permalink raw reply	[flat|nested] 139+ messages in thread

* [PATCH v10 3/4] pmu: support reading Intel x86_64 PMU events in runtime
  2023-02-13 11:31                 ` [PATCH v10 0/4] add support for self monitoring Tomasz Duszynski
  2023-02-13 11:31                   ` [PATCH v10 1/4] lib: add generic support for reading PMU events Tomasz Duszynski
  2023-02-13 11:31                   ` [PATCH v10 2/4] pmu: support reading ARM PMU events in runtime Tomasz Duszynski
@ 2023-02-13 11:31                   ` Tomasz Duszynski
  2023-02-13 11:31                   ` [PATCH v10 4/4] eal: add PMU support to tracing library Tomasz Duszynski
  2023-02-16 17:54                   ` [PATCH v11 0/4] add support for self monitoring Tomasz Duszynski
  4 siblings, 0 replies; 139+ messages in thread
From: Tomasz Duszynski @ 2023-02-13 11:31 UTC (permalink / raw)
  To: dev, Tomasz Duszynski
  Cc: roretzla, Ruifeng.Wang, bruce.richardson, jerinj,
	mattias.ronnblom, mb, thomas, zhoumin, david.marchand

Add support for reading Intel x86_64 PMU events in runtime.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
---
 app/test/test_pmu.c          |  2 ++
 lib/pmu/meson.build          |  1 +
 lib/pmu/rte_pmu.h            |  2 ++
 lib/pmu/rte_pmu_pmc_x86_64.h | 24 ++++++++++++++++++++++++
 4 files changed, 29 insertions(+)
 create mode 100644 lib/pmu/rte_pmu_pmc_x86_64.h

diff --git a/app/test/test_pmu.c b/app/test/test_pmu.c
index 7550223dc0..cf6de85d88 100644
--- a/app/test/test_pmu.c
+++ b/app/test/test_pmu.c
@@ -26,6 +26,8 @@ test_pmu_read(void)
 
 #if defined(RTE_ARCH_ARM64)
 	name = "cpu_cycles";
+#elif defined(RTE_ARCH_X86_64)
+	name = "cpu-cycles";
 #endif
 
 	if (name == NULL) {
diff --git a/lib/pmu/meson.build b/lib/pmu/meson.build
index e857681137..5b92e5c4e3 100644
--- a/lib/pmu/meson.build
+++ b/lib/pmu/meson.build
@@ -13,6 +13,7 @@ sources = files('rte_pmu.c')
 headers = files('rte_pmu.h')
 indirect_headers += files(
         'rte_pmu_pmc_arm64.h',
+        'rte_pmu_pmc_x86_64.h',
 )
 
 if dpdk_conf.has('RTE_ARCH_ARM64')
diff --git a/lib/pmu/rte_pmu.h b/lib/pmu/rte_pmu.h
index bcc8e3b22d..b1d1c17bc5 100644
--- a/lib/pmu/rte_pmu.h
+++ b/lib/pmu/rte_pmu.h
@@ -28,6 +28,8 @@ extern "C" {
 
 #if defined(RTE_ARCH_ARM64)
 #include "rte_pmu_pmc_arm64.h"
+#elif defined(RTE_ARCH_X86_64)
+#include "rte_pmu_pmc_x86_64.h"
 #endif
 
 /** Maximum number of events in a group */
diff --git a/lib/pmu/rte_pmu_pmc_x86_64.h b/lib/pmu/rte_pmu_pmc_x86_64.h
new file mode 100644
index 0000000000..7b67466960
--- /dev/null
+++ b/lib/pmu/rte_pmu_pmc_x86_64.h
@@ -0,0 +1,24 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Marvell.
+ */
+#ifndef _RTE_PMU_PMC_X86_64_H_
+#define _RTE_PMU_PMC_X86_64_H_
+
+#include <rte_common.h>
+
+static __rte_always_inline uint64_t
+rte_pmu_pmc_read(int index)
+{
+	uint64_t low, high;
+
+	asm volatile(
+		"rdpmc\n"
+		: "=a" (low), "=d" (high)
+		: "c" (index)
+	);
+
+	return low | (high << 32);
+}
+#define rte_pmu_pmc_read rte_pmu_pmc_read
+
+#endif /* _RTE_PMU_PMC_X86_64_H_ */
-- 
2.34.1


^ permalink raw reply	[flat|nested] 139+ messages in thread

* [PATCH v10 4/4] eal: add PMU support to tracing library
  2023-02-13 11:31                 ` [PATCH v10 0/4] add support for self monitoring Tomasz Duszynski
                                     ` (2 preceding siblings ...)
  2023-02-13 11:31                   ` [PATCH v10 3/4] pmu: support reading Intel x86_64 " Tomasz Duszynski
@ 2023-02-13 11:31                   ` Tomasz Duszynski
  2023-02-16 17:54                   ` [PATCH v11 0/4] add support for self monitoring Tomasz Duszynski
  4 siblings, 0 replies; 139+ messages in thread
From: Tomasz Duszynski @ 2023-02-13 11:31 UTC (permalink / raw)
  To: dev, Jerin Jacob, Sunil Kumar Kori, Tomasz Duszynski
  Cc: roretzla, Ruifeng.Wang, bruce.richardson, mattias.ronnblom, mb,
	thomas, zhoumin, david.marchand

In order to profile app one needs to store significant amount of samples
somewhere for an analysis latern on. Since trace library supports
storing data in a CTF format lets take adventage of that and add a
dedicated PMU tracepoint.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
---
 app/test/test_trace_perf.c               | 10 ++++
 doc/guides/prog_guide/profile_app.rst    |  5 ++
 doc/guides/prog_guide/trace_lib.rst      | 32 +++++++++++++
 lib/eal/common/eal_common_trace.c        | 13 ++++-
 lib/eal/common/eal_common_trace_points.c |  5 ++
 lib/eal/include/rte_eal_trace.h          | 13 +++++
 lib/eal/meson.build                      |  3 ++
 lib/eal/version.map                      |  1 +
 lib/pmu/rte_pmu.c                        | 61 ++++++++++++++++++++++++
 lib/pmu/rte_pmu.h                        | 14 ++++++
 lib/pmu/version.map                      |  1 +
 11 files changed, 157 insertions(+), 1 deletion(-)

diff --git a/app/test/test_trace_perf.c b/app/test/test_trace_perf.c
index 46ae7d8074..f1929f2734 100644
--- a/app/test/test_trace_perf.c
+++ b/app/test/test_trace_perf.c
@@ -114,6 +114,10 @@ worker_fn_##func(void *arg) \
 #define GENERIC_DOUBLE rte_eal_trace_generic_double(3.66666)
 #define GENERIC_STR rte_eal_trace_generic_str("hello world")
 #define VOID_FP app_dpdk_test_fp()
+#ifdef RTE_EXEC_ENV_LINUX
+/* 0 corresponds first event passed via --trace= */
+#define READ_PMU rte_eal_trace_pmu_read(0)
+#endif
 
 WORKER_DEFINE(GENERIC_VOID)
 WORKER_DEFINE(GENERIC_U64)
@@ -122,6 +126,9 @@ WORKER_DEFINE(GENERIC_FLOAT)
 WORKER_DEFINE(GENERIC_DOUBLE)
 WORKER_DEFINE(GENERIC_STR)
 WORKER_DEFINE(VOID_FP)
+#ifdef RTE_EXEC_ENV_LINUX
+WORKER_DEFINE(READ_PMU)
+#endif
 
 static void
 run_test(const char *str, lcore_function_t f, struct test_data *data, size_t sz)
@@ -174,6 +181,9 @@ test_trace_perf(void)
 	run_test("double", worker_fn_GENERIC_DOUBLE, data, sz);
 	run_test("string", worker_fn_GENERIC_STR, data, sz);
 	run_test("void_fp", worker_fn_VOID_FP, data, sz);
+#ifdef RTE_EXEC_ENV_LINUX
+	run_test("read_pmu", worker_fn_READ_PMU, data, sz);
+#endif
 
 	rte_free(data);
 	return TEST_SUCCESS;
diff --git a/doc/guides/prog_guide/profile_app.rst b/doc/guides/prog_guide/profile_app.rst
index 89e38cd301..c4dfe85c3b 100644
--- a/doc/guides/prog_guide/profile_app.rst
+++ b/doc/guides/prog_guide/profile_app.rst
@@ -19,6 +19,11 @@ dedicated tasks interrupting those tasks with perf may be undesirable.
 
 In such cases, an application can use the PMU library to read such events via ``rte_pmu_read()``.
 
+Alternatively tracing library can be used which offers dedicated tracepoint
+``rte_eal_trace_pmu_event()``.
+
+Refer to :doc:`../prog_guide/trace_lib` for more details.
+
 
 Profiling on x86
 ----------------
diff --git a/doc/guides/prog_guide/trace_lib.rst b/doc/guides/prog_guide/trace_lib.rst
index 3e0ea5835c..9c81936e35 100644
--- a/doc/guides/prog_guide/trace_lib.rst
+++ b/doc/guides/prog_guide/trace_lib.rst
@@ -46,6 +46,7 @@ DPDK tracing library features
   trace format and is compatible with ``LTTng``.
   For detailed information, refer to
   `Common Trace Format <https://diamon.org/ctf/>`_.
+- Support reading PMU events on ARM64 and x86-64 (Intel)
 
 How to add a tracepoint?
 ------------------------
@@ -137,6 +138,37 @@ the user must use ``RTE_TRACE_POINT_FP`` instead of ``RTE_TRACE_POINT``.
 ``RTE_TRACE_POINT_FP`` is compiled out by default and it can be enabled using
 the ``enable_trace_fp`` option for meson build.
 
+PMU tracepoint
+--------------
+
+Performance monitoring unit (PMU) event values can be read from hardware
+registers using predefined ``rte_pmu_read`` tracepoint.
+
+Tracing is enabled via ``--trace`` EAL option by passing both expression
+matching PMU tracepoint name i.e ``lib.eal.pmu.read`` and expression
+``e=ev1[,ev2,...]`` matching particular events::
+
+    --trace='.*pmu.read\|e=cpu_cycles,l1d_cache'
+
+Event names are available under ``/sys/bus/event_source/devices/PMU/events``
+directory, where ``PMU`` is a placeholder for either a ``cpu`` or a directory
+containing ``cpus``.
+
+In contrary to other tracepoints this does not need any extra variables
+added to source files. Instead, caller passes index which follows the order of
+events specified via ``--trace`` parameter. In the following example index ``0``
+corresponds to ``cpu_cyclces`` while index ``1`` corresponds to ``l1d_cache``.
+
+.. code-block:: c
+
+ ...
+ rte_eal_trace_pmu_read(0);
+ rte_eal_trace_pmu_read(1);
+ ...
+
+PMU tracing support must be explicitly enabled using the ``enable_trace_fp``
+option for meson build.
+
 Event record mode
 -----------------
 
diff --git a/lib/eal/common/eal_common_trace.c b/lib/eal/common/eal_common_trace.c
index 75162b722d..8796052d0c 100644
--- a/lib/eal/common/eal_common_trace.c
+++ b/lib/eal/common/eal_common_trace.c
@@ -11,6 +11,9 @@
 #include <rte_errno.h>
 #include <rte_lcore.h>
 #include <rte_per_lcore.h>
+#ifdef RTE_EXEC_ENV_LINUX
+#include <rte_pmu.h>
+#endif
 #include <rte_string_fns.h>
 
 #include "eal_trace.h"
@@ -71,8 +74,13 @@ eal_trace_init(void)
 		goto free_meta;
 
 	/* Apply global configurations */
-	STAILQ_FOREACH(arg, &trace.args, next)
+	STAILQ_FOREACH(arg, &trace.args, next) {
 		trace_args_apply(arg->val);
+#ifdef RTE_EXEC_ENV_LINUX
+		if (rte_pmu_init() == 0)
+			rte_pmu_add_events_by_pattern(arg->val);
+#endif
+	}
 
 	rte_trace_mode_set(trace.mode);
 
@@ -88,6 +96,9 @@ eal_trace_init(void)
 void
 eal_trace_fini(void)
 {
+#ifdef RTE_EXEC_ENV_LINUX
+	rte_pmu_fini();
+#endif
 	trace_mem_free();
 	trace_metadata_destroy();
 	eal_trace_args_free();
diff --git a/lib/eal/common/eal_common_trace_points.c b/lib/eal/common/eal_common_trace_points.c
index 051f89809c..9d6faa19ed 100644
--- a/lib/eal/common/eal_common_trace_points.c
+++ b/lib/eal/common/eal_common_trace_points.c
@@ -77,3 +77,8 @@ RTE_TRACE_POINT_REGISTER(rte_eal_trace_intr_enable,
 	lib.eal.intr.enable)
 RTE_TRACE_POINT_REGISTER(rte_eal_trace_intr_disable,
 	lib.eal.intr.disable)
+
+#ifdef RTE_EXEC_ENV_LINUX
+RTE_TRACE_POINT_REGISTER(rte_eal_trace_pmu_read,
+	lib.eal.pmu.read)
+#endif
diff --git a/lib/eal/include/rte_eal_trace.h b/lib/eal/include/rte_eal_trace.h
index 6f5c022558..c7da83c480 100644
--- a/lib/eal/include/rte_eal_trace.h
+++ b/lib/eal/include/rte_eal_trace.h
@@ -17,6 +17,9 @@ extern "C" {
 
 #include <rte_alarm.h>
 #include <rte_interrupts.h>
+#ifdef RTE_EXEC_ENV_LINUX
+#include <rte_pmu.h>
+#endif
 #include <rte_trace_point.h>
 
 #include "eal_interrupts.h"
@@ -285,6 +288,16 @@ RTE_TRACE_POINT(
 	rte_trace_point_emit_string(cpuset);
 )
 
+#ifdef RTE_EXEC_ENV_LINUX
+RTE_TRACE_POINT_FP(
+	rte_eal_trace_pmu_read,
+	RTE_TRACE_POINT_ARGS(unsigned int index),
+	uint64_t val;
+	val = rte_pmu_read(index);
+	rte_trace_point_emit_u64(val);
+)
+#endif
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/eal/meson.build b/lib/eal/meson.build
index 056beb9461..f5865dbcd9 100644
--- a/lib/eal/meson.build
+++ b/lib/eal/meson.build
@@ -26,6 +26,9 @@ deps += ['kvargs']
 if not is_windows
     deps += ['telemetry']
 endif
+if is_linux
+    deps += ['pmu']
+endif
 if dpdk_conf.has('RTE_USE_LIBBSD')
     ext_deps += libbsd
 endif
diff --git a/lib/eal/version.map b/lib/eal/version.map
index 2ae57ee78a..01e7a099d2 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -441,6 +441,7 @@ EXPERIMENTAL {
 	rte_thread_join;
 
 	# added in 23.03
+	__rte_eal_trace_pmu_read; # WINDOWS_NO_EXPORT
 	rte_lcore_register_usage_cb;
 	rte_thread_create_control;
 	rte_thread_set_name;
diff --git a/lib/pmu/rte_pmu.c b/lib/pmu/rte_pmu.c
index 950f999cb7..862edcb1e3 100644
--- a/lib/pmu/rte_pmu.c
+++ b/lib/pmu/rte_pmu.c
@@ -398,6 +398,67 @@ rte_pmu_add_event(const char *name)
 	return event->index;
 }
 
+static int
+add_events(const char *pattern)
+{
+	char *token, *copy;
+	int ret = 0;
+
+	copy = strdup(pattern);
+	if (copy == NULL)
+		return -ENOMEM;
+
+	token = strtok(copy, ",");
+	while (token) {
+		ret = rte_pmu_add_event(token);
+		if (ret < 0)
+			break;
+
+		token = strtok(NULL, ",");
+	}
+
+	free(copy);
+
+	return ret >= 0 ? 0 : ret;
+}
+
+int
+rte_pmu_add_events_by_pattern(const char *pattern)
+{
+	regmatch_t rmatch;
+	char buf[BUFSIZ];
+	unsigned int num;
+	regex_t reg;
+	int ret;
+
+	/* events are matched against occurrences of e=ev1[,ev2,..] pattern */
+	ret = regcomp(&reg, "e=([_[:alnum:]-],?)+", REG_EXTENDED);
+	if (ret)
+		return -EINVAL;
+
+	for (;;) {
+		if (regexec(&reg, pattern, 1, &rmatch, 0))
+			break;
+
+		num = rmatch.rm_eo - rmatch.rm_so;
+		if (num > sizeof(buf))
+			num = sizeof(buf);
+
+		/* skip e= pattern prefix */
+		memcpy(buf, pattern + rmatch.rm_so + 2, num - 2);
+		buf[num - 2] = '\0';
+		ret = add_events(buf);
+		if (ret)
+			break;
+
+		pattern += rmatch.rm_eo;
+	}
+
+	regfree(&reg);
+
+	return ret;
+}
+
 int
 rte_pmu_init(void)
 {
diff --git a/lib/pmu/rte_pmu.h b/lib/pmu/rte_pmu.h
index b1d1c17bc5..e1c3bb5e56 100644
--- a/lib/pmu/rte_pmu.h
+++ b/lib/pmu/rte_pmu.h
@@ -176,6 +176,20 @@ __rte_experimental
 int
 rte_pmu_add_event(const char *name);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Add events matching pattern to the group of enabled events.
+ *
+ * @param pattern
+ *   Pattern e=ev1[,ev2,...] matching events, where evX is a placeholder for an event listed under
+ *   /sys/bus/event_source/devices/pmu/events.
+ */
+__rte_experimental
+int
+rte_pmu_add_events_by_pattern(const char *pattern);
+
 /**
  * @warning
  * @b EXPERIMENTAL: this API may change without prior notice
diff --git a/lib/pmu/version.map b/lib/pmu/version.map
index 39a4f279c1..e16b3ff009 100644
--- a/lib/pmu/version.map
+++ b/lib/pmu/version.map
@@ -9,6 +9,7 @@ EXPERIMENTAL {
 	per_lcore__event_group;
 	rte_pmu;
 	rte_pmu_add_event;
+	rte_pmu_add_events_by_pattern;
 	rte_pmu_fini;
 	rte_pmu_init;
 	rte_pmu_read;
-- 
2.34.1


^ permalink raw reply	[flat|nested] 139+ messages in thread

* RE: [PATCH v10 1/4] lib: add generic support for reading PMU events
  2023-02-13 11:31                   ` [PATCH v10 1/4] lib: add generic support for reading PMU events Tomasz Duszynski
@ 2023-02-16  7:39                     ` Ruifeng Wang
  2023-02-16 14:44                       ` Tomasz Duszynski
  0 siblings, 1 reply; 139+ messages in thread
From: Ruifeng Wang @ 2023-02-16  7:39 UTC (permalink / raw)
  To: Tomasz Duszynski, dev, thomas
  Cc: roretzla, bruce.richardson, jerinj, mattias.ronnblom, mb,
	zhoumin, david.marchand, nd

> -----Original Message-----
> From: Tomasz Duszynski <tduszynski@marvell.com>
> Sent: Monday, February 13, 2023 7:32 PM
> To: dev@dpdk.org; thomas@monjalon.net; Tomasz Duszynski <tduszynski@marvell.com>
> Cc: roretzla@linux.microsoft.com; Ruifeng Wang <Ruifeng.Wang@arm.com>;
> bruce.richardson@intel.com; jerinj@marvell.com; mattias.ronnblom@ericsson.com;
> mb@smartsharesystems.com; zhoumin@loongson.cn; david.marchand@redhat.com
> Subject: [PATCH v10 1/4] lib: add generic support for reading PMU events
> 
> Add support for programming PMU counters and reading their values in runtime bypassing
> kernel completely.
> 
> This is especially useful in cases where CPU cores are isolated i.e run dedicated tasks.
> In such cases one cannot use standard perf utility without sacrificing latency and
> performance.
> 
> Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
> Acked-by: Morten Brørup <mb@smartsharesystems.com>
> ---
>  MAINTAINERS                            |   5 +
>  app/test/meson.build                   |   2 +
>  app/test/test_pmu.c                    |  62 ++++
>  doc/api/doxy-api-index.md              |   3 +-
>  doc/api/doxy-api.conf.in               |   1 +
>  doc/guides/prog_guide/profile_app.rst  |  12 +
>  doc/guides/rel_notes/release_23_03.rst |   7 +
>  lib/meson.build                        |   1 +
>  lib/pmu/meson.build                    |  13 +
>  lib/pmu/pmu_private.h                  |  32 ++
>  lib/pmu/rte_pmu.c                      | 460 +++++++++++++++++++++++++
>  lib/pmu/rte_pmu.h                      | 212 ++++++++++++
>  lib/pmu/version.map                    |  15 +
>  13 files changed, 824 insertions(+), 1 deletion(-)  create mode 100644
> app/test/test_pmu.c  create mode 100644 lib/pmu/meson.build  create mode 100644
> lib/pmu/pmu_private.h  create mode 100644 lib/pmu/rte_pmu.c  create mode 100644
> lib/pmu/rte_pmu.h  create mode 100644 lib/pmu/version.map
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 3495946d0f..d37f242120 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -1697,6 +1697,11 @@ M: Nithin Dabilpuram <ndabilpuram@marvell.com>
>  M: Pavan Nikhilesh <pbhagavatula@marvell.com>
>  F: lib/node/
> 
> +PMU - EXPERIMENTAL
> +M: Tomasz Duszynski <tduszynski@marvell.com>
> +F: lib/pmu/
> +F: app/test/test_pmu*
> +
> 
>  Test Applications
>  -----------------
> diff --git a/app/test/meson.build b/app/test/meson.build index f34d19e3c3..6b61b7fc32
> 100644
> --- a/app/test/meson.build
> +++ b/app/test/meson.build
> @@ -111,6 +111,7 @@ test_sources = files(
>          'test_reciprocal_division_perf.c',
>          'test_red.c',
>          'test_pie.c',
> +        'test_pmu.c',
>          'test_reorder.c',
>          'test_rib.c',
>          'test_rib6.c',
> @@ -239,6 +240,7 @@ fast_tests = [
>          ['kni_autotest', false, true],
>          ['kvargs_autotest', true, true],
>          ['member_autotest', true, true],
> +        ['pmu_autotest', true, true],
>          ['power_cpufreq_autotest', false, true],
>          ['power_autotest', true, true],
>          ['power_kvm_vm_autotest', false, true], diff --git a/app/test/test_pmu.c
> b/app/test/test_pmu.c new file mode 100644 index 0000000000..a64564b5f5
> --- /dev/null
> +++ b/app/test/test_pmu.c
> @@ -0,0 +1,62 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(C) 2023 Marvell International Ltd.
> + */
> +
> +#include "test.h"
> +
> +#ifndef RTE_EXEC_ENV_LINUX
> +
> +static int
> +test_pmu(void)
> +{
> +	printf("pmu_autotest only supported on Linux, skipping test\n");
> +	return TEST_SKIPPED;
> +}
> +
> +#else
> +
> +#include <rte_pmu.h>
> +
> +static int
> +test_pmu_read(void)
> +{
> +	const char *name = NULL;
> +	int tries = 10, event;
> +	uint64_t val = 0;
> +
> +	if (name == NULL) {
> +		printf("PMU not supported on this arch\n");
> +		return TEST_SKIPPED;
> +	}
> +
> +	if (rte_pmu_init() < 0)
> +		return TEST_FAILED;

Can we return TEST_SKIPPED here?
On aarch64, this feature requires kernel version >= 5.17. CI setups doesn't meet this requirement will
start to report failure when running fast_tests.

> +
> +	event = rte_pmu_add_event(name);
> +	while (tries--)
> +		val += rte_pmu_read(event);
> +
> +	rte_pmu_fini();
> +
> +	return val ? TEST_SUCCESS : TEST_FAILED; }
> +
> +static struct unit_test_suite pmu_tests = {
> +	.suite_name = "pmu autotest",
> +	.setup = NULL,
> +	.teardown = NULL,
> +	.unit_test_cases = {
> +		TEST_CASE(test_pmu_read),
> +		TEST_CASES_END()
> +	}
> +};
> +
<snip>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* RE: [PATCH v10 2/4] pmu: support reading ARM PMU events in runtime
  2023-02-13 11:31                   ` [PATCH v10 2/4] pmu: support reading ARM PMU events in runtime Tomasz Duszynski
@ 2023-02-16  7:41                     ` Ruifeng Wang
  0 siblings, 0 replies; 139+ messages in thread
From: Ruifeng Wang @ 2023-02-16  7:41 UTC (permalink / raw)
  To: Tomasz Duszynski, dev
  Cc: roretzla, bruce.richardson, jerinj, mattias.ronnblom, mb, thomas,
	zhoumin, david.marchand, nd

> -----Original Message-----
> From: Tomasz Duszynski <tduszynski@marvell.com>
> Sent: Monday, February 13, 2023 7:32 PM
> To: dev@dpdk.org; Tomasz Duszynski <tduszynski@marvell.com>; Ruifeng Wang
> <Ruifeng.Wang@arm.com>
> Cc: roretzla@linux.microsoft.com; Ruifeng Wang <Ruifeng.Wang@arm.com>;
> bruce.richardson@intel.com; jerinj@marvell.com; mattias.ronnblom@ericsson.com;
> mb@smartsharesystems.com; thomas@monjalon.net; zhoumin@loongson.cn;
> david.marchand@redhat.com
> Subject: [PATCH v10 2/4] pmu: support reading ARM PMU events in runtime
> 
> Add support for reading ARM PMU events in runtime.
> 
> Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
> Acked-by: Morten Brørup <mb@smartsharesystems.com>
> ---
>  app/test/test_pmu.c         |  4 ++
>  lib/pmu/meson.build         |  7 +++
>  lib/pmu/pmu_arm64.c         | 94 +++++++++++++++++++++++++++++++++++++
>  lib/pmu/rte_pmu.h           |  4 ++
>  lib/pmu/rte_pmu_pmc_arm64.h | 30 ++++++++++++
>  5 files changed, 139 insertions(+)
>  create mode 100644 lib/pmu/pmu_arm64.c
>  create mode 100644 lib/pmu/rte_pmu_pmc_arm64.h
> 
Acked-by: Ruifeng Wang <ruifeng.wang@arm.com>

Thanks.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* RE: [PATCH v10 1/4] lib: add generic support for reading PMU events
  2023-02-16  7:39                     ` Ruifeng Wang
@ 2023-02-16 14:44                       ` Tomasz Duszynski
  0 siblings, 0 replies; 139+ messages in thread
From: Tomasz Duszynski @ 2023-02-16 14:44 UTC (permalink / raw)
  To: Ruifeng Wang, dev, thomas
  Cc: roretzla, bruce.richardson, Jerin Jacob Kollanukkaran,
	mattias.ronnblom, mb, zhoumin, david.marchand, nd

[...]

>> +
>> +	if (rte_pmu_init() < 0)
>> +		return TEST_FAILED;
>
>Can we return TEST_SKIPPED here?
>On aarch64, this feature requires kernel version >= 5.17. CI setups doesn't meet this requirement
>will start to report failure when running fast_tests.
>

Okay. I think that's good enough for CI. 

>> +
>> +	event = rte_pmu_add_event(name);
>> +	while (tries--)
>> +		val += rte_pmu_read(event);
>> +
>> +	rte_pmu_fini();
>> +
>> +	return val ? TEST_SUCCESS : TEST_FAILED; }
>> +
>> +static struct unit_test_suite pmu_tests = {
>> +	.suite_name = "pmu autotest",
>> +	.setup = NULL,
>> +	.teardown = NULL,
>> +	.unit_test_cases = {
>> +		TEST_CASE(test_pmu_read),
>> +		TEST_CASES_END()
>> +	}
>> +};
>> +
><snip>

^ permalink raw reply	[flat|nested] 139+ messages in thread

* [PATCH v11 0/4] add support for self monitoring
  2023-02-13 11:31                 ` [PATCH v10 0/4] add support for self monitoring Tomasz Duszynski
                                     ` (3 preceding siblings ...)
  2023-02-13 11:31                   ` [PATCH v10 4/4] eal: add PMU support to tracing library Tomasz Duszynski
@ 2023-02-16 17:54                   ` Tomasz Duszynski
  2023-02-16 17:54                     ` [PATCH v11 1/4] lib: add generic support for reading PMU events Tomasz Duszynski
                                       ` (5 more replies)
  4 siblings, 6 replies; 139+ messages in thread
From: Tomasz Duszynski @ 2023-02-16 17:54 UTC (permalink / raw)
  To: dev
  Cc: roretzla, Ruifeng.Wang, bruce.richardson, jerinj,
	mattias.ronnblom, mb, thomas, zhoumin, david.marchand,
	Tomasz Duszynski

This series adds self monitoring support i.e allows to configure and
read performance measurement unit (PMU) counters in runtime without
using perf utility. This has certain advantages when application runs on
isolated cores running dedicated tasks.

Events can be read directly using rte_pmu_read() or using dedicated
tracepoint rte_eal_trace_pmu_read(). The latter will cause events to be
stored inside CTF file.

By design, all enabled events are grouped together and the same group
is attached to lcores that use self monitoring funtionality.

Events are enabled by names, which need to be read from standard
location under sysfs i.e

/sys/bus/event_source/devices/PMU/events

where PMU is a core pmu i.e one measuring cpu events. As of today
raw events are not supported.

v11:
- skip fast test in case init fails
v10:
- check permissions before using counters
- do not use internal symbols in exported functions
- address review comments
v9:
- fix 'maybe-uninitialized' warning reported by CI
v8:
- just rebase series
v7:
- use per-lcore event group instead of global table index by lcore-id
- don't add pmu_autotest to fast tests because due to lack of suported on
  every arch
v6:
- move codebase to the separate library
- address review comments
v5:
- address review comments
- fix sign extension while reading pmu on x86
- fix regex mentioned in doc
- various minor changes/improvements here and there
v4:
- fix freeing mem detected by debug_autotest
v3:
- fix shared build
v2:
- fix problems reported by test build infra

Tomasz Duszynski (4):
  lib: add generic support for reading PMU events
  pmu: support reading ARM PMU events in runtime
  pmu: support reading Intel x86_64 PMU events in runtime
  eal: add PMU support to tracing library

 MAINTAINERS                              |   5 +
 app/test/meson.build                     |   2 +
 app/test/test_pmu.c                      |  68 +++
 app/test/test_trace_perf.c               |  10 +
 doc/api/doxy-api-index.md                |   3 +-
 doc/api/doxy-api.conf.in                 |   1 +
 doc/guides/prog_guide/profile_app.rst    |  17 +
 doc/guides/prog_guide/trace_lib.rst      |  32 ++
 doc/guides/rel_notes/release_23_03.rst   |   7 +
 lib/eal/common/eal_common_trace.c        |  13 +-
 lib/eal/common/eal_common_trace_points.c |   5 +
 lib/eal/include/rte_eal_trace.h          |  13 +
 lib/eal/meson.build                      |   3 +
 lib/eal/version.map                      |   1 +
 lib/meson.build                          |   1 +
 lib/pmu/meson.build                      |  21 +
 lib/pmu/pmu_arm64.c                      |  94 ++++
 lib/pmu/pmu_private.h                    |  32 ++
 lib/pmu/rte_pmu.c                        | 521 +++++++++++++++++++++++
 lib/pmu/rte_pmu.h                        | 232 ++++++++++
 lib/pmu/rte_pmu_pmc_arm64.h              |  30 ++
 lib/pmu/rte_pmu_pmc_x86_64.h             |  24 ++
 lib/pmu/version.map                      |  16 +
 23 files changed, 1149 insertions(+), 2 deletions(-)
 create mode 100644 app/test/test_pmu.c
 create mode 100644 lib/pmu/meson.build
 create mode 100644 lib/pmu/pmu_arm64.c
 create mode 100644 lib/pmu/pmu_private.h
 create mode 100644 lib/pmu/rte_pmu.c
 create mode 100644 lib/pmu/rte_pmu.h
 create mode 100644 lib/pmu/rte_pmu_pmc_arm64.h
 create mode 100644 lib/pmu/rte_pmu_pmc_x86_64.h
 create mode 100644 lib/pmu/version.map

--
2.34.1


^ permalink raw reply	[flat|nested] 139+ messages in thread

* [PATCH v11 1/4] lib: add generic support for reading PMU events
  2023-02-16 17:54                   ` [PATCH v11 0/4] add support for self monitoring Tomasz Duszynski
@ 2023-02-16 17:54                     ` Tomasz Duszynski
  2023-02-16 23:50                       ` Konstantin Ananyev
  2023-02-21  2:17                       ` Konstantin Ananyev
  2023-02-16 17:55                     ` [PATCH v11 2/4] pmu: support reading ARM PMU events in runtime Tomasz Duszynski
                                       ` (4 subsequent siblings)
  5 siblings, 2 replies; 139+ messages in thread
From: Tomasz Duszynski @ 2023-02-16 17:54 UTC (permalink / raw)
  To: dev, Thomas Monjalon, Tomasz Duszynski
  Cc: roretzla, Ruifeng.Wang, bruce.richardson, jerinj,
	mattias.ronnblom, mb, zhoumin, david.marchand

Add support for programming PMU counters and reading their values
in runtime bypassing kernel completely.

This is especially useful in cases where CPU cores are isolated
i.e run dedicated tasks. In such cases one cannot use standard
perf utility without sacrificing latency and performance.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
---
 MAINTAINERS                            |   5 +
 app/test/meson.build                   |   2 +
 app/test/test_pmu.c                    |  62 ++++
 doc/api/doxy-api-index.md              |   3 +-
 doc/api/doxy-api.conf.in               |   1 +
 doc/guides/prog_guide/profile_app.rst  |  12 +
 doc/guides/rel_notes/release_23_03.rst |   7 +
 lib/meson.build                        |   1 +
 lib/pmu/meson.build                    |  13 +
 lib/pmu/pmu_private.h                  |  32 ++
 lib/pmu/rte_pmu.c                      | 460 +++++++++++++++++++++++++
 lib/pmu/rte_pmu.h                      | 212 ++++++++++++
 lib/pmu/version.map                    |  15 +
 13 files changed, 824 insertions(+), 1 deletion(-)
 create mode 100644 app/test/test_pmu.c
 create mode 100644 lib/pmu/meson.build
 create mode 100644 lib/pmu/pmu_private.h
 create mode 100644 lib/pmu/rte_pmu.c
 create mode 100644 lib/pmu/rte_pmu.h
 create mode 100644 lib/pmu/version.map

diff --git a/MAINTAINERS b/MAINTAINERS
index 3495946d0f..d37f242120 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1697,6 +1697,11 @@ M: Nithin Dabilpuram <ndabilpuram@marvell.com>
 M: Pavan Nikhilesh <pbhagavatula@marvell.com>
 F: lib/node/
 
+PMU - EXPERIMENTAL
+M: Tomasz Duszynski <tduszynski@marvell.com>
+F: lib/pmu/
+F: app/test/test_pmu*
+
 
 Test Applications
 -----------------
diff --git a/app/test/meson.build b/app/test/meson.build
index f34d19e3c3..6b61b7fc32 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -111,6 +111,7 @@ test_sources = files(
         'test_reciprocal_division_perf.c',
         'test_red.c',
         'test_pie.c',
+        'test_pmu.c',
         'test_reorder.c',
         'test_rib.c',
         'test_rib6.c',
@@ -239,6 +240,7 @@ fast_tests = [
         ['kni_autotest', false, true],
         ['kvargs_autotest', true, true],
         ['member_autotest', true, true],
+        ['pmu_autotest', true, true],
         ['power_cpufreq_autotest', false, true],
         ['power_autotest', true, true],
         ['power_kvm_vm_autotest', false, true],
diff --git a/app/test/test_pmu.c b/app/test/test_pmu.c
new file mode 100644
index 0000000000..c257638e8b
--- /dev/null
+++ b/app/test/test_pmu.c
@@ -0,0 +1,62 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2023 Marvell International Ltd.
+ */
+
+#include "test.h"
+
+#ifndef RTE_EXEC_ENV_LINUX
+
+static int
+test_pmu(void)
+{
+	printf("pmu_autotest only supported on Linux, skipping test\n");
+	return TEST_SKIPPED;
+}
+
+#else
+
+#include <rte_pmu.h>
+
+static int
+test_pmu_read(void)
+{
+	const char *name = NULL;
+	int tries = 10, event;
+	uint64_t val = 0;
+
+	if (name == NULL) {
+		printf("PMU not supported on this arch\n");
+		return TEST_SKIPPED;
+	}
+
+	if (rte_pmu_init() < 0)
+		return TEST_SKIPPED;
+
+	event = rte_pmu_add_event(name);
+	while (tries--)
+		val += rte_pmu_read(event);
+
+	rte_pmu_fini();
+
+	return val ? TEST_SUCCESS : TEST_FAILED;
+}
+
+static struct unit_test_suite pmu_tests = {
+	.suite_name = "pmu autotest",
+	.setup = NULL,
+	.teardown = NULL,
+	.unit_test_cases = {
+		TEST_CASE(test_pmu_read),
+		TEST_CASES_END()
+	}
+};
+
+static int
+test_pmu(void)
+{
+	return unit_test_suite_runner(&pmu_tests);
+}
+
+#endif /* RTE_EXEC_ENV_LINUX */
+
+REGISTER_TEST_COMMAND(pmu_autotest, test_pmu);
diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
index 2deec7ea19..a8e04a195d 100644
--- a/doc/api/doxy-api-index.md
+++ b/doc/api/doxy-api-index.md
@@ -223,7 +223,8 @@ The public API headers are grouped by topics:
   [log](@ref rte_log.h),
   [errno](@ref rte_errno.h),
   [trace](@ref rte_trace.h),
-  [trace_point](@ref rte_trace_point.h)
+  [trace_point](@ref rte_trace_point.h),
+  [pmu](@ref rte_pmu.h)
 
 - **misc**:
   [EAL config](@ref rte_eal.h),
diff --git a/doc/api/doxy-api.conf.in b/doc/api/doxy-api.conf.in
index e859426099..350b5a8c94 100644
--- a/doc/api/doxy-api.conf.in
+++ b/doc/api/doxy-api.conf.in
@@ -63,6 +63,7 @@ INPUT                   = @TOPDIR@/doc/api/doxy-api-index.md \
                           @TOPDIR@/lib/pci \
                           @TOPDIR@/lib/pdump \
                           @TOPDIR@/lib/pipeline \
+                          @TOPDIR@/lib/pmu \
                           @TOPDIR@/lib/port \
                           @TOPDIR@/lib/power \
                           @TOPDIR@/lib/rawdev \
diff --git a/doc/guides/prog_guide/profile_app.rst b/doc/guides/prog_guide/profile_app.rst
index 14292d4c25..89e38cd301 100644
--- a/doc/guides/prog_guide/profile_app.rst
+++ b/doc/guides/prog_guide/profile_app.rst
@@ -7,6 +7,18 @@ Profile Your Application
 The following sections describe methods of profiling DPDK applications on
 different architectures.
 
+Performance counter based profiling
+-----------------------------------
+
+Majority of architectures support some performance monitoring unit (PMU).
+Such unit provides programmable counters that monitor specific events.
+
+Different tools gather that information, like for example perf.
+However, in some scenarios when CPU cores are isolated and run
+dedicated tasks interrupting those tasks with perf may be undesirable.
+
+In such cases, an application can use the PMU library to read such events via ``rte_pmu_read()``.
+
 
 Profiling on x86
 ----------------
diff --git a/doc/guides/rel_notes/release_23_03.rst b/doc/guides/rel_notes/release_23_03.rst
index ab998a5357..20622efe58 100644
--- a/doc/guides/rel_notes/release_23_03.rst
+++ b/doc/guides/rel_notes/release_23_03.rst
@@ -147,6 +147,13 @@ New Features
   * Added support to capture packets at each graph node with packet metadata and
     node name.
 
+* **Added PMU library.**
+
+  Added a new performance monitoring unit (PMU) library which allows applications
+  to perform self monitoring activities without depending on external utilities like perf.
+  After integration with :doc:`../prog_guide/trace_lib` data gathered from hardware counters
+  can be stored in CTF format for further analysis.
+
 
 Removed Items
 -------------
diff --git a/lib/meson.build b/lib/meson.build
index 450c061d2b..8a42d45d20 100644
--- a/lib/meson.build
+++ b/lib/meson.build
@@ -11,6 +11,7 @@
 libraries = [
         'kvargs', # eal depends on kvargs
         'telemetry', # basic info querying
+        'pmu',
         'eal', # everything depends on eal
         'ring',
         'rcu', # rcu depends on ring
diff --git a/lib/pmu/meson.build b/lib/pmu/meson.build
new file mode 100644
index 0000000000..a4160b494e
--- /dev/null
+++ b/lib/pmu/meson.build
@@ -0,0 +1,13 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(C) 2023 Marvell International Ltd.
+
+if not is_linux
+    build = false
+    reason = 'only supported on Linux'
+    subdir_done()
+endif
+
+includes = [global_inc]
+
+sources = files('rte_pmu.c')
+headers = files('rte_pmu.h')
diff --git a/lib/pmu/pmu_private.h b/lib/pmu/pmu_private.h
new file mode 100644
index 0000000000..b9f8c1ddc8
--- /dev/null
+++ b/lib/pmu/pmu_private.h
@@ -0,0 +1,32 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Marvell
+ */
+
+#ifndef _PMU_PRIVATE_H_
+#define _PMU_PRIVATE_H_
+
+/**
+ * Architecture specific PMU init callback.
+ *
+ * @return
+ *   0 in case of success, negative value otherwise.
+ */
+int
+pmu_arch_init(void);
+
+/**
+ * Architecture specific PMU cleanup callback.
+ */
+void
+pmu_arch_fini(void);
+
+/**
+ * Apply architecture specific settings to config before passing it to syscall.
+ *
+ * @param config
+ *   Architecture specific event configuration. Consult kernel sources for available options.
+ */
+void
+pmu_arch_fixup_config(uint64_t config[3]);
+
+#endif /* _PMU_PRIVATE_H_ */
diff --git a/lib/pmu/rte_pmu.c b/lib/pmu/rte_pmu.c
new file mode 100644
index 0000000000..950f999cb7
--- /dev/null
+++ b/lib/pmu/rte_pmu.c
@@ -0,0 +1,460 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2023 Marvell International Ltd.
+ */
+
+#include <ctype.h>
+#include <dirent.h>
+#include <errno.h>
+#include <regex.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <sys/queue.h>
+#include <sys/syscall.h>
+#include <unistd.h>
+
+#include <rte_atomic.h>
+#include <rte_per_lcore.h>
+#include <rte_pmu.h>
+#include <rte_spinlock.h>
+#include <rte_tailq.h>
+
+#include "pmu_private.h"
+
+#define EVENT_SOURCE_DEVICES_PATH "/sys/bus/event_source/devices"
+
+#define GENMASK_ULL(h, l) ((~0ULL - (1ULL << (l)) + 1) & (~0ULL >> ((64 - 1 - (h)))))
+#define FIELD_PREP(m, v) (((uint64_t)(v) << (__builtin_ffsll(m) - 1)) & (m))
+
+RTE_DEFINE_PER_LCORE(struct rte_pmu_event_group, _event_group);
+struct rte_pmu rte_pmu;
+
+/*
+ * Following __rte_weak functions provide default no-op. Architectures should override them if
+ * necessary.
+ */
+
+int
+__rte_weak pmu_arch_init(void)
+{
+	return 0;
+}
+
+void
+__rte_weak pmu_arch_fini(void)
+{
+}
+
+void
+__rte_weak pmu_arch_fixup_config(uint64_t __rte_unused config[3])
+{
+}
+
+static int
+get_term_format(const char *name, int *num, uint64_t *mask)
+{
+	char path[PATH_MAX];
+	char *config = NULL;
+	int high, low, ret;
+	FILE *fp;
+
+	*num = *mask = 0;
+	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/format/%s", rte_pmu.name, name);
+	fp = fopen(path, "r");
+	if (fp == NULL)
+		return -errno;
+
+	errno = 0;
+	ret = fscanf(fp, "%m[^:]:%d-%d", &config, &low, &high);
+	if (ret < 2) {
+		ret = -ENODATA;
+		goto out;
+	}
+	if (errno) {
+		ret = -errno;
+		goto out;
+	}
+
+	if (ret == 2)
+		high = low;
+
+	*mask = GENMASK_ULL(high, low);
+	/* Last digit should be [012]. If last digit is missing 0 is implied. */
+	*num = config[strlen(config) - 1];
+	*num = isdigit(*num) ? *num - '0' : 0;
+
+	ret = 0;
+out:
+	free(config);
+	fclose(fp);
+
+	return ret;
+}
+
+static int
+parse_event(char *buf, uint64_t config[3])
+{
+	char *token, *term;
+	int num, ret, val;
+	uint64_t mask;
+
+	config[0] = config[1] = config[2] = 0;
+
+	token = strtok(buf, ",");
+	while (token) {
+		errno = 0;
+		/* <term>=<value> */
+		ret = sscanf(token, "%m[^=]=%i", &term, &val);
+		if (ret < 1)
+			return -ENODATA;
+		if (errno)
+			return -errno;
+		if (ret == 1)
+			val = 1;
+
+		ret = get_term_format(term, &num, &mask);
+		free(term);
+		if (ret)
+			return ret;
+
+		config[num] |= FIELD_PREP(mask, val);
+		token = strtok(NULL, ",");
+	}
+
+	return 0;
+}
+
+static int
+get_event_config(const char *name, uint64_t config[3])
+{
+	char path[PATH_MAX], buf[BUFSIZ];
+	FILE *fp;
+	int ret;
+
+	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu.name, name);
+	fp = fopen(path, "r");
+	if (fp == NULL)
+		return -errno;
+
+	ret = fread(buf, 1, sizeof(buf), fp);
+	if (ret == 0) {
+		fclose(fp);
+
+		return -EINVAL;
+	}
+	fclose(fp);
+	buf[ret] = '\0';
+
+	return parse_event(buf, config);
+}
+
+static int
+do_perf_event_open(uint64_t config[3], int group_fd)
+{
+	struct perf_event_attr attr = {
+		.size = sizeof(struct perf_event_attr),
+		.type = PERF_TYPE_RAW,
+		.exclude_kernel = 1,
+		.exclude_hv = 1,
+		.disabled = 1,
+	};
+
+	pmu_arch_fixup_config(config);
+
+	attr.config = config[0];
+	attr.config1 = config[1];
+	attr.config2 = config[2];
+
+	return syscall(SYS_perf_event_open, &attr, 0, -1, group_fd, 0);
+}
+
+static int
+open_events(struct rte_pmu_event_group *group)
+{
+	struct rte_pmu_event *event;
+	uint64_t config[3];
+	int num = 0, ret;
+
+	/* group leader gets created first, with fd = -1 */
+	group->fds[0] = -1;
+
+	TAILQ_FOREACH(event, &rte_pmu.event_list, next) {
+		ret = get_event_config(event->name, config);
+		if (ret)
+			continue;
+
+		ret = do_perf_event_open(config, group->fds[0]);
+		if (ret == -1) {
+			ret = -errno;
+			goto out;
+		}
+
+		group->fds[event->index] = ret;
+		num++;
+	}
+
+	return 0;
+out:
+	for (--num; num >= 0; num--) {
+		close(group->fds[num]);
+		group->fds[num] = -1;
+	}
+
+
+	return ret;
+}
+
+static int
+mmap_events(struct rte_pmu_event_group *group)
+{
+	long page_size = sysconf(_SC_PAGE_SIZE);
+	unsigned int i;
+	void *addr;
+	int ret;
+
+	for (i = 0; i < rte_pmu.num_group_events; i++) {
+		addr = mmap(0, page_size, PROT_READ, MAP_SHARED, group->fds[i], 0);
+		if (addr == MAP_FAILED) {
+			ret = -errno;
+			goto out;
+		}
+
+		group->mmap_pages[i] = addr;
+		if (!group->mmap_pages[i]->cap_user_rdpmc) {
+			ret = -EPERM;
+			goto out;
+		}
+	}
+
+	return 0;
+out:
+	for (; i; i--) {
+		munmap(group->mmap_pages[i - 1], page_size);
+		group->mmap_pages[i - 1] = NULL;
+	}
+
+	return ret;
+}
+
+static void
+cleanup_events(struct rte_pmu_event_group *group)
+{
+	unsigned int i;
+
+	if (group->fds[0] != -1)
+		ioctl(group->fds[0], PERF_EVENT_IOC_DISABLE, PERF_IOC_FLAG_GROUP);
+
+	for (i = 0; i < rte_pmu.num_group_events; i++) {
+		if (group->mmap_pages[i]) {
+			munmap(group->mmap_pages[i], sysconf(_SC_PAGE_SIZE));
+			group->mmap_pages[i] = NULL;
+		}
+
+		if (group->fds[i] != -1) {
+			close(group->fds[i]);
+			group->fds[i] = -1;
+		}
+	}
+
+	group->enabled = false;
+}
+
+int
+__rte_pmu_enable_group(void)
+{
+	struct rte_pmu_event_group *group = &RTE_PER_LCORE(_event_group);
+	int ret;
+
+	if (rte_pmu.num_group_events == 0)
+		return -ENODEV;
+
+	ret = open_events(group);
+	if (ret)
+		goto out;
+
+	ret = mmap_events(group);
+	if (ret)
+		goto out;
+
+	if (ioctl(group->fds[0], PERF_EVENT_IOC_RESET, PERF_IOC_FLAG_GROUP) == -1) {
+		ret = -errno;
+		goto out;
+	}
+
+	if (ioctl(group->fds[0], PERF_EVENT_IOC_ENABLE, PERF_IOC_FLAG_GROUP) == -1) {
+		ret = -errno;
+		goto out;
+	}
+
+	rte_spinlock_lock(&rte_pmu.lock);
+	TAILQ_INSERT_TAIL(&rte_pmu.event_group_list, group, next);
+	rte_spinlock_unlock(&rte_pmu.lock);
+	group->enabled = true;
+
+	return 0;
+
+out:
+	cleanup_events(group);
+
+	return ret;
+}
+
+static int
+scan_pmus(void)
+{
+	char path[PATH_MAX];
+	struct dirent *dent;
+	const char *name;
+	DIR *dirp;
+
+	dirp = opendir(EVENT_SOURCE_DEVICES_PATH);
+	if (dirp == NULL)
+		return -errno;
+
+	while ((dent = readdir(dirp))) {
+		name = dent->d_name;
+		if (name[0] == '.')
+			continue;
+
+		/* sysfs entry should either contain cpus or be a cpu */
+		if (!strcmp(name, "cpu"))
+			break;
+
+		snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/cpus", name);
+		if (access(path, F_OK) == 0)
+			break;
+	}
+
+	if (dent) {
+		rte_pmu.name = strdup(name);
+		if (rte_pmu.name == NULL) {
+			closedir(dirp);
+
+			return -ENOMEM;
+		}
+	}
+
+	closedir(dirp);
+
+	return rte_pmu.name ? 0 : -ENODEV;
+}
+
+static struct rte_pmu_event *
+new_event(const char *name)
+{
+	struct rte_pmu_event *event;
+
+	event = calloc(1, sizeof(*event));
+	if (event == NULL)
+		goto out;
+
+	event->name = strdup(name);
+	if (event->name == NULL) {
+		free(event);
+		event = NULL;
+	}
+
+out:
+	return event;
+}
+
+static void
+free_event(struct rte_pmu_event *event)
+{
+	free(event->name);
+	free(event);
+}
+
+int
+rte_pmu_add_event(const char *name)
+{
+	struct rte_pmu_event *event;
+	char path[PATH_MAX];
+
+	if (rte_pmu.name == NULL)
+		return -ENODEV;
+
+	if (rte_pmu.num_group_events + 1 >= MAX_NUM_GROUP_EVENTS)
+		return -ENOSPC;
+
+	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu.name, name);
+	if (access(path, R_OK))
+		return -ENODEV;
+
+	TAILQ_FOREACH(event, &rte_pmu.event_list, next) {
+		if (!strcmp(event->name, name))
+			return event->index;
+		continue;
+	}
+
+	event = new_event(name);
+	if (event == NULL)
+		return -ENOMEM;
+
+	event->index = rte_pmu.num_group_events++;
+	TAILQ_INSERT_TAIL(&rte_pmu.event_list, event, next);
+
+	return event->index;
+}
+
+int
+rte_pmu_init(void)
+{
+	int ret;
+
+	/* Allow calling init from multiple contexts within a single thread. This simplifies
+	 * resource management a bit e.g in case fast-path tracepoint has already been enabled
+	 * via command line but application doesn't care enough and performs init/fini again.
+	 */
+	if (rte_pmu.initialized != 0) {
+		rte_pmu.initialized++;
+		return 0;
+	}
+
+	ret = scan_pmus();
+	if (ret)
+		goto out;
+
+	ret = pmu_arch_init();
+	if (ret)
+		goto out;
+
+	TAILQ_INIT(&rte_pmu.event_list);
+	TAILQ_INIT(&rte_pmu.event_group_list);
+	rte_spinlock_init(&rte_pmu.lock);
+	rte_pmu.initialized = 1;
+
+	return 0;
+out:
+	free(rte_pmu.name);
+	rte_pmu.name = NULL;
+
+	return ret;
+}
+
+void
+rte_pmu_fini(void)
+{
+	struct rte_pmu_event_group *group, *tmp_group;
+	struct rte_pmu_event *event, *tmp_event;
+
+	/* cleanup once init count drops to zero */
+	if (rte_pmu.initialized == 0 || --rte_pmu.initialized != 0)
+		return;
+
+	RTE_TAILQ_FOREACH_SAFE(event, &rte_pmu.event_list, next, tmp_event) {
+		TAILQ_REMOVE(&rte_pmu.event_list, event, next);
+		free_event(event);
+	}
+
+	RTE_TAILQ_FOREACH_SAFE(group, &rte_pmu.event_group_list, next, tmp_group) {
+		TAILQ_REMOVE(&rte_pmu.event_group_list, group, next);
+		cleanup_events(group);
+	}
+
+	pmu_arch_fini();
+	free(rte_pmu.name);
+	rte_pmu.name = NULL;
+	rte_pmu.num_group_events = 0;
+}
diff --git a/lib/pmu/rte_pmu.h b/lib/pmu/rte_pmu.h
new file mode 100644
index 0000000000..6b664c3336
--- /dev/null
+++ b/lib/pmu/rte_pmu.h
@@ -0,0 +1,212 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Marvell
+ */
+
+#ifndef _RTE_PMU_H_
+#define _RTE_PMU_H_
+
+/**
+ * @file
+ *
+ * PMU event tracing operations
+ *
+ * This file defines generic API and types necessary to setup PMU and
+ * read selected counters in runtime.
+ */
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <linux/perf_event.h>
+
+#include <rte_atomic.h>
+#include <rte_branch_prediction.h>
+#include <rte_common.h>
+#include <rte_compat.h>
+#include <rte_spinlock.h>
+
+/** Maximum number of events in a group */
+#define MAX_NUM_GROUP_EVENTS 8
+
+/**
+ * A structure describing a group of events.
+ */
+struct rte_pmu_event_group {
+	struct perf_event_mmap_page *mmap_pages[MAX_NUM_GROUP_EVENTS]; /**< array of user pages */
+	int fds[MAX_NUM_GROUP_EVENTS]; /**< array of event descriptors */
+	bool enabled; /**< true if group was enabled on particular lcore */
+	TAILQ_ENTRY(rte_pmu_event_group) next; /**< list entry */
+} __rte_cache_aligned;
+
+/**
+ * A structure describing an event.
+ */
+struct rte_pmu_event {
+	char *name; /**< name of an event */
+	unsigned int index; /**< event index into fds/mmap_pages */
+	TAILQ_ENTRY(rte_pmu_event) next; /**< list entry */
+};
+
+/**
+ * A PMU state container.
+ */
+struct rte_pmu {
+	char *name; /**< name of core PMU listed under /sys/bus/event_source/devices */
+	rte_spinlock_t lock; /**< serialize access to event group list */
+	TAILQ_HEAD(, rte_pmu_event_group) event_group_list; /**< list of event groups */
+	unsigned int num_group_events; /**< number of events in a group */
+	TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of matching events */
+	unsigned int initialized; /**< initialization counter */
+};
+
+/** lcore event group */
+RTE_DECLARE_PER_LCORE(struct rte_pmu_event_group, _event_group);
+
+/** PMU state container */
+extern struct rte_pmu rte_pmu;
+
+/** Each architecture supporting PMU needs to provide its own version */
+#ifndef rte_pmu_pmc_read
+#define rte_pmu_pmc_read(index) ({ 0; })
+#endif
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Read PMU counter.
+ *
+ * @warning This should be not called directly.
+ *
+ * @param pc
+ *   Pointer to the mmapped user page.
+ * @return
+ *   Counter value read from hardware.
+ */
+static __rte_always_inline uint64_t
+__rte_pmu_read_userpage(struct perf_event_mmap_page *pc)
+{
+	uint64_t width, offset;
+	uint32_t seq, index;
+	int64_t pmc;
+
+	for (;;) {
+		seq = pc->lock;
+		rte_compiler_barrier();
+		index = pc->index;
+		offset = pc->offset;
+		width = pc->pmc_width;
+
+		/* index set to 0 means that particular counter cannot be used */
+		if (likely(pc->cap_user_rdpmc && index)) {
+			pmc = rte_pmu_pmc_read(index - 1);
+			pmc <<= 64 - width;
+			pmc >>= 64 - width;
+			offset += pmc;
+		}
+
+		rte_compiler_barrier();
+
+		if (likely(pc->lock == seq))
+			return offset;
+	}
+
+	return 0;
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Enable group of events on the calling lcore.
+ *
+ * @warning This should be not called directly.
+ *
+ * @return
+ *   0 in case of success, negative value otherwise.
+ */
+__rte_experimental
+int
+__rte_pmu_enable_group(void);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Initialize PMU library.
+ *
+ * @warning This should be not called directly.
+ *
+ * @return
+ *   0 in case of success, negative value otherwise.
+ */
+__rte_experimental
+int
+rte_pmu_init(void);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Finalize PMU library. This should be called after PMU counters are no longer being read.
+ */
+__rte_experimental
+void
+rte_pmu_fini(void);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Add event to the group of enabled events.
+ *
+ * @param name
+ *   Name of an event listed under /sys/bus/event_source/devices/pmu/events.
+ * @return
+ *   Event index in case of success, negative value otherwise.
+ */
+__rte_experimental
+int
+rte_pmu_add_event(const char *name);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Read hardware counter configured to count occurrences of an event.
+ *
+ * @param index
+ *   Index of an event to be read.
+ * @return
+ *   Event value read from register. In case of errors or lack of support
+ *   0 is returned. In other words, stream of zeros in a trace file
+ *   indicates problem with reading particular PMU event register.
+ */
+__rte_experimental
+static __rte_always_inline uint64_t
+rte_pmu_read(unsigned int index)
+{
+	struct rte_pmu_event_group *group = &RTE_PER_LCORE(_event_group);
+	int ret;
+
+	if (unlikely(!rte_pmu.initialized))
+		return 0;
+
+	if (unlikely(!group->enabled)) {
+		ret = __rte_pmu_enable_group();
+		if (ret)
+			return 0;
+	}
+
+	if (unlikely(index >= rte_pmu.num_group_events))
+		return 0;
+
+	return __rte_pmu_read_userpage(group->mmap_pages[index]);
+}
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_PMU_H_ */
diff --git a/lib/pmu/version.map b/lib/pmu/version.map
new file mode 100644
index 0000000000..39a4f279c1
--- /dev/null
+++ b/lib/pmu/version.map
@@ -0,0 +1,15 @@
+DPDK_23 {
+	local: *;
+};
+
+EXPERIMENTAL {
+	global:
+
+	__rte_pmu_enable_group;
+	per_lcore__event_group;
+	rte_pmu;
+	rte_pmu_add_event;
+	rte_pmu_fini;
+	rte_pmu_init;
+	rte_pmu_read;
+};
-- 
2.34.1


^ permalink raw reply	[flat|nested] 139+ messages in thread

* [PATCH v11 2/4] pmu: support reading ARM PMU events in runtime
  2023-02-16 17:54                   ` [PATCH v11 0/4] add support for self monitoring Tomasz Duszynski
  2023-02-16 17:54                     ` [PATCH v11 1/4] lib: add generic support for reading PMU events Tomasz Duszynski
@ 2023-02-16 17:55                     ` Tomasz Duszynski
  2023-02-16 17:55                     ` [PATCH v11 3/4] pmu: support reading Intel x86_64 " Tomasz Duszynski
                                       ` (3 subsequent siblings)
  5 siblings, 0 replies; 139+ messages in thread
From: Tomasz Duszynski @ 2023-02-16 17:55 UTC (permalink / raw)
  To: dev, Tomasz Duszynski, Ruifeng Wang
  Cc: roretzla, Ruifeng.Wang, bruce.richardson, jerinj,
	mattias.ronnblom, mb, thomas, zhoumin, david.marchand

Add support for reading ARM PMU events in runtime.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
 app/test/test_pmu.c         |  4 ++
 lib/pmu/meson.build         |  7 +++
 lib/pmu/pmu_arm64.c         | 94 +++++++++++++++++++++++++++++++++++++
 lib/pmu/rte_pmu.h           |  4 ++
 lib/pmu/rte_pmu_pmc_arm64.h | 30 ++++++++++++
 5 files changed, 139 insertions(+)
 create mode 100644 lib/pmu/pmu_arm64.c
 create mode 100644 lib/pmu/rte_pmu_pmc_arm64.h

diff --git a/app/test/test_pmu.c b/app/test/test_pmu.c
index c257638e8b..e0220e3c59 100644
--- a/app/test/test_pmu.c
+++ b/app/test/test_pmu.c
@@ -24,6 +24,10 @@ test_pmu_read(void)
 	int tries = 10, event;
 	uint64_t val = 0;
 
+#if defined(RTE_ARCH_ARM64)
+	name = "cpu_cycles";
+#endif
+
 	if (name == NULL) {
 		printf("PMU not supported on this arch\n");
 		return TEST_SKIPPED;
diff --git a/lib/pmu/meson.build b/lib/pmu/meson.build
index a4160b494e..e857681137 100644
--- a/lib/pmu/meson.build
+++ b/lib/pmu/meson.build
@@ -11,3 +11,10 @@ includes = [global_inc]
 
 sources = files('rte_pmu.c')
 headers = files('rte_pmu.h')
+indirect_headers += files(
+        'rte_pmu_pmc_arm64.h',
+)
+
+if dpdk_conf.has('RTE_ARCH_ARM64')
+    sources += files('pmu_arm64.c')
+endif
diff --git a/lib/pmu/pmu_arm64.c b/lib/pmu/pmu_arm64.c
new file mode 100644
index 0000000000..9e15727948
--- /dev/null
+++ b/lib/pmu/pmu_arm64.c
@@ -0,0 +1,94 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2023 Marvell International Ltd.
+ */
+
+#include <errno.h>
+#include <fcntl.h>
+#include <stdlib.h>
+#include <unistd.h>
+
+#include <rte_bitops.h>
+#include <rte_common.h>
+
+#include "pmu_private.h"
+
+#define PERF_USER_ACCESS_PATH "/proc/sys/kernel/perf_user_access"
+
+static int restore_uaccess;
+
+static int
+read_attr_int(const char *path, int *val)
+{
+	char buf[BUFSIZ];
+	int ret, fd;
+
+	fd = open(path, O_RDONLY);
+	if (fd == -1)
+		return -errno;
+
+	ret = read(fd, buf, sizeof(buf));
+	if (ret == -1) {
+		close(fd);
+
+		return -errno;
+	}
+
+	*val = strtol(buf, NULL, 10);
+	close(fd);
+
+	return 0;
+}
+
+static int
+write_attr_int(const char *path, int val)
+{
+	char buf[BUFSIZ];
+	int num, ret, fd;
+
+	fd = open(path, O_WRONLY);
+	if (fd == -1)
+		return -errno;
+
+	num = snprintf(buf, sizeof(buf), "%d", val);
+	ret = write(fd, buf, num);
+	if (ret == -1) {
+		close(fd);
+
+		return -errno;
+	}
+
+	close(fd);
+
+	return 0;
+}
+
+int
+pmu_arch_init(void)
+{
+	int ret;
+
+	ret = read_attr_int(PERF_USER_ACCESS_PATH, &restore_uaccess);
+	if (ret)
+		return ret;
+
+	/* user access already enabled */
+	if (restore_uaccess == 1)
+		return 0;
+
+	return write_attr_int(PERF_USER_ACCESS_PATH, 1);
+}
+
+void
+pmu_arch_fini(void)
+{
+	write_attr_int(PERF_USER_ACCESS_PATH, restore_uaccess);
+}
+
+void
+pmu_arch_fixup_config(uint64_t config[3])
+{
+	/* select 64 bit counters */
+	config[1] |= RTE_BIT64(0);
+	/* enable userspace access */
+	config[1] |= RTE_BIT64(1);
+}
diff --git a/lib/pmu/rte_pmu.h b/lib/pmu/rte_pmu.h
index 6b664c3336..bcc8e3b22d 100644
--- a/lib/pmu/rte_pmu.h
+++ b/lib/pmu/rte_pmu.h
@@ -26,6 +26,10 @@ extern "C" {
 #include <rte_compat.h>
 #include <rte_spinlock.h>
 
+#if defined(RTE_ARCH_ARM64)
+#include "rte_pmu_pmc_arm64.h"
+#endif
+
 /** Maximum number of events in a group */
 #define MAX_NUM_GROUP_EVENTS 8
 
diff --git a/lib/pmu/rte_pmu_pmc_arm64.h b/lib/pmu/rte_pmu_pmc_arm64.h
new file mode 100644
index 0000000000..10648f0c5f
--- /dev/null
+++ b/lib/pmu/rte_pmu_pmc_arm64.h
@@ -0,0 +1,30 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Marvell.
+ */
+#ifndef _RTE_PMU_PMC_ARM64_H_
+#define _RTE_PMU_PMC_ARM64_H_
+
+#include <rte_common.h>
+
+static __rte_always_inline uint64_t
+rte_pmu_pmc_read(int index)
+{
+	uint64_t val;
+
+	if (index == 31) {
+		/* CPU Cycles (0x11) must be read via pmccntr_el0 */
+		asm volatile("mrs %0, pmccntr_el0" : "=r" (val));
+	} else {
+		asm volatile(
+			"msr pmselr_el0, %x0\n"
+			"mrs %0, pmxevcntr_el0\n"
+			: "=r" (val)
+			: "rZ" (index)
+		);
+	}
+
+	return val;
+}
+#define rte_pmu_pmc_read rte_pmu_pmc_read
+
+#endif /* _RTE_PMU_PMC_ARM64_H_ */
-- 
2.34.1


^ permalink raw reply	[flat|nested] 139+ messages in thread

* [PATCH v11 3/4] pmu: support reading Intel x86_64 PMU events in runtime
  2023-02-16 17:54                   ` [PATCH v11 0/4] add support for self monitoring Tomasz Duszynski
  2023-02-16 17:54                     ` [PATCH v11 1/4] lib: add generic support for reading PMU events Tomasz Duszynski
  2023-02-16 17:55                     ` [PATCH v11 2/4] pmu: support reading ARM PMU events in runtime Tomasz Duszynski
@ 2023-02-16 17:55                     ` Tomasz Duszynski
  2023-02-16 17:55                     ` [PATCH v11 4/4] eal: add PMU support to tracing library Tomasz Duszynski
                                       ` (2 subsequent siblings)
  5 siblings, 0 replies; 139+ messages in thread
From: Tomasz Duszynski @ 2023-02-16 17:55 UTC (permalink / raw)
  To: dev, Tomasz Duszynski
  Cc: roretzla, Ruifeng.Wang, bruce.richardson, jerinj,
	mattias.ronnblom, mb, thomas, zhoumin, david.marchand

Add support for reading Intel x86_64 PMU events in runtime.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
---
 app/test/test_pmu.c          |  2 ++
 lib/pmu/meson.build          |  1 +
 lib/pmu/rte_pmu.h            |  2 ++
 lib/pmu/rte_pmu_pmc_x86_64.h | 24 ++++++++++++++++++++++++
 4 files changed, 29 insertions(+)
 create mode 100644 lib/pmu/rte_pmu_pmc_x86_64.h

diff --git a/app/test/test_pmu.c b/app/test/test_pmu.c
index e0220e3c59..bed7101a5d 100644
--- a/app/test/test_pmu.c
+++ b/app/test/test_pmu.c
@@ -26,6 +26,8 @@ test_pmu_read(void)
 
 #if defined(RTE_ARCH_ARM64)
 	name = "cpu_cycles";
+#elif defined(RTE_ARCH_X86_64)
+	name = "cpu-cycles";
 #endif
 
 	if (name == NULL) {
diff --git a/lib/pmu/meson.build b/lib/pmu/meson.build
index e857681137..5b92e5c4e3 100644
--- a/lib/pmu/meson.build
+++ b/lib/pmu/meson.build
@@ -13,6 +13,7 @@ sources = files('rte_pmu.c')
 headers = files('rte_pmu.h')
 indirect_headers += files(
         'rte_pmu_pmc_arm64.h',
+        'rte_pmu_pmc_x86_64.h',
 )
 
 if dpdk_conf.has('RTE_ARCH_ARM64')
diff --git a/lib/pmu/rte_pmu.h b/lib/pmu/rte_pmu.h
index bcc8e3b22d..b1d1c17bc5 100644
--- a/lib/pmu/rte_pmu.h
+++ b/lib/pmu/rte_pmu.h
@@ -28,6 +28,8 @@ extern "C" {
 
 #if defined(RTE_ARCH_ARM64)
 #include "rte_pmu_pmc_arm64.h"
+#elif defined(RTE_ARCH_X86_64)
+#include "rte_pmu_pmc_x86_64.h"
 #endif
 
 /** Maximum number of events in a group */
diff --git a/lib/pmu/rte_pmu_pmc_x86_64.h b/lib/pmu/rte_pmu_pmc_x86_64.h
new file mode 100644
index 0000000000..7b67466960
--- /dev/null
+++ b/lib/pmu/rte_pmu_pmc_x86_64.h
@@ -0,0 +1,24 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Marvell.
+ */
+#ifndef _RTE_PMU_PMC_X86_64_H_
+#define _RTE_PMU_PMC_X86_64_H_
+
+#include <rte_common.h>
+
+static __rte_always_inline uint64_t
+rte_pmu_pmc_read(int index)
+{
+	uint64_t low, high;
+
+	asm volatile(
+		"rdpmc\n"
+		: "=a" (low), "=d" (high)
+		: "c" (index)
+	);
+
+	return low | (high << 32);
+}
+#define rte_pmu_pmc_read rte_pmu_pmc_read
+
+#endif /* _RTE_PMU_PMC_X86_64_H_ */
-- 
2.34.1


^ permalink raw reply	[flat|nested] 139+ messages in thread

* [PATCH v11 4/4] eal: add PMU support to tracing library
  2023-02-16 17:54                   ` [PATCH v11 0/4] add support for self monitoring Tomasz Duszynski
                                       ` (2 preceding siblings ...)
  2023-02-16 17:55                     ` [PATCH v11 3/4] pmu: support reading Intel x86_64 " Tomasz Duszynski
@ 2023-02-16 17:55                     ` Tomasz Duszynski
  2023-02-16 18:03                     ` [PATCH v11 0/4] add support for self monitoring Ruifeng Wang
  2023-05-04  8:02                     ` David Marchand
  5 siblings, 0 replies; 139+ messages in thread
From: Tomasz Duszynski @ 2023-02-16 17:55 UTC (permalink / raw)
  To: dev, Jerin Jacob, Sunil Kumar Kori, Tomasz Duszynski
  Cc: roretzla, Ruifeng.Wang, bruce.richardson, mattias.ronnblom, mb,
	thomas, zhoumin, david.marchand

In order to profile app one needs to store significant amount of samples
somewhere for an analysis latern on. Since trace library supports
storing data in a CTF format lets take adventage of that and add a
dedicated PMU tracepoint.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
---
 app/test/test_trace_perf.c               | 10 ++++
 doc/guides/prog_guide/profile_app.rst    |  5 ++
 doc/guides/prog_guide/trace_lib.rst      | 32 +++++++++++++
 lib/eal/common/eal_common_trace.c        | 13 ++++-
 lib/eal/common/eal_common_trace_points.c |  5 ++
 lib/eal/include/rte_eal_trace.h          | 13 +++++
 lib/eal/meson.build                      |  3 ++
 lib/eal/version.map                      |  1 +
 lib/pmu/rte_pmu.c                        | 61 ++++++++++++++++++++++++
 lib/pmu/rte_pmu.h                        | 14 ++++++
 lib/pmu/version.map                      |  1 +
 11 files changed, 157 insertions(+), 1 deletion(-)

diff --git a/app/test/test_trace_perf.c b/app/test/test_trace_perf.c
index 46ae7d8074..f1929f2734 100644
--- a/app/test/test_trace_perf.c
+++ b/app/test/test_trace_perf.c
@@ -114,6 +114,10 @@ worker_fn_##func(void *arg) \
 #define GENERIC_DOUBLE rte_eal_trace_generic_double(3.66666)
 #define GENERIC_STR rte_eal_trace_generic_str("hello world")
 #define VOID_FP app_dpdk_test_fp()
+#ifdef RTE_EXEC_ENV_LINUX
+/* 0 corresponds first event passed via --trace= */
+#define READ_PMU rte_eal_trace_pmu_read(0)
+#endif
 
 WORKER_DEFINE(GENERIC_VOID)
 WORKER_DEFINE(GENERIC_U64)
@@ -122,6 +126,9 @@ WORKER_DEFINE(GENERIC_FLOAT)
 WORKER_DEFINE(GENERIC_DOUBLE)
 WORKER_DEFINE(GENERIC_STR)
 WORKER_DEFINE(VOID_FP)
+#ifdef RTE_EXEC_ENV_LINUX
+WORKER_DEFINE(READ_PMU)
+#endif
 
 static void
 run_test(const char *str, lcore_function_t f, struct test_data *data, size_t sz)
@@ -174,6 +181,9 @@ test_trace_perf(void)
 	run_test("double", worker_fn_GENERIC_DOUBLE, data, sz);
 	run_test("string", worker_fn_GENERIC_STR, data, sz);
 	run_test("void_fp", worker_fn_VOID_FP, data, sz);
+#ifdef RTE_EXEC_ENV_LINUX
+	run_test("read_pmu", worker_fn_READ_PMU, data, sz);
+#endif
 
 	rte_free(data);
 	return TEST_SUCCESS;
diff --git a/doc/guides/prog_guide/profile_app.rst b/doc/guides/prog_guide/profile_app.rst
index 89e38cd301..c4dfe85c3b 100644
--- a/doc/guides/prog_guide/profile_app.rst
+++ b/doc/guides/prog_guide/profile_app.rst
@@ -19,6 +19,11 @@ dedicated tasks interrupting those tasks with perf may be undesirable.
 
 In such cases, an application can use the PMU library to read such events via ``rte_pmu_read()``.
 
+Alternatively tracing library can be used which offers dedicated tracepoint
+``rte_eal_trace_pmu_event()``.
+
+Refer to :doc:`../prog_guide/trace_lib` for more details.
+
 
 Profiling on x86
 ----------------
diff --git a/doc/guides/prog_guide/trace_lib.rst b/doc/guides/prog_guide/trace_lib.rst
index 3e0ea5835c..9c81936e35 100644
--- a/doc/guides/prog_guide/trace_lib.rst
+++ b/doc/guides/prog_guide/trace_lib.rst
@@ -46,6 +46,7 @@ DPDK tracing library features
   trace format and is compatible with ``LTTng``.
   For detailed information, refer to
   `Common Trace Format <https://diamon.org/ctf/>`_.
+- Support reading PMU events on ARM64 and x86-64 (Intel)
 
 How to add a tracepoint?
 ------------------------
@@ -137,6 +138,37 @@ the user must use ``RTE_TRACE_POINT_FP`` instead of ``RTE_TRACE_POINT``.
 ``RTE_TRACE_POINT_FP`` is compiled out by default and it can be enabled using
 the ``enable_trace_fp`` option for meson build.
 
+PMU tracepoint
+--------------
+
+Performance monitoring unit (PMU) event values can be read from hardware
+registers using predefined ``rte_pmu_read`` tracepoint.
+
+Tracing is enabled via ``--trace`` EAL option by passing both expression
+matching PMU tracepoint name i.e ``lib.eal.pmu.read`` and expression
+``e=ev1[,ev2,...]`` matching particular events::
+
+    --trace='.*pmu.read\|e=cpu_cycles,l1d_cache'
+
+Event names are available under ``/sys/bus/event_source/devices/PMU/events``
+directory, where ``PMU`` is a placeholder for either a ``cpu`` or a directory
+containing ``cpus``.
+
+In contrary to other tracepoints this does not need any extra variables
+added to source files. Instead, caller passes index which follows the order of
+events specified via ``--trace`` parameter. In the following example index ``0``
+corresponds to ``cpu_cyclces`` while index ``1`` corresponds to ``l1d_cache``.
+
+.. code-block:: c
+
+ ...
+ rte_eal_trace_pmu_read(0);
+ rte_eal_trace_pmu_read(1);
+ ...
+
+PMU tracing support must be explicitly enabled using the ``enable_trace_fp``
+option for meson build.
+
 Event record mode
 -----------------
 
diff --git a/lib/eal/common/eal_common_trace.c b/lib/eal/common/eal_common_trace.c
index 75162b722d..8796052d0c 100644
--- a/lib/eal/common/eal_common_trace.c
+++ b/lib/eal/common/eal_common_trace.c
@@ -11,6 +11,9 @@
 #include <rte_errno.h>
 #include <rte_lcore.h>
 #include <rte_per_lcore.h>
+#ifdef RTE_EXEC_ENV_LINUX
+#include <rte_pmu.h>
+#endif
 #include <rte_string_fns.h>
 
 #include "eal_trace.h"
@@ -71,8 +74,13 @@ eal_trace_init(void)
 		goto free_meta;
 
 	/* Apply global configurations */
-	STAILQ_FOREACH(arg, &trace.args, next)
+	STAILQ_FOREACH(arg, &trace.args, next) {
 		trace_args_apply(arg->val);
+#ifdef RTE_EXEC_ENV_LINUX
+		if (rte_pmu_init() == 0)
+			rte_pmu_add_events_by_pattern(arg->val);
+#endif
+	}
 
 	rte_trace_mode_set(trace.mode);
 
@@ -88,6 +96,9 @@ eal_trace_init(void)
 void
 eal_trace_fini(void)
 {
+#ifdef RTE_EXEC_ENV_LINUX
+	rte_pmu_fini();
+#endif
 	trace_mem_free();
 	trace_metadata_destroy();
 	eal_trace_args_free();
diff --git a/lib/eal/common/eal_common_trace_points.c b/lib/eal/common/eal_common_trace_points.c
index 051f89809c..9d6faa19ed 100644
--- a/lib/eal/common/eal_common_trace_points.c
+++ b/lib/eal/common/eal_common_trace_points.c
@@ -77,3 +77,8 @@ RTE_TRACE_POINT_REGISTER(rte_eal_trace_intr_enable,
 	lib.eal.intr.enable)
 RTE_TRACE_POINT_REGISTER(rte_eal_trace_intr_disable,
 	lib.eal.intr.disable)
+
+#ifdef RTE_EXEC_ENV_LINUX
+RTE_TRACE_POINT_REGISTER(rte_eal_trace_pmu_read,
+	lib.eal.pmu.read)
+#endif
diff --git a/lib/eal/include/rte_eal_trace.h b/lib/eal/include/rte_eal_trace.h
index 6f5c022558..c7da83c480 100644
--- a/lib/eal/include/rte_eal_trace.h
+++ b/lib/eal/include/rte_eal_trace.h
@@ -17,6 +17,9 @@ extern "C" {
 
 #include <rte_alarm.h>
 #include <rte_interrupts.h>
+#ifdef RTE_EXEC_ENV_LINUX
+#include <rte_pmu.h>
+#endif
 #include <rte_trace_point.h>
 
 #include "eal_interrupts.h"
@@ -285,6 +288,16 @@ RTE_TRACE_POINT(
 	rte_trace_point_emit_string(cpuset);
 )
 
+#ifdef RTE_EXEC_ENV_LINUX
+RTE_TRACE_POINT_FP(
+	rte_eal_trace_pmu_read,
+	RTE_TRACE_POINT_ARGS(unsigned int index),
+	uint64_t val;
+	val = rte_pmu_read(index);
+	rte_trace_point_emit_u64(val);
+)
+#endif
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/eal/meson.build b/lib/eal/meson.build
index 056beb9461..f5865dbcd9 100644
--- a/lib/eal/meson.build
+++ b/lib/eal/meson.build
@@ -26,6 +26,9 @@ deps += ['kvargs']
 if not is_windows
     deps += ['telemetry']
 endif
+if is_linux
+    deps += ['pmu']
+endif
 if dpdk_conf.has('RTE_USE_LIBBSD')
     ext_deps += libbsd
 endif
diff --git a/lib/eal/version.map b/lib/eal/version.map
index 2ae57ee78a..01e7a099d2 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -441,6 +441,7 @@ EXPERIMENTAL {
 	rte_thread_join;
 
 	# added in 23.03
+	__rte_eal_trace_pmu_read; # WINDOWS_NO_EXPORT
 	rte_lcore_register_usage_cb;
 	rte_thread_create_control;
 	rte_thread_set_name;
diff --git a/lib/pmu/rte_pmu.c b/lib/pmu/rte_pmu.c
index 950f999cb7..862edcb1e3 100644
--- a/lib/pmu/rte_pmu.c
+++ b/lib/pmu/rte_pmu.c
@@ -398,6 +398,67 @@ rte_pmu_add_event(const char *name)
 	return event->index;
 }
 
+static int
+add_events(const char *pattern)
+{
+	char *token, *copy;
+	int ret = 0;
+
+	copy = strdup(pattern);
+	if (copy == NULL)
+		return -ENOMEM;
+
+	token = strtok(copy, ",");
+	while (token) {
+		ret = rte_pmu_add_event(token);
+		if (ret < 0)
+			break;
+
+		token = strtok(NULL, ",");
+	}
+
+	free(copy);
+
+	return ret >= 0 ? 0 : ret;
+}
+
+int
+rte_pmu_add_events_by_pattern(const char *pattern)
+{
+	regmatch_t rmatch;
+	char buf[BUFSIZ];
+	unsigned int num;
+	regex_t reg;
+	int ret;
+
+	/* events are matched against occurrences of e=ev1[,ev2,..] pattern */
+	ret = regcomp(&reg, "e=([_[:alnum:]-],?)+", REG_EXTENDED);
+	if (ret)
+		return -EINVAL;
+
+	for (;;) {
+		if (regexec(&reg, pattern, 1, &rmatch, 0))
+			break;
+
+		num = rmatch.rm_eo - rmatch.rm_so;
+		if (num > sizeof(buf))
+			num = sizeof(buf);
+
+		/* skip e= pattern prefix */
+		memcpy(buf, pattern + rmatch.rm_so + 2, num - 2);
+		buf[num - 2] = '\0';
+		ret = add_events(buf);
+		if (ret)
+			break;
+
+		pattern += rmatch.rm_eo;
+	}
+
+	regfree(&reg);
+
+	return ret;
+}
+
 int
 rte_pmu_init(void)
 {
diff --git a/lib/pmu/rte_pmu.h b/lib/pmu/rte_pmu.h
index b1d1c17bc5..e1c3bb5e56 100644
--- a/lib/pmu/rte_pmu.h
+++ b/lib/pmu/rte_pmu.h
@@ -176,6 +176,20 @@ __rte_experimental
 int
 rte_pmu_add_event(const char *name);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Add events matching pattern to the group of enabled events.
+ *
+ * @param pattern
+ *   Pattern e=ev1[,ev2,...] matching events, where evX is a placeholder for an event listed under
+ *   /sys/bus/event_source/devices/pmu/events.
+ */
+__rte_experimental
+int
+rte_pmu_add_events_by_pattern(const char *pattern);
+
 /**
  * @warning
  * @b EXPERIMENTAL: this API may change without prior notice
diff --git a/lib/pmu/version.map b/lib/pmu/version.map
index 39a4f279c1..e16b3ff009 100644
--- a/lib/pmu/version.map
+++ b/lib/pmu/version.map
@@ -9,6 +9,7 @@ EXPERIMENTAL {
 	per_lcore__event_group;
 	rte_pmu;
 	rte_pmu_add_event;
+	rte_pmu_add_events_by_pattern;
 	rte_pmu_fini;
 	rte_pmu_init;
 	rte_pmu_read;
-- 
2.34.1


^ permalink raw reply	[flat|nested] 139+ messages in thread

* RE: [PATCH v11 0/4] add support for self monitoring
  2023-02-16 17:54                   ` [PATCH v11 0/4] add support for self monitoring Tomasz Duszynski
                                       ` (3 preceding siblings ...)
  2023-02-16 17:55                     ` [PATCH v11 4/4] eal: add PMU support to tracing library Tomasz Duszynski
@ 2023-02-16 18:03                     ` Ruifeng Wang
  2023-05-04  8:02                     ` David Marchand
  5 siblings, 0 replies; 139+ messages in thread
From: Ruifeng Wang @ 2023-02-16 18:03 UTC (permalink / raw)
  To: Tomasz Duszynski, dev
  Cc: roretzla, bruce.richardson, jerinj, mattias.ronnblom, mb, thomas,
	zhoumin, david.marchand, nd

> -----Original Message-----
> From: Tomasz Duszynski <tduszynski@marvell.com>
> Sent: Friday, February 17, 2023 1:55 AM
> To: dev@dpdk.org
> Cc: roretzla@linux.microsoft.com; Ruifeng Wang <Ruifeng.Wang@arm.com>;
> bruce.richardson@intel.com; jerinj@marvell.com; mattias.ronnblom@ericsson.com;
> mb@smartsharesystems.com; thomas@monjalon.net; zhoumin@loongson.cn;
> david.marchand@redhat.com; Tomasz Duszynski <tduszynski@marvell.com>
> Subject: [PATCH v11 0/4] add support for self monitoring
> 
> This series adds self monitoring support i.e allows to configure and read performance
> measurement unit (PMU) counters in runtime without using perf utility. This has certain
> advantages when application runs on isolated cores running dedicated tasks.
> 
> Events can be read directly using rte_pmu_read() or using dedicated tracepoint
> rte_eal_trace_pmu_read(). The latter will cause events to be stored inside CTF file.
> 
> By design, all enabled events are grouped together and the same group is attached to
> lcores that use self monitoring funtionality.
> 
> Events are enabled by names, which need to be read from standard location under sysfs i.e
> 
> /sys/bus/event_source/devices/PMU/events
> 
> where PMU is a core pmu i.e one measuring cpu events. As of today raw events are not
> supported.
> 
> v11:
> - skip fast test in case init fails
> v10:
> - check permissions before using counters
> - do not use internal symbols in exported functions
> - address review comments
> v9:
> - fix 'maybe-uninitialized' warning reported by CI
> v8:
> - just rebase series
> v7:
> - use per-lcore event group instead of global table index by lcore-id
> - don't add pmu_autotest to fast tests because due to lack of suported on
>   every arch
> v6:
> - move codebase to the separate library
> - address review comments
> v5:
> - address review comments
> - fix sign extension while reading pmu on x86
> - fix regex mentioned in doc
> - various minor changes/improvements here and there
> v4:
> - fix freeing mem detected by debug_autotest
> v3:
> - fix shared build
> v2:
> - fix problems reported by test build infra
> 
> Tomasz Duszynski (4):
>   lib: add generic support for reading PMU events
>   pmu: support reading ARM PMU events in runtime
>   pmu: support reading Intel x86_64 PMU events in runtime
>   eal: add PMU support to tracing library
> 
>  MAINTAINERS                              |   5 +
>  app/test/meson.build                     |   2 +
>  app/test/test_pmu.c                      |  68 +++
>  app/test/test_trace_perf.c               |  10 +
>  doc/api/doxy-api-index.md                |   3 +-
>  doc/api/doxy-api.conf.in                 |   1 +
>  doc/guides/prog_guide/profile_app.rst    |  17 +
>  doc/guides/prog_guide/trace_lib.rst      |  32 ++
>  doc/guides/rel_notes/release_23_03.rst   |   7 +
>  lib/eal/common/eal_common_trace.c        |  13 +-
>  lib/eal/common/eal_common_trace_points.c |   5 +
>  lib/eal/include/rte_eal_trace.h          |  13 +
>  lib/eal/meson.build                      |   3 +
>  lib/eal/version.map                      |   1 +
>  lib/meson.build                          |   1 +
>  lib/pmu/meson.build                      |  21 +
>  lib/pmu/pmu_arm64.c                      |  94 ++++
>  lib/pmu/pmu_private.h                    |  32 ++
>  lib/pmu/rte_pmu.c                        | 521 +++++++++++++++++++++++
>  lib/pmu/rte_pmu.h                        | 232 ++++++++++
>  lib/pmu/rte_pmu_pmc_arm64.h              |  30 ++
>  lib/pmu/rte_pmu_pmc_x86_64.h             |  24 ++
>  lib/pmu/version.map                      |  16 +
>  23 files changed, 1149 insertions(+), 2 deletions(-)  create mode 100644
> app/test/test_pmu.c  create mode 100644 lib/pmu/meson.build  create mode 100644
> lib/pmu/pmu_arm64.c  create mode 100644 lib/pmu/pmu_private.h  create mode 100644
> lib/pmu/rte_pmu.c  create mode 100644 lib/pmu/rte_pmu.h  create mode 100644
> lib/pmu/rte_pmu_pmc_arm64.h  create mode 100644 lib/pmu/rte_pmu_pmc_x86_64.h  create mode
> 100644 lib/pmu/version.map
> 
> --
> 2.34.1

For the series,
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v5 0/4] add support for self monitoring
  2023-01-10 23:46       ` [PATCH v5 0/4] add support for self monitoring Tomasz Duszynski
                           ` (6 preceding siblings ...)
  2023-01-25 10:33         ` [PATCH 0/2] add platform bus Tomasz Duszynski
@ 2023-02-16 20:56         ` Liang Ma
  7 siblings, 0 replies; 139+ messages in thread
From: Liang Ma @ 2023-02-16 20:56 UTC (permalink / raw)
  To: Tomasz Duszynski
  Cc: dev, thomas, jerinj, mb, Ruifeng.Wang, mattias.ronnblom, zhoumin

PMU is kind of MSR. Precisely , that's MSR per core. 
All MSR reading will lead to IPI(Please reference the kernel
implementation of MSR driver). The IPI will disturb the DPDK application
because the userspace/kernel context switch, which has impact to the
tail latency. 

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v11 1/4] lib: add generic support for reading PMU events
  2023-02-16 17:54                     ` [PATCH v11 1/4] lib: add generic support for reading PMU events Tomasz Duszynski
@ 2023-02-16 23:50                       ` Konstantin Ananyev
  2023-02-17  8:49                         ` [EXT] " Tomasz Duszynski
  2023-02-21  2:17                       ` Konstantin Ananyev
  1 sibling, 1 reply; 139+ messages in thread
From: Konstantin Ananyev @ 2023-02-16 23:50 UTC (permalink / raw)
  To: dev

16/02/2023 17:54, Tomasz Duszynski пишет:
> Add support for programming PMU counters and reading their values
> in runtime bypassing kernel completely.
> 
> This is especially useful in cases where CPU cores are isolated
> i.e run dedicated tasks. In such cases one cannot use standard
> perf utility without sacrificing latency and performance.
> 
> Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
> Acked-by: Morten Brørup <mb@smartsharesystems.com>
> ---
>   MAINTAINERS                            |   5 +
>   app/test/meson.build                   |   2 +
>   app/test/test_pmu.c                    |  62 ++++
>   doc/api/doxy-api-index.md              |   3 +-
>   doc/api/doxy-api.conf.in               |   1 +
>   doc/guides/prog_guide/profile_app.rst  |  12 +
>   doc/guides/rel_notes/release_23_03.rst |   7 +
>   lib/meson.build                        |   1 +
>   lib/pmu/meson.build                    |  13 +
>   lib/pmu/pmu_private.h                  |  32 ++
>   lib/pmu/rte_pmu.c                      | 460 +++++++++++++++++++++++++
>   lib/pmu/rte_pmu.h                      | 212 ++++++++++++
>   lib/pmu/version.map                    |  15 +
>   13 files changed, 824 insertions(+), 1 deletion(-)
>   create mode 100644 app/test/test_pmu.c
>   create mode 100644 lib/pmu/meson.build
>   create mode 100644 lib/pmu/pmu_private.h
>   create mode 100644 lib/pmu/rte_pmu.c
>   create mode 100644 lib/pmu/rte_pmu.h
>   create mode 100644 lib/pmu/version.map
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 3495946d0f..d37f242120 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -1697,6 +1697,11 @@ M: Nithin Dabilpuram <ndabilpuram@marvell.com>
>   M: Pavan Nikhilesh <pbhagavatula@marvell.com>
>   F: lib/node/
>   
> +PMU - EXPERIMENTAL
> +M: Tomasz Duszynski <tduszynski@marvell.com>
> +F: lib/pmu/
> +F: app/test/test_pmu*
> +
>   
>   Test Applications
>   -----------------
> diff --git a/app/test/meson.build b/app/test/meson.build
> index f34d19e3c3..6b61b7fc32 100644
> --- a/app/test/meson.build
> +++ b/app/test/meson.build
> @@ -111,6 +111,7 @@ test_sources = files(
>           'test_reciprocal_division_perf.c',
>           'test_red.c',
>           'test_pie.c',
> +        'test_pmu.c',
>           'test_reorder.c',
>           'test_rib.c',
>           'test_rib6.c',
> @@ -239,6 +240,7 @@ fast_tests = [
>           ['kni_autotest', false, true],
>           ['kvargs_autotest', true, true],
>           ['member_autotest', true, true],
> +        ['pmu_autotest', true, true],
>           ['power_cpufreq_autotest', false, true],
>           ['power_autotest', true, true],
>           ['power_kvm_vm_autotest', false, true],
> diff --git a/app/test/test_pmu.c b/app/test/test_pmu.c
> new file mode 100644
> index 0000000000..c257638e8b
> --- /dev/null
> +++ b/app/test/test_pmu.c
> @@ -0,0 +1,62 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(C) 2023 Marvell International Ltd.
> + */
> +
> +#include "test.h"
> +
> +#ifndef RTE_EXEC_ENV_LINUX
> +
> +static int
> +test_pmu(void)
> +{
> +	printf("pmu_autotest only supported on Linux, skipping test\n");
> +	return TEST_SKIPPED;
> +}
> +
> +#else
> +
> +#include <rte_pmu.h>
> +
> +static int
> +test_pmu_read(void)
> +{
> +	const char *name = NULL;
> +	int tries = 10, event;
> +	uint64_t val = 0;
> +
> +	if (name == NULL) {
> +		printf("PMU not supported on this arch\n");
> +		return TEST_SKIPPED;
> +	}
> +
> +	if (rte_pmu_init() < 0)
> +		return TEST_SKIPPED;
> +
> +	event = rte_pmu_add_event(name);
> +	while (tries--)
> +		val += rte_pmu_read(event);
> +
> +	rte_pmu_fini();
> +
> +	return val ? TEST_SUCCESS : TEST_FAILED;
> +}
> +
> +static struct unit_test_suite pmu_tests = {
> +	.suite_name = "pmu autotest",
> +	.setup = NULL,
> +	.teardown = NULL,
> +	.unit_test_cases = {
> +		TEST_CASE(test_pmu_read),
> +		TEST_CASES_END()
> +	}
> +};
> +
> +static int
> +test_pmu(void)
> +{
> +	return unit_test_suite_runner(&pmu_tests);
> +}
> +
> +#endif /* RTE_EXEC_ENV_LINUX */
> +
> +REGISTER_TEST_COMMAND(pmu_autotest, test_pmu);
> diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
> index 2deec7ea19..a8e04a195d 100644
> --- a/doc/api/doxy-api-index.md
> +++ b/doc/api/doxy-api-index.md
> @@ -223,7 +223,8 @@ The public API headers are grouped by topics:
>     [log](@ref rte_log.h),
>     [errno](@ref rte_errno.h),
>     [trace](@ref rte_trace.h),
> -  [trace_point](@ref rte_trace_point.h)
> +  [trace_point](@ref rte_trace_point.h),
> +  [pmu](@ref rte_pmu.h)
>   
>   - **misc**:
>     [EAL config](@ref rte_eal.h),
> diff --git a/doc/api/doxy-api.conf.in b/doc/api/doxy-api.conf.in
> index e859426099..350b5a8c94 100644
> --- a/doc/api/doxy-api.conf.in
> +++ b/doc/api/doxy-api.conf.in
> @@ -63,6 +63,7 @@ INPUT                   = @TOPDIR@/doc/api/doxy-api-index.md \
>                             @TOPDIR@/lib/pci \
>                             @TOPDIR@/lib/pdump \
>                             @TOPDIR@/lib/pipeline \
> +                          @TOPDIR@/lib/pmu \
>                             @TOPDIR@/lib/port \
>                             @TOPDIR@/lib/power \
>                             @TOPDIR@/lib/rawdev \
> diff --git a/doc/guides/prog_guide/profile_app.rst b/doc/guides/prog_guide/profile_app.rst
> index 14292d4c25..89e38cd301 100644
> --- a/doc/guides/prog_guide/profile_app.rst
> +++ b/doc/guides/prog_guide/profile_app.rst
> @@ -7,6 +7,18 @@ Profile Your Application
>   The following sections describe methods of profiling DPDK applications on
>   different architectures.
>   
> +Performance counter based profiling
> +-----------------------------------
> +
> +Majority of architectures support some performance monitoring unit (PMU).
> +Such unit provides programmable counters that monitor specific events.
> +
> +Different tools gather that information, like for example perf.
> +However, in some scenarios when CPU cores are isolated and run
> +dedicated tasks interrupting those tasks with perf may be undesirable.
> +
> +In such cases, an application can use the PMU library to read such events via ``rte_pmu_read()``.
> +
>   
>   Profiling on x86
>   ----------------
> diff --git a/doc/guides/rel_notes/release_23_03.rst b/doc/guides/rel_notes/release_23_03.rst
> index ab998a5357..20622efe58 100644
> --- a/doc/guides/rel_notes/release_23_03.rst
> +++ b/doc/guides/rel_notes/release_23_03.rst
> @@ -147,6 +147,13 @@ New Features
>     * Added support to capture packets at each graph node with packet metadata and
>       node name.
>   
> +* **Added PMU library.**
> +
> +  Added a new performance monitoring unit (PMU) library which allows applications
> +  to perform self monitoring activities without depending on external utilities like perf.
> +  After integration with :doc:`../prog_guide/trace_lib` data gathered from hardware counters
> +  can be stored in CTF format for further analysis.
> +
>   
>   Removed Items
>   -------------
> diff --git a/lib/meson.build b/lib/meson.build
> index 450c061d2b..8a42d45d20 100644
> --- a/lib/meson.build
> +++ b/lib/meson.build
> @@ -11,6 +11,7 @@
>   libraries = [
>           'kvargs', # eal depends on kvargs
>           'telemetry', # basic info querying
> +        'pmu',
>           'eal', # everything depends on eal
>           'ring',
>           'rcu', # rcu depends on ring
> diff --git a/lib/pmu/meson.build b/lib/pmu/meson.build
> new file mode 100644
> index 0000000000..a4160b494e
> --- /dev/null
> +++ b/lib/pmu/meson.build
> @@ -0,0 +1,13 @@
> +# SPDX-License-Identifier: BSD-3-Clause
> +# Copyright(C) 2023 Marvell International Ltd.
> +
> +if not is_linux
> +    build = false
> +    reason = 'only supported on Linux'
> +    subdir_done()
> +endif
> +
> +includes = [global_inc]
> +
> +sources = files('rte_pmu.c')
> +headers = files('rte_pmu.h')
> diff --git a/lib/pmu/pmu_private.h b/lib/pmu/pmu_private.h
> new file mode 100644
> index 0000000000..b9f8c1ddc8
> --- /dev/null
> +++ b/lib/pmu/pmu_private.h
> @@ -0,0 +1,32 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2023 Marvell
> + */
> +
> +#ifndef _PMU_PRIVATE_H_
> +#define _PMU_PRIVATE_H_
> +
> +/**
> + * Architecture specific PMU init callback.
> + *
> + * @return
> + *   0 in case of success, negative value otherwise.
> + */
> +int
> +pmu_arch_init(void);
> +
> +/**
> + * Architecture specific PMU cleanup callback.
> + */
> +void
> +pmu_arch_fini(void);
> +
> +/**
> + * Apply architecture specific settings to config before passing it to syscall.
> + *
> + * @param config
> + *   Architecture specific event configuration. Consult kernel sources for available options.
> + */
> +void
> +pmu_arch_fixup_config(uint64_t config[3]);
> +
> +#endif /* _PMU_PRIVATE_H_ */
> diff --git a/lib/pmu/rte_pmu.c b/lib/pmu/rte_pmu.c
> new file mode 100644
> index 0000000000..950f999cb7
> --- /dev/null
> +++ b/lib/pmu/rte_pmu.c
> @@ -0,0 +1,460 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(C) 2023 Marvell International Ltd.
> + */
> +
> +#include <ctype.h>
> +#include <dirent.h>
> +#include <errno.h>
> +#include <regex.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <sys/ioctl.h>
> +#include <sys/mman.h>
> +#include <sys/queue.h>
> +#include <sys/syscall.h>
> +#include <unistd.h>
> +
> +#include <rte_atomic.h>
> +#include <rte_per_lcore.h>
> +#include <rte_pmu.h>
> +#include <rte_spinlock.h>
> +#include <rte_tailq.h>
> +
> +#include "pmu_private.h"
> +
> +#define EVENT_SOURCE_DEVICES_PATH "/sys/bus/event_source/devices"


I suppose that pass (as the whole implementation) is linux specific?
If so, wouldn't it make sense to have it under linux subdir?

> +
> +#define GENMASK_ULL(h, l) ((~0ULL - (1ULL << (l)) + 1) & (~0ULL >> ((64 - 1 - (h)))))
> +#define FIELD_PREP(m, v) (((uint64_t)(v) << (__builtin_ffsll(m) - 1)) & (m))
> +
> +RTE_DEFINE_PER_LCORE(struct rte_pmu_event_group, _event_group);
> +struct rte_pmu rte_pmu;

Do we really need struct declaration here?


> +/*
> + * Following __rte_weak functions provide default no-op. Architectures should override them if
> + * necessary.
> + */
> +
> +int
> +__rte_weak pmu_arch_init(void)
> +{
> +	return 0;
> +}
> +
> +void
> +__rte_weak pmu_arch_fini(void)
> +{
> +}
> +
> +void
> +__rte_weak pmu_arch_fixup_config(uint64_t __rte_unused config[3])
> +{
> +}
> +
> +static int
> +get_term_format(const char *name, int *num, uint64_t *mask)
> +{
> +	char path[PATH_MAX];
> +	char *config = NULL;
> +	int high, low, ret;
> +	FILE *fp;
> +
> +	*num = *mask = 0;
> +	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/format/%s", rte_pmu.name, name);
> +	fp = fopen(path, "r");
> +	if (fp == NULL)
> +		return -errno;
> +
> +	errno = 0;
> +	ret = fscanf(fp, "%m[^:]:%d-%d", &config, &low, &high);
> +	if (ret < 2) {
> +		ret = -ENODATA;
> +		goto out;
> +	}
> +	if (errno) {
> +		ret = -errno;
> +		goto out;
> +	}
> +
> +	if (ret == 2)
> +		high = low;
> +
> +	*mask = GENMASK_ULL(high, low);
> +	/* Last digit should be [012]. If last digit is missing 0 is implied. */
> +	*num = config[strlen(config) - 1];
> +	*num = isdigit(*num) ? *num - '0' : 0;
> +
> +	ret = 0;
> +out:
> +	free(config);
> +	fclose(fp);
> +
> +	return ret;
> +}
> +
> +static int
> +parse_event(char *buf, uint64_t config[3])
> +{
> +	char *token, *term;
> +	int num, ret, val;
> +	uint64_t mask;
> +
> +	config[0] = config[1] = config[2] = 0;
> +
> +	token = strtok(buf, ",");
> +	while (token) {
> +		errno = 0;
> +		/* <term>=<value> */
> +		ret = sscanf(token, "%m[^=]=%i", &term, &val);
> +		if (ret < 1)
> +			return -ENODATA;
> +		if (errno)
> +			return -errno;
> +		if (ret == 1)
> +			val = 1;
> +
> +		ret = get_term_format(term, &num, &mask);
> +		free(term);
> +		if (ret)
> +			return ret;
> +
> +		config[num] |= FIELD_PREP(mask, val);
> +		token = strtok(NULL, ",");
> +	}
> +
> +	return 0;
> +}
> +
> +static int
> +get_event_config(const char *name, uint64_t config[3])
> +{
> +	char path[PATH_MAX], buf[BUFSIZ];
> +	FILE *fp;
> +	int ret;
> +
> +	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu.name, name);
> +	fp = fopen(path, "r");
> +	if (fp == NULL)
> +		return -errno;
> +
> +	ret = fread(buf, 1, sizeof(buf), fp);
> +	if (ret == 0) {
> +		fclose(fp);
> +
> +		return -EINVAL;
> +	}
> +	fclose(fp);
> +	buf[ret] = '\0';
> +
> +	return parse_event(buf, config);
> +}
> +
> +static int
> +do_perf_event_open(uint64_t config[3], int group_fd)
> +{
> +	struct perf_event_attr attr = {
> +		.size = sizeof(struct perf_event_attr),
> +		.type = PERF_TYPE_RAW,
> +		.exclude_kernel = 1,
> +		.exclude_hv = 1,
> +		.disabled = 1,
> +	};
> +
> +	pmu_arch_fixup_config(config);
> +
> +	attr.config = config[0];
> +	attr.config1 = config[1];
> +	attr.config2 = config[2];
> +
> +	return syscall(SYS_perf_event_open, &attr, 0, -1, group_fd, 0);
> +}
> +
> +static int
> +open_events(struct rte_pmu_event_group *group)
> +{
> +	struct rte_pmu_event *event;
> +	uint64_t config[3];
> +	int num = 0, ret;
> +
> +	/* group leader gets created first, with fd = -1 */
> +	group->fds[0] = -1;
> +
> +	TAILQ_FOREACH(event, &rte_pmu.event_list, next) {
> +		ret = get_event_config(event->name, config);
> +		if (ret)
> +			continue;
> +
> +		ret = do_perf_event_open(config, group->fds[0]);
> +		if (ret == -1) {
> +			ret = -errno;
> +			goto out;
> +		}
> +
> +		group->fds[event->index] = ret;
> +		num++;
> +	}
> +
> +	return 0;
> +out:
> +	for (--num; num >= 0; num--) {
> +		close(group->fds[num]);
> +		group->fds[num] = -1;
> +	}
> +
> +
> +	return ret;
> +}
> +
> +static int
> +mmap_events(struct rte_pmu_event_group *group)
> +{
> +	long page_size = sysconf(_SC_PAGE_SIZE);
> +	unsigned int i;
> +	void *addr;
> +	int ret;
> +
> +	for (i = 0; i < rte_pmu.num_group_events; i++) {
> +		addr = mmap(0, page_size, PROT_READ, MAP_SHARED, group->fds[i], 0);
> +		if (addr == MAP_FAILED) {
> +			ret = -errno;
> +			goto out;
> +		}
> +
> +		group->mmap_pages[i] = addr;
> +		if (!group->mmap_pages[i]->cap_user_rdpmc) {
> +			ret = -EPERM;
> +			goto out;
> +		}
> +	}
> +
> +	return 0;
> +out:
> +	for (; i; i--) {
> +		munmap(group->mmap_pages[i - 1], page_size);
> +		group->mmap_pages[i - 1] = NULL;
> +	}
> +
> +	return ret;
> +}
> +
> +static void
> +cleanup_events(struct rte_pmu_event_group *group)
> +{
> +	unsigned int i;
> +
> +	if (group->fds[0] != -1)
> +		ioctl(group->fds[0], PERF_EVENT_IOC_DISABLE, PERF_IOC_FLAG_GROUP);
> +
> +	for (i = 0; i < rte_pmu.num_group_events; i++) {
> +		if (group->mmap_pages[i]) {
> +			munmap(group->mmap_pages[i], sysconf(_SC_PAGE_SIZE));
> +			group->mmap_pages[i] = NULL;
> +		}
> +
> +		if (group->fds[i] != -1) {
> +			close(group->fds[i]);
> +			group->fds[i] = -1;
> +		}
> +	}
> +
> +	group->enabled = false;
> +}
> +
> +int
> +__rte_pmu_enable_group(void)
> +{
> +	struct rte_pmu_event_group *group = &RTE_PER_LCORE(_event_group);
> +	int ret;
> +
> +	if (rte_pmu.num_group_events == 0)
> +		return -ENODEV;
> +
> +	ret = open_events(group);
> +	if (ret)
> +		goto out;
> +
> +	ret = mmap_events(group);
> +	if (ret)
> +		goto out;
> +
> +	if (ioctl(group->fds[0], PERF_EVENT_IOC_RESET, PERF_IOC_FLAG_GROUP) == -1) {
> +		ret = -errno;
> +		goto out;
> +	}
> +
> +	if (ioctl(group->fds[0], PERF_EVENT_IOC_ENABLE, PERF_IOC_FLAG_GROUP) == -1) {
> +		ret = -errno;
> +		goto out;
> +	}
> +
> +	rte_spinlock_lock(&rte_pmu.lock);
> +	TAILQ_INSERT_TAIL(&rte_pmu.event_group_list, group, next);
> +	rte_spinlock_unlock(&rte_pmu.lock);
> +	group->enabled = true;
> +
> +	return 0;
> +
> +out:
> +	cleanup_events(group);
> +
> +	return ret;
> +}
> +
> +static int
> +scan_pmus(void)
> +{
> +	char path[PATH_MAX];
> +	struct dirent *dent;
> +	const char *name;
> +	DIR *dirp;
> +
> +	dirp = opendir(EVENT_SOURCE_DEVICES_PATH);
> +	if (dirp == NULL)
> +		return -errno;
> +
> +	while ((dent = readdir(dirp))) {
> +		name = dent->d_name;
> +		if (name[0] == '.')
> +			continue;
> +
> +		/* sysfs entry should either contain cpus or be a cpu */
> +		if (!strcmp(name, "cpu"))
> +			break;
> +
> +		snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/cpus", name);
> +		if (access(path, F_OK) == 0)
> +			break;
> +	}
> +
> +	if (dent) {
> +		rte_pmu.name = strdup(name);
> +		if (rte_pmu.name == NULL) {
> +			closedir(dirp);
> +
> +			return -ENOMEM;
> +		}
> +	}
> +
> +	closedir(dirp);
> +
> +	return rte_pmu.name ? 0 : -ENODEV;
> +}
> +
> +static struct rte_pmu_event *
> +new_event(const char *name)
> +{
> +	struct rte_pmu_event *event;
> +
> +	event = calloc(1, sizeof(*event));
> +	if (event == NULL)
> +		goto out;
> +
> +	event->name = strdup(name);
> +	if (event->name == NULL) {
> +		free(event);
> +		event = NULL;
> +	}
> +
> +out:
> +	return event;
> +}
> +
> +static void
> +free_event(struct rte_pmu_event *event)
> +{
> +	free(event->name);
> +	free(event);
> +}
> +
> +int
> +rte_pmu_add_event(const char *name)
> +{
> +	struct rte_pmu_event *event;
> +	char path[PATH_MAX];
> +
> +	if (rte_pmu.name == NULL)
> +		return -ENODEV;
> +
> +	if (rte_pmu.num_group_events + 1 >= MAX_NUM_GROUP_EVENTS)
> +		return -ENOSPC;
> +
> +	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu.name, name);
> +	if (access(path, R_OK))
> +		return -ENODEV;
> +
> +	TAILQ_FOREACH(event, &rte_pmu.event_list, next) {
> +		if (!strcmp(event->name, name))
> +			return event->index;
> +		continue;
> +	}
> +
> +	event = new_event(name);
> +	if (event == NULL)
> +		return -ENOMEM;
> +
> +	event->index = rte_pmu.num_group_events++;
> +	TAILQ_INSERT_TAIL(&rte_pmu.event_list, event, next);
> +
> +	return event->index;
> +}
> +
> +int
> +rte_pmu_init(void)
> +{
> +	int ret;
> +
> +	/* Allow calling init from multiple contexts within a single thread. This simplifies
> +	 * resource management a bit e.g in case fast-path tracepoint has already been enabled
> +	 * via command line but application doesn't care enough and performs init/fini again.
> +	 */
> +	if (rte_pmu.initialized != 0) {
> +		rte_pmu.initialized++;
> +		return 0;
> +	}
> +
> +	ret = scan_pmus();
> +	if (ret)
> +		goto out;
> +
> +	ret = pmu_arch_init();
> +	if (ret)
> +		goto out;
> +
> +	TAILQ_INIT(&rte_pmu.event_list);
> +	TAILQ_INIT(&rte_pmu.event_group_list);
> +	rte_spinlock_init(&rte_pmu.lock);
> +	rte_pmu.initialized = 1;
> +
> +	return 0;
> +out:
> +	free(rte_pmu.name);
> +	rte_pmu.name = NULL;
> +
> +	return ret;
> +}
> +
> +void
> +rte_pmu_fini(void)
> +{
> +	struct rte_pmu_event_group *group, *tmp_group;
> +	struct rte_pmu_event *event, *tmp_event;
> +
> +	/* cleanup once init count drops to zero */
> +	if (rte_pmu.initialized == 0 || --rte_pmu.initialized != 0)
> +		return;
> +
> +	RTE_TAILQ_FOREACH_SAFE(event, &rte_pmu.event_list, next, tmp_event) {
> +		TAILQ_REMOVE(&rte_pmu.event_list, event, next);
> +		free_event(event);
> +	}
> +
> +	RTE_TAILQ_FOREACH_SAFE(group, &rte_pmu.event_group_list, next, tmp_group) {
> +		TAILQ_REMOVE(&rte_pmu.event_group_list, group, next);
> +		cleanup_events(group);
> +	}
> +
> +	pmu_arch_fini();
> +	free(rte_pmu.name);
> +	rte_pmu.name = NULL;
> +	rte_pmu.num_group_events = 0;
> +}
> diff --git a/lib/pmu/rte_pmu.h b/lib/pmu/rte_pmu.h
> new file mode 100644
> index 0000000000..6b664c3336
> --- /dev/null
> +++ b/lib/pmu/rte_pmu.h
> @@ -0,0 +1,212 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2023 Marvell
> + */
> +
> +#ifndef _RTE_PMU_H_
> +#define _RTE_PMU_H_
> +
> +/**
> + * @file
> + *
> + * PMU event tracing operations
> + *
> + * This file defines generic API and types necessary to setup PMU and
> + * read selected counters in runtime.
> + */
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +#include <linux/perf_event.h>
> +
> +#include <rte_atomic.h>
> +#include <rte_branch_prediction.h>
> +#include <rte_common.h>
> +#include <rte_compat.h>
> +#include <rte_spinlock.h>
> +
> +/** Maximum number of events in a group */
> +#define MAX_NUM_GROUP_EVENTS 8
> +
> +/**
> + * A structure describing a group of events.
> + */
> +struct rte_pmu_event_group {
> +	struct perf_event_mmap_page *mmap_pages[MAX_NUM_GROUP_EVENTS]; /**< array of user pages */
> +	int fds[MAX_NUM_GROUP_EVENTS]; /**< array of event descriptors */
> +	bool enabled; /**< true if group was enabled on particular lcore */
> +	TAILQ_ENTRY(rte_pmu_event_group) next; /**< list entry */
> +} __rte_cache_aligned;
> +
> +/**
> + * A structure describing an event.
> + */
> +struct rte_pmu_event {
> +	char *name; /**< name of an event */
> +	unsigned int index; /**< event index into fds/mmap_pages */
> +	TAILQ_ENTRY(rte_pmu_event) next; /**< list entry */
> +};
> +
> +/**
> + * A PMU state container.
> + */
> +struct rte_pmu {
> +	char *name; /**< name of core PMU listed under /sys/bus/event_source/devices */
> +	rte_spinlock_t lock; /**< serialize access to event group list */
> +	TAILQ_HEAD(, rte_pmu_event_group) event_group_list; /**< list of event groups */
> +	unsigned int num_group_events; /**< number of events in a group */
> +	TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of matching events */
> +	unsigned int initialized; /**< initialization counter */
> +};
> +
> +/** lcore event group */
> +RTE_DECLARE_PER_LCORE(struct rte_pmu_event_group, _event_group);
> +
> +/** PMU state container */
> +extern struct rte_pmu rte_pmu;
> +
> +/** Each architecture supporting PMU needs to provide its own version */
> +#ifndef rte_pmu_pmc_read
> +#define rte_pmu_pmc_read(index) ({ 0; })
> +#endif
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice
> + *
> + * Read PMU counter.
> + *
> + * @warning This should be not called directly.
> + *
> + * @param pc
> + *   Pointer to the mmapped user page.
> + * @return
> + *   Counter value read from hardware.
> + */
> +static __rte_always_inline uint64_t
> +__rte_pmu_read_userpage(struct perf_event_mmap_page *pc)
> +{
> +	uint64_t width, offset;
> +	uint32_t seq, index;
> +	int64_t pmc;
> +
> +	for (;;) {
> +		seq = pc->lock;
> +		rte_compiler_barrier();

Are you sure that compiler_barrier() is enough here?
On some archs CPU itself has freedom to re-order reads.
Or I am missing something obvious here?

> +		index = pc->index;
> +		offset = pc->offset;
> +		width = pc->pmc_width;
> +
> +		/* index set to 0 means that particular counter cannot be used */
> +		if (likely(pc->cap_user_rdpmc && index)) {
> +			pmc = rte_pmu_pmc_read(index - 1);
> +			pmc <<= 64 - width;
> +			pmc >>= 64 - width;
> +			offset += pmc;
> +		}
> +
> +		rte_compiler_barrier();
> +
> +		if (likely(pc->lock == seq))
> +			return offset;
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice
> + *
> + * Enable group of events on the calling lcore.
> + *
> + * @warning This should be not called directly.
> + *
> + * @return
> + *   0 in case of success, negative value otherwise.
> + */
> +__rte_experimental
> +int
> +__rte_pmu_enable_group(void);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice
> + *
> + * Initialize PMU library.
> + *
> + * @warning This should be not called directly.
> + *
> + * @return
> + *   0 in case of success, negative value otherwise.
> + */
> +__rte_experimental
> +int
> +rte_pmu_init(void);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice
> + *
> + * Finalize PMU library. This should be called after PMU counters are no longer being read.
> + */
> +__rte_experimental
> +void
> +rte_pmu_fini(void);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice
> + *
> + * Add event to the group of enabled events.
> + *
> + * @param name
> + *   Name of an event listed under /sys/bus/event_source/devices/pmu/events.
> + * @return
> + *   Event index in case of success, negative value otherwise.
> + */
> +__rte_experimental
> +int
> +rte_pmu_add_event(const char *name);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice
> + *
> + * Read hardware counter configured to count occurrences of an event.
> + *
> + * @param index
> + *   Index of an event to be read.
> + * @return
> + *   Event value read from register. In case of errors or lack of support
> + *   0 is returned. In other words, stream of zeros in a trace file
> + *   indicates problem with reading particular PMU event register.
> + */
> +__rte_experimental
> +static __rte_always_inline uint64_t
> +rte_pmu_read(unsigned int index)
> +{
> +	struct rte_pmu_event_group *group = &RTE_PER_LCORE(_event_group);
> +	int ret;
> +
> +	if (unlikely(!rte_pmu.initialized))
> +		return 0;
> +
> +	if (unlikely(!group->enabled)) {
> +		ret = __rte_pmu_enable_group();
> +		if (ret)
> +			return 0;
> +	}
> +
> +	if (unlikely(index >= rte_pmu.num_group_events))
> +		return 0;
> +
> +	return __rte_pmu_read_userpage(group->mmap_pages[index]);
> +}
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif /* _RTE_PMU_H_ */
> diff --git a/lib/pmu/version.map b/lib/pmu/version.map
> new file mode 100644
> index 0000000000..39a4f279c1
> --- /dev/null
> +++ b/lib/pmu/version.map
> @@ -0,0 +1,15 @@
> +DPDK_23 {
> +	local: *;
> +};
> +
> +EXPERIMENTAL {
> +	global:
> +
> +	__rte_pmu_enable_group;
> +	per_lcore__event_group;
> +	rte_pmu;
> +	rte_pmu_add_event;
> +	rte_pmu_fini;
> +	rte_pmu_init;
> +	rte_pmu_read;
> +};


^ permalink raw reply	[flat|nested] 139+ messages in thread

* RE: [EXT] Re: [PATCH v11 1/4] lib: add generic support for reading PMU events
  2023-02-16 23:50                       ` Konstantin Ananyev
@ 2023-02-17  8:49                         ` Tomasz Duszynski
  2023-02-17 10:14                           ` Konstantin Ananyev
  2023-02-21 12:15                           ` Konstantin Ananyev
  0 siblings, 2 replies; 139+ messages in thread
From: Tomasz Duszynski @ 2023-02-17  8:49 UTC (permalink / raw)
  To: Konstantin Ananyev, dev



>-----Original Message-----
>From: Konstantin Ananyev <konstantin.v.ananyev@yandex.ru>
>Sent: Friday, February 17, 2023 12:51 AM
>To: dev@dpdk.org
>Subject: [EXT] Re: [PATCH v11 1/4] lib: add generic support for reading PMU events
>
>External Email
>
>----------------------------------------------------------------------
>16/02/2023 17:54, Tomasz Duszynski пишет:
>> Add support for programming PMU counters and reading their values in
>> runtime bypassing kernel completely.
>>
>> This is especially useful in cases where CPU cores are isolated i.e
>> run dedicated tasks. In such cases one cannot use standard perf
>> utility without sacrificing latency and performance.
>>
>> Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
>> Acked-by: Morten Brørup <mb@smartsharesystems.com>
>> ---
>>   MAINTAINERS                            |   5 +
>>   app/test/meson.build                   |   2 +
>>   app/test/test_pmu.c                    |  62 ++++
>>   doc/api/doxy-api-index.md              |   3 +-
>>   doc/api/doxy-api.conf.in               |   1 +
>>   doc/guides/prog_guide/profile_app.rst  |  12 +
>>   doc/guides/rel_notes/release_23_03.rst |   7 +
>>   lib/meson.build                        |   1 +
>>   lib/pmu/meson.build                    |  13 +
>>   lib/pmu/pmu_private.h                  |  32 ++
>>   lib/pmu/rte_pmu.c                      | 460 +++++++++++++++++++++++++
>>   lib/pmu/rte_pmu.h                      | 212 ++++++++++++
>>   lib/pmu/version.map                    |  15 +
>>   13 files changed, 824 insertions(+), 1 deletion(-)
>>   create mode 100644 app/test/test_pmu.c
>>   create mode 100644 lib/pmu/meson.build
>>   create mode 100644 lib/pmu/pmu_private.h
>>   create mode 100644 lib/pmu/rte_pmu.c
>>   create mode 100644 lib/pmu/rte_pmu.h
>>   create mode 100644 lib/pmu/version.map
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS index 3495946d0f..d37f242120
>> 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -1697,6 +1697,11 @@ M: Nithin Dabilpuram <ndabilpuram@marvell.com>
>>   M: Pavan Nikhilesh <pbhagavatula@marvell.com>
>>   F: lib/node/
>>
>> +PMU - EXPERIMENTAL
>> +M: Tomasz Duszynski <tduszynski@marvell.com>
>> +F: lib/pmu/
>> +F: app/test/test_pmu*
>> +
>>
>>   Test Applications
>>   -----------------
>> diff --git a/app/test/meson.build b/app/test/meson.build index
>> f34d19e3c3..6b61b7fc32 100644
>> --- a/app/test/meson.build
>> +++ b/app/test/meson.build
>> @@ -111,6 +111,7 @@ test_sources = files(
>>           'test_reciprocal_division_perf.c',
>>           'test_red.c',
>>           'test_pie.c',
>> +        'test_pmu.c',
>>           'test_reorder.c',
>>           'test_rib.c',
>>           'test_rib6.c',
>> @@ -239,6 +240,7 @@ fast_tests = [
>>           ['kni_autotest', false, true],
>>           ['kvargs_autotest', true, true],
>>           ['member_autotest', true, true],
>> +        ['pmu_autotest', true, true],
>>           ['power_cpufreq_autotest', false, true],
>>           ['power_autotest', true, true],
>>           ['power_kvm_vm_autotest', false, true], diff --git
>> a/app/test/test_pmu.c b/app/test/test_pmu.c new file mode 100644 index
>> 0000000000..c257638e8b
>> --- /dev/null
>> +++ b/app/test/test_pmu.c
>> @@ -0,0 +1,62 @@
>> +/* SPDX-License-Identifier: BSD-3-Clause
>> + * Copyright(C) 2023 Marvell International Ltd.
>> + */
>> +
>> +#include "test.h"
>> +
>> +#ifndef RTE_EXEC_ENV_LINUX
>> +
>> +static int
>> +test_pmu(void)
>> +{
>> +	printf("pmu_autotest only supported on Linux, skipping test\n");
>> +	return TEST_SKIPPED;
>> +}
>> +
>> +#else
>> +
>> +#include <rte_pmu.h>
>> +
>> +static int
>> +test_pmu_read(void)
>> +{
>> +	const char *name = NULL;
>> +	int tries = 10, event;
>> +	uint64_t val = 0;
>> +
>> +	if (name == NULL) {
>> +		printf("PMU not supported on this arch\n");
>> +		return TEST_SKIPPED;
>> +	}
>> +
>> +	if (rte_pmu_init() < 0)
>> +		return TEST_SKIPPED;
>> +
>> +	event = rte_pmu_add_event(name);
>> +	while (tries--)
>> +		val += rte_pmu_read(event);
>> +
>> +	rte_pmu_fini();
>> +
>> +	return val ? TEST_SUCCESS : TEST_FAILED; }
>> +
>> +static struct unit_test_suite pmu_tests = {
>> +	.suite_name = "pmu autotest",
>> +	.setup = NULL,
>> +	.teardown = NULL,
>> +	.unit_test_cases = {
>> +		TEST_CASE(test_pmu_read),
>> +		TEST_CASES_END()
>> +	}
>> +};
>> +
>> +static int
>> +test_pmu(void)
>> +{
>> +	return unit_test_suite_runner(&pmu_tests);
>> +}
>> +
>> +#endif /* RTE_EXEC_ENV_LINUX */
>> +
>> +REGISTER_TEST_COMMAND(pmu_autotest, test_pmu);
>> diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
>> index 2deec7ea19..a8e04a195d 100644
>> --- a/doc/api/doxy-api-index.md
>> +++ b/doc/api/doxy-api-index.md
>> @@ -223,7 +223,8 @@ The public API headers are grouped by topics:
>>     [log](@ref rte_log.h),
>>     [errno](@ref rte_errno.h),
>>     [trace](@ref rte_trace.h),
>> -  [trace_point](@ref rte_trace_point.h)
>> +  [trace_point](@ref rte_trace_point.h),  [pmu](@ref rte_pmu.h)
>>
>>   - **misc**:
>>     [EAL config](@ref rte_eal.h),
>> diff --git a/doc/api/doxy-api.conf.in b/doc/api/doxy-api.conf.in index
>> e859426099..350b5a8c94 100644
>> --- a/doc/api/doxy-api.conf.in
>> +++ b/doc/api/doxy-api.conf.in
>> @@ -63,6 +63,7 @@ INPUT                   = @TOPDIR@/doc/api/doxy-api-index.md \
>>                             @TOPDIR@/lib/pci \
>>                             @TOPDIR@/lib/pdump \
>>                             @TOPDIR@/lib/pipeline \
>> +                          @TOPDIR@/lib/pmu \
>>                             @TOPDIR@/lib/port \
>>                             @TOPDIR@/lib/power \
>>                             @TOPDIR@/lib/rawdev \ diff --git
>> a/doc/guides/prog_guide/profile_app.rst
>> b/doc/guides/prog_guide/profile_app.rst
>> index 14292d4c25..89e38cd301 100644
>> --- a/doc/guides/prog_guide/profile_app.rst
>> +++ b/doc/guides/prog_guide/profile_app.rst
>> @@ -7,6 +7,18 @@ Profile Your Application
>>   The following sections describe methods of profiling DPDK applications on
>>   different architectures.
>>
>> +Performance counter based profiling
>> +-----------------------------------
>> +
>> +Majority of architectures support some performance monitoring unit (PMU).
>> +Such unit provides programmable counters that monitor specific events.
>> +
>> +Different tools gather that information, like for example perf.
>> +However, in some scenarios when CPU cores are isolated and run
>> +dedicated tasks interrupting those tasks with perf may be undesirable.
>> +
>> +In such cases, an application can use the PMU library to read such events via
>``rte_pmu_read()``.
>> +
>>
>>   Profiling on x86
>>   ----------------
>> diff --git a/doc/guides/rel_notes/release_23_03.rst
>> b/doc/guides/rel_notes/release_23_03.rst
>> index ab998a5357..20622efe58 100644
>> --- a/doc/guides/rel_notes/release_23_03.rst
>> +++ b/doc/guides/rel_notes/release_23_03.rst
>> @@ -147,6 +147,13 @@ New Features
>>     * Added support to capture packets at each graph node with packet metadata and
>>       node name.
>>
>> +* **Added PMU library.**
>> +
>> +  Added a new performance monitoring unit (PMU) library which allows
>> + applications  to perform self monitoring activities without depending on external utilities
>like perf.
>> +  After integration with :doc:`../prog_guide/trace_lib` data gathered
>> + from hardware counters  can be stored in CTF format for further analysis.
>> +
>>
>>   Removed Items
>>   -------------
>> diff --git a/lib/meson.build b/lib/meson.build index
>> 450c061d2b..8a42d45d20 100644
>> --- a/lib/meson.build
>> +++ b/lib/meson.build
>> @@ -11,6 +11,7 @@
>>   libraries = [
>>           'kvargs', # eal depends on kvargs
>>           'telemetry', # basic info querying
>> +        'pmu',
>>           'eal', # everything depends on eal
>>           'ring',
>>           'rcu', # rcu depends on ring diff --git
>> a/lib/pmu/meson.build b/lib/pmu/meson.build new file mode 100644 index
>> 0000000000..a4160b494e
>> --- /dev/null
>> +++ b/lib/pmu/meson.build
>> @@ -0,0 +1,13 @@
>> +# SPDX-License-Identifier: BSD-3-Clause # Copyright(C) 2023 Marvell
>> +International Ltd.
>> +
>> +if not is_linux
>> +    build = false
>> +    reason = 'only supported on Linux'
>> +    subdir_done()
>> +endif
>> +
>> +includes = [global_inc]
>> +
>> +sources = files('rte_pmu.c')
>> +headers = files('rte_pmu.h')
>> diff --git a/lib/pmu/pmu_private.h b/lib/pmu/pmu_private.h new file
>> mode 100644 index 0000000000..b9f8c1ddc8
>> --- /dev/null
>> +++ b/lib/pmu/pmu_private.h
>> @@ -0,0 +1,32 @@
>> +/* SPDX-License-Identifier: BSD-3-Clause
>> + * Copyright(c) 2023 Marvell
>> + */
>> +
>> +#ifndef _PMU_PRIVATE_H_
>> +#define _PMU_PRIVATE_H_
>> +
>> +/**
>> + * Architecture specific PMU init callback.
>> + *
>> + * @return
>> + *   0 in case of success, negative value otherwise.
>> + */
>> +int
>> +pmu_arch_init(void);
>> +
>> +/**
>> + * Architecture specific PMU cleanup callback.
>> + */
>> +void
>> +pmu_arch_fini(void);
>> +
>> +/**
>> + * Apply architecture specific settings to config before passing it to syscall.
>> + *
>> + * @param config
>> + *   Architecture specific event configuration. Consult kernel sources for available options.
>> + */
>> +void
>> +pmu_arch_fixup_config(uint64_t config[3]);
>> +
>> +#endif /* _PMU_PRIVATE_H_ */
>> diff --git a/lib/pmu/rte_pmu.c b/lib/pmu/rte_pmu.c new file mode
>> 100644 index 0000000000..950f999cb7
>> --- /dev/null
>> +++ b/lib/pmu/rte_pmu.c
>> @@ -0,0 +1,460 @@
>> +/* SPDX-License-Identifier: BSD-3-Clause
>> + * Copyright(C) 2023 Marvell International Ltd.
>> + */
>> +
>> +#include <ctype.h>
>> +#include <dirent.h>
>> +#include <errno.h>
>> +#include <regex.h>
>> +#include <stdlib.h>
>> +#include <string.h>
>> +#include <sys/ioctl.h>
>> +#include <sys/mman.h>
>> +#include <sys/queue.h>
>> +#include <sys/syscall.h>
>> +#include <unistd.h>
>> +
>> +#include <rte_atomic.h>
>> +#include <rte_per_lcore.h>
>> +#include <rte_pmu.h>
>> +#include <rte_spinlock.h>
>> +#include <rte_tailq.h>
>> +
>> +#include "pmu_private.h"
>> +
>> +#define EVENT_SOURCE_DEVICES_PATH "/sys/bus/event_source/devices"
>
>
>I suppose that pass (as the whole implementation) is linux specific?
>If so, wouldn't it make sense to have it under linux subdir?
>

There are not any plans to support that elsewhere currently so flat
directory structure is good enough. 

>> +
>> +#define GENMASK_ULL(h, l) ((~0ULL - (1ULL << (l)) + 1) & (~0ULL >>
>> +((64 - 1 - (h))))) #define FIELD_PREP(m, v) (((uint64_t)(v) <<
>> +(__builtin_ffsll(m) - 1)) & (m))
>> +
>> +RTE_DEFINE_PER_LCORE(struct rte_pmu_event_group, _event_group);
>> +struct rte_pmu rte_pmu;
>
>Do we really need struct declaration here?
>

What’s the problem with this placement precisely?

>
>> +/*
>> + * Following __rte_weak functions provide default no-op.
>> +Architectures should override them if
>> + * necessary.
>> + */
>> +
>> +int
>> +__rte_weak pmu_arch_init(void)
>> +{
>> +	return 0;
>> +}
>> +
>> +void
>> +__rte_weak pmu_arch_fini(void)
>> +{
>> +}
>> +
>> +void
>> +__rte_weak pmu_arch_fixup_config(uint64_t __rte_unused config[3]) { }
>> +
>> +static int
>> +get_term_format(const char *name, int *num, uint64_t *mask) {
>> +	char path[PATH_MAX];
>> +	char *config = NULL;
>> +	int high, low, ret;
>> +	FILE *fp;
>> +
>> +	*num = *mask = 0;
>> +	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/format/%s", rte_pmu.name, name);
>> +	fp = fopen(path, "r");
>> +	if (fp == NULL)
>> +		return -errno;
>> +
>> +	errno = 0;
>> +	ret = fscanf(fp, "%m[^:]:%d-%d", &config, &low, &high);
>> +	if (ret < 2) {
>> +		ret = -ENODATA;
>> +		goto out;
>> +	}
>> +	if (errno) {
>> +		ret = -errno;
>> +		goto out;
>> +	}
>> +
>> +	if (ret == 2)
>> +		high = low;
>> +
>> +	*mask = GENMASK_ULL(high, low);
>> +	/* Last digit should be [012]. If last digit is missing 0 is implied. */
>> +	*num = config[strlen(config) - 1];
>> +	*num = isdigit(*num) ? *num - '0' : 0;
>> +
>> +	ret = 0;
>> +out:
>> +	free(config);
>> +	fclose(fp);
>> +
>> +	return ret;
>> +}
>> +
>> +static int
>> +parse_event(char *buf, uint64_t config[3]) {
>> +	char *token, *term;
>> +	int num, ret, val;
>> +	uint64_t mask;
>> +
>> +	config[0] = config[1] = config[2] = 0;
>> +
>> +	token = strtok(buf, ",");
>> +	while (token) {
>> +		errno = 0;
>> +		/* <term>=<value> */
>> +		ret = sscanf(token, "%m[^=]=%i", &term, &val);
>> +		if (ret < 1)
>> +			return -ENODATA;
>> +		if (errno)
>> +			return -errno;
>> +		if (ret == 1)
>> +			val = 1;
>> +
>> +		ret = get_term_format(term, &num, &mask);
>> +		free(term);
>> +		if (ret)
>> +			return ret;
>> +
>> +		config[num] |= FIELD_PREP(mask, val);
>> +		token = strtok(NULL, ",");
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +static int
>> +get_event_config(const char *name, uint64_t config[3]) {
>> +	char path[PATH_MAX], buf[BUFSIZ];
>> +	FILE *fp;
>> +	int ret;
>> +
>> +	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu.name, name);
>> +	fp = fopen(path, "r");
>> +	if (fp == NULL)
>> +		return -errno;
>> +
>> +	ret = fread(buf, 1, sizeof(buf), fp);
>> +	if (ret == 0) {
>> +		fclose(fp);
>> +
>> +		return -EINVAL;
>> +	}
>> +	fclose(fp);
>> +	buf[ret] = '\0';
>> +
>> +	return parse_event(buf, config);
>> +}
>> +
>> +static int
>> +do_perf_event_open(uint64_t config[3], int group_fd) {
>> +	struct perf_event_attr attr = {
>> +		.size = sizeof(struct perf_event_attr),
>> +		.type = PERF_TYPE_RAW,
>> +		.exclude_kernel = 1,
>> +		.exclude_hv = 1,
>> +		.disabled = 1,
>> +	};
>> +
>> +	pmu_arch_fixup_config(config);
>> +
>> +	attr.config = config[0];
>> +	attr.config1 = config[1];
>> +	attr.config2 = config[2];
>> +
>> +	return syscall(SYS_perf_event_open, &attr, 0, -1, group_fd, 0); }
>> +
>> +static int
>> +open_events(struct rte_pmu_event_group *group) {
>> +	struct rte_pmu_event *event;
>> +	uint64_t config[3];
>> +	int num = 0, ret;
>> +
>> +	/* group leader gets created first, with fd = -1 */
>> +	group->fds[0] = -1;
>> +
>> +	TAILQ_FOREACH(event, &rte_pmu.event_list, next) {
>> +		ret = get_event_config(event->name, config);
>> +		if (ret)
>> +			continue;
>> +
>> +		ret = do_perf_event_open(config, group->fds[0]);
>> +		if (ret == -1) {
>> +			ret = -errno;
>> +			goto out;
>> +		}
>> +
>> +		group->fds[event->index] = ret;
>> +		num++;
>> +	}
>> +
>> +	return 0;
>> +out:
>> +	for (--num; num >= 0; num--) {
>> +		close(group->fds[num]);
>> +		group->fds[num] = -1;
>> +	}
>> +
>> +
>> +	return ret;
>> +}
>> +
>> +static int
>> +mmap_events(struct rte_pmu_event_group *group) {
>> +	long page_size = sysconf(_SC_PAGE_SIZE);
>> +	unsigned int i;
>> +	void *addr;
>> +	int ret;
>> +
>> +	for (i = 0; i < rte_pmu.num_group_events; i++) {
>> +		addr = mmap(0, page_size, PROT_READ, MAP_SHARED, group->fds[i], 0);
>> +		if (addr == MAP_FAILED) {
>> +			ret = -errno;
>> +			goto out;
>> +		}
>> +
>> +		group->mmap_pages[i] = addr;
>> +		if (!group->mmap_pages[i]->cap_user_rdpmc) {
>> +			ret = -EPERM;
>> +			goto out;
>> +		}
>> +	}
>> +
>> +	return 0;
>> +out:
>> +	for (; i; i--) {
>> +		munmap(group->mmap_pages[i - 1], page_size);
>> +		group->mmap_pages[i - 1] = NULL;
>> +	}
>> +
>> +	return ret;
>> +}
>> +
>> +static void
>> +cleanup_events(struct rte_pmu_event_group *group) {
>> +	unsigned int i;
>> +
>> +	if (group->fds[0] != -1)
>> +		ioctl(group->fds[0], PERF_EVENT_IOC_DISABLE, PERF_IOC_FLAG_GROUP);
>> +
>> +	for (i = 0; i < rte_pmu.num_group_events; i++) {
>> +		if (group->mmap_pages[i]) {
>> +			munmap(group->mmap_pages[i], sysconf(_SC_PAGE_SIZE));
>> +			group->mmap_pages[i] = NULL;
>> +		}
>> +
>> +		if (group->fds[i] != -1) {
>> +			close(group->fds[i]);
>> +			group->fds[i] = -1;
>> +		}
>> +	}
>> +
>> +	group->enabled = false;
>> +}
>> +
>> +int
>> +__rte_pmu_enable_group(void)
>> +{
>> +	struct rte_pmu_event_group *group = &RTE_PER_LCORE(_event_group);
>> +	int ret;
>> +
>> +	if (rte_pmu.num_group_events == 0)
>> +		return -ENODEV;
>> +
>> +	ret = open_events(group);
>> +	if (ret)
>> +		goto out;
>> +
>> +	ret = mmap_events(group);
>> +	if (ret)
>> +		goto out;
>> +
>> +	if (ioctl(group->fds[0], PERF_EVENT_IOC_RESET, PERF_IOC_FLAG_GROUP) == -1) {
>> +		ret = -errno;
>> +		goto out;
>> +	}
>> +
>> +	if (ioctl(group->fds[0], PERF_EVENT_IOC_ENABLE, PERF_IOC_FLAG_GROUP) == -1) {
>> +		ret = -errno;
>> +		goto out;
>> +	}
>> +
>> +	rte_spinlock_lock(&rte_pmu.lock);
>> +	TAILQ_INSERT_TAIL(&rte_pmu.event_group_list, group, next);
>> +	rte_spinlock_unlock(&rte_pmu.lock);
>> +	group->enabled = true;
>> +
>> +	return 0;
>> +
>> +out:
>> +	cleanup_events(group);
>> +
>> +	return ret;
>> +}
>> +
>> +static int
>> +scan_pmus(void)
>> +{
>> +	char path[PATH_MAX];
>> +	struct dirent *dent;
>> +	const char *name;
>> +	DIR *dirp;
>> +
>> +	dirp = opendir(EVENT_SOURCE_DEVICES_PATH);
>> +	if (dirp == NULL)
>> +		return -errno;
>> +
>> +	while ((dent = readdir(dirp))) {
>> +		name = dent->d_name;
>> +		if (name[0] == '.')
>> +			continue;
>> +
>> +		/* sysfs entry should either contain cpus or be a cpu */
>> +		if (!strcmp(name, "cpu"))
>> +			break;
>> +
>> +		snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/cpus", name);
>> +		if (access(path, F_OK) == 0)
>> +			break;
>> +	}
>> +
>> +	if (dent) {
>> +		rte_pmu.name = strdup(name);
>> +		if (rte_pmu.name == NULL) {
>> +			closedir(dirp);
>> +
>> +			return -ENOMEM;
>> +		}
>> +	}
>> +
>> +	closedir(dirp);
>> +
>> +	return rte_pmu.name ? 0 : -ENODEV;
>> +}
>> +
>> +static struct rte_pmu_event *
>> +new_event(const char *name)
>> +{
>> +	struct rte_pmu_event *event;
>> +
>> +	event = calloc(1, sizeof(*event));
>> +	if (event == NULL)
>> +		goto out;
>> +
>> +	event->name = strdup(name);
>> +	if (event->name == NULL) {
>> +		free(event);
>> +		event = NULL;
>> +	}
>> +
>> +out:
>> +	return event;
>> +}
>> +
>> +static void
>> +free_event(struct rte_pmu_event *event) {
>> +	free(event->name);
>> +	free(event);
>> +}
>> +
>> +int
>> +rte_pmu_add_event(const char *name)
>> +{
>> +	struct rte_pmu_event *event;
>> +	char path[PATH_MAX];
>> +
>> +	if (rte_pmu.name == NULL)
>> +		return -ENODEV;
>> +
>> +	if (rte_pmu.num_group_events + 1 >= MAX_NUM_GROUP_EVENTS)
>> +		return -ENOSPC;
>> +
>> +	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu.name, name);
>> +	if (access(path, R_OK))
>> +		return -ENODEV;
>> +
>> +	TAILQ_FOREACH(event, &rte_pmu.event_list, next) {
>> +		if (!strcmp(event->name, name))
>> +			return event->index;
>> +		continue;
>> +	}
>> +
>> +	event = new_event(name);
>> +	if (event == NULL)
>> +		return -ENOMEM;
>> +
>> +	event->index = rte_pmu.num_group_events++;
>> +	TAILQ_INSERT_TAIL(&rte_pmu.event_list, event, next);
>> +
>> +	return event->index;
>> +}
>> +
>> +int
>> +rte_pmu_init(void)
>> +{
>> +	int ret;
>> +
>> +	/* Allow calling init from multiple contexts within a single thread. This simplifies
>> +	 * resource management a bit e.g in case fast-path tracepoint has already been enabled
>> +	 * via command line but application doesn't care enough and performs init/fini again.
>> +	 */
>> +	if (rte_pmu.initialized != 0) {
>> +		rte_pmu.initialized++;
>> +		return 0;
>> +	}
>> +
>> +	ret = scan_pmus();
>> +	if (ret)
>> +		goto out;
>> +
>> +	ret = pmu_arch_init();
>> +	if (ret)
>> +		goto out;
>> +
>> +	TAILQ_INIT(&rte_pmu.event_list);
>> +	TAILQ_INIT(&rte_pmu.event_group_list);
>> +	rte_spinlock_init(&rte_pmu.lock);
>> +	rte_pmu.initialized = 1;
>> +
>> +	return 0;
>> +out:
>> +	free(rte_pmu.name);
>> +	rte_pmu.name = NULL;
>> +
>> +	return ret;
>> +}
>> +
>> +void
>> +rte_pmu_fini(void)
>> +{
>> +	struct rte_pmu_event_group *group, *tmp_group;
>> +	struct rte_pmu_event *event, *tmp_event;
>> +
>> +	/* cleanup once init count drops to zero */
>> +	if (rte_pmu.initialized == 0 || --rte_pmu.initialized != 0)
>> +		return;
>> +
>> +	RTE_TAILQ_FOREACH_SAFE(event, &rte_pmu.event_list, next, tmp_event) {
>> +		TAILQ_REMOVE(&rte_pmu.event_list, event, next);
>> +		free_event(event);
>> +	}
>> +
>> +	RTE_TAILQ_FOREACH_SAFE(group, &rte_pmu.event_group_list, next, tmp_group) {
>> +		TAILQ_REMOVE(&rte_pmu.event_group_list, group, next);
>> +		cleanup_events(group);
>> +	}
>> +
>> +	pmu_arch_fini();
>> +	free(rte_pmu.name);
>> +	rte_pmu.name = NULL;
>> +	rte_pmu.num_group_events = 0;
>> +}
>> diff --git a/lib/pmu/rte_pmu.h b/lib/pmu/rte_pmu.h new file mode
>> 100644 index 0000000000..6b664c3336
>> --- /dev/null
>> +++ b/lib/pmu/rte_pmu.h
>> @@ -0,0 +1,212 @@
>> +/* SPDX-License-Identifier: BSD-3-Clause
>> + * Copyright(c) 2023 Marvell
>> + */
>> +
>> +#ifndef _RTE_PMU_H_
>> +#define _RTE_PMU_H_
>> +
>> +/**
>> + * @file
>> + *
>> + * PMU event tracing operations
>> + *
>> + * This file defines generic API and types necessary to setup PMU and
>> + * read selected counters in runtime.
>> + */
>> +
>> +#ifdef __cplusplus
>> +extern "C" {
>> +#endif
>> +
>> +#include <linux/perf_event.h>
>> +
>> +#include <rte_atomic.h>
>> +#include <rte_branch_prediction.h>
>> +#include <rte_common.h>
>> +#include <rte_compat.h>
>> +#include <rte_spinlock.h>
>> +
>> +/** Maximum number of events in a group */ #define
>> +MAX_NUM_GROUP_EVENTS 8
>> +
>> +/**
>> + * A structure describing a group of events.
>> + */
>> +struct rte_pmu_event_group {
>> +	struct perf_event_mmap_page *mmap_pages[MAX_NUM_GROUP_EVENTS]; /**< array of user pages */
>> +	int fds[MAX_NUM_GROUP_EVENTS]; /**< array of event descriptors */
>> +	bool enabled; /**< true if group was enabled on particular lcore */
>> +	TAILQ_ENTRY(rte_pmu_event_group) next; /**< list entry */ }
>> +__rte_cache_aligned;
>> +
>> +/**
>> + * A structure describing an event.
>> + */
>> +struct rte_pmu_event {
>> +	char *name; /**< name of an event */
>> +	unsigned int index; /**< event index into fds/mmap_pages */
>> +	TAILQ_ENTRY(rte_pmu_event) next; /**< list entry */ };
>> +
>> +/**
>> + * A PMU state container.
>> + */
>> +struct rte_pmu {
>> +	char *name; /**< name of core PMU listed under /sys/bus/event_source/devices */
>> +	rte_spinlock_t lock; /**< serialize access to event group list */
>> +	TAILQ_HEAD(, rte_pmu_event_group) event_group_list; /**< list of event groups */
>> +	unsigned int num_group_events; /**< number of events in a group */
>> +	TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of matching events */
>> +	unsigned int initialized; /**< initialization counter */ };
>> +
>> +/** lcore event group */
>> +RTE_DECLARE_PER_LCORE(struct rte_pmu_event_group, _event_group);
>> +
>> +/** PMU state container */
>> +extern struct rte_pmu rte_pmu;
>> +
>> +/** Each architecture supporting PMU needs to provide its own version
>> +*/ #ifndef rte_pmu_pmc_read #define rte_pmu_pmc_read(index) ({ 0; })
>> +#endif
>> +
>> +/**
>> + * @warning
>> + * @b EXPERIMENTAL: this API may change without prior notice
>> + *
>> + * Read PMU counter.
>> + *
>> + * @warning This should be not called directly.
>> + *
>> + * @param pc
>> + *   Pointer to the mmapped user page.
>> + * @return
>> + *   Counter value read from hardware.
>> + */
>> +static __rte_always_inline uint64_t
>> +__rte_pmu_read_userpage(struct perf_event_mmap_page *pc) {
>> +	uint64_t width, offset;
>> +	uint32_t seq, index;
>> +	int64_t pmc;
>> +
>> +	for (;;) {
>> +		seq = pc->lock;
>> +		rte_compiler_barrier();
>
>Are you sure that compiler_barrier() is enough here?
>On some archs CPU itself has freedom to re-order reads.
>Or I am missing something obvious here?
>

It's a matter of not keeping old stuff cached in registers 
and making sure that we have two reads of lock. CPU reordering
won't do any harm here. 

>> +		index = pc->index;
>> +		offset = pc->offset;
>> +		width = pc->pmc_width;
>> +
>> +		/* index set to 0 means that particular counter cannot be used */
>> +		if (likely(pc->cap_user_rdpmc && index)) {
>> +			pmc = rte_pmu_pmc_read(index - 1);
>> +			pmc <<= 64 - width;
>> +			pmc >>= 64 - width;
>> +			offset += pmc;
>> +		}
>> +
>> +		rte_compiler_barrier();
>> +
>> +		if (likely(pc->lock == seq))
>> +			return offset;
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +/**
>> + * @warning
>> + * @b EXPERIMENTAL: this API may change without prior notice
>> + *
>> + * Enable group of events on the calling lcore.
>> + *
>> + * @warning This should be not called directly.
>> + *
>> + * @return
>> + *   0 in case of success, negative value otherwise.
>> + */
>> +__rte_experimental
>> +int
>> +__rte_pmu_enable_group(void);
>> +
>> +/**
>> + * @warning
>> + * @b EXPERIMENTAL: this API may change without prior notice
>> + *
>> + * Initialize PMU library.
>> + *
>> + * @warning This should be not called directly.
>> + *
>> + * @return
>> + *   0 in case of success, negative value otherwise.
>> + */
>> +__rte_experimental
>> +int
>> +rte_pmu_init(void);
>> +
>> +/**
>> + * @warning
>> + * @b EXPERIMENTAL: this API may change without prior notice
>> + *
>> + * Finalize PMU library. This should be called after PMU counters are no longer being read.
>> + */
>> +__rte_experimental
>> +void
>> +rte_pmu_fini(void);
>> +
>> +/**
>> + * @warning
>> + * @b EXPERIMENTAL: this API may change without prior notice
>> + *
>> + * Add event to the group of enabled events.
>> + *
>> + * @param name
>> + *   Name of an event listed under /sys/bus/event_source/devices/pmu/events.
>> + * @return
>> + *   Event index in case of success, negative value otherwise.
>> + */
>> +__rte_experimental
>> +int
>> +rte_pmu_add_event(const char *name);
>> +
>> +/**
>> + * @warning
>> + * @b EXPERIMENTAL: this API may change without prior notice
>> + *
>> + * Read hardware counter configured to count occurrences of an event.
>> + *
>> + * @param index
>> + *   Index of an event to be read.
>> + * @return
>> + *   Event value read from register. In case of errors or lack of support
>> + *   0 is returned. In other words, stream of zeros in a trace file
>> + *   indicates problem with reading particular PMU event register.
>> + */
>> +__rte_experimental
>> +static __rte_always_inline uint64_t
>> +rte_pmu_read(unsigned int index)
>> +{
>> +	struct rte_pmu_event_group *group = &RTE_PER_LCORE(_event_group);
>> +	int ret;
>> +
>> +	if (unlikely(!rte_pmu.initialized))
>> +		return 0;
>> +
>> +	if (unlikely(!group->enabled)) {
>> +		ret = __rte_pmu_enable_group();
>> +		if (ret)
>> +			return 0;
>> +	}
>> +
>> +	if (unlikely(index >= rte_pmu.num_group_events))
>> +		return 0;
>> +
>> +	return __rte_pmu_read_userpage(group->mmap_pages[index]);
>> +}
>> +
>> +#ifdef __cplusplus
>> +}
>> +#endif
>> +
>> +#endif /* _RTE_PMU_H_ */
>> diff --git a/lib/pmu/version.map b/lib/pmu/version.map new file mode
>> 100644 index 0000000000..39a4f279c1
>> --- /dev/null
>> +++ b/lib/pmu/version.map
>> @@ -0,0 +1,15 @@
>> +DPDK_23 {
>> +	local: *;
>> +};
>> +
>> +EXPERIMENTAL {
>> +	global:
>> +
>> +	__rte_pmu_enable_group;
>> +	per_lcore__event_group;
>> +	rte_pmu;
>> +	rte_pmu_add_event;
>> +	rte_pmu_fini;
>> +	rte_pmu_init;
>> +	rte_pmu_read;
>> +};


^ permalink raw reply	[flat|nested] 139+ messages in thread

* RE: [EXT] Re: [PATCH v11 1/4] lib: add generic support for reading PMU events
  2023-02-17  8:49                         ` [EXT] " Tomasz Duszynski
@ 2023-02-17 10:14                           ` Konstantin Ananyev
  2023-02-19 14:23                             ` Tomasz Duszynski
  2023-02-21 12:15                           ` Konstantin Ananyev
  1 sibling, 1 reply; 139+ messages in thread
From: Konstantin Ananyev @ 2023-02-17 10:14 UTC (permalink / raw)
  To: Tomasz Duszynski, Konstantin Ananyev, dev



> >>
> >> This is especially useful in cases where CPU cores are isolated i.e
> >> run dedicated tasks. In such cases one cannot use standard perf
> >> utility without sacrificing latency and performance.
> >>
> >> Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
> >> Acked-by: Morten Brørup <mb@smartsharesystems.com>
> >> ---
> >>   MAINTAINERS                            |   5 +
> >>   app/test/meson.build                   |   2 +
> >>   app/test/test_pmu.c                    |  62 ++++
> >>   doc/api/doxy-api-index.md              |   3 +-
> >>   doc/api/doxy-api.conf.in               |   1 +
> >>   doc/guides/prog_guide/profile_app.rst  |  12 +
> >>   doc/guides/rel_notes/release_23_03.rst |   7 +
> >>   lib/meson.build                        |   1 +
> >>   lib/pmu/meson.build                    |  13 +
> >>   lib/pmu/pmu_private.h                  |  32 ++
> >>   lib/pmu/rte_pmu.c                      | 460 +++++++++++++++++++++++++
> >>   lib/pmu/rte_pmu.h                      | 212 ++++++++++++
> >>   lib/pmu/version.map                    |  15 +
> >>   13 files changed, 824 insertions(+), 1 deletion(-)
> >>   create mode 100644 app/test/test_pmu.c
> >>   create mode 100644 lib/pmu/meson.build
> >>   create mode 100644 lib/pmu/pmu_private.h
> >>   create mode 100644 lib/pmu/rte_pmu.c
> >>   create mode 100644 lib/pmu/rte_pmu.h
> >>   create mode 100644 lib/pmu/version.map
> >>
> >> diff --git a/MAINTAINERS b/MAINTAINERS index 3495946d0f..d37f242120
> >> 100644
> >> --- a/MAINTAINERS
> >> +++ b/MAINTAINERS
> >> @@ -1697,6 +1697,11 @@ M: Nithin Dabilpuram <ndabilpuram@marvell.com>
> >>   M: Pavan Nikhilesh <pbhagavatula@marvell.com>
> >>   F: lib/node/
> >>
> >> +PMU - EXPERIMENTAL
> >> +M: Tomasz Duszynski <tduszynski@marvell.com>
> >> +F: lib/pmu/
> >> +F: app/test/test_pmu*
> >> +
> >>
> >>   Test Applications
> >>   -----------------
> >> diff --git a/app/test/meson.build b/app/test/meson.build index
> >> f34d19e3c3..6b61b7fc32 100644
> >> --- a/app/test/meson.build
> >> +++ b/app/test/meson.build
> >> @@ -111,6 +111,7 @@ test_sources = files(
> >>           'test_reciprocal_division_perf.c',
> >>           'test_red.c',
> >>           'test_pie.c',
> >> +        'test_pmu.c',
> >>           'test_reorder.c',
> >>           'test_rib.c',
> >>           'test_rib6.c',
> >> @@ -239,6 +240,7 @@ fast_tests = [
> >>           ['kni_autotest', false, true],
> >>           ['kvargs_autotest', true, true],
> >>           ['member_autotest', true, true],
> >> +        ['pmu_autotest', true, true],
> >>           ['power_cpufreq_autotest', false, true],
> >>           ['power_autotest', true, true],
> >>           ['power_kvm_vm_autotest', false, true], diff --git
> >> a/app/test/test_pmu.c b/app/test/test_pmu.c new file mode 100644 index
> >> 0000000000..c257638e8b
> >> --- /dev/null
> >> +++ b/app/test/test_pmu.c
> >> @@ -0,0 +1,62 @@
> >> +/* SPDX-License-Identifier: BSD-3-Clause
> >> + * Copyright(C) 2023 Marvell International Ltd.
> >> + */
> >> +
> >> +#include "test.h"
> >> +
> >> +#ifndef RTE_EXEC_ENV_LINUX
> >> +
> >> +static int
> >> +test_pmu(void)
> >> +{
> >> +	printf("pmu_autotest only supported on Linux, skipping test\n");
> >> +	return TEST_SKIPPED;
> >> +}
> >> +
> >> +#else
> >> +
> >> +#include <rte_pmu.h>
> >> +
> >> +static int
> >> +test_pmu_read(void)
> >> +{
> >> +	const char *name = NULL;
> >> +	int tries = 10, event;
> >> +	uint64_t val = 0;
> >> +
> >> +	if (name == NULL) {
> >> +		printf("PMU not supported on this arch\n");
> >> +		return TEST_SKIPPED;
> >> +	}
> >> +
> >> +	if (rte_pmu_init() < 0)
> >> +		return TEST_SKIPPED;
> >> +
> >> +	event = rte_pmu_add_event(name);
> >> +	while (tries--)
> >> +		val += rte_pmu_read(event);
> >> +
> >> +	rte_pmu_fini();
> >> +
> >> +	return val ? TEST_SUCCESS : TEST_FAILED; }
> >> +
> >> +static struct unit_test_suite pmu_tests = {
> >> +	.suite_name = "pmu autotest",
> >> +	.setup = NULL,
> >> +	.teardown = NULL,
> >> +	.unit_test_cases = {
> >> +		TEST_CASE(test_pmu_read),
> >> +		TEST_CASES_END()
> >> +	}
> >> +};
> >> +
> >> +static int
> >> +test_pmu(void)
> >> +{
> >> +	return unit_test_suite_runner(&pmu_tests);
> >> +}
> >> +
> >> +#endif /* RTE_EXEC_ENV_LINUX */
> >> +
> >> +REGISTER_TEST_COMMAND(pmu_autotest, test_pmu);
> >> diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
> >> index 2deec7ea19..a8e04a195d 100644
> >> --- a/doc/api/doxy-api-index.md
> >> +++ b/doc/api/doxy-api-index.md
> >> @@ -223,7 +223,8 @@ The public API headers are grouped by topics:
> >>     [log](@ref rte_log.h),
> >>     [errno](@ref rte_errno.h),
> >>     [trace](@ref rte_trace.h),
> >> -  [trace_point](@ref rte_trace_point.h)
> >> +  [trace_point](@ref rte_trace_point.h),  [pmu](@ref rte_pmu.h)
> >>
> >>   - **misc**:
> >>     [EAL config](@ref rte_eal.h),
> >> diff --git a/doc/api/doxy-api.conf.in b/doc/api/doxy-api.conf.in index
> >> e859426099..350b5a8c94 100644
> >> --- a/doc/api/doxy-api.conf.in
> >> +++ b/doc/api/doxy-api.conf.in
> >> @@ -63,6 +63,7 @@ INPUT                   = @TOPDIR@/doc/api/doxy-api-index.md \
> >>                             @TOPDIR@/lib/pci \
> >>                             @TOPDIR@/lib/pdump \
> >>                             @TOPDIR@/lib/pipeline \
> >> +                          @TOPDIR@/lib/pmu \
> >>                             @TOPDIR@/lib/port \
> >>                             @TOPDIR@/lib/power \
> >>                             @TOPDIR@/lib/rawdev \ diff --git
> >> a/doc/guides/prog_guide/profile_app.rst
> >> b/doc/guides/prog_guide/profile_app.rst
> >> index 14292d4c25..89e38cd301 100644
> >> --- a/doc/guides/prog_guide/profile_app.rst
> >> +++ b/doc/guides/prog_guide/profile_app.rst
> >> @@ -7,6 +7,18 @@ Profile Your Application
> >>   The following sections describe methods of profiling DPDK applications on
> >>   different architectures.
> >>
> >> +Performance counter based profiling
> >> +-----------------------------------
> >> +
> >> +Majority of architectures support some performance monitoring unit (PMU).
> >> +Such unit provides programmable counters that monitor specific events.
> >> +
> >> +Different tools gather that information, like for example perf.
> >> +However, in some scenarios when CPU cores are isolated and run
> >> +dedicated tasks interrupting those tasks with perf may be undesirable.
> >> +
> >> +In such cases, an application can use the PMU library to read such events via
> >``rte_pmu_read()``.
> >> +
> >>
> >>   Profiling on x86
> >>   ----------------
> >> diff --git a/doc/guides/rel_notes/release_23_03.rst
> >> b/doc/guides/rel_notes/release_23_03.rst
> >> index ab998a5357..20622efe58 100644
> >> --- a/doc/guides/rel_notes/release_23_03.rst
> >> +++ b/doc/guides/rel_notes/release_23_03.rst
> >> @@ -147,6 +147,13 @@ New Features
> >>     * Added support to capture packets at each graph node with packet metadata and
> >>       node name.
> >>
> >> +* **Added PMU library.**
> >> +
> >> +  Added a new performance monitoring unit (PMU) library which allows
> >> + applications  to perform self monitoring activities without depending on external utilities
> >like perf.
> >> +  After integration with :doc:`../prog_guide/trace_lib` data gathered
> >> + from hardware counters  can be stored in CTF format for further analysis.
> >> +
> >>
> >>   Removed Items
> >>   -------------
> >> diff --git a/lib/meson.build b/lib/meson.build index
> >> 450c061d2b..8a42d45d20 100644
> >> --- a/lib/meson.build
> >> +++ b/lib/meson.build
> >> @@ -11,6 +11,7 @@
> >>   libraries = [
> >>           'kvargs', # eal depends on kvargs
> >>           'telemetry', # basic info querying
> >> +        'pmu',
> >>           'eal', # everything depends on eal
> >>           'ring',
> >>           'rcu', # rcu depends on ring diff --git
> >> a/lib/pmu/meson.build b/lib/pmu/meson.build new file mode 100644 index
> >> 0000000000..a4160b494e
> >> --- /dev/null
> >> +++ b/lib/pmu/meson.build
> >> @@ -0,0 +1,13 @@
> >> +# SPDX-License-Identifier: BSD-3-Clause # Copyright(C) 2023 Marvell
> >> +International Ltd.
> >> +
> >> +if not is_linux
> >> +    build = false
> >> +    reason = 'only supported on Linux'
> >> +    subdir_done()
> >> +endif
> >> +
> >> +includes = [global_inc]
> >> +
> >> +sources = files('rte_pmu.c')
> >> +headers = files('rte_pmu.h')
> >> diff --git a/lib/pmu/pmu_private.h b/lib/pmu/pmu_private.h new file
> >> mode 100644 index 0000000000..b9f8c1ddc8
> >> --- /dev/null
> >> +++ b/lib/pmu/pmu_private.h
> >> @@ -0,0 +1,32 @@
> >> +/* SPDX-License-Identifier: BSD-3-Clause
> >> + * Copyright(c) 2023 Marvell
> >> + */
> >> +
> >> +#ifndef _PMU_PRIVATE_H_
> >> +#define _PMU_PRIVATE_H_
> >> +
> >> +/**
> >> + * Architecture specific PMU init callback.
> >> + *
> >> + * @return
> >> + *   0 in case of success, negative value otherwise.
> >> + */
> >> +int
> >> +pmu_arch_init(void);
> >> +
> >> +/**
> >> + * Architecture specific PMU cleanup callback.
> >> + */
> >> +void
> >> +pmu_arch_fini(void);
> >> +
> >> +/**
> >> + * Apply architecture specific settings to config before passing it to syscall.
> >> + *
> >> + * @param config
> >> + *   Architecture specific event configuration. Consult kernel sources for available options.
> >> + */
> >> +void
> >> +pmu_arch_fixup_config(uint64_t config[3]);
> >> +
> >> +#endif /* _PMU_PRIVATE_H_ */
> >> diff --git a/lib/pmu/rte_pmu.c b/lib/pmu/rte_pmu.c new file mode
> >> 100644 index 0000000000..950f999cb7
> >> --- /dev/null
> >> +++ b/lib/pmu/rte_pmu.c
> >> @@ -0,0 +1,460 @@
> >> +/* SPDX-License-Identifier: BSD-3-Clause
> >> + * Copyright(C) 2023 Marvell International Ltd.
> >> + */
> >> +
> >> +#include <ctype.h>
> >> +#include <dirent.h>
> >> +#include <errno.h>
> >> +#include <regex.h>
> >> +#include <stdlib.h>
> >> +#include <string.h>
> >> +#include <sys/ioctl.h>
> >> +#include <sys/mman.h>
> >> +#include <sys/queue.h>
> >> +#include <sys/syscall.h>
> >> +#include <unistd.h>
> >> +
> >> +#include <rte_atomic.h>
> >> +#include <rte_per_lcore.h>
> >> +#include <rte_pmu.h>
> >> +#include <rte_spinlock.h>
> >> +#include <rte_tailq.h>
> >> +
> >> +#include "pmu_private.h"
> >> +
> >> +#define EVENT_SOURCE_DEVICES_PATH "/sys/bus/event_source/devices"
> >
> >
> >I suppose that pass (as the whole implementation) is linux specific?
> >If so, wouldn't it make sense to have it under linux subdir?
> >
> 
> There are not any plans to support that elsewhere currently so flat
> directory structure is good enough.
> 
> >> +
> >> +#define GENMASK_ULL(h, l) ((~0ULL - (1ULL << (l)) + 1) & (~0ULL >>
> >> +((64 - 1 - (h))))) #define FIELD_PREP(m, v) (((uint64_t)(v) <<
> >> +(__builtin_ffsll(m) - 1)) & (m))
> >> +
> >> +RTE_DEFINE_PER_LCORE(struct rte_pmu_event_group, _event_group);
> >> +struct rte_pmu rte_pmu;
> >
> >Do we really need struct declaration here?
> >
> 
> What’s the problem with this placement precisely?

Not a big deal, but It seems excessive for me.
As I understand you do have include just above for the whole .h
that contains definition of that struct anyway.  
 
> 
> >
> >> +/*
> >> + * Following __rte_weak functions provide default no-op.
> >> +Architectures should override them if
> >> + * necessary.
> >> + */
> >> +
> >> +int
> >> +__rte_weak pmu_arch_init(void)
> >> +{
> >> +	return 0;
> >> +}
> >> +
> >> +void
> >> +__rte_weak pmu_arch_fini(void)
> >> +{
> >> +}
> >> +
> >> +void
> >> +__rte_weak pmu_arch_fixup_config(uint64_t __rte_unused config[3]) { }
> >> +
> >> +static int
> >> +get_term_format(const char *name, int *num, uint64_t *mask) {
> >> +	char path[PATH_MAX];
> >> +	char *config = NULL;
> >> +	int high, low, ret;
> >> +	FILE *fp;
> >> +
> >> +	*num = *mask = 0;
> >> +	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/format/%s", rte_pmu.name, name);
> >> +	fp = fopen(path, "r");
> >> +	if (fp == NULL)
> >> +		return -errno;
> >> +
> >> +	errno = 0;
> >> +	ret = fscanf(fp, "%m[^:]:%d-%d", &config, &low, &high);
> >> +	if (ret < 2) {
> >> +		ret = -ENODATA;
> >> +		goto out;
> >> +	}
> >> +	if (errno) {
> >> +		ret = -errno;
> >> +		goto out;
> >> +	}
> >> +
> >> +	if (ret == 2)
> >> +		high = low;
> >> +
> >> +	*mask = GENMASK_ULL(high, low);
> >> +	/* Last digit should be [012]. If last digit is missing 0 is implied. */
> >> +	*num = config[strlen(config) - 1];
> >> +	*num = isdigit(*num) ? *num - '0' : 0;
> >> +
> >> +	ret = 0;
> >> +out:
> >> +	free(config);
> >> +	fclose(fp);
> >> +
> >> +	return ret;
> >> +}
> >> +
> >> +static int
> >> +parse_event(char *buf, uint64_t config[3]) {
> >> +	char *token, *term;
> >> +	int num, ret, val;
> >> +	uint64_t mask;
> >> +
> >> +	config[0] = config[1] = config[2] = 0;
> >> +
> >> +	token = strtok(buf, ",");
> >> +	while (token) {
> >> +		errno = 0;
> >> +		/* <term>=<value> */
> >> +		ret = sscanf(token, "%m[^=]=%i", &term, &val);
> >> +		if (ret < 1)
> >> +			return -ENODATA;
> >> +		if (errno)
> >> +			return -errno;
> >> +		if (ret == 1)
> >> +			val = 1;
> >> +
> >> +		ret = get_term_format(term, &num, &mask);
> >> +		free(term);
> >> +		if (ret)
> >> +			return ret;
> >> +
> >> +		config[num] |= FIELD_PREP(mask, val);
> >> +		token = strtok(NULL, ",");
> >> +	}
> >> +
> >> +	return 0;
> >> +}
> >> +
> >> +static int
> >> +get_event_config(const char *name, uint64_t config[3]) {
> >> +	char path[PATH_MAX], buf[BUFSIZ];
> >> +	FILE *fp;
> >> +	int ret;
> >> +
> >> +	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu.name, name);
> >> +	fp = fopen(path, "r");
> >> +	if (fp == NULL)
> >> +		return -errno;
> >> +
> >> +	ret = fread(buf, 1, sizeof(buf), fp);
> >> +	if (ret == 0) {
> >> +		fclose(fp);
> >> +
> >> +		return -EINVAL;
> >> +	}
> >> +	fclose(fp);
> >> +	buf[ret] = '\0';
> >> +
> >> +	return parse_event(buf, config);
> >> +}
> >> +
> >> +static int
> >> +do_perf_event_open(uint64_t config[3], int group_fd) {
> >> +	struct perf_event_attr attr = {
> >> +		.size = sizeof(struct perf_event_attr),
> >> +		.type = PERF_TYPE_RAW,
> >> +		.exclude_kernel = 1,
> >> +		.exclude_hv = 1,
> >> +		.disabled = 1,
> >> +	};
> >> +
> >> +	pmu_arch_fixup_config(config);
> >> +
> >> +	attr.config = config[0];
> >> +	attr.config1 = config[1];
> >> +	attr.config2 = config[2];
> >> +
> >> +	return syscall(SYS_perf_event_open, &attr, 0, -1, group_fd, 0); }
> >> +
> >> +static int
> >> +open_events(struct rte_pmu_event_group *group) {
> >> +	struct rte_pmu_event *event;
> >> +	uint64_t config[3];
> >> +	int num = 0, ret;
> >> +
> >> +	/* group leader gets created first, with fd = -1 */
> >> +	group->fds[0] = -1;
> >> +
> >> +	TAILQ_FOREACH(event, &rte_pmu.event_list, next) {
> >> +		ret = get_event_config(event->name, config);
> >> +		if (ret)
> >> +			continue;
> >> +
> >> +		ret = do_perf_event_open(config, group->fds[0]);
> >> +		if (ret == -1) {
> >> +			ret = -errno;
> >> +			goto out;
> >> +		}
> >> +
> >> +		group->fds[event->index] = ret;
> >> +		num++;
> >> +	}
> >> +
> >> +	return 0;
> >> +out:
> >> +	for (--num; num >= 0; num--) {
> >> +		close(group->fds[num]);
> >> +		group->fds[num] = -1;
> >> +	}
> >> +
> >> +
> >> +	return ret;
> >> +}
> >> +
> >> +static int
> >> +mmap_events(struct rte_pmu_event_group *group) {
> >> +	long page_size = sysconf(_SC_PAGE_SIZE);
> >> +	unsigned int i;
> >> +	void *addr;
> >> +	int ret;
> >> +
> >> +	for (i = 0; i < rte_pmu.num_group_events; i++) {
> >> +		addr = mmap(0, page_size, PROT_READ, MAP_SHARED, group->fds[i], 0);
> >> +		if (addr == MAP_FAILED) {
> >> +			ret = -errno;
> >> +			goto out;
> >> +		}
> >> +
> >> +		group->mmap_pages[i] = addr;
> >> +		if (!group->mmap_pages[i]->cap_user_rdpmc) {
> >> +			ret = -EPERM;
> >> +			goto out;
> >> +		}
> >> +	}
> >> +
> >> +	return 0;
> >> +out:
> >> +	for (; i; i--) {
> >> +		munmap(group->mmap_pages[i - 1], page_size);
> >> +		group->mmap_pages[i - 1] = NULL;
> >> +	}
> >> +
> >> +	return ret;
> >> +}
> >> +
> >> +static void
> >> +cleanup_events(struct rte_pmu_event_group *group) {
> >> +	unsigned int i;
> >> +
> >> +	if (group->fds[0] != -1)
> >> +		ioctl(group->fds[0], PERF_EVENT_IOC_DISABLE, PERF_IOC_FLAG_GROUP);
> >> +
> >> +	for (i = 0; i < rte_pmu.num_group_events; i++) {
> >> +		if (group->mmap_pages[i]) {
> >> +			munmap(group->mmap_pages[i], sysconf(_SC_PAGE_SIZE));
> >> +			group->mmap_pages[i] = NULL;
> >> +		}
> >> +
> >> +		if (group->fds[i] != -1) {
> >> +			close(group->fds[i]);
> >> +			group->fds[i] = -1;
> >> +		}
> >> +	}
> >> +
> >> +	group->enabled = false;
> >> +}
> >> +
> >> +int
> >> +__rte_pmu_enable_group(void)
> >> +{
> >> +	struct rte_pmu_event_group *group = &RTE_PER_LCORE(_event_group);
> >> +	int ret;
> >> +
> >> +	if (rte_pmu.num_group_events == 0)
> >> +		return -ENODEV;
> >> +
> >> +	ret = open_events(group);
> >> +	if (ret)
> >> +		goto out;
> >> +
> >> +	ret = mmap_events(group);
> >> +	if (ret)
> >> +		goto out;
> >> +
> >> +	if (ioctl(group->fds[0], PERF_EVENT_IOC_RESET, PERF_IOC_FLAG_GROUP) == -1) {
> >> +		ret = -errno;
> >> +		goto out;
> >> +	}
> >> +
> >> +	if (ioctl(group->fds[0], PERF_EVENT_IOC_ENABLE, PERF_IOC_FLAG_GROUP) == -1) {
> >> +		ret = -errno;
> >> +		goto out;
> >> +	}
> >> +
> >> +	rte_spinlock_lock(&rte_pmu.lock);
> >> +	TAILQ_INSERT_TAIL(&rte_pmu.event_group_list, group, next);
> >> +	rte_spinlock_unlock(&rte_pmu.lock);
> >> +	group->enabled = true;
> >> +
> >> +	return 0;
> >> +
> >> +out:
> >> +	cleanup_events(group);
> >> +
> >> +	return ret;
> >> +}
> >> +
> >> +static int
> >> +scan_pmus(void)
> >> +{
> >> +	char path[PATH_MAX];
> >> +	struct dirent *dent;
> >> +	const char *name;
> >> +	DIR *dirp;
> >> +
> >> +	dirp = opendir(EVENT_SOURCE_DEVICES_PATH);
> >> +	if (dirp == NULL)
> >> +		return -errno;
> >> +
> >> +	while ((dent = readdir(dirp))) {
> >> +		name = dent->d_name;
> >> +		if (name[0] == '.')
> >> +			continue;
> >> +
> >> +		/* sysfs entry should either contain cpus or be a cpu */
> >> +		if (!strcmp(name, "cpu"))
> >> +			break;
> >> +
> >> +		snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/cpus", name);
> >> +		if (access(path, F_OK) == 0)
> >> +			break;
> >> +	}
> >> +
> >> +	if (dent) {
> >> +		rte_pmu.name = strdup(name);
> >> +		if (rte_pmu.name == NULL) {
> >> +			closedir(dirp);
> >> +
> >> +			return -ENOMEM;
> >> +		}
> >> +	}
> >> +
> >> +	closedir(dirp);
> >> +
> >> +	return rte_pmu.name ? 0 : -ENODEV;
> >> +}
> >> +
> >> +static struct rte_pmu_event *
> >> +new_event(const char *name)
> >> +{
> >> +	struct rte_pmu_event *event;
> >> +
> >> +	event = calloc(1, sizeof(*event));
> >> +	if (event == NULL)
> >> +		goto out;
> >> +
> >> +	event->name = strdup(name);
> >> +	if (event->name == NULL) {
> >> +		free(event);
> >> +		event = NULL;
> >> +	}
> >> +
> >> +out:
> >> +	return event;
> >> +}
> >> +
> >> +static void
> >> +free_event(struct rte_pmu_event *event) {
> >> +	free(event->name);
> >> +	free(event);
> >> +}
> >> +
> >> +int
> >> +rte_pmu_add_event(const char *name)
> >> +{
> >> +	struct rte_pmu_event *event;
> >> +	char path[PATH_MAX];
> >> +
> >> +	if (rte_pmu.name == NULL)
> >> +		return -ENODEV;
> >> +
> >> +	if (rte_pmu.num_group_events + 1 >= MAX_NUM_GROUP_EVENTS)
> >> +		return -ENOSPC;
> >> +
> >> +	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu.name, name);
> >> +	if (access(path, R_OK))
> >> +		return -ENODEV;
> >> +
> >> +	TAILQ_FOREACH(event, &rte_pmu.event_list, next) {
> >> +		if (!strcmp(event->name, name))
> >> +			return event->index;
> >> +		continue;
> >> +	}
> >> +
> >> +	event = new_event(name);
> >> +	if (event == NULL)
> >> +		return -ENOMEM;
> >> +
> >> +	event->index = rte_pmu.num_group_events++;
> >> +	TAILQ_INSERT_TAIL(&rte_pmu.event_list, event, next);
> >> +
> >> +	return event->index;
> >> +}
> >> +
> >> +int
> >> +rte_pmu_init(void)
> >> +{
> >> +	int ret;
> >> +
> >> +	/* Allow calling init from multiple contexts within a single thread. This simplifies
> >> +	 * resource management a bit e.g in case fast-path tracepoint has already been enabled
> >> +	 * via command line but application doesn't care enough and performs init/fini again.
> >> +	 */
> >> +	if (rte_pmu.initialized != 0) {
> >> +		rte_pmu.initialized++;
> >> +		return 0;
> >> +	}
> >> +
> >> +	ret = scan_pmus();
> >> +	if (ret)
> >> +		goto out;
> >> +
> >> +	ret = pmu_arch_init();
> >> +	if (ret)
> >> +		goto out;
> >> +
> >> +	TAILQ_INIT(&rte_pmu.event_list);
> >> +	TAILQ_INIT(&rte_pmu.event_group_list);
> >> +	rte_spinlock_init(&rte_pmu.lock);
> >> +	rte_pmu.initialized = 1;
> >> +
> >> +	return 0;
> >> +out:
> >> +	free(rte_pmu.name);
> >> +	rte_pmu.name = NULL;
> >> +
> >> +	return ret;
> >> +}
> >> +
> >> +void
> >> +rte_pmu_fini(void)
> >> +{
> >> +	struct rte_pmu_event_group *group, *tmp_group;
> >> +	struct rte_pmu_event *event, *tmp_event;
> >> +
> >> +	/* cleanup once init count drops to zero */
> >> +	if (rte_pmu.initialized == 0 || --rte_pmu.initialized != 0)
> >> +		return;
> >> +
> >> +	RTE_TAILQ_FOREACH_SAFE(event, &rte_pmu.event_list, next, tmp_event) {
> >> +		TAILQ_REMOVE(&rte_pmu.event_list, event, next);
> >> +		free_event(event);
> >> +	}
> >> +
> >> +	RTE_TAILQ_FOREACH_SAFE(group, &rte_pmu.event_group_list, next, tmp_group) {
> >> +		TAILQ_REMOVE(&rte_pmu.event_group_list, group, next);
> >> +		cleanup_events(group);
> >> +	}
> >> +
> >> +	pmu_arch_fini();
> >> +	free(rte_pmu.name);
> >> +	rte_pmu.name = NULL;
> >> +	rte_pmu.num_group_events = 0;
> >> +}
> >> diff --git a/lib/pmu/rte_pmu.h b/lib/pmu/rte_pmu.h new file mode
> >> 100644 index 0000000000..6b664c3336
> >> --- /dev/null
> >> +++ b/lib/pmu/rte_pmu.h
> >> @@ -0,0 +1,212 @@
> >> +/* SPDX-License-Identifier: BSD-3-Clause
> >> + * Copyright(c) 2023 Marvell
> >> + */
> >> +
> >> +#ifndef _RTE_PMU_H_
> >> +#define _RTE_PMU_H_
> >> +
> >> +/**
> >> + * @file
> >> + *
> >> + * PMU event tracing operations
> >> + *
> >> + * This file defines generic API and types necessary to setup PMU and
> >> + * read selected counters in runtime.
> >> + */
> >> +
> >> +#ifdef __cplusplus
> >> +extern "C" {
> >> +#endif
> >> +
> >> +#include <linux/perf_event.h>
> >> +
> >> +#include <rte_atomic.h>
> >> +#include <rte_branch_prediction.h>
> >> +#include <rte_common.h>
> >> +#include <rte_compat.h>
> >> +#include <rte_spinlock.h>
> >> +
> >> +/** Maximum number of events in a group */ #define
> >> +MAX_NUM_GROUP_EVENTS 8
> >> +
> >> +/**
> >> + * A structure describing a group of events.
> >> + */
> >> +struct rte_pmu_event_group {
> >> +	struct perf_event_mmap_page *mmap_pages[MAX_NUM_GROUP_EVENTS]; /**< array of user pages */
> >> +	int fds[MAX_NUM_GROUP_EVENTS]; /**< array of event descriptors */
> >> +	bool enabled; /**< true if group was enabled on particular lcore */
> >> +	TAILQ_ENTRY(rte_pmu_event_group) next; /**< list entry */ }
> >> +__rte_cache_aligned;
> >> +
> >> +/**
> >> + * A structure describing an event.
> >> + */
> >> +struct rte_pmu_event {
> >> +	char *name; /**< name of an event */
> >> +	unsigned int index; /**< event index into fds/mmap_pages */
> >> +	TAILQ_ENTRY(rte_pmu_event) next; /**< list entry */ };
> >> +
> >> +/**
> >> + * A PMU state container.
> >> + */
> >> +struct rte_pmu {
> >> +	char *name; /**< name of core PMU listed under /sys/bus/event_source/devices */
> >> +	rte_spinlock_t lock; /**< serialize access to event group list */
> >> +	TAILQ_HEAD(, rte_pmu_event_group) event_group_list; /**< list of event groups */
> >> +	unsigned int num_group_events; /**< number of events in a group */
> >> +	TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of matching events */
> >> +	unsigned int initialized; /**< initialization counter */ };
> >> +
> >> +/** lcore event group */
> >> +RTE_DECLARE_PER_LCORE(struct rte_pmu_event_group, _event_group);
> >> +
> >> +/** PMU state container */
> >> +extern struct rte_pmu rte_pmu;
> >> +
> >> +/** Each architecture supporting PMU needs to provide its own version
> >> +*/ #ifndef rte_pmu_pmc_read #define rte_pmu_pmc_read(index) ({ 0; })
> >> +#endif
> >> +
> >> +/**
> >> + * @warning
> >> + * @b EXPERIMENTAL: this API may change without prior notice
> >> + *
> >> + * Read PMU counter.
> >> + *
> >> + * @warning This should be not called directly.
> >> + *
> >> + * @param pc
> >> + *   Pointer to the mmapped user page.
> >> + * @return
> >> + *   Counter value read from hardware.
> >> + */
> >> +static __rte_always_inline uint64_t
> >> +__rte_pmu_read_userpage(struct perf_event_mmap_page *pc) {
> >> +	uint64_t width, offset;
> >> +	uint32_t seq, index;
> >> +	int64_t pmc;
> >> +
> >> +	for (;;) {
> >> +		seq = pc->lock;
> >> +		rte_compiler_barrier();
> >
> >Are you sure that compiler_barrier() is enough here?
> >On some archs CPU itself has freedom to re-order reads.
> >Or I am missing something obvious here?
> >
> 
> It's a matter of not keeping old stuff cached in registers
> and making sure that we have two reads of lock. CPU reordering
> won't do any harm here.

Sorry, I didn't get you here:
Suppose CPU will re-order reads and will read lock *after* index or offset value.
Wouldn't it mean that in that case index and/or offset can contain old/invalid values? 
 
> 
> >> +		index = pc->index;
> >> +		offset = pc->offset;
> >> +		width = pc->pmc_width;
> >> +
> >> +		/* index set to 0 means that particular counter cannot be used */
> >> +		if (likely(pc->cap_user_rdpmc && index)) {
> >> +			pmc = rte_pmu_pmc_read(index - 1);
> >> +			pmc <<= 64 - width;
> >> +			pmc >>= 64 - width;
> >> +			offset += pmc;
> >> +		}
> >> +
> >> +		rte_compiler_barrier();
> >> +
> >> +		if (likely(pc->lock == seq))
> >> +			return offset;
> >> +	}
> >> +
> >> +	return 0;
> >> +}
> >> +
> >> +/**
> >> + * @warning
> >> + * @b EXPERIMENTAL: this API may change without prior notice
> >> + *
> >> + * Enable group of events on the calling lcore.
> >> + *
> >> + * @warning This should be not called directly.
> >> + *
> >> + * @return
> >> + *   0 in case of success, negative value otherwise.
> >> + */
> >> +__rte_experimental
> >> +int
> >> +__rte_pmu_enable_group(void);
> >> +
> >> +/**
> >> + * @warning
> >> + * @b EXPERIMENTAL: this API may change without prior notice
> >> + *
> >> + * Initialize PMU library.
> >> + *
> >> + * @warning This should be not called directly.
> >> + *
> >> + * @return
> >> + *   0 in case of success, negative value otherwise.
> >> + */
> >> +__rte_experimental
> >> +int
> >> +rte_pmu_init(void);
> >> +
> >> +/**
> >> + * @warning
> >> + * @b EXPERIMENTAL: this API may change without prior notice
> >> + *
> >> + * Finalize PMU library. This should be called after PMU counters are no longer being read.
> >> + */
> >> +__rte_experimental
> >> +void
> >> +rte_pmu_fini(void);
> >> +
> >> +/**
> >> + * @warning
> >> + * @b EXPERIMENTAL: this API may change without prior notice
> >> + *
> >> + * Add event to the group of enabled events.
> >> + *
> >> + * @param name
> >> + *   Name of an event listed under /sys/bus/event_source/devices/pmu/events.
> >> + * @return
> >> + *   Event index in case of success, negative value otherwise.
> >> + */
> >> +__rte_experimental
> >> +int
> >> +rte_pmu_add_event(const char *name);
> >> +
> >> +/**
> >> + * @warning
> >> + * @b EXPERIMENTAL: this API may change without prior notice
> >> + *
> >> + * Read hardware counter configured to count occurrences of an event.
> >> + *
> >> + * @param index
> >> + *   Index of an event to be read.
> >> + * @return
> >> + *   Event value read from register. In case of errors or lack of support
> >> + *   0 is returned. In other words, stream of zeros in a trace file
> >> + *   indicates problem with reading particular PMU event register.
> >> + */
> >> +__rte_experimental
> >> +static __rte_always_inline uint64_t
> >> +rte_pmu_read(unsigned int index)
> >> +{
> >> +	struct rte_pmu_event_group *group = &RTE_PER_LCORE(_event_group);
> >> +	int ret;
> >> +
> >> +	if (unlikely(!rte_pmu.initialized))
> >> +		return 0;
> >> +
> >> +	if (unlikely(!group->enabled)) {
> >> +		ret = __rte_pmu_enable_group();
> >> +		if (ret)
> >> +			return 0;
> >> +	}
> >> +
> >> +	if (unlikely(index >= rte_pmu.num_group_events))
> >> +		return 0;
> >> +
> >> +	return __rte_pmu_read_userpage(group->mmap_pages[index]);
> >> +}
> >> +
> >> +#ifdef __cplusplus
> >> +}
> >> +#endif
> >> +
> >> +#endif /* _RTE_PMU_H_ */
> >> diff --git a/lib/pmu/version.map b/lib/pmu/version.map new file mode
> >> 100644 index 0000000000..39a4f279c1
> >> --- /dev/null
> >> +++ b/lib/pmu/version.map
> >> @@ -0,0 +1,15 @@
> >> +DPDK_23 {
> >> +	local: *;
> >> +};
> >> +
> >> +EXPERIMENTAL {
> >> +	global:
> >> +
> >> +	__rte_pmu_enable_group;
> >> +	per_lcore__event_group;
> >> +	rte_pmu;
> >> +	rte_pmu_add_event;
> >> +	rte_pmu_fini;
> >> +	rte_pmu_init;
> >> +	rte_pmu_read;
> >> +};


^ permalink raw reply	[flat|nested] 139+ messages in thread

* RE: [EXT] Re: [PATCH v11 1/4] lib: add generic support for reading PMU events
  2023-02-17 10:14                           ` Konstantin Ananyev
@ 2023-02-19 14:23                             ` Tomasz Duszynski
  2023-02-20 14:31                               ` Konstantin Ananyev
  0 siblings, 1 reply; 139+ messages in thread
From: Tomasz Duszynski @ 2023-02-19 14:23 UTC (permalink / raw)
  To: Konstantin Ananyev, Konstantin Ananyev, dev



>-----Original Message-----
>From: Konstantin Ananyev <konstantin.ananyev@huawei.com>
>Sent: Friday, February 17, 2023 11:15 AM
>To: Tomasz Duszynski <tduszynski@marvell.com>; Konstantin Ananyev <konstantin.v.ananyev@yandex.ru>;
>dev@dpdk.org
>Subject: RE: [EXT] Re: [PATCH v11 1/4] lib: add generic support for reading PMU events
>
>
>
>> >>
>> >> This is especially useful in cases where CPU cores are isolated i.e
>> >> run dedicated tasks. In such cases one cannot use standard perf
>> >> utility without sacrificing latency and performance.
>> >>
>> >> Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
>> >> Acked-by: Morten Brørup <mb@smartsharesystems.com>
>> >> ---
>> >>   MAINTAINERS                            |   5 +
>> >>   app/test/meson.build                   |   2 +
>> >>   app/test/test_pmu.c                    |  62 ++++
>> >>   doc/api/doxy-api-index.md              |   3 +-
>> >>   doc/api/doxy-api.conf.in               |   1 +
>> >>   doc/guides/prog_guide/profile_app.rst  |  12 +
>> >>   doc/guides/rel_notes/release_23_03.rst |   7 +
>> >>   lib/meson.build                        |   1 +
>> >>   lib/pmu/meson.build                    |  13 +
>> >>   lib/pmu/pmu_private.h                  |  32 ++
>> >>   lib/pmu/rte_pmu.c                      | 460 +++++++++++++++++++++++++
>> >>   lib/pmu/rte_pmu.h                      | 212 ++++++++++++
>> >>   lib/pmu/version.map                    |  15 +
>> >>   13 files changed, 824 insertions(+), 1 deletion(-)
>> >>   create mode 100644 app/test/test_pmu.c
>> >>   create mode 100644 lib/pmu/meson.build
>> >>   create mode 100644 lib/pmu/pmu_private.h
>> >>   create mode 100644 lib/pmu/rte_pmu.c
>> >>   create mode 100644 lib/pmu/rte_pmu.h
>> >>   create mode 100644 lib/pmu/version.map
>> >>
>> >> diff --git a/MAINTAINERS b/MAINTAINERS index 3495946d0f..d37f242120
>> >> 100644
>> >> --- a/MAINTAINERS
>> >> +++ b/MAINTAINERS
>> >> @@ -1697,6 +1697,11 @@ M: Nithin Dabilpuram <ndabilpuram@marvell.com>
>> >>   M: Pavan Nikhilesh <pbhagavatula@marvell.com>
>> >>   F: lib/node/
>> >>
>> >> +PMU - EXPERIMENTAL
>> >> +M: Tomasz Duszynski <tduszynski@marvell.com>
>> >> +F: lib/pmu/
>> >> +F: app/test/test_pmu*
>> >> +
>> >>
>> >>   Test Applications
>> >>   -----------------
>> >> diff --git a/app/test/meson.build b/app/test/meson.build index
>> >> f34d19e3c3..6b61b7fc32 100644
>> >> --- a/app/test/meson.build
>> >> +++ b/app/test/meson.build
>> >> @@ -111,6 +111,7 @@ test_sources = files(
>> >>           'test_reciprocal_division_perf.c',
>> >>           'test_red.c',
>> >>           'test_pie.c',
>> >> +        'test_pmu.c',
>> >>           'test_reorder.c',
>> >>           'test_rib.c',
>> >>           'test_rib6.c',
>> >> @@ -239,6 +240,7 @@ fast_tests = [
>> >>           ['kni_autotest', false, true],
>> >>           ['kvargs_autotest', true, true],
>> >>           ['member_autotest', true, true],
>> >> +        ['pmu_autotest', true, true],
>> >>           ['power_cpufreq_autotest', false, true],
>> >>           ['power_autotest', true, true],
>> >>           ['power_kvm_vm_autotest', false, true], diff --git
>> >> a/app/test/test_pmu.c b/app/test/test_pmu.c new file mode 100644
>> >> index 0000000000..c257638e8b
>> >> --- /dev/null
>> >> +++ b/app/test/test_pmu.c
>> >> @@ -0,0 +1,62 @@
>> >> +/* SPDX-License-Identifier: BSD-3-Clause
>> >> + * Copyright(C) 2023 Marvell International Ltd.
>> >> + */
>> >> +
>> >> +#include "test.h"
>> >> +
>> >> +#ifndef RTE_EXEC_ENV_LINUX
>> >> +
>> >> +static int
>> >> +test_pmu(void)
>> >> +{
>> >> +	printf("pmu_autotest only supported on Linux, skipping test\n");
>> >> +	return TEST_SKIPPED;
>> >> +}
>> >> +
>> >> +#else
>> >> +
>> >> +#include <rte_pmu.h>
>> >> +
>> >> +static int
>> >> +test_pmu_read(void)
>> >> +{
>> >> +	const char *name = NULL;
>> >> +	int tries = 10, event;
>> >> +	uint64_t val = 0;
>> >> +
>> >> +	if (name == NULL) {
>> >> +		printf("PMU not supported on this arch\n");
>> >> +		return TEST_SKIPPED;
>> >> +	}
>> >> +
>> >> +	if (rte_pmu_init() < 0)
>> >> +		return TEST_SKIPPED;
>> >> +
>> >> +	event = rte_pmu_add_event(name);
>> >> +	while (tries--)
>> >> +		val += rte_pmu_read(event);
>> >> +
>> >> +	rte_pmu_fini();
>> >> +
>> >> +	return val ? TEST_SUCCESS : TEST_FAILED; }
>> >> +
>> >> +static struct unit_test_suite pmu_tests = {
>> >> +	.suite_name = "pmu autotest",
>> >> +	.setup = NULL,
>> >> +	.teardown = NULL,
>> >> +	.unit_test_cases = {
>> >> +		TEST_CASE(test_pmu_read),
>> >> +		TEST_CASES_END()
>> >> +	}
>> >> +};
>> >> +
>> >> +static int
>> >> +test_pmu(void)
>> >> +{
>> >> +	return unit_test_suite_runner(&pmu_tests);
>> >> +}
>> >> +
>> >> +#endif /* RTE_EXEC_ENV_LINUX */
>> >> +
>> >> +REGISTER_TEST_COMMAND(pmu_autotest, test_pmu);
>> >> diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
>> >> index 2deec7ea19..a8e04a195d 100644
>> >> --- a/doc/api/doxy-api-index.md
>> >> +++ b/doc/api/doxy-api-index.md
>> >> @@ -223,7 +223,8 @@ The public API headers are grouped by topics:
>> >>     [log](@ref rte_log.h),
>> >>     [errno](@ref rte_errno.h),
>> >>     [trace](@ref rte_trace.h),
>> >> -  [trace_point](@ref rte_trace_point.h)
>> >> +  [trace_point](@ref rte_trace_point.h),  [pmu](@ref rte_pmu.h)
>> >>
>> >>   - **misc**:
>> >>     [EAL config](@ref rte_eal.h),
>> >> diff --git a/doc/api/doxy-api.conf.in b/doc/api/doxy-api.conf.in
>> >> index
>> >> e859426099..350b5a8c94 100644
>> >> --- a/doc/api/doxy-api.conf.in
>> >> +++ b/doc/api/doxy-api.conf.in
>> >> @@ -63,6 +63,7 @@ INPUT                   = @TOPDIR@/doc/api/doxy-api-index.md \
>> >>                             @TOPDIR@/lib/pci \
>> >>                             @TOPDIR@/lib/pdump \
>> >>                             @TOPDIR@/lib/pipeline \
>> >> +                          @TOPDIR@/lib/pmu \
>> >>                             @TOPDIR@/lib/port \
>> >>                             @TOPDIR@/lib/power \
>> >>                             @TOPDIR@/lib/rawdev \ diff --git
>> >> a/doc/guides/prog_guide/profile_app.rst
>> >> b/doc/guides/prog_guide/profile_app.rst
>> >> index 14292d4c25..89e38cd301 100644
>> >> --- a/doc/guides/prog_guide/profile_app.rst
>> >> +++ b/doc/guides/prog_guide/profile_app.rst
>> >> @@ -7,6 +7,18 @@ Profile Your Application
>> >>   The following sections describe methods of profiling DPDK applications on
>> >>   different architectures.
>> >>
>> >> +Performance counter based profiling
>> >> +-----------------------------------
>> >> +
>> >> +Majority of architectures support some performance monitoring unit (PMU).
>> >> +Such unit provides programmable counters that monitor specific events.
>> >> +
>> >> +Different tools gather that information, like for example perf.
>> >> +However, in some scenarios when CPU cores are isolated and run
>> >> +dedicated tasks interrupting those tasks with perf may be undesirable.
>> >> +
>> >> +In such cases, an application can use the PMU library to read such
>> >> +events via
>> >``rte_pmu_read()``.
>> >> +
>> >>
>> >>   Profiling on x86
>> >>   ----------------
>> >> diff --git a/doc/guides/rel_notes/release_23_03.rst
>> >> b/doc/guides/rel_notes/release_23_03.rst
>> >> index ab998a5357..20622efe58 100644
>> >> --- a/doc/guides/rel_notes/release_23_03.rst
>> >> +++ b/doc/guides/rel_notes/release_23_03.rst
>> >> @@ -147,6 +147,13 @@ New Features
>> >>     * Added support to capture packets at each graph node with packet metadata and
>> >>       node name.
>> >>
>> >> +* **Added PMU library.**
>> >> +
>> >> +  Added a new performance monitoring unit (PMU) library which
>> >> + allows applications  to perform self monitoring activities
>> >> + without depending on external utilities
>> >like perf.
>> >> +  After integration with :doc:`../prog_guide/trace_lib` data
>> >> + gathered from hardware counters  can be stored in CTF format for further analysis.
>> >> +
>> >>
>> >>   Removed Items
>> >>   -------------
>> >> diff --git a/lib/meson.build b/lib/meson.build index
>> >> 450c061d2b..8a42d45d20 100644
>> >> --- a/lib/meson.build
>> >> +++ b/lib/meson.build
>> >> @@ -11,6 +11,7 @@
>> >>   libraries = [
>> >>           'kvargs', # eal depends on kvargs
>> >>           'telemetry', # basic info querying
>> >> +        'pmu',
>> >>           'eal', # everything depends on eal
>> >>           'ring',
>> >>           'rcu', # rcu depends on ring diff --git
>> >> a/lib/pmu/meson.build b/lib/pmu/meson.build new file mode 100644
>> >> index 0000000000..a4160b494e
>> >> --- /dev/null
>> >> +++ b/lib/pmu/meson.build
>> >> @@ -0,0 +1,13 @@
>> >> +# SPDX-License-Identifier: BSD-3-Clause # Copyright(C) 2023
>> >> +Marvell International Ltd.
>> >> +
>> >> +if not is_linux
>> >> +    build = false
>> >> +    reason = 'only supported on Linux'
>> >> +    subdir_done()
>> >> +endif
>> >> +
>> >> +includes = [global_inc]
>> >> +
>> >> +sources = files('rte_pmu.c')
>> >> +headers = files('rte_pmu.h')
>> >> diff --git a/lib/pmu/pmu_private.h b/lib/pmu/pmu_private.h new file
>> >> mode 100644 index 0000000000..b9f8c1ddc8
>> >> --- /dev/null
>> >> +++ b/lib/pmu/pmu_private.h
>> >> @@ -0,0 +1,32 @@
>> >> +/* SPDX-License-Identifier: BSD-3-Clause
>> >> + * Copyright(c) 2023 Marvell
>> >> + */
>> >> +
>> >> +#ifndef _PMU_PRIVATE_H_
>> >> +#define _PMU_PRIVATE_H_
>> >> +
>> >> +/**
>> >> + * Architecture specific PMU init callback.
>> >> + *
>> >> + * @return
>> >> + *   0 in case of success, negative value otherwise.
>> >> + */
>> >> +int
>> >> +pmu_arch_init(void);
>> >> +
>> >> +/**
>> >> + * Architecture specific PMU cleanup callback.
>> >> + */
>> >> +void
>> >> +pmu_arch_fini(void);
>> >> +
>> >> +/**
>> >> + * Apply architecture specific settings to config before passing it to syscall.
>> >> + *
>> >> + * @param config
>> >> + *   Architecture specific event configuration. Consult kernel sources for available options.
>> >> + */
>> >> +void
>> >> +pmu_arch_fixup_config(uint64_t config[3]);
>> >> +
>> >> +#endif /* _PMU_PRIVATE_H_ */
>> >> diff --git a/lib/pmu/rte_pmu.c b/lib/pmu/rte_pmu.c new file mode
>> >> 100644 index 0000000000..950f999cb7
>> >> --- /dev/null
>> >> +++ b/lib/pmu/rte_pmu.c
>> >> @@ -0,0 +1,460 @@
>> >> +/* SPDX-License-Identifier: BSD-3-Clause
>> >> + * Copyright(C) 2023 Marvell International Ltd.
>> >> + */
>> >> +
>> >> +#include <ctype.h>
>> >> +#include <dirent.h>
>> >> +#include <errno.h>
>> >> +#include <regex.h>
>> >> +#include <stdlib.h>
>> >> +#include <string.h>
>> >> +#include <sys/ioctl.h>
>> >> +#include <sys/mman.h>
>> >> +#include <sys/queue.h>
>> >> +#include <sys/syscall.h>
>> >> +#include <unistd.h>
>> >> +
>> >> +#include <rte_atomic.h>
>> >> +#include <rte_per_lcore.h>
>> >> +#include <rte_pmu.h>
>> >> +#include <rte_spinlock.h>
>> >> +#include <rte_tailq.h>
>> >> +
>> >> +#include "pmu_private.h"
>> >> +
>> >> +#define EVENT_SOURCE_DEVICES_PATH "/sys/bus/event_source/devices"
>> >
>> >
>> >I suppose that pass (as the whole implementation) is linux specific?
>> >If so, wouldn't it make sense to have it under linux subdir?
>> >
>>
>> There are not any plans to support that elsewhere currently so flat
>> directory structure is good enough.
>>
>> >> +
>> >> +#define GENMASK_ULL(h, l) ((~0ULL - (1ULL << (l)) + 1) & (~0ULL >>
>> >> +((64 - 1 - (h))))) #define FIELD_PREP(m, v) (((uint64_t)(v) <<
>> >> +(__builtin_ffsll(m) - 1)) & (m))
>> >> +
>> >> +RTE_DEFINE_PER_LCORE(struct rte_pmu_event_group, _event_group);
>> >> +struct rte_pmu rte_pmu;
>> >
>> >Do we really need struct declaration here?
>> >
>>
>> What’s the problem with this placement precisely?
>
>Not a big deal, but It seems excessive for me.
>As I understand you do have include just above for the whole .h that contains definition of that
>struct anyway.
>
>>
>> >
>> >> +/*
>> >> + * Following __rte_weak functions provide default no-op.
>> >> +Architectures should override them if
>> >> + * necessary.
>> >> + */
>> >> +
>> >> +int
>> >> +__rte_weak pmu_arch_init(void)
>> >> +{
>> >> +	return 0;
>> >> +}
>> >> +
>> >> +void
>> >> +__rte_weak pmu_arch_fini(void)
>> >> +{
>> >> +}
>> >> +
>> >> +void
>> >> +__rte_weak pmu_arch_fixup_config(uint64_t __rte_unused config[3])
>> >> +{ }
>> >> +
>> >> +static int
>> >> +get_term_format(const char *name, int *num, uint64_t *mask) {
>> >> +	char path[PATH_MAX];
>> >> +	char *config = NULL;
>> >> +	int high, low, ret;
>> >> +	FILE *fp;
>> >> +
>> >> +	*num = *mask = 0;
>> >> +	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/format/%s", rte_pmu.name,
>name);
>> >> +	fp = fopen(path, "r");
>> >> +	if (fp == NULL)
>> >> +		return -errno;
>> >> +
>> >> +	errno = 0;
>> >> +	ret = fscanf(fp, "%m[^:]:%d-%d", &config, &low, &high);
>> >> +	if (ret < 2) {
>> >> +		ret = -ENODATA;
>> >> +		goto out;
>> >> +	}
>> >> +	if (errno) {
>> >> +		ret = -errno;
>> >> +		goto out;
>> >> +	}
>> >> +
>> >> +	if (ret == 2)
>> >> +		high = low;
>> >> +
>> >> +	*mask = GENMASK_ULL(high, low);
>> >> +	/* Last digit should be [012]. If last digit is missing 0 is implied. */
>> >> +	*num = config[strlen(config) - 1];
>> >> +	*num = isdigit(*num) ? *num - '0' : 0;
>> >> +
>> >> +	ret = 0;
>> >> +out:
>> >> +	free(config);
>> >> +	fclose(fp);
>> >> +
>> >> +	return ret;
>> >> +}
>> >> +
>> >> +static int
>> >> +parse_event(char *buf, uint64_t config[3]) {
>> >> +	char *token, *term;
>> >> +	int num, ret, val;
>> >> +	uint64_t mask;
>> >> +
>> >> +	config[0] = config[1] = config[2] = 0;
>> >> +
>> >> +	token = strtok(buf, ",");
>> >> +	while (token) {
>> >> +		errno = 0;
>> >> +		/* <term>=<value> */
>> >> +		ret = sscanf(token, "%m[^=]=%i", &term, &val);
>> >> +		if (ret < 1)
>> >> +			return -ENODATA;
>> >> +		if (errno)
>> >> +			return -errno;
>> >> +		if (ret == 1)
>> >> +			val = 1;
>> >> +
>> >> +		ret = get_term_format(term, &num, &mask);
>> >> +		free(term);
>> >> +		if (ret)
>> >> +			return ret;
>> >> +
>> >> +		config[num] |= FIELD_PREP(mask, val);
>> >> +		token = strtok(NULL, ",");
>> >> +	}
>> >> +
>> >> +	return 0;
>> >> +}
>> >> +
>> >> +static int
>> >> +get_event_config(const char *name, uint64_t config[3]) {
>> >> +	char path[PATH_MAX], buf[BUFSIZ];
>> >> +	FILE *fp;
>> >> +	int ret;
>> >> +
>> >> +	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu.name,
>name);
>> >> +	fp = fopen(path, "r");
>> >> +	if (fp == NULL)
>> >> +		return -errno;
>> >> +
>> >> +	ret = fread(buf, 1, sizeof(buf), fp);
>> >> +	if (ret == 0) {
>> >> +		fclose(fp);
>> >> +
>> >> +		return -EINVAL;
>> >> +	}
>> >> +	fclose(fp);
>> >> +	buf[ret] = '\0';
>> >> +
>> >> +	return parse_event(buf, config);
>> >> +}
>> >> +
>> >> +static int
>> >> +do_perf_event_open(uint64_t config[3], int group_fd) {
>> >> +	struct perf_event_attr attr = {
>> >> +		.size = sizeof(struct perf_event_attr),
>> >> +		.type = PERF_TYPE_RAW,
>> >> +		.exclude_kernel = 1,
>> >> +		.exclude_hv = 1,
>> >> +		.disabled = 1,
>> >> +	};
>> >> +
>> >> +	pmu_arch_fixup_config(config);
>> >> +
>> >> +	attr.config = config[0];
>> >> +	attr.config1 = config[1];
>> >> +	attr.config2 = config[2];
>> >> +
>> >> +	return syscall(SYS_perf_event_open, &attr, 0, -1, group_fd, 0); }
>> >> +
>> >> +static int
>> >> +open_events(struct rte_pmu_event_group *group) {
>> >> +	struct rte_pmu_event *event;
>> >> +	uint64_t config[3];
>> >> +	int num = 0, ret;
>> >> +
>> >> +	/* group leader gets created first, with fd = -1 */
>> >> +	group->fds[0] = -1;
>> >> +
>> >> +	TAILQ_FOREACH(event, &rte_pmu.event_list, next) {
>> >> +		ret = get_event_config(event->name, config);
>> >> +		if (ret)
>> >> +			continue;
>> >> +
>> >> +		ret = do_perf_event_open(config, group->fds[0]);
>> >> +		if (ret == -1) {
>> >> +			ret = -errno;
>> >> +			goto out;
>> >> +		}
>> >> +
>> >> +		group->fds[event->index] = ret;
>> >> +		num++;
>> >> +	}
>> >> +
>> >> +	return 0;
>> >> +out:
>> >> +	for (--num; num >= 0; num--) {
>> >> +		close(group->fds[num]);
>> >> +		group->fds[num] = -1;
>> >> +	}
>> >> +
>> >> +
>> >> +	return ret;
>> >> +}
>> >> +
>> >> +static int
>> >> +mmap_events(struct rte_pmu_event_group *group) {
>> >> +	long page_size = sysconf(_SC_PAGE_SIZE);
>> >> +	unsigned int i;
>> >> +	void *addr;
>> >> +	int ret;
>> >> +
>> >> +	for (i = 0; i < rte_pmu.num_group_events; i++) {
>> >> +		addr = mmap(0, page_size, PROT_READ, MAP_SHARED, group->fds[i], 0);
>> >> +		if (addr == MAP_FAILED) {
>> >> +			ret = -errno;
>> >> +			goto out;
>> >> +		}
>> >> +
>> >> +		group->mmap_pages[i] = addr;
>> >> +		if (!group->mmap_pages[i]->cap_user_rdpmc) {
>> >> +			ret = -EPERM;
>> >> +			goto out;
>> >> +		}
>> >> +	}
>> >> +
>> >> +	return 0;
>> >> +out:
>> >> +	for (; i; i--) {
>> >> +		munmap(group->mmap_pages[i - 1], page_size);
>> >> +		group->mmap_pages[i - 1] = NULL;
>> >> +	}
>> >> +
>> >> +	return ret;
>> >> +}
>> >> +
>> >> +static void
>> >> +cleanup_events(struct rte_pmu_event_group *group) {
>> >> +	unsigned int i;
>> >> +
>> >> +	if (group->fds[0] != -1)
>> >> +		ioctl(group->fds[0], PERF_EVENT_IOC_DISABLE,
>> >> +PERF_IOC_FLAG_GROUP);
>> >> +
>> >> +	for (i = 0; i < rte_pmu.num_group_events; i++) {
>> >> +		if (group->mmap_pages[i]) {
>> >> +			munmap(group->mmap_pages[i], sysconf(_SC_PAGE_SIZE));
>> >> +			group->mmap_pages[i] = NULL;
>> >> +		}
>> >> +
>> >> +		if (group->fds[i] != -1) {
>> >> +			close(group->fds[i]);
>> >> +			group->fds[i] = -1;
>> >> +		}
>> >> +	}
>> >> +
>> >> +	group->enabled = false;
>> >> +}
>> >> +
>> >> +int
>> >> +__rte_pmu_enable_group(void)
>> >> +{
>> >> +	struct rte_pmu_event_group *group = &RTE_PER_LCORE(_event_group);
>> >> +	int ret;
>> >> +
>> >> +	if (rte_pmu.num_group_events == 0)
>> >> +		return -ENODEV;
>> >> +
>> >> +	ret = open_events(group);
>> >> +	if (ret)
>> >> +		goto out;
>> >> +
>> >> +	ret = mmap_events(group);
>> >> +	if (ret)
>> >> +		goto out;
>> >> +
>> >> +	if (ioctl(group->fds[0], PERF_EVENT_IOC_RESET, PERF_IOC_FLAG_GROUP) == -1) {
>> >> +		ret = -errno;
>> >> +		goto out;
>> >> +	}
>> >> +
>> >> +	if (ioctl(group->fds[0], PERF_EVENT_IOC_ENABLE, PERF_IOC_FLAG_GROUP) == -1) {
>> >> +		ret = -errno;
>> >> +		goto out;
>> >> +	}
>> >> +
>> >> +	rte_spinlock_lock(&rte_pmu.lock);
>> >> +	TAILQ_INSERT_TAIL(&rte_pmu.event_group_list, group, next);
>> >> +	rte_spinlock_unlock(&rte_pmu.lock);
>> >> +	group->enabled = true;
>> >> +
>> >> +	return 0;
>> >> +
>> >> +out:
>> >> +	cleanup_events(group);
>> >> +
>> >> +	return ret;
>> >> +}
>> >> +
>> >> +static int
>> >> +scan_pmus(void)
>> >> +{
>> >> +	char path[PATH_MAX];
>> >> +	struct dirent *dent;
>> >> +	const char *name;
>> >> +	DIR *dirp;
>> >> +
>> >> +	dirp = opendir(EVENT_SOURCE_DEVICES_PATH);
>> >> +	if (dirp == NULL)
>> >> +		return -errno;
>> >> +
>> >> +	while ((dent = readdir(dirp))) {
>> >> +		name = dent->d_name;
>> >> +		if (name[0] == '.')
>> >> +			continue;
>> >> +
>> >> +		/* sysfs entry should either contain cpus or be a cpu */
>> >> +		if (!strcmp(name, "cpu"))
>> >> +			break;
>> >> +
>> >> +		snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/cpus", name);
>> >> +		if (access(path, F_OK) == 0)
>> >> +			break;
>> >> +	}
>> >> +
>> >> +	if (dent) {
>> >> +		rte_pmu.name = strdup(name);
>> >> +		if (rte_pmu.name == NULL) {
>> >> +			closedir(dirp);
>> >> +
>> >> +			return -ENOMEM;
>> >> +		}
>> >> +	}
>> >> +
>> >> +	closedir(dirp);
>> >> +
>> >> +	return rte_pmu.name ? 0 : -ENODEV; }
>> >> +
>> >> +static struct rte_pmu_event *
>> >> +new_event(const char *name)
>> >> +{
>> >> +	struct rte_pmu_event *event;
>> >> +
>> >> +	event = calloc(1, sizeof(*event));
>> >> +	if (event == NULL)
>> >> +		goto out;
>> >> +
>> >> +	event->name = strdup(name);
>> >> +	if (event->name == NULL) {
>> >> +		free(event);
>> >> +		event = NULL;
>> >> +	}
>> >> +
>> >> +out:
>> >> +	return event;
>> >> +}
>> >> +
>> >> +static void
>> >> +free_event(struct rte_pmu_event *event) {
>> >> +	free(event->name);
>> >> +	free(event);
>> >> +}
>> >> +
>> >> +int
>> >> +rte_pmu_add_event(const char *name) {
>> >> +	struct rte_pmu_event *event;
>> >> +	char path[PATH_MAX];
>> >> +
>> >> +	if (rte_pmu.name == NULL)
>> >> +		return -ENODEV;
>> >> +
>> >> +	if (rte_pmu.num_group_events + 1 >= MAX_NUM_GROUP_EVENTS)
>> >> +		return -ENOSPC;
>> >> +
>> >> +	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu.name,
>name);
>> >> +	if (access(path, R_OK))
>> >> +		return -ENODEV;
>> >> +
>> >> +	TAILQ_FOREACH(event, &rte_pmu.event_list, next) {
>> >> +		if (!strcmp(event->name, name))
>> >> +			return event->index;
>> >> +		continue;
>> >> +	}
>> >> +
>> >> +	event = new_event(name);
>> >> +	if (event == NULL)
>> >> +		return -ENOMEM;
>> >> +
>> >> +	event->index = rte_pmu.num_group_events++;
>> >> +	TAILQ_INSERT_TAIL(&rte_pmu.event_list, event, next);
>> >> +
>> >> +	return event->index;
>> >> +}
>> >> +
>> >> +int
>> >> +rte_pmu_init(void)
>> >> +{
>> >> +	int ret;
>> >> +
>> >> +	/* Allow calling init from multiple contexts within a single thread. This simplifies
>> >> +	 * resource management a bit e.g in case fast-path tracepoint has already been enabled
>> >> +	 * via command line but application doesn't care enough and performs init/fini again.
>> >> +	 */
>> >> +	if (rte_pmu.initialized != 0) {
>> >> +		rte_pmu.initialized++;
>> >> +		return 0;
>> >> +	}
>> >> +
>> >> +	ret = scan_pmus();
>> >> +	if (ret)
>> >> +		goto out;
>> >> +
>> >> +	ret = pmu_arch_init();
>> >> +	if (ret)
>> >> +		goto out;
>> >> +
>> >> +	TAILQ_INIT(&rte_pmu.event_list);
>> >> +	TAILQ_INIT(&rte_pmu.event_group_list);
>> >> +	rte_spinlock_init(&rte_pmu.lock);
>> >> +	rte_pmu.initialized = 1;
>> >> +
>> >> +	return 0;
>> >> +out:
>> >> +	free(rte_pmu.name);
>> >> +	rte_pmu.name = NULL;
>> >> +
>> >> +	return ret;
>> >> +}
>> >> +
>> >> +void
>> >> +rte_pmu_fini(void)
>> >> +{
>> >> +	struct rte_pmu_event_group *group, *tmp_group;
>> >> +	struct rte_pmu_event *event, *tmp_event;
>> >> +
>> >> +	/* cleanup once init count drops to zero */
>> >> +	if (rte_pmu.initialized == 0 || --rte_pmu.initialized != 0)
>> >> +		return;
>> >> +
>> >> +	RTE_TAILQ_FOREACH_SAFE(event, &rte_pmu.event_list, next, tmp_event) {
>> >> +		TAILQ_REMOVE(&rte_pmu.event_list, event, next);
>> >> +		free_event(event);
>> >> +	}
>> >> +
>> >> +	RTE_TAILQ_FOREACH_SAFE(group, &rte_pmu.event_group_list, next, tmp_group) {
>> >> +		TAILQ_REMOVE(&rte_pmu.event_group_list, group, next);
>> >> +		cleanup_events(group);
>> >> +	}
>> >> +
>> >> +	pmu_arch_fini();
>> >> +	free(rte_pmu.name);
>> >> +	rte_pmu.name = NULL;
>> >> +	rte_pmu.num_group_events = 0;
>> >> +}
>> >> diff --git a/lib/pmu/rte_pmu.h b/lib/pmu/rte_pmu.h new file mode
>> >> 100644 index 0000000000..6b664c3336
>> >> --- /dev/null
>> >> +++ b/lib/pmu/rte_pmu.h
>> >> @@ -0,0 +1,212 @@
>> >> +/* SPDX-License-Identifier: BSD-3-Clause
>> >> + * Copyright(c) 2023 Marvell
>> >> + */
>> >> +
>> >> +#ifndef _RTE_PMU_H_
>> >> +#define _RTE_PMU_H_
>> >> +
>> >> +/**
>> >> + * @file
>> >> + *
>> >> + * PMU event tracing operations
>> >> + *
>> >> + * This file defines generic API and types necessary to setup PMU
>> >> +and
>> >> + * read selected counters in runtime.
>> >> + */
>> >> +
>> >> +#ifdef __cplusplus
>> >> +extern "C" {
>> >> +#endif
>> >> +
>> >> +#include <linux/perf_event.h>
>> >> +
>> >> +#include <rte_atomic.h>
>> >> +#include <rte_branch_prediction.h> #include <rte_common.h>
>> >> +#include <rte_compat.h> #include <rte_spinlock.h>
>> >> +
>> >> +/** Maximum number of events in a group */ #define
>> >> +MAX_NUM_GROUP_EVENTS 8
>> >> +
>> >> +/**
>> >> + * A structure describing a group of events.
>> >> + */
>> >> +struct rte_pmu_event_group {
>> >> +	struct perf_event_mmap_page *mmap_pages[MAX_NUM_GROUP_EVENTS]; /**< array of user pages
>*/
>> >> +	int fds[MAX_NUM_GROUP_EVENTS]; /**< array of event descriptors */
>> >> +	bool enabled; /**< true if group was enabled on particular lcore */
>> >> +	TAILQ_ENTRY(rte_pmu_event_group) next; /**< list entry */ }
>> >> +__rte_cache_aligned;
>> >> +
>> >> +/**
>> >> + * A structure describing an event.
>> >> + */
>> >> +struct rte_pmu_event {
>> >> +	char *name; /**< name of an event */
>> >> +	unsigned int index; /**< event index into fds/mmap_pages */
>> >> +	TAILQ_ENTRY(rte_pmu_event) next; /**< list entry */ };
>> >> +
>> >> +/**
>> >> + * A PMU state container.
>> >> + */
>> >> +struct rte_pmu {
>> >> +	char *name; /**< name of core PMU listed under /sys/bus/event_source/devices */
>> >> +	rte_spinlock_t lock; /**< serialize access to event group list */
>> >> +	TAILQ_HEAD(, rte_pmu_event_group) event_group_list; /**< list of event groups */
>> >> +	unsigned int num_group_events; /**< number of events in a group */
>> >> +	TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of matching events */
>> >> +	unsigned int initialized; /**< initialization counter */ };
>> >> +
>> >> +/** lcore event group */
>> >> +RTE_DECLARE_PER_LCORE(struct rte_pmu_event_group, _event_group);
>> >> +
>> >> +/** PMU state container */
>> >> +extern struct rte_pmu rte_pmu;
>> >> +
>> >> +/** Each architecture supporting PMU needs to provide its own
>> >> +version */ #ifndef rte_pmu_pmc_read #define
>> >> +rte_pmu_pmc_read(index) ({ 0; }) #endif
>> >> +
>> >> +/**
>> >> + * @warning
>> >> + * @b EXPERIMENTAL: this API may change without prior notice
>> >> + *
>> >> + * Read PMU counter.
>> >> + *
>> >> + * @warning This should be not called directly.
>> >> + *
>> >> + * @param pc
>> >> + *   Pointer to the mmapped user page.
>> >> + * @return
>> >> + *   Counter value read from hardware.
>> >> + */
>> >> +static __rte_always_inline uint64_t __rte_pmu_read_userpage(struct
>> >> +perf_event_mmap_page *pc) {
>> >> +	uint64_t width, offset;
>> >> +	uint32_t seq, index;
>> >> +	int64_t pmc;
>> >> +
>> >> +	for (;;) {
>> >> +		seq = pc->lock;
>> >> +		rte_compiler_barrier();
>> >
>> >Are you sure that compiler_barrier() is enough here?
>> >On some archs CPU itself has freedom to re-order reads.
>> >Or I am missing something obvious here?
>> >
>>
>> It's a matter of not keeping old stuff cached in registers and making
>> sure that we have two reads of lock. CPU reordering won't do any harm
>> here.
>
>Sorry, I didn't get you here:
>Suppose CPU will re-order reads and will read lock *after* index or offset value.
>Wouldn't it mean that in that case index and/or offset can contain old/invalid values?
>

This number is just an indicator whether kernel did change something or not.
If cpu reordering will come into play then this will not change anything from pov of this loop. 
All we want is fresh data when needed and no involvement of compiler when it comes to reordering
code. 

>>
>> >> +		index = pc->index;
>> >> +		offset = pc->offset;
>> >> +		width = pc->pmc_width;
>> >> +
>> >> +		/* index set to 0 means that particular counter cannot be used */
>> >> +		if (likely(pc->cap_user_rdpmc && index)) {
>> >> +			pmc = rte_pmu_pmc_read(index - 1);
>> >> +			pmc <<= 64 - width;
>> >> +			pmc >>= 64 - width;
>> >> +			offset += pmc;
>> >> +		}
>> >> +
>> >> +		rte_compiler_barrier();
>> >> +
>> >> +		if (likely(pc->lock == seq))
>> >> +			return offset;
>> >> +	}
>> >> +
>> >> +	return 0;
>> >> +}
>> >> +
>> >> +/**
>> >> + * @warning
>> >> + * @b EXPERIMENTAL: this API may change without prior notice
>> >> + *
>> >> + * Enable group of events on the calling lcore.
>> >> + *
>> >> + * @warning This should be not called directly.
>> >> + *
>> >> + * @return
>> >> + *   0 in case of success, negative value otherwise.
>> >> + */
>> >> +__rte_experimental
>> >> +int
>> >> +__rte_pmu_enable_group(void);
>> >> +
>> >> +/**
>> >> + * @warning
>> >> + * @b EXPERIMENTAL: this API may change without prior notice
>> >> + *
>> >> + * Initialize PMU library.
>> >> + *
>> >> + * @warning This should be not called directly.
>> >> + *
>> >> + * @return
>> >> + *   0 in case of success, negative value otherwise.
>> >> + */
>> >> +__rte_experimental
>> >> +int
>> >> +rte_pmu_init(void);
>> >> +
>> >> +/**
>> >> + * @warning
>> >> + * @b EXPERIMENTAL: this API may change without prior notice
>> >> + *
>> >> + * Finalize PMU library. This should be called after PMU counters are no longer being read.
>> >> + */
>> >> +__rte_experimental
>> >> +void
>> >> +rte_pmu_fini(void);
>> >> +
>> >> +/**
>> >> + * @warning
>> >> + * @b EXPERIMENTAL: this API may change without prior notice
>> >> + *
>> >> + * Add event to the group of enabled events.
>> >> + *
>> >> + * @param name
>> >> + *   Name of an event listed under /sys/bus/event_source/devices/pmu/events.
>> >> + * @return
>> >> + *   Event index in case of success, negative value otherwise.
>> >> + */
>> >> +__rte_experimental
>> >> +int
>> >> +rte_pmu_add_event(const char *name);
>> >> +
>> >> +/**
>> >> + * @warning
>> >> + * @b EXPERIMENTAL: this API may change without prior notice
>> >> + *
>> >> + * Read hardware counter configured to count occurrences of an event.
>> >> + *
>> >> + * @param index
>> >> + *   Index of an event to be read.
>> >> + * @return
>> >> + *   Event value read from register. In case of errors or lack of support
>> >> + *   0 is returned. In other words, stream of zeros in a trace file
>> >> + *   indicates problem with reading particular PMU event register.
>> >> + */
>> >> +__rte_experimental
>> >> +static __rte_always_inline uint64_t rte_pmu_read(unsigned int
>> >> +index) {
>> >> +	struct rte_pmu_event_group *group = &RTE_PER_LCORE(_event_group);
>> >> +	int ret;
>> >> +
>> >> +	if (unlikely(!rte_pmu.initialized))
>> >> +		return 0;
>> >> +
>> >> +	if (unlikely(!group->enabled)) {
>> >> +		ret = __rte_pmu_enable_group();
>> >> +		if (ret)
>> >> +			return 0;
>> >> +	}
>> >> +
>> >> +	if (unlikely(index >= rte_pmu.num_group_events))
>> >> +		return 0;
>> >> +
>> >> +	return __rte_pmu_read_userpage(group->mmap_pages[index]);
>> >> +}
>> >> +
>> >> +#ifdef __cplusplus
>> >> +}
>> >> +#endif
>> >> +
>> >> +#endif /* _RTE_PMU_H_ */
>> >> diff --git a/lib/pmu/version.map b/lib/pmu/version.map new file
>> >> mode
>> >> 100644 index 0000000000..39a4f279c1
>> >> --- /dev/null
>> >> +++ b/lib/pmu/version.map
>> >> @@ -0,0 +1,15 @@
>> >> +DPDK_23 {
>> >> +	local: *;
>> >> +};
>> >> +
>> >> +EXPERIMENTAL {
>> >> +	global:
>> >> +
>> >> +	__rte_pmu_enable_group;
>> >> +	per_lcore__event_group;
>> >> +	rte_pmu;
>> >> +	rte_pmu_add_event;
>> >> +	rte_pmu_fini;
>> >> +	rte_pmu_init;
>> >> +	rte_pmu_read;
>> >> +};


^ permalink raw reply	[flat|nested] 139+ messages in thread

* RE: [EXT] Re: [PATCH v11 1/4] lib: add generic support for reading PMU events
  2023-02-19 14:23                             ` Tomasz Duszynski
@ 2023-02-20 14:31                               ` Konstantin Ananyev
  2023-02-20 16:59                                 ` Tomasz Duszynski
  0 siblings, 1 reply; 139+ messages in thread
From: Konstantin Ananyev @ 2023-02-20 14:31 UTC (permalink / raw)
  To: Tomasz Duszynski, Konstantin Ananyev, dev

> >> >> diff --git a/lib/pmu/rte_pmu.h b/lib/pmu/rte_pmu.h new file mode
> >> >> 100644 index 0000000000..6b664c3336
> >> >> --- /dev/null
> >> >> +++ b/lib/pmu/rte_pmu.h
> >> >> @@ -0,0 +1,212 @@
> >> >> +/* SPDX-License-Identifier: BSD-3-Clause
> >> >> + * Copyright(c) 2023 Marvell
> >> >> + */
> >> >> +
> >> >> +#ifndef _RTE_PMU_H_
> >> >> +#define _RTE_PMU_H_
> >> >> +
> >> >> +/**
> >> >> + * @file
> >> >> + *
> >> >> + * PMU event tracing operations
> >> >> + *
> >> >> + * This file defines generic API and types necessary to setup PMU
> >> >> +and
> >> >> + * read selected counters in runtime.
> >> >> + */
> >> >> +
> >> >> +#ifdef __cplusplus
> >> >> +extern "C" {
> >> >> +#endif
> >> >> +
> >> >> +#include <linux/perf_event.h>
> >> >> +
> >> >> +#include <rte_atomic.h>
> >> >> +#include <rte_branch_prediction.h> #include <rte_common.h>
> >> >> +#include <rte_compat.h> #include <rte_spinlock.h>
> >> >> +
> >> >> +/** Maximum number of events in a group */ #define
> >> >> +MAX_NUM_GROUP_EVENTS 8
> >> >> +
> >> >> +/**
> >> >> + * A structure describing a group of events.
> >> >> + */
> >> >> +struct rte_pmu_event_group {
> >> >> +	struct perf_event_mmap_page *mmap_pages[MAX_NUM_GROUP_EVENTS]; /**< array of user pages
> >*/
> >> >> +	int fds[MAX_NUM_GROUP_EVENTS]; /**< array of event descriptors */
> >> >> +	bool enabled; /**< true if group was enabled on particular lcore */
> >> >> +	TAILQ_ENTRY(rte_pmu_event_group) next; /**< list entry */ }
> >> >> +__rte_cache_aligned;
> >> >> +
> >> >> +/**
> >> >> + * A structure describing an event.
> >> >> + */
> >> >> +struct rte_pmu_event {
> >> >> +	char *name; /**< name of an event */
> >> >> +	unsigned int index; /**< event index into fds/mmap_pages */
> >> >> +	TAILQ_ENTRY(rte_pmu_event) next; /**< list entry */ };
> >> >> +
> >> >> +/**
> >> >> + * A PMU state container.
> >> >> + */
> >> >> +struct rte_pmu {
> >> >> +	char *name; /**< name of core PMU listed under /sys/bus/event_source/devices */
> >> >> +	rte_spinlock_t lock; /**< serialize access to event group list */
> >> >> +	TAILQ_HEAD(, rte_pmu_event_group) event_group_list; /**< list of event groups */
> >> >> +	unsigned int num_group_events; /**< number of events in a group */
> >> >> +	TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of matching events */
> >> >> +	unsigned int initialized; /**< initialization counter */ };
> >> >> +
> >> >> +/** lcore event group */
> >> >> +RTE_DECLARE_PER_LCORE(struct rte_pmu_event_group, _event_group);
> >> >> +
> >> >> +/** PMU state container */
> >> >> +extern struct rte_pmu rte_pmu;
> >> >> +
> >> >> +/** Each architecture supporting PMU needs to provide its own
> >> >> +version */ #ifndef rte_pmu_pmc_read #define
> >> >> +rte_pmu_pmc_read(index) ({ 0; }) #endif
> >> >> +
> >> >> +/**
> >> >> + * @warning
> >> >> + * @b EXPERIMENTAL: this API may change without prior notice
> >> >> + *
> >> >> + * Read PMU counter.
> >> >> + *
> >> >> + * @warning This should be not called directly.
> >> >> + *
> >> >> + * @param pc
> >> >> + *   Pointer to the mmapped user page.
> >> >> + * @return
> >> >> + *   Counter value read from hardware.
> >> >> + */
> >> >> +static __rte_always_inline uint64_t __rte_pmu_read_userpage(struct
> >> >> +perf_event_mmap_page *pc) {
> >> >> +	uint64_t width, offset;
> >> >> +	uint32_t seq, index;
> >> >> +	int64_t pmc;
> >> >> +
> >> >> +	for (;;) {
> >> >> +		seq = pc->lock;
> >> >> +		rte_compiler_barrier();
> >> >
> >> >Are you sure that compiler_barrier() is enough here?
> >> >On some archs CPU itself has freedom to re-order reads.
> >> >Or I am missing something obvious here?
> >> >
> >>
> >> It's a matter of not keeping old stuff cached in registers and making
> >> sure that we have two reads of lock. CPU reordering won't do any harm
> >> here.
> >
> >Sorry, I didn't get you here:
> >Suppose CPU will re-order reads and will read lock *after* index or offset value.
> >Wouldn't it mean that in that case index and/or offset can contain old/invalid values?
> >
> 
> This number is just an indicator whether kernel did change something or not.
 
You are talking about pc->lock, right?
Yes, I do understand that it is sort of seqlock.
That's why I am puzzled why we do not care about possible cpu read-reordering.
Manual for perf_event_open() also has a code snippet with compiler barrier only...

> If cpu reordering will come into play then this will not change anything from pov of this loop.
> All we want is fresh data when needed and no involvement of compiler when it comes to reordering
> code.

Ok, can you probably explain to me why the following could not happen:
T0:
pc->seqlock==0; pc->index==I1; pc->offset==O1;
T1:
      cpu #0 read pmu (due to cpu read reorder, we get index value before seqlock):
       index=pc->index;  //index==I1;
 T2:
      cpu #1 kernel vent_update_userpage:
      pc->lock++; // pc->lock==1
      pc->index=I2;
      pc->offset=O2;
      ...
      pc->lock++; //pc->lock==2
T3:
      cpu #0 continue with read pmu:
      seq=pc->lock; //seq == 2
       offset=pc->offset; // offset == O2
       ....
       pmc = rte_pmu_pmc_read(index - 1);  // Note that we read at I1, not I2
       offset += pmc; //offset == O2 + pmcread(I1-1);       
       if (pc->lock == seq) // they are equal, return
             return offset;
  
Or, it can happen, but by some reason we don't care much?     

> >>
> >> >> +		index = pc->index;
> >> >> +		offset = pc->offset;
> >> >> +		width = pc->pmc_width;
> >> >> +
> >> >> +		/* index set to 0 means that particular counter cannot be used */
> >> >> +		if (likely(pc->cap_user_rdpmc && index)) {
> >> >> +			pmc = rte_pmu_pmc_read(index - 1);
> >> >> +			pmc <<= 64 - width;
> >> >> +			pmc >>= 64 - width;
> >> >> +			offset += pmc;
> >> >> +		}
> >> >> +
> >> >> +		rte_compiler_barrier();
> >> >> +
> >> >> +		if (likely(pc->lock == seq))
> >> >> +			return offset;
> >> >> +	}
> >> >> +
> >> >> +	return 0;
> >> >> +}
> >> >> +
> >> >> +/**
> >> >> + * @warning
> >> >> + * @b EXPERIMENTAL: this API may change without prior notice
> >> >> + *
> >> >> + * Enable group of events on the calling lcore.
> >> >> + *
> >> >> + * @warning This should be not called directly.
> >> >> + *
> >> >> + * @return
> >> >> + *   0 in case of success, negative value otherwise.
> >> >> + */
> >> >> +__rte_experimental
> >> >> +int
> >> >> +__rte_pmu_enable_group(void);
> >> >> +
> >> >> +/**
> >> >> + * @warning
> >> >> + * @b EXPERIMENTAL: this API may change without prior notice
> >> >> + *
> >> >> + * Initialize PMU library.
> >> >> + *
> >> >> + * @warning This should be not called directly.
> >> >> + *
> >> >> + * @return
> >> >> + *   0 in case of success, negative value otherwise.
> >> >> + */
> >> >> +__rte_experimental
> >> >> +int
> >> >> +rte_pmu_init(void);
> >> >> +
> >> >> +/**
> >> >> + * @warning
> >> >> + * @b EXPERIMENTAL: this API may change without prior notice
> >> >> + *
> >> >> + * Finalize PMU library. This should be called after PMU counters are no longer being read.
> >> >> + */
> >> >> +__rte_experimental
> >> >> +void
> >> >> +rte_pmu_fini(void);
> >> >> +
> >> >> +/**
> >> >> + * @warning
> >> >> + * @b EXPERIMENTAL: this API may change without prior notice
> >> >> + *
> >> >> + * Add event to the group of enabled events.
> >> >> + *
> >> >> + * @param name
> >> >> + *   Name of an event listed under /sys/bus/event_source/devices/pmu/events.
> >> >> + * @return
> >> >> + *   Event index in case of success, negative value otherwise.
> >> >> + */
> >> >> +__rte_experimental
> >> >> +int
> >> >> +rte_pmu_add_event(const char *name);
> >> >> +
> >> >> +/**
> >> >> + * @warning
> >> >> + * @b EXPERIMENTAL: this API may change without prior notice
> >> >> + *
> >> >> + * Read hardware counter configured to count occurrences of an event.
> >> >> + *
> >> >> + * @param index
> >> >> + *   Index of an event to be read.
> >> >> + * @return
> >> >> + *   Event value read from register. In case of errors or lack of support
> >> >> + *   0 is returned. In other words, stream of zeros in a trace file
> >> >> + *   indicates problem with reading particular PMU event register.
> >> >> + */

Another question - do we really need  to have __rte_pmu_read_userpage()
and rte_pmu_read() as static inline functions in public header?
As I understand, because of that we also have to make 'struct rte_pmu_*' 
definitions also public.   

> >> >> +__rte_experimental
> >> >> +static __rte_always_inline uint64_t rte_pmu_read(unsigned int
> >> >> +index) {
> >> >> +	struct rte_pmu_event_group *group = &RTE_PER_LCORE(_event_group);
> >> >> +	int ret;
> >> >> +
> >> >> +	if (unlikely(!rte_pmu.initialized))
> >> >> +		return 0;
> >> >> +
> >> >> +	if (unlikely(!group->enabled)) {
> >> >> +		ret = __rte_pmu_enable_group();
> >> >> +		if (ret)
> >> >> +			return 0;
> >> >> +	}
> >> >> +
> >> >> +	if (unlikely(index >= rte_pmu.num_group_events))
> >> >> +		return 0;
> >> >> +
> >> >> +	return __rte_pmu_read_userpage(group->mmap_pages[index]);
> >> >> +}
> >> >> +
> >> >> +#ifdef __cplusplus
> >> >> +}
> >> >> +#endif
> >> >> +


^ permalink raw reply	[flat|nested] 139+ messages in thread

* RE: [EXT] Re: [PATCH v11 1/4] lib: add generic support for reading PMU events
  2023-02-20 14:31                               ` Konstantin Ananyev
@ 2023-02-20 16:59                                 ` Tomasz Duszynski
  2023-02-20 17:21                                   ` Konstantin Ananyev
  0 siblings, 1 reply; 139+ messages in thread
From: Tomasz Duszynski @ 2023-02-20 16:59 UTC (permalink / raw)
  To: Konstantin Ananyev, Konstantin Ananyev, dev



>-----Original Message-----
>From: Konstantin Ananyev <konstantin.ananyev@huawei.com>
>Sent: Monday, February 20, 2023 3:31 PM
>To: Tomasz Duszynski <tduszynski@marvell.com>; Konstantin Ananyev <konstantin.v.ananyev@yandex.ru>;
>dev@dpdk.org
>Subject: RE: [EXT] Re: [PATCH v11 1/4] lib: add generic support for reading PMU events
>
>> >> >> diff --git a/lib/pmu/rte_pmu.h b/lib/pmu/rte_pmu.h new file mode
>> >> >> 100644 index 0000000000..6b664c3336
>> >> >> --- /dev/null
>> >> >> +++ b/lib/pmu/rte_pmu.h
>> >> >> @@ -0,0 +1,212 @@
>> >> >> +/* SPDX-License-Identifier: BSD-3-Clause
>> >> >> + * Copyright(c) 2023 Marvell
>> >> >> + */
>> >> >> +
>> >> >> +#ifndef _RTE_PMU_H_
>> >> >> +#define _RTE_PMU_H_
>> >> >> +
>> >> >> +/**
>> >> >> + * @file
>> >> >> + *
>> >> >> + * PMU event tracing operations
>> >> >> + *
>> >> >> + * This file defines generic API and types necessary to setup
>> >> >> +PMU and
>> >> >> + * read selected counters in runtime.
>> >> >> + */
>> >> >> +
>> >> >> +#ifdef __cplusplus
>> >> >> +extern "C" {
>> >> >> +#endif
>> >> >> +
>> >> >> +#include <linux/perf_event.h>
>> >> >> +
>> >> >> +#include <rte_atomic.h>
>> >> >> +#include <rte_branch_prediction.h> #include <rte_common.h>
>> >> >> +#include <rte_compat.h> #include <rte_spinlock.h>
>> >> >> +
>> >> >> +/** Maximum number of events in a group */ #define
>> >> >> +MAX_NUM_GROUP_EVENTS 8
>> >> >> +
>> >> >> +/**
>> >> >> + * A structure describing a group of events.
>> >> >> + */
>> >> >> +struct rte_pmu_event_group {
>> >> >> +	struct perf_event_mmap_page *mmap_pages[MAX_NUM_GROUP_EVENTS];
>> >> >> +/**< array of user pages
>> >*/
>> >> >> +	int fds[MAX_NUM_GROUP_EVENTS]; /**< array of event descriptors */
>> >> >> +	bool enabled; /**< true if group was enabled on particular lcore */
>> >> >> +	TAILQ_ENTRY(rte_pmu_event_group) next; /**< list entry */ }
>> >> >> +__rte_cache_aligned;
>> >> >> +
>> >> >> +/**
>> >> >> + * A structure describing an event.
>> >> >> + */
>> >> >> +struct rte_pmu_event {
>> >> >> +	char *name; /**< name of an event */
>> >> >> +	unsigned int index; /**< event index into fds/mmap_pages */
>> >> >> +	TAILQ_ENTRY(rte_pmu_event) next; /**< list entry */ };
>> >> >> +
>> >> >> +/**
>> >> >> + * A PMU state container.
>> >> >> + */
>> >> >> +struct rte_pmu {
>> >> >> +	char *name; /**< name of core PMU listed under /sys/bus/event_source/devices */
>> >> >> +	rte_spinlock_t lock; /**< serialize access to event group list */
>> >> >> +	TAILQ_HEAD(, rte_pmu_event_group) event_group_list; /**< list of event groups */
>> >> >> +	unsigned int num_group_events; /**< number of events in a group */
>> >> >> +	TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of matching events */
>> >> >> +	unsigned int initialized; /**< initialization counter */ };
>> >> >> +
>> >> >> +/** lcore event group */
>> >> >> +RTE_DECLARE_PER_LCORE(struct rte_pmu_event_group,
>> >> >> +_event_group);
>> >> >> +
>> >> >> +/** PMU state container */
>> >> >> +extern struct rte_pmu rte_pmu;
>> >> >> +
>> >> >> +/** Each architecture supporting PMU needs to provide its own
>> >> >> +version */ #ifndef rte_pmu_pmc_read #define
>> >> >> +rte_pmu_pmc_read(index) ({ 0; }) #endif
>> >> >> +
>> >> >> +/**
>> >> >> + * @warning
>> >> >> + * @b EXPERIMENTAL: this API may change without prior notice
>> >> >> + *
>> >> >> + * Read PMU counter.
>> >> >> + *
>> >> >> + * @warning This should be not called directly.
>> >> >> + *
>> >> >> + * @param pc
>> >> >> + *   Pointer to the mmapped user page.
>> >> >> + * @return
>> >> >> + *   Counter value read from hardware.
>> >> >> + */
>> >> >> +static __rte_always_inline uint64_t
>> >> >> +__rte_pmu_read_userpage(struct perf_event_mmap_page *pc) {
>> >> >> +	uint64_t width, offset;
>> >> >> +	uint32_t seq, index;
>> >> >> +	int64_t pmc;
>> >> >> +
>> >> >> +	for (;;) {
>> >> >> +		seq = pc->lock;
>> >> >> +		rte_compiler_barrier();
>> >> >
>> >> >Are you sure that compiler_barrier() is enough here?
>> >> >On some archs CPU itself has freedom to re-order reads.
>> >> >Or I am missing something obvious here?
>> >> >
>> >>
>> >> It's a matter of not keeping old stuff cached in registers and
>> >> making sure that we have two reads of lock. CPU reordering won't do
>> >> any harm here.
>> >
>> >Sorry, I didn't get you here:
>> >Suppose CPU will re-order reads and will read lock *after* index or offset value.
>> >Wouldn't it mean that in that case index and/or offset can contain old/invalid values?
>> >
>>
>> This number is just an indicator whether kernel did change something or not.
>
>You are talking about pc->lock, right?
>Yes, I do understand that it is sort of seqlock.
>That's why I am puzzled why we do not care about possible cpu read-reordering.
>Manual for perf_event_open() also has a code snippet with compiler barrier only...
>
>> If cpu reordering will come into play then this will not change anything from pov of this loop.
>> All we want is fresh data when needed and no involvement of compiler
>> when it comes to reordering code.
>
>Ok, can you probably explain to me why the following could not happen:
>T0:
>pc->seqlock==0; pc->index==I1; pc->offset==O1;
>T1:
>      cpu #0 read pmu (due to cpu read reorder, we get index value before seqlock):
>       index=pc->index;  //index==I1;
> T2:
>      cpu #1 kernel vent_update_userpage:
>      pc->lock++; // pc->lock==1
>      pc->index=I2;
>      pc->offset=O2;
>      ...
>      pc->lock++; //pc->lock==2
>T3:
>      cpu #0 continue with read pmu:
>      seq=pc->lock; //seq == 2
>       offset=pc->offset; // offset == O2
>       ....
>       pmc = rte_pmu_pmc_read(index - 1);  // Note that we read at I1, not I2
>       offset += pmc; //offset == O2 + pmcread(I1-1);
>       if (pc->lock == seq) // they are equal, return
>             return offset;
>
>Or, it can happen, but by some reason we don't care much?
>

This code does self-monitoring and user page (whole group actually) is per thread running on
current cpu. Hence I am not sure what are you trying to prove with that example.  

>> >>
>> >> >> +		index = pc->index;
>> >> >> +		offset = pc->offset;
>> >> >> +		width = pc->pmc_width;
>> >> >> +
>> >> >> +		/* index set to 0 means that particular counter cannot be used */
>> >> >> +		if (likely(pc->cap_user_rdpmc && index)) {
>> >> >> +			pmc = rte_pmu_pmc_read(index - 1);
>> >> >> +			pmc <<= 64 - width;
>> >> >> +			pmc >>= 64 - width;
>> >> >> +			offset += pmc;
>> >> >> +		}
>> >> >> +
>> >> >> +		rte_compiler_barrier();
>> >> >> +
>> >> >> +		if (likely(pc->lock == seq))
>> >> >> +			return offset;
>> >> >> +	}
>> >> >> +
>> >> >> +	return 0;
>> >> >> +}
>> >> >> +
>> >> >> +/**
>> >> >> + * @warning
>> >> >> + * @b EXPERIMENTAL: this API may change without prior notice
>> >> >> + *
>> >> >> + * Enable group of events on the calling lcore.
>> >> >> + *
>> >> >> + * @warning This should be not called directly.
>> >> >> + *
>> >> >> + * @return
>> >> >> + *   0 in case of success, negative value otherwise.
>> >> >> + */
>> >> >> +__rte_experimental
>> >> >> +int
>> >> >> +__rte_pmu_enable_group(void);
>> >> >> +
>> >> >> +/**
>> >> >> + * @warning
>> >> >> + * @b EXPERIMENTAL: this API may change without prior notice
>> >> >> + *
>> >> >> + * Initialize PMU library.
>> >> >> + *
>> >> >> + * @warning This should be not called directly.
>> >> >> + *
>> >> >> + * @return
>> >> >> + *   0 in case of success, negative value otherwise.
>> >> >> + */
>> >> >> +__rte_experimental
>> >> >> +int
>> >> >> +rte_pmu_init(void);
>> >> >> +
>> >> >> +/**
>> >> >> + * @warning
>> >> >> + * @b EXPERIMENTAL: this API may change without prior notice
>> >> >> + *
>> >> >> + * Finalize PMU library. This should be called after PMU counters are no longer being
>read.
>> >> >> + */
>> >> >> +__rte_experimental
>> >> >> +void
>> >> >> +rte_pmu_fini(void);
>> >> >> +
>> >> >> +/**
>> >> >> + * @warning
>> >> >> + * @b EXPERIMENTAL: this API may change without prior notice
>> >> >> + *
>> >> >> + * Add event to the group of enabled events.
>> >> >> + *
>> >> >> + * @param name
>> >> >> + *   Name of an event listed under /sys/bus/event_source/devices/pmu/events.
>> >> >> + * @return
>> >> >> + *   Event index in case of success, negative value otherwise.
>> >> >> + */
>> >> >> +__rte_experimental
>> >> >> +int
>> >> >> +rte_pmu_add_event(const char *name);
>> >> >> +
>> >> >> +/**
>> >> >> + * @warning
>> >> >> + * @b EXPERIMENTAL: this API may change without prior notice
>> >> >> + *
>> >> >> + * Read hardware counter configured to count occurrences of an event.
>> >> >> + *
>> >> >> + * @param index
>> >> >> + *   Index of an event to be read.
>> >> >> + * @return
>> >> >> + *   Event value read from register. In case of errors or lack of support
>> >> >> + *   0 is returned. In other words, stream of zeros in a trace file
>> >> >> + *   indicates problem with reading particular PMU event register.
>> >> >> + */
>
>Another question - do we really need  to have __rte_pmu_read_userpage() and rte_pmu_read() as
>static inline functions in public header?
>As I understand, because of that we also have to make 'struct rte_pmu_*'
>definitions also public.
>
>> >> >> +__rte_experimental
>> >> >> +static __rte_always_inline uint64_t rte_pmu_read(unsigned int
>> >> >> +index) {
>> >> >> +	struct rte_pmu_event_group *group = &RTE_PER_LCORE(_event_group);
>> >> >> +	int ret;
>> >> >> +
>> >> >> +	if (unlikely(!rte_pmu.initialized))
>> >> >> +		return 0;
>> >> >> +
>> >> >> +	if (unlikely(!group->enabled)) {
>> >> >> +		ret = __rte_pmu_enable_group();
>> >> >> +		if (ret)
>> >> >> +			return 0;
>> >> >> +	}
>> >> >> +
>> >> >> +	if (unlikely(index >= rte_pmu.num_group_events))
>> >> >> +		return 0;
>> >> >> +
>> >> >> +	return __rte_pmu_read_userpage(group->mmap_pages[index]);
>> >> >> +}
>> >> >> +
>> >> >> +#ifdef __cplusplus
>> >> >> +}
>> >> >> +#endif
>> >> >> +


^ permalink raw reply	[flat|nested] 139+ messages in thread

* RE: [EXT] Re: [PATCH v11 1/4] lib: add generic support for reading PMU events
  2023-02-20 16:59                                 ` Tomasz Duszynski
@ 2023-02-20 17:21                                   ` Konstantin Ananyev
  2023-02-20 20:42                                     ` Tomasz Duszynski
  0 siblings, 1 reply; 139+ messages in thread
From: Konstantin Ananyev @ 2023-02-20 17:21 UTC (permalink / raw)
  To: Tomasz Duszynski, Konstantin Ananyev, dev


> >> >> >> diff --git a/lib/pmu/rte_pmu.h b/lib/pmu/rte_pmu.h new file mode
> >> >> >> 100644 index 0000000000..6b664c3336
> >> >> >> --- /dev/null
> >> >> >> +++ b/lib/pmu/rte_pmu.h
> >> >> >> @@ -0,0 +1,212 @@
> >> >> >> +/* SPDX-License-Identifier: BSD-3-Clause
> >> >> >> + * Copyright(c) 2023 Marvell
> >> >> >> + */
> >> >> >> +
> >> >> >> +#ifndef _RTE_PMU_H_
> >> >> >> +#define _RTE_PMU_H_
> >> >> >> +
> >> >> >> +/**
> >> >> >> + * @file
> >> >> >> + *
> >> >> >> + * PMU event tracing operations
> >> >> >> + *
> >> >> >> + * This file defines generic API and types necessary to setup
> >> >> >> +PMU and
> >> >> >> + * read selected counters in runtime.
> >> >> >> + */
> >> >> >> +
> >> >> >> +#ifdef __cplusplus
> >> >> >> +extern "C" {
> >> >> >> +#endif
> >> >> >> +
> >> >> >> +#include <linux/perf_event.h>
> >> >> >> +
> >> >> >> +#include <rte_atomic.h>
> >> >> >> +#include <rte_branch_prediction.h> #include <rte_common.h>
> >> >> >> +#include <rte_compat.h> #include <rte_spinlock.h>
> >> >> >> +
> >> >> >> +/** Maximum number of events in a group */ #define
> >> >> >> +MAX_NUM_GROUP_EVENTS 8
> >> >> >> +
> >> >> >> +/**
> >> >> >> + * A structure describing a group of events.
> >> >> >> + */
> >> >> >> +struct rte_pmu_event_group {
> >> >> >> +	struct perf_event_mmap_page *mmap_pages[MAX_NUM_GROUP_EVENTS];
> >> >> >> +/**< array of user pages
> >> >*/
> >> >> >> +	int fds[MAX_NUM_GROUP_EVENTS]; /**< array of event descriptors */
> >> >> >> +	bool enabled; /**< true if group was enabled on particular lcore */
> >> >> >> +	TAILQ_ENTRY(rte_pmu_event_group) next; /**< list entry */ }
> >> >> >> +__rte_cache_aligned;
> >> >> >> +
> >> >> >> +/**
> >> >> >> + * A structure describing an event.
> >> >> >> + */
> >> >> >> +struct rte_pmu_event {
> >> >> >> +	char *name; /**< name of an event */
> >> >> >> +	unsigned int index; /**< event index into fds/mmap_pages */
> >> >> >> +	TAILQ_ENTRY(rte_pmu_event) next; /**< list entry */ };
> >> >> >> +
> >> >> >> +/**
> >> >> >> + * A PMU state container.
> >> >> >> + */
> >> >> >> +struct rte_pmu {
> >> >> >> +	char *name; /**< name of core PMU listed under /sys/bus/event_source/devices */
> >> >> >> +	rte_spinlock_t lock; /**< serialize access to event group list */
> >> >> >> +	TAILQ_HEAD(, rte_pmu_event_group) event_group_list; /**< list of event groups */
> >> >> >> +	unsigned int num_group_events; /**< number of events in a group */
> >> >> >> +	TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of matching events */
> >> >> >> +	unsigned int initialized; /**< initialization counter */ };
> >> >> >> +
> >> >> >> +/** lcore event group */
> >> >> >> +RTE_DECLARE_PER_LCORE(struct rte_pmu_event_group,
> >> >> >> +_event_group);
> >> >> >> +
> >> >> >> +/** PMU state container */
> >> >> >> +extern struct rte_pmu rte_pmu;
> >> >> >> +
> >> >> >> +/** Each architecture supporting PMU needs to provide its own
> >> >> >> +version */ #ifndef rte_pmu_pmc_read #define
> >> >> >> +rte_pmu_pmc_read(index) ({ 0; }) #endif
> >> >> >> +
> >> >> >> +/**
> >> >> >> + * @warning
> >> >> >> + * @b EXPERIMENTAL: this API may change without prior notice
> >> >> >> + *
> >> >> >> + * Read PMU counter.
> >> >> >> + *
> >> >> >> + * @warning This should be not called directly.
> >> >> >> + *
> >> >> >> + * @param pc
> >> >> >> + *   Pointer to the mmapped user page.
> >> >> >> + * @return
> >> >> >> + *   Counter value read from hardware.
> >> >> >> + */
> >> >> >> +static __rte_always_inline uint64_t
> >> >> >> +__rte_pmu_read_userpage(struct perf_event_mmap_page *pc) {
> >> >> >> +	uint64_t width, offset;
> >> >> >> +	uint32_t seq, index;
> >> >> >> +	int64_t pmc;
> >> >> >> +
> >> >> >> +	for (;;) {
> >> >> >> +		seq = pc->lock;
> >> >> >> +		rte_compiler_barrier();
> >> >> >
> >> >> >Are you sure that compiler_barrier() is enough here?
> >> >> >On some archs CPU itself has freedom to re-order reads.
> >> >> >Or I am missing something obvious here?
> >> >> >
> >> >>
> >> >> It's a matter of not keeping old stuff cached in registers and
> >> >> making sure that we have two reads of lock. CPU reordering won't do
> >> >> any harm here.
> >> >
> >> >Sorry, I didn't get you here:
> >> >Suppose CPU will re-order reads and will read lock *after* index or offset value.
> >> >Wouldn't it mean that in that case index and/or offset can contain old/invalid values?
> >> >
> >>
> >> This number is just an indicator whether kernel did change something or not.
> >
> >You are talking about pc->lock, right?
> >Yes, I do understand that it is sort of seqlock.
> >That's why I am puzzled why we do not care about possible cpu read-reordering.
> >Manual for perf_event_open() also has a code snippet with compiler barrier only...
> >
> >> If cpu reordering will come into play then this will not change anything from pov of this loop.
> >> All we want is fresh data when needed and no involvement of compiler
> >> when it comes to reordering code.
> >
> >Ok, can you probably explain to me why the following could not happen:
> >T0:
> >pc->seqlock==0; pc->index==I1; pc->offset==O1;
> >T1:
> >      cpu #0 read pmu (due to cpu read reorder, we get index value before seqlock):
> >       index=pc->index;  //index==I1;
> > T2:
> >      cpu #1 kernel vent_update_userpage:
> >      pc->lock++; // pc->lock==1
> >      pc->index=I2;
> >      pc->offset=O2;
> >      ...
> >      pc->lock++; //pc->lock==2
> >T3:
> >      cpu #0 continue with read pmu:
> >      seq=pc->lock; //seq == 2
> >       offset=pc->offset; // offset == O2
> >       ....
> >       pmc = rte_pmu_pmc_read(index - 1);  // Note that we read at I1, not I2
> >       offset += pmc; //offset == O2 + pmcread(I1-1);
> >       if (pc->lock == seq) // they are equal, return
> >             return offset;
> >
> >Or, it can happen, but by some reason we don't care much?
> >
> 
> This code does self-monitoring and user page (whole group actually) is per thread running on
> current cpu. Hence I am not sure what are you trying to prove with that example.

I am not trying to prove anything so far.
I am asking is such situation possible or not, and if not, why?
My current understanding (possibly wrong) is that after you mmaped these pages,
kernel still can asynchronously update them.
So, when reading the data from these pages you have to check 'lock' value before and
after accessing other data.
If so, why possible cpu read-reordering doesn't matter?    

Also there was another question below, which you probably  missed, so I copied it here:
Another question - do we really need  to have __rte_pmu_read_userpage() and rte_pmu_read() as
static inline functions in public header?
As I understand, because of that we also have to make 'struct rte_pmu_*'
definitions also public.

> 
> >> >>
> >> >> >> +		index = pc->index;
> >> >> >> +		offset = pc->offset;
> >> >> >> +		width = pc->pmc_width;
> >> >> >> +
> >> >> >> +		/* index set to 0 means that particular counter cannot be used */
> >> >> >> +		if (likely(pc->cap_user_rdpmc && index)) {
> >> >> >> +			pmc = rte_pmu_pmc_read(index - 1);
> >> >> >> +			pmc <<= 64 - width;
> >> >> >> +			pmc >>= 64 - width;
> >> >> >> +			offset += pmc;
> >> >> >> +		}
> >> >> >> +
> >> >> >> +		rte_compiler_barrier();
> >> >> >> +
> >> >> >> +		if (likely(pc->lock == seq))
> >> >> >> +			return offset;
> >> >> >> +	}
> >> >> >> +
> >> >> >> +	return 0;
> >> >> >> +}
> >> >> >> +
> >> >> >> +/**
> >> >> >> + * @warning
> >> >> >> + * @b EXPERIMENTAL: this API may change without prior notice
> >> >> >> + *
> >> >> >> + * Enable group of events on the calling lcore.
> >> >> >> + *
> >> >> >> + * @warning This should be not called directly.
> >> >> >> + *
> >> >> >> + * @return
> >> >> >> + *   0 in case of success, negative value otherwise.
> >> >> >> + */
> >> >> >> +__rte_experimental
> >> >> >> +int
> >> >> >> +__rte_pmu_enable_group(void);
> >> >> >> +
> >> >> >> +/**
> >> >> >> + * @warning
> >> >> >> + * @b EXPERIMENTAL: this API may change without prior notice
> >> >> >> + *
> >> >> >> + * Initialize PMU library.
> >> >> >> + *
> >> >> >> + * @warning This should be not called directly.
> >> >> >> + *
> >> >> >> + * @return
> >> >> >> + *   0 in case of success, negative value otherwise.
> >> >> >> + */
> >> >> >> +__rte_experimental
> >> >> >> +int
> >> >> >> +rte_pmu_init(void);
> >> >> >> +
> >> >> >> +/**
> >> >> >> + * @warning
> >> >> >> + * @b EXPERIMENTAL: this API may change without prior notice
> >> >> >> + *
> >> >> >> + * Finalize PMU library. This should be called after PMU counters are no longer being
> >read.
> >> >> >> + */
> >> >> >> +__rte_experimental
> >> >> >> +void
> >> >> >> +rte_pmu_fini(void);
> >> >> >> +
> >> >> >> +/**
> >> >> >> + * @warning
> >> >> >> + * @b EXPERIMENTAL: this API may change without prior notice
> >> >> >> + *
> >> >> >> + * Add event to the group of enabled events.
> >> >> >> + *
> >> >> >> + * @param name
> >> >> >> + *   Name of an event listed under /sys/bus/event_source/devices/pmu/events.
> >> >> >> + * @return
> >> >> >> + *   Event index in case of success, negative value otherwise.
> >> >> >> + */
> >> >> >> +__rte_experimental
> >> >> >> +int
> >> >> >> +rte_pmu_add_event(const char *name);
> >> >> >> +
> >> >> >> +/**
> >> >> >> + * @warning
> >> >> >> + * @b EXPERIMENTAL: this API may change without prior notice
> >> >> >> + *
> >> >> >> + * Read hardware counter configured to count occurrences of an event.
> >> >> >> + *
> >> >> >> + * @param index
> >> >> >> + *   Index of an event to be read.
> >> >> >> + * @return
> >> >> >> + *   Event value read from register. In case of errors or lack of support
> >> >> >> + *   0 is returned. In other words, stream of zeros in a trace file
> >> >> >> + *   indicates problem with reading particular PMU event register.
> >> >> >> + */
> >
> >Another question - do we really need  to have __rte_pmu_read_userpage() and rte_pmu_read() as
> >static inline functions in public header?
> >As I understand, because of that we also have to make 'struct rte_pmu_*'
> >definitions also public.
> >
> >> >> >> +__rte_experimental
> >> >> >> +static __rte_always_inline uint64_t rte_pmu_read(unsigned int
> >> >> >> +index) {
> >> >> >> +	struct rte_pmu_event_group *group = &RTE_PER_LCORE(_event_group);
> >> >> >> +	int ret;
> >> >> >> +
> >> >> >> +	if (unlikely(!rte_pmu.initialized))
> >> >> >> +		return 0;
> >> >> >> +
> >> >> >> +	if (unlikely(!group->enabled)) {
> >> >> >> +		ret = __rte_pmu_enable_group();
> >> >> >> +		if (ret)
> >> >> >> +			return 0;
> >> >> >> +	}
> >> >> >> +
> >> >> >> +	if (unlikely(index >= rte_pmu.num_group_events))
> >> >> >> +		return 0;
> >> >> >> +
> >> >> >> +	return __rte_pmu_read_userpage(group->mmap_pages[index]);
> >> >> >> +}
> >> >> >> +
> >> >> >> +#ifdef __cplusplus
> >> >> >> +}
> >> >> >> +#endif
> >> >> >> +


^ permalink raw reply	[flat|nested] 139+ messages in thread

* RE: [EXT] Re: [PATCH v11 1/4] lib: add generic support for reading PMU events
  2023-02-20 17:21                                   ` Konstantin Ananyev
@ 2023-02-20 20:42                                     ` Tomasz Duszynski
  2023-02-21  0:48                                       ` Konstantin Ananyev
  0 siblings, 1 reply; 139+ messages in thread
From: Tomasz Duszynski @ 2023-02-20 20:42 UTC (permalink / raw)
  To: Konstantin Ananyev, Konstantin Ananyev, dev



>-----Original Message-----
>From: Konstantin Ananyev <konstantin.ananyev@huawei.com>
>Sent: Monday, February 20, 2023 6:21 PM
>To: Tomasz Duszynski <tduszynski@marvell.com>; Konstantin Ananyev <konstantin.v.ananyev@yandex.ru>;
>dev@dpdk.org
>Subject: RE: [EXT] Re: [PATCH v11 1/4] lib: add generic support for reading PMU events
>
>
>> >> >> >> diff --git a/lib/pmu/rte_pmu.h b/lib/pmu/rte_pmu.h new file
>> >> >> >> mode
>> >> >> >> 100644 index 0000000000..6b664c3336
>> >> >> >> --- /dev/null
>> >> >> >> +++ b/lib/pmu/rte_pmu.h
>> >> >> >> @@ -0,0 +1,212 @@
>> >> >> >> +/* SPDX-License-Identifier: BSD-3-Clause
>> >> >> >> + * Copyright(c) 2023 Marvell  */
>> >> >> >> +
>> >> >> >> +#ifndef _RTE_PMU_H_
>> >> >> >> +#define _RTE_PMU_H_
>> >> >> >> +
>> >> >> >> +/**
>> >> >> >> + * @file
>> >> >> >> + *
>> >> >> >> + * PMU event tracing operations
>> >> >> >> + *
>> >> >> >> + * This file defines generic API and types necessary to
>> >> >> >> +setup PMU and
>> >> >> >> + * read selected counters in runtime.
>> >> >> >> + */
>> >> >> >> +
>> >> >> >> +#ifdef __cplusplus
>> >> >> >> +extern "C" {
>> >> >> >> +#endif
>> >> >> >> +
>> >> >> >> +#include <linux/perf_event.h>
>> >> >> >> +
>> >> >> >> +#include <rte_atomic.h>
>> >> >> >> +#include <rte_branch_prediction.h> #include <rte_common.h>
>> >> >> >> +#include <rte_compat.h> #include <rte_spinlock.h>
>> >> >> >> +
>> >> >> >> +/** Maximum number of events in a group */ #define
>> >> >> >> +MAX_NUM_GROUP_EVENTS 8
>> >> >> >> +
>> >> >> >> +/**
>> >> >> >> + * A structure describing a group of events.
>> >> >> >> + */
>> >> >> >> +struct rte_pmu_event_group {
>> >> >> >> +	struct perf_event_mmap_page
>> >> >> >> +*mmap_pages[MAX_NUM_GROUP_EVENTS];
>> >> >> >> +/**< array of user pages
>> >> >*/
>> >> >> >> +	int fds[MAX_NUM_GROUP_EVENTS]; /**< array of event descriptors */
>> >> >> >> +	bool enabled; /**< true if group was enabled on particular lcore */
>> >> >> >> +	TAILQ_ENTRY(rte_pmu_event_group) next; /**< list entry */ }
>> >> >> >> +__rte_cache_aligned;
>> >> >> >> +
>> >> >> >> +/**
>> >> >> >> + * A structure describing an event.
>> >> >> >> + */
>> >> >> >> +struct rte_pmu_event {
>> >> >> >> +	char *name; /**< name of an event */
>> >> >> >> +	unsigned int index; /**< event index into fds/mmap_pages */
>> >> >> >> +	TAILQ_ENTRY(rte_pmu_event) next; /**< list entry */ };
>> >> >> >> +
>> >> >> >> +/**
>> >> >> >> + * A PMU state container.
>> >> >> >> + */
>> >> >> >> +struct rte_pmu {
>> >> >> >> +	char *name; /**< name of core PMU listed under /sys/bus/event_source/devices */
>> >> >> >> +	rte_spinlock_t lock; /**< serialize access to event group list */
>> >> >> >> +	TAILQ_HEAD(, rte_pmu_event_group) event_group_list; /**< list of event groups */
>> >> >> >> +	unsigned int num_group_events; /**< number of events in a group */
>> >> >> >> +	TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of matching events */
>> >> >> >> +	unsigned int initialized; /**< initialization counter */ };
>> >> >> >> +
>> >> >> >> +/** lcore event group */
>> >> >> >> +RTE_DECLARE_PER_LCORE(struct rte_pmu_event_group,
>> >> >> >> +_event_group);
>> >> >> >> +
>> >> >> >> +/** PMU state container */
>> >> >> >> +extern struct rte_pmu rte_pmu;
>> >> >> >> +
>> >> >> >> +/** Each architecture supporting PMU needs to provide its
>> >> >> >> +own version */ #ifndef rte_pmu_pmc_read #define
>> >> >> >> +rte_pmu_pmc_read(index) ({ 0; }) #endif
>> >> >> >> +
>> >> >> >> +/**
>> >> >> >> + * @warning
>> >> >> >> + * @b EXPERIMENTAL: this API may change without prior notice
>> >> >> >> + *
>> >> >> >> + * Read PMU counter.
>> >> >> >> + *
>> >> >> >> + * @warning This should be not called directly.
>> >> >> >> + *
>> >> >> >> + * @param pc
>> >> >> >> + *   Pointer to the mmapped user page.
>> >> >> >> + * @return
>> >> >> >> + *   Counter value read from hardware.
>> >> >> >> + */
>> >> >> >> +static __rte_always_inline uint64_t
>> >> >> >> +__rte_pmu_read_userpage(struct perf_event_mmap_page *pc) {
>> >> >> >> +	uint64_t width, offset;
>> >> >> >> +	uint32_t seq, index;
>> >> >> >> +	int64_t pmc;
>> >> >> >> +
>> >> >> >> +	for (;;) {
>> >> >> >> +		seq = pc->lock;
>> >> >> >> +		rte_compiler_barrier();
>> >> >> >
>> >> >> >Are you sure that compiler_barrier() is enough here?
>> >> >> >On some archs CPU itself has freedom to re-order reads.
>> >> >> >Or I am missing something obvious here?
>> >> >> >
>> >> >>
>> >> >> It's a matter of not keeping old stuff cached in registers and
>> >> >> making sure that we have two reads of lock. CPU reordering won't
>> >> >> do any harm here.
>> >> >
>> >> >Sorry, I didn't get you here:
>> >> >Suppose CPU will re-order reads and will read lock *after* index or offset value.
>> >> >Wouldn't it mean that in that case index and/or offset can contain old/invalid values?
>> >> >
>> >>
>> >> This number is just an indicator whether kernel did change something or not.
>> >
>> >You are talking about pc->lock, right?
>> >Yes, I do understand that it is sort of seqlock.
>> >That's why I am puzzled why we do not care about possible cpu read-reordering.
>> >Manual for perf_event_open() also has a code snippet with compiler barrier only...
>> >
>> >> If cpu reordering will come into play then this will not change anything from pov of this
>loop.
>> >> All we want is fresh data when needed and no involvement of
>> >> compiler when it comes to reordering code.
>> >
>> >Ok, can you probably explain to me why the following could not happen:
>> >T0:
>> >pc->seqlock==0; pc->index==I1; pc->offset==O1;
>> >T1:
>> >      cpu #0 read pmu (due to cpu read reorder, we get index value before seqlock):
>> >       index=pc->index;  //index==I1;
>> > T2:
>> >      cpu #1 kernel vent_update_userpage:
>> >      pc->lock++; // pc->lock==1
>> >      pc->index=I2;
>> >      pc->offset=O2;
>> >      ...
>> >      pc->lock++; //pc->lock==2
>> >T3:
>> >      cpu #0 continue with read pmu:
>> >      seq=pc->lock; //seq == 2
>> >       offset=pc->offset; // offset == O2
>> >       ....
>> >       pmc = rte_pmu_pmc_read(index - 1);  // Note that we read at I1, not I2
>> >       offset += pmc; //offset == O2 + pmcread(I1-1);
>> >       if (pc->lock == seq) // they are equal, return
>> >             return offset;
>> >
>> >Or, it can happen, but by some reason we don't care much?
>> >
>>
>> This code does self-monitoring and user page (whole group actually) is
>> per thread running on current cpu. Hence I am not sure what are you trying to prove with that
>example.
>
>I am not trying to prove anything so far.
>I am asking is such situation possible or not, and if not, why?
>My current understanding (possibly wrong) is that after you mmaped these pages, kernel still can
>asynchronously update them.
>So, when reading the data from these pages you have to check 'lock' value before and after
>accessing other data.
>If so, why possible cpu read-reordering doesn't matter?
>

Look. I'll reiterate that.

1. That user page/group/PMU config is per process. Other processes do not access that.
   All this happens on the very same CPU where current thread is running.
2. Suppose you've already read seq. Now for some reason kernel updates data in page seq was read from. 
3. Kernel will enter critical section during update. seq changes along with other data without app knowing about it. 
   If you want nitty gritty details consult kernel sources. 
4. app resumes and has some stale data but *WILL* read new seq. Code loops again because values do not match.  
5. Otherwise seq values match and data is valid. 

>Also there was another question below, which you probably  missed, so I copied it here:
>Another question - do we really need  to have __rte_pmu_read_userpage() and rte_pmu_read() as
>static inline functions in public header?
>As I understand, because of that we also have to make 'struct rte_pmu_*'
>definitions also public.
>

These functions need to be inlined otherwise performance takes a hit. 

>>
>> >> >>
>> >> >> >> +		index = pc->index;
>> >> >> >> +		offset = pc->offset;
>> >> >> >> +		width = pc->pmc_width;
>> >> >> >> +
>> >> >> >> +		/* index set to 0 means that particular counter cannot be used */
>> >> >> >> +		if (likely(pc->cap_user_rdpmc && index)) {
>> >> >> >> +			pmc = rte_pmu_pmc_read(index - 1);
>> >> >> >> +			pmc <<= 64 - width;
>> >> >> >> +			pmc >>= 64 - width;
>> >> >> >> +			offset += pmc;
>> >> >> >> +		}
>> >> >> >> +
>> >> >> >> +		rte_compiler_barrier();
>> >> >> >> +
>> >> >> >> +		if (likely(pc->lock == seq))
>> >> >> >> +			return offset;
>> >> >> >> +	}
>> >> >> >> +
>> >> >> >> +	return 0;
>> >> >> >> +}
>> >> >> >> +
>> >> >> >> +/**
>> >> >> >> + * @warning
>> >> >> >> + * @b EXPERIMENTAL: this API may change without prior notice
>> >> >> >> + *
>> >> >> >> + * Enable group of events on the calling lcore.
>> >> >> >> + *
>> >> >> >> + * @warning This should be not called directly.
>> >> >> >> + *
>> >> >> >> + * @return
>> >> >> >> + *   0 in case of success, negative value otherwise.
>> >> >> >> + */
>> >> >> >> +__rte_experimental
>> >> >> >> +int
>> >> >> >> +__rte_pmu_enable_group(void);
>> >> >> >> +
>> >> >> >> +/**
>> >> >> >> + * @warning
>> >> >> >> + * @b EXPERIMENTAL: this API may change without prior notice
>> >> >> >> + *
>> >> >> >> + * Initialize PMU library.
>> >> >> >> + *
>> >> >> >> + * @warning This should be not called directly.
>> >> >> >> + *
>> >> >> >> + * @return
>> >> >> >> + *   0 in case of success, negative value otherwise.
>> >> >> >> + */
>> >> >> >> +__rte_experimental
>> >> >> >> +int
>> >> >> >> +rte_pmu_init(void);
>> >> >> >> +
>> >> >> >> +/**
>> >> >> >> + * @warning
>> >> >> >> + * @b EXPERIMENTAL: this API may change without prior notice
>> >> >> >> + *
>> >> >> >> + * Finalize PMU library. This should be called after PMU
>> >> >> >> +counters are no longer being
>> >read.
>> >> >> >> + */
>> >> >> >> +__rte_experimental
>> >> >> >> +void
>> >> >> >> +rte_pmu_fini(void);
>> >> >> >> +
>> >> >> >> +/**
>> >> >> >> + * @warning
>> >> >> >> + * @b EXPERIMENTAL: this API may change without prior notice
>> >> >> >> + *
>> >> >> >> + * Add event to the group of enabled events.
>> >> >> >> + *
>> >> >> >> + * @param name
>> >> >> >> + *   Name of an event listed under /sys/bus/event_source/devices/pmu/events.
>> >> >> >> + * @return
>> >> >> >> + *   Event index in case of success, negative value otherwise.
>> >> >> >> + */
>> >> >> >> +__rte_experimental
>> >> >> >> +int
>> >> >> >> +rte_pmu_add_event(const char *name);
>> >> >> >> +
>> >> >> >> +/**
>> >> >> >> + * @warning
>> >> >> >> + * @b EXPERIMENTAL: this API may change without prior notice
>> >> >> >> + *
>> >> >> >> + * Read hardware counter configured to count occurrences of an event.
>> >> >> >> + *
>> >> >> >> + * @param index
>> >> >> >> + *   Index of an event to be read.
>> >> >> >> + * @return
>> >> >> >> + *   Event value read from register. In case of errors or lack of support
>> >> >> >> + *   0 is returned. In other words, stream of zeros in a trace file
>> >> >> >> + *   indicates problem with reading particular PMU event register.
>> >> >> >> + */
>> >
>> >Another question - do we really need  to have
>> >__rte_pmu_read_userpage() and rte_pmu_read() as static inline functions in public header?
>> >As I understand, because of that we also have to make 'struct rte_pmu_*'
>> >definitions also public.
>> >
>> >> >> >> +__rte_experimental
>> >> >> >> +static __rte_always_inline uint64_t rte_pmu_read(unsigned
>> >> >> >> +int
>> >> >> >> +index) {
>> >> >> >> +	struct rte_pmu_event_group *group = &RTE_PER_LCORE(_event_group);
>> >> >> >> +	int ret;
>> >> >> >> +
>> >> >> >> +	if (unlikely(!rte_pmu.initialized))
>> >> >> >> +		return 0;
>> >> >> >> +
>> >> >> >> +	if (unlikely(!group->enabled)) {
>> >> >> >> +		ret = __rte_pmu_enable_group();
>> >> >> >> +		if (ret)
>> >> >> >> +			return 0;
>> >> >> >> +	}
>> >> >> >> +
>> >> >> >> +	if (unlikely(index >= rte_pmu.num_group_events))
>> >> >> >> +		return 0;
>> >> >> >> +
>> >> >> >> +	return __rte_pmu_read_userpage(group->mmap_pages[index]);
>> >> >> >> +}
>> >> >> >> +
>> >> >> >> +#ifdef __cplusplus
>> >> >> >> +}
>> >> >> >> +#endif
>> >> >> >> +


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [EXT] Re: [PATCH v11 1/4] lib: add generic support for reading PMU events
  2023-02-20 20:42                                     ` Tomasz Duszynski
@ 2023-02-21  0:48                                       ` Konstantin Ananyev
  2023-02-27  8:12                                         ` Tomasz Duszynski
  0 siblings, 1 reply; 139+ messages in thread
From: Konstantin Ananyev @ 2023-02-21  0:48 UTC (permalink / raw)
  To: Tomasz Duszynski, Konstantin Ananyev, dev


>>>>>>>>> diff --git a/lib/pmu/rte_pmu.h b/lib/pmu/rte_pmu.h new file
>>>>>>>>> mode
>>>>>>>>> 100644 index 0000000000..6b664c3336
>>>>>>>>> --- /dev/null
>>>>>>>>> +++ b/lib/pmu/rte_pmu.h
>>>>>>>>> @@ -0,0 +1,212 @@
>>>>>>>>> +/* SPDX-License-Identifier: BSD-3-Clause
>>>>>>>>> + * Copyright(c) 2023 Marvell  */
>>>>>>>>> +
>>>>>>>>> +#ifndef _RTE_PMU_H_
>>>>>>>>> +#define _RTE_PMU_H_
>>>>>>>>> +
>>>>>>>>> +/**
>>>>>>>>> + * @file
>>>>>>>>> + *
>>>>>>>>> + * PMU event tracing operations
>>>>>>>>> + *
>>>>>>>>> + * This file defines generic API and types necessary to
>>>>>>>>> +setup PMU and
>>>>>>>>> + * read selected counters in runtime.
>>>>>>>>> + */
>>>>>>>>> +
>>>>>>>>> +#ifdef __cplusplus
>>>>>>>>> +extern "C" {
>>>>>>>>> +#endif
>>>>>>>>> +
>>>>>>>>> +#include <linux/perf_event.h>
>>>>>>>>> +
>>>>>>>>> +#include <rte_atomic.h>
>>>>>>>>> +#include <rte_branch_prediction.h> #include <rte_common.h>
>>>>>>>>> +#include <rte_compat.h> #include <rte_spinlock.h>
>>>>>>>>> +
>>>>>>>>> +/** Maximum number of events in a group */ #define
>>>>>>>>> +MAX_NUM_GROUP_EVENTS 8
>>>>>>>>> +
>>>>>>>>> +/**
>>>>>>>>> + * A structure describing a group of events.
>>>>>>>>> + */
>>>>>>>>> +struct rte_pmu_event_group {
>>>>>>>>> +	struct perf_event_mmap_page
>>>>>>>>> +*mmap_pages[MAX_NUM_GROUP_EVENTS];
>>>>>>>>> +/**< array of user pages
>>>>>> */
>>>>>>>>> +	int fds[MAX_NUM_GROUP_EVENTS]; /**< array of event descriptors */
>>>>>>>>> +	bool enabled; /**< true if group was enabled on particular lcore */
>>>>>>>>> +	TAILQ_ENTRY(rte_pmu_event_group) next; /**< list entry */ }
>>>>>>>>> +__rte_cache_aligned;
>>>>>>>>> +
>>>>>>>>> +/**
>>>>>>>>> + * A structure describing an event.
>>>>>>>>> + */
>>>>>>>>> +struct rte_pmu_event {
>>>>>>>>> +	char *name; /**< name of an event */
>>>>>>>>> +	unsigned int index; /**< event index into fds/mmap_pages */
>>>>>>>>> +	TAILQ_ENTRY(rte_pmu_event) next; /**< list entry */ };
>>>>>>>>> +
>>>>>>>>> +/**
>>>>>>>>> + * A PMU state container.
>>>>>>>>> + */
>>>>>>>>> +struct rte_pmu {
>>>>>>>>> +	char *name; /**< name of core PMU listed under /sys/bus/event_source/devices */
>>>>>>>>> +	rte_spinlock_t lock; /**< serialize access to event group list */
>>>>>>>>> +	TAILQ_HEAD(, rte_pmu_event_group) event_group_list; /**< list of event groups */
>>>>>>>>> +	unsigned int num_group_events; /**< number of events in a group */
>>>>>>>>> +	TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of matching events */
>>>>>>>>> +	unsigned int initialized; /**< initialization counter */ };
>>>>>>>>> +
>>>>>>>>> +/** lcore event group */
>>>>>>>>> +RTE_DECLARE_PER_LCORE(struct rte_pmu_event_group,
>>>>>>>>> +_event_group);
>>>>>>>>> +
>>>>>>>>> +/** PMU state container */
>>>>>>>>> +extern struct rte_pmu rte_pmu;
>>>>>>>>> +
>>>>>>>>> +/** Each architecture supporting PMU needs to provide its
>>>>>>>>> +own version */ #ifndef rte_pmu_pmc_read #define
>>>>>>>>> +rte_pmu_pmc_read(index) ({ 0; }) #endif
>>>>>>>>> +
>>>>>>>>> +/**
>>>>>>>>> + * @warning
>>>>>>>>> + * @b EXPERIMENTAL: this API may change without prior notice
>>>>>>>>> + *
>>>>>>>>> + * Read PMU counter.
>>>>>>>>> + *
>>>>>>>>> + * @warning This should be not called directly.
>>>>>>>>> + *
>>>>>>>>> + * @param pc
>>>>>>>>> + *   Pointer to the mmapped user page.
>>>>>>>>> + * @return
>>>>>>>>> + *   Counter value read from hardware.
>>>>>>>>> + */
>>>>>>>>> +static __rte_always_inline uint64_t
>>>>>>>>> +__rte_pmu_read_userpage(struct perf_event_mmap_page *pc) {
>>>>>>>>> +	uint64_t width, offset;
>>>>>>>>> +	uint32_t seq, index;
>>>>>>>>> +	int64_t pmc;
>>>>>>>>> +
>>>>>>>>> +	for (;;) {
>>>>>>>>> +		seq = pc->lock;
>>>>>>>>> +		rte_compiler_barrier();
>>>>>>>>
>>>>>>>> Are you sure that compiler_barrier() is enough here?
>>>>>>>> On some archs CPU itself has freedom to re-order reads.
>>>>>>>> Or I am missing something obvious here?
>>>>>>>>
>>>>>>>
>>>>>>> It's a matter of not keeping old stuff cached in registers and
>>>>>>> making sure that we have two reads of lock. CPU reordering won't
>>>>>>> do any harm here.
>>>>>>
>>>>>> Sorry, I didn't get you here:
>>>>>> Suppose CPU will re-order reads and will read lock *after* index or offset value.
>>>>>> Wouldn't it mean that in that case index and/or offset can contain old/invalid values?
>>>>>>
>>>>>
>>>>> This number is just an indicator whether kernel did change something or not.
>>>>
>>>> You are talking about pc->lock, right?
>>>> Yes, I do understand that it is sort of seqlock.
>>>> That's why I am puzzled why we do not care about possible cpu read-reordering.
>>>> Manual for perf_event_open() also has a code snippet with compiler barrier only...
>>>>
>>>>> If cpu reordering will come into play then this will not change anything from pov of this
>> loop.
>>>>> All we want is fresh data when needed and no involvement of
>>>>> compiler when it comes to reordering code.
>>>>
>>>> Ok, can you probably explain to me why the following could not happen:
>>>> T0:
>>>> pc->seqlock==0; pc->index==I1; pc->offset==O1;
>>>> T1:
>>>>       cpu #0 read pmu (due to cpu read reorder, we get index value before seqlock):
>>>>        index=pc->index;  //index==I1;
>>>> T2:
>>>>       cpu #1 kernel vent_update_userpage:
>>>>       pc->lock++; // pc->lock==1
>>>>       pc->index=I2;
>>>>       pc->offset=O2;
>>>>       ...
>>>>       pc->lock++; //pc->lock==2
>>>> T3:
>>>>       cpu #0 continue with read pmu:
>>>>       seq=pc->lock; //seq == 2
>>>>        offset=pc->offset; // offset == O2
>>>>        ....
>>>>        pmc = rte_pmu_pmc_read(index - 1);  // Note that we read at I1, not I2
>>>>        offset += pmc; //offset == O2 + pmcread(I1-1);
>>>>        if (pc->lock == seq) // they are equal, return
>>>>              return offset;
>>>>
>>>> Or, it can happen, but by some reason we don't care much?
>>>>
>>>
>>> This code does self-monitoring and user page (whole group actually) is
>>> per thread running on current cpu. Hence I am not sure what are you trying to prove with that
>> example.
>>
>> I am not trying to prove anything so far.
>> I am asking is such situation possible or not, and if not, why?
>> My current understanding (possibly wrong) is that after you mmaped these pages, kernel still can
>> asynchronously update them.
>> So, when reading the data from these pages you have to check 'lock' value before and after
>> accessing other data.
>> If so, why possible cpu read-reordering doesn't matter?
>>
> 
> Look. I'll reiterate that.
> 
> 1. That user page/group/PMU config is per process. Other processes do not access that.

Ok, that's clear.


>     All this happens on the very same CPU where current thread is running.

Ok... but can't this page be updated by kernel thread running 
simultaneously on different CPU?


> 2. Suppose you've already read seq. Now for some reason kernel updates data in page seq was read from.
> 3. Kernel will enter critical section during update. seq changes along with other data without app knowing about it.
>     If you want nitty gritty details consult kernel sources.

Look, I don't have to beg you to answer these questions.
In fact, I expect library author to document all such narrow things 
clearly either in in PG, or in source code comments (ideally in both).
If not, then from my perspective the patch is not ready stage and 
shouldn't be accepted.
I don't know is compiler-barrier is enough here or not, but I think it 
is definitely worth a clear explanation in the docs.
I suppose it wouldn't be only me who will get confused here.
So please take an effort and document it clearly why you believe there 
is no race-condition.

> 4. app resumes and has some stale data but *WILL* read new seq. Code loops again because values do not match.

If the kernel will always execute update for this page in the same 
thread context, then yes, - user code will always note the difference
after resume.
But why it can't happen that your user-thread reads this page on one 
CPU, while some kernel code on other CPU updates it simultaneously?


> 5. Otherwise seq values match and data is valid.
> 
>> Also there was another question below, which you probably  missed, so I copied it here:
>> Another question - do we really need  to have __rte_pmu_read_userpage() and rte_pmu_read() as
>> static inline functions in public header?
>> As I understand, because of that we also have to make 'struct rte_pmu_*'
>> definitions also public.
>>
> 
> These functions need to be inlined otherwise performance takes a hit.

I understand that perfomance might be affected, but how big is hit?
I expect actual PMU read will not be free anyway, right?
If the diff is small, might be it is worth to go for such change,
removing unneeded structures from public headers would help a lot in 
future in terms of ABI/API stability.



>>>
>>>>>>>
>>>>>>>>> +		index = pc->index;
>>>>>>>>> +		offset = pc->offset;
>>>>>>>>> +		width = pc->pmc_width;
>>>>>>>>> +
>>>>>>>>> +		/* index set to 0 means that particular counter cannot be used */
>>>>>>>>> +		if (likely(pc->cap_user_rdpmc && index)) {
>>>>>>>>> +			pmc = rte_pmu_pmc_read(index - 1);
>>>>>>>>> +			pmc <<= 64 - width;
>>>>>>>>> +			pmc >>= 64 - width;
>>>>>>>>> +			offset += pmc;
>>>>>>>>> +		}
>>>>>>>>> +
>>>>>>>>> +		rte_compiler_barrier();
>>>>>>>>> +
>>>>>>>>> +		if (likely(pc->lock == seq))
>>>>>>>>> +			return offset;
>>>>>>>>> +	}
>>>>>>>>> +
>>>>>>>>> +	return 0;
>>>>>>>>> +}
>>>>>>>>> +
>>>>>>>>> +/**
>>>>>>>>> + * @warning
>>>>>>>>> + * @b EXPERIMENTAL: this API may change without prior notice
>>>>>>>>> + *
>>>>>>>>> + * Enable group of events on the calling lcore.
>>>>>>>>> + *
>>>>>>>>> + * @warning This should be not called directly.
>>>>>>>>> + *
>>>>>>>>> + * @return
>>>>>>>>> + *   0 in case of success, negative value otherwise.
>>>>>>>>> + */
>>>>>>>>> +__rte_experimental
>>>>>>>>> +int
>>>>>>>>> +__rte_pmu_enable_group(void);
>>>>>>>>> +
>>>>>>>>> +/**
>>>>>>>>> + * @warning
>>>>>>>>> + * @b EXPERIMENTAL: this API may change without prior notice
>>>>>>>>> + *
>>>>>>>>> + * Initialize PMU library.
>>>>>>>>> + *
>>>>>>>>> + * @warning This should be not called directly.
>>>>>>>>> + *
>>>>>>>>> + * @return
>>>>>>>>> + *   0 in case of success, negative value otherwise.
>>>>>>>>> + */
>>>>>>>>> +__rte_experimental
>>>>>>>>> +int
>>>>>>>>> +rte_pmu_init(void);
>>>>>>>>> +
>>>>>>>>> +/**
>>>>>>>>> + * @warning
>>>>>>>>> + * @b EXPERIMENTAL: this API may change without prior notice
>>>>>>>>> + *
>>>>>>>>> + * Finalize PMU library. This should be called after PMU
>>>>>>>>> +counters are no longer being
>>>> read.
>>>>>>>>> + */
>>>>>>>>> +__rte_experimental
>>>>>>>>> +void
>>>>>>>>> +rte_pmu_fini(void);
>>>>>>>>> +
>>>>>>>>> +/**
>>>>>>>>> + * @warning
>>>>>>>>> + * @b EXPERIMENTAL: this API may change without prior notice
>>>>>>>>> + *
>>>>>>>>> + * Add event to the group of enabled events.
>>>>>>>>> + *
>>>>>>>>> + * @param name
>>>>>>>>> + *   Name of an event listed under /sys/bus/event_source/devices/pmu/events.
>>>>>>>>> + * @return
>>>>>>>>> + *   Event index in case of success, negative value otherwise.
>>>>>>>>> + */
>>>>>>>>> +__rte_experimental
>>>>>>>>> +int
>>>>>>>>> +rte_pmu_add_event(const char *name);
>>>>>>>>> +
>>>>>>>>> +/**
>>>>>>>>> + * @warning
>>>>>>>>> + * @b EXPERIMENTAL: this API may change without prior notice
>>>>>>>>> + *
>>>>>>>>> + * Read hardware counter configured to count occurrences of an event.
>>>>>>>>> + *
>>>>>>>>> + * @param index
>>>>>>>>> + *   Index of an event to be read.
>>>>>>>>> + * @return
>>>>>>>>> + *   Event value read from register. In case of errors or lack of support
>>>>>>>>> + *   0 is returned. In other words, stream of zeros in a trace file
>>>>>>>>> + *   indicates problem with reading particular PMU event register.
>>>>>>>>> + */
>>>>
>>>> Another question - do we really need  to have
>>>> __rte_pmu_read_userpage() and rte_pmu_read() as static inline functions in public header?
>>>> As I understand, because of that we also have to make 'struct rte_pmu_*'
>>>> definitions also public.
>>>>
>>>>>>>>> +__rte_experimental
>>>>>>>>> +static __rte_always_inline uint64_t rte_pmu_read(unsigned
>>>>>>>>> +int
>>>>>>>>> +index) {
>>>>>>>>> +	struct rte_pmu_event_group *group = &RTE_PER_LCORE(_event_group);
>>>>>>>>> +	int ret;
>>>>>>>>> +
>>>>>>>>> +	if (unlikely(!rte_pmu.initialized))
>>>>>>>>> +		return 0;
>>>>>>>>> +
>>>>>>>>> +	if (unlikely(!group->enabled)) {
>>>>>>>>> +		ret = __rte_pmu_enable_group();
>>>>>>>>> +		if (ret)
>>>>>>>>> +			return 0;
>>>>>>>>> +	}
>>>>>>>>> +
>>>>>>>>> +	if (unlikely(index >= rte_pmu.num_group_events))
>>>>>>>>> +		return 0;
>>>>>>>>> +
>>>>>>>>> +	return __rte_pmu_read_userpage(group->mmap_pages[index]);
>>>>>>>>> +}
>>>>>>>>> +
>>>>>>>>> +#ifdef __cplusplus
>>>>>>>>> +}
>>>>>>>>> +#endif
>>>>>>>>> +
> 


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v11 1/4] lib: add generic support for reading PMU events
  2023-02-16 17:54                     ` [PATCH v11 1/4] lib: add generic support for reading PMU events Tomasz Duszynski
  2023-02-16 23:50                       ` Konstantin Ananyev
@ 2023-02-21  2:17                       ` Konstantin Ananyev
  2023-02-27  9:19                         ` [EXT] " Tomasz Duszynski
  1 sibling, 1 reply; 139+ messages in thread
From: Konstantin Ananyev @ 2023-02-21  2:17 UTC (permalink / raw)
  To: dev


> Add support for programming PMU counters and reading their values
> in runtime bypassing kernel completely.
> 
> This is especially useful in cases where CPU cores are isolated
> i.e run dedicated tasks. In such cases one cannot use standard
> perf utility without sacrificing latency and performance.
> 
> Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
> Acked-by: Morten Brørup <mb@smartsharesystems.com>

Few more comments/questions below.


> diff --git a/lib/pmu/rte_pmu.c b/lib/pmu/rte_pmu.c
> new file mode 100644
> index 0000000000..950f999cb7
> --- /dev/null
> +++ b/lib/pmu/rte_pmu.c
> @@ -0,0 +1,460 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(C) 2023 Marvell International Ltd.
> + */
> +
> +#include <ctype.h>
> +#include <dirent.h>
> +#include <errno.h>
> +#include <regex.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <sys/ioctl.h>
> +#include <sys/mman.h>
> +#include <sys/queue.h>
> +#include <sys/syscall.h>
> +#include <unistd.h>
> +
> +#include <rte_atomic.h>
> +#include <rte_per_lcore.h>
> +#include <rte_pmu.h>
> +#include <rte_spinlock.h>
> +#include <rte_tailq.h>
> +
> +#include "pmu_private.h"
> +
> +#define EVENT_SOURCE_DEVICES_PATH "/sys/bus/event_source/devices"
> +
> +#define GENMASK_ULL(h, l) ((~0ULL - (1ULL << (l)) + 1) & (~0ULL >> ((64 - 1 - (h)))))
> +#define FIELD_PREP(m, v) (((uint64_t)(v) << (__builtin_ffsll(m) - 1)) & (m))
> +
> +RTE_DEFINE_PER_LCORE(struct rte_pmu_event_group, _event_group);
> +struct rte_pmu rte_pmu;
> +
> +/*
> + * Following __rte_weak functions provide default no-op. Architectures should override them if
> + * necessary.
> + */
> +
> +int
> +__rte_weak pmu_arch_init(void)
> +{
> +	return 0;
> +}
> +
> +void
> +__rte_weak pmu_arch_fini(void)
> +{
> +}
> +
> +void
> +__rte_weak pmu_arch_fixup_config(uint64_t __rte_unused config[3])
> +{
> +}
> +
> +static int
> +get_term_format(const char *name, int *num, uint64_t *mask)
> +{
> +	char path[PATH_MAX];
> +	char *config = NULL;
> +	int high, low, ret;
> +	FILE *fp;
> +
> +	*num = *mask = 0;
> +	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/format/%s", rte_pmu.name, name);
> +	fp = fopen(path, "r");
> +	if (fp == NULL)
> +		return -errno;
> +
> +	errno = 0;
> +	ret = fscanf(fp, "%m[^:]:%d-%d", &config, &low, &high);
> +	if (ret < 2) {
> +		ret = -ENODATA;
> +		goto out;
> +	}
> +	if (errno) {
> +		ret = -errno;
> +		goto out;
> +	}
> +
> +	if (ret == 2)
> +		high = low;
> +
> +	*mask = GENMASK_ULL(high, low);
> +	/* Last digit should be [012]. If last digit is missing 0 is implied. */
> +	*num = config[strlen(config) - 1];
> +	*num = isdigit(*num) ? *num - '0' : 0;
> +
> +	ret = 0;
> +out:
> +	free(config);
> +	fclose(fp);
> +
> +	return ret;
> +}
> +
> +static int
> +parse_event(char *buf, uint64_t config[3])
> +{
> +	char *token, *term;
> +	int num, ret, val;
> +	uint64_t mask;
> +
> +	config[0] = config[1] = config[2] = 0;
> +
> +	token = strtok(buf, ",");
> +	while (token) {
> +		errno = 0;
> +		/* <term>=<value> */
> +		ret = sscanf(token, "%m[^=]=%i", &term, &val);
> +		if (ret < 1)
> +			return -ENODATA;
> +		if (errno)
> +			return -errno;
> +		if (ret == 1)
> +			val = 1;
> +
> +		ret = get_term_format(term, &num, &mask);
> +		free(term);
> +		if (ret)
> +			return ret;
> +
> +		config[num] |= FIELD_PREP(mask, val);
> +		token = strtok(NULL, ",");
> +	}
> +
> +	return 0;
> +}
> +
> +static int
> +get_event_config(const char *name, uint64_t config[3])
> +{
> +	char path[PATH_MAX], buf[BUFSIZ];
> +	FILE *fp;
> +	int ret;
> +
> +	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu.name, name);
> +	fp = fopen(path, "r");
> +	if (fp == NULL)
> +		return -errno;
> +
> +	ret = fread(buf, 1, sizeof(buf), fp);
> +	if (ret == 0) {
> +		fclose(fp);
> +
> +		return -EINVAL;
> +	}
> +	fclose(fp);
> +	buf[ret] = '\0';
> +
> +	return parse_event(buf, config);
> +}
> +
> +static int
> +do_perf_event_open(uint64_t config[3], int group_fd)
> +{
> +	struct perf_event_attr attr = {
> +		.size = sizeof(struct perf_event_attr),
> +		.type = PERF_TYPE_RAW,
> +		.exclude_kernel = 1,
> +		.exclude_hv = 1,
> +		.disabled = 1,
> +	};
> +
> +	pmu_arch_fixup_config(config);
> +
> +	attr.config = config[0];
> +	attr.config1 = config[1];
> +	attr.config2 = config[2];
> +
> +	return syscall(SYS_perf_event_open, &attr, 0, -1, group_fd, 0);
> +}
> +
> +static int
> +open_events(struct rte_pmu_event_group *group)
> +{
> +	struct rte_pmu_event *event;
> +	uint64_t config[3];
> +	int num = 0, ret;
> +
> +	/* group leader gets created first, with fd = -1 */
> +	group->fds[0] = -1;
> +
> +	TAILQ_FOREACH(event, &rte_pmu.event_list, next) {
> +		ret = get_event_config(event->name, config);
> +		if (ret)
> +			continue;
> +
> +		ret = do_perf_event_open(config, group->fds[0]);
> +		if (ret == -1) {
> +			ret = -errno;
> +			goto out;
> +		}
> +
> +		group->fds[event->index] = ret;
> +		num++;
> +	}
> +
> +	return 0;
> +out:
> +	for (--num; num >= 0; num--) {
> +		close(group->fds[num]);
> +		group->fds[num] = -1;
> +	}
> +
> +
> +	return ret;
> +}
> +
> +static int
> +mmap_events(struct rte_pmu_event_group *group)
> +{
> +	long page_size = sysconf(_SC_PAGE_SIZE);
> +	unsigned int i;
> +	void *addr;
> +	int ret;
> +
> +	for (i = 0; i < rte_pmu.num_group_events; i++) {
> +		addr = mmap(0, page_size, PROT_READ, MAP_SHARED, group->fds[i], 0);
> +		if (addr == MAP_FAILED) {
> +			ret = -errno;
> +			goto out;
> +		}
> +
> +		group->mmap_pages[i] = addr;
> +		if (!group->mmap_pages[i]->cap_user_rdpmc) {
> +			ret = -EPERM;
> +			goto out;
> +		}
> +	}
> +
> +	return 0;
> +out:
> +	for (; i; i--) {
> +		munmap(group->mmap_pages[i - 1], page_size);
> +		group->mmap_pages[i - 1] = NULL;
> +	}
> +
> +	return ret;
> +}
> +
> +static void
> +cleanup_events(struct rte_pmu_event_group *group)
> +{
> +	unsigned int i;
> +
> +	if (group->fds[0] != -1)
> +		ioctl(group->fds[0], PERF_EVENT_IOC_DISABLE, PERF_IOC_FLAG_GROUP);
> +
> +	for (i = 0; i < rte_pmu.num_group_events; i++) {
> +		if (group->mmap_pages[i]) {
> +			munmap(group->mmap_pages[i], sysconf(_SC_PAGE_SIZE));
> +			group->mmap_pages[i] = NULL;
> +		}
> +
> +		if (group->fds[i] != -1) {
> +			close(group->fds[i]);
> +			group->fds[i] = -1;
> +		}
> +	}
> +
> +	group->enabled = false;
> +}
> +
> +int
> +__rte_pmu_enable_group(void)
> +{
> +	struct rte_pmu_event_group *group = &RTE_PER_LCORE(_event_group);
> +	int ret;
> +
> +	if (rte_pmu.num_group_events == 0)
> +		return -ENODEV;
> +
> +	ret = open_events(group);
> +	if (ret)
> +		goto out;
> +
> +	ret = mmap_events(group);
> +	if (ret)
> +		goto out;
> +
> +	if (ioctl(group->fds[0], PERF_EVENT_IOC_RESET, PERF_IOC_FLAG_GROUP) == -1) {
> +		ret = -errno;
> +		goto out;
> +	}
> +
> +	if (ioctl(group->fds[0], PERF_EVENT_IOC_ENABLE, PERF_IOC_FLAG_GROUP) == -1) {
> +		ret = -errno;
> +		goto out;
> +	}
> +
> +	rte_spinlock_lock(&rte_pmu.lock);
> +	TAILQ_INSERT_TAIL(&rte_pmu.event_group_list, group, next);

Hmm.. so we insert pointer to TLS variable into the global list?
Wonder what would happen if that thread get terminated?
Can memory from its TLS block get re-used (by other thread or for other 
purposes)?


> +	rte_spinlock_unlock(&rte_pmu.lock);
> +	group->enabled = true;
> +
> +	return 0;
> +
> +out:
> +	cleanup_events(group);
> +
> +	return ret;
> +}
> +
> +static int
> +scan_pmus(void)
> +{
> +	char path[PATH_MAX];
> +	struct dirent *dent;
> +	const char *name;
> +	DIR *dirp;
> +
> +	dirp = opendir(EVENT_SOURCE_DEVICES_PATH);
> +	if (dirp == NULL)
> +		return -errno;
> +
> +	while ((dent = readdir(dirp))) {
> +		name = dent->d_name;
> +		if (name[0] == '.')
> +			continue;
> +
> +		/* sysfs entry should either contain cpus or be a cpu */
> +		if (!strcmp(name, "cpu"))
> +			break;
> +
> +		snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/cpus", name);
> +		if (access(path, F_OK) == 0)
> +			break;
> +	}
> +
> +	if (dent) {
> +		rte_pmu.name = strdup(name);
> +		if (rte_pmu.name == NULL) {
> +			closedir(dirp);
> +
> +			return -ENOMEM;
> +		}
> +	}
> +
> +	closedir(dirp);
> +
> +	return rte_pmu.name ? 0 : -ENODEV;
> +}
> +
> +static struct rte_pmu_event *
> +new_event(const char *name)
> +{
> +	struct rte_pmu_event *event;
> +
> +	event = calloc(1, sizeof(*event));
> +	if (event == NULL)
> +		goto out;
> +
> +	event->name = strdup(name);
> +	if (event->name == NULL) {
> +		free(event);
> +		event = NULL;
> +	}
> +
> +out:
> +	return event;
> +}
> +
> +static void
> +free_event(struct rte_pmu_event *event)
> +{
> +	free(event->name);
> +	free(event);
> +}
> +
> +int
> +rte_pmu_add_event(const char *name)
> +{
> +	struct rte_pmu_event *event;
> +	char path[PATH_MAX];
> +
> +	if (rte_pmu.name == NULL)
> +		return -ENODEV;
> +
> +	if (rte_pmu.num_group_events + 1 >= MAX_NUM_GROUP_EVENTS)
> +		return -ENOSPC;
> +
> +	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu.name, name);
> +	if (access(path, R_OK))
> +		return -ENODEV;
> +
> +	TAILQ_FOREACH(event, &rte_pmu.event_list, next) {
> +		if (!strcmp(event->name, name))
> +			return event->index;
> +		continue;
> +	}
> +
> +	event = new_event(name);
> +	if (event == NULL)
> +		return -ENOMEM;
> +
> +	event->index = rte_pmu.num_group_events++;
> +	TAILQ_INSERT_TAIL(&rte_pmu.event_list, event, next);
> +
> +	return event->index;
> +}
> +
> +int
> +rte_pmu_init(void)
> +{
> +	int ret;
> +
> +	/* Allow calling init from multiple contexts within a single thread. This simplifies
> +	 * resource management a bit e.g in case fast-path tracepoint has already been enabled
> +	 * via command line but application doesn't care enough and performs init/fini again.
> +	 */
> +	if (rte_pmu.initialized != 0) {
> +		rte_pmu.initialized++;
> +		return 0;
> +	}
> +
> +	ret = scan_pmus();
> +	if (ret)
> +		goto out;
> +
> +	ret = pmu_arch_init();
> +	if (ret)
> +		goto out;
> +
> +	TAILQ_INIT(&rte_pmu.event_list);
> +	TAILQ_INIT(&rte_pmu.event_group_list);
> +	rte_spinlock_init(&rte_pmu.lock);
> +	rte_pmu.initialized = 1;
> +
> +	return 0;
> +out:
> +	free(rte_pmu.name);
> +	rte_pmu.name = NULL;
> +
> +	return ret;
> +}
> +
> +void
> +rte_pmu_fini(void)
> +{
> +	struct rte_pmu_event_group *group, *tmp_group;
> +	struct rte_pmu_event *event, *tmp_event;
> +
> +	/* cleanup once init count drops to zero */
> +	if (rte_pmu.initialized == 0 || --rte_pmu.initialized != 0)
> +		return;
> +
> +	RTE_TAILQ_FOREACH_SAFE(event, &rte_pmu.event_list, next, tmp_event) {
> +		TAILQ_REMOVE(&rte_pmu.event_list, event, next);
> +		free_event(event);
> +	}
> +
> +	RTE_TAILQ_FOREACH_SAFE(group, &rte_pmu.event_group_list, next, tmp_group) {
> +		TAILQ_REMOVE(&rte_pmu.event_group_list, group, next);
> +		cleanup_events(group);
> +	}
> +
> +	pmu_arch_fini();
> +	free(rte_pmu.name);
> +	rte_pmu.name = NULL;
> +	rte_pmu.num_group_events = 0;
> +}
> diff --git a/lib/pmu/rte_pmu.h b/lib/pmu/rte_pmu.h
> new file mode 100644
> index 0000000000..6b664c3336
> --- /dev/null
> +++ b/lib/pmu/rte_pmu.h
> @@ -0,0 +1,212 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2023 Marvell
> + */
> +
> +#ifndef _RTE_PMU_H_
> +#define _RTE_PMU_H_
> +
> +/**
> + * @file
> + *
> + * PMU event tracing operations
> + *
> + * This file defines generic API and types necessary to setup PMU and
> + * read selected counters in runtime.
> + */
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +#include <linux/perf_event.h>
> +
> +#include <rte_atomic.h>
> +#include <rte_branch_prediction.h>
> +#include <rte_common.h>
> +#include <rte_compat.h>
> +#include <rte_spinlock.h>
> +
> +/** Maximum number of events in a group */
> +#define MAX_NUM_GROUP_EVENTS 8

forgot RTE_ prefix.
In fact, do you really need number of events in group to be hard-coded?
Couldn't mmap_pages[] and fds[] be allocated dynamically by enable_group()?

> +
> +/**
> + * A structure describing a group of events.
> + */
> +struct rte_pmu_event_group {
> +	struct perf_event_mmap_page *mmap_pages[MAX_NUM_GROUP_EVENTS]; /**< array of user pages */
> +	int fds[MAX_NUM_GROUP_EVENTS]; /**< array of event descriptors */
> +	bool enabled; /**< true if group was enabled on particular lcore */
> +	TAILQ_ENTRY(rte_pmu_event_group) next; /**< list entry */
> +} __rte_cache_aligned;
> +

Even if we'd decide to keep rte_pmu_read() as static inline (still not 
sure it is a good idea),
why these two struct below (rte_pmu_event and rte_pmu) have to be public?
I think both can be safely moved away from public headers.


> +/**
> + * A structure describing an event.
> + */
> +struct rte_pmu_event {
> +	char *name; /**< name of an event */
> +	unsigned int index; /**< event index into fds/mmap_pages */
> +	TAILQ_ENTRY(rte_pmu_event) next; /**< list entry */
> +};

> +
> +/**
> + * A PMU state container.
> + */
> +struct rte_pmu {
> +	char *name; /**< name of core PMU listed under /sys/bus/event_source/devices */
> +	rte_spinlock_t lock; /**< serialize access to event group list */
> +	TAILQ_HEAD(, rte_pmu_event_group) event_group_list; /**< list of event groups */
> +	unsigned int num_group_events; /**< number of events in a group */
> +	TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of matching events */
> +	unsigned int initialized; /**< initialization counter */
> +};
> +
> +/** lcore event group */
> +RTE_DECLARE_PER_LCORE(struct rte_pmu_event_group, _event_group);
> +
> +/** PMU state container */
> +extern struct rte_pmu rte_pmu;
> +
> +/** Each architecture supporting PMU needs to provide its own version */
> +#ifndef rte_pmu_pmc_read
> +#define rte_pmu_pmc_read(index) ({ 0; })
> +#endif
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice
> + *
> + * Read PMU counter.
> + *
> + * @warning This should be not called directly.
> + *
> + * @param pc
> + *   Pointer to the mmapped user page.
> + * @return
> + *   Counter value read from hardware.
> + */
> +static __rte_always_inline uint64_t
> +__rte_pmu_read_userpage(struct perf_event_mmap_page *pc)
> +{
> +	uint64_t width, offset;
> +	uint32_t seq, index;
> +	int64_t pmc;
> +
> +	for (;;) {
> +		seq = pc->lock;
> +		rte_compiler_barrier();
> +		index = pc->index;
> +		offset = pc->offset;
> +		width = pc->pmc_width;
> +
> +		/* index set to 0 means that particular counter cannot be used */
> +		if (likely(pc->cap_user_rdpmc && index)) {

In mmap_events() you return EPERM if cap_user_rdpmc is not enabled.
Do you need another check here? Or this capability can be disabled by 
kernel at run-time?


> +			pmc = rte_pmu_pmc_read(index - 1);
> +			pmc <<= 64 - width;
> +			pmc >>= 64 - width;
> +			offset += pmc;
> +		}
> +
> +		rte_compiler_barrier();
> +
> +		if (likely(pc->lock == seq))
> +			return offset;
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice
> + *
> + * Enable group of events on the calling lcore.
> + *
> + * @warning This should be not called directly.

__rte_internal ?

> + *
> + * @return
> + *   0 in case of success, negative value otherwise.
> + */
> +__rte_experimental
> +int
> +__rte_pmu_enable_group(void);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice
> + *
> + * Initialize PMU library.
> + *
> + * @warning This should be not called directly.

Hmm.. then who should call it?
If it not supposed to be called directly, why to declare it here?

> + *
> + * @return
> + *   0 in case of success, negative value otherwise.
> + */

Probably worth to mention that this function is not MT safe.
Same for _fini_ and add_event.
Also worth to mention that all control-path functions 
(init/fini/add_event) and data-path (pmu_read) can't be called concurrently.

> +__rte_experimental
> +int
> +rte_pmu_init(void);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice
> + *
> + * Finalize PMU library. This should be called after PMU counters are no longer being read.
> + */
> +__rte_experimental
> +void
> +rte_pmu_fini(void);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice
> + *
> + * Add event to the group of enabled events.
> + *
> + * @param name
> + *   Name of an event listed under /sys/bus/event_source/devices/pmu/events.
> + * @return
> + *   Event index in case of success, negative value otherwise.
> + */
> +__rte_experimental
> +int
> +rte_pmu_add_event(const char *name);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice
> + *
> + * Read hardware counter configured to count occurrences of an event.
> + *
> + * @param index
> + *   Index of an event to be read.
> + * @return
> + *   Event value read from register. In case of errors or lack of support
> + *   0 is returned. In other words, stream of zeros in a trace file
> + *   indicates problem with reading particular PMU event register.
> + */
> +__rte_experimental
> +static __rte_always_inline uint64_t
> +rte_pmu_read(unsigned int index)
> +{
> +	struct rte_pmu_event_group *group = &RTE_PER_LCORE(_event_group);
> +	int ret;
> +
> +	if (unlikely(!rte_pmu.initialized))
> +		return 0;
> +
> +	if (unlikely(!group->enabled)) {
> +		ret = __rte_pmu_enable_group();
> +		if (ret)
> +			return 0;
> +	}
> +
> +	if (unlikely(index >= rte_pmu.num_group_events))
> +		return 0;
> +
> +	return __rte_pmu_read_userpage(group->mmap_pages[index]);
> +}
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif /* _RTE_PMU_H_ */
> diff --git a/lib/pmu/version.map b/lib/pmu/version.map
> new file mode 100644
> index 0000000000..39a4f279c1
> --- /dev/null
> +++ b/lib/pmu/version.map
> @@ -0,0 +1,15 @@
> +DPDK_23 {
> +	local: *;
> +};
> +
> +EXPERIMENTAL {
> +	global:
> +
> +	__rte_pmu_enable_group;
> +	per_lcore__event_group;
> +	rte_pmu;
> +	rte_pmu_add_event;
> +	rte_pmu_fini;
> +	rte_pmu_init;
> +	rte_pmu_read;
> +};


^ permalink raw reply	[flat|nested] 139+ messages in thread

* RE: [EXT] Re: [PATCH v11 1/4] lib: add generic support for reading PMU events
  2023-02-17  8:49                         ` [EXT] " Tomasz Duszynski
  2023-02-17 10:14                           ` Konstantin Ananyev
@ 2023-02-21 12:15                           ` Konstantin Ananyev
  1 sibling, 0 replies; 139+ messages in thread
From: Konstantin Ananyev @ 2023-02-21 12:15 UTC (permalink / raw)
  To: Tomasz Duszynski, Konstantin Ananyev, dev
  Cc: bruce.richardson, dmitry.kozliuk, navasile, dmitrym, pallavi.kadam


> >> diff --git a/lib/pmu/rte_pmu.c b/lib/pmu/rte_pmu.c new file mode
> >> 100644 index 0000000000..950f999cb7
> >> --- /dev/null
> >> +++ b/lib/pmu/rte_pmu.c
> >> @@ -0,0 +1,460 @@
> >> +/* SPDX-License-Identifier: BSD-3-Clause
> >> + * Copyright(C) 2023 Marvell International Ltd.
> >> + */
> >> +
> >> +#include <ctype.h>
> >> +#include <dirent.h>
> >> +#include <errno.h>
> >> +#include <regex.h>
> >> +#include <stdlib.h>
> >> +#include <string.h>
> >> +#include <sys/ioctl.h>
> >> +#include <sys/mman.h>
> >> +#include <sys/queue.h>
> >> +#include <sys/syscall.h>
> >> +#include <unistd.h>
> >> +
> >> +#include <rte_atomic.h>
> >> +#include <rte_per_lcore.h>
> >> +#include <rte_pmu.h>
> >> +#include <rte_spinlock.h>
> >> +#include <rte_tailq.h>
> >> +
> >> +#include "pmu_private.h"
> >> +
> >> +#define EVENT_SOURCE_DEVICES_PATH "/sys/bus/event_source/devices"
> >
> >
> >I suppose that pass (as the whole implementation) is linux specific?
> >If so, wouldn't it make sense to have it under linux subdir?
> >
> 
> There are not any plans to support that elsewhere currently so flat
> directory structure is good enough.

Ok, I suppose then best choice is to ask freebsd and windows maintainers.
Guys, any thoughts here?
Thanks
Konstantin

^ permalink raw reply	[flat|nested] 139+ messages in thread

* RE: [EXT] Re: [PATCH v11 1/4] lib: add generic support for reading PMU events
  2023-02-21  0:48                                       ` Konstantin Ananyev
@ 2023-02-27  8:12                                         ` Tomasz Duszynski
  2023-02-28 11:35                                           ` Konstantin Ananyev
  0 siblings, 1 reply; 139+ messages in thread
From: Tomasz Duszynski @ 2023-02-27  8:12 UTC (permalink / raw)
  To: Konstantin Ananyev, Konstantin Ananyev, dev



>-----Original Message-----
>From: Konstantin Ananyev <konstantin.v.ananyev@yandex.ru>
>Sent: Tuesday, February 21, 2023 1:48 AM
>To: Tomasz Duszynski <tduszynski@marvell.com>; Konstantin Ananyev <konstantin.ananyev@huawei.com>;
>dev@dpdk.org
>Subject: Re: [EXT] Re: [PATCH v11 1/4] lib: add generic support for reading PMU events
>
>
>>>>>>>>>> diff --git a/lib/pmu/rte_pmu.h b/lib/pmu/rte_pmu.h new file
>>>>>>>>>> mode
>>>>>>>>>> 100644 index 0000000000..6b664c3336
>>>>>>>>>> --- /dev/null
>>>>>>>>>> +++ b/lib/pmu/rte_pmu.h
>>>>>>>>>> @@ -0,0 +1,212 @@
>>>>>>>>>> +/* SPDX-License-Identifier: BSD-3-Clause
>>>>>>>>>> + * Copyright(c) 2023 Marvell  */
>>>>>>>>>> +
>>>>>>>>>> +#ifndef _RTE_PMU_H_
>>>>>>>>>> +#define _RTE_PMU_H_
>>>>>>>>>> +
>>>>>>>>>> +/**
>>>>>>>>>> + * @file
>>>>>>>>>> + *
>>>>>>>>>> + * PMU event tracing operations
>>>>>>>>>> + *
>>>>>>>>>> + * This file defines generic API and types necessary to setup
>>>>>>>>>> +PMU and
>>>>>>>>>> + * read selected counters in runtime.
>>>>>>>>>> + */
>>>>>>>>>> +
>>>>>>>>>> +#ifdef __cplusplus
>>>>>>>>>> +extern "C" {
>>>>>>>>>> +#endif
>>>>>>>>>> +
>>>>>>>>>> +#include <linux/perf_event.h>
>>>>>>>>>> +
>>>>>>>>>> +#include <rte_atomic.h>
>>>>>>>>>> +#include <rte_branch_prediction.h> #include <rte_common.h>
>>>>>>>>>> +#include <rte_compat.h> #include <rte_spinlock.h>
>>>>>>>>>> +
>>>>>>>>>> +/** Maximum number of events in a group */ #define
>>>>>>>>>> +MAX_NUM_GROUP_EVENTS 8
>>>>>>>>>> +
>>>>>>>>>> +/**
>>>>>>>>>> + * A structure describing a group of events.
>>>>>>>>>> + */
>>>>>>>>>> +struct rte_pmu_event_group {
>>>>>>>>>> +	struct perf_event_mmap_page
>>>>>>>>>> +*mmap_pages[MAX_NUM_GROUP_EVENTS];
>>>>>>>>>> +/**< array of user pages
>>>>>>> */
>>>>>>>>>> +	int fds[MAX_NUM_GROUP_EVENTS]; /**< array of event descriptors */
>>>>>>>>>> +	bool enabled; /**< true if group was enabled on particular lcore */
>>>>>>>>>> +	TAILQ_ENTRY(rte_pmu_event_group) next; /**< list entry */ }
>>>>>>>>>> +__rte_cache_aligned;
>>>>>>>>>> +
>>>>>>>>>> +/**
>>>>>>>>>> + * A structure describing an event.
>>>>>>>>>> + */
>>>>>>>>>> +struct rte_pmu_event {
>>>>>>>>>> +	char *name; /**< name of an event */
>>>>>>>>>> +	unsigned int index; /**< event index into fds/mmap_pages */
>>>>>>>>>> +	TAILQ_ENTRY(rte_pmu_event) next; /**< list entry */ };
>>>>>>>>>> +
>>>>>>>>>> +/**
>>>>>>>>>> + * A PMU state container.
>>>>>>>>>> + */
>>>>>>>>>> +struct rte_pmu {
>>>>>>>>>> +	char *name; /**< name of core PMU listed under /sys/bus/event_source/devices */
>>>>>>>>>> +	rte_spinlock_t lock; /**< serialize access to event group list */
>>>>>>>>>> +	TAILQ_HEAD(, rte_pmu_event_group) event_group_list; /**< list of event groups */
>>>>>>>>>> +	unsigned int num_group_events; /**< number of events in a group */
>>>>>>>>>> +	TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of matching events */
>>>>>>>>>> +	unsigned int initialized; /**< initialization counter */ };
>>>>>>>>>> +
>>>>>>>>>> +/** lcore event group */
>>>>>>>>>> +RTE_DECLARE_PER_LCORE(struct rte_pmu_event_group,
>>>>>>>>>> +_event_group);
>>>>>>>>>> +
>>>>>>>>>> +/** PMU state container */
>>>>>>>>>> +extern struct rte_pmu rte_pmu;
>>>>>>>>>> +
>>>>>>>>>> +/** Each architecture supporting PMU needs to provide its own
>>>>>>>>>> +version */ #ifndef rte_pmu_pmc_read #define
>>>>>>>>>> +rte_pmu_pmc_read(index) ({ 0; }) #endif
>>>>>>>>>> +
>>>>>>>>>> +/**
>>>>>>>>>> + * @warning
>>>>>>>>>> + * @b EXPERIMENTAL: this API may change without prior notice
>>>>>>>>>> + *
>>>>>>>>>> + * Read PMU counter.
>>>>>>>>>> + *
>>>>>>>>>> + * @warning This should be not called directly.
>>>>>>>>>> + *
>>>>>>>>>> + * @param pc
>>>>>>>>>> + *   Pointer to the mmapped user page.
>>>>>>>>>> + * @return
>>>>>>>>>> + *   Counter value read from hardware.
>>>>>>>>>> + */
>>>>>>>>>> +static __rte_always_inline uint64_t
>>>>>>>>>> +__rte_pmu_read_userpage(struct perf_event_mmap_page *pc) {
>>>>>>>>>> +	uint64_t width, offset;
>>>>>>>>>> +	uint32_t seq, index;
>>>>>>>>>> +	int64_t pmc;
>>>>>>>>>> +
>>>>>>>>>> +	for (;;) {
>>>>>>>>>> +		seq = pc->lock;
>>>>>>>>>> +		rte_compiler_barrier();
>>>>>>>>>
>>>>>>>>> Are you sure that compiler_barrier() is enough here?
>>>>>>>>> On some archs CPU itself has freedom to re-order reads.
>>>>>>>>> Or I am missing something obvious here?
>>>>>>>>>
>>>>>>>>
>>>>>>>> It's a matter of not keeping old stuff cached in registers and
>>>>>>>> making sure that we have two reads of lock. CPU reordering won't
>>>>>>>> do any harm here.
>>>>>>>
>>>>>>> Sorry, I didn't get you here:
>>>>>>> Suppose CPU will re-order reads and will read lock *after* index or offset value.
>>>>>>> Wouldn't it mean that in that case index and/or offset can contain old/invalid values?
>>>>>>>
>>>>>>
>>>>>> This number is just an indicator whether kernel did change something or not.
>>>>>
>>>>> You are talking about pc->lock, right?
>>>>> Yes, I do understand that it is sort of seqlock.
>>>>> That's why I am puzzled why we do not care about possible cpu read-reordering.
>>>>> Manual for perf_event_open() also has a code snippet with compiler barrier only...
>>>>>
>>>>>> If cpu reordering will come into play then this will not change
>>>>>> anything from pov of this
>>> loop.
>>>>>> All we want is fresh data when needed and no involvement of
>>>>>> compiler when it comes to reordering code.
>>>>>
>>>>> Ok, can you probably explain to me why the following could not happen:
>>>>> T0:
>>>>> pc->seqlock==0; pc->index==I1; pc->offset==O1;
>>>>> T1:
>>>>>       cpu #0 read pmu (due to cpu read reorder, we get index value before seqlock):
>>>>>        index=pc->index;  //index==I1;
>>>>> T2:
>>>>>       cpu #1 kernel vent_update_userpage:
>>>>>       pc->lock++; // pc->lock==1
>>>>>       pc->index=I2;
>>>>>       pc->offset=O2;
>>>>>       ...
>>>>>       pc->lock++; //pc->lock==2
>>>>> T3:
>>>>>       cpu #0 continue with read pmu:
>>>>>       seq=pc->lock; //seq == 2
>>>>>        offset=pc->offset; // offset == O2
>>>>>        ....
>>>>>        pmc = rte_pmu_pmc_read(index - 1);  // Note that we read at I1, not I2
>>>>>        offset += pmc; //offset == O2 + pmcread(I1-1);
>>>>>        if (pc->lock == seq) // they are equal, return
>>>>>              return offset;
>>>>>
>>>>> Or, it can happen, but by some reason we don't care much?
>>>>>
>>>>
>>>> This code does self-monitoring and user page (whole group actually)
>>>> is per thread running on current cpu. Hence I am not sure what are
>>>> you trying to prove with that
>>> example.
>>>
>>> I am not trying to prove anything so far.
>>> I am asking is such situation possible or not, and if not, why?
>>> My current understanding (possibly wrong) is that after you mmaped
>>> these pages, kernel still can asynchronously update them.
>>> So, when reading the data from these pages you have to check 'lock'
>>> value before and after accessing other data.
>>> If so, why possible cpu read-reordering doesn't matter?
>>>
>>
>> Look. I'll reiterate that.
>>
>> 1. That user page/group/PMU config is per process. Other processes do not access that.
>
>Ok, that's clear.
>
>
>>     All this happens on the very same CPU where current thread is running.
>
>Ok... but can't this page be updated by kernel thread running simultaneously on different CPU?
>

I already pointed out that event/counter configuration is bound to current cpu. How can possibly
other cpu update that configuration? This cannot work. 


If you think that there's some problem with the code (or is simply broken on your setup) and logic 
has obvious flaw and you can provide meaningful evidence of that then I'd be more than happy to 
apply that fix. Otherwise that discussion will get us nowhere. 

>
>> 2. Suppose you've already read seq. Now for some reason kernel updates data in page seq was read
>from.
>> 3. Kernel will enter critical section during update. seq changes along with other data without
>app knowing about it.
>>     If you want nitty gritty details consult kernel sources.
>
>Look, I don't have to beg you to answer these questions.
>In fact, I expect library author to document all such narrow things
>clearly either in in PG, or in source code comments (ideally in both).
>If not, then from my perspective the patch is not ready stage and
>shouldn't be accepted.
>I don't know is compiler-barrier is enough here or not, but I think it
>is definitely worth a clear explanation in the docs.
>I suppose it wouldn't be only me who will get confused here.
>So please take an effort and document it clearly why you believe there
>is no race-condition.
>
>> 4. app resumes and has some stale data but *WILL* read new seq. Code loops again because values
>do not match.
>
>If the kernel will always execute update for this page in the same
>thread context, then yes, - user code will always note the difference
>after resume.
>But why it can't happen that your user-thread reads this page on one
>CPU, while some kernel code on other CPU updates it simultaneously?
>
>
>> 5. Otherwise seq values match and data is valid.
>>
>>> Also there was another question below, which you probably  missed, so I copied it here:
>>> Another question - do we really need  to have __rte_pmu_read_userpage() and rte_pmu_read() as
>>> static inline functions in public header?
>>> As I understand, because of that we also have to make 'struct rte_pmu_*'
>>> definitions also public.
>>>
>>
>> These functions need to be inlined otherwise performance takes a hit.
>
>I understand that perfomance might be affected, but how big is hit?
>I expect actual PMU read will not be free anyway, right?
>If the diff is small, might be it is worth to go for such change,
>removing unneeded structures from public headers would help a lot in
>future in terms of ABI/API stability.
>
>
>
>>>>
>>>>>>>>
>>>>>>>>>> +		index = pc->index;
>>>>>>>>>> +		offset = pc->offset;
>>>>>>>>>> +		width = pc->pmc_width;
>>>>>>>>>> +
>>>>>>>>>> +		/* index set to 0 means that particular counter cannot be used */
>>>>>>>>>> +		if (likely(pc->cap_user_rdpmc && index)) {
>>>>>>>>>> +			pmc = rte_pmu_pmc_read(index - 1);
>>>>>>>>>> +			pmc <<= 64 - width;
>>>>>>>>>> +			pmc >>= 64 - width;
>>>>>>>>>> +			offset += pmc;
>>>>>>>>>> +		}
>>>>>>>>>> +
>>>>>>>>>> +		rte_compiler_barrier();
>>>>>>>>>> +
>>>>>>>>>> +		if (likely(pc->lock == seq))
>>>>>>>>>> +			return offset;
>>>>>>>>>> +	}
>>>>>>>>>> +
>>>>>>>>>> +	return 0;
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +/**
>>>>>>>>>> + * @warning
>>>>>>>>>> + * @b EXPERIMENTAL: this API may change without prior notice
>>>>>>>>>> + *
>>>>>>>>>> + * Enable group of events on the calling lcore.
>>>>>>>>>> + *
>>>>>>>>>> + * @warning This should be not called directly.
>>>>>>>>>> + *
>>>>>>>>>> + * @return
>>>>>>>>>> + *   0 in case of success, negative value otherwise.
>>>>>>>>>> + */
>>>>>>>>>> +__rte_experimental
>>>>>>>>>> +int
>>>>>>>>>> +__rte_pmu_enable_group(void);
>>>>>>>>>> +
>>>>>>>>>> +/**
>>>>>>>>>> + * @warning
>>>>>>>>>> + * @b EXPERIMENTAL: this API may change without prior notice
>>>>>>>>>> + *
>>>>>>>>>> + * Initialize PMU library.
>>>>>>>>>> + *
>>>>>>>>>> + * @warning This should be not called directly.
>>>>>>>>>> + *
>>>>>>>>>> + * @return
>>>>>>>>>> + *   0 in case of success, negative value otherwise.
>>>>>>>>>> + */
>>>>>>>>>> +__rte_experimental
>>>>>>>>>> +int
>>>>>>>>>> +rte_pmu_init(void);
>>>>>>>>>> +
>>>>>>>>>> +/**
>>>>>>>>>> + * @warning
>>>>>>>>>> + * @b EXPERIMENTAL: this API may change without prior notice
>>>>>>>>>> + *
>>>>>>>>>> + * Finalize PMU library. This should be called after PMU
>>>>>>>>>> +counters are no longer being
>>>>> read.
>>>>>>>>>> + */
>>>>>>>>>> +__rte_experimental
>>>>>>>>>> +void
>>>>>>>>>> +rte_pmu_fini(void);
>>>>>>>>>> +
>>>>>>>>>> +/**
>>>>>>>>>> + * @warning
>>>>>>>>>> + * @b EXPERIMENTAL: this API may change without prior notice
>>>>>>>>>> + *
>>>>>>>>>> + * Add event to the group of enabled events.
>>>>>>>>>> + *
>>>>>>>>>> + * @param name
>>>>>>>>>> + *   Name of an event listed under /sys/bus/event_source/devices/pmu/events.
>>>>>>>>>> + * @return
>>>>>>>>>> + *   Event index in case of success, negative value otherwise.
>>>>>>>>>> + */
>>>>>>>>>> +__rte_experimental
>>>>>>>>>> +int
>>>>>>>>>> +rte_pmu_add_event(const char *name);
>>>>>>>>>> +
>>>>>>>>>> +/**
>>>>>>>>>> + * @warning
>>>>>>>>>> + * @b EXPERIMENTAL: this API may change without prior notice
>>>>>>>>>> + *
>>>>>>>>>> + * Read hardware counter configured to count occurrences of an event.
>>>>>>>>>> + *
>>>>>>>>>> + * @param index
>>>>>>>>>> + *   Index of an event to be read.
>>>>>>>>>> + * @return
>>>>>>>>>> + *   Event value read from register. In case of errors or lack of support
>>>>>>>>>> + *   0 is returned. In other words, stream of zeros in a trace file
>>>>>>>>>> + *   indicates problem with reading particular PMU event register.
>>>>>>>>>> + */
>>>>>
>>>>> Another question - do we really need  to have
>>>>> __rte_pmu_read_userpage() and rte_pmu_read() as static inline functions in public header?
>>>>> As I understand, because of that we also have to make 'struct rte_pmu_*'
>>>>> definitions also public.
>>>>>
>>>>>>>>>> +__rte_experimental
>>>>>>>>>> +static __rte_always_inline uint64_t rte_pmu_read(unsigned
>>>>>>>>>> +int
>>>>>>>>>> +index) {
>>>>>>>>>> +	struct rte_pmu_event_group *group = &RTE_PER_LCORE(_event_group);
>>>>>>>>>> +	int ret;
>>>>>>>>>> +
>>>>>>>>>> +	if (unlikely(!rte_pmu.initialized))
>>>>>>>>>> +		return 0;
>>>>>>>>>> +
>>>>>>>>>> +	if (unlikely(!group->enabled)) {
>>>>>>>>>> +		ret = __rte_pmu_enable_group();
>>>>>>>>>> +		if (ret)
>>>>>>>>>> +			return 0;
>>>>>>>>>> +	}
>>>>>>>>>> +
>>>>>>>>>> +	if (unlikely(index >= rte_pmu.num_group_events))
>>>>>>>>>> +		return 0;
>>>>>>>>>> +
>>>>>>>>>> +	return __rte_pmu_read_userpage(group->mmap_pages[index]);
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +#ifdef __cplusplus
>>>>>>>>>> +}
>>>>>>>>>> +#endif
>>>>>>>>>> +
>>


^ permalink raw reply	[flat|nested] 139+ messages in thread

* RE: [EXT] Re: [PATCH v11 1/4] lib: add generic support for reading PMU events
  2023-02-21  2:17                       ` Konstantin Ananyev
@ 2023-02-27  9:19                         ` Tomasz Duszynski
  2023-02-27 20:53                           ` Konstantin Ananyev
  0 siblings, 1 reply; 139+ messages in thread
From: Tomasz Duszynski @ 2023-02-27  9:19 UTC (permalink / raw)
  To: Konstantin Ananyev, dev



>-----Original Message-----
>From: Konstantin Ananyev <konstantin.v.ananyev@yandex.ru>
>Sent: Tuesday, February 21, 2023 3:17 AM
>To: dev@dpdk.org
>Subject: [EXT] Re: [PATCH v11 1/4] lib: add generic support for reading PMU events
>
>External Email
>
>----------------------------------------------------------------------
>
>> Add support for programming PMU counters and reading their values in
>> runtime bypassing kernel completely.
>>
>> This is especially useful in cases where CPU cores are isolated i.e
>> run dedicated tasks. In such cases one cannot use standard perf
>> utility without sacrificing latency and performance.
>>
>> Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
>> Acked-by: Morten Brørup <mb@smartsharesystems.com>
>
>Few more comments/questions below.
>
>
>> diff --git a/lib/pmu/rte_pmu.c b/lib/pmu/rte_pmu.c new file mode
>> 100644 index 0000000000..950f999cb7
>> --- /dev/null
>> +++ b/lib/pmu/rte_pmu.c
>> @@ -0,0 +1,460 @@
>> +/* SPDX-License-Identifier: BSD-3-Clause
>> + * Copyright(C) 2023 Marvell International Ltd.
>> + */
>> +
>> +#include <ctype.h>
>> +#include <dirent.h>
>> +#include <errno.h>
>> +#include <regex.h>
>> +#include <stdlib.h>
>> +#include <string.h>
>> +#include <sys/ioctl.h>
>> +#include <sys/mman.h>
>> +#include <sys/queue.h>
>> +#include <sys/syscall.h>
>> +#include <unistd.h>
>> +
>> +#include <rte_atomic.h>
>> +#include <rte_per_lcore.h>
>> +#include <rte_pmu.h>
>> +#include <rte_spinlock.h>
>> +#include <rte_tailq.h>
>> +
>> +#include "pmu_private.h"
>> +
>> +#define EVENT_SOURCE_DEVICES_PATH "/sys/bus/event_source/devices"
>> +
>> +#define GENMASK_ULL(h, l) ((~0ULL - (1ULL << (l)) + 1) & (~0ULL >>
>> +((64 - 1 - (h))))) #define FIELD_PREP(m, v) (((uint64_t)(v) <<
>> +(__builtin_ffsll(m) - 1)) & (m))
>> +
>> +RTE_DEFINE_PER_LCORE(struct rte_pmu_event_group, _event_group);
>> +struct rte_pmu rte_pmu;
>> +
>> +/*
>> + * Following __rte_weak functions provide default no-op.
>> +Architectures should override them if
>> + * necessary.
>> + */
>> +
>> +int
>> +__rte_weak pmu_arch_init(void)
>> +{
>> +	return 0;
>> +}
>> +
>> +void
>> +__rte_weak pmu_arch_fini(void)
>> +{
>> +}
>> +
>> +void
>> +__rte_weak pmu_arch_fixup_config(uint64_t __rte_unused config[3]) { }
>> +
>> +static int
>> +get_term_format(const char *name, int *num, uint64_t *mask) {
>> +	char path[PATH_MAX];
>> +	char *config = NULL;
>> +	int high, low, ret;
>> +	FILE *fp;
>> +
>> +	*num = *mask = 0;
>> +	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/format/%s", rte_pmu.name, name);
>> +	fp = fopen(path, "r");
>> +	if (fp == NULL)
>> +		return -errno;
>> +
>> +	errno = 0;
>> +	ret = fscanf(fp, "%m[^:]:%d-%d", &config, &low, &high);
>> +	if (ret < 2) {
>> +		ret = -ENODATA;
>> +		goto out;
>> +	}
>> +	if (errno) {
>> +		ret = -errno;
>> +		goto out;
>> +	}
>> +
>> +	if (ret == 2)
>> +		high = low;
>> +
>> +	*mask = GENMASK_ULL(high, low);
>> +	/* Last digit should be [012]. If last digit is missing 0 is implied. */
>> +	*num = config[strlen(config) - 1];
>> +	*num = isdigit(*num) ? *num - '0' : 0;
>> +
>> +	ret = 0;
>> +out:
>> +	free(config);
>> +	fclose(fp);
>> +
>> +	return ret;
>> +}
>> +
>> +static int
>> +parse_event(char *buf, uint64_t config[3]) {
>> +	char *token, *term;
>> +	int num, ret, val;
>> +	uint64_t mask;
>> +
>> +	config[0] = config[1] = config[2] = 0;
>> +
>> +	token = strtok(buf, ",");
>> +	while (token) {
>> +		errno = 0;
>> +		/* <term>=<value> */
>> +		ret = sscanf(token, "%m[^=]=%i", &term, &val);
>> +		if (ret < 1)
>> +			return -ENODATA;
>> +		if (errno)
>> +			return -errno;
>> +		if (ret == 1)
>> +			val = 1;
>> +
>> +		ret = get_term_format(term, &num, &mask);
>> +		free(term);
>> +		if (ret)
>> +			return ret;
>> +
>> +		config[num] |= FIELD_PREP(mask, val);
>> +		token = strtok(NULL, ",");
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +static int
>> +get_event_config(const char *name, uint64_t config[3]) {
>> +	char path[PATH_MAX], buf[BUFSIZ];
>> +	FILE *fp;
>> +	int ret;
>> +
>> +	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu.name, name);
>> +	fp = fopen(path, "r");
>> +	if (fp == NULL)
>> +		return -errno;
>> +
>> +	ret = fread(buf, 1, sizeof(buf), fp);
>> +	if (ret == 0) {
>> +		fclose(fp);
>> +
>> +		return -EINVAL;
>> +	}
>> +	fclose(fp);
>> +	buf[ret] = '\0';
>> +
>> +	return parse_event(buf, config);
>> +}
>> +
>> +static int
>> +do_perf_event_open(uint64_t config[3], int group_fd) {
>> +	struct perf_event_attr attr = {
>> +		.size = sizeof(struct perf_event_attr),
>> +		.type = PERF_TYPE_RAW,
>> +		.exclude_kernel = 1,
>> +		.exclude_hv = 1,
>> +		.disabled = 1,
>> +	};
>> +
>> +	pmu_arch_fixup_config(config);
>> +
>> +	attr.config = config[0];
>> +	attr.config1 = config[1];
>> +	attr.config2 = config[2];
>> +
>> +	return syscall(SYS_perf_event_open, &attr, 0, -1, group_fd, 0); }
>> +
>> +static int
>> +open_events(struct rte_pmu_event_group *group) {
>> +	struct rte_pmu_event *event;
>> +	uint64_t config[3];
>> +	int num = 0, ret;
>> +
>> +	/* group leader gets created first, with fd = -1 */
>> +	group->fds[0] = -1;
>> +
>> +	TAILQ_FOREACH(event, &rte_pmu.event_list, next) {
>> +		ret = get_event_config(event->name, config);
>> +		if (ret)
>> +			continue;
>> +
>> +		ret = do_perf_event_open(config, group->fds[0]);
>> +		if (ret == -1) {
>> +			ret = -errno;
>> +			goto out;
>> +		}
>> +
>> +		group->fds[event->index] = ret;
>> +		num++;
>> +	}
>> +
>> +	return 0;
>> +out:
>> +	for (--num; num >= 0; num--) {
>> +		close(group->fds[num]);
>> +		group->fds[num] = -1;
>> +	}
>> +
>> +
>> +	return ret;
>> +}
>> +
>> +static int
>> +mmap_events(struct rte_pmu_event_group *group) {
>> +	long page_size = sysconf(_SC_PAGE_SIZE);
>> +	unsigned int i;
>> +	void *addr;
>> +	int ret;
>> +
>> +	for (i = 0; i < rte_pmu.num_group_events; i++) {
>> +		addr = mmap(0, page_size, PROT_READ, MAP_SHARED, group->fds[i], 0);
>> +		if (addr == MAP_FAILED) {
>> +			ret = -errno;
>> +			goto out;
>> +		}
>> +
>> +		group->mmap_pages[i] = addr;
>> +		if (!group->mmap_pages[i]->cap_user_rdpmc) {
>> +			ret = -EPERM;
>> +			goto out;
>> +		}
>> +	}
>> +
>> +	return 0;
>> +out:
>> +	for (; i; i--) {
>> +		munmap(group->mmap_pages[i - 1], page_size);
>> +		group->mmap_pages[i - 1] = NULL;
>> +	}
>> +
>> +	return ret;
>> +}
>> +
>> +static void
>> +cleanup_events(struct rte_pmu_event_group *group) {
>> +	unsigned int i;
>> +
>> +	if (group->fds[0] != -1)
>> +		ioctl(group->fds[0], PERF_EVENT_IOC_DISABLE, PERF_IOC_FLAG_GROUP);
>> +
>> +	for (i = 0; i < rte_pmu.num_group_events; i++) {
>> +		if (group->mmap_pages[i]) {
>> +			munmap(group->mmap_pages[i], sysconf(_SC_PAGE_SIZE));
>> +			group->mmap_pages[i] = NULL;
>> +		}
>> +
>> +		if (group->fds[i] != -1) {
>> +			close(group->fds[i]);
>> +			group->fds[i] = -1;
>> +		}
>> +	}
>> +
>> +	group->enabled = false;
>> +}
>> +
>> +int
>> +__rte_pmu_enable_group(void)
>> +{
>> +	struct rte_pmu_event_group *group = &RTE_PER_LCORE(_event_group);
>> +	int ret;
>> +
>> +	if (rte_pmu.num_group_events == 0)
>> +		return -ENODEV;
>> +
>> +	ret = open_events(group);
>> +	if (ret)
>> +		goto out;
>> +
>> +	ret = mmap_events(group);
>> +	if (ret)
>> +		goto out;
>> +
>> +	if (ioctl(group->fds[0], PERF_EVENT_IOC_RESET, PERF_IOC_FLAG_GROUP) == -1) {
>> +		ret = -errno;
>> +		goto out;
>> +	}
>> +
>> +	if (ioctl(group->fds[0], PERF_EVENT_IOC_ENABLE, PERF_IOC_FLAG_GROUP) == -1) {
>> +		ret = -errno;
>> +		goto out;
>> +	}
>> +
>> +	rte_spinlock_lock(&rte_pmu.lock);
>> +	TAILQ_INSERT_TAIL(&rte_pmu.event_group_list, group, next);
>
>Hmm.. so we insert pointer to TLS variable into the global list?
>Wonder what would happen if that thread get terminated?

Nothing special. Any pointers to that thread-local in that thread are invalided.

>Can memory from its TLS block get re-used (by other thread or for other purposes)?
>

Why would any other thread reuse that? Eventually main thread will need that data to do the cleanup. 

>
>> +	rte_spinlock_unlock(&rte_pmu.lock);
>> +	group->enabled = true;
>> +
>> +	return 0;
>> +
>> +out:
>> +	cleanup_events(group);
>> +
>> +	return ret;
>> +}
>> +
>> +static int
>> +scan_pmus(void)
>> +{
>> +	char path[PATH_MAX];
>> +	struct dirent *dent;
>> +	const char *name;
>> +	DIR *dirp;
>> +
>> +	dirp = opendir(EVENT_SOURCE_DEVICES_PATH);
>> +	if (dirp == NULL)
>> +		return -errno;
>> +
>> +	while ((dent = readdir(dirp))) {
>> +		name = dent->d_name;
>> +		if (name[0] == '.')
>> +			continue;
>> +
>> +		/* sysfs entry should either contain cpus or be a cpu */
>> +		if (!strcmp(name, "cpu"))
>> +			break;
>> +
>> +		snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/cpus", name);
>> +		if (access(path, F_OK) == 0)
>> +			break;
>> +	}
>> +
>> +	if (dent) {
>> +		rte_pmu.name = strdup(name);
>> +		if (rte_pmu.name == NULL) {
>> +			closedir(dirp);
>> +
>> +			return -ENOMEM;
>> +		}
>> +	}
>> +
>> +	closedir(dirp);
>> +
>> +	return rte_pmu.name ? 0 : -ENODEV;
>> +}
>> +
>> +static struct rte_pmu_event *
>> +new_event(const char *name)
>> +{
>> +	struct rte_pmu_event *event;
>> +
>> +	event = calloc(1, sizeof(*event));
>> +	if (event == NULL)
>> +		goto out;
>> +
>> +	event->name = strdup(name);
>> +	if (event->name == NULL) {
>> +		free(event);
>> +		event = NULL;
>> +	}
>> +
>> +out:
>> +	return event;
>> +}
>> +
>> +static void
>> +free_event(struct rte_pmu_event *event)
>> +{
>> +	free(event->name);
>> +	free(event);
>> +}
>> +
>> +int
>> +rte_pmu_add_event(const char *name)
>> +{
>> +	struct rte_pmu_event *event;
>> +	char path[PATH_MAX];
>> +
>> +	if (rte_pmu.name == NULL)
>> +		return -ENODEV;
>> +
>> +	if (rte_pmu.num_group_events + 1 >= MAX_NUM_GROUP_EVENTS)
>> +		return -ENOSPC;
>> +
>> +	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu.name, name);
>> +	if (access(path, R_OK))
>> +		return -ENODEV;
>> +
>> +	TAILQ_FOREACH(event, &rte_pmu.event_list, next) {
>> +		if (!strcmp(event->name, name))
>> +			return event->index;
>> +		continue;
>> +	}
>> +
>> +	event = new_event(name);
>> +	if (event == NULL)
>> +		return -ENOMEM;
>> +
>> +	event->index = rte_pmu.num_group_events++;
>> +	TAILQ_INSERT_TAIL(&rte_pmu.event_list, event, next);
>> +
>> +	return event->index;
>> +}
>> +
>> +int
>> +rte_pmu_init(void)
>> +{
>> +	int ret;
>> +
>> +	/* Allow calling init from multiple contexts within a single thread. This simplifies
>> +	 * resource management a bit e.g in case fast-path tracepoint has already been enabled
>> +	 * via command line but application doesn't care enough and performs init/fini again.
>> +	 */
>> +	if (rte_pmu.initialized != 0) {
>> +		rte_pmu.initialized++;
>> +		return 0;
>> +	}
>> +
>> +	ret = scan_pmus();
>> +	if (ret)
>> +		goto out;
>> +
>> +	ret = pmu_arch_init();
>> +	if (ret)
>> +		goto out;
>> +
>> +	TAILQ_INIT(&rte_pmu.event_list);
>> +	TAILQ_INIT(&rte_pmu.event_group_list);
>> +	rte_spinlock_init(&rte_pmu.lock);
>> +	rte_pmu.initialized = 1;
>> +
>> +	return 0;
>> +out:
>> +	free(rte_pmu.name);
>> +	rte_pmu.name = NULL;
>> +
>> +	return ret;
>> +}
>> +
>> +void
>> +rte_pmu_fini(void)
>> +{
>> +	struct rte_pmu_event_group *group, *tmp_group;
>> +	struct rte_pmu_event *event, *tmp_event;
>> +
>> +	/* cleanup once init count drops to zero */
>> +	if (rte_pmu.initialized == 0 || --rte_pmu.initialized != 0)
>> +		return;
>> +
>> +	RTE_TAILQ_FOREACH_SAFE(event, &rte_pmu.event_list, next, tmp_event) {
>> +		TAILQ_REMOVE(&rte_pmu.event_list, event, next);
>> +		free_event(event);
>> +	}
>> +
>> +	RTE_TAILQ_FOREACH_SAFE(group, &rte_pmu.event_group_list, next, tmp_group) {
>> +		TAILQ_REMOVE(&rte_pmu.event_group_list, group, next);
>> +		cleanup_events(group);
>> +	}
>> +
>> +	pmu_arch_fini();
>> +	free(rte_pmu.name);
>> +	rte_pmu.name = NULL;
>> +	rte_pmu.num_group_events = 0;
>> +}
>> diff --git a/lib/pmu/rte_pmu.h b/lib/pmu/rte_pmu.h
>> new file mode 100644
>> index 0000000000..6b664c3336
>> --- /dev/null
>> +++ b/lib/pmu/rte_pmu.h
>> @@ -0,0 +1,212 @@
>> +/* SPDX-License-Identifier: BSD-3-Clause
>> + * Copyright(c) 2023 Marvell
>> + */
>> +
>> +#ifndef _RTE_PMU_H_
>> +#define _RTE_PMU_H_
>> +
>> +/**
>> + * @file
>> + *
>> + * PMU event tracing operations
>> + *
>> + * This file defines generic API and types necessary to setup PMU and
>> + * read selected counters in runtime.
>> + */
>> +
>> +#ifdef __cplusplus
>> +extern "C" {
>> +#endif
>> +
>> +#include <linux/perf_event.h>
>> +
>> +#include <rte_atomic.h>
>> +#include <rte_branch_prediction.h>
>> +#include <rte_common.h>
>> +#include <rte_compat.h>
>> +#include <rte_spinlock.h>
>> +
>> +/** Maximum number of events in a group */
>> +#define MAX_NUM_GROUP_EVENTS 8
>
>forgot RTE_ prefix.
>In fact, do you really need number of events in group to be hard-coded?
>Couldn't mmap_pages[] and fds[] be allocated dynamically by enable_group()?
>

8 is reasonable number I think. X86/ARM have actually less that that (was that something like 4?). 
Moreover events are scheduled as a group so there must be enough hw counters available
for that to succeed. So this number should cover current needs.  

>> +
>> +/**
>> + * A structure describing a group of events.
>> + */
>> +struct rte_pmu_event_group {
>> +	struct perf_event_mmap_page *mmap_pages[MAX_NUM_GROUP_EVENTS]; /**< array of user pages */
>> +	int fds[MAX_NUM_GROUP_EVENTS]; /**< array of event descriptors */
>> +	bool enabled; /**< true if group was enabled on particular lcore */
>> +	TAILQ_ENTRY(rte_pmu_event_group) next; /**< list entry */
>> +} __rte_cache_aligned;
>> +
>
>Even if we'd decide to keep rte_pmu_read() as static inline (still not
>sure it is a good idea),

We want to save as much cpu cycles as we possibly can and inlining does helps
in that matter.

>why these two struct below (rte_pmu_event and rte_pmu) have to be public?
>I think both can be safely moved away from public headers.
>

struct rte_pmu_event can be hidden I guess. 
struct rte_pmu is used in this header hence cannot be moved elsewhere. 

>
>> +/**
>> + * A structure describing an event.
>> + */
>> +struct rte_pmu_event {
>> +	char *name; /**< name of an event */
>> +	unsigned int index; /**< event index into fds/mmap_pages */
>> +	TAILQ_ENTRY(rte_pmu_event) next; /**< list entry */
>> +};
>
>> +
>> +/**
>> + * A PMU state container.
>> + */
>> +struct rte_pmu {
>> +	char *name; /**< name of core PMU listed under /sys/bus/event_source/devices */
>> +	rte_spinlock_t lock; /**< serialize access to event group list */
>> +	TAILQ_HEAD(, rte_pmu_event_group) event_group_list; /**< list of event groups */
>> +	unsigned int num_group_events; /**< number of events in a group */
>> +	TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of matching events */
>> +	unsigned int initialized; /**< initialization counter */
>> +};
>> +
>> +/** lcore event group */
>> +RTE_DECLARE_PER_LCORE(struct rte_pmu_event_group, _event_group);
>> +
>> +/** PMU state container */
>> +extern struct rte_pmu rte_pmu;
>> +
>> +/** Each architecture supporting PMU needs to provide its own version */
>> +#ifndef rte_pmu_pmc_read
>> +#define rte_pmu_pmc_read(index) ({ 0; })
>> +#endif
>> +
>> +/**
>> + * @warning
>> + * @b EXPERIMENTAL: this API may change without prior notice
>> + *
>> + * Read PMU counter.
>> + *
>> + * @warning This should be not called directly.
>> + *
>> + * @param pc
>> + *   Pointer to the mmapped user page.
>> + * @return
>> + *   Counter value read from hardware.
>> + */
>> +static __rte_always_inline uint64_t
>> +__rte_pmu_read_userpage(struct perf_event_mmap_page *pc)
>> +{
>> +	uint64_t width, offset;
>> +	uint32_t seq, index;
>> +	int64_t pmc;
>> +
>> +	for (;;) {
>> +		seq = pc->lock;
>> +		rte_compiler_barrier();
>> +		index = pc->index;
>> +		offset = pc->offset;
>> +		width = pc->pmc_width;
>> +
>> +		/* index set to 0 means that particular counter cannot be used */
>> +		if (likely(pc->cap_user_rdpmc && index)) {
>
>In mmap_events() you return EPERM if cap_user_rdpmc is not enabled.
>Do you need another check here? Or this capability can be disabled by
>kernel at run-time?
>

That extra check in mmap_event() may be removed actually. Some archs allow
disabling reading rdpmc (I think that on x86 one can do that) so this check needs to stay. 

>
>> +			pmc = rte_pmu_pmc_read(index - 1);
>> +			pmc <<= 64 - width;
>> +			pmc >>= 64 - width;
>> +			offset += pmc;
>> +		}
>> +
>> +		rte_compiler_barrier();
>> +
>> +		if (likely(pc->lock == seq))
>> +			return offset;
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +/**
>> + * @warning
>> + * @b EXPERIMENTAL: this API may change without prior notice
>> + *
>> + * Enable group of events on the calling lcore.
>> + *
>> + * @warning This should be not called directly.
>
>__rte_internal ?
>

No this cannot be internal because that will make functions calling it 
internal as well hence apps won't be able to use that. This has
already been brought up by one of the reviewers. 

>> + *
>> + * @return
>> + *   0 in case of success, negative value otherwise.
>> + */
>> +__rte_experimental
>> +int
>> +__rte_pmu_enable_group(void);
>> +
>> +/**
>> + * @warning
>> + * @b EXPERIMENTAL: this API may change without prior notice
>> + *
>> + * Initialize PMU library.
>> + *
>> + * @warning This should be not called directly.
>
>Hmm.. then who should call it?
>If it not supposed to be called directly, why to declare it here?
>

This is inlined and has one caller i.e rte_pmu_read(). 

>> + *
>> + * @return
>> + *   0 in case of success, negative value otherwise.
>> + */
>
>Probably worth to mention that this function is not MT safe.
>Same for _fini_ and add_event.
>Also worth to mention that all control-path functions
>(init/fini/add_event) and data-path (pmu_read) can't be called concurrently.
>

Yes they are meant to be called from main thread. 

>> +__rte_experimental
>> +int
>> +rte_pmu_init(void);
>> +
>> +/**
>> + * @warning
>> + * @b EXPERIMENTAL: this API may change without prior notice
>> + *
>> + * Finalize PMU library. This should be called after PMU counters are no longer being read.
>> + */
>> +__rte_experimental
>> +void
>> +rte_pmu_fini(void);
>> +
>> +/**
>> + * @warning
>> + * @b EXPERIMENTAL: this API may change without prior notice
>> + *
>> + * Add event to the group of enabled events.
>> + *
>> + * @param name
>> + *   Name of an event listed under /sys/bus/event_source/devices/pmu/events.
>> + * @return
>> + *   Event index in case of success, negative value otherwise.
>> + */
>> +__rte_experimental
>> +int
>> +rte_pmu_add_event(const char *name);
>> +
>> +/**
>> + * @warning
>> + * @b EXPERIMENTAL: this API may change without prior notice
>> + *
>> + * Read hardware counter configured to count occurrences of an event.
>> + *
>> + * @param index
>> + *   Index of an event to be read.
>> + * @return
>> + *   Event value read from register. In case of errors or lack of support
>> + *   0 is returned. In other words, stream of zeros in a trace file
>> + *   indicates problem with reading particular PMU event register.
>> + */
>> +__rte_experimental
>> +static __rte_always_inline uint64_t
>> +rte_pmu_read(unsigned int index)
>> +{
>> +	struct rte_pmu_event_group *group = &RTE_PER_LCORE(_event_group);
>> +	int ret;
>> +
>> +	if (unlikely(!rte_pmu.initialized))
>> +		return 0;
>> +
>> +	if (unlikely(!group->enabled)) {
>> +		ret = __rte_pmu_enable_group();
>> +		if (ret)
>> +			return 0;
>> +	}
>> +
>> +	if (unlikely(index >= rte_pmu.num_group_events))
>> +		return 0;
>> +
>> +	return __rte_pmu_read_userpage(group->mmap_pages[index]);
>> +}
>> +
>> +#ifdef __cplusplus
>> +}
>> +#endif
>> +
>> +#endif /* _RTE_PMU_H_ */
>> diff --git a/lib/pmu/version.map b/lib/pmu/version.map
>> new file mode 100644
>> index 0000000000..39a4f279c1
>> --- /dev/null
>> +++ b/lib/pmu/version.map
>> @@ -0,0 +1,15 @@
>> +DPDK_23 {
>> +	local: *;
>> +};
>> +
>> +EXPERIMENTAL {
>> +	global:
>> +
>> +	__rte_pmu_enable_group;
>> +	per_lcore__event_group;
>> +	rte_pmu;
>> +	rte_pmu_add_event;
>> +	rte_pmu_fini;
>> +	rte_pmu_init;
>> +	rte_pmu_read;
>> +};


^ permalink raw reply	[flat|nested] 139+ messages in thread

* RE: [EXT] Re: [PATCH v11 1/4] lib: add generic support for reading PMU events
  2023-02-27  9:19                         ` [EXT] " Tomasz Duszynski
@ 2023-02-27 20:53                           ` Konstantin Ananyev
  2023-02-28  8:25                             ` Morten Brørup
  2023-02-28  9:57                             ` Tomasz Duszynski
  0 siblings, 2 replies; 139+ messages in thread
From: Konstantin Ananyev @ 2023-02-27 20:53 UTC (permalink / raw)
  To: Tomasz Duszynski, Konstantin Ananyev, dev



> >> Add support for programming PMU counters and reading their values in
> >> runtime bypassing kernel completely.
> >>
> >> This is especially useful in cases where CPU cores are isolated i.e
> >> run dedicated tasks. In such cases one cannot use standard perf
> >> utility without sacrificing latency and performance.
> >>
> >> Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
> >> Acked-by: Morten Brørup <mb@smartsharesystems.com>
> >
> >Few more comments/questions below.
> >
> >
> >> diff --git a/lib/pmu/rte_pmu.c b/lib/pmu/rte_pmu.c new file mode
> >> 100644 index 0000000000..950f999cb7
> >> --- /dev/null
> >> +++ b/lib/pmu/rte_pmu.c
> >> @@ -0,0 +1,460 @@
> >> +/* SPDX-License-Identifier: BSD-3-Clause
> >> + * Copyright(C) 2023 Marvell International Ltd.
> >> + */
> >> +
> >> +#include <ctype.h>
> >> +#include <dirent.h>
> >> +#include <errno.h>
> >> +#include <regex.h>
> >> +#include <stdlib.h>
> >> +#include <string.h>
> >> +#include <sys/ioctl.h>
> >> +#include <sys/mman.h>
> >> +#include <sys/queue.h>
> >> +#include <sys/syscall.h>
> >> +#include <unistd.h>
> >> +
> >> +#include <rte_atomic.h>
> >> +#include <rte_per_lcore.h>
> >> +#include <rte_pmu.h>
> >> +#include <rte_spinlock.h>
> >> +#include <rte_tailq.h>
> >> +
> >> +#include "pmu_private.h"
> >> +
> >> +#define EVENT_SOURCE_DEVICES_PATH "/sys/bus/event_source/devices"
> >> +
> >> +#define GENMASK_ULL(h, l) ((~0ULL - (1ULL << (l)) + 1) & (~0ULL >>
> >> +((64 - 1 - (h))))) #define FIELD_PREP(m, v) (((uint64_t)(v) <<
> >> +(__builtin_ffsll(m) - 1)) & (m))
> >> +
> >> +RTE_DEFINE_PER_LCORE(struct rte_pmu_event_group, _event_group);
> >> +struct rte_pmu rte_pmu;
> >> +
> >> +/*
> >> + * Following __rte_weak functions provide default no-op.
> >> +Architectures should override them if
> >> + * necessary.
> >> + */
> >> +
> >> +int
> >> +__rte_weak pmu_arch_init(void)
> >> +{
> >> +	return 0;
> >> +}
> >> +
> >> +void
> >> +__rte_weak pmu_arch_fini(void)
> >> +{
> >> +}
> >> +
> >> +void
> >> +__rte_weak pmu_arch_fixup_config(uint64_t __rte_unused config[3]) { }
> >> +
> >> +static int
> >> +get_term_format(const char *name, int *num, uint64_t *mask) {
> >> +	char path[PATH_MAX];
> >> +	char *config = NULL;
> >> +	int high, low, ret;
> >> +	FILE *fp;
> >> +
> >> +	*num = *mask = 0;
> >> +	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/format/%s", rte_pmu.name, name);
> >> +	fp = fopen(path, "r");
> >> +	if (fp == NULL)
> >> +		return -errno;
> >> +
> >> +	errno = 0;
> >> +	ret = fscanf(fp, "%m[^:]:%d-%d", &config, &low, &high);
> >> +	if (ret < 2) {
> >> +		ret = -ENODATA;
> >> +		goto out;
> >> +	}
> >> +	if (errno) {
> >> +		ret = -errno;
> >> +		goto out;
> >> +	}
> >> +
> >> +	if (ret == 2)
> >> +		high = low;
> >> +
> >> +	*mask = GENMASK_ULL(high, low);
> >> +	/* Last digit should be [012]. If last digit is missing 0 is implied. */
> >> +	*num = config[strlen(config) - 1];
> >> +	*num = isdigit(*num) ? *num - '0' : 0;
> >> +
> >> +	ret = 0;
> >> +out:
> >> +	free(config);
> >> +	fclose(fp);
> >> +
> >> +	return ret;
> >> +}
> >> +
> >> +static int
> >> +parse_event(char *buf, uint64_t config[3]) {
> >> +	char *token, *term;
> >> +	int num, ret, val;
> >> +	uint64_t mask;
> >> +
> >> +	config[0] = config[1] = config[2] = 0;
> >> +
> >> +	token = strtok(buf, ",");
> >> +	while (token) {
> >> +		errno = 0;
> >> +		/* <term>=<value> */
> >> +		ret = sscanf(token, "%m[^=]=%i", &term, &val);
> >> +		if (ret < 1)
> >> +			return -ENODATA;
> >> +		if (errno)
> >> +			return -errno;
> >> +		if (ret == 1)
> >> +			val = 1;
> >> +
> >> +		ret = get_term_format(term, &num, &mask);
> >> +		free(term);
> >> +		if (ret)
> >> +			return ret;
> >> +
> >> +		config[num] |= FIELD_PREP(mask, val);
> >> +		token = strtok(NULL, ",");
> >> +	}
> >> +
> >> +	return 0;
> >> +}
> >> +
> >> +static int
> >> +get_event_config(const char *name, uint64_t config[3]) {
> >> +	char path[PATH_MAX], buf[BUFSIZ];
> >> +	FILE *fp;
> >> +	int ret;
> >> +
> >> +	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu.name, name);
> >> +	fp = fopen(path, "r");
> >> +	if (fp == NULL)
> >> +		return -errno;
> >> +
> >> +	ret = fread(buf, 1, sizeof(buf), fp);
> >> +	if (ret == 0) {
> >> +		fclose(fp);
> >> +
> >> +		return -EINVAL;
> >> +	}
> >> +	fclose(fp);
> >> +	buf[ret] = '\0';
> >> +
> >> +	return parse_event(buf, config);
> >> +}
> >> +
> >> +static int
> >> +do_perf_event_open(uint64_t config[3], int group_fd) {
> >> +	struct perf_event_attr attr = {
> >> +		.size = sizeof(struct perf_event_attr),
> >> +		.type = PERF_TYPE_RAW,
> >> +		.exclude_kernel = 1,
> >> +		.exclude_hv = 1,
> >> +		.disabled = 1,
> >> +	};
> >> +
> >> +	pmu_arch_fixup_config(config);
> >> +
> >> +	attr.config = config[0];
> >> +	attr.config1 = config[1];
> >> +	attr.config2 = config[2];
> >> +
> >> +	return syscall(SYS_perf_event_open, &attr, 0, -1, group_fd, 0); }
> >> +
> >> +static int
> >> +open_events(struct rte_pmu_event_group *group) {
> >> +	struct rte_pmu_event *event;
> >> +	uint64_t config[3];
> >> +	int num = 0, ret;
> >> +
> >> +	/* group leader gets created first, with fd = -1 */
> >> +	group->fds[0] = -1;
> >> +
> >> +	TAILQ_FOREACH(event, &rte_pmu.event_list, next) {
> >> +		ret = get_event_config(event->name, config);
> >> +		if (ret)
> >> +			continue;
> >> +
> >> +		ret = do_perf_event_open(config, group->fds[0]);
> >> +		if (ret == -1) {
> >> +			ret = -errno;
> >> +			goto out;
> >> +		}
> >> +
> >> +		group->fds[event->index] = ret;
> >> +		num++;
> >> +	}
> >> +
> >> +	return 0;
> >> +out:
> >> +	for (--num; num >= 0; num--) {
> >> +		close(group->fds[num]);
> >> +		group->fds[num] = -1;
> >> +	}
> >> +
> >> +
> >> +	return ret;
> >> +}
> >> +
> >> +static int
> >> +mmap_events(struct rte_pmu_event_group *group) {
> >> +	long page_size = sysconf(_SC_PAGE_SIZE);
> >> +	unsigned int i;
> >> +	void *addr;
> >> +	int ret;
> >> +
> >> +	for (i = 0; i < rte_pmu.num_group_events; i++) {
> >> +		addr = mmap(0, page_size, PROT_READ, MAP_SHARED, group->fds[i], 0);
> >> +		if (addr == MAP_FAILED) {
> >> +			ret = -errno;
> >> +			goto out;
> >> +		}
> >> +
> >> +		group->mmap_pages[i] = addr;
> >> +		if (!group->mmap_pages[i]->cap_user_rdpmc) {
> >> +			ret = -EPERM;
> >> +			goto out;
> >> +		}
> >> +	}
> >> +
> >> +	return 0;
> >> +out:
> >> +	for (; i; i--) {
> >> +		munmap(group->mmap_pages[i - 1], page_size);
> >> +		group->mmap_pages[i - 1] = NULL;
> >> +	}
> >> +
> >> +	return ret;
> >> +}
> >> +
> >> +static void
> >> +cleanup_events(struct rte_pmu_event_group *group) {
> >> +	unsigned int i;
> >> +
> >> +	if (group->fds[0] != -1)
> >> +		ioctl(group->fds[0], PERF_EVENT_IOC_DISABLE, PERF_IOC_FLAG_GROUP);
> >> +
> >> +	for (i = 0; i < rte_pmu.num_group_events; i++) {
> >> +		if (group->mmap_pages[i]) {
> >> +			munmap(group->mmap_pages[i], sysconf(_SC_PAGE_SIZE));
> >> +			group->mmap_pages[i] = NULL;
> >> +		}
> >> +
> >> +		if (group->fds[i] != -1) {
> >> +			close(group->fds[i]);
> >> +			group->fds[i] = -1;
> >> +		}
> >> +	}
> >> +
> >> +	group->enabled = false;
> >> +}
> >> +
> >> +int
> >> +__rte_pmu_enable_group(void)
> >> +{
> >> +	struct rte_pmu_event_group *group = &RTE_PER_LCORE(_event_group);
> >> +	int ret;
> >> +
> >> +	if (rte_pmu.num_group_events == 0)
> >> +		return -ENODEV;
> >> +
> >> +	ret = open_events(group);
> >> +	if (ret)
> >> +		goto out;
> >> +
> >> +	ret = mmap_events(group);
> >> +	if (ret)
> >> +		goto out;
> >> +
> >> +	if (ioctl(group->fds[0], PERF_EVENT_IOC_RESET, PERF_IOC_FLAG_GROUP) == -1) {
> >> +		ret = -errno;
> >> +		goto out;
> >> +	}
> >> +
> >> +	if (ioctl(group->fds[0], PERF_EVENT_IOC_ENABLE, PERF_IOC_FLAG_GROUP) == -1) {
> >> +		ret = -errno;
> >> +		goto out;
> >> +	}
> >> +
> >> +	rte_spinlock_lock(&rte_pmu.lock);
> >> +	TAILQ_INSERT_TAIL(&rte_pmu.event_group_list, group, next);
> >
> >Hmm.. so we insert pointer to TLS variable into the global list?
> >Wonder what would happen if that thread get terminated?
> 
> Nothing special. Any pointers to that thread-local in that thread are invalided.
> 
> >Can memory from its TLS block get re-used (by other thread or for other purposes)?
> >
> 
> Why would any other thread reuse that? 
> Eventually main thread will need that data to do the cleanup.

I understand that main thread would need to access that data.
I am not sure that it would be able to.
Imagine thread calls rte_pmu_read(...) and then terminates, while program continues to run.
As I understand address of its RTE_PER_LCORE(_event_group) will still remain in rte_pmu.event_group_list,
even if it is probably not valid any more. 

> >
> >> +	rte_spinlock_unlock(&rte_pmu.lock);
> >> +	group->enabled = true;
> >> +
> >> +	return 0;
> >> +
> >> +out:
> >> +	cleanup_events(group);
> >> +
> >> +	return ret;
> >> +}
> >> +
> >> +static int
> >> +scan_pmus(void)
> >> +{
> >> +	char path[PATH_MAX];
> >> +	struct dirent *dent;
> >> +	const char *name;
> >> +	DIR *dirp;
> >> +
> >> +	dirp = opendir(EVENT_SOURCE_DEVICES_PATH);
> >> +	if (dirp == NULL)
> >> +		return -errno;
> >> +
> >> +	while ((dent = readdir(dirp))) {
> >> +		name = dent->d_name;
> >> +		if (name[0] == '.')
> >> +			continue;
> >> +
> >> +		/* sysfs entry should either contain cpus or be a cpu */
> >> +		if (!strcmp(name, "cpu"))
> >> +			break;
> >> +
> >> +		snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/cpus", name);
> >> +		if (access(path, F_OK) == 0)
> >> +			break;
> >> +	}
> >> +
> >> +	if (dent) {
> >> +		rte_pmu.name = strdup(name);
> >> +		if (rte_pmu.name == NULL) {
> >> +			closedir(dirp);
> >> +
> >> +			return -ENOMEM;
> >> +		}
> >> +	}
> >> +
> >> +	closedir(dirp);
> >> +
> >> +	return rte_pmu.name ? 0 : -ENODEV;
> >> +}
> >> +
> >> +static struct rte_pmu_event *
> >> +new_event(const char *name)
> >> +{
> >> +	struct rte_pmu_event *event;
> >> +
> >> +	event = calloc(1, sizeof(*event));
> >> +	if (event == NULL)
> >> +		goto out;
> >> +
> >> +	event->name = strdup(name);
> >> +	if (event->name == NULL) {
> >> +		free(event);
> >> +		event = NULL;
> >> +	}
> >> +
> >> +out:
> >> +	return event;
> >> +}
> >> +
> >> +static void
> >> +free_event(struct rte_pmu_event *event)
> >> +{
> >> +	free(event->name);
> >> +	free(event);
> >> +}
> >> +
> >> +int
> >> +rte_pmu_add_event(const char *name)
> >> +{
> >> +	struct rte_pmu_event *event;
> >> +	char path[PATH_MAX];
> >> +
> >> +	if (rte_pmu.name == NULL)
> >> +		return -ENODEV;
> >> +
> >> +	if (rte_pmu.num_group_events + 1 >= MAX_NUM_GROUP_EVENTS)
> >> +		return -ENOSPC;
> >> +
> >> +	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu.name, name);
> >> +	if (access(path, R_OK))
> >> +		return -ENODEV;
> >> +
> >> +	TAILQ_FOREACH(event, &rte_pmu.event_list, next) {
> >> +		if (!strcmp(event->name, name))
> >> +			return event->index;
> >> +		continue;
> >> +	}
> >> +
> >> +	event = new_event(name);
> >> +	if (event == NULL)
> >> +		return -ENOMEM;
> >> +
> >> +	event->index = rte_pmu.num_group_events++;
> >> +	TAILQ_INSERT_TAIL(&rte_pmu.event_list, event, next);
> >> +
> >> +	return event->index;
> >> +}
> >> +
> >> +int
> >> +rte_pmu_init(void)
> >> +{
> >> +	int ret;
> >> +
> >> +	/* Allow calling init from multiple contexts within a single thread. This simplifies
> >> +	 * resource management a bit e.g in case fast-path tracepoint has already been enabled
> >> +	 * via command line but application doesn't care enough and performs init/fini again.
> >> +	 */
> >> +	if (rte_pmu.initialized != 0) {
> >> +		rte_pmu.initialized++;
> >> +		return 0;
> >> +	}
> >> +
> >> +	ret = scan_pmus();
> >> +	if (ret)
> >> +		goto out;
> >> +
> >> +	ret = pmu_arch_init();
> >> +	if (ret)
> >> +		goto out;
> >> +
> >> +	TAILQ_INIT(&rte_pmu.event_list);
> >> +	TAILQ_INIT(&rte_pmu.event_group_list);
> >> +	rte_spinlock_init(&rte_pmu.lock);
> >> +	rte_pmu.initialized = 1;
> >> +
> >> +	return 0;
> >> +out:
> >> +	free(rte_pmu.name);
> >> +	rte_pmu.name = NULL;
> >> +
> >> +	return ret;
> >> +}
> >> +
> >> +void
> >> +rte_pmu_fini(void)
> >> +{
> >> +	struct rte_pmu_event_group *group, *tmp_group;
> >> +	struct rte_pmu_event *event, *tmp_event;
> >> +
> >> +	/* cleanup once init count drops to zero */
> >> +	if (rte_pmu.initialized == 0 || --rte_pmu.initialized != 0)
> >> +		return;
> >> +
> >> +	RTE_TAILQ_FOREACH_SAFE(event, &rte_pmu.event_list, next, tmp_event) {
> >> +		TAILQ_REMOVE(&rte_pmu.event_list, event, next);
> >> +		free_event(event);
> >> +	}
> >> +
> >> +	RTE_TAILQ_FOREACH_SAFE(group, &rte_pmu.event_group_list, next, tmp_group) {
> >> +		TAILQ_REMOVE(&rte_pmu.event_group_list, group, next);
> >> +		cleanup_events(group);
> >> +	}
> >> +
> >> +	pmu_arch_fini();
> >> +	free(rte_pmu.name);
> >> +	rte_pmu.name = NULL;
> >> +	rte_pmu.num_group_events = 0;
> >> +}
> >> diff --git a/lib/pmu/rte_pmu.h b/lib/pmu/rte_pmu.h
> >> new file mode 100644
> >> index 0000000000..6b664c3336
> >> --- /dev/null
> >> +++ b/lib/pmu/rte_pmu.h
> >> @@ -0,0 +1,212 @@
> >> +/* SPDX-License-Identifier: BSD-3-Clause
> >> + * Copyright(c) 2023 Marvell
> >> + */
> >> +
> >> +#ifndef _RTE_PMU_H_
> >> +#define _RTE_PMU_H_
> >> +
> >> +/**
> >> + * @file
> >> + *
> >> + * PMU event tracing operations
> >> + *
> >> + * This file defines generic API and types necessary to setup PMU and
> >> + * read selected counters in runtime.
> >> + */
> >> +
> >> +#ifdef __cplusplus
> >> +extern "C" {
> >> +#endif
> >> +
> >> +#include <linux/perf_event.h>
> >> +
> >> +#include <rte_atomic.h>
> >> +#include <rte_branch_prediction.h>
> >> +#include <rte_common.h>
> >> +#include <rte_compat.h>
> >> +#include <rte_spinlock.h>
> >> +
> >> +/** Maximum number of events in a group */
> >> +#define MAX_NUM_GROUP_EVENTS 8
> >
> >forgot RTE_ prefix.
> >In fact, do you really need number of events in group to be hard-coded?
> >Couldn't mmap_pages[] and fds[] be allocated dynamically by enable_group()?
> >
> 
> 8 is reasonable number I think. X86/ARM have actually less that that (was that something like 4?).
> Moreover events are scheduled as a group so there must be enough hw counters available
> for that to succeed. So this number should cover current needs.

If you think 8 will be enough to cover all possible future cases - I am ok either way.
Still need RTE_ prefix for it.

> >> +
> >> +/**
> >> + * A structure describing a group of events.
> >> + */
> >> +struct rte_pmu_event_group {
> >> +	struct perf_event_mmap_page *mmap_pages[MAX_NUM_GROUP_EVENTS]; /**< array of user pages */
> >> +	int fds[MAX_NUM_GROUP_EVENTS]; /**< array of event descriptors */
> >> +	bool enabled; /**< true if group was enabled on particular lcore */
> >> +	TAILQ_ENTRY(rte_pmu_event_group) next; /**< list entry */
> >> +} __rte_cache_aligned;
> >> +
> >
> >Even if we'd decide to keep rte_pmu_read() as static inline (still not
> >sure it is a good idea),
> 
> We want to save as much cpu cycles as we possibly can and inlining does helps
> in that matter.

Ok, so asking same question from different thread: how many cycles it will save?
What is the difference in terms of performance when you have this function
inlined vs not inlined?
 
> >why these two struct below (rte_pmu_event and rte_pmu) have to be public?
> >I think both can be safely moved away from public headers.
> >
> 
> struct rte_pmu_event can be hidden I guess.
> struct rte_pmu is used in this header hence cannot be moved elsewhere.

Not sure why? 
Is that because you use it inside rte_pmu_read()?
But that check I think can be safely moved into __rte_pmu_enable_group()
or probably even into rte_pmu_add_event(). 

> >
> >> +/**
> >> + * A structure describing an event.
> >> + */
> >> +struct rte_pmu_event {
> >> +	char *name; /**< name of an event */
> >> +	unsigned int index; /**< event index into fds/mmap_pages */
> >> +	TAILQ_ENTRY(rte_pmu_event) next; /**< list entry */
> >> +};
> >
> >> +
> >> +/**
> >> + * A PMU state container.
> >> + */
> >> +struct rte_pmu {
> >> +	char *name; /**< name of core PMU listed under /sys/bus/event_source/devices */
> >> +	rte_spinlock_t lock; /**< serialize access to event group list */
> >> +	TAILQ_HEAD(, rte_pmu_event_group) event_group_list; /**< list of event groups */
> >> +	unsigned int num_group_events; /**< number of events in a group */
> >> +	TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of matching events */
> >> +	unsigned int initialized; /**< initialization counter */
> >> +};
> >> +
> >> +/** lcore event group */
> >> +RTE_DECLARE_PER_LCORE(struct rte_pmu_event_group, _event_group);
> >> +
> >> +/** PMU state container */
> >> +extern struct rte_pmu rte_pmu;
> >> +
> >> +/** Each architecture supporting PMU needs to provide its own version */
> >> +#ifndef rte_pmu_pmc_read
> >> +#define rte_pmu_pmc_read(index) ({ 0; })
> >> +#endif
> >> +
> >> +/**
> >> + * @warning
> >> + * @b EXPERIMENTAL: this API may change without prior notice
> >> + *
> >> + * Read PMU counter.
> >> + *
> >> + * @warning This should be not called directly.
> >> + *
> >> + * @param pc
> >> + *   Pointer to the mmapped user page.
> >> + * @return
> >> + *   Counter value read from hardware.
> >> + */
> >> +static __rte_always_inline uint64_t
> >> +__rte_pmu_read_userpage(struct perf_event_mmap_page *pc)
> >> +{
> >> +	uint64_t width, offset;
> >> +	uint32_t seq, index;
> >> +	int64_t pmc;
> >> +
> >> +	for (;;) {
> >> +		seq = pc->lock;
> >> +		rte_compiler_barrier();
> >> +		index = pc->index;
> >> +		offset = pc->offset;
> >> +		width = pc->pmc_width;
> >> +
> >> +		/* index set to 0 means that particular counter cannot be used */
> >> +		if (likely(pc->cap_user_rdpmc && index)) {
> >
> >In mmap_events() you return EPERM if cap_user_rdpmc is not enabled.
> >Do you need another check here? Or this capability can be disabled by
> >kernel at run-time?
> >
> 
> That extra check in mmap_event() may be removed actually. Some archs allow
> disabling reading rdpmc (I think that on x86 one can do that) so this check needs to stay.
> 
> >
> >> +			pmc = rte_pmu_pmc_read(index - 1);
> >> +			pmc <<= 64 - width;
> >> +			pmc >>= 64 - width;
> >> +			offset += pmc;
> >> +		}
> >> +
> >> +		rte_compiler_barrier();
> >> +
> >> +		if (likely(pc->lock == seq))
> >> +			return offset;
> >> +	}
> >> +
> >> +	return 0;
> >> +}
> >> +
> >> +/**
> >> + * @warning
> >> + * @b EXPERIMENTAL: this API may change without prior notice
> >> + *
> >> + * Enable group of events on the calling lcore.
> >> + *
> >> + * @warning This should be not called directly.
> >
> >__rte_internal ?
> >
> 
> No this cannot be internal because that will make functions calling it
> internal as well hence apps won't be able to use that. This has
> already been brought up by one of the reviewers.

Ok, then we probably can mark it with ' @internal' tag in
formal comments?

> 
> >> + *
> >> + * @return
> >> + *   0 in case of success, negative value otherwise.
> >> + */
> >> +__rte_experimental
> >> +int
> >> +__rte_pmu_enable_group(void);
> >> +
> >> +/**
> >> + * @warning
> >> + * @b EXPERIMENTAL: this API may change without prior notice
> >> + *
> >> + * Initialize PMU library.
> >> + *
> >> + * @warning This should be not called directly.
> >
> >Hmm.. then who should call it?
> >If it not supposed to be called directly, why to declare it here?
> >
> 
> This is inlined and has one caller i.e rte_pmu_read().

I thought we are talking here about rte_pmu_init().
I don't see where it is inlined and still not clear why it can't be called directly.

> >> + *
> >> + * @return
> >> + *   0 in case of success, negative value otherwise.
> >> + */
> >
> >Probably worth to mention that this function is not MT safe.
> >Same for _fini_ and add_event.
> >Also worth to mention that all control-path functions
> >(init/fini/add_event) and data-path (pmu_read) can't be called concurrently.
> >
> 
> Yes they are meant to be called from main thread.

Ok, then please add that into formal API comments. 
 
> >> +__rte_experimental
> >> +int
> >> +rte_pmu_init(void);
> >> +

^ permalink raw reply	[flat|nested] 139+ messages in thread

* RE: [EXT] Re: [PATCH v11 1/4] lib: add generic support for reading PMU events
  2023-02-27 20:53                           ` Konstantin Ananyev
@ 2023-02-28  8:25                             ` Morten Brørup
  2023-02-28 12:04                               ` Konstantin Ananyev
  2023-02-28  9:57                             ` Tomasz Duszynski
  1 sibling, 1 reply; 139+ messages in thread
From: Morten Brørup @ 2023-02-28  8:25 UTC (permalink / raw)
  To: Konstantin Ananyev, Tomasz Duszynski, Konstantin Ananyev, dev

> From: Konstantin Ananyev [mailto:konstantin.ananyev@huawei.com]
> Sent: Monday, 27 February 2023 21.53
> 
> > >> Add support for programming PMU counters and reading their values in
> > >> runtime bypassing kernel completely.
> > >>
> > >> This is especially useful in cases where CPU cores are isolated i.e
> > >> run dedicated tasks. In such cases one cannot use standard perf
> > >> utility without sacrificing latency and performance.
> > >>
> > >> Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
> > >> Acked-by: Morten Brørup <mb@smartsharesystems.com>
> > >

[...]

> > >> +int
> > >> +__rte_pmu_enable_group(void)
> > >> +{
> > >> +	struct rte_pmu_event_group *group = &RTE_PER_LCORE(_event_group);
> > >> +	int ret;
> > >> +
> > >> +	if (rte_pmu.num_group_events == 0)
> > >> +		return -ENODEV;
> > >> +
> > >> +	ret = open_events(group);
> > >> +	if (ret)
> > >> +		goto out;
> > >> +
> > >> +	ret = mmap_events(group);
> > >> +	if (ret)
> > >> +		goto out;
> > >> +
> > >> +	if (ioctl(group->fds[0], PERF_EVENT_IOC_RESET, PERF_IOC_FLAG_GROUP) == -
> 1) {
> > >> +		ret = -errno;
> > >> +		goto out;
> > >> +	}
> > >> +
> > >> +	if (ioctl(group->fds[0], PERF_EVENT_IOC_ENABLE, PERF_IOC_FLAG_GROUP) ==
> -1) {
> > >> +		ret = -errno;
> > >> +		goto out;
> > >> +	}
> > >> +
> > >> +	rte_spinlock_lock(&rte_pmu.lock);
> > >> +	TAILQ_INSERT_TAIL(&rte_pmu.event_group_list, group, next);
> > >
> > >Hmm.. so we insert pointer to TLS variable into the global list?
> > >Wonder what would happen if that thread get terminated?
> >
> > Nothing special. Any pointers to that thread-local in that thread are
> invalided.
> >
> > >Can memory from its TLS block get re-used (by other thread or for other
> purposes)?
> > >
> >
> > Why would any other thread reuse that?
> > Eventually main thread will need that data to do the cleanup.
> 
> I understand that main thread would need to access that data.
> I am not sure that it would be able to.
> Imagine thread calls rte_pmu_read(...) and then terminates, while program
> continues to run.

Is the example you describe here (i.e. a thread terminating in the middle of doing something) really a scenario DPDK is supposed to support?

> As I understand address of its RTE_PER_LCORE(_event_group) will still remain
> in rte_pmu.event_group_list,
> even if it is probably not valid any more.

There should be a "destructor/done/finish" function available to remove this from the list.

[...]

> > >Even if we'd decide to keep rte_pmu_read() as static inline (still not
> > >sure it is a good idea),
> >
> > We want to save as much cpu cycles as we possibly can and inlining does
> helps
> > in that matter.
> 
> Ok, so asking same question from different thread: how many cycles it will
> save?
> What is the difference in terms of performance when you have this function
> inlined vs not inlined?

We expect to use this in our in-house profiler library. For this reason, I have a very strong preference for absolute maximum performance.

Reading PMU events is for performance profiling, so I expect other potential users of the PMU library to share my opinion on this.


^ permalink raw reply	[flat|nested] 139+ messages in thread

* RE: [EXT] Re: [PATCH v11 1/4] lib: add generic support for reading PMU events
  2023-02-27 20:53                           ` Konstantin Ananyev
  2023-02-28  8:25                             ` Morten Brørup
@ 2023-02-28  9:57                             ` Tomasz Duszynski
  2023-02-28 11:58                               ` Konstantin Ananyev
  1 sibling, 1 reply; 139+ messages in thread
From: Tomasz Duszynski @ 2023-02-28  9:57 UTC (permalink / raw)
  To: Konstantin Ananyev, Konstantin Ananyev, dev



>-----Original Message-----
>From: Konstantin Ananyev <konstantin.ananyev@huawei.com>
>Sent: Monday, February 27, 2023 9:53 PM
>To: Tomasz Duszynski <tduszynski@marvell.com>; Konstantin Ananyev <konstantin.v.ananyev@yandex.ru>;
>dev@dpdk.org
>Subject: RE: [EXT] Re: [PATCH v11 1/4] lib: add generic support for reading PMU events
>
>
>
>> >> Add support for programming PMU counters and reading their values
>> >> in runtime bypassing kernel completely.
>> >>
>> >> This is especially useful in cases where CPU cores are isolated i.e
>> >> run dedicated tasks. In such cases one cannot use standard perf
>> >> utility without sacrificing latency and performance.
>> >>
>> >> Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
>> >> Acked-by: Morten Brørup <mb@smartsharesystems.com>
>> >
>> >Few more comments/questions below.
>> >
>> >
>> >> diff --git a/lib/pmu/rte_pmu.c b/lib/pmu/rte_pmu.c new file mode
>> >> 100644 index 0000000000..950f999cb7
>> >> --- /dev/null
>> >> +++ b/lib/pmu/rte_pmu.c
>> >> @@ -0,0 +1,460 @@
>> >> +/* SPDX-License-Identifier: BSD-3-Clause
>> >> + * Copyright(C) 2023 Marvell International Ltd.
>> >> + */
>> >> +
>> >> +#include <ctype.h>
>> >> +#include <dirent.h>
>> >> +#include <errno.h>
>> >> +#include <regex.h>
>> >> +#include <stdlib.h>
>> >> +#include <string.h>
>> >> +#include <sys/ioctl.h>
>> >> +#include <sys/mman.h>
>> >> +#include <sys/queue.h>
>> >> +#include <sys/syscall.h>
>> >> +#include <unistd.h>
>> >> +
>> >> +#include <rte_atomic.h>
>> >> +#include <rte_per_lcore.h>
>> >> +#include <rte_pmu.h>
>> >> +#include <rte_spinlock.h>
>> >> +#include <rte_tailq.h>
>> >> +
>> >> +#include "pmu_private.h"
>> >> +
>> >> +#define EVENT_SOURCE_DEVICES_PATH "/sys/bus/event_source/devices"
>> >> +
>> >> +#define GENMASK_ULL(h, l) ((~0ULL - (1ULL << (l)) + 1) & (~0ULL >>
>> >> +((64 - 1 - (h))))) #define FIELD_PREP(m, v) (((uint64_t)(v) <<
>> >> +(__builtin_ffsll(m) - 1)) & (m))
>> >> +
>> >> +RTE_DEFINE_PER_LCORE(struct rte_pmu_event_group, _event_group);
>> >> +struct rte_pmu rte_pmu;
>> >> +
>> >> +/*
>> >> + * Following __rte_weak functions provide default no-op.
>> >> +Architectures should override them if
>> >> + * necessary.
>> >> + */
>> >> +
>> >> +int
>> >> +__rte_weak pmu_arch_init(void)
>> >> +{
>> >> +	return 0;
>> >> +}
>> >> +
>> >> +void
>> >> +__rte_weak pmu_arch_fini(void)
>> >> +{
>> >> +}
>> >> +
>> >> +void
>> >> +__rte_weak pmu_arch_fixup_config(uint64_t __rte_unused config[3])
>> >> +{ }
>> >> +
>> >> +static int
>> >> +get_term_format(const char *name, int *num, uint64_t *mask) {
>> >> +	char path[PATH_MAX];
>> >> +	char *config = NULL;
>> >> +	int high, low, ret;
>> >> +	FILE *fp;
>> >> +
>> >> +	*num = *mask = 0;
>> >> +	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/format/%s", rte_pmu.name,
>name);
>> >> +	fp = fopen(path, "r");
>> >> +	if (fp == NULL)
>> >> +		return -errno;
>> >> +
>> >> +	errno = 0;
>> >> +	ret = fscanf(fp, "%m[^:]:%d-%d", &config, &low, &high);
>> >> +	if (ret < 2) {
>> >> +		ret = -ENODATA;
>> >> +		goto out;
>> >> +	}
>> >> +	if (errno) {
>> >> +		ret = -errno;
>> >> +		goto out;
>> >> +	}
>> >> +
>> >> +	if (ret == 2)
>> >> +		high = low;
>> >> +
>> >> +	*mask = GENMASK_ULL(high, low);
>> >> +	/* Last digit should be [012]. If last digit is missing 0 is implied. */
>> >> +	*num = config[strlen(config) - 1];
>> >> +	*num = isdigit(*num) ? *num - '0' : 0;
>> >> +
>> >> +	ret = 0;
>> >> +out:
>> >> +	free(config);
>> >> +	fclose(fp);
>> >> +
>> >> +	return ret;
>> >> +}
>> >> +
>> >> +static int
>> >> +parse_event(char *buf, uint64_t config[3]) {
>> >> +	char *token, *term;
>> >> +	int num, ret, val;
>> >> +	uint64_t mask;
>> >> +
>> >> +	config[0] = config[1] = config[2] = 0;
>> >> +
>> >> +	token = strtok(buf, ",");
>> >> +	while (token) {
>> >> +		errno = 0;
>> >> +		/* <term>=<value> */
>> >> +		ret = sscanf(token, "%m[^=]=%i", &term, &val);
>> >> +		if (ret < 1)
>> >> +			return -ENODATA;
>> >> +		if (errno)
>> >> +			return -errno;
>> >> +		if (ret == 1)
>> >> +			val = 1;
>> >> +
>> >> +		ret = get_term_format(term, &num, &mask);
>> >> +		free(term);
>> >> +		if (ret)
>> >> +			return ret;
>> >> +
>> >> +		config[num] |= FIELD_PREP(mask, val);
>> >> +		token = strtok(NULL, ",");
>> >> +	}
>> >> +
>> >> +	return 0;
>> >> +}
>> >> +
>> >> +static int
>> >> +get_event_config(const char *name, uint64_t config[3]) {
>> >> +	char path[PATH_MAX], buf[BUFSIZ];
>> >> +	FILE *fp;
>> >> +	int ret;
>> >> +
>> >> +	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu.name,
>name);
>> >> +	fp = fopen(path, "r");
>> >> +	if (fp == NULL)
>> >> +		return -errno;
>> >> +
>> >> +	ret = fread(buf, 1, sizeof(buf), fp);
>> >> +	if (ret == 0) {
>> >> +		fclose(fp);
>> >> +
>> >> +		return -EINVAL;
>> >> +	}
>> >> +	fclose(fp);
>> >> +	buf[ret] = '\0';
>> >> +
>> >> +	return parse_event(buf, config);
>> >> +}
>> >> +
>> >> +static int
>> >> +do_perf_event_open(uint64_t config[3], int group_fd) {
>> >> +	struct perf_event_attr attr = {
>> >> +		.size = sizeof(struct perf_event_attr),
>> >> +		.type = PERF_TYPE_RAW,
>> >> +		.exclude_kernel = 1,
>> >> +		.exclude_hv = 1,
>> >> +		.disabled = 1,
>> >> +	};
>> >> +
>> >> +	pmu_arch_fixup_config(config);
>> >> +
>> >> +	attr.config = config[0];
>> >> +	attr.config1 = config[1];
>> >> +	attr.config2 = config[2];
>> >> +
>> >> +	return syscall(SYS_perf_event_open, &attr, 0, -1, group_fd, 0); }
>> >> +
>> >> +static int
>> >> +open_events(struct rte_pmu_event_group *group) {
>> >> +	struct rte_pmu_event *event;
>> >> +	uint64_t config[3];
>> >> +	int num = 0, ret;
>> >> +
>> >> +	/* group leader gets created first, with fd = -1 */
>> >> +	group->fds[0] = -1;
>> >> +
>> >> +	TAILQ_FOREACH(event, &rte_pmu.event_list, next) {
>> >> +		ret = get_event_config(event->name, config);
>> >> +		if (ret)
>> >> +			continue;
>> >> +
>> >> +		ret = do_perf_event_open(config, group->fds[0]);
>> >> +		if (ret == -1) {
>> >> +			ret = -errno;
>> >> +			goto out;
>> >> +		}
>> >> +
>> >> +		group->fds[event->index] = ret;
>> >> +		num++;
>> >> +	}
>> >> +
>> >> +	return 0;
>> >> +out:
>> >> +	for (--num; num >= 0; num--) {
>> >> +		close(group->fds[num]);
>> >> +		group->fds[num] = -1;
>> >> +	}
>> >> +
>> >> +
>> >> +	return ret;
>> >> +}
>> >> +
>> >> +static int
>> >> +mmap_events(struct rte_pmu_event_group *group) {
>> >> +	long page_size = sysconf(_SC_PAGE_SIZE);
>> >> +	unsigned int i;
>> >> +	void *addr;
>> >> +	int ret;
>> >> +
>> >> +	for (i = 0; i < rte_pmu.num_group_events; i++) {
>> >> +		addr = mmap(0, page_size, PROT_READ, MAP_SHARED, group->fds[i], 0);
>> >> +		if (addr == MAP_FAILED) {
>> >> +			ret = -errno;
>> >> +			goto out;
>> >> +		}
>> >> +
>> >> +		group->mmap_pages[i] = addr;
>> >> +		if (!group->mmap_pages[i]->cap_user_rdpmc) {
>> >> +			ret = -EPERM;
>> >> +			goto out;
>> >> +		}
>> >> +	}
>> >> +
>> >> +	return 0;
>> >> +out:
>> >> +	for (; i; i--) {
>> >> +		munmap(group->mmap_pages[i - 1], page_size);
>> >> +		group->mmap_pages[i - 1] = NULL;
>> >> +	}
>> >> +
>> >> +	return ret;
>> >> +}
>> >> +
>> >> +static void
>> >> +cleanup_events(struct rte_pmu_event_group *group) {
>> >> +	unsigned int i;
>> >> +
>> >> +	if (group->fds[0] != -1)
>> >> +		ioctl(group->fds[0], PERF_EVENT_IOC_DISABLE,
>> >> +PERF_IOC_FLAG_GROUP);
>> >> +
>> >> +	for (i = 0; i < rte_pmu.num_group_events; i++) {
>> >> +		if (group->mmap_pages[i]) {
>> >> +			munmap(group->mmap_pages[i], sysconf(_SC_PAGE_SIZE));
>> >> +			group->mmap_pages[i] = NULL;
>> >> +		}
>> >> +
>> >> +		if (group->fds[i] != -1) {
>> >> +			close(group->fds[i]);
>> >> +			group->fds[i] = -1;
>> >> +		}
>> >> +	}
>> >> +
>> >> +	group->enabled = false;
>> >> +}
>> >> +
>> >> +int
>> >> +__rte_pmu_enable_group(void)
>> >> +{
>> >> +	struct rte_pmu_event_group *group = &RTE_PER_LCORE(_event_group);
>> >> +	int ret;
>> >> +
>> >> +	if (rte_pmu.num_group_events == 0)
>> >> +		return -ENODEV;
>> >> +
>> >> +	ret = open_events(group);
>> >> +	if (ret)
>> >> +		goto out;
>> >> +
>> >> +	ret = mmap_events(group);
>> >> +	if (ret)
>> >> +		goto out;
>> >> +
>> >> +	if (ioctl(group->fds[0], PERF_EVENT_IOC_RESET, PERF_IOC_FLAG_GROUP) == -1) {
>> >> +		ret = -errno;
>> >> +		goto out;
>> >> +	}
>> >> +
>> >> +	if (ioctl(group->fds[0], PERF_EVENT_IOC_ENABLE, PERF_IOC_FLAG_GROUP) == -1) {
>> >> +		ret = -errno;
>> >> +		goto out;
>> >> +	}
>> >> +
>> >> +	rte_spinlock_lock(&rte_pmu.lock);
>> >> +	TAILQ_INSERT_TAIL(&rte_pmu.event_group_list, group, next);
>> >
>> >Hmm.. so we insert pointer to TLS variable into the global list?
>> >Wonder what would happen if that thread get terminated?
>>
>> Nothing special. Any pointers to that thread-local in that thread are invalided.
>>
>> >Can memory from its TLS block get re-used (by other thread or for other purposes)?
>> >
>>
>> Why would any other thread reuse that?
>> Eventually main thread will need that data to do the cleanup.
>
>I understand that main thread would need to access that data.
>I am not sure that it would be able to.
>Imagine thread calls rte_pmu_read(...) and then terminates, while program continues to run.
>As I understand address of its RTE_PER_LCORE(_event_group) will still remain in
>rte_pmu.event_group_list, even if it is probably not valid any more.
>

Okay got your point. In DPDK that will not happen. We do not spawn/kill lcores in runtime. 
In other scenarios such approach will not work because once thread terminates it's per-thread-data
becomes invalid. 

>> >
>> >> +	rte_spinlock_unlock(&rte_pmu.lock);
>> >> +	group->enabled = true;
>> >> +
>> >> +	return 0;
>> >> +
>> >> +out:
>> >> +	cleanup_events(group);
>> >> +
>> >> +	return ret;
>> >> +}
>> >> +
>> >> +static int
>> >> +scan_pmus(void)
>> >> +{
>> >> +	char path[PATH_MAX];
>> >> +	struct dirent *dent;
>> >> +	const char *name;
>> >> +	DIR *dirp;
>> >> +
>> >> +	dirp = opendir(EVENT_SOURCE_DEVICES_PATH);
>> >> +	if (dirp == NULL)
>> >> +		return -errno;
>> >> +
>> >> +	while ((dent = readdir(dirp))) {
>> >> +		name = dent->d_name;
>> >> +		if (name[0] == '.')
>> >> +			continue;
>> >> +
>> >> +		/* sysfs entry should either contain cpus or be a cpu */
>> >> +		if (!strcmp(name, "cpu"))
>> >> +			break;
>> >> +
>> >> +		snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/cpus", name);
>> >> +		if (access(path, F_OK) == 0)
>> >> +			break;
>> >> +	}
>> >> +
>> >> +	if (dent) {
>> >> +		rte_pmu.name = strdup(name);
>> >> +		if (rte_pmu.name == NULL) {
>> >> +			closedir(dirp);
>> >> +
>> >> +			return -ENOMEM;
>> >> +		}
>> >> +	}
>> >> +
>> >> +	closedir(dirp);
>> >> +
>> >> +	return rte_pmu.name ? 0 : -ENODEV; }
>> >> +
>> >> +static struct rte_pmu_event *
>> >> +new_event(const char *name)
>> >> +{
>> >> +	struct rte_pmu_event *event;
>> >> +
>> >> +	event = calloc(1, sizeof(*event));
>> >> +	if (event == NULL)
>> >> +		goto out;
>> >> +
>> >> +	event->name = strdup(name);
>> >> +	if (event->name == NULL) {
>> >> +		free(event);
>> >> +		event = NULL;
>> >> +	}
>> >> +
>> >> +out:
>> >> +	return event;
>> >> +}
>> >> +
>> >> +static void
>> >> +free_event(struct rte_pmu_event *event) {
>> >> +	free(event->name);
>> >> +	free(event);
>> >> +}
>> >> +
>> >> +int
>> >> +rte_pmu_add_event(const char *name) {
>> >> +	struct rte_pmu_event *event;
>> >> +	char path[PATH_MAX];
>> >> +
>> >> +	if (rte_pmu.name == NULL)
>> >> +		return -ENODEV;
>> >> +
>> >> +	if (rte_pmu.num_group_events + 1 >= MAX_NUM_GROUP_EVENTS)
>> >> +		return -ENOSPC;
>> >> +
>> >> +	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu.name,
>name);
>> >> +	if (access(path, R_OK))
>> >> +		return -ENODEV;
>> >> +
>> >> +	TAILQ_FOREACH(event, &rte_pmu.event_list, next) {
>> >> +		if (!strcmp(event->name, name))
>> >> +			return event->index;
>> >> +		continue;
>> >> +	}
>> >> +
>> >> +	event = new_event(name);
>> >> +	if (event == NULL)
>> >> +		return -ENOMEM;
>> >> +
>> >> +	event->index = rte_pmu.num_group_events++;
>> >> +	TAILQ_INSERT_TAIL(&rte_pmu.event_list, event, next);
>> >> +
>> >> +	return event->index;
>> >> +}
>> >> +
>> >> +int
>> >> +rte_pmu_init(void)
>> >> +{
>> >> +	int ret;
>> >> +
>> >> +	/* Allow calling init from multiple contexts within a single thread. This simplifies
>> >> +	 * resource management a bit e.g in case fast-path tracepoint has already been enabled
>> >> +	 * via command line but application doesn't care enough and performs init/fini again.
>> >> +	 */
>> >> +	if (rte_pmu.initialized != 0) {
>> >> +		rte_pmu.initialized++;
>> >> +		return 0;
>> >> +	}
>> >> +
>> >> +	ret = scan_pmus();
>> >> +	if (ret)
>> >> +		goto out;
>> >> +
>> >> +	ret = pmu_arch_init();
>> >> +	if (ret)
>> >> +		goto out;
>> >> +
>> >> +	TAILQ_INIT(&rte_pmu.event_list);
>> >> +	TAILQ_INIT(&rte_pmu.event_group_list);
>> >> +	rte_spinlock_init(&rte_pmu.lock);
>> >> +	rte_pmu.initialized = 1;
>> >> +
>> >> +	return 0;
>> >> +out:
>> >> +	free(rte_pmu.name);
>> >> +	rte_pmu.name = NULL;
>> >> +
>> >> +	return ret;
>> >> +}
>> >> +
>> >> +void
>> >> +rte_pmu_fini(void)
>> >> +{
>> >> +	struct rte_pmu_event_group *group, *tmp_group;
>> >> +	struct rte_pmu_event *event, *tmp_event;
>> >> +
>> >> +	/* cleanup once init count drops to zero */
>> >> +	if (rte_pmu.initialized == 0 || --rte_pmu.initialized != 0)
>> >> +		return;
>> >> +
>> >> +	RTE_TAILQ_FOREACH_SAFE(event, &rte_pmu.event_list, next, tmp_event) {
>> >> +		TAILQ_REMOVE(&rte_pmu.event_list, event, next);
>> >> +		free_event(event);
>> >> +	}
>> >> +
>> >> +	RTE_TAILQ_FOREACH_SAFE(group, &rte_pmu.event_group_list, next, tmp_group) {
>> >> +		TAILQ_REMOVE(&rte_pmu.event_group_list, group, next);
>> >> +		cleanup_events(group);
>> >> +	}
>> >> +
>> >> +	pmu_arch_fini();
>> >> +	free(rte_pmu.name);
>> >> +	rte_pmu.name = NULL;
>> >> +	rte_pmu.num_group_events = 0;
>> >> +}
>> >> diff --git a/lib/pmu/rte_pmu.h b/lib/pmu/rte_pmu.h new file mode
>> >> 100644 index 0000000000..6b664c3336
>> >> --- /dev/null
>> >> +++ b/lib/pmu/rte_pmu.h
>> >> @@ -0,0 +1,212 @@
>> >> +/* SPDX-License-Identifier: BSD-3-Clause
>> >> + * Copyright(c) 2023 Marvell
>> >> + */
>> >> +
>> >> +#ifndef _RTE_PMU_H_
>> >> +#define _RTE_PMU_H_
>> >> +
>> >> +/**
>> >> + * @file
>> >> + *
>> >> + * PMU event tracing operations
>> >> + *
>> >> + * This file defines generic API and types necessary to setup PMU
>> >> +and
>> >> + * read selected counters in runtime.
>> >> + */
>> >> +
>> >> +#ifdef __cplusplus
>> >> +extern "C" {
>> >> +#endif
>> >> +
>> >> +#include <linux/perf_event.h>
>> >> +
>> >> +#include <rte_atomic.h>
>> >> +#include <rte_branch_prediction.h> #include <rte_common.h>
>> >> +#include <rte_compat.h> #include <rte_spinlock.h>
>> >> +
>> >> +/** Maximum number of events in a group */ #define
>> >> +MAX_NUM_GROUP_EVENTS 8
>> >
>> >forgot RTE_ prefix.
>> >In fact, do you really need number of events in group to be hard-coded?
>> >Couldn't mmap_pages[] and fds[] be allocated dynamically by enable_group()?
>> >
>>
>> 8 is reasonable number I think. X86/ARM have actually less that that (was that something like
>4?).
>> Moreover events are scheduled as a group so there must be enough hw
>> counters available for that to succeed. So this number should cover current needs.
>
>If you think 8 will be enough to cover all possible future cases - I am ok either way.
>Still need RTE_ prefix for it.
>

Okay that can be added. 

>> >> +
>> >> +/**
>> >> + * A structure describing a group of events.
>> >> + */
>> >> +struct rte_pmu_event_group {
>> >> +	struct perf_event_mmap_page *mmap_pages[MAX_NUM_GROUP_EVENTS]; /**< array of user pages
>*/
>> >> +	int fds[MAX_NUM_GROUP_EVENTS]; /**< array of event descriptors */
>> >> +	bool enabled; /**< true if group was enabled on particular lcore */
>> >> +	TAILQ_ENTRY(rte_pmu_event_group) next; /**< list entry */ }
>> >> +__rte_cache_aligned;
>> >> +
>> >
>> >Even if we'd decide to keep rte_pmu_read() as static inline (still
>> >not sure it is a good idea),
>>
>> We want to save as much cpu cycles as we possibly can and inlining
>> does helps in that matter.
>
>Ok, so asking same question from different thread: how many cycles it will save?
>What is the difference in terms of performance when you have this function inlined vs not inlined?
>

On x86 setup which is not under load, no cpusets configured, etc *just* not inlining rte_pmu_read() 
decreases performance by roughly 24% (44 vs 58 cpu cycles). At least that is reported by 
trace_perf_autotest. 


>> >why these two struct below (rte_pmu_event and rte_pmu) have to be public?
>> >I think both can be safely moved away from public headers.
>> >
>>
>> struct rte_pmu_event can be hidden I guess.
>> struct rte_pmu is used in this header hence cannot be moved elsewhere.
>
>Not sure why?
>Is that because you use it inside rte_pmu_read()?
>But that check I think can be safely moved into __rte_pmu_enable_group() or probably even into
>rte_pmu_add_event().

No, we should not do that. Otherwise we'll need to call function. Even though check will happen
early on still function prologue/epilogue will happen. This takes cycles. 

>
>> >
>> >> +/**
>> >> + * A structure describing an event.
>> >> + */
>> >> +struct rte_pmu_event {
>> >> +	char *name; /**< name of an event */
>> >> +	unsigned int index; /**< event index into fds/mmap_pages */
>> >> +	TAILQ_ENTRY(rte_pmu_event) next; /**< list entry */ };
>> >
>> >> +
>> >> +/**
>> >> + * A PMU state container.
>> >> + */
>> >> +struct rte_pmu {
>> >> +	char *name; /**< name of core PMU listed under /sys/bus/event_source/devices */
>> >> +	rte_spinlock_t lock; /**< serialize access to event group list */
>> >> +	TAILQ_HEAD(, rte_pmu_event_group) event_group_list; /**< list of event groups */
>> >> +	unsigned int num_group_events; /**< number of events in a group */
>> >> +	TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of matching events */
>> >> +	unsigned int initialized; /**< initialization counter */ };
>> >> +
>> >> +/** lcore event group */
>> >> +RTE_DECLARE_PER_LCORE(struct rte_pmu_event_group, _event_group);
>> >> +
>> >> +/** PMU state container */
>> >> +extern struct rte_pmu rte_pmu;
>> >> +
>> >> +/** Each architecture supporting PMU needs to provide its own
>> >> +version */ #ifndef rte_pmu_pmc_read #define
>> >> +rte_pmu_pmc_read(index) ({ 0; }) #endif
>> >> +
>> >> +/**
>> >> + * @warning
>> >> + * @b EXPERIMENTAL: this API may change without prior notice
>> >> + *
>> >> + * Read PMU counter.
>> >> + *
>> >> + * @warning This should be not called directly.
>> >> + *
>> >> + * @param pc
>> >> + *   Pointer to the mmapped user page.
>> >> + * @return
>> >> + *   Counter value read from hardware.
>> >> + */
>> >> +static __rte_always_inline uint64_t __rte_pmu_read_userpage(struct
>> >> +perf_event_mmap_page *pc) {
>> >> +	uint64_t width, offset;
>> >> +	uint32_t seq, index;
>> >> +	int64_t pmc;
>> >> +
>> >> +	for (;;) {
>> >> +		seq = pc->lock;
>> >> +		rte_compiler_barrier();
>> >> +		index = pc->index;
>> >> +		offset = pc->offset;
>> >> +		width = pc->pmc_width;
>> >> +
>> >> +		/* index set to 0 means that particular counter cannot be used */
>> >> +		if (likely(pc->cap_user_rdpmc && index)) {
>> >
>> >In mmap_events() you return EPERM if cap_user_rdpmc is not enabled.
>> >Do you need another check here? Or this capability can be disabled by
>> >kernel at run-time?
>> >
>>
>> That extra check in mmap_event() may be removed actually. Some archs
>> allow disabling reading rdpmc (I think that on x86 one can do that) so this check needs to stay.
>>
>> >
>> >> +			pmc = rte_pmu_pmc_read(index - 1);
>> >> +			pmc <<= 64 - width;
>> >> +			pmc >>= 64 - width;
>> >> +			offset += pmc;
>> >> +		}
>> >> +
>> >> +		rte_compiler_barrier();
>> >> +
>> >> +		if (likely(pc->lock == seq))
>> >> +			return offset;
>> >> +	}
>> >> +
>> >> +	return 0;
>> >> +}
>> >> +
>> >> +/**
>> >> + * @warning
>> >> + * @b EXPERIMENTAL: this API may change without prior notice
>> >> + *
>> >> + * Enable group of events on the calling lcore.
>> >> + *
>> >> + * @warning This should be not called directly.
>> >
>> >__rte_internal ?
>> >
>>
>> No this cannot be internal because that will make functions calling it
>> internal as well hence apps won't be able to use that. This has
>> already been brought up by one of the reviewers.
>
>Ok, then we probably can mark it with ' @internal' tag in formal comments?
>

I added a warning not to call that directly. Since function is not internal (in DPDK parlance) per se
I don’t think we should add more confusion that extra tag. 

>>
>> >> + *
>> >> + * @return
>> >> + *   0 in case of success, negative value otherwise.
>> >> + */
>> >> +__rte_experimental
>> >> +int
>> >> +__rte_pmu_enable_group(void);
>> >> +
>> >> +/**
>> >> + * @warning
>> >> + * @b EXPERIMENTAL: this API may change without prior notice
>> >> + *
>> >> + * Initialize PMU library.
>> >> + *
>> >> + * @warning This should be not called directly.
>> >
>> >Hmm.. then who should call it?
>> >If it not supposed to be called directly, why to declare it here?
>> >
>>
>> This is inlined and has one caller i.e rte_pmu_read().
>
>I thought we are talking here about rte_pmu_init().
>I don't see where it is inlined and still not clear why it can't be called directly.
>

No this cannot be called by init because groups are configured in runtime. That is why
__rte_pmu_enable_group() is called once in rte_pmu_read().

*Other* code should not call that directly. And yes, that is not inlined - my mistake. 

>> >> + *
>> >> + * @return
>> >> + *   0 in case of success, negative value otherwise.
>> >> + */
>> >
>> >Probably worth to mention that this function is not MT safe.
>> >Same for _fini_ and add_event.
>> >Also worth to mention that all control-path functions
>> >(init/fini/add_event) and data-path (pmu_read) can't be called concurrently.
>> >
>>
>> Yes they are meant to be called from main thread.
>
>Ok, then please add that into formal API comments.
>
>> >> +__rte_experimental
>> >> +int
>> >> +rte_pmu_init(void);
>> >> +

^ permalink raw reply	[flat|nested] 139+ messages in thread

* RE: [EXT] Re: [PATCH v11 1/4] lib: add generic support for reading PMU events
  2023-02-27  8:12                                         ` Tomasz Duszynski
@ 2023-02-28 11:35                                           ` Konstantin Ananyev
  0 siblings, 0 replies; 139+ messages in thread
From: Konstantin Ananyev @ 2023-02-28 11:35 UTC (permalink / raw)
  To: Tomasz Duszynski, Konstantin Ananyev, dev



> >>>>>>>>>> diff --git a/lib/pmu/rte_pmu.h b/lib/pmu/rte_pmu.h new file
> >>>>>>>>>> mode
> >>>>>>>>>> 100644 index 0000000000..6b664c3336
> >>>>>>>>>> --- /dev/null
> >>>>>>>>>> +++ b/lib/pmu/rte_pmu.h
> >>>>>>>>>> @@ -0,0 +1,212 @@
> >>>>>>>>>> +/* SPDX-License-Identifier: BSD-3-Clause
> >>>>>>>>>> + * Copyright(c) 2023 Marvell  */
> >>>>>>>>>> +
> >>>>>>>>>> +#ifndef _RTE_PMU_H_
> >>>>>>>>>> +#define _RTE_PMU_H_
> >>>>>>>>>> +
> >>>>>>>>>> +/**
> >>>>>>>>>> + * @file
> >>>>>>>>>> + *
> >>>>>>>>>> + * PMU event tracing operations
> >>>>>>>>>> + *
> >>>>>>>>>> + * This file defines generic API and types necessary to setup
> >>>>>>>>>> +PMU and
> >>>>>>>>>> + * read selected counters in runtime.
> >>>>>>>>>> + */
> >>>>>>>>>> +
> >>>>>>>>>> +#ifdef __cplusplus
> >>>>>>>>>> +extern "C" {
> >>>>>>>>>> +#endif
> >>>>>>>>>> +
> >>>>>>>>>> +#include <linux/perf_event.h>
> >>>>>>>>>> +
> >>>>>>>>>> +#include <rte_atomic.h>
> >>>>>>>>>> +#include <rte_branch_prediction.h> #include <rte_common.h>
> >>>>>>>>>> +#include <rte_compat.h> #include <rte_spinlock.h>
> >>>>>>>>>> +
> >>>>>>>>>> +/** Maximum number of events in a group */ #define
> >>>>>>>>>> +MAX_NUM_GROUP_EVENTS 8
> >>>>>>>>>> +
> >>>>>>>>>> +/**
> >>>>>>>>>> + * A structure describing a group of events.
> >>>>>>>>>> + */
> >>>>>>>>>> +struct rte_pmu_event_group {
> >>>>>>>>>> +	struct perf_event_mmap_page
> >>>>>>>>>> +*mmap_pages[MAX_NUM_GROUP_EVENTS];
> >>>>>>>>>> +/**< array of user pages
> >>>>>>> */
> >>>>>>>>>> +	int fds[MAX_NUM_GROUP_EVENTS]; /**< array of event descriptors */
> >>>>>>>>>> +	bool enabled; /**< true if group was enabled on particular lcore */
> >>>>>>>>>> +	TAILQ_ENTRY(rte_pmu_event_group) next; /**< list entry */ }
> >>>>>>>>>> +__rte_cache_aligned;
> >>>>>>>>>> +
> >>>>>>>>>> +/**
> >>>>>>>>>> + * A structure describing an event.
> >>>>>>>>>> + */
> >>>>>>>>>> +struct rte_pmu_event {
> >>>>>>>>>> +	char *name; /**< name of an event */
> >>>>>>>>>> +	unsigned int index; /**< event index into fds/mmap_pages */
> >>>>>>>>>> +	TAILQ_ENTRY(rte_pmu_event) next; /**< list entry */ };
> >>>>>>>>>> +
> >>>>>>>>>> +/**
> >>>>>>>>>> + * A PMU state container.
> >>>>>>>>>> + */
> >>>>>>>>>> +struct rte_pmu {
> >>>>>>>>>> +	char *name; /**< name of core PMU listed under /sys/bus/event_source/devices */
> >>>>>>>>>> +	rte_spinlock_t lock; /**< serialize access to event group list */
> >>>>>>>>>> +	TAILQ_HEAD(, rte_pmu_event_group) event_group_list; /**< list of event groups */
> >>>>>>>>>> +	unsigned int num_group_events; /**< number of events in a group */
> >>>>>>>>>> +	TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of matching events */
> >>>>>>>>>> +	unsigned int initialized; /**< initialization counter */ };
> >>>>>>>>>> +
> >>>>>>>>>> +/** lcore event group */
> >>>>>>>>>> +RTE_DECLARE_PER_LCORE(struct rte_pmu_event_group,
> >>>>>>>>>> +_event_group);
> >>>>>>>>>> +
> >>>>>>>>>> +/** PMU state container */
> >>>>>>>>>> +extern struct rte_pmu rte_pmu;
> >>>>>>>>>> +
> >>>>>>>>>> +/** Each architecture supporting PMU needs to provide its own
> >>>>>>>>>> +version */ #ifndef rte_pmu_pmc_read #define
> >>>>>>>>>> +rte_pmu_pmc_read(index) ({ 0; }) #endif
> >>>>>>>>>> +
> >>>>>>>>>> +/**
> >>>>>>>>>> + * @warning
> >>>>>>>>>> + * @b EXPERIMENTAL: this API may change without prior notice
> >>>>>>>>>> + *
> >>>>>>>>>> + * Read PMU counter.
> >>>>>>>>>> + *
> >>>>>>>>>> + * @warning This should be not called directly.
> >>>>>>>>>> + *
> >>>>>>>>>> + * @param pc
> >>>>>>>>>> + *   Pointer to the mmapped user page.
> >>>>>>>>>> + * @return
> >>>>>>>>>> + *   Counter value read from hardware.
> >>>>>>>>>> + */
> >>>>>>>>>> +static __rte_always_inline uint64_t
> >>>>>>>>>> +__rte_pmu_read_userpage(struct perf_event_mmap_page *pc) {
> >>>>>>>>>> +	uint64_t width, offset;
> >>>>>>>>>> +	uint32_t seq, index;
> >>>>>>>>>> +	int64_t pmc;
> >>>>>>>>>> +
> >>>>>>>>>> +	for (;;) {
> >>>>>>>>>> +		seq = pc->lock;
> >>>>>>>>>> +		rte_compiler_barrier();
> >>>>>>>>>
> >>>>>>>>> Are you sure that compiler_barrier() is enough here?
> >>>>>>>>> On some archs CPU itself has freedom to re-order reads.
> >>>>>>>>> Or I am missing something obvious here?
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> It's a matter of not keeping old stuff cached in registers and
> >>>>>>>> making sure that we have two reads of lock. CPU reordering won't
> >>>>>>>> do any harm here.
> >>>>>>>
> >>>>>>> Sorry, I didn't get you here:
> >>>>>>> Suppose CPU will re-order reads and will read lock *after* index or offset value.
> >>>>>>> Wouldn't it mean that in that case index and/or offset can contain old/invalid values?
> >>>>>>>
> >>>>>>
> >>>>>> This number is just an indicator whether kernel did change something or not.
> >>>>>
> >>>>> You are talking about pc->lock, right?
> >>>>> Yes, I do understand that it is sort of seqlock.
> >>>>> That's why I am puzzled why we do not care about possible cpu read-reordering.
> >>>>> Manual for perf_event_open() also has a code snippet with compiler barrier only...
> >>>>>
> >>>>>> If cpu reordering will come into play then this will not change
> >>>>>> anything from pov of this
> >>> loop.
> >>>>>> All we want is fresh data when needed and no involvement of
> >>>>>> compiler when it comes to reordering code.
> >>>>>
> >>>>> Ok, can you probably explain to me why the following could not happen:
> >>>>> T0:
> >>>>> pc->seqlock==0; pc->index==I1; pc->offset==O1;
> >>>>> T1:
> >>>>>       cpu #0 read pmu (due to cpu read reorder, we get index value before seqlock):
> >>>>>        index=pc->index;  //index==I1;
> >>>>> T2:
> >>>>>       cpu #1 kernel vent_update_userpage:
> >>>>>       pc->lock++; // pc->lock==1
> >>>>>       pc->index=I2;
> >>>>>       pc->offset=O2;
> >>>>>       ...
> >>>>>       pc->lock++; //pc->lock==2
> >>>>> T3:
> >>>>>       cpu #0 continue with read pmu:
> >>>>>       seq=pc->lock; //seq == 2
> >>>>>        offset=pc->offset; // offset == O2
> >>>>>        ....
> >>>>>        pmc = rte_pmu_pmc_read(index - 1);  // Note that we read at I1, not I2
> >>>>>        offset += pmc; //offset == O2 + pmcread(I1-1);
> >>>>>        if (pc->lock == seq) // they are equal, return
> >>>>>              return offset;
> >>>>>
> >>>>> Or, it can happen, but by some reason we don't care much?
> >>>>>
> >>>>
> >>>> This code does self-monitoring and user page (whole group actually)
> >>>> is per thread running on current cpu. Hence I am not sure what are
> >>>> you trying to prove with that
> >>> example.
> >>>
> >>> I am not trying to prove anything so far.
> >>> I am asking is such situation possible or not, and if not, why?
> >>> My current understanding (possibly wrong) is that after you mmaped
> >>> these pages, kernel still can asynchronously update them.
> >>> So, when reading the data from these pages you have to check 'lock'
> >>> value before and after accessing other data.
> >>> If so, why possible cpu read-reordering doesn't matter?
> >>>
> >>
> >> Look. I'll reiterate that.
> >>
> >> 1. That user page/group/PMU config is per process. Other processes do not access that.
> >
> >Ok, that's clear.
> >
> >
> >>     All this happens on the very same CPU where current thread is running.
> >
> >Ok... but can't this page be updated by kernel thread running simultaneously on different CPU?
> >
> 
> I already pointed out that event/counter configuration is bound to current cpu. How can possibly
> other cpu update that configuration? This cannot work.
 
Can you elaborate a bit what you mean with 'event/counter configuration is bound to current cpu'?
If that means it could be updated only by code running on given, CPU - yes it is clear.
But can this page be read by user-space from different CPU? 
Or you just assume that your user-space thread will *always* be bounded just to one
particular CPU and would never switch?

> 
> 
> If you think that there's some problem with the code (or is simply broken on your setup) and logic[] a bit
> has obvious flaw and you can provide meaningful evidence of that then I'd be more than happy to
> apply that fix. Otherwise that discussion will get us nowhere.
> 

Yes, we are going in cycles here.
I keep asking you same questions about library function internals, you keep refusing to explain things
to me insisting that it is 'way too obvious'.
Well, sorry but it is not obvious to me.
So I still insist that a clearly documented internal design and expected usage is required for that patch
before it can be accepted.

Konstantin 

^ permalink raw reply	[flat|nested] 139+ messages in thread

* RE: [EXT] Re: [PATCH v11 1/4] lib: add generic support for reading PMU events
  2023-02-28  9:57                             ` Tomasz Duszynski
@ 2023-02-28 11:58                               ` Konstantin Ananyev
  0 siblings, 0 replies; 139+ messages in thread
From: Konstantin Ananyev @ 2023-02-28 11:58 UTC (permalink / raw)
  To: Tomasz Duszynski, Konstantin Ananyev, dev



> >> >> Add support for programming PMU counters and reading their values
> >> >> in runtime bypassing kernel completely.
> >> >>
> >> >> This is especially useful in cases where CPU cores are isolated i.e
> >> >> run dedicated tasks. In such cases one cannot use standard perf
> >> >> utility without sacrificing latency and performance.
> >> >>
> >> >> Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
> >> >> Acked-by: Morten Brørup <mb@smartsharesystems.com>
> >> >
> >> >Few more comments/questions below.
> >> >
> >> >
> >> >> diff --git a/lib/pmu/rte_pmu.c b/lib/pmu/rte_pmu.c new file mode
> >> >> 100644 index 0000000000..950f999cb7
> >> >> --- /dev/null
> >> >> +++ b/lib/pmu/rte_pmu.c
> >> >> @@ -0,0 +1,460 @@
> >> >> +/* SPDX-License-Identifier: BSD-3-Clause
> >> >> + * Copyright(C) 2023 Marvell International Ltd.
> >> >> + */
> >> >> +
> >> >> +#include <ctype.h>
> >> >> +#include <dirent.h>
> >> >> +#include <errno.h>
> >> >> +#include <regex.h>
> >> >> +#include <stdlib.h>
> >> >> +#include <string.h>
> >> >> +#include <sys/ioctl.h>
> >> >> +#include <sys/mman.h>
> >> >> +#include <sys/queue.h>
> >> >> +#include <sys/syscall.h>
> >> >> +#include <unistd.h>
> >> >> +
> >> >> +#include <rte_atomic.h>
> >> >> +#include <rte_per_lcore.h>
> >> >> +#include <rte_pmu.h>
> >> >> +#include <rte_spinlock.h>
> >> >> +#include <rte_tailq.h>
> >> >> +
> >> >> +#include "pmu_private.h"
> >> >> +
> >> >> +#define EVENT_SOURCE_DEVICES_PATH "/sys/bus/event_source/devices"
> >> >> +
> >> >> +#define GENMASK_ULL(h, l) ((~0ULL - (1ULL << (l)) + 1) & (~0ULL >>
> >> >> +((64 - 1 - (h))))) #define FIELD_PREP(m, v) (((uint64_t)(v) <<
> >> >> +(__builtin_ffsll(m) - 1)) & (m))
> >> >> +
> >> >> +RTE_DEFINE_PER_LCORE(struct rte_pmu_event_group, _event_group);
> >> >> +struct rte_pmu rte_pmu;
> >> >> +
> >> >> +/*
> >> >> + * Following __rte_weak functions provide default no-op.
> >> >> +Architectures should override them if
> >> >> + * necessary.
> >> >> + */
> >> >> +
> >> >> +int
> >> >> +__rte_weak pmu_arch_init(void)
> >> >> +{
> >> >> +	return 0;
> >> >> +}
> >> >> +
> >> >> +void
> >> >> +__rte_weak pmu_arch_fini(void)
> >> >> +{
> >> >> +}
> >> >> +
> >> >> +void
> >> >> +__rte_weak pmu_arch_fixup_config(uint64_t __rte_unused config[3])
> >> >> +{ }
> >> >> +
> >> >> +static int
> >> >> +get_term_format(const char *name, int *num, uint64_t *mask) {
> >> >> +	char path[PATH_MAX];
> >> >> +	char *config = NULL;
> >> >> +	int high, low, ret;
> >> >> +	FILE *fp;
> >> >> +
> >> >> +	*num = *mask = 0;
> >> >> +	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/format/%s", rte_pmu.name,
> >name);
> >> >> +	fp = fopen(path, "r");
> >> >> +	if (fp == NULL)
> >> >> +		return -errno;
> >> >> +
> >> >> +	errno = 0;
> >> >> +	ret = fscanf(fp, "%m[^:]:%d-%d", &config, &low, &high);
> >> >> +	if (ret < 2) {
> >> >> +		ret = -ENODATA;
> >> >> +		goto out;
> >> >> +	}
> >> >> +	if (errno) {
> >> >> +		ret = -errno;
> >> >> +		goto out;
> >> >> +	}
> >> >> +
> >> >> +	if (ret == 2)
> >> >> +		high = low;
> >> >> +
> >> >> +	*mask = GENMASK_ULL(high, low);
> >> >> +	/* Last digit should be [012]. If last digit is missing 0 is implied. */
> >> >> +	*num = config[strlen(config) - 1];
> >> >> +	*num = isdigit(*num) ? *num - '0' : 0;
> >> >> +
> >> >> +	ret = 0;
> >> >> +out:
> >> >> +	free(config);
> >> >> +	fclose(fp);
> >> >> +
> >> >> +	return ret;
> >> >> +}
> >> >> +
> >> >> +static int
> >> >> +parse_event(char *buf, uint64_t config[3]) {
> >> >> +	char *token, *term;
> >> >> +	int num, ret, val;
> >> >> +	uint64_t mask;
> >> >> +
> >> >> +	config[0] = config[1] = config[2] = 0;
> >> >> +
> >> >> +	token = strtok(buf, ",");
> >> >> +	while (token) {
> >> >> +		errno = 0;
> >> >> +		/* <term>=<value> */
> >> >> +		ret = sscanf(token, "%m[^=]=%i", &term, &val);
> >> >> +		if (ret < 1)
> >> >> +			return -ENODATA;
> >> >> +		if (errno)
> >> >> +			return -errno;
> >> >> +		if (ret == 1)
> >> >> +			val = 1;
> >> >> +
> >> >> +		ret = get_term_format(term, &num, &mask);
> >> >> +		free(term);
> >> >> +		if (ret)
> >> >> +			return ret;
> >> >> +
> >> >> +		config[num] |= FIELD_PREP(mask, val);
> >> >> +		token = strtok(NULL, ",");
> >> >> +	}
> >> >> +
> >> >> +	return 0;
> >> >> +}
> >> >> +
> >> >> +static int
> >> >> +get_event_config(const char *name, uint64_t config[3]) {
> >> >> +	char path[PATH_MAX], buf[BUFSIZ];
> >> >> +	FILE *fp;
> >> >> +	int ret;
> >> >> +
> >> >> +	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu.name,
> >name);
> >> >> +	fp = fopen(path, "r");
> >> >> +	if (fp == NULL)
> >> >> +		return -errno;
> >> >> +
> >> >> +	ret = fread(buf, 1, sizeof(buf), fp);
> >> >> +	if (ret == 0) {
> >> >> +		fclose(fp);
> >> >> +
> >> >> +		return -EINVAL;
> >> >> +	}
> >> >> +	fclose(fp);
> >> >> +	buf[ret] = '\0';
> >> >> +
> >> >> +	return parse_event(buf, config);
> >> >> +}
> >> >> +
> >> >> +static int
> >> >> +do_perf_event_open(uint64_t config[3], int group_fd) {
> >> >> +	struct perf_event_attr attr = {
> >> >> +		.size = sizeof(struct perf_event_attr),
> >> >> +		.type = PERF_TYPE_RAW,
> >> >> +		.exclude_kernel = 1,
> >> >> +		.exclude_hv = 1,
> >> >> +		.disabled = 1,
> >> >> +	};
> >> >> +
> >> >> +	pmu_arch_fixup_config(config);
> >> >> +
> >> >> +	attr.config = config[0];
> >> >> +	attr.config1 = config[1];
> >> >> +	attr.config2 = config[2];
> >> >> +
> >> >> +	return syscall(SYS_perf_event_open, &attr, 0, -1, group_fd, 0); }
> >> >> +
> >> >> +static int
> >> >> +open_events(struct rte_pmu_event_group *group) {
> >> >> +	struct rte_pmu_event *event;
> >> >> +	uint64_t config[3];
> >> >> +	int num = 0, ret;
> >> >> +
> >> >> +	/* group leader gets created first, with fd = -1 */
> >> >> +	group->fds[0] = -1;
> >> >> +
> >> >> +	TAILQ_FOREACH(event, &rte_pmu.event_list, next) {
> >> >> +		ret = get_event_config(event->name, config);
> >> >> +		if (ret)
> >> >> +			continue;
> >> >> +
> >> >> +		ret = do_perf_event_open(config, group->fds[0]);
> >> >> +		if (ret == -1) {
> >> >> +			ret = -errno;
> >> >> +			goto out;
> >> >> +		}
> >> >> +
> >> >> +		group->fds[event->index] = ret;
> >> >> +		num++;
> >> >> +	}
> >> >> +
> >> >> +	return 0;
> >> >> +out:
> >> >> +	for (--num; num >= 0; num--) {
> >> >> +		close(group->fds[num]);
> >> >> +		group->fds[num] = -1;
> >> >> +	}
> >> >> +
> >> >> +
> >> >> +	return ret;
> >> >> +}
> >> >> +
> >> >> +static int
> >> >> +mmap_events(struct rte_pmu_event_group *group) {
> >> >> +	long page_size = sysconf(_SC_PAGE_SIZE);
> >> >> +	unsigned int i;
> >> >> +	void *addr;
> >> >> +	int ret;
> >> >> +
> >> >> +	for (i = 0; i < rte_pmu.num_group_events; i++) {
> >> >> +		addr = mmap(0, page_size, PROT_READ, MAP_SHARED, group->fds[i], 0);
> >> >> +		if (addr == MAP_FAILED) {
> >> >> +			ret = -errno;
> >> >> +			goto out;
> >> >> +		}
> >> >> +
> >> >> +		group->mmap_pages[i] = addr;
> >> >> +		if (!group->mmap_pages[i]->cap_user_rdpmc) {
> >> >> +			ret = -EPERM;
> >> >> +			goto out;
> >> >> +		}
> >> >> +	}
> >> >> +
> >> >> +	return 0;
> >> >> +out:
> >> >> +	for (; i; i--) {
> >> >> +		munmap(group->mmap_pages[i - 1], page_size);
> >> >> +		group->mmap_pages[i - 1] = NULL;
> >> >> +	}
> >> >> +
> >> >> +	return ret;
> >> >> +}
> >> >> +
> >> >> +static void
> >> >> +cleanup_events(struct rte_pmu_event_group *group) {
> >> >> +	unsigned int i;
> >> >> +
> >> >> +	if (group->fds[0] != -1)
> >> >> +		ioctl(group->fds[0], PERF_EVENT_IOC_DISABLE,
> >> >> +PERF_IOC_FLAG_GROUP);
> >> >> +
> >> >> +	for (i = 0; i < rte_pmu.num_group_events; i++) {
> >> >> +		if (group->mmap_pages[i]) {
> >> >> +			munmap(group->mmap_pages[i], sysconf(_SC_PAGE_SIZE));
> >> >> +			group->mmap_pages[i] = NULL;
> >> >> +		}
> >> >> +
> >> >> +		if (group->fds[i] != -1) {
> >> >> +			close(group->fds[i]);
> >> >> +			group->fds[i] = -1;
> >> >> +		}
> >> >> +	}
> >> >> +
> >> >> +	group->enabled = false;
> >> >> +}
> >> >> +
> >> >> +int
> >> >> +__rte_pmu_enable_group(void)
> >> >> +{
> >> >> +	struct rte_pmu_event_group *group = &RTE_PER_LCORE(_event_group);
> >> >> +	int ret;
> >> >> +
> >> >> +	if (rte_pmu.num_group_events == 0)
> >> >> +		return -ENODEV;
> >> >> +
> >> >> +	ret = open_events(group);
> >> >> +	if (ret)
> >> >> +		goto out;
> >> >> +
> >> >> +	ret = mmap_events(group);
> >> >> +	if (ret)
> >> >> +		goto out;
> >> >> +
> >> >> +	if (ioctl(group->fds[0], PERF_EVENT_IOC_RESET, PERF_IOC_FLAG_GROUP) == -1) {
> >> >> +		ret = -errno;
> >> >> +		goto out;
> >> >> +	}
> >> >> +
> >> >> +	if (ioctl(group->fds[0], PERF_EVENT_IOC_ENABLE, PERF_IOC_FLAG_GROUP) == -1) {
> >> >> +		ret = -errno;
> >> >> +		goto out;
> >> >> +	}
> >> >> +
> >> >> +	rte_spinlock_lock(&rte_pmu.lock);
> >> >> +	TAILQ_INSERT_TAIL(&rte_pmu.event_group_list, group, next);
> >> >
> >> >Hmm.. so we insert pointer to TLS variable into the global list?
> >> >Wonder what would happen if that thread get terminated?
> >>
> >> Nothing special. Any pointers to that thread-local in that thread are invalided.
> >>
> >> >Can memory from its TLS block get re-used (by other thread or for other purposes)?
> >> >
> >>
> >> Why would any other thread reuse that?
> >> Eventually main thread will need that data to do the cleanup.
> >
> >I understand that main thread would need to access that data.
> >I am not sure that it would be able to.
> >Imagine thread calls rte_pmu_read(...) and then terminates, while program continues to run.
> >As I understand address of its RTE_PER_LCORE(_event_group) will still remain in
> >rte_pmu.event_group_list, even if it is probably not valid any more.
> >
> 
> Okay got your point. In DPDK that will not happen. We do not spawn/kill lcores in runtime.

Well, yes usually DPDK app doesn't do that, but in theory there is an API to register/unregister
non-eal threads as lcores: rte_thread_register()/rte_thread_unregister().
Also besides of lcores there are control threads, some house-keeping threads, plus user is free
to spawn/kill his own threads.
Are you saying that this library doesn't support none of them?
If so, then at least that should be very clearly documented.
Though I think a proper way is  handle this situation somehow -
either return error at __rte_pmu_enable_group(), or change the code to allow it to work
properly from any thread. I don't think it is that hard.

> In other scenarios such approach will not work because once thread terminates it's per-thread-data
> becomes invalid.
> 
> >> >
> >> >> +	rte_spinlock_unlock(&rte_pmu.lock);
> >> >> +	group->enabled = true;
> >> >> +
> >> >> +	return 0;
> >> >> +
> >> >> +out:
> >> >> +	cleanup_events(group);
> >> >> +
> >> >> +	return ret;
> >> >> +}
> >> >> +
> >> >> +static int
> >> >> +scan_pmus(void)
> >> >> +{
> >> >> +	char path[PATH_MAX];
> >> >> +	struct dirent *dent;
> >> >> +	const char *name;
> >> >> +	DIR *dirp;
> >> >> +
> >> >> +	dirp = opendir(EVENT_SOURCE_DEVICES_PATH);
> >> >> +	if (dirp == NULL)
> >> >> +		return -errno;
> >> >> +
> >> >> +	while ((dent = readdir(dirp))) {
> >> >> +		name = dent->d_name;
> >> >> +		if (name[0] == '.')
> >> >> +			continue;
> >> >> +
> >> >> +		/* sysfs entry should either contain cpus or be a cpu */
> >> >> +		if (!strcmp(name, "cpu"))
> >> >> +			break;
> >> >> +
> >> >> +		snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/cpus", name);
> >> >> +		if (access(path, F_OK) == 0)
> >> >> +			break;
> >> >> +	}
> >> >> +
> >> >> +	if (dent) {
> >> >> +		rte_pmu.name = strdup(name);
> >> >> +		if (rte_pmu.name == NULL) {
> >> >> +			closedir(dirp);
> >> >> +
> >> >> +			return -ENOMEM;
> >> >> +		}
> >> >> +	}
> >> >> +
> >> >> +	closedir(dirp);
> >> >> +
> >> >> +	return rte_pmu.name ? 0 : -ENODEV; }
> >> >> +
> >> >> +static struct rte_pmu_event *
> >> >> +new_event(const char *name)
> >> >> +{
> >> >> +	struct rte_pmu_event *event;
> >> >> +
> >> >> +	event = calloc(1, sizeof(*event));
> >> >> +	if (event == NULL)
> >> >> +		goto out;
> >> >> +
> >> >> +	event->name = strdup(name);
> >> >> +	if (event->name == NULL) {
> >> >> +		free(event);
> >> >> +		event = NULL;
> >> >> +	}
> >> >> +
> >> >> +out:
> >> >> +	return event;
> >> >> +}
> >> >> +
> >> >> +static void
> >> >> +free_event(struct rte_pmu_event *event) {
> >> >> +	free(event->name);
> >> >> +	free(event);
> >> >> +}
> >> >> +
> >> >> +int
> >> >> +rte_pmu_add_event(const char *name) {
> >> >> +	struct rte_pmu_event *event;
> >> >> +	char path[PATH_MAX];
> >> >> +
> >> >> +	if (rte_pmu.name == NULL)
> >> >> +		return -ENODEV;
> >> >> +
> >> >> +	if (rte_pmu.num_group_events + 1 >= MAX_NUM_GROUP_EVENTS)
> >> >> +		return -ENOSPC;
> >> >> +
> >> >> +	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu.name,
> >name);
> >> >> +	if (access(path, R_OK))
> >> >> +		return -ENODEV;
> >> >> +
> >> >> +	TAILQ_FOREACH(event, &rte_pmu.event_list, next) {
> >> >> +		if (!strcmp(event->name, name))
> >> >> +			return event->index;
> >> >> +		continue;
> >> >> +	}
> >> >> +
> >> >> +	event = new_event(name);
> >> >> +	if (event == NULL)
> >> >> +		return -ENOMEM;
> >> >> +
> >> >> +	event->index = rte_pmu.num_group_events++;
> >> >> +	TAILQ_INSERT_TAIL(&rte_pmu.event_list, event, next);
> >> >> +
> >> >> +	return event->index;
> >> >> +}
> >> >> +
> >> >> +int
> >> >> +rte_pmu_init(void)
> >> >> +{
> >> >> +	int ret;
> >> >> +
> >> >> +	/* Allow calling init from multiple contexts within a single thread. This simplifies
> >> >> +	 * resource management a bit e.g in case fast-path tracepoint has already been enabled
> >> >> +	 * via command line but application doesn't care enough and performs init/fini again.
> >> >> +	 */
> >> >> +	if (rte_pmu.initialized != 0) {
> >> >> +		rte_pmu.initialized++;
> >> >> +		return 0;
> >> >> +	}
> >> >> +
> >> >> +	ret = scan_pmus();
> >> >> +	if (ret)
> >> >> +		goto out;
> >> >> +
> >> >> +	ret = pmu_arch_init();
> >> >> +	if (ret)
> >> >> +		goto out;
> >> >> +
> >> >> +	TAILQ_INIT(&rte_pmu.event_list);
> >> >> +	TAILQ_INIT(&rte_pmu.event_group_list);
> >> >> +	rte_spinlock_init(&rte_pmu.lock);
> >> >> +	rte_pmu.initialized = 1;
> >> >> +
> >> >> +	return 0;
> >> >> +out:
> >> >> +	free(rte_pmu.name);
> >> >> +	rte_pmu.name = NULL;
> >> >> +
> >> >> +	return ret;
> >> >> +}
> >> >> +
> >> >> +void
> >> >> +rte_pmu_fini(void)
> >> >> +{
> >> >> +	struct rte_pmu_event_group *group, *tmp_group;
> >> >> +	struct rte_pmu_event *event, *tmp_event;
> >> >> +
> >> >> +	/* cleanup once init count drops to zero */
> >> >> +	if (rte_pmu.initialized == 0 || --rte_pmu.initialized != 0)
> >> >> +		return;
> >> >> +
> >> >> +	RTE_TAILQ_FOREACH_SAFE(event, &rte_pmu.event_list, next, tmp_event) {
> >> >> +		TAILQ_REMOVE(&rte_pmu.event_list, event, next);
> >> >> +		free_event(event);
> >> >> +	}
> >> >> +
> >> >> +	RTE_TAILQ_FOREACH_SAFE(group, &rte_pmu.event_group_list, next, tmp_group) {
> >> >> +		TAILQ_REMOVE(&rte_pmu.event_group_list, group, next);
> >> >> +		cleanup_events(group);
> >> >> +	}
> >> >> +
> >> >> +	pmu_arch_fini();
> >> >> +	free(rte_pmu.name);
> >> >> +	rte_pmu.name = NULL;
> >> >> +	rte_pmu.num_group_events = 0;
> >> >> +}
> >> >> diff --git a/lib/pmu/rte_pmu.h b/lib/pmu/rte_pmu.h new file mode
> >> >> 100644 index 0000000000..6b664c3336
> >> >> --- /dev/null
> >> >> +++ b/lib/pmu/rte_pmu.h
> >> >> @@ -0,0 +1,212 @@
> >> >> +/* SPDX-License-Identifier: BSD-3-Clause
> >> >> + * Copyright(c) 2023 Marvell
> >> >> + */
> >> >> +
> >> >> +#ifndef _RTE_PMU_H_
> >> >> +#define _RTE_PMU_H_
> >> >> +
> >> >> +/**
> >> >> + * @file
> >> >> + *
> >> >> + * PMU event tracing operations
> >> >> + *
> >> >> + * This file defines generic API and types necessary to setup PMU
> >> >> +and
> >> >> + * read selected counters in runtime.
> >> >> + */
> >> >> +
> >> >> +#ifdef __cplusplus
> >> >> +extern "C" {
> >> >> +#endif
> >> >> +
> >> >> +#include <linux/perf_event.h>
> >> >> +
> >> >> +#include <rte_atomic.h>
> >> >> +#include <rte_branch_prediction.h> #include <rte_common.h>
> >> >> +#include <rte_compat.h> #include <rte_spinlock.h>
> >> >> +
> >> >> +/** Maximum number of events in a group */ #define
> >> >> +MAX_NUM_GROUP_EVENTS 8
> >> >
> >> >forgot RTE_ prefix.
> >> >In fact, do you really need number of events in group to be hard-coded?
> >> >Couldn't mmap_pages[] and fds[] be allocated dynamically by enable_group()?
> >> >
> >>
> >> 8 is reasonable number I think. X86/ARM have actually less that that (was that something like
> >4?).
> >> Moreover events are scheduled as a group so there must be enough hw
> >> counters available for that to succeed. So this number should cover current needs.
> >
> >If you think 8 will be enough to cover all possible future cases - I am ok either way.
> >Still need RTE_ prefix for it.
> >
> 
> Okay that can be added.
> 
> >> >> +
> >> >> +/**
> >> >> + * A structure describing a group of events.
> >> >> + */
> >> >> +struct rte_pmu_event_group {
> >> >> +	struct perf_event_mmap_page *mmap_pages[MAX_NUM_GROUP_EVENTS]; /**< array of user pages
> >*/
> >> >> +	int fds[MAX_NUM_GROUP_EVENTS]; /**< array of event descriptors */
> >> >> +	bool enabled; /**< true if group was enabled on particular lcore */
> >> >> +	TAILQ_ENTRY(rte_pmu_event_group) next; /**< list entry */ }
> >> >> +__rte_cache_aligned;
> >> >> +
> >> >
> >> >Even if we'd decide to keep rte_pmu_read() as static inline (still
> >> >not sure it is a good idea),
> >>
> >> We want to save as much cpu cycles as we possibly can and inlining
> >> does helps in that matter.
> >
> >Ok, so asking same question from different thread: how many cycles it will save?
> >What is the difference in terms of performance when you have this function inlined vs not inlined?
> >
> 
> On x86 setup which is not under load, no cpusets configured, etc *just* not inlining rte_pmu_read()
> decreases performance by roughly 24% (44 vs 58 cpu cycles). At least that is reported by
> trace_perf_autotest.

From my perspective 14 cycles is not that much...
Considering that user will probably not call it very often, and by enabling measurements he
probably already prepared to get some hit. 

> 
> >> >why these two struct below (rte_pmu_event and rte_pmu) have to be public?
> >> >I think both can be safely moved away from public headers.
> >> >
> >>
> >> struct rte_pmu_event can be hidden I guess.
> >> struct rte_pmu is used in this header hence cannot be moved elsewhere.
> >
> >Not sure why?
> >Is that because you use it inside rte_pmu_read()?
> >But that check I think can be safely moved into __rte_pmu_enable_group() or probably even into
> >rte_pmu_add_event().
> 
> No, we should not do that. Otherwise we'll need to call function. Even though check will happen
> early on still function prologue/epilogue will happen. This takes cycles.

Not necessary. You can store this value in pmu_group itself,
and by this value decide is pmu and group initialized, etc.   
 
> >
> >> >
> >> >> +/**
> >> >> + * A structure describing an event.
> >> >> + */
> >> >> +struct rte_pmu_event {
> >> >> +	char *name; /**< name of an event */
> >> >> +	unsigned int index; /**< event index into fds/mmap_pages */
> >> >> +	TAILQ_ENTRY(rte_pmu_event) next; /**< list entry */ };
> >> >
> >> >> +
> >> >> +/**
> >> >> + * A PMU state container.
> >> >> + */
> >> >> +struct rte_pmu {
> >> >> +	char *name; /**< name of core PMU listed under /sys/bus/event_source/devices */
> >> >> +	rte_spinlock_t lock; /**< serialize access to event group list */
> >> >> +	TAILQ_HEAD(, rte_pmu_event_group) event_group_list; /**< list of event groups */
> >> >> +	unsigned int num_group_events; /**< number of events in a group */
> >> >> +	TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of matching events */
> >> >> +	unsigned int initialized; /**< initialization counter */ };
> >> >> +
> >> >> +/** lcore event group */
> >> >> +RTE_DECLARE_PER_LCORE(struct rte_pmu_event_group, _event_group);
> >> >> +
> >> >> +/** PMU state container */
> >> >> +extern struct rte_pmu rte_pmu;
> >> >> +
> >> >> +/** Each architecture supporting PMU needs to provide its own
> >> >> +version */ #ifndef rte_pmu_pmc_read #define
> >> >> +rte_pmu_pmc_read(index) ({ 0; }) #endif
> >> >> +
> >> >> +/**
> >> >> + * @warning
> >> >> + * @b EXPERIMENTAL: this API may change without prior notice
> >> >> + *
> >> >> + * Read PMU counter.
> >> >> + *
> >> >> + * @warning This should be not called directly.
> >> >> + *
> >> >> + * @param pc
> >> >> + *   Pointer to the mmapped user page.
> >> >> + * @return
> >> >> + *   Counter value read from hardware.
> >> >> + */
> >> >> +static __rte_always_inline uint64_t __rte_pmu_read_userpage(struct
> >> >> +perf_event_mmap_page *pc) {
> >> >> +	uint64_t width, offset;
> >> >> +	uint32_t seq, index;
> >> >> +	int64_t pmc;
> >> >> +
> >> >> +	for (;;) {
> >> >> +		seq = pc->lock;
> >> >> +		rte_compiler_barrier();
> >> >> +		index = pc->index;
> >> >> +		offset = pc->offset;
> >> >> +		width = pc->pmc_width;
> >> >> +
> >> >> +		/* index set to 0 means that particular counter cannot be used */
> >> >> +		if (likely(pc->cap_user_rdpmc && index)) {
> >> >
> >> >In mmap_events() you return EPERM if cap_user_rdpmc is not enabled.
> >> >Do you need another check here? Or this capability can be disabled by
> >> >kernel at run-time?
> >> >
> >>
> >> That extra check in mmap_event() may be removed actually. Some archs
> >> allow disabling reading rdpmc (I think that on x86 one can do that) so this check needs to stay.
> >>
> >> >
> >> >> +			pmc = rte_pmu_pmc_read(index - 1);
> >> >> +			pmc <<= 64 - width;
> >> >> +			pmc >>= 64 - width;
> >> >> +			offset += pmc;
> >> >> +		}
> >> >> +
> >> >> +		rte_compiler_barrier();
> >> >> +
> >> >> +		if (likely(pc->lock == seq))
> >> >> +			return offset;
> >> >> +	}
> >> >> +
> >> >> +	return 0;
> >> >> +}
> >> >> +
> >> >> +/**
> >> >> + * @warning
> >> >> + * @b EXPERIMENTAL: this API may change without prior notice
> >> >> + *
> >> >> + * Enable group of events on the calling lcore.
> >> >> + *
> >> >> + * @warning This should be not called directly.
> >> >
> >> >__rte_internal ?
> >> >
> >>
> >> No this cannot be internal because that will make functions calling it
> >> internal as well hence apps won't be able to use that. This has
> >> already been brought up by one of the reviewers.
> >
> >Ok, then we probably can mark it with ' @internal' tag in formal comments?
> >
> 
> I added a warning not to call that directly. Since function is not internal (in DPDK parlance) per se
> I don’t think we should add more confusion that extra tag.

We doing it in other places, why not to add it here? 
 
> >>
> >> >> + *
> >> >> + * @return
> >> >> + *   0 in case of success, negative value otherwise.
> >> >> + */
> >> >> +__rte_experimental
> >> >> +int
> >> >> +__rte_pmu_enable_group(void);
> >> >> +
> >> >> +/**
> >> >> + * @warning
> >> >> + * @b EXPERIMENTAL: this API may change without prior notice
> >> >> + *
> >> >> + * Initialize PMU library.
> >> >> + *
> >> >> + * @warning This should be not called directly.
> >> >
> >> >Hmm.. then who should call it?
> >> >If it not supposed to be called directly, why to declare it here?
> >> >
> >>
> >> This is inlined and has one caller i.e rte_pmu_read().
> >
> >I thought we are talking here about rte_pmu_init().
> >I don't see where it is inlined and still not clear why it can't be called directly.
> >
> 
> No this cannot be called by init because groups are configured in runtime. That is why
> __rte_pmu_enable_group() is called once in rte_pmu_read().
> 
> *Other* code should not call that directly. And yes, that is not inlined - my mistake.

Once again: we are discussing comments for rte_pmu_init() function.
Why it can't be called directly?
In test_pmu_read() you do call it directly.
 
> >> >> + *
> >> >> + * @return
> >> >> + *   0 in case of success, negative value otherwise.
> >> >> + */
> >> >
> >> >Probably worth to mention that this function is not MT safe.
> >> >Same for _fini_ and add_event.
> >> >Also worth to mention that all control-path functions
> >> >(init/fini/add_event) and data-path (pmu_read) can't be called concurrently.
> >> >
> >>
> >> Yes they are meant to be called from main thread.
> >
> >Ok, then please add that into formal API comments.
> >
> >> >> +__rte_experimental
> >> >> +int
> >> >> +rte_pmu_init(void);
> >> >> +

^ permalink raw reply	[flat|nested] 139+ messages in thread

* RE: [EXT] Re: [PATCH v11 1/4] lib: add generic support for reading PMU events
  2023-02-28  8:25                             ` Morten Brørup
@ 2023-02-28 12:04                               ` Konstantin Ananyev
  2023-02-28 13:15                                 ` Morten Brørup
  2023-02-28 16:22                                 ` Morten Brørup
  0 siblings, 2 replies; 139+ messages in thread
From: Konstantin Ananyev @ 2023-02-28 12:04 UTC (permalink / raw)
  To: Morten Brørup, Tomasz Duszynski, Konstantin Ananyev, dev


> > > >> Add support for programming PMU counters and reading their values in
> > > >> runtime bypassing kernel completely.
> > > >>
> > > >> This is especially useful in cases where CPU cores are isolated i.e
> > > >> run dedicated tasks. In such cases one cannot use standard perf
> > > >> utility without sacrificing latency and performance.
> > > >>
> > > >> Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
> > > >> Acked-by: Morten Brørup <mb@smartsharesystems.com>
> > > >
> 
> [...]
> 
> > > >> +int
> > > >> +__rte_pmu_enable_group(void)
> > > >> +{
> > > >> +	struct rte_pmu_event_group *group = &RTE_PER_LCORE(_event_group);
> > > >> +	int ret;
> > > >> +
> > > >> +	if (rte_pmu.num_group_events == 0)
> > > >> +		return -ENODEV;
> > > >> +
> > > >> +	ret = open_events(group);
> > > >> +	if (ret)
> > > >> +		goto out;
> > > >> +
> > > >> +	ret = mmap_events(group);
> > > >> +	if (ret)
> > > >> +		goto out;
> > > >> +
> > > >> +	if (ioctl(group->fds[0], PERF_EVENT_IOC_RESET, PERF_IOC_FLAG_GROUP) == -
> > 1) {
> > > >> +		ret = -errno;
> > > >> +		goto out;
> > > >> +	}
> > > >> +
> > > >> +	if (ioctl(group->fds[0], PERF_EVENT_IOC_ENABLE, PERF_IOC_FLAG_GROUP) ==
> > -1) {
> > > >> +		ret = -errno;
> > > >> +		goto out;
> > > >> +	}
> > > >> +
> > > >> +	rte_spinlock_lock(&rte_pmu.lock);
> > > >> +	TAILQ_INSERT_TAIL(&rte_pmu.event_group_list, group, next);
> > > >
> > > >Hmm.. so we insert pointer to TLS variable into the global list?
> > > >Wonder what would happen if that thread get terminated?
> > >
> > > Nothing special. Any pointers to that thread-local in that thread are
> > invalided.
> > >
> > > >Can memory from its TLS block get re-used (by other thread or for other
> > purposes)?
> > > >
> > >
> > > Why would any other thread reuse that?
> > > Eventually main thread will need that data to do the cleanup.
> >
> > I understand that main thread would need to access that data.
> > I am not sure that it would be able to.
> > Imagine thread calls rte_pmu_read(...) and then terminates, while program
> > continues to run.
> 
> Is the example you describe here (i.e. a thread terminating in the middle of doing something) really a scenario DPDK is supposed to
> support?

I am not talking about some abnormal termination.
We do have ability to spawn control threads, user can spawn his own thread, all these
threads can have limited life-time.
Not to mention about  rte_thread_register()/rte_thread_unregister().
 
> > As I understand address of its RTE_PER_LCORE(_event_group) will still remain
> > in rte_pmu.event_group_list,
> > even if it is probably not valid any more.
> 
> There should be a "destructor/done/finish" function available to remove this from the list.
> 
> [...]
> 
> > > >Even if we'd decide to keep rte_pmu_read() as static inline (still not
> > > >sure it is a good idea),
> > >
> > > We want to save as much cpu cycles as we possibly can and inlining does
> > helps
> > > in that matter.
> >
> > Ok, so asking same question from different thread: how many cycles it will
> > save?
> > What is the difference in terms of performance when you have this function
> > inlined vs not inlined?
> 
> We expect to use this in our in-house profiler library. For this reason, I have a very strong preference for absolute maximum
> performance.
> 
> Reading PMU events is for performance profiling, so I expect other potential users of the PMU library to share my opinion on this.

Well, from my perspective 14 cycles are not that much...
Though yes, it would be good to hear more opinions here.

 



^ permalink raw reply	[flat|nested] 139+ messages in thread

* RE: [EXT] Re: [PATCH v11 1/4] lib: add generic support for reading PMU events
  2023-02-28 12:04                               ` Konstantin Ananyev
@ 2023-02-28 13:15                                 ` Morten Brørup
  2023-02-28 16:22                                 ` Morten Brørup
  1 sibling, 0 replies; 139+ messages in thread
From: Morten Brørup @ 2023-02-28 13:15 UTC (permalink / raw)
  To: Konstantin Ananyev, Tomasz Duszynski, Konstantin Ananyev, dev

> From: Konstantin Ananyev [mailto:konstantin.ananyev@huawei.com]
> Sent: Tuesday, 28 February 2023 13.05
> 
> > > > >> Add support for programming PMU counters and reading their values in
> > > > >> runtime bypassing kernel completely.
> > > > >>
> > > > >> This is especially useful in cases where CPU cores are isolated i.e
> > > > >> run dedicated tasks. In such cases one cannot use standard perf
> > > > >> utility without sacrificing latency and performance.
> > > > >>
> > > > >> Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
> > > > >> Acked-by: Morten Brørup <mb@smartsharesystems.com>
> > > > >
> >
> > [...]
> >
> > > > >> +int
> > > > >> +__rte_pmu_enable_group(void)
> > > > >> +{
> > > > >> +	struct rte_pmu_event_group *group = &RTE_PER_LCORE(_event_group);
> > > > >> +	int ret;
> > > > >> +
> > > > >> +	if (rte_pmu.num_group_events == 0)
> > > > >> +		return -ENODEV;
> > > > >> +
> > > > >> +	ret = open_events(group);
> > > > >> +	if (ret)
> > > > >> +		goto out;
> > > > >> +
> > > > >> +	ret = mmap_events(group);
> > > > >> +	if (ret)
> > > > >> +		goto out;
> > > > >> +
> > > > >> +	if (ioctl(group->fds[0], PERF_EVENT_IOC_RESET,
> PERF_IOC_FLAG_GROUP) == -
> > > 1) {
> > > > >> +		ret = -errno;
> > > > >> +		goto out;
> > > > >> +	}
> > > > >> +
> > > > >> +	if (ioctl(group->fds[0], PERF_EVENT_IOC_ENABLE,
> PERF_IOC_FLAG_GROUP) ==
> > > -1) {
> > > > >> +		ret = -errno;
> > > > >> +		goto out;
> > > > >> +	}
> > > > >> +
> > > > >> +	rte_spinlock_lock(&rte_pmu.lock);
> > > > >> +	TAILQ_INSERT_TAIL(&rte_pmu.event_group_list, group, next);
> > > > >
> > > > >Hmm.. so we insert pointer to TLS variable into the global list?
> > > > >Wonder what would happen if that thread get terminated?
> > > >
> > > > Nothing special. Any pointers to that thread-local in that thread are
> > > invalided.
> > > >
> > > > >Can memory from its TLS block get re-used (by other thread or for other
> > > purposes)?
> > > > >
> > > >
> > > > Why would any other thread reuse that?
> > > > Eventually main thread will need that data to do the cleanup.
> > >
> > > I understand that main thread would need to access that data.
> > > I am not sure that it would be able to.
> > > Imagine thread calls rte_pmu_read(...) and then terminates, while program
> > > continues to run.
> >
> > Is the example you describe here (i.e. a thread terminating in the middle of
> doing something) really a scenario DPDK is supposed to
> > support?
> 
> I am not talking about some abnormal termination.

Then I misunderstood your example; I thought you meant the tread was terminated while inside the rte_pmu_read() function.

> We do have ability to spawn control threads, user can spawn his own thread,
> all these
> threads can have limited life-time.
> Not to mention about  rte_thread_register()/rte_thread_unregister().
> 

I agree that normal thread termination should be supported.

> > > As I understand address of its RTE_PER_LCORE(_event_group) will still
> remain
> > > in rte_pmu.event_group_list,
> > > even if it is probably not valid any more.
> >
> > There should be a "destructor/done/finish" function available to remove this
> from the list.
> >
> > [...]
> >
> > > > >Even if we'd decide to keep rte_pmu_read() as static inline (still not
> > > > >sure it is a good idea),
> > > >
> > > > We want to save as much cpu cycles as we possibly can and inlining does
> > > helps
> > > > in that matter.
> > >
> > > Ok, so asking same question from different thread: how many cycles it will
> > > save?
> > > What is the difference in terms of performance when you have this function
> > > inlined vs not inlined?
> >
> > We expect to use this in our in-house profiler library. For this reason, I
> have a very strong preference for absolute maximum
> > performance.
> >
> > Reading PMU events is for performance profiling, so I expect other potential
> users of the PMU library to share my opinion on this.
> 
> Well, from my perspective 14 cycles are not that much...

For reference, the i40e testpmd per-core performance report shows that it uses 36 cycles per packet.

This is a total of 1152 cycles per burst of 32 packets. 14 cycles overhead per burst / 1152 cycles per burst = 1.2 % overhead.

But that is not all: If the application's pipeline has three stages, where the PMU counters are read for each stage, the per-invocation overhead of 14 cycles adds up, and the overhead per burst is now 3 * 14 / 1152 = 3.6 %.

Generalizing...

In my example here, the same function with 14 wasted cycles is called three times. It might as well be three individual libraries each wasting 14 cycles in its individual fast path processing function, due to a similarly relaxed attitude regarding wasting 14 cycles.

My point is:

Real applications do much more work than testpmd, so all this "insignificant" extra overhead in the libraries adds up!

Generally, I would like the DPDK Project to remain loyal to its original philosophy, where performance is considered a Key Performance Indicator, and overhead in the fast path is kept at an absolute minimum.

> Though yes, it would be good to hear more opinions here.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* RE: [EXT] Re: [PATCH v11 1/4] lib: add generic support for reading PMU events
  2023-02-28 12:04                               ` Konstantin Ananyev
  2023-02-28 13:15                                 ` Morten Brørup
@ 2023-02-28 16:22                                 ` Morten Brørup
  2023-03-05 16:30                                   ` Konstantin Ananyev
  1 sibling, 1 reply; 139+ messages in thread
From: Morten Brørup @ 2023-02-28 16:22 UTC (permalink / raw)
  To: Konstantin Ananyev, Tomasz Duszynski, Konstantin Ananyev, dev
  Cc: bruce.richardson

> From: Morten Brørup
> Sent: Tuesday, 28 February 2023 14.16
> 
> > From: Konstantin Ananyev [mailto:konstantin.ananyev@huawei.com]
> > Sent: Tuesday, 28 February 2023 13.05
> >
> > > > > >> Add support for programming PMU counters and reading their values
> in
> > > > > >> runtime bypassing kernel completely.
> > > > > >>
> > > > > >> This is especially useful in cases where CPU cores are isolated i.e
> > > > > >> run dedicated tasks. In such cases one cannot use standard perf
> > > > > >> utility without sacrificing latency and performance.
> > > > > >>
> > > > > >> Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
> > > > > >> Acked-by: Morten Brørup <mb@smartsharesystems.com>
> > > > > >
> > >
> > > [...]
> > >
> > > > > >> +int
> > > > > >> +__rte_pmu_enable_group(void)
> > > > > >> +{
> > > > > >> +	struct rte_pmu_event_group *group = &RTE_PER_LCORE(_event_group);
> > > > > >> +	int ret;
> > > > > >> +
> > > > > >> +	if (rte_pmu.num_group_events == 0)
> > > > > >> +		return -ENODEV;
> > > > > >> +
> > > > > >> +	ret = open_events(group);
> > > > > >> +	if (ret)
> > > > > >> +		goto out;
> > > > > >> +
> > > > > >> +	ret = mmap_events(group);
> > > > > >> +	if (ret)
> > > > > >> +		goto out;
> > > > > >> +
> > > > > >> +	if (ioctl(group->fds[0], PERF_EVENT_IOC_RESET,
> > PERF_IOC_FLAG_GROUP) == -
> > > > 1) {
> > > > > >> +		ret = -errno;
> > > > > >> +		goto out;
> > > > > >> +	}
> > > > > >> +
> > > > > >> +	if (ioctl(group->fds[0], PERF_EVENT_IOC_ENABLE,
> > PERF_IOC_FLAG_GROUP) ==
> > > > -1) {
> > > > > >> +		ret = -errno;
> > > > > >> +		goto out;
> > > > > >> +	}
> > > > > >> +
> > > > > >> +	rte_spinlock_lock(&rte_pmu.lock);
> > > > > >> +	TAILQ_INSERT_TAIL(&rte_pmu.event_group_list, group, next);
> > > > > >
> > > > > >Hmm.. so we insert pointer to TLS variable into the global list?
> > > > > >Wonder what would happen if that thread get terminated?
> > > > >
> > > > > Nothing special. Any pointers to that thread-local in that thread are
> > > > invalided.
> > > > >
> > > > > >Can memory from its TLS block get re-used (by other thread or for
> other
> > > > purposes)?
> > > > > >
> > > > >
> > > > > Why would any other thread reuse that?
> > > > > Eventually main thread will need that data to do the cleanup.
> > > >
> > > > I understand that main thread would need to access that data.
> > > > I am not sure that it would be able to.
> > > > Imagine thread calls rte_pmu_read(...) and then terminates, while
> program
> > > > continues to run.
> > >
> > > Is the example you describe here (i.e. a thread terminating in the middle
> of
> > doing something) really a scenario DPDK is supposed to
> > > support?
> >
> > I am not talking about some abnormal termination.
> 
> Then I misunderstood your example; I thought you meant the tread was
> terminated while inside the rte_pmu_read() function.
> 
> > We do have ability to spawn control threads, user can spawn his own thread,
> > all these
> > threads can have limited life-time.
> > Not to mention about  rte_thread_register()/rte_thread_unregister().
> >
> 
> I agree that normal thread termination should be supported.
> 
> > > > As I understand address of its RTE_PER_LCORE(_event_group) will still
> > remain
> > > > in rte_pmu.event_group_list,
> > > > even if it is probably not valid any more.
> > >
> > > There should be a "destructor/done/finish" function available to remove
> this
> > from the list.
> > >
> > > [...]
> > >
> > > > > >Even if we'd decide to keep rte_pmu_read() as static inline (still
> not
> > > > > >sure it is a good idea),
> > > > >
> > > > > We want to save as much cpu cycles as we possibly can and inlining
> does
> > > > helps
> > > > > in that matter.
> > > >
> > > > Ok, so asking same question from different thread: how many cycles it
> will
> > > > save?
> > > > What is the difference in terms of performance when you have this
> function
> > > > inlined vs not inlined?
> > >
> > > We expect to use this in our in-house profiler library. For this reason, I
> > have a very strong preference for absolute maximum
> > > performance.
> > >
> > > Reading PMU events is for performance profiling, so I expect other
> potential
> > users of the PMU library to share my opinion on this.
> >
> > Well, from my perspective 14 cycles are not that much...
> 
> For reference, the i40e testpmd per-core performance report shows that it uses
> 36 cycles per packet.
> 
> This is a total of 1152 cycles per burst of 32 packets. 14 cycles overhead per
> burst / 1152 cycles per burst = 1.2 % overhead.
> 
> But that is not all: If the application's pipeline has three stages, where the
> PMU counters are read for each stage, the per-invocation overhead of 14 cycles
> adds up, and the overhead per burst is now 3 * 14 / 1152 = 3.6 %.

I was too fast on the keyboard here... If the application does more work than testpmd, it certainly also uses more than 1152 cycles to do that work. So please ignore the 3.6 % as a wild exaggeration from an invalid example, and just stick with the 1.2 % overhead - which I still consider significant, and thus worth avoiding.

> 
> Generalizing...
> 
> In my example here, the same function with 14 wasted cycles is called three
> times. It might as well be three individual libraries each wasting 14 cycles
> in its individual fast path processing function, due to a similarly relaxed
> attitude regarding wasting 14 cycles.
> 
> My point is:
> 
> Real applications do much more work than testpmd, so all this "insignificant"
> extra overhead in the libraries adds up!
> 
> Generally, I would like the DPDK Project to remain loyal to its original
> philosophy, where performance is considered a Key Performance Indicator, and
> overhead in the fast path is kept at an absolute minimum.
> 
> > Though yes, it would be good to hear more opinions here.

^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [EXT] Re: [PATCH v11 1/4] lib: add generic support for reading PMU events
  2023-02-28 16:22                                 ` Morten Brørup
@ 2023-03-05 16:30                                   ` Konstantin Ananyev
  0 siblings, 0 replies; 139+ messages in thread
From: Konstantin Ananyev @ 2023-03-05 16:30 UTC (permalink / raw)
  To: Morten Brørup, Konstantin Ananyev, Tomasz Duszynski, dev
  Cc: bruce.richardson


>>>>>>>> Add support for programming PMU counters and reading their values
>> in
>>>>>>>> runtime bypassing kernel completely.
>>>>>>>>
>>>>>>>> This is especially useful in cases where CPU cores are isolated i.e
>>>>>>>> run dedicated tasks. In such cases one cannot use standard perf
>>>>>>>> utility without sacrificing latency and performance.
>>>>>>>>
>>>>>>>> Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
>>>>>>>> Acked-by: Morten Brørup <mb@smartsharesystems.com>
>>>>>>>
>>>>
>>>> [...]
>>>>
>>>>>>>> +int
>>>>>>>> +__rte_pmu_enable_group(void)
>>>>>>>> +{
>>>>>>>> +	struct rte_pmu_event_group *group = &RTE_PER_LCORE(_event_group);
>>>>>>>> +	int ret;
>>>>>>>> +
>>>>>>>> +	if (rte_pmu.num_group_events == 0)
>>>>>>>> +		return -ENODEV;
>>>>>>>> +
>>>>>>>> +	ret = open_events(group);
>>>>>>>> +	if (ret)
>>>>>>>> +		goto out;
>>>>>>>> +
>>>>>>>> +	ret = mmap_events(group);
>>>>>>>> +	if (ret)
>>>>>>>> +		goto out;
>>>>>>>> +
>>>>>>>> +	if (ioctl(group->fds[0], PERF_EVENT_IOC_RESET,
>>> PERF_IOC_FLAG_GROUP) == -
>>>>> 1) {
>>>>>>>> +		ret = -errno;
>>>>>>>> +		goto out;
>>>>>>>> +	}
>>>>>>>> +
>>>>>>>> +	if (ioctl(group->fds[0], PERF_EVENT_IOC_ENABLE,
>>> PERF_IOC_FLAG_GROUP) ==
>>>>> -1) {
>>>>>>>> +		ret = -errno;
>>>>>>>> +		goto out;
>>>>>>>> +	}
>>>>>>>> +
>>>>>>>> +	rte_spinlock_lock(&rte_pmu.lock);
>>>>>>>> +	TAILQ_INSERT_TAIL(&rte_pmu.event_group_list, group, next);
>>>>>>>
>>>>>>> Hmm.. so we insert pointer to TLS variable into the global list?
>>>>>>> Wonder what would happen if that thread get terminated?
>>>>>>
>>>>>> Nothing special. Any pointers to that thread-local in that thread are
>>>>> invalided.
>>>>>>
>>>>>>> Can memory from its TLS block get re-used (by other thread or for
>> other
>>>>> purposes)?
>>>>>>>
>>>>>>
>>>>>> Why would any other thread reuse that?
>>>>>> Eventually main thread will need that data to do the cleanup.
>>>>>
>>>>> I understand that main thread would need to access that data.
>>>>> I am not sure that it would be able to.
>>>>> Imagine thread calls rte_pmu_read(...) and then terminates, while
>> program
>>>>> continues to run.
>>>>
>>>> Is the example you describe here (i.e. a thread terminating in the middle
>> of
>>> doing something) really a scenario DPDK is supposed to
>>>> support?
>>>
>>> I am not talking about some abnormal termination.
>>
>> Then I misunderstood your example; I thought you meant the tread was
>> terminated while inside the rte_pmu_read() function.
>>
>>> We do have ability to spawn control threads, user can spawn his own thread,
>>> all these
>>> threads can have limited life-time.
>>> Not to mention about  rte_thread_register()/rte_thread_unregister().
>>>
>>
>> I agree that normal thread termination should be supported.
>>
>>>>> As I understand address of its RTE_PER_LCORE(_event_group) will still
>>> remain
>>>>> in rte_pmu.event_group_list,
>>>>> even if it is probably not valid any more.
>>>>
>>>> There should be a "destructor/done/finish" function available to remove
>> this
>>> from the list.
>>>>
>>>> [...]
>>>>
>>>>>>> Even if we'd decide to keep rte_pmu_read() as static inline (still
>> not
>>>>>>> sure it is a good idea),
>>>>>>
>>>>>> We want to save as much cpu cycles as we possibly can and inlining
>> does
>>>>> helps
>>>>>> in that matter.
>>>>>
>>>>> Ok, so asking same question from different thread: how many cycles it
>> will
>>>>> save?
>>>>> What is the difference in terms of performance when you have this
>> function
>>>>> inlined vs not inlined?
>>>>
>>>> We expect to use this in our in-house profiler library. For this reason, I
>>> have a very strong preference for absolute maximum
>>>> performance.
>>>>
>>>> Reading PMU events is for performance profiling, so I expect other
>> potential
>>> users of the PMU library to share my opinion on this.
>>>
>>> Well, from my perspective 14 cycles are not that much...
>>
>> For reference, the i40e testpmd per-core performance report shows that it uses
>> 36 cycles per packet.
>>
>> This is a total of 1152 cycles per burst of 32 packets. 14 cycles overhead per
>> burst / 1152 cycles per burst = 1.2 % overhead.
>>
>> But that is not all: If the application's pipeline has three stages, where the
>> PMU counters are read for each stage, the per-invocation overhead of 14 cycles
>> adds up, and the overhead per burst is now 3 * 14 / 1152 = 3.6 %.
> 
> I was too fast on the keyboard here... If the application does more work than testpmd, it certainly also uses more than 1152 cycles to do that work. So please ignore the 3.6 % as a wild exaggeration from an invalid example, and just stick with the 1.2 % overhead - which I still consider significant, and thus worth avoiding.

Wonder can we do both - hide struct rte_pmu_event_group from public API 
and have inline function to read pmu stats?
if we can add a separate function that will allow user to get
struct perf_event_mmap_page * for given event index (or event name),
then later user can use
__rte_pmu_read_userpage(struct perf_event_mmap_page *pc)
directly.

> 
>>
>> Generalizing...
>>
>> In my example here, the same function with 14 wasted cycles is called three
>> times. It might as well be three individual libraries each wasting 14 cycles
>> in its individual fast path processing function, due to a similarly relaxed
>> attitude regarding wasting 14 cycles.
>>
>> My point is:
>>
>> Real applications do much more work than testpmd, so all this "insignificant"
>> extra overhead in the libraries adds up!
>>
>> Generally, I would like the DPDK Project to remain loyal to its original
>> philosophy, where performance is considered a Key Performance Indicator, and
>> overhead in the fast path is kept at an absolute minimum.
>>
>>> Though yes, it would be good to hear more opinions here.


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v11 0/4] add support for self monitoring
  2023-02-16 17:54                   ` [PATCH v11 0/4] add support for self monitoring Tomasz Duszynski
                                       ` (4 preceding siblings ...)
  2023-02-16 18:03                     ` [PATCH v11 0/4] add support for self monitoring Ruifeng Wang
@ 2023-05-04  8:02                     ` David Marchand
  2023-07-31 12:33                       ` Thomas Monjalon
  5 siblings, 1 reply; 139+ messages in thread
From: David Marchand @ 2023-05-04  8:02 UTC (permalink / raw)
  To: Tomasz Duszynski
  Cc: dev, roretzla, Ruifeng.Wang, bruce.richardson, jerinj,
	mattias.ronnblom, mb, thomas, zhoumin, Konstantin Ananyev

Hello Tomasz,

On Thu, Feb 16, 2023 at 6:55 PM Tomasz Duszynski <tduszynski@marvell.com> wrote:
>
> This series adds self monitoring support i.e allows to configure and
> read performance measurement unit (PMU) counters in runtime without
> using perf utility. This has certain advantages when application runs on
> isolated cores running dedicated tasks.
>
> Events can be read directly using rte_pmu_read() or using dedicated
> tracepoint rte_eal_trace_pmu_read(). The latter will cause events to be
> stored inside CTF file.
>
> By design, all enabled events are grouped together and the same group
> is attached to lcores that use self monitoring funtionality.
>
> Events are enabled by names, which need to be read from standard
> location under sysfs i.e
>
> /sys/bus/event_source/devices/PMU/events
>
> where PMU is a core pmu i.e one measuring cpu events. As of today
> raw events are not supported.
>
> Tomasz Duszynski (4):
>   lib: add generic support for reading PMU events
>   pmu: support reading ARM PMU events in runtime
>   pmu: support reading Intel x86_64 PMU events in runtime
>   eal: add PMU support to tracing library

There are still some pending comments on this series and it can't be
merged until they get sorted out.

I noted two points :
- Konstantin asked for better explanations in the implementation.
- He also pointed out at using this feature with non EAL lcores.

Could you work on this so we can consider this series for v23.07?

Thank you.


-- 
David Marchand


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [PATCH v11 0/4] add support for self monitoring
  2023-05-04  8:02                     ` David Marchand
@ 2023-07-31 12:33                       ` Thomas Monjalon
  2023-08-07  8:11                         ` [EXT] " Tomasz Duszynski
  0 siblings, 1 reply; 139+ messages in thread
From: Thomas Monjalon @ 2023-07-31 12:33 UTC (permalink / raw)
  To: Tomasz Duszynski
  Cc: dev, roretzla, Ruifeng.Wang, bruce.richardson, jerinj,
	mattias.ronnblom, mb, zhoumin, Konstantin Ananyev,
	David Marchand

Ping for update
What is the status of this feature?


04/05/2023 10:02, David Marchand:
> Hello Tomasz,
> 
> On Thu, Feb 16, 2023 at 6:55 PM Tomasz Duszynski <tduszynski@marvell.com> wrote:
> >
> > This series adds self monitoring support i.e allows to configure and
> > read performance measurement unit (PMU) counters in runtime without
> > using perf utility. This has certain advantages when application runs on
> > isolated cores running dedicated tasks.
> >
> > Events can be read directly using rte_pmu_read() or using dedicated
> > tracepoint rte_eal_trace_pmu_read(). The latter will cause events to be
> > stored inside CTF file.
> >
> > By design, all enabled events are grouped together and the same group
> > is attached to lcores that use self monitoring funtionality.
> >
> > Events are enabled by names, which need to be read from standard
> > location under sysfs i.e
> >
> > /sys/bus/event_source/devices/PMU/events
> >
> > where PMU is a core pmu i.e one measuring cpu events. As of today
> > raw events are not supported.
> >
> > Tomasz Duszynski (4):
> >   lib: add generic support for reading PMU events
> >   pmu: support reading ARM PMU events in runtime
> >   pmu: support reading Intel x86_64 PMU events in runtime
> >   eal: add PMU support to tracing library
> 
> There are still some pending comments on this series and it can't be
> merged until they get sorted out.
> 
> I noted two points :
> - Konstantin asked for better explanations in the implementation.
> - He also pointed out at using this feature with non EAL lcores.
> 
> Could you work on this so we can consider this series for v23.07?






^ permalink raw reply	[flat|nested] 139+ messages in thread

* RE: [EXT] Re: [PATCH v11 0/4] add support for self monitoring
  2023-07-31 12:33                       ` Thomas Monjalon
@ 2023-08-07  8:11                         ` Tomasz Duszynski
  2023-09-21  8:26                           ` David Marchand
  0 siblings, 1 reply; 139+ messages in thread
From: Tomasz Duszynski @ 2023-08-07  8:11 UTC (permalink / raw)
  To: Thomas Monjalon
  Cc: dev, roretzla, Ruifeng.Wang, bruce.richardson,
	Jerin Jacob Kollanukkaran, mattias.ronnblom, mb, zhoumin,
	Konstantin Ananyev, David Marchand



>-----Original Message-----
>From: Thomas Monjalon <thomas@monjalon.net>
>Sent: Monday, July 31, 2023 2:33 PM
>To: Tomasz Duszynski <tduszynski@marvell.com>
>Cc: dev@dpdk.org; roretzla@linux.microsoft.com; Ruifeng.Wang@arm.com; bruce.richardson@intel.com;
>Jerin Jacob Kollanukkaran <jerinj@marvell.com>; mattias.ronnblom@ericsson.com;
>mb@smartsharesystems.com; zhoumin@loongson.cn; Konstantin Ananyev
><konstantin.ananyev@huawei.com>; David Marchand <david.marchand@redhat.com>
>Subject: [EXT] Re: [PATCH v11 0/4] add support for self monitoring
>
>External Email
>
>----------------------------------------------------------------------
>Ping for update
>What is the status of this feature?
>
>

Hi Thomas, 

I'll re-spin the series soon. 

>04/05/2023 10:02, David Marchand:
>> Hello Tomasz,
>>
>> On Thu, Feb 16, 2023 at 6:55 PM Tomasz Duszynski <tduszynski@marvell.com> wrote:
>> >
>> > This series adds self monitoring support i.e allows to configure and
>> > read performance measurement unit (PMU) counters in runtime without
>> > using perf utility. This has certain advantages when application
>> > runs on isolated cores running dedicated tasks.
>> >
>> > Events can be read directly using rte_pmu_read() or using dedicated
>> > tracepoint rte_eal_trace_pmu_read(). The latter will cause events to
>> > be stored inside CTF file.
>> >
>> > By design, all enabled events are grouped together and the same
>> > group is attached to lcores that use self monitoring funtionality.
>> >
>> > Events are enabled by names, which need to be read from standard
>> > location under sysfs i.e
>> >
>> > /sys/bus/event_source/devices/PMU/events
>> >
>> > where PMU is a core pmu i.e one measuring cpu events. As of today
>> > raw events are not supported.
>> >
>> > Tomasz Duszynski (4):
>> >   lib: add generic support for reading PMU events
>> >   pmu: support reading ARM PMU events in runtime
>> >   pmu: support reading Intel x86_64 PMU events in runtime
>> >   eal: add PMU support to tracing library
>>
>> There are still some pending comments on this series and it can't be
>> merged until they get sorted out.
>>
>> I noted two points :
>> - Konstantin asked for better explanations in the implementation.
>> - He also pointed out at using this feature with non EAL lcores.
>>
>> Could you work on this so we can consider this series for v23.07?
>
>
>
>


^ permalink raw reply	[flat|nested] 139+ messages in thread

* Re: [EXT] Re: [PATCH v11 0/4] add support for self monitoring
  2023-08-07  8:11                         ` [EXT] " Tomasz Duszynski
@ 2023-09-21  8:26                           ` David Marchand
  0 siblings, 0 replies; 139+ messages in thread
From: David Marchand @ 2023-09-21  8:26 UTC (permalink / raw)
  To: Tomasz Duszynski
  Cc: Thomas Monjalon, dev, roretzla, Ruifeng.Wang, bruce.richardson,
	Jerin Jacob Kollanukkaran, mattias.ronnblom, mb, zhoumin,
	Konstantin Ananyev

Hello,

On Mon, Aug 7, 2023 at 10:12 AM Tomasz Duszynski <tduszynski@marvell.com> wrote:
> >Ping for update
> >What is the status of this feature?
>
> I'll re-spin the series soon.

-rc1 is getting closer.
Any update?


-- 
David Marchand


^ permalink raw reply	[flat|nested] 139+ messages in thread

end of thread, other threads:[~2023-09-21  8:27 UTC | newest]

Thread overview: 139+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-11  9:43 [PATCH 0/4] add support for self monitoring Tomasz Duszynski
2022-11-11  9:43 ` [PATCH 1/4] eal: add generic support for reading PMU events Tomasz Duszynski
2022-12-15  8:33   ` Mattias Rönnblom
2022-11-11  9:43 ` [PATCH 2/4] eal/arm: support reading ARM PMU events in runtime Tomasz Duszynski
2022-11-11  9:43 ` [PATCH 3/4] eal/x86: support reading Intel " Tomasz Duszynski
2022-11-11  9:43 ` [PATCH 4/4] eal: add PMU support to tracing library Tomasz Duszynski
2022-11-21 12:11 ` [PATCH v2 0/4] add support for self monitoring Tomasz Duszynski
2022-11-21 12:11   ` [PATCH v2 1/4] eal: add generic support for reading PMU events Tomasz Duszynski
2022-11-21 12:11   ` [PATCH v2 2/4] eal/arm: support reading ARM PMU events in runtime Tomasz Duszynski
2022-11-21 12:11   ` [PATCH v2 3/4] eal/x86: support reading Intel " Tomasz Duszynski
2022-11-21 12:11   ` [PATCH v2 4/4] eal: add PMU support to tracing library Tomasz Duszynski
2022-11-29  9:28   ` [PATCH v3 0/4] add support for self monitoring Tomasz Duszynski
2022-11-29  9:28     ` [PATCH v3 1/4] eal: add generic support for reading PMU events Tomasz Duszynski
2022-11-30  8:32       ` zhoumin
2022-12-13  8:05         ` [EXT] " Tomasz Duszynski
2022-11-29  9:28     ` [PATCH v3 2/4] eal/arm: support reading ARM PMU events in runtime Tomasz Duszynski
2022-11-29  9:28     ` [PATCH v3 3/4] eal/x86: support reading Intel " Tomasz Duszynski
2022-11-29  9:28     ` [PATCH v3 4/4] eal: add PMU support to tracing library Tomasz Duszynski
2022-11-29 10:42     ` [PATCH v3 0/4] add support for self monitoring Morten Brørup
2022-12-13  8:23       ` Tomasz Duszynski
2022-12-13 10:43     ` [PATCH v4 " Tomasz Duszynski
2022-12-13 10:43       ` [PATCH v4 1/4] eal: add generic support for reading PMU events Tomasz Duszynski
2022-12-13 11:52         ` Morten Brørup
2022-12-14  9:38           ` Tomasz Duszynski
2022-12-14 10:41             ` Morten Brørup
2022-12-15  8:22               ` Morten Brørup
2022-12-16  7:33                 ` Morten Brørup
2023-01-05 21:14               ` Tomasz Duszynski
2023-01-05 22:07                 ` Morten Brørup
2023-01-08 15:41                   ` Tomasz Duszynski
2023-01-08 16:30                     ` Morten Brørup
2022-12-15  8:46         ` Mattias Rönnblom
2023-01-04 15:47           ` Tomasz Duszynski
2023-01-09  7:37         ` Ruifeng Wang
2023-01-09 15:40           ` Tomasz Duszynski
2022-12-13 10:43       ` [PATCH v4 2/4] eal/arm: support reading ARM PMU events in runtime Tomasz Duszynski
2022-12-13 10:43       ` [PATCH v4 3/4] eal/x86: support reading Intel " Tomasz Duszynski
2022-12-13 10:43       ` [PATCH v4 4/4] eal: add PMU support to tracing library Tomasz Duszynski
2023-01-10 23:46       ` [PATCH v5 0/4] add support for self monitoring Tomasz Duszynski
2023-01-10 23:46         ` [PATCH v5 1/4] eal: add generic support for reading PMU events Tomasz Duszynski
2023-01-11  9:05           ` Morten Brørup
2023-01-11 16:20             ` Tomasz Duszynski
2023-01-11 16:54               ` Morten Brørup
2023-01-10 23:46         ` [PATCH v5 2/4] eal/arm: support reading ARM PMU events in runtime Tomasz Duszynski
2023-01-10 23:46         ` [PATCH v5 3/4] eal/x86: support reading Intel " Tomasz Duszynski
2023-01-10 23:46         ` [PATCH v5 4/4] eal: add PMU support to tracing library Tomasz Duszynski
2023-01-11  0:32         ` [PATCH v5 0/4] add support for self monitoring Tyler Retzlaff
2023-01-11  9:31           ` Morten Brørup
2023-01-11 14:24             ` Tomasz Duszynski
2023-01-11 14:32               ` Bruce Richardson
2023-01-11  9:39           ` [EXT] " Tomasz Duszynski
2023-01-11 21:05             ` Tyler Retzlaff
2023-01-13  7:44               ` Tomasz Duszynski
2023-01-13 19:22                 ` Tyler Retzlaff
2023-01-14  9:53                   ` Morten Brørup
2023-01-19 23:39         ` [PATCH v6 " Tomasz Duszynski
2023-01-19 23:39           ` [PATCH v6 1/4] lib: add generic support for reading PMU events Tomasz Duszynski
2023-01-20  9:46             ` Morten Brørup
2023-01-26  9:40               ` Tomasz Duszynski
2023-01-26 12:29                 ` Morten Brørup
2023-01-26 12:59                   ` Bruce Richardson
2023-01-26 15:28                     ` [EXT] " Tomasz Duszynski
2023-02-02 14:27                       ` Morten Brørup
2023-01-26 15:17                   ` Tomasz Duszynski
2023-01-20 18:29             ` Tyler Retzlaff
2023-01-26  9:05               ` [EXT] " Tomasz Duszynski
2023-01-19 23:39           ` [PATCH v6 2/4] pmu: support reading ARM PMU events in runtime Tomasz Duszynski
2023-01-19 23:39           ` [PATCH v6 3/4] pmu: support reading Intel x86_64 " Tomasz Duszynski
2023-01-19 23:39           ` [PATCH v6 4/4] eal: add PMU support to tracing library Tomasz Duszynski
2023-02-01 13:17           ` [PATCH v7 0/4] add support for self monitoring Tomasz Duszynski
2023-02-01 13:17             ` [PATCH v7 1/4] lib: add generic support for reading PMU events Tomasz Duszynski
2023-02-01 13:17             ` [PATCH v7 2/4] pmu: support reading ARM PMU events in runtime Tomasz Duszynski
2023-02-01 13:17             ` [PATCH v7 3/4] pmu: support reading Intel x86_64 " Tomasz Duszynski
2023-02-01 13:17             ` [PATCH v7 4/4] eal: add PMU support to tracing library Tomasz Duszynski
2023-02-01 13:51             ` [PATCH v7 0/4] add support for self monitoring Morten Brørup
2023-02-02  7:54               ` Tomasz Duszynski
2023-02-02  9:43             ` [PATCH v8 " Tomasz Duszynski
2023-02-02  9:43               ` [PATCH v8 1/4] lib: add generic support for reading PMU events Tomasz Duszynski
2023-02-02 10:32                 ` Ruifeng Wang
2023-02-02  9:43               ` [PATCH v8 2/4] pmu: support reading ARM PMU events in runtime Tomasz Duszynski
2023-02-02  9:43               ` [PATCH v8 3/4] pmu: support reading Intel x86_64 " Tomasz Duszynski
2023-02-02  9:43               ` [PATCH v8 4/4] eal: add PMU support to tracing library Tomasz Duszynski
2023-02-02 12:49               ` [PATCH v9 0/4] add support for self monitoring Tomasz Duszynski
2023-02-02 12:49                 ` [PATCH v9 1/4] lib: add generic support for reading PMU events Tomasz Duszynski
2023-02-06 11:02                   ` David Marchand
2023-02-09 11:09                     ` [EXT] " Tomasz Duszynski
2023-02-02 12:49                 ` [PATCH v9 2/4] pmu: support reading ARM PMU events in runtime Tomasz Duszynski
2023-02-02 12:49                 ` [PATCH v9 3/4] pmu: support reading Intel x86_64 " Tomasz Duszynski
2023-02-02 12:49                 ` [PATCH v9 4/4] eal: add PMU support to tracing library Tomasz Duszynski
2023-02-13 11:31                 ` [PATCH v10 0/4] add support for self monitoring Tomasz Duszynski
2023-02-13 11:31                   ` [PATCH v10 1/4] lib: add generic support for reading PMU events Tomasz Duszynski
2023-02-16  7:39                     ` Ruifeng Wang
2023-02-16 14:44                       ` Tomasz Duszynski
2023-02-13 11:31                   ` [PATCH v10 2/4] pmu: support reading ARM PMU events in runtime Tomasz Duszynski
2023-02-16  7:41                     ` Ruifeng Wang
2023-02-13 11:31                   ` [PATCH v10 3/4] pmu: support reading Intel x86_64 " Tomasz Duszynski
2023-02-13 11:31                   ` [PATCH v10 4/4] eal: add PMU support to tracing library Tomasz Duszynski
2023-02-16 17:54                   ` [PATCH v11 0/4] add support for self monitoring Tomasz Duszynski
2023-02-16 17:54                     ` [PATCH v11 1/4] lib: add generic support for reading PMU events Tomasz Duszynski
2023-02-16 23:50                       ` Konstantin Ananyev
2023-02-17  8:49                         ` [EXT] " Tomasz Duszynski
2023-02-17 10:14                           ` Konstantin Ananyev
2023-02-19 14:23                             ` Tomasz Duszynski
2023-02-20 14:31                               ` Konstantin Ananyev
2023-02-20 16:59                                 ` Tomasz Duszynski
2023-02-20 17:21                                   ` Konstantin Ananyev
2023-02-20 20:42                                     ` Tomasz Duszynski
2023-02-21  0:48                                       ` Konstantin Ananyev
2023-02-27  8:12                                         ` Tomasz Duszynski
2023-02-28 11:35                                           ` Konstantin Ananyev
2023-02-21 12:15                           ` Konstantin Ananyev
2023-02-21  2:17                       ` Konstantin Ananyev
2023-02-27  9:19                         ` [EXT] " Tomasz Duszynski
2023-02-27 20:53                           ` Konstantin Ananyev
2023-02-28  8:25                             ` Morten Brørup
2023-02-28 12:04                               ` Konstantin Ananyev
2023-02-28 13:15                                 ` Morten Brørup
2023-02-28 16:22                                 ` Morten Brørup
2023-03-05 16:30                                   ` Konstantin Ananyev
2023-02-28  9:57                             ` Tomasz Duszynski
2023-02-28 11:58                               ` Konstantin Ananyev
2023-02-16 17:55                     ` [PATCH v11 2/4] pmu: support reading ARM PMU events in runtime Tomasz Duszynski
2023-02-16 17:55                     ` [PATCH v11 3/4] pmu: support reading Intel x86_64 " Tomasz Duszynski
2023-02-16 17:55                     ` [PATCH v11 4/4] eal: add PMU support to tracing library Tomasz Duszynski
2023-02-16 18:03                     ` [PATCH v11 0/4] add support for self monitoring Ruifeng Wang
2023-05-04  8:02                     ` David Marchand
2023-07-31 12:33                       ` Thomas Monjalon
2023-08-07  8:11                         ` [EXT] " Tomasz Duszynski
2023-09-21  8:26                           ` David Marchand
2023-01-25 10:33         ` [PATCH 0/2] add platform bus Tomasz Duszynski
2023-01-25 10:33           ` [PATCH 1/2] lib: add helper to read strings from sysfs files Tomasz Duszynski
2023-01-25 10:39             ` Thomas Monjalon
2023-01-25 16:16               ` Tyler Retzlaff
2023-01-26  8:30                 ` [EXT] " Tomasz Duszynski
2023-01-26 17:21                   ` Tyler Retzlaff
2023-01-26  8:35               ` Tomasz Duszynski
2023-01-25 10:33           ` [PATCH 2/2] bus: add platform bus Tomasz Duszynski
2023-01-25 10:41           ` [PATCH 0/2] " Tomasz Duszynski
2023-02-16 20:56         ` [PATCH v5 0/4] add support for self monitoring Liang Ma

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).