DPDK patches and discussions
 help / color / mirror / Atom feed
* [PATCH 0/4] add support for self monitoring
@ 2022-11-11  9:43 Tomasz Duszynski
  2022-11-11  9:43 ` [PATCH 1/4] eal: add generic support for reading PMU events Tomasz Duszynski
                   ` (4 more replies)
  0 siblings, 5 replies; 205+ messages in thread
From: Tomasz Duszynski @ 2022-11-11  9:43 UTC (permalink / raw)
  To: dev, tduszynski; +Cc: thomas, jerinj

This series adds self monitoring support i.e allows to configure and
read performance measurement unit (PMU) counters in runtime without
using perf utility. This has certain adventages when application runs on
isolated cores with nohz_full kernel parameter.

Events can be read directly using rte_pmu_read() or using dedicated
tracepoint rte_eal_trace_pmu_read(). The latter will cause events to be
stored inside CTF file.

By design, all enabled events are grouped together and the same group
is attached to lcores that use self monitoring funtionality.

Events are enabled by names, which need to be read from standard
location under sysfs i.e

/sys/bus/event_source/devices/PMU/events

where PMU is a core pmu i.e one measuring cpu events. As of today
raw events are not supported.

Tomasz Duszynski (4):
  eal: add generic support for reading PMU events
  eal/arm: support reading ARM PMU events in runtime
  eal/x86: support reading Intel PMU events in runtime
  eal: add PMU support to tracing library

 app/test/meson.build                     |   1 +
 app/test/test_pmu.c                      |  47 ++
 app/test/test_trace_perf.c               |   4 +
 doc/guides/prog_guide/profile_app.rst    |  13 +
 doc/guides/prog_guide/trace_lib.rst      |  32 ++
 lib/eal/arm/include/meson.build          |   1 +
 lib/eal/arm/include/rte_pmu_pmc.h        |  37 ++
 lib/eal/arm/meson.build                  |   4 +
 lib/eal/arm/rte_pmu.c                    | 103 +++++
 lib/eal/common/eal_common_trace_points.c |   3 +
 lib/eal/common/meson.build               |   3 +
 lib/eal/common/pmu_private.h             |  41 ++
 lib/eal/common/rte_pmu.c                 | 518 +++++++++++++++++++++++
 lib/eal/include/meson.build              |   1 +
 lib/eal/include/rte_eal_trace.h          |  11 +
 lib/eal/include/rte_pmu.h                | 207 +++++++++
 lib/eal/linux/eal.c                      |   4 +
 lib/eal/version.map                      |   4 +
 lib/eal/x86/include/meson.build          |   1 +
 lib/eal/x86/include/rte_pmu_pmc.h        |  32 ++
 20 files changed, 1067 insertions(+)
 create mode 100644 app/test/test_pmu.c
 create mode 100644 lib/eal/arm/include/rte_pmu_pmc.h
 create mode 100644 lib/eal/arm/rte_pmu.c
 create mode 100644 lib/eal/common/pmu_private.h
 create mode 100644 lib/eal/common/rte_pmu.c
 create mode 100644 lib/eal/include/rte_pmu.h
 create mode 100644 lib/eal/x86/include/rte_pmu_pmc.h

--
2.25.1


^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH 1/4] eal: add generic support for reading PMU events
  2022-11-11  9:43 [PATCH 0/4] add support for self monitoring Tomasz Duszynski
@ 2022-11-11  9:43 ` Tomasz Duszynski
  2022-12-15  8:33   ` Mattias Rönnblom
  2022-11-11  9:43 ` [PATCH 2/4] eal/arm: support reading ARM PMU events in runtime Tomasz Duszynski
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 205+ messages in thread
From: Tomasz Duszynski @ 2022-11-11  9:43 UTC (permalink / raw)
  To: dev, tduszynski; +Cc: thomas, jerinj

Add support for programming PMU counters and reading their values
in runtime bypassing kernel completely.

This is especially useful in cases where CPU cores are isolated
(nohz_full) i.e run dedicated tasks. In such cases one cannot use
standard perf utility without sacrificing latency and performance.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
---
 app/test/meson.build                  |   1 +
 app/test/test_pmu.c                   |  41 +++
 doc/guides/prog_guide/profile_app.rst |   8 +
 lib/eal/common/meson.build            |   3 +
 lib/eal/common/pmu_private.h          |  41 +++
 lib/eal/common/rte_pmu.c              | 455 ++++++++++++++++++++++++++
 lib/eal/include/meson.build           |   1 +
 lib/eal/include/rte_pmu.h             | 204 ++++++++++++
 lib/eal/linux/eal.c                   |   4 +
 lib/eal/version.map                   |   3 +
 10 files changed, 761 insertions(+)
 create mode 100644 app/test/test_pmu.c
 create mode 100644 lib/eal/common/pmu_private.h
 create mode 100644 lib/eal/common/rte_pmu.c
 create mode 100644 lib/eal/include/rte_pmu.h

diff --git a/app/test/meson.build b/app/test/meson.build
index f34d19e3c3..93b3300309 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -143,6 +143,7 @@ test_sources = files(
         'test_timer_racecond.c',
         'test_timer_secondary.c',
         'test_ticketlock.c',
+        'test_pmu.c',
         'test_trace.c',
         'test_trace_register.c',
         'test_trace_perf.c',
diff --git a/app/test/test_pmu.c b/app/test/test_pmu.c
new file mode 100644
index 0000000000..fd331af9ee
--- /dev/null
+++ b/app/test/test_pmu.c
@@ -0,0 +1,41 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2022 Marvell International Ltd.
+ */
+
+#include <rte_pmu.h>
+
+#include "test.h"
+
+static int
+test_pmu_read(void)
+{
+	uint64_t val = 0;
+	int tries = 10;
+	int event = -1;
+
+	while (tries--)
+		val += rte_pmu_read(event);
+
+	if (val == 0)
+		return TEST_FAILED;
+
+	return TEST_SUCCESS;
+}
+
+static struct unit_test_suite pmu_tests = {
+	.suite_name = "pmu autotest",
+	.setup = NULL,
+	.teardown = NULL,
+	.unit_test_cases = {
+		TEST_CASE(test_pmu_read),
+		TEST_CASES_END()
+	}
+};
+
+static int
+test_pmu(void)
+{
+	return unit_test_suite_runner(&pmu_tests);
+}
+
+REGISTER_TEST_COMMAND(pmu_autotest, test_pmu);
diff --git a/doc/guides/prog_guide/profile_app.rst b/doc/guides/prog_guide/profile_app.rst
index bd6700ef85..8fc1b20cab 100644
--- a/doc/guides/prog_guide/profile_app.rst
+++ b/doc/guides/prog_guide/profile_app.rst
@@ -7,6 +7,14 @@ Profile Your Application
 The following sections describe methods of profiling DPDK applications on
 different architectures.
 
+Performance counter based profiling
+-----------------------------------
+
+Majority of architectures support some sort hardware measurement unit which provides a set of
+programmable counters that monitor specific events. There are different tools which can gather
+that information, perf being an example here. Though in some scenarios, eg. when CPU cores are
+isolated (nohz_full) and run dedicated tasks, using perf is less than ideal. In such cases one can
+read specific events directly from application via ``rte_pmu_read()``.
 
 Profiling on x86
 ----------------
diff --git a/lib/eal/common/meson.build b/lib/eal/common/meson.build
index 917758cc65..d6d05b56f3 100644
--- a/lib/eal/common/meson.build
+++ b/lib/eal/common/meson.build
@@ -38,6 +38,9 @@ sources += files(
         'rte_service.c',
         'rte_version.c',
 )
+if is_linux
+    sources += files('rte_pmu.c')
+endif
 if is_linux or is_windows
     sources += files('eal_common_dynmem.c')
 endif
diff --git a/lib/eal/common/pmu_private.h b/lib/eal/common/pmu_private.h
new file mode 100644
index 0000000000..cade4245e6
--- /dev/null
+++ b/lib/eal/common/pmu_private.h
@@ -0,0 +1,41 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2022 Marvell
+ */
+
+#ifndef _PMU_PRIVATE_H_
+#define _PMU_PRIVATE_H_
+
+/**
+ * Architecture specific PMU init callback.
+ *
+ * @return
+ *   0 in case of success, negative value otherwise.
+ */
+int
+pmu_arch_init(void);
+
+/**
+ * Architecture specific PMU cleanup callback.
+ */
+void
+pmu_arch_fini(void);
+
+/**
+ * Apply architecture specific settings to config before passing it to syscall.
+ */
+void
+pmu_arch_fixup_config(uint64_t config[3]);
+
+/**
+ * Initialize PMU tracing internals.
+ */
+void
+eal_pmu_init(void);
+
+/**
+ * Cleanup PMU internals.
+ */
+void
+eal_pmu_fini(void);
+
+#endif /* _PMU_PRIVATE_H_ */
diff --git a/lib/eal/common/rte_pmu.c b/lib/eal/common/rte_pmu.c
new file mode 100644
index 0000000000..7d3bd57d1d
--- /dev/null
+++ b/lib/eal/common/rte_pmu.c
@@ -0,0 +1,455 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2022 Marvell International Ltd.
+ */
+
+#include <ctype.h>
+#include <dirent.h>
+#include <errno.h>
+#include <regex.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <sys/queue.h>
+#include <sys/syscall.h>
+#include <unistd.h>
+
+#include <rte_eal_paging.h>
+#include <rte_malloc.h>
+#include <rte_pmu.h>
+#include <rte_tailq.h>
+
+#include "pmu_private.h"
+
+#define EVENT_SOURCE_DEVICES_PATH "/sys/bus/event_source/devices"
+
+#ifndef GENMASK_ULL
+#define GENMASK_ULL(h, l) ((~0ULL - (1ULL << (l)) + 1) & (~0ULL >> ((64 - 1 - (h)))))
+#endif
+
+#ifndef FIELD_PREP
+#define FIELD_PREP(m, v) (((uint64_t)(v) << (__builtin_ffsll(m) - 1)) & (m))
+#endif
+
+struct rte_pmu *pmu;
+
+/*
+ * Following __rte_weak functions provide default no-op. Architectures should override them if
+ * necessary.
+ */
+
+int
+__rte_weak pmu_arch_init(void)
+{
+	return 0;
+}
+
+void
+__rte_weak pmu_arch_fini(void)
+{
+}
+
+void
+__rte_weak pmu_arch_fixup_config(uint64_t config[3])
+{
+	RTE_SET_USED(config);
+}
+
+static int
+get_term_format(const char *name, int *num, uint64_t *mask)
+{
+	char *config = NULL;
+	char path[PATH_MAX];
+	int high, low, ret;
+	FILE *fp;
+
+	/* quiesce -Wmaybe-uninitialized warning */
+	*num = 0;
+	*mask = 0;
+
+	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/format/%s", pmu->name, name);
+	fp = fopen(path, "r");
+	if (!fp)
+		return -errno;
+
+	errno = 0;
+	ret = fscanf(fp, "%m[^:]:%d-%d", &config, &low, &high);
+	if (ret < 2) {
+		ret = -ENODATA;
+		goto out;
+	}
+	if (errno) {
+		ret = -errno;
+		goto out;
+	}
+
+	if (ret == 2)
+		high = low;
+
+	*mask = GENMASK_ULL(high, low);
+	/* Last digit should be [012]. If last digit is missing 0 is implied. */
+	*num = config[strlen(config) - 1];
+	*num = isdigit(*num) ? *num - '0' : 0;
+
+	ret = 0;
+out:
+	free(config);
+	fclose(fp);
+
+	return ret;
+}
+
+static int
+parse_event(char *buf, uint64_t config[3])
+{
+	char *token, *term;
+	int num, ret, val;
+	uint64_t mask;
+
+	config[0] = config[1] = config[2] = 0;
+
+	token = strtok(buf, ",");
+	while (token) {
+		errno = 0;
+		/* <term>=<value> */
+		ret = sscanf(token, "%m[^=]=%i", &term, &val);
+		if (ret < 1)
+			return -ENODATA;
+		if (errno)
+			return -errno;
+		if (ret == 1)
+			val = 1;
+
+		ret = get_term_format(term, &num, &mask);
+		free(term);
+		if (ret)
+			return ret;
+
+		config[num] |= FIELD_PREP(mask, val);
+		token = strtok(NULL, ",");
+	}
+
+	return 0;
+}
+
+static int
+get_event_config(const char *name, uint64_t config[3])
+{
+	char path[PATH_MAX], buf[BUFSIZ];
+	FILE *fp;
+	int ret;
+
+	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", pmu->name, name);
+	fp = fopen(path, "r");
+	if (!fp)
+		return -errno;
+
+	ret = fread(buf, 1, sizeof(buf), fp);
+	if (ret == 0) {
+		fclose(fp);
+
+		return -EINVAL;
+	}
+	fclose(fp);
+	buf[ret] = '\0';
+
+	return parse_event(buf, config);
+}
+
+static int
+do_perf_event_open(uint64_t config[3], int lcore_id, int group_fd)
+{
+	struct perf_event_attr attr = {
+		.size = sizeof(struct perf_event_attr),
+		.type = PERF_TYPE_RAW,
+		.exclude_kernel = 1,
+		.exclude_hv = 1,
+		.disabled = 1,
+	};
+
+	pmu_arch_fixup_config(config);
+
+	attr.config = config[0];
+	attr.config1 = config[1];
+	attr.config2 = config[2];
+
+	return syscall(SYS_perf_event_open, &attr, rte_gettid(), rte_lcore_to_cpu_id(lcore_id),
+		       group_fd, 0);
+}
+
+static int
+open_events(int lcore_id)
+{
+	struct rte_pmu_event_group *group = &pmu->group[lcore_id];
+	struct rte_pmu_event *event;
+	uint64_t config[3];
+	int num = 0, ret;
+
+	/* group leader gets created first, with fd = -1 */
+	group->fds[0] = -1;
+
+	TAILQ_FOREACH(event, &pmu->event_list, next) {
+		ret = get_event_config(event->name, config);
+		if (ret) {
+			RTE_LOG(ERR, EAL, "failed to get %s event config\n", event->name);
+			continue;
+		}
+
+		ret = do_perf_event_open(config, lcore_id, group->fds[0]);
+		if (ret == -1) {
+			if (errno == EOPNOTSUPP)
+				RTE_LOG(ERR, EAL, "64 bit counters not supported\n");
+
+			ret = -errno;
+			goto out;
+		}
+
+		group->fds[event->index] = ret;
+		num++;
+	}
+
+	return 0;
+out:
+	for (--num; num >= 0; num--) {
+		close(group->fds[num]);
+		group->fds[num] = -1;
+	}
+
+
+	return ret;
+}
+
+static int
+mmap_events(int lcore_id)
+{
+	struct rte_pmu_event_group *group = &pmu->group[lcore_id];
+	void *addr;
+	int ret, i;
+
+	for (i = 0; i < pmu->num_group_events; i++) {
+		addr = mmap(0, rte_mem_page_size(), PROT_READ, MAP_SHARED, group->fds[i], 0);
+		if (addr == MAP_FAILED) {
+			ret = -errno;
+			goto out;
+		}
+
+		group->mmap_pages[i] = addr;
+	}
+
+	return 0;
+out:
+	for (; i; i--) {
+		munmap(group->mmap_pages[i - 1], rte_mem_page_size());
+		group->mmap_pages[i - 1] = NULL;
+	}
+
+	return ret;
+}
+
+static void
+cleanup_events(int lcore_id)
+{
+	struct rte_pmu_event_group *group = &pmu->group[lcore_id];
+	int i;
+
+	if (!group->fds)
+		return;
+
+	if (group->fds[0] != -1)
+		ioctl(group->fds[0], PERF_EVENT_IOC_DISABLE, PERF_IOC_FLAG_GROUP);
+
+	for (i = 0; i < pmu->num_group_events; i++) {
+		if (group->mmap_pages[i]) {
+			munmap(group->mmap_pages[i], rte_mem_page_size());
+			group->mmap_pages[i] = NULL;
+		}
+
+		if (group->fds[i] != -1) {
+			close(group->fds[i]);
+			group->fds[i] = -1;
+		}
+	}
+
+	rte_free(group->mmap_pages);
+	rte_free(group->fds);
+
+	group->mmap_pages = NULL;
+	group->fds = NULL;
+	group->enabled = false;
+}
+
+int __rte_noinline
+rte_pmu_enable_group(int lcore_id)
+{
+	struct rte_pmu_event_group *group = &pmu->group[lcore_id];
+	int ret;
+
+	if (pmu->num_group_events == 0) {
+		RTE_LOG(DEBUG, EAL, "no matching PMU events\n");
+
+		return 0;
+	}
+
+	group->fds = rte_zmalloc(NULL, pmu->num_group_events, sizeof(*group->fds));
+	if (!group->fds) {
+		RTE_LOG(ERR, EAL, "failed to alloc descriptor memory\n");
+
+		return -ENOMEM;
+	}
+
+	group->mmap_pages = rte_zmalloc(NULL, pmu->num_group_events, sizeof(*group->mmap_pages));
+	if (!group->mmap_pages) {
+		RTE_LOG(ERR, EAL, "failed to alloc userpage memory\n");
+
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	ret = open_events(lcore_id);
+	if (ret) {
+		RTE_LOG(ERR, EAL, "failed to open events on lcore-worker-%d\n", lcore_id);
+		goto out;
+	}
+
+	ret = mmap_events(lcore_id);
+	if (ret) {
+		RTE_LOG(ERR, EAL, "failed to map events on lcore-worker-%d\n", lcore_id);
+		goto out;
+	}
+
+	if (ioctl(group->fds[0], PERF_EVENT_IOC_ENABLE, PERF_IOC_FLAG_GROUP) == -1) {
+		RTE_LOG(ERR, EAL, "failed to enable events on lcore-worker-%d\n", lcore_id);
+
+		ret = -errno;
+		goto out;
+	}
+
+	return 0;
+
+out:
+	cleanup_events(lcore_id);
+
+	return ret;
+}
+
+static int
+scan_pmus(void)
+{
+	char path[PATH_MAX];
+	struct dirent *dent;
+	const char *name;
+	DIR *dirp;
+
+	dirp = opendir(EVENT_SOURCE_DEVICES_PATH);
+	if (!dirp)
+		return -errno;
+
+	while ((dent = readdir(dirp))) {
+		name = dent->d_name;
+		if (name[0] == '.')
+			continue;
+
+		/* sysfs entry should either contain cpus or be a cpu */
+		if (!strcmp(name, "cpu"))
+			break;
+
+		snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/cpus", name);
+		if (access(path, F_OK) == 0)
+			break;
+	}
+
+	closedir(dirp);
+
+	if (dent) {
+		pmu->name = strdup(name);
+		if (!pmu->name)
+			return -ENOMEM;
+	}
+
+	return pmu->name ? 0 : -ENODEV;
+}
+
+int
+rte_pmu_add_event(const char *name)
+{
+	struct rte_pmu_event *event;
+	char path[PATH_MAX];
+
+	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", pmu->name, name);
+	if (access(path, R_OK))
+		return -ENODEV;
+
+	TAILQ_FOREACH(event, &pmu->event_list, next) {
+		if (!strcmp(event->name, name))
+			return event->index;
+		continue;
+	}
+
+	event = rte_zmalloc(NULL, 1, sizeof(*event));
+	if (!event)
+		return -ENOMEM;
+
+	event->name = strdup(name);
+	if (!event->name) {
+		rte_free(event);
+
+		return -ENOMEM;
+	}
+
+	event->index = pmu->num_group_events++;
+	TAILQ_INSERT_TAIL(&pmu->event_list, event, next);
+
+	RTE_LOG(DEBUG, EAL, "%s even added at index %d\n", name, event->index);
+
+	return event->index;
+}
+
+void
+eal_pmu_init(void)
+{
+	int ret;
+
+	pmu = rte_calloc(NULL, 1, sizeof(*pmu), RTE_CACHE_LINE_SIZE);
+	if (!pmu) {
+		RTE_LOG(ERR, EAL, "failed to alloc PMU\n");
+
+		return;
+	}
+
+	TAILQ_INIT(&pmu->event_list);
+
+	ret = scan_pmus();
+	if (ret) {
+		RTE_LOG(ERR, EAL, "failed to find core pmu\n");
+		goto out;
+	}
+
+	ret = pmu_arch_init();
+	if (ret) {
+		RTE_LOG(ERR, EAL, "failed to setup arch for PMU\n");
+		goto out;
+	}
+
+	return;
+out:
+	free(pmu->name);
+	rte_free(pmu);
+}
+
+void
+eal_pmu_fini(void)
+{
+	struct rte_pmu_event *event, *tmp;
+	int lcore_id;
+
+	RTE_TAILQ_FOREACH_SAFE(event, &pmu->event_list, next, tmp) {
+		TAILQ_REMOVE(&pmu->event_list, event, next);
+		free(event->name);
+		rte_free(event);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		cleanup_events(lcore_id);
+
+	pmu_arch_fini();
+	free(pmu->name);
+	rte_free(pmu);
+}
diff --git a/lib/eal/include/meson.build b/lib/eal/include/meson.build
index cfcd40aaed..3bf830adee 100644
--- a/lib/eal/include/meson.build
+++ b/lib/eal/include/meson.build
@@ -36,6 +36,7 @@ headers += files(
         'rte_pci_dev_features.h',
         'rte_per_lcore.h',
         'rte_pflock.h',
+        'rte_pmu.h',
         'rte_random.h',
         'rte_reciprocal.h',
         'rte_seqcount.h',
diff --git a/lib/eal/include/rte_pmu.h b/lib/eal/include/rte_pmu.h
new file mode 100644
index 0000000000..5955c22779
--- /dev/null
+++ b/lib/eal/include/rte_pmu.h
@@ -0,0 +1,204 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2022 Marvell
+ */
+
+#ifndef _RTE_PMU_H_
+#define _RTE_PMU_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <rte_common.h>
+#include <rte_compat.h>
+
+#ifdef RTE_EXEC_ENV_LINUX
+
+#include <linux/perf_event.h>
+
+#include <rte_atomic.h>
+#include <rte_branch_prediction.h>
+#include <rte_lcore.h>
+#include <rte_log.h>
+
+/**
+ * @file
+ *
+ * PMU event tracing operations
+ *
+ * This file defines generic API and types necessary to setup PMU and
+ * read selected counters in runtime.
+ */
+
+/**
+ * A structure describing a group of events.
+ */
+struct rte_pmu_event_group {
+	int *fds; /**< array of event descriptors */
+	void **mmap_pages; /**< array of pointers to mmapped perf_event_attr structures */
+	bool enabled; /**< true if group was enabled on particular lcore */
+};
+
+/**
+ * A structure describing an event.
+ */
+struct rte_pmu_event {
+	char *name; /** name of an event */
+	int index; /** event index into fds/mmap_pages */
+	TAILQ_ENTRY(rte_pmu_event) next; /** list entry */
+};
+
+/**
+ * A PMU state container.
+ */
+struct rte_pmu {
+	char *name; /** name of core PMU listed under /sys/bus/event_source/devices */
+	struct rte_pmu_event_group group[RTE_MAX_LCORE]; /**< per lcore event group data */
+	int num_group_events; /**< number of events in a group */
+	TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of matching events */
+};
+
+/** Pointer to the PMU state container */
+extern struct rte_pmu *pmu;
+
+/** Each architecture supporting PMU needs to provide its own version */
+#ifndef rte_pmu_pmc_read
+#define rte_pmu_pmc_read(index) ({ 0; })
+#endif
+
+/**
+ * @internal
+ *
+ * Read PMU counter.
+ *
+ * @param pc
+ *   Pointer to the mmapped user page.
+ * @return
+ *   Counter value read from hardware.
+ */
+__rte_internal
+static __rte_always_inline uint64_t
+rte_pmu_read_userpage(struct perf_event_mmap_page *pc)
+{
+	uint64_t offset, width, pmc = 0;
+	uint32_t seq, index;
+	int tries = 100;
+
+	for (;;) {
+		seq = pc->lock;
+		rte_compiler_barrier();
+		index = pc->index;
+		offset = pc->offset;
+		width = pc->pmc_width;
+
+		if (likely(pc->cap_user_rdpmc && index)) {
+			pmc = rte_pmu_pmc_read(index - 1);
+			pmc <<= 64 - width;
+			pmc >>= 64 - width;
+		}
+
+		rte_compiler_barrier();
+
+		if (likely(pc->lock == seq))
+			return pmc + offset;
+
+		if (--tries == 0) {
+			RTE_LOG(DEBUG, EAL, "failed to get perf_event_mmap_page lock\n");
+			break;
+		}
+	}
+
+	return 0;
+}
+
+/**
+ * @internal
+ *
+ * Enable group of events for a given lcore.
+ *
+ * @param lcore_id
+ *   The identifier of the lcore.
+ * @return
+ *   0 in case of success, negative value otherwise.
+ */
+__rte_internal
+int
+rte_pmu_enable_group(int lcore_id);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Add event to the group of enabled events.
+ *
+ * @param name
+ *   Name of an event listed under /sys/bus/event_source/devices/pmu/events.
+ * @return
+ *   Event index in case of success, negative value otherwise.
+ */
+__rte_experimental
+int
+rte_pmu_add_event(const char *name);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Read hardware counter configured to count occurrences of an event.
+ *
+ * @param index
+ *   Index of an event to be read.
+ * @return
+ *   Event value read from register. In case of errors or lack of support
+ *   0 is returned. In other words, stream of zeros in a trace file
+ *   indicates problem with reading particular PMU event register.
+ */
+__rte_experimental
+static __rte_always_inline uint64_t
+rte_pmu_read(int index)
+{
+	int lcore_id = rte_lcore_id();
+	struct rte_pmu_event_group *group;
+	int ret;
+
+	if (!pmu)
+		return 0;
+
+	group = &pmu->group[lcore_id];
+	if (!group->enabled) {
+		ret = rte_pmu_enable_group(lcore_id);
+		if (ret)
+			return 0;
+
+		group->enabled = true;
+	}
+
+	if (index < 0 || index >= pmu->num_group_events)
+		return 0;
+
+	return rte_pmu_read_userpage(group->mmap_pages[index]);
+}
+
+#else /* !RTE_EXEC_ENV_LINUX */
+
+__rte_experimental
+static int __rte_unused
+rte_pmu_add_event(__rte_unused const char *name)
+{
+	return -1;
+}
+
+__rte_experimental
+static __rte_always_inline uint64_t
+rte_pmu_read(__rte_unused int index)
+{
+	return 0;
+}
+
+#endif /* RTE_EXEC_ENV_LINUX */
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_PMU_H_ */
diff --git a/lib/eal/linux/eal.c b/lib/eal/linux/eal.c
index 8c118d0d9f..751a13b597 100644
--- a/lib/eal/linux/eal.c
+++ b/lib/eal/linux/eal.c
@@ -53,6 +53,7 @@
 #include "eal_options.h"
 #include "eal_vfio.h"
 #include "hotplug_mp.h"
+#include "pmu_private.h"
 
 #define MEMSIZE_IF_NO_HUGE_PAGE (64ULL * 1024ULL * 1024ULL)
 
@@ -1206,6 +1207,8 @@ rte_eal_init(int argc, char **argv)
 		return -1;
 	}
 
+	eal_pmu_init();
+
 	if (rte_eal_tailqs_init() < 0) {
 		rte_eal_init_alert("Cannot init tail queues for objects");
 		rte_errno = EFAULT;
@@ -1372,6 +1375,7 @@ rte_eal_cleanup(void)
 	eal_bus_cleanup();
 	rte_trace_save();
 	eal_trace_fini();
+	eal_pmu_fini();
 	/* after this point, any DPDK pointers will become dangling */
 	rte_eal_memory_detach();
 	eal_mp_dev_hotplug_cleanup();
diff --git a/lib/eal/version.map b/lib/eal/version.map
index 7ad12a7dc9..e870c87493 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -432,6 +432,8 @@ EXPERIMENTAL {
 	rte_thread_set_priority;
 
 	# added in 22.11
+	rte_pmu_add_event; # WINDOWS_NO_EXPORT
+	rte_pmu_read; # WINDOWS_NO_EXPORT
 	rte_thread_attr_get_affinity;
 	rte_thread_attr_init;
 	rte_thread_attr_set_affinity;
@@ -483,4 +485,5 @@ INTERNAL {
 	rte_mem_map;
 	rte_mem_page_size;
 	rte_mem_unmap;
+	rte_pmu_enable_group;
 };
-- 
2.25.1


^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH 2/4] eal/arm: support reading ARM PMU events in runtime
  2022-11-11  9:43 [PATCH 0/4] add support for self monitoring Tomasz Duszynski
  2022-11-11  9:43 ` [PATCH 1/4] eal: add generic support for reading PMU events Tomasz Duszynski
@ 2022-11-11  9:43 ` Tomasz Duszynski
  2022-11-11  9:43 ` [PATCH 3/4] eal/x86: support reading Intel " Tomasz Duszynski
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 205+ messages in thread
From: Tomasz Duszynski @ 2022-11-11  9:43 UTC (permalink / raw)
  To: dev, tduszynski, Ruifeng Wang; +Cc: thomas, jerinj

Add support for reading ARM PMU events in runtime.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
---
 app/test/test_pmu.c               |   4 ++
 lib/eal/arm/include/meson.build   |   1 +
 lib/eal/arm/include/rte_pmu_pmc.h |  37 +++++++++++
 lib/eal/arm/meson.build           |   4 ++
 lib/eal/arm/rte_pmu.c             | 103 ++++++++++++++++++++++++++++++
 lib/eal/include/rte_pmu.h         |   3 +
 6 files changed, 152 insertions(+)
 create mode 100644 lib/eal/arm/include/rte_pmu_pmc.h
 create mode 100644 lib/eal/arm/rte_pmu.c

diff --git a/app/test/test_pmu.c b/app/test/test_pmu.c
index fd331af9ee..f94866dff9 100644
--- a/app/test/test_pmu.c
+++ b/app/test/test_pmu.c
@@ -13,6 +13,10 @@ test_pmu_read(void)
 	int tries = 10;
 	int event = -1;
 
+#if defined(RTE_ARCH_ARM64)
+	event = rte_pmu_add_event("cpu_cycles");
+#endif
+
 	while (tries--)
 		val += rte_pmu_read(event);
 
diff --git a/lib/eal/arm/include/meson.build b/lib/eal/arm/include/meson.build
index 657bf58569..ab13b0220a 100644
--- a/lib/eal/arm/include/meson.build
+++ b/lib/eal/arm/include/meson.build
@@ -20,6 +20,7 @@ arch_headers = files(
         'rte_pause_32.h',
         'rte_pause_64.h',
         'rte_pause.h',
+        'rte_pmu_pmc.h',
         'rte_power_intrinsics.h',
         'rte_prefetch_32.h',
         'rte_prefetch_64.h',
diff --git a/lib/eal/arm/include/rte_pmu_pmc.h b/lib/eal/arm/include/rte_pmu_pmc.h
new file mode 100644
index 0000000000..5efc851cb8
--- /dev/null
+++ b/lib/eal/arm/include/rte_pmu_pmc.h
@@ -0,0 +1,37 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2022 Marvell.
+ */
+
+#ifndef _RTE_PMU_PMC_ARM_H_
+#define _RTE_PMU_PMC_ARM_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+static __rte_always_inline uint64_t
+rte_pmu_pmc_read(int index)
+{
+	uint64_t val;
+
+	if (index == 31) {
+		/* CPU Cycles (0x11) must be read via pmccntr_el0 */
+		asm volatile("mrs %0, pmccntr_el0" : "=r" (val));
+	} else {
+		asm volatile(
+			"msr pmselr_el0, %x0\n"
+			"mrs %0, pmxevcntr_el0\n"
+			: "=r" (val)
+			: "rZ" (index)
+		);
+	}
+
+	return val;
+}
+#define rte_pmu_pmc_read rte_pmu_pmc_read
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_PMU_PMC_ARM_H_ */
diff --git a/lib/eal/arm/meson.build b/lib/eal/arm/meson.build
index dca1106aae..0c5575b197 100644
--- a/lib/eal/arm/meson.build
+++ b/lib/eal/arm/meson.build
@@ -9,3 +9,7 @@ sources += files(
         'rte_hypervisor.c',
         'rte_power_intrinsics.c',
 )
+
+if is_linux
+    sources += files('rte_pmu.c')
+endif
diff --git a/lib/eal/arm/rte_pmu.c b/lib/eal/arm/rte_pmu.c
new file mode 100644
index 0000000000..6c50a1b3c4
--- /dev/null
+++ b/lib/eal/arm/rte_pmu.c
@@ -0,0 +1,103 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2022 Marvell International Ltd.
+ */
+
+#include <errno.h>
+#include <fcntl.h>
+#include <unistd.h>
+
+#include <rte_bitops.h>
+#include <rte_common.h>
+#include <rte_log.h>
+#include <rte_pmu.h>
+
+#include "pmu_private.h"
+
+#define PERF_USER_ACCESS_PATH "/proc/sys/kernel/perf_user_access"
+
+static int restore_uaccess;
+
+static int
+read_attr_int(const char *path, int *val)
+{
+	char buf[BUFSIZ];
+	int ret, fd;
+
+	fd = open(path, O_RDONLY);
+	if (fd == -1)
+		return -errno;
+
+	ret = read(fd, buf, sizeof(buf));
+	if (ret == -1) {
+		close(fd);
+
+		return -errno;
+	}
+
+	*val = strtol(buf, NULL, 10);
+	close(fd);
+
+	return 0;
+}
+
+static int
+write_attr_int(const char *path, int val)
+{
+	char buf[BUFSIZ];
+	int num, ret, fd;
+
+	fd = open(path, O_WRONLY);
+	if (fd == -1)
+		return -errno;
+
+	num = snprintf(buf, sizeof(buf), "%d", val);
+	ret = write(fd, buf, num);
+	if (ret == -1) {
+		close(fd);
+
+		return -errno;
+	}
+
+	close(fd);
+
+	return 0;
+}
+
+int
+pmu_arch_init(void)
+{
+	int ret;
+
+	ret = read_attr_int(PERF_USER_ACCESS_PATH, &restore_uaccess);
+	if (ret) {
+		RTE_LOG(ERR, EAL, "failed to read %s\n", PERF_USER_ACCESS_PATH);
+
+		return ret;
+	}
+
+	ret = write_attr_int(PERF_USER_ACCESS_PATH, 1);
+	if (ret) {
+		RTE_LOG(ERR, EAL, "failed to enable perf user access\n"
+			"try enabling manually 'echo 1 > %s'\n",
+			PERF_USER_ACCESS_PATH);
+
+		return ret;
+	}
+
+	return 0;
+}
+
+void
+pmu_arch_fini(void)
+{
+	write_attr_int(PERF_USER_ACCESS_PATH, restore_uaccess);
+}
+
+void
+pmu_arch_fixup_config(uint64_t config[3])
+{
+	/* select 64 bit counters */
+	config[1] |= RTE_BIT64(0);
+	/* enable userspace access */
+	config[1] |= RTE_BIT64(1);
+}
diff --git a/lib/eal/include/rte_pmu.h b/lib/eal/include/rte_pmu.h
index 5955c22779..67b1194a2a 100644
--- a/lib/eal/include/rte_pmu.h
+++ b/lib/eal/include/rte_pmu.h
@@ -20,6 +20,9 @@ extern "C" {
 #include <rte_branch_prediction.h>
 #include <rte_lcore.h>
 #include <rte_log.h>
+#if defined(RTE_ARCH_ARM64)
+#include <rte_pmu_pmc.h>
+#endif
 
 /**
  * @file
-- 
2.25.1


^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH 3/4] eal/x86: support reading Intel PMU events in runtime
  2022-11-11  9:43 [PATCH 0/4] add support for self monitoring Tomasz Duszynski
  2022-11-11  9:43 ` [PATCH 1/4] eal: add generic support for reading PMU events Tomasz Duszynski
  2022-11-11  9:43 ` [PATCH 2/4] eal/arm: support reading ARM PMU events in runtime Tomasz Duszynski
@ 2022-11-11  9:43 ` Tomasz Duszynski
  2022-11-11  9:43 ` [PATCH 4/4] eal: add PMU support to tracing library Tomasz Duszynski
  2022-11-21 12:11 ` [PATCH v2 0/4] add support for self monitoring Tomasz Duszynski
  4 siblings, 0 replies; 205+ messages in thread
From: Tomasz Duszynski @ 2022-11-11  9:43 UTC (permalink / raw)
  To: dev, tduszynski, Bruce Richardson, Konstantin Ananyev; +Cc: thomas, jerinj

Add support for reading Intel PMU events in runtime.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
---
 app/test/test_pmu.c               |  2 ++
 lib/eal/include/rte_pmu.h         |  2 +-
 lib/eal/x86/include/meson.build   |  1 +
 lib/eal/x86/include/rte_pmu_pmc.h | 32 +++++++++++++++++++++++++++++++
 4 files changed, 36 insertions(+), 1 deletion(-)
 create mode 100644 lib/eal/x86/include/rte_pmu_pmc.h

diff --git a/app/test/test_pmu.c b/app/test/test_pmu.c
index f94866dff9..016204c083 100644
--- a/app/test/test_pmu.c
+++ b/app/test/test_pmu.c
@@ -15,6 +15,8 @@ test_pmu_read(void)
 
 #if defined(RTE_ARCH_ARM64)
 	event = rte_pmu_add_event("cpu_cycles");
+#elif defined(RTE_ARCH_X86_64)
+	event = rte_pmu_add_event("cpu-cycles");
 #endif
 
 	while (tries--)
diff --git a/lib/eal/include/rte_pmu.h b/lib/eal/include/rte_pmu.h
index 67b1194a2a..bbe12d100d 100644
--- a/lib/eal/include/rte_pmu.h
+++ b/lib/eal/include/rte_pmu.h
@@ -20,7 +20,7 @@ extern "C" {
 #include <rte_branch_prediction.h>
 #include <rte_lcore.h>
 #include <rte_log.h>
-#if defined(RTE_ARCH_ARM64)
+#if defined(RTE_ARCH_ARM64) || defined(RTE_ARCH_X86_64)
 #include <rte_pmu_pmc.h>
 #endif
 
diff --git a/lib/eal/x86/include/meson.build b/lib/eal/x86/include/meson.build
index 52d2f8e969..03d286ed25 100644
--- a/lib/eal/x86/include/meson.build
+++ b/lib/eal/x86/include/meson.build
@@ -9,6 +9,7 @@ arch_headers = files(
         'rte_io.h',
         'rte_memcpy.h',
         'rte_pause.h',
+        'rte_pmu_pmc.h',
         'rte_power_intrinsics.h',
         'rte_prefetch.h',
         'rte_rtm.h',
diff --git a/lib/eal/x86/include/rte_pmu_pmc.h b/lib/eal/x86/include/rte_pmu_pmc.h
new file mode 100644
index 0000000000..6ecb27a1eb
--- /dev/null
+++ b/lib/eal/x86/include/rte_pmu_pmc.h
@@ -0,0 +1,32 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2022 Marvell.
+ */
+
+#ifndef _RTE_PMU_PMC_X86_H_
+#define _RTE_PMU_PMC_X86_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+__rte_internal
+static __rte_always_inline uint64_t
+rte_pmu_pmc_read(int index)
+{
+	uint32_t high, low;
+
+	asm volatile(
+		"rdpmc\n"
+		: "=a" (low), "=d" (high)
+		: "c" (index)
+	);
+
+	return ((uint64_t)high << 32) | (uint64_t)low;
+}
+#define rte_pmu_pmc_read rte_pmu_pmc_read
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_PMU_PMC_X86_H_ */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH 4/4] eal: add PMU support to tracing library
  2022-11-11  9:43 [PATCH 0/4] add support for self monitoring Tomasz Duszynski
                   ` (2 preceding siblings ...)
  2022-11-11  9:43 ` [PATCH 3/4] eal/x86: support reading Intel " Tomasz Duszynski
@ 2022-11-11  9:43 ` Tomasz Duszynski
  2022-11-21 12:11 ` [PATCH v2 0/4] add support for self monitoring Tomasz Duszynski
  4 siblings, 0 replies; 205+ messages in thread
From: Tomasz Duszynski @ 2022-11-11  9:43 UTC (permalink / raw)
  To: dev, tduszynski, Jerin Jacob, Sunil Kumar Kori; +Cc: thomas

In order to profile app one needs to store significant amount of samples
somewhere for an analysis latern on. Since trace library supports
storing data in a CTF format lets take adventage of that and add a
dedicated PMU tracepoint.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
---
 app/test/test_trace_perf.c               |  4 ++
 doc/guides/prog_guide/profile_app.rst    |  5 ++
 doc/guides/prog_guide/trace_lib.rst      | 32 ++++++++++++
 lib/eal/common/eal_common_trace_points.c |  3 ++
 lib/eal/common/rte_pmu.c                 | 63 ++++++++++++++++++++++++
 lib/eal/include/rte_eal_trace.h          | 11 +++++
 lib/eal/version.map                      |  1 +
 7 files changed, 119 insertions(+)

diff --git a/app/test/test_trace_perf.c b/app/test/test_trace_perf.c
index 46ae7d8074..4851b6852f 100644
--- a/app/test/test_trace_perf.c
+++ b/app/test/test_trace_perf.c
@@ -114,6 +114,8 @@ worker_fn_##func(void *arg) \
 #define GENERIC_DOUBLE rte_eal_trace_generic_double(3.66666)
 #define GENERIC_STR rte_eal_trace_generic_str("hello world")
 #define VOID_FP app_dpdk_test_fp()
+/* 0 corresponds first event passed via --trace= */
+#define READ_PMU rte_eal_trace_pmu_read(0)
 
 WORKER_DEFINE(GENERIC_VOID)
 WORKER_DEFINE(GENERIC_U64)
@@ -122,6 +124,7 @@ WORKER_DEFINE(GENERIC_FLOAT)
 WORKER_DEFINE(GENERIC_DOUBLE)
 WORKER_DEFINE(GENERIC_STR)
 WORKER_DEFINE(VOID_FP)
+WORKER_DEFINE(READ_PMU)
 
 static void
 run_test(const char *str, lcore_function_t f, struct test_data *data, size_t sz)
@@ -174,6 +177,7 @@ test_trace_perf(void)
 	run_test("double", worker_fn_GENERIC_DOUBLE, data, sz);
 	run_test("string", worker_fn_GENERIC_STR, data, sz);
 	run_test("void_fp", worker_fn_VOID_FP, data, sz);
+	run_test("read_pmu", worker_fn_READ_PMU, data, sz);
 
 	rte_free(data);
 	return TEST_SUCCESS;
diff --git a/doc/guides/prog_guide/profile_app.rst b/doc/guides/prog_guide/profile_app.rst
index 8fc1b20cab..977800ea01 100644
--- a/doc/guides/prog_guide/profile_app.rst
+++ b/doc/guides/prog_guide/profile_app.rst
@@ -16,6 +16,11 @@ that information, perf being an example here. Though in some scenarios, eg. when
 isolated (nohz_full) and run dedicated tasks, using perf is less than ideal. In such cases one can
 read specific events directly from application via ``rte_pmu_read()``.
 
+Alternatively tracing library can be used which offers dedicated tracepoint
+``rte_eal_trace_pmu_event()``.
+
+Refer to :doc:`../prog_guide/trace_lib` for more details.
+
 Profiling on x86
 ----------------
 
diff --git a/doc/guides/prog_guide/trace_lib.rst b/doc/guides/prog_guide/trace_lib.rst
index 9a8f38073d..9a845fd86f 100644
--- a/doc/guides/prog_guide/trace_lib.rst
+++ b/doc/guides/prog_guide/trace_lib.rst
@@ -46,6 +46,7 @@ DPDK tracing library features
   trace format and is compatible with ``LTTng``.
   For detailed information, refer to
   `Common Trace Format <https://diamon.org/ctf/>`_.
+- Support reading PMU events on ARM64 and x86 (Intel)
 
 How to add a tracepoint?
 ------------------------
@@ -137,6 +138,37 @@ the user must use ``RTE_TRACE_POINT_FP`` instead of ``RTE_TRACE_POINT``.
 ``RTE_TRACE_POINT_FP`` is compiled out by default and it can be enabled using
 the ``enable_trace_fp`` option for meson build.
 
+PMU tracepoint
+--------------
+
+Performance measurement unit (PMU) event values can be read from hardware
+registers using predefined ``rte_pmu_read`` tracepoint.
+
+Tracing is enabled via ``--trace`` EAL option by passing both expression
+matching PMU tracepoint name i.e ``lib.eal.pmu.read`` and expression
+``e=ev1[,ev2,...]`` matching particular events::
+
+    --trace='*pmu.read\|e=cpu_cycles,l1d_cache'
+
+Event names are available under ``/sys/bus/event_source/devices/PMU/events``
+directory, where ``PMU`` is a placeholder for either a ``cpu`` or a directory
+containing ``cpus``.
+
+In contrary to other tracepoints this does not need any extra variables
+added to source files. Instead, caller passes index which follows the order of
+events specified via ``--trace`` parameter. In the following example index ``0``
+corresponds to ``cpu_cyclces`` while index ``1`` corresponds to ``l1d_cache``.
+
+.. code-block:: c
+
+ ...
+ rte_eal_trace_pmu_read(0);
+ rte_eal_trace_pmu_read(1);
+ ...
+
+PMU tracing support must be explicitly enabled using the ``enable_trace_fp``
+option for meson build.
+
 Event record mode
 -----------------
 
diff --git a/lib/eal/common/eal_common_trace_points.c b/lib/eal/common/eal_common_trace_points.c
index 0b0b254615..de918ca618 100644
--- a/lib/eal/common/eal_common_trace_points.c
+++ b/lib/eal/common/eal_common_trace_points.c
@@ -75,3 +75,6 @@ RTE_TRACE_POINT_REGISTER(rte_eal_trace_intr_enable,
 	lib.eal.intr.enable)
 RTE_TRACE_POINT_REGISTER(rte_eal_trace_intr_disable,
 	lib.eal.intr.disable)
+
+RTE_TRACE_POINT_REGISTER(rte_eal_trace_pmu_read,
+	lib.eal.pmu.read)
diff --git a/lib/eal/common/rte_pmu.c b/lib/eal/common/rte_pmu.c
index 7d3bd57d1d..40c454f92a 100644
--- a/lib/eal/common/rte_pmu.c
+++ b/lib/eal/common/rte_pmu.c
@@ -18,6 +18,7 @@
 #include <rte_tailq.h>
 
 #include "pmu_private.h"
+#include "eal_trace.h"
 
 #define EVENT_SOURCE_DEVICES_PATH "/sys/bus/event_source/devices"
 
@@ -402,11 +403,70 @@ rte_pmu_add_event(const char *name)
 	return event->index;
 }
 
+static void
+add_events(const char *pattern)
+{
+	char *token, *copy;
+	int ret;
+
+	copy = strdup(pattern);
+	if (!copy)
+		return;
+
+	token = strtok(copy, ",");
+	while (token) {
+		ret = rte_pmu_add_event(token);
+		if (ret < 0)
+			RTE_LOG(ERR, EAL, "failed to add %s event\n", token);
+
+		token = strtok(NULL, ",");
+	}
+
+	free(copy);
+}
+
+static void
+add_events_by_pattern(const char *pattern)
+{
+	regmatch_t rmatch;
+	char buf[BUFSIZ];
+	unsigned int num;
+	regex_t reg;
+
+	/* events are matched against occurrences of e=ev1[,ev2,..] pattern */
+	if (regcomp(&reg, "e=([_[:alnum:]-],?)+", REG_EXTENDED))
+		return;
+
+	for (;;) {
+		if (regexec(&reg, pattern, 1, &rmatch, 0))
+			break;
+
+		num = rmatch.rm_eo - rmatch.rm_so;
+		if (num > sizeof(buf))
+			num = sizeof(buf);
+
+		/* skip e= pattern prefix */
+		memcpy(buf, pattern + rmatch.rm_so + 2, num - 2);
+		buf[num] = '\0';
+		add_events(buf);
+
+		pattern += rmatch.rm_eo;
+	}
+
+	regfree(&reg);
+}
+
 void
 eal_pmu_init(void)
 {
+	struct trace_arg *arg;
+	struct trace *trace;
 	int ret;
 
+	trace = trace_obj_get();
+	if (!trace)
+		RTE_LOG(WARNING, EAL, "tracing not initialized\n");
+
 	pmu = rte_calloc(NULL, 1, sizeof(*pmu), RTE_CACHE_LINE_SIZE);
 	if (!pmu) {
 		RTE_LOG(ERR, EAL, "failed to alloc PMU\n");
@@ -428,6 +488,9 @@ eal_pmu_init(void)
 		goto out;
 	}
 
+	STAILQ_FOREACH(arg, &trace->args, next)
+		add_events_by_pattern(arg->val);
+
 	return;
 out:
 	free(pmu->name);
diff --git a/lib/eal/include/rte_eal_trace.h b/lib/eal/include/rte_eal_trace.h
index 5ef4398230..2a10f63e97 100644
--- a/lib/eal/include/rte_eal_trace.h
+++ b/lib/eal/include/rte_eal_trace.h
@@ -17,6 +17,7 @@ extern "C" {
 
 #include <rte_alarm.h>
 #include <rte_interrupts.h>
+#include <rte_pmu.h>
 #include <rte_trace_point.h>
 
 #include "eal_interrupts.h"
@@ -279,6 +280,16 @@ RTE_TRACE_POINT(
 	rte_trace_point_emit_string(cpuset);
 )
 
+/* PMU */
+RTE_TRACE_POINT_FP(
+	rte_eal_trace_pmu_read,
+	RTE_TRACE_POINT_ARGS(int index),
+	uint64_t val;
+	rte_trace_point_emit_int(index);
+	val = rte_pmu_read(index);
+	rte_trace_point_emit_u64(val);
+)
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/eal/version.map b/lib/eal/version.map
index e870c87493..d6ec3f3b0e 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -432,6 +432,7 @@ EXPERIMENTAL {
 	rte_thread_set_priority;
 
 	# added in 22.11
+	__rte_eal_trace_pmu_read; # WINDOWS_NO_EXPORT
 	rte_pmu_add_event; # WINDOWS_NO_EXPORT
 	rte_pmu_read; # WINDOWS_NO_EXPORT
 	rte_thread_attr_get_affinity;
-- 
2.25.1


^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH v2 0/4] add support for self monitoring
  2022-11-11  9:43 [PATCH 0/4] add support for self monitoring Tomasz Duszynski
                   ` (3 preceding siblings ...)
  2022-11-11  9:43 ` [PATCH 4/4] eal: add PMU support to tracing library Tomasz Duszynski
@ 2022-11-21 12:11 ` Tomasz Duszynski
  2022-11-21 12:11   ` [PATCH v2 1/4] eal: add generic support for reading PMU events Tomasz Duszynski
                     ` (4 more replies)
  4 siblings, 5 replies; 205+ messages in thread
From: Tomasz Duszynski @ 2022-11-21 12:11 UTC (permalink / raw)
  To: dev; +Cc: thomas, jerinj, Tomasz Duszynski

This series adds self monitoring support i.e allows to configure and
read performance measurement unit (PMU) counters in runtime without
using perf utility. This has certain adventages when application runs on
isolated cores with nohz_full kernel parameter.

Events can be read directly using rte_pmu_read() or using dedicated
tracepoint rte_eal_trace_pmu_read(). The latter will cause events to be
stored inside CTF file.

By design, all enabled events are grouped together and the same group
is attached to lcores that use self monitoring funtionality.

Events are enabled by names, which need to be read from standard
location under sysfs i.e

/sys/bus/event_source/devices/PMU/events

where PMU is a core pmu i.e one measuring cpu events. As of today
raw events are not supported.

v2:
- fix problems reported by test build infra

Tomasz Duszynski (4):
  eal: add generic support for reading PMU events
  eal/arm: support reading ARM PMU events in runtime
  eal/x86: support reading Intel PMU events in runtime
  eal: add PMU support to tracing library

 app/test/meson.build                     |   1 +
 app/test/test_pmu.c                      |  47 ++
 app/test/test_trace_perf.c               |   4 +
 doc/guides/prog_guide/profile_app.rst    |  13 +
 doc/guides/prog_guide/trace_lib.rst      |  32 ++
 lib/eal/arm/include/meson.build          |   1 +
 lib/eal/arm/include/rte_pmu_pmc.h        |  39 ++
 lib/eal/arm/meson.build                  |   4 +
 lib/eal/arm/rte_pmu.c                    | 103 +++++
 lib/eal/common/eal_common_trace_points.c |   3 +
 lib/eal/common/meson.build               |   3 +
 lib/eal/common/pmu_private.h             |  41 ++
 lib/eal/common/rte_pmu.c                 | 519 +++++++++++++++++++++++
 lib/eal/include/meson.build              |   1 +
 lib/eal/include/rte_eal_trace.h          |  11 +
 lib/eal/include/rte_pmu.h                | 207 +++++++++
 lib/eal/linux/eal.c                      |   4 +
 lib/eal/version.map                      |   6 +
 lib/eal/x86/include/meson.build          |   1 +
 lib/eal/x86/include/rte_pmu_pmc.h        |  33 ++
 20 files changed, 1073 insertions(+)
 create mode 100644 app/test/test_pmu.c
 create mode 100644 lib/eal/arm/include/rte_pmu_pmc.h
 create mode 100644 lib/eal/arm/rte_pmu.c
 create mode 100644 lib/eal/common/pmu_private.h
 create mode 100644 lib/eal/common/rte_pmu.c
 create mode 100644 lib/eal/include/rte_pmu.h
 create mode 100644 lib/eal/x86/include/rte_pmu_pmc.h

--
2.25.1


^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH v2 1/4] eal: add generic support for reading PMU events
  2022-11-21 12:11 ` [PATCH v2 0/4] add support for self monitoring Tomasz Duszynski
@ 2022-11-21 12:11   ` Tomasz Duszynski
  2022-11-21 12:11   ` [PATCH v2 2/4] eal/arm: support reading ARM PMU events in runtime Tomasz Duszynski
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 205+ messages in thread
From: Tomasz Duszynski @ 2022-11-21 12:11 UTC (permalink / raw)
  To: dev; +Cc: thomas, jerinj, Tomasz Duszynski

Add support for programming PMU counters and reading their values
in runtime bypassing kernel completely.

This is especially useful in cases where CPU cores are isolated
(nohz_full) i.e run dedicated tasks. In such cases one cannot use
standard perf utility without sacrificing latency and performance.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
---
 app/test/meson.build                  |   1 +
 app/test/test_pmu.c                   |  41 +++
 doc/guides/prog_guide/profile_app.rst |   8 +
 lib/eal/common/meson.build            |   3 +
 lib/eal/common/pmu_private.h          |  41 +++
 lib/eal/common/rte_pmu.c              | 456 ++++++++++++++++++++++++++
 lib/eal/include/meson.build           |   1 +
 lib/eal/include/rte_pmu.h             | 204 ++++++++++++
 lib/eal/linux/eal.c                   |   4 +
 lib/eal/version.map                   |   5 +
 10 files changed, 764 insertions(+)
 create mode 100644 app/test/test_pmu.c
 create mode 100644 lib/eal/common/pmu_private.h
 create mode 100644 lib/eal/common/rte_pmu.c
 create mode 100644 lib/eal/include/rte_pmu.h

diff --git a/app/test/meson.build b/app/test/meson.build
index f34d19e3c3..93b3300309 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -143,6 +143,7 @@ test_sources = files(
         'test_timer_racecond.c',
         'test_timer_secondary.c',
         'test_ticketlock.c',
+        'test_pmu.c',
         'test_trace.c',
         'test_trace_register.c',
         'test_trace_perf.c',
diff --git a/app/test/test_pmu.c b/app/test/test_pmu.c
new file mode 100644
index 0000000000..fd331af9ee
--- /dev/null
+++ b/app/test/test_pmu.c
@@ -0,0 +1,41 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2022 Marvell International Ltd.
+ */
+
+#include <rte_pmu.h>
+
+#include "test.h"
+
+static int
+test_pmu_read(void)
+{
+	uint64_t val = 0;
+	int tries = 10;
+	int event = -1;
+
+	while (tries--)
+		val += rte_pmu_read(event);
+
+	if (val == 0)
+		return TEST_FAILED;
+
+	return TEST_SUCCESS;
+}
+
+static struct unit_test_suite pmu_tests = {
+	.suite_name = "pmu autotest",
+	.setup = NULL,
+	.teardown = NULL,
+	.unit_test_cases = {
+		TEST_CASE(test_pmu_read),
+		TEST_CASES_END()
+	}
+};
+
+static int
+test_pmu(void)
+{
+	return unit_test_suite_runner(&pmu_tests);
+}
+
+REGISTER_TEST_COMMAND(pmu_autotest, test_pmu);
diff --git a/doc/guides/prog_guide/profile_app.rst b/doc/guides/prog_guide/profile_app.rst
index bd6700ef85..8fc1b20cab 100644
--- a/doc/guides/prog_guide/profile_app.rst
+++ b/doc/guides/prog_guide/profile_app.rst
@@ -7,6 +7,14 @@ Profile Your Application
 The following sections describe methods of profiling DPDK applications on
 different architectures.
 
+Performance counter based profiling
+-----------------------------------
+
+Majority of architectures support some sort hardware measurement unit which provides a set of
+programmable counters that monitor specific events. There are different tools which can gather
+that information, perf being an example here. Though in some scenarios, eg. when CPU cores are
+isolated (nohz_full) and run dedicated tasks, using perf is less than ideal. In such cases one can
+read specific events directly from application via ``rte_pmu_read()``.
 
 Profiling on x86
 ----------------
diff --git a/lib/eal/common/meson.build b/lib/eal/common/meson.build
index 917758cc65..d6d05b56f3 100644
--- a/lib/eal/common/meson.build
+++ b/lib/eal/common/meson.build
@@ -38,6 +38,9 @@ sources += files(
         'rte_service.c',
         'rte_version.c',
 )
+if is_linux
+    sources += files('rte_pmu.c')
+endif
 if is_linux or is_windows
     sources += files('eal_common_dynmem.c')
 endif
diff --git a/lib/eal/common/pmu_private.h b/lib/eal/common/pmu_private.h
new file mode 100644
index 0000000000..cade4245e6
--- /dev/null
+++ b/lib/eal/common/pmu_private.h
@@ -0,0 +1,41 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2022 Marvell
+ */
+
+#ifndef _PMU_PRIVATE_H_
+#define _PMU_PRIVATE_H_
+
+/**
+ * Architecture specific PMU init callback.
+ *
+ * @return
+ *   0 in case of success, negative value otherwise.
+ */
+int
+pmu_arch_init(void);
+
+/**
+ * Architecture specific PMU cleanup callback.
+ */
+void
+pmu_arch_fini(void);
+
+/**
+ * Apply architecture specific settings to config before passing it to syscall.
+ */
+void
+pmu_arch_fixup_config(uint64_t config[3]);
+
+/**
+ * Initialize PMU tracing internals.
+ */
+void
+eal_pmu_init(void);
+
+/**
+ * Cleanup PMU internals.
+ */
+void
+eal_pmu_fini(void);
+
+#endif /* _PMU_PRIVATE_H_ */
diff --git a/lib/eal/common/rte_pmu.c b/lib/eal/common/rte_pmu.c
new file mode 100644
index 0000000000..dc169fb2cf
--- /dev/null
+++ b/lib/eal/common/rte_pmu.c
@@ -0,0 +1,456 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2022 Marvell International Ltd.
+ */
+
+#include <ctype.h>
+#include <dirent.h>
+#include <errno.h>
+#include <regex.h>
+#include <stdlib.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <sys/queue.h>
+#include <sys/syscall.h>
+#include <unistd.h>
+
+#include <rte_eal_paging.h>
+#include <rte_malloc.h>
+#include <rte_pmu.h>
+#include <rte_tailq.h>
+
+#include "pmu_private.h"
+
+#define EVENT_SOURCE_DEVICES_PATH "/sys/bus/event_source/devices"
+
+#ifndef GENMASK_ULL
+#define GENMASK_ULL(h, l) ((~0ULL - (1ULL << (l)) + 1) & (~0ULL >> ((64 - 1 - (h)))))
+#endif
+
+#ifndef FIELD_PREP
+#define FIELD_PREP(m, v) (((uint64_t)(v) << (__builtin_ffsll(m) - 1)) & (m))
+#endif
+
+struct rte_pmu *pmu;
+
+/*
+ * Following __rte_weak functions provide default no-op. Architectures should override them if
+ * necessary.
+ */
+
+int
+__rte_weak pmu_arch_init(void)
+{
+	return 0;
+}
+
+void
+__rte_weak pmu_arch_fini(void)
+{
+}
+
+void
+__rte_weak pmu_arch_fixup_config(uint64_t config[3])
+{
+	RTE_SET_USED(config);
+}
+
+static int
+get_term_format(const char *name, int *num, uint64_t *mask)
+{
+	char *config = NULL;
+	char path[PATH_MAX];
+	int high, low, ret;
+	FILE *fp;
+
+	/* quiesce -Wmaybe-uninitialized warning */
+	*num = 0;
+	*mask = 0;
+
+	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/format/%s", pmu->name, name);
+	fp = fopen(path, "r");
+	if (!fp)
+		return -errno;
+
+	errno = 0;
+	ret = fscanf(fp, "%m[^:]:%d-%d", &config, &low, &high);
+	if (ret < 2) {
+		ret = -ENODATA;
+		goto out;
+	}
+	if (errno) {
+		ret = -errno;
+		goto out;
+	}
+
+	if (ret == 2)
+		high = low;
+
+	*mask = GENMASK_ULL(high, low);
+	/* Last digit should be [012]. If last digit is missing 0 is implied. */
+	*num = config[strlen(config) - 1];
+	*num = isdigit(*num) ? *num - '0' : 0;
+
+	ret = 0;
+out:
+	free(config);
+	fclose(fp);
+
+	return ret;
+}
+
+static int
+parse_event(char *buf, uint64_t config[3])
+{
+	char *token, *term;
+	int num, ret, val;
+	uint64_t mask;
+
+	config[0] = config[1] = config[2] = 0;
+
+	token = strtok(buf, ",");
+	while (token) {
+		errno = 0;
+		/* <term>=<value> */
+		ret = sscanf(token, "%m[^=]=%i", &term, &val);
+		if (ret < 1)
+			return -ENODATA;
+		if (errno)
+			return -errno;
+		if (ret == 1)
+			val = 1;
+
+		ret = get_term_format(term, &num, &mask);
+		free(term);
+		if (ret)
+			return ret;
+
+		config[num] |= FIELD_PREP(mask, val);
+		token = strtok(NULL, ",");
+	}
+
+	return 0;
+}
+
+static int
+get_event_config(const char *name, uint64_t config[3])
+{
+	char path[PATH_MAX], buf[BUFSIZ];
+	FILE *fp;
+	int ret;
+
+	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", pmu->name, name);
+	fp = fopen(path, "r");
+	if (!fp)
+		return -errno;
+
+	ret = fread(buf, 1, sizeof(buf), fp);
+	if (ret == 0) {
+		fclose(fp);
+
+		return -EINVAL;
+	}
+	fclose(fp);
+	buf[ret] = '\0';
+
+	return parse_event(buf, config);
+}
+
+static int
+do_perf_event_open(uint64_t config[3], int lcore_id, int group_fd)
+{
+	struct perf_event_attr attr = {
+		.size = sizeof(struct perf_event_attr),
+		.type = PERF_TYPE_RAW,
+		.exclude_kernel = 1,
+		.exclude_hv = 1,
+		.disabled = 1,
+	};
+
+	pmu_arch_fixup_config(config);
+
+	attr.config = config[0];
+	attr.config1 = config[1];
+	attr.config2 = config[2];
+
+	return syscall(SYS_perf_event_open, &attr, rte_gettid(), rte_lcore_to_cpu_id(lcore_id),
+		       group_fd, 0);
+}
+
+static int
+open_events(int lcore_id)
+{
+	struct rte_pmu_event_group *group = &pmu->group[lcore_id];
+	struct rte_pmu_event *event;
+	uint64_t config[3];
+	int num = 0, ret;
+
+	/* group leader gets created first, with fd = -1 */
+	group->fds[0] = -1;
+
+	TAILQ_FOREACH(event, &pmu->event_list, next) {
+		ret = get_event_config(event->name, config);
+		if (ret) {
+			RTE_LOG(ERR, EAL, "failed to get %s event config\n", event->name);
+			continue;
+		}
+
+		ret = do_perf_event_open(config, lcore_id, group->fds[0]);
+		if (ret == -1) {
+			if (errno == EOPNOTSUPP)
+				RTE_LOG(ERR, EAL, "64 bit counters not supported\n");
+
+			ret = -errno;
+			goto out;
+		}
+
+		group->fds[event->index] = ret;
+		num++;
+	}
+
+	return 0;
+out:
+	for (--num; num >= 0; num--) {
+		close(group->fds[num]);
+		group->fds[num] = -1;
+	}
+
+
+	return ret;
+}
+
+static int
+mmap_events(int lcore_id)
+{
+	struct rte_pmu_event_group *group = &pmu->group[lcore_id];
+	void *addr;
+	int ret, i;
+
+	for (i = 0; i < pmu->num_group_events; i++) {
+		addr = mmap(0, rte_mem_page_size(), PROT_READ, MAP_SHARED, group->fds[i], 0);
+		if (addr == MAP_FAILED) {
+			ret = -errno;
+			goto out;
+		}
+
+		group->mmap_pages[i] = addr;
+	}
+
+	return 0;
+out:
+	for (; i; i--) {
+		munmap(group->mmap_pages[i - 1], rte_mem_page_size());
+		group->mmap_pages[i - 1] = NULL;
+	}
+
+	return ret;
+}
+
+static void
+cleanup_events(int lcore_id)
+{
+	struct rte_pmu_event_group *group = &pmu->group[lcore_id];
+	int i;
+
+	if (!group->fds)
+		return;
+
+	if (group->fds[0] != -1)
+		ioctl(group->fds[0], PERF_EVENT_IOC_DISABLE, PERF_IOC_FLAG_GROUP);
+
+	for (i = 0; i < pmu->num_group_events; i++) {
+		if (group->mmap_pages[i]) {
+			munmap(group->mmap_pages[i], rte_mem_page_size());
+			group->mmap_pages[i] = NULL;
+		}
+
+		if (group->fds[i] != -1) {
+			close(group->fds[i]);
+			group->fds[i] = -1;
+		}
+	}
+
+	rte_free(group->mmap_pages);
+	rte_free(group->fds);
+
+	group->mmap_pages = NULL;
+	group->fds = NULL;
+	group->enabled = false;
+}
+
+int __rte_noinline
+rte_pmu_enable_group(int lcore_id)
+{
+	struct rte_pmu_event_group *group = &pmu->group[lcore_id];
+	int ret;
+
+	if (pmu->num_group_events == 0) {
+		RTE_LOG(DEBUG, EAL, "no matching PMU events\n");
+
+		return 0;
+	}
+
+	group->fds = rte_calloc(NULL, pmu->num_group_events, sizeof(*group->fds), 0);
+	if (!group->fds) {
+		RTE_LOG(ERR, EAL, "failed to alloc descriptor memory\n");
+
+		return -ENOMEM;
+	}
+
+	group->mmap_pages = rte_calloc(NULL, pmu->num_group_events, sizeof(*group->mmap_pages), 0);
+	if (!group->mmap_pages) {
+		RTE_LOG(ERR, EAL, "failed to alloc userpage memory\n");
+
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	ret = open_events(lcore_id);
+	if (ret) {
+		RTE_LOG(ERR, EAL, "failed to open events on lcore-worker-%d\n", lcore_id);
+		goto out;
+	}
+
+	ret = mmap_events(lcore_id);
+	if (ret) {
+		RTE_LOG(ERR, EAL, "failed to map events on lcore-worker-%d\n", lcore_id);
+		goto out;
+	}
+
+	if (ioctl(group->fds[0], PERF_EVENT_IOC_ENABLE, PERF_IOC_FLAG_GROUP) == -1) {
+		RTE_LOG(ERR, EAL, "failed to enable events on lcore-worker-%d\n", lcore_id);
+
+		ret = -errno;
+		goto out;
+	}
+
+	return 0;
+
+out:
+	cleanup_events(lcore_id);
+
+	return ret;
+}
+
+static int
+scan_pmus(void)
+{
+	char path[PATH_MAX];
+	struct dirent *dent;
+	const char *name;
+	DIR *dirp;
+
+	dirp = opendir(EVENT_SOURCE_DEVICES_PATH);
+	if (!dirp)
+		return -errno;
+
+	while ((dent = readdir(dirp))) {
+		name = dent->d_name;
+		if (name[0] == '.')
+			continue;
+
+		/* sysfs entry should either contain cpus or be a cpu */
+		if (!strcmp(name, "cpu"))
+			break;
+
+		snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/cpus", name);
+		if (access(path, F_OK) == 0)
+			break;
+	}
+
+	closedir(dirp);
+
+	if (dent) {
+		pmu->name = strdup(name);
+		if (!pmu->name)
+			return -ENOMEM;
+	}
+
+	return pmu->name ? 0 : -ENODEV;
+}
+
+int
+rte_pmu_add_event(const char *name)
+{
+	struct rte_pmu_event *event;
+	char path[PATH_MAX];
+
+	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", pmu->name, name);
+	if (access(path, R_OK))
+		return -ENODEV;
+
+	TAILQ_FOREACH(event, &pmu->event_list, next) {
+		if (!strcmp(event->name, name))
+			return event->index;
+		continue;
+	}
+
+	event = rte_calloc(NULL, 1, sizeof(*event), 0);
+	if (!event)
+		return -ENOMEM;
+
+	event->name = strdup(name);
+	if (!event->name) {
+		rte_free(event);
+
+		return -ENOMEM;
+	}
+
+	event->index = pmu->num_group_events++;
+	TAILQ_INSERT_TAIL(&pmu->event_list, event, next);
+
+	RTE_LOG(DEBUG, EAL, "%s even added at index %d\n", name, event->index);
+
+	return event->index;
+}
+
+void
+eal_pmu_init(void)
+{
+	int ret;
+
+	pmu = rte_calloc(NULL, 1, sizeof(*pmu), RTE_CACHE_LINE_SIZE);
+	if (!pmu) {
+		RTE_LOG(ERR, EAL, "failed to alloc PMU\n");
+
+		return;
+	}
+
+	TAILQ_INIT(&pmu->event_list);
+
+	ret = scan_pmus();
+	if (ret) {
+		RTE_LOG(ERR, EAL, "failed to find core pmu\n");
+		goto out;
+	}
+
+	ret = pmu_arch_init();
+	if (ret) {
+		RTE_LOG(ERR, EAL, "failed to setup arch for PMU\n");
+		goto out;
+	}
+
+	return;
+out:
+	free(pmu->name);
+	rte_free(pmu);
+}
+
+void
+eal_pmu_fini(void)
+{
+	struct rte_pmu_event *event, *tmp;
+	int lcore_id;
+
+	RTE_TAILQ_FOREACH_SAFE(event, &pmu->event_list, next, tmp) {
+		TAILQ_REMOVE(&pmu->event_list, event, next);
+		free(event->name);
+		rte_free(event);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		cleanup_events(lcore_id);
+
+	pmu_arch_fini();
+	free(pmu->name);
+	rte_free(pmu);
+}
diff --git a/lib/eal/include/meson.build b/lib/eal/include/meson.build
index cfcd40aaed..3bf830adee 100644
--- a/lib/eal/include/meson.build
+++ b/lib/eal/include/meson.build
@@ -36,6 +36,7 @@ headers += files(
         'rte_pci_dev_features.h',
         'rte_per_lcore.h',
         'rte_pflock.h',
+        'rte_pmu.h',
         'rte_random.h',
         'rte_reciprocal.h',
         'rte_seqcount.h',
diff --git a/lib/eal/include/rte_pmu.h b/lib/eal/include/rte_pmu.h
new file mode 100644
index 0000000000..5955c22779
--- /dev/null
+++ b/lib/eal/include/rte_pmu.h
@@ -0,0 +1,204 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2022 Marvell
+ */
+
+#ifndef _RTE_PMU_H_
+#define _RTE_PMU_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <rte_common.h>
+#include <rte_compat.h>
+
+#ifdef RTE_EXEC_ENV_LINUX
+
+#include <linux/perf_event.h>
+
+#include <rte_atomic.h>
+#include <rte_branch_prediction.h>
+#include <rte_lcore.h>
+#include <rte_log.h>
+
+/**
+ * @file
+ *
+ * PMU event tracing operations
+ *
+ * This file defines generic API and types necessary to setup PMU and
+ * read selected counters in runtime.
+ */
+
+/**
+ * A structure describing a group of events.
+ */
+struct rte_pmu_event_group {
+	int *fds; /**< array of event descriptors */
+	void **mmap_pages; /**< array of pointers to mmapped perf_event_attr structures */
+	bool enabled; /**< true if group was enabled on particular lcore */
+};
+
+/**
+ * A structure describing an event.
+ */
+struct rte_pmu_event {
+	char *name; /** name of an event */
+	int index; /** event index into fds/mmap_pages */
+	TAILQ_ENTRY(rte_pmu_event) next; /** list entry */
+};
+
+/**
+ * A PMU state container.
+ */
+struct rte_pmu {
+	char *name; /** name of core PMU listed under /sys/bus/event_source/devices */
+	struct rte_pmu_event_group group[RTE_MAX_LCORE]; /**< per lcore event group data */
+	int num_group_events; /**< number of events in a group */
+	TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of matching events */
+};
+
+/** Pointer to the PMU state container */
+extern struct rte_pmu *pmu;
+
+/** Each architecture supporting PMU needs to provide its own version */
+#ifndef rte_pmu_pmc_read
+#define rte_pmu_pmc_read(index) ({ 0; })
+#endif
+
+/**
+ * @internal
+ *
+ * Read PMU counter.
+ *
+ * @param pc
+ *   Pointer to the mmapped user page.
+ * @return
+ *   Counter value read from hardware.
+ */
+__rte_internal
+static __rte_always_inline uint64_t
+rte_pmu_read_userpage(struct perf_event_mmap_page *pc)
+{
+	uint64_t offset, width, pmc = 0;
+	uint32_t seq, index;
+	int tries = 100;
+
+	for (;;) {
+		seq = pc->lock;
+		rte_compiler_barrier();
+		index = pc->index;
+		offset = pc->offset;
+		width = pc->pmc_width;
+
+		if (likely(pc->cap_user_rdpmc && index)) {
+			pmc = rte_pmu_pmc_read(index - 1);
+			pmc <<= 64 - width;
+			pmc >>= 64 - width;
+		}
+
+		rte_compiler_barrier();
+
+		if (likely(pc->lock == seq))
+			return pmc + offset;
+
+		if (--tries == 0) {
+			RTE_LOG(DEBUG, EAL, "failed to get perf_event_mmap_page lock\n");
+			break;
+		}
+	}
+
+	return 0;
+}
+
+/**
+ * @internal
+ *
+ * Enable group of events for a given lcore.
+ *
+ * @param lcore_id
+ *   The identifier of the lcore.
+ * @return
+ *   0 in case of success, negative value otherwise.
+ */
+__rte_internal
+int
+rte_pmu_enable_group(int lcore_id);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Add event to the group of enabled events.
+ *
+ * @param name
+ *   Name of an event listed under /sys/bus/event_source/devices/pmu/events.
+ * @return
+ *   Event index in case of success, negative value otherwise.
+ */
+__rte_experimental
+int
+rte_pmu_add_event(const char *name);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Read hardware counter configured to count occurrences of an event.
+ *
+ * @param index
+ *   Index of an event to be read.
+ * @return
+ *   Event value read from register. In case of errors or lack of support
+ *   0 is returned. In other words, stream of zeros in a trace file
+ *   indicates problem with reading particular PMU event register.
+ */
+__rte_experimental
+static __rte_always_inline uint64_t
+rte_pmu_read(int index)
+{
+	int lcore_id = rte_lcore_id();
+	struct rte_pmu_event_group *group;
+	int ret;
+
+	if (!pmu)
+		return 0;
+
+	group = &pmu->group[lcore_id];
+	if (!group->enabled) {
+		ret = rte_pmu_enable_group(lcore_id);
+		if (ret)
+			return 0;
+
+		group->enabled = true;
+	}
+
+	if (index < 0 || index >= pmu->num_group_events)
+		return 0;
+
+	return rte_pmu_read_userpage(group->mmap_pages[index]);
+}
+
+#else /* !RTE_EXEC_ENV_LINUX */
+
+__rte_experimental
+static int __rte_unused
+rte_pmu_add_event(__rte_unused const char *name)
+{
+	return -1;
+}
+
+__rte_experimental
+static __rte_always_inline uint64_t
+rte_pmu_read(__rte_unused int index)
+{
+	return 0;
+}
+
+#endif /* RTE_EXEC_ENV_LINUX */
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_PMU_H_ */
diff --git a/lib/eal/linux/eal.c b/lib/eal/linux/eal.c
index 8c118d0d9f..751a13b597 100644
--- a/lib/eal/linux/eal.c
+++ b/lib/eal/linux/eal.c
@@ -53,6 +53,7 @@
 #include "eal_options.h"
 #include "eal_vfio.h"
 #include "hotplug_mp.h"
+#include "pmu_private.h"
 
 #define MEMSIZE_IF_NO_HUGE_PAGE (64ULL * 1024ULL * 1024ULL)
 
@@ -1206,6 +1207,8 @@ rte_eal_init(int argc, char **argv)
 		return -1;
 	}
 
+	eal_pmu_init();
+
 	if (rte_eal_tailqs_init() < 0) {
 		rte_eal_init_alert("Cannot init tail queues for objects");
 		rte_errno = EFAULT;
@@ -1372,6 +1375,7 @@ rte_eal_cleanup(void)
 	eal_bus_cleanup();
 	rte_trace_save();
 	eal_trace_fini();
+	eal_pmu_fini();
 	/* after this point, any DPDK pointers will become dangling */
 	rte_eal_memory_detach();
 	eal_mp_dev_hotplug_cleanup();
diff --git a/lib/eal/version.map b/lib/eal/version.map
index 7ad12a7dc9..1ebd842f34 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -440,6 +440,10 @@ EXPERIMENTAL {
 	rte_thread_detach;
 	rte_thread_equal;
 	rte_thread_join;
+
+	# added in 23.03
+	rte_pmu_add_event; # WINDOWS_NO_EXPORT
+	rte_pmu_read; # WINDOWS_NO_EXPORT
 };
 
 INTERNAL {
@@ -483,4 +487,5 @@ INTERNAL {
 	rte_mem_map;
 	rte_mem_page_size;
 	rte_mem_unmap;
+	rte_pmu_enable_group;
 };
-- 
2.25.1


^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH v2 2/4] eal/arm: support reading ARM PMU events in runtime
  2022-11-21 12:11 ` [PATCH v2 0/4] add support for self monitoring Tomasz Duszynski
  2022-11-21 12:11   ` [PATCH v2 1/4] eal: add generic support for reading PMU events Tomasz Duszynski
@ 2022-11-21 12:11   ` Tomasz Duszynski
  2022-11-21 12:11   ` [PATCH v2 3/4] eal/x86: support reading Intel " Tomasz Duszynski
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 205+ messages in thread
From: Tomasz Duszynski @ 2022-11-21 12:11 UTC (permalink / raw)
  To: dev, Ruifeng Wang; +Cc: thomas, jerinj, Tomasz Duszynski

Add support for reading ARM PMU events in runtime.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
---
 app/test/test_pmu.c               |   4 ++
 lib/eal/arm/include/meson.build   |   1 +
 lib/eal/arm/include/rte_pmu_pmc.h |  39 +++++++++++
 lib/eal/arm/meson.build           |   4 ++
 lib/eal/arm/rte_pmu.c             | 103 ++++++++++++++++++++++++++++++
 lib/eal/include/rte_pmu.h         |   3 +
 6 files changed, 154 insertions(+)
 create mode 100644 lib/eal/arm/include/rte_pmu_pmc.h
 create mode 100644 lib/eal/arm/rte_pmu.c

diff --git a/app/test/test_pmu.c b/app/test/test_pmu.c
index fd331af9ee..f94866dff9 100644
--- a/app/test/test_pmu.c
+++ b/app/test/test_pmu.c
@@ -13,6 +13,10 @@ test_pmu_read(void)
 	int tries = 10;
 	int event = -1;
 
+#if defined(RTE_ARCH_ARM64)
+	event = rte_pmu_add_event("cpu_cycles");
+#endif
+
 	while (tries--)
 		val += rte_pmu_read(event);
 
diff --git a/lib/eal/arm/include/meson.build b/lib/eal/arm/include/meson.build
index 657bf58569..ab13b0220a 100644
--- a/lib/eal/arm/include/meson.build
+++ b/lib/eal/arm/include/meson.build
@@ -20,6 +20,7 @@ arch_headers = files(
         'rte_pause_32.h',
         'rte_pause_64.h',
         'rte_pause.h',
+        'rte_pmu_pmc.h',
         'rte_power_intrinsics.h',
         'rte_prefetch_32.h',
         'rte_prefetch_64.h',
diff --git a/lib/eal/arm/include/rte_pmu_pmc.h b/lib/eal/arm/include/rte_pmu_pmc.h
new file mode 100644
index 0000000000..10e2984813
--- /dev/null
+++ b/lib/eal/arm/include/rte_pmu_pmc.h
@@ -0,0 +1,39 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2022 Marvell.
+ */
+
+#ifndef _RTE_PMU_PMC_ARM_H_
+#define _RTE_PMU_PMC_ARM_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <rte_common.h>
+
+static __rte_always_inline uint64_t
+rte_pmu_pmc_read(int index)
+{
+	uint64_t val;
+
+	if (index == 31) {
+		/* CPU Cycles (0x11) must be read via pmccntr_el0 */
+		asm volatile("mrs %0, pmccntr_el0" : "=r" (val));
+	} else {
+		asm volatile(
+			"msr pmselr_el0, %x0\n"
+			"mrs %0, pmxevcntr_el0\n"
+			: "=r" (val)
+			: "rZ" (index)
+		);
+	}
+
+	return val;
+}
+#define rte_pmu_pmc_read rte_pmu_pmc_read
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_PMU_PMC_ARM_H_ */
diff --git a/lib/eal/arm/meson.build b/lib/eal/arm/meson.build
index dca1106aae..0c5575b197 100644
--- a/lib/eal/arm/meson.build
+++ b/lib/eal/arm/meson.build
@@ -9,3 +9,7 @@ sources += files(
         'rte_hypervisor.c',
         'rte_power_intrinsics.c',
 )
+
+if is_linux
+    sources += files('rte_pmu.c')
+endif
diff --git a/lib/eal/arm/rte_pmu.c b/lib/eal/arm/rte_pmu.c
new file mode 100644
index 0000000000..6c50a1b3c4
--- /dev/null
+++ b/lib/eal/arm/rte_pmu.c
@@ -0,0 +1,103 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2022 Marvell International Ltd.
+ */
+
+#include <errno.h>
+#include <fcntl.h>
+#include <unistd.h>
+
+#include <rte_bitops.h>
+#include <rte_common.h>
+#include <rte_log.h>
+#include <rte_pmu.h>
+
+#include "pmu_private.h"
+
+#define PERF_USER_ACCESS_PATH "/proc/sys/kernel/perf_user_access"
+
+static int restore_uaccess;
+
+static int
+read_attr_int(const char *path, int *val)
+{
+	char buf[BUFSIZ];
+	int ret, fd;
+
+	fd = open(path, O_RDONLY);
+	if (fd == -1)
+		return -errno;
+
+	ret = read(fd, buf, sizeof(buf));
+	if (ret == -1) {
+		close(fd);
+
+		return -errno;
+	}
+
+	*val = strtol(buf, NULL, 10);
+	close(fd);
+
+	return 0;
+}
+
+static int
+write_attr_int(const char *path, int val)
+{
+	char buf[BUFSIZ];
+	int num, ret, fd;
+
+	fd = open(path, O_WRONLY);
+	if (fd == -1)
+		return -errno;
+
+	num = snprintf(buf, sizeof(buf), "%d", val);
+	ret = write(fd, buf, num);
+	if (ret == -1) {
+		close(fd);
+
+		return -errno;
+	}
+
+	close(fd);
+
+	return 0;
+}
+
+int
+pmu_arch_init(void)
+{
+	int ret;
+
+	ret = read_attr_int(PERF_USER_ACCESS_PATH, &restore_uaccess);
+	if (ret) {
+		RTE_LOG(ERR, EAL, "failed to read %s\n", PERF_USER_ACCESS_PATH);
+
+		return ret;
+	}
+
+	ret = write_attr_int(PERF_USER_ACCESS_PATH, 1);
+	if (ret) {
+		RTE_LOG(ERR, EAL, "failed to enable perf user access\n"
+			"try enabling manually 'echo 1 > %s'\n",
+			PERF_USER_ACCESS_PATH);
+
+		return ret;
+	}
+
+	return 0;
+}
+
+void
+pmu_arch_fini(void)
+{
+	write_attr_int(PERF_USER_ACCESS_PATH, restore_uaccess);
+}
+
+void
+pmu_arch_fixup_config(uint64_t config[3])
+{
+	/* select 64 bit counters */
+	config[1] |= RTE_BIT64(0);
+	/* enable userspace access */
+	config[1] |= RTE_BIT64(1);
+}
diff --git a/lib/eal/include/rte_pmu.h b/lib/eal/include/rte_pmu.h
index 5955c22779..67b1194a2a 100644
--- a/lib/eal/include/rte_pmu.h
+++ b/lib/eal/include/rte_pmu.h
@@ -20,6 +20,9 @@ extern "C" {
 #include <rte_branch_prediction.h>
 #include <rte_lcore.h>
 #include <rte_log.h>
+#if defined(RTE_ARCH_ARM64)
+#include <rte_pmu_pmc.h>
+#endif
 
 /**
  * @file
-- 
2.25.1


^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH v2 3/4] eal/x86: support reading Intel PMU events in runtime
  2022-11-21 12:11 ` [PATCH v2 0/4] add support for self monitoring Tomasz Duszynski
  2022-11-21 12:11   ` [PATCH v2 1/4] eal: add generic support for reading PMU events Tomasz Duszynski
  2022-11-21 12:11   ` [PATCH v2 2/4] eal/arm: support reading ARM PMU events in runtime Tomasz Duszynski
@ 2022-11-21 12:11   ` Tomasz Duszynski
  2022-11-21 12:11   ` [PATCH v2 4/4] eal: add PMU support to tracing library Tomasz Duszynski
  2022-11-29  9:28   ` [PATCH v3 0/4] add support for self monitoring Tomasz Duszynski
  4 siblings, 0 replies; 205+ messages in thread
From: Tomasz Duszynski @ 2022-11-21 12:11 UTC (permalink / raw)
  To: dev, Bruce Richardson, Konstantin Ananyev
  Cc: thomas, jerinj, Tomasz Duszynski

Add support for reading Intel PMU events in runtime.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
---
 app/test/test_pmu.c               |  2 ++
 lib/eal/include/rte_pmu.h         |  2 +-
 lib/eal/x86/include/meson.build   |  1 +
 lib/eal/x86/include/rte_pmu_pmc.h | 33 +++++++++++++++++++++++++++++++
 4 files changed, 37 insertions(+), 1 deletion(-)
 create mode 100644 lib/eal/x86/include/rte_pmu_pmc.h

diff --git a/app/test/test_pmu.c b/app/test/test_pmu.c
index f94866dff9..016204c083 100644
--- a/app/test/test_pmu.c
+++ b/app/test/test_pmu.c
@@ -15,6 +15,8 @@ test_pmu_read(void)
 
 #if defined(RTE_ARCH_ARM64)
 	event = rte_pmu_add_event("cpu_cycles");
+#elif defined(RTE_ARCH_X86_64)
+	event = rte_pmu_add_event("cpu-cycles");
 #endif
 
 	while (tries--)
diff --git a/lib/eal/include/rte_pmu.h b/lib/eal/include/rte_pmu.h
index 67b1194a2a..bbe12d100d 100644
--- a/lib/eal/include/rte_pmu.h
+++ b/lib/eal/include/rte_pmu.h
@@ -20,7 +20,7 @@ extern "C" {
 #include <rte_branch_prediction.h>
 #include <rte_lcore.h>
 #include <rte_log.h>
-#if defined(RTE_ARCH_ARM64)
+#if defined(RTE_ARCH_ARM64) || defined(RTE_ARCH_X86_64)
 #include <rte_pmu_pmc.h>
 #endif
 
diff --git a/lib/eal/x86/include/meson.build b/lib/eal/x86/include/meson.build
index 52d2f8e969..03d286ed25 100644
--- a/lib/eal/x86/include/meson.build
+++ b/lib/eal/x86/include/meson.build
@@ -9,6 +9,7 @@ arch_headers = files(
         'rte_io.h',
         'rte_memcpy.h',
         'rte_pause.h',
+        'rte_pmu_pmc.h',
         'rte_power_intrinsics.h',
         'rte_prefetch.h',
         'rte_rtm.h',
diff --git a/lib/eal/x86/include/rte_pmu_pmc.h b/lib/eal/x86/include/rte_pmu_pmc.h
new file mode 100644
index 0000000000..a2cd849fb1
--- /dev/null
+++ b/lib/eal/x86/include/rte_pmu_pmc.h
@@ -0,0 +1,33 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2022 Marvell.
+ */
+
+#ifndef _RTE_PMU_PMC_X86_H_
+#define _RTE_PMU_PMC_X86_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <rte_common.h>
+
+static __rte_always_inline uint64_t
+rte_pmu_pmc_read(int index)
+{
+	uint32_t high, low;
+
+	asm volatile(
+		"rdpmc\n"
+		: "=a" (low), "=d" (high)
+		: "c" (index)
+	);
+
+	return ((uint64_t)high << 32) | (uint64_t)low;
+}
+#define rte_pmu_pmc_read rte_pmu_pmc_read
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_PMU_PMC_X86_H_ */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH v2 4/4] eal: add PMU support to tracing library
  2022-11-21 12:11 ` [PATCH v2 0/4] add support for self monitoring Tomasz Duszynski
                     ` (2 preceding siblings ...)
  2022-11-21 12:11   ` [PATCH v2 3/4] eal/x86: support reading Intel " Tomasz Duszynski
@ 2022-11-21 12:11   ` Tomasz Duszynski
  2022-11-29  9:28   ` [PATCH v3 0/4] add support for self monitoring Tomasz Duszynski
  4 siblings, 0 replies; 205+ messages in thread
From: Tomasz Duszynski @ 2022-11-21 12:11 UTC (permalink / raw)
  To: dev, Jerin Jacob, Sunil Kumar Kori; +Cc: thomas, Tomasz Duszynski

In order to profile app one needs to store significant amount of samples
somewhere for an analysis latern on. Since trace library supports
storing data in a CTF format lets take adventage of that and add a
dedicated PMU tracepoint.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
---
 app/test/test_trace_perf.c               |  4 ++
 doc/guides/prog_guide/profile_app.rst    |  5 ++
 doc/guides/prog_guide/trace_lib.rst      | 32 ++++++++++++
 lib/eal/common/eal_common_trace_points.c |  3 ++
 lib/eal/common/rte_pmu.c                 | 63 ++++++++++++++++++++++++
 lib/eal/include/rte_eal_trace.h          | 11 +++++
 lib/eal/version.map                      |  1 +
 7 files changed, 119 insertions(+)

diff --git a/app/test/test_trace_perf.c b/app/test/test_trace_perf.c
index 46ae7d8074..4851b6852f 100644
--- a/app/test/test_trace_perf.c
+++ b/app/test/test_trace_perf.c
@@ -114,6 +114,8 @@ worker_fn_##func(void *arg) \
 #define GENERIC_DOUBLE rte_eal_trace_generic_double(3.66666)
 #define GENERIC_STR rte_eal_trace_generic_str("hello world")
 #define VOID_FP app_dpdk_test_fp()
+/* 0 corresponds first event passed via --trace= */
+#define READ_PMU rte_eal_trace_pmu_read(0)
 
 WORKER_DEFINE(GENERIC_VOID)
 WORKER_DEFINE(GENERIC_U64)
@@ -122,6 +124,7 @@ WORKER_DEFINE(GENERIC_FLOAT)
 WORKER_DEFINE(GENERIC_DOUBLE)
 WORKER_DEFINE(GENERIC_STR)
 WORKER_DEFINE(VOID_FP)
+WORKER_DEFINE(READ_PMU)
 
 static void
 run_test(const char *str, lcore_function_t f, struct test_data *data, size_t sz)
@@ -174,6 +177,7 @@ test_trace_perf(void)
 	run_test("double", worker_fn_GENERIC_DOUBLE, data, sz);
 	run_test("string", worker_fn_GENERIC_STR, data, sz);
 	run_test("void_fp", worker_fn_VOID_FP, data, sz);
+	run_test("read_pmu", worker_fn_READ_PMU, data, sz);
 
 	rte_free(data);
 	return TEST_SUCCESS;
diff --git a/doc/guides/prog_guide/profile_app.rst b/doc/guides/prog_guide/profile_app.rst
index 8fc1b20cab..977800ea01 100644
--- a/doc/guides/prog_guide/profile_app.rst
+++ b/doc/guides/prog_guide/profile_app.rst
@@ -16,6 +16,11 @@ that information, perf being an example here. Though in some scenarios, eg. when
 isolated (nohz_full) and run dedicated tasks, using perf is less than ideal. In such cases one can
 read specific events directly from application via ``rte_pmu_read()``.
 
+Alternatively tracing library can be used which offers dedicated tracepoint
+``rte_eal_trace_pmu_event()``.
+
+Refer to :doc:`../prog_guide/trace_lib` for more details.
+
 Profiling on x86
 ----------------
 
diff --git a/doc/guides/prog_guide/trace_lib.rst b/doc/guides/prog_guide/trace_lib.rst
index 9a8f38073d..9a845fd86f 100644
--- a/doc/guides/prog_guide/trace_lib.rst
+++ b/doc/guides/prog_guide/trace_lib.rst
@@ -46,6 +46,7 @@ DPDK tracing library features
   trace format and is compatible with ``LTTng``.
   For detailed information, refer to
   `Common Trace Format <https://diamon.org/ctf/>`_.
+- Support reading PMU events on ARM64 and x86 (Intel)
 
 How to add a tracepoint?
 ------------------------
@@ -137,6 +138,37 @@ the user must use ``RTE_TRACE_POINT_FP`` instead of ``RTE_TRACE_POINT``.
 ``RTE_TRACE_POINT_FP`` is compiled out by default and it can be enabled using
 the ``enable_trace_fp`` option for meson build.
 
+PMU tracepoint
+--------------
+
+Performance measurement unit (PMU) event values can be read from hardware
+registers using predefined ``rte_pmu_read`` tracepoint.
+
+Tracing is enabled via ``--trace`` EAL option by passing both expression
+matching PMU tracepoint name i.e ``lib.eal.pmu.read`` and expression
+``e=ev1[,ev2,...]`` matching particular events::
+
+    --trace='*pmu.read\|e=cpu_cycles,l1d_cache'
+
+Event names are available under ``/sys/bus/event_source/devices/PMU/events``
+directory, where ``PMU`` is a placeholder for either a ``cpu`` or a directory
+containing ``cpus``.
+
+In contrary to other tracepoints this does not need any extra variables
+added to source files. Instead, caller passes index which follows the order of
+events specified via ``--trace`` parameter. In the following example index ``0``
+corresponds to ``cpu_cyclces`` while index ``1`` corresponds to ``l1d_cache``.
+
+.. code-block:: c
+
+ ...
+ rte_eal_trace_pmu_read(0);
+ rte_eal_trace_pmu_read(1);
+ ...
+
+PMU tracing support must be explicitly enabled using the ``enable_trace_fp``
+option for meson build.
+
 Event record mode
 -----------------
 
diff --git a/lib/eal/common/eal_common_trace_points.c b/lib/eal/common/eal_common_trace_points.c
index 0b0b254615..de918ca618 100644
--- a/lib/eal/common/eal_common_trace_points.c
+++ b/lib/eal/common/eal_common_trace_points.c
@@ -75,3 +75,6 @@ RTE_TRACE_POINT_REGISTER(rte_eal_trace_intr_enable,
 	lib.eal.intr.enable)
 RTE_TRACE_POINT_REGISTER(rte_eal_trace_intr_disable,
 	lib.eal.intr.disable)
+
+RTE_TRACE_POINT_REGISTER(rte_eal_trace_pmu_read,
+	lib.eal.pmu.read)
diff --git a/lib/eal/common/rte_pmu.c b/lib/eal/common/rte_pmu.c
index dc169fb2cf..6a417f74a9 100644
--- a/lib/eal/common/rte_pmu.c
+++ b/lib/eal/common/rte_pmu.c
@@ -19,6 +19,7 @@
 #include <rte_tailq.h>
 
 #include "pmu_private.h"
+#include "eal_trace.h"
 
 #define EVENT_SOURCE_DEVICES_PATH "/sys/bus/event_source/devices"
 
@@ -403,11 +404,70 @@ rte_pmu_add_event(const char *name)
 	return event->index;
 }
 
+static void
+add_events(const char *pattern)
+{
+	char *token, *copy;
+	int ret;
+
+	copy = strdup(pattern);
+	if (!copy)
+		return;
+
+	token = strtok(copy, ",");
+	while (token) {
+		ret = rte_pmu_add_event(token);
+		if (ret < 0)
+			RTE_LOG(ERR, EAL, "failed to add %s event\n", token);
+
+		token = strtok(NULL, ",");
+	}
+
+	free(copy);
+}
+
+static void
+add_events_by_pattern(const char *pattern)
+{
+	regmatch_t rmatch;
+	char buf[BUFSIZ];
+	unsigned int num;
+	regex_t reg;
+
+	/* events are matched against occurrences of e=ev1[,ev2,..] pattern */
+	if (regcomp(&reg, "e=([_[:alnum:]-],?)+", REG_EXTENDED))
+		return;
+
+	for (;;) {
+		if (regexec(&reg, pattern, 1, &rmatch, 0))
+			break;
+
+		num = rmatch.rm_eo - rmatch.rm_so;
+		if (num > sizeof(buf))
+			num = sizeof(buf);
+
+		/* skip e= pattern prefix */
+		memcpy(buf, pattern + rmatch.rm_so + 2, num - 2);
+		buf[num] = '\0';
+		add_events(buf);
+
+		pattern += rmatch.rm_eo;
+	}
+
+	regfree(&reg);
+}
+
 void
 eal_pmu_init(void)
 {
+	struct trace_arg *arg;
+	struct trace *trace;
 	int ret;
 
+	trace = trace_obj_get();
+	if (!trace)
+		RTE_LOG(WARNING, EAL, "tracing not initialized\n");
+
 	pmu = rte_calloc(NULL, 1, sizeof(*pmu), RTE_CACHE_LINE_SIZE);
 	if (!pmu) {
 		RTE_LOG(ERR, EAL, "failed to alloc PMU\n");
@@ -429,6 +489,9 @@ eal_pmu_init(void)
 		goto out;
 	}
 
+	STAILQ_FOREACH(arg, &trace->args, next)
+		add_events_by_pattern(arg->val);
+
 	return;
 out:
 	free(pmu->name);
diff --git a/lib/eal/include/rte_eal_trace.h b/lib/eal/include/rte_eal_trace.h
index 5ef4398230..2a10f63e97 100644
--- a/lib/eal/include/rte_eal_trace.h
+++ b/lib/eal/include/rte_eal_trace.h
@@ -17,6 +17,7 @@ extern "C" {
 
 #include <rte_alarm.h>
 #include <rte_interrupts.h>
+#include <rte_pmu.h>
 #include <rte_trace_point.h>
 
 #include "eal_interrupts.h"
@@ -279,6 +280,16 @@ RTE_TRACE_POINT(
 	rte_trace_point_emit_string(cpuset);
 )
 
+/* PMU */
+RTE_TRACE_POINT_FP(
+	rte_eal_trace_pmu_read,
+	RTE_TRACE_POINT_ARGS(int index),
+	uint64_t val;
+	rte_trace_point_emit_int(index);
+	val = rte_pmu_read(index);
+	rte_trace_point_emit_u64(val);
+)
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/eal/version.map b/lib/eal/version.map
index 1ebd842f34..b49a430c84 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -442,6 +442,7 @@ EXPERIMENTAL {
 	rte_thread_join;
 
 	# added in 23.03
+	__rte_eal_trace_pmu_read; # WINDOWS_NO_EXPORT
 	rte_pmu_add_event; # WINDOWS_NO_EXPORT
 	rte_pmu_read; # WINDOWS_NO_EXPORT
 };
-- 
2.25.1


^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH v3 0/4] add support for self monitoring
  2022-11-21 12:11 ` [PATCH v2 0/4] add support for self monitoring Tomasz Duszynski
                     ` (3 preceding siblings ...)
  2022-11-21 12:11   ` [PATCH v2 4/4] eal: add PMU support to tracing library Tomasz Duszynski
@ 2022-11-29  9:28   ` Tomasz Duszynski
  2022-11-29  9:28     ` [PATCH v3 1/4] eal: add generic support for reading PMU events Tomasz Duszynski
                       ` (5 more replies)
  4 siblings, 6 replies; 205+ messages in thread
From: Tomasz Duszynski @ 2022-11-29  9:28 UTC (permalink / raw)
  To: dev; +Cc: thomas, jerinj, Tomasz Duszynski

This series adds self monitoring support i.e allows to configure and
read performance measurement unit (PMU) counters in runtime without
using perf utility. This has certain adventages when application runs on
isolated cores with nohz_full kernel parameter.

Events can be read directly using rte_pmu_read() or using dedicated
tracepoint rte_eal_trace_pmu_read(). The latter will cause events to be
stored inside CTF file.

By design, all enabled events are grouped together and the same group
is attached to lcores that use self monitoring funtionality.

Events are enabled by names, which need to be read from standard
location under sysfs i.e

/sys/bus/event_source/devices/PMU/events

where PMU is a core pmu i.e one measuring cpu events. As of today
raw events are not supported.

v3:
- fix shared build
v2:
- fix problems reported by test build infra

Tomasz Duszynski (4):
  eal: add generic support for reading PMU events
  eal/arm: support reading ARM PMU events in runtime
  eal/x86: support reading Intel PMU events in runtime
  eal: add PMU support to tracing library

 app/test/meson.build                     |   1 +
 app/test/test_pmu.c                      |  47 ++
 app/test/test_trace_perf.c               |   4 +
 doc/guides/prog_guide/profile_app.rst    |  13 +
 doc/guides/prog_guide/trace_lib.rst      |  32 ++
 lib/eal/arm/include/meson.build          |   1 +
 lib/eal/arm/include/rte_pmu_pmc.h        |  39 ++
 lib/eal/arm/meson.build                  |   4 +
 lib/eal/arm/rte_pmu.c                    | 104 +++++
 lib/eal/common/eal_common_trace_points.c |   3 +
 lib/eal/common/meson.build               |   3 +
 lib/eal/common/pmu_private.h             |  41 ++
 lib/eal/common/rte_pmu.c                 | 520 +++++++++++++++++++++++
 lib/eal/include/meson.build              |   1 +
 lib/eal/include/rte_eal_trace.h          |  11 +
 lib/eal/include/rte_pmu.h                | 207 +++++++++
 lib/eal/linux/eal.c                      |   4 +
 lib/eal/version.map                      |   7 +
 lib/eal/x86/include/meson.build          |   1 +
 lib/eal/x86/include/rte_pmu_pmc.h        |  33 ++
 20 files changed, 1076 insertions(+)
 create mode 100644 app/test/test_pmu.c
 create mode 100644 lib/eal/arm/include/rte_pmu_pmc.h
 create mode 100644 lib/eal/arm/rte_pmu.c
 create mode 100644 lib/eal/common/pmu_private.h
 create mode 100644 lib/eal/common/rte_pmu.c
 create mode 100644 lib/eal/include/rte_pmu.h
 create mode 100644 lib/eal/x86/include/rte_pmu_pmc.h

--
2.25.1


^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH v3 1/4] eal: add generic support for reading PMU events
  2022-11-29  9:28   ` [PATCH v3 0/4] add support for self monitoring Tomasz Duszynski
@ 2022-11-29  9:28     ` Tomasz Duszynski
  2022-11-30  8:32       ` zhoumin
  2022-11-29  9:28     ` [PATCH v3 2/4] eal/arm: support reading ARM PMU events in runtime Tomasz Duszynski
                       ` (4 subsequent siblings)
  5 siblings, 1 reply; 205+ messages in thread
From: Tomasz Duszynski @ 2022-11-29  9:28 UTC (permalink / raw)
  To: dev; +Cc: thomas, jerinj, Tomasz Duszynski

Add support for programming PMU counters and reading their values
in runtime bypassing kernel completely.

This is especially useful in cases where CPU cores are isolated
(nohz_full) i.e run dedicated tasks. In such cases one cannot use
standard perf utility without sacrificing latency and performance.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
---
 app/test/meson.build                  |   1 +
 app/test/test_pmu.c                   |  41 +++
 doc/guides/prog_guide/profile_app.rst |   8 +
 lib/eal/common/meson.build            |   3 +
 lib/eal/common/pmu_private.h          |  41 +++
 lib/eal/common/rte_pmu.c              | 457 ++++++++++++++++++++++++++
 lib/eal/include/meson.build           |   1 +
 lib/eal/include/rte_pmu.h             | 204 ++++++++++++
 lib/eal/linux/eal.c                   |   4 +
 lib/eal/version.map                   |   6 +
 10 files changed, 766 insertions(+)
 create mode 100644 app/test/test_pmu.c
 create mode 100644 lib/eal/common/pmu_private.h
 create mode 100644 lib/eal/common/rte_pmu.c
 create mode 100644 lib/eal/include/rte_pmu.h

diff --git a/app/test/meson.build b/app/test/meson.build
index f34d19e3c3..93b3300309 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -143,6 +143,7 @@ test_sources = files(
         'test_timer_racecond.c',
         'test_timer_secondary.c',
         'test_ticketlock.c',
+        'test_pmu.c',
         'test_trace.c',
         'test_trace_register.c',
         'test_trace_perf.c',
diff --git a/app/test/test_pmu.c b/app/test/test_pmu.c
new file mode 100644
index 0000000000..fd331af9ee
--- /dev/null
+++ b/app/test/test_pmu.c
@@ -0,0 +1,41 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2022 Marvell International Ltd.
+ */
+
+#include <rte_pmu.h>
+
+#include "test.h"
+
+static int
+test_pmu_read(void)
+{
+	uint64_t val = 0;
+	int tries = 10;
+	int event = -1;
+
+	while (tries--)
+		val += rte_pmu_read(event);
+
+	if (val == 0)
+		return TEST_FAILED;
+
+	return TEST_SUCCESS;
+}
+
+static struct unit_test_suite pmu_tests = {
+	.suite_name = "pmu autotest",
+	.setup = NULL,
+	.teardown = NULL,
+	.unit_test_cases = {
+		TEST_CASE(test_pmu_read),
+		TEST_CASES_END()
+	}
+};
+
+static int
+test_pmu(void)
+{
+	return unit_test_suite_runner(&pmu_tests);
+}
+
+REGISTER_TEST_COMMAND(pmu_autotest, test_pmu);
diff --git a/doc/guides/prog_guide/profile_app.rst b/doc/guides/prog_guide/profile_app.rst
index bd6700ef85..8fc1b20cab 100644
--- a/doc/guides/prog_guide/profile_app.rst
+++ b/doc/guides/prog_guide/profile_app.rst
@@ -7,6 +7,14 @@ Profile Your Application
 The following sections describe methods of profiling DPDK applications on
 different architectures.
 
+Performance counter based profiling
+-----------------------------------
+
+Majority of architectures support some sort hardware measurement unit which provides a set of
+programmable counters that monitor specific events. There are different tools which can gather
+that information, perf being an example here. Though in some scenarios, eg. when CPU cores are
+isolated (nohz_full) and run dedicated tasks, using perf is less than ideal. In such cases one can
+read specific events directly from application via ``rte_pmu_read()``.
 
 Profiling on x86
 ----------------
diff --git a/lib/eal/common/meson.build b/lib/eal/common/meson.build
index 917758cc65..d6d05b56f3 100644
--- a/lib/eal/common/meson.build
+++ b/lib/eal/common/meson.build
@@ -38,6 +38,9 @@ sources += files(
         'rte_service.c',
         'rte_version.c',
 )
+if is_linux
+    sources += files('rte_pmu.c')
+endif
 if is_linux or is_windows
     sources += files('eal_common_dynmem.c')
 endif
diff --git a/lib/eal/common/pmu_private.h b/lib/eal/common/pmu_private.h
new file mode 100644
index 0000000000..cade4245e6
--- /dev/null
+++ b/lib/eal/common/pmu_private.h
@@ -0,0 +1,41 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2022 Marvell
+ */
+
+#ifndef _PMU_PRIVATE_H_
+#define _PMU_PRIVATE_H_
+
+/**
+ * Architecture specific PMU init callback.
+ *
+ * @return
+ *   0 in case of success, negative value otherwise.
+ */
+int
+pmu_arch_init(void);
+
+/**
+ * Architecture specific PMU cleanup callback.
+ */
+void
+pmu_arch_fini(void);
+
+/**
+ * Apply architecture specific settings to config before passing it to syscall.
+ */
+void
+pmu_arch_fixup_config(uint64_t config[3]);
+
+/**
+ * Initialize PMU tracing internals.
+ */
+void
+eal_pmu_init(void);
+
+/**
+ * Cleanup PMU internals.
+ */
+void
+eal_pmu_fini(void);
+
+#endif /* _PMU_PRIVATE_H_ */
diff --git a/lib/eal/common/rte_pmu.c b/lib/eal/common/rte_pmu.c
new file mode 100644
index 0000000000..6763005903
--- /dev/null
+++ b/lib/eal/common/rte_pmu.c
@@ -0,0 +1,457 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2022 Marvell International Ltd.
+ */
+
+#include <ctype.h>
+#include <dirent.h>
+#include <errno.h>
+#include <regex.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <sys/queue.h>
+#include <sys/syscall.h>
+#include <unistd.h>
+
+#include <rte_eal_paging.h>
+#include <rte_malloc.h>
+#include <rte_pmu.h>
+#include <rte_tailq.h>
+
+#include "pmu_private.h"
+
+#define EVENT_SOURCE_DEVICES_PATH "/sys/bus/event_source/devices"
+
+#ifndef GENMASK_ULL
+#define GENMASK_ULL(h, l) ((~0ULL - (1ULL << (l)) + 1) & (~0ULL >> ((64 - 1 - (h)))))
+#endif
+
+#ifndef FIELD_PREP
+#define FIELD_PREP(m, v) (((uint64_t)(v) << (__builtin_ffsll(m) - 1)) & (m))
+#endif
+
+struct rte_pmu *rte_pmu;
+
+/*
+ * Following __rte_weak functions provide default no-op. Architectures should override them if
+ * necessary.
+ */
+
+int
+__rte_weak pmu_arch_init(void)
+{
+	return 0;
+}
+
+void
+__rte_weak pmu_arch_fini(void)
+{
+}
+
+void
+__rte_weak pmu_arch_fixup_config(uint64_t config[3])
+{
+	RTE_SET_USED(config);
+}
+
+static int
+get_term_format(const char *name, int *num, uint64_t *mask)
+{
+	char *config = NULL;
+	char path[PATH_MAX];
+	int high, low, ret;
+	FILE *fp;
+
+	/* quiesce -Wmaybe-uninitialized warning */
+	*num = 0;
+	*mask = 0;
+
+	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/format/%s", rte_pmu->name, name);
+	fp = fopen(path, "r");
+	if (!fp)
+		return -errno;
+
+	errno = 0;
+	ret = fscanf(fp, "%m[^:]:%d-%d", &config, &low, &high);
+	if (ret < 2) {
+		ret = -ENODATA;
+		goto out;
+	}
+	if (errno) {
+		ret = -errno;
+		goto out;
+	}
+
+	if (ret == 2)
+		high = low;
+
+	*mask = GENMASK_ULL(high, low);
+	/* Last digit should be [012]. If last digit is missing 0 is implied. */
+	*num = config[strlen(config) - 1];
+	*num = isdigit(*num) ? *num - '0' : 0;
+
+	ret = 0;
+out:
+	free(config);
+	fclose(fp);
+
+	return ret;
+}
+
+static int
+parse_event(char *buf, uint64_t config[3])
+{
+	char *token, *term;
+	int num, ret, val;
+	uint64_t mask;
+
+	config[0] = config[1] = config[2] = 0;
+
+	token = strtok(buf, ",");
+	while (token) {
+		errno = 0;
+		/* <term>=<value> */
+		ret = sscanf(token, "%m[^=]=%i", &term, &val);
+		if (ret < 1)
+			return -ENODATA;
+		if (errno)
+			return -errno;
+		if (ret == 1)
+			val = 1;
+
+		ret = get_term_format(term, &num, &mask);
+		free(term);
+		if (ret)
+			return ret;
+
+		config[num] |= FIELD_PREP(mask, val);
+		token = strtok(NULL, ",");
+	}
+
+	return 0;
+}
+
+static int
+get_event_config(const char *name, uint64_t config[3])
+{
+	char path[PATH_MAX], buf[BUFSIZ];
+	FILE *fp;
+	int ret;
+
+	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu->name, name);
+	fp = fopen(path, "r");
+	if (!fp)
+		return -errno;
+
+	ret = fread(buf, 1, sizeof(buf), fp);
+	if (ret == 0) {
+		fclose(fp);
+
+		return -EINVAL;
+	}
+	fclose(fp);
+	buf[ret] = '\0';
+
+	return parse_event(buf, config);
+}
+
+static int
+do_perf_event_open(uint64_t config[3], int lcore_id, int group_fd)
+{
+	struct perf_event_attr attr = {
+		.size = sizeof(struct perf_event_attr),
+		.type = PERF_TYPE_RAW,
+		.exclude_kernel = 1,
+		.exclude_hv = 1,
+		.disabled = 1,
+	};
+
+	pmu_arch_fixup_config(config);
+
+	attr.config = config[0];
+	attr.config1 = config[1];
+	attr.config2 = config[2];
+
+	return syscall(SYS_perf_event_open, &attr, rte_gettid(), rte_lcore_to_cpu_id(lcore_id),
+		       group_fd, 0);
+}
+
+static int
+open_events(int lcore_id)
+{
+	struct rte_pmu_event_group *group = &rte_pmu->group[lcore_id];
+	struct rte_pmu_event *event;
+	uint64_t config[3];
+	int num = 0, ret;
+
+	/* group leader gets created first, with fd = -1 */
+	group->fds[0] = -1;
+
+	TAILQ_FOREACH(event, &rte_pmu->event_list, next) {
+		ret = get_event_config(event->name, config);
+		if (ret) {
+			RTE_LOG(ERR, EAL, "failed to get %s event config\n", event->name);
+			continue;
+		}
+
+		ret = do_perf_event_open(config, lcore_id, group->fds[0]);
+		if (ret == -1) {
+			if (errno == EOPNOTSUPP)
+				RTE_LOG(ERR, EAL, "64 bit counters not supported\n");
+
+			ret = -errno;
+			goto out;
+		}
+
+		group->fds[event->index] = ret;
+		num++;
+	}
+
+	return 0;
+out:
+	for (--num; num >= 0; num--) {
+		close(group->fds[num]);
+		group->fds[num] = -1;
+	}
+
+
+	return ret;
+}
+
+static int
+mmap_events(int lcore_id)
+{
+	struct rte_pmu_event_group *group = &rte_pmu->group[lcore_id];
+	void *addr;
+	int ret, i;
+
+	for (i = 0; i < rte_pmu->num_group_events; i++) {
+		addr = mmap(0, rte_mem_page_size(), PROT_READ, MAP_SHARED, group->fds[i], 0);
+		if (addr == MAP_FAILED) {
+			ret = -errno;
+			goto out;
+		}
+
+		group->mmap_pages[i] = addr;
+	}
+
+	return 0;
+out:
+	for (; i; i--) {
+		munmap(group->mmap_pages[i - 1], rte_mem_page_size());
+		group->mmap_pages[i - 1] = NULL;
+	}
+
+	return ret;
+}
+
+static void
+cleanup_events(int lcore_id)
+{
+	struct rte_pmu_event_group *group = &rte_pmu->group[lcore_id];
+	int i;
+
+	if (!group->fds)
+		return;
+
+	if (group->fds[0] != -1)
+		ioctl(group->fds[0], PERF_EVENT_IOC_DISABLE, PERF_IOC_FLAG_GROUP);
+
+	for (i = 0; i < rte_pmu->num_group_events; i++) {
+		if (group->mmap_pages[i]) {
+			munmap(group->mmap_pages[i], rte_mem_page_size());
+			group->mmap_pages[i] = NULL;
+		}
+
+		if (group->fds[i] != -1) {
+			close(group->fds[i]);
+			group->fds[i] = -1;
+		}
+	}
+
+	rte_free(group->mmap_pages);
+	rte_free(group->fds);
+
+	group->mmap_pages = NULL;
+	group->fds = NULL;
+	group->enabled = false;
+}
+
+int __rte_noinline
+rte_pmu_enable_group(int lcore_id)
+{
+	struct rte_pmu_event_group *group = &rte_pmu->group[lcore_id];
+	int ret;
+
+	if (rte_pmu->num_group_events == 0) {
+		RTE_LOG(DEBUG, EAL, "no matching PMU events\n");
+
+		return 0;
+	}
+
+	group->fds = rte_calloc(NULL, rte_pmu->num_group_events, sizeof(*group->fds), 0);
+	if (!group->fds) {
+		RTE_LOG(ERR, EAL, "failed to alloc descriptor memory\n");
+
+		return -ENOMEM;
+	}
+
+	group->mmap_pages = rte_calloc(NULL, rte_pmu->num_group_events, sizeof(*group->mmap_pages), 0);
+	if (!group->mmap_pages) {
+		RTE_LOG(ERR, EAL, "failed to alloc userpage memory\n");
+
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	ret = open_events(lcore_id);
+	if (ret) {
+		RTE_LOG(ERR, EAL, "failed to open events on lcore-worker-%d\n", lcore_id);
+		goto out;
+	}
+
+	ret = mmap_events(lcore_id);
+	if (ret) {
+		RTE_LOG(ERR, EAL, "failed to map events on lcore-worker-%d\n", lcore_id);
+		goto out;
+	}
+
+	if (ioctl(group->fds[0], PERF_EVENT_IOC_ENABLE, PERF_IOC_FLAG_GROUP) == -1) {
+		RTE_LOG(ERR, EAL, "failed to enable events on lcore-worker-%d\n", lcore_id);
+
+		ret = -errno;
+		goto out;
+	}
+
+	return 0;
+
+out:
+	cleanup_events(lcore_id);
+
+	return ret;
+}
+
+static int
+scan_pmus(void)
+{
+	char path[PATH_MAX];
+	struct dirent *dent;
+	const char *name;
+	DIR *dirp;
+
+	dirp = opendir(EVENT_SOURCE_DEVICES_PATH);
+	if (!dirp)
+		return -errno;
+
+	while ((dent = readdir(dirp))) {
+		name = dent->d_name;
+		if (name[0] == '.')
+			continue;
+
+		/* sysfs entry should either contain cpus or be a cpu */
+		if (!strcmp(name, "cpu"))
+			break;
+
+		snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/cpus", name);
+		if (access(path, F_OK) == 0)
+			break;
+	}
+
+	closedir(dirp);
+
+	if (dent) {
+		rte_pmu->name = strdup(name);
+		if (!rte_pmu->name)
+			return -ENOMEM;
+	}
+
+	return rte_pmu->name ? 0 : -ENODEV;
+}
+
+int
+rte_pmu_add_event(const char *name)
+{
+	struct rte_pmu_event *event;
+	char path[PATH_MAX];
+
+	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu->name, name);
+	if (access(path, R_OK))
+		return -ENODEV;
+
+	TAILQ_FOREACH(event, &rte_pmu->event_list, next) {
+		if (!strcmp(event->name, name))
+			return event->index;
+		continue;
+	}
+
+	event = rte_calloc(NULL, 1, sizeof(*event), 0);
+	if (!event)
+		return -ENOMEM;
+
+	event->name = strdup(name);
+	if (!event->name) {
+		rte_free(event);
+
+		return -ENOMEM;
+	}
+
+	event->index = rte_pmu->num_group_events++;
+	TAILQ_INSERT_TAIL(&rte_pmu->event_list, event, next);
+
+	RTE_LOG(DEBUG, EAL, "%s even added at index %d\n", name, event->index);
+
+	return event->index;
+}
+
+void
+eal_pmu_init(void)
+{
+	int ret;
+
+	rte_pmu = rte_calloc(NULL, 1, sizeof(*rte_pmu), RTE_CACHE_LINE_SIZE);
+	if (!rte_pmu) {
+		RTE_LOG(ERR, EAL, "failed to alloc PMU\n");
+
+		return;
+	}
+
+	TAILQ_INIT(&rte_pmu->event_list);
+
+	ret = scan_pmus();
+	if (ret) {
+		RTE_LOG(ERR, EAL, "failed to find core pmu\n");
+		goto out;
+	}
+
+	ret = pmu_arch_init();
+	if (ret) {
+		RTE_LOG(ERR, EAL, "failed to setup arch for PMU\n");
+		goto out;
+	}
+
+	return;
+out:
+	free(rte_pmu->name);
+	rte_free(rte_pmu);
+}
+
+void
+eal_pmu_fini(void)
+{
+	struct rte_pmu_event *event, *tmp;
+	int lcore_id;
+
+	RTE_TAILQ_FOREACH_SAFE(event, &rte_pmu->event_list, next, tmp) {
+		TAILQ_REMOVE(&rte_pmu->event_list, event, next);
+		free(event->name);
+		rte_free(event);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		cleanup_events(lcore_id);
+
+	pmu_arch_fini();
+	free(rte_pmu->name);
+	rte_free(rte_pmu);
+}
diff --git a/lib/eal/include/meson.build b/lib/eal/include/meson.build
index cfcd40aaed..3bf830adee 100644
--- a/lib/eal/include/meson.build
+++ b/lib/eal/include/meson.build
@@ -36,6 +36,7 @@ headers += files(
         'rte_pci_dev_features.h',
         'rte_per_lcore.h',
         'rte_pflock.h',
+        'rte_pmu.h',
         'rte_random.h',
         'rte_reciprocal.h',
         'rte_seqcount.h',
diff --git a/lib/eal/include/rte_pmu.h b/lib/eal/include/rte_pmu.h
new file mode 100644
index 0000000000..e4b4f6b052
--- /dev/null
+++ b/lib/eal/include/rte_pmu.h
@@ -0,0 +1,204 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2022 Marvell
+ */
+
+#ifndef _RTE_PMU_H_
+#define _RTE_PMU_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <rte_common.h>
+#include <rte_compat.h>
+
+#ifdef RTE_EXEC_ENV_LINUX
+
+#include <linux/perf_event.h>
+
+#include <rte_atomic.h>
+#include <rte_branch_prediction.h>
+#include <rte_lcore.h>
+#include <rte_log.h>
+
+/**
+ * @file
+ *
+ * PMU event tracing operations
+ *
+ * This file defines generic API and types necessary to setup PMU and
+ * read selected counters in runtime.
+ */
+
+/**
+ * A structure describing a group of events.
+ */
+struct rte_pmu_event_group {
+	int *fds; /**< array of event descriptors */
+	void **mmap_pages; /**< array of pointers to mmapped perf_event_attr structures */
+	bool enabled; /**< true if group was enabled on particular lcore */
+};
+
+/**
+ * A structure describing an event.
+ */
+struct rte_pmu_event {
+	char *name; /** name of an event */
+	int index; /** event index into fds/mmap_pages */
+	TAILQ_ENTRY(rte_pmu_event) next; /** list entry */
+};
+
+/**
+ * A PMU state container.
+ */
+struct rte_pmu {
+	char *name; /** name of core PMU listed under /sys/bus/event_source/devices */
+	struct rte_pmu_event_group group[RTE_MAX_LCORE]; /**< per lcore event group data */
+	int num_group_events; /**< number of events in a group */
+	TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of matching events */
+};
+
+/** Pointer to the PMU state container */
+extern struct rte_pmu *rte_pmu;
+
+/** Each architecture supporting PMU needs to provide its own version */
+#ifndef rte_pmu_pmc_read
+#define rte_pmu_pmc_read(index) ({ 0; })
+#endif
+
+/**
+ * @internal
+ *
+ * Read PMU counter.
+ *
+ * @param pc
+ *   Pointer to the mmapped user page.
+ * @return
+ *   Counter value read from hardware.
+ */
+__rte_internal
+static __rte_always_inline uint64_t
+rte_pmu_read_userpage(struct perf_event_mmap_page *pc)
+{
+	uint64_t offset, width, pmc = 0;
+	uint32_t seq, index;
+	int tries = 100;
+
+	for (;;) {
+		seq = pc->lock;
+		rte_compiler_barrier();
+		index = pc->index;
+		offset = pc->offset;
+		width = pc->pmc_width;
+
+		if (likely(pc->cap_user_rdpmc && index)) {
+			pmc = rte_pmu_pmc_read(index - 1);
+			pmc <<= 64 - width;
+			pmc >>= 64 - width;
+		}
+
+		rte_compiler_barrier();
+
+		if (likely(pc->lock == seq))
+			return pmc + offset;
+
+		if (--tries == 0) {
+			RTE_LOG(DEBUG, EAL, "failed to get perf_event_mmap_page lock\n");
+			break;
+		}
+	}
+
+	return 0;
+}
+
+/**
+ * @internal
+ *
+ * Enable group of events for a given lcore.
+ *
+ * @param lcore_id
+ *   The identifier of the lcore.
+ * @return
+ *   0 in case of success, negative value otherwise.
+ */
+__rte_internal
+int
+rte_pmu_enable_group(int lcore_id);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Add event to the group of enabled events.
+ *
+ * @param name
+ *   Name of an event listed under /sys/bus/event_source/devices/pmu/events.
+ * @return
+ *   Event index in case of success, negative value otherwise.
+ */
+__rte_experimental
+int
+rte_pmu_add_event(const char *name);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Read hardware counter configured to count occurrences of an event.
+ *
+ * @param index
+ *   Index of an event to be read.
+ * @return
+ *   Event value read from register. In case of errors or lack of support
+ *   0 is returned. In other words, stream of zeros in a trace file
+ *   indicates problem with reading particular PMU event register.
+ */
+__rte_experimental
+static __rte_always_inline uint64_t
+rte_pmu_read(int index)
+{
+	int lcore_id = rte_lcore_id();
+	struct rte_pmu_event_group *group;
+	int ret;
+
+	if (!rte_pmu)
+		return 0;
+
+	group = &rte_pmu->group[lcore_id];
+	if (!group->enabled) {
+		ret = rte_pmu_enable_group(lcore_id);
+		if (ret)
+			return 0;
+
+		group->enabled = true;
+	}
+
+	if (index < 0 || index >= rte_pmu->num_group_events)
+		return 0;
+
+	return rte_pmu_read_userpage((struct perf_event_mmap_page *)group->mmap_pages[index]);
+}
+
+#else /* !RTE_EXEC_ENV_LINUX */
+
+__rte_experimental
+static int __rte_unused
+rte_pmu_add_event(__rte_unused const char *name)
+{
+	return -1;
+}
+
+__rte_experimental
+static __rte_always_inline uint64_t
+rte_pmu_read(__rte_unused int index)
+{
+	return 0;
+}
+
+#endif /* RTE_EXEC_ENV_LINUX */
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_PMU_H_ */
diff --git a/lib/eal/linux/eal.c b/lib/eal/linux/eal.c
index 8c118d0d9f..751a13b597 100644
--- a/lib/eal/linux/eal.c
+++ b/lib/eal/linux/eal.c
@@ -53,6 +53,7 @@
 #include "eal_options.h"
 #include "eal_vfio.h"
 #include "hotplug_mp.h"
+#include "pmu_private.h"
 
 #define MEMSIZE_IF_NO_HUGE_PAGE (64ULL * 1024ULL * 1024ULL)
 
@@ -1206,6 +1207,8 @@ rte_eal_init(int argc, char **argv)
 		return -1;
 	}
 
+	eal_pmu_init();
+
 	if (rte_eal_tailqs_init() < 0) {
 		rte_eal_init_alert("Cannot init tail queues for objects");
 		rte_errno = EFAULT;
@@ -1372,6 +1375,7 @@ rte_eal_cleanup(void)
 	eal_bus_cleanup();
 	rte_trace_save();
 	eal_trace_fini();
+	eal_pmu_fini();
 	/* after this point, any DPDK pointers will become dangling */
 	rte_eal_memory_detach();
 	eal_mp_dev_hotplug_cleanup();
diff --git a/lib/eal/version.map b/lib/eal/version.map
index 7ad12a7dc9..9225f46f67 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -440,6 +440,11 @@ EXPERIMENTAL {
 	rte_thread_detach;
 	rte_thread_equal;
 	rte_thread_join;
+
+	# added in 23.03
+	rte_pmu; # WINDOWS_NO_EXPORT
+	rte_pmu_add_event; # WINDOWS_NO_EXPORT
+	rte_pmu_read; # WINDOWS_NO_EXPORT
 };
 
 INTERNAL {
@@ -483,4 +488,5 @@ INTERNAL {
 	rte_mem_map;
 	rte_mem_page_size;
 	rte_mem_unmap;
+	rte_pmu_enable_group;
 };
-- 
2.25.1


^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH v3 2/4] eal/arm: support reading ARM PMU events in runtime
  2022-11-29  9:28   ` [PATCH v3 0/4] add support for self monitoring Tomasz Duszynski
  2022-11-29  9:28     ` [PATCH v3 1/4] eal: add generic support for reading PMU events Tomasz Duszynski
@ 2022-11-29  9:28     ` Tomasz Duszynski
  2022-11-29  9:28     ` [PATCH v3 3/4] eal/x86: support reading Intel " Tomasz Duszynski
                       ` (3 subsequent siblings)
  5 siblings, 0 replies; 205+ messages in thread
From: Tomasz Duszynski @ 2022-11-29  9:28 UTC (permalink / raw)
  To: dev, Ruifeng Wang; +Cc: thomas, jerinj, Tomasz Duszynski

Add support for reading ARM PMU events in runtime.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
---
 app/test/test_pmu.c               |   4 ++
 lib/eal/arm/include/meson.build   |   1 +
 lib/eal/arm/include/rte_pmu_pmc.h |  39 +++++++++++
 lib/eal/arm/meson.build           |   4 ++
 lib/eal/arm/rte_pmu.c             | 104 ++++++++++++++++++++++++++++++
 lib/eal/include/rte_pmu.h         |   3 +
 6 files changed, 155 insertions(+)
 create mode 100644 lib/eal/arm/include/rte_pmu_pmc.h
 create mode 100644 lib/eal/arm/rte_pmu.c

diff --git a/app/test/test_pmu.c b/app/test/test_pmu.c
index fd331af9ee..f94866dff9 100644
--- a/app/test/test_pmu.c
+++ b/app/test/test_pmu.c
@@ -13,6 +13,10 @@ test_pmu_read(void)
 	int tries = 10;
 	int event = -1;
 
+#if defined(RTE_ARCH_ARM64)
+	event = rte_pmu_add_event("cpu_cycles");
+#endif
+
 	while (tries--)
 		val += rte_pmu_read(event);
 
diff --git a/lib/eal/arm/include/meson.build b/lib/eal/arm/include/meson.build
index 657bf58569..ab13b0220a 100644
--- a/lib/eal/arm/include/meson.build
+++ b/lib/eal/arm/include/meson.build
@@ -20,6 +20,7 @@ arch_headers = files(
         'rte_pause_32.h',
         'rte_pause_64.h',
         'rte_pause.h',
+        'rte_pmu_pmc.h',
         'rte_power_intrinsics.h',
         'rte_prefetch_32.h',
         'rte_prefetch_64.h',
diff --git a/lib/eal/arm/include/rte_pmu_pmc.h b/lib/eal/arm/include/rte_pmu_pmc.h
new file mode 100644
index 0000000000..10e2984813
--- /dev/null
+++ b/lib/eal/arm/include/rte_pmu_pmc.h
@@ -0,0 +1,39 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2022 Marvell.
+ */
+
+#ifndef _RTE_PMU_PMC_ARM_H_
+#define _RTE_PMU_PMC_ARM_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <rte_common.h>
+
+static __rte_always_inline uint64_t
+rte_pmu_pmc_read(int index)
+{
+	uint64_t val;
+
+	if (index == 31) {
+		/* CPU Cycles (0x11) must be read via pmccntr_el0 */
+		asm volatile("mrs %0, pmccntr_el0" : "=r" (val));
+	} else {
+		asm volatile(
+			"msr pmselr_el0, %x0\n"
+			"mrs %0, pmxevcntr_el0\n"
+			: "=r" (val)
+			: "rZ" (index)
+		);
+	}
+
+	return val;
+}
+#define rte_pmu_pmc_read rte_pmu_pmc_read
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_PMU_PMC_ARM_H_ */
diff --git a/lib/eal/arm/meson.build b/lib/eal/arm/meson.build
index dca1106aae..0c5575b197 100644
--- a/lib/eal/arm/meson.build
+++ b/lib/eal/arm/meson.build
@@ -9,3 +9,7 @@ sources += files(
         'rte_hypervisor.c',
         'rte_power_intrinsics.c',
 )
+
+if is_linux
+    sources += files('rte_pmu.c')
+endif
diff --git a/lib/eal/arm/rte_pmu.c b/lib/eal/arm/rte_pmu.c
new file mode 100644
index 0000000000..10ec770ead
--- /dev/null
+++ b/lib/eal/arm/rte_pmu.c
@@ -0,0 +1,104 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2022 Marvell International Ltd.
+ */
+
+#include <errno.h>
+#include <fcntl.h>
+#include <stdlib.h>
+#include <unistd.h>
+
+#include <rte_bitops.h>
+#include <rte_common.h>
+#include <rte_log.h>
+#include <rte_pmu.h>
+
+#include "pmu_private.h"
+
+#define PERF_USER_ACCESS_PATH "/proc/sys/kernel/perf_user_access"
+
+static int restore_uaccess;
+
+static int
+read_attr_int(const char *path, int *val)
+{
+	char buf[BUFSIZ];
+	int ret, fd;
+
+	fd = open(path, O_RDONLY);
+	if (fd == -1)
+		return -errno;
+
+	ret = read(fd, buf, sizeof(buf));
+	if (ret == -1) {
+		close(fd);
+
+		return -errno;
+	}
+
+	*val = strtol(buf, NULL, 10);
+	close(fd);
+
+	return 0;
+}
+
+static int
+write_attr_int(const char *path, int val)
+{
+	char buf[BUFSIZ];
+	int num, ret, fd;
+
+	fd = open(path, O_WRONLY);
+	if (fd == -1)
+		return -errno;
+
+	num = snprintf(buf, sizeof(buf), "%d", val);
+	ret = write(fd, buf, num);
+	if (ret == -1) {
+		close(fd);
+
+		return -errno;
+	}
+
+	close(fd);
+
+	return 0;
+}
+
+int
+pmu_arch_init(void)
+{
+	int ret;
+
+	ret = read_attr_int(PERF_USER_ACCESS_PATH, &restore_uaccess);
+	if (ret) {
+		RTE_LOG(ERR, EAL, "failed to read %s\n", PERF_USER_ACCESS_PATH);
+
+		return ret;
+	}
+
+	ret = write_attr_int(PERF_USER_ACCESS_PATH, 1);
+	if (ret) {
+		RTE_LOG(ERR, EAL, "failed to enable perf user access\n"
+			"try enabling manually 'echo 1 > %s'\n",
+			PERF_USER_ACCESS_PATH);
+
+		return ret;
+	}
+
+	return 0;
+}
+
+void
+pmu_arch_fini(void)
+{
+	write_attr_int(PERF_USER_ACCESS_PATH, restore_uaccess);
+}
+
+void
+pmu_arch_fixup_config(uint64_t config[3])
+{
+	/* select 64 bit counters */
+	config[1] |= RTE_BIT64(0);
+	/* enable userspace access */
+	config[1] |= RTE_BIT64(1);
+}
diff --git a/lib/eal/include/rte_pmu.h b/lib/eal/include/rte_pmu.h
index e4b4f6b052..158a616b83 100644
--- a/lib/eal/include/rte_pmu.h
+++ b/lib/eal/include/rte_pmu.h
@@ -20,6 +20,9 @@ extern "C" {
 #include <rte_branch_prediction.h>
 #include <rte_lcore.h>
 #include <rte_log.h>
+#if defined(RTE_ARCH_ARM64)
+#include <rte_pmu_pmc.h>
+#endif
 
 /**
  * @file
-- 
2.25.1


^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH v3 3/4] eal/x86: support reading Intel PMU events in runtime
  2022-11-29  9:28   ` [PATCH v3 0/4] add support for self monitoring Tomasz Duszynski
  2022-11-29  9:28     ` [PATCH v3 1/4] eal: add generic support for reading PMU events Tomasz Duszynski
  2022-11-29  9:28     ` [PATCH v3 2/4] eal/arm: support reading ARM PMU events in runtime Tomasz Duszynski
@ 2022-11-29  9:28     ` Tomasz Duszynski
  2022-11-29  9:28     ` [PATCH v3 4/4] eal: add PMU support to tracing library Tomasz Duszynski
                       ` (2 subsequent siblings)
  5 siblings, 0 replies; 205+ messages in thread
From: Tomasz Duszynski @ 2022-11-29  9:28 UTC (permalink / raw)
  To: dev, Bruce Richardson, Konstantin Ananyev
  Cc: thomas, jerinj, Tomasz Duszynski

Add support for reading Intel PMU events in runtime.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
---
 app/test/test_pmu.c               |  2 ++
 lib/eal/include/rte_pmu.h         |  2 +-
 lib/eal/x86/include/meson.build   |  1 +
 lib/eal/x86/include/rte_pmu_pmc.h | 33 +++++++++++++++++++++++++++++++
 4 files changed, 37 insertions(+), 1 deletion(-)
 create mode 100644 lib/eal/x86/include/rte_pmu_pmc.h

diff --git a/app/test/test_pmu.c b/app/test/test_pmu.c
index f94866dff9..016204c083 100644
--- a/app/test/test_pmu.c
+++ b/app/test/test_pmu.c
@@ -15,6 +15,8 @@ test_pmu_read(void)
 
 #if defined(RTE_ARCH_ARM64)
 	event = rte_pmu_add_event("cpu_cycles");
+#elif defined(RTE_ARCH_X86_64)
+	event = rte_pmu_add_event("cpu-cycles");
 #endif
 
 	while (tries--)
diff --git a/lib/eal/include/rte_pmu.h b/lib/eal/include/rte_pmu.h
index 158a616b83..3d90f4baf7 100644
--- a/lib/eal/include/rte_pmu.h
+++ b/lib/eal/include/rte_pmu.h
@@ -20,7 +20,7 @@ extern "C" {
 #include <rte_branch_prediction.h>
 #include <rte_lcore.h>
 #include <rte_log.h>
-#if defined(RTE_ARCH_ARM64)
+#if defined(RTE_ARCH_ARM64) || defined(RTE_ARCH_X86_64)
 #include <rte_pmu_pmc.h>
 #endif
 
diff --git a/lib/eal/x86/include/meson.build b/lib/eal/x86/include/meson.build
index 52d2f8e969..03d286ed25 100644
--- a/lib/eal/x86/include/meson.build
+++ b/lib/eal/x86/include/meson.build
@@ -9,6 +9,7 @@ arch_headers = files(
         'rte_io.h',
         'rte_memcpy.h',
         'rte_pause.h',
+        'rte_pmu_pmc.h',
         'rte_power_intrinsics.h',
         'rte_prefetch.h',
         'rte_rtm.h',
diff --git a/lib/eal/x86/include/rte_pmu_pmc.h b/lib/eal/x86/include/rte_pmu_pmc.h
new file mode 100644
index 0000000000..a2cd849fb1
--- /dev/null
+++ b/lib/eal/x86/include/rte_pmu_pmc.h
@@ -0,0 +1,33 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2022 Marvell.
+ */
+
+#ifndef _RTE_PMU_PMC_X86_H_
+#define _RTE_PMU_PMC_X86_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <rte_common.h>
+
+static __rte_always_inline uint64_t
+rte_pmu_pmc_read(int index)
+{
+	uint32_t high, low;
+
+	asm volatile(
+		"rdpmc\n"
+		: "=a" (low), "=d" (high)
+		: "c" (index)
+	);
+
+	return ((uint64_t)high << 32) | (uint64_t)low;
+}
+#define rte_pmu_pmc_read rte_pmu_pmc_read
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_PMU_PMC_X86_H_ */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH v3 4/4] eal: add PMU support to tracing library
  2022-11-29  9:28   ` [PATCH v3 0/4] add support for self monitoring Tomasz Duszynski
                       ` (2 preceding siblings ...)
  2022-11-29  9:28     ` [PATCH v3 3/4] eal/x86: support reading Intel " Tomasz Duszynski
@ 2022-11-29  9:28     ` Tomasz Duszynski
  2022-11-29 10:42     ` [PATCH v3 0/4] add support for self monitoring Morten Brørup
  2022-12-13 10:43     ` [PATCH v4 " Tomasz Duszynski
  5 siblings, 0 replies; 205+ messages in thread
From: Tomasz Duszynski @ 2022-11-29  9:28 UTC (permalink / raw)
  To: dev, Jerin Jacob, Sunil Kumar Kori; +Cc: thomas, Tomasz Duszynski

In order to profile app one needs to store significant amount of samples
somewhere for an analysis latern on. Since trace library supports
storing data in a CTF format lets take adventage of that and add a
dedicated PMU tracepoint.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
---
 app/test/test_trace_perf.c               |  4 ++
 doc/guides/prog_guide/profile_app.rst    |  5 ++
 doc/guides/prog_guide/trace_lib.rst      | 32 ++++++++++++
 lib/eal/common/eal_common_trace_points.c |  3 ++
 lib/eal/common/rte_pmu.c                 | 63 ++++++++++++++++++++++++
 lib/eal/include/rte_eal_trace.h          | 11 +++++
 lib/eal/version.map                      |  1 +
 7 files changed, 119 insertions(+)

diff --git a/app/test/test_trace_perf.c b/app/test/test_trace_perf.c
index 46ae7d8074..4851b6852f 100644
--- a/app/test/test_trace_perf.c
+++ b/app/test/test_trace_perf.c
@@ -114,6 +114,8 @@ worker_fn_##func(void *arg) \
 #define GENERIC_DOUBLE rte_eal_trace_generic_double(3.66666)
 #define GENERIC_STR rte_eal_trace_generic_str("hello world")
 #define VOID_FP app_dpdk_test_fp()
+/* 0 corresponds first event passed via --trace= */
+#define READ_PMU rte_eal_trace_pmu_read(0)
 
 WORKER_DEFINE(GENERIC_VOID)
 WORKER_DEFINE(GENERIC_U64)
@@ -122,6 +124,7 @@ WORKER_DEFINE(GENERIC_FLOAT)
 WORKER_DEFINE(GENERIC_DOUBLE)
 WORKER_DEFINE(GENERIC_STR)
 WORKER_DEFINE(VOID_FP)
+WORKER_DEFINE(READ_PMU)
 
 static void
 run_test(const char *str, lcore_function_t f, struct test_data *data, size_t sz)
@@ -174,6 +177,7 @@ test_trace_perf(void)
 	run_test("double", worker_fn_GENERIC_DOUBLE, data, sz);
 	run_test("string", worker_fn_GENERIC_STR, data, sz);
 	run_test("void_fp", worker_fn_VOID_FP, data, sz);
+	run_test("read_pmu", worker_fn_READ_PMU, data, sz);
 
 	rte_free(data);
 	return TEST_SUCCESS;
diff --git a/doc/guides/prog_guide/profile_app.rst b/doc/guides/prog_guide/profile_app.rst
index 8fc1b20cab..977800ea01 100644
--- a/doc/guides/prog_guide/profile_app.rst
+++ b/doc/guides/prog_guide/profile_app.rst
@@ -16,6 +16,11 @@ that information, perf being an example here. Though in some scenarios, eg. when
 isolated (nohz_full) and run dedicated tasks, using perf is less than ideal. In such cases one can
 read specific events directly from application via ``rte_pmu_read()``.
 
+Alternatively tracing library can be used which offers dedicated tracepoint
+``rte_eal_trace_pmu_event()``.
+
+Refer to :doc:`../prog_guide/trace_lib` for more details.
+
 Profiling on x86
 ----------------
 
diff --git a/doc/guides/prog_guide/trace_lib.rst b/doc/guides/prog_guide/trace_lib.rst
index 9a8f38073d..9a845fd86f 100644
--- a/doc/guides/prog_guide/trace_lib.rst
+++ b/doc/guides/prog_guide/trace_lib.rst
@@ -46,6 +46,7 @@ DPDK tracing library features
   trace format and is compatible with ``LTTng``.
   For detailed information, refer to
   `Common Trace Format <https://diamon.org/ctf/>`_.
+- Support reading PMU events on ARM64 and x86 (Intel)
 
 How to add a tracepoint?
 ------------------------
@@ -137,6 +138,37 @@ the user must use ``RTE_TRACE_POINT_FP`` instead of ``RTE_TRACE_POINT``.
 ``RTE_TRACE_POINT_FP`` is compiled out by default and it can be enabled using
 the ``enable_trace_fp`` option for meson build.
 
+PMU tracepoint
+--------------
+
+Performance measurement unit (PMU) event values can be read from hardware
+registers using predefined ``rte_pmu_read`` tracepoint.
+
+Tracing is enabled via ``--trace`` EAL option by passing both expression
+matching PMU tracepoint name i.e ``lib.eal.pmu.read`` and expression
+``e=ev1[,ev2,...]`` matching particular events::
+
+    --trace='*pmu.read\|e=cpu_cycles,l1d_cache'
+
+Event names are available under ``/sys/bus/event_source/devices/PMU/events``
+directory, where ``PMU`` is a placeholder for either a ``cpu`` or a directory
+containing ``cpus``.
+
+In contrary to other tracepoints this does not need any extra variables
+added to source files. Instead, caller passes index which follows the order of
+events specified via ``--trace`` parameter. In the following example index ``0``
+corresponds to ``cpu_cyclces`` while index ``1`` corresponds to ``l1d_cache``.
+
+.. code-block:: c
+
+ ...
+ rte_eal_trace_pmu_read(0);
+ rte_eal_trace_pmu_read(1);
+ ...
+
+PMU tracing support must be explicitly enabled using the ``enable_trace_fp``
+option for meson build.
+
 Event record mode
 -----------------
 
diff --git a/lib/eal/common/eal_common_trace_points.c b/lib/eal/common/eal_common_trace_points.c
index 0b0b254615..de918ca618 100644
--- a/lib/eal/common/eal_common_trace_points.c
+++ b/lib/eal/common/eal_common_trace_points.c
@@ -75,3 +75,6 @@ RTE_TRACE_POINT_REGISTER(rte_eal_trace_intr_enable,
 	lib.eal.intr.enable)
 RTE_TRACE_POINT_REGISTER(rte_eal_trace_intr_disable,
 	lib.eal.intr.disable)
+
+RTE_TRACE_POINT_REGISTER(rte_eal_trace_pmu_read,
+	lib.eal.pmu.read)
diff --git a/lib/eal/common/rte_pmu.c b/lib/eal/common/rte_pmu.c
index 6763005903..db8f6f43c3 100644
--- a/lib/eal/common/rte_pmu.c
+++ b/lib/eal/common/rte_pmu.c
@@ -20,6 +20,7 @@
 #include <rte_tailq.h>
 
 #include "pmu_private.h"
+#include "eal_trace.h"
 
 #define EVENT_SOURCE_DEVICES_PATH "/sys/bus/event_source/devices"
 
@@ -404,11 +405,70 @@ rte_pmu_add_event(const char *name)
 	return event->index;
 }
 
+static void
+add_events(const char *pattern)
+{
+	char *token, *copy;
+	int ret;
+
+	copy = strdup(pattern);
+	if (!copy)
+		return;
+
+	token = strtok(copy, ",");
+	while (token) {
+		ret = rte_pmu_add_event(token);
+		if (ret < 0)
+			RTE_LOG(ERR, EAL, "failed to add %s event\n", token);
+
+		token = strtok(NULL, ",");
+	}
+
+	free(copy);
+}
+
+static void
+add_events_by_pattern(const char *pattern)
+{
+	regmatch_t rmatch;
+	char buf[BUFSIZ];
+	unsigned int num;
+	regex_t reg;
+
+	/* events are matched against occurrences of e=ev1[,ev2,..] pattern */
+	if (regcomp(&reg, "e=([_[:alnum:]-],?)+", REG_EXTENDED))
+		return;
+
+	for (;;) {
+		if (regexec(&reg, pattern, 1, &rmatch, 0))
+			break;
+
+		num = rmatch.rm_eo - rmatch.rm_so;
+		if (num > sizeof(buf))
+			num = sizeof(buf);
+
+		/* skip e= pattern prefix */
+		memcpy(buf, pattern + rmatch.rm_so + 2, num - 2);
+		buf[num] = '\0';
+		add_events(buf);
+
+		pattern += rmatch.rm_eo;
+	}
+
+	regfree(&reg);
+}
+
 void
 eal_pmu_init(void)
 {
+	struct trace_arg *arg;
+	struct trace *trace;
 	int ret;
 
+	trace = trace_obj_get();
+	if (!trace)
+		RTE_LOG(WARNING, EAL, "tracing not initialized\n");
+
 	rte_pmu = rte_calloc(NULL, 1, sizeof(*rte_pmu), RTE_CACHE_LINE_SIZE);
 	if (!rte_pmu) {
 		RTE_LOG(ERR, EAL, "failed to alloc PMU\n");
@@ -430,6 +490,9 @@ eal_pmu_init(void)
 		goto out;
 	}
 
+	STAILQ_FOREACH(arg, &trace->args, next)
+		add_events_by_pattern(arg->val);
+
 	return;
 out:
 	free(rte_pmu->name);
diff --git a/lib/eal/include/rte_eal_trace.h b/lib/eal/include/rte_eal_trace.h
index 5ef4398230..2a10f63e97 100644
--- a/lib/eal/include/rte_eal_trace.h
+++ b/lib/eal/include/rte_eal_trace.h
@@ -17,6 +17,7 @@ extern "C" {
 
 #include <rte_alarm.h>
 #include <rte_interrupts.h>
+#include <rte_pmu.h>
 #include <rte_trace_point.h>
 
 #include "eal_interrupts.h"
@@ -279,6 +280,16 @@ RTE_TRACE_POINT(
 	rte_trace_point_emit_string(cpuset);
 )
 
+/* PMU */
+RTE_TRACE_POINT_FP(
+	rte_eal_trace_pmu_read,
+	RTE_TRACE_POINT_ARGS(int index),
+	uint64_t val;
+	rte_trace_point_emit_int(index);
+	val = rte_pmu_read(index);
+	rte_trace_point_emit_u64(val);
+)
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/eal/version.map b/lib/eal/version.map
index 9225f46f67..73803f9601 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -442,6 +442,7 @@ EXPERIMENTAL {
 	rte_thread_join;
 
 	# added in 23.03
+	__rte_eal_trace_pmu_read; # WINDOWS_NO_EXPORT
 	rte_pmu; # WINDOWS_NO_EXPORT
 	rte_pmu_add_event; # WINDOWS_NO_EXPORT
 	rte_pmu_read; # WINDOWS_NO_EXPORT
-- 
2.25.1


^ permalink raw reply	[flat|nested] 205+ messages in thread

* RE: [PATCH v3 0/4] add support for self monitoring
  2022-11-29  9:28   ` [PATCH v3 0/4] add support for self monitoring Tomasz Duszynski
                       ` (3 preceding siblings ...)
  2022-11-29  9:28     ` [PATCH v3 4/4] eal: add PMU support to tracing library Tomasz Duszynski
@ 2022-11-29 10:42     ` Morten Brørup
  2022-12-13  8:23       ` Tomasz Duszynski
  2022-12-13 10:43     ` [PATCH v4 " Tomasz Duszynski
  5 siblings, 1 reply; 205+ messages in thread
From: Morten Brørup @ 2022-11-29 10:42 UTC (permalink / raw)
  To: Tomasz Duszynski, dev; +Cc: thomas, jerinj

> From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
> Sent: Tuesday, 29 November 2022 10.28
> 
> This series adds self monitoring support i.e allows to configure and
> read performance measurement unit (PMU) counters in runtime without
> using perf utility. This has certain adventages when application runs
> on
> isolated cores with nohz_full kernel parameter.
> 
> Events can be read directly using rte_pmu_read() or using dedicated
> tracepoint rte_eal_trace_pmu_read(). The latter will cause events to be
> stored inside CTF file.
> 
> By design, all enabled events are grouped together and the same group
> is attached to lcores that use self monitoring funtionality.
> 
> Events are enabled by names, which need to be read from standard
> location under sysfs i.e
> 
> /sys/bus/event_source/devices/PMU/events
> 
> where PMU is a core pmu i.e one measuring cpu events. As of today
> raw events are not supported.

Hi Thomasz,

I am very interested in this patch series for fast path profiling purposes. (Not using EAL trace, but our proprietary profiler.)

However, it seems that rte_pmu_read() is quite longwinded, compared to rte_pmu_pmc_read().

But perhaps I am just worrying too much, so I will ask: What is the performance cost of using rte_pmu_read() - compared to rte_pmu_pmc_read() - in the fast path?

If there is a non-negligible difference, could you please provide an example of how to configure PMU events and use rte_pmu_pmc_read() in an application?

I would primarily be interested in data cache misses and branch mispredictions. But feel free to make your own choices for the example.


^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH v3 1/4] eal: add generic support for reading PMU events
  2022-11-29  9:28     ` [PATCH v3 1/4] eal: add generic support for reading PMU events Tomasz Duszynski
@ 2022-11-30  8:32       ` zhoumin
  2022-12-13  8:05         ` [EXT] " Tomasz Duszynski
  0 siblings, 1 reply; 205+ messages in thread
From: zhoumin @ 2022-11-30  8:32 UTC (permalink / raw)
  To: Tomasz Duszynski, dev; +Cc: thomas, jerinj

Hi Tomasz,

On Tue, Nov 29, 2022 at 5:28 PM, Tomasz Duszynski wrote:
> Add support for programming PMU counters and reading their values
> in runtime bypassing kernel completely.
>
> This is especially useful in cases where CPU cores are isolated
> (nohz_full) i.e run dedicated tasks. In such cases one cannot use
> standard perf utility without sacrificing latency and performance.
>
> Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
> ---
>   app/test/meson.build                  |   1 +
>   app/test/test_pmu.c                   |  41 +++
>   doc/guides/prog_guide/profile_app.rst |   8 +
>   lib/eal/common/meson.build            |   3 +
>   lib/eal/common/pmu_private.h          |  41 +++
>   lib/eal/common/rte_pmu.c              | 457 ++++++++++++++++++++++++++
>   lib/eal/include/meson.build           |   1 +
>   lib/eal/include/rte_pmu.h             | 204 ++++++++++++
>   lib/eal/linux/eal.c                   |   4 +
>   lib/eal/version.map                   |   6 +
>   10 files changed, 766 insertions(+)
>   create mode 100644 app/test/test_pmu.c
>   create mode 100644 lib/eal/common/pmu_private.h
>   create mode 100644 lib/eal/common/rte_pmu.c
>   create mode 100644 lib/eal/include/rte_pmu.h
>
> diff --git a/app/test/meson.build b/app/test/meson.build
> index f34d19e3c3..93b3300309 100644
> --- a/app/test/meson.build
> +++ b/app/test/meson.build
> @@ -143,6 +143,7 @@ test_sources = files(
>           'test_timer_racecond.c',
>           'test_timer_secondary.c',
>           'test_ticketlock.c',
> +        'test_pmu.c',
>           'test_trace.c',
>           'test_trace_register.c',
>           'test_trace_perf.c',
> diff --git a/app/test/test_pmu.c b/app/test/test_pmu.c
> new file mode 100644
> index 0000000000..fd331af9ee
> --- /dev/null
> +++ b/app/test/test_pmu.c
> @@ -0,0 +1,41 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(C) 2022 Marvell International Ltd.
> + */
> +
> +#include <rte_pmu.h>
> +
> +#include "test.h"
> +
> +static int
> +test_pmu_read(void)
> +{
> +	uint64_t val = 0;
> +	int tries = 10;
> +	int event = -1;
> +
> +	while (tries--)
> +		val += rte_pmu_read(event);
> +
> +	if (val == 0)
> +		return TEST_FAILED;
> +
> +	return TEST_SUCCESS;
> +}
> +
> +static struct unit_test_suite pmu_tests = {
> +	.suite_name = "pmu autotest",
> +	.setup = NULL,
> +	.teardown = NULL,
> +	.unit_test_cases = {
> +		TEST_CASE(test_pmu_read),
> +		TEST_CASES_END()
> +	}
> +};
> +
> +static int
> +test_pmu(void)
> +{
> +	return unit_test_suite_runner(&pmu_tests);
> +}
> +
> +REGISTER_TEST_COMMAND(pmu_autotest, test_pmu);
> diff --git a/doc/guides/prog_guide/profile_app.rst b/doc/guides/prog_guide/profile_app.rst
> index bd6700ef85..8fc1b20cab 100644
> --- a/doc/guides/prog_guide/profile_app.rst
> +++ b/doc/guides/prog_guide/profile_app.rst
> @@ -7,6 +7,14 @@ Profile Your Application
>   The following sections describe methods of profiling DPDK applications on
>   different architectures.
>   
> +Performance counter based profiling
> +-----------------------------------
> +
> +Majority of architectures support some sort hardware measurement unit which provides a set of
> +programmable counters that monitor specific events. There are different tools which can gather
> +that information, perf being an example here. Though in some scenarios, eg. when CPU cores are
> +isolated (nohz_full) and run dedicated tasks, using perf is less than ideal. In such cases one can
> +read specific events directly from application via ``rte_pmu_read()``.
>   
>   Profiling on x86
>   ----------------
> diff --git a/lib/eal/common/meson.build b/lib/eal/common/meson.build
> index 917758cc65..d6d05b56f3 100644
> --- a/lib/eal/common/meson.build
> +++ b/lib/eal/common/meson.build
> @@ -38,6 +38,9 @@ sources += files(
>           'rte_service.c',
>           'rte_version.c',
>   )
> +if is_linux
> +    sources += files('rte_pmu.c')
> +endif
>   if is_linux or is_windows
>       sources += files('eal_common_dynmem.c')
>   endif
> diff --git a/lib/eal/common/pmu_private.h b/lib/eal/common/pmu_private.h
> new file mode 100644
> index 0000000000..cade4245e6
> --- /dev/null
> +++ b/lib/eal/common/pmu_private.h
> @@ -0,0 +1,41 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2022 Marvell
> + */
> +
> +#ifndef _PMU_PRIVATE_H_
> +#define _PMU_PRIVATE_H_
> +
> +/**
> + * Architecture specific PMU init callback.
> + *
> + * @return
> + *   0 in case of success, negative value otherwise.
> + */
> +int
> +pmu_arch_init(void);
> +
> +/**
> + * Architecture specific PMU cleanup callback.
> + */
> +void
> +pmu_arch_fini(void);
> +
> +/**
> + * Apply architecture specific settings to config before passing it to syscall.
> + */
> +void
> +pmu_arch_fixup_config(uint64_t config[3]);
> +
> +/**
> + * Initialize PMU tracing internals.
> + */
> +void
> +eal_pmu_init(void);
> +
> +/**
> + * Cleanup PMU internals.
> + */
> +void
> +eal_pmu_fini(void);
> +
> +#endif /* _PMU_PRIVATE_H_ */
> diff --git a/lib/eal/common/rte_pmu.c b/lib/eal/common/rte_pmu.c
> new file mode 100644
> index 0000000000..6763005903
> --- /dev/null
> +++ b/lib/eal/common/rte_pmu.c
> @@ -0,0 +1,457 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(C) 2022 Marvell International Ltd.
> + */
> +
> +#include <ctype.h>
> +#include <dirent.h>
> +#include <errno.h>
> +#include <regex.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <sys/ioctl.h>
> +#include <sys/mman.h>
> +#include <sys/queue.h>
> +#include <sys/syscall.h>
> +#include <unistd.h>
> +
> +#include <rte_eal_paging.h>
> +#include <rte_malloc.h>
> +#include <rte_pmu.h>
> +#include <rte_tailq.h>
> +
> +#include "pmu_private.h"
> +
> +#define EVENT_SOURCE_DEVICES_PATH "/sys/bus/event_source/devices"
> +
> +#ifndef GENMASK_ULL
> +#define GENMASK_ULL(h, l) ((~0ULL - (1ULL << (l)) + 1) & (~0ULL >> ((64 - 1 - (h)))))
> +#endif
> +
> +#ifndef FIELD_PREP
> +#define FIELD_PREP(m, v) (((uint64_t)(v) << (__builtin_ffsll(m) - 1)) & (m))
> +#endif
> +
> +struct rte_pmu *rte_pmu;
> +
> +/*
> + * Following __rte_weak functions provide default no-op. Architectures should override them if
> + * necessary.
> + */
> +
> +int
> +__rte_weak pmu_arch_init(void)
> +{
> +	return 0;
> +}
> +
> +void
> +__rte_weak pmu_arch_fini(void)
> +{
> +}
> +
> +void
> +__rte_weak pmu_arch_fixup_config(uint64_t config[3])
> +{
> +	RTE_SET_USED(config);
> +}
> +
> +static int
> +get_term_format(const char *name, int *num, uint64_t *mask)
> +{
> +	char *config = NULL;
> +	char path[PATH_MAX];
> +	int high, low, ret;
> +	FILE *fp;
> +
> +	/* quiesce -Wmaybe-uninitialized warning */
> +	*num = 0;
> +	*mask = 0;
> +
> +	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/format/%s", rte_pmu->name, name);
> +	fp = fopen(path, "r");
> +	if (!fp)
> +		return -errno;
> +
> +	errno = 0;
> +	ret = fscanf(fp, "%m[^:]:%d-%d", &config, &low, &high);
> +	if (ret < 2) {
> +		ret = -ENODATA;
> +		goto out;
> +	}
> +	if (errno) {
> +		ret = -errno;
> +		goto out;
> +	}
> +
> +	if (ret == 2)
> +		high = low;
> +
> +	*mask = GENMASK_ULL(high, low);
> +	/* Last digit should be [012]. If last digit is missing 0 is implied. */
> +	*num = config[strlen(config) - 1];
> +	*num = isdigit(*num) ? *num - '0' : 0;
> +
> +	ret = 0;
> +out:
> +	free(config);
> +	fclose(fp);
> +
> +	return ret;
> +}
> +
> +static int
> +parse_event(char *buf, uint64_t config[3])
> +{
> +	char *token, *term;
> +	int num, ret, val;
> +	uint64_t mask;
> +
> +	config[0] = config[1] = config[2] = 0;
> +
> +	token = strtok(buf, ",");
> +	while (token) {
> +		errno = 0;
> +		/* <term>=<value> */
> +		ret = sscanf(token, "%m[^=]=%i", &term, &val);
> +		if (ret < 1)
> +			return -ENODATA;
> +		if (errno)
> +			return -errno;
> +		if (ret == 1)
> +			val = 1;
> +
> +		ret = get_term_format(term, &num, &mask);
> +		free(term);
> +		if (ret)
> +			return ret;
> +
> +		config[num] |= FIELD_PREP(mask, val);
> +		token = strtok(NULL, ",");
> +	}
> +
> +	return 0;
> +}
> +
> +static int
> +get_event_config(const char *name, uint64_t config[3])
> +{
> +	char path[PATH_MAX], buf[BUFSIZ];
> +	FILE *fp;
> +	int ret;
> +
> +	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu->name, name);
> +	fp = fopen(path, "r");
> +	if (!fp)
> +		return -errno;
> +
> +	ret = fread(buf, 1, sizeof(buf), fp);
> +	if (ret == 0) {
> +		fclose(fp);
> +
> +		return -EINVAL;
> +	}
> +	fclose(fp);
> +	buf[ret] = '\0';
> +
> +	return parse_event(buf, config);
> +}
> +
> +static int
> +do_perf_event_open(uint64_t config[3], int lcore_id, int group_fd)
> +{
> +	struct perf_event_attr attr = {
> +		.size = sizeof(struct perf_event_attr),
> +		.type = PERF_TYPE_RAW,
> +		.exclude_kernel = 1,
> +		.exclude_hv = 1,
> +		.disabled = 1,
> +	};
> +
> +	pmu_arch_fixup_config(config);
> +
> +	attr.config = config[0];
> +	attr.config1 = config[1];
> +	attr.config2 = config[2];
> +
> +	return syscall(SYS_perf_event_open, &attr, rte_gettid(), rte_lcore_to_cpu_id(lcore_id),
> +		       group_fd, 0);
> +}
> +
> +static int
> +open_events(int lcore_id)
> +{
> +	struct rte_pmu_event_group *group = &rte_pmu->group[lcore_id];
> +	struct rte_pmu_event *event;
> +	uint64_t config[3];
> +	int num = 0, ret;
> +
> +	/* group leader gets created first, with fd = -1 */
> +	group->fds[0] = -1;
> +
> +	TAILQ_FOREACH(event, &rte_pmu->event_list, next) {
> +		ret = get_event_config(event->name, config);
> +		if (ret) {
> +			RTE_LOG(ERR, EAL, "failed to get %s event config\n", event->name);
> +			continue;
> +		}
> +
> +		ret = do_perf_event_open(config, lcore_id, group->fds[0]);
> +		if (ret == -1) {
> +			if (errno == EOPNOTSUPP)
> +				RTE_LOG(ERR, EAL, "64 bit counters not supported\n");
> +
> +			ret = -errno;
> +			goto out;
> +		}
> +
> +		group->fds[event->index] = ret;
> +		num++;
> +	}
> +
> +	return 0;
> +out:
> +	for (--num; num >= 0; num--) {
> +		close(group->fds[num]);
> +		group->fds[num] = -1;
> +	}
> +
> +
> +	return ret;
> +}
> +
> +static int
> +mmap_events(int lcore_id)
> +{
> +	struct rte_pmu_event_group *group = &rte_pmu->group[lcore_id];
> +	void *addr;
> +	int ret, i;
> +
> +	for (i = 0; i < rte_pmu->num_group_events; i++) {
> +		addr = mmap(0, rte_mem_page_size(), PROT_READ, MAP_SHARED, group->fds[i], 0);
> +		if (addr == MAP_FAILED) {
> +			ret = -errno;
> +			goto out;
> +		}
> +
> +		group->mmap_pages[i] = addr;
> +	}
> +
> +	return 0;
> +out:
> +	for (; i; i--) {
> +		munmap(group->mmap_pages[i - 1], rte_mem_page_size());
> +		group->mmap_pages[i - 1] = NULL;
> +	}
> +
> +	return ret;
> +}
> +
> +static void
> +cleanup_events(int lcore_id)
> +{
> +	struct rte_pmu_event_group *group = &rte_pmu->group[lcore_id];
> +	int i;
> +
> +	if (!group->fds)
> +		return;
> +
> +	if (group->fds[0] != -1)
> +		ioctl(group->fds[0], PERF_EVENT_IOC_DISABLE, PERF_IOC_FLAG_GROUP);
> +
> +	for (i = 0; i < rte_pmu->num_group_events; i++) {
> +		if (group->mmap_pages[i]) {
> +			munmap(group->mmap_pages[i], rte_mem_page_size());
> +			group->mmap_pages[i] = NULL;
> +		}
> +
> +		if (group->fds[i] != -1) {
> +			close(group->fds[i]);
> +			group->fds[i] = -1;
> +		}
> +	}
> +
> +	rte_free(group->mmap_pages);
> +	rte_free(group->fds);
> +
> +	group->mmap_pages = NULL;
> +	group->fds = NULL;
> +	group->enabled = false;
> +}
> +
> +int __rte_noinline
> +rte_pmu_enable_group(int lcore_id)
> +{
> +	struct rte_pmu_event_group *group = &rte_pmu->group[lcore_id];
> +	int ret;
> +
> +	if (rte_pmu->num_group_events == 0) {
> +		RTE_LOG(DEBUG, EAL, "no matching PMU events\n");
> +
> +		return 0;
> +	}
> +
> +	group->fds = rte_calloc(NULL, rte_pmu->num_group_events, sizeof(*group->fds), 0);
> +	if (!group->fds) {
> +		RTE_LOG(ERR, EAL, "failed to alloc descriptor memory\n");
> +
> +		return -ENOMEM;
> +	}
> +
> +	group->mmap_pages = rte_calloc(NULL, rte_pmu->num_group_events, sizeof(*group->mmap_pages), 0);
> +	if (!group->mmap_pages) {
> +		RTE_LOG(ERR, EAL, "failed to alloc userpage memory\n");
> +
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	ret = open_events(lcore_id);
> +	if (ret) {
> +		RTE_LOG(ERR, EAL, "failed to open events on lcore-worker-%d\n", lcore_id);
> +		goto out;
> +	}
> +
> +	ret = mmap_events(lcore_id);
> +	if (ret) {
> +		RTE_LOG(ERR, EAL, "failed to map events on lcore-worker-%d\n", lcore_id);
> +		goto out;
> +	}
> +
> +	if (ioctl(group->fds[0], PERF_EVENT_IOC_ENABLE, PERF_IOC_FLAG_GROUP) == -1) {
> +		RTE_LOG(ERR, EAL, "failed to enable events on lcore-worker-%d\n", lcore_id);
> +
> +		ret = -errno;
> +		goto out;
> +	}
> +
> +	return 0;
> +
> +out:
> +	cleanup_events(lcore_id);
> +
> +	return ret;
> +}
> +
> +static int
> +scan_pmus(void)
> +{
> +	char path[PATH_MAX];
> +	struct dirent *dent;
> +	const char *name;
> +	DIR *dirp;
> +
> +	dirp = opendir(EVENT_SOURCE_DEVICES_PATH);
> +	if (!dirp)
> +		return -errno;
> +
> +	while ((dent = readdir(dirp))) {
> +		name = dent->d_name;
> +		if (name[0] == '.')
> +			continue;
> +
> +		/* sysfs entry should either contain cpus or be a cpu */
> +		if (!strcmp(name, "cpu"))
> +			break;
> +
> +		snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/cpus", name);
> +		if (access(path, F_OK) == 0)
> +			break;
> +	}
> +
> +	closedir(dirp);
> +
> +	if (dent) {
> +		rte_pmu->name = strdup(name);
> +		if (!rte_pmu->name)
> +			return -ENOMEM;
> +	}
> +
> +	return rte_pmu->name ? 0 : -ENODEV;
> +}
> +
> +int
> +rte_pmu_add_event(const char *name)
> +{
> +	struct rte_pmu_event *event;
> +	char path[PATH_MAX];
> +
> +	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu->name, name);
> +	if (access(path, R_OK))
> +		return -ENODEV;
> +
> +	TAILQ_FOREACH(event, &rte_pmu->event_list, next) {
> +		if (!strcmp(event->name, name))
> +			return event->index;
> +		continue;
> +	}
> +
> +	event = rte_calloc(NULL, 1, sizeof(*event), 0);
> +	if (!event)
> +		return -ENOMEM;
> +
> +	event->name = strdup(name);
> +	if (!event->name) {
> +		rte_free(event);
> +
> +		return -ENOMEM;
> +	}
> +
> +	event->index = rte_pmu->num_group_events++;
> +	TAILQ_INSERT_TAIL(&rte_pmu->event_list, event, next);
> +
> +	RTE_LOG(DEBUG, EAL, "%s even added at index %d\n", name, event->index);
> +
> +	return event->index;
> +}
> +
> +void
> +eal_pmu_init(void)
> +{
> +	int ret;
> +
> +	rte_pmu = rte_calloc(NULL, 1, sizeof(*rte_pmu), RTE_CACHE_LINE_SIZE);
> +	if (!rte_pmu) {
> +		RTE_LOG(ERR, EAL, "failed to alloc PMU\n");
> +
> +		return;
> +	}
> +
> +	TAILQ_INIT(&rte_pmu->event_list);
> +
> +	ret = scan_pmus();
> +	if (ret) {
> +		RTE_LOG(ERR, EAL, "failed to find core pmu\n");
> +		goto out;
> +	}
> +
> +	ret = pmu_arch_init();
> +	if (ret) {
> +		RTE_LOG(ERR, EAL, "failed to setup arch for PMU\n");
> +		goto out;
> +	}
> +
> +	return;
> +out:
> +	free(rte_pmu->name);
> +	rte_free(rte_pmu);
> +}
> +
> +void
> +eal_pmu_fini(void)
> +{
> +	struct rte_pmu_event *event, *tmp;
> +	int lcore_id;
> +
> +	RTE_TAILQ_FOREACH_SAFE(event, &rte_pmu->event_list, next, tmp) {
> +		TAILQ_REMOVE(&rte_pmu->event_list, event, next);
> +		free(event->name);
> +		rte_free(event);
> +	}
> +
> +	RTE_LCORE_FOREACH_WORKER(lcore_id)
> +		cleanup_events(lcore_id);
> +
> +	pmu_arch_fini();
> +	free(rte_pmu->name);
> +	rte_free(rte_pmu);
> +}

There may be some problems with the implementation of eal_pmu_fini(), 
but I'm not sure.

I checked some test reports for this series. It seems that the test case 
of `debug_autotest` in

the DPDK unit test has an issue when the child process in this test case 
calls the function of  rte_exit().

The call chain is as follows:

     test_debug() -> test_exit() -> test_exit_val() -> rte_exit() -> 
rte_eal_cleanup() -> eal_pmu_fini().

The issue may be related to memory free from the error message as follows:

test_exit_valEAL: Error: Invalid memory
EAL: Error - exiting with code: 1
   Cause: test_exit_valEAL: Error: Invalid memory
EAL: Error - exiting with code: 2
   Cause: test_exit_valEAL: Error: Invalid memory
EAL: Error - exiting with code: 255
   Cause: test_exit_valEAL: Error: Invalid memory
EAL: Error - exiting with code: -1
   Cause: test_exit_valEAL: Error: Invalid memory

The above error message will disappear when I comment out the calling to 
the eal_pmu_fini() in

the rte_eal_cleanup().

> diff --git a/lib/eal/include/meson.build b/lib/eal/include/meson.build
> index cfcd40aaed..3bf830adee 100644
> --- a/lib/eal/include/meson.build
> +++ b/lib/eal/include/meson.build
> @@ -36,6 +36,7 @@ headers += files(
>           'rte_pci_dev_features.h',
>           'rte_per_lcore.h',
>           'rte_pflock.h',
> +        'rte_pmu.h',
>           'rte_random.h',
>           'rte_reciprocal.h',
>           'rte_seqcount.h',
> diff --git a/lib/eal/include/rte_pmu.h b/lib/eal/include/rte_pmu.h
> new file mode 100644
> index 0000000000..e4b4f6b052
> --- /dev/null
> +++ b/lib/eal/include/rte_pmu.h
> @@ -0,0 +1,204 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2022 Marvell
> + */
> +
> +#ifndef _RTE_PMU_H_
> +#define _RTE_PMU_H_
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +#include <rte_common.h>
> +#include <rte_compat.h>
> +
> +#ifdef RTE_EXEC_ENV_LINUX
> +
> +#include <linux/perf_event.h>
> +
> +#include <rte_atomic.h>
> +#include <rte_branch_prediction.h>
> +#include <rte_lcore.h>
> +#include <rte_log.h>
> +
> +/**
> + * @file
> + *
> + * PMU event tracing operations
> + *
> + * This file defines generic API and types necessary to setup PMU and
> + * read selected counters in runtime.
> + */
> +
> +/**
> + * A structure describing a group of events.
> + */
> +struct rte_pmu_event_group {
> +	int *fds; /**< array of event descriptors */
> +	void **mmap_pages; /**< array of pointers to mmapped perf_event_attr structures */
> +	bool enabled; /**< true if group was enabled on particular lcore */
> +};
> +
> +/**
> + * A structure describing an event.
> + */
> +struct rte_pmu_event {
> +	char *name; /** name of an event */
> +	int index; /** event index into fds/mmap_pages */
> +	TAILQ_ENTRY(rte_pmu_event) next; /** list entry */
> +};
> +
> +/**
> + * A PMU state container.
> + */
> +struct rte_pmu {
> +	char *name; /** name of core PMU listed under /sys/bus/event_source/devices */
> +	struct rte_pmu_event_group group[RTE_MAX_LCORE]; /**< per lcore event group data */
> +	int num_group_events; /**< number of events in a group */
> +	TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of matching events */
> +};
> +
> +/** Pointer to the PMU state container */
> +extern struct rte_pmu *rte_pmu;
> +
> +/** Each architecture supporting PMU needs to provide its own version */
> +#ifndef rte_pmu_pmc_read
> +#define rte_pmu_pmc_read(index) ({ 0; })
> +#endif
> +
> +/**
> + * @internal
> + *
> + * Read PMU counter.
> + *
> + * @param pc
> + *   Pointer to the mmapped user page.
> + * @return
> + *   Counter value read from hardware.
> + */
> +__rte_internal
> +static __rte_always_inline uint64_t
> +rte_pmu_read_userpage(struct perf_event_mmap_page *pc)
> +{
> +	uint64_t offset, width, pmc = 0;
> +	uint32_t seq, index;
> +	int tries = 100;
> +
> +	for (;;) {
> +		seq = pc->lock;
> +		rte_compiler_barrier();
> +		index = pc->index;
> +		offset = pc->offset;
> +		width = pc->pmc_width;
> +
> +		if (likely(pc->cap_user_rdpmc && index)) {
> +			pmc = rte_pmu_pmc_read(index - 1);
> +			pmc <<= 64 - width;
> +			pmc >>= 64 - width;
> +		}
> +
> +		rte_compiler_barrier();
> +
> +		if (likely(pc->lock == seq))
> +			return pmc + offset;
> +
> +		if (--tries == 0) {
> +			RTE_LOG(DEBUG, EAL, "failed to get perf_event_mmap_page lock\n");
> +			break;
> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * @internal
> + *
> + * Enable group of events for a given lcore.
> + *
> + * @param lcore_id
> + *   The identifier of the lcore.
> + * @return
> + *   0 in case of success, negative value otherwise.
> + */
> +__rte_internal
> +int
> +rte_pmu_enable_group(int lcore_id);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice
> + *
> + * Add event to the group of enabled events.
> + *
> + * @param name
> + *   Name of an event listed under /sys/bus/event_source/devices/pmu/events.
> + * @return
> + *   Event index in case of success, negative value otherwise.
> + */
> +__rte_experimental
> +int
> +rte_pmu_add_event(const char *name);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice
> + *
> + * Read hardware counter configured to count occurrences of an event.
> + *
> + * @param index
> + *   Index of an event to be read.
> + * @return
> + *   Event value read from register. In case of errors or lack of support
> + *   0 is returned. In other words, stream of zeros in a trace file
> + *   indicates problem with reading particular PMU event register.
> + */
> +__rte_experimental
> +static __rte_always_inline uint64_t
> +rte_pmu_read(int index)
> +{
> +	int lcore_id = rte_lcore_id();
> +	struct rte_pmu_event_group *group;
> +	int ret;
> +
> +	if (!rte_pmu)
> +		return 0;
> +
> +	group = &rte_pmu->group[lcore_id];
> +	if (!group->enabled) {
> +		ret = rte_pmu_enable_group(lcore_id);
> +		if (ret)
> +			return 0;
> +
> +		group->enabled = true;
> +	}
> +
> +	if (index < 0 || index >= rte_pmu->num_group_events)
> +		return 0;
> +
> +	return rte_pmu_read_userpage((struct perf_event_mmap_page *)group->mmap_pages[index]);
> +}
> +
> +#else /* !RTE_EXEC_ENV_LINUX */
> +
> +__rte_experimental
> +static int __rte_unused
> +rte_pmu_add_event(__rte_unused const char *name)
> +{
> +	return -1;
> +}
> +
> +__rte_experimental
> +static __rte_always_inline uint64_t
> +rte_pmu_read(__rte_unused int index)
> +{
> +	return 0;
> +}
> +
> +#endif /* RTE_EXEC_ENV_LINUX */
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif /* _RTE_PMU_H_ */
> diff --git a/lib/eal/linux/eal.c b/lib/eal/linux/eal.c
> index 8c118d0d9f..751a13b597 100644
> --- a/lib/eal/linux/eal.c
> +++ b/lib/eal/linux/eal.c
> @@ -53,6 +53,7 @@
>   #include "eal_options.h"
>   #include "eal_vfio.h"
>   #include "hotplug_mp.h"
> +#include "pmu_private.h"
>   
>   #define MEMSIZE_IF_NO_HUGE_PAGE (64ULL * 1024ULL * 1024ULL)
>   
> @@ -1206,6 +1207,8 @@ rte_eal_init(int argc, char **argv)
>   		return -1;
>   	}
>   
> +	eal_pmu_init();
> +
>   	if (rte_eal_tailqs_init() < 0) {
>   		rte_eal_init_alert("Cannot init tail queues for objects");
>   		rte_errno = EFAULT;
> @@ -1372,6 +1375,7 @@ rte_eal_cleanup(void)
>   	eal_bus_cleanup();
>   	rte_trace_save();
>   	eal_trace_fini();
> +	eal_pmu_fini();
>   	/* after this point, any DPDK pointers will become dangling */
>   	rte_eal_memory_detach();
>   	eal_mp_dev_hotplug_cleanup();
> diff --git a/lib/eal/version.map b/lib/eal/version.map
> index 7ad12a7dc9..9225f46f67 100644
> --- a/lib/eal/version.map
> +++ b/lib/eal/version.map
> @@ -440,6 +440,11 @@ EXPERIMENTAL {
>   	rte_thread_detach;
>   	rte_thread_equal;
>   	rte_thread_join;
> +
> +	# added in 23.03
> +	rte_pmu; # WINDOWS_NO_EXPORT
> +	rte_pmu_add_event; # WINDOWS_NO_EXPORT
> +	rte_pmu_read; # WINDOWS_NO_EXPORT
>   };
>   
>   INTERNAL {
> @@ -483,4 +488,5 @@ INTERNAL {
>   	rte_mem_map;
>   	rte_mem_page_size;
>   	rte_mem_unmap;
> +	rte_pmu_enable_group;
>   };

Best regards,

Min Zhou


^ permalink raw reply	[flat|nested] 205+ messages in thread

* RE: [EXT] Re: [PATCH v3 1/4] eal: add generic support for reading PMU events
  2022-11-30  8:32       ` zhoumin
@ 2022-12-13  8:05         ` Tomasz Duszynski
  0 siblings, 0 replies; 205+ messages in thread
From: Tomasz Duszynski @ 2022-12-13  8:05 UTC (permalink / raw)
  To: zhoumin, dev; +Cc: thomas, Jerin Jacob Kollanukkaran

Hello Min, 

> -----Original Message-----
> From: zhoumin <zhoumin@loongson.cn>
> Sent: Wednesday, November 30, 2022 9:33 AM
> To: Tomasz Duszynski <tduszynski@marvell.com>; dev@dpdk.org
> Cc: thomas@monjalon.net; Jerin Jacob Kollanukkaran <jerinj@marvell.com>
> Subject: [EXT] Re: [PATCH v3 1/4] eal: add generic support for reading PMU events
> 
> External Email
> 
> ----------------------------------------------------------------------
> Hi Tomasz,
> 

[...] 

> > +void
> > +eal_pmu_fini(void)
> > +{
> > +	struct rte_pmu_event *event, *tmp;
> > +	int lcore_id;
> > +
> > +	RTE_TAILQ_FOREACH_SAFE(event, &rte_pmu->event_list, next, tmp) {
> > +		TAILQ_REMOVE(&rte_pmu->event_list, event, next);
> > +		free(event->name);
> > +		rte_free(event);
> > +	}
> > +
> > +	RTE_LCORE_FOREACH_WORKER(lcore_id)
> > +		cleanup_events(lcore_id);
> > +
> > +	pmu_arch_fini();
> > +	free(rte_pmu->name);
> > +	rte_free(rte_pmu);
> > +}
> 
> There may be some problems with the implementation of eal_pmu_fini(), but I'm not sure.
> 
> I checked some test reports for this series. It seems that the test case of `debug_autotest` in
> 
> the DPDK unit test has an issue when the child process in this test case calls the function
> of  rte_exit().
> 
> The call chain is as follows:
> 
>      test_debug() -> test_exit() -> test_exit_val() -> rte_exit() ->
> rte_eal_cleanup() -> eal_pmu_fini().
> 
> The issue may be related to memory free from the error message as follows:
> 
> test_exit_valEAL: Error: Invalid memory
> EAL: Error - exiting with code: 1
>    Cause: test_exit_valEAL: Error: Invalid memory
> EAL: Error - exiting with code: 2
>    Cause: test_exit_valEAL: Error: Invalid memory
> EAL: Error - exiting with code: 255
>    Cause: test_exit_valEAL: Error: Invalid memory
> EAL: Error - exiting with code: -1
>    Cause: test_exit_valEAL: Error: Invalid memory
> 
> The above error message will disappear when I comment out the calling to the eal_pmu_fini() in
> 
> the rte_eal_cleanup().
> 

Thanks for pointing this out. This was apparently happening due to freeing same hugepage memory in forked process multiple times.

^ permalink raw reply	[flat|nested] 205+ messages in thread

* RE: [PATCH v3 0/4] add support for self monitoring
  2022-11-29 10:42     ` [PATCH v3 0/4] add support for self monitoring Morten Brørup
@ 2022-12-13  8:23       ` Tomasz Duszynski
  0 siblings, 0 replies; 205+ messages in thread
From: Tomasz Duszynski @ 2022-12-13  8:23 UTC (permalink / raw)
  To: Morten Brørup, dev; +Cc: thomas, Jerin Jacob Kollanukkaran

Hi Morten,

> -----Original Message-----
> From: Morten Brørup <mb@smartsharesystems.com>
> Sent: Tuesday, November 29, 2022 11:43 AM
> To: Tomasz Duszynski <tduszynski@marvell.com>; dev@dpdk.org
> Cc: thomas@monjalon.net; Jerin Jacob Kollanukkaran <jerinj@marvell.com>
> Subject: [EXT] RE: [PATCH v3 0/4] add support for self monitoring
> 
> External Email
> 
> ----------------------------------------------------------------------
> > From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
> > Sent: Tuesday, 29 November 2022 10.28
> >
> > This series adds self monitoring support i.e allows to configure and
> > read performance measurement unit (PMU) counters in runtime without
> > using perf utility. This has certain adventages when application runs
> > on isolated cores with nohz_full kernel parameter.
> >
> > Events can be read directly using rte_pmu_read() or using dedicated
> > tracepoint rte_eal_trace_pmu_read(). The latter will cause events to
> > be stored inside CTF file.
> >
> > By design, all enabled events are grouped together and the same group
> > is attached to lcores that use self monitoring funtionality.
> >
> > Events are enabled by names, which need to be read from standard
> > location under sysfs i.e
> >
> > /sys/bus/event_source/devices/PMU/events
> >
> > where PMU is a core pmu i.e one measuring cpu events. As of today raw
> > events are not supported.
> 
> Hi Thomasz,
> 
> I am very interested in this patch series for fast path profiling purposes. (Not using EAL trace,
> but our proprietary profiler.)
> 
> However, it seems that rte_pmu_read() is quite longwinded, compared to rte_pmu_pmc_read().
> 

We need some bit of extra logic to set thigs up before performing reading actual counter but in reality 
cycles are mostly consumed by rte_pmu_pmc_read(). This obviously differs among platforms so if you
want precise measurements you need to get your hands dirty. 

That said, below are results coming from dpdk-test after running trace_perf_autotest - just to give you some idea. 

X86-64

RTE>>trace_perf_autotest
Timer running at 3000.00MHz
            void: cycles=17.739375 ns=5.913125
             u64: cycles=17.348296 ns=5.782765
             int: cycles=17.098724 ns=5.699575
           float: cycles=17.099946 ns=5.699982
          double: cycles=17.229702 ns=5.743234
          string: cycles=31.159907 ns=10.386636
         void_fp: cycles=0.679842 ns=0.226614
        read_pmu: cycles=49.325117 ns=16.441706

ARM64 with RTE_ARM_EAL_RDTSC_USE_PMU

RTE>>trace_perf_autotest
Timer running at 2480.00MHz
            void: cycles=9.413568 ns=3.795793
             u64: cycles=9.386003 ns=3.784678
             int: cycles=9.438701 ns=3.805928
           float: cycles=9.359377 ns=3.773942
          double: cycles=9.372279 ns=3.779145
          string: cycles=24.474899 ns=9.868911
         void_fp: cycles=0.505513 ns=0.203836
        read_pmu: cycles=17.442853 ns=7.033409

> But perhaps I am just worrying too much, so I will ask: What is the performance cost of using
> rte_pmu_read() - compared to rte_pmu_pmc_read() - in the fast path?
> 
> If there is a non-negligible difference, could you please provide an example of how to configure
> PMU events and use rte_pmu_pmc_read() in an application?
> 

Series come with some docs so you can check there how to run it. 

> I would primarily be interested in data cache misses and branch mispredictions. But feel free to
> make your own choices for the example.

Raw events are not supported right now which means you don't have fine control over all events. 
You can use only events from CPU PMU (/sys/bus/event_source/devices/<PMU>/events).



^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH v4 0/4] add support for self monitoring
  2022-11-29  9:28   ` [PATCH v3 0/4] add support for self monitoring Tomasz Duszynski
                       ` (4 preceding siblings ...)
  2022-11-29 10:42     ` [PATCH v3 0/4] add support for self monitoring Morten Brørup
@ 2022-12-13 10:43     ` Tomasz Duszynski
  2022-12-13 10:43       ` [PATCH v4 1/4] eal: add generic support for reading PMU events Tomasz Duszynski
                         ` (4 more replies)
  5 siblings, 5 replies; 205+ messages in thread
From: Tomasz Duszynski @ 2022-12-13 10:43 UTC (permalink / raw)
  To: dev; +Cc: thomas, jerinj, mb, zhoumin, Tomasz Duszynski

This series adds self monitoring support i.e allows to configure and
read performance measurement unit (PMU) counters in runtime without
using perf utility. This has certain adventages when application runs on
isolated cores with nohz_full kernel parameter.

Events can be read directly using rte_pmu_read() or using dedicated
tracepoint rte_eal_trace_pmu_read(). The latter will cause events to be
stored inside CTF file.

By design, all enabled events are grouped together and the same group
is attached to lcores that use self monitoring funtionality.

Events are enabled by names, which need to be read from standard
location under sysfs i.e

/sys/bus/event_source/devices/PMU/events

where PMU is a core pmu i.e one measuring cpu events. As of today
raw events are not supported.

v4:
- fix freeing mem detected by debug_autotest
v3:
- fix shared build
v2:
- fix problems reported by test build infra

Tomasz Duszynski (4):
  eal: add generic support for reading PMU events
  eal/arm: support reading ARM PMU events in runtime
  eal/x86: support reading Intel PMU events in runtime
  eal: add PMU support to tracing library

 app/test/meson.build                     |   1 +
 app/test/test_pmu.c                      |  47 ++
 app/test/test_trace_perf.c               |   4 +
 doc/guides/prog_guide/profile_app.rst    |  13 +
 doc/guides/prog_guide/trace_lib.rst      |  32 ++
 lib/eal/arm/include/meson.build          |   1 +
 lib/eal/arm/include/rte_pmu_pmc.h        |  39 ++
 lib/eal/arm/meson.build                  |   4 +
 lib/eal/arm/rte_pmu.c                    | 104 +++++
 lib/eal/common/eal_common_trace_points.c |   3 +
 lib/eal/common/meson.build               |   3 +
 lib/eal/common/pmu_private.h             |  41 ++
 lib/eal/common/rte_pmu.c                 | 519 +++++++++++++++++++++++
 lib/eal/include/meson.build              |   1 +
 lib/eal/include/rte_eal_trace.h          |  11 +
 lib/eal/include/rte_pmu.h                | 207 +++++++++
 lib/eal/linux/eal.c                      |   4 +
 lib/eal/version.map                      |   7 +
 lib/eal/x86/include/meson.build          |   1 +
 lib/eal/x86/include/rte_pmu_pmc.h        |  33 ++
 20 files changed, 1075 insertions(+)
 create mode 100644 app/test/test_pmu.c
 create mode 100644 lib/eal/arm/include/rte_pmu_pmc.h
 create mode 100644 lib/eal/arm/rte_pmu.c
 create mode 100644 lib/eal/common/pmu_private.h
 create mode 100644 lib/eal/common/rte_pmu.c
 create mode 100644 lib/eal/include/rte_pmu.h
 create mode 100644 lib/eal/x86/include/rte_pmu_pmc.h

--
2.25.1


^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH v4 1/4] eal: add generic support for reading PMU events
  2022-12-13 10:43     ` [PATCH v4 " Tomasz Duszynski
@ 2022-12-13 10:43       ` Tomasz Duszynski
  2022-12-13 11:52         ` Morten Brørup
                           ` (2 more replies)
  2022-12-13 10:43       ` [PATCH v4 2/4] eal/arm: support reading ARM PMU events in runtime Tomasz Duszynski
                         ` (3 subsequent siblings)
  4 siblings, 3 replies; 205+ messages in thread
From: Tomasz Duszynski @ 2022-12-13 10:43 UTC (permalink / raw)
  To: dev; +Cc: thomas, jerinj, mb, zhoumin, Tomasz Duszynski

Add support for programming PMU counters and reading their values
in runtime bypassing kernel completely.

This is especially useful in cases where CPU cores are isolated
(nohz_full) i.e run dedicated tasks. In such cases one cannot use
standard perf utility without sacrificing latency and performance.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
---
 app/test/meson.build                  |   1 +
 app/test/test_pmu.c                   |  41 +++
 doc/guides/prog_guide/profile_app.rst |   8 +
 lib/eal/common/meson.build            |   3 +
 lib/eal/common/pmu_private.h          |  41 +++
 lib/eal/common/rte_pmu.c              | 456 ++++++++++++++++++++++++++
 lib/eal/include/meson.build           |   1 +
 lib/eal/include/rte_pmu.h             | 204 ++++++++++++
 lib/eal/linux/eal.c                   |   4 +
 lib/eal/version.map                   |   6 +
 10 files changed, 765 insertions(+)
 create mode 100644 app/test/test_pmu.c
 create mode 100644 lib/eal/common/pmu_private.h
 create mode 100644 lib/eal/common/rte_pmu.c
 create mode 100644 lib/eal/include/rte_pmu.h

diff --git a/app/test/meson.build b/app/test/meson.build
index f34d19e3c3..93b3300309 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -143,6 +143,7 @@ test_sources = files(
         'test_timer_racecond.c',
         'test_timer_secondary.c',
         'test_ticketlock.c',
+        'test_pmu.c',
         'test_trace.c',
         'test_trace_register.c',
         'test_trace_perf.c',
diff --git a/app/test/test_pmu.c b/app/test/test_pmu.c
new file mode 100644
index 0000000000..fd331af9ee
--- /dev/null
+++ b/app/test/test_pmu.c
@@ -0,0 +1,41 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2022 Marvell International Ltd.
+ */
+
+#include <rte_pmu.h>
+
+#include "test.h"
+
+static int
+test_pmu_read(void)
+{
+	uint64_t val = 0;
+	int tries = 10;
+	int event = -1;
+
+	while (tries--)
+		val += rte_pmu_read(event);
+
+	if (val == 0)
+		return TEST_FAILED;
+
+	return TEST_SUCCESS;
+}
+
+static struct unit_test_suite pmu_tests = {
+	.suite_name = "pmu autotest",
+	.setup = NULL,
+	.teardown = NULL,
+	.unit_test_cases = {
+		TEST_CASE(test_pmu_read),
+		TEST_CASES_END()
+	}
+};
+
+static int
+test_pmu(void)
+{
+	return unit_test_suite_runner(&pmu_tests);
+}
+
+REGISTER_TEST_COMMAND(pmu_autotest, test_pmu);
diff --git a/doc/guides/prog_guide/profile_app.rst b/doc/guides/prog_guide/profile_app.rst
index 14292d4c25..a8b501fe0c 100644
--- a/doc/guides/prog_guide/profile_app.rst
+++ b/doc/guides/prog_guide/profile_app.rst
@@ -7,6 +7,14 @@ Profile Your Application
 The following sections describe methods of profiling DPDK applications on
 different architectures.
 
+Performance counter based profiling
+-----------------------------------
+
+Majority of architectures support some sort hardware measurement unit which provides a set of
+programmable counters that monitor specific events. There are different tools which can gather
+that information, perf being an example here. Though in some scenarios, eg. when CPU cores are
+isolated (nohz_full) and run dedicated tasks, using perf is less than ideal. In such cases one can
+read specific events directly from application via ``rte_pmu_read()``.
 
 Profiling on x86
 ----------------
diff --git a/lib/eal/common/meson.build b/lib/eal/common/meson.build
index 917758cc65..d6d05b56f3 100644
--- a/lib/eal/common/meson.build
+++ b/lib/eal/common/meson.build
@@ -38,6 +38,9 @@ sources += files(
         'rte_service.c',
         'rte_version.c',
 )
+if is_linux
+    sources += files('rte_pmu.c')
+endif
 if is_linux or is_windows
     sources += files('eal_common_dynmem.c')
 endif
diff --git a/lib/eal/common/pmu_private.h b/lib/eal/common/pmu_private.h
new file mode 100644
index 0000000000..cade4245e6
--- /dev/null
+++ b/lib/eal/common/pmu_private.h
@@ -0,0 +1,41 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2022 Marvell
+ */
+
+#ifndef _PMU_PRIVATE_H_
+#define _PMU_PRIVATE_H_
+
+/**
+ * Architecture specific PMU init callback.
+ *
+ * @return
+ *   0 in case of success, negative value otherwise.
+ */
+int
+pmu_arch_init(void);
+
+/**
+ * Architecture specific PMU cleanup callback.
+ */
+void
+pmu_arch_fini(void);
+
+/**
+ * Apply architecture specific settings to config before passing it to syscall.
+ */
+void
+pmu_arch_fixup_config(uint64_t config[3]);
+
+/**
+ * Initialize PMU tracing internals.
+ */
+void
+eal_pmu_init(void);
+
+/**
+ * Cleanup PMU internals.
+ */
+void
+eal_pmu_fini(void);
+
+#endif /* _PMU_PRIVATE_H_ */
diff --git a/lib/eal/common/rte_pmu.c b/lib/eal/common/rte_pmu.c
new file mode 100644
index 0000000000..049fe19fe3
--- /dev/null
+++ b/lib/eal/common/rte_pmu.c
@@ -0,0 +1,456 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2022 Marvell International Ltd.
+ */
+
+#include <ctype.h>
+#include <dirent.h>
+#include <errno.h>
+#include <regex.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <sys/queue.h>
+#include <sys/syscall.h>
+#include <unistd.h>
+
+#include <rte_eal_paging.h>
+#include <rte_pmu.h>
+#include <rte_tailq.h>
+
+#include "pmu_private.h"
+
+#define EVENT_SOURCE_DEVICES_PATH "/sys/bus/event_source/devices"
+
+#ifndef GENMASK_ULL
+#define GENMASK_ULL(h, l) ((~0ULL - (1ULL << (l)) + 1) & (~0ULL >> ((64 - 1 - (h)))))
+#endif
+
+#ifndef FIELD_PREP
+#define FIELD_PREP(m, v) (((uint64_t)(v) << (__builtin_ffsll(m) - 1)) & (m))
+#endif
+
+struct rte_pmu *rte_pmu;
+
+/*
+ * Following __rte_weak functions provide default no-op. Architectures should override them if
+ * necessary.
+ */
+
+int
+__rte_weak pmu_arch_init(void)
+{
+	return 0;
+}
+
+void
+__rte_weak pmu_arch_fini(void)
+{
+}
+
+void
+__rte_weak pmu_arch_fixup_config(uint64_t config[3])
+{
+	RTE_SET_USED(config);
+}
+
+static int
+get_term_format(const char *name, int *num, uint64_t *mask)
+{
+	char *config = NULL;
+	char path[PATH_MAX];
+	int high, low, ret;
+	FILE *fp;
+
+	/* quiesce -Wmaybe-uninitialized warning */
+	*num = 0;
+	*mask = 0;
+
+	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/format/%s", rte_pmu->name, name);
+	fp = fopen(path, "r");
+	if (!fp)
+		return -errno;
+
+	errno = 0;
+	ret = fscanf(fp, "%m[^:]:%d-%d", &config, &low, &high);
+	if (ret < 2) {
+		ret = -ENODATA;
+		goto out;
+	}
+	if (errno) {
+		ret = -errno;
+		goto out;
+	}
+
+	if (ret == 2)
+		high = low;
+
+	*mask = GENMASK_ULL(high, low);
+	/* Last digit should be [012]. If last digit is missing 0 is implied. */
+	*num = config[strlen(config) - 1];
+	*num = isdigit(*num) ? *num - '0' : 0;
+
+	ret = 0;
+out:
+	free(config);
+	fclose(fp);
+
+	return ret;
+}
+
+static int
+parse_event(char *buf, uint64_t config[3])
+{
+	char *token, *term;
+	int num, ret, val;
+	uint64_t mask;
+
+	config[0] = config[1] = config[2] = 0;
+
+	token = strtok(buf, ",");
+	while (token) {
+		errno = 0;
+		/* <term>=<value> */
+		ret = sscanf(token, "%m[^=]=%i", &term, &val);
+		if (ret < 1)
+			return -ENODATA;
+		if (errno)
+			return -errno;
+		if (ret == 1)
+			val = 1;
+
+		ret = get_term_format(term, &num, &mask);
+		free(term);
+		if (ret)
+			return ret;
+
+		config[num] |= FIELD_PREP(mask, val);
+		token = strtok(NULL, ",");
+	}
+
+	return 0;
+}
+
+static int
+get_event_config(const char *name, uint64_t config[3])
+{
+	char path[PATH_MAX], buf[BUFSIZ];
+	FILE *fp;
+	int ret;
+
+	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu->name, name);
+	fp = fopen(path, "r");
+	if (!fp)
+		return -errno;
+
+	ret = fread(buf, 1, sizeof(buf), fp);
+	if (ret == 0) {
+		fclose(fp);
+
+		return -EINVAL;
+	}
+	fclose(fp);
+	buf[ret] = '\0';
+
+	return parse_event(buf, config);
+}
+
+static int
+do_perf_event_open(uint64_t config[3], int lcore_id, int group_fd)
+{
+	struct perf_event_attr attr = {
+		.size = sizeof(struct perf_event_attr),
+		.type = PERF_TYPE_RAW,
+		.exclude_kernel = 1,
+		.exclude_hv = 1,
+		.disabled = 1,
+	};
+
+	pmu_arch_fixup_config(config);
+
+	attr.config = config[0];
+	attr.config1 = config[1];
+	attr.config2 = config[2];
+
+	return syscall(SYS_perf_event_open, &attr, rte_gettid(), rte_lcore_to_cpu_id(lcore_id),
+		       group_fd, 0);
+}
+
+static int
+open_events(int lcore_id)
+{
+	struct rte_pmu_event_group *group = &rte_pmu->group[lcore_id];
+	struct rte_pmu_event *event;
+	uint64_t config[3];
+	int num = 0, ret;
+
+	/* group leader gets created first, with fd = -1 */
+	group->fds[0] = -1;
+
+	TAILQ_FOREACH(event, &rte_pmu->event_list, next) {
+		ret = get_event_config(event->name, config);
+		if (ret) {
+			RTE_LOG(ERR, EAL, "failed to get %s event config\n", event->name);
+			continue;
+		}
+
+		ret = do_perf_event_open(config, lcore_id, group->fds[0]);
+		if (ret == -1) {
+			if (errno == EOPNOTSUPP)
+				RTE_LOG(ERR, EAL, "64 bit counters not supported\n");
+
+			ret = -errno;
+			goto out;
+		}
+
+		group->fds[event->index] = ret;
+		num++;
+	}
+
+	return 0;
+out:
+	for (--num; num >= 0; num--) {
+		close(group->fds[num]);
+		group->fds[num] = -1;
+	}
+
+
+	return ret;
+}
+
+static int
+mmap_events(int lcore_id)
+{
+	struct rte_pmu_event_group *group = &rte_pmu->group[lcore_id];
+	void *addr;
+	int ret, i;
+
+	for (i = 0; i < rte_pmu->num_group_events; i++) {
+		addr = mmap(0, rte_mem_page_size(), PROT_READ, MAP_SHARED, group->fds[i], 0);
+		if (addr == MAP_FAILED) {
+			ret = -errno;
+			goto out;
+		}
+
+		group->mmap_pages[i] = addr;
+	}
+
+	return 0;
+out:
+	for (; i; i--) {
+		munmap(group->mmap_pages[i - 1], rte_mem_page_size());
+		group->mmap_pages[i - 1] = NULL;
+	}
+
+	return ret;
+}
+
+static void
+cleanup_events(int lcore_id)
+{
+	struct rte_pmu_event_group *group = &rte_pmu->group[lcore_id];
+	int i;
+
+	if (!group->fds)
+		return;
+
+	if (group->fds[0] != -1)
+		ioctl(group->fds[0], PERF_EVENT_IOC_DISABLE, PERF_IOC_FLAG_GROUP);
+
+	for (i = 0; i < rte_pmu->num_group_events; i++) {
+		if (group->mmap_pages[i]) {
+			munmap(group->mmap_pages[i], rte_mem_page_size());
+			group->mmap_pages[i] = NULL;
+		}
+
+		if (group->fds[i] != -1) {
+			close(group->fds[i]);
+			group->fds[i] = -1;
+		}
+	}
+
+	free(group->mmap_pages);
+	free(group->fds);
+
+	group->mmap_pages = NULL;
+	group->fds = NULL;
+	group->enabled = false;
+}
+
+int __rte_noinline
+rte_pmu_enable_group(int lcore_id)
+{
+	struct rte_pmu_event_group *group = &rte_pmu->group[lcore_id];
+	int ret;
+
+	if (rte_pmu->num_group_events == 0) {
+		RTE_LOG(DEBUG, EAL, "no matching PMU events\n");
+
+		return 0;
+	}
+
+	group->fds = calloc(rte_pmu->num_group_events, sizeof(*group->fds));
+	if (!group->fds) {
+		RTE_LOG(ERR, EAL, "failed to alloc descriptor memory\n");
+
+		return -ENOMEM;
+	}
+
+	group->mmap_pages = calloc(rte_pmu->num_group_events, sizeof(*group->mmap_pages));
+	if (!group->mmap_pages) {
+		RTE_LOG(ERR, EAL, "failed to alloc userpage memory\n");
+
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	ret = open_events(lcore_id);
+	if (ret) {
+		RTE_LOG(ERR, EAL, "failed to open events on lcore-worker-%d\n", lcore_id);
+		goto out;
+	}
+
+	ret = mmap_events(lcore_id);
+	if (ret) {
+		RTE_LOG(ERR, EAL, "failed to map events on lcore-worker-%d\n", lcore_id);
+		goto out;
+	}
+
+	if (ioctl(group->fds[0], PERF_EVENT_IOC_ENABLE, PERF_IOC_FLAG_GROUP) == -1) {
+		RTE_LOG(ERR, EAL, "failed to enable events on lcore-worker-%d\n", lcore_id);
+
+		ret = -errno;
+		goto out;
+	}
+
+	return 0;
+
+out:
+	cleanup_events(lcore_id);
+
+	return ret;
+}
+
+static int
+scan_pmus(void)
+{
+	char path[PATH_MAX];
+	struct dirent *dent;
+	const char *name;
+	DIR *dirp;
+
+	dirp = opendir(EVENT_SOURCE_DEVICES_PATH);
+	if (!dirp)
+		return -errno;
+
+	while ((dent = readdir(dirp))) {
+		name = dent->d_name;
+		if (name[0] == '.')
+			continue;
+
+		/* sysfs entry should either contain cpus or be a cpu */
+		if (!strcmp(name, "cpu"))
+			break;
+
+		snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/cpus", name);
+		if (access(path, F_OK) == 0)
+			break;
+	}
+
+	closedir(dirp);
+
+	if (dent) {
+		rte_pmu->name = strdup(name);
+		if (!rte_pmu->name)
+			return -ENOMEM;
+	}
+
+	return rte_pmu->name ? 0 : -ENODEV;
+}
+
+int
+rte_pmu_add_event(const char *name)
+{
+	struct rte_pmu_event *event;
+	char path[PATH_MAX];
+
+	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu->name, name);
+	if (access(path, R_OK))
+		return -ENODEV;
+
+	TAILQ_FOREACH(event, &rte_pmu->event_list, next) {
+		if (!strcmp(event->name, name))
+			return event->index;
+		continue;
+	}
+
+	event = calloc(1, sizeof(*event));
+	if (!event)
+		return -ENOMEM;
+
+	event->name = strdup(name);
+	if (!event->name) {
+		free(event);
+
+		return -ENOMEM;
+	}
+
+	event->index = rte_pmu->num_group_events++;
+	TAILQ_INSERT_TAIL(&rte_pmu->event_list, event, next);
+
+	RTE_LOG(DEBUG, EAL, "%s even added at index %d\n", name, event->index);
+
+	return event->index;
+}
+
+void
+eal_pmu_init(void)
+{
+	int ret;
+
+	rte_pmu = calloc(1, sizeof(*rte_pmu));
+	if (!rte_pmu) {
+		RTE_LOG(ERR, EAL, "failed to alloc PMU\n");
+
+		return;
+	}
+
+	TAILQ_INIT(&rte_pmu->event_list);
+
+	ret = scan_pmus();
+	if (ret) {
+		RTE_LOG(ERR, EAL, "failed to find core pmu\n");
+		goto out;
+	}
+
+	ret = pmu_arch_init();
+	if (ret) {
+		RTE_LOG(ERR, EAL, "failed to setup arch for PMU\n");
+		goto out;
+	}
+
+	return;
+out:
+	free(rte_pmu->name);
+	free(rte_pmu);
+}
+
+void
+eal_pmu_fini(void)
+{
+	struct rte_pmu_event *event, *tmp;
+	int lcore_id;
+
+	RTE_TAILQ_FOREACH_SAFE(event, &rte_pmu->event_list, next, tmp) {
+		TAILQ_REMOVE(&rte_pmu->event_list, event, next);
+		free(event->name);
+		free(event);
+	}
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id)
+		cleanup_events(lcore_id);
+
+	pmu_arch_fini();
+	free(rte_pmu->name);
+	free(rte_pmu);
+}
diff --git a/lib/eal/include/meson.build b/lib/eal/include/meson.build
index cfcd40aaed..3bf830adee 100644
--- a/lib/eal/include/meson.build
+++ b/lib/eal/include/meson.build
@@ -36,6 +36,7 @@ headers += files(
         'rte_pci_dev_features.h',
         'rte_per_lcore.h',
         'rte_pflock.h',
+        'rte_pmu.h',
         'rte_random.h',
         'rte_reciprocal.h',
         'rte_seqcount.h',
diff --git a/lib/eal/include/rte_pmu.h b/lib/eal/include/rte_pmu.h
new file mode 100644
index 0000000000..e4b4f6b052
--- /dev/null
+++ b/lib/eal/include/rte_pmu.h
@@ -0,0 +1,204 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2022 Marvell
+ */
+
+#ifndef _RTE_PMU_H_
+#define _RTE_PMU_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <rte_common.h>
+#include <rte_compat.h>
+
+#ifdef RTE_EXEC_ENV_LINUX
+
+#include <linux/perf_event.h>
+
+#include <rte_atomic.h>
+#include <rte_branch_prediction.h>
+#include <rte_lcore.h>
+#include <rte_log.h>
+
+/**
+ * @file
+ *
+ * PMU event tracing operations
+ *
+ * This file defines generic API and types necessary to setup PMU and
+ * read selected counters in runtime.
+ */
+
+/**
+ * A structure describing a group of events.
+ */
+struct rte_pmu_event_group {
+	int *fds; /**< array of event descriptors */
+	void **mmap_pages; /**< array of pointers to mmapped perf_event_attr structures */
+	bool enabled; /**< true if group was enabled on particular lcore */
+};
+
+/**
+ * A structure describing an event.
+ */
+struct rte_pmu_event {
+	char *name; /** name of an event */
+	int index; /** event index into fds/mmap_pages */
+	TAILQ_ENTRY(rte_pmu_event) next; /** list entry */
+};
+
+/**
+ * A PMU state container.
+ */
+struct rte_pmu {
+	char *name; /** name of core PMU listed under /sys/bus/event_source/devices */
+	struct rte_pmu_event_group group[RTE_MAX_LCORE]; /**< per lcore event group data */
+	int num_group_events; /**< number of events in a group */
+	TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of matching events */
+};
+
+/** Pointer to the PMU state container */
+extern struct rte_pmu *rte_pmu;
+
+/** Each architecture supporting PMU needs to provide its own version */
+#ifndef rte_pmu_pmc_read
+#define rte_pmu_pmc_read(index) ({ 0; })
+#endif
+
+/**
+ * @internal
+ *
+ * Read PMU counter.
+ *
+ * @param pc
+ *   Pointer to the mmapped user page.
+ * @return
+ *   Counter value read from hardware.
+ */
+__rte_internal
+static __rte_always_inline uint64_t
+rte_pmu_read_userpage(struct perf_event_mmap_page *pc)
+{
+	uint64_t offset, width, pmc = 0;
+	uint32_t seq, index;
+	int tries = 100;
+
+	for (;;) {
+		seq = pc->lock;
+		rte_compiler_barrier();
+		index = pc->index;
+		offset = pc->offset;
+		width = pc->pmc_width;
+
+		if (likely(pc->cap_user_rdpmc && index)) {
+			pmc = rte_pmu_pmc_read(index - 1);
+			pmc <<= 64 - width;
+			pmc >>= 64 - width;
+		}
+
+		rte_compiler_barrier();
+
+		if (likely(pc->lock == seq))
+			return pmc + offset;
+
+		if (--tries == 0) {
+			RTE_LOG(DEBUG, EAL, "failed to get perf_event_mmap_page lock\n");
+			break;
+		}
+	}
+
+	return 0;
+}
+
+/**
+ * @internal
+ *
+ * Enable group of events for a given lcore.
+ *
+ * @param lcore_id
+ *   The identifier of the lcore.
+ * @return
+ *   0 in case of success, negative value otherwise.
+ */
+__rte_internal
+int
+rte_pmu_enable_group(int lcore_id);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Add event to the group of enabled events.
+ *
+ * @param name
+ *   Name of an event listed under /sys/bus/event_source/devices/pmu/events.
+ * @return
+ *   Event index in case of success, negative value otherwise.
+ */
+__rte_experimental
+int
+rte_pmu_add_event(const char *name);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Read hardware counter configured to count occurrences of an event.
+ *
+ * @param index
+ *   Index of an event to be read.
+ * @return
+ *   Event value read from register. In case of errors or lack of support
+ *   0 is returned. In other words, stream of zeros in a trace file
+ *   indicates problem with reading particular PMU event register.
+ */
+__rte_experimental
+static __rte_always_inline uint64_t
+rte_pmu_read(int index)
+{
+	int lcore_id = rte_lcore_id();
+	struct rte_pmu_event_group *group;
+	int ret;
+
+	if (!rte_pmu)
+		return 0;
+
+	group = &rte_pmu->group[lcore_id];
+	if (!group->enabled) {
+		ret = rte_pmu_enable_group(lcore_id);
+		if (ret)
+			return 0;
+
+		group->enabled = true;
+	}
+
+	if (index < 0 || index >= rte_pmu->num_group_events)
+		return 0;
+
+	return rte_pmu_read_userpage((struct perf_event_mmap_page *)group->mmap_pages[index]);
+}
+
+#else /* !RTE_EXEC_ENV_LINUX */
+
+__rte_experimental
+static int __rte_unused
+rte_pmu_add_event(__rte_unused const char *name)
+{
+	return -1;
+}
+
+__rte_experimental
+static __rte_always_inline uint64_t
+rte_pmu_read(__rte_unused int index)
+{
+	return 0;
+}
+
+#endif /* RTE_EXEC_ENV_LINUX */
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_PMU_H_ */
diff --git a/lib/eal/linux/eal.c b/lib/eal/linux/eal.c
index 8c118d0d9f..751a13b597 100644
--- a/lib/eal/linux/eal.c
+++ b/lib/eal/linux/eal.c
@@ -53,6 +53,7 @@
 #include "eal_options.h"
 #include "eal_vfio.h"
 #include "hotplug_mp.h"
+#include "pmu_private.h"
 
 #define MEMSIZE_IF_NO_HUGE_PAGE (64ULL * 1024ULL * 1024ULL)
 
@@ -1206,6 +1207,8 @@ rte_eal_init(int argc, char **argv)
 		return -1;
 	}
 
+	eal_pmu_init();
+
 	if (rte_eal_tailqs_init() < 0) {
 		rte_eal_init_alert("Cannot init tail queues for objects");
 		rte_errno = EFAULT;
@@ -1372,6 +1375,7 @@ rte_eal_cleanup(void)
 	eal_bus_cleanup();
 	rte_trace_save();
 	eal_trace_fini();
+	eal_pmu_fini();
 	/* after this point, any DPDK pointers will become dangling */
 	rte_eal_memory_detach();
 	eal_mp_dev_hotplug_cleanup();
diff --git a/lib/eal/version.map b/lib/eal/version.map
index 7ad12a7dc9..9225f46f67 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -440,6 +440,11 @@ EXPERIMENTAL {
 	rte_thread_detach;
 	rte_thread_equal;
 	rte_thread_join;
+
+	# added in 23.03
+	rte_pmu; # WINDOWS_NO_EXPORT
+	rte_pmu_add_event; # WINDOWS_NO_EXPORT
+	rte_pmu_read; # WINDOWS_NO_EXPORT
 };
 
 INTERNAL {
@@ -483,4 +488,5 @@ INTERNAL {
 	rte_mem_map;
 	rte_mem_page_size;
 	rte_mem_unmap;
+	rte_pmu_enable_group;
 };
-- 
2.25.1


^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH v4 2/4] eal/arm: support reading ARM PMU events in runtime
  2022-12-13 10:43     ` [PATCH v4 " Tomasz Duszynski
  2022-12-13 10:43       ` [PATCH v4 1/4] eal: add generic support for reading PMU events Tomasz Duszynski
@ 2022-12-13 10:43       ` Tomasz Duszynski
  2022-12-13 10:43       ` [PATCH v4 3/4] eal/x86: support reading Intel " Tomasz Duszynski
                         ` (2 subsequent siblings)
  4 siblings, 0 replies; 205+ messages in thread
From: Tomasz Duszynski @ 2022-12-13 10:43 UTC (permalink / raw)
  To: dev, Ruifeng Wang; +Cc: thomas, jerinj, mb, zhoumin, Tomasz Duszynski

Add support for reading ARM PMU events in runtime.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
---
 app/test/test_pmu.c               |   4 ++
 lib/eal/arm/include/meson.build   |   1 +
 lib/eal/arm/include/rte_pmu_pmc.h |  39 +++++++++++
 lib/eal/arm/meson.build           |   4 ++
 lib/eal/arm/rte_pmu.c             | 104 ++++++++++++++++++++++++++++++
 lib/eal/include/rte_pmu.h         |   3 +
 6 files changed, 155 insertions(+)
 create mode 100644 lib/eal/arm/include/rte_pmu_pmc.h
 create mode 100644 lib/eal/arm/rte_pmu.c

diff --git a/app/test/test_pmu.c b/app/test/test_pmu.c
index fd331af9ee..f94866dff9 100644
--- a/app/test/test_pmu.c
+++ b/app/test/test_pmu.c
@@ -13,6 +13,10 @@ test_pmu_read(void)
 	int tries = 10;
 	int event = -1;
 
+#if defined(RTE_ARCH_ARM64)
+	event = rte_pmu_add_event("cpu_cycles");
+#endif
+
 	while (tries--)
 		val += rte_pmu_read(event);
 
diff --git a/lib/eal/arm/include/meson.build b/lib/eal/arm/include/meson.build
index 657bf58569..ab13b0220a 100644
--- a/lib/eal/arm/include/meson.build
+++ b/lib/eal/arm/include/meson.build
@@ -20,6 +20,7 @@ arch_headers = files(
         'rte_pause_32.h',
         'rte_pause_64.h',
         'rte_pause.h',
+        'rte_pmu_pmc.h',
         'rte_power_intrinsics.h',
         'rte_prefetch_32.h',
         'rte_prefetch_64.h',
diff --git a/lib/eal/arm/include/rte_pmu_pmc.h b/lib/eal/arm/include/rte_pmu_pmc.h
new file mode 100644
index 0000000000..10e2984813
--- /dev/null
+++ b/lib/eal/arm/include/rte_pmu_pmc.h
@@ -0,0 +1,39 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2022 Marvell.
+ */
+
+#ifndef _RTE_PMU_PMC_ARM_H_
+#define _RTE_PMU_PMC_ARM_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <rte_common.h>
+
+static __rte_always_inline uint64_t
+rte_pmu_pmc_read(int index)
+{
+	uint64_t val;
+
+	if (index == 31) {
+		/* CPU Cycles (0x11) must be read via pmccntr_el0 */
+		asm volatile("mrs %0, pmccntr_el0" : "=r" (val));
+	} else {
+		asm volatile(
+			"msr pmselr_el0, %x0\n"
+			"mrs %0, pmxevcntr_el0\n"
+			: "=r" (val)
+			: "rZ" (index)
+		);
+	}
+
+	return val;
+}
+#define rte_pmu_pmc_read rte_pmu_pmc_read
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_PMU_PMC_ARM_H_ */
diff --git a/lib/eal/arm/meson.build b/lib/eal/arm/meson.build
index dca1106aae..0c5575b197 100644
--- a/lib/eal/arm/meson.build
+++ b/lib/eal/arm/meson.build
@@ -9,3 +9,7 @@ sources += files(
         'rte_hypervisor.c',
         'rte_power_intrinsics.c',
 )
+
+if is_linux
+    sources += files('rte_pmu.c')
+endif
diff --git a/lib/eal/arm/rte_pmu.c b/lib/eal/arm/rte_pmu.c
new file mode 100644
index 0000000000..10ec770ead
--- /dev/null
+++ b/lib/eal/arm/rte_pmu.c
@@ -0,0 +1,104 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2022 Marvell International Ltd.
+ */
+
+#include <errno.h>
+#include <fcntl.h>
+#include <stdlib.h>
+#include <unistd.h>
+
+#include <rte_bitops.h>
+#include <rte_common.h>
+#include <rte_log.h>
+#include <rte_pmu.h>
+
+#include "pmu_private.h"
+
+#define PERF_USER_ACCESS_PATH "/proc/sys/kernel/perf_user_access"
+
+static int restore_uaccess;
+
+static int
+read_attr_int(const char *path, int *val)
+{
+	char buf[BUFSIZ];
+	int ret, fd;
+
+	fd = open(path, O_RDONLY);
+	if (fd == -1)
+		return -errno;
+
+	ret = read(fd, buf, sizeof(buf));
+	if (ret == -1) {
+		close(fd);
+
+		return -errno;
+	}
+
+	*val = strtol(buf, NULL, 10);
+	close(fd);
+
+	return 0;
+}
+
+static int
+write_attr_int(const char *path, int val)
+{
+	char buf[BUFSIZ];
+	int num, ret, fd;
+
+	fd = open(path, O_WRONLY);
+	if (fd == -1)
+		return -errno;
+
+	num = snprintf(buf, sizeof(buf), "%d", val);
+	ret = write(fd, buf, num);
+	if (ret == -1) {
+		close(fd);
+
+		return -errno;
+	}
+
+	close(fd);
+
+	return 0;
+}
+
+int
+pmu_arch_init(void)
+{
+	int ret;
+
+	ret = read_attr_int(PERF_USER_ACCESS_PATH, &restore_uaccess);
+	if (ret) {
+		RTE_LOG(ERR, EAL, "failed to read %s\n", PERF_USER_ACCESS_PATH);
+
+		return ret;
+	}
+
+	ret = write_attr_int(PERF_USER_ACCESS_PATH, 1);
+	if (ret) {
+		RTE_LOG(ERR, EAL, "failed to enable perf user access\n"
+			"try enabling manually 'echo 1 > %s'\n",
+			PERF_USER_ACCESS_PATH);
+
+		return ret;
+	}
+
+	return 0;
+}
+
+void
+pmu_arch_fini(void)
+{
+	write_attr_int(PERF_USER_ACCESS_PATH, restore_uaccess);
+}
+
+void
+pmu_arch_fixup_config(uint64_t config[3])
+{
+	/* select 64 bit counters */
+	config[1] |= RTE_BIT64(0);
+	/* enable userspace access */
+	config[1] |= RTE_BIT64(1);
+}
diff --git a/lib/eal/include/rte_pmu.h b/lib/eal/include/rte_pmu.h
index e4b4f6b052..158a616b83 100644
--- a/lib/eal/include/rte_pmu.h
+++ b/lib/eal/include/rte_pmu.h
@@ -20,6 +20,9 @@ extern "C" {
 #include <rte_branch_prediction.h>
 #include <rte_lcore.h>
 #include <rte_log.h>
+#if defined(RTE_ARCH_ARM64)
+#include <rte_pmu_pmc.h>
+#endif
 
 /**
  * @file
-- 
2.25.1


^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH v4 3/4] eal/x86: support reading Intel PMU events in runtime
  2022-12-13 10:43     ` [PATCH v4 " Tomasz Duszynski
  2022-12-13 10:43       ` [PATCH v4 1/4] eal: add generic support for reading PMU events Tomasz Duszynski
  2022-12-13 10:43       ` [PATCH v4 2/4] eal/arm: support reading ARM PMU events in runtime Tomasz Duszynski
@ 2022-12-13 10:43       ` Tomasz Duszynski
  2022-12-13 10:43       ` [PATCH v4 4/4] eal: add PMU support to tracing library Tomasz Duszynski
  2023-01-10 23:46       ` [PATCH v5 0/4] add support for self monitoring Tomasz Duszynski
  4 siblings, 0 replies; 205+ messages in thread
From: Tomasz Duszynski @ 2022-12-13 10:43 UTC (permalink / raw)
  To: dev, Bruce Richardson, Konstantin Ananyev
  Cc: thomas, jerinj, mb, zhoumin, Tomasz Duszynski

Add support for reading Intel PMU events in runtime.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
---
 app/test/test_pmu.c               |  2 ++
 lib/eal/include/rte_pmu.h         |  2 +-
 lib/eal/x86/include/meson.build   |  1 +
 lib/eal/x86/include/rte_pmu_pmc.h | 33 +++++++++++++++++++++++++++++++
 4 files changed, 37 insertions(+), 1 deletion(-)
 create mode 100644 lib/eal/x86/include/rte_pmu_pmc.h

diff --git a/app/test/test_pmu.c b/app/test/test_pmu.c
index f94866dff9..016204c083 100644
--- a/app/test/test_pmu.c
+++ b/app/test/test_pmu.c
@@ -15,6 +15,8 @@ test_pmu_read(void)
 
 #if defined(RTE_ARCH_ARM64)
 	event = rte_pmu_add_event("cpu_cycles");
+#elif defined(RTE_ARCH_X86_64)
+	event = rte_pmu_add_event("cpu-cycles");
 #endif
 
 	while (tries--)
diff --git a/lib/eal/include/rte_pmu.h b/lib/eal/include/rte_pmu.h
index 158a616b83..3d90f4baf7 100644
--- a/lib/eal/include/rte_pmu.h
+++ b/lib/eal/include/rte_pmu.h
@@ -20,7 +20,7 @@ extern "C" {
 #include <rte_branch_prediction.h>
 #include <rte_lcore.h>
 #include <rte_log.h>
-#if defined(RTE_ARCH_ARM64)
+#if defined(RTE_ARCH_ARM64) || defined(RTE_ARCH_X86_64)
 #include <rte_pmu_pmc.h>
 #endif
 
diff --git a/lib/eal/x86/include/meson.build b/lib/eal/x86/include/meson.build
index 52d2f8e969..03d286ed25 100644
--- a/lib/eal/x86/include/meson.build
+++ b/lib/eal/x86/include/meson.build
@@ -9,6 +9,7 @@ arch_headers = files(
         'rte_io.h',
         'rte_memcpy.h',
         'rte_pause.h',
+        'rte_pmu_pmc.h',
         'rte_power_intrinsics.h',
         'rte_prefetch.h',
         'rte_rtm.h',
diff --git a/lib/eal/x86/include/rte_pmu_pmc.h b/lib/eal/x86/include/rte_pmu_pmc.h
new file mode 100644
index 0000000000..a2cd849fb1
--- /dev/null
+++ b/lib/eal/x86/include/rte_pmu_pmc.h
@@ -0,0 +1,33 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2022 Marvell.
+ */
+
+#ifndef _RTE_PMU_PMC_X86_H_
+#define _RTE_PMU_PMC_X86_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <rte_common.h>
+
+static __rte_always_inline uint64_t
+rte_pmu_pmc_read(int index)
+{
+	uint32_t high, low;
+
+	asm volatile(
+		"rdpmc\n"
+		: "=a" (low), "=d" (high)
+		: "c" (index)
+	);
+
+	return ((uint64_t)high << 32) | (uint64_t)low;
+}
+#define rte_pmu_pmc_read rte_pmu_pmc_read
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_PMU_PMC_X86_H_ */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH v4 4/4] eal: add PMU support to tracing library
  2022-12-13 10:43     ` [PATCH v4 " Tomasz Duszynski
                         ` (2 preceding siblings ...)
  2022-12-13 10:43       ` [PATCH v4 3/4] eal/x86: support reading Intel " Tomasz Duszynski
@ 2022-12-13 10:43       ` Tomasz Duszynski
  2023-01-10 23:46       ` [PATCH v5 0/4] add support for self monitoring Tomasz Duszynski
  4 siblings, 0 replies; 205+ messages in thread
From: Tomasz Duszynski @ 2022-12-13 10:43 UTC (permalink / raw)
  To: dev, Jerin Jacob, Sunil Kumar Kori; +Cc: thomas, mb, zhoumin, Tomasz Duszynski

In order to profile app one needs to store significant amount of samples
somewhere for an analysis latern on. Since trace library supports
storing data in a CTF format lets take adventage of that and add a
dedicated PMU tracepoint.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
---
 app/test/test_trace_perf.c               |  4 ++
 doc/guides/prog_guide/profile_app.rst    |  5 ++
 doc/guides/prog_guide/trace_lib.rst      | 32 ++++++++++++
 lib/eal/common/eal_common_trace_points.c |  3 ++
 lib/eal/common/rte_pmu.c                 | 63 ++++++++++++++++++++++++
 lib/eal/include/rte_eal_trace.h          | 11 +++++
 lib/eal/version.map                      |  1 +
 7 files changed, 119 insertions(+)

diff --git a/app/test/test_trace_perf.c b/app/test/test_trace_perf.c
index 46ae7d8074..4851b6852f 100644
--- a/app/test/test_trace_perf.c
+++ b/app/test/test_trace_perf.c
@@ -114,6 +114,8 @@ worker_fn_##func(void *arg) \
 #define GENERIC_DOUBLE rte_eal_trace_generic_double(3.66666)
 #define GENERIC_STR rte_eal_trace_generic_str("hello world")
 #define VOID_FP app_dpdk_test_fp()
+/* 0 corresponds first event passed via --trace= */
+#define READ_PMU rte_eal_trace_pmu_read(0)
 
 WORKER_DEFINE(GENERIC_VOID)
 WORKER_DEFINE(GENERIC_U64)
@@ -122,6 +124,7 @@ WORKER_DEFINE(GENERIC_FLOAT)
 WORKER_DEFINE(GENERIC_DOUBLE)
 WORKER_DEFINE(GENERIC_STR)
 WORKER_DEFINE(VOID_FP)
+WORKER_DEFINE(READ_PMU)
 
 static void
 run_test(const char *str, lcore_function_t f, struct test_data *data, size_t sz)
@@ -174,6 +177,7 @@ test_trace_perf(void)
 	run_test("double", worker_fn_GENERIC_DOUBLE, data, sz);
 	run_test("string", worker_fn_GENERIC_STR, data, sz);
 	run_test("void_fp", worker_fn_VOID_FP, data, sz);
+	run_test("read_pmu", worker_fn_READ_PMU, data, sz);
 
 	rte_free(data);
 	return TEST_SUCCESS;
diff --git a/doc/guides/prog_guide/profile_app.rst b/doc/guides/prog_guide/profile_app.rst
index a8b501fe0c..6a53341c6b 100644
--- a/doc/guides/prog_guide/profile_app.rst
+++ b/doc/guides/prog_guide/profile_app.rst
@@ -16,6 +16,11 @@ that information, perf being an example here. Though in some scenarios, eg. when
 isolated (nohz_full) and run dedicated tasks, using perf is less than ideal. In such cases one can
 read specific events directly from application via ``rte_pmu_read()``.
 
+Alternatively tracing library can be used which offers dedicated tracepoint
+``rte_eal_trace_pmu_event()``.
+
+Refer to :doc:`../prog_guide/trace_lib` for more details.
+
 Profiling on x86
 ----------------
 
diff --git a/doc/guides/prog_guide/trace_lib.rst b/doc/guides/prog_guide/trace_lib.rst
index 9a8f38073d..9a845fd86f 100644
--- a/doc/guides/prog_guide/trace_lib.rst
+++ b/doc/guides/prog_guide/trace_lib.rst
@@ -46,6 +46,7 @@ DPDK tracing library features
   trace format and is compatible with ``LTTng``.
   For detailed information, refer to
   `Common Trace Format <https://diamon.org/ctf/>`_.
+- Support reading PMU events on ARM64 and x86 (Intel)
 
 How to add a tracepoint?
 ------------------------
@@ -137,6 +138,37 @@ the user must use ``RTE_TRACE_POINT_FP`` instead of ``RTE_TRACE_POINT``.
 ``RTE_TRACE_POINT_FP`` is compiled out by default and it can be enabled using
 the ``enable_trace_fp`` option for meson build.
 
+PMU tracepoint
+--------------
+
+Performance measurement unit (PMU) event values can be read from hardware
+registers using predefined ``rte_pmu_read`` tracepoint.
+
+Tracing is enabled via ``--trace`` EAL option by passing both expression
+matching PMU tracepoint name i.e ``lib.eal.pmu.read`` and expression
+``e=ev1[,ev2,...]`` matching particular events::
+
+    --trace='*pmu.read\|e=cpu_cycles,l1d_cache'
+
+Event names are available under ``/sys/bus/event_source/devices/PMU/events``
+directory, where ``PMU`` is a placeholder for either a ``cpu`` or a directory
+containing ``cpus``.
+
+In contrary to other tracepoints this does not need any extra variables
+added to source files. Instead, caller passes index which follows the order of
+events specified via ``--trace`` parameter. In the following example index ``0``
+corresponds to ``cpu_cyclces`` while index ``1`` corresponds to ``l1d_cache``.
+
+.. code-block:: c
+
+ ...
+ rte_eal_trace_pmu_read(0);
+ rte_eal_trace_pmu_read(1);
+ ...
+
+PMU tracing support must be explicitly enabled using the ``enable_trace_fp``
+option for meson build.
+
 Event record mode
 -----------------
 
diff --git a/lib/eal/common/eal_common_trace_points.c b/lib/eal/common/eal_common_trace_points.c
index 0b0b254615..de918ca618 100644
--- a/lib/eal/common/eal_common_trace_points.c
+++ b/lib/eal/common/eal_common_trace_points.c
@@ -75,3 +75,6 @@ RTE_TRACE_POINT_REGISTER(rte_eal_trace_intr_enable,
 	lib.eal.intr.enable)
 RTE_TRACE_POINT_REGISTER(rte_eal_trace_intr_disable,
 	lib.eal.intr.disable)
+
+RTE_TRACE_POINT_REGISTER(rte_eal_trace_pmu_read,
+	lib.eal.pmu.read)
diff --git a/lib/eal/common/rte_pmu.c b/lib/eal/common/rte_pmu.c
index 049fe19fe3..be105493ea 100644
--- a/lib/eal/common/rte_pmu.c
+++ b/lib/eal/common/rte_pmu.c
@@ -19,6 +19,7 @@
 #include <rte_tailq.h>
 
 #include "pmu_private.h"
+#include "eal_trace.h"
 
 #define EVENT_SOURCE_DEVICES_PATH "/sys/bus/event_source/devices"
 
@@ -403,11 +404,70 @@ rte_pmu_add_event(const char *name)
 	return event->index;
 }
 
+static void
+add_events(const char *pattern)
+{
+	char *token, *copy;
+	int ret;
+
+	copy = strdup(pattern);
+	if (!copy)
+		return;
+
+	token = strtok(copy, ",");
+	while (token) {
+		ret = rte_pmu_add_event(token);
+		if (ret < 0)
+			RTE_LOG(ERR, EAL, "failed to add %s event\n", token);
+
+		token = strtok(NULL, ",");
+	}
+
+	free(copy);
+}
+
+static void
+add_events_by_pattern(const char *pattern)
+{
+	regmatch_t rmatch;
+	char buf[BUFSIZ];
+	unsigned int num;
+	regex_t reg;
+
+	/* events are matched against occurrences of e=ev1[,ev2,..] pattern */
+	if (regcomp(&reg, "e=([_[:alnum:]-],?)+", REG_EXTENDED))
+		return;
+
+	for (;;) {
+		if (regexec(&reg, pattern, 1, &rmatch, 0))
+			break;
+
+		num = rmatch.rm_eo - rmatch.rm_so;
+		if (num > sizeof(buf))
+			num = sizeof(buf);
+
+		/* skip e= pattern prefix */
+		memcpy(buf, pattern + rmatch.rm_so + 2, num - 2);
+		buf[num] = '\0';
+		add_events(buf);
+
+		pattern += rmatch.rm_eo;
+	}
+
+	regfree(&reg);
+}
+
 void
 eal_pmu_init(void)
 {
+	struct trace_arg *arg;
+	struct trace *trace;
 	int ret;
 
+	trace = trace_obj_get();
+	if (!trace)
+		RTE_LOG(WARNING, EAL, "tracing not initialized\n");
+
 	rte_pmu = calloc(1, sizeof(*rte_pmu));
 	if (!rte_pmu) {
 		RTE_LOG(ERR, EAL, "failed to alloc PMU\n");
@@ -429,6 +489,9 @@ eal_pmu_init(void)
 		goto out;
 	}
 
+	STAILQ_FOREACH(arg, &trace->args, next)
+		add_events_by_pattern(arg->val);
+
 	return;
 out:
 	free(rte_pmu->name);
diff --git a/lib/eal/include/rte_eal_trace.h b/lib/eal/include/rte_eal_trace.h
index 5ef4398230..2a10f63e97 100644
--- a/lib/eal/include/rte_eal_trace.h
+++ b/lib/eal/include/rte_eal_trace.h
@@ -17,6 +17,7 @@ extern "C" {
 
 #include <rte_alarm.h>
 #include <rte_interrupts.h>
+#include <rte_pmu.h>
 #include <rte_trace_point.h>
 
 #include "eal_interrupts.h"
@@ -279,6 +280,16 @@ RTE_TRACE_POINT(
 	rte_trace_point_emit_string(cpuset);
 )
 
+/* PMU */
+RTE_TRACE_POINT_FP(
+	rte_eal_trace_pmu_read,
+	RTE_TRACE_POINT_ARGS(int index),
+	uint64_t val;
+	rte_trace_point_emit_int(index);
+	val = rte_pmu_read(index);
+	rte_trace_point_emit_u64(val);
+)
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/eal/version.map b/lib/eal/version.map
index 9225f46f67..73803f9601 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -442,6 +442,7 @@ EXPERIMENTAL {
 	rte_thread_join;
 
 	# added in 23.03
+	__rte_eal_trace_pmu_read; # WINDOWS_NO_EXPORT
 	rte_pmu; # WINDOWS_NO_EXPORT
 	rte_pmu_add_event; # WINDOWS_NO_EXPORT
 	rte_pmu_read; # WINDOWS_NO_EXPORT
-- 
2.25.1


^ permalink raw reply	[flat|nested] 205+ messages in thread

* RE: [PATCH v4 1/4] eal: add generic support for reading PMU events
  2022-12-13 10:43       ` [PATCH v4 1/4] eal: add generic support for reading PMU events Tomasz Duszynski
@ 2022-12-13 11:52         ` Morten Brørup
  2022-12-14  9:38           ` Tomasz Duszynski
  2022-12-15  8:46         ` Mattias Rönnblom
  2023-01-09  7:37         ` Ruifeng Wang
  2 siblings, 1 reply; 205+ messages in thread
From: Morten Brørup @ 2022-12-13 11:52 UTC (permalink / raw)
  To: Tomasz Duszynski, dev; +Cc: thomas, jerinj, zhoumin

> From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
> Sent: Tuesday, 13 December 2022 11.44
> 
> Add support for programming PMU counters and reading their values
> in runtime bypassing kernel completely.
> 
> This is especially useful in cases where CPU cores are isolated
> (nohz_full) i.e run dedicated tasks. In such cases one cannot use
> standard perf utility without sacrificing latency and performance.
> 
> Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
> ---


> +++ b/lib/eal/common/rte_pmu.c
> @@ -0,0 +1,456 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(C) 2022 Marvell International Ltd.
> + */
> +
> +#include <ctype.h>
> +#include <dirent.h>
> +#include <errno.h>
> +#include <regex.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <sys/ioctl.h>
> +#include <sys/mman.h>
> +#include <sys/queue.h>
> +#include <sys/syscall.h>
> +#include <unistd.h>
> +
> +#include <rte_eal_paging.h>
> +#include <rte_pmu.h>
> +#include <rte_tailq.h>
> +
> +#include "pmu_private.h"
> +
> +#define EVENT_SOURCE_DEVICES_PATH "/sys/bus/event_source/devices"
> +
> +#ifndef GENMASK_ULL
> +#define GENMASK_ULL(h, l) ((~0ULL - (1ULL << (l)) + 1) & (~0ULL >>
> ((64 - 1 - (h)))))
> +#endif
> +
> +#ifndef FIELD_PREP
> +#define FIELD_PREP(m, v) (((uint64_t)(v) << (__builtin_ffsll(m) - 1))
> & (m))
> +#endif
> +
> +struct rte_pmu *rte_pmu;
> +
> +/*
> + * Following __rte_weak functions provide default no-op. Architectures
> should override them if
> + * necessary.
> + */
> +
> +int
> +__rte_weak pmu_arch_init(void)
> +{
> +	return 0;
> +}
> +
> +void
> +__rte_weak pmu_arch_fini(void)
> +{
> +}
> +
> +void
> +__rte_weak pmu_arch_fixup_config(uint64_t config[3])
> +{
> +	RTE_SET_USED(config);
> +}
> +
> +static int
> +get_term_format(const char *name, int *num, uint64_t *mask)
> +{
> +	char *config = NULL;
> +	char path[PATH_MAX];
> +	int high, low, ret;
> +	FILE *fp;
> +
> +	/* quiesce -Wmaybe-uninitialized warning */
> +	*num = 0;
> +	*mask = 0;
> +
> +	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH
> "/%s/format/%s", rte_pmu->name, name);
> +	fp = fopen(path, "r");
> +	if (!fp)
> +		return -errno;
> +
> +	errno = 0;
> +	ret = fscanf(fp, "%m[^:]:%d-%d", &config, &low, &high);
> +	if (ret < 2) {
> +		ret = -ENODATA;
> +		goto out;
> +	}
> +	if (errno) {
> +		ret = -errno;
> +		goto out;
> +	}
> +
> +	if (ret == 2)
> +		high = low;
> +
> +	*mask = GENMASK_ULL(high, low);
> +	/* Last digit should be [012]. If last digit is missing 0 is
> implied. */
> +	*num = config[strlen(config) - 1];
> +	*num = isdigit(*num) ? *num - '0' : 0;
> +
> +	ret = 0;
> +out:
> +	free(config);
> +	fclose(fp);
> +
> +	return ret;
> +}
> +
> +static int
> +parse_event(char *buf, uint64_t config[3])
> +{
> +	char *token, *term;
> +	int num, ret, val;
> +	uint64_t mask;
> +
> +	config[0] = config[1] = config[2] = 0;
> +
> +	token = strtok(buf, ",");
> +	while (token) {
> +		errno = 0;
> +		/* <term>=<value> */
> +		ret = sscanf(token, "%m[^=]=%i", &term, &val);
> +		if (ret < 1)
> +			return -ENODATA;
> +		if (errno)
> +			return -errno;
> +		if (ret == 1)
> +			val = 1;
> +
> +		ret = get_term_format(term, &num, &mask);
> +		free(term);
> +		if (ret)
> +			return ret;
> +
> +		config[num] |= FIELD_PREP(mask, val);
> +		token = strtok(NULL, ",");
> +	}
> +
> +	return 0;
> +}
> +
> +static int
> +get_event_config(const char *name, uint64_t config[3])
> +{
> +	char path[PATH_MAX], buf[BUFSIZ];
> +	FILE *fp;
> +	int ret;
> +
> +	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH
> "/%s/events/%s", rte_pmu->name, name);
> +	fp = fopen(path, "r");
> +	if (!fp)
> +		return -errno;
> +
> +	ret = fread(buf, 1, sizeof(buf), fp);
> +	if (ret == 0) {
> +		fclose(fp);
> +
> +		return -EINVAL;
> +	}
> +	fclose(fp);
> +	buf[ret] = '\0';
> +
> +	return parse_event(buf, config);
> +}
> +
> +static int
> +do_perf_event_open(uint64_t config[3], int lcore_id, int group_fd)
> +{
> +	struct perf_event_attr attr = {
> +		.size = sizeof(struct perf_event_attr),
> +		.type = PERF_TYPE_RAW,
> +		.exclude_kernel = 1,
> +		.exclude_hv = 1,
> +		.disabled = 1,
> +	};
> +
> +	pmu_arch_fixup_config(config);
> +
> +	attr.config = config[0];
> +	attr.config1 = config[1];
> +	attr.config2 = config[2];
> +
> +	return syscall(SYS_perf_event_open, &attr, rte_gettid(),
> rte_lcore_to_cpu_id(lcore_id),
> +		       group_fd, 0);
> +}
> +
> +static int
> +open_events(int lcore_id)
> +{
> +	struct rte_pmu_event_group *group = &rte_pmu->group[lcore_id];
> +	struct rte_pmu_event *event;
> +	uint64_t config[3];
> +	int num = 0, ret;
> +
> +	/* group leader gets created first, with fd = -1 */
> +	group->fds[0] = -1;
> +
> +	TAILQ_FOREACH(event, &rte_pmu->event_list, next) {
> +		ret = get_event_config(event->name, config);
> +		if (ret) {
> +			RTE_LOG(ERR, EAL, "failed to get %s event config\n",
> event->name);
> +			continue;
> +		}
> +
> +		ret = do_perf_event_open(config, lcore_id, group->fds[0]);
> +		if (ret == -1) {
> +			if (errno == EOPNOTSUPP)
> +				RTE_LOG(ERR, EAL, "64 bit counters not
> supported\n");
> +
> +			ret = -errno;
> +			goto out;
> +		}
> +
> +		group->fds[event->index] = ret;
> +		num++;
> +	}
> +
> +	return 0;
> +out:
> +	for (--num; num >= 0; num--) {
> +		close(group->fds[num]);
> +		group->fds[num] = -1;
> +	}
> +
> +
> +	return ret;
> +}
> +
> +static int
> +mmap_events(int lcore_id)
> +{
> +	struct rte_pmu_event_group *group = &rte_pmu->group[lcore_id];
> +	void *addr;
> +	int ret, i;
> +
> +	for (i = 0; i < rte_pmu->num_group_events; i++) {
> +		addr = mmap(0, rte_mem_page_size(), PROT_READ, MAP_SHARED,
> group->fds[i], 0);
> +		if (addr == MAP_FAILED) {
> +			ret = -errno;
> +			goto out;
> +		}
> +
> +		group->mmap_pages[i] = addr;
> +	}
> +
> +	return 0;
> +out:
> +	for (; i; i--) {
> +		munmap(group->mmap_pages[i - 1], rte_mem_page_size());
> +		group->mmap_pages[i - 1] = NULL;
> +	}
> +
> +	return ret;
> +}
> +
> +static void
> +cleanup_events(int lcore_id)
> +{
> +	struct rte_pmu_event_group *group = &rte_pmu->group[lcore_id];
> +	int i;
> +
> +	if (!group->fds)
> +		return;
> +
> +	if (group->fds[0] != -1)
> +		ioctl(group->fds[0], PERF_EVENT_IOC_DISABLE,
> PERF_IOC_FLAG_GROUP);
> +
> +	for (i = 0; i < rte_pmu->num_group_events; i++) {
> +		if (group->mmap_pages[i]) {
> +			munmap(group->mmap_pages[i], rte_mem_page_size());
> +			group->mmap_pages[i] = NULL;
> +		}
> +
> +		if (group->fds[i] != -1) {
> +			close(group->fds[i]);
> +			group->fds[i] = -1;
> +		}
> +	}
> +
> +	free(group->mmap_pages);
> +	free(group->fds);
> +
> +	group->mmap_pages = NULL;
> +	group->fds = NULL;
> +	group->enabled = false;
> +}
> +
> +int __rte_noinline
> +rte_pmu_enable_group(int lcore_id)
> +{
> +	struct rte_pmu_event_group *group = &rte_pmu->group[lcore_id];
> +	int ret;
> +
> +	if (rte_pmu->num_group_events == 0) {
> +		RTE_LOG(DEBUG, EAL, "no matching PMU events\n");
> +
> +		return 0;
> +	}
> +
> +	group->fds = calloc(rte_pmu->num_group_events, sizeof(*group-
> >fds));
> +	if (!group->fds) {
> +		RTE_LOG(ERR, EAL, "failed to alloc descriptor memory\n");
> +
> +		return -ENOMEM;
> +	}
> +
> +	group->mmap_pages = calloc(rte_pmu->num_group_events,
> sizeof(*group->mmap_pages));
> +	if (!group->mmap_pages) {
> +		RTE_LOG(ERR, EAL, "failed to alloc userpage memory\n");
> +
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	ret = open_events(lcore_id);
> +	if (ret) {
> +		RTE_LOG(ERR, EAL, "failed to open events on lcore-worker-
> %d\n", lcore_id);
> +		goto out;
> +	}
> +
> +	ret = mmap_events(lcore_id);
> +	if (ret) {
> +		RTE_LOG(ERR, EAL, "failed to map events on lcore-worker-
> %d\n", lcore_id);
> +		goto out;
> +	}
> +
> +	if (ioctl(group->fds[0], PERF_EVENT_IOC_ENABLE,
> PERF_IOC_FLAG_GROUP) == -1) {
> +		RTE_LOG(ERR, EAL, "failed to enable events on lcore-worker-
> %d\n", lcore_id);
> +
> +		ret = -errno;
> +		goto out;
> +	}
> +
> +	return 0;
> +
> +out:
> +	cleanup_events(lcore_id);
> +
> +	return ret;
> +}
> +
> +static int
> +scan_pmus(void)
> +{
> +	char path[PATH_MAX];
> +	struct dirent *dent;
> +	const char *name;
> +	DIR *dirp;
> +
> +	dirp = opendir(EVENT_SOURCE_DEVICES_PATH);
> +	if (!dirp)
> +		return -errno;
> +
> +	while ((dent = readdir(dirp))) {
> +		name = dent->d_name;
> +		if (name[0] == '.')
> +			continue;
> +
> +		/* sysfs entry should either contain cpus or be a cpu */
> +		if (!strcmp(name, "cpu"))
> +			break;
> +
> +		snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH
> "/%s/cpus", name);
> +		if (access(path, F_OK) == 0)
> +			break;
> +	}
> +
> +	closedir(dirp);
> +
> +	if (dent) {
> +		rte_pmu->name = strdup(name);
> +		if (!rte_pmu->name)
> +			return -ENOMEM;
> +	}
> +
> +	return rte_pmu->name ? 0 : -ENODEV;
> +}
> +
> +int
> +rte_pmu_add_event(const char *name)
> +{
> +	struct rte_pmu_event *event;
> +	char path[PATH_MAX];
> +
> +	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH
> "/%s/events/%s", rte_pmu->name, name);
> +	if (access(path, R_OK))
> +		return -ENODEV;
> +
> +	TAILQ_FOREACH(event, &rte_pmu->event_list, next) {
> +		if (!strcmp(event->name, name))
> +			return event->index;
> +		continue;
> +	}
> +
> +	event = calloc(1, sizeof(*event));
> +	if (!event)
> +		return -ENOMEM;
> +
> +	event->name = strdup(name);
> +	if (!event->name) {
> +		free(event);
> +
> +		return -ENOMEM;
> +	}
> +
> +	event->index = rte_pmu->num_group_events++;
> +	TAILQ_INSERT_TAIL(&rte_pmu->event_list, event, next);
> +
> +	RTE_LOG(DEBUG, EAL, "%s even added at index %d\n", name, event-
> >index);
> +
> +	return event->index;
> +}
> +
> +void
> +eal_pmu_init(void)
> +{
> +	int ret;
> +
> +	rte_pmu = calloc(1, sizeof(*rte_pmu));
> +	if (!rte_pmu) {
> +		RTE_LOG(ERR, EAL, "failed to alloc PMU\n");
> +
> +		return;
> +	}
> +
> +	TAILQ_INIT(&rte_pmu->event_list);
> +
> +	ret = scan_pmus();
> +	if (ret) {
> +		RTE_LOG(ERR, EAL, "failed to find core pmu\n");
> +		goto out;
> +	}
> +
> +	ret = pmu_arch_init();
> +	if (ret) {
> +		RTE_LOG(ERR, EAL, "failed to setup arch for PMU\n");
> +		goto out;
> +	}
> +
> +	return;
> +out:
> +	free(rte_pmu->name);
> +	free(rte_pmu);
> +}
> +
> +void
> +eal_pmu_fini(void)
> +{
> +	struct rte_pmu_event *event, *tmp;
> +	int lcore_id;
> +
> +	RTE_TAILQ_FOREACH_SAFE(event, &rte_pmu->event_list, next, tmp) {
> +		TAILQ_REMOVE(&rte_pmu->event_list, event, next);
> +		free(event->name);
> +		free(event);
> +	}
> +
> +	RTE_LCORE_FOREACH_WORKER(lcore_id)
> +		cleanup_events(lcore_id);
> +
> +	pmu_arch_fini();
> +	free(rte_pmu->name);
> +	free(rte_pmu);
> +}
> diff --git a/lib/eal/include/meson.build b/lib/eal/include/meson.build
> index cfcd40aaed..3bf830adee 100644
> --- a/lib/eal/include/meson.build
> +++ b/lib/eal/include/meson.build
> @@ -36,6 +36,7 @@ headers += files(
>          'rte_pci_dev_features.h',
>          'rte_per_lcore.h',
>          'rte_pflock.h',
> +        'rte_pmu.h',
>          'rte_random.h',
>          'rte_reciprocal.h',
>          'rte_seqcount.h',
> diff --git a/lib/eal/include/rte_pmu.h b/lib/eal/include/rte_pmu.h
> new file mode 100644
> index 0000000000..e4b4f6b052
> --- /dev/null
> +++ b/lib/eal/include/rte_pmu.h
> @@ -0,0 +1,204 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2022 Marvell
> + */
> +
> +#ifndef _RTE_PMU_H_
> +#define _RTE_PMU_H_
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +#include <rte_common.h>
> +#include <rte_compat.h>
> +
> +#ifdef RTE_EXEC_ENV_LINUX
> +
> +#include <linux/perf_event.h>
> +
> +#include <rte_atomic.h>
> +#include <rte_branch_prediction.h>
> +#include <rte_lcore.h>
> +#include <rte_log.h>
> +
> +/**
> + * @file
> + *
> + * PMU event tracing operations
> + *
> + * This file defines generic API and types necessary to setup PMU and
> + * read selected counters in runtime.
> + */
> +
> +/**
> + * A structure describing a group of events.
> + */
> +struct rte_pmu_event_group {
> +	int *fds; /**< array of event descriptors */
> +	void **mmap_pages; /**< array of pointers to mmapped
> perf_event_attr structures */

There seems to be a lot of indirection involved here. Why are these arrays not statically sized, instead of dynamically allocated?

Also, what is the reason for hiding the type struct perf_event_mmap_page **mmap_pages opaque by using void **mmap_pages instead?

> +	bool enabled; /**< true if group was enabled on particular lcore
> */
> +};
> +
> +/**
> + * A structure describing an event.
> + */
> +struct rte_pmu_event {
> +	char *name; /** name of an event */
> +	int index; /** event index into fds/mmap_pages */
> +	TAILQ_ENTRY(rte_pmu_event) next; /** list entry */
> +};
> +
> +/**
> + * A PMU state container.
> + */
> +struct rte_pmu {
> +	char *name; /** name of core PMU listed under
> /sys/bus/event_source/devices */
> +	struct rte_pmu_event_group group[RTE_MAX_LCORE]; /**< per lcore
> event group data */
> +	int num_group_events; /**< number of events in a group */
> +	TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of matching
> events */
> +};
> +
> +/** Pointer to the PMU state container */
> +extern struct rte_pmu *rte_pmu;

Again, why not just extern struct rte_pmu, instead of dynamic allocation?

> +
> +/** Each architecture supporting PMU needs to provide its own version
> */
> +#ifndef rte_pmu_pmc_read
> +#define rte_pmu_pmc_read(index) ({ 0; })
> +#endif
> +
> +/**
> + * @internal
> + *
> + * Read PMU counter.
> + *
> + * @param pc
> + *   Pointer to the mmapped user page.
> + * @return
> + *   Counter value read from hardware.
> + */
> +__rte_internal
> +static __rte_always_inline uint64_t
> +rte_pmu_read_userpage(struct perf_event_mmap_page *pc)
> +{
> +	uint64_t offset, width, pmc = 0;
> +	uint32_t seq, index;
> +	int tries = 100;
> +
> +	for (;;) {
> +		seq = pc->lock;
> +		rte_compiler_barrier();
> +		index = pc->index;
> +		offset = pc->offset;
> +		width = pc->pmc_width;
> +
> +		if (likely(pc->cap_user_rdpmc && index)) {
> +			pmc = rte_pmu_pmc_read(index - 1);
> +			pmc <<= 64 - width;
> +			pmc >>= 64 - width;
> +		}
> +
> +		rte_compiler_barrier();
> +
> +		if (likely(pc->lock == seq))
> +			return pmc + offset;
> +
> +		if (--tries == 0) {
> +			RTE_LOG(DEBUG, EAL, "failed to get
> perf_event_mmap_page lock\n");
> +			break;
> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * @internal
> + *
> + * Enable group of events for a given lcore.
> + *
> + * @param lcore_id
> + *   The identifier of the lcore.
> + * @return
> + *   0 in case of success, negative value otherwise.
> + */
> +__rte_internal
> +int
> +rte_pmu_enable_group(int lcore_id);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice
> + *
> + * Add event to the group of enabled events.
> + *
> + * @param name
> + *   Name of an event listed under
> /sys/bus/event_source/devices/pmu/events.
> + * @return
> + *   Event index in case of success, negative value otherwise.
> + */
> +__rte_experimental
> +int
> +rte_pmu_add_event(const char *name);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice
> + *
> + * Read hardware counter configured to count occurrences of an event.
> + *
> + * @param index
> + *   Index of an event to be read.
> + * @return
> + *   Event value read from register. In case of errors or lack of
> support
> + *   0 is returned. In other words, stream of zeros in a trace file
> + *   indicates problem with reading particular PMU event register.
> + */
> +__rte_experimental
> +static __rte_always_inline uint64_t
> +rte_pmu_read(int index)
> +{
> +	int lcore_id = rte_lcore_id();
> +	struct rte_pmu_event_group *group;
> +	int ret;
> +
> +	if (!rte_pmu)
> +		return 0;
> +
> +	group = &rte_pmu->group[lcore_id];
> +	if (!group->enabled) {
> +		ret = rte_pmu_enable_group(lcore_id);
> +		if (ret)
> +			return 0;
> +
> +		group->enabled = true;
> +	}

Why is the group not enabled in the setup function, rte_pmu_add_event(), instead of here, in the hot path?

> +
> +	if (index < 0 || index >= rte_pmu->num_group_events)
> +		return 0;
> +
> +	return rte_pmu_read_userpage((struct perf_event_mmap_page
> *)group->mmap_pages[index]);

Using fixed size arrays instead of multiple indirections via pointers is faster. It could be:

return rte_pmu_read_userpage((struct perf_event_mmap_page *)rte_pmu.group[lcore_id].mmap_pages[index]);

With our without suggested performance improvements...

Series-acked-by: Morten Brørup <mb@smartsharesystems.com>


^ permalink raw reply	[flat|nested] 205+ messages in thread

* RE: [PATCH v4 1/4] eal: add generic support for reading PMU events
  2022-12-13 11:52         ` Morten Brørup
@ 2022-12-14  9:38           ` Tomasz Duszynski
  2022-12-14 10:41             ` Morten Brørup
  0 siblings, 1 reply; 205+ messages in thread
From: Tomasz Duszynski @ 2022-12-14  9:38 UTC (permalink / raw)
  To: Morten Brørup, dev; +Cc: thomas, Jerin Jacob Kollanukkaran, zhoumin

Hello Morten, 

Thanks for review. Answers inline. 

[...]

> > +/**
> > + * @file
> > + *
> > + * PMU event tracing operations
> > + *
> > + * This file defines generic API and types necessary to setup PMU and
> > + * read selected counters in runtime.
> > + */
> > +
> > +/**
> > + * A structure describing a group of events.
> > + */
> > +struct rte_pmu_event_group {
> > +	int *fds; /**< array of event descriptors */
> > +	void **mmap_pages; /**< array of pointers to mmapped
> > perf_event_attr structures */
> 
> There seems to be a lot of indirection involved here. Why are these arrays not statically sized,
> instead of dynamically allocated?
> 

Different architectures/pmus impose limits on number of simultaneously enabled counters. So in order
relief the pain of thinking about it and adding macros for each and every arch I decided to allocate
the number user wants dynamically. Also assumption holds that user knows about tradeoffs of using
too many counters hence will not enable too many events at once. 

> Also, what is the reason for hiding the type struct perf_event_mmap_page **mmap_pages opaque by
> using void **mmap_pages instead?

I think, that part doing mmap/munmap was written first hence void ** was chosen in the first place. 

> 
> > +	bool enabled; /**< true if group was enabled on particular lcore
> > */
> > +};
> > +
> > +/**
> > + * A structure describing an event.
> > + */
> > +struct rte_pmu_event {
> > +	char *name; /** name of an event */
> > +	int index; /** event index into fds/mmap_pages */
> > +	TAILQ_ENTRY(rte_pmu_event) next; /** list entry */ };
> > +
> > +/**
> > + * A PMU state container.
> > + */
> > +struct rte_pmu {
> > +	char *name; /** name of core PMU listed under
> > /sys/bus/event_source/devices */
> > +	struct rte_pmu_event_group group[RTE_MAX_LCORE]; /**< per lcore
> > event group data */
> > +	int num_group_events; /**< number of events in a group */
> > +	TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of matching
> > events */
> > +};
> > +
> > +/** Pointer to the PMU state container */ extern struct rte_pmu
> > +*rte_pmu;
> 
> Again, why not just extern struct rte_pmu, instead of dynamic allocation?
> 

No strong opinions here since this is a matter of personal preference. Can be removed
in the next version. 

> > +
> > +/** Each architecture supporting PMU needs to provide its own version
> > */
> > +#ifndef rte_pmu_pmc_read
> > +#define rte_pmu_pmc_read(index) ({ 0; }) #endif
> > +
> > +/**
> > + * @internal
> > + *
> > + * Read PMU counter.
> > + *
> > + * @param pc
> > + *   Pointer to the mmapped user page.
> > + * @return
> > + *   Counter value read from hardware.
> > + */
> > +__rte_internal
> > +static __rte_always_inline uint64_t
> > +rte_pmu_read_userpage(struct perf_event_mmap_page *pc) {
> > +	uint64_t offset, width, pmc = 0;
> > +	uint32_t seq, index;
> > +	int tries = 100;
> > +
> > +	for (;;) {
> > +		seq = pc->lock;
> > +		rte_compiler_barrier();
> > +		index = pc->index;
> > +		offset = pc->offset;
> > +		width = pc->pmc_width;
> > +
> > +		if (likely(pc->cap_user_rdpmc && index)) {
> > +			pmc = rte_pmu_pmc_read(index - 1);
> > +			pmc <<= 64 - width;
> > +			pmc >>= 64 - width;
> > +		}
> > +
> > +		rte_compiler_barrier();
> > +
> > +		if (likely(pc->lock == seq))
> > +			return pmc + offset;
> > +
> > +		if (--tries == 0) {
> > +			RTE_LOG(DEBUG, EAL, "failed to get
> > perf_event_mmap_page lock\n");
> > +			break;
> > +		}
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +/**
> > + * @internal
> > + *
> > + * Enable group of events for a given lcore.
> > + *
> > + * @param lcore_id
> > + *   The identifier of the lcore.
> > + * @return
> > + *   0 in case of success, negative value otherwise.
> > + */
> > +__rte_internal
> > +int
> > +rte_pmu_enable_group(int lcore_id);
> > +
> > +/**
> > + * @warning
> > + * @b EXPERIMENTAL: this API may change without prior notice
> > + *
> > + * Add event to the group of enabled events.
> > + *
> > + * @param name
> > + *   Name of an event listed under
> > /sys/bus/event_source/devices/pmu/events.
> > + * @return
> > + *   Event index in case of success, negative value otherwise.
> > + */
> > +__rte_experimental
> > +int
> > +rte_pmu_add_event(const char *name);
> > +
> > +/**
> > + * @warning
> > + * @b EXPERIMENTAL: this API may change without prior notice
> > + *
> > + * Read hardware counter configured to count occurrences of an event.
> > + *
> > + * @param index
> > + *   Index of an event to be read.
> > + * @return
> > + *   Event value read from register. In case of errors or lack of
> > support
> > + *   0 is returned. In other words, stream of zeros in a trace file
> > + *   indicates problem with reading particular PMU event register.
> > + */
> > +__rte_experimental
> > +static __rte_always_inline uint64_t
> > +rte_pmu_read(int index)
> > +{
> > +	int lcore_id = rte_lcore_id();
> > +	struct rte_pmu_event_group *group;
> > +	int ret;
> > +
> > +	if (!rte_pmu)
> > +		return 0;
> > +
> > +	group = &rte_pmu->group[lcore_id];
> > +	if (!group->enabled) {
> > +		ret = rte_pmu_enable_group(lcore_id);
> > +		if (ret)
> > +			return 0;
> > +
> > +		group->enabled = true;
> > +	}
> 
> Why is the group not enabled in the setup function, rte_pmu_add_event(), instead of here, in the
> hot path?
> 

When this is executed for the very first time then cpu will have obviously more work to do 
but afterwards setup path is not taken hence much less cpu cycles are required.

Setup is executed by main lcore solely, before lcores are executed hence some info passed to
SYS_perf_event_open ioctl() is missing, pid (via rte_gettid()) being an example here. 

> > +
> > +	if (index < 0 || index >= rte_pmu->num_group_events)
> > +		return 0;
> > +
> > +	return rte_pmu_read_userpage((struct perf_event_mmap_page
> > *)group->mmap_pages[index]);
> 
> Using fixed size arrays instead of multiple indirections via pointers is faster. It could be:
> 
> return rte_pmu_read_userpage((struct perf_event_mmap_page
> *)rte_pmu.group[lcore_id].mmap_pages[index]);
> 
> With our without suggested performance improvements...
> 
> Series-acked-by: Morten Brørup <mb@smartsharesystems.com>


^ permalink raw reply	[flat|nested] 205+ messages in thread

* RE: [PATCH v4 1/4] eal: add generic support for reading PMU events
  2022-12-14  9:38           ` Tomasz Duszynski
@ 2022-12-14 10:41             ` Morten Brørup
  2022-12-15  8:22               ` Morten Brørup
  2023-01-05 21:14               ` Tomasz Duszynski
  0 siblings, 2 replies; 205+ messages in thread
From: Morten Brørup @ 2022-12-14 10:41 UTC (permalink / raw)
  To: Tomasz Duszynski, dev
  Cc: thomas, Jerin Jacob Kollanukkaran, zhoumin, mattias.ronnblom

+CC: Mattias, see my comment below about per-thread constructor for this

> From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
> Sent: Wednesday, 14 December 2022 10.39
> 
> Hello Morten,
> 
> Thanks for review. Answers inline.
> 
> [...]
> 
> > > +/**
> > > + * @file
> > > + *
> > > + * PMU event tracing operations
> > > + *
> > > + * This file defines generic API and types necessary to setup PMU
> and
> > > + * read selected counters in runtime.
> > > + */
> > > +
> > > +/**
> > > + * A structure describing a group of events.
> > > + */
> > > +struct rte_pmu_event_group {
> > > +	int *fds; /**< array of event descriptors */
> > > +	void **mmap_pages; /**< array of pointers to mmapped
> > > perf_event_attr structures */
> >
> > There seems to be a lot of indirection involved here. Why are these
> arrays not statically sized,
> > instead of dynamically allocated?
> >
> 
> Different architectures/pmus impose limits on number of simultaneously
> enabled counters. So in order
> relief the pain of thinking about it and adding macros for each and
> every arch I decided to allocate
> the number user wants dynamically. Also assumption holds that user
> knows about tradeoffs of using
> too many counters hence will not enable too many events at once.

The DPDK convention is to use fixed size arrays (with a maximum size, e.g. RTE_MAX_ETHPORTS) in the fast path, for performance reasons.

Please use fixed size arrays instead of dynamically allocated arrays.

> 
> > Also, what is the reason for hiding the type struct
> perf_event_mmap_page **mmap_pages opaque by
> > using void **mmap_pages instead?
> 
> I think, that part doing mmap/munmap was written first hence void **
> was chosen in the first place.

Please update it, so the actual type is reflected here.

> 
> >
> > > +	bool enabled; /**< true if group was enabled on particular lcore
> > > */
> > > +};
> > > +
> > > +/**
> > > + * A structure describing an event.
> > > + */
> > > +struct rte_pmu_event {
> > > +	char *name; /** name of an event */
> > > +	int index; /** event index into fds/mmap_pages */
> > > +	TAILQ_ENTRY(rte_pmu_event) next; /** list entry */ };
> > > +
> > > +/**
> > > + * A PMU state container.
> > > + */
> > > +struct rte_pmu {
> > > +	char *name; /** name of core PMU listed under
> > > /sys/bus/event_source/devices */
> > > +	struct rte_pmu_event_group group[RTE_MAX_LCORE]; /**< per lcore
> > > event group data */
> > > +	int num_group_events; /**< number of events in a group */
> > > +	TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of matching
> > > events */

The event_list is used in slow path only, so it can remain a list - i.e. no change requested here. :-)

> > > +};
> > > +
> > > +/** Pointer to the PMU state container */ extern struct rte_pmu
> > > +*rte_pmu;
> >
> > Again, why not just extern struct rte_pmu, instead of dynamic
> allocation?
> >
> 
> No strong opinions here since this is a matter of personal preference.
> Can be removed
> in the next version.

Yes, please.

> 
> > > +
> > > +/** Each architecture supporting PMU needs to provide its own
> version
> > > */
> > > +#ifndef rte_pmu_pmc_read
> > > +#define rte_pmu_pmc_read(index) ({ 0; }) #endif
> > > +
> > > +/**
> > > + * @internal
> > > + *
> > > + * Read PMU counter.
> > > + *
> > > + * @param pc
> > > + *   Pointer to the mmapped user page.
> > > + * @return
> > > + *   Counter value read from hardware.
> > > + */
> > > +__rte_internal
> > > +static __rte_always_inline uint64_t
> > > +rte_pmu_read_userpage(struct perf_event_mmap_page *pc) {
> > > +	uint64_t offset, width, pmc = 0;
> > > +	uint32_t seq, index;
> > > +	int tries = 100;
> > > +
> > > +	for (;;) {

As a matter of personal preference, I would write this loop differently:

+ for (tries = 100; tries != 0; tries--) {

> > > +		seq = pc->lock;
> > > +		rte_compiler_barrier();
> > > +		index = pc->index;
> > > +		offset = pc->offset;
> > > +		width = pc->pmc_width;
> > > +
> > > +		if (likely(pc->cap_user_rdpmc && index)) {

Why "&& index"? The way I read [man perf_event_open], index 0 is perfectly valid.

[man perf_event_open]: https://man7.org/linux/man-pages/man2/perf_event_open.2.html

> > > +			pmc = rte_pmu_pmc_read(index - 1);
> > > +			pmc <<= 64 - width;
> > > +			pmc >>= 64 - width;
> > > +		}
> > > +
> > > +		rte_compiler_barrier();
> > > +
> > > +		if (likely(pc->lock == seq))
> > > +			return pmc + offset;
> > > +
> > > +		if (--tries == 0) {
> > > +			RTE_LOG(DEBUG, EAL, "failed to get
> > > perf_event_mmap_page lock\n");
> > > +			break;
> > > +		}

- Remove the 4 above lines of code, and move the debug log message to the end of the function instead.

> > > +	}

+ RTE_LOG(DEBUG, EAL, "failed to get perf_event_mmap_page lock\n");

> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +/**
> > > + * @internal
> > > + *
> > > + * Enable group of events for a given lcore.
> > > + *
> > > + * @param lcore_id
> > > + *   The identifier of the lcore.
> > > + * @return
> > > + *   0 in case of success, negative value otherwise.
> > > + */
> > > +__rte_internal
> > > +int
> > > +rte_pmu_enable_group(int lcore_id);
> > > +
> > > +/**
> > > + * @warning
> > > + * @b EXPERIMENTAL: this API may change without prior notice
> > > + *
> > > + * Add event to the group of enabled events.
> > > + *
> > > + * @param name
> > > + *   Name of an event listed under
> > > /sys/bus/event_source/devices/pmu/events.
> > > + * @return
> > > + *   Event index in case of success, negative value otherwise.
> > > + */
> > > +__rte_experimental
> > > +int
> > > +rte_pmu_add_event(const char *name);
> > > +
> > > +/**
> > > + * @warning
> > > + * @b EXPERIMENTAL: this API may change without prior notice
> > > + *
> > > + * Read hardware counter configured to count occurrences of an
> event.
> > > + *
> > > + * @param index
> > > + *   Index of an event to be read.
> > > + * @return
> > > + *   Event value read from register. In case of errors or lack of
> > > support
> > > + *   0 is returned. In other words, stream of zeros in a trace
> file
> > > + *   indicates problem with reading particular PMU event register.
> > > + */
> > > +__rte_experimental
> > > +static __rte_always_inline uint64_t
> > > +rte_pmu_read(int index)

The index type can be changed from int to uint32_t. This also eliminates the "(index < 0" part of the comparison further below in this function.

> > > +{
> > > +	int lcore_id = rte_lcore_id();
> > > +	struct rte_pmu_event_group *group;
> > > +	int ret;
> > > +
> > > +	if (!rte_pmu)
> > > +		return 0;
> > > +
> > > +	group = &rte_pmu->group[lcore_id];
> > > +	if (!group->enabled) {

Optimized: if (unlikely(!group->enabled)) {

> > > +		ret = rte_pmu_enable_group(lcore_id);
> > > +		if (ret)
> > > +			return 0;
> > > +
> > > +		group->enabled = true;
> > > +	}
> >
> > Why is the group not enabled in the setup function,
> rte_pmu_add_event(), instead of here, in the
> > hot path?
> >
> 
> When this is executed for the very first time then cpu will have
> obviously more work to do
> but afterwards setup path is not taken hence much less cpu cycles are
> required.
> 
> Setup is executed by main lcore solely, before lcores are executed
> hence some info passed to
> SYS_perf_event_open ioctl() is missing, pid (via rte_gettid()) being an
> example here.

OK. Thank you for the explanation. Since impossible at setup, it has to be done at runtime.

@Mattias: Another good example of something that would belong in per-thread constructors, as my suggested feature creep in [1].

[1]: http://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35D87553@smartserver.smartshare.dk/

> 
> > > +
> > > +	if (index < 0 || index >= rte_pmu->num_group_events)

Optimized: if (unlikely(index >= rte_pmu.num_group_events))

> > > +		return 0;
> > > +
> > > +	return rte_pmu_read_userpage((struct perf_event_mmap_page
> > > *)group->mmap_pages[index]);
> >
> > Using fixed size arrays instead of multiple indirections via pointers
> is faster. It could be:
> >
> > return rte_pmu_read_userpage((struct perf_event_mmap_page
> > *)rte_pmu.group[lcore_id].mmap_pages[index]);
> >
> > With our without suggested performance improvements...
> >
> > Series-acked-by: Morten Brørup <mb@smartsharesystems.com>
> 


^ permalink raw reply	[flat|nested] 205+ messages in thread

* RE: [PATCH v4 1/4] eal: add generic support for reading PMU events
  2022-12-14 10:41             ` Morten Brørup
@ 2022-12-15  8:22               ` Morten Brørup
  2022-12-16  7:33                 ` Morten Brørup
  2023-01-05 21:14               ` Tomasz Duszynski
  1 sibling, 1 reply; 205+ messages in thread
From: Morten Brørup @ 2022-12-15  8:22 UTC (permalink / raw)
  To: Tomasz Duszynski, dev
  Cc: thomas, Jerin Jacob Kollanukkaran, zhoumin, mattias.ronnblom

> From: Morten Brørup [mailto:mb@smartsharesystems.com]
> Sent: Wednesday, 14 December 2022 11.41
> 
> +CC: Mattias, see my comment below about per-thread constructor for
> this
> 
> > From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
> > Sent: Wednesday, 14 December 2022 10.39
> >
> > Hello Morten,
> >
> > Thanks for review. Answers inline.
> >
> > [...]
> >
> > > > +__rte_experimental
> > > > +static __rte_always_inline uint64_t
> > > > +rte_pmu_read(int index)
> 
> The index type can be changed from int to uint32_t. This also
> eliminates the "(index < 0" part of the comparison further below in
> this function.
> 
> > > > +{
> > > > +	int lcore_id = rte_lcore_id();
> > > > +	struct rte_pmu_event_group *group;
> > > > +	int ret;
> > > > +
> > > > +	if (!rte_pmu)
> > > > +		return 0;
> > > > +
> > > > +	group = &rte_pmu->group[lcore_id];
> > > > +	if (!group->enabled) {
> 
> Optimized: if (unlikely(!group->enabled)) {
> 
> > > > +		ret = rte_pmu_enable_group(lcore_id);
> > > > +		if (ret)
> > > > +			return 0;
> > > > +
> > > > +		group->enabled = true;
> > > > +	}
> > >
> > > Why is the group not enabled in the setup function,
> > rte_pmu_add_event(), instead of here, in the
> > > hot path?
> > >
> >
> > When this is executed for the very first time then cpu will have
> > obviously more work to do
> > but afterwards setup path is not taken hence much less cpu cycles are
> > required.
> >
> > Setup is executed by main lcore solely, before lcores are executed
> > hence some info passed to
> > SYS_perf_event_open ioctl() is missing, pid (via rte_gettid()) being
> an
> > example here.
> 
> OK. Thank you for the explanation. Since impossible at setup, it has to
> be done at runtime.
> 
> @Mattias: Another good example of something that would belong in per-
> thread constructors, as my suggested feature creep in [1].
> 
> [1]:
> http://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35D87553@smarts
> erver.smartshare.dk/

I just realized that this initialization is per-lcore (not per thread), so you can use rte_lcore_callback_register() to register a per-lcore initialization function, and move rte_pmu_enable_group(lcore_id) there.

-Morten


^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 1/4] eal: add generic support for reading PMU events
  2022-11-11  9:43 ` [PATCH 1/4] eal: add generic support for reading PMU events Tomasz Duszynski
@ 2022-12-15  8:33   ` Mattias Rönnblom
  0 siblings, 0 replies; 205+ messages in thread
From: Mattias Rönnblom @ 2022-12-15  8:33 UTC (permalink / raw)
  To: Tomasz Duszynski, dev; +Cc: thomas, jerinj

On 2022-11-11 10:43, Tomasz Duszynski wrote:
> Add support for programming PMU counters and reading their values
> in runtime bypassing kernel completely.
> 
> This is especially useful in cases where CPU cores are isolated
> (nohz_full) i.e run dedicated tasks. In such cases one cannot use
> standard perf utility without sacrificing latency and performance.
> 
> Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
> ---
>   app/test/meson.build                  |   1 +
>   app/test/test_pmu.c                   |  41 +++
>   doc/guides/prog_guide/profile_app.rst |   8 +
>   lib/eal/common/meson.build            |   3 +
>   lib/eal/common/pmu_private.h          |  41 +++
>   lib/eal/common/rte_pmu.c              | 455 ++++++++++++++++++++++++++
>   lib/eal/include/meson.build           |   1 +
>   lib/eal/include/rte_pmu.h             | 204 ++++++++++++
>   lib/eal/linux/eal.c                   |   4 +
>   lib/eal/version.map                   |   3 +
>   10 files changed, 761 insertions(+)
>   create mode 100644 app/test/test_pmu.c
>   create mode 100644 lib/eal/common/pmu_private.h
>   create mode 100644 lib/eal/common/rte_pmu.c
>   create mode 100644 lib/eal/include/rte_pmu.h
> 
> diff --git a/app/test/meson.build b/app/test/meson.build
> index f34d19e3c3..93b3300309 100644
> --- a/app/test/meson.build
> +++ b/app/test/meson.build
> @@ -143,6 +143,7 @@ test_sources = files(
>           'test_timer_racecond.c',
>           'test_timer_secondary.c',
>           'test_ticketlock.c',
> +        'test_pmu.c',
>           'test_trace.c',
>           'test_trace_register.c',
>           'test_trace_perf.c',
> diff --git a/app/test/test_pmu.c b/app/test/test_pmu.c
> new file mode 100644
> index 0000000000..fd331af9ee
> --- /dev/null
> +++ b/app/test/test_pmu.c
> @@ -0,0 +1,41 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(C) 2022 Marvell International Ltd.
> + */
> +
> +#include <rte_pmu.h>
> +
> +#include "test.h"
> +
> +static int
> +test_pmu_read(void)
> +{
> +	uint64_t val = 0;
> +	int tries = 10;
> +	int event = -1;
> +
> +	while (tries--)
> +		val += rte_pmu_read(event);
> +
> +	if (val == 0)
> +		return TEST_FAILED;
> +
> +	return TEST_SUCCESS;
> +}
> +
> +static struct unit_test_suite pmu_tests = {
> +	.suite_name = "pmu autotest",
> +	.setup = NULL,
> +	.teardown = NULL,
> +	.unit_test_cases = {
> +		TEST_CASE(test_pmu_read),
> +		TEST_CASES_END()
> +	}
> +};
> +
> +static int
> +test_pmu(void)
> +{
> +	return unit_test_suite_runner(&pmu_tests);
> +}
> +
> +REGISTER_TEST_COMMAND(pmu_autotest, test_pmu);
> diff --git a/doc/guides/prog_guide/profile_app.rst b/doc/guides/prog_guide/profile_app.rst
> index bd6700ef85..8fc1b20cab 100644
> --- a/doc/guides/prog_guide/profile_app.rst
> +++ b/doc/guides/prog_guide/profile_app.rst
> @@ -7,6 +7,14 @@ Profile Your Application
>   The following sections describe methods of profiling DPDK applications on
>   different architectures.
>   
> +Performance counter based profiling
> +-----------------------------------
> +
> +Majority of architectures support some sort hardware measurement unit which provides a set of
> +programmable counters that monitor specific events. There are different tools which can gather
> +that information, perf being an example here. Though in some scenarios, eg. when CPU cores are
> +isolated (nohz_full) and run dedicated tasks, using perf is less than ideal. In such cases one can
> +read specific events directly from application via ``rte_pmu_read()``.
>   
>   Profiling on x86
>   ----------------
> diff --git a/lib/eal/common/meson.build b/lib/eal/common/meson.build
> index 917758cc65..d6d05b56f3 100644
> --- a/lib/eal/common/meson.build
> +++ b/lib/eal/common/meson.build
> @@ -38,6 +38,9 @@ sources += files(
>           'rte_service.c',
>           'rte_version.c',
>   )
> +if is_linux
> +    sources += files('rte_pmu.c')
> +endif
>   if is_linux or is_windows
>       sources += files('eal_common_dynmem.c')
>   endif
> diff --git a/lib/eal/common/pmu_private.h b/lib/eal/common/pmu_private.h
> new file mode 100644
> index 0000000000..cade4245e6
> --- /dev/null
> +++ b/lib/eal/common/pmu_private.h
> @@ -0,0 +1,41 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2022 Marvell
> + */
> +
> +#ifndef _PMU_PRIVATE_H_
> +#define _PMU_PRIVATE_H_
> +
> +/**
> + * Architecture specific PMU init callback.
> + *
> + * @return
> + *   0 in case of success, negative value otherwise.
> + */
> +int
> +pmu_arch_init(void);
> +
> +/**
> + * Architecture specific PMU cleanup callback.
> + */
> +void
> +pmu_arch_fini(void);
> +
> +/**
> + * Apply architecture specific settings to config before passing it to syscall.
> + */
> +void
> +pmu_arch_fixup_config(uint64_t config[3]);
> +
> +/**
> + * Initialize PMU tracing internals.
> + */
> +void
> +eal_pmu_init(void);
> +
> +/**
> + * Cleanup PMU internals.
> + */
> +void
> +eal_pmu_fini(void);
> +
> +#endif /* _PMU_PRIVATE_H_ */
> diff --git a/lib/eal/common/rte_pmu.c b/lib/eal/common/rte_pmu.c
> new file mode 100644
> index 0000000000..7d3bd57d1d
> --- /dev/null
> +++ b/lib/eal/common/rte_pmu.c
> @@ -0,0 +1,455 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(C) 2022 Marvell International Ltd.
> + */
> +
> +#include <ctype.h>
> +#include <dirent.h>
> +#include <errno.h>
> +#include <regex.h>
> +#include <sys/ioctl.h>
> +#include <sys/mman.h>
> +#include <sys/queue.h>
> +#include <sys/syscall.h>
> +#include <unistd.h>
> +
> +#include <rte_eal_paging.h>
> +#include <rte_malloc.h>
> +#include <rte_pmu.h>
> +#include <rte_tailq.h>
> +
> +#include "pmu_private.h"
> +
> +#define EVENT_SOURCE_DEVICES_PATH "/sys/bus/event_source/devices"
> +
> +#ifndef GENMASK_ULL
> +#define GENMASK_ULL(h, l) ((~0ULL - (1ULL << (l)) + 1) & (~0ULL >> ((64 - 1 - (h)))))
> +#endif
> +
> +#ifndef FIELD_PREP
> +#define FIELD_PREP(m, v) (((uint64_t)(v) << (__builtin_ffsll(m) - 1)) & (m))
> +#endif
> +
> +struct rte_pmu *pmu;
> +
> +/*
> + * Following __rte_weak functions provide default no-op. Architectures should override them if
> + * necessary.
> + */
> +
> +int
> +__rte_weak pmu_arch_init(void)
> +{
> +	return 0;
> +}
> +
> +void
> +__rte_weak pmu_arch_fini(void)
> +{
> +}
> +
> +void
> +__rte_weak pmu_arch_fixup_config(uint64_t config[3])
> +{
> +	RTE_SET_USED(config);
> +}
> +
> +static int
> +get_term_format(const char *name, int *num, uint64_t *mask)
> +{
> +	char *config = NULL;
> +	char path[PATH_MAX];
> +	int high, low, ret;
> +	FILE *fp;
> +
> +	/* quiesce -Wmaybe-uninitialized warning */
> +	*num = 0;
> +	*mask = 0;
> +
> +	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/format/%s", pmu->name, name);
> +	fp = fopen(path, "r");
> +	if (!fp)
> +		return -errno;
> +
> +	errno = 0;
> +	ret = fscanf(fp, "%m[^:]:%d-%d", &config, &low, &high);
> +	if (ret < 2) {
> +		ret = -ENODATA;
> +		goto out;
> +	}
> +	if (errno) {
> +		ret = -errno;
> +		goto out;
> +	}
> +
> +	if (ret == 2)
> +		high = low;
> +
> +	*mask = GENMASK_ULL(high, low);
> +	/* Last digit should be [012]. If last digit is missing 0 is implied. */
> +	*num = config[strlen(config) - 1];
> +	*num = isdigit(*num) ? *num - '0' : 0;
> +
> +	ret = 0;
> +out:
> +	free(config);
> +	fclose(fp);
> +
> +	return ret;
> +}
> +
> +static int
> +parse_event(char *buf, uint64_t config[3])
> +{
> +	char *token, *term;
> +	int num, ret, val;
> +	uint64_t mask;
> +
> +	config[0] = config[1] = config[2] = 0;
> +
> +	token = strtok(buf, ",");
> +	while (token) {
> +		errno = 0;
> +		/* <term>=<value> */
> +		ret = sscanf(token, "%m[^=]=%i", &term, &val);
> +		if (ret < 1)
> +			return -ENODATA;
> +		if (errno)
> +			return -errno;
> +		if (ret == 1)
> +			val = 1;
> +
> +		ret = get_term_format(term, &num, &mask);
> +		free(term);
> +		if (ret)
> +			return ret;
> +
> +		config[num] |= FIELD_PREP(mask, val);
> +		token = strtok(NULL, ",");
> +	}
> +
> +	return 0;
> +}
> +
> +static int
> +get_event_config(const char *name, uint64_t config[3])
> +{
> +	char path[PATH_MAX], buf[BUFSIZ];
> +	FILE *fp;
> +	int ret;
> +
> +	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", pmu->name, name);
> +	fp = fopen(path, "r");
> +	if (!fp)
> +		return -errno;
> +
> +	ret = fread(buf, 1, sizeof(buf), fp);
> +	if (ret == 0) {
> +		fclose(fp);
> +
> +		return -EINVAL;
> +	}
> +	fclose(fp);
> +	buf[ret] = '\0';
> +
> +	return parse_event(buf, config);
> +}
> +
> +static int
> +do_perf_event_open(uint64_t config[3], int lcore_id, int group_fd)
> +{
> +	struct perf_event_attr attr = {
> +		.size = sizeof(struct perf_event_attr),
> +		.type = PERF_TYPE_RAW,
> +		.exclude_kernel = 1,
> +		.exclude_hv = 1,
> +		.disabled = 1,
> +	};
> +
> +	pmu_arch_fixup_config(config);
> +
> +	attr.config = config[0];
> +	attr.config1 = config[1];
> +	attr.config2 = config[2];
> +
> +	return syscall(SYS_perf_event_open, &attr, rte_gettid(), rte_lcore_to_cpu_id(lcore_id),
> +		       group_fd, 0);
> +}
> +
> +static int
> +open_events(int lcore_id)
> +{
> +	struct rte_pmu_event_group *group = &pmu->group[lcore_id];
> +	struct rte_pmu_event *event;
> +	uint64_t config[3];
> +	int num = 0, ret;
> +
> +	/* group leader gets created first, with fd = -1 */
> +	group->fds[0] = -1;
> +
> +	TAILQ_FOREACH(event, &pmu->event_list, next) {
> +		ret = get_event_config(event->name, config);
> +		if (ret) {
> +			RTE_LOG(ERR, EAL, "failed to get %s event config\n", event->name);
> +			continue;
> +		}
> +
> +		ret = do_perf_event_open(config, lcore_id, group->fds[0]);
> +		if (ret == -1) {
> +			if (errno == EOPNOTSUPP)
> +				RTE_LOG(ERR, EAL, "64 bit counters not supported\n");
> +
> +			ret = -errno;
> +			goto out;
> +		}
> +
> +		group->fds[event->index] = ret;
> +		num++;
> +	}
> +
> +	return 0;
> +out:
> +	for (--num; num >= 0; num--) {
> +		close(group->fds[num]);
> +		group->fds[num] = -1;
> +	}
> +
> +
> +	return ret;
> +}
> +
> +static int
> +mmap_events(int lcore_id)
> +{
> +	struct rte_pmu_event_group *group = &pmu->group[lcore_id];
> +	void *addr;
> +	int ret, i;
> +
> +	for (i = 0; i < pmu->num_group_events; i++) {
> +		addr = mmap(0, rte_mem_page_size(), PROT_READ, MAP_SHARED, group->fds[i], 0);
> +		if (addr == MAP_FAILED) {
> +			ret = -errno;
> +			goto out;
> +		}
> +
> +		group->mmap_pages[i] = addr;
> +	}
> +
> +	return 0;
> +out:
> +	for (; i; i--) {
> +		munmap(group->mmap_pages[i - 1], rte_mem_page_size());
> +		group->mmap_pages[i - 1] = NULL;
> +	}
> +
> +	return ret;
> +}
> +
> +static void
> +cleanup_events(int lcore_id)
> +{
> +	struct rte_pmu_event_group *group = &pmu->group[lcore_id];
> +	int i;
> +
> +	if (!group->fds)
> +		return;
> +
> +	if (group->fds[0] != -1)
> +		ioctl(group->fds[0], PERF_EVENT_IOC_DISABLE, PERF_IOC_FLAG_GROUP);
> +
> +	for (i = 0; i < pmu->num_group_events; i++) {
> +		if (group->mmap_pages[i]) {
> +			munmap(group->mmap_pages[i], rte_mem_page_size());
> +			group->mmap_pages[i] = NULL;
> +		}
> +
> +		if (group->fds[i] != -1) {
> +			close(group->fds[i]);
> +			group->fds[i] = -1;
> +		}
> +	}
> +
> +	rte_free(group->mmap_pages);
> +	rte_free(group->fds);
> +
> +	group->mmap_pages = NULL;
> +	group->fds = NULL;
> +	group->enabled = false;
> +}
> +
> +int __rte_noinline
> +rte_pmu_enable_group(int lcore_id)
> +{
> +	struct rte_pmu_event_group *group = &pmu->group[lcore_id];
> +	int ret;
> +
> +	if (pmu->num_group_events == 0) {
> +		RTE_LOG(DEBUG, EAL, "no matching PMU events\n");
> +
> +		return 0;
> +	}
> +
> +	group->fds = rte_zmalloc(NULL, pmu->num_group_events, sizeof(*group->fds));
> +	if (!group->fds) {
> +		RTE_LOG(ERR, EAL, "failed to alloc descriptor memory\n");
> +
> +		return -ENOMEM;
> +	}
> +
> +	group->mmap_pages = rte_zmalloc(NULL, pmu->num_group_events, sizeof(*group->mmap_pages));
> +	if (!group->mmap_pages) {
> +		RTE_LOG(ERR, EAL, "failed to alloc userpage memory\n");
> +
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	ret = open_events(lcore_id);
> +	if (ret) {
> +		RTE_LOG(ERR, EAL, "failed to open events on lcore-worker-%d\n", lcore_id);
> +		goto out;
> +	}
> +
> +	ret = mmap_events(lcore_id);
> +	if (ret) {
> +		RTE_LOG(ERR, EAL, "failed to map events on lcore-worker-%d\n", lcore_id);
> +		goto out;
> +	}
> +
> +	if (ioctl(group->fds[0], PERF_EVENT_IOC_ENABLE, PERF_IOC_FLAG_GROUP) == -1) {
> +		RTE_LOG(ERR, EAL, "failed to enable events on lcore-worker-%d\n", lcore_id);
> +
> +		ret = -errno;
> +		goto out;
> +	}
> +
> +	return 0;
> +
> +out:
> +	cleanup_events(lcore_id);
> +
> +	return ret;
> +}
> +
> +static int
> +scan_pmus(void)
> +{
> +	char path[PATH_MAX];
> +	struct dirent *dent;
> +	const char *name;
> +	DIR *dirp;
> +
> +	dirp = opendir(EVENT_SOURCE_DEVICES_PATH);
> +	if (!dirp)
> +		return -errno;
> +
> +	while ((dent = readdir(dirp))) {
> +		name = dent->d_name;
> +		if (name[0] == '.')
> +			continue;
> +
> +		/* sysfs entry should either contain cpus or be a cpu */
> +		if (!strcmp(name, "cpu"))
> +			break;
> +
> +		snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/cpus", name);
> +		if (access(path, F_OK) == 0)
> +			break;
> +	}
> +
> +	closedir(dirp);
> +
> +	if (dent) {
> +		pmu->name = strdup(name);
> +		if (!pmu->name)
> +			return -ENOMEM;
> +	}
> +
> +	return pmu->name ? 0 : -ENODEV;
> +}
> +
> +int
> +rte_pmu_add_event(const char *name)
> +{
> +	struct rte_pmu_event *event;
> +	char path[PATH_MAX];
> +
> +	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", pmu->name, name);
> +	if (access(path, R_OK))
> +		return -ENODEV;
> +
> +	TAILQ_FOREACH(event, &pmu->event_list, next) {
> +		if (!strcmp(event->name, name))
> +			return event->index;
> +		continue;
> +	}
> +
> +	event = rte_zmalloc(NULL, 1, sizeof(*event));
> +	if (!event)
> +		return -ENOMEM;
> +
> +	event->name = strdup(name);
> +	if (!event->name) {
> +		rte_free(event);
> +
> +		return -ENOMEM;
> +	}
> +
> +	event->index = pmu->num_group_events++;
> +	TAILQ_INSERT_TAIL(&pmu->event_list, event, next);
> +
> +	RTE_LOG(DEBUG, EAL, "%s even added at index %d\n", name, event->index);
> +
> +	return event->index;
> +}
> +
> +void
> +eal_pmu_init(void)
> +{
> +	int ret;
> +
> +	pmu = rte_calloc(NULL, 1, sizeof(*pmu), RTE_CACHE_LINE_SIZE);
> +	if (!pmu) {
> +		RTE_LOG(ERR, EAL, "failed to alloc PMU\n");
> +
> +		return;
> +	}
> +
> +	TAILQ_INIT(&pmu->event_list);
> +
> +	ret = scan_pmus();
> +	if (ret) {
> +		RTE_LOG(ERR, EAL, "failed to find core pmu\n");
> +		goto out;
> +	}
> +
> +	ret = pmu_arch_init();
> +	if (ret) {
> +		RTE_LOG(ERR, EAL, "failed to setup arch for PMU\n");
> +		goto out;
> +	}
> +
> +	return;
> +out:
> +	free(pmu->name);
> +	rte_free(pmu);
> +}
> +
> +void
> +eal_pmu_fini(void)
> +{
> +	struct rte_pmu_event *event, *tmp;
> +	int lcore_id;
> +
> +	RTE_TAILQ_FOREACH_SAFE(event, &pmu->event_list, next, tmp) {
> +		TAILQ_REMOVE(&pmu->event_list, event, next);
> +		free(event->name);
> +		rte_free(event);
> +	}
> +
> +	RTE_LCORE_FOREACH_WORKER(lcore_id)
> +		cleanup_events(lcore_id);
> +
> +	pmu_arch_fini();
> +	free(pmu->name);
> +	rte_free(pmu);
> +}
> diff --git a/lib/eal/include/meson.build b/lib/eal/include/meson.build
> index cfcd40aaed..3bf830adee 100644
> --- a/lib/eal/include/meson.build
> +++ b/lib/eal/include/meson.build
> @@ -36,6 +36,7 @@ headers += files(
>           'rte_pci_dev_features.h',
>           'rte_per_lcore.h',
>           'rte_pflock.h',
> +        'rte_pmu.h',
>           'rte_random.h',
>           'rte_reciprocal.h',
>           'rte_seqcount.h',
> diff --git a/lib/eal/include/rte_pmu.h b/lib/eal/include/rte_pmu.h
> new file mode 100644
> index 0000000000..5955c22779
> --- /dev/null
> +++ b/lib/eal/include/rte_pmu.h
> @@ -0,0 +1,204 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2022 Marvell
> + */
> +
> +#ifndef _RTE_PMU_H_
> +#define _RTE_PMU_H_
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +#include <rte_common.h>
> +#include <rte_compat.h>
> +
> +#ifdef RTE_EXEC_ENV_LINUX
> +
> +#include <linux/perf_event.h>
> +
> +#include <rte_atomic.h>
> +#include <rte_branch_prediction.h>
> +#include <rte_lcore.h>
> +#include <rte_log.h>
> +
> +/**
> + * @file
> + *
> + * PMU event tracing operations
> + *
> + * This file defines generic API and types necessary to setup PMU and
> + * read selected counters in runtime.
> + */
> +
> +/**
> + * A structure describing a group of events.
> + */
> +struct rte_pmu_event_group {
> +	int *fds; /**< array of event descriptors */
> +	void **mmap_pages; /**< array of pointers to mmapped perf_event_attr structures */
> +	bool enabled; /**< true if group was enabled on particular lcore */
> +};
> +
> +/**
> + * A structure describing an event.
> + */
> +struct rte_pmu_event {
> +	char *name; /** name of an event */
> +	int index; /** event index into fds/mmap_pages */
> +	TAILQ_ENTRY(rte_pmu_event) next; /** list entry */
> +};
> +
> +/**
> + * A PMU state container.
> + */
> +struct rte_pmu {
> +	char *name; /** name of core PMU listed under /sys/bus/event_source/devices */
> +	struct rte_pmu_event_group group[RTE_MAX_LCORE]; /**< per lcore event group data */
> +	int num_group_events; /**< number of events in a group */
> +	TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of matching events */
> +};
> +
> +/** Pointer to the PMU state container */
> +extern struct rte_pmu *pmu;
> +
> +/** Each architecture supporting PMU needs to provide its own version */
> +#ifndef rte_pmu_pmc_read
> +#define rte_pmu_pmc_read(index) ({ 0; })
> +#endif
> +
> +/**
> + * @internal
> + *
> + * Read PMU counter.
> + *
> + * @param pc
> + *   Pointer to the mmapped user page.
> + * @return
> + *   Counter value read from hardware.
> + */
> +__rte_internal
> +static __rte_always_inline uint64_t
> +rte_pmu_read_userpage(struct perf_event_mmap_page *pc)
> +{
> +	uint64_t offset, width, pmc = 0;
> +	uint32_t seq, index;
> +	int tries = 100;
> +
> +	for (;;) {
> +		seq = pc->lock;
> +		rte_compiler_barrier();

I'm guessing this should be a load-acquire instead. Less heavy-handed 
than a compiler barrier on TSO CPUs, and works on weakly ordered systems 
as well (unlike the compiler barrier).

This looks like an open-coded sequence lock, so take a look in 
rte_seqcount.h for inspiration.

> +		index = pc->index;
> +		offset = pc->offset;
> +		width = pc->pmc_width;
> +
> +		if (likely(pc->cap_user_rdpmc && index)) {
> +			pmc = rte_pmu_pmc_read(index - 1);
> +			pmc <<= 64 - width;
> +			pmc >>= 64 - width;
> +		}
> +
> +		rte_compiler_barrier();
> +
> +		if (likely(pc->lock == seq))
> +			return pmc + offset;
> +
> +		if (--tries == 0) {
> +			RTE_LOG(DEBUG, EAL, "failed to get perf_event_mmap_page lock\n");
> +			break;
> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * @internal
> + *
> + * Enable group of events for a given lcore.
> + *
> + * @param lcore_id
> + *   The identifier of the lcore.
> + * @return
> + *   0 in case of success, negative value otherwise.
> + */
> +__rte_internal
> +int
> +rte_pmu_enable_group(int lcore_id);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice
> + *
> + * Add event to the group of enabled events.
> + *
> + * @param name
> + *   Name of an event listed under /sys/bus/event_source/devices/pmu/events.
> + * @return
> + *   Event index in case of success, negative value otherwise.
> + */
> +__rte_experimental
> +int
> +rte_pmu_add_event(const char *name);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice
> + *
> + * Read hardware counter configured to count occurrences of an event.
> + *
> + * @param index
> + *   Index of an event to be read.
> + * @return
> + *   Event value read from register. In case of errors or lack of support
> + *   0 is returned. In other words, stream of zeros in a trace file
> + *   indicates problem with reading particular PMU event register.
> + */
> +__rte_experimental
> +static __rte_always_inline uint64_t
> +rte_pmu_read(int index)
> +{
> +	int lcore_id = rte_lcore_id();
> +	struct rte_pmu_event_group *group;
> +	int ret;
> +
> +	if (!pmu)
> +		return 0;
> +
> +	group = &pmu->group[lcore_id];
> +	if (!group->enabled) {
> +		ret = rte_pmu_enable_group(lcore_id);
> +		if (ret)
> +			return 0;
> +
> +		group->enabled = true;
> +	}
> +
> +	if (index < 0 || index >= pmu->num_group_events)
> +		return 0;
> +
> +	return rte_pmu_read_userpage(group->mmap_pages[index]);
> +}
> +
> +#else /* !RTE_EXEC_ENV_LINUX */
> +
> +__rte_experimental
> +static int __rte_unused
> +rte_pmu_add_event(__rte_unused const char *name)
> +{
> +	return -1;
> +}
> +
> +__rte_experimental
> +static __rte_always_inline uint64_t
> +rte_pmu_read(__rte_unused int index)
> +{
> +	return 0;
> +}
> +
> +#endif /* RTE_EXEC_ENV_LINUX */
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif /* _RTE_PMU_H_ */
> diff --git a/lib/eal/linux/eal.c b/lib/eal/linux/eal.c
> index 8c118d0d9f..751a13b597 100644
> --- a/lib/eal/linux/eal.c
> +++ b/lib/eal/linux/eal.c
> @@ -53,6 +53,7 @@
>   #include "eal_options.h"
>   #include "eal_vfio.h"
>   #include "hotplug_mp.h"
> +#include "pmu_private.h"
>   
>   #define MEMSIZE_IF_NO_HUGE_PAGE (64ULL * 1024ULL * 1024ULL)
>   
> @@ -1206,6 +1207,8 @@ rte_eal_init(int argc, char **argv)
>   		return -1;
>   	}
>   
> +	eal_pmu_init();
> +
>   	if (rte_eal_tailqs_init() < 0) {
>   		rte_eal_init_alert("Cannot init tail queues for objects");
>   		rte_errno = EFAULT;
> @@ -1372,6 +1375,7 @@ rte_eal_cleanup(void)
>   	eal_bus_cleanup();
>   	rte_trace_save();
>   	eal_trace_fini();
> +	eal_pmu_fini();
>   	/* after this point, any DPDK pointers will become dangling */
>   	rte_eal_memory_detach();
>   	eal_mp_dev_hotplug_cleanup();
> diff --git a/lib/eal/version.map b/lib/eal/version.map
> index 7ad12a7dc9..e870c87493 100644
> --- a/lib/eal/version.map
> +++ b/lib/eal/version.map
> @@ -432,6 +432,8 @@ EXPERIMENTAL {
>   	rte_thread_set_priority;
>   
>   	# added in 22.11
> +	rte_pmu_add_event; # WINDOWS_NO_EXPORT
> +	rte_pmu_read; # WINDOWS_NO_EXPORT
>   	rte_thread_attr_get_affinity;
>   	rte_thread_attr_init;
>   	rte_thread_attr_set_affinity;
> @@ -483,4 +485,5 @@ INTERNAL {
>   	rte_mem_map;
>   	rte_mem_page_size;
>   	rte_mem_unmap;
> +	rte_pmu_enable_group;
>   };


^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH v4 1/4] eal: add generic support for reading PMU events
  2022-12-13 10:43       ` [PATCH v4 1/4] eal: add generic support for reading PMU events Tomasz Duszynski
  2022-12-13 11:52         ` Morten Brørup
@ 2022-12-15  8:46         ` Mattias Rönnblom
  2023-01-04 15:47           ` Tomasz Duszynski
  2023-01-09  7:37         ` Ruifeng Wang
  2 siblings, 1 reply; 205+ messages in thread
From: Mattias Rönnblom @ 2022-12-15  8:46 UTC (permalink / raw)
  To: Tomasz Duszynski, dev; +Cc: thomas, jerinj, mb, zhoumin

On 2022-12-13 11:43, Tomasz Duszynski wrote:
> Add support for programming PMU counters and reading their values
> in runtime bypassing kernel completely.
> 
> This is especially useful in cases where CPU cores are isolated
> (nohz_full) i.e run dedicated tasks. In such cases one cannot use
> standard perf utility without sacrificing latency and performance.
> 
> Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
> ---
>   app/test/meson.build                  |   1 +
>   app/test/test_pmu.c                   |  41 +++
>   doc/guides/prog_guide/profile_app.rst |   8 +
>   lib/eal/common/meson.build            |   3 +
>   lib/eal/common/pmu_private.h          |  41 +++
>   lib/eal/common/rte_pmu.c              | 456 ++++++++++++++++++++++++++
>   lib/eal/include/meson.build           |   1 +
>   lib/eal/include/rte_pmu.h             | 204 ++++++++++++
>   lib/eal/linux/eal.c                   |   4 +
>   lib/eal/version.map                   |   6 +
>   10 files changed, 765 insertions(+)
>   create mode 100644 app/test/test_pmu.c
>   create mode 100644 lib/eal/common/pmu_private.h
>   create mode 100644 lib/eal/common/rte_pmu.c
>   create mode 100644 lib/eal/include/rte_pmu.h
> 
> diff --git a/app/test/meson.build b/app/test/meson.build
> index f34d19e3c3..93b3300309 100644
> --- a/app/test/meson.build
> +++ b/app/test/meson.build
> @@ -143,6 +143,7 @@ test_sources = files(
>           'test_timer_racecond.c',
>           'test_timer_secondary.c',
>           'test_ticketlock.c',
> +        'test_pmu.c',
>           'test_trace.c',
>           'test_trace_register.c',
>           'test_trace_perf.c',
> diff --git a/app/test/test_pmu.c b/app/test/test_pmu.c
> new file mode 100644
> index 0000000000..fd331af9ee
> --- /dev/null
> +++ b/app/test/test_pmu.c
> @@ -0,0 +1,41 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(C) 2022 Marvell International Ltd.
> + */
> +
> +#include <rte_pmu.h>
> +
> +#include "test.h"
> +
> +static int
> +test_pmu_read(void)
> +{
> +	uint64_t val = 0;
> +	int tries = 10;
> +	int event = -1;
> +
> +	while (tries--)
> +		val += rte_pmu_read(event);
> +
> +	if (val == 0)
> +		return TEST_FAILED;
> +
> +	return TEST_SUCCESS;
> +}
> +
> +static struct unit_test_suite pmu_tests = {
> +	.suite_name = "pmu autotest",
> +	.setup = NULL,
> +	.teardown = NULL,
> +	.unit_test_cases = {
> +		TEST_CASE(test_pmu_read),
> +		TEST_CASES_END()
> +	}
> +};
> +
> +static int
> +test_pmu(void)
> +{
> +	return unit_test_suite_runner(&pmu_tests);
> +}
> +
> +REGISTER_TEST_COMMAND(pmu_autotest, test_pmu);
> diff --git a/doc/guides/prog_guide/profile_app.rst b/doc/guides/prog_guide/profile_app.rst
> index 14292d4c25..a8b501fe0c 100644
> --- a/doc/guides/prog_guide/profile_app.rst
> +++ b/doc/guides/prog_guide/profile_app.rst
> @@ -7,6 +7,14 @@ Profile Your Application
>   The following sections describe methods of profiling DPDK applications on
>   different architectures.
>   
> +Performance counter based profiling
> +-----------------------------------
> +
> +Majority of architectures support some sort hardware measurement unit which provides a set of
> +programmable counters that monitor specific events. There are different tools which can gather
> +that information, perf being an example here. Though in some scenarios, eg. when CPU cores are
> +isolated (nohz_full) and run dedicated tasks, using perf is less than ideal. In such cases one can
> +read specific events directly from application via ``rte_pmu_read()``.
>   
>   Profiling on x86
>   ----------------
> diff --git a/lib/eal/common/meson.build b/lib/eal/common/meson.build
> index 917758cc65..d6d05b56f3 100644
> --- a/lib/eal/common/meson.build
> +++ b/lib/eal/common/meson.build
> @@ -38,6 +38,9 @@ sources += files(
>           'rte_service.c',
>           'rte_version.c',
>   )
> +if is_linux
> +    sources += files('rte_pmu.c')
> +endif
>   if is_linux or is_windows
>       sources += files('eal_common_dynmem.c')
>   endif
> diff --git a/lib/eal/common/pmu_private.h b/lib/eal/common/pmu_private.h
> new file mode 100644
> index 0000000000..cade4245e6
> --- /dev/null
> +++ b/lib/eal/common/pmu_private.h
> @@ -0,0 +1,41 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2022 Marvell
> + */
> +
> +#ifndef _PMU_PRIVATE_H_
> +#define _PMU_PRIVATE_H_
> +
> +/**
> + * Architecture specific PMU init callback.
> + *
> + * @return
> + *   0 in case of success, negative value otherwise.
> + */
> +int
> +pmu_arch_init(void);
> +
> +/**
> + * Architecture specific PMU cleanup callback.
> + */
> +void
> +pmu_arch_fini(void);
> +
> +/**
> + * Apply architecture specific settings to config before passing it to syscall.
> + */
> +void
> +pmu_arch_fixup_config(uint64_t config[3]);
> +
> +/**
> + * Initialize PMU tracing internals.
> + */
> +void
> +eal_pmu_init(void);
> +
> +/**
> + * Cleanup PMU internals.
> + */
> +void
> +eal_pmu_fini(void);
> +
> +#endif /* _PMU_PRIVATE_H_ */
> diff --git a/lib/eal/common/rte_pmu.c b/lib/eal/common/rte_pmu.c
> new file mode 100644
> index 0000000000..049fe19fe3
> --- /dev/null
> +++ b/lib/eal/common/rte_pmu.c
> @@ -0,0 +1,456 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(C) 2022 Marvell International Ltd.
> + */
> +
> +#include <ctype.h>
> +#include <dirent.h>
> +#include <errno.h>
> +#include <regex.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <sys/ioctl.h>
> +#include <sys/mman.h>
> +#include <sys/queue.h>
> +#include <sys/syscall.h>
> +#include <unistd.h>
> +
> +#include <rte_eal_paging.h>
> +#include <rte_pmu.h>
> +#include <rte_tailq.h>
> +
> +#include "pmu_private.h"
> +
> +#define EVENT_SOURCE_DEVICES_PATH "/sys/bus/event_source/devices"
> +
> +#ifndef GENMASK_ULL
> +#define GENMASK_ULL(h, l) ((~0ULL - (1ULL << (l)) + 1) & (~0ULL >> ((64 - 1 - (h)))))
> +#endif
> +
> +#ifndef FIELD_PREP
> +#define FIELD_PREP(m, v) (((uint64_t)(v) << (__builtin_ffsll(m) - 1)) & (m))
> +#endif
> +
> +struct rte_pmu *rte_pmu;
> +
> +/*
> + * Following __rte_weak functions provide default no-op. Architectures should override them if
> + * necessary.
> + */
> +
> +int
> +__rte_weak pmu_arch_init(void)
> +{
> +	return 0;
> +}
> +
> +void
> +__rte_weak pmu_arch_fini(void)
> +{
> +}
> +
> +void
> +__rte_weak pmu_arch_fixup_config(uint64_t config[3])
> +{
> +	RTE_SET_USED(config);
> +}
> +
> +static int
> +get_term_format(const char *name, int *num, uint64_t *mask)
> +{
> +	char *config = NULL;
> +	char path[PATH_MAX];
> +	int high, low, ret;
> +	FILE *fp;
> +
> +	/* quiesce -Wmaybe-uninitialized warning */
> +	*num = 0;
> +	*mask = 0;
> +
> +	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/format/%s", rte_pmu->name, name)

This code might crash in case a long name is supplied, which is maybe 
not want you want. A trunacte and a "file not found" is probably better. 
I believe there is a snprintf lookalike with these properties in DPDK.

> +	fp = fopen(path, "r");
> +	if (!fp)
> +		return -errno;
> +
> +	errno = 0;
> +	ret = fscanf(fp, "%m[^:]:%d-%d", &config, &low, &high);
> +	if (ret < 2) {
> +		ret = -ENODATA;
> +		goto out;
> +	}
> +	if (errno) {
> +		ret = -errno;
> +		goto out;
> +	}
> +
> +	if (ret == 2)
> +		high = low;
> +
> +	*mask = GENMASK_ULL(high, low);
> +	/* Last digit should be [012]. If last digit is missing 0 is implied. */
> +	*num = config[strlen(config) - 1];
> +	*num = isdigit(*num) ? *num - '0' : 0;
> +
> +	ret = 0;
> +out:
> +	free(config);
> +	fclose(fp);
> +
> +	return ret;
> +}
> +
> +static int
> +parse_event(char *buf, uint64_t config[3])
> +{
> +	char *token, *term;
> +	int num, ret, val;
> +	uint64_t mask;
> +
> +	config[0] = config[1] = config[2] = 0;
> +
> +	token = strtok(buf, ",");
> +	while (token) {
> +		errno = 0;
> +		/* <term>=<value> */
> +		ret = sscanf(token, "%m[^=]=%i", &term, &val);
> +		if (ret < 1)
> +			return -ENODATA;
> +		if (errno)
> +			return -errno;
> +		if (ret == 1)
> +			val = 1;
> +
> +		ret = get_term_format(term, &num, &mask);
> +		free(term);
> +		if (ret)
> +			return ret;
> +
> +		config[num] |= FIELD_PREP(mask, val);
> +		token = strtok(NULL, ",");
> +	}
> +
> +	return 0;
> +}
> +
> +static int
> +get_event_config(const char *name, uint64_t config[3])
> +{
> +	char path[PATH_MAX], buf[BUFSIZ];
> +	FILE *fp;
> +	int ret;
> +
> +	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu->name, name);
> +	fp = fopen(path, "r");
> +	if (!fp)
> +		return -errno;
> +
> +	ret = fread(buf, 1, sizeof(buf), fp);
> +	if (ret == 0) {
> +		fclose(fp);
> +
> +		return -EINVAL;
> +	}
> +	fclose(fp);
> +	buf[ret] = '\0';
> +
> +	return parse_event(buf, config);
> +}
> +
> +static int
> +do_perf_event_open(uint64_t config[3], int lcore_id, int group_fd)
> +{
> +	struct perf_event_attr attr = {
> +		.size = sizeof(struct perf_event_attr),
> +		.type = PERF_TYPE_RAW,
> +		.exclude_kernel = 1,
> +		.exclude_hv = 1,
> +		.disabled = 1,
> +	};
> +
> +	pmu_arch_fixup_config(config);
> +
> +	attr.config = config[0];
> +	attr.config1 = config[1];
> +	attr.config2 = config[2];
> +
> +	return syscall(SYS_perf_event_open, &attr, rte_gettid(), rte_lcore_to_cpu_id(lcore_id),
> +		       group_fd, 0);
> +}
> +
> +static int
> +open_events(int lcore_id)
> +{
> +	struct rte_pmu_event_group *group = &rte_pmu->group[lcore_id];
> +	struct rte_pmu_event *event;
> +	uint64_t config[3];
> +	int num = 0, ret;
> +
> +	/* group leader gets created first, with fd = -1 */
> +	group->fds[0] = -1;
> +
> +	TAILQ_FOREACH(event, &rte_pmu->event_list, next) {
> +		ret = get_event_config(event->name, config);
> +		if (ret) {
> +			RTE_LOG(ERR, EAL, "failed to get %s event config\n", event->name);
> +			continue;
> +		}
> +
> +		ret = do_perf_event_open(config, lcore_id, group->fds[0]);
> +		if (ret == -1) {
> +			if (errno == EOPNOTSUPP)
> +				RTE_LOG(ERR, EAL, "64 bit counters not supported\n");
> +
> +			ret = -errno;
> +			goto out;
> +		}
> +
> +		group->fds[event->index] = ret;
> +		num++;
> +	}
> +
> +	return 0;
> +out:
> +	for (--num; num >= 0; num--) {
> +		close(group->fds[num]);
> +		group->fds[num] = -1;
> +	}
> +
> +
> +	return ret;
> +}
> +
> +static int
> +mmap_events(int lcore_id)
> +{
> +	struct rte_pmu_event_group *group = &rte_pmu->group[lcore_id];
> +	void *addr;
> +	int ret, i;
> +
> +	for (i = 0; i < rte_pmu->num_group_events; i++) {
> +		addr = mmap(0, rte_mem_page_size(), PROT_READ, MAP_SHARED, group->fds[i], 0);
> +		if (addr == MAP_FAILED) {
> +			ret = -errno;
> +			goto out;
> +		}
> +
> +		group->mmap_pages[i] = addr;
> +	}
> +
> +	return 0;
> +out:
> +	for (; i; i--) {
> +		munmap(group->mmap_pages[i - 1], rte_mem_page_size());
> +		group->mmap_pages[i - 1] = NULL;
> +	}
> +
> +	return ret;
> +}
> +
> +static void
> +cleanup_events(int lcore_id)

lcore_id is an unsigned int.

> +{
> +	struct rte_pmu_event_group *group = &rte_pmu->group[lcore_id];
> +	int i;
> +
> +	if (!group->fds)
> +		return;

group->fds == NULL

This coding style violating appears throughput the patch set.

> +
> +	if (group->fds[0] != -1)
> +		ioctl(group->fds[0], PERF_EVENT_IOC_DISABLE, PERF_IOC_FLAG_GROUP);
> +
> +	for (i = 0; i < rte_pmu->num_group_events; i++) {
> +		if (group->mmap_pages[i]) {
> +			munmap(group->mmap_pages[i], rte_mem_page_size());
> +			group->mmap_pages[i] = NULL;
> +		}
> +
> +		if (group->fds[i] != -1) {
> +			close(group->fds[i]);
> +			group->fds[i] = -1;
> +		}
> +	}
> +
> +	free(group->mmap_pages);
> +	free(group->fds);
> +
> +	group->mmap_pages = NULL;
> +	group->fds = NULL;
> +	group->enabled = false;
> +}
> +
> +int __rte_noinline
> +rte_pmu_enable_group(int lcore_id)

unsigned

> +{
> +	struct rte_pmu_event_group *group = &rte_pmu->group[lcore_id];
> +	int ret;
> +
> +	if (rte_pmu->num_group_events == 0) {
> +		RTE_LOG(DEBUG, EAL, "no matching PMU events\n");
> +
> +		return 0;
> +	}
> +
> +	group->fds = calloc(rte_pmu->num_group_events, sizeof(*group->fds));
> +	if (!group->fds) {
> +		RTE_LOG(ERR, EAL, "failed to alloc descriptor memory\n");
> +
> +		return -ENOMEM;
> +	}
> +
> +	group->mmap_pages = calloc(rte_pmu->num_group_events, sizeof(*group->mmap_pages));
> +	if (!group->mmap_pages) {
> +		RTE_LOG(ERR, EAL, "failed to alloc userpage memory\n");
> +
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	ret = open_events(lcore_id);
> +	if (ret) {
> +		RTE_LOG(ERR, EAL, "failed to open events on lcore-worker-%d\n", lcore_id);
> +		goto out;
> +	}
> +
> +	ret = mmap_events(lcore_id);
> +	if (ret) {
> +		RTE_LOG(ERR, EAL, "failed to map events on lcore-worker-%d\n", lcore_id);
> +		goto out;
> +	}
> +
> +	if (ioctl(group->fds[0], PERF_EVENT_IOC_ENABLE, PERF_IOC_FLAG_GROUP) == -1) {
> +		RTE_LOG(ERR, EAL, "failed to enable events on lcore-worker-%d\n", lcore_id);
> +
> +		ret = -errno;
> +		goto out;
> +	}
> +
> +	return 0;
> +
> +out:
> +	cleanup_events(lcore_id);
> +
> +	return ret;
> +}
> +
> +static int
> +scan_pmus(void)
> +{
> +	char path[PATH_MAX];
> +	struct dirent *dent;
> +	const char *name;
> +	DIR *dirp;
> +
> +	dirp = opendir(EVENT_SOURCE_DEVICES_PATH);
> +	if (!dirp)
> +		return -errno;
> +
> +	while ((dent = readdir(dirp))) {
> +		name = dent->d_name;
> +		if (name[0] == '.')
> +			continue;
> +
> +		/* sysfs entry should either contain cpus or be a cpu */
> +		if (!strcmp(name, "cpu"))
> +			break;
> +
> +		snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/cpus", name);
> +		if (access(path, F_OK) == 0)
> +			break;
> +	}
> +
> +	closedir(dirp);
> +
> +	if (dent) {
> +		rte_pmu->name = strdup(name);
> +		if (!rte_pmu->name)
> +			return -ENOMEM;
> +	}
> +
> +	return rte_pmu->name ? 0 : -ENODEV;
> +}
> +
> +int
> +rte_pmu_add_event(const char *name)
> +{
> +	struct rte_pmu_event *event;
> +	char path[PATH_MAX];
> +
> +	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu->name, name);
> +	if (access(path, R_OK))
> +		return -ENODEV;
> +
> +	TAILQ_FOREACH(event, &rte_pmu->event_list, next) {
> +		if (!strcmp(event->name, name))
> +			return event->index;
> +		continue;
> +	}
> +
> +	event = calloc(1, sizeof(*event));
> +	if (!event)
> +		return -ENOMEM;
> +
> +	event->name = strdup(name);
> +	if (!event->name) {
> +		free(event);
> +
> +		return -ENOMEM;
> +	}
> +
> +	event->index = rte_pmu->num_group_events++;
> +	TAILQ_INSERT_TAIL(&rte_pmu->event_list, event, next);
> +
> +	RTE_LOG(DEBUG, EAL, "%s even added at index %d\n", name, event->index);
> +
> +	return event->index;
> +}
> +
> +void
> +eal_pmu_init(void)
> +{
> +	int ret;
> +
> +	rte_pmu = calloc(1, sizeof(*rte_pmu));
> +	if (!rte_pmu) {
> +		RTE_LOG(ERR, EAL, "failed to alloc PMU\n");
> +
> +		return;
> +	}
> +
> +	TAILQ_INIT(&rte_pmu->event_list);
> +
> +	ret = scan_pmus();
> +	if (ret) {
> +		RTE_LOG(ERR, EAL, "failed to find core pmu\n");
> +		goto out;
> +	}
> +
> +	ret = pmu_arch_init();
> +	if (ret) {
> +		RTE_LOG(ERR, EAL, "failed to setup arch for PMU\n");
> +		goto out;
> +	}
> +
> +	return;
> +out:
> +	free(rte_pmu->name);
> +	free(rte_pmu);
> +}
> +
> +void
> +eal_pmu_fini(void)
> +{
> +	struct rte_pmu_event *event, *tmp;
> +	int lcore_id;
> +
> +	RTE_TAILQ_FOREACH_SAFE(event, &rte_pmu->event_list, next, tmp) {
> +		TAILQ_REMOVE(&rte_pmu->event_list, event, next);
> +		free(event->name);
> +		free(event);
> +	}
> +
> +	RTE_LCORE_FOREACH_WORKER(lcore_id)
> +		cleanup_events(lcore_id);

Why is the main lcore left out?

> +
> +	pmu_arch_fini();
> +	free(rte_pmu->name);
> +	free(rte_pmu);
> +}
> diff --git a/lib/eal/include/meson.build b/lib/eal/include/meson.build
> index cfcd40aaed..3bf830adee 100644
> --- a/lib/eal/include/meson.build
> +++ b/lib/eal/include/meson.build
> @@ -36,6 +36,7 @@ headers += files(
>           'rte_pci_dev_features.h',
>           'rte_per_lcore.h',
>           'rte_pflock.h',
> +        'rte_pmu.h',
>           'rte_random.h',
>           'rte_reciprocal.h',
>           'rte_seqcount.h',
> diff --git a/lib/eal/include/rte_pmu.h b/lib/eal/include/rte_pmu.h
> new file mode 100644
> index 0000000000..e4b4f6b052
> --- /dev/null
> +++ b/lib/eal/include/rte_pmu.h
> @@ -0,0 +1,204 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2022 Marvell
> + */
> +
> +#ifndef _RTE_PMU_H_
> +#define _RTE_PMU_H_
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +#include <rte_common.h>
> +#include <rte_compat.h>
> +
> +#ifdef RTE_EXEC_ENV_LINUX
> +
> +#include <linux/perf_event.h>
> +
> +#include <rte_atomic.h>
> +#include <rte_branch_prediction.h>
> +#include <rte_lcore.h>
> +#include <rte_log.h>
> +
> +/**
> + * @file
> + *
> + * PMU event tracing operations
> + *
> + * This file defines generic API and types necessary to setup PMU and
> + * read selected counters in runtime.
> + */
> +
> +/**
> + * A structure describing a group of events.
> + */
> +struct rte_pmu_event_group {
> +	int *fds; /**< array of event descriptors */
> +	void **mmap_pages; /**< array of pointers to mmapped perf_event_attr structures */
> +	bool enabled; /**< true if group was enabled on particular lcore */
> +};
> +
> +/**
> + * A structure describing an event.
> + */
> +struct rte_pmu_event {
> +	char *name; /** name of an event */
> +	int index; /** event index into fds/mmap_pages */
> +	TAILQ_ENTRY(rte_pmu_event) next; /** list entry */
> +};
> +
> +/**
> + * A PMU state container.
> + */
> +struct rte_pmu {
> +	char *name; /** name of core PMU listed under /sys/bus/event_source/devices */
> +	struct rte_pmu_event_group group[RTE_MAX_LCORE]; /**< per lcore event group data */
> +	int num_group_events; /**< number of events in a group */
> +	TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of matching events */
> +};
> +
> +/** Pointer to the PMU state container */
> +extern struct rte_pmu *rte_pmu;
> +
> +/** Each architecture supporting PMU needs to provide its own version */
> +#ifndef rte_pmu_pmc_read
> +#define rte_pmu_pmc_read(index) ({ 0; })
> +#endif
> +
> +/**
> + * @internal
> + *
> + * Read PMU counter.
> + *
> + * @param pc
> + *   Pointer to the mmapped user page.
> + * @return
> + *   Counter value read from hardware.
> + */
> +__rte_internal
> +static __rte_always_inline uint64_t
> +rte_pmu_read_userpage(struct perf_event_mmap_page *pc)
> +{
> +	uint64_t offset, width, pmc = 0;
> +	uint32_t seq, index;
> +	int tries = 100;
> +
> +	for (;;) {
> +		seq = pc->lock;
> +		rte_compiler_barrier();
> +		index = pc->index;
> +		offset = pc->offset;
> +		width = pc->pmc_width;
> +
> +		if (likely(pc->cap_user_rdpmc && index)) {
> +			pmc = rte_pmu_pmc_read(index - 1);
> +			pmc <<= 64 - width;
> +			pmc >>= 64 - width;
> +		}
> +
> +		rte_compiler_barrier();
> +
> +		if (likely(pc->lock == seq))
> +			return pmc + offset;
> +
> +		if (--tries == 0) {
> +			RTE_LOG(DEBUG, EAL, "failed to get perf_event_mmap_page lock\n");
> +			break;
> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +/**
> + * @internal
> + *
> + * Enable group of events for a given lcore.
> + *
> + * @param lcore_id
> + *   The identifier of the lcore.
> + * @return
> + *   0 in case of success, negative value otherwise.
> + */
> +__rte_internal
> +int
> +rte_pmu_enable_group(int lcore_id);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice
> + *
> + * Add event to the group of enabled events.
> + *
> + * @param name
> + *   Name of an event listed under /sys/bus/event_source/devices/pmu/events.
> + * @return
> + *   Event index in case of success, negative value otherwise.
> + */
> +__rte_experimental
> +int
> +rte_pmu_add_event(const char *name);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice
> + *
> + * Read hardware counter configured to count occurrences of an event.
> + *
> + * @param index
> + *   Index of an event to be read.
> + * @return
> + *   Event value read from register. In case of errors or lack of support
> + *   0 is returned. In other words, stream of zeros in a trace file
> + *   indicates problem with reading particular PMU event register.
> + */
> +__rte_experimental
> +static __rte_always_inline uint64_t
> +rte_pmu_read(int index)
> +{
> +	int lcore_id = rte_lcore_id();
> +	struct rte_pmu_event_group *group;
> +	int ret;
> +
> +	if (!rte_pmu)
> +		return 0;
> +
> +	group = &rte_pmu->group[lcore_id];
> +	if (!group->enabled) {
> +		ret = rte_pmu_enable_group(lcore_id);
> +		if (ret)
> +			return 0;
> +
> +		group->enabled = true;
> +	}
> +
> +	if (index < 0 || index >= rte_pmu->num_group_events)
> +		return 0;
> +
> +	return rte_pmu_read_userpage((struct perf_event_mmap_page *)group->mmap_pages[index]);
> +}
> +
> +#else /* !RTE_EXEC_ENV_LINUX */
> +
> +__rte_experimental
> +static int __rte_unused
> +rte_pmu_add_event(__rte_unused const char *name)
> +{
> +	return -1;
> +}
> +
> +__rte_experimental
> +static __rte_always_inline uint64_t
> +rte_pmu_read(__rte_unused int index)
> +{
> +	return 0;
> +}
> +
> +#endif /* RTE_EXEC_ENV_LINUX */
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif /* _RTE_PMU_H_ */
> diff --git a/lib/eal/linux/eal.c b/lib/eal/linux/eal.c
> index 8c118d0d9f..751a13b597 100644
> --- a/lib/eal/linux/eal.c
> +++ b/lib/eal/linux/eal.c
> @@ -53,6 +53,7 @@
>   #include "eal_options.h"
>   #include "eal_vfio.h"
>   #include "hotplug_mp.h"
> +#include "pmu_private.h"
>   
>   #define MEMSIZE_IF_NO_HUGE_PAGE (64ULL * 1024ULL * 1024ULL)
>   
> @@ -1206,6 +1207,8 @@ rte_eal_init(int argc, char **argv)
>   		return -1;
>   	}
>   
> +	eal_pmu_init();
> +
>   	if (rte_eal_tailqs_init() < 0) {
>   		rte_eal_init_alert("Cannot init tail queues for objects");
>   		rte_errno = EFAULT;
> @@ -1372,6 +1375,7 @@ rte_eal_cleanup(void)
>   	eal_bus_cleanup();
>   	rte_trace_save();
>   	eal_trace_fini();
> +	eal_pmu_fini();
>   	/* after this point, any DPDK pointers will become dangling */
>   	rte_eal_memory_detach();
>   	eal_mp_dev_hotplug_cleanup();
> diff --git a/lib/eal/version.map b/lib/eal/version.map
> index 7ad12a7dc9..9225f46f67 100644
> --- a/lib/eal/version.map
> +++ b/lib/eal/version.map
> @@ -440,6 +440,11 @@ EXPERIMENTAL {
>   	rte_thread_detach;
>   	rte_thread_equal;
>   	rte_thread_join;
> +
> +	# added in 23.03
> +	rte_pmu; # WINDOWS_NO_EXPORT
> +	rte_pmu_add_event; # WINDOWS_NO_EXPORT
> +	rte_pmu_read; # WINDOWS_NO_EXPORT
>   };
>   
>   INTERNAL {
> @@ -483,4 +488,5 @@ INTERNAL {
>   	rte_mem_map;
>   	rte_mem_page_size;
>   	rte_mem_unmap;
> +	rte_pmu_enable_group;
>   };


^ permalink raw reply	[flat|nested] 205+ messages in thread

* RE: [PATCH v4 1/4] eal: add generic support for reading PMU events
  2022-12-15  8:22               ` Morten Brørup
@ 2022-12-16  7:33                 ` Morten Brørup
  0 siblings, 0 replies; 205+ messages in thread
From: Morten Brørup @ 2022-12-16  7:33 UTC (permalink / raw)
  To: Tomasz Duszynski, dev
  Cc: thomas, Jerin Jacob Kollanukkaran, zhoumin, mattias.ronnblom,
	david.marchand

> From: Morten Brørup [mailto:mb@smartsharesystems.com]
> Sent: Thursday, 15 December 2022 09.22
> 
> > From: Morten Brørup [mailto:mb@smartsharesystems.com]
> > Sent: Wednesday, 14 December 2022 11.41
> >
> > +CC: Mattias, see my comment below about per-thread constructor for
> > this
> >
> > > From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
> > > Sent: Wednesday, 14 December 2022 10.39
> > >
> > > Hello Morten,
> > >
> > > Thanks for review. Answers inline.
> > >
> > > [...]
> > >
> > > > > +__rte_experimental
> > > > > +static __rte_always_inline uint64_t
> > > > > +rte_pmu_read(int index)
> >
> > The index type can be changed from int to uint32_t. This also
> > eliminates the "(index < 0" part of the comparison further below in
> > this function.
> >
> > > > > +{
> > > > > +	int lcore_id = rte_lcore_id();
> > > > > +	struct rte_pmu_event_group *group;
> > > > > +	int ret;
> > > > > +
> > > > > +	if (!rte_pmu)
> > > > > +		return 0;
> > > > > +
> > > > > +	group = &rte_pmu->group[lcore_id];
> > > > > +	if (!group->enabled) {
> >
> > Optimized: if (unlikely(!group->enabled)) {
> >
> > > > > +		ret = rte_pmu_enable_group(lcore_id);
> > > > > +		if (ret)
> > > > > +			return 0;
> > > > > +
> > > > > +		group->enabled = true;
> > > > > +	}
> > > >
> > > > Why is the group not enabled in the setup function,
> > > rte_pmu_add_event(), instead of here, in the
> > > > hot path?
> > > >
> > >
> > > When this is executed for the very first time then cpu will have
> > > obviously more work to do
> > > but afterwards setup path is not taken hence much less cpu cycles
> are
> > > required.
> > >
> > > Setup is executed by main lcore solely, before lcores are executed
> > > hence some info passed to
> > > SYS_perf_event_open ioctl() is missing, pid (via rte_gettid())
> being
> > an
> > > example here.
> >
> > OK. Thank you for the explanation. Since impossible at setup, it has
> to
> > be done at runtime.
> >
> > @Mattias: Another good example of something that would belong in per-
> > thread constructors, as my suggested feature creep in [1].
> >
> > [1]:
> >
> http://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35D87553@smarts
> > erver.smartshare.dk/
> 
> I just realized that this initialization is per-lcore (not per thread),
> so you can use rte_lcore_callback_register() to register a per-lcore
> initialization function, and move rte_pmu_enable_group(lcore_id) there.

Sorry, Thomasz!

You can't use rte_lcore_callback_register()... it doesn't provide per-lcore thread constructors/destructors the way I thought. :-(



^ permalink raw reply	[flat|nested] 205+ messages in thread

* RE: [PATCH v4 1/4] eal: add generic support for reading PMU events
  2022-12-15  8:46         ` Mattias Rönnblom
@ 2023-01-04 15:47           ` Tomasz Duszynski
  0 siblings, 0 replies; 205+ messages in thread
From: Tomasz Duszynski @ 2023-01-04 15:47 UTC (permalink / raw)
  To: Mattias Rönnblom, dev; +Cc: thomas, Jerin Jacob Kollanukkaran, mb, zhoumin

> -----Original Message-----
> From: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> Sent: Thursday, December 15, 2022 9:46 AM
> To: Tomasz Duszynski <tduszynski@marvell.com>; dev@dpdk.org
> Cc: thomas@monjalon.net; Jerin Jacob Kollanukkaran <jerinj@marvell.com>; mb@smartsharesystems.com;
> zhoumin@loongson.cn
> Subject: [EXT] Re: [PATCH v4 1/4] eal: add generic support for reading PMU events
> 
> External Email
> 
> ----------------------------------------------------------------------
> On 2022-12-13 11:43, Tomasz Duszynski wrote:
> > Add support for programming PMU counters and reading their values in
> > runtime bypassing kernel completely.
> >
> > This is especially useful in cases where CPU cores are isolated
> > (nohz_full) i.e run dedicated tasks. In such cases one cannot use
> > standard perf utility without sacrificing latency and performance.
> >
> > Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
> > ---
> >   app/test/meson.build                  |   1 +
> >   app/test/test_pmu.c                   |  41 +++
> >   doc/guides/prog_guide/profile_app.rst |   8 +
> >   lib/eal/common/meson.build            |   3 +
> >   lib/eal/common/pmu_private.h          |  41 +++
> >   lib/eal/common/rte_pmu.c              | 456 ++++++++++++++++++++++++++
> >   lib/eal/include/meson.build           |   1 +
> >   lib/eal/include/rte_pmu.h             | 204 ++++++++++++
> >   lib/eal/linux/eal.c                   |   4 +
> >   lib/eal/version.map                   |   6 +
> >   10 files changed, 765 insertions(+)
> >   create mode 100644 app/test/test_pmu.c
> >   create mode 100644 lib/eal/common/pmu_private.h
> >   create mode 100644 lib/eal/common/rte_pmu.c
> >   create mode 100644 lib/eal/include/rte_pmu.h
> >
> > diff --git a/app/test/meson.build b/app/test/meson.build index
> > f34d19e3c3..93b3300309 100644
> > --- a/app/test/meson.build
> > +++ b/app/test/meson.build
> > @@ -143,6 +143,7 @@ test_sources = files(
> >           'test_timer_racecond.c',
> >           'test_timer_secondary.c',
> >           'test_ticketlock.c',
> > +        'test_pmu.c',
> >           'test_trace.c',
> >           'test_trace_register.c',
> >           'test_trace_perf.c',
> > diff --git a/app/test/test_pmu.c b/app/test/test_pmu.c new file mode
> > 100644 index 0000000000..fd331af9ee
> > --- /dev/null
> > +++ b/app/test/test_pmu.c
> > @@ -0,0 +1,41 @@
> > +/* SPDX-License-Identifier: BSD-3-Clause
> > + * Copyright(C) 2022 Marvell International Ltd.
> > + */
> > +
> > +#include <rte_pmu.h>
> > +
> > +#include "test.h"
> > +
> > +static int
> > +test_pmu_read(void)
> > +{
> > +	uint64_t val = 0;
> > +	int tries = 10;
> > +	int event = -1;
> > +
> > +	while (tries--)
> > +		val += rte_pmu_read(event);
> > +
> > +	if (val == 0)
> > +		return TEST_FAILED;
> > +
> > +	return TEST_SUCCESS;
> > +}
> > +
> > +static struct unit_test_suite pmu_tests = {
> > +	.suite_name = "pmu autotest",
> > +	.setup = NULL,
> > +	.teardown = NULL,
> > +	.unit_test_cases = {
> > +		TEST_CASE(test_pmu_read),
> > +		TEST_CASES_END()
> > +	}
> > +};
> > +
> > +static int
> > +test_pmu(void)
> > +{
> > +	return unit_test_suite_runner(&pmu_tests);
> > +}
> > +
> > +REGISTER_TEST_COMMAND(pmu_autotest, test_pmu);
> > diff --git a/doc/guides/prog_guide/profile_app.rst
> > b/doc/guides/prog_guide/profile_app.rst
> > index 14292d4c25..a8b501fe0c 100644
> > --- a/doc/guides/prog_guide/profile_app.rst
> > +++ b/doc/guides/prog_guide/profile_app.rst
> > @@ -7,6 +7,14 @@ Profile Your Application
> >   The following sections describe methods of profiling DPDK applications on
> >   different architectures.
> >
> > +Performance counter based profiling
> > +-----------------------------------
> > +
> > +Majority of architectures support some sort hardware measurement unit
> > +which provides a set of programmable counters that monitor specific
> > +events. There are different tools which can gather that information,
> > +perf being an example here. Though in some scenarios, eg. when CPU
> > +cores are isolated (nohz_full) and run dedicated tasks, using perf is less than ideal. In such
> cases one can read specific events directly from application via ``rte_pmu_read()``.
> >
> >   Profiling on x86
> >   ----------------
> > diff --git a/lib/eal/common/meson.build b/lib/eal/common/meson.build
> > index 917758cc65..d6d05b56f3 100644
> > --- a/lib/eal/common/meson.build
> > +++ b/lib/eal/common/meson.build
> > @@ -38,6 +38,9 @@ sources += files(
> >           'rte_service.c',
> >           'rte_version.c',
> >   )
> > +if is_linux
> > +    sources += files('rte_pmu.c')
> > +endif
> >   if is_linux or is_windows
> >       sources += files('eal_common_dynmem.c')
> >   endif
> > diff --git a/lib/eal/common/pmu_private.h
> > b/lib/eal/common/pmu_private.h new file mode 100644 index
> > 0000000000..cade4245e6
> > --- /dev/null
> > +++ b/lib/eal/common/pmu_private.h
> > @@ -0,0 +1,41 @@
> > +/* SPDX-License-Identifier: BSD-3-Clause
> > + * Copyright(c) 2022 Marvell
> > + */
> > +
> > +#ifndef _PMU_PRIVATE_H_
> > +#define _PMU_PRIVATE_H_
> > +
> > +/**
> > + * Architecture specific PMU init callback.
> > + *
> > + * @return
> > + *   0 in case of success, negative value otherwise.
> > + */
> > +int
> > +pmu_arch_init(void);
> > +
> > +/**
> > + * Architecture specific PMU cleanup callback.
> > + */
> > +void
> > +pmu_arch_fini(void);
> > +
> > +/**
> > + * Apply architecture specific settings to config before passing it to syscall.
> > + */
> > +void
> > +pmu_arch_fixup_config(uint64_t config[3]);
> > +
> > +/**
> > + * Initialize PMU tracing internals.
> > + */
> > +void
> > +eal_pmu_init(void);
> > +
> > +/**
> > + * Cleanup PMU internals.
> > + */
> > +void
> > +eal_pmu_fini(void);
> > +
> > +#endif /* _PMU_PRIVATE_H_ */
> > diff --git a/lib/eal/common/rte_pmu.c b/lib/eal/common/rte_pmu.c new
> > file mode 100644 index 0000000000..049fe19fe3
> > --- /dev/null
> > +++ b/lib/eal/common/rte_pmu.c
> > @@ -0,0 +1,456 @@
> > +/* SPDX-License-Identifier: BSD-3-Clause
> > + * Copyright(C) 2022 Marvell International Ltd.
> > + */
> > +
> > +#include <ctype.h>
> > +#include <dirent.h>
> > +#include <errno.h>
> > +#include <regex.h>
> > +#include <stdlib.h>
> > +#include <string.h>
> > +#include <sys/ioctl.h>
> > +#include <sys/mman.h>
> > +#include <sys/queue.h>
> > +#include <sys/syscall.h>
> > +#include <unistd.h>
> > +
> > +#include <rte_eal_paging.h>
> > +#include <rte_pmu.h>
> > +#include <rte_tailq.h>
> > +
> > +#include "pmu_private.h"
> > +
> > +#define EVENT_SOURCE_DEVICES_PATH "/sys/bus/event_source/devices"
> > +
> > +#ifndef GENMASK_ULL
> > +#define GENMASK_ULL(h, l) ((~0ULL - (1ULL << (l)) + 1) & (~0ULL >>
> > +((64 - 1 - (h))))) #endif
> > +
> > +#ifndef FIELD_PREP
> > +#define FIELD_PREP(m, v) (((uint64_t)(v) << (__builtin_ffsll(m) - 1))
> > +& (m)) #endif
> > +
> > +struct rte_pmu *rte_pmu;
> > +
> > +/*
> > + * Following __rte_weak functions provide default no-op.
> > +Architectures should override them if
> > + * necessary.
> > + */
> > +
> > +int
> > +__rte_weak pmu_arch_init(void)
> > +{
> > +	return 0;
> > +}
> > +
> > +void
> > +__rte_weak pmu_arch_fini(void)
> > +{
> > +}
> > +
> > +void
> > +__rte_weak pmu_arch_fixup_config(uint64_t config[3]) {
> > +	RTE_SET_USED(config);
> > +}
> > +
> > +static int
> > +get_term_format(const char *name, int *num, uint64_t *mask) {
> > +	char *config = NULL;
> > +	char path[PATH_MAX];
> > +	int high, low, ret;
> > +	FILE *fp;
> > +
> > +	/* quiesce -Wmaybe-uninitialized warning */
> > +	*num = 0;
> > +	*mask = 0;
> > +
> > +	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH
> > +"/%s/format/%s", rte_pmu->name, name)
> 
> This code might crash in case a long name is supplied, which is maybe not want you want. A
> trunacte and a "file not found" is probably better.
> I believe there is a snprintf lookalike with these properties in DPDK.
> 

In scenario, which is pretty unlikely especially because sysfs files have sane names, where 
'path' cannot accommodate the whole string there will be NUL implicitly appended by snprintf.
Hence fopen will fail. Not sure how this may go wrong. 

> > +	fp = fopen(path, "r");
> > +	if (!fp)
> > +		return -errno;
> > +
> > +	errno = 0;
> > +	ret = fscanf(fp, "%m[^:]:%d-%d", &config, &low, &high);
> > +	if (ret < 2) {
> > +		ret = -ENODATA;
> > +		goto out;
> > +	}
> > +	if (errno) {
> > +		ret = -errno;
> > +		goto out;
> > +	}
> > +
> > +	if (ret == 2)
> > +		high = low;
> > +
> > +	*mask = GENMASK_ULL(high, low);
> > +	/* Last digit should be [012]. If last digit is missing 0 is implied. */
> > +	*num = config[strlen(config) - 1];
> > +	*num = isdigit(*num) ? *num - '0' : 0;
> > +
> > +	ret = 0;
> > +out:
> > +	free(config);
> > +	fclose(fp);
> > +
> > +	return ret;
> > +}
> > +
> > +static int
> > +parse_event(char *buf, uint64_t config[3]) {
> > +	char *token, *term;
> > +	int num, ret, val;
> > +	uint64_t mask;
> > +
> > +	config[0] = config[1] = config[2] = 0;
> > +
> > +	token = strtok(buf, ",");
> > +	while (token) {
> > +		errno = 0;
> > +		/* <term>=<value> */
> > +		ret = sscanf(token, "%m[^=]=%i", &term, &val);
> > +		if (ret < 1)
> > +			return -ENODATA;
> > +		if (errno)
> > +			return -errno;
> > +		if (ret == 1)
> > +			val = 1;
> > +
> > +		ret = get_term_format(term, &num, &mask);
> > +		free(term);
> > +		if (ret)
> > +			return ret;
> > +
> > +		config[num] |= FIELD_PREP(mask, val);
> > +		token = strtok(NULL, ",");
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +static int
> > +get_event_config(const char *name, uint64_t config[3]) {
> > +	char path[PATH_MAX], buf[BUFSIZ];
> > +	FILE *fp;
> > +	int ret;
> > +
> > +	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu->name,
> name);
> > +	fp = fopen(path, "r");
> > +	if (!fp)
> > +		return -errno;
> > +
> > +	ret = fread(buf, 1, sizeof(buf), fp);
> > +	if (ret == 0) {
> > +		fclose(fp);
> > +
> > +		return -EINVAL;
> > +	}
> > +	fclose(fp);
> > +	buf[ret] = '\0';
> > +
> > +	return parse_event(buf, config);
> > +}
> > +
> > +static int
> > +do_perf_event_open(uint64_t config[3], int lcore_id, int group_fd) {
> > +	struct perf_event_attr attr = {
> > +		.size = sizeof(struct perf_event_attr),
> > +		.type = PERF_TYPE_RAW,
> > +		.exclude_kernel = 1,
> > +		.exclude_hv = 1,
> > +		.disabled = 1,
> > +	};
> > +
> > +	pmu_arch_fixup_config(config);
> > +
> > +	attr.config = config[0];
> > +	attr.config1 = config[1];
> > +	attr.config2 = config[2];
> > +
> > +	return syscall(SYS_perf_event_open, &attr, rte_gettid(), rte_lcore_to_cpu_id(lcore_id),
> > +		       group_fd, 0);
> > +}
> > +
> > +static int
> > +open_events(int lcore_id)
> > +{
> > +	struct rte_pmu_event_group *group = &rte_pmu->group[lcore_id];
> > +	struct rte_pmu_event *event;
> > +	uint64_t config[3];
> > +	int num = 0, ret;
> > +
> > +	/* group leader gets created first, with fd = -1 */
> > +	group->fds[0] = -1;
> > +
> > +	TAILQ_FOREACH(event, &rte_pmu->event_list, next) {
> > +		ret = get_event_config(event->name, config);
> > +		if (ret) {
> > +			RTE_LOG(ERR, EAL, "failed to get %s event config\n", event->name);
> > +			continue;
> > +		}
> > +
> > +		ret = do_perf_event_open(config, lcore_id, group->fds[0]);
> > +		if (ret == -1) {
> > +			if (errno == EOPNOTSUPP)
> > +				RTE_LOG(ERR, EAL, "64 bit counters not supported\n");
> > +
> > +			ret = -errno;
> > +			goto out;
> > +		}
> > +
> > +		group->fds[event->index] = ret;
> > +		num++;
> > +	}
> > +
> > +	return 0;
> > +out:
> > +	for (--num; num >= 0; num--) {
> > +		close(group->fds[num]);
> > +		group->fds[num] = -1;
> > +	}
> > +
> > +
> > +	return ret;
> > +}
> > +
> > +static int
> > +mmap_events(int lcore_id)
> > +{
> > +	struct rte_pmu_event_group *group = &rte_pmu->group[lcore_id];
> > +	void *addr;
> > +	int ret, i;
> > +
> > +	for (i = 0; i < rte_pmu->num_group_events; i++) {
> > +		addr = mmap(0, rte_mem_page_size(), PROT_READ, MAP_SHARED, group->fds[i], 0);
> > +		if (addr == MAP_FAILED) {
> > +			ret = -errno;
> > +			goto out;
> > +		}
> > +
> > +		group->mmap_pages[i] = addr;
> > +	}
> > +
> > +	return 0;
> > +out:
> > +	for (; i; i--) {
> > +		munmap(group->mmap_pages[i - 1], rte_mem_page_size());
> > +		group->mmap_pages[i - 1] = NULL;
> > +	}
> > +
> > +	return ret;
> > +}
> > +
> > +static void
> > +cleanup_events(int lcore_id)
> 
> lcore_id is an unsigned int.
> 

True, unsigned seems to be more common. 

> > +{
> > +	struct rte_pmu_event_group *group = &rte_pmu->group[lcore_id];
> > +	int i;
> > +
> > +	if (!group->fds)
> > +		return;
> 
> group->fds == NULL
> 
> This coding style violating appears throughput the patch set.
> 

Good point. 

> > +
> > +	if (group->fds[0] != -1)
> > +		ioctl(group->fds[0], PERF_EVENT_IOC_DISABLE, PERF_IOC_FLAG_GROUP);
> > +
> > +	for (i = 0; i < rte_pmu->num_group_events; i++) {
> > +		if (group->mmap_pages[i]) {
> > +			munmap(group->mmap_pages[i], rte_mem_page_size());
> > +			group->mmap_pages[i] = NULL;
> > +		}
> > +
> > +		if (group->fds[i] != -1) {
> > +			close(group->fds[i]);
> > +			group->fds[i] = -1;
> > +		}
> > +	}
> > +
> > +	free(group->mmap_pages);
> > +	free(group->fds);
> > +
> > +	group->mmap_pages = NULL;
> > +	group->fds = NULL;
> > +	group->enabled = false;
> > +}
> > +
> > +int __rte_noinline
> > +rte_pmu_enable_group(int lcore_id)
> 
> unsigned
> 
> > +{
> > +	struct rte_pmu_event_group *group = &rte_pmu->group[lcore_id];
> > +	int ret;
> > +
> > +	if (rte_pmu->num_group_events == 0) {
> > +		RTE_LOG(DEBUG, EAL, "no matching PMU events\n");
> > +
> > +		return 0;
> > +	}
> > +
> > +	group->fds = calloc(rte_pmu->num_group_events, sizeof(*group->fds));
> > +	if (!group->fds) {
> > +		RTE_LOG(ERR, EAL, "failed to alloc descriptor memory\n");
> > +
> > +		return -ENOMEM;
> > +	}
> > +
> > +	group->mmap_pages = calloc(rte_pmu->num_group_events, sizeof(*group->mmap_pages));
> > +	if (!group->mmap_pages) {
> > +		RTE_LOG(ERR, EAL, "failed to alloc userpage memory\n");
> > +
> > +		ret = -ENOMEM;
> > +		goto out;
> > +	}
> > +
> > +	ret = open_events(lcore_id);
> > +	if (ret) {
> > +		RTE_LOG(ERR, EAL, "failed to open events on lcore-worker-%d\n", lcore_id);
> > +		goto out;
> > +	}
> > +
> > +	ret = mmap_events(lcore_id);
> > +	if (ret) {
> > +		RTE_LOG(ERR, EAL, "failed to map events on lcore-worker-%d\n", lcore_id);
> > +		goto out;
> > +	}
> > +
> > +	if (ioctl(group->fds[0], PERF_EVENT_IOC_ENABLE, PERF_IOC_FLAG_GROUP) == -1) {
> > +		RTE_LOG(ERR, EAL, "failed to enable events on lcore-worker-%d\n",
> > +lcore_id);
> > +
> > +		ret = -errno;
> > +		goto out;
> > +	}
> > +
> > +	return 0;
> > +
> > +out:
> > +	cleanup_events(lcore_id);
> > +
> > +	return ret;
> > +}
> > +
> > +static int
> > +scan_pmus(void)
> > +{
> > +	char path[PATH_MAX];
> > +	struct dirent *dent;
> > +	const char *name;
> > +	DIR *dirp;
> > +
> > +	dirp = opendir(EVENT_SOURCE_DEVICES_PATH);
> > +	if (!dirp)
> > +		return -errno;
> > +
> > +	while ((dent = readdir(dirp))) {
> > +		name = dent->d_name;
> > +		if (name[0] == '.')
> > +			continue;
> > +
> > +		/* sysfs entry should either contain cpus or be a cpu */
> > +		if (!strcmp(name, "cpu"))
> > +			break;
> > +
> > +		snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/cpus", name);
> > +		if (access(path, F_OK) == 0)
> > +			break;
> > +	}
> > +
> > +	closedir(dirp);
> > +
> > +	if (dent) {
> > +		rte_pmu->name = strdup(name);
> > +		if (!rte_pmu->name)
> > +			return -ENOMEM;
> > +	}
> > +
> > +	return rte_pmu->name ? 0 : -ENODEV;
> > +}
> > +
> > +int
> > +rte_pmu_add_event(const char *name)
> > +{
> > +	struct rte_pmu_event *event;
> > +	char path[PATH_MAX];
> > +
> > +	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu->name,
> name);
> > +	if (access(path, R_OK))
> > +		return -ENODEV;
> > +
> > +	TAILQ_FOREACH(event, &rte_pmu->event_list, next) {
> > +		if (!strcmp(event->name, name))
> > +			return event->index;
> > +		continue;
> > +	}
> > +
> > +	event = calloc(1, sizeof(*event));
> > +	if (!event)
> > +		return -ENOMEM;
> > +
> > +	event->name = strdup(name);
> > +	if (!event->name) {
> > +		free(event);
> > +
> > +		return -ENOMEM;
> > +	}
> > +
> > +	event->index = rte_pmu->num_group_events++;
> > +	TAILQ_INSERT_TAIL(&rte_pmu->event_list, event, next);
> > +
> > +	RTE_LOG(DEBUG, EAL, "%s even added at index %d\n", name,
> > +event->index);
> > +
> > +	return event->index;
> > +}
> > +
> > +void
> > +eal_pmu_init(void)
> > +{
> > +	int ret;
> > +
> > +	rte_pmu = calloc(1, sizeof(*rte_pmu));
> > +	if (!rte_pmu) {
> > +		RTE_LOG(ERR, EAL, "failed to alloc PMU\n");
> > +
> > +		return;
> > +	}
> > +
> > +	TAILQ_INIT(&rte_pmu->event_list);
> > +
> > +	ret = scan_pmus();
> > +	if (ret) {
> > +		RTE_LOG(ERR, EAL, "failed to find core pmu\n");
> > +		goto out;
> > +	}
> > +
> > +	ret = pmu_arch_init();
> > +	if (ret) {
> > +		RTE_LOG(ERR, EAL, "failed to setup arch for PMU\n");
> > +		goto out;
> > +	}
> > +
> > +	return;
> > +out:
> > +	free(rte_pmu->name);
> > +	free(rte_pmu);
> > +}
> > +
> > +void
> > +eal_pmu_fini(void)
> > +{
> > +	struct rte_pmu_event *event, *tmp;
> > +	int lcore_id;
> > +
> > +	RTE_TAILQ_FOREACH_SAFE(event, &rte_pmu->event_list, next, tmp) {
> > +		TAILQ_REMOVE(&rte_pmu->event_list, event, next);
> > +		free(event->name);
> > +		free(event);
> > +	}
> > +
> > +	RTE_LCORE_FOREACH_WORKER(lcore_id)
> > +		cleanup_events(lcore_id);
> 
> Why is the main lcore left out?
> 

Main lcore was omitted because it's pretty uncommon for it to do a heavy-lifting so
usefulness of reading counters is questionable. It can be added for completeness
though. 

Do you have any specific use case on your mind?

> > +
> > +	pmu_arch_fini();
> > +	free(rte_pmu->name);
> > +	free(rte_pmu);
> > +}
> > diff --git a/lib/eal/include/meson.build b/lib/eal/include/meson.build
> > index cfcd40aaed..3bf830adee 100644
> > --- a/lib/eal/include/meson.build
> > +++ b/lib/eal/include/meson.build
> > @@ -36,6 +36,7 @@ headers += files(
> >           'rte_pci_dev_features.h',
> >           'rte_per_lcore.h',
> >           'rte_pflock.h',
> > +        'rte_pmu.h',
> >           'rte_random.h',
> >           'rte_reciprocal.h',
> >           'rte_seqcount.h',
> > diff --git a/lib/eal/include/rte_pmu.h b/lib/eal/include/rte_pmu.h new
> > file mode 100644 index 0000000000..e4b4f6b052
> > --- /dev/null
> > +++ b/lib/eal/include/rte_pmu.h
> > @@ -0,0 +1,204 @@
> > +/* SPDX-License-Identifier: BSD-3-Clause
> > + * Copyright(c) 2022 Marvell
> > + */
> > +
> > +#ifndef _RTE_PMU_H_
> > +#define _RTE_PMU_H_
> > +
> > +#ifdef __cplusplus
> > +extern "C" {
> > +#endif
> > +
> > +#include <rte_common.h>
> > +#include <rte_compat.h>
> > +
> > +#ifdef RTE_EXEC_ENV_LINUX
> > +
> > +#include <linux/perf_event.h>
> > +
> > +#include <rte_atomic.h>
> > +#include <rte_branch_prediction.h>
> > +#include <rte_lcore.h>
> > +#include <rte_log.h>
> > +
> > +/**
> > + * @file
> > + *
> > + * PMU event tracing operations
> > + *
> > + * This file defines generic API and types necessary to setup PMU and
> > + * read selected counters in runtime.
> > + */
> > +
> > +/**
> > + * A structure describing a group of events.
> > + */
> > +struct rte_pmu_event_group {
> > +	int *fds; /**< array of event descriptors */
> > +	void **mmap_pages; /**< array of pointers to mmapped perf_event_attr structures */
> > +	bool enabled; /**< true if group was enabled on particular lcore */
> > +};
> > +
> > +/**
> > + * A structure describing an event.
> > + */
> > +struct rte_pmu_event {
> > +	char *name; /** name of an event */
> > +	int index; /** event index into fds/mmap_pages */
> > +	TAILQ_ENTRY(rte_pmu_event) next; /** list entry */ };
> > +
> > +/**
> > + * A PMU state container.
> > + */
> > +struct rte_pmu {
> > +	char *name; /** name of core PMU listed under /sys/bus/event_source/devices */
> > +	struct rte_pmu_event_group group[RTE_MAX_LCORE]; /**< per lcore event group data */
> > +	int num_group_events; /**< number of events in a group */
> > +	TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of matching events
> > +*/ };
> > +
> > +/** Pointer to the PMU state container */ extern struct rte_pmu
> > +*rte_pmu;
> > +
> > +/** Each architecture supporting PMU needs to provide its own version
> > +*/ #ifndef rte_pmu_pmc_read #define rte_pmu_pmc_read(index) ({ 0; })
> > +#endif
> > +
> > +/**
> > + * @internal
> > + *
> > + * Read PMU counter.
> > + *
> > + * @param pc
> > + *   Pointer to the mmapped user page.
> > + * @return
> > + *   Counter value read from hardware.
> > + */
> > +__rte_internal
> > +static __rte_always_inline uint64_t
> > +rte_pmu_read_userpage(struct perf_event_mmap_page *pc) {
> > +	uint64_t offset, width, pmc = 0;
> > +	uint32_t seq, index;
> > +	int tries = 100;
> > +
> > +	for (;;) {
> > +		seq = pc->lock;
> > +		rte_compiler_barrier();
> > +		index = pc->index;
> > +		offset = pc->offset;
> > +		width = pc->pmc_width;
> > +
> > +		if (likely(pc->cap_user_rdpmc && index)) {
> > +			pmc = rte_pmu_pmc_read(index - 1);
> > +			pmc <<= 64 - width;
> > +			pmc >>= 64 - width;
> > +		}
> > +
> > +		rte_compiler_barrier();
> > +
> > +		if (likely(pc->lock == seq))
> > +			return pmc + offset;
> > +
> > +		if (--tries == 0) {
> > +			RTE_LOG(DEBUG, EAL, "failed to get perf_event_mmap_page lock\n");
> > +			break;
> > +		}
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +/**
> > + * @internal
> > + *
> > + * Enable group of events for a given lcore.
> > + *
> > + * @param lcore_id
> > + *   The identifier of the lcore.
> > + * @return
> > + *   0 in case of success, negative value otherwise.
> > + */
> > +__rte_internal
> > +int
> > +rte_pmu_enable_group(int lcore_id);
> > +
> > +/**
> > + * @warning
> > + * @b EXPERIMENTAL: this API may change without prior notice
> > + *
> > + * Add event to the group of enabled events.
> > + *
> > + * @param name
> > + *   Name of an event listed under /sys/bus/event_source/devices/pmu/events.
> > + * @return
> > + *   Event index in case of success, negative value otherwise.
> > + */
> > +__rte_experimental
> > +int
> > +rte_pmu_add_event(const char *name);
> > +
> > +/**
> > + * @warning
> > + * @b EXPERIMENTAL: this API may change without prior notice
> > + *
> > + * Read hardware counter configured to count occurrences of an event.
> > + *
> > + * @param index
> > + *   Index of an event to be read.
> > + * @return
> > + *   Event value read from register. In case of errors or lack of support
> > + *   0 is returned. In other words, stream of zeros in a trace file
> > + *   indicates problem with reading particular PMU event register.
> > + */
> > +__rte_experimental
> > +static __rte_always_inline uint64_t
> > +rte_pmu_read(int index)
> > +{
> > +	int lcore_id = rte_lcore_id();
> > +	struct rte_pmu_event_group *group;
> > +	int ret;
> > +
> > +	if (!rte_pmu)
> > +		return 0;
> > +
> > +	group = &rte_pmu->group[lcore_id];
> > +	if (!group->enabled) {
> > +		ret = rte_pmu_enable_group(lcore_id);
> > +		if (ret)
> > +			return 0;
> > +
> > +		group->enabled = true;
> > +	}
> > +
> > +	if (index < 0 || index >= rte_pmu->num_group_events)
> > +		return 0;
> > +
> > +	return rte_pmu_read_userpage((struct perf_event_mmap_page
> > +*)group->mmap_pages[index]); }
> > +
> > +#else /* !RTE_EXEC_ENV_LINUX */
> > +
> > +__rte_experimental
> > +static int __rte_unused
> > +rte_pmu_add_event(__rte_unused const char *name) {
> > +	return -1;
> > +}
> > +
> > +__rte_experimental
> > +static __rte_always_inline uint64_t
> > +rte_pmu_read(__rte_unused int index)
> > +{
> > +	return 0;
> > +}
> > +
> > +#endif /* RTE_EXEC_ENV_LINUX */
> > +
> > +#ifdef __cplusplus
> > +}
> > +#endif
> > +
> > +#endif /* _RTE_PMU_H_ */
> > diff --git a/lib/eal/linux/eal.c b/lib/eal/linux/eal.c index
> > 8c118d0d9f..751a13b597 100644
> > --- a/lib/eal/linux/eal.c
> > +++ b/lib/eal/linux/eal.c
> > @@ -53,6 +53,7 @@
> >   #include "eal_options.h"
> >   #include "eal_vfio.h"
> >   #include "hotplug_mp.h"
> > +#include "pmu_private.h"
> >
> >   #define MEMSIZE_IF_NO_HUGE_PAGE (64ULL * 1024ULL * 1024ULL)
> >
> > @@ -1206,6 +1207,8 @@ rte_eal_init(int argc, char **argv)
> >   		return -1;
> >   	}
> >
> > +	eal_pmu_init();
> > +
> >   	if (rte_eal_tailqs_init() < 0) {
> >   		rte_eal_init_alert("Cannot init tail queues for objects");
> >   		rte_errno = EFAULT;
> > @@ -1372,6 +1375,7 @@ rte_eal_cleanup(void)
> >   	eal_bus_cleanup();
> >   	rte_trace_save();
> >   	eal_trace_fini();
> > +	eal_pmu_fini();
> >   	/* after this point, any DPDK pointers will become dangling */
> >   	rte_eal_memory_detach();
> >   	eal_mp_dev_hotplug_cleanup();
> > diff --git a/lib/eal/version.map b/lib/eal/version.map index
> > 7ad12a7dc9..9225f46f67 100644
> > --- a/lib/eal/version.map
> > +++ b/lib/eal/version.map
> > @@ -440,6 +440,11 @@ EXPERIMENTAL {
> >   	rte_thread_detach;
> >   	rte_thread_equal;
> >   	rte_thread_join;
> > +
> > +	# added in 23.03
> > +	rte_pmu; # WINDOWS_NO_EXPORT
> > +	rte_pmu_add_event; # WINDOWS_NO_EXPORT
> > +	rte_pmu_read; # WINDOWS_NO_EXPORT
> >   };
> >
> >   INTERNAL {
> > @@ -483,4 +488,5 @@ INTERNAL {
> >   	rte_mem_map;
> >   	rte_mem_page_size;
> >   	rte_mem_unmap;
> > +	rte_pmu_enable_group;
> >   };


^ permalink raw reply	[flat|nested] 205+ messages in thread

* RE: [PATCH v4 1/4] eal: add generic support for reading PMU events
  2022-12-14 10:41             ` Morten Brørup
  2022-12-15  8:22               ` Morten Brørup
@ 2023-01-05 21:14               ` Tomasz Duszynski
  2023-01-05 22:07                 ` Morten Brørup
  1 sibling, 1 reply; 205+ messages in thread
From: Tomasz Duszynski @ 2023-01-05 21:14 UTC (permalink / raw)
  To: Morten Brørup, dev
  Cc: thomas, Jerin Jacob Kollanukkaran, zhoumin, mattias.ronnblom

Hi Morten, 

A few comments inline. 

>-----Original Message-----
>From: Morten Brørup <mb@smartsharesystems.com>
>Sent: Wednesday, December 14, 2022 11:41 AM
>To: Tomasz Duszynski <tduszynski@marvell.com>; dev@dpdk.org
>Cc: thomas@monjalon.net; Jerin Jacob Kollanukkaran <jerinj@marvell.com>; zhoumin@loongson.cn;
>mattias.ronnblom@ericsson.com
>Subject: [EXT] RE: [PATCH v4 1/4] eal: add generic support for reading PMU events
>
>External Email
>
>----------------------------------------------------------------------
>+CC: Mattias, see my comment below about per-thread constructor for this
>
>> From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
>> Sent: Wednesday, 14 December 2022 10.39
>>
>> Hello Morten,
>>
>> Thanks for review. Answers inline.
>>
>> [...]
>>
>> > > +/**
>> > > + * @file
>> > > + *
>> > > + * PMU event tracing operations
>> > > + *
>> > > + * This file defines generic API and types necessary to setup PMU
>> and
>> > > + * read selected counters in runtime.
>> > > + */
>> > > +
>> > > +/**
>> > > + * A structure describing a group of events.
>> > > + */
>> > > +struct rte_pmu_event_group {
>> > > +	int *fds; /**< array of event descriptors */
>> > > +	void **mmap_pages; /**< array of pointers to mmapped
>> > > perf_event_attr structures */
>> >
>> > There seems to be a lot of indirection involved here. Why are these
>> arrays not statically sized,
>> > instead of dynamically allocated?
>> >
>>
>> Different architectures/pmus impose limits on number of simultaneously
>> enabled counters. So in order relief the pain of thinking about it and
>> adding macros for each and every arch I decided to allocate the number
>> user wants dynamically. Also assumption holds that user knows about
>> tradeoffs of using too many counters hence will not enable too many
>> events at once.
>
>The DPDK convention is to use fixed size arrays (with a maximum size, e.g. RTE_MAX_ETHPORTS) in the
>fast path, for performance reasons.
>
>Please use fixed size arrays instead of dynamically allocated arrays.
>

I do agree that from maintenance angle fixed arrays are much more convenient 
but when optimization kicks in then that statement does not necessarily
hold true anymore.

For example, in this case performance dropped by ~0.3% which is insignificant imo. So
given simpler code, next patchset will use fixed arrays. 

>>
>> > Also, what is the reason for hiding the type struct
>> perf_event_mmap_page **mmap_pages opaque by
>> > using void **mmap_pages instead?
>>
>> I think, that part doing mmap/munmap was written first hence void **
>> was chosen in the first place.
>
>Please update it, so the actual type is reflected here.
>
>>
>> >
>> > > +	bool enabled; /**< true if group was enabled on particular lcore
>> > > */
>> > > +};
>> > > +
>> > > +/**
>> > > + * A structure describing an event.
>> > > + */
>> > > +struct rte_pmu_event {
>> > > +	char *name; /** name of an event */
>> > > +	int index; /** event index into fds/mmap_pages */
>> > > +	TAILQ_ENTRY(rte_pmu_event) next; /** list entry */ };
>> > > +
>> > > +/**
>> > > + * A PMU state container.
>> > > + */
>> > > +struct rte_pmu {
>> > > +	char *name; /** name of core PMU listed under
>> > > /sys/bus/event_source/devices */
>> > > +	struct rte_pmu_event_group group[RTE_MAX_LCORE]; /**< per lcore
>> > > event group data */
>> > > +	int num_group_events; /**< number of events in a group */
>> > > +	TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of matching
>> > > events */
>
>The event_list is used in slow path only, so it can remain a list - i.e. no change requested here.
>:-)
>
>> > > +};
>> > > +
>> > > +/** Pointer to the PMU state container */ extern struct rte_pmu
>> > > +*rte_pmu;
>> >
>> > Again, why not just extern struct rte_pmu, instead of dynamic
>> allocation?
>> >
>>
>> No strong opinions here since this is a matter of personal preference.
>> Can be removed
>> in the next version.
>
>Yes, please.
>
>>
>> > > +
>> > > +/** Each architecture supporting PMU needs to provide its own
>> version
>> > > */
>> > > +#ifndef rte_pmu_pmc_read
>> > > +#define rte_pmu_pmc_read(index) ({ 0; }) #endif
>> > > +
>> > > +/**
>> > > + * @internal
>> > > + *
>> > > + * Read PMU counter.
>> > > + *
>> > > + * @param pc
>> > > + *   Pointer to the mmapped user page.
>> > > + * @return
>> > > + *   Counter value read from hardware.
>> > > + */
>> > > +__rte_internal
>> > > +static __rte_always_inline uint64_t rte_pmu_read_userpage(struct
>> > > +perf_event_mmap_page *pc) {
>> > > +	uint64_t offset, width, pmc = 0;
>> > > +	uint32_t seq, index;
>> > > +	int tries = 100;
>> > > +
>> > > +	for (;;) {
>
>As a matter of personal preference, I would write this loop differently:
>
>+ for (tries = 100; tries != 0; tries--) {
>
>> > > +		seq = pc->lock;
>> > > +		rte_compiler_barrier();
>> > > +		index = pc->index;
>> > > +		offset = pc->offset;
>> > > +		width = pc->pmc_width;
>> > > +
>> > > +		if (likely(pc->cap_user_rdpmc && index)) {
>
>Why "&& index"? The way I read [man perf_event_open], index 0 is perfectly valid.
>

Valid index starts at 1. 0 means that either hw counter is stopped or isn't active. Maybe this is not
initially clear from man but there's example later on how to get actual number. 

>[man perf_event_open]: https://urldefense.proofpoint.com/v2/url?u=https-3A__man7.org_linux_man-
>2Dpages_man2_perf-5Fevent-
>5Fopen.2.html&d=DwIFAw&c=nKjWec2b6R0mOyPaz7xtfQ&r=PZNXgrbjdlXxVEEGYkxIxRndyEUwWU_ad5ce22YI6Is&m=tny
>gBVwOnoZDV7hItku1HtmsI8R3F6vPJdr7ON3hE5iAds96T2C9JTNcnt6ptN4Q&s=s10yJogwRRXHqAuIay47H-
>aWl9SL5wpQ9tCjfiQUgrY&e=
>
>> > > +			pmc = rte_pmu_pmc_read(index - 1);
>> > > +			pmc <<= 64 - width;
>> > > +			pmc >>= 64 - width;
>> > > +		}
>> > > +
>> > > +		rte_compiler_barrier();
>> > > +
>> > > +		if (likely(pc->lock == seq))
>> > > +			return pmc + offset;
>> > > +
>> > > +		if (--tries == 0) {
>> > > +			RTE_LOG(DEBUG, EAL, "failed to get
>> > > perf_event_mmap_page lock\n");
>> > > +			break;
>> > > +		}
>
>- Remove the 4 above lines of code, and move the debug log message to the end of the function
>instead.
>
>> > > +	}
>
>+ RTE_LOG(DEBUG, EAL, "failed to get perf_event_mmap_page lock\n");
>
>> > > +
>> > > +	return 0;
>> > > +}
>> > > +
>> > > +/**
>> > > + * @internal
>> > > + *
>> > > + * Enable group of events for a given lcore.
>> > > + *
>> > > + * @param lcore_id
>> > > + *   The identifier of the lcore.
>> > > + * @return
>> > > + *   0 in case of success, negative value otherwise.
>> > > + */
>> > > +__rte_internal
>> > > +int
>> > > +rte_pmu_enable_group(int lcore_id);
>> > > +
>> > > +/**
>> > > + * @warning
>> > > + * @b EXPERIMENTAL: this API may change without prior notice
>> > > + *
>> > > + * Add event to the group of enabled events.
>> > > + *
>> > > + * @param name
>> > > + *   Name of an event listed under
>> > > /sys/bus/event_source/devices/pmu/events.
>> > > + * @return
>> > > + *   Event index in case of success, negative value otherwise.
>> > > + */
>> > > +__rte_experimental
>> > > +int
>> > > +rte_pmu_add_event(const char *name);
>> > > +
>> > > +/**
>> > > + * @warning
>> > > + * @b EXPERIMENTAL: this API may change without prior notice
>> > > + *
>> > > + * Read hardware counter configured to count occurrences of an
>> event.
>> > > + *
>> > > + * @param index
>> > > + *   Index of an event to be read.
>> > > + * @return
>> > > + *   Event value read from register. In case of errors or lack of
>> > > support
>> > > + *   0 is returned. In other words, stream of zeros in a trace
>> file
>> > > + *   indicates problem with reading particular PMU event register.
>> > > + */
>> > > +__rte_experimental
>> > > +static __rte_always_inline uint64_t rte_pmu_read(int index)
>
>The index type can be changed from int to uint32_t. This also eliminates the "(index < 0" part of
>the comparison further below in this function.
>

That's true. 

>> > > +{
>> > > +	int lcore_id = rte_lcore_id();
>> > > +	struct rte_pmu_event_group *group;
>> > > +	int ret;
>> > > +
>> > > +	if (!rte_pmu)
>> > > +		return 0;
>> > > +
>> > > +	group = &rte_pmu->group[lcore_id];
>> > > +	if (!group->enabled) {
>
>Optimized: if (unlikely(!group->enabled)) {
>

Compiler will optimize the branch itself correctly. Extra hint is not required.  

>> > > +		ret = rte_pmu_enable_group(lcore_id);
>> > > +		if (ret)
>> > > +			return 0;
>> > > +
>> > > +		group->enabled = true;
>> > > +	}
>> >
>> > Why is the group not enabled in the setup function,
>> rte_pmu_add_event(), instead of here, in the
>> > hot path?
>> >
>>
>> When this is executed for the very first time then cpu will have
>> obviously more work to do but afterwards setup path is not taken hence
>> much less cpu cycles are required.
>>
>> Setup is executed by main lcore solely, before lcores are executed
>> hence some info passed to SYS_perf_event_open ioctl() is missing, pid
>> (via rte_gettid()) being an example here.
>
>OK. Thank you for the explanation. Since impossible at setup, it has to be done at runtime.
>
>@Mattias: Another good example of something that would belong in per-thread constructors, as my
>suggested feature creep in [1].
>
>[1]: https://urldefense.proofpoint.com/v2/url?u=http-
>3A__inbox.dpdk.org_dev_98CBD80474FA8B44BF855DF32C47DC35D87553-
>40smartserver.smartshare.dk_&d=DwIFAw&c=nKjWec2b6R0mOyPaz7xtfQ&r=PZNXgrbjdlXxVEEGYkxIxRndyEUwWU_ad5
>ce22YI6Is&m=tnygBVwOnoZDV7hItku1HtmsI8R3F6vPJdr7ON3hE5iAds96T2C9JTNcnt6ptN4Q&s=aSAnYqgVnrgDp6yyMtGC
>uWgJjDlgqj1wHf1nGWyHCNo&e=
>
>>
>> > > +
>> > > +	if (index < 0 || index >= rte_pmu->num_group_events)
>
>Optimized: if (unlikely(index >= rte_pmu.num_group_events))
>
>> > > +		return 0;
>> > > +
>> > > +	return rte_pmu_read_userpage((struct perf_event_mmap_page
>> > > *)group->mmap_pages[index]);
>> >
>> > Using fixed size arrays instead of multiple indirections via
>> > pointers
>> is faster. It could be:
>> >
>> > return rte_pmu_read_userpage((struct perf_event_mmap_page
>> > *)rte_pmu.group[lcore_id].mmap_pages[index]);
>> >
>> > With our without suggested performance improvements...
>> >
>> > Series-acked-by: Morten Brørup <mb@smartsharesystems.com>
>>


^ permalink raw reply	[flat|nested] 205+ messages in thread

* RE: [PATCH v4 1/4] eal: add generic support for reading PMU events
  2023-01-05 21:14               ` Tomasz Duszynski
@ 2023-01-05 22:07                 ` Morten Brørup
  2023-01-08 15:41                   ` Tomasz Duszynski
  0 siblings, 1 reply; 205+ messages in thread
From: Morten Brørup @ 2023-01-05 22:07 UTC (permalink / raw)
  To: Tomasz Duszynski, dev
  Cc: thomas, Jerin Jacob Kollanukkaran, zhoumin, mattias.ronnblom

> From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
> Sent: Thursday, 5 January 2023 22.14
> 
> Hi Morten,
> 
> A few comments inline.
> 
> >From: Morten Brørup <mb@smartsharesystems.com>
> >Sent: Wednesday, December 14, 2022 11:41 AM
> >
> >External Email
> >
> >----------------------------------------------------------------------
> >+CC: Mattias, see my comment below about per-thread constructor for
> this
> >
> >> From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
> >> Sent: Wednesday, 14 December 2022 10.39
> >>
> >> Hello Morten,
> >>
> >> Thanks for review. Answers inline.
> >>
> >> [...]
> >>
> >> > > +/**
> >> > > + * @file
> >> > > + *
> >> > > + * PMU event tracing operations
> >> > > + *
> >> > > + * This file defines generic API and types necessary to setup
> PMU
> >> and
> >> > > + * read selected counters in runtime.
> >> > > + */
> >> > > +
> >> > > +/**
> >> > > + * A structure describing a group of events.
> >> > > + */
> >> > > +struct rte_pmu_event_group {
> >> > > +	int *fds; /**< array of event descriptors */
> >> > > +	void **mmap_pages; /**< array of pointers to mmapped
> >> > > perf_event_attr structures */
> >> >
> >> > There seems to be a lot of indirection involved here. Why are
> these
> >> arrays not statically sized,
> >> > instead of dynamically allocated?
> >> >
> >>
> >> Different architectures/pmus impose limits on number of
> simultaneously
> >> enabled counters. So in order relief the pain of thinking about it
> and
> >> adding macros for each and every arch I decided to allocate the
> number
> >> user wants dynamically. Also assumption holds that user knows about
> >> tradeoffs of using too many counters hence will not enable too many
> >> events at once.
> >
> >The DPDK convention is to use fixed size arrays (with a maximum size,
> e.g. RTE_MAX_ETHPORTS) in the
> >fast path, for performance reasons.
> >
> >Please use fixed size arrays instead of dynamically allocated arrays.
> >
> 
> I do agree that from maintenance angle fixed arrays are much more
> convenient
> but when optimization kicks in then that statement does not necessarily
> hold true anymore.
> 
> For example, in this case performance dropped by ~0.3% which is
> insignificant imo. So
> given simpler code, next patchset will use fixed arrays.

I fail to understand how pointer chasing can perform better than obtaining an address by multiplying by a constant. Modern CPUs work in mysterious ways, and you obviously tested this, so I believe your test results. But in theory, pointer chasing touches more cache lines, and should perform worse in a loaded system where pointers in the chain have been evicted from the cache.

Anyway, you agreed to use fixed arrays, so I am happy. :-)

> 
> >>
> >> > Also, what is the reason for hiding the type struct
> >> perf_event_mmap_page **mmap_pages opaque by
> >> > using void **mmap_pages instead?
> >>
> >> I think, that part doing mmap/munmap was written first hence void **
> >> was chosen in the first place.
> >
> >Please update it, so the actual type is reflected here.
> >
> >>
> >> >
> >> > > +	bool enabled; /**< true if group was enabled on particular
> lcore
> >> > > */
> >> > > +};
> >> > > +
> >> > > +/**
> >> > > + * A structure describing an event.
> >> > > + */
> >> > > +struct rte_pmu_event {
> >> > > +	char *name; /** name of an event */
> >> > > +	int index; /** event index into fds/mmap_pages */
> >> > > +	TAILQ_ENTRY(rte_pmu_event) next; /** list entry */ };
> >> > > +
> >> > > +/**
> >> > > + * A PMU state container.
> >> > > + */
> >> > > +struct rte_pmu {
> >> > > +	char *name; /** name of core PMU listed under
> >> > > /sys/bus/event_source/devices */
> >> > > +	struct rte_pmu_event_group group[RTE_MAX_LCORE]; /**< per
> lcore
> >> > > event group data */
> >> > > +	int num_group_events; /**< number of events in a group */
> >> > > +	TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of
> matching
> >> > > events */
> >
> >The event_list is used in slow path only, so it can remain a list -
> i.e. no change requested here.
> >:-)
> >
> >> > > +};
> >> > > +
> >> > > +/** Pointer to the PMU state container */ extern struct rte_pmu
> >> > > +*rte_pmu;
> >> >
> >> > Again, why not just extern struct rte_pmu, instead of dynamic
> >> allocation?
> >> >
> >>
> >> No strong opinions here since this is a matter of personal
> preference.
> >> Can be removed
> >> in the next version.
> >
> >Yes, please.
> >
> >>
> >> > > +
> >> > > +/** Each architecture supporting PMU needs to provide its own
> >> version
> >> > > */
> >> > > +#ifndef rte_pmu_pmc_read
> >> > > +#define rte_pmu_pmc_read(index) ({ 0; }) #endif
> >> > > +
> >> > > +/**
> >> > > + * @internal
> >> > > + *
> >> > > + * Read PMU counter.
> >> > > + *
> >> > > + * @param pc
> >> > > + *   Pointer to the mmapped user page.
> >> > > + * @return
> >> > > + *   Counter value read from hardware.
> >> > > + */
> >> > > +__rte_internal
> >> > > +static __rte_always_inline uint64_t
> rte_pmu_read_userpage(struct
> >> > > +perf_event_mmap_page *pc) {
> >> > > +	uint64_t offset, width, pmc = 0;
> >> > > +	uint32_t seq, index;
> >> > > +	int tries = 100;
> >> > > +
> >> > > +	for (;;) {
> >
> >As a matter of personal preference, I would write this loop
> differently:
> >
> >+ for (tries = 100; tries != 0; tries--) {
> >
> >> > > +		seq = pc->lock;
> >> > > +		rte_compiler_barrier();
> >> > > +		index = pc->index;
> >> > > +		offset = pc->offset;
> >> > > +		width = pc->pmc_width;
> >> > > +
> >> > > +		if (likely(pc->cap_user_rdpmc && index)) {
> >
> >Why "&& index"? The way I read [man perf_event_open], index 0 is
> perfectly valid.
> >
> 
> Valid index starts at 1. 0 means that either hw counter is stopped or
> isn't active. Maybe this is not
> initially clear from man but there's example later on how to get actual
> number.

OK. Thanks for the reference.

Please add a comment about the special meaning of index 0 in the code.

> 
> >[man perf_event_open]:
> https://urldefense.proofpoint.com/v2/url?u=https-
> 3A__man7.org_linux_man-
> >2Dpages_man2_perf-5Fevent-
> >5Fopen.2.html&d=DwIFAw&c=nKjWec2b6R0mOyPaz7xtfQ&r=PZNXgrbjdlXxVEEGYkxI
> xRndyEUwWU_ad5ce22YI6Is&m=tny
> >gBVwOnoZDV7hItku1HtmsI8R3F6vPJdr7ON3hE5iAds96T2C9JTNcnt6ptN4Q&s=s10yJo
> gwRRXHqAuIay47H-
> >aWl9SL5wpQ9tCjfiQUgrY&e=
> >
> >> > > +			pmc = rte_pmu_pmc_read(index - 1);
> >> > > +			pmc <<= 64 - width;
> >> > > +			pmc >>= 64 - width;
> >> > > +		}
> >> > > +
> >> > > +		rte_compiler_barrier();
> >> > > +
> >> > > +		if (likely(pc->lock == seq))
> >> > > +			return pmc + offset;
> >> > > +
> >> > > +		if (--tries == 0) {
> >> > > +			RTE_LOG(DEBUG, EAL, "failed to get
> >> > > perf_event_mmap_page lock\n");
> >> > > +			break;
> >> > > +		}
> >
> >- Remove the 4 above lines of code, and move the debug log message to
> the end of the function
> >instead.
> >
> >> > > +	}
> >
> >+ RTE_LOG(DEBUG, EAL, "failed to get perf_event_mmap_page lock\n");
> >
> >> > > +
> >> > > +	return 0;
> >> > > +}
> >> > > +
> >> > > +/**
> >> > > + * @internal
> >> > > + *
> >> > > + * Enable group of events for a given lcore.
> >> > > + *
> >> > > + * @param lcore_id
> >> > > + *   The identifier of the lcore.
> >> > > + * @return
> >> > > + *   0 in case of success, negative value otherwise.
> >> > > + */
> >> > > +__rte_internal
> >> > > +int
> >> > > +rte_pmu_enable_group(int lcore_id);
> >> > > +
> >> > > +/**
> >> > > + * @warning
> >> > > + * @b EXPERIMENTAL: this API may change without prior notice
> >> > > + *
> >> > > + * Add event to the group of enabled events.
> >> > > + *
> >> > > + * @param name
> >> > > + *   Name of an event listed under
> >> > > /sys/bus/event_source/devices/pmu/events.
> >> > > + * @return
> >> > > + *   Event index in case of success, negative value otherwise.
> >> > > + */
> >> > > +__rte_experimental
> >> > > +int
> >> > > +rte_pmu_add_event(const char *name);
> >> > > +
> >> > > +/**
> >> > > + * @warning
> >> > > + * @b EXPERIMENTAL: this API may change without prior notice
> >> > > + *
> >> > > + * Read hardware counter configured to count occurrences of an
> >> event.
> >> > > + *
> >> > > + * @param index
> >> > > + *   Index of an event to be read.
> >> > > + * @return
> >> > > + *   Event value read from register. In case of errors or lack
> of
> >> > > support
> >> > > + *   0 is returned. In other words, stream of zeros in a trace
> >> file
> >> > > + *   indicates problem with reading particular PMU event
> register.
> >> > > + */
> >> > > +__rte_experimental
> >> > > +static __rte_always_inline uint64_t rte_pmu_read(int index)
> >
> >The index type can be changed from int to uint32_t. This also
> eliminates the "(index < 0" part of
> >the comparison further below in this function.
> >
> 
> That's true.
> 
> >> > > +{
> >> > > +	int lcore_id = rte_lcore_id();
> >> > > +	struct rte_pmu_event_group *group;
> >> > > +	int ret;
> >> > > +
> >> > > +	if (!rte_pmu)
> >> > > +		return 0;
> >> > > +
> >> > > +	group = &rte_pmu->group[lcore_id];
> >> > > +	if (!group->enabled) {
> >
> >Optimized: if (unlikely(!group->enabled)) {
> >
> 
> Compiler will optimize the branch itself correctly. Extra hint is not
> required.

I haven't reviewed the output from this, so I'll take your word for it. I suggested the unlikely() because I previously tested some very simple code, and it optimized for taking the "if":

void testb(bool b)
{
    if (!b)
        exit(1);
    
    exit(99);
}

I guess I should experiment with more realistic code, and update my optimization notes!

You could add the unlikely() for readability purposes. ;-)

> 
> >> > > +		ret = rte_pmu_enable_group(lcore_id);
> >> > > +		if (ret)
> >> > > +			return 0;
> >> > > +
> >> > > +		group->enabled = true;
> >> > > +	}
> >> >
> >> > Why is the group not enabled in the setup function,
> >> rte_pmu_add_event(), instead of here, in the
> >> > hot path?
> >> >
> >>
> >> When this is executed for the very first time then cpu will have
> >> obviously more work to do but afterwards setup path is not taken
> hence
> >> much less cpu cycles are required.
> >>
> >> Setup is executed by main lcore solely, before lcores are executed
> >> hence some info passed to SYS_perf_event_open ioctl() is missing,
> pid
> >> (via rte_gettid()) being an example here.
> >
> >OK. Thank you for the explanation. Since impossible at setup, it has
> to be done at runtime.
> >
> >@Mattias: Another good example of something that would belong in per-
> thread constructors, as my
> >suggested feature creep in [1].
> >
> >[1]: https://urldefense.proofpoint.com/v2/url?u=http-
> >3A__inbox.dpdk.org_dev_98CBD80474FA8B44BF855DF32C47DC35D87553-
> >40smartserver.smartshare.dk_&d=DwIFAw&c=nKjWec2b6R0mOyPaz7xtfQ&r=PZNXg
> rbjdlXxVEEGYkxIxRndyEUwWU_ad5
> >ce22YI6Is&m=tnygBVwOnoZDV7hItku1HtmsI8R3F6vPJdr7ON3hE5iAds96T2C9JTNcnt
> 6ptN4Q&s=aSAnYqgVnrgDp6yyMtGC
> >uWgJjDlgqj1wHf1nGWyHCNo&e=
> >
> >>
> >> > > +
> >> > > +	if (index < 0 || index >= rte_pmu->num_group_events)
> >
> >Optimized: if (unlikely(index >= rte_pmu.num_group_events))
> >
> >> > > +		return 0;
> >> > > +
> >> > > +	return rte_pmu_read_userpage((struct perf_event_mmap_page
> >> > > *)group->mmap_pages[index]);
> >> >
> >> > Using fixed size arrays instead of multiple indirections via
> >> > pointers
> >> is faster. It could be:
> >> >
> >> > return rte_pmu_read_userpage((struct perf_event_mmap_page
> >> > *)rte_pmu.group[lcore_id].mmap_pages[index]);
> >> >
> >> > With our without suggested performance improvements...
> >> >
> >> > Series-acked-by: Morten Brørup <mb@smartsharesystems.com>
> >>
> 


^ permalink raw reply	[flat|nested] 205+ messages in thread

* RE: [PATCH v4 1/4] eal: add generic support for reading PMU events
  2023-01-05 22:07                 ` Morten Brørup
@ 2023-01-08 15:41                   ` Tomasz Duszynski
  2023-01-08 16:30                     ` Morten Brørup
  0 siblings, 1 reply; 205+ messages in thread
From: Tomasz Duszynski @ 2023-01-08 15:41 UTC (permalink / raw)
  To: Morten Brørup, dev
  Cc: thomas, Jerin Jacob Kollanukkaran, zhoumin, mattias.ronnblom

>-----Original Message-----
>From: Morten Brørup <mb@smartsharesystems.com>
>Sent: Thursday, January 5, 2023 11:08 PM
>To: Tomasz Duszynski <tduszynski@marvell.com>; dev@dpdk.org
>Cc: thomas@monjalon.net; Jerin Jacob Kollanukkaran <jerinj@marvell.com>; zhoumin@loongson.cn;
>mattias.ronnblom@ericsson.com
>Subject: [EXT] RE: [PATCH v4 1/4] eal: add generic support for reading PMU events
>
>External Email
>
>----------------------------------------------------------------------
>> From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
>> Sent: Thursday, 5 January 2023 22.14
>>
>> Hi Morten,
>>
>> A few comments inline.
>>
>> >From: Morten Brørup <mb@smartsharesystems.com>
>> >Sent: Wednesday, December 14, 2022 11:41 AM
>> >
>> >External Email
>> >
>> >---------------------------------------------------------------------
>> >-
>> >+CC: Mattias, see my comment below about per-thread constructor for
>> this
>> >
>> >> From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
>> >> Sent: Wednesday, 14 December 2022 10.39
>> >>
>> >> Hello Morten,
>> >>
>> >> Thanks for review. Answers inline.
>> >>
>> >> [...]
>> >>
>> >> > > +/**
>> >> > > + * @file
>> >> > > + *
>> >> > > + * PMU event tracing operations
>> >> > > + *
>> >> > > + * This file defines generic API and types necessary to setup
>> PMU
>> >> and
>> >> > > + * read selected counters in runtime.
>> >> > > + */
>> >> > > +
>> >> > > +/**
>> >> > > + * A structure describing a group of events.
>> >> > > + */
>> >> > > +struct rte_pmu_event_group {
>> >> > > +	int *fds; /**< array of event descriptors */
>> >> > > +	void **mmap_pages; /**< array of pointers to mmapped
>> >> > > perf_event_attr structures */
>> >> >
>> >> > There seems to be a lot of indirection involved here. Why are
>> these
>> >> arrays not statically sized,
>> >> > instead of dynamically allocated?
>> >> >
>> >>
>> >> Different architectures/pmus impose limits on number of
>> simultaneously
>> >> enabled counters. So in order relief the pain of thinking about it
>> and
>> >> adding macros for each and every arch I decided to allocate the
>> number
>> >> user wants dynamically. Also assumption holds that user knows about
>> >> tradeoffs of using too many counters hence will not enable too many
>> >> events at once.
>> >
>> >The DPDK convention is to use fixed size arrays (with a maximum size,
>> e.g. RTE_MAX_ETHPORTS) in the
>> >fast path, for performance reasons.
>> >
>> >Please use fixed size arrays instead of dynamically allocated arrays.
>> >
>>
>> I do agree that from maintenance angle fixed arrays are much more
>> convenient but when optimization kicks in then that statement does not
>> necessarily hold true anymore.
>>
>> For example, in this case performance dropped by ~0.3% which is
>> insignificant imo. So given simpler code, next patchset will use fixed
>> arrays.
>
>I fail to understand how pointer chasing can perform better than obtaining an address by
>multiplying by a constant. Modern CPUs work in mysterious ways, and you obviously tested this, so I
>believe your test results. But in theory, pointer chasing touches more cache lines, and should
>perform worse in a loaded system where pointers in the chain have been evicted from the cache.
>
>Anyway, you agreed to use fixed arrays, so I am happy. :-)
>
>>
>> >>
>> >> > Also, what is the reason for hiding the type struct
>> >> perf_event_mmap_page **mmap_pages opaque by
>> >> > using void **mmap_pages instead?
>> >>
>> >> I think, that part doing mmap/munmap was written first hence void
>> >> ** was chosen in the first place.
>> >
>> >Please update it, so the actual type is reflected here.
>> >
>> >>
>> >> >
>> >> > > +	bool enabled; /**< true if group was enabled on particular
>> lcore
>> >> > > */
>> >> > > +};
>> >> > > +
>> >> > > +/**
>> >> > > + * A structure describing an event.
>> >> > > + */
>> >> > > +struct rte_pmu_event {
>> >> > > +	char *name; /** name of an event */
>> >> > > +	int index; /** event index into fds/mmap_pages */
>> >> > > +	TAILQ_ENTRY(rte_pmu_event) next; /** list entry */ };
>> >> > > +
>> >> > > +/**
>> >> > > + * A PMU state container.
>> >> > > + */
>> >> > > +struct rte_pmu {
>> >> > > +	char *name; /** name of core PMU listed under
>> >> > > /sys/bus/event_source/devices */
>> >> > > +	struct rte_pmu_event_group group[RTE_MAX_LCORE]; /**< per
>> lcore
>> >> > > event group data */
>> >> > > +	int num_group_events; /**< number of events in a group */
>> >> > > +	TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of
>> matching
>> >> > > events */
>> >
>> >The event_list is used in slow path only, so it can remain a list -
>> i.e. no change requested here.
>> >:-)
>> >
>> >> > > +};
>> >> > > +
>> >> > > +/** Pointer to the PMU state container */ extern struct
>> >> > > +rte_pmu *rte_pmu;
>> >> >
>> >> > Again, why not just extern struct rte_pmu, instead of dynamic
>> >> allocation?
>> >> >
>> >>
>> >> No strong opinions here since this is a matter of personal
>> preference.
>> >> Can be removed
>> >> in the next version.
>> >
>> >Yes, please.
>> >
>> >>
>> >> > > +
>> >> > > +/** Each architecture supporting PMU needs to provide its own
>> >> version
>> >> > > */
>> >> > > +#ifndef rte_pmu_pmc_read
>> >> > > +#define rte_pmu_pmc_read(index) ({ 0; }) #endif
>> >> > > +
>> >> > > +/**
>> >> > > + * @internal
>> >> > > + *
>> >> > > + * Read PMU counter.
>> >> > > + *
>> >> > > + * @param pc
>> >> > > + *   Pointer to the mmapped user page.
>> >> > > + * @return
>> >> > > + *   Counter value read from hardware.
>> >> > > + */
>> >> > > +__rte_internal
>> >> > > +static __rte_always_inline uint64_t
>> rte_pmu_read_userpage(struct
>> >> > > +perf_event_mmap_page *pc) {
>> >> > > +	uint64_t offset, width, pmc = 0;
>> >> > > +	uint32_t seq, index;
>> >> > > +	int tries = 100;
>> >> > > +
>> >> > > +	for (;;) {
>> >
>> >As a matter of personal preference, I would write this loop
>> differently:
>> >
>> >+ for (tries = 100; tries != 0; tries--) {
>> >
>> >> > > +		seq = pc->lock;
>> >> > > +		rte_compiler_barrier();
>> >> > > +		index = pc->index;
>> >> > > +		offset = pc->offset;
>> >> > > +		width = pc->pmc_width;
>> >> > > +
>> >> > > +		if (likely(pc->cap_user_rdpmc && index)) {
>> >
>> >Why "&& index"? The way I read [man perf_event_open], index 0 is
>> perfectly valid.
>> >
>>
>> Valid index starts at 1. 0 means that either hw counter is stopped or
>> isn't active. Maybe this is not initially clear from man but there's
>> example later on how to get actual number.
>
>OK. Thanks for the reference.
>
>Please add a comment about the special meaning of index 0 in the code.
>
>>
>> >[man perf_event_open]:
>> https://urldefense.proofpoint.com/v2/url?u=https-
>> 3A__man7.org_linux_man-
>> >2Dpages_man2_perf-5Fevent-
>> >5Fopen.2.html&d=DwIFAw&c=nKjWec2b6R0mOyPaz7xtfQ&r=PZNXgrbjdlXxVEEGYkx
>> >I
>> xRndyEUwWU_ad5ce22YI6Is&m=tny
>> >gBVwOnoZDV7hItku1HtmsI8R3F6vPJdr7ON3hE5iAds96T2C9JTNcnt6ptN4Q&s=s10yJ
>> >o
>> gwRRXHqAuIay47H-
>> >aWl9SL5wpQ9tCjfiQUgrY&e=
>> >
>> >> > > +			pmc = rte_pmu_pmc_read(index - 1);
>> >> > > +			pmc <<= 64 - width;
>> >> > > +			pmc >>= 64 - width;
>> >> > > +		}
>> >> > > +
>> >> > > +		rte_compiler_barrier();
>> >> > > +
>> >> > > +		if (likely(pc->lock == seq))
>> >> > > +			return pmc + offset;
>> >> > > +
>> >> > > +		if (--tries == 0) {
>> >> > > +			RTE_LOG(DEBUG, EAL, "failed to get
>> >> > > perf_event_mmap_page lock\n");
>> >> > > +			break;
>> >> > > +		}
>> >
>> >- Remove the 4 above lines of code, and move the debug log message to
>> the end of the function
>> >instead.
>> >
>> >> > > +	}
>> >
>> >+ RTE_LOG(DEBUG, EAL, "failed to get perf_event_mmap_page lock\n");
>> >
>> >> > > +
>> >> > > +	return 0;
>> >> > > +}
>> >> > > +
>> >> > > +/**
>> >> > > + * @internal
>> >> > > + *
>> >> > > + * Enable group of events for a given lcore.
>> >> > > + *
>> >> > > + * @param lcore_id
>> >> > > + *   The identifier of the lcore.
>> >> > > + * @return
>> >> > > + *   0 in case of success, negative value otherwise.
>> >> > > + */
>> >> > > +__rte_internal
>> >> > > +int
>> >> > > +rte_pmu_enable_group(int lcore_id);
>> >> > > +
>> >> > > +/**
>> >> > > + * @warning
>> >> > > + * @b EXPERIMENTAL: this API may change without prior notice
>> >> > > + *
>> >> > > + * Add event to the group of enabled events.
>> >> > > + *
>> >> > > + * @param name
>> >> > > + *   Name of an event listed under
>> >> > > /sys/bus/event_source/devices/pmu/events.
>> >> > > + * @return
>> >> > > + *   Event index in case of success, negative value otherwise.
>> >> > > + */
>> >> > > +__rte_experimental
>> >> > > +int
>> >> > > +rte_pmu_add_event(const char *name);
>> >> > > +
>> >> > > +/**
>> >> > > + * @warning
>> >> > > + * @b EXPERIMENTAL: this API may change without prior notice
>> >> > > + *
>> >> > > + * Read hardware counter configured to count occurrences of an
>> >> event.
>> >> > > + *
>> >> > > + * @param index
>> >> > > + *   Index of an event to be read.
>> >> > > + * @return
>> >> > > + *   Event value read from register. In case of errors or lack
>> of
>> >> > > support
>> >> > > + *   0 is returned. In other words, stream of zeros in a trace
>> >> file
>> >> > > + *   indicates problem with reading particular PMU event
>> register.
>> >> > > + */
>> >> > > +__rte_experimental
>> >> > > +static __rte_always_inline uint64_t rte_pmu_read(int index)
>> >
>> >The index type can be changed from int to uint32_t. This also
>> eliminates the "(index < 0" part of
>> >the comparison further below in this function.
>> >
>>
>> That's true.
>>
>> >> > > +{
>> >> > > +	int lcore_id = rte_lcore_id();
>> >> > > +	struct rte_pmu_event_group *group;
>> >> > > +	int ret;
>> >> > > +
>> >> > > +	if (!rte_pmu)
>> >> > > +		return 0;
>> >> > > +
>> >> > > +	group = &rte_pmu->group[lcore_id];
>> >> > > +	if (!group->enabled) {
>> >
>> >Optimized: if (unlikely(!group->enabled)) {
>> >
>>
>> Compiler will optimize the branch itself correctly. Extra hint is not
>> required.
>
>I haven't reviewed the output from this, so I'll take your word for it. I suggested the unlikely()
>because I previously tested some very simple code, and it optimized for taking the "if":
>
>void testb(bool b)
>{
>    if (!b)
>        exit(1);
>
>    exit(99);
>}
>
>I guess I should experiment with more realistic code, and update my optimization notes!
>

I think this may be too simple to draw far-reaching conclusions from it. Compiler will make the
fall-through path more likely. If I recall Intel Optimization Reference Manual has some more
info on this. 

Lets take a different example.  

int main(int argc, char *argv[])
{
        int *p;

        p = malloc(sizeof(*p));
        if (!p)
                return 1;
        *p = atoi(argv[1]);
        if (*p < 0)
                return 2;
        free(p);

        return 0;
}

If compiled with -O3 and disassembled I got below. 

00000000000010a0 <main>:
    10a0:       f3 0f 1e fa             endbr64
    10a4:       55                      push   %rbp
    10a5:       bf 04 00 00 00          mov    $0x4,%edi
    10aa:       53                      push   %rbx
    10ab:       48 89 f3                mov    %rsi,%rbx
    10ae:       48 83 ec 08             sub    $0x8,%rsp
    10b2:       e8 d9 ff ff ff          call   1090 <malloc@plt>
    10b7:       48 85 c0                test   %rax,%rax
    10ba:       74 31                   je     10ed <main+0x4d>
    10bc:       48 8b 7b 08             mov    0x8(%rbx),%rdi
    10c0:       ba 0a 00 00 00          mov    $0xa,%edx
    10c5:       31 f6                   xor    %esi,%esi
    10c7:       48 89 c5                mov    %rax,%rbp
    10ca:       e8 b1 ff ff ff          call   1080 <strtol@plt>
    10cf:       49 89 c0                mov    %rax,%r8
    10d2:       b8 02 00 00 00          mov    $0x2,%eax
    10d7:       45 85 c0                test   %r8d,%r8d
    10da:       78 0a                   js     10e6 <main+0x46>
    10dc:       48 89 ef                mov    %rbp,%rdi
    10df:       e8 8c ff ff ff          call   1070 <free@plt>
    10e4:       31 c0                   xor    %eax,%eax
    10e6:       48 83 c4 08             add    $0x8,%rsp
    10ea:       5b                      pop    %rbx
    10eb:       5d                      pop    %rbp
    10ec:       c3                      ret
    10ed:       b8 01 00 00 00          mov    $0x1,%eax
    10f2:       eb f2                   jmp    10e6 <main+0x46>

Looking at both 10ba and 10da suggests that code was laid out in a way that jumping is frowned upon. Also 
potentially lest frequently executed code (at 10ed) is pushed further down the memory hence optimizing cache line usage. 

That said, each and every scenario needs analysis on its own. 

>You could add the unlikely() for readability purposes. ;-)
>

Sure. That won't hurt performance.   

>>
>> >> > > +		ret = rte_pmu_enable_group(lcore_id);
>> >> > > +		if (ret)
>> >> > > +			return 0;
>> >> > > +
>> >> > > +		group->enabled = true;
>> >> > > +	}
>> >> >
>> >> > Why is the group not enabled in the setup function,
>> >> rte_pmu_add_event(), instead of here, in the
>> >> > hot path?
>> >> >
>> >>
>> >> When this is executed for the very first time then cpu will have
>> >> obviously more work to do but afterwards setup path is not taken
>> hence
>> >> much less cpu cycles are required.
>> >>
>> >> Setup is executed by main lcore solely, before lcores are executed
>> >> hence some info passed to SYS_perf_event_open ioctl() is missing,
>> pid
>> >> (via rte_gettid()) being an example here.
>> >
>> >OK. Thank you for the explanation. Since impossible at setup, it has
>> to be done at runtime.
>> >
>> >@Mattias: Another good example of something that would belong in per-
>> thread constructors, as my
>> >suggested feature creep in [1].
>> >
>> >[1]: https://urldefense.proofpoint.com/v2/url?u=http-
>> >3A__inbox.dpdk.org_dev_98CBD80474FA8B44BF855DF32C47DC35D87553-
>> >40smartserver.smartshare.dk_&d=DwIFAw&c=nKjWec2b6R0mOyPaz7xtfQ&r=PZNX
>> >g
>> rbjdlXxVEEGYkxIxRndyEUwWU_ad5
>> >ce22YI6Is&m=tnygBVwOnoZDV7hItku1HtmsI8R3F6vPJdr7ON3hE5iAds96T2C9JTNcn
>> >t
>> 6ptN4Q&s=aSAnYqgVnrgDp6yyMtGC
>> >uWgJjDlgqj1wHf1nGWyHCNo&e=
>> >
>> >>
>> >> > > +
>> >> > > +	if (index < 0 || index >= rte_pmu->num_group_events)
>> >
>> >Optimized: if (unlikely(index >= rte_pmu.num_group_events))
>> >
>> >> > > +		return 0;
>> >> > > +
>> >> > > +	return rte_pmu_read_userpage((struct perf_event_mmap_page
>> >> > > *)group->mmap_pages[index]);
>> >> >
>> >> > Using fixed size arrays instead of multiple indirections via
>> >> > pointers
>> >> is faster. It could be:
>> >> >
>> >> > return rte_pmu_read_userpage((struct perf_event_mmap_page
>> >> > *)rte_pmu.group[lcore_id].mmap_pages[index]);
>> >> >
>> >> > With our without suggested performance improvements...
>> >> >
>> >> > Series-acked-by: Morten Brørup <mb@smartsharesystems.com>
>> >>
>>


^ permalink raw reply	[flat|nested] 205+ messages in thread

* RE: [PATCH v4 1/4] eal: add generic support for reading PMU events
  2023-01-08 15:41                   ` Tomasz Duszynski
@ 2023-01-08 16:30                     ` Morten Brørup
  0 siblings, 0 replies; 205+ messages in thread
From: Morten Brørup @ 2023-01-08 16:30 UTC (permalink / raw)
  To: Tomasz Duszynski, dev
  Cc: thomas, Jerin Jacob Kollanukkaran, zhoumin, mattias.ronnblom

> From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
> Sent: Sunday, 8 January 2023 16.41
> 
> >From: Morten Brørup <mb@smartsharesystems.com>
> >Sent: Thursday, January 5, 2023 11:08 PM
> >
> >> From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
> >> Sent: Thursday, 5 January 2023 22.14
> >>
> >> Hi Morten,
> >>
> >> A few comments inline.
> >>
> >> >From: Morten Brørup <mb@smartsharesystems.com>
> >> >Sent: Wednesday, December 14, 2022 11:41 AM
> >> >
> >> >External Email
> >> >
> >> >-------------------------------------------------------------------
> --
> >> >-
> >> >+CC: Mattias, see my comment below about per-thread constructor for
> >> this
> >> >
> >> >> From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
> >> >> Sent: Wednesday, 14 December 2022 10.39
> >> >>
> >> >> Hello Morten,
> >> >>
> >> >> Thanks for review. Answers inline.
> >> >>

[...]

> >> >> > > +{
> >> >> > > +	int lcore_id = rte_lcore_id();
> >> >> > > +	struct rte_pmu_event_group *group;
> >> >> > > +	int ret;
> >> >> > > +
> >> >> > > +	if (!rte_pmu)
> >> >> > > +		return 0;
> >> >> > > +
> >> >> > > +	group = &rte_pmu->group[lcore_id];
> >> >> > > +	if (!group->enabled) {
> >> >
> >> >Optimized: if (unlikely(!group->enabled)) {
> >> >
> >>
> >> Compiler will optimize the branch itself correctly. Extra hint is
> not
> >> required.
> >
> >I haven't reviewed the output from this, so I'll take your word for
> it. I suggested the unlikely()
> >because I previously tested some very simple code, and it optimized
> for taking the "if":
> >
> >void testb(bool b)
> >{
> >    if (!b)
> >        exit(1);
> >
> >    exit(99);
> >}
> >
> >I guess I should experiment with more realistic code, and update my
> optimization notes!
> >
> 
> I think this may be too simple to draw far-reaching conclusions from
> it. Compiler will make the
> fall-through path more likely. If I recall Intel Optimization Reference
> Manual has some more
> info on this.

IIRC, the Intel Optimization Reference Manual discusses branch optimization for assembler, not C.

> 
> Lets take a different example.
> 
> int main(int argc, char *argv[])
> {
>         int *p;
> 
>         p = malloc(sizeof(*p));
>         if (!p)
>                 return 1;
>         *p = atoi(argv[1]);
>         if (*p < 0)
>                 return 2;
>         free(p);
> 
>         return 0;
> }
> 
> If compiled with -O3 and disassembled I got below.
> 
> 00000000000010a0 <main>:
>     10a0:       f3 0f 1e fa             endbr64
>     10a4:       55                      push   %rbp
>     10a5:       bf 04 00 00 00          mov    $0x4,%edi
>     10aa:       53                      push   %rbx
>     10ab:       48 89 f3                mov    %rsi,%rbx
>     10ae:       48 83 ec 08             sub    $0x8,%rsp
>     10b2:       e8 d9 ff ff ff          call   1090 <malloc@plt>
>     10b7:       48 85 c0                test   %rax,%rax
>     10ba:       74 31                   je     10ed <main+0x4d>
>     10bc:       48 8b 7b 08             mov    0x8(%rbx),%rdi
>     10c0:       ba 0a 00 00 00          mov    $0xa,%edx
>     10c5:       31 f6                   xor    %esi,%esi
>     10c7:       48 89 c5                mov    %rax,%rbp
>     10ca:       e8 b1 ff ff ff          call   1080 <strtol@plt>
>     10cf:       49 89 c0                mov    %rax,%r8
>     10d2:       b8 02 00 00 00          mov    $0x2,%eax
>     10d7:       45 85 c0                test   %r8d,%r8d
>     10da:       78 0a                   js     10e6 <main+0x46>
>     10dc:       48 89 ef                mov    %rbp,%rdi
>     10df:       e8 8c ff ff ff          call   1070 <free@plt>
>     10e4:       31 c0                   xor    %eax,%eax
>     10e6:       48 83 c4 08             add    $0x8,%rsp
>     10ea:       5b                      pop    %rbx
>     10eb:       5d                      pop    %rbp
>     10ec:       c3                      ret
>     10ed:       b8 01 00 00 00          mov    $0x1,%eax
>     10f2:       eb f2                   jmp    10e6 <main+0x46>
> 
> Looking at both 10ba and 10da suggests that code was laid out in a way
> that jumping is frowned upon. Also
> potentially lest frequently executed code (at 10ed) is pushed further
> down the memory hence optimizing cache line usage.

In my notes, I have (ptr == NULL) marked as considered unlikely, but (int == 0) marked as considered likely. Since group->enabled is bool, I guessed the compiler would treat it like int and consider (!group->enabled) as likely.

Like in your example here, I also have (int < 0) marked as considered unlikely.

> 
> That said, each and every scenario needs analysis on its own.

Agree. Theory is good, validation is better. ;-)

> 
> >You could add the unlikely() for readability purposes. ;-)
> >
> 
> Sure. That won't hurt performance.

I think we are both in agreement about the intentions here, so I won't hold you back with further academic discussions at this point. I might resume the discussion with your next patch version, though. ;-)



^ permalink raw reply	[flat|nested] 205+ messages in thread

* RE: [PATCH v4 1/4] eal: add generic support for reading PMU events
  2022-12-13 10:43       ` [PATCH v4 1/4] eal: add generic support for reading PMU events Tomasz Duszynski
  2022-12-13 11:52         ` Morten Brørup
  2022-12-15  8:46         ` Mattias Rönnblom
@ 2023-01-09  7:37         ` Ruifeng Wang
  2023-01-09 15:40           ` Tomasz Duszynski
  2 siblings, 1 reply; 205+ messages in thread
From: Ruifeng Wang @ 2023-01-09  7:37 UTC (permalink / raw)
  To: Tomasz Duszynski, dev; +Cc: thomas, jerinj, mb, zhoumin, nd

> -----Original Message-----
> From: Tomasz Duszynski <tduszynski@marvell.com>
> Sent: Tuesday, December 13, 2022 6:44 PM
> To: dev@dpdk.org
> Cc: thomas@monjalon.net; jerinj@marvell.com; mb@smartsharesystems.com; zhoumin@loongson.cn;
> Tomasz Duszynski <tduszynski@marvell.com>
> Subject: [PATCH v4 1/4] eal: add generic support for reading PMU events
> 
> Add support for programming PMU counters and reading their values in runtime bypassing
> kernel completely.
> 
> This is especially useful in cases where CPU cores are isolated
> (nohz_full) i.e run dedicated tasks. In such cases one cannot use standard perf utility
> without sacrificing latency and performance.
> 
> Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
> ---
>  app/test/meson.build                  |   1 +
>  app/test/test_pmu.c                   |  41 +++
>  doc/guides/prog_guide/profile_app.rst |   8 +
>  lib/eal/common/meson.build            |   3 +
>  lib/eal/common/pmu_private.h          |  41 +++
>  lib/eal/common/rte_pmu.c              | 456 ++++++++++++++++++++++++++
>  lib/eal/include/meson.build           |   1 +
>  lib/eal/include/rte_pmu.h             | 204 ++++++++++++
>  lib/eal/linux/eal.c                   |   4 +
>  lib/eal/version.map                   |   6 +
>  10 files changed, 765 insertions(+)
>  create mode 100644 app/test/test_pmu.c
>  create mode 100644 lib/eal/common/pmu_private.h  create mode 100644
> lib/eal/common/rte_pmu.c  create mode 100644 lib/eal/include/rte_pmu.h
> 
<snip>
> diff --git a/lib/eal/common/rte_pmu.c b/lib/eal/common/rte_pmu.c new file mode 100644
> index 0000000000..049fe19fe3
> --- /dev/null
> +++ b/lib/eal/common/rte_pmu.c
> @@ -0,0 +1,456 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(C) 2022 Marvell International Ltd.
> + */
> +
> +#include <ctype.h>
> +#include <dirent.h>
> +#include <errno.h>
> +#include <regex.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <sys/ioctl.h>
> +#include <sys/mman.h>
> +#include <sys/queue.h>
> +#include <sys/syscall.h>
> +#include <unistd.h>
> +
> +#include <rte_eal_paging.h>
> +#include <rte_pmu.h>
> +#include <rte_tailq.h>
> +
> +#include "pmu_private.h"
> +
> +#define EVENT_SOURCE_DEVICES_PATH "/sys/bus/event_source/devices"
> +
> +#ifndef GENMASK_ULL
> +#define GENMASK_ULL(h, l) ((~0ULL - (1ULL << (l)) + 1) & (~0ULL >> ((64
> +- 1 - (h))))) #endif
> +
> +#ifndef FIELD_PREP
> +#define FIELD_PREP(m, v) (((uint64_t)(v) << (__builtin_ffsll(m) - 1)) &
> +(m)) #endif
> +
> +struct rte_pmu *rte_pmu;
> +
> +/*
> + * Following __rte_weak functions provide default no-op. Architectures
> +should override them if
> + * necessary.
> + */
> +
> +int
> +__rte_weak pmu_arch_init(void)
> +{
> +	return 0;
> +}
> +
> +void
> +__rte_weak pmu_arch_fini(void)
> +{
> +}
> +
> +void
> +__rte_weak pmu_arch_fixup_config(uint64_t config[3]) {
> +	RTE_SET_USED(config);
> +}
> +
> +static int
> +get_term_format(const char *name, int *num, uint64_t *mask) {
> +	char *config = NULL;
> +	char path[PATH_MAX];
> +	int high, low, ret;
> +	FILE *fp;
> +
> +	/* quiesce -Wmaybe-uninitialized warning */
> +	*num = 0;
> +	*mask = 0;
> +
> +	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/format/%s", rte_pmu->name,
> name);
> +	fp = fopen(path, "r");
> +	if (!fp)
> +		return -errno;
> +
> +	errno = 0;
> +	ret = fscanf(fp, "%m[^:]:%d-%d", &config, &low, &high);
> +	if (ret < 2) {
> +		ret = -ENODATA;
> +		goto out;
> +	}
> +	if (errno) {
> +		ret = -errno;
> +		goto out;
> +	}
> +
> +	if (ret == 2)
> +		high = low;
> +
> +	*mask = GENMASK_ULL(high, low);
> +	/* Last digit should be [012]. If last digit is missing 0 is implied. */
> +	*num = config[strlen(config) - 1];
> +	*num = isdigit(*num) ? *num - '0' : 0;
> +
> +	ret = 0;
> +out:
> +	free(config);
> +	fclose(fp);
> +
> +	return ret;
> +}
> +
> +static int
> +parse_event(char *buf, uint64_t config[3]) {
> +	char *token, *term;
> +	int num, ret, val;
> +	uint64_t mask;
> +
> +	config[0] = config[1] = config[2] = 0;
> +
> +	token = strtok(buf, ",");
> +	while (token) {
> +		errno = 0;
> +		/* <term>=<value> */
> +		ret = sscanf(token, "%m[^=]=%i", &term, &val);
> +		if (ret < 1)
> +			return -ENODATA;
> +		if (errno)
> +			return -errno;
> +		if (ret == 1)
> +			val = 1;
> +
> +		ret = get_term_format(term, &num, &mask);
> +		free(term);
> +		if (ret)
> +			return ret;
> +
> +		config[num] |= FIELD_PREP(mask, val);
> +		token = strtok(NULL, ",");
> +	}
> +
> +	return 0;
> +}
> +
> +static int
> +get_event_config(const char *name, uint64_t config[3]) {
> +	char path[PATH_MAX], buf[BUFSIZ];
> +	FILE *fp;
> +	int ret;
> +
> +	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu->name,
> name);
> +	fp = fopen(path, "r");
> +	if (!fp)
> +		return -errno;
> +
> +	ret = fread(buf, 1, sizeof(buf), fp);
> +	if (ret == 0) {
> +		fclose(fp);
> +
> +		return -EINVAL;
> +	}
> +	fclose(fp);
> +	buf[ret] = '\0';
> +
> +	return parse_event(buf, config);
> +}
> +
> +static int
> +do_perf_event_open(uint64_t config[3], int lcore_id, int group_fd) {
> +	struct perf_event_attr attr = {
> +		.size = sizeof(struct perf_event_attr),
> +		.type = PERF_TYPE_RAW,
> +		.exclude_kernel = 1,
> +		.exclude_hv = 1,
> +		.disabled = 1,
> +	};
> +
> +	pmu_arch_fixup_config(config);
> +
> +	attr.config = config[0];
> +	attr.config1 = config[1];
> +	attr.config2 = config[2];
> +
> +	return syscall(SYS_perf_event_open, &attr, rte_gettid(),

Looks like using '0' instead of rte_gettid() takes the same effect. A small optimization.

> rte_lcore_to_cpu_id(lcore_id),
> +		       group_fd, 0);
> +}
> +
> +static int
> +open_events(int lcore_id)
> +{
> +	struct rte_pmu_event_group *group = &rte_pmu->group[lcore_id];
> +	struct rte_pmu_event *event;
> +	uint64_t config[3];
> +	int num = 0, ret;
> +
> +	/* group leader gets created first, with fd = -1 */
> +	group->fds[0] = -1;
> +
> +	TAILQ_FOREACH(event, &rte_pmu->event_list, next) {
> +		ret = get_event_config(event->name, config);
> +		if (ret) {
> +			RTE_LOG(ERR, EAL, "failed to get %s event config\n", event->name);
> +			continue;
> +		}
> +
> +		ret = do_perf_event_open(config, lcore_id, group->fds[0]);
> +		if (ret == -1) {
> +			if (errno == EOPNOTSUPP)
> +				RTE_LOG(ERR, EAL, "64 bit counters not supported\n");
> +
> +			ret = -errno;
> +			goto out;
> +		}
> +
> +		group->fds[event->index] = ret;
> +		num++;
> +	}
> +
> +	return 0;
> +out:
> +	for (--num; num >= 0; num--) {
> +		close(group->fds[num]);
> +		group->fds[num] = -1;
> +	}
> +
> +
> +	return ret;
> +}
> +
> +static int
> +mmap_events(int lcore_id)
> +{
> +	struct rte_pmu_event_group *group = &rte_pmu->group[lcore_id];
> +	void *addr;
> +	int ret, i;
> +
> +	for (i = 0; i < rte_pmu->num_group_events; i++) {
> +		addr = mmap(0, rte_mem_page_size(), PROT_READ, MAP_SHARED, group->fds[i], 0);
> +		if (addr == MAP_FAILED) {
> +			ret = -errno;
> +			goto out;
> +		}
> +
> +		group->mmap_pages[i] = addr;
> +	}
> +
> +	return 0;
> +out:
> +	for (; i; i--) {
> +		munmap(group->mmap_pages[i - 1], rte_mem_page_size());
> +		group->mmap_pages[i - 1] = NULL;
> +	}
> +
> +	return ret;
> +}
> +
> +static void
> +cleanup_events(int lcore_id)
> +{
> +	struct rte_pmu_event_group *group = &rte_pmu->group[lcore_id];
> +	int i;
> +
> +	if (!group->fds)
> +		return;
> +
> +	if (group->fds[0] != -1)
> +		ioctl(group->fds[0], PERF_EVENT_IOC_DISABLE, PERF_IOC_FLAG_GROUP);
> +
> +	for (i = 0; i < rte_pmu->num_group_events; i++) {
> +		if (group->mmap_pages[i]) {
> +			munmap(group->mmap_pages[i], rte_mem_page_size());
> +			group->mmap_pages[i] = NULL;
> +		}
> +
> +		if (group->fds[i] != -1) {
> +			close(group->fds[i]);
> +			group->fds[i] = -1;
> +		}
> +	}
> +
> +	free(group->mmap_pages);
> +	free(group->fds);
> +
> +	group->mmap_pages = NULL;
> +	group->fds = NULL;
> +	group->enabled = false;
> +}
> +
> +int __rte_noinline
> +rte_pmu_enable_group(int lcore_id)
> +{
> +	struct rte_pmu_event_group *group = &rte_pmu->group[lcore_id];
> +	int ret;
> +
> +	if (rte_pmu->num_group_events == 0) {
> +		RTE_LOG(DEBUG, EAL, "no matching PMU events\n");
> +
> +		return 0;
> +	}
> +
> +	group->fds = calloc(rte_pmu->num_group_events, sizeof(*group->fds));
> +	if (!group->fds) {
> +		RTE_LOG(ERR, EAL, "failed to alloc descriptor memory\n");
> +
> +		return -ENOMEM;
> +	}
> +
> +	group->mmap_pages = calloc(rte_pmu->num_group_events, sizeof(*group->mmap_pages));
> +	if (!group->mmap_pages) {
> +		RTE_LOG(ERR, EAL, "failed to alloc userpage memory\n");
> +
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	ret = open_events(lcore_id);
> +	if (ret) {
> +		RTE_LOG(ERR, EAL, "failed to open events on lcore-worker-%d\n", lcore_id);
> +		goto out;
> +	}
> +
> +	ret = mmap_events(lcore_id);
> +	if (ret) {
> +		RTE_LOG(ERR, EAL, "failed to map events on lcore-worker-%d\n", lcore_id);
> +		goto out;
> +	}
> +
> +	if (ioctl(group->fds[0], PERF_EVENT_IOC_ENABLE, PERF_IOC_FLAG_GROUP) == -1) {
> +		RTE_LOG(ERR, EAL, "failed to enable events on lcore-worker-%d\n",
> +lcore_id);
> +
> +		ret = -errno;
> +		goto out;
> +	}
> +
> +	return 0;
> +
> +out:
> +	cleanup_events(lcore_id);
> +
> +	return ret;
> +}
> +
> +static int
> +scan_pmus(void)
> +{
> +	char path[PATH_MAX];
> +	struct dirent *dent;
> +	const char *name;
> +	DIR *dirp;
> +
> +	dirp = opendir(EVENT_SOURCE_DEVICES_PATH);
> +	if (!dirp)
> +		return -errno;
> +
> +	while ((dent = readdir(dirp))) {
> +		name = dent->d_name;
> +		if (name[0] == '.')
> +			continue;
> +
> +		/* sysfs entry should either contain cpus or be a cpu */
> +		if (!strcmp(name, "cpu"))
> +			break;
> +
> +		snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/cpus", name);
> +		if (access(path, F_OK) == 0)
> +			break;
> +	}
> +
> +	closedir(dirp);
> +
> +	if (dent) {
> +		rte_pmu->name = strdup(name);
> +		if (!rte_pmu->name)
> +			return -ENOMEM;
> +	}
> +
> +	return rte_pmu->name ? 0 : -ENODEV;
> +}
> +
> +int
> +rte_pmu_add_event(const char *name)
> +{
> +	struct rte_pmu_event *event;
> +	char path[PATH_MAX];
> +
> +	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu->name,
> name);

Better to check if rte_pmu is available.
See below.

> +	if (access(path, R_OK))
> +		return -ENODEV;
> +
> +	TAILQ_FOREACH(event, &rte_pmu->event_list, next) {
> +		if (!strcmp(event->name, name))
> +			return event->index;
> +		continue;
> +	}
> +
> +	event = calloc(1, sizeof(*event));
> +	if (!event)
> +		return -ENOMEM;
> +
> +	event->name = strdup(name);
> +	if (!event->name) {
> +		free(event);
> +
> +		return -ENOMEM;
> +	}
> +
> +	event->index = rte_pmu->num_group_events++;
> +	TAILQ_INSERT_TAIL(&rte_pmu->event_list, event, next);
> +
> +	RTE_LOG(DEBUG, EAL, "%s even added at index %d\n", name,
> +event->index);
> +
> +	return event->index;
> +}
> +
> +void
> +eal_pmu_init(void)
> +{
> +	int ret;
> +
> +	rte_pmu = calloc(1, sizeof(*rte_pmu));
> +	if (!rte_pmu) {
> +		RTE_LOG(ERR, EAL, "failed to alloc PMU\n");
> +
> +		return;
> +	}
> +
> +	TAILQ_INIT(&rte_pmu->event_list);
> +
> +	ret = scan_pmus();
> +	if (ret) {
> +		RTE_LOG(ERR, EAL, "failed to find core pmu\n");
> +		goto out;
> +	}
> +
> +	ret = pmu_arch_init();
> +	if (ret) {
> +		RTE_LOG(ERR, EAL, "failed to setup arch for PMU\n");
> +		goto out;
> +	}
> +
> +	return;
> +out:
> +	free(rte_pmu->name);
> +	free(rte_pmu);

Set rte_pmu to NULL to prevent unintentional use?

> +}
> +
> +void
> +eal_pmu_fini(void)
> +{
> +	struct rte_pmu_event *event, *tmp;
> +	int lcore_id;
> +
> +	RTE_TAILQ_FOREACH_SAFE(event, &rte_pmu->event_list, next, tmp) {

rte_pmu can be unavailable if init fails. Better to check before accessing.

> +		TAILQ_REMOVE(&rte_pmu->event_list, event, next);
> +		free(event->name);
> +		free(event);
> +	}
> +
> +	RTE_LCORE_FOREACH_WORKER(lcore_id)
> +		cleanup_events(lcore_id);
> +
> +	pmu_arch_fini();
> +	free(rte_pmu->name);
> +	free(rte_pmu);
> +}
<snip>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* RE: [PATCH v4 1/4] eal: add generic support for reading PMU events
  2023-01-09  7:37         ` Ruifeng Wang
@ 2023-01-09 15:40           ` Tomasz Duszynski
  0 siblings, 0 replies; 205+ messages in thread
From: Tomasz Duszynski @ 2023-01-09 15:40 UTC (permalink / raw)
  To: Ruifeng Wang, dev; +Cc: thomas, Jerin Jacob Kollanukkaran, mb, zhoumin, nd

Hi Ruifeng, 

>-----Original Message-----
>From: Ruifeng Wang <Ruifeng.Wang@arm.com>
>Sent: Monday, January 9, 2023 8:37 AM
>To: Tomasz Duszynski <tduszynski@marvell.com>; dev@dpdk.org
>Cc: thomas@monjalon.net; Jerin Jacob Kollanukkaran <jerinj@marvell.com>; mb@smartsharesystems.com;
>zhoumin@loongson.cn; nd <nd@arm.com>
>Subject: [EXT] RE: [PATCH v4 1/4] eal: add generic support for reading PMU events
>
>External Email
>
>----------------------------------------------------------------------
>> -----Original Message-----
>> From: Tomasz Duszynski <tduszynski@marvell.com>
>> Sent: Tuesday, December 13, 2022 6:44 PM
>> To: dev@dpdk.org
>> Cc: thomas@monjalon.net; jerinj@marvell.com; mb@smartsharesystems.com;
>> zhoumin@loongson.cn; Tomasz Duszynski <tduszynski@marvell.com>
>> Subject: [PATCH v4 1/4] eal: add generic support for reading PMU
>> events
>>
>> Add support for programming PMU counters and reading their values in
>> runtime bypassing kernel completely.
>>
>> This is especially useful in cases where CPU cores are isolated
>> (nohz_full) i.e run dedicated tasks. In such cases one cannot use
>> standard perf utility without sacrificing latency and performance.
>>
>> Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
>> ---
>>  app/test/meson.build                  |   1 +
>>  app/test/test_pmu.c                   |  41 +++
>>  doc/guides/prog_guide/profile_app.rst |   8 +
>>  lib/eal/common/meson.build            |   3 +
>>  lib/eal/common/pmu_private.h          |  41 +++
>>  lib/eal/common/rte_pmu.c              | 456 ++++++++++++++++++++++++++
>>  lib/eal/include/meson.build           |   1 +
>>  lib/eal/include/rte_pmu.h             | 204 ++++++++++++
>>  lib/eal/linux/eal.c                   |   4 +
>>  lib/eal/version.map                   |   6 +
>>  10 files changed, 765 insertions(+)
>>  create mode 100644 app/test/test_pmu.c  create mode 100644
>> lib/eal/common/pmu_private.h  create mode 100644
>> lib/eal/common/rte_pmu.c  create mode 100644 lib/eal/include/rte_pmu.h
>>
><snip>
>> diff --git a/lib/eal/common/rte_pmu.c b/lib/eal/common/rte_pmu.c new
>> file mode 100644 index 0000000000..049fe19fe3
>> --- /dev/null
>> +++ b/lib/eal/common/rte_pmu.c
>> @@ -0,0 +1,456 @@
>> +/* SPDX-License-Identifier: BSD-3-Clause
>> + * Copyright(C) 2022 Marvell International Ltd.
>> + */
>> +
>> +#include <ctype.h>
>> +#include <dirent.h>
>> +#include <errno.h>
>> +#include <regex.h>
>> +#include <stdlib.h>
>> +#include <string.h>
>> +#include <sys/ioctl.h>
>> +#include <sys/mman.h>
>> +#include <sys/queue.h>
>> +#include <sys/syscall.h>
>> +#include <unistd.h>
>> +
>> +#include <rte_eal_paging.h>
>> +#include <rte_pmu.h>
>> +#include <rte_tailq.h>
>> +
>> +#include "pmu_private.h"
>> +
>> +#define EVENT_SOURCE_DEVICES_PATH "/sys/bus/event_source/devices"
>> +
>> +#ifndef GENMASK_ULL
>> +#define GENMASK_ULL(h, l) ((~0ULL - (1ULL << (l)) + 1) & (~0ULL >>
>> +((64
>> +- 1 - (h))))) #endif
>> +
>> +#ifndef FIELD_PREP
>> +#define FIELD_PREP(m, v) (((uint64_t)(v) << (__builtin_ffsll(m) - 1))
>> +&
>> +(m)) #endif
>> +
>> +struct rte_pmu *rte_pmu;
>> +
>> +/*
>> + * Following __rte_weak functions provide default no-op.
>> +Architectures should override them if
>> + * necessary.
>> + */
>> +
>> +int
>> +__rte_weak pmu_arch_init(void)
>> +{
>> +	return 0;
>> +}
>> +
>> +void
>> +__rte_weak pmu_arch_fini(void)
>> +{
>> +}
>> +
>> +void
>> +__rte_weak pmu_arch_fixup_config(uint64_t config[3]) {
>> +	RTE_SET_USED(config);
>> +}
>> +
>> +static int
>> +get_term_format(const char *name, int *num, uint64_t *mask) {
>> +	char *config = NULL;
>> +	char path[PATH_MAX];
>> +	int high, low, ret;
>> +	FILE *fp;
>> +
>> +	/* quiesce -Wmaybe-uninitialized warning */
>> +	*num = 0;
>> +	*mask = 0;
>> +
>> +	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH
>> +"/%s/format/%s", rte_pmu->name,
>> name);
>> +	fp = fopen(path, "r");
>> +	if (!fp)
>> +		return -errno;
>> +
>> +	errno = 0;
>> +	ret = fscanf(fp, "%m[^:]:%d-%d", &config, &low, &high);
>> +	if (ret < 2) {
>> +		ret = -ENODATA;
>> +		goto out;
>> +	}
>> +	if (errno) {
>> +		ret = -errno;
>> +		goto out;
>> +	}
>> +
>> +	if (ret == 2)
>> +		high = low;
>> +
>> +	*mask = GENMASK_ULL(high, low);
>> +	/* Last digit should be [012]. If last digit is missing 0 is implied. */
>> +	*num = config[strlen(config) - 1];
>> +	*num = isdigit(*num) ? *num - '0' : 0;
>> +
>> +	ret = 0;
>> +out:
>> +	free(config);
>> +	fclose(fp);
>> +
>> +	return ret;
>> +}
>> +
>> +static int
>> +parse_event(char *buf, uint64_t config[3]) {
>> +	char *token, *term;
>> +	int num, ret, val;
>> +	uint64_t mask;
>> +
>> +	config[0] = config[1] = config[2] = 0;
>> +
>> +	token = strtok(buf, ",");
>> +	while (token) {
>> +		errno = 0;
>> +		/* <term>=<value> */
>> +		ret = sscanf(token, "%m[^=]=%i", &term, &val);
>> +		if (ret < 1)
>> +			return -ENODATA;
>> +		if (errno)
>> +			return -errno;
>> +		if (ret == 1)
>> +			val = 1;
>> +
>> +		ret = get_term_format(term, &num, &mask);
>> +		free(term);
>> +		if (ret)
>> +			return ret;
>> +
>> +		config[num] |= FIELD_PREP(mask, val);
>> +		token = strtok(NULL, ",");
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +static int
>> +get_event_config(const char *name, uint64_t config[3]) {
>> +	char path[PATH_MAX], buf[BUFSIZ];
>> +	FILE *fp;
>> +	int ret;
>> +
>> +	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH
>> +"/%s/events/%s", rte_pmu->name,
>> name);
>> +	fp = fopen(path, "r");
>> +	if (!fp)
>> +		return -errno;
>> +
>> +	ret = fread(buf, 1, sizeof(buf), fp);
>> +	if (ret == 0) {
>> +		fclose(fp);
>> +
>> +		return -EINVAL;
>> +	}
>> +	fclose(fp);
>> +	buf[ret] = '\0';
>> +
>> +	return parse_event(buf, config);
>> +}
>> +
>> +static int
>> +do_perf_event_open(uint64_t config[3], int lcore_id, int group_fd) {
>> +	struct perf_event_attr attr = {
>> +		.size = sizeof(struct perf_event_attr),
>> +		.type = PERF_TYPE_RAW,
>> +		.exclude_kernel = 1,
>> +		.exclude_hv = 1,
>> +		.disabled = 1,
>> +	};
>> +
>> +	pmu_arch_fixup_config(config);
>> +
>> +	attr.config = config[0];
>> +	attr.config1 = config[1];
>> +	attr.config2 = config[2];
>> +
>> +	return syscall(SYS_perf_event_open, &attr, rte_gettid(),
>
>Looks like using '0' instead of rte_gettid() takes the same effect. A small optimization.
>
>> rte_lcore_to_cpu_id(lcore_id),
>> +		       group_fd, 0);
>> +}
>> +
>> +static int
>> +open_events(int lcore_id)
>> +{
>> +	struct rte_pmu_event_group *group = &rte_pmu->group[lcore_id];
>> +	struct rte_pmu_event *event;
>> +	uint64_t config[3];
>> +	int num = 0, ret;
>> +
>> +	/* group leader gets created first, with fd = -1 */
>> +	group->fds[0] = -1;
>> +
>> +	TAILQ_FOREACH(event, &rte_pmu->event_list, next) {
>> +		ret = get_event_config(event->name, config);
>> +		if (ret) {
>> +			RTE_LOG(ERR, EAL, "failed to get %s event config\n", event->name);
>> +			continue;
>> +		}
>> +
>> +		ret = do_perf_event_open(config, lcore_id, group->fds[0]);
>> +		if (ret == -1) {
>> +			if (errno == EOPNOTSUPP)
>> +				RTE_LOG(ERR, EAL, "64 bit counters not supported\n");
>> +
>> +			ret = -errno;
>> +			goto out;
>> +		}
>> +
>> +		group->fds[event->index] = ret;
>> +		num++;
>> +	}
>> +
>> +	return 0;
>> +out:
>> +	for (--num; num >= 0; num--) {
>> +		close(group->fds[num]);
>> +		group->fds[num] = -1;
>> +	}
>> +
>> +
>> +	return ret;
>> +}
>> +
>> +static int
>> +mmap_events(int lcore_id)
>> +{
>> +	struct rte_pmu_event_group *group = &rte_pmu->group[lcore_id];
>> +	void *addr;
>> +	int ret, i;
>> +
>> +	for (i = 0; i < rte_pmu->num_group_events; i++) {
>> +		addr = mmap(0, rte_mem_page_size(), PROT_READ, MAP_SHARED, group->fds[i], 0);
>> +		if (addr == MAP_FAILED) {
>> +			ret = -errno;
>> +			goto out;
>> +		}
>> +
>> +		group->mmap_pages[i] = addr;
>> +	}
>> +
>> +	return 0;
>> +out:
>> +	for (; i; i--) {
>> +		munmap(group->mmap_pages[i - 1], rte_mem_page_size());
>> +		group->mmap_pages[i - 1] = NULL;
>> +	}
>> +
>> +	return ret;
>> +}
>> +
>> +static void
>> +cleanup_events(int lcore_id)
>> +{
>> +	struct rte_pmu_event_group *group = &rte_pmu->group[lcore_id];
>> +	int i;
>> +
>> +	if (!group->fds)
>> +		return;
>> +
>> +	if (group->fds[0] != -1)
>> +		ioctl(group->fds[0], PERF_EVENT_IOC_DISABLE, PERF_IOC_FLAG_GROUP);
>> +
>> +	for (i = 0; i < rte_pmu->num_group_events; i++) {
>> +		if (group->mmap_pages[i]) {
>> +			munmap(group->mmap_pages[i], rte_mem_page_size());
>> +			group->mmap_pages[i] = NULL;
>> +		}
>> +
>> +		if (group->fds[i] != -1) {
>> +			close(group->fds[i]);
>> +			group->fds[i] = -1;
>> +		}
>> +	}
>> +
>> +	free(group->mmap_pages);
>> +	free(group->fds);
>> +
>> +	group->mmap_pages = NULL;
>> +	group->fds = NULL;
>> +	group->enabled = false;
>> +}
>> +
>> +int __rte_noinline
>> +rte_pmu_enable_group(int lcore_id)
>> +{
>> +	struct rte_pmu_event_group *group = &rte_pmu->group[lcore_id];
>> +	int ret;
>> +
>> +	if (rte_pmu->num_group_events == 0) {
>> +		RTE_LOG(DEBUG, EAL, "no matching PMU events\n");
>> +
>> +		return 0;
>> +	}
>> +
>> +	group->fds = calloc(rte_pmu->num_group_events, sizeof(*group->fds));
>> +	if (!group->fds) {
>> +		RTE_LOG(ERR, EAL, "failed to alloc descriptor memory\n");
>> +
>> +		return -ENOMEM;
>> +	}
>> +
>> +	group->mmap_pages = calloc(rte_pmu->num_group_events, sizeof(*group->mmap_pages));
>> +	if (!group->mmap_pages) {
>> +		RTE_LOG(ERR, EAL, "failed to alloc userpage memory\n");
>> +
>> +		ret = -ENOMEM;
>> +		goto out;
>> +	}
>> +
>> +	ret = open_events(lcore_id);
>> +	if (ret) {
>> +		RTE_LOG(ERR, EAL, "failed to open events on lcore-worker-%d\n", lcore_id);
>> +		goto out;
>> +	}
>> +
>> +	ret = mmap_events(lcore_id);
>> +	if (ret) {
>> +		RTE_LOG(ERR, EAL, "failed to map events on lcore-worker-%d\n", lcore_id);
>> +		goto out;
>> +	}
>> +
>> +	if (ioctl(group->fds[0], PERF_EVENT_IOC_ENABLE, PERF_IOC_FLAG_GROUP) == -1) {
>> +		RTE_LOG(ERR, EAL, "failed to enable events on lcore-worker-%d\n",
>> +lcore_id);
>> +
>> +		ret = -errno;
>> +		goto out;
>> +	}
>> +
>> +	return 0;
>> +
>> +out:
>> +	cleanup_events(lcore_id);
>> +
>> +	return ret;
>> +}
>> +
>> +static int
>> +scan_pmus(void)
>> +{
>> +	char path[PATH_MAX];
>> +	struct dirent *dent;
>> +	const char *name;
>> +	DIR *dirp;
>> +
>> +	dirp = opendir(EVENT_SOURCE_DEVICES_PATH);
>> +	if (!dirp)
>> +		return -errno;
>> +
>> +	while ((dent = readdir(dirp))) {
>> +		name = dent->d_name;
>> +		if (name[0] == '.')
>> +			continue;
>> +
>> +		/* sysfs entry should either contain cpus or be a cpu */
>> +		if (!strcmp(name, "cpu"))
>> +			break;
>> +
>> +		snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/cpus", name);
>> +		if (access(path, F_OK) == 0)
>> +			break;
>> +	}
>> +
>> +	closedir(dirp);
>> +
>> +	if (dent) {
>> +		rte_pmu->name = strdup(name);
>> +		if (!rte_pmu->name)
>> +			return -ENOMEM;
>> +	}
>> +
>> +	return rte_pmu->name ? 0 : -ENODEV;
>> +}
>> +
>> +int
>> +rte_pmu_add_event(const char *name)
>> +{
>> +	struct rte_pmu_event *event;
>> +	char path[PATH_MAX];
>> +
>> +	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH
>> +"/%s/events/%s", rte_pmu->name,
>> name);
>
>Better to check if rte_pmu is available.
>See below.
>
>> +	if (access(path, R_OK))
>> +		return -ENODEV;
>> +
>> +	TAILQ_FOREACH(event, &rte_pmu->event_list, next) {
>> +		if (!strcmp(event->name, name))
>> +			return event->index;
>> +		continue;
>> +	}
>> +
>> +	event = calloc(1, sizeof(*event));
>> +	if (!event)
>> +		return -ENOMEM;
>> +
>> +	event->name = strdup(name);
>> +	if (!event->name) {
>> +		free(event);
>> +
>> +		return -ENOMEM;
>> +	}
>> +
>> +	event->index = rte_pmu->num_group_events++;
>> +	TAILQ_INSERT_TAIL(&rte_pmu->event_list, event, next);
>> +
>> +	RTE_LOG(DEBUG, EAL, "%s even added at index %d\n", name,
>> +event->index);
>> +
>> +	return event->index;
>> +}
>> +
>> +void
>> +eal_pmu_init(void)
>> +{
>> +	int ret;
>> +
>> +	rte_pmu = calloc(1, sizeof(*rte_pmu));
>> +	if (!rte_pmu) {
>> +		RTE_LOG(ERR, EAL, "failed to alloc PMU\n");
>> +
>> +		return;
>> +	}
>> +
>> +	TAILQ_INIT(&rte_pmu->event_list);
>> +
>> +	ret = scan_pmus();
>> +	if (ret) {
>> +		RTE_LOG(ERR, EAL, "failed to find core pmu\n");
>> +		goto out;
>> +	}
>> +
>> +	ret = pmu_arch_init();
>> +	if (ret) {
>> +		RTE_LOG(ERR, EAL, "failed to setup arch for PMU\n");
>> +		goto out;
>> +	}
>> +
>> +	return;
>> +out:
>> +	free(rte_pmu->name);
>> +	free(rte_pmu);
>
>Set rte_pmu to NULL to prevent unintentional use?
>

Next series will take use of global pmu instance so this will no longer be
required though your suggestions may be applied to other pointers around.  

>> +}
>> +
>> +void
>> +eal_pmu_fini(void)
>> +{
>> +	struct rte_pmu_event *event, *tmp;
>> +	int lcore_id;
>> +
>> +	RTE_TAILQ_FOREACH_SAFE(event, &rte_pmu->event_list, next, tmp) {
>
>rte_pmu can be unavailable if init fails. Better to check before accessing.
>

Yep. 

>> +		TAILQ_REMOVE(&rte_pmu->event_list, event, next);
>> +		free(event->name);
>> +		free(event);
>> +	}
>> +
>> +	RTE_LCORE_FOREACH_WORKER(lcore_id)
>> +		cleanup_events(lcore_id);
>> +
>> +	pmu_arch_fini();
>> +	free(rte_pmu->name);
>> +	free(rte_pmu);
>> +}
><snip>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH v5 0/4] add support for self monitoring
  2022-12-13 10:43     ` [PATCH v4 " Tomasz Duszynski
                         ` (3 preceding siblings ...)
  2022-12-13 10:43       ` [PATCH v4 4/4] eal: add PMU support to tracing library Tomasz Duszynski
@ 2023-01-10 23:46       ` Tomasz Duszynski
  2023-01-10 23:46         ` [PATCH v5 1/4] eal: add generic support for reading PMU events Tomasz Duszynski
                           ` (7 more replies)
  4 siblings, 8 replies; 205+ messages in thread
From: Tomasz Duszynski @ 2023-01-10 23:46 UTC (permalink / raw)
  To: dev
  Cc: thomas, jerinj, mb, Ruifeng.Wang, mattias.ronnblom, zhoumin,
	Tomasz Duszynski

This series adds self monitoring support i.e allows to configure and
read performance measurement unit (PMU) counters in runtime without
using perf utility. This has certain adventages when application runs on
isolated cores with nohz_full kernel parameter.

Events can be read directly using rte_pmu_read() or using dedicated
tracepoint rte_eal_trace_pmu_read(). The latter will cause events to be
stored inside CTF file.

By design, all enabled events are grouped together and the same group
is attached to lcores that use self monitoring funtionality.

Events are enabled by names, which need to be read from standard
location under sysfs i.e

/sys/bus/event_source/devices/PMU/events

where PMU is a core pmu i.e one measuring cpu events. As of today
raw events are not supported.

v5:
- address review comments
- fix sign extension while reading pmu on x86
- fix regex mentioned in doc
- various minor changes/improvements here and there
v4:
- fix freeing mem detected by debug_autotest
v3:
- fix shared build
v2:
- fix problems reported by test build infra

Tomasz Duszynski (4):
  eal: add generic support for reading PMU events
  eal/arm: support reading ARM PMU events in runtime
  eal/x86: support reading Intel PMU events in runtime
  eal: add PMU support to tracing library

 app/test/meson.build                     |   1 +
 app/test/test_pmu.c                      |  47 +++
 app/test/test_trace_perf.c               |   4 +
 doc/guides/prog_guide/profile_app.rst    |  13 +
 doc/guides/prog_guide/trace_lib.rst      |  32 ++
 lib/eal/arm/include/meson.build          |   1 +
 lib/eal/arm/include/rte_pmu_pmc.h        |  39 ++
 lib/eal/arm/meson.build                  |   4 +
 lib/eal/arm/rte_pmu.c                    | 104 +++++
 lib/eal/common/eal_common_trace_points.c |   3 +
 lib/eal/common/meson.build               |   3 +
 lib/eal/common/pmu_private.h             |  41 ++
 lib/eal/common/rte_pmu.c                 | 504 +++++++++++++++++++++++
 lib/eal/include/meson.build              |   1 +
 lib/eal/include/rte_eal_trace.h          |  10 +
 lib/eal/include/rte_pmu.h                | 202 +++++++++
 lib/eal/linux/eal.c                      |   4 +
 lib/eal/version.map                      |   7 +
 lib/eal/x86/include/meson.build          |   1 +
 lib/eal/x86/include/rte_pmu_pmc.h        |  33 ++
 20 files changed, 1054 insertions(+)
 create mode 100644 app/test/test_pmu.c
 create mode 100644 lib/eal/arm/include/rte_pmu_pmc.h
 create mode 100644 lib/eal/arm/rte_pmu.c
 create mode 100644 lib/eal/common/pmu_private.h
 create mode 100644 lib/eal/common/rte_pmu.c
 create mode 100644 lib/eal/include/rte_pmu.h
 create mode 100644 lib/eal/x86/include/rte_pmu_pmc.h

--
2.34.1


^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH v5 1/4] eal: add generic support for reading PMU events
  2023-01-10 23:46       ` [PATCH v5 0/4] add support for self monitoring Tomasz Duszynski
@ 2023-01-10 23:46         ` Tomasz Duszynski
  2023-01-11  9:05           ` Morten Brørup
  2023-01-10 23:46         ` [PATCH v5 2/4] eal/arm: support reading ARM PMU events in runtime Tomasz Duszynski
                           ` (6 subsequent siblings)
  7 siblings, 1 reply; 205+ messages in thread
From: Tomasz Duszynski @ 2023-01-10 23:46 UTC (permalink / raw)
  To: dev
  Cc: thomas, jerinj, mb, Ruifeng.Wang, mattias.ronnblom, zhoumin,
	Tomasz Duszynski

Add support for programming PMU counters and reading their values
in runtime bypassing kernel completely.

This is especially useful in cases where CPU cores are isolated
(nohz_full) i.e run dedicated tasks. In such cases one cannot use
standard perf utility without sacrificing latency and performance.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
---
 app/test/meson.build                  |   1 +
 app/test/test_pmu.c                   |  41 +++
 doc/guides/prog_guide/profile_app.rst |   8 +
 lib/eal/common/meson.build            |   3 +
 lib/eal/common/pmu_private.h          |  41 +++
 lib/eal/common/rte_pmu.c              | 435 ++++++++++++++++++++++++++
 lib/eal/include/meson.build           |   1 +
 lib/eal/include/rte_pmu.h             | 199 ++++++++++++
 lib/eal/linux/eal.c                   |   4 +
 lib/eal/version.map                   |   6 +
 10 files changed, 739 insertions(+)
 create mode 100644 app/test/test_pmu.c
 create mode 100644 lib/eal/common/pmu_private.h
 create mode 100644 lib/eal/common/rte_pmu.c
 create mode 100644 lib/eal/include/rte_pmu.h

diff --git a/app/test/meson.build b/app/test/meson.build
index f34d19e3c3..93b3300309 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -143,6 +143,7 @@ test_sources = files(
         'test_timer_racecond.c',
         'test_timer_secondary.c',
         'test_ticketlock.c',
+        'test_pmu.c',
         'test_trace.c',
         'test_trace_register.c',
         'test_trace_perf.c',
diff --git a/app/test/test_pmu.c b/app/test/test_pmu.c
new file mode 100644
index 0000000000..9a90aaffdb
--- /dev/null
+++ b/app/test/test_pmu.c
@@ -0,0 +1,41 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2023 Marvell International Ltd.
+ */
+
+#include <rte_pmu.h>
+
+#include "test.h"
+
+static int
+test_pmu_read(void)
+{
+	uint64_t val = 0;
+	int tries = 10;
+	int event = -1;
+
+	while (tries--)
+		val += rte_pmu_read(event);
+
+	if (val == 0)
+		return TEST_FAILED;
+
+	return TEST_SUCCESS;
+}
+
+static struct unit_test_suite pmu_tests = {
+	.suite_name = "pmu autotest",
+	.setup = NULL,
+	.teardown = NULL,
+	.unit_test_cases = {
+		TEST_CASE(test_pmu_read),
+		TEST_CASES_END()
+	}
+};
+
+static int
+test_pmu(void)
+{
+	return unit_test_suite_runner(&pmu_tests);
+}
+
+REGISTER_TEST_COMMAND(pmu_autotest, test_pmu);
diff --git a/doc/guides/prog_guide/profile_app.rst b/doc/guides/prog_guide/profile_app.rst
index 14292d4c25..a8b501fe0c 100644
--- a/doc/guides/prog_guide/profile_app.rst
+++ b/doc/guides/prog_guide/profile_app.rst
@@ -7,6 +7,14 @@ Profile Your Application
 The following sections describe methods of profiling DPDK applications on
 different architectures.
 
+Performance counter based profiling
+-----------------------------------
+
+Majority of architectures support some sort hardware measurement unit which provides a set of
+programmable counters that monitor specific events. There are different tools which can gather
+that information, perf being an example here. Though in some scenarios, eg. when CPU cores are
+isolated (nohz_full) and run dedicated tasks, using perf is less than ideal. In such cases one can
+read specific events directly from application via ``rte_pmu_read()``.
 
 Profiling on x86
 ----------------
diff --git a/lib/eal/common/meson.build b/lib/eal/common/meson.build
index 917758cc65..d6d05b56f3 100644
--- a/lib/eal/common/meson.build
+++ b/lib/eal/common/meson.build
@@ -38,6 +38,9 @@ sources += files(
         'rte_service.c',
         'rte_version.c',
 )
+if is_linux
+    sources += files('rte_pmu.c')
+endif
 if is_linux or is_windows
     sources += files('eal_common_dynmem.c')
 endif
diff --git a/lib/eal/common/pmu_private.h b/lib/eal/common/pmu_private.h
new file mode 100644
index 0000000000..cade4245e6
--- /dev/null
+++ b/lib/eal/common/pmu_private.h
@@ -0,0 +1,41 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2022 Marvell
+ */
+
+#ifndef _PMU_PRIVATE_H_
+#define _PMU_PRIVATE_H_
+
+/**
+ * Architecture specific PMU init callback.
+ *
+ * @return
+ *   0 in case of success, negative value otherwise.
+ */
+int
+pmu_arch_init(void);
+
+/**
+ * Architecture specific PMU cleanup callback.
+ */
+void
+pmu_arch_fini(void);
+
+/**
+ * Apply architecture specific settings to config before passing it to syscall.
+ */
+void
+pmu_arch_fixup_config(uint64_t config[3]);
+
+/**
+ * Initialize PMU tracing internals.
+ */
+void
+eal_pmu_init(void);
+
+/**
+ * Cleanup PMU internals.
+ */
+void
+eal_pmu_fini(void);
+
+#endif /* _PMU_PRIVATE_H_ */
diff --git a/lib/eal/common/rte_pmu.c b/lib/eal/common/rte_pmu.c
new file mode 100644
index 0000000000..67e8ffefb2
--- /dev/null
+++ b/lib/eal/common/rte_pmu.c
@@ -0,0 +1,435 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2023 Marvell International Ltd.
+ */
+
+#include <ctype.h>
+#include <dirent.h>
+#include <errno.h>
+#include <regex.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <sys/queue.h>
+#include <sys/syscall.h>
+#include <unistd.h>
+
+#include <rte_eal_paging.h>
+#include <rte_pmu.h>
+#include <rte_tailq.h>
+
+#include "pmu_private.h"
+
+#define EVENT_SOURCE_DEVICES_PATH "/sys/bus/event_source/devices"
+
+#ifndef GENMASK_ULL
+#define GENMASK_ULL(h, l) ((~0ULL - (1ULL << (l)) + 1) & (~0ULL >> ((64 - 1 - (h)))))
+#endif
+
+#ifndef FIELD_PREP
+#define FIELD_PREP(m, v) (((uint64_t)(v) << (__builtin_ffsll(m) - 1)) & (m))
+#endif
+
+struct rte_pmu rte_pmu;
+
+/*
+ * Following __rte_weak functions provide default no-op. Architectures should override them if
+ * necessary.
+ */
+
+int
+__rte_weak pmu_arch_init(void)
+{
+	return 0;
+}
+
+void
+__rte_weak pmu_arch_fini(void)
+{
+}
+
+void
+__rte_weak pmu_arch_fixup_config(uint64_t config[3])
+{
+	RTE_SET_USED(config);
+}
+
+static int
+get_term_format(const char *name, int *num, uint64_t *mask)
+{
+	char *config = NULL;
+	char path[PATH_MAX];
+	int high, low, ret;
+	FILE *fp;
+
+	/* quiesce -Wmaybe-uninitialized warning */
+	*num = 0;
+	*mask = 0;
+
+	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/format/%s", rte_pmu.name, name);
+	fp = fopen(path, "r");
+	if (fp == NULL)
+		return -errno;
+
+	errno = 0;
+	ret = fscanf(fp, "%m[^:]:%d-%d", &config, &low, &high);
+	if (ret < 2) {
+		ret = -ENODATA;
+		goto out;
+	}
+	if (errno) {
+		ret = -errno;
+		goto out;
+	}
+
+	if (ret == 2)
+		high = low;
+
+	*mask = GENMASK_ULL(high, low);
+	/* Last digit should be [012]. If last digit is missing 0 is implied. */
+	*num = config[strlen(config) - 1];
+	*num = isdigit(*num) ? *num - '0' : 0;
+
+	ret = 0;
+out:
+	free(config);
+	fclose(fp);
+
+	return ret;
+}
+
+static int
+parse_event(char *buf, uint64_t config[3])
+{
+	char *token, *term;
+	int num, ret, val;
+	uint64_t mask;
+
+	config[0] = config[1] = config[2] = 0;
+
+	token = strtok(buf, ",");
+	while (token) {
+		errno = 0;
+		/* <term>=<value> */
+		ret = sscanf(token, "%m[^=]=%i", &term, &val);
+		if (ret < 1)
+			return -ENODATA;
+		if (errno)
+			return -errno;
+		if (ret == 1)
+			val = 1;
+
+		ret = get_term_format(term, &num, &mask);
+		free(term);
+		if (ret)
+			return ret;
+
+		config[num] |= FIELD_PREP(mask, val);
+		token = strtok(NULL, ",");
+	}
+
+	return 0;
+}
+
+static int
+get_event_config(const char *name, uint64_t config[3])
+{
+	char path[PATH_MAX], buf[BUFSIZ];
+	FILE *fp;
+	int ret;
+
+	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu.name, name);
+	fp = fopen(path, "r");
+	if (fp == NULL)
+		return -errno;
+
+	ret = fread(buf, 1, sizeof(buf), fp);
+	if (ret == 0) {
+		fclose(fp);
+
+		return -EINVAL;
+	}
+	fclose(fp);
+	buf[ret] = '\0';
+
+	return parse_event(buf, config);
+}
+
+static int
+do_perf_event_open(uint64_t config[3], unsigned int lcore_id, int group_fd)
+{
+	struct perf_event_attr attr = {
+		.size = sizeof(struct perf_event_attr),
+		.type = PERF_TYPE_RAW,
+		.exclude_kernel = 1,
+		.exclude_hv = 1,
+		.disabled = 1,
+	};
+
+	pmu_arch_fixup_config(config);
+
+	attr.config = config[0];
+	attr.config1 = config[1];
+	attr.config2 = config[2];
+
+	return syscall(SYS_perf_event_open, &attr, 0, rte_lcore_to_cpu_id(lcore_id), group_fd, 0);
+}
+
+static int
+open_events(unsigned int lcore_id)
+{
+	struct rte_pmu_event_group *group = &rte_pmu.group[lcore_id];
+	struct rte_pmu_event *event;
+	uint64_t config[3];
+	int num = 0, ret;
+
+	/* group leader gets created first, with fd = -1 */
+	group->fds[0] = -1;
+
+	TAILQ_FOREACH(event, &rte_pmu.event_list, next) {
+		ret = get_event_config(event->name, config);
+		if (ret) {
+			RTE_LOG(ERR, EAL, "failed to get %s event config\n", event->name);
+			continue;
+		}
+
+		ret = do_perf_event_open(config, lcore_id, group->fds[0]);
+		if (ret == -1) {
+			if (errno == EOPNOTSUPP)
+				RTE_LOG(ERR, EAL, "64 bit counters not supported\n");
+
+			ret = -errno;
+			goto out;
+		}
+
+		group->fds[event->index] = ret;
+		num++;
+	}
+
+	return 0;
+out:
+	for (--num; num >= 0; num--) {
+		close(group->fds[num]);
+		group->fds[num] = -1;
+	}
+
+
+	return ret;
+}
+
+static int
+mmap_events(unsigned int lcore_id)
+{
+	struct rte_pmu_event_group *group = &rte_pmu.group[lcore_id];
+	unsigned int i;
+	void *addr;
+	int ret;
+
+	for (i = 0; i < rte_pmu.num_group_events; i++) {
+		addr = mmap(0, rte_mem_page_size(), PROT_READ, MAP_SHARED, group->fds[i], 0);
+		if (addr == MAP_FAILED) {
+			ret = -errno;
+			goto out;
+		}
+
+		group->mmap_pages[i] = addr;
+	}
+
+	return 0;
+out:
+	for (; i; i--) {
+		munmap(group->mmap_pages[i - 1], rte_mem_page_size());
+		group->mmap_pages[i - 1] = NULL;
+	}
+
+	return ret;
+}
+
+static void
+cleanup_events(unsigned int lcore_id)
+{
+	struct rte_pmu_event_group *group = &rte_pmu.group[lcore_id];
+	unsigned int i;
+
+	if (group->fds == NULL)
+		return;
+
+	if (group->fds[0] != -1)
+		ioctl(group->fds[0], PERF_EVENT_IOC_DISABLE, PERF_IOC_FLAG_GROUP);
+
+	for (i = 0; i < rte_pmu.num_group_events; i++) {
+		if (group->mmap_pages[i]) {
+			munmap(group->mmap_pages[i], rte_mem_page_size());
+			group->mmap_pages[i] = NULL;
+		}
+
+		if (group->fds[i] != -1) {
+			close(group->fds[i]);
+			group->fds[i] = -1;
+		}
+	}
+
+	group->enabled = false;
+}
+
+int __rte_noinline
+rte_pmu_enable_group(unsigned int lcore_id)
+{
+	struct rte_pmu_event_group *group = &rte_pmu.group[lcore_id];
+	int ret;
+
+	if (rte_pmu.num_group_events == 0) {
+		RTE_LOG(DEBUG, EAL, "no matching PMU events\n");
+
+		return 0;
+	}
+
+	ret = open_events(lcore_id);
+	if (ret) {
+		RTE_LOG(ERR, EAL, "failed to open events on lcore-worker-%d\n", lcore_id);
+		goto out;
+	}
+
+	ret = mmap_events(lcore_id);
+	if (ret) {
+		RTE_LOG(ERR, EAL, "failed to map events on lcore-worker-%d\n", lcore_id);
+		goto out;
+	}
+
+	if (ioctl(group->fds[0], PERF_EVENT_IOC_RESET, PERF_IOC_FLAG_GROUP) == -1) {
+		RTE_LOG(ERR, EAL, "failed to reset events on lcore-worker-%d\n", lcore_id);
+
+		ret = -errno;
+		goto out;
+	}
+
+	if (ioctl(group->fds[0], PERF_EVENT_IOC_ENABLE, PERF_IOC_FLAG_GROUP) == -1) {
+		RTE_LOG(ERR, EAL, "failed to enable events on lcore-worker-%d\n", lcore_id);
+
+		ret = -errno;
+		goto out;
+	}
+
+	return 0;
+
+out:
+	cleanup_events(lcore_id);
+
+	return ret;
+}
+
+static int
+scan_pmus(void)
+{
+	char path[PATH_MAX];
+	struct dirent *dent;
+	const char *name;
+	DIR *dirp;
+
+	dirp = opendir(EVENT_SOURCE_DEVICES_PATH);
+	if (dirp == NULL)
+		return -errno;
+
+	while ((dent = readdir(dirp))) {
+		name = dent->d_name;
+		if (name[0] == '.')
+			continue;
+
+		/* sysfs entry should either contain cpus or be a cpu */
+		if (!strcmp(name, "cpu"))
+			break;
+
+		snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/cpus", name);
+		if (access(path, F_OK) == 0)
+			break;
+	}
+
+	closedir(dirp);
+
+	if (dent) {
+		rte_pmu.name = strdup(name);
+		if (rte_pmu.name == NULL)
+			return -ENOMEM;
+	}
+
+	return rte_pmu.name ? 0 : -ENODEV;
+}
+
+int
+rte_pmu_add_event(const char *name)
+{
+	struct rte_pmu_event *event;
+	char path[PATH_MAX];
+
+	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu.name, name);
+	if (access(path, R_OK))
+		return -ENODEV;
+
+	TAILQ_FOREACH(event, &rte_pmu.event_list, next) {
+		if (!strcmp(event->name, name))
+			return event->index;
+		continue;
+	}
+
+	event = calloc(1, sizeof(*event));
+	if (!event)
+		return -ENOMEM;
+
+	event->name = strdup(name);
+	if (!event->name) {
+		free(event);
+
+		return -ENOMEM;
+	}
+
+	event->index = rte_pmu.num_group_events++;
+	TAILQ_INSERT_TAIL(&rte_pmu.event_list, event, next);
+
+	RTE_LOG(DEBUG, EAL, "%s even added at index %d\n", name, event->index);
+
+	return event->index;
+}
+
+void
+eal_pmu_init(void)
+{
+	int ret;
+
+	TAILQ_INIT(&rte_pmu.event_list);
+
+	ret = scan_pmus();
+	if (ret) {
+		RTE_LOG(ERR, EAL, "failed to find core pmu\n");
+		goto out;
+	}
+
+	ret = pmu_arch_init();
+	if (ret) {
+		RTE_LOG(ERR, EAL, "failed to setup arch for PMU\n");
+		goto out;
+	}
+
+	return;
+out:
+	free(rte_pmu.name);
+	rte_pmu.name = NULL;
+}
+
+void
+eal_pmu_fini(void)
+{
+	struct rte_pmu_event *event, *tmp;
+	unsigned int lcore_id;
+
+	RTE_TAILQ_FOREACH_SAFE(event, &rte_pmu.event_list, next, tmp) {
+		TAILQ_REMOVE(&rte_pmu.event_list, event, next);
+		free(event->name);
+		free(event);
+	}
+
+	RTE_LCORE_FOREACH(lcore_id)
+		cleanup_events(lcore_id);
+
+	pmu_arch_fini();
+	free(rte_pmu.name);
+}
diff --git a/lib/eal/include/meson.build b/lib/eal/include/meson.build
index cfcd40aaed..3bf830adee 100644
--- a/lib/eal/include/meson.build
+++ b/lib/eal/include/meson.build
@@ -36,6 +36,7 @@ headers += files(
         'rte_pci_dev_features.h',
         'rte_per_lcore.h',
         'rte_pflock.h',
+        'rte_pmu.h',
         'rte_random.h',
         'rte_reciprocal.h',
         'rte_seqcount.h',
diff --git a/lib/eal/include/rte_pmu.h b/lib/eal/include/rte_pmu.h
new file mode 100644
index 0000000000..6968b35545
--- /dev/null
+++ b/lib/eal/include/rte_pmu.h
@@ -0,0 +1,199 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Marvell
+ */
+
+#ifndef _RTE_PMU_H_
+#define _RTE_PMU_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <rte_common.h>
+#include <rte_compat.h>
+
+#ifdef RTE_EXEC_ENV_LINUX
+
+#include <linux/perf_event.h>
+
+#include <rte_atomic.h>
+#include <rte_branch_prediction.h>
+#include <rte_lcore.h>
+#include <rte_log.h>
+
+/**
+ * @file
+ *
+ * PMU event tracing operations
+ *
+ * This file defines generic API and types necessary to setup PMU and
+ * read selected counters in runtime.
+ */
+
+/** Maximum number of events in a group */
+#define MAX_NUM_GROUP_EVENTS 16
+
+/**
+ * A structure describing a group of events.
+ */
+struct rte_pmu_event_group {
+	int fds[MAX_NUM_GROUP_EVENTS]; /**< array of event descriptors */
+	struct perf_event_mmap_page *mmap_pages[MAX_NUM_GROUP_EVENTS]; /**< array of user pages */
+	bool enabled; /**< true if group was enabled on particular lcore */
+};
+
+/**
+ * A structure describing an event.
+ */
+struct rte_pmu_event {
+	char *name; /** name of an event */
+	unsigned int index; /** event index into fds/mmap_pages */
+	TAILQ_ENTRY(rte_pmu_event) next; /** list entry */
+};
+
+/**
+ * A PMU state container.
+ */
+struct rte_pmu {
+	char *name; /** name of core PMU listed under /sys/bus/event_source/devices */
+	struct rte_pmu_event_group group[RTE_MAX_LCORE]; /**< per lcore event group data */
+	unsigned int num_group_events; /**< number of events in a group */
+	TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of matching events */
+};
+
+/** Pointer to the PMU state container */
+extern struct rte_pmu rte_pmu;
+
+/** Each architecture supporting PMU needs to provide its own version */
+#ifndef rte_pmu_pmc_read
+#define rte_pmu_pmc_read(index) ({ 0; })
+#endif
+
+/**
+ * @internal
+ *
+ * Read PMU counter.
+ *
+ * @param pc
+ *   Pointer to the mmapped user page.
+ * @return
+ *   Counter value read from hardware.
+ */
+__rte_internal
+static __rte_always_inline uint64_t
+rte_pmu_read_userpage(struct perf_event_mmap_page *pc)
+{
+	uint64_t width, offset;
+	uint32_t seq, index;
+	int64_t pmc;
+
+	for (;;) {
+		seq = pc->lock;
+		rte_compiler_barrier();
+		index = pc->index;
+		offset = pc->offset;
+		width = pc->pmc_width;
+
+		if (likely(pc->cap_user_rdpmc && index)) {
+			pmc = rte_pmu_pmc_read(index - 1);
+			pmc <<= 64 - width;
+			pmc >>= 64 - width;
+			offset += pmc;
+		}
+
+		rte_compiler_barrier();
+
+		if (likely(pc->lock == seq))
+			return offset;
+	}
+
+	return 0;
+}
+
+/**
+ * @internal
+ *
+ * Enable group of events for a given lcore.
+ *
+ * @param lcore_id
+ *   The identifier of the lcore.
+ * @return
+ *   0 in case of success, negative value otherwise.
+ */
+__rte_internal
+int
+rte_pmu_enable_group(unsigned int lcore_id);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Add event to the group of enabled events.
+ *
+ * @param name
+ *   Name of an event listed under /sys/bus/event_source/devices/pmu/events.
+ * @return
+ *   Event index in case of success, negative value otherwise.
+ */
+__rte_experimental
+int
+rte_pmu_add_event(const char *name);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Read hardware counter configured to count occurrences of an event.
+ *
+ * @param index
+ *   Index of an event to be read.
+ * @return
+ *   Event value read from register. In case of errors or lack of support
+ *   0 is returned. In other words, stream of zeros in a trace file
+ *   indicates problem with reading particular PMU event register.
+ */
+__rte_experimental
+static __rte_always_inline uint64_t
+rte_pmu_read(unsigned int index)
+{
+	struct rte_pmu_event_group *group;
+	int ret, lcore_id = rte_lcore_id();
+
+	group = &rte_pmu.group[lcore_id];
+	if (unlikely(!group->enabled)) {
+		ret = rte_pmu_enable_group(lcore_id);
+		if (ret)
+			return 0;
+
+		group->enabled = true;
+	}
+
+	if (unlikely(index >= rte_pmu.num_group_events))
+		return 0;
+
+	return rte_pmu_read_userpage(group->mmap_pages[index]);
+}
+
+#else /* !RTE_EXEC_ENV_LINUX */
+
+__rte_experimental
+static int __rte_unused
+rte_pmu_add_event(__rte_unused const char *name)
+{
+	return -1;
+}
+
+__rte_experimental
+static __rte_always_inline uint64_t
+rte_pmu_read(__rte_unused unsigned int index)
+{
+	return 0;
+}
+
+#endif /* RTE_EXEC_ENV_LINUX */
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_PMU_H_ */
diff --git a/lib/eal/linux/eal.c b/lib/eal/linux/eal.c
index 8c118d0d9f..751a13b597 100644
--- a/lib/eal/linux/eal.c
+++ b/lib/eal/linux/eal.c
@@ -53,6 +53,7 @@
 #include "eal_options.h"
 #include "eal_vfio.h"
 #include "hotplug_mp.h"
+#include "pmu_private.h"
 
 #define MEMSIZE_IF_NO_HUGE_PAGE (64ULL * 1024ULL * 1024ULL)
 
@@ -1206,6 +1207,8 @@ rte_eal_init(int argc, char **argv)
 		return -1;
 	}
 
+	eal_pmu_init();
+
 	if (rte_eal_tailqs_init() < 0) {
 		rte_eal_init_alert("Cannot init tail queues for objects");
 		rte_errno = EFAULT;
@@ -1372,6 +1375,7 @@ rte_eal_cleanup(void)
 	eal_bus_cleanup();
 	rte_trace_save();
 	eal_trace_fini();
+	eal_pmu_fini();
 	/* after this point, any DPDK pointers will become dangling */
 	rte_eal_memory_detach();
 	eal_mp_dev_hotplug_cleanup();
diff --git a/lib/eal/version.map b/lib/eal/version.map
index 7ad12a7dc9..1717b221b4 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -440,6 +440,11 @@ EXPERIMENTAL {
 	rte_thread_detach;
 	rte_thread_equal;
 	rte_thread_join;
+
+	# added in 23.03
+	rte_pmu; # WINDOWS_NO_EXPORT
+	rte_pmu_add_event; # WINDOWS_NO_EXPORT
+	rte_pmu_read; # WINDOWS_NO_EXPORT
 };
 
 INTERNAL {
@@ -483,4 +488,5 @@ INTERNAL {
 	rte_mem_map;
 	rte_mem_page_size;
 	rte_mem_unmap;
+	rte_pmu_enable_group; # WINDOWS_NO_EXPORT
 };
-- 
2.34.1


^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH v5 2/4] eal/arm: support reading ARM PMU events in runtime
  2023-01-10 23:46       ` [PATCH v5 0/4] add support for self monitoring Tomasz Duszynski
  2023-01-10 23:46         ` [PATCH v5 1/4] eal: add generic support for reading PMU events Tomasz Duszynski
@ 2023-01-10 23:46         ` Tomasz Duszynski
  2023-01-10 23:46         ` [PATCH v5 3/4] eal/x86: support reading Intel " Tomasz Duszynski
                           ` (5 subsequent siblings)
  7 siblings, 0 replies; 205+ messages in thread
From: Tomasz Duszynski @ 2023-01-10 23:46 UTC (permalink / raw)
  To: dev, Ruifeng Wang
  Cc: thomas, jerinj, mb, Ruifeng.Wang, mattias.ronnblom, zhoumin,
	Tomasz Duszynski

Add support for reading ARM PMU events in runtime.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
---
 app/test/test_pmu.c               |   4 ++
 lib/eal/arm/include/meson.build   |   1 +
 lib/eal/arm/include/rte_pmu_pmc.h |  39 +++++++++++
 lib/eal/arm/meson.build           |   4 ++
 lib/eal/arm/rte_pmu.c             | 104 ++++++++++++++++++++++++++++++
 lib/eal/include/rte_pmu.h         |   3 +
 6 files changed, 155 insertions(+)
 create mode 100644 lib/eal/arm/include/rte_pmu_pmc.h
 create mode 100644 lib/eal/arm/rte_pmu.c

diff --git a/app/test/test_pmu.c b/app/test/test_pmu.c
index 9a90aaffdb..e19819c31a 100644
--- a/app/test/test_pmu.c
+++ b/app/test/test_pmu.c
@@ -13,6 +13,10 @@ test_pmu_read(void)
 	int tries = 10;
 	int event = -1;
 
+#if defined(RTE_ARCH_ARM64)
+	event = rte_pmu_add_event("cpu_cycles");
+#endif
+
 	while (tries--)
 		val += rte_pmu_read(event);
 
diff --git a/lib/eal/arm/include/meson.build b/lib/eal/arm/include/meson.build
index 657bf58569..ab13b0220a 100644
--- a/lib/eal/arm/include/meson.build
+++ b/lib/eal/arm/include/meson.build
@@ -20,6 +20,7 @@ arch_headers = files(
         'rte_pause_32.h',
         'rte_pause_64.h',
         'rte_pause.h',
+        'rte_pmu_pmc.h',
         'rte_power_intrinsics.h',
         'rte_prefetch_32.h',
         'rte_prefetch_64.h',
diff --git a/lib/eal/arm/include/rte_pmu_pmc.h b/lib/eal/arm/include/rte_pmu_pmc.h
new file mode 100644
index 0000000000..729f3d4dfe
--- /dev/null
+++ b/lib/eal/arm/include/rte_pmu_pmc.h
@@ -0,0 +1,39 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Marvell.
+ */
+
+#ifndef _RTE_PMU_PMC_ARM_H_
+#define _RTE_PMU_PMC_ARM_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <rte_common.h>
+
+static __rte_always_inline uint64_t
+rte_pmu_pmc_read(int index)
+{
+	uint64_t val;
+
+	if (index == 31) {
+		/* CPU Cycles (0x11) must be read via pmccntr_el0 */
+		asm volatile("mrs %0, pmccntr_el0" : "=r" (val));
+	} else {
+		asm volatile(
+			"msr pmselr_el0, %x0\n"
+			"mrs %0, pmxevcntr_el0\n"
+			: "=r" (val)
+			: "rZ" (index)
+		);
+	}
+
+	return val;
+}
+#define rte_pmu_pmc_read rte_pmu_pmc_read
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_PMU_PMC_ARM_H_ */
diff --git a/lib/eal/arm/meson.build b/lib/eal/arm/meson.build
index dca1106aae..0c5575b197 100644
--- a/lib/eal/arm/meson.build
+++ b/lib/eal/arm/meson.build
@@ -9,3 +9,7 @@ sources += files(
         'rte_hypervisor.c',
         'rte_power_intrinsics.c',
 )
+
+if is_linux
+    sources += files('rte_pmu.c')
+endif
diff --git a/lib/eal/arm/rte_pmu.c b/lib/eal/arm/rte_pmu.c
new file mode 100644
index 0000000000..4cbbe6f31d
--- /dev/null
+++ b/lib/eal/arm/rte_pmu.c
@@ -0,0 +1,104 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2023 Marvell International Ltd.
+ */
+
+#include <errno.h>
+#include <fcntl.h>
+#include <stdlib.h>
+#include <unistd.h>
+
+#include <rte_bitops.h>
+#include <rte_common.h>
+#include <rte_log.h>
+#include <rte_pmu.h>
+
+#include "pmu_private.h"
+
+#define PERF_USER_ACCESS_PATH "/proc/sys/kernel/perf_user_access"
+
+static int restore_uaccess;
+
+static int
+read_attr_int(const char *path, int *val)
+{
+	char buf[BUFSIZ];
+	int ret, fd;
+
+	fd = open(path, O_RDONLY);
+	if (fd == -1)
+		return -errno;
+
+	ret = read(fd, buf, sizeof(buf));
+	if (ret == -1) {
+		close(fd);
+
+		return -errno;
+	}
+
+	*val = strtol(buf, NULL, 10);
+	close(fd);
+
+	return 0;
+}
+
+static int
+write_attr_int(const char *path, int val)
+{
+	char buf[BUFSIZ];
+	int num, ret, fd;
+
+	fd = open(path, O_WRONLY);
+	if (fd == -1)
+		return -errno;
+
+	num = snprintf(buf, sizeof(buf), "%d", val);
+	ret = write(fd, buf, num);
+	if (ret == -1) {
+		close(fd);
+
+		return -errno;
+	}
+
+	close(fd);
+
+	return 0;
+}
+
+int
+pmu_arch_init(void)
+{
+	int ret;
+
+	ret = read_attr_int(PERF_USER_ACCESS_PATH, &restore_uaccess);
+	if (ret) {
+		RTE_LOG(ERR, EAL, "failed to read %s\n", PERF_USER_ACCESS_PATH);
+
+		return ret;
+	}
+
+	ret = write_attr_int(PERF_USER_ACCESS_PATH, 1);
+	if (ret) {
+		RTE_LOG(ERR, EAL, "failed to enable perf user access\n"
+			"try enabling manually 'echo 1 > %s'\n",
+			PERF_USER_ACCESS_PATH);
+
+		return ret;
+	}
+
+	return 0;
+}
+
+void
+pmu_arch_fini(void)
+{
+	write_attr_int(PERF_USER_ACCESS_PATH, restore_uaccess);
+}
+
+void
+pmu_arch_fixup_config(uint64_t config[3])
+{
+	/* select 64 bit counters */
+	config[1] |= RTE_BIT64(0);
+	/* enable userspace access */
+	config[1] |= RTE_BIT64(1);
+}
diff --git a/lib/eal/include/rte_pmu.h b/lib/eal/include/rte_pmu.h
index 6968b35545..9185d05ca3 100644
--- a/lib/eal/include/rte_pmu.h
+++ b/lib/eal/include/rte_pmu.h
@@ -20,6 +20,9 @@ extern "C" {
 #include <rte_branch_prediction.h>
 #include <rte_lcore.h>
 #include <rte_log.h>
+#if defined(RTE_ARCH_ARM64)
+#include <rte_pmu_pmc.h>
+#endif
 
 /**
  * @file
-- 
2.34.1


^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH v5 3/4] eal/x86: support reading Intel PMU events in runtime
  2023-01-10 23:46       ` [PATCH v5 0/4] add support for self monitoring Tomasz Duszynski
  2023-01-10 23:46         ` [PATCH v5 1/4] eal: add generic support for reading PMU events Tomasz Duszynski
  2023-01-10 23:46         ` [PATCH v5 2/4] eal/arm: support reading ARM PMU events in runtime Tomasz Duszynski
@ 2023-01-10 23:46         ` Tomasz Duszynski
  2023-01-10 23:46         ` [PATCH v5 4/4] eal: add PMU support to tracing library Tomasz Duszynski
                           ` (4 subsequent siblings)
  7 siblings, 0 replies; 205+ messages in thread
From: Tomasz Duszynski @ 2023-01-10 23:46 UTC (permalink / raw)
  To: dev, Bruce Richardson, Konstantin Ananyev
  Cc: thomas, jerinj, mb, Ruifeng.Wang, mattias.ronnblom, zhoumin,
	Tomasz Duszynski

Add support for reading Intel PMU events in runtime.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
---
 app/test/test_pmu.c               |  2 ++
 lib/eal/include/rte_pmu.h         |  2 +-
 lib/eal/x86/include/meson.build   |  1 +
 lib/eal/x86/include/rte_pmu_pmc.h | 33 +++++++++++++++++++++++++++++++
 4 files changed, 37 insertions(+), 1 deletion(-)
 create mode 100644 lib/eal/x86/include/rte_pmu_pmc.h

diff --git a/app/test/test_pmu.c b/app/test/test_pmu.c
index e19819c31a..79f83a1925 100644
--- a/app/test/test_pmu.c
+++ b/app/test/test_pmu.c
@@ -15,6 +15,8 @@ test_pmu_read(void)
 
 #if defined(RTE_ARCH_ARM64)
 	event = rte_pmu_add_event("cpu_cycles");
+#elif defined(RTE_ARCH_X86_64)
+	event = rte_pmu_add_event("cpu-cycles");
 #endif
 
 	while (tries--)
diff --git a/lib/eal/include/rte_pmu.h b/lib/eal/include/rte_pmu.h
index 9185d05ca3..0345746940 100644
--- a/lib/eal/include/rte_pmu.h
+++ b/lib/eal/include/rte_pmu.h
@@ -20,7 +20,7 @@ extern "C" {
 #include <rte_branch_prediction.h>
 #include <rte_lcore.h>
 #include <rte_log.h>
-#if defined(RTE_ARCH_ARM64)
+#if defined(RTE_ARCH_ARM64) || defined(RTE_ARCH_X86_64)
 #include <rte_pmu_pmc.h>
 #endif
 
diff --git a/lib/eal/x86/include/meson.build b/lib/eal/x86/include/meson.build
index 52d2f8e969..03d286ed25 100644
--- a/lib/eal/x86/include/meson.build
+++ b/lib/eal/x86/include/meson.build
@@ -9,6 +9,7 @@ arch_headers = files(
         'rte_io.h',
         'rte_memcpy.h',
         'rte_pause.h',
+        'rte_pmu_pmc.h',
         'rte_power_intrinsics.h',
         'rte_prefetch.h',
         'rte_rtm.h',
diff --git a/lib/eal/x86/include/rte_pmu_pmc.h b/lib/eal/x86/include/rte_pmu_pmc.h
new file mode 100644
index 0000000000..f241b80bc9
--- /dev/null
+++ b/lib/eal/x86/include/rte_pmu_pmc.h
@@ -0,0 +1,33 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Marvell.
+ */
+
+#ifndef _RTE_PMU_PMC_X86_H_
+#define _RTE_PMU_PMC_X86_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <rte_common.h>
+
+static __rte_always_inline uint64_t
+rte_pmu_pmc_read(int index)
+{
+	uint64_t low, high;
+
+	asm volatile(
+		"rdpmc\n"
+		: "=a" (low), "=d" (high)
+		: "c" (index)
+	);
+
+	return low | (high << 32);
+}
+#define rte_pmu_pmc_read rte_pmu_pmc_read
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_PMU_PMC_X86_H_ */
-- 
2.34.1


^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH v5 4/4] eal: add PMU support to tracing library
  2023-01-10 23:46       ` [PATCH v5 0/4] add support for self monitoring Tomasz Duszynski
                           ` (2 preceding siblings ...)
  2023-01-10 23:46         ` [PATCH v5 3/4] eal/x86: support reading Intel " Tomasz Duszynski
@ 2023-01-10 23:46         ` Tomasz Duszynski
  2023-01-11  0:32         ` [PATCH v5 0/4] add support for self monitoring Tyler Retzlaff
                           ` (3 subsequent siblings)
  7 siblings, 0 replies; 205+ messages in thread
From: Tomasz Duszynski @ 2023-01-10 23:46 UTC (permalink / raw)
  To: dev, Jerin Jacob, Sunil Kumar Kori
  Cc: thomas, mb, Ruifeng.Wang, mattias.ronnblom, zhoumin, Tomasz Duszynski

In order to profile app one needs to store significant amount of samples
somewhere for an analysis latern on. Since trace library supports
storing data in a CTF format lets take adventage of that and add a
dedicated PMU tracepoint.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
---
 app/test/test_trace_perf.c               |  4 ++
 doc/guides/prog_guide/profile_app.rst    |  5 ++
 doc/guides/prog_guide/trace_lib.rst      | 32 +++++++++++
 lib/eal/common/eal_common_trace_points.c |  3 +
 lib/eal/common/rte_pmu.c                 | 73 +++++++++++++++++++++++-
 lib/eal/include/rte_eal_trace.h          | 10 ++++
 lib/eal/version.map                      |  1 +
 7 files changed, 126 insertions(+), 2 deletions(-)

diff --git a/app/test/test_trace_perf.c b/app/test/test_trace_perf.c
index 46ae7d8074..4851b6852f 100644
--- a/app/test/test_trace_perf.c
+++ b/app/test/test_trace_perf.c
@@ -114,6 +114,8 @@ worker_fn_##func(void *arg) \
 #define GENERIC_DOUBLE rte_eal_trace_generic_double(3.66666)
 #define GENERIC_STR rte_eal_trace_generic_str("hello world")
 #define VOID_FP app_dpdk_test_fp()
+/* 0 corresponds first event passed via --trace= */
+#define READ_PMU rte_eal_trace_pmu_read(0)
 
 WORKER_DEFINE(GENERIC_VOID)
 WORKER_DEFINE(GENERIC_U64)
@@ -122,6 +124,7 @@ WORKER_DEFINE(GENERIC_FLOAT)
 WORKER_DEFINE(GENERIC_DOUBLE)
 WORKER_DEFINE(GENERIC_STR)
 WORKER_DEFINE(VOID_FP)
+WORKER_DEFINE(READ_PMU)
 
 static void
 run_test(const char *str, lcore_function_t f, struct test_data *data, size_t sz)
@@ -174,6 +177,7 @@ test_trace_perf(void)
 	run_test("double", worker_fn_GENERIC_DOUBLE, data, sz);
 	run_test("string", worker_fn_GENERIC_STR, data, sz);
 	run_test("void_fp", worker_fn_VOID_FP, data, sz);
+	run_test("read_pmu", worker_fn_READ_PMU, data, sz);
 
 	rte_free(data);
 	return TEST_SUCCESS;
diff --git a/doc/guides/prog_guide/profile_app.rst b/doc/guides/prog_guide/profile_app.rst
index a8b501fe0c..6a53341c6b 100644
--- a/doc/guides/prog_guide/profile_app.rst
+++ b/doc/guides/prog_guide/profile_app.rst
@@ -16,6 +16,11 @@ that information, perf being an example here. Though in some scenarios, eg. when
 isolated (nohz_full) and run dedicated tasks, using perf is less than ideal. In such cases one can
 read specific events directly from application via ``rte_pmu_read()``.
 
+Alternatively tracing library can be used which offers dedicated tracepoint
+``rte_eal_trace_pmu_event()``.
+
+Refer to :doc:`../prog_guide/trace_lib` for more details.
+
 Profiling on x86
 ----------------
 
diff --git a/doc/guides/prog_guide/trace_lib.rst b/doc/guides/prog_guide/trace_lib.rst
index 9a8f38073d..10d5b99084 100644
--- a/doc/guides/prog_guide/trace_lib.rst
+++ b/doc/guides/prog_guide/trace_lib.rst
@@ -46,6 +46,7 @@ DPDK tracing library features
   trace format and is compatible with ``LTTng``.
   For detailed information, refer to
   `Common Trace Format <https://diamon.org/ctf/>`_.
+- Support reading PMU events on ARM64 and x86 (Intel)
 
 How to add a tracepoint?
 ------------------------
@@ -137,6 +138,37 @@ the user must use ``RTE_TRACE_POINT_FP`` instead of ``RTE_TRACE_POINT``.
 ``RTE_TRACE_POINT_FP`` is compiled out by default and it can be enabled using
 the ``enable_trace_fp`` option for meson build.
 
+PMU tracepoint
+--------------
+
+Performance measurement unit (PMU) event values can be read from hardware
+registers using predefined ``rte_pmu_read`` tracepoint.
+
+Tracing is enabled via ``--trace`` EAL option by passing both expression
+matching PMU tracepoint name i.e ``lib.eal.pmu.read`` and expression
+``e=ev1[,ev2,...]`` matching particular events::
+
+    --trace='.*pmu.read\|e=cpu_cycles,l1d_cache'
+
+Event names are available under ``/sys/bus/event_source/devices/PMU/events``
+directory, where ``PMU`` is a placeholder for either a ``cpu`` or a directory
+containing ``cpus``.
+
+In contrary to other tracepoints this does not need any extra variables
+added to source files. Instead, caller passes index which follows the order of
+events specified via ``--trace`` parameter. In the following example index ``0``
+corresponds to ``cpu_cyclces`` while index ``1`` corresponds to ``l1d_cache``.
+
+.. code-block:: c
+
+ ...
+ rte_eal_trace_pmu_read(0);
+ rte_eal_trace_pmu_read(1);
+ ...
+
+PMU tracing support must be explicitly enabled using the ``enable_trace_fp``
+option for meson build.
+
 Event record mode
 -----------------
 
diff --git a/lib/eal/common/eal_common_trace_points.c b/lib/eal/common/eal_common_trace_points.c
index 0b0b254615..de918ca618 100644
--- a/lib/eal/common/eal_common_trace_points.c
+++ b/lib/eal/common/eal_common_trace_points.c
@@ -75,3 +75,6 @@ RTE_TRACE_POINT_REGISTER(rte_eal_trace_intr_enable,
 	lib.eal.intr.enable)
 RTE_TRACE_POINT_REGISTER(rte_eal_trace_intr_disable,
 	lib.eal.intr.disable)
+
+RTE_TRACE_POINT_REGISTER(rte_eal_trace_pmu_read,
+	lib.eal.pmu.read)
diff --git a/lib/eal/common/rte_pmu.c b/lib/eal/common/rte_pmu.c
index 67e8ffefb2..fd0df3b756 100644
--- a/lib/eal/common/rte_pmu.c
+++ b/lib/eal/common/rte_pmu.c
@@ -19,6 +19,7 @@
 #include <rte_tailq.h>
 
 #include "pmu_private.h"
+#include "eal_trace.h"
 
 #define EVENT_SOURCE_DEVICES_PATH "/sys/bus/event_source/devices"
 
@@ -361,6 +362,12 @@ rte_pmu_add_event(const char *name)
 	struct rte_pmu_event *event;
 	char path[PATH_MAX];
 
+	if (rte_pmu.name == NULL)
+		return -ENODEV;
+
+	if (rte_pmu.num_group_events + 1 >= MAX_NUM_GROUP_EVENTS)
+		return -ENOSPC;
+
 	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu.name, name);
 	if (access(path, R_OK))
 		return -ENODEV;
@@ -372,11 +379,11 @@ rte_pmu_add_event(const char *name)
 	}
 
 	event = calloc(1, sizeof(*event));
-	if (!event)
+	if (event == NULL)
 		return -ENOMEM;
 
 	event->name = strdup(name);
-	if (!event->name) {
+	if (event->name == NULL) {
 		free(event);
 
 		return -ENOMEM;
@@ -390,11 +397,70 @@ rte_pmu_add_event(const char *name)
 	return event->index;
 }
 
+static void
+add_events(const char *pattern)
+{
+	char *token, *copy;
+	int ret;
+
+	copy = strdup(pattern);
+	if (copy == NULL)
+		return;
+
+	token = strtok(copy, ",");
+	while (token) {
+		ret = rte_pmu_add_event(token);
+		if (ret < 0)
+			RTE_LOG(ERR, EAL, "failed to add %s event\n", token);
+
+		token = strtok(NULL, ",");
+	}
+
+	free(copy);
+}
+
+static void
+add_events_by_pattern(const char *pattern)
+{
+	regmatch_t rmatch;
+	char buf[BUFSIZ];
+	unsigned int num;
+	regex_t reg;
+
+	/* events are matched against occurrences of e=ev1[,ev2,..] pattern */
+	if (regcomp(&reg, "e=([_[:alnum:]-],?)+", REG_EXTENDED))
+		return;
+
+	for (;;) {
+		if (regexec(&reg, pattern, 1, &rmatch, 0))
+			break;
+
+		num = rmatch.rm_eo - rmatch.rm_so;
+		if (num > sizeof(buf))
+			num = sizeof(buf);
+
+		/* skip e= pattern prefix */
+		memcpy(buf, pattern + rmatch.rm_so + 2, num - 2);
+		buf[num] = '\0';
+		add_events(buf);
+
+		pattern += rmatch.rm_eo;
+	}
+
+	regfree(&reg);
+}
+
 void
 eal_pmu_init(void)
 {
+	struct trace_arg *arg;
+	struct trace *trace;
 	int ret;
 
+	trace = trace_obj_get();
+	if (trace == NULL)
+		RTE_LOG(WARNING, EAL, "tracing not initialized\n");
+
 	TAILQ_INIT(&rte_pmu.event_list);
 
 	ret = scan_pmus();
@@ -409,6 +475,9 @@ eal_pmu_init(void)
 		goto out;
 	}
 
+	STAILQ_FOREACH(arg, &trace->args, next)
+		add_events_by_pattern(arg->val);
+
 	return;
 out:
 	free(rte_pmu.name);
diff --git a/lib/eal/include/rte_eal_trace.h b/lib/eal/include/rte_eal_trace.h
index 5ef4398230..9b35af75d5 100644
--- a/lib/eal/include/rte_eal_trace.h
+++ b/lib/eal/include/rte_eal_trace.h
@@ -17,6 +17,7 @@ extern "C" {
 
 #include <rte_alarm.h>
 #include <rte_interrupts.h>
+#include <rte_pmu.h>
 #include <rte_trace_point.h>
 
 #include "eal_interrupts.h"
@@ -279,6 +280,15 @@ RTE_TRACE_POINT(
 	rte_trace_point_emit_string(cpuset);
 )
 
+/* PMU */
+RTE_TRACE_POINT_FP(
+	rte_eal_trace_pmu_read,
+	RTE_TRACE_POINT_ARGS(unsigned int index),
+	uint64_t val;
+	val = rte_pmu_read(index);
+	rte_trace_point_emit_u64(val);
+)
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/eal/version.map b/lib/eal/version.map
index 1717b221b4..d87a867e5b 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -442,6 +442,7 @@ EXPERIMENTAL {
 	rte_thread_join;
 
 	# added in 23.03
+	__rte_eal_trace_pmu_read; # WINDOWS_NO_EXPORT
 	rte_pmu; # WINDOWS_NO_EXPORT
 	rte_pmu_add_event; # WINDOWS_NO_EXPORT
 	rte_pmu_read; # WINDOWS_NO_EXPORT
-- 
2.34.1


^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH v5 0/4] add support for self monitoring
  2023-01-10 23:46       ` [PATCH v5 0/4] add support for self monitoring Tomasz Duszynski
                           ` (3 preceding siblings ...)
  2023-01-10 23:46         ` [PATCH v5 4/4] eal: add PMU support to tracing library Tomasz Duszynski
@ 2023-01-11  0:32         ` Tyler Retzlaff
  2023-01-11  9:31           ` Morten Brørup
  2023-01-11  9:39           ` [EXT] " Tomasz Duszynski
  2023-01-19 23:39         ` [PATCH v6 " Tomasz Duszynski
                           ` (2 subsequent siblings)
  7 siblings, 2 replies; 205+ messages in thread
From: Tyler Retzlaff @ 2023-01-11  0:32 UTC (permalink / raw)
  To: Tomasz Duszynski, bruce.richardson, mb
  Cc: dev, thomas, jerinj, mb, Ruifeng.Wang, mattias.ronnblom, zhoumin

hi,

don't interpret this as an objection to the functionality but this looks
like a clear example of something that doesn't belong in the EAL. has
there been a discussion as to whether or not this should be in a
separate library?

a basic test is whether or not an implementation exists or can be
reasonably provided for all platforms and that isn't strictly evident
here. red flag is to see yet more code being added conditionally
compiled for a single platform.

Morten, Bruce comments?

thanks

On Wed, Jan 11, 2023 at 12:46:38AM +0100, Tomasz Duszynski wrote:
> This series adds self monitoring support i.e allows to configure and
> read performance measurement unit (PMU) counters in runtime without
> using perf utility. This has certain adventages when application runs on
> isolated cores with nohz_full kernel parameter.
> 
> Events can be read directly using rte_pmu_read() or using dedicated
> tracepoint rte_eal_trace_pmu_read(). The latter will cause events to be
> stored inside CTF file.
> 
> By design, all enabled events are grouped together and the same group
> is attached to lcores that use self monitoring funtionality.
> 
> Events are enabled by names, which need to be read from standard
> location under sysfs i.e
> 
> /sys/bus/event_source/devices/PMU/events
> 
> where PMU is a core pmu i.e one measuring cpu events. As of today
> raw events are not supported.
> 
> v5:
> - address review comments
> - fix sign extension while reading pmu on x86
> - fix regex mentioned in doc
> - various minor changes/improvements here and there
> v4:
> - fix freeing mem detected by debug_autotest
> v3:
> - fix shared build
> v2:
> - fix problems reported by test build infra
> 
> Tomasz Duszynski (4):
>   eal: add generic support for reading PMU events
>   eal/arm: support reading ARM PMU events in runtime
>   eal/x86: support reading Intel PMU events in runtime
>   eal: add PMU support to tracing library
> 
>  app/test/meson.build                     |   1 +
>  app/test/test_pmu.c                      |  47 +++
>  app/test/test_trace_perf.c               |   4 +
>  doc/guides/prog_guide/profile_app.rst    |  13 +
>  doc/guides/prog_guide/trace_lib.rst      |  32 ++
>  lib/eal/arm/include/meson.build          |   1 +
>  lib/eal/arm/include/rte_pmu_pmc.h        |  39 ++
>  lib/eal/arm/meson.build                  |   4 +
>  lib/eal/arm/rte_pmu.c                    | 104 +++++
>  lib/eal/common/eal_common_trace_points.c |   3 +
>  lib/eal/common/meson.build               |   3 +
>  lib/eal/common/pmu_private.h             |  41 ++
>  lib/eal/common/rte_pmu.c                 | 504 +++++++++++++++++++++++
>  lib/eal/include/meson.build              |   1 +
>  lib/eal/include/rte_eal_trace.h          |  10 +
>  lib/eal/include/rte_pmu.h                | 202 +++++++++
>  lib/eal/linux/eal.c                      |   4 +
>  lib/eal/version.map                      |   7 +
>  lib/eal/x86/include/meson.build          |   1 +
>  lib/eal/x86/include/rte_pmu_pmc.h        |  33 ++
>  20 files changed, 1054 insertions(+)
>  create mode 100644 app/test/test_pmu.c
>  create mode 100644 lib/eal/arm/include/rte_pmu_pmc.h
>  create mode 100644 lib/eal/arm/rte_pmu.c
>  create mode 100644 lib/eal/common/pmu_private.h
>  create mode 100644 lib/eal/common/rte_pmu.c
>  create mode 100644 lib/eal/include/rte_pmu.h
>  create mode 100644 lib/eal/x86/include/rte_pmu_pmc.h
> 
> --
> 2.34.1

^ permalink raw reply	[flat|nested] 205+ messages in thread

* RE: [PATCH v5 1/4] eal: add generic support for reading PMU events
  2023-01-10 23:46         ` [PATCH v5 1/4] eal: add generic support for reading PMU events Tomasz Duszynski
@ 2023-01-11  9:05           ` Morten Brørup
  2023-01-11 16:20             ` Tomasz Duszynski
  0 siblings, 1 reply; 205+ messages in thread
From: Morten Brørup @ 2023-01-11  9:05 UTC (permalink / raw)
  To: Tomasz Duszynski, dev
  Cc: thomas, jerinj, Ruifeng.Wang, mattias.ronnblom, zhoumin

> From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
> Sent: Wednesday, 11 January 2023 00.47
> 
> Add support for programming PMU counters and reading their values
> in runtime bypassing kernel completely.
> 
> This is especially useful in cases where CPU cores are isolated
> (nohz_full) i.e run dedicated tasks. In such cases one cannot use
> standard perf utility without sacrificing latency and performance.
> 
> Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
> ---

[...]

> +static int
> +do_perf_event_open(uint64_t config[3], unsigned int lcore_id, int
> group_fd)
> +{
> +	struct perf_event_attr attr = {
> +		.size = sizeof(struct perf_event_attr),
> +		.type = PERF_TYPE_RAW,
> +		.exclude_kernel = 1,
> +		.exclude_hv = 1,
> +		.disabled = 1,
> +	};
> +
> +	pmu_arch_fixup_config(config);
> +
> +	attr.config = config[0];
> +	attr.config1 = config[1];
> +	attr.config2 = config[2];
> +
> +	return syscall(SYS_perf_event_open, &attr, 0,
> rte_lcore_to_cpu_id(lcore_id), group_fd, 0);
> +}

If SYS_perf_event_open() must be called from the worker thread itself, then lcore_id must not be passed as a parameter to do_perf_event_open(). Otherwise, I would expect to be able to call do_perf_event_open() from the main thread and pass any lcore_id of a worker thread.
This comment applies to all functions that must be called from the worker thread itself. It also applies to the functions that call such functions.

[...]

> +/**
> + * A structure describing a group of events.
> + */
> +struct rte_pmu_event_group {
> +	int fds[MAX_NUM_GROUP_EVENTS]; /**< array of event descriptors */
> +	struct perf_event_mmap_page *mmap_pages[MAX_NUM_GROUP_EVENTS];
> /**< array of user pages */
> +	bool enabled; /**< true if group was enabled on particular lcore
> */
> +};
> +
> +/**
> + * A structure describing an event.
> + */
> +struct rte_pmu_event {
> +	char *name; /** name of an event */
> +	unsigned int index; /** event index into fds/mmap_pages */
> +	TAILQ_ENTRY(rte_pmu_event) next; /** list entry */
> +};

Move the "enabled" field up, making it the first field in this structure. This might reduce the number of instructions required to check (!group->enabled) in rte_pmu_read().

Also, each instance of the structure is used individually per lcore, so the structure should be cache line aligned to avoid unnecessarily crossing cache lines.

I.e.:

struct rte_pmu_event_group {
	bool enabled; /**< true if group was enabled on particular lcore */
	int fds[MAX_NUM_GROUP_EVENTS]; /**< array of event descriptors */
	struct perf_event_mmap_page *mmap_pages[MAX_NUM_GROUP_EVENTS]; /**< array of user pages */
} __rte_cache_aligned;

> +
> +/**
> + * A PMU state container.
> + */
> +struct rte_pmu {
> +	char *name; /** name of core PMU listed under
> /sys/bus/event_source/devices */
> +	struct rte_pmu_event_group group[RTE_MAX_LCORE]; /**< per lcore
> event group data */
> +	unsigned int num_group_events; /**< number of events in a group
> */
> +	TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of matching
> events */
> +};
> +
> +/** Pointer to the PMU state container */
> +extern struct rte_pmu rte_pmu;

Just "The PMU state container". It is not a pointer anymore. :-)

[...]

> +/**
> + * @internal
> + *
> + * Read PMU counter.
> + *
> + * @param pc
> + *   Pointer to the mmapped user page.
> + * @return
> + *   Counter value read from hardware.
> + */
> +__rte_internal
> +static __rte_always_inline uint64_t
> +rte_pmu_read_userpage(struct perf_event_mmap_page *pc)
> +{
> +	uint64_t width, offset;
> +	uint32_t seq, index;
> +	int64_t pmc;
> +
> +	for (;;) {
> +		seq = pc->lock;
> +		rte_compiler_barrier();
> +		index = pc->index;
> +		offset = pc->offset;
> +		width = pc->pmc_width;
> +

Please add a comment here about the special meaning of index == 0.

> +		if (likely(pc->cap_user_rdpmc && index)) {
> +			pmc = rte_pmu_pmc_read(index - 1);
> +			pmc <<= 64 - width;
> +			pmc >>= 64 - width;
> +			offset += pmc;
> +		}
> +
> +		rte_compiler_barrier();
> +
> +		if (likely(pc->lock == seq))
> +			return offset;
> +	}
> +
> +	return 0;
> +}

[...]

> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice
> + *
> + * Read hardware counter configured to count occurrences of an event.
> + *
> + * @param index
> + *   Index of an event to be read.
> + * @return
> + *   Event value read from register. In case of errors or lack of
> support
> + *   0 is returned. In other words, stream of zeros in a trace file
> + *   indicates problem with reading particular PMU event register.
> + */
> +__rte_experimental
> +static __rte_always_inline uint64_t
> +rte_pmu_read(unsigned int index)
> +{
> +	struct rte_pmu_event_group *group;
> +	int ret, lcore_id = rte_lcore_id();
> +
> +	group = &rte_pmu.group[lcore_id];
> +	if (unlikely(!group->enabled)) {
> +		ret = rte_pmu_enable_group(lcore_id);
> +		if (ret)
> +			return 0;
> +
> +		group->enabled = true;

Group->enabled should be set inside rte_pmu_enable_group(), not here.

> +	}
> +
> +	if (unlikely(index >= rte_pmu.num_group_events))
> +		return 0;
> +
> +	return rte_pmu_read_userpage(group->mmap_pages[index]);
> +}



^ permalink raw reply	[flat|nested] 205+ messages in thread

* RE: [PATCH v5 0/4] add support for self monitoring
  2023-01-11  0:32         ` [PATCH v5 0/4] add support for self monitoring Tyler Retzlaff
@ 2023-01-11  9:31           ` Morten Brørup
  2023-01-11 14:24             ` Tomasz Duszynski
  2023-01-11  9:39           ` [EXT] " Tomasz Duszynski
  1 sibling, 1 reply; 205+ messages in thread
From: Morten Brørup @ 2023-01-11  9:31 UTC (permalink / raw)
  To: Tyler Retzlaff, Tomasz Duszynski, bruce.richardson
  Cc: dev, thomas, jerinj, Ruifeng.Wang, mattias.ronnblom, zhoumin

> From: Tyler Retzlaff [mailto:roretzla@linux.microsoft.com]
> Sent: Wednesday, 11 January 2023 01.32
> 
> On Wed, Jan 11, 2023 at 12:46:38AM +0100, Tomasz Duszynski wrote:
> > This series adds self monitoring support i.e allows to configure and
> > read performance measurement unit (PMU) counters in runtime without
> > using perf utility. This has certain adventages when application runs
> on
> > isolated cores with nohz_full kernel parameter.
> >
> > Events can be read directly using rte_pmu_read() or using dedicated
> > tracepoint rte_eal_trace_pmu_read(). The latter will cause events to
> be
> > stored inside CTF file.
> >
> > By design, all enabled events are grouped together and the same group
> > is attached to lcores that use self monitoring funtionality.
> >
> > Events are enabled by names, which need to be read from standard
> > location under sysfs i.e
> >
> > /sys/bus/event_source/devices/PMU/events
> >
> > where PMU is a core pmu i.e one measuring cpu events. As of today
> > raw events are not supported.
> >
> > v5:
> > - address review comments
> > - fix sign extension while reading pmu on x86
> > - fix regex mentioned in doc
> > - various minor changes/improvements here and there
> > v4:
> > - fix freeing mem detected by debug_autotest
> > v3:
> > - fix shared build
> > v2:
> > - fix problems reported by test build infra
> >
> > Tomasz Duszynski (4):
> >   eal: add generic support for reading PMU events
> >   eal/arm: support reading ARM PMU events in runtime
> >   eal/x86: support reading Intel PMU events in runtime
> >   eal: add PMU support to tracing library
> >
> >  app/test/meson.build                     |   1 +
> >  app/test/test_pmu.c                      |  47 +++
> >  app/test/test_trace_perf.c               |   4 +
> >  doc/guides/prog_guide/profile_app.rst    |  13 +
> >  doc/guides/prog_guide/trace_lib.rst      |  32 ++
> >  lib/eal/arm/include/meson.build          |   1 +
> >  lib/eal/arm/include/rte_pmu_pmc.h        |  39 ++
> >  lib/eal/arm/meson.build                  |   4 +
> >  lib/eal/arm/rte_pmu.c                    | 104 +++++
> >  lib/eal/common/eal_common_trace_points.c |   3 +
> >  lib/eal/common/meson.build               |   3 +
> >  lib/eal/common/pmu_private.h             |  41 ++
> >  lib/eal/common/rte_pmu.c                 | 504
> +++++++++++++++++++++++
> >  lib/eal/include/meson.build              |   1 +
> >  lib/eal/include/rte_eal_trace.h          |  10 +
> >  lib/eal/include/rte_pmu.h                | 202 +++++++++
> >  lib/eal/linux/eal.c                      |   4 +
> >  lib/eal/version.map                      |   7 +
> >  lib/eal/x86/include/meson.build          |   1 +
> >  lib/eal/x86/include/rte_pmu_pmc.h        |  33 ++
> >  20 files changed, 1054 insertions(+)
> >  create mode 100644 app/test/test_pmu.c
> >  create mode 100644 lib/eal/arm/include/rte_pmu_pmc.h
> >  create mode 100644 lib/eal/arm/rte_pmu.c
> >  create mode 100644 lib/eal/common/pmu_private.h
> >  create mode 100644 lib/eal/common/rte_pmu.c
> >  create mode 100644 lib/eal/include/rte_pmu.h
> >  create mode 100644 lib/eal/x86/include/rte_pmu_pmc.h
> >
> > --
> > 2.34.1

[Moved Tyler's post down here.]

> 
> hi,
> 
> don't interpret this as an objection to the functionality but this
> looks
> like a clear example of something that doesn't belong in the EAL. has
> there been a discussion as to whether or not this should be in a
> separate library?

IIRC, there has been no such discussion.

Although I agree that this doesn't belong in EAL, I would point to the trace library as a reference for allowing it into the EAL.

For the records, I also oppose to the trace library being part of the EAL.

On the other hand, it would be interesting to determine if it is *impossible* adding this functionality as any other normal DPDK library, i.e. outside of the EAL, or if there is an unavoidable tie-in to the EAL.

@Tomasz, if this is impossible, please describe the unavoidable tie-in to the EAL. No need for a long report, just a few words. You (and this functionality) shouldn't suffer from our long term ambition to move stuff out of the EAL.

> 
> a basic test is whether or not an implementation exists or can be
> reasonably provided for all platforms and that isn't strictly evident
> here. red flag is to see yet more code being added conditionally
> compiled for a single platform.

Another basic test: Can DPDK applications run without it? If they can, an Environment Abstraction Layer does not need to have it, and thus it does not need to be part of the EAL.

> 
> Morten, Bruce comments?
> 
> thanks

^ permalink raw reply	[flat|nested] 205+ messages in thread

* RE: [EXT] Re: [PATCH v5 0/4] add support for self monitoring
  2023-01-11  0:32         ` [PATCH v5 0/4] add support for self monitoring Tyler Retzlaff
  2023-01-11  9:31           ` Morten Brørup
@ 2023-01-11  9:39           ` Tomasz Duszynski
  2023-01-11 21:05             ` Tyler Retzlaff
  1 sibling, 1 reply; 205+ messages in thread
From: Tomasz Duszynski @ 2023-01-11  9:39 UTC (permalink / raw)
  To: Tyler Retzlaff, bruce.richardson, mb
  Cc: dev, thomas, Jerin Jacob Kollanukkaran, mb, Ruifeng.Wang,
	mattias.ronnblom, zhoumin

Hi Tyler,

>-----Original Message-----
>From: Tyler Retzlaff <roretzla@linux.microsoft.com>
>Sent: Wednesday, January 11, 2023 1:32 AM
>To: Tomasz Duszynski <tduszynski@marvell.com>; bruce.richardson@intel.com; mb@smartsharesystems.com
>Cc: dev@dpdk.org; thomas@monjalon.net; Jerin Jacob Kollanukkaran <jerinj@marvell.com>;
>mb@smartsharesystems.com; Ruifeng.Wang@arm.com; mattias.ronnblom@ericsson.com; zhoumin@loongson.cn
>Subject: [EXT] Re: [PATCH v5 0/4] add support for self monitoring
>
>External Email
>
>----------------------------------------------------------------------
>hi,
>
>don't interpret this as an objection to the functionality but this looks like a clear example of
>something that doesn't belong in the EAL. has there been a discussion as to whether or not this
>should be in a separate library?

No, I don't recall anybody having any concerns about the code placement. Rationale behind 
making this part of eal was based on the fact that tracing itself is a part of eal and
since this was meant to be extension to tracing, code placement decision came out naturally. 

During development phase idea evolved a bit and what initially was supposed to be solely yet
another tracepoint become generic API to read pmu and tracepoint based on that. Which means
both can be used independently. 

That said, since this code has both platform agnostic and platform specific parts this can either be split into: 
1. library + eal platform code
2. all under eal 

Either approach seems legit. Thoughts?

>
>a basic test is whether or not an implementation exists or can be reasonably provided for all
>platforms and that isn't strictly evident here. red flag is to see yet more code being added
>conditionally compiled for a single platform.

Even libs are not entirely pristine and have platform specific ifdefs lurking so not sure where
this red flag is coming from. 

>
>Morten, Bruce comments?
>
>thanks
>
>On Wed, Jan 11, 2023 at 12:46:38AM +0100, Tomasz Duszynski wrote:
>> This series adds self monitoring support i.e allows to configure and
>> read performance measurement unit (PMU) counters in runtime without
>> using perf utility. This has certain adventages when application runs
>> on isolated cores with nohz_full kernel parameter.
>>
>> Events can be read directly using rte_pmu_read() or using dedicated
>> tracepoint rte_eal_trace_pmu_read(). The latter will cause events to
>> be stored inside CTF file.
>>
>> By design, all enabled events are grouped together and the same group
>> is attached to lcores that use self monitoring funtionality.
>>
>> Events are enabled by names, which need to be read from standard
>> location under sysfs i.e
>>
>> /sys/bus/event_source/devices/PMU/events
>>
>> where PMU is a core pmu i.e one measuring cpu events. As of today raw
>> events are not supported.
>>
>> v5:
>> - address review comments
>> - fix sign extension while reading pmu on x86
>> - fix regex mentioned in doc
>> - various minor changes/improvements here and there
>> v4:
>> - fix freeing mem detected by debug_autotest
>> v3:
>> - fix shared build
>> v2:
>> - fix problems reported by test build infra
>>
>> Tomasz Duszynski (4):
>>   eal: add generic support for reading PMU events
>>   eal/arm: support reading ARM PMU events in runtime
>>   eal/x86: support reading Intel PMU events in runtime
>>   eal: add PMU support to tracing library
>>
>>  app/test/meson.build                     |   1 +
>>  app/test/test_pmu.c                      |  47 +++
>>  app/test/test_trace_perf.c               |   4 +
>>  doc/guides/prog_guide/profile_app.rst    |  13 +
>>  doc/guides/prog_guide/trace_lib.rst      |  32 ++
>>  lib/eal/arm/include/meson.build          |   1 +
>>  lib/eal/arm/include/rte_pmu_pmc.h        |  39 ++
>>  lib/eal/arm/meson.build                  |   4 +
>>  lib/eal/arm/rte_pmu.c                    | 104 +++++
>>  lib/eal/common/eal_common_trace_points.c |   3 +
>>  lib/eal/common/meson.build               |   3 +
>>  lib/eal/common/pmu_private.h             |  41 ++
>>  lib/eal/common/rte_pmu.c                 | 504 +++++++++++++++++++++++
>>  lib/eal/include/meson.build              |   1 +
>>  lib/eal/include/rte_eal_trace.h          |  10 +
>>  lib/eal/include/rte_pmu.h                | 202 +++++++++
>>  lib/eal/linux/eal.c                      |   4 +
>>  lib/eal/version.map                      |   7 +
>>  lib/eal/x86/include/meson.build          |   1 +
>>  lib/eal/x86/include/rte_pmu_pmc.h        |  33 ++
>>  20 files changed, 1054 insertions(+)
>>  create mode 100644 app/test/test_pmu.c  create mode 100644
>> lib/eal/arm/include/rte_pmu_pmc.h  create mode 100644
>> lib/eal/arm/rte_pmu.c  create mode 100644 lib/eal/common/pmu_private.h
>> create mode 100644 lib/eal/common/rte_pmu.c  create mode 100644
>> lib/eal/include/rte_pmu.h  create mode 100644
>> lib/eal/x86/include/rte_pmu_pmc.h
>>
>> --
>> 2.34.1

^ permalink raw reply	[flat|nested] 205+ messages in thread

* RE: [PATCH v5 0/4] add support for self monitoring
  2023-01-11  9:31           ` Morten Brørup
@ 2023-01-11 14:24             ` Tomasz Duszynski
  2023-01-11 14:32               ` Bruce Richardson
  0 siblings, 1 reply; 205+ messages in thread
From: Tomasz Duszynski @ 2023-01-11 14:24 UTC (permalink / raw)
  To: Morten Brørup, Tyler Retzlaff, bruce.richardson
  Cc: dev, thomas, Jerin Jacob Kollanukkaran, Ruifeng.Wang,
	mattias.ronnblom, zhoumin



>-----Original Message-----
>From: Morten Brørup <mb@smartsharesystems.com>
>Sent: Wednesday, January 11, 2023 10:31 AM
>To: Tyler Retzlaff <roretzla@linux.microsoft.com>; Tomasz Duszynski <tduszynski@marvell.com>;
>bruce.richardson@intel.com
>Cc: dev@dpdk.org; thomas@monjalon.net; Jerin Jacob Kollanukkaran <jerinj@marvell.com>;
>Ruifeng.Wang@arm.com; mattias.ronnblom@ericsson.com; zhoumin@loongson.cn
>Subject: [EXT] RE: [PATCH v5 0/4] add support for self monitoring
>
>External Email
>
>----------------------------------------------------------------------
>> From: Tyler Retzlaff [mailto:roretzla@linux.microsoft.com]
>> Sent: Wednesday, 11 January 2023 01.32
>>
>> On Wed, Jan 11, 2023 at 12:46:38AM +0100, Tomasz Duszynski wrote:
>> > This series adds self monitoring support i.e allows to configure and
>> > read performance measurement unit (PMU) counters in runtime without
>> > using perf utility. This has certain adventages when application
>> > runs
>> on
>> > isolated cores with nohz_full kernel parameter.
>> >
>> > Events can be read directly using rte_pmu_read() or using dedicated
>> > tracepoint rte_eal_trace_pmu_read(). The latter will cause events to
>> be
>> > stored inside CTF file.
>> >
>> > By design, all enabled events are grouped together and the same
>> > group is attached to lcores that use self monitoring funtionality.
>> >
>> > Events are enabled by names, which need to be read from standard
>> > location under sysfs i.e
>> >
>> > /sys/bus/event_source/devices/PMU/events
>> >
>> > where PMU is a core pmu i.e one measuring cpu events. As of today
>> > raw events are not supported.
>> >
>> > v5:
>> > - address review comments
>> > - fix sign extension while reading pmu on x86
>> > - fix regex mentioned in doc
>> > - various minor changes/improvements here and there
>> > v4:
>> > - fix freeing mem detected by debug_autotest
>> > v3:
>> > - fix shared build
>> > v2:
>> > - fix problems reported by test build infra
>> >
>> > Tomasz Duszynski (4):
>> >   eal: add generic support for reading PMU events
>> >   eal/arm: support reading ARM PMU events in runtime
>> >   eal/x86: support reading Intel PMU events in runtime
>> >   eal: add PMU support to tracing library
>> >
>> >  app/test/meson.build                     |   1 +
>> >  app/test/test_pmu.c                      |  47 +++
>> >  app/test/test_trace_perf.c               |   4 +
>> >  doc/guides/prog_guide/profile_app.rst    |  13 +
>> >  doc/guides/prog_guide/trace_lib.rst      |  32 ++
>> >  lib/eal/arm/include/meson.build          |   1 +
>> >  lib/eal/arm/include/rte_pmu_pmc.h        |  39 ++
>> >  lib/eal/arm/meson.build                  |   4 +
>> >  lib/eal/arm/rte_pmu.c                    | 104 +++++
>> >  lib/eal/common/eal_common_trace_points.c |   3 +
>> >  lib/eal/common/meson.build               |   3 +
>> >  lib/eal/common/pmu_private.h             |  41 ++
>> >  lib/eal/common/rte_pmu.c                 | 504
>> +++++++++++++++++++++++
>> >  lib/eal/include/meson.build              |   1 +
>> >  lib/eal/include/rte_eal_trace.h          |  10 +
>> >  lib/eal/include/rte_pmu.h                | 202 +++++++++
>> >  lib/eal/linux/eal.c                      |   4 +
>> >  lib/eal/version.map                      |   7 +
>> >  lib/eal/x86/include/meson.build          |   1 +
>> >  lib/eal/x86/include/rte_pmu_pmc.h        |  33 ++
>> >  20 files changed, 1054 insertions(+)  create mode 100644
>> > app/test/test_pmu.c  create mode 100644
>> > lib/eal/arm/include/rte_pmu_pmc.h  create mode 100644
>> > lib/eal/arm/rte_pmu.c  create mode 100644
>> > lib/eal/common/pmu_private.h  create mode 100644
>> > lib/eal/common/rte_pmu.c  create mode 100644
>> > lib/eal/include/rte_pmu.h  create mode 100644
>> > lib/eal/x86/include/rte_pmu_pmc.h
>> >
>> > --
>> > 2.34.1
>
>[Moved Tyler's post down here.]
>
>>
>> hi,
>>
>> don't interpret this as an objection to the functionality but this
>> looks like a clear example of something that doesn't belong in the
>> EAL. has there been a discussion as to whether or not this should be
>> in a separate library?
>
>IIRC, there has been no such discussion.
>
>Although I agree that this doesn't belong in EAL, I would point to the trace library as a reference
>for allowing it into the EAL.
>
>For the records, I also oppose to the trace library being part of the EAL.
>
>On the other hand, it would be interesting to determine if it is *impossible* adding this
>functionality as any other normal DPDK library, i.e. outside of the EAL, or if there is an
>unavoidable tie-in to the EAL.
>
>@Tomasz, if this is impossible, please describe the unavoidable tie-in to the EAL. No need for a
>long report, just a few words. You (and this functionality) shouldn't suffer from our long term
>ambition to move stuff out of the EAL.
>

You can read about rationale here https://lore.kernel.org/dpdk-dev/DM4PR18MB436872EBC5922084C5DAFC1DD2FC9@DM4PR18MB4368.namprd18.prod.outlook.com/#t

As for the NO-NO there isn't any in fact. There are some tradeoffs though. 

For example, seems eal cannot depend on other libs so if someone needs to
finetune some part of EAL for whatever reason, then relevant part needs to 
modified each and every time. I.e specific includes and trcepoints need to be added each time.

On the other hand, if this is coupled with eal then adding tracepoints to some parts
will be easier. Or they can just be added to specific points and live there. 

No strong opinions besides that. I'd like to know what others think. 

>>
>> a basic test is whether or not an implementation exists or can be
>> reasonably provided for all platforms and that isn't strictly evident
>> here. red flag is to see yet more code being added conditionally
>> compiled for a single platform.
>
>Another basic test: Can DPDK applications run without it? If they can, an Environment Abstraction
>Layer does not need to have it, and thus it does not need to be part of the EAL.
>
>>
>> Morten, Bruce comments?
>>
>> thanks

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH v5 0/4] add support for self monitoring
  2023-01-11 14:24             ` Tomasz Duszynski
@ 2023-01-11 14:32               ` Bruce Richardson
  0 siblings, 0 replies; 205+ messages in thread
From: Bruce Richardson @ 2023-01-11 14:32 UTC (permalink / raw)
  To: Tomasz Duszynski
  Cc: Morten Brørup, Tyler Retzlaff, dev, thomas,
	Jerin Jacob Kollanukkaran, Ruifeng.Wang, mattias.ronnblom,
	zhoumin

On Wed, Jan 11, 2023 at 02:24:28PM +0000, Tomasz Duszynski wrote:
> 
> 
> >-----Original Message-----
> >From: Morten Brørup <mb@smartsharesystems.com>
> >Sent: Wednesday, January 11, 2023 10:31 AM
> >To: Tyler Retzlaff <roretzla@linux.microsoft.com>; Tomasz Duszynski <tduszynski@marvell.com>;
> >bruce.richardson@intel.com
> >Cc: dev@dpdk.org; thomas@monjalon.net; Jerin Jacob Kollanukkaran <jerinj@marvell.com>;
> >Ruifeng.Wang@arm.com; mattias.ronnblom@ericsson.com; zhoumin@loongson.cn
> >Subject: [EXT] RE: [PATCH v5 0/4] add support for self monitoring
> >
> >External Email
> >
> >----------------------------------------------------------------------
> >> From: Tyler Retzlaff [mailto:roretzla@linux.microsoft.com]
> >> Sent: Wednesday, 11 January 2023 01.32
> >>
> >> On Wed, Jan 11, 2023 at 12:46:38AM +0100, Tomasz Duszynski wrote:
> >> > This series adds self monitoring support i.e allows to configure and
> >> > read performance measurement unit (PMU) counters in runtime without
> >> > using perf utility. This has certain adventages when application
> >> > runs
> >> on
> >> > isolated cores with nohz_full kernel parameter.
> >> >
> >> > Events can be read directly using rte_pmu_read() or using dedicated
> >> > tracepoint rte_eal_trace_pmu_read(). The latter will cause events to
> >> be
> >> > stored inside CTF file.
> >> >
> >> > By design, all enabled events are grouped together and the same
> >> > group is attached to lcores that use self monitoring funtionality.
> >> >
> >> > Events are enabled by names, which need to be read from standard
> >> > location under sysfs i.e
> >> >
> >> > /sys/bus/event_source/devices/PMU/events
> >> >
> >> > where PMU is a core pmu i.e one measuring cpu events. As of today
> >> > raw events are not supported.
> >> >
> >> > v5:
> >> > - address review comments
> >> > - fix sign extension while reading pmu on x86
> >> > - fix regex mentioned in doc
> >> > - various minor changes/improvements here and there
> >> > v4:
> >> > - fix freeing mem detected by debug_autotest
> >> > v3:
> >> > - fix shared build
> >> > v2:
> >> > - fix problems reported by test build infra
> >> >
> >> > Tomasz Duszynski (4):
> >> >   eal: add generic support for reading PMU events
> >> >   eal/arm: support reading ARM PMU events in runtime
> >> >   eal/x86: support reading Intel PMU events in runtime
> >> >   eal: add PMU support to tracing library
> >> >
> >> >  app/test/meson.build                     |   1 +
> >> >  app/test/test_pmu.c                      |  47 +++
> >> >  app/test/test_trace_perf.c               |   4 +
> >> >  doc/guides/prog_guide/profile_app.rst    |  13 +
> >> >  doc/guides/prog_guide/trace_lib.rst      |  32 ++
> >> >  lib/eal/arm/include/meson.build          |   1 +
> >> >  lib/eal/arm/include/rte_pmu_pmc.h        |  39 ++
> >> >  lib/eal/arm/meson.build                  |   4 +
> >> >  lib/eal/arm/rte_pmu.c                    | 104 +++++
> >> >  lib/eal/common/eal_common_trace_points.c |   3 +
> >> >  lib/eal/common/meson.build               |   3 +
> >> >  lib/eal/common/pmu_private.h             |  41 ++
> >> >  lib/eal/common/rte_pmu.c                 | 504
> >> +++++++++++++++++++++++
> >> >  lib/eal/include/meson.build              |   1 +
> >> >  lib/eal/include/rte_eal_trace.h          |  10 +
> >> >  lib/eal/include/rte_pmu.h                | 202 +++++++++
> >> >  lib/eal/linux/eal.c                      |   4 +
> >> >  lib/eal/version.map                      |   7 +
> >> >  lib/eal/x86/include/meson.build          |   1 +
> >> >  lib/eal/x86/include/rte_pmu_pmc.h        |  33 ++
> >> >  20 files changed, 1054 insertions(+)  create mode 100644
> >> > app/test/test_pmu.c  create mode 100644
> >> > lib/eal/arm/include/rte_pmu_pmc.h  create mode 100644
> >> > lib/eal/arm/rte_pmu.c  create mode 100644
> >> > lib/eal/common/pmu_private.h  create mode 100644
> >> > lib/eal/common/rte_pmu.c  create mode 100644
> >> > lib/eal/include/rte_pmu.h  create mode 100644
> >> > lib/eal/x86/include/rte_pmu_pmc.h
> >> >
> >> > --
> >> > 2.34.1
> >
> >[Moved Tyler's post down here.]
> >
> >>
> >> hi,
> >>
> >> don't interpret this as an objection to the functionality but this
> >> looks like a clear example of something that doesn't belong in the
> >> EAL. has there been a discussion as to whether or not this should be
> >> in a separate library?
> >
> >IIRC, there has been no such discussion.
> >
> >Although I agree that this doesn't belong in EAL, I would point to the trace library as a reference
> >for allowing it into the EAL.
> >
> >For the records, I also oppose to the trace library being part of the EAL.
> >
> >On the other hand, it would be interesting to determine if it is *impossible* adding this
> >functionality as any other normal DPDK library, i.e. outside of the EAL, or if there is an
> >unavoidable tie-in to the EAL.
> >
> >@Tomasz, if this is impossible, please describe the unavoidable tie-in to the EAL. No need for a
> >long report, just a few words. You (and this functionality) shouldn't suffer from our long term
> >ambition to move stuff out of the EAL.
> >
> 
> You can read about rationale here https://lore.kernel.org/dpdk-dev/DM4PR18MB436872EBC5922084C5DAFC1DD2FC9@DM4PR18MB4368.namprd18.prod.outlook.com/#t
> 
> As for the NO-NO there isn't any in fact. There are some tradeoffs though. 
> 
> For example, seems eal cannot depend on other libs so if someone needs to
> finetune some part of EAL for whatever reason, then relevant part needs to 
> modified each and every time. I.e specific includes and trcepoints need to be added each time.
>
Well, EAL can depend on other libs, but then those libs cannot in turn
directly depend upon DPDK. This is where breaking out first some of the
smaller widely used parts of DPDK  e.g. logging, would be good, as it would
then in turn allow other, potentially bigger parts of EAL to be taken out.

See [1] for a rough first attempt at this, which allows simlification of
telemetry as it no longer needs a "dependency injection" style to have
logging. Moving out logging would also allow logging from kvargs library
too - another lib which is used by EAL rather than depending on it.
Similarly for tracing functionality - if that were pulled out of EAL, it
could be used by telemetry, kvargs and any other parts removed from EAL.

/Bruce

[1] http://patches.dpdk.org/project/dpdk/list/?series=24453

^ permalink raw reply	[flat|nested] 205+ messages in thread

* RE: [PATCH v5 1/4] eal: add generic support for reading PMU events
  2023-01-11  9:05           ` Morten Brørup
@ 2023-01-11 16:20             ` Tomasz Duszynski
  2023-01-11 16:54               ` Morten Brørup
  0 siblings, 1 reply; 205+ messages in thread
From: Tomasz Duszynski @ 2023-01-11 16:20 UTC (permalink / raw)
  To: Morten Brørup, dev
  Cc: thomas, Jerin Jacob Kollanukkaran, Ruifeng.Wang,
	mattias.ronnblom, zhoumin



>-----Original Message-----
>From: Morten Brørup <mb@smartsharesystems.com>
>Sent: Wednesday, January 11, 2023 10:06 AM
>To: Tomasz Duszynski <tduszynski@marvell.com>; dev@dpdk.org
>Cc: thomas@monjalon.net; Jerin Jacob Kollanukkaran <jerinj@marvell.com>; Ruifeng.Wang@arm.com;
>mattias.ronnblom@ericsson.com; zhoumin@loongson.cn
>Subject: [EXT] RE: [PATCH v5 1/4] eal: add generic support for reading PMU events
>
>External Email
>
>----------------------------------------------------------------------
>> From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
>> Sent: Wednesday, 11 January 2023 00.47
>>
>> Add support for programming PMU counters and reading their values in
>> runtime bypassing kernel completely.
>>
>> This is especially useful in cases where CPU cores are isolated
>> (nohz_full) i.e run dedicated tasks. In such cases one cannot use
>> standard perf utility without sacrificing latency and performance.
>>
>> Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
>> ---
>
>[...]
>
>> +static int
>> +do_perf_event_open(uint64_t config[3], unsigned int lcore_id, int
>> group_fd)
>> +{
>> +	struct perf_event_attr attr = {
>> +		.size = sizeof(struct perf_event_attr),
>> +		.type = PERF_TYPE_RAW,
>> +		.exclude_kernel = 1,
>> +		.exclude_hv = 1,
>> +		.disabled = 1,
>> +	};
>> +
>> +	pmu_arch_fixup_config(config);
>> +
>> +	attr.config = config[0];
>> +	attr.config1 = config[1];
>> +	attr.config2 = config[2];
>> +
>> +	return syscall(SYS_perf_event_open, &attr, 0,
>> rte_lcore_to_cpu_id(lcore_id), group_fd, 0);
>> +}
>
>If SYS_perf_event_open() must be called from the worker thread itself, then lcore_id must not be
>passed as a parameter to do_perf_event_open(). Otherwise, I would expect to be able to call
>do_perf_event_open() from the main thread and pass any lcore_id of a worker thread.
>This comment applies to all functions that must be called from the worker thread itself. It also
>applies to the functions that call such functions.
>

Lcore_id is being passed around so that we don't need to call rte_lcore_id() each and every time. 

>[...]
>
>> +/**
>> + * A structure describing a group of events.
>> + */
>> +struct rte_pmu_event_group {
>> +	int fds[MAX_NUM_GROUP_EVENTS]; /**< array of event descriptors */
>> +	struct perf_event_mmap_page *mmap_pages[MAX_NUM_GROUP_EVENTS];
>> /**< array of user pages */
>> +	bool enabled; /**< true if group was enabled on particular lcore
>> */
>> +};
>> +
>> +/**
>> + * A structure describing an event.
>> + */
>> +struct rte_pmu_event {
>> +	char *name; /** name of an event */
>> +	unsigned int index; /** event index into fds/mmap_pages */
>> +	TAILQ_ENTRY(rte_pmu_event) next; /** list entry */ };
>
>Move the "enabled" field up, making it the first field in this structure. This might reduce the
>number of instructions required to check (!group->enabled) in rte_pmu_read().
>

This will be called once and no this will not produce more instructions. Why should it?
In both cases compiler will need to load data at some offset and archs do have instructions for that. 

>Also, each instance of the structure is used individually per lcore, so the structure should be
>cache line aligned to avoid unnecessarily crossing cache lines.
>
>I.e.:
>
>struct rte_pmu_event_group {
>	bool enabled; /**< true if group was enabled on particular lcore */
>	int fds[MAX_NUM_GROUP_EVENTS]; /**< array of event descriptors */
>	struct perf_event_mmap_page *mmap_pages[MAX_NUM_GROUP_EVENTS]; /**< array of user pages */ }
>__rte_cache_aligned;

Yes, this can be aligned. While at it, I'd be more inclined to move mmap_pages up instead of enable.   

>
>> +
>> +/**
>> + * A PMU state container.
>> + */
>> +struct rte_pmu {
>> +	char *name; /** name of core PMU listed under
>> /sys/bus/event_source/devices */
>> +	struct rte_pmu_event_group group[RTE_MAX_LCORE]; /**< per lcore
>> event group data */
>> +	unsigned int num_group_events; /**< number of events in a group
>> */
>> +	TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of matching
>> events */
>> +};
>> +
>> +/** Pointer to the PMU state container */ extern struct rte_pmu
>> +rte_pmu;
>
>Just "The PMU state container". It is not a pointer anymore. :-)
>

Good catch.

>[...]
>
>> +/**
>> + * @internal
>> + *
>> + * Read PMU counter.
>> + *
>> + * @param pc
>> + *   Pointer to the mmapped user page.
>> + * @return
>> + *   Counter value read from hardware.
>> + */
>> +__rte_internal
>> +static __rte_always_inline uint64_t
>> +rte_pmu_read_userpage(struct perf_event_mmap_page *pc) {
>> +	uint64_t width, offset;
>> +	uint32_t seq, index;
>> +	int64_t pmc;
>> +
>> +	for (;;) {
>> +		seq = pc->lock;
>> +		rte_compiler_barrier();
>> +		index = pc->index;
>> +		offset = pc->offset;
>> +		width = pc->pmc_width;
>> +
>
>Please add a comment here about the special meaning of index == 0.

Okay. 

>
>> +		if (likely(pc->cap_user_rdpmc && index)) {
>> +			pmc = rte_pmu_pmc_read(index - 1);
>> +			pmc <<= 64 - width;
>> +			pmc >>= 64 - width;
>> +			offset += pmc;
>> +		}
>> +
>> +		rte_compiler_barrier();
>> +
>> +		if (likely(pc->lock == seq))
>> +			return offset;
>> +	}
>> +
>> +	return 0;
>> +}
>
>[...]
>
>> +/**
>> + * @warning
>> + * @b EXPERIMENTAL: this API may change without prior notice
>> + *
>> + * Read hardware counter configured to count occurrences of an event.
>> + *
>> + * @param index
>> + *   Index of an event to be read.
>> + * @return
>> + *   Event value read from register. In case of errors or lack of
>> support
>> + *   0 is returned. In other words, stream of zeros in a trace file
>> + *   indicates problem with reading particular PMU event register.
>> + */
>> +__rte_experimental
>> +static __rte_always_inline uint64_t
>> +rte_pmu_read(unsigned int index)
>> +{
>> +	struct rte_pmu_event_group *group;
>> +	int ret, lcore_id = rte_lcore_id();
>> +
>> +	group = &rte_pmu.group[lcore_id];
>> +	if (unlikely(!group->enabled)) {
>> +		ret = rte_pmu_enable_group(lcore_id);
>> +		if (ret)
>> +			return 0;
>> +
>> +		group->enabled = true;
>
>Group->enabled should be set inside rte_pmu_enable_group(), not here.
>

This is easier to follow imo and not against coding guidelines so I prefer to leave it as is.  

>> +	}
>> +
>> +	if (unlikely(index >= rte_pmu.num_group_events))
>> +		return 0;
>> +
>> +	return rte_pmu_read_userpage(group->mmap_pages[index]);
>> +}
>


^ permalink raw reply	[flat|nested] 205+ messages in thread

* RE: [PATCH v5 1/4] eal: add generic support for reading PMU events
  2023-01-11 16:20             ` Tomasz Duszynski
@ 2023-01-11 16:54               ` Morten Brørup
  0 siblings, 0 replies; 205+ messages in thread
From: Morten Brørup @ 2023-01-11 16:54 UTC (permalink / raw)
  To: Tomasz Duszynski, dev
  Cc: thomas, Jerin Jacob Kollanukkaran, Ruifeng.Wang,
	mattias.ronnblom, zhoumin

> From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
> Sent: Wednesday, 11 January 2023 17.21
> 
> >From: Morten Brørup <mb@smartsharesystems.com>
> >Sent: Wednesday, January 11, 2023 10:06 AM
> >
> >> From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
> >> Sent: Wednesday, 11 January 2023 00.47
> >>
> >> Add support for programming PMU counters and reading their values in
> >> runtime bypassing kernel completely.
> >>
> >> This is especially useful in cases where CPU cores are isolated
> >> (nohz_full) i.e run dedicated tasks. In such cases one cannot use
> >> standard perf utility without sacrificing latency and performance.
> >>
> >> Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
> >> ---
> >
> >[...]
> >
> >> +static int
> >> +do_perf_event_open(uint64_t config[3], unsigned int lcore_id, int
> >> group_fd)
> >> +{
> >> +	struct perf_event_attr attr = {
> >> +		.size = sizeof(struct perf_event_attr),
> >> +		.type = PERF_TYPE_RAW,
> >> +		.exclude_kernel = 1,
> >> +		.exclude_hv = 1,
> >> +		.disabled = 1,
> >> +	};
> >> +
> >> +	pmu_arch_fixup_config(config);
> >> +
> >> +	attr.config = config[0];
> >> +	attr.config1 = config[1];
> >> +	attr.config2 = config[2];
> >> +
> >> +	return syscall(SYS_perf_event_open, &attr, 0,
> >> rte_lcore_to_cpu_id(lcore_id), group_fd, 0);
> >> +}
> >
> >If SYS_perf_event_open() must be called from the worker thread itself,
> then lcore_id must not be
> >passed as a parameter to do_perf_event_open(). Otherwise, I would
> expect to be able to call
> >do_perf_event_open() from the main thread and pass any lcore_id of a
> worker thread.
> >This comment applies to all functions that must be called from the
> worker thread itself. It also
> >applies to the functions that call such functions.
> >
> 
> Lcore_id is being passed around so that we don't need to call
> rte_lcore_id() each and every time.

Please take a look at the rte_lcore_id() implementation. :-)

Regardless, my argument still stands: If a function cannot be called with the lcore_id parameter set to any valid lcore id, it should not be a parameter to the function.

> 
> >[...]
> >
> >> +/**
> >> + * A structure describing a group of events.
> >> + */
> >> +struct rte_pmu_event_group {
> >> +	int fds[MAX_NUM_GROUP_EVENTS]; /**< array of event descriptors */
> >> +	struct perf_event_mmap_page *mmap_pages[MAX_NUM_GROUP_EVENTS];
> >> /**< array of user pages */
> >> +	bool enabled; /**< true if group was enabled on particular lcore
> >> */
> >> +};
> >> +
> >> +/**
> >> + * A structure describing an event.
> >> + */
> >> +struct rte_pmu_event {
> >> +	char *name; /** name of an event */
> >> +	unsigned int index; /** event index into fds/mmap_pages */
> >> +	TAILQ_ENTRY(rte_pmu_event) next; /** list entry */ };
> >
> >Move the "enabled" field up, making it the first field in this
> structure. This might reduce the
> >number of instructions required to check (!group->enabled) in
> rte_pmu_read().
> >
> 
> This will be called once and no this will not produce more
> instructions. Why should it?

It seems I was not clearly describing my intention here here. rte_pmu_read() a hot function, where the comparison "if (!group->enabled)" itself will be executed many times.

> In both cases compiler will need to load data at some offset and archs
> do have instructions for that.

Yes, the instructions are: address = BASE + sizeof(struct rte_pmu_event_group) * lcore_id + offsetof(struct rte_pmu_event, enabled).

I meant you could avoid the extra instructions stemming from the addition: "+ offsetof()". But you are right... Both BASE and offsetof(struct rte_pmu_event, enabled) are known in advance, and can be merged at compile time to avoid the addition.

> 
> >Also, each instance of the structure is used individually per lcore,
> so the structure should be
> >cache line aligned to avoid unnecessarily crossing cache lines.
> >
> >I.e.:
> >
> >struct rte_pmu_event_group {
> >	bool enabled; /**< true if group was enabled on particular lcore
> */
> >	int fds[MAX_NUM_GROUP_EVENTS]; /**< array of event descriptors */
> >	struct perf_event_mmap_page *mmap_pages[MAX_NUM_GROUP_EVENTS];
> /**< array of user pages */ }
> >__rte_cache_aligned;
> 
> Yes, this can be aligned. While at it, I'd be more inclined to move
> mmap_pages up instead of enable.

Yes, moving up mmap_pages is better.

> 
> >
> >> +
> >> +/**
> >> + * A PMU state container.
> >> + */
> >> +struct rte_pmu {
> >> +	char *name; /** name of core PMU listed under
> >> /sys/bus/event_source/devices */
> >> +	struct rte_pmu_event_group group[RTE_MAX_LCORE]; /**< per lcore
> >> event group data */
> >> +	unsigned int num_group_events; /**< number of events in a group
> >> */
> >> +	TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of matching
> >> events */
> >> +};
> >> +
> >> +/** Pointer to the PMU state container */ extern struct rte_pmu
> >> +rte_pmu;
> >
> >Just "The PMU state container". It is not a pointer anymore. :-)
> >
> 
> Good catch.
> 
> >[...]
> >
> >> +/**
> >> + * @internal
> >> + *
> >> + * Read PMU counter.
> >> + *
> >> + * @param pc
> >> + *   Pointer to the mmapped user page.
> >> + * @return
> >> + *   Counter value read from hardware.
> >> + */
> >> +__rte_internal
> >> +static __rte_always_inline uint64_t
> >> +rte_pmu_read_userpage(struct perf_event_mmap_page *pc) {
> >> +	uint64_t width, offset;
> >> +	uint32_t seq, index;
> >> +	int64_t pmc;
> >> +
> >> +	for (;;) {
> >> +		seq = pc->lock;
> >> +		rte_compiler_barrier();
> >> +		index = pc->index;
> >> +		offset = pc->offset;
> >> +		width = pc->pmc_width;
> >> +
> >
> >Please add a comment here about the special meaning of index == 0.
> 
> Okay.
> 
> >
> >> +		if (likely(pc->cap_user_rdpmc && index)) {
> >> +			pmc = rte_pmu_pmc_read(index - 1);
> >> +			pmc <<= 64 - width;
> >> +			pmc >>= 64 - width;
> >> +			offset += pmc;
> >> +		}
> >> +
> >> +		rte_compiler_barrier();
> >> +
> >> +		if (likely(pc->lock == seq))
> >> +			return offset;
> >> +	}
> >> +
> >> +	return 0;
> >> +}
> >
> >[...]
> >
> >> +/**
> >> + * @warning
> >> + * @b EXPERIMENTAL: this API may change without prior notice
> >> + *
> >> + * Read hardware counter configured to count occurrences of an
> event.
> >> + *
> >> + * @param index
> >> + *   Index of an event to be read.
> >> + * @return
> >> + *   Event value read from register. In case of errors or lack of
> >> support
> >> + *   0 is returned. In other words, stream of zeros in a trace file
> >> + *   indicates problem with reading particular PMU event register.
> >> + */
> >> +__rte_experimental
> >> +static __rte_always_inline uint64_t
> >> +rte_pmu_read(unsigned int index)
> >> +{
> >> +	struct rte_pmu_event_group *group;
> >> +	int ret, lcore_id = rte_lcore_id();
> >> +
> >> +	group = &rte_pmu.group[lcore_id];
> >> +	if (unlikely(!group->enabled)) {
> >> +		ret = rte_pmu_enable_group(lcore_id);
> >> +		if (ret)
> >> +			return 0;
> >> +
> >> +		group->enabled = true;
> >
> >Group->enabled should be set inside rte_pmu_enable_group(), not here.
> >
> 
> This is easier to follow imo and not against coding guidelines so I
> prefer to leave it as is.

OK. It makes the rte_pmu_read() source code slightly shorter, but probably has zero effect on the generated code. No strong preference - feel free to follow your personal preference on this.

> 
> >> +	}
> >> +
> >> +	if (unlikely(index >= rte_pmu.num_group_events))
> >> +		return 0;
> >> +
> >> +	return rte_pmu_read_userpage(group->mmap_pages[index]);
> >> +}
> >
> 


^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [EXT] Re: [PATCH v5 0/4] add support for self monitoring
  2023-01-11  9:39           ` [EXT] " Tomasz Duszynski
@ 2023-01-11 21:05             ` Tyler Retzlaff
  2023-01-13  7:44               ` Tomasz Duszynski
  0 siblings, 1 reply; 205+ messages in thread
From: Tyler Retzlaff @ 2023-01-11 21:05 UTC (permalink / raw)
  To: Tomasz Duszynski
  Cc: bruce.richardson, mb, dev, thomas, Jerin Jacob Kollanukkaran,
	Ruifeng.Wang, mattias.ronnblom, zhoumin

On Wed, Jan 11, 2023 at 09:39:35AM +0000, Tomasz Duszynski wrote:
> Hi Tyler,
> 
> >-----Original Message-----
> >From: Tyler Retzlaff <roretzla@linux.microsoft.com>
> >Sent: Wednesday, January 11, 2023 1:32 AM
> >To: Tomasz Duszynski <tduszynski@marvell.com>; bruce.richardson@intel.com; mb@smartsharesystems.com
> >Cc: dev@dpdk.org; thomas@monjalon.net; Jerin Jacob Kollanukkaran <jerinj@marvell.com>;
> >mb@smartsharesystems.com; Ruifeng.Wang@arm.com; mattias.ronnblom@ericsson.com; zhoumin@loongson.cn
> >Subject: [EXT] Re: [PATCH v5 0/4] add support for self monitoring
> >
> >External Email
> >
> >----------------------------------------------------------------------
> >hi,
> >
> >don't interpret this as an objection to the functionality but this looks like a clear example of
> >something that doesn't belong in the EAL. has there been a discussion as to whether or not this
> >should be in a separate library?
> 
> No, I don't recall anybody having any concerns about the code placement. Rationale behind 
> making this part of eal was based on the fact that tracing itself is a part of eal and
> since this was meant to be extension to tracing, code placement decision came out naturally. 
> 
> During development phase idea evolved a bit and what initially was supposed to be solely yet
> another tracepoint become generic API to read pmu and tracepoint based on that. Which means
> both can be used independently. 
> 
> That said, since this code has both platform agnostic and platform specific parts this can either be split into: 
> 1. library + eal platform code
> 2. all under eal 
> 
> Either approach seems legit. Thoughts?
> 
> >
> >a basic test is whether or not an implementation exists or can be reasonably provided for all
> >platforms and that isn't strictly evident here. red flag is to see yet more code being added
> >conditionally compiled for a single platform.
> 
> Even libs are not entirely pristine and have platform specific ifdefs lurking so not sure where
> this red flag is coming from. 

i think red flag was probably the wrong term to use sorry for that.
rather i should say it is an indicator that the api probably doesn't
belong in the eal.

fundamentally the purpose of the abstraction library is to relieve the
application from having to do conditional compilation and/or execution for
the subject apis coming from eal. including and exporting apis that work
for only one platform is in direct contradiction.

please explore adding this as a separate library, it is understood that
there are tradeoffs involved.

thanks!


^ permalink raw reply	[flat|nested] 205+ messages in thread

* RE: [EXT] Re: [PATCH v5 0/4] add support for self monitoring
  2023-01-11 21:05             ` Tyler Retzlaff
@ 2023-01-13  7:44               ` Tomasz Duszynski
  2023-01-13 19:22                 ` Tyler Retzlaff
  0 siblings, 1 reply; 205+ messages in thread
From: Tomasz Duszynski @ 2023-01-13  7:44 UTC (permalink / raw)
  To: Tyler Retzlaff
  Cc: bruce.richardson, mb, dev, thomas, Jerin Jacob Kollanukkaran,
	Ruifeng.Wang, mattias.ronnblom, zhoumin



>-----Original Message-----
>From: Tyler Retzlaff <roretzla@linux.microsoft.com>
>Sent: Wednesday, January 11, 2023 10:06 PM
>To: Tomasz Duszynski <tduszynski@marvell.com>
>Cc: bruce.richardson@intel.com; mb@smartsharesystems.com; dev@dpdk.org; thomas@monjalon.net; Jerin
>Jacob Kollanukkaran <jerinj@marvell.com>; Ruifeng.Wang@arm.com; mattias.ronnblom@ericsson.com;
>zhoumin@loongson.cn
>Subject: Re: [EXT] Re: [PATCH v5 0/4] add support for self monitoring
>
>On Wed, Jan 11, 2023 at 09:39:35AM +0000, Tomasz Duszynski wrote:
>> Hi Tyler,
>>
>> >-----Original Message-----
>> >From: Tyler Retzlaff <roretzla@linux.microsoft.com>
>> >Sent: Wednesday, January 11, 2023 1:32 AM
>> >To: Tomasz Duszynski <tduszynski@marvell.com>;
>> >bruce.richardson@intel.com; mb@smartsharesystems.com
>> >Cc: dev@dpdk.org; thomas@monjalon.net; Jerin Jacob Kollanukkaran
>> ><jerinj@marvell.com>; mb@smartsharesystems.com; Ruifeng.Wang@arm.com;
>> >mattias.ronnblom@ericsson.com; zhoumin@loongson.cn
>> >Subject: [EXT] Re: [PATCH v5 0/4] add support for self monitoring
>> >
>> >External Email
>> >
>> >---------------------------------------------------------------------
>> >-
>> >hi,
>> >
>> >don't interpret this as an objection to the functionality but this
>> >looks like a clear example of something that doesn't belong in the
>> >EAL. has there been a discussion as to whether or not this should be in a separate library?
>>
>> No, I don't recall anybody having any concerns about the code
>> placement. Rationale behind making this part of eal was based on the
>> fact that tracing itself is a part of eal and since this was meant to be extension to tracing,
>code placement decision came out naturally.
>>
>> During development phase idea evolved a bit and what initially was
>> supposed to be solely yet another tracepoint become generic API to
>> read pmu and tracepoint based on that. Which means both can be used independently.
>>
>> That said, since this code has both platform agnostic and platform specific parts this can either
>be split into:
>> 1. library + eal platform code
>> 2. all under eal
>>
>> Either approach seems legit. Thoughts?
>>
>> >
>> >a basic test is whether or not an implementation exists or can be
>> >reasonably provided for all platforms and that isn't strictly evident
>> >here. red flag is to see yet more code being added conditionally compiled for a single platform.
>>
>> Even libs are not entirely pristine and have platform specific ifdefs
>> lurking so not sure where this red flag is coming from.
>
>i think red flag was probably the wrong term to use sorry for that.
>rather i should say it is an indicator that the api probably doesn't belong in the eal.
>
>fundamentally the purpose of the abstraction library is to relieve the application from having to
>do conditional compilation and/or execution for the subject apis coming from eal. including and
>exporting apis that work for only one platform is in direct contradiction.
>
>please explore adding this as a separate library, it is understood that there are tradeoffs
>involved.
>
>thanks!

Any ideas how to name the library?


^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [EXT] Re: [PATCH v5 0/4] add support for self monitoring
  2023-01-13  7:44               ` Tomasz Duszynski
@ 2023-01-13 19:22                 ` Tyler Retzlaff
  2023-01-14  9:53                   ` Morten Brørup
  0 siblings, 1 reply; 205+ messages in thread
From: Tyler Retzlaff @ 2023-01-13 19:22 UTC (permalink / raw)
  To: Tomasz Duszynski
  Cc: bruce.richardson, mb, dev, thomas, Jerin Jacob Kollanukkaran,
	Ruifeng.Wang, mattias.ronnblom, zhoumin

On Fri, Jan 13, 2023 at 07:44:57AM +0000, Tomasz Duszynski wrote:
> 
> 
> >-----Original Message-----
> >From: Tyler Retzlaff <roretzla@linux.microsoft.com>
> >Sent: Wednesday, January 11, 2023 10:06 PM
> >To: Tomasz Duszynski <tduszynski@marvell.com>
> >Cc: bruce.richardson@intel.com; mb@smartsharesystems.com; dev@dpdk.org; thomas@monjalon.net; Jerin
> >Jacob Kollanukkaran <jerinj@marvell.com>; Ruifeng.Wang@arm.com; mattias.ronnblom@ericsson.com;
> >zhoumin@loongson.cn
> >Subject: Re: [EXT] Re: [PATCH v5 0/4] add support for self monitoring
> >
> >On Wed, Jan 11, 2023 at 09:39:35AM +0000, Tomasz Duszynski wrote:
> >> Hi Tyler,
> >>
> >> >-----Original Message-----
> >> >From: Tyler Retzlaff <roretzla@linux.microsoft.com>
> >> >Sent: Wednesday, January 11, 2023 1:32 AM
> >> >To: Tomasz Duszynski <tduszynski@marvell.com>;
> >> >bruce.richardson@intel.com; mb@smartsharesystems.com
> >> >Cc: dev@dpdk.org; thomas@monjalon.net; Jerin Jacob Kollanukkaran
> >> ><jerinj@marvell.com>; mb@smartsharesystems.com; Ruifeng.Wang@arm.com;
> >> >mattias.ronnblom@ericsson.com; zhoumin@loongson.cn
> >> >Subject: [EXT] Re: [PATCH v5 0/4] add support for self monitoring
> >> >
> >> >External Email
> >> >
> >> >---------------------------------------------------------------------
> >> >-
> >> >hi,
> >> >
> >> >don't interpret this as an objection to the functionality but this
> >> >looks like a clear example of something that doesn't belong in the
> >> >EAL. has there been a discussion as to whether or not this should be in a separate library?
> >>
> >> No, I don't recall anybody having any concerns about the code
> >> placement. Rationale behind making this part of eal was based on the
> >> fact that tracing itself is a part of eal and since this was meant to be extension to tracing,
> >code placement decision came out naturally.
> >>
> >> During development phase idea evolved a bit and what initially was
> >> supposed to be solely yet another tracepoint become generic API to
> >> read pmu and tracepoint based on that. Which means both can be used independently.
> >>
> >> That said, since this code has both platform agnostic and platform specific parts this can either
> >be split into:
> >> 1. library + eal platform code
> >> 2. all under eal
> >>
> >> Either approach seems legit. Thoughts?
> >>
> >> >
> >> >a basic test is whether or not an implementation exists or can be
> >> >reasonably provided for all platforms and that isn't strictly evident
> >> >here. red flag is to see yet more code being added conditionally compiled for a single platform.
> >>
> >> Even libs are not entirely pristine and have platform specific ifdefs
> >> lurking so not sure where this red flag is coming from.
> >
> >i think red flag was probably the wrong term to use sorry for that.
> >rather i should say it is an indicator that the api probably doesn't belong in the eal.
> >
> >fundamentally the purpose of the abstraction library is to relieve the application from having to
> >do conditional compilation and/or execution for the subject apis coming from eal. including and
> >exporting apis that work for only one platform is in direct contradiction.
> >
> >please explore adding this as a separate library, it is understood that there are tradeoffs
> >involved.
> >
> >thanks!
> 
> Any ideas how to name the library?

naming is always so hard and i'm definitely not authoritative.

it seems like lib/pmu would be the least churn against your existing
patch series, here are some other suggestions that might work.

lib/pmu (measuring unit)
lib/pmc (measuring counters)
lib/pcq (counter query)
lib/pmq (measuring query)


^ permalink raw reply	[flat|nested] 205+ messages in thread

* RE: [EXT] Re: [PATCH v5 0/4] add support for self monitoring
  2023-01-13 19:22                 ` Tyler Retzlaff
@ 2023-01-14  9:53                   ` Morten Brørup
  0 siblings, 0 replies; 205+ messages in thread
From: Morten Brørup @ 2023-01-14  9:53 UTC (permalink / raw)
  To: Tyler Retzlaff, Tomasz Duszynski
  Cc: bruce.richardson, dev, thomas, Jerin Jacob Kollanukkaran,
	Ruifeng.Wang, mattias.ronnblom, zhoumin


> From: Tyler Retzlaff [mailto:roretzla@linux.microsoft.com]
> Sent: Friday, 13 January 2023 20.22
> 
> On Fri, Jan 13, 2023 at 07:44:57AM +0000, Tomasz Duszynski wrote:
> >
> > >From: Tyler Retzlaff <roretzla@linux.microsoft.com>
> > >Sent: Wednesday, January 11, 2023 10:06 PM
> > >
> > >On Wed, Jan 11, 2023 at 09:39:35AM +0000, Tomasz Duszynski wrote:
> > >> Hi Tyler,
> > >>
> > >> >From: Tyler Retzlaff <roretzla@linux.microsoft.com>
> > >> >Sent: Wednesday, January 11, 2023 1:32 AM
> > >> >
> > >> >hi,
> > >> >
> > >> >don't interpret this as an objection to the functionality but
> this
> > >> >looks like a clear example of something that doesn't belong in
> the
> > >> >EAL. has there been a discussion as to whether or not this should
> be in a separate library?
> > >>
> > >> No, I don't recall anybody having any concerns about the code
> > >> placement. Rationale behind making this part of eal was based on
> the
> > >> fact that tracing itself is a part of eal and since this was meant
> to be extension to tracing,
> > >code placement decision came out naturally.
> > >>
> > >> During development phase idea evolved a bit and what initially was
> > >> supposed to be solely yet another tracepoint become generic API to
> > >> read pmu and tracepoint based on that. Which means both can be
> used independently.
> > >>
> > >> That said, since this code has both platform agnostic and platform
> specific parts this can either
> > >be split into:
> > >> 1. library + eal platform code
> > >> 2. all under eal
> > >>
> > >> Either approach seems legit. Thoughts?
> > >>
> > >> >
> > >> >a basic test is whether or not an implementation exists or can be
> > >> >reasonably provided for all platforms and that isn't strictly
> evident
> > >> >here. red flag is to see yet more code being added conditionally
> compiled for a single platform.
> > >>
> > >> Even libs are not entirely pristine and have platform specific
> ifdefs
> > >> lurking so not sure where this red flag is coming from.
> > >
> > >i think red flag was probably the wrong term to use sorry for that.
> > >rather i should say it is an indicator that the api probably doesn't
> belong in the eal.
> > >
> > >fundamentally the purpose of the abstraction library is to relieve
> the application from having to
> > >do conditional compilation and/or execution for the subject apis
> coming from eal. including and
> > >exporting apis that work for only one platform is in direct
> contradiction.
> > >
> > >please explore adding this as a separate library, it is understood
> that there are tradeoffs
> > >involved.
> > >
> > >thanks!
> >
> > Any ideas how to name the library?
> 
> naming is always so hard and i'm definitely not authoritative.
> 
> it seems like lib/pmu would be the least churn against your existing
> patch series, here are some other suggestions that might work.

+1 to lib/pmu

Less work, as Tyler already mentioned. Furthermore:

Both Intel and ARM use the term Performance Monitoring Unit (abbreviated PMU).

Microsoft does too [1].

[1]: https://learn.microsoft.com/en-us/windows-hardware/test/wpt/recording-pmu-events

RISC-V uses the term Hardware Performance Monitor (abbreviated HPM).
I haven't checked other CPU vendors.


^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH v6 0/4] add support for self monitoring
  2023-01-10 23:46       ` [PATCH v5 0/4] add support for self monitoring Tomasz Duszynski
                           ` (4 preceding siblings ...)
  2023-01-11  0:32         ` [PATCH v5 0/4] add support for self monitoring Tyler Retzlaff
@ 2023-01-19 23:39         ` Tomasz Duszynski
  2023-01-19 23:39           ` [PATCH v6 1/4] lib: add generic support for reading PMU events Tomasz Duszynski
                             ` (4 more replies)
  2023-01-25 10:33         ` [PATCH 0/2] add platform bus Tomasz Duszynski
  2023-02-16 20:56         ` [PATCH v5 0/4] add support for self monitoring Liang Ma
  7 siblings, 5 replies; 205+ messages in thread
From: Tomasz Duszynski @ 2023-01-19 23:39 UTC (permalink / raw)
  To: dev
  Cc: thomas, jerinj, mb, Ruifeng.Wang, mattias.ronnblom, zhoumin,
	bruce.richardson, roretzla, Tomasz Duszynski

This series adds self monitoring support i.e allows to configure and
read performance measurement unit (PMU) counters in runtime without
using perf utility. This has certain adventages when application runs on
isolated cores with nohz_full kernel parameter.

Events can be read directly using rte_pmu_read() or using dedicated
tracepoint rte_eal_trace_pmu_read(). The latter will cause events to be
stored inside CTF file.

By design, all enabled events are grouped together and the same group
is attached to lcores that use self monitoring funtionality.

Events are enabled by names, which need to be read from standard
location under sysfs i.e

/sys/bus/event_source/devices/PMU/events

where PMU is a core pmu i.e one measuring cpu events. As of today
raw events are not supported.

v6:
- move codebase to the separate library
- address review comments
v5:
- address review comments
- fix sign extension while reading pmu on x86
- fix regex mentioned in doc
- various minor changes/improvements here and there
v4:
- fix freeing mem detected by debug_autotest
v3:
- fix shared build
v2:
- fix problems reported by test build infra

Tomasz Duszynski (4):
  lib: add generic support for reading PMU events
  pmu: support reading ARM PMU events in runtime
  pmu: support reading Intel x86_64 PMU events in runtime
  eal: add PMU support to tracing library

 MAINTAINERS                              |   5 +
 app/test/meson.build                     |   4 +
 app/test/test_pmu.c                      |  48 +++
 app/test/test_trace_perf.c               |  10 +
 doc/api/doxy-api-index.md                |   3 +-
 doc/api/doxy-api.conf.in                 |   1 +
 doc/guides/prog_guide/profile_app.rst    |  13 +
 doc/guides/prog_guide/trace_lib.rst      |  32 ++
 doc/guides/rel_notes/release_23_03.rst   |   7 +
 lib/eal/common/eal_common_trace.c        |  13 +-
 lib/eal/common/eal_common_trace_points.c |   5 +
 lib/eal/include/rte_eal_trace.h          |  13 +
 lib/eal/meson.build                      |   3 +
 lib/eal/version.map                      |   3 +
 lib/meson.build                          |   1 +
 lib/pmu/meson.build                      |  21 +
 lib/pmu/pmu_arm64.c                      |  94 +++++
 lib/pmu/pmu_private.h                    |  29 ++
 lib/pmu/rte_pmu.c                        | 497 +++++++++++++++++++++++
 lib/pmu/rte_pmu.h                        | 226 +++++++++++
 lib/pmu/rte_pmu_pmc_arm64.h              |  30 ++
 lib/pmu/rte_pmu_pmc_x86_64.h             |  24 ++
 lib/pmu/version.map                      |  20 +
 23 files changed, 1100 insertions(+), 2 deletions(-)
 create mode 100644 app/test/test_pmu.c
 create mode 100644 lib/pmu/meson.build
 create mode 100644 lib/pmu/pmu_arm64.c
 create mode 100644 lib/pmu/pmu_private.h
 create mode 100644 lib/pmu/rte_pmu.c
 create mode 100644 lib/pmu/rte_pmu.h
 create mode 100644 lib/pmu/rte_pmu_pmc_arm64.h
 create mode 100644 lib/pmu/rte_pmu_pmc_x86_64.h
 create mode 100644 lib/pmu/version.map

--
2.34.1


^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH v6 1/4] lib: add generic support for reading PMU events
  2023-01-19 23:39         ` [PATCH v6 " Tomasz Duszynski
@ 2023-01-19 23:39           ` Tomasz Duszynski
  2023-01-20  9:46             ` Morten Brørup
  2023-01-20 18:29             ` Tyler Retzlaff
  2023-01-19 23:39           ` [PATCH v6 2/4] pmu: support reading ARM PMU events in runtime Tomasz Duszynski
                             ` (3 subsequent siblings)
  4 siblings, 2 replies; 205+ messages in thread
From: Tomasz Duszynski @ 2023-01-19 23:39 UTC (permalink / raw)
  To: dev, Thomas Monjalon, Tomasz Duszynski
  Cc: jerinj, mb, Ruifeng.Wang, mattias.ronnblom, zhoumin,
	bruce.richardson, roretzla

Add support for programming PMU counters and reading their values
in runtime bypassing kernel completely.

This is especially useful in cases where CPU cores are isolated
(nohz_full) i.e run dedicated tasks. In such cases one cannot use
standard perf utility without sacrificing latency and performance.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
---
 MAINTAINERS                            |   5 +
 app/test/meson.build                   |   4 +
 app/test/test_pmu.c                    |  42 +++
 doc/api/doxy-api-index.md              |   3 +-
 doc/api/doxy-api.conf.in               |   1 +
 doc/guides/prog_guide/profile_app.rst  |   8 +
 doc/guides/rel_notes/release_23_03.rst |   7 +
 lib/meson.build                        |   1 +
 lib/pmu/meson.build                    |  13 +
 lib/pmu/pmu_private.h                  |  29 ++
 lib/pmu/rte_pmu.c                      | 436 +++++++++++++++++++++++++
 lib/pmu/rte_pmu.h                      | 206 ++++++++++++
 lib/pmu/version.map                    |  19 ++
 13 files changed, 773 insertions(+), 1 deletion(-)
 create mode 100644 app/test/test_pmu.c
 create mode 100644 lib/pmu/meson.build
 create mode 100644 lib/pmu/pmu_private.h
 create mode 100644 lib/pmu/rte_pmu.c
 create mode 100644 lib/pmu/rte_pmu.h
 create mode 100644 lib/pmu/version.map

diff --git a/MAINTAINERS b/MAINTAINERS
index 9a0f416d2e..9f13eafd95 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1697,6 +1697,11 @@ M: Nithin Dabilpuram <ndabilpuram@marvell.com>
 M: Pavan Nikhilesh <pbhagavatula@marvell.com>
 F: lib/node/
 
+PMU - EXPERIMENTAL
+M: Tomasz Duszynski <tduszynski@marvell.com>
+F: lib/pmu/
+F: app/test/test_pmu*
+
 
 Test Applications
 -----------------
diff --git a/app/test/meson.build b/app/test/meson.build
index f34d19e3c3..b2c2a618b1 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -360,6 +360,10 @@ if dpdk_conf.has('RTE_LIB_METRICS')
     test_sources += ['test_metrics.c']
     fast_tests += [['metrics_autotest', true, true]]
 endif
+if is_linux
+    test_sources += ['test_pmu.c']
+    fast_tests += [['pmu_autotest', true, true]]
+endif
 if not is_windows and dpdk_conf.has('RTE_LIB_TELEMETRY')
     test_sources += ['test_telemetry_json.c', 'test_telemetry_data.c']
     fast_tests += [['telemetry_json_autotest', true, true]]
diff --git a/app/test/test_pmu.c b/app/test/test_pmu.c
new file mode 100644
index 0000000000..7c3cf18ed9
--- /dev/null
+++ b/app/test/test_pmu.c
@@ -0,0 +1,42 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2023 Marvell International Ltd.
+ */
+
+#include <rte_pmu.h>
+
+#include "test.h"
+
+static int
+test_pmu_read(void)
+{
+	int tries = 10, event = -1;
+	uint64_t val = 0;
+
+	if (rte_pmu_init() < 0)
+		return TEST_FAILED;
+
+	while (tries--)
+		val += rte_pmu_read(event);
+
+	rte_pmu_fini();
+
+	return val ? TEST_SUCCESS : TEST_FAILED;
+}
+
+static struct unit_test_suite pmu_tests = {
+	.suite_name = "pmu autotest",
+	.setup = NULL,
+	.teardown = NULL,
+	.unit_test_cases = {
+		TEST_CASE(test_pmu_read),
+		TEST_CASES_END()
+	}
+};
+
+static int
+test_pmu(void)
+{
+	return unit_test_suite_runner(&pmu_tests);
+}
+
+REGISTER_TEST_COMMAND(pmu_autotest, test_pmu);
diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
index de488c7abf..7f1938f92f 100644
--- a/doc/api/doxy-api-index.md
+++ b/doc/api/doxy-api-index.md
@@ -222,7 +222,8 @@ The public API headers are grouped by topics:
   [log](@ref rte_log.h),
   [errno](@ref rte_errno.h),
   [trace](@ref rte_trace.h),
-  [trace_point](@ref rte_trace_point.h)
+  [trace_point](@ref rte_trace_point.h),
+  [pmu](@ref rte_pmu.h)
 
 - **misc**:
   [EAL config](@ref rte_eal.h),
diff --git a/doc/api/doxy-api.conf.in b/doc/api/doxy-api.conf.in
index f0886c3bd1..920e615996 100644
--- a/doc/api/doxy-api.conf.in
+++ b/doc/api/doxy-api.conf.in
@@ -63,6 +63,7 @@ INPUT                   = @TOPDIR@/doc/api/doxy-api-index.md \
                           @TOPDIR@/lib/pci \
                           @TOPDIR@/lib/pdump \
                           @TOPDIR@/lib/pipeline \
+                          @TOPDIR@/lib/pmu \
                           @TOPDIR@/lib/port \
                           @TOPDIR@/lib/power \
                           @TOPDIR@/lib/rawdev \
diff --git a/doc/guides/prog_guide/profile_app.rst b/doc/guides/prog_guide/profile_app.rst
index 14292d4c25..a8b501fe0c 100644
--- a/doc/guides/prog_guide/profile_app.rst
+++ b/doc/guides/prog_guide/profile_app.rst
@@ -7,6 +7,14 @@ Profile Your Application
 The following sections describe methods of profiling DPDK applications on
 different architectures.
 
+Performance counter based profiling
+-----------------------------------
+
+Majority of architectures support some sort hardware measurement unit which provides a set of
+programmable counters that monitor specific events. There are different tools which can gather
+that information, perf being an example here. Though in some scenarios, eg. when CPU cores are
+isolated (nohz_full) and run dedicated tasks, using perf is less than ideal. In such cases one can
+read specific events directly from application via ``rte_pmu_read()``.
 
 Profiling on x86
 ----------------
diff --git a/doc/guides/rel_notes/release_23_03.rst b/doc/guides/rel_notes/release_23_03.rst
index 92ec1e4b88..f43bd62376 100644
--- a/doc/guides/rel_notes/release_23_03.rst
+++ b/doc/guides/rel_notes/release_23_03.rst
@@ -57,6 +57,13 @@ New Features
 
 * **Added multi-process support for axgbe PMD.**
 
+* **Added PMU library.**
+
+  Added a new PMU (performance measurement unit) library which allows applications
+  to perform self monitoring activities without depending on external utilities like perf.
+  After integration with :doc:`../prog_guide/trace_lib` data gathered from hardware counters
+  can be stored in CTF format for further analysis.
+
 
 Removed Items
 -------------
diff --git a/lib/meson.build b/lib/meson.build
index a90fee31b7..7132131b5c 100644
--- a/lib/meson.build
+++ b/lib/meson.build
@@ -11,6 +11,7 @@
 libraries = [
         'kvargs', # eal depends on kvargs
         'telemetry', # basic info querying
+        'pmu',
         'eal', # everything depends on eal
         'ring',
         'rcu', # rcu depends on ring
diff --git a/lib/pmu/meson.build b/lib/pmu/meson.build
new file mode 100644
index 0000000000..a4160b494e
--- /dev/null
+++ b/lib/pmu/meson.build
@@ -0,0 +1,13 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(C) 2023 Marvell International Ltd.
+
+if not is_linux
+    build = false
+    reason = 'only supported on Linux'
+    subdir_done()
+endif
+
+includes = [global_inc]
+
+sources = files('rte_pmu.c')
+headers = files('rte_pmu.h')
diff --git a/lib/pmu/pmu_private.h b/lib/pmu/pmu_private.h
new file mode 100644
index 0000000000..849549b125
--- /dev/null
+++ b/lib/pmu/pmu_private.h
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Marvell
+ */
+
+#ifndef _PMU_PRIVATE_H_
+#define _PMU_PRIVATE_H_
+
+/**
+ * Architecture specific PMU init callback.
+ *
+ * @return
+ *   0 in case of success, negative value otherwise.
+ */
+int
+pmu_arch_init(void);
+
+/**
+ * Architecture specific PMU cleanup callback.
+ */
+void
+pmu_arch_fini(void);
+
+/**
+ * Apply architecture specific settings to config before passing it to syscall.
+ */
+void
+pmu_arch_fixup_config(uint64_t config[3]);
+
+#endif /* _PMU_PRIVATE_H_ */
diff --git a/lib/pmu/rte_pmu.c b/lib/pmu/rte_pmu.c
new file mode 100644
index 0000000000..f8369b9dc7
--- /dev/null
+++ b/lib/pmu/rte_pmu.c
@@ -0,0 +1,436 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2023 Marvell International Ltd.
+ */
+
+#include <ctype.h>
+#include <dirent.h>
+#include <errno.h>
+#include <regex.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <sys/queue.h>
+#include <sys/syscall.h>
+#include <unistd.h>
+
+#include <rte_atomic.h>
+#include <rte_pmu.h>
+#include <rte_tailq.h>
+
+#include "pmu_private.h"
+
+#define EVENT_SOURCE_DEVICES_PATH "/sys/bus/event_source/devices"
+
+#ifndef GENMASK_ULL
+#define GENMASK_ULL(h, l) ((~0ULL - (1ULL << (l)) + 1) & (~0ULL >> ((64 - 1 - (h)))))
+#endif
+
+#ifndef FIELD_PREP
+#define FIELD_PREP(m, v) (((uint64_t)(v) << (__builtin_ffsll(m) - 1)) & (m))
+#endif
+
+struct rte_pmu rte_pmu;
+
+/*
+ * Following __rte_weak functions provide default no-op. Architectures should override them if
+ * necessary.
+ */
+
+int
+__rte_weak pmu_arch_init(void)
+{
+	return 0;
+}
+
+void
+__rte_weak pmu_arch_fini(void)
+{
+}
+
+void
+__rte_weak pmu_arch_fixup_config(uint64_t __rte_unused config[3])
+{
+}
+
+static int
+get_term_format(const char *name, int *num, uint64_t *mask)
+{
+	char *config = NULL;
+	char path[PATH_MAX];
+	int high, low, ret;
+	FILE *fp;
+
+	/* quiesce -Wmaybe-uninitialized warning */
+	*num = 0;
+	*mask = 0;
+
+	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/format/%s", rte_pmu.name, name);
+	fp = fopen(path, "r");
+	if (fp == NULL)
+		return -errno;
+
+	errno = 0;
+	ret = fscanf(fp, "%m[^:]:%d-%d", &config, &low, &high);
+	if (ret < 2) {
+		ret = -ENODATA;
+		goto out;
+	}
+	if (errno) {
+		ret = -errno;
+		goto out;
+	}
+
+	if (ret == 2)
+		high = low;
+
+	*mask = GENMASK_ULL(high, low);
+	/* Last digit should be [012]. If last digit is missing 0 is implied. */
+	*num = config[strlen(config) - 1];
+	*num = isdigit(*num) ? *num - '0' : 0;
+
+	ret = 0;
+out:
+	free(config);
+	fclose(fp);
+
+	return ret;
+}
+
+static int
+parse_event(char *buf, uint64_t config[3])
+{
+	char *token, *term;
+	int num, ret, val;
+	uint64_t mask;
+
+	config[0] = config[1] = config[2] = 0;
+
+	token = strtok(buf, ",");
+	while (token) {
+		errno = 0;
+		/* <term>=<value> */
+		ret = sscanf(token, "%m[^=]=%i", &term, &val);
+		if (ret < 1)
+			return -ENODATA;
+		if (errno)
+			return -errno;
+		if (ret == 1)
+			val = 1;
+
+		ret = get_term_format(term, &num, &mask);
+		free(term);
+		if (ret)
+			return ret;
+
+		config[num] |= FIELD_PREP(mask, val);
+		token = strtok(NULL, ",");
+	}
+
+	return 0;
+}
+
+static int
+get_event_config(const char *name, uint64_t config[3])
+{
+	char path[PATH_MAX], buf[BUFSIZ];
+	FILE *fp;
+	int ret;
+
+	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu.name, name);
+	fp = fopen(path, "r");
+	if (fp == NULL)
+		return -errno;
+
+	ret = fread(buf, 1, sizeof(buf), fp);
+	if (ret == 0) {
+		fclose(fp);
+
+		return -EINVAL;
+	}
+	fclose(fp);
+	buf[ret] = '\0';
+
+	return parse_event(buf, config);
+}
+
+static int
+do_perf_event_open(uint64_t config[3], int group_fd)
+{
+	struct perf_event_attr attr = {
+		.size = sizeof(struct perf_event_attr),
+		.type = PERF_TYPE_RAW,
+		.exclude_kernel = 1,
+		.exclude_hv = 1,
+		.disabled = 1,
+	};
+
+	pmu_arch_fixup_config(config);
+
+	attr.config = config[0];
+	attr.config1 = config[1];
+	attr.config2 = config[2];
+
+	return syscall(SYS_perf_event_open, &attr, 0, -1, group_fd, 0);
+}
+
+static int
+open_events(unsigned int lcore_id)
+{
+	struct rte_pmu_event_group *group = &rte_pmu.group[lcore_id];
+	struct rte_pmu_event *event;
+	uint64_t config[3];
+	int num = 0, ret;
+
+	/* group leader gets created first, with fd = -1 */
+	group->fds[0] = -1;
+
+	TAILQ_FOREACH(event, &rte_pmu.event_list, next) {
+		ret = get_event_config(event->name, config);
+		if (ret)
+			continue;
+
+		ret = do_perf_event_open(config, group->fds[0]);
+		if (ret == -1) {
+			ret = -errno;
+			goto out;
+		}
+
+		group->fds[event->index] = ret;
+		num++;
+	}
+
+	return 0;
+out:
+	for (--num; num >= 0; num--) {
+		close(group->fds[num]);
+		group->fds[num] = -1;
+	}
+
+
+	return ret;
+}
+
+static int
+mmap_events(unsigned int lcore_id)
+{
+	struct rte_pmu_event_group *group = &rte_pmu.group[lcore_id];
+	long page_size = sysconf(_SC_PAGE_SIZE);
+	unsigned int i;
+	void *addr;
+	int ret;
+
+	for (i = 0; i < rte_pmu.num_group_events; i++) {
+		addr = mmap(0, page_size, PROT_READ, MAP_SHARED, group->fds[i], 0);
+		if (addr == MAP_FAILED) {
+			ret = -errno;
+			goto out;
+		}
+
+		group->mmap_pages[i] = addr;
+	}
+
+	return 0;
+out:
+	for (; i; i--) {
+		munmap(group->mmap_pages[i - 1], page_size);
+		group->mmap_pages[i - 1] = NULL;
+	}
+
+	return ret;
+}
+
+static void
+cleanup_events(unsigned int lcore_id)
+{
+	struct rte_pmu_event_group *group = &rte_pmu.group[lcore_id];
+	unsigned int i;
+
+	if (group->fds[0] != -1)
+		ioctl(group->fds[0], PERF_EVENT_IOC_DISABLE, PERF_IOC_FLAG_GROUP);
+
+	for (i = 0; i < rte_pmu.num_group_events; i++) {
+		if (group->mmap_pages[i]) {
+			munmap(group->mmap_pages[i], sysconf(_SC_PAGE_SIZE));
+			group->mmap_pages[i] = NULL;
+		}
+
+		if (group->fds[i] != -1) {
+			close(group->fds[i]);
+			group->fds[i] = -1;
+		}
+	}
+
+	group->enabled = false;
+}
+
+int __rte_noinline
+rte_pmu_enable_group(unsigned int lcore_id)
+{
+	struct rte_pmu_event_group *group = &rte_pmu.group[lcore_id];
+	int ret;
+
+	if (rte_pmu.num_group_events == 0)
+		return -ENODEV;
+
+	ret = open_events(lcore_id);
+	if (ret)
+		goto out;
+
+	ret = mmap_events(lcore_id);
+	if (ret)
+		goto out;
+
+	if (ioctl(group->fds[0], PERF_EVENT_IOC_RESET, PERF_IOC_FLAG_GROUP) == -1) {
+		ret = -errno;
+		goto out;
+	}
+
+	if (ioctl(group->fds[0], PERF_EVENT_IOC_ENABLE, PERF_IOC_FLAG_GROUP) == -1) {
+		ret = -errno;
+		goto out;
+	}
+
+	return 0;
+
+out:
+	cleanup_events(lcore_id);
+
+	return ret;
+}
+
+static int
+scan_pmus(void)
+{
+	char path[PATH_MAX];
+	struct dirent *dent;
+	const char *name;
+	DIR *dirp;
+
+	dirp = opendir(EVENT_SOURCE_DEVICES_PATH);
+	if (dirp == NULL)
+		return -errno;
+
+	while ((dent = readdir(dirp))) {
+		name = dent->d_name;
+		if (name[0] == '.')
+			continue;
+
+		/* sysfs entry should either contain cpus or be a cpu */
+		if (!strcmp(name, "cpu"))
+			break;
+
+		snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/cpus", name);
+		if (access(path, F_OK) == 0)
+			break;
+	}
+
+	closedir(dirp);
+
+	if (dent) {
+		rte_pmu.name = strdup(name);
+		if (rte_pmu.name == NULL)
+			return -ENOMEM;
+	}
+
+	return rte_pmu.name ? 0 : -ENODEV;
+}
+
+int
+rte_pmu_add_event(const char *name)
+{
+	struct rte_pmu_event *event;
+	char path[PATH_MAX];
+
+	if (rte_pmu.name == NULL)
+		return -ENODEV;
+
+	if (rte_pmu.num_group_events + 1 >= MAX_NUM_GROUP_EVENTS)
+		return -ENOSPC;
+
+	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu.name, name);
+	if (access(path, R_OK))
+		return -ENODEV;
+
+	TAILQ_FOREACH(event, &rte_pmu.event_list, next) {
+		if (!strcmp(event->name, name))
+			return event->index;
+		continue;
+	}
+
+	event = calloc(1, sizeof(*event));
+	if (event == NULL)
+		return -ENOMEM;
+
+	event->name = strdup(name);
+	if (event->name == NULL) {
+		free(event);
+
+		return -ENOMEM;
+	}
+
+	event->index = rte_pmu.num_group_events++;
+	TAILQ_INSERT_TAIL(&rte_pmu.event_list, event, next);
+
+	return event->index;
+}
+
+int
+rte_pmu_init(void)
+{
+	int ret;
+
+	/* Allow calling init from multiple contexts within a single thread. This simplifies
+	 * resource management a bit e.g in case fast-path tracepoint has already been enabled
+	 * via command line but application doesn't care enough and performs init/fini again.
+	 */
+	if (rte_pmu.initialized) {
+		rte_pmu.initialized++;
+
+		return 0;
+	}
+
+	ret = scan_pmus();
+	if (ret)
+		goto out;
+
+	TAILQ_INIT(&rte_pmu.event_list);
+
+	ret = pmu_arch_init();
+	if (ret)
+		goto out;
+
+	rte_pmu.initialized = 1;
+
+	return 0;
+out:
+	free(rte_pmu.name);
+	rte_pmu.name = NULL;
+
+	return ret;
+}
+
+void
+rte_pmu_fini(void)
+{
+	struct rte_pmu_event *event, *tmp;
+	unsigned int i;
+
+	/* cleanup once init count drops to zero */
+	if (!rte_pmu.initialized || --rte_pmu.initialized)
+		return;
+
+	RTE_TAILQ_FOREACH_SAFE(event, &rte_pmu.event_list, next, tmp) {
+		TAILQ_REMOVE(&rte_pmu.event_list, event, next);
+		free(event->name);
+		free(event);
+	}
+
+	for (i = 0; i < rte_pmu.num_group_events; i++)
+		cleanup_events(i);
+
+	pmu_arch_fini();
+	free(rte_pmu.name);
+	rte_pmu.name = NULL;
+	rte_pmu.num_group_events = 0;
+}
diff --git a/lib/pmu/rte_pmu.h b/lib/pmu/rte_pmu.h
new file mode 100644
index 0000000000..42c764fa9e
--- /dev/null
+++ b/lib/pmu/rte_pmu.h
@@ -0,0 +1,206 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Marvell
+ */
+
+#ifndef _RTE_PMU_H_
+#define _RTE_PMU_H_
+
+/**
+ * @file
+ *
+ * PMU event tracing operations
+ *
+ * This file defines generic API and types necessary to setup PMU and
+ * read selected counters in runtime.
+ */
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <linux/perf_event.h>
+
+#include <rte_atomic.h>
+#include <rte_branch_prediction.h>
+#include <rte_common.h>
+#include <rte_compat.h>
+#include <rte_lcore.h>
+
+/** Maximum number of events in a group */
+#define MAX_NUM_GROUP_EVENTS 8
+
+/**
+ * A structure describing a group of events.
+ */
+struct rte_pmu_event_group {
+	struct perf_event_mmap_page *mmap_pages[MAX_NUM_GROUP_EVENTS]; /**< array of user pages */
+	int fds[MAX_NUM_GROUP_EVENTS]; /**< array of event descriptors */
+	bool enabled; /**< true if group was enabled on particular lcore */
+} __rte_cache_aligned;
+
+/**
+ * A structure describing an event.
+ */
+struct rte_pmu_event {
+	char *name; /**< name of an event */
+	unsigned int index; /**< event index into fds/mmap_pages */
+	TAILQ_ENTRY(rte_pmu_event) next; /**< list entry */
+};
+
+/**
+ * A PMU state container.
+ */
+struct rte_pmu {
+	char *name; /**< name of core PMU listed under /sys/bus/event_source/devices */
+	struct rte_pmu_event_group group[RTE_MAX_LCORE]; /**< per lcore event group data */
+	unsigned int num_group_events; /**< number of events in a group */
+	TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of matching events */
+	unsigned int initialized; /**< initialization counter */
+};
+
+/** PMU state container */
+extern struct rte_pmu rte_pmu;
+
+/** Each architecture supporting PMU needs to provide its own version */
+#ifndef rte_pmu_pmc_read
+#define rte_pmu_pmc_read(index) ({ 0; })
+#endif
+
+/**
+ * @internal
+ *
+ * Read PMU counter.
+ *
+ * @param pc
+ *   Pointer to the mmapped user page.
+ * @return
+ *   Counter value read from hardware.
+ */
+__rte_internal
+static __rte_always_inline uint64_t
+rte_pmu_read_userpage(struct perf_event_mmap_page *pc)
+{
+	uint64_t width, offset;
+	uint32_t seq, index;
+	int64_t pmc;
+
+	for (;;) {
+		seq = pc->lock;
+		rte_compiler_barrier();
+		index = pc->index;
+		offset = pc->offset;
+		width = pc->pmc_width;
+
+		/* index set to 0 means that particular counter cannot be used */
+		if (likely(pc->cap_user_rdpmc && index)) {
+			pmc = rte_pmu_pmc_read(index - 1);
+			pmc <<= 64 - width;
+			pmc >>= 64 - width;
+			offset += pmc;
+		}
+
+		rte_compiler_barrier();
+
+		if (likely(pc->lock == seq))
+			return offset;
+	}
+
+	return 0;
+}
+
+/**
+ * @internal
+ *
+ * Enable group of events on a given lcore.
+ *
+ * @param lcore_id
+ *   The identifier of the lcore.
+ * @return
+ *   0 in case of success, negative value otherwise.
+ */
+__rte_internal
+int
+rte_pmu_enable_group(unsigned int lcore_id);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Initialize PMU library.
+ *
+ * @return
+ *   0 in case of success, negative value otherwise.
+ */
+__rte_experimental
+int
+rte_pmu_init(void);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Finalize PMU library. This should be called after PMU counters are no longer being read.
+ */
+__rte_experimental
+void
+rte_pmu_fini(void);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Add event to the group of enabled events.
+ *
+ * @param name
+ *   Name of an event listed under /sys/bus/event_source/devices/pmu/events.
+ * @return
+ *   Event index in case of success, negative value otherwise.
+ */
+__rte_experimental
+int
+rte_pmu_add_event(const char *name);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Read hardware counter configured to count occurrences of an event.
+ *
+ * @param index
+ *   Index of an event to be read.
+ * @return
+ *   Event value read from register. In case of errors or lack of support
+ *   0 is returned. In other words, stream of zeros in a trace file
+ *   indicates problem with reading particular PMU event register.
+ */
+__rte_experimental
+static __rte_always_inline uint64_t
+rte_pmu_read(unsigned int index)
+{
+	unsigned int lcore_id = rte_lcore_id();
+	struct rte_pmu_event_group *group;
+	int ret;
+
+	if (unlikely(!rte_pmu.initialized))
+		return 0;
+
+	group = &rte_pmu.group[lcore_id];
+	if (unlikely(!group->enabled)) {
+		ret = rte_pmu_enable_group(lcore_id);
+		if (ret)
+			return 0;
+
+		group->enabled = true;
+	}
+
+	if (unlikely(index >= rte_pmu.num_group_events))
+		return 0;
+
+	return rte_pmu_read_userpage(group->mmap_pages[index]);
+}
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_PMU_H_ */
diff --git a/lib/pmu/version.map b/lib/pmu/version.map
new file mode 100644
index 0000000000..e15e21156a
--- /dev/null
+++ b/lib/pmu/version.map
@@ -0,0 +1,19 @@
+DPDK_23 {
+	local: *;
+};
+
+EXPERIMENTAL {
+	global:
+
+	rte_pmu;
+	rte_pmu_add_event;
+	rte_pmu_fini;
+	rte_pmu_init;
+	rte_pmu_read;
+};
+
+INTERNAL {
+	global:
+
+	rte_pmu_enable_group;
+};
-- 
2.34.1


^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH v6 2/4] pmu: support reading ARM PMU events in runtime
  2023-01-19 23:39         ` [PATCH v6 " Tomasz Duszynski
  2023-01-19 23:39           ` [PATCH v6 1/4] lib: add generic support for reading PMU events Tomasz Duszynski
@ 2023-01-19 23:39           ` Tomasz Duszynski
  2023-01-19 23:39           ` [PATCH v6 3/4] pmu: support reading Intel x86_64 " Tomasz Duszynski
                             ` (2 subsequent siblings)
  4 siblings, 0 replies; 205+ messages in thread
From: Tomasz Duszynski @ 2023-01-19 23:39 UTC (permalink / raw)
  To: dev, Tomasz Duszynski, Ruifeng Wang
  Cc: thomas, jerinj, mb, Ruifeng.Wang, mattias.ronnblom, zhoumin,
	bruce.richardson, roretzla

Add support for reading ARM PMU events in runtime.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
---
 app/test/test_pmu.c         |  4 ++
 lib/pmu/meson.build         |  7 +++
 lib/pmu/pmu_arm64.c         | 94 +++++++++++++++++++++++++++++++++++++
 lib/pmu/rte_pmu.h           |  4 ++
 lib/pmu/rte_pmu_pmc_arm64.h | 30 ++++++++++++
 5 files changed, 139 insertions(+)
 create mode 100644 lib/pmu/pmu_arm64.c
 create mode 100644 lib/pmu/rte_pmu_pmc_arm64.h

diff --git a/app/test/test_pmu.c b/app/test/test_pmu.c
index 7c3cf18ed9..4cdc71791e 100644
--- a/app/test/test_pmu.c
+++ b/app/test/test_pmu.c
@@ -15,6 +15,10 @@ test_pmu_read(void)
 	if (rte_pmu_init() < 0)
 		return TEST_FAILED;
 
+#if defined(RTE_ARCH_ARM64)
+	event = rte_pmu_add_event("cpu_cycles");
+#endif
+
 	while (tries--)
 		val += rte_pmu_read(event);
 
diff --git a/lib/pmu/meson.build b/lib/pmu/meson.build
index a4160b494e..e857681137 100644
--- a/lib/pmu/meson.build
+++ b/lib/pmu/meson.build
@@ -11,3 +11,10 @@ includes = [global_inc]
 
 sources = files('rte_pmu.c')
 headers = files('rte_pmu.h')
+indirect_headers += files(
+        'rte_pmu_pmc_arm64.h',
+)
+
+if dpdk_conf.has('RTE_ARCH_ARM64')
+    sources += files('pmu_arm64.c')
+endif
diff --git a/lib/pmu/pmu_arm64.c b/lib/pmu/pmu_arm64.c
new file mode 100644
index 0000000000..9e15727948
--- /dev/null
+++ b/lib/pmu/pmu_arm64.c
@@ -0,0 +1,94 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2023 Marvell International Ltd.
+ */
+
+#include <errno.h>
+#include <fcntl.h>
+#include <stdlib.h>
+#include <unistd.h>
+
+#include <rte_bitops.h>
+#include <rte_common.h>
+
+#include "pmu_private.h"
+
+#define PERF_USER_ACCESS_PATH "/proc/sys/kernel/perf_user_access"
+
+static int restore_uaccess;
+
+static int
+read_attr_int(const char *path, int *val)
+{
+	char buf[BUFSIZ];
+	int ret, fd;
+
+	fd = open(path, O_RDONLY);
+	if (fd == -1)
+		return -errno;
+
+	ret = read(fd, buf, sizeof(buf));
+	if (ret == -1) {
+		close(fd);
+
+		return -errno;
+	}
+
+	*val = strtol(buf, NULL, 10);
+	close(fd);
+
+	return 0;
+}
+
+static int
+write_attr_int(const char *path, int val)
+{
+	char buf[BUFSIZ];
+	int num, ret, fd;
+
+	fd = open(path, O_WRONLY);
+	if (fd == -1)
+		return -errno;
+
+	num = snprintf(buf, sizeof(buf), "%d", val);
+	ret = write(fd, buf, num);
+	if (ret == -1) {
+		close(fd);
+
+		return -errno;
+	}
+
+	close(fd);
+
+	return 0;
+}
+
+int
+pmu_arch_init(void)
+{
+	int ret;
+
+	ret = read_attr_int(PERF_USER_ACCESS_PATH, &restore_uaccess);
+	if (ret)
+		return ret;
+
+	/* user access already enabled */
+	if (restore_uaccess == 1)
+		return 0;
+
+	return write_attr_int(PERF_USER_ACCESS_PATH, 1);
+}
+
+void
+pmu_arch_fini(void)
+{
+	write_attr_int(PERF_USER_ACCESS_PATH, restore_uaccess);
+}
+
+void
+pmu_arch_fixup_config(uint64_t config[3])
+{
+	/* select 64 bit counters */
+	config[1] |= RTE_BIT64(0);
+	/* enable userspace access */
+	config[1] |= RTE_BIT64(1);
+}
diff --git a/lib/pmu/rte_pmu.h b/lib/pmu/rte_pmu.h
index 42c764fa9e..4808d90eb9 100644
--- a/lib/pmu/rte_pmu.h
+++ b/lib/pmu/rte_pmu.h
@@ -26,6 +26,10 @@ extern "C" {
 #include <rte_compat.h>
 #include <rte_lcore.h>
 
+#if defined(RTE_ARCH_ARM64)
+#include "rte_pmu_pmc_arm64.h"
+#endif
+
 /** Maximum number of events in a group */
 #define MAX_NUM_GROUP_EVENTS 8
 
diff --git a/lib/pmu/rte_pmu_pmc_arm64.h b/lib/pmu/rte_pmu_pmc_arm64.h
new file mode 100644
index 0000000000..10648f0c5f
--- /dev/null
+++ b/lib/pmu/rte_pmu_pmc_arm64.h
@@ -0,0 +1,30 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Marvell.
+ */
+#ifndef _RTE_PMU_PMC_ARM64_H_
+#define _RTE_PMU_PMC_ARM64_H_
+
+#include <rte_common.h>
+
+static __rte_always_inline uint64_t
+rte_pmu_pmc_read(int index)
+{
+	uint64_t val;
+
+	if (index == 31) {
+		/* CPU Cycles (0x11) must be read via pmccntr_el0 */
+		asm volatile("mrs %0, pmccntr_el0" : "=r" (val));
+	} else {
+		asm volatile(
+			"msr pmselr_el0, %x0\n"
+			"mrs %0, pmxevcntr_el0\n"
+			: "=r" (val)
+			: "rZ" (index)
+		);
+	}
+
+	return val;
+}
+#define rte_pmu_pmc_read rte_pmu_pmc_read
+
+#endif /* _RTE_PMU_PMC_ARM64_H_ */
-- 
2.34.1


^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH v6 3/4] pmu: support reading Intel x86_64 PMU events in runtime
  2023-01-19 23:39         ` [PATCH v6 " Tomasz Duszynski
  2023-01-19 23:39           ` [PATCH v6 1/4] lib: add generic support for reading PMU events Tomasz Duszynski
  2023-01-19 23:39           ` [PATCH v6 2/4] pmu: support reading ARM PMU events in runtime Tomasz Duszynski
@ 2023-01-19 23:39           ` Tomasz Duszynski
  2023-01-19 23:39           ` [PATCH v6 4/4] eal: add PMU support to tracing library Tomasz Duszynski
  2023-02-01 13:17           ` [PATCH v7 0/4] add support for self monitoring Tomasz Duszynski
  4 siblings, 0 replies; 205+ messages in thread
From: Tomasz Duszynski @ 2023-01-19 23:39 UTC (permalink / raw)
  To: dev, Tomasz Duszynski
  Cc: thomas, jerinj, mb, Ruifeng.Wang, mattias.ronnblom, zhoumin,
	bruce.richardson, roretzla

Add support for reading Intel x86_64 PMU events in runtime.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
---
 app/test/test_pmu.c          |  2 ++
 lib/pmu/meson.build          |  1 +
 lib/pmu/rte_pmu.h            |  2 ++
 lib/pmu/rte_pmu_pmc_x86_64.h | 24 ++++++++++++++++++++++++
 4 files changed, 29 insertions(+)
 create mode 100644 lib/pmu/rte_pmu_pmc_x86_64.h

diff --git a/app/test/test_pmu.c b/app/test/test_pmu.c
index 4cdc71791e..dc7a9cdb27 100644
--- a/app/test/test_pmu.c
+++ b/app/test/test_pmu.c
@@ -17,6 +17,8 @@ test_pmu_read(void)
 
 #if defined(RTE_ARCH_ARM64)
 	event = rte_pmu_add_event("cpu_cycles");
+#elif defined(RTE_ARCH_X86_64)
+	event = rte_pmu_add_event("cpu-cycles");
 #endif
 
 	while (tries--)
diff --git a/lib/pmu/meson.build b/lib/pmu/meson.build
index e857681137..5b92e5c4e3 100644
--- a/lib/pmu/meson.build
+++ b/lib/pmu/meson.build
@@ -13,6 +13,7 @@ sources = files('rte_pmu.c')
 headers = files('rte_pmu.h')
 indirect_headers += files(
         'rte_pmu_pmc_arm64.h',
+        'rte_pmu_pmc_x86_64.h',
 )
 
 if dpdk_conf.has('RTE_ARCH_ARM64')
diff --git a/lib/pmu/rte_pmu.h b/lib/pmu/rte_pmu.h
index 4808d90eb9..617732361c 100644
--- a/lib/pmu/rte_pmu.h
+++ b/lib/pmu/rte_pmu.h
@@ -28,6 +28,8 @@ extern "C" {
 
 #if defined(RTE_ARCH_ARM64)
 #include "rte_pmu_pmc_arm64.h"
+#elif defined(RTE_ARCH_X86_64)
+#include "rte_pmu_pmc_x86_64.h"
 #endif
 
 /** Maximum number of events in a group */
diff --git a/lib/pmu/rte_pmu_pmc_x86_64.h b/lib/pmu/rte_pmu_pmc_x86_64.h
new file mode 100644
index 0000000000..7b67466960
--- /dev/null
+++ b/lib/pmu/rte_pmu_pmc_x86_64.h
@@ -0,0 +1,24 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Marvell.
+ */
+#ifndef _RTE_PMU_PMC_X86_64_H_
+#define _RTE_PMU_PMC_X86_64_H_
+
+#include <rte_common.h>
+
+static __rte_always_inline uint64_t
+rte_pmu_pmc_read(int index)
+{
+	uint64_t low, high;
+
+	asm volatile(
+		"rdpmc\n"
+		: "=a" (low), "=d" (high)
+		: "c" (index)
+	);
+
+	return low | (high << 32);
+}
+#define rte_pmu_pmc_read rte_pmu_pmc_read
+
+#endif /* _RTE_PMU_PMC_X86_64_H_ */
-- 
2.34.1


^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH v6 4/4] eal: add PMU support to tracing library
  2023-01-19 23:39         ` [PATCH v6 " Tomasz Duszynski
                             ` (2 preceding siblings ...)
  2023-01-19 23:39           ` [PATCH v6 3/4] pmu: support reading Intel x86_64 " Tomasz Duszynski
@ 2023-01-19 23:39           ` Tomasz Duszynski
  2023-02-01 13:17           ` [PATCH v7 0/4] add support for self monitoring Tomasz Duszynski
  4 siblings, 0 replies; 205+ messages in thread
From: Tomasz Duszynski @ 2023-01-19 23:39 UTC (permalink / raw)
  To: dev, Jerin Jacob, Sunil Kumar Kori, Tomasz Duszynski
  Cc: thomas, mb, Ruifeng.Wang, mattias.ronnblom, zhoumin,
	bruce.richardson, roretzla

In order to profile app one needs to store significant amount of samples
somewhere for an analysis latern on. Since trace library supports
storing data in a CTF format lets take adventage of that and add a
dedicated PMU tracepoint.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
---
 app/test/test_trace_perf.c               | 10 ++++
 doc/guides/prog_guide/profile_app.rst    |  5 ++
 doc/guides/prog_guide/trace_lib.rst      | 32 +++++++++++++
 lib/eal/common/eal_common_trace.c        | 13 ++++-
 lib/eal/common/eal_common_trace_points.c |  5 ++
 lib/eal/include/rte_eal_trace.h          | 13 +++++
 lib/eal/meson.build                      |  3 ++
 lib/eal/version.map                      |  3 ++
 lib/pmu/rte_pmu.c                        | 61 ++++++++++++++++++++++++
 lib/pmu/rte_pmu.h                        | 14 ++++++
 lib/pmu/version.map                      |  1 +
 11 files changed, 159 insertions(+), 1 deletion(-)

diff --git a/app/test/test_trace_perf.c b/app/test/test_trace_perf.c
index 46ae7d8074..f1929f2734 100644
--- a/app/test/test_trace_perf.c
+++ b/app/test/test_trace_perf.c
@@ -114,6 +114,10 @@ worker_fn_##func(void *arg) \
 #define GENERIC_DOUBLE rte_eal_trace_generic_double(3.66666)
 #define GENERIC_STR rte_eal_trace_generic_str("hello world")
 #define VOID_FP app_dpdk_test_fp()
+#ifdef RTE_EXEC_ENV_LINUX
+/* 0 corresponds first event passed via --trace= */
+#define READ_PMU rte_eal_trace_pmu_read(0)
+#endif
 
 WORKER_DEFINE(GENERIC_VOID)
 WORKER_DEFINE(GENERIC_U64)
@@ -122,6 +126,9 @@ WORKER_DEFINE(GENERIC_FLOAT)
 WORKER_DEFINE(GENERIC_DOUBLE)
 WORKER_DEFINE(GENERIC_STR)
 WORKER_DEFINE(VOID_FP)
+#ifdef RTE_EXEC_ENV_LINUX
+WORKER_DEFINE(READ_PMU)
+#endif
 
 static void
 run_test(const char *str, lcore_function_t f, struct test_data *data, size_t sz)
@@ -174,6 +181,9 @@ test_trace_perf(void)
 	run_test("double", worker_fn_GENERIC_DOUBLE, data, sz);
 	run_test("string", worker_fn_GENERIC_STR, data, sz);
 	run_test("void_fp", worker_fn_VOID_FP, data, sz);
+#ifdef RTE_EXEC_ENV_LINUX
+	run_test("read_pmu", worker_fn_READ_PMU, data, sz);
+#endif
 
 	rte_free(data);
 	return TEST_SUCCESS;
diff --git a/doc/guides/prog_guide/profile_app.rst b/doc/guides/prog_guide/profile_app.rst
index a8b501fe0c..6a53341c6b 100644
--- a/doc/guides/prog_guide/profile_app.rst
+++ b/doc/guides/prog_guide/profile_app.rst
@@ -16,6 +16,11 @@ that information, perf being an example here. Though in some scenarios, eg. when
 isolated (nohz_full) and run dedicated tasks, using perf is less than ideal. In such cases one can
 read specific events directly from application via ``rte_pmu_read()``.
 
+Alternatively tracing library can be used which offers dedicated tracepoint
+``rte_eal_trace_pmu_event()``.
+
+Refer to :doc:`../prog_guide/trace_lib` for more details.
+
 Profiling on x86
 ----------------
 
diff --git a/doc/guides/prog_guide/trace_lib.rst b/doc/guides/prog_guide/trace_lib.rst
index 9a8f38073d..a8e97ee1ec 100644
--- a/doc/guides/prog_guide/trace_lib.rst
+++ b/doc/guides/prog_guide/trace_lib.rst
@@ -46,6 +46,7 @@ DPDK tracing library features
   trace format and is compatible with ``LTTng``.
   For detailed information, refer to
   `Common Trace Format <https://diamon.org/ctf/>`_.
+- Support reading PMU events on ARM64 and x86-64 (Intel)
 
 How to add a tracepoint?
 ------------------------
@@ -137,6 +138,37 @@ the user must use ``RTE_TRACE_POINT_FP`` instead of ``RTE_TRACE_POINT``.
 ``RTE_TRACE_POINT_FP`` is compiled out by default and it can be enabled using
 the ``enable_trace_fp`` option for meson build.
 
+PMU tracepoint
+--------------
+
+Performance measurement unit (PMU) event values can be read from hardware
+registers using predefined ``rte_pmu_read`` tracepoint.
+
+Tracing is enabled via ``--trace`` EAL option by passing both expression
+matching PMU tracepoint name i.e ``lib.eal.pmu.read`` and expression
+``e=ev1[,ev2,...]`` matching particular events::
+
+    --trace='.*pmu.read\|e=cpu_cycles,l1d_cache'
+
+Event names are available under ``/sys/bus/event_source/devices/PMU/events``
+directory, where ``PMU`` is a placeholder for either a ``cpu`` or a directory
+containing ``cpus``.
+
+In contrary to other tracepoints this does not need any extra variables
+added to source files. Instead, caller passes index which follows the order of
+events specified via ``--trace`` parameter. In the following example index ``0``
+corresponds to ``cpu_cyclces`` while index ``1`` corresponds to ``l1d_cache``.
+
+.. code-block:: c
+
+ ...
+ rte_eal_trace_pmu_read(0);
+ rte_eal_trace_pmu_read(1);
+ ...
+
+PMU tracing support must be explicitly enabled using the ``enable_trace_fp``
+option for meson build.
+
 Event record mode
 -----------------
 
diff --git a/lib/eal/common/eal_common_trace.c b/lib/eal/common/eal_common_trace.c
index 5caaac8e59..3631d0032b 100644
--- a/lib/eal/common/eal_common_trace.c
+++ b/lib/eal/common/eal_common_trace.c
@@ -11,6 +11,9 @@
 #include <rte_errno.h>
 #include <rte_lcore.h>
 #include <rte_per_lcore.h>
+#ifdef RTE_EXEC_ENV_LINUX
+#include <rte_pmu.h>
+#endif
 #include <rte_string_fns.h>
 
 #include "eal_trace.h"
@@ -71,8 +74,13 @@ eal_trace_init(void)
 		goto free_meta;
 
 	/* Apply global configurations */
-	STAILQ_FOREACH(arg, &trace.args, next)
+	STAILQ_FOREACH(arg, &trace.args, next) {
 		trace_args_apply(arg->val);
+#ifdef RTE_EXEC_ENV_LINUX
+		if (rte_pmu_init() == 0)
+			rte_pmu_add_events_by_pattern(arg->val);
+#endif
+	}
 
 	rte_trace_mode_set(trace.mode);
 
@@ -88,6 +96,9 @@ eal_trace_init(void)
 void
 eal_trace_fini(void)
 {
+#ifdef RTE_EXEC_ENV_LINUX
+	rte_pmu_fini();
+#endif
 	trace_mem_free();
 	trace_metadata_destroy();
 	eal_trace_args_free();
diff --git a/lib/eal/common/eal_common_trace_points.c b/lib/eal/common/eal_common_trace_points.c
index 0b0b254615..1e46ce549a 100644
--- a/lib/eal/common/eal_common_trace_points.c
+++ b/lib/eal/common/eal_common_trace_points.c
@@ -75,3 +75,8 @@ RTE_TRACE_POINT_REGISTER(rte_eal_trace_intr_enable,
 	lib.eal.intr.enable)
 RTE_TRACE_POINT_REGISTER(rte_eal_trace_intr_disable,
 	lib.eal.intr.disable)
+
+#ifdef RTE_EXEC_ENV_LINUX
+RTE_TRACE_POINT_REGISTER(rte_eal_trace_pmu_read,
+	lib.eal.pmu.read)
+#endif
diff --git a/lib/eal/include/rte_eal_trace.h b/lib/eal/include/rte_eal_trace.h
index 5ef4398230..afb459b198 100644
--- a/lib/eal/include/rte_eal_trace.h
+++ b/lib/eal/include/rte_eal_trace.h
@@ -17,6 +17,9 @@ extern "C" {
 
 #include <rte_alarm.h>
 #include <rte_interrupts.h>
+#ifdef RTE_EXEC_ENV_LINUX
+#include <rte_pmu.h>
+#endif
 #include <rte_trace_point.h>
 
 #include "eal_interrupts.h"
@@ -279,6 +282,16 @@ RTE_TRACE_POINT(
 	rte_trace_point_emit_string(cpuset);
 )
 
+#ifdef RTE_EXEC_ENV_LINUX
+RTE_TRACE_POINT_FP(
+	rte_eal_trace_pmu_read,
+	RTE_TRACE_POINT_ARGS(unsigned int index),
+	uint64_t val;
+	val = rte_pmu_read(index);
+	rte_trace_point_emit_u64(val);
+)
+#endif
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/eal/meson.build b/lib/eal/meson.build
index 056beb9461..f5865dbcd9 100644
--- a/lib/eal/meson.build
+++ b/lib/eal/meson.build
@@ -26,6 +26,9 @@ deps += ['kvargs']
 if not is_windows
     deps += ['telemetry']
 endif
+if is_linux
+    deps += ['pmu']
+endif
 if dpdk_conf.has('RTE_USE_LIBBSD')
     ext_deps += libbsd
 endif
diff --git a/lib/eal/version.map b/lib/eal/version.map
index 7ad12a7dc9..eddb45bebf 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -440,6 +440,9 @@ EXPERIMENTAL {
 	rte_thread_detach;
 	rte_thread_equal;
 	rte_thread_join;
+
+	# added in 23.03
+	__rte_eal_trace_pmu_read; # WINDOWS_NO_EXPORT
 };
 
 INTERNAL {
diff --git a/lib/pmu/rte_pmu.c b/lib/pmu/rte_pmu.c
index f8369b9dc7..3241b8c748 100644
--- a/lib/pmu/rte_pmu.c
+++ b/lib/pmu/rte_pmu.c
@@ -375,6 +375,67 @@ rte_pmu_add_event(const char *name)
 	return event->index;
 }
 
+static int
+add_events(const char *pattern)
+{
+	char *token, *copy;
+	int ret;
+
+	copy = strdup(pattern);
+	if (copy == NULL)
+		return -ENOMEM;
+
+	token = strtok(copy, ",");
+	while (token) {
+		ret = rte_pmu_add_event(token);
+		if (ret < 0)
+			break;
+
+		token = strtok(NULL, ",");
+	}
+
+	free(copy);
+
+	return ret >= 0 ? 0 : ret;
+}
+
+int
+rte_pmu_add_events_by_pattern(const char *pattern)
+{
+	regmatch_t rmatch;
+	char buf[BUFSIZ];
+	unsigned int num;
+	regex_t reg;
+	int ret;
+
+	/* events are matched against occurrences of e=ev1[,ev2,..] pattern */
+	ret = regcomp(&reg, "e=([_[:alnum:]-],?)+", REG_EXTENDED);
+	if (ret)
+		return -EINVAL;
+
+	for (;;) {
+		if (regexec(&reg, pattern, 1, &rmatch, 0))
+			break;
+
+		num = rmatch.rm_eo - rmatch.rm_so;
+		if (num > sizeof(buf))
+			num = sizeof(buf);
+
+		/* skip e= pattern prefix */
+		memcpy(buf, pattern + rmatch.rm_so + 2, num - 2);
+		buf[num - 2] = '\0';
+		ret = add_events(buf);
+		if (ret)
+			break;
+
+		pattern += rmatch.rm_eo;
+	}
+
+	regfree(&reg);
+
+	return ret;
+}
+
 int
 rte_pmu_init(void)
 {
diff --git a/lib/pmu/rte_pmu.h b/lib/pmu/rte_pmu.h
index 617732361c..f642b721e8 100644
--- a/lib/pmu/rte_pmu.h
+++ b/lib/pmu/rte_pmu.h
@@ -166,6 +166,20 @@ __rte_experimental
 int
 rte_pmu_add_event(const char *name);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Add events matching pattern to the group of enabled events.
+ *
+ * @param pattern
+ *   Pattern e=ev1[,ev2,...] matching events, where evX is a placeholder for an event listed under
+ *   /sys/bus/event_source/devices/pmu/events.
+ */
+__rte_experimental
+int
+rte_pmu_add_events_by_pattern(const char *pattern);
+
 /**
  * @warning
  * @b EXPERIMENTAL: this API may change without prior notice
diff --git a/lib/pmu/version.map b/lib/pmu/version.map
index e15e21156a..4646eefd2b 100644
--- a/lib/pmu/version.map
+++ b/lib/pmu/version.map
@@ -7,6 +7,7 @@ EXPERIMENTAL {
 
 	rte_pmu;
 	rte_pmu_add_event;
+	rte_pmu_add_events_by_pattern;
 	rte_pmu_fini;
 	rte_pmu_init;
 	rte_pmu_read;
-- 
2.34.1


^ permalink raw reply	[flat|nested] 205+ messages in thread

* RE: [PATCH v6 1/4] lib: add generic support for reading PMU events
  2023-01-19 23:39           ` [PATCH v6 1/4] lib: add generic support for reading PMU events Tomasz Duszynski
@ 2023-01-20  9:46             ` Morten Brørup
  2023-01-26  9:40               ` Tomasz Duszynski
  2023-01-20 18:29             ` Tyler Retzlaff
  1 sibling, 1 reply; 205+ messages in thread
From: Morten Brørup @ 2023-01-20  9:46 UTC (permalink / raw)
  To: Tomasz Duszynski, dev, Thomas Monjalon
  Cc: jerinj, Ruifeng.Wang, mattias.ronnblom, zhoumin,
	bruce.richardson, roretzla

> From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
> Sent: Friday, 20 January 2023 00.39
> 
> Add support for programming PMU counters and reading their values
> in runtime bypassing kernel completely.
> 
> This is especially useful in cases where CPU cores are isolated
> (nohz_full) i.e run dedicated tasks. In such cases one cannot use
> standard perf utility without sacrificing latency and performance.
> 
> Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
> ---

If you insist on passing lcore_id around as a function parameter, the function description must mention that the lcore_id parameter must be set to rte_lcore_id() for the functions where this is a requirement, including all functions that use those functions.

Alternatively, follow my previous suggestion: Omit the lcore_id function parameter, and use rte_lcore_id() instead.



^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH v6 1/4] lib: add generic support for reading PMU events
  2023-01-19 23:39           ` [PATCH v6 1/4] lib: add generic support for reading PMU events Tomasz Duszynski
  2023-01-20  9:46             ` Morten Brørup
@ 2023-01-20 18:29             ` Tyler Retzlaff
  2023-01-26  9:05               ` [EXT] " Tomasz Duszynski
  1 sibling, 1 reply; 205+ messages in thread
From: Tyler Retzlaff @ 2023-01-20 18:29 UTC (permalink / raw)
  To: Tomasz Duszynski
  Cc: dev, Thomas Monjalon, jerinj, mb, Ruifeng.Wang, mattias.ronnblom,
	zhoumin, bruce.richardson

On Fri, Jan 20, 2023 at 12:39:12AM +0100, Tomasz Duszynski wrote:
> Add support for programming PMU counters and reading their values
> in runtime bypassing kernel completely.
> 
> This is especially useful in cases where CPU cores are isolated
> (nohz_full) i.e run dedicated tasks. In such cases one cannot use
> standard perf utility without sacrificing latency and performance.
> 
> Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
> ---
>  MAINTAINERS                            |   5 +
>  app/test/meson.build                   |   4 +
>  app/test/test_pmu.c                    |  42 +++
>  doc/api/doxy-api-index.md              |   3 +-
>  doc/api/doxy-api.conf.in               |   1 +
>  doc/guides/prog_guide/profile_app.rst  |   8 +
>  doc/guides/rel_notes/release_23_03.rst |   7 +
>  lib/meson.build                        |   1 +
>  lib/pmu/meson.build                    |  13 +
>  lib/pmu/pmu_private.h                  |  29 ++
>  lib/pmu/rte_pmu.c                      | 436 +++++++++++++++++++++++++
>  lib/pmu/rte_pmu.h                      | 206 ++++++++++++
>  lib/pmu/version.map                    |  19 ++
>  13 files changed, 773 insertions(+), 1 deletion(-)
>  create mode 100644 app/test/test_pmu.c
>  create mode 100644 lib/pmu/meson.build
>  create mode 100644 lib/pmu/pmu_private.h
>  create mode 100644 lib/pmu/rte_pmu.c
>  create mode 100644 lib/pmu/rte_pmu.h
>  create mode 100644 lib/pmu/version.map
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 9a0f416d2e..9f13eafd95 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -1697,6 +1697,11 @@ M: Nithin Dabilpuram <ndabilpuram@marvell.com>
>  M: Pavan Nikhilesh <pbhagavatula@marvell.com>
>  F: lib/node/
>  
> +PMU - EXPERIMENTAL
> +M: Tomasz Duszynski <tduszynski@marvell.com>
> +F: lib/pmu/
> +F: app/test/test_pmu*
> +
>  
>  Test Applications
>  -----------------
> diff --git a/app/test/meson.build b/app/test/meson.build
> index f34d19e3c3..b2c2a618b1 100644
> --- a/app/test/meson.build
> +++ b/app/test/meson.build
> @@ -360,6 +360,10 @@ if dpdk_conf.has('RTE_LIB_METRICS')
>      test_sources += ['test_metrics.c']
>      fast_tests += [['metrics_autotest', true, true]]
>  endif
> +if is_linux
> +    test_sources += ['test_pmu.c']
> +    fast_tests += [['pmu_autotest', true, true]]
> +endif

traditionally we don't conditionally include tests at the meson.build
level, instead we run all tests and have them skip when executed for
unsupported exec environments.

you can take a look at test_eventdev.c as an example for a test that is
skipped on windows, i'm sure it could be adapted to skip on freebsd if
you aren't supporting it.


^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH 0/2] add platform bus
  2023-01-10 23:46       ` [PATCH v5 0/4] add support for self monitoring Tomasz Duszynski
                           ` (5 preceding siblings ...)
  2023-01-19 23:39         ` [PATCH v6 " Tomasz Duszynski
@ 2023-01-25 10:33         ` Tomasz Duszynski
  2023-01-25 10:33           ` [PATCH 1/2] lib: add helper to read strings from sysfs files Tomasz Duszynski
                             ` (2 more replies)
  2023-02-16 20:56         ` [PATCH v5 0/4] add support for self monitoring Liang Ma
  7 siblings, 3 replies; 205+ messages in thread
From: Tomasz Duszynski @ 2023-01-25 10:33 UTC (permalink / raw)
  To: dev; +Cc: thomas, jerinj, stephen, chenbo.xia, Tomasz Duszynski

Platform bus is a bus under Linux which manages devices that do not have
any discovery-mechanism built in. Linux learns about platform devices
directly from device-tree during boot-up phase.

Afterwards if userspace wants to use some particular device driver being
usually a mixture of vdev/rawdev gets developed.

In order to simplify that introduce a DPDK platform bus which provides
auto-probe experience and separates a bus logic from the driver itself.

Now only devices which are backed-by vfio-platform kernel driver
are supported, though other options may be added if necessary.

Tomasz Duszynski (2):
  lib: add helper to read strings from sysfs files
  bus: add platform bus

 MAINTAINERS                                |   4 +
 app/test/test_eal_fs.c                     | 108 +++-
 doc/guides/rel_notes/release_23_03.rst     |   5 +
 drivers/bus/meson.build                    |   1 +
 drivers/bus/platform/bus_platform_driver.h | 174 ++++++
 drivers/bus/platform/meson.build           |  16 +
 drivers/bus/platform/platform.c            | 604 +++++++++++++++++++++
 drivers/bus/platform/platform_params.c     |  70 +++
 drivers/bus/platform/private.h             |  48 ++
 drivers/bus/platform/version.map           |  10 +
 lib/eal/common/eal_filesystem.h            |   6 +
 lib/eal/unix/eal_filesystem.c              |  24 +-
 lib/eal/version.map                        |   1 +
 13 files changed, 1053 insertions(+), 18 deletions(-)
 create mode 100644 drivers/bus/platform/bus_platform_driver.h
 create mode 100644 drivers/bus/platform/meson.build
 create mode 100644 drivers/bus/platform/platform.c
 create mode 100644 drivers/bus/platform/platform_params.c
 create mode 100644 drivers/bus/platform/private.h
 create mode 100644 drivers/bus/platform/version.map

--
2.34.1


^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH 1/2] lib: add helper to read strings from sysfs files
  2023-01-25 10:33         ` [PATCH 0/2] add platform bus Tomasz Duszynski
@ 2023-01-25 10:33           ` Tomasz Duszynski
  2023-01-25 10:39             ` Thomas Monjalon
  2023-01-25 10:33           ` [PATCH 2/2] bus: add platform bus Tomasz Duszynski
  2023-01-25 10:41           ` [PATCH 0/2] " Tomasz Duszynski
  2 siblings, 1 reply; 205+ messages in thread
From: Tomasz Duszynski @ 2023-01-25 10:33 UTC (permalink / raw)
  To: dev; +Cc: thomas, jerinj, stephen, chenbo.xia, Tomasz Duszynski

Reading strings from sysfs files is a re-occurring pattern
hence add helper for doing that.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
---
 app/test/test_eal_fs.c          | 108 ++++++++++++++++++++++++++++----
 lib/eal/common/eal_filesystem.h |   6 ++
 lib/eal/unix/eal_filesystem.c   |  24 ++++---
 lib/eal/version.map             |   1 +
 4 files changed, 121 insertions(+), 18 deletions(-)

diff --git a/app/test/test_eal_fs.c b/app/test/test_eal_fs.c
index b3686edcb4..6c373fc7f1 100644
--- a/app/test/test_eal_fs.c
+++ b/app/test/test_eal_fs.c
@@ -20,12 +20,33 @@ test_eal_fs(void)
 
 #else
 
+static int
+temp_create(char *filename, size_t len)
+{
+	char file_template[] = "/tmp/eal_test_XXXXXX";
+	char proc_path[PATH_MAX];
+	int fd;
+
+	fd = mkstemp(file_template);
+	if (fd == -1) {
+		perror("mkstemp() failure");
+		return -1;
+	}
+
+	snprintf(proc_path, sizeof(proc_path), "/proc/self/fd/%d", fd);
+	if (readlink(proc_path, filename, len) < 0) {
+		perror("readlink() failure");
+		close(fd);
+		return -1;
+	}
+
+	return fd;
+}
+
 static int
 test_parse_sysfs_value(void)
 {
 	char filename[PATH_MAX] = "";
-	char proc_path[PATH_MAX];
-	char file_template[] = "/tmp/eal_test_XXXXXX";
 	int tmp_file_handle = -1;
 	FILE *fd = NULL;
 	unsigned valid_number;
@@ -40,16 +61,10 @@ test_parse_sysfs_value(void)
 
 	/* get a temporary filename to use for all tests - create temp file handle and then
 	 * use /proc to get the actual file that we can open */
-	tmp_file_handle = mkstemp(file_template);
-	if (tmp_file_handle == -1) {
-		perror("mkstemp() failure");
+	tmp_file_handle = temp_create(filename, sizeof(filename));
+	if (tmp_file_handle < 0)
 		goto error;
-	}
-	snprintf(proc_path, sizeof(proc_path), "/proc/self/fd/%d", tmp_file_handle);
-	if (readlink(proc_path, filename, sizeof(filename)) < 0) {
-		perror("readlink() failure");
-		goto error;
-	}
+
 	printf("Temporary file is: %s\n", filename);
 
 	/* test we get an error value if we use file before it's created */
@@ -175,11 +190,82 @@ test_parse_sysfs_value(void)
 	return -1;
 }
 
+static int
+test_parse_sysfs_string(void)
+{
+	const char *teststr = "the quick brown dog jumps over the lazy fox\n";
+	char filename[PATH_MAX] = "";
+	char buf[BUFSIZ] = { };
+	int tmp_file_handle;
+	FILE *fd = NULL;
+
+#ifdef RTE_EXEC_ENV_FREEBSD
+	/* BSD doesn't have /proc/pid/fd */
+	return 0;
+#endif
+	printf("Testing function eal_parse_sysfs_string()\n");
+
+	/* get a temporary filename to use for all tests - create temp file handle and then
+	 * use /proc to get the actual file that we can open
+	 */
+	tmp_file_handle = temp_create(filename, sizeof(filename));
+	if (tmp_file_handle < 0)
+		goto error;
+
+	printf("Temporary file is: %s\n", filename);
+
+	/* test we get an error value if we use file before it's created */
+	printf("Test reading a missing file ...\n");
+	if (eal_parse_sysfs_string("/dev/not-quite-null", buf, sizeof(buf)) == 0) {
+		printf("Error with eal_parse_sysfs_string() - returned success on reading empty file\n");
+		goto error;
+	}
+	printf("Confirmed return error when reading empty file\n");
+
+	/* test reading a string from file */
+	printf("Test reading string ...\n");
+	fd = fopen(filename, "w");
+	if (fd == NULL) {
+		printf("line %d, Error opening %s: %s\n", __LINE__, filename, strerror(errno));
+		goto error;
+	}
+	fprintf(fd, "%s", teststr);
+	fclose(fd);
+	fd = NULL;
+	if (eal_parse_sysfs_string(filename, buf, sizeof(buf) - 1) < 0) {
+		printf("eal_parse_sysfs_string() returned error - test failed\n");
+		goto error;
+	}
+	if (strcmp(teststr, buf)) {
+		printf("Invalid string read by eal_parse_sysfs_string() - test failed\n");
+		goto error;
+	}
+	/* don't print newline */
+	buf[strlen(buf) - 1] = '\0';
+	printf("Read '%s\\n' ok\n", buf);
+
+	close(tmp_file_handle);
+	unlink(filename);
+	printf("eal_parse_sysfs_string() - OK\n");
+	return 0;
+
+error:
+	if (fd)
+		fclose(fd);
+	if (tmp_file_handle > 0)
+		close(tmp_file_handle);
+	if (filename[0] != '\0')
+		unlink(filename);
+	return -1;
+}
+
 static int
 test_eal_fs(void)
 {
 	if (test_parse_sysfs_value() < 0)
 		return -1;
+	if (test_parse_sysfs_string() < 0)
+		return -1;
 	return 0;
 }
 
diff --git a/lib/eal/common/eal_filesystem.h b/lib/eal/common/eal_filesystem.h
index 5d21f07c20..ac6449f529 100644
--- a/lib/eal/common/eal_filesystem.h
+++ b/lib/eal/common/eal_filesystem.h
@@ -104,4 +104,10 @@ eal_get_hugefile_path(char *buffer, size_t buflen, const char *hugedir, int f_id
  * Used to read information from files on /sys */
 int eal_parse_sysfs_value(const char *filename, unsigned long *val);
 
+/** Function to read a string from a file on the filesystem.
+ * Used to read information for files in /sys
+ */
+__rte_internal
+int eal_parse_sysfs_string(const char *filename, char *str, size_t size);
+
 #endif /* EAL_FILESYSTEM_H */
diff --git a/lib/eal/unix/eal_filesystem.c b/lib/eal/unix/eal_filesystem.c
index afbab9368a..8ed10094be 100644
--- a/lib/eal/unix/eal_filesystem.c
+++ b/lib/eal/unix/eal_filesystem.c
@@ -76,12 +76,9 @@ int eal_create_runtime_dir(void)
 	return 0;
 }
 
-/* parse a sysfs (or other) file containing one integer value */
-int eal_parse_sysfs_value(const char *filename, unsigned long *val)
+int eal_parse_sysfs_string(const char *filename, char *str, size_t size)
 {
 	FILE *f;
-	char buf[BUFSIZ];
-	char *end = NULL;
 
 	if ((f = fopen(filename, "r")) == NULL) {
 		RTE_LOG(ERR, EAL, "%s(): cannot open sysfs value %s\n",
@@ -89,19 +86,32 @@ int eal_parse_sysfs_value(const char *filename, unsigned long *val)
 		return -1;
 	}
 
-	if (fgets(buf, sizeof(buf), f) == NULL) {
+	if (fgets(str, size, f) == NULL) {
 		RTE_LOG(ERR, EAL, "%s(): cannot read sysfs value %s\n",
 			__func__, filename);
 		fclose(f);
 		return -1;
 	}
+	fclose(f);
+	return 0;
+}
+
+/* parse a sysfs (or other) file containing one integer value */
+int eal_parse_sysfs_value(const char *filename, unsigned long *val)
+{
+	char buf[BUFSIZ];
+	char *end = NULL;
+	int ret;
+
+	ret = eal_parse_sysfs_string(filename, buf, sizeof(buf));
+	if (ret < 0)
+		return ret;
+
 	*val = strtoul(buf, &end, 0);
 	if ((buf[0] == '\0') || (end == NULL) || (*end != '\n')) {
 		RTE_LOG(ERR, EAL, "%s(): cannot parse sysfs value %s\n",
 				__func__, filename);
-		fclose(f);
 		return -1;
 	}
-	fclose(f);
 	return 0;
 }
diff --git a/lib/eal/version.map b/lib/eal/version.map
index 7ad12a7dc9..9118bb6228 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -445,6 +445,7 @@ EXPERIMENTAL {
 INTERNAL {
 	global:
 
+	eal_parse_sysfs_string; # WINDOWS_NO_EXPORT
 	rte_bus_register;
 	rte_bus_unregister;
 	rte_eal_get_baseaddr;
-- 
2.34.1


^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH 2/2] bus: add platform bus
  2023-01-25 10:33         ` [PATCH 0/2] add platform bus Tomasz Duszynski
  2023-01-25 10:33           ` [PATCH 1/2] lib: add helper to read strings from sysfs files Tomasz Duszynski
@ 2023-01-25 10:33           ` Tomasz Duszynski
  2023-01-25 10:41           ` [PATCH 0/2] " Tomasz Duszynski
  2 siblings, 0 replies; 205+ messages in thread
From: Tomasz Duszynski @ 2023-01-25 10:33 UTC (permalink / raw)
  To: dev, Thomas Monjalon, Tomasz Duszynski; +Cc: jerinj, stephen, chenbo.xia

Platform bus is a software bus under Linux that manages devices which
generally do not have built-in discovery mechanisms. Linux normally
learns about platform devices directly from device-tree during
boot-up phase.

Up to this point, whenever some userspace app needed control over
platform device or a range of thereof some sort of driver being
a mixture of vdev/rawdev was required.

In order to simplify this task, provide an auto-probe
experience and separate bus logic from the driver itself,
add platform bus support.

Currently devices backed up by vfio-platform kernel driver
are supported.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
---
 MAINTAINERS                                |   4 +
 doc/guides/rel_notes/release_23_03.rst     |   5 +
 drivers/bus/meson.build                    |   1 +
 drivers/bus/platform/bus_platform_driver.h | 174 ++++++
 drivers/bus/platform/meson.build           |  16 +
 drivers/bus/platform/platform.c            | 604 +++++++++++++++++++++
 drivers/bus/platform/platform_params.c     |  70 +++
 drivers/bus/platform/private.h             |  48 ++
 drivers/bus/platform/version.map           |  10 +
 9 files changed, 932 insertions(+)
 create mode 100644 drivers/bus/platform/bus_platform_driver.h
 create mode 100644 drivers/bus/platform/meson.build
 create mode 100644 drivers/bus/platform/platform.c
 create mode 100644 drivers/bus/platform/platform_params.c
 create mode 100644 drivers/bus/platform/private.h
 create mode 100644 drivers/bus/platform/version.map

diff --git a/MAINTAINERS b/MAINTAINERS
index 9a0f416d2e..b02666710c 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -581,6 +581,10 @@ VDEV bus driver
 F: drivers/bus/vdev/
 F: app/test/test_vdev.c
 
+Platform bus driver
+M: Tomasz Duszynski <tduszynski@marvell.com>
+F: drivers/bus/platform
+
 VMBUS bus driver
 M: Long Li <longli@microsoft.com>
 F: drivers/bus/vmbus/
diff --git a/doc/guides/rel_notes/release_23_03.rst b/doc/guides/rel_notes/release_23_03.rst
index 84b112a8b1..74b2b1e3ff 100644
--- a/doc/guides/rel_notes/release_23_03.rst
+++ b/doc/guides/rel_notes/release_23_03.rst
@@ -57,6 +57,11 @@ New Features
 
 * **Added multi-process support for axgbe PMD.**
 
+* **Added platform bus support.**
+
+  A platform bus provides a way to use Linux platform devices which
+  are compatible with vfio-platform kernel driver.
+
 * **Updated Corigine nfp driver.**
 
   * Added support for meter options.
diff --git a/drivers/bus/meson.build b/drivers/bus/meson.build
index 45eab5233d..6d2520c543 100644
--- a/drivers/bus/meson.build
+++ b/drivers/bus/meson.build
@@ -7,6 +7,7 @@ drivers = [
         'fslmc',
         'ifpga',
         'pci',
+        'platform',
         'vdev',
         'vmbus',
 ]
diff --git a/drivers/bus/platform/bus_platform_driver.h b/drivers/bus/platform/bus_platform_driver.h
new file mode 100644
index 0000000000..8291c7f3f6
--- /dev/null
+++ b/drivers/bus/platform/bus_platform_driver.h
@@ -0,0 +1,174 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2023 Marvell.
+ */
+
+#ifndef _BUS_PLATFORM_DRIVER_H_
+#define _BUS_PLATFORM_DRIVER_H_
+
+/**
+ * @file
+ * Platform bus interface.
+ */
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <stddef.h>
+#include <stdint.h>
+
+#include <dev_driver.h>
+#include <rte_common.h>
+#include <rte_dev.h>
+#include <rte_os.h>
+
+/* Forward declarations */
+struct rte_platform_bus;
+struct rte_platform_device;
+struct rte_platform_driver;
+
+/**
+ * Initialization function for the driver called during platform device probing.
+ *
+ * @param pdev
+ *   Pointer to the platform device.
+ * @return
+ *   0 on success, negative value otherwise.
+ */
+typedef int (rte_platform_probe_t)(struct rte_platform_device *pdev);
+
+/**
+ * Removal function for the driver called during platform device removal.
+ *
+ * @param pdev
+ *   Pointer to the platform device.
+ * @return
+ *   0 on success, negative value otherwise.
+ */
+typedef int (rte_platform_remove_t)(struct rte_platform_device *pdev);
+
+/**
+ * Driver specific DMA mapping.
+ *
+ * @param pdev
+ *   Pointer to the platform device.
+ * @param addr
+ *   Starting virtual address of memory to be mapped.
+ * @param iova
+ *   Starting IOVA address of memory to be mapped.
+ * @param len
+ *   Length of memory segment being mapped.
+ * @return
+ *   - 0 on success, negative value and rte_errno is set otherwise.
+ */
+typedef int (rte_platform_dma_map_t)(struct rte_platform_device *pdev, void *addr, uint64_t iova,
+				     size_t len);
+
+/**
+ * Driver specific DMA unmapping.
+ *
+ * @param pdev
+ *   Pointer to the platform device.
+ * @param addr
+ *   Starting virtual address of memory to be mapped.
+ * @param iova
+ *   Starting IOVA address of memory to be mapped.
+ * @param len
+ *   Length of memory segment being mapped.
+ * @return
+ *   - 0 on success, negative value and rte_errno is set otherwise.
+ */
+typedef int (rte_platform_dma_unmap_t)(struct rte_platform_device *pdev, void *addr, uint64_t iova,
+				       size_t len);
+
+/**
+ * A structure describing a platform device resource.
+ */
+struct rte_platform_resource {
+	char *name; /**< Resource name specified via reg-names prop in device-tree */
+	struct rte_mem_resource mem; /**< Memory resource */
+};
+
+/**
+ * A structure describing a platform device.
+ */
+struct rte_platform_device {
+	RTE_TAILQ_ENTRY(rte_platform_device) next; /**< Next attached platform device */
+	struct rte_device device; /**< Core device */
+	struct rte_platform_driver *driver; /**< Matching device driver */
+	char name[RTE_DEV_NAME_MAX_LEN]; /**< Device name */
+	unsigned int num_resource; /**< Number of device resources */
+	struct rte_platform_resource *resource; /**< Device resources */
+	int dev_fd; /**< VFIO device fd */
+};
+
+/**
+ * A structure describing a platform device driver.
+ */
+struct rte_platform_driver {
+	RTE_TAILQ_ENTRY(rte_platform_driver) next; /**< Next available platform driver */
+	struct rte_driver driver; /**< Core driver */
+	rte_platform_probe_t *probe;  /**< Device probe function */
+	rte_platform_remove_t *remove; /**< Device remove function */
+	rte_platform_dma_map_t *dma_map; /**< Device DMA map function */
+	rte_platform_dma_unmap_t *dma_unmap; /**< Device DMA unmap function */
+	uint32_t drv_flags; /**< Driver flags RTE_PLATFORM_DRV_* */
+};
+
+/** Device driver needs IOVA as VA and cannot work with IOVA as PA */
+#define RTE_PLATFORM_DRV_NEED_IOVA_AS_VA 0x0001
+
+/**
+ * @internal
+ * Helper macros used to convert core device to platform device.
+ */
+#define RTE_DEV_TO_PLATFORM_DEV(ptr) \
+	container_of(ptr, struct rte_platform_device, device)
+
+#define RTE_DEV_TO_PLATFORM_DEV_CONST(ptr) \
+	container_of(ptr, const struct rte_platform_device, device)
+
+/**
+ * Register a platform device driver.
+ *
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * @param pdrv
+ *   A pointer to a rte_platform_driver structure describing driver to be registered.
+ */
+__rte_internal
+void rte_platform_register(struct rte_platform_driver *pdrv);
+
+/** Helper for platform driver registration. */
+#define RTE_PMD_REGISTER_PLATFORM(nm, platform_drv) \
+static const char *pdrvinit_ ## nm ## _alias; \
+RTE_INIT(pdrvinitfn_ ##nm) \
+{ \
+	(platform_drv).driver.name = RTE_STR(nm); \
+	(platform_drv).driver.alias = pdrvinit_ ## nm ## _alias; \
+	rte_platform_register(&(platform_drv)); \
+} \
+RTE_PMD_EXPORT_NAME(nm, __COUNTER__)
+
+/** Helper for setting platform driver alias. */
+#define RTE_PMD_REGISTER_ALIAS(nm, alias) \
+static const char *pdrvinit_ ## nm ## _alias = RTE_STR(alias)
+
+/**
+ * Unregister a platform device driver.
+ *
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * @param pdrv
+ *   A pointer to a rte_platform_driver structure describing driver to be unregistered.
+ */
+__rte_internal
+void rte_platform_unregister(struct rte_platform_driver *pdrv);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _BUS_PLATFORM_DRIVER_H_ */
diff --git a/drivers/bus/platform/meson.build b/drivers/bus/platform/meson.build
new file mode 100644
index 0000000000..417d7b81f8
--- /dev/null
+++ b/drivers/bus/platform/meson.build
@@ -0,0 +1,16 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(C) 2023 Marvell.
+#
+
+if not is_linux
+    build = false
+    reason = 'only supported on Linux'
+    subdir_done()
+endif
+
+deps += ['kvargs']
+sources = files(
+        'platform_params.c',
+        'platform.c',
+)
+driver_sdk_headers += files('bus_platform_driver.h')
diff --git a/drivers/bus/platform/platform.c b/drivers/bus/platform/platform.c
new file mode 100644
index 0000000000..b43a5b9153
--- /dev/null
+++ b/drivers/bus/platform/platform.c
@@ -0,0 +1,604 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2023 Marvell.
+ */
+
+#include <dirent.h>
+#include <inttypes.h>
+#include <linux/vfio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <sys/queue.h>
+#include <unistd.h>
+
+#include <bus_driver.h>
+#include <bus_platform_driver.h>
+#include <eal_filesystem.h>
+#include <rte_bus.h>
+#include <rte_devargs.h>
+#include <rte_errno.h>
+#include <rte_log.h>
+#include <rte_memory.h>
+#include <rte_string_fns.h>
+#include <rte_vfio.h>
+
+#include "private.h"
+
+#define PLATFORM_BUS_DEVICES_PATH "/sys/bus/platform/devices"
+
+void
+rte_platform_register(struct rte_platform_driver *pdrv)
+{
+	TAILQ_INSERT_TAIL(&platform_bus.driver_list, pdrv, next);
+}
+
+void
+rte_platform_unregister(struct rte_platform_driver *pdrv)
+{
+	TAILQ_REMOVE(&platform_bus.driver_list, pdrv, next);
+}
+
+static struct rte_devargs *
+dev_devargs(const char *dev_name)
+{
+	struct rte_devargs *devargs;
+
+	RTE_EAL_DEVARGS_FOREACH("platform", devargs) {
+		if (!strcmp(devargs->name, dev_name))
+			return devargs;
+	}
+
+	return NULL;
+}
+
+static bool
+dev_allowed(const char *dev_name)
+{
+	struct rte_devargs *devargs;
+
+	devargs = dev_devargs(dev_name);
+	if (devargs == NULL)
+		return true;
+
+	switch (platform_bus.bus.conf.scan_mode) {
+	case RTE_BUS_SCAN_UNDEFINED:
+	case RTE_BUS_SCAN_ALLOWLIST:
+		if (devargs->policy == RTE_DEV_ALLOWED)
+			return true;
+		break;
+	case RTE_BUS_SCAN_BLOCKLIST:
+		if (devargs->policy == RTE_DEV_BLOCKED)
+			return false;
+		break;
+	}
+
+	return true;
+}
+
+static int
+dev_add(const char *dev_name)
+{
+	struct rte_platform_device *pdev, *tmp;
+	char path[PATH_MAX];
+	unsigned long val;
+
+	pdev = calloc(1, sizeof(*pdev));
+	if (pdev == NULL)
+		return -ENOMEM;
+
+	rte_strscpy(pdev->name, dev_name, sizeof(pdev->name));
+	pdev->device.name = pdev->name;
+	pdev->device.devargs = dev_devargs(dev_name);
+	pdev->device.bus = &platform_bus.bus;
+	snprintf(path, sizeof(path), PLATFORM_BUS_DEVICES_PATH "/%s/numa_node", dev_name);
+	pdev->device.numa_node = eal_parse_sysfs_value(path, &val) ? rte_socket_id() : val;
+
+	FOREACH_DEVICE_ON_PLATFORM_BUS(tmp) {
+		if (!strcmp(tmp->name, pdev->name)) {
+			PLATFORM_LOG(INFO, "device %s already added\n", pdev->name);
+
+			if (tmp->device.devargs != pdev->device.devargs)
+				rte_devargs_remove(pdev->device.devargs);
+
+			free(pdev);
+		}
+	}
+
+	TAILQ_INSERT_HEAD(&platform_bus.device_list, pdev, next);
+
+	PLATFORM_LOG(INFO, "adding device %s to the list\n", dev_name);
+
+	return 0;
+}
+
+static char *
+dev_kernel_driver_name(const char *dev_name)
+{
+	char path[PATH_MAX], buf[BUFSIZ] = { };
+	char *kdrv;
+	int ret;
+
+	snprintf(path, sizeof(path), PLATFORM_BUS_DEVICES_PATH "/%s/driver", dev_name);
+	/* save space for NUL */
+	ret = readlink(path, buf, sizeof(buf) - 1);
+	if (ret <= 0)
+		return NULL;
+
+	/* last token is kernel driver name */
+	kdrv = strrchr(buf, '/');
+	if (kdrv != NULL)
+		return strdup(kdrv + 1);
+
+	return NULL;
+}
+
+static bool
+dev_is_bound_vfio_platform(const char *dev_name)
+{
+	char *kdrv;
+	int ret;
+
+	kdrv = dev_kernel_driver_name(dev_name);
+	if (!kdrv)
+		return false;
+
+	ret = strcmp(kdrv, "vfio-platform");
+	free(kdrv);
+
+	return ret == 0;
+}
+
+static int
+platform_bus_scan(void)
+{
+	const struct dirent *ent;
+	const char *dev_name;
+	int ret = 0;
+	DIR *dp;
+
+	if ((dp = opendir(PLATFORM_BUS_DEVICES_PATH)) == NULL) {
+		PLATFORM_LOG(INFO, "failed to open %s\n", PLATFORM_BUS_DEVICES_PATH);
+		return -errno;
+	}
+
+	while ((ent = readdir(dp))) {
+		dev_name = ent->d_name;
+		if (dev_name[0] == '.')
+			continue;
+
+		if (!dev_allowed(dev_name))
+			continue;
+
+		if (!dev_is_bound_vfio_platform(dev_name))
+			continue;
+
+		ret = dev_add(dev_name);
+		if (ret)
+			break;
+	}
+
+	closedir(dp);
+
+	return ret;
+}
+
+static int
+device_map_resource_offset(struct rte_platform_device *pdev, struct rte_platform_resource *res,
+			   size_t offset)
+{
+	res->mem.addr = mmap(NULL, res->mem.len, PROT_READ | PROT_WRITE, MAP_PRIVATE, pdev->dev_fd,
+			     offset);
+	if (res->mem.addr == MAP_FAILED)
+		return -errno;
+
+	PLATFORM_LOG(DEBUG, "adding resource va = %p len = %"PRIu64" name = %s\n", res->mem.addr,
+		     res->mem.len, res->name);
+
+	return 0;
+}
+
+static void
+device_unmap_resources(struct rte_platform_device *pdev)
+{
+	struct rte_platform_resource *res;
+	unsigned int i;
+
+	for (i = 0; i < pdev->num_resource; i++) {
+		res = &pdev->resource[i];
+		munmap(res->mem.addr, res->mem.len);
+		free(res->name);
+	}
+
+	free(pdev->resource);
+	pdev->resource = NULL;
+	pdev->num_resource = 0;
+}
+
+static char *
+of_resource_name(const char *dev_name, int index)
+{
+	char path[PATH_MAX], buf[BUFSIZ] = { };
+	int num = 0, ret;
+	char *name;
+
+	snprintf(path, sizeof(path), PLATFORM_BUS_DEVICES_PATH "/%s/of_node/reg-names", dev_name);
+	ret = eal_parse_sysfs_string(path, buf, sizeof(buf) - 1);
+	if (ret)
+		return NULL;
+
+	for (name = buf; name; name += strlen(name) + 1) {
+		if (num++ != index)
+			continue;
+		return strdup(name);
+	}
+
+	return NULL;
+}
+
+static int
+device_map_resources(struct rte_platform_device *pdev, unsigned int num)
+{
+	struct rte_platform_resource *res;
+	unsigned int i;
+	int ret;
+
+	if (num == 0)
+		PLATFORM_LOG(WARNING, "device %s has no resources\n", pdev->name);
+
+	pdev->resource = calloc(num, sizeof(*pdev->resource));
+	if (pdev->resource == NULL)
+		return -ENOMEM;
+
+	for (i = 0; i < num; i++) {
+		struct vfio_region_info reg_info = {
+			.argsz = sizeof(reg_info),
+			.index = i,
+		};
+
+		ret = ioctl(pdev->dev_fd, VFIO_DEVICE_GET_REGION_INFO, &reg_info);
+		if (ret) {
+			PLATFORM_LOG(ERR, "failed to get region info at %d\n", i);
+			ret = -errno;
+			goto out;
+		}
+
+		res = &pdev->resource[i];
+		res->name = of_resource_name(pdev->name, reg_info.index);
+		res->mem.len = reg_info.size;
+		ret = device_map_resource_offset(pdev, res, reg_info.offset);
+		if (ret) {
+			PLATFORM_LOG(ERR, "failed to ioremap resource at %d\n", i);
+			goto out;
+		}
+
+		pdev->num_resource++;
+	}
+
+	return 0;
+out:
+	device_unmap_resources(pdev);
+
+	return ret;
+}
+
+static void
+device_cleanup(struct rte_platform_device *pdev)
+{
+	device_unmap_resources(pdev);
+	rte_vfio_release_device(PLATFORM_BUS_DEVICES_PATH, pdev->name, pdev->dev_fd);
+}
+
+static int
+device_setup(struct rte_platform_device *pdev)
+{
+	struct vfio_device_info dev_info = { .argsz = sizeof(dev_info), };
+	const char *name = pdev->name;
+	int ret;
+
+	ret = rte_vfio_setup_device(PLATFORM_BUS_DEVICES_PATH, name, &pdev->dev_fd, &dev_info);
+	if (ret) {
+		PLATFORM_LOG(ERR, "failed to setup %s\n", name);
+		return -ENODEV;
+	}
+
+	if (!(dev_info.flags & VFIO_DEVICE_FLAGS_PLATFORM)) {
+		PLATFORM_LOG(ERR, "device not backed by vfio-platform\n");
+		ret = -ENOTSUP;
+		goto out;
+	}
+
+	ret = device_map_resources(pdev, dev_info.num_regions);
+	if (ret) {
+		PLATFORM_LOG(ERR, "failed to setup platform resources\n");
+		goto out;
+	}
+
+	return 0;
+out:
+	device_cleanup(pdev);
+
+	return ret;
+}
+
+static int
+driver_call_probe(struct rte_platform_driver *pdrv, struct rte_platform_device *pdev)
+{
+	int ret;
+
+	if (rte_dev_is_probed(&pdev->device))
+		return -EBUSY;
+
+	if (pdrv->probe) {
+		pdev->driver = pdrv;
+		ret = pdrv->probe(pdev);
+		if (ret)
+			return ret;
+	}
+
+	pdev->device.driver = &pdrv->driver;
+
+	return 0;
+}
+
+static int
+driver_probe_device(struct rte_platform_driver *pdrv, struct rte_platform_device *pdev)
+{
+	enum rte_iova_mode iova_mode;
+	int ret;
+
+	iova_mode = rte_eal_iova_mode();
+	if (pdrv->drv_flags & RTE_PLATFORM_DRV_NEED_IOVA_AS_VA && iova_mode != RTE_IOVA_VA) {
+		PLATFORM_LOG(ERR, "driver %s expects VA IOVA mode but current mode is PA\n",
+			     pdrv->driver.name);
+		return -EINVAL;
+	}
+
+	ret = device_setup(pdev);
+	if (ret)
+		return ret;
+
+	ret = driver_call_probe(pdrv, pdev);
+	if (ret)
+		device_cleanup(pdev);
+
+	return ret;
+}
+
+static bool
+driver_match_device(struct rte_platform_driver *pdrv, struct rte_platform_device *pdev)
+{
+	bool match = false;
+	char *kdrv;
+
+	kdrv = dev_kernel_driver_name(pdev->name);
+	if (!kdrv)
+		return false;
+
+	/* match by driver name */
+	if (!strcmp(kdrv, pdrv->driver.name)) {
+		match = true;
+		goto out;
+	}
+
+	/* match by driver alias */
+	if (pdrv->driver.alias != NULL && !strcmp(kdrv, pdrv->driver.alias)) {
+		match = true;
+		goto out;
+	}
+
+	/* match by device name */
+	if (!strcmp(pdev->name, pdrv->driver.name))
+		match = true;
+
+out:
+	free(kdrv);
+
+	return match;
+}
+
+
+static int
+device_attach(struct rte_platform_device *pdev)
+{
+	struct rte_platform_driver *pdrv;
+
+	FOREACH_DRIVER_ON_PLATFORM_BUS(pdrv) {
+		if (driver_match_device(pdrv, pdev))
+			break;
+	}
+
+	if (pdrv == NULL)
+		return -ENODEV;
+
+	return driver_probe_device(pdrv, pdev);
+}
+
+static int
+platform_bus_probe(void)
+{
+	struct rte_platform_device *pdev;
+	int ret;
+
+	FOREACH_DEVICE_ON_PLATFORM_BUS(pdev) {
+		ret = device_attach(pdev);
+		if (ret == -EBUSY) {
+			PLATFORM_LOG(DEBUG, "device %s already probed\n", pdev->name);
+			continue;
+		}
+		if (ret)
+			PLATFORM_LOG(ERR, "failed to probe %s\n", pdev->name);
+	}
+
+	return 0;
+}
+
+static struct rte_device *
+platform_bus_find_device(const struct rte_device *start, rte_dev_cmp_t cmp, const void *data)
+{
+	struct rte_platform_device *pdev;
+
+	pdev = start ? RTE_TAILQ_NEXT(RTE_DEV_TO_PLATFORM_DEV_CONST(start), next) :
+		       RTE_TAILQ_FIRST(&platform_bus.device_list);
+	while (pdev) {
+		if (cmp(&pdev->device, data) == 0)
+			return &pdev->device;
+
+		pdev = RTE_TAILQ_NEXT(pdev, next);
+	}
+
+	return NULL;
+}
+
+static int
+platform_bus_plug(struct rte_device *dev)
+{
+	struct rte_platform_device *pdev;
+
+	if (!dev_allowed(dev->name))
+		return -EPERM;
+
+	if (!dev_is_bound_vfio_platform(dev->name))
+		return -EPERM;
+
+	pdev = RTE_DEV_TO_PLATFORM_DEV(dev);
+	if (pdev == NULL)
+		return -EINVAL;
+
+	return device_attach(pdev);
+}
+
+static void
+device_release_driver(struct rte_platform_device *pdev)
+{
+	struct rte_platform_driver *pdrv;
+	int ret;
+
+	pdrv = pdev->driver;
+	if (pdrv != NULL && pdrv->remove != NULL) {
+		ret = pdrv->remove(pdev);
+		if (ret)
+			PLATFORM_LOG(WARNING, "failed to remove %s\n", pdev->name);
+	}
+
+	pdev->device.driver = NULL;
+	pdev->driver = NULL;
+}
+
+static int
+platform_bus_unplug(struct rte_device *dev)
+{
+	struct rte_platform_device *pdev;
+
+	pdev = RTE_DEV_TO_PLATFORM_DEV(dev);
+	if (pdev == NULL)
+		return -EINVAL;
+
+	device_release_driver(pdev);
+	device_cleanup(pdev);
+	rte_devargs_remove(pdev->device.devargs);
+	free(pdev);
+
+	return 0;
+}
+
+static int
+platform_bus_parse(const char *name, void *addr)
+{
+	struct rte_platform_device *pdev;
+	const char **out = addr;
+
+	FOREACH_DEVICE_ON_PLATFORM_BUS(pdev) {
+		if (!strcmp(name, pdev->name))
+			break;
+	}
+
+	if (pdev && addr)
+		*out = name;
+
+	return pdev ? 0 : -ENODEV;
+}
+
+static int
+platform_bus_dma_map(struct rte_device *dev, void *addr, uint64_t iova, size_t len)
+{
+	struct rte_platform_device *pdev;
+
+	pdev = RTE_DEV_TO_PLATFORM_DEV(dev);
+	if (pdev == NULL || pdev->driver == NULL) {
+		rte_errno = EINVAL;
+		return -1;
+	}
+
+	if (pdev->driver->dma_map != NULL)
+		return pdev->driver->dma_map(pdev, addr, iova, len);
+
+	return rte_vfio_container_dma_map(RTE_VFIO_DEFAULT_CONTAINER_FD, (uint64_t)addr, iova, len);
+}
+
+static int
+platform_bus_dma_unmap(struct rte_device *dev, void *addr, uint64_t iova, size_t len)
+{
+	struct rte_platform_device *pdev;
+
+	pdev = RTE_DEV_TO_PLATFORM_DEV(dev);
+	if (pdev == NULL || pdev->driver == NULL) {
+		rte_errno = EINVAL;
+		return -1;
+	}
+
+	if (pdev->driver->dma_unmap != NULL)
+		return pdev->driver->dma_unmap(pdev, addr, iova, len);
+
+	return rte_vfio_container_dma_unmap(RTE_VFIO_DEFAULT_CONTAINER_FD, (uint64_t)addr, iova,
+					    len);
+}
+
+static enum rte_iova_mode
+platform_bus_get_iommu_class(void)
+{
+	struct rte_platform_driver *pdrv;
+	struct rte_platform_device *pdev;
+
+	FOREACH_DEVICE_ON_PLATFORM_BUS(pdev) {
+		pdrv = pdev->driver;
+		if (pdrv != NULL && pdrv->drv_flags & RTE_PLATFORM_DRV_NEED_IOVA_AS_VA)
+			return RTE_IOVA_VA;
+	}
+
+	return RTE_IOVA_DC;
+}
+
+static int
+platform_bus_cleanup(void)
+{
+	struct rte_platform_device *pdev, *tmp;
+
+	RTE_TAILQ_FOREACH_SAFE(pdev, &platform_bus.device_list, next, tmp) {
+		platform_bus_unplug(&pdev->device);
+		TAILQ_REMOVE(&platform_bus.device_list, pdev, next);
+	}
+
+	return 0;
+}
+
+struct rte_platform_bus platform_bus = {
+	.bus = {
+		.scan = platform_bus_scan,
+		.probe = platform_bus_probe,
+		.find_device = platform_bus_find_device,
+		.plug = platform_bus_plug,
+		.unplug = platform_bus_unplug,
+		.parse = platform_bus_parse,
+		.dma_map = platform_bus_dma_map,
+		.dma_unmap = platform_bus_dma_unmap,
+		.get_iommu_class = platform_bus_get_iommu_class,
+		.dev_iterate = platform_bus_dev_iterate,
+		.cleanup = platform_bus_cleanup,
+	},
+	.device_list = TAILQ_HEAD_INITIALIZER(platform_bus.device_list),
+	.driver_list = TAILQ_HEAD_INITIALIZER(platform_bus.driver_list),
+};
+
+RTE_REGISTER_BUS(platform_bus, platform_bus.bus);
+RTE_LOG_REGISTER_DEFAULT(platform_bus_logtype, NOTICE);
diff --git a/drivers/bus/platform/platform_params.c b/drivers/bus/platform/platform_params.c
new file mode 100644
index 0000000000..d199c0c586
--- /dev/null
+++ b/drivers/bus/platform/platform_params.c
@@ -0,0 +1,70 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2023 Marvell.
+ */
+
+#include <string.h>
+#include <errno.h>
+
+#include <rte_bus.h>
+#include <rte_common.h>
+#include <rte_dev.h>
+#include <rte_errno.h>
+#include <rte_kvargs.h>
+
+#include "bus_platform_driver.h"
+#include "private.h"
+
+enum platform_params {
+	RTE_PLATFORM_PARAM_NAME,
+};
+
+static const char * const platform_params_keys[] = {
+	[RTE_PLATFORM_PARAM_NAME] = "name",
+	NULL
+};
+
+static int
+platform_dev_match(const struct rte_device *dev, const void *_kvlist)
+{
+	const char *key = platform_params_keys[RTE_PLATFORM_PARAM_NAME];
+	const struct rte_kvargs *kvlist = _kvlist;
+	const char *name;
+
+	/* no kvlist arg, all devices match */
+	if (kvlist == NULL)
+		return 0;
+
+	/* if key is present in kvlist and does not match, filter device */
+	name = rte_kvargs_get(kvlist, key);
+	if (name != NULL && strcmp(name, dev->name))
+		return -1;
+
+	return 0;
+}
+
+void *
+platform_bus_dev_iterate(const void *start, const char *str,
+			 const struct rte_dev_iterator *it __rte_unused)
+{
+	rte_bus_find_device_t find_device;
+	struct rte_kvargs *kvargs = NULL;
+	struct rte_device *dev;
+
+	if (str != NULL) {
+		kvargs = rte_kvargs_parse(str, platform_params_keys);
+		if (!kvargs) {
+			PLATFORM_LOG(ERR, "cannot parse argument list %s", str);
+			rte_errno = EINVAL;
+			return NULL;
+		}
+	}
+
+	find_device = platform_bus.bus.find_device;
+	if (find_device == NULL)
+		return NULL;
+
+	dev = platform_bus.bus.find_device(start, platform_dev_match, kvargs);
+	rte_kvargs_free(kvargs);
+
+	return dev;
+}
diff --git a/drivers/bus/platform/private.h b/drivers/bus/platform/private.h
new file mode 100644
index 0000000000..dcd992f8a7
--- /dev/null
+++ b/drivers/bus/platform/private.h
@@ -0,0 +1,48 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2023 Marvell.
+ */
+
+#ifndef _PLATFORM_PRIVATE_H_
+#define _PLATFORM_PRIVATE_H_
+
+#include <bus_driver.h>
+#include <rte_bus.h>
+#include <rte_common.h>
+#include <rte_dev.h>
+#include <rte_log.h>
+#include <rte_os.h>
+
+#include "bus_platform_driver.h"
+
+extern struct rte_platform_bus platform_bus;
+extern int platform_bus_logtype;
+
+/* Platform bus iterators. */
+#define FOREACH_DEVICE_ON_PLATFORM_BUS(p) \
+	RTE_TAILQ_FOREACH(p, &(platform_bus.device_list), next)
+
+#define FOREACH_DRIVER_ON_PLATFORM_BUS(p) \
+	RTE_TAILQ_FOREACH(p, &(platform_bus.driver_list), next)
+
+/*
+ * Structure describing platform bus.
+ */
+struct rte_platform_bus {
+	struct rte_bus bus; /* Core bus */
+	RTE_TAILQ_HEAD(, rte_platform_device) device_list; /* List of bus devices */
+	RTE_TAILQ_HEAD(, rte_platform_driver) driver_list; /* List of bus drivers */
+};
+
+#define PLATFORM_LOG(level, ...) \
+	rte_log(RTE_LOG_ ## level, platform_bus_logtype, \
+		RTE_FMT("platform bus: " RTE_FMT_HEAD(__VA_ARGS__,), \
+			RTE_FMT_TAIL(__VA_ARGS__,)))
+
+/*
+ * Iterate registered platform devices and find one that matches provided string.
+ */
+void *
+platform_bus_dev_iterate(const void *start, const char *str,
+			 const struct rte_dev_iterator *it __rte_unused);
+
+#endif /* _PLATFORM_PRIVATE_H_ */
diff --git a/drivers/bus/platform/version.map b/drivers/bus/platform/version.map
new file mode 100644
index 0000000000..bacce4da08
--- /dev/null
+++ b/drivers/bus/platform/version.map
@@ -0,0 +1,10 @@
+DPDK_23 {
+	local: *;
+};
+
+INTERNAL {
+	global:
+
+	rte_platform_register;
+	rte_platform_unregister;
+};
-- 
2.34.1


^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 1/2] lib: add helper to read strings from sysfs files
  2023-01-25 10:33           ` [PATCH 1/2] lib: add helper to read strings from sysfs files Tomasz Duszynski
@ 2023-01-25 10:39             ` Thomas Monjalon
  2023-01-25 16:16               ` Tyler Retzlaff
  2023-01-26  8:35               ` Tomasz Duszynski
  0 siblings, 2 replies; 205+ messages in thread
From: Thomas Monjalon @ 2023-01-25 10:39 UTC (permalink / raw)
  To: Tomasz Duszynski
  Cc: dev, jerinj, stephen, chenbo.xia, david.marchand, bruce.richardson

25/01/2023 11:33, Tomasz Duszynski:
> Reading strings from sysfs files is a re-occurring pattern
> hence add helper for doing that.

In general it would be to nice to clean sysfs parsing in libs and drivers,
so they all use some functions from EAL.




^ permalink raw reply	[flat|nested] 205+ messages in thread

* RE: [PATCH 0/2] add platform bus
  2023-01-25 10:33         ` [PATCH 0/2] add platform bus Tomasz Duszynski
  2023-01-25 10:33           ` [PATCH 1/2] lib: add helper to read strings from sysfs files Tomasz Duszynski
  2023-01-25 10:33           ` [PATCH 2/2] bus: add platform bus Tomasz Duszynski
@ 2023-01-25 10:41           ` Tomasz Duszynski
  2 siblings, 0 replies; 205+ messages in thread
From: Tomasz Duszynski @ 2023-01-25 10:41 UTC (permalink / raw)
  To: Tomasz Duszynski, dev
  Cc: thomas, Jerin Jacob Kollanukkaran, stephen, chenbo.xia

This was mistakenly appended to this thread - ignore it. I've just sent the series again. 

>-----Original Message-----
>From: Tomasz Duszynski <tduszynski@marvell.com>
>Sent: Wednesday, January 25, 2023 11:33 AM
>To: dev@dpdk.org
>Cc: thomas@monjalon.net; Jerin Jacob Kollanukkaran <jerinj@marvell.com>;
>stephen@networkplumber.org; chenbo.xia@intel.com; Tomasz Duszynski <tduszynski@marvell.com>
>Subject: [PATCH 0/2] add platform bus
>
>Platform bus is a bus under Linux which manages devices that do not have any discovery-mechanism
>built in. Linux learns about platform devices directly from device-tree during boot-up phase.
>
>Afterwards if userspace wants to use some particular device driver being usually a mixture of
>vdev/rawdev gets developed.
>
>In order to simplify that introduce a DPDK platform bus which provides auto-probe experience and
>separates a bus logic from the driver itself.
>
>Now only devices which are backed-by vfio-platform kernel driver are supported, though other
>options may be added if necessary.
>
>Tomasz Duszynski (2):
>  lib: add helper to read strings from sysfs files
>  bus: add platform bus
>
> MAINTAINERS                                |   4 +
> app/test/test_eal_fs.c                     | 108 +++-
> doc/guides/rel_notes/release_23_03.rst     |   5 +
> drivers/bus/meson.build                    |   1 +
> drivers/bus/platform/bus_platform_driver.h | 174 ++++++
> drivers/bus/platform/meson.build           |  16 +
> drivers/bus/platform/platform.c            | 604 +++++++++++++++++++++
> drivers/bus/platform/platform_params.c     |  70 +++
> drivers/bus/platform/private.h             |  48 ++
> drivers/bus/platform/version.map           |  10 +
> lib/eal/common/eal_filesystem.h            |   6 +
> lib/eal/unix/eal_filesystem.c              |  24 +-
> lib/eal/version.map                        |   1 +
> 13 files changed, 1053 insertions(+), 18 deletions(-)  create mode 100644
>drivers/bus/platform/bus_platform_driver.h
> create mode 100644 drivers/bus/platform/meson.build  create mode 100644
>drivers/bus/platform/platform.c  create mode 100644 drivers/bus/platform/platform_params.c
> create mode 100644 drivers/bus/platform/private.h  create mode 100644
>drivers/bus/platform/version.map
>
>--
>2.34.1


^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH 1/2] lib: add helper to read strings from sysfs files
  2023-01-25 10:39             ` Thomas Monjalon
@ 2023-01-25 16:16               ` Tyler Retzlaff
  2023-01-26  8:30                 ` [EXT] " Tomasz Duszynski
  2023-01-26  8:35               ` Tomasz Duszynski
  1 sibling, 1 reply; 205+ messages in thread
From: Tyler Retzlaff @ 2023-01-25 16:16 UTC (permalink / raw)
  To: Thomas Monjalon
  Cc: Tomasz Duszynski, dev, jerinj, stephen, chenbo.xia,
	david.marchand, bruce.richardson

On Wed, Jan 25, 2023 at 11:39:30AM +0100, Thomas Monjalon wrote:
> 25/01/2023 11:33, Tomasz Duszynski:
> > Reading strings from sysfs files is a re-occurring pattern
> > hence add helper for doing that.
> 
> In general it would be to nice to clean sysfs parsing in libs and drivers,
> so they all use some functions from EAL.

maybe there should be a general utility library for dealing with sysfs
separate from the core EAL that drivers / platform specific libs can
share?

^ permalink raw reply	[flat|nested] 205+ messages in thread

* RE: [EXT] Re: [PATCH 1/2] lib: add helper to read strings from sysfs files
  2023-01-25 16:16               ` Tyler Retzlaff
@ 2023-01-26  8:30                 ` Tomasz Duszynski
  2023-01-26 17:21                   ` Tyler Retzlaff
  0 siblings, 1 reply; 205+ messages in thread
From: Tomasz Duszynski @ 2023-01-26  8:30 UTC (permalink / raw)
  To: Tyler Retzlaff, Thomas Monjalon
  Cc: dev, Jerin Jacob Kollanukkaran, stephen, chenbo.xia,
	david.marchand, bruce.richardson


>-----Original Message-----
>From: Tyler Retzlaff <roretzla@linux.microsoft.com>
>Sent: Wednesday, January 25, 2023 5:16 PM
>To: Thomas Monjalon <thomas@monjalon.net>
>Cc: Tomasz Duszynski <tduszynski@marvell.com>; dev@dpdk.org; Jerin Jacob Kollanukkaran
><jerinj@marvell.com>; stephen@networkplumber.org; chenbo.xia@intel.com; david.marchand@redhat.com;
>bruce.richardson@intel.com
>Subject: [EXT] Re: [PATCH 1/2] lib: add helper to read strings from sysfs files
>
>External Email
>
>----------------------------------------------------------------------
>On Wed, Jan 25, 2023 at 11:39:30AM +0100, Thomas Monjalon wrote:
>> 25/01/2023 11:33, Tomasz Duszynski:
>> > Reading strings from sysfs files is a re-occurring pattern hence add
>> > helper for doing that.
>>
>> In general it would be to nice to clean sysfs parsing in libs and
>> drivers, so they all use some functions from EAL.
>
>maybe there should be a general utility library for dealing with sysfs separate from the core EAL
>that drivers / platform specific libs can share?

reading/writing of sysfs files is scattered around the codebase and this has been piling up
with each and and every new pmd/lib that requires it. So generally a few simple utility functions 
in one place may be a good idea. 

Would following make sense?

rte_sysfs_write_int()
rte_sysfs_write_string()
rte_sysfs_read_int()
rte_sysfs_read_string() 

Also seems that pattern where file gets opened once and keeps being written to until closed is 
reoccurring as well. So there might be some utils for that as well. Thoughts? 

^ permalink raw reply	[flat|nested] 205+ messages in thread

* RE: [EXT] Re: [PATCH 1/2] lib: add helper to read strings from sysfs files
  2023-01-25 10:39             ` Thomas Monjalon
  2023-01-25 16:16               ` Tyler Retzlaff
@ 2023-01-26  8:35               ` Tomasz Duszynski
  1 sibling, 0 replies; 205+ messages in thread
From: Tomasz Duszynski @ 2023-01-26  8:35 UTC (permalink / raw)
  To: Thomas Monjalon
  Cc: dev, Jerin Jacob Kollanukkaran, stephen, chenbo.xia,
	david.marchand, bruce.richardson



>-----Original Message-----
>From: Thomas Monjalon <thomas@monjalon.net>
>Sent: Wednesday, January 25, 2023 11:40 AM
>To: Tomasz Duszynski <tduszynski@marvell.com>
>Cc: dev@dpdk.org; Jerin Jacob Kollanukkaran <jerinj@marvell.com>; stephen@networkplumber.org;
>chenbo.xia@intel.com; david.marchand@redhat.com; bruce.richardson@intel.com
>Subject: [EXT] Re: [PATCH 1/2] lib: add helper to read strings from sysfs files
>
>External Email
>
>----------------------------------------------------------------------
>25/01/2023 11:33, Tomasz Duszynski:
>> Reading strings from sysfs files is a re-occurring pattern hence add
>> helper for doing that.
>
>In general it would be to nice to clean sysfs parsing in libs and drivers, so they all use some
>functions from EAL.
>

That's generally true. Here I wanted to avoid tree-wide changes caused by unrelated work i.e a new bus
and do a cleanup, i.e use this read string util where applicable, later on. 
 

^ permalink raw reply	[flat|nested] 205+ messages in thread

* RE: [EXT] Re: [PATCH v6 1/4] lib: add generic support for reading PMU events
  2023-01-20 18:29             ` Tyler Retzlaff
@ 2023-01-26  9:05               ` Tomasz Duszynski
  0 siblings, 0 replies; 205+ messages in thread
From: Tomasz Duszynski @ 2023-01-26  9:05 UTC (permalink / raw)
  To: Tyler Retzlaff
  Cc: dev, Thomas Monjalon, Jerin Jacob Kollanukkaran, mb,
	Ruifeng.Wang, mattias.ronnblom, zhoumin, bruce.richardson



>-----Original Message-----
>From: Tyler Retzlaff <roretzla@linux.microsoft.com>
>Sent: Friday, January 20, 2023 7:30 PM
>To: Tomasz Duszynski <tduszynski@marvell.com>
>Cc: dev@dpdk.org; Thomas Monjalon <thomas@monjalon.net>; Jerin Jacob Kollanukkaran
><jerinj@marvell.com>; mb@smartsharesystems.com; Ruifeng.Wang@arm.com;
>mattias.ronnblom@ericsson.com; zhoumin@loongson.cn; bruce.richardson@intel.com
>Subject: [EXT] Re: [PATCH v6 1/4] lib: add generic support for reading PMU events
>
>External Email
>
>----------------------------------------------------------------------
>On Fri, Jan 20, 2023 at 12:39:12AM +0100, Tomasz Duszynski wrote:
>> Add support for programming PMU counters and reading their values in
>> runtime bypassing kernel completely.
>>
>> This is especially useful in cases where CPU cores are isolated
>> (nohz_full) i.e run dedicated tasks. In such cases one cannot use
>> standard perf utility without sacrificing latency and performance.
>>
>> Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
>> ---
>>  MAINTAINERS                            |   5 +
>>  app/test/meson.build                   |   4 +
>>  app/test/test_pmu.c                    |  42 +++
>>  doc/api/doxy-api-index.md              |   3 +-
>>  doc/api/doxy-api.conf.in               |   1 +
>>  doc/guides/prog_guide/profile_app.rst  |   8 +
>>  doc/guides/rel_notes/release_23_03.rst |   7 +
>>  lib/meson.build                        |   1 +
>>  lib/pmu/meson.build                    |  13 +
>>  lib/pmu/pmu_private.h                  |  29 ++
>>  lib/pmu/rte_pmu.c                      | 436 +++++++++++++++++++++++++
>>  lib/pmu/rte_pmu.h                      | 206 ++++++++++++
>>  lib/pmu/version.map                    |  19 ++
>>  13 files changed, 773 insertions(+), 1 deletion(-)  create mode
>> 100644 app/test/test_pmu.c  create mode 100644 lib/pmu/meson.build
>> create mode 100644 lib/pmu/pmu_private.h  create mode 100644
>> lib/pmu/rte_pmu.c  create mode 100644 lib/pmu/rte_pmu.h  create mode
>> 100644 lib/pmu/version.map
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS index 9a0f416d2e..9f13eafd95
>> 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -1697,6 +1697,11 @@ M: Nithin Dabilpuram <ndabilpuram@marvell.com>
>>  M: Pavan Nikhilesh <pbhagavatula@marvell.com>
>>  F: lib/node/
>>
>> +PMU - EXPERIMENTAL
>> +M: Tomasz Duszynski <tduszynski@marvell.com>
>> +F: lib/pmu/
>> +F: app/test/test_pmu*
>> +
>>
>>  Test Applications
>>  -----------------
>> diff --git a/app/test/meson.build b/app/test/meson.build index
>> f34d19e3c3..b2c2a618b1 100644
>> --- a/app/test/meson.build
>> +++ b/app/test/meson.build
>> @@ -360,6 +360,10 @@ if dpdk_conf.has('RTE_LIB_METRICS')
>>      test_sources += ['test_metrics.c']
>>      fast_tests += [['metrics_autotest', true, true]]  endif
>> +if is_linux
>> +    test_sources += ['test_pmu.c']
>> +    fast_tests += [['pmu_autotest', true, true]] endif
>
>traditionally we don't conditionally include tests at the meson.build level, instead we run all
>tests and have them skip when executed for unsupported exec environments.
>
>you can take a look at test_eventdev.c as an example for a test that is skipped on windows, i'm
>sure it could be adapted to skip on freebsd if you aren't supporting it.

Right, this looks better. Thanks for pointing this out. 

^ permalink raw reply	[flat|nested] 205+ messages in thread

* RE: [PATCH v6 1/4] lib: add generic support for reading PMU events
  2023-01-20  9:46             ` Morten Brørup
@ 2023-01-26  9:40               ` Tomasz Duszynski
  2023-01-26 12:29                 ` Morten Brørup
  0 siblings, 1 reply; 205+ messages in thread
From: Tomasz Duszynski @ 2023-01-26  9:40 UTC (permalink / raw)
  To: Morten Brørup, dev, Thomas Monjalon
  Cc: Jerin Jacob Kollanukkaran, Ruifeng.Wang, mattias.ronnblom,
	zhoumin, bruce.richardson, roretzla

>-----Original Message-----
>From: Morten Brørup <mb@smartsharesystems.com>
>Sent: Friday, January 20, 2023 10:47 AM
>To: Tomasz Duszynski <tduszynski@marvell.com>; dev@dpdk.org; Thomas Monjalon <thomas@monjalon.net>
>Cc: Jerin Jacob Kollanukkaran <jerinj@marvell.com>; Ruifeng.Wang@arm.com;
>mattias.ronnblom@ericsson.com; zhoumin@loongson.cn; bruce.richardson@intel.com;
>roretzla@linux.microsoft.com
>Subject: [EXT] RE: [PATCH v6 1/4] lib: add generic support for reading PMU events
>
>External Email
>
>----------------------------------------------------------------------
>> From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
>> Sent: Friday, 20 January 2023 00.39
>>
>> Add support for programming PMU counters and reading their values in
>> runtime bypassing kernel completely.
>>
>> This is especially useful in cases where CPU cores are isolated
>> (nohz_full) i.e run dedicated tasks. In such cases one cannot use
>> standard perf utility without sacrificing latency and performance.
>>
>> Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
>> ---
>
>If you insist on passing lcore_id around as a function parameter, the function description must
>mention that the lcore_id parameter must be set to rte_lcore_id() for the functions where this is a
>requirement, including all functions that use those functions.
>

Not sure why are you insisting so much on removing that rte_lcore_id(). Yes that macro evaluates
to integer but if you don't think about internals this resembles a function call.

Then natural pattern is to call it once and reuse results if possible. Passing lcore_id around
implies that calls are per l-core, why would that confuse anyone reading that code?

Besides, all functions taking it are internal stuff hence you cannot call it elsewhere. 

>Alternatively, follow my previous suggestion: Omit the lcore_id function parameter, and use
>rte_lcore_id() instead.
>


^ permalink raw reply	[flat|nested] 205+ messages in thread

* RE: [PATCH v6 1/4] lib: add generic support for reading PMU events
  2023-01-26  9:40               ` Tomasz Duszynski
@ 2023-01-26 12:29                 ` Morten Brørup
  2023-01-26 12:59                   ` Bruce Richardson
  2023-01-26 15:17                   ` Tomasz Duszynski
  0 siblings, 2 replies; 205+ messages in thread
From: Morten Brørup @ 2023-01-26 12:29 UTC (permalink / raw)
  To: Tomasz Duszynski, dev, Thomas Monjalon
  Cc: Jerin Jacob Kollanukkaran, Ruifeng.Wang, mattias.ronnblom,
	zhoumin, bruce.richardson, roretzla

> From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
> Sent: Thursday, 26 January 2023 10.40
> 
> >From: Morten Brørup <mb@smartsharesystems.com>
> >Sent: Friday, January 20, 2023 10:47 AM
> >
> >> From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
> >> Sent: Friday, 20 January 2023 00.39
> >>
> >> Add support for programming PMU counters and reading their values in
> >> runtime bypassing kernel completely.
> >>
> >> This is especially useful in cases where CPU cores are isolated
> >> (nohz_full) i.e run dedicated tasks. In such cases one cannot use
> >> standard perf utility without sacrificing latency and performance.
> >>
> >> Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
> >> ---
> >
> >If you insist on passing lcore_id around as a function parameter, the
> function description must
> >mention that the lcore_id parameter must be set to rte_lcore_id() for
> the functions where this is a
> >requirement, including all functions that use those functions.

Perhaps I'm stating this wrong, so let me try to rephrase:

As I understand it, some of the setup functions must be called from the EAL thread that executes that function - due to some syscall (SYS_perf_event_open) needing to be called from the thread itself.

Those functions should not take an lcore_id parameter. Otherwise, I would expect to be able to call those functions from e.g. the main thread and pass the lcore_id of any EAL thread as a parameter, which you at the bottom of this email [1] explained is not possible.

[1]: http://inbox.dpdk.org/dev/DM4PR18MB4368461EC42603F77A7DC1BCD2E09@DM4PR18MB4368.namprd18.prod.outlook.com/

> >
> 
> Not sure why are you insisting so much on removing that rte_lcore_id().
> Yes that macro evaluates
> to integer but if you don't think about internals this resembles a
> function call.

I agree with this argument. And for that reason, passing lcore_id around could be relevant.

I only wanted to bring your attention to the low cost of fetching it inside the functions, as an alternative to passing it as an argument.

> 
> Then natural pattern is to call it once and reuse results if possible.

Yes, and I would usually agree to using this pattern.

> Passing lcore_id around
> implies that calls are per l-core, why would that confuse anyone
> reading that code?

This is where I disagree: Passing lcore_id as a parameter to a function does NOT imply that the function is running on that lcore!

E.g rte_mempool_default_cache(struct rte_mempool *mp, unsigned lcore_id) [2] takes lcore_id as a parameter, and does not assume that lcore_id==rte_lcore_id().

[2]: https://elixir.bootlin.com/dpdk/latest/source/lib/mempool/rte_mempool.h#L1315

> 
> Besides, all functions taking it are internal stuff hence you cannot
> call it elsewhere.

OK. I agree that this reduces the risk of incorrect use.

Generally, I think that internal functions should be documented too. Not to the full extent, like public functions, but some documentation is nice.

And if there are special requirements to a function parameter, it should be documented with that function. Requiring that the lcore_id parameter must be == rte_lcore_id() is certainly a special requirement.

It might just be me worrying too much, so... If nobody else complains about this, I can live with it as is. Assuming that none of the public functions have this special requirement (either directly or indirectly, by calling functions with the special requirement).

> 
> >Alternatively, follow my previous suggestion: Omit the lcore_id
> function parameter, and use
> >rte_lcore_id() instead.
> >
> 


^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [PATCH v6 1/4] lib: add generic support for reading PMU events
  2023-01-26 12:29                 ` Morten Brørup
@ 2023-01-26 12:59                   ` Bruce Richardson
  2023-01-26 15:28                     ` [EXT] " Tomasz Duszynski
  2023-01-26 15:17                   ` Tomasz Duszynski
  1 sibling, 1 reply; 205+ messages in thread
From: Bruce Richardson @ 2023-01-26 12:59 UTC (permalink / raw)
  To: Morten Brørup
  Cc: Tomasz Duszynski, dev, Thomas Monjalon,
	Jerin Jacob Kollanukkaran, Ruifeng.Wang, mattias.ronnblom,
	zhoumin, roretzla

On Thu, Jan 26, 2023 at 01:29:36PM +0100, Morten Brørup wrote:
> > From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
> > Sent: Thursday, 26 January 2023 10.40
> > 
> > >From: Morten Brørup <mb@smartsharesystems.com>
> > >Sent: Friday, January 20, 2023 10:47 AM
> > >
> > >> From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
> > >> Sent: Friday, 20 January 2023 00.39
> > >>
> > >> Add support for programming PMU counters and reading their values in
> > >> runtime bypassing kernel completely.
> > >>
> > >> This is especially useful in cases where CPU cores are isolated
> > >> (nohz_full) i.e run dedicated tasks. In such cases one cannot use
> > >> standard perf utility without sacrificing latency and performance.
> > >>
> > >> Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
> > >> ---
> > >
> > >If you insist on passing lcore_id around as a function parameter, the
> > function description must
> > >mention that the lcore_id parameter must be set to rte_lcore_id() for
> > the functions where this is a
> > >requirement, including all functions that use those functions.
> 
> Perhaps I'm stating this wrong, so let me try to rephrase:
> 
> As I understand it, some of the setup functions must be called from the EAL thread that executes that function - due to some syscall (SYS_perf_event_open) needing to be called from the thread itself.
> 
> Those functions should not take an lcore_id parameter. Otherwise, I would expect to be able to call those functions from e.g. the main thread and pass the lcore_id of any EAL thread as a parameter, which you at the bottom of this email [1] explained is not possible.
> 
> [1]: http://inbox.dpdk.org/dev/DM4PR18MB4368461EC42603F77A7DC1BCD2E09@DM4PR18MB4368.namprd18.prod.outlook.com/
> 
> > >
> > 
> > Not sure why are you insisting so much on removing that rte_lcore_id().
> > Yes that macro evaluates
> > to integer but if you don't think about internals this resembles a
> > function call.
> 
> I agree with this argument. And for that reason, passing lcore_id around could be relevant.
> 
> I only wanted to bring your attention to the low cost of fetching it inside the functions, as an alternative to passing it as an argument.
> 
> > 
> > Then natural pattern is to call it once and reuse results if possible.
> 
> Yes, and I would usually agree to using this pattern.
> 
> > Passing lcore_id around
> > implies that calls are per l-core, why would that confuse anyone
> > reading that code?
> 
> This is where I disagree: Passing lcore_id as a parameter to a function does NOT imply that the function is running on that lcore!
> 
> E.g rte_mempool_default_cache(struct rte_mempool *mp, unsigned lcore_id) [2] takes lcore_id as a parameter, and does not assume that lcore_id==rte_lcore_id().
> 
> [2]: https://elixir.bootlin.com/dpdk/latest/source/lib/mempool/rte_mempool.h#L1315
> 
> > 
> > Besides, all functions taking it are internal stuff hence you cannot
> > call it elsewhere.
> 
> OK. I agree that this reduces the risk of incorrect use.
> 
> Generally, I think that internal functions should be documented too. Not to the full extent, like public functions, but some documentation is nice.
> 
> And if there are special requirements to a function parameter, it should be documented with that function. Requiring that the lcore_id parameter must be == rte_lcore_id() is certainly a special requirement.
> 
> It might just be me worrying too much, so... If nobody else complains about this, I can live with it as is. Assuming that none of the public functions have this special requirement (either directly or indirectly, by calling functions with the special requirement).
> 
I would tend to agree with you Morten. If the lcore_id parameter to the
function must be rte_lcore_id(), then I think it's error prone to have that
as an explicit parameter, and that the function should always get the core
id itself.

Other possible complication is - how does this work with threads that are
not pinned to a particular physical core? Do things work as expected in
that case?

/Bruce

^ permalink raw reply	[flat|nested] 205+ messages in thread

* RE: [PATCH v6 1/4] lib: add generic support for reading PMU events
  2023-01-26 12:29                 ` Morten Brørup
  2023-01-26 12:59                   ` Bruce Richardson
@ 2023-01-26 15:17                   ` Tomasz Duszynski
  1 sibling, 0 replies; 205+ messages in thread
From: Tomasz Duszynski @ 2023-01-26 15:17 UTC (permalink / raw)
  To: Morten Brørup, dev, Thomas Monjalon
  Cc: Jerin Jacob Kollanukkaran, Ruifeng.Wang, mattias.ronnblom,
	zhoumin, bruce.richardson, roretzla

>-----Original Message-----
>From: Morten Brørup <mb@smartsharesystems.com>
>Sent: Thursday, January 26, 2023 1:30 PM
>To: Tomasz Duszynski <tduszynski@marvell.com>; dev@dpdk.org; Thomas Monjalon <thomas@monjalon.net>
>Cc: Jerin Jacob Kollanukkaran <jerinj@marvell.com>; Ruifeng.Wang@arm.com;
>mattias.ronnblom@ericsson.com; zhoumin@loongson.cn; bruce.richardson@intel.com;
>roretzla@linux.microsoft.com
>Subject: [EXT] RE: [PATCH v6 1/4] lib: add generic support for reading PMU events
>
>External Email
>
>----------------------------------------------------------------------
>> From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
>> Sent: Thursday, 26 January 2023 10.40
>>
>> >From: Morten Brørup <mb@smartsharesystems.com>
>> >Sent: Friday, January 20, 2023 10:47 AM
>> >
>> >> From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
>> >> Sent: Friday, 20 January 2023 00.39
>> >>
>> >> Add support for programming PMU counters and reading their values
>> >> in runtime bypassing kernel completely.
>> >>
>> >> This is especially useful in cases where CPU cores are isolated
>> >> (nohz_full) i.e run dedicated tasks. In such cases one cannot use
>> >> standard perf utility without sacrificing latency and performance.
>> >>
>> >> Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
>> >> ---
>> >
>> >If you insist on passing lcore_id around as a function parameter, the
>> function description must
>> >mention that the lcore_id parameter must be set to rte_lcore_id() for
>> the functions where this is a
>> >requirement, including all functions that use those functions.
>
>Perhaps I'm stating this wrong, so let me try to rephrase:
>
>As I understand it, some of the setup functions must be called from the EAL thread that executes
>that function - due to some syscall (SYS_perf_event_open) needing to be called from the thread
>itself.
>
>Those functions should not take an lcore_id parameter. Otherwise, I would expect to be able to call
>those functions from e.g. the main thread and pass the lcore_id of any EAL thread as a parameter,
>which you at the bottom of this email [1] explained is not possible.
>
>[1]: https://urldefense.proofpoint.com/v2/url?u=http-
>3A__inbox.dpdk.org_dev_DM4PR18MB4368461EC42603F77A7DC1BCD2E09-
>40DM4PR18MB4368.namprd18.prod.outlook.com_&d=DwIFAw&c=nKjWec2b6R0mOyPaz7xtfQ&r=PZNXgrbjdlXxVEEGYkxI
>xRndyEUwWU_ad5ce22YI6Is&m=QkMcmM2epUOCdRd6xI5o3d2nQaqruy0GvOQUgbn75cLlzobEVMwLBUiGXADiuvVz&s=5K9oM8
>e7u52C_0_5xtWIKl31aXRHhJDKoQTDQp5EHWY&e=
>
>> >
>>
>> Not sure why are you insisting so much on removing that rte_lcore_id().
>> Yes that macro evaluates
>> to integer but if you don't think about internals this resembles a
>> function call.
>
>I agree with this argument. And for that reason, passing lcore_id around could be relevant.
>
>I only wanted to bring your attention to the low cost of fetching it inside the functions, as an
>alternative to passing it as an argument.
>
>>
>> Then natural pattern is to call it once and reuse results if possible.
>
>Yes, and I would usually agree to using this pattern.
>
>> Passing lcore_id around
>> implies that calls are per l-core, why would that confuse anyone
>> reading that code?
>
>This is where I disagree: Passing lcore_id as a parameter to a function does NOT imply that the
>function is running on that lcore!
>
>E.g rte_mempool_default_cache(struct rte_mempool *mp, unsigned lcore_id) [2] takes lcore_id as a
>parameter, and does not assume that lcore_id==rte_lcore_id().
>

Oh, now I got your point!

Okay then, if this is going to cause confusion because of misleading
self-documenting code I'll change that.  

>[2]: https://urldefense.proofpoint.com/v2/url?u=https-
>3A__elixir.bootlin.com_dpdk_latest_source_lib_mempool_rte-5Fmempool.h-
>23L1315&d=DwIFAw&c=nKjWec2b6R0mOyPaz7xtfQ&r=PZNXgrbjdlXxVEEGYkxIxRndyEUwWU_ad5ce22YI6Is&m=QkMcmM2ep
>UOCdRd6xI5o3d2nQaqruy0GvOQUgbn75cLlzobEVMwLBUiGXADiuvVz&s=4pnL_TZcVhj476u19ybcn2Rbad6OTb3k2U-
>nhFvhZ0k&e=
>
>>
>> Besides, all functions taking it are internal stuff hence you cannot
>> call it elsewhere.
>
>OK. I agree that this reduces the risk of incorrect use.
>
>Generally, I think that internal functions should be documented too. Not to the full extent, like
>public functions, but some documentation is nice.
>
>And if there are special requirements to a function parameter, it should be documented with that
>function. Requiring that the lcore_id parameter must be == rte_lcore_id() is certainly a special
>requirement.
>
>It might just be me worrying too much, so... If nobody else complains about this, I can live with
>it as is. Assuming that none of the public functions have this special requirement (either directly
>or indirectly, by calling functions with the special requirement).
>
>>
>> >Alternatively, follow my previous suggestion: Omit the lcore_id
>> function parameter, and use
>> >rte_lcore_id() instead.
>> >
>>


^ permalink raw reply	[flat|nested] 205+ messages in thread

* RE: [EXT] Re: [PATCH v6 1/4] lib: add generic support for reading PMU events
  2023-01-26 12:59                   ` Bruce Richardson
@ 2023-01-26 15:28                     ` Tomasz Duszynski
  2023-02-02 14:27                       ` Morten Brørup
  0 siblings, 1 reply; 205+ messages in thread
From: Tomasz Duszynski @ 2023-01-26 15:28 UTC (permalink / raw)
  To: Bruce Richardson, Morten Brørup
  Cc: dev, Thomas Monjalon, Jerin Jacob Kollanukkaran, Ruifeng.Wang,
	mattias.ronnblom, zhoumin, roretzla



>-----Original Message-----
>From: Bruce Richardson <bruce.richardson@intel.com>
>Sent: Thursday, January 26, 2023 1:59 PM
>To: Morten Brørup <mb@smartsharesystems.com>
>Cc: Tomasz Duszynski <tduszynski@marvell.com>; dev@dpdk.org; Thomas Monjalon <thomas@monjalon.net>;
>Jerin Jacob Kollanukkaran <jerinj@marvell.com>; Ruifeng.Wang@arm.com;
>mattias.ronnblom@ericsson.com; zhoumin@loongson.cn; roretzla@linux.microsoft.com
>Subject: [EXT] Re: [PATCH v6 1/4] lib: add generic support for reading PMU events
>
>External Email
>
>----------------------------------------------------------------------
>On Thu, Jan 26, 2023 at 01:29:36PM +0100, Morten Brørup wrote:
>> > From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
>> > Sent: Thursday, 26 January 2023 10.40
>> >
>> > >From: Morten Brørup <mb@smartsharesystems.com>
>> > >Sent: Friday, January 20, 2023 10:47 AM
>> > >
>> > >> From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
>> > >> Sent: Friday, 20 January 2023 00.39
>> > >>
>> > >> Add support for programming PMU counters and reading their values
>> > >> in runtime bypassing kernel completely.
>> > >>
>> > >> This is especially useful in cases where CPU cores are isolated
>> > >> (nohz_full) i.e run dedicated tasks. In such cases one cannot use
>> > >> standard perf utility without sacrificing latency and performance.
>> > >>
>> > >> Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
>> > >> ---
>> > >
>> > >If you insist on passing lcore_id around as a function parameter,
>> > >the
>> > function description must
>> > >mention that the lcore_id parameter must be set to rte_lcore_id()
>> > >for
>> > the functions where this is a
>> > >requirement, including all functions that use those functions.
>>
>> Perhaps I'm stating this wrong, so let me try to rephrase:
>>
>> As I understand it, some of the setup functions must be called from the EAL thread that executes
>that function - due to some syscall (SYS_perf_event_open) needing to be called from the thread
>itself.
>>
>> Those functions should not take an lcore_id parameter. Otherwise, I would expect to be able to
>call those functions from e.g. the main thread and pass the lcore_id of any EAL thread as a
>parameter, which you at the bottom of this email [1] explained is not possible.
>>
>> [1]:
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__inbox.dpdk.org_dev
>> _DM4PR18MB4368461EC42603F77A7DC1BCD2E09-40DM4PR18MB4368.namprd18.prod.
>> outlook.com_&d=DwIDAw&c=nKjWec2b6R0mOyPaz7xtfQ&r=PZNXgrbjdlXxVEEGYkxIx
>> RndyEUwWU_ad5ce22YI6Is&m=wEvFmuH_S_EhAgRZQTC7z3pQ1Sr_cEsbFAXxgE2Fi2ESd
>> 4sMgg-tgVOVDepp-JYO&s=wU4g1LLV4EHyRYpj2inWOK8MDcUKq7txrZ7RXZhUM2I&e=
>>
>> > >
>> >
>> > Not sure why are you insisting so much on removing that rte_lcore_id().
>> > Yes that macro evaluates
>> > to integer but if you don't think about internals this resembles a
>> > function call.
>>
>> I agree with this argument. And for that reason, passing lcore_id around could be relevant.
>>
>> I only wanted to bring your attention to the low cost of fetching it inside the functions, as an
>alternative to passing it as an argument.
>>
>> >
>> > Then natural pattern is to call it once and reuse results if possible.
>>
>> Yes, and I would usually agree to using this pattern.
>>
>> > Passing lcore_id around
>> > implies that calls are per l-core, why would that confuse anyone
>> > reading that code?
>>
>> This is where I disagree: Passing lcore_id as a parameter to a function does NOT imply that the
>function is running on that lcore!
>>
>> E.g rte_mempool_default_cache(struct rte_mempool *mp, unsigned lcore_id) [2] takes lcore_id as a
>parameter, and does not assume that lcore_id==rte_lcore_id().
>>
>> [2]:
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__elixir.bootlin.co
>> m_dpdk_latest_source_lib_mempool_rte-5Fmempool.h-23L1315&d=DwIDAw&c=nK
>> jWec2b6R0mOyPaz7xtfQ&r=PZNXgrbjdlXxVEEGYkxIxRndyEUwWU_ad5ce22YI6Is&m=w
>> EvFmuH_S_EhAgRZQTC7z3pQ1Sr_cEsbFAXxgE2Fi2ESd4sMgg-tgVOVDepp-JYO&s=Ayyj
>> pEtATWUHfWnGMn5j2XDLMjgxxJTh5gQV0m77z5Q&e=
>>
>> >
>> > Besides, all functions taking it are internal stuff hence you cannot
>> > call it elsewhere.
>>
>> OK. I agree that this reduces the risk of incorrect use.
>>
>> Generally, I think that internal functions should be documented too. Not to the full extent, like
>public functions, but some documentation is nice.
>>
>> And if there are special requirements to a function parameter, it should be documented with that
>function. Requiring that the lcore_id parameter must be == rte_lcore_id() is certainly a special
>requirement.
>>
>> It might just be me worrying too much, so... If nobody else complains about this, I can live with
>it as is. Assuming that none of the public functions have this special requirement (either directly
>or indirectly, by calling functions with the special requirement).
>>
>I would tend to agree with you Morten. If the lcore_id parameter to the function must be
>rte_lcore_id(), then I think it's error prone to have that as an explicit parameter, and that the
>function should always get the core id itself.
>
>Other possible complication is - how does this work with threads that are not pinned to a
>particular physical core? Do things work as expected in that case?
>

It's assumed that once set of counters is enabled on particular l-core then this thread shouldn't be migrating 
back and for the obvious reasons. 

But, once scheduled elsewhere all should still work as expected. 

>/Bruce

^ permalink raw reply	[flat|nested] 205+ messages in thread

* Re: [EXT] Re: [PATCH 1/2] lib: add helper to read strings from sysfs files
  2023-01-26  8:30                 ` [EXT] " Tomasz Duszynski
@ 2023-01-26 17:21                   ` Tyler Retzlaff
  0 siblings, 0 replies; 205+ messages in thread
From: Tyler Retzlaff @ 2023-01-26 17:21 UTC (permalink / raw)
  To: Tomasz Duszynski
  Cc: Thomas Monjalon, dev, Jerin Jacob Kollanukkaran, stephen,
	chenbo.xia, david.marchand, bruce.richardson

On Thu, Jan 26, 2023 at 08:30:01AM +0000, Tomasz Duszynski wrote:
> 
> >-----Original Message-----
> >From: Tyler Retzlaff <roretzla@linux.microsoft.com>
> >Sent: Wednesday, January 25, 2023 5:16 PM
> >To: Thomas Monjalon <thomas@monjalon.net>
> >Cc: Tomasz Duszynski <tduszynski@marvell.com>; dev@dpdk.org; Jerin Jacob Kollanukkaran
> ><jerinj@marvell.com>; stephen@networkplumber.org; chenbo.xia@intel.com; david.marchand@redhat.com;
> >bruce.richardson@intel.com
> >Subject: [EXT] Re: [PATCH 1/2] lib: add helper to read strings from sysfs files
> >
> >External Email
> >
> >----------------------------------------------------------------------
> >On Wed, Jan 25, 2023 at 11:39:30AM +0100, Thomas Monjalon wrote:
> >> 25/01/2023 11:33, Tomasz Duszynski:
> >> > Reading strings from sysfs files is a re-occurring pattern hence add
> >> > helper for doing that.
> >>
> >> In general it would be to nice to clean sysfs parsing in libs and
> >> drivers, so they all use some functions from EAL.
> >
> >maybe there should be a general utility library for dealing with sysfs separate from the core EAL
> >that drivers / platform specific libs can share?
> 
> reading/writing of sysfs files is scattered around the codebase and this has been piling up
> with each and and every new pmd/lib that requires it. So generally a few simple utility functions 
> in one place may be a good idea. 

i'm an advocate of smaller libraries that tackle a subject area and do
so well. even better if they can be unit tested without dragging in a
lot of dependencies or bootstrapping other unrelated subsystems.

it is also in alignment with trying to de-bloat eal which i think there
is increasing interest in.

> 
> Would following make sense?
> 
> rte_sysfs_write_int()
> rte_sysfs_write_string()
> rte_sysfs_read_int()
> rte_sysfs_read_string() 
> 
> Also seems that pattern where file gets opened once and keeps being written to until closed is 
> reoccurring as well. So there might be some utils for that as well. Thoughts? 

i guess the answer here is whatever makes a simple intuitive api for
sysfs access, i don't contribute much on the linux side to dpdk so can't
speak to what makes a good api here, but i imagine others can in review.

thanks

^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH v7 0/4] add support for self monitoring
  2023-01-19 23:39         ` [PATCH v6 " Tomasz Duszynski
                             ` (3 preceding siblings ...)
  2023-01-19 23:39           ` [PATCH v6 4/4] eal: add PMU support to tracing library Tomasz Duszynski
@ 2023-02-01 13:17           ` Tomasz Duszynski
  2023-02-01 13:17             ` [PATCH v7 1/4] lib: add generic support for reading PMU events Tomasz Duszynski
                               ` (5 more replies)
  4 siblings, 6 replies; 205+ messages in thread
From: Tomasz Duszynski @ 2023-02-01 13:17 UTC (permalink / raw)
  To: dev
  Cc: roretzla, Ruifeng.Wang, bruce.richardson, jerinj,
	mattias.ronnblom, mb, thomas, zhoumin, Tomasz Duszynski

This series adds self monitoring support i.e allows to configure and
read performance measurement unit (PMU) counters in runtime without
using perf utility. This has certain adventages when application runs on
isolated cores with nohz_full kernel parameter.

Events can be read directly using rte_pmu_read() or using dedicated
tracepoint rte_eal_trace_pmu_read(). The latter will cause events to be
stored inside CTF file.

By design, all enabled events are grouped together and the same group
is attached to lcores that use self monitoring funtionality.

Events are enabled by names, which need to be read from standard
location under sysfs i.e

/sys/bus/event_source/devices/PMU/events

where PMU is a core pmu i.e one measuring cpu events. As of today
raw events are not supported.

v7:
- use per-lcore event group instead of global table index by lcore-id
- don't add pmu_autotest to fast tests because due to lack of suported on
  every arch
v6:
- move codebase to the separate library
- address review comments
v5:
- address review comments
- fix sign extension while reading pmu on x86
- fix regex mentioned in doc
- various minor changes/improvements here and there
v4:
- fix freeing mem detected by debug_autotest
v3:
- fix shared build
v2:
- fix problems reported by test build infra

Tomasz Duszynski (4):
  lib: add generic support for reading PMU events
  pmu: support reading ARM PMU events in runtime
  pmu: support reading Intel x86_64 PMU events in runtime
  eal: add PMU support to tracing library

 MAINTAINERS                              |   5 +
 app/test/meson.build                     |   1 +
 app/test/test_pmu.c                      |  61 +++
 app/test/test_trace_perf.c               |  10 +
 doc/api/doxy-api-index.md                |   3 +-
 doc/api/doxy-api.conf.in                 |   1 +
 doc/guides/prog_guide/profile_app.rst    |  13 +
 doc/guides/prog_guide/trace_lib.rst      |  32 ++
 doc/guides/rel_notes/release_23_03.rst   |   7 +
 lib/eal/common/eal_common_trace.c        |  13 +-
 lib/eal/common/eal_common_trace_points.c |   5 +
 lib/eal/include/rte_eal_trace.h          |  13 +
 lib/eal/meson.build                      |   3 +
 lib/eal/version.map                      |   3 +
 lib/meson.build                          |   1 +
 lib/pmu/meson.build                      |  21 +
 lib/pmu/pmu_arm64.c                      |  94 ++++
 lib/pmu/pmu_private.h                    |  29 ++
 lib/pmu/rte_pmu.c                        | 525 +++++++++++++++++++++++
 lib/pmu/rte_pmu.h                        | 225 ++++++++++
 lib/pmu/rte_pmu_pmc_arm64.h              |  30 ++
 lib/pmu/rte_pmu_pmc_x86_64.h             |  24 ++
 lib/pmu/version.map                      |  21 +
 23 files changed, 1138 insertions(+), 2 deletions(-)
 create mode 100644 app/test/test_pmu.c
 create mode 100644 lib/pmu/meson.build
 create mode 100644 lib/pmu/pmu_arm64.c
 create mode 100644 lib/pmu/pmu_private.h
 create mode 100644 lib/pmu/rte_pmu.c
 create mode 100644 lib/pmu/rte_pmu.h
 create mode 100644 lib/pmu/rte_pmu_pmc_arm64.h
 create mode 100644 lib/pmu/rte_pmu_pmc_x86_64.h
 create mode 100644 lib/pmu/version.map

--
2.34.1


^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH v7 1/4] lib: add generic support for reading PMU events
  2023-02-01 13:17           ` [PATCH v7 0/4] add support for self monitoring Tomasz Duszynski
@ 2023-02-01 13:17             ` Tomasz Duszynski
  2023-02-01 13:17             ` [PATCH v7 2/4] pmu: support reading ARM PMU events in runtime Tomasz Duszynski
                               ` (4 subsequent siblings)
  5 siblings, 0 replies; 205+ messages in thread
From: Tomasz Duszynski @ 2023-02-01 13:17 UTC (permalink / raw)
  To: dev, Thomas Monjalon, Tomasz Duszynski
  Cc: roretzla, Ruifeng.Wang, bruce.richardson, jerinj,
	mattias.ronnblom, mb, zhoumin

Add support for programming PMU counters and reading their values
in runtime bypassing kernel completely.

This is especially useful in cases where CPU cores are isolated
(nohz_full) i.e run dedicated tasks. In such cases one cannot use
standard perf utility without sacrificing latency and performance.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
---
 MAINTAINERS                            |   5 +
 app/test/meson.build                   |   1 +
 app/test/test_pmu.c                    |  55 +++
 doc/api/doxy-api-index.md              |   3 +-
 doc/api/doxy-api.conf.in               |   1 +
 doc/guides/prog_guide/profile_app.rst  |   8 +
 doc/guides/rel_notes/release_23_03.rst |   7 +
 lib/meson.build                        |   1 +
 lib/pmu/meson.build                    |  13 +
 lib/pmu/pmu_private.h                  |  29 ++
 lib/pmu/rte_pmu.c                      | 464 +++++++++++++++++++++++++
 lib/pmu/rte_pmu.h                      | 205 +++++++++++
 lib/pmu/version.map                    |  20 ++
 13 files changed, 811 insertions(+), 1 deletion(-)
 create mode 100644 app/test/test_pmu.c
 create mode 100644 lib/pmu/meson.build
 create mode 100644 lib/pmu/pmu_private.h
 create mode 100644 lib/pmu/rte_pmu.c
 create mode 100644 lib/pmu/rte_pmu.h
 create mode 100644 lib/pmu/version.map

diff --git a/MAINTAINERS b/MAINTAINERS
index 9a0f416d2e..9f13eafd95 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1697,6 +1697,11 @@ M: Nithin Dabilpuram <ndabilpuram@marvell.com>
 M: Pavan Nikhilesh <pbhagavatula@marvell.com>
 F: lib/node/
 
+PMU - EXPERIMENTAL
+M: Tomasz Duszynski <tduszynski@marvell.com>
+F: lib/pmu/
+F: app/test/test_pmu*
+
 
 Test Applications
 -----------------
diff --git a/app/test/meson.build b/app/test/meson.build
index f34d19e3c3..7b6b69dcf1 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -111,6 +111,7 @@ test_sources = files(
         'test_reciprocal_division_perf.c',
         'test_red.c',
         'test_pie.c',
+        'test_pmu.c',
         'test_reorder.c',
         'test_rib.c',
         'test_rib6.c',
diff --git a/app/test/test_pmu.c b/app/test/test_pmu.c
new file mode 100644
index 0000000000..b30db35724
--- /dev/null
+++ b/app/test/test_pmu.c
@@ -0,0 +1,55 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2023 Marvell International Ltd.
+ */
+
+#include "test.h"
+
+#ifndef RTE_EXEC_ENV_LINUX
+
+static int
+test_pmu(void)
+{
+	printf("pmu_autotest onnly supported on Linux, skipping test\n");
+	return TEST_SKIPPED;
+}
+
+#else
+
+#include <rte_pmu.h>
+
+static int
+test_pmu_read(void)
+{
+	int tries = 10, event = -1;
+	uint64_t val = 0;
+
+	if (rte_pmu_init() < 0)
+		return TEST_FAILED;
+
+	while (tries--)
+		val += rte_pmu_read(event);
+
+	rte_pmu_fini();
+
+	return val ? TEST_SUCCESS : TEST_FAILED;
+}
+
+static struct unit_test_suite pmu_tests = {
+	.suite_name = "pmu autotest",
+	.setup = NULL,
+	.teardown = NULL,
+	.unit_test_cases = {
+		TEST_CASE(test_pmu_read),
+		TEST_CASES_END()
+	}
+};
+
+static int
+test_pmu(void)
+{
+	return unit_test_suite_runner(&pmu_tests);
+}
+
+#endif /* RTE_EXEC_ENV_LINUX */
+
+REGISTER_TEST_COMMAND(pmu_autotest, test_pmu);
diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
index de488c7abf..7f1938f92f 100644
--- a/doc/api/doxy-api-index.md
+++ b/doc/api/doxy-api-index.md
@@ -222,7 +222,8 @@ The public API headers are grouped by topics:
   [log](@ref rte_log.h),
   [errno](@ref rte_errno.h),
   [trace](@ref rte_trace.h),
-  [trace_point](@ref rte_trace_point.h)
+  [trace_point](@ref rte_trace_point.h),
+  [pmu](@ref rte_pmu.h)
 
 - **misc**:
   [EAL config](@ref rte_eal.h),
diff --git a/doc/api/doxy-api.conf.in b/doc/api/doxy-api.conf.in
index f0886c3bd1..920e615996 100644
--- a/doc/api/doxy-api.conf.in
+++ b/doc/api/doxy-api.conf.in
@@ -63,6 +63,7 @@ INPUT                   = @TOPDIR@/doc/api/doxy-api-index.md \
                           @TOPDIR@/lib/pci \
                           @TOPDIR@/lib/pdump \
                           @TOPDIR@/lib/pipeline \
+                          @TOPDIR@/lib/pmu \
                           @TOPDIR@/lib/port \
                           @TOPDIR@/lib/power \
                           @TOPDIR@/lib/rawdev \
diff --git a/doc/guides/prog_guide/profile_app.rst b/doc/guides/prog_guide/profile_app.rst
index 14292d4c25..a8b501fe0c 100644
--- a/doc/guides/prog_guide/profile_app.rst
+++ b/doc/guides/prog_guide/profile_app.rst
@@ -7,6 +7,14 @@ Profile Your Application
 The following sections describe methods of profiling DPDK applications on
 different architectures.
 
+Performance counter based profiling
+-----------------------------------
+
+Majority of architectures support some sort hardware measurement unit which provides a set of
+programmable counters that monitor specific events. There are different tools which can gather
+that information, perf being an example here. Though in some scenarios, eg. when CPU cores are
+isolated (nohz_full) and run dedicated tasks, using perf is less than ideal. In such cases one can
+read specific events directly from application via ``rte_pmu_read()``.
 
 Profiling on x86
 ----------------
diff --git a/doc/guides/rel_notes/release_23_03.rst b/doc/guides/rel_notes/release_23_03.rst
index 84b112a8b1..7e6062022a 100644
--- a/doc/guides/rel_notes/release_23_03.rst
+++ b/doc/guides/rel_notes/release_23_03.rst
@@ -55,6 +55,13 @@ New Features
      Also, make sure to start the actual text at the margin.
      =======================================================
 
+* **Added PMU library.**
+
+  Added a new PMU (performance measurement unit) library which allows applications
+  to perform self monitoring activities without depending on external utilities like perf.
+  After integration with :doc:`../prog_guide/trace_lib` data gathered from hardware counters
+  can be stored in CTF format for further analysis.
+
 * **Added multi-process support for axgbe PMD.**
 
 * **Updated Corigine nfp driver.**
diff --git a/lib/meson.build b/lib/meson.build
index a90fee31b7..7132131b5c 100644
--- a/lib/meson.build
+++ b/lib/meson.build
@@ -11,6 +11,7 @@
 libraries = [
         'kvargs', # eal depends on kvargs
         'telemetry', # basic info querying
+        'pmu',
         'eal', # everything depends on eal
         'ring',
         'rcu', # rcu depends on ring
diff --git a/lib/pmu/meson.build b/lib/pmu/meson.build
new file mode 100644
index 0000000000..a4160b494e
--- /dev/null
+++ b/lib/pmu/meson.build
@@ -0,0 +1,13 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(C) 2023 Marvell International Ltd.
+
+if not is_linux
+    build = false
+    reason = 'only supported on Linux'
+    subdir_done()
+endif
+
+includes = [global_inc]
+
+sources = files('rte_pmu.c')
+headers = files('rte_pmu.h')
diff --git a/lib/pmu/pmu_private.h b/lib/pmu/pmu_private.h
new file mode 100644
index 0000000000..849549b125
--- /dev/null
+++ b/lib/pmu/pmu_private.h
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Marvell
+ */
+
+#ifndef _PMU_PRIVATE_H_
+#define _PMU_PRIVATE_H_
+
+/**
+ * Architecture specific PMU init callback.
+ *
+ * @return
+ *   0 in case of success, negative value otherwise.
+ */
+int
+pmu_arch_init(void);
+
+/**
+ * Architecture specific PMU cleanup callback.
+ */
+void
+pmu_arch_fini(void);
+
+/**
+ * Apply architecture specific settings to config before passing it to syscall.
+ */
+void
+pmu_arch_fixup_config(uint64_t config[3]);
+
+#endif /* _PMU_PRIVATE_H_ */
diff --git a/lib/pmu/rte_pmu.c b/lib/pmu/rte_pmu.c
new file mode 100644
index 0000000000..4cf3161155
--- /dev/null
+++ b/lib/pmu/rte_pmu.c
@@ -0,0 +1,464 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2023 Marvell International Ltd.
+ */
+
+#include <ctype.h>
+#include <dirent.h>
+#include <errno.h>
+#include <regex.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <sys/queue.h>
+#include <sys/syscall.h>
+#include <unistd.h>
+
+#include <rte_atomic.h>
+#include <rte_per_lcore.h>
+#include <rte_pmu.h>
+#include <rte_spinlock.h>
+#include <rte_tailq.h>
+
+#include "pmu_private.h"
+
+#define EVENT_SOURCE_DEVICES_PATH "/sys/bus/event_source/devices"
+
+#ifndef GENMASK_ULL
+#define GENMASK_ULL(h, l) ((~0ULL - (1ULL << (l)) + 1) & (~0ULL >> ((64 - 1 - (h)))))
+#endif
+
+#ifndef FIELD_PREP
+#define FIELD_PREP(m, v) (((uint64_t)(v) << (__builtin_ffsll(m) - 1)) & (m))
+#endif
+
+RTE_DEFINE_PER_LCORE(struct rte_pmu_event_group, _event_group);
+struct rte_pmu rte_pmu;
+
+/*
+ * Following __rte_weak functions provide default no-op. Architectures should override them if
+ * necessary.
+ */
+
+int
+__rte_weak pmu_arch_init(void)
+{
+	return 0;
+}
+
+void
+__rte_weak pmu_arch_fini(void)
+{
+}
+
+void
+__rte_weak pmu_arch_fixup_config(uint64_t __rte_unused config[3])
+{
+}
+
+static int
+get_term_format(const char *name, int *num, uint64_t *mask)
+{
+	char *config = NULL;
+	char path[PATH_MAX];
+	int high, low, ret;
+	FILE *fp;
+
+	/* quiesce -Wmaybe-uninitialized warning */
+	*num = 0;
+	*mask = 0;
+
+	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/format/%s", rte_pmu.name, name);
+	fp = fopen(path, "r");
+	if (fp == NULL)
+		return -errno;
+
+	errno = 0;
+	ret = fscanf(fp, "%m[^:]:%d-%d", &config, &low, &high);
+	if (ret < 2) {
+		ret = -ENODATA;
+		goto out;
+	}
+	if (errno) {
+		ret = -errno;
+		goto out;
+	}
+
+	if (ret == 2)
+		high = low;
+
+	*mask = GENMASK_ULL(high, low);
+	/* Last digit should be [012]. If last digit is missing 0 is implied. */
+	*num = config[strlen(config) - 1];
+	*num = isdigit(*num) ? *num - '0' : 0;
+
+	ret = 0;
+out:
+	free(config);
+	fclose(fp);
+
+	return ret;
+}
+
+static int
+parse_event(char *buf, uint64_t config[3])
+{
+	char *token, *term;
+	int num, ret, val;
+	uint64_t mask;
+
+	config[0] = config[1] = config[2] = 0;
+
+	token = strtok(buf, ",");
+	while (token) {
+		errno = 0;
+		/* <term>=<value> */
+		ret = sscanf(token, "%m[^=]=%i", &term, &val);
+		if (ret < 1)
+			return -ENODATA;
+		if (errno)
+			return -errno;
+		if (ret == 1)
+			val = 1;
+
+		ret = get_term_format(term, &num, &mask);
+		free(term);
+		if (ret)
+			return ret;
+
+		config[num] |= FIELD_PREP(mask, val);
+		token = strtok(NULL, ",");
+	}
+
+	return 0;
+}
+
+static int
+get_event_config(const char *name, uint64_t config[3])
+{
+	char path[PATH_MAX], buf[BUFSIZ];
+	FILE *fp;
+	int ret;
+
+	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu.name, name);
+	fp = fopen(path, "r");
+	if (fp == NULL)
+		return -errno;
+
+	ret = fread(buf, 1, sizeof(buf), fp);
+	if (ret == 0) {
+		fclose(fp);
+
+		return -EINVAL;
+	}
+	fclose(fp);
+	buf[ret] = '\0';
+
+	return parse_event(buf, config);
+}
+
+static int
+do_perf_event_open(uint64_t config[3], int group_fd)
+{
+	struct perf_event_attr attr = {
+		.size = sizeof(struct perf_event_attr),
+		.type = PERF_TYPE_RAW,
+		.exclude_kernel = 1,
+		.exclude_hv = 1,
+		.disabled = 1,
+	};
+
+	pmu_arch_fixup_config(config);
+
+	attr.config = config[0];
+	attr.config1 = config[1];
+	attr.config2 = config[2];
+
+	return syscall(SYS_perf_event_open, &attr, 0, -1, group_fd, 0);
+}
+
+static int
+open_events(struct rte_pmu_event_group *group)
+{
+	struct rte_pmu_event *event;
+	uint64_t config[3];
+	int num = 0, ret;
+
+	/* group leader gets created first, with fd = -1 */
+	group->fds[0] = -1;
+
+	TAILQ_FOREACH(event, &rte_pmu.event_list, next) {
+		ret = get_event_config(event->name, config);
+		if (ret)
+			continue;
+
+		ret = do_perf_event_open(config, group->fds[0]);
+		if (ret == -1) {
+			ret = -errno;
+			goto out;
+		}
+
+		group->fds[event->index] = ret;
+		num++;
+	}
+
+	return 0;
+out:
+	for (--num; num >= 0; num--) {
+		close(group->fds[num]);
+		group->fds[num] = -1;
+	}
+
+
+	return ret;
+}
+
+static int
+mmap_events(struct rte_pmu_event_group *group)
+{
+	long page_size = sysconf(_SC_PAGE_SIZE);
+	unsigned int i;
+	void *addr;
+	int ret;
+
+	for (i = 0; i < rte_pmu.num_group_events; i++) {
+		addr = mmap(0, page_size, PROT_READ, MAP_SHARED, group->fds[i], 0);
+		if (addr == MAP_FAILED) {
+			ret = -errno;
+			goto out;
+		}
+
+		group->mmap_pages[i] = addr;
+	}
+
+	return 0;
+out:
+	for (; i; i--) {
+		munmap(group->mmap_pages[i - 1], page_size);
+		group->mmap_pages[i - 1] = NULL;
+	}
+
+	return ret;
+}
+
+static void
+cleanup_events(struct rte_pmu_event_group *group)
+{
+	unsigned int i;
+
+	if (group->fds[0] != -1)
+		ioctl(group->fds[0], PERF_EVENT_IOC_DISABLE, PERF_IOC_FLAG_GROUP);
+
+	for (i = 0; i < rte_pmu.num_group_events; i++) {
+		if (group->mmap_pages[i]) {
+			munmap(group->mmap_pages[i], sysconf(_SC_PAGE_SIZE));
+			group->mmap_pages[i] = NULL;
+		}
+
+		if (group->fds[i] != -1) {
+			close(group->fds[i]);
+			group->fds[i] = -1;
+		}
+	}
+
+	group->enabled = false;
+}
+
+int __rte_noinline
+rte_pmu_enable_group(void)
+{
+	struct rte_pmu_event_group *group = &RTE_PER_LCORE(_event_group);
+	int ret;
+
+	if (rte_pmu.num_group_events == 0)
+		return -ENODEV;
+
+	ret = open_events(group);
+	if (ret)
+		goto out;
+
+	ret = mmap_events(group);
+	if (ret)
+		goto out;
+
+	if (ioctl(group->fds[0], PERF_EVENT_IOC_RESET, PERF_IOC_FLAG_GROUP) == -1) {
+		ret = -errno;
+		goto out;
+	}
+
+	if (ioctl(group->fds[0], PERF_EVENT_IOC_ENABLE, PERF_IOC_FLAG_GROUP) == -1) {
+		ret = -errno;
+		goto out;
+	}
+
+	rte_spinlock_lock(&rte_pmu.lock);
+	TAILQ_INSERT_TAIL(&rte_pmu.event_group_list, group, next);
+	rte_spinlock_unlock(&rte_pmu.lock);
+	group->enabled = true;
+
+	return 0;
+
+out:
+	cleanup_events(group);
+
+	return ret;
+}
+
+static int
+scan_pmus(void)
+{
+	char path[PATH_MAX];
+	struct dirent *dent;
+	const char *name;
+	DIR *dirp;
+
+	dirp = opendir(EVENT_SOURCE_DEVICES_PATH);
+	if (dirp == NULL)
+		return -errno;
+
+	while ((dent = readdir(dirp))) {
+		name = dent->d_name;
+		if (name[0] == '.')
+			continue;
+
+		/* sysfs entry should either contain cpus or be a cpu */
+		if (!strcmp(name, "cpu"))
+			break;
+
+		snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/cpus", name);
+		if (access(path, F_OK) == 0)
+			break;
+	}
+
+	if (dent) {
+		rte_pmu.name = strdup(name);
+		if (rte_pmu.name == NULL) {
+			closedir(dirp);
+
+			return -ENOMEM;
+		}
+	}
+
+	closedir(dirp);
+
+	return rte_pmu.name ? 0 : -ENODEV;
+}
+
+static struct rte_pmu_event *
+new_event(const char *name)
+{
+	struct rte_pmu_event *event;
+
+	event = calloc(1, sizeof(*event));
+	if (event == NULL)
+		goto out;
+
+	event->name = strdup(name);
+	if (event->name == NULL) {
+		free(event);
+		event = NULL;
+	}
+
+out:
+	return event;
+}
+
+static void
+free_event(struct rte_pmu_event *event)
+{
+	free(event->name);
+	free(event);
+}
+
+int
+rte_pmu_add_event(const char *name)
+{
+	struct rte_pmu_event *event;
+	char path[PATH_MAX];
+
+	if (rte_pmu.name == NULL)
+		return -ENODEV;
+
+	if (rte_pmu.num_group_events + 1 >= MAX_NUM_GROUP_EVENTS)
+		return -ENOSPC;
+
+	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu.name, name);
+	if (access(path, R_OK))
+		return -ENODEV;
+
+	TAILQ_FOREACH(event, &rte_pmu.event_list, next) {
+		if (!strcmp(event->name, name))
+			return event->index;
+		continue;
+	}
+
+	event = new_event(name);
+	if (event == NULL)
+		return -ENOMEM;
+
+	event->index = rte_pmu.num_group_events++;
+	TAILQ_INSERT_TAIL(&rte_pmu.event_list, event, next);
+
+	return event->index;
+}
+
+int
+rte_pmu_init(void)
+{
+	int ret;
+
+	/* Allow calling init from multiple contexts within a single thread. This simplifies
+	 * resource management a bit e.g in case fast-path tracepoint has already been enabled
+	 * via command line but application doesn't care enough and performs init/fini again.
+	 */
+	if (rte_pmu.initialized) {
+		rte_pmu.initialized++;
+		return 0;
+	}
+
+	ret = scan_pmus();
+	if (ret)
+		goto out;
+
+	ret = pmu_arch_init();
+	if (ret)
+		goto out;
+
+	TAILQ_INIT(&rte_pmu.event_list);
+	TAILQ_INIT(&rte_pmu.event_group_list);
+	rte_spinlock_init(&rte_pmu.lock);
+	rte_pmu.initialized = 1;
+
+	return 0;
+out:
+	free(rte_pmu.name);
+	rte_pmu.name = NULL;
+
+	return ret;
+}
+
+void
+rte_pmu_fini(void)
+{
+	struct rte_pmu_event_group *group, *tmp_group;
+	struct rte_pmu_event *event, *tmp_event;
+
+	/* cleanup once init count drops to zero */
+	if (!rte_pmu.initialized || --rte_pmu.initialized)
+		return;
+
+	RTE_TAILQ_FOREACH_SAFE(event, &rte_pmu.event_list, next, tmp_event) {
+		TAILQ_REMOVE(&rte_pmu.event_list, event, next);
+		free_event(event);
+	}
+
+	RTE_TAILQ_FOREACH_SAFE(group, &rte_pmu.event_group_list, next, tmp_group) {
+		TAILQ_REMOVE(&rte_pmu.event_group_list, group, next);
+		cleanup_events(group);
+	}
+
+	pmu_arch_fini();
+	free(rte_pmu.name);
+	rte_pmu.name = NULL;
+	rte_pmu.num_group_events = 0;
+}
diff --git a/lib/pmu/rte_pmu.h b/lib/pmu/rte_pmu.h
new file mode 100644
index 0000000000..e360375a0c
--- /dev/null
+++ b/lib/pmu/rte_pmu.h
@@ -0,0 +1,205 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Marvell
+ */
+
+#ifndef _RTE_PMU_H_
+#define _RTE_PMU_H_
+
+/**
+ * @file
+ *
+ * PMU event tracing operations
+ *
+ * This file defines generic API and types necessary to setup PMU and
+ * read selected counters in runtime.
+ */
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <linux/perf_event.h>
+
+#include <rte_atomic.h>
+#include <rte_branch_prediction.h>
+#include <rte_common.h>
+#include <rte_compat.h>
+#include <rte_spinlock.h>
+
+/** Maximum number of events in a group */
+#define MAX_NUM_GROUP_EVENTS 8
+
+/**
+ * A structure describing a group of events.
+ */
+struct rte_pmu_event_group {
+	struct perf_event_mmap_page *mmap_pages[MAX_NUM_GROUP_EVENTS]; /**< array of user pages */
+	int fds[MAX_NUM_GROUP_EVENTS]; /**< array of event descriptors */
+	bool enabled; /**< true if group was enabled on particular lcore */
+	TAILQ_ENTRY(rte_pmu_event_group) next; /**< list entry */
+} __rte_cache_aligned;
+
+/**
+ * A structure describing an event.
+ */
+struct rte_pmu_event {
+	char *name; /**< name of an event */
+	unsigned int index; /**< event index into fds/mmap_pages */
+	TAILQ_ENTRY(rte_pmu_event) next; /**< list entry */
+};
+
+/**
+ * A PMU state container.
+ */
+struct rte_pmu {
+	char *name; /**< name of core PMU listed under /sys/bus/event_source/devices */
+	rte_spinlock_t lock; /**< serialize access to event group list */
+	TAILQ_HEAD(, rte_pmu_event_group) event_group_list; /**< list of event groups */
+	unsigned int num_group_events; /**< number of events in a group */
+	TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of matching events */
+	unsigned int initialized; /**< initialization counter */
+};
+
+/** lcore event group */
+RTE_DECLARE_PER_LCORE(struct rte_pmu_event_group, _event_group);
+
+/** PMU state container */
+extern struct rte_pmu rte_pmu;
+
+/** Each architecture supporting PMU needs to provide its own version */
+#ifndef rte_pmu_pmc_read
+#define rte_pmu_pmc_read(index) ({ 0; })
+#endif
+
+/**
+ * @internal
+ *
+ * Read PMU counter.
+ *
+ * @param pc
+ *   Pointer to the mmapped user page.
+ * @return
+ *   Counter value read from hardware.
+ */
+__rte_internal
+static __rte_always_inline uint64_t
+rte_pmu_read_userpage(struct perf_event_mmap_page *pc)
+{
+	uint64_t width, offset;
+	uint32_t seq, index;
+	int64_t pmc;
+
+	for (;;) {
+		seq = pc->lock;
+		rte_compiler_barrier();
+		index = pc->index;
+		offset = pc->offset;
+		width = pc->pmc_width;
+
+		/* index set to 0 means that particular counter cannot be used */
+		if (likely(pc->cap_user_rdpmc && index)) {
+			pmc = rte_pmu_pmc_read(index - 1);
+			pmc <<= 64 - width;
+			pmc >>= 64 - width;
+			offset += pmc;
+		}
+
+		rte_compiler_barrier();
+
+		if (likely(pc->lock == seq))
+			return offset;
+	}
+
+	return 0;
+}
+
+/**
+ * @internal
+ *
+ * Enable group of events on the calling lcore.
+ *
+ * @return
+ *   0 in case of success, negative value otherwise.
+ */
+__rte_internal
+int
+rte_pmu_enable_group(void);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Initialize PMU library.
+ *
+ * @return
+ *   0 in case of success, negative value otherwise.
+ */
+__rte_experimental
+int
+rte_pmu_init(void);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Finalize PMU library. This should be called after PMU counters are no longer being read.
+ */
+__rte_experimental
+void
+rte_pmu_fini(void);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Add event to the group of enabled events.
+ *
+ * @param name
+ *   Name of an event listed under /sys/bus/event_source/devices/pmu/events.
+ * @return
+ *   Event index in case of success, negative value otherwise.
+ */
+__rte_experimental
+int
+rte_pmu_add_event(const char *name);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Read hardware counter configured to count occurrences of an event.
+ *
+ * @param index
+ *   Index of an event to be read.
+ * @return
+ *   Event value read from register. In case of errors or lack of support
+ *   0 is returned. In other words, stream of zeros in a trace file
+ *   indicates problem with reading particular PMU event register.
+ */
+__rte_experimental
+static __rte_always_inline uint64_t
+rte_pmu_read(unsigned int index)
+{
+	struct rte_pmu_event_group *group = &RTE_PER_LCORE(_event_group);
+	int ret;
+
+	if (unlikely(!rte_pmu.initialized))
+		return 0;
+
+	if (unlikely(!group->enabled)) {
+		ret = rte_pmu_enable_group();
+		if (ret)
+			return 0;
+	}
+
+	if (unlikely(index >= rte_pmu.num_group_events))
+		return 0;
+
+	return rte_pmu_read_userpage(group->mmap_pages[index]);
+}
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_PMU_H_ */
diff --git a/lib/pmu/version.map b/lib/pmu/version.map
new file mode 100644
index 0000000000..50fb0f354e
--- /dev/null
+++ b/lib/pmu/version.map
@@ -0,0 +1,20 @@
+DPDK_23 {
+	local: *;
+};
+
+EXPERIMENTAL {
+	global:
+
+	per_lcore__event_group;
+	rte_pmu;
+	rte_pmu_add_event;
+	rte_pmu_fini;
+	rte_pmu_init;
+	rte_pmu_read;
+};
+
+INTERNAL {
+	global:
+
+	rte_pmu_enable_group;
+};
-- 
2.34.1


^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH v7 2/4] pmu: support reading ARM PMU events in runtime
  2023-02-01 13:17           ` [PATCH v7 0/4] add support for self monitoring Tomasz Duszynski
  2023-02-01 13:17             ` [PATCH v7 1/4] lib: add generic support for reading PMU events Tomasz Duszynski
@ 2023-02-01 13:17             ` Tomasz Duszynski
  2023-02-01 13:17             ` [PATCH v7 3/4] pmu: support reading Intel x86_64 " Tomasz Duszynski
                               ` (3 subsequent siblings)
  5 siblings, 0 replies; 205+ messages in thread
From: Tomasz Duszynski @ 2023-02-01 13:17 UTC (permalink / raw)
  To: dev, Tomasz Duszynski, Ruifeng Wang
  Cc: roretzla, Ruifeng.Wang, bruce.richardson, jerinj,
	mattias.ronnblom, mb, thomas, zhoumin

Add support for reading ARM PMU events in runtime.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
---
 app/test/test_pmu.c         |  4 ++
 lib/pmu/meson.build         |  7 +++
 lib/pmu/pmu_arm64.c         | 94 +++++++++++++++++++++++++++++++++++++
 lib/pmu/rte_pmu.h           |  4 ++
 lib/pmu/rte_pmu_pmc_arm64.h | 30 ++++++++++++
 5 files changed, 139 insertions(+)
 create mode 100644 lib/pmu/pmu_arm64.c
 create mode 100644 lib/pmu/rte_pmu_pmc_arm64.h

diff --git a/app/test/test_pmu.c b/app/test/test_pmu.c
index b30db35724..c53a1bc2f1 100644
--- a/app/test/test_pmu.c
+++ b/app/test/test_pmu.c
@@ -26,6 +26,10 @@ test_pmu_read(void)
 	if (rte_pmu_init() < 0)
 		return TEST_FAILED;
 
+#if defined(RTE_ARCH_ARM64)
+	event = rte_pmu_add_event("cpu_cycles");
+#endif
+
 	while (tries--)
 		val += rte_pmu_read(event);
 
diff --git a/lib/pmu/meson.build b/lib/pmu/meson.build
index a4160b494e..e857681137 100644
--- a/lib/pmu/meson.build
+++ b/lib/pmu/meson.build
@@ -11,3 +11,10 @@ includes = [global_inc]
 
 sources = files('rte_pmu.c')
 headers = files('rte_pmu.h')
+indirect_headers += files(
+        'rte_pmu_pmc_arm64.h',
+)
+
+if dpdk_conf.has('RTE_ARCH_ARM64')
+    sources += files('pmu_arm64.c')
+endif
diff --git a/lib/pmu/pmu_arm64.c b/lib/pmu/pmu_arm64.c
new file mode 100644
index 0000000000..9e15727948
--- /dev/null
+++ b/lib/pmu/pmu_arm64.c
@@ -0,0 +1,94 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2023 Marvell International Ltd.
+ */
+
+#include <errno.h>
+#include <fcntl.h>
+#include <stdlib.h>
+#include <unistd.h>
+
+#include <rte_bitops.h>
+#include <rte_common.h>
+
+#include "pmu_private.h"
+
+#define PERF_USER_ACCESS_PATH "/proc/sys/kernel/perf_user_access"
+
+static int restore_uaccess;
+
+static int
+read_attr_int(const char *path, int *val)
+{
+	char buf[BUFSIZ];
+	int ret, fd;
+
+	fd = open(path, O_RDONLY);
+	if (fd == -1)
+		return -errno;
+
+	ret = read(fd, buf, sizeof(buf));
+	if (ret == -1) {
+		close(fd);
+
+		return -errno;
+	}
+
+	*val = strtol(buf, NULL, 10);
+	close(fd);
+
+	return 0;
+}
+
+static int
+write_attr_int(const char *path, int val)
+{
+	char buf[BUFSIZ];
+	int num, ret, fd;
+
+	fd = open(path, O_WRONLY);
+	if (fd == -1)
+		return -errno;
+
+	num = snprintf(buf, sizeof(buf), "%d", val);
+	ret = write(fd, buf, num);
+	if (ret == -1) {
+		close(fd);
+
+		return -errno;
+	}
+
+	close(fd);
+
+	return 0;
+}
+
+int
+pmu_arch_init(void)
+{
+	int ret;
+
+	ret = read_attr_int(PERF_USER_ACCESS_PATH, &restore_uaccess);
+	if (ret)
+		return ret;
+
+	/* user access already enabled */
+	if (restore_uaccess == 1)
+		return 0;
+
+	return write_attr_int(PERF_USER_ACCESS_PATH, 1);
+}
+
+void
+pmu_arch_fini(void)
+{
+	write_attr_int(PERF_USER_ACCESS_PATH, restore_uaccess);
+}
+
+void
+pmu_arch_fixup_config(uint64_t config[3])
+{
+	/* select 64 bit counters */
+	config[1] |= RTE_BIT64(0);
+	/* enable userspace access */
+	config[1] |= RTE_BIT64(1);
+}
diff --git a/lib/pmu/rte_pmu.h b/lib/pmu/rte_pmu.h
index e360375a0c..b18938dab1 100644
--- a/lib/pmu/rte_pmu.h
+++ b/lib/pmu/rte_pmu.h
@@ -26,6 +26,10 @@ extern "C" {
 #include <rte_compat.h>
 #include <rte_spinlock.h>
 
+#if defined(RTE_ARCH_ARM64)
+#include "rte_pmu_pmc_arm64.h"
+#endif
+
 /** Maximum number of events in a group */
 #define MAX_NUM_GROUP_EVENTS 8
 
diff --git a/lib/pmu/rte_pmu_pmc_arm64.h b/lib/pmu/rte_pmu_pmc_arm64.h
new file mode 100644
index 0000000000..10648f0c5f
--- /dev/null
+++ b/lib/pmu/rte_pmu_pmc_arm64.h
@@ -0,0 +1,30 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Marvell.
+ */
+#ifndef _RTE_PMU_PMC_ARM64_H_
+#define _RTE_PMU_PMC_ARM64_H_
+
+#include <rte_common.h>
+
+static __rte_always_inline uint64_t
+rte_pmu_pmc_read(int index)
+{
+	uint64_t val;
+
+	if (index == 31) {
+		/* CPU Cycles (0x11) must be read via pmccntr_el0 */
+		asm volatile("mrs %0, pmccntr_el0" : "=r" (val));
+	} else {
+		asm volatile(
+			"msr pmselr_el0, %x0\n"
+			"mrs %0, pmxevcntr_el0\n"
+			: "=r" (val)
+			: "rZ" (index)
+		);
+	}
+
+	return val;
+}
+#define rte_pmu_pmc_read rte_pmu_pmc_read
+
+#endif /* _RTE_PMU_PMC_ARM64_H_ */
-- 
2.34.1


^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH v7 3/4] pmu: support reading Intel x86_64 PMU events in runtime
  2023-02-01 13:17           ` [PATCH v7 0/4] add support for self monitoring Tomasz Duszynski
  2023-02-01 13:17             ` [PATCH v7 1/4] lib: add generic support for reading PMU events Tomasz Duszynski
  2023-02-01 13:17             ` [PATCH v7 2/4] pmu: support reading ARM PMU events in runtime Tomasz Duszynski
@ 2023-02-01 13:17             ` Tomasz Duszynski
  2023-02-01 13:17             ` [PATCH v7 4/4] eal: add PMU support to tracing library Tomasz Duszynski
                               ` (2 subsequent siblings)
  5 siblings, 0 replies; 205+ messages in thread
From: Tomasz Duszynski @ 2023-02-01 13:17 UTC (permalink / raw)
  To: dev, Tomasz Duszynski
  Cc: roretzla, Ruifeng.Wang, bruce.richardson, jerinj,
	mattias.ronnblom, mb, thomas, zhoumin

Add support for reading Intel x86_64 PMU events in runtime.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
---
 app/test/test_pmu.c          |  2 ++
 lib/pmu/meson.build          |  1 +
 lib/pmu/rte_pmu.h            |  2 ++
 lib/pmu/rte_pmu_pmc_x86_64.h | 24 ++++++++++++++++++++++++
 4 files changed, 29 insertions(+)
 create mode 100644 lib/pmu/rte_pmu_pmc_x86_64.h

diff --git a/app/test/test_pmu.c b/app/test/test_pmu.c
index c53a1bc2f1..07cdc8f5ec 100644
--- a/app/test/test_pmu.c
+++ b/app/test/test_pmu.c
@@ -28,6 +28,8 @@ test_pmu_read(void)
 
 #if defined(RTE_ARCH_ARM64)
 	event = rte_pmu_add_event("cpu_cycles");
+#elif defined(RTE_ARCH_X86_64)
+	event = rte_pmu_add_event("cpu-cycles");
 #endif
 
 	while (tries--)
diff --git a/lib/pmu/meson.build b/lib/pmu/meson.build
index e857681137..5b92e5c4e3 100644
--- a/lib/pmu/meson.build
+++ b/lib/pmu/meson.build
@@ -13,6 +13,7 @@ sources = files('rte_pmu.c')
 headers = files('rte_pmu.h')
 indirect_headers += files(
         'rte_pmu_pmc_arm64.h',
+        'rte_pmu_pmc_x86_64.h',
 )
 
 if dpdk_conf.has('RTE_ARCH_ARM64')
diff --git a/lib/pmu/rte_pmu.h b/lib/pmu/rte_pmu.h
index b18938dab1..0f7004c31c 100644
--- a/lib/pmu/rte_pmu.h
+++ b/lib/pmu/rte_pmu.h
@@ -28,6 +28,8 @@ extern "C" {
 
 #if defined(RTE_ARCH_ARM64)
 #include "rte_pmu_pmc_arm64.h"
+#elif defined(RTE_ARCH_X86_64)
+#include "rte_pmu_pmc_x86_64.h"
 #endif
 
 /** Maximum number of events in a group */
diff --git a/lib/pmu/rte_pmu_pmc_x86_64.h b/lib/pmu/rte_pmu_pmc_x86_64.h
new file mode 100644
index 0000000000..7b67466960
--- /dev/null
+++ b/lib/pmu/rte_pmu_pmc_x86_64.h
@@ -0,0 +1,24 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Marvell.
+ */
+#ifndef _RTE_PMU_PMC_X86_64_H_
+#define _RTE_PMU_PMC_X86_64_H_
+
+#include <rte_common.h>
+
+static __rte_always_inline uint64_t
+rte_pmu_pmc_read(int index)
+{
+	uint64_t low, high;
+
+	asm volatile(
+		"rdpmc\n"
+		: "=a" (low), "=d" (high)
+		: "c" (index)
+	);
+
+	return low | (high << 32);
+}
+#define rte_pmu_pmc_read rte_pmu_pmc_read
+
+#endif /* _RTE_PMU_PMC_X86_64_H_ */
-- 
2.34.1


^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH v7 4/4] eal: add PMU support to tracing library
  2023-02-01 13:17           ` [PATCH v7 0/4] add support for self monitoring Tomasz Duszynski
                               ` (2 preceding siblings ...)
  2023-02-01 13:17             ` [PATCH v7 3/4] pmu: support reading Intel x86_64 " Tomasz Duszynski
@ 2023-02-01 13:17             ` Tomasz Duszynski
  2023-02-01 13:51             ` [PATCH v7 0/4] add support for self monitoring Morten Brørup
  2023-02-02  9:43             ` [PATCH v8 " Tomasz Duszynski
  5 siblings, 0 replies; 205+ messages in thread
From: Tomasz Duszynski @ 2023-02-01 13:17 UTC (permalink / raw)
  To: dev, Jerin Jacob, Sunil Kumar Kori, Tomasz Duszynski
  Cc: roretzla, Ruifeng.Wang, bruce.richardson, mattias.ronnblom, mb,
	thomas, zhoumin

In order to profile app one needs to store significant amount of samples
somewhere for an analysis latern on. Since trace library supports
storing data in a CTF format lets take adventage of that and add a
dedicated PMU tracepoint.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
---
 app/test/test_trace_perf.c               | 10 ++++
 doc/guides/prog_guide/profile_app.rst    |  5 ++
 doc/guides/prog_guide/trace_lib.rst      | 32 +++++++++++++
 lib/eal/common/eal_common_trace.c        | 13 ++++-
 lib/eal/common/eal_common_trace_points.c |  5 ++
 lib/eal/include/rte_eal_trace.h          | 13 +++++
 lib/eal/meson.build                      |  3 ++
 lib/eal/version.map                      |  3 ++
 lib/pmu/rte_pmu.c                        | 61 ++++++++++++++++++++++++
 lib/pmu/rte_pmu.h                        | 14 ++++++
 lib/pmu/version.map                      |  1 +
 11 files changed, 159 insertions(+), 1 deletion(-)

diff --git a/app/test/test_trace_perf.c b/app/test/test_trace_perf.c
index 46ae7d8074..f1929f2734 100644
--- a/app/test/test_trace_perf.c
+++ b/app/test/test_trace_perf.c
@@ -114,6 +114,10 @@ worker_fn_##func(void *arg) \
 #define GENERIC_DOUBLE rte_eal_trace_generic_double(3.66666)
 #define GENERIC_STR rte_eal_trace_generic_str("hello world")
 #define VOID_FP app_dpdk_test_fp()
+#ifdef RTE_EXEC_ENV_LINUX
+/* 0 corresponds first event passed via --trace= */
+#define READ_PMU rte_eal_trace_pmu_read(0)
+#endif
 
 WORKER_DEFINE(GENERIC_VOID)
 WORKER_DEFINE(GENERIC_U64)
@@ -122,6 +126,9 @@ WORKER_DEFINE(GENERIC_FLOAT)
 WORKER_DEFINE(GENERIC_DOUBLE)
 WORKER_DEFINE(GENERIC_STR)
 WORKER_DEFINE(VOID_FP)
+#ifdef RTE_EXEC_ENV_LINUX
+WORKER_DEFINE(READ_PMU)
+#endif
 
 static void
 run_test(const char *str, lcore_function_t f, struct test_data *data, size_t sz)
@@ -174,6 +181,9 @@ test_trace_perf(void)
 	run_test("double", worker_fn_GENERIC_DOUBLE, data, sz);
 	run_test("string", worker_fn_GENERIC_STR, data, sz);
 	run_test("void_fp", worker_fn_VOID_FP, data, sz);
+#ifdef RTE_EXEC_ENV_LINUX
+	run_test("read_pmu", worker_fn_READ_PMU, data, sz);
+#endif
 
 	rte_free(data);
 	return TEST_SUCCESS;
diff --git a/doc/guides/prog_guide/profile_app.rst b/doc/guides/prog_guide/profile_app.rst
index a8b501fe0c..6a53341c6b 100644
--- a/doc/guides/prog_guide/profile_app.rst
+++ b/doc/guides/prog_guide/profile_app.rst
@@ -16,6 +16,11 @@ that information, perf being an example here. Though in some scenarios, eg. when
 isolated (nohz_full) and run dedicated tasks, using perf is less than ideal. In such cases one can
 read specific events directly from application via ``rte_pmu_read()``.
 
+Alternatively tracing library can be used which offers dedicated tracepoint
+``rte_eal_trace_pmu_event()``.
+
+Refer to :doc:`../prog_guide/trace_lib` for more details.
+
 Profiling on x86
 ----------------
 
diff --git a/doc/guides/prog_guide/trace_lib.rst b/doc/guides/prog_guide/trace_lib.rst
index 9a8f38073d..a8e97ee1ec 100644
--- a/doc/guides/prog_guide/trace_lib.rst
+++ b/doc/guides/prog_guide/trace_lib.rst
@@ -46,6 +46,7 @@ DPDK tracing library features
   trace format and is compatible with ``LTTng``.
   For detailed information, refer to
   `Common Trace Format <https://diamon.org/ctf/>`_.
+- Support reading PMU events on ARM64 and x86-64 (Intel)
 
 How to add a tracepoint?
 ------------------------
@@ -137,6 +138,37 @@ the user must use ``RTE_TRACE_POINT_FP`` instead of ``RTE_TRACE_POINT``.
 ``RTE_TRACE_POINT_FP`` is compiled out by default and it can be enabled using
 the ``enable_trace_fp`` option for meson build.
 
+PMU tracepoint
+--------------
+
+Performance measurement unit (PMU) event values can be read from hardware
+registers using predefined ``rte_pmu_read`` tracepoint.
+
+Tracing is enabled via ``--trace`` EAL option by passing both expression
+matching PMU tracepoint name i.e ``lib.eal.pmu.read`` and expression
+``e=ev1[,ev2,...]`` matching particular events::
+
+    --trace='.*pmu.read\|e=cpu_cycles,l1d_cache'
+
+Event names are available under ``/sys/bus/event_source/devices/PMU/events``
+directory, where ``PMU`` is a placeholder for either a ``cpu`` or a directory
+containing ``cpus``.
+
+In contrary to other tracepoints this does not need any extra variables
+added to source files. Instead, caller passes index which follows the order of
+events specified via ``--trace`` parameter. In the following example index ``0``
+corresponds to ``cpu_cyclces`` while index ``1`` corresponds to ``l1d_cache``.
+
+.. code-block:: c
+
+ ...
+ rte_eal_trace_pmu_read(0);
+ rte_eal_trace_pmu_read(1);
+ ...
+
+PMU tracing support must be explicitly enabled using the ``enable_trace_fp``
+option for meson build.
+
 Event record mode
 -----------------
 
diff --git a/lib/eal/common/eal_common_trace.c b/lib/eal/common/eal_common_trace.c
index 5caaac8e59..3631d0032b 100644
--- a/lib/eal/common/eal_common_trace.c
+++ b/lib/eal/common/eal_common_trace.c
@@ -11,6 +11,9 @@
 #include <rte_errno.h>
 #include <rte_lcore.h>
 #include <rte_per_lcore.h>
+#ifdef RTE_EXEC_ENV_LINUX
+#include <rte_pmu.h>
+#endif
 #include <rte_string_fns.h>
 
 #include "eal_trace.h"
@@ -71,8 +74,13 @@ eal_trace_init(void)
 		goto free_meta;
 
 	/* Apply global configurations */
-	STAILQ_FOREACH(arg, &trace.args, next)
+	STAILQ_FOREACH(arg, &trace.args, next) {
 		trace_args_apply(arg->val);
+#ifdef RTE_EXEC_ENV_LINUX
+		if (rte_pmu_init() == 0)
+			rte_pmu_add_events_by_pattern(arg->val);
+#endif
+	}
 
 	rte_trace_mode_set(trace.mode);
 
@@ -88,6 +96,9 @@ eal_trace_init(void)
 void
 eal_trace_fini(void)
 {
+#ifdef RTE_EXEC_ENV_LINUX
+	rte_pmu_fini();
+#endif
 	trace_mem_free();
 	trace_metadata_destroy();
 	eal_trace_args_free();
diff --git a/lib/eal/common/eal_common_trace_points.c b/lib/eal/common/eal_common_trace_points.c
index 0b0b254615..1e46ce549a 100644
--- a/lib/eal/common/eal_common_trace_points.c
+++ b/lib/eal/common/eal_common_trace_points.c
@@ -75,3 +75,8 @@ RTE_TRACE_POINT_REGISTER(rte_eal_trace_intr_enable,
 	lib.eal.intr.enable)
 RTE_TRACE_POINT_REGISTER(rte_eal_trace_intr_disable,
 	lib.eal.intr.disable)
+
+#ifdef RTE_EXEC_ENV_LINUX
+RTE_TRACE_POINT_REGISTER(rte_eal_trace_pmu_read,
+	lib.eal.pmu.read)
+#endif
diff --git a/lib/eal/include/rte_eal_trace.h b/lib/eal/include/rte_eal_trace.h
index 5ef4398230..afb459b198 100644
--- a/lib/eal/include/rte_eal_trace.h
+++ b/lib/eal/include/rte_eal_trace.h
@@ -17,6 +17,9 @@ extern "C" {
 
 #include <rte_alarm.h>
 #include <rte_interrupts.h>
+#ifdef RTE_EXEC_ENV_LINUX
+#include <rte_pmu.h>
+#endif
 #include <rte_trace_point.h>
 
 #include "eal_interrupts.h"
@@ -279,6 +282,16 @@ RTE_TRACE_POINT(
 	rte_trace_point_emit_string(cpuset);
 )
 
+#ifdef RTE_EXEC_ENV_LINUX
+RTE_TRACE_POINT_FP(
+	rte_eal_trace_pmu_read,
+	RTE_TRACE_POINT_ARGS(unsigned int index),
+	uint64_t val;
+	val = rte_pmu_read(index);
+	rte_trace_point_emit_u64(val);
+)
+#endif
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/eal/meson.build b/lib/eal/meson.build
index 056beb9461..f5865dbcd9 100644
--- a/lib/eal/meson.build
+++ b/lib/eal/meson.build
@@ -26,6 +26,9 @@ deps += ['kvargs']
 if not is_windows
     deps += ['telemetry']
 endif
+if is_linux
+    deps += ['pmu']
+endif
 if dpdk_conf.has('RTE_USE_LIBBSD')
     ext_deps += libbsd
 endif
diff --git a/lib/eal/version.map b/lib/eal/version.map
index 7ad12a7dc9..eddb45bebf 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -440,6 +440,9 @@ EXPERIMENTAL {
 	rte_thread_detach;
 	rte_thread_equal;
 	rte_thread_join;
+
+	# added in 23.03
+	__rte_eal_trace_pmu_read; # WINDOWS_NO_EXPORT
 };
 
 INTERNAL {
diff --git a/lib/pmu/rte_pmu.c b/lib/pmu/rte_pmu.c
index 4cf3161155..ae880c72b7 100644
--- a/lib/pmu/rte_pmu.c
+++ b/lib/pmu/rte_pmu.c
@@ -402,6 +402,67 @@ rte_pmu_add_event(const char *name)
 	return event->index;
 }
 
+static int
+add_events(const char *pattern)
+{
+	char *token, *copy;
+	int ret;
+
+	copy = strdup(pattern);
+	if (copy == NULL)
+		return -ENOMEM;
+
+	token = strtok(copy, ",");
+	while (token) {
+		ret = rte_pmu_add_event(token);
+		if (ret < 0)
+			break;
+
+		token = strtok(NULL, ",");
+	}
+
+	free(copy);
+
+	return ret >= 0 ? 0 : ret;
+}
+
+int
+rte_pmu_add_events_by_pattern(const char *pattern)
+{
+	regmatch_t rmatch;
+	char buf[BUFSIZ];
+	unsigned int num;
+	regex_t reg;
+	int ret;
+
+	/* events are matched against occurrences of e=ev1[,ev2,..] pattern */
+	ret = regcomp(&reg, "e=([_[:alnum:]-],?)+", REG_EXTENDED);
+	if (ret)
+		return -EINVAL;
+
+	for (;;) {
+		if (regexec(&reg, pattern, 1, &rmatch, 0))
+			break;
+
+		num = rmatch.rm_eo - rmatch.rm_so;
+		if (num > sizeof(buf))
+			num = sizeof(buf);
+
+		/* skip e= pattern prefix */
+		memcpy(buf, pattern + rmatch.rm_so + 2, num - 2);
+		buf[num - 2] = '\0';
+		ret = add_events(buf);
+		if (ret)
+			break;
+
+		pattern += rmatch.rm_eo;
+	}
+
+	regfree(&reg);
+
+	return ret;
+}
+
 int
 rte_pmu_init(void)
 {
diff --git a/lib/pmu/rte_pmu.h b/lib/pmu/rte_pmu.h
index 0f7004c31c..0f6250e81f 100644
--- a/lib/pmu/rte_pmu.h
+++ b/lib/pmu/rte_pmu.h
@@ -169,6 +169,20 @@ __rte_experimental
 int
 rte_pmu_add_event(const char *name);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Add events matching pattern to the group of enabled events.
+ *
+ * @param pattern
+ *   Pattern e=ev1[,ev2,...] matching events, where evX is a placeholder for an event listed under
+ *   /sys/bus/event_source/devices/pmu/events.
+ */
+__rte_experimental
+int
+rte_pmu_add_events_by_pattern(const char *pattern);
+
 /**
  * @warning
  * @b EXPERIMENTAL: this API may change without prior notice
diff --git a/lib/pmu/version.map b/lib/pmu/version.map
index 50fb0f354e..20a27d085c 100644
--- a/lib/pmu/version.map
+++ b/lib/pmu/version.map
@@ -8,6 +8,7 @@ EXPERIMENTAL {
 	per_lcore__event_group;
 	rte_pmu;
 	rte_pmu_add_event;
+	rte_pmu_add_events_by_pattern;
 	rte_pmu_fini;
 	rte_pmu_init;
 	rte_pmu_read;
-- 
2.34.1


^ permalink raw reply	[flat|nested] 205+ messages in thread

* RE: [PATCH v7 0/4] add support for self monitoring
  2023-02-01 13:17           ` [PATCH v7 0/4] add support for self monitoring Tomasz Duszynski
                               ` (3 preceding siblings ...)
  2023-02-01 13:17             ` [PATCH v7 4/4] eal: add PMU support to tracing library Tomasz Duszynski
@ 2023-02-01 13:51             ` Morten Brørup
  2023-02-02  7:54               ` Tomasz Duszynski
  2023-02-02  9:43             ` [PATCH v8 " Tomasz Duszynski
  5 siblings, 1 reply; 205+ messages in thread
From: Morten Brørup @ 2023-02-01 13:51 UTC (permalink / raw)
  To: Tomasz Duszynski, dev
  Cc: roretzla, Ruifeng.Wang, bruce.richardson, jerinj,
	mattias.ronnblom, thomas, zhoumin

> From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
> Sent: Wednesday, 1 February 2023 14.18
> 
> This series adds self monitoring support i.e allows to configure and
> read performance measurement unit (PMU) counters in runtime without
> using perf utility. This has certain adventages when application runs
> on
> isolated cores with nohz_full kernel parameter.
> 
> Events can be read directly using rte_pmu_read() or using dedicated
> tracepoint rte_eal_trace_pmu_read(). The latter will cause events to be
> stored inside CTF file.
> 
> By design, all enabled events are grouped together and the same group
> is attached to lcores that use self monitoring funtionality.
> 
> Events are enabled by names, which need to be read from standard
> location under sysfs i.e
> 
> /sys/bus/event_source/devices/PMU/events
> 
> where PMU is a core pmu i.e one measuring cpu events. As of today
> raw events are not supported.

I like the modifications in v7.

Series-acked-by: Morten Brørup <mb@smartsharesystems.com>


^ permalink raw reply	[flat|nested] 205+ messages in thread

* RE: [PATCH v7 0/4] add support for self monitoring
  2023-02-01 13:51             ` [PATCH v7 0/4] add support for self monitoring Morten Brørup
@ 2023-02-02  7:54               ` Tomasz Duszynski
  0 siblings, 0 replies; 205+ messages in thread
From: Tomasz Duszynski @ 2023-02-02  7:54 UTC (permalink / raw)
  To: Morten Brørup, dev
  Cc: roretzla, Ruifeng.Wang, bruce.richardson,
	Jerin Jacob Kollanukkaran, mattias.ronnblom, thomas, zhoumin

Hi Morten,

>-----Original Message-----
>From: Morten Brørup <mb@smartsharesystems.com>
>Sent: Wednesday, February 1, 2023 2:51 PM
>To: Tomasz Duszynski <tduszynski@marvell.com>; dev@dpdk.org
>Cc: roretzla@linux.microsoft.com; Ruifeng.Wang@arm.com; bruce.richardson@intel.com; Jerin Jacob
>Kollanukkaran <jerinj@marvell.com>; mattias.ronnblom@ericsson.com; thomas@monjalon.net;
>zhoumin@loongson.cn
>Subject: [EXT] RE: [PATCH v7 0/4] add support for self monitoring
>
>External Email
>
>----------------------------------------------------------------------
>> From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
>> Sent: Wednesday, 1 February 2023 14.18
>>
>> This series adds self monitoring support i.e allows to configure and
>> read performance measurement unit (PMU) counters in runtime without
>> using perf utility. This has certain adventages when application runs
>> on isolated cores with nohz_full kernel parameter.
>>
>> Events can be read directly using rte_pmu_read() or using dedicated
>> tracepoint rte_eal_trace_pmu_read(). The latter will cause events to
>> be stored inside CTF file.
>>
>> By design, all enabled events are grouped together and the same group
>> is attached to lcores that use self monitoring funtionality.
>>
>> Events are enabled by names, which need to be read from standard
>> location under sysfs i.e
>>
>> /sys/bus/event_source/devices/PMU/events
>>
>> where PMU is a core pmu i.e one measuring cpu events. As of today raw
>> events are not supported.
>
>I like the modifications in v7.
>
>Series-acked-by: Morten Brørup <mb@smartsharesystems.com>

Thanks. 

^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH v8 0/4] add support for self monitoring
  2023-02-01 13:17           ` [PATCH v7 0/4] add support for self monitoring Tomasz Duszynski
                               ` (4 preceding siblings ...)
  2023-02-01 13:51             ` [PATCH v7 0/4] add support for self monitoring Morten Brørup
@ 2023-02-02  9:43             ` Tomasz Duszynski
  2023-02-02  9:43               ` [PATCH v8 1/4] lib: add generic support for reading PMU events Tomasz Duszynski
                                 ` (4 more replies)
  5 siblings, 5 replies; 205+ messages in thread
From: Tomasz Duszynski @ 2023-02-02  9:43 UTC (permalink / raw)
  To: dev
  Cc: roretzla, Ruifeng.Wang, bruce.richardson, jerinj,
	mattias.ronnblom, mb, thomas, zhoumin, Tomasz Duszynski

This series adds self monitoring support i.e allows to configure and
read performance measurement unit (PMU) counters in runtime without
using perf utility. This has certain adventages when application runs on
isolated cores with nohz_full kernel parameter.

Events can be read directly using rte_pmu_read() or using dedicated
tracepoint rte_eal_trace_pmu_read(). The latter will cause events to be
stored inside CTF file.

By design, all enabled events are grouped together and the same group
is attached to lcores that use self monitoring funtionality.

Events are enabled by names, which need to be read from standard
location under sysfs i.e

/sys/bus/event_source/devices/PMU/events

where PMU is a core pmu i.e one measuring cpu events. As of today
raw events are not supported.

v8:
- just rebase series
v7:
- use per-lcore event group instead of global table index by lcore-id
- don't add pmu_autotest to fast tests because due to lack of suported on
  every arch
v6:
- move codebase to the separate library
- address review comments
v5:
- address review comments
- fix sign extension while reading pmu on x86
- fix regex mentioned in doc
- various minor changes/improvements here and there
v4:
- fix freeing mem detected by debug_autotest
v3:
- fix shared build
v2:
- fix problems reported by test build infra

Tomasz Duszynski (4):
  lib: add generic support for reading PMU events
  pmu: support reading ARM PMU events in runtime
  pmu: support reading Intel x86_64 PMU events in runtime
  eal: add PMU support to tracing library

 MAINTAINERS                              |   5 +
 app/test/meson.build                     |   1 +
 app/test/test_pmu.c                      |  61 +++
 app/test/test_trace_perf.c               |  10 +
 doc/api/doxy-api-index.md                |   3 +-
 doc/api/doxy-api.conf.in                 |   1 +
 doc/guides/prog_guide/profile_app.rst    |  13 +
 doc/guides/prog_guide/trace_lib.rst      |  32 ++
 doc/guides/rel_notes/release_23_03.rst   |   9 +
 lib/eal/common/eal_common_trace.c        |  13 +-
 lib/eal/common/eal_common_trace_points.c |   5 +
 lib/eal/include/rte_eal_trace.h          |  13 +
 lib/eal/meson.build                      |   3 +
 lib/eal/version.map                      |   1 +
 lib/meson.build                          |   1 +
 lib/pmu/meson.build                      |  21 +
 lib/pmu/pmu_arm64.c                      |  94 ++++
 lib/pmu/pmu_private.h                    |  29 ++
 lib/pmu/rte_pmu.c                        | 525 +++++++++++++++++++++++
 lib/pmu/rte_pmu.h                        | 225 ++++++++++
 lib/pmu/rte_pmu_pmc_arm64.h              |  30 ++
 lib/pmu/rte_pmu_pmc_x86_64.h             |  24 ++
 lib/pmu/version.map                      |  21 +
 23 files changed, 1138 insertions(+), 2 deletions(-)
 create mode 100644 app/test/test_pmu.c
 create mode 100644 lib/pmu/meson.build
 create mode 100644 lib/pmu/pmu_arm64.c
 create mode 100644 lib/pmu/pmu_private.h
 create mode 100644 lib/pmu/rte_pmu.c
 create mode 100644 lib/pmu/rte_pmu.h
 create mode 100644 lib/pmu/rte_pmu_pmc_arm64.h
 create mode 100644 lib/pmu/rte_pmu_pmc_x86_64.h
 create mode 100644 lib/pmu/version.map

--
2.34.1


^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH v8 1/4] lib: add generic support for reading PMU events
  2023-02-02  9:43             ` [PATCH v8 " Tomasz Duszynski
@ 2023-02-02  9:43               ` Tomasz Duszynski
  2023-02-02 10:32                 ` Ruifeng Wang
  2023-02-02  9:43               ` [PATCH v8 2/4] pmu: support reading ARM PMU events in runtime Tomasz Duszynski
                                 ` (3 subsequent siblings)
  4 siblings, 1 reply; 205+ messages in thread
From: Tomasz Duszynski @ 2023-02-02  9:43 UTC (permalink / raw)
  To: dev, Thomas Monjalon, Tomasz Duszynski
  Cc: roretzla, Ruifeng.Wang, bruce.richardson, jerinj,
	mattias.ronnblom, mb, zhoumin

Add support for programming PMU counters and reading their values
in runtime bypassing kernel completely.

This is especially useful in cases where CPU cores are isolated
(nohz_full) i.e run dedicated tasks. In such cases one cannot use
standard perf utility without sacrificing latency and performance.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
---
 MAINTAINERS                            |   5 +
 app/test/meson.build                   |   1 +
 app/test/test_pmu.c                    |  55 +++
 doc/api/doxy-api-index.md              |   3 +-
 doc/api/doxy-api.conf.in               |   1 +
 doc/guides/prog_guide/profile_app.rst  |   8 +
 doc/guides/rel_notes/release_23_03.rst |   9 +
 lib/meson.build                        |   1 +
 lib/pmu/meson.build                    |  13 +
 lib/pmu/pmu_private.h                  |  29 ++
 lib/pmu/rte_pmu.c                      | 464 +++++++++++++++++++++++++
 lib/pmu/rte_pmu.h                      | 205 +++++++++++
 lib/pmu/version.map                    |  20 ++
 13 files changed, 813 insertions(+), 1 deletion(-)
 create mode 100644 app/test/test_pmu.c
 create mode 100644 lib/pmu/meson.build
 create mode 100644 lib/pmu/pmu_private.h
 create mode 100644 lib/pmu/rte_pmu.c
 create mode 100644 lib/pmu/rte_pmu.h
 create mode 100644 lib/pmu/version.map

diff --git a/MAINTAINERS b/MAINTAINERS
index 9a0f416d2e..9f13eafd95 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1697,6 +1697,11 @@ M: Nithin Dabilpuram <ndabilpuram@marvell.com>
 M: Pavan Nikhilesh <pbhagavatula@marvell.com>
 F: lib/node/
 
+PMU - EXPERIMENTAL
+M: Tomasz Duszynski <tduszynski@marvell.com>
+F: lib/pmu/
+F: app/test/test_pmu*
+
 
 Test Applications
 -----------------
diff --git a/app/test/meson.build b/app/test/meson.build
index f34d19e3c3..7b6b69dcf1 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -111,6 +111,7 @@ test_sources = files(
         'test_reciprocal_division_perf.c',
         'test_red.c',
         'test_pie.c',
+        'test_pmu.c',
         'test_reorder.c',
         'test_rib.c',
         'test_rib6.c',
diff --git a/app/test/test_pmu.c b/app/test/test_pmu.c
new file mode 100644
index 0000000000..a9bfb1a427
--- /dev/null
+++ b/app/test/test_pmu.c
@@ -0,0 +1,55 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2023 Marvell International Ltd.
+ */
+
+#include "test.h"
+
+#ifndef RTE_EXEC_ENV_LINUX
+
+static int
+test_pmu(void)
+{
+	printf("pmu_autotest only supported on Linux, skipping test\n");
+	return TEST_SKIPPED;
+}
+
+#else
+
+#include <rte_pmu.h>
+
+static int
+test_pmu_read(void)
+{
+	int tries = 10, event = -1;
+	uint64_t val = 0;
+
+	if (rte_pmu_init() < 0)
+		return TEST_FAILED;
+
+	while (tries--)
+		val += rte_pmu_read(event);
+
+	rte_pmu_fini();
+
+	return val ? TEST_SUCCESS : TEST_FAILED;
+}
+
+static struct unit_test_suite pmu_tests = {
+	.suite_name = "pmu autotest",
+	.setup = NULL,
+	.teardown = NULL,
+	.unit_test_cases = {
+		TEST_CASE(test_pmu_read),
+		TEST_CASES_END()
+	}
+};
+
+static int
+test_pmu(void)
+{
+	return unit_test_suite_runner(&pmu_tests);
+}
+
+#endif /* RTE_EXEC_ENV_LINUX */
+
+REGISTER_TEST_COMMAND(pmu_autotest, test_pmu);
diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
index de488c7abf..7f1938f92f 100644
--- a/doc/api/doxy-api-index.md
+++ b/doc/api/doxy-api-index.md
@@ -222,7 +222,8 @@ The public API headers are grouped by topics:
   [log](@ref rte_log.h),
   [errno](@ref rte_errno.h),
   [trace](@ref rte_trace.h),
-  [trace_point](@ref rte_trace_point.h)
+  [trace_point](@ref rte_trace_point.h),
+  [pmu](@ref rte_pmu.h)
 
 - **misc**:
   [EAL config](@ref rte_eal.h),
diff --git a/doc/api/doxy-api.conf.in b/doc/api/doxy-api.conf.in
index f0886c3bd1..920e615996 100644
--- a/doc/api/doxy-api.conf.in
+++ b/doc/api/doxy-api.conf.in
@@ -63,6 +63,7 @@ INPUT                   = @TOPDIR@/doc/api/doxy-api-index.md \
                           @TOPDIR@/lib/pci \
                           @TOPDIR@/lib/pdump \
                           @TOPDIR@/lib/pipeline \
+                          @TOPDIR@/lib/pmu \
                           @TOPDIR@/lib/port \
                           @TOPDIR@/lib/power \
                           @TOPDIR@/lib/rawdev \
diff --git a/doc/guides/prog_guide/profile_app.rst b/doc/guides/prog_guide/profile_app.rst
index 14292d4c25..a8b501fe0c 100644
--- a/doc/guides/prog_guide/profile_app.rst
+++ b/doc/guides/prog_guide/profile_app.rst
@@ -7,6 +7,14 @@ Profile Your Application
 The following sections describe methods of profiling DPDK applications on
 different architectures.
 
+Performance counter based profiling
+-----------------------------------
+
+Majority of architectures support some sort hardware measurement unit which provides a set of
+programmable counters that monitor specific events. There are different tools which can gather
+that information, perf being an example here. Though in some scenarios, eg. when CPU cores are
+isolated (nohz_full) and run dedicated tasks, using perf is less than ideal. In such cases one can
+read specific events directly from application via ``rte_pmu_read()``.
 
 Profiling on x86
 ----------------
diff --git a/doc/guides/rel_notes/release_23_03.rst b/doc/guides/rel_notes/release_23_03.rst
index 73f5d94e14..733541d56c 100644
--- a/doc/guides/rel_notes/release_23_03.rst
+++ b/doc/guides/rel_notes/release_23_03.rst
@@ -55,10 +55,19 @@ New Features
      Also, make sure to start the actual text at the margin.
      =======================================================
 
+* **Added PMU library.**
+
+  Added a new PMU (performance measurement unit) library which allows applications
+  to perform self monitoring activities without depending on external utilities like perf.
+  After integration with :doc:`../prog_guide/trace_lib` data gathered from hardware counters
+  can be stored in CTF format for further analysis.
+
 * **Updated AMD axgbe driver.**
 
   * Added multi-process support.
 
+* **Added multi-process support for axgbe PMD.**
+
 * **Updated Corigine nfp driver.**
 
   * Added support for meter options.
diff --git a/lib/meson.build b/lib/meson.build
index a90fee31b7..7132131b5c 100644
--- a/lib/meson.build
+++ b/lib/meson.build
@@ -11,6 +11,7 @@
 libraries = [
         'kvargs', # eal depends on kvargs
         'telemetry', # basic info querying
+        'pmu',
         'eal', # everything depends on eal
         'ring',
         'rcu', # rcu depends on ring
diff --git a/lib/pmu/meson.build b/lib/pmu/meson.build
new file mode 100644
index 0000000000..a4160b494e
--- /dev/null
+++ b/lib/pmu/meson.build
@@ -0,0 +1,13 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(C) 2023 Marvell International Ltd.
+
+if not is_linux
+    build = false
+    reason = 'only supported on Linux'
+    subdir_done()
+endif
+
+includes = [global_inc]
+
+sources = files('rte_pmu.c')
+headers = files('rte_pmu.h')
diff --git a/lib/pmu/pmu_private.h b/lib/pmu/pmu_private.h
new file mode 100644
index 0000000000..849549b125
--- /dev/null
+++ b/lib/pmu/pmu_private.h
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Marvell
+ */
+
+#ifndef _PMU_PRIVATE_H_
+#define _PMU_PRIVATE_H_
+
+/**
+ * Architecture specific PMU init callback.
+ *
+ * @return
+ *   0 in case of success, negative value otherwise.
+ */
+int
+pmu_arch_init(void);
+
+/**
+ * Architecture specific PMU cleanup callback.
+ */
+void
+pmu_arch_fini(void);
+
+/**
+ * Apply architecture specific settings to config before passing it to syscall.
+ */
+void
+pmu_arch_fixup_config(uint64_t config[3]);
+
+#endif /* _PMU_PRIVATE_H_ */
diff --git a/lib/pmu/rte_pmu.c b/lib/pmu/rte_pmu.c
new file mode 100644
index 0000000000..4cf3161155
--- /dev/null
+++ b/lib/pmu/rte_pmu.c
@@ -0,0 +1,464 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2023 Marvell International Ltd.
+ */
+
+#include <ctype.h>
+#include <dirent.h>
+#include <errno.h>
+#include <regex.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <sys/queue.h>
+#include <sys/syscall.h>
+#include <unistd.h>
+
+#include <rte_atomic.h>
+#include <rte_per_lcore.h>
+#include <rte_pmu.h>
+#include <rte_spinlock.h>
+#include <rte_tailq.h>
+
+#include "pmu_private.h"
+
+#define EVENT_SOURCE_DEVICES_PATH "/sys/bus/event_source/devices"
+
+#ifndef GENMASK_ULL
+#define GENMASK_ULL(h, l) ((~0ULL - (1ULL << (l)) + 1) & (~0ULL >> ((64 - 1 - (h)))))
+#endif
+
+#ifndef FIELD_PREP
+#define FIELD_PREP(m, v) (((uint64_t)(v) << (__builtin_ffsll(m) - 1)) & (m))
+#endif
+
+RTE_DEFINE_PER_LCORE(struct rte_pmu_event_group, _event_group);
+struct rte_pmu rte_pmu;
+
+/*
+ * Following __rte_weak functions provide default no-op. Architectures should override them if
+ * necessary.
+ */
+
+int
+__rte_weak pmu_arch_init(void)
+{
+	return 0;
+}
+
+void
+__rte_weak pmu_arch_fini(void)
+{
+}
+
+void
+__rte_weak pmu_arch_fixup_config(uint64_t __rte_unused config[3])
+{
+}
+
+static int
+get_term_format(const char *name, int *num, uint64_t *mask)
+{
+	char *config = NULL;
+	char path[PATH_MAX];
+	int high, low, ret;
+	FILE *fp;
+
+	/* quiesce -Wmaybe-uninitialized warning */
+	*num = 0;
+	*mask = 0;
+
+	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/format/%s", rte_pmu.name, name);
+	fp = fopen(path, "r");
+	if (fp == NULL)
+		return -errno;
+
+	errno = 0;
+	ret = fscanf(fp, "%m[^:]:%d-%d", &config, &low, &high);
+	if (ret < 2) {
+		ret = -ENODATA;
+		goto out;
+	}
+	if (errno) {
+		ret = -errno;
+		goto out;
+	}
+
+	if (ret == 2)
+		high = low;
+
+	*mask = GENMASK_ULL(high, low);
+	/* Last digit should be [012]. If last digit is missing 0 is implied. */
+	*num = config[strlen(config) - 1];
+	*num = isdigit(*num) ? *num - '0' : 0;
+
+	ret = 0;
+out:
+	free(config);
+	fclose(fp);
+
+	return ret;
+}
+
+static int
+parse_event(char *buf, uint64_t config[3])
+{
+	char *token, *term;
+	int num, ret, val;
+	uint64_t mask;
+
+	config[0] = config[1] = config[2] = 0;
+
+	token = strtok(buf, ",");
+	while (token) {
+		errno = 0;
+		/* <term>=<value> */
+		ret = sscanf(token, "%m[^=]=%i", &term, &val);
+		if (ret < 1)
+			return -ENODATA;
+		if (errno)
+			return -errno;
+		if (ret == 1)
+			val = 1;
+
+		ret = get_term_format(term, &num, &mask);
+		free(term);
+		if (ret)
+			return ret;
+
+		config[num] |= FIELD_PREP(mask, val);
+		token = strtok(NULL, ",");
+	}
+
+	return 0;
+}
+
+static int
+get_event_config(const char *name, uint64_t config[3])
+{
+	char path[PATH_MAX], buf[BUFSIZ];
+	FILE *fp;
+	int ret;
+
+	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu.name, name);
+	fp = fopen(path, "r");
+	if (fp == NULL)
+		return -errno;
+
+	ret = fread(buf, 1, sizeof(buf), fp);
+	if (ret == 0) {
+		fclose(fp);
+
+		return -EINVAL;
+	}
+	fclose(fp);
+	buf[ret] = '\0';
+
+	return parse_event(buf, config);
+}
+
+static int
+do_perf_event_open(uint64_t config[3], int group_fd)
+{
+	struct perf_event_attr attr = {
+		.size = sizeof(struct perf_event_attr),
+		.type = PERF_TYPE_RAW,
+		.exclude_kernel = 1,
+		.exclude_hv = 1,
+		.disabled = 1,
+	};
+
+	pmu_arch_fixup_config(config);
+
+	attr.config = config[0];
+	attr.config1 = config[1];
+	attr.config2 = config[2];
+
+	return syscall(SYS_perf_event_open, &attr, 0, -1, group_fd, 0);
+}
+
+static int
+open_events(struct rte_pmu_event_group *group)
+{
+	struct rte_pmu_event *event;
+	uint64_t config[3];
+	int num = 0, ret;
+
+	/* group leader gets created first, with fd = -1 */
+	group->fds[0] = -1;
+
+	TAILQ_FOREACH(event, &rte_pmu.event_list, next) {
+		ret = get_event_config(event->name, config);
+		if (ret)
+			continue;
+
+		ret = do_perf_event_open(config, group->fds[0]);
+		if (ret == -1) {
+			ret = -errno;
+			goto out;
+		}
+
+		group->fds[event->index] = ret;
+		num++;
+	}
+
+	return 0;
+out:
+	for (--num; num >= 0; num--) {
+		close(group->fds[num]);
+		group->fds[num] = -1;
+	}
+
+
+	return ret;
+}
+
+static int
+mmap_events(struct rte_pmu_event_group *group)
+{
+	long page_size = sysconf(_SC_PAGE_SIZE);
+	unsigned int i;
+	void *addr;
+	int ret;
+
+	for (i = 0; i < rte_pmu.num_group_events; i++) {
+		addr = mmap(0, page_size, PROT_READ, MAP_SHARED, group->fds[i], 0);
+		if (addr == MAP_FAILED) {
+			ret = -errno;
+			goto out;
+		}
+
+		group->mmap_pages[i] = addr;
+	}
+
+	return 0;
+out:
+	for (; i; i--) {
+		munmap(group->mmap_pages[i - 1], page_size);
+		group->mmap_pages[i - 1] = NULL;
+	}
+
+	return ret;
+}
+
+static void
+cleanup_events(struct rte_pmu_event_group *group)
+{
+	unsigned int i;
+
+	if (group->fds[0] != -1)
+		ioctl(group->fds[0], PERF_EVENT_IOC_DISABLE, PERF_IOC_FLAG_GROUP);
+
+	for (i = 0; i < rte_pmu.num_group_events; i++) {
+		if (group->mmap_pages[i]) {
+			munmap(group->mmap_pages[i], sysconf(_SC_PAGE_SIZE));
+			group->mmap_pages[i] = NULL;
+		}
+
+		if (group->fds[i] != -1) {
+			close(group->fds[i]);
+			group->fds[i] = -1;
+		}
+	}
+
+	group->enabled = false;
+}
+
+int __rte_noinline
+rte_pmu_enable_group(void)
+{
+	struct rte_pmu_event_group *group = &RTE_PER_LCORE(_event_group);
+	int ret;
+
+	if (rte_pmu.num_group_events == 0)
+		return -ENODEV;
+
+	ret = open_events(group);
+	if (ret)
+		goto out;
+
+	ret = mmap_events(group);
+	if (ret)
+		goto out;
+
+	if (ioctl(group->fds[0], PERF_EVENT_IOC_RESET, PERF_IOC_FLAG_GROUP) == -1) {
+		ret = -errno;
+		goto out;
+	}
+
+	if (ioctl(group->fds[0], PERF_EVENT_IOC_ENABLE, PERF_IOC_FLAG_GROUP) == -1) {
+		ret = -errno;
+		goto out;
+	}
+
+	rte_spinlock_lock(&rte_pmu.lock);
+	TAILQ_INSERT_TAIL(&rte_pmu.event_group_list, group, next);
+	rte_spinlock_unlock(&rte_pmu.lock);
+	group->enabled = true;
+
+	return 0;
+
+out:
+	cleanup_events(group);
+
+	return ret;
+}
+
+static int
+scan_pmus(void)
+{
+	char path[PATH_MAX];
+	struct dirent *dent;
+	const char *name;
+	DIR *dirp;
+
+	dirp = opendir(EVENT_SOURCE_DEVICES_PATH);
+	if (dirp == NULL)
+		return -errno;
+
+	while ((dent = readdir(dirp))) {
+		name = dent->d_name;
+		if (name[0] == '.')
+			continue;
+
+		/* sysfs entry should either contain cpus or be a cpu */
+		if (!strcmp(name, "cpu"))
+			break;
+
+		snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/cpus", name);
+		if (access(path, F_OK) == 0)
+			break;
+	}
+
+	if (dent) {
+		rte_pmu.name = strdup(name);
+		if (rte_pmu.name == NULL) {
+			closedir(dirp);
+
+			return -ENOMEM;
+		}
+	}
+
+	closedir(dirp);
+
+	return rte_pmu.name ? 0 : -ENODEV;
+}
+
+static struct rte_pmu_event *
+new_event(const char *name)
+{
+	struct rte_pmu_event *event;
+
+	event = calloc(1, sizeof(*event));
+	if (event == NULL)
+		goto out;
+
+	event->name = strdup(name);
+	if (event->name == NULL) {
+		free(event);
+		event = NULL;
+	}
+
+out:
+	return event;
+}
+
+static void
+free_event(struct rte_pmu_event *event)
+{
+	free(event->name);
+	free(event);
+}
+
+int
+rte_pmu_add_event(const char *name)
+{
+	struct rte_pmu_event *event;
+	char path[PATH_MAX];
+
+	if (rte_pmu.name == NULL)
+		return -ENODEV;
+
+	if (rte_pmu.num_group_events + 1 >= MAX_NUM_GROUP_EVENTS)
+		return -ENOSPC;
+
+	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu.name, name);
+	if (access(path, R_OK))
+		return -ENODEV;
+
+	TAILQ_FOREACH(event, &rte_pmu.event_list, next) {
+		if (!strcmp(event->name, name))
+			return event->index;
+		continue;
+	}
+
+	event = new_event(name);
+	if (event == NULL)
+		return -ENOMEM;
+
+	event->index = rte_pmu.num_group_events++;
+	TAILQ_INSERT_TAIL(&rte_pmu.event_list, event, next);
+
+	return event->index;
+}
+
+int
+rte_pmu_init(void)
+{
+	int ret;
+
+	/* Allow calling init from multiple contexts within a single thread. This simplifies
+	 * resource management a bit e.g in case fast-path tracepoint has already been enabled
+	 * via command line but application doesn't care enough and performs init/fini again.
+	 */
+	if (rte_pmu.initialized) {
+		rte_pmu.initialized++;
+		return 0;
+	}
+
+	ret = scan_pmus();
+	if (ret)
+		goto out;
+
+	ret = pmu_arch_init();
+	if (ret)
+		goto out;
+
+	TAILQ_INIT(&rte_pmu.event_list);
+	TAILQ_INIT(&rte_pmu.event_group_list);
+	rte_spinlock_init(&rte_pmu.lock);
+	rte_pmu.initialized = 1;
+
+	return 0;
+out:
+	free(rte_pmu.name);
+	rte_pmu.name = NULL;
+
+	return ret;
+}
+
+void
+rte_pmu_fini(void)
+{
+	struct rte_pmu_event_group *group, *tmp_group;
+	struct rte_pmu_event *event, *tmp_event;
+
+	/* cleanup once init count drops to zero */
+	if (!rte_pmu.initialized || --rte_pmu.initialized)
+		return;
+
+	RTE_TAILQ_FOREACH_SAFE(event, &rte_pmu.event_list, next, tmp_event) {
+		TAILQ_REMOVE(&rte_pmu.event_list, event, next);
+		free_event(event);
+	}
+
+	RTE_TAILQ_FOREACH_SAFE(group, &rte_pmu.event_group_list, next, tmp_group) {
+		TAILQ_REMOVE(&rte_pmu.event_group_list, group, next);
+		cleanup_events(group);
+	}
+
+	pmu_arch_fini();
+	free(rte_pmu.name);
+	rte_pmu.name = NULL;
+	rte_pmu.num_group_events = 0;
+}
diff --git a/lib/pmu/rte_pmu.h b/lib/pmu/rte_pmu.h
new file mode 100644
index 0000000000..e360375a0c
--- /dev/null
+++ b/lib/pmu/rte_pmu.h
@@ -0,0 +1,205 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Marvell
+ */
+
+#ifndef _RTE_PMU_H_
+#define _RTE_PMU_H_
+
+/**
+ * @file
+ *
+ * PMU event tracing operations
+ *
+ * This file defines generic API and types necessary to setup PMU and
+ * read selected counters in runtime.
+ */
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <linux/perf_event.h>
+
+#include <rte_atomic.h>
+#include <rte_branch_prediction.h>
+#include <rte_common.h>
+#include <rte_compat.h>
+#include <rte_spinlock.h>
+
+/** Maximum number of events in a group */
+#define MAX_NUM_GROUP_EVENTS 8
+
+/**
+ * A structure describing a group of events.
+ */
+struct rte_pmu_event_group {
+	struct perf_event_mmap_page *mmap_pages[MAX_NUM_GROUP_EVENTS]; /**< array of user pages */
+	int fds[MAX_NUM_GROUP_EVENTS]; /**< array of event descriptors */
+	bool enabled; /**< true if group was enabled on particular lcore */
+	TAILQ_ENTRY(rte_pmu_event_group) next; /**< list entry */
+} __rte_cache_aligned;
+
+/**
+ * A structure describing an event.
+ */
+struct rte_pmu_event {
+	char *name; /**< name of an event */
+	unsigned int index; /**< event index into fds/mmap_pages */
+	TAILQ_ENTRY(rte_pmu_event) next; /**< list entry */
+};
+
+/**
+ * A PMU state container.
+ */
+struct rte_pmu {
+	char *name; /**< name of core PMU listed under /sys/bus/event_source/devices */
+	rte_spinlock_t lock; /**< serialize access to event group list */
+	TAILQ_HEAD(, rte_pmu_event_group) event_group_list; /**< list of event groups */
+	unsigned int num_group_events; /**< number of events in a group */
+	TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of matching events */
+	unsigned int initialized; /**< initialization counter */
+};
+
+/** lcore event group */
+RTE_DECLARE_PER_LCORE(struct rte_pmu_event_group, _event_group);
+
+/** PMU state container */
+extern struct rte_pmu rte_pmu;
+
+/** Each architecture supporting PMU needs to provide its own version */
+#ifndef rte_pmu_pmc_read
+#define rte_pmu_pmc_read(index) ({ 0; })
+#endif
+
+/**
+ * @internal
+ *
+ * Read PMU counter.
+ *
+ * @param pc
+ *   Pointer to the mmapped user page.
+ * @return
+ *   Counter value read from hardware.
+ */
+__rte_internal
+static __rte_always_inline uint64_t
+rte_pmu_read_userpage(struct perf_event_mmap_page *pc)
+{
+	uint64_t width, offset;
+	uint32_t seq, index;
+	int64_t pmc;
+
+	for (;;) {
+		seq = pc->lock;
+		rte_compiler_barrier();
+		index = pc->index;
+		offset = pc->offset;
+		width = pc->pmc_width;
+
+		/* index set to 0 means that particular counter cannot be used */
+		if (likely(pc->cap_user_rdpmc && index)) {
+			pmc = rte_pmu_pmc_read(index - 1);
+			pmc <<= 64 - width;
+			pmc >>= 64 - width;
+			offset += pmc;
+		}
+
+		rte_compiler_barrier();
+
+		if (likely(pc->lock == seq))
+			return offset;
+	}
+
+	return 0;
+}
+
+/**
+ * @internal
+ *
+ * Enable group of events on the calling lcore.
+ *
+ * @return
+ *   0 in case of success, negative value otherwise.
+ */
+__rte_internal
+int
+rte_pmu_enable_group(void);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Initialize PMU library.
+ *
+ * @return
+ *   0 in case of success, negative value otherwise.
+ */
+__rte_experimental
+int
+rte_pmu_init(void);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Finalize PMU library. This should be called after PMU counters are no longer being read.
+ */
+__rte_experimental
+void
+rte_pmu_fini(void);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Add event to the group of enabled events.
+ *
+ * @param name
+ *   Name of an event listed under /sys/bus/event_source/devices/pmu/events.
+ * @return
+ *   Event index in case of success, negative value otherwise.
+ */
+__rte_experimental
+int
+rte_pmu_add_event(const char *name);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Read hardware counter configured to count occurrences of an event.
+ *
+ * @param index
+ *   Index of an event to be read.
+ * @return
+ *   Event value read from register. In case of errors or lack of support
+ *   0 is returned. In other words, stream of zeros in a trace file
+ *   indicates problem with reading particular PMU event register.
+ */
+__rte_experimental
+static __rte_always_inline uint64_t
+rte_pmu_read(unsigned int index)
+{
+	struct rte_pmu_event_group *group = &RTE_PER_LCORE(_event_group);
+	int ret;
+
+	if (unlikely(!rte_pmu.initialized))
+		return 0;
+
+	if (unlikely(!group->enabled)) {
+		ret = rte_pmu_enable_group();
+		if (ret)
+			return 0;
+	}
+
+	if (unlikely(index >= rte_pmu.num_group_events))
+		return 0;
+
+	return rte_pmu_read_userpage(group->mmap_pages[index]);
+}
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_PMU_H_ */
diff --git a/lib/pmu/version.map b/lib/pmu/version.map
new file mode 100644
index 0000000000..50fb0f354e
--- /dev/null
+++ b/lib/pmu/version.map
@@ -0,0 +1,20 @@
+DPDK_23 {
+	local: *;
+};
+
+EXPERIMENTAL {
+	global:
+
+	per_lcore__event_group;
+	rte_pmu;
+	rte_pmu_add_event;
+	rte_pmu_fini;
+	rte_pmu_init;
+	rte_pmu_read;
+};
+
+INTERNAL {
+	global:
+
+	rte_pmu_enable_group;
+};
-- 
2.34.1


^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH v8 2/4] pmu: support reading ARM PMU events in runtime
  2023-02-02  9:43             ` [PATCH v8 " Tomasz Duszynski
  2023-02-02  9:43               ` [PATCH v8 1/4] lib: add generic support for reading PMU events Tomasz Duszynski
@ 2023-02-02  9:43               ` Tomasz Duszynski
  2023-02-02  9:43               ` [PATCH v8 3/4] pmu: support reading Intel x86_64 " Tomasz Duszynski
                                 ` (2 subsequent siblings)
  4 siblings, 0 replies; 205+ messages in thread
From: Tomasz Duszynski @ 2023-02-02  9:43 UTC (permalink / raw)
  To: dev, Tomasz Duszynski, Ruifeng Wang
  Cc: roretzla, Ruifeng.Wang, bruce.richardson, jerinj,
	mattias.ronnblom, mb, thomas, zhoumin

Add support for reading ARM PMU events in runtime.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
---
 app/test/test_pmu.c         |  4 ++
 lib/pmu/meson.build         |  7 +++
 lib/pmu/pmu_arm64.c         | 94 +++++++++++++++++++++++++++++++++++++
 lib/pmu/rte_pmu.h           |  4 ++
 lib/pmu/rte_pmu_pmc_arm64.h | 30 ++++++++++++
 5 files changed, 139 insertions(+)
 create mode 100644 lib/pmu/pmu_arm64.c
 create mode 100644 lib/pmu/rte_pmu_pmc_arm64.h

diff --git a/app/test/test_pmu.c b/app/test/test_pmu.c
index a9bfb1a427..623e04b691 100644
--- a/app/test/test_pmu.c
+++ b/app/test/test_pmu.c
@@ -26,6 +26,10 @@ test_pmu_read(void)
 	if (rte_pmu_init() < 0)
 		return TEST_FAILED;
 
+#if defined(RTE_ARCH_ARM64)
+	event = rte_pmu_add_event("cpu_cycles");
+#endif
+
 	while (tries--)
 		val += rte_pmu_read(event);
 
diff --git a/lib/pmu/meson.build b/lib/pmu/meson.build
index a4160b494e..e857681137 100644
--- a/lib/pmu/meson.build
+++ b/lib/pmu/meson.build
@@ -11,3 +11,10 @@ includes = [global_inc]
 
 sources = files('rte_pmu.c')
 headers = files('rte_pmu.h')
+indirect_headers += files(
+        'rte_pmu_pmc_arm64.h',
+)
+
+if dpdk_conf.has('RTE_ARCH_ARM64')
+    sources += files('pmu_arm64.c')
+endif
diff --git a/lib/pmu/pmu_arm64.c b/lib/pmu/pmu_arm64.c
new file mode 100644
index 0000000000..9e15727948
--- /dev/null
+++ b/lib/pmu/pmu_arm64.c
@@ -0,0 +1,94 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2023 Marvell International Ltd.
+ */
+
+#include <errno.h>
+#include <fcntl.h>
+#include <stdlib.h>
+#include <unistd.h>
+
+#include <rte_bitops.h>
+#include <rte_common.h>
+
+#include "pmu_private.h"
+
+#define PERF_USER_ACCESS_PATH "/proc/sys/kernel/perf_user_access"
+
+static int restore_uaccess;
+
+static int
+read_attr_int(const char *path, int *val)
+{
+	char buf[BUFSIZ];
+	int ret, fd;
+
+	fd = open(path, O_RDONLY);
+	if (fd == -1)
+		return -errno;
+
+	ret = read(fd, buf, sizeof(buf));
+	if (ret == -1) {
+		close(fd);
+
+		return -errno;
+	}
+
+	*val = strtol(buf, NULL, 10);
+	close(fd);
+
+	return 0;
+}
+
+static int
+write_attr_int(const char *path, int val)
+{
+	char buf[BUFSIZ];
+	int num, ret, fd;
+
+	fd = open(path, O_WRONLY);
+	if (fd == -1)
+		return -errno;
+
+	num = snprintf(buf, sizeof(buf), "%d", val);
+	ret = write(fd, buf, num);
+	if (ret == -1) {
+		close(fd);
+
+		return -errno;
+	}
+
+	close(fd);
+
+	return 0;
+}
+
+int
+pmu_arch_init(void)
+{
+	int ret;
+
+	ret = read_attr_int(PERF_USER_ACCESS_PATH, &restore_uaccess);
+	if (ret)
+		return ret;
+
+	/* user access already enabled */
+	if (restore_uaccess == 1)
+		return 0;
+
+	return write_attr_int(PERF_USER_ACCESS_PATH, 1);
+}
+
+void
+pmu_arch_fini(void)
+{
+	write_attr_int(PERF_USER_ACCESS_PATH, restore_uaccess);
+}
+
+void
+pmu_arch_fixup_config(uint64_t config[3])
+{
+	/* select 64 bit counters */
+	config[1] |= RTE_BIT64(0);
+	/* enable userspace access */
+	config[1] |= RTE_BIT64(1);
+}
diff --git a/lib/pmu/rte_pmu.h b/lib/pmu/rte_pmu.h
index e360375a0c..b18938dab1 100644
--- a/lib/pmu/rte_pmu.h
+++ b/lib/pmu/rte_pmu.h
@@ -26,6 +26,10 @@ extern "C" {
 #include <rte_compat.h>
 #include <rte_spinlock.h>
 
+#if defined(RTE_ARCH_ARM64)
+#include "rte_pmu_pmc_arm64.h"
+#endif
+
 /** Maximum number of events in a group */
 #define MAX_NUM_GROUP_EVENTS 8
 
diff --git a/lib/pmu/rte_pmu_pmc_arm64.h b/lib/pmu/rte_pmu_pmc_arm64.h
new file mode 100644
index 0000000000..10648f0c5f
--- /dev/null
+++ b/lib/pmu/rte_pmu_pmc_arm64.h
@@ -0,0 +1,30 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Marvell.
+ */
+#ifndef _RTE_PMU_PMC_ARM64_H_
+#define _RTE_PMU_PMC_ARM64_H_
+
+#include <rte_common.h>
+
+static __rte_always_inline uint64_t
+rte_pmu_pmc_read(int index)
+{
+	uint64_t val;
+
+	if (index == 31) {
+		/* CPU Cycles (0x11) must be read via pmccntr_el0 */
+		asm volatile("mrs %0, pmccntr_el0" : "=r" (val));
+	} else {
+		asm volatile(
+			"msr pmselr_el0, %x0\n"
+			"mrs %0, pmxevcntr_el0\n"
+			: "=r" (val)
+			: "rZ" (index)
+		);
+	}
+
+	return val;
+}
+#define rte_pmu_pmc_read rte_pmu_pmc_read
+
+#endif /* _RTE_PMU_PMC_ARM64_H_ */
-- 
2.34.1


^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH v8 3/4] pmu: support reading Intel x86_64 PMU events in runtime
  2023-02-02  9:43             ` [PATCH v8 " Tomasz Duszynski
  2023-02-02  9:43               ` [PATCH v8 1/4] lib: add generic support for reading PMU events Tomasz Duszynski
  2023-02-02  9:43               ` [PATCH v8 2/4] pmu: support reading ARM PMU events in runtime Tomasz Duszynski
@ 2023-02-02  9:43               ` Tomasz Duszynski
  2023-02-02  9:43               ` [PATCH v8 4/4] eal: add PMU support to tracing library Tomasz Duszynski
  2023-02-02 12:49               ` [PATCH v9 0/4] add support for self monitoring Tomasz Duszynski
  4 siblings, 0 replies; 205+ messages in thread
From: Tomasz Duszynski @ 2023-02-02  9:43 UTC (permalink / raw)
  To: dev, Tomasz Duszynski
  Cc: roretzla, Ruifeng.Wang, bruce.richardson, jerinj,
	mattias.ronnblom, mb, thomas, zhoumin

Add support for reading Intel x86_64 PMU events in runtime.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
---
 app/test/test_pmu.c          |  2 ++
 lib/pmu/meson.build          |  1 +
 lib/pmu/rte_pmu.h            |  2 ++
 lib/pmu/rte_pmu_pmc_x86_64.h | 24 ++++++++++++++++++++++++
 4 files changed, 29 insertions(+)
 create mode 100644 lib/pmu/rte_pmu_pmc_x86_64.h

diff --git a/app/test/test_pmu.c b/app/test/test_pmu.c
index 623e04b691..614395482f 100644
--- a/app/test/test_pmu.c
+++ b/app/test/test_pmu.c
@@ -28,6 +28,8 @@ test_pmu_read(void)
 
 #if defined(RTE_ARCH_ARM64)
 	event = rte_pmu_add_event("cpu_cycles");
+#elif defined(RTE_ARCH_X86_64)
+	event = rte_pmu_add_event("cpu-cycles");
 #endif
 
 	while (tries--)
diff --git a/lib/pmu/meson.build b/lib/pmu/meson.build
index e857681137..5b92e5c4e3 100644
--- a/lib/pmu/meson.build
+++ b/lib/pmu/meson.build
@@ -13,6 +13,7 @@ sources = files('rte_pmu.c')
 headers = files('rte_pmu.h')
 indirect_headers += files(
         'rte_pmu_pmc_arm64.h',
+        'rte_pmu_pmc_x86_64.h',
 )
 
 if dpdk_conf.has('RTE_ARCH_ARM64')
diff --git a/lib/pmu/rte_pmu.h b/lib/pmu/rte_pmu.h
index b18938dab1..0f7004c31c 100644
--- a/lib/pmu/rte_pmu.h
+++ b/lib/pmu/rte_pmu.h
@@ -28,6 +28,8 @@ extern "C" {
 
 #if defined(RTE_ARCH_ARM64)
 #include "rte_pmu_pmc_arm64.h"
+#elif defined(RTE_ARCH_X86_64)
+#include "rte_pmu_pmc_x86_64.h"
 #endif
 
 /** Maximum number of events in a group */
diff --git a/lib/pmu/rte_pmu_pmc_x86_64.h b/lib/pmu/rte_pmu_pmc_x86_64.h
new file mode 100644
index 0000000000..7b67466960
--- /dev/null
+++ b/lib/pmu/rte_pmu_pmc_x86_64.h
@@ -0,0 +1,24 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Marvell.
+ */
+#ifndef _RTE_PMU_PMC_X86_64_H_
+#define _RTE_PMU_PMC_X86_64_H_
+
+#include <rte_common.h>
+
+static __rte_always_inline uint64_t
+rte_pmu_pmc_read(int index)
+{
+	uint64_t low, high;
+
+	asm volatile(
+		"rdpmc\n"
+		: "=a" (low), "=d" (high)
+		: "c" (index)
+	);
+
+	return low | (high << 32);
+}
+#define rte_pmu_pmc_read rte_pmu_pmc_read
+
+#endif /* _RTE_PMU_PMC_X86_64_H_ */
-- 
2.34.1


^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH v8 4/4] eal: add PMU support to tracing library
  2023-02-02  9:43             ` [PATCH v8 " Tomasz Duszynski
                                 ` (2 preceding siblings ...)
  2023-02-02  9:43               ` [PATCH v8 3/4] pmu: support reading Intel x86_64 " Tomasz Duszynski
@ 2023-02-02  9:43               ` Tomasz Duszynski
  2023-02-02 12:49               ` [PATCH v9 0/4] add support for self monitoring Tomasz Duszynski
  4 siblings, 0 replies; 205+ messages in thread
From: Tomasz Duszynski @ 2023-02-02  9:43 UTC (permalink / raw)
  To: dev, Jerin Jacob, Sunil Kumar Kori, Tomasz Duszynski
  Cc: roretzla, Ruifeng.Wang, bruce.richardson, mattias.ronnblom, mb,
	thomas, zhoumin

In order to profile app one needs to store significant amount of samples
somewhere for an analysis latern on. Since trace library supports
storing data in a CTF format lets take adventage of that and add a
dedicated PMU tracepoint.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
---
 app/test/test_trace_perf.c               | 10 ++++
 doc/guides/prog_guide/profile_app.rst    |  5 ++
 doc/guides/prog_guide/trace_lib.rst      | 32 +++++++++++++
 lib/eal/common/eal_common_trace.c        | 13 ++++-
 lib/eal/common/eal_common_trace_points.c |  5 ++
 lib/eal/include/rte_eal_trace.h          | 13 +++++
 lib/eal/meson.build                      |  3 ++
 lib/eal/version.map                      |  1 +
 lib/pmu/rte_pmu.c                        | 61 ++++++++++++++++++++++++
 lib/pmu/rte_pmu.h                        | 14 ++++++
 lib/pmu/version.map                      |  1 +
 11 files changed, 157 insertions(+), 1 deletion(-)

diff --git a/app/test/test_trace_perf.c b/app/test/test_trace_perf.c
index 46ae7d8074..f1929f2734 100644
--- a/app/test/test_trace_perf.c
+++ b/app/test/test_trace_perf.c
@@ -114,6 +114,10 @@ worker_fn_##func(void *arg) \
 #define GENERIC_DOUBLE rte_eal_trace_generic_double(3.66666)
 #define GENERIC_STR rte_eal_trace_generic_str("hello world")
 #define VOID_FP app_dpdk_test_fp()
+#ifdef RTE_EXEC_ENV_LINUX
+/* 0 corresponds first event passed via --trace= */
+#define READ_PMU rte_eal_trace_pmu_read(0)
+#endif
 
 WORKER_DEFINE(GENERIC_VOID)
 WORKER_DEFINE(GENERIC_U64)
@@ -122,6 +126,9 @@ WORKER_DEFINE(GENERIC_FLOAT)
 WORKER_DEFINE(GENERIC_DOUBLE)
 WORKER_DEFINE(GENERIC_STR)
 WORKER_DEFINE(VOID_FP)
+#ifdef RTE_EXEC_ENV_LINUX
+WORKER_DEFINE(READ_PMU)
+#endif
 
 static void
 run_test(const char *str, lcore_function_t f, struct test_data *data, size_t sz)
@@ -174,6 +181,9 @@ test_trace_perf(void)
 	run_test("double", worker_fn_GENERIC_DOUBLE, data, sz);
 	run_test("string", worker_fn_GENERIC_STR, data, sz);
 	run_test("void_fp", worker_fn_VOID_FP, data, sz);
+#ifdef RTE_EXEC_ENV_LINUX
+	run_test("read_pmu", worker_fn_READ_PMU, data, sz);
+#endif
 
 	rte_free(data);
 	return TEST_SUCCESS;
diff --git a/doc/guides/prog_guide/profile_app.rst b/doc/guides/prog_guide/profile_app.rst
index a8b501fe0c..6a53341c6b 100644
--- a/doc/guides/prog_guide/profile_app.rst
+++ b/doc/guides/prog_guide/profile_app.rst
@@ -16,6 +16,11 @@ that information, perf being an example here. Though in some scenarios, eg. when
 isolated (nohz_full) and run dedicated tasks, using perf is less than ideal. In such cases one can
 read specific events directly from application via ``rte_pmu_read()``.
 
+Alternatively tracing library can be used which offers dedicated tracepoint
+``rte_eal_trace_pmu_event()``.
+
+Refer to :doc:`../prog_guide/trace_lib` for more details.
+
 Profiling on x86
 ----------------
 
diff --git a/doc/guides/prog_guide/trace_lib.rst b/doc/guides/prog_guide/trace_lib.rst
index 9a8f38073d..a8e97ee1ec 100644
--- a/doc/guides/prog_guide/trace_lib.rst
+++ b/doc/guides/prog_guide/trace_lib.rst
@@ -46,6 +46,7 @@ DPDK tracing library features
   trace format and is compatible with ``LTTng``.
   For detailed information, refer to
   `Common Trace Format <https://diamon.org/ctf/>`_.
+- Support reading PMU events on ARM64 and x86-64 (Intel)
 
 How to add a tracepoint?
 ------------------------
@@ -137,6 +138,37 @@ the user must use ``RTE_TRACE_POINT_FP`` instead of ``RTE_TRACE_POINT``.
 ``RTE_TRACE_POINT_FP`` is compiled out by default and it can be enabled using
 the ``enable_trace_fp`` option for meson build.
 
+PMU tracepoint
+--------------
+
+Performance measurement unit (PMU) event values can be read from hardware
+registers using predefined ``rte_pmu_read`` tracepoint.
+
+Tracing is enabled via ``--trace`` EAL option by passing both expression
+matching PMU tracepoint name i.e ``lib.eal.pmu.read`` and expression
+``e=ev1[,ev2,...]`` matching particular events::
+
+    --trace='.*pmu.read\|e=cpu_cycles,l1d_cache'
+
+Event names are available under ``/sys/bus/event_source/devices/PMU/events``
+directory, where ``PMU`` is a placeholder for either a ``cpu`` or a directory
+containing ``cpus``.
+
+In contrary to other tracepoints this does not need any extra variables
+added to source files. Instead, caller passes index which follows the order of
+events specified via ``--trace`` parameter. In the following example index ``0``
+corresponds to ``cpu_cyclces`` while index ``1`` corresponds to ``l1d_cache``.
+
+.. code-block:: c
+
+ ...
+ rte_eal_trace_pmu_read(0);
+ rte_eal_trace_pmu_read(1);
+ ...
+
+PMU tracing support must be explicitly enabled using the ``enable_trace_fp``
+option for meson build.
+
 Event record mode
 -----------------
 
diff --git a/lib/eal/common/eal_common_trace.c b/lib/eal/common/eal_common_trace.c
index 75162b722d..8796052d0c 100644
--- a/lib/eal/common/eal_common_trace.c
+++ b/lib/eal/common/eal_common_trace.c
@@ -11,6 +11,9 @@
 #include <rte_errno.h>
 #include <rte_lcore.h>
 #include <rte_per_lcore.h>
+#ifdef RTE_EXEC_ENV_LINUX
+#include <rte_pmu.h>
+#endif
 #include <rte_string_fns.h>
 
 #include "eal_trace.h"
@@ -71,8 +74,13 @@ eal_trace_init(void)
 		goto free_meta;
 
 	/* Apply global configurations */
-	STAILQ_FOREACH(arg, &trace.args, next)
+	STAILQ_FOREACH(arg, &trace.args, next) {
 		trace_args_apply(arg->val);
+#ifdef RTE_EXEC_ENV_LINUX
+		if (rte_pmu_init() == 0)
+			rte_pmu_add_events_by_pattern(arg->val);
+#endif
+	}
 
 	rte_trace_mode_set(trace.mode);
 
@@ -88,6 +96,9 @@ eal_trace_init(void)
 void
 eal_trace_fini(void)
 {
+#ifdef RTE_EXEC_ENV_LINUX
+	rte_pmu_fini();
+#endif
 	trace_mem_free();
 	trace_metadata_destroy();
 	eal_trace_args_free();
diff --git a/lib/eal/common/eal_common_trace_points.c b/lib/eal/common/eal_common_trace_points.c
index 0b0b254615..1e46ce549a 100644
--- a/lib/eal/common/eal_common_trace_points.c
+++ b/lib/eal/common/eal_common_trace_points.c
@@ -75,3 +75,8 @@ RTE_TRACE_POINT_REGISTER(rte_eal_trace_intr_enable,
 	lib.eal.intr.enable)
 RTE_TRACE_POINT_REGISTER(rte_eal_trace_intr_disable,
 	lib.eal.intr.disable)
+
+#ifdef RTE_EXEC_ENV_LINUX
+RTE_TRACE_POINT_REGISTER(rte_eal_trace_pmu_read,
+	lib.eal.pmu.read)
+#endif
diff --git a/lib/eal/include/rte_eal_trace.h b/lib/eal/include/rte_eal_trace.h
index 5ef4398230..afb459b198 100644
--- a/lib/eal/include/rte_eal_trace.h
+++ b/lib/eal/include/rte_eal_trace.h
@@ -17,6 +17,9 @@ extern "C" {
 
 #include <rte_alarm.h>
 #include <rte_interrupts.h>
+#ifdef RTE_EXEC_ENV_LINUX
+#include <rte_pmu.h>
+#endif
 #include <rte_trace_point.h>
 
 #include "eal_interrupts.h"
@@ -279,6 +282,16 @@ RTE_TRACE_POINT(
 	rte_trace_point_emit_string(cpuset);
 )
 
+#ifdef RTE_EXEC_ENV_LINUX
+RTE_TRACE_POINT_FP(
+	rte_eal_trace_pmu_read,
+	RTE_TRACE_POINT_ARGS(unsigned int index),
+	uint64_t val;
+	val = rte_pmu_read(index);
+	rte_trace_point_emit_u64(val);
+)
+#endif
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/eal/meson.build b/lib/eal/meson.build
index 056beb9461..f5865dbcd9 100644
--- a/lib/eal/meson.build
+++ b/lib/eal/meson.build
@@ -26,6 +26,9 @@ deps += ['kvargs']
 if not is_windows
     deps += ['telemetry']
 endif
+if is_linux
+    deps += ['pmu']
+endif
 if dpdk_conf.has('RTE_USE_LIBBSD')
     ext_deps += libbsd
 endif
diff --git a/lib/eal/version.map b/lib/eal/version.map
index 6523102157..2f8f66874b 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -441,6 +441,7 @@ EXPERIMENTAL {
 	rte_thread_join;
 
 	# added in 23.03
+	__rte_eal_trace_pmu_read; # WINDOWS_NO_EXPORT
 	rte_thread_set_name;
 };
 
diff --git a/lib/pmu/rte_pmu.c b/lib/pmu/rte_pmu.c
index 4cf3161155..ae880c72b7 100644
--- a/lib/pmu/rte_pmu.c
+++ b/lib/pmu/rte_pmu.c
@@ -402,6 +402,67 @@ rte_pmu_add_event(const char *name)
 	return event->index;
 }
 
+static int
+add_events(const char *pattern)
+{
+	char *token, *copy;
+	int ret;
+
+	copy = strdup(pattern);
+	if (copy == NULL)
+		return -ENOMEM;
+
+	token = strtok(copy, ",");
+	while (token) {
+		ret = rte_pmu_add_event(token);
+		if (ret < 0)
+			break;
+
+		token = strtok(NULL, ",");
+	}
+
+	free(copy);
+
+	return ret >= 0 ? 0 : ret;
+}
+
+int
+rte_pmu_add_events_by_pattern(const char *pattern)
+{
+	regmatch_t rmatch;
+	char buf[BUFSIZ];
+	unsigned int num;
+	regex_t reg;
+	int ret;
+
+	/* events are matched against occurrences of e=ev1[,ev2,..] pattern */
+	ret = regcomp(&reg, "e=([_[:alnum:]-],?)+", REG_EXTENDED);
+	if (ret)
+		return -EINVAL;
+
+	for (;;) {
+		if (regexec(&reg, pattern, 1, &rmatch, 0))
+			break;
+
+		num = rmatch.rm_eo - rmatch.rm_so;
+		if (num > sizeof(buf))
+			num = sizeof(buf);
+
+		/* skip e= pattern prefix */
+		memcpy(buf, pattern + rmatch.rm_so + 2, num - 2);
+		buf[num - 2] = '\0';
+		ret = add_events(buf);
+		if (ret)
+			break;
+
+		pattern += rmatch.rm_eo;
+	}
+
+	regfree(&reg);
+
+	return ret;
+}
+
 int
 rte_pmu_init(void)
 {
diff --git a/lib/pmu/rte_pmu.h b/lib/pmu/rte_pmu.h
index 0f7004c31c..0f6250e81f 100644
--- a/lib/pmu/rte_pmu.h
+++ b/lib/pmu/rte_pmu.h
@@ -169,6 +169,20 @@ __rte_experimental
 int
 rte_pmu_add_event(const char *name);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Add events matching pattern to the group of enabled events.
+ *
+ * @param pattern
+ *   Pattern e=ev1[,ev2,...] matching events, where evX is a placeholder for an event listed under
+ *   /sys/bus/event_source/devices/pmu/events.
+ */
+__rte_experimental
+int
+rte_pmu_add_events_by_pattern(const char *pattern);
+
 /**
  * @warning
  * @b EXPERIMENTAL: this API may change without prior notice
diff --git a/lib/pmu/version.map b/lib/pmu/version.map
index 50fb0f354e..20a27d085c 100644
--- a/lib/pmu/version.map
+++ b/lib/pmu/version.map
@@ -8,6 +8,7 @@ EXPERIMENTAL {
 	per_lcore__event_group;
 	rte_pmu;
 	rte_pmu_add_event;
+	rte_pmu_add_events_by_pattern;
 	rte_pmu_fini;
 	rte_pmu_init;
 	rte_pmu_read;
-- 
2.34.1


^ permalink raw reply	[flat|nested] 205+ messages in thread

* RE: [PATCH v8 1/4] lib: add generic support for reading PMU events
  2023-02-02  9:43               ` [PATCH v8 1/4] lib: add generic support for reading PMU events Tomasz Duszynski
@ 2023-02-02 10:32                 ` Ruifeng Wang
  0 siblings, 0 replies; 205+ messages in thread
From: Ruifeng Wang @ 2023-02-02 10:32 UTC (permalink / raw)
  To: Tomasz Duszynski, dev, thomas
  Cc: roretzla, bruce.richardson, jerinj, mattias.ronnblom, mb, zhoumin, nd

> -----Original Message-----
> From: Tomasz Duszynski <tduszynski@marvell.com>
> Sent: Thursday, February 2, 2023 5:44 PM
> To: dev@dpdk.org; thomas@monjalon.net; Tomasz Duszynski <tduszynski@marvell.com>
> Cc: roretzla@linux.microsoft.com; Ruifeng Wang <Ruifeng.Wang@arm.com>;
> bruce.richardson@intel.com; jerinj@marvell.com; mattias.ronnblom@ericsson.com;
> mb@smartsharesystems.com; zhoumin@loongson.cn
> Subject: [PATCH v8 1/4] lib: add generic support for reading PMU events
> 
> Add support for programming PMU counters and reading their values in runtime bypassing
> kernel completely.
> 
> This is especially useful in cases where CPU cores are isolated
> (nohz_full) i.e run dedicated tasks. In such cases one cannot use standard perf utility
> without sacrificing latency and performance.
> 
> Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
> Acked-by: Morten Brørup <mb@smartsharesystems.com>
> ---
>  MAINTAINERS                            |   5 +
>  app/test/meson.build                   |   1 +
>  app/test/test_pmu.c                    |  55 +++
>  doc/api/doxy-api-index.md              |   3 +-
>  doc/api/doxy-api.conf.in               |   1 +
>  doc/guides/prog_guide/profile_app.rst  |   8 +
>  doc/guides/rel_notes/release_23_03.rst |   9 +
>  lib/meson.build                        |   1 +
>  lib/pmu/meson.build                    |  13 +
>  lib/pmu/pmu_private.h                  |  29 ++
>  lib/pmu/rte_pmu.c                      | 464 +++++++++++++++++++++++++
>  lib/pmu/rte_pmu.h                      | 205 +++++++++++
>  lib/pmu/version.map                    |  20 ++
>  13 files changed, 813 insertions(+), 1 deletion(-)  create mode 100644
> app/test/test_pmu.c  create mode 100644 lib/pmu/meson.build  create mode 100644
> lib/pmu/pmu_private.h  create mode 100644 lib/pmu/rte_pmu.c  create mode 100644
> lib/pmu/rte_pmu.h  create mode 100644 lib/pmu/version.map
>
 
<snip>

> diff --git a/doc/guides/rel_notes/release_23_03.rst
> b/doc/guides/rel_notes/release_23_03.rst
> index 73f5d94e14..733541d56c 100644
> --- a/doc/guides/rel_notes/release_23_03.rst
> +++ b/doc/guides/rel_notes/release_23_03.rst
> @@ -55,10 +55,19 @@ New Features
>       Also, make sure to start the actual text at the margin.
>       =======================================================
> 
> +* **Added PMU library.**
> +
> +  Added a new PMU (performance measurement unit) library which allows

Overall looks good to me. Just a minor comment.
Should it be 'performance *monitoring* unit'?
I see the same terminology is used across architectures. It will be better if we align with that.

Thanks.

<snip>

^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH v9 0/4] add support for self monitoring
  2023-02-02  9:43             ` [PATCH v8 " Tomasz Duszynski
                                 ` (3 preceding siblings ...)
  2023-02-02  9:43               ` [PATCH v8 4/4] eal: add PMU support to tracing library Tomasz Duszynski
@ 2023-02-02 12:49               ` Tomasz Duszynski
  2023-02-02 12:49                 ` [PATCH v9 1/4] lib: add generic support for reading PMU events Tomasz Duszynski
                                   ` (4 more replies)
  4 siblings, 5 replies; 205+ messages in thread
From: Tomasz Duszynski @ 2023-02-02 12:49 UTC (permalink / raw)
  To: dev
  Cc: roretzla, Ruifeng.Wang, bruce.richardson, jerinj,
	mattias.ronnblom, mb, thomas, zhoumin, Tomasz Duszynski

This series adds self monitoring support i.e allows to configure and
read performance measurement unit (PMU) counters in runtime without
using perf utility. This has certain adventages when application runs on
isolated cores with nohz_full kernel parameter.

Events can be read directly using rte_pmu_read() or using dedicated
tracepoint rte_eal_trace_pmu_read(). The latter will cause events to be
stored inside CTF file.

By design, all enabled events are grouped together and the same group
is attached to lcores that use self monitoring funtionality.

Events are enabled by names, which need to be read from standard
location under sysfs i.e

/sys/bus/event_source/devices/PMU/events

where PMU is a core pmu i.e one measuring cpu events. As of today
raw events are not supported.

v9:
- fix 'maybe-uninitialized' warning reported by CI
v8:
- just rebase series
v7:
- use per-lcore event group instead of global table index by lcore-id
- don't add pmu_autotest to fast tests because due to lack of suported on
  every arch
v6:
- move codebase to the separate library
- address review comments
v5:
- address review comments
- fix sign extension while reading pmu on x86
- fix regex mentioned in doc
- various minor changes/improvements here and there
v4:
- fix freeing mem detected by debug_autotest
v3:
- fix shared build
v2:
- fix problems reported by test build infra

Tomasz Duszynski (4):
  lib: add generic support for reading PMU events
  pmu: support reading ARM PMU events in runtime
  pmu: support reading Intel x86_64 PMU events in runtime
  eal: add PMU support to tracing library

 MAINTAINERS                              |   5 +
 app/test/meson.build                     |   1 +
 app/test/test_pmu.c                      |  61 +++
 app/test/test_trace_perf.c               |  10 +
 doc/api/doxy-api-index.md                |   3 +-
 doc/api/doxy-api.conf.in                 |   1 +
 doc/guides/prog_guide/profile_app.rst    |  13 +
 doc/guides/prog_guide/trace_lib.rst      |  32 ++
 doc/guides/rel_notes/release_23_03.rst   |   9 +
 lib/eal/common/eal_common_trace.c        |  13 +-
 lib/eal/common/eal_common_trace_points.c |   5 +
 lib/eal/include/rte_eal_trace.h          |  13 +
 lib/eal/meson.build                      |   3 +
 lib/eal/version.map                      |   1 +
 lib/meson.build                          |   1 +
 lib/pmu/meson.build                      |  21 +
 lib/pmu/pmu_arm64.c                      |  94 ++++
 lib/pmu/pmu_private.h                    |  29 ++
 lib/pmu/rte_pmu.c                        | 525 +++++++++++++++++++++++
 lib/pmu/rte_pmu.h                        | 225 ++++++++++
 lib/pmu/rte_pmu_pmc_arm64.h              |  30 ++
 lib/pmu/rte_pmu_pmc_x86_64.h             |  24 ++
 lib/pmu/version.map                      |  21 +
 23 files changed, 1138 insertions(+), 2 deletions(-)
 create mode 100644 app/test/test_pmu.c
 create mode 100644 lib/pmu/meson.build
 create mode 100644 lib/pmu/pmu_arm64.c
 create mode 100644 lib/pmu/pmu_private.h
 create mode 100644 lib/pmu/rte_pmu.c
 create mode 100644 lib/pmu/rte_pmu.h
 create mode 100644 lib/pmu/rte_pmu_pmc_arm64.h
 create mode 100644 lib/pmu/rte_pmu_pmc_x86_64.h
 create mode 100644 lib/pmu/version.map

--
2.34.1


^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH v9 1/4] lib: add generic support for reading PMU events
  2023-02-02 12:49               ` [PATCH v9 0/4] add support for self monitoring Tomasz Duszynski
@ 2023-02-02 12:49                 ` Tomasz Duszynski
  2023-02-06 11:02                   ` David Marchand
  2023-02-02 12:49                 ` [PATCH v9 2/4] pmu: support reading ARM PMU events in runtime Tomasz Duszynski
                                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 205+ messages in thread
From: Tomasz Duszynski @ 2023-02-02 12:49 UTC (permalink / raw)
  To: dev, Thomas Monjalon, Tomasz Duszynski
  Cc: roretzla, Ruifeng.Wang, bruce.richardson, jerinj,
	mattias.ronnblom, mb, zhoumin

Add support for programming PMU counters and reading their values
in runtime bypassing kernel completely.

This is especially useful in cases where CPU cores are isolated
(nohz_full) i.e run dedicated tasks. In such cases one cannot use
standard perf utility without sacrificing latency and performance.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
---
 MAINTAINERS                            |   5 +
 app/test/meson.build                   |   1 +
 app/test/test_pmu.c                    |  55 +++
 doc/api/doxy-api-index.md              |   3 +-
 doc/api/doxy-api.conf.in               |   1 +
 doc/guides/prog_guide/profile_app.rst  |   8 +
 doc/guides/rel_notes/release_23_03.rst |   9 +
 lib/meson.build                        |   1 +
 lib/pmu/meson.build                    |  13 +
 lib/pmu/pmu_private.h                  |  29 ++
 lib/pmu/rte_pmu.c                      | 464 +++++++++++++++++++++++++
 lib/pmu/rte_pmu.h                      | 205 +++++++++++
 lib/pmu/version.map                    |  20 ++
 13 files changed, 813 insertions(+), 1 deletion(-)
 create mode 100644 app/test/test_pmu.c
 create mode 100644 lib/pmu/meson.build
 create mode 100644 lib/pmu/pmu_private.h
 create mode 100644 lib/pmu/rte_pmu.c
 create mode 100644 lib/pmu/rte_pmu.h
 create mode 100644 lib/pmu/version.map

diff --git a/MAINTAINERS b/MAINTAINERS
index 9a0f416d2e..9f13eafd95 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1697,6 +1697,11 @@ M: Nithin Dabilpuram <ndabilpuram@marvell.com>
 M: Pavan Nikhilesh <pbhagavatula@marvell.com>
 F: lib/node/
 
+PMU - EXPERIMENTAL
+M: Tomasz Duszynski <tduszynski@marvell.com>
+F: lib/pmu/
+F: app/test/test_pmu*
+
 
 Test Applications
 -----------------
diff --git a/app/test/meson.build b/app/test/meson.build
index f34d19e3c3..7b6b69dcf1 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -111,6 +111,7 @@ test_sources = files(
         'test_reciprocal_division_perf.c',
         'test_red.c',
         'test_pie.c',
+        'test_pmu.c',
         'test_reorder.c',
         'test_rib.c',
         'test_rib6.c',
diff --git a/app/test/test_pmu.c b/app/test/test_pmu.c
new file mode 100644
index 0000000000..a9bfb1a427
--- /dev/null
+++ b/app/test/test_pmu.c
@@ -0,0 +1,55 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2023 Marvell International Ltd.
+ */
+
+#include "test.h"
+
+#ifndef RTE_EXEC_ENV_LINUX
+
+static int
+test_pmu(void)
+{
+	printf("pmu_autotest only supported on Linux, skipping test\n");
+	return TEST_SKIPPED;
+}
+
+#else
+
+#include <rte_pmu.h>
+
+static int
+test_pmu_read(void)
+{
+	int tries = 10, event = -1;
+	uint64_t val = 0;
+
+	if (rte_pmu_init() < 0)
+		return TEST_FAILED;
+
+	while (tries--)
+		val += rte_pmu_read(event);
+
+	rte_pmu_fini();
+
+	return val ? TEST_SUCCESS : TEST_FAILED;
+}
+
+static struct unit_test_suite pmu_tests = {
+	.suite_name = "pmu autotest",
+	.setup = NULL,
+	.teardown = NULL,
+	.unit_test_cases = {
+		TEST_CASE(test_pmu_read),
+		TEST_CASES_END()
+	}
+};
+
+static int
+test_pmu(void)
+{
+	return unit_test_suite_runner(&pmu_tests);
+}
+
+#endif /* RTE_EXEC_ENV_LINUX */
+
+REGISTER_TEST_COMMAND(pmu_autotest, test_pmu);
diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
index de488c7abf..7f1938f92f 100644
--- a/doc/api/doxy-api-index.md
+++ b/doc/api/doxy-api-index.md
@@ -222,7 +222,8 @@ The public API headers are grouped by topics:
   [log](@ref rte_log.h),
   [errno](@ref rte_errno.h),
   [trace](@ref rte_trace.h),
-  [trace_point](@ref rte_trace_point.h)
+  [trace_point](@ref rte_trace_point.h),
+  [pmu](@ref rte_pmu.h)
 
 - **misc**:
   [EAL config](@ref rte_eal.h),
diff --git a/doc/api/doxy-api.conf.in b/doc/api/doxy-api.conf.in
index f0886c3bd1..920e615996 100644
--- a/doc/api/doxy-api.conf.in
+++ b/doc/api/doxy-api.conf.in
@@ -63,6 +63,7 @@ INPUT                   = @TOPDIR@/doc/api/doxy-api-index.md \
                           @TOPDIR@/lib/pci \
                           @TOPDIR@/lib/pdump \
                           @TOPDIR@/lib/pipeline \
+                          @TOPDIR@/lib/pmu \
                           @TOPDIR@/lib/port \
                           @TOPDIR@/lib/power \
                           @TOPDIR@/lib/rawdev \
diff --git a/doc/guides/prog_guide/profile_app.rst b/doc/guides/prog_guide/profile_app.rst
index 14292d4c25..a8b501fe0c 100644
--- a/doc/guides/prog_guide/profile_app.rst
+++ b/doc/guides/prog_guide/profile_app.rst
@@ -7,6 +7,14 @@ Profile Your Application
 The following sections describe methods of profiling DPDK applications on
 different architectures.
 
+Performance counter based profiling
+-----------------------------------
+
+Majority of architectures support some sort hardware measurement unit which provides a set of
+programmable counters that monitor specific events. There are different tools which can gather
+that information, perf being an example here. Though in some scenarios, eg. when CPU cores are
+isolated (nohz_full) and run dedicated tasks, using perf is less than ideal. In such cases one can
+read specific events directly from application via ``rte_pmu_read()``.
 
 Profiling on x86
 ----------------
diff --git a/doc/guides/rel_notes/release_23_03.rst b/doc/guides/rel_notes/release_23_03.rst
index 73f5d94e14..733541d56c 100644
--- a/doc/guides/rel_notes/release_23_03.rst
+++ b/doc/guides/rel_notes/release_23_03.rst
@@ -55,10 +55,19 @@ New Features
      Also, make sure to start the actual text at the margin.
      =======================================================
 
+* **Added PMU library.**
+
+  Added a new PMU (performance measurement unit) library which allows applications
+  to perform self monitoring activities without depending on external utilities like perf.
+  After integration with :doc:`../prog_guide/trace_lib` data gathered from hardware counters
+  can be stored in CTF format for further analysis.
+
 * **Updated AMD axgbe driver.**
 
   * Added multi-process support.
 
+* **Added multi-process support for axgbe PMD.**
+
 * **Updated Corigine nfp driver.**
 
   * Added support for meter options.
diff --git a/lib/meson.build b/lib/meson.build
index a90fee31b7..7132131b5c 100644
--- a/lib/meson.build
+++ b/lib/meson.build
@@ -11,6 +11,7 @@
 libraries = [
         'kvargs', # eal depends on kvargs
         'telemetry', # basic info querying
+        'pmu',
         'eal', # everything depends on eal
         'ring',
         'rcu', # rcu depends on ring
diff --git a/lib/pmu/meson.build b/lib/pmu/meson.build
new file mode 100644
index 0000000000..a4160b494e
--- /dev/null
+++ b/lib/pmu/meson.build
@@ -0,0 +1,13 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(C) 2023 Marvell International Ltd.
+
+if not is_linux
+    build = false
+    reason = 'only supported on Linux'
+    subdir_done()
+endif
+
+includes = [global_inc]
+
+sources = files('rte_pmu.c')
+headers = files('rte_pmu.h')
diff --git a/lib/pmu/pmu_private.h b/lib/pmu/pmu_private.h
new file mode 100644
index 0000000000..849549b125
--- /dev/null
+++ b/lib/pmu/pmu_private.h
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Marvell
+ */
+
+#ifndef _PMU_PRIVATE_H_
+#define _PMU_PRIVATE_H_
+
+/**
+ * Architecture specific PMU init callback.
+ *
+ * @return
+ *   0 in case of success, negative value otherwise.
+ */
+int
+pmu_arch_init(void);
+
+/**
+ * Architecture specific PMU cleanup callback.
+ */
+void
+pmu_arch_fini(void);
+
+/**
+ * Apply architecture specific settings to config before passing it to syscall.
+ */
+void
+pmu_arch_fixup_config(uint64_t config[3]);
+
+#endif /* _PMU_PRIVATE_H_ */
diff --git a/lib/pmu/rte_pmu.c b/lib/pmu/rte_pmu.c
new file mode 100644
index 0000000000..4cf3161155
--- /dev/null
+++ b/lib/pmu/rte_pmu.c
@@ -0,0 +1,464 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2023 Marvell International Ltd.
+ */
+
+#include <ctype.h>
+#include <dirent.h>
+#include <errno.h>
+#include <regex.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <sys/queue.h>
+#include <sys/syscall.h>
+#include <unistd.h>
+
+#include <rte_atomic.h>
+#include <rte_per_lcore.h>
+#include <rte_pmu.h>
+#include <rte_spinlock.h>
+#include <rte_tailq.h>
+
+#include "pmu_private.h"
+
+#define EVENT_SOURCE_DEVICES_PATH "/sys/bus/event_source/devices"
+
+#ifndef GENMASK_ULL
+#define GENMASK_ULL(h, l) ((~0ULL - (1ULL << (l)) + 1) & (~0ULL >> ((64 - 1 - (h)))))
+#endif
+
+#ifndef FIELD_PREP
+#define FIELD_PREP(m, v) (((uint64_t)(v) << (__builtin_ffsll(m) - 1)) & (m))
+#endif
+
+RTE_DEFINE_PER_LCORE(struct rte_pmu_event_group, _event_group);
+struct rte_pmu rte_pmu;
+
+/*
+ * Following __rte_weak functions provide default no-op. Architectures should override them if
+ * necessary.
+ */
+
+int
+__rte_weak pmu_arch_init(void)
+{
+	return 0;
+}
+
+void
+__rte_weak pmu_arch_fini(void)
+{
+}
+
+void
+__rte_weak pmu_arch_fixup_config(uint64_t __rte_unused config[3])
+{
+}
+
+static int
+get_term_format(const char *name, int *num, uint64_t *mask)
+{
+	char *config = NULL;
+	char path[PATH_MAX];
+	int high, low, ret;
+	FILE *fp;
+
+	/* quiesce -Wmaybe-uninitialized warning */
+	*num = 0;
+	*mask = 0;
+
+	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/format/%s", rte_pmu.name, name);
+	fp = fopen(path, "r");
+	if (fp == NULL)
+		return -errno;
+
+	errno = 0;
+	ret = fscanf(fp, "%m[^:]:%d-%d", &config, &low, &high);
+	if (ret < 2) {
+		ret = -ENODATA;
+		goto out;
+	}
+	if (errno) {
+		ret = -errno;
+		goto out;
+	}
+
+	if (ret == 2)
+		high = low;
+
+	*mask = GENMASK_ULL(high, low);
+	/* Last digit should be [012]. If last digit is missing 0 is implied. */
+	*num = config[strlen(config) - 1];
+	*num = isdigit(*num) ? *num - '0' : 0;
+
+	ret = 0;
+out:
+	free(config);
+	fclose(fp);
+
+	return ret;
+}
+
+static int
+parse_event(char *buf, uint64_t config[3])
+{
+	char *token, *term;
+	int num, ret, val;
+	uint64_t mask;
+
+	config[0] = config[1] = config[2] = 0;
+
+	token = strtok(buf, ",");
+	while (token) {
+		errno = 0;
+		/* <term>=<value> */
+		ret = sscanf(token, "%m[^=]=%i", &term, &val);
+		if (ret < 1)
+			return -ENODATA;
+		if (errno)
+			return -errno;
+		if (ret == 1)
+			val = 1;
+
+		ret = get_term_format(term, &num, &mask);
+		free(term);
+		if (ret)
+			return ret;
+
+		config[num] |= FIELD_PREP(mask, val);
+		token = strtok(NULL, ",");
+	}
+
+	return 0;
+}
+
+static int
+get_event_config(const char *name, uint64_t config[3])
+{
+	char path[PATH_MAX], buf[BUFSIZ];
+	FILE *fp;
+	int ret;
+
+	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu.name, name);
+	fp = fopen(path, "r");
+	if (fp == NULL)
+		return -errno;
+
+	ret = fread(buf, 1, sizeof(buf), fp);
+	if (ret == 0) {
+		fclose(fp);
+
+		return -EINVAL;
+	}
+	fclose(fp);
+	buf[ret] = '\0';
+
+	return parse_event(buf, config);
+}
+
+static int
+do_perf_event_open(uint64_t config[3], int group_fd)
+{
+	struct perf_event_attr attr = {
+		.size = sizeof(struct perf_event_attr),
+		.type = PERF_TYPE_RAW,
+		.exclude_kernel = 1,
+		.exclude_hv = 1,
+		.disabled = 1,
+	};
+
+	pmu_arch_fixup_config(config);
+
+	attr.config = config[0];
+	attr.config1 = config[1];
+	attr.config2 = config[2];
+
+	return syscall(SYS_perf_event_open, &attr, 0, -1, group_fd, 0);
+}
+
+static int
+open_events(struct rte_pmu_event_group *group)
+{
+	struct rte_pmu_event *event;
+	uint64_t config[3];
+	int num = 0, ret;
+
+	/* group leader gets created first, with fd = -1 */
+	group->fds[0] = -1;
+
+	TAILQ_FOREACH(event, &rte_pmu.event_list, next) {
+		ret = get_event_config(event->name, config);
+		if (ret)
+			continue;
+
+		ret = do_perf_event_open(config, group->fds[0]);
+		if (ret == -1) {
+			ret = -errno;
+			goto out;
+		}
+
+		group->fds[event->index] = ret;
+		num++;
+	}
+
+	return 0;
+out:
+	for (--num; num >= 0; num--) {
+		close(group->fds[num]);
+		group->fds[num] = -1;
+	}
+
+
+	return ret;
+}
+
+static int
+mmap_events(struct rte_pmu_event_group *group)
+{
+	long page_size = sysconf(_SC_PAGE_SIZE);
+	unsigned int i;
+	void *addr;
+	int ret;
+
+	for (i = 0; i < rte_pmu.num_group_events; i++) {
+		addr = mmap(0, page_size, PROT_READ, MAP_SHARED, group->fds[i], 0);
+		if (addr == MAP_FAILED) {
+			ret = -errno;
+			goto out;
+		}
+
+		group->mmap_pages[i] = addr;
+	}
+
+	return 0;
+out:
+	for (; i; i--) {
+		munmap(group->mmap_pages[i - 1], page_size);
+		group->mmap_pages[i - 1] = NULL;
+	}
+
+	return ret;
+}
+
+static void
+cleanup_events(struct rte_pmu_event_group *group)
+{
+	unsigned int i;
+
+	if (group->fds[0] != -1)
+		ioctl(group->fds[0], PERF_EVENT_IOC_DISABLE, PERF_IOC_FLAG_GROUP);
+
+	for (i = 0; i < rte_pmu.num_group_events; i++) {
+		if (group->mmap_pages[i]) {
+			munmap(group->mmap_pages[i], sysconf(_SC_PAGE_SIZE));
+			group->mmap_pages[i] = NULL;
+		}
+
+		if (group->fds[i] != -1) {
+			close(group->fds[i]);
+			group->fds[i] = -1;
+		}
+	}
+
+	group->enabled = false;
+}
+
+int __rte_noinline
+rte_pmu_enable_group(void)
+{
+	struct rte_pmu_event_group *group = &RTE_PER_LCORE(_event_group);
+	int ret;
+
+	if (rte_pmu.num_group_events == 0)
+		return -ENODEV;
+
+	ret = open_events(group);
+	if (ret)
+		goto out;
+
+	ret = mmap_events(group);
+	if (ret)
+		goto out;
+
+	if (ioctl(group->fds[0], PERF_EVENT_IOC_RESET, PERF_IOC_FLAG_GROUP) == -1) {
+		ret = -errno;
+		goto out;
+	}
+
+	if (ioctl(group->fds[0], PERF_EVENT_IOC_ENABLE, PERF_IOC_FLAG_GROUP) == -1) {
+		ret = -errno;
+		goto out;
+	}
+
+	rte_spinlock_lock(&rte_pmu.lock);
+	TAILQ_INSERT_TAIL(&rte_pmu.event_group_list, group, next);
+	rte_spinlock_unlock(&rte_pmu.lock);
+	group->enabled = true;
+
+	return 0;
+
+out:
+	cleanup_events(group);
+
+	return ret;
+}
+
+static int
+scan_pmus(void)
+{
+	char path[PATH_MAX];
+	struct dirent *dent;
+	const char *name;
+	DIR *dirp;
+
+	dirp = opendir(EVENT_SOURCE_DEVICES_PATH);
+	if (dirp == NULL)
+		return -errno;
+
+	while ((dent = readdir(dirp))) {
+		name = dent->d_name;
+		if (name[0] == '.')
+			continue;
+
+		/* sysfs entry should either contain cpus or be a cpu */
+		if (!strcmp(name, "cpu"))
+			break;
+
+		snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/cpus", name);
+		if (access(path, F_OK) == 0)
+			break;
+	}
+
+	if (dent) {
+		rte_pmu.name = strdup(name);
+		if (rte_pmu.name == NULL) {
+			closedir(dirp);
+
+			return -ENOMEM;
+		}
+	}
+
+	closedir(dirp);
+
+	return rte_pmu.name ? 0 : -ENODEV;
+}
+
+static struct rte_pmu_event *
+new_event(const char *name)
+{
+	struct rte_pmu_event *event;
+
+	event = calloc(1, sizeof(*event));
+	if (event == NULL)
+		goto out;
+
+	event->name = strdup(name);
+	if (event->name == NULL) {
+		free(event);
+		event = NULL;
+	}
+
+out:
+	return event;
+}
+
+static void
+free_event(struct rte_pmu_event *event)
+{
+	free(event->name);
+	free(event);
+}
+
+int
+rte_pmu_add_event(const char *name)
+{
+	struct rte_pmu_event *event;
+	char path[PATH_MAX];
+
+	if (rte_pmu.name == NULL)
+		return -ENODEV;
+
+	if (rte_pmu.num_group_events + 1 >= MAX_NUM_GROUP_EVENTS)
+		return -ENOSPC;
+
+	snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s", rte_pmu.name, name);
+	if (access(path, R_OK))
+		return -ENODEV;
+
+	TAILQ_FOREACH(event, &rte_pmu.event_list, next) {
+		if (!strcmp(event->name, name))
+			return event->index;
+		continue;
+	}
+
+	event = new_event(name);
+	if (event == NULL)
+		return -ENOMEM;
+
+	event->index = rte_pmu.num_group_events++;
+	TAILQ_INSERT_TAIL(&rte_pmu.event_list, event, next);
+
+	return event->index;
+}
+
+int
+rte_pmu_init(void)
+{
+	int ret;
+
+	/* Allow calling init from multiple contexts within a single thread. This simplifies
+	 * resource management a bit e.g in case fast-path tracepoint has already been enabled
+	 * via command line but application doesn't care enough and performs init/fini again.
+	 */
+	if (rte_pmu.initialized) {
+		rte_pmu.initialized++;
+		return 0;
+	}
+
+	ret = scan_pmus();
+	if (ret)
+		goto out;
+
+	ret = pmu_arch_init();
+	if (ret)
+		goto out;
+
+	TAILQ_INIT(&rte_pmu.event_list);
+	TAILQ_INIT(&rte_pmu.event_group_list);
+	rte_spinlock_init(&rte_pmu.lock);
+	rte_pmu.initialized = 1;
+
+	return 0;
+out:
+	free(rte_pmu.name);
+	rte_pmu.name = NULL;
+
+	return ret;
+}
+
+void
+rte_pmu_fini(void)
+{
+	struct rte_pmu_event_group *group, *tmp_group;
+	struct rte_pmu_event *event, *tmp_event;
+
+	/* cleanup once init count drops to zero */
+	if (!rte_pmu.initialized || --rte_pmu.initialized)
+		return;
+
+	RTE_TAILQ_FOREACH_SAFE(event, &rte_pmu.event_list, next, tmp_event) {
+		TAILQ_REMOVE(&rte_pmu.event_list, event, next);
+		free_event(event);
+	}
+
+	RTE_TAILQ_FOREACH_SAFE(group, &rte_pmu.event_group_list, next, tmp_group) {
+		TAILQ_REMOVE(&rte_pmu.event_group_list, group, next);
+		cleanup_events(group);
+	}
+
+	pmu_arch_fini();
+	free(rte_pmu.name);
+	rte_pmu.name = NULL;
+	rte_pmu.num_group_events = 0;
+}
diff --git a/lib/pmu/rte_pmu.h b/lib/pmu/rte_pmu.h
new file mode 100644
index 0000000000..e360375a0c
--- /dev/null
+++ b/lib/pmu/rte_pmu.h
@@ -0,0 +1,205 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Marvell
+ */
+
+#ifndef _RTE_PMU_H_
+#define _RTE_PMU_H_
+
+/**
+ * @file
+ *
+ * PMU event tracing operations
+ *
+ * This file defines generic API and types necessary to setup PMU and
+ * read selected counters in runtime.
+ */
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <linux/perf_event.h>
+
+#include <rte_atomic.h>
+#include <rte_branch_prediction.h>
+#include <rte_common.h>
+#include <rte_compat.h>
+#include <rte_spinlock.h>
+
+/** Maximum number of events in a group */
+#define MAX_NUM_GROUP_EVENTS 8
+
+/**
+ * A structure describing a group of events.
+ */
+struct rte_pmu_event_group {
+	struct perf_event_mmap_page *mmap_pages[MAX_NUM_GROUP_EVENTS]; /**< array of user pages */
+	int fds[MAX_NUM_GROUP_EVENTS]; /**< array of event descriptors */
+	bool enabled; /**< true if group was enabled on particular lcore */
+	TAILQ_ENTRY(rte_pmu_event_group) next; /**< list entry */
+} __rte_cache_aligned;
+
+/**
+ * A structure describing an event.
+ */
+struct rte_pmu_event {
+	char *name; /**< name of an event */
+	unsigned int index; /**< event index into fds/mmap_pages */
+	TAILQ_ENTRY(rte_pmu_event) next; /**< list entry */
+};
+
+/**
+ * A PMU state container.
+ */
+struct rte_pmu {
+	char *name; /**< name of core PMU listed under /sys/bus/event_source/devices */
+	rte_spinlock_t lock; /**< serialize access to event group list */
+	TAILQ_HEAD(, rte_pmu_event_group) event_group_list; /**< list of event groups */
+	unsigned int num_group_events; /**< number of events in a group */
+	TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of matching events */
+	unsigned int initialized; /**< initialization counter */
+};
+
+/** lcore event group */
+RTE_DECLARE_PER_LCORE(struct rte_pmu_event_group, _event_group);
+
+/** PMU state container */
+extern struct rte_pmu rte_pmu;
+
+/** Each architecture supporting PMU needs to provide its own version */
+#ifndef rte_pmu_pmc_read
+#define rte_pmu_pmc_read(index) ({ 0; })
+#endif
+
+/**
+ * @internal
+ *
+ * Read PMU counter.
+ *
+ * @param pc
+ *   Pointer to the mmapped user page.
+ * @return
+ *   Counter value read from hardware.
+ */
+__rte_internal
+static __rte_always_inline uint64_t
+rte_pmu_read_userpage(struct perf_event_mmap_page *pc)
+{
+	uint64_t width, offset;
+	uint32_t seq, index;
+	int64_t pmc;
+
+	for (;;) {
+		seq = pc->lock;
+		rte_compiler_barrier();
+		index = pc->index;
+		offset = pc->offset;
+		width = pc->pmc_width;
+
+		/* index set to 0 means that particular counter cannot be used */
+		if (likely(pc->cap_user_rdpmc && index)) {
+			pmc = rte_pmu_pmc_read(index - 1);
+			pmc <<= 64 - width;
+			pmc >>= 64 - width;
+			offset += pmc;
+		}
+
+		rte_compiler_barrier();
+
+		if (likely(pc->lock == seq))
+			return offset;
+	}
+
+	return 0;
+}
+
+/**
+ * @internal
+ *
+ * Enable group of events on the calling lcore.
+ *
+ * @return
+ *   0 in case of success, negative value otherwise.
+ */
+__rte_internal
+int
+rte_pmu_enable_group(void);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Initialize PMU library.
+ *
+ * @return
+ *   0 in case of success, negative value otherwise.
+ */
+__rte_experimental
+int
+rte_pmu_init(void);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Finalize PMU library. This should be called after PMU counters are no longer being read.
+ */
+__rte_experimental
+void
+rte_pmu_fini(void);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Add event to the group of enabled events.
+ *
+ * @param name
+ *   Name of an event listed under /sys/bus/event_source/devices/pmu/events.
+ * @return
+ *   Event index in case of success, negative value otherwise.
+ */
+__rte_experimental
+int
+rte_pmu_add_event(const char *name);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Read hardware counter configured to count occurrences of an event.
+ *
+ * @param index
+ *   Index of an event to be read.
+ * @return
+ *   Event value read from register. In case of errors or lack of support
+ *   0 is returned. In other words, stream of zeros in a trace file
+ *   indicates problem with reading particular PMU event register.
+ */
+__rte_experimental
+static __rte_always_inline uint64_t
+rte_pmu_read(unsigned int index)
+{
+	struct rte_pmu_event_group *group = &RTE_PER_LCORE(_event_group);
+	int ret;
+
+	if (unlikely(!rte_pmu.initialized))
+		return 0;
+
+	if (unlikely(!group->enabled)) {
+		ret = rte_pmu_enable_group();
+		if (ret)
+			return 0;
+	}
+
+	if (unlikely(index >= rte_pmu.num_group_events))
+		return 0;
+
+	return rte_pmu_read_userpage(group->mmap_pages[index]);
+}
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_PMU_H_ */
diff --git a/lib/pmu/version.map b/lib/pmu/version.map
new file mode 100644
index 0000000000..50fb0f354e
--- /dev/null
+++ b/lib/pmu/version.map
@@ -0,0 +1,20 @@
+DPDK_23 {
+	local: *;
+};
+
+EXPERIMENTAL {
+	global:
+
+	per_lcore__event_group;
+	rte_pmu;
+	rte_pmu_add_event;
+	rte_pmu_fini;
+	rte_pmu_init;
+	rte_pmu_read;
+};
+
+INTERNAL {
+	global:
+
+	rte_pmu_enable_group;
+};
-- 
2.34.1


^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH v9 2/4] pmu: support reading ARM PMU events in runtime
  2023-02-02 12:49               ` [PATCH v9 0/4] add support for self monitoring Tomasz Duszynski
  2023-02-02 12:49                 ` [PATCH v9 1/4] lib: add generic support for reading PMU events Tomasz Duszynski
@ 2023-02-02 12:49                 ` Tomasz Duszynski
  2023-02-02 12:49                 ` [PATCH v9 3/4] pmu: support reading Intel x86_64 " Tomasz Duszynski
                                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 205+ messages in thread
From: Tomasz Duszynski @ 2023-02-02 12:49 UTC (permalink / raw)
  To: dev, Tomasz Duszynski, Ruifeng Wang
  Cc: roretzla, Ruifeng.Wang, bruce.richardson, jerinj,
	mattias.ronnblom, mb, thomas, zhoumin

Add support for reading ARM PMU events in runtime.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
---
 app/test/test_pmu.c         |  4 ++
 lib/pmu/meson.build         |  7 +++
 lib/pmu/pmu_arm64.c         | 94 +++++++++++++++++++++++++++++++++++++
 lib/pmu/rte_pmu.h           |  4 ++
 lib/pmu/rte_pmu_pmc_arm64.h | 30 ++++++++++++
 5 files changed, 139 insertions(+)
 create mode 100644 lib/pmu/pmu_arm64.c
 create mode 100644 lib/pmu/rte_pmu_pmc_arm64.h

diff --git a/app/test/test_pmu.c b/app/test/test_pmu.c
index a9bfb1a427..623e04b691 100644
--- a/app/test/test_pmu.c
+++ b/app/test/test_pmu.c
@@ -26,6 +26,10 @@ test_pmu_read(void)
 	if (rte_pmu_init() < 0)
 		return TEST_FAILED;
 
+#if defined(RTE_ARCH_ARM64)
+	event = rte_pmu_add_event("cpu_cycles");
+#endif
+
 	while (tries--)
 		val += rte_pmu_read(event);
 
diff --git a/lib/pmu/meson.build b/lib/pmu/meson.build
index a4160b494e..e857681137 100644
--- a/lib/pmu/meson.build
+++ b/lib/pmu/meson.build
@@ -11,3 +11,10 @@ includes = [global_inc]
 
 sources = files('rte_pmu.c')
 headers = files('rte_pmu.h')
+indirect_headers += files(
+        'rte_pmu_pmc_arm64.h',
+)
+
+if dpdk_conf.has('RTE_ARCH_ARM64')
+    sources += files('pmu_arm64.c')
+endif
diff --git a/lib/pmu/pmu_arm64.c b/lib/pmu/pmu_arm64.c
new file mode 100644
index 0000000000..9e15727948
--- /dev/null
+++ b/lib/pmu/pmu_arm64.c
@@ -0,0 +1,94 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(C) 2023 Marvell International Ltd.
+ */
+
+#include <errno.h>
+#include <fcntl.h>
+#include <stdlib.h>
+#include <unistd.h>
+
+#include <rte_bitops.h>
+#include <rte_common.h>
+
+#include "pmu_private.h"
+
+#define PERF_USER_ACCESS_PATH "/proc/sys/kernel/perf_user_access"
+
+static int restore_uaccess;
+
+static int
+read_attr_int(const char *path, int *val)
+{
+	char buf[BUFSIZ];
+	int ret, fd;
+
+	fd = open(path, O_RDONLY);
+	if (fd == -1)
+		return -errno;
+
+	ret = read(fd, buf, sizeof(buf));
+	if (ret == -1) {
+		close(fd);
+
+		return -errno;
+	}
+
+	*val = strtol(buf, NULL, 10);
+	close(fd);
+
+	return 0;
+}
+
+static int
+write_attr_int(const char *path, int val)
+{
+	char buf[BUFSIZ];
+	int num, ret, fd;
+
+	fd = open(path, O_WRONLY);
+	if (fd == -1)
+		return -errno;
+
+	num = snprintf(buf, sizeof(buf), "%d", val);
+	ret = write(fd, buf, num);
+	if (ret == -1) {
+		close(fd);
+
+		return -errno;
+	}
+
+	close(fd);
+
+	return 0;
+}
+
+int
+pmu_arch_init(void)
+{
+	int ret;
+
+	ret = read_attr_int(PERF_USER_ACCESS_PATH, &restore_uaccess);
+	if (ret)
+		return ret;
+
+	/* user access already enabled */
+	if (restore_uaccess == 1)
+		return 0;
+
+	return write_attr_int(PERF_USER_ACCESS_PATH, 1);
+}
+
+void
+pmu_arch_fini(void)
+{
+	write_attr_int(PERF_USER_ACCESS_PATH, restore_uaccess);
+}
+
+void
+pmu_arch_fixup_config(uint64_t config[3])
+{
+	/* select 64 bit counters */
+	config[1] |= RTE_BIT64(0);
+	/* enable userspace access */
+	config[1] |= RTE_BIT64(1);
+}
diff --git a/lib/pmu/rte_pmu.h b/lib/pmu/rte_pmu.h
index e360375a0c..b18938dab1 100644
--- a/lib/pmu/rte_pmu.h
+++ b/lib/pmu/rte_pmu.h
@@ -26,6 +26,10 @@ extern "C" {
 #include <rte_compat.h>
 #include <rte_spinlock.h>
 
+#if defined(RTE_ARCH_ARM64)
+#include "rte_pmu_pmc_arm64.h"
+#endif
+
 /** Maximum number of events in a group */
 #define MAX_NUM_GROUP_EVENTS 8
 
diff --git a/lib/pmu/rte_pmu_pmc_arm64.h b/lib/pmu/rte_pmu_pmc_arm64.h
new file mode 100644
index 0000000000..10648f0c5f
--- /dev/null
+++ b/lib/pmu/rte_pmu_pmc_arm64.h
@@ -0,0 +1,30 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Marvell.
+ */
+#ifndef _RTE_PMU_PMC_ARM64_H_
+#define _RTE_PMU_PMC_ARM64_H_
+
+#include <rte_common.h>
+
+static __rte_always_inline uint64_t
+rte_pmu_pmc_read(int index)
+{
+	uint64_t val;
+
+	if (index == 31) {
+		/* CPU Cycles (0x11) must be read via pmccntr_el0 */
+		asm volatile("mrs %0, pmccntr_el0" : "=r" (val));
+	} else {
+		asm volatile(
+			"msr pmselr_el0, %x0\n"
+			"mrs %0, pmxevcntr_el0\n"
+			: "=r" (val)
+			: "rZ" (index)
+		);
+	}
+
+	return val;
+}
+#define rte_pmu_pmc_read rte_pmu_pmc_read
+
+#endif /* _RTE_PMU_PMC_ARM64_H_ */
-- 
2.34.1


^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH v9 3/4] pmu: support reading Intel x86_64 PMU events in runtime
  2023-02-02 12:49               ` [PATCH v9 0/4] add support for self monitoring Tomasz Duszynski
  2023-02-02 12:49                 ` [PATCH v9 1/4] lib: add generic support for reading PMU events Tomasz Duszynski
  2023-02-02 12:49                 ` [PATCH v9 2/4] pmu: support reading ARM PMU events in runtime Tomasz Duszynski
@ 2023-02-02 12:49                 ` Tomasz Duszynski
  2023-02-02 12:49                 ` [PATCH v9 4/4] eal: add PMU support to tracing library Tomasz Duszynski
  2023-02-13 11:31                 ` [PATCH v10 0/4] add support for self monitoring Tomasz Duszynski
  4 siblings, 0 replies; 205+ messages in thread
From: Tomasz Duszynski @ 2023-02-02 12:49 UTC (permalink / raw)
  To: dev, Tomasz Duszynski
  Cc: roretzla, Ruifeng.Wang, bruce.richardson, jerinj,
	mattias.ronnblom, mb, thomas, zhoumin

Add support for reading Intel x86_64 PMU events in runtime.

Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
Acked-by: Morten Brørup <mb@smartsharesystems.com>
---
 app/test/test_pmu.c          |  2 ++
 lib/pmu/meson.build          |  1 +
 lib/pmu/rte_pmu.h            |  2 ++
 lib/pmu/rte_pmu_pmc_x86_64.h | 24 ++++++++++++++++++++++++
 4 files changed, 29 insertions(+)
 create mode 100644 lib/pmu/rte_pmu_pmc_x86_64.h

diff --git a/app/test/test_pmu.c b/app/test/test_pmu.c
index 623e04b691..614395482f 100644
--- a/app/test/test_pmu.c
+++ b/app/test/test_pmu.c
@@ -28,6 +28,8 @@ test_pmu_read(void)
 
 #if defined(RTE_ARCH_ARM64)
 	event = rte_pmu_add_event("cpu_cycles");
+#elif defined(RTE_ARCH_X86_64)
+	event = rte_pmu_add_event("cpu-cycles");
 #endif
 
 	while (tries--)
diff --git a/lib/pmu/meson.build b/lib/pmu/meson.build
index e857681137..5b92e5c4e3 100644
--- a/lib/pmu/meson.build
+++ b/lib/pmu/meson.build
@@ -13,6 +13,7 @@ sources = files('rte_pmu.c')
 headers = files('rte_pmu.h')
 indirect_headers += files(
         'rte_pmu_pmc_arm64.h',
+        'rte_pmu_pmc_x86_64.h',
 )
 
 if dpdk_conf.has('RTE_ARCH_ARM64')
diff --git a/lib/pmu/rte_pmu.h b/lib/pmu/rte_pmu.h
index b18938dab1..0f7004c31c 100644
--- a/lib/pmu/rte_pmu.h
+++ b/lib/pmu/rte_pmu.h
@@ -28,6 +28,8 @@ extern "C" {
 
 #if defined(RTE_ARCH_ARM64)
 #include "rte_pmu_pmc_arm64.h"
+#elif defined(RTE_ARCH_X86_64)
+#include "rte_pmu_pmc_x86_64.h"
 #endif
 
 /** Maximum number of events in a group */
diff --git a/lib/pmu/rte_pmu_pmc_x86_64.h b/lib/pmu/rte_pmu_pmc_x86_64.h
new file mode 100644
index 0000000000..7b67466960
--- /dev/null
+++ b/lib/pmu/rte_pmu_pmc_x86_64.h
@@ -0,0 +1,24 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Marvell.
+ */
+#ifndef _RTE_PMU_PMC_X86_64_H_
+#define _RTE_PMU_PMC_X86_64_H_
+
+#include <rte_common.h>
+
+static __rte_always_inline uint64_t
+rte_pmu_pmc_read(int index)
+{
+	uint64_t low, high;
+
+	asm volatile(
+		"rdpmc\n"
+		: "=a" (low), "=d" (high)
+		: "c" (index)
+	);
+
+	return low | (high << 32);
+}
+#define rte_pmu_pmc_read rte_pmu_pmc_read
+
+#endif /* _RTE_PMU_PMC_X86_64_H_ */
-- 
2.34.1


^ permalink raw reply	[flat|nested] 205+ messages in thread

* [PATCH v9 4/4] eal: add PMU support to tracing library
  2023-02-02 12:49               ` [PATCH v9 0/4] add support for self monitoring Tomasz Duszynski
                                   ` (2 preceding siblings ...)
  2023-02-02 12:49                 ` [PATCH v9 3/4] pmu: support reading Intel x86_64 " Tomasz Duszynski
@ 2023-02-02 12:49                 ` Tomasz Duszynski
  2023-02-13 11:31