From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 112CC45C87; Tue, 5 Nov 2024 11:58:31 +0100 (CET) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id A4A9E402B3; Tue, 5 Nov 2024 11:58:30 +0100 (CET) Received: from frasgout.his.huawei.com (frasgout.his.huawei.com [185.176.79.56]) by mails.dpdk.org (Postfix) with ESMTP id 8394B40151 for ; Tue, 5 Nov 2024 11:58:28 +0100 (CET) Received: from mail.maildlp.com (unknown [172.18.186.216]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4XjQLB4vKqz6LDFW; Tue, 5 Nov 2024 18:58:26 +0800 (CST) Received: from frapeml500008.china.huawei.com (unknown [7.182.85.71]) by mail.maildlp.com (Postfix) with ESMTPS id 9278A14011D; Tue, 5 Nov 2024 18:58:27 +0800 (CST) Received: from frapeml500007.china.huawei.com (7.182.85.172) by frapeml500008.china.huawei.com (7.182.85.71) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.1.2507.39; Tue, 5 Nov 2024 11:58:27 +0100 Received: from frapeml500007.china.huawei.com ([7.182.85.172]) by frapeml500007.china.huawei.com ([7.182.85.172]) with mapi id 15.01.2507.039; Tue, 5 Nov 2024 11:58:27 +0100 From: Konstantin Ananyev To: Tomasz Duszynski , Thomas Monjalon CC: "Ruifeng.Wang@arm.com" , "bruce.richardson@intel.com" , "david.marchand@redhat.com" , "dev@dpdk.org" , "jerinj@marvell.com" , "konstantin.v.ananyev@yandex.ru" , "mattias.ronnblom@ericsson.com" , "mb@smartsharesystems.com" , "roretzla@linux.microsoft.com" , "stephen@networkplumber.org" , "zhoumin@loongson.cn" Subject: RE: [PATCH v15 1/4] lib: add generic support for reading PMU events Thread-Topic: [PATCH v15 1/4] lib: add generic support for reading PMU events Thread-Index: AQHbJruhL5b3d1fVgkOdZsB7SUaDF7KokgIg Date: Tue, 5 Nov 2024 10:58:27 +0000 Message-ID: <5f678ae6bb1d4e25bf1f537415799fcc@huawei.com> References: <20241011094944.3586051-1-tduszynski@marvell.com> <20241025085414.3412068-1-tduszynski@marvell.com> <20241025085414.3412068-2-tduszynski@marvell.com> In-Reply-To: <20241025085414.3412068-2-tduszynski@marvell.com> Accept-Language: en-GB, en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.48.152.20] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org > Add support for programming PMU counters and reading their values > in runtime bypassing kernel completely. >=20 > This is especially useful in cases where CPU cores are isolated > i.e run dedicated tasks. In such cases one cannot use standard > perf utility without sacrificing latency and performance. LGTM in general, just few questions, nits - majority about docs/comments. > Signed-off-by: Tomasz Duszynski > --- > MAINTAINERS | 5 + > app/test/meson.build | 1 + > app/test/test_pmu.c | 49 +++ > doc/api/doxy-api-index.md | 3 +- > doc/api/doxy-api.conf.in | 1 + > doc/guides/prog_guide/profile_app.rst | 29 ++ > doc/guides/rel_notes/release_24_11.rst | 7 + > lib/eal/meson.build | 3 + > lib/meson.build | 1 + > lib/pmu/meson.build | 13 + > lib/pmu/pmu_private.h | 32 ++ > lib/pmu/rte_pmu.c | 473 +++++++++++++++++++++++++ > lib/pmu/rte_pmu.h | 205 +++++++++++ > lib/pmu/version.map | 13 + > 14 files changed, 834 insertions(+), 1 deletion(-) > create mode 100644 app/test/test_pmu.c > create mode 100644 lib/pmu/meson.build > create mode 100644 lib/pmu/pmu_private.h > create mode 100644 lib/pmu/rte_pmu.c > create mode 100644 lib/pmu/rte_pmu.h > create mode 100644 lib/pmu/version.map >=20 > diff --git a/MAINTAINERS b/MAINTAINERS > index cd78bc7db1..077efe41cf 100644 > --- a/MAINTAINERS > +++ b/MAINTAINERS > @@ -1816,6 +1816,11 @@ M: Nithin Dabilpuram > M: Pavan Nikhilesh > F: lib/node/ >=20 > +PMU - EXPERIMENTAL > +M: Tomasz Duszynski > +F: lib/pmu/ > +F: app/test/test_pmu* > + >=20 > Test Applications > ----------------- > diff --git a/app/test/meson.build b/app/test/meson.build > index 0f7e11969a..5f1622ecab 100644 > --- a/app/test/meson.build > +++ b/app/test/meson.build > @@ -141,6 +141,7 @@ source_file_deps =3D { > 'test_pmd_perf.c': ['ethdev', 'net'] + packet_burst_generator_deps, > 'test_pmd_ring.c': ['net_ring', 'ethdev', 'bus_vdev'], > 'test_pmd_ring_perf.c': ['ethdev', 'net_ring', 'bus_vdev'], > + 'test_pmu.c': ['pmu'], > 'test_power.c': ['power'], > 'test_power_cpufreq.c': ['power'], > 'test_power_intel_uncore.c': ['power'], > diff --git a/app/test/test_pmu.c b/app/test/test_pmu.c > new file mode 100644 > index 0000000000..464e5068ec > --- /dev/null > +++ b/app/test/test_pmu.c > @@ -0,0 +1,49 @@ > +/* SPDX-License-Identifier: BSD-3-Clause > + * Copyright(C) 2024 Marvell International Ltd. > + */ > + > +#include > + > +#include "test.h" > + > +static int > +test_pmu_read(void) > +{ > + const char *name =3D NULL; > + int tries =3D 10, event; > + uint64_t val =3D 0; > + > + if (name =3D=3D NULL) { > + printf("PMU not supported on this arch\n"); > + return TEST_SKIPPED; > + } > + > + if (rte_pmu_init() < 0) > + return TEST_FAILED; > + > + event =3D rte_pmu_add_event(name); > + while (tries--) > + val +=3D rte_pmu_read(event); > + > + rte_pmu_fini(); > + > + return val ? TEST_SUCCESS : TEST_FAILED; > +} > + > +static struct unit_test_suite pmu_tests =3D { > + .suite_name =3D "pmu autotest", > + .setup =3D NULL, > + .teardown =3D NULL, > + .unit_test_cases =3D { > + TEST_CASE(test_pmu_read), > + TEST_CASES_END() > + } > +}; > + > +static int > +test_pmu(void) > +{ > + return unit_test_suite_runner(&pmu_tests); > +} > + > +REGISTER_FAST_TEST(pmu_autotest, true, true, test_pmu); > diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md > index 266c8b90dc..3e6020f367 100644 > --- a/doc/api/doxy-api-index.md > +++ b/doc/api/doxy-api-index.md > @@ -240,7 +240,8 @@ The public API headers are grouped by topics: > [log](@ref rte_log.h), > [errno](@ref rte_errno.h), > [trace](@ref rte_trace.h), > - [trace_point](@ref rte_trace_point.h) > + [trace_point](@ref rte_trace_point.h), > + [pmu](@ref rte_pmu.h) >=20 > - **misc**: > [EAL config](@ref rte_eal.h), > diff --git a/doc/api/doxy-api.conf.in b/doc/api/doxy-api.conf.in > index c94f02d411..f4b382e073 100644 > --- a/doc/api/doxy-api.conf.in > +++ b/doc/api/doxy-api.conf.in > @@ -69,6 +69,7 @@ INPUT =3D @TOPDIR@/doc/api/doxy-api-i= ndex.md \ > @TOPDIR@/lib/pdcp \ > @TOPDIR@/lib/pdump \ > @TOPDIR@/lib/pipeline \ > + @TOPDIR@/lib/pmu \ > @TOPDIR@/lib/port \ > @TOPDIR@/lib/power \ > @TOPDIR@/lib/ptr_compress \ > diff --git a/doc/guides/prog_guide/profile_app.rst b/doc/guides/prog_guid= e/profile_app.rst > index a6b5fb4d5e..854c515a61 100644 > --- a/doc/guides/prog_guide/profile_app.rst > +++ b/doc/guides/prog_guide/profile_app.rst > @@ -7,6 +7,35 @@ Profile Your Application > The following sections describe methods of profiling DPDK applications o= n > different architectures. >=20 > +Performance counter based profiling > +----------------------------------- > + > +Majority of architectures support some performance monitoring unit (PMU)= . > +Such unit provides programmable counters that monitor specific events. > + > +Different tools gather that information, like for example perf. > +However, in some scenarios when CPU cores are isolated and run > +dedicated tasks interrupting those tasks with perf may be undesirable. > + > +In such cases, an application can use the PMU library to read such event= s via ``rte_pmu_read()``. > + > +By default, userspace applications are not allowed to access PMU interna= ls. That can be changed > +by setting ``/sys/kernel/perf_event_paranoid`` to 2 (that should be a de= fault value anyway) and > +adding ``CAP_PERFMON`` capability to DPDK application. Please refer to > +``Documentation/admin-guide/perf-security.rst`` under Linux sources for = more information. Fairly > +recent kernel, i.e >=3D 5.9, is advised too. > + > +As of now implementation imposes certain limitations: > + > +* Management APIs that normally return a non-negative value will return = error (``-ENOTSUP``) while > + ``rte_pmu_read()`` will return ``UINT64_MAX`` if running under unsuppo= rted operating system. > + > +* Only EAL lcores are supported > + > +* EAL lcores must not share a cpu > + > +* Each EAL lcore measures same group of events > + >=20 > Profiling on x86 > ---------------- > diff --git a/doc/guides/rel_notes/release_24_11.rst b/doc/guides/rel_note= s/release_24_11.rst > index fa4822d928..d34ecb55e0 100644 > --- a/doc/guides/rel_notes/release_24_11.rst > +++ b/doc/guides/rel_notes/release_24_11.rst > @@ -247,6 +247,13 @@ New Features > Added ability for node to advertise and update multiple xstat counters= , > that can be retrieved using ``rte_graph_cluster_stats_get``. >=20 > +* **Added PMU library.** > + > + Added a new performance monitoring unit (PMU) library which allows app= lications > + to perform self monitoring activities without depending on external ut= ilities like perf. > + After integration with :doc:`../prog_guide/trace_lib` data gathered fr= om hardware counters > + can be stored in CTF format for further analysis. > + >=20 > Removed Items > ------------- > diff --git a/lib/eal/meson.build b/lib/eal/meson.build > index e1d6c4cf17..1349624653 100644 > --- a/lib/eal/meson.build > +++ b/lib/eal/meson.build > @@ -18,6 +18,9 @@ deps +=3D ['log', 'kvargs'] > if not is_windows > deps +=3D ['telemetry'] > endif > +if dpdk_conf.has('RTE_LIB_PMU') > + deps +=3D ['pmu'] > +endif > if dpdk_conf.has('RTE_USE_LIBBSD') > ext_deps +=3D libbsd > endif > diff --git a/lib/meson.build b/lib/meson.build > index ce92cb5537..968ad29e8d 100644 > --- a/lib/meson.build > +++ b/lib/meson.build > @@ -13,6 +13,7 @@ libraries =3D [ > 'kvargs', # eal depends on kvargs > 'argparse', > 'telemetry', # basic info querying > + 'pmu', > 'eal', # everything depends on eal > 'ptr_compress', > 'ring', > diff --git a/lib/pmu/meson.build b/lib/pmu/meson.build > new file mode 100644 > index 0000000000..f173b6f55c > --- /dev/null > +++ b/lib/pmu/meson.build > @@ -0,0 +1,13 @@ > +# SPDX-License-Identifier: BSD-3-Clause > +# Copyright(C) 2024 Marvell International Ltd. > + > +if not is_linux > + build =3D false > + reason =3D 'only supported on Linux' > + subdir_done() > +endif > + > +headers =3D files('rte_pmu.h') > +sources =3D files('rte_pmu.c') > + > +deps +=3D ['log'] > diff --git a/lib/pmu/pmu_private.h b/lib/pmu/pmu_private.h > new file mode 100644 > index 0000000000..d2b15615bf > --- /dev/null > +++ b/lib/pmu/pmu_private.h > @@ -0,0 +1,32 @@ > +/* SPDX-License-Identifier: BSD-3-Clause > + * Copyright(c) 2024 Marvell > + */ > + > +#ifndef _PMU_PRIVATE_H_ > +#define _PMU_PRIVATE_H_ > + > +/** > + * Architecture specific PMU init callback. > + * > + * @return > + * 0 in case of success, negative value otherwise. > + */ > +int > +pmu_arch_init(void); > + > +/** > + * Architecture specific PMU cleanup callback. > + */ > +void > +pmu_arch_fini(void); > + > +/** > + * Apply architecture specific settings to config before passing it to s= yscall. > + * > + * @param config > + * Architecture specific event configuration. Consult kernel sources f= or available options. > + */ > +void > +pmu_arch_fixup_config(uint64_t config[3]); > + > +#endif /* _PMU_PRIVATE_H_ */ > diff --git a/lib/pmu/rte_pmu.c b/lib/pmu/rte_pmu.c > new file mode 100644 > index 0000000000..dd57961627 > --- /dev/null > +++ b/lib/pmu/rte_pmu.c > @@ -0,0 +1,473 @@ > +/* SPDX-License-Identifier: BSD-3-Clause > + * Copyright(C) 2024 Marvell International Ltd. > + */ > + > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +#include "pmu_private.h" > + > +#define EVENT_SOURCE_DEVICES_PATH "/sys/bus/event_source/devices" > + > +#define FIELD_PREP(m, v) (((uint64_t)(v) << (__builtin_ffsll(m) - 1)) & = (m)) > + > +RTE_LOG_REGISTER_DEFAULT(rte_pmu_logtype, INFO) > +#define RTE_LOGTYPE_PMU rte_pmu_logtype > + > +#define PMU_LOG(level, ...) \ > + RTE_LOG_LINE(level, PMU, ## __VA_ARGS__) > + > +/* A structure describing an event */ > +struct rte_pmu_event { > + char *name; > + unsigned int index; > + TAILQ_ENTRY(rte_pmu_event) next; > +}; > + > +struct rte_pmu rte_pmu; > + > +/* > + * Following __rte_weak functions provide default no-op. Architectures s= hould override them if > + * necessary. > + */ > + > +int > +__rte_weak pmu_arch_init(void) > +{ > + return 0; > +} > + > +void > +__rte_weak pmu_arch_fini(void) > +{ > +} > + > +void > +__rte_weak pmu_arch_fixup_config(uint64_t __rte_unused config[3]) > +{ > +} > + > +static int > +get_term_format(const char *name, int *num, uint64_t *mask) > +{ > + char path[PATH_MAX]; > + char *config =3D NULL; > + int high, low, ret; > + FILE *fp; > + > + *num =3D *mask =3D 0; > + snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/format/%s",= rte_pmu.name, name); > + fp =3D fopen(path, "r"); > + if (fp =3D=3D NULL) > + return -errno; > + > + errno =3D 0; > + ret =3D fscanf(fp, "%m[^:]:%d-%d", &config, &low, &high); > + if (ret < 2) { > + ret =3D -ENODATA; > + goto out; > + } > + if (errno) { > + ret =3D -errno; > + goto out; > + } > + > + if (ret =3D=3D 2) > + high =3D low; > + > + *mask =3D RTE_GENMASK64(high, low); > + /* Last digit should be [012]. If last digit is missing 0 is implied. *= / > + *num =3D config[strlen(config) - 1]; > + *num =3D isdigit(*num) ? *num - '0' : 0; > + > + ret =3D 0; > +out: > + free(config); > + fclose(fp); > + > + return ret; > +} > + > +static int > +parse_event(char *buf, uint64_t config[3]) > +{ > + char *token, *term; > + int num, ret, val; > + uint64_t mask; > + > + config[0] =3D config[1] =3D config[2] =3D 0; > + > + token =3D strtok(buf, ","); > + while (token) { > + errno =3D 0; > + /* =3D */ > + ret =3D sscanf(token, "%m[^=3D]=3D%i", &term, &val); > + if (ret < 1) > + return -ENODATA; > + if (errno) > + return -errno; > + if (ret =3D=3D 1) > + val =3D 1; > + > + ret =3D get_term_format(term, &num, &mask); > + free(term); > + if (ret) > + return ret; > + > + config[num] |=3D FIELD_PREP(mask, val); > + token =3D strtok(NULL, ","); > + } > + > + return 0; > +} > + > +static int > +get_event_config(const char *name, uint64_t config[3]) > +{ > + char path[PATH_MAX], buf[BUFSIZ]; > + FILE *fp; > + int ret; > + > + snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s",= rte_pmu.name, name); > + fp =3D fopen(path, "r"); > + if (fp =3D=3D NULL) > + return -errno; > + > + ret =3D fread(buf, 1, sizeof(buf), fp); > + if (ret =3D=3D 0) { > + fclose(fp); > + > + return -EINVAL; > + } > + fclose(fp); > + buf[ret] =3D '\0'; > + > + return parse_event(buf, config); > +} > + > +static int > +do_perf_event_open(uint64_t config[3], int group_fd) > +{ > + struct perf_event_attr attr =3D { > + .size =3D sizeof(struct perf_event_attr), > + .type =3D PERF_TYPE_RAW, > + .exclude_kernel =3D 1, > + .exclude_hv =3D 1, > + .disabled =3D 1, > + .pinned =3D group_fd =3D=3D -1, > + }; > + > + pmu_arch_fixup_config(config); > + > + attr.config =3D config[0]; > + attr.config1 =3D config[1]; > + attr.config2 =3D config[2]; > + > + return syscall(SYS_perf_event_open, &attr, 0, -1, group_fd, 0); > +} > + > +static int > +open_events(struct rte_pmu_event_group *group) > +{ > + struct rte_pmu_event *event; > + uint64_t config[3]; > + int num =3D 0, ret; > + > + /* group leader gets created first, with fd =3D -1 */ > + group->fds[0] =3D -1; > + > + TAILQ_FOREACH(event, &rte_pmu.event_list, next) { > + ret =3D get_event_config(event->name, config); > + if (ret) > + continue; > + > + ret =3D do_perf_event_open(config, group->fds[0]); > + if (ret =3D=3D -1) { > + ret =3D -errno; > + goto out; > + } > + > + group->fds[event->index] =3D ret; > + num++; > + } > + > + return 0; > +out: > + for (--num; num >=3D 0; num--) { > + close(group->fds[num]); > + group->fds[num] =3D -1; > + } > + > + return ret; > +} > + > +static int > +mmap_events(struct rte_pmu_event_group *group) > +{ > + long page_size =3D sysconf(_SC_PAGE_SIZE); > + unsigned int i; > + void *addr; > + int ret; > + > + for (i =3D 0; i < rte_pmu.num_group_events; i++) { > + addr =3D mmap(0, page_size, PROT_READ, MAP_SHARED, group->fds[i], 0); > + if (addr =3D=3D MAP_FAILED) { > + ret =3D -errno; > + goto out; > + } > + > + group->mmap_pages[i] =3D addr; > + } > + > + return 0; > +out: > + for (; i; i--) { > + munmap(group->mmap_pages[i - 1], page_size); > + group->mmap_pages[i - 1] =3D NULL; > + } > + > + return ret; > +} > + > +static void > +cleanup_events(struct rte_pmu_event_group *group) > +{ > + unsigned int i; > + > + if (group->fds[0] !=3D -1) > + ioctl(group->fds[0], PERF_EVENT_IOC_DISABLE, PERF_IOC_FLAG_GROUP); > + > + for (i =3D 0; i < rte_pmu.num_group_events; i++) { > + if (group->mmap_pages[i]) { > + munmap(group->mmap_pages[i], sysconf(_SC_PAGE_SIZE)); > + group->mmap_pages[i] =3D NULL; > + } > + > + if (group->fds[i] !=3D -1) { > + close(group->fds[i]); > + group->fds[i] =3D -1; > + } > + } > + > + group->enabled =3D false; > +} > + > +int > +__rte_pmu_enable_group(struct rte_pmu_event_group *group) > +{ > + int ret; > + > + if (rte_pmu.num_group_events =3D=3D 0) > + return -ENODEV; > + > + ret =3D open_events(group); > + if (ret) > + goto out; > + > + ret =3D mmap_events(group); > + if (ret) > + goto out; > + > + if (ioctl(group->fds[0], PERF_EVENT_IOC_RESET, PERF_IOC_FLAG_GROUP) =3D= =3D -1) { > + ret =3D -errno; > + goto out; > + } > + > + if (ioctl(group->fds[0], PERF_EVENT_IOC_ENABLE, PERF_IOC_FLAG_GROUP) = =3D=3D -1) { > + ret =3D -errno; > + goto out; > + } > + > + group->enabled =3D true; > + > + return 0; > +out: > + cleanup_events(group); > + > + return ret; > +} > + > +static int > +scan_pmus(void) > +{ > + char path[PATH_MAX]; > + struct dirent *dent; > + const char *name; > + DIR *dirp; > + > + dirp =3D opendir(EVENT_SOURCE_DEVICES_PATH); > + if (dirp =3D=3D NULL) > + return -errno; > + > + while ((dent =3D readdir(dirp))) { > + name =3D dent->d_name; > + if (name[0] =3D=3D '.') > + continue; > + > + /* sysfs entry should either contain cpus or be a cpu */ > + if (!strcmp(name, "cpu")) > + break; > + > + snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/cpus", nam= e); > + if (access(path, F_OK) =3D=3D 0) > + break; > + } > + > + if (dent) { > + rte_pmu.name =3D strdup(name); > + if (rte_pmu.name =3D=3D NULL) { > + closedir(dirp); > + > + return -ENOMEM; > + } > + } > + > + closedir(dirp); > + > + return rte_pmu.name ? 0 : -ENODEV; > +} > + > +static struct rte_pmu_event * > +new_event(const char *name) > +{ > + struct rte_pmu_event *event; > + > + event =3D calloc(1, sizeof(*event)); > + if (event =3D=3D NULL) > + goto out; > + > + event->name =3D strdup(name); > + if (event->name =3D=3D NULL) { > + free(event); > + event =3D NULL; > + } > + > +out: > + return event; > +} > + > +static void > +free_event(struct rte_pmu_event *event) > +{ > + free(event->name); > + free(event); > +} > + > +int > +rte_pmu_add_event(const char *name) > +{ > + struct rte_pmu_event *event; > + char path[PATH_MAX]; > + > + if (!rte_pmu.initialized) { > + PMU_LOG(ERR, "PMU is not initialized"); > + return -ENODEV; > + } > + > + if (rte_pmu.num_group_events + 1 >=3D RTE_MAX_NUM_GROUP_EVENTS) { > + PMU_LOG(ERR, "Excessive number of events in a group (%d > %d)", > + rte_pmu.num_group_events, RTE_MAX_NUM_GROUP_EVENTS); > + return -ENOSPC; > + } > + > + snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s",= rte_pmu.name, name); > + if (access(path, R_OK)) { > + PMU_LOG(ERR, "Cannot access %s", path); > + return -ENODEV; > + } > + > + TAILQ_FOREACH(event, &rte_pmu.event_list, next) { > + if (strcmp(event->name, name)) > + continue; > + > + return event->index; > + } > + > + event =3D new_event(name); > + if (event =3D=3D NULL) { > + PMU_LOG(ERR, "Failed to create event %s", name); > + return -ENOMEM; > + } > + > + event->index =3D rte_pmu.num_group_events++; > + TAILQ_INSERT_TAIL(&rte_pmu.event_list, event, next); > + > + return event->index; > +} > + > +int > +rte_pmu_init(void) > +{ > + int ret; > + > + if (rte_pmu.initialized) > + return 0; > + > + ret =3D scan_pmus(); > + if (ret) { > + PMU_LOG(ERR, "Failed to scan for event sources"); > + goto out; > + } > + > + ret =3D pmu_arch_init(); > + if (ret) { > + PMU_LOG(ERR, "Failed to setup arch internals"); > + goto out; > + } > + > + TAILQ_INIT(&rte_pmu.event_list); > + rte_pmu.initialized =3D 1; > +out: > + if (ret) { > + free(rte_pmu.name); > + rte_pmu.name =3D NULL; > + } > + > + return ret; > +} > + > +void > +rte_pmu_fini(void) > +{ > + struct rte_pmu_event *event, *tmp_event; > + struct rte_pmu_event_group *group; > + unsigned int i; > + > + if (!rte_pmu.initialized) > + return; > + > + RTE_TAILQ_FOREACH_SAFE(event, &rte_pmu.event_list, next, tmp_event) { > + TAILQ_REMOVE(&rte_pmu.event_list, event, next); > + free_event(event); > + } > + > + for (i =3D 0; i < RTE_DIM(rte_pmu.event_groups); i++) { > + group =3D &rte_pmu.event_groups[i]; > + if (!group->enabled) > + continue; > + > + cleanup_events(group); > + } > + > + pmu_arch_fini(); > + free(rte_pmu.name); > + rte_pmu.name =3D NULL; > + rte_pmu.num_group_events =3D 0; > +} > diff --git a/lib/pmu/rte_pmu.h b/lib/pmu/rte_pmu.h > new file mode 100644 > index 0000000000..7d10a0dc56 > --- /dev/null > +++ b/lib/pmu/rte_pmu.h > @@ -0,0 +1,205 @@ > +/* SPDX-License-Identifier: BSD-3-Clause > + * Copyright(c) 2024 Marvell > + */ > + > +#ifndef _RTE_PMU_H_ > +#define _RTE_PMU_H_ > + > +/** > + * @file > + * > + * PMU event tracing operations > + * > + * This file defines generic API and types necessary to setup PMU and > + * read selected counters in runtime. Exported APIs are generally not MT= -safe. > + * One exception is rte_pmu_read() which can be called concurrently once > + * everything has been setup. > + */ > + >From reading the code - I can see ithat PMU API is not MT safe. The only function that can run in parallel is rte_pmu_read(), correct? All other combinations: let say pmu_read() and add_event() are not possible= , right? =20 If so, then it is probably worth to articulate that explicitly in the publi= c API comments, after all extra comment doesn't hurt. > +#ifdef __cplusplus > +extern "C" { > +#endif > + > +#include > +#include > + > +#include > +#include > +#include > +#include > +#include > + > +/** Maximum number of events in a group */ > +#define RTE_MAX_NUM_GROUP_EVENTS 8 > + > +/** > + * A structure describing a group of events. > + */ > +struct __rte_cache_aligned rte_pmu_event_group { > + /** array of user pages */ > + struct perf_event_mmap_page *mmap_pages[RTE_MAX_NUM_GROUP_EVENTS]; > + int fds[RTE_MAX_NUM_GROUP_EVENTS]; /**< array of event descriptors */ > + TAILQ_ENTRY(rte_pmu_event_group) next; /**< list entry */ > + bool enabled; /**< true if group was enabled on particular lcore */ > +}; > + > +/** > + * A PMU state container. > + */ > +struct rte_pmu { > + struct rte_pmu_event_group event_groups[RTE_MAX_LCORE]; /**< event grou= ps */ > + unsigned int num_group_events; /**< number of events in a group */ > + unsigned int initialized; /**< initialization counter */ > + char *name; /**< name of core PMU listed under /sys/bus/event_source/de= vices */ > + TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of matching events */ > +}; > + > +/** PMU state container */ > +extern struct rte_pmu rte_pmu; > + > +/** Each architecture supporting PMU needs to provide its own version */ > +#ifndef rte_pmu_pmc_read > +#define rte_pmu_pmc_read(index) ({ (void)(index); 0; }) > +#endif > + > +/** > + * @warning > + * @b EXPERIMENTAL: this API may change without prior notice > + * > + * Read PMU counter. > + * > + * @warning This should not be called directly. > + * > + * @param pc > + * Pointer to the mmapped user page. > + * @return > + * Counter value read from hardware. > + */ > +__rte_experimental > +static __rte_always_inline uint64_t > +__rte_pmu_read_userpage(struct perf_event_mmap_page *pc) > +{ > +#define __RTE_PMU_READ_ONCE(x) (*(const volatile typeof(x) *)&(x)) > + uint64_t width, offset; > + uint32_t seq, index; > + int64_t pmc; > + > + for (;;) { > + seq =3D __RTE_PMU_READ_ONCE(pc->lock); > + rte_compiler_barrier(); > + index =3D __RTE_PMU_READ_ONCE(pc->index); > + offset =3D __RTE_PMU_READ_ONCE(pc->offset); > + width =3D __RTE_PMU_READ_ONCE(pc->pmc_width); > + > + /* index set to 0 means that particular counter cannot be used */ > + if (likely(pc->cap_user_rdpmc && index)) { > + pmc =3D rte_pmu_pmc_read(index - 1); > + pmc <<=3D 64 - width; > + pmc >>=3D 64 - width; > + offset +=3D pmc; > + } > + > + rte_compiler_barrier(); > + > + if (likely(__RTE_PMU_READ_ONCE(pc->lock) =3D=3D seq)) > + return offset; > + } > + > + return 0; > +} > + > +/** > + * @warning > + * @b EXPERIMENTAL: this API may change without prior notice > + * > + * Enable group of events on the calling lcore. > + * > + * @warning This should not be called directly. > + * > + * @param group > + * Pointer to the group which will be enabled. > + * @return > + * 0 in case of success, negative value otherwise. > + */ > +__rte_experimental > +int > +__rte_pmu_enable_group(struct rte_pmu_event_group *group); > + > +/** > + * @warning > + * @b EXPERIMENTAL: this API may change without prior notice > + * > + * Initialize PMU library. > + * > + * @return > + * 0 in case of success, negative value otherwise. > + */ > +__rte_experimental > +int > +rte_pmu_init(void); > + > +/** > + * @warning > + * @b EXPERIMENTAL: this API may change without prior notice > + * > + * Finalize PMU library. > + */ > +__rte_experimental > +void > +rte_pmu_fini(void); > + > +/** > + * @warning > + * @b EXPERIMENTAL: this API may change without prior notice > + * > + * Add event to the group of enabled events. > + * > + * @param name > + * Name of an event listed under /sys/bus/event_source/devices//e= vents. > + * @return > + * Event index in case of success, negative value otherwise. > + */ > +__rte_experimental > +int > +rte_pmu_add_event(const char *name); > + > +/** > + * @warning > + * @b EXPERIMENTAL: this API may change without prior notice > + * > + * Read hardware counter configured to count occurrences of an event. > + * > + * @param index > + * Index of an event to be read. > + * @return > + * Event value read from register. In case of errors or lack of suppor= t > + * 0 is returned. In other words, stream of zeros in a trace file > + * indicates problem with reading particular PMU event register. > + */ As I remember that function implies to be called from the thread bind to one particular cpu. If that is still the case - please state it explicitly in the comments abov= e.=20 > +__rte_experimental > +static __rte_always_inline uint64_t > +rte_pmu_read(unsigned int index) > +{ > + struct rte_pmu_event_group *group =3D &rte_pmu.event_groups[rte_lcore_i= d()]; I think we better check here value returned by rte_lcore_id() before using = it as an index. =20 > + int ret; > + > + if (unlikely(!rte_pmu.initialized)) > + return 0; > + > + if (unlikely(!group->enabled)) { > + ret =3D __rte_pmu_enable_group(group); > + if (ret) > + return 0; > + } > + > + if (unlikely(index >=3D rte_pmu.num_group_events)) > + return 0; > + > + return __rte_pmu_read_userpage(group->mmap_pages[index]); > +} > + > +#ifdef __cplusplus > +} > +#endif > + > +#endif /* _RTE_PMU_H_ */ > diff --git a/lib/pmu/version.map b/lib/pmu/version.map > new file mode 100644 > index 0000000000..d0f907d13d > --- /dev/null > +++ b/lib/pmu/version.map > @@ -0,0 +1,13 @@ > +EXPERIMENTAL { > + global: > + > + # added in 24.11 > + __rte_pmu_enable_group; > + rte_pmu; > + rte_pmu_add_event; > + rte_pmu_fini; > + rte_pmu_init; > + rte_pmu_read; > + > + local: *; > +}; > -- > 2.34.1