From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id BA9D245B07; Fri, 11 Oct 2024 13:56:09 +0200 (CEST) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 7FF484028B; Fri, 11 Oct 2024 13:56:09 +0200 (CEST) Received: from frasgout.his.huawei.com (frasgout.his.huawei.com [185.176.79.56]) by mails.dpdk.org (Postfix) with ESMTP id A53944025F for ; Fri, 11 Oct 2024 13:56:06 +0200 (CEST) Received: from mail.maildlp.com (unknown [172.18.186.216]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4XQ4nn243wz6K9Kk; Fri, 11 Oct 2024 19:55:41 +0800 (CST) Received: from frapeml100007.china.huawei.com (unknown [7.182.85.133]) by mail.maildlp.com (Postfix) with ESMTPS id 9D4AE140A78; Fri, 11 Oct 2024 19:56:04 +0800 (CST) Received: from frapeml500007.china.huawei.com (7.182.85.172) by frapeml100007.china.huawei.com (7.182.85.133) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.1.2507.39; Fri, 11 Oct 2024 13:56:04 +0200 Received: from frapeml500007.china.huawei.com ([7.182.85.172]) by frapeml500007.china.huawei.com ([7.182.85.172]) with mapi id 15.01.2507.039; Fri, 11 Oct 2024 13:56:04 +0200 From: Konstantin Ananyev To: Tomasz Duszynski , Thomas Monjalon CC: "Ruifeng.Wang@arm.com" , "bruce.richardson@intel.com" , "david.marchand@redhat.com" , "dev@dpdk.org" , "jerinj@marvell.com" , "konstantin.v.ananyev@yandex.ru" , "mattias.ronnblom@ericsson.com" , "mb@smartsharesystems.com" , "roretzla@linux.microsoft.com" , "zhoumin@loongson.cn" , "stephen@networkplumber.org" Subject: RE: [PATCH v14 1/4] lib: add generic support for reading PMU events Thread-Topic: [PATCH v14 1/4] lib: add generic support for reading PMU events Thread-Index: AQHbG8L+LupSgwrGO0yTBV3TByrxGbKBUHkg Date: Fri, 11 Oct 2024 11:56:04 +0000 Message-ID: <660c6a2a24fd4cd785333881208fb659@huawei.com> References: <20241009112308.2973903-1-tduszynski@marvell.com> <20241011094944.3586051-1-tduszynski@marvell.com> <20241011094944.3586051-2-tduszynski@marvell.com> In-Reply-To: <20241011094944.3586051-2-tduszynski@marvell.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.206.138.42] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org > Add support for programming PMU counters and reading their values > in runtime bypassing kernel completely. >=20 > This is especially useful in cases where CPU cores are isolated > i.e run dedicated tasks. In such cases one cannot use standard > perf utility without sacrificing latency and performance. >=20 > Signed-off-by: Tomasz Duszynski > --- > MAINTAINERS | 5 + > app/test/meson.build | 1 + > app/test/test_pmu.c | 53 +++ > doc/api/doxy-api-index.md | 3 +- > doc/api/doxy-api.conf.in | 1 + > doc/guides/prog_guide/profile_app.rst | 29 ++ > doc/guides/rel_notes/release_24_11.rst | 7 + > lib/meson.build | 1 + > lib/pmu/meson.build | 12 + > lib/pmu/pmu_private.h | 32 ++ > lib/pmu/rte_pmu.c | 463 +++++++++++++++++++++++++ > lib/pmu/rte_pmu.h | 227 ++++++++++++ > lib/pmu/version.map | 14 + > 13 files changed, 847 insertions(+), 1 deletion(-) > create mode 100644 app/test/test_pmu.c > create mode 100644 lib/pmu/meson.build > create mode 100644 lib/pmu/pmu_private.h > create mode 100644 lib/pmu/rte_pmu.c > create mode 100644 lib/pmu/rte_pmu.h > create mode 100644 lib/pmu/version.map >=20 > diff --git a/MAINTAINERS b/MAINTAINERS > index 7d171e3d45..fbd46905b6 100644 > --- a/MAINTAINERS > +++ b/MAINTAINERS > @@ -1811,6 +1811,11 @@ M: Nithin Dabilpuram > M: Pavan Nikhilesh > F: lib/node/ >=20 > +PMU - EXPERIMENTAL > +M: Tomasz Duszynski > +F: lib/pmu/ > +F: app/test/test_pmu* > + >=20 > Test Applications > ----------------- > diff --git a/app/test/meson.build b/app/test/meson.build > index e29258e6ec..45f56d8aae 100644 > --- a/app/test/meson.build > +++ b/app/test/meson.build > @@ -139,6 +139,7 @@ source_file_deps =3D { > 'test_pmd_perf.c': ['ethdev', 'net'] + packet_burst_generator_deps, > 'test_pmd_ring.c': ['net_ring', 'ethdev', 'bus_vdev'], > 'test_pmd_ring_perf.c': ['ethdev', 'net_ring', 'bus_vdev'], > + 'test_pmu.c': ['pmu'], > 'test_power.c': ['power'], > 'test_power_cpufreq.c': ['power'], > 'test_power_intel_uncore.c': ['power'], > diff --git a/app/test/test_pmu.c b/app/test/test_pmu.c > new file mode 100644 > index 0000000000..79376ea2e8 > --- /dev/null > +++ b/app/test/test_pmu.c > @@ -0,0 +1,53 @@ > +/* SPDX-License-Identifier: BSD-3-Clause > + * Copyright(C) 2024 Marvell International Ltd. > + */ > + > +#include > + > +#include "test.h" > + > +static int > +test_pmu_read(void) > +{ > + const char *name =3D NULL; > + int tries =3D 10, event; > + uint64_t val =3D 0; > + > + if (name =3D=3D NULL) { > + printf("PMU not supported on this arch\n"); > + return TEST_SKIPPED; > + } > + > + if (rte_pmu_init() =3D=3D -ENOTSUP) { > + printf("pmu_autotest only supported on Linux, skipping test\n"); > + return TEST_SKIPPED; > + } > + if (rte_pmu_init() < 0) > + return TEST_SKIPPED; > + > + event =3D rte_pmu_add_event(name); > + while (tries--) > + val +=3D rte_pmu_read(event); > + > + rte_pmu_fini(); > + > + return val ? TEST_SUCCESS : TEST_FAILED; > +} > + > +static struct unit_test_suite pmu_tests =3D { > + .suite_name =3D "pmu autotest", > + .setup =3D NULL, > + .teardown =3D NULL, > + .unit_test_cases =3D { > + TEST_CASE(test_pmu_read), > + TEST_CASES_END() > + } > +}; > + > +static int > +test_pmu(void) > +{ > + return unit_test_suite_runner(&pmu_tests); > +} > + > +REGISTER_FAST_TEST(pmu_autotest, true, true, test_pmu); > diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md > index f9f0300126..805efc6520 100644 > --- a/doc/api/doxy-api-index.md > +++ b/doc/api/doxy-api-index.md > @@ -237,7 +237,8 @@ The public API headers are grouped by topics: > [log](@ref rte_log.h), > [errno](@ref rte_errno.h), > [trace](@ref rte_trace.h), > - [trace_point](@ref rte_trace_point.h) > + [trace_point](@ref rte_trace_point.h), > + [pmu](@ref rte_pmu.h) >=20 > - **misc**: > [EAL config](@ref rte_eal.h), > diff --git a/doc/api/doxy-api.conf.in b/doc/api/doxy-api.conf.in > index a8823c046f..658490b6a2 100644 > --- a/doc/api/doxy-api.conf.in > +++ b/doc/api/doxy-api.conf.in > @@ -69,6 +69,7 @@ INPUT =3D @TOPDIR@/doc/api/doxy-api-i= ndex.md \ > @TOPDIR@/lib/pdcp \ > @TOPDIR@/lib/pdump \ > @TOPDIR@/lib/pipeline \ > + @TOPDIR@/lib/pmu \ > @TOPDIR@/lib/port \ > @TOPDIR@/lib/power \ > @TOPDIR@/lib/ptr_compress \ > diff --git a/doc/guides/prog_guide/profile_app.rst b/doc/guides/prog_guid= e/profile_app.rst > index a6b5fb4d5e..854c515a61 100644 > --- a/doc/guides/prog_guide/profile_app.rst > +++ b/doc/guides/prog_guide/profile_app.rst > @@ -7,6 +7,35 @@ Profile Your Application > The following sections describe methods of profiling DPDK applications o= n > different architectures. >=20 > +Performance counter based profiling > +----------------------------------- > + > +Majority of architectures support some performance monitoring unit (PMU)= . > +Such unit provides programmable counters that monitor specific events. > + > +Different tools gather that information, like for example perf. > +However, in some scenarios when CPU cores are isolated and run > +dedicated tasks interrupting those tasks with perf may be undesirable. > + > +In such cases, an application can use the PMU library to read such event= s via ``rte_pmu_read()``. > + > +By default, userspace applications are not allowed to access PMU interna= ls. That can be changed > +by setting ``/sys/kernel/perf_event_paranoid`` to 2 (that should be a de= fault value anyway) and > +adding ``CAP_PERFMON`` capability to DPDK application. Please refer to > +``Documentation/admin-guide/perf-security.rst`` under Linux sources for = more information. Fairly > +recent kernel, i.e >=3D 5.9, is advised too. > + > +As of now implementation imposes certain limitations: > + > +* Management APIs that normally return a non-negative value will return = error (``-ENOTSUP``) while > + ``rte_pmu_read()`` will return ``UINT64_MAX`` if running under unsuppo= rted operating system. > + > +* Only EAL lcores are supported Not that I am against that limitation - it is quite common within DPDK... but reading through the code, not sure where it comes from?=20 > + > +* EAL lcores must not share a cpu > + > +* Each EAL lcore measures same group of events > + >=20 > Profiling on x86 > ---------------- > diff --git a/doc/guides/rel_notes/release_24_11.rst b/doc/guides/rel_note= s/release_24_11.rst > index 53d8661365..06fa294787 100644 > --- a/doc/guides/rel_notes/release_24_11.rst > +++ b/doc/guides/rel_notes/release_24_11.rst > @@ -150,6 +150,13 @@ New Features >=20 > * Added independent enqueue feature. >=20 > +* **Added PMU library.** > + > + Added a new performance monitoring unit (PMU) library which allows app= lications > + to perform self monitoring activities without depending on external ut= ilities like perf. > + After integration with :doc:`../prog_guide/trace_lib` data gathered fr= om hardware counters > + can be stored in CTF format for further analysis. > + >=20 > Removed Items > ------------- > diff --git a/lib/meson.build b/lib/meson.build > index 162287753f..cc7a4bd535 100644 > --- a/lib/meson.build > +++ b/lib/meson.build > @@ -13,6 +13,7 @@ libraries =3D [ > 'kvargs', # eal depends on kvargs > 'argparse', > 'telemetry', # basic info querying > + 'pmu', > 'eal', # everything depends on eal > 'ptr_compress', > 'ring', > diff --git a/lib/pmu/meson.build b/lib/pmu/meson.build > new file mode 100644 > index 0000000000..386232e5c7 > --- /dev/null > +++ b/lib/pmu/meson.build > @@ -0,0 +1,12 @@ > +# SPDX-License-Identifier: BSD-3-Clause > +# Copyright(C) 2024 Marvell International Ltd. > + > +headers =3D files('rte_pmu.h') > + > +if not is_linux > + subdir_done() > +endif > + > +sources =3D files('rte_pmu.c') > + > +deps +=3D ['log'] > diff --git a/lib/pmu/pmu_private.h b/lib/pmu/pmu_private.h > new file mode 100644 > index 0000000000..d2b15615bf > --- /dev/null > +++ b/lib/pmu/pmu_private.h > @@ -0,0 +1,32 @@ > +/* SPDX-License-Identifier: BSD-3-Clause > + * Copyright(c) 2024 Marvell > + */ > + > +#ifndef _PMU_PRIVATE_H_ > +#define _PMU_PRIVATE_H_ > + > +/** > + * Architecture specific PMU init callback. > + * > + * @return > + * 0 in case of success, negative value otherwise. > + */ > +int > +pmu_arch_init(void); > + > +/** > + * Architecture specific PMU cleanup callback. > + */ > +void > +pmu_arch_fini(void); > + > +/** > + * Apply architecture specific settings to config before passing it to s= yscall. > + * > + * @param config > + * Architecture specific event configuration. Consult kernel sources f= or available options. > + */ > +void > +pmu_arch_fixup_config(uint64_t config[3]); > + > +#endif /* _PMU_PRIVATE_H_ */ > diff --git a/lib/pmu/rte_pmu.c b/lib/pmu/rte_pmu.c > new file mode 100644 > index 0000000000..3a861bca71 > --- /dev/null > +++ b/lib/pmu/rte_pmu.c > @@ -0,0 +1,463 @@ > +/* SPDX-License-Identifier: BSD-3-Clause > + * Copyright(C) 2024 Marvell International Ltd. > + */ > + > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +#include > +#include > +#include > +#include > +#include > +#include > + > +#include "pmu_private.h" > + > +#define EVENT_SOURCE_DEVICES_PATH "/sys/bus/event_source/devices" > + > +#define GENMASK_ULL(h, l) ((~0ULL - (1ULL << (l)) + 1) & (~0ULL >> ((64 = - 1 - (h))))) Would RTE_GENMASK64(high, low) work for you here? =20 > +#define FIELD_PREP(m, v) (((uint64_t)(v) << (__builtin_ffsll(m) - 1)) & = (m)) > + > +/* A structure describing an event */ > +struct rte_pmu_event { > + char *name; > + unsigned int index; > + TAILQ_ENTRY(rte_pmu_event) next; > +}; > + > +RTE_DEFINE_PER_LCORE(struct rte_pmu_event_group, _event_group); > +struct rte_pmu rte_pmu; > + > +/* > + * Following __rte_weak functions provide default no-op. Architectures s= hould override them if > + * necessary. > + */ > + > +int > +__rte_weak pmu_arch_init(void) > +{ > + return 0; > +} > + > +void > +__rte_weak pmu_arch_fini(void) > +{ > +} > + > +void > +__rte_weak pmu_arch_fixup_config(uint64_t __rte_unused config[3]) > +{ > +} > + > +static int > +get_term_format(const char *name, int *num, uint64_t *mask) > +{ > + char path[PATH_MAX]; > + char *config =3D NULL; > + int high, low, ret; > + FILE *fp; > + > + *num =3D *mask =3D 0; > + snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/format/%s",= rte_pmu.name, name); > + fp =3D fopen(path, "r"); > + if (fp =3D=3D NULL) > + return -errno; > + > + errno =3D 0; > + ret =3D fscanf(fp, "%m[^:]:%d-%d", &config, &low, &high); > + if (ret < 2) { > + ret =3D -ENODATA; > + goto out; > + } > + if (errno) { > + ret =3D -errno; > + goto out; > + } > + > + if (ret =3D=3D 2) > + high =3D low; > + > + *mask =3D GENMASK_ULL(high, low); > + /* Last digit should be [012]. If last digit is missing 0 is implied. *= / > + *num =3D config[strlen(config) - 1]; > + *num =3D isdigit(*num) ? *num - '0' : 0; > + > + ret =3D 0; > +out: > + free(config); > + fclose(fp); > + > + return ret; > +} > + > +static int > +parse_event(char *buf, uint64_t config[3]) > +{ > + char *token, *term; > + int num, ret, val; > + uint64_t mask; > + > + config[0] =3D config[1] =3D config[2] =3D 0; > + > + token =3D strtok(buf, ","); > + while (token) { > + errno =3D 0; > + /* =3D */ > + ret =3D sscanf(token, "%m[^=3D]=3D%i", &term, &val); > + if (ret < 1) > + return -ENODATA; > + if (errno) > + return -errno; > + if (ret =3D=3D 1) > + val =3D 1; > + > + ret =3D get_term_format(term, &num, &mask); > + free(term); > + if (ret) > + return ret; > + > + config[num] |=3D FIELD_PREP(mask, val); > + token =3D strtok(NULL, ","); > + } > + > + return 0; > +} > + > +static int > +get_event_config(const char *name, uint64_t config[3]) > +{ > + char path[PATH_MAX], buf[BUFSIZ]; > + FILE *fp; > + int ret; > + > + snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s",= rte_pmu.name, name); > + fp =3D fopen(path, "r"); > + if (fp =3D=3D NULL) > + return -errno; > + > + ret =3D fread(buf, 1, sizeof(buf), fp); > + if (ret =3D=3D 0) { > + fclose(fp); > + > + return -EINVAL; > + } > + fclose(fp); > + buf[ret] =3D '\0'; > + > + return parse_event(buf, config); > +} > + > +static int > +do_perf_event_open(uint64_t config[3], int group_fd) > +{ > + struct perf_event_attr attr =3D { > + .size =3D sizeof(struct perf_event_attr), > + .type =3D PERF_TYPE_RAW, > + .exclude_kernel =3D 1, > + .exclude_hv =3D 1, > + .disabled =3D 1, > + .pinned =3D group_fd =3D=3D -1, > + }; > + > + pmu_arch_fixup_config(config); > + > + attr.config =3D config[0]; > + attr.config1 =3D config[1]; > + attr.config2 =3D config[2]; > + > + return syscall(SYS_perf_event_open, &attr, 0, -1, group_fd, 0); > +} > + > +static int > +open_events(struct rte_pmu_event_group *group) > +{ > + struct rte_pmu_event *event; > + uint64_t config[3]; > + int num =3D 0, ret; > + > + /* group leader gets created first, with fd =3D -1 */ > + group->fds[0] =3D -1; > + > + TAILQ_FOREACH(event, &rte_pmu.event_list, next) { > + ret =3D get_event_config(event->name, config); > + if (ret) > + continue; > + > + ret =3D do_perf_event_open(config, group->fds[0]); > + if (ret =3D=3D -1) { > + ret =3D -errno; > + goto out; > + } > + > + group->fds[event->index] =3D ret; > + num++; > + } > + > + return 0; > +out: > + for (--num; num >=3D 0; num--) { > + close(group->fds[num]); > + group->fds[num] =3D -1; > + } > + > + > + return ret; > +} > + > +static int > +mmap_events(struct rte_pmu_event_group *group) > +{ > + long page_size =3D sysconf(_SC_PAGE_SIZE); > + unsigned int i; > + void *addr; > + int ret; > + > + for (i =3D 0; i < rte_pmu.num_group_events; i++) { > + addr =3D mmap(0, page_size, PROT_READ, MAP_SHARED, group->fds[i], 0); > + if (addr =3D=3D MAP_FAILED) { > + ret =3D -errno; > + goto out; > + } > + > + group->mmap_pages[i] =3D addr; > + } > + > + return 0; > +out: > + for (; i; i--) { > + munmap(group->mmap_pages[i - 1], page_size); > + group->mmap_pages[i - 1] =3D NULL; > + } > + > + return ret; > +} > + > +static void > +cleanup_events(struct rte_pmu_event_group *group) > +{ > + unsigned int i; > + > + if (group->fds[0] !=3D -1) > + ioctl(group->fds[0], PERF_EVENT_IOC_DISABLE, PERF_IOC_FLAG_GROUP); > + > + for (i =3D 0; i < rte_pmu.num_group_events; i++) { > + if (group->mmap_pages[i]) { > + munmap(group->mmap_pages[i], sysconf(_SC_PAGE_SIZE)); > + group->mmap_pages[i] =3D NULL; > + } > + > + if (group->fds[i] !=3D -1) { > + close(group->fds[i]); > + group->fds[i] =3D -1; > + } > + } > + > + group->enabled =3D false; > +} > + > +int > +__rte_pmu_enable_group(void) > +{ > + struct rte_pmu_event_group *group =3D &RTE_PER_LCORE(_event_group); > + int ret; > + > + if (rte_pmu.num_group_events =3D=3D 0) > + return -ENODEV; > + > + ret =3D open_events(group); > + if (ret) > + goto out; > + > + ret =3D mmap_events(group); > + if (ret) > + goto out; > + > + if (ioctl(group->fds[0], PERF_EVENT_IOC_RESET, PERF_IOC_FLAG_GROUP) =3D= =3D -1) { > + ret =3D -errno; > + goto out; > + } > + > + if (ioctl(group->fds[0], PERF_EVENT_IOC_ENABLE, PERF_IOC_FLAG_GROUP) = =3D=3D -1) { > + ret =3D -errno; > + goto out; > + } > + > + rte_spinlock_lock(&rte_pmu.lock); > + TAILQ_INSERT_TAIL(&rte_pmu.event_group_list, group, next); > + rte_spinlock_unlock(&rte_pmu.lock); I thought that after previous round of reviews, we got a consensus that it is a bad idea to insert pointer of TLS variable into the global lis= t: https://patchwork.dpdk.org/project/dpdk/patch/20230216175502.3164820-2-tdus= zynski@marvell.com/ As I said before - I believe it is error prone, and needs to be addressed. There are plenty of other ways to have one object per lcore (if that's real= ly necessary): array of RTE_MAX_LCORE indexed by lcore_id, or new rte_lcore_var.h API intr= oduced by Mattias recently, or even simple TLS pointer with malloc() behind would = be better. > + group->enabled =3D true; > + > + return 0; > + > +out: > + cleanup_events(group); > + > + return ret; > +} > + > +static int > +scan_pmus(void) > +{ > + char path[PATH_MAX]; > + struct dirent *dent; > + const char *name; > + DIR *dirp; > + > + dirp =3D opendir(EVENT_SOURCE_DEVICES_PATH); > + if (dirp =3D=3D NULL) > + return -errno; > + > + while ((dent =3D readdir(dirp))) { > + name =3D dent->d_name; > + if (name[0] =3D=3D '.') > + continue; > + > + /* sysfs entry should either contain cpus or be a cpu */ > + if (!strcmp(name, "cpu")) > + break; > + > + snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/cpus", nam= e); > + if (access(path, F_OK) =3D=3D 0) > + break; > + } > + > + if (dent) { > + rte_pmu.name =3D strdup(name); > + if (rte_pmu.name =3D=3D NULL) { > + closedir(dirp); > + > + return -ENOMEM; > + } > + } > + > + closedir(dirp); > + > + return rte_pmu.name ? 0 : -ENODEV; > +} > + > +static struct rte_pmu_event * > +new_event(const char *name) > +{ > + struct rte_pmu_event *event; > + > + event =3D calloc(1, sizeof(*event)); > + if (event =3D=3D NULL) > + goto out; > + > + event->name =3D strdup(name); > + if (event->name =3D=3D NULL) { > + free(event); > + event =3D NULL; > + } > + > +out: > + return event; > +} > + > +static void > +free_event(struct rte_pmu_event *event) > +{ > + free(event->name); > + free(event); > +} > + > +int > +rte_pmu_add_event(const char *name) > +{ > + struct rte_pmu_event *event; > + char path[PATH_MAX]; > + > + if (rte_pmu.name =3D=3D NULL) > + return -ENODEV; > + > + if (rte_pmu.num_group_events + 1 >=3D RTE_MAX_NUM_GROUP_EVENTS) > + return -ENOSPC; > + > + snprintf(path, sizeof(path), EVENT_SOURCE_DEVICES_PATH "/%s/events/%s",= rte_pmu.name, name); > + if (access(path, R_OK)) > + return -ENODEV; > + > + TAILQ_FOREACH(event, &rte_pmu.event_list, next) { > + if (!strcmp(event->name, name)) > + return event->index; > + continue; > + } > + > + event =3D new_event(name); > + if (event =3D=3D NULL) > + return -ENOMEM; > + > + event->index =3D rte_pmu.num_group_events++; > + TAILQ_INSERT_TAIL(&rte_pmu.event_list, event, next); > + > + return event->index; > +} > + > +int > +rte_pmu_init(void) > +{ > + int ret; > + > + /* Allow calling init from multiple contexts within a single thread. Th= is simplifies > + * resource management a bit e.g in case fast-path tracepoint has alrea= dy been enabled > + * via command line but application doesn't care enough and performs in= it/fini again. > + */ > + if (rte_atomic_fetch_add_explicit(&rte_pmu.initialized, 1, rte_memory_o= rder_seq_cst) !=3D 0) > + return 0; Not sure I understand the comment above: rte_pmu_init() supposed to be MT s= afe or not? If yes, then that function probably need to wait till initialization finish= es.=20 > + > + ret =3D scan_pmus(); > + if (ret) > + goto out; > + > + ret =3D pmu_arch_init(); > + if (ret) > + goto out; > + > + TAILQ_INIT(&rte_pmu.event_list); > + TAILQ_INIT(&rte_pmu.event_group_list); > + rte_spinlock_init(&rte_pmu.lock); > + > + return 0; > +out: > + free(rte_pmu.name); > + rte_pmu.name =3D NULL; > + > + return ret; > +} > + > +void > +rte_pmu_fini(void) > +{ > + struct rte_pmu_event_group *group, *tmp_group; > + struct rte_pmu_event *event, *tmp_event; > + > + /* cleanup once init count drops to zero */ > + if (rte_atomic_fetch_sub_explicit(&rte_pmu.initialized, 1, > + rte_memory_order_seq_cst) - 1 !=3D 0) > + return; > + > + RTE_TAILQ_FOREACH_SAFE(event, &rte_pmu.event_list, next, tmp_event) { > + TAILQ_REMOVE(&rte_pmu.event_list, event, next); > + free_event(event); > + } > + > + RTE_TAILQ_FOREACH_SAFE(group, &rte_pmu.event_group_list, next, tmp_grou= p) { > + TAILQ_REMOVE(&rte_pmu.event_group_list, group, next); > + cleanup_events(group); > + } > + > + pmu_arch_fini(); > + free(rte_pmu.name); > + rte_pmu.name =3D NULL; > + rte_pmu.num_group_events =3D 0; > +} > diff --git a/lib/pmu/rte_pmu.h b/lib/pmu/rte_pmu.h > new file mode 100644 > index 0000000000..09238ee33d > --- /dev/null > +++ b/lib/pmu/rte_pmu.h > @@ -0,0 +1,227 @@ > +/* SPDX-License-Identifier: BSD-3-Clause > + * Copyright(c) 2024 Marvell > + */ > + > +#ifndef _RTE_PMU_H_ > +#define _RTE_PMU_H_ > + > +/** > + * @file > + * > + * PMU event tracing operations > + * > + * This file defines generic API and types necessary to setup PMU and > + * read selected counters in runtime. > + */ > + > +#ifdef __cplusplus > +extern "C" { > +#endif > + > +#include > + > +#include > +#include > + > +#ifdef RTE_EXEC_ENV_LINUX > + > +#include > + > +#include > +#include > +#include > + > +/** Maximum number of events in a group */ > +#define RTE_MAX_NUM_GROUP_EVENTS 8 > + > +/** > + * A structure describing a group of events. > + */ > +struct __rte_cache_aligned rte_pmu_event_group { > + /** array of user pages */ > + struct perf_event_mmap_page *mmap_pages[RTE_MAX_NUM_GROUP_EVENTS]; > + int fds[RTE_MAX_NUM_GROUP_EVENTS]; /**< array of event descriptors */ > + bool enabled; /**< true if group was enabled on particular lcore */ > + TAILQ_ENTRY(rte_pmu_event_group) next; /**< list entry */ > +}; > + > +/** > + * A PMU state container. > + */ > +struct rte_pmu { > + char *name; /**< name of core PMU listed under /sys/bus/event_source/de= vices */ > + rte_spinlock_t lock; /**< serialize access to event group list */ > + TAILQ_HEAD(, rte_pmu_event_group) event_group_list; /**< list of event = groups */ > + unsigned int num_group_events; /**< number of events in a group */ > + TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of matching events */ > + unsigned int initialized; /**< initialization counter */ > +}; > + > +/** lcore event group */ > +RTE_DECLARE_PER_LCORE(struct rte_pmu_event_group, _event_group); > + > +/** PMU state container */ > +extern struct rte_pmu rte_pmu; > + > +/** Each architecture supporting PMU needs to provide its own version */ > +#ifndef rte_pmu_pmc_read > +#define rte_pmu_pmc_read(index) ({ (void)(index); 0; }) > +#endif > + > +/** > + * @warning > + * @b EXPERIMENTAL: this API may change without prior notice > + * > + * Read PMU counter. > + * > + * @warning This should be not called directly. > + * > + * @param pc > + * Pointer to the mmapped user page. > + * @return > + * Counter value read from hardware. > + */ As I already asked you several times in previous review rounds: https://patchwork.dpdk.org/project/dpdk/patch/20230216175502.3164820-2-tdus= zynski@marvell.com/ please add a detailed comment here: i.e. that function (and rte_pmu_read()) to work properly supposed to be use= d only on a thread that *always* binded to just one cpu core. =20 > +__rte_experimental > +static __rte_always_inline uint64_t > +__rte_pmu_read_userpage(struct perf_event_mmap_page *pc) > +{ > +#define __RTE_PMU_READ_ONCE(x) (*(const volatile typeof(x) *)&(x)) > + uint64_t width, offset; > + uint32_t seq, index; > + int64_t pmc; > + > + for (;;) { > + seq =3D __RTE_PMU_READ_ONCE(pc->lock); > + rte_compiler_barrier(); > + index =3D __RTE_PMU_READ_ONCE(pc->index); > + offset =3D __RTE_PMU_READ_ONCE(pc->offset); > + width =3D __RTE_PMU_READ_ONCE(pc->pmc_width); > + > + /* index set to 0 means that particular counter cannot be used */ > + if (likely(pc->cap_user_rdpmc && index)) { > + pmc =3D rte_pmu_pmc_read(index - 1); > + pmc <<=3D 64 - width; > + pmc >>=3D 64 - width; > + offset +=3D pmc; > + } > + > + rte_compiler_barrier(); > + > + if (likely(__RTE_PMU_READ_ONCE(pc->lock) =3D=3D seq)) > + return offset; > + } > + > + return 0; > +} > + > +/** > + * @warning > + * @b EXPERIMENTAL: this API may change without prior notice > + * > + * Enable group of events on the calling lcore. > + * > + * @warning This should be not called directly. > + * > + * @return > + * 0 in case of success, negative value otherwise. > + */ > +__rte_experimental > +int > +__rte_pmu_enable_group(void); > + > +/** > + * @warning > + * @b EXPERIMENTAL: this API may change without prior notice > + * > + * Initialize PMU library. > + * > + * @warning This should be not called directly. Same questions as before: if it is internal, then why: a) it is located in public API .h file=20 b) why it is called directly by user-code in app/test/test_pmu.c ? > + * > + * @return > + * 0 in case of success, negative value otherwise. > + */ > +__rte_experimental > +int > +rte_pmu_init(void); > + > +/** > + * @warning > + * @b EXPERIMENTAL: this API may change without prior notice > + * > + * Finalize PMU library. This should be called after PMU counters are no= longer being read. > + */ > +__rte_experimental > +void > +rte_pmu_fini(void); Hmm..., why _fini_() is allowed to be called directly while _init_() doesn'= t?=20 > +/** > + * @warning > + * @b EXPERIMENTAL: this API may change without prior notice > + * > + * Add event to the group of enabled events. > + * > + * @param name > + * Name of an event listed under /sys/bus/event_source/devices/pmu/eve= nts. > + * @return > + * Event index in case of success, negative value otherwise. > + */ > +__rte_experimental > +int > +rte_pmu_add_event(const char *name); > + > +/** > + * @warning > + * @b EXPERIMENTAL: this API may change without prior notice > + * > + * Read hardware counter configured to count occurrences of an event. > + * > + * @param index > + * Index of an event to be read. > + * @return > + * Event value read from register. In case of errors or lack of suppor= t > + * 0 is returned. In other words, stream of zeros in a trace file > + * indicates problem with reading particular PMU event register. > + */ > +__rte_experimental > +static __rte_always_inline uint64_t > +rte_pmu_read(unsigned int index) > +{ > + struct rte_pmu_event_group *group =3D &RTE_PER_LCORE(_event_group); > + int ret; > + > + if (unlikely(!rte_pmu.initialized)) > + return 0; > + > + if (unlikely(!group->enabled)) { > + ret =3D __rte_pmu_enable_group(); > + if (ret) > + return 0; > + } > + > + if (unlikely(index >=3D rte_pmu.num_group_events)) > + return 0; > + > + return __rte_pmu_read_userpage(group->mmap_pages[index]); > +} > + > +#else /* !RTE_EXEC_ENV_LINUX */ > + > +__rte_experimental > +static inline int rte_pmu_init(void) { return -ENOTSUP; } > + > +__rte_experimental > +static inline void rte_pmu_fini(void) { } > + > +__rte_experimental > +static inline int rte_pmu_add_event(const char *name __rte_unused) { ret= urn -ENOTSUP; } > + > +__rte_experimental > +static inline uint64_t rte_pmu_read(unsigned int index __rte_unused) { r= eturn UINT64_MAX; } > + > +#endif /* RTE_EXEC_ENV_LINUX */ > + > +#ifdef __cplusplus > +} > +#endif > + > +#endif /* _RTE_PMU_H_ */ > diff --git a/lib/pmu/version.map b/lib/pmu/version.map > new file mode 100644 > index 0000000000..d7c80ce4ce > --- /dev/null > +++ b/lib/pmu/version.map > @@ -0,0 +1,14 @@ > +EXPERIMENTAL { > + global: > + > + # added in 24.11 > + __rte_pmu_enable_group; > + per_lcore__event_group; > + rte_pmu; > + rte_pmu_add_event; > + rte_pmu_fini; > + rte_pmu_init; > + rte_pmu_read; > + > + local: *; > +}; > -- > 2.34.1