From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id D6D11A0543; Wed, 14 Dec 2022 11:41:27 +0100 (CET) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id CA1CE40685; Wed, 14 Dec 2022 11:41:27 +0100 (CET) Received: from smartserver.smartsharesystems.com (smartserver.smartsharesystems.com [77.243.40.215]) by mails.dpdk.org (Postfix) with ESMTP id B9EFD400D6 for ; Wed, 14 Dec 2022 11:41:25 +0100 (CET) Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Subject: RE: [PATCH v4 1/4] eal: add generic support for reading PMU events Content-Transfer-Encoding: quoted-printable Date: Wed, 14 Dec 2022 11:41:24 +0100 X-MimeOLE: Produced By Microsoft Exchange V6.5 Message-ID: <98CBD80474FA8B44BF855DF32C47DC35D8759B@smartserver.smartshare.dk> In-Reply-To: X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: [PATCH v4 1/4] eal: add generic support for reading PMU events Thread-Index: AQHZDt/QEhRtFHSga0yHSOMIy5Zzs65rtOuAgAFVOCCAABlQ4A== References: <20221129092821.1304853-1-tduszynski@marvell.com> <20221213104350.3218167-1-tduszynski@marvell.com> <20221213104350.3218167-2-tduszynski@marvell.com> <98CBD80474FA8B44BF855DF32C47DC35D8758C@smartserver.smartshare.dk> From: =?iso-8859-1?Q?Morten_Br=F8rup?= To: "Tomasz Duszynski" , Cc: , "Jerin Jacob Kollanukkaran" , , X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org +CC: Mattias, see my comment below about per-thread constructor for this > From: Tomasz Duszynski [mailto:tduszynski@marvell.com] > Sent: Wednesday, 14 December 2022 10.39 >=20 > Hello Morten, >=20 > Thanks for review. Answers inline. >=20 > [...] >=20 > > > +/** > > > + * @file > > > + * > > > + * PMU event tracing operations > > > + * > > > + * This file defines generic API and types necessary to setup PMU > and > > > + * read selected counters in runtime. > > > + */ > > > + > > > +/** > > > + * A structure describing a group of events. > > > + */ > > > +struct rte_pmu_event_group { > > > + int *fds; /**< array of event descriptors */ > > > + void **mmap_pages; /**< array of pointers to mmapped > > > perf_event_attr structures */ > > > > There seems to be a lot of indirection involved here. Why are these > arrays not statically sized, > > instead of dynamically allocated? > > >=20 > Different architectures/pmus impose limits on number of simultaneously > enabled counters. So in order > relief the pain of thinking about it and adding macros for each and > every arch I decided to allocate > the number user wants dynamically. Also assumption holds that user > knows about tradeoffs of using > too many counters hence will not enable too many events at once. The DPDK convention is to use fixed size arrays (with a maximum size, = e.g. RTE_MAX_ETHPORTS) in the fast path, for performance reasons. Please use fixed size arrays instead of dynamically allocated arrays. >=20 > > Also, what is the reason for hiding the type struct > perf_event_mmap_page **mmap_pages opaque by > > using void **mmap_pages instead? >=20 > I think, that part doing mmap/munmap was written first hence void ** > was chosen in the first place. Please update it, so the actual type is reflected here. >=20 > > > > > + bool enabled; /**< true if group was enabled on particular lcore > > > */ > > > +}; > > > + > > > +/** > > > + * A structure describing an event. > > > + */ > > > +struct rte_pmu_event { > > > + char *name; /** name of an event */ > > > + int index; /** event index into fds/mmap_pages */ > > > + TAILQ_ENTRY(rte_pmu_event) next; /** list entry */ }; > > > + > > > +/** > > > + * A PMU state container. > > > + */ > > > +struct rte_pmu { > > > + char *name; /** name of core PMU listed under > > > /sys/bus/event_source/devices */ > > > + struct rte_pmu_event_group group[RTE_MAX_LCORE]; /**< per lcore > > > event group data */ > > > + int num_group_events; /**< number of events in a group */ > > > + TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of matching > > > events */ The event_list is used in slow path only, so it can remain a list - i.e. = no change requested here. :-) > > > +}; > > > + > > > +/** Pointer to the PMU state container */ extern struct rte_pmu > > > +*rte_pmu; > > > > Again, why not just extern struct rte_pmu, instead of dynamic > allocation? > > >=20 > No strong opinions here since this is a matter of personal preference. > Can be removed > in the next version. Yes, please. >=20 > > > + > > > +/** Each architecture supporting PMU needs to provide its own > version > > > */ > > > +#ifndef rte_pmu_pmc_read > > > +#define rte_pmu_pmc_read(index) ({ 0; }) #endif > > > + > > > +/** > > > + * @internal > > > + * > > > + * Read PMU counter. > > > + * > > > + * @param pc > > > + * Pointer to the mmapped user page. > > > + * @return > > > + * Counter value read from hardware. > > > + */ > > > +__rte_internal > > > +static __rte_always_inline uint64_t > > > +rte_pmu_read_userpage(struct perf_event_mmap_page *pc) { > > > + uint64_t offset, width, pmc =3D 0; > > > + uint32_t seq, index; > > > + int tries =3D 100; > > > + > > > + for (;;) { As a matter of personal preference, I would write this loop differently: + for (tries =3D 100; tries !=3D 0; tries--) { > > > + seq =3D pc->lock; > > > + rte_compiler_barrier(); > > > + index =3D pc->index; > > > + offset =3D pc->offset; > > > + width =3D pc->pmc_width; > > > + > > > + if (likely(pc->cap_user_rdpmc && index)) { Why "&& index"? The way I read [man perf_event_open], index 0 is = perfectly valid. [man perf_event_open]: = https://man7.org/linux/man-pages/man2/perf_event_open.2.html > > > + pmc =3D rte_pmu_pmc_read(index - 1); > > > + pmc <<=3D 64 - width; > > > + pmc >>=3D 64 - width; > > > + } > > > + > > > + rte_compiler_barrier(); > > > + > > > + if (likely(pc->lock =3D=3D seq)) > > > + return pmc + offset; > > > + > > > + if (--tries =3D=3D 0) { > > > + RTE_LOG(DEBUG, EAL, "failed to get > > > perf_event_mmap_page lock\n"); > > > + break; > > > + } - Remove the 4 above lines of code, and move the debug log message to = the end of the function instead. > > > + } + RTE_LOG(DEBUG, EAL, "failed to get perf_event_mmap_page lock\n"); > > > + > > > + return 0; > > > +} > > > + > > > +/** > > > + * @internal > > > + * > > > + * Enable group of events for a given lcore. > > > + * > > > + * @param lcore_id > > > + * The identifier of the lcore. > > > + * @return > > > + * 0 in case of success, negative value otherwise. > > > + */ > > > +__rte_internal > > > +int > > > +rte_pmu_enable_group(int lcore_id); > > > + > > > +/** > > > + * @warning > > > + * @b EXPERIMENTAL: this API may change without prior notice > > > + * > > > + * Add event to the group of enabled events. > > > + * > > > + * @param name > > > + * Name of an event listed under > > > /sys/bus/event_source/devices/pmu/events. > > > + * @return > > > + * Event index in case of success, negative value otherwise. > > > + */ > > > +__rte_experimental > > > +int > > > +rte_pmu_add_event(const char *name); > > > + > > > +/** > > > + * @warning > > > + * @b EXPERIMENTAL: this API may change without prior notice > > > + * > > > + * Read hardware counter configured to count occurrences of an > event. > > > + * > > > + * @param index > > > + * Index of an event to be read. > > > + * @return > > > + * Event value read from register. In case of errors or lack of > > > support > > > + * 0 is returned. In other words, stream of zeros in a trace > file > > > + * indicates problem with reading particular PMU event = register. > > > + */ > > > +__rte_experimental > > > +static __rte_always_inline uint64_t > > > +rte_pmu_read(int index) The index type can be changed from int to uint32_t. This also eliminates = the "(index < 0" part of the comparison further below in this function. > > > +{ > > > + int lcore_id =3D rte_lcore_id(); > > > + struct rte_pmu_event_group *group; > > > + int ret; > > > + > > > + if (!rte_pmu) > > > + return 0; > > > + > > > + group =3D &rte_pmu->group[lcore_id]; > > > + if (!group->enabled) { Optimized: if (unlikely(!group->enabled)) { > > > + ret =3D rte_pmu_enable_group(lcore_id); > > > + if (ret) > > > + return 0; > > > + > > > + group->enabled =3D true; > > > + } > > > > Why is the group not enabled in the setup function, > rte_pmu_add_event(), instead of here, in the > > hot path? > > >=20 > When this is executed for the very first time then cpu will have > obviously more work to do > but afterwards setup path is not taken hence much less cpu cycles are > required. >=20 > Setup is executed by main lcore solely, before lcores are executed > hence some info passed to > SYS_perf_event_open ioctl() is missing, pid (via rte_gettid()) being = an > example here. OK. Thank you for the explanation. Since impossible at setup, it has to = be done at runtime. @Mattias: Another good example of something that would belong in = per-thread constructors, as my suggested feature creep in [1]. [1]: = http://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35D87553@smartser= ver.smartshare.dk/ >=20 > > > + > > > + if (index < 0 || index >=3D rte_pmu->num_group_events) Optimized: if (unlikely(index >=3D rte_pmu.num_group_events)) > > > + return 0; > > > + > > > + return rte_pmu_read_userpage((struct perf_event_mmap_page > > > *)group->mmap_pages[index]); > > > > Using fixed size arrays instead of multiple indirections via = pointers > is faster. It could be: > > > > return rte_pmu_read_userpage((struct perf_event_mmap_page > > *)rte_pmu.group[lcore_id].mmap_pages[index]); > > > > With our without suggested performance improvements... > > > > Series-acked-by: Morten Br=F8rup >=20