From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id DBE38A00C2; Thu, 5 Jan 2023 23:07:46 +0100 (CET) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id CC74540697; Thu, 5 Jan 2023 23:07:46 +0100 (CET) Received: from smartserver.smartsharesystems.com (smartserver.smartsharesystems.com [77.243.40.215]) by mails.dpdk.org (Postfix) with ESMTP id E737E4067C for ; Thu, 5 Jan 2023 23:07:44 +0100 (CET) Content-class: urn:content-classes:message Subject: RE: [PATCH v4 1/4] eal: add generic support for reading PMU events MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Date: Thu, 5 Jan 2023 23:07:41 +0100 X-MimeOLE: Produced By Microsoft Exchange V6.5 Message-ID: <98CBD80474FA8B44BF855DF32C47DC35D8762C@smartserver.smartshare.dk> In-Reply-To: A X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: [PATCH v4 1/4] eal: add generic support for reading PMU events Thread-Index: AQHZDt/QEhRtFHSga0yHSOMIy5Zzs65rtOuAgAFVOCCAABlQ4IAjRt0AgAAVgdA= References: <20221129092821.1304853-1-tduszynski@marvell.com> <20221213104350.3218167-1-tduszynski@marvell.com> <20221213104350.3218167-2-tduszynski@marvell.com> <98CBD80474FA8B44BF855DF32C47DC35D8758C@smartserver.smartshare.dk> <98CBD80474FA8B44BF855DF32C47DC35D8759B@smartserver.smartshare.dk> A From: =?iso-8859-1?Q?Morten_Br=F8rup?= To: "Tomasz Duszynski" , Cc: , "Jerin Jacob Kollanukkaran" , , X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org > From: Tomasz Duszynski [mailto:tduszynski@marvell.com] > Sent: Thursday, 5 January 2023 22.14 >=20 > Hi Morten, >=20 > A few comments inline. >=20 > >From: Morten Br=F8rup > >Sent: Wednesday, December 14, 2022 11:41 AM > > > >External Email > > > = >---------------------------------------------------------------------- > >+CC: Mattias, see my comment below about per-thread constructor for > this > > > >> From: Tomasz Duszynski [mailto:tduszynski@marvell.com] > >> Sent: Wednesday, 14 December 2022 10.39 > >> > >> Hello Morten, > >> > >> Thanks for review. Answers inline. > >> > >> [...] > >> > >> > > +/** > >> > > + * @file > >> > > + * > >> > > + * PMU event tracing operations > >> > > + * > >> > > + * This file defines generic API and types necessary to setup > PMU > >> and > >> > > + * read selected counters in runtime. > >> > > + */ > >> > > + > >> > > +/** > >> > > + * A structure describing a group of events. > >> > > + */ > >> > > +struct rte_pmu_event_group { > >> > > + int *fds; /**< array of event descriptors */ > >> > > + void **mmap_pages; /**< array of pointers to mmapped > >> > > perf_event_attr structures */ > >> > > >> > There seems to be a lot of indirection involved here. Why are > these > >> arrays not statically sized, > >> > instead of dynamically allocated? > >> > > >> > >> Different architectures/pmus impose limits on number of > simultaneously > >> enabled counters. So in order relief the pain of thinking about it > and > >> adding macros for each and every arch I decided to allocate the > number > >> user wants dynamically. Also assumption holds that user knows about > >> tradeoffs of using too many counters hence will not enable too many > >> events at once. > > > >The DPDK convention is to use fixed size arrays (with a maximum size, > e.g. RTE_MAX_ETHPORTS) in the > >fast path, for performance reasons. > > > >Please use fixed size arrays instead of dynamically allocated arrays. > > >=20 > I do agree that from maintenance angle fixed arrays are much more > convenient > but when optimization kicks in then that statement does not = necessarily > hold true anymore. >=20 > For example, in this case performance dropped by ~0.3% which is > insignificant imo. So > given simpler code, next patchset will use fixed arrays. I fail to understand how pointer chasing can perform better than = obtaining an address by multiplying by a constant. Modern CPUs work in = mysterious ways, and you obviously tested this, so I believe your test = results. But in theory, pointer chasing touches more cache lines, and = should perform worse in a loaded system where pointers in the chain have = been evicted from the cache. Anyway, you agreed to use fixed arrays, so I am happy. :-) >=20 > >> > >> > Also, what is the reason for hiding the type struct > >> perf_event_mmap_page **mmap_pages opaque by > >> > using void **mmap_pages instead? > >> > >> I think, that part doing mmap/munmap was written first hence void = ** > >> was chosen in the first place. > > > >Please update it, so the actual type is reflected here. > > > >> > >> > > >> > > + bool enabled; /**< true if group was enabled on particular > lcore > >> > > */ > >> > > +}; > >> > > + > >> > > +/** > >> > > + * A structure describing an event. > >> > > + */ > >> > > +struct rte_pmu_event { > >> > > + char *name; /** name of an event */ > >> > > + int index; /** event index into fds/mmap_pages */ > >> > > + TAILQ_ENTRY(rte_pmu_event) next; /** list entry */ }; > >> > > + > >> > > +/** > >> > > + * A PMU state container. > >> > > + */ > >> > > +struct rte_pmu { > >> > > + char *name; /** name of core PMU listed under > >> > > /sys/bus/event_source/devices */ > >> > > + struct rte_pmu_event_group group[RTE_MAX_LCORE]; /**< per > lcore > >> > > event group data */ > >> > > + int num_group_events; /**< number of events in a group */ > >> > > + TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of > matching > >> > > events */ > > > >The event_list is used in slow path only, so it can remain a list - > i.e. no change requested here. > >:-) > > > >> > > +}; > >> > > + > >> > > +/** Pointer to the PMU state container */ extern struct = rte_pmu > >> > > +*rte_pmu; > >> > > >> > Again, why not just extern struct rte_pmu, instead of dynamic > >> allocation? > >> > > >> > >> No strong opinions here since this is a matter of personal > preference. > >> Can be removed > >> in the next version. > > > >Yes, please. > > > >> > >> > > + > >> > > +/** Each architecture supporting PMU needs to provide its own > >> version > >> > > */ > >> > > +#ifndef rte_pmu_pmc_read > >> > > +#define rte_pmu_pmc_read(index) ({ 0; }) #endif > >> > > + > >> > > +/** > >> > > + * @internal > >> > > + * > >> > > + * Read PMU counter. > >> > > + * > >> > > + * @param pc > >> > > + * Pointer to the mmapped user page. > >> > > + * @return > >> > > + * Counter value read from hardware. > >> > > + */ > >> > > +__rte_internal > >> > > +static __rte_always_inline uint64_t > rte_pmu_read_userpage(struct > >> > > +perf_event_mmap_page *pc) { > >> > > + uint64_t offset, width, pmc =3D 0; > >> > > + uint32_t seq, index; > >> > > + int tries =3D 100; > >> > > + > >> > > + for (;;) { > > > >As a matter of personal preference, I would write this loop > differently: > > > >+ for (tries =3D 100; tries !=3D 0; tries--) { > > > >> > > + seq =3D pc->lock; > >> > > + rte_compiler_barrier(); > >> > > + index =3D pc->index; > >> > > + offset =3D pc->offset; > >> > > + width =3D pc->pmc_width; > >> > > + > >> > > + if (likely(pc->cap_user_rdpmc && index)) { > > > >Why "&& index"? The way I read [man perf_event_open], index 0 is > perfectly valid. > > >=20 > Valid index starts at 1. 0 means that either hw counter is stopped or > isn't active. Maybe this is not > initially clear from man but there's example later on how to get = actual > number. OK. Thanks for the reference. Please add a comment about the special meaning of index 0 in the code. >=20 > >[man perf_event_open]: > https://urldefense.proofpoint.com/v2/url?u=3Dhttps- > 3A__man7.org_linux_man- > >2Dpages_man2_perf-5Fevent- > = >5Fopen.2.html&d=3DDwIFAw&c=3DnKjWec2b6R0mOyPaz7xtfQ&r=3DPZNXgrbjdlXxVEEG= YkxI > xRndyEUwWU_ad5ce22YI6Is&m=3Dtny > = >gBVwOnoZDV7hItku1HtmsI8R3F6vPJdr7ON3hE5iAds96T2C9JTNcnt6ptN4Q&s=3Ds10yJo= > gwRRXHqAuIay47H- > >aWl9SL5wpQ9tCjfiQUgrY&e=3D > > > >> > > + pmc =3D rte_pmu_pmc_read(index - 1); > >> > > + pmc <<=3D 64 - width; > >> > > + pmc >>=3D 64 - width; > >> > > + } > >> > > + > >> > > + rte_compiler_barrier(); > >> > > + > >> > > + if (likely(pc->lock =3D=3D seq)) > >> > > + return pmc + offset; > >> > > + > >> > > + if (--tries =3D=3D 0) { > >> > > + RTE_LOG(DEBUG, EAL, "failed to get > >> > > perf_event_mmap_page lock\n"); > >> > > + break; > >> > > + } > > > >- Remove the 4 above lines of code, and move the debug log message to > the end of the function > >instead. > > > >> > > + } > > > >+ RTE_LOG(DEBUG, EAL, "failed to get perf_event_mmap_page lock\n"); > > > >> > > + > >> > > + return 0; > >> > > +} > >> > > + > >> > > +/** > >> > > + * @internal > >> > > + * > >> > > + * Enable group of events for a given lcore. > >> > > + * > >> > > + * @param lcore_id > >> > > + * The identifier of the lcore. > >> > > + * @return > >> > > + * 0 in case of success, negative value otherwise. > >> > > + */ > >> > > +__rte_internal > >> > > +int > >> > > +rte_pmu_enable_group(int lcore_id); > >> > > + > >> > > +/** > >> > > + * @warning > >> > > + * @b EXPERIMENTAL: this API may change without prior notice > >> > > + * > >> > > + * Add event to the group of enabled events. > >> > > + * > >> > > + * @param name > >> > > + * Name of an event listed under > >> > > /sys/bus/event_source/devices/pmu/events. > >> > > + * @return > >> > > + * Event index in case of success, negative value otherwise. > >> > > + */ > >> > > +__rte_experimental > >> > > +int > >> > > +rte_pmu_add_event(const char *name); > >> > > + > >> > > +/** > >> > > + * @warning > >> > > + * @b EXPERIMENTAL: this API may change without prior notice > >> > > + * > >> > > + * Read hardware counter configured to count occurrences of an > >> event. > >> > > + * > >> > > + * @param index > >> > > + * Index of an event to be read. > >> > > + * @return > >> > > + * Event value read from register. In case of errors or lack > of > >> > > support > >> > > + * 0 is returned. In other words, stream of zeros in a trace > >> file > >> > > + * indicates problem with reading particular PMU event > register. > >> > > + */ > >> > > +__rte_experimental > >> > > +static __rte_always_inline uint64_t rte_pmu_read(int index) > > > >The index type can be changed from int to uint32_t. This also > eliminates the "(index < 0" part of > >the comparison further below in this function. > > >=20 > That's true. >=20 > >> > > +{ > >> > > + int lcore_id =3D rte_lcore_id(); > >> > > + struct rte_pmu_event_group *group; > >> > > + int ret; > >> > > + > >> > > + if (!rte_pmu) > >> > > + return 0; > >> > > + > >> > > + group =3D &rte_pmu->group[lcore_id]; > >> > > + if (!group->enabled) { > > > >Optimized: if (unlikely(!group->enabled)) { > > >=20 > Compiler will optimize the branch itself correctly. Extra hint is not > required. I haven't reviewed the output from this, so I'll take your word for it. = I suggested the unlikely() because I previously tested some very simple = code, and it optimized for taking the "if": void testb(bool b) { if (!b) exit(1); =20 exit(99); } I guess I should experiment with more realistic code, and update my = optimization notes! You could add the unlikely() for readability purposes. ;-) >=20 > >> > > + ret =3D rte_pmu_enable_group(lcore_id); > >> > > + if (ret) > >> > > + return 0; > >> > > + > >> > > + group->enabled =3D true; > >> > > + } > >> > > >> > Why is the group not enabled in the setup function, > >> rte_pmu_add_event(), instead of here, in the > >> > hot path? > >> > > >> > >> When this is executed for the very first time then cpu will have > >> obviously more work to do but afterwards setup path is not taken > hence > >> much less cpu cycles are required. > >> > >> Setup is executed by main lcore solely, before lcores are executed > >> hence some info passed to SYS_perf_event_open ioctl() is missing, > pid > >> (via rte_gettid()) being an example here. > > > >OK. Thank you for the explanation. Since impossible at setup, it has > to be done at runtime. > > > >@Mattias: Another good example of something that would belong in per- > thread constructors, as my > >suggested feature creep in [1]. > > > >[1]: https://urldefense.proofpoint.com/v2/url?u=3Dhttp- > >3A__inbox.dpdk.org_dev_98CBD80474FA8B44BF855DF32C47DC35D87553- > = >40smartserver.smartshare.dk_&d=3DDwIFAw&c=3DnKjWec2b6R0mOyPaz7xtfQ&r=3DP= ZNXg > rbjdlXxVEEGYkxIxRndyEUwWU_ad5 > = >ce22YI6Is&m=3DtnygBVwOnoZDV7hItku1HtmsI8R3F6vPJdr7ON3hE5iAds96T2C9JTNcnt= > 6ptN4Q&s=3DaSAnYqgVnrgDp6yyMtGC > >uWgJjDlgqj1wHf1nGWyHCNo&e=3D > > > >> > >> > > + > >> > > + if (index < 0 || index >=3D rte_pmu->num_group_events) > > > >Optimized: if (unlikely(index >=3D rte_pmu.num_group_events)) > > > >> > > + return 0; > >> > > + > >> > > + return rte_pmu_read_userpage((struct perf_event_mmap_page > >> > > *)group->mmap_pages[index]); > >> > > >> > Using fixed size arrays instead of multiple indirections via > >> > pointers > >> is faster. It could be: > >> > > >> > return rte_pmu_read_userpage((struct perf_event_mmap_page > >> > *)rte_pmu.group[lcore_id].mmap_pages[index]); > >> > > >> > With our without suggested performance improvements... > >> > > >> > Series-acked-by: Morten Br=F8rup > >> >=20