From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id DD89C42383; Wed, 11 Jan 2023 17:54:17 +0100 (CET) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 77DB440A7D; Wed, 11 Jan 2023 17:54:17 +0100 (CET) Received: from smartserver.smartsharesystems.com (smartserver.smartsharesystems.com [77.243.40.215]) by mails.dpdk.org (Postfix) with ESMTP id AD0B9400D4 for ; Wed, 11 Jan 2023 17:54:15 +0100 (CET) Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Subject: RE: [PATCH v5 1/4] eal: add generic support for reading PMU events Date: Wed, 11 Jan 2023 17:54:14 +0100 X-MimeOLE: Produced By Microsoft Exchange V6.5 Message-ID: <98CBD80474FA8B44BF855DF32C47DC35D8765C@smartserver.smartshare.dk> In-Reply-To: X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: [PATCH v5 1/4] eal: add generic support for reading PMU events Thread-Index: AQHZJU3R+bXQYI510E6hymM30KPKzq6Y7S4AgABtFRCAAA+V0A== References: <20221213104350.3218167-1-tduszynski@marvell.com> <20230110234642.1188550-1-tduszynski@marvell.com> <20230110234642.1188550-2-tduszynski@marvell.com> <98CBD80474FA8B44BF855DF32C47DC35D8764D@smartserver.smartshare.dk> From: =?iso-8859-1?Q?Morten_Br=F8rup?= To: "Tomasz Duszynski" , Cc: , "Jerin Jacob Kollanukkaran" , , , X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org > From: Tomasz Duszynski [mailto:tduszynski@marvell.com] > Sent: Wednesday, 11 January 2023 17.21 >=20 > >From: Morten Br=F8rup > >Sent: Wednesday, January 11, 2023 10:06 AM > > > >> From: Tomasz Duszynski [mailto:tduszynski@marvell.com] > >> Sent: Wednesday, 11 January 2023 00.47 > >> > >> Add support for programming PMU counters and reading their values = in > >> runtime bypassing kernel completely. > >> > >> This is especially useful in cases where CPU cores are isolated > >> (nohz_full) i.e run dedicated tasks. In such cases one cannot use > >> standard perf utility without sacrificing latency and performance. > >> > >> Signed-off-by: Tomasz Duszynski > >> --- > > > >[...] > > > >> +static int > >> +do_perf_event_open(uint64_t config[3], unsigned int lcore_id, int > >> group_fd) > >> +{ > >> + struct perf_event_attr attr =3D { > >> + .size =3D sizeof(struct perf_event_attr), > >> + .type =3D PERF_TYPE_RAW, > >> + .exclude_kernel =3D 1, > >> + .exclude_hv =3D 1, > >> + .disabled =3D 1, > >> + }; > >> + > >> + pmu_arch_fixup_config(config); > >> + > >> + attr.config =3D config[0]; > >> + attr.config1 =3D config[1]; > >> + attr.config2 =3D config[2]; > >> + > >> + return syscall(SYS_perf_event_open, &attr, 0, > >> rte_lcore_to_cpu_id(lcore_id), group_fd, 0); > >> +} > > > >If SYS_perf_event_open() must be called from the worker thread = itself, > then lcore_id must not be > >passed as a parameter to do_perf_event_open(). Otherwise, I would > expect to be able to call > >do_perf_event_open() from the main thread and pass any lcore_id of a > worker thread. > >This comment applies to all functions that must be called from the > worker thread itself. It also > >applies to the functions that call such functions. > > >=20 > Lcore_id is being passed around so that we don't need to call > rte_lcore_id() each and every time. Please take a look at the rte_lcore_id() implementation. :-) Regardless, my argument still stands: If a function cannot be called = with the lcore_id parameter set to any valid lcore id, it should not be = a parameter to the function. >=20 > >[...] > > > >> +/** > >> + * A structure describing a group of events. > >> + */ > >> +struct rte_pmu_event_group { > >> + int fds[MAX_NUM_GROUP_EVENTS]; /**< array of event descriptors */ > >> + struct perf_event_mmap_page *mmap_pages[MAX_NUM_GROUP_EVENTS]; > >> /**< array of user pages */ > >> + bool enabled; /**< true if group was enabled on particular lcore > >> */ > >> +}; > >> + > >> +/** > >> + * A structure describing an event. > >> + */ > >> +struct rte_pmu_event { > >> + char *name; /** name of an event */ > >> + unsigned int index; /** event index into fds/mmap_pages */ > >> + TAILQ_ENTRY(rte_pmu_event) next; /** list entry */ }; > > > >Move the "enabled" field up, making it the first field in this > structure. This might reduce the > >number of instructions required to check (!group->enabled) in > rte_pmu_read(). > > >=20 > This will be called once and no this will not produce more > instructions. Why should it? It seems I was not clearly describing my intention here here. = rte_pmu_read() a hot function, where the comparison "if = (!group->enabled)" itself will be executed many times. > In both cases compiler will need to load data at some offset and archs > do have instructions for that. Yes, the instructions are: address =3D BASE + sizeof(struct = rte_pmu_event_group) * lcore_id + offsetof(struct rte_pmu_event, = enabled). I meant you could avoid the extra instructions stemming from the = addition: "+ offsetof()". But you are right... Both BASE and = offsetof(struct rte_pmu_event, enabled) are known in advance, and can be = merged at compile time to avoid the addition. >=20 > >Also, each instance of the structure is used individually per lcore, > so the structure should be > >cache line aligned to avoid unnecessarily crossing cache lines. > > > >I.e.: > > > >struct rte_pmu_event_group { > > bool enabled; /**< true if group was enabled on particular lcore > */ > > int fds[MAX_NUM_GROUP_EVENTS]; /**< array of event descriptors */ > > struct perf_event_mmap_page *mmap_pages[MAX_NUM_GROUP_EVENTS]; > /**< array of user pages */ } > >__rte_cache_aligned; >=20 > Yes, this can be aligned. While at it, I'd be more inclined to move > mmap_pages up instead of enable. Yes, moving up mmap_pages is better. >=20 > > > >> + > >> +/** > >> + * A PMU state container. > >> + */ > >> +struct rte_pmu { > >> + char *name; /** name of core PMU listed under > >> /sys/bus/event_source/devices */ > >> + struct rte_pmu_event_group group[RTE_MAX_LCORE]; /**< per lcore > >> event group data */ > >> + unsigned int num_group_events; /**< number of events in a group > >> */ > >> + TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of matching > >> events */ > >> +}; > >> + > >> +/** Pointer to the PMU state container */ extern struct rte_pmu > >> +rte_pmu; > > > >Just "The PMU state container". It is not a pointer anymore. :-) > > >=20 > Good catch. >=20 > >[...] > > > >> +/** > >> + * @internal > >> + * > >> + * Read PMU counter. > >> + * > >> + * @param pc > >> + * Pointer to the mmapped user page. > >> + * @return > >> + * Counter value read from hardware. > >> + */ > >> +__rte_internal > >> +static __rte_always_inline uint64_t > >> +rte_pmu_read_userpage(struct perf_event_mmap_page *pc) { > >> + uint64_t width, offset; > >> + uint32_t seq, index; > >> + int64_t pmc; > >> + > >> + for (;;) { > >> + seq =3D pc->lock; > >> + rte_compiler_barrier(); > >> + index =3D pc->index; > >> + offset =3D pc->offset; > >> + width =3D pc->pmc_width; > >> + > > > >Please add a comment here about the special meaning of index =3D=3D = 0. >=20 > Okay. >=20 > > > >> + if (likely(pc->cap_user_rdpmc && index)) { > >> + pmc =3D rte_pmu_pmc_read(index - 1); > >> + pmc <<=3D 64 - width; > >> + pmc >>=3D 64 - width; > >> + offset +=3D pmc; > >> + } > >> + > >> + rte_compiler_barrier(); > >> + > >> + if (likely(pc->lock =3D=3D seq)) > >> + return offset; > >> + } > >> + > >> + return 0; > >> +} > > > >[...] > > > >> +/** > >> + * @warning > >> + * @b EXPERIMENTAL: this API may change without prior notice > >> + * > >> + * Read hardware counter configured to count occurrences of an > event. > >> + * > >> + * @param index > >> + * Index of an event to be read. > >> + * @return > >> + * Event value read from register. In case of errors or lack of > >> support > >> + * 0 is returned. In other words, stream of zeros in a trace = file > >> + * indicates problem with reading particular PMU event register. > >> + */ > >> +__rte_experimental > >> +static __rte_always_inline uint64_t > >> +rte_pmu_read(unsigned int index) > >> +{ > >> + struct rte_pmu_event_group *group; > >> + int ret, lcore_id =3D rte_lcore_id(); > >> + > >> + group =3D &rte_pmu.group[lcore_id]; > >> + if (unlikely(!group->enabled)) { > >> + ret =3D rte_pmu_enable_group(lcore_id); > >> + if (ret) > >> + return 0; > >> + > >> + group->enabled =3D true; > > > >Group->enabled should be set inside rte_pmu_enable_group(), not here. > > >=20 > This is easier to follow imo and not against coding guidelines so I > prefer to leave it as is. OK. It makes the rte_pmu_read() source code slightly shorter, but = probably has zero effect on the generated code. No strong preference - = feel free to follow your personal preference on this. >=20 > >> + } > >> + > >> + if (unlikely(index >=3D rte_pmu.num_group_events)) > >> + return 0; > >> + > >> + return rte_pmu_read_userpage(group->mmap_pages[index]); > >> +} > > >=20