From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id DD89C42383;
	Wed, 11 Jan 2023 17:54:17 +0100 (CET)
Received: from mails.dpdk.org (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id 77DB440A7D;
	Wed, 11 Jan 2023 17:54:17 +0100 (CET)
Received: from smartserver.smartsharesystems.com
 (smartserver.smartsharesystems.com [77.243.40.215])
 by mails.dpdk.org (Postfix) with ESMTP id AD0B9400D4
 for <dev@dpdk.org>; Wed, 11 Jan 2023 17:54:15 +0100 (CET)
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Subject: RE: [PATCH v5 1/4] eal: add generic support for reading PMU events
Date: Wed, 11 Jan 2023 17:54:14 +0100
X-MimeOLE: Produced By Microsoft Exchange V6.5
Message-ID: <98CBD80474FA8B44BF855DF32C47DC35D8765C@smartserver.smartshare.dk>
In-Reply-To: <DM4PR18MB436886CBAE2816AF6943AE48D2FC9@DM4PR18MB4368.namprd18.prod.outlook.com>
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Thread-Topic: [PATCH v5 1/4] eal: add generic support for reading PMU events
Thread-Index: AQHZJU3R+bXQYI510E6hymM30KPKzq6Y7S4AgABtFRCAAA+V0A==
References: <20221213104350.3218167-1-tduszynski@marvell.com>
 <20230110234642.1188550-1-tduszynski@marvell.com>
 <20230110234642.1188550-2-tduszynski@marvell.com>
 <98CBD80474FA8B44BF855DF32C47DC35D8764D@smartserver.smartshare.dk>
 <DM4PR18MB436886CBAE2816AF6943AE48D2FC9@DM4PR18MB4368.namprd18.prod.outlook.com>
From: =?iso-8859-1?Q?Morten_Br=F8rup?= <mb@smartsharesystems.com>
To: "Tomasz Duszynski" <tduszynski@marvell.com>,
	<dev@dpdk.org>
Cc: <thomas@monjalon.net>, "Jerin Jacob Kollanukkaran" <jerinj@marvell.com>,
 <Ruifeng.Wang@arm.com>, <mattias.ronnblom@ericsson.com>,
 <zhoumin@loongson.cn>
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org

> From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
> Sent: Wednesday, 11 January 2023 17.21
>=20
> >From: Morten Br=F8rup <mb@smartsharesystems.com>
> >Sent: Wednesday, January 11, 2023 10:06 AM
> >
> >> From: Tomasz Duszynski [mailto:tduszynski@marvell.com]
> >> Sent: Wednesday, 11 January 2023 00.47
> >>
> >> Add support for programming PMU counters and reading their values =
in
> >> runtime bypassing kernel completely.
> >>
> >> This is especially useful in cases where CPU cores are isolated
> >> (nohz_full) i.e run dedicated tasks. In such cases one cannot use
> >> standard perf utility without sacrificing latency and performance.
> >>
> >> Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
> >> ---
> >
> >[...]
> >
> >> +static int
> >> +do_perf_event_open(uint64_t config[3], unsigned int lcore_id, int
> >> group_fd)
> >> +{
> >> +	struct perf_event_attr attr =3D {
> >> +		.size =3D sizeof(struct perf_event_attr),
> >> +		.type =3D PERF_TYPE_RAW,
> >> +		.exclude_kernel =3D 1,
> >> +		.exclude_hv =3D 1,
> >> +		.disabled =3D 1,
> >> +	};
> >> +
> >> +	pmu_arch_fixup_config(config);
> >> +
> >> +	attr.config =3D config[0];
> >> +	attr.config1 =3D config[1];
> >> +	attr.config2 =3D config[2];
> >> +
> >> +	return syscall(SYS_perf_event_open, &attr, 0,
> >> rte_lcore_to_cpu_id(lcore_id), group_fd, 0);
> >> +}
> >
> >If SYS_perf_event_open() must be called from the worker thread =
itself,
> then lcore_id must not be
> >passed as a parameter to do_perf_event_open(). Otherwise, I would
> expect to be able to call
> >do_perf_event_open() from the main thread and pass any lcore_id of a
> worker thread.
> >This comment applies to all functions that must be called from the
> worker thread itself. It also
> >applies to the functions that call such functions.
> >
>=20
> Lcore_id is being passed around so that we don't need to call
> rte_lcore_id() each and every time.

Please take a look at the rte_lcore_id() implementation. :-)

Regardless, my argument still stands: If a function cannot be called =
with the lcore_id parameter set to any valid lcore id, it should not be =
a parameter to the function.

>=20
> >[...]
> >
> >> +/**
> >> + * A structure describing a group of events.
> >> + */
> >> +struct rte_pmu_event_group {
> >> +	int fds[MAX_NUM_GROUP_EVENTS]; /**< array of event descriptors */
> >> +	struct perf_event_mmap_page *mmap_pages[MAX_NUM_GROUP_EVENTS];
> >> /**< array of user pages */
> >> +	bool enabled; /**< true if group was enabled on particular lcore
> >> */
> >> +};
> >> +
> >> +/**
> >> + * A structure describing an event.
> >> + */
> >> +struct rte_pmu_event {
> >> +	char *name; /** name of an event */
> >> +	unsigned int index; /** event index into fds/mmap_pages */
> >> +	TAILQ_ENTRY(rte_pmu_event) next; /** list entry */ };
> >
> >Move the "enabled" field up, making it the first field in this
> structure. This might reduce the
> >number of instructions required to check (!group->enabled) in
> rte_pmu_read().
> >
>=20
> This will be called once and no this will not produce more
> instructions. Why should it?

It seems I was not clearly describing my intention here here. =
rte_pmu_read() a hot function, where the comparison "if =
(!group->enabled)" itself will be executed many times.

> In both cases compiler will need to load data at some offset and archs
> do have instructions for that.

Yes, the instructions are: address =3D BASE + sizeof(struct =
rte_pmu_event_group) * lcore_id + offsetof(struct rte_pmu_event, =
enabled).

I meant you could avoid the extra instructions stemming from the =
addition: "+ offsetof()". But you are right... Both BASE and =
offsetof(struct rte_pmu_event, enabled) are known in advance, and can be =
merged at compile time to avoid the addition.

>=20
> >Also, each instance of the structure is used individually per lcore,
> so the structure should be
> >cache line aligned to avoid unnecessarily crossing cache lines.
> >
> >I.e.:
> >
> >struct rte_pmu_event_group {
> >	bool enabled; /**< true if group was enabled on particular lcore
> */
> >	int fds[MAX_NUM_GROUP_EVENTS]; /**< array of event descriptors */
> >	struct perf_event_mmap_page *mmap_pages[MAX_NUM_GROUP_EVENTS];
> /**< array of user pages */ }
> >__rte_cache_aligned;
>=20
> Yes, this can be aligned. While at it, I'd be more inclined to move
> mmap_pages up instead of enable.

Yes, moving up mmap_pages is better.

>=20
> >
> >> +
> >> +/**
> >> + * A PMU state container.
> >> + */
> >> +struct rte_pmu {
> >> +	char *name; /** name of core PMU listed under
> >> /sys/bus/event_source/devices */
> >> +	struct rte_pmu_event_group group[RTE_MAX_LCORE]; /**< per lcore
> >> event group data */
> >> +	unsigned int num_group_events; /**< number of events in a group
> >> */
> >> +	TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of matching
> >> events */
> >> +};
> >> +
> >> +/** Pointer to the PMU state container */ extern struct rte_pmu
> >> +rte_pmu;
> >
> >Just "The PMU state container". It is not a pointer anymore. :-)
> >
>=20
> Good catch.
>=20
> >[...]
> >
> >> +/**
> >> + * @internal
> >> + *
> >> + * Read PMU counter.
> >> + *
> >> + * @param pc
> >> + *   Pointer to the mmapped user page.
> >> + * @return
> >> + *   Counter value read from hardware.
> >> + */
> >> +__rte_internal
> >> +static __rte_always_inline uint64_t
> >> +rte_pmu_read_userpage(struct perf_event_mmap_page *pc) {
> >> +	uint64_t width, offset;
> >> +	uint32_t seq, index;
> >> +	int64_t pmc;
> >> +
> >> +	for (;;) {
> >> +		seq =3D pc->lock;
> >> +		rte_compiler_barrier();
> >> +		index =3D pc->index;
> >> +		offset =3D pc->offset;
> >> +		width =3D pc->pmc_width;
> >> +
> >
> >Please add a comment here about the special meaning of index =3D=3D =
0.
>=20
> Okay.
>=20
> >
> >> +		if (likely(pc->cap_user_rdpmc && index)) {
> >> +			pmc =3D rte_pmu_pmc_read(index - 1);
> >> +			pmc <<=3D 64 - width;
> >> +			pmc >>=3D 64 - width;
> >> +			offset +=3D pmc;
> >> +		}
> >> +
> >> +		rte_compiler_barrier();
> >> +
> >> +		if (likely(pc->lock =3D=3D seq))
> >> +			return offset;
> >> +	}
> >> +
> >> +	return 0;
> >> +}
> >
> >[...]
> >
> >> +/**
> >> + * @warning
> >> + * @b EXPERIMENTAL: this API may change without prior notice
> >> + *
> >> + * Read hardware counter configured to count occurrences of an
> event.
> >> + *
> >> + * @param index
> >> + *   Index of an event to be read.
> >> + * @return
> >> + *   Event value read from register. In case of errors or lack of
> >> support
> >> + *   0 is returned. In other words, stream of zeros in a trace =
file
> >> + *   indicates problem with reading particular PMU event register.
> >> + */
> >> +__rte_experimental
> >> +static __rte_always_inline uint64_t
> >> +rte_pmu_read(unsigned int index)
> >> +{
> >> +	struct rte_pmu_event_group *group;
> >> +	int ret, lcore_id =3D rte_lcore_id();
> >> +
> >> +	group =3D &rte_pmu.group[lcore_id];
> >> +	if (unlikely(!group->enabled)) {
> >> +		ret =3D rte_pmu_enable_group(lcore_id);
> >> +		if (ret)
> >> +			return 0;
> >> +
> >> +		group->enabled =3D true;
> >
> >Group->enabled should be set inside rte_pmu_enable_group(), not here.
> >
>=20
> This is easier to follow imo and not against coding guidelines so I
> prefer to leave it as is.

OK. It makes the rte_pmu_read() source code slightly shorter, but =
probably has zero effect on the generated code. No strong preference - =
feel free to follow your personal preference on this.

>=20
> >> +	}
> >> +
> >> +	if (unlikely(index >=3D rte_pmu.num_group_events))
> >> +		return 0;
> >> +
> >> +	return rte_pmu_read_userpage(group->mmap_pages[index]);
> >> +}
> >
>=20