From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id 6F54F41CF1;
	Tue, 21 Feb 2023 01:48:05 +0100 (CET)
Received: from mails.dpdk.org (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id 56A024311C;
	Tue, 21 Feb 2023 01:48:05 +0100 (CET)
Received: from forward502c.mail.yandex.net (forward502c.mail.yandex.net
 [178.154.239.210])
 by mails.dpdk.org (Postfix) with ESMTP id C31DF43109
 for <dev@dpdk.org>; Tue, 21 Feb 2023 01:48:03 +0100 (CET)
Received: from sas1-1970a3fc41d8.qloud-c.yandex.net
 (sas1-1970a3fc41d8.qloud-c.yandex.net
 [IPv6:2a02:6b8:c08:48a4:0:640:1970:a3fc])
 by forward502c.mail.yandex.net (Yandex) with ESMTP id 157575E532;
 Tue, 21 Feb 2023 03:48:03 +0300 (MSK)
Received: by sas1-1970a3fc41d8.qloud-c.yandex.net (smtp/Yandex) with ESMTPSA
 id 0mdP1gdZu0U1-b1RYLKPB; Tue, 21 Feb 2023 03:48:02 +0300
X-Yandex-Fwd: 1
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yandex.ru; s=mail;
 t=1676940482; bh=zBfglnOibql2HEGjcpzGWVL1sqHVvC8ZaOihTyrCoKY=;
 h=In-Reply-To:From:Date:References:To:Subject:Message-ID;
 b=h8URpkAtpVirJzAAqYOJX/Nutg/IDpge0Bz8mSDREtlD83tQ8zCLG8Osex+uOb4uU
 GVquIYmKVCRe6dzeFn1xUJnuPNoCc655dh8QOzcpEZTUGdx1q/2JxUQvKW+h27Xtui
 kpXVSvRRK4N1blxWA0D+PCEu1JYeRK+UUhdnYm00=
Authentication-Results: sas1-1970a3fc41d8.qloud-c.yandex.net;
 dkim=pass header.i=@yandex.ru
Message-ID: <ca8ee464-1b90-c794-9290-dd0e13d1271b@yandex.ru>
Date: Tue, 21 Feb 2023 00:48:00 +0000
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
 Thunderbird/102.7.1
Subject: Re: [EXT] Re: [PATCH v11 1/4] lib: add generic support for reading
 PMU events
Content-Language: en-US
To: Tomasz Duszynski <tduszynski@marvell.com>,
 Konstantin Ananyev <konstantin.ananyev@huawei.com>,
 "dev@dpdk.org" <dev@dpdk.org>
References: <20230213113156.385482-1-tduszynski@marvell.com>
 <20230216175502.3164820-1-tduszynski@marvell.com>
 <20230216175502.3164820-2-tduszynski@marvell.com>
 <4b055bbc-36e5-671c-8117-fb0ee7307d55@yandex.ru>
 <DM4PR18MB43685A2D4112F65069769D09D2A19@DM4PR18MB4368.namprd18.prod.outlook.com>
 <b00d773b3a2d4dd3a81cb67733d8a76a@huawei.com>
 <DM4PR18MB43683E3C58B0A84A5023E835D2A79@DM4PR18MB4368.namprd18.prod.outlook.com>
 <b356335a28654a879dc07ad6a3681f13@huawei.com>
 <DM4PR18MB43685ADA6E197BDADEED73D8D2A49@DM4PR18MB4368.namprd18.prod.outlook.com>
 <f16155a4624d4682a991b21526c8ae9a@huawei.com>
 <DM4PR18MB4368402E9D3750697EBE4F81D2A49@DM4PR18MB4368.namprd18.prod.outlook.com>
From: Konstantin Ananyev <konstantin.v.ananyev@yandex.ru>
In-Reply-To: <DM4PR18MB4368402E9D3750697EBE4F81D2A49@DM4PR18MB4368.namprd18.prod.outlook.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org


>>>>>>>>> diff --git a/lib/pmu/rte_pmu.h b/lib/pmu/rte_pmu.h new file
>>>>>>>>> mode
>>>>>>>>> 100644 index 0000000000..6b664c3336
>>>>>>>>> --- /dev/null
>>>>>>>>> +++ b/lib/pmu/rte_pmu.h
>>>>>>>>> @@ -0,0 +1,212 @@
>>>>>>>>> +/* SPDX-License-Identifier: BSD-3-Clause
>>>>>>>>> + * Copyright(c) 2023 Marvell  */
>>>>>>>>> +
>>>>>>>>> +#ifndef _RTE_PMU_H_
>>>>>>>>> +#define _RTE_PMU_H_
>>>>>>>>> +
>>>>>>>>> +/**
>>>>>>>>> + * @file
>>>>>>>>> + *
>>>>>>>>> + * PMU event tracing operations
>>>>>>>>> + *
>>>>>>>>> + * This file defines generic API and types necessary to
>>>>>>>>> +setup PMU and
>>>>>>>>> + * read selected counters in runtime.
>>>>>>>>> + */
>>>>>>>>> +
>>>>>>>>> +#ifdef __cplusplus
>>>>>>>>> +extern "C" {
>>>>>>>>> +#endif
>>>>>>>>> +
>>>>>>>>> +#include <linux/perf_event.h>
>>>>>>>>> +
>>>>>>>>> +#include <rte_atomic.h>
>>>>>>>>> +#include <rte_branch_prediction.h> #include <rte_common.h>
>>>>>>>>> +#include <rte_compat.h> #include <rte_spinlock.h>
>>>>>>>>> +
>>>>>>>>> +/** Maximum number of events in a group */ #define
>>>>>>>>> +MAX_NUM_GROUP_EVENTS 8
>>>>>>>>> +
>>>>>>>>> +/**
>>>>>>>>> + * A structure describing a group of events.
>>>>>>>>> + */
>>>>>>>>> +struct rte_pmu_event_group {
>>>>>>>>> +	struct perf_event_mmap_page
>>>>>>>>> +*mmap_pages[MAX_NUM_GROUP_EVENTS];
>>>>>>>>> +/**< array of user pages
>>>>>> */
>>>>>>>>> +	int fds[MAX_NUM_GROUP_EVENTS]; /**< array of event descriptors */
>>>>>>>>> +	bool enabled; /**< true if group was enabled on particular lcore */
>>>>>>>>> +	TAILQ_ENTRY(rte_pmu_event_group) next; /**< list entry */ }
>>>>>>>>> +__rte_cache_aligned;
>>>>>>>>> +
>>>>>>>>> +/**
>>>>>>>>> + * A structure describing an event.
>>>>>>>>> + */
>>>>>>>>> +struct rte_pmu_event {
>>>>>>>>> +	char *name; /**< name of an event */
>>>>>>>>> +	unsigned int index; /**< event index into fds/mmap_pages */
>>>>>>>>> +	TAILQ_ENTRY(rte_pmu_event) next; /**< list entry */ };
>>>>>>>>> +
>>>>>>>>> +/**
>>>>>>>>> + * A PMU state container.
>>>>>>>>> + */
>>>>>>>>> +struct rte_pmu {
>>>>>>>>> +	char *name; /**< name of core PMU listed under /sys/bus/event_source/devices */
>>>>>>>>> +	rte_spinlock_t lock; /**< serialize access to event group list */
>>>>>>>>> +	TAILQ_HEAD(, rte_pmu_event_group) event_group_list; /**< list of event groups */
>>>>>>>>> +	unsigned int num_group_events; /**< number of events in a group */
>>>>>>>>> +	TAILQ_HEAD(, rte_pmu_event) event_list; /**< list of matching events */
>>>>>>>>> +	unsigned int initialized; /**< initialization counter */ };
>>>>>>>>> +
>>>>>>>>> +/** lcore event group */
>>>>>>>>> +RTE_DECLARE_PER_LCORE(struct rte_pmu_event_group,
>>>>>>>>> +_event_group);
>>>>>>>>> +
>>>>>>>>> +/** PMU state container */
>>>>>>>>> +extern struct rte_pmu rte_pmu;
>>>>>>>>> +
>>>>>>>>> +/** Each architecture supporting PMU needs to provide its
>>>>>>>>> +own version */ #ifndef rte_pmu_pmc_read #define
>>>>>>>>> +rte_pmu_pmc_read(index) ({ 0; }) #endif
>>>>>>>>> +
>>>>>>>>> +/**
>>>>>>>>> + * @warning
>>>>>>>>> + * @b EXPERIMENTAL: this API may change without prior notice
>>>>>>>>> + *
>>>>>>>>> + * Read PMU counter.
>>>>>>>>> + *
>>>>>>>>> + * @warning This should be not called directly.
>>>>>>>>> + *
>>>>>>>>> + * @param pc
>>>>>>>>> + *   Pointer to the mmapped user page.
>>>>>>>>> + * @return
>>>>>>>>> + *   Counter value read from hardware.
>>>>>>>>> + */
>>>>>>>>> +static __rte_always_inline uint64_t
>>>>>>>>> +__rte_pmu_read_userpage(struct perf_event_mmap_page *pc) {
>>>>>>>>> +	uint64_t width, offset;
>>>>>>>>> +	uint32_t seq, index;
>>>>>>>>> +	int64_t pmc;
>>>>>>>>> +
>>>>>>>>> +	for (;;) {
>>>>>>>>> +		seq = pc->lock;
>>>>>>>>> +		rte_compiler_barrier();
>>>>>>>>
>>>>>>>> Are you sure that compiler_barrier() is enough here?
>>>>>>>> On some archs CPU itself has freedom to re-order reads.
>>>>>>>> Or I am missing something obvious here?
>>>>>>>>
>>>>>>>
>>>>>>> It's a matter of not keeping old stuff cached in registers and
>>>>>>> making sure that we have two reads of lock. CPU reordering won't
>>>>>>> do any harm here.
>>>>>>
>>>>>> Sorry, I didn't get you here:
>>>>>> Suppose CPU will re-order reads and will read lock *after* index or offset value.
>>>>>> Wouldn't it mean that in that case index and/or offset can contain old/invalid values?
>>>>>>
>>>>>
>>>>> This number is just an indicator whether kernel did change something or not.
>>>>
>>>> You are talking about pc->lock, right?
>>>> Yes, I do understand that it is sort of seqlock.
>>>> That's why I am puzzled why we do not care about possible cpu read-reordering.
>>>> Manual for perf_event_open() also has a code snippet with compiler barrier only...
>>>>
>>>>> If cpu reordering will come into play then this will not change anything from pov of this
>> loop.
>>>>> All we want is fresh data when needed and no involvement of
>>>>> compiler when it comes to reordering code.
>>>>
>>>> Ok, can you probably explain to me why the following could not happen:
>>>> T0:
>>>> pc->seqlock==0; pc->index==I1; pc->offset==O1;
>>>> T1:
>>>>       cpu #0 read pmu (due to cpu read reorder, we get index value before seqlock):
>>>>        index=pc->index;  //index==I1;
>>>> T2:
>>>>       cpu #1 kernel vent_update_userpage:
>>>>       pc->lock++; // pc->lock==1
>>>>       pc->index=I2;
>>>>       pc->offset=O2;
>>>>       ...
>>>>       pc->lock++; //pc->lock==2
>>>> T3:
>>>>       cpu #0 continue with read pmu:
>>>>       seq=pc->lock; //seq == 2
>>>>        offset=pc->offset; // offset == O2
>>>>        ....
>>>>        pmc = rte_pmu_pmc_read(index - 1);  // Note that we read at I1, not I2
>>>>        offset += pmc; //offset == O2 + pmcread(I1-1);
>>>>        if (pc->lock == seq) // they are equal, return
>>>>              return offset;
>>>>
>>>> Or, it can happen, but by some reason we don't care much?
>>>>
>>>
>>> This code does self-monitoring and user page (whole group actually) is
>>> per thread running on current cpu. Hence I am not sure what are you trying to prove with that
>> example.
>>
>> I am not trying to prove anything so far.
>> I am asking is such situation possible or not, and if not, why?
>> My current understanding (possibly wrong) is that after you mmaped these pages, kernel still can
>> asynchronously update them.
>> So, when reading the data from these pages you have to check 'lock' value before and after
>> accessing other data.
>> If so, why possible cpu read-reordering doesn't matter?
>>
> 
> Look. I'll reiterate that.
> 
> 1. That user page/group/PMU config is per process. Other processes do not access that.

Ok, that's clear.


>     All this happens on the very same CPU where current thread is running.

Ok... but can't this page be updated by kernel thread running 
simultaneously on different CPU?


> 2. Suppose you've already read seq. Now for some reason kernel updates data in page seq was read from.
> 3. Kernel will enter critical section during update. seq changes along with other data without app knowing about it.
>     If you want nitty gritty details consult kernel sources.

Look, I don't have to beg you to answer these questions.
In fact, I expect library author to document all such narrow things 
clearly either in in PG, or in source code comments (ideally in both).
If not, then from my perspective the patch is not ready stage and 
shouldn't be accepted.
I don't know is compiler-barrier is enough here or not, but I think it 
is definitely worth a clear explanation in the docs.
I suppose it wouldn't be only me who will get confused here.
So please take an effort and document it clearly why you believe there 
is no race-condition.

> 4. app resumes and has some stale data but *WILL* read new seq. Code loops again because values do not match.

If the kernel will always execute update for this page in the same 
thread context, then yes, - user code will always note the difference
after resume.
But why it can't happen that your user-thread reads this page on one 
CPU, while some kernel code on other CPU updates it simultaneously?


> 5. Otherwise seq values match and data is valid.
> 
>> Also there was another question below, which you probably  missed, so I copied it here:
>> Another question - do we really need  to have __rte_pmu_read_userpage() and rte_pmu_read() as
>> static inline functions in public header?
>> As I understand, because of that we also have to make 'struct rte_pmu_*'
>> definitions also public.
>>
> 
> These functions need to be inlined otherwise performance takes a hit.

I understand that perfomance might be affected, but how big is hit?
I expect actual PMU read will not be free anyway, right?
If the diff is small, might be it is worth to go for such change,
removing unneeded structures from public headers would help a lot in 
future in terms of ABI/API stability.


>>>
>>>>>>>
>>>>>>>>> +		index = pc->index;
>>>>>>>>> +		offset = pc->offset;
>>>>>>>>> +		width = pc->pmc_width;
>>>>>>>>> +
>>>>>>>>> +		/* index set to 0 means that particular counter cannot be used */
>>>>>>>>> +		if (likely(pc->cap_user_rdpmc && index)) {
>>>>>>>>> +			pmc = rte_pmu_pmc_read(index - 1);
>>>>>>>>> +			pmc <<= 64 - width;
>>>>>>>>> +			pmc >>= 64 - width;
>>>>>>>>> +			offset += pmc;
>>>>>>>>> +		}
>>>>>>>>> +
>>>>>>>>> +		rte_compiler_barrier();
>>>>>>>>> +
>>>>>>>>> +		if (likely(pc->lock == seq))
>>>>>>>>> +			return offset;
>>>>>>>>> +	}
>>>>>>>>> +
>>>>>>>>> +	return 0;
>>>>>>>>> +}
>>>>>>>>> +
>>>>>>>>> +/**
>>>>>>>>> + * @warning
>>>>>>>>> + * @b EXPERIMENTAL: this API may change without prior notice
>>>>>>>>> + *
>>>>>>>>> + * Enable group of events on the calling lcore.
>>>>>>>>> + *
>>>>>>>>> + * @warning This should be not called directly.
>>>>>>>>> + *
>>>>>>>>> + * @return
>>>>>>>>> + *   0 in case of success, negative value otherwise.
>>>>>>>>> + */
>>>>>>>>> +__rte_experimental
>>>>>>>>> +int
>>>>>>>>> +__rte_pmu_enable_group(void);
>>>>>>>>> +
>>>>>>>>> +/**
>>>>>>>>> + * @warning
>>>>>>>>> + * @b EXPERIMENTAL: this API may change without prior notice
>>>>>>>>> + *
>>>>>>>>> + * Initialize PMU library.
>>>>>>>>> + *
>>>>>>>>> + * @warning This should be not called directly.
>>>>>>>>> + *
>>>>>>>>> + * @return
>>>>>>>>> + *   0 in case of success, negative value otherwise.
>>>>>>>>> + */
>>>>>>>>> +__rte_experimental
>>>>>>>>> +int
>>>>>>>>> +rte_pmu_init(void);
>>>>>>>>> +
>>>>>>>>> +/**
>>>>>>>>> + * @warning
>>>>>>>>> + * @b EXPERIMENTAL: this API may change without prior notice
>>>>>>>>> + *
>>>>>>>>> + * Finalize PMU library. This should be called after PMU
>>>>>>>>> +counters are no longer being
>>>> read.
>>>>>>>>> + */
>>>>>>>>> +__rte_experimental
>>>>>>>>> +void
>>>>>>>>> +rte_pmu_fini(void);
>>>>>>>>> +
>>>>>>>>> +/**
>>>>>>>>> + * @warning
>>>>>>>>> + * @b EXPERIMENTAL: this API may change without prior notice
>>>>>>>>> + *
>>>>>>>>> + * Add event to the group of enabled events.
>>>>>>>>> + *
>>>>>>>>> + * @param name
>>>>>>>>> + *   Name of an event listed under /sys/bus/event_source/devices/pmu/events.
>>>>>>>>> + * @return
>>>>>>>>> + *   Event index in case of success, negative value otherwise.
>>>>>>>>> + */
>>>>>>>>> +__rte_experimental
>>>>>>>>> +int
>>>>>>>>> +rte_pmu_add_event(const char *name);
>>>>>>>>> +
>>>>>>>>> +/**
>>>>>>>>> + * @warning
>>>>>>>>> + * @b EXPERIMENTAL: this API may change without prior notice
>>>>>>>>> + *
>>>>>>>>> + * Read hardware counter configured to count occurrences of an event.
>>>>>>>>> + *
>>>>>>>>> + * @param index
>>>>>>>>> + *   Index of an event to be read.
>>>>>>>>> + * @return
>>>>>>>>> + *   Event value read from register. In case of errors or lack of support
>>>>>>>>> + *   0 is returned. In other words, stream of zeros in a trace file
>>>>>>>>> + *   indicates problem with reading particular PMU event register.
>>>>>>>>> + */
>>>>
>>>> Another question - do we really need  to have
>>>> __rte_pmu_read_userpage() and rte_pmu_read() as static inline functions in public header?
>>>> As I understand, because of that we also have to make 'struct rte_pmu_*'
>>>> definitions also public.
>>>>
>>>>>>>>> +__rte_experimental
>>>>>>>>> +static __rte_always_inline uint64_t rte_pmu_read(unsigned
>>>>>>>>> +int
>>>>>>>>> +index) {
>>>>>>>>> +	struct rte_pmu_event_group *group = &RTE_PER_LCORE(_event_group);
>>>>>>>>> +	int ret;
>>>>>>>>> +
>>>>>>>>> +	if (unlikely(!rte_pmu.initialized))
>>>>>>>>> +		return 0;
>>>>>>>>> +
>>>>>>>>> +	if (unlikely(!group->enabled)) {
>>>>>>>>> +		ret = __rte_pmu_enable_group();
>>>>>>>>> +		if (ret)
>>>>>>>>> +			return 0;
>>>>>>>>> +	}
>>>>>>>>> +
>>>>>>>>> +	if (unlikely(index >= rte_pmu.num_group_events))
>>>>>>>>> +		return 0;
>>>>>>>>> +
>>>>>>>>> +	return __rte_pmu_read_userpage(group->mmap_pages[index]);
>>>>>>>>> +}
>>>>>>>>> +
>>>>>>>>> +#ifdef __cplusplus
>>>>>>>>> +}
>>>>>>>>> +#endif
>>>>>>>>> +
>