From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id 4F2D041DEA;
	Sun,  5 Mar 2023 17:30:56 +0100 (CET)
Received: from mails.dpdk.org (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id E24AD40EF0;
	Sun,  5 Mar 2023 17:30:55 +0100 (CET)
Received: from forward500c.mail.yandex.net (forward500c.mail.yandex.net
 [178.154.239.208])
 by mails.dpdk.org (Postfix) with ESMTP id E20D440EE5
 for <dev@dpdk.org>; Sun,  5 Mar 2023 17:30:54 +0100 (CET)
Received: from iva4-6b05bebd6179.qloud-c.yandex.net
 (iva4-6b05bebd6179.qloud-c.yandex.net
 [IPv6:2a02:6b8:c0c:740d:0:640:6b05:bebd])
 by forward500c.mail.yandex.net (Yandex) with ESMTP id 30F515E994;
 Sun,  5 Mar 2023 19:30:54 +0300 (MSK)
Received: by iva4-6b05bebd6179.qloud-c.yandex.net (smtp/Yandex) with ESMTPSA
 id oUcB4mScgGk1-8usg4JkS; Sun, 05 Mar 2023 19:30:53 +0300
X-Yandex-Fwd: 1
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yandex.ru; s=mail;
 t=1678033853; bh=MOQJyZcR2nBCtC4NX5BnXZ9Pdy9/rILwH03hKlQZSRw=;
 h=From:In-Reply-To:Cc:Date:References:To:Subject:Message-ID;
 b=r0Db8UZfewMQSfdUsq29/rOf0MQzqkrViq2wlFZ0T323zRhMMobjp6EEfQnh6zSrd
 34OCigjzTXr/ELlq/k+kvr1UElD/SSSXKVd4lFtN1bRWesR27NHZ6bLT0tHx1IGLCc
 Gmt71sl+taccVCKtuCgiMzeJqouleKGKUiNajx0k=
Authentication-Results: iva4-6b05bebd6179.qloud-c.yandex.net;
 dkim=pass header.i=@yandex.ru
Message-ID: <3d3d9fcc-e199-aa12-57db-419ae11cbea7@yandex.ru>
Date: Sun, 5 Mar 2023 16:30:50 +0000
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
 Thunderbird/102.7.1
Subject: Re: [EXT] Re: [PATCH v11 1/4] lib: add generic support for reading
 PMU events
Content-Language: en-US
To: =?UTF-8?Q?Morten_Br=c3=b8rup?= <mb@smartsharesystems.com>,
 Konstantin Ananyev <konstantin.ananyev@huawei.com>,
 Tomasz Duszynski <tduszynski@marvell.com>, dev@dpdk.org
Cc: bruce.richardson@intel.com
References: <20230213113156.385482-1-tduszynski@marvell.com>
 <20230216175502.3164820-1-tduszynski@marvell.com>
 <20230216175502.3164820-2-tduszynski@marvell.com>
 <5dfb1803-ae7f-5361-6453-70f767041928@yandex.ru>
 <DM4PR18MB4368067C358980D7BFADC0CBD2AF9@DM4PR18MB4368.namprd18.prod.outlook.com>
 <2be9393cdbb94b9dad5bb7340ab9bca0@huawei.com>
 <98CBD80474FA8B44BF855DF32C47DC35D87790@smartserver.smartshare.dk>
 <6bf789b7ba4e4a8e847431a130372a4b@huawei.com>
 <98CBD80474FA8B44BF855DF32C47DC35D87799@smartserver.smartshare.dk>
From: Konstantin Ananyev <konstantin.v.ananyev@yandex.ru>
In-Reply-To: <98CBD80474FA8B44BF855DF32C47DC35D87799@smartserver.smartshare.dk>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org


>>>>>>>> Add support for programming PMU counters and reading their values
>> in
>>>>>>>> runtime bypassing kernel completely.
>>>>>>>>
>>>>>>>> This is especially useful in cases where CPU cores are isolated i.e
>>>>>>>> run dedicated tasks. In such cases one cannot use standard perf
>>>>>>>> utility without sacrificing latency and performance.
>>>>>>>>
>>>>>>>> Signed-off-by: Tomasz Duszynski <tduszynski@marvell.com>
>>>>>>>> Acked-by: Morten Brørup <mb@smartsharesystems.com>
>>>>>>>
>>>>
>>>> [...]
>>>>
>>>>>>>> +int
>>>>>>>> +__rte_pmu_enable_group(void)
>>>>>>>> +{
>>>>>>>> +	struct rte_pmu_event_group *group = &RTE_PER_LCORE(_event_group);
>>>>>>>> +	int ret;
>>>>>>>> +
>>>>>>>> +	if (rte_pmu.num_group_events == 0)
>>>>>>>> +		return -ENODEV;
>>>>>>>> +
>>>>>>>> +	ret = open_events(group);
>>>>>>>> +	if (ret)
>>>>>>>> +		goto out;
>>>>>>>> +
>>>>>>>> +	ret = mmap_events(group);
>>>>>>>> +	if (ret)
>>>>>>>> +		goto out;
>>>>>>>> +
>>>>>>>> +	if (ioctl(group->fds[0], PERF_EVENT_IOC_RESET,
>>> PERF_IOC_FLAG_GROUP) == -
>>>>> 1) {
>>>>>>>> +		ret = -errno;
>>>>>>>> +		goto out;
>>>>>>>> +	}
>>>>>>>> +
>>>>>>>> +	if (ioctl(group->fds[0], PERF_EVENT_IOC_ENABLE,
>>> PERF_IOC_FLAG_GROUP) ==
>>>>> -1) {
>>>>>>>> +		ret = -errno;
>>>>>>>> +		goto out;
>>>>>>>> +	}
>>>>>>>> +
>>>>>>>> +	rte_spinlock_lock(&rte_pmu.lock);
>>>>>>>> +	TAILQ_INSERT_TAIL(&rte_pmu.event_group_list, group, next);
>>>>>>>
>>>>>>> Hmm.. so we insert pointer to TLS variable into the global list?
>>>>>>> Wonder what would happen if that thread get terminated?
>>>>>>
>>>>>> Nothing special. Any pointers to that thread-local in that thread are
>>>>> invalided.
>>>>>>
>>>>>>> Can memory from its TLS block get re-used (by other thread or for
>> other
>>>>> purposes)?
>>>>>>>
>>>>>>
>>>>>> Why would any other thread reuse that?
>>>>>> Eventually main thread will need that data to do the cleanup.
>>>>>
>>>>> I understand that main thread would need to access that data.
>>>>> I am not sure that it would be able to.
>>>>> Imagine thread calls rte_pmu_read(...) and then terminates, while
>> program
>>>>> continues to run.
>>>>
>>>> Is the example you describe here (i.e. a thread terminating in the middle
>> of
>>> doing something) really a scenario DPDK is supposed to
>>>> support?
>>>
>>> I am not talking about some abnormal termination.
>>
>> Then I misunderstood your example; I thought you meant the tread was
>> terminated while inside the rte_pmu_read() function.
>>
>>> We do have ability to spawn control threads, user can spawn his own thread,
>>> all these
>>> threads can have limited life-time.
>>> Not to mention about  rte_thread_register()/rte_thread_unregister().
>>>
>>
>> I agree that normal thread termination should be supported.
>>
>>>>> As I understand address of its RTE_PER_LCORE(_event_group) will still
>>> remain
>>>>> in rte_pmu.event_group_list,
>>>>> even if it is probably not valid any more.
>>>>
>>>> There should be a "destructor/done/finish" function available to remove
>> this
>>> from the list.
>>>>
>>>> [...]
>>>>
>>>>>>> Even if we'd decide to keep rte_pmu_read() as static inline (still
>> not
>>>>>>> sure it is a good idea),
>>>>>>
>>>>>> We want to save as much cpu cycles as we possibly can and inlining
>> does
>>>>> helps
>>>>>> in that matter.
>>>>>
>>>>> Ok, so asking same question from different thread: how many cycles it
>> will
>>>>> save?
>>>>> What is the difference in terms of performance when you have this
>> function
>>>>> inlined vs not inlined?
>>>>
>>>> We expect to use this in our in-house profiler library. For this reason, I
>>> have a very strong preference for absolute maximum
>>>> performance.
>>>>
>>>> Reading PMU events is for performance profiling, so I expect other
>> potential
>>> users of the PMU library to share my opinion on this.
>>>
>>> Well, from my perspective 14 cycles are not that much...
>>
>> For reference, the i40e testpmd per-core performance report shows that it uses
>> 36 cycles per packet.
>>
>> This is a total of 1152 cycles per burst of 32 packets. 14 cycles overhead per
>> burst / 1152 cycles per burst = 1.2 % overhead.
>>
>> But that is not all: If the application's pipeline has three stages, where the
>> PMU counters are read for each stage, the per-invocation overhead of 14 cycles
>> adds up, and the overhead per burst is now 3 * 14 / 1152 = 3.6 %.
> 
> I was too fast on the keyboard here... If the application does more work than testpmd, it certainly also uses more than 1152 cycles to do that work. So please ignore the 3.6 % as a wild exaggeration from an invalid example, and just stick with the 1.2 % overhead - which I still consider significant, and thus worth avoiding.

Wonder can we do both - hide struct rte_pmu_event_group from public API 
and have inline function to read pmu stats?
if we can add a separate function that will allow user to get
struct perf_event_mmap_page * for given event index (or event name),
then later user can use
__rte_pmu_read_userpage(struct perf_event_mmap_page *pc)
directly.

> 
>>
>> Generalizing...
>>
>> In my example here, the same function with 14 wasted cycles is called three
>> times. It might as well be three individual libraries each wasting 14 cycles
>> in its individual fast path processing function, due to a similarly relaxed
>> attitude regarding wasting 14 cycles.
>>
>> My point is:
>>
>> Real applications do much more work than testpmd, so all this "insignificant"
>> extra overhead in the libraries adds up!
>>
>> Generally, I would like the DPDK Project to remain loyal to its original
>> philosophy, where performance is considered a Key Performance Indicator, and
>> overhead in the fast path is kept at an absolute minimum.
>>
>>> Though yes, it would be good to hear more opinions here.