From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id 430ACA054D;
	Wed, 29 Jun 2022 08:34:30 +0200 (CEST)
Received: from [217.70.189.124] (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id D836C400D7;
	Wed, 29 Jun 2022 08:34:29 +0200 (CEST)
Received: from mail.lysator.liu.se (mail.lysator.liu.se [130.236.254.3])
 by mails.dpdk.org (Postfix) with ESMTP id B137540042
 for <dev@dpdk.org>; Wed, 29 Jun 2022 08:34:28 +0200 (CEST)
Received: from mail.lysator.liu.se (localhost [127.0.0.1])
 by mail.lysator.liu.se (Postfix) with ESMTP id ED2B61B8D
 for <dev@dpdk.org>; Wed, 29 Jun 2022 08:34:27 +0200 (CEST)
Received: by mail.lysator.liu.se (Postfix, from userid 1004)
 id EC2351946; Wed, 29 Jun 2022 08:34:27 +0200 (CEST)
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 hermod.lysator.liu.se
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=ALL_TRUSTED, AWL, NICE_REPLY_A,
 T_SCC_BODY_TEXT_LINE autolearn=disabled version=3.4.6
X-Spam-Score: -1.9
Received: from [192.168.1.59] (unknown [62.63.215.114])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange ECDHE (P-256) server-signature RSA-PSS (2048 bits) server-digest
 SHA256) (No client certificate requested)
 by mail.lysator.liu.se (Postfix) with ESMTPSA id C9C6C1AA8;
 Wed, 29 Jun 2022 08:34:24 +0200 (CEST)
Message-ID: <ba1afb7d-557f-f926-9d0e-b97942348c33@lysator.liu.se>
Date: Wed, 29 Jun 2022 08:34:23 +0200
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
 Thunderbird/91.9.1
Subject: Re: Service core statistics MT safety
Content-Language: en-US
To: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>,
 =?UTF-8?Q?Mattias_R=c3=b6nnblom?= <mattias.ronnblom@ericsson.com>,
 =?UTF-8?Q?Morten_Br=c3=b8rup?= <mb@smartsharesystems.com>,
 "dev@dpdk.org" <dev@dpdk.org>
Cc: "harry.van.haaren@intel.com" <harry.van.haaren@intel.com>, nd <nd@arm.com>
References: <336b6175-4fb9-92b9-f65d-b6917822d1bc@ericsson.com>
 <98CBD80474FA8B44BF855DF32C47DC35D87186@smartserver.smartshare.dk>
 <DBAPR08MB58140E90BD676F3C1A6F609098B99@DBAPR08MB5814.eurprd08.prod.outlook.com>
 <0cfb2d19-ca59-fbba-c428-2e7629da43ec@lysator.liu.se>
 <DBAPR08MB5814937DA34E1EC71B0762FA98B99@DBAPR08MB5814.eurprd08.prod.outlook.com>
 <e8aa8c6d-827f-d8fb-ce3c-7424518aad4e@ericsson.com>
 <DBAPR08MB58143F4655A04D5E7EF3CDE698B89@DBAPR08MB5814.eurprd08.prod.outlook.com>
 <2027bd3d-0530-8dbe-c3ea-1a199c6a13dd@ericsson.com>
 <DBAPR08MB5814C22F5B93785BFD4D840D98B89@DBAPR08MB5814.eurprd08.prod.outlook.com>
 <d88035e5-f9a7-bac1-052e-91d9871540b6@ericsson.com>
 <DBAPR08MB5814D1475D26BCA18B9CCA5298B89@DBAPR08MB5814.eurprd08.prod.outlook.com>
From: =?UTF-8?Q?Mattias_R=c3=b6nnblom?= <hofors@lysator.liu.se>
In-Reply-To: <DBAPR08MB5814D1475D26BCA18B9CCA5298B89@DBAPR08MB5814.eurprd08.prod.outlook.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
X-Virus-Scanned: ClamAV using ClamSMTP
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org

On 2022-06-28 21:15, Honnappa Nagarahalli wrote:
> <snip>
>>>>>>>>>>
>>>>>>>>>>> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
>>>>>>>>>>> Sent: Monday, 27 June 2022 13.06
>>>>>>>>>>>
>>>>>>>>>>> Hi.
>>>>>>>>>>>
>>>>>>>>>>> Is it safe to enable stats on MT safe services?
>>>>>>>>>>>
>>>>>>>>>>> https://protect2.fireeye.com/v1/url?k=31323334-501d5122-
>>>> 313273af
>>>>>>>>>>> -
>>>>>> 4
>>>>>>>>>>> 54445555731-6096fdb16385f88f&q=1&e=27b94605-d1e2-40b6-
>>>> af6d-
>>>>>> 9ebc54d
>>>>>>>>>>>
>>>>>>
>>>>
>> 5db18&u=https%3A%2F%2Fgithub.com%2FDPDK%2Fdpdk%2Fblob%2Fmain%
>>>>>> 2Flib
>>>>>>>>>>> %2Feal%2Fcommon%2Frte_service.c%23
>>>>>>>>>>> L3
>>>>>>>>>>> 6
>>>>>>>>>>> 6
>>>>>>>>>>>
>>>>>>>>>>> It seems to me this would have to be an __atomic_add for this
>>>>>>>>>>> code to produce deterministic results.
>>>>>>>>>>
>>>>>>>>>> I agree. The same goes for the 'calls' field.
>>>>>>>>> The calling function does the locking.
>>>>>>>>> https://protect2.fireeye.com/v1/url?k=31323334-501d5122-
>> 313273af
>>>>>>>>> -
>>>>>> 454
>>>>>>>>> 445555731-5ce419f8bf9f9b76&q=1&e=27b94605-d1e2-40b6-af6d-
>>>>>> 9ebc54d5db1
>>>>>>>>>
>>>>>>
>>>>
>> 8&u=https%3A%2F%2Fgithub.com%2FDPDK%2Fdpdk%2Fblob%2Fmain%2Flib
>>>>>> %2Feal
>>>>>>>>> %2Fcommon%2Frte_service.c%23L3
>>>>>>>>> 98
>>>>>>>>>
>>>>>>>>> For more information you can look at:
>>>>>>>>> https://protect2.fireeye.com/v1/url?k=31323334-501d5122-
>> 313273af
>>>>>>>>> -
>>>>>> 454
>>>>>>>>> 445555731-ba0d1416f08856f0&q=1&e=27b94605-d1e2-40b6-
>> af6d-
>>>>>> 9ebc54d5db1
>>>>>>>>>
>>>>>>
>>>>
>> 8&u=https%3A%2F%2Fgithub.com%2FDPDK%2Fdpdk%2Fblob%2Fmain%2Flib
>>>>>> %2Feal
>>>>>>>>> %2Finclude%2Frte_service.h%23L
>>>>>>>>> 120
>>>>>>>>>
>>>>>>>>
>>>>>>>> What about the
>>>>>>>> https://protect2.fireeye.com/v1/url?k=31323334-501d5122-
>> 313273af-
>>>>>> 4544
>>>>>>>> 45555731-b64334addc78c264&q=1&e=27b94605-d1e2-40b6-af6d-
>>>>>> 9ebc54d5db18&
>>>>>>>>
>>>>>>
>>>>
>> u=https%3A%2F%2Fgithub.com%2FDPDK%2Fdpdk%2Fblob%2Fmain%2Flib%2
>>>>>> Feal%2F
>>>>>>>> common%2Frte_service.c%23L404
>>>>>>>> call (for MT safe services)?
>>>>>>>>
>>>>>>>> There's no lock held there.
>>>>>>> Good point.
>>>>>>> This is the case where the service running in service cores is MT
>>>>>>> safe. However,
>>>>>> the stats are incremented outside of the MT Safety mechanism
>>>>>> employed by the service. So, yes, this and other updates in the
>>>>>> function 'service_runner_do_callback' need to be updated atomically.
>>>>>>
>>>>>> Maybe a better solution would be to move this to the core_state
>>>>>> struct (and eliminate the "calls" field since the same information
>>>>>> is already in the core_state struct). The contention on these cache
>>>>>> lines will be pretty crazy for services with short run time (say a
>>>>>> thousand cycles or less per call), assuming they are mapped to many
>> cores.
>>>>> That's one option, the structures are internal as well. With this
>>>>> option stats
>>>> need to be aggregated which will not give an accurate view. But, that
>>>> is the nature of the statistics.
>>>>>
>>>>
>>>> Per-core counters is a very common pattern. Used for Linux MIB
>>>> counters, for example. I'm not sure I think it's much less accurate.
>>>> I mean, you just load in quick succession what's globally visible for
>>>> the different per-lcore counters. If you do a relaxed store on an
>>>> ARM, how long time does it take until it's seen by someone doing a relaxed
>> load on a different core? Roughly.
>>> I think my explanation of the problem is not clear.
>>>
>>> If a service is running on more than one core and the stats are per core, when
>> we aggregate, the resulting statistics is not atomic. By making the stats per core,
>> we will be taking out that feature which is present currently (even though it is
>> implemented incorrectly). As we agree, the proposed change is a common
>> pattern and taking away the atomicity of the stats might not be a problem.
>>>
>>
>> Isn't it just a push model, versus a pull one? Both give just an approximation,
>> albeit a very good one, of how many cycles are spent "now" for a particular
>> service. Isn't time a local phenomena in a SMP system, and there is no global
>> "now"? Maybe you can achieve such with a transaction or handshake of some
>> sort, but I don't see how the an __atomic_add would be enough.
> If we consider a global time line of events, using atomic operation will provide a single 'now' from the reader's perspective (of course there might be writers waiting to update). Without the atomic operations, there will not be a single 'now' from reader's perspective, there will be multiple read events on the timeline.
> 

At the time of the read operation (in the global counter solution), 
there may well be cycles consumed or calls having been made, but not yet 
posted. The window between call having been made, and global counter 
having been incremented (and thus made globally visible) is small, but 
non-zero.

(The core-local counter solution also use atomic operations, although 
not __atomic_add, but store, for the producer, and load, for the consumer.)

>>
>> I was fortunate to get some data from a real-world application, and enabling
>> service core stats resulted in a 7% degradation of overall system capacity. I'm
>> guessing atomic instructions would not make things better.
> Is the service running on multiple cores?
> 

Yes. I think something like 6 cores were used in this case. The effect 
will grow with core count, obviously. On a large system, I don't think 
you will do much else but fight for this cache line.

If you want post counter updates to some shared data structure, you need 
to batch the updates to achieve reasonable efficiency. That will be, 
obviously, at the cost of accuracy, since there will be a significant 
delay between local-counter-increment, and 
post-in-global-data-structure. The system will be much less able to 
answer how many cycles have been consumed at a particular point in time.

For really large counter sets, the above, batched-update approach may be 
required. You simply can't afford the memory required to duplicate the 
counter struct across all cores in the system. In my experience, this 
still can be made to meet real-world counter accuracy requirement. 
(Accuracy in the time dimension.)

>>
>>>>
>>>>> I am also wondering if these stats are of any use other than for debugging.
>>>> Adding a capability to disable stats might help as well.
>>>>>
>>>>
>>>> They could be used as a crude tool to determine service core utilization.
>>>> Comparing utilization between different services running on the same core
>>>> should be straight-forward, but lcore utilization is harder in absolute terms. If
>>>> you just look at "cycles", a completely idle core would look like it's very busy
>>>> (basically rdtsc latency added for every loop). I assume you'd have to do
>> some
>>>> heuristic based on both "calls" and "cycles" to get an estimate.
>>>>
>>>> I think service core utilization would be very useful, although that would
>> require
>>>> some changes in the service function signature, so the service can report
>> back if
>>>> it did some useful work for a particular call.
>>>>
>>>> That would make for a DPDK 'top'. Just like 'top', it can't impose any serious
>>>> performance degradation when used, to be really useful, I think.
>>>>
>>>> Sure, it should be possible to turn it on and off. I thought that was the case
>>>> already?
>>> Thanks, yes, this exists already. Though the 'loops' counter is out of the stats
>> enable check, looks like it is considered as an attribute for some reason.
>>>
>>>>
>>>>>>
>>>>>> Idle service cores will basically do nothing else than stall waiting
>>>>>> for these lines, I suspect, hampering the progress of more busy cores.
>>>
>