Service core statistics MT safety

DPDK patches and discussions
 help / color / mirror / Atom feed

* Service core statistics MT safety
@ 2022-06-27 11:05 Mattias Rönnblom
  2022-06-27 12:31 ` Morten Brørup
  0 siblings, 1 reply; 19+ messages in thread
From: Mattias Rönnblom @ 2022-06-27 11:05 UTC (permalink / raw)
  To: dev; +Cc: harry.van.haaren

Hi.

Is it safe to enable stats on MT safe services?

https://github.com/DPDK/dpdk/blob/main/lib/eal/common/rte_service.c#L366

It seems to me this would have to be an __atomic_add for this code to 
produce deterministic results.

Best regards,
	Mattias

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: Service core statistics MT safety
  2022-06-27 11:05 Service core statistics MT safety Mattias Rönnblom
@ 2022-06-27 12:31 ` Morten Brørup
  2022-06-27 17:39   ` Honnappa Nagarahalli
  0 siblings, 1 reply; 19+ messages in thread
From: Morten Brørup @ 2022-06-27 12:31 UTC (permalink / raw)
  To: Mattias Rönnblom, dev; +Cc: harry.van.haaren

> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> Sent: Monday, 27 June 2022 13.06
> 
> Hi.
> 
> Is it safe to enable stats on MT safe services?
> 
> https://github.com/DPDK/dpdk/blob/main/lib/eal/common/rte_service.c#L36
> 6
> 
> It seems to me this would have to be an __atomic_add for this code to
> produce deterministic results.

I agree. The same goes for the 'calls' field.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: Service core statistics MT safety
  2022-06-27 12:31 ` Morten Brørup
@ 2022-06-27 17:39   ` Honnappa Nagarahalli
  2022-06-27 17:47     ` Mattias Rönnblom
  0 siblings, 1 reply; 19+ messages in thread
From: Honnappa Nagarahalli @ 2022-06-27 17:39 UTC (permalink / raw)
  To: Morten Brørup, Mattias Rönnblom, dev
  Cc: harry.van.haaren, nd, Honnappa Nagarahalli, nd

<snip>
> 
> > From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> > Sent: Monday, 27 June 2022 13.06
> >
> > Hi.
> >
> > Is it safe to enable stats on MT safe services?
> >
> > https://github.com/DPDK/dpdk/blob/main/lib/eal/common/rte_service.c#L3
> > 6
> > 6
> >
> > It seems to me this would have to be an __atomic_add for this code to
> > produce deterministic results.
> 
> I agree. The same goes for the 'calls' field.
The calling function does the locking. https://github.com/DPDK/dpdk/blob/main/lib/eal/common/rte_service.c#L398

For more information you can look at: https://github.com/DPDK/dpdk/blob/main/lib/eal/include/rte_service.h#L120


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Service core statistics MT safety
  2022-06-27 17:39   ` Honnappa Nagarahalli
@ 2022-06-27 17:47     ` Mattias Rönnblom
  2022-06-27 18:19       ` Honnappa Nagarahalli
  0 siblings, 1 reply; 19+ messages in thread
From: Mattias Rönnblom @ 2022-06-27 17:47 UTC (permalink / raw)
  To: Honnappa Nagarahalli, Morten Brørup, Mattias Rönnblom, dev
  Cc: harry.van.haaren, nd

On 2022-06-27 19:39, Honnappa Nagarahalli wrote:
> <snip>
>>
>>> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
>>> Sent: Monday, 27 June 2022 13.06
>>>
>>> Hi.
>>>
>>> Is it safe to enable stats on MT safe services?
>>>
>>> https://github.com/DPDK/dpdk/blob/main/lib/eal/common/rte_service.c#L3
>>> 6
>>> 6
>>>
>>> It seems to me this would have to be an __atomic_add for this code to
>>> produce deterministic results.
>>
>> I agree. The same goes for the 'calls' field.
> The calling function does the locking. https://github.com/DPDK/dpdk/blob/main/lib/eal/common/rte_service.c#L398
> 
> For more information you can look at: https://github.com/DPDK/dpdk/blob/main/lib/eal/include/rte_service.h#L120
> 

What about the
https://github.com/DPDK/dpdk/blob/main/lib/eal/common/rte_service.c#L404
call (for MT safe services)?

There's no lock held there.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: Service core statistics MT safety
  2022-06-27 17:47     ` Mattias Rönnblom
@ 2022-06-27 18:19       ` Honnappa Nagarahalli
  2022-06-27 20:00         ` Mattias Rönnblom
  0 siblings, 1 reply; 19+ messages in thread
From: Honnappa Nagarahalli @ 2022-06-27 18:19 UTC (permalink / raw)
  To: Mattias Rönnblom, Morten Brørup, Mattias Rönnblom, dev
  Cc: harry.van.haaren, nd, Honnappa Nagarahalli, nd

<snip>
> >>
> >>> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> >>> Sent: Monday, 27 June 2022 13.06
> >>>
> >>> Hi.
> >>>
> >>> Is it safe to enable stats on MT safe services?
> >>>
> >>> https://github.com/DPDK/dpdk/blob/main/lib/eal/common/rte_service.c#
> >>> L3
> >>> 6
> >>> 6
> >>>
> >>> It seems to me this would have to be an __atomic_add for this code
> >>> to produce deterministic results.
> >>
> >> I agree. The same goes for the 'calls' field.
> > The calling function does the locking.
> > https://github.com/DPDK/dpdk/blob/main/lib/eal/common/rte_service.c#L3
> > 98
> >
> > For more information you can look at:
> > https://github.com/DPDK/dpdk/blob/main/lib/eal/include/rte_service.h#L
> > 120
> >
> 
> What about the
> https://github.com/DPDK/dpdk/blob/main/lib/eal/common/rte_service.c#L404
> call (for MT safe services)?
> 
> There's no lock held there.
Good point.
This is the case where the service running in service cores is MT safe. However, the stats are incremented outside of the MT Safety mechanism employed by the service. So, yes, this and other updates in the function 'service_runner_do_callback' need to be updated atomically.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Service core statistics MT safety
  2022-06-27 18:19       ` Honnappa Nagarahalli
@ 2022-06-27 20:00         ` Mattias Rönnblom
  2022-06-28  0:14           ` Honnappa Nagarahalli
  0 siblings, 1 reply; 19+ messages in thread
From: Mattias Rönnblom @ 2022-06-27 20:00 UTC (permalink / raw)
  To: Honnappa Nagarahalli, Mattias Rönnblom, Morten Brørup, dev
  Cc: harry.van.haaren, nd

On 2022-06-27 20:19, Honnappa Nagarahalli wrote:
> <snip>
>>>>
>>>>> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
>>>>> Sent: Monday, 27 June 2022 13.06
>>>>>
>>>>> Hi.
>>>>>
>>>>> Is it safe to enable stats on MT safe services?
>>>>>
>>>>> https://protect2.fireeye.com/v1/url?k=31323334-501d5122-313273af-454445555731-6096fdb16385f88f&q=1&e=27b94605-d1e2-40b6-af6d-9ebc54d5db18&u=https%3A%2F%2Fgithub.com%2FDPDK%2Fdpdk%2Fblob%2Fmain%2Flib%2Feal%2Fcommon%2Frte_service.c%23
>>>>> L3
>>>>> 6
>>>>> 6
>>>>>
>>>>> It seems to me this would have to be an __atomic_add for this code
>>>>> to produce deterministic results.
>>>>
>>>> I agree. The same goes for the 'calls' field.
>>> The calling function does the locking.
>>> https://protect2.fireeye.com/v1/url?k=31323334-501d5122-313273af-454445555731-5ce419f8bf9f9b76&q=1&e=27b94605-d1e2-40b6-af6d-9ebc54d5db18&u=https%3A%2F%2Fgithub.com%2FDPDK%2Fdpdk%2Fblob%2Fmain%2Flib%2Feal%2Fcommon%2Frte_service.c%23L3
>>> 98
>>>
>>> For more information you can look at:
>>> https://protect2.fireeye.com/v1/url?k=31323334-501d5122-313273af-454445555731-ba0d1416f08856f0&q=1&e=27b94605-d1e2-40b6-af6d-9ebc54d5db18&u=https%3A%2F%2Fgithub.com%2FDPDK%2Fdpdk%2Fblob%2Fmain%2Flib%2Feal%2Finclude%2Frte_service.h%23L
>>> 120
>>>
>>
>> What about the
>> https://protect2.fireeye.com/v1/url?k=31323334-501d5122-313273af-454445555731-b64334addc78c264&q=1&e=27b94605-d1e2-40b6-af6d-9ebc54d5db18&u=https%3A%2F%2Fgithub.com%2FDPDK%2Fdpdk%2Fblob%2Fmain%2Flib%2Feal%2Fcommon%2Frte_service.c%23L404
>> call (for MT safe services)?
>>
>> There's no lock held there.
> Good point.
> This is the case where the service running in service cores is MT safe. However, the stats are incremented outside of the MT Safety mechanism employed by the service. So, yes, this and other updates in the function 'service_runner_do_callback' need to be updated atomically.

Maybe a better solution would be to move this to the core_state struct 
(and eliminate the "calls" field since the same information is already 
in the core_state struct). The contention on these cache lines will be 
pretty crazy for services with short run time (say a thousand cycles or 
less per call), assuming they are mapped to many cores.

Idle service cores will basically do nothing else than stall waiting for 
these lines, I suspect, hampering the progress of more busy cores.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: Service core statistics MT safety
  2022-06-27 20:00         ` Mattias Rönnblom
@ 2022-06-28  0:14           ` Honnappa Nagarahalli
  2022-06-28  6:52             ` Mattias Rönnblom
  0 siblings, 1 reply; 19+ messages in thread
From: Honnappa Nagarahalli @ 2022-06-28  0:14 UTC (permalink / raw)
  To: Mattias Rönnblom, Mattias Rönnblom, Morten Brørup, dev
  Cc: harry.van.haaren, nd, nd

<snip>
> >>>>
> >>>>> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> >>>>> Sent: Monday, 27 June 2022 13.06
> >>>>>
> >>>>> Hi.
> >>>>>
> >>>>> Is it safe to enable stats on MT safe services?
> >>>>>
> >>>>> https://protect2.fireeye.com/v1/url?k=31323334-501d5122-313273af-
> 4
> >>>>> 54445555731-6096fdb16385f88f&q=1&e=27b94605-d1e2-40b6-af6d-
> 9ebc54d
> >>>>>
> 5db18&u=https%3A%2F%2Fgithub.com%2FDPDK%2Fdpdk%2Fblob%2Fmain%
> 2Flib
> >>>>> %2Feal%2Fcommon%2Frte_service.c%23
> >>>>> L3
> >>>>> 6
> >>>>> 6
> >>>>>
> >>>>> It seems to me this would have to be an __atomic_add for this code
> >>>>> to produce deterministic results.
> >>>>
> >>>> I agree. The same goes for the 'calls' field.
> >>> The calling function does the locking.
> >>> https://protect2.fireeye.com/v1/url?k=31323334-501d5122-313273af-
> 454
> >>> 445555731-5ce419f8bf9f9b76&q=1&e=27b94605-d1e2-40b6-af6d-
> 9ebc54d5db1
> >>>
> 8&u=https%3A%2F%2Fgithub.com%2FDPDK%2Fdpdk%2Fblob%2Fmain%2Flib
> %2Feal
> >>> %2Fcommon%2Frte_service.c%23L3
> >>> 98
> >>>
> >>> For more information you can look at:
> >>> https://protect2.fireeye.com/v1/url?k=31323334-501d5122-313273af-
> 454
> >>> 445555731-ba0d1416f08856f0&q=1&e=27b94605-d1e2-40b6-af6d-
> 9ebc54d5db1
> >>>
> 8&u=https%3A%2F%2Fgithub.com%2FDPDK%2Fdpdk%2Fblob%2Fmain%2Flib
> %2Feal
> >>> %2Finclude%2Frte_service.h%23L
> >>> 120
> >>>
> >>
> >> What about the
> >> https://protect2.fireeye.com/v1/url?k=31323334-501d5122-313273af-
> 4544
> >> 45555731-b64334addc78c264&q=1&e=27b94605-d1e2-40b6-af6d-
> 9ebc54d5db18&
> >>
> u=https%3A%2F%2Fgithub.com%2FDPDK%2Fdpdk%2Fblob%2Fmain%2Flib%2
> Feal%2F
> >> common%2Frte_service.c%23L404
> >> call (for MT safe services)?
> >>
> >> There's no lock held there.
> > Good point.
> > This is the case where the service running in service cores is MT safe. However,
> the stats are incremented outside of the MT Safety mechanism employed by the
> service. So, yes, this and other updates in the function
> 'service_runner_do_callback' need to be updated atomically.
> 
> Maybe a better solution would be to move this to the core_state struct (and
> eliminate the "calls" field since the same information is already in the core_state
> struct). The contention on these cache lines will be pretty crazy for services with
> short run time (say a thousand cycles or less per call), assuming they are
> mapped to many cores.
That's one option, the structures are internal as well. With this option stats need to be aggregated which will not give an accurate view. But, that is the nature of the statistics. 

I am also wondering if these stats are of any use other than for debugging. Adding a capability to disable stats might help as well.

> 
> Idle service cores will basically do nothing else than stall waiting for these lines, I
> suspect, hampering the progress of more busy cores.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Service core statistics MT safety
  2022-06-28  0:14           ` Honnappa Nagarahalli
@ 2022-06-28  6:52             ` Mattias Rönnblom
  2022-06-28 15:24               ` Honnappa Nagarahalli
  0 siblings, 1 reply; 19+ messages in thread
From: Mattias Rönnblom @ 2022-06-28  6:52 UTC (permalink / raw)
  To: Honnappa Nagarahalli, Mattias Rönnblom, Morten Brørup, dev
  Cc: harry.van.haaren, nd

On 2022-06-28 02:14, Honnappa Nagarahalli wrote:
> <snip>
>>>>>>
>>>>>>> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
>>>>>>> Sent: Monday, 27 June 2022 13.06
>>>>>>>
>>>>>>> Hi.
>>>>>>>
>>>>>>> Is it safe to enable stats on MT safe services?
>>>>>>>
>>>>>>> https://protect2.fireeye.com/v1/url?k=31323334-501d5122-313273af-
>> 4
>>>>>>> 54445555731-6096fdb16385f88f&q=1&e=27b94605-d1e2-40b6-af6d-
>> 9ebc54d
>>>>>>>
>> 5db18&u=https%3A%2F%2Fgithub.com%2FDPDK%2Fdpdk%2Fblob%2Fmain%
>> 2Flib
>>>>>>> %2Feal%2Fcommon%2Frte_service.c%23
>>>>>>> L3
>>>>>>> 6
>>>>>>> 6
>>>>>>>
>>>>>>> It seems to me this would have to be an __atomic_add for this code
>>>>>>> to produce deterministic results.
>>>>>>
>>>>>> I agree. The same goes for the 'calls' field.
>>>>> The calling function does the locking.
>>>>> https://protect2.fireeye.com/v1/url?k=31323334-501d5122-313273af-
>> 454
>>>>> 445555731-5ce419f8bf9f9b76&q=1&e=27b94605-d1e2-40b6-af6d-
>> 9ebc54d5db1
>>>>>
>> 8&u=https%3A%2F%2Fgithub.com%2FDPDK%2Fdpdk%2Fblob%2Fmain%2Flib
>> %2Feal
>>>>> %2Fcommon%2Frte_service.c%23L3
>>>>> 98
>>>>>
>>>>> For more information you can look at:
>>>>> https://protect2.fireeye.com/v1/url?k=31323334-501d5122-313273af-
>> 454
>>>>> 445555731-ba0d1416f08856f0&q=1&e=27b94605-d1e2-40b6-af6d-
>> 9ebc54d5db1
>>>>>
>> 8&u=https%3A%2F%2Fgithub.com%2FDPDK%2Fdpdk%2Fblob%2Fmain%2Flib
>> %2Feal
>>>>> %2Finclude%2Frte_service.h%23L
>>>>> 120
>>>>>
>>>>
>>>> What about the
>>>> https://protect2.fireeye.com/v1/url?k=31323334-501d5122-313273af-
>> 4544
>>>> 45555731-b64334addc78c264&q=1&e=27b94605-d1e2-40b6-af6d-
>> 9ebc54d5db18&
>>>>
>> u=https%3A%2F%2Fgithub.com%2FDPDK%2Fdpdk%2Fblob%2Fmain%2Flib%2
>> Feal%2F
>>>> common%2Frte_service.c%23L404
>>>> call (for MT safe services)?
>>>>
>>>> There's no lock held there.
>>> Good point.
>>> This is the case where the service running in service cores is MT safe. However,
>> the stats are incremented outside of the MT Safety mechanism employed by the
>> service. So, yes, this and other updates in the function
>> 'service_runner_do_callback' need to be updated atomically.
>>
>> Maybe a better solution would be to move this to the core_state struct (and
>> eliminate the "calls" field since the same information is already in the core_state
>> struct). The contention on these cache lines will be pretty crazy for services with
>> short run time (say a thousand cycles or less per call), assuming they are
>> mapped to many cores.
> That's one option, the structures are internal as well. With this option stats need to be aggregated which will not give an accurate view. But, that is the nature of the statistics.
> 

Per-core counters is a very common pattern. Used for Linux MIB counters, 
for example. I'm not sure I think it's much less accurate. I mean, you 
just load in quick succession what's globally visible for the different 
per-lcore counters. If you do a relaxed store on an ARM, how long time 
does it take until it's seen by someone doing a relaxed load on a 
different core? Roughly.

> I am also wondering if these stats are of any use other than for debugging. Adding a capability to disable stats might help as well.
> 

They could be used as a crude tool to determine service core 
utilization. Comparing utilization between different services running on 
the same core should be straight-forward, but lcore utilization is 
harder in absolute terms. If you just look at "cycles", a completely 
idle core would look like it's very busy (basically rdtsc latency added 
for every loop). I assume you'd have to do some heuristic based on both 
"calls" and "cycles" to get an estimate.

I think service core utilization would be very useful, although that 
would require some changes in the service function signature, so the 
service can report back if it did some useful work for a particular call.

That would make for a DPDK 'top'. Just like 'top', it can't impose any 
serious performance degradation when used, to be really useful, I think.

Sure, it should be possible to turn it on and off. I thought that was 
the case already?

>>
>> Idle service cores will basically do nothing else than stall waiting for these lines, I
>> suspect, hampering the progress of more busy cores.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: Service core statistics MT safety
  2022-06-28  6:52             ` Mattias Rönnblom
@ 2022-06-28 15:24               ` Honnappa Nagarahalli
  2022-06-28 18:30                 ` Mattias Rönnblom
  0 siblings, 1 reply; 19+ messages in thread
From: Honnappa Nagarahalli @ 2022-06-28 15:24 UTC (permalink / raw)
  To: Mattias Rönnblom, Mattias Rönnblom, Morten Brørup, dev
  Cc: harry.van.haaren, nd, nd

<snip>

> >>>>>>
> >>>>>>> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> >>>>>>> Sent: Monday, 27 June 2022 13.06
> >>>>>>>
> >>>>>>> Hi.
> >>>>>>>
> >>>>>>> Is it safe to enable stats on MT safe services?
> >>>>>>>
> >>>>>>> https://protect2.fireeye.com/v1/url?k=31323334-501d5122-
> 313273af
> >>>>>>> -
> >> 4
> >>>>>>> 54445555731-6096fdb16385f88f&q=1&e=27b94605-d1e2-40b6-
> af6d-
> >> 9ebc54d
> >>>>>>>
> >>
> 5db18&u=https%3A%2F%2Fgithub.com%2FDPDK%2Fdpdk%2Fblob%2Fmain%
> >> 2Flib
> >>>>>>> %2Feal%2Fcommon%2Frte_service.c%23
> >>>>>>> L3
> >>>>>>> 6
> >>>>>>> 6
> >>>>>>>
> >>>>>>> It seems to me this would have to be an __atomic_add for this
> >>>>>>> code to produce deterministic results.
> >>>>>>
> >>>>>> I agree. The same goes for the 'calls' field.
> >>>>> The calling function does the locking.
> >>>>> https://protect2.fireeye.com/v1/url?k=31323334-501d5122-313273af-
> >> 454
> >>>>> 445555731-5ce419f8bf9f9b76&q=1&e=27b94605-d1e2-40b6-af6d-
> >> 9ebc54d5db1
> >>>>>
> >>
> 8&u=https%3A%2F%2Fgithub.com%2FDPDK%2Fdpdk%2Fblob%2Fmain%2Flib
> >> %2Feal
> >>>>> %2Fcommon%2Frte_service.c%23L3
> >>>>> 98
> >>>>>
> >>>>> For more information you can look at:
> >>>>> https://protect2.fireeye.com/v1/url?k=31323334-501d5122-313273af-
> >> 454
> >>>>> 445555731-ba0d1416f08856f0&q=1&e=27b94605-d1e2-40b6-af6d-
> >> 9ebc54d5db1
> >>>>>
> >>
> 8&u=https%3A%2F%2Fgithub.com%2FDPDK%2Fdpdk%2Fblob%2Fmain%2Flib
> >> %2Feal
> >>>>> %2Finclude%2Frte_service.h%23L
> >>>>> 120
> >>>>>
> >>>>
> >>>> What about the
> >>>> https://protect2.fireeye.com/v1/url?k=31323334-501d5122-313273af-
> >> 4544
> >>>> 45555731-b64334addc78c264&q=1&e=27b94605-d1e2-40b6-af6d-
> >> 9ebc54d5db18&
> >>>>
> >>
> u=https%3A%2F%2Fgithub.com%2FDPDK%2Fdpdk%2Fblob%2Fmain%2Flib%2
> >> Feal%2F
> >>>> common%2Frte_service.c%23L404
> >>>> call (for MT safe services)?
> >>>>
> >>>> There's no lock held there.
> >>> Good point.
> >>> This is the case where the service running in service cores is MT
> >>> safe. However,
> >> the stats are incremented outside of the MT Safety mechanism employed
> >> by the service. So, yes, this and other updates in the function
> >> 'service_runner_do_callback' need to be updated atomically.
> >>
> >> Maybe a better solution would be to move this to the core_state
> >> struct (and eliminate the "calls" field since the same information is
> >> already in the core_state struct). The contention on these cache
> >> lines will be pretty crazy for services with short run time (say a
> >> thousand cycles or less per call), assuming they are mapped to many cores.
> > That's one option, the structures are internal as well. With this option stats
> need to be aggregated which will not give an accurate view. But, that is the
> nature of the statistics.
> >
> 
> Per-core counters is a very common pattern. Used for Linux MIB counters, for
> example. I'm not sure I think it's much less accurate. I mean, you just load in
> quick succession what's globally visible for the different per-lcore counters. If
> you do a relaxed store on an ARM, how long time does it take until it's seen by
> someone doing a relaxed load on a different core? Roughly.
I think my explanation of the problem is not clear.

If a service is running on more than one core and the stats are per core, when we aggregate, the resulting statistics is not atomic. By making the stats per core, we will be taking out that feature which is present currently (even though it is implemented incorrectly). As we agree, the proposed change is a common pattern and taking away the atomicity of the stats might not be a problem.

> 
> > I am also wondering if these stats are of any use other than for debugging.
> Adding a capability to disable stats might help as well.
> >
> 
> They could be used as a crude tool to determine service core utilization.
> Comparing utilization between different services running on the same core
> should be straight-forward, but lcore utilization is harder in absolute terms. If
> you just look at "cycles", a completely idle core would look like it's very busy
> (basically rdtsc latency added for every loop). I assume you'd have to do some
> heuristic based on both "calls" and "cycles" to get an estimate.
> 
> I think service core utilization would be very useful, although that would require
> some changes in the service function signature, so the service can report back if
> it did some useful work for a particular call.
> 
> That would make for a DPDK 'top'. Just like 'top', it can't impose any serious
> performance degradation when used, to be really useful, I think.
> 
> Sure, it should be possible to turn it on and off. I thought that was the case
> already?
Thanks, yes, this exists already. Though the 'loops' counter is out of the stats enable check, looks like it is considered as an attribute for some reason.

> 
> >>
> >> Idle service cores will basically do nothing else than stall waiting
> >> for these lines, I suspect, hampering the progress of more busy cores.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Service core statistics MT safety
  2022-06-28 15:24               ` Honnappa Nagarahalli
@ 2022-06-28 18:30                 ` Mattias Rönnblom
  2022-06-28 19:15                   ` Honnappa Nagarahalli
  0 siblings, 1 reply; 19+ messages in thread
From: Mattias Rönnblom @ 2022-06-28 18:30 UTC (permalink / raw)
  To: Honnappa Nagarahalli, Mattias Rönnblom, Morten Brørup, dev
  Cc: harry.van.haaren, nd

On 2022-06-28 17:24, Honnappa Nagarahalli wrote:
> <snip>
> 
>>>>>>>>
>>>>>>>>> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
>>>>>>>>> Sent: Monday, 27 June 2022 13.06
>>>>>>>>>
>>>>>>>>> Hi.
>>>>>>>>>
>>>>>>>>> Is it safe to enable stats on MT safe services?
>>>>>>>>>
>>>>>>>>> https://protect2.fireeye.com/v1/url?k=31323334-501d5122-
>> 313273af
>>>>>>>>> -
>>>> 4
>>>>>>>>> 54445555731-6096fdb16385f88f&q=1&e=27b94605-d1e2-40b6-
>> af6d-
>>>> 9ebc54d
>>>>>>>>>
>>>>
>> 5db18&u=https%3A%2F%2Fgithub.com%2FDPDK%2Fdpdk%2Fblob%2Fmain%
>>>> 2Flib
>>>>>>>>> %2Feal%2Fcommon%2Frte_service.c%23
>>>>>>>>> L3
>>>>>>>>> 6
>>>>>>>>> 6
>>>>>>>>>
>>>>>>>>> It seems to me this would have to be an __atomic_add for this
>>>>>>>>> code to produce deterministic results.
>>>>>>>>
>>>>>>>> I agree. The same goes for the 'calls' field.
>>>>>>> The calling function does the locking.
>>>>>>> https://protect2.fireeye.com/v1/url?k=31323334-501d5122-313273af-
>>>> 454
>>>>>>> 445555731-5ce419f8bf9f9b76&q=1&e=27b94605-d1e2-40b6-af6d-
>>>> 9ebc54d5db1
>>>>>>>
>>>>
>> 8&u=https%3A%2F%2Fgithub.com%2FDPDK%2Fdpdk%2Fblob%2Fmain%2Flib
>>>> %2Feal
>>>>>>> %2Fcommon%2Frte_service.c%23L3
>>>>>>> 98
>>>>>>>
>>>>>>> For more information you can look at:
>>>>>>> https://protect2.fireeye.com/v1/url?k=31323334-501d5122-313273af-
>>>> 454
>>>>>>> 445555731-ba0d1416f08856f0&q=1&e=27b94605-d1e2-40b6-af6d-
>>>> 9ebc54d5db1
>>>>>>>
>>>>
>> 8&u=https%3A%2F%2Fgithub.com%2FDPDK%2Fdpdk%2Fblob%2Fmain%2Flib
>>>> %2Feal
>>>>>>> %2Finclude%2Frte_service.h%23L
>>>>>>> 120
>>>>>>>
>>>>>>
>>>>>> What about the
>>>>>> https://protect2.fireeye.com/v1/url?k=31323334-501d5122-313273af-
>>>> 4544
>>>>>> 45555731-b64334addc78c264&q=1&e=27b94605-d1e2-40b6-af6d-
>>>> 9ebc54d5db18&
>>>>>>
>>>>
>> u=https%3A%2F%2Fgithub.com%2FDPDK%2Fdpdk%2Fblob%2Fmain%2Flib%2
>>>> Feal%2F
>>>>>> common%2Frte_service.c%23L404
>>>>>> call (for MT safe services)?
>>>>>>
>>>>>> There's no lock held there.
>>>>> Good point.
>>>>> This is the case where the service running in service cores is MT
>>>>> safe. However,
>>>> the stats are incremented outside of the MT Safety mechanism employed
>>>> by the service. So, yes, this and other updates in the function
>>>> 'service_runner_do_callback' need to be updated atomically.
>>>>
>>>> Maybe a better solution would be to move this to the core_state
>>>> struct (and eliminate the "calls" field since the same information is
>>>> already in the core_state struct). The contention on these cache
>>>> lines will be pretty crazy for services with short run time (say a
>>>> thousand cycles or less per call), assuming they are mapped to many cores.
>>> That's one option, the structures are internal as well. With this option stats
>> need to be aggregated which will not give an accurate view. But, that is the
>> nature of the statistics.
>>>
>>
>> Per-core counters is a very common pattern. Used for Linux MIB counters, for
>> example. I'm not sure I think it's much less accurate. I mean, you just load in
>> quick succession what's globally visible for the different per-lcore counters. If
>> you do a relaxed store on an ARM, how long time does it take until it's seen by
>> someone doing a relaxed load on a different core? Roughly.
> I think my explanation of the problem is not clear.
> 
> If a service is running on more than one core and the stats are per core, when we aggregate, the resulting statistics is not atomic. By making the stats per core, we will be taking out that feature which is present currently (even though it is implemented incorrectly). As we agree, the proposed change is a common pattern and taking away the atomicity of the stats might not be a problem.
> 

Isn't it just a push model, versus a pull one? Both give just an 
approximation, albeit a very good one, of how many cycles are spent 
"now" for a particular service. Isn't time a local phenomena in a SMP 
system, and there is no global "now"? Maybe you can achieve such with a 
transaction or handshake of some sort, but I don't see how the an 
__atomic_add would be enough.

I was fortunate to get some data from a real-world application, and 
enabling service core stats resulted in a 7% degradation of overall 
system capacity. I'm guessing atomic instructions would not make things 
better.

>>
>>> I am also wondering if these stats are of any use other than for debugging.
>> Adding a capability to disable stats might help as well.
>>>
>>
>> They could be used as a crude tool to determine service core utilization.
>> Comparing utilization between different services running on the same core
>> should be straight-forward, but lcore utilization is harder in absolute terms. If
>> you just look at "cycles", a completely idle core would look like it's very busy
>> (basically rdtsc latency added for every loop). I assume you'd have to do some
>> heuristic based on both "calls" and "cycles" to get an estimate.
>>
>> I think service core utilization would be very useful, although that would require
>> some changes in the service function signature, so the service can report back if
>> it did some useful work for a particular call.
>>
>> That would make for a DPDK 'top'. Just like 'top', it can't impose any serious
>> performance degradation when used, to be really useful, I think.
>>
>> Sure, it should be possible to turn it on and off. I thought that was the case
>> already?
> Thanks, yes, this exists already. Though the 'loops' counter is out of the stats enable check, looks like it is considered as an attribute for some reason.
> 
>>
>>>>
>>>> Idle service cores will basically do nothing else than stall waiting
>>>> for these lines, I suspect, hampering the progress of more busy cores.
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: Service core statistics MT safety
  2022-06-28 18:30                 ` Mattias Rönnblom
@ 2022-06-28 19:15                   ` Honnappa Nagarahalli
  2022-06-29  6:34                     ` Mattias Rönnblom
  0 siblings, 1 reply; 19+ messages in thread
From: Honnappa Nagarahalli @ 2022-06-28 19:15 UTC (permalink / raw)
  To: Mattias Rönnblom, Mattias Rönnblom, Morten Brørup, dev
  Cc: harry.van.haaren, nd, nd

<snip>
> >>>>>>>>
> >>>>>>>>> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> >>>>>>>>> Sent: Monday, 27 June 2022 13.06
> >>>>>>>>>
> >>>>>>>>> Hi.
> >>>>>>>>>
> >>>>>>>>> Is it safe to enable stats on MT safe services?
> >>>>>>>>>
> >>>>>>>>> https://protect2.fireeye.com/v1/url?k=31323334-501d5122-
> >> 313273af
> >>>>>>>>> -
> >>>> 4
> >>>>>>>>> 54445555731-6096fdb16385f88f&q=1&e=27b94605-d1e2-40b6-
> >> af6d-
> >>>> 9ebc54d
> >>>>>>>>>
> >>>>
> >>
> 5db18&u=https%3A%2F%2Fgithub.com%2FDPDK%2Fdpdk%2Fblob%2Fmain%
> >>>> 2Flib
> >>>>>>>>> %2Feal%2Fcommon%2Frte_service.c%23
> >>>>>>>>> L3
> >>>>>>>>> 6
> >>>>>>>>> 6
> >>>>>>>>>
> >>>>>>>>> It seems to me this would have to be an __atomic_add for this
> >>>>>>>>> code to produce deterministic results.
> >>>>>>>>
> >>>>>>>> I agree. The same goes for the 'calls' field.
> >>>>>>> The calling function does the locking.
> >>>>>>> https://protect2.fireeye.com/v1/url?k=31323334-501d5122-
> 313273af
> >>>>>>> -
> >>>> 454
> >>>>>>> 445555731-5ce419f8bf9f9b76&q=1&e=27b94605-d1e2-40b6-af6d-
> >>>> 9ebc54d5db1
> >>>>>>>
> >>>>
> >>
> 8&u=https%3A%2F%2Fgithub.com%2FDPDK%2Fdpdk%2Fblob%2Fmain%2Flib
> >>>> %2Feal
> >>>>>>> %2Fcommon%2Frte_service.c%23L3
> >>>>>>> 98
> >>>>>>>
> >>>>>>> For more information you can look at:
> >>>>>>> https://protect2.fireeye.com/v1/url?k=31323334-501d5122-
> 313273af
> >>>>>>> -
> >>>> 454
> >>>>>>> 445555731-ba0d1416f08856f0&q=1&e=27b94605-d1e2-40b6-
> af6d-
> >>>> 9ebc54d5db1
> >>>>>>>
> >>>>
> >>
> 8&u=https%3A%2F%2Fgithub.com%2FDPDK%2Fdpdk%2Fblob%2Fmain%2Flib
> >>>> %2Feal
> >>>>>>> %2Finclude%2Frte_service.h%23L
> >>>>>>> 120
> >>>>>>>
> >>>>>>
> >>>>>> What about the
> >>>>>> https://protect2.fireeye.com/v1/url?k=31323334-501d5122-
> 313273af-
> >>>> 4544
> >>>>>> 45555731-b64334addc78c264&q=1&e=27b94605-d1e2-40b6-af6d-
> >>>> 9ebc54d5db18&
> >>>>>>
> >>>>
> >>
> u=https%3A%2F%2Fgithub.com%2FDPDK%2Fdpdk%2Fblob%2Fmain%2Flib%2
> >>>> Feal%2F
> >>>>>> common%2Frte_service.c%23L404
> >>>>>> call (for MT safe services)?
> >>>>>>
> >>>>>> There's no lock held there.
> >>>>> Good point.
> >>>>> This is the case where the service running in service cores is MT
> >>>>> safe. However,
> >>>> the stats are incremented outside of the MT Safety mechanism
> >>>> employed by the service. So, yes, this and other updates in the
> >>>> function 'service_runner_do_callback' need to be updated atomically.
> >>>>
> >>>> Maybe a better solution would be to move this to the core_state
> >>>> struct (and eliminate the "calls" field since the same information
> >>>> is already in the core_state struct). The contention on these cache
> >>>> lines will be pretty crazy for services with short run time (say a
> >>>> thousand cycles or less per call), assuming they are mapped to many
> cores.
> >>> That's one option, the structures are internal as well. With this
> >>> option stats
> >> need to be aggregated which will not give an accurate view. But, that
> >> is the nature of the statistics.
> >>>
> >>
> >> Per-core counters is a very common pattern. Used for Linux MIB
> >> counters, for example. I'm not sure I think it's much less accurate.
> >> I mean, you just load in quick succession what's globally visible for
> >> the different per-lcore counters. If you do a relaxed store on an
> >> ARM, how long time does it take until it's seen by someone doing a relaxed
> load on a different core? Roughly.
> > I think my explanation of the problem is not clear.
> >
> > If a service is running on more than one core and the stats are per core, when
> we aggregate, the resulting statistics is not atomic. By making the stats per core,
> we will be taking out that feature which is present currently (even though it is
> implemented incorrectly). As we agree, the proposed change is a common
> pattern and taking away the atomicity of the stats might not be a problem.
> >
> 
> Isn't it just a push model, versus a pull one? Both give just an approximation,
> albeit a very good one, of how many cycles are spent "now" for a particular
> service. Isn't time a local phenomena in a SMP system, and there is no global
> "now"? Maybe you can achieve such with a transaction or handshake of some
> sort, but I don't see how the an __atomic_add would be enough.
If we consider a global time line of events, using atomic operation will provide a single 'now' from the reader's perspective (of course there might be writers waiting to update). Without the atomic operations, there will not be a single 'now' from reader's perspective, there will be multiple read events on the timeline.

> 
> I was fortunate to get some data from a real-world application, and enabling
> service core stats resulted in a 7% degradation of overall system capacity. I'm
> guessing atomic instructions would not make things better.
Is the service running on multiple cores?

> 
> >>
> >>> I am also wondering if these stats are of any use other than for debugging.
> >> Adding a capability to disable stats might help as well.
> >>>
> >>
> >> They could be used as a crude tool to determine service core utilization.
> >> Comparing utilization between different services running on the same core
> >> should be straight-forward, but lcore utilization is harder in absolute terms. If
> >> you just look at "cycles", a completely idle core would look like it's very busy
> >> (basically rdtsc latency added for every loop). I assume you'd have to do
> some
> >> heuristic based on both "calls" and "cycles" to get an estimate.
> >>
> >> I think service core utilization would be very useful, although that would
> require
> >> some changes in the service function signature, so the service can report
> back if
> >> it did some useful work for a particular call.
> >>
> >> That would make for a DPDK 'top'. Just like 'top', it can't impose any serious
> >> performance degradation when used, to be really useful, I think.
> >>
> >> Sure, it should be possible to turn it on and off. I thought that was the case
> >> already?
> > Thanks, yes, this exists already. Though the 'loops' counter is out of the stats
> enable check, looks like it is considered as an attribute for some reason.
> >
> >>
> >>>>
> >>>> Idle service cores will basically do nothing else than stall waiting
> >>>> for these lines, I suspect, hampering the progress of more busy cores.
> >


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Service core statistics MT safety
  2022-06-28 19:15                   ` Honnappa Nagarahalli
@ 2022-06-29  6:34                     ` Mattias Rönnblom
  2022-06-29 20:07                       ` Honnappa Nagarahalli
  0 siblings, 1 reply; 19+ messages in thread
From: Mattias Rönnblom @ 2022-06-29  6:34 UTC (permalink / raw)
  To: Honnappa Nagarahalli, Mattias Rönnblom, Morten Brørup, dev
  Cc: harry.van.haaren, nd

On 2022-06-28 21:15, Honnappa Nagarahalli wrote:
> <snip>
>>>>>>>>>>
>>>>>>>>>>> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
>>>>>>>>>>> Sent: Monday, 27 June 2022 13.06
>>>>>>>>>>>
>>>>>>>>>>> Hi.
>>>>>>>>>>>
>>>>>>>>>>> Is it safe to enable stats on MT safe services?
>>>>>>>>>>>
>>>>>>>>>>> https://protect2.fireeye.com/v1/url?k=31323334-501d5122-
>>>> 313273af
>>>>>>>>>>> -
>>>>>> 4
>>>>>>>>>>> 54445555731-6096fdb16385f88f&q=1&e=27b94605-d1e2-40b6-
>>>> af6d-
>>>>>> 9ebc54d
>>>>>>>>>>>
>>>>>>
>>>>
>> 5db18&u=https%3A%2F%2Fgithub.com%2FDPDK%2Fdpdk%2Fblob%2Fmain%
>>>>>> 2Flib
>>>>>>>>>>> %2Feal%2Fcommon%2Frte_service.c%23
>>>>>>>>>>> L3
>>>>>>>>>>> 6
>>>>>>>>>>> 6
>>>>>>>>>>>
>>>>>>>>>>> It seems to me this would have to be an __atomic_add for this
>>>>>>>>>>> code to produce deterministic results.
>>>>>>>>>>
>>>>>>>>>> I agree. The same goes for the 'calls' field.
>>>>>>>>> The calling function does the locking.
>>>>>>>>> https://protect2.fireeye.com/v1/url?k=31323334-501d5122-
>> 313273af
>>>>>>>>> -
>>>>>> 454
>>>>>>>>> 445555731-5ce419f8bf9f9b76&q=1&e=27b94605-d1e2-40b6-af6d-
>>>>>> 9ebc54d5db1
>>>>>>>>>
>>>>>>
>>>>
>> 8&u=https%3A%2F%2Fgithub.com%2FDPDK%2Fdpdk%2Fblob%2Fmain%2Flib
>>>>>> %2Feal
>>>>>>>>> %2Fcommon%2Frte_service.c%23L3
>>>>>>>>> 98
>>>>>>>>>
>>>>>>>>> For more information you can look at:
>>>>>>>>> https://protect2.fireeye.com/v1/url?k=31323334-501d5122-
>> 313273af
>>>>>>>>> -
>>>>>> 454
>>>>>>>>> 445555731-ba0d1416f08856f0&q=1&e=27b94605-d1e2-40b6-
>> af6d-
>>>>>> 9ebc54d5db1
>>>>>>>>>
>>>>>>
>>>>
>> 8&u=https%3A%2F%2Fgithub.com%2FDPDK%2Fdpdk%2Fblob%2Fmain%2Flib
>>>>>> %2Feal
>>>>>>>>> %2Finclude%2Frte_service.h%23L
>>>>>>>>> 120
>>>>>>>>>
>>>>>>>>
>>>>>>>> What about the
>>>>>>>> https://protect2.fireeye.com/v1/url?k=31323334-501d5122-
>> 313273af-
>>>>>> 4544
>>>>>>>> 45555731-b64334addc78c264&q=1&e=27b94605-d1e2-40b6-af6d-
>>>>>> 9ebc54d5db18&
>>>>>>>>
>>>>>>
>>>>
>> u=https%3A%2F%2Fgithub.com%2FDPDK%2Fdpdk%2Fblob%2Fmain%2Flib%2
>>>>>> Feal%2F
>>>>>>>> common%2Frte_service.c%23L404
>>>>>>>> call (for MT safe services)?
>>>>>>>>
>>>>>>>> There's no lock held there.
>>>>>>> Good point.
>>>>>>> This is the case where the service running in service cores is MT
>>>>>>> safe. However,
>>>>>> the stats are incremented outside of the MT Safety mechanism
>>>>>> employed by the service. So, yes, this and other updates in the
>>>>>> function 'service_runner_do_callback' need to be updated atomically.
>>>>>>
>>>>>> Maybe a better solution would be to move this to the core_state
>>>>>> struct (and eliminate the "calls" field since the same information
>>>>>> is already in the core_state struct). The contention on these cache
>>>>>> lines will be pretty crazy for services with short run time (say a
>>>>>> thousand cycles or less per call), assuming they are mapped to many
>> cores.
>>>>> That's one option, the structures are internal as well. With this
>>>>> option stats
>>>> need to be aggregated which will not give an accurate view. But, that
>>>> is the nature of the statistics.
>>>>>
>>>>
>>>> Per-core counters is a very common pattern. Used for Linux MIB
>>>> counters, for example. I'm not sure I think it's much less accurate.
>>>> I mean, you just load in quick succession what's globally visible for
>>>> the different per-lcore counters. If you do a relaxed store on an
>>>> ARM, how long time does it take until it's seen by someone doing a relaxed
>> load on a different core? Roughly.
>>> I think my explanation of the problem is not clear.
>>>
>>> If a service is running on more than one core and the stats are per core, when
>> we aggregate, the resulting statistics is not atomic. By making the stats per core,
>> we will be taking out that feature which is present currently (even though it is
>> implemented incorrectly). As we agree, the proposed change is a common
>> pattern and taking away the atomicity of the stats might not be a problem.
>>>
>>
>> Isn't it just a push model, versus a pull one? Both give just an approximation,
>> albeit a very good one, of how many cycles are spent "now" for a particular
>> service. Isn't time a local phenomena in a SMP system, and there is no global
>> "now"? Maybe you can achieve such with a transaction or handshake of some
>> sort, but I don't see how the an __atomic_add would be enough.
> If we consider a global time line of events, using atomic operation will provide a single 'now' from the reader's perspective (of course there might be writers waiting to update). Without the atomic operations, there will not be a single 'now' from reader's perspective, there will be multiple read events on the timeline.
> 

At the time of the read operation (in the global counter solution), 
there may well be cycles consumed or calls having been made, but not yet 
posted. The window between call having been made, and global counter 
having been incremented (and thus made globally visible) is small, but 
non-zero.

(The core-local counter solution also use atomic operations, although 
not __atomic_add, but store, for the producer, and load, for the consumer.)

>>
>> I was fortunate to get some data from a real-world application, and enabling
>> service core stats resulted in a 7% degradation of overall system capacity. I'm
>> guessing atomic instructions would not make things better.
> Is the service running on multiple cores?
> 

Yes. I think something like 6 cores were used in this case. The effect 
will grow with core count, obviously. On a large system, I don't think 
you will do much else but fight for this cache line.

If you want post counter updates to some shared data structure, you need 
to batch the updates to achieve reasonable efficiency. That will be, 
obviously, at the cost of accuracy, since there will be a significant 
delay between local-counter-increment, and 
post-in-global-data-structure. The system will be much less able to 
answer how many cycles have been consumed at a particular point in time.

For really large counter sets, the above, batched-update approach may be 
required. You simply can't afford the memory required to duplicate the 
counter struct across all cores in the system. In my experience, this 
still can be made to meet real-world counter accuracy requirement. 
(Accuracy in the time dimension.)

>>
>>>>
>>>>> I am also wondering if these stats are of any use other than for debugging.
>>>> Adding a capability to disable stats might help as well.
>>>>>
>>>>
>>>> They could be used as a crude tool to determine service core utilization.
>>>> Comparing utilization between different services running on the same core
>>>> should be straight-forward, but lcore utilization is harder in absolute terms. If
>>>> you just look at "cycles", a completely idle core would look like it's very busy
>>>> (basically rdtsc latency added for every loop). I assume you'd have to do
>> some
>>>> heuristic based on both "calls" and "cycles" to get an estimate.
>>>>
>>>> I think service core utilization would be very useful, although that would
>> require
>>>> some changes in the service function signature, so the service can report
>> back if
>>>> it did some useful work for a particular call.
>>>>
>>>> That would make for a DPDK 'top'. Just like 'top', it can't impose any serious
>>>> performance degradation when used, to be really useful, I think.
>>>>
>>>> Sure, it should be possible to turn it on and off. I thought that was the case
>>>> already?
>>> Thanks, yes, this exists already. Though the 'loops' counter is out of the stats
>> enable check, looks like it is considered as an attribute for some reason.
>>>
>>>>
>>>>>>
>>>>>> Idle service cores will basically do nothing else than stall waiting
>>>>>> for these lines, I suspect, hampering the progress of more busy cores.
>>>
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: Service core statistics MT safety
  2022-06-29  6:34                     ` Mattias Rönnblom
@ 2022-06-29 20:07                       ` Honnappa Nagarahalli
  2022-06-30  8:46                         ` Van Haaren, Harry
  0 siblings, 1 reply; 19+ messages in thread
From: Honnappa Nagarahalli @ 2022-06-29 20:07 UTC (permalink / raw)
  To: Mattias Rönnblom, Mattias Rönnblom, Morten Brørup, dev
  Cc: harry.van.haaren, nd, nd

<snip>
> >>>>>>>>>>
> >>>>>>>>>>> From: Mattias Rönnblom
> >>>>>>>>>>> [mailto:mattias.ronnblom@ericsson.com]
> >>>>>>>>>>> Sent: Monday, 27 June 2022 13.06
> >>>>>>>>>>>
> >>>>>>>>>>> Hi.
> >>>>>>>>>>>
> >>>>>>>>>>> Is it safe to enable stats on MT safe services?
> >>>>>>>>>>>
> >>>>>>>>>>> https://protect2.fireeye.com/v1/url?k=31323334-501d5122-
> >>>> 313273af
> >>>>>>>>>>> -
> >>>>>> 4
> >>>>>>>>>>> 54445555731-6096fdb16385f88f&q=1&e=27b94605-d1e2-
> 40b6-
> >>>> af6d-
> >>>>>> 9ebc54d
> >>>>>>>>>>>
> >>>>>>
> >>>>
> >>
> 5db18&u=https%3A%2F%2Fgithub.com%2FDPDK%2Fdpdk%2Fblob%2Fmain%
> >>>>>> 2Flib
> >>>>>>>>>>> %2Feal%2Fcommon%2Frte_service.c%23
> >>>>>>>>>>> L3
> >>>>>>>>>>> 6
> >>>>>>>>>>> 6
> >>>>>>>>>>>
> >>>>>>>>>>> It seems to me this would have to be an __atomic_add for
> >>>>>>>>>>> this code to produce deterministic results.
> >>>>>>>>>>
> >>>>>>>>>> I agree. The same goes for the 'calls' field.
> >>>>>>>>> The calling function does the locking.
> >>>>>>>>> https://protect2.fireeye.com/v1/url?k=31323334-501d5122-
> >> 313273af
> >>>>>>>>> -
> >>>>>> 454
> >>>>>>>>> 445555731-5ce419f8bf9f9b76&q=1&e=27b94605-d1e2-40b6-
> af6d-
> >>>>>> 9ebc54d5db1
> >>>>>>>>>
> >>>>>>
> >>>>
> >>
> 8&u=https%3A%2F%2Fgithub.com%2FDPDK%2Fdpdk%2Fblob%2Fmain%2Flib
> >>>>>> %2Feal
> >>>>>>>>> %2Fcommon%2Frte_service.c%23L3
> >>>>>>>>> 98
> >>>>>>>>>
> >>>>>>>>> For more information you can look at:
> >>>>>>>>> https://protect2.fireeye.com/v1/url?k=31323334-501d5122-
> >> 313273af
> >>>>>>>>> -
> >>>>>> 454
> >>>>>>>>> 445555731-ba0d1416f08856f0&q=1&e=27b94605-d1e2-40b6-
> >> af6d-
> >>>>>> 9ebc54d5db1
> >>>>>>>>>
> >>>>>>
> >>>>
> >>
> 8&u=https%3A%2F%2Fgithub.com%2FDPDK%2Fdpdk%2Fblob%2Fmain%2Flib
> >>>>>> %2Feal
> >>>>>>>>> %2Finclude%2Frte_service.h%23L
> >>>>>>>>> 120
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> What about the
> >>>>>>>> https://protect2.fireeye.com/v1/url?k=31323334-501d5122-
> >> 313273af-
> >>>>>> 4544
> >>>>>>>> 45555731-b64334addc78c264&q=1&e=27b94605-d1e2-40b6-
> af6d-
> >>>>>> 9ebc54d5db18&
> >>>>>>>>
> >>>>>>
> >>>>
> >>
> u=https%3A%2F%2Fgithub.com%2FDPDK%2Fdpdk%2Fblob%2Fmain%2Flib%2
> >>>>>> Feal%2F
> >>>>>>>> common%2Frte_service.c%23L404
> >>>>>>>> call (for MT safe services)?
> >>>>>>>>
> >>>>>>>> There's no lock held there.
> >>>>>>> Good point.
> >>>>>>> This is the case where the service running in service cores is
> >>>>>>> MT safe. However,
> >>>>>> the stats are incremented outside of the MT Safety mechanism
> >>>>>> employed by the service. So, yes, this and other updates in the
> >>>>>> function 'service_runner_do_callback' need to be updated atomically.
> >>>>>>
> >>>>>> Maybe a better solution would be to move this to the core_state
> >>>>>> struct (and eliminate the "calls" field since the same
> >>>>>> information is already in the core_state struct). The contention
> >>>>>> on these cache lines will be pretty crazy for services with short
> >>>>>> run time (say a thousand cycles or less per call), assuming they
> >>>>>> are mapped to many
> >> cores.
> >>>>> That's one option, the structures are internal as well. With this
> >>>>> option stats
> >>>> need to be aggregated which will not give an accurate view. But,
> >>>> that is the nature of the statistics.
> >>>>>
> >>>>
> >>>> Per-core counters is a very common pattern. Used for Linux MIB
> >>>> counters, for example. I'm not sure I think it's much less accurate.
> >>>> I mean, you just load in quick succession what's globally visible
> >>>> for the different per-lcore counters. If you do a relaxed store on
> >>>> an ARM, how long time does it take until it's seen by someone doing
> >>>> a relaxed
> >> load on a different core? Roughly.
> >>> I think my explanation of the problem is not clear.
> >>>
> >>> If a service is running on more than one core and the stats are per
> >>> core, when
> >> we aggregate, the resulting statistics is not atomic. By making the
> >> stats per core, we will be taking out that feature which is present
> >> currently (even though it is implemented incorrectly). As we agree,
> >> the proposed change is a common pattern and taking away the atomicity of
> the stats might not be a problem.
> >>>
> >>
> >> Isn't it just a push model, versus a pull one? Both give just an
> >> approximation, albeit a very good one, of how many cycles are spent
> >> "now" for a particular service. Isn't time a local phenomena in a SMP
> >> system, and there is no global "now"? Maybe you can achieve such with
> >> a transaction or handshake of some sort, but I don't see how the an
> __atomic_add would be enough.
> > If we consider a global time line of events, using atomic operation will provide
> a single 'now' from the reader's perspective (of course there might be writers
> waiting to update). Without the atomic operations, there will not be a single
> 'now' from reader's perspective, there will be multiple read events on the
> timeline.
> >
> 
> At the time of the read operation (in the global counter solution), there may well
> be cycles consumed or calls having been made, but not yet posted. The window
> between call having been made, and global counter having been incremented
> (and thus made globally visible) is small, but non-zero.
Agree. The read value is the atomic state of the system at a given instance (when the read was executed), though that instance happened few cycles back.
(Just to be clear, I am fine with per-core counters)

> 
> (The core-local counter solution also use atomic operations, although not
> __atomic_add, but store, for the producer, and load, for the consumer.)
Agree we need atomic operations. I am not sure if __atomic_fetch_add or __atomic_store_n would have a large difference. __atomic_fetch_add would result in less number of instructions. I am fine with either.

> 
> >>
> >> I was fortunate to get some data from a real-world application, and
> >> enabling service core stats resulted in a 7% degradation of overall
> >> system capacity. I'm guessing atomic instructions would not make things
> better.
> > Is the service running on multiple cores?
> >
> 
> Yes. I think something like 6 cores were used in this case. The effect will grow
> with core count, obviously. On a large system, I don't think you will do much
> else but fight for this cache line.
> 
> If you want post counter updates to some shared data structure, you need to
> batch the updates to achieve reasonable efficiency. That will be, obviously, at
> the cost of accuracy, since there will be a significant delay between local-
> counter-increment, and post-in-global-data-structure. The system will be much
> less able to answer how many cycles have been consumed at a particular point
> in time.
> 
> For really large counter sets, the above, batched-update approach may be
> required. You simply can't afford the memory required to duplicate the counter
> struct across all cores in the system. In my experience, this still can be made to
> meet real-world counter accuracy requirement.
> (Accuracy in the time dimension.)
> 
> >>
> >>>>
> >>>>> I am also wondering if these stats are of any use other than for
> debugging.
> >>>> Adding a capability to disable stats might help as well.
> >>>>>
> >>>>
> >>>> They could be used as a crude tool to determine service core utilization.
> >>>> Comparing utilization between different services running on the same core
> >>>> should be straight-forward, but lcore utilization is harder in absolute
> terms. If
> >>>> you just look at "cycles", a completely idle core would look like it's very
> busy
> >>>> (basically rdtsc latency added for every loop). I assume you'd have to do
> >> some
> >>>> heuristic based on both "calls" and "cycles" to get an estimate.
> >>>>
> >>>> I think service core utilization would be very useful, although that would
> >> require
> >>>> some changes in the service function signature, so the service can report
> >> back if
> >>>> it did some useful work for a particular call.
> >>>>
> >>>> That would make for a DPDK 'top'. Just like 'top', it can't impose any
> serious
> >>>> performance degradation when used, to be really useful, I think.
> >>>>
> >>>> Sure, it should be possible to turn it on and off. I thought that was the case
> >>>> already?
> >>> Thanks, yes, this exists already. Though the 'loops' counter is out of the stats
> >> enable check, looks like it is considered as an attribute for some reason.
> >>>
> >>>>
> >>>>>>
> >>>>>> Idle service cores will basically do nothing else than stall waiting
> >>>>>> for these lines, I suspect, hampering the progress of more busy cores.
> >>>
> >

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: Service core statistics MT safety
  2022-06-29 20:07                       ` Honnappa Nagarahalli
@ 2022-06-30  8:46                         ` Van Haaren, Harry
  2022-07-01 18:37                           ` Honnappa Nagarahalli
  0 siblings, 1 reply; 19+ messages in thread
From: Van Haaren, Harry @ 2022-06-30  8:46 UTC (permalink / raw)
  To: Honnappa Nagarahalli, Mattias Rönnblom, mattias.ronnblom,
	Morten Brørup, dev
  Cc: nd, nd

Hey All,

(I've been away on Holidays a few days, just catching up!)

> -----Original Message-----
> From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Sent: Wednesday, June 29, 2022 9:07 PM
> To: Mattias Rönnblom <hofors@lysator.liu.se>; mattias.ronnblom
> <mattias.ronnblom@ericsson.com>; Morten Brørup
> <mb@smartsharesystems.com>; dev@dpdk.org
> Cc: Van Haaren, Harry <harry.van.haaren@intel.com>; nd <nd@arm.com>; nd
> <nd@arm.com>
> Subject: RE: Service core statistics MT safety
> 
> <snip>

<big snip of previous discussions>

> > At the time of the read operation (in the global counter solution), there may well
> > be cycles consumed or calls having been made, but not yet posted. The window
> > between call having been made, and global counter having been incremented
> > (and thus made globally visible) is small, but non-zero.
> Agree. The read value is the atomic state of the system at a given instance (when the
> read was executed), though that instance happened few cycles back.
> (Just to be clear, I am fine with per-core counters)

Option 1: "Per core counters"

> Agree we need atomic operations. I am not sure if __atomic_fetch_add or
> __atomic_store_n would have a large difference. __atomic_fetch_add would result
> in less number of instructions. I am fine with either.

Option 2: "Use atomics for counter increments".

> > >> I was fortunate to get some data from a real-world application, and
> > >> enabling service core stats resulted in a 7% degradation of overall
> > >> system capacity. I'm guessing atomic instructions would not make things
> > better.

Agree, performance of atomics is likely to reduce performance.. but correctness
is worth more than performance.

<snip>

In my mind, any LTS/backports get the simplest/highest-confidence bugfix: using atomics.
The atomics are behind the "service stats" feature enable, so impact is only when those are
enabled.

If there is still a performance hit, and there are *no* MT services registered, we could check
a static-global flag, and if there are no MT services use the normal adds. Thoughts on such a
solution to reduce atomic perf impact only to apps with MT-services? 

The code changes themselves are OK.. I can send a patch with fix if there's agreement on the approach?


diff --git a/lib/eal/common/rte_service.c b/lib/eal/common/rte_service.c
index ef31b1f63c..a07c8fc2d7 100644
--- a/lib/eal/common/rte_service.c
+++ b/lib/eal/common/rte_service.c
@@ -363,9 +363,9 @@ service_runner_do_callback(struct rte_service_spec_impl *s,
                uint64_t start = rte_rdtsc();
                s->spec.callback(userdata);
                uint64_t end = rte_rdtsc();
-               s->cycles_spent += end - start;
+               __atomic_fetch_add(&s->cycles_spent, (end-start), __ATOMIC_RELAXED);
+               __atomic_fetch_add(&s->calls, 1, __ATOMIC_RELAXED);
                cs->calls_per_service[service_idx]++;
-               s->calls++;
        } else
                s->spec.callback(userdata);
 }

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: Service core statistics MT safety
  2022-06-30  8:46                         ` Van Haaren, Harry
@ 2022-07-01 18:37                           ` Honnappa Nagarahalli
  2022-07-01 20:51                             ` Morten Brørup
  0 siblings, 1 reply; 19+ messages in thread
From: Honnappa Nagarahalli @ 2022-07-01 18:37 UTC (permalink / raw)
  To: Van Haaren, Harry, Mattias Rönnblom, mattias.ronnblom,
	Morten Brørup, dev
  Cc: nd, nd, nd

<snip>
> 
> <big snip of previous discussions>
> 
> > > At the time of the read operation (in the global counter solution),
> > > there may well be cycles consumed or calls having been made, but not
> > > yet posted. The window between call having been made, and global
> > > counter having been incremented (and thus made globally visible) is small,
> but non-zero.
> > Agree. The read value is the atomic state of the system at a given
> > instance (when the read was executed), though that instance happened few
> cycles back.
> > (Just to be clear, I am fine with per-core counters)
> 
> Option 1: "Per core counters"
> 
> > Agree we need atomic operations. I am not sure if __atomic_fetch_add
> > or __atomic_store_n would have a large difference. __atomic_fetch_add
> > would result in less number of instructions. I am fine with either.
> 
> Option 2: "Use atomics for counter increments".
> 
> > > >> I was fortunate to get some data from a real-world application,
> > > >> and enabling service core stats resulted in a 7% degradation of
> > > >> overall system capacity. I'm guessing atomic instructions would
> > > >> not make things
> > > better.
> 
> Agree, performance of atomics is likely to reduce performance.. but correctness
> is worth more than performance.
> 
> <snip>
> 
> In my mind, any LTS/backports get the simplest/highest-confidence bugfix: using
> atomics.
> The atomics are behind the "service stats" feature enable, so impact is only
> when those are enabled.
> 
> If there is still a performance hit, and there are *no* MT services registered, we
> could check a static-global flag, and if there are no MT services use the normal
> adds. Thoughts on such a solution to reduce atomic perf impact only to apps
> with MT-services?
The choice is not between atomic stats and non-atomic stats. The stats need to be incremented atomically in both cases(MT or no MT services) as the reader could be another thread. We need to ensure that the stores and loads are not split. Hence, we need atomic operations. We cannot avoid any performance hit here.

I think the choice is between per-core stats vs global stats. IMO, we should go with per-core stats to be aligned with the norm.

> 
> The code changes themselves are OK.. I can send a patch with fix if there's
> agreement on the approach?
> 
> 
> diff --git a/lib/eal/common/rte_service.c b/lib/eal/common/rte_service.c index
> ef31b1f63c..a07c8fc2d7 100644
> --- a/lib/eal/common/rte_service.c
> +++ b/lib/eal/common/rte_service.c
> @@ -363,9 +363,9 @@ service_runner_do_callback(struct
> rte_service_spec_impl *s,
>                 uint64_t start = rte_rdtsc();
>                 s->spec.callback(userdata);
>                 uint64_t end = rte_rdtsc();
> -               s->cycles_spent += end - start;
> +               __atomic_fetch_add(&s->cycles_spent, (end-start),
> __ATOMIC_RELAXED);
> +               __atomic_fetch_add(&s->calls, 1, __ATOMIC_RELAXED);
>                 cs->calls_per_service[service_idx]++;
> -               s->calls++;
>         } else
>                 s->spec.callback(userdata);  }

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: Service core statistics MT safety
  2022-07-01 18:37                           ` Honnappa Nagarahalli
@ 2022-07-01 20:51                             ` Morten Brørup
  2022-07-05 17:09                               ` Van Haaren, Harry
  0 siblings, 1 reply; 19+ messages in thread
From: Morten Brørup @ 2022-07-01 20:51 UTC (permalink / raw)
  To: Honnappa Nagarahalli, Van Haaren, Harry, Mattias Rönnblom,
	mattias.ronnblom, dev
  Cc: nd, nd, nd

> From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> Sent: Friday, 1 July 2022 20.38
> 
> <snip>
> >
> > <big snip of previous discussions>
> >
> > > > At the time of the read operation (in the global counter
> solution),
> > > > there may well be cycles consumed or calls having been made, but
> not
> > > > yet posted. The window between call having been made, and global
> > > > counter having been incremented (and thus made globally visible)
> is small,
> > but non-zero.
> > > Agree. The read value is the atomic state of the system at a given
> > > instance (when the read was executed), though that instance
> happened few
> > cycles back.
> > > (Just to be clear, I am fine with per-core counters)
> >
> > Option 1: "Per core counters"
> >
> > > Agree we need atomic operations. I am not sure if
> __atomic_fetch_add
> > > or __atomic_store_n would have a large difference.
> __atomic_fetch_add
> > > would result in less number of instructions. I am fine with either.
> >
> > Option 2: "Use atomics for counter increments".
> >
> > > > >> I was fortunate to get some data from a real-world
> application,
> > > > >> and enabling service core stats resulted in a 7% degradation
> of
> > > > >> overall system capacity. I'm guessing atomic instructions
> would
> > > > >> not make things
> > > > better.
> >
> > Agree, performance of atomics is likely to reduce performance.. but
> correctness
> > is worth more than performance.
> >
> > <snip>
> >
> > In my mind, any LTS/backports get the simplest/highest-confidence
> bugfix: using
> > atomics.
> > The atomics are behind the "service stats" feature enable, so impact
> is only
> > when those are enabled.
> >
> > If there is still a performance hit, and there are *no* MT services
> registered, we
> > could check a static-global flag, and if there are no MT services use
> the normal
> > adds. Thoughts on such a solution to reduce atomic perf impact only
> to apps
> > with MT-services?
> The choice is not between atomic stats and non-atomic stats. The stats
> need to be incremented atomically in both cases(MT or no MT services)
> as the reader could be another thread. We need to ensure that the
> stores and loads are not split. Hence, we need atomic operations. We
> cannot avoid any performance hit here.
> 
> I think the choice is between per-core stats vs global stats. IMO, we
> should go with per-core stats to be aligned with the norm.
+1

And the per-core counters can be summed when read for statistics, so it still looks like one counter per service. They are only updated per-core to prevent cache trashing.

> 
> >
> > The code changes themselves are OK.. I can send a patch with fix if
> there's
> > agreement on the approach?
> >
> >
> > diff --git a/lib/eal/common/rte_service.c
> b/lib/eal/common/rte_service.c index
> > ef31b1f63c..a07c8fc2d7 100644
> > --- a/lib/eal/common/rte_service.c
> > +++ b/lib/eal/common/rte_service.c
> > @@ -363,9 +363,9 @@ service_runner_do_callback(struct
> > rte_service_spec_impl *s,
> >                 uint64_t start = rte_rdtsc();
> >                 s->spec.callback(userdata);
> >                 uint64_t end = rte_rdtsc();
> > -               s->cycles_spent += end - start;
> > +               __atomic_fetch_add(&s->cycles_spent, (end-start),
> > __ATOMIC_RELAXED);
> > +               __atomic_fetch_add(&s->calls, 1, __ATOMIC_RELAXED);
> >                 cs->calls_per_service[service_idx]++;
> > -               s->calls++;
> >         } else
> >                 s->spec.callback(userdata);  }

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: Service core statistics MT safety
  2022-07-01 20:51                             ` Morten Brørup
@ 2022-07-05 17:09                               ` Van Haaren, Harry
  2022-07-07 17:26                                 ` Honnappa Nagarahalli
  0 siblings, 1 reply; 19+ messages in thread
From: Van Haaren, Harry @ 2022-07-05 17:09 UTC (permalink / raw)
  To: Morten Brørup, Honnappa Nagarahalli, Mattias Rönnblom,
	mattias.ronnblom, dev
  Cc: nd

> -----Original Message-----
> From: Morten Brørup <mb@smartsharesystems.com>
> Sent: Friday, July 1, 2022 9:52 PM
> To: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; Van Haaren, Harry
> <harry.van.haaren@intel.com>; Mattias Rönnblom <hofors@lysator.liu.se>;
> mattias.ronnblom <mattias.ronnblom@ericsson.com>; dev@dpdk.org
> Cc: nd <nd@arm.com>; nd <nd@arm.com>; nd <nd@arm.com>
> Subject: RE: Service core statistics MT safety
> 
> > From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> > Sent: Friday, 1 July 2022 20.38
> >
> > <snip>

<snip>

> > > In my mind, any LTS/backports get the simplest/highest-confidence
> > bugfix: using
> > > atomics.
> > > The atomics are behind the "service stats" feature enable, so impact
> > is only
> > > when those are enabled.
> > >
> > > If there is still a performance hit, and there are *no* MT services
> > registered, we
> > > could check a static-global flag, and if there are no MT services use
> > the normal
> > > adds. Thoughts on such a solution to reduce atomic perf impact only
> > to apps
> > > with MT-services?
> > The choice is not between atomic stats and non-atomic stats. The stats
> > need to be incremented atomically in both cases(MT or no MT services)
> > as the reader could be another thread. We need to ensure that the
> > stores and loads are not split.
> >
> > Hence, we need atomic operations. We cannot avoid any performance hit here.
> >
> > I think the choice is between per-core stats vs global stats. IMO, we
> > should go with per-core stats to be aligned with the norm.
> +1
> 
> And the per-core counters can be summed when read for statistics, so it still looks
> like one counter per service. They are only updated per-core to prevent cache
> trashing.

Yes, understood and agree. Per-core counters are preferred, as relaxed atomic ordered stores are enough
to ensure non-tearing loads. Summation before reporting back from the stats-requesting thread.
For backporting, per-core counters is a significant change. Is using atomics to fix the miss-behaviour a better idea?

My question below is still open, is the below enough to fix the *functional* part of MT services?

Code today uses normal ADD/INCQ (and hence the MT increments bug of tear/clobber exists)
   0x000055555649189d <+141>:   add    %rax,0x50(%rbx)
   0x00005555564918a1 <+145>:   incq   0x48(%rbx)

For x86, my disassembly for RELAXED and SEQ_CST orderings are the same, using LOCK prefix ("proper atomics")
   0x000055555649189d <+141>:   lock add %rax,0x50(%rbx)
   0x00005555564918a2 <+146>:   lock incq 0x48(%rbx)

My understanding is that since 2x threads can write the same address, SEQ_CST is required.
Can somebody more familiar with C11 confirm that?


> > > The code changes themselves are OK.. I can send a patch with fix if
> > there's
> > > agreement on the approach?
> > >
> > >
> > > diff --git a/lib/eal/common/rte_service.c
> > b/lib/eal/common/rte_service.c index
> > > ef31b1f63c..a07c8fc2d7 100644
> > > --- a/lib/eal/common/rte_service.c
> > > +++ b/lib/eal/common/rte_service.c
> > > @@ -363,9 +363,9 @@ service_runner_do_callback(struct
> > > rte_service_spec_impl *s,
> > >                 uint64_t start = rte_rdtsc();
> > >                 s->spec.callback(userdata);
> > >                 uint64_t end = rte_rdtsc();
> > > -               s->cycles_spent += end - start;
> > > +               __atomic_fetch_add(&s->cycles_spent, (end-start),
> > > __ATOMIC_RELAXED);
> > > +               __atomic_fetch_add(&s->calls, 1, __ATOMIC_RELAXED);
> > >                 cs->calls_per_service[service_idx]++;
> > > -               s->calls++;
> > >         } else
> > >                 s->spec.callback(userdata);  }

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: Service core statistics MT safety
  2022-07-05 17:09                               ` Van Haaren, Harry
@ 2022-07-07 17:26                                 ` Honnappa Nagarahalli
  2022-07-08 12:58                                   ` Van Haaren, Harry
  0 siblings, 1 reply; 19+ messages in thread
From: Honnappa Nagarahalli @ 2022-07-07 17:26 UTC (permalink / raw)
  To: Van Haaren, Harry, Morten Brørup, Mattias Rönnblom,
	mattias.ronnblom, dev
  Cc: nd, nd

<snip>

> 
> > > > In my mind, any LTS/backports get the simplest/highest-confidence
> > > bugfix: using
> > > > atomics.
> > > > The atomics are behind the "service stats" feature enable, so
> > > > impact
> > > is only
> > > > when those are enabled.
> > > >
> > > > If there is still a performance hit, and there are *no* MT
> > > > services
> > > registered, we
> > > > could check a static-global flag, and if there are no MT services
> > > > use
> > > the normal
> > > > adds. Thoughts on such a solution to reduce atomic perf impact
> > > > only
> > > to apps
> > > > with MT-services?
> > > The choice is not between atomic stats and non-atomic stats. The
> > > stats need to be incremented atomically in both cases(MT or no MT
> > > services) as the reader could be another thread. We need to ensure
> > > that the stores and loads are not split.
> > >
> > > Hence, we need atomic operations. We cannot avoid any performance hit
> here.
> > >
> > > I think the choice is between per-core stats vs global stats. IMO,
> > > we should go with per-core stats to be aligned with the norm.
> > +1
> >
> > And the per-core counters can be summed when read for statistics, so
> > it still looks like one counter per service. They are only updated
> > per-core to prevent cache trashing.
> 
> Yes, understood and agree. Per-core counters are preferred, as relaxed atomic
> ordered stores are enough to ensure non-tearing loads. Summation before
> reporting back from the stats-requesting thread.
> For backporting, per-core counters is a significant change. Is using atomics to fix
> the miss-behaviour a better idea?
Agree, the change will be simple.

This will result in some perf impact which can be mitigated by disabling the stats where possible. 

> 
> My question below is still open, is the below enough to fix the *functional* part
> of MT services?
> 
> Code today uses normal ADD/INCQ (and hence the MT increments bug of
> tear/clobber exists)
>    0x000055555649189d <+141>:   add    %rax,0x50(%rbx)
>    0x00005555564918a1 <+145>:   incq   0x48(%rbx)
> 
> For x86, my disassembly for RELAXED and SEQ_CST orderings are the same,
> using LOCK prefix ("proper atomics")
>    0x000055555649189d <+141>:   lock add %rax,0x50(%rbx)
>    0x00005555564918a2 <+146>:   lock incq 0x48(%rbx)
> 
> My understanding is that since 2x threads can write the same address, SEQ_CST
> is required.
> Can somebody more familiar with C11 confirm that?
We need to use RELAXED. You can think of it as, 'do we need a particular order for memory operations within a single thread?' In this case, we do not need a particular order for incrementing stats in the single thread context. 

> 
> 
> > > > The code changes themselves are OK.. I can send a patch with fix
> > > > if
> > > there's
> > > > agreement on the approach?
> > > >
> > > >
> > > > diff --git a/lib/eal/common/rte_service.c
> > > b/lib/eal/common/rte_service.c index
> > > > ef31b1f63c..a07c8fc2d7 100644
> > > > --- a/lib/eal/common/rte_service.c
> > > > +++ b/lib/eal/common/rte_service.c
> > > > @@ -363,9 +363,9 @@ service_runner_do_callback(struct
> > > > rte_service_spec_impl *s,
> > > >                 uint64_t start = rte_rdtsc();
> > > >                 s->spec.callback(userdata);
> > > >                 uint64_t end = rte_rdtsc();
> > > > -               s->cycles_spent += end - start;
> > > > +               __atomic_fetch_add(&s->cycles_spent, (end-start),
> > > > __ATOMIC_RELAXED);
> > > > +               __atomic_fetch_add(&s->calls, 1,
> > > > + __ATOMIC_RELAXED);
> > > >                 cs->calls_per_service[service_idx]++;
> > > > -               s->calls++;
> > > >         } else
> > > >                 s->spec.callback(userdata);  }

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: Service core statistics MT safety
  2022-07-07 17:26                                 ` Honnappa Nagarahalli
@ 2022-07-08 12:58                                   ` Van Haaren, Harry
  0 siblings, 0 replies; 19+ messages in thread
From: Van Haaren, Harry @ 2022-07-08 12:58 UTC (permalink / raw)
  To: Honnappa Nagarahalli, Morten Brørup, Mattias Rönnblom,
	mattias.ronnblom, dev
  Cc: nd, nd

> -----Original Message-----
> From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Sent: Thursday, July 7, 2022 6:26 PM
> To: Van Haaren, Harry <harry.van.haaren@intel.com>; Morten Brørup
> <mb@smartsharesystems.com>; Mattias Rönnblom <hofors@lysator.liu.se>;
> mattias.ronnblom <mattias.ronnblom@ericsson.com>; dev@dpdk.org
> Cc: nd <nd@arm.com>; nd <nd@arm.com>
> Subject: RE: Service core statistics MT safety
> 
> <snip>

<snip>

> > Yes, understood and agree. Per-core counters are preferred, as relaxed atomic
> > ordered stores are enough to ensure non-tearing loads. Summation before
> > reporting back from the stats-requesting thread.
> > For backporting, per-core counters is a significant change. Is using atomics to fix
> > the miss-behaviour a better idea?
> Agree, the change will be simple.
> 
> This will result in some perf impact which can be mitigated by disabling the stats
> where possible.

Yes, agree, disabling stats is a workaround if they're not required yep.
Correctness > performance. We can optimize to per-CPU stats in a future release.


> > My question below is still open, is the below enough to fix the *functional* part
> > of MT services?
> >
> > Code today uses normal ADD/INCQ (and hence the MT increments bug of
> > tear/clobber exists)
> >    0x000055555649189d <+141>:   add    %rax,0x50(%rbx)
> >    0x00005555564918a1 <+145>:   incq   0x48(%rbx)
> >
> > For x86, my disassembly for RELAXED and SEQ_CST orderings are the same,
> > using LOCK prefix ("proper atomics")
> >    0x000055555649189d <+141>:   lock add %rax,0x50(%rbx)
> >    0x00005555564918a2 <+146>:   lock incq 0x48(%rbx)
> >
> > My understanding is that since 2x threads can write the same address, SEQ_CST
> > is required.
> > Can somebody more familiar with C11 confirm that?
> We need to use RELAXED. You can think of it as, 'do we need a particular order for
> memory operations within a single thread?' In this case, we do not need a particular
> order for incrementing stats in the single thread context.

Thanks for confirming; makes sense.

2 patches sent, you're on CC.

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2022-07-08 12:58 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-27 11:05 Service core statistics MT safety Mattias Rönnblom
2022-06-27 12:31 ` Morten Brørup
2022-06-27 17:39   ` Honnappa Nagarahalli
2022-06-27 17:47     ` Mattias Rönnblom
2022-06-27 18:19       ` Honnappa Nagarahalli
2022-06-27 20:00         ` Mattias Rönnblom
2022-06-28  0:14           ` Honnappa Nagarahalli
2022-06-28  6:52             ` Mattias Rönnblom
2022-06-28 15:24               ` Honnappa Nagarahalli
2022-06-28 18:30                 ` Mattias Rönnblom
2022-06-28 19:15                   ` Honnappa Nagarahalli
2022-06-29  6:34                     ` Mattias Rönnblom
2022-06-29 20:07                       ` Honnappa Nagarahalli
2022-06-30  8:46                         ` Van Haaren, Harry
2022-07-01 18:37                           ` Honnappa Nagarahalli
2022-07-01 20:51                             ` Morten Brørup
2022-07-05 17:09                               ` Van Haaren, Harry
2022-07-07 17:26                                 ` Honnappa Nagarahalli
2022-07-08 12:58                                   ` Van Haaren, Harry

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).