[RFC] Dynamic log/trace control via telemetry

DPDK patches and discussions
 help / color / mirror / Atom feed

* [RFC] Dynamic log/trace control via telemetry
@ 2022-08-15 23:17 Dmitry Kozlyuk
  2022-08-17  1:25 ` fengchengwen
  2022-08-17  2:08 ` Stephen Hemminger
  0 siblings, 2 replies; 8+ messages in thread
From: Dmitry Kozlyuk @ 2022-08-15 23:17 UTC (permalink / raw)
  To: dev

When debugging a live app, useful info can be obtained from logs or traces
that were not enabled when it was started and it is undesirable to restart.
Furthermore, unless the app authors have considered tracing,
rte_trace_save() is only called on exit, i.e. a shutdown is required again.

What if the telemetry socket gave the missing control?
For example:

--> /eal/log/set_level,debug
{"/eal/log/set_level": {"status": "success"}}
--> /eal/log/set_level,pmd.net.*:debug
{"/eal/log/set_level": {"status": "success"}}
--> /eal/log/set_level,lib.[a-d],debug
{"/eal/log/set_level": {"status": "success"}}
--> /eal/log/set_level,foobar
{"/eal/log/set_level": {"status": "fail", "message": "Invalid log level: foobar"}}

Tracing is more complicated because it requires resource allocation.
It could be:

/eal/trace/set_mode,discard
/eal/trace/set_mode,overwrite
    rte_trace_mode_set()

/eal/trace/enable_pattern,<pattern>
/eal/trace/enable_regex,<regex>
/eal/trace/disable_pattern,<pattern>
/eal/trace/disable_regex,<regex>
    rte_trace_enable_pattern()
    rte_trace_enable_regex()

/eal/trace/rearm
    Clear the trace buffer.
    Apply the new settings accumulated by the previous commands.

/eal/trace/save
    rte_trace_save()

An open question is how to deal with multi-process.
Only the primary process listens to the telemetry socket.
Log and trace settings are not shared between processes.
OTOH, when enabling some log source, it is desirable to do in all processes.

In general, telemetry is not for changing the state of the app,
but logs and traces are diagnostic information that seems to be in-scope.
A similar suggestion was not opposed recently:

    http://inbox.dpdk.org/dev/24c49429394294cfbf0d9c506b205029bac77c8b.1657890378.git.anatoly.burakov@intel.com/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] Dynamic log/trace control via telemetry
  2022-08-15 23:17 [RFC] Dynamic log/trace control via telemetry Dmitry Kozlyuk
@ 2022-08-17  1:25 ` fengchengwen
  2022-08-17  2:08 ` Stephen Hemminger
  1 sibling, 0 replies; 8+ messages in thread
From: fengchengwen @ 2022-08-17  1:25 UTC (permalink / raw)
  To: Dmitry Kozlyuk, dev

On 2022/8/16 7:17, Dmitry Kozlyuk wrote:
> When debugging a live app, useful info can be obtained from logs or traces
> that were not enabled when it was started and it is undesirable to restart.
> Furthermore, unless the app authors have considered tracing,
> rte_trace_save() is only called on exit, i.e. a shutdown is required again.

Good idea!


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] Dynamic log/trace control via telemetry
  2022-08-15 23:17 [RFC] Dynamic log/trace control via telemetry Dmitry Kozlyuk
  2022-08-17  1:25 ` fengchengwen
@ 2022-08-17  2:08 ` Stephen Hemminger
  2022-08-17 15:15   ` Dmitry Kozlyuk
  1 sibling, 1 reply; 8+ messages in thread
From: Stephen Hemminger @ 2022-08-17  2:08 UTC (permalink / raw)
  To: Dmitry Kozlyuk; +Cc: dev

On Tue, 16 Aug 2022 02:17:38 +0300
Dmitry Kozlyuk <dmitry.kozliuk@gmail.com> wrote:

> When debugging a live app, useful info can be obtained from logs or traces
> that were not enabled when it was started and it is undesirable to restart.
> Furthermore, unless the app authors have considered tracing,
> rte_trace_save() is only called on exit, i.e. a shutdown is required again.
> 
> What if the telemetry socket gave the missing control?
> For example:
> 
> --> /eal/log/set_level,debug  
> {"/eal/log/set_level": {"status": "success"}}
> --> /eal/log/set_level,pmd.net.*:debug  
> {"/eal/log/set_level": {"status": "success"}}
> --> /eal/log/set_level,lib.[a-d],debug  
> {"/eal/log/set_level": {"status": "success"}}
> --> /eal/log/set_level,foobar  
> {"/eal/log/set_level": {"status": "fail", "message": "Invalid log level: foobar"}}
> 
> Tracing is more complicated because it requires resource allocation.
> It could be:
> 
> /eal/trace/set_mode,discard
> /eal/trace/set_mode,overwrite
>     rte_trace_mode_set()
> 
> /eal/trace/enable_pattern,<pattern>
> /eal/trace/enable_regex,<regex>
> /eal/trace/disable_pattern,<pattern>
> /eal/trace/disable_regex,<regex>
>     rte_trace_enable_pattern()
>     rte_trace_enable_regex()
> 
> /eal/trace/rearm
>     Clear the trace buffer.
>     Apply the new settings accumulated by the previous commands.
> 
> /eal/trace/save
>     rte_trace_save()
> 
> An open question is how to deal with multi-process.
> Only the primary process listens to the telemetry socket.
> Log and trace settings are not shared between processes.
> OTOH, when enabling some log source, it is desirable to do in all processes.
> 
> In general, telemetry is not for changing the state of the app,
> but logs and traces are diagnostic information that seems to be in-scope.
> A similar suggestion was not opposed recently:
> 
>     http://inbox.dpdk.org/dev/24c49429394294cfbf0d9c506b205029bac77c8b.1657890378.git.anatoly.burakov@intel.com/

Not sure if turning telemetry into a do all control api makes sense.
This seems like a different API.
Also, the default would have to be disabled for application safety reasons.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] Dynamic log/trace control via telemetry
  2022-08-17  2:08 ` Stephen Hemminger
@ 2022-08-17 15:15   ` Dmitry Kozlyuk
  2022-08-17 15:34     ` Stephen Hemminger
  2022-08-17 15:34     ` Morten Brørup
  0 siblings, 2 replies; 8+ messages in thread
From: Dmitry Kozlyuk @ 2022-08-17 15:15 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev

2022-08-16 19:08 (UTC-0700), Stephen Hemminger:
> Not sure if turning telemetry into a do all control api makes sense.

I'm sure it doesn't, for "do all".
Controlling diagnostic collection and output, however,
is directly related to the telemetry purpose.

> This seems like a different API.
> Also, the default would have to be disabled for application safety reasons.

This feature would be for collecting additional info
in case the collection was not planned and a restart is not desired.
If it is disabled by default, it is likely to be off when it's needed.

Let's consider how exactly can safety be compromised.

1. Securing telemetry socket access is out of scope for DPDK,
   that is, any successful access is considered trusted.

2. Even read-only telemetry still comes at cost, for example,
   memory telemetry takes a global lock that blocks all allocations,
   so affecting the app performance is already possible.

3. Important logs and traces enabled at startup may be disabled dynamically.
   If it's an issue, the API can refuse to disable them.

4. Bogus logs may flood the output and slow down the app.
   Bogus traces can exhaust disk space.
   Logs should be monitored automatically, so flooding is just an annoyance.
   Disk space can have a quota.
   Since the user is trusted (item 1), even if they do it by mistake,
   they can quickly correct themselves using the same API.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] Dynamic log/trace control via telemetry
  2022-08-17 15:15   ` Dmitry Kozlyuk
@ 2022-08-17 15:34     ` Stephen Hemminger
  2022-08-17 15:34     ` Morten Brørup
  1 sibling, 0 replies; 8+ messages in thread
From: Stephen Hemminger @ 2022-08-17 15:34 UTC (permalink / raw)
  To: Dmitry Kozlyuk; +Cc: dev

On Wed, 17 Aug 2022 18:15:03 +0300
Dmitry Kozlyuk <dmitry.kozliuk@gmail.com> wrote:

> 2022-08-16 19:08 (UTC-0700), Stephen Hemminger:
> > Not sure if turning telemetry into a do all control api makes sense.  
> 
> I'm sure it doesn't, for "do all".
> Controlling diagnostic collection and output, however,
> is directly related to the telemetry purpose.
> 
> > This seems like a different API.
> > Also, the default would have to be disabled for application safety reasons.  
> 
> This feature would be for collecting additional info
> in case the collection was not planned and a restart is not desired.
> If it is disabled by default, it is likely to be off when it's needed.
> 
> Let's consider how exactly can safety be compromised.
> 
> 1. Securing telemetry socket access is out of scope for DPDK,
>    that is, any successful access is considered trusted.
> 
> 2. Even read-only telemetry still comes at cost, for example,
>    memory telemetry takes a global lock that blocks all allocations,
>    so affecting the app performance is already possible.
> 
> 3. Important logs and traces enabled at startup may be disabled dynamically.
>    If it's an issue, the API can refuse to disable them.
> 
> 4. Bogus logs may flood the output and slow down the app.
>    Bogus traces can exhaust disk space.
>    Logs should be monitored automatically, so flooding is just an annoyance.
>    Disk space can have a quota.
>    Since the user is trusted (item 1), even if they do it by mistake,
>    they can quickly correct themselves using the same API.

There can be security impact to telemetry.
There always is some performance cost to telemetry.

My interest is that we run a performance sensitive application and it gets
lots of security review. If a new version of DPDK magically enabled something
that had impact, you would cause extra effort and confusion.

Developers often have the wrong point of view "my feature is great, everyone wants it"
and also "why should I test with this disabled".  New features should be opt-in not
opt-out.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: [RFC] Dynamic log/trace control via telemetry
  2022-08-17 15:15   ` Dmitry Kozlyuk
  2022-08-17 15:34     ` Stephen Hemminger
@ 2022-08-17 15:34     ` Morten Brørup
  2022-08-20 15:19       ` Dmitry Kozlyuk
  1 sibling, 1 reply; 8+ messages in thread
From: Morten Brørup @ 2022-08-17 15:34 UTC (permalink / raw)
  To: Dmitry Kozlyuk, Stephen Hemminger; +Cc: dev

> From: Dmitry Kozlyuk [mailto:dmitry.kozliuk@gmail.com]
> Sent: Wednesday, 17 August 2022 17.15
> 
> 2022-08-16 19:08 (UTC-0700), Stephen Hemminger:
> > Not sure if turning telemetry into a do all control api makes sense.
> 
> I'm sure it doesn't, for "do all".
> Controlling diagnostic collection and output, however,
> is directly related to the telemetry purpose.
> 
> > This seems like a different API.

I agree with Stephen regarding not making the telemetry library a "do all" control API. A separate API would be preferable.

And then, a wrapper through the telemetry interface can be provided to that API. Best of both worlds. :-)

> > Also, the default would have to be disabled for application safety
> reasons.
> 
> This feature would be for collecting additional info
> in case the collection was not planned and a restart is not desired.
> If it is disabled by default, it is likely to be off when it's needed.

All tracing, logging etc. MUST be disabled by default. You are suggesting the opposite, which will definitely impact performance.

And performance will become a valid argument for not adding more trace/logging to libraries, if all of it is enabled by default.

And my usual rant: I hope all of this can be disabled at build time - for maximum performance.

> 
> Let's consider how exactly can safety be compromised.
> 
> 1. Securing telemetry socket access is out of scope for DPDK,
>    that is, any successful access is considered trusted.
> 
> 2. Even read-only telemetry still comes at cost, for example,
>    memory telemetry takes a global lock that blocks all allocations,
>    so affecting the app performance is already possible.
> 
> 3. Important logs and traces enabled at startup may be disabled
> dynamically.
>    If it's an issue, the API can refuse to disable them.
> 
> 4. Bogus logs may flood the output and slow down the app.
>    Bogus traces can exhaust disk space.
>    Logs should be monitored automatically, so flooding is just an
> annoyance.
>    Disk space can have a quota.
>    Since the user is trusted (item 1), even if they do it by mistake,
>    they can quickly correct themselves using the same API.
> 

Here's a thought:

Add an API to set an "unlock key", so applications who don't want to allow these features for unauthorized users can prevent them from enabling it. Authorized users can use an API to unlock these features by providing the key.



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] Dynamic log/trace control via telemetry
  2022-08-17 15:34     ` Morten Brørup
@ 2022-08-20 15:19       ` Dmitry Kozlyuk
  2022-08-22  6:47         ` Morten Brørup
  0 siblings, 1 reply; 8+ messages in thread
From: Dmitry Kozlyuk @ 2022-08-20 15:19 UTC (permalink / raw)
  To: Morten Brørup; +Cc: Stephen Hemminger, dev

2022-08-17 17:34 (UTC+0200), Morten Brørup:
> > From: Dmitry Kozlyuk [mailto:dmitry.kozliuk@gmail.com]
> > Sent: Wednesday, 17 August 2022 17.15
> > 
> > 2022-08-16 19:08 (UTC-0700), Stephen Hemminger:  
> > > Not sure if turning telemetry into a do all control api makes sense.  
> > 
> > I'm sure it doesn't, for "do all".
> > Controlling diagnostic collection and output, however,
> > is directly related to the telemetry purpose.
> >   
> > > This seems like a different API.  
> 
> I agree with Stephen regarding not making the telemetry library a "do all" control API. A separate API would be preferable.
> 
> And then, a wrapper through the telemetry interface can be provided to that API. Best of both worlds. :-)

One of the reasons why I considered and suggested using the telemetry socket
was that a new channel, protocol, and API would be an overkill for the task.

It reminds me of an older "IF proxy library" proposal [1].
In the RFC discussion it was suggested that it could be a generic mechanism
to deliver external events to DPDK, although, unlike events, commands need
responses. I also found Thomas' message [2] that suggested log/trace control
(never knew it was proposed already) but discouraged adding more interfaces.
The final version of "IF proxy library" was nevertheless focused on netdev
specifics and AFAICT never accepted because of questions from that side.

So this RFC should be severely re-framed: 1) a generic event/command delivery
mechanism is needed in principle, and 2) log/trace control can be the first
usage of it, which may be easier to pull through than a complex interaction
with OS networking stack.

[1]: https://inbox.dpdk.org/dev/20200114142517.29522-1-aostruszka@marvell.com/

[2]: https://inbox.dpdk.org/dev/1734533.zqhfolzEdB@thomas/

> 
> > > Also, the default would have to be disabled for application safety  
> > reasons.
> > 
> > This feature would be for collecting additional info
> > in case the collection was not planned and a restart is not desired.
> > If it is disabled by default, it is likely to be off when it's needed.  
> 
> All tracing, logging etc. MUST be disabled by default. You are suggesting the opposite, which will definitely impact performance.
> 
> And performance will become a valid argument for not adding more trace/logging to libraries, if all of it is enabled by default.
> 
> And my usual rant: I hope all of this can be disabled at build time - for maximum performance.

The feature is a dynamic equivalent of existing --log-level/--trace options.
It doesn't affect performance until used to configure logs/traces that do.
Other arguments to have it off by default are valid; just to clarify.

> 
> > 
> > Let's consider how exactly can safety be compromised.
> > 
> > 1. Securing telemetry socket access is out of scope for DPDK,
> >    that is, any successful access is considered trusted.
> > 
> > 2. Even read-only telemetry still comes at cost, for example,
> >    memory telemetry takes a global lock that blocks all allocations,
> >    so affecting the app performance is already possible.
> > 
> > 3. Important logs and traces enabled at startup may be disabled
> > dynamically.
> >    If it's an issue, the API can refuse to disable them.
> > 
> > 4. Bogus logs may flood the output and slow down the app.
> >    Bogus traces can exhaust disk space.
> >    Logs should be monitored automatically, so flooding is just an
> > annoyance.
> >    Disk space can have a quota.
> >    Since the user is trusted (item 1), even if they do it by mistake,
> >    they can quickly correct themselves using the same API.
> >   
> 
> Here's a thought:
> 
> Add an API to set an "unlock key", so applications who don't want to allow these features for unauthorized users can prevent them from enabling it. Authorized users can use an API to unlock these features by providing the key.

Let's keep delegating AAA to an external proxy if needed.

Unlock key would not remove Stephen's concerns,
because it would still be another interface to analyze and to protect.
Worse, it's a home-grown security mechanism to consider.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: [RFC] Dynamic log/trace control via telemetry
  2022-08-20 15:19       ` Dmitry Kozlyuk
@ 2022-08-22  6:47         ` Morten Brørup
  0 siblings, 0 replies; 8+ messages in thread
From: Morten Brørup @ 2022-08-22  6:47 UTC (permalink / raw)
  To: Dmitry Kozlyuk; +Cc: Stephen Hemminger, dev

> From: Dmitry Kozlyuk [mailto:dmitry.kozliuk@gmail.com]
> Sent: Saturday, 20 August 2022 17.20
> To: Morten Brørup
> Cc: Stephen Hemminger; dev@dpdk.org
> Subject: Re: [RFC] Dynamic log/trace control via telemetry
> 
> 2022-08-17 17:34 (UTC+0200), Morten Brørup:
> > > From: Dmitry Kozlyuk [mailto:dmitry.kozliuk@gmail.com]
> > > Sent: Wednesday, 17 August 2022 17.15
> > >
> > > 2022-08-16 19:08 (UTC-0700), Stephen Hemminger:
> > > > Not sure if turning telemetry into a do all control api makes
> sense.
> > >
> > > I'm sure it doesn't, for "do all".
> > > Controlling diagnostic collection and output, however,
> > > is directly related to the telemetry purpose.
> > >
> > > > This seems like a different API.
> >
> > I agree with Stephen regarding not making the telemetry library a "do
> all" control API. A separate API would be preferable.
> >
> > And then, a wrapper through the telemetry interface can be provided
> to that API. Best of both worlds. :-)
> 
> One of the reasons why I considered and suggested using the telemetry
> socket
> was that a new channel, protocol, and API would be an overkill for the
> task.
> 
> It reminds me of an older "IF proxy library" proposal [1].
> In the RFC discussion it was suggested that it could be a generic
> mechanism
> to deliver external events to DPDK, although, unlike events, commands
> need
> responses. I also found Thomas' message [2] that suggested log/trace
> control
> (never knew it was proposed already) but discouraged adding more
> interfaces.
> The final version of "IF proxy library" was nevertheless focused on
> netdev
> specifics and AFAICT never accepted because of questions from that
> side.
> 
> So this RFC should be severely re-framed: 1) a generic event/command
> delivery
> mechanism is needed in principle, and 2) log/trace control can be the
> first
> usage of it, which may be easier to pull through than a complex
> interaction
> with OS networking stack.
> 
> [1]: https://inbox.dpdk.org/dev/20200114142517.29522-1-
> aostruszka@marvell.com/
> 
> [2]: https://inbox.dpdk.org/dev/1734533.zqhfolzEdB@thomas/
> 

Good arguments, Dmitry.

I 100 % agree that the telemetry API/socket is appropriate for changing log/trace settings.

I'm just not sure the telemetry API/socket is usable for streaming log and trace output, and I got these two things mixed up.

> >
> > > > Also, the default would have to be disabled for application
> safety
> > > reasons.
> > >
> > > This feature would be for collecting additional info
> > > in case the collection was not planned and a restart is not
> desired.
> > > If it is disabled by default, it is likely to be off when it's
> needed.
> >
> > All tracing, logging etc. MUST be disabled by default. You are
> suggesting the opposite, which will definitely impact performance.
> >
> > And performance will become a valid argument for not adding more
> trace/logging to libraries, if all of it is enabled by default.
> >
> > And my usual rant: I hope all of this can be disabled at build time -
> for maximum performance.
> 
> The feature is a dynamic equivalent of existing --log-level/--trace
> options.
> It doesn't affect performance until used to configure logs/traces that
> do.
> Other arguments to have it off by default are valid; just to clarify.

OK. Then I have no further concerns about performance.

> 
> >
> > >
> > > Let's consider how exactly can safety be compromised.
> > >
> > > 1. Securing telemetry socket access is out of scope for DPDK,
> > >    that is, any successful access is considered trusted.
> > >
> > > 2. Even read-only telemetry still comes at cost, for example,
> > >    memory telemetry takes a global lock that blocks all
> allocations,
> > >    so affecting the app performance is already possible.
> > >
> > > 3. Important logs and traces enabled at startup may be disabled
> > > dynamically.
> > >    If it's an issue, the API can refuse to disable them.
> > >
> > > 4. Bogus logs may flood the output and slow down the app.
> > >    Bogus traces can exhaust disk space.
> > >    Logs should be monitored automatically, so flooding is just an
> > > annoyance.
> > >    Disk space can have a quota.
> > >    Since the user is trusted (item 1), even if they do it by
> mistake,
> > >    they can quickly correct themselves using the same API.
> > >
> >
> > Here's a thought:
> >
> > Add an API to set an "unlock key", so applications who don't want to
> allow these features for unauthorized users can prevent them from
> enabling it. Authorized users can use an API to unlock these features
> by providing the key.
> 
> Let's keep delegating AAA to an external proxy if needed.
> 
> Unlock key would not remove Stephen's concerns,
> because it would still be another interface to analyze and to protect.
> Worse, it's a home-grown security mechanism to consider.

Tracing, logging and telemetry has lots of existing security issues.

I agree that this fact should not prevent any new features, such as your proposal here. AAA regarding telemetry should be addressed separately.

The "unlock key" concept was inspired by the silicon world. Some chips require "keys" to unlock access to their firmware, to protect them from reverse engineering. I was thinking we could use a similar approach to protect proprietary DPDK applications from being reverse engineered through telemetry. But again, I agree with your response; let's not blend existing tracing/logging/telemetry security issues into this RFC.



^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2022-08-22  6:48 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-15 23:17 [RFC] Dynamic log/trace control via telemetry Dmitry Kozlyuk
2022-08-17  1:25 ` fengchengwen
2022-08-17  2:08 ` Stephen Hemminger
2022-08-17 15:15   ` Dmitry Kozlyuk
2022-08-17 15:34     ` Stephen Hemminger
2022-08-17 15:34     ` Morten Brørup
2022-08-20 15:19       ` Dmitry Kozlyuk
2022-08-22  6:47         ` Morten Brørup

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).