* rte_eth_stats_get seems slow
@ 2025-04-25 11:52 Morten Brørup
2025-04-26 15:23 ` Stephen Hemminger
0 siblings, 1 reply; 5+ messages in thread
From: Morten Brørup @ 2025-04-25 11:52 UTC (permalink / raw)
To: bruce.richardson; +Cc: Ian Stokes, dev
Bruce,
rte_eth_stats_get() on Intel NICs seems slow to me.
E.g. getting the stats on a single port takes ~132 us (~451,000 CPU cycles) using the igb driver, and ~50 us using the i40e driver.
Referring to the igb driver source code [1], it's 44 calls to E1000_READ_REG(), so the math says that each one takes 3 us (~10,000 CPU cycles).
Is this expected behavior?
It adds up, e.g. it takes a full millisecond to fetch the stats from eight ports using the igb driver.
[1]: https://elixir.bootlin.com/dpdk/v24.11.1/source/drivers/net/e1000/igb_ethdev.c#L1724
Med venlig hilsen / Kind regards,
-Morten Brørup
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: rte_eth_stats_get seems slow
2025-04-25 11:52 rte_eth_stats_get seems slow Morten Brørup
@ 2025-04-26 15:23 ` Stephen Hemminger
2025-04-27 7:00 ` Morten Brørup
0 siblings, 1 reply; 5+ messages in thread
From: Stephen Hemminger @ 2025-04-26 15:23 UTC (permalink / raw)
To: Morten Brørup; +Cc: bruce.richardson, Ian Stokes, dev
On Fri, 25 Apr 2025 13:52:55 +0200
Morten Brørup <mb@smartsharesystems.com> wrote:
> Bruce,
>
> rte_eth_stats_get() on Intel NICs seems slow to me.
>
> E.g. getting the stats on a single port takes ~132 us (~451,000 CPU cycles) using the igb driver, and ~50 us using the i40e driver.
>
> Referring to the igb driver source code [1], it's 44 calls to E1000_READ_REG(), so the math says that each one takes 3 us (~10,000 CPU cycles).
>
> Is this expected behavior?
>
> It adds up, e.g. it takes a full millisecond to fetch the stats from eight ports using the igb driver.
>
> [1]: https://elixir.bootlin.com/dpdk/v24.11.1/source/drivers/net/e1000/igb_ethdev.c#L1724
>
>
> Med venlig hilsen / Kind regards,
> -Morten Brørup
>
Well reading each stat requires a PCI access. And PCI accesses are non-cached.
^ permalink raw reply [flat|nested] 5+ messages in thread
* RE: rte_eth_stats_get seems slow
2025-04-26 15:23 ` Stephen Hemminger
@ 2025-04-27 7:00 ` Morten Brørup
2025-04-29 12:52 ` Bruce Richardson
0 siblings, 1 reply; 5+ messages in thread
From: Morten Brørup @ 2025-04-27 7:00 UTC (permalink / raw)
To: Stephen Hemminger, bruce.richardson; +Cc: dev
> From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> Sent: Saturday, 26 April 2025 17.24
>
> On Fri, 25 Apr 2025 13:52:55 +0200
> Morten Brørup <mb@smartsharesystems.com> wrote:
>
> > Bruce,
> >
> > rte_eth_stats_get() on Intel NICs seems slow to me.
> >
> > E.g. getting the stats on a single port takes ~132 us (~451,000 CPU
> cycles) using the igb driver, and ~50 us using the i40e driver.
> >
> > Referring to the igb driver source code [1], it's 44 calls to
> E1000_READ_REG(), so the math says that each one takes 3 us (~10,000
> CPU cycles).
> >
> > Is this expected behavior?
> >
> > It adds up, e.g. it takes a full millisecond to fetch the stats from
> eight ports using the igb driver.
> >
> > [1]:
> https://elixir.bootlin.com/dpdk/v24.11.1/source/drivers/net/e1000/igb_e
> thdev.c#L1724
> >
> >
> > Med venlig hilsen / Kind regards,
> > -Morten Brørup
> >
>
> Well reading each stat requires a PCI access. And PCI accesses are non-
> cached.
You're right, thank you for reminding me. I was caught by surprise that getting 7 counters took so long.
Perhaps reading 44 NIC registers over the PCI bus is required to calculate those summary counters. Or nobody cared to optimize this function to only read the necessary registers.
We periodically poll the ethdev stats in a management thread, and I noticed the duration because of the jitter it caused in that thread.
It's not a real problem. If it was, we could easily move it to a separate thread or poll the counters iteratively port by port, instead of all ports in one go.
A longwinded way of saying: Probably not worth optimizing. ;-)
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: rte_eth_stats_get seems slow
2025-04-27 7:00 ` Morten Brørup
@ 2025-04-29 12:52 ` Bruce Richardson
2025-04-29 13:34 ` Morten Brørup
0 siblings, 1 reply; 5+ messages in thread
From: Bruce Richardson @ 2025-04-29 12:52 UTC (permalink / raw)
To: Morten Brørup; +Cc: Stephen Hemminger, dev
On Sun, Apr 27, 2025 at 09:00:31AM +0200, Morten Brørup wrote:
> > From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> > Sent: Saturday, 26 April 2025 17.24
> >
> > On Fri, 25 Apr 2025 13:52:55 +0200
> > Morten Brørup <mb@smartsharesystems.com> wrote:
> >
> > > Bruce,
> > >
> > > rte_eth_stats_get() on Intel NICs seems slow to me.
> > >
> > > E.g. getting the stats on a single port takes ~132 us (~451,000 CPU
> > cycles) using the igb driver, and ~50 us using the i40e driver.
> > >
> > > Referring to the igb driver source code [1], it's 44 calls to
> > E1000_READ_REG(), so the math says that each one takes 3 us (~10,000
> > CPU cycles).
> > >
> > > Is this expected behavior?
> > >
> > > It adds up, e.g. it takes a full millisecond to fetch the stats from
> > eight ports using the igb driver.
> > >
> > > [1]:
> > https://elixir.bootlin.com/dpdk/v24.11.1/source/drivers/net/e1000/igb_e
> > thdev.c#L1724
> > >
> > >
> > > Med venlig hilsen / Kind regards,
> > > -Morten Brørup
> > >
> >
> > Well reading each stat requires a PCI access. And PCI accesses are non-
> > cached.
>
> You're right, thank you for reminding me. I was caught by surprise that getting 7 counters took so long.
> Perhaps reading 44 NIC registers over the PCI bus is required to calculate those summary counters. Or nobody cared to optimize this function to only read the necessary registers.
>
> We periodically poll the ethdev stats in a management thread, and I noticed the duration because of the jitter it caused in that thread.
> It's not a real problem. If it was, we could easily move it to a separate thread or poll the counters iteratively port by port, instead of all ports in one go.
> A longwinded way of saying: Probably not worth optimizing. ;-)
>
I actually think it is something that we should consider optimizing, but I
also think it needs to be done at the API level. Even if the user is only
interested in e.g. the Rx bytes counter, the only way to get that is to
retrieve the full stats set and use the counter from it. Therefore, instead
of reading (possibly) just one register, you end up reading 44 as in the
case you describe. Maybe we need to add a stats mask to the get call, to
allow a user to indicate that they only want a subset of the stats, in
order to improve performance.
/Bruce
^ permalink raw reply [flat|nested] 5+ messages in thread
* RE: rte_eth_stats_get seems slow
2025-04-29 12:52 ` Bruce Richardson
@ 2025-04-29 13:34 ` Morten Brørup
0 siblings, 0 replies; 5+ messages in thread
From: Morten Brørup @ 2025-04-29 13:34 UTC (permalink / raw)
To: Bruce Richardson; +Cc: Stephen Hemminger, dev
> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> Sent: Tuesday, 29 April 2025 14.53
>
> On Sun, Apr 27, 2025 at 09:00:31AM +0200, Morten Brørup wrote:
> > > From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> > > Sent: Saturday, 26 April 2025 17.24
> > >
> > > On Fri, 25 Apr 2025 13:52:55 +0200
> > > Morten Brørup <mb@smartsharesystems.com> wrote:
> > >
> > > > Bruce,
> > > >
> > > > rte_eth_stats_get() on Intel NICs seems slow to me.
> > > >
> > > > E.g. getting the stats on a single port takes ~132 us (~451,000
> CPU
> > > cycles) using the igb driver, and ~50 us using the i40e driver.
> > > >
> > > > Referring to the igb driver source code [1], it's 44 calls to
> > > E1000_READ_REG(), so the math says that each one takes 3 us
> (~10,000
> > > CPU cycles).
> > > >
> > > > Is this expected behavior?
> > > >
> > > > It adds up, e.g. it takes a full millisecond to fetch the stats
> from
> > > eight ports using the igb driver.
> > > >
> > > > [1]:
> > >
> https://elixir.bootlin.com/dpdk/v24.11.1/source/drivers/net/e1000/igb_e
> > > thdev.c#L1724
> > > >
> > > >
> > > > Med venlig hilsen / Kind regards,
> > > > -Morten Brørup
> > > >
> > >
> > > Well reading each stat requires a PCI access. And PCI accesses are
> non-
> > > cached.
> >
> > You're right, thank you for reminding me. I was caught by surprise
> that getting 7 counters took so long.
> > Perhaps reading 44 NIC registers over the PCI bus is required to
> calculate those summary counters. Or nobody cared to optimize this
> function to only read the necessary registers.
> >
> > We periodically poll the ethdev stats in a management thread, and I
> noticed the duration because of the jitter it caused in that thread.
> > It's not a real problem. If it was, we could easily move it to a
> separate thread or poll the counters iteratively port by port, instead
> of all ports in one go.
> > A longwinded way of saying: Probably not worth optimizing. ;-)
> >
>
> I actually think it is something that we should consider optimizing,
> but I
> also think it needs to be done at the API level. Even if the user is
> only
> interested in e.g. the Rx bytes counter, the only way to get that is to
> retrieve the full stats set and use the counter from it. Therefore,
> instead
> of reading (possibly) just one register, you end up reading 44 as in
> the
> case you describe. Maybe we need to add a stats mask to the get call,
> to
> allow a user to indicate that they only want a subset of the stats, in
> order to improve performance.
I think xstats already serves that purpose.
<feature creep>
We could improve xstats by maintaining a repository of standardized xstats counter names and definitions. That would also make it possible to conformance test these.
</feature creep>
Basic stats in struct rte_eth_stats [2] are 7 counters + the rx_nombuf software counter. This should be acceptable. And it makes it very easy for an application to implement basic ethdev stats.
For some reason, struct rte_eth_stats also holds the per-queue counters. They should be moved out of here.
The documentation for rte_eth_stats_get() even says that it only fills the first part, not the per-queue counters, so it should be possible.
The igb driver fetches many registers not required for getting the 7 counters asked for. E.g. fetching the packet size range counters is certainly not required, but the driver does it anyway.
If other drivers implement a similar design pattern ("fetch all counters to update some"), the execution time of an API call supposedly fetching 7 counters may come as a surprise to other developers too.
So perhaps a warning about execution time should be added to the rte_eth_stats_get() documentation, to formally accept this behavior by drivers. That would be an easy fix. :-)
[2]: https://elixir.bootlin.com/dpdk/v24.11.1/source/lib/ethdev/rte_ethdev.h#L262
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2025-04-29 13:34 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-04-25 11:52 rte_eth_stats_get seems slow Morten Brørup
2025-04-26 15:23 ` Stephen Hemminger
2025-04-27 7:00 ` Morten Brørup
2025-04-29 12:52 ` Bruce Richardson
2025-04-29 13:34 ` Morten Brørup
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).