DPDK patches and discussions
 help / color / mirror / Atom feed
From: "Morten Brørup" <mb@smartsharesystems.com>
To: "Coyle, David" <david.coyle@intel.com>, <dev@dpdk.org>
Cc: <honnappa.nagarahalli@arm.com>, <konstantin.v.ananyev@yandex.ru>,
	"Sexton, Rory" <rory.sexton@intel.com>
Subject: RE: [RFC PATCH] ring: adding TPAUSE instruction to ring dequeue
Date: Wed, 3 May 2023 23:32:21 +0200	[thread overview]
Message-ID: <98CBD80474FA8B44BF855DF32C47DC35D878E1@smartserver.smartshare.dk> (raw)
In-Reply-To: <SJ0PR11MB561534ADDCAA902CEEC6323DE36C9@SJ0PR11MB5615.namprd11.prod.outlook.com>

> From: Coyle, David [mailto:david.coyle@intel.com]
> Sent: Wednesday, 3 May 2023 17.32
> 
> Hi Morten
> 
> > From: Morten Brørup <mb@smartsharesystems.com>
> >
> > > From: David Coyle [mailto:david.coyle@intel.com]
> > > Sent: Wednesday, 3 May 2023 13.39
> > >
> > > This is NOT for upstreaming. This is being submitted to allow early
> > > comparison testing with the preferred solution, which will add
> TAPUSE
> > > power management support to the ring library through the addition of
> > > callbacks. Initial stages of the preferred solution are available at
> > > http://dpdk.org/patch/125454.
> > >
> > > This patch adds functionality directly to rte_ring_dequeue functions
> > > to monitor the empty reads of the ring. When a configurable number
> of
> > > empty reads is reached, a TPAUSE instruction is triggered by using
> > > rte_power_pause() on supported architectures. rte_pause() is used on
> > > other architectures. The functionality can be included or excluded
> at
> > > compilation time using the RTE_RING_PMGMT flag. If included, the new
> > > API can be used to enable/disable the feature on a per-ring basis.
> > > Other related settings can also be configured using the API.
> >
> > I don't understand why DPDK developers keep spending time on trying to
> > invent methods to determine application busyness based on entry/exit
> > points in a variety of libraries, when the application is in a much
> better
> > position to determine busyness. All of these "busyness measuring"
> library
> > extensions have their own specific assumptions and weird limitations.
> >
> > I do understand that the goal is power saving, which certainly is
> relevant! I
> > only criticize the measuring methods.
> >
> > For reference, we implemented something very simple in our application
> > framework:
> > 1. When each pipeline stage has completed a burst, it reports if it
> was busy or
> > not.
> > 2. If the pipeline busyness is low, we take a nap to save some power.
> >
> > And here is the magic twist to this simple algorithm:
> > 3. A pipeline stage is not considered busy unless it processed a full
> burst, and
> > is ready to process more packets immediately. This interpretation of
> > busyness has a significant impact on the percentage of time spent
> napping
> > during the low-traffic hours.
> >
> > This algorithm was very quickly implemented. It might not be perfect,
> and we
> > do intend to improve it (also to determine CPU Utilization on a scale
> that the
> > end user can translate to a linear interpretation of how busy the
> system is).
> > But I seriously doubt that any of the proposed "busyness measuring"
> library
> > extensions are any better.
> >
> > So: The application knows better, please spend your precious time on
> > something useful instead.
> >
> > @David, my outburst is not directed at you specifically. Generally, I
> do
> > appreciate experimenting as a good way of obtaining knowledge. So
> thank
> > you for sharing your experiments with this audience!
> >
> > PS: If cruft can be disabled at build time, I generally don't oppose
> to it.
> 
> [DC] Appreciate that feedback, and it is certainly another way of
> looking at
> and tackling the problem that we are ultimately trying to solve (i.e
> power
> saving)
> 
> The problem however is that we work with a large number of ISVs and
> operators,
> each with their own workload architecture and implementation. That means
> we
> would have to work individually with each of these to integrate this
> type of
> pipeline-stage-busyness algorithm into their applications. And as these
> applications are usually commercial, non-open-source applications, that
> could
> prove to be very difficult.
> 
> Also most ISVs and operators don't want to have to worry about changing
> their
> application, especially their fast-path dataplane, in order to get power
> savings. They prefer for it to just happen without them caring about the
> finer
> details.
> 
> For these reasons, consolidating the busyness algorithms down into the
> DPDK
> libraries and PMDs is currently the preferred solution. As you say
> though, the
> libraries and PMDs may not be in the best position to determine the
> busyness
> of the pipeline, but it provides a good balance between achieving power
> savings
> and ease of adoption.

Thank you for describing the business logic driving this technical approach. Now I get it!

Automagic busyness monitoring and power management would be excellent. But what I see on the mailing list is a bunch of incoherent attempts at doing this. (And I don't mean your patches, I mean all the patches for automagic power management.) And the cost is not insignificant: Pollution of DPDK all over the place, in both drivers and libraries.

I would much rather see a top-down approach, so we could all work towards a unified solution.

However, I understand that customers are impatient, so I accept that in reality we have to live with these weird "code injection" based solutions until something sane becomes available. If they were clearly marked as temporary workarounds until a proper solution is provided, I might object less to them. (Again, not just your patches, but all the patches of this sort.)

> 
> It's also worth calling out again that this patch is only to allow early
> testing by some customers of the benefit of adding TPAUSE support to the
> ring
> library. We don't intend on this patch being upstreamed. The preferred
> longer
> term solution is to use callbacks from the ring library to initiate the
> pause
> (either via the DPDK power management API or through functions that an
> ISV
> may write themselves). This is mentioned in the commit message.

Noted!

> 
> Also, the pipeline stage busyness algorithm that you have added to your
> pipeline - have you ever considered implementing this into DPDK as a
> generic
> type library. This could certainly be of benefit to other DPDK
> application
> developers, and having this mechanism in DPDK could again ease the
> adoption
> and realisation of power savings for others. I understand though if this
> is your
> own secret sauce and you want to keep it like that :)

Power saving is important for the environment (to save the planet and all that), so everyone should contribute, if they have a good solution. So even if our algorithm had a significant degree of innovation, we would probably choose to make it public anyway. Open sourcing it also makes it possible for chip vendors like Intel to fine tune it more than we can ourselves, which also comes back to benefit us. All products need some sort of power saving in to stay competitive, but power saving algorithms is not an area we want to pursue for competitive purposes in our products.

Our algorithm is too simple to make a library at this point, but I have been thinking about how we can make it a generic library when it has matured some more. I will take your information about the many customers' need to have it invisibly injected into consideration in this regard.

Our current algorithm works like this:

while (running) {
int more = 0;
more += stage1();
more += stage2();
more += stage3();
if (!more) sleep();
}

Each pipeline stage only returns 1 if it processed a full burst. Furthermore, if a pipeline stage processed a full burst, but happens to know that no more data is readily available for it, it returns 0 instead.

Obviously, the sleep() duration must be short enough to avoid that the NIC RX descriptor rings overflow before the ingress pipeline stage is serviced again.

Changing the algorithm to "more" (1 = more work expected by the pipeline stage) from "busy" (1 = some work done by the pipeline stage) has the consequence that sleep() is called more often, which has the follow-on consequence that the ingress stage is called less often, and thus more often has a full burst to process.

We know from our in-house profiler that processing a full burst provides *much* higher execution efficiency (cycles/packet) than processing a few packets. This is public knowledge - after all, this is the whole point of DPDK's vector packet processing design! Nonetheless, it might surprise some people how much the efficiency (cycles/packet) increases when processing a full burst compared to processing just a few packets. I will leave it up to the readers to make their own experiments. :-)

Our initial "busy" algorithm behaved like this:
Process a few packets (at low efficiency), don't sleep,
Process a few packets (at low efficiency), don't sleep,
Process a few packets (at low efficiency), don't sleep,
Process a few packets (at low efficiency), don't sleep,
Process a few packets (at low efficiency), don't sleep,
Process a few packets (at low efficiency), don't sleep,
Process a few packets (at low efficiency), don't sleep,
Process a few packets (at low efficiency), don't sleep,
No packets to process (we are lucky this time!), sleep briefly,
Repeat.

So we switched to our "more" algorithm, which behaves like this:
Process a few packets (at low efficiency), sleep briefly,
Process a full burst of packets (at high efficiency), don't sleep,
Repeat.

Instead of processing e.g. 8 small bursts per sleep, we now process only 2 bursts per sleep. And the big of the two bursts is processed at higher efficiency.

We can improve this algorithm in some areas...

E.g. some of our pipeline stages also know that they are not going to do anymore work for the next X amount of nanoseconds; but we don't use that information in our power management algorithm yet. The sleep duration could depend on this.

Also, we don't use the CPU power management states yet. I assume that doing some work for 20 us at half clock speed is more power conserving than doing the same work at full speed for 10 us and then sleeping for 10 us. That's another potential improvement.


What we need in generic a power management helper library are functions to feed it with the application's perception of how much work is being done, and functions to tell if we can sleep and/or if we should change the power management states of the individual CPU cores.

Such a unified power management helper (or "busyness") library could perhaps also be fed with data directly from the drivers and libraries to support the customer use cases you described.

-Morten


  reply	other threads:[~2023-05-03 21:32 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-03 11:38 David Coyle
2023-05-03 13:32 ` Morten Brørup
2023-05-03 14:51   ` Stephen Hemminger
2023-05-03 15:31   ` Coyle, David
2023-05-03 21:32     ` Morten Brørup [this message]
2023-05-04 16:11       ` Coyle, David
2023-05-04 16:23         ` Stephen Hemminger
2023-05-04 16:58           ` Morten Brørup

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=98CBD80474FA8B44BF855DF32C47DC35D878E1@smartserver.smartshare.dk \
    --to=mb@smartsharesystems.com \
    --cc=david.coyle@intel.com \
    --cc=dev@dpdk.org \
    --cc=honnappa.nagarahalli@arm.com \
    --cc=konstantin.v.ananyev@yandex.ru \
    --cc=rory.sexton@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).