RE: [RFC PATCH] ring: adding TPAUSE instruction to ring dequeue

DPDK patches and discussions
 help / color / mirror / Atom feed

From: "Coyle, David" <david.coyle@intel.com>
To: "Morten Brørup" <mb@smartsharesystems.com>,
	"dev@dpdk.org" <dev@dpdk.org>
Cc: "honnappa.nagarahalli@arm.com" <honnappa.nagarahalli@arm.com>,
	"konstantin.v.ananyev@yandex.ru" <konstantin.v.ananyev@yandex.ru>,
	"Sexton, Rory" <rory.sexton@intel.com>
Subject: RE: [RFC PATCH] ring: adding TPAUSE instruction to ring dequeue
Date: Thu, 4 May 2023 16:11:31 +0000	[thread overview]
Message-ID: <CO6PR11MB56047E869A409FF5ADF46764E36D9@CO6PR11MB5604.namprd11.prod.outlook.com> (raw)
In-Reply-To: <98CBD80474FA8B44BF855DF32C47DC35D878E1@smartserver.smartshare.dk>

Hi Morten

> -----Original Message-----
> From: Morten Brørup <mb@smartsharesystems.com>
> 

<snip>

> Power saving is important for the environment (to save the planet and all
> that), so everyone should contribute, if they have a good solution. So even if
> our algorithm had a significant degree of innovation, we would probably
> choose to make it public anyway. Open sourcing it also makes it possible for
> chip vendors like Intel to fine tune it more than we can ourselves, which also
> comes back to benefit us. All products need some sort of power saving in to
> stay competitive, but power saving algorithms is not an area we want to
> pursue for competitive purposes in our products.
> 
> Our algorithm is too simple to make a library at this point, but I have been
> thinking about how we can make it a generic library when it has matured
> some more. I will take your information about the many customers' need to
> have it invisibly injected into consideration in this regard.
> 
> Our current algorithm works like this:
> 
> while (running) {
> int more = 0;
> more += stage1();
> more += stage2();
> more += stage3();
> if (!more) sleep();
> }
> 
> Each pipeline stage only returns 1 if it processed a full burst. Furthermore, if a
> pipeline stage processed a full burst, but happens to know that no more data
> is readily available for it, it returns 0 instead.
> 
> Obviously, the sleep() duration must be short enough to avoid that the NIC
> RX descriptor rings overflow before the ingress pipeline stage is serviced
> again.
> 
> Changing the algorithm to "more" (1 = more work expected by the pipeline
> stage) from "busy" (1 = some work done by the pipeline stage) has the
> consequence that sleep() is called more often, which has the follow-on
> consequence that the ingress stage is called less often, and thus more often
> has a full burst to process.
> 
> We know from our in-house profiler that processing a full burst provides
> *much* higher execution efficiency (cycles/packet) than processing a few
> packets. This is public knowledge - after all, this is the whole point of DPDK's
> vector packet processing design! Nonetheless, it might surprise some people
> how much the efficiency (cycles/packet) increases when processing a full
> burst compared to processing just a few packets. I will leave it up to the
> readers to make their own experiments. :-)
> 
> Our initial "busy" algorithm behaved like this:
> Process a few packets (at low efficiency), don't sleep, Process a few packets
> (at low efficiency), don't sleep, Process a few packets (at low efficiency),
> don't sleep, Process a few packets (at low efficiency), don't sleep, Process a
> few packets (at low efficiency), don't sleep, Process a few packets (at low
> efficiency), don't sleep, Process a few packets (at low efficiency), don't
> sleep, Process a few packets (at low efficiency), don't sleep, No packets to
> process (we are lucky this time!), sleep briefly, Repeat.
> 
> So we switched to our "more" algorithm, which behaves like this:
> Process a few packets (at low efficiency), sleep briefly, Process a full burst of
> packets (at high efficiency), don't sleep, Repeat.
> 
> Instead of processing e.g. 8 small bursts per sleep, we now process only 2
> bursts per sleep. And the big of the two bursts is processed at higher
> efficiency.
> 
> We can improve this algorithm in some areas...
> 
> E.g. some of our pipeline stages also know that they are not going to do
> anymore work for the next X amount of nanoseconds; but we don't use that
> information in our power management algorithm yet. The sleep duration
> could depend on this.
> 
> Also, we don't use the CPU power management states yet. I assume that
> doing some work for 20 us at half clock speed is more power conserving than
> doing the same work at full speed for 10 us and then sleeping for 10 us.
> That's another potential improvement.
> 
> 
> What we need in generic a power management helper library are functions
> to feed it with the application's perception of how much work is being done,
> and functions to tell if we can sleep and/or if we should change the power
> management states of the individual CPU cores.
> 
> Such a unified power management helper (or "busyness") library could
> perhaps also be fed with data directly from the drivers and libraries to
> support the customer use cases you described.

[DC] Thank you for that detailed description, very interesting. There may
well be merit in upstreaming such an algorithm as a library once it has
matured as you said.

Configuration could include specifying what a "full burst"
actually is. Different stages of a pipeline may also have different definitions
of busyness, so that may also need to considered:
- Some stages may perform an operation (e.g. running an acl rule check) on a
burst of packets and then it is complete
- Other stages may be more asynchronous in nature e.g. enqueuing and
dequeuing to/from a crypto device or a QoS scheduler. The dequeue might
not dequeue any packets on a particular call of the dequeue API, but there
may still be packets waiting inside the crypto device or scheduler. Those waiting
packets would also need to be taken into account so as not to sleep for too long.

Using such an API would require a workload developer to update their datapath
to report the pipeline stage busyness to the algorithm, but if those calls are
kept to a minimum, then that shouldn't be too much of a problem

Thanks,
David

next prev parent reply	other threads:[~2023-05-04 16:13 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-03 11:38 David Coyle
2023-05-03 13:32 ` Morten Brørup
2023-05-03 14:51   ` Stephen Hemminger
2023-05-03 15:31   ` Coyle, David
2023-05-03 21:32     ` Morten Brørup
2023-05-04 16:11       ` Coyle, David [this message]
2023-05-04 16:23         ` Stephen Hemminger
2023-05-04 16:58           ` Morten Brørup

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CO6PR11MB56047E869A409FF5ADF46764E36D9@CO6PR11MB5604.namprd11.prod.outlook.com \
    --to=david.coyle@intel.com \
    --cc=dev@dpdk.org \
    --cc=honnappa.nagarahalli@arm.com \
    --cc=konstantin.v.ananyev@yandex.ru \
    --cc=mb@smartsharesystems.com \
    --cc=rory.sexton@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).