From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id DFB4D42A53; Wed, 3 May 2023 23:32:28 +0200 (CEST) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 13DE441144; Wed, 3 May 2023 23:32:26 +0200 (CEST) Received: from dkmailrelay1.smartsharesystems.com (smartserver.smartsharesystems.com [77.243.40.215]) by mails.dpdk.org (Postfix) with ESMTP id 0F50040A87 for ; Wed, 3 May 2023 23:32:24 +0200 (CEST) Received: from smartserver.smartsharesystems.com (smartserver.smartsharesys.local [192.168.4.10]) by dkmailrelay1.smartsharesystems.com (Postfix) with ESMTP id CE99121B3F; Wed, 3 May 2023 23:32:23 +0200 (CEST) Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Subject: RE: [RFC PATCH] ring: adding TPAUSE instruction to ring dequeue X-MimeOLE: Produced By Microsoft Exchange V6.5 Date: Wed, 3 May 2023 23:32:21 +0200 Message-ID: <98CBD80474FA8B44BF855DF32C47DC35D878E1@smartserver.smartshare.dk> In-Reply-To: X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: [RFC PATCH] ring: adding TPAUSE instruction to ring dequeue Thread-Index: AQHZfbPnG2/msW+xn0Gzh27kyZT0UK9Ii/+AgAAP7eCAAFKdUA== References: <20230503113840.11010-1-david.coyle@intel.com> <98CBD80474FA8B44BF855DF32C47DC35D878DD@smartserver.smartshare.dk> From: =?iso-8859-1?Q?Morten_Br=F8rup?= To: "Coyle, David" , Cc: , , "Sexton, Rory" X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org > From: Coyle, David [mailto:david.coyle@intel.com] > Sent: Wednesday, 3 May 2023 17.32 >=20 > Hi Morten >=20 > > From: Morten Br=F8rup > > > > > From: David Coyle [mailto:david.coyle@intel.com] > > > Sent: Wednesday, 3 May 2023 13.39 > > > > > > This is NOT for upstreaming. This is being submitted to allow = early > > > comparison testing with the preferred solution, which will add > TAPUSE > > > power management support to the ring library through the addition = of > > > callbacks. Initial stages of the preferred solution are available = at > > > http://dpdk.org/patch/125454. > > > > > > This patch adds functionality directly to rte_ring_dequeue = functions > > > to monitor the empty reads of the ring. When a configurable number > of > > > empty reads is reached, a TPAUSE instruction is triggered by using > > > rte_power_pause() on supported architectures. rte_pause() is used = on > > > other architectures. The functionality can be included or excluded > at > > > compilation time using the RTE_RING_PMGMT flag. If included, the = new > > > API can be used to enable/disable the feature on a per-ring basis. > > > Other related settings can also be configured using the API. > > > > I don't understand why DPDK developers keep spending time on trying = to > > invent methods to determine application busyness based on entry/exit > > points in a variety of libraries, when the application is in a much > better > > position to determine busyness. All of these "busyness measuring" > library > > extensions have their own specific assumptions and weird = limitations. > > > > I do understand that the goal is power saving, which certainly is > relevant! I > > only criticize the measuring methods. > > > > For reference, we implemented something very simple in our = application > > framework: > > 1. When each pipeline stage has completed a burst, it reports if it > was busy or > > not. > > 2. If the pipeline busyness is low, we take a nap to save some = power. > > > > And here is the magic twist to this simple algorithm: > > 3. A pipeline stage is not considered busy unless it processed a = full > burst, and > > is ready to process more packets immediately. This interpretation of > > busyness has a significant impact on the percentage of time spent > napping > > during the low-traffic hours. > > > > This algorithm was very quickly implemented. It might not be = perfect, > and we > > do intend to improve it (also to determine CPU Utilization on a = scale > that the > > end user can translate to a linear interpretation of how busy the > system is). > > But I seriously doubt that any of the proposed "busyness measuring" > library > > extensions are any better. > > > > So: The application knows better, please spend your precious time on > > something useful instead. > > > > @David, my outburst is not directed at you specifically. Generally, = I > do > > appreciate experimenting as a good way of obtaining knowledge. So > thank > > you for sharing your experiments with this audience! > > > > PS: If cruft can be disabled at build time, I generally don't oppose > to it. >=20 > [DC] Appreciate that feedback, and it is certainly another way of > looking at > and tackling the problem that we are ultimately trying to solve (i.e > power > saving) >=20 > The problem however is that we work with a large number of ISVs and > operators, > each with their own workload architecture and implementation. That = means > we > would have to work individually with each of these to integrate this > type of > pipeline-stage-busyness algorithm into their applications. And as = these > applications are usually commercial, non-open-source applications, = that > could > prove to be very difficult. >=20 > Also most ISVs and operators don't want to have to worry about = changing > their > application, especially their fast-path dataplane, in order to get = power > savings. They prefer for it to just happen without them caring about = the > finer > details. >=20 > For these reasons, consolidating the busyness algorithms down into the > DPDK > libraries and PMDs is currently the preferred solution. As you say > though, the > libraries and PMDs may not be in the best position to determine the > busyness > of the pipeline, but it provides a good balance between achieving = power > savings > and ease of adoption. Thank you for describing the business logic driving this technical = approach. Now I get it! Automagic busyness monitoring and power management would be excellent. = But what I see on the mailing list is a bunch of incoherent attempts at = doing this. (And I don't mean your patches, I mean all the patches for = automagic power management.) And the cost is not insignificant: = Pollution of DPDK all over the place, in both drivers and libraries. I would much rather see a top-down approach, so we could all work = towards a unified solution. However, I understand that customers are impatient, so I accept that in = reality we have to live with these weird "code injection" based = solutions until something sane becomes available. If they were clearly = marked as temporary workarounds until a proper solution is provided, I = might object less to them. (Again, not just your patches, but all the = patches of this sort.) >=20 > It's also worth calling out again that this patch is only to allow = early > testing by some customers of the benefit of adding TPAUSE support to = the > ring > library. We don't intend on this patch being upstreamed. The preferred > longer > term solution is to use callbacks from the ring library to initiate = the > pause > (either via the DPDK power management API or through functions that an > ISV > may write themselves). This is mentioned in the commit message. Noted! >=20 > Also, the pipeline stage busyness algorithm that you have added to = your > pipeline - have you ever considered implementing this into DPDK as a > generic > type library. This could certainly be of benefit to other DPDK > application > developers, and having this mechanism in DPDK could again ease the > adoption > and realisation of power savings for others. I understand though if = this > is your > own secret sauce and you want to keep it like that :) Power saving is important for the environment (to save the planet and = all that), so everyone should contribute, if they have a good solution. = So even if our algorithm had a significant degree of innovation, we = would probably choose to make it public anyway. Open sourcing it also = makes it possible for chip vendors like Intel to fine tune it more than = we can ourselves, which also comes back to benefit us. All products need = some sort of power saving in to stay competitive, but power saving = algorithms is not an area we want to pursue for competitive purposes in = our products. Our algorithm is too simple to make a library at this point, but I have = been thinking about how we can make it a generic library when it has = matured some more. I will take your information about the many = customers' need to have it invisibly injected into consideration in this = regard. Our current algorithm works like this: while (running) { int more =3D 0; more +=3D stage1(); more +=3D stage2(); more +=3D stage3(); if (!more) sleep(); } Each pipeline stage only returns 1 if it processed a full burst. = Furthermore, if a pipeline stage processed a full burst, but happens to = know that no more data is readily available for it, it returns 0 = instead. Obviously, the sleep() duration must be short enough to avoid that the = NIC RX descriptor rings overflow before the ingress pipeline stage is = serviced again. Changing the algorithm to "more" (1 =3D more work expected by the = pipeline stage) from "busy" (1 =3D some work done by the pipeline stage) = has the consequence that sleep() is called more often, which has the = follow-on consequence that the ingress stage is called less often, and = thus more often has a full burst to process. We know from our in-house profiler that processing a full burst provides = *much* higher execution efficiency (cycles/packet) than processing a few = packets. This is public knowledge - after all, this is the whole point = of DPDK's vector packet processing design! Nonetheless, it might = surprise some people how much the efficiency (cycles/packet) increases = when processing a full burst compared to processing just a few packets. = I will leave it up to the readers to make their own experiments. :-) Our initial "busy" algorithm behaved like this: Process a few packets (at low efficiency), don't sleep, Process a few packets (at low efficiency), don't sleep, Process a few packets (at low efficiency), don't sleep, Process a few packets (at low efficiency), don't sleep, Process a few packets (at low efficiency), don't sleep, Process a few packets (at low efficiency), don't sleep, Process a few packets (at low efficiency), don't sleep, Process a few packets (at low efficiency), don't sleep, No packets to process (we are lucky this time!), sleep briefly, Repeat. So we switched to our "more" algorithm, which behaves like this: Process a few packets (at low efficiency), sleep briefly, Process a full burst of packets (at high efficiency), don't sleep, Repeat. Instead of processing e.g. 8 small bursts per sleep, we now process only = 2 bursts per sleep. And the big of the two bursts is processed at higher = efficiency. We can improve this algorithm in some areas... E.g. some of our pipeline stages also know that they are not going to do = anymore work for the next X amount of nanoseconds; but we don't use that = information in our power management algorithm yet. The sleep duration = could depend on this. Also, we don't use the CPU power management states yet. I assume that = doing some work for 20 us at half clock speed is more power conserving = than doing the same work at full speed for 10 us and then sleeping for = 10 us. That's another potential improvement. What we need in generic a power management helper library are functions = to feed it with the application's perception of how much work is being = done, and functions to tell if we can sleep and/or if we should change = the power management states of the individual CPU cores. Such a unified power management helper (or "busyness") library could = perhaps also be fed with data directly from the drivers and libraries to = support the customer use cases you described. -Morten