From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 3CFEE42A5F; Thu, 4 May 2023 18:23:34 +0200 (CEST) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id BA52E41144; Thu, 4 May 2023 18:23:33 +0200 (CEST) Received: from mail-pl1-f173.google.com (mail-pl1-f173.google.com [209.85.214.173]) by mails.dpdk.org (Postfix) with ESMTP id EAB49410DC for ; Thu, 4 May 2023 18:23:31 +0200 (CEST) Received: by mail-pl1-f173.google.com with SMTP id d9443c01a7336-1aae46e62e9so4887325ad.2 for ; Thu, 04 May 2023 09:23:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=networkplumber-org.20221208.gappssmtp.com; s=20221208; t=1683217411; x=1685809411; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:subject:cc:to:from:date:from:to:cc:subject:date :message-id:reply-to; bh=1G2vpkvZ/X5PIp9RwlXLohNRm6q1CubBjHXr7+/CUCo=; b=NlQDAPSAEt8ttI4mvKIPXR3NJMCNnCKsp57b/yvlR9H12GXGUVYk+8o8p6unE0ce2D 3W1XzDL2Kj4OJ/SRqUYl/duzW+8Vwpf6mDraygQMGODhB8lxV0EBB6UhiWNzttVV37yK umbouEt8KB+gGGpasqCcQr7+SNxeOgq6mvY7RyVXJcwoMT6V2zGr0T85t4gx2wg9TmKM JbNGvY5bH4Tv7uLRUChhp0pDhSV9lko9FBnEfB9RS8dJ9N2ZcQ1uJpEKTaTDNDsCyJy4 owhO2vMlpnf0YjDmqey/Gi/Prze5yUUH0u9MbA7Mx7NnVcwoFSUiffD+60HiCDWr+qr6 fhcQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1683217411; x=1685809411; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:subject:cc:to:from:date:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=1G2vpkvZ/X5PIp9RwlXLohNRm6q1CubBjHXr7+/CUCo=; b=HoYwd/pTKF0bRqAZLUsT0fxkvBIGqkHSV7Crtw0PHILh1fM0kn6u5DHywsansuA8RU Etero2/lLdtZIz2dY9Aa2dEGU/Ba+k/BlZYdvAua+6z5aGJs0PgbNIf/B1DOgYdAjLja gVrVouxcTO+HifMCI9l1W8pNQw75OKAu9e1LX8xG8M0oU9RVGTiBiQ5rNzAZ3c3hHzo5 pWkeUaVabo01MA57ZNEI/CglwI3imwcgc/Uzt9P1g/MOzOn8KKS8EVatQI0nZOKT3JUo fxcIzU5nLJ0GBe+RxguicMYJ6eFx5jZaLyaUxY/3NgyxlwdQsWL4Wr7rsqf1Y3byQp0Z FzPQ== X-Gm-Message-State: AC+VfDw4BhzqD0fF93duEFt/43y8H/XAf7ftqfXhwJBgeXioFMt702dD TgRzEIGad5YCBQkHcuoomsyKoA== X-Google-Smtp-Source: ACHHUZ6gXwZbVMf+61RxvoXgjAGTOHS+q1iHViYsQBPADkdh8ou/+Xhrh3vxMEAmtk35+QK3W7sy2g== X-Received: by 2002:a17:903:22c4:b0:19e:4bc3:b1ef with SMTP id y4-20020a17090322c400b0019e4bc3b1efmr4676060plg.64.1683217411062; Thu, 04 May 2023 09:23:31 -0700 (PDT) Received: from hermes.local (204-195-120-218.wavecable.com. [204.195.120.218]) by smtp.gmail.com with ESMTPSA id z11-20020a170903018b00b001a04d27ee92sm2115362plg.241.2023.05.04.09.23.30 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 04 May 2023 09:23:30 -0700 (PDT) Date: Thu, 4 May 2023 09:23:29 -0700 From: Stephen Hemminger To: "Coyle, David" Cc: Morten =?UTF-8?B?QnLDuHJ1cA==?= , "dev@dpdk.org" , "honnappa.nagarahalli@arm.com" , "konstantin.v.ananyev@yandex.ru" , "Sexton, Rory" Subject: Re: [RFC PATCH] ring: adding TPAUSE instruction to ring dequeue Message-ID: <20230504092329.457d4f8c@hermes.local> In-Reply-To: References: <20230503113840.11010-1-david.coyle@intel.com> <98CBD80474FA8B44BF855DF32C47DC35D878DD@smartserver.smartshare.dk> <98CBD80474FA8B44BF855DF32C47DC35D878E1@smartserver.smartshare.dk> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org On Thu, 4 May 2023 16:11:31 +0000 "Coyle, David" wrote: > Hi Morten >=20 > > -----Original Message----- > > From: Morten Br=C3=B8rup > > =20 >=20 > >=20 > > Power saving is important for the environment (to save the planet and a= ll > > that), so everyone should contribute, if they have a good solution. So = even if > > our algorithm had a significant degree of innovation, we would probably > > choose to make it public anyway. Open sourcing it also makes it possibl= e for > > chip vendors like Intel to fine tune it more than we can ourselves, whi= ch also > > comes back to benefit us. All products need some sort of power saving i= n to > > stay competitive, but power saving algorithms is not an area we want to > > pursue for competitive purposes in our products. > >=20 > > Our algorithm is too simple to make a library at this point, but I have= been > > thinking about how we can make it a generic library when it has matured > > some more. I will take your information about the many customers' need = to > > have it invisibly injected into consideration in this regard. > >=20 > > Our current algorithm works like this: > >=20 > > while (running) { > > int more =3D 0; > > more +=3D stage1(); > > more +=3D stage2(); > > more +=3D stage3(); > > if (!more) sleep(); > > } > >=20 > > Each pipeline stage only returns 1 if it processed a full burst. Furthe= rmore, if a > > pipeline stage processed a full burst, but happens to know that no more= data > > is readily available for it, it returns 0 instead. > >=20 > > Obviously, the sleep() duration must be short enough to avoid that the = NIC > > RX descriptor rings overflow before the ingress pipeline stage is servi= ced > > again. > >=20 > > Changing the algorithm to "more" (1 =3D more work expected by the pipel= ine > > stage) from "busy" (1 =3D some work done by the pipeline stage) has the > > consequence that sleep() is called more often, which has the follow-on > > consequence that the ingress stage is called less often, and thus more = often > > has a full burst to process. > >=20 > > We know from our in-house profiler that processing a full burst provides > > *much* higher execution efficiency (cycles/packet) than processing a few > > packets. This is public knowledge - after all, this is the whole point = of DPDK's > > vector packet processing design! Nonetheless, it might surprise some pe= ople > > how much the efficiency (cycles/packet) increases when processing a full > > burst compared to processing just a few packets. I will leave it up to = the > > readers to make their own experiments. :-) > >=20 > > Our initial "busy" algorithm behaved like this: > > Process a few packets (at low efficiency), don't sleep, Process a few p= ackets > > (at low efficiency), don't sleep, Process a few packets (at low efficie= ncy), > > don't sleep, Process a few packets (at low efficiency), don't sleep, Pr= ocess a > > few packets (at low efficiency), don't sleep, Process a few packets (at= low > > efficiency), don't sleep, Process a few packets (at low efficiency), do= n't > > sleep, Process a few packets (at low efficiency), don't sleep, No packe= ts to > > process (we are lucky this time!), sleep briefly, Repeat. > >=20 > > So we switched to our "more" algorithm, which behaves like this: > > Process a few packets (at low efficiency), sleep briefly, Process a ful= l burst of > > packets (at high efficiency), don't sleep, Repeat. > >=20 > > Instead of processing e.g. 8 small bursts per sleep, we now process onl= y 2 > > bursts per sleep. And the big of the two bursts is processed at higher > > efficiency. > >=20 > > We can improve this algorithm in some areas... > >=20 > > E.g. some of our pipeline stages also know that they are not going to do > > anymore work for the next X amount of nanoseconds; but we don't use that > > information in our power management algorithm yet. The sleep duration > > could depend on this. > >=20 > > Also, we don't use the CPU power management states yet. I assume that > > doing some work for 20 us at half clock speed is more power conserving = than > > doing the same work at full speed for 10 us and then sleeping for 10 us. > > That's another potential improvement. > >=20 > >=20 > > What we need in generic a power management helper library are functions > > to feed it with the application's perception of how much work is being = done, > > and functions to tell if we can sleep and/or if we should change the po= wer > > management states of the individual CPU cores. > >=20 > > Such a unified power management helper (or "busyness") library could > > perhaps also be fed with data directly from the drivers and libraries to > > support the customer use cases you described. =20 >=20 > [DC] Thank you for that detailed description, very interesting. There may > well be merit in upstreaming such an algorithm as a library once it has > matured as you said. >=20 > Configuration could include specifying what a "full burst" > actually is. Different stages of a pipeline may also have different defin= itions > of busyness, so that may also need to considered: > - Some stages may perform an operation (e.g. running an acl rule check) o= n a > burst of packets and then it is complete > - Other stages may be more asynchronous in nature e.g. enqueuing and > dequeuing to/from a crypto device or a QoS scheduler. The dequeue might > not dequeue any packets on a particular call of the dequeue API, but there > may still be packets waiting inside the crypto device or scheduler. Those= waiting > packets would also need to be taken into account so as not to sleep for t= oo long. >=20 > Using such an API would require a workload developer to update their data= path > to report the pipeline stage busyness to the algorithm, but if those call= s are > kept to a minimum, then that shouldn't be too much of a problem >=20 > Thanks, > David I see two overlapping discussions here: The first, is using some form of memory wait when or timed pause for the ca= ses where it is spinning on contended region like lock or ring concurrency. There already is some of this available on Arm64 and having TPAUSE used on = intel makes sense. Using TPAUSE in rte_usleep is obvious good idea. The other is having some overall indication of busyness. This would be how often things like rx_burst and ring_dequeue get data to work on. A mechanism for this must be lightweight (ie per-core and minimum data collection), and plumbed into the telemetry system. It makes sense that this would be a new DPDK EAL call that would be used in place of the sleep done by most applications in the main loop when not busy. Any solution should be architecture independent None of the designs presented so far seem complete and simple enough to be part of the main DPDK distribution. Keep working and experimenting.