From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 58714A04FD; Mon, 3 Oct 2022 22:02:29 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 0CF7E40DFB; Mon, 3 Oct 2022 22:02:29 +0200 (CEST) Received: from mail.lysator.liu.se (mail.lysator.liu.se [130.236.254.3]) by mails.dpdk.org (Postfix) with ESMTP id C4C1D40041 for ; Mon, 3 Oct 2022 22:02:27 +0200 (CEST) Received: from mail.lysator.liu.se (localhost [127.0.0.1]) by mail.lysator.liu.se (Postfix) with ESMTP id 65CC1B604 for ; Mon, 3 Oct 2022 22:02:27 +0200 (CEST) Received: by mail.lysator.liu.se (Postfix, from userid 1004) id 50596B3B0; Mon, 3 Oct 2022 22:02:27 +0200 (CEST) X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on hermod.lysator.liu.se X-Spam-Level: X-Spam-Status: No, score=-2.2 required=5.0 tests=ALL_TRUSTED, AWL, NICE_REPLY_A autolearn=disabled version=3.4.6 X-Spam-Score: -2.2 Received: from [192.168.1.59] (h-62-63-215-114.A163.priv.bahnhof.se [62.63.215.114]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mail.lysator.liu.se (Postfix) with ESMTPSA id 53F25B489; Mon, 3 Oct 2022 22:02:23 +0200 (CEST) Message-ID: Date: Mon, 3 Oct 2022 22:02:22 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.11.0 Subject: Re: [PATCH v7 1/4] eal: add lcore poll busyness telemetry Content-Language: en-US To: Konstantin Ananyev , Kevin Laatz , Konstantin Ananyev , "dev@dpdk.org" Cc: "anatoly.burakov@intel.com" , Conor Walsh , David Hunt , Bruce Richardson , Nicolas Chautru , Fan Zhang , Ashish Gupta , Akhil Goyal , Fengchengwen , Ray Kinsella , Thomas Monjalon , Ferruh Yigit , Andrew Rybchenko , Jerin Jacob , Sachin Saxena , Hemant Agrawal , Ori Kam , Honnappa Nagarahalli , =?UTF-8?Q?Mattias_R=c3=b6nnblom?= References: <24c49429394294cfbf0d9c506b205029bac77c8b.1657890378.git.anatoly.burakov@intel.com> <20220914092929.1159773-1-kevin.laatz@intel.com> <20220914092929.1159773-2-kevin.laatz@intel.com> <9a6fec15f9684d21bb4730596cceacff@huawei.com> <4af4b5d6-57a9-39a4-2197-a2acdc57156b@intel.com> From: =?UTF-8?Q?Mattias_R=c3=b6nnblom?= In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Virus-Scanned: ClamAV using ClamSMTP X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org On 2022-10-01 16:17, Konstantin Ananyev wrote: > >>> Hi Kevin, >>> >>>>>> Currently, there is no way to measure lcore poll busyness in a >>>>>> passive way, >>>>>> without any modifications to the application. This patch adds a >>>>>> new EAL API >>>>>> that will be able to passively track core polling busyness. >>>>>> >>>>>> The poll busyness is calculated by relying on the fact that most >>>>>> DPDK API's >>>>>> will poll for work (packets, completions, eventdev events, etc). >>>>>> Empty >>>>>> polls can be counted as "idle", while non-empty polls can be >>>>>> counted as >>>>>> busy. To measure lcore poll busyness, we simply call the telemetry >>>>>> timestamping function with the number of polls a particular code >>>>>> section >>>>>> has processed, and count the number of cycles we've spent >>>>>> processing empty >>>>>> bursts. The more empty bursts we encounter, the less cycles we >>>>>> spend in >>>>>> "busy" state, and the less core poll busyness will be reported. >>>>>> >>>>>> In order for all of the above to work without modifications to the >>>>>> application, the library code needs to be instrumented with calls >>>>>> to the >>>>>> lcore telemetry busyness timestamping function. The following >>>>>> parts of DPDK >>>>>> are instrumented with lcore poll busyness timestamping calls: >>>>>> >>>>>> - All major driver API's: >>>>>>     - ethdev >>>>>>     - cryptodev >>>>>>     - compressdev >>>>>>     - regexdev >>>>>>     - bbdev >>>>>>     - rawdev >>>>>>     - eventdev >>>>>>     - dmadev >>>>>> - Some additional libraries: >>>>>>     - ring >>>>>>     - distributor >>>>>> >>>>>> To avoid performance impact from having lcore telemetry support, a >>>>>> global >>>>>> variable is exported by EAL, and a call to timestamping function >>>>>> is wrapped >>>>>> into a macro, so that whenever telemetry is disabled, it only >>>>>> takes one >>>>>> additional branch and no function calls are performed. It is >>>>>> disabled at >>>>>> compile time by default. >>>>>> >>>>>> This patch also adds a telemetry endpoint to report lcore poll >>>>>> busyness, as >>>>>> well as telemetry endpoints to enable/disable lcore telemetry. A >>>>>> documentation entry has been added to the howto guides to explain >>>>>> the usage >>>>>> of the new telemetry endpoints and API. >>>>> As was already mentioned  by other reviewers, it would be much better >>>>> to let application itself decide when it is idle and when it is busy. >>>>> With current approach even for constant polling run-to-completion >>>>> model there >>>>> are plenty of opportunities to get things wrong and provide >>>>> misleading statistics. >>>>> My special concern - inserting it into ring dequeue code. >>>>> Ring is used for various different things, not only pass packets >>>>> between threads (mempool, etc.). >>>>> Blindly assuming that ring dequeue returns empty means idle cycles >>>>> seams wrong to me. >>>>> Which make me wonder should we really hard-code these calls into >>>>> DPDK core functions? >>>>> If you still like to introduce such stats, might be better to >>>>> implement it via callback mechanism. >>>>> As I remember nearly all our drivers (net, crypto, etc.) do support >>>>> it. >>>>> That way our generic code   will remain unaffected, plus user will >>>>> have ability to enable/disable >>>>> it on a per device basis. >>>> Thanks for your feedback, Konstantin. >>>> >>>> You are right in saying that this approach won't be 100% suitable for >>>> all use-cases, but should be suitable for the majority of applications. >>> First of all - could you explain how did you measure what is the >>> 'majority' of DPDK applications? >>> And how did you conclude that it definitely work for all the apps in >>> that 'majority'? >>> Second what bother me with that approach - I don't see s clear and >>> deterministic way >>> for the user to understand would that stats work properly for his app >>> or not. >>> (except manually ananlyzing his app code). >> >> All of the DPDK example applications we've tested with (l2fwd, l3fwd + >> friends, testpmd, distributor, dmafwd) report lcore poll busyness and >> respond to changing traffic rates etc. We've also compared the >> reported busyness to similar metrics reported by other projects such >> as VPP and OvS, and found the reported busyness matches with a >> difference of +/- 1%. In addition to the DPDK example applications, >> we've have shared our plans with end customers and they have confirmed >> that the design should work with their applications. > > I am sure l3fwd and testpmd should be ok, I am talking about > something more complicated/unusual. > Below are few examples on top of my head when I think your approach > will generate invalid stats, feel free to correct me, if I am wrong. > > 1) App doing some sort of bonding itslef, i.e: > > struct rte_mbuf pkts[N*2]; > k = rte_eth_rx_burst(p0, q0, pkts, N); > n = rte_eth_rx_burst(p1, q1, pkts + k, N); > > /*process all packets from both ports at once */ > if (n + k != 0) >    process_pkts(pkts, n + k); > > Now, as I understand, if n==0, then all cycles spent > in process_pkts() will be accounted as idle. > > 2) App doing something similar to what pdump library does > (creates a copy of a packet and sends it somewhere). > > n =rte_eth_rx_burst(p0, q0, &pkt, 1); > if (n != 0) { >   dup_pkt = rte_pktmbuf_copy(pkt, dup_mp, ...); >   if (dup_pkt != NULL) >      process_dup_pkt(dup_pkt); >   process_pkt(pkt); > } > > that relates to ring discussion below: > if there are no mbufs in dup_mp, then ring_deque() will fail > and process_pkt() will be accounted as idle. > > 3) App dequeues from ring in a bit of unusual way: > > /* idle spin loop */ > while ((n = rte_ring_count(ring)) == 0) >   ret_pause(); > > n = rte_ring_dequeue_bulk(ring, pkts, n, NULL); > if (n != 0) >   process_pkts(pkts, n); > > here, we can end-up accounting cycles spent in > idle spin loop as busy. > > > 4) Any thread that generates TX traffic on it's own > (something like testpmd tx_only fwd mode) > > 5) Any thread that depends on both dequeue and enqueue: > > n = rte_ring_dequeue_burst(in_ring, pkts, n, ..); > ... > > /* loop till all packets are sent out successfully */ > while(rte_ring_enqueue_bulk(out_ring, pkts, n, NULL) == 0) >    rte_pause(); > > Now, if n > 0, all cycles spent in enqueue() will be accounted > as 'busy', though from my perspective they probably should > be considered as 'idle'. > > > Also I expect some problems when packet processing is done inside > rx callbacks, but that probably can be easily fixed. > > >> >>>> It's worth keeping in mind that this feature is compile-time >>>> disabled by >>>> default, so there is no impact to any application/user that does not >>>> wish to use this, for example applications where this type of busyness >>>> is not useful, or for applications that already use other mechanisms to >>>> report similar telemetry. >>> Not sure that adding in new compile-time option disabled by default >>> is a good thing... >>> For me it would be much more preferable if we'll go through a more >>> 'standard' way here: >>> a) define clear API to enable/disable/collect/report such type of stats. >>> b) use some of our sample apps to demonstrate how to use it properly >>> with user-specific code. >>> c) if needed implement some 'silent' stats collection for limited >>> scope of apps via callbacks - >>> let say for run-to-completion apps that do use ether and crypto devs >>> only. >> >> With the compile-time option, its just one build flag for lots of >> applications to silently benefit from this. > > There could be a lot of useful and helpfull stats > that user would like to collect (idle/busy, processing latency, etc.). > But, if for each such case we will hard-code new stats collection > into our fast data-path code, then very soon it will become > completely bloated and unmaintainable. > I think we need some generic approach for such extra stats collection. > Callbacks could be one way, Jerin in another mail suggested using > existing trace-point hooks, might be it worth to explore it further. > >> >>>   However, the upside for applications that do >>>> wish to use this is that there are no code changes required (for the >>>> most part), the feature simply needs to be enabled at compile-time via >>>> the meson option. >>>> >>>> In scenarios where contextual awareness of the application is needed in >>>> order to report more accurate "busyness", the >>>> "RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(n)" macro can be used to mark >>>> sections of code as "busy" or "idle". This way, the application can >>>> assume control of determining the poll busyness of its lcores while >>>> leveraging the telemetry hooks adding in this patchset. >>>> >>>> We did initially consider implementing this via callbacks, however we >>>> found this approach to have 2 main drawbacks: >>>> 1. Application changes are required for all applications wanting to >>>> report this telemetry - rather than the majority getting it for free. >>> Didn't get it - why callbacks approach would require user-app changes? >>> In other situations - rte_power callbacks, pdump, etc. it works >>> transparent to >>> user-leve code. >>> Why it can't be done here in a similar way? >> >>  From my understanding, the callbacks would need to be registered by >> the application at the very least (and the callback would have to be >> registered per device/pmd/lib). > > Callbacks can be registered by library itself. > AFAIK, latenc-ystats, power and pdump libraries - all use similar > approach. user calls something like xxx_stats_enable() and then library > can iterate over all available devices and setup necessary callbacks. > same for xxx_stats_disable(). > >>> >>>> 2. Ring does not have callback support, meaning pipelined applications >>>> could not report lcore poll busyness telemetry with this approach. >>> That's another big concern that I have: >>> Why you consider that all rings will be used for a pipilines between >>> threads and should >>> always be accounted by your stats? >>> They could be used for dozens different purposes. >>> What if that ring is used for mempool, and ring_dequeue() just means >>> we try to allocate >>> an object from the pool? In such case, why failing to allocate an >>> object should mean >>> start of new 'idle cycle'? >> >> Another approach could be taken here if the mempool interactions are >> of concern. >> >>  From our understanding, mempool operations use the "_bulk" APIs, >> whereas polling operations use the "_burst" APIs. Would only >> timestamping on the "_burst" APIs be better here? That way the mempool >> interactions won't be counted towards the busyness. > > Well, it would help to solve one particular case, > but in general I still think it is incomplete and error-prone. I agree. The functionality provided is very useful, and the implementation is clever in the way it doesn't require any application modifications. But, a clever, useful brittle hack is still a brittle hack. What if there was instead a busyness module, where the application would explicitly report what it was up to. The new library would hook up to telemetry just like this patchset does, plus provide an explicit API to retrieve lcore thread load. The service cores framework (fancy name for rte_service.c) could also call the lcore load tracking module, provided all services properly reported back on whether or not they were doing anything useful with the cycles they just spent. The metrics of such a load tracking module could potentially be used by other modules in DPDK, or by the application. It could potentially be used for dynamic load balancing of service core services, or for power management (e.g, DVFS), or for a potential future deferred-work type mechanism more sophisticated than current rte_service, or some green threads/coroutines/fiber thingy. The DSW event device could also use it to replace its current internal load estimation scheme. I may be repeating myself here, from past threads. > What if pipeline app will use ring_count/ring_dequeue_bulk(), > or even ZC ring API? > What if app will use something different from rte_ring to pass > packets between threads/processes? > As I said before, without some clues from the app, it is probably > not possible to collect such stats in a proper way. > > >> Including support for pipelined applications using rings is key for a >> number of usecases, this was highlighted as part of the customer >> feedback when we shared the design. >> >>> >>>> Eventdev is another driver which would be completely missed with this >>>> approach. >>> Ok, I see two ways here: >>> - implement CB support for eventdev. >>> -meanwhile clearly document that this stats are not supported for >>> eventdev  scenarios (yet). >