From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id D3310A00C4; Sat, 1 Oct 2022 16:17:34 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 83D03400D5; Sat, 1 Oct 2022 16:17:34 +0200 (CEST) Received: from forward501p.mail.yandex.net (forward501p.mail.yandex.net [77.88.28.111]) by mails.dpdk.org (Postfix) with ESMTP id C2BCB4003F for ; Sat, 1 Oct 2022 16:17:32 +0200 (CEST) Received: from vla5-04fcbc3a0ab0.qloud-c.yandex.net (vla5-04fcbc3a0ab0.qloud-c.yandex.net [IPv6:2a02:6b8:c18:3484:0:640:4fc:bc3a]) by forward501p.mail.yandex.net (Yandex) with ESMTP id EC7BB6212100; Sat, 1 Oct 2022 17:17:31 +0300 (MSK) Received: by vla5-04fcbc3a0ab0.qloud-c.yandex.net (smtp/Yandex) with ESMTPSA id VsKgYtLiP5-HRha1cBe; Sat, 01 Oct 2022 17:17:31 +0300 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) (Client certificate not present) X-Yandex-Fwd: 1 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yandex.ru; s=mail; t=1664633851; bh=kN/3gc4OybMBnkx2kKyWOb/s0oMoDuCnMdhRdzkxq3I=; h=From:In-Reply-To:Cc:Date:References:To:Subject:Message-ID; b=J9T67rrrVU+tR2aUCrj2976sbTKRygD6dGqmXR9uCm9wJ9+3CsZKK9G9FGgpBTKz9 i+5k2ObOcAQKfZXauPuyTaK7xT5JVf5NKjG/JL+M4BnNTMV9Qc9qPemlHUGM2RVarp f3FWFKmxZxdfRknLXVBzUwoVvp2mcfPLVf4cNw7E= Authentication-Results: vla5-04fcbc3a0ab0.qloud-c.yandex.net; dkim=pass header.i=@yandex.ru Message-ID: Date: Sat, 1 Oct 2022 15:17:26 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.11.0 Subject: Re: [PATCH v7 1/4] eal: add lcore poll busyness telemetry Content-Language: en-US To: Kevin Laatz , Konstantin Ananyev , "dev@dpdk.org" Cc: "anatoly.burakov@intel.com" , Conor Walsh , David Hunt , Bruce Richardson , Nicolas Chautru , Fan Zhang , Ashish Gupta , Akhil Goyal , Fengchengwen , Ray Kinsella , Thomas Monjalon , Ferruh Yigit , Andrew Rybchenko , Jerin Jacob , Sachin Saxena , Hemant Agrawal , Ori Kam , Honnappa Nagarahalli References: <24c49429394294cfbf0d9c506b205029bac77c8b.1657890378.git.anatoly.burakov@intel.com> <20220914092929.1159773-1-kevin.laatz@intel.com> <20220914092929.1159773-2-kevin.laatz@intel.com> <9a6fec15f9684d21bb4730596cceacff@huawei.com> <4af4b5d6-57a9-39a4-2197-a2acdc57156b@intel.com> From: Konstantin Ananyev In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org >> Hi Kevin, >> >>>>> Currently, there is no way to measure lcore poll busyness in a >>>>> passive way, >>>>> without any modifications to the application. This patch adds a new >>>>> EAL API >>>>> that will be able to passively track core polling busyness. >>>>> >>>>> The poll busyness is calculated by relying on the fact that most >>>>> DPDK API's >>>>> will poll for work (packets, completions, eventdev events, etc). Empty >>>>> polls can be counted as "idle", while non-empty polls can be >>>>> counted as >>>>> busy. To measure lcore poll busyness, we simply call the telemetry >>>>> timestamping function with the number of polls a particular code >>>>> section >>>>> has processed, and count the number of cycles we've spent >>>>> processing empty >>>>> bursts. The more empty bursts we encounter, the less cycles we >>>>> spend in >>>>> "busy" state, and the less core poll busyness will be reported. >>>>> >>>>> In order for all of the above to work without modifications to the >>>>> application, the library code needs to be instrumented with calls >>>>> to the >>>>> lcore telemetry busyness timestamping function. The following parts >>>>> of DPDK >>>>> are instrumented with lcore poll busyness timestamping calls: >>>>> >>>>> - All major driver API's: >>>>>     - ethdev >>>>>     - cryptodev >>>>>     - compressdev >>>>>     - regexdev >>>>>     - bbdev >>>>>     - rawdev >>>>>     - eventdev >>>>>     - dmadev >>>>> - Some additional libraries: >>>>>     - ring >>>>>     - distributor >>>>> >>>>> To avoid performance impact from having lcore telemetry support, a >>>>> global >>>>> variable is exported by EAL, and a call to timestamping function is >>>>> wrapped >>>>> into a macro, so that whenever telemetry is disabled, it only takes >>>>> one >>>>> additional branch and no function calls are performed. It is >>>>> disabled at >>>>> compile time by default. >>>>> >>>>> This patch also adds a telemetry endpoint to report lcore poll >>>>> busyness, as >>>>> well as telemetry endpoints to enable/disable lcore telemetry. A >>>>> documentation entry has been added to the howto guides to explain >>>>> the usage >>>>> of the new telemetry endpoints and API. >>>> As was already mentioned  by other reviewers, it would be much better >>>> to let application itself decide when it is idle and when it is busy. >>>> With current approach even for constant polling run-to-completion >>>> model there >>>> are plenty of opportunities to get things wrong and provide >>>> misleading statistics. >>>> My special concern - inserting it into ring dequeue code. >>>> Ring is used for various different things, not only pass packets >>>> between threads (mempool, etc.). >>>> Blindly assuming that ring dequeue returns empty means idle cycles >>>> seams wrong to me. >>>> Which make me wonder should we really hard-code these calls into >>>> DPDK core functions? >>>> If you still like to introduce such stats, might be better to >>>> implement it via callback mechanism. >>>> As I remember nearly all our drivers (net, crypto, etc.) do support it. >>>> That way our generic code   will remain unaffected, plus user will >>>> have ability to enable/disable >>>> it on a per device basis. >>> Thanks for your feedback, Konstantin. >>> >>> You are right in saying that this approach won't be 100% suitable for >>> all use-cases, but should be suitable for the majority of applications. >> First of all - could you explain how did you measure what is the >> 'majority' of DPDK applications? >> And how did you conclude that it definitely work for all the apps in >> that 'majority'? >> Second what bother me with that approach - I don't see s clear and >> deterministic way >> for the user to understand would that stats work properly for his app >> or not. >> (except manually ananlyzing his app code). > > All of the DPDK example applications we've tested with (l2fwd, l3fwd + > friends, testpmd, distributor, dmafwd) report lcore poll busyness and > respond to changing traffic rates etc. We've also compared the reported > busyness to similar metrics reported by other projects such as VPP and > OvS, and found the reported busyness matches with a difference of +/- > 1%. In addition to the DPDK example applications, we've have shared our > plans with end customers and they have confirmed that the design should > work with their applications. I am sure l3fwd and testpmd should be ok, I am talking about something more complicated/unusual. Below are few examples on top of my head when I think your approach will generate invalid stats, feel free to correct me, if I am wrong. 1) App doing some sort of bonding itslef, i.e: struct rte_mbuf pkts[N*2]; k = rte_eth_rx_burst(p0, q0, pkts, N); n = rte_eth_rx_burst(p1, q1, pkts + k, N); /*process all packets from both ports at once */ if (n + k != 0) process_pkts(pkts, n + k); Now, as I understand, if n==0, then all cycles spent in process_pkts() will be accounted as idle. 2) App doing something similar to what pdump library does (creates a copy of a packet and sends it somewhere). n =rte_eth_rx_burst(p0, q0, &pkt, 1); if (n != 0) { dup_pkt = rte_pktmbuf_copy(pkt, dup_mp, ...); if (dup_pkt != NULL) process_dup_pkt(dup_pkt); process_pkt(pkt); } that relates to ring discussion below: if there are no mbufs in dup_mp, then ring_deque() will fail and process_pkt() will be accounted as idle. 3) App dequeues from ring in a bit of unusual way: /* idle spin loop */ while ((n = rte_ring_count(ring)) == 0) ret_pause(); n = rte_ring_dequeue_bulk(ring, pkts, n, NULL); if (n != 0) process_pkts(pkts, n); here, we can end-up accounting cycles spent in idle spin loop as busy. 4) Any thread that generates TX traffic on it's own (something like testpmd tx_only fwd mode) 5) Any thread that depends on both dequeue and enqueue: n = rte_ring_dequeue_burst(in_ring, pkts, n, ..); ... /* loop till all packets are sent out successfully */ while(rte_ring_enqueue_bulk(out_ring, pkts, n, NULL) == 0) rte_pause(); Now, if n > 0, all cycles spent in enqueue() will be accounted as 'busy', though from my perspective they probably should be considered as 'idle'. Also I expect some problems when packet processing is done inside rx callbacks, but that probably can be easily fixed. > >>> It's worth keeping in mind that this feature is compile-time disabled by >>> default, so there is no impact to any application/user that does not >>> wish to use this, for example applications where this type of busyness >>> is not useful, or for applications that already use other mechanisms to >>> report similar telemetry. >> Not sure that adding in new compile-time option disabled by default is >> a good thing... >> For me it would be much more preferable if we'll go through a more >> 'standard' way here: >> a) define clear API to enable/disable/collect/report such type of stats. >> b) use some of our sample apps to demonstrate how to use it properly >> with user-specific code. >> c) if needed implement some 'silent' stats collection for limited >> scope of apps via callbacks - >> let say for run-to-completion apps that do use ether and crypto devs >> only. > > With the compile-time option, its just one build flag for lots of > applications to silently benefit from this. There could be a lot of useful and helpfull stats that user would like to collect (idle/busy, processing latency, etc.). But, if for each such case we will hard-code new stats collection into our fast data-path code, then very soon it will become completely bloated and unmaintainable. I think we need some generic approach for such extra stats collection. Callbacks could be one way, Jerin in another mail suggested using existing trace-point hooks, might be it worth to explore it further. > >>   However, the upside for applications that do >>> wish to use this is that there are no code changes required (for the >>> most part), the feature simply needs to be enabled at compile-time via >>> the meson option. >>> >>> In scenarios where contextual awareness of the application is needed in >>> order to report more accurate "busyness", the >>> "RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(n)" macro can be used to mark >>> sections of code as "busy" or "idle". This way, the application can >>> assume control of determining the poll busyness of its lcores while >>> leveraging the telemetry hooks adding in this patchset. >>> >>> We did initially consider implementing this via callbacks, however we >>> found this approach to have 2 main drawbacks: >>> 1. Application changes are required for all applications wanting to >>> report this telemetry - rather than the majority getting it for free. >> Didn't get it - why callbacks approach would require user-app changes? >> In other situations - rte_power callbacks, pdump, etc. it works >> transparent to >> user-leve code. >> Why it can't be done here in a similar way? > > From my understanding, the callbacks would need to be registered by the > application at the very least (and the callback would have to be > registered per device/pmd/lib). Callbacks can be registered by library itself. AFAIK, latenc-ystats, power and pdump libraries - all use similar approach. user calls something like xxx_stats_enable() and then library can iterate over all available devices and setup necessary callbacks. same for xxx_stats_disable(). >> >>> 2. Ring does not have callback support, meaning pipelined applications >>> could not report lcore poll busyness telemetry with this approach. >> That's another big concern that I have: >> Why you consider that all rings will be used for a pipilines between >> threads and should >> always be accounted by your stats? >> They could be used for dozens different purposes. >> What if that ring is used for mempool, and ring_dequeue() just means >> we try to allocate >> an object from the pool? In such case, why failing to allocate an >> object should mean >> start of new 'idle cycle'? > > Another approach could be taken here if the mempool interactions are of > concern. > > From our understanding, mempool operations use the "_bulk" APIs, > whereas polling operations use the "_burst" APIs. Would only > timestamping on the "_burst" APIs be better here? That way the mempool > interactions won't be counted towards the busyness. Well, it would help to solve one particular case, but in general I still think it is incomplete and error-prone. What if pipeline app will use ring_count/ring_dequeue_bulk(), or even ZC ring API? What if app will use something different from rte_ring to pass packets between threads/processes? As I said before, without some clues from the app, it is probably not possible to collect such stats in a proper way. > Including support for pipelined applications using rings is key for a > number of usecases, this was highlighted as part of the customer > feedback when we shared the design. > >> >>> Eventdev is another driver which would be completely missed with this >>> approach. >> Ok, I see two ways here: >> - implement CB support for eventdev. >> -meanwhile clearly document that this stats are not supported for >> eventdev  scenarios (yet).