From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 267D8440D2; Sun, 26 May 2024 09:10:17 +0200 (CEST) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 9333B402D5; Sun, 26 May 2024 09:10:16 +0200 (CEST) Received: from mail.lysator.liu.se (mail.lysator.liu.se [130.236.254.3]) by mails.dpdk.org (Postfix) with ESMTP id 68579402B2 for ; Sun, 26 May 2024 09:10:15 +0200 (CEST) Received: from mail.lysator.liu.se (localhost [127.0.0.1]) by mail.lysator.liu.se (Postfix) with ESMTP id C28D6194B4 for ; Sun, 26 May 2024 09:10:14 +0200 (CEST) Received: by mail.lysator.liu.se (Postfix, from userid 1004) id B5BDE19377; Sun, 26 May 2024 09:10:14 +0200 (CEST) X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on hermod.lysator.liu.se X-Spam-Level: X-Spam-Status: No, score=-1.3 required=5.0 tests=ALL_TRUSTED,AWL, T_SCC_BODY_TEXT_LINE autolearn=disabled version=4.0.0 X-Spam-Score: -1.3 Received: from [192.168.1.59] (h-62-63-215-114.A163.priv.bahnhof.se [62.63.215.114]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mail.lysator.liu.se (Postfix) with ESMTPSA id A4CEF193E0; Sun, 26 May 2024 09:10:12 +0200 (CEST) Message-ID: <1cc502d9-2afd-47fb-a340-bb3d9ad0d573@lysator.liu.se> Date: Sun, 26 May 2024 09:10:12 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC v3] net/af_packet: make stats reset reliable To: =?UTF-8?Q?Morten_Br=C3=B8rup?= , Stephen Hemminger Cc: Bruce Richardson , Ferruh Yigit , "John W. Linville" , Thomas Monjalon , dev@dpdk.org, =?UTF-8?Q?Mattias_R=C3=B6nnblom?= References: <20240425174617.2126159-1-ferruh.yigit@amd.com> <20240503154547.392069-1-ferruh.yigit@amd.com> <20240503150011.55681b97@hermes.local> <432fb280-96e3-4079-b987-990d509b8c79@lysator.liu.se> <20240508082339.6bf74202@hermes.local> <834102e5-8f4b-4b67-9f62-8e8f61403b39@amd.com> <20240508135418.55505105@hermes.local> <98CBD80474FA8B44BF855DF32C47DC35E9F431@smartserver.smartshare.dk> <98CBD80474FA8B44BF855DF32C47DC35E9F432@smartserver.smartshare.dk> <98CBD80474FA8B44BF855DF32C47DC35E9F433@smartserver.smartshare.dk> <20240509215602.7e03f34d@hermes.local> <98CBD80474FA8B44BF855DF32C47DC35E9F434@smartserver.smartshare.dk> Content-Language: en-US From: =?UTF-8?Q?Mattias_R=C3=B6nnblom?= In-Reply-To: <98CBD80474FA8B44BF855DF32C47DC35E9F434@smartserver.smartshare.dk> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Virus-Scanned: ClamAV using ClamSMTP X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org On 2024-05-10 11:14, Morten Brørup wrote: >> From: Stephen Hemminger [mailto:stephen@networkplumber.org] >> Sent: Friday, 10 May 2024 06.56 >> >> On Thu, 9 May 2024 16:19:08 +0200 >> Morten Brørup wrote: >> >>>> From: Morten Brørup [mailto:mb@smartsharesystems.com] >>>> Sent: Thursday, 9 May 2024 13.37 >>>> >>>>> From: Bruce Richardson [mailto:bruce.richardson@intel.com] >>>>> Sent: Thursday, 9 May 2024 11.30 >>>>> >>>>> On Thu, May 09, 2024 at 09:43:16AM +0200, Morten Brørup wrote: >>>>>>> From: Stephen Hemminger [mailto:stephen@networkplumber.org] >>>>>>> Sent: Wednesday, 8 May 2024 22.54 >>>>>>> >>>>>>> On Wed, 8 May 2024 20:48:06 +0100 >>>>>>> Ferruh Yigit wrote: >>>>>>> >>>>>>>>> >>>>>>>>> The idea of load tearing is crazy talk of integral types. >> It >>>>> would >>>>>>> break so many things. >>>>>>>>> It is the kind of stupid compiler thing that would send >> Linus >>>> on >>>>> a >>>>>>> rant and get >>>>>>>>> the GCC compiler writers in trouble. >>>>>>>>> >>>>>>>>> The DPDK has always favored performance over strict safety >>>> guard >>>>>>> rails everywhere. >>>>>>>>> Switching to making every statistic an atomic operation is >> not >>>>> in >>>>>>> the spirit of >>>>>>>>> what is required. There is no strict guarantee necessary >> here. >>>>>>>>> >>>>>>>> >>>>>>>> I kind of agree with Stephen. >>>>>>>> >>>>>>>> Thanks Mattias, Morten & Stephen, it was informative >> discussion. >>>>> But >>>>>>> for >>>>>>>> *SW drivers* stats update and reset is not core >> functionality >>>> and >>>>> I >>>>>>>> think we can be OK to get hit on corner cases, instead of >>>>>>>> over-engineering or making code more complex. >>>>>>> >>>>>>> >>>>>>> I forgot the case of 64 bit values on 32 bit platforms! >>>>>>> Mostly because haven't cared about 32 bit for years... >>>>>>> >>>>>>> The Linux kernel uses some wrappers to handle this. >>>>>>> On 64 bit platforms they become noop. >>>>>>> On 32 bit platform, they are protected by a seqlock and >> updates >>>> are >>>>>>> wrapped by the sequence count. >>>>>>> >>>>>>> If we go this way, then doing similar Noop on 64 bit and >> atomic or >>>>>>> seqlock >>>>>>> on 32 bit should be done, but in common helper. >>>>>>> >>>>>>> Looking inside FreeBSD, it looks like that has changed over >> the >>>>> years as >>>>>>> well. >>>>>>> >>>>>>> if_inc_counter >>>>>>> counter_u64_add >>>>>>> atomic_add_64 >>>>>>> But the counters are always per-cpu in this case. So although >> it >>>>> does >>>>>>> use >>>>>>> locked operation, will always be uncontended. >>>>>>> >>>>>>> >>>>>>> PS: Does DPDK still actually support 32 bit on x86? Can it be >>>>> dropped >>>>>>> this cycle? >>>>>> >>>>>> We cannot drop 32 bit architecture support altogether. >>>>>> >>>>>> But, unlike the Linux kernel, DPDK doesn't need to support >> ancient >>>> 32 >>>>> bit architectures. >>>>>> If the few 32 bit architectures supported by DPDK provide non- >>>> tearing >>>>> 64 bit loads/stores, we don't need locks (in the fast path) for 64 >> bit >>>>> counters. >>>>>> >>>>>> In addition to 32 bit x86, DPDK supports ARMv7-A (a 32 bit >>>>> architecture) and 32 bit ARMv8. >>>>>> I don't think DPDK support any other 32 bit architectures. >>>>>> >>>>>> >>>>>> As Mattias mentioned, 32 bit x86 can use xmm registers to >> provide 64 >>>>> bit non-tearing load/store. >>>>>> >>>>> >>>>> Testing this a little in godbolt, I see gcc using xmm registers on >> 32- >>>>> bit >>>>> when updating 64-bit counters, but clang doesn't seem to do so, >> but >>>>> instead >>>>> does 2 stores when writing back the 64 value. (I tried with both >>>>> volatile >>>>> and non-volatile 64-bit values, just to see if volatile would >>>> encourage >>>>> clang to do a single store). >>>>> >>>>> GCC: https://godbolt.org/z/9eqKfT3hz >>>>> Clang: https://godbolt.org/z/PT5EqKn4c >>>> >>>> Interesting. >>>> I guess this can be fixed by manually implementing what GCC does. >>>> >>>> I'm more concerned about finding a high-performance (in the fast >> path) >>>> 64 bit counter solution for 32 bit ARM. >>> >>> Reading up on the topic, and continuing Bruce's experiment on Godbolt, >> it is possible on 32 bit ARMv7-A too, using LDRD/STRD (Load/Store >> Register Dual) instructions, which load/store 64 bit from memory into >> two registers at once. >>> >>> Clang is emits more efficient code without volatile. >>> GCC requires volatile to use STRD. >>> >>> Clang: https://godbolt.org/z/WjdTq6EKh >>> GCC: https://godbolt.org/z/qq9j7d4Ea >>> >>> Summing it up, it is possible to implement non-tearing 64 bit high- >> performance (lockless, barrier-free) counters on the 32 bit >> architectures supported by DPDK. >>> >>> But the implementation is both architecture and compiler specific. >>> So it seems a "64 bit counters" library would be handy. (Or a "non- >> tearing 64 bit integers" library, for support of the signed variant too; >> but I don't think we need that.) >>> We can use uint64_t as the underlying type and type cast in the >> library (where needed by the specific compiler/architecture), or >> introduce a new rte_ctr64_t type to ensure that accessor functions are >> always used and the developer doesn't have to worry about tearing on 32 >> bit architectures. >>> >>> The most simple variant of such a library only provides load and store >> functions. The API would look like this: >>> >>> uint64_t inline >>> rte_ctr64_get(const rte_ctr64_t *const ctr); >>> >>> void inline >>> rte_ctr64_set(rte_ctr64_t *const ctr, const uint64_t value); >>> >>> And if some CPU offers special instructions for increment or addition, >> faster (regarding performance) and/or more compact (regarding >> instruction memory) than a sequence of load-add-store instructions: >>> >>> void inline >>> rte_ctr64_inc(rte_ctr64_t *const ctr); >>> >>> void inline >>> rte_ctr64_add(rte_ctr64_t *const ctr, const uint64_t value); > > Note: 32 bit architectures might achieve higher performance if the "value" parameter to rte_ctr64_add() is unsigned long (or unsigned int) instead of uint64_t. > If the rte_ctr64_add() implementation is kept in the header file (or, always in case of LTO), I don't see how that could make a difference, assuming the operand is of a smaller (<64 bits) type. >>> >>> >>> And perhaps atomic variants of all these functions, with explicit >> and/or relaxed memory ordering, for counters shared by multiple threads. >>> >>> >> >> >> This kind of what I am experimenting with but... > > Excellent! > >> Intend to keep the details of the counters read and update in one file >> and not as inlines. > > Please note that traffic management applications maintain many counters (e.g. per-flow, per-subscriber and per-link packet and byte counters, some also per QoS class), so rte_ctr64_add() must have the highest performance technically possible. > > For reference, the packet scheduling code in our application updates 28 statistics counters per burst of packets. (In addition to internal state variables for e.g. queue lenghts.) > > Furthermore, our application processes and displays live statistics with one second granularity, using a separate thread. Although statistics processing is not part of the fast path, the sheer number of counters processed requires high performance read access to those counters. > > > > Some groups of counters are maintained locally in the inner loop, and then added in bulk to the "public" statistics afterwards. Conceptually: > > struct stats_per_prio { > uint64_t forwarded_bytes; > uint64_t forwarded_packets; > uint64_t marked_bytes; > uint64_t marked_packets; > uint64_t dropped_bytes; > uint64_t dropped_packets; > }; > > If this is a common design pattern in DPDK (drivers, libraries and/or applications), perhaps also provide a performance optimized function for architectures offering vector instructions: > > void inline > rte_ctr64_add_bulk( > rte_ctr64_t *const ctrs, > const unsigned long *const values, > const unsigned int n /* compile time constant */); > > This slightly resembles the Linux kernel's design pattern, where counters are updated in bulk, protected by a common lock for the bulk update. (However, DPDK has no lock, so the motivation for optimizing for this design pattern is only "nice to have".) > > PS: > Many DPDK applications are 64 bit only, including the SmartShare appliances, and can easily manage 64 bit counters without all this. > However, if the DPDK project still considers 32 bit architectures first class citizens, 64 bit counters should have high performance on 32 bit architectures too. >