From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 5162C43F6A; Thu, 2 May 2024 19:37:36 +0200 (CEST) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id C5937402B2; Thu, 2 May 2024 19:37:35 +0200 (CEST) Received: from mail.lysator.liu.se (mail.lysator.liu.se [130.236.254.3]) by mails.dpdk.org (Postfix) with ESMTP id 3EC4540299 for ; Thu, 2 May 2024 19:37:34 +0200 (CEST) Received: from mail.lysator.liu.se (localhost [127.0.0.1]) by mail.lysator.liu.se (Postfix) with ESMTP id 6EA8610A31 for ; Thu, 2 May 2024 19:37:33 +0200 (CEST) Received: by mail.lysator.liu.se (Postfix, from userid 1004) id 5FF3A10AA5; Thu, 2 May 2024 19:37:33 +0200 (CEST) X-Spam-Checker-Version: SpamAssassin 4.0.0 (2022-12-13) on hermod.lysator.liu.se X-Spam-Level: X-Spam-Status: No, score=-1.3 required=5.0 tests=ALL_TRUSTED,AWL, T_SCC_BODY_TEXT_LINE autolearn=disabled version=4.0.0 X-Spam-Score: -1.3 Received: from [192.168.1.59] (h-62-63-215-114.A163.priv.bahnhof.se [62.63.215.114]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mail.lysator.liu.se (Postfix) with ESMTPSA id C188810AA3; Thu, 2 May 2024 19:37:28 +0200 (CEST) Message-ID: <758132fe-fc44-4c06-8e54-c0439f3a693f@lysator.liu.se> Date: Thu, 2 May 2024 19:37:28 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC v2] net/af_packet: make stats reset reliable To: Ferruh Yigit , "John W. Linville" Cc: Thomas Monjalon , dev@dpdk.org, =?UTF-8?Q?Mattias_R=C3=B6nnblom?= , Stephen Hemminger , =?UTF-8?Q?Morten_Br=C3=B8rup?= References: <20240425174617.2126159-1-ferruh.yigit@amd.com> <20240426143848.2280689-1-ferruh.yigit@amd.com> <108e0c40-33e7-4eed-83de-eaedee454480@lysator.liu.se> <7e5dd3cd-ea50-4aea-a350-d88f46854172@amd.com> <422dd5a6-4ed6-4ae5-b613-b22a7cf6f77d@amd.com> Content-Language: en-US From: =?UTF-8?Q?Mattias_R=C3=B6nnblom?= In-Reply-To: <422dd5a6-4ed6-4ae5-b613-b22a7cf6f77d@amd.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Virus-Scanned: ClamAV using ClamSMTP X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org On 2024-05-02 16:22, Ferruh Yigit wrote: > On 5/2/2024 6:51 AM, Mattias Rönnblom wrote: >> On 2024-05-01 18:19, Ferruh Yigit wrote: >>> On 4/28/2024 4:11 PM, Mattias Rönnblom wrote: >>>> On 2024-04-26 16:38, Ferruh Yigit wrote: >>>>> For stats reset, use an offset instead of zeroing out actual stats >>>>> values, >>>>> get_stats() displays diff between stats and offset. >>>>> This way stats only updated in datapath and offset only updated in >>>>> stats >>>>> reset function. This makes stats reset function more reliable. >>>>> >>>>> As stats only written by single thread, we can remove 'volatile' >>>>> qualifier >>>>> which should improve the performance in datapath. >>>>> >>>> >>>> volatile wouldn't help you if you had multiple writers, so that can't be >>>> the reason for its removal. It would be more accurate to say it should >>>> be replaced with atomic updates. If you don't use volatile and don't use >>>> atomics, you have to consider if the compiler can reach the conclusion >>>> that it does not need to store the counter value for future use *for >>>> that thread*. Since otherwise, I don't think the store actually needs to >>>> occur. Since DPDK statistics tend to work, it's pretty obvious that >>>> current compilers tend not to reach this conclusion. >>>> >>> >>> Thanks Mattias for clarifying why we need volatile or atomics even with >>> single writer. >>> >>>> If this should be done 100% properly, the update operation should be a >>>> non-atomic load, non-atomic add, and an atomic store. Similarly, for the >>>> reset, the offset store should be atomic. >>>> >>> >>> ack >>> >>>> Considered the state of the rest of the DPDK code base, I think a >>>> non-atomic, non-volatile solution is also fine. >>>> >>> >>> Yes, this seems working practically but I guess better to follow above >>> suggestion. >>> >>>> (That said, I think we're better off just deprecating stats reset >>>> altogether, and returning -ENOTSUP here meanwhile.) >>>> >>> >>> As long as reset is reliable (here I mean it reset stats in every call) >>> and doesn't impact datapath performance, I am for to continue with it. >>> Returning non supported won't bring more benefit to users. >>> >>>>> Signed-off-by: Ferruh Yigit >>>>> --- >>>>> Cc: Mattias Rönnblom >>>>> Cc: Stephen Hemminger >>>>> Cc: Morten Brørup >>>>> >>>>> This update triggered by mail list discussion [1]. >>>>> >>>>> [1] >>>>> https://inbox.dpdk.org/dev/3b2cf48e-2293-4226-b6cd-5f4dd3969f99@lysator.liu.se/ >>>>> >>>>> v2: >>>>> * Remove wrapping check for stats >>>>> --- >>>>>    drivers/net/af_packet/rte_eth_af_packet.c | 66 >>>>> ++++++++++++++--------- >>>>>    1 file changed, 41 insertions(+), 25 deletions(-) >>>>> >>>>> diff --git a/drivers/net/af_packet/rte_eth_af_packet.c >>>>> b/drivers/net/af_packet/rte_eth_af_packet.c >>>>> index 397a32db5886..10c8e1e50139 100644 >>>>> --- a/drivers/net/af_packet/rte_eth_af_packet.c >>>>> +++ b/drivers/net/af_packet/rte_eth_af_packet.c >>>>> @@ -51,8 +51,10 @@ struct pkt_rx_queue { >>>>>        uint16_t in_port; >>>>>        uint8_t vlan_strip; >>>>>    -    volatile unsigned long rx_pkts; >>>>> -    volatile unsigned long rx_bytes; >>>>> +    uint64_t rx_pkts; >>>>> +    uint64_t rx_bytes; >>>>> +    uint64_t rx_pkts_offset; >>>>> +    uint64_t rx_bytes_offset; >>>> >>>> I suggest you introduce a separate struct for reset-able counters. It'll >>>> make things cleaner, and you can sneak in atomics without too much >>>> atomics-related bloat. >>>> >>>> struct counter >>>> { >>>>      uint64_t count; >>>>      uint64_t offset; >>>> }; >>>> >>>> /../ >>>>      struct counter rx_pkts; >>>>      struct counter rx_bytes; >>>> /../ >>>> >>>> static uint64_t >>>> counter_value(struct counter *counter) >>>> { >>>>      uint64_t count = __atomic_load_n(&counter->count, >>>> __ATOMIC_RELAXED); >>>>      uint64_t offset = __atomic_load_n(&counter->offset, >>>> __ATOMIC_RELAXED); >>>> >>>>      return count + offset; >>>> } >>>> >>>> static void >>>> counter_reset(struct counter *counter) >>>> { >>>>      uint64_t count = __atomic_load_n(&counter->count, >>>> __ATOMIC_RELAXED); >>>> >>>>      __atomic_store_n(&counter->offset, count, __ATOMIC_RELAXED); >>>> } >>>> >>>> static void >>>> counter_add(struct counter *counter, uint64_t operand) >>>> { >>>>      __atomic_store_n(&counter->count, counter->count + operand, >>>> __ATOMIC_RELAXED); >>>> } >>>> >>> >>> Ack for separate struct for reset-able counters. >>> >>>> You'd have to port this to calls, which prevents >>>> non-atomic loads from RTE_ATOMIC()s. The non-atomic reads above must be >>>> replaced with explicit relaxed non-atomic load. Otherwise, if you just >>>> use "counter->count", that would be an atomic load with sequential >>>> consistency memory order on C11 atomics-based builds, which would result >>>> in a barrier, at least on weakly ordered machines (e.g., ARM). >>>> >>> >>> I am not sure I understand above. >>> As load and add will be non-atomic, why not access them directly, like: >>> `uint64_t count = counter->count;` >>> >> >> In case count is _Atomic (i.e., on enable_stdatomic=true builds), "count >> = counter->count" will imply a memory barrier. On x86_64, I think it >> will "only" be a compiler barrier (i.e., preventing optimization). On >> weakly ordered machines, it will result in a barrier-instruction (or an >> instruction-which-is-also-a-barrier, like in the example below). >> >> include >> >> int relaxed_load(_Atomic int *p) >> { >>     atomic_load_explicit(p, memory_order_relaxed); >> } >> >> int direct_load(_Atomic int *p) >> { >>     return *p; >> } >> >> GCC 13.2 ARM64 -> >> >> relaxed_load: >>         ldr     w0, [x0] >>         ret >> direct_load: >>         ldar    w0, [x0] >>         ret >> >> > > > Do we need to declare count as '_Atomic', I wasn't planning to make > variable _Atomic. This way assignment won't introduce any memory barrier. > To use atomics in DPDK, the current requirements seems to be to use RTE_ATOMIC(). That macro expands to _Atomic in enable_stdatomic=true builds, and nothing otherwise. Carefully crafted code using atomics will achieved the same performance and be more correct than the non-atomic variant. However, in practice, I think the non-atomic variant is very likely to produce the desired results. > >>> So my understanding is, remove `volatile`, load and add without atomics, >>> and only use relaxed ordered atomics for store (to ensure value in >>> register stored to memory). >>> >> >> Yes, that would be the best option, would the DPDK atomics API allow its >> implementation - but it doesn't. At least not if you care about what >> happens in enable_stdatomic=true builds. >> >> The second-best option is to use a rte_memory_order_relaxed atomic load, >> a regular non-atomic add, and a relaxed atomic store. >> >>> I will send a new version of RFC with above understanding. >>> >>>> I would still use a struct and some helper-functions even for the less >>>> ambitious, non-atomic variant. >>>> >>>> The only drawback of using GCC built-ins type atomics here, versus an >>>> atomic- and volatile-free approach, is that current compilers seems to >>>> refuse merging atomic stores. It's beyond me why this is the case. If >>>> you store to a variable twice in quick succession, it'll be two store >>>> machine instructions, even in cases where the compiler *knows* the value >>>> is identical. So volatile, even though you didn't ask for it. Weird. >>>> >>>> So if you have a loop, you may want to make an "counter_add()" in the >>>> end from a temporary, to get the final 0.001% of performance. >>>> >>> >>> ack >>> >>> I can't really say which one of the following is better (because of >>> store in empty poll), but I will keep it as it is (b.): >>> >>> a. >>> for (i < nb_pkt) { >>>     stats =+ 1; >>> } >>> >>> >>> b. >>> for (i < nb_pkt) { >>>     tmp =+ 1; >>> } >>> stats += tmp; >>> >>> >>> c. >>> for (i < nb_pkt) { >>>     tmp =+ 1; >>> } >>> if (tmp) >>>     stats += tmp; >>> >>> >>> >>>> If the tech board thinks MT-safe reset-able software-manage statistics >>>> is the future (as opposed to dropping reset support, for example), I >>>> think this stuff should go into a separate header file, so other PMDs >>>> can reuse it. Maybe out of scope for this patch. >>>> >>> >>> I don't think we need MT-safe reset, the patch is already out to >>> document current status. >> >> Well, what you are working on is a MT-safe reset, in the sense it allows >> for one (1) resetting thread properly synchronize with multiple >> concurrent counter-updating threads. >> >> It's not going to be completely MT safe, since you can't have two >> threads calling the reset function in parallel. >> > > This is what I meant with "MT-safe reset", so multiple threads not > allowed to call stats reset in parallel. > > And for multiple concurrent counter-updating threads case, suggestion is > to stop forwarding. > > Above two are the update I added to 'rte_eth_stats_reset()' API, and I > believe we can continue with this restriction for the API. > >> Any change to the API should make this clear. >> >>> For HW stats reset is already reliable and for SW drives offset based >>> approach can make is reliable. >>> >>> Unless you explicitly asked for it, I don't think this is in the agenda >>> of the techboard. >>> >>> >