From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 80DCDA0542; Fri, 4 Nov 2022 09:58:42 +0100 (CET) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 5AF7842D10; Fri, 4 Nov 2022 09:58:42 +0100 (CET) Received: from mail.lysator.liu.se (mail.lysator.liu.se [130.236.254.3]) by mails.dpdk.org (Postfix) with ESMTP id 8852A42D0E for ; Fri, 4 Nov 2022 09:58:41 +0100 (CET) Received: from mail.lysator.liu.se (localhost [127.0.0.1]) by mail.lysator.liu.se (Postfix) with ESMTP id 2AF8FFDBF for ; Fri, 4 Nov 2022 09:58:41 +0100 (CET) Received: by mail.lysator.liu.se (Postfix, from userid 1004) id 29696FB7F; Fri, 4 Nov 2022 09:58:41 +0100 (CET) X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on hermod.lysator.liu.se X-Spam-Level: X-Spam-Status: No, score=-1.5 required=5.0 tests=ALL_TRUSTED, AWL, NICE_REPLY_A autolearn=disabled version=3.4.6 X-Spam-Score: -1.5 Received: from [192.168.1.59] (h-62-63-215-114.A163.priv.bahnhof.se [62.63.215.114]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mail.lysator.liu.se (Postfix) with ESMTPSA id 823AEFB7E; Fri, 4 Nov 2022 09:58:33 +0100 (CET) Message-ID: Date: Fri, 4 Nov 2022 09:58:33 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.2.2 From: =?UTF-8?Q?Mattias_R=c3=b6nnblom?= Subject: Re: [PATCH v2 2/3] mempool: include non-DPDK threads in statistics To: =?UTF-8?Q?Morten_Br=c3=b8rup?= , olivier.matz@6wind.com, andrew.rybchenko@oktetlabs.ru, stephen@networkplumber.org, jerinj@marvell.com, bruce.richardson@intel.com Cc: thomas@monjalon.net, dev@dpdk.org References: <20221030115445.2115-1-mb@smartsharesystems.com> <20221031112634.18329-1-mb@smartsharesystems.com> <20221031112634.18329-2-mb@smartsharesystems.com> <6c37816f-4136-772a-1ff2-501eb1d3a244@lysator.liu.se> <98CBD80474FA8B44BF855DF32C47DC35D87473@smartserver.smartshare.dk> <22a82237-3e7c-70bc-c98c-b9fd4a088513@lysator.liu.se> <98CBD80474FA8B44BF855DF32C47DC35D8747D@smartserver.smartshare.dk> Content-Language: en-US In-Reply-To: <98CBD80474FA8B44BF855DF32C47DC35D8747D@smartserver.smartshare.dk> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Virus-Scanned: ClamAV using ClamSMTP X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org On 2022-11-03 09:59, Morten Brørup wrote: >> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se] >> Sent: Wednesday, 2 November 2022 18.53 >> >> On 2022-11-02 10:09, Morten Brørup wrote: >>>> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se] >>>> Sent: Wednesday, 2 November 2022 08.53 >>>> >>>> On 2022-10-31 12:26, Morten Brørup wrote: >>>>> Offset the stats array index by one, and count non-DPDK threads at >>>> index >>>>> zero. >>>>> >>>>> This patch provides two benefits: >>>>> * Non-DPDK threads are also included in the statistics. >>>>> * A conditional in the fast path is removed. Static branch >> prediction >>>> was >>>>> correct, so the performance improvement is negligible. >>>>> >>>>> v2: >>>>> * New. No v1 of this patch in the series. >>>>> >>>>> Suggested-by: Stephen Hemminger >>>>> Signed-off-by: Morten Brørup >>>>> --- >>>>> lib/mempool/rte_mempool.c | 2 +- >>>>> lib/mempool/rte_mempool.h | 12 ++++++------ >>>>> 2 files changed, 7 insertions(+), 7 deletions(-) >>>>> >>>>> diff --git a/lib/mempool/rte_mempool.c b/lib/mempool/rte_mempool.c >>>>> index 62d1ce764e..e6208125e0 100644 >>>>> --- a/lib/mempool/rte_mempool.c >>>>> +++ b/lib/mempool/rte_mempool.c >>>>> @@ -1272,7 +1272,7 @@ rte_mempool_dump(FILE *f, struct rte_mempool >>>> *mp) >>>>> #ifdef RTE_LIBRTE_MEMPOOL_STATS >>>>> rte_mempool_ops_get_info(mp, &info); >>>>> memset(&sum, 0, sizeof(sum)); >>>>> - for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) { >>>>> + for (lcore_id = 0; lcore_id < RTE_MAX_LCORE + 1; lcore_id++) { >>>>> sum.put_bulk += mp->stats[lcore_id].put_bulk; >>>>> sum.put_objs += mp->stats[lcore_id].put_objs; >>>>> sum.put_common_pool_bulk += mp- >>>>> stats[lcore_id].put_common_pool_bulk; >>>>> diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h >>>>> index 9c4bf5549f..16e7e62e3c 100644 >>>>> --- a/lib/mempool/rte_mempool.h >>>>> +++ b/lib/mempool/rte_mempool.h >>>>> @@ -238,8 +238,11 @@ struct rte_mempool { >>>>> struct rte_mempool_memhdr_list mem_list; /**< List of >> memory >>>> chunks */ >>>>> >>>>> #ifdef RTE_LIBRTE_MEMPOOL_STATS >>>>> - /** Per-lcore statistics. */ >>>>> - struct rte_mempool_debug_stats stats[RTE_MAX_LCORE]; >>>>> + /** Per-lcore statistics. >>>>> + * >>>>> + * Offset by one, to include non-DPDK threads. >>>>> + */ >>>>> + struct rte_mempool_debug_stats stats[RTE_MAX_LCORE + 1]; >>>>> #endif >>>>> } __rte_cache_aligned; >>>>> >>>>> @@ -304,10 +307,7 @@ struct rte_mempool { >>>>> */ >>>>> #ifdef RTE_LIBRTE_MEMPOOL_STATS >>>>> #define RTE_MEMPOOL_STAT_ADD(mp, name, n) do { >> \ >>>>> - unsigned __lcore_id = rte_lcore_id(); \ >>>>> - if (__lcore_id < RTE_MAX_LCORE) { \ >>>>> - mp->stats[__lcore_id].name += n; \ >>>>> - } \ >>>>> + (mp)->stats[rte_lcore_id() + 1].name += n; \ >>>> >>>> This relies on LCORE_ID_ANY being UINT32_MAX, and a wrap to 0, for >> an >>>> unregistered non-EAL thread? Might be worth a comment, or better a >>>> rewrite with an explicit LCORE_ID_ANY comparison. >>> >>> The purpose of this patch is to avoid the comparison here. >>> >>> Yes, it relies on the wrap to zero, and these conditions: >>> 1. LCORE_ID_ANY being UINT32_MAX, and >>> 2. the return type of rte_lcore_id() being unsigned int, and >>> 3. unsigned int being uint32_t. >>> >>> When I wrote this, I considered it safe to assume that LCORE_ID_ANY >> will remain the unsigned equivalent of -1 using the return type of >> rte_lcore_id(). In other words: If the return type of rte_lcore_id() >> should change from 32 to 64 bit, LCORE_ID_ANY would be updated >> accordingly to UINT64_MAX. >>> >>> Because of this assumption, I didn't use [(rte_lcore_id() + 1) & >> UINT32_MAX], but just [rte_lcore_id() + 1]. >>> >>> I struggled writing an appropriate comment without making it >> unacceptably long, but eventually gave up, and settled for the one-line >> comment in the structure only. >>> >>>> >>>> You anyways need a conditional. An atomic add must be used in the >>>> unregistered EAL thread case. >>> >>> Good point: The "+= n" must be atomic for non-isolated threads. >>> >> >> If the various unregistered non-EAL threads are run on isolated cores >> or not does not matter. > > Agree: They all share the same index, and thus may race, regardless which cores they are using. > > Rephrasing: The "+= n" must be atomic for the unregistered non-EAL threads. > >> >>> I just took a look at how software maintained stats are handled >> elsewhere, and the first thing I found, is the IOAT DMA driver, which >> also seems to be using non-atomic increment [1] regardless if used by a >> DPDK thread or not. >>> >>> [1]: https://elixir.bootlin.com/dpdk/v22.11- >> rc2/source/drivers/dma/ioat/ioat_dmadev.c#L228 >>> >>> However, doing it wrong elsewhere doesn't make it correct doing it >> wrong here too. :-) >>> >>> Atomic increments are costly, so I would rather follow your >> suggestion and reintroduce the comparison. How about: >>> >>> #define RTE_MEMPOOL_STAT_ADD(mp, name, n) do { \ >>> unsigned int __lcore_id = rte_lcore_id(); \ >>> if (likely(__lcore_id < RTE_MAX_LCORE)) { \ >>> (mp)->stats[__lcore_id].name += n; \ >>> } else { >>> rte_atomic64_add( \ >>> (rte_atomic64_t*)&((mp)->stats[RTE_MAX_LCORE].name), >> n);\ >>> } \ >>> } >> You are supposed to use GCC C11 intrinsics (e.g., >> __atomic_fetch_add()). > > Ups. I forgot! > > This should be emphasized everywhere in the rte_atomic library, to prevent such mistakes. > >> >> In addition: technically, you must use an atomic store for the EAL >> thread case (and an atomic load on the reader side), although there are >> tons of examples in DPDK where tearing is ignored. (The generated code >> will likely look the same.) > > The counters are 64 bit aligned, but tearing could happen on 32 bit architectures. > The compiler is free to do it on any architecture, but I'm not sure if it happens much in practice. > My initial reaction to your comment was to do it correctly on the EAL threads too, to avoid tearing there too. However, there might be a performance cost for 32 bit architectures, so I will consider that these are only debug counters, and accept the risk of tearing. > What would that cost consist of? In theory C11 atomics could be implemented "in software" (i.e., without the proper ISA-level guarantees, with locks), but does any of the DPDK-supported compiler/32-bit ISAs actually do so? It's also not obvious that if there, for a certain compiler/ISA-combination, is a performance impact to pay for correctness, correctness would have to give way. >> >> DPDK coding conventions require there be no braces for a single >> statement. > > Will fix. > >> >> Other than that, it looks good. >> >>> >>> And the structure comment could be: >>> * Plus one, for non-DPDK threads. >>> >> >> "Unregistered non-EAL threads". This is the term the EAL documentation >> uses. > > Thank you for clarifying. I didn't follow that discussion, so the new terminology for threads hasn't stuck with me yet. :-) >