From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 16EDFA0093; Mon, 7 Nov 2022 08:27:02 +0100 (CET) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 2D98340156; Mon, 7 Nov 2022 08:27:01 +0100 (CET) Received: from mail.lysator.liu.se (mail.lysator.liu.se [130.236.254.3]) by mails.dpdk.org (Postfix) with ESMTP id BD1DB40151 for ; Mon, 7 Nov 2022 08:26:59 +0100 (CET) Received: from mail.lysator.liu.se (localhost [127.0.0.1]) by mail.lysator.liu.se (Postfix) with ESMTP id 5C1B7148B6 for ; Mon, 7 Nov 2022 08:26:59 +0100 (CET) Received: by mail.lysator.liu.se (Postfix, from userid 1004) id 5A6F1149A3; Mon, 7 Nov 2022 08:26:59 +0100 (CET) X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on hermod.lysator.liu.se X-Spam-Level: X-Spam-Status: No, score=-1.5 required=5.0 tests=ALL_TRUSTED, AWL, NICE_REPLY_A autolearn=disabled version=3.4.6 X-Spam-Score: -1.5 Received: from [192.168.1.59] (h-62-63-215-114.A163.priv.bahnhof.se [62.63.215.114]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mail.lysator.liu.se (Postfix) with ESMTPSA id CAFB1148B5; Mon, 7 Nov 2022 08:26:56 +0100 (CET) Message-ID: <2a9c1a31-ef82-3542-d5a1-cf73dea73140@lysator.liu.se> Date: Mon, 7 Nov 2022 08:26:56 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.2.2 Subject: Re: [PATCH v2 2/3] mempool: include non-DPDK threads in statistics Content-Language: en-US To: =?UTF-8?Q?Morten_Br=c3=b8rup?= , olivier.matz@6wind.com, andrew.rybchenko@oktetlabs.ru, stephen@networkplumber.org, jerinj@marvell.com, bruce.richardson@intel.com Cc: thomas@monjalon.net, dev@dpdk.org References: <20221030115445.2115-1-mb@smartsharesystems.com> <20221031112634.18329-1-mb@smartsharesystems.com> <20221031112634.18329-2-mb@smartsharesystems.com> <6c37816f-4136-772a-1ff2-501eb1d3a244@lysator.liu.se> <98CBD80474FA8B44BF855DF32C47DC35D87473@smartserver.smartshare.dk> <22a82237-3e7c-70bc-c98c-b9fd4a088513@lysator.liu.se> <98CBD80474FA8B44BF855DF32C47DC35D8747D@smartserver.smartshare.dk> <98CBD80474FA8B44BF855DF32C47DC35D87483@smartserver.smartshare.dk> From: =?UTF-8?Q?Mattias_R=c3=b6nnblom?= In-Reply-To: <98CBD80474FA8B44BF855DF32C47DC35D87483@smartserver.smartshare.dk> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Virus-Scanned: ClamAV using ClamSMTP X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org On 2022-11-04 11:01, Morten Brørup wrote: >> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se] >> Sent: Friday, 4 November 2022 09.59 >> >> On 2022-11-03 09:59, Morten Brørup wrote: >>>> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se] >>>> Sent: Wednesday, 2 November 2022 18.53 >>>> >>>> On 2022-11-02 10:09, Morten Brørup wrote: >>>>>> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se] >>>>>> Sent: Wednesday, 2 November 2022 08.53 >>>>>> >>>>>> On 2022-10-31 12:26, Morten Brørup wrote: >>>>>>> Offset the stats array index by one, and count non-DPDK threads >> at >>>>>> index >>>>>>> zero. >>>>>>> >>>>>>> This patch provides two benefits: >>>>>>> * Non-DPDK threads are also included in the statistics. >>>>>>> * A conditional in the fast path is removed. Static branch >>>> prediction >>>>>> was >>>>>>> correct, so the performance improvement is negligible. >>>>>>> >>>>>>> v2: >>>>>>> * New. No v1 of this patch in the series. >>>>>>> >>>>>>> Suggested-by: Stephen Hemminger >>>>>>> Signed-off-by: Morten Brørup >>>>>>> --- >>>>>>> lib/mempool/rte_mempool.c | 2 +- >>>>>>> lib/mempool/rte_mempool.h | 12 ++++++------ >>>>>>> 2 files changed, 7 insertions(+), 7 deletions(-) >>>>>>> >>>>>>> diff --git a/lib/mempool/rte_mempool.c >> b/lib/mempool/rte_mempool.c >>>>>>> index 62d1ce764e..e6208125e0 100644 >>>>>>> --- a/lib/mempool/rte_mempool.c >>>>>>> +++ b/lib/mempool/rte_mempool.c >>>>>>> @@ -1272,7 +1272,7 @@ rte_mempool_dump(FILE *f, struct >> rte_mempool >>>>>> *mp) >>>>>>> #ifdef RTE_LIBRTE_MEMPOOL_STATS >>>>>>> rte_mempool_ops_get_info(mp, &info); >>>>>>> memset(&sum, 0, sizeof(sum)); >>>>>>> - for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) { >>>>>>> + for (lcore_id = 0; lcore_id < RTE_MAX_LCORE + 1; >> lcore_id++) { >>>>>>> sum.put_bulk += mp->stats[lcore_id].put_bulk; >>>>>>> sum.put_objs += mp->stats[lcore_id].put_objs; >>>>>>> sum.put_common_pool_bulk += mp- >>>>>>> stats[lcore_id].put_common_pool_bulk; >>>>>>> diff --git a/lib/mempool/rte_mempool.h >> b/lib/mempool/rte_mempool.h >>>>>>> index 9c4bf5549f..16e7e62e3c 100644 >>>>>>> --- a/lib/mempool/rte_mempool.h >>>>>>> +++ b/lib/mempool/rte_mempool.h >>>>>>> @@ -238,8 +238,11 @@ struct rte_mempool { >>>>>>> struct rte_mempool_memhdr_list mem_list; /**< List of >>>> memory >>>>>> chunks */ >>>>>>> >>>>>>> #ifdef RTE_LIBRTE_MEMPOOL_STATS >>>>>>> - /** Per-lcore statistics. */ >>>>>>> - struct rte_mempool_debug_stats stats[RTE_MAX_LCORE]; >>>>>>> + /** Per-lcore statistics. >>>>>>> + * >>>>>>> + * Offset by one, to include non-DPDK threads. >>>>>>> + */ >>>>>>> + struct rte_mempool_debug_stats stats[RTE_MAX_LCORE + 1]; >>>>>>> #endif >>>>>>> } __rte_cache_aligned; >>>>>>> >>>>>>> @@ -304,10 +307,7 @@ struct rte_mempool { >>>>>>> */ >>>>>>> #ifdef RTE_LIBRTE_MEMPOOL_STATS >>>>>>> #define RTE_MEMPOOL_STAT_ADD(mp, name, n) do { >>>> \ >>>>>>> - unsigned __lcore_id = rte_lcore_id(); \ >>>>>>> - if (__lcore_id < RTE_MAX_LCORE) { \ >>>>>>> - mp->stats[__lcore_id].name += n; \ >>>>>>> - } \ >>>>>>> + (mp)->stats[rte_lcore_id() + 1].name += n; \ >>>>>> >>>>>> This relies on LCORE_ID_ANY being UINT32_MAX, and a wrap to 0, for >>>> an >>>>>> unregistered non-EAL thread? Might be worth a comment, or better a >>>>>> rewrite with an explicit LCORE_ID_ANY comparison. >>>>> >>>>> The purpose of this patch is to avoid the comparison here. >>>>> >>>>> Yes, it relies on the wrap to zero, and these conditions: >>>>> 1. LCORE_ID_ANY being UINT32_MAX, and >>>>> 2. the return type of rte_lcore_id() being unsigned int, and >>>>> 3. unsigned int being uint32_t. >>>>> >>>>> When I wrote this, I considered it safe to assume that LCORE_ID_ANY >>>> will remain the unsigned equivalent of -1 using the return type of >>>> rte_lcore_id(). In other words: If the return type of rte_lcore_id() >>>> should change from 32 to 64 bit, LCORE_ID_ANY would be updated >>>> accordingly to UINT64_MAX. >>>>> >>>>> Because of this assumption, I didn't use [(rte_lcore_id() + 1) & >>>> UINT32_MAX], but just [rte_lcore_id() + 1]. >>>>> >>>>> I struggled writing an appropriate comment without making it >>>> unacceptably long, but eventually gave up, and settled for the one- >> line >>>> comment in the structure only. >>>>> >>>>>> >>>>>> You anyways need a conditional. An atomic add must be used in the >>>>>> unregistered EAL thread case. >>>>> >>>>> Good point: The "+= n" must be atomic for non-isolated threads. >>>>> >>>> >>>> If the various unregistered non-EAL threads are run on isolated >> cores >>>> or not does not matter. >>> >>> Agree: They all share the same index, and thus may race, regardless >> which cores they are using. >>> >>> Rephrasing: The "+= n" must be atomic for the unregistered non-EAL >> threads. >>> >>>> >>>>> I just took a look at how software maintained stats are handled >>>> elsewhere, and the first thing I found, is the IOAT DMA driver, >> which >>>> also seems to be using non-atomic increment [1] regardless if used >> by a >>>> DPDK thread or not. >>>>> >>>>> [1]: https://elixir.bootlin.com/dpdk/v22.11- >>>> rc2/source/drivers/dma/ioat/ioat_dmadev.c#L228 >>>>> >>>>> However, doing it wrong elsewhere doesn't make it correct doing it >>>> wrong here too. :-) >>>>> >>>>> Atomic increments are costly, so I would rather follow your >>>> suggestion and reintroduce the comparison. How about: >>>>> >>>>> #define RTE_MEMPOOL_STAT_ADD(mp, name, n) do { \ >>>>> unsigned int __lcore_id = rte_lcore_id(); \ >>>>> if (likely(__lcore_id < RTE_MAX_LCORE)) { \ >>>>> (mp)->stats[__lcore_id].name += n; \ >>>>> } else { >>>>> rte_atomic64_add( \ >>>>> (rte_atomic64_t*)&((mp)- >>> stats[RTE_MAX_LCORE].name), >>>> n);\ >>>>> } \ >>>>> } >>>> You are supposed to use GCC C11 intrinsics (e.g., >>>> __atomic_fetch_add()). >>> >>> Ups. I forgot! >>> >>> This should be emphasized everywhere in the rte_atomic library, to >> prevent such mistakes. >>> >>>> >>>> In addition: technically, you must use an atomic store for the EAL >>>> thread case (and an atomic load on the reader side), although there >> are >>>> tons of examples in DPDK where tearing is ignored. (The generated >> code >>>> will likely look the same.) >>> >>> The counters are 64 bit aligned, but tearing could happen on 32 bit >> architectures. >>> >> >> The compiler is free to do it on any architecture, but I'm not sure if >> it happens much in practice. >> >>> My initial reaction to your comment was to do it correctly on the EAL >> threads too, to avoid tearing there too. However, there might be a >> performance cost for 32 bit architectures, so I will consider that >> these are only debug counters, and accept the risk of tearing. >>> >> >> What would that cost consist of? > > I have tested it this morning on Godbolt: https://godbolt.org/z/fz7f3cv8b > > ctr += n becomes: > > add QWORD PTR ctr[rip], rdi > > Whereas __atomic_fetch_add(&ctr, n, __ATOMIC_RELAXED) becomes: > > lock add QWORD PTR ctr[rip], rdi > Since there is a single writer/producer, only the store need be atomic, not the complete load+add+store sequence. __atomic_store_n(&ctr, ctr + 1, __ATOMIC_RELAXED); Such a construct will not result in a lock add instruction. I've happened to finished a draft version of a chapter on this topic, in the data plane book I'm writing: https://ericsson.github.io/dataplanebook/stats/stats.html#per-core-counters I will add a section with this "add an extra instance to deal with unregistered non-EAL threads" approach taken in your patch. >> >> In theory C11 atomics could be implemented "in software" (i.e., without >> the proper ISA-level guarantees, with locks), but does any of the >> DPDK-supported compiler/32-bit ISAs actually do so? > > Adding -m32 to the compiler options, ctr += n becomes: > > xor edx, edx > mov eax, DWORD PTR [esp+4] > add DWORD PTR ctr, eax > adc DWORD PTR ctr+4, edx > > And __atomic_fetch_add(&ctr, n, __ATOMIC_RELAXED) becomes: > > push edi > xor edi, edi > push esi > push ebx > sub esp, 8 > mov esi, DWORD PTR [esp+24] > mov eax, DWORD PTR ctr > mov edx, DWORD PTR ctr+4 > .L4: > mov ecx, eax > mov ebx, edx > add ecx, esi > adc ebx, edi > mov DWORD PTR [esp], ecx > mov DWORD PTR [esp+4], ebx > mov ebx, DWORD PTR [esp] > mov ecx, DWORD PTR [esp+4] > lock cmpxchg8b QWORD PTR ctr > jne .L4 > add esp, 8 > pop ebx > pop esi > pop edi > >> >> It's also not obvious that if there, for a certain >> compiler/ISA-combination, is a performance impact to pay for >> correctness, correctness would have to give way. > > In this particular case, and because they are only debug counters, I will argue that DPDK generally uses non-atomic access to 64 bit statistics counters. This was also the case for these counters before this patch. > > I am planning to finish v3 of this patch series today, so I hope you can agree to close the atomicity discussion with another argument: Making statistics counters atomic should be elevated to a general patch across DPDK, and not part of this mempool patch series. > As I think I indicated, I think this is a minor issue. I'm OK with the code as-written. > The v3 patch (which I am working on right now) counts atomically for the unregistered non-EAL threads: > > #define RTE_MEMPOOL_STAT_ADD(mp, name, n) do { \ > unsigned int __lcore_id = rte_lcore_id(); \ > if (likely(__lcore_id < RTE_MAX_LCORE)) \ > (mp)->stats[__lcore_id].name += n; \ > else \ > __atomic_fetch_add(&((mp)->stats[RTE_MAX_LCORE].name), \ > n, __ATOMIC_RELAXED); \ > } while (0) > > PS: Excellent discussion - thank you for getting involved, Mattias. >