From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 0EE0AA04FD; Fri, 29 Jul 2022 11:23:35 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id A72F44069C; Fri, 29 Jul 2022 11:23:34 +0200 (CEST) Received: from forward500j.mail.yandex.net (forward500j.mail.yandex.net [5.45.198.250]) by mails.dpdk.org (Postfix) with ESMTP id 9FE4440151 for ; Fri, 29 Jul 2022 11:23:33 +0200 (CEST) Received: from vla5-4f2bb2e137e4.qloud-c.yandex.net (vla5-4f2bb2e137e4.qloud-c.yandex.net [IPv6:2a02:6b8:c18:3588:0:640:4f2b:b2e1]) by forward500j.mail.yandex.net (Yandex) with ESMTP id 96CA86CB72DF; Fri, 29 Jul 2022 12:21:46 +0300 (MSK) Received: by vla5-4f2bb2e137e4.qloud-c.yandex.net (smtp/Yandex) with ESMTPSA id JDQrMxzbIX-LhlKaV4m; Fri, 29 Jul 2022 12:21:45 +0300 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)) (Client certificate not present) X-Yandex-Fwd: 1 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yandex.ru; s=mail; t=1659086505; bh=WI5Ee3tAZXGSMSHBdNZljWNSYRS6d/O1x7ZJM5084EM=; h=From:In-Reply-To:Cc:Date:References:To:Subject:Message-ID; b=T2fFBkdXg6CQXb5dgmkOzOiXLQ4zS3XPAQOmH2Lkq/cPH3lE6G9CHWzmsXpzEaqu4 0KBeU+9nc5YA622ZLL288xPi+MAkgrg0MCZeVSQXPAb9H6cCaiJtPhnvu2dhw7nhj3 TAIIVb7gLFEtMamuVzjLERpSMpClbcxitnGp6xHs= Authentication-Results: vla5-4f2bb2e137e4.qloud-c.yandex.net; dkim=pass header.i=@yandex.ru Message-ID: Date: Fri, 29 Jul 2022 10:21:41 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.11.0 Subject: Re: [RFC v2] non-temporal memcpy Content-Language: en-US To: =?UTF-8?Q?Morten_Br=c3=b8rup?= , =?UTF-8?Q?Stanis=c5=82aw_Kardach?= , Honnappa Nagarahalli Cc: dev , Bruce Richardson , Jan Viktorin , Ruifeng Wang , David Christensen , nd References: <98CBD80474FA8B44BF855DF32C47DC35D871D4@smartserver.smartshare.dk> <98CBD80474FA8B44BF855DF32C47DC35D871DB@smartserver.smartshare.dk> <262c214b-7870-a221-2621-6684dce42823@yandex.ru> <98CBD80474FA8B44BF855DF32C47DC35D871F7@smartserver.smartshare.dk> <98CBD80474FA8B44BF855DF32C47DC35D871FD@smartserver.smartshare.dk> <98CBD80474FA8B44BF855DF32C47DC35D87201@smartserver.smartshare.dk> From: Konstantin Ananyev In-Reply-To: <98CBD80474FA8B44BF855DF32C47DC35D87201@smartserver.smartshare.dk> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org 28/07/2022 11:51, Morten Brørup пишет: > From: Stanisław Kardach [mailto:kda@semihalf.com] > Sent: Thursday, 28 July 2022 00.02 >> On Wed, 27 Jul 2022, 21:53 Honnappa Nagarahalli, wrote: >> >>>>>>> Yes, x86 needs 16B alignment for NT load/stores But that's >>>> supposed >>>>>> to be arch >>>>>>> specific limitation, that we probably want to hide, no? >>>>> >>>>> Correct. However, optional hints for optimization purposes will be >>>> available. >>>>> And it is up to the architecture specific implementation to make the >>>> best use >>>>> of these hints, or just ignore them. >>>>> >>>>>>> Inside the function can check alignment of both src and dst and >>>>>> decide should it >>>>>>> use NT load/store instructions or just do normal copy. >>>>>> IMO, the normal copy should not be done by this API under any >>>>>> conditions. Why not let the application call memcpy/rte_memcpy >>>>>> when the NT copy is not applicable? It helps the programmer to >>>> understand >>>>>> and debug the issues much easier. >>>>> >>>>> Yes, the programmer must choose between normal memcpy() and non- >>>>> temporal rte_memcpy_nt(). I am offering new functions, not modifying >>>>> memcpy() or rte_memcpy(). >>>>> >>>>> And rte_memcpy_nt() will silently fall back to normal memcpy() if >>>> non- >>>>> temporal copying is unavailable, e.g. on POWER and RISC-V >>>> architectures, >>>>> which don't have NT load/store instructions. >>>> I am talking about a scenario where the application is being ported >>>> between architectures. Not everyone knows about the capabilities of >>>> the architecture. It is better to indicate upfront (ex: compilation >>>> failures) that a certain feature is not supported on the target >>>> architecture rather than the user having to discover through painful >>>> debugging. >>> >>> I'm considering rte_memcpy_nt() a performance optimized variant of >>> memcpy(), where the performance gain is less cache pollution. Thus, silent >>> fallback to memcpy() should suffice. >>> >>> Other architecture differences also affect DPDK performance; the inability to >>> perform non-temporal load/store just one more to the (undocumented) list. >>> >>> Failing at build time if NT load/store is unavailable by the architecture would >>> prevent the function from being used by other DPDK libraries, e.g. by the >>> rte_pktmbuf_copy() function used by the pdump library. >> The other libraries in DPDK need to provide NT versions as the libraries need to cater for not-NT use cases as well. i.e. we cannot hide a NT copy under rte_pktmbuf_copy() API, we need to have rte_pktmbuf_copy_nt() > > Yes, it was my intention to provide rte_pktmbuf_copy_nt() as a new function. Some uses of rte_pktmbuf_copy() may benefit from having the copied data in cache. > > But there is a ripple effect: > > It is also my intention to improve the pdump and pcapng libraries by using rte_pktmbuf_copy_nt() instead of rte_pktmbuf_copy(). These would normally benefit from not polluting the cache. > > So the underlying rte_memcpy_nt() function needs a fallback if the architecture doesn't support non-temporal memory copy, now that the pdump and pcapng libraries depend on it. > > Alternatively, if rte_memcpy_nt() has no fallback to standard memcpy(), but an application fails to build if the application developer tries to use rte_memcpy_nt(), we would have to modify e.g. pdump_copy() like this: > > + #ifdef RTE_CPUFLAG_xxx > p = rte_pktmbuf_copy_nt(pkts[i], mp, 0, cbs->snaplen); > + #else > p = rte_pktmbuf_copy(pkts[i], mp, 0, cbs->snaplen); > + #endif > > Personally, I prefer the fallback inside rte_memcpy_nt(), rather than having to check for it everywhere. +1 here. If we going to introduce rte_memcpy_nt(), I think it better be 'best effort' approach - if it can do NT, great, if not just fall back to normal copy. > > The developer using the pdump library will not know if the fallback is inside rte_memcpy_nt() or outside using #ifdef. It is still hidden inside pdump_copy(). > >> >>> >>> I don't oppose to your idea, I just don't have any idea how to reasonably >>> implement it. So I'm trying to defend why it is not important. >> I am suggesting that the applications could implement #ifdef depending on the architecture. >> I assume that it would be a pre-processor flag defined (or not) on DPDK side and application doing #ifdef based on it? >> >> Another way to achieve this would be to use #warning directive (see [1]) inside DPDK when the generic fallback is taken. >> >> Also isn't the argument on memcpy_nt capability query not a more general one, that is how would/should application query DPDK's capabilities when run or compiled? > > Good point! You just solved this part of the puzzle, Stanislaw: > > The ability to perform non-temporal memory load/store is a CPU feature. > > Applications that need to know if non-temporal memory access is available should check for the appropriate CPU feature flag, e.g. RTE_CPUFLAG_SSE4_1 on x86 architecture. This works both at runtime and at compile time. > >> >> [1] https://gcc.gnu.org/onlinedocs/cpp/Diagnostics.html >