From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 5A829A0548; Thu, 11 Aug 2022 13:50:38 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 2FF8E427F2; Thu, 11 Aug 2022 13:50:38 +0200 (CEST) Received: from mail.lysator.liu.se (mail.lysator.liu.se [130.236.254.3]) by mails.dpdk.org (Postfix) with ESMTP id 6960C410FC for ; Thu, 11 Aug 2022 13:50:37 +0200 (CEST) Received: from mail.lysator.liu.se (localhost [127.0.0.1]) by mail.lysator.liu.se (Postfix) with ESMTP id 692FEAFDA for ; Thu, 11 Aug 2022 13:50:36 +0200 (CEST) Received: by mail.lysator.liu.se (Postfix, from userid 1004) id 58D3CAD77; Thu, 11 Aug 2022 13:50:36 +0200 (CEST) X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on hermod.lysator.liu.se X-Spam-Level: X-Spam-Status: No, score=-1.6 required=5.0 tests=ALL_TRUSTED, AWL, NICE_REPLY_A, T_SCC_BODY_TEXT_LINE autolearn=disabled version=3.4.6 X-Spam-Score: -1.6 Received: from [192.168.1.59] (unknown [62.63.215.114]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (2048 bits)) (No client certificate requested) by mail.lysator.liu.se (Postfix) with ESMTPSA id 5C865B0D8; Thu, 11 Aug 2022 13:50:34 +0200 (CEST) Message-ID: <04a9ad56-7a47-8dc2-b3b2-e677139dc28d@lysator.liu.se> Date: Thu, 11 Aug 2022 13:50:34 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.11.0 Subject: Re: [RFC v2] non-temporal memcpy Content-Language: en-US To: Honnappa Nagarahalli , =?UTF-8?Q?Morten_Br=c3=b8rup?= , Konstantin Ananyev , Konstantin Ananyev , "dev@dpdk.org" , Bruce Richardson Cc: Jan Viktorin , Ruifeng Wang , David Christensen , Stanislaw Kardach , nd References: <98CBD80474FA8B44BF855DF32C47DC35D871D4@smartserver.smartshare.dk> <98CBD80474FA8B44BF855DF32C47DC35D871DB@smartserver.smartshare.dk> <262c214b-7870-a221-2621-6684dce42823@yandex.ru> <98CBD80474FA8B44BF855DF32C47DC35D871E6@smartserver.smartshare.dk> <2c646d01-14d0-e5cb-2d7c-50c8456fc3e5@yandex.ru> <98CBD80474FA8B44BF855DF32C47DC35D8720C@smartserver.smartshare.dk> <5e1567fb744841a0915348397a81b99d@huawei.com> <98CBD80474FA8B44BF855DF32C47DC35D87211@smartserver.smartshare.dk> <66b6efccde5b4d68bffcc47713cdf983@huawei.com> <98CBD80474FA8B44BF855DF32C47DC35D87213@smartserver.smartshare.dk> From: =?UTF-8?Q?Mattias_R=c3=b6nnblom?= In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Virus-Scanned: ClamAV using ClamSMTP X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org On 2022-08-10 23:05, Honnappa Nagarahalli wrote: > > >> >> +TO: @Honnappa, we need input from ARM >> >>> From: Konstantin Ananyev [mailto:konstantin.ananyev@huawei.com] >>> Sent: Friday, 29 July 2022 21.49 >>>> >>>>> From: Konstantin Ananyev [mailto:konstantin.ananyev@huawei.com] >>>>> Sent: Friday, 29 July 2022 14.14 >>>>> >>>>> >>>>> Sorry, missed that part. >>>>> >>>>>> >>>>>>> Another question - who will do 'sfence' after the copying? >>>>>>> Would it be inside memcpy_nt (seems quite costly), or would it >>>>>>> be another API function for that: memcpy_nt_flush() or so? >>>>>> >>>>>> Outside. Only the developer knows when it is required, so it >>> wouldn't >>>>> make any sense to add the cost inside memcpy_nt(). >>>>>> >>>>>> I don't think we should add a flush function; it would just be >>>>> another name for an already existing function. Referring to the >>>>> required >>>>>> operation in the memcpy_nt() function documentation should >>> suffice. >>>>>> >>>>> >>>>> Ok, but again wouldn't it be arch specific? >>>>> AFAIK for x86 it needs to boil down to sfence, for other >>> architectures >>>>> - I don't know. >>>>> If you think there already is some generic one (rte_wmb?) that >>> would >>>>> always produce >>>>> correct instructions - sure let's use it. >>>>> >>>> >>>> DPDK has generic functions to wrap architecture specific stuff like >>> memory barriers. >>>> >>>> Because they are non-temporal stores, I suspect that rte_mb() is >>> required before reading the data from the location it was copied to. >>>> Ensuring that STORE operations are ordered (rte_wmb) might not >>> suffice. However, I'm not a CPU expert, so I will seek advice from >>>> more qualified people in the community on this. >>> >>> I think for IA sfence is enough, see citation below, for other >>> architectures - no idea. >>> What I am trying to say - it needs to be the *same* function on all >>> archs we support. >> >> Now I get it: rte_wmb() might be appropriate on x86, but if any other >> architecture requires something else, we should add a new common function >> for flushing, e.g. rte_memcpy_nt_flush(). >> >>> >>> IA SW optimization manual: >>> 9.4.2 Streaming Store Usage Models >>> The two primary usage domains for streaming store are coherent >>> requests and non-coherent requests. >>> 9.4.2.1 Coherent Requests >>> Coherent requests are normal loads and stores to system memory, which >>> may also hit cache lines present in another processor in a >>> multiprocessor environment. With coherent requests, a streaming store >>> can be used in the same way as a regular store that has been mapped >>> with a WC memory type (PAT or MTRR). An SFENCE instruction must be >>> used within a producer-consumer usage model in order to ensure >>> coherency and visibility of data between processors. >>> Within a single-processor system, the CPU can also re-read the same >>> memory location and be assured of coherence (that is, a single, >>> consistent view of this memory location). >>> The same is true for a multiprocessor >>> (MP) system, assuming an accepted MP software producer-consumer >>> synchronization policy is employed. >>> >> >> With this reference, I am convinced that you are right about the SFENCE. This >> puts a checkmark on this item on my TODO list for the patch. Thank you, >> Konstantin! >> >> Any ARM CPU experts on the mailing list seeing this, not on vacation? >> @Honnappa, I'm looking at you. :-) >> >> Summing up, the question is: >> >> After a bunch of *non-temporal* stores (STNP instruction) on ARM >> architecture, does calling rte_wmb() suffice to ensure the data is visible across >> the system? > Apologies for the late response, the docs did not have enough information. The internal dialogue is still going on, but I have some information now. There is some information in ArmV8 programmer's guide [1], though it is not complete. > In summary, rte_wmb()/rte_mb() would not suffice, we need new APIs. > > From my perspective, I see several scenarios: > 1) Need for ordering before the memcpy_nt. Here there are several cases: > a. LD – LDNP/STNP – DMB NSHLD > b. ST – LDNP/STNP – DMB NSH > 2) Need for ordering after the memcpy. Again, we have the similar use cases: > a. LDNP/STNP – LD – DMB NSH > b. LDNP/STNP – ST – DMB NSH > > The 'ST - STNP' and 'STNP - ST' do not apply here, but good to add an API for completion. > > So, may be we could have rte_[r|w]mb_nt() APIs. > Is rte_smp_rmb()/rte_smp_wmb() also not enough on ARM? > [1] https://developer.arm.com/documentation/den0024/a/The-A64-instruction-set/Memory-access-instructions/Non-temporal-load-and-store-pair