From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 3CEB5A0544; Mon, 10 Oct 2022 10:59:03 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 26FCC40146; Mon, 10 Oct 2022 10:59:03 +0200 (CEST) Received: from mail.lysator.liu.se (mail.lysator.liu.se [130.236.254.3]) by mails.dpdk.org (Postfix) with ESMTP id 1643940041 for ; Mon, 10 Oct 2022 10:59:02 +0200 (CEST) Received: from mail.lysator.liu.se (localhost [127.0.0.1]) by mail.lysator.liu.se (Postfix) with ESMTP id A38D716FB7 for ; Mon, 10 Oct 2022 10:59:01 +0200 (CEST) Received: by mail.lysator.liu.se (Postfix, from userid 1004) id A21EE16F24; Mon, 10 Oct 2022 10:59:01 +0200 (CEST) X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on hermod.lysator.liu.se X-Spam-Level: X-Spam-Status: No, score=-2.5 required=5.0 tests=ALL_TRUSTED, AWL, NICE_REPLY_A autolearn=disabled version=3.4.6 X-Spam-Score: -2.5 Received: from [192.168.1.59] (h-62-63-215-114.A163.priv.bahnhof.se [62.63.215.114]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mail.lysator.liu.se (Postfix) with ESMTPSA id F16C116865; Mon, 10 Oct 2022 10:58:57 +0200 (CEST) Message-ID: <730193b1-9574-ff59-28be-c1449cba0ffc@lysator.liu.se> Date: Mon, 10 Oct 2022 10:58:57 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.2.2 Subject: Re: [PATCH] eal: non-temporal memcpy To: =?UTF-8?Q?Morten_Br=c3=b8rup?= , konstantin.v.ananyev@yandex.ru, Honnappa.Nagarahalli@arm.com, stephen@networkplumber.org Cc: mattias.ronnblom@ericsson.com, bruce.richardson@intel.com, kda@semihalf.com, drc@linux.vnet.ibm.com, dev@dpdk.org References: <98CBD80474FA8B44BF855DF32C47DC35D8728A@smartserver.smartshare.dk> <20221006203426.78743-1-mb@smartsharesystems.com> <98CBD80474FA8B44BF855DF32C47DC35D873BC@smartserver.smartshare.dk> Content-Language: en-US From: =?UTF-8?Q?Mattias_R=c3=b6nnblom?= In-Reply-To: <98CBD80474FA8B44BF855DF32C47DC35D873BC@smartserver.smartshare.dk> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Virus-Scanned: ClamAV using ClamSMTP X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org On 2022-10-10 09:35, Morten Brørup wrote: > Mattias, Konstantin, Honnappa, Stephen, > > In my patch for non-temporal memcpy, I have been aiming for using as much non-temporal store as possible. E.g. copying 16 byte to a 16 byte aligned address will be done using non-temporal store instructions. > > Now, I am seriously considering this alternative: > > Only using non-temporal stores for complete cache lines, and using normal stores for partial cache lines. > This is how I've done it in the past, in DPDK applications. That was both to simplify (and potentially optimize) the code somewhat, and because I had my doubt there was any actual benefits from using non-temporal stores for the beginning or the end of the memory block. That latter reason however, was pure conjecture. I think it would be great if Intel, ARM, AMD, IBM etc. DPDK developers could dig in the manuals or go find the appropriate CPU expert, to find out if that is true. More specifically, my question is: A) Consider a scenario where a core does a regular store against some cache line, and then pretty much immediately does a non-temporal store against a different address in the same cache line. How will this cache line be treated? B) Consider the same scenario, but where no regular stores preceded (or followed) the non-temporal store, and the non-temporal stores performed did not cover the entirety of the cache line. Scenario A) would be common in the beginning of the copy, in case there's a header preceding the data, and writing that header non-temporally might be cumbersome. Scenario B) would common at the end of the copy. Both assuming copies of memory blocks which are not cache-line aligned. > I think it will make things simpler when an application mixes normal and non-temporal stores. E.g. an application writing metadata (a pcap header) followed by packet data. > The application *could* use NT stores for the pcap header as well. I haven't reviewed v3 of your patch, but in some earlier patch you did not use the movnti instruction to make smaller (< 16 bytes) stores. > The disadvantage is that copying a burst of 32 packets, will - in the worst case - pollute 64 cache lines (one at the start plus one at the end of the copied data), i.e. 4 KiB of data cache. If copying to a consecutive memory area, e.g. a packet capture buffer, it will pollute 33 cache lines (because the start of packet #2 is in the same cache line as the end of packet #1, etc.). > > What do you think? > For large copies, which I'm guessing is what non-temporal stores are usually used for, this is hair splitting. For DPDK applications, it might well be at least somewhat relevant, because such an application may make an enormous amount of copies, each roughly the size of a packet. If we had a rte_memcpy_ex() that only cared about copying whole cache line in a NT manner, the application could add a clflushopt (or the equivalent) after the copy, flushing the the beginning and end cache line of the destination buffer. > > PS: Non-temporal loads are easy to work with, so don't worry about that. > > > Med venlig hilsen / Kind regards, > -Morten Brørup