From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 54476A00C2; Tue, 9 Aug 2022 14:05:14 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 4A0354069C; Tue, 9 Aug 2022 14:05:14 +0200 (CEST) Received: from mail.lysator.liu.se (mail.lysator.liu.se [130.236.254.3]) by mails.dpdk.org (Postfix) with ESMTP id 5B3D640143 for ; Tue, 9 Aug 2022 14:05:13 +0200 (CEST) Received: from mail.lysator.liu.se (localhost [127.0.0.1]) by mail.lysator.liu.se (Postfix) with ESMTP id 226566F67 for ; Tue, 9 Aug 2022 14:05:13 +0200 (CEST) Received: by mail.lysator.liu.se (Postfix, from userid 1004) id 210A975BF; Tue, 9 Aug 2022 14:05:13 +0200 (CEST) X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on hermod.lysator.liu.se X-Spam-Level: X-Spam-Status: No, score=-1.6 required=5.0 tests=ALL_TRUSTED, AWL, NICE_REPLY_A, T_SCC_BODY_TEXT_LINE autolearn=disabled version=3.4.6 X-Spam-Score: -1.6 Received: from [192.168.1.59] (unknown [62.63.215.114]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (2048 bits)) (No client certificate requested) by mail.lysator.liu.se (Postfix) with ESMTPSA id 89F1A733D; Tue, 9 Aug 2022 14:05:12 +0200 (CEST) Message-ID: <44172db2-5f03-e58e-f72c-76eac1cd192c@lysator.liu.se> Date: Tue, 9 Aug 2022 14:05:12 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.11.0 Subject: Re: [RFC v2] non-temporal memcpy Content-Language: en-US To: =?UTF-8?Q?Morten_Br=c3=b8rup?= , dev@dpdk.org, Bruce Richardson , Konstantin Ananyev Cc: Jan Viktorin , Ruifeng Wang , David Christensen , Stanislaw Kardach References: <98CBD80474FA8B44BF855DF32C47DC35D871D4@smartserver.smartshare.dk> <9ac934d2-ad05-6ec9-3bb6-63986d68d5d3@lysator.liu.se> <98CBD80474FA8B44BF855DF32C47DC35D87247@smartserver.smartshare.dk> From: =?UTF-8?Q?Mattias_R=c3=b6nnblom?= In-Reply-To: <98CBD80474FA8B44BF855DF32C47DC35D87247@smartserver.smartshare.dk> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Virus-Scanned: ClamAV using ClamSMTP X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org On 2022-08-09 11:46, Morten Brørup wrote: >> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se] >> Sent: Sunday, 7 August 2022 22.25 >> >> On 2022-07-19 17:26, Morten Brørup wrote: >>> This RFC proposes a set of functions optimized for non-temporal >> memory copy. >>> >>> At this stage, I am asking for feedback on the concept. >>> >>> Applications sometimes data to another memory location, which is only >> used >>> much later. >>> In this case, it is inefficient to pollute the data cache with the >> copied >>> data. >>> >>> An example use case (originating from a real life application): >>> Copying filtered packets, or the first part of them, into a capture >> buffer >>> for offline analysis. >>> >>> The purpose of these functions is to achieve a performance gain by >> not >>> polluting the cache when copying data. >>> Although the throughput may be improved by further optimization, I do >> not >>> consider througput optimization relevant initially. >>> >>> The x86 non-temporal load instructions have 16 byte alignment >>> requirements [1], while ARM non-temporal load instructions are >> available with >>> 4 byte alignment requirements [2]. >>> Both platforms offer non-temporal store instructions with 4 byte >> alignment >>> requirements. >>> >> >> I don't think memcpy() functions should have alignment requirements. >> That's not very practical, and violates the principle of least >> surprise. > > I didn't make the CPUs with these alignment requirements. > > However, I will offer optimized performance in a generic NT memcpy() function in the cases where the individual alignment requirements of various CPUs happen to be met. > >> >> Use normal memcpy() for the unaligned parts, and for the whole thing >> for >> small sizes (at least on x86). >> > > I'm not going to plunge into some advanced vector programming, so I'm working on an implementation where misalignment is handled by using a bounce buffer (allocated on the stack, which is probably cache hot anyway). > > I don't know for the NT load + NT store case, but for regular load + NT store, this is trivial. The implementation I've used is 36 straight-forward lines of code.