From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 2B0B6A034C; Sun, 7 Aug 2022 22:25:24 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id F1B374014F; Sun, 7 Aug 2022 22:25:23 +0200 (CEST) Received: from mail.lysator.liu.se (mail.lysator.liu.se [130.236.254.3]) by mails.dpdk.org (Postfix) with ESMTP id 9063A400D7 for ; Sun, 7 Aug 2022 22:25:22 +0200 (CEST) Received: from mail.lysator.liu.se (localhost [127.0.0.1]) by mail.lysator.liu.se (Postfix) with ESMTP id C50664076 for ; Sun, 7 Aug 2022 22:25:21 +0200 (CEST) Received: by mail.lysator.liu.se (Postfix, from userid 1004) id C3A2D442F; Sun, 7 Aug 2022 22:25:21 +0200 (CEST) X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on hermod.lysator.liu.se X-Spam-Level: X-Spam-Status: No, score=-1.7 required=5.0 tests=ALL_TRUSTED, AWL, NICE_REPLY_A, T_SCC_BODY_TEXT_LINE autolearn=disabled version=3.4.6 X-Spam-Score: -1.7 Received: from [192.168.1.59] (unknown [62.63.215.114]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature RSA-PSS (2048 bits)) (No client certificate requested) by mail.lysator.liu.se (Postfix) with ESMTPSA id 6E31E43D2; Sun, 7 Aug 2022 22:25:20 +0200 (CEST) Message-ID: <9ac934d2-ad05-6ec9-3bb6-63986d68d5d3@lysator.liu.se> Date: Sun, 7 Aug 2022 22:25:20 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.11.0 Subject: Re: [RFC v2] non-temporal memcpy Content-Language: en-US To: =?UTF-8?Q?Morten_Br=c3=b8rup?= , dev@dpdk.org, Bruce Richardson , Konstantin Ananyev Cc: Jan Viktorin , Ruifeng Wang , David Christensen , Stanislaw Kardach References: <98CBD80474FA8B44BF855DF32C47DC35D871D4@smartserver.smartshare.dk> From: =?UTF-8?Q?Mattias_R=c3=b6nnblom?= In-Reply-To: <98CBD80474FA8B44BF855DF32C47DC35D871D4@smartserver.smartshare.dk> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Virus-Scanned: ClamAV using ClamSMTP X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org On 2022-07-19 17:26, Morten Brørup wrote: > This RFC proposes a set of functions optimized for non-temporal memory copy. > > At this stage, I am asking for feedback on the concept. > > Applications sometimes data to another memory location, which is only used > much later. > In this case, it is inefficient to pollute the data cache with the copied > data. > > An example use case (originating from a real life application): > Copying filtered packets, or the first part of them, into a capture buffer > for offline analysis. > > The purpose of these functions is to achieve a performance gain by not > polluting the cache when copying data. > Although the throughput may be improved by further optimization, I do not > consider througput optimization relevant initially. > > The x86 non-temporal load instructions have 16 byte alignment > requirements [1], while ARM non-temporal load instructions are available with > 4 byte alignment requirements [2]. > Both platforms offer non-temporal store instructions with 4 byte alignment > requirements. > I don't think memcpy() functions should have alignment requirements. That's not very practical, and violates the principle of least surprise. Use normal memcpy() for the unaligned parts, and for the whole thing for small sizes (at least on x86). > In addition to the primary function without any alignment requirements, we > also provide functions for respectivly 16 and 4 byte aligned access for > performance purposes. > > The function names resemble standard C library function names, but their > signatures are intentionally different. No need to drag legacy into it. > > NB: Don't comment on spaces for indentation; a patch will follow DPDK coding > style and use TAB. > > [1] https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm_stream_load > [2] https://developer.arm.com/documentation/100076/0100/A64-Instruction-Set-Reference/A64-Floating-point-Instructions/LDNP--SIMD-and-FP- > > V2: > - Only copy from non-temporal source to non-temporal destination. > I.e. remove the two variants with only source and/or destination being > non-temporal. > - Do not require alignment. > Instead, offer additional 4 and 16 byte aligned functions for performance > purposes. > - Implemented two of the functions for x86. > - Remove memset function. > > Signed-off-by: Morten Brørup > --- > > /** > * @warning > * @b EXPERIMENTAL: this API may change without prior notice. > * > * Copy data from non-temporal source to non-temporal destination. > * > * @param dst > * Pointer to the non-temporal destination of the data. > * Should be 4 byte aligned, for optimal performance. > * @param src > * Pointer to the non-temporal source data. > * No alignment requirements. > * @param len > * Number of bytes to copy. > * Should be be divisible by 4, for optimal performance. > */ > __rte_experimental > static __rte_always_inline > __attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3), __access__(read_only, 2, 3))) > void rte_memcpy_nt(void * __rte_restrict dst, const void * __rte_restrict src, size_t len) > /* Implementation T.B.D. */ > > /** > * @warning > * @b EXPERIMENTAL: this API may change without prior notice. > * > * Copy data in blocks of 16 byte from aligned non-temporal source > * to aligned non-temporal destination. > * > * @param dst > * Pointer to the non-temporal destination of the data. > * Must be 16 byte aligned. > * @param src > * Pointer to the non-temporal source data. > * Must be 16 byte aligned. > * @param len > * Number of bytes to copy. > * Must be divisible by 16. > */ > __rte_experimental > static __rte_always_inline > __attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3), __access__(read_only, 2, 3))) > void rte_memcpy_nt16a(void * __rte_restrict dst, const void * __rte_restrict src, size_t len) > { > const void * const end = RTE_PTR_ADD(src, len); > > RTE_ASSERT(rte_is_aligned(dst, sizeof(__m128i))); > RTE_ASSERT(rte_is_aligned(src, sizeof(__m128i))); > RTE_ASSERT(rte_is_aligned(len, sizeof(__m128i))); > > /* Copy large portion of data. */ > while (RTE_PTR_DIFF(end, src) >= 4 * sizeof(__m128i)) { > register __m128i xmm0, xmm1, xmm2, xmm3; > > /* Note: Workaround for _mm_stream_load_si128() not taking a const pointer as parameter. */ > #pragma GCC diagnostic push > #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers" > xmm0 = _mm_stream_load_si128(RTE_PTR_ADD(src, 0 * sizeof(__m128i))); > xmm1 = _mm_stream_load_si128(RTE_PTR_ADD(src, 1 * sizeof(__m128i))); > xmm2 = _mm_stream_load_si128(RTE_PTR_ADD(src, 2 * sizeof(__m128i))); > xmm3 = _mm_stream_load_si128(RTE_PTR_ADD(src, 3 * sizeof(__m128i))); > #pragma GCC diagnostic pop > _mm_stream_si128(RTE_PTR_ADD(dst, 0 * sizeof(__m128i)), xmm0); > _mm_stream_si128(RTE_PTR_ADD(dst, 1 * sizeof(__m128i)), xmm1); > _mm_stream_si128(RTE_PTR_ADD(dst, 2 * sizeof(__m128i)), xmm2); > _mm_stream_si128(RTE_PTR_ADD(dst, 3 * sizeof(__m128i)), xmm3); > src = RTE_PTR_ADD(src, 4 * sizeof(__m128i)); > dst = RTE_PTR_ADD(dst, 4 * sizeof(__m128i)); > } > > /* Copy remaining data. */ > while (src != end) { > register __m128i xmm; > > /* Note: Workaround for _mm_stream_load_si128() not taking a const pointer as parameter. */ > #pragma GCC diagnostic push > #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers" > xmm = _mm_stream_load_si128(src); > #pragma GCC diagnostic pop > _mm_stream_si128(dst, xmm); > src = RTE_PTR_ADD(src, sizeof(__m128i)); > dst = RTE_PTR_ADD(dst, sizeof(__m128i)); > } > } > > /** > * @warning > * @b EXPERIMENTAL: this API may change without prior notice. > * > * Copy data in blocks of 4 byte from aligned non-temporal source > * to aligned non-temporal destination. > * > * @param dst > * Pointer to the non-temporal destination of the data. > * Must be 4 byte aligned. > * @param src > * Pointer to the non-temporal source data. > * Must be 4 byte aligned. > * @param len > * Number of bytes to copy. > * Must be divisible by 4. > */ > __rte_experimental > static __rte_always_inline > __attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3), __access__(read_only, 2, 3))) > void rte_memcpy_nt4a(void * __rte_restrict dst, const void * __rte_restrict src, size_t len) > { > int32_t buf[sizeof(__m128i) / sizeof(int32_t)] __rte_aligned(sizeof(__m128i)); > /** Address of source data, rounded down to achieve alignment. */ > const void * srca = RTE_PTR_ALIGN_FLOOR(src, sizeof(__m128i)); > /** Address of end of source data, rounded down to achieve alignment. */ > const void * const srcenda = RTE_PTR_ALIGN_FLOOR(RTE_PTR_ADD(src, len), sizeof(__m128i)); > const int offset = RTE_PTR_DIFF(src, srca) / sizeof(int32_t); > register __m128i xmm0; > > RTE_ASSERT(rte_is_aligned(dst, sizeof(int32_t))); > RTE_ASSERT(rte_is_aligned(src, sizeof(int32_t))); > RTE_ASSERT(rte_is_aligned(len, sizeof(int32_t))); > > if (unlikely(len == 0)) return; > > /* Copy first, non-__m128i aligned, part of source data. */ > if (offset) { > /* Note: Workaround for _mm_stream_load_si128() not taking a const pointer as parameter. */ > #pragma GCC diagnostic push > #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers" > xmm0 = _mm_stream_load_si128(srca); > _mm_store_si128((void *)buf, xmm0); > #pragma GCC diagnostic pop > switch (offset) { > case 1: > _mm_stream_si32(RTE_PTR_ADD(dst, 0 * sizeof(int32_t)), buf[1]); > if (unlikely(len == 1 * sizeof(int32_t))) return; > _mm_stream_si32(RTE_PTR_ADD(dst, 1 * sizeof(int32_t)), buf[2]); > if (unlikely(len == 2 * sizeof(int32_t))) return; > _mm_stream_si32(RTE_PTR_ADD(dst, 2 * sizeof(int32_t)), buf[3]); > break; > case 2: > _mm_stream_si32(RTE_PTR_ADD(dst, 0 * sizeof(int32_t)), buf[2]); > if (unlikely(len == 1 * sizeof(int32_t))) return; > _mm_stream_si32(RTE_PTR_ADD(dst, 1 * sizeof(int32_t)), buf[3]); > break; > case 3: > _mm_stream_si32(RTE_PTR_ADD(dst, 0 * sizeof(int32_t)), buf[3]); > break; > } > srca = RTE_PTR_ADD(srca, (4 - offset) * sizeof(int32_t)); > dst = RTE_PTR_ADD(dst, (4 - offset) * sizeof(int32_t)); > } > > /* Copy middle, __m128i aligned, part of source data. */ > while (srca != srcenda) { > /* Note: Workaround for _mm_stream_load_si128() not taking a const pointer as parameter. */ > #pragma GCC diagnostic push > #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers" > xmm0 = _mm_stream_load_si128(srca); > #pragma GCC diagnostic pop > _mm_store_si128((void *)buf, xmm0); > _mm_stream_si32(RTE_PTR_ADD(dst, 0 * sizeof(int32_t)), buf[0]); > _mm_stream_si32(RTE_PTR_ADD(dst, 1 * sizeof(int32_t)), buf[1]); > _mm_stream_si32(RTE_PTR_ADD(dst, 2 * sizeof(int32_t)), buf[2]); > _mm_stream_si32(RTE_PTR_ADD(dst, 3 * sizeof(int32_t)), buf[3]); > srca = RTE_PTR_ADD(srca, sizeof(__m128i)); > dst = RTE_PTR_ADD(dst, 4 * sizeof(int32_t)); > } > > /* Copy last, non-__m128i aligned, part of source data. */ > if (RTE_PTR_DIFF(srca, src) != 4) { > /* Note: Workaround for _mm_stream_load_si128() not taking a const pointer as parameter. */ > #pragma GCC diagnostic push > #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers" > xmm0 = _mm_stream_load_si128(srca); > _mm_store_si128((void *)buf, xmm0); > #pragma GCC diagnostic pop > switch (offset) { > case 1: > _mm_stream_si32(RTE_PTR_ADD(dst, 0 * sizeof(int32_t)), buf[0]); > break; > case 2: > _mm_stream_si32(RTE_PTR_ADD(dst, 0 * sizeof(int32_t)), buf[0]); > if (unlikely(RTE_PTR_DIFF(srca, src) == 1 * sizeof(int32_t))) return; > _mm_stream_si32(RTE_PTR_ADD(dst, 1 * sizeof(int32_t)), buf[1]); > break; > case 3: > _mm_stream_si32(RTE_PTR_ADD(dst, 0 * sizeof(int32_t)), buf[0]); > if (unlikely(RTE_PTR_DIFF(srca, src) == 1 * sizeof(int32_t))) return; > _mm_stream_si32(RTE_PTR_ADD(dst, 1 * sizeof(int32_t)), buf[1]); > if (unlikely(RTE_PTR_DIFF(srca, src) == 2 * sizeof(int32_t))) return; > _mm_stream_si32(RTE_PTR_ADD(dst, 2 * sizeof(int32_t)), buf[2]); > break; > } > } > } >