From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 45BF8A00C5; Tue, 19 Jul 2022 17:26:45 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id D045F40A8B; Tue, 19 Jul 2022 17:26:44 +0200 (CEST) Received: from smartserver.smartsharesystems.com (smartserver.smartsharesystems.com [77.243.40.215]) by mails.dpdk.org (Postfix) with ESMTP id 3DF1A40A8A for ; Tue, 19 Jul 2022 17:26:43 +0200 (CEST) X-MimeOLE: Produced By Microsoft Exchange V6.5 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Subject: [RFC v2] non-temporal memcpy Date: Tue, 19 Jul 2022 17:26:40 +0200 Message-ID: <98CBD80474FA8B44BF855DF32C47DC35D871D4@smartserver.smartshare.dk> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: [RFC v2] non-temporal memcpy Thread-Index: AdiMgG8wV8KKmxnURxm0C2yHsgRJzAO+5XRQ From: =?iso-8859-1?Q?Morten_Br=F8rup?= To: , "Bruce Richardson" , "Konstantin Ananyev" Cc: "Jan Viktorin" , "Ruifeng Wang" , "David Christensen" , "Stanislaw Kardach" X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org This RFC proposes a set of functions optimized for non-temporal memory = copy. At this stage, I am asking for feedback on the concept. Applications sometimes data to another memory location, which is only = used much later. In this case, it is inefficient to pollute the data cache with the = copied data. An example use case (originating from a real life application): Copying filtered packets, or the first part of them, into a capture = buffer for offline analysis. The purpose of these functions is to achieve a performance gain by not polluting the cache when copying data. Although the throughput may be improved by further optimization, I do = not consider througput optimization relevant initially. The x86 non-temporal load instructions have 16 byte alignment requirements [1], while ARM non-temporal load instructions are available = with 4 byte alignment requirements [2]. Both platforms offer non-temporal store instructions with 4 byte = alignment requirements. In addition to the primary function without any alignment requirements, = we also provide functions for respectivly 16 and 4 byte aligned access for performance purposes. The function names resemble standard C library function names, but their signatures are intentionally different. No need to drag legacy into it. NB: Don't comment on spaces for indentation; a patch will follow DPDK = coding style and use TAB. [1] = https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#= text=3D_mm_stream_load [2] = https://developer.arm.com/documentation/100076/0100/A64-Instruction-Set-R= eference/A64-Floating-point-Instructions/LDNP--SIMD-and-FP- V2: - Only copy from non-temporal source to non-temporal destination. I.e. remove the two variants with only source and/or destination being non-temporal. - Do not require alignment. Instead, offer additional 4 and 16 byte aligned functions for = performance purposes. - Implemented two of the functions for x86. - Remove memset function. Signed-off-by: Morten Br=F8rup --- /** * @warning * @b EXPERIMENTAL: this API may change without prior notice. * * Copy data from non-temporal source to non-temporal destination. * * @param dst * Pointer to the non-temporal destination of the data. * Should be 4 byte aligned, for optimal performance. * @param src * Pointer to the non-temporal source data. * No alignment requirements. * @param len * Number of bytes to copy. * Should be be divisible by 4, for optimal performance. */ __rte_experimental static __rte_always_inline __attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3), = __access__(read_only, 2, 3))) void rte_memcpy_nt(void * __rte_restrict dst, const void * = __rte_restrict src, size_t len) /* Implementation T.B.D. */ /** * @warning * @b EXPERIMENTAL: this API may change without prior notice. * * Copy data in blocks of 16 byte from aligned non-temporal source * to aligned non-temporal destination. * * @param dst * Pointer to the non-temporal destination of the data. * Must be 16 byte aligned. * @param src * Pointer to the non-temporal source data. * Must be 16 byte aligned. * @param len * Number of bytes to copy. * Must be divisible by 16. */ __rte_experimental static __rte_always_inline __attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3), = __access__(read_only, 2, 3))) void rte_memcpy_nt16a(void * __rte_restrict dst, const void * = __rte_restrict src, size_t len) { const void * const end =3D RTE_PTR_ADD(src, len); RTE_ASSERT(rte_is_aligned(dst, sizeof(__m128i))); RTE_ASSERT(rte_is_aligned(src, sizeof(__m128i))); RTE_ASSERT(rte_is_aligned(len, sizeof(__m128i))); /* Copy large portion of data. */ while (RTE_PTR_DIFF(end, src) >=3D 4 * sizeof(__m128i)) { register __m128i xmm0, xmm1, xmm2, xmm3; /* Note: Workaround for _mm_stream_load_si128() not taking a const = pointer as parameter. */ #pragma GCC diagnostic push #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers" xmm0 =3D _mm_stream_load_si128(RTE_PTR_ADD(src, 0 * = sizeof(__m128i))); xmm1 =3D _mm_stream_load_si128(RTE_PTR_ADD(src, 1 * = sizeof(__m128i))); xmm2 =3D _mm_stream_load_si128(RTE_PTR_ADD(src, 2 * = sizeof(__m128i))); xmm3 =3D _mm_stream_load_si128(RTE_PTR_ADD(src, 3 * = sizeof(__m128i))); #pragma GCC diagnostic pop _mm_stream_si128(RTE_PTR_ADD(dst, 0 * sizeof(__m128i)), xmm0); _mm_stream_si128(RTE_PTR_ADD(dst, 1 * sizeof(__m128i)), xmm1); _mm_stream_si128(RTE_PTR_ADD(dst, 2 * sizeof(__m128i)), xmm2); _mm_stream_si128(RTE_PTR_ADD(dst, 3 * sizeof(__m128i)), xmm3); src =3D RTE_PTR_ADD(src, 4 * sizeof(__m128i)); dst =3D RTE_PTR_ADD(dst, 4 * sizeof(__m128i)); } /* Copy remaining data. */ while (src !=3D end) { register __m128i xmm; /* Note: Workaround for _mm_stream_load_si128() not taking a const = pointer as parameter. */ #pragma GCC diagnostic push #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers" xmm =3D _mm_stream_load_si128(src); #pragma GCC diagnostic pop _mm_stream_si128(dst, xmm); src =3D RTE_PTR_ADD(src, sizeof(__m128i)); dst =3D RTE_PTR_ADD(dst, sizeof(__m128i)); } } /** * @warning * @b EXPERIMENTAL: this API may change without prior notice. * * Copy data in blocks of 4 byte from aligned non-temporal source * to aligned non-temporal destination. * * @param dst * Pointer to the non-temporal destination of the data. * Must be 4 byte aligned. * @param src * Pointer to the non-temporal source data. * Must be 4 byte aligned. * @param len * Number of bytes to copy. * Must be divisible by 4. */ __rte_experimental static __rte_always_inline __attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3), = __access__(read_only, 2, 3))) void rte_memcpy_nt4a(void * __rte_restrict dst, const void * = __rte_restrict src, size_t len) { int32_t buf[sizeof(__m128i) / sizeof(int32_t)] = __rte_aligned(sizeof(__m128i)); /** Address of source data, rounded down to achieve alignment. */ const void * srca =3D RTE_PTR_ALIGN_FLOOR(src, = sizeof(__m128i)); /** Address of end of source data, rounded down to achieve = alignment. */ const void * const srcenda =3D RTE_PTR_ALIGN_FLOOR(RTE_PTR_ADD(src, = len), sizeof(__m128i)); const int offset =3D RTE_PTR_DIFF(src, srca) / = sizeof(int32_t); register __m128i xmm0; RTE_ASSERT(rte_is_aligned(dst, sizeof(int32_t))); RTE_ASSERT(rte_is_aligned(src, sizeof(int32_t))); RTE_ASSERT(rte_is_aligned(len, sizeof(int32_t))); if (unlikely(len =3D=3D 0)) return; /* Copy first, non-__m128i aligned, part of source data. */ if (offset) { /* Note: Workaround for _mm_stream_load_si128() not taking a const = pointer as parameter. */ #pragma GCC diagnostic push #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers" xmm0 =3D _mm_stream_load_si128(srca); _mm_store_si128((void *)buf, xmm0); #pragma GCC diagnostic pop switch (offset) { case 1: _mm_stream_si32(RTE_PTR_ADD(dst, 0 * sizeof(int32_t)), = buf[1]); if (unlikely(len =3D=3D 1 * sizeof(int32_t))) return; _mm_stream_si32(RTE_PTR_ADD(dst, 1 * sizeof(int32_t)), = buf[2]); if (unlikely(len =3D=3D 2 * sizeof(int32_t))) return; _mm_stream_si32(RTE_PTR_ADD(dst, 2 * sizeof(int32_t)), = buf[3]); break; case 2: _mm_stream_si32(RTE_PTR_ADD(dst, 0 * sizeof(int32_t)), = buf[2]); if (unlikely(len =3D=3D 1 * sizeof(int32_t))) return; _mm_stream_si32(RTE_PTR_ADD(dst, 1 * sizeof(int32_t)), = buf[3]); break; case 3: _mm_stream_si32(RTE_PTR_ADD(dst, 0 * sizeof(int32_t)), = buf[3]); break; } srca =3D RTE_PTR_ADD(srca, (4 - offset) * sizeof(int32_t)); dst =3D RTE_PTR_ADD(dst, (4 - offset) * sizeof(int32_t)); } /* Copy middle, __m128i aligned, part of source data. */ while (srca !=3D srcenda) { /* Note: Workaround for _mm_stream_load_si128() not taking a const = pointer as parameter. */ #pragma GCC diagnostic push #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers" xmm0 =3D _mm_stream_load_si128(srca); #pragma GCC diagnostic pop _mm_store_si128((void *)buf, xmm0); _mm_stream_si32(RTE_PTR_ADD(dst, 0 * sizeof(int32_t)), buf[0]); _mm_stream_si32(RTE_PTR_ADD(dst, 1 * sizeof(int32_t)), buf[1]); _mm_stream_si32(RTE_PTR_ADD(dst, 2 * sizeof(int32_t)), buf[2]); _mm_stream_si32(RTE_PTR_ADD(dst, 3 * sizeof(int32_t)), buf[3]); srca =3D RTE_PTR_ADD(srca, sizeof(__m128i)); dst =3D RTE_PTR_ADD(dst, 4 * sizeof(int32_t)); } /* Copy last, non-__m128i aligned, part of source data. */ if (RTE_PTR_DIFF(srca, src) !=3D 4) { /* Note: Workaround for _mm_stream_load_si128() not taking a const = pointer as parameter. */ #pragma GCC diagnostic push #pragma GCC diagnostic ignored "-Wdiscarded-qualifiers" xmm0 =3D _mm_stream_load_si128(srca); _mm_store_si128((void *)buf, xmm0); #pragma GCC diagnostic pop switch (offset) { case 1: _mm_stream_si32(RTE_PTR_ADD(dst, 0 * sizeof(int32_t)), = buf[0]); break; case 2: _mm_stream_si32(RTE_PTR_ADD(dst, 0 * sizeof(int32_t)), = buf[0]); if (unlikely(RTE_PTR_DIFF(srca, src) =3D=3D 1 * = sizeof(int32_t))) return; _mm_stream_si32(RTE_PTR_ADD(dst, 1 * sizeof(int32_t)), = buf[1]); break; case 3: _mm_stream_si32(RTE_PTR_ADD(dst, 0 * sizeof(int32_t)), = buf[0]); if (unlikely(RTE_PTR_DIFF(srca, src) =3D=3D 1 * = sizeof(int32_t))) return; _mm_stream_si32(RTE_PTR_ADD(dst, 1 * sizeof(int32_t)), = buf[1]); if (unlikely(RTE_PTR_DIFF(srca, src) =3D=3D 2 * = sizeof(int32_t))) return; _mm_stream_si32(RTE_PTR_ADD(dst, 2 * sizeof(int32_t)), = buf[2]); break; } } }