From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 8F182A00C4; Thu, 30 Jun 2022 14:55:24 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 28A4940694; Thu, 30 Jun 2022 14:55:24 +0200 (CEST) Received: from smartserver.smartsharesystems.com (smartserver.smartsharesystems.com [77.243.40.215]) by mails.dpdk.org (Postfix) with ESMTP id 3DA9640223 for ; Thu, 30 Jun 2022 14:55:23 +0200 (CEST) X-MimeOLE: Produced By Microsoft Exchange V6.5 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Subject: [RFC] non-temporal memory access functions Date: Thu, 30 Jun 2022 14:53:45 +0200 Message-ID: <98CBD80474FA8B44BF855DF32C47DC35D87195@smartserver.smartshare.dk> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: [RFC] non-temporal memory access functions Thread-Index: AdiMgG8wV8KKmxnURxm0C2yHsgRJzA== From: =?iso-8859-1?Q?Morten_Br=F8rup?= To: X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org This RFC proposes a set of functions optimized for non-temporal memory = handling. Applications sometimes copy large amounts of data to another memory = location, which is only used much later. In this case, it is inefficient to pollute the cache with the copied = data. I have only provided the API, and omitted most of the implementation = details. The implementation is irrelevant if the community disapproves of the = concept. Although the function names resemble standard C library function names, their signatures are intentionally different. No need to drag legacy = into it. The x86 non-temporal streaming instructions have alignment requirements, so I suggest we require minimum 16 byte alignment. /** * Copy data to 16 byte aligned non-temporal destination. * * @param dst * Pointer to the non-temporal destination of the data. * Must be 16 byte aligned. * @param src * Pointer to the source data. * No alignment requirements. * @param len * Number of bytes to copy. * Must be divisible by 16. */ __rte_experimental static __rte_always_inline __attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3), __access__(read_only, 2, 3))) void rte_memcpy_ntd16(void * __rte_restrict dst, const void * __rte_restrict src, size_t len); /** * Copy data from 16 byte aligned non-temporal source. * * @param dst * Pointer to the destination of the data. * No alignment requirements. * @param src * Pointer to the non-temporal source data. * Must be 16 byte aligned. * @param len * Number of bytes to copy. * Must be divisible by 16. */ __rte_experimental static __rte_always_inline __attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3), __access__(read_only, 2, 3))) void rte_memcpy_nts16(void * __rte_restrict dst, const void * __rte_restrict src, size_t len); /** * Copy data from 16 byte aligned non-temporal source * to 16 byte aligned non-temporal destination. * * @param dst * Pointer to the non-temporal destination of the data. * Must be 16 byte aligned. * @param src * Pointer to the non-temporal source data. * Must be 16 byte aligned. * @param len * Number of bytes to copy. * Must be divisible by 16. */ __rte_experimental static __rte_always_inline __attribute__((__nonnull__(1, 2), __access__(write_only, 1, 3), __access__(read_only, 2, 3))) void rte_memcpy_ntsd16(void * __rte_restrict dst, const void * __rte_restrict src, size_t len); /** * Fill 16 byte aligned non-temporal memory. * * @param dst * Pointer to the non-temporal memory. * Must be 16 byte aligned. * @param c * Byte to fill with. * @param len * Number of bytes to fill. * Must be divisible by 16. */ __rte_experimental static __rte_always_inline __attribute__((__nonnull__(1), __access__(write_only, 1, 3))) void rte_memset_nt16(void * dst, unsigned char c, size_t len); In addition to these, we could also provide variants with 64 byte = alignment (and length divisible by 64). Those would use 64 in their names instead of 16. E.g. the name of the 64 byte aligned variant would be rte_memset_nt64(). Personally, I think the 16 byte aligned variants suffice, and 64 byte aligned variants would be overkill. Remember, the performance gain is achieved by not polluting the cache. And the implementation can still use AVX512 instructions for large = copies, if desired. Another thing for discussion is the location of these functions. Should the memcpy function go into the existing rte_memcpy.h files, and a new rte_memset.h file be created? Or should a new rte_nt.h file be created for functions optimized for non-temporal memory? Personally, I think these belong together with the existing functions, rather than in a separate "non-temporal memory" library. When considering this question, remember that other functions using non-temporal memory access might be added in the future, e.g.: /** * @internal Calculate a sum of all words in the non-temporal buffer. * Helper routine for the rte_raw_cksum_nt(). * * 16 byte aligment is required when accessing non-temporal memory. * If the buffer is not 16 byte aligned, the preceding bytes are also = read. * If the end of the buffer is 16 byte aligned, the following bytes are = also * read. * Any such additional bytes are obvisouly not included in the checksum. * * @param buf * Pointer to the non-temporal buffer. * No alignment requirements. * @param len * Length of the buffer. * @return * Sum of all words in the buffer. */ __rte_experimental extern __attribute__((__nonnull__(1), __access__(read_only, 1, 2))) uint32_t __rte_raw_cksum_nt(const void *buf, size_t len); Med venlig hilsen / Kind regards, -Morten Br=F8rup