DPDK patches and discussions
 help / color / mirror / Atom feed
* [RFC] non-temporal memory access functions
@ 2022-06-30 12:53 Morten Brørup
  0 siblings, 0 replies; only message in thread
From: Morten Brørup @ 2022-06-30 12:53 UTC (permalink / raw)
  To: dev

This RFC proposes a set of functions optimized for non-temporal memory handling.

Applications sometimes copy large amounts of data to another memory location,
which is only used much later.
In this case, it is inefficient to pollute the cache with the copied data.

I have only provided the API, and omitted most of the implementation details.
The implementation is irrelevant if the community disapproves of the concept.

Although the function names resemble standard C library function names,
their signatures are intentionally different. No need to drag legacy into it.

The x86 non-temporal streaming instructions have alignment requirements,
so I suggest we require minimum 16 byte alignment.

/**
 * Copy data to 16 byte aligned non-temporal destination.
 *
 * @param dst
 *   Pointer to the non-temporal destination of the data.
 *   Must be 16 byte aligned.
 * @param src
 *   Pointer to the source data.
 *   No alignment requirements.
 * @param len
 *   Number of bytes to copy.
 *   Must be divisible by 16.
 */
__rte_experimental
static __rte_always_inline
__attribute__((__nonnull__(1, 2),
		__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
void rte_memcpy_ntd16(void * __rte_restrict dst,
		const void * __rte_restrict src,
		size_t len);

/**
 * Copy data from 16 byte aligned non-temporal source.
 *
 * @param dst
 *   Pointer to the destination of the data.
 *   No alignment requirements.
 * @param src
 *   Pointer to the non-temporal source data.
 *   Must be 16 byte aligned.
 * @param len
 *   Number of bytes to copy.
 *   Must be divisible by 16.
 */
__rte_experimental
static __rte_always_inline
__attribute__((__nonnull__(1, 2),
		__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
void rte_memcpy_nts16(void * __rte_restrict dst,
		const void * __rte_restrict src,
		size_t len);

/**
 * Copy data from 16 byte aligned non-temporal source
 * to 16 byte aligned non-temporal destination.
 *
 * @param dst
 *   Pointer to the non-temporal destination of the data.
 *   Must be 16 byte aligned.
 * @param src
 *   Pointer to the non-temporal source data.
 *   Must be 16 byte aligned.
 * @param len
 *   Number of bytes to copy.
 *   Must be divisible by 16.
 */
__rte_experimental
static __rte_always_inline
__attribute__((__nonnull__(1, 2),
		__access__(write_only, 1, 3), __access__(read_only, 2, 3)))
void rte_memcpy_ntsd16(void * __rte_restrict dst,
		const void * __rte_restrict src,
		size_t len);

/**
 * Fill 16 byte aligned non-temporal memory.
 *
 * @param dst
 *   Pointer to the non-temporal memory.
 *   Must be 16 byte aligned.
 * @param c
 *   Byte to fill with.
 * @param len
 *   Number of bytes to fill.
 *   Must be divisible by 16.
 */
__rte_experimental
static __rte_always_inline
__attribute__((__nonnull__(1), __access__(write_only, 1, 3)))
void rte_memset_nt16(void * dst, unsigned char c, size_t len);


In addition to these, we could also provide variants with 64 byte alignment
(and length divisible by 64).
Those would use 64 in their names instead of 16.
E.g. the name of the 64 byte aligned variant would be rte_memset_nt64().

Personally, I think the 16 byte aligned variants suffice, and 64 byte
aligned variants would be overkill.
Remember, the performance gain is achieved by not polluting the cache.
And the implementation can still use AVX512 instructions for large copies,
if desired.


Another thing for discussion is the location of these functions.

Should the memcpy function go into the existing rte_memcpy.h files,
and a new rte_memset.h file be created? Or should a new rte_nt.h file
be created for functions optimized for non-temporal memory?

Personally, I think these belong together with the existing functions,
rather than in a separate "non-temporal memory" library.

When considering this question, remember that other functions using
non-temporal memory access might be added in the future, e.g.:

/**
 * @internal Calculate a sum of all words in the non-temporal buffer.
 * Helper routine for the rte_raw_cksum_nt().
 *
 * 16 byte aligment is required when accessing non-temporal memory.
 * If the buffer is not 16 byte aligned, the preceding bytes are also read.
 * If the end of the buffer is 16 byte aligned, the following bytes are also
 * read.
 * Any such additional bytes are obvisouly not included in the checksum.
 *
 * @param buf
 *   Pointer to the non-temporal buffer.
 *   No alignment requirements.
 * @param len
 *   Length of the buffer.
 * @return
 *   Sum of all words in the buffer.
 */
__rte_experimental
extern
__attribute__((__nonnull__(1), __access__(read_only, 1, 2)))
uint32_t __rte_raw_cksum_nt(const void *buf, size_t len);


Med venlig hilsen / Kind regards,
-Morten Brørup


^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2022-06-30 12:55 UTC | newest]

Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-30 12:53 [RFC] non-temporal memory access functions Morten Brørup

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).