RE: [PATCH v5] eal/x86: optimize memcpy of small sizes

DPDK patches and discussions
 help / color / mirror / Atom feed

From: "Morten Brørup" <mb@smartsharesystems.com>
To: "Konstantin Ananyev" <konstantin.ananyev@huawei.com>,
	<dev@dpdk.org>, "Bruce Richardson" <bruce.richardson@intel.com>,
	"Konstantin Ananyev" <konstantin.v.ananyev@yandex.ru>,
	"Vipin Varghese" <vipin.varghese@amd.com>
Cc: "Stephen Hemminger" <stephen@networkplumber.org>
Subject: RE: [PATCH v5] eal/x86: optimize memcpy of small sizes
Date: Mon, 12 Jan 2026 09:02:35 +0100	[thread overview]
Message-ID: <98CBD80474FA8B44BF855DF32C47DC35F6564F@smartserver.smartshare.dk> (raw)
In-Reply-To: <c2bf94e103f64dcaad115e1d3d39d654@huawei.com>

> > > -	/**
> > > -	 * Use the following structs to avoid violating C standard
> > > -	 * alignment requirements and to avoid strict aliasing bugs
> > > -	 */
> > > -	struct __rte_packed_begin rte_uint64_alias {
> > > -		uint64_t val;
> > > -	} __rte_packed_end __rte_may_alias;
> > > -	struct __rte_packed_begin rte_uint32_alias {
> > > -		uint32_t val;
> > > -	} __rte_packed_end __rte_may_alias;
> > > -	struct __rte_packed_begin rte_uint16_alias {
> > > -		uint16_t val;
> > > -	} __rte_packed_end __rte_may_alias;

The discussion about the optimized checksum function [1] has shown us that memcpy() sometimes prevents Clang from optimizing (loop unrolling and vectorizing) and potentially causes strict aliasing bugs with GCC, so I will work on a new patch version that keeps using the above types, instead of introducing memcpy() inside rte_memcpy().

[1]: https://inbox.dpdk.org/dev/CAFn2buBzBLFLVN-K=u3MgBEbQ-hqbgJLVpDx3vSXVKJpa0yPNg@mail.gmail.com/

> > > +static __rte_always_inline void
> > > +rte_mov48(uint8_t *dst, const uint8_t *src)
> > > +{
> > > +#if defined RTE_MEMCPY_AVX
> > > +	rte_mov32((uint8_t *)dst, (const uint8_t *)src);
> > > +	rte_mov32((uint8_t *)dst - 32 + 48, (const uint8_t *)src - 32 +
> > > 48);
> 
> Just a s thought: would compiler and CPU be smart enough to realize
> that there is no dependency between these 2 ops, and they can be
> executed in any
> order?
> Might be do mov32(); mov16() instead?
> Again' didn't test anything, just a thought.

Good idea.
I simply copied what the existing AVX code did for copying 48 bytes, but I agree with your suggestion.

> 
> > > +#else /* SSE implementation */
> > > +	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 *
> > > 16);
> > > +	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 *
> > > 16);
> > > +	rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 *
> > > 16);
> > > +#endif
> > > +}
> > > +
> > >  /**
> > >   * Copy 64 bytes from one location to another,
> > >   * locations should not overlap.
> > > @@ -172,6 +143,137 @@ rte_mov256(uint8_t *dst, const uint8_t *src)
> > >  	rte_mov128(dst + 1 * 128, src + 1 * 128);
> > >  }
> > >
> > > +/**
> > > + * Copy bytes from one location to another,
> > > + * locations should not overlap.
> > > + * Use with n <= 16.
> > > + *
> > > + * Note: Copying uninitialized memory is perfectly acceptable.
> > > + * Using e.g. memcpy(dst, src, 8) instead of
> > > + * *(unaligned_uint64_t*) = *(const unaligned_uint64_t *)src
> > > + * avoids compiler warnings about source data may be uninitialized
> > > + * [-Wmaybe-uninitialized].
> > > + */
> > > +static __rte_always_inline void *
> > > +rte_mov16_or_less(void *dst, const void *src, size_t n)
> > > +{
> > > +	/* Faster way when size is known at build time. */
> > > +	if (__rte_constant(n)) {
> > > +		if (n == 2)
> > > +			return memcpy(dst, src, 2);
> > > +		if (n == 4)
> > > +			return memcpy(dst, src, 4);
> > > +		if (n == 6) /* 4 + 2 */
> > > +			return memcpy(dst, src, 6);
> > > +		if (n == 8)
> > > +			return memcpy(dst, src, 8);
> > > +		if (n == 10) /* 8 + 2 */
> > > +			return memcpy(dst, src, 10);
> > > +		if (n == 12) /* 8 + 4 */
> > > +			return memcpy(dst, src, 12);
> > > +		if (n == 16) {
> > > +			rte_mov16((uint8_t *)dst, (const uint8_t *)src);
> > > +			return dst;
> > > +		}
> 
> If n is constant; wouldn't compiler unroll such memcpy itself?
> Specially for such small (<=16) values?
> I mean. can't we just:
> If (n < 16) memcpy(dst, src, n); else rte_mov16(dst, src);

Unfortunately not. For e.g. n == 13, we want to use the trick with the overlapping copies, requiring only two 8-byte copy operations instead of three copy operations (8-byte + 4-byte + 1-byte).

> 
> > > +	}
> > > +
> > > +	/*
> > > +	 * Note: Using "n & X" generates 3-byte "test" instructions,
> > > +	 * instead of "n >= X", which would generate 4-byte "cmp"
> > > instructions.
> > > +	 */
> > > +	if (n & 0x18) { /* n >= 8, including n == 0x10, hence n & 0x18.
> > > */
> > > +		/* Copy 8 ~ 16 bytes. */
> > > +		memcpy(dst, src, 8);
> > > +		memcpy((uint8_t *)dst - 8 + n, (const uint8_t *)src - 8 +
> > > n, 8);
> > > +	} else if (n & 0x4) {
> > > +		/* Copy 4 ~ 7 bytes. */
> > > +		memcpy(dst, src, 4);
> > > +		memcpy((uint8_t *)dst - 4 + n, (const uint8_t *)src - 4 +
> > > n, 4);
> > > +	} else if (n & 0x2) {
> > > +		/* Copy 2 ~ 3 bytes. */
> > > +		memcpy(dst, src, 2);
> > > +		memcpy((uint8_t *)dst - 2 + n, (const uint8_t *)src - 2 +
> > > n, 2);
> > > +	} else if (n & 0x1) {
> > > +		/* Copy 1 byte. */
> > > +		memcpy(dst, src, 1);
> > > +	}
> > > +	return dst;
> > > +}

next prev parent reply	other threads:[~2026-01-12  8:02 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-20 11:45 [PATCH] eal/x86: reduce memcpy code duplication Morten Brørup
2025-11-21 10:35 ` [PATCH v2] eal/x86: optimize memcpy of small sizes Morten Brørup
2025-11-21 16:57   ` Stephen Hemminger
2025-11-21 17:02     ` Bruce Richardson
2025-11-21 17:11       ` Stephen Hemminger
2025-11-21 21:36         ` Morten Brørup
2025-11-21 10:40 ` Morten Brørup
2025-11-21 10:40 ` [PATCH v3] " Morten Brørup
2025-11-24 13:36   ` Morten Brørup
2025-11-24 15:46     ` Patrick Robb
2025-11-28 14:02   ` Konstantin Ananyev
2025-11-28 15:55     ` Morten Brørup
2025-11-28 18:10       ` Konstantin Ananyev
2025-11-29  2:17         ` Morten Brørup
2025-12-01  9:35           ` Konstantin Ananyev
2025-12-01 10:41             ` Morten Brørup
2025-11-24 20:31 ` [PATCH v4] " Morten Brørup
2025-11-25  8:19   ` Morten Brørup
2025-12-01 15:55 ` [PATCH v5] " Morten Brørup
2025-12-03 13:29   ` Morten Brørup
2026-01-03 17:53   ` Morten Brørup
2026-01-09 15:05     ` Varghese, Vipin
2026-01-11 15:52     ` Konstantin Ananyev
2026-01-11 16:01       ` Stephen Hemminger
2026-01-12  8:02       ` Morten Brørup [this message]
2026-01-12 16:00         ` Scott Mitchell
2026-01-12 12:03 ` [PATCH v6] " Morten Brørup

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=98CBD80474FA8B44BF855DF32C47DC35F6564F@smartserver.smartshare.dk \
    --to=mb@smartsharesystems.com \
    --cc=bruce.richardson@intel.com \
    --cc=dev@dpdk.org \
    --cc=konstantin.ananyev@huawei.com \
    --cc=konstantin.v.ananyev@yandex.ru \
    --cc=stephen@networkplumber.org \
    --cc=vipin.varghese@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).