[PATCH v12 0/3] net: optimize raw checksum computation

DPDK patches and discussions
 help / color / mirror / Atom feed

* [PATCH v12 0/3] net: optimize raw checksum computation
@ 2026-01-10  1:56 scott.k.mitch1
  2026-01-10  1:56 ` [PATCH v12 1/3] net: optimize __rte_raw_cksum and add tests scott.k.mitch1
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: scott.k.mitch1 @ 2026-01-10  1:56 UTC (permalink / raw)
  To: dev; +Cc: mb, stephen, Scott Mitchell

From: Scott Mitchell <scott.k.mitch1@gmail.com>

This series optimizes __rte_raw_cksum() by replacing memcpy-based access
with unaligned_uint16_t pointer access, enabling vectorization in both
GCC and Clang. The series is split into three patches to clearly separate
the core optimization from compiler-specific workarounds.

Performance improvement from cksum_perf_autotest on Intel Xeon
(Cascade Lake, AVX-512) with Clang 18.1 (TSC cycles/byte):

  Block size    Before    After    Improvement
         100      0.40     0.24        ~40%
        1500      0.50     0.06        ~8x
        9000      0.49     0.06        ~8x

Changes in v12:
- Split into 3-patch series per reviewer feedback
- Patch 1/3: Core optimization and test additions
- Patch 2/3: UBSAN alignment workaround (separate from GCC bug)
- Patch 3/3: GCC optimization bug workaround
- Reverted len & 1 to len % 2 and restored unlikely() per feedback
- Renamed RTE_SUPPRESS_UNINITIALIZED_WARNING to RTE_FORCE_INIT_BARRIER
- Applied minimal changes (no refactoring) to existing code
- Deferred hinic driver refactoring to future series

Note: Patch 1/3 will trigger compiler warnings/failures on GCC versions
with the optimization bug (GCC 11.5.0 and others seen on DPDK CI). These
are resolved by patches 2/3 and 3/3.

Scott Mitchell (3):
  net: optimize __rte_raw_cksum and add tests
  eal: add workaround for UBSAN alignment false positive
  eal/net: add workaround for GCC optimization bug

 app/test/meson.build             |   1 +
 app/test/test_cksum_fuzz.c       | 240 +++++++++++++++++++++++++++++++
 app/test/test_cksum_perf.c       |   2 +-
 drivers/net/hinic/hinic_pmd_tx.c |   2 +
 drivers/net/mlx5/mlx5_flow_dv.c  |   2 +
 lib/eal/include/rte_common.h     |  23 +++
 lib/net/rte_cksum.h              |  15 +-
 lib/net/rte_ip4.h                |   1 +
 lib/net/rte_ip6.h                |   1 +
 9 files changed, 277 insertions(+), 10 deletions(-)
 create mode 100644 app/test/test_cksum_fuzz.c

--
2.39.5 (Apple Git-154)


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v12 1/3] net: optimize __rte_raw_cksum and add tests
  2026-01-10  1:56 [PATCH v12 0/3] net: optimize raw checksum computation scott.k.mitch1
@ 2026-01-10  1:56 ` scott.k.mitch1
  2026-01-10  2:28   ` Scott Mitchell
  2026-01-10 14:47   ` Morten Brørup
  2026-01-10  1:56 ` [PATCH v12 2/3] eal: add workaround for UBSAN alignment false positive scott.k.mitch1
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 9+ messages in thread
From: scott.k.mitch1 @ 2026-01-10  1:56 UTC (permalink / raw)
  To: dev; +Cc: mb, stephen, Scott, Scott Mitchell

From: Scott <scott_mitchell@apple.com>

__rte_raw_cksum uses a loop with memcpy on each iteration.
GCC 15+ is able to vectorize the loop but Clang 18.1 is not.
Replacing the memcpy with unaligned_uint16_t pointer access enables
both GCC and Clang to vectorize with SSE/AVX/AVX-512.

This patch adds comprehensive fuzz testing and updates the performance
test to measure the optimization impact.

Performance results from cksum_perf_autotest on Intel Xeon
(Cascade Lake, AVX-512) built with Clang 18.1 (TSC cycles/byte):

  Block size    Before    After    Improvement
         100      0.40     0.24        ~40%
        1500      0.50     0.06        ~8x
        9000      0.49     0.06        ~8x

Signed-off-by: Scott Mitchell <scott.k.mitch1@gmail.com>
---
 app/test/meson.build       |   1 +
 app/test/test_cksum_fuzz.c | 240 +++++++++++++++++++++++++++++++++++++
 app/test/test_cksum_perf.c |   2 +-
 lib/net/rte_cksum.h        |  14 +--
 4 files changed, 247 insertions(+), 10 deletions(-)
 create mode 100644 app/test/test_cksum_fuzz.c

diff --git a/app/test/meson.build b/app/test/meson.build
index efec42a6bf..c92325ad58 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -38,6 +38,7 @@ source_file_deps = {
     'test_byteorder.c': [],
     'test_cfgfile.c': ['cfgfile'],
     'test_cksum.c': ['net'],
+    'test_cksum_fuzz.c': ['net'],
     'test_cksum_perf.c': ['net'],
     'test_cmdline.c': [],
     'test_cmdline_cirbuf.c': [],
diff --git a/app/test/test_cksum_fuzz.c b/app/test/test_cksum_fuzz.c
new file mode 100644
index 0000000000..839861f57d
--- /dev/null
+++ b/app/test/test_cksum_fuzz.c
@@ -0,0 +1,240 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2026 Apple Inc.
+ */
+
+#include <stdio.h>
+#include <string.h>
+
+#include <rte_common.h>
+#include <rte_cycles.h>
+#include <rte_hexdump.h>
+#include <rte_cksum.h>
+#include <rte_malloc.h>
+#include <rte_random.h>
+
+#include "test.h"
+
+/*
+ * Fuzz test for __rte_raw_cksum optimization.
+ * Compares the optimized implementation against the original reference
+ * implementation across random data of various lengths.
+ */
+
+#define DEFAULT_ITERATIONS 1000
+#define MAX_TEST_LEN 65536  /* 64K to match GRO frame sizes */
+
+/*
+ * Original (reference) implementation of __rte_raw_cksum from DPDK v23.11.
+ * This is retained here for comparison testing against the optimized version.
+ */
+static inline uint32_t
+__rte_raw_cksum_reference(const void *buf, size_t len, uint32_t sum)
+{
+	const void *end;
+
+	for (end = RTE_PTR_ADD(buf, RTE_ALIGN_FLOOR(len, sizeof(uint16_t)));
+	     buf != end; buf = RTE_PTR_ADD(buf, sizeof(uint16_t))) {
+		uint16_t v;
+
+		memcpy(&v, buf, sizeof(uint16_t));
+		sum += v;
+	}
+
+	/* if length is odd, keeping it byte order independent */
+	if (unlikely(len % 2)) {
+		uint16_t left = 0;
+
+		memcpy(&left, end, 1);
+		sum += left;
+	}
+
+	return sum;
+}
+
+static void
+init_random_buffer(uint8_t *buf, size_t len)
+{
+	size_t i;
+
+	for (i = 0; i < len; i++)
+		buf[i] = (uint8_t)rte_rand();
+}
+
+static inline uint32_t
+get_initial_sum(bool random_initial_sum)
+{
+	return random_initial_sum ? (rte_rand() & 0xFFFFFFFF) : 0;
+}
+
+/*
+ * Test a single buffer length with specific alignment and initial sum
+ */
+static int
+test_cksum_fuzz_length_aligned(size_t len, bool aligned, uint32_t initial_sum)
+{
+	uint8_t *data;
+	uint8_t *buf;
+	size_t alloc_size;
+	uint32_t sum_ref, sum_opt;
+
+	if (len == 0 && !aligned) {
+		/* Skip unaligned test for zero length - nothing to test */
+		return TEST_SUCCESS;
+	}
+
+	/* Allocate exact size for aligned, +1 for unaligned offset */
+	alloc_size = aligned ? len : len + 1;
+	if (alloc_size == 0)
+		alloc_size = 1;  /* rte_malloc doesn't like 0 */
+
+	data = rte_malloc(NULL, alloc_size, 64);
+	if (data == NULL) {
+		printf("Failed to allocate %zu bytes\n", alloc_size);
+		return TEST_FAILED;
+	}
+
+	buf = aligned ? data : (data + 1);
+
+	init_random_buffer(buf, len);
+
+	sum_ref = __rte_raw_cksum_reference(buf, len, initial_sum);
+	sum_opt = __rte_raw_cksum(buf, len, initial_sum);
+
+	if (sum_ref != sum_opt) {
+		printf("MISMATCH at len=%zu aligned='%s' initial_sum=0x%08x ref=0x%08x opt=0x%08x\n",
+		       len, aligned ? "aligned" : "unaligned",
+		       initial_sum, sum_ref, sum_opt);
+		rte_hexdump(stdout, "failing buffer", buf, len);
+		rte_free(data);
+		return TEST_FAILED;
+	}
+
+	rte_free(data);
+	return TEST_SUCCESS;
+}
+
+/*
+ * Test a length with both alignments
+ */
+static int
+test_cksum_fuzz_length(size_t len, uint32_t initial_sum)
+{
+	int rc;
+
+	/* Test aligned */
+	rc = test_cksum_fuzz_length_aligned(len, true, initial_sum);
+	if (rc != TEST_SUCCESS)
+		return rc;
+
+	/* Test unaligned */
+	rc = test_cksum_fuzz_length_aligned(len, false, initial_sum);
+
+	return rc;
+}
+
+/*
+ * Test specific edge case lengths
+ */
+static int
+test_cksum_fuzz_edge_cases(void)
+{
+	/* Edge case lengths that might trigger bugs */
+	static const size_t edge_lengths[] = {
+		0, 1, 2, 3, 4, 5, 6, 7, 8,
+		15, 16, 17,
+		31, 32, 33,
+		63, 64, 65,
+		127, 128, 129,
+		255, 256, 257,
+		511, 512, 513,
+		1023, 1024, 1025,
+		1500, 1501,  /* MTU boundaries */
+		2047, 2048, 2049,
+		4095, 4096, 4097,
+		8191, 8192, 8193,
+		16383, 16384, 16385,
+		32767, 32768, 32769,
+		65534, 65535, 65536  /* 64K GRO boundaries */
+	};
+	unsigned int i;
+	int rc;
+
+	printf("Testing edge case lengths...\n");
+
+	for (i = 0; i < RTE_DIM(edge_lengths); i++) {
+		/* Test with zero initial sum */
+		rc = test_cksum_fuzz_length(edge_lengths[i], 0);
+		if (rc != TEST_SUCCESS)
+			return rc;
+
+		/* Test with random initial sum */
+		rc = test_cksum_fuzz_length(edge_lengths[i], get_initial_sum(true));
+		if (rc != TEST_SUCCESS)
+			return rc;
+	}
+
+	return TEST_SUCCESS;
+}
+
+/*
+ * Test random lengths with optional random initial sums
+ */
+static int
+test_cksum_fuzz_random(unsigned int iterations, bool random_initial_sum)
+{
+	unsigned int i;
+	int rc;
+
+	printf("Testing random lengths (0-%d)%s...\n", MAX_TEST_LEN,
+	       random_initial_sum ? " with random initial sums" : "");
+
+	for (i = 0; i < iterations; i++) {
+		size_t len = rte_rand() % (MAX_TEST_LEN + 1);
+
+		rc = test_cksum_fuzz_length(len, get_initial_sum(random_initial_sum));
+		if (rc != TEST_SUCCESS) {
+			printf("Failed at len=%zu\n", len);
+			return rc;
+		}
+	}
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_cksum_fuzz(void)
+{
+	int rc;
+	unsigned int iterations = DEFAULT_ITERATIONS;
+	printf("### __rte_raw_cksum optimization fuzz test ###\n");
+	printf("Iterations per test: %u\n\n", iterations);
+
+	/* Test edge cases */
+	rc = test_cksum_fuzz_edge_cases();
+	if (rc != TEST_SUCCESS) {
+		printf("Edge case test FAILED\n");
+		return rc;
+	}
+	printf("Edge case test PASSED\n\n");
+
+	/* Test random lengths with zero initial sum */
+	rc = test_cksum_fuzz_random(iterations, false);
+	if (rc != TEST_SUCCESS) {
+		printf("Random length test FAILED\n");
+		return rc;
+	}
+	printf("Random length test PASSED\n\n");
+
+	/* Test random lengths with random initial sums */
+	rc = test_cksum_fuzz_random(iterations, true);
+	if (rc != TEST_SUCCESS) {
+		printf("Random initial sum test FAILED\n");
+		return rc;
+	}
+	printf("Random initial sum test PASSED\n\n");
+
+	printf("All fuzz tests PASSED!\n");
+	return TEST_SUCCESS;
+}
+
+REGISTER_FAST_TEST(cksum_fuzz_autotest, true, true, test_cksum_fuzz);
diff --git a/app/test/test_cksum_perf.c b/app/test/test_cksum_perf.c
index 0b919cd59f..6b1d4589e0 100644
--- a/app/test/test_cksum_perf.c
+++ b/app/test/test_cksum_perf.c
@@ -15,7 +15,7 @@
 #define NUM_BLOCKS 10
 #define ITERATIONS 1000000
 
-static const size_t data_sizes[] = { 20, 21, 100, 101, 1500, 1501 };
+static const size_t data_sizes[] = { 20, 21, 100, 101, 1500, 1501, 9000, 9001, 65536, 65537 };
 
 static __rte_noinline uint16_t
 do_rte_raw_cksum(const void *buf, size_t len)
diff --git a/lib/net/rte_cksum.h b/lib/net/rte_cksum.h
index a8e8927952..8dbe1179e8 100644
--- a/lib/net/rte_cksum.h
+++ b/lib/net/rte_cksum.h
@@ -42,15 +42,11 @@ extern "C" {
 static inline uint32_t
 __rte_raw_cksum(const void *buf, size_t len, uint32_t sum)
 {
-	const void *end;
-
-	for (end = RTE_PTR_ADD(buf, RTE_ALIGN_FLOOR(len, sizeof(uint16_t)));
-	     buf != end; buf = RTE_PTR_ADD(buf, sizeof(uint16_t))) {
-		uint16_t v;
-
-		memcpy(&v, buf, sizeof(uint16_t));
-		sum += v;
-	}
+	/* Process uint16 chunks to preserve overflow/carry math. GCC/Clang vectorize the loop. */
+	const unaligned_uint16_t *buf16 = (const unaligned_uint16_t *)buf;
+	const unaligned_uint16_t *end = buf16 + (len / sizeof(uint16_t));
+	for (; buf16 != end; buf16++)
+		sum += *buf16;
 
 	/* if length is odd, keeping it byte order independent */
 	if (unlikely(len % 2)) {
-- 
2.39.5 (Apple Git-154)


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v12 2/3] eal: add workaround for UBSAN alignment false positive
  2026-01-10  1:56 [PATCH v12 0/3] net: optimize raw checksum computation scott.k.mitch1
  2026-01-10  1:56 ` [PATCH v12 1/3] net: optimize __rte_raw_cksum and add tests scott.k.mitch1
@ 2026-01-10  1:56 ` scott.k.mitch1
  2026-01-10 15:02   ` Morten Brørup
  2026-01-10  1:56 ` [PATCH v12 3/3] eal/net: add workaround for GCC optimization bug scott.k.mitch1
  2026-01-10 16:59 ` [PATCH v12 0/3] net: optimize raw checksum computation Stephen Hemminger
  3 siblings, 1 reply; 9+ messages in thread
From: scott.k.mitch1 @ 2026-01-10  1:56 UTC (permalink / raw)
  To: dev; +Cc: mb, stephen, Scott, Scott Mitchell

From: Scott <scott_mitchell@apple.com>

The optimized __rte_raw_cksum() uses unaligned_uint16_t pointer access
which triggers UBSAN alignment warnings even though the access is safe
due to the unaligned type definition.

Add __rte_no_ubsan_alignment attribute to suppress these false positive
warnings while preserving other UBSAN checks.

Signed-off-by: Scott Mitchell <scott.k.mitch1@gmail.com>
---
 lib/eal/include/rte_common.h | 9 +++++++++
 lib/net/rte_cksum.h          | 1 +
 2 files changed, 10 insertions(+)

diff --git a/lib/eal/include/rte_common.h b/lib/eal/include/rte_common.h
index 9e7d84f929..00d428e295 100644
--- a/lib/eal/include/rte_common.h
+++ b/lib/eal/include/rte_common.h
@@ -546,6 +546,15 @@ static void __attribute__((destructor(RTE_PRIO(prio)), used)) func(void)
 #define __rte_no_asan
 #endif
 
+/**
+ * Disable UndefinedBehaviorSanitizer alignment check on some code
+ */
+#if defined(RTE_CC_CLANG) || defined(RTE_CC_GCC)
+#define __rte_no_ubsan_alignment __attribute__((no_sanitize("alignment")))
+#else
+#define __rte_no_ubsan_alignment
+#endif
+
 /*********** Macros for pointer arithmetic ********/
 
 /**
diff --git a/lib/net/rte_cksum.h b/lib/net/rte_cksum.h
index 8dbe1179e8..9a7fda563d 100644
--- a/lib/net/rte_cksum.h
+++ b/lib/net/rte_cksum.h
@@ -39,6 +39,7 @@ extern "C" {
  * @return
  *   sum += Sum of all words in the buffer.
  */
+__rte_no_ubsan_alignment
 static inline uint32_t
 __rte_raw_cksum(const void *buf, size_t len, uint32_t sum)
 {
-- 
2.39.5 (Apple Git-154)


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v12 3/3] eal/net: add workaround for GCC optimization bug
  2026-01-10  1:56 [PATCH v12 0/3] net: optimize raw checksum computation scott.k.mitch1
  2026-01-10  1:56 ` [PATCH v12 1/3] net: optimize __rte_raw_cksum and add tests scott.k.mitch1
  2026-01-10  1:56 ` [PATCH v12 2/3] eal: add workaround for UBSAN alignment false positive scott.k.mitch1
@ 2026-01-10  1:56 ` scott.k.mitch1
  2026-01-10 15:29   ` Morten Brørup
  2026-01-10 16:59 ` [PATCH v12 0/3] net: optimize raw checksum computation Stephen Hemminger
  3 siblings, 1 reply; 9+ messages in thread
From: scott.k.mitch1 @ 2026-01-10  1:56 UTC (permalink / raw)
  To: dev; +Cc: mb, stephen, Scott, Scott Mitchell

From: Scott <scott_mitchell@apple.com>

GCC has a bug where it incorrectly elides struct initialization in
inline functions when strict aliasing is enabled (-O2/-O3/-Os), causing
reads from uninitialized memory. This affects both designated initializers
and manual field assignment.

Add RTE_FORCE_INIT_BARRIER macro that uses an asm volatile memory barrier
to prevent the compiler from incorrectly optimizing away struct
initialization. Apply the workaround to pseudo-header checksum functions
in rte_ip4.h, rte_ip6.h, hinic driver, and mlx5 driver.

Signed-off-by: Scott Mitchell <scott.k.mitch1@gmail.com>
---
 drivers/net/hinic/hinic_pmd_tx.c |  2 ++
 drivers/net/mlx5/mlx5_flow_dv.c  |  2 ++
 lib/eal/include/rte_common.h     | 14 ++++++++++++++
 lib/net/rte_ip4.h                |  1 +
 lib/net/rte_ip6.h                |  1 +
 5 files changed, 20 insertions(+)

diff --git a/drivers/net/hinic/hinic_pmd_tx.c b/drivers/net/hinic/hinic_pmd_tx.c
index 22fb0bffaf..570715531d 100644
--- a/drivers/net/hinic/hinic_pmd_tx.c
+++ b/drivers/net/hinic/hinic_pmd_tx.c
@@ -725,6 +725,7 @@ hinic_ipv4_phdr_cksum(const struct rte_ipv4_hdr *ipv4_hdr, uint64_t ol_flags)
 		rte_cpu_to_be_16(rte_be_to_cpu_16(ipv4_hdr->total_length) -
 				 rte_ipv4_hdr_len(ipv4_hdr));
 	}
+	RTE_FORCE_INIT_BARRIER(psd_hdr);
 	return rte_raw_cksum(&psd_hdr, sizeof(psd_hdr));
 }
 
@@ -743,6 +744,7 @@ hinic_ipv6_phdr_cksum(const struct rte_ipv6_hdr *ipv6_hdr, uint64_t ol_flags)
 	else
 		psd_hdr.len = ipv6_hdr->payload_len;
 
+	RTE_FORCE_INIT_BARRIER(psd_hdr);
 	sum = __rte_raw_cksum(&ipv6_hdr->src_addr,
 		sizeof(ipv6_hdr->src_addr) + sizeof(ipv6_hdr->dst_addr), 0);
 	sum = __rte_raw_cksum(&psd_hdr, sizeof(psd_hdr), sum);
diff --git a/drivers/net/mlx5/mlx5_flow_dv.c b/drivers/net/mlx5/mlx5_flow_dv.c
index 47f6d28410..1eeeb6747f 100644
--- a/drivers/net/mlx5/mlx5_flow_dv.c
+++ b/drivers/net/mlx5/mlx5_flow_dv.c
@@ -4445,6 +4445,8 @@ __flow_encap_decap_resource_register(struct rte_eth_dev *dev,
 			.reserve = 0,
 		}
 	};
+	RTE_FORCE_INIT_BARRIER(encap_decap_key);
+
 	struct mlx5_flow_cb_ctx ctx = {
 		.error = error,
 		.data = resource,
diff --git a/lib/eal/include/rte_common.h b/lib/eal/include/rte_common.h
index 00d428e295..d58e856d96 100644
--- a/lib/eal/include/rte_common.h
+++ b/lib/eal/include/rte_common.h
@@ -555,6 +555,20 @@ static void __attribute__((destructor(RTE_PRIO(prio)), used)) func(void)
 #define __rte_no_ubsan_alignment
 #endif
 
+/**
+ * Force struct initialization to prevent GCC optimization bug.
+ * GCC has a bug where it incorrectly elides struct initialization in
+ * inline functions when strict aliasing is enabled, causing reads from
+ * uninitialized memory. This memory barrier prevents the misoptimization.
+ */
+#ifdef RTE_CC_GCC
+#define RTE_FORCE_INIT_BARRIER(var) do {	\
+	asm volatile("" : "+m" (var));		\
+} while (0)
+#else
+#define RTE_FORCE_INIT_BARRIER(var)
+#endif
+
 /*********** Macros for pointer arithmetic ********/
 
 /**
diff --git a/lib/net/rte_ip4.h b/lib/net/rte_ip4.h
index 822a660cfb..a0839c584e 100644
--- a/lib/net/rte_ip4.h
+++ b/lib/net/rte_ip4.h
@@ -238,6 +238,7 @@ rte_ipv4_phdr_cksum(const struct rte_ipv4_hdr *ipv4_hdr, uint64_t ol_flags)
 		psd_hdr.len = rte_cpu_to_be_16((uint16_t)(l3_len -
 			rte_ipv4_hdr_len(ipv4_hdr)));
 	}
+	RTE_FORCE_INIT_BARRIER(psd_hdr);
 	return rte_raw_cksum(&psd_hdr, sizeof(psd_hdr));
 }
 
diff --git a/lib/net/rte_ip6.h b/lib/net/rte_ip6.h
index d1abf1f5d5..902f100a44 100644
--- a/lib/net/rte_ip6.h
+++ b/lib/net/rte_ip6.h
@@ -572,6 +572,7 @@ rte_ipv6_phdr_cksum(const struct rte_ipv6_hdr *ipv6_hdr, uint64_t ol_flags)
 	else
 		psd_hdr.len = ipv6_hdr->payload_len;
 
+	RTE_FORCE_INIT_BARRIER(psd_hdr);
 	sum = __rte_raw_cksum(&ipv6_hdr->src_addr,
 		sizeof(ipv6_hdr->src_addr) + sizeof(ipv6_hdr->dst_addr),
 		0);
-- 
2.39.5 (Apple Git-154)


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v12 1/3] net: optimize __rte_raw_cksum and add tests
  2026-01-10  1:56 ` [PATCH v12 1/3] net: optimize __rte_raw_cksum and add tests scott.k.mitch1
@ 2026-01-10  2:28   ` Scott Mitchell
  2026-01-10 14:47   ` Morten Brørup
  1 sibling, 0 replies; 9+ messages in thread
From: Scott Mitchell @ 2026-01-10  2:28 UTC (permalink / raw)
  To: dev; +Cc: mb, stephen, Scott

I will fix the `ERROR:NO_AUTHOR_SIGN_OFF` issue in a followup submission,
but will wait in case there is other feedback to address.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [PATCH v12 1/3] net: optimize __rte_raw_cksum and add tests
  2026-01-10  1:56 ` [PATCH v12 1/3] net: optimize __rte_raw_cksum and add tests scott.k.mitch1
  2026-01-10  2:28   ` Scott Mitchell
@ 2026-01-10 14:47   ` Morten Brørup
  1 sibling, 0 replies; 9+ messages in thread
From: Morten Brørup @ 2026-01-10 14:47 UTC (permalink / raw)
  To: scott.k.mitch1, dev; +Cc: stephen, Scott

> From: Scott <scott_mitchell@apple.com>
> 
> __rte_raw_cksum uses a loop with memcpy on each iteration.
> GCC 15+ is able to vectorize the loop but Clang 18.1 is not.
> Replacing the memcpy with unaligned_uint16_t pointer access enables
> both GCC and Clang to vectorize with SSE/AVX/AVX-512.
> 
> This patch adds comprehensive fuzz testing and updates the performance
> test to measure the optimization impact.
> 
> Performance results from cksum_perf_autotest on Intel Xeon
> (Cascade Lake, AVX-512) built with Clang 18.1 (TSC cycles/byte):
> 
>   Block size    Before    After    Improvement
>          100      0.40     0.24        ~40%
>         1500      0.50     0.06        ~8x
>         9000      0.49     0.06        ~8x
> 
> Signed-off-by: Scott Mitchell <scott.k.mitch1@gmail.com>
> ---

Probably makes no practical difference, but consider marking the __rte_raw_cksum() function __rte_pure:
https://elixir.bootlin.com/dpdk/v25.11/source/lib/eal/include/rte_common.h#L228

With or without __rte_pure marking,
Acked-by: Morten Brørup <mb@smartsharesystems.com>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [PATCH v12 2/3] eal: add workaround for UBSAN alignment false positive
  2026-01-10  1:56 ` [PATCH v12 2/3] eal: add workaround for UBSAN alignment false positive scott.k.mitch1
@ 2026-01-10 15:02   ` Morten Brørup
  0 siblings, 0 replies; 9+ messages in thread
From: Morten Brørup @ 2026-01-10 15:02 UTC (permalink / raw)
  To: scott.k.mitch1, dev, stephen; +Cc: Scott

> From: Scott <scott_mitchell@apple.com>
> 
> The optimized __rte_raw_cksum() uses unaligned_uint16_t pointer access
> which triggers UBSAN alignment warnings even though the access is safe
> due to the unaligned type definition.
> 
> Add __rte_no_ubsan_alignment attribute to suppress these false positive
> warnings while preserving other UBSAN checks.

I think root cause of the false positive warnings is a faulty type definition of unaligned_uint16_t in rte_common.h [1].

For architectures with strict alignment requirements, the type is correctly defined as:
typedef uint16_t unaligned_uint16_t __rte_aligned(1);

But for architectures with relaxed alignment requirements, the type definition is incomplete:
typedef uint16_t unaligned_uint16_t;

When UBSAN runs with the latter definition (which doesn't remove the type's alignment requirement), UBSAN is correctly warning about the misalignment.

Either UBSAN should run with the first definition (i.e. with __rte_aligned(1)), or the type definitions with __rte_aligned(1) should be used for all architectures.

[1]: https://elixir.bootlin.com/dpdk/v25.11/source/lib/eal/include/rte_common.h#L124

Stephen, WDYT?

> 
> Signed-off-by: Scott Mitchell <scott.k.mitch1@gmail.com>
> ---
>  lib/eal/include/rte_common.h | 9 +++++++++
>  lib/net/rte_cksum.h          | 1 +
>  2 files changed, 10 insertions(+)
> 
> diff --git a/lib/eal/include/rte_common.h
> b/lib/eal/include/rte_common.h
> index 9e7d84f929..00d428e295 100644
> --- a/lib/eal/include/rte_common.h
> +++ b/lib/eal/include/rte_common.h
> @@ -546,6 +546,15 @@ static void
> __attribute__((destructor(RTE_PRIO(prio)), used)) func(void)
>  #define __rte_no_asan
>  #endif
> 
> +/**
> + * Disable UndefinedBehaviorSanitizer alignment check on some code
> + */
> +#if defined(RTE_CC_CLANG) || defined(RTE_CC_GCC)
> +#define __rte_no_ubsan_alignment
> __attribute__((no_sanitize("alignment")))
> +#else
> +#define __rte_no_ubsan_alignment
> +#endif
> +
>  /*********** Macros for pointer arithmetic ********/
> 
>  /**
> diff --git a/lib/net/rte_cksum.h b/lib/net/rte_cksum.h
> index 8dbe1179e8..9a7fda563d 100644
> --- a/lib/net/rte_cksum.h
> +++ b/lib/net/rte_cksum.h
> @@ -39,6 +39,7 @@ extern "C" {
>   * @return
>   *   sum += Sum of all words in the buffer.
>   */
> +__rte_no_ubsan_alignment
>  static inline uint32_t
>  __rte_raw_cksum(const void *buf, size_t len, uint32_t sum)
>  {
> --
> 2.39.5 (Apple Git-154)


^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [PATCH v12 3/3] eal/net: add workaround for GCC optimization bug
  2026-01-10  1:56 ` [PATCH v12 3/3] eal/net: add workaround for GCC optimization bug scott.k.mitch1
@ 2026-01-10 15:29   ` Morten Brørup
  0 siblings, 0 replies; 9+ messages in thread
From: Morten Brørup @ 2026-01-10 15:29 UTC (permalink / raw)
  To: scott.k.mitch1, dev; +Cc: stephen, Scott

> From: Scott <scott_mitchell@apple.com>
> 
> GCC has a bug where it incorrectly elides struct initialization in
> inline functions when strict aliasing is enabled (-O2/-O3/-Os), causing
> reads from uninitialized memory. This affects both designated
> initializers and manual field assignment.

Does this still happen after you replaced the RTE_PTR_ADD() with native pointer arithmetic in the checksum function?
In other words: Is this workaround still necessary?

This is a showstopper:
If the workaround is necessary, applications with similar use cases also need to apply the workaround.
If we cannot somehow enforce that, the series is likely to break some applications, which is unacceptable.

> 
> Add RTE_FORCE_INIT_BARRIER macro that uses an asm volatile memory
> barrier
> to prevent the compiler from incorrectly optimizing away struct
> initialization. Apply the workaround to pseudo-header checksum
> functions
> in rte_ip4.h, rte_ip6.h, hinic driver, and mlx5 driver.
> 
> Signed-off-by: Scott Mitchell <scott.k.mitch1@gmail.com>
> ---
>  drivers/net/hinic/hinic_pmd_tx.c |  2 ++
>  drivers/net/mlx5/mlx5_flow_dv.c  |  2 ++
>  lib/eal/include/rte_common.h     | 14 ++++++++++++++
>  lib/net/rte_ip4.h                |  1 +
>  lib/net/rte_ip6.h                |  1 +
>  5 files changed, 20 insertions(+)
> 
> diff --git a/drivers/net/hinic/hinic_pmd_tx.c
> b/drivers/net/hinic/hinic_pmd_tx.c
> index 22fb0bffaf..570715531d 100644
> --- a/drivers/net/hinic/hinic_pmd_tx.c
> +++ b/drivers/net/hinic/hinic_pmd_tx.c
> @@ -725,6 +725,7 @@ hinic_ipv4_phdr_cksum(const struct rte_ipv4_hdr
> *ipv4_hdr, uint64_t ol_flags)
>  		rte_cpu_to_be_16(rte_be_to_cpu_16(ipv4_hdr->total_length) -
>  				 rte_ipv4_hdr_len(ipv4_hdr));
>  	}
> +	RTE_FORCE_INIT_BARRIER(psd_hdr);
>  	return rte_raw_cksum(&psd_hdr, sizeof(psd_hdr));
>  }
> 
> @@ -743,6 +744,7 @@ hinic_ipv6_phdr_cksum(const struct rte_ipv6_hdr
> *ipv6_hdr, uint64_t ol_flags)
>  	else
>  		psd_hdr.len = ipv6_hdr->payload_len;
> 
> +	RTE_FORCE_INIT_BARRIER(psd_hdr);
>  	sum = __rte_raw_cksum(&ipv6_hdr->src_addr,
>  		sizeof(ipv6_hdr->src_addr) + sizeof(ipv6_hdr->dst_addr),
> 0);
>  	sum = __rte_raw_cksum(&psd_hdr, sizeof(psd_hdr), sum);
> diff --git a/drivers/net/mlx5/mlx5_flow_dv.c
> b/drivers/net/mlx5/mlx5_flow_dv.c
> index 47f6d28410..1eeeb6747f 100644
> --- a/drivers/net/mlx5/mlx5_flow_dv.c
> +++ b/drivers/net/mlx5/mlx5_flow_dv.c
> @@ -4445,6 +4445,8 @@ __flow_encap_decap_resource_register(struct
> rte_eth_dev *dev,
>  			.reserve = 0,
>  		}
>  	};
> +	RTE_FORCE_INIT_BARRIER(encap_decap_key);
> +
>  	struct mlx5_flow_cb_ctx ctx = {
>  		.error = error,
>  		.data = resource,
> diff --git a/lib/eal/include/rte_common.h
> b/lib/eal/include/rte_common.h
> index 00d428e295..d58e856d96 100644
> --- a/lib/eal/include/rte_common.h
> +++ b/lib/eal/include/rte_common.h
> @@ -555,6 +555,20 @@ static void
> __attribute__((destructor(RTE_PRIO(prio)), used)) func(void)
>  #define __rte_no_ubsan_alignment
>  #endif
> 
> +/**
> + * Force struct initialization to prevent GCC optimization bug.
> + * GCC has a bug where it incorrectly elides struct initialization in
> + * inline functions when strict aliasing is enabled, causing reads
> from
> + * uninitialized memory. This memory barrier prevents the
> misoptimization.
> + */
> +#ifdef RTE_CC_GCC
> +#define RTE_FORCE_INIT_BARRIER(var) do {	\
> +	asm volatile("" : "+m" (var));		\
> +} while (0)
> +#else
> +#define RTE_FORCE_INIT_BARRIER(var)
> +#endif
> +
>  /*********** Macros for pointer arithmetic ********/
> 
>  /**
> diff --git a/lib/net/rte_ip4.h b/lib/net/rte_ip4.h
> index 822a660cfb..a0839c584e 100644
> --- a/lib/net/rte_ip4.h
> +++ b/lib/net/rte_ip4.h
> @@ -238,6 +238,7 @@ rte_ipv4_phdr_cksum(const struct rte_ipv4_hdr
> *ipv4_hdr, uint64_t ol_flags)
>  		psd_hdr.len = rte_cpu_to_be_16((uint16_t)(l3_len -
>  			rte_ipv4_hdr_len(ipv4_hdr)));
>  	}
> +	RTE_FORCE_INIT_BARRIER(psd_hdr);
>  	return rte_raw_cksum(&psd_hdr, sizeof(psd_hdr));
>  }
> 
> diff --git a/lib/net/rte_ip6.h b/lib/net/rte_ip6.h
> index d1abf1f5d5..902f100a44 100644
> --- a/lib/net/rte_ip6.h
> +++ b/lib/net/rte_ip6.h
> @@ -572,6 +572,7 @@ rte_ipv6_phdr_cksum(const struct rte_ipv6_hdr
> *ipv6_hdr, uint64_t ol_flags)
>  	else
>  		psd_hdr.len = ipv6_hdr->payload_len;
> 
> +	RTE_FORCE_INIT_BARRIER(psd_hdr);
>  	sum = __rte_raw_cksum(&ipv6_hdr->src_addr,
>  		sizeof(ipv6_hdr->src_addr) + sizeof(ipv6_hdr->dst_addr),
>  		0);
> --
> 2.39.5 (Apple Git-154)


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v12 0/3] net: optimize raw checksum computation
  2026-01-10  1:56 [PATCH v12 0/3] net: optimize raw checksum computation scott.k.mitch1
                   ` (2 preceding siblings ...)
  2026-01-10  1:56 ` [PATCH v12 3/3] eal/net: add workaround for GCC optimization bug scott.k.mitch1
@ 2026-01-10 16:59 ` Stephen Hemminger
  3 siblings, 0 replies; 9+ messages in thread
From: Stephen Hemminger @ 2026-01-10 16:59 UTC (permalink / raw)
  To: scott.k.mitch1; +Cc: dev, mb

On Fri,  9 Jan 2026 20:56:48 -0500
scott.k.mitch1@gmail.com wrote:

> From: Scott Mitchell <scott.k.mitch1@gmail.com>
> 
> This series optimizes __rte_raw_cksum() by replacing memcpy-based access
> with unaligned_uint16_t pointer access, enabling vectorization in both
> GCC and Clang. The series is split into three patches to clearly separate
> the core optimization from compiler-specific workarounds.
> 
> Performance improvement from cksum_perf_autotest on Intel Xeon
> (Cascade Lake, AVX-512) with Clang 18.1 (TSC cycles/byte):
> 
>   Block size    Before    After    Improvement
>          100      0.40     0.24        ~40%
>         1500      0.50     0.06        ~8x
>         9000      0.49     0.06        ~8x
> 
> Changes in v12:
> - Split into 3-patch series per reviewer feedback
> - Patch 1/3: Core optimization and test additions
> - Patch 2/3: UBSAN alignment workaround (separate from GCC bug)
> - Patch 3/3: GCC optimization bug workaround
> - Reverted len & 1 to len % 2 and restored unlikely() per feedback
> - Renamed RTE_SUPPRESS_UNINITIALIZED_WARNING to RTE_FORCE_INIT_BARRIER
> - Applied minimal changes (no refactoring) to existing code
> - Deferred hinic driver refactoring to future series
> 
> Note: Patch 1/3 will trigger compiler warnings/failures on GCC versions
> with the optimization bug (GCC 11.5.0 and others seen on DPDK CI). These
> are resolved by patches 2/3 and 3/3.
> 
> Scott Mitchell (3):
>   net: optimize __rte_raw_cksum and add tests
>   eal: add workaround for UBSAN alignment false positive
>   eal/net: add workaround for GCC optimization bug
> 
>  app/test/meson.build             |   1 +
>  app/test/test_cksum_fuzz.c       | 240 +++++++++++++++++++++++++++++++
>  app/test/test_cksum_perf.c       |   2 +-
>  drivers/net/hinic/hinic_pmd_tx.c |   2 +
>  drivers/net/mlx5/mlx5_flow_dv.c  |   2 +
>  lib/eal/include/rte_common.h     |  23 +++
>  lib/net/rte_cksum.h              |  15 +-
>  lib/net/rte_ip4.h                |   1 +
>  lib/net/rte_ip6.h                |   1 +
>  9 files changed, 277 insertions(+), 10 deletions(-)
>  create mode 100644 app/test/test_cksum_fuzz.c
> 
> --
> 2.39.5 (Apple Git-154)
> 

Do not like to see so much compiler hacking, and not sure if it is worth
having to do so much special case code for a few cycles.

Feeling lazy, asked AI to compare the FreeBSD and Linux checksum and its
answer was enlightening.

The optimium solution would in my opinion would be:
   - specific IP and IPv6 only routines using assembly on a platform basis.
     Add with carry will be faster than all this compiler flags.
   - non-inline function for the general case with data.

This should eliminate the need for compiler flags.


Compare the algorithm used to compute standard IP checksum in FreeBSD and Linux

Great question! Both implement the same RFC 1071 one's complement checksum, but their optimization strategies differ in interesting ways.

## Core Algorithm

Both compute a 16-bit one's complement sum: accumulate 16-bit words, fold carries back in, and complement the result. The differences are in the implementation details.

## Linux Approach

Linux uses a layered architecture with generic C fallbacks and arch-specific optimizations:

**Generic path** (`lib/checksum.c`, `include/net/checksum.h`):
- `do_csum()` uses a 64-bit accumulator, processing 32 bytes per loop iteration (heavily unrolled)
- Defers folding until the end to minimize carry handling
- `csum_fold()` does the final 32→16 bit fold with a clever trick:
  ```c
  sum = (sum & 0xffff) + (sum >> 16);
  sum = (sum & 0xffff) + (sum >> 16);
  return ~sum;
  ```

**IP header fast path** (`ip_fast_csum()`):
- Specialized for the common 20-byte header case
- On x86-64, uses ADC (add-with-carry) chains to exploit hardware carry propagation
- ARM64 uses similar ADCS instruction chains

## FreeBSD Approach

FreeBSD's `in_cksum()` (`sys/netinet/in_cksum.c`) takes a different approach:

**mbuf-oriented design**:
- Walks mbuf chains directly, handling the common case of headers split across mbufs
- Tracks byte-swapping state when data crosses 16-bit boundaries between mbufs

**Accumulation strategy**:
- Uses `int` (32-bit) accumulator in the generic version
- Processes in 16-byte chunks with explicit unrolling
- Folds more frequently (after each mbuf) rather than deferring everything

**Architecture-specific versions**:
- x86 versions historically used inline assembly with ADC chains
- More recent versions lean on compiler intrinsics

## Key Differences

| Aspect | Linux | FreeBSD |
|--------|-------|---------|
| Accumulator | 64-bit, fold once at end | 32-bit, fold per mbuf |
| Loop unroll | 32 bytes/iteration | 16 bytes/iteration |
| Data structure | Works on flat buffers via `skb_checksum()` | Direct mbuf chain walking |
| Alignment | Handles via initial byte adjustment | Similar, but tracks "odd byte" state across mbufs |
| IP header | Dedicated `ip_fast_csum()` | Often uses general `in_cksum()` |

## Performance Implications

Linux's approach tends to be faster for large contiguous buffers due to larger unrolling and deferred folding. FreeBSD's mbuf-aware design is arguably cleaner for scattered data but may do more fold operations.

On modern CPUs with good branch prediction and out-of-order execution, the differences are often marginal—the ADC chain optimizations matter more than the high-level strategy. Both have moved toward letting the compiler handle more optimization rather than hand-tuned assembly, though Linux still maintains more arch-specific variants.

Given your DPDK work, you've probably noticed DPDK's `rte_raw_cksum()` takes yet another approach—often vectorized with SIMD when available, which can significantly outperform both kernel implementations for large payloads.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2026-01-10 16:59 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-01-10  1:56 [PATCH v12 0/3] net: optimize raw checksum computation scott.k.mitch1
2026-01-10  1:56 ` [PATCH v12 1/3] net: optimize __rte_raw_cksum and add tests scott.k.mitch1
2026-01-10  2:28   ` Scott Mitchell
2026-01-10 14:47   ` Morten Brørup
2026-01-10  1:56 ` [PATCH v12 2/3] eal: add workaround for UBSAN alignment false positive scott.k.mitch1
2026-01-10 15:02   ` Morten Brørup
2026-01-10  1:56 ` [PATCH v12 3/3] eal/net: add workaround for GCC optimization bug scott.k.mitch1
2026-01-10 15:29   ` Morten Brørup
2026-01-10 16:59 ` [PATCH v12 0/3] net: optimize raw checksum computation Stephen Hemminger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).