Re: [PATCH v12 0/3] net: optimize raw checksum computation

DPDK patches and discussions
 help / color / mirror / Atom feed

From: Stephen Hemminger <stephen@networkplumber.org>
To: scott.k.mitch1@gmail.com
Cc: dev@dpdk.org, mb@smartsharesystems.com
Subject: Re: [PATCH v12 0/3] net: optimize raw checksum computation
Date: Sat, 10 Jan 2026 08:59:29 -0800	[thread overview]
Message-ID: <20260110085929.712a0a87@phoenix.local> (raw)
In-Reply-To: <20260110015651.26201-1-scott.k.mitch1@gmail.com>

On Fri,  9 Jan 2026 20:56:48 -0500
scott.k.mitch1@gmail.com wrote:

> From: Scott Mitchell <scott.k.mitch1@gmail.com>
> 
> This series optimizes __rte_raw_cksum() by replacing memcpy-based access
> with unaligned_uint16_t pointer access, enabling vectorization in both
> GCC and Clang. The series is split into three patches to clearly separate
> the core optimization from compiler-specific workarounds.
> 
> Performance improvement from cksum_perf_autotest on Intel Xeon
> (Cascade Lake, AVX-512) with Clang 18.1 (TSC cycles/byte):
> 
>   Block size    Before    After    Improvement
>          100      0.40     0.24        ~40%
>         1500      0.50     0.06        ~8x
>         9000      0.49     0.06        ~8x
> 
> Changes in v12:
> - Split into 3-patch series per reviewer feedback
> - Patch 1/3: Core optimization and test additions
> - Patch 2/3: UBSAN alignment workaround (separate from GCC bug)
> - Patch 3/3: GCC optimization bug workaround
> - Reverted len & 1 to len % 2 and restored unlikely() per feedback
> - Renamed RTE_SUPPRESS_UNINITIALIZED_WARNING to RTE_FORCE_INIT_BARRIER
> - Applied minimal changes (no refactoring) to existing code
> - Deferred hinic driver refactoring to future series
> 
> Note: Patch 1/3 will trigger compiler warnings/failures on GCC versions
> with the optimization bug (GCC 11.5.0 and others seen on DPDK CI). These
> are resolved by patches 2/3 and 3/3.
> 
> Scott Mitchell (3):
>   net: optimize __rte_raw_cksum and add tests
>   eal: add workaround for UBSAN alignment false positive
>   eal/net: add workaround for GCC optimization bug
> 
>  app/test/meson.build             |   1 +
>  app/test/test_cksum_fuzz.c       | 240 +++++++++++++++++++++++++++++++
>  app/test/test_cksum_perf.c       |   2 +-
>  drivers/net/hinic/hinic_pmd_tx.c |   2 +
>  drivers/net/mlx5/mlx5_flow_dv.c  |   2 +
>  lib/eal/include/rte_common.h     |  23 +++
>  lib/net/rte_cksum.h              |  15 +-
>  lib/net/rte_ip4.h                |   1 +
>  lib/net/rte_ip6.h                |   1 +
>  9 files changed, 277 insertions(+), 10 deletions(-)
>  create mode 100644 app/test/test_cksum_fuzz.c
> 
> --
> 2.39.5 (Apple Git-154)
> 

Do not like to see so much compiler hacking, and not sure if it is worth
having to do so much special case code for a few cycles.

Feeling lazy, asked AI to compare the FreeBSD and Linux checksum and its
answer was enlightening.

The optimium solution would in my opinion would be:
   - specific IP and IPv6 only routines using assembly on a platform basis.
     Add with carry will be faster than all this compiler flags.
   - non-inline function for the general case with data.

This should eliminate the need for compiler flags.


Compare the algorithm used to compute standard IP checksum in FreeBSD and Linux

Great question! Both implement the same RFC 1071 one's complement checksum, but their optimization strategies differ in interesting ways.

## Core Algorithm

Both compute a 16-bit one's complement sum: accumulate 16-bit words, fold carries back in, and complement the result. The differences are in the implementation details.

## Linux Approach

Linux uses a layered architecture with generic C fallbacks and arch-specific optimizations:

**Generic path** (`lib/checksum.c`, `include/net/checksum.h`):
- `do_csum()` uses a 64-bit accumulator, processing 32 bytes per loop iteration (heavily unrolled)
- Defers folding until the end to minimize carry handling
- `csum_fold()` does the final 32→16 bit fold with a clever trick:
  ```c
  sum = (sum & 0xffff) + (sum >> 16);
  sum = (sum & 0xffff) + (sum >> 16);
  return ~sum;
  ```

**IP header fast path** (`ip_fast_csum()`):
- Specialized for the common 20-byte header case
- On x86-64, uses ADC (add-with-carry) chains to exploit hardware carry propagation
- ARM64 uses similar ADCS instruction chains

## FreeBSD Approach

FreeBSD's `in_cksum()` (`sys/netinet/in_cksum.c`) takes a different approach:

**mbuf-oriented design**:
- Walks mbuf chains directly, handling the common case of headers split across mbufs
- Tracks byte-swapping state when data crosses 16-bit boundaries between mbufs

**Accumulation strategy**:
- Uses `int` (32-bit) accumulator in the generic version
- Processes in 16-byte chunks with explicit unrolling
- Folds more frequently (after each mbuf) rather than deferring everything

**Architecture-specific versions**:
- x86 versions historically used inline assembly with ADC chains
- More recent versions lean on compiler intrinsics

## Key Differences

| Aspect | Linux | FreeBSD |
|--------|-------|---------|
| Accumulator | 64-bit, fold once at end | 32-bit, fold per mbuf |
| Loop unroll | 32 bytes/iteration | 16 bytes/iteration |
| Data structure | Works on flat buffers via `skb_checksum()` | Direct mbuf chain walking |
| Alignment | Handles via initial byte adjustment | Similar, but tracks "odd byte" state across mbufs |
| IP header | Dedicated `ip_fast_csum()` | Often uses general `in_cksum()` |

## Performance Implications

Linux's approach tends to be faster for large contiguous buffers due to larger unrolling and deferred folding. FreeBSD's mbuf-aware design is arguably cleaner for scattered data but may do more fold operations.

On modern CPUs with good branch prediction and out-of-order execution, the differences are often marginal—the ADC chain optimizations matter more than the high-level strategy. Both have moved toward letting the compiler handle more optimization rather than hand-tuned assembly, though Linux still maintains more arch-specific variants.

Given your DPDK work, you've probably noticed DPDK's `rte_raw_cksum()` takes yet another approach—often vectorized with SIMD when available, which can significantly outperform both kernel implementations for large payloads.

     prev parent reply	other threads:[~2026-01-10 16:59 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-10  1:56 scott.k.mitch1
2026-01-10  1:56 ` [PATCH v12 1/3] net: optimize __rte_raw_cksum and add tests scott.k.mitch1
2026-01-10  2:28   ` Scott Mitchell
2026-01-10 14:47   ` Morten Brørup
2026-01-10  1:56 ` [PATCH v12 2/3] eal: add workaround for UBSAN alignment false positive scott.k.mitch1
2026-01-10 15:02   ` Morten Brørup
2026-01-10  1:56 ` [PATCH v12 3/3] eal/net: add workaround for GCC optimization bug scott.k.mitch1
2026-01-10 15:29   ` Morten Brørup
2026-01-11  6:21     ` Scott Mitchell
2026-01-10 16:59 ` Stephen Hemminger [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260110085929.712a0a87@phoenix.local \
    --to=stephen@networkplumber.org \
    --cc=dev@dpdk.org \
    --cc=mb@smartsharesystems.com \
    --cc=scott.k.mitch1@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).