From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id A00FF471AD; Wed, 7 Jan 2026 18:56:31 +0100 (CET) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 2AA394028E; Wed, 7 Jan 2026 18:56:31 +0100 (CET) Received: from dkmailrelay1.smartsharesystems.com (smartserver.smartsharesystems.com [77.243.40.215]) by mails.dpdk.org (Postfix) with ESMTP id 159BF4021E for ; Wed, 7 Jan 2026 18:56:29 +0100 (CET) Received: from smartserver.smartsharesystems.com (smartserver.smartsharesys.local [192.168.4.10]) by dkmailrelay1.smartsharesystems.com (Postfix) with ESMTP id 37DAA204A4; Wed, 7 Jan 2026 18:56:28 +0100 (CET) Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Subject: RE: [PATCH v3] net: optimize raw checksum computation Date: Wed, 7 Jan 2026 18:56:23 +0100 Message-ID: <98CBD80474FA8B44BF855DF32C47DC35F65630@smartserver.smartshare.dk> X-MimeOLE: Produced By Microsoft Exchange V6.5 In-Reply-To: <20260107170415.80275-1-scott.k.mitch1@gmail.com> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: [PATCH v3] net: optimize raw checksum computation Thread-Index: Adx/97GtpVeCkEpKTPiV/BfuhGwyJwABBWjw References: <20260107170415.80275-1-scott.k.mitch1@gmail.com> From: =?iso-8859-1?Q?Morten_Br=F8rup?= To: , X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org > From: scott.k.mitch1@gmail.com [mailto:scott.k.mitch1@gmail.com] > Sent: Wednesday, 7 January 2026 18.04 >=20 > From: Scott Mitchell >=20 > Optimize __rte_raw_cksum() by processing data in larger unrolled loops > instead of iterating word-by-word. The new implementation processes > 64-byte blocks (32 x uint16_t) in the hot path, followed by smaller > 32/16/8/4/2-byte chunks. Playing around with Godbolt: https://godbolt.org/z/oYdP9xxfG With the original code (built with -msse4.2), the compiler vectorizes = the loop to process 16-byte chunks (instead of the 2-byte chunks the = source code indicates). When built with -mavx512f, it processes 32-byte chunks. IMHO, the compiled output of the new code is too big; using more than 12 = kB instructions consumes too much L1 Instruction Cache. I suppose the compiler both vectorizes and loop unrolls. >=20 > Uses uint32_t accumulator with explicit casts to prevent signed = integer > overflow and leverages unaligned_uint16_t for safe unaligned access on > all platforms. Adds __rte_no_ubsan_alignment attribute to suppress > false > positive alignment warnings from UndefinedBehaviorSanitizer. >=20 > Performance results from cksum_perf_autotest (TSC cycles/byte): > Block size Before After Improvement > 100 0.40-0.64 0.13-0.14 ~3-4x > 1500 0.49-0.51 0.10-0.11 ~4-5x > 9000 0.48-0.51 0.11-0.12 ~4x On which machine do you achieve these perf numbers? Can a measurable performance increase be achieved using significantly = smaller compiled code than this patch? >=20 > Signed-off-by: Scott Mitchell > --- > Changes in v3: > - Added __rte_no_ubsan_alignment macro to suppress false-positive = UBSAN > alignment warnings when using unaligned_uint16_t > - Fixed false-positive GCC maybe-uninitialized warning in rte_ip6.h > exposed > by optimization (can be split to separate patch once verified on CI) >=20 > Changes in v2: > - Fixed UndefinedBehaviorSanitizer errors by adding uint32_t casts to > prevent > signed integer overflow in addition chains > - Restored uint32_t sum accumulator instead of uint64_t > - Added 64k length to test_cksum_perf.c >=20 > diff --git a/lib/net/rte_cksum.h b/lib/net/rte_cksum.h > index a8e8927952..d6e313dea5 100644 > --- a/lib/net/rte_cksum.h > +++ b/lib/net/rte_cksum.h > @@ -39,24 +39,64 @@ extern "C" { > * @return > * sum +=3D Sum of all words in the buffer. > */ > +__rte_no_ubsan_alignment > static inline uint32_t > __rte_raw_cksum(const void *buf, size_t len, uint32_t sum) > { > - const void *end; > + /* Process in 64 byte blocks (32 x uint16_t). */ > + /* Always process as uint16_t chunks to preserve overflow/carry. > */ > + const void *end =3D RTE_PTR_ADD(buf, RTE_ALIGN_FLOOR(len, 64)); > + while (buf !=3D end) { > + const unaligned_uint16_t *p16 =3D (const unaligned_uint16_t > *)buf; > + sum +=3D (uint32_t)p16[0] + p16[1] + p16[2] + p16[3] + > + p16[4] + p16[5] + p16[6] + p16[7] + > + p16[8] + p16[9] + p16[10] + p16[11] + > + p16[12] + p16[13] + p16[14] + p16[15] + > + p16[16] + p16[17] + p16[18] + p16[19] + > + p16[20] + p16[21] + p16[22] + p16[23] + > + p16[24] + p16[25] + p16[26] + p16[27] + > + p16[28] + p16[29] + p16[30] + p16[31]; > + buf =3D RTE_PTR_ADD(buf, 64); > + } >=20 > - for (end =3D RTE_PTR_ADD(buf, RTE_ALIGN_FLOOR(len, > sizeof(uint16_t))); > - buf !=3D end; buf =3D RTE_PTR_ADD(buf, sizeof(uint16_t))) { > - uint16_t v; > + if (len & 32) { > + const unaligned_uint16_t *p16 =3D (const unaligned_uint16_t > *)buf; > + sum +=3D (uint32_t)p16[0] + p16[1] + p16[2] + p16[3] + > + p16[4] + p16[5] + p16[6] + p16[7] + > + p16[8] + p16[9] + p16[10] + p16[11] + > + p16[12] + p16[13] + p16[14] + p16[15]; > + buf =3D RTE_PTR_ADD(buf, 32); > + } >=20 > - memcpy(&v, buf, sizeof(uint16_t)); > - sum +=3D v; > + if (len & 16) { > + const unaligned_uint16_t *p16 =3D (const unaligned_uint16_t > *)buf; > + sum +=3D (uint32_t)p16[0] + p16[1] + p16[2] + p16[3] + > + p16[4] + p16[5] + p16[6] + p16[7]; > + buf =3D RTE_PTR_ADD(buf, 16); > } >=20 > - /* if length is odd, keeping it byte order independent */ > - if (unlikely(len % 2)) { > - uint16_t left =3D 0; > + if (len & 8) { > + const unaligned_uint16_t *p16 =3D (const unaligned_uint16_t > *)buf; > + sum +=3D (uint32_t)p16[0] + p16[1] + p16[2] + p16[3]; > + buf =3D RTE_PTR_ADD(buf, 8); > + } >=20 > - memcpy(&left, end, 1); > + if (len & 4) { > + const unaligned_uint16_t *p16 =3D (const unaligned_uint16_t > *)buf; > + sum +=3D (uint32_t)p16[0] + p16[1]; > + buf =3D RTE_PTR_ADD(buf, 4); > + } > + > + if (len & 2) { > + const unaligned_uint16_t *p16 =3D (const unaligned_uint16_t > *)buf; > + sum +=3D *p16; > + buf =3D RTE_PTR_ADD(buf, 2); > + } > + > + /* If length is odd use memcpy for byte order independence */ > + if (len & 1) { > + uint16_t left =3D 0; > + memcpy(&left, buf, 1); > sum +=3D left; > } >=20 > diff --git a/lib/net/rte_ip6.h b/lib/net/rte_ip6.h > index d1abf1f5d5..af65a39815 100644 > --- a/lib/net/rte_ip6.h > +++ b/lib/net/rte_ip6.h > @@ -564,7 +564,7 @@ rte_ipv6_phdr_cksum(const struct rte_ipv6_hdr > *ipv6_hdr, uint64_t ol_flags) > struct { > rte_be32_t len; /* L4 length. */ > rte_be32_t proto; /* L4 protocol - top 3 bytes must be zero > */ > - } psd_hdr; > + } psd_hdr =3D {0}; /* Empty initializer avoids false-positive > maybe-uninitialized warning */ >=20 > psd_hdr.proto =3D (uint32_t)(ipv6_hdr->proto << 24); > if (ol_flags & (RTE_MBUF_F_TX_TCP_SEG | RTE_MBUF_F_TX_UDP_SEG)) Maybe ipv6 can be fixed like this instead: - if (ol_flags & (RTE_MBUF_F_TX_TCP_SEG | RTE_MBUF_F_TX_UDP_SEG)) - psd_hdr.len =3D 0; - else - psd_hdr.len =3D ipv6_hdr->payload_len; + psd_hdr.len =3D (ol_flags & (RTE_MBUF_F_TX_TCP_SEG | = RTE_MBUF_F_TX_UDP_SEG)) ? + 0 : psd_hdr.len =3D ipv6_hdr->payload_len; > -- > 2.39.5 (Apple Git-154)