From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id ED99D489A6; Wed, 22 Oct 2025 16:14:49 +0200 (CEST) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id D16B740262; Wed, 22 Oct 2025 16:14:49 +0200 (CEST) Received: from dkmailrelay1.smartsharesystems.com (smartserver.smartsharesystems.com [77.243.40.215]) by mails.dpdk.org (Postfix) with ESMTP id E43D94025E for ; Wed, 22 Oct 2025 16:14:48 +0200 (CEST) Received: from smartserver.smartsharesystems.com (smartserver.smartsharesys.local [192.168.4.10]) by dkmailrelay1.smartsharesystems.com (Postfix) with ESMTP id ACCA62041E; Wed, 22 Oct 2025 16:14:48 +0200 (CEST) Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Subject: RE: [PATCH v2] mbuf: optimize segment prefree Date: Wed, 22 Oct 2025 16:14:46 +0200 Message-ID: <98CBD80474FA8B44BF855DF32C47DC35F654EF@smartserver.smartshare.dk> X-MimeOLE: Produced By Microsoft Exchange V6.5 In-Reply-To: X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: [PATCH v2] mbuf: optimize segment prefree Thread-Index: AdxDXf6IdtlsiNUFRyS1jxcsf5+PNgAABo8A References: <20250827213535.21602-1-mb@smartsharesystems.com> <20251020120202.80114-1-mb@smartsharesystems.com> <98CBD80474FA8B44BF855DF32C47DC35F654EE@smartserver.smartshare.dk> From: =?iso-8859-1?Q?Morten_Br=F8rup?= To: "Bruce Richardson" Cc: "Konstantin Ananyev" , "Chengwen Feng" , , "Stephen Hemminger" , "Wathsala Vithanage" X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org > From: Bruce Richardson [mailto:bruce.richardson@intel.com] > Sent: Wednesday, 22 October 2025 16.12 >=20 > On Wed, Oct 22, 2025 at 03:53:21PM +0200, Morten Br=F8rup wrote: > > > From: Bruce Richardson [mailto:bruce.richardson@intel.com] > > > Sent: Wednesday, 22 October 2025 11.08 > > > > > > On Mon, Oct 20, 2025 at 12:02:01PM +0000, Morten Br=F8rup wrote: > > > > Refactored rte_pktmbuf_prefree_seg() for both performance and > > > readability. > > > > > > > > With the optimized RTE_MBUF_DIRECT() macro, the common likely > code > > > path > > > > now fits within one instruction cache line on x86-64 when built > with > > > GCC. > > > > > > > > Signed-off-by: Morten Br=F8rup > > > > > > Reviewed-by: Bruce Richardson > > > > > > > [...] > > > > > > #define RTE_MBUF_DIRECT(mb) \ > > > > (!((mb)->ol_flags & (RTE_MBUF_F_INDIRECT | > RTE_MBUF_F_EXTERNAL))) > > > > > > > > +#if defined(RTE_TOOLCHAIN_GCC) && defined(RTE_ARCH_X86) > > > > +/* Optimization for code size. > > > > + * GCC only optimizes single-bit MSB tests this way, so we do = it > by > > > hand with multi-bit. > > > > + * > > > > + * The flags RTE_MBUF_F_INDIRECT and RTE_MBUF_F_EXTERNAL are > both in > > > the MSB of the > > > > + * 64-bit ol_flags field, so we only compare this one byte > instead > > > of all 64 bits. > > > > + * On little endian architecture, the MSB of a 64-bit integer = is > at > > > byte offest 7. > > > > + * > > > > + * Note: Tested using GCC version 16.0.0 20251019 > (experimental). > > > > + * > > > > + * Without this optimization, GCC generates 17 bytes of > > > instructions: > > > > + * movabs rax,0x6000000000000000 // 10 bytes > > > > + * and rax,QWORD PTR [rdi+0x18] // 4 bytes > > > > + * sete al // 3 bytes > > > > + * With this optimization, GCC generates only 7 bytes of > > > instructions: > > > > + * test BYTE PTR [rdi+0x1f],0x60 // 4 bytes > > > > + * sete al // 3 bytes > > > > + */ > > > > +#undef RTE_MBUF_DIRECT > > > > +#define RTE_MBUF_DIRECT(mb) \ > > > > + (!(((const uint8_t *)(mb))[offsetof(struct rte_mbuf, > ol_flags) + > > > 7] & \ > > > > + (uint8_t)((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) >> (7 > * > > > 8)))) > > > > +static_assert(((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) >> = (7 > * > > > 8)) << (7 * 8) =3D=3D > > > > + (RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL), > > > > + "RTE_MBUF_F_INDIRECT and/or RTE_MBUF_F_EXTERNAL are not in > > > MSB."); > > > > +#endif > > > > + > > > Couple of comments/thoughts/questions here. > > > > > > * This looks like a compiler limitation that should be fixed in > GCC. IF > > > we > > > put this optimization in, how will we know when/if we can remove > it > > > again > > > in future? I'm not sure we want this hanging around forever. > > > > Agree. > > There are plenty of hand crafted optimizations in DPDK, which are > already obsolete; > > it seems no one has found a good way of identifying them. Including > myself. > > > > > * Can the static_assert - which just checks flags are in the MSB - > be > > > * simplified to e.g. > > > "((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) << CHAR_BIT) = =3D=3D 0" > > > or "__builtin_ctzll(...) > (7 * CHAR_BIT)" > > > * As in prev bullet, I tend to prefer use of CHAR_BIT over hard- > coded > > > 8. > > > > In v3, I have simplified both the static_assert and the optimized > macro as you suggested on Slack, > > with some minor improvements. > > > > > * Is it necessary to limit this to just GCC and x86? If it leads = to > the > > > best code on x86, why not include for all compilers? What about > non- > > > x86 > > > LE platforms? > > > > I had already tested ARM64, where it didn't make a difference; now I > have added a note about it. > > I also tested ARM32, which doesn't benefit either, but I didn't add = a > note about it. > > I also tested Loongarch (on Godbolt), which does benefit from it, so > I added it. > > > > Now, as I'm writing this email, Godbolt shows that RISC-V and POWER > could also benefit. > > Maybe we should just replace the standard macro with the optimized > macro. WDYT? > > > I think that's not a bad idea. At least everything would be = consistent. I'll post a v4. Then we can go back to v3 if it looks too weird. >=20 > /Bruce