* Re: [PATCH] mbuf: optimize segment prefree
2025-08-27 21:35 [PATCH] mbuf: optimize segment prefree Morten Brørup
@ 2025-08-27 23:17 ` Stephen Hemminger
2025-10-06 17:46 ` Wathsala Vithanage
2025-10-06 14:49 ` Morten Brørup
` (6 subsequent siblings)
7 siblings, 1 reply; 29+ messages in thread
From: Stephen Hemminger @ 2025-08-27 23:17 UTC (permalink / raw)
To: Morten Brørup; +Cc: dev
On Wed, 27 Aug 2025 21:35:34 +0000
Morten Brørup <mb@smartsharesystems.com> wrote:
>
> +/* GCC only optimizes single-bit MSB tests this way, so do it by hand with multi-bit. */
> +#if defined(RTE_TOOLCHAIN_GCC) && defined(RTE_ARCH_X86)
> +#undef RTE_MBUF_DIRECT
> +#define RTE_MBUF_DIRECT(mb) \
> + (!(((const uint8_t *)(mb))[offsetof(struct rte_mbuf, ol_flags) + 7] & \
> + (uint8_t)((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) >> 56)))
> +#endif
Complex enough expression that I would prefer this be an inline function
with some more temporary variables and more comments.
Like the magic 7 for mask??
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] mbuf: optimize segment prefree
2025-08-27 23:17 ` Stephen Hemminger
@ 2025-10-06 17:46 ` Wathsala Vithanage
2025-10-06 18:26 ` Morten Brørup
0 siblings, 1 reply; 29+ messages in thread
From: Wathsala Vithanage @ 2025-10-06 17:46 UTC (permalink / raw)
To: dev
On 8/27/25 18:17, Stephen Hemminger wrote:
> On Wed, 27 Aug 2025 21:35:34 +0000
> Morten Brørup <mb@smartsharesystems.com> wrote:
>
>>
>> +/* GCC only optimizes single-bit MSB tests this way, so do it by hand with multi-bit. */
>> +#if defined(RTE_TOOLCHAIN_GCC) && defined(RTE_ARCH_X86)
>> +#undef RTE_MBUF_DIRECT
>> +#define RTE_MBUF_DIRECT(mb) \
>> + (!(((const uint8_t *)(mb))[offsetof(struct rte_mbuf, ol_flags) + 7] & \
>> + (uint8_t)((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) >> 56)))
>> +#endif
> Complex enough expression that I would prefer this be an inline function
> with some more temporary variables and more comments.
> Like the magic 7 for mask??
+1
--wathsala
^ permalink raw reply [flat|nested] 29+ messages in thread
* RE: [PATCH] mbuf: optimize segment prefree
2025-10-06 17:46 ` Wathsala Vithanage
@ 2025-10-06 18:26 ` Morten Brørup
0 siblings, 0 replies; 29+ messages in thread
From: Morten Brørup @ 2025-10-06 18:26 UTC (permalink / raw)
To: Wathsala Vithanage, stephen; +Cc: dev
> From: Wathsala Vithanage [mailto:wathsala.vithanage@arm.com]
> Sent: Monday, 6 October 2025 19.46
>
> On 8/27/25 18:17, Stephen Hemminger wrote:
> > On Wed, 27 Aug 2025 21:35:34 +0000
> > Morten Brørup <mb@smartsharesystems.com> wrote:
> >
> >>
> >> +/* GCC only optimizes single-bit MSB tests this way, so do it by
> hand with multi-bit. */
> >> +#if defined(RTE_TOOLCHAIN_GCC) && defined(RTE_ARCH_X86)
> >> +#undef RTE_MBUF_DIRECT
> >> +#define RTE_MBUF_DIRECT(mb) \
> >> + (!(((const uint8_t *)(mb))[offsetof(struct rte_mbuf, ol_flags) +
> 7] & \
> >> + (uint8_t)((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) >> 56)))
> >> +#endif
> > Complex enough expression that I would prefer this be an inline
> function
> > with some more temporary variables and more comments.
> > Like the magic 7 for mask??
>
> +1
So, instead of overriding the macro definition in the GCC exception case, you prefer something like:
#if defined(RTE_TOOLCHAIN_GCC) && defined(RTE_ARCH_X86)
static __rte_always_inline
RTE_MBUF_DIRECT(const struct rte_mbuf * const mb)
{...}
#else
#define RTE_MBUF_DIRECT(mb) \
(!((mb)->ol_flags & (RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL)))
#endif
which would give us a function with a macro-like (upper case) name.
How about I just add a more detailed description to this macro?
After all, formally, it's an exception to the simple default macro.
^ permalink raw reply [flat|nested] 29+ messages in thread
* RE: [PATCH] mbuf: optimize segment prefree
2025-08-27 21:35 [PATCH] mbuf: optimize segment prefree Morten Brørup
2025-08-27 23:17 ` Stephen Hemminger
@ 2025-10-06 14:49 ` Morten Brørup
2025-10-20 12:02 ` [PATCH v2] " Morten Brørup
` (5 subsequent siblings)
7 siblings, 0 replies; 29+ messages in thread
From: Morten Brørup @ 2025-10-06 14:49 UTC (permalink / raw)
To: dev
PING for review.
Only received feedback from Stephen about the GCC optimized RTE_MBUF_DIRECT() macro being complex, and could be an inline function instead.
Venlig hilsen / Kind regards,
-Morten Brørup
> -----Original Message-----
> From: Morten Brørup [mailto:mb@smartsharesystems.com]
> Sent: Wednesday, 27 August 2025 23.36
> To: dev@dpdk.org
> Cc: Morten Brørup
> Subject: [PATCH] mbuf: optimize segment prefree
>
> Eefactored rte_pktmbuf_prefree_seg() for both performance and
> readability.
>
> With the optimized RTE_MBUF_DIRECT() macro, the common likely code path
> now fits within one instruction cache line on x86-64 when built with
> GCC.
>
> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> ---
> lib/mbuf/rte_mbuf.h | 52 ++++++++++++++++------------------------
> lib/mbuf/rte_mbuf_core.h | 8 +++++++
> 2 files changed, 28 insertions(+), 32 deletions(-)
>
> diff --git a/lib/mbuf/rte_mbuf.h b/lib/mbuf/rte_mbuf.h
> index 06ab7502a5..f4a348597f 100644
> --- a/lib/mbuf/rte_mbuf.h
> +++ b/lib/mbuf/rte_mbuf.h
> @@ -31,6 +31,7 @@
> * http://www.kohala.com/start/tcpipiv2.html
> */
>
> +#include <stdbool.h>
> #include <stdint.h>
>
> #include <rte_common.h>
> @@ -1423,44 +1424,31 @@ static inline int
> __rte_pktmbuf_pinned_extbuf_decref(struct rte_mbuf *m)
> static __rte_always_inline struct rte_mbuf *
> rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
> {
> - __rte_mbuf_sanity_check(m, 0);
> -
> - if (likely(rte_mbuf_refcnt_read(m) == 1)) {
> -
> - if (!RTE_MBUF_DIRECT(m)) {
> - rte_pktmbuf_detach(m);
> - if (RTE_MBUF_HAS_EXTBUF(m) &&
> - RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
> - __rte_pktmbuf_pinned_extbuf_decref(m))
> - return NULL;
> - }
> -
> - if (m->next != NULL)
> - m->next = NULL;
> - if (m->nb_segs != 1)
> - m->nb_segs = 1;
> + bool refcnt_not_one;
>
> - return m;
> + __rte_mbuf_sanity_check(m, 0);
>
> - } else if (__rte_mbuf_refcnt_update(m, -1) == 0) {
> + refcnt_not_one = unlikely(rte_mbuf_refcnt_read(m) != 1);
> + if (refcnt_not_one &&
> + __rte_mbuf_refcnt_update(m, -1) != 0)
> + return NULL;
>
> - if (!RTE_MBUF_DIRECT(m)) {
> - rte_pktmbuf_detach(m);
> - if (RTE_MBUF_HAS_EXTBUF(m) &&
> - RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
> - __rte_pktmbuf_pinned_extbuf_decref(m))
> - return NULL;
> - }
> + if (unlikely(!RTE_MBUF_DIRECT(m))) {
> + rte_pktmbuf_detach(m);
> + if (RTE_MBUF_HAS_EXTBUF(m) &&
> + RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
> + __rte_pktmbuf_pinned_extbuf_decref(m))
> + return NULL;
> + }
>
> - if (m->next != NULL)
> - m->next = NULL;
> - if (m->nb_segs != 1)
> - m->nb_segs = 1;
> + if (refcnt_not_one)
> rte_mbuf_refcnt_set(m, 1);
> + if (m->nb_segs != 1)
> + m->nb_segs = 1;
> + if (m->next != NULL)
> + m->next = NULL;
>
> - return m;
> - }
> - return NULL;
> + return m;
> }
>
> /**
> diff --git a/lib/mbuf/rte_mbuf_core.h b/lib/mbuf/rte_mbuf_core.h
> index a0df265b5d..a5242274d7 100644
> --- a/lib/mbuf/rte_mbuf_core.h
> +++ b/lib/mbuf/rte_mbuf_core.h
> @@ -715,6 +715,14 @@ struct rte_mbuf_ext_shared_info {
> #define RTE_MBUF_DIRECT(mb) \
> (!((mb)->ol_flags & (RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL)))
>
> +/* GCC only optimizes single-bit MSB tests this way, so do it by hand
> with multi-bit. */
> +#if defined(RTE_TOOLCHAIN_GCC) && defined(RTE_ARCH_X86)
> +#undef RTE_MBUF_DIRECT
> +#define RTE_MBUF_DIRECT(mb) \
> + (!(((const uint8_t *)(mb))[offsetof(struct rte_mbuf, ol_flags) +
> 7] & \
> + (uint8_t)((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) >> 56)))
> +#endif
> +
> /** Uninitialized or unspecified port. */
> #define RTE_MBUF_PORT_INVALID UINT16_MAX
> /** For backwards compatibility. */
> --
> 2.43.0
^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH v2] mbuf: optimize segment prefree
2025-08-27 21:35 [PATCH] mbuf: optimize segment prefree Morten Brørup
2025-08-27 23:17 ` Stephen Hemminger
2025-10-06 14:49 ` Morten Brørup
@ 2025-10-20 12:02 ` Morten Brørup
2025-10-20 14:24 ` Konstantin Ananyev
` (2 more replies)
2025-10-22 13:23 ` [PATCH v3] " Morten Brørup
` (4 subsequent siblings)
7 siblings, 3 replies; 29+ messages in thread
From: Morten Brørup @ 2025-10-20 12:02 UTC (permalink / raw)
To: dev, Stephen Hemminger, Wathsala Vithanage; +Cc: Morten Brørup
Refactored rte_pktmbuf_prefree_seg() for both performance and readability.
With the optimized RTE_MBUF_DIRECT() macro, the common likely code path
now fits within one instruction cache line on x86-64 when built with GCC.
Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
---
v2:
* Fixed typo in commit description.
* Fixed indentation.
* Added detailed description to the optimized RTE_MBUF_DIRECT() macro.
(Stephen Hemminger)
* Added static_assert() to verify that the optimized RTE_MBUF_DIRECT()
macro is valid, specifically that the tested bits are in the MSB of the
64-bit field.
---
lib/mbuf/rte_mbuf.h | 51 +++++++++++++++-------------------------
lib/mbuf/rte_mbuf_core.h | 27 +++++++++++++++++++++
2 files changed, 46 insertions(+), 32 deletions(-)
diff --git a/lib/mbuf/rte_mbuf.h b/lib/mbuf/rte_mbuf.h
index 3df22125de..2004391f57 100644
--- a/lib/mbuf/rte_mbuf.h
+++ b/lib/mbuf/rte_mbuf.h
@@ -31,6 +31,7 @@
* http://www.kohala.com/start/tcpipiv2.html
*/
+#include <stdbool.h>
#include <stdint.h>
#include <rte_common.h>
@@ -1458,44 +1459,30 @@ static inline int __rte_pktmbuf_pinned_extbuf_decref(struct rte_mbuf *m)
static __rte_always_inline struct rte_mbuf *
rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
{
- __rte_mbuf_sanity_check(m, 0);
-
- if (likely(rte_mbuf_refcnt_read(m) == 1)) {
-
- if (!RTE_MBUF_DIRECT(m)) {
- rte_pktmbuf_detach(m);
- if (RTE_MBUF_HAS_EXTBUF(m) &&
- RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
- __rte_pktmbuf_pinned_extbuf_decref(m))
- return NULL;
- }
-
- if (m->next != NULL)
- m->next = NULL;
- if (m->nb_segs != 1)
- m->nb_segs = 1;
+ bool refcnt_not_one;
- return m;
+ __rte_mbuf_sanity_check(m, 0);
- } else if (__rte_mbuf_refcnt_update(m, -1) == 0) {
+ refcnt_not_one = unlikely(rte_mbuf_refcnt_read(m) != 1);
+ if (refcnt_not_one && __rte_mbuf_refcnt_update(m, -1) != 0)
+ return NULL;
- if (!RTE_MBUF_DIRECT(m)) {
- rte_pktmbuf_detach(m);
- if (RTE_MBUF_HAS_EXTBUF(m) &&
- RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
- __rte_pktmbuf_pinned_extbuf_decref(m))
- return NULL;
- }
+ if (unlikely(!RTE_MBUF_DIRECT(m))) {
+ rte_pktmbuf_detach(m);
+ if (RTE_MBUF_HAS_EXTBUF(m) &&
+ RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
+ __rte_pktmbuf_pinned_extbuf_decref(m))
+ return NULL;
+ }
- if (m->next != NULL)
- m->next = NULL;
- if (m->nb_segs != 1)
- m->nb_segs = 1;
+ if (refcnt_not_one)
rte_mbuf_refcnt_set(m, 1);
+ if (m->nb_segs != 1)
+ m->nb_segs = 1;
+ if (m->next != NULL)
+ m->next = NULL;
- return m;
- }
- return NULL;
+ return m;
}
/**
diff --git a/lib/mbuf/rte_mbuf_core.h b/lib/mbuf/rte_mbuf_core.h
index a0df265b5d..41f40e1967 100644
--- a/lib/mbuf/rte_mbuf_core.h
+++ b/lib/mbuf/rte_mbuf_core.h
@@ -715,6 +715,33 @@ struct rte_mbuf_ext_shared_info {
#define RTE_MBUF_DIRECT(mb) \
(!((mb)->ol_flags & (RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL)))
+#if defined(RTE_TOOLCHAIN_GCC) && defined(RTE_ARCH_X86)
+/* Optimization for code size.
+ * GCC only optimizes single-bit MSB tests this way, so we do it by hand with multi-bit.
+ *
+ * The flags RTE_MBUF_F_INDIRECT and RTE_MBUF_F_EXTERNAL are both in the MSB of the
+ * 64-bit ol_flags field, so we only compare this one byte instead of all 64 bits.
+ * On little endian architecture, the MSB of a 64-bit integer is at byte offest 7.
+ *
+ * Note: Tested using GCC version 16.0.0 20251019 (experimental).
+ *
+ * Without this optimization, GCC generates 17 bytes of instructions:
+ * movabs rax,0x6000000000000000 // 10 bytes
+ * and rax,QWORD PTR [rdi+0x18] // 4 bytes
+ * sete al // 3 bytes
+ * With this optimization, GCC generates only 7 bytes of instructions:
+ * test BYTE PTR [rdi+0x1f],0x60 // 4 bytes
+ * sete al // 3 bytes
+ */
+#undef RTE_MBUF_DIRECT
+#define RTE_MBUF_DIRECT(mb) \
+ (!(((const uint8_t *)(mb))[offsetof(struct rte_mbuf, ol_flags) + 7] & \
+ (uint8_t)((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) >> (7 * 8))))
+static_assert(((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) >> (7 * 8)) << (7 * 8) ==
+ (RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL),
+ "RTE_MBUF_F_INDIRECT and/or RTE_MBUF_F_EXTERNAL are not in MSB.");
+#endif
+
/** Uninitialized or unspecified port. */
#define RTE_MBUF_PORT_INVALID UINT16_MAX
/** For backwards compatibility. */
--
2.43.0
^ permalink raw reply [flat|nested] 29+ messages in thread
* RE: [PATCH v2] mbuf: optimize segment prefree
2025-10-20 12:02 ` [PATCH v2] " Morten Brørup
@ 2025-10-20 14:24 ` Konstantin Ananyev
2025-10-21 8:38 ` fengchengwen
2025-10-22 9:08 ` Bruce Richardson
2 siblings, 0 replies; 29+ messages in thread
From: Konstantin Ananyev @ 2025-10-20 14:24 UTC (permalink / raw)
To: Morten Brørup, dev, Stephen Hemminger, Wathsala Vithanage
>
> Refactored rte_pktmbuf_prefree_seg() for both performance and readability.
>
> With the optimized RTE_MBUF_DIRECT() macro, the common likely code path
> now fits within one instruction cache line on x86-64 when built with GCC.
>
> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> ---
> v2:
> * Fixed typo in commit description.
> * Fixed indentation.
> * Added detailed description to the optimized RTE_MBUF_DIRECT() macro.
> (Stephen Hemminger)
> * Added static_assert() to verify that the optimized RTE_MBUF_DIRECT()
> macro is valid, specifically that the tested bits are in the MSB of the
> 64-bit field.
> ---
> lib/mbuf/rte_mbuf.h | 51 +++++++++++++++-------------------------
> lib/mbuf/rte_mbuf_core.h | 27 +++++++++++++++++++++
> 2 files changed, 46 insertions(+), 32 deletions(-)
>
> diff --git a/lib/mbuf/rte_mbuf.h b/lib/mbuf/rte_mbuf.h
> index 3df22125de..2004391f57 100644
> --- a/lib/mbuf/rte_mbuf.h
> +++ b/lib/mbuf/rte_mbuf.h
> @@ -31,6 +31,7 @@
> * http://www.kohala.com/start/tcpipiv2.html
> */
>
> +#include <stdbool.h>
> #include <stdint.h>
>
> #include <rte_common.h>
> @@ -1458,44 +1459,30 @@ static inline int
> __rte_pktmbuf_pinned_extbuf_decref(struct rte_mbuf *m)
> static __rte_always_inline struct rte_mbuf *
> rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
> {
> - __rte_mbuf_sanity_check(m, 0);
> -
> - if (likely(rte_mbuf_refcnt_read(m) == 1)) {
> -
> - if (!RTE_MBUF_DIRECT(m)) {
> - rte_pktmbuf_detach(m);
> - if (RTE_MBUF_HAS_EXTBUF(m) &&
> - RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
> - __rte_pktmbuf_pinned_extbuf_decref(m))
> - return NULL;
> - }
> -
> - if (m->next != NULL)
> - m->next = NULL;
> - if (m->nb_segs != 1)
> - m->nb_segs = 1;
> + bool refcnt_not_one;
>
> - return m;
> + __rte_mbuf_sanity_check(m, 0);
>
> - } else if (__rte_mbuf_refcnt_update(m, -1) == 0) {
> + refcnt_not_one = unlikely(rte_mbuf_refcnt_read(m) != 1);
> + if (refcnt_not_one && __rte_mbuf_refcnt_update(m, -1) != 0)
> + return NULL;
>
> - if (!RTE_MBUF_DIRECT(m)) {
> - rte_pktmbuf_detach(m);
> - if (RTE_MBUF_HAS_EXTBUF(m) &&
> - RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
> - __rte_pktmbuf_pinned_extbuf_decref(m))
> - return NULL;
> - }
> + if (unlikely(!RTE_MBUF_DIRECT(m))) {
> + rte_pktmbuf_detach(m);
> + if (RTE_MBUF_HAS_EXTBUF(m) &&
> + RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
> + __rte_pktmbuf_pinned_extbuf_decref(m))
> + return NULL;
> + }
>
> - if (m->next != NULL)
> - m->next = NULL;
> - if (m->nb_segs != 1)
> - m->nb_segs = 1;
> + if (refcnt_not_one)
> rte_mbuf_refcnt_set(m, 1);
> + if (m->nb_segs != 1)
> + m->nb_segs = 1;
> + if (m->next != NULL)
> + m->next = NULL;
>
> - return m;
> - }
> - return NULL;
> + return m;
> }
>
> /**
> diff --git a/lib/mbuf/rte_mbuf_core.h b/lib/mbuf/rte_mbuf_core.h
> index a0df265b5d..41f40e1967 100644
> --- a/lib/mbuf/rte_mbuf_core.h
> +++ b/lib/mbuf/rte_mbuf_core.h
> @@ -715,6 +715,33 @@ struct rte_mbuf_ext_shared_info {
> #define RTE_MBUF_DIRECT(mb) \
> (!((mb)->ol_flags & (RTE_MBUF_F_INDIRECT |
> RTE_MBUF_F_EXTERNAL)))
>
> +#if defined(RTE_TOOLCHAIN_GCC) && defined(RTE_ARCH_X86)
> +/* Optimization for code size.
> + * GCC only optimizes single-bit MSB tests this way, so we do it by hand with
> multi-bit.
> + *
> + * The flags RTE_MBUF_F_INDIRECT and RTE_MBUF_F_EXTERNAL are both in
> the MSB of the
> + * 64-bit ol_flags field, so we only compare this one byte instead of all 64 bits.
> + * On little endian architecture, the MSB of a 64-bit integer is at byte offest 7.
> + *
> + * Note: Tested using GCC version 16.0.0 20251019 (experimental).
> + *
> + * Without this optimization, GCC generates 17 bytes of instructions:
> + * movabs rax,0x6000000000000000 // 10 bytes
> + * and rax,QWORD PTR [rdi+0x18] // 4 bytes
> + * sete al // 3 bytes
> + * With this optimization, GCC generates only 7 bytes of instructions:
> + * test BYTE PTR [rdi+0x1f],0x60 // 4 bytes
> + * sete al // 3 bytes
> + */
> +#undef RTE_MBUF_DIRECT
> +#define RTE_MBUF_DIRECT(mb) \
> + (!(((const uint8_t *)(mb))[offsetof(struct rte_mbuf, ol_flags) + 7] & \
> + (uint8_t)((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) >> (7 *
> 8))))
> +static_assert(((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) >> (7 * 8))
> << (7 * 8) ==
> + (RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL),
> + "RTE_MBUF_F_INDIRECT and/or RTE_MBUF_F_EXTERNAL are not in
> MSB.");
> +#endif
> +
> /** Uninitialized or unspecified port. */
> #define RTE_MBUF_PORT_INVALID UINT16_MAX
> /** For backwards compatibility. */
> --
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
> 2.43.0
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v2] mbuf: optimize segment prefree
2025-10-20 12:02 ` [PATCH v2] " Morten Brørup
2025-10-20 14:24 ` Konstantin Ananyev
@ 2025-10-21 8:38 ` fengchengwen
2025-10-22 9:08 ` Bruce Richardson
2 siblings, 0 replies; 29+ messages in thread
From: fengchengwen @ 2025-10-21 8:38 UTC (permalink / raw)
To: Morten Brørup, dev, Stephen Hemminger, Wathsala Vithanage
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
On 10/20/2025 8:02 PM, Morten Brørup wrote:
> Refactored rte_pktmbuf_prefree_seg() for both performance and readability.
>
> With the optimized RTE_MBUF_DIRECT() macro, the common likely code path
> now fits within one instruction cache line on x86-64 when built with GCC.
Learned a lot, thanks
>
> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v2] mbuf: optimize segment prefree
2025-10-20 12:02 ` [PATCH v2] " Morten Brørup
2025-10-20 14:24 ` Konstantin Ananyev
2025-10-21 8:38 ` fengchengwen
@ 2025-10-22 9:08 ` Bruce Richardson
2025-10-22 13:53 ` Morten Brørup
2 siblings, 1 reply; 29+ messages in thread
From: Bruce Richardson @ 2025-10-22 9:08 UTC (permalink / raw)
To: Morten Brørup; +Cc: dev, Stephen Hemminger, Wathsala Vithanage
On Mon, Oct 20, 2025 at 12:02:01PM +0000, Morten Brørup wrote:
> Refactored rte_pktmbuf_prefree_seg() for both performance and readability.
>
> With the optimized RTE_MBUF_DIRECT() macro, the common likely code path
> now fits within one instruction cache line on x86-64 when built with GCC.
>
> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
Reviewed-by: Bruce Richardson <bruce.richardson@intel.com>
Comments inline below.
> ---
> v2:
> * Fixed typo in commit description.
> * Fixed indentation.
> * Added detailed description to the optimized RTE_MBUF_DIRECT() macro.
> (Stephen Hemminger)
> * Added static_assert() to verify that the optimized RTE_MBUF_DIRECT()
> macro is valid, specifically that the tested bits are in the MSB of the
> 64-bit field.
> ---
> lib/mbuf/rte_mbuf.h | 51 +++++++++++++++-------------------------
> lib/mbuf/rte_mbuf_core.h | 27 +++++++++++++++++++++
> 2 files changed, 46 insertions(+), 32 deletions(-)
>
> diff --git a/lib/mbuf/rte_mbuf.h b/lib/mbuf/rte_mbuf.h
> index 3df22125de..2004391f57 100644
> --- a/lib/mbuf/rte_mbuf.h
> +++ b/lib/mbuf/rte_mbuf.h
> @@ -31,6 +31,7 @@
> * http://www.kohala.com/start/tcpipiv2.html
> */
>
> +#include <stdbool.h>
> #include <stdint.h>
>
> #include <rte_common.h>
> @@ -1458,44 +1459,30 @@ static inline int __rte_pktmbuf_pinned_extbuf_decref(struct rte_mbuf *m)
> static __rte_always_inline struct rte_mbuf *
> rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
> {
> - __rte_mbuf_sanity_check(m, 0);
> -
> - if (likely(rte_mbuf_refcnt_read(m) == 1)) {
> -
> - if (!RTE_MBUF_DIRECT(m)) {
> - rte_pktmbuf_detach(m);
> - if (RTE_MBUF_HAS_EXTBUF(m) &&
> - RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
> - __rte_pktmbuf_pinned_extbuf_decref(m))
> - return NULL;
> - }
> -
> - if (m->next != NULL)
> - m->next = NULL;
> - if (m->nb_segs != 1)
> - m->nb_segs = 1;
> + bool refcnt_not_one;
>
> - return m;
> + __rte_mbuf_sanity_check(m, 0);
>
> - } else if (__rte_mbuf_refcnt_update(m, -1) == 0) {
> + refcnt_not_one = unlikely(rte_mbuf_refcnt_read(m) != 1);
> + if (refcnt_not_one && __rte_mbuf_refcnt_update(m, -1) != 0)
> + return NULL;
>
> - if (!RTE_MBUF_DIRECT(m)) {
> - rte_pktmbuf_detach(m);
> - if (RTE_MBUF_HAS_EXTBUF(m) &&
> - RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
> - __rte_pktmbuf_pinned_extbuf_decref(m))
> - return NULL;
> - }
> + if (unlikely(!RTE_MBUF_DIRECT(m))) {
> + rte_pktmbuf_detach(m);
> + if (RTE_MBUF_HAS_EXTBUF(m) &&
> + RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
> + __rte_pktmbuf_pinned_extbuf_decref(m))
> + return NULL;
> + }
>
> - if (m->next != NULL)
> - m->next = NULL;
> - if (m->nb_segs != 1)
> - m->nb_segs = 1;
> + if (refcnt_not_one)
> rte_mbuf_refcnt_set(m, 1);
> + if (m->nb_segs != 1)
> + m->nb_segs = 1;
> + if (m->next != NULL)
> + m->next = NULL;
>
> - return m;
> - }
> - return NULL;
> + return m;
> }
>
Nice refactor, much more readable, thanks.
> /**
> diff --git a/lib/mbuf/rte_mbuf_core.h b/lib/mbuf/rte_mbuf_core.h
> index a0df265b5d..41f40e1967 100644
> --- a/lib/mbuf/rte_mbuf_core.h
> +++ b/lib/mbuf/rte_mbuf_core.h
> @@ -715,6 +715,33 @@ struct rte_mbuf_ext_shared_info {
> #define RTE_MBUF_DIRECT(mb) \
> (!((mb)->ol_flags & (RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL)))
>
> +#if defined(RTE_TOOLCHAIN_GCC) && defined(RTE_ARCH_X86)
> +/* Optimization for code size.
> + * GCC only optimizes single-bit MSB tests this way, so we do it by hand with multi-bit.
> + *
> + * The flags RTE_MBUF_F_INDIRECT and RTE_MBUF_F_EXTERNAL are both in the MSB of the
> + * 64-bit ol_flags field, so we only compare this one byte instead of all 64 bits.
> + * On little endian architecture, the MSB of a 64-bit integer is at byte offest 7.
> + *
> + * Note: Tested using GCC version 16.0.0 20251019 (experimental).
> + *
> + * Without this optimization, GCC generates 17 bytes of instructions:
> + * movabs rax,0x6000000000000000 // 10 bytes
> + * and rax,QWORD PTR [rdi+0x18] // 4 bytes
> + * sete al // 3 bytes
> + * With this optimization, GCC generates only 7 bytes of instructions:
> + * test BYTE PTR [rdi+0x1f],0x60 // 4 bytes
> + * sete al // 3 bytes
> + */
> +#undef RTE_MBUF_DIRECT
> +#define RTE_MBUF_DIRECT(mb) \
> + (!(((const uint8_t *)(mb))[offsetof(struct rte_mbuf, ol_flags) + 7] & \
> + (uint8_t)((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) >> (7 * 8))))
> +static_assert(((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) >> (7 * 8)) << (7 * 8) ==
> + (RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL),
> + "RTE_MBUF_F_INDIRECT and/or RTE_MBUF_F_EXTERNAL are not in MSB.");
> +#endif
> +
Couple of comments/thoughts/questions here.
* This looks like a compiler limitation that should be fixed in GCC. IF we
put this optimization in, how will we know when/if we can remove it again
in future? I'm not sure we want this hanging around forever.
* Can the static_assert - which just checks flags are in the MSB - be
* simplified to e.g.
"((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) << CHAR_BIT) == 0"
or "__builtin_ctzll(...) > (7 * CHAR_BIT)"
* As in prev bullet, I tend to prefer use of CHAR_BIT over hard-coded 8.
* Is it necessary to limit this to just GCC and x86? If it leads to the
best code on x86, why not include for all compilers? What about non-x86
LE platforms?
* Does the actual macro need to be that long and complex? If we simplify a
bit, does the compiler go back to generating bad code? For example:
using "(mb->ol_flags >> 56) & ((RTE_MBUF_F_INDIRECT | ..) >> 56)"
* If the above is true, do we need to actually put this in in assembler to
guarantee compiler generates good code in all situations?
^ permalink raw reply [flat|nested] 29+ messages in thread
* RE: [PATCH v2] mbuf: optimize segment prefree
2025-10-22 9:08 ` Bruce Richardson
@ 2025-10-22 13:53 ` Morten Brørup
2025-10-22 14:12 ` Bruce Richardson
0 siblings, 1 reply; 29+ messages in thread
From: Morten Brørup @ 2025-10-22 13:53 UTC (permalink / raw)
To: Bruce Richardson, Konstantin Ananyev, Chengwen Feng
Cc: dev, Stephen Hemminger, Wathsala Vithanage
> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> Sent: Wednesday, 22 October 2025 11.08
>
> On Mon, Oct 20, 2025 at 12:02:01PM +0000, Morten Brørup wrote:
> > Refactored rte_pktmbuf_prefree_seg() for both performance and
> readability.
> >
> > With the optimized RTE_MBUF_DIRECT() macro, the common likely code
> path
> > now fits within one instruction cache line on x86-64 when built with
> GCC.
> >
> > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
>
> Reviewed-by: Bruce Richardson <bruce.richardson@intel.com>
>
[...]
> > #define RTE_MBUF_DIRECT(mb) \
> > (!((mb)->ol_flags & (RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL)))
> >
> > +#if defined(RTE_TOOLCHAIN_GCC) && defined(RTE_ARCH_X86)
> > +/* Optimization for code size.
> > + * GCC only optimizes single-bit MSB tests this way, so we do it by
> hand with multi-bit.
> > + *
> > + * The flags RTE_MBUF_F_INDIRECT and RTE_MBUF_F_EXTERNAL are both in
> the MSB of the
> > + * 64-bit ol_flags field, so we only compare this one byte instead
> of all 64 bits.
> > + * On little endian architecture, the MSB of a 64-bit integer is at
> byte offest 7.
> > + *
> > + * Note: Tested using GCC version 16.0.0 20251019 (experimental).
> > + *
> > + * Without this optimization, GCC generates 17 bytes of
> instructions:
> > + * movabs rax,0x6000000000000000 // 10 bytes
> > + * and rax,QWORD PTR [rdi+0x18] // 4 bytes
> > + * sete al // 3 bytes
> > + * With this optimization, GCC generates only 7 bytes of
> instructions:
> > + * test BYTE PTR [rdi+0x1f],0x60 // 4 bytes
> > + * sete al // 3 bytes
> > + */
> > +#undef RTE_MBUF_DIRECT
> > +#define RTE_MBUF_DIRECT(mb) \
> > + (!(((const uint8_t *)(mb))[offsetof(struct rte_mbuf, ol_flags) +
> 7] & \
> > + (uint8_t)((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) >> (7 *
> 8))))
> > +static_assert(((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) >> (7 *
> 8)) << (7 * 8) ==
> > + (RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL),
> > + "RTE_MBUF_F_INDIRECT and/or RTE_MBUF_F_EXTERNAL are not in
> MSB.");
> > +#endif
> > +
> Couple of comments/thoughts/questions here.
>
> * This looks like a compiler limitation that should be fixed in GCC. IF
> we
> put this optimization in, how will we know when/if we can remove it
> again
> in future? I'm not sure we want this hanging around forever.
Agree.
There are plenty of hand crafted optimizations in DPDK, which are already obsolete;
it seems no one has found a good way of identifying them. Including myself.
> * Can the static_assert - which just checks flags are in the MSB - be
> * simplified to e.g.
> "((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) << CHAR_BIT) == 0"
> or "__builtin_ctzll(...) > (7 * CHAR_BIT)"
> * As in prev bullet, I tend to prefer use of CHAR_BIT over hard-coded
> 8.
In v3, I have simplified both the static_assert and the optimized macro as you suggested on Slack,
with some minor improvements.
> * Is it necessary to limit this to just GCC and x86? If it leads to the
> best code on x86, why not include for all compilers? What about non-
> x86
> LE platforms?
I had already tested ARM64, where it didn't make a difference; now I have added a note about it.
I also tested ARM32, which doesn't benefit either, but I didn't add a note about it.
I also tested Loongarch (on Godbolt), which does benefit from it, so I added it.
Now, as I'm writing this email, Godbolt shows that RISC-V and POWER could also benefit.
Maybe we should just replace the standard macro with the optimized macro. WDYT?
> * Does the actual macro need to be that long and complex? If we
> simplify a
> bit, does the compiler go back to generating bad code? For example:
> using "(mb->ol_flags >> 56) & ((RTE_MBUF_F_INDIRECT | ..) >> 56)"
Simplified in v3.
> * If the above is true, do we need to actually put this in in assembler
> to
> guarantee compiler generates good code in all situations?
No need for assembler.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v2] mbuf: optimize segment prefree
2025-10-22 13:53 ` Morten Brørup
@ 2025-10-22 14:12 ` Bruce Richardson
2025-10-22 14:14 ` Morten Brørup
0 siblings, 1 reply; 29+ messages in thread
From: Bruce Richardson @ 2025-10-22 14:12 UTC (permalink / raw)
To: Morten Brørup
Cc: Konstantin Ananyev, Chengwen Feng, dev, Stephen Hemminger,
Wathsala Vithanage
On Wed, Oct 22, 2025 at 03:53:21PM +0200, Morten Brørup wrote:
> > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > Sent: Wednesday, 22 October 2025 11.08
> >
> > On Mon, Oct 20, 2025 at 12:02:01PM +0000, Morten Brørup wrote:
> > > Refactored rte_pktmbuf_prefree_seg() for both performance and
> > readability.
> > >
> > > With the optimized RTE_MBUF_DIRECT() macro, the common likely code
> > path
> > > now fits within one instruction cache line on x86-64 when built with
> > GCC.
> > >
> > > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> >
> > Reviewed-by: Bruce Richardson <bruce.richardson@intel.com>
> >
>
> [...]
>
> > > #define RTE_MBUF_DIRECT(mb) \
> > > (!((mb)->ol_flags & (RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL)))
> > >
> > > +#if defined(RTE_TOOLCHAIN_GCC) && defined(RTE_ARCH_X86)
> > > +/* Optimization for code size.
> > > + * GCC only optimizes single-bit MSB tests this way, so we do it by
> > hand with multi-bit.
> > > + *
> > > + * The flags RTE_MBUF_F_INDIRECT and RTE_MBUF_F_EXTERNAL are both in
> > the MSB of the
> > > + * 64-bit ol_flags field, so we only compare this one byte instead
> > of all 64 bits.
> > > + * On little endian architecture, the MSB of a 64-bit integer is at
> > byte offest 7.
> > > + *
> > > + * Note: Tested using GCC version 16.0.0 20251019 (experimental).
> > > + *
> > > + * Without this optimization, GCC generates 17 bytes of
> > instructions:
> > > + * movabs rax,0x6000000000000000 // 10 bytes
> > > + * and rax,QWORD PTR [rdi+0x18] // 4 bytes
> > > + * sete al // 3 bytes
> > > + * With this optimization, GCC generates only 7 bytes of
> > instructions:
> > > + * test BYTE PTR [rdi+0x1f],0x60 // 4 bytes
> > > + * sete al // 3 bytes
> > > + */
> > > +#undef RTE_MBUF_DIRECT
> > > +#define RTE_MBUF_DIRECT(mb) \
> > > + (!(((const uint8_t *)(mb))[offsetof(struct rte_mbuf, ol_flags) +
> > 7] & \
> > > + (uint8_t)((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) >> (7 *
> > 8))))
> > > +static_assert(((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) >> (7 *
> > 8)) << (7 * 8) ==
> > > + (RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL),
> > > + "RTE_MBUF_F_INDIRECT and/or RTE_MBUF_F_EXTERNAL are not in
> > MSB.");
> > > +#endif
> > > +
> > Couple of comments/thoughts/questions here.
> >
> > * This looks like a compiler limitation that should be fixed in GCC. IF
> > we
> > put this optimization in, how will we know when/if we can remove it
> > again
> > in future? I'm not sure we want this hanging around forever.
>
> Agree.
> There are plenty of hand crafted optimizations in DPDK, which are already obsolete;
> it seems no one has found a good way of identifying them. Including myself.
>
> > * Can the static_assert - which just checks flags are in the MSB - be
> > * simplified to e.g.
> > "((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) << CHAR_BIT) == 0"
> > or "__builtin_ctzll(...) > (7 * CHAR_BIT)"
> > * As in prev bullet, I tend to prefer use of CHAR_BIT over hard-coded
> > 8.
>
> In v3, I have simplified both the static_assert and the optimized macro as you suggested on Slack,
> with some minor improvements.
>
> > * Is it necessary to limit this to just GCC and x86? If it leads to the
> > best code on x86, why not include for all compilers? What about non-
> > x86
> > LE platforms?
>
> I had already tested ARM64, where it didn't make a difference; now I have added a note about it.
> I also tested ARM32, which doesn't benefit either, but I didn't add a note about it.
> I also tested Loongarch (on Godbolt), which does benefit from it, so I added it.
>
> Now, as I'm writing this email, Godbolt shows that RISC-V and POWER could also benefit.
> Maybe we should just replace the standard macro with the optimized macro. WDYT?
>
I think that's not a bad idea. At least everything would be consistent.
/Bruce
^ permalink raw reply [flat|nested] 29+ messages in thread
* RE: [PATCH v2] mbuf: optimize segment prefree
2025-10-22 14:12 ` Bruce Richardson
@ 2025-10-22 14:14 ` Morten Brørup
0 siblings, 0 replies; 29+ messages in thread
From: Morten Brørup @ 2025-10-22 14:14 UTC (permalink / raw)
To: Bruce Richardson
Cc: Konstantin Ananyev, Chengwen Feng, dev, Stephen Hemminger,
Wathsala Vithanage
> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> Sent: Wednesday, 22 October 2025 16.12
>
> On Wed, Oct 22, 2025 at 03:53:21PM +0200, Morten Brørup wrote:
> > > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > > Sent: Wednesday, 22 October 2025 11.08
> > >
> > > On Mon, Oct 20, 2025 at 12:02:01PM +0000, Morten Brørup wrote:
> > > > Refactored rte_pktmbuf_prefree_seg() for both performance and
> > > readability.
> > > >
> > > > With the optimized RTE_MBUF_DIRECT() macro, the common likely
> code
> > > path
> > > > now fits within one instruction cache line on x86-64 when built
> with
> > > GCC.
> > > >
> > > > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > >
> > > Reviewed-by: Bruce Richardson <bruce.richardson@intel.com>
> > >
> >
> > [...]
> >
> > > > #define RTE_MBUF_DIRECT(mb) \
> > > > (!((mb)->ol_flags & (RTE_MBUF_F_INDIRECT |
> RTE_MBUF_F_EXTERNAL)))
> > > >
> > > > +#if defined(RTE_TOOLCHAIN_GCC) && defined(RTE_ARCH_X86)
> > > > +/* Optimization for code size.
> > > > + * GCC only optimizes single-bit MSB tests this way, so we do it
> by
> > > hand with multi-bit.
> > > > + *
> > > > + * The flags RTE_MBUF_F_INDIRECT and RTE_MBUF_F_EXTERNAL are
> both in
> > > the MSB of the
> > > > + * 64-bit ol_flags field, so we only compare this one byte
> instead
> > > of all 64 bits.
> > > > + * On little endian architecture, the MSB of a 64-bit integer is
> at
> > > byte offest 7.
> > > > + *
> > > > + * Note: Tested using GCC version 16.0.0 20251019
> (experimental).
> > > > + *
> > > > + * Without this optimization, GCC generates 17 bytes of
> > > instructions:
> > > > + * movabs rax,0x6000000000000000 // 10 bytes
> > > > + * and rax,QWORD PTR [rdi+0x18] // 4 bytes
> > > > + * sete al // 3 bytes
> > > > + * With this optimization, GCC generates only 7 bytes of
> > > instructions:
> > > > + * test BYTE PTR [rdi+0x1f],0x60 // 4 bytes
> > > > + * sete al // 3 bytes
> > > > + */
> > > > +#undef RTE_MBUF_DIRECT
> > > > +#define RTE_MBUF_DIRECT(mb) \
> > > > + (!(((const uint8_t *)(mb))[offsetof(struct rte_mbuf,
> ol_flags) +
> > > 7] & \
> > > > + (uint8_t)((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) >> (7
> *
> > > 8))))
> > > > +static_assert(((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) >> (7
> *
> > > 8)) << (7 * 8) ==
> > > > + (RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL),
> > > > + "RTE_MBUF_F_INDIRECT and/or RTE_MBUF_F_EXTERNAL are not in
> > > MSB.");
> > > > +#endif
> > > > +
> > > Couple of comments/thoughts/questions here.
> > >
> > > * This looks like a compiler limitation that should be fixed in
> GCC. IF
> > > we
> > > put this optimization in, how will we know when/if we can remove
> it
> > > again
> > > in future? I'm not sure we want this hanging around forever.
> >
> > Agree.
> > There are plenty of hand crafted optimizations in DPDK, which are
> already obsolete;
> > it seems no one has found a good way of identifying them. Including
> myself.
> >
> > > * Can the static_assert - which just checks flags are in the MSB -
> be
> > > * simplified to e.g.
> > > "((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) << CHAR_BIT) == 0"
> > > or "__builtin_ctzll(...) > (7 * CHAR_BIT)"
> > > * As in prev bullet, I tend to prefer use of CHAR_BIT over hard-
> coded
> > > 8.
> >
> > In v3, I have simplified both the static_assert and the optimized
> macro as you suggested on Slack,
> > with some minor improvements.
> >
> > > * Is it necessary to limit this to just GCC and x86? If it leads to
> the
> > > best code on x86, why not include for all compilers? What about
> non-
> > > x86
> > > LE platforms?
> >
> > I had already tested ARM64, where it didn't make a difference; now I
> have added a note about it.
> > I also tested ARM32, which doesn't benefit either, but I didn't add a
> note about it.
> > I also tested Loongarch (on Godbolt), which does benefit from it, so
> I added it.
> >
> > Now, as I'm writing this email, Godbolt shows that RISC-V and POWER
> could also benefit.
> > Maybe we should just replace the standard macro with the optimized
> macro. WDYT?
> >
> I think that's not a bad idea. At least everything would be consistent.
I'll post a v4. Then we can go back to v3 if it looks too weird.
>
> /Bruce
^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH v3] mbuf: optimize segment prefree
2025-08-27 21:35 [PATCH] mbuf: optimize segment prefree Morten Brørup
` (2 preceding siblings ...)
2025-10-20 12:02 ` [PATCH v2] " Morten Brørup
@ 2025-10-22 13:23 ` Morten Brørup
2025-10-22 14:47 ` [PATCH v4] " Morten Brørup
` (3 subsequent siblings)
7 siblings, 0 replies; 29+ messages in thread
From: Morten Brørup @ 2025-10-22 13:23 UTC (permalink / raw)
To: dev, Stephen Hemminger, Wathsala Vithanage, Konstantin Ananyev,
Chengwen Feng, Bruce Richardson
Cc: Morten Brørup
Refactored rte_pktmbuf_prefree_seg() for both performance and readability.
With the optimized RTE_MBUF_DIRECT() macro, the common likely code path
now fits within one instruction cache line on x86-64 when built with GCC.
Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
Reviewed-by: Bruce Richardson <bruce.richardson@intel.com>
---
v3:
* Rewrote the optimized RTE_MBUF_DIRECT() macro for readability; use
numerical value instead of long names. (Bruce Richardson)
* Enabled the optimized RTE_MBUF_DIRECT() macro for Loongarch too.
v2:
* Fixed typo in commit description.
* Fixed indentation.
* Added detailed description to the optimized RTE_MBUF_DIRECT() macro.
(Stephen Hemminger)
* Added static_assert() to verify that the optimized RTE_MBUF_DIRECT()
macro is valid, specifically that the tested bits are in the MSB of the
64-bit field.
---
lib/mbuf/rte_mbuf.h | 51 +++++++++++++++-------------------------
lib/mbuf/rte_mbuf_core.h | 40 +++++++++++++++++++++++++++++++
2 files changed, 59 insertions(+), 32 deletions(-)
diff --git a/lib/mbuf/rte_mbuf.h b/lib/mbuf/rte_mbuf.h
index 3df22125de..2004391f57 100644
--- a/lib/mbuf/rte_mbuf.h
+++ b/lib/mbuf/rte_mbuf.h
@@ -31,6 +31,7 @@
* http://www.kohala.com/start/tcpipiv2.html
*/
+#include <stdbool.h>
#include <stdint.h>
#include <rte_common.h>
@@ -1458,44 +1459,30 @@ static inline int __rte_pktmbuf_pinned_extbuf_decref(struct rte_mbuf *m)
static __rte_always_inline struct rte_mbuf *
rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
{
- __rte_mbuf_sanity_check(m, 0);
-
- if (likely(rte_mbuf_refcnt_read(m) == 1)) {
-
- if (!RTE_MBUF_DIRECT(m)) {
- rte_pktmbuf_detach(m);
- if (RTE_MBUF_HAS_EXTBUF(m) &&
- RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
- __rte_pktmbuf_pinned_extbuf_decref(m))
- return NULL;
- }
-
- if (m->next != NULL)
- m->next = NULL;
- if (m->nb_segs != 1)
- m->nb_segs = 1;
+ bool refcnt_not_one;
- return m;
+ __rte_mbuf_sanity_check(m, 0);
- } else if (__rte_mbuf_refcnt_update(m, -1) == 0) {
+ refcnt_not_one = unlikely(rte_mbuf_refcnt_read(m) != 1);
+ if (refcnt_not_one && __rte_mbuf_refcnt_update(m, -1) != 0)
+ return NULL;
- if (!RTE_MBUF_DIRECT(m)) {
- rte_pktmbuf_detach(m);
- if (RTE_MBUF_HAS_EXTBUF(m) &&
- RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
- __rte_pktmbuf_pinned_extbuf_decref(m))
- return NULL;
- }
+ if (unlikely(!RTE_MBUF_DIRECT(m))) {
+ rte_pktmbuf_detach(m);
+ if (RTE_MBUF_HAS_EXTBUF(m) &&
+ RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
+ __rte_pktmbuf_pinned_extbuf_decref(m))
+ return NULL;
+ }
- if (m->next != NULL)
- m->next = NULL;
- if (m->nb_segs != 1)
- m->nb_segs = 1;
+ if (refcnt_not_one)
rte_mbuf_refcnt_set(m, 1);
+ if (m->nb_segs != 1)
+ m->nb_segs = 1;
+ if (m->next != NULL)
+ m->next = NULL;
- return m;
- }
- return NULL;
+ return m;
}
/**
diff --git a/lib/mbuf/rte_mbuf_core.h b/lib/mbuf/rte_mbuf_core.h
index a0df265b5d..7c0bddc362 100644
--- a/lib/mbuf/rte_mbuf_core.h
+++ b/lib/mbuf/rte_mbuf_core.h
@@ -715,6 +715,46 @@ struct rte_mbuf_ext_shared_info {
#define RTE_MBUF_DIRECT(mb) \
(!((mb)->ol_flags & (RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL)))
+#if defined(RTE_TOOLCHAIN_GCC) && (defined(RTE_ARCH_X86) || defined(RTE_ARCH_LOONGARCH))
+/* Optimization for code size.
+ * GCC only optimizes single-bit MSB tests this way, so we do it by hand with multi-bit.
+ *
+ * The flags RTE_MBUF_F_INDIRECT and RTE_MBUF_F_EXTERNAL are both in the MSB of the
+ * 64-bit ol_flags field, so we only compare this one byte instead of all 64 bits.
+ * On little endian architecture, the MSB of a 64-bit integer is at byte offset 7.
+ *
+ * Note: Tested on x86-64 using GCC version 16.0.0 20251019 (experimental).
+ *
+ * Without this optimization, GCC generates 17 bytes of instructions:
+ * movabs rax,0x6000000000000000 // 10 bytes
+ * and rax,QWORD PTR [rdi+0x18] // 4 bytes
+ * sete al // 3 bytes
+ * With this optimization, GCC generates only 7 bytes of instructions:
+ * test BYTE PTR [rdi+0x1f],0x60 // 4 bytes
+ * sete al // 3 bytes
+ *
+ * Note: Tested on loongarch using GCC version 15.2.0.
+ *
+ * Without this optimization, GCC generates 5 instructions:
+ * ld.d $a0, $a0, 24
+ * move $t0, $zero
+ * lu52i.d $t0, $t0, 1536
+ * and $a0, $a0, $t0
+ * sltui $a0, $a0, 1
+ * With this optimization, GCC generates only 3 instructions:
+ * ld.bu $a0, $a0, 31
+ * andi $a0, $a0, 0x60
+ * sltui $a0, $a0, 1
+ *
+ * Note: Tested on ARM64, GCC generates 3 instructions both without and with the optimization,
+ * so no need for the optimization here.
+ */
+static_assert((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) == UINT64_C(0x60) << (7 * CHAR_BIT),
+ "(RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) is not 0x60 at byte offset 7");
+#undef RTE_MBUF_DIRECT
+#define RTE_MBUF_DIRECT(mb) !(((const char *)(&(mb)->ol_flags))[7] & 0x60)
+#endif
+
/** Uninitialized or unspecified port. */
#define RTE_MBUF_PORT_INVALID UINT16_MAX
/** For backwards compatibility. */
--
2.43.0
^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH v4] mbuf: optimize segment prefree
2025-08-27 21:35 [PATCH] mbuf: optimize segment prefree Morten Brørup
` (3 preceding siblings ...)
2025-10-22 13:23 ` [PATCH v3] " Morten Brørup
@ 2025-10-22 14:47 ` Morten Brørup
2025-10-22 15:02 ` Bruce Richardson
2025-10-23 8:01 ` [PATCH v5] " Morten Brørup
` (2 subsequent siblings)
7 siblings, 1 reply; 29+ messages in thread
From: Morten Brørup @ 2025-10-22 14:47 UTC (permalink / raw)
To: dev, Stephen Hemminger, Wathsala Vithanage, Konstantin Ananyev,
Chengwen Feng, Bruce Richardson
Cc: Morten Brørup
Refactored rte_pktmbuf_prefree_seg() for both performance and readability.
With the optimized RTE_MBUF_DIRECT() macro, the common likely code path
now fits within one instruction cache line on x86-64 when built with GCC.
Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
Reviewed-by: Bruce Richardson <bruce.richardson@intel.com>
---
v4:
* Enabled the optimized RTE_MBUF_DIRECT() macro for GCC on all
architectures.
v3:
* Rewrote the optimized RTE_MBUF_DIRECT() macro for readability; use
numerical value instead of long names. (Bruce Richardson)
* Enabled the optimized RTE_MBUF_DIRECT() macro for Loongarch too.
v2:
* Fixed typo in commit description.
* Fixed indentation.
* Added detailed description to the optimized RTE_MBUF_DIRECT() macro.
(Stephen Hemminger)
* Added static_assert() to verify that the optimized RTE_MBUF_DIRECT()
macro is valid, specifically that the tested bits are in the MSB of the
64-bit field.
---
lib/mbuf/rte_mbuf.h | 51 +++++++++++++++-------------------------
lib/mbuf/rte_mbuf_core.h | 47 ++++++++++++++++++++++++++++++++++++
2 files changed, 66 insertions(+), 32 deletions(-)
diff --git a/lib/mbuf/rte_mbuf.h b/lib/mbuf/rte_mbuf.h
index 3df22125de..2004391f57 100644
--- a/lib/mbuf/rte_mbuf.h
+++ b/lib/mbuf/rte_mbuf.h
@@ -31,6 +31,7 @@
* http://www.kohala.com/start/tcpipiv2.html
*/
+#include <stdbool.h>
#include <stdint.h>
#include <rte_common.h>
@@ -1458,44 +1459,30 @@ static inline int __rte_pktmbuf_pinned_extbuf_decref(struct rte_mbuf *m)
static __rte_always_inline struct rte_mbuf *
rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
{
- __rte_mbuf_sanity_check(m, 0);
-
- if (likely(rte_mbuf_refcnt_read(m) == 1)) {
-
- if (!RTE_MBUF_DIRECT(m)) {
- rte_pktmbuf_detach(m);
- if (RTE_MBUF_HAS_EXTBUF(m) &&
- RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
- __rte_pktmbuf_pinned_extbuf_decref(m))
- return NULL;
- }
-
- if (m->next != NULL)
- m->next = NULL;
- if (m->nb_segs != 1)
- m->nb_segs = 1;
+ bool refcnt_not_one;
- return m;
+ __rte_mbuf_sanity_check(m, 0);
- } else if (__rte_mbuf_refcnt_update(m, -1) == 0) {
+ refcnt_not_one = unlikely(rte_mbuf_refcnt_read(m) != 1);
+ if (refcnt_not_one && __rte_mbuf_refcnt_update(m, -1) != 0)
+ return NULL;
- if (!RTE_MBUF_DIRECT(m)) {
- rte_pktmbuf_detach(m);
- if (RTE_MBUF_HAS_EXTBUF(m) &&
- RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
- __rte_pktmbuf_pinned_extbuf_decref(m))
- return NULL;
- }
+ if (unlikely(!RTE_MBUF_DIRECT(m))) {
+ rte_pktmbuf_detach(m);
+ if (RTE_MBUF_HAS_EXTBUF(m) &&
+ RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
+ __rte_pktmbuf_pinned_extbuf_decref(m))
+ return NULL;
+ }
- if (m->next != NULL)
- m->next = NULL;
- if (m->nb_segs != 1)
- m->nb_segs = 1;
+ if (refcnt_not_one)
rte_mbuf_refcnt_set(m, 1);
+ if (m->nb_segs != 1)
+ m->nb_segs = 1;
+ if (m->next != NULL)
+ m->next = NULL;
- return m;
- }
- return NULL;
+ return m;
}
/**
diff --git a/lib/mbuf/rte_mbuf_core.h b/lib/mbuf/rte_mbuf_core.h
index a0df265b5d..37f2975158 100644
--- a/lib/mbuf/rte_mbuf_core.h
+++ b/lib/mbuf/rte_mbuf_core.h
@@ -706,6 +706,7 @@ struct rte_mbuf_ext_shared_info {
*/
#define RTE_MBUF_HAS_EXTBUF(mb) ((mb)->ol_flags & RTE_MBUF_F_EXTERNAL)
+#if !defined(RTE_TOOLCHAIN_GCC) || defined __DOXYGEN__
/**
* Returns TRUE if given mbuf is direct, or FALSE otherwise.
*
@@ -714,6 +715,52 @@ struct rte_mbuf_ext_shared_info {
*/
#define RTE_MBUF_DIRECT(mb) \
(!((mb)->ol_flags & (RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL)))
+#else /* RTE_TOOLCHAIN_GCC */
+#if RTE_BYTE_ORDER == RTE_LITTLE_ENDIAN
+/* Macro optimized for code size.
+ * GCC only optimizes single-bit MSB tests this way, so we do it by hand with multi-bit.
+ *
+ * The flags RTE_MBUF_F_INDIRECT and RTE_MBUF_F_EXTERNAL are both in the MSB of the
+ * 64-bit ol_flags field, so we only compare this one byte instead of all 64 bits.
+ * On little endian architecture, the MSB of a 64-bit integer is at byte offset 7.
+ *
+ * Note: Tested on x86-64 using GCC version 16.0.0 20251019 (experimental).
+ *
+ * Without this optimization, GCC generates 17 bytes of instructions:
+ * movabs rax,0x6000000000000000 // 10 bytes
+ * and rax,QWORD PTR [rdi+0x18] // 4 bytes
+ * sete al // 3 bytes
+ * With this optimization, GCC generates only 7 bytes of instructions:
+ * test BYTE PTR [rdi+0x1f],0x60 // 4 bytes
+ * sete al // 3 bytes
+ *
+ * Note: Tested on loongarch using GCC version 15.2.0.
+ *
+ * Without this optimization, GCC generates 5 instructions:
+ * ld.d $a0, $a0, 24
+ * move $t0, $zero
+ * lu52i.d $t0, $t0, 1536
+ * and $a0, $a0, $t0
+ * sltui $a0, $a0, 1
+ * With this optimization, GCC generates only 3 instructions:
+ * ld.bu $a0, $a0, 31
+ * andi $a0, $a0, 0x60
+ * sltui $a0, $a0, 1
+ *
+ * Note: GCC also generates smaller code size with the optimized macro on many other architectures.
+ *
+ * Note: GCC generates the same code size as with the plain macro on ARM (64 and 32 bit).
+ */
+static_assert((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) == UINT64_C(0x60) << (7 * CHAR_BIT),
+ "(RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) is not 0x60 at byte offset 7");
+#define RTE_MBUF_DIRECT(mb) !(((const char *)(&(mb)->ol_flags))[7] & 0x60)
+#elif RTE_BYTE_ORDER == RTE_BIG_ENDIAN
+/* As described above; but on big endian architecture, the MSB is at byte offset 0. */
+static_assert((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) == UINT64_C(0x60) << (7 * CHAR_BIT),
+ "(RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) is not 0x60 at byte offset 0");
+#define RTE_MBUF_DIRECT(mb) !(((const char *)(&(mb)->ol_flags))[0] & 0x60)
+#endif /* RTE_BYTE_ORDER */
+#endif /* RTE_TOOLCHAIN_GCC */
/** Uninitialized or unspecified port. */
#define RTE_MBUF_PORT_INVALID UINT16_MAX
--
2.43.0
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v4] mbuf: optimize segment prefree
2025-10-22 14:47 ` [PATCH v4] " Morten Brørup
@ 2025-10-22 15:02 ` Bruce Richardson
2025-10-22 18:28 ` Morten Brørup
2025-10-23 7:04 ` Morten Brørup
0 siblings, 2 replies; 29+ messages in thread
From: Bruce Richardson @ 2025-10-22 15:02 UTC (permalink / raw)
To: Morten Brørup
Cc: dev, Stephen Hemminger, Wathsala Vithanage, Konstantin Ananyev,
Chengwen Feng
On Wed, Oct 22, 2025 at 02:47:08PM +0000, Morten Brørup wrote:
> Refactored rte_pktmbuf_prefree_seg() for both performance and readability.
>
> With the optimized RTE_MBUF_DIRECT() macro, the common likely code path
> now fits within one instruction cache line on x86-64 when built with GCC.
>
> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
> Acked-by: Chengwen Feng <fengchengwen@huawei.com>
> Reviewed-by: Bruce Richardson <bruce.richardson@intel.com>
> ---
> v4:
> * Enabled the optimized RTE_MBUF_DIRECT() macro for GCC on all
> architectures.
> v3:
> * Rewrote the optimized RTE_MBUF_DIRECT() macro for readability; use
> numerical value instead of long names. (Bruce Richardson)
> * Enabled the optimized RTE_MBUF_DIRECT() macro for Loongarch too.
> v2:
> * Fixed typo in commit description.
> * Fixed indentation.
> * Added detailed description to the optimized RTE_MBUF_DIRECT() macro.
> (Stephen Hemminger)
> * Added static_assert() to verify that the optimized RTE_MBUF_DIRECT()
> macro is valid, specifically that the tested bits are in the MSB of the
> 64-bit field.
> ---
> lib/mbuf/rte_mbuf.h | 51 +++++++++++++++-------------------------
> lib/mbuf/rte_mbuf_core.h | 47 ++++++++++++++++++++++++++++++++++++
> 2 files changed, 66 insertions(+), 32 deletions(-)
>
> diff --git a/lib/mbuf/rte_mbuf.h b/lib/mbuf/rte_mbuf.h
> index 3df22125de..2004391f57 100644
> --- a/lib/mbuf/rte_mbuf.h
> +++ b/lib/mbuf/rte_mbuf.h
> @@ -31,6 +31,7 @@
> * http://www.kohala.com/start/tcpipiv2.html
> */
>
> +#include <stdbool.h>
> #include <stdint.h>
>
> #include <rte_common.h>
> @@ -1458,44 +1459,30 @@ static inline int __rte_pktmbuf_pinned_extbuf_decref(struct rte_mbuf *m)
> static __rte_always_inline struct rte_mbuf *
> rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
> {
> - __rte_mbuf_sanity_check(m, 0);
> -
> - if (likely(rte_mbuf_refcnt_read(m) == 1)) {
> -
> - if (!RTE_MBUF_DIRECT(m)) {
> - rte_pktmbuf_detach(m);
> - if (RTE_MBUF_HAS_EXTBUF(m) &&
> - RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
> - __rte_pktmbuf_pinned_extbuf_decref(m))
> - return NULL;
> - }
> -
> - if (m->next != NULL)
> - m->next = NULL;
> - if (m->nb_segs != 1)
> - m->nb_segs = 1;
> + bool refcnt_not_one;
>
> - return m;
> + __rte_mbuf_sanity_check(m, 0);
>
> - } else if (__rte_mbuf_refcnt_update(m, -1) == 0) {
> + refcnt_not_one = unlikely(rte_mbuf_refcnt_read(m) != 1);
> + if (refcnt_not_one && __rte_mbuf_refcnt_update(m, -1) != 0)
> + return NULL;
>
> - if (!RTE_MBUF_DIRECT(m)) {
> - rte_pktmbuf_detach(m);
> - if (RTE_MBUF_HAS_EXTBUF(m) &&
> - RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
> - __rte_pktmbuf_pinned_extbuf_decref(m))
> - return NULL;
> - }
> + if (unlikely(!RTE_MBUF_DIRECT(m))) {
> + rte_pktmbuf_detach(m);
> + if (RTE_MBUF_HAS_EXTBUF(m) &&
> + RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
> + __rte_pktmbuf_pinned_extbuf_decref(m))
> + return NULL;
> + }
>
> - if (m->next != NULL)
> - m->next = NULL;
> - if (m->nb_segs != 1)
> - m->nb_segs = 1;
> + if (refcnt_not_one)
> rte_mbuf_refcnt_set(m, 1);
> + if (m->nb_segs != 1)
> + m->nb_segs = 1;
> + if (m->next != NULL)
> + m->next = NULL;
>
> - return m;
> - }
> - return NULL;
> + return m;
> }
>
> /**
> diff --git a/lib/mbuf/rte_mbuf_core.h b/lib/mbuf/rte_mbuf_core.h
> index a0df265b5d..37f2975158 100644
> --- a/lib/mbuf/rte_mbuf_core.h
> +++ b/lib/mbuf/rte_mbuf_core.h
> @@ -706,6 +706,7 @@ struct rte_mbuf_ext_shared_info {
> */
> #define RTE_MBUF_HAS_EXTBUF(mb) ((mb)->ol_flags & RTE_MBUF_F_EXTERNAL)
>
> +#if !defined(RTE_TOOLCHAIN_GCC) || defined __DOXYGEN__
> /**
> * Returns TRUE if given mbuf is direct, or FALSE otherwise.
> *
> @@ -714,6 +715,52 @@ struct rte_mbuf_ext_shared_info {
> */
> #define RTE_MBUF_DIRECT(mb) \
> (!((mb)->ol_flags & (RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL)))
> +#else /* RTE_TOOLCHAIN_GCC */
Do we need the non-gcc block?
> +#if RTE_BYTE_ORDER == RTE_LITTLE_ENDIAN
> +/* Macro optimized for code size.
> + * GCC only optimizes single-bit MSB tests this way, so we do it by hand with multi-bit.
> + *
> + * The flags RTE_MBUF_F_INDIRECT and RTE_MBUF_F_EXTERNAL are both in the MSB of the
> + * 64-bit ol_flags field, so we only compare this one byte instead of all 64 bits.
> + * On little endian architecture, the MSB of a 64-bit integer is at byte offset 7.
> + *
> + * Note: Tested on x86-64 using GCC version 16.0.0 20251019 (experimental).
> + *
> + * Without this optimization, GCC generates 17 bytes of instructions:
> + * movabs rax,0x6000000000000000 // 10 bytes
> + * and rax,QWORD PTR [rdi+0x18] // 4 bytes
> + * sete al // 3 bytes
> + * With this optimization, GCC generates only 7 bytes of instructions:
> + * test BYTE PTR [rdi+0x1f],0x60 // 4 bytes
> + * sete al // 3 bytes
> + *
> + * Note: Tested on loongarch using GCC version 15.2.0.
> + *
> + * Without this optimization, GCC generates 5 instructions:
> + * ld.d $a0, $a0, 24
> + * move $t0, $zero
> + * lu52i.d $t0, $t0, 1536
> + * and $a0, $a0, $t0
> + * sltui $a0, $a0, 1
> + * With this optimization, GCC generates only 3 instructions:
> + * ld.bu $a0, $a0, 31
> + * andi $a0, $a0, 0x60
> + * sltui $a0, $a0, 1
> + *
> + * Note: GCC also generates smaller code size with the optimized macro on many other architectures.
> + *
> + * Note: GCC generates the same code size as with the plain macro on ARM (64 and 32 bit).
> + */
> +static_assert((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) == UINT64_C(0x60) << (7 * CHAR_BIT),
> + "(RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) is not 0x60 at byte offset 7");
> +#define RTE_MBUF_DIRECT(mb) !(((const char *)(&(mb)->ol_flags))[7] & 0x60)
> +#elif RTE_BYTE_ORDER == RTE_BIG_ENDIAN
> +/* As described above; but on big endian architecture, the MSB is at byte offset 0. */
> +static_assert((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) == UINT64_C(0x60) << (7 * CHAR_BIT),
> + "(RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) is not 0x60 at byte offset 0");
> +#define RTE_MBUF_DIRECT(mb) !(((const char *)(&(mb)->ol_flags))[0] & 0x60)
> +#endif /* RTE_BYTE_ORDER */
Minor nit, the static assert is common, so can be put out of the endianness
check rather than duplicated. I'd also suggest putting the the comment
outside it too, so that the endianness check only covers one line each.
Also, in the comment, showing an example on one arch is probably enough,
showing two sort of implies that you should show all, while if you show
just one, you can clearly state its just one example.
BTW: if you want to remove the endian ifdefs, you can compute the offset
as e.g.
#define OLF_MSB (7 * (RTE_BYTE_ORDER == RTE_LITTLE_ENDIAN)) // 7 for LE, 0 for BE
#define RTE_MBUF_DIRECT(mb) !(((const char *)(&(mb)->ol_flags))[OLF_MSB] & 0x60)
> +#endif /* RTE_TOOLCHAIN_GCC */
>
> /** Uninitialized or unspecified port. */
> #define RTE_MBUF_PORT_INVALID UINT16_MAX
> --
> 2.43.0
>
^ permalink raw reply [flat|nested] 29+ messages in thread
* RE: [PATCH v4] mbuf: optimize segment prefree
2025-10-22 15:02 ` Bruce Richardson
@ 2025-10-22 18:28 ` Morten Brørup
2025-10-23 7:04 ` Morten Brørup
1 sibling, 0 replies; 29+ messages in thread
From: Morten Brørup @ 2025-10-22 18:28 UTC (permalink / raw)
To: Bruce Richardson
Cc: dev, Stephen Hemminger, Wathsala Vithanage, Konstantin Ananyev,
Chengwen Feng
> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> Sent: Wednesday, 22 October 2025 17.03
>
> On Wed, Oct 22, 2025 at 02:47:08PM +0000, Morten Brørup wrote:
> > Refactored rte_pktmbuf_prefree_seg() for both performance and
> readability.
> >
> > With the optimized RTE_MBUF_DIRECT() macro, the common likely code
> path
> > now fits within one instruction cache line on x86-64 when built with
> GCC.
> >
> > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
> > Acked-by: Chengwen Feng <fengchengwen@huawei.com>
> > Reviewed-by: Bruce Richardson <bruce.richardson@intel.com>
> > ---
> > v4:
> > * Enabled the optimized RTE_MBUF_DIRECT() macro for GCC on all
> > architectures.
> > v3:
> > * Rewrote the optimized RTE_MBUF_DIRECT() macro for readability; use
> > numerical value instead of long names. (Bruce Richardson)
> > * Enabled the optimized RTE_MBUF_DIRECT() macro for Loongarch too.
> > v2:
> > * Fixed typo in commit description.
> > * Fixed indentation.
> > * Added detailed description to the optimized RTE_MBUF_DIRECT()
> macro.
> > (Stephen Hemminger)
> > * Added static_assert() to verify that the optimized
> RTE_MBUF_DIRECT()
> > macro is valid, specifically that the tested bits are in the MSB of
> the
> > 64-bit field.
> > ---
> > lib/mbuf/rte_mbuf.h | 51 +++++++++++++++-----------------------
> --
> > lib/mbuf/rte_mbuf_core.h | 47 ++++++++++++++++++++++++++++++++++++
> > 2 files changed, 66 insertions(+), 32 deletions(-)
> >
> > diff --git a/lib/mbuf/rte_mbuf.h b/lib/mbuf/rte_mbuf.h
> > index 3df22125de..2004391f57 100644
> > --- a/lib/mbuf/rte_mbuf.h
> > +++ b/lib/mbuf/rte_mbuf.h
> > @@ -31,6 +31,7 @@
> > * http://www.kohala.com/start/tcpipiv2.html
> > */
> >
> > +#include <stdbool.h>
> > #include <stdint.h>
> >
> > #include <rte_common.h>
> > @@ -1458,44 +1459,30 @@ static inline int
> __rte_pktmbuf_pinned_extbuf_decref(struct rte_mbuf *m)
> > static __rte_always_inline struct rte_mbuf *
> > rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
> > {
> > - __rte_mbuf_sanity_check(m, 0);
> > -
> > - if (likely(rte_mbuf_refcnt_read(m) == 1)) {
> > -
> > - if (!RTE_MBUF_DIRECT(m)) {
> > - rte_pktmbuf_detach(m);
> > - if (RTE_MBUF_HAS_EXTBUF(m) &&
> > - RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
> > - __rte_pktmbuf_pinned_extbuf_decref(m))
> > - return NULL;
> > - }
> > -
> > - if (m->next != NULL)
> > - m->next = NULL;
> > - if (m->nb_segs != 1)
> > - m->nb_segs = 1;
> > + bool refcnt_not_one;
> >
> > - return m;
> > + __rte_mbuf_sanity_check(m, 0);
> >
> > - } else if (__rte_mbuf_refcnt_update(m, -1) == 0) {
> > + refcnt_not_one = unlikely(rte_mbuf_refcnt_read(m) != 1);
> > + if (refcnt_not_one && __rte_mbuf_refcnt_update(m, -1) != 0)
> > + return NULL;
> >
> > - if (!RTE_MBUF_DIRECT(m)) {
> > - rte_pktmbuf_detach(m);
> > - if (RTE_MBUF_HAS_EXTBUF(m) &&
> > - RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
> > - __rte_pktmbuf_pinned_extbuf_decref(m))
> > - return NULL;
> > - }
> > + if (unlikely(!RTE_MBUF_DIRECT(m))) {
> > + rte_pktmbuf_detach(m);
> > + if (RTE_MBUF_HAS_EXTBUF(m) &&
> > + RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
> > + __rte_pktmbuf_pinned_extbuf_decref(m))
> > + return NULL;
> > + }
> >
> > - if (m->next != NULL)
> > - m->next = NULL;
> > - if (m->nb_segs != 1)
> > - m->nb_segs = 1;
> > + if (refcnt_not_one)
> > rte_mbuf_refcnt_set(m, 1);
> > + if (m->nb_segs != 1)
> > + m->nb_segs = 1;
> > + if (m->next != NULL)
> > + m->next = NULL;
> >
> > - return m;
> > - }
> > - return NULL;
> > + return m;
> > }
> >
> > /**
> > diff --git a/lib/mbuf/rte_mbuf_core.h b/lib/mbuf/rte_mbuf_core.h
> > index a0df265b5d..37f2975158 100644
> > --- a/lib/mbuf/rte_mbuf_core.h
> > +++ b/lib/mbuf/rte_mbuf_core.h
> > @@ -706,6 +706,7 @@ struct rte_mbuf_ext_shared_info {
> > */
> > #define RTE_MBUF_HAS_EXTBUF(mb) ((mb)->ol_flags &
> RTE_MBUF_F_EXTERNAL)
> >
> > +#if !defined(RTE_TOOLCHAIN_GCC) || defined __DOXYGEN__
> > /**
> > * Returns TRUE if given mbuf is direct, or FALSE otherwise.
> > *
> > @@ -714,6 +715,52 @@ struct rte_mbuf_ext_shared_info {
> > */
> > #define RTE_MBUF_DIRECT(mb) \
> > (!((mb)->ol_flags & (RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL)))
> > +#else /* RTE_TOOLCHAIN_GCC */
>
> Do we need the non-gcc block?
>
> > +#if RTE_BYTE_ORDER == RTE_LITTLE_ENDIAN
> > +/* Macro optimized for code size.
> > + * GCC only optimizes single-bit MSB tests this way, so we do it by
> hand with multi-bit.
> > + *
> > + * The flags RTE_MBUF_F_INDIRECT and RTE_MBUF_F_EXTERNAL are both in
> the MSB of the
> > + * 64-bit ol_flags field, so we only compare this one byte instead
> of all 64 bits.
> > + * On little endian architecture, the MSB of a 64-bit integer is at
> byte offset 7.
> > + *
> > + * Note: Tested on x86-64 using GCC version 16.0.0 20251019
> (experimental).
> > + *
> > + * Without this optimization, GCC generates 17 bytes of
> instructions:
> > + * movabs rax,0x6000000000000000 // 10 bytes
> > + * and rax,QWORD PTR [rdi+0x18] // 4 bytes
> > + * sete al // 3 bytes
> > + * With this optimization, GCC generates only 7 bytes of
> instructions:
> > + * test BYTE PTR [rdi+0x1f],0x60 // 4 bytes
> > + * sete al // 3 bytes
> > + *
> > + * Note: Tested on loongarch using GCC version 15.2.0.
> > + *
> > + * Without this optimization, GCC generates 5 instructions:
> > + * ld.d $a0, $a0, 24
> > + * move $t0, $zero
> > + * lu52i.d $t0, $t0, 1536
> > + * and $a0, $a0, $t0
> > + * sltui $a0, $a0, 1
> > + * With this optimization, GCC generates only 3 instructions:
> > + * ld.bu $a0, $a0, 31
> > + * andi $a0, $a0, 0x60
> > + * sltui $a0, $a0, 1
> > + *
> > + * Note: GCC also generates smaller code size with the optimized
> macro on many other architectures.
> > + *
> > + * Note: GCC generates the same code size as with the plain macro on
> ARM (64 and 32 bit).
> > + */
> > +static_assert((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) ==
> UINT64_C(0x60) << (7 * CHAR_BIT),
> > + "(RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) is not 0x60 at byte
> offset 7");
> > +#define RTE_MBUF_DIRECT(mb) !(((const char *)(&(mb)-
> >ol_flags))[7] & 0x60)
> > +#elif RTE_BYTE_ORDER == RTE_BIG_ENDIAN
> > +/* As described above; but on big endian architecture, the MSB is at
> byte offset 0. */
> > +static_assert((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) ==
> UINT64_C(0x60) << (7 * CHAR_BIT),
> > + "(RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) is not 0x60 at byte
> offset 0");
> > +#define RTE_MBUF_DIRECT(mb) !(((const char *)(&(mb)-
> >ol_flags))[0] & 0x60)
> > +#endif /* RTE_BYTE_ORDER */
>
> Minor nit, the static assert is common, so can be put out of the
> endianness
> check rather than duplicated.
I considered a generic common static_assert, but prefer individual ones with specific error texts (byte offset 7 or 0).
> I'd also suggest putting the the comment
> outside it too, so that the endianness check only covers one line each.
Good idea.
While doing this, I'll see how it looks with a common static_assert and individual ones.
> Also, in the comment, showing an example on one arch is probably
> enough,
> showing two sort of implies that you should show all, while if you show
> just one, you can clearly state its just one example.
Agree. Less is more.
>
> BTW: if you want to remove the endian ifdefs, you can compute the
> offset
> as e.g.
>
> #define OLF_MSB (7 * (RTE_BYTE_ORDER == RTE_LITTLE_ENDIAN)) // 7 for
> LE, 0 for BE
> #define RTE_MBUF_DIRECT(mb) !(((const char *)(&(mb)-
> >ol_flags))[OLF_MSB] & 0x60)
I don't think it improves source code readability, so not going to do it.
Also, it's a header file, so the OLF_MSB would need to have some obscure name. Alternatively reside in rte_common.h with a generic name for similar purposes.
>
>
> > +#endif /* RTE_TOOLCHAIN_GCC */
> >
> > /** Uninitialized or unspecified port. */
> > #define RTE_MBUF_PORT_INVALID UINT16_MAX
> > --
> > 2.43.0
> >
Will post a v5.
^ permalink raw reply [flat|nested] 29+ messages in thread
* RE: [PATCH v4] mbuf: optimize segment prefree
2025-10-22 15:02 ` Bruce Richardson
2025-10-22 18:28 ` Morten Brørup
@ 2025-10-23 7:04 ` Morten Brørup
1 sibling, 0 replies; 29+ messages in thread
From: Morten Brørup @ 2025-10-23 7:04 UTC (permalink / raw)
To: Bruce Richardson
Cc: dev, Stephen Hemminger, Wathsala Vithanage, Konstantin Ananyev,
Chengwen Feng
> > +#if !defined(RTE_TOOLCHAIN_GCC) || defined __DOXYGEN__
> > /**
> > * Returns TRUE if given mbuf is direct, or FALSE otherwise.
> > *
> > @@ -714,6 +715,52 @@ struct rte_mbuf_ext_shared_info {
> > */
> > #define RTE_MBUF_DIRECT(mb) \
> > (!((mb)->ol_flags & (RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL)))
> > +#else /* RTE_TOOLCHAIN_GCC */
>
> Do we need the non-gcc block?
I have now tested with a bunch of compilers on Godbolt, and it seems only clang doesn't need this optimization, so I'll remove the compiler check and make the optimized variant the only one.
^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH v5] mbuf: optimize segment prefree
2025-08-27 21:35 [PATCH] mbuf: optimize segment prefree Morten Brørup
` (4 preceding siblings ...)
2025-10-22 14:47 ` [PATCH v4] " Morten Brørup
@ 2025-10-23 8:01 ` Morten Brørup
2025-10-23 8:08 ` Bruce Richardson
2025-10-23 8:51 ` Konstantin Ananyev
2025-10-23 12:48 ` [PATCH v6] " Morten Brørup
2025-10-23 16:18 ` [PATCH v7] " Morten Brørup
7 siblings, 2 replies; 29+ messages in thread
From: Morten Brørup @ 2025-10-23 8:01 UTC (permalink / raw)
To: dev, Stephen Hemminger, Wathsala Vithanage, Konstantin Ananyev,
Chengwen Feng, Bruce Richardson
Cc: Morten Brørup
Refactored rte_pktmbuf_prefree_seg() for both performance and readability.
With the optimized RTE_MBUF_DIRECT() macro, the common likely code path
now fits within one instruction cache line on x86-64 when built with GCC.
Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
Reviewed-by: Bruce Richardson <bruce.richardson@intel.com>
---
v5:
* Removed the plain RTE_MBUF_DIRECT() macro, and only use the optimized
variant. (Bruce Richardson)
Further testing on Godbolt confirmed that other compilers benefit from
the optimized macro too.
* Shortened the description of the RTE_MBUF_DIRECT() macro, and only
provide one example of code emitted by a compiler. (Bruce Richardson)
* Consolidated the static_assert() into one, covering both little and big
endian.
This also reduces the amount of endian-conditional source code and
improves readability.
(Bruce Richardson)
* Added comment about MSB meaning "most significant byte".
v4:
* Enabled the optimized RTE_MBUF_DIRECT() macro for GCC on all
architectures.
v3:
* Rewrote the optimized RTE_MBUF_DIRECT() macro for readability; use
numerical value instead of long names. (Bruce Richardson)
* Enabled the optimized RTE_MBUF_DIRECT() macro for Loongarch too.
v2:
* Fixed typo in commit description.
* Fixed indentation.
* Added detailed description to the optimized RTE_MBUF_DIRECT() macro.
(Stephen Hemminger)
* Added static_assert() to verify that the optimized RTE_MBUF_DIRECT()
macro is valid, specifically that the tested bits are in the MSB of the
64-bit field.
---
lib/mbuf/rte_mbuf.h | 51 +++++++++++++++-------------------------
lib/mbuf/rte_mbuf_core.h | 33 +++++++++++++++++++++++---
2 files changed, 49 insertions(+), 35 deletions(-)
diff --git a/lib/mbuf/rte_mbuf.h b/lib/mbuf/rte_mbuf.h
index 3df22125de..2004391f57 100644
--- a/lib/mbuf/rte_mbuf.h
+++ b/lib/mbuf/rte_mbuf.h
@@ -31,6 +31,7 @@
* http://www.kohala.com/start/tcpipiv2.html
*/
+#include <stdbool.h>
#include <stdint.h>
#include <rte_common.h>
@@ -1458,44 +1459,30 @@ static inline int __rte_pktmbuf_pinned_extbuf_decref(struct rte_mbuf *m)
static __rte_always_inline struct rte_mbuf *
rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
{
- __rte_mbuf_sanity_check(m, 0);
-
- if (likely(rte_mbuf_refcnt_read(m) == 1)) {
-
- if (!RTE_MBUF_DIRECT(m)) {
- rte_pktmbuf_detach(m);
- if (RTE_MBUF_HAS_EXTBUF(m) &&
- RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
- __rte_pktmbuf_pinned_extbuf_decref(m))
- return NULL;
- }
-
- if (m->next != NULL)
- m->next = NULL;
- if (m->nb_segs != 1)
- m->nb_segs = 1;
+ bool refcnt_not_one;
- return m;
+ __rte_mbuf_sanity_check(m, 0);
- } else if (__rte_mbuf_refcnt_update(m, -1) == 0) {
+ refcnt_not_one = unlikely(rte_mbuf_refcnt_read(m) != 1);
+ if (refcnt_not_one && __rte_mbuf_refcnt_update(m, -1) != 0)
+ return NULL;
- if (!RTE_MBUF_DIRECT(m)) {
- rte_pktmbuf_detach(m);
- if (RTE_MBUF_HAS_EXTBUF(m) &&
- RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
- __rte_pktmbuf_pinned_extbuf_decref(m))
- return NULL;
- }
+ if (unlikely(!RTE_MBUF_DIRECT(m))) {
+ rte_pktmbuf_detach(m);
+ if (RTE_MBUF_HAS_EXTBUF(m) &&
+ RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
+ __rte_pktmbuf_pinned_extbuf_decref(m))
+ return NULL;
+ }
- if (m->next != NULL)
- m->next = NULL;
- if (m->nb_segs != 1)
- m->nb_segs = 1;
+ if (refcnt_not_one)
rte_mbuf_refcnt_set(m, 1);
+ if (m->nb_segs != 1)
+ m->nb_segs = 1;
+ if (m->next != NULL)
+ m->next = NULL;
- return m;
- }
- return NULL;
+ return m;
}
/**
diff --git a/lib/mbuf/rte_mbuf_core.h b/lib/mbuf/rte_mbuf_core.h
index a0df265b5d..1dfeab0511 100644
--- a/lib/mbuf/rte_mbuf_core.h
+++ b/lib/mbuf/rte_mbuf_core.h
@@ -711,9 +711,36 @@ struct rte_mbuf_ext_shared_info {
*
* If a mbuf embeds its own data after the rte_mbuf structure, this mbuf
* can be defined as a direct mbuf.
- */
-#define RTE_MBUF_DIRECT(mb) \
- (!((mb)->ol_flags & (RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL)))
+ *
+ * Note: Macro optimized for code size.
+ *
+ * The plain macro would be:
+ * #define RTE_MBUF_DIRECT(mb) \
+ * (!((mb)->ol_flags & (RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL)))
+ *
+ * The flags RTE_MBUF_F_INDIRECT and RTE_MBUF_F_EXTERNAL are both in the MSB (most significant
+ * byte) of the 64-bit ol_flags field, so we only compare this one byte instead of all 64 bits.
+ *
+ * E.g., GCC version 16.0.0 20251019 (experimental) generates the following code for x86-64.
+ *
+ * With the plain macro, 17 bytes of instructions:
+ * movabs rax,0x6000000000000000 // 10 bytes
+ * and rax,QWORD PTR [rdi+0x18] // 4 bytes
+ * sete al // 3 bytes
+ * With this optimized macro, only 7 bytes of instructions:
+ * test BYTE PTR [rdi+0x1f],0x60 // 4 bytes
+ * sete al // 3 bytes
+ */
+#if RTE_BYTE_ORDER == RTE_LITTLE_ENDIAN
+/* On little endian architecture, the MSB of a 64-bit integer is at byte offset 7. */
+#define RTE_MBUF_DIRECT(mb) !(((const char *)(&(mb)->ol_flags))[7] & 0x60)
+#elif RTE_BYTE_ORDER == RTE_BIG_ENDIAN
+/* On big endian architecture, the MSB of a 64-bit integer is at byte offset 0. */
+#define RTE_MBUF_DIRECT(mb) !(((const char *)(&(mb)->ol_flags))[0] & 0x60)
+#endif
+/* Verify the optimization above. */
+static_assert((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) == UINT64_C(0x60) << (7 * CHAR_BIT),
+ "(RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) is not 0x60 at MSB");
/** Uninitialized or unspecified port. */
#define RTE_MBUF_PORT_INVALID UINT16_MAX
--
2.43.0
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v5] mbuf: optimize segment prefree
2025-10-23 8:01 ` [PATCH v5] " Morten Brørup
@ 2025-10-23 8:08 ` Bruce Richardson
2025-10-23 8:51 ` Konstantin Ananyev
1 sibling, 0 replies; 29+ messages in thread
From: Bruce Richardson @ 2025-10-23 8:08 UTC (permalink / raw)
To: Morten Brørup
Cc: dev, Stephen Hemminger, Wathsala Vithanage, Konstantin Ananyev,
Chengwen Feng
On Thu, Oct 23, 2025 at 08:01:36AM +0000, Morten Brørup wrote:
> Refactored rte_pktmbuf_prefree_seg() for both performance and readability.
>
> With the optimized RTE_MBUF_DIRECT() macro, the common likely code path
> now fits within one instruction cache line on x86-64 when built with GCC.
>
> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
> Acked-by: Chengwen Feng <fengchengwen@huawei.com>
> Reviewed-by: Bruce Richardson <bruce.richardson@intel.com>
> ---
> v5:
> * Removed the plain RTE_MBUF_DIRECT() macro, and only use the optimized
> variant. (Bruce Richardson)
> Further testing on Godbolt confirmed that other compilers benefit from
> the optimized macro too.
> * Shortened the description of the RTE_MBUF_DIRECT() macro, and only
> provide one example of code emitted by a compiler. (Bruce Richardson)
> * Consolidated the static_assert() into one, covering both little and big
> endian.
> This also reduces the amount of endian-conditional source code and
> improves readability.
> (Bruce Richardson)
> * Added comment about MSB meaning "most significant byte".
LGTM now thanks!
^ permalink raw reply [flat|nested] 29+ messages in thread
* RE: [PATCH v5] mbuf: optimize segment prefree
2025-10-23 8:01 ` [PATCH v5] " Morten Brørup
2025-10-23 8:08 ` Bruce Richardson
@ 2025-10-23 8:51 ` Konstantin Ananyev
2025-10-23 11:17 ` Morten Brørup
1 sibling, 1 reply; 29+ messages in thread
From: Konstantin Ananyev @ 2025-10-23 8:51 UTC (permalink / raw)
To: Morten Brørup, dev, Stephen Hemminger, Wathsala Vithanage,
Fengchengwen, Bruce Richardson
>
> Refactored rte_pktmbuf_prefree_seg() for both performance and readability.
>
> With the optimized RTE_MBUF_DIRECT() macro, the common likely code path
> now fits within one instruction cache line on x86-64 when built with GCC.
>
> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
> Acked-by: Chengwen Feng <fengchengwen@huawei.com>
> Reviewed-by: Bruce Richardson <bruce.richardson@intel.com>
> ---
> v5:
> * Removed the plain RTE_MBUF_DIRECT() macro, and only use the optimized
> variant. (Bruce Richardson)
> Further testing on Godbolt confirmed that other compilers benefit from
> the optimized macro too.
> * Shortened the description of the RTE_MBUF_DIRECT() macro, and only
> provide one example of code emitted by a compiler. (Bruce Richardson)
> * Consolidated the static_assert() into one, covering both little and big
> endian.
> This also reduces the amount of endian-conditional source code and
> improves readability.
> (Bruce Richardson)
> * Added comment about MSB meaning "most significant byte".
> v4:
> * Enabled the optimized RTE_MBUF_DIRECT() macro for GCC on all
> architectures.
> v3:
> * Rewrote the optimized RTE_MBUF_DIRECT() macro for readability; use
> numerical value instead of long names. (Bruce Richardson)
> * Enabled the optimized RTE_MBUF_DIRECT() macro for Loongarch too.
> v2:
> * Fixed typo in commit description.
> * Fixed indentation.
> * Added detailed description to the optimized RTE_MBUF_DIRECT() macro.
> (Stephen Hemminger)
> * Added static_assert() to verify that the optimized RTE_MBUF_DIRECT()
> macro is valid, specifically that the tested bits are in the MSB of the
> 64-bit field.
> ---
> lib/mbuf/rte_mbuf.h | 51 +++++++++++++++-------------------------
> lib/mbuf/rte_mbuf_core.h | 33 +++++++++++++++++++++++---
> 2 files changed, 49 insertions(+), 35 deletions(-)
>
> diff --git a/lib/mbuf/rte_mbuf.h b/lib/mbuf/rte_mbuf.h
> index 3df22125de..2004391f57 100644
> --- a/lib/mbuf/rte_mbuf.h
> +++ b/lib/mbuf/rte_mbuf.h
> @@ -31,6 +31,7 @@
> * http://www.kohala.com/start/tcpipiv2.html
> */
>
> +#include <stdbool.h>
> #include <stdint.h>
>
> #include <rte_common.h>
> @@ -1458,44 +1459,30 @@ static inline int
> __rte_pktmbuf_pinned_extbuf_decref(struct rte_mbuf *m)
> static __rte_always_inline struct rte_mbuf *
> rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
> {
> - __rte_mbuf_sanity_check(m, 0);
> -
> - if (likely(rte_mbuf_refcnt_read(m) == 1)) {
> -
> - if (!RTE_MBUF_DIRECT(m)) {
> - rte_pktmbuf_detach(m);
> - if (RTE_MBUF_HAS_EXTBUF(m) &&
> - RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
> - __rte_pktmbuf_pinned_extbuf_decref(m))
> - return NULL;
> - }
> -
> - if (m->next != NULL)
> - m->next = NULL;
> - if (m->nb_segs != 1)
> - m->nb_segs = 1;
> + bool refcnt_not_one;
>
> - return m;
> + __rte_mbuf_sanity_check(m, 0);
>
> - } else if (__rte_mbuf_refcnt_update(m, -1) == 0) {
> + refcnt_not_one = unlikely(rte_mbuf_refcnt_read(m) != 1);
> + if (refcnt_not_one && __rte_mbuf_refcnt_update(m, -1) != 0)
> + return NULL;
>
> - if (!RTE_MBUF_DIRECT(m)) {
> - rte_pktmbuf_detach(m);
> - if (RTE_MBUF_HAS_EXTBUF(m) &&
> - RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
> - __rte_pktmbuf_pinned_extbuf_decref(m))
> - return NULL;
> - }
> + if (unlikely(!RTE_MBUF_DIRECT(m))) {
> + rte_pktmbuf_detach(m);
> + if (RTE_MBUF_HAS_EXTBUF(m) &&
> + RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
> + __rte_pktmbuf_pinned_extbuf_decref(m))
> + return NULL;
> + }
>
> - if (m->next != NULL)
> - m->next = NULL;
> - if (m->nb_segs != 1)
> - m->nb_segs = 1;
> + if (refcnt_not_one)
> rte_mbuf_refcnt_set(m, 1);
> + if (m->nb_segs != 1)
> + m->nb_segs = 1;
> + if (m->next != NULL)
> + m->next = NULL;
>
> - return m;
> - }
> - return NULL;
> + return m;
> }
>
> /**
> diff --git a/lib/mbuf/rte_mbuf_core.h b/lib/mbuf/rte_mbuf_core.h
> index a0df265b5d..1dfeab0511 100644
> --- a/lib/mbuf/rte_mbuf_core.h
> +++ b/lib/mbuf/rte_mbuf_core.h
> @@ -711,9 +711,36 @@ struct rte_mbuf_ext_shared_info {
> *
> * If a mbuf embeds its own data after the rte_mbuf structure, this mbuf
> * can be defined as a direct mbuf.
> - */
> -#define RTE_MBUF_DIRECT(mb) \
> - (!((mb)->ol_flags & (RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL)))
> + *
> + * Note: Macro optimized for code size.
> + *
> + * The plain macro would be:
> + * #define RTE_MBUF_DIRECT(mb) \
> + * (!((mb)->ol_flags & (RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL)))
> + *
> + * The flags RTE_MBUF_F_INDIRECT and RTE_MBUF_F_EXTERNAL are both in the
> MSB (most significant
> + * byte) of the 64-bit ol_flags field, so we only compare this one byte instead of all
> 64 bits.
> + *
> + * E.g., GCC version 16.0.0 20251019 (experimental) generates the following code
> for x86-64.
> + *
> + * With the plain macro, 17 bytes of instructions:
> + * movabs rax,0x6000000000000000 // 10 bytes
> + * and rax,QWORD PTR [rdi+0x18] // 4 bytes
> + * sete al // 3 bytes
> + * With this optimized macro, only 7 bytes of instructions:
> + * test BYTE PTR [rdi+0x1f],0x60 // 4 bytes
> + * sete al // 3 bytes
> + */
> +#if RTE_BYTE_ORDER == RTE_LITTLE_ENDIAN
> +/* On little endian architecture, the MSB of a 64-bit integer is at byte offset 7. */
> +#define RTE_MBUF_DIRECT(mb) !(((const char *)(&(mb)->ol_flags))[7] & 0x60)
> +#elif RTE_BYTE_ORDER == RTE_BIG_ENDIAN
> +/* On big endian architecture, the MSB of a 64-bit integer is at byte offset 0. */
> +#define RTE_MBUF_DIRECT(mb) !(((const char *)(&(mb)->ol_flags))[0] & 0x60)
A stupid q: why then not simply do:
(mb->ol_flags >> 56) & 0x60
then?
Should help to all these LE/BE casts, etc.
> +#endif
> +/* Verify the optimization above. */
> +static_assert((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) ==
> UINT64_C(0x60) << (7 * CHAR_BIT),
> + "(RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) is not 0x60 at MSB");
>
> /** Uninitialized or unspecified port. */
> #define RTE_MBUF_PORT_INVALID UINT16_MAX
> --
> 2.43.0
^ permalink raw reply [flat|nested] 29+ messages in thread
* RE: [PATCH v5] mbuf: optimize segment prefree
2025-10-23 8:51 ` Konstantin Ananyev
@ 2025-10-23 11:17 ` Morten Brørup
2025-10-23 14:04 ` Konstantin Ananyev
0 siblings, 1 reply; 29+ messages in thread
From: Morten Brørup @ 2025-10-23 11:17 UTC (permalink / raw)
To: Konstantin Ananyev, dev, Stephen Hemminger, Wathsala Vithanage,
Fengchengwen, Bruce Richardson
> From: Konstantin Ananyev [mailto:konstantin.ananyev@huawei.com]
> Sent: Thursday, 23 October 2025 10.51
>
> > -#define RTE_MBUF_DIRECT(mb) \
> > - (!((mb)->ol_flags & (RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL)))
> > + *
> > + * Note: Macro optimized for code size.
> > + *
> > + * The plain macro would be:
> > + * #define RTE_MBUF_DIRECT(mb) \
> > + * (!((mb)->ol_flags & (RTE_MBUF_F_INDIRECT |
> RTE_MBUF_F_EXTERNAL)))
> > + *
> > + * The flags RTE_MBUF_F_INDIRECT and RTE_MBUF_F_EXTERNAL are both in
> the
> > MSB (most significant
> > + * byte) of the 64-bit ol_flags field, so we only compare this one
> byte instead of all
> > 64 bits.
> > + *
> > + * E.g., GCC version 16.0.0 20251019 (experimental) generates the
> following code
> > for x86-64.
> > + *
> > + * With the plain macro, 17 bytes of instructions:
> > + * movabs rax,0x6000000000000000 // 10 bytes
> > + * and rax,QWORD PTR [rdi+0x18] // 4 bytes
> > + * sete al // 3 bytes
> > + * With this optimized macro, only 7 bytes of instructions:
> > + * test BYTE PTR [rdi+0x1f],0x60 // 4 bytes
> > + * sete al // 3 bytes
> > + */
> > +#if RTE_BYTE_ORDER == RTE_LITTLE_ENDIAN
> > +/* On little endian architecture, the MSB of a 64-bit integer is at
> byte offset 7. */
> > +#define RTE_MBUF_DIRECT(mb) !(((const char *)(&(mb)-
> >ol_flags))[7] & 0x60)
> > +#elif RTE_BYTE_ORDER == RTE_BIG_ENDIAN
> > +/* On big endian architecture, the MSB of a 64-bit integer is at
> byte offset 0. */
> > +#define RTE_MBUF_DIRECT(mb) !(((const char *)(&(mb)-
> >ol_flags))[0] & 0x60)
>
> A stupid q: why then not simply do:
> (mb->ol_flags >> 56) & 0x60
> then?
> Should help to all these LE/BE casts, etc.
GCC is too stupid for that too.
Playing around with Godbolt shows that
return !((char)(p[3] >> 56) & 0x60);
becomes
movzx eax,BYTE PTR [rdi+0x1f] // 4 bytes
test al,0x60 // 2 bytes
Instead of simply
test BYTE PTR [rdi+0x1f],0x60 // 4 bytes
Good suggestion, though.
>
> > +#endif
> > +/* Verify the optimization above. */
> > +static_assert((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) ==
> > UINT64_C(0x60) << (7 * CHAR_BIT),
> > + "(RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) is not 0x60 at
> MSB");
> >
> > /** Uninitialized or unspecified port. */
> > #define RTE_MBUF_PORT_INVALID UINT16_MAX
> > --
> > 2.43.0
^ permalink raw reply [flat|nested] 29+ messages in thread
* RE: [PATCH v5] mbuf: optimize segment prefree
2025-10-23 11:17 ` Morten Brørup
@ 2025-10-23 14:04 ` Konstantin Ananyev
2025-10-23 14:48 ` Morten Brørup
0 siblings, 1 reply; 29+ messages in thread
From: Konstantin Ananyev @ 2025-10-23 14:04 UTC (permalink / raw)
To: Morten Brørup, dev, Stephen Hemminger, Wathsala Vithanage,
Fengchengwen, Bruce Richardson
> > From: Konstantin Ananyev [mailto:konstantin.ananyev@huawei.com]
> > Sent: Thursday, 23 October 2025 10.51
> >
> > > -#define RTE_MBUF_DIRECT(mb) \
> > > - (!((mb)->ol_flags & (RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL)))
> > > + *
> > > + * Note: Macro optimized for code size.
> > > + *
> > > + * The plain macro would be:
> > > + * #define RTE_MBUF_DIRECT(mb) \
> > > + * (!((mb)->ol_flags & (RTE_MBUF_F_INDIRECT |
> > RTE_MBUF_F_EXTERNAL)))
> > > + *
> > > + * The flags RTE_MBUF_F_INDIRECT and RTE_MBUF_F_EXTERNAL are both in
> > the
> > > MSB (most significant
> > > + * byte) of the 64-bit ol_flags field, so we only compare this one
> > byte instead of all
> > > 64 bits.
> > > + *
> > > + * E.g., GCC version 16.0.0 20251019 (experimental) generates the
> > following code
> > > for x86-64.
> > > + *
> > > + * With the plain macro, 17 bytes of instructions:
> > > + * movabs rax,0x6000000000000000 // 10 bytes
> > > + * and rax,QWORD PTR [rdi+0x18] // 4 bytes
> > > + * sete al // 3 bytes
> > > + * With this optimized macro, only 7 bytes of instructions:
> > > + * test BYTE PTR [rdi+0x1f],0x60 // 4 bytes
> > > + * sete al // 3 bytes
> > > + */
> > > +#if RTE_BYTE_ORDER == RTE_LITTLE_ENDIAN
> > > +/* On little endian architecture, the MSB of a 64-bit integer is at
> > byte offset 7. */
> > > +#define RTE_MBUF_DIRECT(mb) !(((const char *)(&(mb)-
> > >ol_flags))[7] & 0x60)
> > > +#elif RTE_BYTE_ORDER == RTE_BIG_ENDIAN
> > > +/* On big endian architecture, the MSB of a 64-bit integer is at
> > byte offset 0. */
> > > +#define RTE_MBUF_DIRECT(mb) !(((const char *)(&(mb)-
> > >ol_flags))[0] & 0x60)
> >
> > A stupid q: why then not simply do:
> > (mb->ol_flags >> 56) & 0x60
> > then?
> > Should help to all these LE/BE casts, etc.
>
> GCC is too stupid for that too.
>
> Playing around with Godbolt shows that
> return !((char)(p[3] >> 56) & 0x60);
> becomes
> movzx eax,BYTE PTR [rdi+0x1f] // 4 bytes
> test al,0x60 // 2 bytes
> Instead of simply
> test BYTE PTR [rdi+0x1f],0x60 // 4 bytes
And these 2 extra bytes in instructions, are that really that critical?
My guess, we wouldn't notice any real diff here.
But if it really is, can I ask you to create a new define for 0x60,
to avoid hardcoded constants in the code?
Might be something like
#define RTE_MBUF_F_INDIRECT_EXTERNAL_1B ...
or so.
Konstantin
> Good suggestion, though.
>
> >
> > > +#endif
> > > +/* Verify the optimization above. */
> > > +static_assert((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) ==
> > > UINT64_C(0x60) << (7 * CHAR_BIT),
> > > + "(RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) is not 0x60 at
> > MSB");
> > >
> > > /** Uninitialized or unspecified port. */
> > > #define RTE_MBUF_PORT_INVALID UINT16_MAX
> > > --
> > > 2.43.0
^ permalink raw reply [flat|nested] 29+ messages in thread
* RE: [PATCH v5] mbuf: optimize segment prefree
2025-10-23 14:04 ` Konstantin Ananyev
@ 2025-10-23 14:48 ` Morten Brørup
2025-10-23 15:27 ` Konstantin Ananyev
0 siblings, 1 reply; 29+ messages in thread
From: Morten Brørup @ 2025-10-23 14:48 UTC (permalink / raw)
To: Konstantin Ananyev, dev, Stephen Hemminger, Wathsala Vithanage,
Fengchengwen, Bruce Richardson
> From: Konstantin Ananyev [mailto:konstantin.ananyev@huawei.com]
> Sent: Thursday, 23 October 2025 16.05
>
> > > From: Konstantin Ananyev [mailto:konstantin.ananyev@huawei.com]
> > > Sent: Thursday, 23 October 2025 10.51
> > >
> > > > -#define RTE_MBUF_DIRECT(mb) \
> > > > - (!((mb)->ol_flags & (RTE_MBUF_F_INDIRECT |
> RTE_MBUF_F_EXTERNAL)))
> > > > + *
> > > > + * Note: Macro optimized for code size.
> > > > + *
> > > > + * The plain macro would be:
> > > > + * #define RTE_MBUF_DIRECT(mb) \
> > > > + * (!((mb)->ol_flags & (RTE_MBUF_F_INDIRECT |
> > > RTE_MBUF_F_EXTERNAL)))
> > > > + *
> > > > + * The flags RTE_MBUF_F_INDIRECT and RTE_MBUF_F_EXTERNAL are
> both in
> > > the
> > > > MSB (most significant
> > > > + * byte) of the 64-bit ol_flags field, so we only compare this
> one
> > > byte instead of all
> > > > 64 bits.
> > > > + *
> > > > + * E.g., GCC version 16.0.0 20251019 (experimental) generates
> the
> > > following code
> > > > for x86-64.
> > > > + *
> > > > + * With the plain macro, 17 bytes of instructions:
> > > > + * movabs rax,0x6000000000000000 // 10 bytes
> > > > + * and rax,QWORD PTR [rdi+0x18] // 4 bytes
> > > > + * sete al // 3 bytes
> > > > + * With this optimized macro, only 7 bytes of instructions:
> > > > + * test BYTE PTR [rdi+0x1f],0x60 // 4 bytes
> > > > + * sete al // 3 bytes
> > > > + */
> > > > +#if RTE_BYTE_ORDER == RTE_LITTLE_ENDIAN
> > > > +/* On little endian architecture, the MSB of a 64-bit integer is
> at
> > > byte offset 7. */
> > > > +#define RTE_MBUF_DIRECT(mb) !(((const char *)(&(mb)-
> > > >ol_flags))[7] & 0x60)
> > > > +#elif RTE_BYTE_ORDER == RTE_BIG_ENDIAN
> > > > +/* On big endian architecture, the MSB of a 64-bit integer is at
> > > byte offset 0. */
> > > > +#define RTE_MBUF_DIRECT(mb) !(((const char *)(&(mb)-
> > > >ol_flags))[0] & 0x60)
> > >
> > > A stupid q: why then not simply do:
> > > (mb->ol_flags >> 56) & 0x60
> > > then?
> > > Should help to all these LE/BE casts, etc.
> >
> > GCC is too stupid for that too.
> >
> > Playing around with Godbolt shows that
> > return !((char)(p[3] >> 56) & 0x60);
> > becomes
> > movzx eax,BYTE PTR [rdi+0x1f] // 4 bytes
> > test al,0x60 // 2 bytes
> > Instead of simply
> > test BYTE PTR [rdi+0x1f],0x60 // 4 bytes
>
> And these 2 extra bytes in instructions, are that really that critical?
> My guess, we wouldn't notice any real diff here.
The optimized macro made the common code path of the refactored rte_pktmbuf_prefree_seg() fit into one cache line.
IIRC, all 10 bytes saving were required for this.
> But if it really is, can I ask you to create a new define for 0x60,
> to avoid hardcoded constants in the code?
> Might be something like
> #define RTE_MBUF_F_INDIRECT_EXTERNAL_1B ...
> or so.
I started out using the field names, but Bruce suggested using 0x60 for readability, making the macros shorter, which IMO looks good.
I don't like adding special names just for this, so either we stick with 0x60 or go for "(char)((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) >> (7 * CHAR_BIT))", something like this:
#ifdef __DOXYGEN__
#define RTE_MBUF_DIRECT(mb) \
!(((const char *)(&(mb)->ol_flags))[MSB_OFFSET /* 7 or 0, depending on endianness */] & \
(char)((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) >> (7 * CHAR_BIT)))
#else /* !__DOXYGEN__ */
#if RTE_BYTE_ORDER == RTE_LITTLE_ENDIAN
/* On little endian architecture, the MSB of a 64-bit integer is at byte offset 7. */
#define RTE_MBUF_DIRECT(mb) \
!(((const char *)(&(mb)->ol_flags))[7] & \
(char)((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) >> (7 * CHAR_BIT)))
#elif RTE_BYTE_ORDER == RTE_BIG_ENDIAN
/* On big endian architecture, the MSB of a 64-bit integer is at byte offset 0. */
#define RTE_MBUF_DIRECT(mb) \
!(((const char *)(&(mb)->ol_flags))[0] & \
(char)((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) >> (7 * CHAR_BIT)))
#endif /* RTE_BYTE_ORDER */
#endif /* !__DOXYGEN__ */
/* Verify the optimization above. */
static_assert(((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) & (UINT64_C(0xFF) << (7 * CHAR_BIT))) ==
(RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL),
"(RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) is not at MSB");
> Konstantin
>
> > Good suggestion, though.
> >
> > >
> > > > +#endif
> > > > +/* Verify the optimization above. */
> > > > +static_assert((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) ==
> > > > UINT64_C(0x60) << (7 * CHAR_BIT),
> > > > + "(RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) is not 0x60 at
> > > MSB");
> > > >
> > > > /** Uninitialized or unspecified port. */
> > > > #define RTE_MBUF_PORT_INVALID UINT16_MAX
> > > > --
> > > > 2.43.0
^ permalink raw reply [flat|nested] 29+ messages in thread
* RE: [PATCH v5] mbuf: optimize segment prefree
2025-10-23 14:48 ` Morten Brørup
@ 2025-10-23 15:27 ` Konstantin Ananyev
2025-10-23 15:46 ` Morten Brørup
0 siblings, 1 reply; 29+ messages in thread
From: Konstantin Ananyev @ 2025-10-23 15:27 UTC (permalink / raw)
To: Morten Brørup, dev, Stephen Hemminger, Wathsala Vithanage,
Fengchengwen, Bruce Richardson
> > From: Konstantin Ananyev [mailto:konstantin.ananyev@huawei.com]
> > Sent: Thursday, 23 October 2025 16.05
> >
> > > > From: Konstantin Ananyev [mailto:konstantin.ananyev@huawei.com]
> > > > Sent: Thursday, 23 October 2025 10.51
> > > >
> > > > > -#define RTE_MBUF_DIRECT(mb) \
> > > > > - (!((mb)->ol_flags & (RTE_MBUF_F_INDIRECT |
> > RTE_MBUF_F_EXTERNAL)))
> > > > > + *
> > > > > + * Note: Macro optimized for code size.
> > > > > + *
> > > > > + * The plain macro would be:
> > > > > + * #define RTE_MBUF_DIRECT(mb) \
> > > > > + * (!((mb)->ol_flags & (RTE_MBUF_F_INDIRECT |
> > > > RTE_MBUF_F_EXTERNAL)))
> > > > > + *
> > > > > + * The flags RTE_MBUF_F_INDIRECT and RTE_MBUF_F_EXTERNAL are
> > both in
> > > > the
> > > > > MSB (most significant
> > > > > + * byte) of the 64-bit ol_flags field, so we only compare this
> > one
> > > > byte instead of all
> > > > > 64 bits.
> > > > > + *
> > > > > + * E.g., GCC version 16.0.0 20251019 (experimental) generates
> > the
> > > > following code
> > > > > for x86-64.
> > > > > + *
> > > > > + * With the plain macro, 17 bytes of instructions:
> > > > > + * movabs rax,0x6000000000000000 // 10 bytes
> > > > > + * and rax,QWORD PTR [rdi+0x18] // 4 bytes
> > > > > + * sete al // 3 bytes
> > > > > + * With this optimized macro, only 7 bytes of instructions:
> > > > > + * test BYTE PTR [rdi+0x1f],0x60 // 4 bytes
> > > > > + * sete al // 3 bytes
> > > > > + */
> > > > > +#if RTE_BYTE_ORDER == RTE_LITTLE_ENDIAN
> > > > > +/* On little endian architecture, the MSB of a 64-bit integer is
> > at
> > > > byte offset 7. */
> > > > > +#define RTE_MBUF_DIRECT(mb) !(((const char *)(&(mb)-
> > > > >ol_flags))[7] & 0x60)
> > > > > +#elif RTE_BYTE_ORDER == RTE_BIG_ENDIAN
> > > > > +/* On big endian architecture, the MSB of a 64-bit integer is at
> > > > byte offset 0. */
> > > > > +#define RTE_MBUF_DIRECT(mb) !(((const char *)(&(mb)-
> > > > >ol_flags))[0] & 0x60)
> > > >
> > > > A stupid q: why then not simply do:
> > > > (mb->ol_flags >> 56) & 0x60
> > > > then?
> > > > Should help to all these LE/BE casts, etc.
> > >
> > > GCC is too stupid for that too.
> > >
> > > Playing around with Godbolt shows that
> > > return !((char)(p[3] >> 56) & 0x60);
> > > becomes
> > > movzx eax,BYTE PTR [rdi+0x1f] // 4 bytes
> > > test al,0x60 // 2 bytes
> > > Instead of simply
> > > test BYTE PTR [rdi+0x1f],0x60 // 4 bytes
> >
> > And these 2 extra bytes in instructions, are that really that critical?
> > My guess, we wouldn't notice any real diff here.
>
> The optimized macro made the common code path of the refactored
> rte_pktmbuf_prefree_seg() fit into one cache line.
> IIRC, all 10 bytes saving were required for this.
I understand that. but is that change will provide a measurable impact,
in terms of cycles/op or pps or so?
> > But if it really is, can I ask you to create a new define for 0x60,
> > to avoid hardcoded constants in the code?
> > Might be something like
> > #define RTE_MBUF_F_INDIRECT_EXTERNAL_1B ...
> > or so.
>
> I started out using the field names, but Bruce suggested using 0x60 for
> readability, making the macros shorter, which IMO looks good.
>
> I don't like adding special names just for this, so either we stick with 0x60 or go for
> "(char)((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) >> (7 *
> CHAR_BIT))", something like this:
My vote would be to use the construction above.
Might be put it in a new macro for readability.
Konstantin
> #ifdef __DOXYGEN__
> #define RTE_MBUF_DIRECT(mb) \
> !(((const char *)(&(mb)->ol_flags))[MSB_OFFSET /* 7 or 0, depending on
> endianness */] & \
> (char)((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) >> (7 *
> CHAR_BIT)))
> #else /* !__DOXYGEN__ */
> #if RTE_BYTE_ORDER == RTE_LITTLE_ENDIAN
> /* On little endian architecture, the MSB of a 64-bit integer is at byte offset 7. */
> #define RTE_MBUF_DIRECT(mb) \
> !(((const char *)(&(mb)->ol_flags))[7] & \
> (char)((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) >> (7 *
> CHAR_BIT)))
> #elif RTE_BYTE_ORDER == RTE_BIG_ENDIAN
> /* On big endian architecture, the MSB of a 64-bit integer is at byte offset 0. */
> #define RTE_MBUF_DIRECT(mb) \
> !(((const char *)(&(mb)->ol_flags))[0] & \
> (char)((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) >> (7 *
> CHAR_BIT)))
> #endif /* RTE_BYTE_ORDER */
> #endif /* !__DOXYGEN__ */
> /* Verify the optimization above. */
> static_assert(((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) &
> (UINT64_C(0xFF) << (7 * CHAR_BIT))) ==
> (RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL),
> "(RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) is not at MSB");
>
> > Konstantin
> >
> > > Good suggestion, though.
> > >
> > > >
> > > > > +#endif
> > > > > +/* Verify the optimization above. */
> > > > > +static_assert((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) ==
> > > > > UINT64_C(0x60) << (7 * CHAR_BIT),
> > > > > + "(RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) is not 0x60
> at
> > > > MSB");
> > > > >
> > > > > /** Uninitialized or unspecified port. */
> > > > > #define RTE_MBUF_PORT_INVALID UINT16_MAX
> > > > > --
> > > > > 2.43.0
^ permalink raw reply [flat|nested] 29+ messages in thread
* RE: [PATCH v5] mbuf: optimize segment prefree
2025-10-23 15:27 ` Konstantin Ananyev
@ 2025-10-23 15:46 ` Morten Brørup
2025-10-23 16:03 ` Bruce Richardson
0 siblings, 1 reply; 29+ messages in thread
From: Morten Brørup @ 2025-10-23 15:46 UTC (permalink / raw)
To: Konstantin Ananyev, dev, Stephen Hemminger, Wathsala Vithanage,
Fengchengwen, Bruce Richardson
> From: Konstantin Ananyev [mailto:konstantin.ananyev@huawei.com]
> Sent: Thursday, 23 October 2025 17.27
>
> > > From: Konstantin Ananyev [mailto:konstantin.ananyev@huawei.com]
> > > Sent: Thursday, 23 October 2025 16.05
> > >
> > > > > From: Konstantin Ananyev [mailto:konstantin.ananyev@huawei.com]
> > > > > Sent: Thursday, 23 October 2025 10.51
> > > > >
> > > > > > -#define RTE_MBUF_DIRECT(mb) \
> > > > > > - (!((mb)->ol_flags & (RTE_MBUF_F_INDIRECT |
> > > RTE_MBUF_F_EXTERNAL)))
> > > > > > + *
> > > > > > + * Note: Macro optimized for code size.
> > > > > > + *
> > > > > > + * The plain macro would be:
> > > > > > + * #define RTE_MBUF_DIRECT(mb) \
> > > > > > + * (!((mb)->ol_flags & (RTE_MBUF_F_INDIRECT |
> > > > > RTE_MBUF_F_EXTERNAL)))
> > > > > > + *
> > > > > > + * The flags RTE_MBUF_F_INDIRECT and RTE_MBUF_F_EXTERNAL are
> > > both in
> > > > > the
> > > > > > MSB (most significant
> > > > > > + * byte) of the 64-bit ol_flags field, so we only compare
> this
> > > one
> > > > > byte instead of all
> > > > > > 64 bits.
> > > > > > + *
> > > > > > + * E.g., GCC version 16.0.0 20251019 (experimental)
> generates
> > > the
> > > > > following code
> > > > > > for x86-64.
> > > > > > + *
> > > > > > + * With the plain macro, 17 bytes of instructions:
> > > > > > + * movabs rax,0x6000000000000000 // 10 bytes
> > > > > > + * and rax,QWORD PTR [rdi+0x18] // 4 bytes
> > > > > > + * sete al // 3 bytes
> > > > > > + * With this optimized macro, only 7 bytes of instructions:
> > > > > > + * test BYTE PTR [rdi+0x1f],0x60 // 4 bytes
> > > > > > + * sete al // 3 bytes
> > > > > > + */
> > > > > > +#if RTE_BYTE_ORDER == RTE_LITTLE_ENDIAN
> > > > > > +/* On little endian architecture, the MSB of a 64-bit
> integer is
> > > at
> > > > > byte offset 7. */
> > > > > > +#define RTE_MBUF_DIRECT(mb) !(((const char *)(&(mb)-
> > > > > >ol_flags))[7] & 0x60)
> > > > > > +#elif RTE_BYTE_ORDER == RTE_BIG_ENDIAN
> > > > > > +/* On big endian architecture, the MSB of a 64-bit integer
> is at
> > > > > byte offset 0. */
> > > > > > +#define RTE_MBUF_DIRECT(mb) !(((const char *)(&(mb)-
> > > > > >ol_flags))[0] & 0x60)
> > > > >
> > > > > A stupid q: why then not simply do:
> > > > > (mb->ol_flags >> 56) & 0x60
> > > > > then?
> > > > > Should help to all these LE/BE casts, etc.
> > > >
> > > > GCC is too stupid for that too.
> > > >
> > > > Playing around with Godbolt shows that
> > > > return !((char)(p[3] >> 56) & 0x60);
> > > > becomes
> > > > movzx eax,BYTE PTR [rdi+0x1f] // 4 bytes
> > > > test al,0x60 // 2 bytes
> > > > Instead of simply
> > > > test BYTE PTR [rdi+0x1f],0x60 // 4 bytes
> > >
> > > And these 2 extra bytes in instructions, are that really that
> critical?
> > > My guess, we wouldn't notice any real diff here.
> >
> > The optimized macro made the common code path of the refactored
> > rte_pktmbuf_prefree_seg() fit into one cache line.
> > IIRC, all 10 bytes saving were required for this.
>
> I understand that. but is that change will provide a measurable impact,
> in terms of cycles/op or pps or so?
L1 instruction cache is important; reducing code size of a per-packet function might have an effect in some cases.
I don't have other metrics than code size for this optimization.
I just tested to see if I recalled correctly, and here's the generated code with the two different macros:
#define RTE_MBUF_DIRECT(mb) !(((const char *)(&(mb)->ol_flags))[7] & 0x60)
0000000000000670 <review_rte_pktmbuf_prefree_seg>:
670: f3 0f 1e fa endbr64
674: 41 57 push %r15
676: 41 56 push %r14
678: 41 55 push %r13
67a: 41 54 push %r12
67c: 55 push %rbp
67d: 53 push %rbx
67e: 48 89 fb mov %rdi,%rbx
681: 48 83 ec 18 sub $0x18,%rsp
685: 64 48 8b 04 25 28 00 mov %fs:0x28,%rax
68c: 00 00
68e: 48 89 44 24 08 mov %rax,0x8(%rsp)
693: 31 c0 xor %eax,%eax
695: 0f b7 6f 12 movzwl 0x12(%rdi),%ebp
699: 66 83 fd 01 cmp $0x1,%bp
69d: 75 51 jne 6f0 <review_rte_pktmbuf_prefree_seg+0x80>
** Look here {
69f: f6 47 1f 60 testb $0x60,0x1f(%rdi)
6a3: 75 6b jne 710 <review_rte_pktmbuf_prefree_seg+0xa0>
** }
6a5: 66 83 7b 14 01 cmpw $0x1,0x14(%rbx)
6aa: 74 09 je 6b5 <review_rte_pktmbuf_prefree_seg+0x45>
6ac: b8 01 00 00 00 mov $0x1,%eax
6b1: 66 89 43 14 mov %ax,0x14(%rbx)
6b5: 48 83 7b 40 00 cmpq $0x0,0x40(%rbx)
6ba: 74 08 je 6c4 <review_rte_pktmbuf_prefree_seg+0x54>
6bc: 48 c7 43 40 00 00 00 movq $0x0,0x40(%rbx)
6c3: 00
6c4: 48 89 d8 mov %rbx,%rax
6c7: 48 8b 54 24 08 mov 0x8(%rsp),%rdx
6cc: 64 48 2b 14 25 28 00 sub %fs:0x28,%rdx
6d3: 00 00
6d5: 0f 85 3a 02 00 00 jne 915 <review_rte_pktmbuf_prefree_seg+0x2a5>
6db: 48 83 c4 18 add $0x18,%rsp
6df: 5b pop %rbx
6e0: 5d pop %rbp
6e1: 41 5c pop %r12
6e3: 41 5d pop %r13
6e5: 41 5e pop %r14
6e7: 41 5f pop %r15
6e9: c3 ret
#define RTE_MBUF_DIRECT(mb) \
!((char)((mb)->ol_flags >> (7 * CHAR_BIT)) & \
(char)((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) >> (7 * CHAR_BIT)))
0000000000000690 <review_rte_pktmbuf_prefree_seg>:
690: f3 0f 1e fa endbr64
694: 41 57 push %r15
696: 41 56 push %r14
698: 41 55 push %r13
69a: 41 54 push %r12
69c: 55 push %rbp
69d: 53 push %rbx
69e: 48 89 fb mov %rdi,%rbx
6a1: 48 83 ec 18 sub $0x18,%rsp
6a5: 64 48 8b 04 25 28 00 mov %fs:0x28,%rax
6ac: 00 00
6ae: 48 89 44 24 08 mov %rax,0x8(%rsp)
6b3: 31 c0 xor %eax,%eax
6b5: 0f b7 6f 12 movzwl 0x12(%rdi),%ebp
6b9: 66 83 fd 01 cmp $0x1,%bp
6bd: 75 59 jne 718 <review_rte_pktmbuf_prefree_seg+0x88>
** Look here {
6bf: 48 8b 47 18 mov 0x18(%rdi),%rax
6c3: 48 89 c2 mov %rax,%rdx
6c6: 48 c1 ea 38 shr $0x38,%rdx
6ca: 83 e2 60 and $0x60,%edx
6cd: 75 71 jne 740 <review_rte_pktmbuf_prefree_seg+0xb0>
* }
6cf: 66 83 7b 14 01 cmpw $0x1,0x14(%rbx)
6d4: 74 09 je 6df <review_rte_pktmbuf_prefree_seg+0x4f>
6d6: b8 01 00 00 00 mov $0x1,%eax
6db: 66 89 43 14 mov %ax,0x14(%rbx)
6df: 48 83 7b 40 00 cmpq $0x0,0x40(%rbx)
6e4: 74 08 je 6ee <review_rte_pktmbuf_prefree_seg+0x5e>
6e6: 48 c7 43 40 00 00 00 movq $0x0,0x40(%rbx)
6ed: 00
6ee: 48 89 d8 mov %rbx,%rax
6f1: 48 8b 54 24 08 mov 0x8(%rsp),%rdx
6f6: 64 48 2b 14 25 28 00 sub %fs:0x28,%rdx
6fd: 00 00
6ff: 0f 85 50 02 00 00 jne 955 <review_rte_pktmbuf_prefree_seg+0x2c5>
705: 48 83 c4 18 add $0x18,%rsp
709: 5b pop %rbx
70a: 5d pop %rbp
70b: 41 5c pop %r12
70d: 41 5d pop %r13
70f: 41 5e pop %r14
711: 41 5f pop %r15
713: c3 ret
>
> > > But if it really is, can I ask you to create a new define for 0x60,
> > > to avoid hardcoded constants in the code?
> > > Might be something like
> > > #define RTE_MBUF_F_INDIRECT_EXTERNAL_1B ...
> > > or so.
> >
> > I started out using the field names, but Bruce suggested using 0x60
> for
> > readability, making the macros shorter, which IMO looks good.
> >
> > I don't like adding special names just for this, so either we stick
> with 0x60 or go for
> > "(char)((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) >> (7 *
> > CHAR_BIT))", something like this:
>
> My vote would be to use the construction above.
> Might be put it in a new macro for readability.
> Konstantin
The optimization requires casting ol_flags as a byte array and then reading the MSB; otherwise GCC and some other compilers are too stupid to perform the optimization.
So, should I post a v7 with the code proposed below (to get rid of the 0x60 numerical value)?
>
> > #ifdef __DOXYGEN__
> > #define RTE_MBUF_DIRECT(mb) \
> > !(((const char *)(&(mb)->ol_flags))[MSB_OFFSET /* 7 or 0,
> depending on
> > endianness */] & \
> > (char)((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) >> (7 *
> > CHAR_BIT)))
> > #else /* !__DOXYGEN__ */
> > #if RTE_BYTE_ORDER == RTE_LITTLE_ENDIAN
> > /* On little endian architecture, the MSB of a 64-bit integer is at
> byte offset 7. */
> > #define RTE_MBUF_DIRECT(mb) \
> > !(((const char *)(&(mb)->ol_flags))[7] & \
> > (char)((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) >> (7 *
> > CHAR_BIT)))
> > #elif RTE_BYTE_ORDER == RTE_BIG_ENDIAN
> > /* On big endian architecture, the MSB of a 64-bit integer is at byte
> offset 0. */
> > #define RTE_MBUF_DIRECT(mb) \
> > !(((const char *)(&(mb)->ol_flags))[0] & \
> > (char)((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) >> (7 *
> > CHAR_BIT)))
> > #endif /* RTE_BYTE_ORDER */
> > #endif /* !__DOXYGEN__ */
> > /* Verify the optimization above. */
> > static_assert(((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) &
> > (UINT64_C(0xFF) << (7 * CHAR_BIT))) ==
> > (RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL),
> > "(RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) is not at MSB");
> >
> > > Konstantin
> > >
> > > > Good suggestion, though.
> > > >
> > > > >
> > > > > > +#endif
> > > > > > +/* Verify the optimization above. */
> > > > > > +static_assert((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) ==
> > > > > > UINT64_C(0x60) << (7 * CHAR_BIT),
> > > > > > + "(RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) is not 0x60
> > at
> > > > > MSB");
> > > > > >
> > > > > > /** Uninitialized or unspecified port. */
> > > > > > #define RTE_MBUF_PORT_INVALID UINT16_MAX
> > > > > > --
> > > > > > 2.43.0
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v5] mbuf: optimize segment prefree
2025-10-23 15:46 ` Morten Brørup
@ 2025-10-23 16:03 ` Bruce Richardson
2025-10-23 16:24 ` Morten Brørup
0 siblings, 1 reply; 29+ messages in thread
From: Bruce Richardson @ 2025-10-23 16:03 UTC (permalink / raw)
To: Morten Brørup
Cc: Konstantin Ananyev, dev, Stephen Hemminger, Wathsala Vithanage,
Fengchengwen
On Thu, Oct 23, 2025 at 05:46:49PM +0200, Morten Brørup wrote:
> > From: Konstantin Ananyev [mailto:konstantin.ananyev@huawei.com]
> > Sent: Thursday, 23 October 2025 17.27
> >
> > > > From: Konstantin Ananyev [mailto:konstantin.ananyev@huawei.com]
> > > > Sent: Thursday, 23 October 2025 16.05
> > > >
> > > > > > From: Konstantin Ananyev [mailto:konstantin.ananyev@huawei.com]
> > > > > > Sent: Thursday, 23 October 2025 10.51
> > > > > >
> > > > > > > -#define RTE_MBUF_DIRECT(mb) \
> > > > > > > - (!((mb)->ol_flags & (RTE_MBUF_F_INDIRECT |
> > > > RTE_MBUF_F_EXTERNAL)))
> > > > > > > + *
> > > > > > > + * Note: Macro optimized for code size.
> > > > > > > + *
> > > > > > > + * The plain macro would be:
> > > > > > > + * #define RTE_MBUF_DIRECT(mb) \
> > > > > > > + * (!((mb)->ol_flags & (RTE_MBUF_F_INDIRECT |
> > > > > > RTE_MBUF_F_EXTERNAL)))
> > > > > > > + *
> > > > > > > + * The flags RTE_MBUF_F_INDIRECT and RTE_MBUF_F_EXTERNAL are
> > > > both in
> > > > > > the
> > > > > > > MSB (most significant
> > > > > > > + * byte) of the 64-bit ol_flags field, so we only compare
> > this
> > > > one
> > > > > > byte instead of all
> > > > > > > 64 bits.
> > > > > > > + *
> > > > > > > + * E.g., GCC version 16.0.0 20251019 (experimental)
> > generates
> > > > the
> > > > > > following code
> > > > > > > for x86-64.
> > > > > > > + *
> > > > > > > + * With the plain macro, 17 bytes of instructions:
> > > > > > > + * movabs rax,0x6000000000000000 // 10 bytes
> > > > > > > + * and rax,QWORD PTR [rdi+0x18] // 4 bytes
> > > > > > > + * sete al // 3 bytes
> > > > > > > + * With this optimized macro, only 7 bytes of instructions:
> > > > > > > + * test BYTE PTR [rdi+0x1f],0x60 // 4 bytes
> > > > > > > + * sete al // 3 bytes
> > > > > > > + */
> > > > > > > +#if RTE_BYTE_ORDER == RTE_LITTLE_ENDIAN
> > > > > > > +/* On little endian architecture, the MSB of a 64-bit
> > integer is
> > > > at
> > > > > > byte offset 7. */
> > > > > > > +#define RTE_MBUF_DIRECT(mb) !(((const char *)(&(mb)-
> > > > > > >ol_flags))[7] & 0x60)
> > > > > > > +#elif RTE_BYTE_ORDER == RTE_BIG_ENDIAN
> > > > > > > +/* On big endian architecture, the MSB of a 64-bit integer
> > is at
> > > > > > byte offset 0. */
> > > > > > > +#define RTE_MBUF_DIRECT(mb) !(((const char *)(&(mb)-
> > > > > > >ol_flags))[0] & 0x60)
> > > > > >
> > > > > > A stupid q: why then not simply do:
> > > > > > (mb->ol_flags >> 56) & 0x60
> > > > > > then?
> > > > > > Should help to all these LE/BE casts, etc.
> > > > >
> > > > > GCC is too stupid for that too.
> > > > >
> > > > > Playing around with Godbolt shows that
> > > > > return !((char)(p[3] >> 56) & 0x60);
> > > > > becomes
> > > > > movzx eax,BYTE PTR [rdi+0x1f] // 4 bytes
> > > > > test al,0x60 // 2 bytes
> > > > > Instead of simply
> > > > > test BYTE PTR [rdi+0x1f],0x60 // 4 bytes
> > > >
> > > > And these 2 extra bytes in instructions, are that really that
> > critical?
> > > > My guess, we wouldn't notice any real diff here.
> > >
> > > The optimized macro made the common code path of the refactored
> > > rte_pktmbuf_prefree_seg() fit into one cache line.
> > > IIRC, all 10 bytes saving were required for this.
> >
> > I understand that. but is that change will provide a measurable impact,
> > in terms of cycles/op or pps or so?
>
> L1 instruction cache is important; reducing code size of a per-packet function might have an effect in some cases.
> I don't have other metrics than code size for this optimization.
>
> I just tested to see if I recalled correctly, and here's the generated code with the two different macros:
>
> #define RTE_MBUF_DIRECT(mb) !(((const char *)(&(mb)->ol_flags))[7] & 0x60)
>
> 0000000000000670 <review_rte_pktmbuf_prefree_seg>:
> 670: f3 0f 1e fa endbr64
> 674: 41 57 push %r15
> 676: 41 56 push %r14
> 678: 41 55 push %r13
> 67a: 41 54 push %r12
> 67c: 55 push %rbp
> 67d: 53 push %rbx
> 67e: 48 89 fb mov %rdi,%rbx
> 681: 48 83 ec 18 sub $0x18,%rsp
> 685: 64 48 8b 04 25 28 00 mov %fs:0x28,%rax
> 68c: 00 00
> 68e: 48 89 44 24 08 mov %rax,0x8(%rsp)
> 693: 31 c0 xor %eax,%eax
> 695: 0f b7 6f 12 movzwl 0x12(%rdi),%ebp
> 699: 66 83 fd 01 cmp $0x1,%bp
> 69d: 75 51 jne 6f0 <review_rte_pktmbuf_prefree_seg+0x80>
> ** Look here {
> 69f: f6 47 1f 60 testb $0x60,0x1f(%rdi)
> 6a3: 75 6b jne 710 <review_rte_pktmbuf_prefree_seg+0xa0>
> ** }
> 6a5: 66 83 7b 14 01 cmpw $0x1,0x14(%rbx)
> 6aa: 74 09 je 6b5 <review_rte_pktmbuf_prefree_seg+0x45>
> 6ac: b8 01 00 00 00 mov $0x1,%eax
> 6b1: 66 89 43 14 mov %ax,0x14(%rbx)
> 6b5: 48 83 7b 40 00 cmpq $0x0,0x40(%rbx)
> 6ba: 74 08 je 6c4 <review_rte_pktmbuf_prefree_seg+0x54>
> 6bc: 48 c7 43 40 00 00 00 movq $0x0,0x40(%rbx)
> 6c3: 00
> 6c4: 48 89 d8 mov %rbx,%rax
> 6c7: 48 8b 54 24 08 mov 0x8(%rsp),%rdx
> 6cc: 64 48 2b 14 25 28 00 sub %fs:0x28,%rdx
> 6d3: 00 00
> 6d5: 0f 85 3a 02 00 00 jne 915 <review_rte_pktmbuf_prefree_seg+0x2a5>
> 6db: 48 83 c4 18 add $0x18,%rsp
> 6df: 5b pop %rbx
> 6e0: 5d pop %rbp
> 6e1: 41 5c pop %r12
> 6e3: 41 5d pop %r13
> 6e5: 41 5e pop %r14
> 6e7: 41 5f pop %r15
> 6e9: c3 ret
>
> #define RTE_MBUF_DIRECT(mb) \
> !((char)((mb)->ol_flags >> (7 * CHAR_BIT)) & \
> (char)((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) >> (7 * CHAR_BIT)))
>
> 0000000000000690 <review_rte_pktmbuf_prefree_seg>:
> 690: f3 0f 1e fa endbr64
> 694: 41 57 push %r15
> 696: 41 56 push %r14
> 698: 41 55 push %r13
> 69a: 41 54 push %r12
> 69c: 55 push %rbp
> 69d: 53 push %rbx
> 69e: 48 89 fb mov %rdi,%rbx
> 6a1: 48 83 ec 18 sub $0x18,%rsp
> 6a5: 64 48 8b 04 25 28 00 mov %fs:0x28,%rax
> 6ac: 00 00
> 6ae: 48 89 44 24 08 mov %rax,0x8(%rsp)
> 6b3: 31 c0 xor %eax,%eax
> 6b5: 0f b7 6f 12 movzwl 0x12(%rdi),%ebp
> 6b9: 66 83 fd 01 cmp $0x1,%bp
> 6bd: 75 59 jne 718 <review_rte_pktmbuf_prefree_seg+0x88>
> ** Look here {
> 6bf: 48 8b 47 18 mov 0x18(%rdi),%rax
> 6c3: 48 89 c2 mov %rax,%rdx
> 6c6: 48 c1 ea 38 shr $0x38,%rdx
> 6ca: 83 e2 60 and $0x60,%edx
> 6cd: 75 71 jne 740 <review_rte_pktmbuf_prefree_seg+0xb0>
> * }
> 6cf: 66 83 7b 14 01 cmpw $0x1,0x14(%rbx)
> 6d4: 74 09 je 6df <review_rte_pktmbuf_prefree_seg+0x4f>
> 6d6: b8 01 00 00 00 mov $0x1,%eax
> 6db: 66 89 43 14 mov %ax,0x14(%rbx)
> 6df: 48 83 7b 40 00 cmpq $0x0,0x40(%rbx)
> 6e4: 74 08 je 6ee <review_rte_pktmbuf_prefree_seg+0x5e>
> 6e6: 48 c7 43 40 00 00 00 movq $0x0,0x40(%rbx)
> 6ed: 00
> 6ee: 48 89 d8 mov %rbx,%rax
> 6f1: 48 8b 54 24 08 mov 0x8(%rsp),%rdx
> 6f6: 64 48 2b 14 25 28 00 sub %fs:0x28,%rdx
> 6fd: 00 00
> 6ff: 0f 85 50 02 00 00 jne 955 <review_rte_pktmbuf_prefree_seg+0x2c5>
> 705: 48 83 c4 18 add $0x18,%rsp
> 709: 5b pop %rbx
> 70a: 5d pop %rbp
> 70b: 41 5c pop %r12
> 70d: 41 5d pop %r13
> 70f: 41 5e pop %r14
> 711: 41 5f pop %r15
> 713: c3 ret
>
> >
> > > > But if it really is, can I ask you to create a new define for 0x60,
> > > > to avoid hardcoded constants in the code?
> > > > Might be something like
> > > > #define RTE_MBUF_F_INDIRECT_EXTERNAL_1B ...
> > > > or so.
> > >
> > > I started out using the field names, but Bruce suggested using 0x60
> > for
> > > readability, making the macros shorter, which IMO looks good.
> > >
> > > I don't like adding special names just for this, so either we stick
> > with 0x60 or go for
> > > "(char)((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) >> (7 *
> > > CHAR_BIT))", something like this:
> >
> > My vote would be to use the construction above.
> > Might be put it in a new macro for readability.
> > Konstantin
>
> The optimization requires casting ol_flags as a byte array and then reading the MSB; otherwise GCC and some other compilers are too stupid to perform the optimization.
> So, should I post a v7 with the code proposed below (to get rid of the 0x60 numerical value)?
>
While I'm not going to massively complain about removing it, I think using
the numeric value is absolutely fine because we check its validity using a
static_assert, which also serves to document where the constant comes from.
/Bruce
^ permalink raw reply [flat|nested] 29+ messages in thread
* RE: [PATCH v5] mbuf: optimize segment prefree
2025-10-23 16:03 ` Bruce Richardson
@ 2025-10-23 16:24 ` Morten Brørup
0 siblings, 0 replies; 29+ messages in thread
From: Morten Brørup @ 2025-10-23 16:24 UTC (permalink / raw)
To: Bruce Richardson, Konstantin Ananyev, thomas
Cc: dev, Stephen Hemminger, Wathsala Vithanage, Fengchengwen
> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> Sent: Thursday, 23 October 2025 18.04
>
> On Thu, Oct 23, 2025 at 05:46:49PM +0200, Morten Brørup wrote:
> > > From: Konstantin Ananyev [mailto:konstantin.ananyev@huawei.com]
> > > Sent: Thursday, 23 October 2025 17.27
> > >
> > > > > From: Konstantin Ananyev [mailto:konstantin.ananyev@huawei.com]
> > > > > Sent: Thursday, 23 October 2025 16.05
> > > > >
> > > > > > > From: Konstantin Ananyev
> [mailto:konstantin.ananyev@huawei.com]
> > > > > > > Sent: Thursday, 23 October 2025 10.51
> > > > > > >
> > > > > > > > -#define RTE_MBUF_DIRECT(mb) \
> > > > > > > > - (!((mb)->ol_flags & (RTE_MBUF_F_INDIRECT |
> > > > > RTE_MBUF_F_EXTERNAL)))
> > > > > > > > + *
> > > > > > > > + * Note: Macro optimized for code size.
> > > > > > > > + *
> > > > > > > > + * The plain macro would be:
> > > > > > > > + * #define RTE_MBUF_DIRECT(mb) \
> > > > > > > > + * (!((mb)->ol_flags & (RTE_MBUF_F_INDIRECT |
> > > > > > > RTE_MBUF_F_EXTERNAL)))
> > > > > > > > + *
> > > > > > > > + * The flags RTE_MBUF_F_INDIRECT and RTE_MBUF_F_EXTERNAL
> are
> > > > > both in
> > > > > > > the
> > > > > > > > MSB (most significant
> > > > > > > > + * byte) of the 64-bit ol_flags field, so we only
> compare
> > > this
> > > > > one
> > > > > > > byte instead of all
> > > > > > > > 64 bits.
> > > > > > > > + *
> > > > > > > > + * E.g., GCC version 16.0.0 20251019 (experimental)
> > > generates
> > > > > the
> > > > > > > following code
> > > > > > > > for x86-64.
> > > > > > > > + *
> > > > > > > > + * With the plain macro, 17 bytes of instructions:
> > > > > > > > + * movabs rax,0x6000000000000000 // 10 bytes
> > > > > > > > + * and rax,QWORD PTR [rdi+0x18] // 4 bytes
> > > > > > > > + * sete al // 3 bytes
> > > > > > > > + * With this optimized macro, only 7 bytes of
> instructions:
> > > > > > > > + * test BYTE PTR [rdi+0x1f],0x60 // 4 bytes
> > > > > > > > + * sete al // 3 bytes
> > > > > > > > + */
> > > > > > > > +#if RTE_BYTE_ORDER == RTE_LITTLE_ENDIAN
> > > > > > > > +/* On little endian architecture, the MSB of a 64-bit
> > > integer is
> > > > > at
> > > > > > > byte offset 7. */
> > > > > > > > +#define RTE_MBUF_DIRECT(mb) !(((const char *)(&(mb)-
> > > > > > > >ol_flags))[7] & 0x60)
> > > > > > > > +#elif RTE_BYTE_ORDER == RTE_BIG_ENDIAN
> > > > > > > > +/* On big endian architecture, the MSB of a 64-bit
> integer
> > > is at
> > > > > > > byte offset 0. */
> > > > > > > > +#define RTE_MBUF_DIRECT(mb) !(((const char *)(&(mb)-
> > > > > > > >ol_flags))[0] & 0x60)
> > > > > > >
> > > > > > > A stupid q: why then not simply do:
> > > > > > > (mb->ol_flags >> 56) & 0x60
> > > > > > > then?
> > > > > > > Should help to all these LE/BE casts, etc.
> > > > > >
> > > > > > GCC is too stupid for that too.
> > > > > >
> > > > > > Playing around with Godbolt shows that
> > > > > > return !((char)(p[3] >> 56) & 0x60);
> > > > > > becomes
> > > > > > movzx eax,BYTE PTR [rdi+0x1f] // 4 bytes
> > > > > > test al,0x60 // 2 bytes
> > > > > > Instead of simply
> > > > > > test BYTE PTR [rdi+0x1f],0x60 // 4 bytes
> > > > >
> > > > > And these 2 extra bytes in instructions, are that really that
> > > critical?
> > > > > My guess, we wouldn't notice any real diff here.
> > > >
> > > > The optimized macro made the common code path of the refactored
> > > > rte_pktmbuf_prefree_seg() fit into one cache line.
> > > > IIRC, all 10 bytes saving were required for this.
> > >
> > > I understand that. but is that change will provide a measurable
> impact,
> > > in terms of cycles/op or pps or so?
> >
> > L1 instruction cache is important; reducing code size of a per-packet
> function might have an effect in some cases.
> > I don't have other metrics than code size for this optimization.
> >
> > I just tested to see if I recalled correctly, and here's the
> generated code with the two different macros:
> >
> > #define RTE_MBUF_DIRECT(mb) !(((const char *)(&(mb)-
> >ol_flags))[7] & 0x60)
> >
> > 0000000000000670 <review_rte_pktmbuf_prefree_seg>:
> > 670: f3 0f 1e fa endbr64
> > 674: 41 57 push %r15
> > 676: 41 56 push %r14
> > 678: 41 55 push %r13
> > 67a: 41 54 push %r12
> > 67c: 55 push %rbp
> > 67d: 53 push %rbx
> > 67e: 48 89 fb mov %rdi,%rbx
> > 681: 48 83 ec 18 sub $0x18,%rsp
> > 685: 64 48 8b 04 25 28 00 mov %fs:0x28,%rax
> > 68c: 00 00
> > 68e: 48 89 44 24 08 mov %rax,0x8(%rsp)
> > 693: 31 c0 xor %eax,%eax
> > 695: 0f b7 6f 12 movzwl 0x12(%rdi),%ebp
> > 699: 66 83 fd 01 cmp $0x1,%bp
> > 69d: 75 51 jne 6f0
> <review_rte_pktmbuf_prefree_seg+0x80>
> > ** Look here {
> > 69f: f6 47 1f 60 testb $0x60,0x1f(%rdi)
> > 6a3: 75 6b jne 710
> <review_rte_pktmbuf_prefree_seg+0xa0>
> > ** }
> > 6a5: 66 83 7b 14 01 cmpw $0x1,0x14(%rbx)
> > 6aa: 74 09 je 6b5
> <review_rte_pktmbuf_prefree_seg+0x45>
> > 6ac: b8 01 00 00 00 mov $0x1,%eax
> > 6b1: 66 89 43 14 mov %ax,0x14(%rbx)
> > 6b5: 48 83 7b 40 00 cmpq $0x0,0x40(%rbx)
> > 6ba: 74 08 je 6c4
> <review_rte_pktmbuf_prefree_seg+0x54>
> > 6bc: 48 c7 43 40 00 00 00 movq $0x0,0x40(%rbx)
> > 6c3: 00
> > 6c4: 48 89 d8 mov %rbx,%rax
> > 6c7: 48 8b 54 24 08 mov 0x8(%rsp),%rdx
> > 6cc: 64 48 2b 14 25 28 00 sub %fs:0x28,%rdx
> > 6d3: 00 00
> > 6d5: 0f 85 3a 02 00 00 jne 915
> <review_rte_pktmbuf_prefree_seg+0x2a5>
> > 6db: 48 83 c4 18 add $0x18,%rsp
> > 6df: 5b pop %rbx
> > 6e0: 5d pop %rbp
> > 6e1: 41 5c pop %r12
> > 6e3: 41 5d pop %r13
> > 6e5: 41 5e pop %r14
> > 6e7: 41 5f pop %r15
> > 6e9: c3 ret
> >
> > #define RTE_MBUF_DIRECT(mb) \
> > !((char)((mb)->ol_flags >> (7 * CHAR_BIT)) & \
> > (char)((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) >> (7 *
> CHAR_BIT)))
> >
> > 0000000000000690 <review_rte_pktmbuf_prefree_seg>:
> > 690: f3 0f 1e fa endbr64
> > 694: 41 57 push %r15
> > 696: 41 56 push %r14
> > 698: 41 55 push %r13
> > 69a: 41 54 push %r12
> > 69c: 55 push %rbp
> > 69d: 53 push %rbx
> > 69e: 48 89 fb mov %rdi,%rbx
> > 6a1: 48 83 ec 18 sub $0x18,%rsp
> > 6a5: 64 48 8b 04 25 28 00 mov %fs:0x28,%rax
> > 6ac: 00 00
> > 6ae: 48 89 44 24 08 mov %rax,0x8(%rsp)
> > 6b3: 31 c0 xor %eax,%eax
> > 6b5: 0f b7 6f 12 movzwl 0x12(%rdi),%ebp
> > 6b9: 66 83 fd 01 cmp $0x1,%bp
> > 6bd: 75 59 jne 718
> <review_rte_pktmbuf_prefree_seg+0x88>
> > ** Look here {
> > 6bf: 48 8b 47 18 mov 0x18(%rdi),%rax
> > 6c3: 48 89 c2 mov %rax,%rdx
> > 6c6: 48 c1 ea 38 shr $0x38,%rdx
> > 6ca: 83 e2 60 and $0x60,%edx
> > 6cd: 75 71 jne 740
> <review_rte_pktmbuf_prefree_seg+0xb0>
> > * }
> > 6cf: 66 83 7b 14 01 cmpw $0x1,0x14(%rbx)
> > 6d4: 74 09 je 6df
> <review_rte_pktmbuf_prefree_seg+0x4f>
> > 6d6: b8 01 00 00 00 mov $0x1,%eax
> > 6db: 66 89 43 14 mov %ax,0x14(%rbx)
> > 6df: 48 83 7b 40 00 cmpq $0x0,0x40(%rbx)
> > 6e4: 74 08 je 6ee
> <review_rte_pktmbuf_prefree_seg+0x5e>
> > 6e6: 48 c7 43 40 00 00 00 movq $0x0,0x40(%rbx)
> > 6ed: 00
> > 6ee: 48 89 d8 mov %rbx,%rax
> > 6f1: 48 8b 54 24 08 mov 0x8(%rsp),%rdx
> > 6f6: 64 48 2b 14 25 28 00 sub %fs:0x28,%rdx
> > 6fd: 00 00
> > 6ff: 0f 85 50 02 00 00 jne 955
> <review_rte_pktmbuf_prefree_seg+0x2c5>
> > 705: 48 83 c4 18 add $0x18,%rsp
> > 709: 5b pop %rbx
> > 70a: 5d pop %rbp
> > 70b: 41 5c pop %r12
> > 70d: 41 5d pop %r13
> > 70f: 41 5e pop %r14
> > 711: 41 5f pop %r15
> > 713: c3 ret
> >
> > >
> > > > > But if it really is, can I ask you to create a new define for
> 0x60,
> > > > > to avoid hardcoded constants in the code?
> > > > > Might be something like
> > > > > #define RTE_MBUF_F_INDIRECT_EXTERNAL_1B ...
> > > > > or so.
> > > >
> > > > I started out using the field names, but Bruce suggested using
> 0x60
> > > for
> > > > readability, making the macros shorter, which IMO looks good.
> > > >
> > > > I don't like adding special names just for this, so either we
> stick
> > > with 0x60 or go for
> > > > "(char)((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) >> (7 *
> > > > CHAR_BIT))", something like this:
> > >
> > > My vote would be to use the construction above.
> > > Might be put it in a new macro for readability.
> > > Konstantin
> >
> > The optimization requires casting ol_flags as a byte array and then
> reading the MSB; otherwise GCC and some other compilers are too stupid
> to perform the optimization.
> > So, should I post a v7 with the code proposed below (to get rid of
> the 0x60 numerical value)?
> >
> While I'm not going to massively complain about removing it, I think
> using
> the numeric value is absolutely fine because we check its validity
> using a
> static_assert, which also serves to document where the constant comes
> from.
>
> /Bruce
I have posted a v7 patch, so you can all see how it looks with long names vs. numerical value.
I just want the optimization, and have no strong preference of one or the other.
If none of you have any strong opinions, let's let @Thomas choose. :-)
^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH v6] mbuf: optimize segment prefree
2025-08-27 21:35 [PATCH] mbuf: optimize segment prefree Morten Brørup
` (5 preceding siblings ...)
2025-10-23 8:01 ` [PATCH v5] " Morten Brørup
@ 2025-10-23 12:48 ` Morten Brørup
2025-10-23 16:18 ` [PATCH v7] " Morten Brørup
7 siblings, 0 replies; 29+ messages in thread
From: Morten Brørup @ 2025-10-23 12:48 UTC (permalink / raw)
To: dev, Stephen Hemminger, Wathsala Vithanage, Konstantin Ananyev,
Chengwen Feng, Bruce Richardson, Thomas Monjalon
Cc: Morten Brørup
Refactored rte_pktmbuf_prefree_seg() for both performance and readability.
With the optimized RTE_MBUF_DIRECT() macro, the common likely code path
now fits within one instruction cache line on x86-64 when built with GCC.
Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
Reviewed-by: Bruce Richardson <bruce.richardson@intel.com>
---
v6:
* Fixed Doxygen error in RTE_MBUF_DIRECT() macro description. (Thomas Monjalon)
* Added specific RTE_MBUF_DIRECT() macro implementation for API documentation
purposes, using flag names instead of numerical value.
v5:
* Removed the plain RTE_MBUF_DIRECT() macro, and only use the optimized
variant. (Bruce Richardson)
Further testing on Godbolt confirmed that other compilers benefit from
the optimized macro too.
* Shortened the description of the RTE_MBUF_DIRECT() macro, and only
provide one example of code emitted by a compiler. (Bruce Richardson)
* Consolidated the static_assert() into one, covering both little and big
endian.
This also reduces the amount of endian-conditional source code and
improves readability.
(Bruce Richardson)
* Added comment about MSB meaning "most significant byte".
v4:
* Enabled the optimized RTE_MBUF_DIRECT() macro for GCC on all
architectures.
v3:
* Rewrote the optimized RTE_MBUF_DIRECT() macro for readability; use
numerical value instead of long names. (Bruce Richardson)
* Enabled the optimized RTE_MBUF_DIRECT() macro for Loongarch too.
v2:
* Fixed typo in commit description.
* Fixed indentation.
* Added detailed description to the optimized RTE_MBUF_DIRECT() macro.
(Stephen Hemminger)
* Added static_assert() to verify that the optimized RTE_MBUF_DIRECT()
macro is valid, specifically that the tested bits are in the MSB of the
64-bit field.
---
lib/mbuf/rte_mbuf.h | 51 +++++++++++++++-------------------------
lib/mbuf/rte_mbuf_core.h | 43 +++++++++++++++++++++++++++++++--
2 files changed, 60 insertions(+), 34 deletions(-)
diff --git a/lib/mbuf/rte_mbuf.h b/lib/mbuf/rte_mbuf.h
index 3df22125de..2004391f57 100644
--- a/lib/mbuf/rte_mbuf.h
+++ b/lib/mbuf/rte_mbuf.h
@@ -31,6 +31,7 @@
* http://www.kohala.com/start/tcpipiv2.html
*/
+#include <stdbool.h>
#include <stdint.h>
#include <rte_common.h>
@@ -1458,44 +1459,30 @@ static inline int __rte_pktmbuf_pinned_extbuf_decref(struct rte_mbuf *m)
static __rte_always_inline struct rte_mbuf *
rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
{
- __rte_mbuf_sanity_check(m, 0);
-
- if (likely(rte_mbuf_refcnt_read(m) == 1)) {
-
- if (!RTE_MBUF_DIRECT(m)) {
- rte_pktmbuf_detach(m);
- if (RTE_MBUF_HAS_EXTBUF(m) &&
- RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
- __rte_pktmbuf_pinned_extbuf_decref(m))
- return NULL;
- }
-
- if (m->next != NULL)
- m->next = NULL;
- if (m->nb_segs != 1)
- m->nb_segs = 1;
+ bool refcnt_not_one;
- return m;
+ __rte_mbuf_sanity_check(m, 0);
- } else if (__rte_mbuf_refcnt_update(m, -1) == 0) {
+ refcnt_not_one = unlikely(rte_mbuf_refcnt_read(m) != 1);
+ if (refcnt_not_one && __rte_mbuf_refcnt_update(m, -1) != 0)
+ return NULL;
- if (!RTE_MBUF_DIRECT(m)) {
- rte_pktmbuf_detach(m);
- if (RTE_MBUF_HAS_EXTBUF(m) &&
- RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
- __rte_pktmbuf_pinned_extbuf_decref(m))
- return NULL;
- }
+ if (unlikely(!RTE_MBUF_DIRECT(m))) {
+ rte_pktmbuf_detach(m);
+ if (RTE_MBUF_HAS_EXTBUF(m) &&
+ RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
+ __rte_pktmbuf_pinned_extbuf_decref(m))
+ return NULL;
+ }
- if (m->next != NULL)
- m->next = NULL;
- if (m->nb_segs != 1)
- m->nb_segs = 1;
+ if (refcnt_not_one)
rte_mbuf_refcnt_set(m, 1);
+ if (m->nb_segs != 1)
+ m->nb_segs = 1;
+ if (m->next != NULL)
+ m->next = NULL;
- return m;
- }
- return NULL;
+ return m;
}
/**
diff --git a/lib/mbuf/rte_mbuf_core.h b/lib/mbuf/rte_mbuf_core.h
index a0df265b5d..fb38eca575 100644
--- a/lib/mbuf/rte_mbuf_core.h
+++ b/lib/mbuf/rte_mbuf_core.h
@@ -711,9 +711,48 @@ struct rte_mbuf_ext_shared_info {
*
* If a mbuf embeds its own data after the rte_mbuf structure, this mbuf
* can be defined as a direct mbuf.
- */
+ *
+ * Note: Macro optimized for code size.
+ *
+ * The plain macro would be:
+ * \code{.c}
+ * #define RTE_MBUF_DIRECT(mb) \
+ * (!((mb)->ol_flags & (RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL)))
+ * \endcode
+ *
+ * The flags RTE_MBUF_F_INDIRECT and RTE_MBUF_F_EXTERNAL are both in the MSB (most significant
+ * byte) of the 64-bit ol_flags field, so we only compare this one byte instead of all 64 bits.
+ *
+ * E.g., GCC version 16.0.0 20251019 (experimental) generates the following code for x86-64.
+ *
+ * With the plain macro, 17 bytes of instructions:
+ * \code
+ * movabs rax,0x6000000000000000 // 10 bytes
+ * and rax,QWORD PTR [rdi+0x18] // 4 bytes
+ * sete al // 3 bytes
+ * \endcode
+ * With this optimized macro, only 7 bytes of instructions:
+ * \code
+ * test BYTE PTR [rdi+0x1f],0x60 // 4 bytes
+ * sete al // 3 bytes
+ * \endcode
+ */
+#ifdef __DOXYGEN__
#define RTE_MBUF_DIRECT(mb) \
- (!((mb)->ol_flags & (RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL)))
+ !(((const char *)(&(mb)->ol_flags))[MSB_OFFSET /* 7 or 0, depending on endianness */] & \
+ (char)((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) >> (7 * CHAR_BIT)) /* 0x60 */)
+#else /* !__DOXYGEN__ */
+#if RTE_BYTE_ORDER == RTE_LITTLE_ENDIAN
+/* On little endian architecture, the MSB of a 64-bit integer is at byte offset 7. */
+#define RTE_MBUF_DIRECT(mb) !(((const char *)(&(mb)->ol_flags))[7] & 0x60)
+#elif RTE_BYTE_ORDER == RTE_BIG_ENDIAN
+/* On big endian architecture, the MSB of a 64-bit integer is at byte offset 0. */
+#define RTE_MBUF_DIRECT(mb) !(((const char *)(&(mb)->ol_flags))[0] & 0x60)
+#endif /* RTE_BYTE_ORDER */
+#endif /* !__DOXYGEN__ */
+/* Verify the optimization above. */
+static_assert((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) == UINT64_C(0x60) << (7 * CHAR_BIT),
+ "(RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) is not 0x60 at MSB");
/** Uninitialized or unspecified port. */
#define RTE_MBUF_PORT_INVALID UINT16_MAX
--
2.43.0
^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH v7] mbuf: optimize segment prefree
2025-08-27 21:35 [PATCH] mbuf: optimize segment prefree Morten Brørup
` (6 preceding siblings ...)
2025-10-23 12:48 ` [PATCH v6] " Morten Brørup
@ 2025-10-23 16:18 ` Morten Brørup
7 siblings, 0 replies; 29+ messages in thread
From: Morten Brørup @ 2025-10-23 16:18 UTC (permalink / raw)
To: dev, Stephen Hemminger, Wathsala Vithanage, Konstantin Ananyev,
Chengwen Feng, Bruce Richardson, Thomas Monjalon
Cc: Morten Brørup
Refactored rte_pktmbuf_prefree_seg() for both performance and readability.
With the optimized RTE_MBUF_DIRECT() macro, the common likely code path
now fits within one instruction cache line on x86-64 when built with GCC.
Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
Reviewed-by: Bruce Richardson <bruce.richardson@intel.com>
---
v7:
* Go back to long names instead of numerical value in RTE_MBUF_DIRECT()
macro.
(Konstantin Ananyev)
* Updated static_assert() accordingly.
v6:
* Fixed Doxygen error in RTE_MBUF_DIRECT() macro description.
(Thomas Monjalon)
* Added specific RTE_MBUF_DIRECT() macro implementation for API
documentation purposes, using flag names instead of numerical value.
v5:
* Removed the plain RTE_MBUF_DIRECT() macro, and only use the optimized
variant. (Bruce Richardson)
Further testing on Godbolt confirmed that other compilers benefit from
the optimized macro too.
* Shortened the description of the RTE_MBUF_DIRECT() macro, and only
provide one example of code emitted by a compiler. (Bruce Richardson)
* Consolidated the static_assert() into one, covering both little and big
endian.
This also reduces the amount of endian-conditional source code and
improves readability.
(Bruce Richardson)
* Added comment about MSB meaning "most significant byte".
v4:
* Enabled the optimized RTE_MBUF_DIRECT() macro for GCC on all
architectures.
v3:
* Rewrote the optimized RTE_MBUF_DIRECT() macro for readability; use
numerical value instead of long names. (Bruce Richardson)
* Enabled the optimized RTE_MBUF_DIRECT() macro for Loongarch too.
v2:
* Fixed typo in commit description.
* Fixed indentation.
* Added detailed description to the optimized RTE_MBUF_DIRECT() macro.
(Stephen Hemminger)
* Added static_assert() to verify that the optimized RTE_MBUF_DIRECT()
macro is valid, specifically that the tested bits are in the MSB of the
64-bit field.
---
lib/mbuf/rte_mbuf.h | 51 +++++++++++++++-------------------------
lib/mbuf/rte_mbuf_core.h | 48 +++++++++++++++++++++++++++++++++++--
2 files changed, 65 insertions(+), 34 deletions(-)
diff --git a/lib/mbuf/rte_mbuf.h b/lib/mbuf/rte_mbuf.h
index 3df22125de..2004391f57 100644
--- a/lib/mbuf/rte_mbuf.h
+++ b/lib/mbuf/rte_mbuf.h
@@ -31,6 +31,7 @@
* http://www.kohala.com/start/tcpipiv2.html
*/
+#include <stdbool.h>
#include <stdint.h>
#include <rte_common.h>
@@ -1458,44 +1459,30 @@ static inline int __rte_pktmbuf_pinned_extbuf_decref(struct rte_mbuf *m)
static __rte_always_inline struct rte_mbuf *
rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
{
- __rte_mbuf_sanity_check(m, 0);
-
- if (likely(rte_mbuf_refcnt_read(m) == 1)) {
-
- if (!RTE_MBUF_DIRECT(m)) {
- rte_pktmbuf_detach(m);
- if (RTE_MBUF_HAS_EXTBUF(m) &&
- RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
- __rte_pktmbuf_pinned_extbuf_decref(m))
- return NULL;
- }
-
- if (m->next != NULL)
- m->next = NULL;
- if (m->nb_segs != 1)
- m->nb_segs = 1;
+ bool refcnt_not_one;
- return m;
+ __rte_mbuf_sanity_check(m, 0);
- } else if (__rte_mbuf_refcnt_update(m, -1) == 0) {
+ refcnt_not_one = unlikely(rte_mbuf_refcnt_read(m) != 1);
+ if (refcnt_not_one && __rte_mbuf_refcnt_update(m, -1) != 0)
+ return NULL;
- if (!RTE_MBUF_DIRECT(m)) {
- rte_pktmbuf_detach(m);
- if (RTE_MBUF_HAS_EXTBUF(m) &&
- RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
- __rte_pktmbuf_pinned_extbuf_decref(m))
- return NULL;
- }
+ if (unlikely(!RTE_MBUF_DIRECT(m))) {
+ rte_pktmbuf_detach(m);
+ if (RTE_MBUF_HAS_EXTBUF(m) &&
+ RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
+ __rte_pktmbuf_pinned_extbuf_decref(m))
+ return NULL;
+ }
- if (m->next != NULL)
- m->next = NULL;
- if (m->nb_segs != 1)
- m->nb_segs = 1;
+ if (refcnt_not_one)
rte_mbuf_refcnt_set(m, 1);
+ if (m->nb_segs != 1)
+ m->nb_segs = 1;
+ if (m->next != NULL)
+ m->next = NULL;
- return m;
- }
- return NULL;
+ return m;
}
/**
diff --git a/lib/mbuf/rte_mbuf_core.h b/lib/mbuf/rte_mbuf_core.h
index a0df265b5d..506f7146f1 100644
--- a/lib/mbuf/rte_mbuf_core.h
+++ b/lib/mbuf/rte_mbuf_core.h
@@ -711,9 +711,53 @@ struct rte_mbuf_ext_shared_info {
*
* If a mbuf embeds its own data after the rte_mbuf structure, this mbuf
* can be defined as a direct mbuf.
- */
+ *
+ * Note: Macro optimized for code size.
+ *
+ * The plain macro would be:
+ * \code{.c}
+ * #define RTE_MBUF_DIRECT(mb) \
+ * (!((mb)->ol_flags & (RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL)))
+ * \endcode
+ *
+ * The flags RTE_MBUF_F_INDIRECT and RTE_MBUF_F_EXTERNAL are both in the MSB (most significant
+ * byte) of the 64-bit ol_flags field, so we only compare this one byte instead of all 64 bits.
+ *
+ * E.g., GCC version 16.0.0 20251019 (experimental) generates the following code for x86-64.
+ *
+ * With the plain macro, 17 bytes of instructions:
+ * \code
+ * movabs rax,0x6000000000000000 // 10 bytes
+ * and rax,QWORD PTR [rdi+0x18] // 4 bytes
+ * sete al // 3 bytes
+ * \endcode
+ * With this optimized macro, only 7 bytes of instructions:
+ * \code
+ * test BYTE PTR [rdi+0x1f],0x60 // 4 bytes
+ * sete al // 3 bytes
+ * \endcode
+ */
+#ifdef __DOXYGEN__
+#define RTE_MBUF_DIRECT(mb) \
+ !(((const char *)(&(mb)->ol_flags))[MSB_OFFSET /* 7 or 0, depending on endianness */] & \
+ (char)((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) >> (7 * CHAR_BIT)))
+#else /* !__DOXYGEN__ */
+#if RTE_BYTE_ORDER == RTE_LITTLE_ENDIAN
+/* On little endian architecture, the MSB of a 64-bit integer is at byte offset 7. */
+#define RTE_MBUF_DIRECT(mb) \
+ !(((const char *)(&(mb)->ol_flags))[7] & \
+ (char)((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) >> (7 * CHAR_BIT)))
+#elif RTE_BYTE_ORDER == RTE_BIG_ENDIAN
+/* On big endian architecture, the MSB of a 64-bit integer is at byte offset 0. */
#define RTE_MBUF_DIRECT(mb) \
- (!((mb)->ol_flags & (RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL)))
+ !(((const char *)(&(mb)->ol_flags))[0] & \
+ (char)((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) >> (7 * CHAR_BIT)))
+#endif /* RTE_BYTE_ORDER */
+#endif /* !__DOXYGEN__ */
+/* Verify the optimization above. */
+static_assert(((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) & (UINT64_C(0xFF) << (7 * CHAR_BIT))) ==
+ (RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL),
+ "(RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) is not at MSB");
/** Uninitialized or unspecified port. */
#define RTE_MBUF_PORT_INVALID UINT16_MAX
--
2.43.0
^ permalink raw reply [flat|nested] 29+ messages in thread