* [PATCH] mbuf: optimize segment prefree
@ 2025-08-27 21:35 Morten Brørup
2025-08-27 23:17 ` Stephen Hemminger
` (4 more replies)
0 siblings, 5 replies; 15+ messages in thread
From: Morten Brørup @ 2025-08-27 21:35 UTC (permalink / raw)
To: dev; +Cc: Morten Brørup
Eefactored rte_pktmbuf_prefree_seg() for both performance and readability.
With the optimized RTE_MBUF_DIRECT() macro, the common likely code path
now fits within one instruction cache line on x86-64 when built with GCC.
Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
---
lib/mbuf/rte_mbuf.h | 52 ++++++++++++++++------------------------
lib/mbuf/rte_mbuf_core.h | 8 +++++++
2 files changed, 28 insertions(+), 32 deletions(-)
diff --git a/lib/mbuf/rte_mbuf.h b/lib/mbuf/rte_mbuf.h
index 06ab7502a5..f4a348597f 100644
--- a/lib/mbuf/rte_mbuf.h
+++ b/lib/mbuf/rte_mbuf.h
@@ -31,6 +31,7 @@
* http://www.kohala.com/start/tcpipiv2.html
*/
+#include <stdbool.h>
#include <stdint.h>
#include <rte_common.h>
@@ -1423,44 +1424,31 @@ static inline int __rte_pktmbuf_pinned_extbuf_decref(struct rte_mbuf *m)
static __rte_always_inline struct rte_mbuf *
rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
{
- __rte_mbuf_sanity_check(m, 0);
-
- if (likely(rte_mbuf_refcnt_read(m) == 1)) {
-
- if (!RTE_MBUF_DIRECT(m)) {
- rte_pktmbuf_detach(m);
- if (RTE_MBUF_HAS_EXTBUF(m) &&
- RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
- __rte_pktmbuf_pinned_extbuf_decref(m))
- return NULL;
- }
-
- if (m->next != NULL)
- m->next = NULL;
- if (m->nb_segs != 1)
- m->nb_segs = 1;
+ bool refcnt_not_one;
- return m;
+ __rte_mbuf_sanity_check(m, 0);
- } else if (__rte_mbuf_refcnt_update(m, -1) == 0) {
+ refcnt_not_one = unlikely(rte_mbuf_refcnt_read(m) != 1);
+ if (refcnt_not_one &&
+ __rte_mbuf_refcnt_update(m, -1) != 0)
+ return NULL;
- if (!RTE_MBUF_DIRECT(m)) {
- rte_pktmbuf_detach(m);
- if (RTE_MBUF_HAS_EXTBUF(m) &&
- RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
- __rte_pktmbuf_pinned_extbuf_decref(m))
- return NULL;
- }
+ if (unlikely(!RTE_MBUF_DIRECT(m))) {
+ rte_pktmbuf_detach(m);
+ if (RTE_MBUF_HAS_EXTBUF(m) &&
+ RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
+ __rte_pktmbuf_pinned_extbuf_decref(m))
+ return NULL;
+ }
- if (m->next != NULL)
- m->next = NULL;
- if (m->nb_segs != 1)
- m->nb_segs = 1;
+ if (refcnt_not_one)
rte_mbuf_refcnt_set(m, 1);
+ if (m->nb_segs != 1)
+ m->nb_segs = 1;
+ if (m->next != NULL)
+ m->next = NULL;
- return m;
- }
- return NULL;
+ return m;
}
/**
diff --git a/lib/mbuf/rte_mbuf_core.h b/lib/mbuf/rte_mbuf_core.h
index a0df265b5d..a5242274d7 100644
--- a/lib/mbuf/rte_mbuf_core.h
+++ b/lib/mbuf/rte_mbuf_core.h
@@ -715,6 +715,14 @@ struct rte_mbuf_ext_shared_info {
#define RTE_MBUF_DIRECT(mb) \
(!((mb)->ol_flags & (RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL)))
+/* GCC only optimizes single-bit MSB tests this way, so do it by hand with multi-bit. */
+#if defined(RTE_TOOLCHAIN_GCC) && defined(RTE_ARCH_X86)
+#undef RTE_MBUF_DIRECT
+#define RTE_MBUF_DIRECT(mb) \
+ (!(((const uint8_t *)(mb))[offsetof(struct rte_mbuf, ol_flags) + 7] & \
+ (uint8_t)((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) >> 56)))
+#endif
+
/** Uninitialized or unspecified port. */
#define RTE_MBUF_PORT_INVALID UINT16_MAX
/** For backwards compatibility. */
--
2.43.0
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] mbuf: optimize segment prefree
2025-08-27 21:35 [PATCH] mbuf: optimize segment prefree Morten Brørup
@ 2025-08-27 23:17 ` Stephen Hemminger
2025-10-06 17:46 ` Wathsala Vithanage
2025-10-06 14:49 ` Morten Brørup
` (3 subsequent siblings)
4 siblings, 1 reply; 15+ messages in thread
From: Stephen Hemminger @ 2025-08-27 23:17 UTC (permalink / raw)
To: Morten Brørup; +Cc: dev
On Wed, 27 Aug 2025 21:35:34 +0000
Morten Brørup <mb@smartsharesystems.com> wrote:
>
> +/* GCC only optimizes single-bit MSB tests this way, so do it by hand with multi-bit. */
> +#if defined(RTE_TOOLCHAIN_GCC) && defined(RTE_ARCH_X86)
> +#undef RTE_MBUF_DIRECT
> +#define RTE_MBUF_DIRECT(mb) \
> + (!(((const uint8_t *)(mb))[offsetof(struct rte_mbuf, ol_flags) + 7] & \
> + (uint8_t)((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) >> 56)))
> +#endif
Complex enough expression that I would prefer this be an inline function
with some more temporary variables and more comments.
Like the magic 7 for mask??
^ permalink raw reply [flat|nested] 15+ messages in thread
* RE: [PATCH] mbuf: optimize segment prefree
2025-08-27 21:35 [PATCH] mbuf: optimize segment prefree Morten Brørup
2025-08-27 23:17 ` Stephen Hemminger
@ 2025-10-06 14:49 ` Morten Brørup
2025-10-20 12:02 ` [PATCH v2] " Morten Brørup
` (2 subsequent siblings)
4 siblings, 0 replies; 15+ messages in thread
From: Morten Brørup @ 2025-10-06 14:49 UTC (permalink / raw)
To: dev
PING for review.
Only received feedback from Stephen about the GCC optimized RTE_MBUF_DIRECT() macro being complex, and could be an inline function instead.
Venlig hilsen / Kind regards,
-Morten Brørup
> -----Original Message-----
> From: Morten Brørup [mailto:mb@smartsharesystems.com]
> Sent: Wednesday, 27 August 2025 23.36
> To: dev@dpdk.org
> Cc: Morten Brørup
> Subject: [PATCH] mbuf: optimize segment prefree
>
> Eefactored rte_pktmbuf_prefree_seg() for both performance and
> readability.
>
> With the optimized RTE_MBUF_DIRECT() macro, the common likely code path
> now fits within one instruction cache line on x86-64 when built with
> GCC.
>
> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> ---
> lib/mbuf/rte_mbuf.h | 52 ++++++++++++++++------------------------
> lib/mbuf/rte_mbuf_core.h | 8 +++++++
> 2 files changed, 28 insertions(+), 32 deletions(-)
>
> diff --git a/lib/mbuf/rte_mbuf.h b/lib/mbuf/rte_mbuf.h
> index 06ab7502a5..f4a348597f 100644
> --- a/lib/mbuf/rte_mbuf.h
> +++ b/lib/mbuf/rte_mbuf.h
> @@ -31,6 +31,7 @@
> * http://www.kohala.com/start/tcpipiv2.html
> */
>
> +#include <stdbool.h>
> #include <stdint.h>
>
> #include <rte_common.h>
> @@ -1423,44 +1424,31 @@ static inline int
> __rte_pktmbuf_pinned_extbuf_decref(struct rte_mbuf *m)
> static __rte_always_inline struct rte_mbuf *
> rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
> {
> - __rte_mbuf_sanity_check(m, 0);
> -
> - if (likely(rte_mbuf_refcnt_read(m) == 1)) {
> -
> - if (!RTE_MBUF_DIRECT(m)) {
> - rte_pktmbuf_detach(m);
> - if (RTE_MBUF_HAS_EXTBUF(m) &&
> - RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
> - __rte_pktmbuf_pinned_extbuf_decref(m))
> - return NULL;
> - }
> -
> - if (m->next != NULL)
> - m->next = NULL;
> - if (m->nb_segs != 1)
> - m->nb_segs = 1;
> + bool refcnt_not_one;
>
> - return m;
> + __rte_mbuf_sanity_check(m, 0);
>
> - } else if (__rte_mbuf_refcnt_update(m, -1) == 0) {
> + refcnt_not_one = unlikely(rte_mbuf_refcnt_read(m) != 1);
> + if (refcnt_not_one &&
> + __rte_mbuf_refcnt_update(m, -1) != 0)
> + return NULL;
>
> - if (!RTE_MBUF_DIRECT(m)) {
> - rte_pktmbuf_detach(m);
> - if (RTE_MBUF_HAS_EXTBUF(m) &&
> - RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
> - __rte_pktmbuf_pinned_extbuf_decref(m))
> - return NULL;
> - }
> + if (unlikely(!RTE_MBUF_DIRECT(m))) {
> + rte_pktmbuf_detach(m);
> + if (RTE_MBUF_HAS_EXTBUF(m) &&
> + RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
> + __rte_pktmbuf_pinned_extbuf_decref(m))
> + return NULL;
> + }
>
> - if (m->next != NULL)
> - m->next = NULL;
> - if (m->nb_segs != 1)
> - m->nb_segs = 1;
> + if (refcnt_not_one)
> rte_mbuf_refcnt_set(m, 1);
> + if (m->nb_segs != 1)
> + m->nb_segs = 1;
> + if (m->next != NULL)
> + m->next = NULL;
>
> - return m;
> - }
> - return NULL;
> + return m;
> }
>
> /**
> diff --git a/lib/mbuf/rte_mbuf_core.h b/lib/mbuf/rte_mbuf_core.h
> index a0df265b5d..a5242274d7 100644
> --- a/lib/mbuf/rte_mbuf_core.h
> +++ b/lib/mbuf/rte_mbuf_core.h
> @@ -715,6 +715,14 @@ struct rte_mbuf_ext_shared_info {
> #define RTE_MBUF_DIRECT(mb) \
> (!((mb)->ol_flags & (RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL)))
>
> +/* GCC only optimizes single-bit MSB tests this way, so do it by hand
> with multi-bit. */
> +#if defined(RTE_TOOLCHAIN_GCC) && defined(RTE_ARCH_X86)
> +#undef RTE_MBUF_DIRECT
> +#define RTE_MBUF_DIRECT(mb) \
> + (!(((const uint8_t *)(mb))[offsetof(struct rte_mbuf, ol_flags) +
> 7] & \
> + (uint8_t)((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) >> 56)))
> +#endif
> +
> /** Uninitialized or unspecified port. */
> #define RTE_MBUF_PORT_INVALID UINT16_MAX
> /** For backwards compatibility. */
> --
> 2.43.0
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] mbuf: optimize segment prefree
2025-08-27 23:17 ` Stephen Hemminger
@ 2025-10-06 17:46 ` Wathsala Vithanage
2025-10-06 18:26 ` Morten Brørup
0 siblings, 1 reply; 15+ messages in thread
From: Wathsala Vithanage @ 2025-10-06 17:46 UTC (permalink / raw)
To: dev
On 8/27/25 18:17, Stephen Hemminger wrote:
> On Wed, 27 Aug 2025 21:35:34 +0000
> Morten Brørup <mb@smartsharesystems.com> wrote:
>
>>
>> +/* GCC only optimizes single-bit MSB tests this way, so do it by hand with multi-bit. */
>> +#if defined(RTE_TOOLCHAIN_GCC) && defined(RTE_ARCH_X86)
>> +#undef RTE_MBUF_DIRECT
>> +#define RTE_MBUF_DIRECT(mb) \
>> + (!(((const uint8_t *)(mb))[offsetof(struct rte_mbuf, ol_flags) + 7] & \
>> + (uint8_t)((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) >> 56)))
>> +#endif
> Complex enough expression that I would prefer this be an inline function
> with some more temporary variables and more comments.
> Like the magic 7 for mask??
+1
--wathsala
^ permalink raw reply [flat|nested] 15+ messages in thread
* RE: [PATCH] mbuf: optimize segment prefree
2025-10-06 17:46 ` Wathsala Vithanage
@ 2025-10-06 18:26 ` Morten Brørup
0 siblings, 0 replies; 15+ messages in thread
From: Morten Brørup @ 2025-10-06 18:26 UTC (permalink / raw)
To: Wathsala Vithanage, stephen; +Cc: dev
> From: Wathsala Vithanage [mailto:wathsala.vithanage@arm.com]
> Sent: Monday, 6 October 2025 19.46
>
> On 8/27/25 18:17, Stephen Hemminger wrote:
> > On Wed, 27 Aug 2025 21:35:34 +0000
> > Morten Brørup <mb@smartsharesystems.com> wrote:
> >
> >>
> >> +/* GCC only optimizes single-bit MSB tests this way, so do it by
> hand with multi-bit. */
> >> +#if defined(RTE_TOOLCHAIN_GCC) && defined(RTE_ARCH_X86)
> >> +#undef RTE_MBUF_DIRECT
> >> +#define RTE_MBUF_DIRECT(mb) \
> >> + (!(((const uint8_t *)(mb))[offsetof(struct rte_mbuf, ol_flags) +
> 7] & \
> >> + (uint8_t)((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) >> 56)))
> >> +#endif
> > Complex enough expression that I would prefer this be an inline
> function
> > with some more temporary variables and more comments.
> > Like the magic 7 for mask??
>
> +1
So, instead of overriding the macro definition in the GCC exception case, you prefer something like:
#if defined(RTE_TOOLCHAIN_GCC) && defined(RTE_ARCH_X86)
static __rte_always_inline
RTE_MBUF_DIRECT(const struct rte_mbuf * const mb)
{...}
#else
#define RTE_MBUF_DIRECT(mb) \
(!((mb)->ol_flags & (RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL)))
#endif
which would give us a function with a macro-like (upper case) name.
How about I just add a more detailed description to this macro?
After all, formally, it's an exception to the simple default macro.
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH v2] mbuf: optimize segment prefree
2025-08-27 21:35 [PATCH] mbuf: optimize segment prefree Morten Brørup
2025-08-27 23:17 ` Stephen Hemminger
2025-10-06 14:49 ` Morten Brørup
@ 2025-10-20 12:02 ` Morten Brørup
2025-10-20 14:24 ` Konstantin Ananyev
` (2 more replies)
2025-10-22 13:23 ` [PATCH v3] " Morten Brørup
2025-10-22 14:47 ` [PATCH v4] " Morten Brørup
4 siblings, 3 replies; 15+ messages in thread
From: Morten Brørup @ 2025-10-20 12:02 UTC (permalink / raw)
To: dev, Stephen Hemminger, Wathsala Vithanage; +Cc: Morten Brørup
Refactored rte_pktmbuf_prefree_seg() for both performance and readability.
With the optimized RTE_MBUF_DIRECT() macro, the common likely code path
now fits within one instruction cache line on x86-64 when built with GCC.
Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
---
v2:
* Fixed typo in commit description.
* Fixed indentation.
* Added detailed description to the optimized RTE_MBUF_DIRECT() macro.
(Stephen Hemminger)
* Added static_assert() to verify that the optimized RTE_MBUF_DIRECT()
macro is valid, specifically that the tested bits are in the MSB of the
64-bit field.
---
lib/mbuf/rte_mbuf.h | 51 +++++++++++++++-------------------------
lib/mbuf/rte_mbuf_core.h | 27 +++++++++++++++++++++
2 files changed, 46 insertions(+), 32 deletions(-)
diff --git a/lib/mbuf/rte_mbuf.h b/lib/mbuf/rte_mbuf.h
index 3df22125de..2004391f57 100644
--- a/lib/mbuf/rte_mbuf.h
+++ b/lib/mbuf/rte_mbuf.h
@@ -31,6 +31,7 @@
* http://www.kohala.com/start/tcpipiv2.html
*/
+#include <stdbool.h>
#include <stdint.h>
#include <rte_common.h>
@@ -1458,44 +1459,30 @@ static inline int __rte_pktmbuf_pinned_extbuf_decref(struct rte_mbuf *m)
static __rte_always_inline struct rte_mbuf *
rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
{
- __rte_mbuf_sanity_check(m, 0);
-
- if (likely(rte_mbuf_refcnt_read(m) == 1)) {
-
- if (!RTE_MBUF_DIRECT(m)) {
- rte_pktmbuf_detach(m);
- if (RTE_MBUF_HAS_EXTBUF(m) &&
- RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
- __rte_pktmbuf_pinned_extbuf_decref(m))
- return NULL;
- }
-
- if (m->next != NULL)
- m->next = NULL;
- if (m->nb_segs != 1)
- m->nb_segs = 1;
+ bool refcnt_not_one;
- return m;
+ __rte_mbuf_sanity_check(m, 0);
- } else if (__rte_mbuf_refcnt_update(m, -1) == 0) {
+ refcnt_not_one = unlikely(rte_mbuf_refcnt_read(m) != 1);
+ if (refcnt_not_one && __rte_mbuf_refcnt_update(m, -1) != 0)
+ return NULL;
- if (!RTE_MBUF_DIRECT(m)) {
- rte_pktmbuf_detach(m);
- if (RTE_MBUF_HAS_EXTBUF(m) &&
- RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
- __rte_pktmbuf_pinned_extbuf_decref(m))
- return NULL;
- }
+ if (unlikely(!RTE_MBUF_DIRECT(m))) {
+ rte_pktmbuf_detach(m);
+ if (RTE_MBUF_HAS_EXTBUF(m) &&
+ RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
+ __rte_pktmbuf_pinned_extbuf_decref(m))
+ return NULL;
+ }
- if (m->next != NULL)
- m->next = NULL;
- if (m->nb_segs != 1)
- m->nb_segs = 1;
+ if (refcnt_not_one)
rte_mbuf_refcnt_set(m, 1);
+ if (m->nb_segs != 1)
+ m->nb_segs = 1;
+ if (m->next != NULL)
+ m->next = NULL;
- return m;
- }
- return NULL;
+ return m;
}
/**
diff --git a/lib/mbuf/rte_mbuf_core.h b/lib/mbuf/rte_mbuf_core.h
index a0df265b5d..41f40e1967 100644
--- a/lib/mbuf/rte_mbuf_core.h
+++ b/lib/mbuf/rte_mbuf_core.h
@@ -715,6 +715,33 @@ struct rte_mbuf_ext_shared_info {
#define RTE_MBUF_DIRECT(mb) \
(!((mb)->ol_flags & (RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL)))
+#if defined(RTE_TOOLCHAIN_GCC) && defined(RTE_ARCH_X86)
+/* Optimization for code size.
+ * GCC only optimizes single-bit MSB tests this way, so we do it by hand with multi-bit.
+ *
+ * The flags RTE_MBUF_F_INDIRECT and RTE_MBUF_F_EXTERNAL are both in the MSB of the
+ * 64-bit ol_flags field, so we only compare this one byte instead of all 64 bits.
+ * On little endian architecture, the MSB of a 64-bit integer is at byte offest 7.
+ *
+ * Note: Tested using GCC version 16.0.0 20251019 (experimental).
+ *
+ * Without this optimization, GCC generates 17 bytes of instructions:
+ * movabs rax,0x6000000000000000 // 10 bytes
+ * and rax,QWORD PTR [rdi+0x18] // 4 bytes
+ * sete al // 3 bytes
+ * With this optimization, GCC generates only 7 bytes of instructions:
+ * test BYTE PTR [rdi+0x1f],0x60 // 4 bytes
+ * sete al // 3 bytes
+ */
+#undef RTE_MBUF_DIRECT
+#define RTE_MBUF_DIRECT(mb) \
+ (!(((const uint8_t *)(mb))[offsetof(struct rte_mbuf, ol_flags) + 7] & \
+ (uint8_t)((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) >> (7 * 8))))
+static_assert(((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) >> (7 * 8)) << (7 * 8) ==
+ (RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL),
+ "RTE_MBUF_F_INDIRECT and/or RTE_MBUF_F_EXTERNAL are not in MSB.");
+#endif
+
/** Uninitialized or unspecified port. */
#define RTE_MBUF_PORT_INVALID UINT16_MAX
/** For backwards compatibility. */
--
2.43.0
^ permalink raw reply [flat|nested] 15+ messages in thread
* RE: [PATCH v2] mbuf: optimize segment prefree
2025-10-20 12:02 ` [PATCH v2] " Morten Brørup
@ 2025-10-20 14:24 ` Konstantin Ananyev
2025-10-21 8:38 ` fengchengwen
2025-10-22 9:08 ` Bruce Richardson
2 siblings, 0 replies; 15+ messages in thread
From: Konstantin Ananyev @ 2025-10-20 14:24 UTC (permalink / raw)
To: Morten Brørup, dev, Stephen Hemminger, Wathsala Vithanage
>
> Refactored rte_pktmbuf_prefree_seg() for both performance and readability.
>
> With the optimized RTE_MBUF_DIRECT() macro, the common likely code path
> now fits within one instruction cache line on x86-64 when built with GCC.
>
> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> ---
> v2:
> * Fixed typo in commit description.
> * Fixed indentation.
> * Added detailed description to the optimized RTE_MBUF_DIRECT() macro.
> (Stephen Hemminger)
> * Added static_assert() to verify that the optimized RTE_MBUF_DIRECT()
> macro is valid, specifically that the tested bits are in the MSB of the
> 64-bit field.
> ---
> lib/mbuf/rte_mbuf.h | 51 +++++++++++++++-------------------------
> lib/mbuf/rte_mbuf_core.h | 27 +++++++++++++++++++++
> 2 files changed, 46 insertions(+), 32 deletions(-)
>
> diff --git a/lib/mbuf/rte_mbuf.h b/lib/mbuf/rte_mbuf.h
> index 3df22125de..2004391f57 100644
> --- a/lib/mbuf/rte_mbuf.h
> +++ b/lib/mbuf/rte_mbuf.h
> @@ -31,6 +31,7 @@
> * http://www.kohala.com/start/tcpipiv2.html
> */
>
> +#include <stdbool.h>
> #include <stdint.h>
>
> #include <rte_common.h>
> @@ -1458,44 +1459,30 @@ static inline int
> __rte_pktmbuf_pinned_extbuf_decref(struct rte_mbuf *m)
> static __rte_always_inline struct rte_mbuf *
> rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
> {
> - __rte_mbuf_sanity_check(m, 0);
> -
> - if (likely(rte_mbuf_refcnt_read(m) == 1)) {
> -
> - if (!RTE_MBUF_DIRECT(m)) {
> - rte_pktmbuf_detach(m);
> - if (RTE_MBUF_HAS_EXTBUF(m) &&
> - RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
> - __rte_pktmbuf_pinned_extbuf_decref(m))
> - return NULL;
> - }
> -
> - if (m->next != NULL)
> - m->next = NULL;
> - if (m->nb_segs != 1)
> - m->nb_segs = 1;
> + bool refcnt_not_one;
>
> - return m;
> + __rte_mbuf_sanity_check(m, 0);
>
> - } else if (__rte_mbuf_refcnt_update(m, -1) == 0) {
> + refcnt_not_one = unlikely(rte_mbuf_refcnt_read(m) != 1);
> + if (refcnt_not_one && __rte_mbuf_refcnt_update(m, -1) != 0)
> + return NULL;
>
> - if (!RTE_MBUF_DIRECT(m)) {
> - rte_pktmbuf_detach(m);
> - if (RTE_MBUF_HAS_EXTBUF(m) &&
> - RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
> - __rte_pktmbuf_pinned_extbuf_decref(m))
> - return NULL;
> - }
> + if (unlikely(!RTE_MBUF_DIRECT(m))) {
> + rte_pktmbuf_detach(m);
> + if (RTE_MBUF_HAS_EXTBUF(m) &&
> + RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
> + __rte_pktmbuf_pinned_extbuf_decref(m))
> + return NULL;
> + }
>
> - if (m->next != NULL)
> - m->next = NULL;
> - if (m->nb_segs != 1)
> - m->nb_segs = 1;
> + if (refcnt_not_one)
> rte_mbuf_refcnt_set(m, 1);
> + if (m->nb_segs != 1)
> + m->nb_segs = 1;
> + if (m->next != NULL)
> + m->next = NULL;
>
> - return m;
> - }
> - return NULL;
> + return m;
> }
>
> /**
> diff --git a/lib/mbuf/rte_mbuf_core.h b/lib/mbuf/rte_mbuf_core.h
> index a0df265b5d..41f40e1967 100644
> --- a/lib/mbuf/rte_mbuf_core.h
> +++ b/lib/mbuf/rte_mbuf_core.h
> @@ -715,6 +715,33 @@ struct rte_mbuf_ext_shared_info {
> #define RTE_MBUF_DIRECT(mb) \
> (!((mb)->ol_flags & (RTE_MBUF_F_INDIRECT |
> RTE_MBUF_F_EXTERNAL)))
>
> +#if defined(RTE_TOOLCHAIN_GCC) && defined(RTE_ARCH_X86)
> +/* Optimization for code size.
> + * GCC only optimizes single-bit MSB tests this way, so we do it by hand with
> multi-bit.
> + *
> + * The flags RTE_MBUF_F_INDIRECT and RTE_MBUF_F_EXTERNAL are both in
> the MSB of the
> + * 64-bit ol_flags field, so we only compare this one byte instead of all 64 bits.
> + * On little endian architecture, the MSB of a 64-bit integer is at byte offest 7.
> + *
> + * Note: Tested using GCC version 16.0.0 20251019 (experimental).
> + *
> + * Without this optimization, GCC generates 17 bytes of instructions:
> + * movabs rax,0x6000000000000000 // 10 bytes
> + * and rax,QWORD PTR [rdi+0x18] // 4 bytes
> + * sete al // 3 bytes
> + * With this optimization, GCC generates only 7 bytes of instructions:
> + * test BYTE PTR [rdi+0x1f],0x60 // 4 bytes
> + * sete al // 3 bytes
> + */
> +#undef RTE_MBUF_DIRECT
> +#define RTE_MBUF_DIRECT(mb) \
> + (!(((const uint8_t *)(mb))[offsetof(struct rte_mbuf, ol_flags) + 7] & \
> + (uint8_t)((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) >> (7 *
> 8))))
> +static_assert(((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) >> (7 * 8))
> << (7 * 8) ==
> + (RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL),
> + "RTE_MBUF_F_INDIRECT and/or RTE_MBUF_F_EXTERNAL are not in
> MSB.");
> +#endif
> +
> /** Uninitialized or unspecified port. */
> #define RTE_MBUF_PORT_INVALID UINT16_MAX
> /** For backwards compatibility. */
> --
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
> 2.43.0
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v2] mbuf: optimize segment prefree
2025-10-20 12:02 ` [PATCH v2] " Morten Brørup
2025-10-20 14:24 ` Konstantin Ananyev
@ 2025-10-21 8:38 ` fengchengwen
2025-10-22 9:08 ` Bruce Richardson
2 siblings, 0 replies; 15+ messages in thread
From: fengchengwen @ 2025-10-21 8:38 UTC (permalink / raw)
To: Morten Brørup, dev, Stephen Hemminger, Wathsala Vithanage
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
On 10/20/2025 8:02 PM, Morten Brørup wrote:
> Refactored rte_pktmbuf_prefree_seg() for both performance and readability.
>
> With the optimized RTE_MBUF_DIRECT() macro, the common likely code path
> now fits within one instruction cache line on x86-64 when built with GCC.
Learned a lot, thanks
>
> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v2] mbuf: optimize segment prefree
2025-10-20 12:02 ` [PATCH v2] " Morten Brørup
2025-10-20 14:24 ` Konstantin Ananyev
2025-10-21 8:38 ` fengchengwen
@ 2025-10-22 9:08 ` Bruce Richardson
2025-10-22 13:53 ` Morten Brørup
2 siblings, 1 reply; 15+ messages in thread
From: Bruce Richardson @ 2025-10-22 9:08 UTC (permalink / raw)
To: Morten Brørup; +Cc: dev, Stephen Hemminger, Wathsala Vithanage
On Mon, Oct 20, 2025 at 12:02:01PM +0000, Morten Brørup wrote:
> Refactored rte_pktmbuf_prefree_seg() for both performance and readability.
>
> With the optimized RTE_MBUF_DIRECT() macro, the common likely code path
> now fits within one instruction cache line on x86-64 when built with GCC.
>
> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
Reviewed-by: Bruce Richardson <bruce.richardson@intel.com>
Comments inline below.
> ---
> v2:
> * Fixed typo in commit description.
> * Fixed indentation.
> * Added detailed description to the optimized RTE_MBUF_DIRECT() macro.
> (Stephen Hemminger)
> * Added static_assert() to verify that the optimized RTE_MBUF_DIRECT()
> macro is valid, specifically that the tested bits are in the MSB of the
> 64-bit field.
> ---
> lib/mbuf/rte_mbuf.h | 51 +++++++++++++++-------------------------
> lib/mbuf/rte_mbuf_core.h | 27 +++++++++++++++++++++
> 2 files changed, 46 insertions(+), 32 deletions(-)
>
> diff --git a/lib/mbuf/rte_mbuf.h b/lib/mbuf/rte_mbuf.h
> index 3df22125de..2004391f57 100644
> --- a/lib/mbuf/rte_mbuf.h
> +++ b/lib/mbuf/rte_mbuf.h
> @@ -31,6 +31,7 @@
> * http://www.kohala.com/start/tcpipiv2.html
> */
>
> +#include <stdbool.h>
> #include <stdint.h>
>
> #include <rte_common.h>
> @@ -1458,44 +1459,30 @@ static inline int __rte_pktmbuf_pinned_extbuf_decref(struct rte_mbuf *m)
> static __rte_always_inline struct rte_mbuf *
> rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
> {
> - __rte_mbuf_sanity_check(m, 0);
> -
> - if (likely(rte_mbuf_refcnt_read(m) == 1)) {
> -
> - if (!RTE_MBUF_DIRECT(m)) {
> - rte_pktmbuf_detach(m);
> - if (RTE_MBUF_HAS_EXTBUF(m) &&
> - RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
> - __rte_pktmbuf_pinned_extbuf_decref(m))
> - return NULL;
> - }
> -
> - if (m->next != NULL)
> - m->next = NULL;
> - if (m->nb_segs != 1)
> - m->nb_segs = 1;
> + bool refcnt_not_one;
>
> - return m;
> + __rte_mbuf_sanity_check(m, 0);
>
> - } else if (__rte_mbuf_refcnt_update(m, -1) == 0) {
> + refcnt_not_one = unlikely(rte_mbuf_refcnt_read(m) != 1);
> + if (refcnt_not_one && __rte_mbuf_refcnt_update(m, -1) != 0)
> + return NULL;
>
> - if (!RTE_MBUF_DIRECT(m)) {
> - rte_pktmbuf_detach(m);
> - if (RTE_MBUF_HAS_EXTBUF(m) &&
> - RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
> - __rte_pktmbuf_pinned_extbuf_decref(m))
> - return NULL;
> - }
> + if (unlikely(!RTE_MBUF_DIRECT(m))) {
> + rte_pktmbuf_detach(m);
> + if (RTE_MBUF_HAS_EXTBUF(m) &&
> + RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
> + __rte_pktmbuf_pinned_extbuf_decref(m))
> + return NULL;
> + }
>
> - if (m->next != NULL)
> - m->next = NULL;
> - if (m->nb_segs != 1)
> - m->nb_segs = 1;
> + if (refcnt_not_one)
> rte_mbuf_refcnt_set(m, 1);
> + if (m->nb_segs != 1)
> + m->nb_segs = 1;
> + if (m->next != NULL)
> + m->next = NULL;
>
> - return m;
> - }
> - return NULL;
> + return m;
> }
>
Nice refactor, much more readable, thanks.
> /**
> diff --git a/lib/mbuf/rte_mbuf_core.h b/lib/mbuf/rte_mbuf_core.h
> index a0df265b5d..41f40e1967 100644
> --- a/lib/mbuf/rte_mbuf_core.h
> +++ b/lib/mbuf/rte_mbuf_core.h
> @@ -715,6 +715,33 @@ struct rte_mbuf_ext_shared_info {
> #define RTE_MBUF_DIRECT(mb) \
> (!((mb)->ol_flags & (RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL)))
>
> +#if defined(RTE_TOOLCHAIN_GCC) && defined(RTE_ARCH_X86)
> +/* Optimization for code size.
> + * GCC only optimizes single-bit MSB tests this way, so we do it by hand with multi-bit.
> + *
> + * The flags RTE_MBUF_F_INDIRECT and RTE_MBUF_F_EXTERNAL are both in the MSB of the
> + * 64-bit ol_flags field, so we only compare this one byte instead of all 64 bits.
> + * On little endian architecture, the MSB of a 64-bit integer is at byte offest 7.
> + *
> + * Note: Tested using GCC version 16.0.0 20251019 (experimental).
> + *
> + * Without this optimization, GCC generates 17 bytes of instructions:
> + * movabs rax,0x6000000000000000 // 10 bytes
> + * and rax,QWORD PTR [rdi+0x18] // 4 bytes
> + * sete al // 3 bytes
> + * With this optimization, GCC generates only 7 bytes of instructions:
> + * test BYTE PTR [rdi+0x1f],0x60 // 4 bytes
> + * sete al // 3 bytes
> + */
> +#undef RTE_MBUF_DIRECT
> +#define RTE_MBUF_DIRECT(mb) \
> + (!(((const uint8_t *)(mb))[offsetof(struct rte_mbuf, ol_flags) + 7] & \
> + (uint8_t)((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) >> (7 * 8))))
> +static_assert(((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) >> (7 * 8)) << (7 * 8) ==
> + (RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL),
> + "RTE_MBUF_F_INDIRECT and/or RTE_MBUF_F_EXTERNAL are not in MSB.");
> +#endif
> +
Couple of comments/thoughts/questions here.
* This looks like a compiler limitation that should be fixed in GCC. IF we
put this optimization in, how will we know when/if we can remove it again
in future? I'm not sure we want this hanging around forever.
* Can the static_assert - which just checks flags are in the MSB - be
* simplified to e.g.
"((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) << CHAR_BIT) == 0"
or "__builtin_ctzll(...) > (7 * CHAR_BIT)"
* As in prev bullet, I tend to prefer use of CHAR_BIT over hard-coded 8.
* Is it necessary to limit this to just GCC and x86? If it leads to the
best code on x86, why not include for all compilers? What about non-x86
LE platforms?
* Does the actual macro need to be that long and complex? If we simplify a
bit, does the compiler go back to generating bad code? For example:
using "(mb->ol_flags >> 56) & ((RTE_MBUF_F_INDIRECT | ..) >> 56)"
* If the above is true, do we need to actually put this in in assembler to
guarantee compiler generates good code in all situations?
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH v3] mbuf: optimize segment prefree
2025-08-27 21:35 [PATCH] mbuf: optimize segment prefree Morten Brørup
` (2 preceding siblings ...)
2025-10-20 12:02 ` [PATCH v2] " Morten Brørup
@ 2025-10-22 13:23 ` Morten Brørup
2025-10-22 14:47 ` [PATCH v4] " Morten Brørup
4 siblings, 0 replies; 15+ messages in thread
From: Morten Brørup @ 2025-10-22 13:23 UTC (permalink / raw)
To: dev, Stephen Hemminger, Wathsala Vithanage, Konstantin Ananyev,
Chengwen Feng, Bruce Richardson
Cc: Morten Brørup
Refactored rte_pktmbuf_prefree_seg() for both performance and readability.
With the optimized RTE_MBUF_DIRECT() macro, the common likely code path
now fits within one instruction cache line on x86-64 when built with GCC.
Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
Reviewed-by: Bruce Richardson <bruce.richardson@intel.com>
---
v3:
* Rewrote the optimized RTE_MBUF_DIRECT() macro for readability; use
numerical value instead of long names. (Bruce Richardson)
* Enabled the optimized RTE_MBUF_DIRECT() macro for Loongarch too.
v2:
* Fixed typo in commit description.
* Fixed indentation.
* Added detailed description to the optimized RTE_MBUF_DIRECT() macro.
(Stephen Hemminger)
* Added static_assert() to verify that the optimized RTE_MBUF_DIRECT()
macro is valid, specifically that the tested bits are in the MSB of the
64-bit field.
---
lib/mbuf/rte_mbuf.h | 51 +++++++++++++++-------------------------
lib/mbuf/rte_mbuf_core.h | 40 +++++++++++++++++++++++++++++++
2 files changed, 59 insertions(+), 32 deletions(-)
diff --git a/lib/mbuf/rte_mbuf.h b/lib/mbuf/rte_mbuf.h
index 3df22125de..2004391f57 100644
--- a/lib/mbuf/rte_mbuf.h
+++ b/lib/mbuf/rte_mbuf.h
@@ -31,6 +31,7 @@
* http://www.kohala.com/start/tcpipiv2.html
*/
+#include <stdbool.h>
#include <stdint.h>
#include <rte_common.h>
@@ -1458,44 +1459,30 @@ static inline int __rte_pktmbuf_pinned_extbuf_decref(struct rte_mbuf *m)
static __rte_always_inline struct rte_mbuf *
rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
{
- __rte_mbuf_sanity_check(m, 0);
-
- if (likely(rte_mbuf_refcnt_read(m) == 1)) {
-
- if (!RTE_MBUF_DIRECT(m)) {
- rte_pktmbuf_detach(m);
- if (RTE_MBUF_HAS_EXTBUF(m) &&
- RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
- __rte_pktmbuf_pinned_extbuf_decref(m))
- return NULL;
- }
-
- if (m->next != NULL)
- m->next = NULL;
- if (m->nb_segs != 1)
- m->nb_segs = 1;
+ bool refcnt_not_one;
- return m;
+ __rte_mbuf_sanity_check(m, 0);
- } else if (__rte_mbuf_refcnt_update(m, -1) == 0) {
+ refcnt_not_one = unlikely(rte_mbuf_refcnt_read(m) != 1);
+ if (refcnt_not_one && __rte_mbuf_refcnt_update(m, -1) != 0)
+ return NULL;
- if (!RTE_MBUF_DIRECT(m)) {
- rte_pktmbuf_detach(m);
- if (RTE_MBUF_HAS_EXTBUF(m) &&
- RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
- __rte_pktmbuf_pinned_extbuf_decref(m))
- return NULL;
- }
+ if (unlikely(!RTE_MBUF_DIRECT(m))) {
+ rte_pktmbuf_detach(m);
+ if (RTE_MBUF_HAS_EXTBUF(m) &&
+ RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
+ __rte_pktmbuf_pinned_extbuf_decref(m))
+ return NULL;
+ }
- if (m->next != NULL)
- m->next = NULL;
- if (m->nb_segs != 1)
- m->nb_segs = 1;
+ if (refcnt_not_one)
rte_mbuf_refcnt_set(m, 1);
+ if (m->nb_segs != 1)
+ m->nb_segs = 1;
+ if (m->next != NULL)
+ m->next = NULL;
- return m;
- }
- return NULL;
+ return m;
}
/**
diff --git a/lib/mbuf/rte_mbuf_core.h b/lib/mbuf/rte_mbuf_core.h
index a0df265b5d..7c0bddc362 100644
--- a/lib/mbuf/rte_mbuf_core.h
+++ b/lib/mbuf/rte_mbuf_core.h
@@ -715,6 +715,46 @@ struct rte_mbuf_ext_shared_info {
#define RTE_MBUF_DIRECT(mb) \
(!((mb)->ol_flags & (RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL)))
+#if defined(RTE_TOOLCHAIN_GCC) && (defined(RTE_ARCH_X86) || defined(RTE_ARCH_LOONGARCH))
+/* Optimization for code size.
+ * GCC only optimizes single-bit MSB tests this way, so we do it by hand with multi-bit.
+ *
+ * The flags RTE_MBUF_F_INDIRECT and RTE_MBUF_F_EXTERNAL are both in the MSB of the
+ * 64-bit ol_flags field, so we only compare this one byte instead of all 64 bits.
+ * On little endian architecture, the MSB of a 64-bit integer is at byte offset 7.
+ *
+ * Note: Tested on x86-64 using GCC version 16.0.0 20251019 (experimental).
+ *
+ * Without this optimization, GCC generates 17 bytes of instructions:
+ * movabs rax,0x6000000000000000 // 10 bytes
+ * and rax,QWORD PTR [rdi+0x18] // 4 bytes
+ * sete al // 3 bytes
+ * With this optimization, GCC generates only 7 bytes of instructions:
+ * test BYTE PTR [rdi+0x1f],0x60 // 4 bytes
+ * sete al // 3 bytes
+ *
+ * Note: Tested on loongarch using GCC version 15.2.0.
+ *
+ * Without this optimization, GCC generates 5 instructions:
+ * ld.d $a0, $a0, 24
+ * move $t0, $zero
+ * lu52i.d $t0, $t0, 1536
+ * and $a0, $a0, $t0
+ * sltui $a0, $a0, 1
+ * With this optimization, GCC generates only 3 instructions:
+ * ld.bu $a0, $a0, 31
+ * andi $a0, $a0, 0x60
+ * sltui $a0, $a0, 1
+ *
+ * Note: Tested on ARM64, GCC generates 3 instructions both without and with the optimization,
+ * so no need for the optimization here.
+ */
+static_assert((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) == UINT64_C(0x60) << (7 * CHAR_BIT),
+ "(RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) is not 0x60 at byte offset 7");
+#undef RTE_MBUF_DIRECT
+#define RTE_MBUF_DIRECT(mb) !(((const char *)(&(mb)->ol_flags))[7] & 0x60)
+#endif
+
/** Uninitialized or unspecified port. */
#define RTE_MBUF_PORT_INVALID UINT16_MAX
/** For backwards compatibility. */
--
2.43.0
^ permalink raw reply [flat|nested] 15+ messages in thread
* RE: [PATCH v2] mbuf: optimize segment prefree
2025-10-22 9:08 ` Bruce Richardson
@ 2025-10-22 13:53 ` Morten Brørup
2025-10-22 14:12 ` Bruce Richardson
0 siblings, 1 reply; 15+ messages in thread
From: Morten Brørup @ 2025-10-22 13:53 UTC (permalink / raw)
To: Bruce Richardson, Konstantin Ananyev, Chengwen Feng
Cc: dev, Stephen Hemminger, Wathsala Vithanage
> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> Sent: Wednesday, 22 October 2025 11.08
>
> On Mon, Oct 20, 2025 at 12:02:01PM +0000, Morten Brørup wrote:
> > Refactored rte_pktmbuf_prefree_seg() for both performance and
> readability.
> >
> > With the optimized RTE_MBUF_DIRECT() macro, the common likely code
> path
> > now fits within one instruction cache line on x86-64 when built with
> GCC.
> >
> > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
>
> Reviewed-by: Bruce Richardson <bruce.richardson@intel.com>
>
[...]
> > #define RTE_MBUF_DIRECT(mb) \
> > (!((mb)->ol_flags & (RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL)))
> >
> > +#if defined(RTE_TOOLCHAIN_GCC) && defined(RTE_ARCH_X86)
> > +/* Optimization for code size.
> > + * GCC only optimizes single-bit MSB tests this way, so we do it by
> hand with multi-bit.
> > + *
> > + * The flags RTE_MBUF_F_INDIRECT and RTE_MBUF_F_EXTERNAL are both in
> the MSB of the
> > + * 64-bit ol_flags field, so we only compare this one byte instead
> of all 64 bits.
> > + * On little endian architecture, the MSB of a 64-bit integer is at
> byte offest 7.
> > + *
> > + * Note: Tested using GCC version 16.0.0 20251019 (experimental).
> > + *
> > + * Without this optimization, GCC generates 17 bytes of
> instructions:
> > + * movabs rax,0x6000000000000000 // 10 bytes
> > + * and rax,QWORD PTR [rdi+0x18] // 4 bytes
> > + * sete al // 3 bytes
> > + * With this optimization, GCC generates only 7 bytes of
> instructions:
> > + * test BYTE PTR [rdi+0x1f],0x60 // 4 bytes
> > + * sete al // 3 bytes
> > + */
> > +#undef RTE_MBUF_DIRECT
> > +#define RTE_MBUF_DIRECT(mb) \
> > + (!(((const uint8_t *)(mb))[offsetof(struct rte_mbuf, ol_flags) +
> 7] & \
> > + (uint8_t)((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) >> (7 *
> 8))))
> > +static_assert(((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) >> (7 *
> 8)) << (7 * 8) ==
> > + (RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL),
> > + "RTE_MBUF_F_INDIRECT and/or RTE_MBUF_F_EXTERNAL are not in
> MSB.");
> > +#endif
> > +
> Couple of comments/thoughts/questions here.
>
> * This looks like a compiler limitation that should be fixed in GCC. IF
> we
> put this optimization in, how will we know when/if we can remove it
> again
> in future? I'm not sure we want this hanging around forever.
Agree.
There are plenty of hand crafted optimizations in DPDK, which are already obsolete;
it seems no one has found a good way of identifying them. Including myself.
> * Can the static_assert - which just checks flags are in the MSB - be
> * simplified to e.g.
> "((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) << CHAR_BIT) == 0"
> or "__builtin_ctzll(...) > (7 * CHAR_BIT)"
> * As in prev bullet, I tend to prefer use of CHAR_BIT over hard-coded
> 8.
In v3, I have simplified both the static_assert and the optimized macro as you suggested on Slack,
with some minor improvements.
> * Is it necessary to limit this to just GCC and x86? If it leads to the
> best code on x86, why not include for all compilers? What about non-
> x86
> LE platforms?
I had already tested ARM64, where it didn't make a difference; now I have added a note about it.
I also tested ARM32, which doesn't benefit either, but I didn't add a note about it.
I also tested Loongarch (on Godbolt), which does benefit from it, so I added it.
Now, as I'm writing this email, Godbolt shows that RISC-V and POWER could also benefit.
Maybe we should just replace the standard macro with the optimized macro. WDYT?
> * Does the actual macro need to be that long and complex? If we
> simplify a
> bit, does the compiler go back to generating bad code? For example:
> using "(mb->ol_flags >> 56) & ((RTE_MBUF_F_INDIRECT | ..) >> 56)"
Simplified in v3.
> * If the above is true, do we need to actually put this in in assembler
> to
> guarantee compiler generates good code in all situations?
No need for assembler.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v2] mbuf: optimize segment prefree
2025-10-22 13:53 ` Morten Brørup
@ 2025-10-22 14:12 ` Bruce Richardson
2025-10-22 14:14 ` Morten Brørup
0 siblings, 1 reply; 15+ messages in thread
From: Bruce Richardson @ 2025-10-22 14:12 UTC (permalink / raw)
To: Morten Brørup
Cc: Konstantin Ananyev, Chengwen Feng, dev, Stephen Hemminger,
Wathsala Vithanage
On Wed, Oct 22, 2025 at 03:53:21PM +0200, Morten Brørup wrote:
> > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > Sent: Wednesday, 22 October 2025 11.08
> >
> > On Mon, Oct 20, 2025 at 12:02:01PM +0000, Morten Brørup wrote:
> > > Refactored rte_pktmbuf_prefree_seg() for both performance and
> > readability.
> > >
> > > With the optimized RTE_MBUF_DIRECT() macro, the common likely code
> > path
> > > now fits within one instruction cache line on x86-64 when built with
> > GCC.
> > >
> > > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> >
> > Reviewed-by: Bruce Richardson <bruce.richardson@intel.com>
> >
>
> [...]
>
> > > #define RTE_MBUF_DIRECT(mb) \
> > > (!((mb)->ol_flags & (RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL)))
> > >
> > > +#if defined(RTE_TOOLCHAIN_GCC) && defined(RTE_ARCH_X86)
> > > +/* Optimization for code size.
> > > + * GCC only optimizes single-bit MSB tests this way, so we do it by
> > hand with multi-bit.
> > > + *
> > > + * The flags RTE_MBUF_F_INDIRECT and RTE_MBUF_F_EXTERNAL are both in
> > the MSB of the
> > > + * 64-bit ol_flags field, so we only compare this one byte instead
> > of all 64 bits.
> > > + * On little endian architecture, the MSB of a 64-bit integer is at
> > byte offest 7.
> > > + *
> > > + * Note: Tested using GCC version 16.0.0 20251019 (experimental).
> > > + *
> > > + * Without this optimization, GCC generates 17 bytes of
> > instructions:
> > > + * movabs rax,0x6000000000000000 // 10 bytes
> > > + * and rax,QWORD PTR [rdi+0x18] // 4 bytes
> > > + * sete al // 3 bytes
> > > + * With this optimization, GCC generates only 7 bytes of
> > instructions:
> > > + * test BYTE PTR [rdi+0x1f],0x60 // 4 bytes
> > > + * sete al // 3 bytes
> > > + */
> > > +#undef RTE_MBUF_DIRECT
> > > +#define RTE_MBUF_DIRECT(mb) \
> > > + (!(((const uint8_t *)(mb))[offsetof(struct rte_mbuf, ol_flags) +
> > 7] & \
> > > + (uint8_t)((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) >> (7 *
> > 8))))
> > > +static_assert(((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) >> (7 *
> > 8)) << (7 * 8) ==
> > > + (RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL),
> > > + "RTE_MBUF_F_INDIRECT and/or RTE_MBUF_F_EXTERNAL are not in
> > MSB.");
> > > +#endif
> > > +
> > Couple of comments/thoughts/questions here.
> >
> > * This looks like a compiler limitation that should be fixed in GCC. IF
> > we
> > put this optimization in, how will we know when/if we can remove it
> > again
> > in future? I'm not sure we want this hanging around forever.
>
> Agree.
> There are plenty of hand crafted optimizations in DPDK, which are already obsolete;
> it seems no one has found a good way of identifying them. Including myself.
>
> > * Can the static_assert - which just checks flags are in the MSB - be
> > * simplified to e.g.
> > "((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) << CHAR_BIT) == 0"
> > or "__builtin_ctzll(...) > (7 * CHAR_BIT)"
> > * As in prev bullet, I tend to prefer use of CHAR_BIT over hard-coded
> > 8.
>
> In v3, I have simplified both the static_assert and the optimized macro as you suggested on Slack,
> with some minor improvements.
>
> > * Is it necessary to limit this to just GCC and x86? If it leads to the
> > best code on x86, why not include for all compilers? What about non-
> > x86
> > LE platforms?
>
> I had already tested ARM64, where it didn't make a difference; now I have added a note about it.
> I also tested ARM32, which doesn't benefit either, but I didn't add a note about it.
> I also tested Loongarch (on Godbolt), which does benefit from it, so I added it.
>
> Now, as I'm writing this email, Godbolt shows that RISC-V and POWER could also benefit.
> Maybe we should just replace the standard macro with the optimized macro. WDYT?
>
I think that's not a bad idea. At least everything would be consistent.
/Bruce
^ permalink raw reply [flat|nested] 15+ messages in thread
* RE: [PATCH v2] mbuf: optimize segment prefree
2025-10-22 14:12 ` Bruce Richardson
@ 2025-10-22 14:14 ` Morten Brørup
0 siblings, 0 replies; 15+ messages in thread
From: Morten Brørup @ 2025-10-22 14:14 UTC (permalink / raw)
To: Bruce Richardson
Cc: Konstantin Ananyev, Chengwen Feng, dev, Stephen Hemminger,
Wathsala Vithanage
> From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> Sent: Wednesday, 22 October 2025 16.12
>
> On Wed, Oct 22, 2025 at 03:53:21PM +0200, Morten Brørup wrote:
> > > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > > Sent: Wednesday, 22 October 2025 11.08
> > >
> > > On Mon, Oct 20, 2025 at 12:02:01PM +0000, Morten Brørup wrote:
> > > > Refactored rte_pktmbuf_prefree_seg() for both performance and
> > > readability.
> > > >
> > > > With the optimized RTE_MBUF_DIRECT() macro, the common likely
> code
> > > path
> > > > now fits within one instruction cache line on x86-64 when built
> with
> > > GCC.
> > > >
> > > > Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> > >
> > > Reviewed-by: Bruce Richardson <bruce.richardson@intel.com>
> > >
> >
> > [...]
> >
> > > > #define RTE_MBUF_DIRECT(mb) \
> > > > (!((mb)->ol_flags & (RTE_MBUF_F_INDIRECT |
> RTE_MBUF_F_EXTERNAL)))
> > > >
> > > > +#if defined(RTE_TOOLCHAIN_GCC) && defined(RTE_ARCH_X86)
> > > > +/* Optimization for code size.
> > > > + * GCC only optimizes single-bit MSB tests this way, so we do it
> by
> > > hand with multi-bit.
> > > > + *
> > > > + * The flags RTE_MBUF_F_INDIRECT and RTE_MBUF_F_EXTERNAL are
> both in
> > > the MSB of the
> > > > + * 64-bit ol_flags field, so we only compare this one byte
> instead
> > > of all 64 bits.
> > > > + * On little endian architecture, the MSB of a 64-bit integer is
> at
> > > byte offest 7.
> > > > + *
> > > > + * Note: Tested using GCC version 16.0.0 20251019
> (experimental).
> > > > + *
> > > > + * Without this optimization, GCC generates 17 bytes of
> > > instructions:
> > > > + * movabs rax,0x6000000000000000 // 10 bytes
> > > > + * and rax,QWORD PTR [rdi+0x18] // 4 bytes
> > > > + * sete al // 3 bytes
> > > > + * With this optimization, GCC generates only 7 bytes of
> > > instructions:
> > > > + * test BYTE PTR [rdi+0x1f],0x60 // 4 bytes
> > > > + * sete al // 3 bytes
> > > > + */
> > > > +#undef RTE_MBUF_DIRECT
> > > > +#define RTE_MBUF_DIRECT(mb) \
> > > > + (!(((const uint8_t *)(mb))[offsetof(struct rte_mbuf,
> ol_flags) +
> > > 7] & \
> > > > + (uint8_t)((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) >> (7
> *
> > > 8))))
> > > > +static_assert(((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) >> (7
> *
> > > 8)) << (7 * 8) ==
> > > > + (RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL),
> > > > + "RTE_MBUF_F_INDIRECT and/or RTE_MBUF_F_EXTERNAL are not in
> > > MSB.");
> > > > +#endif
> > > > +
> > > Couple of comments/thoughts/questions here.
> > >
> > > * This looks like a compiler limitation that should be fixed in
> GCC. IF
> > > we
> > > put this optimization in, how will we know when/if we can remove
> it
> > > again
> > > in future? I'm not sure we want this hanging around forever.
> >
> > Agree.
> > There are plenty of hand crafted optimizations in DPDK, which are
> already obsolete;
> > it seems no one has found a good way of identifying them. Including
> myself.
> >
> > > * Can the static_assert - which just checks flags are in the MSB -
> be
> > > * simplified to e.g.
> > > "((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) << CHAR_BIT) == 0"
> > > or "__builtin_ctzll(...) > (7 * CHAR_BIT)"
> > > * As in prev bullet, I tend to prefer use of CHAR_BIT over hard-
> coded
> > > 8.
> >
> > In v3, I have simplified both the static_assert and the optimized
> macro as you suggested on Slack,
> > with some minor improvements.
> >
> > > * Is it necessary to limit this to just GCC and x86? If it leads to
> the
> > > best code on x86, why not include for all compilers? What about
> non-
> > > x86
> > > LE platforms?
> >
> > I had already tested ARM64, where it didn't make a difference; now I
> have added a note about it.
> > I also tested ARM32, which doesn't benefit either, but I didn't add a
> note about it.
> > I also tested Loongarch (on Godbolt), which does benefit from it, so
> I added it.
> >
> > Now, as I'm writing this email, Godbolt shows that RISC-V and POWER
> could also benefit.
> > Maybe we should just replace the standard macro with the optimized
> macro. WDYT?
> >
> I think that's not a bad idea. At least everything would be consistent.
I'll post a v4. Then we can go back to v3 if it looks too weird.
>
> /Bruce
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH v4] mbuf: optimize segment prefree
2025-08-27 21:35 [PATCH] mbuf: optimize segment prefree Morten Brørup
` (3 preceding siblings ...)
2025-10-22 13:23 ` [PATCH v3] " Morten Brørup
@ 2025-10-22 14:47 ` Morten Brørup
2025-10-22 15:02 ` Bruce Richardson
4 siblings, 1 reply; 15+ messages in thread
From: Morten Brørup @ 2025-10-22 14:47 UTC (permalink / raw)
To: dev, Stephen Hemminger, Wathsala Vithanage, Konstantin Ananyev,
Chengwen Feng, Bruce Richardson
Cc: Morten Brørup
Refactored rte_pktmbuf_prefree_seg() for both performance and readability.
With the optimized RTE_MBUF_DIRECT() macro, the common likely code path
now fits within one instruction cache line on x86-64 when built with GCC.
Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
Acked-by: Chengwen Feng <fengchengwen@huawei.com>
Reviewed-by: Bruce Richardson <bruce.richardson@intel.com>
---
v4:
* Enabled the optimized RTE_MBUF_DIRECT() macro for GCC on all
architectures.
v3:
* Rewrote the optimized RTE_MBUF_DIRECT() macro for readability; use
numerical value instead of long names. (Bruce Richardson)
* Enabled the optimized RTE_MBUF_DIRECT() macro for Loongarch too.
v2:
* Fixed typo in commit description.
* Fixed indentation.
* Added detailed description to the optimized RTE_MBUF_DIRECT() macro.
(Stephen Hemminger)
* Added static_assert() to verify that the optimized RTE_MBUF_DIRECT()
macro is valid, specifically that the tested bits are in the MSB of the
64-bit field.
---
lib/mbuf/rte_mbuf.h | 51 +++++++++++++++-------------------------
lib/mbuf/rte_mbuf_core.h | 47 ++++++++++++++++++++++++++++++++++++
2 files changed, 66 insertions(+), 32 deletions(-)
diff --git a/lib/mbuf/rte_mbuf.h b/lib/mbuf/rte_mbuf.h
index 3df22125de..2004391f57 100644
--- a/lib/mbuf/rte_mbuf.h
+++ b/lib/mbuf/rte_mbuf.h
@@ -31,6 +31,7 @@
* http://www.kohala.com/start/tcpipiv2.html
*/
+#include <stdbool.h>
#include <stdint.h>
#include <rte_common.h>
@@ -1458,44 +1459,30 @@ static inline int __rte_pktmbuf_pinned_extbuf_decref(struct rte_mbuf *m)
static __rte_always_inline struct rte_mbuf *
rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
{
- __rte_mbuf_sanity_check(m, 0);
-
- if (likely(rte_mbuf_refcnt_read(m) == 1)) {
-
- if (!RTE_MBUF_DIRECT(m)) {
- rte_pktmbuf_detach(m);
- if (RTE_MBUF_HAS_EXTBUF(m) &&
- RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
- __rte_pktmbuf_pinned_extbuf_decref(m))
- return NULL;
- }
-
- if (m->next != NULL)
- m->next = NULL;
- if (m->nb_segs != 1)
- m->nb_segs = 1;
+ bool refcnt_not_one;
- return m;
+ __rte_mbuf_sanity_check(m, 0);
- } else if (__rte_mbuf_refcnt_update(m, -1) == 0) {
+ refcnt_not_one = unlikely(rte_mbuf_refcnt_read(m) != 1);
+ if (refcnt_not_one && __rte_mbuf_refcnt_update(m, -1) != 0)
+ return NULL;
- if (!RTE_MBUF_DIRECT(m)) {
- rte_pktmbuf_detach(m);
- if (RTE_MBUF_HAS_EXTBUF(m) &&
- RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
- __rte_pktmbuf_pinned_extbuf_decref(m))
- return NULL;
- }
+ if (unlikely(!RTE_MBUF_DIRECT(m))) {
+ rte_pktmbuf_detach(m);
+ if (RTE_MBUF_HAS_EXTBUF(m) &&
+ RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
+ __rte_pktmbuf_pinned_extbuf_decref(m))
+ return NULL;
+ }
- if (m->next != NULL)
- m->next = NULL;
- if (m->nb_segs != 1)
- m->nb_segs = 1;
+ if (refcnt_not_one)
rte_mbuf_refcnt_set(m, 1);
+ if (m->nb_segs != 1)
+ m->nb_segs = 1;
+ if (m->next != NULL)
+ m->next = NULL;
- return m;
- }
- return NULL;
+ return m;
}
/**
diff --git a/lib/mbuf/rte_mbuf_core.h b/lib/mbuf/rte_mbuf_core.h
index a0df265b5d..37f2975158 100644
--- a/lib/mbuf/rte_mbuf_core.h
+++ b/lib/mbuf/rte_mbuf_core.h
@@ -706,6 +706,7 @@ struct rte_mbuf_ext_shared_info {
*/
#define RTE_MBUF_HAS_EXTBUF(mb) ((mb)->ol_flags & RTE_MBUF_F_EXTERNAL)
+#if !defined(RTE_TOOLCHAIN_GCC) || defined __DOXYGEN__
/**
* Returns TRUE if given mbuf is direct, or FALSE otherwise.
*
@@ -714,6 +715,52 @@ struct rte_mbuf_ext_shared_info {
*/
#define RTE_MBUF_DIRECT(mb) \
(!((mb)->ol_flags & (RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL)))
+#else /* RTE_TOOLCHAIN_GCC */
+#if RTE_BYTE_ORDER == RTE_LITTLE_ENDIAN
+/* Macro optimized for code size.
+ * GCC only optimizes single-bit MSB tests this way, so we do it by hand with multi-bit.
+ *
+ * The flags RTE_MBUF_F_INDIRECT and RTE_MBUF_F_EXTERNAL are both in the MSB of the
+ * 64-bit ol_flags field, so we only compare this one byte instead of all 64 bits.
+ * On little endian architecture, the MSB of a 64-bit integer is at byte offset 7.
+ *
+ * Note: Tested on x86-64 using GCC version 16.0.0 20251019 (experimental).
+ *
+ * Without this optimization, GCC generates 17 bytes of instructions:
+ * movabs rax,0x6000000000000000 // 10 bytes
+ * and rax,QWORD PTR [rdi+0x18] // 4 bytes
+ * sete al // 3 bytes
+ * With this optimization, GCC generates only 7 bytes of instructions:
+ * test BYTE PTR [rdi+0x1f],0x60 // 4 bytes
+ * sete al // 3 bytes
+ *
+ * Note: Tested on loongarch using GCC version 15.2.0.
+ *
+ * Without this optimization, GCC generates 5 instructions:
+ * ld.d $a0, $a0, 24
+ * move $t0, $zero
+ * lu52i.d $t0, $t0, 1536
+ * and $a0, $a0, $t0
+ * sltui $a0, $a0, 1
+ * With this optimization, GCC generates only 3 instructions:
+ * ld.bu $a0, $a0, 31
+ * andi $a0, $a0, 0x60
+ * sltui $a0, $a0, 1
+ *
+ * Note: GCC also generates smaller code size with the optimized macro on many other architectures.
+ *
+ * Note: GCC generates the same code size as with the plain macro on ARM (64 and 32 bit).
+ */
+static_assert((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) == UINT64_C(0x60) << (7 * CHAR_BIT),
+ "(RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) is not 0x60 at byte offset 7");
+#define RTE_MBUF_DIRECT(mb) !(((const char *)(&(mb)->ol_flags))[7] & 0x60)
+#elif RTE_BYTE_ORDER == RTE_BIG_ENDIAN
+/* As described above; but on big endian architecture, the MSB is at byte offset 0. */
+static_assert((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) == UINT64_C(0x60) << (7 * CHAR_BIT),
+ "(RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) is not 0x60 at byte offset 0");
+#define RTE_MBUF_DIRECT(mb) !(((const char *)(&(mb)->ol_flags))[0] & 0x60)
+#endif /* RTE_BYTE_ORDER */
+#endif /* RTE_TOOLCHAIN_GCC */
/** Uninitialized or unspecified port. */
#define RTE_MBUF_PORT_INVALID UINT16_MAX
--
2.43.0
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v4] mbuf: optimize segment prefree
2025-10-22 14:47 ` [PATCH v4] " Morten Brørup
@ 2025-10-22 15:02 ` Bruce Richardson
0 siblings, 0 replies; 15+ messages in thread
From: Bruce Richardson @ 2025-10-22 15:02 UTC (permalink / raw)
To: Morten Brørup
Cc: dev, Stephen Hemminger, Wathsala Vithanage, Konstantin Ananyev,
Chengwen Feng
On Wed, Oct 22, 2025 at 02:47:08PM +0000, Morten Brørup wrote:
> Refactored rte_pktmbuf_prefree_seg() for both performance and readability.
>
> With the optimized RTE_MBUF_DIRECT() macro, the common likely code path
> now fits within one instruction cache line on x86-64 when built with GCC.
>
> Signed-off-by: Morten Brørup <mb@smartsharesystems.com>
> Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
> Acked-by: Chengwen Feng <fengchengwen@huawei.com>
> Reviewed-by: Bruce Richardson <bruce.richardson@intel.com>
> ---
> v4:
> * Enabled the optimized RTE_MBUF_DIRECT() macro for GCC on all
> architectures.
> v3:
> * Rewrote the optimized RTE_MBUF_DIRECT() macro for readability; use
> numerical value instead of long names. (Bruce Richardson)
> * Enabled the optimized RTE_MBUF_DIRECT() macro for Loongarch too.
> v2:
> * Fixed typo in commit description.
> * Fixed indentation.
> * Added detailed description to the optimized RTE_MBUF_DIRECT() macro.
> (Stephen Hemminger)
> * Added static_assert() to verify that the optimized RTE_MBUF_DIRECT()
> macro is valid, specifically that the tested bits are in the MSB of the
> 64-bit field.
> ---
> lib/mbuf/rte_mbuf.h | 51 +++++++++++++++-------------------------
> lib/mbuf/rte_mbuf_core.h | 47 ++++++++++++++++++++++++++++++++++++
> 2 files changed, 66 insertions(+), 32 deletions(-)
>
> diff --git a/lib/mbuf/rte_mbuf.h b/lib/mbuf/rte_mbuf.h
> index 3df22125de..2004391f57 100644
> --- a/lib/mbuf/rte_mbuf.h
> +++ b/lib/mbuf/rte_mbuf.h
> @@ -31,6 +31,7 @@
> * http://www.kohala.com/start/tcpipiv2.html
> */
>
> +#include <stdbool.h>
> #include <stdint.h>
>
> #include <rte_common.h>
> @@ -1458,44 +1459,30 @@ static inline int __rte_pktmbuf_pinned_extbuf_decref(struct rte_mbuf *m)
> static __rte_always_inline struct rte_mbuf *
> rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
> {
> - __rte_mbuf_sanity_check(m, 0);
> -
> - if (likely(rte_mbuf_refcnt_read(m) == 1)) {
> -
> - if (!RTE_MBUF_DIRECT(m)) {
> - rte_pktmbuf_detach(m);
> - if (RTE_MBUF_HAS_EXTBUF(m) &&
> - RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
> - __rte_pktmbuf_pinned_extbuf_decref(m))
> - return NULL;
> - }
> -
> - if (m->next != NULL)
> - m->next = NULL;
> - if (m->nb_segs != 1)
> - m->nb_segs = 1;
> + bool refcnt_not_one;
>
> - return m;
> + __rte_mbuf_sanity_check(m, 0);
>
> - } else if (__rte_mbuf_refcnt_update(m, -1) == 0) {
> + refcnt_not_one = unlikely(rte_mbuf_refcnt_read(m) != 1);
> + if (refcnt_not_one && __rte_mbuf_refcnt_update(m, -1) != 0)
> + return NULL;
>
> - if (!RTE_MBUF_DIRECT(m)) {
> - rte_pktmbuf_detach(m);
> - if (RTE_MBUF_HAS_EXTBUF(m) &&
> - RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
> - __rte_pktmbuf_pinned_extbuf_decref(m))
> - return NULL;
> - }
> + if (unlikely(!RTE_MBUF_DIRECT(m))) {
> + rte_pktmbuf_detach(m);
> + if (RTE_MBUF_HAS_EXTBUF(m) &&
> + RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
> + __rte_pktmbuf_pinned_extbuf_decref(m))
> + return NULL;
> + }
>
> - if (m->next != NULL)
> - m->next = NULL;
> - if (m->nb_segs != 1)
> - m->nb_segs = 1;
> + if (refcnt_not_one)
> rte_mbuf_refcnt_set(m, 1);
> + if (m->nb_segs != 1)
> + m->nb_segs = 1;
> + if (m->next != NULL)
> + m->next = NULL;
>
> - return m;
> - }
> - return NULL;
> + return m;
> }
>
> /**
> diff --git a/lib/mbuf/rte_mbuf_core.h b/lib/mbuf/rte_mbuf_core.h
> index a0df265b5d..37f2975158 100644
> --- a/lib/mbuf/rte_mbuf_core.h
> +++ b/lib/mbuf/rte_mbuf_core.h
> @@ -706,6 +706,7 @@ struct rte_mbuf_ext_shared_info {
> */
> #define RTE_MBUF_HAS_EXTBUF(mb) ((mb)->ol_flags & RTE_MBUF_F_EXTERNAL)
>
> +#if !defined(RTE_TOOLCHAIN_GCC) || defined __DOXYGEN__
> /**
> * Returns TRUE if given mbuf is direct, or FALSE otherwise.
> *
> @@ -714,6 +715,52 @@ struct rte_mbuf_ext_shared_info {
> */
> #define RTE_MBUF_DIRECT(mb) \
> (!((mb)->ol_flags & (RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL)))
> +#else /* RTE_TOOLCHAIN_GCC */
Do we need the non-gcc block?
> +#if RTE_BYTE_ORDER == RTE_LITTLE_ENDIAN
> +/* Macro optimized for code size.
> + * GCC only optimizes single-bit MSB tests this way, so we do it by hand with multi-bit.
> + *
> + * The flags RTE_MBUF_F_INDIRECT and RTE_MBUF_F_EXTERNAL are both in the MSB of the
> + * 64-bit ol_flags field, so we only compare this one byte instead of all 64 bits.
> + * On little endian architecture, the MSB of a 64-bit integer is at byte offset 7.
> + *
> + * Note: Tested on x86-64 using GCC version 16.0.0 20251019 (experimental).
> + *
> + * Without this optimization, GCC generates 17 bytes of instructions:
> + * movabs rax,0x6000000000000000 // 10 bytes
> + * and rax,QWORD PTR [rdi+0x18] // 4 bytes
> + * sete al // 3 bytes
> + * With this optimization, GCC generates only 7 bytes of instructions:
> + * test BYTE PTR [rdi+0x1f],0x60 // 4 bytes
> + * sete al // 3 bytes
> + *
> + * Note: Tested on loongarch using GCC version 15.2.0.
> + *
> + * Without this optimization, GCC generates 5 instructions:
> + * ld.d $a0, $a0, 24
> + * move $t0, $zero
> + * lu52i.d $t0, $t0, 1536
> + * and $a0, $a0, $t0
> + * sltui $a0, $a0, 1
> + * With this optimization, GCC generates only 3 instructions:
> + * ld.bu $a0, $a0, 31
> + * andi $a0, $a0, 0x60
> + * sltui $a0, $a0, 1
> + *
> + * Note: GCC also generates smaller code size with the optimized macro on many other architectures.
> + *
> + * Note: GCC generates the same code size as with the plain macro on ARM (64 and 32 bit).
> + */
> +static_assert((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) == UINT64_C(0x60) << (7 * CHAR_BIT),
> + "(RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) is not 0x60 at byte offset 7");
> +#define RTE_MBUF_DIRECT(mb) !(((const char *)(&(mb)->ol_flags))[7] & 0x60)
> +#elif RTE_BYTE_ORDER == RTE_BIG_ENDIAN
> +/* As described above; but on big endian architecture, the MSB is at byte offset 0. */
> +static_assert((RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) == UINT64_C(0x60) << (7 * CHAR_BIT),
> + "(RTE_MBUF_F_INDIRECT | RTE_MBUF_F_EXTERNAL) is not 0x60 at byte offset 0");
> +#define RTE_MBUF_DIRECT(mb) !(((const char *)(&(mb)->ol_flags))[0] & 0x60)
> +#endif /* RTE_BYTE_ORDER */
Minor nit, the static assert is common, so can be put out of the endianness
check rather than duplicated. I'd also suggest putting the the comment
outside it too, so that the endianness check only covers one line each.
Also, in the comment, showing an example on one arch is probably enough,
showing two sort of implies that you should show all, while if you show
just one, you can clearly state its just one example.
BTW: if you want to remove the endian ifdefs, you can compute the offset
as e.g.
#define OLF_MSB (7 * (RTE_BYTE_ORDER == RTE_LITTLE_ENDIAN)) // 7 for LE, 0 for BE
#define RTE_MBUF_DIRECT(mb) !(((const char *)(&(mb)->ol_flags))[OLF_MSB] & 0x60)
> +#endif /* RTE_TOOLCHAIN_GCC */
>
> /** Uninitialized or unspecified port. */
> #define RTE_MBUF_PORT_INVALID UINT16_MAX
> --
> 2.43.0
>
^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2025-10-22 15:02 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-08-27 21:35 [PATCH] mbuf: optimize segment prefree Morten Brørup
2025-08-27 23:17 ` Stephen Hemminger
2025-10-06 17:46 ` Wathsala Vithanage
2025-10-06 18:26 ` Morten Brørup
2025-10-06 14:49 ` Morten Brørup
2025-10-20 12:02 ` [PATCH v2] " Morten Brørup
2025-10-20 14:24 ` Konstantin Ananyev
2025-10-21 8:38 ` fengchengwen
2025-10-22 9:08 ` Bruce Richardson
2025-10-22 13:53 ` Morten Brørup
2025-10-22 14:12 ` Bruce Richardson
2025-10-22 14:14 ` Morten Brørup
2025-10-22 13:23 ` [PATCH v3] " Morten Brørup
2025-10-22 14:47 ` [PATCH v4] " Morten Brørup
2025-10-22 15:02 ` Bruce Richardson
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).