* mbuf fast-free requirements analysis
@ 2025-12-15 11:06 Morten Brørup
2025-12-15 11:46 ` Bruce Richardson
2025-12-15 14:41 ` Konstantin Ananyev
0 siblings, 2 replies; 4+ messages in thread
From: Morten Brørup @ 2025-12-15 11:06 UTC (permalink / raw)
To: dev, techboard
Executive Summary:
My analysis shows that the mbuf library is not a barrier for fast-freeing
segmented packet mbufs, and thus fast-free of jumbo frames is possible.
Detailed Analysis:
The purpose of the mbuf fast-free Tx optimization is to reduce
rte_pktmbuf_free_seg() to something much simpler in the ethdev drivers, by
eliminating the code path related to indirect mbufs.
Optimally, we want to simplify the ethdev driver's function that frees the
transmitted mbufs, so it can free them directly to their mempool without
accessing the mbufs themselves.
If the driver cannot access the mbuf itself, it cannot determine which
mempool it belongs to.
We don't want the driver to access every mbuf being freed; but if all
mbufs of a Tx queue belong to the same mempool, the driver can determine
which mempool by looking into just one of the mbufs.
REQUIREMENT 1: The mbufs of a Tx queue must come from the same mempool.
When an mbuf is freed to its mempool, some of the fields in the mbuf must
be initialized.
So, for fast-free, this must be done by the driver's function that
prepares the Tx descriptor.
This is a requirement to the driver, not a requirement to the application.
Now, let's dig into the code for freeing an mbuf.
Note: For readability purposes, I'll cut out some code and comments
unrelated to this topic.
static __rte_always_inline void
rte_pktmbuf_free_seg(struct rte_mbuf *m)
{
m = rte_pktmbuf_prefree_seg(m);
if (likely(m != NULL))
rte_mbuf_raw_free(m);
}
rte_mbuf_raw_free(m) is simple, so nothing to gain there:
/**
* Put mbuf back into its original mempool.
*
* The caller must ensure that the mbuf is direct and properly
* reinitialized (refcnt=1, next=NULL, nb_segs=1), as done by
* rte_pktmbuf_prefree_seg().
*/
static __rte_always_inline void
rte_mbuf_raw_free(struct rte_mbuf *m)
{
rte_mbuf_history_mark(m, RTE_MBUF_HISTORY_OP_LIB_FREE);
rte_mempool_put(m->pool, m);
}
Note that the description says that the mbuf must be direct.
This is not entirely accurate; the mbuf is allowed to use a pinned
external buffer, if the mbuf holds the only reference to it.
(Most of the mbuf library functions have this documentation inaccuracy,
which should be fixed some day.)
So, the fast-free optimization really comes down to
rte_pktmbuf_prefree_seg(m), which must not return NULL.
Let's dig into that.
/**
* Decrease reference counter and unlink a mbuf segment
*
* This function does the same than a free, except that it does not
* return the segment to its pool.
* It decreases the reference counter, and if it reaches 0, it is
* detached from its parent for an indirect mbuf.
*
* @return
* - (m) if it is the last reference. It can be recycled or freed.
* - (NULL) if the mbuf still has remaining references on it.
*/
static __rte_always_inline struct rte_mbuf *
rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
{
bool refcnt_not_one;
refcnt_not_one = unlikely(rte_mbuf_refcnt_read(m) != 1);
if (refcnt_not_one && __rte_mbuf_refcnt_update(m, -1) != 0)
return NULL;
if (unlikely(!RTE_MBUF_DIRECT(m))) {
rte_pktmbuf_detach(m);
if (RTE_MBUF_HAS_EXTBUF(m) &&
RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
__rte_pktmbuf_pinned_extbuf_decref(m))
return NULL;
}
if (refcnt_not_one)
rte_mbuf_refcnt_set(m, 1);
if (m->nb_segs != 1)
m->nb_segs = 1;
if (m->next != NULL)
m->next = NULL;
return m;
}
This function can only succeed (i.e. return non-NULL) when 'refcnt' is 1
(or reaches 0).
REQUIREMENT 2: The driver must hold the only reference to the mbuf,
i.e. 'm->refcnt' must be 1.
When the function succeeds, it initializes the mbuf fields as required by
rte_mbuf_raw_free() before returning.
Now, since the driver has exclusive access to the mbuf, it is free to
initialize the 'm->next' and 'm->nb_segs' at any time.
It could do that when preparing the Tx descriptor.
This is very interesting, because it means that fast-free does not
prohibit segmented packets!
(But the driver must have sufficient Tx descriptors for all segments in
the mbuf.)
Now, lets dig into rte_pktmbuf_prefree_seg()'s block handling non-direct
mbufs, i.e. cloned mbufs and mbufs with external buffer:
if (unlikely(!RTE_MBUF_DIRECT(m))) {
rte_pktmbuf_detach(m);
if (RTE_MBUF_HAS_EXTBUF(m) &&
RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
__rte_pktmbuf_pinned_extbuf_decref(m))
return NULL;
}
Starting with rte_pktmbuf_detach():
static inline void rte_pktmbuf_detach(struct rte_mbuf *m)
{
struct rte_mempool *mp = m->pool;
uint32_t mbuf_size, buf_len;
uint16_t priv_size;
if (RTE_MBUF_HAS_EXTBUF(m)) {
/*
* The mbuf has the external attached buffer,
* we should check the type of the memory pool where
* the mbuf was allocated from to detect the pinned
* external buffer.
*/
uint32_t flags = rte_pktmbuf_priv_flags(mp);
if (flags & RTE_PKTMBUF_POOL_F_PINNED_EXT_BUF) {
/*
* The pinned external buffer should not be
* detached from its backing mbuf, just exit.
*/
return;
}
__rte_pktmbuf_free_extbuf(m);
} else {
__rte_pktmbuf_free_direct(m);
}
priv_size = rte_pktmbuf_priv_size(mp);
mbuf_size = (uint32_t)(sizeof(struct rte_mbuf) + priv_size);
buf_len = rte_pktmbuf_data_room_size(mp);
m->priv_size = priv_size;
m->buf_addr = (char *)m + mbuf_size;
rte_mbuf_iova_set(m, rte_mempool_virt2iova(m) + mbuf_size);
m->buf_len = (uint16_t)buf_len;
rte_pktmbuf_reset_headroom(m);
m->data_len = 0;
m->ol_flags = 0;
}
The only quick and simple code path through this function is when the mbuf
uses a pinned external buffer:
if (RTE_MBUF_HAS_EXTBUF(m)) {
uint32_t flags = rte_pktmbuf_priv_flags(mp);
if (flags & RTE_PKTMBUF_POOL_F_PINNED_EXT_BUF)
return;
REQUIREMENT 3: The mbuf must not be cloned or use a non-pinned external buffer.
Continuing with the next part of rte_pktmbuf_prefree_seg()'s block:
if (RTE_MBUF_HAS_EXTBUF(m) &&
RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
__rte_pktmbuf_pinned_extbuf_decref(m))
return NULL;
Continuing with the next part of the block in rte_pktmbuf_prefree_seg():
/**
* @internal Handle the packet mbufs with attached pinned external buffer
* on the mbuf freeing:
*
* - return zero if reference counter in shinfo is one. It means there is
* no more reference to this pinned buffer and mbuf can be returned to
* the pool
*
* - otherwise (if reference counter is not one), decrement reference
* counter and return non-zero value to prevent freeing the backing mbuf.
*
* Returns non zero if mbuf should not be freed.
*/
static inline int __rte_pktmbuf_pinned_extbuf_decref(struct rte_mbuf *m)
{
struct rte_mbuf_ext_shared_info *shinfo;
/* Clear flags, mbuf is being freed. */
m->ol_flags = RTE_MBUF_F_EXTERNAL;
shinfo = m->shinfo;
/* Optimize for performance - do not dec/reinit */
if (likely(rte_mbuf_ext_refcnt_read(shinfo) == 1))
return 0;
/*
* Direct usage of add primitive to avoid
* duplication of comparing with one.
*/
if (likely(rte_atomic_fetch_add_explicit(&shinfo->refcnt, -1,
rte_memory_order_acq_rel) - 1))
return 1;
/* Reinitialize counter before mbuf freeing. */
rte_mbuf_ext_refcnt_set(shinfo, 1);
return 0;
}
Essentially, if the mbuf does use a pinned external buffer,
rte_pktmbuf_prefree_seg() only succeeds if that pinned external buffer is
only referred to by the mbuf.
REQUIREMENT 4: If the mbuf uses a pinned external buffer, the mbuf must
hold the only reference to that pinned external buffer, i.e. in that case,
'm->shinfo->refcnt' must be 1.
Please review.
If I'm not mistaken, the mbuf library is not a barrier for fast-freeing
segmented packet mbufs, and thus fast-free of jumbo frames is possible.
We need a driver developer to confirm that my suggested approach -
resetting the mbuf fields, incl. 'm->nb_segs' and 'm->next', when
preparing the Tx descriptor - is viable.
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: mbuf fast-free requirements analysis
2025-12-15 11:06 mbuf fast-free requirements analysis Morten Brørup
@ 2025-12-15 11:46 ` Bruce Richardson
2025-12-15 14:41 ` Konstantin Ananyev
1 sibling, 0 replies; 4+ messages in thread
From: Bruce Richardson @ 2025-12-15 11:46 UTC (permalink / raw)
To: Morten Brørup; +Cc: dev, techboard
On Mon, Dec 15, 2025 at 12:06:38PM +0100, Morten Brørup wrote:
> Executive Summary:
>
> My analysis shows that the mbuf library is not a barrier for fast-freeing
> segmented packet mbufs, and thus fast-free of jumbo frames is possible.
>
>
> Detailed Analysis:
>
> The purpose of the mbuf fast-free Tx optimization is to reduce
> rte_pktmbuf_free_seg() to something much simpler in the ethdev drivers, by
> eliminating the code path related to indirect mbufs.
> Optimally, we want to simplify the ethdev driver's function that frees the
> transmitted mbufs, so it can free them directly to their mempool without
> accessing the mbufs themselves.
>
> If the driver cannot access the mbuf itself, it cannot determine which
> mempool it belongs to.
> We don't want the driver to access every mbuf being freed; but if all
> mbufs of a Tx queue belong to the same mempool, the driver can determine
> which mempool by looking into just one of the mbufs.
>
> REQUIREMENT 1: The mbufs of a Tx queue must come from the same mempool.
>
>
> When an mbuf is freed to its mempool, some of the fields in the mbuf must
> be initialized.
> So, for fast-free, this must be done by the driver's function that
> prepares the Tx descriptor.
> This is a requirement to the driver, not a requirement to the application.
>
> Now, let's dig into the code for freeing an mbuf.
> Note: For readability purposes, I'll cut out some code and comments
> unrelated to this topic.
>
> static __rte_always_inline void
> rte_pktmbuf_free_seg(struct rte_mbuf *m)
> {
> m = rte_pktmbuf_prefree_seg(m);
> if (likely(m != NULL))
> rte_mbuf_raw_free(m);
> }
>
>
> rte_mbuf_raw_free(m) is simple, so nothing to gain there:
>
> /**
> * Put mbuf back into its original mempool.
> *
> * The caller must ensure that the mbuf is direct and properly
> * reinitialized (refcnt=1, next=NULL, nb_segs=1), as done by
> * rte_pktmbuf_prefree_seg().
> */
> static __rte_always_inline void
> rte_mbuf_raw_free(struct rte_mbuf *m)
> {
> rte_mbuf_history_mark(m, RTE_MBUF_HISTORY_OP_LIB_FREE);
> rte_mempool_put(m->pool, m);
> }
>
> Note that the description says that the mbuf must be direct.
> This is not entirely accurate; the mbuf is allowed to use a pinned
> external buffer, if the mbuf holds the only reference to it.
> (Most of the mbuf library functions have this documentation inaccuracy,
> which should be fixed some day.)
>
> So, the fast-free optimization really comes down to
> rte_pktmbuf_prefree_seg(m), which must not return NULL.
>
> Let's dig into that.
>
> /**
> * Decrease reference counter and unlink a mbuf segment
> *
> * This function does the same than a free, except that it does not
> * return the segment to its pool.
> * It decreases the reference counter, and if it reaches 0, it is
> * detached from its parent for an indirect mbuf.
> *
> * @return
> * - (m) if it is the last reference. It can be recycled or freed.
> * - (NULL) if the mbuf still has remaining references on it.
> */
> static __rte_always_inline struct rte_mbuf *
> rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
> {
> bool refcnt_not_one;
>
> refcnt_not_one = unlikely(rte_mbuf_refcnt_read(m) != 1);
> if (refcnt_not_one && __rte_mbuf_refcnt_update(m, -1) != 0)
> return NULL;
>
> if (unlikely(!RTE_MBUF_DIRECT(m))) {
> rte_pktmbuf_detach(m);
> if (RTE_MBUF_HAS_EXTBUF(m) &&
> RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
> __rte_pktmbuf_pinned_extbuf_decref(m))
> return NULL;
> }
>
> if (refcnt_not_one)
> rte_mbuf_refcnt_set(m, 1);
> if (m->nb_segs != 1)
> m->nb_segs = 1;
> if (m->next != NULL)
> m->next = NULL;
>
> return m;
> }
>
> This function can only succeed (i.e. return non-NULL) when 'refcnt' is 1
> (or reaches 0).
>
> REQUIREMENT 2: The driver must hold the only reference to the mbuf,
> i.e. 'm->refcnt' must be 1.
>
>
> When the function succeeds, it initializes the mbuf fields as required by
> rte_mbuf_raw_free() before returning.
>
> Now, since the driver has exclusive access to the mbuf, it is free to
> initialize the 'm->next' and 'm->nb_segs' at any time.
> It could do that when preparing the Tx descriptor.
>
> This is very interesting, because it means that fast-free does not
> prohibit segmented packets!
> (But the driver must have sufficient Tx descriptors for all segments in
> the mbuf.)
>
>
> Now, lets dig into rte_pktmbuf_prefree_seg()'s block handling non-direct
> mbufs, i.e. cloned mbufs and mbufs with external buffer:
>
> if (unlikely(!RTE_MBUF_DIRECT(m))) {
> rte_pktmbuf_detach(m);
> if (RTE_MBUF_HAS_EXTBUF(m) &&
> RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
> __rte_pktmbuf_pinned_extbuf_decref(m))
> return NULL;
> }
>
> Starting with rte_pktmbuf_detach():
>
> static inline void rte_pktmbuf_detach(struct rte_mbuf *m)
> {
> struct rte_mempool *mp = m->pool;
> uint32_t mbuf_size, buf_len;
> uint16_t priv_size;
>
> if (RTE_MBUF_HAS_EXTBUF(m)) {
> /*
> * The mbuf has the external attached buffer,
> * we should check the type of the memory pool where
> * the mbuf was allocated from to detect the pinned
> * external buffer.
> */
> uint32_t flags = rte_pktmbuf_priv_flags(mp);
>
> if (flags & RTE_PKTMBUF_POOL_F_PINNED_EXT_BUF) {
> /*
> * The pinned external buffer should not be
> * detached from its backing mbuf, just exit.
> */
> return;
> }
> __rte_pktmbuf_free_extbuf(m);
> } else {
> __rte_pktmbuf_free_direct(m);
> }
> priv_size = rte_pktmbuf_priv_size(mp);
> mbuf_size = (uint32_t)(sizeof(struct rte_mbuf) + priv_size);
> buf_len = rte_pktmbuf_data_room_size(mp);
>
> m->priv_size = priv_size;
> m->buf_addr = (char *)m + mbuf_size;
> rte_mbuf_iova_set(m, rte_mempool_virt2iova(m) + mbuf_size);
> m->buf_len = (uint16_t)buf_len;
> rte_pktmbuf_reset_headroom(m);
> m->data_len = 0;
> m->ol_flags = 0;
> }
>
> The only quick and simple code path through this function is when the mbuf
> uses a pinned external buffer:
> if (RTE_MBUF_HAS_EXTBUF(m)) {
> uint32_t flags = rte_pktmbuf_priv_flags(mp);
> if (flags & RTE_PKTMBUF_POOL_F_PINNED_EXT_BUF)
> return;
>
> REQUIREMENT 3: The mbuf must not be cloned or use a non-pinned external buffer.
>
>
> Continuing with the next part of rte_pktmbuf_prefree_seg()'s block:
> if (RTE_MBUF_HAS_EXTBUF(m) &&
> RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
> __rte_pktmbuf_pinned_extbuf_decref(m))
> return NULL;
>
> Continuing with the next part of the block in rte_pktmbuf_prefree_seg():
>
> /**
> * @internal Handle the packet mbufs with attached pinned external buffer
> * on the mbuf freeing:
> *
> * - return zero if reference counter in shinfo is one. It means there is
> * no more reference to this pinned buffer and mbuf can be returned to
> * the pool
> *
> * - otherwise (if reference counter is not one), decrement reference
> * counter and return non-zero value to prevent freeing the backing mbuf.
> *
> * Returns non zero if mbuf should not be freed.
> */
> static inline int __rte_pktmbuf_pinned_extbuf_decref(struct rte_mbuf *m)
> {
> struct rte_mbuf_ext_shared_info *shinfo;
>
> /* Clear flags, mbuf is being freed. */
> m->ol_flags = RTE_MBUF_F_EXTERNAL;
> shinfo = m->shinfo;
>
> /* Optimize for performance - do not dec/reinit */
> if (likely(rte_mbuf_ext_refcnt_read(shinfo) == 1))
> return 0;
>
> /*
> * Direct usage of add primitive to avoid
> * duplication of comparing with one.
> */
> if (likely(rte_atomic_fetch_add_explicit(&shinfo->refcnt, -1,
> rte_memory_order_acq_rel) - 1))
> return 1;
>
> /* Reinitialize counter before mbuf freeing. */
> rte_mbuf_ext_refcnt_set(shinfo, 1);
> return 0;
> }
>
> Essentially, if the mbuf does use a pinned external buffer,
> rte_pktmbuf_prefree_seg() only succeeds if that pinned external buffer is
> only referred to by the mbuf.
>
> REQUIREMENT 4: If the mbuf uses a pinned external buffer, the mbuf must
> hold the only reference to that pinned external buffer, i.e. in that case,
> 'm->shinfo->refcnt' must be 1.
>
>
> Please review.
>
> If I'm not mistaken, the mbuf library is not a barrier for fast-freeing
> segmented packet mbufs, and thus fast-free of jumbo frames is possible.
>
> We need a driver developer to confirm that my suggested approach -
> resetting the mbuf fields, incl. 'm->nb_segs' and 'm->next', when
> preparing the Tx descriptor - is viable.
>
Excellent analysis, Morten. If I get a chance some time this release cycle,
I will try implementing this change in our drivers, see if any difference
is made.
/Bruce
^ permalink raw reply [flat|nested] 4+ messages in thread
* RE: mbuf fast-free requirements analysis
2025-12-15 11:06 mbuf fast-free requirements analysis Morten Brørup
2025-12-15 11:46 ` Bruce Richardson
@ 2025-12-15 14:41 ` Konstantin Ananyev
2025-12-15 16:14 ` Morten Brørup
1 sibling, 1 reply; 4+ messages in thread
From: Konstantin Ananyev @ 2025-12-15 14:41 UTC (permalink / raw)
To: Morten Brørup, dev, techboard
>
> Executive Summary:
>
> My analysis shows that the mbuf library is not a barrier for fast-freeing
> segmented packet mbufs, and thus fast-free of jumbo frames is possible.
>
>
> Detailed Analysis:
>
> The purpose of the mbuf fast-free Tx optimization is to reduce
> rte_pktmbuf_free_seg() to something much simpler in the ethdev drivers, by
> eliminating the code path related to indirect mbufs.
> Optimally, we want to simplify the ethdev driver's function that frees the
> transmitted mbufs, so it can free them directly to their mempool without
> accessing the mbufs themselves.
>
> If the driver cannot access the mbuf itself, it cannot determine which
> mempool it belongs to.
> We don't want the driver to access every mbuf being freed; but if all
> mbufs of a Tx queue belong to the same mempool, the driver can determine
> which mempool by looking into just one of the mbufs.
>
> REQUIREMENT 1: The mbufs of a Tx queue must come from the same mempool.
>
>
> When an mbuf is freed to its mempool, some of the fields in the mbuf must
> be initialized.
> So, for fast-free, this must be done by the driver's function that
> prepares the Tx descriptor.
> This is a requirement to the driver, not a requirement to the application.
>
> Now, let's dig into the code for freeing an mbuf.
> Note: For readability purposes, I'll cut out some code and comments
> unrelated to this topic.
>
> static __rte_always_inline void
> rte_pktmbuf_free_seg(struct rte_mbuf *m)
> {
> m = rte_pktmbuf_prefree_seg(m);
> if (likely(m != NULL))
> rte_mbuf_raw_free(m);
> }
>
>
> rte_mbuf_raw_free(m) is simple, so nothing to gain there:
>
> /**
> * Put mbuf back into its original mempool.
> *
> * The caller must ensure that the mbuf is direct and properly
> * reinitialized (refcnt=1, next=NULL, nb_segs=1), as done by
> * rte_pktmbuf_prefree_seg().
> */
> static __rte_always_inline void
> rte_mbuf_raw_free(struct rte_mbuf *m)
> {
> rte_mbuf_history_mark(m, RTE_MBUF_HISTORY_OP_LIB_FREE);
> rte_mempool_put(m->pool, m);
> }
>
> Note that the description says that the mbuf must be direct.
> This is not entirely accurate; the mbuf is allowed to use a pinned
> external buffer, if the mbuf holds the only reference to it.
> (Most of the mbuf library functions have this documentation inaccuracy,
> which should be fixed some day.)
>
> So, the fast-free optimization really comes down to
> rte_pktmbuf_prefree_seg(m), which must not return NULL.
>
> Let's dig into that.
>
> /**
> * Decrease reference counter and unlink a mbuf segment
> *
> * This function does the same than a free, except that it does not
> * return the segment to its pool.
> * It decreases the reference counter, and if it reaches 0, it is
> * detached from its parent for an indirect mbuf.
> *
> * @return
> * - (m) if it is the last reference. It can be recycled or freed.
> * - (NULL) if the mbuf still has remaining references on it.
> */
> static __rte_always_inline struct rte_mbuf *
> rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
> {
> bool refcnt_not_one;
>
> refcnt_not_one = unlikely(rte_mbuf_refcnt_read(m) != 1);
> if (refcnt_not_one && __rte_mbuf_refcnt_update(m, -1) != 0)
> return NULL;
>
> if (unlikely(!RTE_MBUF_DIRECT(m))) {
> rte_pktmbuf_detach(m);
> if (RTE_MBUF_HAS_EXTBUF(m) &&
> RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
> __rte_pktmbuf_pinned_extbuf_decref(m))
> return NULL;
> }
>
> if (refcnt_not_one)
> rte_mbuf_refcnt_set(m, 1);
> if (m->nb_segs != 1)
> m->nb_segs = 1;
> if (m->next != NULL)
> m->next = NULL;
>
> return m;
> }
>
> This function can only succeed (i.e. return non-NULL) when 'refcnt' is 1
> (or reaches 0).
>
> REQUIREMENT 2: The driver must hold the only reference to the mbuf,
> i.e. 'm->refcnt' must be 1.
>
>
> When the function succeeds, it initializes the mbuf fields as required by
> rte_mbuf_raw_free() before returning.
>
> Now, since the driver has exclusive access to the mbuf, it is free to
> initialize the 'm->next' and 'm->nb_segs' at any time.
> It could do that when preparing the Tx descriptor.
>
> This is very interesting, because it means that fast-free does not
> prohibit segmented packets!
> (But the driver must have sufficient Tx descriptors for all segments in
> the mbuf.)
>
>
> Now, lets dig into rte_pktmbuf_prefree_seg()'s block handling non-direct
> mbufs, i.e. cloned mbufs and mbufs with external buffer:
>
> if (unlikely(!RTE_MBUF_DIRECT(m))) {
> rte_pktmbuf_detach(m);
> if (RTE_MBUF_HAS_EXTBUF(m) &&
> RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
> __rte_pktmbuf_pinned_extbuf_decref(m))
> return NULL;
> }
>
> Starting with rte_pktmbuf_detach():
>
> static inline void rte_pktmbuf_detach(struct rte_mbuf *m)
> {
> struct rte_mempool *mp = m->pool;
> uint32_t mbuf_size, buf_len;
> uint16_t priv_size;
>
> if (RTE_MBUF_HAS_EXTBUF(m)) {
> /*
> * The mbuf has the external attached buffer,
> * we should check the type of the memory pool where
> * the mbuf was allocated from to detect the pinned
> * external buffer.
> */
> uint32_t flags = rte_pktmbuf_priv_flags(mp);
>
> if (flags & RTE_PKTMBUF_POOL_F_PINNED_EXT_BUF) {
> /*
> * The pinned external buffer should not be
> * detached from its backing mbuf, just exit.
> */
> return;
> }
> __rte_pktmbuf_free_extbuf(m);
> } else {
> __rte_pktmbuf_free_direct(m);
> }
> priv_size = rte_pktmbuf_priv_size(mp);
> mbuf_size = (uint32_t)(sizeof(struct rte_mbuf) + priv_size);
> buf_len = rte_pktmbuf_data_room_size(mp);
>
> m->priv_size = priv_size;
> m->buf_addr = (char *)m + mbuf_size;
> rte_mbuf_iova_set(m, rte_mempool_virt2iova(m) + mbuf_size);
> m->buf_len = (uint16_t)buf_len;
> rte_pktmbuf_reset_headroom(m);
> m->data_len = 0;
> m->ol_flags = 0;
> }
>
> The only quick and simple code path through this function is when the mbuf
> uses a pinned external buffer:
> if (RTE_MBUF_HAS_EXTBUF(m)) {
> uint32_t flags = rte_pktmbuf_priv_flags(mp);
> if (flags & RTE_PKTMBUF_POOL_F_PINNED_EXT_BUF)
> return;
>
> REQUIREMENT 3: The mbuf must not be cloned or use a non-pinned external
> buffer.
>
>
> Continuing with the next part of rte_pktmbuf_prefree_seg()'s block:
> if (RTE_MBUF_HAS_EXTBUF(m) &&
> RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
> __rte_pktmbuf_pinned_extbuf_decref(m))
> return NULL;
>
> Continuing with the next part of the block in rte_pktmbuf_prefree_seg():
>
> /**
> * @internal Handle the packet mbufs with attached pinned external buffer
> * on the mbuf freeing:
> *
> * - return zero if reference counter in shinfo is one. It means there is
> * no more reference to this pinned buffer and mbuf can be returned to
> * the pool
> *
> * - otherwise (if reference counter is not one), decrement reference
> * counter and return non-zero value to prevent freeing the backing mbuf.
> *
> * Returns non zero if mbuf should not be freed.
> */
> static inline int __rte_pktmbuf_pinned_extbuf_decref(struct rte_mbuf *m)
> {
> struct rte_mbuf_ext_shared_info *shinfo;
>
> /* Clear flags, mbuf is being freed. */
> m->ol_flags = RTE_MBUF_F_EXTERNAL;
> shinfo = m->shinfo;
>
> /* Optimize for performance - do not dec/reinit */
> if (likely(rte_mbuf_ext_refcnt_read(shinfo) == 1))
> return 0;
>
> /*
> * Direct usage of add primitive to avoid
> * duplication of comparing with one.
> */
> if (likely(rte_atomic_fetch_add_explicit(&shinfo->refcnt, -1,
> rte_memory_order_acq_rel) - 1))
> return 1;
>
> /* Reinitialize counter before mbuf freeing. */
> rte_mbuf_ext_refcnt_set(shinfo, 1);
> return 0;
> }
>
> Essentially, if the mbuf does use a pinned external buffer,
> rte_pktmbuf_prefree_seg() only succeeds if that pinned external buffer is
> only referred to by the mbuf.
>
> REQUIREMENT 4: If the mbuf uses a pinned external buffer, the mbuf must
> hold the only reference to that pinned external buffer, i.e. in that case,
> 'm->shinfo->refcnt' must be 1.
>
>
> Please review.
>
> If I'm not mistaken, the mbuf library is not a barrier for fast-freeing
> segmented packet mbufs, and thus fast-free of jumbo frames is possible.
>
> We need a driver developer to confirm that my suggested approach -
> resetting the mbuf fields, incl. 'm->nb_segs' and 'm->next', when
> preparing the Tx descriptor - is viable.
Great analysis, makes a lot of sense to me.
Shall we add then a special API to make PMD maintainers life a bit easier:
Something like rte_mbuf_fast_free_prep(mp, mb), that will optionally check
that requirements outlined above are satisfied for given mbuf and
also reset mbuf fields to expected values?
Konstantin
^ permalink raw reply [flat|nested] 4+ messages in thread
* RE: mbuf fast-free requirements analysis
2025-12-15 14:41 ` Konstantin Ananyev
@ 2025-12-15 16:14 ` Morten Brørup
0 siblings, 0 replies; 4+ messages in thread
From: Morten Brørup @ 2025-12-15 16:14 UTC (permalink / raw)
To: Konstantin Ananyev, dev, techboard, bruce.richardson
> From: Konstantin Ananyev [mailto:konstantin.ananyev@huawei.com]
> Sent: Monday, 15 December 2025 15.41
>
> >
> > Executive Summary:
> >
> > My analysis shows that the mbuf library is not a barrier for fast-
> freeing
> > segmented packet mbufs, and thus fast-free of jumbo frames is
> possible.
> >
> >
> > Detailed Analysis:
> >
> > The purpose of the mbuf fast-free Tx optimization is to reduce
> > rte_pktmbuf_free_seg() to something much simpler in the ethdev
> drivers, by
> > eliminating the code path related to indirect mbufs.
> > Optimally, we want to simplify the ethdev driver's function that
> frees the
> > transmitted mbufs, so it can free them directly to their mempool
> without
> > accessing the mbufs themselves.
> >
> > If the driver cannot access the mbuf itself, it cannot determine
> which
> > mempool it belongs to.
> > We don't want the driver to access every mbuf being freed; but if all
> > mbufs of a Tx queue belong to the same mempool, the driver can
> determine
> > which mempool by looking into just one of the mbufs.
> >
> > REQUIREMENT 1: The mbufs of a Tx queue must come from the same
> mempool.
> >
> >
> > When an mbuf is freed to its mempool, some of the fields in the mbuf
> must
> > be initialized.
> > So, for fast-free, this must be done by the driver's function that
> > prepares the Tx descriptor.
> > This is a requirement to the driver, not a requirement to the
> application.
> >
> > Now, let's dig into the code for freeing an mbuf.
> > Note: For readability purposes, I'll cut out some code and comments
> > unrelated to this topic.
> >
> > static __rte_always_inline void
> > rte_pktmbuf_free_seg(struct rte_mbuf *m)
> > {
> > m = rte_pktmbuf_prefree_seg(m);
> > if (likely(m != NULL))
> > rte_mbuf_raw_free(m);
> > }
> >
> >
> > rte_mbuf_raw_free(m) is simple, so nothing to gain there:
> >
> > /**
> > * Put mbuf back into its original mempool.
> > *
> > * The caller must ensure that the mbuf is direct and properly
> > * reinitialized (refcnt=1, next=NULL, nb_segs=1), as done by
> > * rte_pktmbuf_prefree_seg().
> > */
> > static __rte_always_inline void
> > rte_mbuf_raw_free(struct rte_mbuf *m)
> > {
> > rte_mbuf_history_mark(m, RTE_MBUF_HISTORY_OP_LIB_FREE);
> > rte_mempool_put(m->pool, m);
> > }
> >
> > Note that the description says that the mbuf must be direct.
> > This is not entirely accurate; the mbuf is allowed to use a pinned
> > external buffer, if the mbuf holds the only reference to it.
> > (Most of the mbuf library functions have this documentation
> inaccuracy,
> > which should be fixed some day.)
> >
> > So, the fast-free optimization really comes down to
> > rte_pktmbuf_prefree_seg(m), which must not return NULL.
> >
> > Let's dig into that.
> >
> > /**
> > * Decrease reference counter and unlink a mbuf segment
> > *
> > * This function does the same than a free, except that it does not
> > * return the segment to its pool.
> > * It decreases the reference counter, and if it reaches 0, it is
> > * detached from its parent for an indirect mbuf.
> > *
> > * @return
> > * - (m) if it is the last reference. It can be recycled or freed.
> > * - (NULL) if the mbuf still has remaining references on it.
> > */
> > static __rte_always_inline struct rte_mbuf *
> > rte_pktmbuf_prefree_seg(struct rte_mbuf *m)
> > {
> > bool refcnt_not_one;
> >
> > refcnt_not_one = unlikely(rte_mbuf_refcnt_read(m) != 1);
> > if (refcnt_not_one && __rte_mbuf_refcnt_update(m, -1) != 0)
> > return NULL;
> >
> > if (unlikely(!RTE_MBUF_DIRECT(m))) {
> > rte_pktmbuf_detach(m);
> > if (RTE_MBUF_HAS_EXTBUF(m) &&
> > RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
> > __rte_pktmbuf_pinned_extbuf_decref(m))
> > return NULL;
> > }
> >
> > if (refcnt_not_one)
> > rte_mbuf_refcnt_set(m, 1);
> > if (m->nb_segs != 1)
> > m->nb_segs = 1;
> > if (m->next != NULL)
> > m->next = NULL;
> >
> > return m;
> > }
> >
> > This function can only succeed (i.e. return non-NULL) when 'refcnt'
> is 1
> > (or reaches 0).
> >
> > REQUIREMENT 2: The driver must hold the only reference to the mbuf,
> > i.e. 'm->refcnt' must be 1.
> >
> >
> > When the function succeeds, it initializes the mbuf fields as
> required by
> > rte_mbuf_raw_free() before returning.
> >
> > Now, since the driver has exclusive access to the mbuf, it is free to
> > initialize the 'm->next' and 'm->nb_segs' at any time.
> > It could do that when preparing the Tx descriptor.
> >
> > This is very interesting, because it means that fast-free does not
> > prohibit segmented packets!
> > (But the driver must have sufficient Tx descriptors for all segments
> in
> > the mbuf.)
> >
> >
> > Now, lets dig into rte_pktmbuf_prefree_seg()'s block handling non-
> direct
> > mbufs, i.e. cloned mbufs and mbufs with external buffer:
> >
> > if (unlikely(!RTE_MBUF_DIRECT(m))) {
> > rte_pktmbuf_detach(m);
> > if (RTE_MBUF_HAS_EXTBUF(m) &&
> > RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
> > __rte_pktmbuf_pinned_extbuf_decref(m))
> > return NULL;
> > }
> >
> > Starting with rte_pktmbuf_detach():
> >
> > static inline void rte_pktmbuf_detach(struct rte_mbuf *m)
> > {
> > struct rte_mempool *mp = m->pool;
> > uint32_t mbuf_size, buf_len;
> > uint16_t priv_size;
> >
> > if (RTE_MBUF_HAS_EXTBUF(m)) {
> > /*
> > * The mbuf has the external attached buffer,
> > * we should check the type of the memory pool where
> > * the mbuf was allocated from to detect the pinned
> > * external buffer.
> > */
> > uint32_t flags = rte_pktmbuf_priv_flags(mp);
> >
> > if (flags & RTE_PKTMBUF_POOL_F_PINNED_EXT_BUF) {
> > /*
> > * The pinned external buffer should not be
> > * detached from its backing mbuf, just exit.
> > */
> > return;
> > }
> > __rte_pktmbuf_free_extbuf(m);
> > } else {
> > __rte_pktmbuf_free_direct(m);
> > }
> > priv_size = rte_pktmbuf_priv_size(mp);
> > mbuf_size = (uint32_t)(sizeof(struct rte_mbuf) + priv_size);
> > buf_len = rte_pktmbuf_data_room_size(mp);
> >
> > m->priv_size = priv_size;
> > m->buf_addr = (char *)m + mbuf_size;
> > rte_mbuf_iova_set(m, rte_mempool_virt2iova(m) + mbuf_size);
> > m->buf_len = (uint16_t)buf_len;
> > rte_pktmbuf_reset_headroom(m);
> > m->data_len = 0;
> > m->ol_flags = 0;
> > }
> >
> > The only quick and simple code path through this function is when the
> mbuf
> > uses a pinned external buffer:
> > if (RTE_MBUF_HAS_EXTBUF(m)) {
> > uint32_t flags = rte_pktmbuf_priv_flags(mp);
> > if (flags & RTE_PKTMBUF_POOL_F_PINNED_EXT_BUF)
> > return;
> >
> > REQUIREMENT 3: The mbuf must not be cloned or use a non-pinned
> external
> > buffer.
> >
> >
> > Continuing with the next part of rte_pktmbuf_prefree_seg()'s block:
> > if (RTE_MBUF_HAS_EXTBUF(m) &&
> > RTE_MBUF_HAS_PINNED_EXTBUF(m) &&
> > __rte_pktmbuf_pinned_extbuf_decref(m))
> > return NULL;
> >
> > Continuing with the next part of the block in
> rte_pktmbuf_prefree_seg():
> >
> > /**
> > * @internal Handle the packet mbufs with attached pinned external
> buffer
> > * on the mbuf freeing:
> > *
> > * - return zero if reference counter in shinfo is one. It means
> there is
> > * no more reference to this pinned buffer and mbuf can be returned
> to
> > * the pool
> > *
> > * - otherwise (if reference counter is not one), decrement
> reference
> > * counter and return non-zero value to prevent freeing the backing
> mbuf.
> > *
> > * Returns non zero if mbuf should not be freed.
> > */
> > static inline int __rte_pktmbuf_pinned_extbuf_decref(struct rte_mbuf
> *m)
> > {
> > struct rte_mbuf_ext_shared_info *shinfo;
> >
> > /* Clear flags, mbuf is being freed. */
> > m->ol_flags = RTE_MBUF_F_EXTERNAL;
> > shinfo = m->shinfo;
> >
> > /* Optimize for performance - do not dec/reinit */
> > if (likely(rte_mbuf_ext_refcnt_read(shinfo) == 1))
> > return 0;
> >
> > /*
> > * Direct usage of add primitive to avoid
> > * duplication of comparing with one.
> > */
> > if (likely(rte_atomic_fetch_add_explicit(&shinfo->refcnt, -1,
> > rte_memory_order_acq_rel) - 1))
> > return 1;
> >
> > /* Reinitialize counter before mbuf freeing. */
> > rte_mbuf_ext_refcnt_set(shinfo, 1);
> > return 0;
> > }
> >
> > Essentially, if the mbuf does use a pinned external buffer,
> > rte_pktmbuf_prefree_seg() only succeeds if that pinned external
> buffer is
> > only referred to by the mbuf.
> >
> > REQUIREMENT 4: If the mbuf uses a pinned external buffer, the mbuf
> must
> > hold the only reference to that pinned external buffer, i.e. in that
> case,
> > 'm->shinfo->refcnt' must be 1.
> >
> >
> > Please review.
> >
> > If I'm not mistaken, the mbuf library is not a barrier for fast-
> freeing
> > segmented packet mbufs, and thus fast-free of jumbo frames is
> possible.
> >
> > We need a driver developer to confirm that my suggested approach -
> > resetting the mbuf fields, incl. 'm->nb_segs' and 'm->next', when
> > preparing the Tx descriptor - is viable.
>
> Great analysis, makes a lot of sense to me.
> Shall we add then a special API to make PMD maintainers life a bit
> easier:
> Something like rte_mbuf_fast_free_prep(mp, mb), that will optionally
> check
> that requirements outlined above are satisfied for given mbuf and
> also reset mbuf fields to expected values?
Good idea, Konstantin.
Detailed suggestion below.
Note that __rte_mbuf_raw_sanity_check_mp() is used to checks the requirements after 'nb_segs' and 'next' have been initialized.
/**
* Reinitialize an mbuf for freeing back into the mempool.
*
* The caller must ensure that the mbuf comes from the specified mempool,
* is direct and only referred to by the caller (refcnt=1).
*
* This function is used by drivers in their transmit function for mbuf fast release
* when the transmit descriptor is initialized,
* so the driver can call rte_mbuf_raw_free()
* when the packet segment has been transmitted.
*
* @see RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE
*
* @param mp
* The mempool to which the mbuf belong.
* @param m
* The mbuf being reinitialized.
*/
static __rte_always_inline void
rte_mbuf_raw_prefree_seg(const struct rte_mempool *mp, struct rte_mbuf *m)
{
if (m->nb_segs != 1)
m->nb_segs = 1;
if (m->next != NULL)
m->next = NULL;
__rte_mbuf_raw_sanity_check_mp(m, mp);
rte_mbuf_history_mark(mbuf, RTE_MBUF_HISTORY_OP_LIB_PREFREE_RAW);
}
/**
* Reinitialize a bulk of mbufs for freeing back into the mempool.
*
* The caller must ensure that the mbufs come from the specified mempool,
* are direct and only referred to by the caller (refcnt=1).
*
* This function is used by drivers in their transmit function for mbuf fast release
* when the transmit descriptors are initialized,
* so the driver can call rte_mbuf_raw_free_bulk()
* when the packet segments have been transmitted.
*
* @see RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE
*
* @param mp
* The mempool to which the mbufs belong.
* @param mbufs
* Array of pointers to mbufs being reinitialized.
* The array must not contain NULL pointers.
* @param count
* Array size.
*/
static __rte_always_inline void
rte_mbuf_raw_prefree_seg_bulk(const struct rte_mempool *mp, struct rte_mbuf **mbufs, unsigned int count)
{
for (unsigned int idx = 0; idx < count; idx++) {
struct rte_mbuf *m = mbufs[idx];
if (m->nb_segs != 1)
m->nb_segs = 1;
if (m->next != NULL)
m->next = NULL;
__rte_mbuf_raw_sanity_check_mp(m, mp);
}
rte_mbuf_history_mark_bulk(mbufs, count, RTE_MBUF_HISTORY_OP_LIB_PREFREE_RAW);
}
> Konstantin
>
>
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2025-12-15 16:14 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-12-15 11:06 mbuf fast-free requirements analysis Morten Brørup
2025-12-15 11:46 ` Bruce Richardson
2025-12-15 14:41 ` Konstantin Ananyev
2025-12-15 16:14 ` Morten Brørup
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).