Re: Fixing MBUF_FAST_FREE TX offload requirements?

DPDK patches and discussions
 help / color / mirror / Atom feed

From: Bruce Richardson <bruce.richardson@intel.com>
To: Thomas Monjalon <thomas@monjalon.net>
Cc: "Morten Brørup" <mb@smartsharesystems.com>,
	"Andrew Rybchenko" <andrew.rybchenko@oktetlabs.ru>,
	"Konstantin Ananyev" <konstantin.ananyev@huawei.com>,
	"Ajit Khaparde" <ajit.khaparde@broadcom.com>,
	"Somnath Kotur" <somnath.kotur@broadcom.com>,
	"Nithin Dabilpuram" <ndabilpuram@marvell.com>,
	"Kiran Kumar K" <kirankumark@marvell.com>,
	"Sunil Kumar Kori" <skori@marvell.com>,
	"Satha Rao" <skoteshwar@marvell.com>,
	"Harman Kalra" <hkalra@marvell.com>,
	"Hemant Agrawal" <hemant.agrawal@nxp.com>,
	"Sachin Saxena" <sachin.saxena@oss.nxp.com>,
	"Shai Brandes" <shaibran@amazon.com>,
	"Evgeny Schemeilin" <evgenys@amazon.com>,
	"Ron Beider" <rbeider@amazon.com>,
	"Amit Bernstein" <amitbern@amazon.com>,
	"Wajeeh Atrash" <atrwajee@amazon.com>,
	"Gaetan Rivet" <grive@u256.net>,
	yangxingui <yangxingui@h-partners.com>,
	Fengchengwen <fengchengwen@huawei.com>,
	"Praveen Shetty" <praveen.shetty@intel.com>,
	"Vladimir Medvedkin" <vladimir.medvedkin@intel.com>,
	"Anatoly Burakov" <anatoly.burakov@intel.com>,
	"Jingjing Wu" <jingjing.wu@intel.com>,
	"Rosen Xu" <rosen.xu@altera.com>,
	"Andrew Boyer" <andrew.boyer@amd.com>,
	"Dariusz Sosnowski" <dsosnowski@nvidia.com>,
	"Viacheslav Ovsiienko" <viacheslavo@nvidia.com>,
	"Bing Zhao" <bingz@nvidia.com>, "Ori Kam" <orika@nvidia.com>,
	"Suanming Mou" <suanmingm@nvidia.com>,
	"Matan Azrad" <matan@nvidia.com>,
	"Wenbo Cao" <caowenbo@mucse.com>,
	"Jerin Jacob" <jerinj@marvell.com>,
	"Maciej Czekaj" <mczekaj@marvell.com>,
	dev@dpdk.org, techboard@dpdk.org,
	"Ivan Malov" <ivan.malov@arknetworks.am>
Subject: Re: Fixing MBUF_FAST_FREE TX offload requirements?
Date: Wed, 29 Oct 2025 15:45:43 +0000	[thread overview]
Message-ID: <aQI2p-ODpm8oxTMy@bricha3-mobl1.ger.corp.intel.com> (raw)
In-Reply-To: <5090512.QZNE9M9tJY@thomas>

On Wed, Oct 29, 2025 at 03:57:40PM +0100, Thomas Monjalon wrote:
> 29/10/2025 13:23, Morten Brørup:
> > > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > > On Wed, Oct 29, 2025 at 12:16:37PM +0300, Andrew Rybchenko wrote:
> > > > On 9/18/25 5:12 PM, Konstantin Ananyev wrote:
> > > > > > > From: Bruce Richardson [mailto:bruce.richardson@intel.com]
> > > > > > > On Thu, Sep 18, 2025 at 10:50:11AM +0200, Morten Brørup wrote:
> > > > > > > > Dear NIC driver maintainers (CC: DPDK Tech Board),
> > > > > > > >
> > > > > > > > The DPDK Tech Board has discussed that patch [1] (included in
> > > DPDK
> > > > > > > 25.07) extended the documented requirements to the
> > > > > > > RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE offload.
> > > > > > > > These changes put additional limitations on applications' use
> > > of the
> > > > > > > MBUF_FAST_FREE TX offload, and made MBUF_FAST_FREE mutually
> > > exclusive
> > > > > > > with MULTI_SEGS (which is typically used for jumbo frame
> > > support).
> > > > > > > > The Tech Board discussed that these changes do not reflect
> > > the
> > > > > > > intention of the MBUF_FAST_FREE TX offload, and wants to fix
> > > it.
> > > > > > > > Mainly, MBUF_FAST_FREE and MULTI_SEGS should not be mutually
> > > > > > > exclusive.
> > > > > > > >
> > > > > > > > The original RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE requirements
> > > were:
> > > > > > > > When set, application must guarantee that
> > > > > > > > 1) per-queue all mbufs come from the same mempool, and
> > > > > > > > 2) mbufs have refcnt = 1.
> > > > > > > >
> > > > > > > > The patch added the following requirements to the
> > > MBUF_FAST_FREE
> > > > > > > offload, reflecting rte_pktmbuf_prefree_seg() postconditions:
> > > > > > > > 3) mbufs are direct,
> > > > > > > > 4) mbufs have next = NULL and nb_segs = 1.
> > > > > > > >
> > > > > > > > Now, the key question is:
> > > > > > > > Can we roll back to the original two requirements?
> > > > > > > > Or do the drivers also depend on the third and/or fourth
> > > > > > > requirements?
> > > > > > > >
> > > > > > > > <advertisement>
> > > > > > > > Drivers freeing mbufs directly to a mempool should use the
> > > new
> > > > > > > rte_mbuf_raw_free_bulk() instead of rte_mempool_put_bulk(), so
> > > the
> > > > > > > preconditions for freeing mbufs directly into a mempool are
> > > validated
> > > > > > > in mbuf debug mode (with RTE_LIBRTE_MBUF_DEBUG enabled).
> > > > > > > > Similarly, rte_mbuf_raw_alloc_bulk() should be used instead
> > > of
> > > > > > > rte_mempool_get_bulk().
> > > > > > > > </advertisement>
> > > > > > > >
> > > > > > > > PS: The feature documentation [2] still reflects the original
> > > > > > > requirements.
> > > > > > > >
> > > > > > > > [1]:
> > > > > > >
> > > > > >
> > > https://github.com/DPDK/dpdk/commit/55624173bacb2becaa67793b7139188487
> > > > > > 6
> > > > > > > 673c1
> > > > > > > > [2]:
> > > > > > >
> > > https://elixir.bootlin.com/dpdk/v25.07/source/doc/guides/nics/features.
> > > > > > > rst#L125
> > > > > > > >
> > > > > > > >
> > > > > > > > Venlig hilsen / Kind regards,
> > > > > > > > -Morten Brørup
> > > > > > > >
> > > > > > > I'm a little torn on this question, because I can see benefits
> > > for both
> > > > > > > approaches. Firstly, it would be nice if we made FAST_FREE as
> > > > > > > accessible
> > > > > > > for driver use as it was originally, with minimal requirements.
> > > > > > > However, on
> > > > > > > looking at the code, I believe that many drivers actually took
> > > it to
> > > > > > > mean
> > > > > > > that scattered packets couldn't occur in that case either, so
> > > the use
> > > > > > > was
> > > > > > > incorrect.
> > > > > >
> > > > > > I primarily look at Intel drivers, and that's how I read the
> > > driver code too.
> > > > > >
> > > > > > > Similarly, and secondly, if we do have the extra
> > > > > > > requirements
> > > > > > > for FAST_FREE, it does mean that any use of it can be very,
> > > very
> > > > > > > minimal
> > > > > > > and efficient, since we don't need to check anything before
> > > freeing the
> > > > > > > buffers.
> > > > > > >
> > > > > > > Given where we are now, I think keeping the more restrictive
> > > definition
> > > > > > > of
> > > > > > > FAST_FREE is the way to go - keeping it exclusive with
> > > MULTI_SEGS -
> > > > > > > because
> > > > > > > it means that we are less likely to have bugs. If we look to
> > > change it
> > > > > > > back, I think we'd have to check all drivers to ensure they are
> > > using
> > > > > > > the
> > > > > > > flag safely.
> > > > > >
> > > > > > However, those driver bugs are not new.
> > > > > > If we haven't received bug reports from users affected by them,
> > > maybe we can
> > > > > > disregard them (in this discussion about pros and cons).
> > > > > > I prefer we register them as driver bugs, instead of changing the
> > > API to
> > > > > > accommodate bugs in the drivers.
> > > > > >
> > > > > >  From an application perspective, here's an idea for
> > > consideration:
> > > > > > Assuming that indirect mbufs are uncommon, we keep requirement
> > > #3.
> > > > > > To allow MULTI_SEGS (jumbo frames) with FAST_FREE, we get rid of
> > > requirement
> > > > > > #4.
> > > > >
> > > > > Do we really need to enable FAST_FREE for jumbo-frames?
> > > > > Jumbo-frames usually means much smaller PPS number and actual RX/TX
> > > overhead
> > > > > becomes really tiny.
> > > >
> > > > +1
> > > > > > Since the driver knows that refcnt == 1, the driver can set next
> > > = NULL and
> > > > > > nb_segs = 1 at any time, either when writing the TX descriptor
> > > (when it reads the
> > > > > > mbuf anyway), or when freeing the mbuf.
> > > > > > Regarding performance, this means that the driver's TX code path
> > > has to write to
> > > > > > the mbufs (i.e. adding the performance cost of memory store
> > > operations) when
> > > > > > segmented - but that is a universal requirement when freeing
> > > segmented mbufs
> > > > > > to the mempool.
> > > > >
> > > > > It might work, but I think it will become way too complicated.
> > > > > Again I don't know who is going to inspect/fix all the drivers.
> > > > > Just not allowing FAST_FREE for mulsti-seg seems like a much more
> > > simpler and safer approach.
> > > > > > For even more optimized driver performance, as Bruce mentions...
> > > > > > If a port is configured for FAST_FREE and not MULTI_SEGS, the
> > > driver can use a
> > > > > > super lean transmit function.
> > > > > > Since the driver's transmit function pointer is per port (not per
> > > queue), this would
> > > > > > require the driver to provide the MULTI_SEGS capability only per
> > > port, and not
> > > > > > per queue. (Or we would have to add a NOT_MULTI_SEGS offload
> > > flag, to ensure
> > > > > > that no queue is configured for MULTI_SEGS.)
> > > >
> > > >
> > > > FAST_FREE is not a real Tx offload, since there is no promise from
> > > > driver to do something (like other Tx offloads, e.g. checksumming or
> > > > TSO). Is it a promise to ignore refcount or take a look at memory
> > > pool
> > > > of some packets only? I guess no. If so, basically any driver may
> > > > advertise it and simply ignore if the offload is requested, but
> > > > driver can do nothing with these limitations on input data.
> > > >
> > > > It is a performance hint in fact and promise from application to
> > > > follow specified limitations on Tx mbufs.
> > > >
> > > > So, if application specifies both FAST_FREE and MULTI_SEG, but driver
> > > > code can't FAST_FREE with MULTI_SEG, it should just ignore FAST_FREE.
> > > > That's it. The performance hint is simply useless in this case.
> > > > There is no point to make FAST_FREE and MULTI_SEG mutual exclusive.
> > > > If some drivers can really support both - great. If no, just ignore
> > > > FAST_FREE and support MULTI_SEG.
> > > >
> > > > "mbufs are direct" must be FAST_FREE requirement. Since otherwise
> > > > freeing is not simple. I guess is was simply lost in the original
> > > > definition of FAST_FREE.
> > 
> > Agree about the "mbufs are direct" statement being lost in the original definition.
> > It can be extended to include mbufs using "pinned external buffer with refcnt==1", because freeing those is just as simple as freeing "direct" mbufs.
> > 
> > > >
> > > That's a good point and expanation of things. Perhaps we are better to
> > > deprecate FAST_FREE and replace it with a couple of explicit hints that
> > > better explain what they are?
> > > 
> > > - RTE_ETH_TX_HINT_DIRECT_MBUFS
> > 
> > In the FAST_FREE case, this hint would be TX_HINT_MBUF_DIRECT_OR_SINGLE_OWNER_PINNED_EXTBUF.
> > 
> > > - RTE_ETH_TX_HINT_SINGLE_MEMPOOL
> > 
> > Prefer TX_HINT_SINGLE_MEMPOOL -> TX_HINT_SAME_MEMPOOL, so we can add a globally scoped TX_HINT_SINGLE_MEMPOOL later.
> > 
> > Also, RTE_ETH_TX_HINT_NON_SEGMENTED can be added later.
> > 
> > I strongly agree with the finer granularity for the hints; the optimization of freeing to one mempool instead of a variety of mempools is orthogonal to the optimization of not having to consider indirect mbufs.
> > And the drivers are free to only optimize if multiple hints are present; so there is no downside to using a finer granularity for hints.
> 
> Yes we can have finer granularity.
> 
> 
> > Although we are reusing "offload" fields for hints, there's no need for drivers to announce capability for such hints, including FAST_FREE; since the drivers can freely ignore any hints, hint capability doesn't contain any information about the driver's ability to do anything useful with the hints.
> 
> Capability does not need to be announced,
> but it would be useful to have debug logs when an optimization is enabled.
> I'm not sure how we can enforce such logs in drivers.
> 
> 
> > Regarding naming, we should use "promise" instead of "hint",
> > to emphasize that the "hint" is not allowed to be violated.
> 
> I'm not sure why but I'm not confortable with the word "promise".
> To me, a "hint" is already something strong.
> 

Agree. Also, promise is too long a word. Hint is shorter.

next prev parent reply	other threads:[~2025-10-29 15:46 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-18  8:50 Morten Brørup
2025-09-18  9:09 ` Bruce Richardson
2025-09-18 10:00   ` Morten Brørup
2025-09-18 14:12     ` Konstantin Ananyev
2025-10-29  9:16       ` Andrew Rybchenko
2025-10-29  9:23         ` Bruce Richardson
2025-10-29 12:23           ` Morten Brørup
2025-10-29 14:57             ` Thomas Monjalon
2025-10-29 15:45               ` Bruce Richardson [this message]
2025-09-18 15:13 ` Stephen Hemminger
2025-10-28 17:44   ` Nithin Dabilpuram

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aQI2p-ODpm8oxTMy@bricha3-mobl1.ger.corp.intel.com \
    --to=bruce.richardson@intel.com \
    --cc=ajit.khaparde@broadcom.com \
    --cc=amitbern@amazon.com \
    --cc=anatoly.burakov@intel.com \
    --cc=andrew.boyer@amd.com \
    --cc=andrew.rybchenko@oktetlabs.ru \
    --cc=atrwajee@amazon.com \
    --cc=bingz@nvidia.com \
    --cc=caowenbo@mucse.com \
    --cc=dev@dpdk.org \
    --cc=dsosnowski@nvidia.com \
    --cc=evgenys@amazon.com \
    --cc=fengchengwen@huawei.com \
    --cc=grive@u256.net \
    --cc=hemant.agrawal@nxp.com \
    --cc=hkalra@marvell.com \
    --cc=ivan.malov@arknetworks.am \
    --cc=jerinj@marvell.com \
    --cc=jingjing.wu@intel.com \
    --cc=kirankumark@marvell.com \
    --cc=konstantin.ananyev@huawei.com \
    --cc=matan@nvidia.com \
    --cc=mb@smartsharesystems.com \
    --cc=mczekaj@marvell.com \
    --cc=ndabilpuram@marvell.com \
    --cc=orika@nvidia.com \
    --cc=praveen.shetty@intel.com \
    --cc=rbeider@amazon.com \
    --cc=rosen.xu@altera.com \
    --cc=sachin.saxena@oss.nxp.com \
    --cc=shaibran@amazon.com \
    --cc=skori@marvell.com \
    --cc=skoteshwar@marvell.com \
    --cc=somnath.kotur@broadcom.com \
    --cc=suanmingm@nvidia.com \
    --cc=techboard@dpdk.org \
    --cc=thomas@monjalon.net \
    --cc=viacheslavo@nvidia.com \
    --cc=vladimir.medvedkin@intel.com \
    --cc=yangxingui@h-partners.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).