From: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
To: "Morten Brørup" <mb@smartsharesystems.com>,
"Stephen Hemminger" <stephen@networkplumber.org>,
dev@dpdk.org,
"Konstantin Ananyev" <konstantin.ananyev@huawei.com>
Cc: bruce.richardson@intel.com, Thomas Monjalon <thomas@monjalon.net>,
John McNamara <john.mcnamara@intel.com>
Subject: Re: [RFC] ethdev: fix UAF in rte_eth_tx_burst with filtering
Date: Mon, 10 Nov 2025 12:02:15 +0300 [thread overview]
Message-ID: <1fd7889d-9187-4bf6-b7c0-863783bfc7ed@oktetlabs.ru> (raw)
In-Reply-To: <98CBD80474FA8B44BF855DF32C47DC35F65544@smartserver.smartshare.dk>
On 11/10/25 11:34, Morten Brørup wrote:
> +TO: Konstantin, regarding Stage Ordered Ring.
>
>> From: Andrew Rybchenko [mailto:andrew.rybchenko@oktetlabs.ru]
>> Sent: Sunday, 9 November 2025 16.23
>>
>> If I remember correctly we should avoid function names in summary.
>>
>> (I'm really surprised that callbacks and rte_eth_tx_burst() are
>> supposed to reorder mbufs in tx_pkts array, but it looks like it
>> the only sensible option.)
>
> Reordering the packets is incompatible with the Stage Ordered Ring.
> When the transmit function returns e.g. 2 non-consumed packets, the metadata will be associated with the 2 last packets in the original burst passed to the transmit function, not the 2 packets moved to the last position in the array.
Good point.
>
>>
>> On 10/30/25 02:54, Stephen Hemminger wrote:
>>> When packets are filtered by Tx callback, they return value
>>> needs to include those packets; otherwise the application will
>>> do a duplicate free of those mbufs.
>>>
>>> The example would be application that sent a burst of 8 packets,
>>> and 2 were filtered but the PMD was busy and could not send any.
>>> The result should be a return value of 2 and mbufs 0..1 should
>>> be freed and 2..7 should be untouched and up to application to
>>> resend or drop.
>>>
>>> Original buggy code would return 0 and 0..1 would be freed
>>> 2..7 would be unused. Application would then try and drop
>>> or resend an already freed mbuf.
>>>
>>> Bugzilla ID: 1816
>>> Fixes: 4dc294158cac ("ethdev: support optional Rx and Tx callbacks")
>>> Cc: bruce.richardson@intel.com
>>> Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
>>> ---
>>> lib/ethdev/rte_ethdev.h | 47 ++++++++++++++++++++++++++++++--------
>> ---
>>> 1 file changed, 35 insertions(+), 12 deletions(-)
>>>
>>> diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
>>> index a66c2abbdb..d2cfe390f9 100644
>>> --- a/lib/ethdev/rte_ethdev.h
>>> +++ b/lib/ethdev/rte_ethdev.h
>>> @@ -6714,6 +6714,10 @@ rte_eth_tx_burst(uint16_t port_id, uint16_t
>> queue_id,
>>> }
>>> #endif
>>>
>>> + uint16_t requested_pkts = nb_pkts;
>>> + uint16_t filtered_pkts = 0;
>>> + rte_mbuf_history_mark_bulk(tx_pkts, nb_pkts,
>> RTE_MBUF_HISTORY_OP_TX);
>>> +
>>> #ifdef RTE_ETHDEV_RXTX_CALLBACKS
>>> {
>>> void *cb;
>>> @@ -6727,22 +6731,41 @@ rte_eth_tx_burst(uint16_t port_id, uint16_t
>> queue_id,
>>> cb = rte_atomic_load_explicit(&p->txq.clbk[queue_id],
>>> rte_memory_order_relaxed);
>>> if (unlikely(cb != NULL))
>>> - nb_pkts = rte_eth_call_tx_callbacks(port_id,
>> queue_id,
>>> - tx_pkts, nb_pkts, cb);
>>> + filtered_pkts = nb_pkts -
>>> + rte_eth_call_tx_callbacks(port_id, queue_id,
>> tx_pkts, nb_pkts, cb);
>>> }
>>> #endif
>>>
>>> - uint16_t requested_pkts = nb_pkts;
>>> - rte_mbuf_history_mark_bulk(tx_pkts, nb_pkts,
>> RTE_MBUF_HISTORY_OP_TX);
>>> -
>>> - nb_pkts = p->tx_pkt_burst(qd, tx_pkts, nb_pkts);
>>> -
>>> - if (requested_pkts > nb_pkts)
>>> - rte_mbuf_history_mark_bulk(tx_pkts + nb_pkts,
>>> - requested_pkts - nb_pkts,
>> RTE_MBUF_HISTORY_OP_TX_BUSY);
>>> + uint16_t sent_pkts = p->tx_pkt_burst(qd, tx_pkts, nb_pkts -
>> filtered_pkts);
>>> + uint16_t result_pkts = sent_pkts + filtered_pkts;
>>> +
>>> + if (unlikely(result_pkts < requested_pkts)) {
>>
>> First of all unlikely() is wrong here. It is not an error path. It is
>> OK
>> when not everything is sent and it should not cost too much as unlikely
>> does.
>> Second, I'd remove the condition completely. It is not a big deal to
>> calculate unsent_pkts and call trace (if enabled) with zero
>> unsent_pkts.
>>
>>> + uint16_t unsent_pkts = requested_pkts - result_pkts;
>>> +
>>> + if (unlikely(filtered_pkts > 0)) {
>>
>> Same here for unlikely().
>> Also since filtered_pkts appear in RTE_ETHDEV_RXTX_CALLBACKS case,
>> I'd put it under corresponding conditional compilation to be
>> consistent.
>> Yes, code with conditional compilation is hard to read, but consistency
>> is more important here.
>>
>>> + /* Need to reshuffle packet list so that unsent
>> packets are
>>> + * after the return value
>>> + *
>>> + * Original: MMMMMM
>>> + * Filter: MMMMMF filtered = 1
>>> + * Tx_burst: SSSUUF sent = 3 unsent = 2
>>> + *
>>> + * Return: SSSFUU
>>> + * return from tx_burst is 4
>>> + */
>>> + memmove(tx_pkts + result_pkts, tx_pkts + sent_pkts,
>>> + unsent_pkts * sizeof(struct rte_mbuf *));
>>> +
>>> + /* Tx filter process does not drop the packets until
>> here */
>>> + rte_pktmbuf_free_bulk(tx_pkts + sent_pkts,
>> filtered_pkts);
>
> NAK.
> If the callback consumes the packet, the callback must free it too. The callback could consume the packet for other purposes, e.g. redirection, so freeing it here would introduce use-after-free bugs.
Good point as well. However, it still requires reordering which has
problem mentioned above.
>
>>> + }
>>> +
>>> + rte_mbuf_history_mark_bulk(tx_pkts + result_pkts,
>> unsent_pkts,
>>> + RTE_MBUF_HISTORY_OP_TX_BUSY);
>>> + }
>>>
>>> - rte_ethdev_trace_tx_burst(port_id, queue_id, (void **)tx_pkts,
>> nb_pkts);
>>> - return nb_pkts;
>>> + rte_ethdev_trace_tx_burst(port_id, queue_id, (void **)tx_pkts,
>> result_pkts);
>>> + return result_pkts;
>>> }
>>>
>>> /**
>
> IMO, letting a Tx callback change the number of packets to actually transmit has too many side effects.
> E.g. port packet counters are not updated accordingly if a callback consumes a packet and silently drops it.
>
> The Ethdev Tx API is not designed for the ethdev layer to handle any of retransmit-or-drop (when insufficient descriptors), redirect, mirror, drop (redirect/mirror/drop of all packets or selectively based on e.g. VLAN or BPF filter).
>
> IMO, we should update the callback API to reflect this, by changing the callback return type to void.
>
prev parent reply other threads:[~2025-11-10 9:02 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-29 23:54 Stephen Hemminger
2025-11-09 15:22 ` Andrew Rybchenko
2025-11-10 8:34 ` Morten Brørup
2025-11-10 9:02 ` Andrew Rybchenko [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1fd7889d-9187-4bf6-b7c0-863783bfc7ed@oktetlabs.ru \
--to=andrew.rybchenko@oktetlabs.ru \
--cc=bruce.richardson@intel.com \
--cc=dev@dpdk.org \
--cc=john.mcnamara@intel.com \
--cc=konstantin.ananyev@huawei.com \
--cc=mb@smartsharesystems.com \
--cc=stephen@networkplumber.org \
--cc=thomas@monjalon.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).