* mbuf refcnt issue
@ 2025-04-04 22:00 Lombardo, Ed
2025-04-04 22:29 ` Dmitry Kozlyuk
0 siblings, 1 reply; 3+ messages in thread
From: Lombardo, Ed @ 2025-04-04 22:00 UTC (permalink / raw)
To: users
[-- Attachment #1: Type: text/plain, Size: 1914 bytes --]
Hi,
I have an application where we receive packets and transmit them. The packet data is inspected and later mbuf is freed to mempool.
The pipeline is such that the Rx packet mbuf is saved to rx worker ring, then the application threads process the packets and decides if to transmit the packet and if true then increments the mbuf to a value of 2. The batch of mbufs to transmit are put in a Tx ring queue for the Tx thread to pull from and call the DPDK rte_eth_tx_burst() with the batch of mbufs (limited to 400 mbufs). In theory the transmit operation will decrement the mbuf refcnt. In our application we could see the tx of the mbuf followed by another application thread that calls to free the mbufs, or vice versa. We have no way to synchronize these threads.
Is the mbuf refcnt updates thread safe to allow un-deterministic handling of the mbufs among multiple threads? The decision to transmit the mbuf and increment the mbuf refcnt and load in the tx ring is completed before the application says it is finished and frees the mbufs.
I am seeing in my error checking code the mbuf refcnt contains large values like 65520, 65529, 65530, 65534, 65535 in the early pipeline stage refcnt checks.
I read online and in the DPDK code that the mbuf refcnt update is atomic, and is thread safe; so, this is good.
Now this part is unclear to me and that is when the rte_eth_tx_burst() is called and returns the number of packets transmitted , does this mean that transmit of the packets are completed and mbuf refcnt is decremented by 1 on return, or maybe the Tx engine queue is populated and mbuf refcnt is not decremented until it is actually transmitted, or much worse later in time.
Is the DPDK Tx operation intended to be the last stage of any pipeline that will free the mbuf if successfully transmitted?
Any help is appreciated to help me resolve my issue.
Thanks,
Ed
[-- Attachment #2: Type: text/html, Size: 4149 bytes --]
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: mbuf refcnt issue
2025-04-04 22:00 mbuf refcnt issue Lombardo, Ed
@ 2025-04-04 22:29 ` Dmitry Kozlyuk
2025-04-07 19:53 ` Lombardo, Ed
0 siblings, 1 reply; 3+ messages in thread
From: Dmitry Kozlyuk @ 2025-04-04 22:29 UTC (permalink / raw)
To: Lombardo, Ed, users
Hi Ed,
On 05.04.2025 01:00, Lombardo, Ed wrote:
>
> Hi,
>
> I have an application where we receive packets and transmit them. The
> packet data is inspected and later mbuf is freed to mempool.
>
> The pipeline is such that the Rx packet mbuf is saved to rx worker
> ring, then the application threads process the packets and decides if
> to transmit the packet and if true then increments the mbuf to a value
> of 2.
>
Do I understand the pipeline correctly?
Rx thread:
receive mbuf
put mbuf into the ring
inspect mbuf
free mbuf
Worker thread:
take mbuf from the ring
if decided to transmit it,
increment refcnt
transmit mbuf
If so, there's a problem that after Rx thread puts mbuf into the ring,
mbuf is owned by Rx thread and the ring, so its refcnt must be 2 when it
enters the ring:
Rx thread:
receive mbuf
increment refcnt
put mbuf into the ring
inspect mbuf
free mbuf (just decrements refcnt if > 1)
Worker thread:
take mbuf from the ring
if decided to transmit it,
transmit (or put into the bulk transmitted later)
else
free mbuf (just decrements refcnt if > 1)
> The batch of mbufs to transmit are put in a Tx ring queue for the Tx
> thread to pull from and call the DPDK rte_eth_tx_burst() with the
> batch of mbufs (limited to 400 mbufs). In theory the transmit
> operation will decrement the mbuf refcnt. In our application we could
> see the tx of the mbuf followed by another application thread that
> calls to free the mbufs, or vice versa. We have no way to synchronize
> these threads.
>
> Is the mbuf refcnt updates thread safe to allow un-deterministic
> handling of the mbufs among multiple threads? The decision to
> transmit the mbuf and increment the mbuf refcnt and load in the tx
> ring is completed before the application says it is finished and frees
> the mbufs.
>
Have you validated this assumption?
If my understanding above is correct, there's no synchronization and
thus no guarantees.
>
> I am seeing in my error checking code the mbuf refcnt contains large
> values like 65520, 65529, 65530, 65534, 65535 in the early pipeline
> stage refcnt checks.
>
> I read online and in the DPDK code that the mbuf refcnt update is
> atomic, and is thread safe; so, this is good.
>
> Now this part is unclear to me and that is when the rte_eth_tx_burst()
> is called and returns the number of packets transmitted , does this
> mean that transmit of the packets are completed and mbuf refcnt is
> decremented by 1 on return, or maybe the Tx engine queue is populated
> and mbuf refcnt is not decremented until it is actually transmitted,
> or much worse later in time.
>
> Is the DPDK Tx operation intended to be the last stage of any pipeline
> that will free the mbuf if successfully transmitted?
>
Return from rte_eth_tx_burst() means that mbufs are queued for transmission.
Hardware completes transmission asynchronously.
The next call to rte_eth_tx_burst() will poll HW,
learn status of mbufs *previously* queued,
and calls rte_pktmbuf_free() for those that are transmitted.
The latter will free mbufs to mempool if and only if refcnt == 1.
^ permalink raw reply [flat|nested] 3+ messages in thread
* RE: mbuf refcnt issue
2025-04-04 22:29 ` Dmitry Kozlyuk
@ 2025-04-07 19:53 ` Lombardo, Ed
0 siblings, 0 replies; 3+ messages in thread
From: Lombardo, Ed @ 2025-04-07 19:53 UTC (permalink / raw)
To: Dmitry Kozlyuk, users
Hi Dmitry,
I see issues where spotty mbufs show incorrect refcnt, i.e. should be 1 but I see 2 or 3, or should be 2 but I see 3. Like the mbuf is returned to mempool and reused by DPDK receive (and appears that mbuf refcnt is not initialized to 1, as I would expect). I tried to pay attention in the software to eliminate any asynchronous events by multiple threads that perform either rte_mbuf_refcnt_update() and rte_pktmbuf_free() that could cause a weird behavior. Meaning, on receive of packets the application first calls to request packets is where I increment the mbuf refcnt, then I return the list of packets to the application which will then process the packets and determine which ones to drop (should be very few drops). In my testing, I am transmitting all the packets received so is much simplified. From this point on there are multiple threads touching the mbuf refcnt (either DPDK rte_eth_tx_burst()or rte_pktmbuf_free()).
I am testing against DPDK 24.11.1 as generic platform build.
The Pipeline with many threads are explained somewhat below.
1. Rx thread: (two threads 1,2)
Free mbufs processed by application in Ack ring.
Receive mbufs from NIC via rte_ring_dequeue_burst()
Put mbufs into Worker ring.
2. App request for packets: (two threads, 3, 4)
Process packets from worker ring, and mark the packets to tx (via internal array of pkt info)
Increment mbuf refcnt by 1 (assume will transmit all packets)
Add mbufs to Ack ring for freeing later.
At this point, no other thread should touch these mbufs.
3. App process packets and update internal array of pkt info. (two or more threads, x)
(here I simulate that the App is processing packets, I mark them all for tx.)
4. App calls to request to send burst of packets: (2 threads, x+2)
Process packets marked for transmit and put in rte_mbuf pointer arrays.
Those packets to drop call rte_pktmbuf_free() (this step is not executed because all packets are marked for transmit)
When done then add mbufs to tx rings. Two Tx threads will process packets from their tx ring, one per NIC port.
5. App calls an API to free mbufs in the Ack ring.
* Tx thread (two threads)
Monitor tx rings and transmit with rte_eth_tx_burst()
If DPDK fails to send any mbufs, perform rte_pktmbuf_free() on those mbufs that could not be transmitted. This event is prevented because Tx rate is low. Wanted to rule this out.
The order of pipeline #4 and #5 cannot be controlled, but counting on the mbuf refcnt to hold correct values so at the end the mbuf is freed to the mempool for DPDK reuse on receive.
I added debug and I see mbuf refcnts are not correct. I don't believe the mempool is a first in first out so mbuf addresses are showing in random, maybe cached.
Also worth mentioning if the DPDK Tx burst fails to transmit all the supplied mbufs I am planning to issue the rte_pktmbuf_free() for those not transmitted. Another thread touching mbufs.
Is this multi-thread approach a problem in DPDK?
Thanks,
Ed
-----Original Message-----
From: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
Sent: Friday, April 4, 2025 6:29 PM
To: Lombardo, Ed <Ed.Lombardo@netscout.com>; users@dpdk.org
Subject: Re: mbuf refcnt issue
External Email: This message originated outside of NETSCOUT. Do not click links or open attachments unless you recognize the sender and know the content is safe.
Hi Ed,
On 05.04.2025 01:00, Lombardo, Ed wrote:
>
> Hi,
>
> I have an application where we receive packets and transmit them. The
> packet data is inspected and later mbuf is freed to mempool.
>
> The pipeline is such that the Rx packet mbuf is saved to rx worker
> ring, then the application threads process the packets and decides if
> to transmit the packet and if true then increments the mbuf to a value
> of 2.
>
Do I understand the pipeline correctly?
Rx thread:
receive mbuf
put mbuf into the ring
inspect mbuf
free mbuf
Worker thread:
take mbuf from the ring
if decided to transmit it,
increment refcnt
transmit mbuf
If so, there's a problem that after Rx thread puts mbuf into the ring, mbuf is owned by Rx thread and the ring, so its refcnt must be 2 when it enters the ring:
Rx thread:
receive mbuf
increment refcnt
put mbuf into the ring
inspect mbuf
free mbuf (just decrements refcnt if > 1)
Worker thread:
take mbuf from the ring
if decided to transmit it,
transmit (or put into the bulk transmitted later)
else
free mbuf (just decrements refcnt if > 1)
> The batch of mbufs to transmit are put in a Tx ring queue for the Tx
> thread to pull from and call the DPDK rte_eth_tx_burst() with the
> batch of mbufs (limited to 400 mbufs). In theory the transmit
> operation will decrement the mbuf refcnt. In our application we could
> see the tx of the mbuf followed by another application thread that
> calls to free the mbufs, or vice versa. We have no way to synchronize
> these threads.
>
> Is the mbuf refcnt updates thread safe to allow un-deterministic
> handling of the mbufs among multiple threads? The decision to
> transmit the mbuf and increment the mbuf refcnt and load in the tx
> ring is completed before the application says it is finished and frees
> the mbufs.
>
Have you validated this assumption?
If my understanding above is correct, there's no synchronization and thus no guarantees.
>
> I am seeing in my error checking code the mbuf refcnt contains large
> values like 65520, 65529, 65530, 65534, 65535 in the early pipeline
> stage refcnt checks.
>
> I read online and in the DPDK code that the mbuf refcnt update is
> atomic, and is thread safe; so, this is good.
>
> Now this part is unclear to me and that is when the rte_eth_tx_burst()
> is called and returns the number of packets transmitted , does this
> mean that transmit of the packets are completed and mbuf refcnt is
> decremented by 1 on return, or maybe the Tx engine queue is populated
> and mbuf refcnt is not decremented until it is actually transmitted,
> or much worse later in time.
>
> Is the DPDK Tx operation intended to be the last stage of any pipeline
> that will free the mbuf if successfully transmitted?
>
Return from rte_eth_tx_burst() means that mbufs are queued for transmission.
Hardware completes transmission asynchronously.
The next call to rte_eth_tx_burst() will poll HW, learn status of mbufs *previously* queued, and calls rte_pktmbuf_free() for those that are transmitted.
The latter will free mbufs to mempool if and only if refcnt == 1.
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2025-04-07 19:54 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-04-04 22:00 mbuf refcnt issue Lombardo, Ed
2025-04-04 22:29 ` Dmitry Kozlyuk
2025-04-07 19:53 ` Lombardo, Ed
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).