* mbuf refcnt issue @ 2025-04-04 22:00 Lombardo, Ed 2025-04-04 22:29 ` Dmitry Kozlyuk 0 siblings, 1 reply; 12+ messages in thread From: Lombardo, Ed @ 2025-04-04 22:00 UTC (permalink / raw) To: users [-- Attachment #1: Type: text/plain, Size: 1914 bytes --] Hi, I have an application where we receive packets and transmit them. The packet data is inspected and later mbuf is freed to mempool. The pipeline is such that the Rx packet mbuf is saved to rx worker ring, then the application threads process the packets and decides if to transmit the packet and if true then increments the mbuf to a value of 2. The batch of mbufs to transmit are put in a Tx ring queue for the Tx thread to pull from and call the DPDK rte_eth_tx_burst() with the batch of mbufs (limited to 400 mbufs). In theory the transmit operation will decrement the mbuf refcnt. In our application we could see the tx of the mbuf followed by another application thread that calls to free the mbufs, or vice versa. We have no way to synchronize these threads. Is the mbuf refcnt updates thread safe to allow un-deterministic handling of the mbufs among multiple threads? The decision to transmit the mbuf and increment the mbuf refcnt and load in the tx ring is completed before the application says it is finished and frees the mbufs. I am seeing in my error checking code the mbuf refcnt contains large values like 65520, 65529, 65530, 65534, 65535 in the early pipeline stage refcnt checks. I read online and in the DPDK code that the mbuf refcnt update is atomic, and is thread safe; so, this is good. Now this part is unclear to me and that is when the rte_eth_tx_burst() is called and returns the number of packets transmitted , does this mean that transmit of the packets are completed and mbuf refcnt is decremented by 1 on return, or maybe the Tx engine queue is populated and mbuf refcnt is not decremented until it is actually transmitted, or much worse later in time. Is the DPDK Tx operation intended to be the last stage of any pipeline that will free the mbuf if successfully transmitted? Any help is appreciated to help me resolve my issue. Thanks, Ed [-- Attachment #2: Type: text/html, Size: 4149 bytes --] ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: mbuf refcnt issue 2025-04-04 22:00 mbuf refcnt issue Lombardo, Ed @ 2025-04-04 22:29 ` Dmitry Kozlyuk 2025-04-07 19:53 ` Lombardo, Ed 0 siblings, 1 reply; 12+ messages in thread From: Dmitry Kozlyuk @ 2025-04-04 22:29 UTC (permalink / raw) To: Lombardo, Ed, users Hi Ed, On 05.04.2025 01:00, Lombardo, Ed wrote: > > Hi, > > I have an application where we receive packets and transmit them. The > packet data is inspected and later mbuf is freed to mempool. > > The pipeline is such that the Rx packet mbuf is saved to rx worker > ring, then the application threads process the packets and decides if > to transmit the packet and if true then increments the mbuf to a value > of 2. > Do I understand the pipeline correctly? Rx thread: receive mbuf put mbuf into the ring inspect mbuf free mbuf Worker thread: take mbuf from the ring if decided to transmit it, increment refcnt transmit mbuf If so, there's a problem that after Rx thread puts mbuf into the ring, mbuf is owned by Rx thread and the ring, so its refcnt must be 2 when it enters the ring: Rx thread: receive mbuf increment refcnt put mbuf into the ring inspect mbuf free mbuf (just decrements refcnt if > 1) Worker thread: take mbuf from the ring if decided to transmit it, transmit (or put into the bulk transmitted later) else free mbuf (just decrements refcnt if > 1) > The batch of mbufs to transmit are put in a Tx ring queue for the Tx > thread to pull from and call the DPDK rte_eth_tx_burst() with the > batch of mbufs (limited to 400 mbufs). In theory the transmit > operation will decrement the mbuf refcnt. In our application we could > see the tx of the mbuf followed by another application thread that > calls to free the mbufs, or vice versa. We have no way to synchronize > these threads. > > Is the mbuf refcnt updates thread safe to allow un-deterministic > handling of the mbufs among multiple threads? The decision to > transmit the mbuf and increment the mbuf refcnt and load in the tx > ring is completed before the application says it is finished and frees > the mbufs. > Have you validated this assumption? If my understanding above is correct, there's no synchronization and thus no guarantees. > > I am seeing in my error checking code the mbuf refcnt contains large > values like 65520, 65529, 65530, 65534, 65535 in the early pipeline > stage refcnt checks. > > I read online and in the DPDK code that the mbuf refcnt update is > atomic, and is thread safe; so, this is good. > > Now this part is unclear to me and that is when the rte_eth_tx_burst() > is called and returns the number of packets transmitted , does this > mean that transmit of the packets are completed and mbuf refcnt is > decremented by 1 on return, or maybe the Tx engine queue is populated > and mbuf refcnt is not decremented until it is actually transmitted, > or much worse later in time. > > Is the DPDK Tx operation intended to be the last stage of any pipeline > that will free the mbuf if successfully transmitted? > Return from rte_eth_tx_burst() means that mbufs are queued for transmission. Hardware completes transmission asynchronously. The next call to rte_eth_tx_burst() will poll HW, learn status of mbufs *previously* queued, and calls rte_pktmbuf_free() for those that are transmitted. The latter will free mbufs to mempool if and only if refcnt == 1. ^ permalink raw reply [flat|nested] 12+ messages in thread
* RE: mbuf refcnt issue 2025-04-04 22:29 ` Dmitry Kozlyuk @ 2025-04-07 19:53 ` Lombardo, Ed 2025-04-08 22:33 ` Lombardo, Ed 0 siblings, 1 reply; 12+ messages in thread From: Lombardo, Ed @ 2025-04-07 19:53 UTC (permalink / raw) To: Dmitry Kozlyuk, users Hi Dmitry, I see issues where spotty mbufs show incorrect refcnt, i.e. should be 1 but I see 2 or 3, or should be 2 but I see 3. Like the mbuf is returned to mempool and reused by DPDK receive (and appears that mbuf refcnt is not initialized to 1, as I would expect). I tried to pay attention in the software to eliminate any asynchronous events by multiple threads that perform either rte_mbuf_refcnt_update() and rte_pktmbuf_free() that could cause a weird behavior. Meaning, on receive of packets the application first calls to request packets is where I increment the mbuf refcnt, then I return the list of packets to the application which will then process the packets and determine which ones to drop (should be very few drops). In my testing, I am transmitting all the packets received so is much simplified. From this point on there are multiple threads touching the mbuf refcnt (either DPDK rte_eth_tx_burst()or rte_pktmbuf_free()). I am testing against DPDK 24.11.1 as generic platform build. The Pipeline with many threads are explained somewhat below. 1. Rx thread: (two threads 1,2) Free mbufs processed by application in Ack ring. Receive mbufs from NIC via rte_ring_dequeue_burst() Put mbufs into Worker ring. 2. App request for packets: (two threads, 3, 4) Process packets from worker ring, and mark the packets to tx (via internal array of pkt info) Increment mbuf refcnt by 1 (assume will transmit all packets) Add mbufs to Ack ring for freeing later. At this point, no other thread should touch these mbufs. 3. App process packets and update internal array of pkt info. (two or more threads, x) (here I simulate that the App is processing packets, I mark them all for tx.) 4. App calls to request to send burst of packets: (2 threads, x+2) Process packets marked for transmit and put in rte_mbuf pointer arrays. Those packets to drop call rte_pktmbuf_free() (this step is not executed because all packets are marked for transmit) When done then add mbufs to tx rings. Two Tx threads will process packets from their tx ring, one per NIC port. 5. App calls an API to free mbufs in the Ack ring. * Tx thread (two threads) Monitor tx rings and transmit with rte_eth_tx_burst() If DPDK fails to send any mbufs, perform rte_pktmbuf_free() on those mbufs that could not be transmitted. This event is prevented because Tx rate is low. Wanted to rule this out. The order of pipeline #4 and #5 cannot be controlled, but counting on the mbuf refcnt to hold correct values so at the end the mbuf is freed to the mempool for DPDK reuse on receive. I added debug and I see mbuf refcnts are not correct. I don't believe the mempool is a first in first out so mbuf addresses are showing in random, maybe cached. Also worth mentioning if the DPDK Tx burst fails to transmit all the supplied mbufs I am planning to issue the rte_pktmbuf_free() for those not transmitted. Another thread touching mbufs. Is this multi-thread approach a problem in DPDK? Thanks, Ed -----Original Message----- From: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com> Sent: Friday, April 4, 2025 6:29 PM To: Lombardo, Ed <Ed.Lombardo@netscout.com>; users@dpdk.org Subject: Re: mbuf refcnt issue External Email: This message originated outside of NETSCOUT. Do not click links or open attachments unless you recognize the sender and know the content is safe. Hi Ed, On 05.04.2025 01:00, Lombardo, Ed wrote: > > Hi, > > I have an application where we receive packets and transmit them. The > packet data is inspected and later mbuf is freed to mempool. > > The pipeline is such that the Rx packet mbuf is saved to rx worker > ring, then the application threads process the packets and decides if > to transmit the packet and if true then increments the mbuf to a value > of 2. > Do I understand the pipeline correctly? Rx thread: receive mbuf put mbuf into the ring inspect mbuf free mbuf Worker thread: take mbuf from the ring if decided to transmit it, increment refcnt transmit mbuf If so, there's a problem that after Rx thread puts mbuf into the ring, mbuf is owned by Rx thread and the ring, so its refcnt must be 2 when it enters the ring: Rx thread: receive mbuf increment refcnt put mbuf into the ring inspect mbuf free mbuf (just decrements refcnt if > 1) Worker thread: take mbuf from the ring if decided to transmit it, transmit (or put into the bulk transmitted later) else free mbuf (just decrements refcnt if > 1) > The batch of mbufs to transmit are put in a Tx ring queue for the Tx > thread to pull from and call the DPDK rte_eth_tx_burst() with the > batch of mbufs (limited to 400 mbufs). In theory the transmit > operation will decrement the mbuf refcnt. In our application we could > see the tx of the mbuf followed by another application thread that > calls to free the mbufs, or vice versa. We have no way to synchronize > these threads. > > Is the mbuf refcnt updates thread safe to allow un-deterministic > handling of the mbufs among multiple threads? The decision to > transmit the mbuf and increment the mbuf refcnt and load in the tx > ring is completed before the application says it is finished and frees > the mbufs. > Have you validated this assumption? If my understanding above is correct, there's no synchronization and thus no guarantees. > > I am seeing in my error checking code the mbuf refcnt contains large > values like 65520, 65529, 65530, 65534, 65535 in the early pipeline > stage refcnt checks. > > I read online and in the DPDK code that the mbuf refcnt update is > atomic, and is thread safe; so, this is good. > > Now this part is unclear to me and that is when the rte_eth_tx_burst() > is called and returns the number of packets transmitted , does this > mean that transmit of the packets are completed and mbuf refcnt is > decremented by 1 on return, or maybe the Tx engine queue is populated > and mbuf refcnt is not decremented until it is actually transmitted, > or much worse later in time. > > Is the DPDK Tx operation intended to be the last stage of any pipeline > that will free the mbuf if successfully transmitted? > Return from rte_eth_tx_burst() means that mbufs are queued for transmission. Hardware completes transmission asynchronously. The next call to rte_eth_tx_burst() will poll HW, learn status of mbufs *previously* queued, and calls rte_pktmbuf_free() for those that are transmitted. The latter will free mbufs to mempool if and only if refcnt == 1. ^ permalink raw reply [flat|nested] 12+ messages in thread
* RE: mbuf refcnt issue 2025-04-07 19:53 ` Lombardo, Ed @ 2025-04-08 22:33 ` Lombardo, Ed 2025-04-09 3:53 ` Stephen Hemminger 2025-04-09 16:15 ` Lombardo, Ed 0 siblings, 2 replies; 12+ messages in thread From: Lombardo, Ed @ 2025-04-08 22:33 UTC (permalink / raw) To: Dmitry Kozlyuk, users Hi Dmitry, I added mbuf refcnt check at the point where we get the packets from DPDK rte_eth_rx_burst() and I am finding many mbufs refcnt set to 2 when I expect them to be 1. This is more evidence that something has gone wrong within DPDK and the mempool management, perhaps, and I don't know how my setup went into this state. I changed the traffic type from Telnet to Enterprise traffic and rte_eth_tx_burst() no longer accepts mbufs. However, the membuf pool is not depleted after I hit the point the Tx stops. *** DPDK MEMORY STATS ** Reserved mbufs : 786432 Available mbufs : 786432 Allocated mbufs : 0 Is Mempool Full : 1 Is Mempool Empty : 0 The receiving of mbufs from rte_eth_rx_burst() that have refcnt == 2 is leading me to believe that our application pipeline stage sets the refcnt to 2 but when the Tx thread is blocked, the mempool still allows the rte_eth_rx_burst() to reuse these mbufs and does not touch the mbuf refcnt? Or does any mbufs provided by rte_eth_rx_burst() guarantee the mbuf refcnt will be initialized to 1? I also tried DPDK 23.11.2 and see same issue. This morning, I found that once the Tx operation stops, I can set the breakpoint just before rte_eth_tx_burst() and change the tx_queue_id to different value then the tx of packets pickup again and then stops shortly. Once the Tx stops on a tx_queue_id then tx packets fail and never clears up again. Any thoughts to my issue? Thanks, Ed -----Original Message----- From: Lombardo, Ed Sent: Monday, April 7, 2025 3:54 PM To: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>; users@dpdk.org Subject: RE: mbuf refcnt issue Hi Dmitry, I see issues where spotty mbufs show incorrect refcnt, i.e. should be 1 but I see 2 or 3, or should be 2 but I see 3. Like the mbuf is returned to mempool and reused by DPDK receive (and appears that mbuf refcnt is not initialized to 1, as I would expect). I tried to pay attention in the software to eliminate any asynchronous events by multiple threads that perform either rte_mbuf_refcnt_update() and rte_pktmbuf_free() that could cause a weird behavior. Meaning, on receive of packets the application first calls to request packets is where I increment the mbuf refcnt, then I return the list of packets to the application which will then process the packets and determine which ones to drop (should be very few drops). In my testing, I am transmitting all the packets received so is much simplified. From this point on there are multiple threads touching the mbuf refcnt (either DPDK rte_eth_tx_burst()or rte_pktmbuf_free()). I am testing against DPDK 24.11.1 as generic platform build. The Pipeline with many threads are explained somewhat below. 1. Rx thread: (two threads 1,2) Free mbufs processed by application in Ack ring. Receive mbufs from NIC via rte_ring_dequeue_burst() Put mbufs into Worker ring. 2. App request for packets: (two threads, 3, 4) Process packets from worker ring, and mark the packets to tx (via internal array of pkt info) Increment mbuf refcnt by 1 (assume will transmit all packets) Add mbufs to Ack ring for freeing later. At this point, no other thread should touch these mbufs. 3. App process packets and update internal array of pkt info. (two or more threads, x) (here I simulate that the App is processing packets, I mark them all for tx.) 4. App calls to request to send burst of packets: (2 threads, x+2) Process packets marked for transmit and put in rte_mbuf pointer arrays. Those packets to drop call rte_pktmbuf_free() (this step is not executed because all packets are marked for transmit) When done then add mbufs to tx rings. Two Tx threads will process packets from their tx ring, one per NIC port. 5. App calls an API to free mbufs in the Ack ring. * Tx thread (two threads) Monitor tx rings and transmit with rte_eth_tx_burst() If DPDK fails to send any mbufs, perform rte_pktmbuf_free() on those mbufs that could not be transmitted. This event is prevented because Tx rate is low. Wanted to rule this out. The order of pipeline #4 and #5 cannot be controlled, but counting on the mbuf refcnt to hold correct values so at the end the mbuf is freed to the mempool for DPDK reuse on receive. I added debug and I see mbuf refcnts are not correct. I don't believe the mempool is a first in first out so mbuf addresses are showing in random, maybe cached. Also worth mentioning if the DPDK Tx burst fails to transmit all the supplied mbufs I am planning to issue the rte_pktmbuf_free() for those not transmitted. Another thread touching mbufs. Is this multi-thread approach a problem in DPDK? Thanks, Ed -----Original Message----- From: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com> Sent: Friday, April 4, 2025 6:29 PM To: Lombardo, Ed <Ed.Lombardo@netscout.com>; users@dpdk.org Subject: Re: mbuf refcnt issue External Email: This message originated outside of NETSCOUT. Do not click links or open attachments unless you recognize the sender and know the content is safe. Hi Ed, On 05.04.2025 01:00, Lombardo, Ed wrote: > > Hi, > > I have an application where we receive packets and transmit them. The > packet data is inspected and later mbuf is freed to mempool. > > The pipeline is such that the Rx packet mbuf is saved to rx worker > ring, then the application threads process the packets and decides if > to transmit the packet and if true then increments the mbuf to a value > of 2. > Do I understand the pipeline correctly? Rx thread: receive mbuf put mbuf into the ring inspect mbuf free mbuf Worker thread: take mbuf from the ring if decided to transmit it, increment refcnt transmit mbuf If so, there's a problem that after Rx thread puts mbuf into the ring, mbuf is owned by Rx thread and the ring, so its refcnt must be 2 when it enters the ring: Rx thread: receive mbuf increment refcnt put mbuf into the ring inspect mbuf free mbuf (just decrements refcnt if > 1) Worker thread: take mbuf from the ring if decided to transmit it, transmit (or put into the bulk transmitted later) else free mbuf (just decrements refcnt if > 1) > The batch of mbufs to transmit are put in a Tx ring queue for the Tx > thread to pull from and call the DPDK rte_eth_tx_burst() with the > batch of mbufs (limited to 400 mbufs). In theory the transmit > operation will decrement the mbuf refcnt. In our application we could > see the tx of the mbuf followed by another application thread that > calls to free the mbufs, or vice versa. We have no way to synchronize > these threads. > > Is the mbuf refcnt updates thread safe to allow un-deterministic > handling of the mbufs among multiple threads? The decision to > transmit the mbuf and increment the mbuf refcnt and load in the tx > ring is completed before the application says it is finished and frees > the mbufs. > Have you validated this assumption? If my understanding above is correct, there's no synchronization and thus no guarantees. > > I am seeing in my error checking code the mbuf refcnt contains large > values like 65520, 65529, 65530, 65534, 65535 in the early pipeline > stage refcnt checks. > > I read online and in the DPDK code that the mbuf refcnt update is > atomic, and is thread safe; so, this is good. > > Now this part is unclear to me and that is when the rte_eth_tx_burst() > is called and returns the number of packets transmitted , does this > mean that transmit of the packets are completed and mbuf refcnt is > decremented by 1 on return, or maybe the Tx engine queue is populated > and mbuf refcnt is not decremented until it is actually transmitted, > or much worse later in time. > > Is the DPDK Tx operation intended to be the last stage of any pipeline > that will free the mbuf if successfully transmitted? > Return from rte_eth_tx_burst() means that mbufs are queued for transmission. Hardware completes transmission asynchronously. The next call to rte_eth_tx_burst() will poll HW, learn status of mbufs *previously* queued, and calls rte_pktmbuf_free() for those that are transmitted. The latter will free mbufs to mempool if and only if refcnt == 1. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: mbuf refcnt issue 2025-04-08 22:33 ` Lombardo, Ed @ 2025-04-09 3:53 ` Stephen Hemminger 2025-04-09 4:46 ` Lombardo, Ed 2025-04-09 16:15 ` Lombardo, Ed 1 sibling, 1 reply; 12+ messages in thread From: Stephen Hemminger @ 2025-04-09 3:53 UTC (permalink / raw) To: Lombardo, Ed; +Cc: Dmitry Kozlyuk, users On Tue, 8 Apr 2025 22:33:56 +0000 "Lombardo, Ed" <Ed.Lombardo@netscout.com> wrote: > Hi Dmitry, > I added mbuf refcnt check at the point where we get the packets from DPDK rte_eth_rx_burst() and I am finding many mbufs refcnt set to 2 when I expect them to be 1. This is more evidence that something has gone wrong within DPDK and the mempool management, perhaps, and I don't know how my setup went into this state. I changed the traffic type from Telnet to Enterprise traffic and rte_eth_tx_burst() no longer accepts mbufs. Not likely a DPDK bug. More likely you are having application problems. Have you tried enabling things like RTE_LIBRTE_MBUF_DEBUG and RTE_LIBRTE_MEMPOOL_DEBUG and RTE_ENABLE_ASSERT? Also using a current version of DPDK, address sanitizer, and latest GCC or CLANG can uncover issues with use after free. ^ permalink raw reply [flat|nested] 12+ messages in thread
* RE: mbuf refcnt issue 2025-04-09 3:53 ` Stephen Hemminger @ 2025-04-09 4:46 ` Lombardo, Ed 2025-04-09 16:24 ` Stephen Hemminger 0 siblings, 1 reply; 12+ messages in thread From: Lombardo, Ed @ 2025-04-09 4:46 UTC (permalink / raw) To: Stephen Hemminger; +Cc: Dmitry Kozlyuk, users Hi Stephen, I am looking a the rte_mbuf.h file for rte_pktmbuf_free() and it is not clear to me that it checks if the mbuf refcnt is 1 before decrementing it and allowing the mbuf and segments (if any) to be returned to free pool. Could my application issue be I have tx threads that transmit packets and does rte_pktmbuf_free(), while one other thread will perform rte_pktmbuf_free() on the same mbuf? I ensured I bump the mbuf refcnt to 2 before other threads can process the same mbuf. Thanks, Ed -----Original Message----- From: Stephen Hemminger <stephen@networkplumber.org> Sent: Tuesday, April 8, 2025 11:54 PM To: Lombardo, Ed <Ed.Lombardo@netscout.com> Cc: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>; users@dpdk.org Subject: Re: mbuf refcnt issue External Email: This message originated outside of NETSCOUT. Do not click links or open attachments unless you recognize the sender and know the content is safe. On Tue, 8 Apr 2025 22:33:56 +0000 "Lombardo, Ed" <Ed.Lombardo@netscout.com> wrote: > Hi Dmitry, > I added mbuf refcnt check at the point where we get the packets from DPDK rte_eth_rx_burst() and I am finding many mbufs refcnt set to 2 when I expect them to be 1. This is more evidence that something has gone wrong within DPDK and the mempool management, perhaps, and I don't know how my setup went into this state. I changed the traffic type from Telnet to Enterprise traffic and rte_eth_tx_burst() no longer accepts mbufs. Not likely a DPDK bug. More likely you are having application problems. Have you tried enabling things like RTE_LIBRTE_MBUF_DEBUG and RTE_LIBRTE_MEMPOOL_DEBUG and RTE_ENABLE_ASSERT? Also using a current version of DPDK, address sanitizer, and latest GCC or CLANG can uncover issues with use after free. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: mbuf refcnt issue 2025-04-09 4:46 ` Lombardo, Ed @ 2025-04-09 16:24 ` Stephen Hemminger 2025-04-09 23:22 ` Lombardo, Ed 0 siblings, 1 reply; 12+ messages in thread From: Stephen Hemminger @ 2025-04-09 16:24 UTC (permalink / raw) To: Lombardo, Ed; +Cc: Dmitry Kozlyuk, users On Wed, 9 Apr 2025 04:46:09 +0000 "Lombardo, Ed" <Ed.Lombardo@netscout.com> wrote: > Hi Stephen, > I am looking a the rte_mbuf.h file for rte_pktmbuf_free() and it is not clear to me that it checks if the mbuf refcnt is 1 before decrementing it and allowing the mbuf and segments (if any) to be returned to free pool. > > Could my application issue be I have tx threads that transmit packets and does rte_pktmbuf_free(), while one other thread will perform rte_pktmbuf_free() on the same mbuf? I ensured I bump the mbuf refcnt to 2 before other threads can process the same mbuf. > > Thanks, > Ed It doesn't need to check refcnt there. The check is done later (since mbuf can be multi segment). rte_pktmbuf_free -> rte_pktmbuf_free_seg -> rte_pktmbuf_prefree_seg static __rte_always_inline struct rte_mbuf * rte_pktmbuf_prefree_seg(struct rte_mbuf *m) { __rte_mbuf_sanity_check(m, 0); if (likely(rte_mbuf_refcnt_read(m) == 1)) { normal fast path. breaks the chain. } else if (__rte_mbuf_refcnt_update(m, -1) == 0) { refcnt > 1 logic Note, the refcnt doesn't always go to zero when the mbuf is put back in the pool. The refcnt for a freed mbuf (in the pool) doesn't matter, it is free, it is dead. The refcnt is reset to 1 when mbuf is extracted from the pool. ^ permalink raw reply [flat|nested] 12+ messages in thread
* RE: mbuf refcnt issue 2025-04-09 16:24 ` Stephen Hemminger @ 2025-04-09 23:22 ` Lombardo, Ed 2025-04-10 3:25 ` Stephen Hemminger 0 siblings, 1 reply; 12+ messages in thread From: Lombardo, Ed @ 2025-04-09 23:22 UTC (permalink / raw) To: Stephen Hemminger; +Cc: Dmitry Kozlyuk, users Hi, I just finished modifying and testing our application to just do transmit of packets received on an NIC interface and let the rte_eth_tx_burst () free the mbuf and all works fine for both traffic types. This proves to me that my implementation of processing the packets and queueing them to tx ring and transmit from the tx ring is not buggy, which I had carefully verified in gdb early on. I still believe there is a problem with our application with many threads that can do rte_pktmbuf_free() on the same mbuf. I added these lines in my driver source file: #define RTE_LITRTE_MBUF_DEBUG 1 #define RTE_LIBRTE_MEMPOOL_DEBUG 1 #define RTE_ENABLE_ASSERT 1 I don't see any asserts occur during my tx packet testing. The dpdk header files show the Atomic ifdef checks rte_build_config.h:#define RTE_MBUF_REFCNT_ATOMIC rte_mbuf_core.h: * or non-atomic) is controlled by the RTE_MBUF_REFCNT_ATOMIC flag. rte_mbuf.h:#ifdef RTE_MBUF_REFCNT_ATOMIC rte_mbuf.h:#else /* ! RTE_MBUF_REFCNT_ATOMIC */ rte_mbuf.h:#endif /* RTE_MBUF_REFCNT_ATOMIC */ I verified in building our application with DPDK rte_mbuf.h header file that the atomic functions for mbuf refcnt read/writes are turned ON. I added junk characters, and the compiler spotted syntax errors. So, I am back to the question as to why I get mbuf issue with multiple threads processing the same mbuf? Any more suggestions. Thanks, Ed -----Original Message----- From: Stephen Hemminger <stephen@networkplumber.org> Sent: Wednesday, April 9, 2025 12:24 PM To: Lombardo, Ed <Ed.Lombardo@netscout.com> Cc: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>; users@dpdk.org Subject: Re: mbuf refcnt issue External Email: This message originated outside of NETSCOUT. Do not click links or open attachments unless you recognize the sender and know the content is safe. On Wed, 9 Apr 2025 04:46:09 +0000 "Lombardo, Ed" <Ed.Lombardo@netscout.com> wrote: > Hi Stephen, > I am looking a the rte_mbuf.h file for rte_pktmbuf_free() and it is not clear to me that it checks if the mbuf refcnt is 1 before decrementing it and allowing the mbuf and segments (if any) to be returned to free pool. > > Could my application issue be I have tx threads that transmit packets and does rte_pktmbuf_free(), while one other thread will perform rte_pktmbuf_free() on the same mbuf? I ensured I bump the mbuf refcnt to 2 before other threads can process the same mbuf. > > Thanks, > Ed It doesn't need to check refcnt there. The check is done later (since mbuf can be multi segment). rte_pktmbuf_free -> rte_pktmbuf_free_seg -> rte_pktmbuf_prefree_seg static __rte_always_inline struct rte_mbuf * rte_pktmbuf_prefree_seg(struct rte_mbuf *m) { __rte_mbuf_sanity_check(m, 0); if (likely(rte_mbuf_refcnt_read(m) == 1)) { normal fast path. breaks the chain. } else if (__rte_mbuf_refcnt_update(m, -1) == 0) { refcnt > 1 logic Note, the refcnt doesn't always go to zero when the mbuf is put back in the pool. The refcnt for a freed mbuf (in the pool) doesn't matter, it is free, it is dead. The refcnt is reset to 1 when mbuf is extracted from the pool. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: mbuf refcnt issue 2025-04-09 23:22 ` Lombardo, Ed @ 2025-04-10 3:25 ` Stephen Hemminger 2025-04-10 3:58 ` Lombardo, Ed 0 siblings, 1 reply; 12+ messages in thread From: Stephen Hemminger @ 2025-04-10 3:25 UTC (permalink / raw) To: Lombardo, Ed; +Cc: Dmitry Kozlyuk, users On Wed, 9 Apr 2025 23:22:50 +0000 "Lombardo, Ed" <Ed.Lombardo@netscout.com> wrote: > Hi, > I just finished modifying and testing our application to just do transmit of packets received on an NIC interface and let the rte_eth_tx_burst () free the mbuf and all works fine for both traffic types. This proves to me that my implementation of processing the packets and queueing them to tx ring and transmit from the tx ring is not buggy, which I had carefully verified in gdb early on. I still believe there is a problem with our application with many threads that can do rte_pktmbuf_free() on the same mbuf. > > I added these lines in my driver source file: > #define RTE_LITRTE_MBUF_DEBUG 1 > #define RTE_LIBRTE_MEMPOOL_DEBUG 1 > #define RTE_ENABLE_ASSERT 1 > > I don't see any asserts occur during my tx packet testing. > > The dpdk header files show the Atomic ifdef checks > rte_build_config.h:#define RTE_MBUF_REFCNT_ATOMIC > rte_mbuf_core.h: * or non-atomic) is controlled by the RTE_MBUF_REFCNT_ATOMIC flag. > rte_mbuf.h:#ifdef RTE_MBUF_REFCNT_ATOMIC > rte_mbuf.h:#else /* ! RTE_MBUF_REFCNT_ATOMIC */ > rte_mbuf.h:#endif /* RTE_MBUF_REFCNT_ATOMIC */ > > I verified in building our application with DPDK rte_mbuf.h header file that the atomic functions for mbuf refcnt read/writes are turned ON. I added junk characters, and the compiler spotted syntax errors. > > So, I am back to the question as to why I get mbuf issue with multiple threads processing the same mbuf? > > Any more suggestions. > > Thanks, > Ed > > -----Original Message----- > From: Stephen Hemminger <stephen@networkplumber.org> > Sent: Wednesday, April 9, 2025 12:24 PM > To: Lombardo, Ed <Ed.Lombardo@netscout.com> > Cc: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>; users@dpdk.org > Subject: Re: mbuf refcnt issue > > External Email: This message originated outside of NETSCOUT. Do not click links or open attachments unless you recognize the sender and know the content is safe. > > On Wed, 9 Apr 2025 04:46:09 +0000 > "Lombardo, Ed" <Ed.Lombardo@netscout.com> wrote: > > > Hi Stephen, > > I am looking a the rte_mbuf.h file for rte_pktmbuf_free() and it is not clear to me that it checks if the mbuf refcnt is 1 before decrementing it and allowing the mbuf and segments (if any) to be returned to free pool. > > > > Could my application issue be I have tx threads that transmit packets and does rte_pktmbuf_free(), while one other thread will perform rte_pktmbuf_free() on the same mbuf? I ensured I bump the mbuf refcnt to 2 before other threads can process the same mbuf. > > > > Thanks, > > Ed > > It doesn't need to check refcnt there. The check is done later (since mbuf can be multi segment). > > rte_pktmbuf_free > -> rte_pktmbuf_free_seg > -> rte_pktmbuf_prefree_seg > > static __rte_always_inline struct rte_mbuf * rte_pktmbuf_prefree_seg(struct rte_mbuf *m) { > __rte_mbuf_sanity_check(m, 0); > > if (likely(rte_mbuf_refcnt_read(m) == 1)) { > normal fast path. breaks the chain. > } else if (__rte_mbuf_refcnt_update(m, -1) == 0) { > refcnt > 1 logic > > Note, the refcnt doesn't always go to zero when the mbuf is put back in the pool. > The refcnt for a freed mbuf (in the pool) doesn't matter, it is free, it is dead. > The refcnt is reset to 1 when mbuf is extracted from the pool. > > > You might find something by poisoning the mbuf when it is freed, so that any attempt to use the data would get junk. Also, add check at start of rte_pktmbuf_free_seg() to catch dup free. Something like this, but only compile tested. It will break some of the functional tests, because they tend to make bogus dummy mbufs. diff --git a/lib/mbuf/rte_mbuf.h b/lib/mbuf/rte_mbuf.h index 06ab7502a5..6088b34506 100644 --- a/lib/mbuf/rte_mbuf.h +++ b/lib/mbuf/rte_mbuf.h @@ -1423,10 +1423,11 @@ static inline int __rte_pktmbuf_pinned_extbuf_decref(struct rte_mbuf *m) static __rte_always_inline struct rte_mbuf * rte_pktmbuf_prefree_seg(struct rte_mbuf *m) { - __rte_mbuf_sanity_check(m, 0); - if (likely(rte_mbuf_refcnt_read(m) == 1)) { + __rte_mbuf_sanity_check(m, 0); + unsigned int refcnt = rte_mbuf_refcnt_read(m); + if (likely(refcnt == 1)) { if (!RTE_MBUF_DIRECT(m)) { rte_pktmbuf_detach(m); if (RTE_MBUF_HAS_EXTBUF(m) && @@ -1435,13 +1436,15 @@ rte_pktmbuf_prefree_seg(struct rte_mbuf *m) return NULL; } + m->refcnt = 0; if (m->next != NULL) m->next = NULL; if (m->nb_segs != 1) m->nb_segs = 1; return m; - + } else if (unlikely(refcnt == 0 || refcnt >= UINT16_MAX - 1)) { + rte_panic("mbuf refcnt underflow %u\n", refcnt); } else if (__rte_mbuf_refcnt_update(m, -1) == 0) { if (!RTE_MBUF_DIRECT(m)) { @@ -1452,6 +1455,7 @@ rte_pktmbuf_prefree_seg(struct rte_mbuf *m) return NULL; } + m->refcnt = 0; if (m->next != NULL) m->next = NULL; if (m->nb_segs != 1) ^ permalink raw reply [flat|nested] 12+ messages in thread
* RE: mbuf refcnt issue 2025-04-10 3:25 ` Stephen Hemminger @ 2025-04-10 3:58 ` Lombardo, Ed 2025-04-10 21:15 ` Lombardo, Ed 0 siblings, 1 reply; 12+ messages in thread From: Lombardo, Ed @ 2025-04-10 3:58 UTC (permalink / raw) To: Stephen Hemminger; +Cc: Dmitry Kozlyuk, users Hi Stephen, I should mention that I added mbuf refcnt checks at all pipeline stages, and what caught my eye was the first stage of the pipeline the mbuf provided by the rte_eth_rx_burst() I logged mbuf refcnt set to 2 when should be 1. S,o something went bad after the event that triggered the error to cause this. The mbufs kept arriving so the mempool was making available the mbufs for receive of packets from the NIC interface. So weird. I will try the patch your provided. -----Original Message----- From: Stephen Hemminger <stephen@networkplumber.org> Sent: Wednesday, April 9, 2025 11:26 PM To: Lombardo, Ed <Ed.Lombardo@netscout.com> Cc: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>; users@dpdk.org Subject: Re: mbuf refcnt issue External Email: This message originated outside of NETSCOUT. Do not click links or open attachments unless you recognize the sender and know the content is safe. On Wed, 9 Apr 2025 23:22:50 +0000 "Lombardo, Ed" <Ed.Lombardo@netscout.com> wrote: > Hi, > I just finished modifying and testing our application to just do transmit of packets received on an NIC interface and let the rte_eth_tx_burst () free the mbuf and all works fine for both traffic types. This proves to me that my implementation of processing the packets and queueing them to tx ring and transmit from the tx ring is not buggy, which I had carefully verified in gdb early on. I still believe there is a problem with our application with many threads that can do rte_pktmbuf_free() on the same mbuf. > > I added these lines in my driver source file: > #define RTE_LITRTE_MBUF_DEBUG 1 > #define RTE_LIBRTE_MEMPOOL_DEBUG 1 > #define RTE_ENABLE_ASSERT 1 > > I don't see any asserts occur during my tx packet testing. > > The dpdk header files show the Atomic ifdef checks > rte_build_config.h:#define RTE_MBUF_REFCNT_ATOMIC > rte_mbuf_core.h: * or non-atomic) is controlled by the RTE_MBUF_REFCNT_ATOMIC flag. > rte_mbuf.h:#ifdef RTE_MBUF_REFCNT_ATOMIC rte_mbuf.h:#else /* ! > RTE_MBUF_REFCNT_ATOMIC */ rte_mbuf.h:#endif /* RTE_MBUF_REFCNT_ATOMIC > */ > > I verified in building our application with DPDK rte_mbuf.h header file that the atomic functions for mbuf refcnt read/writes are turned ON. I added junk characters, and the compiler spotted syntax errors. > > So, I am back to the question as to why I get mbuf issue with multiple threads processing the same mbuf? > > Any more suggestions. > > Thanks, > Ed > > -----Original Message----- > From: Stephen Hemminger <stephen@networkplumber.org> > Sent: Wednesday, April 9, 2025 12:24 PM > To: Lombardo, Ed <Ed.Lombardo@netscout.com> > Cc: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>; users@dpdk.org > Subject: Re: mbuf refcnt issue > > External Email: This message originated outside of NETSCOUT. Do not click links or open attachments unless you recognize the sender and know the content is safe. > > On Wed, 9 Apr 2025 04:46:09 +0000 > "Lombardo, Ed" <Ed.Lombardo@netscout.com> wrote: > > > Hi Stephen, > > I am looking a the rte_mbuf.h file for rte_pktmbuf_free() and it is not clear to me that it checks if the mbuf refcnt is 1 before decrementing it and allowing the mbuf and segments (if any) to be returned to free pool. > > > > Could my application issue be I have tx threads that transmit packets and does rte_pktmbuf_free(), while one other thread will perform rte_pktmbuf_free() on the same mbuf? I ensured I bump the mbuf refcnt to 2 before other threads can process the same mbuf. > > > > Thanks, > > Ed > > It doesn't need to check refcnt there. The check is done later (since mbuf can be multi segment). > > rte_pktmbuf_free > -> rte_pktmbuf_free_seg > -> rte_pktmbuf_prefree_seg > > static __rte_always_inline struct rte_mbuf * rte_pktmbuf_prefree_seg(struct rte_mbuf *m) { > __rte_mbuf_sanity_check(m, 0); > > if (likely(rte_mbuf_refcnt_read(m) == 1)) { > normal fast path. breaks the chain. > } else if (__rte_mbuf_refcnt_update(m, -1) == 0) { > refcnt > 1 logic > > Note, the refcnt doesn't always go to zero when the mbuf is put back in the pool. > The refcnt for a freed mbuf (in the pool) doesn't matter, it is free, it is dead. > The refcnt is reset to 1 when mbuf is extracted from the pool. > > > You might find something by poisoning the mbuf when it is freed, so that any attempt to use the data would get junk. Also, add check at start of rte_pktmbuf_free_seg() to catch dup free. Something like this, but only compile tested. It will break some of the functional tests, because they tend to make bogus dummy mbufs. diff --git a/lib/mbuf/rte_mbuf.h b/lib/mbuf/rte_mbuf.h index 06ab7502a5..6088b34506 100644 --- a/lib/mbuf/rte_mbuf.h +++ b/lib/mbuf/rte_mbuf.h @@ -1423,10 +1423,11 @@ static inline int __rte_pktmbuf_pinned_extbuf_decref(struct rte_mbuf *m) static __rte_always_inline struct rte_mbuf * rte_pktmbuf_prefree_seg(struct rte_mbuf *m) { - __rte_mbuf_sanity_check(m, 0); - if (likely(rte_mbuf_refcnt_read(m) == 1)) { + __rte_mbuf_sanity_check(m, 0); + unsigned int refcnt = rte_mbuf_refcnt_read(m); + if (likely(refcnt == 1)) { if (!RTE_MBUF_DIRECT(m)) { rte_pktmbuf_detach(m); if (RTE_MBUF_HAS_EXTBUF(m) && @@ -1435,13 +1436,15 @@ rte_pktmbuf_prefree_seg(struct rte_mbuf *m) return NULL; } + m->refcnt = 0; if (m->next != NULL) m->next = NULL; if (m->nb_segs != 1) m->nb_segs = 1; return m; - + } else if (unlikely(refcnt == 0 || refcnt >= UINT16_MAX - 1)) { + rte_panic("mbuf refcnt underflow %u\n", refcnt); } else if (__rte_mbuf_refcnt_update(m, -1) == 0) { if (!RTE_MBUF_DIRECT(m)) { @@ -1452,6 +1455,7 @@ rte_pktmbuf_prefree_seg(struct rte_mbuf *m) return NULL; } + m->refcnt = 0; if (m->next != NULL) m->next = NULL; if (m->nb_segs != 1) ^ permalink raw reply [flat|nested] 12+ messages in thread
* RE: mbuf refcnt issue 2025-04-10 3:58 ` Lombardo, Ed @ 2025-04-10 21:15 ` Lombardo, Ed 0 siblings, 0 replies; 12+ messages in thread From: Lombardo, Ed @ 2025-04-10 21:15 UTC (permalink / raw) To: Stephen Hemminger; +Cc: Dmitry Kozlyuk, users Hi Stephen, The overnight tx test passed with no problems. I modified our application DPDK driver this morning, I set the mbuf refcnt in the second stage of our pipeline, I then enqueued the mbuf pointers to the ack ring. Then I added the same mbufs to the tx ring. Then I provided Mobile traffic and the same MEMPOOL log was seen but in addition it failed to transmit. I checked the rte_mempool_create() has the flags setting to value of 0. The flags setting to 0 is not an issue for my usecase? I removed the high layer application, if it was corrupting the mbufs and this has been eliminated. I then tested the multi-thread tx with rte_eth_tx_burst() and second thread doing the rte_pktmbuf_free() and I see problems again. Pretty convincing that these two DPDK APIs can not be called by multiple threads and maybe the rte_pktmbuf_free() must be called after the mbuf is transmitted. If this is the case I am back to square one. I verified that when rte_pktmbuf_free() on a mbuf that has refcnt== 2 is decremented to refcnt==1. The rte_eth_tx_burst() should also decrement the mbuf refcnt by 1 if seen as 2. In theory I believe that the last thread that decrements the mbuf refcnt when seen as value of 1 should trigger that mbuf to be returned to the mempool. Is my understanding incorrect? Thanks, Ed -----Original Message----- From: Lombardo, Ed Sent: Wednesday, April 9, 2025 11:58 PM To: Stephen Hemminger <stephen@networkplumber.org> Cc: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>; users@dpdk.org Subject: RE: mbuf refcnt issue Hi Stephen, I should mention that I added mbuf refcnt checks at all pipeline stages, and what caught my eye was the first stage of the pipeline the mbuf provided by the rte_eth_rx_burst() I logged mbuf refcnt set to 2 when should be 1. S,o something went bad after the event that triggered the error to cause this. The mbufs kept arriving so the mempool was making available the mbufs for receive of packets from the NIC interface. So weird. I will try the patch your provided. -----Original Message----- From: Stephen Hemminger <stephen@networkplumber.org> Sent: Wednesday, April 9, 2025 11:26 PM To: Lombardo, Ed <Ed.Lombardo@netscout.com> Cc: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>; users@dpdk.org Subject: Re: mbuf refcnt issue External Email: This message originated outside of NETSCOUT. Do not click links or open attachments unless you recognize the sender and know the content is safe. On Wed, 9 Apr 2025 23:22:50 +0000 "Lombardo, Ed" <Ed.Lombardo@netscout.com> wrote: > Hi, > I just finished modifying and testing our application to just do transmit of packets received on an NIC interface and let the rte_eth_tx_burst () free the mbuf and all works fine for both traffic types. This proves to me that my implementation of processing the packets and queueing them to tx ring and transmit from the tx ring is not buggy, which I had carefully verified in gdb early on. I still believe there is a problem with our application with many threads that can do rte_pktmbuf_free() on the same mbuf. > > I added these lines in my driver source file: > #define RTE_LITRTE_MBUF_DEBUG 1 > #define RTE_LIBRTE_MEMPOOL_DEBUG 1 > #define RTE_ENABLE_ASSERT 1 > > I don't see any asserts occur during my tx packet testing. > > The dpdk header files show the Atomic ifdef checks > rte_build_config.h:#define RTE_MBUF_REFCNT_ATOMIC > rte_mbuf_core.h: * or non-atomic) is controlled by the RTE_MBUF_REFCNT_ATOMIC flag. > rte_mbuf.h:#ifdef RTE_MBUF_REFCNT_ATOMIC rte_mbuf.h:#else /* ! > RTE_MBUF_REFCNT_ATOMIC */ rte_mbuf.h:#endif /* RTE_MBUF_REFCNT_ATOMIC > */ > > I verified in building our application with DPDK rte_mbuf.h header file that the atomic functions for mbuf refcnt read/writes are turned ON. I added junk characters, and the compiler spotted syntax errors. > > So, I am back to the question as to why I get mbuf issue with multiple threads processing the same mbuf? > > Any more suggestions. > > Thanks, > Ed > > -----Original Message----- > From: Stephen Hemminger <stephen@networkplumber.org> > Sent: Wednesday, April 9, 2025 12:24 PM > To: Lombardo, Ed <Ed.Lombardo@netscout.com> > Cc: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>; users@dpdk.org > Subject: Re: mbuf refcnt issue > > External Email: This message originated outside of NETSCOUT. Do not click links or open attachments unless you recognize the sender and know the content is safe. > > On Wed, 9 Apr 2025 04:46:09 +0000 > "Lombardo, Ed" <Ed.Lombardo@netscout.com> wrote: > > > Hi Stephen, > > I am looking a the rte_mbuf.h file for rte_pktmbuf_free() and it is not clear to me that it checks if the mbuf refcnt is 1 before decrementing it and allowing the mbuf and segments (if any) to be returned to free pool. > > > > Could my application issue be I have tx threads that transmit packets and does rte_pktmbuf_free(), while one other thread will perform rte_pktmbuf_free() on the same mbuf? I ensured I bump the mbuf refcnt to 2 before other threads can process the same mbuf. > > > > Thanks, > > Ed > > It doesn't need to check refcnt there. The check is done later (since mbuf can be multi segment). > > rte_pktmbuf_free > -> rte_pktmbuf_free_seg > -> rte_pktmbuf_prefree_seg > > static __rte_always_inline struct rte_mbuf * rte_pktmbuf_prefree_seg(struct rte_mbuf *m) { > __rte_mbuf_sanity_check(m, 0); > > if (likely(rte_mbuf_refcnt_read(m) == 1)) { > normal fast path. breaks the chain. > } else if (__rte_mbuf_refcnt_update(m, -1) == 0) { > refcnt > 1 logic > > Note, the refcnt doesn't always go to zero when the mbuf is put back in the pool. > The refcnt for a freed mbuf (in the pool) doesn't matter, it is free, it is dead. > The refcnt is reset to 1 when mbuf is extracted from the pool. > > > You might find something by poisoning the mbuf when it is freed, so that any attempt to use the data would get junk. Also, add check at start of rte_pktmbuf_free_seg() to catch dup free. Something like this, but only compile tested. It will break some of the functional tests, because they tend to make bogus dummy mbufs. diff --git a/lib/mbuf/rte_mbuf.h b/lib/mbuf/rte_mbuf.h index 06ab7502a5..6088b34506 100644 --- a/lib/mbuf/rte_mbuf.h +++ b/lib/mbuf/rte_mbuf.h @@ -1423,10 +1423,11 @@ static inline int __rte_pktmbuf_pinned_extbuf_decref(struct rte_mbuf *m) static __rte_always_inline struct rte_mbuf * rte_pktmbuf_prefree_seg(struct rte_mbuf *m) { - __rte_mbuf_sanity_check(m, 0); - if (likely(rte_mbuf_refcnt_read(m) == 1)) { + __rte_mbuf_sanity_check(m, 0); + unsigned int refcnt = rte_mbuf_refcnt_read(m); + if (likely(refcnt == 1)) { if (!RTE_MBUF_DIRECT(m)) { rte_pktmbuf_detach(m); if (RTE_MBUF_HAS_EXTBUF(m) && @@ -1435,13 +1436,15 @@ rte_pktmbuf_prefree_seg(struct rte_mbuf *m) return NULL; } + m->refcnt = 0; if (m->next != NULL) m->next = NULL; if (m->nb_segs != 1) m->nb_segs = 1; return m; - + } else if (unlikely(refcnt == 0 || refcnt >= UINT16_MAX - 1)) { + rte_panic("mbuf refcnt underflow %u\n", refcnt); } else if (__rte_mbuf_refcnt_update(m, -1) == 0) { if (!RTE_MBUF_DIRECT(m)) { @@ -1452,6 +1455,7 @@ rte_pktmbuf_prefree_seg(struct rte_mbuf *m) return NULL; } + m->refcnt = 0; if (m->next != NULL) m->next = NULL; if (m->nb_segs != 1) ^ permalink raw reply [flat|nested] 12+ messages in thread
* RE: mbuf refcnt issue 2025-04-08 22:33 ` Lombardo, Ed 2025-04-09 3:53 ` Stephen Hemminger @ 2025-04-09 16:15 ` Lombardo, Ed 1 sibling, 0 replies; 12+ messages in thread From: Lombardo, Ed @ 2025-04-09 16:15 UTC (permalink / raw) To: Dmitry Kozlyuk, users Hi, I added to my source code: #define RTE_LITRTE_MBUF_DEBUG 1 #define RTE_LIBRTE_MEMPOOL_DEBUG 1 #define RTE_ENABLE_ASSERT 1 When I send traffic a core file is generated. It failed on the rte_mbuf_free() when the __rte_mbuf_raw_santy_check() is called. Below it shows RTE_ASSERT(rte_mbuf_refcnt_read(m) == 1) that fails because this mbuf has refcnt=2. rte_mbuf.h shows the RTE_ASSERT check on refcnt. 565 static __rte_always_inline void 566 __rte_mbuf_raw_sanity_check(__rte_unused const struct rte_mbuf *m) 567 { 568 RTE_ASSERT(rte_mbuf_refcnt_read(m) == 1); 569 RTE_ASSERT(m->next == NULL); 570 RTE_ASSERT(m->nb_segs == 1); 571 __rte_mbuf_sanity_check(m, 0); 572 } In gdb I inspected the mbuf and it shows refcnt=2. Which is possible in our multi-threaded design. So, my understanding was that the refcnt is used to decide when the mbuf(s) should be returned to the mempool. In our design if the refcnt is 2 then the tx thread is also targeted for this mbuf which may not have been transmitted and freed yet by the Tx thread. What if I wanted to transmit this mbuf to different tx threads, is even worse for my issue. (gdb) bt #0 0x00007fc90548baac in __pthread_kill_implementation () from /lib/libc.so.6 #1 0x00007fc90543e686 in raise () from /lib/libc.so.6 #2 0x00007fc905428833 in abort () from /lib/libc.so.6 #3 0x000000000042e2ba in __rte_panic (funcname=<optimized out>, format=0x338a848 "line %d\tassert \"%s\" failed\n%.0s") at ../lib/eal/common/eal_common_debug.c:26 #4 0x0000000001cfa713 in __rte_mbuf_raw_sanity_check (m=0x190ef93c0) at /users/lombardoe/70_ISNG_AED_Merge_new/nsagent/lib/dpdk-2411.1_gen_dbg/include/dpdk/rte_mbuf.h:568 #5 rte_mbuf_raw_free (m=0x190ef93c0) at /users/lombardoe/70_ISNG_AED_Merge_new/nsagent/lib/dpdk-2411.1_gen_dbg/include/dpdk/rte_mbuf.h:628 #6 rte_pktmbuf_free_seg (m=0x190ef93c0) at /users/lombardoe/70_ISNG_AED_Merge_new/nsagent/lib/dpdk-2411.1_gen_dbg/include/dpdk/rte_mbuf.h:1403 #7 rte_pktmbuf_free (m=0x190ef93c0) at /users/lombardoe/70_ISNG_AED_Merge_new/nsagent/lib/dpdk-2411.1_gen_dbg/include/dpdk/rte_mbuf.h:1424 I enabled in meson the Atomic guard to mbuf refcnt. mbuf_refcnt_atomic true [true, false] Atomically access the mbuf refcnt. And the rte_build_config.h file shows "#define RTE_MBUF_REFCNT_ATOMIC" This appears to me as a collision with same mbuf and refcnt in multi-threaded design. And the rte_pktmbuf_free() does not decrement refcnt if not equal to 1, it tries to free the mbuf. I don't see how the mbuf(s) can be handled by multiple threads (i.e. tx path and free those not tx). Thanks, Ed -----Original Message----- From: Lombardo, Ed Sent: Tuesday, April 8, 2025 6:34 PM To: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>; users@dpdk.org Subject: RE: mbuf refcnt issue Hi Dmitry, I added mbuf refcnt check at the point where we get the packets from DPDK rte_eth_rx_burst() and I am finding many mbufs refcnt set to 2 when I expect them to be 1. This is more evidence that something has gone wrong within DPDK and the mempool management, perhaps, and I don't know how my setup went into this state. I changed the traffic type from Telnet to Enterprise traffic and rte_eth_tx_burst() no longer accepts mbufs. However, the membuf pool is not depleted after I hit the point the Tx stops. *** DPDK MEMORY STATS ** Reserved mbufs : 786432 Available mbufs : 786432 Allocated mbufs : 0 Is Mempool Full : 1 Is Mempool Empty : 0 The receiving of mbufs from rte_eth_rx_burst() that have refcnt == 2 is leading me to believe that our application pipeline stage sets the refcnt to 2 but when the Tx thread is blocked, the mempool still allows the rte_eth_rx_burst() to reuse these mbufs and does not touch the mbuf refcnt? Or does any mbufs provided by rte_eth_rx_burst() guarantee the mbuf refcnt will be initialized to 1? I also tried DPDK 23.11.2 and see same issue. This morning, I found that once the Tx operation stops, I can set the breakpoint just before rte_eth_tx_burst() and change the tx_queue_id to different value then the tx of packets pickup again and then stops shortly. Once the Tx stops on a tx_queue_id then tx packets fail and never clears up again. Any thoughts to my issue? Thanks, Ed -----Original Message----- From: Lombardo, Ed Sent: Monday, April 7, 2025 3:54 PM To: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>; users@dpdk.org Subject: RE: mbuf refcnt issue Hi Dmitry, I see issues where spotty mbufs show incorrect refcnt, i.e. should be 1 but I see 2 or 3, or should be 2 but I see 3. Like the mbuf is returned to mempool and reused by DPDK receive (and appears that mbuf refcnt is not initialized to 1, as I would expect). I tried to pay attention in the software to eliminate any asynchronous events by multiple threads that perform either rte_mbuf_refcnt_update() and rte_pktmbuf_free() that could cause a weird behavior. Meaning, on receive of packets the application first calls to request packets is where I increment the mbuf refcnt, then I return the list of packets to the application which will then process the packets and determine which ones to drop (should be very few drops). In my testing, I am transmitting all the packets received so is much simplified. From this point on there are multiple threads touching the mbuf refcnt (either DPDK rte_eth_tx_burst()or rte_pktmbuf_free()). I am testing against DPDK 24.11.1 as generic platform build. The Pipeline with many threads are explained somewhat below. 1. Rx thread: (two threads 1,2) Free mbufs processed by application in Ack ring. Receive mbufs from NIC via rte_ring_dequeue_burst() Put mbufs into Worker ring. 2. App request for packets: (two threads, 3, 4) Process packets from worker ring, and mark the packets to tx (via internal array of pkt info) Increment mbuf refcnt by 1 (assume will transmit all packets) Add mbufs to Ack ring for freeing later. At this point, no other thread should touch these mbufs. 3. App process packets and update internal array of pkt info. (two or more threads, x) (here I simulate that the App is processing packets, I mark them all for tx.) 4. App calls to request to send burst of packets: (2 threads, x+2) Process packets marked for transmit and put in rte_mbuf pointer arrays. Those packets to drop call rte_pktmbuf_free() (this step is not executed because all packets are marked for transmit) When done then add mbufs to tx rings. Two Tx threads will process packets from their tx ring, one per NIC port. 5. App calls an API to free mbufs in the Ack ring. * Tx thread (two threads) Monitor tx rings and transmit with rte_eth_tx_burst() If DPDK fails to send any mbufs, perform rte_pktmbuf_free() on those mbufs that could not be transmitted. This event is prevented because Tx rate is low. Wanted to rule this out. The order of pipeline #4 and #5 cannot be controlled, but counting on the mbuf refcnt to hold correct values so at the end the mbuf is freed to the mempool for DPDK reuse on receive. I added debug and I see mbuf refcnts are not correct. I don't believe the mempool is a first in first out so mbuf addresses are showing in random, maybe cached. Also worth mentioning if the DPDK Tx burst fails to transmit all the supplied mbufs I am planning to issue the rte_pktmbuf_free() for those not transmitted. Another thread touching mbufs. Is this multi-thread approach a problem in DPDK? Thanks, Ed -----Original Message----- From: Dmitry Kozlyuk <dmitry.kozliuk@gmail.com> Sent: Friday, April 4, 2025 6:29 PM To: Lombardo, Ed <Ed.Lombardo@netscout.com>; users@dpdk.org Subject: Re: mbuf refcnt issue External Email: This message originated outside of NETSCOUT. Do not click links or open attachments unless you recognize the sender and know the content is safe. Hi Ed, On 05.04.2025 01:00, Lombardo, Ed wrote: > > Hi, > > I have an application where we receive packets and transmit them. The > packet data is inspected and later mbuf is freed to mempool. > > The pipeline is such that the Rx packet mbuf is saved to rx worker > ring, then the application threads process the packets and decides if > to transmit the packet and if true then increments the mbuf to a value > of 2. > Do I understand the pipeline correctly? Rx thread: receive mbuf put mbuf into the ring inspect mbuf free mbuf Worker thread: take mbuf from the ring if decided to transmit it, increment refcnt transmit mbuf If so, there's a problem that after Rx thread puts mbuf into the ring, mbuf is owned by Rx thread and the ring, so its refcnt must be 2 when it enters the ring: Rx thread: receive mbuf increment refcnt put mbuf into the ring inspect mbuf free mbuf (just decrements refcnt if > 1) Worker thread: take mbuf from the ring if decided to transmit it, transmit (or put into the bulk transmitted later) else free mbuf (just decrements refcnt if > 1) > The batch of mbufs to transmit are put in a Tx ring queue for the Tx > thread to pull from and call the DPDK rte_eth_tx_burst() with the > batch of mbufs (limited to 400 mbufs). In theory the transmit > operation will decrement the mbuf refcnt. In our application we could > see the tx of the mbuf followed by another application thread that > calls to free the mbufs, or vice versa. We have no way to synchronize > these threads. > > Is the mbuf refcnt updates thread safe to allow un-deterministic > handling of the mbufs among multiple threads? The decision to > transmit the mbuf and increment the mbuf refcnt and load in the tx > ring is completed before the application says it is finished and frees > the mbufs. > Have you validated this assumption? If my understanding above is correct, there's no synchronization and thus no guarantees. > > I am seeing in my error checking code the mbuf refcnt contains large > values like 65520, 65529, 65530, 65534, 65535 in the early pipeline > stage refcnt checks. > > I read online and in the DPDK code that the mbuf refcnt update is > atomic, and is thread safe; so, this is good. > > Now this part is unclear to me and that is when the rte_eth_tx_burst() > is called and returns the number of packets transmitted , does this > mean that transmit of the packets are completed and mbuf refcnt is > decremented by 1 on return, or maybe the Tx engine queue is populated > and mbuf refcnt is not decremented until it is actually transmitted, > or much worse later in time. > > Is the DPDK Tx operation intended to be the last stage of any pipeline > that will free the mbuf if successfully transmitted? > Return from rte_eth_tx_burst() means that mbufs are queued for transmission. Hardware completes transmission asynchronously. The next call to rte_eth_tx_burst() will poll HW, learn status of mbufs *previously* queued, and calls rte_pktmbuf_free() for those that are transmitted. The latter will free mbufs to mempool if and only if refcnt == 1. ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2025-04-10 21:15 UTC | newest] Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2025-04-04 22:00 mbuf refcnt issue Lombardo, Ed 2025-04-04 22:29 ` Dmitry Kozlyuk 2025-04-07 19:53 ` Lombardo, Ed 2025-04-08 22:33 ` Lombardo, Ed 2025-04-09 3:53 ` Stephen Hemminger 2025-04-09 4:46 ` Lombardo, Ed 2025-04-09 16:24 ` Stephen Hemminger 2025-04-09 23:22 ` Lombardo, Ed 2025-04-10 3:25 ` Stephen Hemminger 2025-04-10 3:58 ` Lombardo, Ed 2025-04-10 21:15 ` Lombardo, Ed 2025-04-09 16:15 ` Lombardo, Ed
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).