* Multiprocess App Problems with tx_burst @ 2024-12-31 17:49 Alan Beadle 2025-01-04 16:22 ` Alan Beadle 0 siblings, 1 reply; 10+ messages in thread From: Alan Beadle @ 2024-12-31 17:49 UTC (permalink / raw) To: users Hi everyone, I am working on a multi-process DPDK application. It uses one NIC, one port, and both separate processes send as well as receive, and they share memory for synchronization and IPC. I had previously made a mistake in setting up the lcores, and all of the processes were assigned to the same physical core. This seems to have concealed some DPDK thread safety issues which I am now dealing with. I understand that rte_eth_tx_burst() and rte_eth_rx_burst() are not thread safe. Previously I did not have any synchronization around these functions. Now that I am successfully using separate cores, I have added a shared spinlock around all invocations of these functions, as well as around all mbuf frees and allocations. However, when my code sends a packet, it checks the return value of rte_eth_tx_burst() to ensure that the packet was actually sent. If it fails to send, my app exits with an error. This was not previously happening, but now it happens every time I run it. I thought this was due to the lack of synchronization but it is still happening after I added the lock. Why would rte_eth_tx_burst() be failing now? Thank you, -Alan ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Multiprocess App Problems with tx_burst 2024-12-31 17:49 Multiprocess App Problems with tx_burst Alan Beadle @ 2025-01-04 16:22 ` Alan Beadle 2025-01-04 18:40 ` Dmitry Kozlyuk 0 siblings, 1 reply; 10+ messages in thread From: Alan Beadle @ 2025-01-04 16:22 UTC (permalink / raw) To: users Hi everyone, I'm still stuck on this. Most likely I am doing something wrong in the initialization phase. I am trying to follow the standard code example for symmetric multi-process, but since my code is doing very different things from this example I cannot even begin to guess where I am going wrong. I do not even know if what I am trying to do is permissible in the DPDK API. It would be very helpful if someone could provide an initialization checklist for my use case (below). As explained previously, I have several separately launched processes. These processes already share a memory region for local communication. I want all of these processes to have equal ability to read incoming packets, place pointers to the mbufs in shared memory, and wake each other up when packets destined for a particular one of these processes arrives. I have one X550-T2 NIC and I am only using one one of the physical ports. It connects to a second machine which is doing essentially the same thing, running the same DPDK code. In summary, each of my multiple processes should all be able to equally receive packets of behalf of each other, and leave pointers to rx'ed mbufs for each other in shared memory according to which process the mbuf was destined for. Outbound packets may also be shared with local peer processes for reading. In order to do this I am also bumping the mbuf refcount until the peer process has read the mbuf. I already thought I had all of this working fine, but it turns out that they were all taking turns on the same physical core, and everything breaks when they are run concurrently on separate cores. I have seen conflicting information in online threads about the thread safety of the various DPDK functions that I am using. I tried adding synchronization around DPDK allocation and tx/rx bursts to no avail. My code detects weird errors where either mbufs contain unexpected things (invalid reuse?) or tx bursts start to fail in one of the processes. Frankly I also feel very confused about how ports, queues, mempools, etc work and I suspect that a lot of what I have been reading is outdated or faulty information. Any guidance at all would be greatly appreciated! -Alan On Tue, Dec 31, 2024 at 12:49 PM Alan Beadle <ab.beadle@gmail.com> wrote: > > Hi everyone, > > I am working on a multi-process DPDK application. It uses one NIC, one > port, and both separate processes send as well as receive, and they > share memory for synchronization and IPC. > > I had previously made a mistake in setting up the lcores, and all of > the processes were assigned to the same physical core. This seems to > have concealed some DPDK thread safety issues which I am now dealing > with. > > I understand that rte_eth_tx_burst() and rte_eth_rx_burst() are not > thread safe. Previously I did not have any synchronization around > these functions. Now that I am successfully using separate cores, I > have added a shared spinlock around all invocations of these > functions, as well as around all mbuf frees and allocations. > > However, when my code sends a packet, it checks the return value of > rte_eth_tx_burst() to ensure that the packet was actually sent. If it > fails to send, my app exits with an error. This was not previously > happening, but now it happens every time I run it. I thought this was > due to the lack of synchronization but it is still happening after I > added the lock. Why would rte_eth_tx_burst() be failing now? > > Thank you, > -Alan ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Multiprocess App Problems with tx_burst 2025-01-04 16:22 ` Alan Beadle @ 2025-01-04 18:40 ` Dmitry Kozlyuk 2025-01-04 19:16 ` Alan Beadle 0 siblings, 1 reply; 10+ messages in thread From: Dmitry Kozlyuk @ 2025-01-04 18:40 UTC (permalink / raw) To: Alan Beadle; +Cc: users 2025-01-04 11:22 (UTC-0500), Alan Beadle: > Hi everyone, > > I'm still stuck on this. Most likely I am doing something wrong in the > initialization phase. I am trying to follow the standard code example > for symmetric multi-process, but since my code is doing very different > things from this example I cannot even begin to guess where I am going > wrong. I do not even know if what I am trying to do is permissible in > the DPDK API. > > It would be very helpful if someone could provide an initialization > checklist for my use case (below). > > As explained previously, I have several separately launched processes. > These processes already share a memory region for local communication. > I want all of these processes to have equal ability to read incoming > packets, place pointers to the mbufs in shared memory, and wake each > other up when packets destined for a particular one of these processes > arrives. I have one X550-T2 NIC and I am only using one one of the > physical ports. It connects to a second machine which is doing > essentially the same thing, running the same DPDK code. > > In summary, each of my multiple processes should all be able to > equally receive packets of behalf of each other, and leave pointers to > rx'ed mbufs for each other in shared memory according to which process > the mbuf was destined for. Outbound packets may also be shared with > local peer processes for reading. In order to do this I am also > bumping the mbuf refcount until the peer process has read the mbuf. > > I already thought I had all of this working fine, but it turns out > that they were all taking turns on the same physical core, and > everything breaks when they are run concurrently on separate cores. I > have seen conflicting information in online threads about the thread > safety of the various DPDK functions that I am using. I tried adding > synchronization around DPDK allocation and tx/rx bursts to no avail. > My code detects weird errors where either mbufs contain unexpected > things (invalid reuse?) or tx bursts start to fail in one of the > processes. > > Frankly I also feel very confused about how ports, queues, mempools, > etc work and I suspect that a lot of what I have been reading is > outdated or faulty information. > > Any guidance at all would be greatly appreciated! > -Alan > > On Tue, Dec 31, 2024 at 12:49 PM Alan Beadle <ab.beadle@gmail.com> wrote: > > > > Hi everyone, > > > > I am working on a multi-process DPDK application. It uses one NIC, one > > port, and both separate processes send as well as receive, and they > > share memory for synchronization and IPC. > > > > I had previously made a mistake in setting up the lcores, and all of > > the processes were assigned to the same physical core. This seems to > > have concealed some DPDK thread safety issues which I am now dealing > > with. > > > > I understand that rte_eth_tx_burst() and rte_eth_rx_burst() are not > > thread safe. Previously I did not have any synchronization around > > these functions. Now that I am successfully using separate cores, I > > have added a shared spinlock around all invocations of these > > functions, as well as around all mbuf frees and allocations. > > > > However, when my code sends a packet, it checks the return value of > > rte_eth_tx_burst() to ensure that the packet was actually sent. If it > > fails to send, my app exits with an error. This was not previously > > happening, but now it happens every time I run it. I thought this was > > due to the lack of synchronization but it is still happening after I > > added the lock. Why would rte_eth_tx_burst() be failing now? > > > > Thank you, > > -Alan Hi Alan, A lot is still unclear, let's start gradually. Thread-unsafe are queues, not calls to rte_eth_rx/tx_burst(). You can call rte_eth_rx/tx_burst() concurrently without synchronization if they operate on different queues. Typically you assign each lcore to operate on one or more queues, but no queue to be operated by multiple lcores. Otherwise you need to synchronize access, which obviously hurts scaling. Does this hold in your case? Lcore is a thread to which DPDK can dispatch work. By default it is pinned to one physical core unless --lcores is used. What is lcore-to-CPU mapping in your case? What is the design of your app regarding processes, lcores, and queues? That is: which process runs which lcores and which queues to the latter serve? P.S. Please don't top-post. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Multiprocess App Problems with tx_burst 2025-01-04 18:40 ` Dmitry Kozlyuk @ 2025-01-04 19:16 ` Alan Beadle 2025-01-04 22:01 ` Dmitry Kozlyuk 0 siblings, 1 reply; 10+ messages in thread From: Alan Beadle @ 2025-01-04 19:16 UTC (permalink / raw) To: Dmitry Kozlyuk; +Cc: users > Thread-unsafe are queues, not calls to rte_eth_rx/tx_burst(). > You can call rte_eth_rx/tx_burst() concurrently without synchronization > if they operate on different queues. > Typically you assign each lcore to operate on one or more queues, > but no queue to be operated by multiple lcores. > Otherwise you need to synchronize access, which obviously hurts scaling. > Does this hold in your case? > Lcore is a thread to which DPDK can dispatch work. > By default it is pinned to one physical core unless --lcores is used. > What is lcore-to-CPU mapping in your case? > > What is the design of your app regarding processes, lcores, and queues? > That is: which process runs which lcores and which queues to the latter serve? > > P.S. Please don't top-post. Hi Dmitry, On one machine I have two processes and on the other there are three processes. For simplicity we can focus on the former but the only difference is the number of secondary processes. It is also worth noting that I am developing a shared library which uses DPDK internally, rather than the app code directly using DPDK. Therefore, all of my DPDK command line args for rte_eal_init() are actually kept in char arrays inside of the library code. I am setting up one tx queue and one rx queue via the primary process init code. The code here resembles the "basic forwarding" sample application (in the skeleton/ subdir). Please let me know whether it would be possible for each process to use entirely separate queues and still pass mbuf pointers around. Before I worry about scaling though, I want correctness first. My application is more likely to be CPU-bound than network-bound but for other reasons (this is a research project) I must use user-mode networking, which is why I am using DPDK. I will explain the processes and lcores below. First of all though, my application uses several types of packets. There are handshake packets (acknacks and the like) and data packets. The data packets are addressed to specific subsets of my secondary processes (for now just 1 or 2 secondary processes exist per machine, but support for even more is in principle part of my design). Sometimes the data should also be read by other peer processes on the same machine (including the daemon/primary process) so I chose to make the mbufs readable instead of allocating a duplicate local buffer. It is important that mbuf pointers from one process will work in the others. Otherwise all of my data would need to be duplicated into non-dpdk shared buffers too. The first process is the "daemon". This is the primary process. It uses DPDK through my shared library (which uses DPDK internally, as explained above). The daemon just polls the NIC and periodically cleans up my non-DPDK data structures in shared memory. The intent is to rely on the daemon to watch for packets during periods of low activity and avoid unnecessary CPU usage. When a packet arrives it can wake the correct secondary process by finding a futex in shared memory for that process. On both machines the daemon is mapped to core 2 with the parameter "-l 2". The second process is the "server". It uses separate app code from the daemon but calls into the same library. Like the daemon, it receives and parses packets. The server can originate new data packets, and can also reply to inbound data packets with more data packets to be sent back to processes on the other machine. It sleeps on a shared futex during periods of inactivity. If there were additional secondary processes (as is the case on the other machine) it could wake them when packets arrive for those other processes, again using futexes in shared memory. On both machines this second process is mapped to core 4 with the parameter "-l 4". The other machine has another secondary process (a third process) which is on core 6 with "-l 6". For the purposes of this discussion, it behaves similarly to the server process above (sends, receives, and sometime sleeps). Thank you, -Alan ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Multiprocess App Problems with tx_burst 2025-01-04 19:16 ` Alan Beadle @ 2025-01-04 22:01 ` Dmitry Kozlyuk 2025-01-05 16:01 ` Alan Beadle 0 siblings, 1 reply; 10+ messages in thread From: Dmitry Kozlyuk @ 2025-01-04 22:01 UTC (permalink / raw) To: Alan Beadle; +Cc: users 2025-01-04 14:16 (UTC-0500), Alan Beadle: > I am setting up one tx queue and one rx queue via the primary process > init code. The code here resembles the "basic forwarding" sample > application (in the skeleton/ subdir). Please let me know whether it > would be possible for each process to use entirely separate queues and > still pass mbuf pointers around. It is possible. Mbufs are in memory shared by all processes. For example, the primary process can setup 2 Rx queues: to use queue 0 by itself, and to let the secondary process use queue 1. Steering incoming packets to the right queue (of set of queues) would be another question, however. > I will explain the processes and lcores below. First of all though, my > application uses several types of packets. There are handshake packets > (acknacks and the like) and data packets. The data packets are > addressed to specific subsets of my secondary processes (for now just > 1 or 2 secondary processes exist per machine, but support for even > more is in principle part of my design). Sometimes the data should > also be read by other peer processes on the same machine (including > the daemon/primary process) so I chose to make the mbufs readable > instead of allocating a duplicate local buffer. It is important that > mbuf pointers from one process will work in the others. Otherwise all > of my data would need to be duplicated into non-dpdk shared buffers > too. > > The first process is the "daemon". This is the primary process. It > uses DPDK through my shared library (which uses DPDK internally, as > explained above). The daemon just polls the NIC and periodically > cleans up my non-DPDK data structures in shared memory. The intent is > to rely on the daemon to watch for packets during periods of low > activity and avoid unnecessary CPU usage. When a packet arrives it can > wake the correct secondary process by finding a futex in shared memory > for that process. On both machines the daemon is mapped to core 2 with > the parameter "-l 2". > > The second process is the "server". It uses separate app code from the > daemon but calls into the same library. Like the daemon, it receives > and parses packets. The server can originate new data packets, and can > also reply to inbound data packets with more data packets to be sent > back to processes on the other machine. It sleeps on a shared futex > during periods of inactivity. If there were additional secondary > processes (as is the case on the other machine) it could wake them > when packets arrive for those other processes, again using futexes in > shared memory. On both machines this second process is mapped to core > 4 with the parameter "-l 4". > > The other machine has another secondary process (a third process) > which is on core 6 with "-l 6". For the purposes of this discussion, > it behaves similarly to the server process above (sends, receives, and > sometime sleeps). So, "deamon" and "server" may try using the same queue sometimes, correct? Synchronizing all access to the single queue should work in this case. BTW, rte_eth_tx_burst() returning >0 does not mean the packets have been sent. It only means they have been enqueued for sending. At some point the NIC will complete sending, only then the PMD can free the mbuf (or decrement its reference count). For most PMDs, this happens on a subsequent call to rte_eth_tx_burst(). Which PMD and HW is it? Have you tried to print as many stats as possible when rte_eth_tx_burst() can't consume all packets (rte_eth_stats_get(), rte_eth_xstats_get())? ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Multiprocess App Problems with tx_burst 2025-01-04 22:01 ` Dmitry Kozlyuk @ 2025-01-05 16:01 ` Alan Beadle 2025-01-06 16:05 ` Alan Beadle 0 siblings, 1 reply; 10+ messages in thread From: Alan Beadle @ 2025-01-05 16:01 UTC (permalink / raw) To: Dmitry Kozlyuk; +Cc: users > So, "deamon" and "server" may try using the same queue sometimes, correct? > Synchronizing all access to the single queue should work in this case. That is correct. > BTW, rte_eth_tx_burst() returning >0 does not mean the packets have been sent. > It only means they have been enqueued for sending. > At some point the NIC will complete sending, > only then the PMD can free the mbuf (or decrement its reference count). > For most PMDs, this happens on a subsequent call to rte_eth_tx_burst(). > Which PMD and HW is it? Here is the output of 'dpdk-devbind.py --status': Network devices using DPDK-compatible driver ============================================ 0000:65:00.1 'Ethernet Controller 10G X550T 1563' drv=vfio-pci unused=uio_pci_generic > Have you tried to print as many stats as possible when rte_eth_tx_burst() > can't consume all packets (rte_eth_stats_get(), rte_eth_xstats_get())? In setting this up, I discovered that this error only occurs when the primary process on the other host exits (due to an error) or is not initially running (the NIC is "down" in this case?). It happens consistently when I only launch the processes on one of the two machines. ***But*** counterintuitively, it looks like packets are successfully "sent" by the daemon until the other process begins to run. In case it is useful, I summarize the stats for this case below. Note that I am also seeing another error. Sometimes, rather than tx failing, my app detects incorrect/corrupted mbuf contents and exits immediately. It appears that mbufs are being re-allocated when they should not be. I thought I had finally solved this (see my earlier threads) but with multi-core concurrency this problem has returned. It is very possible that this error is somewhere in my own library code, as it looks like the accompanying non-DPDK structures are also being corrupted (probably first). For background, I maintain a hash table of header structs to track individual mbufs. The sequence numbers in the headers should match those contained in the mbuf's payload. This check is failing after a few hundred successful data messages have been exchanged between the hosts. The sequence number in the mbuf shows that it is in the wrong hash bucket, and the sequence number in the header is a large corrupted value which is out of range for my sequence numbers (and also not matching the bucket). Back to the issue of failed tx bursts: Here are the stats I observed after a packet failed to send from the daemon (after only launching the primary+secondary processes on one of the machines). This failure occurred after the daemon had successfully "sent" hundreds of handshake packets (to nowhere, presumably?), and the failure occurred as soon as the second process had finished initialization: ipackets:0, opackets:0, ibytes:0, obytes:0, ierrors:0, oerrors:0 Got 146 xstats Port:0, tx_q0_packets:1138 Port:0, tx_q0_bytes:125180 Port:0, mac_local_errors:2 Port:0, out_pkts_untagged:5 (All other stats had a value of 0 and are omitted). I will continue investigating the corruption bug in the (likely) case that it is in my library code. In the meantime please let me know if I am using DPDK incorrectly. Thank you again! -Alan ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Multiprocess App Problems with tx_burst 2025-01-05 16:01 ` Alan Beadle @ 2025-01-06 16:05 ` Alan Beadle 2025-01-06 20:10 ` Dmitry Kozlyuk 0 siblings, 1 reply; 10+ messages in thread From: Alan Beadle @ 2025-01-06 16:05 UTC (permalink / raw) To: Dmitry Kozlyuk; +Cc: users > Note that I am also seeing another error. Sometimes, rather than tx > failing, my app detects incorrect/corrupted mbuf contents and exits > immediately. It appears that mbufs are being re-allocated when they > should not be. I thought I had finally solved this (see my earlier > threads) but with multi-core concurrency this problem has returned. It > is very possible that this error is somewhere in my own library code, > as it looks like the accompanying non-DPDK structures are also being > corrupted (probably first). > > For background, I maintain a hash table of header structs to track > individual mbufs. The sequence numbers in the headers should match > those contained in the mbuf's payload. This check is failing after a > few hundred successful data messages have been exchanged between the > hosts. The sequence number in the mbuf shows that it is in the wrong > hash bucket, and the sequence number in the header is a large > corrupted value which is out of range for my sequence numbers (and > also not matching the bucket). > There is definitely something going wrong with the mbuf allocator. Each run results in such different errors that it is difficult to add instrumentation for a specific one, but one frequent error is that a newly allocated mbuf already has a refcnt of 2, and contains data that I am still using elsewhere. At each call to rte_pktmbuf_alloc() (with locks around it) I immediately do a rte_mbuf_refcnt_read() and ensure that it is 1. Sometimes it is 2. This should never occur and I believe it proves that DPDK is not working as expected here for some reason. -Alan ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Multiprocess App Problems with tx_burst 2025-01-06 16:05 ` Alan Beadle @ 2025-01-06 20:10 ` Dmitry Kozlyuk 2025-01-06 20:34 ` Alan Beadle 0 siblings, 1 reply; 10+ messages in thread From: Dmitry Kozlyuk @ 2025-01-06 20:10 UTC (permalink / raw) To: Alan Beadle; +Cc: users 2025-01-06 11:05 (UTC-0500), Alan Beadle: > There is definitely something going wrong with the mbuf allocator. > Each run results in such different errors that it is difficult to add > instrumentation for a specific one, but one frequent error is that a > newly allocated mbuf already has a refcnt of 2, and contains data that > I am still using elsewhere. > At each call to rte_pktmbuf_alloc() (with locks around it) > I immediately do a rte_mbuf_refcnt_read() and ensure > that it is 1. Sometimes it is 2. This should never occur and I believe > it proves that DPDK is not working as expected here for some reason. I suspect that mbufs in use are put into mempool somehow. Which functions do you use to free mbufs to the pool on processing paths that do not end with `rte_eth_tx_burst()`? You can build DPDK with `-Dc_args='-DRTE_LIBRTE_MBUF_DEBUG'` to enable debug checks in the library. Unless `RTE_MEMPOOL_F_SC_GET` is used, `rte_pktmbuf_alloc()` is thread-safe. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Multiprocess App Problems with tx_burst 2025-01-06 20:10 ` Dmitry Kozlyuk @ 2025-01-06 20:34 ` Alan Beadle 2025-01-07 16:09 ` Alan Beadle 0 siblings, 1 reply; 10+ messages in thread From: Alan Beadle @ 2025-01-06 20:34 UTC (permalink / raw) To: Dmitry Kozlyuk; +Cc: users > I suspect that mbufs in use are put into mempool somehow. > Which functions do you use to free mbufs to the pool > on processing paths that do not end with `rte_eth_tx_burst()`? > You can build DPDK with `-Dc_args='-DRTE_LIBRTE_MBUF_DEBUG'` > to enable debug checks in the library. I am using rte_pktmbuf_free(). My understanding is that it decrements the refcount, and unless it really reaches 0, will not free it back to the mempool. This is the only function that I ever use to decrement the refcount of any mbuf, or to free them. I think I might have found an error in my library's heap allocator which can result in duplicate references to copies of the same data header, which are supposed to be different headers. This could explain some of my problems if the above usage is correct. I will continue investigating this possibility. One thing gives me doubt. My code includes frequent checks of all refcounts, so if it was accessing something with a refcount of 0 then it should detect that, and so far it has not. I'll also try those debug checks. If they can detect a double free, that would be helpful evidence. Thank you, -Alan ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Multiprocess App Problems with tx_burst 2025-01-06 20:34 ` Alan Beadle @ 2025-01-07 16:09 ` Alan Beadle 0 siblings, 0 replies; 10+ messages in thread From: Alan Beadle @ 2025-01-07 16:09 UTC (permalink / raw) To: Dmitry Kozlyuk; +Cc: users > I think I might have found an error in my library's heap allocator > which can result in duplicate references to copies of the same data > header, which are supposed to be different headers. This could explain > some of my problems if the above usage is correct. I will continue > investigating this possibility. One thing gives me doubt. My code > includes frequent checks of all refcounts, so if it was accessing > something with a refcount of 0 then it should detect that, and so far > it has not. I have confirmed that this was the cause. Somehow the zero refcount was not detected in my periodic checks. Everything seems to work ok now. In summary, here is what was happening. The heap allocator I was using inside of my library had a concurrency bug where in rare cases it would allocate the same buffer to multiple concurrent threads. This led to two indirect references to the same mbuf, and the false appearance that the same mbuf had been returned twice by the DPDK allocator while still in use. Concurrency is hard. Thank you for your patient help Dmitry. -Alan ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2025-01-07 16:10 UTC | newest] Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2024-12-31 17:49 Multiprocess App Problems with tx_burst Alan Beadle 2025-01-04 16:22 ` Alan Beadle 2025-01-04 18:40 ` Dmitry Kozlyuk 2025-01-04 19:16 ` Alan Beadle 2025-01-04 22:01 ` Dmitry Kozlyuk 2025-01-05 16:01 ` Alan Beadle 2025-01-06 16:05 ` Alan Beadle 2025-01-06 20:10 ` Dmitry Kozlyuk 2025-01-06 20:34 ` Alan Beadle 2025-01-07 16:09 ` Alan Beadle
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).