DPDK usage discussions
 help / color / mirror / Atom feed
* Multiprocess App Problems with tx_burst
@ 2024-12-31 17:49 Alan Beadle
  2025-01-04 16:22 ` Alan Beadle
  0 siblings, 1 reply; 9+ messages in thread
From: Alan Beadle @ 2024-12-31 17:49 UTC (permalink / raw)
  To: users

Hi everyone,

I am working on a multi-process DPDK application. It uses one NIC, one
port, and both separate processes send as well as receive, and they
share memory for synchronization and IPC.

I had previously made a mistake in setting up the lcores, and all of
the processes were assigned to the same physical core. This seems to
have concealed some DPDK thread safety issues which I am now dealing
with.

I understand that rte_eth_tx_burst() and rte_eth_rx_burst() are not
thread safe. Previously I did not have any synchronization around
these functions. Now that I am successfully using separate cores, I
have added a shared spinlock around all invocations of these
functions, as well as around all mbuf frees and allocations.

However, when my code sends a packet, it checks the return value of
rte_eth_tx_burst() to ensure that the packet was actually sent. If it
fails to send, my app exits with an error. This was not previously
happening, but now it happens every time I run it. I thought this was
due to the lack of synchronization but it is still happening after I
added the lock. Why would rte_eth_tx_burst() be failing now?

Thank you,
-Alan

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Multiprocess App Problems with tx_burst
  2024-12-31 17:49 Multiprocess App Problems with tx_burst Alan Beadle
@ 2025-01-04 16:22 ` Alan Beadle
  2025-01-04 18:40   ` Dmitry Kozlyuk
  0 siblings, 1 reply; 9+ messages in thread
From: Alan Beadle @ 2025-01-04 16:22 UTC (permalink / raw)
  To: users

Hi everyone,

I'm still stuck on this. Most likely I am doing something wrong in the
initialization phase. I am trying to follow the standard code example
for symmetric multi-process, but since my code is doing very different
things from this example I cannot even begin to guess where I am going
wrong. I do not even know if what I am trying to do is permissible in
the DPDK API.

It would be very helpful if someone could provide an initialization
checklist for my use case (below).

As explained previously, I have several separately launched processes.
These processes already share a memory region for local communication.
I want all of these processes to have equal ability to read incoming
packets, place pointers to the mbufs in shared memory, and wake each
other up when packets destined for a particular one of these processes
arrives. I have one X550-T2 NIC and I am only using one one of the
physical ports. It connects to a second machine which is doing
essentially the same thing, running the same DPDK code.

In summary, each of my multiple processes should all be able to
equally receive packets of behalf of each other, and leave pointers to
rx'ed mbufs for each other in shared memory according to which process
the mbuf was destined for. Outbound packets may also be shared with
local peer processes for reading. In order to do this I am also
bumping the mbuf refcount until the peer process has read the mbuf.

I already thought I had all of this working fine, but it turns out
that they were all taking turns on the same physical core, and
everything breaks when they are run concurrently on separate cores. I
have seen conflicting information in online threads about the thread
safety of the various DPDK functions that I am using. I tried adding
synchronization around DPDK allocation and tx/rx bursts to no avail.
My code detects weird errors where either mbufs contain unexpected
things (invalid reuse?) or tx bursts start to fail in one of the
processes.

Frankly I also feel very confused about how ports, queues, mempools,
etc work and I suspect that a lot of what I have been reading is
outdated or faulty information.

Any guidance at all would be greatly appreciated!
-Alan

On Tue, Dec 31, 2024 at 12:49 PM Alan Beadle <ab.beadle@gmail.com> wrote:
>
> Hi everyone,
>
> I am working on a multi-process DPDK application. It uses one NIC, one
> port, and both separate processes send as well as receive, and they
> share memory for synchronization and IPC.
>
> I had previously made a mistake in setting up the lcores, and all of
> the processes were assigned to the same physical core. This seems to
> have concealed some DPDK thread safety issues which I am now dealing
> with.
>
> I understand that rte_eth_tx_burst() and rte_eth_rx_burst() are not
> thread safe. Previously I did not have any synchronization around
> these functions. Now that I am successfully using separate cores, I
> have added a shared spinlock around all invocations of these
> functions, as well as around all mbuf frees and allocations.
>
> However, when my code sends a packet, it checks the return value of
> rte_eth_tx_burst() to ensure that the packet was actually sent. If it
> fails to send, my app exits with an error. This was not previously
> happening, but now it happens every time I run it. I thought this was
> due to the lack of synchronization but it is still happening after I
> added the lock. Why would rte_eth_tx_burst() be failing now?
>
> Thank you,
> -Alan

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Multiprocess App Problems with tx_burst
  2025-01-04 16:22 ` Alan Beadle
@ 2025-01-04 18:40   ` Dmitry Kozlyuk
  2025-01-04 19:16     ` Alan Beadle
  0 siblings, 1 reply; 9+ messages in thread
From: Dmitry Kozlyuk @ 2025-01-04 18:40 UTC (permalink / raw)
  To: Alan Beadle; +Cc: users

2025-01-04 11:22 (UTC-0500), Alan Beadle:
> Hi everyone,
> 
> I'm still stuck on this. Most likely I am doing something wrong in the
> initialization phase. I am trying to follow the standard code example
> for symmetric multi-process, but since my code is doing very different
> things from this example I cannot even begin to guess where I am going
> wrong. I do not even know if what I am trying to do is permissible in
> the DPDK API.
> 
> It would be very helpful if someone could provide an initialization
> checklist for my use case (below).
> 
> As explained previously, I have several separately launched processes.
> These processes already share a memory region for local communication.
> I want all of these processes to have equal ability to read incoming
> packets, place pointers to the mbufs in shared memory, and wake each
> other up when packets destined for a particular one of these processes
> arrives. I have one X550-T2 NIC and I am only using one one of the
> physical ports. It connects to a second machine which is doing
> essentially the same thing, running the same DPDK code.
> 
> In summary, each of my multiple processes should all be able to
> equally receive packets of behalf of each other, and leave pointers to
> rx'ed mbufs for each other in shared memory according to which process
> the mbuf was destined for. Outbound packets may also be shared with
> local peer processes for reading. In order to do this I am also
> bumping the mbuf refcount until the peer process has read the mbuf.
> 
> I already thought I had all of this working fine, but it turns out
> that they were all taking turns on the same physical core, and
> everything breaks when they are run concurrently on separate cores. I
> have seen conflicting information in online threads about the thread
> safety of the various DPDK functions that I am using. I tried adding
> synchronization around DPDK allocation and tx/rx bursts to no avail.
> My code detects weird errors where either mbufs contain unexpected
> things (invalid reuse?) or tx bursts start to fail in one of the
> processes.
> 
> Frankly I also feel very confused about how ports, queues, mempools,
> etc work and I suspect that a lot of what I have been reading is
> outdated or faulty information.
> 
> Any guidance at all would be greatly appreciated!
> -Alan
> 
> On Tue, Dec 31, 2024 at 12:49 PM Alan Beadle <ab.beadle@gmail.com> wrote:
> >
> > Hi everyone,
> >
> > I am working on a multi-process DPDK application. It uses one NIC, one
> > port, and both separate processes send as well as receive, and they
> > share memory for synchronization and IPC.
> >
> > I had previously made a mistake in setting up the lcores, and all of
> > the processes were assigned to the same physical core. This seems to
> > have concealed some DPDK thread safety issues which I am now dealing
> > with.
> >
> > I understand that rte_eth_tx_burst() and rte_eth_rx_burst() are not
> > thread safe. Previously I did not have any synchronization around
> > these functions. Now that I am successfully using separate cores, I
> > have added a shared spinlock around all invocations of these
> > functions, as well as around all mbuf frees and allocations.
> >
> > However, when my code sends a packet, it checks the return value of
> > rte_eth_tx_burst() to ensure that the packet was actually sent. If it
> > fails to send, my app exits with an error. This was not previously
> > happening, but now it happens every time I run it. I thought this was
> > due to the lack of synchronization but it is still happening after I
> > added the lock. Why would rte_eth_tx_burst() be failing now?
> >
> > Thank you,
> > -Alan  

Hi Alan,

A lot is still unclear, let's start gradually.

Thread-unsafe are queues, not calls to rte_eth_rx/tx_burst().
You can call rte_eth_rx/tx_burst() concurrently without synchronization
if they operate on different queues.
Typically you assign each lcore to operate on one or more queues,
but no queue to be operated by multiple lcores.
Otherwise you need to synchronize access, which obviously hurts scaling.
Does this hold in your case?

Lcore is a thread to which DPDK can dispatch work.
By default it is pinned to one physical core unless --lcores is used.
What is lcore-to-CPU mapping in your case?

What is the design of your app regarding processes, lcores, and queues?
That is: which process runs which lcores and which queues to the latter serve?

P.S. Please don't top-post.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Multiprocess App Problems with tx_burst
  2025-01-04 18:40   ` Dmitry Kozlyuk
@ 2025-01-04 19:16     ` Alan Beadle
  2025-01-04 22:01       ` Dmitry Kozlyuk
  0 siblings, 1 reply; 9+ messages in thread
From: Alan Beadle @ 2025-01-04 19:16 UTC (permalink / raw)
  To: Dmitry Kozlyuk; +Cc: users

> Thread-unsafe are queues, not calls to rte_eth_rx/tx_burst().
> You can call rte_eth_rx/tx_burst() concurrently without synchronization
> if they operate on different queues.
> Typically you assign each lcore to operate on one or more queues,
> but no queue to be operated by multiple lcores.
> Otherwise you need to synchronize access, which obviously hurts scaling.
> Does this hold in your case?

> Lcore is a thread to which DPDK can dispatch work.
> By default it is pinned to one physical core unless --lcores is used.
> What is lcore-to-CPU mapping in your case?
>
> What is the design of your app regarding processes, lcores, and queues?
> That is: which process runs which lcores and which queues to the latter serve?
>
> P.S. Please don't top-post.

Hi Dmitry,

On one machine I have two processes and on the other there are three
processes. For simplicity we can focus on the former but the only
difference is the number of secondary processes.

It is also worth noting that I am developing a shared library which
uses DPDK internally, rather than the app code directly using DPDK.
Therefore, all of my DPDK command line args for rte_eal_init() are
actually kept in char arrays inside of the library code.

I am setting up one tx queue and one rx queue via the primary process
init code. The code here resembles the "basic forwarding" sample
application (in the skeleton/ subdir). Please let me know whether it
would be possible for each process to use entirely separate queues and
still pass mbuf pointers around.

Before I worry about scaling though, I want correctness first. My
application is more likely to be CPU-bound than network-bound but for
other reasons (this is a research project) I must use user-mode
networking, which is why I am using DPDK.

I will explain the processes and lcores below. First of all though, my
application uses several types of packets. There are handshake packets
(acknacks and the like) and data packets. The data packets are
addressed to specific subsets of my secondary processes (for now just
1 or 2 secondary processes exist per machine, but support for even
more is in principle part of my design). Sometimes the data should
also be read by other peer processes on the same machine (including
the daemon/primary process) so I chose to make the mbufs readable
instead of allocating a duplicate local buffer. It is important that
mbuf pointers from one process will work in the others. Otherwise all
of my data would need to be duplicated into non-dpdk shared buffers
too.

The first process is the "daemon". This is the primary process. It
uses DPDK through my shared library (which uses DPDK internally, as
explained above). The daemon just polls the NIC and periodically
cleans up my non-DPDK data structures in shared memory. The intent is
to rely on the daemon to watch for packets during periods of low
activity and avoid unnecessary CPU usage. When a packet arrives it can
wake the correct secondary process by finding a futex in shared memory
for that process. On both machines the daemon is mapped to core 2 with
the parameter "-l 2".

The second process is the "server". It uses separate app code from the
daemon but calls into the same library. Like the daemon, it receives
and parses packets. The server can originate new data packets, and can
also reply to inbound data packets with more data packets to be sent
back to processes on the other machine. It sleeps on a shared futex
during periods of inactivity. If there were additional secondary
processes (as is the case on the other machine) it could wake them
when packets arrive for those other processes, again using futexes in
shared memory. On both machines this second process is mapped to core
4 with the parameter "-l 4".

The other machine has another secondary process (a third process)
which is on core 6 with "-l 6". For the purposes of this discussion,
it behaves similarly to the server process above (sends, receives, and
sometime sleeps).

Thank you,
-Alan

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Multiprocess App Problems with tx_burst
  2025-01-04 19:16     ` Alan Beadle
@ 2025-01-04 22:01       ` Dmitry Kozlyuk
  2025-01-05 16:01         ` Alan Beadle
  0 siblings, 1 reply; 9+ messages in thread
From: Dmitry Kozlyuk @ 2025-01-04 22:01 UTC (permalink / raw)
  To: Alan Beadle; +Cc: users

2025-01-04 14:16 (UTC-0500), Alan Beadle:
> I am setting up one tx queue and one rx queue via the primary process
> init code. The code here resembles the "basic forwarding" sample
> application (in the skeleton/ subdir). Please let me know whether it
> would be possible for each process to use entirely separate queues and
> still pass mbuf pointers around.

It is possible. Mbufs are in memory shared by all processes.
For example, the primary process can setup 2 Rx queues:
to use queue 0 by itself, and to let the secondary process use queue 1.
Steering incoming packets to the right queue (of set of queues)
would be another question, however.

> I will explain the processes and lcores below. First of all though, my
> application uses several types of packets. There are handshake packets
> (acknacks and the like) and data packets. The data packets are
> addressed to specific subsets of my secondary processes (for now just
> 1 or 2 secondary processes exist per machine, but support for even
> more is in principle part of my design). Sometimes the data should
> also be read by other peer processes on the same machine (including
> the daemon/primary process) so I chose to make the mbufs readable
> instead of allocating a duplicate local buffer. It is important that
> mbuf pointers from one process will work in the others. Otherwise all
> of my data would need to be duplicated into non-dpdk shared buffers
> too.
> 
> The first process is the "daemon". This is the primary process. It
> uses DPDK through my shared library (which uses DPDK internally, as
> explained above). The daemon just polls the NIC and periodically
> cleans up my non-DPDK data structures in shared memory. The intent is
> to rely on the daemon to watch for packets during periods of low
> activity and avoid unnecessary CPU usage. When a packet arrives it can
> wake the correct secondary process by finding a futex in shared memory
> for that process. On both machines the daemon is mapped to core 2 with
> the parameter "-l 2".
> 
> The second process is the "server". It uses separate app code from the
> daemon but calls into the same library. Like the daemon, it receives
> and parses packets. The server can originate new data packets, and can
> also reply to inbound data packets with more data packets to be sent
> back to processes on the other machine. It sleeps on a shared futex
> during periods of inactivity. If there were additional secondary
> processes (as is the case on the other machine) it could wake them
> when packets arrive for those other processes, again using futexes in
> shared memory. On both machines this second process is mapped to core
> 4 with the parameter "-l 4".
> 
> The other machine has another secondary process (a third process)
> which is on core 6 with "-l 6". For the purposes of this discussion,
> it behaves similarly to the server process above (sends, receives, and
> sometime sleeps).

So, "deamon" and "server" may try using the same queue sometimes, correct?
Synchronizing all access to the single queue should work in this case.

BTW, rte_eth_tx_burst() returning >0 does not mean the packets have been sent.
It only means they have been enqueued for sending.
At some point the NIC will complete sending,
only then the PMD can free the mbuf (or decrement its reference count).
For most PMDs, this happens on a subsequent call to rte_eth_tx_burst().
Which PMD and HW is it?
Have you tried to print as many stats as possible when rte_eth_tx_burst()
can't consume all packets (rte_eth_stats_get(), rte_eth_xstats_get())?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Multiprocess App Problems with tx_burst
  2025-01-04 22:01       ` Dmitry Kozlyuk
@ 2025-01-05 16:01         ` Alan Beadle
  2025-01-06 16:05           ` Alan Beadle
  0 siblings, 1 reply; 9+ messages in thread
From: Alan Beadle @ 2025-01-05 16:01 UTC (permalink / raw)
  To: Dmitry Kozlyuk; +Cc: users

> So, "deamon" and "server" may try using the same queue sometimes, correct?
> Synchronizing all access to the single queue should work in this case.

That is correct.

> BTW, rte_eth_tx_burst() returning >0 does not mean the packets have been sent.
> It only means they have been enqueued for sending.
> At some point the NIC will complete sending,
> only then the PMD can free the mbuf (or decrement its reference count).
> For most PMDs, this happens on a subsequent call to rte_eth_tx_burst().
> Which PMD and HW is it?

Here is the output of 'dpdk-devbind.py --status':

Network devices using DPDK-compatible driver
============================================
0000:65:00.1 'Ethernet Controller 10G X550T 1563' drv=vfio-pci
unused=uio_pci_generic


> Have you tried to print as many stats as possible when rte_eth_tx_burst()
> can't consume all packets (rte_eth_stats_get(), rte_eth_xstats_get())?

In setting this up, I discovered that this error only occurs when the
primary process on the other host exits (due to an error) or is not
initially running (the NIC is "down" in this case?). It happens
consistently when I only launch the processes on one of the two
machines. ***But*** counterintuitively, it looks like packets are
successfully "sent" by the daemon until the other process begins to
run. In case it is useful, I summarize the stats for this case below.

Note that I am also seeing another error. Sometimes, rather than tx
failing, my app detects incorrect/corrupted mbuf contents and exits
immediately. It appears that mbufs are being re-allocated when they
should not be. I thought I had finally solved this (see my earlier
threads) but with multi-core concurrency this problem has returned. It
is very possible that this error is somewhere in my own library code,
as it looks like the accompanying non-DPDK structures are also being
corrupted (probably first).

For background, I maintain a hash table of header structs to track
individual mbufs. The sequence numbers in the headers should match
those contained in the mbuf's payload. This check is failing after a
few hundred successful data messages have been exchanged between the
hosts. The sequence number in the mbuf shows that it is in the wrong
hash bucket, and the sequence number in the header is a large
corrupted value which is out of range for my sequence numbers (and
also not matching the bucket).

Back to the issue of failed tx bursts: Here are the stats I observed
after a packet failed to send from the daemon (after only launching
the primary+secondary processes on one of the machines). This failure
occurred after the daemon had successfully "sent" hundreds of
handshake packets (to nowhere, presumably?), and the failure occurred
as soon as the second process had finished initialization:

ipackets:0, opackets:0, ibytes:0, obytes:0, ierrors:0, oerrors:0
Got 146 xstats
Port:0, tx_q0_packets:1138
Port:0, tx_q0_bytes:125180
Port:0, mac_local_errors:2
Port:0, out_pkts_untagged:5
(All other stats had a value of 0 and are omitted).

I will continue investigating the corruption bug in the (likely) case
that it is in my library code. In the meantime please let me know if I
am using DPDK incorrectly. Thank you again!
-Alan

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Multiprocess App Problems with tx_burst
  2025-01-05 16:01         ` Alan Beadle
@ 2025-01-06 16:05           ` Alan Beadle
  2025-01-06 20:10             ` Dmitry Kozlyuk
  0 siblings, 1 reply; 9+ messages in thread
From: Alan Beadle @ 2025-01-06 16:05 UTC (permalink / raw)
  To: Dmitry Kozlyuk; +Cc: users

> Note that I am also seeing another error. Sometimes, rather than tx
> failing, my app detects incorrect/corrupted mbuf contents and exits
> immediately. It appears that mbufs are being re-allocated when they
> should not be. I thought I had finally solved this (see my earlier
> threads) but with multi-core concurrency this problem has returned. It
> is very possible that this error is somewhere in my own library code,
> as it looks like the accompanying non-DPDK structures are also being
> corrupted (probably first).
>
> For background, I maintain a hash table of header structs to track
> individual mbufs. The sequence numbers in the headers should match
> those contained in the mbuf's payload. This check is failing after a
> few hundred successful data messages have been exchanged between the
> hosts. The sequence number in the mbuf shows that it is in the wrong
> hash bucket, and the sequence number in the header is a large
> corrupted value which is out of range for my sequence numbers (and
> also not matching the bucket).
>

There is definitely something going wrong with the mbuf allocator.
Each run results in such different errors that it is difficult to add
instrumentation for a specific one, but one frequent error is that a
newly allocated mbuf already has a refcnt of 2, and contains data that
I am still using elsewhere. At each call to rte_pktmbuf_alloc() (with
locks around it) I immediately do a rte_mbuf_refcnt_read() and ensure
that it is 1. Sometimes it is 2. This should never occur and I believe
it proves that DPDK is not working as expected here for some reason.

-Alan

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Multiprocess App Problems with tx_burst
  2025-01-06 16:05           ` Alan Beadle
@ 2025-01-06 20:10             ` Dmitry Kozlyuk
  2025-01-06 20:34               ` Alan Beadle
  0 siblings, 1 reply; 9+ messages in thread
From: Dmitry Kozlyuk @ 2025-01-06 20:10 UTC (permalink / raw)
  To: Alan Beadle; +Cc: users

2025-01-06 11:05 (UTC-0500), Alan Beadle:
> There is definitely something going wrong with the mbuf allocator.
> Each run results in such different errors that it is difficult to add
> instrumentation for a specific one, but one frequent error is that a
> newly allocated mbuf already has a refcnt of 2, and contains data that
> I am still using elsewhere.
> At each call to rte_pktmbuf_alloc() (with locks around it)
> I immediately do a rte_mbuf_refcnt_read() and ensure
> that it is 1. Sometimes it is 2. This should never occur and I believe
> it proves that DPDK is not working as expected here for some reason.

I suspect that mbufs in use are put into mempool somehow.
Which functions do you use to free mbufs to the pool
on processing paths that do not end with `rte_eth_tx_burst()`?
You can build DPDK with `-Dc_args='-DRTE_LIBRTE_MBUF_DEBUG'`
to enable debug checks in the library.

Unless `RTE_MEMPOOL_F_SC_GET` is used, `rte_pktmbuf_alloc()` is thread-safe.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Multiprocess App Problems with tx_burst
  2025-01-06 20:10             ` Dmitry Kozlyuk
@ 2025-01-06 20:34               ` Alan Beadle
  0 siblings, 0 replies; 9+ messages in thread
From: Alan Beadle @ 2025-01-06 20:34 UTC (permalink / raw)
  To: Dmitry Kozlyuk; +Cc: users

> I suspect that mbufs in use are put into mempool somehow.
> Which functions do you use to free mbufs to the pool
> on processing paths that do not end with `rte_eth_tx_burst()`?
> You can build DPDK with `-Dc_args='-DRTE_LIBRTE_MBUF_DEBUG'`
> to enable debug checks in the library.

I am using rte_pktmbuf_free(). My understanding is that it decrements
the refcount, and unless it really reaches 0, will not free it back to
the mempool. This is the only function that I ever use to decrement
the refcount of any mbuf, or to free them.

I think I might have found an error in my library's heap allocator
which can result in duplicate references to copies of the same data
header, which are supposed to be different headers. This could explain
some of my problems if the above usage is correct. I will continue
investigating this possibility. One thing gives me doubt. My code
includes frequent checks of all refcounts, so if it was accessing
something with a refcount of 0 then it should detect that, and so far
it has not.

I'll also try those debug checks. If they can detect a double free,
that would be helpful evidence.

Thank you,
-Alan

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-01-06 20:34 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-12-31 17:49 Multiprocess App Problems with tx_burst Alan Beadle
2025-01-04 16:22 ` Alan Beadle
2025-01-04 18:40   ` Dmitry Kozlyuk
2025-01-04 19:16     ` Alan Beadle
2025-01-04 22:01       ` Dmitry Kozlyuk
2025-01-05 16:01         ` Alan Beadle
2025-01-06 16:05           ` Alan Beadle
2025-01-06 20:10             ` Dmitry Kozlyuk
2025-01-06 20:34               ` Alan Beadle

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).