* rte_eth_dev_rss_reta_update() locking considerations?
@ 2025-07-15 16:15 Scott Wasson
2025-07-15 17:30 ` Ivan Malov
2025-07-15 21:40 ` Stephen Hemminger
0 siblings, 2 replies; 5+ messages in thread
From: Scott Wasson @ 2025-07-15 16:15 UTC (permalink / raw)
To: users
[-- Attachment #1: Type: text/plain, Size: 2105 bytes --]
Hi,
We're using multiqueue, and RSS doesn't always balance the load very well. I had a clever idea to periodically measure the load distribution (cpu load on the IO cores) in the background pthread, and use rte_eth_dev_rss_reta_update() to adjust the redirection table dynamically if the imbalance exceeds a given threshold. In practice it seems to work nicely. But I'm concerned about:
https://doc.dpdk.org/api/rte__ethdev_8h.html#a3c1540852c9cf1e576a883902c2e310d
Which states:
By default, all the functions of the Ethernet Device API exported by a PMD are lock-free functions which assume to not be invoked in parallel on different logical cores to work on the same target object. For instance, the receive function of a PMD cannot be invoked in parallel on two logical cores to poll the same Rx queue [of the same port]. Of course, this function can be invoked in parallel by different logical cores on different Rx queues. It is the responsibility of the upper level application to enforce this rule.
In this context, what is the "target object"? The queue_id of the port? Or the port itself? Would I need to add port-level spinlocks around every invocation of rte_eth_dev_*()? That's a hard no, it would destroy performance.
Alternatively, if I were to periodically call rte_eth_dev_rss_reta_update() from the IO cores instead of the background core, as the above paragraph suggests, that doesn't seem correct either. The function takes a reta_conf[] array that affects all RETA entries for that port and maps them to a queue_id. Is it safe to remap RETA entries for a given port on one IO core while another IO core is potentially reading from its rx queue for that same port? That problem seems not much different from remapping in the background core as I am now.
I'm starting to suspect this function was intended to be initialized once on startup before rte_eth_dev_start(), and/or the ports must be stopped before calling it. If that's the case, then I'll call this idea too clever by half and give it up now.
Thanks in advance for your help!
-Scott
[-- Attachment #2: Type: text/html, Size: 5190 bytes --]
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: rte_eth_dev_rss_reta_update() locking considerations?
2025-07-15 16:15 rte_eth_dev_rss_reta_update() locking considerations? Scott Wasson
@ 2025-07-15 17:30 ` Ivan Malov
2025-07-15 21:40 ` Stephen Hemminger
1 sibling, 0 replies; 5+ messages in thread
From: Ivan Malov @ 2025-07-15 17:30 UTC (permalink / raw)
To: Scott Wasson; +Cc: users, Tom Barbette
[-- Attachment #1: Type: text/plain, Size: 5946 bytes --]
Hi Scott,
On Tue, 15 Jul 2025, Scott Wasson wrote:
>
> Hi,
>
>
>
> We’re using multiqueue, and RSS doesn’t always balance the load very well. I had a clever idea to periodically measure the load distribution (cpu load on the IO cores) in the
> background pthread, and use rte_eth_dev_rss_reta_update() to adjust the redirection table dynamically if the imbalance exceeds a given threshold. In practice it seems to work nicely.
As far as I remember, there has already been an academic project [1] that would
do (almost) the same thing: dynamically reprogram the table based on the current
load status. I vaguely remember mentions of DPDK RETA update in video [2], so
perhaps Tom can shed some light on how the API is invoked lock-wise (Cc Tom).
[1] https://dejankosticgithub.github.io/documents/publications/rsspp-conext19.pdf
[2] https://www.youtube.com/watch?v=YV3aJOxjUqI
> But I’m concerned about:
>
> https://doc.dpdk.org/api/rte__ethdev_8h.html#a3c1540852c9cf1e576a883902c2e310d
>
> Which states:
>
>
> By default, all the functions of the Ethernet Device API exported by a PMD are lock-free functions which assume to not be invoked in parallel on different logical cores to work on the
> same target object. For instance, the receive function of a PMD cannot be invoked in parallel on two logical cores to poll the same Rx queue [of the same port]. Of course, this function
> can be invoked in parallel by different logical cores on different Rx queues. It is the responsibility of the upper level application to enforce this rule.
>
>
>
> In this context, what is the “target object”? The queue_id of the port? Or the port itself? Would I need to add port-level spinlocks around every invocation of rte_eth_dev_*()?
> That’s a hard no, it would destroy performance.
My guess is that it mostly refers to the queue receive/transmit operations. So
the target object might be the DMA queue (Rx/Tx) of the device in question. As
for the control-plane APIs in 'ethdev', I would imagine the typical usage of
these was meant to be via a dedicated control-plane core. So if one can gather
load statistics from IO workers in some clever message-oriented way so that the
main (control) lcore can read those periodically and invoke RETA update, that
would ideally not create contention on the control-plane lock of the port.
More to that, in particular, for the RETA update API, I guess many drivers have
an implicit/internal port lock in place, so it is going to be leveraged anyway.
>
>
>
> Alternatively, if I were to periodically call rte_eth_dev_rss_reta_update() from the IO cores instead of the background core, as the above paragraph suggests, that doesn’t seem correct
> either. The function takes a reta_conf[] array that affects all RETA entries for that port and maps them to a queue_id. Is it safe to remap RETA entries for a given port on one IO core
> while another IO core is potentially reading from its rx queue for that same port? That problem seems not much different from remapping in the background core as I am now.
That may not be desirable for the reason explained above: implicit port locks
in vendor-specific implementations of the control-plane APIs like RETA update.
Regarding the question on whether it is safe to reprogram RETA "on the fly" (we
may assume this is done from the main/control lcore): it was my impression that
doing so would affect some HW component sitting deep in the NIC that distributes
packets across DMA queues (which are one level above), not the queues per se,
which just contain packets that have been distributed so far. Reprogramming RSS
in general might not have that much to do with DMA queues, if I'm not mistaken.
>
>
>
> I’m starting to suspect this function was intended to be initialized once on startup before rte_eth_dev_start(), and/or the ports must be stopped before calling it. If that’s the case,
> then I’ll call this idea too clever by half and give it up now.
Not really. There are PMDs that support RETA update in the started state. It
should be fine to invoke this API from a single control-plane core. Why not?
If the device does not support it in started state, it is the duty of the driver
to either remember the table to be set on next port start or return an error.
Also, while RETA update API of 'ethdev' is meant to control the "global" RSS
setting, there is also RTE flow API's action 'RSS' [3] that can be shared among
multiple flow rule objects by means of a container action, 'INDIRECT', [4]. For
example, one can create a flow rule generic enough to target a wide subset of
flows (or multiple flow rules targeting the same 'INDIRECT' RSS action) and then
use an update API [5] to update specifically this shared action. That being
said, once again, this update is better be done from some central place (main
lcore), as even if the PMD says it supports thread-safe flow API (by setting
flag [6]), this may just mean that the driver uses locks internally. However,
there might be vendors who support so-called "asynchronous flow API", which uses
slightly different flow management APIs ([7], for instance), but those are
clearly designed to be invoked by IO workers when they see interesting traffic
and either need to insert new flows or update some shared actions "on the fly".
[3] https://doc.dpdk.org/api-25.03/rte__flow_8h.html#a78f0386e683cfc491462a771df8b971aa72428c7c1896fe4dfdc2dbed85214d27
[4] https://doc.dpdk.org/api-25.03/rte__flow_8h.html#a78f0386e683cfc491462a771df8b971aa47ea41707def29ff416e233434ab33a6
[5] https://doc.dpdk.org/api-25.03/rte__flow_8h.html#aea5b96385043898923f3b1690a72d2c0
[6] https://doc.dpdk.org/api-25.03/rte__ethdev_8h.html#a3c1540852c9cf1e576a883902c2e310d
[7] https://doc.dpdk.org/api-25.03/rte__flow_8h.html#a5097c64396d74102d8f2ae119c9dc7d5
Thank you.
>
>
>
> Thanks in advance for your help!
>
>
>
> -Scott
>
>
>
>
>
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: rte_eth_dev_rss_reta_update() locking considerations?
2025-07-15 16:15 rte_eth_dev_rss_reta_update() locking considerations? Scott Wasson
2025-07-15 17:30 ` Ivan Malov
@ 2025-07-15 21:40 ` Stephen Hemminger
2025-07-23 12:22 ` Tom Barbette
1 sibling, 1 reply; 5+ messages in thread
From: Stephen Hemminger @ 2025-07-15 21:40 UTC (permalink / raw)
To: Scott Wasson; +Cc: users
On Tue, 15 Jul 2025 16:15:22 +0000
Scott Wasson <swasson@microsoft.com> wrote:
> Hi,
>
> We're using multiqueue, and RSS doesn't always balance the load very well. I had a clever idea to periodically measure the load distribution (cpu load on the IO cores) in the background pthread, and use rte_eth_dev_rss_reta_update() to adjust the redirection table dynamically if the imbalance exceeds a given threshold. In practice it seems to work nicely. But I'm concerned about:
>
> https://doc.dpdk.org/api/rte__ethdev_8h.html#a3c1540852c9cf1e576a883902c2e310d
>
> Which states:
>
> By default, all the functions of the Ethernet Device API exported by a PMD are lock-free functions which assume to not be invoked in parallel on different logical cores to work on the same target object. For instance, the receive function of a PMD cannot be invoked in parallel on two logical cores to poll the same Rx queue [of the same port]. Of course, this function can be invoked in parallel by different logical cores on different Rx queues. It is the responsibility of the upper level application to enforce this rule.
>
> In this context, what is the "target object"? The queue_id of the port? Or the port itself? Would I need to add port-level spinlocks around every invocation of rte_eth_dev_*()? That's a hard no, it would destroy performance.
>
> Alternatively, if I were to periodically call rte_eth_dev_rss_reta_update() from the IO cores instead of the background core, as the above paragraph suggests, that doesn't seem correct either. The function takes a reta_conf[] array that affects all RETA entries for that port and maps them to a queue_id. Is it safe to remap RETA entries for a given port on one IO core while another IO core is potentially reading from its rx queue for that same port? That problem seems not much different from remapping in the background core as I am now.
>
> I'm starting to suspect this function was intended to be initialized once on startup before rte_eth_dev_start(), and/or the ports must be stopped before calling it. If that's the case, then I'll call this idea too clever by half and give it up now.
>
> Thanks in advance for your help!
>
> -Scott
>
There is no locking in driver path for control.
It is expected that application will manage access to control path (RSS being one example)
so that only one thread modifies the PMD.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: rte_eth_dev_rss_reta_update() locking considerations?
2025-07-15 21:40 ` Stephen Hemminger
@ 2025-07-23 12:22 ` Tom Barbette
2025-07-23 12:42 ` Ivan Malov
0 siblings, 1 reply; 5+ messages in thread
From: Tom Barbette @ 2025-07-23 12:22 UTC (permalink / raw)
To: Stephen Hemminger, Scott Wasson; +Cc: users
[-- Attachment #1: Type: text/plain, Size: 4334 bytes --]
Hi all,
As Ivan mentioned, this is exactly what we did in RSS++.
For the concern about RSS reprogramming in « live », it depends on the NIC. I remember the Intel card we used could use the “global” API just fine. For the Mellanox cards we had to use the rte_flow RSS action as reprogramming the global RETA table would lead to a (partial ?) device restart and would lead to the loss of many packets. We had to play with priority and prefixes, but
rte_flow and mlx5 support has evolved since then, it might be a bit simpler, just using priorities and groups maybe.
The biggest challenge was the state, as written in the paper. We ended up with using the rte_flow rules anyway so we can use an epoch “mark” action that marks the version of the distribution table and allow an efficient passing of the state of flows going from one core to another.
The code of RSS++ is still coupled a bit to FastClick, but it was mostly separated already here : https://github.com/tbarbette/fastclick/tree/main/vendor/nicscheduler
We also had a version for the Linux Kernel with XDP for counting.
We can chat about that if you want.
NB : my address has changed, I’m not at kth anymore.
Cheers,
Tom
De : Stephen Hemminger <stephen@networkplumber.org>
Date : mardi, 15 juillet 2025 à 23:40
À : Scott Wasson <swasson@microsoft.com>
Cc : users@dpdk.org <users@dpdk.org>
Objet : Re: rte_eth_dev_rss_reta_update() locking considerations?
On Tue, 15 Jul 2025 16:15:22 +0000
Scott Wasson <swasson@microsoft.com> wrote:
> Hi,
>
> We're using multiqueue, and RSS doesn't always balance the load very well. I had a clever idea to periodically measure the load distribution (cpu load on the IO cores) in the background pthread, and use rte_eth_dev_rss_reta_update() to adjust the redirection table dynamically if the imbalance exceeds a given threshold. In practice it seems to work nicely. But I'm concerned about:
>
> https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdoc.dpdk.org%2Fapi%2Frte__ethdev_8h.html%23a3c1540852c9cf1e576a883902c2e310d&data=05%7C02%7Ctom.barbette%40uclouvain.be%7Cebeee334aef74a19446308ddc3e83545%7C7ab090d4fa2e4ecfbc7c4127b4d582ec%7C1%7C0%7C638882124267510617%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=BopyVlMOW0CGCdLDk9Q%2BLf87r81NOzCG%2Bv4w4rMezDI%3D&reserved=0<https://doc.dpdk.org/api/rte__ethdev_8h.html#a3c1540852c9cf1e576a883902c2e310d>
>
> Which states:
>
> By default, all the functions of the Ethernet Device API exported by a PMD are lock-free functions which assume to not be invoked in parallel on different logical cores to work on the same target object. For instance, the receive function of a PMD cannot be invoked in parallel on two logical cores to poll the same Rx queue [of the same port]. Of course, this function can be invoked in parallel by different logical cores on different Rx queues. It is the responsibility of the upper level application to enforce this rule.
>
> In this context, what is the "target object"? The queue_id of the port? Or the port itself? Would I need to add port-level spinlocks around every invocation of rte_eth_dev_*()? That's a hard no, it would destroy performance.
>
> Alternatively, if I were to periodically call rte_eth_dev_rss_reta_update() from the IO cores instead of the background core, as the above paragraph suggests, that doesn't seem correct either. The function takes a reta_conf[] array that affects all RETA entries for that port and maps them to a queue_id. Is it safe to remap RETA entries for a given port on one IO core while another IO core is potentially reading from its rx queue for that same port? That problem seems not much different from remapping in the background core as I am now.
>
> I'm starting to suspect this function was intended to be initialized once on startup before rte_eth_dev_start(), and/or the ports must be stopped before calling it. If that's the case, then I'll call this idea too clever by half and give it up now.
>
> Thanks in advance for your help!
>
> -Scott
>
There is no locking in driver path for control.
It is expected that application will manage access to control path (RSS being one example)
so that only one thread modifies the PMD.
[-- Attachment #2: Type: text/html, Size: 8434 bytes --]
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: rte_eth_dev_rss_reta_update() locking considerations?
2025-07-23 12:22 ` Tom Barbette
@ 2025-07-23 12:42 ` Ivan Malov
0 siblings, 0 replies; 5+ messages in thread
From: Ivan Malov @ 2025-07-23 12:42 UTC (permalink / raw)
To: Tom Barbette; +Cc: Stephen Hemminger, Scott Wasson, users
[-- Attachment #1: Type: text/plain, Size: 4703 bytes --]
Hi Tom,
On Wed, 23 Jul 2025, Tom Barbette wrote:
>
> Hi all,
>
>
>
> As Ivan mentioned, this is exactly what we did in RSS++.
>
>
>
> For the concern about RSS reprogramming in « live », it depends on the NIC. I remember the Intel card we used could use the “global” API just fine. For the Mellanox cards we had to use
> the rte_flow RSS action as reprogramming the global RETA table would lead to a (partial ?) device restart and would lead to the loss of many packets. We had to play with priority and
Valid point, indeed. So some drivers, just like with the MTU update in started
state, may need internal port restart. Thanks for clarifying this.
> prefixes, but
>
> rte_flow and mlx5 support has evolved since then, it might be a bit simpler, just using priorities and groups maybe.
>
>
>
> The biggest challenge was the state, as written in the paper. We ended up with using the rte_flow rules anyway so we can use an epoch “mark” action that marks the version of the
> distribution table and allow an efficient passing of the state of flows going from one core to another.
>
> The code of RSS++ is still coupled a bit to FastClick, but it was mostly separated already here : https://github.com/tbarbette/fastclick/tree/main/vendor/nicscheduler
>
> We also had a version for the Linux Kernel with XDP for counting.
>
>
>
> We can chat about that if you want.
>
>
>
> NB : my address has changed, I’m not at kth anymore.
I apologise for confusing it. Found it at the top of https://github.com/rsspp .
Thank you.
>
>
>
> Cheers,
>
> Tom
>
>
>
>
>
> De : Stephen Hemminger <stephen@networkplumber.org>
> Date : mardi, 15 juillet 2025 à 23:40
> À : Scott Wasson <swasson@microsoft.com>
> Cc : users@dpdk.org <users@dpdk.org>
> Objet : Re: rte_eth_dev_rss_reta_update() locking considerations?
>
> On Tue, 15 Jul 2025 16:15:22 +0000
> Scott Wasson <swasson@microsoft.com> wrote:
>
> > Hi,
> >
> > We're using multiqueue, and RSS doesn't always balance the load very well. I had a clever idea to periodically measure the load distribution (cpu load on the IO cores) in the
> background pthread, and use rte_eth_dev_rss_reta_update() to adjust the redirection table dynamically if the imbalance exceeds a given threshold. In practice it seems to work nicely.
> But I'm concerned about:
> >
> >https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdoc.dpdk.org%2Fapi%2Frte__ethdev_8h.html%23a3c1540852c9cf1e576a883902c2e310d&data=05%7C02%7Ctom.barbette%40uclouvain.be
> %7Cebeee334aef74a19446308ddc3e83545%7C7ab090d4fa2e4ecfbc7c4127b4d582ec%7C1%7C0%7C638882124267510617%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIs
> IkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=BopyVlMOW0CGCdLDk9Q%2BLf87r81NOzCG%2Bv4w4rMezDI%3D&reserved=0
> >
> > Which states:
> >
> > By default, all the functions of the Ethernet Device API exported by a PMD are lock-free functions which assume to not be invoked in parallel on different logical cores to work on the
> same target object. For instance, the receive function of a PMD cannot be invoked in parallel on two logical cores to poll the same Rx queue [of the same port]. Of course, this function
> can be invoked in parallel by different logical cores on different Rx queues. It is the responsibility of the upper level application to enforce this rule.
> >
> > In this context, what is the "target object"? The queue_id of the port? Or the port itself? Would I need to add port-level spinlocks around every invocation of rte_eth_dev_*()?
> That's a hard no, it would destroy performance.
> >
> > Alternatively, if I were to periodically call rte_eth_dev_rss_reta_update() from the IO cores instead of the background core, as the above paragraph suggests, that doesn't seem correct
> either. The function takes a reta_conf[] array that affects all RETA entries for that port and maps them to a queue_id. Is it safe to remap RETA entries for a given port on one IO core
> while another IO core is potentially reading from its rx queue for that same port? That problem seems not much different from remapping in the background core as I am now.
> >
> > I'm starting to suspect this function was intended to be initialized once on startup before rte_eth_dev_start(), and/or the ports must be stopped before calling it. If that's the
> case, then I'll call this idea too clever by half and give it up now.
> >
> > Thanks in advance for your help!
> >
> > -Scott
> >
>
> There is no locking in driver path for control.
> It is expected that application will manage access to control path (RSS being one example)
> so that only one thread modifies the PMD.
>
>
>
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2025-07-23 12:42 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-07-15 16:15 rte_eth_dev_rss_reta_update() locking considerations? Scott Wasson
2025-07-15 17:30 ` Ivan Malov
2025-07-15 21:40 ` Stephen Hemminger
2025-07-23 12:22 ` Tom Barbette
2025-07-23 12:42 ` Ivan Malov
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).