* [PATCH] net/mlx5: fix crash when secondary queries dev info after primary exits @ 2025-07-21 7:38 Khadem Ullah 2025-07-21 10:58 ` Dariusz Sosnowski 0 siblings, 1 reply; 7+ messages in thread From: Khadem Ullah @ 2025-07-21 7:38 UTC (permalink / raw) To: dev; +Cc: rasland, stable, Khadem Ullah When the primary process exits, the shared mlx5 state becomes unavailable to secondary processes. If a secondary process attempts to query device information (e.g., via testpmd), a NULL dereference may occur due to missing shared data. This patch adds a check for shared context availability and fails gracefully while preventing a crash. Fixes: e60fbd5b24fc ("mlx5: add device configure/start/stop") Cc: stable@dpdk.org Signed-off-by: Khadem Ullah <14pwcse1224@uetpeshawar.edu.pk> --- drivers/net/mlx5/mlx5_ethdev.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/drivers/net/mlx5/mlx5_ethdev.c b/drivers/net/mlx5/mlx5_ethdev.c index 68d1c1bfa7..1848f6536a 100644 --- a/drivers/net/mlx5/mlx5_ethdev.c +++ b/drivers/net/mlx5/mlx5_ethdev.c @@ -368,6 +368,12 @@ mlx5_dev_infos_get(struct rte_eth_dev *dev, struct rte_eth_dev_info *info) * Since we need one CQ per QP, the limit is the minimum number * between the two values. */ + if (priv == NULL || priv->sh == NULL) { + DRV_LOG(ERR, + "mlx5 shared data unavailable (primary process likely exited)"); + rte_errno = ENODEV; + return -rte_errno; + } max = RTE_MIN(priv->sh->dev_cap.max_cq, priv->sh->dev_cap.max_qp); /* max_rx_queues is uint16_t. */ max = RTE_MIN(max, (unsigned int)UINT16_MAX); -- 2.43.0 ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] net/mlx5: fix crash when secondary queries dev info after primary exits 2025-07-21 7:38 [PATCH] net/mlx5: fix crash when secondary queries dev info after primary exits Khadem Ullah @ 2025-07-21 10:58 ` Dariusz Sosnowski 2025-07-21 11:38 ` [PATCH] net/mlx5: fix crash when using meter in transfer flow Khadem Ullah ` (2 more replies) 0 siblings, 3 replies; 7+ messages in thread From: Dariusz Sosnowski @ 2025-07-21 10:58 UTC (permalink / raw) To: Khadem Ullah, Thomas Monjalon, Andrew Rybchenko Cc: dev, rasland, stable, Viacheslav Ovsiienko, Bing Zhao, Ori Kam, Suanming Mou, Matan Azrad + mlx5 maintainers Thank you for the patch. Could you please include other PMD maintainers (or other maintainers, depending on changed code) in the future patches? There is a script which automatically adds maintainers while sending a patch. It is described in: https://doc.dpdk.org/guides/contributing/patches.html#sending-patches On Mon, Jul 21, 2025 at 03:38:51AM -0400, Khadem Ullah wrote: > When the primary process exits, the shared mlx5 state becomes > unavailable to secondary processes. If a secondary process attempts > to query device information (e.g., via testpmd), a NULL dereference > may occur due to missing shared data. > > This patch adds a check for shared context availability and fails > gracefully while preventing a crash. > > Fixes: e60fbd5b24fc ("mlx5: add device configure/start/stop") > Cc: stable@dpdk.org > > Signed-off-by: Khadem Ullah <14pwcse1224@uetpeshawar.edu.pk> > --- > drivers/net/mlx5/mlx5_ethdev.c | 6 ++++++ > 1 file changed, 6 insertions(+) > > diff --git a/drivers/net/mlx5/mlx5_ethdev.c b/drivers/net/mlx5/mlx5_ethdev.c > index 68d1c1bfa7..1848f6536a 100644 > --- a/drivers/net/mlx5/mlx5_ethdev.c > +++ b/drivers/net/mlx5/mlx5_ethdev.c > @@ -368,6 +368,12 @@ mlx5_dev_infos_get(struct rte_eth_dev *dev, struct rte_eth_dev_info *info) > * Since we need one CQ per QP, the limit is the minimum number > * between the two values. > */ > + if (priv == NULL || priv->sh == NULL) { > + DRV_LOG(ERR, > + "mlx5 shared data unavailable (primary process likely exited)"); > + rte_errno = ENODEV; > + return -rte_errno; > + } I don't think it's an issue on PMD level, but rather on ethdev/multi-process handling level. When primary process closes the port, ethdev library zeroes and frees device data shared between processes. ethdev port data (rte_eth_dev) on secondary is not updated so it now points to invalid data. rte_eth_dev_info_get() is not the only API call affected. If the primary process closes the port before exiting (like testpmd does) and it exits before the secondary, the any driver call seems invalid because of that use-after-free behavior. @Thomas, @Andrew - Do you happen to know if doing anything on ethdev ports in secondary process after primary has gracefully exited is supported? > max = RTE_MIN(priv->sh->dev_cap.max_cq, priv->sh->dev_cap.max_qp); > /* max_rx_queues is uint16_t. */ > max = RTE_MIN(max, (unsigned int)UINT16_MAX); > -- > 2.43.0 > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] net/mlx5: fix crash when using meter in transfer flow 2025-07-21 10:58 ` Dariusz Sosnowski @ 2025-07-21 11:38 ` Khadem Ullah 2025-07-21 11:46 ` [PATCH] net/mlx5: fix crash when secondary queries dev info after primary exits Ivan Malov 2025-07-21 14:50 ` Stephen Hemminger 2 siblings, 0 replies; 7+ messages in thread From: Khadem Ullah @ 2025-07-21 11:38 UTC (permalink / raw) To: dsosnowski, thomas, andrew.rybchenko Cc: dev, rasland, stable, viacheslavo, orika, suanmingm, matan [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain; charset=y, Size: 820 bytes --] Hi Dariusz, Thanks for the feedback. Agreed, this check only papers over a deeper issue with multi-process lifecycle in ethdev. If the primary process exits after freeing ethdev data, all subsequent PMD-level calls (e.g., dev_infos_get) in secondary are inherently unsafe due to dangling pointers. Unless ethdev explicitly invalidates the port (or secondary process detects primary exit), the current behavior results in use-after-free regardless of PMD. For now, this patch avoids a crash in one case (dev_infos_get), but long-term a proper ethdev-level solution is needed — either by: - Marking ports invalid on primary exit, - Notifying secondaries to teardown, - Or refusing API access after shared data disappears. Until then, we can keep this as a safety patch to prevent segfaults in common test scenarios. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] net/mlx5: fix crash when secondary queries dev info after primary exits 2025-07-21 10:58 ` Dariusz Sosnowski 2025-07-21 11:38 ` [PATCH] net/mlx5: fix crash when using meter in transfer flow Khadem Ullah @ 2025-07-21 11:46 ` Ivan Malov 2025-07-21 11:59 ` Thomas Monjalon 2025-07-21 14:50 ` Stephen Hemminger 2 siblings, 1 reply; 7+ messages in thread From: Ivan Malov @ 2025-07-21 11:46 UTC (permalink / raw) To: Dariusz Sosnowski Cc: Khadem Ullah, Thomas Monjalon, Andrew Rybchenko, dev, rasland, stable, Viacheslav Ovsiienko, Bing Zhao, Ori Kam, Suanming Mou, Matan Azrad On Mon, 21 Jul 2025, Dariusz Sosnowski wrote: > + mlx5 maintainers > > Thank you for the patch. > > Could you please include other PMD maintainers (or other maintainers, depending on changed code) > in the future patches? > There is a script which automatically adds maintainers while sending a patch. > It is described in: https://doc.dpdk.org/guides/contributing/patches.html#sending-patches > > On Mon, Jul 21, 2025 at 03:38:51AM -0400, Khadem Ullah wrote: >> When the primary process exits, the shared mlx5 state becomes >> unavailable to secondary processes. If a secondary process attempts >> to query device information (e.g., via testpmd), a NULL dereference >> may occur due to missing shared data. >> >> This patch adds a check for shared context availability and fails >> gracefully while preventing a crash. >> >> Fixes: e60fbd5b24fc ("mlx5: add device configure/start/stop") >> Cc: stable@dpdk.org >> >> Signed-off-by: Khadem Ullah <14pwcse1224@uetpeshawar.edu.pk> >> --- >> drivers/net/mlx5/mlx5_ethdev.c | 6 ++++++ >> 1 file changed, 6 insertions(+) >> >> diff --git a/drivers/net/mlx5/mlx5_ethdev.c b/drivers/net/mlx5/mlx5_ethdev.c >> index 68d1c1bfa7..1848f6536a 100644 >> --- a/drivers/net/mlx5/mlx5_ethdev.c >> +++ b/drivers/net/mlx5/mlx5_ethdev.c >> @@ -368,6 +368,12 @@ mlx5_dev_infos_get(struct rte_eth_dev *dev, struct rte_eth_dev_info *info) >> * Since we need one CQ per QP, the limit is the minimum number >> * between the two values. >> */ >> + if (priv == NULL || priv->sh == NULL) { >> + DRV_LOG(ERR, >> + "mlx5 shared data unavailable (primary process likely exited)"); >> + rte_errno = ENODEV; >> + return -rte_errno; >> + } > > I don't think it's an issue on PMD level, but rather on > ethdev/multi-process handling level. I may be very wrong here, but can't the PMD use its internal 'shared' state to somehow memorise the fact that a secondary process has attached, in order to not let the primary process close in the first place (before the secondary process detaches)? Or is this going to violate some established convention? Thank you. > > When primary process closes the port, ethdev library zeroes and frees > device data shared between processes. > ethdev port data (rte_eth_dev) on secondary is not updated so it now points to > invalid data. rte_eth_dev_info_get() is not the only API call affected. > > If the primary process closes the port before exiting > (like testpmd does) and it exits before the secondary, > the any driver call seems invalid because of that use-after-free behavior. > > @Thomas, @Andrew - Do you happen to know if doing anything on ethdev ports > in secondary process after primary has gracefully exited is supported? > >> max = RTE_MIN(priv->sh->dev_cap.max_cq, priv->sh->dev_cap.max_qp); >> /* max_rx_queues is uint16_t. */ >> max = RTE_MIN(max, (unsigned int)UINT16_MAX); >> -- >> 2.43.0 >> > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] net/mlx5: fix crash when secondary queries dev info after primary exits 2025-07-21 11:46 ` [PATCH] net/mlx5: fix crash when secondary queries dev info after primary exits Ivan Malov @ 2025-07-21 11:59 ` Thomas Monjalon 2025-07-21 14:01 ` Dariusz Sosnowski 0 siblings, 1 reply; 7+ messages in thread From: Thomas Monjalon @ 2025-07-21 11:59 UTC (permalink / raw) To: Dariusz Sosnowski, Ivan Malov Cc: Khadem Ullah, Andrew Rybchenko, dev, rasland, stable, Viacheslav Ovsiienko, Bing Zhao, Ori Kam, Suanming Mou, Matan Azrad 21/07/2025 13:46, Ivan Malov: > On Mon, 21 Jul 2025, Dariusz Sosnowski wrote: > > > + mlx5 maintainers > > > > Thank you for the patch. > > > > Could you please include other PMD maintainers (or other maintainers, depending on changed code) > > in the future patches? > > There is a script which automatically adds maintainers while sending a patch. > > It is described in: https://doc.dpdk.org/guides/contributing/patches.html#sending-patches > > > > On Mon, Jul 21, 2025 at 03:38:51AM -0400, Khadem Ullah wrote: > >> When the primary process exits, the shared mlx5 state becomes > >> unavailable to secondary processes. If a secondary process attempts > >> to query device information (e.g., via testpmd), a NULL dereference > >> may occur due to missing shared data. > >> > >> This patch adds a check for shared context availability and fails > >> gracefully while preventing a crash. > >> > >> Fixes: e60fbd5b24fc ("mlx5: add device configure/start/stop") > >> Cc: stable@dpdk.org > >> > >> Signed-off-by: Khadem Ullah <14pwcse1224@uetpeshawar.edu.pk> > >> --- > >> drivers/net/mlx5/mlx5_ethdev.c | 6 ++++++ > >> 1 file changed, 6 insertions(+) > >> > >> diff --git a/drivers/net/mlx5/mlx5_ethdev.c b/drivers/net/mlx5/mlx5_ethdev.c > >> index 68d1c1bfa7..1848f6536a 100644 > >> --- a/drivers/net/mlx5/mlx5_ethdev.c > >> +++ b/drivers/net/mlx5/mlx5_ethdev.c > >> @@ -368,6 +368,12 @@ mlx5_dev_infos_get(struct rte_eth_dev *dev, struct rte_eth_dev_info *info) > >> * Since we need one CQ per QP, the limit is the minimum number > >> * between the two values. > >> */ > >> + if (priv == NULL || priv->sh == NULL) { > >> + DRV_LOG(ERR, > >> + "mlx5 shared data unavailable (primary process likely exited)"); > >> + rte_errno = ENODEV; > >> + return -rte_errno; > >> + } > > > > I don't think it's an issue on PMD level, but rather on > > ethdev/multi-process handling level. > > I may be very wrong here, but can't the PMD use its internal 'shared' state to > somehow memorise the fact that a secondary process has attached, in order to not > let the primary process close in the first place (before the secondary process > detaches)? Or is this going to violate some established convention? It looks to be a good idea. > > When primary process closes the port, ethdev library zeroes and frees > > device data shared between processes. > > ethdev port data (rte_eth_dev) on secondary is not updated so it now points to > > invalid data. rte_eth_dev_info_get() is not the only API call affected. > > > > If the primary process closes the port before exiting > > (like testpmd does) and it exits before the secondary, > > the any driver call seems invalid because of that use-after-free behavior. > > > > @Thomas, @Andrew - Do you happen to know if doing anything on ethdev ports > > in secondary process after primary has gracefully exited is supported? No there is no statement about whether it is supported or not. I think we should at least return an error instead of crashing. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] net/mlx5: fix crash when secondary queries dev info after primary exits 2025-07-21 11:59 ` Thomas Monjalon @ 2025-07-21 14:01 ` Dariusz Sosnowski 0 siblings, 0 replies; 7+ messages in thread From: Dariusz Sosnowski @ 2025-07-21 14:01 UTC (permalink / raw) To: Thomas Monjalon Cc: Ivan Malov, Khadem Ullah, Andrew Rybchenko, dev, rasland, stable, Viacheslav Ovsiienko, Bing Zhao, Ori Kam, Suanming Mou, Matan Azrad On Mon, Jul 21, 2025 at 01:59:59PM +0200, Thomas Monjalon wrote: > 21/07/2025 13:46, Ivan Malov: > > On Mon, 21 Jul 2025, Dariusz Sosnowski wrote: > > > > > + mlx5 maintainers > > > > > > Thank you for the patch. > > > > > > Could you please include other PMD maintainers (or other maintainers, depending on changed code) > > > in the future patches? > > > There is a script which automatically adds maintainers while sending a patch. > > > It is described in: https://doc.dpdk.org/guides/contributing/patches.html#sending-patches > > > > > > On Mon, Jul 21, 2025 at 03:38:51AM -0400, Khadem Ullah wrote: > > >> When the primary process exits, the shared mlx5 state becomes > > >> unavailable to secondary processes. If a secondary process attempts > > >> to query device information (e.g., via testpmd), a NULL dereference > > >> may occur due to missing shared data. > > >> > > >> This patch adds a check for shared context availability and fails > > >> gracefully while preventing a crash. > > >> > > >> Fixes: e60fbd5b24fc ("mlx5: add device configure/start/stop") > > >> Cc: stable@dpdk.org > > >> > > >> Signed-off-by: Khadem Ullah <14pwcse1224@uetpeshawar.edu.pk> > > >> --- > > >> drivers/net/mlx5/mlx5_ethdev.c | 6 ++++++ > > >> 1 file changed, 6 insertions(+) > > >> > > >> diff --git a/drivers/net/mlx5/mlx5_ethdev.c b/drivers/net/mlx5/mlx5_ethdev.c > > >> index 68d1c1bfa7..1848f6536a 100644 > > >> --- a/drivers/net/mlx5/mlx5_ethdev.c > > >> +++ b/drivers/net/mlx5/mlx5_ethdev.c > > >> @@ -368,6 +368,12 @@ mlx5_dev_infos_get(struct rte_eth_dev *dev, struct rte_eth_dev_info *info) > > >> * Since we need one CQ per QP, the limit is the minimum number > > >> * between the two values. > > >> */ > > >> + if (priv == NULL || priv->sh == NULL) { > > >> + DRV_LOG(ERR, > > >> + "mlx5 shared data unavailable (primary process likely exited)"); > > >> + rte_errno = ENODEV; > > >> + return -rte_errno; > > >> + } > > > > > > I don't think it's an issue on PMD level, but rather on > > > ethdev/multi-process handling level. > > > > I may be very wrong here, but can't the PMD use its internal 'shared' state to > > somehow memorise the fact that a secondary process has attached, in order to not > > let the primary process close in the first place (before the secondary process > > detaches)? Or is this going to violate some established convention? > > It looks to be a good idea. I agree with idea of adding these checks, but not entirely agree with it being at driver level, since all drivers would have to duplicate this logic. Drivers already have to go through ethdev library when port is probed on secondary process - rte_eth_dev_attach_secondary() must be called to retrieve port data local to the process and device data shared between processes. Memorizing whether a secondary process is using given port can be added to rte_eth_dev_attach_secondary(), and relevant check for primary process can then be added to rte_eth_dev_close(), so that all drivers benefit. What do you think? > > > > When primary process closes the port, ethdev library zeroes and frees > > > device data shared between processes. > > > ethdev port data (rte_eth_dev) on secondary is not updated so it now points to > > > invalid data. rte_eth_dev_info_get() is not the only API call affected. > > > > > > If the primary process closes the port before exiting > > > (like testpmd does) and it exits before the secondary, > > > the any driver call seems invalid because of that use-after-free behavior. > > > > > > @Thomas, @Andrew - Do you happen to know if doing anything on ethdev ports > > > in secondary process after primary has gracefully exited is supported? > > No there is no statement about whether it is supported or not. > I think we should at least return an error instead of crashing. > > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] net/mlx5: fix crash when secondary queries dev info after primary exits 2025-07-21 10:58 ` Dariusz Sosnowski 2025-07-21 11:38 ` [PATCH] net/mlx5: fix crash when using meter in transfer flow Khadem Ullah 2025-07-21 11:46 ` [PATCH] net/mlx5: fix crash when secondary queries dev info after primary exits Ivan Malov @ 2025-07-21 14:50 ` Stephen Hemminger 2 siblings, 0 replies; 7+ messages in thread From: Stephen Hemminger @ 2025-07-21 14:50 UTC (permalink / raw) To: Dariusz Sosnowski Cc: Khadem Ullah, Thomas Monjalon, Andrew Rybchenko, dev, rasland, stable, Viacheslav Ovsiienko, Bing Zhao, Ori Kam, Suanming Mou, Matan Azrad On Mon, 21 Jul 2025 12:58:19 +0200 Dariusz Sosnowski <dsosnowski@nvidia.com> wrote: > I don't think it's an issue on PMD level, but rather on > ethdev/multi-process handling level. > > When primary process closes the port, ethdev library zeroes and frees > device data shared between processes. > ethdev port data (rte_eth_dev) on secondary is not updated so it now points to > invalid data. rte_eth_dev_info_get() is not the only API call affected. > > If the primary process closes the port before exiting > (like testpmd does) and it exits before the secondary, > the any driver call seems invalid because of that use-after-free behavior. > > @Thomas, @Andrew - Do you happen to know if doing anything on ethdev ports > in secondary process after primary has gracefully exited is supported? No this is not supported. A properly written secondary process monitors to see when primary exits. There are many other parts of DPDK that assume primary is always available. ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2025-07-21 14:50 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2025-07-21 7:38 [PATCH] net/mlx5: fix crash when secondary queries dev info after primary exits Khadem Ullah 2025-07-21 10:58 ` Dariusz Sosnowski 2025-07-21 11:38 ` [PATCH] net/mlx5: fix crash when using meter in transfer flow Khadem Ullah 2025-07-21 11:46 ` [PATCH] net/mlx5: fix crash when secondary queries dev info after primary exits Ivan Malov 2025-07-21 11:59 ` Thomas Monjalon 2025-07-21 14:01 ` Dariusz Sosnowski 2025-07-21 14:50 ` Stephen Hemminger
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).