* [PATCH] net/mlx5: fix crash when secondary queries dev info after primary exits
@ 2025-07-21 7:38 Khadem Ullah
2025-07-21 10:58 ` Dariusz Sosnowski
0 siblings, 1 reply; 14+ messages in thread
From: Khadem Ullah @ 2025-07-21 7:38 UTC (permalink / raw)
To: dev; +Cc: rasland, stable, Khadem Ullah
When the primary process exits, the shared mlx5 state becomes
unavailable to secondary processes. If a secondary process attempts
to query device information (e.g., via testpmd), a NULL dereference
may occur due to missing shared data.
This patch adds a check for shared context availability and fails
gracefully while preventing a crash.
Fixes: e60fbd5b24fc ("mlx5: add device configure/start/stop")
Cc: stable@dpdk.org
Signed-off-by: Khadem Ullah <14pwcse1224@uetpeshawar.edu.pk>
---
drivers/net/mlx5/mlx5_ethdev.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/drivers/net/mlx5/mlx5_ethdev.c b/drivers/net/mlx5/mlx5_ethdev.c
index 68d1c1bfa7..1848f6536a 100644
--- a/drivers/net/mlx5/mlx5_ethdev.c
+++ b/drivers/net/mlx5/mlx5_ethdev.c
@@ -368,6 +368,12 @@ mlx5_dev_infos_get(struct rte_eth_dev *dev, struct rte_eth_dev_info *info)
* Since we need one CQ per QP, the limit is the minimum number
* between the two values.
*/
+ if (priv == NULL || priv->sh == NULL) {
+ DRV_LOG(ERR,
+ "mlx5 shared data unavailable (primary process likely exited)");
+ rte_errno = ENODEV;
+ return -rte_errno;
+ }
max = RTE_MIN(priv->sh->dev_cap.max_cq, priv->sh->dev_cap.max_qp);
/* max_rx_queues is uint16_t. */
max = RTE_MIN(max, (unsigned int)UINT16_MAX);
--
2.43.0
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] net/mlx5: fix crash when secondary queries dev info after primary exits
2025-07-21 7:38 [PATCH] net/mlx5: fix crash when secondary queries dev info after primary exits Khadem Ullah
@ 2025-07-21 10:58 ` Dariusz Sosnowski
2025-07-21 11:38 ` [PATCH] net/mlx5: fix crash when using meter in transfer flow Khadem Ullah
` (2 more replies)
0 siblings, 3 replies; 14+ messages in thread
From: Dariusz Sosnowski @ 2025-07-21 10:58 UTC (permalink / raw)
To: Khadem Ullah, Thomas Monjalon, Andrew Rybchenko
Cc: dev, rasland, stable, Viacheslav Ovsiienko, Bing Zhao, Ori Kam,
Suanming Mou, Matan Azrad
+ mlx5 maintainers
Thank you for the patch.
Could you please include other PMD maintainers (or other maintainers, depending on changed code)
in the future patches?
There is a script which automatically adds maintainers while sending a patch.
It is described in: https://doc.dpdk.org/guides/contributing/patches.html#sending-patches
On Mon, Jul 21, 2025 at 03:38:51AM -0400, Khadem Ullah wrote:
> When the primary process exits, the shared mlx5 state becomes
> unavailable to secondary processes. If a secondary process attempts
> to query device information (e.g., via testpmd), a NULL dereference
> may occur due to missing shared data.
>
> This patch adds a check for shared context availability and fails
> gracefully while preventing a crash.
>
> Fixes: e60fbd5b24fc ("mlx5: add device configure/start/stop")
> Cc: stable@dpdk.org
>
> Signed-off-by: Khadem Ullah <14pwcse1224@uetpeshawar.edu.pk>
> ---
> drivers/net/mlx5/mlx5_ethdev.c | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> diff --git a/drivers/net/mlx5/mlx5_ethdev.c b/drivers/net/mlx5/mlx5_ethdev.c
> index 68d1c1bfa7..1848f6536a 100644
> --- a/drivers/net/mlx5/mlx5_ethdev.c
> +++ b/drivers/net/mlx5/mlx5_ethdev.c
> @@ -368,6 +368,12 @@ mlx5_dev_infos_get(struct rte_eth_dev *dev, struct rte_eth_dev_info *info)
> * Since we need one CQ per QP, the limit is the minimum number
> * between the two values.
> */
> + if (priv == NULL || priv->sh == NULL) {
> + DRV_LOG(ERR,
> + "mlx5 shared data unavailable (primary process likely exited)");
> + rte_errno = ENODEV;
> + return -rte_errno;
> + }
I don't think it's an issue on PMD level, but rather on
ethdev/multi-process handling level.
When primary process closes the port, ethdev library zeroes and frees
device data shared between processes.
ethdev port data (rte_eth_dev) on secondary is not updated so it now points to
invalid data. rte_eth_dev_info_get() is not the only API call affected.
If the primary process closes the port before exiting
(like testpmd does) and it exits before the secondary,
the any driver call seems invalid because of that use-after-free behavior.
@Thomas, @Andrew - Do you happen to know if doing anything on ethdev ports
in secondary process after primary has gracefully exited is supported?
> max = RTE_MIN(priv->sh->dev_cap.max_cq, priv->sh->dev_cap.max_qp);
> /* max_rx_queues is uint16_t. */
> max = RTE_MIN(max, (unsigned int)UINT16_MAX);
> --
> 2.43.0
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] net/mlx5: fix crash when using meter in transfer flow
2025-07-21 10:58 ` Dariusz Sosnowski
@ 2025-07-21 11:38 ` Khadem Ullah
2025-07-21 11:46 ` [PATCH] net/mlx5: fix crash when secondary queries dev info after primary exits Ivan Malov
2025-07-21 14:50 ` Stephen Hemminger
2 siblings, 0 replies; 14+ messages in thread
From: Khadem Ullah @ 2025-07-21 11:38 UTC (permalink / raw)
To: dsosnowski, thomas, andrew.rybchenko
Cc: dev, rasland, stable, viacheslavo, orika, suanmingm, matan
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=y, Size: 820 bytes --]
Hi Dariusz,
Thanks for the feedback. Agreed, this check only papers over a deeper issue with
multi-process lifecycle in ethdev.
If the primary process exits after freeing ethdev data, all subsequent PMD-level
calls (e.g., dev_infos_get) in secondary are inherently unsafe due to dangling
pointers.
Unless ethdev explicitly invalidates the port (or secondary process detects
primary exit), the current behavior results in use-after-free regardless of PMD.
For now, this patch avoids a crash in one case (dev_infos_get), but long-term
a proper ethdev-level solution is needed — either by:
- Marking ports invalid on primary exit,
- Notifying secondaries to teardown,
- Or refusing API access after shared data disappears.
Until then, we can keep this as a safety patch to prevent segfaults in common test scenarios.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] net/mlx5: fix crash when secondary queries dev info after primary exits
2025-07-21 10:58 ` Dariusz Sosnowski
2025-07-21 11:38 ` [PATCH] net/mlx5: fix crash when using meter in transfer flow Khadem Ullah
@ 2025-07-21 11:46 ` Ivan Malov
2025-07-21 11:59 ` Thomas Monjalon
2025-07-21 14:50 ` Stephen Hemminger
2 siblings, 1 reply; 14+ messages in thread
From: Ivan Malov @ 2025-07-21 11:46 UTC (permalink / raw)
To: Dariusz Sosnowski
Cc: Khadem Ullah, Thomas Monjalon, Andrew Rybchenko, dev, rasland,
stable, Viacheslav Ovsiienko, Bing Zhao, Ori Kam, Suanming Mou,
Matan Azrad
On Mon, 21 Jul 2025, Dariusz Sosnowski wrote:
> + mlx5 maintainers
>
> Thank you for the patch.
>
> Could you please include other PMD maintainers (or other maintainers, depending on changed code)
> in the future patches?
> There is a script which automatically adds maintainers while sending a patch.
> It is described in: https://doc.dpdk.org/guides/contributing/patches.html#sending-patches
>
> On Mon, Jul 21, 2025 at 03:38:51AM -0400, Khadem Ullah wrote:
>> When the primary process exits, the shared mlx5 state becomes
>> unavailable to secondary processes. If a secondary process attempts
>> to query device information (e.g., via testpmd), a NULL dereference
>> may occur due to missing shared data.
>>
>> This patch adds a check for shared context availability and fails
>> gracefully while preventing a crash.
>>
>> Fixes: e60fbd5b24fc ("mlx5: add device configure/start/stop")
>> Cc: stable@dpdk.org
>>
>> Signed-off-by: Khadem Ullah <14pwcse1224@uetpeshawar.edu.pk>
>> ---
>> drivers/net/mlx5/mlx5_ethdev.c | 6 ++++++
>> 1 file changed, 6 insertions(+)
>>
>> diff --git a/drivers/net/mlx5/mlx5_ethdev.c b/drivers/net/mlx5/mlx5_ethdev.c
>> index 68d1c1bfa7..1848f6536a 100644
>> --- a/drivers/net/mlx5/mlx5_ethdev.c
>> +++ b/drivers/net/mlx5/mlx5_ethdev.c
>> @@ -368,6 +368,12 @@ mlx5_dev_infos_get(struct rte_eth_dev *dev, struct rte_eth_dev_info *info)
>> * Since we need one CQ per QP, the limit is the minimum number
>> * between the two values.
>> */
>> + if (priv == NULL || priv->sh == NULL) {
>> + DRV_LOG(ERR,
>> + "mlx5 shared data unavailable (primary process likely exited)");
>> + rte_errno = ENODEV;
>> + return -rte_errno;
>> + }
>
> I don't think it's an issue on PMD level, but rather on
> ethdev/multi-process handling level.
I may be very wrong here, but can't the PMD use its internal 'shared' state to
somehow memorise the fact that a secondary process has attached, in order to not
let the primary process close in the first place (before the secondary process
detaches)? Or is this going to violate some established convention?
Thank you.
>
> When primary process closes the port, ethdev library zeroes and frees
> device data shared between processes.
> ethdev port data (rte_eth_dev) on secondary is not updated so it now points to
> invalid data. rte_eth_dev_info_get() is not the only API call affected.
>
> If the primary process closes the port before exiting
> (like testpmd does) and it exits before the secondary,
> the any driver call seems invalid because of that use-after-free behavior.
>
> @Thomas, @Andrew - Do you happen to know if doing anything on ethdev ports
> in secondary process after primary has gracefully exited is supported?
>
>> max = RTE_MIN(priv->sh->dev_cap.max_cq, priv->sh->dev_cap.max_qp);
>> /* max_rx_queues is uint16_t. */
>> max = RTE_MIN(max, (unsigned int)UINT16_MAX);
>> --
>> 2.43.0
>>
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] net/mlx5: fix crash when secondary queries dev info after primary exits
2025-07-21 11:46 ` [PATCH] net/mlx5: fix crash when secondary queries dev info after primary exits Ivan Malov
@ 2025-07-21 11:59 ` Thomas Monjalon
2025-07-21 14:01 ` Dariusz Sosnowski
0 siblings, 1 reply; 14+ messages in thread
From: Thomas Monjalon @ 2025-07-21 11:59 UTC (permalink / raw)
To: Dariusz Sosnowski, Ivan Malov
Cc: Khadem Ullah, Andrew Rybchenko, dev, rasland, stable,
Viacheslav Ovsiienko, Bing Zhao, Ori Kam, Suanming Mou,
Matan Azrad
21/07/2025 13:46, Ivan Malov:
> On Mon, 21 Jul 2025, Dariusz Sosnowski wrote:
>
> > + mlx5 maintainers
> >
> > Thank you for the patch.
> >
> > Could you please include other PMD maintainers (or other maintainers, depending on changed code)
> > in the future patches?
> > There is a script which automatically adds maintainers while sending a patch.
> > It is described in: https://doc.dpdk.org/guides/contributing/patches.html#sending-patches
> >
> > On Mon, Jul 21, 2025 at 03:38:51AM -0400, Khadem Ullah wrote:
> >> When the primary process exits, the shared mlx5 state becomes
> >> unavailable to secondary processes. If a secondary process attempts
> >> to query device information (e.g., via testpmd), a NULL dereference
> >> may occur due to missing shared data.
> >>
> >> This patch adds a check for shared context availability and fails
> >> gracefully while preventing a crash.
> >>
> >> Fixes: e60fbd5b24fc ("mlx5: add device configure/start/stop")
> >> Cc: stable@dpdk.org
> >>
> >> Signed-off-by: Khadem Ullah <14pwcse1224@uetpeshawar.edu.pk>
> >> ---
> >> drivers/net/mlx5/mlx5_ethdev.c | 6 ++++++
> >> 1 file changed, 6 insertions(+)
> >>
> >> diff --git a/drivers/net/mlx5/mlx5_ethdev.c b/drivers/net/mlx5/mlx5_ethdev.c
> >> index 68d1c1bfa7..1848f6536a 100644
> >> --- a/drivers/net/mlx5/mlx5_ethdev.c
> >> +++ b/drivers/net/mlx5/mlx5_ethdev.c
> >> @@ -368,6 +368,12 @@ mlx5_dev_infos_get(struct rte_eth_dev *dev, struct rte_eth_dev_info *info)
> >> * Since we need one CQ per QP, the limit is the minimum number
> >> * between the two values.
> >> */
> >> + if (priv == NULL || priv->sh == NULL) {
> >> + DRV_LOG(ERR,
> >> + "mlx5 shared data unavailable (primary process likely exited)");
> >> + rte_errno = ENODEV;
> >> + return -rte_errno;
> >> + }
> >
> > I don't think it's an issue on PMD level, but rather on
> > ethdev/multi-process handling level.
>
> I may be very wrong here, but can't the PMD use its internal 'shared' state to
> somehow memorise the fact that a secondary process has attached, in order to not
> let the primary process close in the first place (before the secondary process
> detaches)? Or is this going to violate some established convention?
It looks to be a good idea.
> > When primary process closes the port, ethdev library zeroes and frees
> > device data shared between processes.
> > ethdev port data (rte_eth_dev) on secondary is not updated so it now points to
> > invalid data. rte_eth_dev_info_get() is not the only API call affected.
> >
> > If the primary process closes the port before exiting
> > (like testpmd does) and it exits before the secondary,
> > the any driver call seems invalid because of that use-after-free behavior.
> >
> > @Thomas, @Andrew - Do you happen to know if doing anything on ethdev ports
> > in secondary process after primary has gracefully exited is supported?
No there is no statement about whether it is supported or not.
I think we should at least return an error instead of crashing.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] net/mlx5: fix crash when secondary queries dev info after primary exits
2025-07-21 11:59 ` Thomas Monjalon
@ 2025-07-21 14:01 ` Dariusz Sosnowski
2025-07-22 12:14 ` [PATCH] ethdev: add dev_private check for secondary process Khadem Ullah
2025-07-22 16:26 ` [PATCH] net/mlx5: fix crash when secondary queries dev info after primary exits Ivan Malov
0 siblings, 2 replies; 14+ messages in thread
From: Dariusz Sosnowski @ 2025-07-21 14:01 UTC (permalink / raw)
To: Thomas Monjalon
Cc: Ivan Malov, Khadem Ullah, Andrew Rybchenko, dev, rasland, stable,
Viacheslav Ovsiienko, Bing Zhao, Ori Kam, Suanming Mou,
Matan Azrad
On Mon, Jul 21, 2025 at 01:59:59PM +0200, Thomas Monjalon wrote:
> 21/07/2025 13:46, Ivan Malov:
> > On Mon, 21 Jul 2025, Dariusz Sosnowski wrote:
> >
> > > + mlx5 maintainers
> > >
> > > Thank you for the patch.
> > >
> > > Could you please include other PMD maintainers (or other maintainers, depending on changed code)
> > > in the future patches?
> > > There is a script which automatically adds maintainers while sending a patch.
> > > It is described in: https://doc.dpdk.org/guides/contributing/patches.html#sending-patches
> > >
> > > On Mon, Jul 21, 2025 at 03:38:51AM -0400, Khadem Ullah wrote:
> > >> When the primary process exits, the shared mlx5 state becomes
> > >> unavailable to secondary processes. If a secondary process attempts
> > >> to query device information (e.g., via testpmd), a NULL dereference
> > >> may occur due to missing shared data.
> > >>
> > >> This patch adds a check for shared context availability and fails
> > >> gracefully while preventing a crash.
> > >>
> > >> Fixes: e60fbd5b24fc ("mlx5: add device configure/start/stop")
> > >> Cc: stable@dpdk.org
> > >>
> > >> Signed-off-by: Khadem Ullah <14pwcse1224@uetpeshawar.edu.pk>
> > >> ---
> > >> drivers/net/mlx5/mlx5_ethdev.c | 6 ++++++
> > >> 1 file changed, 6 insertions(+)
> > >>
> > >> diff --git a/drivers/net/mlx5/mlx5_ethdev.c b/drivers/net/mlx5/mlx5_ethdev.c
> > >> index 68d1c1bfa7..1848f6536a 100644
> > >> --- a/drivers/net/mlx5/mlx5_ethdev.c
> > >> +++ b/drivers/net/mlx5/mlx5_ethdev.c
> > >> @@ -368,6 +368,12 @@ mlx5_dev_infos_get(struct rte_eth_dev *dev, struct rte_eth_dev_info *info)
> > >> * Since we need one CQ per QP, the limit is the minimum number
> > >> * between the two values.
> > >> */
> > >> + if (priv == NULL || priv->sh == NULL) {
> > >> + DRV_LOG(ERR,
> > >> + "mlx5 shared data unavailable (primary process likely exited)");
> > >> + rte_errno = ENODEV;
> > >> + return -rte_errno;
> > >> + }
> > >
> > > I don't think it's an issue on PMD level, but rather on
> > > ethdev/multi-process handling level.
> >
> > I may be very wrong here, but can't the PMD use its internal 'shared' state to
> > somehow memorise the fact that a secondary process has attached, in order to not
> > let the primary process close in the first place (before the secondary process
> > detaches)? Or is this going to violate some established convention?
>
> It looks to be a good idea.
I agree with idea of adding these checks, but not entirely agree with
it being at driver level, since all drivers would have to duplicate this logic.
Drivers already have to go through ethdev library when port is probed
on secondary process - rte_eth_dev_attach_secondary() must be
called to retrieve port data local to the process and device data shared
between processes.
Memorizing whether a secondary process is using given port can be added
to rte_eth_dev_attach_secondary(), and relevant check for primary
process can then be added to rte_eth_dev_close(), so that all drivers
benefit.
What do you think?
>
> > > When primary process closes the port, ethdev library zeroes and frees
> > > device data shared between processes.
> > > ethdev port data (rte_eth_dev) on secondary is not updated so it now points to
> > > invalid data. rte_eth_dev_info_get() is not the only API call affected.
> > >
> > > If the primary process closes the port before exiting
> > > (like testpmd does) and it exits before the secondary,
> > > the any driver call seems invalid because of that use-after-free behavior.
> > >
> > > @Thomas, @Andrew - Do you happen to know if doing anything on ethdev ports
> > > in secondary process after primary has gracefully exited is supported?
>
> No there is no statement about whether it is supported or not.
> I think we should at least return an error instead of crashing.
>
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] net/mlx5: fix crash when secondary queries dev info after primary exits
2025-07-21 10:58 ` Dariusz Sosnowski
2025-07-21 11:38 ` [PATCH] net/mlx5: fix crash when using meter in transfer flow Khadem Ullah
2025-07-21 11:46 ` [PATCH] net/mlx5: fix crash when secondary queries dev info after primary exits Ivan Malov
@ 2025-07-21 14:50 ` Stephen Hemminger
2 siblings, 0 replies; 14+ messages in thread
From: Stephen Hemminger @ 2025-07-21 14:50 UTC (permalink / raw)
To: Dariusz Sosnowski
Cc: Khadem Ullah, Thomas Monjalon, Andrew Rybchenko, dev, rasland,
stable, Viacheslav Ovsiienko, Bing Zhao, Ori Kam, Suanming Mou,
Matan Azrad
On Mon, 21 Jul 2025 12:58:19 +0200
Dariusz Sosnowski <dsosnowski@nvidia.com> wrote:
> I don't think it's an issue on PMD level, but rather on
> ethdev/multi-process handling level.
>
> When primary process closes the port, ethdev library zeroes and frees
> device data shared between processes.
> ethdev port data (rte_eth_dev) on secondary is not updated so it now points to
> invalid data. rte_eth_dev_info_get() is not the only API call affected.
>
> If the primary process closes the port before exiting
> (like testpmd does) and it exits before the secondary,
> the any driver call seems invalid because of that use-after-free behavior.
>
> @Thomas, @Andrew - Do you happen to know if doing anything on ethdev ports
> in secondary process after primary has gracefully exited is supported?
No this is not supported.
A properly written secondary process monitors to see when primary exits.
There are many other parts of DPDK that assume primary is always available.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] ethdev: add dev_private check for secondary process
2025-07-21 14:01 ` Dariusz Sosnowski
@ 2025-07-22 12:14 ` Khadem Ullah
2025-07-22 16:26 ` [PATCH] net/mlx5: fix crash when secondary queries dev info after primary exits Ivan Malov
1 sibling, 0 replies; 14+ messages in thread
From: Khadem Ullah @ 2025-07-22 12:14 UTC (permalink / raw)
To: thomas, andrew.rybchenko, ferruh.yigit, dsosnowski
Cc: dev, rasland, stable, viacheslavo, orika, suanmingm, matan
Yes, I think this can be addressed at both the PMD and ethdev layers.
To centralize the logic and avoid duplication across drivers, I’ve also submitted
a patch at the ethdev layer that checks for `dev_private` accessibility
in the secondary process context.
Please see the updated patch here:
https://patches.dpdk.org/project/dpdk/patch/20250722115439.1353573-1-14pwcse1224@uetpeshawar.edu.pk/
Looking forward to your feedback.
Best Regards,
Khadem Ullah
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] net/mlx5: fix crash when secondary queries dev info after primary exits
2025-07-21 14:01 ` Dariusz Sosnowski
2025-07-22 12:14 ` [PATCH] ethdev: add dev_private check for secondary process Khadem Ullah
@ 2025-07-22 16:26 ` Ivan Malov
1 sibling, 0 replies; 14+ messages in thread
From: Ivan Malov @ 2025-07-22 16:26 UTC (permalink / raw)
To: Dariusz Sosnowski
Cc: Thomas Monjalon, Khadem Ullah, Andrew Rybchenko, dev, rasland,
stable, Viacheslav Ovsiienko, Bing Zhao, Ori Kam, Suanming Mou,
Matan Azrad
Hi Dariusz, Khadem,
On Mon, 21 Jul 2025, Dariusz Sosnowski wrote:
> On Mon, Jul 21, 2025 at 01:59:59PM +0200, Thomas Monjalon wrote:
>> 21/07/2025 13:46, Ivan Malov:
>>> On Mon, 21 Jul 2025, Dariusz Sosnowski wrote:
>>>
>>>> + mlx5 maintainers
>>>>
>>>> Thank you for the patch.
>>>>
>>>> Could you please include other PMD maintainers (or other maintainers, depending on changed code)
>>>> in the future patches?
>>>> There is a script which automatically adds maintainers while sending a patch.
>>>> It is described in: https://doc.dpdk.org/guides/contributing/patches.html#sending-patches
>>>>
>>>> On Mon, Jul 21, 2025 at 03:38:51AM -0400, Khadem Ullah wrote:
>>>>> When the primary process exits, the shared mlx5 state becomes
>>>>> unavailable to secondary processes. If a secondary process attempts
>>>>> to query device information (e.g., via testpmd), a NULL dereference
>>>>> may occur due to missing shared data.
>>>>>
>>>>> This patch adds a check for shared context availability and fails
>>>>> gracefully while preventing a crash.
>>>>>
>>>>> Fixes: e60fbd5b24fc ("mlx5: add device configure/start/stop")
>>>>> Cc: stable@dpdk.org
>>>>>
>>>>> Signed-off-by: Khadem Ullah <14pwcse1224@uetpeshawar.edu.pk>
>>>>> ---
>>>>> drivers/net/mlx5/mlx5_ethdev.c | 6 ++++++
>>>>> 1 file changed, 6 insertions(+)
>>>>>
>>>>> diff --git a/drivers/net/mlx5/mlx5_ethdev.c b/drivers/net/mlx5/mlx5_ethdev.c
>>>>> index 68d1c1bfa7..1848f6536a 100644
>>>>> --- a/drivers/net/mlx5/mlx5_ethdev.c
>>>>> +++ b/drivers/net/mlx5/mlx5_ethdev.c
>>>>> @@ -368,6 +368,12 @@ mlx5_dev_infos_get(struct rte_eth_dev *dev, struct rte_eth_dev_info *info)
>>>>> * Since we need one CQ per QP, the limit is the minimum number
>>>>> * between the two values.
>>>>> */
>>>>> + if (priv == NULL || priv->sh == NULL) {
>>>>> + DRV_LOG(ERR,
>>>>> + "mlx5 shared data unavailable (primary process likely exited)");
>>>>> + rte_errno = ENODEV;
>>>>> + return -rte_errno;
>>>>> + }
>>>>
>>>> I don't think it's an issue on PMD level, but rather on
>>>> ethdev/multi-process handling level.
>>>
>>> I may be very wrong here, but can't the PMD use its internal 'shared' state to
>>> somehow memorise the fact that a secondary process has attached, in order to not
>>> let the primary process close in the first place (before the secondary process
>>> detaches)? Or is this going to violate some established convention?
>>
>> It looks to be a good idea.
>
> I agree with idea of adding these checks, but not entirely agree with
> it being at driver level, since all drivers would have to duplicate this logic.
>
> Drivers already have to go through ethdev library when port is probed
> on secondary process - rte_eth_dev_attach_secondary() must be
> called to retrieve port data local to the process and device data shared
> between processes.
> Memorizing whether a secondary process is using given port can be added
> to rte_eth_dev_attach_secondary(), and relevant check for primary
> process can then be added to rte_eth_dev_close(), so that all drivers
> benefit.
>
> What do you think?
Yes, in general, the idea would be to have a "secondary reference counter" field
in 'rte_eth_dev_data' and have it incremented/decremented/checked in some
generic places. The issue is that, in addition to 'rte_eth_dev_close', a device
can be "closed" via its respective bus 'remove' method. Yes, there are drivers
that invoke 'rte_eth_dev_close' inside the 'remove' implementation, but not all
of them, unfortunately. For example, take a look at 'af_packet' and similar.
On the other hands, there are drivers that use 'rte_eth_dev_create' and its
counterpart 'rte_eth_dev_destroy' in the implementations of bus 'probe/remove',
but that is also not consistent across the 'drivers/net' tree.
Given all these issues and the fact that ethdev is not the only possible device
type, may be 'rte_device' can house the reference counter, with 'rte_dev_probe'
and 'rte_dev_remove' augmented to do the inrement/decrement/validate thing?
And, since 'rte_eth_dev_close' invokes direct ethdev 'close' method, also add a
check to over there (and may be to 'rte_eth_dev_reset', too)? May be I'm wrong.
However, all of that might still make no sense, given the fact that the primary
process can just die. So, may be I am indeed very wrong and, as Stephen has
suggested in [1], this issue it to be addressed by clarifying the documentation.
[1] https://mails.dpdk.org/archives/dev/2025-July/321865.html
Thank you.
>
>>
>>>> When primary process closes the port, ethdev library zeroes and frees
>>>> device data shared between processes.
>>>> ethdev port data (rte_eth_dev) on secondary is not updated so it now points to
>>>> invalid data. rte_eth_dev_info_get() is not the only API call affected.
>>>>
>>>> If the primary process closes the port before exiting
>>>> (like testpmd does) and it exits before the secondary,
>>>> the any driver call seems invalid because of that use-after-free behavior.
>>>>
>>>> @Thomas, @Andrew - Do you happen to know if doing anything on ethdev ports
>>>> in secondary process after primary has gracefully exited is supported?
>>
>> No there is no statement about whether it is supported or not.
>> I think we should at least return an error instead of crashing.
>>
>>
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] net/mlx5: fix crash when using meter in transfer flow
2025-07-16 7:23 ` Khadem Ullah
@ 2025-07-16 10:55 ` Dariusz Sosnowski
0 siblings, 0 replies; 14+ messages in thread
From: Dariusz Sosnowski @ 2025-07-16 10:55 UTC (permalink / raw)
To: Khadem Ullah; +Cc: dev, rasland, stable, viacheslavo, orika, suanmingm, matan
On Wed, Jul 16, 2025 at 03:23:19AM -0400, Khadem Ullah wrote:
> Hi Dariusz,
>
> Yes, you are right — I believe this has been fixed since DPDK v24.11 by the commit you mentioned:
> https://github.com/DPDK/dpdk/commit/c30b356a4d48542fe99c47aa470afc8cd1ced9f5
>
> Previously, this appeared to be an edge case where a segfault was triggered when a transfer rule was created (even when not using switchdev). But as confirmed, recent versions (v24.11.2 and newer) now correctly return ENOTSUP or a validation error instead of crashing.
>
> I understand that transfer rules are only valid in switchdev mode, and that using them outside that context is unsupported. I encountered the segfault when testing this specific combination, but it seems the issue has been properly addressed now in the latest versions.
>
> Thanks for confirming!
>
> Best regards,
> Khadem Ullah
No problem, happy to help.
Best regards,
Dariusz Sosnowski
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] net/mlx5: fix crash when using meter in transfer flow
2025-07-15 17:51 ` Dariusz Sosnowski
@ 2025-07-16 7:23 ` Khadem Ullah
2025-07-16 10:55 ` Dariusz Sosnowski
0 siblings, 1 reply; 14+ messages in thread
From: Khadem Ullah @ 2025-07-16 7:23 UTC (permalink / raw)
To: dsosnowski; +Cc: dev, rasland, stable, viacheslavo, orika, suanmingm, matan
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=yes, Size: 783 bytes --]
Hi Dariusz,
Yes, you are right — I believe this has been fixed since DPDK v24.11 by the commit you mentioned:
https://github.com/DPDK/dpdk/commit/c30b356a4d48542fe99c47aa470afc8cd1ced9f5
Previously, this appeared to be an edge case where a segfault was triggered when a transfer rule was created (even when not using switchdev). But as confirmed, recent versions (v24.11.2 and newer) now correctly return ENOTSUP or a validation error instead of crashing.
I understand that transfer rules are only valid in switchdev mode, and that using them outside that context is unsupported. I encountered the segfault when testing this specific combination, but it seems the issue has been properly addressed now in the latest versions.
Thanks for confirming!
Best regards,
Khadem Ullah
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] net/mlx5: fix crash when using meter in transfer flow
2025-07-15 17:39 ` Dariusz Sosnowski
@ 2025-07-15 17:51 ` Dariusz Sosnowski
2025-07-16 7:23 ` Khadem Ullah
0 siblings, 1 reply; 14+ messages in thread
From: Dariusz Sosnowski @ 2025-07-15 17:51 UTC (permalink / raw)
To: Khadem Ullah
Cc: dev, rasland, stable, Viacheslav Ovsiienko, Bing Zhao, Ori Kam,
Suanming Mou, Matan Azrad
On Tue, Jul 15, 2025 at 07:39:53PM +0200, Dariusz Sosnowski wrote:
> + mlx5 maintainers
>
> Thank you for the patch.
>
> On Wed, Jul 02, 2025 at 08:36:17AM -0400, Khadem Ullah wrote:
> > When creating a flow rule with the transfer attribute and a meter action,
> > the driver did not validate this combination and would crash due to
> > unsupported handling.
> >
> > This patch adds explicit validation rejecting meter action in transfer
> > flows with an appropriate error message.
> >
> > Fixes: 46a5e6bc6a85 ("net/mlx5: prepare meter flow tables")
> > Cc: stable@dpdk.org
> >
> > Steps to reproduce:
> > 1. Launch testpmd:
> > ./build/app/dpdk-testpmd -l 0,1 -a <PCI BDF> -- -i --rxq=8 --txq=8
> >
> > 2. Inside testpmd:
> > add port meter profile trtcm_rfc2698 0 0 5 10 50 100 1
> > add port meter policy 0 0 g_actions mark id 3 / queue index 2 / end /
> > y_actions mark id 7 / queue index 3 / end r_actions drop / end
> > create port meter 0 0 0 0 yes 0xffff 0 y 0
> > 3. flow create 0 group 0 ingress pattern eth / ipv4 / end actions
> > jump group 1 / end
> > 3. Following causes a segmentation fault:
> > flow create 0 transfer ingress pattern eth / ipv4 /
> > end actions meter mtr_id 0 / end
>
> I tried these steps on v25.07-rc3 and the segfault does not reproduce for me.
> rte_flow_create() correctly returns ENOTSUP for that case.
>
> Similar case was segfaulting recently,
> but it was fixed by this commit: https://github.com/DPDK/dpdk/commit/c30b356a4d48542fe99c47aa470afc8cd1ced9f5
> This specific fix is available from v25.03 and is included in v24.11.2
>
> On other LTSes - v22.11.8 and v23.11.4 - segfault does not reproduce
> for me as well. Either ENOTSUP or validation error is reported.
>
> Which DPDK version are you using?
>
> I see that the flow rule is created with transfer attribute.
> Are you using a setup with switchdev enabled?
> (https://docs.kernel.org/networking/device_drivers/ethernet/mellanox/mlx5/switchdev.html)
>
Also, for wider context:
- If switchdev is not enabled, then creating transfer flow
rules is not allowed.
Without switchdev enabled, mlx5 PMD cannot control
the embedded switch of the NIC, so no transfer flow rules can be created.
This is the reason for current behavior of returning ENOTSUP on main branch.
- If switchdev is enabled, transfer flow rules are allowed.
There are however 2 important points:
- Transfer attribute cannot be mixed with either ingress or egress.
(https://doc.dpdk.org/guides/prog_guide/ethdev/flow_offload.html#attribute-transfer)
- In default configuration, mlx5 PMD does not support QUEUE actions for transfer flow rules.
In the steps above, meter policy contains QUEUE actions, so using
such meter in transfer rules is not allowed and is rejected by the driver.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH] net/mlx5: fix crash when using meter in transfer flow
2025-07-02 12:36 [PATCH] net/mlx5: fix crash when using meter in transfer flow Khadem Ullah
@ 2025-07-15 17:39 ` Dariusz Sosnowski
2025-07-15 17:51 ` Dariusz Sosnowski
0 siblings, 1 reply; 14+ messages in thread
From: Dariusz Sosnowski @ 2025-07-15 17:39 UTC (permalink / raw)
To: Khadem Ullah
Cc: dev, rasland, stable, Viacheslav Ovsiienko, Bing Zhao, Ori Kam,
Suanming Mou, Matan Azrad
+ mlx5 maintainers
Thank you for the patch.
On Wed, Jul 02, 2025 at 08:36:17AM -0400, Khadem Ullah wrote:
> When creating a flow rule with the transfer attribute and a meter action,
> the driver did not validate this combination and would crash due to
> unsupported handling.
>
> This patch adds explicit validation rejecting meter action in transfer
> flows with an appropriate error message.
>
> Fixes: 46a5e6bc6a85 ("net/mlx5: prepare meter flow tables")
> Cc: stable@dpdk.org
>
> Steps to reproduce:
> 1. Launch testpmd:
> ./build/app/dpdk-testpmd -l 0,1 -a <PCI BDF> -- -i --rxq=8 --txq=8
>
> 2. Inside testpmd:
> add port meter profile trtcm_rfc2698 0 0 5 10 50 100 1
> add port meter policy 0 0 g_actions mark id 3 / queue index 2 / end /
> y_actions mark id 7 / queue index 3 / end r_actions drop / end
> create port meter 0 0 0 0 yes 0xffff 0 y 0
> 3. flow create 0 group 0 ingress pattern eth / ipv4 / end actions
> jump group 1 / end
> 3. Following causes a segmentation fault:
> flow create 0 transfer ingress pattern eth / ipv4 /
> end actions meter mtr_id 0 / end
I tried these steps on v25.07-rc3 and the segfault does not reproduce for me.
rte_flow_create() correctly returns ENOTSUP for that case.
Similar case was segfaulting recently,
but it was fixed by this commit: https://github.com/DPDK/dpdk/commit/c30b356a4d48542fe99c47aa470afc8cd1ced9f5
This specific fix is available from v25.03 and is included in v24.11.2
On other LTSes - v22.11.8 and v23.11.4 - segfault does not reproduce
for me as well. Either ENOTSUP or validation error is reported.
Which DPDK version are you using?
I see that the flow rule is created with transfer attribute.
Are you using a setup with switchdev enabled?
(https://docs.kernel.org/networking/device_drivers/ethernet/mellanox/mlx5/switchdev.html)
>
> This patch ensures proper handling of the meter action with
> transfer rule to prevent this crash.
>
> Signed-off-by: Khadem Ullah <14pwcse1224@uetpeshawar.edu.pk>
> ---
> drivers/net/mlx5/mlx5_flow.c | 12 +++++++++++-
> 1 file changed, 11 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/mlx5/mlx5_flow.c b/drivers/net/mlx5/mlx5_flow.c
> index 8db372123c..a7b793ef29 100644
> --- a/drivers/net/mlx5/mlx5_flow.c
> +++ b/drivers/net/mlx5/mlx5_flow.c
> @@ -7993,7 +7993,17 @@ mlx5_flow_create(struct rte_eth_dev *dev,
> struct rte_flow_attr *new_attr = (void *)(uintptr_t)attr;
> uint32_t prio = attr->priority;
> uintptr_t flow_idx;
> -
> + if (attr && attr->transfer) {
> + const struct rte_flow_action *act;
> + for (act = actions; act && act->type != RTE_FLOW_ACTION_TYPE_END; ++act) {
> + if (act->type == RTE_FLOW_ACTION_TYPE_METER) {
> + rte_flow_error_set(error, ENOTSUP,
> + RTE_FLOW_ERROR_TYPE_ACTION, act,
> + "Meter action is not supported in transfer flows");
> + return NULL;
> + }
> + }
> + }
> /*
> * If the device is not started yet, it is not allowed to created a
> * flow from application. PMD default flows and traffic control flows
> --
> 2.43.0
>
Best regards,
Dariusz Sosnowski
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH] net/mlx5: fix crash when using meter in transfer flow
@ 2025-07-02 12:36 Khadem Ullah
2025-07-15 17:39 ` Dariusz Sosnowski
0 siblings, 1 reply; 14+ messages in thread
From: Khadem Ullah @ 2025-07-02 12:36 UTC (permalink / raw)
To: dev; +Cc: rasland, stable, Khadem Ullah
When creating a flow rule with the transfer attribute and a meter action,
the driver did not validate this combination and would crash due to
unsupported handling.
This patch adds explicit validation rejecting meter action in transfer
flows with an appropriate error message.
Fixes: 46a5e6bc6a85 ("net/mlx5: prepare meter flow tables")
Cc: stable@dpdk.org
Steps to reproduce:
1. Launch testpmd:
./build/app/dpdk-testpmd -l 0,1 -a <PCI BDF> -- -i --rxq=8 --txq=8
2. Inside testpmd:
add port meter profile trtcm_rfc2698 0 0 5 10 50 100 1
add port meter policy 0 0 g_actions mark id 3 / queue index 2 / end /
y_actions mark id 7 / queue index 3 / end r_actions drop / end
create port meter 0 0 0 0 yes 0xffff 0 y 0
3. flow create 0 group 0 ingress pattern eth / ipv4 / end actions
jump group 1 / end
3. Following causes a segmentation fault:
flow create 0 transfer ingress pattern eth / ipv4 /
end actions meter mtr_id 0 / end
This patch ensures proper handling of the meter action with
transfer rule to prevent this crash.
Signed-off-by: Khadem Ullah <14pwcse1224@uetpeshawar.edu.pk>
---
drivers/net/mlx5/mlx5_flow.c | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)
diff --git a/drivers/net/mlx5/mlx5_flow.c b/drivers/net/mlx5/mlx5_flow.c
index 8db372123c..a7b793ef29 100644
--- a/drivers/net/mlx5/mlx5_flow.c
+++ b/drivers/net/mlx5/mlx5_flow.c
@@ -7993,7 +7993,17 @@ mlx5_flow_create(struct rte_eth_dev *dev,
struct rte_flow_attr *new_attr = (void *)(uintptr_t)attr;
uint32_t prio = attr->priority;
uintptr_t flow_idx;
-
+ if (attr && attr->transfer) {
+ const struct rte_flow_action *act;
+ for (act = actions; act && act->type != RTE_FLOW_ACTION_TYPE_END; ++act) {
+ if (act->type == RTE_FLOW_ACTION_TYPE_METER) {
+ rte_flow_error_set(error, ENOTSUP,
+ RTE_FLOW_ERROR_TYPE_ACTION, act,
+ "Meter action is not supported in transfer flows");
+ return NULL;
+ }
+ }
+ }
/*
* If the device is not started yet, it is not allowed to created a
* flow from application. PMD default flows and traffic control flows
--
2.43.0
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2025-07-22 16:26 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-07-21 7:38 [PATCH] net/mlx5: fix crash when secondary queries dev info after primary exits Khadem Ullah
2025-07-21 10:58 ` Dariusz Sosnowski
2025-07-21 11:38 ` [PATCH] net/mlx5: fix crash when using meter in transfer flow Khadem Ullah
2025-07-21 11:46 ` [PATCH] net/mlx5: fix crash when secondary queries dev info after primary exits Ivan Malov
2025-07-21 11:59 ` Thomas Monjalon
2025-07-21 14:01 ` Dariusz Sosnowski
2025-07-22 12:14 ` [PATCH] ethdev: add dev_private check for secondary process Khadem Ullah
2025-07-22 16:26 ` [PATCH] net/mlx5: fix crash when secondary queries dev info after primary exits Ivan Malov
2025-07-21 14:50 ` Stephen Hemminger
-- strict thread matches above, loose matches on Subject: below --
2025-07-02 12:36 [PATCH] net/mlx5: fix crash when using meter in transfer flow Khadem Ullah
2025-07-15 17:39 ` Dariusz Sosnowski
2025-07-15 17:51 ` Dariusz Sosnowski
2025-07-16 7:23 ` Khadem Ullah
2025-07-16 10:55 ` Dariusz Sosnowski
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).