Hi Ivan, Thanks for the detailed explanation. While primary termination is a known limitation, I believe the application or PMD should handle edge cases proactively to ensure stability. On Tue, Jul 22, 2025 at 9:26 PM Ivan Malov wrote: > Hi Dariusz, Khadem, > > On Mon, 21 Jul 2025, Dariusz Sosnowski wrote: > > > On Mon, Jul 21, 2025 at 01:59:59PM +0200, Thomas Monjalon wrote: > >> 21/07/2025 13:46, Ivan Malov: > >>> On Mon, 21 Jul 2025, Dariusz Sosnowski wrote: > >>> > >>>> + mlx5 maintainers > >>>> > >>>> Thank you for the patch. > >>>> > >>>> Could you please include other PMD maintainers (or other maintainers, > depending on changed code) > >>>> in the future patches? > >>>> There is a script which automatically adds maintainers while sending > a patch. > >>>> It is described in: > https://doc.dpdk.org/guides/contributing/patches.html#sending-patches > >>>> > >>>> On Mon, Jul 21, 2025 at 03:38:51AM -0400, Khadem Ullah wrote: > >>>>> When the primary process exits, the shared mlx5 state becomes > >>>>> unavailable to secondary processes. If a secondary process attempts > >>>>> to query device information (e.g., via testpmd), a NULL dereference > >>>>> may occur due to missing shared data. > >>>>> > >>>>> This patch adds a check for shared context availability and fails > >>>>> gracefully while preventing a crash. > >>>>> > >>>>> Fixes: e60fbd5b24fc ("mlx5: add device configure/start/stop") > >>>>> Cc: stable@dpdk.org > >>>>> > >>>>> Signed-off-by: Khadem Ullah <14pwcse1224@uetpeshawar.edu.pk> > >>>>> --- > >>>>> drivers/net/mlx5/mlx5_ethdev.c | 6 ++++++ > >>>>> 1 file changed, 6 insertions(+) > >>>>> > >>>>> diff --git a/drivers/net/mlx5/mlx5_ethdev.c > b/drivers/net/mlx5/mlx5_ethdev.c > >>>>> index 68d1c1bfa7..1848f6536a 100644 > >>>>> --- a/drivers/net/mlx5/mlx5_ethdev.c > >>>>> +++ b/drivers/net/mlx5/mlx5_ethdev.c > >>>>> @@ -368,6 +368,12 @@ mlx5_dev_infos_get(struct rte_eth_dev *dev, > struct rte_eth_dev_info *info) > >>>>> * Since we need one CQ per QP, the limit is the minimum number > >>>>> * between the two values. > >>>>> */ > >>>>> + if (priv == NULL || priv->sh == NULL) { > >>>>> + DRV_LOG(ERR, > >>>>> + "mlx5 shared data unavailable (primary process likely > exited)"); > >>>>> + rte_errno = ENODEV; > >>>>> + return -rte_errno; > >>>>> + } > >>>> > >>>> I don't think it's an issue on PMD level, but rather on > >>>> ethdev/multi-process handling level. > >>> > >>> I may be very wrong here, but can't the PMD use its internal 'shared' > state to > >>> somehow memorise the fact that a secondary process has attached, in > order to not > >>> let the primary process close in the first place (before the secondary > process > >>> detaches)? Or is this going to violate some established convention? > >> > >> It looks to be a good idea. > > > > I agree with idea of adding these checks, but not entirely agree with > > it being at driver level, since all drivers would have to duplicate this > logic. > > > > Drivers already have to go through ethdev library when port is probed > > on secondary process - rte_eth_dev_attach_secondary() must be > > called to retrieve port data local to the process and device data shared > > between processes. > > Memorizing whether a secondary process is using given port can be added > > to rte_eth_dev_attach_secondary(), and relevant check for primary > > process can then be added to rte_eth_dev_close(), so that all drivers > > benefit. > > > > What do you think? > > Yes, in general, the idea would be to have a "secondary reference counter" > field > in 'rte_eth_dev_data' and have it incremented/decremented/checked in some > generic places. The issue is that, in addition to 'rte_eth_dev_close', a > device > can be "closed" via its respective bus 'remove' method. Yes, there are > drivers > that invoke 'rte_eth_dev_close' inside the 'remove' implementation, but > not all > of them, unfortunately. For example, take a look at 'af_packet' and > similar. > > On the other hands, there are drivers that use 'rte_eth_dev_create' and its > counterpart 'rte_eth_dev_destroy' in the implementations of bus > 'probe/remove', > but that is also not consistent across the 'drivers/net' tree. > > Given all these issues and the fact that ethdev is not the only possible > device > type, may be 'rte_device' can house the reference counter, with > 'rte_dev_probe' > and 'rte_dev_remove' augmented to do the inrement/decrement/validate thing? > And, since 'rte_eth_dev_close' invokes direct ethdev 'close' method, also > add a > check to over there (and may be to 'rte_eth_dev_reset', too)? May be I'm > wrong. > > However, all of that might still make no sense, given the fact that the > primary > process can just die. So, may be I am indeed very wrong and, as Stephen has > suggested in [1], this issue it to be addressed by clarifying the > documentation. > > [1] https://mails.dpdk.org/archives/dev/2025-July/321865.html > > Thank you. > > > > >> > >>>> When primary process closes the port, ethdev library zeroes and frees > >>>> device data shared between processes. > >>>> ethdev port data (rte_eth_dev) on secondary is not updated so it now > points to > >>>> invalid data. rte_eth_dev_info_get() is not the only API call > affected. > >>>> > >>>> If the primary process closes the port before exiting > >>>> (like testpmd does) and it exits before the secondary, > >>>> the any driver call seems invalid because of that use-after-free > behavior. > >>>> > >>>> @Thomas, @Andrew - Do you happen to know if doing anything on ethdev > ports > >>>> in secondary process after primary has gracefully exited is supported? > >> > >> No there is no statement about whether it is supported or not. > >> I think we should at least return an error instead of crashing. > >> > >> > > > -- Engr. Khadem Ullah, Software Engineer, Dreambig Semiconductor Inc https://dreambigsemi.com/