From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
To: Konstantin Ananyev <konstantin.ananyev@huawei.com>,
Fengchengwen <fengchengwen@huawei.com>,
Stephen Hemminger <stephen@networkplumber.org>,
Ruifeng Wang <Ruifeng.Wang@arm.com>,
"Ajit Khaparde (ajit.khaparde@broadcom.com)"
<ajit.khaparde@broadcom.com>
Cc: Ashok Kaladi <ashok.k.kaladi@intel.com>,
"jerinj@marvell.com" <jerinj@marvell.com>,
"thomas@monjalon.net" <thomas@monjalon.net>,
"dev@dpdk.org" <dev@dpdk.org>,
"s.v.naga.harish.k@intel.com" <s.v.naga.harish.k@intel.com>,
"erik.g.carrillo@intel.com" <erik.g.carrillo@intel.com>,
"abhinandan.gujjar@intel.com" <abhinandan.gujjar@intel.com>,
"stable@dpdk.org" <stable@dpdk.org>, nd <nd@arm.com>,
nd <nd@arm.com>
Subject: RE: [PATCH 2/2] ethdev: fix race condition in fast-path ops setup
Date: Tue, 28 Feb 2023 23:57:50 +0000 [thread overview]
Message-ID: <DBAPR08MB581445653C4211BD62E2BBB398AC9@DBAPR08MB5814.eurprd08.prod.outlook.com> (raw)
In-Reply-To: <54fbf4e55cd44477b1e956f98a7a3c50@huawei.com>
> -----Original Message-----
> From: Konstantin Ananyev <konstantin.ananyev@huawei.com>
> Sent: Thursday, February 23, 2023 7:31 AM
> To: Fengchengwen <fengchengwen@huawei.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; Stephen Hemminger
> <stephen@networkplumber.org>; Ruifeng Wang <Ruifeng.Wang@arm.com>;
> Ajit Khaparde (ajit.khaparde@broadcom.com)
> <ajit.khaparde@broadcom.com>
> Cc: Ashok Kaladi <ashok.k.kaladi@intel.com>; jerinj@marvell.com;
> thomas@monjalon.net; dev@dpdk.org; s.v.naga.harish.k@intel.com;
> erik.g.carrillo@intel.com; abhinandan.gujjar@intel.com; stable@dpdk.org; nd
> <nd@arm.com>
> Subject: RE: [PATCH 2/2] ethdev: fix race condition in fast-path ops setup
>
>
>
> > >>>>>>> If ethdev enqueue or dequeue function is called during
> > >>>>>>> eth_dev_fp_ops_setup(), it may get pre-empted after setting
> > >>>>>>> the function pointers, but before setting the pointer to port data.
> > >>>>>>> In this case the newly registered enqueue/dequeue function
> > >>>>>>> will use dummy port data and end up in seg fault.
> > >>>>>>>
> > >>>>>>> This patch moves the updation of each data pointers before
> > >>>>>>> updating corresponding function pointers.
> > >>>>>>>
> > >>>>>>> Fixes: c87d435a4d79 ("ethdev: copy fast-path API into separate
> > >>>>>>> structure")
> > >>>>>>> Cc: stable@dpdk.org
> > >>>>
> > >>>> Why is something calling enqueue/dequeue when device is not fully
> > >> started.
> > >>>> A correctly written application would not call rx/tx burst until
> > >>>> after ethdev start had finished.
> > >>>
> > >>> Please refer the eb0d471a894 (ethdev: add proactive error handling
> > >>> mode), when driver recover itself, the application may still
> > >>> invoke
> > >> enqueue/dequeue API.
> > >>
> > >> Right now DPDK ethdev layer *does not* provide synchronization
> > >> mechanisms between data-path and control-path functions.
> > >> That was a deliberate deisgn choice. If we want to change that
> > >> rule, then I suppose we need a community consensus for it.
> > >> I think that if the driver wants to provide some sort of error
> > >> recovery procedure, then it has to provide some synchronization
> > >> mechanism inside it between data-path and control-path functions.
> > >> Actually looking at eb0d471a894 (ethdev: add proactive error
> > >> handling mode), and following patches I wonder how it creeped in?
> > >> It seems we just introduced a loophole for race condition with this
> > >> approach...
> >
> > Could you try to describe the specific scenario of loophole ?
>
> Ok, as I understand the existing mechanism:
>
> When PMD wants to start a recovery it has to:
> - invoke rte_eth_dev_callback_process(RTE_ETH_EVENT_ERR_RECOVERING);
> That supposed to call user provided callback. After callback is finished PMD
> assumes
> that user is aware that recovery is about to start and should make some
> precautions.
> - when recovery is finished it invokes another callback:
> RTE_ETH_EVENT_RECOVERY_(SUCCESS/FAILED). After that user either can
> continue to
> use port or have to treat is as faulty.
>
> The idea is ok in principle, but there is a problem.
>
> lib/ethdev/rte_ethdev.h:
>
> /** Port recovering from a hardware or firmware error.
> * If PMD supports proactive error recovery,
> * it should trigger this event to notify application
> * that it detected an error and the recovery is being started.
>
> <<< !!!!!
> * Upon receiving the event, the application should not invoke any control
> path API
> * (such as rte_eth_dev_configure/rte_eth_dev_stop...) until receiving
> * RTE_ETH_EVENT_RECOVERY_SUCCESS or
> RTE_ETH_EVENT_RECOVERY_FAILED event.
> * The PMD will set the data path pointers to dummy functions,
> * and re-set the data path pointers to non-dummy functions
> * before reporting RTE_ETH_EVENT_RECOVERY_SUCCESS event.
> <<< !!!!!
>
> That part is just wrong I believe.
> It should be:
> Upon receiving the event, the application should not invoke any *both control
> and data-path* API until receiving RTE_ETH_EVENT_RECOVERY_SUCCESS or
> RTE_ETH_EVENT_RECOVERY_FAILED event.
> Resetting data path pointers to dummy functions by PMD *before* invoking
> rte_eth_dev_callback_process(RTE_ETH_EVENT_ERR_RECOVERING);
> introduces a race-condition with data-path threads, as such thread could
> already be inside RX/TX function or can already read RX/TX function/data
> pointers and be about to use them.
> And right now rte_ethdev layer doesn't provide any mechanism to check it or
> wait when they'll finish, etc.
>
> So, probably the simplest way to fix it with existing DPDK design:
> - user level callback RTE_ETH_EVENT_ERR_RECOVERING should return only
> after it ensures that *all*
> application threads (and processes) stopped using either control or data-path
> functions for that port
> (yes it means that application that wants to use this feature has to provide its
> own synchronization mechanism
> around data-path functions (RX/TX) that it is going to use).
Does this mean the application does not call either control plane or data plane APIs after the callback returns?
If the application can do this in the call back function, can it do the same outside of the call back function?
Correct me if I am wrong, I believe the call back is called in the context of the EAL thread. There could be multiple threads using the same port. There is a possibility that all these threads might call the call back function. So, who owns the responsibility to ensure the call back function is executed only once? PMD or the call back function?
> - after that PMD is safe to reset rte_eth_fp_ops[] values to dummy ones.
Why is this required if the control plane and data plane threads are not going to call any APIs? Resetting all these pointers is not at atomic operation, does it introduce any problems? For ex: if the application has to call function1 and function2 in sequence, what happens if function1 was not null but function2 became NULL by the time it is called?
How about a more simpler approach?
It should be possible to return an error code from the rte_eth_rx_burst API. The responsibility to stop calling any control plane and data plane APIs (this requires a simple synchronization mechanism. The cost of that should be less when there are no errors. I see applications like VPP already implement them) can be left to the application. The application can call the recovery API and release all the threads if the recovery was successful.
>
> And message to all PMD developers:
> *please stop updating rte_eth_fp_ops[] on your own*.
> That's a bad practice and it is not supposed to do things that way.
> There is a special API provided for these purposes:
> eth_dev_fp_ops_reset(), eth_dev_fp_ops_setup(), so use it.
>
> BTW, I don't see any implementation for RTE_ETH_EVENT_ERR_RECOVERING
> within either testpmd or any other example apps.
> Am I missing something?
> If not, then probably it could be a good starting point - let's incorporate it inside
> testpmd (new forwarding engine probably) so everyone can test/try it.
>
> * It means that the application cannot send or receive any packets
> * during this period.
> * @note Before the PMD reports the recovery result,
> * the PMD may report the RTE_ETH_EVENT_ERR_RECOVERING event
> again,
> * because a larger error may occur during the recovery.
> */
> RTE_ETH_EVENT_ERR_RECOVERING,
>
> > >> It probably needs to be either deprecated or reworked.
> > > Looking at the commit, it does not say anything about the data plane
> > > functions which probably means, the error recovery is
> > happening within the data plane thread. What happens to other data
> > plane threads that are polling the same port on which the error recovery is
> happening?
> >
> > The commit log says: "the PMD sets the data path pointers to dummy
> functions".
> >
> > So the data plane threads will receive non-packet and send zero with port
> which in error recovery.
> >
> > >
> > > Also, the commit log says that while the error recovery is under
> > > progress, the application should not call any control plane APIs.
> > > Does
> > that mean, the application has to check for error condition every time it calls a
> control plane API?
> >
> > If application has not register event (RTE_ETH_EVENT_ERR_RECOVERING)
> > callback, it could calls control plane API, but it will return failed.
> > If application has register above callback, it can wait for recovery result, or
> direct call without wait but this will return failed.
> >
> > >
> > > The commit message also says that "PMD makes sure the control path
> > > operations failed with retcode -EBUSY". It does not say how it
> > does this. But, any communication from the PMD thread to control plane
> thread may introduce race conditions if not done correctly.
> >
> > First there are no PMD thread, do you mean eal-intr-thread ?
> >
> > As for this question, you can see PMDs which already implement it, they both
> provides mutual exclusion protection.
> >
> > >
> > >>
> > >>>
> > >>>>
> > >>>> Would something like this work better?
> > >>>>
> > >>>> Note: there is another bug in current code. The check for link
> > >>>> state interrupt and link_ops could return -ENOTSUP and leave
> > >>>> device in
> > >> indeterminate state.
> > >>>> The check should be done before calling PMD.
> > >>>>
> > >>>> diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> > >>>> index
> > >>>> 0266cc82acb6..d6c163ed85e7 100644
> > >>>> --- a/lib/ethdev/rte_ethdev.c
> > >>>> +++ b/lib/ethdev/rte_ethdev.c
> > >>>> @@ -1582,6 +1582,14 @@ rte_eth_dev_start(uint16_t port_id)
> > >>>> return 0;
> > >>>> }
> > >>>>
> > >>>> + if (dev->data->dev_conf.intr_conf.lsc == 0 &&
> > >>>> + dev->dev_ops->link_update == NULL) {
> > >>>> + RTE_ETHDEV_LOG(INFO,
> > >>>> + "Device with port_id=%"PRIu16" link
> update not
> > >> supported\n",
> > >>>> + port_id);
> > >>>> + return -ENOTSUP;
> > >>>> + }
> > >>>> +
> > >>>> ret = rte_eth_dev_info_get(port_id, &dev_info);
> > >>>> if (ret != 0)
> > >>>> return ret;
> > >>>> @@ -1591,9 +1599,7 @@ rte_eth_dev_start(uint16_t port_id)
> > >>>> eth_dev_mac_restore(dev, &dev_info);
> > >>>>
> > >>>> diag = (*dev->dev_ops->dev_start)(dev);
> > >>>> - if (diag == 0)
> > >>>> - dev->data->dev_started = 1;
> > >>>> - else
> > >>>> + if (diag != 0)
> > >>>> return eth_err(port_id, diag);
> > >>>>
> > >>>> ret = eth_dev_config_restore(dev, &dev_info, port_id); @@
> > >>>> -1611,16
> > >>>> +1617,18 @@ rte_eth_dev_start(uint16_t port_id)
> > >>>> return ret;
> > >>>> }
> > >>>>
> > >>>> - if (dev->data->dev_conf.intr_conf.lsc == 0) {
> > >>>> - if (*dev->dev_ops->link_update == NULL)
> > >>>> - return -ENOTSUP;
> > >>>> - (*dev->dev_ops->link_update)(dev, 0);
> > >>>> - }
> > >>>> -
> > >>>> /* expose selection of PMD fast-path functions */
> > >>>> eth_dev_fp_ops_setup(rte_eth_fp_ops + port_id, dev);
> > >>>>
> > >>>> + /* ensure state is set before marking device ready */
> > >>>> + rte_smp_wmb();
> > >>>> +
> > >>>> rte_ethdev_trace_start(port_id);
> > >>>> +
> > >>>> + /* Update current link state */
> > >>>> + if (dev->data->dev_conf.intr_conf.lsc == 0)
> > >>>> + (*dev->dev_ops->link_update)(dev, 0);
> > >>>> +
> > >>>> return 0;
> > >>>> }
> > >>>>
> > >>>>
> > >>>> .
> > >>>>
> > >
next prev parent reply other threads:[~2023-02-28 23:58 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-02-20 6:08 [PATCH 1/2] eventdev: fix race condition in fast-path set function Ashok Kaladi
2023-02-20 6:08 ` [PATCH 2/2] ethdev: fix race condition in fast-path ops setup Ashok Kaladi
2023-02-20 6:57 ` fengchengwen
2023-02-21 7:24 ` Ruifeng Wang
2023-02-21 17:00 ` Stephen Hemminger
2023-02-22 1:07 ` fengchengwen
2023-02-22 9:41 ` Ruifeng Wang
2023-02-22 10:41 ` Konstantin Ananyev
2023-02-22 22:48 ` Honnappa Nagarahalli
2023-02-23 1:15 ` Stephen Hemminger
2023-02-23 4:47 ` Honnappa Nagarahalli
2023-02-23 4:40 ` Honnappa Nagarahalli
2023-02-23 8:23 ` fengchengwen
2023-02-23 13:31 ` Konstantin Ananyev
2023-02-25 1:32 ` fengchengwen
2023-02-28 23:57 ` Honnappa Nagarahalli [this message]
2023-02-20 7:01 ` fengchengwen
2023-02-20 9:44 ` Konstantin Ananyev
2023-03-03 16:49 ` Ferruh Yigit
-- strict thread matches above, loose matches on Subject: below --
2023-02-20 6:06 [PATCH 1/2] eventdev: fix race condition in fast-path set function Ashok Kaladi
2023-02-20 6:06 ` [PATCH 2/2] ethdev: fix race condition in fast-path ops setup Ashok Kaladi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=DBAPR08MB581445653C4211BD62E2BBB398AC9@DBAPR08MB5814.eurprd08.prod.outlook.com \
--to=honnappa.nagarahalli@arm.com \
--cc=Ruifeng.Wang@arm.com \
--cc=abhinandan.gujjar@intel.com \
--cc=ajit.khaparde@broadcom.com \
--cc=ashok.k.kaladi@intel.com \
--cc=dev@dpdk.org \
--cc=erik.g.carrillo@intel.com \
--cc=fengchengwen@huawei.com \
--cc=jerinj@marvell.com \
--cc=konstantin.ananyev@huawei.com \
--cc=nd@arm.com \
--cc=s.v.naga.harish.k@intel.com \
--cc=stable@dpdk.org \
--cc=stephen@networkplumber.org \
--cc=thomas@monjalon.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).