From: Konstantin Ananyev <konstantin.v.ananyev@yandex.ru>
To: dev@dpdk.org
Subject: Re: [PATCH 1/5] ethdev: fix race-condition of proactive error handling mode
Date: Sun, 5 Mar 2023 14:53:25 +0000 [thread overview]
Message-ID: <18c5b676-ae72-e646-89af-d6cd636d923f@yandex.ru> (raw)
In-Reply-To: <bb00bf60-583c-ba6a-92d8-565a875008ab@amd.com>
03/03/2023 16:51, Ferruh Yigit пишет:
> On 3/2/2023 12:08 PM, Konstantin Ananyev wrote:
>>
>>> In the proactive error handling mode, the PMD will set the data path
>>> pointers to dummy functions and then try recovery, in this period the
>>> application may still invoking data path API. This will introduce a
>>> race-condition with data path which may lead to crash [1].
>>>
>>> Although the PMD added delay after setting data path pointers to cover
>>> the above race-condition, it reduces the probability, but it doesn't
>>> solve the problem.
>>>
>>> To solve the race-condition problem fundamentally, the following
>>> requirements are added:
>>> 1. The PMD should set the data path pointers to dummy functions after
>>> report RTE_ETH_EVENT_ERR_RECOVERING event.
>>> 2. The application should stop data path API invocation when process
>>> the RTE_ETH_EVENT_ERR_RECOVERING event.
>>> 3. The PMD should set the data path pointers to valid functions before
>>> report RTE_ETH_EVENT_RECOVERY_SUCCESS event.
>>> 4. The application should enable data path API invocation when process
>>> the RTE_ETH_EVENT_RECOVERY_SUCCESS event.
>>>
>
> How this is solving the race-condition, by pushing responsibility to
> stop data path to application?
Exactly, it becomes application responsibility to make sure data-path is
stopped/suspended before recovery will continue.
>
> What if application is not interested in recovery modes at all and not
> registered any callback for the recovery?
Are you saying there is no way for application to disable
automatic recovery in PMD if it is not interested
(or can't full-fill per-requesties for it)?
If so, then yes it is a problem and we need to fix it.
I assumed that such mechanism to disable unwanted events already exists,
but I can't find anything.
Wonder what would be the easiest way here - can PMD make a decision
based on callback return value, or do we need a new API to
enable/disable callbacks, or ...?
> I think driver should not rely on application for this, unless
> application explicitly says (to driver) that it is handling recovery,
> right now there is no way for driver to know this.
I think it is visa-versa:
application should not enable auto-recovery if it can't meet
per-requeststies for it (provide appropriate callback).
>
>>> Also, this patch introduce a driver internal function
>>> rte_eth_fp_ops_setup which used as an help function for PMD.
>>>
>>> [1] http://patchwork.dpdk.org/project/dpdk/patch/20230220060839.1267349-2-ashok.k.kaladi@intel.com/
>>>
>>> Fixes: eb0d471a8941 ("ethdev: add proactive error handling mode")
>>> Cc: stable@dpdk.org
>>>
>>> Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
>>> ---
>>> doc/guides/prog_guide/poll_mode_drv.rst | 20 +++++++---------
>>> lib/ethdev/ethdev_driver.c | 8 +++++++
>>> lib/ethdev/ethdev_driver.h | 10 ++++++++
>>> lib/ethdev/rte_ethdev.h | 32 +++++++++++++++----------
>>> lib/ethdev/version.map | 1 +
>>> 5 files changed, 46 insertions(+), 25 deletions(-)
>>>
>>> diff --git a/doc/guides/prog_guide/poll_mode_drv.rst b/doc/guides/prog_guide/poll_mode_drv.rst
>>> index c145a9066c..e380ff135a 100644
>>> --- a/doc/guides/prog_guide/poll_mode_drv.rst
>>> +++ b/doc/guides/prog_guide/poll_mode_drv.rst
>>> @@ -638,14 +638,9 @@ different from the application invokes recovery in PASSIVE mode,
>>> the PMD automatically recovers from error in PROACTIVE mode,
>>> and only a small amount of work is required for the application.
>>>
>>> -During error detection and automatic recovery,
>>> -the PMD sets the data path pointers to dummy functions
>>> -(which will prevent the crash),
>>> -and also make sure the control path operations fail with a return code ``-EBUSY``.
>>> -
>>> -Because the PMD recovers automatically,
>>> -the application can only sense that the data flow is disconnected for a while
>>> -and the control API returns an error in this period.
>>> +During error detection and automatic recovery, the PMD sets the data path
>>> +pointers to dummy functions and also make sure the control path operations
>>> +failed with a return code ``-EBUSY``.
>>>
>>> In order to sense the error happening/recovering,
>>> as well as to restore some additional configuration,
>>> @@ -653,9 +648,9 @@ three events are available:
>>>
>>> ``RTE_ETH_EVENT_ERR_RECOVERING``
>>> Notify the application that an error is detected
>>> - and the recovery is being started.
>>> + and the recovery is about to start.
>>> Upon receiving the event, the application should not invoke
>>> - any control path function until receiving
>>> + any control and data path API until receiving
>>> ``RTE_ETH_EVENT_RECOVERY_SUCCESS`` or ``RTE_ETH_EVENT_RECOVERY_FAILED`` event.
>>>
>>> .. note::
>>> @@ -666,8 +661,9 @@ three events are available:
>>>
>>> ``RTE_ETH_EVENT_RECOVERY_SUCCESS``
>>> Notify the application that the recovery from error is successful,
>>> - the PMD already re-configures the port,
>>> - and the effect is the same as a restart operation.
>>> + the PMD already re-configures the port.
>>> + The application should restore some additional configuration, and then
>>> + enable data path API invocation.
>>>
>>> ``RTE_ETH_EVENT_RECOVERY_FAILED``
>>> Notify the application that the recovery from error failed,
>>> diff --git a/lib/ethdev/ethdev_driver.c b/lib/ethdev/ethdev_driver.c
>>> index 0be1e8ca04..f994653fe9 100644
>>> --- a/lib/ethdev/ethdev_driver.c
>>> +++ b/lib/ethdev/ethdev_driver.c
>>> @@ -515,6 +515,14 @@ rte_eth_dma_zone_free(const struct rte_eth_dev *dev, const char *ring_name,
>>> return rc;
>>> }
>>>
>>> +void
>>> +rte_eth_fp_ops_setup(struct rte_eth_dev *dev)
>>> +{
>>> + if (dev == NULL)
>>> + return;
>>> + eth_dev_fp_ops_setup(rte_eth_fp_ops + dev->data->port_id, dev);
>>> +}
>>> +
>>> const struct rte_memzone *
>>> rte_eth_dma_zone_reserve(const struct rte_eth_dev *dev, const char *ring_name,
>>> uint16_t queue_id, size_t size, unsigned int align,
>>> diff --git a/lib/ethdev/ethdev_driver.h b/lib/ethdev/ethdev_driver.h
>>> index 2c9d615fb5..0d964d1f67 100644
>>> --- a/lib/ethdev/ethdev_driver.h
>>> +++ b/lib/ethdev/ethdev_driver.h
>>> @@ -1621,6 +1621,16 @@ int
>>> rte_eth_dma_zone_free(const struct rte_eth_dev *eth_dev, const char *name,
>>> uint16_t queue_id);
>>>
>>> +/**
>>> + * @internal
>>> + * Setup eth fast-path API to ethdev values.
>>> + *
>>> + * @param dev
>>> + * Pointer to struct rte_eth_dev.
>>> + */
>>> +__rte_internal
>>> +void rte_eth_fp_ops_setup(struct rte_eth_dev *dev);
>>> +
>>> /**
>>> * @internal
>>> * Atomically set the link status for the specific device.
>>> diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
>>> index 049641d57c..44ee7229c1 100644
>>> --- a/lib/ethdev/rte_ethdev.h
>>> +++ b/lib/ethdev/rte_ethdev.h
>>> @@ -3944,25 +3944,28 @@ enum rte_eth_event_type {
>>> */
>>> RTE_ETH_EVENT_RX_AVAIL_THRESH,
>>> /** Port recovering from a hardware or firmware error.
>>> - * If PMD supports proactive error recovery,
>>> - * it should trigger this event to notify application
>>> - * that it detected an error and the recovery is being started.
>>> - * Upon receiving the event, the application should not invoke any control path API
>>> - * (such as rte_eth_dev_configure/rte_eth_dev_stop...) until receiving
>>> - * RTE_ETH_EVENT_RECOVERY_SUCCESS or RTE_ETH_EVENT_RECOVERY_FAILED event.
>>> - * The PMD will set the data path pointers to dummy functions,
>>> - * and re-set the data path pointers to non-dummy functions
>>> - * before reporting RTE_ETH_EVENT_RECOVERY_SUCCESS event.
>>> - * It means that the application cannot send or receive any packets
>>> - * during this period.
>>> + *
>>> + * If PMD supports proactive error recovery, it should trigger this
>>> + * event to notify application that it detected an error and the
>>> + * recovery is about to start.
>>> + *
>>> + * Upon receiving the event, the application should not invoke any
>>> + * control and data path API until receiving
>>> + * RTE_ETH_EVENT_RECOVERY_SUCCESS or RTE_ETH_EVENT_RECOVERY_FAILED
>>> + * event.
>>> + *
>>> + * Once this event is reported, the PMD will set the data path pointers
>>> + * to dummy functions, and re-set the data path pointers to valid
>>> + * functions before reporting RTE_ETH_EVENT_RECOVERY_SUCCESS event.
>>> + *
>>> * @note Before the PMD reports the recovery result,
>>> * the PMD may report the RTE_ETH_EVENT_ERR_RECOVERING event again,
>>> * because a larger error may occur during the recovery.
>>> */
>>> RTE_ETH_EVENT_ERR_RECOVERING,
>>> /** Port recovers successfully from the error.
>>> - * The PMD already re-configured the port,
>>> - * and the effect is the same as a restart operation.
>>> + *
>>> + * The PMD already re-configured the port:
>>> * a) The following operation will be retained: (alphabetically)
>>> * - DCB configuration
>>> * - FEC configuration
>>> @@ -3989,6 +3992,9 @@ enum rte_eth_event_type {
>>> * (@see RTE_ETH_DEV_CAPA_FLOW_SHARED_OBJECT_KEEP)
>>> * c) Any other configuration will not be stored
>>> * and will need to be re-configured.
>>> + *
>>> + * The application should restore some additional configuration
>>> + * (see above case b/c), and then enable data path API invocation.
>>> */
>>> RTE_ETH_EVENT_RECOVERY_SUCCESS,
>>> /** Port recovery failed.
>>> diff --git a/lib/ethdev/version.map b/lib/ethdev/version.map
>>> index 357d1a88c0..c273e0bdae 100644
>>> --- a/lib/ethdev/version.map
>>> +++ b/lib/ethdev/version.map
>>> @@ -320,6 +320,7 @@ INTERNAL {
>>> rte_eth_devices;
>>> rte_eth_dma_zone_free;
>>> rte_eth_dma_zone_reserve;
>>> + rte_eth_fp_ops_setup;
>>> rte_eth_hairpin_queue_peer_bind;
>>> rte_eth_hairpin_queue_peer_unbind;
>>> rte_eth_hairpin_queue_peer_update;
>>> --
>>
>> Acked-by: Konstantin Ananyev <konstantin.ananyev@huawei.com>
>>
>>> 2.17.1
>>
>
next prev parent reply other threads:[~2023-03-05 14:53 UTC|newest]
Thread overview: 95+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-03-01 3:06 [PATCH 0/5] " Chengwen Feng
2023-03-01 3:06 ` [PATCH 1/5] ethdev: " Chengwen Feng
2023-03-02 12:08 ` Konstantin Ananyev
2023-03-03 16:51 ` Ferruh Yigit
2023-03-05 14:53 ` Konstantin Ananyev [this message]
2023-03-06 8:55 ` Ferruh Yigit
2023-03-06 10:22 ` Konstantin Ananyev
2023-03-06 11:00 ` Ferruh Yigit
2023-03-06 11:05 ` Ajit Khaparde
2023-03-06 11:13 ` Konstantin Ananyev
2023-03-07 8:25 ` fengchengwen
2023-03-07 9:52 ` Konstantin Ananyev
2023-03-07 10:11 ` Konstantin Ananyev
2023-03-07 12:07 ` Ferruh Yigit
2023-03-07 12:26 ` fengchengwen
2023-03-07 12:39 ` Konstantin Ananyev
2023-03-09 2:05 ` Ajit Khaparde
2023-03-06 1:41 ` fengchengwen
2023-03-06 8:57 ` Ferruh Yigit
2023-03-06 9:10 ` Ferruh Yigit
2023-03-02 23:30 ` Honnappa Nagarahalli
2023-03-03 0:21 ` Konstantin Ananyev
2023-03-04 5:08 ` Honnappa Nagarahalli
2023-03-05 15:23 ` Konstantin Ananyev
2023-03-07 5:34 ` Honnappa Nagarahalli
2023-03-07 8:39 ` fengchengwen
2023-03-08 1:09 ` Honnappa Nagarahalli
2023-03-09 0:59 ` fengchengwen
2023-03-09 3:03 ` Honnappa Nagarahalli
2023-03-09 11:30 ` fengchengwen
2023-03-10 3:25 ` Honnappa Nagarahalli
2023-03-07 9:56 ` Konstantin Ananyev
2023-03-01 3:06 ` [PATCH 2/5] net/hns3: replace fp ops config function Chengwen Feng
2023-03-02 6:50 ` Dongdong Liu
2023-03-01 3:06 ` [PATCH 3/5] net/bnxt: fix race-condition when report error recovery Chengwen Feng
2023-03-02 12:23 ` Konstantin Ananyev
2023-03-01 3:06 ` [PATCH 4/5] net/bnxt: use fp ops setup function Chengwen Feng
2023-03-02 12:30 ` Konstantin Ananyev
2023-03-03 0:01 ` Konstantin Ananyev
2023-03-03 1:17 ` Ajit Khaparde
2023-03-03 2:02 ` fengchengwen
2023-03-03 1:38 ` fengchengwen
2023-03-05 15:57 ` Konstantin Ananyev
2023-03-06 2:47 ` Ajit Khaparde
2023-03-01 3:06 ` [PATCH 5/5] app/testpmd: add error recovery usage demo Chengwen Feng
2023-03-02 13:01 ` Konstantin Ananyev
2023-03-03 1:49 ` fengchengwen
2023-03-03 16:59 ` Ferruh Yigit
2023-09-21 11:12 ` [PATCH 0/5] fix race-condition of proactive error handling mode Ferruh Yigit
2023-10-07 2:32 ` fengchengwen
2023-10-20 10:07 ` [PATCH v2 0/7] " Chengwen Feng
2023-10-20 10:07 ` [PATCH v2 1/7] ethdev: " Chengwen Feng
2023-11-01 3:39 ` lihuisong (C)
2023-10-20 10:07 ` [PATCH v2 2/7] net/hns3: replace fp ops config function Chengwen Feng
2023-11-01 3:40 ` lihuisong (C)
2023-11-02 10:34 ` Konstantin Ananyev
2023-10-20 10:07 ` [PATCH v2 3/7] net/bnxt: fix race-condition when report error recovery Chengwen Feng
2023-11-02 16:28 ` Ajit Khaparde
2023-10-20 10:07 ` [PATCH v2 4/7] net/bnxt: use fp ops setup function Chengwen Feng
2023-11-01 3:48 ` lihuisong (C)
2023-11-02 10:34 ` Konstantin Ananyev
2023-11-02 16:29 ` Ajit Khaparde
2023-10-20 10:07 ` [PATCH v2 5/7] app/testpmd: add error recovery usage demo Chengwen Feng
2023-11-01 4:08 ` lihuisong (C)
2023-11-06 13:01 ` fengchengwen
2023-10-20 10:07 ` [PATCH v2 6/7] app/testpmd: extract event handling to event.c Chengwen Feng
2023-11-01 4:09 ` lihuisong (C)
2023-10-20 10:07 ` [PATCH v2 7/7] doc: testpmd support event handling section Chengwen Feng
2023-11-06 9:28 ` lihuisong (C)
2023-11-06 12:39 ` fengchengwen
2023-11-08 3:02 ` lihuisong (C)
2023-11-06 1:35 ` [PATCH v2 0/7] fix race-condition of proactive error handling mode fengchengwen
2023-11-06 13:11 ` [PATCH v3 " Chengwen Feng
2023-11-06 13:11 ` [PATCH v3 1/7] ethdev: " Chengwen Feng
2023-11-06 13:11 ` [PATCH v3 2/7] net/hns3: replace fp ops config function Chengwen Feng
2023-11-06 13:11 ` [PATCH v3 3/7] net/bnxt: fix race-condition when report error recovery Chengwen Feng
2023-11-06 13:11 ` [PATCH v3 4/7] net/bnxt: use fp ops setup function Chengwen Feng
2023-11-06 13:11 ` [PATCH v3 5/7] app/testpmd: add error recovery usage demo Chengwen Feng
2023-11-06 13:11 ` [PATCH v3 6/7] app/testpmd: extract event handling to event.c Chengwen Feng
2023-11-06 13:11 ` [PATCH v3 7/7] doc: testpmd support event handling section Chengwen Feng
2023-11-08 3:03 ` lihuisong (C)
2023-12-05 2:30 ` [PATCH v3 0/7] fix race-condition of proactive error handling mode fengchengwen
2024-01-15 1:44 ` fengchengwen
2024-01-29 1:16 ` fengchengwen
2024-02-18 3:41 ` fengchengwen
2024-05-08 9:22 ` fengchengwen
2024-09-05 9:24 ` [PATCH v4 " Chengwen Feng
2024-09-05 9:24 ` [PATCH v4 1/7] ethdev: " Chengwen Feng
2024-10-10 0:46 ` Stephen Hemminger
2024-09-05 9:24 ` [PATCH v4 2/7] net/hns3: replace fp ops config function Chengwen Feng
2024-09-05 9:25 ` [PATCH v4 3/7] net/bnxt: fix race-condition when report error recovery Chengwen Feng
2024-09-05 9:25 ` [PATCH v4 4/7] net/bnxt: use fp ops setup function Chengwen Feng
2024-09-05 9:25 ` [PATCH v4 5/7] app/testpmd: add error recovery usage demo Chengwen Feng
2024-09-05 9:25 ` [PATCH v4 6/7] app/testpmd: extract event handling to event.c Chengwen Feng
2024-09-05 9:25 ` [PATCH v4 7/7] doc: testpmd support event handling section Chengwen Feng
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=18c5b676-ae72-e646-89af-d6cd636d923f@yandex.ru \
--to=konstantin.v.ananyev@yandex.ru \
--cc=dev@dpdk.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).