* [dpdk-dev] [PATCH] net/mlx5: fix event handler uninstall
@ 2019-05-25 9:26 Viacheslav Ovsiienko
2019-05-26 19:16 ` Shahaf Shuler
2019-05-27 4:58 ` [dpdk-dev] [PATCH v2] " Viacheslav Ovsiienko
0 siblings, 2 replies; 4+ messages in thread
From: Viacheslav Ovsiienko @ 2019-05-25 9:26 UTC (permalink / raw)
To: dev; +Cc: shahafs, yskoh
When device is being closed and tries to unregister interrupt callback,
there is a chance the handler is still active (called in context of
eal_intr_thread_main thread). If so the rte_intr_callback_unregister
returns -EAGAIN and keeps the handler registered, causing crash when
underlaying resourse is gone away.
This race condition may happen if event handling in application takes
a long time. We should check the return code of unregistering routine
and try again to unregister the handler. The diagnostic messages are
shown once a second, while trying to unregister.
Fixes: 028b2a28c3cb ("net/mlx5: update event handler for multiport IB devices")
Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
Acked-by: Yongseok Koh <yskoh@mellanox.com>
---
drivers/net/mlx5/mlx5.c | 2 +-
drivers/net/mlx5/mlx5.h | 2 ++
drivers/net/mlx5/mlx5_ethdev.c | 79 +++++++++++++++++++++++++++++++++++++++---
3 files changed, 78 insertions(+), 5 deletions(-)
diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
index 9f5ec97..2344cb4 100644
--- a/drivers/net/mlx5/mlx5.c
+++ b/drivers/net/mlx5/mlx5.c
@@ -313,7 +313,7 @@ struct mlx5_dev_spawn_data {
**/
assert(!sh->intr_cnt);
if (sh->intr_cnt)
- rte_intr_callback_unregister
+ mlx5_intr_callback_unregister
(&sh->intr_handle, mlx5_dev_interrupt_handler, sh);
pthread_mutex_destroy(&sh->intr_mutex);
if (sh->pd)
diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index 3eaaafd..5b5b93d 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -437,6 +437,8 @@ void mlx5_nl_check_switch_info(bool nun_vf_set,
struct mlx5_switch_info *switch_info);
void mlx5_translate_port_name(const char *port_name_in,
struct mlx5_switch_info *port_info_out);
+void mlx5_intr_callback_unregister(const struct rte_intr_handle *handle,
+ rte_intr_callback_fn cb_fn, void *cb_arg);
/* mlx5_mac.c */
diff --git a/drivers/net/mlx5/mlx5_ethdev.c b/drivers/net/mlx5/mlx5_ethdev.c
index a8a7ece..f47297c 100644
--- a/drivers/net/mlx5/mlx5_ethdev.c
+++ b/drivers/net/mlx5/mlx5_ethdev.c
@@ -1226,9 +1226,80 @@ int mlx5_fw_version_get(struct rte_eth_dev *dev, char *fw_ver, size_t fw_size)
}
}
+/*
+ * Unregister callback handler safely. The handler may be active
+ * while we are trying to unregister it, in this case code -EAGAIN
+ * is returned by rte_intr_callback_unregister(). This routine checks
+ * the return code and tries to unregister handler again.
+ *
+ * @param handle
+ * interrupt handle
+ * @param cb_fn
+ * pointer to callback routine
+ * @cb_arg
+ * opaque callback parameter
+ */
+void
+mlx5_intr_callback_unregister(const struct rte_intr_handle *handle,
+ rte_intr_callback_fn cb_fn, void *cb_arg)
+{
+ /*
+ * Try to reduce timeout management overhead by not calling
+ * the timer related routines on the first iteration. If the
+ * unregistering succeeds on first call there will be no
+ * timer calls at all.
+ */
+ uint64_t twait = 0;
+ uint64_t start = 0;
+
+ do {
+ int ret;
+
+ ret = rte_intr_callback_unregister(handle, cb_fn, cb_arg);
+ if (ret >= 0)
+ return;
+ if (ret != -EAGAIN) {
+ DRV_LOG(INFO, "failed to unregister interrupt"
+ " handler (error: %d)", ret);
+ assert(false);
+ return;
+ }
+ if (twait) {
+ struct timespec onems;
+
+ /* Wait one millisecond and try again. */
+ onems.tv_sec = 0;
+ onems.tv_nsec = NS_PER_S / MS_PER_S;
+ nanosleep(&onems, 0);
+ /* Check whether one second elapsed. */
+ if ((rte_get_timer_cycles() - start) <= twait)
+ continue;
+ } else {
+ /*
+ * We get the amount of timer ticks for one second.
+ * If this amount elapsed it means we spent one
+ * second in waiting. This branch is executed once
+ * on first iteration.
+ */
+ twait = rte_get_timer_hz();
+ assert(twait);
+ }
+ /*
+ * Timeout elapsed, show message (once a second) and retry.
+ * We have no other acceptable option here, if we ignore
+ * the unregistering return code the handler will not
+ * be unregistered, fd will be closed and we may get the
+ * crush. Hanging and messaging in the loop seems not to be
+ * the worst choice.
+ */
+ DRV_LOG(INFO, "Retrying to unregister interrupt handler");
+ start = rte_get_timer_cycles();
+ } while (true);
+}
+
/**
* Uninstall shared asynchronous device events handler.
- * This function is implemeted to support event sharing
+ * This function is implemented to support event sharing
* between multiple ports of single IB device.
*
* @param dev
@@ -1254,7 +1325,7 @@ int mlx5_fw_version_get(struct rte_eth_dev *dev, char *fw_ver, size_t fw_size)
sh->port[priv->ibv_port - 1].ih_port_id = RTE_MAX_ETHPORTS;
if (!sh->intr_cnt || --sh->intr_cnt)
goto exit;
- rte_intr_callback_unregister(&sh->intr_handle,
+ mlx5_intr_callback_unregister(&sh->intr_handle,
mlx5_dev_interrupt_handler, sh);
sh->intr_handle.fd = 0;
sh->intr_handle.type = RTE_INTR_HANDLE_UNKNOWN;
@@ -1263,8 +1334,8 @@ int mlx5_fw_version_get(struct rte_eth_dev *dev, char *fw_ver, size_t fw_size)
}
/**
- * Install shared asyncronous device events handler.
- * This function is implemeted to support event sharing
+ * Install shared asynchronous device events handler.
+ * This function is implemented to support event sharing
* between multiple ports of single IB device.
*
* @param dev
--
1.8.3.1
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [dpdk-dev] [PATCH] net/mlx5: fix event handler uninstall
2019-05-25 9:26 [dpdk-dev] [PATCH] net/mlx5: fix event handler uninstall Viacheslav Ovsiienko
@ 2019-05-26 19:16 ` Shahaf Shuler
2019-05-27 4:58 ` [dpdk-dev] [PATCH v2] " Viacheslav Ovsiienko
1 sibling, 0 replies; 4+ messages in thread
From: Shahaf Shuler @ 2019-05-26 19:16 UTC (permalink / raw)
To: Slava Ovsiienko, dev; +Cc: Yongseok Koh
Hi Slava,
Saturday, May 25, 2019 12:26 PM, Of Viacheslav Ovsiienko:
> Subject: [dpdk-dev] [PATCH] net/mlx5: fix event handler uninstall
>
> When device is being closed and tries to unregister interrupt callback, there is
> a chance the handler is still active (called in context of eal_intr_thread_main
> thread). If so the rte_intr_callback_unregister returns -EAGAIN and keeps
> the handler registered, causing crash when underlaying resourse is gone
> away.
>
> This race condition may happen if event handling in application takes a long
> time. We should check the return code of unregistering routine and try again
> to unregister the handler. The diagnostic messages are shown once a
> second, while trying to unregister.
>
> Fixes: 028b2a28c3cb ("net/mlx5: update event handler for multiport IB
> devices")
>
> Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
> Acked-by: Yongseok Koh <yskoh@mellanox.com>
> ---
[...]
> + */
> +void
> +mlx5_intr_callback_unregister(const struct rte_intr_handle *handle,
> + rte_intr_callback_fn cb_fn, void *cb_arg) {
> + /*
> + * Try to reduce timeout management overhead by not calling
> + * the timer related routines on the first iteration. If the
> + * unregistering succeeds on first call there will be no
> + * timer calls at all.
> + */
> + uint64_t twait = 0;
> + uint64_t start = 0;
> +
> + do {
> + int ret;
> +
> + ret = rte_intr_callback_unregister(handle, cb_fn, cb_arg);
> + if (ret >= 0)
> + return;
> + if (ret != -EAGAIN) {
> + DRV_LOG(INFO, "failed to unregister interrupt"
> + " handler (error: %d)", ret);
> + assert(false);
> + return;
> + }
> + if (twait) {
> + struct timespec onems;
> +
> + /* Wait one millisecond and try again. */
> + onems.tv_sec = 0;
> + onems.tv_nsec = NS_PER_S / MS_PER_S;
I get the below when trying to compile on top of Bluefield:
/.autodirect/swgwork/shahafs/workspace/dpdk.org/drivers/net/mlx5/mlx5_ethdev.c:1272:20: error: 'NS_PER_S' undeclared (first use in this function); did you mean 'NB_SEGS'?
onems.tv_nsec = NS_PER_S / MS_PER_S;
^~~~~~~~
NB_SEGS
/.autodirect/swgwork/shahafs/workspace/dpdk.org/drivers/net/mlx5/mlx5_ethdev.c:1272:20: note: each undeclared identifier is reported only once for each function it appears in
/.autodirect/swgwork/shahafs/workspace/dpdk.org/drivers/net/mlx5/mlx5_ethdev.c:1272:31: error: 'MS_PER_S' undeclared (first use in this function); did you mean 'NS_PER_S'?
onems.tv_nsec = NS_PER_S / MS_PER_S;
^~~~~~~~
NS_PER_S
/.autodirect/swgwork/shahafs/workspace/dpdk.org/drivers/net/mlx5/mlx5_ethdev.c:1275:9: error: implicit declaration of function 'rte_get_timer_cycles'; did you mean 'rte_get_ptype_name'? [-Werror=implicit-function-declaration]
if ((rte_get_timer_cycles() - start) <= twait)
^~~~~~~~~~~~~~~~~~~~
rte_get_ptype_name
/.autodirect/swgwork/shahafs/workspace/dpdk.org/drivers/net/mlx5/mlx5_ethdev.c:1275:9: error: nested extern declaration of 'rte_get_timer_cycles' [-Werror=nested-externs]
/.autodirect/swgwork/shahafs/workspace/dpdk.org/drivers/net/mlx5/mlx5_ethdev.c:1284:12: error: implicit declaration of function 'rte_get_timer_hz'; did you mean 'rte_gettid'? [-Werror=implicit-function-declaration]
twait = rte_get_timer_hz();
^~~~~~~~~~~~~~~~
rte_gettid
/.autodirect/swgwork/shahafs/workspace/dpdk.org/drivers/net/mlx5/mlx5_ethdev.c:1284:12: error: nested extern declaration of 'rte_get_timer_hz' [-Werror=nested-externs]
^ permalink raw reply [flat|nested] 4+ messages in thread
* [dpdk-dev] [PATCH v2] net/mlx5: fix event handler uninstall
2019-05-25 9:26 [dpdk-dev] [PATCH] net/mlx5: fix event handler uninstall Viacheslav Ovsiienko
2019-05-26 19:16 ` Shahaf Shuler
@ 2019-05-27 4:58 ` Viacheslav Ovsiienko
2019-06-02 7:53 ` Shahaf Shuler
1 sibling, 1 reply; 4+ messages in thread
From: Viacheslav Ovsiienko @ 2019-05-27 4:58 UTC (permalink / raw)
To: dev; +Cc: shahafs, yskoh
When device is being closed and tries to unregister interrupt callback,
there is a chance the handler is still active (called in context of
eal_intr_thread_main thread). If so the rte_intr_callback_unregister
returns -EAGAIN and keeps the handler registered, causing crash when
underlaying resourse is gone away.
This race condition may happen if event handling in application takes
a long time. We should check the return code of unregistering routine
and try again to unregister the handler. The diagnostic messages are
shown once a second, while trying to unregister.
Fixes: 028b2a28c3cb ("net/mlx5: update event handler for multiport IB devices")
Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
Acked-by: Yongseok Koh <yskoh@mellanox.com>
---
v1: http://patches.dpdk.org/patch/53697/
v2: resolve compilation issue for ARM config
drivers/net/mlx5/mlx5.c | 2 +-
drivers/net/mlx5/mlx5.h | 2 ++
drivers/net/mlx5/mlx5_ethdev.c | 80 +++++++++++++++++++++++++++++++++++++++---
3 files changed, 79 insertions(+), 5 deletions(-)
diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
index 9f5ec97..2344cb4 100644
--- a/drivers/net/mlx5/mlx5.c
+++ b/drivers/net/mlx5/mlx5.c
@@ -313,7 +313,7 @@ struct mlx5_dev_spawn_data {
**/
assert(!sh->intr_cnt);
if (sh->intr_cnt)
- rte_intr_callback_unregister
+ mlx5_intr_callback_unregister
(&sh->intr_handle, mlx5_dev_interrupt_handler, sh);
pthread_mutex_destroy(&sh->intr_mutex);
if (sh->pd)
diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index 3eaaafd..5b5b93d 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -437,6 +437,8 @@ void mlx5_nl_check_switch_info(bool nun_vf_set,
struct mlx5_switch_info *switch_info);
void mlx5_translate_port_name(const char *port_name_in,
struct mlx5_switch_info *port_info_out);
+void mlx5_intr_callback_unregister(const struct rte_intr_handle *handle,
+ rte_intr_callback_fn cb_fn, void *cb_arg);
/* mlx5_mac.c */
diff --git a/drivers/net/mlx5/mlx5_ethdev.c b/drivers/net/mlx5/mlx5_ethdev.c
index a8a7ece..fd43c1d 100644
--- a/drivers/net/mlx5/mlx5_ethdev.c
+++ b/drivers/net/mlx5/mlx5_ethdev.c
@@ -33,6 +33,7 @@
#include <rte_malloc.h>
#include <rte_string_fns.h>
#include <rte_rwlock.h>
+#include <rte_cycles.h>
#include "mlx5.h"
#include "mlx5_glue.h"
@@ -1226,9 +1227,80 @@ int mlx5_fw_version_get(struct rte_eth_dev *dev, char *fw_ver, size_t fw_size)
}
}
+/*
+ * Unregister callback handler safely. The handler may be active
+ * while we are trying to unregister it, in this case code -EAGAIN
+ * is returned by rte_intr_callback_unregister(). This routine checks
+ * the return code and tries to unregister handler again.
+ *
+ * @param handle
+ * interrupt handle
+ * @param cb_fn
+ * pointer to callback routine
+ * @cb_arg
+ * opaque callback parameter
+ */
+void
+mlx5_intr_callback_unregister(const struct rte_intr_handle *handle,
+ rte_intr_callback_fn cb_fn, void *cb_arg)
+{
+ /*
+ * Try to reduce timeout management overhead by not calling
+ * the timer related routines on the first iteration. If the
+ * unregistering succeeds on first call there will be no
+ * timer calls at all.
+ */
+ uint64_t twait = 0;
+ uint64_t start = 0;
+
+ do {
+ int ret;
+
+ ret = rte_intr_callback_unregister(handle, cb_fn, cb_arg);
+ if (ret >= 0)
+ return;
+ if (ret != -EAGAIN) {
+ DRV_LOG(INFO, "failed to unregister interrupt"
+ " handler (error: %d)", ret);
+ assert(false);
+ return;
+ }
+ if (twait) {
+ struct timespec onems;
+
+ /* Wait one millisecond and try again. */
+ onems.tv_sec = 0;
+ onems.tv_nsec = NS_PER_S / MS_PER_S;
+ nanosleep(&onems, 0);
+ /* Check whether one second elapsed. */
+ if ((rte_get_timer_cycles() - start) <= twait)
+ continue;
+ } else {
+ /*
+ * We get the amount of timer ticks for one second.
+ * If this amount elapsed it means we spent one
+ * second in waiting. This branch is executed once
+ * on first iteration.
+ */
+ twait = rte_get_timer_hz();
+ assert(twait);
+ }
+ /*
+ * Timeout elapsed, show message (once a second) and retry.
+ * We have no other acceptable option here, if we ignore
+ * the unregistering return code the handler will not
+ * be unregistered, fd will be closed and we may get the
+ * crush. Hanging and messaging in the loop seems not to be
+ * the worst choice.
+ */
+ DRV_LOG(INFO, "Retrying to unregister interrupt handler");
+ start = rte_get_timer_cycles();
+ } while (true);
+}
+
/**
* Uninstall shared asynchronous device events handler.
- * This function is implemeted to support event sharing
+ * This function is implemented to support event sharing
* between multiple ports of single IB device.
*
* @param dev
@@ -1254,7 +1326,7 @@ int mlx5_fw_version_get(struct rte_eth_dev *dev, char *fw_ver, size_t fw_size)
sh->port[priv->ibv_port - 1].ih_port_id = RTE_MAX_ETHPORTS;
if (!sh->intr_cnt || --sh->intr_cnt)
goto exit;
- rte_intr_callback_unregister(&sh->intr_handle,
+ mlx5_intr_callback_unregister(&sh->intr_handle,
mlx5_dev_interrupt_handler, sh);
sh->intr_handle.fd = 0;
sh->intr_handle.type = RTE_INTR_HANDLE_UNKNOWN;
@@ -1263,8 +1335,8 @@ int mlx5_fw_version_get(struct rte_eth_dev *dev, char *fw_ver, size_t fw_size)
}
/**
- * Install shared asyncronous device events handler.
- * This function is implemeted to support event sharing
+ * Install shared asynchronous device events handler.
+ * This function is implemented to support event sharing
* between multiple ports of single IB device.
*
* @param dev
--
1.8.3.1
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [dpdk-dev] [PATCH v2] net/mlx5: fix event handler uninstall
2019-05-27 4:58 ` [dpdk-dev] [PATCH v2] " Viacheslav Ovsiienko
@ 2019-06-02 7:53 ` Shahaf Shuler
0 siblings, 0 replies; 4+ messages in thread
From: Shahaf Shuler @ 2019-06-02 7:53 UTC (permalink / raw)
To: Slava Ovsiienko, dev; +Cc: Yongseok Koh
Monday, May 27, 2019 7:59 AM, Viacheslav Ovsiienko:
> Subject: [dpdk-dev] [PATCH v2] net/mlx5: fix event handler uninstall
>
> When device is being closed and tries to unregister interrupt callback, there is
> a chance the handler is still active (called in context of eal_intr_thread_main
> thread). If so the rte_intr_callback_unregister returns -EAGAIN and keeps
> the handler registered, causing crash when underlaying resourse is gone
> away.
>
> This race condition may happen if event handling in application takes a long
> time. We should check the return code of unregistering routine and try again
> to unregister the handler. The diagnostic messages are shown once a
> second, while trying to unregister.
>
> Fixes: 028b2a28c3cb ("net/mlx5: update event handler for multiport IB
> devices")
>
> Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
> Acked-by: Yongseok Koh <yskoh@mellanox.com>
Applied to next-net-mlx, thanks.
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2019-06-02 7:53 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-25 9:26 [dpdk-dev] [PATCH] net/mlx5: fix event handler uninstall Viacheslav Ovsiienko
2019-05-26 19:16 ` Shahaf Shuler
2019-05-27 4:58 ` [dpdk-dev] [PATCH v2] " Viacheslav Ovsiienko
2019-06-02 7:53 ` Shahaf Shuler
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).