DPDK patches and discussions
 help / color / mirror / Atom feed
* [RFC 0/6] net/mlx5: introduce limit watermark and host shaper
@ 2022-04-01  3:22 Spike Du
  2022-04-01  3:22 ` [RFC 1/6] net/mlx5: add LWM support for Rxq Spike Du
                   ` (6 more replies)
  0 siblings, 7 replies; 131+ messages in thread
From: Spike Du @ 2022-04-01  3:22 UTC (permalink / raw)
  To: matan, viacheslavo, orika, thomas; +Cc: dev, rasland

LWM(limit watermark) is per RX queue attribute, when RX queue fullness reach
the LWM limit, HW sends an event to dpdk application.
Host shaper can configure shaper rate and lwm-triggered for a host port.
The shaper limits the rate of traffic from host port to wire port.
If lwm-triggered is enabled, a 100Mbps shaper is enabled automatically
when one of the host port's Rx queues receives LWM event.

These two features can combine to control traffic from host port to wire port.
The work flow is configure LWM to RX queue and enable lwm-triggered flag in
host shaper, after receiving LWM event, delay a while until RX queue is empty
, then disable the shaper. We recycle this work flow to reduce RX queue drops.

Spike Du (6):
  net/mlx5: add LWM support for Rxq
  common/mlx5: share interrupt management
  net/mlx5: add LWM event handling support
  net/mlx5: add private API to configure Rxq LWM
  net/mlx5: add private API to config host port shaper
  app/testpmd: add LWM and Host Shaper command

 app/test-pmd/cmdline.c                       | 149 ++++++++++++++++++
 app/test-pmd/config.c                        | 122 +++++++++++++++
 app/test-pmd/meson.build                     |   3 +
 app/test-pmd/testpmd.c                       |   3 +
 app/test-pmd/testpmd.h                       |   5 +
 doc/guides/nics/mlx5.rst                     |  87 +++++++++++
 doc/guides/rel_notes/release_22_03.rst       |   7 +
 drivers/common/mlx5/linux/meson.build        |  21 ++-
 drivers/common/mlx5/linux/mlx5_common_os.c   | 131 ++++++++++++++++
 drivers/common/mlx5/linux/mlx5_common_os.h   |  11 ++
 drivers/common/mlx5/mlx5_prm.h               |  26 ++++
 drivers/common/mlx5/version.map              |   3 +-
 drivers/common/mlx5/windows/mlx5_common_os.h |  24 +++
 drivers/net/mlx5/linux/mlx5_ethdev_os.c      |  71 ---------
 drivers/net/mlx5/linux/mlx5_os.c             | 132 ++++------------
 drivers/net/mlx5/linux/mlx5_socket.c         |  53 +------
 drivers/net/mlx5/mlx5.c                      |  61 ++++++++
 drivers/net/mlx5/mlx5.h                      |  12 +-
 drivers/net/mlx5/mlx5_devx.c                 |  57 ++++++-
 drivers/net/mlx5/mlx5_devx.h                 |   1 +
 drivers/net/mlx5/mlx5_rx.c                   | 221 ++++++++++++++++++++++++++-
 drivers/net/mlx5/mlx5_rx.h                   |   9 ++
 drivers/net/mlx5/mlx5_txpp.c                 |  28 +---
 drivers/net/mlx5/rte_pmd_mlx5.h              |  62 ++++++++
 drivers/net/mlx5/version.map                 |   2 +
 drivers/net/mlx5/windows/mlx5_ethdev_os.c    |  22 ---
 drivers/vdpa/mlx5/mlx5_vdpa_virtq.c          |  52 +------
 27 files changed, 1057 insertions(+), 318 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 131+ messages in thread

* [RFC 1/6] net/mlx5: add LWM support for Rxq
  2022-04-01  3:22 [RFC 0/6] net/mlx5: introduce limit watermark and host shaper Spike Du
@ 2022-04-01  3:22 ` Spike Du
  2022-05-06  3:56   ` [RFC v1 0/7] net/mlx5: introduce limit watermark and host shaper Spike Du
  2022-04-01  3:22 ` [RFC 2/6] common/mlx5: share interrupt management Spike Du
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 131+ messages in thread
From: Spike Du @ 2022-04-01  3:22 UTC (permalink / raw)
  To: matan, viacheslavo, orika, thomas; +Cc: dev, rasland

Add lwm(Limit WaterMark) field to Rxq object which indicates the percentage
of RX queue size used by HW to raise LWM event to the user.
Allow LWM setting in modify_rq command.
Allow the LWM configuration dynamically by adding RDY2RDY state change.

Signed-off-by: Spike Du <spiked@nvidia.com>
---
 drivers/net/mlx5/mlx5.h      |  1 +
 drivers/net/mlx5/mlx5_devx.c | 10 +++++++++-
 drivers/net/mlx5/mlx5_devx.h |  1 +
 drivers/net/mlx5/mlx5_rx.h   |  1 +
 4 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index 23a28f6..f3e6682 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -1391,6 +1391,7 @@ enum mlx5_rxq_modify_type {
 	MLX5_RXQ_MOD_RST2RDY, /* modify state from reset to ready. */
 	MLX5_RXQ_MOD_RDY2ERR, /* modify state from ready to error. */
 	MLX5_RXQ_MOD_RDY2RST, /* modify state from ready to reset. */
+	MLX5_RXQ_MOD_RDY2RDY, /* modify state from ready to ready. */
 };
 
 enum mlx5_txq_modify_type {
diff --git a/drivers/net/mlx5/mlx5_devx.c b/drivers/net/mlx5/mlx5_devx.c
index af106bd..d6de882 100644
--- a/drivers/net/mlx5/mlx5_devx.c
+++ b/drivers/net/mlx5/mlx5_devx.c
@@ -62,7 +62,7 @@
  * @return
  *   0 on success, a negative errno value otherwise and rte_errno is set.
  */
-static int
+int
 mlx5_devx_modify_rq(struct mlx5_rxq_priv *rxq, uint8_t type)
 {
 	struct mlx5_devx_modify_rq_attr rq_attr;
@@ -76,6 +76,8 @@
 	case MLX5_RXQ_MOD_RST2RDY:
 		rq_attr.rq_state = MLX5_RQC_STATE_RST;
 		rq_attr.state = MLX5_RQC_STATE_RDY;
+		rq_attr.modify_bitmask |= MLX5_MODIFY_RQ_IN_MODIFY_BITMASK_WQ_LWM;
+		rq_attr.lwm = rxq->lwm;
 		break;
 	case MLX5_RXQ_MOD_RDY2ERR:
 		rq_attr.rq_state = MLX5_RQC_STATE_RDY;
@@ -85,6 +87,12 @@
 		rq_attr.rq_state = MLX5_RQC_STATE_RDY;
 		rq_attr.state = MLX5_RQC_STATE_RST;
 		break;
+	case MLX5_RXQ_MOD_RDY2RDY:
+		rq_attr.rq_state = MLX5_RQC_STATE_RDY;
+		rq_attr.state = MLX5_RQC_STATE_RDY;
+		rq_attr.modify_bitmask |= MLX5_MODIFY_RQ_IN_MODIFY_BITMASK_WQ_LWM;
+		rq_attr.lwm = rxq->lwm;
+		break;
 	default:
 		break;
 	}
diff --git a/drivers/net/mlx5/mlx5_devx.h b/drivers/net/mlx5/mlx5_devx.h
index a95207a..ebd1da4 100644
--- a/drivers/net/mlx5/mlx5_devx.h
+++ b/drivers/net/mlx5/mlx5_devx.h
@@ -11,6 +11,7 @@
 int mlx5_txq_devx_modify(struct mlx5_txq_obj *obj,
 			 enum mlx5_txq_modify_type type, uint8_t dev_port);
 void mlx5_txq_devx_obj_release(struct mlx5_txq_obj *txq_obj);
+int mlx5_devx_modify_rq(struct mlx5_rxq_priv *rxq, uint8_t type);
 
 extern struct mlx5_obj_ops devx_obj_ops;
 
diff --git a/drivers/net/mlx5/mlx5_rx.h b/drivers/net/mlx5/mlx5_rx.h
index acebe33..98d7cae 100644
--- a/drivers/net/mlx5/mlx5_rx.h
+++ b/drivers/net/mlx5/mlx5_rx.h
@@ -174,6 +174,7 @@ struct mlx5_rxq_priv {
 	struct mlx5_devx_rq devx_rq;
 	struct rte_eth_hairpin_conf hairpin_conf; /* Hairpin configuration. */
 	uint32_t hairpin_status; /* Hairpin binding status. */
+	uint32_t lwm:16;
 };
 
 /* External RX queue descriptor. */
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 131+ messages in thread

* [RFC 2/6] common/mlx5: share interrupt management
  2022-04-01  3:22 [RFC 0/6] net/mlx5: introduce limit watermark and host shaper Spike Du
  2022-04-01  3:22 ` [RFC 1/6] net/mlx5: add LWM support for Rxq Spike Du
@ 2022-04-01  3:22 ` Spike Du
  2022-04-01  3:22 ` [RFC 3/6] net/mlx5: add LWM event handling support Spike Du
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 131+ messages in thread
From: Spike Du @ 2022-04-01  3:22 UTC (permalink / raw)
  To: matan, viacheslavo, orika, thomas; +Cc: dev, rasland

There are many duplicate code of creating and initializing rte_intr_handle.
Add a new mlx5_os API to do this, replace all PMD related code with this
API.

Signed-off-by: Spike Du <spiked@nvidia.com>
---
 drivers/common/mlx5/linux/mlx5_common_os.c   | 131 ++++++++++++++++++++++++++
 drivers/common/mlx5/linux/mlx5_common_os.h   |  11 +++
 drivers/common/mlx5/version.map              |   3 +-
 drivers/common/mlx5/windows/mlx5_common_os.h |  24 +++++
 drivers/net/mlx5/linux/mlx5_ethdev_os.c      |  71 --------------
 drivers/net/mlx5/linux/mlx5_os.c             | 132 ++++++---------------------
 drivers/net/mlx5/linux/mlx5_socket.c         |  53 ++---------
 drivers/net/mlx5/mlx5.h                      |   2 -
 drivers/net/mlx5/mlx5_txpp.c                 |  28 ++----
 drivers/net/mlx5/windows/mlx5_ethdev_os.c    |  22 -----
 drivers/vdpa/mlx5/mlx5_vdpa_virtq.c          |  52 ++---------
 11 files changed, 217 insertions(+), 312 deletions(-)

diff --git a/drivers/common/mlx5/linux/mlx5_common_os.c b/drivers/common/mlx5/linux/mlx5_common_os.c
index 030ceb5..6e7c59b 100644
--- a/drivers/common/mlx5/linux/mlx5_common_os.c
+++ b/drivers/common/mlx5/linux/mlx5_common_os.c
@@ -11,6 +11,7 @@
 #endif
 #include <dirent.h>
 #include <net/if.h>
+#include <fcntl.h>
 
 #include <rte_errno.h>
 #include <rte_string_fns.h>
@@ -952,3 +953,133 @@
 		claim_zero(mlx5_glue->dereg_mr(pmd_mr->obj));
 	memset(pmd_mr, 0, sizeof(*pmd_mr));
 }
+
+/**
+ * Rte_intr_handle create and init helper.
+ *
+ * @param[in] mode
+ *   interrupt instance can be shared between primary and secondary
+ *   processes or not.
+ * @param[in] set_fd_nonblock
+ *   Whether to set fd to O_NONBLOCK.
+ * @param[in] fd
+ *   Fd to set in created intr_handle.
+ * @param[in] cb
+ *   Callback to register for intr_handle.
+ * @param[in] cb_arg
+ *   Callback argument for cb.
+ *
+ * @return
+ *  - Interrupt handle on success.
+ *  - NULL on failure, with rte_errno set.
+ */
+struct rte_intr_handle *
+mlx5_os_interrupt_handler_create(int mode, bool set_fd_nonblock, int fd,
+				 rte_intr_callback_fn cb, void *cb_arg)
+{
+	struct rte_intr_handle *tmp_intr_handle;
+	int ret, flags;
+
+	tmp_intr_handle = rte_intr_instance_alloc(mode);
+	if (!tmp_intr_handle) {
+		rte_errno = ENOMEM;
+		goto err;
+	}
+	if (set_fd_nonblock) {
+		flags = fcntl(fd, F_GETFL);
+		ret = fcntl(fd, F_SETFL, flags | O_NONBLOCK);
+		if (ret) {
+			rte_errno = errno;
+			goto err;
+		}
+	}
+	ret = rte_intr_fd_set(tmp_intr_handle, fd);
+	if (ret)
+		goto err;
+	ret = rte_intr_type_set(tmp_intr_handle, RTE_INTR_HANDLE_EXT);
+	if (ret)
+		goto err;
+	ret = rte_intr_callback_register(tmp_intr_handle, cb, cb_arg);
+	if (ret) {
+		rte_errno = -ret;
+		goto err;
+	}
+	return tmp_intr_handle;
+err:
+	if (tmp_intr_handle)
+		rte_intr_instance_free(tmp_intr_handle);
+	return NULL;
+}
+
+/* Safe unregistration for interrupt callback. */
+static void
+mlx5_intr_callback_unregister(const struct rte_intr_handle *handle,
+			      rte_intr_callback_fn cb_fn, void *cb_arg)
+{
+	uint64_t twait = 0;
+	uint64_t start = 0;
+
+	do {
+		int ret;
+
+		ret = rte_intr_callback_unregister(handle, cb_fn, cb_arg);
+		if (ret >= 0)
+			return;
+		if (ret != -EAGAIN) {
+			DRV_LOG(INFO, "failed to unregister interrupt"
+				      " handler (error: %d)", ret);
+			MLX5_ASSERT(false);
+			return;
+		}
+		if (twait) {
+			struct timespec onems;
+
+			/* Wait one millisecond and try again. */
+			onems.tv_sec = 0;
+			onems.tv_nsec = NS_PER_S / MS_PER_S;
+			nanosleep(&onems, 0);
+			/* Check whether one second elapsed. */
+			if ((rte_get_timer_cycles() - start) <= twait)
+				continue;
+		} else {
+			/*
+			 * We get the amount of timer ticks for one second.
+			 * If this amount elapsed it means we spent one
+			 * second in waiting. This branch is executed once
+			 * on first iteration.
+			 */
+			twait = rte_get_timer_hz();
+			MLX5_ASSERT(twait);
+		}
+		/*
+		 * Timeout elapsed, show message (once a second) and retry.
+		 * We have no other acceptable option here, if we ignore
+		 * the unregistering return code the handler will not
+		 * be unregistered, fd will be closed and we may get the
+		 * crush. Hanging and messaging in the loop seems not to be
+		 * the worst choice.
+		 */
+		DRV_LOG(INFO, "Retrying to unregister interrupt handler");
+		start = rte_get_timer_cycles();
+	} while (true);
+}
+
+/**
+ * Rte_intr_handle destroy helper.
+ *
+ * @param[in] intr_handle
+ *   Rte_intr_handle to destroy.
+ * @param[in] cb
+ *   Callback which is registered to intr_handle.
+ * @param[in] cb_arg
+ *   Callback argument for cb.
+ *
+ */
+void
+mlx5_os_interrupt_handler_destroy(struct rte_intr_handle *intr_handle,
+				  rte_intr_callback_fn cb, void *cb_arg)
+{
+	if (rte_intr_fd_get(intr_handle) >= 0)
+		mlx5_intr_callback_unregister(intr_handle, cb, cb_arg);
+	rte_intr_instance_free(intr_handle);
+}
diff --git a/drivers/common/mlx5/linux/mlx5_common_os.h b/drivers/common/mlx5/linux/mlx5_common_os.h
index 27f1192..479bb3c 100644
--- a/drivers/common/mlx5/linux/mlx5_common_os.h
+++ b/drivers/common/mlx5/linux/mlx5_common_os.h
@@ -15,6 +15,7 @@
 #include <rte_log.h>
 #include <rte_kvargs.h>
 #include <rte_devargs.h>
+#include <rte_interrupts.h>
 
 #include "mlx5_autoconf.h"
 #include "mlx5_glue.h"
@@ -299,4 +300,14 @@
 int
 mlx5_get_device_guid(const struct rte_pci_addr *dev, uint8_t *guid, size_t len);
 
+__rte_internal
+struct rte_intr_handle *
+mlx5_os_interrupt_handler_create(int mode, bool set_fd_nonblock, int fd,
+				 rte_intr_callback_fn cb, void *cb_arg);
+
+__rte_internal
+void
+mlx5_os_interrupt_handler_destroy(struct rte_intr_handle *intr_handle,
+				  rte_intr_callback_fn cb, void *cb_arg);
+
 #endif /* RTE_PMD_MLX5_COMMON_OS_H_ */
diff --git a/drivers/common/mlx5/version.map b/drivers/common/mlx5/version.map
index a23a30a..2900544 100644
--- a/drivers/common/mlx5/version.map
+++ b/drivers/common/mlx5/version.map
@@ -152,6 +152,7 @@ INTERNAL {
 	mlx5_mp_req_mempool_reg;
 	mlx5_mr_mempool2mr_bh;
 	mlx5_mr_mempool_populate_cache;
-
+	mlx5_os_interrupt_handler_create; # WINDOWS_NO_EXPORT
+	mlx5_os_interrupt_handler_destroy; # WINDOWS_NO_EXPORT
 	local: *;
 };
diff --git a/drivers/common/mlx5/windows/mlx5_common_os.h b/drivers/common/mlx5/windows/mlx5_common_os.h
index ee7973f..e9e9108 100644
--- a/drivers/common/mlx5/windows/mlx5_common_os.h
+++ b/drivers/common/mlx5/windows/mlx5_common_os.h
@@ -9,6 +9,7 @@
 #include <sys/types.h>
 
 #include <rte_errno.h>
+#include <rte_interrupts.h>
 
 #include "mlx5_autoconf.h"
 #include "mlx5_glue.h"
@@ -253,4 +254,27 @@
 __rte_internal
 int mlx5_os_umem_dereg(void *pumem);
 
+static inline struct rte_intr_handle *
+mlx5_os_interrupt_handler_create(int mode, bool set_fd_nonblock, int fd,
+				 rte_intr_callback_fn cb, void *cb_arg)
+{
+	(void)mode;
+	(void)set_fd_nonblock;
+	(void)fd;
+	(void)cb;
+	(void)cb_arg;
+	rte_errno = ENOTSUP;
+	return NULL;
+}
+
+static inline void
+mlx5_os_interrupt_handler_destroy(struct rte_intr_handle *intr_handle,
+				  rte_intr_callback_fn cb, void *cb_arg)
+{
+	(void)intr_handle;
+	(void)cb;
+	(void)cb_arg;
+}
+
+
 #endif /* RTE_PMD_MLX5_COMMON_OS_H_ */
diff --git a/drivers/net/mlx5/linux/mlx5_ethdev_os.c b/drivers/net/mlx5/linux/mlx5_ethdev_os.c
index 8fe73f1..a276b2b 100644
--- a/drivers/net/mlx5/linux/mlx5_ethdev_os.c
+++ b/drivers/net/mlx5/linux/mlx5_ethdev_os.c
@@ -881,77 +881,6 @@ struct ethtool_link_settings {
 	}
 }
 
-/*
- * Unregister callback handler safely. The handler may be active
- * while we are trying to unregister it, in this case code -EAGAIN
- * is returned by rte_intr_callback_unregister(). This routine checks
- * the return code and tries to unregister handler again.
- *
- * @param handle
- *   interrupt handle
- * @param cb_fn
- *   pointer to callback routine
- * @cb_arg
- *   opaque callback parameter
- */
-void
-mlx5_intr_callback_unregister(const struct rte_intr_handle *handle,
-			      rte_intr_callback_fn cb_fn, void *cb_arg)
-{
-	/*
-	 * Try to reduce timeout management overhead by not calling
-	 * the timer related routines on the first iteration. If the
-	 * unregistering succeeds on first call there will be no
-	 * timer calls at all.
-	 */
-	uint64_t twait = 0;
-	uint64_t start = 0;
-
-	do {
-		int ret;
-
-		ret = rte_intr_callback_unregister(handle, cb_fn, cb_arg);
-		if (ret >= 0)
-			return;
-		if (ret != -EAGAIN) {
-			DRV_LOG(INFO, "failed to unregister interrupt"
-				      " handler (error: %d)", ret);
-			MLX5_ASSERT(false);
-			return;
-		}
-		if (twait) {
-			struct timespec onems;
-
-			/* Wait one millisecond and try again. */
-			onems.tv_sec = 0;
-			onems.tv_nsec = NS_PER_S / MS_PER_S;
-			nanosleep(&onems, 0);
-			/* Check whether one second elapsed. */
-			if ((rte_get_timer_cycles() - start) <= twait)
-				continue;
-		} else {
-			/*
-			 * We get the amount of timer ticks for one second.
-			 * If this amount elapsed it means we spent one
-			 * second in waiting. This branch is executed once
-			 * on first iteration.
-			 */
-			twait = rte_get_timer_hz();
-			MLX5_ASSERT(twait);
-		}
-		/*
-		 * Timeout elapsed, show message (once a second) and retry.
-		 * We have no other acceptable option here, if we ignore
-		 * the unregistering return code the handler will not
-		 * be unregistered, fd will be closed and we may get the
-		 * crush. Hanging and messaging in the loop seems not to be
-		 * the worst choice.
-		 */
-		DRV_LOG(INFO, "Retrying to unregister interrupt handler");
-		start = rte_get_timer_cycles();
-	} while (true);
-}
-
 /**
  * Handle DEVX interrupts from the NIC.
  * This function is probably called from the DPDK host thread.
diff --git a/drivers/net/mlx5/linux/mlx5_os.c b/drivers/net/mlx5/linux/mlx5_os.c
index ff65efb..6274a42 100644
--- a/drivers/net/mlx5/linux/mlx5_os.c
+++ b/drivers/net/mlx5/linux/mlx5_os.c
@@ -2494,40 +2494,6 @@
 	mlx5_pmd_socket_uninit();
 }
 
-static int
-mlx5_os_dev_shared_handler_install_lsc(struct mlx5_dev_ctx_shared *sh)
-{
-	int nlsk_fd, flags, ret;
-
-	nlsk_fd = mlx5_nl_init(NETLINK_ROUTE, RTMGRP_LINK);
-	if (nlsk_fd < 0) {
-		DRV_LOG(ERR, "Failed to create a socket for Netlink events: %s",
-			rte_strerror(rte_errno));
-		return -1;
-	}
-	flags = fcntl(nlsk_fd, F_GETFL);
-	ret = fcntl(nlsk_fd, F_SETFL, flags | O_NONBLOCK);
-	if (ret != 0) {
-		DRV_LOG(ERR, "Failed to make Netlink event socket non-blocking: %s",
-			strerror(errno));
-		rte_errno = errno;
-		goto error;
-	}
-	rte_intr_type_set(sh->intr_handle_nl, RTE_INTR_HANDLE_EXT);
-	rte_intr_fd_set(sh->intr_handle_nl, nlsk_fd);
-	if (rte_intr_callback_register(sh->intr_handle_nl,
-				       mlx5_dev_interrupt_handler_nl,
-				       sh) != 0) {
-		DRV_LOG(ERR, "Failed to register Netlink events interrupt");
-		rte_intr_fd_set(sh->intr_handle_nl, -1);
-		goto error;
-	}
-	return 0;
-error:
-	close(nlsk_fd);
-	return -1;
-}
-
 /**
  * Install shared asynchronous device events handler.
  * This function is implemented to support event sharing
@@ -2539,76 +2505,47 @@
 void
 mlx5_os_dev_shared_handler_install(struct mlx5_dev_ctx_shared *sh)
 {
-	int ret;
-	int flags;
 	struct ibv_context *ctx = sh->cdev->ctx;
+	int nlsk_fd;
 
-	sh->intr_handle = rte_intr_instance_alloc(RTE_INTR_INSTANCE_F_SHARED);
-	if (sh->intr_handle == NULL) {
-		DRV_LOG(ERR, "Fail to allocate intr_handle");
-		rte_errno = ENOMEM;
+	sh->intr_handle = mlx5_os_interrupt_handler_create
+		(RTE_INTR_INSTANCE_F_SHARED, true,
+		 ctx->async_fd, mlx5_dev_interrupt_handler, sh);
+	if (!sh->intr_handle) {
+		DRV_LOG(ERR, "Failed to allocate intr_handle.");
 		return;
 	}
-	rte_intr_fd_set(sh->intr_handle, -1);
-
-	flags = fcntl(ctx->async_fd, F_GETFL);
-	ret = fcntl(ctx->async_fd, F_SETFL, flags | O_NONBLOCK);
-	if (ret) {
-		DRV_LOG(INFO, "failed to change file descriptor async event"
-			" queue");
-	} else {
-		rte_intr_fd_set(sh->intr_handle, ctx->async_fd);
-		rte_intr_type_set(sh->intr_handle, RTE_INTR_HANDLE_EXT);
-		if (rte_intr_callback_register(sh->intr_handle,
-					mlx5_dev_interrupt_handler, sh)) {
-			DRV_LOG(INFO, "Fail to install the shared interrupt.");
-			rte_intr_fd_set(sh->intr_handle, -1);
-		}
+	nlsk_fd = mlx5_nl_init(NETLINK_ROUTE, RTMGRP_LINK);
+	if (nlsk_fd < 0) {
+		DRV_LOG(ERR, "Failed to create a socket for Netlink events: %s",
+			rte_strerror(rte_errno));
+		return;
 	}
-	sh->intr_handle_nl = rte_intr_instance_alloc
-						(RTE_INTR_INSTANCE_F_SHARED);
+	sh->intr_handle_nl = mlx5_os_interrupt_handler_create
+		(RTE_INTR_INSTANCE_F_SHARED, true,
+		 nlsk_fd, mlx5_dev_interrupt_handler_nl, sh);
 	if (sh->intr_handle_nl == NULL) {
 		DRV_LOG(ERR, "Fail to allocate intr_handle");
-		rte_errno = ENOMEM;
 		return;
 	}
-	rte_intr_fd_set(sh->intr_handle_nl, -1);
-	if (mlx5_os_dev_shared_handler_install_lsc(sh) < 0) {
-		DRV_LOG(INFO, "Fail to install the shared Netlink event handler.");
-		rte_intr_fd_set(sh->intr_handle_nl, -1);
-	}
 	if (sh->cdev->config.devx) {
 #ifdef HAVE_IBV_DEVX_ASYNC
-		sh->intr_handle_devx =
-			rte_intr_instance_alloc(RTE_INTR_INSTANCE_F_SHARED);
-		if (!sh->intr_handle_devx) {
-			DRV_LOG(ERR, "Fail to allocate intr_handle");
-			rte_errno = ENOMEM;
-			return;
-		}
-		rte_intr_fd_set(sh->intr_handle_devx, -1);
+		struct mlx5dv_devx_cmd_comp *devx_comp;
+
 		sh->devx_comp = (void *)mlx5_glue->devx_create_cmd_comp(ctx);
-		struct mlx5dv_devx_cmd_comp *devx_comp = sh->devx_comp;
+		devx_comp = sh->devx_comp;
 		if (!devx_comp) {
 			DRV_LOG(INFO, "failed to allocate devx_comp.");
 			return;
 		}
-		flags = fcntl(devx_comp->fd, F_GETFL);
-		ret = fcntl(devx_comp->fd, F_SETFL, flags | O_NONBLOCK);
-		if (ret) {
-			DRV_LOG(INFO, "failed to change file descriptor"
-				" devx comp");
+		sh->intr_handle_devx = mlx5_os_interrupt_handler_create
+			(RTE_INTR_INSTANCE_F_SHARED, true,
+			 devx_comp->fd,
+			 mlx5_dev_interrupt_handler_devx, sh);
+		if (!sh->intr_handle_devx) {
+			DRV_LOG(ERR, "Failed to allocate intr_handle.");
 			return;
 		}
-		rte_intr_fd_set(sh->intr_handle_devx, devx_comp->fd);
-		rte_intr_type_set(sh->intr_handle_devx,
-					 RTE_INTR_HANDLE_EXT);
-		if (rte_intr_callback_register(sh->intr_handle_devx,
-					mlx5_dev_interrupt_handler_devx, sh)) {
-			DRV_LOG(INFO, "Fail to install the devx shared"
-				" interrupt.");
-			rte_intr_fd_set(sh->intr_handle_devx, -1);
-		}
 #endif /* HAVE_IBV_DEVX_ASYNC */
 	}
 }
@@ -2624,24 +2561,13 @@
 void
 mlx5_os_dev_shared_handler_uninstall(struct mlx5_dev_ctx_shared *sh)
 {
-	int nlsk_fd;
-
-	if (rte_intr_fd_get(sh->intr_handle) >= 0)
-		mlx5_intr_callback_unregister(sh->intr_handle,
-					      mlx5_dev_interrupt_handler, sh);
-	rte_intr_instance_free(sh->intr_handle);
-	nlsk_fd = rte_intr_fd_get(sh->intr_handle_nl);
-	if (nlsk_fd >= 0) {
-		mlx5_intr_callback_unregister
-			(sh->intr_handle_nl, mlx5_dev_interrupt_handler_nl, sh);
-		close(nlsk_fd);
-	}
-	rte_intr_instance_free(sh->intr_handle_nl);
+	mlx5_os_interrupt_handler_destroy(sh->intr_handle,
+					  mlx5_dev_interrupt_handler, sh);
+	mlx5_os_interrupt_handler_destroy(sh->intr_handle_nl,
+					  mlx5_dev_interrupt_handler_nl, sh);
 #ifdef HAVE_IBV_DEVX_ASYNC
-	if (rte_intr_fd_get(sh->intr_handle_devx) >= 0)
-		rte_intr_callback_unregister(sh->intr_handle_devx,
-				  mlx5_dev_interrupt_handler_devx, sh);
-	rte_intr_instance_free(sh->intr_handle_devx);
+	mlx5_os_interrupt_handler_destroy(sh->intr_handle_devx,
+					  mlx5_dev_interrupt_handler_devx, sh);
 	if (sh->devx_comp)
 		mlx5_glue->devx_destroy_cmd_comp(sh->devx_comp);
 #endif
diff --git a/drivers/net/mlx5/linux/mlx5_socket.c b/drivers/net/mlx5/linux/mlx5_socket.c
index 4882e5f..0e01aff 100644
--- a/drivers/net/mlx5/linux/mlx5_socket.c
+++ b/drivers/net/mlx5/linux/mlx5_socket.c
@@ -134,51 +134,6 @@
 }
 
 /**
- * Install interrupt handler.
- *
- * @param dev
- *   Pointer to Ethernet device.
- * @return
- *   0 on success, a negative errno value otherwise.
- */
-static int
-mlx5_pmd_interrupt_handler_install(void)
-{
-	MLX5_ASSERT(server_socket != -1);
-
-	server_intr_handle =
-		rte_intr_instance_alloc(RTE_INTR_INSTANCE_F_PRIVATE);
-	if (server_intr_handle == NULL) {
-		DRV_LOG(ERR, "Fail to allocate intr_handle");
-		return -ENOMEM;
-	}
-	if (rte_intr_fd_set(server_intr_handle, server_socket))
-		return -rte_errno;
-
-	if (rte_intr_type_set(server_intr_handle, RTE_INTR_HANDLE_EXT))
-		return -rte_errno;
-
-	return rte_intr_callback_register(server_intr_handle,
-					  mlx5_pmd_socket_handle, NULL);
-}
-
-/**
- * Uninstall interrupt handler.
- */
-static void
-mlx5_pmd_interrupt_handler_uninstall(void)
-{
-	if (server_socket != -1) {
-		mlx5_intr_callback_unregister(server_intr_handle,
-					      mlx5_pmd_socket_handle,
-					      NULL);
-	}
-	rte_intr_fd_set(server_intr_handle, 0);
-	rte_intr_type_set(server_intr_handle, RTE_INTR_HANDLE_UNKNOWN);
-	rte_intr_instance_free(server_intr_handle);
-}
-
-/**
  * Initialise the socket to communicate with external tools.
  *
  * @return
@@ -224,7 +179,10 @@
 			strerror(errno));
 		goto remove;
 	}
-	if (mlx5_pmd_interrupt_handler_install()) {
+	server_intr_handle = mlx5_os_interrupt_handler_create
+		(RTE_INTR_INSTANCE_F_PRIVATE, false,
+		 server_socket, mlx5_pmd_socket_handle, NULL);
+	if (server_intr_handle == NULL) {
 		DRV_LOG(WARNING, "cannot register interrupt handler for mlx5 socket: %s",
 			strerror(errno));
 		goto remove;
@@ -248,7 +206,8 @@
 {
 	if (server_socket == -1)
 		return;
-	mlx5_pmd_interrupt_handler_uninstall();
+	mlx5_os_interrupt_handler_destroy(server_intr_handle,
+					  mlx5_pmd_socket_handle, NULL);
 	claim_zero(close(server_socket));
 	server_socket = -1;
 	MKSTR(path, MLX5_SOCKET_PATH, getpid());
diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index f3e6682..4821ff0 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -1678,8 +1678,6 @@ int mlx5_sysfs_switch_info(unsigned int ifindex,
 			   struct mlx5_switch_info *info);
 void mlx5_translate_port_name(const char *port_name_in,
 			      struct mlx5_switch_info *port_info_out);
-void mlx5_intr_callback_unregister(const struct rte_intr_handle *handle,
-				   rte_intr_callback_fn cb_fn, void *cb_arg);
 int mlx5_sysfs_bond_info(unsigned int pf_ifindex, unsigned int *ifindex,
 			 char *ifname);
 int mlx5_get_module_info(struct rte_eth_dev *dev,
diff --git a/drivers/net/mlx5/mlx5_txpp.c b/drivers/net/mlx5/mlx5_txpp.c
index fe74317..f853a67 100644
--- a/drivers/net/mlx5/mlx5_txpp.c
+++ b/drivers/net/mlx5/mlx5_txpp.c
@@ -741,11 +741,8 @@
 static void
 mlx5_txpp_stop_service(struct mlx5_dev_ctx_shared *sh)
 {
-	if (!rte_intr_fd_get(sh->txpp.intr_handle))
-		return;
-	mlx5_intr_callback_unregister(sh->txpp.intr_handle,
-				      mlx5_txpp_interrupt_handler, sh);
-	rte_intr_instance_free(sh->txpp.intr_handle);
+	mlx5_os_interrupt_handler_destroy(sh->txpp.intr_handle,
+					  mlx5_txpp_interrupt_handler, sh);
 }
 
 /* Attach interrupt handler and fires first request to Rearm Queue. */
@@ -769,23 +766,12 @@
 		rte_errno = errno;
 		return -rte_errno;
 	}
-	sh->txpp.intr_handle =
-			rte_intr_instance_alloc(RTE_INTR_INSTANCE_F_SHARED);
-	if (sh->txpp.intr_handle == NULL) {
-		DRV_LOG(ERR, "Fail to allocate intr_handle");
-		return -ENOMEM;
-	}
 	fd = mlx5_os_get_devx_channel_fd(sh->txpp.echan);
-	if (rte_intr_fd_set(sh->txpp.intr_handle, fd))
-		return -rte_errno;
-
-	if (rte_intr_type_set(sh->txpp.intr_handle, RTE_INTR_HANDLE_EXT))
-		return -rte_errno;
-
-	if (rte_intr_callback_register(sh->txpp.intr_handle,
-				       mlx5_txpp_interrupt_handler, sh)) {
-		rte_intr_fd_set(sh->txpp.intr_handle, 0);
-		DRV_LOG(ERR, "Failed to register CQE interrupt %d.", rte_errno);
+	sh->txpp.intr_handle = mlx5_os_interrupt_handler_create
+		(RTE_INTR_INSTANCE_F_SHARED, false,
+		 fd, mlx5_txpp_interrupt_handler, sh);
+	if (!sh->txpp.intr_handle) {
+		DRV_LOG(ERR, "Fail to allocate intr_handle");
 		return -rte_errno;
 	}
 	/* Subscribe CQ event to the event channel controlled by the driver. */
diff --git a/drivers/net/mlx5/windows/mlx5_ethdev_os.c b/drivers/net/mlx5/windows/mlx5_ethdev_os.c
index c6315ce..b52faea 100644
--- a/drivers/net/mlx5/windows/mlx5_ethdev_os.c
+++ b/drivers/net/mlx5/windows/mlx5_ethdev_os.c
@@ -117,28 +117,6 @@
 	return -ENOTSUP;
 }
 
-/*
- * Unregister callback handler safely. The handler may be active
- * while we are trying to unregister it, in this case code -EAGAIN
- * is returned by rte_intr_callback_unregister(). This routine checks
- * the return code and tries to unregister handler again.
- *
- * @param handle
- *   interrupt handle
- * @param cb_fn
- *   pointer to callback routine
- * @cb_arg
- *   opaque callback parameter
- */
-void
-mlx5_intr_callback_unregister(const struct rte_intr_handle *handle,
-			      rte_intr_callback_fn cb_fn, void *cb_arg)
-{
-	RTE_SET_USED(handle);
-	RTE_SET_USED(cb_fn);
-	RTE_SET_USED(cb_arg);
-}
-
 /**
  * DPDK callback to get flow control status.
  *
diff --git a/drivers/vdpa/mlx5/mlx5_vdpa_virtq.c b/drivers/vdpa/mlx5/mlx5_vdpa_virtq.c
index 3416797..2ca48f5 100644
--- a/drivers/vdpa/mlx5/mlx5_vdpa_virtq.c
+++ b/drivers/vdpa/mlx5/mlx5_vdpa_virtq.c
@@ -59,26 +59,10 @@
 mlx5_vdpa_virtq_unset(struct mlx5_vdpa_virtq *virtq)
 {
 	unsigned int i;
-	int retries = MLX5_VDPA_INTR_RETRIES;
-	int ret = -EAGAIN;
-
-	if (rte_intr_fd_get(virtq->intr_handle) != -1) {
-		while (retries-- && ret == -EAGAIN) {
-			ret = rte_intr_callback_unregister(virtq->intr_handle,
-							mlx5_vdpa_virtq_handler,
-							virtq);
-			if (ret == -EAGAIN) {
-				DRV_LOG(DEBUG, "Try again to unregister fd %d "
-				"of virtq %d interrupt, retries = %d.",
-				rte_intr_fd_get(virtq->intr_handle),
-				(int)virtq->index, retries);
-
-				usleep(MLX5_VDPA_INTR_RETRIES_USEC);
-			}
-		}
-		rte_intr_fd_set(virtq->intr_handle, -1);
-	}
-	rte_intr_instance_free(virtq->intr_handle);
+	int ret;
+
+	mlx5_os_interrupt_handler_destroy(virtq->intr_handle,
+					  mlx5_vdpa_virtq_handler, virtq);
 	if (virtq->virtq) {
 		ret = mlx5_vdpa_virtq_stop(virtq->priv, virtq->index);
 		if (ret)
@@ -342,35 +326,13 @@
 	virtq->priv = priv;
 	rte_write32(virtq->index, priv->virtq_db_addr);
 	/* Setup doorbell mapping. */
-	virtq->intr_handle =
-		rte_intr_instance_alloc(RTE_INTR_INSTANCE_F_SHARED);
+	virtq->intr_handle = mlx5_os_interrupt_handler_create(
+				  RTE_INTR_INSTANCE_F_SHARED, false,
+				  vq.kickfd, mlx5_vdpa_virtq_handler, virtq);
 	if (virtq->intr_handle == NULL) {
 		DRV_LOG(ERR, "Fail to allocate intr_handle");
 		goto error;
 	}
-
-	if (rte_intr_fd_set(virtq->intr_handle, vq.kickfd))
-		goto error;
-
-	if (rte_intr_fd_get(virtq->intr_handle) == -1) {
-		DRV_LOG(WARNING, "Virtq %d kickfd is invalid.", index);
-	} else {
-		if (rte_intr_type_set(virtq->intr_handle, RTE_INTR_HANDLE_EXT))
-			goto error;
-
-		if (rte_intr_callback_register(virtq->intr_handle,
-					       mlx5_vdpa_virtq_handler,
-					       virtq)) {
-			rte_intr_fd_set(virtq->intr_handle, -1);
-			DRV_LOG(ERR, "Failed to register virtq %d interrupt.",
-				index);
-			goto error;
-		} else {
-			DRV_LOG(DEBUG, "Register fd %d interrupt for virtq %d.",
-				rte_intr_fd_get(virtq->intr_handle),
-				index);
-		}
-	}
 	/* Subscribe virtq error event. */
 	virtq->version++;
 	cookie = ((uint64_t)virtq->version << 32) + index;
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 131+ messages in thread

* [RFC 3/6] net/mlx5: add LWM event handling support
  2022-04-01  3:22 [RFC 0/6] net/mlx5: introduce limit watermark and host shaper Spike Du
  2022-04-01  3:22 ` [RFC 1/6] net/mlx5: add LWM support for Rxq Spike Du
  2022-04-01  3:22 ` [RFC 2/6] common/mlx5: share interrupt management Spike Du
@ 2022-04-01  3:22 ` Spike Du
  2022-04-01  3:22 ` [RFC 4/6] net/mlx5: add private API to configure Rxq LWM Spike Du
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 131+ messages in thread
From: Spike Du @ 2022-04-01  3:22 UTC (permalink / raw)
  To: matan, viacheslavo, orika, thomas; +Cc: dev, rasland

When LWM meets RQ WQE, the kernel driver raises an event to SW.
Use devx event_channel to catch this and to notify the user.
Allocate this channel per shared device.
The channel has a cookie that informs the specific event port and queue.

Signed-off-by: Spike Du <spiked@nvidia.com>
---
 drivers/net/mlx5/mlx5.c      | 61 ++++++++++++++++++++++++++++++++++++++++++++
 drivers/net/mlx5/mlx5.h      |  7 +++++
 drivers/net/mlx5/mlx5_devx.c | 47 ++++++++++++++++++++++++++++++++++
 drivers/net/mlx5/mlx5_rx.c   | 29 +++++++++++++++++++++
 drivers/net/mlx5/mlx5_rx.h   |  7 +++++
 5 files changed, 151 insertions(+)

diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
index 72b1e35..334223e 100644
--- a/drivers/net/mlx5/mlx5.c
+++ b/drivers/net/mlx5/mlx5.c
@@ -9,6 +9,7 @@
 #include <stdint.h>
 #include <stdlib.h>
 #include <errno.h>
+#include <fcntl.h>
 
 #include <rte_malloc.h>
 #include <ethdev_driver.h>
@@ -22,6 +23,7 @@
 #include <rte_eal_paging.h>
 #include <rte_alarm.h>
 #include <rte_cycles.h>
+#include <rte_interrupts.h>
 
 #include <mlx5_glue.h>
 #include <mlx5_devx_cmds.h>
@@ -1521,6 +1523,64 @@ struct mlx5_dev_ctx_shared *
 }
 
 /**
+ * Create LWM event_channel and interrupt handle for shared device
+ * context. All rxqs sharing the device context share the event_channel.
+ * A callback is registered in interrupt thread to receive the LWM event.
+ *
+ * @param[in] priv
+ *   Pointer to mlx5_priv instance.
+ *
+ * @return
+ *   0 on success, negative with rte_errno set.
+ */
+int
+mlx5_lwm_setup(struct mlx5_priv *priv)
+{
+	int fd_lwm;
+
+	pthread_mutex_init(&priv->sh->lwm_config_lock, NULL);
+	priv->sh->devx_channel_lwm = mlx5_os_devx_create_event_channel
+			(priv->sh->cdev->ctx,
+			 MLX5DV_DEVX_CREATE_EVENT_CHANNEL_FLAGS_OMIT_EV_DATA);
+	if (!priv->sh->devx_channel_lwm)
+		goto err;
+	fd_lwm = mlx5_os_get_devx_channel_fd(priv->sh->devx_channel_lwm);
+	priv->sh->intr_handle_lwm = mlx5_os_interrupt_handler_create
+		(RTE_INTR_INSTANCE_F_SHARED, true,
+		 fd_lwm, mlx5_dev_interrupt_handler_lwm, priv);
+	if (!priv->sh->intr_handle_lwm)
+		goto err;
+	return 0;
+err:
+	mlx5_lwm_unset(priv->sh);
+	return -rte_errno;
+}
+
+/**
+ * Destroy LWM event_channel and interrupt handle for shared device
+ * context before free this context. The interrupt handler is also
+ * unregistered.
+ *
+ * @param[in] sh
+ *   Pointer to shared device context.
+ */
+void
+mlx5_lwm_unset(struct mlx5_dev_ctx_shared *sh)
+{
+	if (sh->intr_handle_lwm) {
+		mlx5_os_interrupt_handler_destroy(sh->intr_handle_lwm,
+			mlx5_dev_interrupt_handler_lwm, (void *)-1);
+		sh->intr_handle_lwm = NULL;
+	}
+	if (sh->devx_channel_lwm) {
+		mlx5_os_devx_destroy_event_channel
+			(sh->devx_channel_lwm);
+		sh->devx_channel_lwm = NULL;
+	}
+	pthread_mutex_destroy(&sh->lwm_config_lock);
+}
+
+/**
  * Free shared IB device context. Decrement counter and if zero free
  * all allocated resources and close handles.
  *
@@ -1597,6 +1657,7 @@ struct mlx5_dev_ctx_shared *
 		claim_zero(mlx5_devx_cmd_destroy(sh->td));
 	MLX5_ASSERT(sh->geneve_tlv_option_resource == NULL);
 	pthread_mutex_destroy(&sh->txpp.mutex);
+	mlx5_lwm_unset(sh);
 	mlx5_free(sh);
 	return;
 exit:
diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index 4821ff0..515ff33 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -1264,6 +1264,9 @@ struct mlx5_dev_ctx_shared {
 	struct mlx5_lb_ctx self_lb; /* QP to enable self loopback for Devx. */
 	unsigned int flow_max_priority;
 	enum modify_reg flow_mreg_c[MLX5_MREG_C_NUM];
+	void *devx_channel_lwm;
+	struct rte_intr_handle *intr_handle_lwm;
+	pthread_mutex_t lwm_config_lock;
 	/* Availability of mreg_c's. */
 	struct mlx5_dev_shared_port port[]; /* per device port data array. */
 };
@@ -1401,6 +1404,7 @@ enum mlx5_txq_modify_type {
 };
 
 struct mlx5_rxq_priv;
+struct mlx5_priv;
 
 /* HW objects operations structure. */
 struct mlx5_obj_ops {
@@ -1409,6 +1413,7 @@ struct mlx5_obj_ops {
 	int (*rxq_event_get)(struct mlx5_rxq_obj *rxq_obj);
 	int (*rxq_obj_modify)(struct mlx5_rxq_priv *rxq, uint8_t type);
 	void (*rxq_obj_release)(struct mlx5_rxq_priv *rxq);
+	int (*rxq_event_get_lwm)(struct mlx5_priv *priv, int *rxq_idx, int *port_id);
 	int (*ind_table_new)(struct rte_eth_dev *dev, const unsigned int log_n,
 			     struct mlx5_ind_table_obj *ind_tbl);
 	int (*ind_table_modify)(struct rte_eth_dev *dev,
@@ -1599,6 +1604,8 @@ int mlx5_udp_tunnel_port_add(struct rte_eth_dev *dev,
 bool mlx5_is_hpf(struct rte_eth_dev *dev);
 bool mlx5_is_sf_repr(struct rte_eth_dev *dev);
 void mlx5_age_event_prepare(struct mlx5_dev_ctx_shared *sh);
+int mlx5_lwm_setup(struct mlx5_priv *priv);
+void mlx5_lwm_unset(struct mlx5_dev_ctx_shared *sh);
 
 /* Macro to iterate over all valid ports for mlx5 driver. */
 #define MLX5_ETH_FOREACH_DEV(port_id, dev) \
diff --git a/drivers/net/mlx5/mlx5_devx.c b/drivers/net/mlx5/mlx5_devx.c
index d6de882..e1e5d2d 100644
--- a/drivers/net/mlx5/mlx5_devx.c
+++ b/drivers/net/mlx5/mlx5_devx.c
@@ -230,6 +230,52 @@
 }
 
 /**
+ * Get LWM event for shared context, return the correct port/rxq for this event.
+ *
+ * @param priv
+ *   Mlx5_priv object.
+ * @param rxq_idx [out]
+ *   Which rxq gets this event.
+ * @param port_id [out]
+ *   Which port gets this event.
+ *
+ * @return
+ *   0 on success, a negative errno value otherwise and rte_errno is set.
+ */
+static int
+mlx5_rx_devx_get_event_lwm(struct mlx5_priv *priv, int *rxq_idx, int *port_id)
+{
+#ifdef HAVE_IBV_DEVX_EVENT
+	union {
+		struct mlx5dv_devx_async_event_hdr event_resp;
+		uint8_t buf[sizeof(struct mlx5dv_devx_async_event_hdr) + 128];
+	} out;
+	int ret;
+
+	memset(&out, 0, sizeof(out));
+	ret = mlx5_glue->devx_get_event(priv->sh->devx_channel_lwm,
+					&out.event_resp,
+					sizeof(out.buf));
+	if (ret < 0) {
+		rte_errno = errno;
+		DRV_LOG(WARNING, "%s err\n", __func__);
+		return -rte_errno;
+	}
+	*port_id = (((uint32_t)out.event_resp.cookie) >>
+		    LWM_COOKIE_PORTID_OFFSET) & LWM_COOKIE_PORTID_MASK;
+	*rxq_idx = (((uint32_t)out.event_resp.cookie) >>
+		    LWM_COOKIE_RXQID_OFFSET) & LWM_COOKIE_RXQID_MASK;
+	return 0;
+#else
+	(void)priv;
+	(void)rxq_idx;
+	(void)port_id;
+	rte_errno = ENOTSUP;
+	return -rte_errno;
+#endif /* HAVE_IBV_DEVX_EVENT */
+}
+
+/**
  * Create a RQ object using DevX.
  *
  * @param rxq
@@ -1413,6 +1459,7 @@ struct mlx5_obj_ops devx_obj_ops = {
 	.rxq_event_get = mlx5_rx_devx_get_event,
 	.rxq_obj_modify = mlx5_devx_modify_rq,
 	.rxq_obj_release = mlx5_rxq_devx_obj_release,
+	.rxq_event_get_lwm = mlx5_rx_devx_get_event_lwm,
 	.ind_table_new = mlx5_devx_ind_table_new,
 	.ind_table_modify = mlx5_devx_ind_table_modify,
 	.ind_table_destroy = mlx5_devx_ind_table_destroy,
diff --git a/drivers/net/mlx5/mlx5_rx.c b/drivers/net/mlx5/mlx5_rx.c
index e5eea0a..f72364e 100644
--- a/drivers/net/mlx5/mlx5_rx.c
+++ b/drivers/net/mlx5/mlx5_rx.c
@@ -1187,3 +1187,32 @@ int mlx5_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
 {
 	return -ENOTSUP;
 }
+
+/**
+ * Rte interrupt handler for LWM event.
+ * It first checks if the event arrives, if so invokes the callback registered
+ * for LWM event in the rxq.
+ *
+ * @param args
+ *   Generic pointer to mlx5_priv.
+ */
+void
+mlx5_dev_interrupt_handler_lwm(void *args)
+{
+	struct mlx5_priv *priv = args;
+	struct mlx5_rxq_priv *rxq;
+	struct rte_eth_dev *dev;
+	int ret, rxq_idx = 0, port_id = 0;
+
+	ret = priv->obj_ops.rxq_event_get_lwm(priv, &rxq_idx, &port_id);
+	if (unlikely(ret < 0)) {
+		DRV_LOG(WARNING, "Cannot get LWM event context.");
+		return;
+	}
+	DRV_LOG(INFO, "%s get LWM event, port_id:%d rxq_id:%d.", __func__,
+		port_id, rxq_idx);
+	dev = &rte_eth_devices[port_id];
+	rxq = mlx5_rxq_get(dev, rxq_idx);
+	if (rxq && rxq->lwm_event_rxq_limit_reached)
+		rxq->lwm_event_rxq_limit_reached(port_id, rxq_idx);
+}
diff --git a/drivers/net/mlx5/mlx5_rx.h b/drivers/net/mlx5/mlx5_rx.h
index 98d7cae..bf3c5e1 100644
--- a/drivers/net/mlx5/mlx5_rx.h
+++ b/drivers/net/mlx5/mlx5_rx.h
@@ -175,6 +175,7 @@ struct mlx5_rxq_priv {
 	struct rte_eth_hairpin_conf hairpin_conf; /* Hairpin configuration. */
 	uint32_t hairpin_status; /* Hairpin binding status. */
 	uint32_t lwm:16;
+	void (*lwm_event_rxq_limit_reached)(uint16_t port_id, uint16_t rxq_id);
 };
 
 /* External RX queue descriptor. */
@@ -294,6 +295,7 @@ void mlx5_rxq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
 int mlx5_rx_burst_mode_get(struct rte_eth_dev *dev, uint16_t rx_queue_id,
 			   struct rte_eth_burst_mode *mode);
 int mlx5_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc);
+void mlx5_dev_interrupt_handler_lwm(void *args);
 
 /* Vectorized version of mlx5_rx.c */
 int mlx5_rxq_check_vec_support(struct mlx5_rxq_data *rxq_data);
@@ -674,4 +676,9 @@ uint16_t mlx5_rx_burst_mprq_vec(void *dpdk_rxq, struct rte_mbuf **pkts,
 	return !!__atomic_load_n(&rxq->refcnt, __ATOMIC_RELAXED);
 }
 
+#define LWM_COOKIE_RXQID_OFFSET 0
+#define LWM_COOKIE_RXQID_MASK 0xffff
+#define LWM_COOKIE_PORTID_OFFSET 16
+#define LWM_COOKIE_PORTID_MASK 0xffff
+
 #endif /* RTE_PMD_MLX5_RX_H_ */
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 131+ messages in thread

* [RFC 4/6] net/mlx5: add private API to configure Rxq LWM
  2022-04-01  3:22 [RFC 0/6] net/mlx5: introduce limit watermark and host shaper Spike Du
                   ` (2 preceding siblings ...)
  2022-04-01  3:22 ` [RFC 3/6] net/mlx5: add LWM event handling support Spike Du
@ 2022-04-01  3:22 ` Spike Du
  2022-04-01  3:22 ` [RFC 5/6] net/mlx5: add private API to config host port shaper Spike Du
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 131+ messages in thread
From: Spike Du @ 2022-04-01  3:22 UTC (permalink / raw)
  To: matan, viacheslavo, orika, thomas; +Cc: dev, rasland

The new API allows setting/unsetting/modifying an LWM(limit watermark)
event per Rxq.
While the Rx queue fullness reaches the LWM limit, the driver catches
an HW event and invokes the user callback.

Signed-off-by: Spike Du <spiked@nvidia.com>
---
 doc/guides/nics/mlx5.rst               |  4 ++
 doc/guides/rel_notes/release_22_03.rst |  6 +++
 drivers/common/mlx5/mlx5_prm.h         |  1 +
 drivers/net/mlx5/mlx5_rx.c             | 88 +++++++++++++++++++++++++++++++++-
 drivers/net/mlx5/mlx5_rx.h             |  1 +
 drivers/net/mlx5/rte_pmd_mlx5.h        | 32 +++++++++++++
 drivers/net/mlx5/version.map           |  1 +
 7 files changed, 132 insertions(+), 1 deletion(-)

diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index a734d10..0e983a6 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -92,6 +92,7 @@ Features
 - Connection tracking.
 - Sub-Function representors.
 - Sub-Function.
+- Rx queue LWM (Limit WaterMark) configuration.
 
 
 Limitations
@@ -507,6 +508,9 @@ Limitations
 - The NIC egress flow rules on representor port are not supported.
 
 
+- LWM:
+  - Doesn't support shared Rx queue and Hairpin Rx queue.
+
 Statistics
 ----------
 
diff --git a/doc/guides/rel_notes/release_22_03.rst b/doc/guides/rel_notes/release_22_03.rst
index 60e5b4f..0c9d3b6 100644
--- a/doc/guides/rel_notes/release_22_03.rst
+++ b/doc/guides/rel_notes/release_22_03.rst
@@ -187,6 +187,12 @@ New Features
 
   An API was added to get/set an asymmetric crypto session's user data.
 
+* **Updated Mellanox mlx5 driver.**
+
+  Updated the Mellanox mlx5 driver with new features and improvements, including:
+
+  * Added Rx queue LWM(Limit WaterMark) support.
+
 * **Updated Marvell cnxk crypto PMD.**
 
   * Added SHA256-HMAC support in lookaside protocol (IPsec) for CN10K.
diff --git a/drivers/common/mlx5/mlx5_prm.h b/drivers/common/mlx5/mlx5_prm.h
index 44b1822..23b13e3 100644
--- a/drivers/common/mlx5/mlx5_prm.h
+++ b/drivers/common/mlx5/mlx5_prm.h
@@ -3290,6 +3290,7 @@ struct mlx5_aso_wqe {
 
 enum {
 	MLX5_EVENT_TYPE_OBJECT_CHANGE = 0x27,
+	MLX5_EVENT_TYPE_SRQ_LIMIT_REACHED = 0x14,
 };
 
 enum {
diff --git a/drivers/net/mlx5/mlx5_rx.c b/drivers/net/mlx5/mlx5_rx.c
index f72364e..0390412 100644
--- a/drivers/net/mlx5/mlx5_rx.c
+++ b/drivers/net/mlx5/mlx5_rx.c
@@ -19,15 +19,16 @@
 #include <mlx5_prm.h>
 #include <mlx5_common.h>
 #include <mlx5_common_mr.h>
+#include <rte_pmd_mlx5.h>
 
 #include "mlx5_autoconf.h"
 #include "mlx5_defs.h"
 #include "mlx5.h"
 #include "mlx5_utils.h"
 #include "mlx5_rxtx.h"
+#include "mlx5_devx.h"
 #include "mlx5_rx.h"
 
-
 static __rte_always_inline uint32_t
 rxq_cq_to_pkt_type(struct mlx5_rxq_data *rxq, volatile struct mlx5_cqe *cqe,
 		   volatile struct mlx5_mini_cqe8 *mcqe);
@@ -1216,3 +1217,88 @@ int mlx5_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
 	if (rxq && rxq->lwm_event_rxq_limit_reached)
 		rxq->lwm_event_rxq_limit_reached(port_id, rxq_idx);
 }
+
+int
+rte_pmd_mlx5_config_rxq_lwm(uint16_t port_id, uint16_t rx_queue_id,
+			    uint8_t lwm,
+			    lwm_event_rxq_limit_reached_t cb)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+	struct mlx5_rxq_priv *rxq = mlx5_rxq_get(dev, rx_queue_id);
+	uint16_t event_nums[1] = {MLX5_EVENT_TYPE_SRQ_LIMIT_REACHED};
+	struct mlx5_rxq_data *rxq_data;
+	struct mlx5_priv *priv;
+	uint32_t wqe_cnt;
+	uint64_t cookie;
+	int ret = 0;
+
+	if (!rxq) {
+		rte_errno = EINVAL;
+		return -rte_errno;
+	}
+	rxq_data = &rxq->ctrl->rxq;
+	priv = rxq->priv;
+	/* Ensure the Rq is created by devx. */
+	if (priv->obj_ops.rxq_obj_new != devx_obj_ops.rxq_obj_new) {
+		rte_errno = EINVAL;
+		return -rte_errno;
+	}
+	if (lwm > 99) {
+		DRV_LOG(WARNING, "Too big LWM configuration.");
+		rte_errno = E2BIG;
+		return -rte_errno;
+	}
+	/* Start config LWM. */
+	pthread_mutex_lock(&priv->sh->lwm_config_lock);
+	if (rxq->lwm == 0 && lwm == 0) {
+		/* Both old/new values are 0, do nothing. */
+		ret = 0;
+		goto end;
+	}
+	wqe_cnt = mlx5_rxq_mprq_enabled(rxq_data)
+		? RTE_BIT32(rxq_data->cqe_n - rxq_data->log_strd_num) :
+		RTE_BIT32(rxq_data->cqe_n);
+	if (lwm) {
+		if (!priv->sh->devx_channel_lwm) {
+			ret = mlx5_lwm_setup(priv);
+			if (ret) {
+				DRV_LOG(WARNING,
+					"Failed to create shared_lwm.");
+				rte_errno = ENOMEM;
+				ret = -rte_errno;
+				goto end;
+			}
+		}
+		if (!rxq->lwm_devx_subscribed) {
+			cookie = ((uint32_t)
+				  (port_id << LWM_COOKIE_PORTID_OFFSET)) |
+				(rx_queue_id << LWM_COOKIE_RXQID_OFFSET);
+			ret = mlx5_os_devx_subscribe_devx_event
+				(priv->sh->devx_channel_lwm,
+				 rxq->devx_rq.rq->obj,
+				 sizeof(event_nums),
+				 event_nums,
+				 cookie);
+			if (ret) {
+				rte_errno = rte_errno ? rte_errno : EINVAL;
+				ret = -rte_errno;
+				goto end;
+			}
+			rxq->lwm_devx_subscribed = 1;
+		}
+	}
+	/* Save LWM to rxq and send modfiy_rq devx command. */
+	rxq->lwm = lwm * wqe_cnt / 100;
+	if (lwm && !rxq->lwm) {
+		/* With mprq, wqe_cnt may be < 100. */
+		DRV_LOG(WARNING, "Too small LWM configuration.");
+		rte_errno = EINVAL;
+		ret = -rte_errno;
+		goto end;
+	}
+	rxq->lwm_event_rxq_limit_reached = lwm ? cb : NULL;
+	ret = mlx5_devx_modify_rq(rxq, MLX5_RXQ_MOD_RDY2RDY);
+end:
+	pthread_mutex_unlock(&priv->sh->lwm_config_lock);
+	return ret;
+}
diff --git a/drivers/net/mlx5/mlx5_rx.h b/drivers/net/mlx5/mlx5_rx.h
index bf3c5e1..5e56258 100644
--- a/drivers/net/mlx5/mlx5_rx.h
+++ b/drivers/net/mlx5/mlx5_rx.h
@@ -175,6 +175,7 @@ struct mlx5_rxq_priv {
 	struct rte_eth_hairpin_conf hairpin_conf; /* Hairpin configuration. */
 	uint32_t hairpin_status; /* Hairpin binding status. */
 	uint32_t lwm:16;
+	uint32_t lwm_devx_subscribed:1;
 	void (*lwm_event_rxq_limit_reached)(uint16_t port_id, uint16_t rxq_id);
 };
 
diff --git a/drivers/net/mlx5/rte_pmd_mlx5.h b/drivers/net/mlx5/rte_pmd_mlx5.h
index 6e7907e..4d2fb42 100644
--- a/drivers/net/mlx5/rte_pmd_mlx5.h
+++ b/drivers/net/mlx5/rte_pmd_mlx5.h
@@ -109,6 +109,38 @@ int rte_pmd_mlx5_external_rx_queue_id_map(uint16_t port_id, uint16_t dpdk_idx,
 int rte_pmd_mlx5_external_rx_queue_id_unmap(uint16_t port_id,
 					    uint16_t dpdk_idx);
 
+typedef void (*lwm_event_rxq_limit_reached_t)(uint16_t port_id,
+					      uint16_t rxq_id);
+/**
+ * Arm an Rx queue LWM(limit watermark) event.
+ * While the Rx queue fullness reaches the LWM limit, the driver catches
+ * an HW event and invokes the user event callback.
+ * After the last event handling, the user needs to call this API again
+ * to arm an additional event.
+ *
+ * @param[in] port_id
+ *   The port identifier of the Ethernet device.
+ * @param[in] rxq_id
+ *   The rxq id.
+ * @param[in] lwm
+ *   The LWM value, is defined by a percentage of the Rx queue size.
+ *   [1-99] to set a new LWM (update the old value).
+ *   0 to unarm the event.
+ * @param[in] cb
+ *   The LWM event callback.
+ *
+ * @return
+ *   0 : operation success.
+ *   Otherwise:
+ *   - ENOMEM - not enough memory to create LWM event channel.
+ *   - EINVAL - the input Rxq is not created by devx.
+ *   - E2BIG  - lwm is bigger than 99.
+ */
+__rte_experimental
+int rte_pmd_mlx5_config_rxq_lwm(uint16_t port_id, uint16_t rxq_id,
+				uint8_t lwm,
+				lwm_event_rxq_limit_reached_t cb);
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/drivers/net/mlx5/version.map b/drivers/net/mlx5/version.map
index 79cb79a..8c965dd 100644
--- a/drivers/net/mlx5/version.map
+++ b/drivers/net/mlx5/version.map
@@ -12,4 +12,5 @@ EXPERIMENTAL {
 	# added in 22.03
 	rte_pmd_mlx5_external_rx_queue_id_map;
 	rte_pmd_mlx5_external_rx_queue_id_unmap;
+	rte_pmd_mlx5_config_rxq_lwm;
 };
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 131+ messages in thread

* [RFC 5/6] net/mlx5: add private API to config host port shaper
  2022-04-01  3:22 [RFC 0/6] net/mlx5: introduce limit watermark and host shaper Spike Du
                   ` (3 preceding siblings ...)
  2022-04-01  3:22 ` [RFC 4/6] net/mlx5: add private API to configure Rxq LWM Spike Du
@ 2022-04-01  3:22 ` Spike Du
  2022-04-01  3:22 ` [RFC 6/6] app/testpmd: add LWM and Host Shaper command Spike Du
  2022-04-05  8:58 ` [RFC 0/6] net/mlx5: introduce limit watermark and host shaper Jerin Jacob
  6 siblings, 0 replies; 131+ messages in thread
From: Spike Du @ 2022-04-01  3:22 UTC (permalink / raw)
  To: matan, viacheslavo, orika, thomas; +Cc: dev, rasland

Host port shaper can be configured with QSHR(QoS Shaper Host Register).
Add check in build files to enable this function or not.

The host shaper configuration affects all the ethdev ports belonging to the
same host port.

Host shaper can configure shaper rate and lwm-triggered for a host port.
The shaper limits the rate of traffic from host port to wire port.
If lwm-triggered is enabled, a 100Mbps shaper is enabled automatically
when one of the host port's Rx queues receives LWM(Limit Watermark) event.

Signed-off-by: Spike Du <spiked@nvidia.com>
---
 doc/guides/nics/mlx5.rst               |   7 +++
 doc/guides/rel_notes/release_22_03.rst |   1 +
 drivers/common/mlx5/linux/meson.build  |  21 +++++--
 drivers/common/mlx5/mlx5_prm.h         |  25 ++++++++
 drivers/net/mlx5/mlx5.h                |   2 +
 drivers/net/mlx5/mlx5_rx.c             | 104 +++++++++++++++++++++++++++++++++
 drivers/net/mlx5/rte_pmd_mlx5.h        |  30 ++++++++++
 drivers/net/mlx5/version.map           |   1 +
 8 files changed, 187 insertions(+), 4 deletions(-)

diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index 0e983a6..35210c1 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -93,6 +93,7 @@ Features
 - Sub-Function representors.
 - Sub-Function.
 - Rx queue LWM (Limit WaterMark) configuration.
+- Host shaper support.
 
 
 Limitations
@@ -511,6 +512,12 @@ Limitations
 - LWM:
   - Doesn't support shared Rx queue and Hairpin Rx queue.
 
+- Host shaper:
+
+  - Support BlueField series NIC from BlueField 2.
+  - When configure host shaper with MLX5_HOST_SHAPER_FLAG_LWM_TRIGGERED flag set,
+    only rate 0 and 100Mbps are supported.
+
 Statistics
 ----------
 
diff --git a/doc/guides/rel_notes/release_22_03.rst b/doc/guides/rel_notes/release_22_03.rst
index 0c9d3b6..3ab4388 100644
--- a/doc/guides/rel_notes/release_22_03.rst
+++ b/doc/guides/rel_notes/release_22_03.rst
@@ -192,6 +192,7 @@ New Features
   Updated the Mellanox mlx5 driver with new features and improvements, including:
 
   * Added Rx queue LWM(Limit WaterMark) support.
+  * Added host shaper support.
 
 * **Updated Marvell cnxk crypto PMD.**
 
diff --git a/drivers/common/mlx5/linux/meson.build b/drivers/common/mlx5/linux/meson.build
index ed48245..c88c184 100644
--- a/drivers/common/mlx5/linux/meson.build
+++ b/drivers/common/mlx5/linux/meson.build
@@ -16,8 +16,9 @@ if dlopen_ibverbs
     ]
 endif
 
-libnames = [ 'mlx5', 'ibverbs' ]
+libnames = [ 'mlx5', 'ibverbs', 'mtcr_ul' ]
 libs = []
+libmtcr_ul_found = false
 foreach libname:libnames
     lib = dependency('lib' + libname, static:static_ibverbs, required:false, method: 'pkg-config')
     if not lib.found() and not static_ibverbs
@@ -28,10 +29,16 @@ foreach libname:libnames
         if not static_ibverbs and not dlopen_ibverbs
             ext_deps += lib
         endif
+        if libname == 'mtcr_ul'
+            libmtcr_ul_found = true
+            ext_deps += lib
+        endif
     else
-        build = false
-        reason = 'missing dependency, "' + libname + '"'
-        subdir_done()
+        if libname != 'mtcr_ul'
+            build = false
+            reason = 'missing dependency, "' + libname + '"'
+            subdir_done()
+        endif
     endif
 endforeach
 if static_ibverbs or dlopen_ibverbs
@@ -205,6 +212,12 @@ has_sym_args = [
         [ 'HAVE_MLX5_IBV_IMPORT_CTX_PD_AND_MR', 'infiniband/verbs.h',
             'ibv_import_device' ],
 ]
+if  libmtcr_ul_found
+    has_sym_args += [
+        [  'HAVE_MLX5_MSTFLINT', 'mstflint/mtcr.h',
+            'mopen'],
+    ]
+endif
 config = configuration_data()
 foreach arg:has_sym_args
     config.set(arg[0], cc.has_header_symbol(arg[1], arg[2], dependencies: libs))
diff --git a/drivers/common/mlx5/mlx5_prm.h b/drivers/common/mlx5/mlx5_prm.h
index 23b13e3..3559927 100644
--- a/drivers/common/mlx5/mlx5_prm.h
+++ b/drivers/common/mlx5/mlx5_prm.h
@@ -3768,6 +3768,7 @@ enum {
 	MLX5_CRYPTO_COMMISSIONING_REGISTER_ID = 0xC003,
 	MLX5_IMPORT_KEK_HANDLE_REGISTER_ID = 0xC004,
 	MLX5_CREDENTIAL_HANDLE_REGISTER_ID = 0xC005,
+	MLX5_QSHR_REGISTER_ID = 0x4030,
 };
 
 struct mlx5_ifc_register_mtutc_bits {
@@ -3782,6 +3783,30 @@ struct mlx5_ifc_register_mtutc_bits {
 	u8 time_adjustment[0x20];
 };
 
+struct mlx5_ifc_ets_global_config_register_bits {
+	u8 reserved_at_0[0x2];
+	u8 rate_limit_update[0x1];
+	u8 reserved_at_3[0x29];
+	u8 max_bw_units[0x4];
+	u8 reserved_at_48[0x8];
+	u8 max_bw_value[0x8];
+};
+
+#define ETS_GLOBAL_CONFIG_BW_UNIT_DISABLED      0x0
+#define ETS_GLOBAL_CONFIG_BW_UNIT_HUNDREDS_MBPS 0x3
+#define ETS_GLOBAL_CONFIG_BW_UNIT_GBPS          0x4
+
+struct mlx5_ifc_register_qshr_bits {
+	u8 reserved_at_0[0x4];
+	u8 connected_host[0x1];
+	u8 vqos[0x1];
+	u8 fast_response[0x1];
+	u8 reserved_at_7[0x1];
+	u8 local_port[0x8];
+	u8 reserved_at_16[0x230];
+	struct mlx5_ifc_ets_global_config_register_bits global_config;
+};
+
 #define MLX5_MTUTC_TIMESTAMP_MODE_INTERNAL_TIMER 0
 #define MLX5_MTUTC_TIMESTAMP_MODE_REAL_TIME 1
 
diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index 515ff33..5dfd375 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -1267,6 +1267,8 @@ struct mlx5_dev_ctx_shared {
 	void *devx_channel_lwm;
 	struct rte_intr_handle *intr_handle_lwm;
 	pthread_mutex_t lwm_config_lock;
+	uint32_t host_shaper_rate:8;
+	uint32_t lwm_triggered:1;
 	/* Availability of mreg_c's. */
 	struct mlx5_dev_shared_port port[]; /* per device port data array. */
 };
diff --git a/drivers/net/mlx5/mlx5_rx.c b/drivers/net/mlx5/mlx5_rx.c
index 0390412..6d5d11b 100644
--- a/drivers/net/mlx5/mlx5_rx.c
+++ b/drivers/net/mlx5/mlx5_rx.c
@@ -28,6 +28,9 @@
 #include "mlx5_rxtx.h"
 #include "mlx5_devx.h"
 #include "mlx5_rx.h"
+#ifdef HAVE_MLX5_MSTFLINT
+#include <mstflint/mtcr.h>
+#endif
 
 static __rte_always_inline uint32_t
 rxq_cq_to_pkt_type(struct mlx5_rxq_data *rxq, volatile struct mlx5_cqe *cqe,
@@ -1302,3 +1305,104 @@ int mlx5_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
 	pthread_mutex_unlock(&priv->sh->lwm_config_lock);
 	return ret;
 }
+
+/**
+ * Mlx5 access register function to configure host shaper.
+ * It calls API in libmtcr_ul to access QSHR(Qos Shaper Host Register)
+ * in firmware.
+ *
+ * @param dev
+ *   Pointer to rte_eth_dev.
+ * @param lwm_triggered
+ *   Flag to enable/disable lwm_triggered bit in QSHR.
+ * @param rate
+ *   Host shaper rate, unit is 100Mbps, set to 0 means disable the shaper.
+ * @return
+ *   0 : operation success.
+ *   Otherwise:
+ *   - ENOENT - no ibdev interface.
+ *   - EBUSY  - the register access unit is busy.
+ *   - EIO    - the register access command meets IO error.
+ */
+static int
+mlxreg_config_host_shaper(struct rte_eth_dev *dev,
+			  bool lwm_triggered, uint8_t rate)
+{
+#ifdef HAVE_MLX5_MSTFLINT
+	struct mlx5_priv *priv = dev->data->dev_private;
+	uint32_t data[MLX5_ST_SZ_DW(register_qshr)] = {0};
+	int rc, retry_count = 3;
+	mfile *mf = NULL;
+	int status;
+	void *ptr;
+
+	mf = mopen(priv->sh->ibdev_name);
+	if (!mf) {
+		DRV_LOG(WARNING, "mopen failed\n");
+		rte_errno = ENOENT;
+		return -rte_errno;
+	}
+	MLX5_SET(register_qshr, data, connected_host, 1);
+	MLX5_SET(register_qshr, data, fast_response, lwm_triggered ? 1 : 0);
+	MLX5_SET(register_qshr, data, local_port, 1);
+	ptr = MLX5_ADDR_OF(register_qshr, data, global_config);
+	MLX5_SET(ets_global_config_register, ptr, rate_limit_update, 1);
+	MLX5_SET(ets_global_config_register, ptr, max_bw_units,
+		 rate ? ETS_GLOBAL_CONFIG_BW_UNIT_HUNDREDS_MBPS :
+		 ETS_GLOBAL_CONFIG_BW_UNIT_DISABLED);
+	MLX5_SET(ets_global_config_register, ptr, max_bw_value, rate);
+	do {
+		rc = maccess_reg(mf,
+				 MLX5_QSHR_REGISTER_ID,
+				 MACCESS_REG_METHOD_SET,
+				 (u_int32_t *)&data[0],
+				 sizeof(data),
+				 sizeof(data),
+				 sizeof(data),
+				 &status);
+		if ((rc != ME_ICMD_STATUS_IFC_BUSY &&
+		     status != ME_REG_ACCESS_BAD_PARAM) ||
+		    !(mf->flags & MDEVS_REM)) {
+			break;
+		}
+		DRV_LOG(WARNING, "%s retry.", __func__);
+		usleep(10000);
+	} while (retry_count-- > 0);
+	mclose(mf);
+	rte_errno = (rc == ME_REG_ACCESS_DEV_BUSY) ? EBUSY : EIO;
+	return rc ? -rte_errno : 0;
+#else
+	(void)dev;
+	(void)lwm_triggered;
+	(void)rate;
+	return -1;
+#endif
+}
+
+int rte_pmd_mlx5_config_host_shaper(int port_id, uint8_t rate,
+				    uint32_t flags)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+	struct mlx5_priv *priv = dev->data->dev_private;
+	bool lwm_triggered =
+		!!(flags & RTE_BIT32(MLX5_HOST_SHAPER_FLAG_LWM_TRIGGERED));
+
+	if (!lwm_triggered) {
+		priv->sh->host_shaper_rate = rate;
+	} else {
+		switch (rate) {
+		case 0:
+		/* Rate 0 means disable lwm_triggered. */
+			priv->sh->lwm_triggered = 0;
+			break;
+		case 1:
+		/* Rate 1 means enable lwm_triggered. */
+			priv->sh->lwm_triggered = 1;
+			break;
+		default:
+			return -ENOTSUP;
+		}
+	}
+	return mlxreg_config_host_shaper(dev, priv->sh->lwm_triggered,
+					 priv->sh->host_shaper_rate);
+}
diff --git a/drivers/net/mlx5/rte_pmd_mlx5.h b/drivers/net/mlx5/rte_pmd_mlx5.h
index 4d2fb42..3a32463 100644
--- a/drivers/net/mlx5/rte_pmd_mlx5.h
+++ b/drivers/net/mlx5/rte_pmd_mlx5.h
@@ -141,6 +141,36 @@ int rte_pmd_mlx5_config_rxq_lwm(uint16_t port_id, uint16_t rxq_id,
 				uint8_t lwm,
 				lwm_event_rxq_limit_reached_t cb);
 
+/**
+ * The rate of the host port shaper will be updated directly at the next
+ * LWM event to the rate that comes with this flag set; set rate 0
+ * to disable this rate update.
+ * Unset this flag to update the rate of the host port shaper directly in
+ * the API call; use rate 0 to disable the current shaper.
+ */
+#define MLX5_HOST_SHAPER_FLAG_LWM_TRIGGERED 0
+
+/**
+ * Configure an HW shaper to limit Rx rate for a host port.
+ * The configuration will affect all the ethdev ports belonging to
+ * the same rte_device.
+ *
+ * @param[in] port_id
+ *   The port identifier of the Ethernet device.
+ * @param[in] rate
+ *   Unit is 100Mbps, setting the rate to 0 disables the shaper.
+ * @param[in] flags
+ *   Host shaper flags.
+ * @return
+ *   0 : operation success.
+ *   Otherwise:
+ *   - ENOENT - no ibdev interface.
+ *   - EBUSY  - the register access unit is busy.
+ *   - EIO    - the register access command meets IO error.
+ */
+__rte_experimental
+int rte_pmd_mlx5_config_host_shaper(int port_id, uint8_t rate, uint32_t flags);
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/drivers/net/mlx5/version.map b/drivers/net/mlx5/version.map
index 8c965dd..5029e19 100644
--- a/drivers/net/mlx5/version.map
+++ b/drivers/net/mlx5/version.map
@@ -13,4 +13,5 @@ EXPERIMENTAL {
 	rte_pmd_mlx5_external_rx_queue_id_map;
 	rte_pmd_mlx5_external_rx_queue_id_unmap;
 	rte_pmd_mlx5_config_rxq_lwm;
+	rte_pmd_mlx5_config_host_shaper;
 };
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 131+ messages in thread

* [RFC 6/6] app/testpmd: add LWM and Host Shaper command
  2022-04-01  3:22 [RFC 0/6] net/mlx5: introduce limit watermark and host shaper Spike Du
                   ` (4 preceding siblings ...)
  2022-04-01  3:22 ` [RFC 5/6] net/mlx5: add private API to config host port shaper Spike Du
@ 2022-04-01  3:22 ` Spike Du
  2022-04-05  8:58 ` [RFC 0/6] net/mlx5: introduce limit watermark and host shaper Jerin Jacob
  6 siblings, 0 replies; 131+ messages in thread
From: Spike Du @ 2022-04-01  3:22 UTC (permalink / raw)
  To: matan, viacheslavo, orika, thomas; +Cc: dev, rasland

Add command line options to support LWM per-rxq configure.
- Command syntax:
  set port <port_id> rxq <rxq_id> lwm <lwm_num>
  set port <port_id> host_shaper lwm_triggered <0|1> rate <rate_num>

- Example commands:
To configure LWM as 30% of rxq size on port 1 rxq 0:
testpmd> set port 1 rxq 0 lwm 30

To disable LWM on port 1 rxq 0:
testpmd> set port 1 rxq 0 lwm 0

To enable lwm_triggered on port 1 and disable current host shaper:
testpmd> set port 1 host_shaper lwm_triggered 1 rate 0

To disable lwm_triggered and current host shaper on port 1:
testpmd> set port 1 host_shaper lwm_triggered 0 rate 0

The rate unit is 100Mbps.
To disable lwm_triggered and configure a shaper of 5Gbps on port 1:
testpmd> set port 1 host_shaper lwm_triggered 0 rate 50

Add sample code to handle rxq LWM event, it delays a while so that rxq
empties, then disables host shaper and rearms LWM event.

Signed-off-by: Spike Du <spiked@nvidia.com>
---
 app/test-pmd/cmdline.c   | 149 +++++++++++++++++++++++++++++++++++++++++++++++
 app/test-pmd/config.c    | 122 ++++++++++++++++++++++++++++++++++++++
 app/test-pmd/meson.build |   3 +
 app/test-pmd/testpmd.c   |   3 +
 app/test-pmd/testpmd.h   |   5 ++
 doc/guides/nics/mlx5.rst |  76 ++++++++++++++++++++++++
 6 files changed, 358 insertions(+)

diff --git a/app/test-pmd/cmdline.c b/app/test-pmd/cmdline.c
index 7ab0575..8a5fe26 100644
--- a/app/test-pmd/cmdline.c
+++ b/app/test-pmd/cmdline.c
@@ -17807,6 +17807,151 @@ struct cmd_show_port_flow_transfer_proxy_result {
 	}
 };
 
+#ifdef RTE_NET_MLX5
+
+/* *** SET LIMIT WARTER MARK FOR A RXQ OF A PORT *** */
+struct cmd_rxq_lwm_result {
+	cmdline_fixed_string_t set;
+	cmdline_fixed_string_t port;
+	uint16_t port_num;
+	cmdline_fixed_string_t rxq;
+	uint16_t rxq_num;
+	cmdline_fixed_string_t lwm;
+	uint16_t lwm_num;
+};
+
+static void cmd_rxq_lwm_parsed(void *parsed_result,
+		__rte_unused struct cmdline *cl,
+		__rte_unused void *data)
+{
+	struct cmd_rxq_lwm_result *res = parsed_result;
+	int ret = 0;
+
+	if ((strcmp(res->set, "set") == 0) && (strcmp(res->port, "port") == 0)
+	    && (strcmp(res->rxq, "rxq") == 0)
+	    && (strcmp(res->lwm, "lwm") == 0))
+		ret = set_rxq_lwm(res->port_num, res->rxq_num,
+				  res->lwm_num);
+	if (ret < 0)
+		printf("rxq_lwm_cmd error: (%s)\n", strerror(-ret));
+
+}
+
+cmdline_parse_token_string_t cmd_rxq_lwm_set =
+	TOKEN_STRING_INITIALIZER(struct cmd_rxq_lwm_result,
+				set, "set");
+cmdline_parse_token_string_t cmd_rxq_lwm_port =
+	TOKEN_STRING_INITIALIZER(struct cmd_rxq_lwm_result,
+				port, "port");
+cmdline_parse_token_num_t cmd_rxq_lwm_portnum =
+	TOKEN_NUM_INITIALIZER(struct cmd_rxq_lwm_result,
+				port_num, RTE_UINT16);
+cmdline_parse_token_string_t cmd_rxq_lwm_rxq =
+	TOKEN_STRING_INITIALIZER(struct cmd_rxq_lwm_result,
+				rxq, "rxq");
+cmdline_parse_token_num_t cmd_rxq_lwm_rxqnum =
+	TOKEN_NUM_INITIALIZER(struct cmd_rxq_lwm_result,
+				rxq_num, RTE_UINT8);
+cmdline_parse_token_string_t cmd_rxq_lwm_lwm =
+	TOKEN_STRING_INITIALIZER(struct cmd_rxq_lwm_result,
+				lwm, "lwm");
+cmdline_parse_token_num_t cmd_rxq_lwm_lwmnum =
+	TOKEN_NUM_INITIALIZER(struct cmd_rxq_lwm_result,
+				lwm_num, RTE_UINT16);
+
+cmdline_parse_inst_t cmd_rxq_lwm = {
+	.f = cmd_rxq_lwm_parsed,
+	.data = (void *)0,
+	.help_str = "set port <port_id> rxq <rxq_id> lwm <lwm_num>"
+		"Set lwm for rxq on port_id",
+	.tokens = {
+		(void *)&cmd_rxq_lwm_set,
+		(void *)&cmd_rxq_lwm_port,
+		(void *)&cmd_rxq_lwm_portnum,
+		(void *)&cmd_rxq_lwm_rxq,
+		(void *)&cmd_rxq_lwm_rxqnum,
+		(void *)&cmd_rxq_lwm_lwm,
+		(void *)&cmd_rxq_lwm_lwmnum,
+		NULL,
+	},
+};
+
+/* *** SET HOST_SHAPER LWM TRIGGERED FOR A PORT *** */
+struct cmd_port_host_shaper_result {
+	cmdline_fixed_string_t set;
+	cmdline_fixed_string_t port;
+	uint16_t port_num;
+	cmdline_fixed_string_t host_shaper;
+	cmdline_fixed_string_t lwm_triggered;
+	uint16_t fr;
+	cmdline_fixed_string_t rate;
+	uint8_t rate_num;
+};
+
+static void cmd_port_host_shaper_parsed(void *parsed_result,
+		__rte_unused struct cmdline *cl,
+		__rte_unused void *data)
+{
+	struct cmd_port_host_shaper_result *res = parsed_result;
+	int ret = 0;
+
+	if ((strcmp(res->set, "set") == 0) && (strcmp(res->port, "port") == 0)
+	    && (strcmp(res->host_shaper, "host_shaper") == 0)
+	    && (strcmp(res->lwm_triggered, "lwm_triggered") == 0)
+	    && (strcmp(res->rate, "rate") == 0))
+		ret = set_port_host_shaper(res->port_num, res->fr,
+					   res->rate_num);
+	if (ret < 0)
+		printf("cmd_port_host_shaper error: (%s)\n", strerror(-ret));
+
+}
+
+cmdline_parse_token_string_t cmd_port_host_shaper_set =
+	TOKEN_STRING_INITIALIZER(struct cmd_port_host_shaper_result,
+				set, "set");
+cmdline_parse_token_string_t cmd_port_host_shaper_port =
+	TOKEN_STRING_INITIALIZER(struct cmd_port_host_shaper_result,
+				port, "port");
+cmdline_parse_token_num_t cmd_port_host_shaper_portnum =
+	TOKEN_NUM_INITIALIZER(struct cmd_port_host_shaper_result,
+				port_num, RTE_UINT16);
+cmdline_parse_token_string_t cmd_port_host_shaper_host_shaper =
+	TOKEN_STRING_INITIALIZER(struct cmd_port_host_shaper_result,
+				 host_shaper, "host_shaper");
+cmdline_parse_token_string_t cmd_port_host_shaper_lwm_triggered =
+	TOKEN_STRING_INITIALIZER(struct cmd_port_host_shaper_result,
+				 lwm_triggered, "lwm_triggered");
+cmdline_parse_token_num_t cmd_port_host_shaper_fr =
+	TOKEN_NUM_INITIALIZER(struct cmd_port_host_shaper_result,
+			      fr, RTE_UINT16);
+cmdline_parse_token_string_t cmd_port_host_shaper_rate =
+	TOKEN_STRING_INITIALIZER(struct cmd_port_host_shaper_result,
+				 rate, "rate");
+cmdline_parse_token_num_t cmd_port_host_shaper_rate_num =
+	TOKEN_NUM_INITIALIZER(struct cmd_port_host_shaper_result,
+			      rate_num, RTE_UINT8);
+
+
+cmdline_parse_inst_t cmd_port_host_shaper = {
+	.f = cmd_port_host_shaper_parsed,
+	.data = (void *)0,
+	.help_str = "set port <port_id> host_shaper lwm_triggered <0|1> "
+	"rate <rate_num>: Set HOST_SHAPER lwm_triggered and rate with port_id",
+	.tokens = {
+		(void *)&cmd_port_host_shaper_set,
+		(void *)&cmd_port_host_shaper_port,
+		(void *)&cmd_port_host_shaper_portnum,
+		(void *)&cmd_port_host_shaper_host_shaper,
+		(void *)&cmd_port_host_shaper_lwm_triggered,
+		(void *)&cmd_port_host_shaper_fr,
+		(void *)&cmd_port_host_shaper_rate,
+		(void *)&cmd_port_host_shaper_rate_num,
+		NULL,
+	},
+};
+
+#endif
+
 /* ******************************************************************************** */
 
 /* list of instructions */
@@ -18093,6 +18238,10 @@ struct cmd_show_port_flow_transfer_proxy_result {
 	(cmdline_parse_inst_t *)&cmd_show_capability,
 	(cmdline_parse_inst_t *)&cmd_set_flex_is_pattern,
 	(cmdline_parse_inst_t *)&cmd_set_flex_spec_pattern,
+#ifdef RTE_NET_MLX5
+	(cmdline_parse_inst_t *)&cmd_rxq_lwm,
+	(cmdline_parse_inst_t *)&cmd_port_host_shaper,
+#endif
 	NULL,
 };
 
diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
index cc8e7aa..11ef7e3 100644
--- a/app/test-pmd/config.c
+++ b/app/test-pmd/config.c
@@ -39,6 +39,7 @@
 #include <rte_flow.h>
 #include <rte_mtr.h>
 #include <rte_errno.h>
+#include <rte_alarm.h>
 #ifdef RTE_NET_IXGBE
 #include <rte_pmd_ixgbe.h>
 #endif
@@ -52,6 +53,9 @@
 #include <rte_gro.h>
 #endif
 #include <rte_hexdump.h>
+#ifdef RTE_NET_MLX5
+#include <rte_pmd_mlx5.h>
+#endif
 
 #include "testpmd.h"
 #include "cmdline_mtr.h"
@@ -6281,3 +6285,121 @@ struct igb_ring_desc_16_bytes {
 		printf("  %s\n", buf);
 	}
 }
+
+#ifdef RTE_NET_MLX5
+static uint8_t lwms[RTE_MAX_ETHPORTS][RTE_MAX_QUEUES_PER_PORT+1];
+static uint8_t host_shaper_lwm_triggered[RTE_MAX_ETHPORTS];
+
+#define SHAPER_DISABLE_DELAY_US 100000 /* 100ms */
+static void
+lwm_event_rxq_limit_reached(uint16_t port_id, uint16_t rxq_id);
+
+static void
+mlx5_shaper_disable(void *args)
+{
+	uint32_t port_rxq_id = (uint32_t)(uint64_t)args;
+	uint16_t port_id = port_rxq_id & 0xffff;
+	unsigned int qid;
+
+	printf("%s disable shaper\n", __func__);
+	/* Need rearm all previous configured rxqs. */
+	for (qid = 0; qid < nb_rxq; qid++) {
+		/* Configure with rxq's saved LWM value to rearm LWM event */
+		if (rte_pmd_mlx5_config_rxq_lwm(port_id, qid, lwms[port_id][qid],
+						lwm_event_rxq_limit_reached))
+			printf("config lwm returns error\n");
+	}
+	/* Only disable the shaper when lwm_triggered is set. */
+	if (host_shaper_lwm_triggered[port_id] &&
+	    rte_pmd_mlx5_config_host_shaper(port_id, 0, 0))
+		printf("%s disable shaper returns error\n", __func__);
+}
+
+static void
+lwm_event_rxq_limit_reached(uint16_t port_id, uint16_t rxq_id)
+{
+	uint32_t port_rxq_id = port_id | (rxq_id << 16);
+	rte_eal_alarm_set(SHAPER_DISABLE_DELAY_US,
+			  mlx5_shaper_disable, (void *)(uintptr_t)port_rxq_id);
+	printf("%s port_id:%u rxq_id:%u\n", __func__, port_id, rxq_id);
+}
+
+static void
+mlx5_lwm_intr_handle_cancel_alarm(uint16_t port_id, uint16_t qid)
+{
+	uint32_t port_rxq_id = port_id | (qid << 16);
+	int retries = 1024;
+
+	rte_errno = 0;
+	while (--retries) {
+		rte_eal_alarm_cancel(mlx5_shaper_disable,
+				     (void *)(uintptr_t)port_rxq_id);
+		if (rte_errno != EINPROGRESS)
+			break;
+		rte_pause();
+	}
+}
+
+int
+set_rxq_lwm(portid_t port_id, uint16_t queue_idx, uint16_t lwm)
+{
+	struct rte_eth_link link;
+	int ret;
+
+	if (port_id_is_invalid(port_id, ENABLED_WARN))
+		return -EINVAL;
+	ret = eth_link_get_nowait_print_err(port_id, &link);
+	if (ret < 0)
+		return -EINVAL;
+	if (lwm > 99)
+		return -EINVAL;
+	/* When disable LWM, needs cancal alarm. */
+	if (!lwm)
+		mlx5_lwm_intr_handle_cancel_alarm(port_id, queue_idx);
+	ret = rte_pmd_mlx5_config_rxq_lwm(port_id, queue_idx, lwm,
+						lwm_event_rxq_limit_reached);
+	/* Save the input lwm. */
+	lwms[port_id][queue_idx] = lwm;
+	if (ret)
+		return ret;
+	return 0;
+}
+
+/** Configure host shaper's lwm_triggered and current rate.
+ *
+ * @param[in] lwm_triggered
+ *   Disable/enable lwm_triggered.
+ * @param[in] rate
+ *   Configure current host shaper rate.
+ * @return
+ *   On success, returns 0.
+ *   On failure, returns < 0.
+ */
+int
+set_port_host_shaper(portid_t port_id, uint16_t lwm_triggered, uint8_t rate)
+{
+	struct rte_eth_link link;
+	int ret;
+
+	if (port_id_is_invalid(port_id, ENABLED_WARN))
+		return -EINVAL;
+	ret = eth_link_get_nowait_print_err(port_id, &link);
+	if (ret < 0)
+		return ret;
+	host_shaper_lwm_triggered[port_id] = lwm_triggered ? 1 : 0;
+	if (!lwm_triggered) {
+		ret = rte_pmd_mlx5_config_host_shaper(port_id, 0,
+		RTE_BIT32(MLX5_HOST_SHAPER_FLAG_LWM_TRIGGERED));
+	} else {
+		ret = rte_pmd_mlx5_config_host_shaper(port_id, 1,
+		RTE_BIT32(MLX5_HOST_SHAPER_FLAG_LWM_TRIGGERED));
+	}
+	if (ret)
+		return ret;
+	ret = rte_pmd_mlx5_config_host_shaper(port_id, rate, 0);
+	if (ret)
+		return ret;
+	return 0;
+}
+
+#endif
diff --git a/app/test-pmd/meson.build b/app/test-pmd/meson.build
index 43130c8..c4fd379 100644
--- a/app/test-pmd/meson.build
+++ b/app/test-pmd/meson.build
@@ -73,3 +73,6 @@ endif
 if dpdk_conf.has('RTE_NET_DPAA')
     deps += ['bus_dpaa', 'mempool_dpaa', 'net_dpaa']
 endif
+if dpdk_conf.has('RTE_NET_MLX5')
+    deps += 'net_mlx5'
+endif
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index fe2ce19..3b53cd8 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -66,6 +66,9 @@
 #ifdef RTE_EXEC_ENV_WINDOWS
 #include <process.h>
 #endif
+#ifdef RTE_NET_MLX5
+#include <rte_pmd_mlx5.h>
+#endif
 
 #include "testpmd.h"
 
diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index 31f766c..aed2057 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -1163,6 +1163,11 @@ uint16_t tx_pkt_set_dynf(uint16_t port_id, __rte_unused uint16_t queue,
 void flex_item_create(portid_t port_id, uint16_t flex_id, const char *filename);
 void flex_item_destroy(portid_t port_id, uint16_t flex_id);
 void port_flex_item_flush(portid_t port_id);
+#ifdef RTE_NET_MLX5
+int set_rxq_lwm(portid_t port_id, uint16_t queue_idx, uint16_t lwm);
+int set_port_host_shaper(portid_t port_id, uint16_t lwm_triggered,
+			 uint8_t rate);
+#endif
 
 extern int flow_parse(const char *src, void *result, unsigned int size,
 		      struct rte_flow_attr **attr,
diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index 35210c1..0df779f 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -1677,3 +1677,79 @@ The procedure below is an example of using a ConnectX-5 adapter card (pf0) with
 #. For each VF PCIe, using the following command to bind the driver::
 
    $ echo "0000:82:00.2" >> /sys/bus/pci/drivers/mlx5_core/bind
+
+How to use LWM and Host Shaper
+------------------------------
+
+LWM introduction
+~~~~~~~~~~~~~~~~
+
+LWM (Limit WaterMark) is a per Rx queue attribute, it should be configured as
+a percentage of the Rx queue size.
+When Rx queue's available WQE count is below LWM, an event is sent to PMD.
+
+Host shaper introduction
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+Host shaper register is per host port register which sets a shaper
+on the host port.
+All VF/hostPF representors belonging to one host port share one host shaper.
+For example, if representor 0 and representor 1 belong to same host port,
+and a host shaper rate of 1Gbps is configured, the shaper throttles both
+representors' traffic from host.
+Host shaper has two modes for setting the shaper, immediate and deferred to
+LWM event trigger. In immediate mode, the rate limit is configured immediately
+to host shaper. When deferring to LWM trigger, the shaper is not set until an
+LWM event is received by any Rx queue in a VF representor belonging to the host
+port. The only rate supported for deferred mode is 100Mbps (there is no limit
+on the supported rates for immediate mode). In deferred mode, the shaper is set
+on the host port by the firmware upon receiving the LMW event, which allows
+throttling host traffic on LWM events at minimum latency, preventing excess
+drops in the Rx queue.
+
+Testpmd CLI examples
+~~~~~~~~~~~~~~~~~~~~
+
+There are sample command lines to configure LWM in testpmd.
+Testpmd also contains sample logic to handle LWM event.
+The typical workflow is: testpmd configure LWM for Rx queues, enable
+lwm_triggered in host shaper and register a callback, when traffic from host is
+too high and available WQE count runs below LWM, PMD receives an event and
+firmware configures a 100Mbps shaper on host port automatically, then PMD call
+the callback registered previously, which will delay a while to let Rx queue
+empty, then disable host shaper.
+
+Let's assume we have a simple Blue Field 2 setup: port 0 is uplink, port 1
+is VF representor. Each port has 2 Rx queues.
+In order to control traffic from host to ARM, we can enable LWM in testpmd by:
+
+.. code-block:: console
+
+   testpmd> set port 1 host_shaper lwm_triggered 1 rate 0
+   testpmd> set port 1 rxq 0 lwm 30
+   testpmd> set port 1 rxq 1 lwm 30
+
+The first command disables current host shaper, and enables LWM triggered mode.
+The left commands configure LWM to 30% of Rx queue size for both Rx queues,
+When traffic from host is too high, you can see testpmd console prints log
+about LWM event receiving, then host shaper is disabled.
+The traffic rate from host is controlled and less drop happens in Rx queues.
+
+When disable LWM and lwm_triggered, we can invoke below commands in testpmd:
+
+.. code-block:: console
+
+   testpmd> set port 1 host_shaper lwm_triggered 0 rate 0
+   testpmd> set port 1 rxq 0 lwm 0
+   testpmd> set port 1 rxq 1 lwm 0
+
+It's recommended an application disables LWM and lwm_triggered before exit,
+if it enables them before.
+
+We can also configure the shaper with a value, the rate unit is 100Mbps, below
+command sets current shaper to 5Gbps and disables lwm_triggered.
+
+.. code-block:: console
+
+   testpmd> set port 1 host_shaper lwm_triggered 0 rate 50
+
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [RFC 0/6] net/mlx5: introduce limit watermark and host shaper
  2022-04-01  3:22 [RFC 0/6] net/mlx5: introduce limit watermark and host shaper Spike Du
                   ` (5 preceding siblings ...)
  2022-04-01  3:22 ` [RFC 6/6] app/testpmd: add LWM and Host Shaper command Spike Du
@ 2022-04-05  8:58 ` Jerin Jacob
  2022-04-26  2:42   ` Spike Du
  2022-04-29  5:48   ` Spike Du
  6 siblings, 2 replies; 131+ messages in thread
From: Jerin Jacob @ 2022-04-05  8:58 UTC (permalink / raw)
  To: Spike Du, Andrew Rybchenko, Cristian Dumitrescu, Ferruh Yigit, techboard
  Cc: Matan Azrad, Viacheslav Ovsiienko, Ori Kam, Thomas Monjalon,
	dpdk-dev, Raslan Darawsheh

On Fri, Apr 1, 2022 at 8:53 AM Spike Du <spiked@nvidia.com> wrote:
>
> LWM(limit watermark) is per RX queue attribute, when RX queue fullness reach
> the LWM limit, HW sends an event to dpdk application.
> Host shaper can configure shaper rate and lwm-triggered for a host port.
> The shaper limits the rate of traffic from host port to wire port.
> If lwm-triggered is enabled, a 100Mbps shaper is enabled automatically
> when one of the host port's Rx queues receives LWM event.
>
> These two features can combine to control traffic from host port to wire port.
> The work flow is configure LWM to RX queue and enable lwm-triggered flag in
> host shaper, after receiving LWM event, delay a while until RX queue is empty
> , then disable the shaper. We recycle this work flow to reduce RX queue drops.
>
> Spike Du (6):
>   net/mlx5: add LWM support for Rxq
>   common/mlx5: share interrupt management
>   net/mlx5: add LWM event handling support
>   net/mlx5: add private API to configure Rxq LWM
>   net/mlx5: add private API to config host port shaper
>   app/testpmd: add LWM and Host Shaper command

+ @Andrew Rybchenko  @Ferruh Yigit cristian.dumitrescu@intel.com

I think, case one, can be easily abstracted via adding new
rte_eth_event_type event and
case two can be abstracted via the existing Rx meter framework in ethdev.

Also, Updating generic testpmd to support PMD specific API should be
avoided, I know there
is existing stuff in testpmd, I think, we should have the policy to
add PMD specific commands to testpmd.

There are around 56PMDs in ethdev now, If PMDs try to add PMD specific
API in testpmd it will be bloated or
at minimum, it should a separate file in testpmd if we choose to take that path.

+ @techboard@dpdk.org

^ permalink raw reply	[flat|nested] 131+ messages in thread

* RE: [RFC 0/6] net/mlx5: introduce limit watermark and host shaper
  2022-04-05  8:58 ` [RFC 0/6] net/mlx5: introduce limit watermark and host shaper Jerin Jacob
@ 2022-04-26  2:42   ` Spike Du
  2022-05-01 12:50     ` Jerin Jacob
  2022-04-29  5:48   ` Spike Du
  1 sibling, 1 reply; 131+ messages in thread
From: Spike Du @ 2022-04-26  2:42 UTC (permalink / raw)
  To: Jerin Jacob, Andrew Rybchenko, Cristian Dumitrescu, Ferruh Yigit,
	techboard
  Cc: Matan Azrad, Slava Ovsiienko, Ori Kam,
	NBU-Contact-Thomas Monjalon (EXTERNAL),
	dpdk-dev, Raslan Darawsheh

Hi Jerin,	
	Thanks for your comments and sorry for the late response.

	For case one, I think I can refine the design and add LWM(limit watermark) in rte_eth_rxconf, and add a new rte_eth_event_type event.

	For case two(host shaper), I think we can't use RX meter, because it's actually TX shaper on a remote system. It's quite specific to Mellanox/Nvidia BlueField 2(BF2 for short) NIC. The NIC contains an ARM system. We have two terms here: Host-system stands for the system the BF2 NIC is inserted; ARM-system stands for the embedded ARM in BF2. ARM-system is doing the forwarding. This is the way host shaper works: we configure the register on ARM-system, but it affects Host-system's TX shaper, which means the shaper is working on the remote port, it's not a RX meter concept, hence we can't use DPDK RX meter framework. I'd suggest to still use private API.

	For testpmd part, I understand your concern. Because we need one private API for host shaper, and we need testpmd's forwarding code to show how it works to user, we need to call the private API in testpmd. If current patch is not acceptable, what's the correct way to do it? Any framework to isolate the PMD private logic from testpmd common code, but still give a chance to call private APIs in testpmd?


Regards,
Spike.



> -----Original Message-----
> From: Jerin Jacob <jerinjacobk@gmail.com>
> Sent: Tuesday, April 5, 2022 4:59 PM
> To: Spike Du <spiked@nvidia.com>; Andrew Rybchenko
> <andrew.rybchenko@oktetlabs.ru>; Cristian Dumitrescu
> <cristian.dumitrescu@intel.com>; Ferruh Yigit <ferruh.yigit@intel.com>;
> techboard@dpdk.org
> Cc: Matan Azrad <matan@nvidia.com>; Slava Ovsiienko
> <viacheslavo@nvidia.com>; Ori Kam <orika@nvidia.com>; NBU-Contact-
> Thomas Monjalon (EXTERNAL) <thomas@monjalon.net>; dpdk-dev
> <dev@dpdk.org>; Raslan Darawsheh <rasland@nvidia.com>
> Subject: Re: [RFC 0/6] net/mlx5: introduce limit watermark and host shaper
> 
> External email: Use caution opening links or attachments
> 
> 
> On Fri, Apr 1, 2022 at 8:53 AM Spike Du <spiked@nvidia.com> wrote:
> >
> > LWM(limit watermark) is per RX queue attribute, when RX queue fullness
> > reach the LWM limit, HW sends an event to dpdk application.
> > Host shaper can configure shaper rate and lwm-triggered for a host port.
> > The shaper limits the rate of traffic from host port to wire port.
> > If lwm-triggered is enabled, a 100Mbps shaper is enabled automatically
> > when one of the host port's Rx queues receives LWM event.
> >
> > These two features can combine to control traffic from host port to wire
> port.
> > The work flow is configure LWM to RX queue and enable lwm-triggered
> > flag in host shaper, after receiving LWM event, delay a while until RX
> > queue is empty , then disable the shaper. We recycle this work flow to
> reduce RX queue drops.
> >
> > Spike Du (6):
> >   net/mlx5: add LWM support for Rxq
> >   common/mlx5: share interrupt management
> >   net/mlx5: add LWM event handling support
> >   net/mlx5: add private API to configure Rxq LWM
> >   net/mlx5: add private API to config host port shaper
> >   app/testpmd: add LWM and Host Shaper command
> 
> + @Andrew Rybchenko  @Ferruh Yigit cristian.dumitrescu@intel.com
> 
> I think, case one, can be easily abstracted via adding new
> rte_eth_event_type event and case two can be abstracted via the existing
> Rx meter framework in ethdev.
> 
> Also, Updating generic testpmd to support PMD specific API should be
> avoided, I know there is existing stuff in testpmd, I think, we should have the
> policy to add PMD specific commands to testpmd.
> 
> There are around 56PMDs in ethdev now, If PMDs try to add PMD specific
> API in testpmd it will be bloated or at minimum, it should a separate file in
> testpmd if we choose to take that path.
> 
> + @techboard@dpdk.org

^ permalink raw reply	[flat|nested] 131+ messages in thread

* RE: [RFC 0/6] net/mlx5: introduce limit watermark and host shaper
  2022-04-05  8:58 ` [RFC 0/6] net/mlx5: introduce limit watermark and host shaper Jerin Jacob
  2022-04-26  2:42   ` Spike Du
@ 2022-04-29  5:48   ` Spike Du
  1 sibling, 0 replies; 131+ messages in thread
From: Spike Du @ 2022-04-29  5:48 UTC (permalink / raw)
  To: Jerin Jacob, Andrew Rybchenko, Cristian Dumitrescu, techboard
  Cc: Matan Azrad, Slava Ovsiienko, Ori Kam,
	NBU-Contact-Thomas Monjalon (EXTERNAL),
	dpdk-dev, Raslan Darawsheh

Hi Jerin,	
	Thanks for your comments and sorry for the late response.

	For case one, I think I can refine the design and add LWM(limit watermark) in rte_eth_rxconf, and add a new rte_eth_event_type event.

	For case two(host shaper), I think we can't use RX meter, because it's actually TX shaper on a remote system. It's quite specific to Mellanox/Nvidia BlueField 2(BF2 for short) NIC. The NIC contains an ARM system. We have two terms here: Host-system stands for the system the BF2 NIC is inserted; ARM-system stands for the embedded ARM in BF2. ARM-system is doing the forwarding. This is the way host shaper works: we configure the register on ARM-system, but it affects Host-system's TX shaper, which means the shaper is working on the remote port, it's not a RX meter concept, hence we can't use DPDK RX meter framework. I'd suggest to still use private API.

	For testpmd part, I understand your concern. Because we need one private API for host shaper, and we need testpmd's forwarding code to show how it works to user, we need to call the private API in testpmd. If current patch is not acceptable, what's the correct way to do it? Any framework to isolate the PMD private logic from testpmd common code, but still give a chance to call private APIs in testpmd?


Regards,
Spike.

> -----Original Message-----
> From: Jerin Jacob <jerinjacobk@gmail.com>
> Sent: Tuesday, April 5, 2022 4:59 PM
> To: Spike Du <spiked@nvidia.com>; Andrew Rybchenko
> <andrew.rybchenko@oktetlabs.ru>; Cristian Dumitrescu
> <cristian.dumitrescu@intel.com>; Ferruh Yigit <ferruh.yigit@intel.com>;
> techboard@dpdk.org
> Cc: Matan Azrad <matan@nvidia.com>; Slava Ovsiienko
> <viacheslavo@nvidia.com>; Ori Kam <orika@nvidia.com>; NBU-Contact-
> Thomas Monjalon (EXTERNAL) <thomas@monjalon.net>; dpdk-dev
> <dev@dpdk.org>; Raslan Darawsheh <rasland@nvidia.com>
> Subject: Re: [RFC 0/6] net/mlx5: introduce limit watermark and host shaper
> 
> External email: Use caution opening links or attachments
> 
> 
> On Fri, Apr 1, 2022 at 8:53 AM Spike Du <spiked@nvidia.com> wrote:
> >
> > LWM(limit watermark) is per RX queue attribute, when RX queue fullness
> > reach the LWM limit, HW sends an event to dpdk application.
> > Host shaper can configure shaper rate and lwm-triggered for a host port.
> > The shaper limits the rate of traffic from host port to wire port.
> > If lwm-triggered is enabled, a 100Mbps shaper is enabled automatically
> > when one of the host port's Rx queues receives LWM event.
> >
> > These two features can combine to control traffic from host port to wire
> port.
> > The work flow is configure LWM to RX queue and enable lwm-triggered
> > flag in host shaper, after receiving LWM event, delay a while until RX
> > queue is empty , then disable the shaper. We recycle this work flow to
> reduce RX queue drops.
> >
> > Spike Du (6):
> >   net/mlx5: add LWM support for Rxq
> >   common/mlx5: share interrupt management
> >   net/mlx5: add LWM event handling support
> >   net/mlx5: add private API to configure Rxq LWM
> >   net/mlx5: add private API to config host port shaper
> >   app/testpmd: add LWM and Host Shaper command
> 
> + @Andrew Rybchenko  @Ferruh Yigit cristian.dumitrescu@intel.com
> 
> I think, case one, can be easily abstracted via adding new
> rte_eth_event_type event and case two can be abstracted via the existing
> Rx meter framework in ethdev.
> 
> Also, Updating generic testpmd to support PMD specific API should be
> avoided, I know there is existing stuff in testpmd, I think, we should have the
> policy to add PMD specific commands to testpmd.
> 
> There are around 56PMDs in ethdev now, If PMDs try to add PMD specific
> API in testpmd it will be bloated or at minimum, it should a separate file in
> testpmd if we choose to take that path.
> 
> + @techboard@dpdk.org

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [RFC 0/6] net/mlx5: introduce limit watermark and host shaper
  2022-04-26  2:42   ` Spike Du
@ 2022-05-01 12:50     ` Jerin Jacob
  2022-05-02  3:58       ` Spike Du
  0 siblings, 1 reply; 131+ messages in thread
From: Jerin Jacob @ 2022-05-01 12:50 UTC (permalink / raw)
  To: Spike Du
  Cc: Andrew Rybchenko, Cristian Dumitrescu, Ferruh Yigit, techboard,
	Matan Azrad, Slava Ovsiienko, Ori Kam,
	NBU-Contact-Thomas Monjalon (EXTERNAL),
	dpdk-dev, Raslan Darawsheh

On Tue, Apr 26, 2022 at 8:12 AM Spike Du <spiked@nvidia.com> wrote:
>
> Hi Jerin,

Hi Spike,

>         Thanks for your comments and sorry for the late response.
>
>         For case one, I think I can refine the design and add LWM(limit watermark) in rte_eth_rxconf, and add a new rte_eth_event_type event.

OK.

>
>         For case two(host shaper), I think we can't use RX meter, because it's actually TX shaper on a remote system. It's quite specific to Mellanox/Nvidia BlueField 2(BF2 for short) NIC. The NIC contains an ARM system. We have two terms here: Host-system stands for the system the BF2 NIC is inserted; ARM-system stands for the embedded ARM in BF2. ARM-system is doing the forwarding. This is the way host shaper works: we configure the register on ARM-system, but it affects Host-system's TX shaper, which means the shaper is working on the remote port, it's not a RX meter concept, hence we can't use DPDK RX meter framework. I'd suggest to still use private API.

OK. If the host is using the DPDK application then rte_tm can be used
on the egress side to enable the same. If it is not DPDK, then yes, we
need private APIs.

>
>         For testpmd part, I understand your concern. Because we need one private API for host shaper, and we need testpmd's forwarding code to show how it works to user, we need to call the private API in testpmd. If current patch is not acceptable, what's the correct way to do it? Any framework to isolate the PMD private logic from testpmd common code, but still give a chance to call private APIs in testpmd?

Please check "PMD API" item in
http://mails.dpdk.org/archives/dev/2022-April/239191.html

>
>
> Regards,
> Spike.
>
>
>
> > -----Original Message-----
> > From: Jerin Jacob <jerinjacobk@gmail.com>
> > Sent: Tuesday, April 5, 2022 4:59 PM
> > To: Spike Du <spiked@nvidia.com>; Andrew Rybchenko
> > <andrew.rybchenko@oktetlabs.ru>; Cristian Dumitrescu
> > <cristian.dumitrescu@intel.com>; Ferruh Yigit <ferruh.yigit@intel.com>;
> > techboard@dpdk.org
> > Cc: Matan Azrad <matan@nvidia.com>; Slava Ovsiienko
> > <viacheslavo@nvidia.com>; Ori Kam <orika@nvidia.com>; NBU-Contact-
> > Thomas Monjalon (EXTERNAL) <thomas@monjalon.net>; dpdk-dev
> > <dev@dpdk.org>; Raslan Darawsheh <rasland@nvidia.com>
> > Subject: Re: [RFC 0/6] net/mlx5: introduce limit watermark and host shaper
> >
> > External email: Use caution opening links or attachments
> >
> >
> > On Fri, Apr 1, 2022 at 8:53 AM Spike Du <spiked@nvidia.com> wrote:
> > >
> > > LWM(limit watermark) is per RX queue attribute, when RX queue fullness
> > > reach the LWM limit, HW sends an event to dpdk application.
> > > Host shaper can configure shaper rate and lwm-triggered for a host port.
> > > The shaper limits the rate of traffic from host port to wire port.
> > > If lwm-triggered is enabled, a 100Mbps shaper is enabled automatically
> > > when one of the host port's Rx queues receives LWM event.
> > >
> > > These two features can combine to control traffic from host port to wire
> > port.
> > > The work flow is configure LWM to RX queue and enable lwm-triggered
> > > flag in host shaper, after receiving LWM event, delay a while until RX
> > > queue is empty , then disable the shaper. We recycle this work flow to
> > reduce RX queue drops.
> > >
> > > Spike Du (6):
> > >   net/mlx5: add LWM support for Rxq
> > >   common/mlx5: share interrupt management
> > >   net/mlx5: add LWM event handling support
> > >   net/mlx5: add private API to configure Rxq LWM
> > >   net/mlx5: add private API to config host port shaper
> > >   app/testpmd: add LWM and Host Shaper command
> >
> > + @Andrew Rybchenko  @Ferruh Yigit cristian.dumitrescu@intel.com
> >
> > I think, case one, can be easily abstracted via adding new
> > rte_eth_event_type event and case two can be abstracted via the existing
> > Rx meter framework in ethdev.
> >
> > Also, Updating generic testpmd to support PMD specific API should be
> > avoided, I know there is existing stuff in testpmd, I think, we should have the
> > policy to add PMD specific commands to testpmd.
> >
> > There are around 56PMDs in ethdev now, If PMDs try to add PMD specific
> > API in testpmd it will be bloated or at minimum, it should a separate file in
> > testpmd if we choose to take that path.
> >
> > + @techboard@dpdk.org

^ permalink raw reply	[flat|nested] 131+ messages in thread

* RE: [RFC 0/6] net/mlx5: introduce limit watermark and host shaper
  2022-05-01 12:50     ` Jerin Jacob
@ 2022-05-02  3:58       ` Spike Du
  0 siblings, 0 replies; 131+ messages in thread
From: Spike Du @ 2022-05-02  3:58 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: Andrew Rybchenko, Cristian Dumitrescu, Ferruh Yigit, techboard,
	Matan Azrad, Slava Ovsiienko, Ori Kam,
	NBU-Contact-Thomas Monjalon (EXTERNAL),
	dpdk-dev, Raslan Darawsheh

Hi Jerin,
	
> >         For case two(host shaper), I think we can't use RX meter, because it's
> actually TX shaper on a remote system. It's quite specific to Mellanox/Nvidia
> BlueField 2(BF2 for short) NIC. The NIC contains an ARM system. We have
> two terms here: Host-system stands for the system the BF2 NIC is inserted;
> ARM-system stands for the embedded ARM in BF2. ARM-system is doing the
> forwarding. This is the way host shaper works: we configure the register on
> ARM-system, but it affects Host-system's TX shaper, which means the
> shaper is working on the remote port, it's not a RX meter concept, hence we
> can't use DPDK RX meter framework. I'd suggest to still use private API.
> 
> OK. If the host is using the DPDK application then rte_tm can be used on the
> egress side to enable the same. If it is not DPDK, then yes, we need private
> APIs.
	I see your point. The RX drop happens on ARM-system, it'll be too late to notify Host-system to reduce traffic rate. To achieve dropless, MLX developed
	this feature to configure host shaper on remote port. The Host-system is flexible, it may use DPDK or not.

Regards,
Spike.


> -----Original Message-----
> From: Jerin Jacob <jerinjacobk@gmail.com>
> Sent: Sunday, May 1, 2022 8:51 PM
> To: Spike Du <spiked@nvidia.com>
> Cc: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>; Cristian
> Dumitrescu <cristian.dumitrescu@intel.com>; Ferruh Yigit
> <ferruh.yigit@intel.com>; techboard@dpdk.org; Matan Azrad
> <matan@nvidia.com>; Slava Ovsiienko <viacheslavo@nvidia.com>; Ori Kam
> <orika@nvidia.com>; NBU-Contact-Thomas Monjalon (EXTERNAL)
> <thomas@monjalon.net>; dpdk-dev <dev@dpdk.org>; Raslan Darawsheh
> <rasland@nvidia.com>
> Subject: Re: [RFC 0/6] net/mlx5: introduce limit watermark and host shaper
> 
> External email: Use caution opening links or attachments
> 
> 
> On Tue, Apr 26, 2022 at 8:12 AM Spike Du <spiked@nvidia.com> wrote:
> >
> > Hi Jerin,
> 
> Hi Spike,
> 
> >         Thanks for your comments and sorry for the late response.
> >
> >         For case one, I think I can refine the design and add LWM(limit
> watermark) in rte_eth_rxconf, and add a new rte_eth_event_type event.
> 
> OK.
> 
> >
> >         For case two(host shaper), I think we can't use RX meter, because it's
> actually TX shaper on a remote system. It's quite specific to Mellanox/Nvidia
> BlueField 2(BF2 for short) NIC. The NIC contains an ARM system. We have
> two terms here: Host-system stands for the system the BF2 NIC is inserted;
> ARM-system stands for the embedded ARM in BF2. ARM-system is doing the
> forwarding. This is the way host shaper works: we configure the register on
> ARM-system, but it affects Host-system's TX shaper, which means the
> shaper is working on the remote port, it's not a RX meter concept, hence we
> can't use DPDK RX meter framework. I'd suggest to still use private API.
> 
> OK. If the host is using the DPDK application then rte_tm can be used on the
> egress side to enable the same. If it is not DPDK, then yes, we need private
> APIs.
> 
> >
> >         For testpmd part, I understand your concern. Because we need one
> private API for host shaper, and we need testpmd's forwarding code to show
> how it works to user, we need to call the private API in testpmd. If current
> patch is not acceptable, what's the correct way to do it? Any framework to
> isolate the PMD private logic from testpmd common code, but still give a
> chance to call private APIs in testpmd?
> 
> Please check "PMD API" item in
> http://mails.dpdk.org/archives/dev/2022-April/239191.html
> 
> >
> >
> > Regards,
> > Spike.
> >
> >
> >
> > > -----Original Message-----
> > > From: Jerin Jacob <jerinjacobk@gmail.com>
> > > Sent: Tuesday, April 5, 2022 4:59 PM
> > > To: Spike Du <spiked@nvidia.com>; Andrew Rybchenko
> > > <andrew.rybchenko@oktetlabs.ru>; Cristian Dumitrescu
> > > <cristian.dumitrescu@intel.com>; Ferruh Yigit
> > > <ferruh.yigit@intel.com>; techboard@dpdk.org
> > > Cc: Matan Azrad <matan@nvidia.com>; Slava Ovsiienko
> > > <viacheslavo@nvidia.com>; Ori Kam <orika@nvidia.com>; NBU-Contact-
> > > Thomas Monjalon (EXTERNAL) <thomas@monjalon.net>; dpdk-dev
> > > <dev@dpdk.org>; Raslan Darawsheh <rasland@nvidia.com>
> > > Subject: Re: [RFC 0/6] net/mlx5: introduce limit watermark and host
> > > shaper
> > >
> > > External email: Use caution opening links or attachments
> > >
> > >
> > > On Fri, Apr 1, 2022 at 8:53 AM Spike Du <spiked@nvidia.com> wrote:
> > > >
> > > > LWM(limit watermark) is per RX queue attribute, when RX queue
> > > > fullness reach the LWM limit, HW sends an event to dpdk application.
> > > > Host shaper can configure shaper rate and lwm-triggered for a host port.
> > > > The shaper limits the rate of traffic from host port to wire port.
> > > > If lwm-triggered is enabled, a 100Mbps shaper is enabled
> > > > automatically when one of the host port's Rx queues receives LWM
> event.
> > > >
> > > > These two features can combine to control traffic from host port
> > > > to wire
> > > port.
> > > > The work flow is configure LWM to RX queue and enable
> > > > lwm-triggered flag in host shaper, after receiving LWM event,
> > > > delay a while until RX queue is empty , then disable the shaper.
> > > > We recycle this work flow to
> > > reduce RX queue drops.
> > > >
> > > > Spike Du (6):
> > > >   net/mlx5: add LWM support for Rxq
> > > >   common/mlx5: share interrupt management
> > > >   net/mlx5: add LWM event handling support
> > > >   net/mlx5: add private API to configure Rxq LWM
> > > >   net/mlx5: add private API to config host port shaper
> > > >   app/testpmd: add LWM and Host Shaper command
> > >
> > > + @Andrew Rybchenko  @Ferruh Yigit cristian.dumitrescu@intel.com
> > >
> > > I think, case one, can be easily abstracted via adding new
> > > rte_eth_event_type event and case two can be abstracted via the
> > > existing Rx meter framework in ethdev.
> > >
> > > Also, Updating generic testpmd to support PMD specific API should be
> > > avoided, I know there is existing stuff in testpmd, I think, we
> > > should have the policy to add PMD specific commands to testpmd.
> > >
> > > There are around 56PMDs in ethdev now, If PMDs try to add PMD
> > > specific API in testpmd it will be bloated or at minimum, it should
> > > a separate file in testpmd if we choose to take that path.
> > >
> > > + @techboard@dpdk.org

^ permalink raw reply	[flat|nested] 131+ messages in thread

* [RFC v1 0/7] net/mlx5: introduce limit watermark and host shaper
  2022-04-01  3:22 ` [RFC 1/6] net/mlx5: add LWM support for Rxq Spike Du
@ 2022-05-06  3:56   ` Spike Du
  2022-05-06  3:56     ` [RFC v1 1/7] net/mlx5: add LWM support for Rxq Spike Du
                       ` (7 more replies)
  0 siblings, 8 replies; 131+ messages in thread
From: Spike Du @ 2022-05-06  3:56 UTC (permalink / raw)
  To: matan, viacheslavo, orika, thomas; +Cc: dev, rasland

LWM(limit watermark) is per RX queue attribute, when RX queue fullness reach
the LWM limit, HW sends an event to dpdk application.
Host shaper can configure shaper rate and lwm-triggered for a host port.
The shaper limits the rate of traffic from host port to wire port.
If lwm-triggered is enabled, a 100Mbps shaper is enabled automatically
when one of the host port's Rx queues receives LWM event.

These two features can combine to control traffic from host port to wire port.
The work flow is configure LWM to RX queue and enable lwm-triggered flag in
host shaper, after receiving LWM event, delay a while until RX queue is empty
, then disable the shaper. We recycle this work flow to reduce RX queue drops.

Add new libethdev API to set LWM, add rte event RTE_ETH_EVENT_RXQ_LIMIT_REACHED
to handle LWM event. For host shaper, because it doesn't align to existing DPDK
framework and is specific to Nvidia NIC, use PMD private API.

For integration with testpmd, put the private cmdline function and LWM event
handler in mlx5 PMD directory by adding a new file mlx5_test.c. Only add minimal
code in testpmd to invoke interfaces from mlx5_test.c.

Spike Du (7):
  net/mlx5: add LWM support for Rxq
  common/mlx5: share interrupt management
  ethdev: introduce Rx queue based limit watermark
  net/mlx5: add LWM event handling support
  net/mlx5: support Rx queue based limit watermark
  net/mlx5: add private API to config host port shaper
  app/testpmd: add LWM and Host Shaper command

 app/test-pmd/cmdline.c                       |  74 ++++++++
 app/test-pmd/config.c                        |  23 +++
 app/test-pmd/meson.build                     |   3 +
 app/test-pmd/testpmd.c                       |  13 ++
 app/test-pmd/testpmd.h                       |   1 +
 doc/guides/nics/mlx5.rst                     |  87 +++++++++
 doc/guides/rel_notes/release_22_07.rst       |   2 +
 drivers/common/mlx5/linux/meson.build        |  44 +++--
 drivers/common/mlx5/linux/mlx5_common_os.c   | 131 ++++++++++++++
 drivers/common/mlx5/linux/mlx5_common_os.h   |  11 ++
 drivers/common/mlx5/mlx5_prm.h               |  26 +++
 drivers/common/mlx5/version.map              |   3 +-
 drivers/common/mlx5/windows/mlx5_common_os.h |  24 +++
 drivers/net/mlx5/linux/mlx5_ethdev_os.c      |  71 --------
 drivers/net/mlx5/linux/mlx5_os.c             | 132 +++-----------
 drivers/net/mlx5/linux/mlx5_socket.c         |  53 +-----
 drivers/net/mlx5/meson.build                 |   7 +-
 drivers/net/mlx5/mlx5.c                      |  62 +++++++
 drivers/net/mlx5/mlx5.h                      |  12 +-
 drivers/net/mlx5/mlx5_devx.c                 |  60 ++++++-
 drivers/net/mlx5/mlx5_devx.h                 |   1 +
 drivers/net/mlx5/mlx5_rx.c                   | 253 +++++++++++++++++++++++++++
 drivers/net/mlx5/mlx5_rx.h                   |  11 ++
 drivers/net/mlx5/mlx5_test.c                 | 191 ++++++++++++++++++++
 drivers/net/mlx5/mlx5_test.h                 |  27 +++
 drivers/net/mlx5/mlx5_txpp.c                 |  28 +--
 drivers/net/mlx5/rte_pmd_mlx5.h              |  30 ++++
 drivers/net/mlx5/version.map                 |   2 +
 drivers/net/mlx5/windows/mlx5_ethdev_os.c    |  22 ---
 drivers/vdpa/mlx5/mlx5_vdpa_virtq.c          |  52 +-----
 lib/ethdev/ethdev_driver.h                   |   7 +
 lib/ethdev/rte_ethdev.c                      |  28 +++
 lib/ethdev/rte_ethdev.h                      |  30 +++-
 lib/ethdev/version.map                       |   3 +
 34 files changed, 1193 insertions(+), 331 deletions(-)
 create mode 100644 drivers/net/mlx5/mlx5_test.c
 create mode 100644 drivers/net/mlx5/mlx5_test.h

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 131+ messages in thread

* [RFC v1 1/7] net/mlx5: add LWM support for Rxq
  2022-05-06  3:56   ` [RFC v1 0/7] net/mlx5: introduce limit watermark and host shaper Spike Du
@ 2022-05-06  3:56     ` Spike Du
  2022-05-06  3:56     ` [RFC v1 2/7] common/mlx5: share interrupt management Spike Du
                       ` (6 subsequent siblings)
  7 siblings, 0 replies; 131+ messages in thread
From: Spike Du @ 2022-05-06  3:56 UTC (permalink / raw)
  To: matan, viacheslavo, orika, thomas; +Cc: dev, rasland

Add lwm(Limit WaterMark) field to Rxq object which indicates the percentage
of RX queue size used by HW to raise LWM event to the user.
Allow LWM setting in modify_rq command.
Allow the LWM configuration dynamically by adding RDY2RDY state change.

Signed-off-by: Spike Du <spiked@nvidia.com>
---
 drivers/net/mlx5/mlx5.h      |  1 +
 drivers/net/mlx5/mlx5_devx.c | 13 ++++++++++++-
 drivers/net/mlx5/mlx5_devx.h |  1 +
 drivers/net/mlx5/mlx5_rx.h   |  1 +
 4 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index 23a28f6..f3e6682 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -1391,6 +1391,7 @@ enum mlx5_rxq_modify_type {
 	MLX5_RXQ_MOD_RST2RDY, /* modify state from reset to ready. */
 	MLX5_RXQ_MOD_RDY2ERR, /* modify state from ready to error. */
 	MLX5_RXQ_MOD_RDY2RST, /* modify state from ready to reset. */
+	MLX5_RXQ_MOD_RDY2RDY, /* modify state from ready to ready. */
 };
 
 enum mlx5_txq_modify_type {
diff --git a/drivers/net/mlx5/mlx5_devx.c b/drivers/net/mlx5/mlx5_devx.c
index 03c0fac..4fbfcaa 100644
--- a/drivers/net/mlx5/mlx5_devx.c
+++ b/drivers/net/mlx5/mlx5_devx.c
@@ -62,7 +62,7 @@
  * @return
  *   0 on success, a negative errno value otherwise and rte_errno is set.
  */
-static int
+int
 mlx5_devx_modify_rq(struct mlx5_rxq_priv *rxq, uint8_t type)
 {
 	struct mlx5_devx_modify_rq_attr rq_attr;
@@ -76,6 +76,11 @@
 	case MLX5_RXQ_MOD_RST2RDY:
 		rq_attr.rq_state = MLX5_RQC_STATE_RST;
 		rq_attr.state = MLX5_RQC_STATE_RDY;
+		if (rxq->lwm) {
+			rq_attr.modify_bitmask |=
+				MLX5_MODIFY_RQ_IN_MODIFY_BITMASK_WQ_LWM;
+			rq_attr.lwm = rxq->lwm;
+		}
 		break;
 	case MLX5_RXQ_MOD_RDY2ERR:
 		rq_attr.rq_state = MLX5_RQC_STATE_RDY;
@@ -85,6 +90,12 @@
 		rq_attr.rq_state = MLX5_RQC_STATE_RDY;
 		rq_attr.state = MLX5_RQC_STATE_RST;
 		break;
+	case MLX5_RXQ_MOD_RDY2RDY:
+		rq_attr.rq_state = MLX5_RQC_STATE_RDY;
+		rq_attr.state = MLX5_RQC_STATE_RDY;
+		rq_attr.modify_bitmask |= MLX5_MODIFY_RQ_IN_MODIFY_BITMASK_WQ_LWM;
+		rq_attr.lwm = rxq->lwm;
+		break;
 	default:
 		break;
 	}
diff --git a/drivers/net/mlx5/mlx5_devx.h b/drivers/net/mlx5/mlx5_devx.h
index a95207a..ebd1da4 100644
--- a/drivers/net/mlx5/mlx5_devx.h
+++ b/drivers/net/mlx5/mlx5_devx.h
@@ -11,6 +11,7 @@
 int mlx5_txq_devx_modify(struct mlx5_txq_obj *obj,
 			 enum mlx5_txq_modify_type type, uint8_t dev_port);
 void mlx5_txq_devx_obj_release(struct mlx5_txq_obj *txq_obj);
+int mlx5_devx_modify_rq(struct mlx5_rxq_priv *rxq, uint8_t type);
 
 extern struct mlx5_obj_ops devx_obj_ops;
 
diff --git a/drivers/net/mlx5/mlx5_rx.h b/drivers/net/mlx5/mlx5_rx.h
index e715ed6..25a5f2c 100644
--- a/drivers/net/mlx5/mlx5_rx.h
+++ b/drivers/net/mlx5/mlx5_rx.h
@@ -175,6 +175,7 @@ struct mlx5_rxq_priv {
 	struct mlx5_devx_rq devx_rq;
 	struct rte_eth_hairpin_conf hairpin_conf; /* Hairpin configuration. */
 	uint32_t hairpin_status; /* Hairpin binding status. */
+	uint32_t lwm:16;
 };
 
 /* External RX queue descriptor. */
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 131+ messages in thread

* [RFC v1 2/7] common/mlx5: share interrupt management
  2022-05-06  3:56   ` [RFC v1 0/7] net/mlx5: introduce limit watermark and host shaper Spike Du
  2022-05-06  3:56     ` [RFC v1 1/7] net/mlx5: add LWM support for Rxq Spike Du
@ 2022-05-06  3:56     ` Spike Du
  2022-05-06  3:56     ` [RFC v1 3/7] ethdev: introduce Rx queue based limit watermark Spike Du
                       ` (5 subsequent siblings)
  7 siblings, 0 replies; 131+ messages in thread
From: Spike Du @ 2022-05-06  3:56 UTC (permalink / raw)
  To: matan, viacheslavo, orika, thomas; +Cc: dev, rasland

There are many duplicate code of creating and initializing rte_intr_handle.
Add a new mlx5_os API to do this, replace all PMD related code with this
API.

Signed-off-by: Spike Du <spiked@nvidia.com>
---
 drivers/common/mlx5/linux/mlx5_common_os.c   | 131 ++++++++++++++++++++++++++
 drivers/common/mlx5/linux/mlx5_common_os.h   |  11 +++
 drivers/common/mlx5/version.map              |   3 +-
 drivers/common/mlx5/windows/mlx5_common_os.h |  24 +++++
 drivers/net/mlx5/linux/mlx5_ethdev_os.c      |  71 --------------
 drivers/net/mlx5/linux/mlx5_os.c             | 132 ++++++---------------------
 drivers/net/mlx5/linux/mlx5_socket.c         |  53 ++---------
 drivers/net/mlx5/mlx5.h                      |   2 -
 drivers/net/mlx5/mlx5_txpp.c                 |  28 ++----
 drivers/net/mlx5/windows/mlx5_ethdev_os.c    |  22 -----
 drivers/vdpa/mlx5/mlx5_vdpa_virtq.c          |  52 ++---------
 11 files changed, 217 insertions(+), 312 deletions(-)

diff --git a/drivers/common/mlx5/linux/mlx5_common_os.c b/drivers/common/mlx5/linux/mlx5_common_os.c
index d40cfd5..f10a981 100644
--- a/drivers/common/mlx5/linux/mlx5_common_os.c
+++ b/drivers/common/mlx5/linux/mlx5_common_os.c
@@ -11,6 +11,7 @@
 #endif
 #include <dirent.h>
 #include <net/if.h>
+#include <fcntl.h>
 
 #include <rte_errno.h>
 #include <rte_string_fns.h>
@@ -964,3 +965,133 @@
 		claim_zero(mlx5_glue->dereg_mr(pmd_mr->obj));
 	memset(pmd_mr, 0, sizeof(*pmd_mr));
 }
+
+/**
+ * Rte_intr_handle create and init helper.
+ *
+ * @param[in] mode
+ *   interrupt instance can be shared between primary and secondary
+ *   processes or not.
+ * @param[in] set_fd_nonblock
+ *   Whether to set fd to O_NONBLOCK.
+ * @param[in] fd
+ *   Fd to set in created intr_handle.
+ * @param[in] cb
+ *   Callback to register for intr_handle.
+ * @param[in] cb_arg
+ *   Callback argument for cb.
+ *
+ * @return
+ *  - Interrupt handle on success.
+ *  - NULL on failure, with rte_errno set.
+ */
+struct rte_intr_handle *
+mlx5_os_interrupt_handler_create(int mode, bool set_fd_nonblock, int fd,
+				 rte_intr_callback_fn cb, void *cb_arg)
+{
+	struct rte_intr_handle *tmp_intr_handle;
+	int ret, flags;
+
+	tmp_intr_handle = rte_intr_instance_alloc(mode);
+	if (!tmp_intr_handle) {
+		rte_errno = ENOMEM;
+		goto err;
+	}
+	if (set_fd_nonblock) {
+		flags = fcntl(fd, F_GETFL);
+		ret = fcntl(fd, F_SETFL, flags | O_NONBLOCK);
+		if (ret) {
+			rte_errno = errno;
+			goto err;
+		}
+	}
+	ret = rte_intr_fd_set(tmp_intr_handle, fd);
+	if (ret)
+		goto err;
+	ret = rte_intr_type_set(tmp_intr_handle, RTE_INTR_HANDLE_EXT);
+	if (ret)
+		goto err;
+	ret = rte_intr_callback_register(tmp_intr_handle, cb, cb_arg);
+	if (ret) {
+		rte_errno = -ret;
+		goto err;
+	}
+	return tmp_intr_handle;
+err:
+	if (tmp_intr_handle)
+		rte_intr_instance_free(tmp_intr_handle);
+	return NULL;
+}
+
+/* Safe unregistration for interrupt callback. */
+static void
+mlx5_intr_callback_unregister(const struct rte_intr_handle *handle,
+			      rte_intr_callback_fn cb_fn, void *cb_arg)
+{
+	uint64_t twait = 0;
+	uint64_t start = 0;
+
+	do {
+		int ret;
+
+		ret = rte_intr_callback_unregister(handle, cb_fn, cb_arg);
+		if (ret >= 0)
+			return;
+		if (ret != -EAGAIN) {
+			DRV_LOG(INFO, "failed to unregister interrupt"
+				      " handler (error: %d)", ret);
+			MLX5_ASSERT(false);
+			return;
+		}
+		if (twait) {
+			struct timespec onems;
+
+			/* Wait one millisecond and try again. */
+			onems.tv_sec = 0;
+			onems.tv_nsec = NS_PER_S / MS_PER_S;
+			nanosleep(&onems, 0);
+			/* Check whether one second elapsed. */
+			if ((rte_get_timer_cycles() - start) <= twait)
+				continue;
+		} else {
+			/*
+			 * We get the amount of timer ticks for one second.
+			 * If this amount elapsed it means we spent one
+			 * second in waiting. This branch is executed once
+			 * on first iteration.
+			 */
+			twait = rte_get_timer_hz();
+			MLX5_ASSERT(twait);
+		}
+		/*
+		 * Timeout elapsed, show message (once a second) and retry.
+		 * We have no other acceptable option here, if we ignore
+		 * the unregistering return code the handler will not
+		 * be unregistered, fd will be closed and we may get the
+		 * crush. Hanging and messaging in the loop seems not to be
+		 * the worst choice.
+		 */
+		DRV_LOG(INFO, "Retrying to unregister interrupt handler");
+		start = rte_get_timer_cycles();
+	} while (true);
+}
+
+/**
+ * Rte_intr_handle destroy helper.
+ *
+ * @param[in] intr_handle
+ *   Rte_intr_handle to destroy.
+ * @param[in] cb
+ *   Callback which is registered to intr_handle.
+ * @param[in] cb_arg
+ *   Callback argument for cb.
+ *
+ */
+void
+mlx5_os_interrupt_handler_destroy(struct rte_intr_handle *intr_handle,
+				  rte_intr_callback_fn cb, void *cb_arg)
+{
+	if (rte_intr_fd_get(intr_handle) >= 0)
+		mlx5_intr_callback_unregister(intr_handle, cb, cb_arg);
+	rte_intr_instance_free(intr_handle);
+}
diff --git a/drivers/common/mlx5/linux/mlx5_common_os.h b/drivers/common/mlx5/linux/mlx5_common_os.h
index 27f1192..479bb3c 100644
--- a/drivers/common/mlx5/linux/mlx5_common_os.h
+++ b/drivers/common/mlx5/linux/mlx5_common_os.h
@@ -15,6 +15,7 @@
 #include <rte_log.h>
 #include <rte_kvargs.h>
 #include <rte_devargs.h>
+#include <rte_interrupts.h>
 
 #include "mlx5_autoconf.h"
 #include "mlx5_glue.h"
@@ -299,4 +300,14 @@
 int
 mlx5_get_device_guid(const struct rte_pci_addr *dev, uint8_t *guid, size_t len);
 
+__rte_internal
+struct rte_intr_handle *
+mlx5_os_interrupt_handler_create(int mode, bool set_fd_nonblock, int fd,
+				 rte_intr_callback_fn cb, void *cb_arg);
+
+__rte_internal
+void
+mlx5_os_interrupt_handler_destroy(struct rte_intr_handle *intr_handle,
+				  rte_intr_callback_fn cb, void *cb_arg);
+
 #endif /* RTE_PMD_MLX5_COMMON_OS_H_ */
diff --git a/drivers/common/mlx5/version.map b/drivers/common/mlx5/version.map
index a23a30a..2900544 100644
--- a/drivers/common/mlx5/version.map
+++ b/drivers/common/mlx5/version.map
@@ -152,6 +152,7 @@ INTERNAL {
 	mlx5_mp_req_mempool_reg;
 	mlx5_mr_mempool2mr_bh;
 	mlx5_mr_mempool_populate_cache;
-
+	mlx5_os_interrupt_handler_create; # WINDOWS_NO_EXPORT
+	mlx5_os_interrupt_handler_destroy; # WINDOWS_NO_EXPORT
 	local: *;
 };
diff --git a/drivers/common/mlx5/windows/mlx5_common_os.h b/drivers/common/mlx5/windows/mlx5_common_os.h
index ee7973f..e9e9108 100644
--- a/drivers/common/mlx5/windows/mlx5_common_os.h
+++ b/drivers/common/mlx5/windows/mlx5_common_os.h
@@ -9,6 +9,7 @@
 #include <sys/types.h>
 
 #include <rte_errno.h>
+#include <rte_interrupts.h>
 
 #include "mlx5_autoconf.h"
 #include "mlx5_glue.h"
@@ -253,4 +254,27 @@
 __rte_internal
 int mlx5_os_umem_dereg(void *pumem);
 
+static inline struct rte_intr_handle *
+mlx5_os_interrupt_handler_create(int mode, bool set_fd_nonblock, int fd,
+				 rte_intr_callback_fn cb, void *cb_arg)
+{
+	(void)mode;
+	(void)set_fd_nonblock;
+	(void)fd;
+	(void)cb;
+	(void)cb_arg;
+	rte_errno = ENOTSUP;
+	return NULL;
+}
+
+static inline void
+mlx5_os_interrupt_handler_destroy(struct rte_intr_handle *intr_handle,
+				  rte_intr_callback_fn cb, void *cb_arg)
+{
+	(void)intr_handle;
+	(void)cb;
+	(void)cb_arg;
+}
+
+
 #endif /* RTE_PMD_MLX5_COMMON_OS_H_ */
diff --git a/drivers/net/mlx5/linux/mlx5_ethdev_os.c b/drivers/net/mlx5/linux/mlx5_ethdev_os.c
index 8fe73f1..a276b2b 100644
--- a/drivers/net/mlx5/linux/mlx5_ethdev_os.c
+++ b/drivers/net/mlx5/linux/mlx5_ethdev_os.c
@@ -881,77 +881,6 @@ struct ethtool_link_settings {
 	}
 }
 
-/*
- * Unregister callback handler safely. The handler may be active
- * while we are trying to unregister it, in this case code -EAGAIN
- * is returned by rte_intr_callback_unregister(). This routine checks
- * the return code and tries to unregister handler again.
- *
- * @param handle
- *   interrupt handle
- * @param cb_fn
- *   pointer to callback routine
- * @cb_arg
- *   opaque callback parameter
- */
-void
-mlx5_intr_callback_unregister(const struct rte_intr_handle *handle,
-			      rte_intr_callback_fn cb_fn, void *cb_arg)
-{
-	/*
-	 * Try to reduce timeout management overhead by not calling
-	 * the timer related routines on the first iteration. If the
-	 * unregistering succeeds on first call there will be no
-	 * timer calls at all.
-	 */
-	uint64_t twait = 0;
-	uint64_t start = 0;
-
-	do {
-		int ret;
-
-		ret = rte_intr_callback_unregister(handle, cb_fn, cb_arg);
-		if (ret >= 0)
-			return;
-		if (ret != -EAGAIN) {
-			DRV_LOG(INFO, "failed to unregister interrupt"
-				      " handler (error: %d)", ret);
-			MLX5_ASSERT(false);
-			return;
-		}
-		if (twait) {
-			struct timespec onems;
-
-			/* Wait one millisecond and try again. */
-			onems.tv_sec = 0;
-			onems.tv_nsec = NS_PER_S / MS_PER_S;
-			nanosleep(&onems, 0);
-			/* Check whether one second elapsed. */
-			if ((rte_get_timer_cycles() - start) <= twait)
-				continue;
-		} else {
-			/*
-			 * We get the amount of timer ticks for one second.
-			 * If this amount elapsed it means we spent one
-			 * second in waiting. This branch is executed once
-			 * on first iteration.
-			 */
-			twait = rte_get_timer_hz();
-			MLX5_ASSERT(twait);
-		}
-		/*
-		 * Timeout elapsed, show message (once a second) and retry.
-		 * We have no other acceptable option here, if we ignore
-		 * the unregistering return code the handler will not
-		 * be unregistered, fd will be closed and we may get the
-		 * crush. Hanging and messaging in the loop seems not to be
-		 * the worst choice.
-		 */
-		DRV_LOG(INFO, "Retrying to unregister interrupt handler");
-		start = rte_get_timer_cycles();
-	} while (true);
-}
-
 /**
  * Handle DEVX interrupts from the NIC.
  * This function is probably called from the DPDK host thread.
diff --git a/drivers/net/mlx5/linux/mlx5_os.c b/drivers/net/mlx5/linux/mlx5_os.c
index a821153..0741028 100644
--- a/drivers/net/mlx5/linux/mlx5_os.c
+++ b/drivers/net/mlx5/linux/mlx5_os.c
@@ -2494,40 +2494,6 @@
 	mlx5_pmd_socket_uninit();
 }
 
-static int
-mlx5_os_dev_shared_handler_install_lsc(struct mlx5_dev_ctx_shared *sh)
-{
-	int nlsk_fd, flags, ret;
-
-	nlsk_fd = mlx5_nl_init(NETLINK_ROUTE, RTMGRP_LINK);
-	if (nlsk_fd < 0) {
-		DRV_LOG(ERR, "Failed to create a socket for Netlink events: %s",
-			rte_strerror(rte_errno));
-		return -1;
-	}
-	flags = fcntl(nlsk_fd, F_GETFL);
-	ret = fcntl(nlsk_fd, F_SETFL, flags | O_NONBLOCK);
-	if (ret != 0) {
-		DRV_LOG(ERR, "Failed to make Netlink event socket non-blocking: %s",
-			strerror(errno));
-		rte_errno = errno;
-		goto error;
-	}
-	rte_intr_type_set(sh->intr_handle_nl, RTE_INTR_HANDLE_EXT);
-	rte_intr_fd_set(sh->intr_handle_nl, nlsk_fd);
-	if (rte_intr_callback_register(sh->intr_handle_nl,
-				       mlx5_dev_interrupt_handler_nl,
-				       sh) != 0) {
-		DRV_LOG(ERR, "Failed to register Netlink events interrupt");
-		rte_intr_fd_set(sh->intr_handle_nl, -1);
-		goto error;
-	}
-	return 0;
-error:
-	close(nlsk_fd);
-	return -1;
-}
-
 /**
  * Install shared asynchronous device events handler.
  * This function is implemented to support event sharing
@@ -2539,76 +2505,47 @@
 void
 mlx5_os_dev_shared_handler_install(struct mlx5_dev_ctx_shared *sh)
 {
-	int ret;
-	int flags;
 	struct ibv_context *ctx = sh->cdev->ctx;
+	int nlsk_fd;
 
-	sh->intr_handle = rte_intr_instance_alloc(RTE_INTR_INSTANCE_F_SHARED);
-	if (sh->intr_handle == NULL) {
-		DRV_LOG(ERR, "Fail to allocate intr_handle");
-		rte_errno = ENOMEM;
+	sh->intr_handle = mlx5_os_interrupt_handler_create
+		(RTE_INTR_INSTANCE_F_SHARED, true,
+		 ctx->async_fd, mlx5_dev_interrupt_handler, sh);
+	if (!sh->intr_handle) {
+		DRV_LOG(ERR, "Failed to allocate intr_handle.");
 		return;
 	}
-	rte_intr_fd_set(sh->intr_handle, -1);
-
-	flags = fcntl(ctx->async_fd, F_GETFL);
-	ret = fcntl(ctx->async_fd, F_SETFL, flags | O_NONBLOCK);
-	if (ret) {
-		DRV_LOG(INFO, "failed to change file descriptor async event"
-			" queue");
-	} else {
-		rte_intr_fd_set(sh->intr_handle, ctx->async_fd);
-		rte_intr_type_set(sh->intr_handle, RTE_INTR_HANDLE_EXT);
-		if (rte_intr_callback_register(sh->intr_handle,
-					mlx5_dev_interrupt_handler, sh)) {
-			DRV_LOG(INFO, "Fail to install the shared interrupt.");
-			rte_intr_fd_set(sh->intr_handle, -1);
-		}
+	nlsk_fd = mlx5_nl_init(NETLINK_ROUTE, RTMGRP_LINK);
+	if (nlsk_fd < 0) {
+		DRV_LOG(ERR, "Failed to create a socket for Netlink events: %s",
+			rte_strerror(rte_errno));
+		return;
 	}
-	sh->intr_handle_nl = rte_intr_instance_alloc
-						(RTE_INTR_INSTANCE_F_SHARED);
+	sh->intr_handle_nl = mlx5_os_interrupt_handler_create
+		(RTE_INTR_INSTANCE_F_SHARED, true,
+		 nlsk_fd, mlx5_dev_interrupt_handler_nl, sh);
 	if (sh->intr_handle_nl == NULL) {
 		DRV_LOG(ERR, "Fail to allocate intr_handle");
-		rte_errno = ENOMEM;
 		return;
 	}
-	rte_intr_fd_set(sh->intr_handle_nl, -1);
-	if (mlx5_os_dev_shared_handler_install_lsc(sh) < 0) {
-		DRV_LOG(INFO, "Fail to install the shared Netlink event handler.");
-		rte_intr_fd_set(sh->intr_handle_nl, -1);
-	}
 	if (sh->cdev->config.devx) {
 #ifdef HAVE_IBV_DEVX_ASYNC
-		sh->intr_handle_devx =
-			rte_intr_instance_alloc(RTE_INTR_INSTANCE_F_SHARED);
-		if (!sh->intr_handle_devx) {
-			DRV_LOG(ERR, "Fail to allocate intr_handle");
-			rte_errno = ENOMEM;
-			return;
-		}
-		rte_intr_fd_set(sh->intr_handle_devx, -1);
+		struct mlx5dv_devx_cmd_comp *devx_comp;
+
 		sh->devx_comp = (void *)mlx5_glue->devx_create_cmd_comp(ctx);
-		struct mlx5dv_devx_cmd_comp *devx_comp = sh->devx_comp;
+		devx_comp = sh->devx_comp;
 		if (!devx_comp) {
 			DRV_LOG(INFO, "failed to allocate devx_comp.");
 			return;
 		}
-		flags = fcntl(devx_comp->fd, F_GETFL);
-		ret = fcntl(devx_comp->fd, F_SETFL, flags | O_NONBLOCK);
-		if (ret) {
-			DRV_LOG(INFO, "failed to change file descriptor"
-				" devx comp");
+		sh->intr_handle_devx = mlx5_os_interrupt_handler_create
+			(RTE_INTR_INSTANCE_F_SHARED, true,
+			 devx_comp->fd,
+			 mlx5_dev_interrupt_handler_devx, sh);
+		if (!sh->intr_handle_devx) {
+			DRV_LOG(ERR, "Failed to allocate intr_handle.");
 			return;
 		}
-		rte_intr_fd_set(sh->intr_handle_devx, devx_comp->fd);
-		rte_intr_type_set(sh->intr_handle_devx,
-					 RTE_INTR_HANDLE_EXT);
-		if (rte_intr_callback_register(sh->intr_handle_devx,
-					mlx5_dev_interrupt_handler_devx, sh)) {
-			DRV_LOG(INFO, "Fail to install the devx shared"
-				" interrupt.");
-			rte_intr_fd_set(sh->intr_handle_devx, -1);
-		}
 #endif /* HAVE_IBV_DEVX_ASYNC */
 	}
 }
@@ -2624,24 +2561,13 @@
 void
 mlx5_os_dev_shared_handler_uninstall(struct mlx5_dev_ctx_shared *sh)
 {
-	int nlsk_fd;
-
-	if (rte_intr_fd_get(sh->intr_handle) >= 0)
-		mlx5_intr_callback_unregister(sh->intr_handle,
-					      mlx5_dev_interrupt_handler, sh);
-	rte_intr_instance_free(sh->intr_handle);
-	nlsk_fd = rte_intr_fd_get(sh->intr_handle_nl);
-	if (nlsk_fd >= 0) {
-		mlx5_intr_callback_unregister
-			(sh->intr_handle_nl, mlx5_dev_interrupt_handler_nl, sh);
-		close(nlsk_fd);
-	}
-	rte_intr_instance_free(sh->intr_handle_nl);
+	mlx5_os_interrupt_handler_destroy(sh->intr_handle,
+					  mlx5_dev_interrupt_handler, sh);
+	mlx5_os_interrupt_handler_destroy(sh->intr_handle_nl,
+					  mlx5_dev_interrupt_handler_nl, sh);
 #ifdef HAVE_IBV_DEVX_ASYNC
-	if (rte_intr_fd_get(sh->intr_handle_devx) >= 0)
-		rte_intr_callback_unregister(sh->intr_handle_devx,
-				  mlx5_dev_interrupt_handler_devx, sh);
-	rte_intr_instance_free(sh->intr_handle_devx);
+	mlx5_os_interrupt_handler_destroy(sh->intr_handle_devx,
+					  mlx5_dev_interrupt_handler_devx, sh);
 	if (sh->devx_comp)
 		mlx5_glue->devx_destroy_cmd_comp(sh->devx_comp);
 #endif
diff --git a/drivers/net/mlx5/linux/mlx5_socket.c b/drivers/net/mlx5/linux/mlx5_socket.c
index 4882e5f..0e01aff 100644
--- a/drivers/net/mlx5/linux/mlx5_socket.c
+++ b/drivers/net/mlx5/linux/mlx5_socket.c
@@ -134,51 +134,6 @@
 }
 
 /**
- * Install interrupt handler.
- *
- * @param dev
- *   Pointer to Ethernet device.
- * @return
- *   0 on success, a negative errno value otherwise.
- */
-static int
-mlx5_pmd_interrupt_handler_install(void)
-{
-	MLX5_ASSERT(server_socket != -1);
-
-	server_intr_handle =
-		rte_intr_instance_alloc(RTE_INTR_INSTANCE_F_PRIVATE);
-	if (server_intr_handle == NULL) {
-		DRV_LOG(ERR, "Fail to allocate intr_handle");
-		return -ENOMEM;
-	}
-	if (rte_intr_fd_set(server_intr_handle, server_socket))
-		return -rte_errno;
-
-	if (rte_intr_type_set(server_intr_handle, RTE_INTR_HANDLE_EXT))
-		return -rte_errno;
-
-	return rte_intr_callback_register(server_intr_handle,
-					  mlx5_pmd_socket_handle, NULL);
-}
-
-/**
- * Uninstall interrupt handler.
- */
-static void
-mlx5_pmd_interrupt_handler_uninstall(void)
-{
-	if (server_socket != -1) {
-		mlx5_intr_callback_unregister(server_intr_handle,
-					      mlx5_pmd_socket_handle,
-					      NULL);
-	}
-	rte_intr_fd_set(server_intr_handle, 0);
-	rte_intr_type_set(server_intr_handle, RTE_INTR_HANDLE_UNKNOWN);
-	rte_intr_instance_free(server_intr_handle);
-}
-
-/**
  * Initialise the socket to communicate with external tools.
  *
  * @return
@@ -224,7 +179,10 @@
 			strerror(errno));
 		goto remove;
 	}
-	if (mlx5_pmd_interrupt_handler_install()) {
+	server_intr_handle = mlx5_os_interrupt_handler_create
+		(RTE_INTR_INSTANCE_F_PRIVATE, false,
+		 server_socket, mlx5_pmd_socket_handle, NULL);
+	if (server_intr_handle == NULL) {
 		DRV_LOG(WARNING, "cannot register interrupt handler for mlx5 socket: %s",
 			strerror(errno));
 		goto remove;
@@ -248,7 +206,8 @@
 {
 	if (server_socket == -1)
 		return;
-	mlx5_pmd_interrupt_handler_uninstall();
+	mlx5_os_interrupt_handler_destroy(server_intr_handle,
+					  mlx5_pmd_socket_handle, NULL);
 	claim_zero(close(server_socket));
 	server_socket = -1;
 	MKSTR(path, MLX5_SOCKET_PATH, getpid());
diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index f3e6682..4821ff0 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -1678,8 +1678,6 @@ int mlx5_sysfs_switch_info(unsigned int ifindex,
 			   struct mlx5_switch_info *info);
 void mlx5_translate_port_name(const char *port_name_in,
 			      struct mlx5_switch_info *port_info_out);
-void mlx5_intr_callback_unregister(const struct rte_intr_handle *handle,
-				   rte_intr_callback_fn cb_fn, void *cb_arg);
 int mlx5_sysfs_bond_info(unsigned int pf_ifindex, unsigned int *ifindex,
 			 char *ifname);
 int mlx5_get_module_info(struct rte_eth_dev *dev,
diff --git a/drivers/net/mlx5/mlx5_txpp.c b/drivers/net/mlx5/mlx5_txpp.c
index fe74317..f853a67 100644
--- a/drivers/net/mlx5/mlx5_txpp.c
+++ b/drivers/net/mlx5/mlx5_txpp.c
@@ -741,11 +741,8 @@
 static void
 mlx5_txpp_stop_service(struct mlx5_dev_ctx_shared *sh)
 {
-	if (!rte_intr_fd_get(sh->txpp.intr_handle))
-		return;
-	mlx5_intr_callback_unregister(sh->txpp.intr_handle,
-				      mlx5_txpp_interrupt_handler, sh);
-	rte_intr_instance_free(sh->txpp.intr_handle);
+	mlx5_os_interrupt_handler_destroy(sh->txpp.intr_handle,
+					  mlx5_txpp_interrupt_handler, sh);
 }
 
 /* Attach interrupt handler and fires first request to Rearm Queue. */
@@ -769,23 +766,12 @@
 		rte_errno = errno;
 		return -rte_errno;
 	}
-	sh->txpp.intr_handle =
-			rte_intr_instance_alloc(RTE_INTR_INSTANCE_F_SHARED);
-	if (sh->txpp.intr_handle == NULL) {
-		DRV_LOG(ERR, "Fail to allocate intr_handle");
-		return -ENOMEM;
-	}
 	fd = mlx5_os_get_devx_channel_fd(sh->txpp.echan);
-	if (rte_intr_fd_set(sh->txpp.intr_handle, fd))
-		return -rte_errno;
-
-	if (rte_intr_type_set(sh->txpp.intr_handle, RTE_INTR_HANDLE_EXT))
-		return -rte_errno;
-
-	if (rte_intr_callback_register(sh->txpp.intr_handle,
-				       mlx5_txpp_interrupt_handler, sh)) {
-		rte_intr_fd_set(sh->txpp.intr_handle, 0);
-		DRV_LOG(ERR, "Failed to register CQE interrupt %d.", rte_errno);
+	sh->txpp.intr_handle = mlx5_os_interrupt_handler_create
+		(RTE_INTR_INSTANCE_F_SHARED, false,
+		 fd, mlx5_txpp_interrupt_handler, sh);
+	if (!sh->txpp.intr_handle) {
+		DRV_LOG(ERR, "Fail to allocate intr_handle");
 		return -rte_errno;
 	}
 	/* Subscribe CQ event to the event channel controlled by the driver. */
diff --git a/drivers/net/mlx5/windows/mlx5_ethdev_os.c b/drivers/net/mlx5/windows/mlx5_ethdev_os.c
index f975265..88d8213 100644
--- a/drivers/net/mlx5/windows/mlx5_ethdev_os.c
+++ b/drivers/net/mlx5/windows/mlx5_ethdev_os.c
@@ -140,28 +140,6 @@
 	return 0;
 }
 
-/*
- * Unregister callback handler safely. The handler may be active
- * while we are trying to unregister it, in this case code -EAGAIN
- * is returned by rte_intr_callback_unregister(). This routine checks
- * the return code and tries to unregister handler again.
- *
- * @param handle
- *   interrupt handle
- * @param cb_fn
- *   pointer to callback routine
- * @cb_arg
- *   opaque callback parameter
- */
-void
-mlx5_intr_callback_unregister(const struct rte_intr_handle *handle,
-			      rte_intr_callback_fn cb_fn, void *cb_arg)
-{
-	RTE_SET_USED(handle);
-	RTE_SET_USED(cb_fn);
-	RTE_SET_USED(cb_arg);
-}
-
 /**
  * DPDK callback to get flow control status.
  *
diff --git a/drivers/vdpa/mlx5/mlx5_vdpa_virtq.c b/drivers/vdpa/mlx5/mlx5_vdpa_virtq.c
index 3416797..2ca48f5 100644
--- a/drivers/vdpa/mlx5/mlx5_vdpa_virtq.c
+++ b/drivers/vdpa/mlx5/mlx5_vdpa_virtq.c
@@ -59,26 +59,10 @@
 mlx5_vdpa_virtq_unset(struct mlx5_vdpa_virtq *virtq)
 {
 	unsigned int i;
-	int retries = MLX5_VDPA_INTR_RETRIES;
-	int ret = -EAGAIN;
-
-	if (rte_intr_fd_get(virtq->intr_handle) != -1) {
-		while (retries-- && ret == -EAGAIN) {
-			ret = rte_intr_callback_unregister(virtq->intr_handle,
-							mlx5_vdpa_virtq_handler,
-							virtq);
-			if (ret == -EAGAIN) {
-				DRV_LOG(DEBUG, "Try again to unregister fd %d "
-				"of virtq %d interrupt, retries = %d.",
-				rte_intr_fd_get(virtq->intr_handle),
-				(int)virtq->index, retries);
-
-				usleep(MLX5_VDPA_INTR_RETRIES_USEC);
-			}
-		}
-		rte_intr_fd_set(virtq->intr_handle, -1);
-	}
-	rte_intr_instance_free(virtq->intr_handle);
+	int ret;
+
+	mlx5_os_interrupt_handler_destroy(virtq->intr_handle,
+					  mlx5_vdpa_virtq_handler, virtq);
 	if (virtq->virtq) {
 		ret = mlx5_vdpa_virtq_stop(virtq->priv, virtq->index);
 		if (ret)
@@ -342,35 +326,13 @@
 	virtq->priv = priv;
 	rte_write32(virtq->index, priv->virtq_db_addr);
 	/* Setup doorbell mapping. */
-	virtq->intr_handle =
-		rte_intr_instance_alloc(RTE_INTR_INSTANCE_F_SHARED);
+	virtq->intr_handle = mlx5_os_interrupt_handler_create(
+				  RTE_INTR_INSTANCE_F_SHARED, false,
+				  vq.kickfd, mlx5_vdpa_virtq_handler, virtq);
 	if (virtq->intr_handle == NULL) {
 		DRV_LOG(ERR, "Fail to allocate intr_handle");
 		goto error;
 	}
-
-	if (rte_intr_fd_set(virtq->intr_handle, vq.kickfd))
-		goto error;
-
-	if (rte_intr_fd_get(virtq->intr_handle) == -1) {
-		DRV_LOG(WARNING, "Virtq %d kickfd is invalid.", index);
-	} else {
-		if (rte_intr_type_set(virtq->intr_handle, RTE_INTR_HANDLE_EXT))
-			goto error;
-
-		if (rte_intr_callback_register(virtq->intr_handle,
-					       mlx5_vdpa_virtq_handler,
-					       virtq)) {
-			rte_intr_fd_set(virtq->intr_handle, -1);
-			DRV_LOG(ERR, "Failed to register virtq %d interrupt.",
-				index);
-			goto error;
-		} else {
-			DRV_LOG(DEBUG, "Register fd %d interrupt for virtq %d.",
-				rte_intr_fd_get(virtq->intr_handle),
-				index);
-		}
-	}
 	/* Subscribe virtq error event. */
 	virtq->version++;
 	cookie = ((uint64_t)virtq->version << 32) + index;
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 131+ messages in thread

* [RFC v1 3/7] ethdev: introduce Rx queue based limit watermark
  2022-05-06  3:56   ` [RFC v1 0/7] net/mlx5: introduce limit watermark and host shaper Spike Du
  2022-05-06  3:56     ` [RFC v1 1/7] net/mlx5: add LWM support for Rxq Spike Du
  2022-05-06  3:56     ` [RFC v1 2/7] common/mlx5: share interrupt management Spike Du
@ 2022-05-06  3:56     ` Spike Du
  2022-05-19  9:37       ` Andrew Rybchenko
  2022-05-06  3:56     ` [RFC v1 4/7] net/mlx5: add LWM event handling support Spike Du
                       ` (4 subsequent siblings)
  7 siblings, 1 reply; 131+ messages in thread
From: Spike Du @ 2022-05-06  3:56 UTC (permalink / raw)
  To: matan, viacheslavo, orika, thomas; +Cc: dev, rasland

LWM(limit watermark) is a per Rx queue attribute that notifies dpdk
application event of RTE_ETH_EVENT_RXQ_LIMIT_REACHED when the Rx
queue's usable descriptor is under the watermark.
To simplify its configuration, LWM is a percentage of Rx queue
descriptor size with valid value of [0,99].
Setting LWM to 0 means disable it.
Add LWM's configuration handle in eth_dev_ops.

Signed-off-by: Spike Du <spiked@nvidia.com>
---
 lib/ethdev/ethdev_driver.h |  7 +++++++
 lib/ethdev/rte_ethdev.c    | 28 ++++++++++++++++++++++++++++
 lib/ethdev/rte_ethdev.h    | 30 +++++++++++++++++++++++++++++-
 lib/ethdev/version.map     |  3 +++
 4 files changed, 67 insertions(+), 1 deletion(-)

diff --git a/lib/ethdev/ethdev_driver.h b/lib/ethdev/ethdev_driver.h
index 69d9dc2..1e9cdbf 100644
--- a/lib/ethdev/ethdev_driver.h
+++ b/lib/ethdev/ethdev_driver.h
@@ -470,6 +470,10 @@ typedef int (*eth_rx_queue_setup_t)(struct rte_eth_dev *dev,
 				    const struct rte_eth_rxconf *rx_conf,
 				    struct rte_mempool *mb_pool);
 
+typedef int (*eth_rx_queue_set_lwm_t)(struct rte_eth_dev *dev,
+				      uint16_t rx_queue_id,
+				      uint8_t lwm);
+
 /** @internal Setup a transmit queue of an Ethernet device. */
 typedef int (*eth_tx_queue_setup_t)(struct rte_eth_dev *dev,
 				    uint16_t tx_queue_id,
@@ -1283,6 +1287,9 @@ struct eth_dev_ops {
 
 	/** Dump private info from device */
 	eth_dev_priv_dump_t eth_dev_priv_dump;
+
+	/** Set Rx queue limit watermark */
+	eth_rx_queue_set_lwm_t rx_queue_set_lwm;
 };
 
 /**
diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
index 29a3d80..1e4fc6a 100644
--- a/lib/ethdev/rte_ethdev.c
+++ b/lib/ethdev/rte_ethdev.c
@@ -4414,6 +4414,34 @@ int rte_eth_set_queue_rate_limit(uint16_t port_id, uint16_t queue_idx,
 							queue_idx, tx_rate));
 }
 
+int rte_eth_rx_queue_set_lwm(uint16_t port_id, uint16_t queue_idx,
+			     uint8_t lwm)
+{
+	struct rte_eth_dev *dev;
+	struct rte_eth_dev_info dev_info;
+	int ret;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
+	dev = &rte_eth_devices[port_id];
+
+	ret = rte_eth_dev_info_get(port_id, &dev_info);
+	if (ret != 0)
+		return ret;
+
+	if (queue_idx > dev_info.max_rx_queues) {
+		RTE_ETHDEV_LOG(ERR,
+			"Set queue rate limit:port %u: invalid queue ID=%u\n",
+			port_id, queue_idx);
+		return -EINVAL;
+	}
+
+	if (lwm > 99)
+		return -EINVAL;
+	RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->rx_queue_set_lwm, -ENOTSUP);
+	return eth_err(port_id, (*dev->dev_ops->rx_queue_set_lwm)(dev,
+							     queue_idx, lwm));
+}
+
 RTE_INIT(eth_dev_init_fp_ops)
 {
 	uint32_t i;
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index 04cff8e..f29e53b 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -1249,8 +1249,12 @@ struct rte_eth_rxconf {
 	 */
 	union rte_eth_rxseg *rx_seg;
 
-	uint64_t reserved_64s[2]; /**< Reserved for future fields */
+	uint64_t reserved_64s;
+	uint32_t reserved_32s;
+	uint32_t lwm:8;
+	uint32_t reserved_bits:24;
 	void *reserved_ptrs[2];   /**< Reserved for future fields */
+
 };
 
 /**
@@ -3668,6 +3672,29 @@ int rte_eth_dev_set_vlan_ether_type(uint16_t port_id,
  */
 int rte_eth_dev_set_vlan_pvid(uint16_t port_id, uint16_t pvid, int on);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Set Rx queue based limit watermark.
+ *
+ * @param port_id
+ *  The port identifier of the Ethernet device.
+ * @param queue_idx
+ *  The index of the receive queue
+ * @param lwm
+ *  The limit watermark percentage of Rx queue descriptor size.
+ *  The valid range is [0,99].
+ *  Setting 0 means disable limit watermark.
+ *
+ * @return
+ *   - (0) if successful.
+ *   - negative if failed.
+ */
+__rte_experimental
+int rte_eth_rx_queue_set_lwm(uint16_t port_id, uint16_t queue_idx,
+				uint8_t lwm);
+
 typedef void (*buffer_tx_error_fn)(struct rte_mbuf **unsent, uint16_t count,
 		void *userdata);
 
@@ -3873,6 +3900,7 @@ enum rte_eth_event_type {
 	RTE_ETH_EVENT_DESTROY,  /**< port is released */
 	RTE_ETH_EVENT_IPSEC,    /**< IPsec offload related event */
 	RTE_ETH_EVENT_FLOW_AGED,/**< New aged-out flows is detected */
+	RTE_ETH_EVENT_RXQ_LIMIT_REACHED,/**< RX queue limit reached */
 	RTE_ETH_EVENT_MAX       /**< max value of this enum */
 };
 
diff --git a/lib/ethdev/version.map b/lib/ethdev/version.map
index 20391ab..8b85ad8 100644
--- a/lib/ethdev/version.map
+++ b/lib/ethdev/version.map
@@ -279,6 +279,9 @@ EXPERIMENTAL {
 	rte_flow_async_action_handle_create;
 	rte_flow_async_action_handle_destroy;
 	rte_flow_async_action_handle_update;
+
+	# added in 22.07
+	rte_eth_rx_queue_set_lwm;
 };
 
 INTERNAL {
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 131+ messages in thread

* [RFC v1 4/7] net/mlx5: add LWM event handling support
  2022-05-06  3:56   ` [RFC v1 0/7] net/mlx5: introduce limit watermark and host shaper Spike Du
                       ` (2 preceding siblings ...)
  2022-05-06  3:56     ` [RFC v1 3/7] ethdev: introduce Rx queue based limit watermark Spike Du
@ 2022-05-06  3:56     ` Spike Du
  2022-05-06  3:56     ` [RFC v1 5/7] net/mlx5: support Rx queue based limit watermark Spike Du
                       ` (3 subsequent siblings)
  7 siblings, 0 replies; 131+ messages in thread
From: Spike Du @ 2022-05-06  3:56 UTC (permalink / raw)
  To: matan, viacheslavo, orika, thomas; +Cc: dev, rasland

When LWM meets RQ WQE, the kernel driver raises an event to SW.
Use devx event_channel to catch this and to notify the user.
Allocate this channel per shared device.
The channel has a cookie that informs the specific event port and queue.

Signed-off-by: Spike Du <spiked@nvidia.com>
---
 drivers/net/mlx5/mlx5.c      | 61 ++++++++++++++++++++++++++++++++++++++++++++
 drivers/net/mlx5/mlx5.h      |  7 +++++
 drivers/net/mlx5/mlx5_devx.c | 47 ++++++++++++++++++++++++++++++++++
 drivers/net/mlx5/mlx5_rx.c   | 27 ++++++++++++++++++++
 drivers/net/mlx5/mlx5_rx.h   |  7 +++++
 5 files changed, 149 insertions(+)

diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
index 72b1e35..334223e 100644
--- a/drivers/net/mlx5/mlx5.c
+++ b/drivers/net/mlx5/mlx5.c
@@ -9,6 +9,7 @@
 #include <stdint.h>
 #include <stdlib.h>
 #include <errno.h>
+#include <fcntl.h>
 
 #include <rte_malloc.h>
 #include <ethdev_driver.h>
@@ -22,6 +23,7 @@
 #include <rte_eal_paging.h>
 #include <rte_alarm.h>
 #include <rte_cycles.h>
+#include <rte_interrupts.h>
 
 #include <mlx5_glue.h>
 #include <mlx5_devx_cmds.h>
@@ -1521,6 +1523,64 @@ struct mlx5_dev_ctx_shared *
 }
 
 /**
+ * Create LWM event_channel and interrupt handle for shared device
+ * context. All rxqs sharing the device context share the event_channel.
+ * A callback is registered in interrupt thread to receive the LWM event.
+ *
+ * @param[in] priv
+ *   Pointer to mlx5_priv instance.
+ *
+ * @return
+ *   0 on success, negative with rte_errno set.
+ */
+int
+mlx5_lwm_setup(struct mlx5_priv *priv)
+{
+	int fd_lwm;
+
+	pthread_mutex_init(&priv->sh->lwm_config_lock, NULL);
+	priv->sh->devx_channel_lwm = mlx5_os_devx_create_event_channel
+			(priv->sh->cdev->ctx,
+			 MLX5DV_DEVX_CREATE_EVENT_CHANNEL_FLAGS_OMIT_EV_DATA);
+	if (!priv->sh->devx_channel_lwm)
+		goto err;
+	fd_lwm = mlx5_os_get_devx_channel_fd(priv->sh->devx_channel_lwm);
+	priv->sh->intr_handle_lwm = mlx5_os_interrupt_handler_create
+		(RTE_INTR_INSTANCE_F_SHARED, true,
+		 fd_lwm, mlx5_dev_interrupt_handler_lwm, priv);
+	if (!priv->sh->intr_handle_lwm)
+		goto err;
+	return 0;
+err:
+	mlx5_lwm_unset(priv->sh);
+	return -rte_errno;
+}
+
+/**
+ * Destroy LWM event_channel and interrupt handle for shared device
+ * context before free this context. The interrupt handler is also
+ * unregistered.
+ *
+ * @param[in] sh
+ *   Pointer to shared device context.
+ */
+void
+mlx5_lwm_unset(struct mlx5_dev_ctx_shared *sh)
+{
+	if (sh->intr_handle_lwm) {
+		mlx5_os_interrupt_handler_destroy(sh->intr_handle_lwm,
+			mlx5_dev_interrupt_handler_lwm, (void *)-1);
+		sh->intr_handle_lwm = NULL;
+	}
+	if (sh->devx_channel_lwm) {
+		mlx5_os_devx_destroy_event_channel
+			(sh->devx_channel_lwm);
+		sh->devx_channel_lwm = NULL;
+	}
+	pthread_mutex_destroy(&sh->lwm_config_lock);
+}
+
+/**
  * Free shared IB device context. Decrement counter and if zero free
  * all allocated resources and close handles.
  *
@@ -1597,6 +1657,7 @@ struct mlx5_dev_ctx_shared *
 		claim_zero(mlx5_devx_cmd_destroy(sh->td));
 	MLX5_ASSERT(sh->geneve_tlv_option_resource == NULL);
 	pthread_mutex_destroy(&sh->txpp.mutex);
+	mlx5_lwm_unset(sh);
 	mlx5_free(sh);
 	return;
 exit:
diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index 4821ff0..515ff33 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -1264,6 +1264,9 @@ struct mlx5_dev_ctx_shared {
 	struct mlx5_lb_ctx self_lb; /* QP to enable self loopback for Devx. */
 	unsigned int flow_max_priority;
 	enum modify_reg flow_mreg_c[MLX5_MREG_C_NUM];
+	void *devx_channel_lwm;
+	struct rte_intr_handle *intr_handle_lwm;
+	pthread_mutex_t lwm_config_lock;
 	/* Availability of mreg_c's. */
 	struct mlx5_dev_shared_port port[]; /* per device port data array. */
 };
@@ -1401,6 +1404,7 @@ enum mlx5_txq_modify_type {
 };
 
 struct mlx5_rxq_priv;
+struct mlx5_priv;
 
 /* HW objects operations structure. */
 struct mlx5_obj_ops {
@@ -1409,6 +1413,7 @@ struct mlx5_obj_ops {
 	int (*rxq_event_get)(struct mlx5_rxq_obj *rxq_obj);
 	int (*rxq_obj_modify)(struct mlx5_rxq_priv *rxq, uint8_t type);
 	void (*rxq_obj_release)(struct mlx5_rxq_priv *rxq);
+	int (*rxq_event_get_lwm)(struct mlx5_priv *priv, int *rxq_idx, int *port_id);
 	int (*ind_table_new)(struct rte_eth_dev *dev, const unsigned int log_n,
 			     struct mlx5_ind_table_obj *ind_tbl);
 	int (*ind_table_modify)(struct rte_eth_dev *dev,
@@ -1599,6 +1604,8 @@ int mlx5_udp_tunnel_port_add(struct rte_eth_dev *dev,
 bool mlx5_is_hpf(struct rte_eth_dev *dev);
 bool mlx5_is_sf_repr(struct rte_eth_dev *dev);
 void mlx5_age_event_prepare(struct mlx5_dev_ctx_shared *sh);
+int mlx5_lwm_setup(struct mlx5_priv *priv);
+void mlx5_lwm_unset(struct mlx5_dev_ctx_shared *sh);
 
 /* Macro to iterate over all valid ports for mlx5 driver. */
 #define MLX5_ETH_FOREACH_DEV(port_id, dev) \
diff --git a/drivers/net/mlx5/mlx5_devx.c b/drivers/net/mlx5/mlx5_devx.c
index 4fbfcaa..25b998d 100644
--- a/drivers/net/mlx5/mlx5_devx.c
+++ b/drivers/net/mlx5/mlx5_devx.c
@@ -233,6 +233,52 @@
 }
 
 /**
+ * Get LWM event for shared context, return the correct port/rxq for this event.
+ *
+ * @param priv
+ *   Mlx5_priv object.
+ * @param rxq_idx [out]
+ *   Which rxq gets this event.
+ * @param port_id [out]
+ *   Which port gets this event.
+ *
+ * @return
+ *   0 on success, a negative errno value otherwise and rte_errno is set.
+ */
+static int
+mlx5_rx_devx_get_event_lwm(struct mlx5_priv *priv, int *rxq_idx, int *port_id)
+{
+#ifdef HAVE_IBV_DEVX_EVENT
+	union {
+		struct mlx5dv_devx_async_event_hdr event_resp;
+		uint8_t buf[sizeof(struct mlx5dv_devx_async_event_hdr) + 128];
+	} out;
+	int ret;
+
+	memset(&out, 0, sizeof(out));
+	ret = mlx5_glue->devx_get_event(priv->sh->devx_channel_lwm,
+					&out.event_resp,
+					sizeof(out.buf));
+	if (ret < 0) {
+		rte_errno = errno;
+		DRV_LOG(WARNING, "%s err\n", __func__);
+		return -rte_errno;
+	}
+	*port_id = (((uint32_t)out.event_resp.cookie) >>
+		    LWM_COOKIE_PORTID_OFFSET) & LWM_COOKIE_PORTID_MASK;
+	*rxq_idx = (((uint32_t)out.event_resp.cookie) >>
+		    LWM_COOKIE_RXQID_OFFSET) & LWM_COOKIE_RXQID_MASK;
+	return 0;
+#else
+	(void)priv;
+	(void)rxq_idx;
+	(void)port_id;
+	rte_errno = ENOTSUP;
+	return -rte_errno;
+#endif /* HAVE_IBV_DEVX_EVENT */
+}
+
+/**
  * Create a RQ object using DevX.
  *
  * @param rxq
@@ -1419,6 +1465,7 @@ struct mlx5_obj_ops devx_obj_ops = {
 	.rxq_event_get = mlx5_rx_devx_get_event,
 	.rxq_obj_modify = mlx5_devx_modify_rq,
 	.rxq_obj_release = mlx5_rxq_devx_obj_release,
+	.rxq_event_get_lwm = mlx5_rx_devx_get_event_lwm,
 	.ind_table_new = mlx5_devx_ind_table_new,
 	.ind_table_modify = mlx5_devx_ind_table_modify,
 	.ind_table_destroy = mlx5_devx_ind_table_destroy,
diff --git a/drivers/net/mlx5/mlx5_rx.c b/drivers/net/mlx5/mlx5_rx.c
index e5eea0a..6b2ef45 100644
--- a/drivers/net/mlx5/mlx5_rx.c
+++ b/drivers/net/mlx5/mlx5_rx.c
@@ -1187,3 +1187,30 @@ int mlx5_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
 {
 	return -ENOTSUP;
 }
+
+/**
+ * Rte interrupt handler for LWM event.
+ * It first checks if the event arrives, if so process the callback for
+ * RTE_ETH_EVENT_RXQ_LIMIT_REACHED.
+ *
+ * @param args
+ *   Generic pointer to mlx5_priv.
+ */
+void
+mlx5_dev_interrupt_handler_lwm(void *args)
+{
+	struct mlx5_priv *priv = args;
+	struct rte_eth_dev *dev;
+	int ret, rxq_idx = 0, port_id = 0;
+
+	ret = priv->obj_ops.rxq_event_get_lwm(priv, &rxq_idx, &port_id);
+	if (unlikely(ret < 0)) {
+		DRV_LOG(WARNING, "Cannot get LWM event context.");
+		return;
+	}
+	DRV_LOG(INFO, "%s get LWM event, port_id:%d rxq_id:%d.", __func__,
+		port_id, rxq_idx);
+	dev = &rte_eth_devices[port_id];
+	rte_eth_dev_callback_process(dev, RTE_ETH_EVENT_RXQ_LIMIT_REACHED,
+				     (void *)(uintptr_t)rxq_idx);
+}
diff --git a/drivers/net/mlx5/mlx5_rx.h b/drivers/net/mlx5/mlx5_rx.h
index 25a5f2c..103509f 100644
--- a/drivers/net/mlx5/mlx5_rx.h
+++ b/drivers/net/mlx5/mlx5_rx.h
@@ -176,6 +176,7 @@ struct mlx5_rxq_priv {
 	struct rte_eth_hairpin_conf hairpin_conf; /* Hairpin configuration. */
 	uint32_t hairpin_status; /* Hairpin binding status. */
 	uint32_t lwm:16;
+	void (*lwm_event_rxq_limit_reached)(uint16_t port_id, uint16_t rxq_id);
 };
 
 /* External RX queue descriptor. */
@@ -295,6 +296,7 @@ void mlx5_rxq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
 int mlx5_rx_burst_mode_get(struct rte_eth_dev *dev, uint16_t rx_queue_id,
 			   struct rte_eth_burst_mode *mode);
 int mlx5_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc);
+void mlx5_dev_interrupt_handler_lwm(void *args);
 
 /* Vectorized version of mlx5_rx.c */
 int mlx5_rxq_check_vec_support(struct mlx5_rxq_data *rxq_data);
@@ -675,4 +677,9 @@ uint16_t mlx5_rx_burst_mprq_vec(void *dpdk_rxq, struct rte_mbuf **pkts,
 	return !!__atomic_load_n(&rxq->refcnt, __ATOMIC_RELAXED);
 }
 
+#define LWM_COOKIE_RXQID_OFFSET 0
+#define LWM_COOKIE_RXQID_MASK 0xffff
+#define LWM_COOKIE_PORTID_OFFSET 16
+#define LWM_COOKIE_PORTID_MASK 0xffff
+
 #endif /* RTE_PMD_MLX5_RX_H_ */
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 131+ messages in thread

* [RFC v1 5/7] net/mlx5: support Rx queue based limit watermark
  2022-05-06  3:56   ` [RFC v1 0/7] net/mlx5: introduce limit watermark and host shaper Spike Du
                       ` (3 preceding siblings ...)
  2022-05-06  3:56     ` [RFC v1 4/7] net/mlx5: add LWM event handling support Spike Du
@ 2022-05-06  3:56     ` Spike Du
  2022-05-06  3:56     ` [RFC v1 6/7] net/mlx5: add private API to config host port shaper Spike Du
                       ` (2 subsequent siblings)
  7 siblings, 0 replies; 131+ messages in thread
From: Spike Du @ 2022-05-06  3:56 UTC (permalink / raw)
  To: matan, viacheslavo, orika, thomas; +Cc: dev, rasland

Add mlx5 specific LWM(limit watermark) configuration handler.
While the Rx queue fullness reaches the LWM limit, the driver catches
an HW event and invokes the user callback.

Signed-off-by: Spike Du <spiked@nvidia.com>
---
 doc/guides/nics/mlx5.rst               |   4 ++
 doc/guides/rel_notes/release_22_07.rst |   1 +
 drivers/common/mlx5/mlx5_prm.h         |   1 +
 drivers/net/mlx5/mlx5.c                |   1 +
 drivers/net/mlx5/mlx5_rx.c             | 123 +++++++++++++++++++++++++++++++++
 drivers/net/mlx5/mlx5_rx.h             |   3 +
 6 files changed, 133 insertions(+)

diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index 4805d08..a7698c9 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -92,6 +92,7 @@ Features
 - Connection tracking.
 - Sub-Function representors.
 - Sub-Function.
+- Rx queue LWM (Limit WaterMark) configuration.
 
 
 Limitations
@@ -518,6 +519,9 @@ Limitations
 - The NIC egress flow rules on representor port are not supported.
 
 
+- LWM:
+  - Doesn't support shared Rx queue and Hairpin Rx queue.
+
 Statistics
 ----------
 
diff --git a/doc/guides/rel_notes/release_22_07.rst b/doc/guides/rel_notes/release_22_07.rst
index 88d6e96..f3cf2f1 100644
--- a/doc/guides/rel_notes/release_22_07.rst
+++ b/doc/guides/rel_notes/release_22_07.rst
@@ -64,6 +64,7 @@ New Features
 
   * Added support for promiscuous mode on Windows.
   * Added support for MTU on Windows.
+  * Added Rx queue LWM(Limit WaterMark) support.
 
 
 Removed Items
diff --git a/drivers/common/mlx5/mlx5_prm.h b/drivers/common/mlx5/mlx5_prm.h
index 44b1822..23b13e3 100644
--- a/drivers/common/mlx5/mlx5_prm.h
+++ b/drivers/common/mlx5/mlx5_prm.h
@@ -3290,6 +3290,7 @@ struct mlx5_aso_wqe {
 
 enum {
 	MLX5_EVENT_TYPE_OBJECT_CHANGE = 0x27,
+	MLX5_EVENT_TYPE_SRQ_LIMIT_REACHED = 0x14,
 };
 
 enum {
diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
index 334223e..628003d 100644
--- a/drivers/net/mlx5/mlx5.c
+++ b/drivers/net/mlx5/mlx5.c
@@ -2062,6 +2062,7 @@ struct mlx5_dev_ctx_shared *
 	.dev_supported_ptypes_get = mlx5_dev_supported_ptypes_get,
 	.vlan_filter_set = mlx5_vlan_filter_set,
 	.rx_queue_setup = mlx5_rx_queue_setup,
+	.rx_queue_set_lwm = mlx5_rx_queue_set_lwm,
 	.rx_hairpin_queue_setup = mlx5_rx_hairpin_queue_setup,
 	.tx_queue_setup = mlx5_tx_queue_setup,
 	.tx_hairpin_queue_setup = mlx5_tx_hairpin_queue_setup,
diff --git a/drivers/net/mlx5/mlx5_rx.c b/drivers/net/mlx5/mlx5_rx.c
index 6b2ef45..68564ea 100644
--- a/drivers/net/mlx5/mlx5_rx.c
+++ b/drivers/net/mlx5/mlx5_rx.c
@@ -19,12 +19,14 @@
 #include <mlx5_prm.h>
 #include <mlx5_common.h>
 #include <mlx5_common_mr.h>
+#include <rte_pmd_mlx5.h>
 
 #include "mlx5_autoconf.h"
 #include "mlx5_defs.h"
 #include "mlx5.h"
 #include "mlx5_utils.h"
 #include "mlx5_rxtx.h"
+#include "mlx5_devx.h"
 #include "mlx5_rx.h"
 
 
@@ -128,6 +130,16 @@
 	return RTE_ETH_RX_DESC_AVAIL;
 }
 
+/* Get rxq lwm percentage according to lwm number. */
+static uint8_t
+mlx5_rxq_lwm_to_percentage(struct mlx5_rxq_priv *rxq)
+{
+	struct mlx5_rxq_data *rxq_data = &rxq->ctrl->rxq;
+	uint32_t wqe_cnt = 1 << rxq_data->elts_n;
+
+	return (rxq->lwm * 100 / wqe_cnt);
+}
+
 /**
  * DPDK callback to get the RX queue information.
  *
@@ -150,6 +162,7 @@
 {
 	struct mlx5_rxq_ctrl *rxq_ctrl = mlx5_rxq_ctrl_get(dev, rx_queue_id);
 	struct mlx5_rxq_data *rxq = mlx5_rxq_data_get(dev, rx_queue_id);
+	struct mlx5_rxq_priv *rxq_priv = mlx5_rxq_get(dev, rx_queue_id);
 
 	if (!rxq)
 		return;
@@ -169,6 +182,7 @@
 	qinfo->nb_desc = mlx5_rxq_mprq_enabled(rxq) ?
 		RTE_BIT32(rxq->elts_n) * RTE_BIT32(rxq->log_strd_num) :
 		RTE_BIT32(rxq->elts_n);
+	qinfo->conf.lwm = mlx5_rxq_lwm_to_percentage(rxq_priv);
 }
 
 /**
@@ -1214,3 +1228,112 @@ int mlx5_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
 	rte_eth_dev_callback_process(dev, RTE_ETH_EVENT_RXQ_LIMIT_REACHED,
 				     (void *)(uintptr_t)rxq_idx);
 }
+
+/**
+ * DPDK callback to arm an Rx queue LWM(limit watermark) event.
+ * While the Rx queue fullness reaches the LWM limit, the driver catches
+ * an HW event and invokes the user event callback.
+ * After the last event handling, the user needs to call this API again
+ * to arm an additional event.
+ *
+ * @param dev
+ *   Pointer to the device structure.
+ * @param[in] rx_queue_id
+ *   Rx queue identificator.
+ * @param[in] lwm
+ *   The LWM value, is defined by a percentage of the Rx queue size.
+ *   [1-99] to set a new LWM (update the old value).
+ *   0 to unarm the event.
+ *
+ * @return
+ *   0 : operation success.
+ *   Otherwise:
+ *   - ENOMEM - not enough memory to create LWM event channel.
+ *   - EINVAL - the input Rxq is not created by devx.
+ *   - E2BIG  - lwm is bigger than 99.
+ */
+int
+mlx5_rx_queue_set_lwm(struct rte_eth_dev *dev, uint16_t rx_queue_id,
+		      uint8_t lwm)
+{
+	struct mlx5_priv *priv = dev->data->dev_private;
+	uint16_t port_id = PORT_ID(priv);
+	struct mlx5_rxq_priv *rxq = mlx5_rxq_get(dev, rx_queue_id);
+	uint16_t event_nums[1] = {MLX5_EVENT_TYPE_SRQ_LIMIT_REACHED};
+	struct mlx5_rxq_data *rxq_data;
+	uint32_t wqe_cnt;
+	uint64_t cookie;
+	int ret = 0;
+
+	if (!rxq) {
+		rte_errno = EINVAL;
+		return -rte_errno;
+	}
+	rxq_data = &rxq->ctrl->rxq;
+	/* Ensure the Rq is created by devx. */
+	if (priv->obj_ops.rxq_obj_new != devx_obj_ops.rxq_obj_new) {
+		rte_errno = EINVAL;
+		return -rte_errno;
+	}
+	if (lwm > 99) {
+		DRV_LOG(WARNING, "Too big LWM configuration.");
+		rte_errno = E2BIG;
+		return -rte_errno;
+	}
+	/* Start config LWM. */
+	pthread_mutex_lock(&priv->sh->lwm_config_lock);
+	if (rxq->lwm == 0 && lwm == 0) {
+		/* Both old/new values are 0, do nothing. */
+		ret = 0;
+		goto end;
+	}
+	wqe_cnt = 1 << rxq_data->elts_n;
+	if (lwm) {
+		if (!priv->sh->devx_channel_lwm) {
+			ret = mlx5_lwm_setup(priv);
+			if (ret) {
+				DRV_LOG(WARNING,
+					"Failed to create shared_lwm.");
+				rte_errno = ENOMEM;
+				ret = -rte_errno;
+				goto end;
+			}
+		}
+		if (!rxq->lwm_devx_subscribed) {
+			cookie = ((uint32_t)
+				  (port_id << LWM_COOKIE_PORTID_OFFSET)) |
+				(rx_queue_id << LWM_COOKIE_RXQID_OFFSET);
+			ret = mlx5_os_devx_subscribe_devx_event
+				(priv->sh->devx_channel_lwm,
+				 rxq->devx_rq.rq->obj,
+				 sizeof(event_nums),
+				 event_nums,
+				 cookie);
+			if (ret) {
+				rte_errno = rte_errno ? rte_errno : EINVAL;
+				ret = -rte_errno;
+				goto end;
+			}
+			rxq->lwm_devx_subscribed = 1;
+		}
+	}
+	/* Save LWM to rxq and send modfiy_rq devx command. */
+	rxq->lwm = lwm * wqe_cnt / 100;
+	/* Prevent integer division loss when switch lwm number to percentage. */
+	if (lwm && (lwm * wqe_cnt % 100)) {
+		rxq->lwm = ((uint32_t)(rxq->lwm + 1) >= wqe_cnt) ?
+			rxq->lwm : (rxq->lwm + 1);
+	}
+	if (lwm && !rxq->lwm) {
+		/* With mprq, wqe_cnt may be < 100. */
+		DRV_LOG(WARNING, "Too small LWM configuration.");
+		rte_errno = EINVAL;
+		ret = -rte_errno;
+		goto end;
+	}
+	ret = mlx5_devx_modify_rq(rxq, MLX5_RXQ_MOD_RDY2RDY);
+end:
+	pthread_mutex_unlock(&priv->sh->lwm_config_lock);
+	return ret;
+}
+
diff --git a/drivers/net/mlx5/mlx5_rx.h b/drivers/net/mlx5/mlx5_rx.h
index 103509f..483ca12 100644
--- a/drivers/net/mlx5/mlx5_rx.h
+++ b/drivers/net/mlx5/mlx5_rx.h
@@ -176,6 +176,7 @@ struct mlx5_rxq_priv {
 	struct rte_eth_hairpin_conf hairpin_conf; /* Hairpin configuration. */
 	uint32_t hairpin_status; /* Hairpin binding status. */
 	uint32_t lwm:16;
+	uint32_t lwm_devx_subscribed:1;
 	void (*lwm_event_rxq_limit_reached)(uint16_t port_id, uint16_t rxq_id);
 };
 
@@ -297,6 +298,8 @@ int mlx5_rx_burst_mode_get(struct rte_eth_dev *dev, uint16_t rx_queue_id,
 			   struct rte_eth_burst_mode *mode);
 int mlx5_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc);
 void mlx5_dev_interrupt_handler_lwm(void *args);
+int mlx5_rx_queue_set_lwm(struct rte_eth_dev *dev, uint16_t rx_queue_id,
+			  uint8_t lwm);
 
 /* Vectorized version of mlx5_rx.c */
 int mlx5_rxq_check_vec_support(struct mlx5_rxq_data *rxq_data);
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 131+ messages in thread

* [RFC v1 6/7] net/mlx5: add private API to config host port shaper
  2022-05-06  3:56   ` [RFC v1 0/7] net/mlx5: introduce limit watermark and host shaper Spike Du
                       ` (4 preceding siblings ...)
  2022-05-06  3:56     ` [RFC v1 5/7] net/mlx5: support Rx queue based limit watermark Spike Du
@ 2022-05-06  3:56     ` Spike Du
  2022-05-06  3:56     ` [RFC v1 7/7] app/testpmd: add LWM and Host Shaper command Spike Du
  2022-05-22  5:58     ` [RFC v2 0/7] introduce per-queue limit watermark and host shaper Spike Du
  7 siblings, 0 replies; 131+ messages in thread
From: Spike Du @ 2022-05-06  3:56 UTC (permalink / raw)
  To: matan, viacheslavo, orika, thomas; +Cc: dev, rasland

Host port shaper can be configured with QSHR(QoS Shaper Host Register).
Add check in build files to enable this function or not.

The host shaper configuration affects all the ethdev ports belonging to the
same host port.

Host shaper can configure shaper rate and lwm-triggered for a host port.
The shaper limits the rate of traffic from host port to wire port.
If lwm-triggered is enabled, a 100Mbps shaper is enabled automatically
when one of the host port's Rx queues receives LWM(Limit Watermark) event.

Signed-off-by: Spike Du <spiked@nvidia.com>
---
 doc/guides/nics/mlx5.rst               |   7 +++
 doc/guides/rel_notes/release_22_07.rst |   1 +
 drivers/common/mlx5/linux/meson.build  |  44 +++++++++-----
 drivers/common/mlx5/mlx5_prm.h         |  25 ++++++++
 drivers/net/mlx5/mlx5.h                |   2 +
 drivers/net/mlx5/mlx5_rx.c             | 103 +++++++++++++++++++++++++++++++++
 drivers/net/mlx5/rte_pmd_mlx5.h        |  30 ++++++++++
 drivers/net/mlx5/version.map           |   2 +
 8 files changed, 199 insertions(+), 15 deletions(-)

diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index a7698c9..4e2ebff 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -93,6 +93,7 @@ Features
 - Sub-Function representors.
 - Sub-Function.
 - Rx queue LWM (Limit WaterMark) configuration.
+- Host shaper support.
 
 
 Limitations
@@ -522,6 +523,12 @@ Limitations
 - LWM:
   - Doesn't support shared Rx queue and Hairpin Rx queue.
 
+- Host shaper:
+
+  - Support BlueField series NIC from BlueField 2.
+  - When configure host shaper with MLX5_HOST_SHAPER_FLAG_LWM_TRIGGERED flag set,
+    only rate 0 and 100Mbps are supported.
+
 Statistics
 ----------
 
diff --git a/doc/guides/rel_notes/release_22_07.rst b/doc/guides/rel_notes/release_22_07.rst
index f3cf2f1..96083eb 100644
--- a/doc/guides/rel_notes/release_22_07.rst
+++ b/doc/guides/rel_notes/release_22_07.rst
@@ -65,6 +65,7 @@ New Features
   * Added support for promiscuous mode on Windows.
   * Added support for MTU on Windows.
   * Added Rx queue LWM(Limit WaterMark) support.
+  * Added host shaper support.
 
 
 Removed Items
diff --git a/drivers/common/mlx5/linux/meson.build b/drivers/common/mlx5/linux/meson.build
index ed48245..e332261 100644
--- a/drivers/common/mlx5/linux/meson.build
+++ b/drivers/common/mlx5/linux/meson.build
@@ -16,23 +16,24 @@ if dlopen_ibverbs
     ]
 endif
 
-libnames = [ 'mlx5', 'ibverbs' ]
+libnames = [ 'mlx5', 'ibverbs']
 libs = []
 foreach libname:libnames
-    lib = dependency('lib' + libname, static:static_ibverbs, required:false, method: 'pkg-config')
-    if not lib.found() and not static_ibverbs
-        lib = cc.find_library(libname, required:false)
-    endif
-    if lib.found()
-        libs += lib
-        if not static_ibverbs and not dlopen_ibverbs
-            ext_deps += lib
-        endif
-    else
-        build = false
-        reason = 'missing dependency, "' + libname + '"'
-        subdir_done()
-    endif
+	lib = dependency('lib' + libname, static:static_ibverbs,
+			required:false, method: 'pkg-config')
+	if not lib.found() and not static_ibverbs
+		lib = cc.find_library(libname, required:false)
+	endif
+	if lib.found()
+		libs += lib
+		if not static_ibverbs and not dlopen_ibverbs
+			ext_deps += lib
+		endif
+	else
+		build = false
+		reason = 'missing dependency, "' + libname + '"'
+		subdir_done()
+	endif
 endforeach
 if static_ibverbs or dlopen_ibverbs
     # Build without adding shared libs to Requires.private
@@ -45,6 +46,13 @@ if static_ibverbs
     ext_deps += declare_dependency(link_args:ibv_ldflags.split())
 endif
 
+libmtcr_ul_found = false
+lib = cc.find_library('mtcr_ul', required:false)
+if lib.found() and run_command('meson', '--version').stdout().version_compare('>= 0.49.2')
+	libmtcr_ul_found = true
+	ext_deps += lib
+endif
+
 sources += files('mlx5_nl.c')
 sources += files('mlx5_common_auxiliary.c')
 sources += files('mlx5_common_os.c')
@@ -205,6 +213,12 @@ has_sym_args = [
         [ 'HAVE_MLX5_IBV_IMPORT_CTX_PD_AND_MR', 'infiniband/verbs.h',
             'ibv_import_device' ],
 ]
+if  libmtcr_ul_found
+    has_sym_args += [
+        [  'HAVE_MLX5_MSTFLINT', 'mstflint/mtcr.h',
+            'mopen'],
+    ]
+endif
 config = configuration_data()
 foreach arg:has_sym_args
     config.set(arg[0], cc.has_header_symbol(arg[1], arg[2], dependencies: libs))
diff --git a/drivers/common/mlx5/mlx5_prm.h b/drivers/common/mlx5/mlx5_prm.h
index 23b13e3..3559927 100644
--- a/drivers/common/mlx5/mlx5_prm.h
+++ b/drivers/common/mlx5/mlx5_prm.h
@@ -3768,6 +3768,7 @@ enum {
 	MLX5_CRYPTO_COMMISSIONING_REGISTER_ID = 0xC003,
 	MLX5_IMPORT_KEK_HANDLE_REGISTER_ID = 0xC004,
 	MLX5_CREDENTIAL_HANDLE_REGISTER_ID = 0xC005,
+	MLX5_QSHR_REGISTER_ID = 0x4030,
 };
 
 struct mlx5_ifc_register_mtutc_bits {
@@ -3782,6 +3783,30 @@ struct mlx5_ifc_register_mtutc_bits {
 	u8 time_adjustment[0x20];
 };
 
+struct mlx5_ifc_ets_global_config_register_bits {
+	u8 reserved_at_0[0x2];
+	u8 rate_limit_update[0x1];
+	u8 reserved_at_3[0x29];
+	u8 max_bw_units[0x4];
+	u8 reserved_at_48[0x8];
+	u8 max_bw_value[0x8];
+};
+
+#define ETS_GLOBAL_CONFIG_BW_UNIT_DISABLED      0x0
+#define ETS_GLOBAL_CONFIG_BW_UNIT_HUNDREDS_MBPS 0x3
+#define ETS_GLOBAL_CONFIG_BW_UNIT_GBPS          0x4
+
+struct mlx5_ifc_register_qshr_bits {
+	u8 reserved_at_0[0x4];
+	u8 connected_host[0x1];
+	u8 vqos[0x1];
+	u8 fast_response[0x1];
+	u8 reserved_at_7[0x1];
+	u8 local_port[0x8];
+	u8 reserved_at_16[0x230];
+	struct mlx5_ifc_ets_global_config_register_bits global_config;
+};
+
 #define MLX5_MTUTC_TIMESTAMP_MODE_INTERNAL_TIMER 0
 #define MLX5_MTUTC_TIMESTAMP_MODE_REAL_TIME 1
 
diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index 515ff33..5dfd375 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -1267,6 +1267,8 @@ struct mlx5_dev_ctx_shared {
 	void *devx_channel_lwm;
 	struct rte_intr_handle *intr_handle_lwm;
 	pthread_mutex_t lwm_config_lock;
+	uint32_t host_shaper_rate:8;
+	uint32_t lwm_triggered:1;
 	/* Availability of mreg_c's. */
 	struct mlx5_dev_shared_port port[]; /* per device port data array. */
 };
diff --git a/drivers/net/mlx5/mlx5_rx.c b/drivers/net/mlx5/mlx5_rx.c
index 68564ea..cc68a91 100644
--- a/drivers/net/mlx5/mlx5_rx.c
+++ b/drivers/net/mlx5/mlx5_rx.c
@@ -28,6 +28,9 @@
 #include "mlx5_rxtx.h"
 #include "mlx5_devx.h"
 #include "mlx5_rx.h"
+#ifdef HAVE_MLX5_MSTFLINT
+#include <mstflint/mtcr.h>
+#endif
 
 
 static __rte_always_inline uint32_t
@@ -1337,3 +1340,103 @@ int mlx5_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
 	return ret;
 }
 
+/**
+ * Mlx5 access register function to configure host shaper.
+ * It calls API in libmtcr_ul to access QSHR(Qos Shaper Host Register)
+ * in firmware.
+ *
+ * @param dev
+ *   Pointer to rte_eth_dev.
+ * @param lwm_triggered
+ *   Flag to enable/disable lwm_triggered bit in QSHR.
+ * @param rate
+ *   Host shaper rate, unit is 100Mbps, set to 0 means disable the shaper.
+ * @return
+ *   0 : operation success.
+ *   Otherwise:
+ *   - ENOENT - no ibdev interface.
+ *   - EBUSY  - the register access unit is busy.
+ *   - EIO    - the register access command meets IO error.
+ */
+static int
+mlxreg_config_host_shaper(struct rte_eth_dev *dev,
+			  bool lwm_triggered, uint8_t rate)
+{
+#ifdef HAVE_MLX5_MSTFLINT
+	struct mlx5_priv *priv = dev->data->dev_private;
+	uint32_t data[MLX5_ST_SZ_DW(register_qshr)] = {0};
+	int rc, retry_count = 3;
+	mfile *mf = NULL;
+	int status;
+	void *ptr;
+
+	mf = mopen(priv->sh->ibdev_name);
+	if (!mf) {
+		DRV_LOG(WARNING, "mopen failed\n");
+		rte_errno = ENOENT;
+		return -rte_errno;
+	}
+	MLX5_SET(register_qshr, data, connected_host, 1);
+	MLX5_SET(register_qshr, data, fast_response, lwm_triggered ? 1 : 0);
+	MLX5_SET(register_qshr, data, local_port, 1);
+	ptr = MLX5_ADDR_OF(register_qshr, data, global_config);
+	MLX5_SET(ets_global_config_register, ptr, rate_limit_update, 1);
+	MLX5_SET(ets_global_config_register, ptr, max_bw_units,
+		 rate ? ETS_GLOBAL_CONFIG_BW_UNIT_HUNDREDS_MBPS :
+		 ETS_GLOBAL_CONFIG_BW_UNIT_DISABLED);
+	MLX5_SET(ets_global_config_register, ptr, max_bw_value, rate);
+	do {
+		rc = maccess_reg(mf,
+				 MLX5_QSHR_REGISTER_ID,
+				 MACCESS_REG_METHOD_SET,
+				 (u_int32_t *)&data[0],
+				 sizeof(data),
+				 sizeof(data),
+				 sizeof(data),
+				 &status);
+		if ((rc != ME_ICMD_STATUS_IFC_BUSY &&
+		     status != ME_REG_ACCESS_BAD_PARAM) ||
+		    !(mf->flags & MDEVS_REM)) {
+			break;
+		}
+		DRV_LOG(WARNING, "%s retry.", __func__);
+		usleep(10000);
+	} while (retry_count-- > 0);
+	mclose(mf);
+	rte_errno = (rc == ME_REG_ACCESS_DEV_BUSY) ? EBUSY : EIO;
+	return rc ? -rte_errno : 0;
+#else
+	(void)dev;
+	(void)lwm_triggered;
+	(void)rate;
+	return -1;
+#endif
+}
+
+int rte_pmd_mlx5_config_host_shaper(int port_id, uint8_t rate,
+				    uint32_t flags)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+	struct mlx5_priv *priv = dev->data->dev_private;
+	bool lwm_triggered =
+		!!(flags & RTE_BIT32(MLX5_HOST_SHAPER_FLAG_LWM_TRIGGERED));
+
+	if (!lwm_triggered) {
+		priv->sh->host_shaper_rate = rate;
+	} else {
+		switch (rate) {
+		case 0:
+		/* Rate 0 means disable lwm_triggered. */
+			priv->sh->lwm_triggered = 0;
+			break;
+		case 1:
+		/* Rate 1 means enable lwm_triggered. */
+			priv->sh->lwm_triggered = 1;
+			break;
+		default:
+			return -ENOTSUP;
+		}
+	}
+	return mlxreg_config_host_shaper(dev, priv->sh->lwm_triggered,
+					 priv->sh->host_shaper_rate);
+}
diff --git a/drivers/net/mlx5/rte_pmd_mlx5.h b/drivers/net/mlx5/rte_pmd_mlx5.h
index 6e7907e..d0e8cae 100644
--- a/drivers/net/mlx5/rte_pmd_mlx5.h
+++ b/drivers/net/mlx5/rte_pmd_mlx5.h
@@ -109,6 +109,36 @@ int rte_pmd_mlx5_external_rx_queue_id_map(uint16_t port_id, uint16_t dpdk_idx,
 int rte_pmd_mlx5_external_rx_queue_id_unmap(uint16_t port_id,
 					    uint16_t dpdk_idx);
 
+/**
+ * The rate of the host port shaper will be updated directly at the next
+ * LWM event to the rate that comes with this flag set; set rate 0
+ * to disable this rate update.
+ * Unset this flag to update the rate of the host port shaper directly in
+ * the API call; use rate 0 to disable the current shaper.
+ */
+#define MLX5_HOST_SHAPER_FLAG_LWM_TRIGGERED 0
+
+/**
+ * Configure an HW shaper to limit Rx rate for a host port.
+ * The configuration will affect all the ethdev ports belonging to
+ * the same rte_device.
+ *
+ * @param[in] port_id
+ *   The port identifier of the Ethernet device.
+ * @param[in] rate
+ *   Unit is 100Mbps, setting the rate to 0 disables the shaper.
+ * @param[in] flags
+ *   Host shaper flags.
+ * @return
+ *   0 : operation success.
+ *   Otherwise:
+ *   - ENOENT - no ibdev interface.
+ *   - EBUSY  - the register access unit is busy.
+ *   - EIO    - the register access command meets IO error.
+ */
+__rte_experimental
+int rte_pmd_mlx5_config_host_shaper(int port_id, uint8_t rate, uint32_t flags);
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/drivers/net/mlx5/version.map b/drivers/net/mlx5/version.map
index 79cb79a..905ab30 100644
--- a/drivers/net/mlx5/version.map
+++ b/drivers/net/mlx5/version.map
@@ -12,4 +12,6 @@ EXPERIMENTAL {
 	# added in 22.03
 	rte_pmd_mlx5_external_rx_queue_id_map;
 	rte_pmd_mlx5_external_rx_queue_id_unmap;
+	# added in 22.07
+	rte_pmd_mlx5_config_host_shaper;
 };
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 131+ messages in thread

* [RFC v1 7/7] app/testpmd: add LWM and Host Shaper command
  2022-05-06  3:56   ` [RFC v1 0/7] net/mlx5: introduce limit watermark and host shaper Spike Du
                       ` (5 preceding siblings ...)
  2022-05-06  3:56     ` [RFC v1 6/7] net/mlx5: add private API to config host port shaper Spike Du
@ 2022-05-06  3:56     ` Spike Du
  2022-05-22  5:58     ` [RFC v2 0/7] introduce per-queue limit watermark and host shaper Spike Du
  7 siblings, 0 replies; 131+ messages in thread
From: Spike Du @ 2022-05-06  3:56 UTC (permalink / raw)
  To: matan, viacheslavo, orika, thomas; +Cc: dev, rasland

Add command line options to support LWM per-rxq configure.
- Command syntax:
  set port <port_id> rxq <rxq_id> lwm <lwm_num>
  mlx5 set port <port_id> host_shaper lwm_triggered <0|1> rate <rate_num>

- Example commands:
To configure LWM as 30% of rxq size on port 1 rxq 0:
testpmd> set port 1 rxq 0 lwm 30

To disable LWM on port 1 rxq 0:
testpmd> set port 1 rxq 0 lwm 0

To enable lwm_triggered on port 1 and disable current host shaper:
testpmd> mlx5 set port 1 host_shaper lwm_triggered 1 rate 0

To disable lwm_triggered and current host shaper on port 1:
testpmd> mlx5 set port 1 host_shaper lwm_triggered 0 rate 0

The rate unit is 100Mbps.
To disable lwm_triggered and configure a shaper of 5Gbps on port 1:
testpmd> mlx5 set port 1 host_shaper lwm_triggered 0 rate 50

Add sample code to handle rxq LWM event, it delays a while so that rxq
empties, then disables host shaper and rearms LWM event.

Signed-off-by: Spike Du <spiked@nvidia.com>
---
 app/test-pmd/cmdline.c       |  74 +++++++++++++++++
 app/test-pmd/config.c        |  23 ++++++
 app/test-pmd/meson.build     |   3 +
 app/test-pmd/testpmd.c       |  13 +++
 app/test-pmd/testpmd.h       |   1 +
 doc/guides/nics/mlx5.rst     |  76 +++++++++++++++++
 drivers/net/mlx5/meson.build |   7 +-
 drivers/net/mlx5/mlx5_test.c | 191 +++++++++++++++++++++++++++++++++++++++++++
 drivers/net/mlx5/mlx5_test.h |  27 ++++++
 9 files changed, 413 insertions(+), 2 deletions(-)
 create mode 100644 drivers/net/mlx5/mlx5_test.c
 create mode 100644 drivers/net/mlx5/mlx5_test.h

diff --git a/app/test-pmd/cmdline.c b/app/test-pmd/cmdline.c
index 6ffea8e..f98cdf5 100644
--- a/app/test-pmd/cmdline.c
+++ b/app/test-pmd/cmdline.c
@@ -67,6 +67,9 @@
 #include "cmdline_mtr.h"
 #include "cmdline_tm.h"
 #include "bpf_cmd.h"
+#ifdef RTE_NET_MLX5
+#include "mlx5_test.h"
+#endif
 
 static struct cmdline *testpmd_cl;
 
@@ -17807,6 +17810,73 @@ struct cmd_show_port_flow_transfer_proxy_result {
 	}
 };
 
+/* *** SET LIMIT WARTER MARK FOR A RXQ OF A PORT *** */
+struct cmd_rxq_lwm_result {
+	cmdline_fixed_string_t set;
+	cmdline_fixed_string_t port;
+	uint16_t port_num;
+	cmdline_fixed_string_t rxq;
+	uint16_t rxq_num;
+	cmdline_fixed_string_t lwm;
+	uint16_t lwm_num;
+};
+
+static void cmd_rxq_lwm_parsed(void *parsed_result,
+		__rte_unused struct cmdline *cl,
+		__rte_unused void *data)
+{
+	struct cmd_rxq_lwm_result *res = parsed_result;
+	int ret = 0;
+
+	if ((strcmp(res->set, "set") == 0) && (strcmp(res->port, "port") == 0)
+	    && (strcmp(res->rxq, "rxq") == 0)
+	    && (strcmp(res->lwm, "lwm") == 0))
+		ret = set_rxq_lwm(res->port_num, res->rxq_num,
+				  res->lwm_num);
+	if (ret < 0)
+		printf("rxq_lwm_cmd error: (%s)\n", strerror(-ret));
+
+}
+
+cmdline_parse_token_string_t cmd_rxq_lwm_set =
+	TOKEN_STRING_INITIALIZER(struct cmd_rxq_lwm_result,
+				set, "set");
+cmdline_parse_token_string_t cmd_rxq_lwm_port =
+	TOKEN_STRING_INITIALIZER(struct cmd_rxq_lwm_result,
+				port, "port");
+cmdline_parse_token_num_t cmd_rxq_lwm_portnum =
+	TOKEN_NUM_INITIALIZER(struct cmd_rxq_lwm_result,
+				port_num, RTE_UINT16);
+cmdline_parse_token_string_t cmd_rxq_lwm_rxq =
+	TOKEN_STRING_INITIALIZER(struct cmd_rxq_lwm_result,
+				rxq, "rxq");
+cmdline_parse_token_num_t cmd_rxq_lwm_rxqnum =
+	TOKEN_NUM_INITIALIZER(struct cmd_rxq_lwm_result,
+				rxq_num, RTE_UINT8);
+cmdline_parse_token_string_t cmd_rxq_lwm_lwm =
+	TOKEN_STRING_INITIALIZER(struct cmd_rxq_lwm_result,
+				lwm, "lwm");
+cmdline_parse_token_num_t cmd_rxq_lwm_lwmnum =
+	TOKEN_NUM_INITIALIZER(struct cmd_rxq_lwm_result,
+				lwm_num, RTE_UINT16);
+
+cmdline_parse_inst_t cmd_rxq_lwm = {
+	.f = cmd_rxq_lwm_parsed,
+	.data = (void *)0,
+	.help_str = "set port <port_id> rxq <rxq_id> lwm <lwm_num>"
+		"Set lwm for rxq on port_id",
+	.tokens = {
+		(void *)&cmd_rxq_lwm_set,
+		(void *)&cmd_rxq_lwm_port,
+		(void *)&cmd_rxq_lwm_portnum,
+		(void *)&cmd_rxq_lwm_rxq,
+		(void *)&cmd_rxq_lwm_rxqnum,
+		(void *)&cmd_rxq_lwm_lwm,
+		(void *)&cmd_rxq_lwm_lwmnum,
+		NULL,
+	},
+};
+
 /* ******************************************************************************** */
 
 /* list of instructions */
@@ -18093,6 +18163,10 @@ struct cmd_show_port_flow_transfer_proxy_result {
 	(cmdline_parse_inst_t *)&cmd_show_capability,
 	(cmdline_parse_inst_t *)&cmd_set_flex_is_pattern,
 	(cmdline_parse_inst_t *)&cmd_set_flex_spec_pattern,
+	(cmdline_parse_inst_t *)&cmd_rxq_lwm,
+#ifdef RTE_NET_MLX5
+	(cmdline_parse_inst_t *)&_rte_pmd_mlx5_cmd_port_host_shaper,
+#endif
 	NULL,
 };
 
diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
index cc8e7aa..609fde1 100644
--- a/app/test-pmd/config.c
+++ b/app/test-pmd/config.c
@@ -6281,3 +6281,26 @@ struct igb_ring_desc_16_bytes {
 		printf("  %s\n", buf);
 	}
 }
+
+int
+set_rxq_lwm(portid_t port_id, uint16_t queue_idx, uint16_t lwm)
+{
+	struct rte_eth_link link;
+	int ret;
+
+	if (port_id_is_invalid(port_id, ENABLED_WARN))
+		return -EINVAL;
+	ret = eth_link_get_nowait_print_err(port_id, &link);
+	if (ret < 0)
+		return -EINVAL;
+	if (lwm > 99)
+		return -EINVAL;
+	ret = rte_eth_rx_queue_set_lwm(port_id, queue_idx, lwm);
+
+	if (ret)
+		return ret;
+	/* Save the input lwm. */
+	ports[port_id].rx_conf[queue_idx].lwm = lwm;
+	return 0;
+}
+
diff --git a/app/test-pmd/meson.build b/app/test-pmd/meson.build
index 43130c8..c4fd379 100644
--- a/app/test-pmd/meson.build
+++ b/app/test-pmd/meson.build
@@ -73,3 +73,6 @@ endif
 if dpdk_conf.has('RTE_NET_DPAA')
     deps += ['bus_dpaa', 'mempool_dpaa', 'net_dpaa']
 endif
+if dpdk_conf.has('RTE_NET_MLX5')
+    deps += 'net_mlx5'
+endif
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index fe2ce19..683374c 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -66,6 +66,9 @@
 #ifdef RTE_EXEC_ENV_WINDOWS
 #include <process.h>
 #endif
+#ifdef RTE_NET_MLX5
+#include "mlx5_test.h"
+#endif
 
 #include "testpmd.h"
 
@@ -417,6 +420,7 @@ struct fwd_engine * fwd_engines[] = {
 	[RTE_ETH_EVENT_NEW] = "device probed",
 	[RTE_ETH_EVENT_DESTROY] = "device released",
 	[RTE_ETH_EVENT_FLOW_AGED] = "flow aged",
+	[RTE_ETH_EVENT_RXQ_LIMIT_REACHED] = "rxq limit reached",
 	[RTE_ETH_EVENT_MAX] = NULL,
 };
 
@@ -3539,6 +3543,7 @@ struct pmd_test_command {
 eth_event_callback(portid_t port_id, enum rte_eth_event_type type, void *param,
 		  void *ret_param)
 {
+	uint16_t rxq_idx;
 	RTE_SET_USED(param);
 	RTE_SET_USED(ret_param);
 
@@ -3570,6 +3575,14 @@ struct pmd_test_command {
 		ports[port_id].port_status = RTE_PORT_CLOSED;
 		printf("Port %u is closed\n", port_id);
 		break;
+	case RTE_ETH_EVENT_RXQ_LIMIT_REACHED:
+		rxq_idx = (uint16_t)(uintptr_t)ret_param;
+		printf("recv rxq_limit_reached event, port:%d rxq_id:%d\n", port_id,
+		       rxq_idx);
+#ifdef RTE_NET_MLX5
+		mlx5_test_lwm_event_rxq_limit_reached(port_id, rxq_idx);
+#endif
+		break;
 	default:
 		break;
 	}
diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index 31f766c..b570ea7 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -1163,6 +1163,7 @@ uint16_t tx_pkt_set_dynf(uint16_t port_id, __rte_unused uint16_t queue,
 void flex_item_create(portid_t port_id, uint16_t flex_id, const char *filename);
 void flex_item_destroy(portid_t port_id, uint16_t flex_id);
 void port_flex_item_flush(portid_t port_id);
+int set_rxq_lwm(portid_t port_id, uint16_t queue_idx, uint16_t lwm);
 
 extern int flow_parse(const char *src, void *result, unsigned int size,
 		      struct rte_flow_attr **attr,
diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index 4e2ebff..1e6e3c5 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -1688,3 +1688,79 @@ The procedure below is an example of using a ConnectX-5 adapter card (pf0) with
 #. For each VF PCIe, using the following command to bind the driver::
 
    $ echo "0000:82:00.2" >> /sys/bus/pci/drivers/mlx5_core/bind
+
+How to use LWM and Host Shaper
+------------------------------
+
+LWM introduction
+~~~~~~~~~~~~~~~~
+
+LWM (Limit WaterMark) is a per Rx queue attribute, it should be configured as
+a percentage of the Rx queue size.
+When Rx queue's available WQE count is below LWM, an event is sent to PMD.
+
+Host shaper introduction
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+Host shaper register is per host port register which sets a shaper
+on the host port.
+All VF/hostPF representors belonging to one host port share one host shaper.
+For example, if representor 0 and representor 1 belong to same host port,
+and a host shaper rate of 1Gbps is configured, the shaper throttles both
+representors' traffic from host.
+Host shaper has two modes for setting the shaper, immediate and deferred to
+LWM event trigger. In immediate mode, the rate limit is configured immediately
+to host shaper. When deferring to LWM trigger, the shaper is not set until an
+LWM event is received by any Rx queue in a VF representor belonging to the host
+port. The only rate supported for deferred mode is 100Mbps (there is no limit
+on the supported rates for immediate mode). In deferred mode, the shaper is set
+on the host port by the firmware upon receiving the LMW event, which allows
+throttling host traffic on LWM events at minimum latency, preventing excess
+drops in the Rx queue.
+
+Testpmd CLI examples
+~~~~~~~~~~~~~~~~~~~~
+
+There are sample command lines to configure LWM in testpmd.
+Testpmd also contains sample logic to handle LWM event.
+The typical workflow is: testpmd configure LWM for Rx queues, enable
+lwm_triggered in host shaper and register a callback, when traffic from host is
+too high and available WQE count runs below LWM, PMD receives an event and
+firmware configures a 100Mbps shaper on host port automatically, then PMD call
+the callback registered previously, which will delay a while to let Rx queue
+empty, then disable host shaper.
+
+Let's assume we have a simple Blue Field 2 setup: port 0 is uplink, port 1
+is VF representor. Each port has 2 Rx queues.
+In order to control traffic from host to ARM, we can enable LWM in testpmd by:
+
+.. code-block:: console
+
+   testpmd> mlx5 set port 1 host_shaper lwm_triggered 1 rate 0
+   testpmd> set port 1 rxq 0 lwm 30
+   testpmd> set port 1 rxq 1 lwm 30
+
+The first command disables current host shaper, and enables LWM triggered mode.
+The left commands configure LWM to 30% of Rx queue size for both Rx queues,
+When traffic from host is too high, you can see testpmd console prints log
+about LWM event receiving, then host shaper is disabled.
+The traffic rate from host is controlled and less drop happens in Rx queues.
+
+When disable LWM and lwm_triggered, we can invoke below commands in testpmd:
+
+.. code-block:: console
+
+   testpmd> mlx5 set port 1 host_shaper lwm_triggered 0 rate 0
+   testpmd> set port 1 rxq 0 lwm 0
+   testpmd> set port 1 rxq 1 lwm 0
+
+It's recommended an application disables LWM and lwm_triggered before exit,
+if it enables them before.
+
+We can also configure the shaper with a value, the rate unit is 100Mbps, below
+command sets current shaper to 5Gbps and disables lwm_triggered.
+
+.. code-block:: console
+
+   testpmd> mlx5 set port 1 host_shaper lwm_triggered 0 rate 50
+
diff --git a/drivers/net/mlx5/meson.build b/drivers/net/mlx5/meson.build
index 99210fd..4c4eea4 100644
--- a/drivers/net/mlx5/meson.build
+++ b/drivers/net/mlx5/meson.build
@@ -8,8 +8,10 @@ if not (is_linux or is_windows)
     subdir_done()
 endif
 
-deps += ['hash', 'common_mlx5']
-headers = files('rte_pmd_mlx5.h')
+deps += ['hash', 'common_mlx5', 'cmdline']
+headers = files('rte_pmd_mlx5.h',
+		'mlx5_test.h',
+	)
 sources = files(
         'mlx5.c',
         'mlx5_ethdev.c',
@@ -38,6 +40,7 @@ sources = files(
         'mlx5_vlan.c',
         'mlx5_utils.c',
         'mlx5_devx.c',
+        'mlx5_test.c',
 )
 
 if is_linux
diff --git a/drivers/net/mlx5/mlx5_test.c b/drivers/net/mlx5/mlx5_test.c
new file mode 100644
index 0000000..43d25fe
--- /dev/null
+++ b/drivers/net/mlx5/mlx5_test.c
@@ -0,0 +1,191 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright 2021 6WIND S.A.
+ * Copyright 2021 Mellanox Technologies, Ltd
+ */
+
+#include <stdint.h>
+#include <string.h>
+#include <stdlib.h>
+
+#include <rte_prefetch.h>
+#include <rte_common.h>
+#include <rte_branch_prediction.h>
+#include <rte_ether.h>
+#include <rte_alarm.h>
+#include <mlx5_common.h>
+#include <rte_pmd_mlx5.h>
+#include <rte_ethdev.h>
+#include "mlx5_autoconf.h"
+#include "mlx5_defs.h"
+#include "mlx5.h"
+#include "mlx5_utils.h"
+#include "mlx5_devx.h"
+#include "mlx5_rx.h"
+#include "mlx5_test.h"
+
+static uint8_t host_shaper_lwm_triggered[RTE_MAX_ETHPORTS];
+#define SHAPER_DISABLE_DELAY_US 100000 /* 100ms */
+
+/**
+ * Disable the host shaper and re-arm LWM event.
+ *
+ * @param[in] args
+ *   uint32_t integer combining port_id and rxq_id.
+ */
+static void
+mlx5_test_host_shaper_disable(void *args)
+{
+	uint32_t port_rxq_id = (uint32_t)(uintptr_t)args;
+	uint16_t port_id = port_rxq_id & 0xffff;
+	uint16_t qid = (port_rxq_id >> 16) & 0xffff;
+	struct rte_eth_rxq_info qinfo;
+
+	printf("%s disable shaper\n", __func__);
+	if (rte_eth_rx_queue_info_get(port_id, qid, &qinfo)) {
+		printf("rx_queue_info_get returns error\n");
+		return;
+	}
+	/* Rearm the LWM event. */
+	if (rte_eth_rx_queue_set_lwm(port_id, qid, qinfo.conf.lwm)) {
+		printf("config lwm returns error\n");
+		return;
+	}
+	/* Only disable the shaper when lwm_triggered is set. */
+	if (host_shaper_lwm_triggered[port_id] &&
+	    rte_pmd_mlx5_config_host_shaper(port_id, 0, 0))
+		printf("%s disable shaper returns error\n", __func__);
+}
+
+void
+mlx5_test_lwm_event_rxq_limit_reached(uint16_t port_id, uint16_t rxq_id)
+{
+	uint32_t port_rxq_id = port_id | (rxq_id << 16);
+
+	rte_eal_alarm_set(SHAPER_DISABLE_DELAY_US,
+			  mlx5_test_host_shaper_disable,
+			  (void *)(uintptr_t)port_rxq_id);
+	printf("%s port_id:%u rxq_id:%u\n", __func__, port_id, rxq_id);
+}
+
+/**
+ * Configure host shaper's lwm_triggered and current rate.
+ *
+ * @param[in] lwm_triggered
+ *   Disable/enable lwm_triggered.
+ * @param[in] rate
+ *   Configure current host shaper rate.
+ * @return
+ *   On success, returns 0.
+ *   On failure, returns < 0.
+ */
+static int
+mlx5_test_set_port_host_shaper(uint16_t port_id, uint16_t lwm_triggered, uint8_t rate)
+{
+	struct rte_eth_link link;
+	bool port_id_valid = false;
+	uint16_t pid;
+	int ret;
+
+	RTE_ETH_FOREACH_DEV(pid)
+		if (port_id == pid) {
+			port_id_valid = true;
+			break;
+		}
+	if (!port_id_valid)
+		return -EINVAL;
+	ret = rte_eth_link_get_nowait(port_id, &link);
+	if (ret < 0)
+		return ret;
+	host_shaper_lwm_triggered[port_id] = lwm_triggered ? 1 : 0;
+	if (!lwm_triggered) {
+		ret = rte_pmd_mlx5_config_host_shaper(port_id, 0,
+		RTE_BIT32(MLX5_HOST_SHAPER_FLAG_LWM_TRIGGERED));
+	} else {
+		ret = rte_pmd_mlx5_config_host_shaper(port_id, 1,
+		RTE_BIT32(MLX5_HOST_SHAPER_FLAG_LWM_TRIGGERED));
+	}
+	if (ret)
+		return ret;
+	ret = rte_pmd_mlx5_config_host_shaper(port_id, rate, 0);
+	if (ret)
+		return ret;
+	return 0;
+}
+
+/* *** SET HOST_SHAPER FOR A PORT *** */
+struct cmd_port_host_shaper_result {
+	cmdline_fixed_string_t mlx5;
+	cmdline_fixed_string_t set;
+	cmdline_fixed_string_t port;
+	uint16_t port_num;
+	cmdline_fixed_string_t host_shaper;
+	cmdline_fixed_string_t lwm_triggered;
+	uint16_t fr;
+	cmdline_fixed_string_t rate;
+	uint8_t rate_num;
+};
+
+static void cmd_port_host_shaper_parsed(void *parsed_result,
+		__rte_unused struct cmdline *cl,
+		__rte_unused void *data)
+{
+	struct cmd_port_host_shaper_result *res = parsed_result;
+	int ret = 0;
+
+	if ((strcmp(res->mlx5, "mlx5") == 0) &&
+	    (strcmp(res->set, "set") == 0) &&
+	    (strcmp(res->port, "port") == 0) &&
+	    (strcmp(res->host_shaper, "host_shaper") == 0) &&
+	    (strcmp(res->lwm_triggered, "lwm_triggered") == 0) &&
+	    (strcmp(res->rate, "rate") == 0))
+		ret = mlx5_test_set_port_host_shaper(res->port_num, res->fr,
+					   res->rate_num);
+	if (ret < 0)
+		printf("cmd_port_host_shaper error: (%s)\n", strerror(-ret));
+}
+
+cmdline_parse_token_string_t cmd_port_host_shaper_mlx5 =
+	TOKEN_STRING_INITIALIZER(struct cmd_port_host_shaper_result,
+				mlx5, "mlx5");
+cmdline_parse_token_string_t cmd_port_host_shaper_set =
+	TOKEN_STRING_INITIALIZER(struct cmd_port_host_shaper_result,
+				set, "set");
+cmdline_parse_token_string_t cmd_port_host_shaper_port =
+	TOKEN_STRING_INITIALIZER(struct cmd_port_host_shaper_result,
+				port, "port");
+cmdline_parse_token_num_t cmd_port_host_shaper_portnum =
+	TOKEN_NUM_INITIALIZER(struct cmd_port_host_shaper_result,
+				port_num, RTE_UINT16);
+cmdline_parse_token_string_t cmd_port_host_shaper_host_shaper =
+	TOKEN_STRING_INITIALIZER(struct cmd_port_host_shaper_result,
+				 host_shaper, "host_shaper");
+cmdline_parse_token_string_t cmd_port_host_shaper_lwm_triggered =
+	TOKEN_STRING_INITIALIZER(struct cmd_port_host_shaper_result,
+				 lwm_triggered, "lwm_triggered");
+cmdline_parse_token_num_t cmd_port_host_shaper_fr =
+	TOKEN_NUM_INITIALIZER(struct cmd_port_host_shaper_result,
+			      fr, RTE_UINT16);
+cmdline_parse_token_string_t cmd_port_host_shaper_rate =
+	TOKEN_STRING_INITIALIZER(struct cmd_port_host_shaper_result,
+				 rate, "rate");
+cmdline_parse_token_num_t cmd_port_host_shaper_rate_num =
+	TOKEN_NUM_INITIALIZER(struct cmd_port_host_shaper_result,
+			      rate_num, RTE_UINT8);
+cmdline_parse_inst_t _rte_pmd_mlx5_cmd_port_host_shaper = {
+	.f = cmd_port_host_shaper_parsed,
+	.data = (void *)0,
+	.help_str = "mlx5 set port <port_id> host_shaper lwm_triggered <0|1> "
+	"rate <rate_num>: Set HOST_SHAPER lwm_triggered and rate with port_id",
+	.tokens = {
+		(void *)&cmd_port_host_shaper_mlx5,
+		(void *)&cmd_port_host_shaper_set,
+		(void *)&cmd_port_host_shaper_port,
+		(void *)&cmd_port_host_shaper_portnum,
+		(void *)&cmd_port_host_shaper_host_shaper,
+		(void *)&cmd_port_host_shaper_lwm_triggered,
+		(void *)&cmd_port_host_shaper_fr,
+		(void *)&cmd_port_host_shaper_rate,
+		(void *)&cmd_port_host_shaper_rate_num,
+		NULL,
+	},
+};
diff --git a/drivers/net/mlx5/mlx5_test.h b/drivers/net/mlx5/mlx5_test.h
new file mode 100644
index 0000000..16efe88
--- /dev/null
+++ b/drivers/net/mlx5/mlx5_test.h
@@ -0,0 +1,27 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright 2021 6WIND S.A.
+ * Copyright 2021 Mellanox Technologies, Ltd
+ */
+
+#ifndef RTE_PMD_MLX5_TEST_H_
+#define RTE_PMD_MLX5_TEST_H_
+
+#include <cmdline_parse.h>
+#include <cmdline_parse_num.h>
+#include <cmdline_parse_string.h>
+
+/**
+ * RTE_ETH_EVENT_RXQ_LIMIT_REACHED handler sample code.
+ * It's called in testpmd, the work flow here is delay a while until
+ * RX queueu is empty, then disable host shaper.
+ *
+ * @param[in] port_id
+ *   Port identifier.
+ * @param[in] rxq_id
+ *   Rx queue identifier.
+ */
+void
+mlx5_test_lwm_event_rxq_limit_reached(uint16_t port_id, uint16_t rxq_id);
+
+extern cmdline_parse_inst_t _rte_pmd_mlx5_cmd_port_host_shaper;
+#endif
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [RFC v1 3/7] ethdev: introduce Rx queue based limit watermark
  2022-05-06  3:56     ` [RFC v1 3/7] ethdev: introduce Rx queue based limit watermark Spike Du
@ 2022-05-19  9:37       ` Andrew Rybchenko
  0 siblings, 0 replies; 131+ messages in thread
From: Andrew Rybchenko @ 2022-05-19  9:37 UTC (permalink / raw)
  To: Spike Du, matan, viacheslavo, orika, thomas; +Cc: dev, rasland

On 5/6/22 06:56, Spike Du wrote:
> LWM(limit watermark) is a per Rx queue attribute that notifies dpdk

dpdk is not necessary about.

I'm not sure that "attribute" can notify application. Please,
reword the description.

> application event of RTE_ETH_EVENT_RXQ_LIMIT_REACHED when the Rx
> queue's usable descriptor is under the watermark.
> To simplify its configuration, LWM is a percentage of Rx queue
> descriptor size with valid value of [0,99].
> Setting LWM to 0 means disable it.

... which is the default.

Can I request notification when no descriptors left?
1 seems to be close to the answer, but not in the case of big
Rx rings.

> Add LWM's configuration handle in eth_dev_ops.

handle sounds bad here. May be "driver callback" or "driver
method".

> 
> Signed-off-by: Spike Du <spiked@nvidia.com>
> ---
>   lib/ethdev/ethdev_driver.h |  7 +++++++
>   lib/ethdev/rte_ethdev.c    | 28 ++++++++++++++++++++++++++++
>   lib/ethdev/rte_ethdev.h    | 30 +++++++++++++++++++++++++++++-
>   lib/ethdev/version.map     |  3 +++
>   4 files changed, 67 insertions(+), 1 deletion(-)
> 
> diff --git a/lib/ethdev/ethdev_driver.h b/lib/ethdev/ethdev_driver.h
> index 69d9dc2..1e9cdbf 100644
> --- a/lib/ethdev/ethdev_driver.h
> +++ b/lib/ethdev/ethdev_driver.h
> @@ -470,6 +470,10 @@ typedef int (*eth_rx_queue_setup_t)(struct rte_eth_dev *dev,
>   				    const struct rte_eth_rxconf *rx_conf,
>   				    struct rte_mempool *mb_pool);
>   
> +typedef int (*eth_rx_queue_set_lwm_t)(struct rte_eth_dev *dev,
> +				      uint16_t rx_queue_id,
> +				      uint8_t lwm);
> +

Please, add full description including parameters and return
values.

>   /** @internal Setup a transmit queue of an Ethernet device. */
>   typedef int (*eth_tx_queue_setup_t)(struct rte_eth_dev *dev,
>   				    uint16_t tx_queue_id,
> @@ -1283,6 +1287,9 @@ struct eth_dev_ops {
>   
>   	/** Dump private info from device */
>   	eth_dev_priv_dump_t eth_dev_priv_dump;
> +
> +	/** Set Rx queue limit watermark */
> +	eth_rx_queue_set_lwm_t rx_queue_set_lwm;
>   };
>   
>   /**
> diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> index 29a3d80..1e4fc6a 100644
> --- a/lib/ethdev/rte_ethdev.c
> +++ b/lib/ethdev/rte_ethdev.c
> @@ -4414,6 +4414,34 @@ int rte_eth_set_queue_rate_limit(uint16_t port_id, uint16_t queue_idx,
>   							queue_idx, tx_rate));
>   }
>   
> +int rte_eth_rx_queue_set_lwm(uint16_t port_id, uint16_t queue_idx,
> +			     uint8_t lwm)
> +{
> +	struct rte_eth_dev *dev;
> +	struct rte_eth_dev_info dev_info;
> +	int ret;
> +
> +	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
> +	dev = &rte_eth_devices[port_id];
> +
> +	ret = rte_eth_dev_info_get(port_id, &dev_info);
> +	if (ret != 0)
> +		return ret;
> +
> +	if (queue_idx > dev_info.max_rx_queues) {

It should be >=

> +		RTE_ETHDEV_LOG(ERR,
> +			"Set queue rate limit:port %u: invalid queue ID=%u\n",
> +			port_id, queue_idx);
> +		return -EINVAL;
> +	}
> +
> +	if (lwm > 99)
> +		return -EINVAL;
> +	RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->rx_queue_set_lwm, -ENOTSUP);
> +	return eth_err(port_id, (*dev->dev_ops->rx_queue_set_lwm)(dev,
> +							     queue_idx, lwm));
> +}
> +
>   RTE_INIT(eth_dev_init_fp_ops)
>   {
>   	uint32_t i;
> diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> index 04cff8e..f29e53b 100644
> --- a/lib/ethdev/rte_ethdev.h
> +++ b/lib/ethdev/rte_ethdev.h
> @@ -1249,8 +1249,12 @@ struct rte_eth_rxconf {
>   	 */
>   	union rte_eth_rxseg *rx_seg;
>   
> -	uint64_t reserved_64s[2]; /**< Reserved for future fields */
> +	uint64_t reserved_64s;
> +	uint32_t reserved_32s;
> +	uint32_t lwm:8;
> +	uint32_t reserved_bits:24;

I strong dislike bit fields for such purpose. It should
be uint8_t field.

Since we break ABI below anyway, we can break it here as well.

>   	void *reserved_ptrs[2];   /**< Reserved for future fields */
> +

No unrelated changes, please.

>   };
>   
>   /**
> @@ -3668,6 +3672,29 @@ int rte_eth_dev_set_vlan_ether_type(uint16_t port_id,
>    */
>   int rte_eth_dev_set_vlan_pvid(uint16_t port_id, uint16_t pvid, int on);
>   
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Set Rx queue based limit watermark.
> + *
> + * @param port_id
> + *  The port identifier of the Ethernet device.
> + * @param queue_idx
> + *  The index of the receive queue
> + * @param lwm
> + *  The limit watermark percentage of Rx queue descriptor size.
> + *  The valid range is [0,99].
> + *  Setting 0 means disable limit watermark.
> + *
> + * @return
> + *   - (0) if successful.
> + *   - negative if failed.

Please, be precise with negative return values specification
and its meaning.

> + */
> +__rte_experimental
> +int rte_eth_rx_queue_set_lwm(uint16_t port_id, uint16_t queue_idx,
> +				uint8_t lwm);
> +
>   typedef void (*buffer_tx_error_fn)(struct rte_mbuf **unsent, uint16_t count,
>   		void *userdata);
>   
> @@ -3873,6 +3900,7 @@ enum rte_eth_event_type {
>   	RTE_ETH_EVENT_DESTROY,  /**< port is released */
>   	RTE_ETH_EVENT_IPSEC,    /**< IPsec offload related event */
>   	RTE_ETH_EVENT_FLOW_AGED,/**< New aged-out flows is detected */
> +	RTE_ETH_EVENT_RXQ_LIMIT_REACHED,/**< RX queue limit reached */

RX -> Rx

as I understand it is an ABI breakage.

>   	RTE_ETH_EVENT_MAX       /**< max value of this enum */
>   };
>   
> diff --git a/lib/ethdev/version.map b/lib/ethdev/version.map
> index 20391ab..8b85ad8 100644
> --- a/lib/ethdev/version.map
> +++ b/lib/ethdev/version.map
> @@ -279,6 +279,9 @@ EXPERIMENTAL {
>   	rte_flow_async_action_handle_create;
>   	rte_flow_async_action_handle_destroy;
>   	rte_flow_async_action_handle_update;
> +
> +	# added in 22.07
> +	rte_eth_rx_queue_set_lwm;
>   };
>   
>   INTERNAL {


^ permalink raw reply	[flat|nested] 131+ messages in thread

* [RFC v2 0/7] introduce per-queue limit watermark and host shaper
  2022-05-06  3:56   ` [RFC v1 0/7] net/mlx5: introduce limit watermark and host shaper Spike Du
                       ` (6 preceding siblings ...)
  2022-05-06  3:56     ` [RFC v1 7/7] app/testpmd: add LWM and Host Shaper command Spike Du
@ 2022-05-22  5:58     ` Spike Du
  2022-05-22  5:58       ` [RFC v2 1/7] net/mlx5: add LWM support for Rxq Spike Du
                         ` (7 more replies)
  7 siblings, 8 replies; 131+ messages in thread
From: Spike Du @ 2022-05-22  5:58 UTC (permalink / raw)
  To: matan, viacheslavo, orika, thomas; +Cc: dev, rasland

LWM(limit watermark) is per RX queue attribute, when RX queue fullness reach the LWM limit, HW sends an event to dpdk application.
Host shaper can configure shaper rate and lwm-triggered for a host port.
The shaper limits the rate of traffic from host port to wire port.
If lwm-triggered is enabled, a 100Mbps shaper is enabled automatically when one of the host port's Rx queues receives LWM event.

These two features can combine to control traffic from host port to wire port.
The work flow is configure LWM to RX queue and enable lwm-triggered flag in host shaper, after receiving LWM event, delay a while until RX queue is empty , then disable the shaper. We recycle this work flow to reduce RX queue drops.

Add new libethdev API to set LWM, add rte event RTE_ETH_EVENT_RXQ_LIMIT_REACHED to handle LWM event. For host shaper, because it doesn't align to existing DPDK framework and is specific to Nvidia NIC, use PMD private API.

For integration with testpmd, put the private cmdline function and LWM event handler in mlx5 PMD directory by adding a new file mlx5_test.c. Only add minimal code in testpmd to invoke interfaces from mlx5_test.c.

Spike Du (7):
  net/mlx5: add LWM support for Rxq
  common/mlx5: share interrupt management
  ethdev: introduce Rx queue based limit watermark
  net/mlx5: add LWM event handling support
  net/mlx5: support Rx queue based limit watermark
  net/mlx5: add private API to config host port shaper
  app/testpmd: add LWM and Host Shaper command

 app/test-pmd/cmdline.c                       |  74 +++++
 app/test-pmd/config.c                        |  23 ++
 app/test-pmd/meson.build                     |   4 +
 app/test-pmd/testpmd.c                       |  24 ++
 app/test-pmd/testpmd.h                       |   1 +
 doc/guides/nics/mlx5.rst                     |  84 ++++++
 doc/guides/rel_notes/release_22_07.rst       |   2 +
 drivers/common/mlx5/linux/meson.build        |  13 +
 drivers/common/mlx5/linux/mlx5_common_os.c   | 131 +++++++++
 drivers/common/mlx5/linux/mlx5_common_os.h   |  11 +
 drivers/common/mlx5/mlx5_prm.h               |  26 ++
 drivers/common/mlx5/version.map              |   2 +
 drivers/common/mlx5/windows/mlx5_common_os.h |  24 ++
 drivers/net/mlx5/linux/mlx5_ethdev_os.c      |  71 -----
 drivers/net/mlx5/linux/mlx5_os.c             | 132 ++-------
 drivers/net/mlx5/linux/mlx5_socket.c         |  53 +---
 drivers/net/mlx5/mlx5.c                      |  68 +++++
 drivers/net/mlx5/mlx5.h                      |  12 +-
 drivers/net/mlx5/mlx5_devx.c                 |  60 +++-
 drivers/net/mlx5/mlx5_devx.h                 |   1 +
 drivers/net/mlx5/mlx5_rx.c                   | 292 +++++++++++++++++++
 drivers/net/mlx5/mlx5_rx.h                   |  13 +
 drivers/net/mlx5/mlx5_testpmd.c              | 184 ++++++++++++
 drivers/net/mlx5/mlx5_testpmd.h              |  27 ++
 drivers/net/mlx5/mlx5_txpp.c                 |  28 +-
 drivers/net/mlx5/rte_pmd_mlx5.h              |  30 ++
 drivers/net/mlx5/version.map                 |   2 +
 drivers/net/mlx5/windows/mlx5_ethdev_os.c    |  22 --
 drivers/vdpa/mlx5/mlx5_vdpa_virtq.c          |  48 +--
 lib/ethdev/ethdev_driver.h                   |  22 ++
 lib/ethdev/rte_ethdev.c                      |  52 ++++
 lib/ethdev/rte_ethdev.h                      |  74 ++++-
 lib/ethdev/version.map                       |   4 +
 33 files changed, 1305 insertions(+), 309 deletions(-)
 create mode 100644 drivers/net/mlx5/mlx5_testpmd.c
 create mode 100644 drivers/net/mlx5/mlx5_testpmd.h

-- 
2.27.0


^ permalink raw reply	[flat|nested] 131+ messages in thread

* [RFC v2 1/7] net/mlx5: add LWM support for Rxq
  2022-05-22  5:58     ` [RFC v2 0/7] introduce per-queue limit watermark and host shaper Spike Du
@ 2022-05-22  5:58       ` Spike Du
  2022-05-22  5:58       ` [RFC v2 2/7] common/mlx5: share interrupt management Spike Du
                         ` (6 subsequent siblings)
  7 siblings, 0 replies; 131+ messages in thread
From: Spike Du @ 2022-05-22  5:58 UTC (permalink / raw)
  To: matan, viacheslavo, orika, thomas; +Cc: dev, rasland

Add lwm(Limit WaterMark) field to Rxq object which indicates the percentage
of RX queue size used by HW to raise LWM event to the user.
Allow LWM setting in modify_rq command.
Allow the LWM configuration dynamically by adding RDY2RDY state change.

Signed-off-by: Spike Du <spiked@nvidia.com>
---
 drivers/net/mlx5/mlx5.h      |  1 +
 drivers/net/mlx5/mlx5_devx.c | 13 ++++++++++++-
 drivers/net/mlx5/mlx5_devx.h |  1 +
 drivers/net/mlx5/mlx5_rx.h   |  1 +
 4 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index ef755ee8cf..305edffe71 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -1395,6 +1395,7 @@ enum mlx5_rxq_modify_type {
 	MLX5_RXQ_MOD_RST2RDY, /* modify state from reset to ready. */
 	MLX5_RXQ_MOD_RDY2ERR, /* modify state from ready to error. */
 	MLX5_RXQ_MOD_RDY2RST, /* modify state from ready to reset. */
+	MLX5_RXQ_MOD_RDY2RDY, /* modify state from ready to ready. */
 };
 
 enum mlx5_txq_modify_type {
diff --git a/drivers/net/mlx5/mlx5_devx.c b/drivers/net/mlx5/mlx5_devx.c
index 4b48f9433a..c918a50ae9 100644
--- a/drivers/net/mlx5/mlx5_devx.c
+++ b/drivers/net/mlx5/mlx5_devx.c
@@ -62,7 +62,7 @@ mlx5_rxq_obj_modify_rq_vlan_strip(struct mlx5_rxq_priv *rxq, int on)
  * @return
  *   0 on success, a negative errno value otherwise and rte_errno is set.
  */
-static int
+int
 mlx5_devx_modify_rq(struct mlx5_rxq_priv *rxq, uint8_t type)
 {
 	struct mlx5_devx_modify_rq_attr rq_attr;
@@ -76,6 +76,11 @@ mlx5_devx_modify_rq(struct mlx5_rxq_priv *rxq, uint8_t type)
 	case MLX5_RXQ_MOD_RST2RDY:
 		rq_attr.rq_state = MLX5_RQC_STATE_RST;
 		rq_attr.state = MLX5_RQC_STATE_RDY;
+		if (rxq->lwm) {
+			rq_attr.modify_bitmask |=
+				MLX5_MODIFY_RQ_IN_MODIFY_BITMASK_WQ_LWM;
+			rq_attr.lwm = rxq->lwm;
+		}
 		break;
 	case MLX5_RXQ_MOD_RDY2ERR:
 		rq_attr.rq_state = MLX5_RQC_STATE_RDY;
@@ -85,6 +90,12 @@ mlx5_devx_modify_rq(struct mlx5_rxq_priv *rxq, uint8_t type)
 		rq_attr.rq_state = MLX5_RQC_STATE_RDY;
 		rq_attr.state = MLX5_RQC_STATE_RST;
 		break;
+	case MLX5_RXQ_MOD_RDY2RDY:
+		rq_attr.rq_state = MLX5_RQC_STATE_RDY;
+		rq_attr.state = MLX5_RQC_STATE_RDY;
+		rq_attr.modify_bitmask |= MLX5_MODIFY_RQ_IN_MODIFY_BITMASK_WQ_LWM;
+		rq_attr.lwm = rxq->lwm;
+		break;
 	default:
 		break;
 	}
diff --git a/drivers/net/mlx5/mlx5_devx.h b/drivers/net/mlx5/mlx5_devx.h
index a95207a6b9..ebd1da455a 100644
--- a/drivers/net/mlx5/mlx5_devx.h
+++ b/drivers/net/mlx5/mlx5_devx.h
@@ -11,6 +11,7 @@ int mlx5_txq_devx_obj_new(struct rte_eth_dev *dev, uint16_t idx);
 int mlx5_txq_devx_modify(struct mlx5_txq_obj *obj,
 			 enum mlx5_txq_modify_type type, uint8_t dev_port);
 void mlx5_txq_devx_obj_release(struct mlx5_txq_obj *txq_obj);
+int mlx5_devx_modify_rq(struct mlx5_rxq_priv *rxq, uint8_t type);
 
 extern struct mlx5_obj_ops devx_obj_ops;
 
diff --git a/drivers/net/mlx5/mlx5_rx.h b/drivers/net/mlx5/mlx5_rx.h
index e715ed6b62..25a5f2c1fa 100644
--- a/drivers/net/mlx5/mlx5_rx.h
+++ b/drivers/net/mlx5/mlx5_rx.h
@@ -175,6 +175,7 @@ struct mlx5_rxq_priv {
 	struct mlx5_devx_rq devx_rq;
 	struct rte_eth_hairpin_conf hairpin_conf; /* Hairpin configuration. */
 	uint32_t hairpin_status; /* Hairpin binding status. */
+	uint32_t lwm:16;
 };
 
 /* External RX queue descriptor. */
-- 
2.27.0


^ permalink raw reply	[flat|nested] 131+ messages in thread

* [RFC v2 2/7] common/mlx5: share interrupt management
  2022-05-22  5:58     ` [RFC v2 0/7] introduce per-queue limit watermark and host shaper Spike Du
  2022-05-22  5:58       ` [RFC v2 1/7] net/mlx5: add LWM support for Rxq Spike Du
@ 2022-05-22  5:58       ` Spike Du
  2022-05-22  5:58       ` [RFC v2 3/7] ethdev: introduce Rx queue based limit watermark Spike Du
                         ` (5 subsequent siblings)
  7 siblings, 0 replies; 131+ messages in thread
From: Spike Du @ 2022-05-22  5:58 UTC (permalink / raw)
  To: matan, viacheslavo, orika, thomas; +Cc: dev, rasland

There are many duplicate code of creating and initializing rte_intr_handle.
Add a new mlx5_os API to do this, replace all PMD related code with this
API.

Signed-off-by: Spike Du <spiked@nvidia.com>
---
 drivers/common/mlx5/linux/mlx5_common_os.c   | 131 ++++++++++++++++++
 drivers/common/mlx5/linux/mlx5_common_os.h   |  11 ++
 drivers/common/mlx5/version.map              |   2 +
 drivers/common/mlx5/windows/mlx5_common_os.h |  24 ++++
 drivers/net/mlx5/linux/mlx5_ethdev_os.c      |  71 ----------
 drivers/net/mlx5/linux/mlx5_os.c             | 132 ++++---------------
 drivers/net/mlx5/linux/mlx5_socket.c         |  53 +-------
 drivers/net/mlx5/mlx5.h                      |   2 -
 drivers/net/mlx5/mlx5_txpp.c                 |  28 +---
 drivers/net/mlx5/windows/mlx5_ethdev_os.c    |  22 ----
 drivers/vdpa/mlx5/mlx5_vdpa_virtq.c          |  48 +------
 11 files changed, 217 insertions(+), 307 deletions(-)

diff --git a/drivers/common/mlx5/linux/mlx5_common_os.c b/drivers/common/mlx5/linux/mlx5_common_os.c
index d40cfd5cd1..f10a981a37 100644
--- a/drivers/common/mlx5/linux/mlx5_common_os.c
+++ b/drivers/common/mlx5/linux/mlx5_common_os.c
@@ -11,6 +11,7 @@
 #endif
 #include <dirent.h>
 #include <net/if.h>
+#include <fcntl.h>
 
 #include <rte_errno.h>
 #include <rte_string_fns.h>
@@ -964,3 +965,133 @@ mlx5_os_wrapped_mkey_destroy(struct mlx5_pmd_wrapped_mr *pmd_mr)
 		claim_zero(mlx5_glue->dereg_mr(pmd_mr->obj));
 	memset(pmd_mr, 0, sizeof(*pmd_mr));
 }
+
+/**
+ * Rte_intr_handle create and init helper.
+ *
+ * @param[in] mode
+ *   interrupt instance can be shared between primary and secondary
+ *   processes or not.
+ * @param[in] set_fd_nonblock
+ *   Whether to set fd to O_NONBLOCK.
+ * @param[in] fd
+ *   Fd to set in created intr_handle.
+ * @param[in] cb
+ *   Callback to register for intr_handle.
+ * @param[in] cb_arg
+ *   Callback argument for cb.
+ *
+ * @return
+ *  - Interrupt handle on success.
+ *  - NULL on failure, with rte_errno set.
+ */
+struct rte_intr_handle *
+mlx5_os_interrupt_handler_create(int mode, bool set_fd_nonblock, int fd,
+				 rte_intr_callback_fn cb, void *cb_arg)
+{
+	struct rte_intr_handle *tmp_intr_handle;
+	int ret, flags;
+
+	tmp_intr_handle = rte_intr_instance_alloc(mode);
+	if (!tmp_intr_handle) {
+		rte_errno = ENOMEM;
+		goto err;
+	}
+	if (set_fd_nonblock) {
+		flags = fcntl(fd, F_GETFL);
+		ret = fcntl(fd, F_SETFL, flags | O_NONBLOCK);
+		if (ret) {
+			rte_errno = errno;
+			goto err;
+		}
+	}
+	ret = rte_intr_fd_set(tmp_intr_handle, fd);
+	if (ret)
+		goto err;
+	ret = rte_intr_type_set(tmp_intr_handle, RTE_INTR_HANDLE_EXT);
+	if (ret)
+		goto err;
+	ret = rte_intr_callback_register(tmp_intr_handle, cb, cb_arg);
+	if (ret) {
+		rte_errno = -ret;
+		goto err;
+	}
+	return tmp_intr_handle;
+err:
+	if (tmp_intr_handle)
+		rte_intr_instance_free(tmp_intr_handle);
+	return NULL;
+}
+
+/* Safe unregistration for interrupt callback. */
+static void
+mlx5_intr_callback_unregister(const struct rte_intr_handle *handle,
+			      rte_intr_callback_fn cb_fn, void *cb_arg)
+{
+	uint64_t twait = 0;
+	uint64_t start = 0;
+
+	do {
+		int ret;
+
+		ret = rte_intr_callback_unregister(handle, cb_fn, cb_arg);
+		if (ret >= 0)
+			return;
+		if (ret != -EAGAIN) {
+			DRV_LOG(INFO, "failed to unregister interrupt"
+				      " handler (error: %d)", ret);
+			MLX5_ASSERT(false);
+			return;
+		}
+		if (twait) {
+			struct timespec onems;
+
+			/* Wait one millisecond and try again. */
+			onems.tv_sec = 0;
+			onems.tv_nsec = NS_PER_S / MS_PER_S;
+			nanosleep(&onems, 0);
+			/* Check whether one second elapsed. */
+			if ((rte_get_timer_cycles() - start) <= twait)
+				continue;
+		} else {
+			/*
+			 * We get the amount of timer ticks for one second.
+			 * If this amount elapsed it means we spent one
+			 * second in waiting. This branch is executed once
+			 * on first iteration.
+			 */
+			twait = rte_get_timer_hz();
+			MLX5_ASSERT(twait);
+		}
+		/*
+		 * Timeout elapsed, show message (once a second) and retry.
+		 * We have no other acceptable option here, if we ignore
+		 * the unregistering return code the handler will not
+		 * be unregistered, fd will be closed and we may get the
+		 * crush. Hanging and messaging in the loop seems not to be
+		 * the worst choice.
+		 */
+		DRV_LOG(INFO, "Retrying to unregister interrupt handler");
+		start = rte_get_timer_cycles();
+	} while (true);
+}
+
+/**
+ * Rte_intr_handle destroy helper.
+ *
+ * @param[in] intr_handle
+ *   Rte_intr_handle to destroy.
+ * @param[in] cb
+ *   Callback which is registered to intr_handle.
+ * @param[in] cb_arg
+ *   Callback argument for cb.
+ *
+ */
+void
+mlx5_os_interrupt_handler_destroy(struct rte_intr_handle *intr_handle,
+				  rte_intr_callback_fn cb, void *cb_arg)
+{
+	if (rte_intr_fd_get(intr_handle) >= 0)
+		mlx5_intr_callback_unregister(intr_handle, cb, cb_arg);
+	rte_intr_instance_free(intr_handle);
+}
diff --git a/drivers/common/mlx5/linux/mlx5_common_os.h b/drivers/common/mlx5/linux/mlx5_common_os.h
index 27f1192205..479bb3c7cb 100644
--- a/drivers/common/mlx5/linux/mlx5_common_os.h
+++ b/drivers/common/mlx5/linux/mlx5_common_os.h
@@ -15,6 +15,7 @@
 #include <rte_log.h>
 #include <rte_kvargs.h>
 #include <rte_devargs.h>
+#include <rte_interrupts.h>
 
 #include "mlx5_autoconf.h"
 #include "mlx5_glue.h"
@@ -299,4 +300,14 @@ __rte_internal
 int
 mlx5_get_device_guid(const struct rte_pci_addr *dev, uint8_t *guid, size_t len);
 
+__rte_internal
+struct rte_intr_handle *
+mlx5_os_interrupt_handler_create(int mode, bool set_fd_nonblock, int fd,
+				 rte_intr_callback_fn cb, void *cb_arg);
+
+__rte_internal
+void
+mlx5_os_interrupt_handler_destroy(struct rte_intr_handle *intr_handle,
+				  rte_intr_callback_fn cb, void *cb_arg);
+
 #endif /* RTE_PMD_MLX5_COMMON_OS_H_ */
diff --git a/drivers/common/mlx5/version.map b/drivers/common/mlx5/version.map
index a23a30a6c0..413dec14ab 100644
--- a/drivers/common/mlx5/version.map
+++ b/drivers/common/mlx5/version.map
@@ -153,5 +153,7 @@ INTERNAL {
 	mlx5_mr_mempool2mr_bh;
 	mlx5_mr_mempool_populate_cache;
 
+	mlx5_os_interrupt_handler_create; # WINDOWS_NO_EXPORT
+	mlx5_os_interrupt_handler_destroy; # WINDOWS_NO_EXPORT
 	local: *;
 };
diff --git a/drivers/common/mlx5/windows/mlx5_common_os.h b/drivers/common/mlx5/windows/mlx5_common_os.h
index ee7973f1ec..e9e9108127 100644
--- a/drivers/common/mlx5/windows/mlx5_common_os.h
+++ b/drivers/common/mlx5/windows/mlx5_common_os.h
@@ -9,6 +9,7 @@
 #include <sys/types.h>
 
 #include <rte_errno.h>
+#include <rte_interrupts.h>
 
 #include "mlx5_autoconf.h"
 #include "mlx5_glue.h"
@@ -253,4 +254,27 @@ void *mlx5_os_umem_reg(void *ctx, void *addr, size_t size, uint32_t access);
 __rte_internal
 int mlx5_os_umem_dereg(void *pumem);
 
+static inline struct rte_intr_handle *
+mlx5_os_interrupt_handler_create(int mode, bool set_fd_nonblock, int fd,
+				 rte_intr_callback_fn cb, void *cb_arg)
+{
+	(void)mode;
+	(void)set_fd_nonblock;
+	(void)fd;
+	(void)cb;
+	(void)cb_arg;
+	rte_errno = ENOTSUP;
+	return NULL;
+}
+
+static inline void
+mlx5_os_interrupt_handler_destroy(struct rte_intr_handle *intr_handle,
+				  rte_intr_callback_fn cb, void *cb_arg)
+{
+	(void)intr_handle;
+	(void)cb;
+	(void)cb_arg;
+}
+
+
 #endif /* RTE_PMD_MLX5_COMMON_OS_H_ */
diff --git a/drivers/net/mlx5/linux/mlx5_ethdev_os.c b/drivers/net/mlx5/linux/mlx5_ethdev_os.c
index 8fe73f1adb..a276b2ba4f 100644
--- a/drivers/net/mlx5/linux/mlx5_ethdev_os.c
+++ b/drivers/net/mlx5/linux/mlx5_ethdev_os.c
@@ -881,77 +881,6 @@ mlx5_dev_interrupt_handler(void *cb_arg)
 	}
 }
 
-/*
- * Unregister callback handler safely. The handler may be active
- * while we are trying to unregister it, in this case code -EAGAIN
- * is returned by rte_intr_callback_unregister(). This routine checks
- * the return code and tries to unregister handler again.
- *
- * @param handle
- *   interrupt handle
- * @param cb_fn
- *   pointer to callback routine
- * @cb_arg
- *   opaque callback parameter
- */
-void
-mlx5_intr_callback_unregister(const struct rte_intr_handle *handle,
-			      rte_intr_callback_fn cb_fn, void *cb_arg)
-{
-	/*
-	 * Try to reduce timeout management overhead by not calling
-	 * the timer related routines on the first iteration. If the
-	 * unregistering succeeds on first call there will be no
-	 * timer calls at all.
-	 */
-	uint64_t twait = 0;
-	uint64_t start = 0;
-
-	do {
-		int ret;
-
-		ret = rte_intr_callback_unregister(handle, cb_fn, cb_arg);
-		if (ret >= 0)
-			return;
-		if (ret != -EAGAIN) {
-			DRV_LOG(INFO, "failed to unregister interrupt"
-				      " handler (error: %d)", ret);
-			MLX5_ASSERT(false);
-			return;
-		}
-		if (twait) {
-			struct timespec onems;
-
-			/* Wait one millisecond and try again. */
-			onems.tv_sec = 0;
-			onems.tv_nsec = NS_PER_S / MS_PER_S;
-			nanosleep(&onems, 0);
-			/* Check whether one second elapsed. */
-			if ((rte_get_timer_cycles() - start) <= twait)
-				continue;
-		} else {
-			/*
-			 * We get the amount of timer ticks for one second.
-			 * If this amount elapsed it means we spent one
-			 * second in waiting. This branch is executed once
-			 * on first iteration.
-			 */
-			twait = rte_get_timer_hz();
-			MLX5_ASSERT(twait);
-		}
-		/*
-		 * Timeout elapsed, show message (once a second) and retry.
-		 * We have no other acceptable option here, if we ignore
-		 * the unregistering return code the handler will not
-		 * be unregistered, fd will be closed and we may get the
-		 * crush. Hanging and messaging in the loop seems not to be
-		 * the worst choice.
-		 */
-		DRV_LOG(INFO, "Retrying to unregister interrupt handler");
-		start = rte_get_timer_cycles();
-	} while (true);
-}
-
 /**
  * Handle DEVX interrupts from the NIC.
  * This function is probably called from the DPDK host thread.
diff --git a/drivers/net/mlx5/linux/mlx5_os.c b/drivers/net/mlx5/linux/mlx5_os.c
index a821153b35..0741028dab 100644
--- a/drivers/net/mlx5/linux/mlx5_os.c
+++ b/drivers/net/mlx5/linux/mlx5_os.c
@@ -2494,40 +2494,6 @@ mlx5_os_net_cleanup(void)
 	mlx5_pmd_socket_uninit();
 }
 
-static int
-mlx5_os_dev_shared_handler_install_lsc(struct mlx5_dev_ctx_shared *sh)
-{
-	int nlsk_fd, flags, ret;
-
-	nlsk_fd = mlx5_nl_init(NETLINK_ROUTE, RTMGRP_LINK);
-	if (nlsk_fd < 0) {
-		DRV_LOG(ERR, "Failed to create a socket for Netlink events: %s",
-			rte_strerror(rte_errno));
-		return -1;
-	}
-	flags = fcntl(nlsk_fd, F_GETFL);
-	ret = fcntl(nlsk_fd, F_SETFL, flags | O_NONBLOCK);
-	if (ret != 0) {
-		DRV_LOG(ERR, "Failed to make Netlink event socket non-blocking: %s",
-			strerror(errno));
-		rte_errno = errno;
-		goto error;
-	}
-	rte_intr_type_set(sh->intr_handle_nl, RTE_INTR_HANDLE_EXT);
-	rte_intr_fd_set(sh->intr_handle_nl, nlsk_fd);
-	if (rte_intr_callback_register(sh->intr_handle_nl,
-				       mlx5_dev_interrupt_handler_nl,
-				       sh) != 0) {
-		DRV_LOG(ERR, "Failed to register Netlink events interrupt");
-		rte_intr_fd_set(sh->intr_handle_nl, -1);
-		goto error;
-	}
-	return 0;
-error:
-	close(nlsk_fd);
-	return -1;
-}
-
 /**
  * Install shared asynchronous device events handler.
  * This function is implemented to support event sharing
@@ -2539,76 +2505,47 @@ mlx5_os_dev_shared_handler_install_lsc(struct mlx5_dev_ctx_shared *sh)
 void
 mlx5_os_dev_shared_handler_install(struct mlx5_dev_ctx_shared *sh)
 {
-	int ret;
-	int flags;
 	struct ibv_context *ctx = sh->cdev->ctx;
+	int nlsk_fd;
 
-	sh->intr_handle = rte_intr_instance_alloc(RTE_INTR_INSTANCE_F_SHARED);
-	if (sh->intr_handle == NULL) {
-		DRV_LOG(ERR, "Fail to allocate intr_handle");
-		rte_errno = ENOMEM;
+	sh->intr_handle = mlx5_os_interrupt_handler_create
+		(RTE_INTR_INSTANCE_F_SHARED, true,
+		 ctx->async_fd, mlx5_dev_interrupt_handler, sh);
+	if (!sh->intr_handle) {
+		DRV_LOG(ERR, "Failed to allocate intr_handle.");
 		return;
 	}
-	rte_intr_fd_set(sh->intr_handle, -1);
-
-	flags = fcntl(ctx->async_fd, F_GETFL);
-	ret = fcntl(ctx->async_fd, F_SETFL, flags | O_NONBLOCK);
-	if (ret) {
-		DRV_LOG(INFO, "failed to change file descriptor async event"
-			" queue");
-	} else {
-		rte_intr_fd_set(sh->intr_handle, ctx->async_fd);
-		rte_intr_type_set(sh->intr_handle, RTE_INTR_HANDLE_EXT);
-		if (rte_intr_callback_register(sh->intr_handle,
-					mlx5_dev_interrupt_handler, sh)) {
-			DRV_LOG(INFO, "Fail to install the shared interrupt.");
-			rte_intr_fd_set(sh->intr_handle, -1);
-		}
+	nlsk_fd = mlx5_nl_init(NETLINK_ROUTE, RTMGRP_LINK);
+	if (nlsk_fd < 0) {
+		DRV_LOG(ERR, "Failed to create a socket for Netlink events: %s",
+			rte_strerror(rte_errno));
+		return;
 	}
-	sh->intr_handle_nl = rte_intr_instance_alloc
-						(RTE_INTR_INSTANCE_F_SHARED);
+	sh->intr_handle_nl = mlx5_os_interrupt_handler_create
+		(RTE_INTR_INSTANCE_F_SHARED, true,
+		 nlsk_fd, mlx5_dev_interrupt_handler_nl, sh);
 	if (sh->intr_handle_nl == NULL) {
 		DRV_LOG(ERR, "Fail to allocate intr_handle");
-		rte_errno = ENOMEM;
 		return;
 	}
-	rte_intr_fd_set(sh->intr_handle_nl, -1);
-	if (mlx5_os_dev_shared_handler_install_lsc(sh) < 0) {
-		DRV_LOG(INFO, "Fail to install the shared Netlink event handler.");
-		rte_intr_fd_set(sh->intr_handle_nl, -1);
-	}
 	if (sh->cdev->config.devx) {
 #ifdef HAVE_IBV_DEVX_ASYNC
-		sh->intr_handle_devx =
-			rte_intr_instance_alloc(RTE_INTR_INSTANCE_F_SHARED);
-		if (!sh->intr_handle_devx) {
-			DRV_LOG(ERR, "Fail to allocate intr_handle");
-			rte_errno = ENOMEM;
-			return;
-		}
-		rte_intr_fd_set(sh->intr_handle_devx, -1);
+		struct mlx5dv_devx_cmd_comp *devx_comp;
+
 		sh->devx_comp = (void *)mlx5_glue->devx_create_cmd_comp(ctx);
-		struct mlx5dv_devx_cmd_comp *devx_comp = sh->devx_comp;
+		devx_comp = sh->devx_comp;
 		if (!devx_comp) {
 			DRV_LOG(INFO, "failed to allocate devx_comp.");
 			return;
 		}
-		flags = fcntl(devx_comp->fd, F_GETFL);
-		ret = fcntl(devx_comp->fd, F_SETFL, flags | O_NONBLOCK);
-		if (ret) {
-			DRV_LOG(INFO, "failed to change file descriptor"
-				" devx comp");
+		sh->intr_handle_devx = mlx5_os_interrupt_handler_create
+			(RTE_INTR_INSTANCE_F_SHARED, true,
+			 devx_comp->fd,
+			 mlx5_dev_interrupt_handler_devx, sh);
+		if (!sh->intr_handle_devx) {
+			DRV_LOG(ERR, "Failed to allocate intr_handle.");
 			return;
 		}
-		rte_intr_fd_set(sh->intr_handle_devx, devx_comp->fd);
-		rte_intr_type_set(sh->intr_handle_devx,
-					 RTE_INTR_HANDLE_EXT);
-		if (rte_intr_callback_register(sh->intr_handle_devx,
-					mlx5_dev_interrupt_handler_devx, sh)) {
-			DRV_LOG(INFO, "Fail to install the devx shared"
-				" interrupt.");
-			rte_intr_fd_set(sh->intr_handle_devx, -1);
-		}
 #endif /* HAVE_IBV_DEVX_ASYNC */
 	}
 }
@@ -2624,24 +2561,13 @@ mlx5_os_dev_shared_handler_install(struct mlx5_dev_ctx_shared *sh)
 void
 mlx5_os_dev_shared_handler_uninstall(struct mlx5_dev_ctx_shared *sh)
 {
-	int nlsk_fd;
-
-	if (rte_intr_fd_get(sh->intr_handle) >= 0)
-		mlx5_intr_callback_unregister(sh->intr_handle,
-					      mlx5_dev_interrupt_handler, sh);
-	rte_intr_instance_free(sh->intr_handle);
-	nlsk_fd = rte_intr_fd_get(sh->intr_handle_nl);
-	if (nlsk_fd >= 0) {
-		mlx5_intr_callback_unregister
-			(sh->intr_handle_nl, mlx5_dev_interrupt_handler_nl, sh);
-		close(nlsk_fd);
-	}
-	rte_intr_instance_free(sh->intr_handle_nl);
+	mlx5_os_interrupt_handler_destroy(sh->intr_handle,
+					  mlx5_dev_interrupt_handler, sh);
+	mlx5_os_interrupt_handler_destroy(sh->intr_handle_nl,
+					  mlx5_dev_interrupt_handler_nl, sh);
 #ifdef HAVE_IBV_DEVX_ASYNC
-	if (rte_intr_fd_get(sh->intr_handle_devx) >= 0)
-		rte_intr_callback_unregister(sh->intr_handle_devx,
-				  mlx5_dev_interrupt_handler_devx, sh);
-	rte_intr_instance_free(sh->intr_handle_devx);
+	mlx5_os_interrupt_handler_destroy(sh->intr_handle_devx,
+					  mlx5_dev_interrupt_handler_devx, sh);
 	if (sh->devx_comp)
 		mlx5_glue->devx_destroy_cmd_comp(sh->devx_comp);
 #endif
diff --git a/drivers/net/mlx5/linux/mlx5_socket.c b/drivers/net/mlx5/linux/mlx5_socket.c
index 4882e5fa2f..0e01aff0e7 100644
--- a/drivers/net/mlx5/linux/mlx5_socket.c
+++ b/drivers/net/mlx5/linux/mlx5_socket.c
@@ -133,51 +133,6 @@ mlx5_pmd_socket_handle(void *cb __rte_unused)
 		fclose(file);
 }
 
-/**
- * Install interrupt handler.
- *
- * @param dev
- *   Pointer to Ethernet device.
- * @return
- *   0 on success, a negative errno value otherwise.
- */
-static int
-mlx5_pmd_interrupt_handler_install(void)
-{
-	MLX5_ASSERT(server_socket != -1);
-
-	server_intr_handle =
-		rte_intr_instance_alloc(RTE_INTR_INSTANCE_F_PRIVATE);
-	if (server_intr_handle == NULL) {
-		DRV_LOG(ERR, "Fail to allocate intr_handle");
-		return -ENOMEM;
-	}
-	if (rte_intr_fd_set(server_intr_handle, server_socket))
-		return -rte_errno;
-
-	if (rte_intr_type_set(server_intr_handle, RTE_INTR_HANDLE_EXT))
-		return -rte_errno;
-
-	return rte_intr_callback_register(server_intr_handle,
-					  mlx5_pmd_socket_handle, NULL);
-}
-
-/**
- * Uninstall interrupt handler.
- */
-static void
-mlx5_pmd_interrupt_handler_uninstall(void)
-{
-	if (server_socket != -1) {
-		mlx5_intr_callback_unregister(server_intr_handle,
-					      mlx5_pmd_socket_handle,
-					      NULL);
-	}
-	rte_intr_fd_set(server_intr_handle, 0);
-	rte_intr_type_set(server_intr_handle, RTE_INTR_HANDLE_UNKNOWN);
-	rte_intr_instance_free(server_intr_handle);
-}
-
 /**
  * Initialise the socket to communicate with external tools.
  *
@@ -224,7 +179,10 @@ mlx5_pmd_socket_init(void)
 			strerror(errno));
 		goto remove;
 	}
-	if (mlx5_pmd_interrupt_handler_install()) {
+	server_intr_handle = mlx5_os_interrupt_handler_create
+		(RTE_INTR_INSTANCE_F_PRIVATE, false,
+		 server_socket, mlx5_pmd_socket_handle, NULL);
+	if (server_intr_handle == NULL) {
 		DRV_LOG(WARNING, "cannot register interrupt handler for mlx5 socket: %s",
 			strerror(errno));
 		goto remove;
@@ -248,7 +206,8 @@ mlx5_pmd_socket_uninit(void)
 {
 	if (server_socket == -1)
 		return;
-	mlx5_pmd_interrupt_handler_uninstall();
+	mlx5_os_interrupt_handler_destroy(server_intr_handle,
+					  mlx5_pmd_socket_handle, NULL);
 	claim_zero(close(server_socket));
 	server_socket = -1;
 	MKSTR(path, MLX5_SOCKET_PATH, getpid());
diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index 305edffe71..7ebb2cc961 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -1682,8 +1682,6 @@ int mlx5_sysfs_switch_info(unsigned int ifindex,
 			   struct mlx5_switch_info *info);
 void mlx5_translate_port_name(const char *port_name_in,
 			      struct mlx5_switch_info *port_info_out);
-void mlx5_intr_callback_unregister(const struct rte_intr_handle *handle,
-				   rte_intr_callback_fn cb_fn, void *cb_arg);
 int mlx5_sysfs_bond_info(unsigned int pf_ifindex, unsigned int *ifindex,
 			 char *ifname);
 int mlx5_get_module_info(struct rte_eth_dev *dev,
diff --git a/drivers/net/mlx5/mlx5_txpp.c b/drivers/net/mlx5/mlx5_txpp.c
index fe74317fe8..f853a67f58 100644
--- a/drivers/net/mlx5/mlx5_txpp.c
+++ b/drivers/net/mlx5/mlx5_txpp.c
@@ -741,11 +741,8 @@ mlx5_txpp_interrupt_handler(void *cb_arg)
 static void
 mlx5_txpp_stop_service(struct mlx5_dev_ctx_shared *sh)
 {
-	if (!rte_intr_fd_get(sh->txpp.intr_handle))
-		return;
-	mlx5_intr_callback_unregister(sh->txpp.intr_handle,
-				      mlx5_txpp_interrupt_handler, sh);
-	rte_intr_instance_free(sh->txpp.intr_handle);
+	mlx5_os_interrupt_handler_destroy(sh->txpp.intr_handle,
+					  mlx5_txpp_interrupt_handler, sh);
 }
 
 /* Attach interrupt handler and fires first request to Rearm Queue. */
@@ -769,23 +766,12 @@ mlx5_txpp_start_service(struct mlx5_dev_ctx_shared *sh)
 		rte_errno = errno;
 		return -rte_errno;
 	}
-	sh->txpp.intr_handle =
-			rte_intr_instance_alloc(RTE_INTR_INSTANCE_F_SHARED);
-	if (sh->txpp.intr_handle == NULL) {
-		DRV_LOG(ERR, "Fail to allocate intr_handle");
-		return -ENOMEM;
-	}
 	fd = mlx5_os_get_devx_channel_fd(sh->txpp.echan);
-	if (rte_intr_fd_set(sh->txpp.intr_handle, fd))
-		return -rte_errno;
-
-	if (rte_intr_type_set(sh->txpp.intr_handle, RTE_INTR_HANDLE_EXT))
-		return -rte_errno;
-
-	if (rte_intr_callback_register(sh->txpp.intr_handle,
-				       mlx5_txpp_interrupt_handler, sh)) {
-		rte_intr_fd_set(sh->txpp.intr_handle, 0);
-		DRV_LOG(ERR, "Failed to register CQE interrupt %d.", rte_errno);
+	sh->txpp.intr_handle = mlx5_os_interrupt_handler_create
+		(RTE_INTR_INSTANCE_F_SHARED, false,
+		 fd, mlx5_txpp_interrupt_handler, sh);
+	if (!sh->txpp.intr_handle) {
+		DRV_LOG(ERR, "Fail to allocate intr_handle");
 		return -rte_errno;
 	}
 	/* Subscribe CQ event to the event channel controlled by the driver. */
diff --git a/drivers/net/mlx5/windows/mlx5_ethdev_os.c b/drivers/net/mlx5/windows/mlx5_ethdev_os.c
index f97526580d..88d8213f55 100644
--- a/drivers/net/mlx5/windows/mlx5_ethdev_os.c
+++ b/drivers/net/mlx5/windows/mlx5_ethdev_os.c
@@ -140,28 +140,6 @@ mlx5_set_mtu(struct rte_eth_dev *dev, uint16_t mtu)
 	return 0;
 }
 
-/*
- * Unregister callback handler safely. The handler may be active
- * while we are trying to unregister it, in this case code -EAGAIN
- * is returned by rte_intr_callback_unregister(). This routine checks
- * the return code and tries to unregister handler again.
- *
- * @param handle
- *   interrupt handle
- * @param cb_fn
- *   pointer to callback routine
- * @cb_arg
- *   opaque callback parameter
- */
-void
-mlx5_intr_callback_unregister(const struct rte_intr_handle *handle,
-			      rte_intr_callback_fn cb_fn, void *cb_arg)
-{
-	RTE_SET_USED(handle);
-	RTE_SET_USED(cb_fn);
-	RTE_SET_USED(cb_arg);
-}
-
 /**
  * DPDK callback to get flow control status.
  *
diff --git a/drivers/vdpa/mlx5/mlx5_vdpa_virtq.c b/drivers/vdpa/mlx5/mlx5_vdpa_virtq.c
index e025be47d2..fd447cc650 100644
--- a/drivers/vdpa/mlx5/mlx5_vdpa_virtq.c
+++ b/drivers/vdpa/mlx5/mlx5_vdpa_virtq.c
@@ -93,22 +93,10 @@ mlx5_vdpa_virtqs_cleanup(struct mlx5_vdpa_priv *priv)
 static int
 mlx5_vdpa_virtq_unset(struct mlx5_vdpa_virtq *virtq)
 {
-	int ret = -EAGAIN;
-
-	if (rte_intr_fd_get(virtq->intr_handle) >= 0) {
-		while (ret == -EAGAIN) {
-			ret = rte_intr_callback_unregister(virtq->intr_handle,
-					mlx5_vdpa_virtq_kick_handler, virtq);
-			if (ret == -EAGAIN) {
-				DRV_LOG(DEBUG, "Try again to unregister fd %d of virtq %hu interrupt",
-					rte_intr_fd_get(virtq->intr_handle),
-					virtq->index);
-				usleep(MLX5_VDPA_INTR_RETRIES_USEC);
-			}
-		}
-		rte_intr_fd_set(virtq->intr_handle, -1);
-	}
-	rte_intr_instance_free(virtq->intr_handle);
+	int ret;
+
+	mlx5_os_interrupt_handler_destroy(virtq->intr_handle,
+					  mlx5_vdpa_virtq_kick_handler, virtq);
 	if (virtq->virtq) {
 		ret = mlx5_vdpa_virtq_stop(virtq->priv, virtq->index);
 		if (ret)
@@ -365,35 +353,13 @@ mlx5_vdpa_virtq_setup(struct mlx5_vdpa_priv *priv, int index)
 	virtq->priv = priv;
 	rte_write32(virtq->index, priv->virtq_db_addr);
 	/* Setup doorbell mapping. */
-	virtq->intr_handle =
-		rte_intr_instance_alloc(RTE_INTR_INSTANCE_F_SHARED);
+	virtq->intr_handle = mlx5_os_interrupt_handler_create(
+				  RTE_INTR_INSTANCE_F_SHARED, false,
+				  vq.kickfd, mlx5_vdpa_virtq_kick_handler, virtq);
 	if (virtq->intr_handle == NULL) {
 		DRV_LOG(ERR, "Fail to allocate intr_handle");
 		goto error;
 	}
-
-	if (rte_intr_fd_set(virtq->intr_handle, vq.kickfd))
-		goto error;
-
-	if (rte_intr_fd_get(virtq->intr_handle) == -1) {
-		DRV_LOG(WARNING, "Virtq %d kickfd is invalid.", index);
-	} else {
-		if (rte_intr_type_set(virtq->intr_handle, RTE_INTR_HANDLE_EXT))
-			goto error;
-
-		if (rte_intr_callback_register(virtq->intr_handle,
-					       mlx5_vdpa_virtq_kick_handler,
-					       virtq)) {
-			rte_intr_fd_set(virtq->intr_handle, -1);
-			DRV_LOG(ERR, "Failed to register virtq %d interrupt.",
-				index);
-			goto error;
-		} else {
-			DRV_LOG(DEBUG, "Register fd %d interrupt for virtq %d.",
-				rte_intr_fd_get(virtq->intr_handle),
-				index);
-		}
-	}
 	/* Subscribe virtq error event. */
 	virtq->version++;
 	cookie = ((uint64_t)virtq->version << 32) + index;
-- 
2.27.0


^ permalink raw reply	[flat|nested] 131+ messages in thread

* [RFC v2 3/7] ethdev: introduce Rx queue based limit watermark
  2022-05-22  5:58     ` [RFC v2 0/7] introduce per-queue limit watermark and host shaper Spike Du
  2022-05-22  5:58       ` [RFC v2 1/7] net/mlx5: add LWM support for Rxq Spike Du
  2022-05-22  5:58       ` [RFC v2 2/7] common/mlx5: share interrupt management Spike Du
@ 2022-05-22  5:58       ` Spike Du
  2022-05-22 15:23         ` Stephen Hemminger
                           ` (2 more replies)
  2022-05-22  5:58       ` [RFC v2 4/7] net/mlx5: add LWM event handling support Spike Du
                         ` (4 subsequent siblings)
  7 siblings, 3 replies; 131+ messages in thread
From: Spike Du @ 2022-05-22  5:58 UTC (permalink / raw)
  To: matan, viacheslavo, orika, thomas; +Cc: dev, rasland

LWM(limit watermark) describes the fullness of a Rx queue. If the Rx
queue fullness is above LWM, the device will trigger the event
RTE_ETH_EVENT_RX_LWM.
LWM is defined as a percentage of Rx queue size with valid value of
[0,99].
Setting LWM to 0 means disable it, which is the default.
When translate the percentage to queue descriptor number, the numbe
should be bigger than 0 and less than queue size.
Add LWM's configuration and query driver callbacks in eth_dev_ops.

Signed-off-by: Spike Du <spiked@nvidia.com>
---
 lib/ethdev/ethdev_driver.h | 22 ++++++++++++
 lib/ethdev/rte_ethdev.c    | 52 +++++++++++++++++++++++++++
 lib/ethdev/rte_ethdev.h    | 74 +++++++++++++++++++++++++++++++++++++-
 lib/ethdev/version.map     |  4 +++
 4 files changed, 151 insertions(+), 1 deletion(-)

diff --git a/lib/ethdev/ethdev_driver.h b/lib/ethdev/ethdev_driver.h
index 69d9dc21d8..12ec5e7e19 100644
--- a/lib/ethdev/ethdev_driver.h
+++ b/lib/ethdev/ethdev_driver.h
@@ -470,6 +470,23 @@ typedef int (*eth_rx_queue_setup_t)(struct rte_eth_dev *dev,
 				    const struct rte_eth_rxconf *rx_conf,
 				    struct rte_mempool *mb_pool);
 
+/**
+ * @internal Set Rx queue limit watermark.
+ * see @rte_eth_rx_lwm_set()
+ */
+typedef int (*eth_rx_queue_lwm_set_t)(struct rte_eth_dev *dev,
+				      uint16_t rx_queue_id,
+				      uint8_t lwm);
+
+/**
+ * @internal Query queue limit watermark.
+ * see @rte_eth_rx_lwm_query()
+ */
+
+typedef int (*eth_rx_queue_lwm_query_t)(struct rte_eth_dev *dev,
+					uint16_t *rx_queue_id,
+					uint8_t *lwm);
+
 /** @internal Setup a transmit queue of an Ethernet device. */
 typedef int (*eth_tx_queue_setup_t)(struct rte_eth_dev *dev,
 				    uint16_t tx_queue_id,
@@ -1168,6 +1185,11 @@ struct eth_dev_ops {
 	/** Priority flow control queue configure */
 	priority_flow_ctrl_queue_config_t priority_flow_ctrl_queue_config;
 
+	/** Set Rx queue limit watermark */
+	eth_rx_queue_lwm_set_t rx_queue_lwm_set;
+	/** Query Rx queue limit watermark */
+	eth_rx_queue_lwm_query_t rx_queue_lwm_query;
+
 	/** Set Unicast Table Array */
 	eth_uc_hash_table_set_t    uc_hash_table_set;
 	/** Set Unicast hash bitmap */
diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
index 8520aec561..0a46c71288 100644
--- a/lib/ethdev/rte_ethdev.c
+++ b/lib/ethdev/rte_ethdev.c
@@ -4429,6 +4429,58 @@ int rte_eth_set_queue_rate_limit(uint16_t port_id, uint16_t queue_idx,
 							queue_idx, tx_rate));
 }
 
+int rte_eth_rx_lwm_set(uint16_t port_id, uint16_t queue_id,
+		       uint8_t lwm)
+{
+	struct rte_eth_dev *dev;
+	struct rte_eth_dev_info dev_info;
+	int ret;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
+	dev = &rte_eth_devices[port_id];
+
+	ret = rte_eth_dev_info_get(port_id, &dev_info);
+	if (ret != 0)
+		return ret;
+
+	if (queue_id > dev_info.max_rx_queues) {
+		RTE_ETHDEV_LOG(ERR,
+			"Set queue LWM:port %u: invalid queue ID=%u.\n",
+			port_id, queue_id);
+		return -EINVAL;
+	}
+
+	if (lwm > 99)
+		return -EINVAL;
+	RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->rx_queue_lwm_set, -ENOTSUP);
+	return eth_err(port_id, (*dev->dev_ops->rx_queue_lwm_set)(dev,
+							     queue_id, lwm));
+}
+
+int rte_eth_rx_lwm_query(uint16_t port_id, uint16_t *queue_id,
+			 uint8_t *lwm)
+{
+	struct rte_eth_dev_info dev_info;
+	struct rte_eth_dev *dev;
+	int ret;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
+	dev = &rte_eth_devices[port_id];
+
+	ret = rte_eth_dev_info_get(port_id, &dev_info);
+	if (ret != 0)
+		return ret;
+
+	if (queue_id == NULL)
+		return -EINVAL;
+	if (*queue_id >= dev_info.max_rx_queues)
+		*queue_id = 0;
+
+	RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->rx_queue_lwm_query, -ENOTSUP);
+	return eth_err(port_id, (*dev->dev_ops->rx_queue_lwm_query)(dev,
+							     queue_id, lwm));
+}
+
 RTE_INIT(eth_dev_init_fp_ops)
 {
 	uint32_t i;
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index 04cff8ee10..687ae5ff29 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -1249,7 +1249,16 @@ struct rte_eth_rxconf {
 	 */
 	union rte_eth_rxseg *rx_seg;
 
-	uint64_t reserved_64s[2]; /**< Reserved for future fields */
+	/**
+	 * Per-queue Rx limit watermark defined as percentage of Rx queue
+	 * size. If Rx queue receives traffic higher than this percentage,
+	 * the event RTE_ETH_EVENT_RX_LWM is triggered.
+	 */
+	uint8_t lwm;
+
+	uint8_t reserved_bits[3];
+	uint32_t reserved_32s;
+	uint64_t reserved_64s;
 	void *reserved_ptrs[2];   /**< Reserved for future fields */
 };
 
@@ -3668,6 +3677,64 @@ int rte_eth_dev_get_vlan_offload(uint16_t port_id);
  */
 int rte_eth_dev_set_vlan_pvid(uint16_t port_id, uint16_t pvid, int on);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Set Rx queue based limit watermark.
+ *
+ * @param port_id
+ *  The port identifier of the Ethernet device.
+ * @param queue_id
+ *  The index of the receive queue.
+ * @param lwm
+ *  The limit watermark percentage of Rx queue size which describes
+ *  the fullness of Rx queue. If the Rx queue fullness is above LWM,
+ *  the device will trigger the event RTE_ETH_EVENT_RX_LWM.
+ *  [1-99] to set a new LWM.
+ *  0 to disable watermark monitoring.
+ *
+ * @return
+ *   - 0 if successful.
+ *   - negative if failed.
+ */
+__rte_experimental
+int rte_eth_rx_lwm_set(uint16_t port_id, uint16_t queue_id, uint8_t lwm);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Query Rx queue based limit watermark.
+ * The function queries all queues in the port circularly until one
+ * pending LWM event is found or no pending LWM event is found.
+ *
+ * @param port_id
+ *  The port identifier of the Ethernet device.
+ * @param queue_id
+ *  The API caller sets the starting Rx queue id in the pointer.
+ *  If the queue_id is bigger than maximum queue id of the port,
+ *  it's rewinded to 0 so that application can keep calling
+ *  this function to handle all pending LWM events in the queues
+ *  with a simple increment between calls.
+ *  If a Rx queue has pending lwm event, the pointer is updated
+ *  with this Rx queue id; otherwise this pointer's content is
+ *  unchanged.
+ * @param lwm
+ *  The pointer to the limit watermark percentage of Rx queue.
+ *  If Rx queue with pending lwm event is found, the queue's LWM
+ *  percentage is stored in this pointer, otherwise the pointer's
+ *  content is unchanged.
+ *
+ * @return
+ *   - 1 if a Rx queue with pending lwm event is found.
+ *   - 0 if no Rx queue with pending lwm event is found.
+ *   - -EINVAL if queue_id is NULL.
+ */
+__rte_experimental
+int rte_eth_rx_lwm_query(uint16_t port_id, uint16_t *queue_id,
+			 uint8_t *lwm);
+
 typedef void (*buffer_tx_error_fn)(struct rte_mbuf **unsent, uint16_t count,
 		void *userdata);
 
@@ -3873,6 +3940,11 @@ enum rte_eth_event_type {
 	RTE_ETH_EVENT_DESTROY,  /**< port is released */
 	RTE_ETH_EVENT_IPSEC,    /**< IPsec offload related event */
 	RTE_ETH_EVENT_FLOW_AGED,/**< New aged-out flows is detected */
+	/**
+	 *  watermark value is exceeded in a queue.
+	 *  see @rte_eth_rx_lwm_set()
+	 */
+	RTE_ETH_EVENT_RX_LWM,
 	RTE_ETH_EVENT_MAX       /**< max value of this enum */
 };
 
diff --git a/lib/ethdev/version.map b/lib/ethdev/version.map
index 20391ab29e..5cf44c5d1d 100644
--- a/lib/ethdev/version.map
+++ b/lib/ethdev/version.map
@@ -279,6 +279,10 @@ EXPERIMENTAL {
 	rte_flow_async_action_handle_create;
 	rte_flow_async_action_handle_destroy;
 	rte_flow_async_action_handle_update;
+
+	# added in 22.07
+	rte_eth_rx_lwm_set;
+	rte_eth_rx_lwm_query;
 };
 
 INTERNAL {
-- 
2.27.0


^ permalink raw reply	[flat|nested] 131+ messages in thread

* [RFC v2 4/7] net/mlx5: add LWM event handling support
  2022-05-22  5:58     ` [RFC v2 0/7] introduce per-queue limit watermark and host shaper Spike Du
                         ` (2 preceding siblings ...)
  2022-05-22  5:58       ` [RFC v2 3/7] ethdev: introduce Rx queue based limit watermark Spike Du
@ 2022-05-22  5:58       ` Spike Du
  2022-05-22  5:58       ` [RFC v2 5/7] net/mlx5: support Rx queue based limit watermark Spike Du
                         ` (3 subsequent siblings)
  7 siblings, 0 replies; 131+ messages in thread
From: Spike Du @ 2022-05-22  5:58 UTC (permalink / raw)
  To: matan, viacheslavo, orika, thomas; +Cc: dev, rasland

When LWM meets RQ WQE, the kernel driver raises an event to SW.
Use devx event_channel to catch this and to notify the user.
Allocate this channel per shared device.
The channel has a cookie that informs the specific event port and queue.

Signed-off-by: Spike Du <spiked@nvidia.com>
---
 drivers/net/mlx5/mlx5.c      | 66 ++++++++++++++++++++++++++++++++++++
 drivers/net/mlx5/mlx5.h      |  7 ++++
 drivers/net/mlx5/mlx5_devx.c | 47 +++++++++++++++++++++++++
 drivers/net/mlx5/mlx5_rx.c   | 33 ++++++++++++++++++
 drivers/net/mlx5/mlx5_rx.h   |  7 ++++
 5 files changed, 160 insertions(+)

diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
index f0988712df..e04a66625e 100644
--- a/drivers/net/mlx5/mlx5.c
+++ b/drivers/net/mlx5/mlx5.c
@@ -9,6 +9,7 @@
 #include <stdint.h>
 #include <stdlib.h>
 #include <errno.h>
+#include <fcntl.h>
 
 #include <rte_malloc.h>
 #include <ethdev_driver.h>
@@ -22,6 +23,7 @@
 #include <rte_eal_paging.h>
 #include <rte_alarm.h>
 #include <rte_cycles.h>
+#include <rte_interrupts.h>
 
 #include <mlx5_glue.h>
 #include <mlx5_devx_cmds.h>
@@ -1524,6 +1526,69 @@ mlx5_alloc_shared_dev_ctx(const struct mlx5_dev_spawn_data *spawn,
 	return NULL;
 }
 
+/**
+ * Create LWM event_channel and interrupt handle for shared device
+ * context. All rxqs sharing the device context share the event_channel.
+ * A callback is registered in interrupt thread to receive the LWM event.
+ *
+ * @param[in] priv
+ *   Pointer to mlx5_priv instance.
+ *
+ * @return
+ *   0 on success, negative with rte_errno set.
+ */
+int
+mlx5_lwm_setup(struct mlx5_priv *priv)
+{
+	int fd_lwm;
+
+	pthread_mutex_init(&priv->sh->lwm_config_lock, NULL);
+	priv->sh->devx_channel_lwm = mlx5_os_devx_create_event_channel
+			(priv->sh->cdev->ctx,
+			 MLX5DV_DEVX_CREATE_EVENT_CHANNEL_FLAGS_OMIT_EV_DATA);
+	if (!priv->sh->devx_channel_lwm)
+		goto err;
+	fd_lwm = mlx5_os_get_devx_channel_fd(priv->sh->devx_channel_lwm);
+	priv->sh->intr_handle_lwm = mlx5_os_interrupt_handler_create
+		(RTE_INTR_INSTANCE_F_SHARED, true,
+		 fd_lwm, mlx5_dev_interrupt_handler_lwm, priv);
+	if (!priv->sh->intr_handle_lwm)
+		goto err;
+	return 0;
+err:
+	if (priv->sh->devx_channel_lwm) {
+		mlx5_os_devx_destroy_event_channel
+			(priv->sh->devx_channel_lwm);
+		priv->sh->devx_channel_lwm = NULL;
+	}
+	pthread_mutex_destroy(&priv->sh->lwm_config_lock);
+	return -rte_errno;
+}
+
+/**
+ * Destroy LWM event_channel and interrupt handle for shared device
+ * context before free this context. The interrupt handler is also
+ * unregistered.
+ *
+ * @param[in] sh
+ *   Pointer to shared device context.
+ */
+void
+mlx5_lwm_unset(struct mlx5_dev_ctx_shared *sh)
+{
+	if (sh->intr_handle_lwm) {
+		mlx5_os_interrupt_handler_destroy(sh->intr_handle_lwm,
+			mlx5_dev_interrupt_handler_lwm, (void *)-1);
+		sh->intr_handle_lwm = NULL;
+	}
+	if (sh->devx_channel_lwm) {
+		mlx5_os_devx_destroy_event_channel
+			(sh->devx_channel_lwm);
+		sh->devx_channel_lwm = NULL;
+	}
+	pthread_mutex_destroy(&sh->lwm_config_lock);
+}
+
 /**
  * Free shared IB device context. Decrement counter and if zero free
  * all allocated resources and close handles.
@@ -1601,6 +1666,7 @@ mlx5_free_shared_dev_ctx(struct mlx5_dev_ctx_shared *sh)
 		claim_zero(mlx5_devx_cmd_destroy(sh->td));
 	MLX5_ASSERT(sh->geneve_tlv_option_resource == NULL);
 	pthread_mutex_destroy(&sh->txpp.mutex);
+	mlx5_lwm_unset(sh);
 	mlx5_free(sh);
 	return;
 exit:
diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index 7ebb2cc961..a76f2fed3d 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -1268,6 +1268,9 @@ struct mlx5_dev_ctx_shared {
 	struct mlx5_lb_ctx self_lb; /* QP to enable self loopback for Devx. */
 	unsigned int flow_max_priority;
 	enum modify_reg flow_mreg_c[MLX5_MREG_C_NUM];
+	void *devx_channel_lwm;
+	struct rte_intr_handle *intr_handle_lwm;
+	pthread_mutex_t lwm_config_lock;
 	/* Availability of mreg_c's. */
 	struct mlx5_dev_shared_port port[]; /* per device port data array. */
 };
@@ -1405,6 +1408,7 @@ enum mlx5_txq_modify_type {
 };
 
 struct mlx5_rxq_priv;
+struct mlx5_priv;
 
 /* HW objects operations structure. */
 struct mlx5_obj_ops {
@@ -1413,6 +1417,7 @@ struct mlx5_obj_ops {
 	int (*rxq_event_get)(struct mlx5_rxq_obj *rxq_obj);
 	int (*rxq_obj_modify)(struct mlx5_rxq_priv *rxq, uint8_t type);
 	void (*rxq_obj_release)(struct mlx5_rxq_priv *rxq);
+	int (*rxq_event_get_lwm)(struct mlx5_priv *priv, int *rxq_idx, int *port_id);
 	int (*ind_table_new)(struct rte_eth_dev *dev, const unsigned int log_n,
 			     struct mlx5_ind_table_obj *ind_tbl);
 	int (*ind_table_modify)(struct rte_eth_dev *dev,
@@ -1603,6 +1608,8 @@ int mlx5_net_remove(struct mlx5_common_device *cdev);
 bool mlx5_is_hpf(struct rte_eth_dev *dev);
 bool mlx5_is_sf_repr(struct rte_eth_dev *dev);
 void mlx5_age_event_prepare(struct mlx5_dev_ctx_shared *sh);
+int mlx5_lwm_setup(struct mlx5_priv *priv);
+void mlx5_lwm_unset(struct mlx5_dev_ctx_shared *sh);
 
 /* Macro to iterate over all valid ports for mlx5 driver. */
 #define MLX5_ETH_FOREACH_DEV(port_id, dev) \
diff --git a/drivers/net/mlx5/mlx5_devx.c b/drivers/net/mlx5/mlx5_devx.c
index c918a50ae9..6886ae1f22 100644
--- a/drivers/net/mlx5/mlx5_devx.c
+++ b/drivers/net/mlx5/mlx5_devx.c
@@ -232,6 +232,52 @@ mlx5_rx_devx_get_event(struct mlx5_rxq_obj *rxq_obj)
 #endif /* HAVE_IBV_DEVX_EVENT */
 }
 
+/**
+ * Get LWM event for shared context, return the correct port/rxq for this event.
+ *
+ * @param priv
+ *   Mlx5_priv object.
+ * @param rxq_idx [out]
+ *   Which rxq gets this event.
+ * @param port_id [out]
+ *   Which port gets this event.
+ *
+ * @return
+ *   0 on success, a negative errno value otherwise and rte_errno is set.
+ */
+static int
+mlx5_rx_devx_get_event_lwm(struct mlx5_priv *priv, int *rxq_idx, int *port_id)
+{
+#ifdef HAVE_IBV_DEVX_EVENT
+	union {
+		struct mlx5dv_devx_async_event_hdr event_resp;
+		uint8_t buf[sizeof(struct mlx5dv_devx_async_event_hdr) + 128];
+	} out;
+	int ret;
+
+	memset(&out, 0, sizeof(out));
+	ret = mlx5_glue->devx_get_event(priv->sh->devx_channel_lwm,
+					&out.event_resp,
+					sizeof(out.buf));
+	if (ret < 0) {
+		rte_errno = errno;
+		DRV_LOG(WARNING, "%s err\n", __func__);
+		return -rte_errno;
+	}
+	*port_id = (((uint32_t)out.event_resp.cookie) >>
+		    LWM_COOKIE_PORTID_OFFSET) & LWM_COOKIE_PORTID_MASK;
+	*rxq_idx = (((uint32_t)out.event_resp.cookie) >>
+		    LWM_COOKIE_RXQID_OFFSET) & LWM_COOKIE_RXQID_MASK;
+	return 0;
+#else
+	(void)priv;
+	(void)rxq_idx;
+	(void)port_id;
+	rte_errno = ENOTSUP;
+	return -rte_errno;
+#endif /* HAVE_IBV_DEVX_EVENT */
+}
+
 /**
  * Create a RQ object using DevX.
  *
@@ -1421,6 +1467,7 @@ struct mlx5_obj_ops devx_obj_ops = {
 	.rxq_event_get = mlx5_rx_devx_get_event,
 	.rxq_obj_modify = mlx5_devx_modify_rq,
 	.rxq_obj_release = mlx5_rxq_devx_obj_release,
+	.rxq_event_get_lwm = mlx5_rx_devx_get_event_lwm,
 	.ind_table_new = mlx5_devx_ind_table_new,
 	.ind_table_modify = mlx5_devx_ind_table_modify,
 	.ind_table_destroy = mlx5_devx_ind_table_destroy,
diff --git a/drivers/net/mlx5/mlx5_rx.c b/drivers/net/mlx5/mlx5_rx.c
index e5eea0ad94..7d556c2b45 100644
--- a/drivers/net/mlx5/mlx5_rx.c
+++ b/drivers/net/mlx5/mlx5_rx.c
@@ -1187,3 +1187,36 @@ mlx5_check_vec_rx_support(struct rte_eth_dev *dev __rte_unused)
 {
 	return -ENOTSUP;
 }
+
+/**
+ * Rte interrupt handler for LWM event.
+ * It first checks if the event arrives, if so process the callback for
+ * RTE_ETH_EVENT_RX_LWM.
+ *
+ * @param args
+ *   Generic pointer to mlx5_priv.
+ */
+void
+mlx5_dev_interrupt_handler_lwm(void *args)
+{
+	struct mlx5_priv *priv = args;
+	struct mlx5_rxq_priv *rxq;
+	struct rte_eth_dev *dev;
+	int ret, rxq_idx = 0, port_id = 0;
+
+	ret = priv->obj_ops.rxq_event_get_lwm(priv, &rxq_idx, &port_id);
+	if (unlikely(ret < 0)) {
+		DRV_LOG(WARNING, "Cannot get LWM event context.");
+		return;
+	}
+	DRV_LOG(INFO, "%s get LWM event, port_id:%d rxq_id:%d.", __func__,
+		port_id, rxq_idx);
+	dev = &rte_eth_devices[port_id];
+	rxq = mlx5_rxq_get(dev, rxq_idx);
+	if (rxq) {
+		pthread_mutex_lock(&priv->sh->lwm_config_lock);
+		rxq->lwm_event_pending = 1;
+		pthread_mutex_unlock(&priv->sh->lwm_config_lock);
+	}
+	rte_eth_dev_callback_process(dev, RTE_ETH_EVENT_RX_LWM, NULL);
+}
diff --git a/drivers/net/mlx5/mlx5_rx.h b/drivers/net/mlx5/mlx5_rx.h
index 25a5f2c1fa..068dff5863 100644
--- a/drivers/net/mlx5/mlx5_rx.h
+++ b/drivers/net/mlx5/mlx5_rx.h
@@ -176,6 +176,7 @@ struct mlx5_rxq_priv {
 	struct rte_eth_hairpin_conf hairpin_conf; /* Hairpin configuration. */
 	uint32_t hairpin_status; /* Hairpin binding status. */
 	uint32_t lwm:16;
+	uint32_t lwm_event_pending:1;
 };
 
 /* External RX queue descriptor. */
@@ -295,6 +296,7 @@ void mlx5_rxq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
 int mlx5_rx_burst_mode_get(struct rte_eth_dev *dev, uint16_t rx_queue_id,
 			   struct rte_eth_burst_mode *mode);
 int mlx5_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc);
+void mlx5_dev_interrupt_handler_lwm(void *args);
 
 /* Vectorized version of mlx5_rx.c */
 int mlx5_rxq_check_vec_support(struct mlx5_rxq_data *rxq_data);
@@ -675,4 +677,9 @@ mlx5_is_external_rxq(struct rte_eth_dev *dev, uint16_t queue_idx)
 	return !!__atomic_load_n(&rxq->refcnt, __ATOMIC_RELAXED);
 }
 
+#define LWM_COOKIE_RXQID_OFFSET 0
+#define LWM_COOKIE_RXQID_MASK 0xffff
+#define LWM_COOKIE_PORTID_OFFSET 16
+#define LWM_COOKIE_PORTID_MASK 0xffff
+
 #endif /* RTE_PMD_MLX5_RX_H_ */
-- 
2.27.0


^ permalink raw reply	[flat|nested] 131+ messages in thread

* [RFC v2 5/7] net/mlx5: support Rx queue based limit watermark
  2022-05-22  5:58     ` [RFC v2 0/7] introduce per-queue limit watermark and host shaper Spike Du
                         ` (3 preceding siblings ...)
  2022-05-22  5:58       ` [RFC v2 4/7] net/mlx5: add LWM event handling support Spike Du
@ 2022-05-22  5:58       ` Spike Du
  2022-05-22  5:58       ` [RFC v2 6/7] net/mlx5: add private API to config host port shaper Spike Du
                         ` (2 subsequent siblings)
  7 siblings, 0 replies; 131+ messages in thread
From: Spike Du @ 2022-05-22  5:58 UTC (permalink / raw)
  To: matan, viacheslavo, orika, thomas; +Cc: dev, rasland

Add mlx5 specific LWM(limit watermark) configuration and query handler.
While the Rx queue fullness reaches the LWM limit, the driver catches
an HW event and invokes the user callback.
The query handler finds the next RX queue with pending LWM event
if any, starting from the given RX queue index.

Signed-off-by: Spike Du <spiked@nvidia.com>
---
 doc/guides/nics/mlx5.rst               |  12 ++
 doc/guides/rel_notes/release_22_07.rst |   1 +
 drivers/common/mlx5/mlx5_prm.h         |   1 +
 drivers/net/mlx5/mlx5.c                |   2 +
 drivers/net/mlx5/mlx5_rx.c             | 156 +++++++++++++++++++++++++
 drivers/net/mlx5/mlx5_rx.h             |   5 +
 6 files changed, 177 insertions(+)

diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index d83c56de11..79f56018ef 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -93,6 +93,7 @@ Features
 - Connection tracking.
 - Sub-Function representors.
 - Sub-Function.
+- Rx queue LWM (Limit WaterMark) configuration.
 
 
 Limitations
@@ -520,6 +521,9 @@ Limitations
 
 - The NIC egress flow rules on representor port are not supported.
 
+- LWM:
+
+  - Doesn't support shared Rx queue and Hairpin Rx queue.
 
 Statistics
 ----------
@@ -1680,3 +1684,11 @@ The procedure below is an example of using a ConnectX-5 adapter card (pf0) with
 #. For each VF PCIe, using the following command to bind the driver::
 
    $ echo "0000:82:00.2" >> /sys/bus/pci/drivers/mlx5_core/bind
+
+LWM introduction
+----------------
+
+LWM (Limit WaterMark) is a per Rx queue attribute, it should be configured as
+a percentage of the Rx queue size.
+When Rx queue fullness is above LWM, an event is sent to PMD.
+
diff --git a/doc/guides/rel_notes/release_22_07.rst b/doc/guides/rel_notes/release_22_07.rst
index a60a0d5f16..253bc7e381 100644
--- a/doc/guides/rel_notes/release_22_07.rst
+++ b/doc/guides/rel_notes/release_22_07.rst
@@ -80,6 +80,7 @@ New Features
   * Added support for promiscuous mode on Windows.
   * Added support for MTU on Windows.
   * Added matching and RSS on IPsec ESP.
+  * Added Rx queue LWM(Limit WaterMark) support.
 
 * **Updated Marvell cnxk crypto driver.**
 
diff --git a/drivers/common/mlx5/mlx5_prm.h b/drivers/common/mlx5/mlx5_prm.h
index 630b2c5100..3b5e60532a 100644
--- a/drivers/common/mlx5/mlx5_prm.h
+++ b/drivers/common/mlx5/mlx5_prm.h
@@ -3293,6 +3293,7 @@ struct mlx5_aso_wqe {
 
 enum {
 	MLX5_EVENT_TYPE_OBJECT_CHANGE = 0x27,
+	MLX5_EVENT_TYPE_SRQ_LIMIT_REACHED = 0x14,
 };
 
 enum {
diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
index e04a66625e..35ae51b3af 100644
--- a/drivers/net/mlx5/mlx5.c
+++ b/drivers/net/mlx5/mlx5.c
@@ -2071,6 +2071,8 @@ const struct eth_dev_ops mlx5_dev_ops = {
 	.dev_supported_ptypes_get = mlx5_dev_supported_ptypes_get,
 	.vlan_filter_set = mlx5_vlan_filter_set,
 	.rx_queue_setup = mlx5_rx_queue_setup,
+	.rx_queue_lwm_set = mlx5_rx_queue_lwm_set,
+	.rx_queue_lwm_query = mlx5_rx_queue_lwm_query,
 	.rx_hairpin_queue_setup = mlx5_rx_hairpin_queue_setup,
 	.tx_queue_setup = mlx5_tx_queue_setup,
 	.tx_hairpin_queue_setup = mlx5_tx_hairpin_queue_setup,
diff --git a/drivers/net/mlx5/mlx5_rx.c b/drivers/net/mlx5/mlx5_rx.c
index 7d556c2b45..d30522e6df 100644
--- a/drivers/net/mlx5/mlx5_rx.c
+++ b/drivers/net/mlx5/mlx5_rx.c
@@ -19,12 +19,14 @@
 #include <mlx5_prm.h>
 #include <mlx5_common.h>
 #include <mlx5_common_mr.h>
+#include <rte_pmd_mlx5.h>
 
 #include "mlx5_autoconf.h"
 #include "mlx5_defs.h"
 #include "mlx5.h"
 #include "mlx5_utils.h"
 #include "mlx5_rxtx.h"
+#include "mlx5_devx.h"
 #include "mlx5_rx.h"
 
 
@@ -128,6 +130,17 @@ mlx5_rx_descriptor_status(void *rx_queue, uint16_t offset)
 	return RTE_ETH_RX_DESC_AVAIL;
 }
 
+/* Get rxq lwm percentage according to lwm number. */
+static uint8_t
+mlx5_rxq_lwm_to_percentage(struct mlx5_rxq_priv *rxq)
+{
+	struct mlx5_rxq_data *rxq_data = &rxq->ctrl->rxq;
+	uint32_t wqe_cnt = 1 << rxq_data->elts_n;
+
+	/* ethdev LWM describes fullness, mlx5 LWM describes emptiness. */
+	return rxq->lwm ? (100 - rxq->lwm * 100 / wqe_cnt) : 0;
+}
+
 /**
  * DPDK callback to get the RX queue information.
  *
@@ -150,6 +163,7 @@ mlx5_rxq_info_get(struct rte_eth_dev *dev, uint16_t rx_queue_id,
 {
 	struct mlx5_rxq_ctrl *rxq_ctrl = mlx5_rxq_ctrl_get(dev, rx_queue_id);
 	struct mlx5_rxq_data *rxq = mlx5_rxq_data_get(dev, rx_queue_id);
+	struct mlx5_rxq_priv *rxq_priv = mlx5_rxq_get(dev, rx_queue_id);
 
 	if (!rxq)
 		return;
@@ -169,6 +183,8 @@ mlx5_rxq_info_get(struct rte_eth_dev *dev, uint16_t rx_queue_id,
 	qinfo->nb_desc = mlx5_rxq_mprq_enabled(rxq) ?
 		RTE_BIT32(rxq->elts_n) * RTE_BIT32(rxq->log_strd_num) :
 		RTE_BIT32(rxq->elts_n);
+	qinfo->conf.lwm = rxq_priv ?
+		mlx5_rxq_lwm_to_percentage(rxq_priv) : 0;
 }
 
 /**
@@ -1188,6 +1204,34 @@ mlx5_check_vec_rx_support(struct rte_eth_dev *dev __rte_unused)
 	return -ENOTSUP;
 }
 
+int
+mlx5_rx_queue_lwm_query(struct rte_eth_dev *dev,
+			uint16_t *queue_id, uint8_t *lwm)
+{
+	struct mlx5_priv *priv = dev->data->dev_private;
+	unsigned int rxq_id, found = 0, n;
+	struct mlx5_rxq_priv *rxq;
+
+	if (!queue_id)
+		return -EINVAL;
+	/* Query all the Rx queues of the port in a circular way. */
+	for (rxq_id = *queue_id, n = 0; n < priv->rxqs_n; n++) {
+		rxq = mlx5_rxq_get(dev, rxq_id);
+		if (rxq && rxq->lwm_event_pending) {
+			pthread_mutex_lock(&priv->sh->lwm_config_lock);
+			rxq->lwm_event_pending = 0;
+			pthread_mutex_unlock(&priv->sh->lwm_config_lock);
+			*queue_id = rxq_id;
+			found = 1;
+			if (lwm)
+				*lwm =  mlx5_rxq_lwm_to_percentage(rxq);
+			break;
+		}
+		rxq_id = (rxq_id + 1) % priv->rxqs_n;
+	}
+	return found;
+}
+
 /**
  * Rte interrupt handler for LWM event.
  * It first checks if the event arrives, if so process the callback for
@@ -1220,3 +1264,115 @@ mlx5_dev_interrupt_handler_lwm(void *args)
 	}
 	rte_eth_dev_callback_process(dev, RTE_ETH_EVENT_RX_LWM, NULL);
 }
+
+/**
+ * DPDK callback to arm an Rx queue LWM(limit watermark) event.
+ * While the Rx queue fullness reaches the LWM limit, the driver catches
+ * an HW event and invokes the user event callback.
+ * After the last event handling, the user needs to call this API again
+ * to arm an additional event.
+ *
+ * @param dev
+ *   Pointer to the device structure.
+ * @param[in] rx_queue_id
+ *   Rx queue identificator.
+ * @param[in] lwm
+ *   The LWM value, is defined by a percentage of the Rx queue size.
+ *   [1-99] to set a new LWM (update the old value).
+ *   0 to unarm the event.
+ *
+ * @return
+ *   0 : operation success.
+ *   Otherwise:
+ *   - ENOMEM - not enough memory to create LWM event channel.
+ *   - EINVAL - the input Rxq is not created by devx.
+ *   - E2BIG  - lwm is bigger than 99.
+ */
+int
+mlx5_rx_queue_lwm_set(struct rte_eth_dev *dev, uint16_t rx_queue_id,
+		      uint8_t lwm)
+{
+	struct mlx5_priv *priv = dev->data->dev_private;
+	uint16_t port_id = PORT_ID(priv);
+	struct mlx5_rxq_priv *rxq = mlx5_rxq_get(dev, rx_queue_id);
+	uint16_t event_nums[1] = {MLX5_EVENT_TYPE_SRQ_LIMIT_REACHED};
+	struct mlx5_rxq_data *rxq_data;
+	uint32_t wqe_cnt;
+	uint64_t cookie;
+	int ret = 0;
+
+	if (!rxq) {
+		rte_errno = EINVAL;
+		return -rte_errno;
+	}
+	rxq_data = &rxq->ctrl->rxq;
+	/* Ensure the Rq is created by devx. */
+	if (priv->obj_ops.rxq_obj_new != devx_obj_ops.rxq_obj_new) {
+		rte_errno = EINVAL;
+		return -rte_errno;
+	}
+	if (lwm > 99) {
+		DRV_LOG(WARNING, "Too big LWM configuration.");
+		rte_errno = E2BIG;
+		return -rte_errno;
+	}
+	/* Start config LWM. */
+	pthread_mutex_lock(&priv->sh->lwm_config_lock);
+	if (rxq->lwm == 0 && lwm == 0) {
+		/* Both old/new values are 0, do nothing. */
+		ret = 0;
+		goto end;
+	}
+	wqe_cnt = 1 << rxq_data->elts_n;
+	if (lwm) {
+		if (!priv->sh->devx_channel_lwm) {
+			ret = mlx5_lwm_setup(priv);
+			if (ret) {
+				DRV_LOG(WARNING,
+					"Failed to create shared_lwm.");
+				rte_errno = ENOMEM;
+				ret = -rte_errno;
+				goto end;
+			}
+		}
+		if (!rxq->lwm_devx_subscribed) {
+			cookie = ((uint32_t)
+				  (port_id << LWM_COOKIE_PORTID_OFFSET)) |
+				(rx_queue_id << LWM_COOKIE_RXQID_OFFSET);
+			ret = mlx5_os_devx_subscribe_devx_event
+				(priv->sh->devx_channel_lwm,
+				 rxq->devx_rq.rq->obj,
+				 sizeof(event_nums),
+				 event_nums,
+				 cookie);
+			if (ret) {
+				rte_errno = rte_errno ? rte_errno : EINVAL;
+				ret = -rte_errno;
+				goto end;
+			}
+			rxq->lwm_devx_subscribed = 1;
+		}
+	}
+	/* The ethdev LWM describes fullness, mlx5 lwm describes emptiness. */
+	if (lwm)
+		lwm = 100 - lwm;
+	/* Save LWM to rxq and send modfiy_rq devx command. */
+	rxq->lwm = lwm * wqe_cnt / 100;
+	/* Prevent integer division loss when switch lwm number to percentage. */
+	if (lwm && (lwm * wqe_cnt % 100)) {
+		rxq->lwm = ((uint32_t)(rxq->lwm + 1) >= wqe_cnt) ?
+			rxq->lwm : (rxq->lwm + 1);
+	}
+	if (lwm && !rxq->lwm) {
+		/* With mprq, wqe_cnt may be < 100. */
+		DRV_LOG(WARNING, "Too small LWM configuration.");
+		rte_errno = EINVAL;
+		ret = -rte_errno;
+		goto end;
+	}
+	ret = mlx5_devx_modify_rq(rxq, MLX5_RXQ_MOD_RDY2RDY);
+end:
+	pthread_mutex_unlock(&priv->sh->lwm_config_lock);
+	return ret;
+}
+
diff --git a/drivers/net/mlx5/mlx5_rx.h b/drivers/net/mlx5/mlx5_rx.h
index 068dff5863..e078aaf3dc 100644
--- a/drivers/net/mlx5/mlx5_rx.h
+++ b/drivers/net/mlx5/mlx5_rx.h
@@ -177,6 +177,7 @@ struct mlx5_rxq_priv {
 	uint32_t hairpin_status; /* Hairpin binding status. */
 	uint32_t lwm:16;
 	uint32_t lwm_event_pending:1;
+	uint32_t lwm_devx_subscribed:1;
 };
 
 /* External RX queue descriptor. */
@@ -297,6 +298,10 @@ int mlx5_rx_burst_mode_get(struct rte_eth_dev *dev, uint16_t rx_queue_id,
 			   struct rte_eth_burst_mode *mode);
 int mlx5_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc);
 void mlx5_dev_interrupt_handler_lwm(void *args);
+int mlx5_rx_queue_lwm_set(struct rte_eth_dev *dev, uint16_t rx_queue_id,
+			  uint8_t lwm);
+int mlx5_rx_queue_lwm_query(struct rte_eth_dev *dev, uint16_t *rx_queue_id,
+			    uint8_t *lwm);
 
 /* Vectorized version of mlx5_rx.c */
 int mlx5_rxq_check_vec_support(struct mlx5_rxq_data *rxq_data);
-- 
2.27.0


^ permalink raw reply	[flat|nested] 131+ messages in thread

* [RFC v2 6/7] net/mlx5: add private API to config host port shaper
  2022-05-22  5:58     ` [RFC v2 0/7] introduce per-queue limit watermark and host shaper Spike Du
                         ` (4 preceding siblings ...)
  2022-05-22  5:58       ` [RFC v2 5/7] net/mlx5: support Rx queue based limit watermark Spike Du
@ 2022-05-22  5:58       ` Spike Du
  2022-05-22  5:59       ` [RFC v2 7/7] app/testpmd: add LWM and Host Shaper command Spike Du
  2022-05-24 15:20       ` [PATCH v3 0/7] introduce per-queue limit watermark and host shaper Spike Du
  7 siblings, 0 replies; 131+ messages in thread
From: Spike Du @ 2022-05-22  5:58 UTC (permalink / raw)
  To: matan, viacheslavo, orika, thomas; +Cc: dev, rasland

Host port shaper can be configured with QSHR(QoS Shaper Host Register).
Add check in build files to enable this function or not.

The host shaper configuration affects all the ethdev ports belonging to the
same host port.

Host shaper can configure shaper rate and lwm-triggered for a host port.
The shaper limits the rate of traffic from host port to wire port.
If lwm-triggered is enabled, a 100Mbps shaper is enabled automatically
when one of the host port's Rx queues receives LWM(Limit Watermark) event.

Signed-off-by: Spike Du <spiked@nvidia.com>
---
 doc/guides/nics/mlx5.rst               |  26 +++++++
 doc/guides/rel_notes/release_22_07.rst |   1 +
 drivers/common/mlx5/linux/meson.build  |  13 ++++
 drivers/common/mlx5/mlx5_prm.h         |  25 ++++++
 drivers/net/mlx5/mlx5.h                |   2 +
 drivers/net/mlx5/mlx5_rx.c             | 103 +++++++++++++++++++++++++
 drivers/net/mlx5/rte_pmd_mlx5.h        |  30 +++++++
 drivers/net/mlx5/version.map           |   2 +
 8 files changed, 202 insertions(+)

diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index 79f56018ef..3da6f5a03c 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -94,6 +94,7 @@ Features
 - Sub-Function representors.
 - Sub-Function.
 - Rx queue LWM (Limit WaterMark) configuration.
+- Host shaper support.
 
 
 Limitations
@@ -525,6 +526,12 @@ Limitations
 
   - Doesn't support shared Rx queue and Hairpin Rx queue.
 
+- Host shaper:
+
+  - Support BlueField series NIC from BlueField 2.
+  - When configure host shaper with MLX5_HOST_SHAPER_FLAG_LWM_TRIGGERED flag set,
+    only rate 0 and 100Mbps are supported.
+
 Statistics
 ----------
 
@@ -1692,3 +1699,22 @@ LWM (Limit WaterMark) is a per Rx queue attribute, it should be configured as
 a percentage of the Rx queue size.
 When Rx queue fullness is above LWM, an event is sent to PMD.
 
+Host shaper introduction
+------------------------
+
+Host shaper register is per host port register which sets a shaper
+on the host port.
+All VF/hostPF representors belonging to one host port share one host shaper.
+For example, if representor 0 and representor 1 belong to same host port,
+and a host shaper rate of 1Gbps is configured, the shaper throttles both
+representors' traffic from host.
+Host shaper has two modes for setting the shaper, immediate and deferred to
+LWM event trigger. In immediate mode, the rate limit is configured immediately
+to host shaper. When deferring to LWM trigger, the shaper is not set until an
+LWM event is received by any Rx queue in a VF representor belonging to the host
+port. The only rate supported for deferred mode is 100Mbps (there is no limit
+on the supported rates for immediate mode). In deferred mode, the shaper is set
+on the host port by the firmware upon receiving the LMW event, which allows
+throttling host traffic on LWM events at minimum latency, preventing excess
+drops in the Rx queue.
+
diff --git a/doc/guides/rel_notes/release_22_07.rst b/doc/guides/rel_notes/release_22_07.rst
index 253bc7e381..21879bda41 100644
--- a/doc/guides/rel_notes/release_22_07.rst
+++ b/doc/guides/rel_notes/release_22_07.rst
@@ -81,6 +81,7 @@ New Features
   * Added support for MTU on Windows.
   * Added matching and RSS on IPsec ESP.
   * Added Rx queue LWM(Limit WaterMark) support.
+  * Added host shaper support.
 
 * **Updated Marvell cnxk crypto driver.**
 
diff --git a/drivers/common/mlx5/linux/meson.build b/drivers/common/mlx5/linux/meson.build
index 5335f5b027..51c6e5dd2e 100644
--- a/drivers/common/mlx5/linux/meson.build
+++ b/drivers/common/mlx5/linux/meson.build
@@ -45,6 +45,13 @@ if static_ibverbs
     ext_deps += declare_dependency(link_args:ibv_ldflags.split())
 endif
 
+libmtcr_ul_found = false
+lib = cc.find_library('mtcr_ul', required:false)
+if lib.found() and run_command('meson', '--version').stdout().version_compare('>= 0.49.2')
+    libmtcr_ul_found = true
+    ext_deps += lib
+endif
+
 sources += files('mlx5_nl.c')
 sources += files('mlx5_common_auxiliary.c')
 sources += files('mlx5_common_os.c')
@@ -207,6 +214,12 @@ has_sym_args = [
         [ 'HAVE_MLX5_IBV_IMPORT_CTX_PD_AND_MR', 'infiniband/verbs.h',
             'ibv_import_device' ],
 ]
+if  libmtcr_ul_found
+    has_sym_args += [
+        [  'HAVE_MLX5_MSTFLINT', 'mstflint/mtcr.h',
+            'mopen'],
+    ]
+endif
 config = configuration_data()
 foreach arg:has_sym_args
     config.set(arg[0], cc.has_header_symbol(arg[1], arg[2], dependencies: libs))
diff --git a/drivers/common/mlx5/mlx5_prm.h b/drivers/common/mlx5/mlx5_prm.h
index 3b5e60532a..92d05a7368 100644
--- a/drivers/common/mlx5/mlx5_prm.h
+++ b/drivers/common/mlx5/mlx5_prm.h
@@ -3771,6 +3771,7 @@ enum {
 	MLX5_CRYPTO_COMMISSIONING_REGISTER_ID = 0xC003,
 	MLX5_IMPORT_KEK_HANDLE_REGISTER_ID = 0xC004,
 	MLX5_CREDENTIAL_HANDLE_REGISTER_ID = 0xC005,
+	MLX5_QSHR_REGISTER_ID = 0x4030,
 };
 
 struct mlx5_ifc_register_mtutc_bits {
@@ -3785,6 +3786,30 @@ struct mlx5_ifc_register_mtutc_bits {
 	u8 time_adjustment[0x20];
 };
 
+struct mlx5_ifc_ets_global_config_register_bits {
+	u8 reserved_at_0[0x2];
+	u8 rate_limit_update[0x1];
+	u8 reserved_at_3[0x29];
+	u8 max_bw_units[0x4];
+	u8 reserved_at_48[0x8];
+	u8 max_bw_value[0x8];
+};
+
+#define ETS_GLOBAL_CONFIG_BW_UNIT_DISABLED      0x0
+#define ETS_GLOBAL_CONFIG_BW_UNIT_HUNDREDS_MBPS 0x3
+#define ETS_GLOBAL_CONFIG_BW_UNIT_GBPS          0x4
+
+struct mlx5_ifc_register_qshr_bits {
+	u8 reserved_at_0[0x4];
+	u8 connected_host[0x1];
+	u8 vqos[0x1];
+	u8 fast_response[0x1];
+	u8 reserved_at_7[0x1];
+	u8 local_port[0x8];
+	u8 reserved_at_16[0x230];
+	struct mlx5_ifc_ets_global_config_register_bits global_config;
+};
+
 #define MLX5_MTUTC_TIMESTAMP_MODE_INTERNAL_TIMER 0
 #define MLX5_MTUTC_TIMESTAMP_MODE_REAL_TIME 1
 
diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index a76f2fed3d..8af84aef50 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -1271,6 +1271,8 @@ struct mlx5_dev_ctx_shared {
 	void *devx_channel_lwm;
 	struct rte_intr_handle *intr_handle_lwm;
 	pthread_mutex_t lwm_config_lock;
+	uint32_t host_shaper_rate:8;
+	uint32_t lwm_triggered:1;
 	/* Availability of mreg_c's. */
 	struct mlx5_dev_shared_port port[]; /* per device port data array. */
 };
diff --git a/drivers/net/mlx5/mlx5_rx.c b/drivers/net/mlx5/mlx5_rx.c
index d30522e6df..3f7b2620df 100644
--- a/drivers/net/mlx5/mlx5_rx.c
+++ b/drivers/net/mlx5/mlx5_rx.c
@@ -28,6 +28,9 @@
 #include "mlx5_rxtx.h"
 #include "mlx5_devx.h"
 #include "mlx5_rx.h"
+#ifdef HAVE_MLX5_MSTFLINT
+#include <mstflint/mtcr.h>
+#endif
 
 
 static __rte_always_inline uint32_t
@@ -1376,3 +1379,103 @@ mlx5_rx_queue_lwm_set(struct rte_eth_dev *dev, uint16_t rx_queue_id,
 	return ret;
 }
 
+/**
+ * Mlx5 access register function to configure host shaper.
+ * It calls API in libmtcr_ul to access QSHR(Qos Shaper Host Register)
+ * in firmware.
+ *
+ * @param dev
+ *   Pointer to rte_eth_dev.
+ * @param lwm_triggered
+ *   Flag to enable/disable lwm_triggered bit in QSHR.
+ * @param rate
+ *   Host shaper rate, unit is 100Mbps, set to 0 means disable the shaper.
+ * @return
+ *   0 : operation success.
+ *   Otherwise:
+ *   - ENOENT - no ibdev interface.
+ *   - EBUSY  - the register access unit is busy.
+ *   - EIO    - the register access command meets IO error.
+ */
+static int
+mlxreg_host_shaper_config(struct rte_eth_dev *dev,
+			  bool lwm_triggered, uint8_t rate)
+{
+#ifdef HAVE_MLX5_MSTFLINT
+	struct mlx5_priv *priv = dev->data->dev_private;
+	uint32_t data[MLX5_ST_SZ_DW(register_qshr)] = {0};
+	int rc, retry_count = 3;
+	mfile *mf = NULL;
+	int status;
+	void *ptr;
+
+	mf = mopen(priv->sh->ibdev_name);
+	if (!mf) {
+		DRV_LOG(WARNING, "mopen failed\n");
+		rte_errno = ENOENT;
+		return -rte_errno;
+	}
+	MLX5_SET(register_qshr, data, connected_host, 1);
+	MLX5_SET(register_qshr, data, fast_response, lwm_triggered ? 1 : 0);
+	MLX5_SET(register_qshr, data, local_port, 1);
+	ptr = MLX5_ADDR_OF(register_qshr, data, global_config);
+	MLX5_SET(ets_global_config_register, ptr, rate_limit_update, 1);
+	MLX5_SET(ets_global_config_register, ptr, max_bw_units,
+		 rate ? ETS_GLOBAL_CONFIG_BW_UNIT_HUNDREDS_MBPS :
+		 ETS_GLOBAL_CONFIG_BW_UNIT_DISABLED);
+	MLX5_SET(ets_global_config_register, ptr, max_bw_value, rate);
+	do {
+		rc = maccess_reg(mf,
+				 MLX5_QSHR_REGISTER_ID,
+				 MACCESS_REG_METHOD_SET,
+				 (u_int32_t *)&data[0],
+				 sizeof(data),
+				 sizeof(data),
+				 sizeof(data),
+				 &status);
+		if ((rc != ME_ICMD_STATUS_IFC_BUSY &&
+		     status != ME_REG_ACCESS_BAD_PARAM) ||
+		    !(mf->flags & MDEVS_REM)) {
+			break;
+		}
+		DRV_LOG(WARNING, "%s retry.", __func__);
+		usleep(10000);
+	} while (retry_count-- > 0);
+	mclose(mf);
+	rte_errno = (rc == ME_REG_ACCESS_DEV_BUSY) ? EBUSY : EIO;
+	return rc ? -rte_errno : 0;
+#else
+	(void)dev;
+	(void)lwm_triggered;
+	(void)rate;
+	return -1;
+#endif
+}
+
+int rte_pmd_mlx5_host_shaper_config(int port_id, uint8_t rate,
+				    uint32_t flags)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+	struct mlx5_priv *priv = dev->data->dev_private;
+	bool lwm_triggered =
+		!!(flags & RTE_BIT32(MLX5_HOST_SHAPER_FLAG_LWM_TRIGGERED));
+
+	if (!lwm_triggered) {
+		priv->sh->host_shaper_rate = rate;
+	} else {
+		switch (rate) {
+		case 0:
+		/* Rate 0 means disable lwm_triggered. */
+			priv->sh->lwm_triggered = 0;
+			break;
+		case 1:
+		/* Rate 1 means enable lwm_triggered. */
+			priv->sh->lwm_triggered = 1;
+			break;
+		default:
+			return -ENOTSUP;
+		}
+	}
+	return mlxreg_host_shaper_config(dev, priv->sh->lwm_triggered,
+					 priv->sh->host_shaper_rate);
+}
diff --git a/drivers/net/mlx5/rte_pmd_mlx5.h b/drivers/net/mlx5/rte_pmd_mlx5.h
index 6e7907ee59..9964126df5 100644
--- a/drivers/net/mlx5/rte_pmd_mlx5.h
+++ b/drivers/net/mlx5/rte_pmd_mlx5.h
@@ -109,6 +109,36 @@ __rte_experimental
 int rte_pmd_mlx5_external_rx_queue_id_unmap(uint16_t port_id,
 					    uint16_t dpdk_idx);
 
+/**
+ * The rate of the host port shaper will be updated directly at the next
+ * LWM event to the rate that comes with this flag set; set rate 0
+ * to disable this rate update.
+ * Unset this flag to update the rate of the host port shaper directly in
+ * the API call; use rate 0 to disable the current shaper.
+ */
+#define MLX5_HOST_SHAPER_FLAG_LWM_TRIGGERED 0
+
+/**
+ * Configure a HW shaper to limit Tx rate for a host port.
+ * The configuration will affect all the ethdev ports belonging to
+ * the same rte_device.
+ *
+ * @param[in] port_id
+ *   The port identifier of the Ethernet device.
+ * @param[in] rate
+ *   Unit is 100Mbps, setting the rate to 0 disables the shaper.
+ * @param[in] flags
+ *   Host shaper flags.
+ * @return
+ *   0 : operation success.
+ *   Otherwise:
+ *   - ENOENT - no ibdev interface.
+ *   - EBUSY  - the register access unit is busy.
+ *   - EIO    - the register access command meets IO error.
+ */
+__rte_experimental
+int rte_pmd_mlx5_host_shaper_config(int port_id, uint8_t rate, uint32_t flags);
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/drivers/net/mlx5/version.map b/drivers/net/mlx5/version.map
index 79cb79acc6..c97dfe440a 100644
--- a/drivers/net/mlx5/version.map
+++ b/drivers/net/mlx5/version.map
@@ -12,4 +12,6 @@ EXPERIMENTAL {
 	# added in 22.03
 	rte_pmd_mlx5_external_rx_queue_id_map;
 	rte_pmd_mlx5_external_rx_queue_id_unmap;
+	# added in 22.07
+	rte_pmd_mlx5_host_shaper_config;
 };
-- 
2.27.0


^ permalink raw reply	[flat|nested] 131+ messages in thread

* [RFC v2 7/7] app/testpmd: add LWM and Host Shaper command
  2022-05-22  5:58     ` [RFC v2 0/7] introduce per-queue limit watermark and host shaper Spike Du
                         ` (5 preceding siblings ...)
  2022-05-22  5:58       ` [RFC v2 6/7] net/mlx5: add private API to config host port shaper Spike Du
@ 2022-05-22  5:59       ` Spike Du
  2022-05-24 15:20       ` [PATCH v3 0/7] introduce per-queue limit watermark and host shaper Spike Du
  7 siblings, 0 replies; 131+ messages in thread
From: Spike Du @ 2022-05-22  5:59 UTC (permalink / raw)
  To: matan, viacheslavo, orika, thomas; +Cc: dev, rasland

Add command line options to support LWM per-rxq configure.
- Command syntax:
  set port <port_id> rxq <rxq_id> lwm <lwm_num>
  mlx5 set port <port_id> host_shaper lwm_triggered <0|1> rate <rate_num>

- Example commands:
To configure LWM as 30% of rxq size on port 1 rxq 0:
testpmd> set port 1 rxq 0 lwm 30

To disable LWM on port 1 rxq 0:
testpmd> set port 1 rxq 0 lwm 0

To enable lwm_triggered on port 1 and disable current host shaper:
testpmd> mlx5 set port 1 host_shaper lwm_triggered 1 rate 0

To disable lwm_triggered and current host shaper on port 1:
testpmd> mlx5 set port 1 host_shaper lwm_triggered 0 rate 0

The rate unit is 100Mbps.
To disable lwm_triggered and configure a shaper of 5Gbps on port 1:
testpmd> mlx5 set port 1 host_shaper lwm_triggered 0 rate 50

Add sample code to handle rxq LWM event, it delays a while so that rxq
empties, then disables host shaper and rearms LWM event.

Signed-off-by: Spike Du <spiked@nvidia.com>
---
 app/test-pmd/cmdline.c          |  74 +++++++++++++
 app/test-pmd/config.c           |  23 ++++
 app/test-pmd/meson.build        |   4 +
 app/test-pmd/testpmd.c          |  24 +++++
 app/test-pmd/testpmd.h          |   1 +
 doc/guides/nics/mlx5.rst        |  46 ++++++++
 drivers/net/mlx5/mlx5_testpmd.c | 184 ++++++++++++++++++++++++++++++++
 drivers/net/mlx5/mlx5_testpmd.h |  27 +++++
 8 files changed, 383 insertions(+)
 create mode 100644 drivers/net/mlx5/mlx5_testpmd.c
 create mode 100644 drivers/net/mlx5/mlx5_testpmd.h

diff --git a/app/test-pmd/cmdline.c b/app/test-pmd/cmdline.c
index 91e4090582..e8663dd797 100644
--- a/app/test-pmd/cmdline.c
+++ b/app/test-pmd/cmdline.c
@@ -67,6 +67,9 @@
 #include "cmdline_mtr.h"
 #include "cmdline_tm.h"
 #include "bpf_cmd.h"
+#ifdef RTE_NET_MLX5
+#include "mlx5_testpmd.h"
+#endif
 
 static struct cmdline *testpmd_cl;
 
@@ -17803,6 +17806,73 @@ cmdline_parse_inst_t cmd_show_port_flow_transfer_proxy = {
 	}
 };
 
+/* *** SET LIMIT WARTER MARK FOR A RXQ OF A PORT *** */
+struct cmd_rxq_lwm_result {
+	cmdline_fixed_string_t set;
+	cmdline_fixed_string_t port;
+	uint16_t port_num;
+	cmdline_fixed_string_t rxq;
+	uint16_t rxq_num;
+	cmdline_fixed_string_t lwm;
+	uint16_t lwm_num;
+};
+
+static void cmd_rxq_lwm_parsed(void *parsed_result,
+		__rte_unused struct cmdline *cl,
+		__rte_unused void *data)
+{
+	struct cmd_rxq_lwm_result *res = parsed_result;
+	int ret = 0;
+
+	if ((strcmp(res->set, "set") == 0) && (strcmp(res->port, "port") == 0)
+	    && (strcmp(res->rxq, "rxq") == 0)
+	    && (strcmp(res->lwm, "lwm") == 0))
+		ret = set_rxq_lwm(res->port_num, res->rxq_num,
+				  res->lwm_num);
+	if (ret < 0)
+		printf("rxq_lwm_cmd error: (%s)\n", strerror(-ret));
+
+}
+
+cmdline_parse_token_string_t cmd_rxq_lwm_set =
+	TOKEN_STRING_INITIALIZER(struct cmd_rxq_lwm_result,
+				set, "set");
+cmdline_parse_token_string_t cmd_rxq_lwm_port =
+	TOKEN_STRING_INITIALIZER(struct cmd_rxq_lwm_result,
+				port, "port");
+cmdline_parse_token_num_t cmd_rxq_lwm_portnum =
+	TOKEN_NUM_INITIALIZER(struct cmd_rxq_lwm_result,
+				port_num, RTE_UINT16);
+cmdline_parse_token_string_t cmd_rxq_lwm_rxq =
+	TOKEN_STRING_INITIALIZER(struct cmd_rxq_lwm_result,
+				rxq, "rxq");
+cmdline_parse_token_num_t cmd_rxq_lwm_rxqnum =
+	TOKEN_NUM_INITIALIZER(struct cmd_rxq_lwm_result,
+				rxq_num, RTE_UINT8);
+cmdline_parse_token_string_t cmd_rxq_lwm_lwm =
+	TOKEN_STRING_INITIALIZER(struct cmd_rxq_lwm_result,
+				lwm, "lwm");
+cmdline_parse_token_num_t cmd_rxq_lwm_lwmnum =
+	TOKEN_NUM_INITIALIZER(struct cmd_rxq_lwm_result,
+				lwm_num, RTE_UINT16);
+
+cmdline_parse_inst_t cmd_rxq_lwm = {
+	.f = cmd_rxq_lwm_parsed,
+	.data = (void *)0,
+	.help_str = "set port <port_id> rxq <rxq_id> lwm <lwm_num>"
+		"Set lwm for rxq on port_id",
+	.tokens = {
+		(void *)&cmd_rxq_lwm_set,
+		(void *)&cmd_rxq_lwm_port,
+		(void *)&cmd_rxq_lwm_portnum,
+		(void *)&cmd_rxq_lwm_rxq,
+		(void *)&cmd_rxq_lwm_rxqnum,
+		(void *)&cmd_rxq_lwm_lwm,
+		(void *)&cmd_rxq_lwm_lwmnum,
+		NULL,
+	},
+};
+
 /* ******************************************************************************** */
 
 /* list of instructions */
@@ -18089,6 +18159,10 @@ cmdline_parse_ctx_t main_ctx[] = {
 	(cmdline_parse_inst_t *)&cmd_show_capability,
 	(cmdline_parse_inst_t *)&cmd_set_flex_is_pattern,
 	(cmdline_parse_inst_t *)&cmd_set_flex_spec_pattern,
+	(cmdline_parse_inst_t *)&cmd_rxq_lwm,
+#ifdef RTE_NET_MLX5
+	(cmdline_parse_inst_t *)&mlx5_test_cmd_port_host_shaper,
+#endif
 	NULL,
 };
 
diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
index 1b1e738f83..cac5fa1cf7 100644
--- a/app/test-pmd/config.c
+++ b/app/test-pmd/config.c
@@ -6342,3 +6342,26 @@ show_mcast_macs(portid_t port_id)
 		printf("  %s\n", buf);
 	}
 }
+
+int
+set_rxq_lwm(portid_t port_id, uint16_t queue_idx, uint16_t lwm)
+{
+	struct rte_eth_link link;
+	int ret;
+
+	if (port_id_is_invalid(port_id, ENABLED_WARN))
+		return -EINVAL;
+	ret = eth_link_get_nowait_print_err(port_id, &link);
+	if (ret < 0)
+		return -EINVAL;
+	if (lwm > 99)
+		return -EINVAL;
+	ret = rte_eth_rx_lwm_set(port_id, queue_idx, lwm);
+
+	if (ret)
+		return ret;
+	/* Save the input lwm. */
+	ports[port_id].rx_conf[queue_idx].lwm = lwm;
+	return 0;
+}
+
diff --git a/app/test-pmd/meson.build b/app/test-pmd/meson.build
index 43130c8856..c3577a02c1 100644
--- a/app/test-pmd/meson.build
+++ b/app/test-pmd/meson.build
@@ -73,3 +73,7 @@ endif
 if dpdk_conf.has('RTE_NET_DPAA')
     deps += ['bus_dpaa', 'mempool_dpaa', 'net_dpaa']
 endif
+if dpdk_conf.has('RTE_NET_MLX5')
+    deps += 'net_mlx5'
+    sources += files('../../drivers/net/mlx5/mlx5_testpmd.c')
+endif
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index 79bb23264b..789b3150d5 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -66,6 +66,9 @@
 #ifdef RTE_EXEC_ENV_WINDOWS
 #include <process.h>
 #endif
+#ifdef RTE_NET_MLX5
+#include "mlx5_testpmd.h"
+#endif
 
 #include "testpmd.h"
 
@@ -417,6 +420,7 @@ static const char * const eth_event_desc[] = {
 	[RTE_ETH_EVENT_NEW] = "device probed",
 	[RTE_ETH_EVENT_DESTROY] = "device released",
 	[RTE_ETH_EVENT_FLOW_AGED] = "flow aged",
+	[RTE_ETH_EVENT_RX_LWM] = "rxq limit reached",
 	[RTE_ETH_EVENT_MAX] = NULL,
 };
 
@@ -3551,6 +3555,10 @@ static int
 eth_event_callback(portid_t port_id, enum rte_eth_event_type type, void *param,
 		  void *ret_param)
 {
+	struct rte_eth_dev_info dev_info;
+	uint16_t rxq_id;
+	uint8_t lwm;
+	int ret;
 	RTE_SET_USED(param);
 	RTE_SET_USED(ret_param);
 
@@ -3582,6 +3590,22 @@ eth_event_callback(portid_t port_id, enum rte_eth_event_type type, void *param,
 		ports[port_id].port_status = RTE_PORT_CLOSED;
 		printf("Port %u is closed\n", port_id);
 		break;
+	case RTE_ETH_EVENT_RX_LWM:
+		ret = rte_eth_dev_info_get(port_id, &dev_info);
+		if (ret != 0)
+			break;
+		/* LWM query API rewinds rxq_id, no need to check max rxq num. */
+		for (rxq_id = 0; ; rxq_id++) {
+			ret = rte_eth_rx_lwm_query(port_id, &rxq_id, &lwm);
+			if (ret <= 0)
+				break;
+			printf("Received LWM event, port:%d rxq_id:%d\n",
+			       port_id, rxq_id);
+#ifdef RTE_NET_MLX5
+			mlx5_test_lwm_event_handler(port_id, rxq_id);
+#endif
+		}
+		break;
 	default:
 		break;
 	}
diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index 31f766c965..b570ea7c69 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -1163,6 +1163,7 @@ int update_jumbo_frame_offload(portid_t portid);
 void flex_item_create(portid_t port_id, uint16_t flex_id, const char *filename);
 void flex_item_destroy(portid_t port_id, uint16_t flex_id);
 void port_flex_item_flush(portid_t port_id);
+int set_rxq_lwm(portid_t port_id, uint16_t queue_idx, uint16_t lwm);
 
 extern int flow_parse(const char *src, void *result, unsigned int size,
 		      struct rte_flow_attr **attr,
diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index 3da6f5a03c..fb1c957544 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -1718,3 +1718,49 @@ on the host port by the firmware upon receiving the LMW event, which allows
 throttling host traffic on LWM events at minimum latency, preventing excess
 drops in the Rx queue.
 
+How to use LWM and Host Shaper
+------------------------------
+
+There are sample command lines to configure LWM in testpmd.
+Testpmd also contains sample logic to handle LWM event.
+The typical workflow is: testpmd configure LWM for Rx queues, enable
+lwm_triggered in host shaper and register a callback, when traffic from host is
+too high and Rx queue fullness is above LWM, PMD receives an event and
+firmware configures a 100Mbps shaper on host port automatically, then PMD call
+the callback registered previously, which will delay a while to let Rx queue
+empty, then disable host shaper.
+
+Let's assume we have a simple Blue Field 2 setup: port 0 is uplink, port 1
+is VF representor. Each port has 2 Rx queues.
+In order to control traffic from host to ARM, we can enable LWM in testpmd by:
+
+.. code-block:: console
+
+   testpmd> mlx5 set port 1 host_shaper lwm_triggered 1 rate 0
+   testpmd> set port 1 rxq 0 lwm 70
+   testpmd> set port 1 rxq 1 lwm 70
+
+The first command disables current host shaper, and enables LWM triggered mode.
+The left commands configure LWM to 70% of Rx queue size for both Rx queues,
+When traffic from host is too high, you can see testpmd console prints log
+about LWM event receiving, then host shaper is disabled.
+The traffic rate from host is controlled and less drop happens in Rx queues.
+
+When disable LWM and lwm_triggered, we can invoke below commands in testpmd:
+
+.. code-block:: console
+
+   testpmd> mlx5 set port 1 host_shaper lwm_triggered 0 rate 0
+   testpmd> set port 1 rxq 0 lwm 0
+   testpmd> set port 1 rxq 1 lwm 0
+
+It's recommended an application disables LWM and lwm_triggered before exit,
+if it enables them before.
+
+We can also configure the shaper with a value, the rate unit is 100Mbps, below
+command sets current shaper to 5Gbps and disables lwm_triggered.
+
+.. code-block:: console
+
+   testpmd> mlx5 set port 1 host_shaper lwm_triggered 0 rate 50
+
diff --git a/drivers/net/mlx5/mlx5_testpmd.c b/drivers/net/mlx5/mlx5_testpmd.c
new file mode 100644
index 0000000000..a452b3f6e5
--- /dev/null
+++ b/drivers/net/mlx5/mlx5_testpmd.c
@@ -0,0 +1,184 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright 2021 6WIND S.A.
+ * Copyright 2021 Mellanox Technologies, Ltd
+ */
+
+#include <stdint.h>
+#include <string.h>
+#include <stdlib.h>
+
+#include <rte_prefetch.h>
+#include <rte_common.h>
+#include <rte_branch_prediction.h>
+#include <rte_ether.h>
+#include <rte_alarm.h>
+#include <rte_pmd_mlx5.h>
+#include <rte_ethdev.h>
+#include "mlx5_testpmd.h"
+
+static uint8_t host_shaper_lwm_triggered[RTE_MAX_ETHPORTS];
+#define SHAPER_DISABLE_DELAY_US 100000 /* 100ms */
+
+/**
+ * Disable the host shaper and re-arm LWM event.
+ *
+ * @param[in] args
+ *   uint32_t integer combining port_id and rxq_id.
+ */
+static void
+mlx5_test_host_shaper_disable(void *args)
+{
+	uint32_t port_rxq_id = (uint32_t)(uintptr_t)args;
+	uint16_t port_id = port_rxq_id & 0xffff;
+	uint16_t qid = (port_rxq_id >> 16) & 0xffff;
+	struct rte_eth_rxq_info qinfo;
+
+	printf("%s disable shaper\n", __func__);
+	if (rte_eth_rx_queue_info_get(port_id, qid, &qinfo)) {
+		printf("rx_queue_info_get returns error\n");
+		return;
+	}
+	/* Rearm the LWM event. */
+	if (rte_eth_rx_lwm_set(port_id, qid, qinfo.conf.lwm)) {
+		printf("config lwm returns error\n");
+		return;
+	}
+	/* Only disable the shaper when lwm_triggered is set. */
+	if (host_shaper_lwm_triggered[port_id] &&
+	    rte_pmd_mlx5_host_shaper_config(port_id, 0, 0))
+		printf("%s disable shaper returns error\n", __func__);
+}
+
+void
+mlx5_test_lwm_event_handler(uint16_t port_id, uint16_t rxq_id)
+{
+	uint32_t port_rxq_id = port_id | (rxq_id << 16);
+
+	rte_eal_alarm_set(SHAPER_DISABLE_DELAY_US,
+			  mlx5_test_host_shaper_disable,
+			  (void *)(uintptr_t)port_rxq_id);
+	printf("%s port_id:%u rxq_id:%u\n", __func__, port_id, rxq_id);
+}
+
+/**
+ * Configure host shaper's lwm_triggered and current rate.
+ *
+ * @param[in] lwm_triggered
+ *   Disable/enable lwm_triggered.
+ * @param[in] rate
+ *   Configure current host shaper rate.
+ * @return
+ *   On success, returns 0.
+ *   On failure, returns < 0.
+ */
+static int
+mlx5_test_set_port_host_shaper(uint16_t port_id, uint16_t lwm_triggered, uint8_t rate)
+{
+	struct rte_eth_link link;
+	bool port_id_valid = false;
+	uint16_t pid;
+	int ret;
+
+	RTE_ETH_FOREACH_DEV(pid)
+		if (port_id == pid) {
+			port_id_valid = true;
+			break;
+		}
+	if (!port_id_valid)
+		return -EINVAL;
+	ret = rte_eth_link_get_nowait(port_id, &link);
+	if (ret < 0)
+		return ret;
+	host_shaper_lwm_triggered[port_id] = lwm_triggered ? 1 : 0;
+	if (!lwm_triggered) {
+		ret = rte_pmd_mlx5_host_shaper_config(port_id, 0,
+		RTE_BIT32(MLX5_HOST_SHAPER_FLAG_LWM_TRIGGERED));
+	} else {
+		ret = rte_pmd_mlx5_host_shaper_config(port_id, 1,
+		RTE_BIT32(MLX5_HOST_SHAPER_FLAG_LWM_TRIGGERED));
+	}
+	if (ret)
+		return ret;
+	ret = rte_pmd_mlx5_host_shaper_config(port_id, rate, 0);
+	if (ret)
+		return ret;
+	return 0;
+}
+
+/* *** SET HOST_SHAPER FOR A PORT *** */
+struct cmd_port_host_shaper_result {
+	cmdline_fixed_string_t mlx5;
+	cmdline_fixed_string_t set;
+	cmdline_fixed_string_t port;
+	uint16_t port_num;
+	cmdline_fixed_string_t host_shaper;
+	cmdline_fixed_string_t lwm_triggered;
+	uint16_t fr;
+	cmdline_fixed_string_t rate;
+	uint8_t rate_num;
+};
+
+static void cmd_port_host_shaper_parsed(void *parsed_result,
+		__rte_unused struct cmdline *cl,
+		__rte_unused void *data)
+{
+	struct cmd_port_host_shaper_result *res = parsed_result;
+	int ret = 0;
+
+	if ((strcmp(res->mlx5, "mlx5") == 0) &&
+	    (strcmp(res->set, "set") == 0) &&
+	    (strcmp(res->port, "port") == 0) &&
+	    (strcmp(res->host_shaper, "host_shaper") == 0) &&
+	    (strcmp(res->lwm_triggered, "lwm_triggered") == 0) &&
+	    (strcmp(res->rate, "rate") == 0))
+		ret = mlx5_test_set_port_host_shaper(res->port_num, res->fr,
+					   res->rate_num);
+	if (ret < 0)
+		printf("cmd_port_host_shaper error: (%s)\n", strerror(-ret));
+}
+
+cmdline_parse_token_string_t cmd_port_host_shaper_mlx5 =
+	TOKEN_STRING_INITIALIZER(struct cmd_port_host_shaper_result,
+				mlx5, "mlx5");
+cmdline_parse_token_string_t cmd_port_host_shaper_set =
+	TOKEN_STRING_INITIALIZER(struct cmd_port_host_shaper_result,
+				set, "set");
+cmdline_parse_token_string_t cmd_port_host_shaper_port =
+	TOKEN_STRING_INITIALIZER(struct cmd_port_host_shaper_result,
+				port, "port");
+cmdline_parse_token_num_t cmd_port_host_shaper_portnum =
+	TOKEN_NUM_INITIALIZER(struct cmd_port_host_shaper_result,
+				port_num, RTE_UINT16);
+cmdline_parse_token_string_t cmd_port_host_shaper_host_shaper =
+	TOKEN_STRING_INITIALIZER(struct cmd_port_host_shaper_result,
+				 host_shaper, "host_shaper");
+cmdline_parse_token_string_t cmd_port_host_shaper_lwm_triggered =
+	TOKEN_STRING_INITIALIZER(struct cmd_port_host_shaper_result,
+				 lwm_triggered, "lwm_triggered");
+cmdline_parse_token_num_t cmd_port_host_shaper_fr =
+	TOKEN_NUM_INITIALIZER(struct cmd_port_host_shaper_result,
+			      fr, RTE_UINT16);
+cmdline_parse_token_string_t cmd_port_host_shaper_rate =
+	TOKEN_STRING_INITIALIZER(struct cmd_port_host_shaper_result,
+				 rate, "rate");
+cmdline_parse_token_num_t cmd_port_host_shaper_rate_num =
+	TOKEN_NUM_INITIALIZER(struct cmd_port_host_shaper_result,
+			      rate_num, RTE_UINT8);
+cmdline_parse_inst_t mlx5_test_cmd_port_host_shaper = {
+	.f = cmd_port_host_shaper_parsed,
+	.data = (void *)0,
+	.help_str = "mlx5 set port <port_id> host_shaper lwm_triggered <0|1> "
+	"rate <rate_num>: Set HOST_SHAPER lwm_triggered and rate with port_id",
+	.tokens = {
+		(void *)&cmd_port_host_shaper_mlx5,
+		(void *)&cmd_port_host_shaper_set,
+		(void *)&cmd_port_host_shaper_port,
+		(void *)&cmd_port_host_shaper_portnum,
+		(void *)&cmd_port_host_shaper_host_shaper,
+		(void *)&cmd_port_host_shaper_lwm_triggered,
+		(void *)&cmd_port_host_shaper_fr,
+		(void *)&cmd_port_host_shaper_rate,
+		(void *)&cmd_port_host_shaper_rate_num,
+		NULL,
+	}
+};
diff --git a/drivers/net/mlx5/mlx5_testpmd.h b/drivers/net/mlx5/mlx5_testpmd.h
new file mode 100644
index 0000000000..50f3cf0bf9
--- /dev/null
+++ b/drivers/net/mlx5/mlx5_testpmd.h
@@ -0,0 +1,27 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright 2021 6WIND S.A.
+ * Copyright 2021 Mellanox Technologies, Ltd
+ */
+
+#ifndef RTE_PMD_MLX5_TEST_H_
+#define RTE_PMD_MLX5_TEST_H_
+
+#include <cmdline_parse.h>
+#include <cmdline_parse_num.h>
+#include <cmdline_parse_string.h>
+
+/**
+ * RTE_ETH_EVENT_RX_LWM handler sample code.
+ * It's called in testpmd, the work flow here is delay a while until
+ * RX queueu is empty, then disable host shaper.
+ *
+ * @param[in] port_id
+ *   Port identifier.
+ * @param[in] rxq_id
+ *   Rx queue identifier.
+ */
+void
+mlx5_test_lwm_event_handler(uint16_t port_id, uint16_t rxq_id);
+
+extern cmdline_parse_inst_t mlx5_test_cmd_port_host_shaper;
+#endif
-- 
2.27.0


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [RFC v2 3/7] ethdev: introduce Rx queue based limit watermark
  2022-05-22  5:58       ` [RFC v2 3/7] ethdev: introduce Rx queue based limit watermark Spike Du
@ 2022-05-22 15:23         ` Stephen Hemminger
  2022-05-23  3:01           ` Spike Du
  2022-05-22 15:24         ` Stephen Hemminger
  2022-05-23  6:07         ` Morten Brørup
  2 siblings, 1 reply; 131+ messages in thread
From: Stephen Hemminger @ 2022-05-22 15:23 UTC (permalink / raw)
  To: Spike Du; +Cc: matan, viacheslavo, orika, thomas, dev, rasland

On Sun, 22 May 2022 08:58:56 +0300
Spike Du <spiked@nvidia.com> wrote:

> diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> index 04cff8ee10..687ae5ff29 100644
> --- a/lib/ethdev/rte_ethdev.h
> +++ b/lib/ethdev/rte_ethdev.h
> @@ -1249,7 +1249,16 @@ struct rte_eth_rxconf {
>  	 */
>  	union rte_eth_rxseg *rx_seg;
>  
> -	uint64_t reserved_64s[2]; /**< Reserved for future fields */
> +	/**
> +	 * Per-queue Rx limit watermark defined as percentage of Rx queue
> +	 * size. If Rx queue receives traffic higher than this percentage,
> +	 * the event RTE_ETH_EVENT_RX_LWM is triggered.
> +	 */
> +	uint8_t lwm;
> +
> +	uint8_t reserved_bits[3];
> +	uint32_t reserved_32s;
> +	uint64_t reserved_64s;
>  	void *reserved_ptrs[2];   /**< Reserved for future fields */
>  };
>  

Ok but, this is an ABI risk about this because reserved stuff was never required before.
Whenever is a reserved field is introduced the code (in this case rte_ethdev_configure).

Best practice would have been to have the code require all reserved fields be 0
in earlier releases. In this case an application is like to define a watermark
of zero; how will your code handle it.

Also, using 8 bits as percentage is different than how other API's handle this.
Since Rx queue size is in packets, why is this not in packets?
Also document what behavior of 0 is.

Why introduce new query/set operations? This should just be part of the overall
device configuration. 

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [RFC v2 3/7] ethdev: introduce Rx queue based limit watermark
  2022-05-22  5:58       ` [RFC v2 3/7] ethdev: introduce Rx queue based limit watermark Spike Du
  2022-05-22 15:23         ` Stephen Hemminger
@ 2022-05-22 15:24         ` Stephen Hemminger
  2022-05-23  2:18           ` Spike Du
  2022-05-23  6:07         ` Morten Brørup
  2 siblings, 1 reply; 131+ messages in thread
From: Stephen Hemminger @ 2022-05-22 15:24 UTC (permalink / raw)
  To: Spike Du; +Cc: matan, viacheslavo, orika, thomas, dev, rasland

On Sun, 22 May 2022 08:58:56 +0300
Spike Du <spiked@nvidia.com> wrote:

> LWM(limit watermark) describes the fullness of a Rx queue. If the Rx
> queue fullness is above LWM, the device will trigger the event
> RTE_ETH_EVENT_RX_LWM.
> LWM is defined as a percentage of Rx queue size with valid value of
> [0,99].
> Setting LWM to 0 means disable it, which is the default.
> When translate the percentage to queue descriptor number, the numbe
> should be bigger than 0 and less than queue size.
> Add LWM's configuration and query driver callbacks in eth_dev_ops.
> 
> Signed-off-by: Spike Du <spiked@nvidia.com>

One other objection, please don't invent yet another event channel
for this. It should be part of existing Rx interrupt logic.

^ permalink raw reply	[flat|nested] 131+ messages in thread

* RE: [RFC v2 3/7] ethdev: introduce Rx queue based limit watermark
  2022-05-22 15:24         ` Stephen Hemminger
@ 2022-05-23  2:18           ` Spike Du
  0 siblings, 0 replies; 131+ messages in thread
From: Spike Du @ 2022-05-23  2:18 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Matan Azrad, Slava Ovsiienko, Ori Kam,
	NBU-Contact-Thomas Monjalon (EXTERNAL),
	dev, Raslan Darawsheh

Hi,

> -----Original Message-----
> From: Stephen Hemminger <stephen@networkplumber.org>
> Sent: Sunday, May 22, 2022 11:25 PM
> To: Spike Du <spiked@nvidia.com>
> Cc: Matan Azrad <matan@nvidia.com>; Slava Ovsiienko
> <viacheslavo@nvidia.com>; Ori Kam <orika@nvidia.com>; NBU-Contact-
> Thomas Monjalon (EXTERNAL) <thomas@monjalon.net>; dev@dpdk.org;
> Raslan Darawsheh <rasland@nvidia.com>
> Subject: Re: [RFC v2 3/7] ethdev: introduce Rx queue based limit watermark
> 
> External email: Use caution opening links or attachments
> 
> 
> On Sun, 22 May 2022 08:58:56 +0300
> Spike Du <spiked@nvidia.com> wrote:
> 
> > LWM(limit watermark) describes the fullness of a Rx queue. If the Rx
> > queue fullness is above LWM, the device will trigger the event
> > RTE_ETH_EVENT_RX_LWM.
> > LWM is defined as a percentage of Rx queue size with valid value of
> > [0,99].
> > Setting LWM to 0 means disable it, which is the default.
> > When translate the percentage to queue descriptor number, the numbe
> > should be bigger than 0 and less than queue size.
> > Add LWM's configuration and query driver callbacks in eth_dev_ops.
> >
> > Signed-off-by: Spike Du <spiked@nvidia.com>
> 
> One other objection, please don't invent yet another event channel for this.
> It should be part of existing Rx interrupt logic.

I think this is misunderstanding, the "event channel" is a specific concept in MLX5 PMD.
For the DPDK common code like testpmd and event register/callback, I'm using standard dpdk
interfaces. 

^ permalink raw reply	[flat|nested] 131+ messages in thread

* RE: [RFC v2 3/7] ethdev: introduce Rx queue based limit watermark
  2022-05-22 15:23         ` Stephen Hemminger
@ 2022-05-23  3:01           ` Spike Du
  2022-05-23 21:45             ` Thomas Monjalon
  2022-05-23 22:54             ` Stephen Hemminger
  0 siblings, 2 replies; 131+ messages in thread
From: Spike Du @ 2022-05-23  3:01 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Matan Azrad, Slava Ovsiienko, Ori Kam,
	NBU-Contact-Thomas Monjalon (EXTERNAL),
	dev, Raslan Darawsheh

Hi, pls see below.

> -----Original Message-----
> From: Stephen Hemminger <stephen@networkplumber.org>
> Sent: Sunday, May 22, 2022 11:23 PM
> To: Spike Du <spiked@nvidia.com>
> Cc: Matan Azrad <matan@nvidia.com>; Slava Ovsiienko
> <viacheslavo@nvidia.com>; Ori Kam <orika@nvidia.com>; NBU-Contact-
> Thomas Monjalon (EXTERNAL) <thomas@monjalon.net>; dev@dpdk.org;
> Raslan Darawsheh <rasland@nvidia.com>
> Subject: Re: [RFC v2 3/7] ethdev: introduce Rx queue based limit watermark
> 
> External email: Use caution opening links or attachments
> 
> 
> On Sun, 22 May 2022 08:58:56 +0300
> Spike Du <spiked@nvidia.com> wrote:
> 
> > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h index
> > 04cff8ee10..687ae5ff29 100644
> > --- a/lib/ethdev/rte_ethdev.h
> > +++ b/lib/ethdev/rte_ethdev.h
> > @@ -1249,7 +1249,16 @@ struct rte_eth_rxconf {
> >        */
> >       union rte_eth_rxseg *rx_seg;
> >
> > -     uint64_t reserved_64s[2]; /**< Reserved for future fields */
> > +     /**
> > +      * Per-queue Rx limit watermark defined as percentage of Rx queue
> > +      * size. If Rx queue receives traffic higher than this percentage,
> > +      * the event RTE_ETH_EVENT_RX_LWM is triggered.
> > +      */
> > +     uint8_t lwm;
> > +
> > +     uint8_t reserved_bits[3];
> > +     uint32_t reserved_32s;
> > +     uint64_t reserved_64s;
> >       void *reserved_ptrs[2];   /**< Reserved for future fields */
> >  };
> >
> 
> Ok but, this is an ABI risk about this because reserved stuff was never
> required before.
> Whenever is a reserved field is introduced the code (in this case
> rte_ethdev_configure).
> 
> Best practice would have been to have the code require all reserved fields be
> 0 in earlier releases. In this case an application is like to define a watermark of
> zero; how will your code handle it.
Having watermark of 0 is desired, which is the default. LWM of 0 means the Rx
Queue's watermark is not monitored, hence no LWM event is generated.
> 
> Also, using 8 bits as percentage is different than how other API's handle this.
> Since Rx queue size is in packets, why is this not in packets?
The short answer is to simply the LWM configuration.
Rx queue descriptor is complex nowadays. 
For normal queue, user may configure LWM according to queue descriptor number easily.
But for below queues, it's not easy:
Take mprq as example, the testpmd cmd  options can be " -a 0000:03:00.0,rxqs_min_mprq=1,mprq_en=1,mprq_max_memcpy_len=465,mprq_log_stride_size=8,mprq_log_stride_num=3
-- --mbcache=512 -i  --nb-cores=7  --txd=1024 --rxd=1024 ", 
For MLX5 implementation,  the minimum "unit" in queue has 64 descriptors, the "unit" number is 16,  if you configure according to descriptor number(1024)
Here, you may easily set LWM as something like 512, but HW doesn't allow it, because 512 > 16. If you want the watermark to be half, the correct value is 8.
The same issue happens to feature like "Rx queue buffer split" where a packet can be split to multiple descriptors.
Using percentage doesn't have such issues, PMD will cover all the details.

> Also document what behavior of 0 is.
Sure. The behavior is like the old days without this feature, pls see above.

> Why introduce new query/set operations? This should just be part of the
> overall device configuration.
Due to different implementation. LWM can be a dynamic configuration which can help user design a flexible flow control.
User may feel ok with LWM of 80% to get high throughput, or later on with 50% to throttle the traffic responsively by handling LWM event in order to reduce drop.
Some driver like mlx5 may implement LWM event as one-time shot. When you receive LWM event, you need to reconfigure LWM in order to receive the event again, thus you will
not likely to be overwhelmed by the events.
These all require set operation.

For the query operation. The rte_event API rte_eth_dev_callback_process() is per-port API, it doesn't carry much information when an event happens.
When a LWM event happens, we need to know in which Rx queue it happens or optionally what's the current LWM percentage of this queue.
The query operation serves this purpose.


Regards,
Spike.



^ permalink raw reply	[flat|nested] 131+ messages in thread

* RE: [RFC v2 3/7] ethdev: introduce Rx queue based limit watermark
  2022-05-22  5:58       ` [RFC v2 3/7] ethdev: introduce Rx queue based limit watermark Spike Du
  2022-05-22 15:23         ` Stephen Hemminger
  2022-05-22 15:24         ` Stephen Hemminger
@ 2022-05-23  6:07         ` Morten Brørup
  2022-05-23 10:58           ` Thomas Monjalon
  2 siblings, 1 reply; 131+ messages in thread
From: Morten Brørup @ 2022-05-23  6:07 UTC (permalink / raw)
  To: Spike Du, matan, viacheslavo, orika, thomas; +Cc: dev, rasland

> From: Spike Du [mailto:spiked@nvidia.com]
> Sent: Sunday, 22 May 2022 07.59
> 
> LWM(limit watermark) describes the fullness of a Rx queue. If the Rx
> queue fullness is above LWM, the device will trigger the event
> RTE_ETH_EVENT_RX_LWM.
> LWM is defined as a percentage of Rx queue size with valid value of
> [0,99].
> Setting LWM to 0 means disable it, which is the default.
> When translate the percentage to queue descriptor number, the numbe
> should be bigger than 0 and less than queue size.
> Add LWM's configuration and query driver callbacks in eth_dev_ops.
> 
> Signed-off-by: Spike Du <spiked@nvidia.com>
> ---


> @@ -1249,7 +1249,16 @@ struct rte_eth_rxconf {
>  	 */
>  	union rte_eth_rxseg *rx_seg;
> 
> -	uint64_t reserved_64s[2]; /**< Reserved for future fields */
> +	/**
> +	 * Per-queue Rx limit watermark defined as percentage of Rx queue
> +	 * size. If Rx queue receives traffic higher than this
> percentage,
> +	 * the event RTE_ETH_EVENT_RX_LWM is triggered.
> +	 */
> +	uint8_t lwm;

Why percentage, why not 1/128th, or 1/16th? 2^N seems more logical, and I wonder if such high granularity is really necessary. Just a thought, it's not important.

If you stick with percentage, it only needs 7 bits, and you can make the remaining one bit reserved.

Also, please add here that 0 means disable.

> +
> +	uint8_t reserved_bits[3];
> +	uint32_t reserved_32s;
> +	uint64_t reserved_64s;
>  	void *reserved_ptrs[2];   /**< Reserved for future fields */
>  };
> 



^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [RFC v2 3/7] ethdev: introduce Rx queue based limit watermark
  2022-05-23  6:07         ` Morten Brørup
@ 2022-05-23 10:58           ` Thomas Monjalon
  2022-05-23 14:10             ` Spike Du
  0 siblings, 1 reply; 131+ messages in thread
From: Thomas Monjalon @ 2022-05-23 10:58 UTC (permalink / raw)
  To: Spike Du, Morten Brørup; +Cc: matan, viacheslavo, orika, dev, rasland

23/05/2022 08:07, Morten Brørup:
> > +	uint8_t lwm;
> 
> Why percentage, why not 1/128th, or 1/16th? 2^N seems more logical, and I wonder if such high granularity is really necessary. Just a thought, it's not important.

I think percentage is the easiest to understand
and to share with other teams in design documents.

> If you stick with percentage, it only needs 7 bits, and you can make the remaining one bit reserved.
> 
> Also, please add here that 0 means disable.

Good idea.



^ permalink raw reply	[flat|nested] 131+ messages in thread

* RE: [RFC v2 3/7] ethdev: introduce Rx queue based limit watermark
  2022-05-23 10:58           ` Thomas Monjalon
@ 2022-05-23 14:10             ` Spike Du
  2022-05-23 14:39               ` Thomas Monjalon
  0 siblings, 1 reply; 131+ messages in thread
From: Spike Du @ 2022-05-23 14:10 UTC (permalink / raw)
  To: NBU-Contact-Thomas Monjalon (EXTERNAL), Morten Brørup
  Cc: Matan Azrad, Slava Ovsiienko, Ori Kam, dev, Raslan Darawsheh



> -----Original Message-----
> From: Thomas Monjalon <thomas@monjalon.net>
> Sent: Monday, May 23, 2022 6:59 PM
> To: Spike Du <spiked@nvidia.com>; Morten Brørup
> <mb@smartsharesystems.com>
> Cc: Matan Azrad <matan@nvidia.com>; Slava Ovsiienko
> <viacheslavo@nvidia.com>; Ori Kam <orika@nvidia.com>; dev@dpdk.org;
> Raslan Darawsheh <rasland@nvidia.com>
> Subject: Re: [RFC v2 3/7] ethdev: introduce Rx queue based limit watermark
> 
> External email: Use caution opening links or attachments
> 
> 
> 23/05/2022 08:07, Morten Brørup:
> > > +   uint8_t lwm;
> >
> > Why percentage, why not 1/128th, or 1/16th? 2^N seems more logical, and
> I wonder if such high granularity is really necessary. Just a thought, it's not
> important.
> 
> I think percentage is the easiest to understand and to share with other teams
> in design documents.
> 
> > If you stick with percentage, it only needs 7 bits, and you can make the
> remaining one bit reserved.

Agree, will change to use 7 bits.
> >
> > Also, please add here that 0 means disable.

Sure.
> 
> Good idea.
> 

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [RFC v2 3/7] ethdev: introduce Rx queue based limit watermark
  2022-05-23 14:10             ` Spike Du
@ 2022-05-23 14:39               ` Thomas Monjalon
  2022-05-24  6:35                 ` Andrew Rybchenko
  0 siblings, 1 reply; 131+ messages in thread
From: Thomas Monjalon @ 2022-05-23 14:39 UTC (permalink / raw)
  To: Morten Brørup, Spike Du
  Cc: dev, Matan Azrad, Slava Ovsiienko, Ori Kam, dev,
	Raslan Darawsheh, andrew.rybchenko, ferruh.yigit

23/05/2022 16:10, Spike Du:
> > From: Thomas Monjalon <thomas@monjalon.net>
> > > If you stick with percentage, it only needs 7 bits, and you can make the
> > remaining one bit reserved.
> 
> Agree, will change to use 7 bits.

I'm not sure it's worth introducing a bit field here.




^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [RFC v2 3/7] ethdev: introduce Rx queue based limit watermark
  2022-05-23  3:01           ` Spike Du
@ 2022-05-23 21:45             ` Thomas Monjalon
  2022-05-24  2:50               ` Spike Du
  2022-05-23 22:54             ` Stephen Hemminger
  1 sibling, 1 reply; 131+ messages in thread
From: Thomas Monjalon @ 2022-05-23 21:45 UTC (permalink / raw)
  To: Stephen Hemminger, Spike Du
  Cc: dev, Matan Azrad, Slava Ovsiienko, Ori Kam, Raslan Darawsheh,
	ferruh.yigit, andrew.rybchenko

23/05/2022 05:01, Spike Du:
> From: Stephen Hemminger <stephen@networkplumber.org>
> > Spike Du <spiked@nvidia.com> wrote:
> > > --- a/lib/ethdev/rte_ethdev.h
> > > +++ b/lib/ethdev/rte_ethdev.h
> > > @@ -1249,7 +1249,16 @@ struct rte_eth_rxconf {
> > >        */
> > >       union rte_eth_rxseg *rx_seg;
> > >
> > > -     uint64_t reserved_64s[2]; /**< Reserved for future fields */
> > > +     /**
> > > +      * Per-queue Rx limit watermark defined as percentage of Rx queue
> > > +      * size. If Rx queue receives traffic higher than this percentage,
> > > +      * the event RTE_ETH_EVENT_RX_LWM is triggered.
> > > +      */
> > > +     uint8_t lwm;
> > > +
> > > +     uint8_t reserved_bits[3];
> > > +     uint32_t reserved_32s;
> > > +     uint64_t reserved_64s;
> > 
> > Ok but, this is an ABI risk about this because reserved stuff was never
> > required before.

An ABI compatibility issue would be for an application compiled
with an old DPDK, and loading a new DPDK at runtime.
Let's think what would happen in such a case.

> > Whenever is a reserved field is introduced the code (in this case
> > rte_ethdev_configure).

rte_eth_rx_queue_setup() is called with rx_conf->lwm not initialized.
Then the library and drivers may interpret a wrong value.

> > Best practice would have been to have the code require all reserved fields be
> > 0 in earlier releases. In this case an application is like to define a watermark of
> > zero; how will your code handle it.
> 
> Having watermark of 0 is desired, which is the default. LWM of 0 means the Rx
> Queue's watermark is not monitored, hence no LWM event is generated.

The problem is to have a value not initialized.
I think the best approach is to not expose the LWM value
through this configuration structure.
If the need is to get the current value,
we should better add a field in the struct rte_eth_rxq_info.

[...]
> 
> > Why introduce new query/set operations? This should just be part of the
> > overall device configuration.

Thanks to the "set" function, we can avoid the ABI compat issue.

> Due to different implementation. LWM can be a dynamic configuration which can help user design a flexible flow control.
> User may feel ok with LWM of 80% to get high throughput, or later on with 50% to throttle the traffic responsively by handling LWM event in order to reduce drop.
> Some driver like mlx5 may implement LWM event as one-time shot. When you receive LWM event, you need to reconfigure LWM in order to receive the event again, thus you will
> not likely to be overwhelmed by the events.
> These all require set operation.

Yes it is better to allow dynamic watermark configuration,
not using the function rte_eth_rx_queue_setup().

> For the query operation. The rte_event API rte_eth_dev_callback_process() is per-port API, it doesn't carry much information when an event happens.
> When a LWM event happens, we need to know in which Rx queue it happens or optionally what's the current LWM percentage of this queue.
> The query operation serves this purpose.

Yes "query" has to be called in the event handler
because event structure is not specific to any event type.




^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [RFC v2 3/7] ethdev: introduce Rx queue based limit watermark
  2022-05-23  3:01           ` Spike Du
  2022-05-23 21:45             ` Thomas Monjalon
@ 2022-05-23 22:54             ` Stephen Hemminger
  2022-05-24  3:46               ` Spike Du
  1 sibling, 1 reply; 131+ messages in thread
From: Stephen Hemminger @ 2022-05-23 22:54 UTC (permalink / raw)
  To: Spike Du
  Cc: Matan Azrad, Slava Ovsiienko, Ori Kam,
	NBU-Contact-Thomas Monjalon (EXTERNAL),
	dev, Raslan Darawsheh

On Mon, 23 May 2022 03:01:20 +0000
Spike Du <spiked@nvidia.com> wrote:

> Hi, pls see below.
> 
> > -----Original Message-----
> > From: Stephen Hemminger <stephen@networkplumber.org>
> > Sent: Sunday, May 22, 2022 11:23 PM
> > To: Spike Du <spiked@nvidia.com>
> > Cc: Matan Azrad <matan@nvidia.com>; Slava Ovsiienko
> > <viacheslavo@nvidia.com>; Ori Kam <orika@nvidia.com>; NBU-Contact-
> > Thomas Monjalon (EXTERNAL) <thomas@monjalon.net>; dev@dpdk.org;
> > Raslan Darawsheh <rasland@nvidia.com>
> > Subject: Re: [RFC v2 3/7] ethdev: introduce Rx queue based limit watermark
> > 
> > External email: Use caution opening links or attachments
> > 
> > 
> > On Sun, 22 May 2022 08:58:56 +0300
> > Spike Du <spiked@nvidia.com> wrote:
> >   
> > > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h index
> > > 04cff8ee10..687ae5ff29 100644
> > > --- a/lib/ethdev/rte_ethdev.h
> > > +++ b/lib/ethdev/rte_ethdev.h
> > > @@ -1249,7 +1249,16 @@ struct rte_eth_rxconf {
> > >        */
> > >       union rte_eth_rxseg *rx_seg;
> > >
> > > -     uint64_t reserved_64s[2]; /**< Reserved for future fields */
> > > +     /**
> > > +      * Per-queue Rx limit watermark defined as percentage of Rx queue
> > > +      * size. If Rx queue receives traffic higher than this percentage,
> > > +      * the event RTE_ETH_EVENT_RX_LWM is triggered.
> > > +      */
> > > +     uint8_t lwm;
> > > +
> > > +     uint8_t reserved_bits[3];
> > > +     uint32_t reserved_32s;
> > > +     uint64_t reserved_64s;
> > >       void *reserved_ptrs[2];   /**< Reserved for future fields */
> > >  };
> > >  
> > 
> > Ok but, this is an ABI risk about this because reserved stuff was never
> > required before.
> > Whenever is a reserved field is introduced the code (in this case
> > rte_ethdev_configure).
> > 
> > Best practice would have been to have the code require all reserved fields be
> > 0 in earlier releases. In this case an application is like to define a watermark of
> > zero; how will your code handle it.  
> Having watermark of 0 is desired, which is the default. LWM of 0 means the Rx
> Queue's watermark is not monitored, hence no LWM event is generated.
> > 
> > Also, using 8 bits as percentage is different than how other API's handle this.
> > Since Rx queue size is in packets, why is this not in packets?  
> The short answer is to simply the LWM configuration.
> Rx queue descriptor is complex nowadays. 
> For normal queue, user may configure LWM according to queue descriptor number easily.
> But for below queues, it's not easy:
> Take mprq as example, the testpmd cmd  options can be " -a 0000:03:00.0,rxqs_min_mprq=1,mprq_en=1,mprq_max_memcpy_len=465,mprq_log_stride_size=8,mprq_log_stride_num=3
> -- --mbcache=512 -i  --nb-cores=7  --txd=1024 --rxd=1024 ", 
> For MLX5 implementation,  the minimum "unit" in queue has 64 descriptors, the "unit" number is 16,  if you configure according to descriptor number(1024)
> Here, you may easily set LWM as something like 512, but HW doesn't allow it, because 512 > 16. If you want the watermark to be half, the correct value is 8.
> The same issue happens to feature like "Rx queue buffer split" where a packet can be split to multiple descriptors.
> Using percentage doesn't have such issues, PMD will cover all the details.
> 
> > Also document what behavior of 0 is.  
> Sure. The behavior is like the old days without this feature, pls see above.
> 
> > Why introduce new query/set operations? This should just be part of the
> > overall device configuration.  
> Due to different implementation. LWM can be a dynamic configuration which can help user design a flexible flow control.
> User may feel ok with LWM of 80% to get high throughput, or later on with 50% to throttle the traffic responsively by handling LWM event in order to reduce drop.
> Some driver like mlx5 may implement LWM event as one-time shot. When you receive LWM event, you need to reconfigure LWM in order to receive the event again, thus you will
> not likely to be overwhelmed by the events.
> These all require set operation.
> 
> For the query operation. The rte_event API rte_eth_dev_callback_process() is per-port API, it doesn't carry much information when an event happens.
> When a LWM event happens, we need to know in which Rx queue it happens or optionally what's the current LWM percentage of this queue.
> The query operation serves this purpose.
> 
> 
> Regards,
> Spike.
> 
> 

The bigger question is why does this have to be just MLX5 and why
can't it fit into the existing DPDK RX interrupt framework?

Linux and BSD have had this for years in their packet coalescing logic.
Ethtool provides ability to set lot of irq coalescing parameters like:

       ethtool -C|--coalesce devname [adaptive-rx on|off] [adaptive-tx on|off]
              [rx-usecs N] [rx-frames N] [rx-usecs-irq N] [rx-frames-irq N]
              [tx-usecs N] [tx-frames N] [tx-usecs-irq N] [tx-frames-irq N]
              [stats-block-usecs N] [pkt-rate-low N] [rx-usecs-low N]
              [rx-frames-low N] [tx-usecs-low N] [tx-frames-low N]
              [pkt-rate-high N] [rx-usecs-high N] [rx-frames-high N]
              [tx-usecs-high N] [tx-frames-high N] [sample-interval N]
              [cqe-mode-rx on|off] [cqe-mode-tx on|off]

It feels like this is just the DPDK version of a small subset of that.
Since many device already support IRQ coalescing, it would be best to build
one new API that has most of these. Rather than a MLX/Nvidia only API for
a single parameter.

^ permalink raw reply	[flat|nested] 131+ messages in thread

* RE: [RFC v2 3/7] ethdev: introduce Rx queue based limit watermark
  2022-05-23 21:45             ` Thomas Monjalon
@ 2022-05-24  2:50               ` Spike Du
  2022-05-24  8:18                 ` Thomas Monjalon
  0 siblings, 1 reply; 131+ messages in thread
From: Spike Du @ 2022-05-24  2:50 UTC (permalink / raw)
  To: NBU-Contact-Thomas Monjalon (EXTERNAL), Stephen Hemminger
  Cc: dev, Matan Azrad, Slava Ovsiienko, Ori Kam, Raslan Darawsheh,
	ferruh.yigit, andrew.rybchenko



> -----Original Message-----
> From: Thomas Monjalon <thomas@monjalon.net>
> Sent: Tuesday, May 24, 2022 5:46 AM
> To: Stephen Hemminger <stephen@networkplumber.org>; Spike Du
> <spiked@nvidia.com>
> Cc: dev@dpdk.org; Matan Azrad <matan@nvidia.com>; Slava Ovsiienko
> <viacheslavo@nvidia.com>; Ori Kam <orika@nvidia.com>; Raslan Darawsheh
> <rasland@nvidia.com>; ferruh.yigit@amd.com;
> andrew.rybchenko@oktetlabs.ru
> Subject: Re: [RFC v2 3/7] ethdev: introduce Rx queue based limit watermark
> 
> External email: Use caution opening links or attachments
> 
> 
> 23/05/2022 05:01, Spike Du:
> > From: Stephen Hemminger <stephen@networkplumber.org>
> > > Spike Du <spiked@nvidia.com> wrote:
> > > > --- a/lib/ethdev/rte_ethdev.h
> > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > @@ -1249,7 +1249,16 @@ struct rte_eth_rxconf {
> > > >        */
> > > >       union rte_eth_rxseg *rx_seg;
> > > >
> > > > -     uint64_t reserved_64s[2]; /**< Reserved for future fields */
> > > > +     /**
> > > > +      * Per-queue Rx limit watermark defined as percentage of Rx queue
> > > > +      * size. If Rx queue receives traffic higher than this percentage,
> > > > +      * the event RTE_ETH_EVENT_RX_LWM is triggered.
> > > > +      */
> > > > +     uint8_t lwm;
> > > > +
> > > > +     uint8_t reserved_bits[3];
> > > > +     uint32_t reserved_32s;
> > > > +     uint64_t reserved_64s;
> > >
> > > Ok but, this is an ABI risk about this because reserved stuff was
> > > never required before.
> 
> An ABI compatibility issue would be for an application compiled with an old
> DPDK, and loading a new DPDK at runtime.
> Let's think what would happen in such a case.
> 
> > > Whenever is a reserved field is introduced the code (in this case
> > > rte_ethdev_configure).
> 
> rte_eth_rx_queue_setup() is called with rx_conf->lwm not initialized.
> Then the library and drivers may interpret a wrong value.
> 
> > > Best practice would have been to have the code require all reserved
> > > fields be
> > > 0 in earlier releases. In this case an application is like to define
> > > a watermark of zero; how will your code handle it.
> >
> > Having watermark of 0 is desired, which is the default. LWM of 0 means
> > the Rx Queue's watermark is not monitored, hence no LWM event is
> generated.
> 
> The problem is to have a value not initialized.
> I think the best approach is to not expose the LWM value through this
> configuration structure.
> If the need is to get the current value, we should better add a field in the
> struct rte_eth_rxq_info.

At least from all the dpdk app/example code, rxconf is initialized to 0 then setup
The Rx queue, if user follows these examples we should not have ABI issue.
Since many people are concerned about rxconf change, it's ok to remove the LWM
Field there.
Yes, I think we can add lwm into rte_eth_rxq_info. If we can set Rx queue's attribute,
We should have a way to get it.

> 
> [...]
> >
> > > Why introduce new query/set operations? This should just be part of
> > > the overall device configuration.
> 
> Thanks to the "set" function, we can avoid the ABI compat issue.
> 
> > Due to different implementation. LWM can be a dynamic configuration
> which can help user design a flexible flow control.
> > User may feel ok with LWM of 80% to get high throughput, or later on with
> 50% to throttle the traffic responsively by handling LWM event in order to
> reduce drop.
> > Some driver like mlx5 may implement LWM event as one-time shot. When
> > you receive LWM event, you need to reconfigure LWM in order to receive
> the event again, thus you will not likely to be overwhelmed by the events.
> > These all require set operation.
> 
> Yes it is better to allow dynamic watermark configuration, not using the
> function rte_eth_rx_queue_setup().
> 
> > For the query operation. The rte_event API
> rte_eth_dev_callback_process() is per-port API, it doesn't carry much
> information when an event happens.
> > When a LWM event happens, we need to know in which Rx queue it
> happens or optionally what's the current LWM percentage of this queue.
> > The query operation serves this purpose.
> 
> Yes "query" has to be called in the event handler because event structure is
> not specific to any event type.
> 
> 


^ permalink raw reply	[flat|nested] 131+ messages in thread

* RE: [RFC v2 3/7] ethdev: introduce Rx queue based limit watermark
  2022-05-23 22:54             ` Stephen Hemminger
@ 2022-05-24  3:46               ` Spike Du
  0 siblings, 0 replies; 131+ messages in thread
From: Spike Du @ 2022-05-24  3:46 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Matan Azrad, Slava Ovsiienko, Ori Kam,
	NBU-Contact-Thomas Monjalon (EXTERNAL),
	dev, Raslan Darawsheh



> -----Original Message-----
> From: Stephen Hemminger <stephen@networkplumber.org>
> Sent: Tuesday, May 24, 2022 6:55 AM
> To: Spike Du <spiked@nvidia.com>
> Cc: Matan Azrad <matan@nvidia.com>; Slava Ovsiienko
> <viacheslavo@nvidia.com>; Ori Kam <orika@nvidia.com>; NBU-Contact-
> Thomas Monjalon (EXTERNAL) <thomas@monjalon.net>; dev@dpdk.org;
> Raslan Darawsheh <rasland@nvidia.com>
> Subject: Re: [RFC v2 3/7] ethdev: introduce Rx queue based limit watermark
> 
> External email: Use caution opening links or attachments
> 
> 
> On Mon, 23 May 2022 03:01:20 +0000
> Spike Du <spiked@nvidia.com> wrote:
> 
> > Hi, pls see below.
> >
> > > -----Original Message-----
> > > From: Stephen Hemminger <stephen@networkplumber.org>
> > > Sent: Sunday, May 22, 2022 11:23 PM
> > > To: Spike Du <spiked@nvidia.com>
> > > Cc: Matan Azrad <matan@nvidia.com>; Slava Ovsiienko
> > > <viacheslavo@nvidia.com>; Ori Kam <orika@nvidia.com>; NBU-Contact-
> > > Thomas Monjalon (EXTERNAL) <thomas@monjalon.net>; dev@dpdk.org;
> > > Raslan Darawsheh <rasland@nvidia.com>
> > > Subject: Re: [RFC v2 3/7] ethdev: introduce Rx queue based limit
> > > watermark
> > >
> > > External email: Use caution opening links or attachments
> > >
> > >
> > > On Sun, 22 May 2022 08:58:56 +0300
> > > Spike Du <spiked@nvidia.com> wrote:
> > >
> > > > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> > > > index
> > > > 04cff8ee10..687ae5ff29 100644
> > > > --- a/lib/ethdev/rte_ethdev.h
> > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > @@ -1249,7 +1249,16 @@ struct rte_eth_rxconf {
> > > >        */
> > > >       union rte_eth_rxseg *rx_seg;
> > > >
> > > > -     uint64_t reserved_64s[2]; /**< Reserved for future fields */
> > > > +     /**
> > > > +      * Per-queue Rx limit watermark defined as percentage of Rx queue
> > > > +      * size. If Rx queue receives traffic higher than this percentage,
> > > > +      * the event RTE_ETH_EVENT_RX_LWM is triggered.
> > > > +      */
> > > > +     uint8_t lwm;
> > > > +
> > > > +     uint8_t reserved_bits[3];
> > > > +     uint32_t reserved_32s;
> > > > +     uint64_t reserved_64s;
> > > >       void *reserved_ptrs[2];   /**< Reserved for future fields */
> > > >  };
> > > >
> > >
> > > Ok but, this is an ABI risk about this because reserved stuff was
> > > never required before.
> > > Whenever is a reserved field is introduced the code (in this case
> > > rte_ethdev_configure).
> > >
> > > Best practice would have been to have the code require all reserved
> > > fields be
> > > 0 in earlier releases. In this case an application is like to define
> > > a watermark of zero; how will your code handle it.
> > Having watermark of 0 is desired, which is the default. LWM of 0 means
> > the Rx Queue's watermark is not monitored, hence no LWM event is
> generated.
> > >
> > > Also, using 8 bits as percentage is different than how other API's handle
> this.
> > > Since Rx queue size is in packets, why is this not in packets?
> > The short answer is to simply the LWM configuration.
> > Rx queue descriptor is complex nowadays.
> > For normal queue, user may configure LWM according to queue descriptor
> number easily.
> > But for below queues, it's not easy:
> > Take mprq as example, the testpmd cmd  options can be " -a
> >
> 0000:03:00.0,rxqs_min_mprq=1,mprq_en=1,mprq_max_memcpy_len=465,
> mprq_lo
> > g_stride_size=8,mprq_log_stride_num=3
> > -- --mbcache=512 -i  --nb-cores=7  --txd=1024 --rxd=1024 ", For MLX5
> > implementation,  the minimum "unit" in queue has 64 descriptors, the
> > "unit" number is 16,  if you configure according to descriptor number(1024)
> Here, you may easily set LWM as something like 512, but HW doesn't allow it,
> because 512 > 16. If you want the watermark to be half, the correct value is 8.
> > The same issue happens to feature like "Rx queue buffer split" where a
> packet can be split to multiple descriptors.
> > Using percentage doesn't have such issues, PMD will cover all the details.
> >
> > > Also document what behavior of 0 is.
> > Sure. The behavior is like the old days without this feature, pls see above.
> >
> > > Why introduce new query/set operations? This should just be part of
> > > the overall device configuration.
> > Due to different implementation. LWM can be a dynamic configuration
> which can help user design a flexible flow control.
> > User may feel ok with LWM of 80% to get high throughput, or later on with
> 50% to throttle the traffic responsively by handling LWM event in order to
> reduce drop.
> > Some driver like mlx5 may implement LWM event as one-time shot. When
> > you receive LWM event, you need to reconfigure LWM in order to receive
> the event again, thus you will not likely to be overwhelmed by the events.
> > These all require set operation.
> >
> > For the query operation. The rte_event API
> rte_eth_dev_callback_process() is per-port API, it doesn't carry much
> information when an event happens.
> > When a LWM event happens, we need to know in which Rx queue it
> happens or optionally what's the current LWM percentage of this queue.
> > The query operation serves this purpose.
> >
> >
> > Regards,
> > Spike.
> >
> >
> 
> The bigger question is why does this have to be just MLX5 and why can't it fit
> into the existing DPDK RX interrupt framework?
> 
> Linux and BSD have had this for years in their packet coalescing logic.
> Ethtool provides ability to set lot of irq coalescing parameters like:
> 
>        ethtool -C|--coalesce devname [adaptive-rx on|off] [adaptive-tx on|off]
>               [rx-usecs N] [rx-frames N] [rx-usecs-irq N] [rx-frames-irq N]
>               [tx-usecs N] [tx-frames N] [tx-usecs-irq N] [tx-frames-irq N]
>               [stats-block-usecs N] [pkt-rate-low N] [rx-usecs-low N]
>               [rx-frames-low N] [tx-usecs-low N] [tx-frames-low N]
>               [pkt-rate-high N] [rx-usecs-high N] [rx-frames-high N]
>               [tx-usecs-high N] [tx-frames-high N] [sample-interval N]
>               [cqe-mode-rx on|off] [cqe-mode-tx on|off]
> 
> It feels like this is just the DPDK version of a small subset of that.
> Since many device already support IRQ coalescing, it would be best to build
> one new API that has most of these. Rather than a MLX/Nvidia only API for a
> single parameter.

I take MLX5 as example here because I only know how mlx5 works, I don't understand
How other NICs work.  It doesn't mean I try to change common code only to satisfy 
Mlx5 needs.

I think interrupt coalesce is different from LWM:
Interrupt coalesce is delay interrupt until a batch of packets(or an interval) is received. 
LWM intends to notify when a Rx queue is out of buffer. Delaying interrupt can't detect
A specific fullness value of the Rx queue, but LWM can if driver supports it.


Regards,
Spike.



^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [RFC v2 3/7] ethdev: introduce Rx queue based limit watermark
  2022-05-23 14:39               ` Thomas Monjalon
@ 2022-05-24  6:35                 ` Andrew Rybchenko
  2022-05-24  9:40                   ` Morten Brørup
  0 siblings, 1 reply; 131+ messages in thread
From: Andrew Rybchenko @ 2022-05-24  6:35 UTC (permalink / raw)
  To: Thomas Monjalon, Morten Brørup, Spike Du
  Cc: dev, Matan Azrad, Slava Ovsiienko, Ori Kam, Raslan Darawsheh,
	ferruh.yigit

On 5/23/22 17:39, Thomas Monjalon wrote:
> 23/05/2022 16:10, Spike Du:
>>> From: Thomas Monjalon <thomas@monjalon.net>
>>>> If you stick with percentage, it only needs 7 bits, and you can make the
>>> remaining one bit reserved.
>>
>> Agree, will change to use 7 bits.
> 
> I'm not sure it's worth introducing a bit field here.

+1

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [RFC v2 3/7] ethdev: introduce Rx queue based limit watermark
  2022-05-24  2:50               ` Spike Du
@ 2022-05-24  8:18                 ` Thomas Monjalon
  2022-05-25 12:59                   ` Andrew Rybchenko
  0 siblings, 1 reply; 131+ messages in thread
From: Thomas Monjalon @ 2022-05-24  8:18 UTC (permalink / raw)
  To: Stephen Hemminger, Spike Du
  Cc: dev, Matan Azrad, Slava Ovsiienko, Ori Kam, Raslan Darawsheh,
	ferruh.yigit, andrew.rybchenko

24/05/2022 04:50, Spike Du:
> From: Thomas Monjalon <thomas@monjalon.net>
> > 23/05/2022 05:01, Spike Du:
> > > From: Stephen Hemminger <stephen@networkplumber.org>
> > > > Spike Du <spiked@nvidia.com> wrote:
> > > > > --- a/lib/ethdev/rte_ethdev.h
> > > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > > @@ -1249,7 +1249,16 @@ struct rte_eth_rxconf {
> > > > >        */
> > > > >       union rte_eth_rxseg *rx_seg;
> > > > >
> > > > > -     uint64_t reserved_64s[2]; /**< Reserved for future fields */
> > > > > +     /**
> > > > > +      * Per-queue Rx limit watermark defined as percentage of Rx queue
> > > > > +      * size. If Rx queue receives traffic higher than this percentage,
> > > > > +      * the event RTE_ETH_EVENT_RX_LWM is triggered.
> > > > > +      */
> > > > > +     uint8_t lwm;
> > > > > +
> > > > > +     uint8_t reserved_bits[3];
> > > > > +     uint32_t reserved_32s;
> > > > > +     uint64_t reserved_64s;
> > > >
> > > > Ok but, this is an ABI risk about this because reserved stuff was
> > > > never required before.
> > 
> > An ABI compatibility issue would be for an application compiled with an old
> > DPDK, and loading a new DPDK at runtime.
> > Let's think what would happen in such a case.
> > 
> > > > Whenever is a reserved field is introduced the code (in this case
> > > > rte_ethdev_configure).
> > 
> > rte_eth_rx_queue_setup() is called with rx_conf->lwm not initialized.
> > Then the library and drivers may interpret a wrong value.
> > 
> > > > Best practice would have been to have the code require all reserved
> > > > fields be
> > > > 0 in earlier releases. In this case an application is like to define
> > > > a watermark of zero; how will your code handle it.
> > >
> > > Having watermark of 0 is desired, which is the default. LWM of 0 means
> > > the Rx Queue's watermark is not monitored, hence no LWM event is
> > generated.
> > 
> > The problem is to have a value not initialized.
> > I think the best approach is to not expose the LWM value through this
> > configuration structure.
> > If the need is to get the current value, we should better add a field in the
> > struct rte_eth_rxq_info.
> 
> At least from all the dpdk app/example code, rxconf is initialized to 0 then setup
> The Rx queue, if user follows these examples we should not have ABI issue.
> Since many people are concerned about rxconf change, it's ok to remove the LWM
> Field there.
> Yes, I think we can add lwm into rte_eth_rxq_info. If we can set Rx queue's attribute,
> We should have a way to get it.

Unfortunately we cannot rely on examples for ABI compatibility.
My suggestion of moving the field in rte_eth_rxq_info
is not obvious because it could change the size of the struct.
But thanks to __rte_cache_min_aligned, it is OK.
Running pahole on this struct shows we have 50 bytes free:
        /* size: 128, cachelines: 2, members: 6 */
        /* padding: 50 */

The other option would be to get the LWM value with a "get" function.

What others prefer?



^ permalink raw reply	[flat|nested] 131+ messages in thread

* RE: [RFC v2 3/7] ethdev: introduce Rx queue based limit watermark
  2022-05-24  6:35                 ` Andrew Rybchenko
@ 2022-05-24  9:40                   ` Morten Brørup
  0 siblings, 0 replies; 131+ messages in thread
From: Morten Brørup @ 2022-05-24  9:40 UTC (permalink / raw)
  To: Andrew Rybchenko, Thomas Monjalon, Spike Du
  Cc: dev, Matan Azrad, Slava Ovsiienko, Ori Kam, Raslan Darawsheh,
	ferruh.yigit

> From: Andrew Rybchenko [mailto:andrew.rybchenko@oktetlabs.ru]
> Sent: Tuesday, 24 May 2022 08.35
> 
> On 5/23/22 17:39, Thomas Monjalon wrote:
> > 23/05/2022 16:10, Spike Du:
> >>> From: Thomas Monjalon <thomas@monjalon.net>
> >>>> If you stick with percentage, it only needs 7 bits, and you can
> make the
> >>> remaining one bit reserved.
> >>
> >> Agree, will change to use 7 bits.
> >
> > I'm not sure it's worth introducing a bit field here.
> 
> +1

It's not important for me either. Just stick with the full byte. I mainly mentioned it in case you considered reducing the parameter granularity to 1/16th instead of percent, so you would only need 4 bits.


^ permalink raw reply	[flat|nested] 131+ messages in thread

* [PATCH v3 0/7] introduce per-queue limit watermark and host shaper
  2022-05-22  5:58     ` [RFC v2 0/7] introduce per-queue limit watermark and host shaper Spike Du
                         ` (6 preceding siblings ...)
  2022-05-22  5:59       ` [RFC v2 7/7] app/testpmd: add LWM and Host Shaper command Spike Du
@ 2022-05-24 15:20       ` Spike Du
  2022-05-24 15:20         ` [PATCH v3 1/7] net/mlx5: add LWM support for Rxq Spike Du
                           ` (8 more replies)
  7 siblings, 9 replies; 131+ messages in thread
From: Spike Du @ 2022-05-24 15:20 UTC (permalink / raw)
  To: matan, viacheslavo, orika, thomas; +Cc: dev, rasland

LWM(limit watermark) is per RX queue attribute, when RX queue fullness reach the LWM limit, HW sends an event to dpdk application.
Host shaper can configure shaper rate and lwm-triggered for a host port.
The shaper limits the rate of traffic from host port to wire port.
If lwm-triggered is enabled, a 100Mbps shaper is enabled automatically when one of the host port's Rx queues receives LWM event.

These two features can combine to control traffic from host port to wire port.
The work flow is configure LWM to RX queue and enable lwm-triggered flag in host shaper, after receiving LWM event, delay a while until RX queue is empty , then disable the shaper. We recycle this work flow to reduce RX queue drops.

Add new libethdev API to set LWM, add rte event RTE_ETH_EVENT_RXQ_LIMIT_REACHED to handle LWM event. For host shaper, because it doesn't align to existing DPDK framework and is specific to Nvidia NIC, use PMD private API.

For integration with testpmd, put the private cmdline function and LWM event handler in mlx5 PMD directory by adding a new file mlx5_test.c. Only add minimal code in testpmd to invoke interfaces from mlx5_test.c.

Spike Du (7):
  net/mlx5: add LWM support for Rxq
  common/mlx5: share interrupt management
  ethdev: introduce Rx queue based limit watermark
  net/mlx5: add LWM event handling support
  net/mlx5: support Rx queue based limit watermark
  net/mlx5: add private API to config host port shaper
  app/testpmd: add LWM and Host Shaper command

 app/test-pmd/cmdline.c                       |  74 +++++
 app/test-pmd/config.c                        |  21 ++
 app/test-pmd/meson.build                     |   4 +
 app/test-pmd/testpmd.c                       |  24 ++
 app/test-pmd/testpmd.h                       |   1 +
 doc/guides/nics/mlx5.rst                     |  84 ++++++
 doc/guides/rel_notes/release_22_07.rst       |   2 +
 drivers/common/mlx5/linux/meson.build        |  13 +
 drivers/common/mlx5/linux/mlx5_common_os.c   | 131 +++++++++
 drivers/common/mlx5/linux/mlx5_common_os.h   |  11 +
 drivers/common/mlx5/mlx5_prm.h               |  26 ++
 drivers/common/mlx5/version.map              |   2 +
 drivers/common/mlx5/windows/mlx5_common_os.h |  24 ++
 drivers/net/mlx5/linux/mlx5_ethdev_os.c      |  71 -----
 drivers/net/mlx5/linux/mlx5_os.c             | 132 ++-------
 drivers/net/mlx5/linux/mlx5_socket.c         |  53 +---
 drivers/net/mlx5/mlx5.c                      |  68 +++++
 drivers/net/mlx5/mlx5.h                      |  12 +-
 drivers/net/mlx5/mlx5_devx.c                 |  60 +++-
 drivers/net/mlx5/mlx5_devx.h                 |   1 +
 drivers/net/mlx5/mlx5_rx.c                   | 292 +++++++++++++++++++
 drivers/net/mlx5/mlx5_rx.h                   |  13 +
 drivers/net/mlx5/mlx5_testpmd.c              | 184 ++++++++++++
 drivers/net/mlx5/mlx5_testpmd.h              |  27 ++
 drivers/net/mlx5/mlx5_txpp.c                 |  28 +-
 drivers/net/mlx5/rte_pmd_mlx5.h              |  30 ++
 drivers/net/mlx5/version.map                 |   2 +
 drivers/net/mlx5/windows/mlx5_ethdev_os.c    |  22 --
 drivers/vdpa/mlx5/mlx5_vdpa_virtq.c          |  48 +--
 lib/ethdev/ethdev_driver.h                   |  22 ++
 lib/ethdev/rte_ethdev.c                      |  52 ++++
 lib/ethdev/rte_ethdev.h                      |  71 +++++
 lib/ethdev/version.map                       |   2 +
 33 files changed, 1299 insertions(+), 308 deletions(-)
 create mode 100644 drivers/net/mlx5/mlx5_testpmd.c
 create mode 100644 drivers/net/mlx5/mlx5_testpmd.h

-- 
2.27.0


^ permalink raw reply	[flat|nested] 131+ messages in thread

* [PATCH v3 1/7] net/mlx5: add LWM support for Rxq
  2022-05-24 15:20       ` [PATCH v3 0/7] introduce per-queue limit watermark and host shaper Spike Du
@ 2022-05-24 15:20         ` Spike Du
  2022-05-24 15:20         ` [PATCH v3 2/7] common/mlx5: share interrupt management Spike Du
                           ` (7 subsequent siblings)
  8 siblings, 0 replies; 131+ messages in thread
From: Spike Du @ 2022-05-24 15:20 UTC (permalink / raw)
  To: matan, viacheslavo, orika, thomas; +Cc: dev, rasland

Add lwm(Limit WaterMark) field to Rxq object which indicates the percentage
of RX queue size used by HW to raise LWM event to the user.
Allow LWM setting in modify_rq command.
Allow the LWM configuration dynamically by adding RDY2RDY state change.

Signed-off-by: Spike Du <spiked@nvidia.com>
---
 drivers/net/mlx5/mlx5.h      |  1 +
 drivers/net/mlx5/mlx5_devx.c | 13 ++++++++++++-
 drivers/net/mlx5/mlx5_devx.h |  1 +
 drivers/net/mlx5/mlx5_rx.h   |  1 +
 4 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index ef755ee8cf..305edffe71 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -1395,6 +1395,7 @@ enum mlx5_rxq_modify_type {
 	MLX5_RXQ_MOD_RST2RDY, /* modify state from reset to ready. */
 	MLX5_RXQ_MOD_RDY2ERR, /* modify state from ready to error. */
 	MLX5_RXQ_MOD_RDY2RST, /* modify state from ready to reset. */
+	MLX5_RXQ_MOD_RDY2RDY, /* modify state from ready to ready. */
 };
 
 enum mlx5_txq_modify_type {
diff --git a/drivers/net/mlx5/mlx5_devx.c b/drivers/net/mlx5/mlx5_devx.c
index 4b48f9433a..c918a50ae9 100644
--- a/drivers/net/mlx5/mlx5_devx.c
+++ b/drivers/net/mlx5/mlx5_devx.c
@@ -62,7 +62,7 @@ mlx5_rxq_obj_modify_rq_vlan_strip(struct mlx5_rxq_priv *rxq, int on)
  * @return
  *   0 on success, a negative errno value otherwise and rte_errno is set.
  */
-static int
+int
 mlx5_devx_modify_rq(struct mlx5_rxq_priv *rxq, uint8_t type)
 {
 	struct mlx5_devx_modify_rq_attr rq_attr;
@@ -76,6 +76,11 @@ mlx5_devx_modify_rq(struct mlx5_rxq_priv *rxq, uint8_t type)
 	case MLX5_RXQ_MOD_RST2RDY:
 		rq_attr.rq_state = MLX5_RQC_STATE_RST;
 		rq_attr.state = MLX5_RQC_STATE_RDY;
+		if (rxq->lwm) {
+			rq_attr.modify_bitmask |=
+				MLX5_MODIFY_RQ_IN_MODIFY_BITMASK_WQ_LWM;
+			rq_attr.lwm = rxq->lwm;
+		}
 		break;
 	case MLX5_RXQ_MOD_RDY2ERR:
 		rq_attr.rq_state = MLX5_RQC_STATE_RDY;
@@ -85,6 +90,12 @@ mlx5_devx_modify_rq(struct mlx5_rxq_priv *rxq, uint8_t type)
 		rq_attr.rq_state = MLX5_RQC_STATE_RDY;
 		rq_attr.state = MLX5_RQC_STATE_RST;
 		break;
+	case MLX5_RXQ_MOD_RDY2RDY:
+		rq_attr.rq_state = MLX5_RQC_STATE_RDY;
+		rq_attr.state = MLX5_RQC_STATE_RDY;
+		rq_attr.modify_bitmask |= MLX5_MODIFY_RQ_IN_MODIFY_BITMASK_WQ_LWM;
+		rq_attr.lwm = rxq->lwm;
+		break;
 	default:
 		break;
 	}
diff --git a/drivers/net/mlx5/mlx5_devx.h b/drivers/net/mlx5/mlx5_devx.h
index a95207a6b9..ebd1da455a 100644
--- a/drivers/net/mlx5/mlx5_devx.h
+++ b/drivers/net/mlx5/mlx5_devx.h
@@ -11,6 +11,7 @@ int mlx5_txq_devx_obj_new(struct rte_eth_dev *dev, uint16_t idx);
 int mlx5_txq_devx_modify(struct mlx5_txq_obj *obj,
 			 enum mlx5_txq_modify_type type, uint8_t dev_port);
 void mlx5_txq_devx_obj_release(struct mlx5_txq_obj *txq_obj);
+int mlx5_devx_modify_rq(struct mlx5_rxq_priv *rxq, uint8_t type);
 
 extern struct mlx5_obj_ops devx_obj_ops;
 
diff --git a/drivers/net/mlx5/mlx5_rx.h b/drivers/net/mlx5/mlx5_rx.h
index e715ed6b62..25a5f2c1fa 100644
--- a/drivers/net/mlx5/mlx5_rx.h
+++ b/drivers/net/mlx5/mlx5_rx.h
@@ -175,6 +175,7 @@ struct mlx5_rxq_priv {
 	struct mlx5_devx_rq devx_rq;
 	struct rte_eth_hairpin_conf hairpin_conf; /* Hairpin configuration. */
 	uint32_t hairpin_status; /* Hairpin binding status. */
+	uint32_t lwm:16;
 };
 
 /* External RX queue descriptor. */
-- 
2.27.0


^ permalink raw reply	[flat|nested] 131+ messages in thread

* [PATCH v3 2/7] common/mlx5: share interrupt management
  2022-05-24 15:20       ` [PATCH v3 0/7] introduce per-queue limit watermark and host shaper Spike Du
  2022-05-24 15:20         ` [PATCH v3 1/7] net/mlx5: add LWM support for Rxq Spike Du
@ 2022-05-24 15:20         ` Spike Du
  2022-05-24 15:20         ` [PATCH v3 3/7] ethdev: introduce Rx queue based limit watermark Spike Du
                           ` (6 subsequent siblings)
  8 siblings, 0 replies; 131+ messages in thread
From: Spike Du @ 2022-05-24 15:20 UTC (permalink / raw)
  To: matan, viacheslavo, orika, thomas; +Cc: dev, rasland

There are many duplicate code of creating and initializing rte_intr_handle.
Add a new mlx5_os API to do this, replace all PMD related code with this
API.

Signed-off-by: Spike Du <spiked@nvidia.com>
---
 drivers/common/mlx5/linux/mlx5_common_os.c   | 131 ++++++++++++++++++
 drivers/common/mlx5/linux/mlx5_common_os.h   |  11 ++
 drivers/common/mlx5/version.map              |   2 +
 drivers/common/mlx5/windows/mlx5_common_os.h |  24 ++++
 drivers/net/mlx5/linux/mlx5_ethdev_os.c      |  71 ----------
 drivers/net/mlx5/linux/mlx5_os.c             | 132 ++++---------------
 drivers/net/mlx5/linux/mlx5_socket.c         |  53 +-------
 drivers/net/mlx5/mlx5.h                      |   2 -
 drivers/net/mlx5/mlx5_txpp.c                 |  28 +---
 drivers/net/mlx5/windows/mlx5_ethdev_os.c    |  22 ----
 drivers/vdpa/mlx5/mlx5_vdpa_virtq.c          |  48 +------
 11 files changed, 217 insertions(+), 307 deletions(-)

diff --git a/drivers/common/mlx5/linux/mlx5_common_os.c b/drivers/common/mlx5/linux/mlx5_common_os.c
index d40cfd5cd1..f10a981a37 100644
--- a/drivers/common/mlx5/linux/mlx5_common_os.c
+++ b/drivers/common/mlx5/linux/mlx5_common_os.c
@@ -11,6 +11,7 @@
 #endif
 #include <dirent.h>
 #include <net/if.h>
+#include <fcntl.h>
 
 #include <rte_errno.h>
 #include <rte_string_fns.h>
@@ -964,3 +965,133 @@ mlx5_os_wrapped_mkey_destroy(struct mlx5_pmd_wrapped_mr *pmd_mr)
 		claim_zero(mlx5_glue->dereg_mr(pmd_mr->obj));
 	memset(pmd_mr, 0, sizeof(*pmd_mr));
 }
+
+/**
+ * Rte_intr_handle create and init helper.
+ *
+ * @param[in] mode
+ *   interrupt instance can be shared between primary and secondary
+ *   processes or not.
+ * @param[in] set_fd_nonblock
+ *   Whether to set fd to O_NONBLOCK.
+ * @param[in] fd
+ *   Fd to set in created intr_handle.
+ * @param[in] cb
+ *   Callback to register for intr_handle.
+ * @param[in] cb_arg
+ *   Callback argument for cb.
+ *
+ * @return
+ *  - Interrupt handle on success.
+ *  - NULL on failure, with rte_errno set.
+ */
+struct rte_intr_handle *
+mlx5_os_interrupt_handler_create(int mode, bool set_fd_nonblock, int fd,
+				 rte_intr_callback_fn cb, void *cb_arg)
+{
+	struct rte_intr_handle *tmp_intr_handle;
+	int ret, flags;
+
+	tmp_intr_handle = rte_intr_instance_alloc(mode);
+	if (!tmp_intr_handle) {
+		rte_errno = ENOMEM;
+		goto err;
+	}
+	if (set_fd_nonblock) {
+		flags = fcntl(fd, F_GETFL);
+		ret = fcntl(fd, F_SETFL, flags | O_NONBLOCK);
+		if (ret) {
+			rte_errno = errno;
+			goto err;
+		}
+	}
+	ret = rte_intr_fd_set(tmp_intr_handle, fd);
+	if (ret)
+		goto err;
+	ret = rte_intr_type_set(tmp_intr_handle, RTE_INTR_HANDLE_EXT);
+	if (ret)
+		goto err;
+	ret = rte_intr_callback_register(tmp_intr_handle, cb, cb_arg);
+	if (ret) {
+		rte_errno = -ret;
+		goto err;
+	}
+	return tmp_intr_handle;
+err:
+	if (tmp_intr_handle)
+		rte_intr_instance_free(tmp_intr_handle);
+	return NULL;
+}
+
+/* Safe unregistration for interrupt callback. */
+static void
+mlx5_intr_callback_unregister(const struct rte_intr_handle *handle,
+			      rte_intr_callback_fn cb_fn, void *cb_arg)
+{
+	uint64_t twait = 0;
+	uint64_t start = 0;
+
+	do {
+		int ret;
+
+		ret = rte_intr_callback_unregister(handle, cb_fn, cb_arg);
+		if (ret >= 0)
+			return;
+		if (ret != -EAGAIN) {
+			DRV_LOG(INFO, "failed to unregister interrupt"
+				      " handler (error: %d)", ret);
+			MLX5_ASSERT(false);
+			return;
+		}
+		if (twait) {
+			struct timespec onems;
+
+			/* Wait one millisecond and try again. */
+			onems.tv_sec = 0;
+			onems.tv_nsec = NS_PER_S / MS_PER_S;
+			nanosleep(&onems, 0);
+			/* Check whether one second elapsed. */
+			if ((rte_get_timer_cycles() - start) <= twait)
+				continue;
+		} else {
+			/*
+			 * We get the amount of timer ticks for one second.
+			 * If this amount elapsed it means we spent one
+			 * second in waiting. This branch is executed once
+			 * on first iteration.
+			 */
+			twait = rte_get_timer_hz();
+			MLX5_ASSERT(twait);
+		}
+		/*
+		 * Timeout elapsed, show message (once a second) and retry.
+		 * We have no other acceptable option here, if we ignore
+		 * the unregistering return code the handler will not
+		 * be unregistered, fd will be closed and we may get the
+		 * crush. Hanging and messaging in the loop seems not to be
+		 * the worst choice.
+		 */
+		DRV_LOG(INFO, "Retrying to unregister interrupt handler");
+		start = rte_get_timer_cycles();
+	} while (true);
+}
+
+/**
+ * Rte_intr_handle destroy helper.
+ *
+ * @param[in] intr_handle
+ *   Rte_intr_handle to destroy.
+ * @param[in] cb
+ *   Callback which is registered to intr_handle.
+ * @param[in] cb_arg
+ *   Callback argument for cb.
+ *
+ */
+void
+mlx5_os_interrupt_handler_destroy(struct rte_intr_handle *intr_handle,
+				  rte_intr_callback_fn cb, void *cb_arg)
+{
+	if (rte_intr_fd_get(intr_handle) >= 0)
+		mlx5_intr_callback_unregister(intr_handle, cb, cb_arg);
+	rte_intr_instance_free(intr_handle);
+}
diff --git a/drivers/common/mlx5/linux/mlx5_common_os.h b/drivers/common/mlx5/linux/mlx5_common_os.h
index 27f1192205..479bb3c7cb 100644
--- a/drivers/common/mlx5/linux/mlx5_common_os.h
+++ b/drivers/common/mlx5/linux/mlx5_common_os.h
@@ -15,6 +15,7 @@
 #include <rte_log.h>
 #include <rte_kvargs.h>
 #include <rte_devargs.h>
+#include <rte_interrupts.h>
 
 #include "mlx5_autoconf.h"
 #include "mlx5_glue.h"
@@ -299,4 +300,14 @@ __rte_internal
 int
 mlx5_get_device_guid(const struct rte_pci_addr *dev, uint8_t *guid, size_t len);
 
+__rte_internal
+struct rte_intr_handle *
+mlx5_os_interrupt_handler_create(int mode, bool set_fd_nonblock, int fd,
+				 rte_intr_callback_fn cb, void *cb_arg);
+
+__rte_internal
+void
+mlx5_os_interrupt_handler_destroy(struct rte_intr_handle *intr_handle,
+				  rte_intr_callback_fn cb, void *cb_arg);
+
 #endif /* RTE_PMD_MLX5_COMMON_OS_H_ */
diff --git a/drivers/common/mlx5/version.map b/drivers/common/mlx5/version.map
index a23a30a6c0..413dec14ab 100644
--- a/drivers/common/mlx5/version.map
+++ b/drivers/common/mlx5/version.map
@@ -153,5 +153,7 @@ INTERNAL {
 	mlx5_mr_mempool2mr_bh;
 	mlx5_mr_mempool_populate_cache;
 
+	mlx5_os_interrupt_handler_create; # WINDOWS_NO_EXPORT
+	mlx5_os_interrupt_handler_destroy; # WINDOWS_NO_EXPORT
 	local: *;
 };
diff --git a/drivers/common/mlx5/windows/mlx5_common_os.h b/drivers/common/mlx5/windows/mlx5_common_os.h
index ee7973f1ec..e9e9108127 100644
--- a/drivers/common/mlx5/windows/mlx5_common_os.h
+++ b/drivers/common/mlx5/windows/mlx5_common_os.h
@@ -9,6 +9,7 @@
 #include <sys/types.h>
 
 #include <rte_errno.h>
+#include <rte_interrupts.h>
 
 #include "mlx5_autoconf.h"
 #include "mlx5_glue.h"
@@ -253,4 +254,27 @@ void *mlx5_os_umem_reg(void *ctx, void *addr, size_t size, uint32_t access);
 __rte_internal
 int mlx5_os_umem_dereg(void *pumem);
 
+static inline struct rte_intr_handle *
+mlx5_os_interrupt_handler_create(int mode, bool set_fd_nonblock, int fd,
+				 rte_intr_callback_fn cb, void *cb_arg)
+{
+	(void)mode;
+	(void)set_fd_nonblock;
+	(void)fd;
+	(void)cb;
+	(void)cb_arg;
+	rte_errno = ENOTSUP;
+	return NULL;
+}
+
+static inline void
+mlx5_os_interrupt_handler_destroy(struct rte_intr_handle *intr_handle,
+				  rte_intr_callback_fn cb, void *cb_arg)
+{
+	(void)intr_handle;
+	(void)cb;
+	(void)cb_arg;
+}
+
+
 #endif /* RTE_PMD_MLX5_COMMON_OS_H_ */
diff --git a/drivers/net/mlx5/linux/mlx5_ethdev_os.c b/drivers/net/mlx5/linux/mlx5_ethdev_os.c
index 8fe73f1adb..a276b2ba4f 100644
--- a/drivers/net/mlx5/linux/mlx5_ethdev_os.c
+++ b/drivers/net/mlx5/linux/mlx5_ethdev_os.c
@@ -881,77 +881,6 @@ mlx5_dev_interrupt_handler(void *cb_arg)
 	}
 }
 
-/*
- * Unregister callback handler safely. The handler may be active
- * while we are trying to unregister it, in this case code -EAGAIN
- * is returned by rte_intr_callback_unregister(). This routine checks
- * the return code and tries to unregister handler again.
- *
- * @param handle
- *   interrupt handle
- * @param cb_fn
- *   pointer to callback routine
- * @cb_arg
- *   opaque callback parameter
- */
-void
-mlx5_intr_callback_unregister(const struct rte_intr_handle *handle,
-			      rte_intr_callback_fn cb_fn, void *cb_arg)
-{
-	/*
-	 * Try to reduce timeout management overhead by not calling
-	 * the timer related routines on the first iteration. If the
-	 * unregistering succeeds on first call there will be no
-	 * timer calls at all.
-	 */
-	uint64_t twait = 0;
-	uint64_t start = 0;
-
-	do {
-		int ret;
-
-		ret = rte_intr_callback_unregister(handle, cb_fn, cb_arg);
-		if (ret >= 0)
-			return;
-		if (ret != -EAGAIN) {
-			DRV_LOG(INFO, "failed to unregister interrupt"
-				      " handler (error: %d)", ret);
-			MLX5_ASSERT(false);
-			return;
-		}
-		if (twait) {
-			struct timespec onems;
-
-			/* Wait one millisecond and try again. */
-			onems.tv_sec = 0;
-			onems.tv_nsec = NS_PER_S / MS_PER_S;
-			nanosleep(&onems, 0);
-			/* Check whether one second elapsed. */
-			if ((rte_get_timer_cycles() - start) <= twait)
-				continue;
-		} else {
-			/*
-			 * We get the amount of timer ticks for one second.
-			 * If this amount elapsed it means we spent one
-			 * second in waiting. This branch is executed once
-			 * on first iteration.
-			 */
-			twait = rte_get_timer_hz();
-			MLX5_ASSERT(twait);
-		}
-		/*
-		 * Timeout elapsed, show message (once a second) and retry.
-		 * We have no other acceptable option here, if we ignore
-		 * the unregistering return code the handler will not
-		 * be unregistered, fd will be closed and we may get the
-		 * crush. Hanging and messaging in the loop seems not to be
-		 * the worst choice.
-		 */
-		DRV_LOG(INFO, "Retrying to unregister interrupt handler");
-		start = rte_get_timer_cycles();
-	} while (true);
-}
-
 /**
  * Handle DEVX interrupts from the NIC.
  * This function is probably called from the DPDK host thread.
diff --git a/drivers/net/mlx5/linux/mlx5_os.c b/drivers/net/mlx5/linux/mlx5_os.c
index a821153b35..0741028dab 100644
--- a/drivers/net/mlx5/linux/mlx5_os.c
+++ b/drivers/net/mlx5/linux/mlx5_os.c
@@ -2494,40 +2494,6 @@ mlx5_os_net_cleanup(void)
 	mlx5_pmd_socket_uninit();
 }
 
-static int
-mlx5_os_dev_shared_handler_install_lsc(struct mlx5_dev_ctx_shared *sh)
-{
-	int nlsk_fd, flags, ret;
-
-	nlsk_fd = mlx5_nl_init(NETLINK_ROUTE, RTMGRP_LINK);
-	if (nlsk_fd < 0) {
-		DRV_LOG(ERR, "Failed to create a socket for Netlink events: %s",
-			rte_strerror(rte_errno));
-		return -1;
-	}
-	flags = fcntl(nlsk_fd, F_GETFL);
-	ret = fcntl(nlsk_fd, F_SETFL, flags | O_NONBLOCK);
-	if (ret != 0) {
-		DRV_LOG(ERR, "Failed to make Netlink event socket non-blocking: %s",
-			strerror(errno));
-		rte_errno = errno;
-		goto error;
-	}
-	rte_intr_type_set(sh->intr_handle_nl, RTE_INTR_HANDLE_EXT);
-	rte_intr_fd_set(sh->intr_handle_nl, nlsk_fd);
-	if (rte_intr_callback_register(sh->intr_handle_nl,
-				       mlx5_dev_interrupt_handler_nl,
-				       sh) != 0) {
-		DRV_LOG(ERR, "Failed to register Netlink events interrupt");
-		rte_intr_fd_set(sh->intr_handle_nl, -1);
-		goto error;
-	}
-	return 0;
-error:
-	close(nlsk_fd);
-	return -1;
-}
-
 /**
  * Install shared asynchronous device events handler.
  * This function is implemented to support event sharing
@@ -2539,76 +2505,47 @@ mlx5_os_dev_shared_handler_install_lsc(struct mlx5_dev_ctx_shared *sh)
 void
 mlx5_os_dev_shared_handler_install(struct mlx5_dev_ctx_shared *sh)
 {
-	int ret;
-	int flags;
 	struct ibv_context *ctx = sh->cdev->ctx;
+	int nlsk_fd;
 
-	sh->intr_handle = rte_intr_instance_alloc(RTE_INTR_INSTANCE_F_SHARED);
-	if (sh->intr_handle == NULL) {
-		DRV_LOG(ERR, "Fail to allocate intr_handle");
-		rte_errno = ENOMEM;
+	sh->intr_handle = mlx5_os_interrupt_handler_create
+		(RTE_INTR_INSTANCE_F_SHARED, true,
+		 ctx->async_fd, mlx5_dev_interrupt_handler, sh);
+	if (!sh->intr_handle) {
+		DRV_LOG(ERR, "Failed to allocate intr_handle.");
 		return;
 	}
-	rte_intr_fd_set(sh->intr_handle, -1);
-
-	flags = fcntl(ctx->async_fd, F_GETFL);
-	ret = fcntl(ctx->async_fd, F_SETFL, flags | O_NONBLOCK);
-	if (ret) {
-		DRV_LOG(INFO, "failed to change file descriptor async event"
-			" queue");
-	} else {
-		rte_intr_fd_set(sh->intr_handle, ctx->async_fd);
-		rte_intr_type_set(sh->intr_handle, RTE_INTR_HANDLE_EXT);
-		if (rte_intr_callback_register(sh->intr_handle,
-					mlx5_dev_interrupt_handler, sh)) {
-			DRV_LOG(INFO, "Fail to install the shared interrupt.");
-			rte_intr_fd_set(sh->intr_handle, -1);
-		}
+	nlsk_fd = mlx5_nl_init(NETLINK_ROUTE, RTMGRP_LINK);
+	if (nlsk_fd < 0) {
+		DRV_LOG(ERR, "Failed to create a socket for Netlink events: %s",
+			rte_strerror(rte_errno));
+		return;
 	}
-	sh->intr_handle_nl = rte_intr_instance_alloc
-						(RTE_INTR_INSTANCE_F_SHARED);
+	sh->intr_handle_nl = mlx5_os_interrupt_handler_create
+		(RTE_INTR_INSTANCE_F_SHARED, true,
+		 nlsk_fd, mlx5_dev_interrupt_handler_nl, sh);
 	if (sh->intr_handle_nl == NULL) {
 		DRV_LOG(ERR, "Fail to allocate intr_handle");
-		rte_errno = ENOMEM;
 		return;
 	}
-	rte_intr_fd_set(sh->intr_handle_nl, -1);
-	if (mlx5_os_dev_shared_handler_install_lsc(sh) < 0) {
-		DRV_LOG(INFO, "Fail to install the shared Netlink event handler.");
-		rte_intr_fd_set(sh->intr_handle_nl, -1);
-	}
 	if (sh->cdev->config.devx) {
 #ifdef HAVE_IBV_DEVX_ASYNC
-		sh->intr_handle_devx =
-			rte_intr_instance_alloc(RTE_INTR_INSTANCE_F_SHARED);
-		if (!sh->intr_handle_devx) {
-			DRV_LOG(ERR, "Fail to allocate intr_handle");
-			rte_errno = ENOMEM;
-			return;
-		}
-		rte_intr_fd_set(sh->intr_handle_devx, -1);
+		struct mlx5dv_devx_cmd_comp *devx_comp;
+
 		sh->devx_comp = (void *)mlx5_glue->devx_create_cmd_comp(ctx);
-		struct mlx5dv_devx_cmd_comp *devx_comp = sh->devx_comp;
+		devx_comp = sh->devx_comp;
 		if (!devx_comp) {
 			DRV_LOG(INFO, "failed to allocate devx_comp.");
 			return;
 		}
-		flags = fcntl(devx_comp->fd, F_GETFL);
-		ret = fcntl(devx_comp->fd, F_SETFL, flags | O_NONBLOCK);
-		if (ret) {
-			DRV_LOG(INFO, "failed to change file descriptor"
-				" devx comp");
+		sh->intr_handle_devx = mlx5_os_interrupt_handler_create
+			(RTE_INTR_INSTANCE_F_SHARED, true,
+			 devx_comp->fd,
+			 mlx5_dev_interrupt_handler_devx, sh);
+		if (!sh->intr_handle_devx) {
+			DRV_LOG(ERR, "Failed to allocate intr_handle.");
 			return;
 		}
-		rte_intr_fd_set(sh->intr_handle_devx, devx_comp->fd);
-		rte_intr_type_set(sh->intr_handle_devx,
-					 RTE_INTR_HANDLE_EXT);
-		if (rte_intr_callback_register(sh->intr_handle_devx,
-					mlx5_dev_interrupt_handler_devx, sh)) {
-			DRV_LOG(INFO, "Fail to install the devx shared"
-				" interrupt.");
-			rte_intr_fd_set(sh->intr_handle_devx, -1);
-		}
 #endif /* HAVE_IBV_DEVX_ASYNC */
 	}
 }
@@ -2624,24 +2561,13 @@ mlx5_os_dev_shared_handler_install(struct mlx5_dev_ctx_shared *sh)
 void
 mlx5_os_dev_shared_handler_uninstall(struct mlx5_dev_ctx_shared *sh)
 {
-	int nlsk_fd;
-
-	if (rte_intr_fd_get(sh->intr_handle) >= 0)
-		mlx5_intr_callback_unregister(sh->intr_handle,
-					      mlx5_dev_interrupt_handler, sh);
-	rte_intr_instance_free(sh->intr_handle);
-	nlsk_fd = rte_intr_fd_get(sh->intr_handle_nl);
-	if (nlsk_fd >= 0) {
-		mlx5_intr_callback_unregister
-			(sh->intr_handle_nl, mlx5_dev_interrupt_handler_nl, sh);
-		close(nlsk_fd);
-	}
-	rte_intr_instance_free(sh->intr_handle_nl);
+	mlx5_os_interrupt_handler_destroy(sh->intr_handle,
+					  mlx5_dev_interrupt_handler, sh);
+	mlx5_os_interrupt_handler_destroy(sh->intr_handle_nl,
+					  mlx5_dev_interrupt_handler_nl, sh);
 #ifdef HAVE_IBV_DEVX_ASYNC
-	if (rte_intr_fd_get(sh->intr_handle_devx) >= 0)
-		rte_intr_callback_unregister(sh->intr_handle_devx,
-				  mlx5_dev_interrupt_handler_devx, sh);
-	rte_intr_instance_free(sh->intr_handle_devx);
+	mlx5_os_interrupt_handler_destroy(sh->intr_handle_devx,
+					  mlx5_dev_interrupt_handler_devx, sh);
 	if (sh->devx_comp)
 		mlx5_glue->devx_destroy_cmd_comp(sh->devx_comp);
 #endif
diff --git a/drivers/net/mlx5/linux/mlx5_socket.c b/drivers/net/mlx5/linux/mlx5_socket.c
index 4882e5fa2f..0e01aff0e7 100644
--- a/drivers/net/mlx5/linux/mlx5_socket.c
+++ b/drivers/net/mlx5/linux/mlx5_socket.c
@@ -133,51 +133,6 @@ mlx5_pmd_socket_handle(void *cb __rte_unused)
 		fclose(file);
 }
 
-/**
- * Install interrupt handler.
- *
- * @param dev
- *   Pointer to Ethernet device.
- * @return
- *   0 on success, a negative errno value otherwise.
- */
-static int
-mlx5_pmd_interrupt_handler_install(void)
-{
-	MLX5_ASSERT(server_socket != -1);
-
-	server_intr_handle =
-		rte_intr_instance_alloc(RTE_INTR_INSTANCE_F_PRIVATE);
-	if (server_intr_handle == NULL) {
-		DRV_LOG(ERR, "Fail to allocate intr_handle");
-		return -ENOMEM;
-	}
-	if (rte_intr_fd_set(server_intr_handle, server_socket))
-		return -rte_errno;
-
-	if (rte_intr_type_set(server_intr_handle, RTE_INTR_HANDLE_EXT))
-		return -rte_errno;
-
-	return rte_intr_callback_register(server_intr_handle,
-					  mlx5_pmd_socket_handle, NULL);
-}
-
-/**
- * Uninstall interrupt handler.
- */
-static void
-mlx5_pmd_interrupt_handler_uninstall(void)
-{
-	if (server_socket != -1) {
-		mlx5_intr_callback_unregister(server_intr_handle,
-					      mlx5_pmd_socket_handle,
-					      NULL);
-	}
-	rte_intr_fd_set(server_intr_handle, 0);
-	rte_intr_type_set(server_intr_handle, RTE_INTR_HANDLE_UNKNOWN);
-	rte_intr_instance_free(server_intr_handle);
-}
-
 /**
  * Initialise the socket to communicate with external tools.
  *
@@ -224,7 +179,10 @@ mlx5_pmd_socket_init(void)
 			strerror(errno));
 		goto remove;
 	}
-	if (mlx5_pmd_interrupt_handler_install()) {
+	server_intr_handle = mlx5_os_interrupt_handler_create
+		(RTE_INTR_INSTANCE_F_PRIVATE, false,
+		 server_socket, mlx5_pmd_socket_handle, NULL);
+	if (server_intr_handle == NULL) {
 		DRV_LOG(WARNING, "cannot register interrupt handler for mlx5 socket: %s",
 			strerror(errno));
 		goto remove;
@@ -248,7 +206,8 @@ mlx5_pmd_socket_uninit(void)
 {
 	if (server_socket == -1)
 		return;
-	mlx5_pmd_interrupt_handler_uninstall();
+	mlx5_os_interrupt_handler_destroy(server_intr_handle,
+					  mlx5_pmd_socket_handle, NULL);
 	claim_zero(close(server_socket));
 	server_socket = -1;
 	MKSTR(path, MLX5_SOCKET_PATH, getpid());
diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index 305edffe71..7ebb2cc961 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -1682,8 +1682,6 @@ int mlx5_sysfs_switch_info(unsigned int ifindex,
 			   struct mlx5_switch_info *info);
 void mlx5_translate_port_name(const char *port_name_in,
 			      struct mlx5_switch_info *port_info_out);
-void mlx5_intr_callback_unregister(const struct rte_intr_handle *handle,
-				   rte_intr_callback_fn cb_fn, void *cb_arg);
 int mlx5_sysfs_bond_info(unsigned int pf_ifindex, unsigned int *ifindex,
 			 char *ifname);
 int mlx5_get_module_info(struct rte_eth_dev *dev,
diff --git a/drivers/net/mlx5/mlx5_txpp.c b/drivers/net/mlx5/mlx5_txpp.c
index fe74317fe8..f853a67f58 100644
--- a/drivers/net/mlx5/mlx5_txpp.c
+++ b/drivers/net/mlx5/mlx5_txpp.c
@@ -741,11 +741,8 @@ mlx5_txpp_interrupt_handler(void *cb_arg)
 static void
 mlx5_txpp_stop_service(struct mlx5_dev_ctx_shared *sh)
 {
-	if (!rte_intr_fd_get(sh->txpp.intr_handle))
-		return;
-	mlx5_intr_callback_unregister(sh->txpp.intr_handle,
-				      mlx5_txpp_interrupt_handler, sh);
-	rte_intr_instance_free(sh->txpp.intr_handle);
+	mlx5_os_interrupt_handler_destroy(sh->txpp.intr_handle,
+					  mlx5_txpp_interrupt_handler, sh);
 }
 
 /* Attach interrupt handler and fires first request to Rearm Queue. */
@@ -769,23 +766,12 @@ mlx5_txpp_start_service(struct mlx5_dev_ctx_shared *sh)
 		rte_errno = errno;
 		return -rte_errno;
 	}
-	sh->txpp.intr_handle =
-			rte_intr_instance_alloc(RTE_INTR_INSTANCE_F_SHARED);
-	if (sh->txpp.intr_handle == NULL) {
-		DRV_LOG(ERR, "Fail to allocate intr_handle");
-		return -ENOMEM;
-	}
 	fd = mlx5_os_get_devx_channel_fd(sh->txpp.echan);
-	if (rte_intr_fd_set(sh->txpp.intr_handle, fd))
-		return -rte_errno;
-
-	if (rte_intr_type_set(sh->txpp.intr_handle, RTE_INTR_HANDLE_EXT))
-		return -rte_errno;
-
-	if (rte_intr_callback_register(sh->txpp.intr_handle,
-				       mlx5_txpp_interrupt_handler, sh)) {
-		rte_intr_fd_set(sh->txpp.intr_handle, 0);
-		DRV_LOG(ERR, "Failed to register CQE interrupt %d.", rte_errno);
+	sh->txpp.intr_handle = mlx5_os_interrupt_handler_create
+		(RTE_INTR_INSTANCE_F_SHARED, false,
+		 fd, mlx5_txpp_interrupt_handler, sh);
+	if (!sh->txpp.intr_handle) {
+		DRV_LOG(ERR, "Fail to allocate intr_handle");
 		return -rte_errno;
 	}
 	/* Subscribe CQ event to the event channel controlled by the driver. */
diff --git a/drivers/net/mlx5/windows/mlx5_ethdev_os.c b/drivers/net/mlx5/windows/mlx5_ethdev_os.c
index f97526580d..88d8213f55 100644
--- a/drivers/net/mlx5/windows/mlx5_ethdev_os.c
+++ b/drivers/net/mlx5/windows/mlx5_ethdev_os.c
@@ -140,28 +140,6 @@ mlx5_set_mtu(struct rte_eth_dev *dev, uint16_t mtu)
 	return 0;
 }
 
-/*
- * Unregister callback handler safely. The handler may be active
- * while we are trying to unregister it, in this case code -EAGAIN
- * is returned by rte_intr_callback_unregister(). This routine checks
- * the return code and tries to unregister handler again.
- *
- * @param handle
- *   interrupt handle
- * @param cb_fn
- *   pointer to callback routine
- * @cb_arg
- *   opaque callback parameter
- */
-void
-mlx5_intr_callback_unregister(const struct rte_intr_handle *handle,
-			      rte_intr_callback_fn cb_fn, void *cb_arg)
-{
-	RTE_SET_USED(handle);
-	RTE_SET_USED(cb_fn);
-	RTE_SET_USED(cb_arg);
-}
-
 /**
  * DPDK callback to get flow control status.
  *
diff --git a/drivers/vdpa/mlx5/mlx5_vdpa_virtq.c b/drivers/vdpa/mlx5/mlx5_vdpa_virtq.c
index e025be47d2..fd447cc650 100644
--- a/drivers/vdpa/mlx5/mlx5_vdpa_virtq.c
+++ b/drivers/vdpa/mlx5/mlx5_vdpa_virtq.c
@@ -93,22 +93,10 @@ mlx5_vdpa_virtqs_cleanup(struct mlx5_vdpa_priv *priv)
 static int
 mlx5_vdpa_virtq_unset(struct mlx5_vdpa_virtq *virtq)
 {
-	int ret = -EAGAIN;
-
-	if (rte_intr_fd_get(virtq->intr_handle) >= 0) {
-		while (ret == -EAGAIN) {
-			ret = rte_intr_callback_unregister(virtq->intr_handle,
-					mlx5_vdpa_virtq_kick_handler, virtq);
-			if (ret == -EAGAIN) {
-				DRV_LOG(DEBUG, "Try again to unregister fd %d of virtq %hu interrupt",
-					rte_intr_fd_get(virtq->intr_handle),
-					virtq->index);
-				usleep(MLX5_VDPA_INTR_RETRIES_USEC);
-			}
-		}
-		rte_intr_fd_set(virtq->intr_handle, -1);
-	}
-	rte_intr_instance_free(virtq->intr_handle);
+	int ret;
+
+	mlx5_os_interrupt_handler_destroy(virtq->intr_handle,
+					  mlx5_vdpa_virtq_kick_handler, virtq);
 	if (virtq->virtq) {
 		ret = mlx5_vdpa_virtq_stop(virtq->priv, virtq->index);
 		if (ret)
@@ -365,35 +353,13 @@ mlx5_vdpa_virtq_setup(struct mlx5_vdpa_priv *priv, int index)
 	virtq->priv = priv;
 	rte_write32(virtq->index, priv->virtq_db_addr);
 	/* Setup doorbell mapping. */
-	virtq->intr_handle =
-		rte_intr_instance_alloc(RTE_INTR_INSTANCE_F_SHARED);
+	virtq->intr_handle = mlx5_os_interrupt_handler_create(
+				  RTE_INTR_INSTANCE_F_SHARED, false,
+				  vq.kickfd, mlx5_vdpa_virtq_kick_handler, virtq);
 	if (virtq->intr_handle == NULL) {
 		DRV_LOG(ERR, "Fail to allocate intr_handle");
 		goto error;
 	}
-
-	if (rte_intr_fd_set(virtq->intr_handle, vq.kickfd))
-		goto error;
-
-	if (rte_intr_fd_get(virtq->intr_handle) == -1) {
-		DRV_LOG(WARNING, "Virtq %d kickfd is invalid.", index);
-	} else {
-		if (rte_intr_type_set(virtq->intr_handle, RTE_INTR_HANDLE_EXT))
-			goto error;
-
-		if (rte_intr_callback_register(virtq->intr_handle,
-					       mlx5_vdpa_virtq_kick_handler,
-					       virtq)) {
-			rte_intr_fd_set(virtq->intr_handle, -1);
-			DRV_LOG(ERR, "Failed to register virtq %d interrupt.",
-				index);
-			goto error;
-		} else {
-			DRV_LOG(DEBUG, "Register fd %d interrupt for virtq %d.",
-				rte_intr_fd_get(virtq->intr_handle),
-				index);
-		}
-	}
 	/* Subscribe virtq error event. */
 	virtq->version++;
 	cookie = ((uint64_t)virtq->version << 32) + index;
-- 
2.27.0


^ permalink raw reply	[flat|nested] 131+ messages in thread

* [PATCH v3 3/7] ethdev: introduce Rx queue based limit watermark
  2022-05-24 15:20       ` [PATCH v3 0/7] introduce per-queue limit watermark and host shaper Spike Du
  2022-05-24 15:20         ` [PATCH v3 1/7] net/mlx5: add LWM support for Rxq Spike Du
  2022-05-24 15:20         ` [PATCH v3 2/7] common/mlx5: share interrupt management Spike Du
@ 2022-05-24 15:20         ` Spike Du
  2022-05-24 15:20         ` [PATCH v3 4/7] net/mlx5: add LWM event handling support Spike Du
                           ` (5 subsequent siblings)
  8 siblings, 0 replies; 131+ messages in thread
From: Spike Du @ 2022-05-24 15:20 UTC (permalink / raw)
  To: matan, viacheslavo, orika, thomas; +Cc: dev, rasland

LWM (limit watermark) describes the fullness of a Rx queue. If the Rx
queue fullness is above LWM, the device will trigger the event
RTE_ETH_EVENT_RX_LWM.
LWM is defined as a percentage of Rx queue size with valid value of
[0,99].
Setting LWM to 0 means disable it, which is the default.
Add LWM's configuration and query driver callbacks in eth_dev_ops.

Signed-off-by: Spike Du <spiked@nvidia.com>
---
 lib/ethdev/ethdev_driver.h | 22 ++++++++++++
 lib/ethdev/rte_ethdev.c    | 52 ++++++++++++++++++++++++++++
 lib/ethdev/rte_ethdev.h    | 71 ++++++++++++++++++++++++++++++++++++++
 lib/ethdev/version.map     |  2 ++
 4 files changed, 147 insertions(+)

diff --git a/lib/ethdev/ethdev_driver.h b/lib/ethdev/ethdev_driver.h
index 69d9dc21d8..49e4ef0fbb 100644
--- a/lib/ethdev/ethdev_driver.h
+++ b/lib/ethdev/ethdev_driver.h
@@ -470,6 +470,23 @@ typedef int (*eth_rx_queue_setup_t)(struct rte_eth_dev *dev,
 				    const struct rte_eth_rxconf *rx_conf,
 				    struct rte_mempool *mb_pool);
 
+/**
+ * @internal Set Rx queue limit watermark.
+ * @see rte_eth_rx_lwm_set()
+ */
+typedef int (*eth_rx_queue_lwm_set_t)(struct rte_eth_dev *dev,
+				      uint16_t rx_queue_id,
+				      uint8_t lwm);
+
+/**
+ * @internal Query queue limit watermark event.
+ * @see rte_eth_rx_lwm_query()
+ */
+
+typedef int (*eth_rx_queue_lwm_query_t)(struct rte_eth_dev *dev,
+					uint16_t *rx_queue_id,
+					uint8_t *lwm);
+
 /** @internal Setup a transmit queue of an Ethernet device. */
 typedef int (*eth_tx_queue_setup_t)(struct rte_eth_dev *dev,
 				    uint16_t tx_queue_id,
@@ -1168,6 +1185,11 @@ struct eth_dev_ops {
 	/** Priority flow control queue configure */
 	priority_flow_ctrl_queue_config_t priority_flow_ctrl_queue_config;
 
+	/** Set Rx queue limit watermark. */
+	eth_rx_queue_lwm_set_t rx_queue_lwm_set;
+	/** Query Rx queue limit watermark event. */
+	eth_rx_queue_lwm_query_t rx_queue_lwm_query;
+
 	/** Set Unicast Table Array */
 	eth_uc_hash_table_set_t    uc_hash_table_set;
 	/** Set Unicast hash bitmap */
diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
index a175867651..e10e874aae 100644
--- a/lib/ethdev/rte_ethdev.c
+++ b/lib/ethdev/rte_ethdev.c
@@ -4424,6 +4424,58 @@ int rte_eth_set_queue_rate_limit(uint16_t port_id, uint16_t queue_idx,
 							queue_idx, tx_rate));
 }
 
+int rte_eth_rx_lwm_set(uint16_t port_id, uint16_t queue_id,
+		       uint8_t lwm)
+{
+	struct rte_eth_dev *dev;
+	struct rte_eth_dev_info dev_info;
+	int ret;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
+	dev = &rte_eth_devices[port_id];
+
+	ret = rte_eth_dev_info_get(port_id, &dev_info);
+	if (ret != 0)
+		return ret;
+
+	if (queue_id > dev_info.max_rx_queues) {
+		RTE_ETHDEV_LOG(ERR,
+			"Set queue LWM: port %u: invalid queue ID=%u.\n",
+			port_id, queue_id);
+		return -EINVAL;
+	}
+
+	if (lwm > 99)
+		return -EINVAL;
+	RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->rx_queue_lwm_set, -ENOTSUP);
+	return eth_err(port_id, (*dev->dev_ops->rx_queue_lwm_set)(dev,
+							     queue_id, lwm));
+}
+
+int rte_eth_rx_lwm_query(uint16_t port_id, uint16_t *queue_id,
+			 uint8_t *lwm)
+{
+	struct rte_eth_dev_info dev_info;
+	struct rte_eth_dev *dev;
+	int ret;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
+	dev = &rte_eth_devices[port_id];
+
+	ret = rte_eth_dev_info_get(port_id, &dev_info);
+	if (ret != 0)
+		return ret;
+
+	if (queue_id == NULL)
+		return -EINVAL;
+	if (*queue_id >= dev_info.max_rx_queues)
+		*queue_id = 0;
+
+	RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->rx_queue_lwm_query, -ENOTSUP);
+	return eth_err(port_id, (*dev->dev_ops->rx_queue_lwm_query)(dev,
+							     queue_id, lwm));
+}
+
 RTE_INIT(eth_dev_init_fp_ops)
 {
 	uint32_t i;
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index 04225bba4d..541178fa76 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -1931,6 +1931,14 @@ struct rte_eth_rxq_info {
 	uint8_t queue_state;        /**< one of RTE_ETH_QUEUE_STATE_*. */
 	uint16_t nb_desc;           /**< configured number of RXDs. */
 	uint16_t rx_buf_size;       /**< hardware receive buffer size. */
+	/**
+        * Per-queue Rx limit watermark defined as percentage of Rx queue
+        * size. If Rx queue receives traffic higher than this percentage,
+        * the event RTE_ETH_EVENT_RX_LWM is triggered.
+        * Value 0 means watermark monitoring is disabled, no event is
+        * triggered.
+        */
+	uint8_t lwm;
 } __rte_cache_min_aligned;
 
 /**
@@ -3672,6 +3680,64 @@ int rte_eth_dev_get_vlan_offload(uint16_t port_id);
  */
 int rte_eth_dev_set_vlan_pvid(uint16_t port_id, uint16_t pvid, int on);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Set Rx queue based limit watermark.
+ *
+ * @param port_id
+ *  The port identifier of the Ethernet device.
+ * @param queue_id
+ *  The index of the receive queue.
+ * @param lwm
+ *  The limit watermark percentage of Rx queue size which describes
+ *  the fullness of Rx queue. If the Rx queue fullness is above LWM,
+ *  the device will trigger the event RTE_ETH_EVENT_RX_LWM.
+ *  [1-99] to set a new LWM.
+ *  0 to disable watermark monitoring.
+ *
+ * @return
+ *   - 0 if successful.
+ *   - negative if failed.
+ */
+__rte_experimental
+int rte_eth_rx_lwm_set(uint16_t port_id, uint16_t queue_id, uint8_t lwm);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Query Rx queue based limit watermark event.
+ * The function queries all queues in the port circularly until one
+ * pending LWM event is found or no pending LWM event is found.
+ *
+ * @param port_id
+ *  The port identifier of the Ethernet device.
+ * @param queue_id
+ *  The API caller sets the starting Rx queue id in the pointer.
+ *  If the queue_id is bigger than maximum queue id of the port,
+ *  it's rewinded to 0 so that application can keep calling
+ *  this function to handle all pending LWM events in the queues
+ *  with a simple increment between calls.
+ *  If a Rx queue has pending LWM event, the pointer is updated
+ *  with this Rx queue id; otherwise this pointer's content is
+ *  unchanged.
+ * @param lwm
+ *  The pointer to the limit watermark percentage of Rx queue.
+ *  If Rx queue with pending LWM event is found, the queue's LWM
+ *  percentage is stored in this pointer, otherwise the pointer's
+ *  content is unchanged.
+ *
+ * @return
+ *   - 1 if a Rx queue with pending LWM event is found.
+ *   - 0 if no Rx queue with pending LWM event is found.
+ *   - -EINVAL if queue_id is NULL.
+ */
+__rte_experimental
+int rte_eth_rx_lwm_query(uint16_t port_id, uint16_t *queue_id,
+			 uint8_t *lwm);
+
 typedef void (*buffer_tx_error_fn)(struct rte_mbuf **unsent, uint16_t count,
 		void *userdata);
 
@@ -3877,6 +3943,11 @@ enum rte_eth_event_type {
 	RTE_ETH_EVENT_DESTROY,  /**< port is released */
 	RTE_ETH_EVENT_IPSEC,    /**< IPsec offload related event */
 	RTE_ETH_EVENT_FLOW_AGED,/**< New aged-out flows is detected */
+	/**
+	 *  Watermark value is exceeded in a queue.
+	 *  @see rte_eth_rx_lwm_set()
+	 */
+	RTE_ETH_EVENT_RX_LWM,
 	RTE_ETH_EVENT_MAX       /**< max value of this enum */
 };
 
diff --git a/lib/ethdev/version.map b/lib/ethdev/version.map
index daca7851f2..2e60765bbd 100644
--- a/lib/ethdev/version.map
+++ b/lib/ethdev/version.map
@@ -285,6 +285,8 @@ EXPERIMENTAL {
 	rte_mtr_color_in_protocol_priority_get;
 	rte_mtr_color_in_protocol_set;
 	rte_mtr_meter_vlan_table_update;
+	rte_eth_rx_lwm_set;
+	rte_eth_rx_lwm_query;
 };
 
 INTERNAL {
-- 
2.27.0


^ permalink raw reply	[flat|nested] 131+ messages in thread

* [PATCH v3 4/7] net/mlx5: add LWM event handling support
  2022-05-24 15:20       ` [PATCH v3 0/7] introduce per-queue limit watermark and host shaper Spike Du
                           ` (2 preceding siblings ...)
  2022-05-24 15:20         ` [PATCH v3 3/7] ethdev: introduce Rx queue based limit watermark Spike Du
@ 2022-05-24 15:20         ` Spike Du
  2022-05-24 15:20         ` [PATCH v3 5/7] net/mlx5: support Rx queue based limit watermark Spike Du
                           ` (4 subsequent siblings)
  8 siblings, 0 replies; 131+ messages in thread
From: Spike Du @ 2022-05-24 15:20 UTC (permalink / raw)
  To: matan, viacheslavo, orika, thomas; +Cc: dev, rasland

When LWM meets RQ WQE, the kernel driver raises an event to SW.
Use devx event_channel to catch this and to notify the user.
Allocate this channel per shared device.
The channel has a cookie that informs the specific event port and queue.

Signed-off-by: Spike Du <spiked@nvidia.com>
---
 drivers/net/mlx5/mlx5.c      | 66 ++++++++++++++++++++++++++++++++++++
 drivers/net/mlx5/mlx5.h      |  7 ++++
 drivers/net/mlx5/mlx5_devx.c | 47 +++++++++++++++++++++++++
 drivers/net/mlx5/mlx5_rx.c   | 33 ++++++++++++++++++
 drivers/net/mlx5/mlx5_rx.h   |  7 ++++
 5 files changed, 160 insertions(+)

diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
index f0988712df..e04a66625e 100644
--- a/drivers/net/mlx5/mlx5.c
+++ b/drivers/net/mlx5/mlx5.c
@@ -9,6 +9,7 @@
 #include <stdint.h>
 #include <stdlib.h>
 #include <errno.h>
+#include <fcntl.h>
 
 #include <rte_malloc.h>
 #include <ethdev_driver.h>
@@ -22,6 +23,7 @@
 #include <rte_eal_paging.h>
 #include <rte_alarm.h>
 #include <rte_cycles.h>
+#include <rte_interrupts.h>
 
 #include <mlx5_glue.h>
 #include <mlx5_devx_cmds.h>
@@ -1524,6 +1526,69 @@ mlx5_alloc_shared_dev_ctx(const struct mlx5_dev_spawn_data *spawn,
 	return NULL;
 }
 
+/**
+ * Create LWM event_channel and interrupt handle for shared device
+ * context. All rxqs sharing the device context share the event_channel.
+ * A callback is registered in interrupt thread to receive the LWM event.
+ *
+ * @param[in] priv
+ *   Pointer to mlx5_priv instance.
+ *
+ * @return
+ *   0 on success, negative with rte_errno set.
+ */
+int
+mlx5_lwm_setup(struct mlx5_priv *priv)
+{
+	int fd_lwm;
+
+	pthread_mutex_init(&priv->sh->lwm_config_lock, NULL);
+	priv->sh->devx_channel_lwm = mlx5_os_devx_create_event_channel
+			(priv->sh->cdev->ctx,
+			 MLX5DV_DEVX_CREATE_EVENT_CHANNEL_FLAGS_OMIT_EV_DATA);
+	if (!priv->sh->devx_channel_lwm)
+		goto err;
+	fd_lwm = mlx5_os_get_devx_channel_fd(priv->sh->devx_channel_lwm);
+	priv->sh->intr_handle_lwm = mlx5_os_interrupt_handler_create
+		(RTE_INTR_INSTANCE_F_SHARED, true,
+		 fd_lwm, mlx5_dev_interrupt_handler_lwm, priv);
+	if (!priv->sh->intr_handle_lwm)
+		goto err;
+	return 0;
+err:
+	if (priv->sh->devx_channel_lwm) {
+		mlx5_os_devx_destroy_event_channel
+			(priv->sh->devx_channel_lwm);
+		priv->sh->devx_channel_lwm = NULL;
+	}
+	pthread_mutex_destroy(&priv->sh->lwm_config_lock);
+	return -rte_errno;
+}
+
+/**
+ * Destroy LWM event_channel and interrupt handle for shared device
+ * context before free this context. The interrupt handler is also
+ * unregistered.
+ *
+ * @param[in] sh
+ *   Pointer to shared device context.
+ */
+void
+mlx5_lwm_unset(struct mlx5_dev_ctx_shared *sh)
+{
+	if (sh->intr_handle_lwm) {
+		mlx5_os_interrupt_handler_destroy(sh->intr_handle_lwm,
+			mlx5_dev_interrupt_handler_lwm, (void *)-1);
+		sh->intr_handle_lwm = NULL;
+	}
+	if (sh->devx_channel_lwm) {
+		mlx5_os_devx_destroy_event_channel
+			(sh->devx_channel_lwm);
+		sh->devx_channel_lwm = NULL;
+	}
+	pthread_mutex_destroy(&sh->lwm_config_lock);
+}
+
 /**
  * Free shared IB device context. Decrement counter and if zero free
  * all allocated resources and close handles.
@@ -1601,6 +1666,7 @@ mlx5_free_shared_dev_ctx(struct mlx5_dev_ctx_shared *sh)
 		claim_zero(mlx5_devx_cmd_destroy(sh->td));
 	MLX5_ASSERT(sh->geneve_tlv_option_resource == NULL);
 	pthread_mutex_destroy(&sh->txpp.mutex);
+	mlx5_lwm_unset(sh);
 	mlx5_free(sh);
 	return;
 exit:
diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index 7ebb2cc961..a76f2fed3d 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -1268,6 +1268,9 @@ struct mlx5_dev_ctx_shared {
 	struct mlx5_lb_ctx self_lb; /* QP to enable self loopback for Devx. */
 	unsigned int flow_max_priority;
 	enum modify_reg flow_mreg_c[MLX5_MREG_C_NUM];
+	void *devx_channel_lwm;
+	struct rte_intr_handle *intr_handle_lwm;
+	pthread_mutex_t lwm_config_lock;
 	/* Availability of mreg_c's. */
 	struct mlx5_dev_shared_port port[]; /* per device port data array. */
 };
@@ -1405,6 +1408,7 @@ enum mlx5_txq_modify_type {
 };
 
 struct mlx5_rxq_priv;
+struct mlx5_priv;
 
 /* HW objects operations structure. */
 struct mlx5_obj_ops {
@@ -1413,6 +1417,7 @@ struct mlx5_obj_ops {
 	int (*rxq_event_get)(struct mlx5_rxq_obj *rxq_obj);
 	int (*rxq_obj_modify)(struct mlx5_rxq_priv *rxq, uint8_t type);
 	void (*rxq_obj_release)(struct mlx5_rxq_priv *rxq);
+	int (*rxq_event_get_lwm)(struct mlx5_priv *priv, int *rxq_idx, int *port_id);
 	int (*ind_table_new)(struct rte_eth_dev *dev, const unsigned int log_n,
 			     struct mlx5_ind_table_obj *ind_tbl);
 	int (*ind_table_modify)(struct rte_eth_dev *dev,
@@ -1603,6 +1608,8 @@ int mlx5_net_remove(struct mlx5_common_device *cdev);
 bool mlx5_is_hpf(struct rte_eth_dev *dev);
 bool mlx5_is_sf_repr(struct rte_eth_dev *dev);
 void mlx5_age_event_prepare(struct mlx5_dev_ctx_shared *sh);
+int mlx5_lwm_setup(struct mlx5_priv *priv);
+void mlx5_lwm_unset(struct mlx5_dev_ctx_shared *sh);
 
 /* Macro to iterate over all valid ports for mlx5 driver. */
 #define MLX5_ETH_FOREACH_DEV(port_id, dev) \
diff --git a/drivers/net/mlx5/mlx5_devx.c b/drivers/net/mlx5/mlx5_devx.c
index c918a50ae9..6886ae1f22 100644
--- a/drivers/net/mlx5/mlx5_devx.c
+++ b/drivers/net/mlx5/mlx5_devx.c
@@ -232,6 +232,52 @@ mlx5_rx_devx_get_event(struct mlx5_rxq_obj *rxq_obj)
 #endif /* HAVE_IBV_DEVX_EVENT */
 }
 
+/**
+ * Get LWM event for shared context, return the correct port/rxq for this event.
+ *
+ * @param priv
+ *   Mlx5_priv object.
+ * @param rxq_idx [out]
+ *   Which rxq gets this event.
+ * @param port_id [out]
+ *   Which port gets this event.
+ *
+ * @return
+ *   0 on success, a negative errno value otherwise and rte_errno is set.
+ */
+static int
+mlx5_rx_devx_get_event_lwm(struct mlx5_priv *priv, int *rxq_idx, int *port_id)
+{
+#ifdef HAVE_IBV_DEVX_EVENT
+	union {
+		struct mlx5dv_devx_async_event_hdr event_resp;
+		uint8_t buf[sizeof(struct mlx5dv_devx_async_event_hdr) + 128];
+	} out;
+	int ret;
+
+	memset(&out, 0, sizeof(out));
+	ret = mlx5_glue->devx_get_event(priv->sh->devx_channel_lwm,
+					&out.event_resp,
+					sizeof(out.buf));
+	if (ret < 0) {
+		rte_errno = errno;
+		DRV_LOG(WARNING, "%s err\n", __func__);
+		return -rte_errno;
+	}
+	*port_id = (((uint32_t)out.event_resp.cookie) >>
+		    LWM_COOKIE_PORTID_OFFSET) & LWM_COOKIE_PORTID_MASK;
+	*rxq_idx = (((uint32_t)out.event_resp.cookie) >>
+		    LWM_COOKIE_RXQID_OFFSET) & LWM_COOKIE_RXQID_MASK;
+	return 0;
+#else
+	(void)priv;
+	(void)rxq_idx;
+	(void)port_id;
+	rte_errno = ENOTSUP;
+	return -rte_errno;
+#endif /* HAVE_IBV_DEVX_EVENT */
+}
+
 /**
  * Create a RQ object using DevX.
  *
@@ -1421,6 +1467,7 @@ struct mlx5_obj_ops devx_obj_ops = {
 	.rxq_event_get = mlx5_rx_devx_get_event,
 	.rxq_obj_modify = mlx5_devx_modify_rq,
 	.rxq_obj_release = mlx5_rxq_devx_obj_release,
+	.rxq_event_get_lwm = mlx5_rx_devx_get_event_lwm,
 	.ind_table_new = mlx5_devx_ind_table_new,
 	.ind_table_modify = mlx5_devx_ind_table_modify,
 	.ind_table_destroy = mlx5_devx_ind_table_destroy,
diff --git a/drivers/net/mlx5/mlx5_rx.c b/drivers/net/mlx5/mlx5_rx.c
index e5eea0ad94..7d556c2b45 100644
--- a/drivers/net/mlx5/mlx5_rx.c
+++ b/drivers/net/mlx5/mlx5_rx.c
@@ -1187,3 +1187,36 @@ mlx5_check_vec_rx_support(struct rte_eth_dev *dev __rte_unused)
 {
 	return -ENOTSUP;
 }
+
+/**
+ * Rte interrupt handler for LWM event.
+ * It first checks if the event arrives, if so process the callback for
+ * RTE_ETH_EVENT_RX_LWM.
+ *
+ * @param args
+ *   Generic pointer to mlx5_priv.
+ */
+void
+mlx5_dev_interrupt_handler_lwm(void *args)
+{
+	struct mlx5_priv *priv = args;
+	struct mlx5_rxq_priv *rxq;
+	struct rte_eth_dev *dev;
+	int ret, rxq_idx = 0, port_id = 0;
+
+	ret = priv->obj_ops.rxq_event_get_lwm(priv, &rxq_idx, &port_id);
+	if (unlikely(ret < 0)) {
+		DRV_LOG(WARNING, "Cannot get LWM event context.");
+		return;
+	}
+	DRV_LOG(INFO, "%s get LWM event, port_id:%d rxq_id:%d.", __func__,
+		port_id, rxq_idx);
+	dev = &rte_eth_devices[port_id];
+	rxq = mlx5_rxq_get(dev, rxq_idx);
+	if (rxq) {
+		pthread_mutex_lock(&priv->sh->lwm_config_lock);
+		rxq->lwm_event_pending = 1;
+		pthread_mutex_unlock(&priv->sh->lwm_config_lock);
+	}
+	rte_eth_dev_callback_process(dev, RTE_ETH_EVENT_RX_LWM, NULL);
+}
diff --git a/drivers/net/mlx5/mlx5_rx.h b/drivers/net/mlx5/mlx5_rx.h
index 25a5f2c1fa..068dff5863 100644
--- a/drivers/net/mlx5/mlx5_rx.h
+++ b/drivers/net/mlx5/mlx5_rx.h
@@ -176,6 +176,7 @@ struct mlx5_rxq_priv {
 	struct rte_eth_hairpin_conf hairpin_conf; /* Hairpin configuration. */
 	uint32_t hairpin_status; /* Hairpin binding status. */
 	uint32_t lwm:16;
+	uint32_t lwm_event_pending:1;
 };
 
 /* External RX queue descriptor. */
@@ -295,6 +296,7 @@ void mlx5_rxq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
 int mlx5_rx_burst_mode_get(struct rte_eth_dev *dev, uint16_t rx_queue_id,
 			   struct rte_eth_burst_mode *mode);
 int mlx5_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc);
+void mlx5_dev_interrupt_handler_lwm(void *args);
 
 /* Vectorized version of mlx5_rx.c */
 int mlx5_rxq_check_vec_support(struct mlx5_rxq_data *rxq_data);
@@ -675,4 +677,9 @@ mlx5_is_external_rxq(struct rte_eth_dev *dev, uint16_t queue_idx)
 	return !!__atomic_load_n(&rxq->refcnt, __ATOMIC_RELAXED);
 }
 
+#define LWM_COOKIE_RXQID_OFFSET 0
+#define LWM_COOKIE_RXQID_MASK 0xffff
+#define LWM_COOKIE_PORTID_OFFSET 16
+#define LWM_COOKIE_PORTID_MASK 0xffff
+
 #endif /* RTE_PMD_MLX5_RX_H_ */
-- 
2.27.0


^ permalink raw reply	[flat|nested] 131+ messages in thread

* [PATCH v3 5/7] net/mlx5: support Rx queue based limit watermark
  2022-05-24 15:20       ` [PATCH v3 0/7] introduce per-queue limit watermark and host shaper Spike Du
                           ` (3 preceding siblings ...)
  2022-05-24 15:20         ` [PATCH v3 4/7] net/mlx5: add LWM event handling support Spike Du
@ 2022-05-24 15:20         ` Spike Du
  2022-05-24 15:20         ` [PATCH v3 6/7] net/mlx5: add private API to config host port shaper Spike Du
                           ` (3 subsequent siblings)
  8 siblings, 0 replies; 131+ messages in thread
From: Spike Du @ 2022-05-24 15:20 UTC (permalink / raw)
  To: matan, viacheslavo, orika, thomas; +Cc: dev, rasland

Add mlx5 specific LWM(limit watermark) configuration and query handler.
While the Rx queue fullness reaches the LWM limit, the driver catches
an HW event and invokes the user callback.
The query handler finds the next RX queue with pending LWM event
if any, starting from the given RX queue index.

Signed-off-by: Spike Du <spiked@nvidia.com>
---
 doc/guides/nics/mlx5.rst               |  12 ++
 doc/guides/rel_notes/release_22_07.rst |   1 +
 drivers/common/mlx5/mlx5_prm.h         |   1 +
 drivers/net/mlx5/mlx5.c                |   2 +
 drivers/net/mlx5/mlx5_rx.c             | 156 +++++++++++++++++++++++++
 drivers/net/mlx5/mlx5_rx.h             |   5 +
 6 files changed, 177 insertions(+)

diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index d83c56de11..79f56018ef 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -93,6 +93,7 @@ Features
 - Connection tracking.
 - Sub-Function representors.
 - Sub-Function.
+- Rx queue LWM (Limit WaterMark) configuration.
 
 
 Limitations
@@ -520,6 +521,9 @@ Limitations
 
 - The NIC egress flow rules on representor port are not supported.
 
+- LWM:
+
+  - Doesn't support shared Rx queue and Hairpin Rx queue.
 
 Statistics
 ----------
@@ -1680,3 +1684,11 @@ The procedure below is an example of using a ConnectX-5 adapter card (pf0) with
 #. For each VF PCIe, using the following command to bind the driver::
 
    $ echo "0000:82:00.2" >> /sys/bus/pci/drivers/mlx5_core/bind
+
+LWM introduction
+----------------
+
+LWM (Limit WaterMark) is a per Rx queue attribute, it should be configured as
+a percentage of the Rx queue size.
+When Rx queue fullness is above LWM, an event is sent to PMD.
+
diff --git a/doc/guides/rel_notes/release_22_07.rst b/doc/guides/rel_notes/release_22_07.rst
index 0ed4f92820..34f86eaffa 100644
--- a/doc/guides/rel_notes/release_22_07.rst
+++ b/doc/guides/rel_notes/release_22_07.rst
@@ -89,6 +89,7 @@ New Features
   * Added support for promiscuous mode on Windows.
   * Added support for MTU on Windows.
   * Added matching and RSS on IPsec ESP.
+  * Added Rx queue LWM(Limit WaterMark) support.
 
 * **Updated Marvell cnxk crypto driver.**
 
diff --git a/drivers/common/mlx5/mlx5_prm.h b/drivers/common/mlx5/mlx5_prm.h
index 630b2c5100..3b5e60532a 100644
--- a/drivers/common/mlx5/mlx5_prm.h
+++ b/drivers/common/mlx5/mlx5_prm.h
@@ -3293,6 +3293,7 @@ struct mlx5_aso_wqe {
 
 enum {
 	MLX5_EVENT_TYPE_OBJECT_CHANGE = 0x27,
+	MLX5_EVENT_TYPE_SRQ_LIMIT_REACHED = 0x14,
 };
 
 enum {
diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
index e04a66625e..35ae51b3af 100644
--- a/drivers/net/mlx5/mlx5.c
+++ b/drivers/net/mlx5/mlx5.c
@@ -2071,6 +2071,8 @@ const struct eth_dev_ops mlx5_dev_ops = {
 	.dev_supported_ptypes_get = mlx5_dev_supported_ptypes_get,
 	.vlan_filter_set = mlx5_vlan_filter_set,
 	.rx_queue_setup = mlx5_rx_queue_setup,
+	.rx_queue_lwm_set = mlx5_rx_queue_lwm_set,
+	.rx_queue_lwm_query = mlx5_rx_queue_lwm_query,
 	.rx_hairpin_queue_setup = mlx5_rx_hairpin_queue_setup,
 	.tx_queue_setup = mlx5_tx_queue_setup,
 	.tx_hairpin_queue_setup = mlx5_tx_hairpin_queue_setup,
diff --git a/drivers/net/mlx5/mlx5_rx.c b/drivers/net/mlx5/mlx5_rx.c
index 7d556c2b45..406eae9b39 100644
--- a/drivers/net/mlx5/mlx5_rx.c
+++ b/drivers/net/mlx5/mlx5_rx.c
@@ -19,12 +19,14 @@
 #include <mlx5_prm.h>
 #include <mlx5_common.h>
 #include <mlx5_common_mr.h>
+#include <rte_pmd_mlx5.h>
 
 #include "mlx5_autoconf.h"
 #include "mlx5_defs.h"
 #include "mlx5.h"
 #include "mlx5_utils.h"
 #include "mlx5_rxtx.h"
+#include "mlx5_devx.h"
 #include "mlx5_rx.h"
 
 
@@ -128,6 +130,17 @@ mlx5_rx_descriptor_status(void *rx_queue, uint16_t offset)
 	return RTE_ETH_RX_DESC_AVAIL;
 }
 
+/* Get rxq lwm percentage according to lwm number. */
+static uint8_t
+mlx5_rxq_lwm_to_percentage(struct mlx5_rxq_priv *rxq)
+{
+	struct mlx5_rxq_data *rxq_data = &rxq->ctrl->rxq;
+	uint32_t wqe_cnt = 1 << rxq_data->elts_n;
+
+	/* ethdev LWM describes fullness, mlx5 LWM describes emptiness. */
+	return rxq->lwm ? (100 - rxq->lwm * 100 / wqe_cnt) : 0;
+}
+
 /**
  * DPDK callback to get the RX queue information.
  *
@@ -150,6 +163,7 @@ mlx5_rxq_info_get(struct rte_eth_dev *dev, uint16_t rx_queue_id,
 {
 	struct mlx5_rxq_ctrl *rxq_ctrl = mlx5_rxq_ctrl_get(dev, rx_queue_id);
 	struct mlx5_rxq_data *rxq = mlx5_rxq_data_get(dev, rx_queue_id);
+	struct mlx5_rxq_priv *rxq_priv = mlx5_rxq_get(dev, rx_queue_id);
 
 	if (!rxq)
 		return;
@@ -169,6 +183,8 @@ mlx5_rxq_info_get(struct rte_eth_dev *dev, uint16_t rx_queue_id,
 	qinfo->nb_desc = mlx5_rxq_mprq_enabled(rxq) ?
 		RTE_BIT32(rxq->elts_n) * RTE_BIT32(rxq->log_strd_num) :
 		RTE_BIT32(rxq->elts_n);
+	qinfo->lwm = rxq_priv ?
+		mlx5_rxq_lwm_to_percentage(rxq_priv) : 0;
 }
 
 /**
@@ -1188,6 +1204,34 @@ mlx5_check_vec_rx_support(struct rte_eth_dev *dev __rte_unused)
 	return -ENOTSUP;
 }
 
+int
+mlx5_rx_queue_lwm_query(struct rte_eth_dev *dev,
+			uint16_t *queue_id, uint8_t *lwm)
+{
+	struct mlx5_priv *priv = dev->data->dev_private;
+	unsigned int rxq_id, found = 0, n;
+	struct mlx5_rxq_priv *rxq;
+
+	if (!queue_id)
+		return -EINVAL;
+	/* Query all the Rx queues of the port in a circular way. */
+	for (rxq_id = *queue_id, n = 0; n < priv->rxqs_n; n++) {
+		rxq = mlx5_rxq_get(dev, rxq_id);
+		if (rxq && rxq->lwm_event_pending) {
+			pthread_mutex_lock(&priv->sh->lwm_config_lock);
+			rxq->lwm_event_pending = 0;
+			pthread_mutex_unlock(&priv->sh->lwm_config_lock);
+			*queue_id = rxq_id;
+			found = 1;
+			if (lwm)
+				*lwm =  mlx5_rxq_lwm_to_percentage(rxq);
+			break;
+		}
+		rxq_id = (rxq_id + 1) % priv->rxqs_n;
+	}
+	return found;
+}
+
 /**
  * Rte interrupt handler for LWM event.
  * It first checks if the event arrives, if so process the callback for
@@ -1220,3 +1264,115 @@ mlx5_dev_interrupt_handler_lwm(void *args)
 	}
 	rte_eth_dev_callback_process(dev, RTE_ETH_EVENT_RX_LWM, NULL);
 }
+
+/**
+ * DPDK callback to arm an Rx queue LWM(limit watermark) event.
+ * While the Rx queue fullness reaches the LWM limit, the driver catches
+ * an HW event and invokes the user event callback.
+ * After the last event handling, the user needs to call this API again
+ * to arm an additional event.
+ *
+ * @param dev
+ *   Pointer to the device structure.
+ * @param[in] rx_queue_id
+ *   Rx queue identificator.
+ * @param[in] lwm
+ *   The LWM value, is defined by a percentage of the Rx queue size.
+ *   [1-99] to set a new LWM (update the old value).
+ *   0 to unarm the event.
+ *
+ * @return
+ *   0 : operation success.
+ *   Otherwise:
+ *   - ENOMEM - not enough memory to create LWM event channel.
+ *   - EINVAL - the input Rxq is not created by devx.
+ *   - E2BIG  - lwm is bigger than 99.
+ */
+int
+mlx5_rx_queue_lwm_set(struct rte_eth_dev *dev, uint16_t rx_queue_id,
+		      uint8_t lwm)
+{
+	struct mlx5_priv *priv = dev->data->dev_private;
+	uint16_t port_id = PORT_ID(priv);
+	struct mlx5_rxq_priv *rxq = mlx5_rxq_get(dev, rx_queue_id);
+	uint16_t event_nums[1] = {MLX5_EVENT_TYPE_SRQ_LIMIT_REACHED};
+	struct mlx5_rxq_data *rxq_data;
+	uint32_t wqe_cnt;
+	uint64_t cookie;
+	int ret = 0;
+
+	if (!rxq) {
+		rte_errno = EINVAL;
+		return -rte_errno;
+	}
+	rxq_data = &rxq->ctrl->rxq;
+	/* Ensure the Rq is created by devx. */
+	if (priv->obj_ops.rxq_obj_new != devx_obj_ops.rxq_obj_new) {
+		rte_errno = EINVAL;
+		return -rte_errno;
+	}
+	if (lwm > 99) {
+		DRV_LOG(WARNING, "Too big LWM configuration.");
+		rte_errno = E2BIG;
+		return -rte_errno;
+	}
+	/* Start config LWM. */
+	pthread_mutex_lock(&priv->sh->lwm_config_lock);
+	if (rxq->lwm == 0 && lwm == 0) {
+		/* Both old/new values are 0, do nothing. */
+		ret = 0;
+		goto end;
+	}
+	wqe_cnt = 1 << rxq_data->elts_n;
+	if (lwm) {
+		if (!priv->sh->devx_channel_lwm) {
+			ret = mlx5_lwm_setup(priv);
+			if (ret) {
+				DRV_LOG(WARNING,
+					"Failed to create shared_lwm.");
+				rte_errno = ENOMEM;
+				ret = -rte_errno;
+				goto end;
+			}
+		}
+		if (!rxq->lwm_devx_subscribed) {
+			cookie = ((uint32_t)
+				  (port_id << LWM_COOKIE_PORTID_OFFSET)) |
+				(rx_queue_id << LWM_COOKIE_RXQID_OFFSET);
+			ret = mlx5_os_devx_subscribe_devx_event
+				(priv->sh->devx_channel_lwm,
+				 rxq->devx_rq.rq->obj,
+				 sizeof(event_nums),
+				 event_nums,
+				 cookie);
+			if (ret) {
+				rte_errno = rte_errno ? rte_errno : EINVAL;
+				ret = -rte_errno;
+				goto end;
+			}
+			rxq->lwm_devx_subscribed = 1;
+		}
+	}
+	/* The ethdev LWM describes fullness, mlx5 lwm describes emptiness. */
+	if (lwm)
+		lwm = 100 - lwm;
+	/* Save LWM to rxq and send modify_rq devx command. */
+	rxq->lwm = lwm * wqe_cnt / 100;
+	/* Prevent integer division loss when switch lwm number to percentage. */
+	if (lwm && (lwm * wqe_cnt % 100)) {
+		rxq->lwm = ((uint32_t)(rxq->lwm + 1) >= wqe_cnt) ?
+			rxq->lwm : (rxq->lwm + 1);
+	}
+	if (lwm && !rxq->lwm) {
+		/* With mprq, wqe_cnt may be < 100. */
+		DRV_LOG(WARNING, "Too small LWM configuration.");
+		rte_errno = EINVAL;
+		ret = -rte_errno;
+		goto end;
+	}
+	ret = mlx5_devx_modify_rq(rxq, MLX5_RXQ_MOD_RDY2RDY);
+end:
+	pthread_mutex_unlock(&priv->sh->lwm_config_lock);
+	return ret;
+}
+
diff --git a/drivers/net/mlx5/mlx5_rx.h b/drivers/net/mlx5/mlx5_rx.h
index 068dff5863..e078aaf3dc 100644
--- a/drivers/net/mlx5/mlx5_rx.h
+++ b/drivers/net/mlx5/mlx5_rx.h
@@ -177,6 +177,7 @@ struct mlx5_rxq_priv {
 	uint32_t hairpin_status; /* Hairpin binding status. */
 	uint32_t lwm:16;
 	uint32_t lwm_event_pending:1;
+	uint32_t lwm_devx_subscribed:1;
 };
 
 /* External RX queue descriptor. */
@@ -297,6 +298,10 @@ int mlx5_rx_burst_mode_get(struct rte_eth_dev *dev, uint16_t rx_queue_id,
 			   struct rte_eth_burst_mode *mode);
 int mlx5_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc);
 void mlx5_dev_interrupt_handler_lwm(void *args);
+int mlx5_rx_queue_lwm_set(struct rte_eth_dev *dev, uint16_t rx_queue_id,
+			  uint8_t lwm);
+int mlx5_rx_queue_lwm_query(struct rte_eth_dev *dev, uint16_t *rx_queue_id,
+			    uint8_t *lwm);
 
 /* Vectorized version of mlx5_rx.c */
 int mlx5_rxq_check_vec_support(struct mlx5_rxq_data *rxq_data);
-- 
2.27.0


^ permalink raw reply	[flat|nested] 131+ messages in thread

* [PATCH v3 6/7] net/mlx5: add private API to config host port shaper
  2022-05-24 15:20       ` [PATCH v3 0/7] introduce per-queue limit watermark and host shaper Spike Du
                           ` (4 preceding siblings ...)
  2022-05-24 15:20         ` [PATCH v3 5/7] net/mlx5: support Rx queue based limit watermark Spike Du
@ 2022-05-24 15:20         ` Spike Du
  2022-05-24 15:20         ` [PATCH v3 7/7] app/testpmd: add LWM and Host Shaper command Spike Du
                           ` (2 subsequent siblings)
  8 siblings, 0 replies; 131+ messages in thread
From: Spike Du @ 2022-05-24 15:20 UTC (permalink / raw)
  To: matan, viacheslavo, orika, thomas; +Cc: dev, rasland

Host port shaper can be configured with QSHR(QoS Shaper Host Register).
Add check in build files to enable this function or not.

The host shaper configuration affects all the ethdev ports belonging to the
same host port.

Host shaper can configure shaper rate and lwm-triggered for a host port.
The shaper limits the rate of traffic from host port to wire port.
If lwm-triggered is enabled, a 100Mbps shaper is enabled automatically
when one of the host port's Rx queues receives LWM(Limit Watermark) event.

Signed-off-by: Spike Du <spiked@nvidia.com>
---
 doc/guides/nics/mlx5.rst               |  26 +++++++
 doc/guides/rel_notes/release_22_07.rst |   1 +
 drivers/common/mlx5/linux/meson.build  |  13 ++++
 drivers/common/mlx5/mlx5_prm.h         |  25 ++++++
 drivers/net/mlx5/mlx5.h                |   2 +
 drivers/net/mlx5/mlx5_rx.c             | 103 +++++++++++++++++++++++++
 drivers/net/mlx5/rte_pmd_mlx5.h        |  30 +++++++
 drivers/net/mlx5/version.map           |   2 +
 8 files changed, 202 insertions(+)

diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index 79f56018ef..3da6f5a03c 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -94,6 +94,7 @@ Features
 - Sub-Function representors.
 - Sub-Function.
 - Rx queue LWM (Limit WaterMark) configuration.
+- Host shaper support.
 
 
 Limitations
@@ -525,6 +526,12 @@ Limitations
 
   - Doesn't support shared Rx queue and Hairpin Rx queue.
 
+- Host shaper:
+
+  - Support BlueField series NIC from BlueField 2.
+  - When configure host shaper with MLX5_HOST_SHAPER_FLAG_LWM_TRIGGERED flag set,
+    only rate 0 and 100Mbps are supported.
+
 Statistics
 ----------
 
@@ -1692,3 +1699,22 @@ LWM (Limit WaterMark) is a per Rx queue attribute, it should be configured as
 a percentage of the Rx queue size.
 When Rx queue fullness is above LWM, an event is sent to PMD.
 
+Host shaper introduction
+------------------------
+
+Host shaper register is per host port register which sets a shaper
+on the host port.
+All VF/hostPF representors belonging to one host port share one host shaper.
+For example, if representor 0 and representor 1 belong to same host port,
+and a host shaper rate of 1Gbps is configured, the shaper throttles both
+representors' traffic from host.
+Host shaper has two modes for setting the shaper, immediate and deferred to
+LWM event trigger. In immediate mode, the rate limit is configured immediately
+to host shaper. When deferring to LWM trigger, the shaper is not set until an
+LWM event is received by any Rx queue in a VF representor belonging to the host
+port. The only rate supported for deferred mode is 100Mbps (there is no limit
+on the supported rates for immediate mode). In deferred mode, the shaper is set
+on the host port by the firmware upon receiving the LMW event, which allows
+throttling host traffic on LWM events at minimum latency, preventing excess
+drops in the Rx queue.
+
diff --git a/doc/guides/rel_notes/release_22_07.rst b/doc/guides/rel_notes/release_22_07.rst
index 34f86eaffa..94720af3af 100644
--- a/doc/guides/rel_notes/release_22_07.rst
+++ b/doc/guides/rel_notes/release_22_07.rst
@@ -90,6 +90,7 @@ New Features
   * Added support for MTU on Windows.
   * Added matching and RSS on IPsec ESP.
   * Added Rx queue LWM(Limit WaterMark) support.
+  * Added host shaper support.
 
 * **Updated Marvell cnxk crypto driver.**
 
diff --git a/drivers/common/mlx5/linux/meson.build b/drivers/common/mlx5/linux/meson.build
index 5335f5b027..51c6e5dd2e 100644
--- a/drivers/common/mlx5/linux/meson.build
+++ b/drivers/common/mlx5/linux/meson.build
@@ -45,6 +45,13 @@ if static_ibverbs
     ext_deps += declare_dependency(link_args:ibv_ldflags.split())
 endif
 
+libmtcr_ul_found = false
+lib = cc.find_library('mtcr_ul', required:false)
+if lib.found() and run_command('meson', '--version').stdout().version_compare('>= 0.49.2')
+    libmtcr_ul_found = true
+    ext_deps += lib
+endif
+
 sources += files('mlx5_nl.c')
 sources += files('mlx5_common_auxiliary.c')
 sources += files('mlx5_common_os.c')
@@ -207,6 +214,12 @@ has_sym_args = [
         [ 'HAVE_MLX5_IBV_IMPORT_CTX_PD_AND_MR', 'infiniband/verbs.h',
             'ibv_import_device' ],
 ]
+if  libmtcr_ul_found
+    has_sym_args += [
+        [  'HAVE_MLX5_MSTFLINT', 'mstflint/mtcr.h',
+            'mopen'],
+    ]
+endif
 config = configuration_data()
 foreach arg:has_sym_args
     config.set(arg[0], cc.has_header_symbol(arg[1], arg[2], dependencies: libs))
diff --git a/drivers/common/mlx5/mlx5_prm.h b/drivers/common/mlx5/mlx5_prm.h
index 3b5e60532a..92d05a7368 100644
--- a/drivers/common/mlx5/mlx5_prm.h
+++ b/drivers/common/mlx5/mlx5_prm.h
@@ -3771,6 +3771,7 @@ enum {
 	MLX5_CRYPTO_COMMISSIONING_REGISTER_ID = 0xC003,
 	MLX5_IMPORT_KEK_HANDLE_REGISTER_ID = 0xC004,
 	MLX5_CREDENTIAL_HANDLE_REGISTER_ID = 0xC005,
+	MLX5_QSHR_REGISTER_ID = 0x4030,
 };
 
 struct mlx5_ifc_register_mtutc_bits {
@@ -3785,6 +3786,30 @@ struct mlx5_ifc_register_mtutc_bits {
 	u8 time_adjustment[0x20];
 };
 
+struct mlx5_ifc_ets_global_config_register_bits {
+	u8 reserved_at_0[0x2];
+	u8 rate_limit_update[0x1];
+	u8 reserved_at_3[0x29];
+	u8 max_bw_units[0x4];
+	u8 reserved_at_48[0x8];
+	u8 max_bw_value[0x8];
+};
+
+#define ETS_GLOBAL_CONFIG_BW_UNIT_DISABLED      0x0
+#define ETS_GLOBAL_CONFIG_BW_UNIT_HUNDREDS_MBPS 0x3
+#define ETS_GLOBAL_CONFIG_BW_UNIT_GBPS          0x4
+
+struct mlx5_ifc_register_qshr_bits {
+	u8 reserved_at_0[0x4];
+	u8 connected_host[0x1];
+	u8 vqos[0x1];
+	u8 fast_response[0x1];
+	u8 reserved_at_7[0x1];
+	u8 local_port[0x8];
+	u8 reserved_at_16[0x230];
+	struct mlx5_ifc_ets_global_config_register_bits global_config;
+};
+
 #define MLX5_MTUTC_TIMESTAMP_MODE_INTERNAL_TIMER 0
 #define MLX5_MTUTC_TIMESTAMP_MODE_REAL_TIME 1
 
diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index a76f2fed3d..8af84aef50 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -1271,6 +1271,8 @@ struct mlx5_dev_ctx_shared {
 	void *devx_channel_lwm;
 	struct rte_intr_handle *intr_handle_lwm;
 	pthread_mutex_t lwm_config_lock;
+	uint32_t host_shaper_rate:8;
+	uint32_t lwm_triggered:1;
 	/* Availability of mreg_c's. */
 	struct mlx5_dev_shared_port port[]; /* per device port data array. */
 };
diff --git a/drivers/net/mlx5/mlx5_rx.c b/drivers/net/mlx5/mlx5_rx.c
index 406eae9b39..b503d89289 100644
--- a/drivers/net/mlx5/mlx5_rx.c
+++ b/drivers/net/mlx5/mlx5_rx.c
@@ -28,6 +28,9 @@
 #include "mlx5_rxtx.h"
 #include "mlx5_devx.h"
 #include "mlx5_rx.h"
+#ifdef HAVE_MLX5_MSTFLINT
+#include <mstflint/mtcr.h>
+#endif
 
 
 static __rte_always_inline uint32_t
@@ -1376,3 +1379,103 @@ mlx5_rx_queue_lwm_set(struct rte_eth_dev *dev, uint16_t rx_queue_id,
 	return ret;
 }
 
+/**
+ * Mlx5 access register function to configure host shaper.
+ * It calls API in libmtcr_ul to access QSHR(Qos Shaper Host Register)
+ * in firmware.
+ *
+ * @param dev
+ *   Pointer to rte_eth_dev.
+ * @param lwm_triggered
+ *   Flag to enable/disable lwm_triggered bit in QSHR.
+ * @param rate
+ *   Host shaper rate, unit is 100Mbps, set to 0 means disable the shaper.
+ * @return
+ *   0 : operation success.
+ *   Otherwise:
+ *   - ENOENT - no ibdev interface.
+ *   - EBUSY  - the register access unit is busy.
+ *   - EIO    - the register access command meets IO error.
+ */
+static int
+mlxreg_host_shaper_config(struct rte_eth_dev *dev,
+			  bool lwm_triggered, uint8_t rate)
+{
+#ifdef HAVE_MLX5_MSTFLINT
+	struct mlx5_priv *priv = dev->data->dev_private;
+	uint32_t data[MLX5_ST_SZ_DW(register_qshr)] = {0};
+	int rc, retry_count = 3;
+	mfile *mf = NULL;
+	int status;
+	void *ptr;
+
+	mf = mopen(priv->sh->ibdev_name);
+	if (!mf) {
+		DRV_LOG(WARNING, "mopen failed\n");
+		rte_errno = ENOENT;
+		return -rte_errno;
+	}
+	MLX5_SET(register_qshr, data, connected_host, 1);
+	MLX5_SET(register_qshr, data, fast_response, lwm_triggered ? 1 : 0);
+	MLX5_SET(register_qshr, data, local_port, 1);
+	ptr = MLX5_ADDR_OF(register_qshr, data, global_config);
+	MLX5_SET(ets_global_config_register, ptr, rate_limit_update, 1);
+	MLX5_SET(ets_global_config_register, ptr, max_bw_units,
+		 rate ? ETS_GLOBAL_CONFIG_BW_UNIT_HUNDREDS_MBPS :
+		 ETS_GLOBAL_CONFIG_BW_UNIT_DISABLED);
+	MLX5_SET(ets_global_config_register, ptr, max_bw_value, rate);
+	do {
+		rc = maccess_reg(mf,
+				 MLX5_QSHR_REGISTER_ID,
+				 MACCESS_REG_METHOD_SET,
+				 (u_int32_t *)&data[0],
+				 sizeof(data),
+				 sizeof(data),
+				 sizeof(data),
+				 &status);
+		if ((rc != ME_ICMD_STATUS_IFC_BUSY &&
+		     status != ME_REG_ACCESS_BAD_PARAM) ||
+		    !(mf->flags & MDEVS_REM)) {
+			break;
+		}
+		DRV_LOG(WARNING, "%s retry.", __func__);
+		usleep(10000);
+	} while (retry_count-- > 0);
+	mclose(mf);
+	rte_errno = (rc == ME_REG_ACCESS_DEV_BUSY) ? EBUSY : EIO;
+	return rc ? -rte_errno : 0;
+#else
+	(void)dev;
+	(void)lwm_triggered;
+	(void)rate;
+	return -1;
+#endif
+}
+
+int rte_pmd_mlx5_host_shaper_config(int port_id, uint8_t rate,
+				    uint32_t flags)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+	struct mlx5_priv *priv = dev->data->dev_private;
+	bool lwm_triggered =
+		!!(flags & RTE_BIT32(MLX5_HOST_SHAPER_FLAG_LWM_TRIGGERED));
+
+	if (!lwm_triggered) {
+		priv->sh->host_shaper_rate = rate;
+	} else {
+		switch (rate) {
+		case 0:
+		/* Rate 0 means disable lwm_triggered. */
+			priv->sh->lwm_triggered = 0;
+			break;
+		case 1:
+		/* Rate 1 means enable lwm_triggered. */
+			priv->sh->lwm_triggered = 1;
+			break;
+		default:
+			return -ENOTSUP;
+		}
+	}
+	return mlxreg_host_shaper_config(dev, priv->sh->lwm_triggered,
+					 priv->sh->host_shaper_rate);
+}
diff --git a/drivers/net/mlx5/rte_pmd_mlx5.h b/drivers/net/mlx5/rte_pmd_mlx5.h
index 6e7907ee59..9964126df5 100644
--- a/drivers/net/mlx5/rte_pmd_mlx5.h
+++ b/drivers/net/mlx5/rte_pmd_mlx5.h
@@ -109,6 +109,36 @@ __rte_experimental
 int rte_pmd_mlx5_external_rx_queue_id_unmap(uint16_t port_id,
 					    uint16_t dpdk_idx);
 
+/**
+ * The rate of the host port shaper will be updated directly at the next
+ * LWM event to the rate that comes with this flag set; set rate 0
+ * to disable this rate update.
+ * Unset this flag to update the rate of the host port shaper directly in
+ * the API call; use rate 0 to disable the current shaper.
+ */
+#define MLX5_HOST_SHAPER_FLAG_LWM_TRIGGERED 0
+
+/**
+ * Configure a HW shaper to limit Tx rate for a host port.
+ * The configuration will affect all the ethdev ports belonging to
+ * the same rte_device.
+ *
+ * @param[in] port_id
+ *   The port identifier of the Ethernet device.
+ * @param[in] rate
+ *   Unit is 100Mbps, setting the rate to 0 disables the shaper.
+ * @param[in] flags
+ *   Host shaper flags.
+ * @return
+ *   0 : operation success.
+ *   Otherwise:
+ *   - ENOENT - no ibdev interface.
+ *   - EBUSY  - the register access unit is busy.
+ *   - EIO    - the register access command meets IO error.
+ */
+__rte_experimental
+int rte_pmd_mlx5_host_shaper_config(int port_id, uint8_t rate, uint32_t flags);
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/drivers/net/mlx5/version.map b/drivers/net/mlx5/version.map
index 79cb79acc6..c97dfe440a 100644
--- a/drivers/net/mlx5/version.map
+++ b/drivers/net/mlx5/version.map
@@ -12,4 +12,6 @@ EXPERIMENTAL {
 	# added in 22.03
 	rte_pmd_mlx5_external_rx_queue_id_map;
 	rte_pmd_mlx5_external_rx_queue_id_unmap;
+	# added in 22.07
+	rte_pmd_mlx5_host_shaper_config;
 };
-- 
2.27.0


^ permalink raw reply	[flat|nested] 131+ messages in thread

* [PATCH v3 7/7] app/testpmd: add LWM and Host Shaper command
  2022-05-24 15:20       ` [PATCH v3 0/7] introduce per-queue limit watermark and host shaper Spike Du
                           ` (5 preceding siblings ...)
  2022-05-24 15:20         ` [PATCH v3 6/7] net/mlx5: add private API to config host port shaper Spike Du
@ 2022-05-24 15:20         ` Spike Du
  2022-05-24 15:59         ` [PATCH v3 0/7] introduce per-queue limit watermark and host shaper Thomas Monjalon
  2022-06-03 12:48         ` [PATCH v4 0/7] introduce per-queue fill threshold " Spike Du
  8 siblings, 0 replies; 131+ messages in thread
From: Spike Du @ 2022-05-24 15:20 UTC (permalink / raw)
  To: matan, viacheslavo, orika, thomas; +Cc: dev, rasland

Add command line options to support LWM per-rxq configure.
- Command syntax:
  set port <port_id> rxq <rxq_id> lwm <lwm_num>
  mlx5 set port <port_id> host_shaper lwm_triggered <0|1> rate <rate_num>

- Example commands:
To configure LWM as 30% of rxq size on port 1 rxq 0:
testpmd> set port 1 rxq 0 lwm 30

To disable LWM on port 1 rxq 0:
testpmd> set port 1 rxq 0 lwm 0

To enable lwm_triggered on port 1 and disable current host shaper:
testpmd> mlx5 set port 1 host_shaper lwm_triggered 1 rate 0

To disable lwm_triggered and current host shaper on port 1:
testpmd> mlx5 set port 1 host_shaper lwm_triggered 0 rate 0

The rate unit is 100Mbps.
To disable lwm_triggered and configure a shaper of 5Gbps on port 1:
testpmd> mlx5 set port 1 host_shaper lwm_triggered 0 rate 50

Add sample code to handle rxq LWM event, it delays a while so that rxq
empties, then disables host shaper and rearms LWM event.

Signed-off-by: Spike Du <spiked@nvidia.com>
---
 app/test-pmd/cmdline.c          |  74 +++++++++++++
 app/test-pmd/config.c           |  21 ++++
 app/test-pmd/meson.build        |   4 +
 app/test-pmd/testpmd.c          |  24 +++++
 app/test-pmd/testpmd.h          |   1 +
 doc/guides/nics/mlx5.rst        |  46 ++++++++
 drivers/net/mlx5/mlx5_testpmd.c | 184 ++++++++++++++++++++++++++++++++
 drivers/net/mlx5/mlx5_testpmd.h |  27 +++++
 8 files changed, 381 insertions(+)
 create mode 100644 drivers/net/mlx5/mlx5_testpmd.c
 create mode 100644 drivers/net/mlx5/mlx5_testpmd.h

diff --git a/app/test-pmd/cmdline.c b/app/test-pmd/cmdline.c
index 1e5b294ab3..86342f2ac6 100644
--- a/app/test-pmd/cmdline.c
+++ b/app/test-pmd/cmdline.c
@@ -67,6 +67,9 @@
 #include "cmdline_mtr.h"
 #include "cmdline_tm.h"
 #include "bpf_cmd.h"
+#ifdef RTE_NET_MLX5
+#include "mlx5_testpmd.h"
+#endif
 
 static struct cmdline *testpmd_cl;
 
@@ -17804,6 +17807,73 @@ cmdline_parse_inst_t cmd_show_port_flow_transfer_proxy = {
 	}
 };
 
+/* *** SET LIMIT WARTER MARK FOR A RXQ OF A PORT *** */
+struct cmd_rxq_lwm_result {
+	cmdline_fixed_string_t set;
+	cmdline_fixed_string_t port;
+	uint16_t port_num;
+	cmdline_fixed_string_t rxq;
+	uint16_t rxq_num;
+	cmdline_fixed_string_t lwm;
+	uint16_t lwm_num;
+};
+
+static void cmd_rxq_lwm_parsed(void *parsed_result,
+		__rte_unused struct cmdline *cl,
+		__rte_unused void *data)
+{
+	struct cmd_rxq_lwm_result *res = parsed_result;
+	int ret = 0;
+
+	if ((strcmp(res->set, "set") == 0) && (strcmp(res->port, "port") == 0)
+	    && (strcmp(res->rxq, "rxq") == 0)
+	    && (strcmp(res->lwm, "lwm") == 0))
+		ret = set_rxq_lwm(res->port_num, res->rxq_num,
+				  res->lwm_num);
+	if (ret < 0)
+		printf("rxq_lwm_cmd error: (%s)\n", strerror(-ret));
+
+}
+
+cmdline_parse_token_string_t cmd_rxq_lwm_set =
+	TOKEN_STRING_INITIALIZER(struct cmd_rxq_lwm_result,
+				set, "set");
+cmdline_parse_token_string_t cmd_rxq_lwm_port =
+	TOKEN_STRING_INITIALIZER(struct cmd_rxq_lwm_result,
+				port, "port");
+cmdline_parse_token_num_t cmd_rxq_lwm_portnum =
+	TOKEN_NUM_INITIALIZER(struct cmd_rxq_lwm_result,
+				port_num, RTE_UINT16);
+cmdline_parse_token_string_t cmd_rxq_lwm_rxq =
+	TOKEN_STRING_INITIALIZER(struct cmd_rxq_lwm_result,
+				rxq, "rxq");
+cmdline_parse_token_num_t cmd_rxq_lwm_rxqnum =
+	TOKEN_NUM_INITIALIZER(struct cmd_rxq_lwm_result,
+				rxq_num, RTE_UINT8);
+cmdline_parse_token_string_t cmd_rxq_lwm_lwm =
+	TOKEN_STRING_INITIALIZER(struct cmd_rxq_lwm_result,
+				lwm, "lwm");
+cmdline_parse_token_num_t cmd_rxq_lwm_lwmnum =
+	TOKEN_NUM_INITIALIZER(struct cmd_rxq_lwm_result,
+				lwm_num, RTE_UINT16);
+
+cmdline_parse_inst_t cmd_rxq_lwm = {
+	.f = cmd_rxq_lwm_parsed,
+	.data = (void *)0,
+	.help_str = "set port <port_id> rxq <rxq_id> lwm <lwm_num>"
+		"Set lwm for rxq on port_id",
+	.tokens = {
+		(void *)&cmd_rxq_lwm_set,
+		(void *)&cmd_rxq_lwm_port,
+		(void *)&cmd_rxq_lwm_portnum,
+		(void *)&cmd_rxq_lwm_rxq,
+		(void *)&cmd_rxq_lwm_rxqnum,
+		(void *)&cmd_rxq_lwm_lwm,
+		(void *)&cmd_rxq_lwm_lwmnum,
+		NULL,
+	},
+};
+
 /* ******************************************************************************** */
 
 /* list of instructions */
@@ -18091,6 +18161,10 @@ cmdline_parse_ctx_t main_ctx[] = {
 	(cmdline_parse_inst_t *)&cmd_show_capability,
 	(cmdline_parse_inst_t *)&cmd_set_flex_is_pattern,
 	(cmdline_parse_inst_t *)&cmd_set_flex_spec_pattern,
+	(cmdline_parse_inst_t *)&cmd_rxq_lwm,
+#ifdef RTE_NET_MLX5
+	(cmdline_parse_inst_t *)&mlx5_test_cmd_port_host_shaper,
+#endif
 	NULL,
 };
 
diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
index 1b1e738f83..a752c6367f 100644
--- a/app/test-pmd/config.c
+++ b/app/test-pmd/config.c
@@ -6342,3 +6342,24 @@ show_mcast_macs(portid_t port_id)
 		printf("  %s\n", buf);
 	}
 }
+
+int
+set_rxq_lwm(portid_t port_id, uint16_t queue_idx, uint16_t lwm)
+{
+	struct rte_eth_link link;
+	int ret;
+
+	if (port_id_is_invalid(port_id, ENABLED_WARN))
+		return -EINVAL;
+	ret = eth_link_get_nowait_print_err(port_id, &link);
+	if (ret < 0)
+		return -EINVAL;
+	if (lwm > 99)
+		return -EINVAL;
+	ret = rte_eth_rx_lwm_set(port_id, queue_idx, lwm);
+
+	if (ret)
+		return ret;
+	return 0;
+}
+
diff --git a/app/test-pmd/meson.build b/app/test-pmd/meson.build
index 43130c8856..c3577a02c1 100644
--- a/app/test-pmd/meson.build
+++ b/app/test-pmd/meson.build
@@ -73,3 +73,7 @@ endif
 if dpdk_conf.has('RTE_NET_DPAA')
     deps += ['bus_dpaa', 'mempool_dpaa', 'net_dpaa']
 endif
+if dpdk_conf.has('RTE_NET_MLX5')
+    deps += 'net_mlx5'
+    sources += files('../../drivers/net/mlx5/mlx5_testpmd.c')
+endif
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index 777763f749..ee6693dddf 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -69,6 +69,9 @@
 #ifdef RTE_NET_BOND
 #include <rte_eth_bond.h>
 #endif
+#ifdef RTE_NET_MLX5
+#include "mlx5_testpmd.h"
+#endif
 
 #include "testpmd.h"
 
@@ -420,6 +423,7 @@ static const char * const eth_event_desc[] = {
 	[RTE_ETH_EVENT_NEW] = "device probed",
 	[RTE_ETH_EVENT_DESTROY] = "device released",
 	[RTE_ETH_EVENT_FLOW_AGED] = "flow aged",
+	[RTE_ETH_EVENT_RX_LWM] = "rxq limit reached",
 	[RTE_ETH_EVENT_MAX] = NULL,
 };
 
@@ -3616,6 +3620,10 @@ static int
 eth_event_callback(portid_t port_id, enum rte_eth_event_type type, void *param,
 		  void *ret_param)
 {
+	struct rte_eth_dev_info dev_info;
+	uint16_t rxq_id;
+	uint8_t lwm;
+	int ret;
 	RTE_SET_USED(param);
 	RTE_SET_USED(ret_param);
 
@@ -3647,6 +3655,22 @@ eth_event_callback(portid_t port_id, enum rte_eth_event_type type, void *param,
 		ports[port_id].port_status = RTE_PORT_CLOSED;
 		printf("Port %u is closed\n", port_id);
 		break;
+	case RTE_ETH_EVENT_RX_LWM:
+		ret = rte_eth_dev_info_get(port_id, &dev_info);
+		if (ret != 0)
+			break;
+		/* LWM query API rewinds rxq_id, no need to check max rxq num. */
+		for (rxq_id = 0; ; rxq_id++) {
+			ret = rte_eth_rx_lwm_query(port_id, &rxq_id, &lwm);
+			if (ret <= 0)
+				break;
+			printf("Received LWM event, port:%d rxq_id:%d\n",
+			       port_id, rxq_id);
+#ifdef RTE_NET_MLX5
+			mlx5_test_lwm_event_handler(port_id, rxq_id);
+#endif
+		}
+		break;
 	default:
 		break;
 	}
diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index f04a9a11b4..f2ecbe7013 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -1166,6 +1166,7 @@ int update_jumbo_frame_offload(portid_t portid);
 void flex_item_create(portid_t port_id, uint16_t flex_id, const char *filename);
 void flex_item_destroy(portid_t port_id, uint16_t flex_id);
 void port_flex_item_flush(portid_t port_id);
+int set_rxq_lwm(portid_t port_id, uint16_t queue_idx, uint16_t lwm);
 
 extern int flow_parse(const char *src, void *result, unsigned int size,
 		      struct rte_flow_attr **attr,
diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index 3da6f5a03c..fb1c957544 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -1718,3 +1718,49 @@ on the host port by the firmware upon receiving the LMW event, which allows
 throttling host traffic on LWM events at minimum latency, preventing excess
 drops in the Rx queue.
 
+How to use LWM and Host Shaper
+------------------------------
+
+There are sample command lines to configure LWM in testpmd.
+Testpmd also contains sample logic to handle LWM event.
+The typical workflow is: testpmd configure LWM for Rx queues, enable
+lwm_triggered in host shaper and register a callback, when traffic from host is
+too high and Rx queue fullness is above LWM, PMD receives an event and
+firmware configures a 100Mbps shaper on host port automatically, then PMD call
+the callback registered previously, which will delay a while to let Rx queue
+empty, then disable host shaper.
+
+Let's assume we have a simple Blue Field 2 setup: port 0 is uplink, port 1
+is VF representor. Each port has 2 Rx queues.
+In order to control traffic from host to ARM, we can enable LWM in testpmd by:
+
+.. code-block:: console
+
+   testpmd> mlx5 set port 1 host_shaper lwm_triggered 1 rate 0
+   testpmd> set port 1 rxq 0 lwm 70
+   testpmd> set port 1 rxq 1 lwm 70
+
+The first command disables current host shaper, and enables LWM triggered mode.
+The left commands configure LWM to 70% of Rx queue size for both Rx queues,
+When traffic from host is too high, you can see testpmd console prints log
+about LWM event receiving, then host shaper is disabled.
+The traffic rate from host is controlled and less drop happens in Rx queues.
+
+When disable LWM and lwm_triggered, we can invoke below commands in testpmd:
+
+.. code-block:: console
+
+   testpmd> mlx5 set port 1 host_shaper lwm_triggered 0 rate 0
+   testpmd> set port 1 rxq 0 lwm 0
+   testpmd> set port 1 rxq 1 lwm 0
+
+It's recommended an application disables LWM and lwm_triggered before exit,
+if it enables them before.
+
+We can also configure the shaper with a value, the rate unit is 100Mbps, below
+command sets current shaper to 5Gbps and disables lwm_triggered.
+
+.. code-block:: console
+
+   testpmd> mlx5 set port 1 host_shaper lwm_triggered 0 rate 50
+
diff --git a/drivers/net/mlx5/mlx5_testpmd.c b/drivers/net/mlx5/mlx5_testpmd.c
new file mode 100644
index 0000000000..122d6cbc4f
--- /dev/null
+++ b/drivers/net/mlx5/mlx5_testpmd.c
@@ -0,0 +1,184 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright 2021 6WIND S.A.
+ * Copyright 2021 Mellanox Technologies, Ltd
+ */
+
+#include <stdint.h>
+#include <string.h>
+#include <stdlib.h>
+
+#include <rte_prefetch.h>
+#include <rte_common.h>
+#include <rte_branch_prediction.h>
+#include <rte_ether.h>
+#include <rte_alarm.h>
+#include <rte_pmd_mlx5.h>
+#include <rte_ethdev.h>
+#include "mlx5_testpmd.h"
+
+static uint8_t host_shaper_lwm_triggered[RTE_MAX_ETHPORTS];
+#define SHAPER_DISABLE_DELAY_US 100000 /* 100ms */
+
+/**
+ * Disable the host shaper and re-arm LWM event.
+ *
+ * @param[in] args
+ *   uint32_t integer combining port_id and rxq_id.
+ */
+static void
+mlx5_test_host_shaper_disable(void *args)
+{
+	uint32_t port_rxq_id = (uint32_t)(uintptr_t)args;
+	uint16_t port_id = port_rxq_id & 0xffff;
+	uint16_t qid = (port_rxq_id >> 16) & 0xffff;
+	struct rte_eth_rxq_info qinfo;
+
+	printf("%s disable shaper\n", __func__);
+	if (rte_eth_rx_queue_info_get(port_id, qid, &qinfo)) {
+		printf("rx_queue_info_get returns error\n");
+		return;
+	}
+	/* Rearm the LWM event. */
+	if (rte_eth_rx_lwm_set(port_id, qid, qinfo.lwm)) {
+		printf("config lwm returns error\n");
+		return;
+	}
+	/* Only disable the shaper when lwm_triggered is set. */
+	if (host_shaper_lwm_triggered[port_id] &&
+	    rte_pmd_mlx5_host_shaper_config(port_id, 0, 0))
+		printf("%s disable shaper returns error\n", __func__);
+}
+
+void
+mlx5_test_lwm_event_handler(uint16_t port_id, uint16_t rxq_id)
+{
+	uint32_t port_rxq_id = port_id | (rxq_id << 16);
+
+	rte_eal_alarm_set(SHAPER_DISABLE_DELAY_US,
+			  mlx5_test_host_shaper_disable,
+			  (void *)(uintptr_t)port_rxq_id);
+	printf("%s port_id:%u rxq_id:%u\n", __func__, port_id, rxq_id);
+}
+
+/**
+ * Configure host shaper's lwm_triggered and current rate.
+ *
+ * @param[in] lwm_triggered
+ *   Disable/enable lwm_triggered.
+ * @param[in] rate
+ *   Configure current host shaper rate.
+ * @return
+ *   On success, returns 0.
+ *   On failure, returns < 0.
+ */
+static int
+mlx5_test_set_port_host_shaper(uint16_t port_id, uint16_t lwm_triggered, uint8_t rate)
+{
+	struct rte_eth_link link;
+	bool port_id_valid = false;
+	uint16_t pid;
+	int ret;
+
+	RTE_ETH_FOREACH_DEV(pid)
+		if (port_id == pid) {
+			port_id_valid = true;
+			break;
+		}
+	if (!port_id_valid)
+		return -EINVAL;
+	ret = rte_eth_link_get_nowait(port_id, &link);
+	if (ret < 0)
+		return ret;
+	host_shaper_lwm_triggered[port_id] = lwm_triggered ? 1 : 0;
+	if (!lwm_triggered) {
+		ret = rte_pmd_mlx5_host_shaper_config(port_id, 0,
+		RTE_BIT32(MLX5_HOST_SHAPER_FLAG_LWM_TRIGGERED));
+	} else {
+		ret = rte_pmd_mlx5_host_shaper_config(port_id, 1,
+		RTE_BIT32(MLX5_HOST_SHAPER_FLAG_LWM_TRIGGERED));
+	}
+	if (ret)
+		return ret;
+	ret = rte_pmd_mlx5_host_shaper_config(port_id, rate, 0);
+	if (ret)
+		return ret;
+	return 0;
+}
+
+/* *** SET HOST_SHAPER FOR A PORT *** */
+struct cmd_port_host_shaper_result {
+	cmdline_fixed_string_t mlx5;
+	cmdline_fixed_string_t set;
+	cmdline_fixed_string_t port;
+	uint16_t port_num;
+	cmdline_fixed_string_t host_shaper;
+	cmdline_fixed_string_t lwm_triggered;
+	uint16_t fr;
+	cmdline_fixed_string_t rate;
+	uint8_t rate_num;
+};
+
+static void cmd_port_host_shaper_parsed(void *parsed_result,
+		__rte_unused struct cmdline *cl,
+		__rte_unused void *data)
+{
+	struct cmd_port_host_shaper_result *res = parsed_result;
+	int ret = 0;
+
+	if ((strcmp(res->mlx5, "mlx5") == 0) &&
+	    (strcmp(res->set, "set") == 0) &&
+	    (strcmp(res->port, "port") == 0) &&
+	    (strcmp(res->host_shaper, "host_shaper") == 0) &&
+	    (strcmp(res->lwm_triggered, "lwm_triggered") == 0) &&
+	    (strcmp(res->rate, "rate") == 0))
+		ret = mlx5_test_set_port_host_shaper(res->port_num, res->fr,
+					   res->rate_num);
+	if (ret < 0)
+		printf("cmd_port_host_shaper error: (%s)\n", strerror(-ret));
+}
+
+cmdline_parse_token_string_t cmd_port_host_shaper_mlx5 =
+	TOKEN_STRING_INITIALIZER(struct cmd_port_host_shaper_result,
+				mlx5, "mlx5");
+cmdline_parse_token_string_t cmd_port_host_shaper_set =
+	TOKEN_STRING_INITIALIZER(struct cmd_port_host_shaper_result,
+				set, "set");
+cmdline_parse_token_string_t cmd_port_host_shaper_port =
+	TOKEN_STRING_INITIALIZER(struct cmd_port_host_shaper_result,
+				port, "port");
+cmdline_parse_token_num_t cmd_port_host_shaper_portnum =
+	TOKEN_NUM_INITIALIZER(struct cmd_port_host_shaper_result,
+				port_num, RTE_UINT16);
+cmdline_parse_token_string_t cmd_port_host_shaper_host_shaper =
+	TOKEN_STRING_INITIALIZER(struct cmd_port_host_shaper_result,
+				 host_shaper, "host_shaper");
+cmdline_parse_token_string_t cmd_port_host_shaper_lwm_triggered =
+	TOKEN_STRING_INITIALIZER(struct cmd_port_host_shaper_result,
+				 lwm_triggered, "lwm_triggered");
+cmdline_parse_token_num_t cmd_port_host_shaper_fr =
+	TOKEN_NUM_INITIALIZER(struct cmd_port_host_shaper_result,
+			      fr, RTE_UINT16);
+cmdline_parse_token_string_t cmd_port_host_shaper_rate =
+	TOKEN_STRING_INITIALIZER(struct cmd_port_host_shaper_result,
+				 rate, "rate");
+cmdline_parse_token_num_t cmd_port_host_shaper_rate_num =
+	TOKEN_NUM_INITIALIZER(struct cmd_port_host_shaper_result,
+			      rate_num, RTE_UINT8);
+cmdline_parse_inst_t mlx5_test_cmd_port_host_shaper = {
+	.f = cmd_port_host_shaper_parsed,
+	.data = (void *)0,
+	.help_str = "mlx5 set port <port_id> host_shaper lwm_triggered <0|1> "
+	"rate <rate_num>: Set HOST_SHAPER lwm_triggered and rate with port_id",
+	.tokens = {
+		(void *)&cmd_port_host_shaper_mlx5,
+		(void *)&cmd_port_host_shaper_set,
+		(void *)&cmd_port_host_shaper_port,
+		(void *)&cmd_port_host_shaper_portnum,
+		(void *)&cmd_port_host_shaper_host_shaper,
+		(void *)&cmd_port_host_shaper_lwm_triggered,
+		(void *)&cmd_port_host_shaper_fr,
+		(void *)&cmd_port_host_shaper_rate,
+		(void *)&cmd_port_host_shaper_rate_num,
+		NULL,
+	}
+};
diff --git a/drivers/net/mlx5/mlx5_testpmd.h b/drivers/net/mlx5/mlx5_testpmd.h
new file mode 100644
index 0000000000..50f3cf0bf9
--- /dev/null
+++ b/drivers/net/mlx5/mlx5_testpmd.h
@@ -0,0 +1,27 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright 2021 6WIND S.A.
+ * Copyright 2021 Mellanox Technologies, Ltd
+ */
+
+#ifndef RTE_PMD_MLX5_TEST_H_
+#define RTE_PMD_MLX5_TEST_H_
+
+#include <cmdline_parse.h>
+#include <cmdline_parse_num.h>
+#include <cmdline_parse_string.h>
+
+/**
+ * RTE_ETH_EVENT_RX_LWM handler sample code.
+ * It's called in testpmd, the work flow here is delay a while until
+ * RX queueu is empty, then disable host shaper.
+ *
+ * @param[in] port_id
+ *   Port identifier.
+ * @param[in] rxq_id
+ *   Rx queue identifier.
+ */
+void
+mlx5_test_lwm_event_handler(uint16_t port_id, uint16_t rxq_id);
+
+extern cmdline_parse_inst_t mlx5_test_cmd_port_host_shaper;
+#endif
-- 
2.27.0


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v3 0/7] introduce per-queue limit watermark and host shaper
  2022-05-24 15:20       ` [PATCH v3 0/7] introduce per-queue limit watermark and host shaper Spike Du
                           ` (6 preceding siblings ...)
  2022-05-24 15:20         ` [PATCH v3 7/7] app/testpmd: add LWM and Host Shaper command Spike Du
@ 2022-05-24 15:59         ` Thomas Monjalon
  2022-05-24 19:00           ` Morten Brørup
  2022-06-03 12:48         ` [PATCH v4 0/7] introduce per-queue fill threshold " Spike Du
  8 siblings, 1 reply; 131+ messages in thread
From: Thomas Monjalon @ 2022-05-24 15:59 UTC (permalink / raw)
  To: Spike Du
  Cc: matan, viacheslavo, orika, dev, rasland, stephen,
	Morten Brørup, andrew.rybchenko, ferruh.yigit,
	david.marchand

+Cc people involved in previous versions

24/05/2022 17:20, Spike Du:
> LWM(limit watermark) is per RX queue attribute, when RX queue fullness reach the LWM limit, HW sends an event to dpdk application.
> Host shaper can configure shaper rate and lwm-triggered for a host port.
> The shaper limits the rate of traffic from host port to wire port.
> If lwm-triggered is enabled, a 100Mbps shaper is enabled automatically when one of the host port's Rx queues receives LWM event.
> 
> These two features can combine to control traffic from host port to wire port.
> The work flow is configure LWM to RX queue and enable lwm-triggered flag in host shaper, after receiving LWM event, delay a while until RX queue is empty , then disable the shaper. We recycle this work flow to reduce RX queue drops.
> 
> Add new libethdev API to set LWM, add rte event RTE_ETH_EVENT_RXQ_LIMIT_REACHED to handle LWM event. For host shaper, because it doesn't align to existing DPDK framework and is specific to Nvidia NIC, use PMD private API.
> 
> For integration with testpmd, put the private cmdline function and LWM event handler in mlx5 PMD directory by adding a new file mlx5_test.c. Only add minimal code in testpmd to invoke interfaces from mlx5_test.c.
> 
> Spike Du (7):
>   net/mlx5: add LWM support for Rxq
>   common/mlx5: share interrupt management
>   ethdev: introduce Rx queue based limit watermark
>   net/mlx5: add LWM event handling support
>   net/mlx5: support Rx queue based limit watermark
>   net/mlx5: add private API to config host port shaper
>   app/testpmd: add LWM and Host Shaper command
> 
>  app/test-pmd/cmdline.c                       |  74 +++++
>  app/test-pmd/config.c                        |  21 ++
>  app/test-pmd/meson.build                     |   4 +
>  app/test-pmd/testpmd.c                       |  24 ++
>  app/test-pmd/testpmd.h                       |   1 +
>  doc/guides/nics/mlx5.rst                     |  84 ++++++
>  doc/guides/rel_notes/release_22_07.rst       |   2 +
>  drivers/common/mlx5/linux/meson.build        |  13 +
>  drivers/common/mlx5/linux/mlx5_common_os.c   | 131 +++++++++
>  drivers/common/mlx5/linux/mlx5_common_os.h   |  11 +
>  drivers/common/mlx5/mlx5_prm.h               |  26 ++
>  drivers/common/mlx5/version.map              |   2 +
>  drivers/common/mlx5/windows/mlx5_common_os.h |  24 ++
>  drivers/net/mlx5/linux/mlx5_ethdev_os.c      |  71 -----
>  drivers/net/mlx5/linux/mlx5_os.c             | 132 ++-------
>  drivers/net/mlx5/linux/mlx5_socket.c         |  53 +---
>  drivers/net/mlx5/mlx5.c                      |  68 +++++
>  drivers/net/mlx5/mlx5.h                      |  12 +-
>  drivers/net/mlx5/mlx5_devx.c                 |  60 +++-
>  drivers/net/mlx5/mlx5_devx.h                 |   1 +
>  drivers/net/mlx5/mlx5_rx.c                   | 292 +++++++++++++++++++
>  drivers/net/mlx5/mlx5_rx.h                   |  13 +
>  drivers/net/mlx5/mlx5_testpmd.c              | 184 ++++++++++++
>  drivers/net/mlx5/mlx5_testpmd.h              |  27 ++
>  drivers/net/mlx5/mlx5_txpp.c                 |  28 +-
>  drivers/net/mlx5/rte_pmd_mlx5.h              |  30 ++
>  drivers/net/mlx5/version.map                 |   2 +
>  drivers/net/mlx5/windows/mlx5_ethdev_os.c    |  22 --
>  drivers/vdpa/mlx5/mlx5_vdpa_virtq.c          |  48 +--
>  lib/ethdev/ethdev_driver.h                   |  22 ++
>  lib/ethdev/rte_ethdev.c                      |  52 ++++
>  lib/ethdev/rte_ethdev.h                      |  71 +++++
>  lib/ethdev/version.map                       |   2 +
>  33 files changed, 1299 insertions(+), 308 deletions(-)
>  create mode 100644 drivers/net/mlx5/mlx5_testpmd.c
>  create mode 100644 drivers/net/mlx5/mlx5_testpmd.h




^ permalink raw reply	[flat|nested] 131+ messages in thread

* RE: [PATCH v3 0/7] introduce per-queue limit watermark and host shaper
  2022-05-24 15:59         ` [PATCH v3 0/7] introduce per-queue limit watermark and host shaper Thomas Monjalon
@ 2022-05-24 19:00           ` Morten Brørup
  2022-05-24 19:22             ` Thomas Monjalon
  2022-05-25 13:14             ` Spike Du
  0 siblings, 2 replies; 131+ messages in thread
From: Morten Brørup @ 2022-05-24 19:00 UTC (permalink / raw)
  To: Thomas Monjalon, Spike Du
  Cc: matan, viacheslavo, orika, dev, rasland, stephen,
	andrew.rybchenko, ferruh.yigit, david.marchand

> From: Thomas Monjalon [mailto:thomas@monjalon.net]
> Sent: Tuesday, 24 May 2022 17.59
> 
> +Cc people involved in previous versions
> 
> 24/05/2022 17:20, Spike Du:
> > LWM(limit watermark) is per RX queue attribute, when RX queue
> fullness reach the LWM limit, HW sends an event to dpdk application.
> > Host shaper can configure shaper rate and lwm-triggered for a host
> port.

Please ignore this comment, it is not important, but I had to get it out of my system: I assume that the "LWM" name is from the NIC datasheet; otherwise I would probably prefer something with "threshold"... LWM is easily confused with "low water mark", which is the opposite of what the LWM does. Names are always open for discussion, so I won't object to it.

> > The shaper limits the rate of traffic from host port to wire port.

From host to wire? It is RX, so you must mean from wire to host.

> > If lwm-triggered is enabled, a 100Mbps shaper is enabled
> automatically when one of the host port's Rx queues receives LWM event.
> >
> > These two features can combine to control traffic from host port to
> wire port.

Again, you mean from wire to host?

> > The work flow is configure LWM to RX queue and enable lwm-triggered
> flag in host shaper, after receiving LWM event, delay a while until RX
> queue is empty , then disable the shaper. We recycle this work flow to
> reduce RX queue drops.

You delay while RX queue gets drained by some other threads, I assume.

Surely, the excess packets must be dropped somewhere, e.g. by the shaper?

> >
> > Add new libethdev API to set LWM, add rte event
> RTE_ETH_EVENT_RXQ_LIMIT_REACHED to handle LWM event.

Makes sense to make it public; could be usable for other purposes, similar to interrupt coalescing, as mentioned by Stephen.

> > For host shaper,
> because it doesn't align to existing DPDK framework and is specific to
> Nvidia NIC, use PMD private API.

Makes sense to keep it private.

> >
> > For integration with testpmd, put the private cmdline function and
> LWM event handler in mlx5 PMD directory by adding a new file
> mlx5_test.c. Only add minimal code in testpmd to invoke interfaces from
> mlx5_test.c.
> >
> > Spike Du (7):
> >   net/mlx5: add LWM support for Rxq
> >   common/mlx5: share interrupt management
> >   ethdev: introduce Rx queue based limit watermark
> >   net/mlx5: add LWM event handling support
> >   net/mlx5: support Rx queue based limit watermark
> >   net/mlx5: add private API to config host port shaper
> >   app/testpmd: add LWM and Host Shaper command
> >


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v3 0/7] introduce per-queue limit watermark and host shaper
  2022-05-24 19:00           ` Morten Brørup
@ 2022-05-24 19:22             ` Thomas Monjalon
  2022-05-25 14:11               ` Andrew Rybchenko
  2022-05-25 13:14             ` Spike Du
  1 sibling, 1 reply; 131+ messages in thread
From: Thomas Monjalon @ 2022-05-24 19:22 UTC (permalink / raw)
  To: Spike Du, Morten Brørup
  Cc: matan, viacheslavo, orika, dev, rasland, stephen,
	andrew.rybchenko, ferruh.yigit, david.marchand

24/05/2022 21:00, Morten Brørup:
> From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > 24/05/2022 17:20, Spike Du:
> > > LWM(limit watermark) is per RX queue attribute, when RX queue
> > fullness reach the LWM limit, HW sends an event to dpdk application.
> 
> Please ignore this comment, it is not important, but I had to get it out of my system: I assume that the "LWM" name is from the NIC datasheet; otherwise I would probably prefer something with "threshold"... LWM is easily confused with "low water mark", which is the opposite of what the LWM does. Names are always open for discussion, so I won't object to it.

Yes it is a threshold, and yes it is often called a watermark.
I think we can get more ideas and votes about the naming.
Please let's conclude on a short name which can be inserted
easily in function names.



^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [RFC v2 3/7] ethdev: introduce Rx queue based limit watermark
  2022-05-24  8:18                 ` Thomas Monjalon
@ 2022-05-25 12:59                   ` Andrew Rybchenko
  2022-05-25 13:58                     ` Thomas Monjalon
  0 siblings, 1 reply; 131+ messages in thread
From: Andrew Rybchenko @ 2022-05-25 12:59 UTC (permalink / raw)
  To: Thomas Monjalon, Stephen Hemminger, Spike Du
  Cc: dev, Matan Azrad, Slava Ovsiienko, Ori Kam, Raslan Darawsheh,
	ferruh.yigit

On 5/24/22 11:18, Thomas Monjalon wrote:
> 24/05/2022 04:50, Spike Du:
>> From: Thomas Monjalon <thomas@monjalon.net>
>>> 23/05/2022 05:01, Spike Du:
>>>> From: Stephen Hemminger <stephen@networkplumber.org>
>>>>> Spike Du <spiked@nvidia.com> wrote:
>>>>>> --- a/lib/ethdev/rte_ethdev.h
>>>>>> +++ b/lib/ethdev/rte_ethdev.h
>>>>>> @@ -1249,7 +1249,16 @@ struct rte_eth_rxconf {
>>>>>>         */
>>>>>>        union rte_eth_rxseg *rx_seg;
>>>>>>
>>>>>> -     uint64_t reserved_64s[2]; /**< Reserved for future fields */
>>>>>> +     /**
>>>>>> +      * Per-queue Rx limit watermark defined as percentage of Rx queue
>>>>>> +      * size. If Rx queue receives traffic higher than this percentage,
>>>>>> +      * the event RTE_ETH_EVENT_RX_LWM is triggered.
>>>>>> +      */
>>>>>> +     uint8_t lwm;
>>>>>> +
>>>>>> +     uint8_t reserved_bits[3];
>>>>>> +     uint32_t reserved_32s;
>>>>>> +     uint64_t reserved_64s;
>>>>>
>>>>> Ok but, this is an ABI risk about this because reserved stuff was
>>>>> never required before.
>>>
>>> An ABI compatibility issue would be for an application compiled with an old
>>> DPDK, and loading a new DPDK at runtime.
>>> Let's think what would happen in such a case.
>>>
>>>>> Whenever is a reserved field is introduced the code (in this case
>>>>> rte_ethdev_configure).
>>>
>>> rte_eth_rx_queue_setup() is called with rx_conf->lwm not initialized.
>>> Then the library and drivers may interpret a wrong value.
>>>
>>>>> Best practice would have been to have the code require all reserved
>>>>> fields be
>>>>> 0 in earlier releases. In this case an application is like to define
>>>>> a watermark of zero; how will your code handle it.
>>>>
>>>> Having watermark of 0 is desired, which is the default. LWM of 0 means
>>>> the Rx Queue's watermark is not monitored, hence no LWM event is
>>> generated.
>>>
>>> The problem is to have a value not initialized.
>>> I think the best approach is to not expose the LWM value through this
>>> configuration structure.
>>> If the need is to get the current value, we should better add a field in the
>>> struct rte_eth_rxq_info.
>>
>> At least from all the dpdk app/example code, rxconf is initialized to 0 then setup
>> The Rx queue, if user follows these examples we should not have ABI issue.
>> Since many people are concerned about rxconf change, it's ok to remove the LWM
>> Field there.
>> Yes, I think we can add lwm into rte_eth_rxq_info. If we can set Rx queue's attribute,
>> We should have a way to get it.
> 
> Unfortunately we cannot rely on examples for ABI compatibility.
> My suggestion of moving the field in rte_eth_rxq_info
> is not obvious because it could change the size of the struct.
> But thanks to __rte_cache_min_aligned, it is OK.
> Running pahole on this struct shows we have 50 bytes free:
>          /* size: 128, cachelines: 2, members: 6 */
>          /* padding: 50 */
> 
> The other option would be to get the LWM value with a "get" function.
> 
> What others prefer?

If I'm not mistaken the changeset breaks ABI in any case since
it adds a new event and changes MAX. If so, I'd wait for the
next ABI breaking release and do not touch reserved fields.

^ permalink raw reply	[flat|nested] 131+ messages in thread

* RE: [PATCH v3 0/7] introduce per-queue limit watermark and host shaper
  2022-05-24 19:00           ` Morten Brørup
  2022-05-24 19:22             ` Thomas Monjalon
@ 2022-05-25 13:14             ` Spike Du
  2022-05-25 13:40               ` Morten Brørup
  1 sibling, 1 reply; 131+ messages in thread
From: Spike Du @ 2022-05-25 13:14 UTC (permalink / raw)
  To: Morten Brørup, NBU-Contact-Thomas Monjalon (EXTERNAL)
  Cc: Matan Azrad, Slava Ovsiienko, Ori Kam, dev, Raslan Darawsheh,
	stephen, andrew.rybchenko, ferruh.yigit, david.marchand



> -----Original Message-----
> From: Morten Brørup <mb@smartsharesystems.com>
> Sent: Wednesday, May 25, 2022 3:00 AM
> To: NBU-Contact-Thomas Monjalon (EXTERNAL) <thomas@monjalon.net>;
> Spike Du <spiked@nvidia.com>
> Cc: Matan Azrad <matan@nvidia.com>; Slava Ovsiienko
> <viacheslavo@nvidia.com>; Ori Kam <orika@nvidia.com>; dev@dpdk.org;
> Raslan Darawsheh <rasland@nvidia.com>; stephen@networkplumber.org;
> andrew.rybchenko@oktetlabs.ru; ferruh.yigit@amd.com;
> david.marchand@redhat.com
> Subject: RE: [PATCH v3 0/7] introduce per-queue limit watermark and host
> shaper
> 
> External email: Use caution opening links or attachments
> 
> 
> > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > Sent: Tuesday, 24 May 2022 17.59
> >
> > +Cc people involved in previous versions
> >
> > 24/05/2022 17:20, Spike Du:
> > > LWM(limit watermark) is per RX queue attribute, when RX queue
> > fullness reach the LWM limit, HW sends an event to dpdk application.
> > > Host shaper can configure shaper rate and lwm-triggered for a host
> > port.
> 
> Please ignore this comment, it is not important, but I had to get it out of my
> system: I assume that the "LWM" name is from the NIC datasheet; otherwise
> I would probably prefer something with "threshold"... LWM is easily
> confused with "low water mark", which is the opposite of what the LWM
> does. Names are always open for discussion, so I won't object to it.
> 
> > > The shaper limits the rate of traffic from host port to wire port.
> 
> From host to wire? It is RX, so you must mean from wire to host.

The host shaper is quite private to Nvidia's BlueField 2 NIC. The NIC is inserted
In a server which we call it host-system, and the NIC has an embedded Arm-system
Which does the forwarding.
The traffic flows from host-system to wire like this:
Host-system generates traffic, send it to Arm-system, Arm sends it to physical/wire port.
So the RX happens between host-system and Arm-system, and the traffic is host to wire.
The shaper also works in a special way: you configure it on Arm-system, but it takes effect
On host-sysmem's TX side. 

> 
> > > If lwm-triggered is enabled, a 100Mbps shaper is enabled
> > automatically when one of the host port's Rx queues receives LWM event.
> > >
> > > These two features can combine to control traffic from host port to
> > wire port.
> 
> Again, you mean from wire to host?

Pls see above.

> 
> > > The work flow is configure LWM to RX queue and enable lwm-triggered
> > flag in host shaper, after receiving LWM event, delay a while until RX
> > queue is empty , then disable the shaper. We recycle this work flow to
> > reduce RX queue drops.
> 
> You delay while RX queue gets drained by some other threads, I assume.

The PMD thread drains the Rx queue, the PMD receiving  as normal, as the PMD
Implementation uses rte interrupt thread to handle LWM event.

> 
> Surely, the excess packets must be dropped somewhere, e.g. by the shaper?
> 
> > >
> > > Add new libethdev API to set LWM, add rte event
> > RTE_ETH_EVENT_RXQ_LIMIT_REACHED to handle LWM event.
> 
> Makes sense to make it public; could be usable for other purposes, similar to
> interrupt coalescing, as mentioned by Stephen.
> 
> > > For host shaper,
> > because it doesn't align to existing DPDK framework and is specific to
> > Nvidia NIC, use PMD private API.
> 
> Makes sense to keep it private.
> 
> > >
> > > For integration with testpmd, put the private cmdline function and
> > LWM event handler in mlx5 PMD directory by adding a new file
> > mlx5_test.c. Only add minimal code in testpmd to invoke interfaces
> > from mlx5_test.c.
> > >
> > > Spike Du (7):
> > >   net/mlx5: add LWM support for Rxq
> > >   common/mlx5: share interrupt management
> > >   ethdev: introduce Rx queue based limit watermark
> > >   net/mlx5: add LWM event handling support
> > >   net/mlx5: support Rx queue based limit watermark
> > >   net/mlx5: add private API to config host port shaper
> > >   app/testpmd: add LWM and Host Shaper command
> > >


^ permalink raw reply	[flat|nested] 131+ messages in thread

* RE: [PATCH v3 0/7] introduce per-queue limit watermark and host shaper
  2022-05-25 13:14             ` Spike Du
@ 2022-05-25 13:40               ` Morten Brørup
  2022-05-25 13:59                 ` Spike Du
  0 siblings, 1 reply; 131+ messages in thread
From: Morten Brørup @ 2022-05-25 13:40 UTC (permalink / raw)
  To: Spike Du, NBU-Contact-Thomas Monjalon (EXTERNAL)
  Cc: Matan Azrad, Slava Ovsiienko, Ori Kam, dev, Raslan Darawsheh,
	stephen, andrew.rybchenko, ferruh.yigit, david.marchand

> From: Spike Du [mailto:spiked@nvidia.com]
> Sent: Wednesday, 25 May 2022 15.15
> 
> > From: Morten Brørup <mb@smartsharesystems.com>
> > Sent: Wednesday, May 25, 2022 3:00 AM
> >
> > > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > > Sent: Tuesday, 24 May 2022 17.59
> > >
> > > +Cc people involved in previous versions
> > >
> > > 24/05/2022 17:20, Spike Du:
> > > > LWM(limit watermark) is per RX queue attribute, when RX queue
> > > fullness reach the LWM limit, HW sends an event to dpdk
> application.
> > > > Host shaper can configure shaper rate and lwm-triggered for a
> host
> > > port.
> >
> > Please ignore this comment, it is not important, but I had to get it
> out of my
> > system: I assume that the "LWM" name is from the NIC datasheet;
> otherwise
> > I would probably prefer something with "threshold"... LWM is easily
> > confused with "low water mark", which is the opposite of what the LWM
> > does. Names are always open for discussion, so I won't object to it.
> >
> > > > The shaper limits the rate of traffic from host port to wire
> port.
> >
> > From host to wire? It is RX, so you must mean from wire to host.
> 
> The host shaper is quite private to Nvidia's BlueField 2 NIC. The NIC
> is inserted
> In a server which we call it host-system, and the NIC has an embedded
> Arm-system
> Which does the forwarding.
> The traffic flows from host-system to wire like this:
> Host-system generates traffic, send it to Arm-system, Arm sends it to
> physical/wire port.
> So the RX happens between host-system and Arm-system, and the traffic
> is host to wire.
> The shaper also works in a special way: you configure it on Arm-system,
> but it takes effect
> On host-sysmem's TX side.
> 
> >
> > > > If lwm-triggered is enabled, a 100Mbps shaper is enabled
> > > automatically when one of the host port's Rx queues receives LWM
> event.
> > > >
> > > > These two features can combine to control traffic from host port
> to
> > > wire port.
> >
> > Again, you mean from wire to host?
> 
> Pls see above.
> 
> >
> > > > The work flow is configure LWM to RX queue and enable lwm-
> triggered
> > > flag in host shaper, after receiving LWM event, delay a while until
> RX
> > > queue is empty , then disable the shaper. We recycle this work flow
> to
> > > reduce RX queue drops.
> >
> > You delay while RX queue gets drained by some other threads, I
> assume.
> 
> The PMD thread drains the Rx queue, the PMD receiving  as normal, as
> the PMD
> Implementation uses rte interrupt thread to handle LWM event.
> 

Thank you for the explanation, Spike. It really clarifies a lot!

If this patch is intended for DPDK running on the host-system, then the LWM attribute is associated with a TX queue, not an RX queue. The packets are egressing from the host-system, so TX from the host-system's perspective.

Otherwise, if this patch is for DPDK running on the embedded ARM-system, it should be highlighted somewhere.

> >
> > Surely, the excess packets must be dropped somewhere, e.g. by the
> shaper?

I guess the shaper doesn't have to drop any packets, but the host-system will simply be unable to put more packets into the queue if it runs full.



^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [RFC v2 3/7] ethdev: introduce Rx queue based limit watermark
  2022-05-25 12:59                   ` Andrew Rybchenko
@ 2022-05-25 13:58                     ` Thomas Monjalon
  2022-05-25 14:23                       ` Andrew Rybchenko
  0 siblings, 1 reply; 131+ messages in thread
From: Thomas Monjalon @ 2022-05-25 13:58 UTC (permalink / raw)
  To: Stephen Hemminger, Spike Du, Andrew Rybchenko
  Cc: dev, Matan Azrad, Slava Ovsiienko, Ori Kam, Raslan Darawsheh,
	ferruh.yigit

25/05/2022 14:59, Andrew Rybchenko:
> On 5/24/22 11:18, Thomas Monjalon wrote:
> > 24/05/2022 04:50, Spike Du:
> >> From: Thomas Monjalon <thomas@monjalon.net>
> >>> 23/05/2022 05:01, Spike Du:
> >>>> From: Stephen Hemminger <stephen@networkplumber.org>
> >>>>> Spike Du <spiked@nvidia.com> wrote:
> >>>>>> --- a/lib/ethdev/rte_ethdev.h
> >>>>>> +++ b/lib/ethdev/rte_ethdev.h
> >>>>>> @@ -1249,7 +1249,16 @@ struct rte_eth_rxconf {
> >>>>>>         */
> >>>>>>        union rte_eth_rxseg *rx_seg;
> >>>>>>
> >>>>>> -     uint64_t reserved_64s[2]; /**< Reserved for future fields */
> >>>>>> +     /**
> >>>>>> +      * Per-queue Rx limit watermark defined as percentage of Rx queue
> >>>>>> +      * size. If Rx queue receives traffic higher than this percentage,
> >>>>>> +      * the event RTE_ETH_EVENT_RX_LWM is triggered.
> >>>>>> +      */
> >>>>>> +     uint8_t lwm;
> >>>>>> +
> >>>>>> +     uint8_t reserved_bits[3];
> >>>>>> +     uint32_t reserved_32s;
> >>>>>> +     uint64_t reserved_64s;
> >>>>>
> >>>>> Ok but, this is an ABI risk about this because reserved stuff was
> >>>>> never required before.
> >>>
> >>> An ABI compatibility issue would be for an application compiled with an old
> >>> DPDK, and loading a new DPDK at runtime.
> >>> Let's think what would happen in such a case.
> >>>
> >>>>> Whenever is a reserved field is introduced the code (in this case
> >>>>> rte_ethdev_configure).
> >>>
> >>> rte_eth_rx_queue_setup() is called with rx_conf->lwm not initialized.
> >>> Then the library and drivers may interpret a wrong value.
> >>>
> >>>>> Best practice would have been to have the code require all reserved
> >>>>> fields be
> >>>>> 0 in earlier releases. In this case an application is like to define
> >>>>> a watermark of zero; how will your code handle it.
> >>>>
> >>>> Having watermark of 0 is desired, which is the default. LWM of 0 means
> >>>> the Rx Queue's watermark is not monitored, hence no LWM event is
> >>> generated.
> >>>
> >>> The problem is to have a value not initialized.
> >>> I think the best approach is to not expose the LWM value through this
> >>> configuration structure.
> >>> If the need is to get the current value, we should better add a field in the
> >>> struct rte_eth_rxq_info.
> >>
> >> At least from all the dpdk app/example code, rxconf is initialized to 0 then setup
> >> The Rx queue, if user follows these examples we should not have ABI issue.
> >> Since many people are concerned about rxconf change, it's ok to remove the LWM
> >> Field there.
> >> Yes, I think we can add lwm into rte_eth_rxq_info. If we can set Rx queue's attribute,
> >> We should have a way to get it.
> > 
> > Unfortunately we cannot rely on examples for ABI compatibility.
> > My suggestion of moving the field in rte_eth_rxq_info
> > is not obvious because it could change the size of the struct.
> > But thanks to __rte_cache_min_aligned, it is OK.
> > Running pahole on this struct shows we have 50 bytes free:
> >          /* size: 128, cachelines: 2, members: 6 */
> >          /* padding: 50 */
> > 
> > The other option would be to get the LWM value with a "get" function.
> > 
> > What others prefer?
> 
> If I'm not mistaken the changeset breaks ABI in any case since
> it adds a new event and changes MAX.

I think we can consider it as not a breakage (a rule should be added).
Last time we had to update this enum, this was the conclusion:
from https://git.dpdk.org/dpdk/commit/?id=44bf3c796be3f
"
The new event type addition in the enum is flagged as an ABI breakage,
so an ignore rule is added for these reasons:
- It is not changing value of existing types (except MAX)
- The new value is not used by existing API if the event is not
  registered
In general, it is safe adding new ethdev event types at the end of the
enum, because of event callback registration mechanism.
"

> If so, I'd wait for the
> next ABI breaking release and do not touch reserved fields.

In any case, rte_eth_rxconf is not a good fit
because we have a separate function for configuration.
It should be either in rte_eth_rxq_info or a specific "get" function.



^ permalink raw reply	[flat|nested] 131+ messages in thread

* RE: [PATCH v3 0/7] introduce per-queue limit watermark and host shaper
  2022-05-25 13:40               ` Morten Brørup
@ 2022-05-25 13:59                 ` Spike Du
  2022-05-25 14:16                   ` Morten Brørup
  0 siblings, 1 reply; 131+ messages in thread
From: Spike Du @ 2022-05-25 13:59 UTC (permalink / raw)
  To: Morten Brørup, NBU-Contact-Thomas Monjalon (EXTERNAL)
  Cc: Matan Azrad, Slava Ovsiienko, Ori Kam, dev, Raslan Darawsheh,
	stephen, andrew.rybchenko, ferruh.yigit, david.marchand



> -----Original Message-----
> From: Morten Brørup <mb@smartsharesystems.com>
> Sent: Wednesday, May 25, 2022 9:40 PM
> To: Spike Du <spiked@nvidia.com>; NBU-Contact-Thomas Monjalon
> (EXTERNAL) <thomas@monjalon.net>
> Cc: Matan Azrad <matan@nvidia.com>; Slava Ovsiienko
> <viacheslavo@nvidia.com>; Ori Kam <orika@nvidia.com>; dev@dpdk.org;
> Raslan Darawsheh <rasland@nvidia.com>; stephen@networkplumber.org;
> andrew.rybchenko@oktetlabs.ru; ferruh.yigit@amd.com;
> david.marchand@redhat.com
> Subject: RE: [PATCH v3 0/7] introduce per-queue limit watermark and host
> shaper
> 
> External email: Use caution opening links or attachments
> 
> 
> > From: Spike Du [mailto:spiked@nvidia.com]
> > Sent: Wednesday, 25 May 2022 15.15
> >
> > > From: Morten Brørup <mb@smartsharesystems.com>
> > > Sent: Wednesday, May 25, 2022 3:00 AM
> > >
> > > > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > > > Sent: Tuesday, 24 May 2022 17.59
> > > >
> > > > +Cc people involved in previous versions
> > > >
> > > > 24/05/2022 17:20, Spike Du:
> > > > > LWM(limit watermark) is per RX queue attribute, when RX queue
> > > > fullness reach the LWM limit, HW sends an event to dpdk
> > application.
> > > > > Host shaper can configure shaper rate and lwm-triggered for a
> > host
> > > > port.
> > >
> > > Please ignore this comment, it is not important, but I had to get it
> > out of my
> > > system: I assume that the "LWM" name is from the NIC datasheet;
> > otherwise
> > > I would probably prefer something with "threshold"... LWM is easily
> > > confused with "low water mark", which is the opposite of what the
> > > LWM does. Names are always open for discussion, so I won't object to it.
> > >
> > > > > The shaper limits the rate of traffic from host port to wire
> > port.
> > >
> > > From host to wire? It is RX, so you must mean from wire to host.
> >
> > The host shaper is quite private to Nvidia's BlueField 2 NIC. The NIC
> > is inserted In a server which we call it host-system, and the NIC has
> > an embedded Arm-system Which does the forwarding.
> > The traffic flows from host-system to wire like this:
> > Host-system generates traffic, send it to Arm-system, Arm sends it to
> > physical/wire port.
> > So the RX happens between host-system and Arm-system, and the traffic
> > is host to wire.
> > The shaper also works in a special way: you configure it on
> > Arm-system, but it takes effect On host-sysmem's TX side.
> >
> > >
> > > > > If lwm-triggered is enabled, a 100Mbps shaper is enabled
> > > > automatically when one of the host port's Rx queues receives LWM
> > event.
> > > > >
> > > > > These two features can combine to control traffic from host port
> > to
> > > > wire port.
> > >
> > > Again, you mean from wire to host?
> >
> > Pls see above.
> >
> > >
> > > > > The work flow is configure LWM to RX queue and enable lwm-
> > triggered
> > > > flag in host shaper, after receiving LWM event, delay a while
> > > > until
> > RX
> > > > queue is empty , then disable the shaper. We recycle this work
> > > > flow
> > to
> > > > reduce RX queue drops.
> > >
> > > You delay while RX queue gets drained by some other threads, I
> > assume.
> >
> > The PMD thread drains the Rx queue, the PMD receiving  as normal, as
> > the PMD Implementation uses rte interrupt thread to handle LWM event.
> >
> 
> Thank you for the explanation, Spike. It really clarifies a lot!
> 
> If this patch is intended for DPDK running on the host-system, then the LWM
> attribute is associated with a TX queue, not an RX queue. The packets are
> egressing from the host-system, so TX from the host-system's perspective.
> 
> Otherwise, if this patch is for DPDK running on the embedded ARM-system,
> it should be highlighted somewhere.

The host-shaper patch is running on ARM-system, I think in that patch I have some explanation in mlx5.rst.
The LWM patch is common and should work on any Rx queue(right now mlx5 doesn't support Hairpin Rx queue and shared Rx queue).
On ARM-system, we can use it to monitor traffic from host(representor port) or from wire(physical port).
LWM can also work on host-system if there is DPDK running, for example it can monitor traffic from Arm-system to host-system.

> 
> > >
> > > Surely, the excess packets must be dropped somewhere, e.g. by the
> > shaper?
> 
> I guess the shaper doesn't have to drop any packets, but the host-system will
> simply be unable to put more packets into the queue if it runs full.
> 

When LWM event happens, the host-shaper throttles traffic from host-system to Arm-system. Yes, the shaper doesn't drop pkts.
Normally the shaper is small and if PMD thread on Arm keeps working, Rx queue is dropless.
But if PMD thread doesn't receive fast enough, or even with a small shaper but host-system is sending some burst,  Rx queue may still drop on Arm.
Anyway even sometimes drop still happens, the cooperation of host-shaper and LWM greatly reduce the Rx drop on Arm.

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v3 0/7] introduce per-queue limit watermark and host shaper
  2022-05-24 19:22             ` Thomas Monjalon
@ 2022-05-25 14:11               ` Andrew Rybchenko
  0 siblings, 0 replies; 131+ messages in thread
From: Andrew Rybchenko @ 2022-05-25 14:11 UTC (permalink / raw)
  To: Thomas Monjalon, Spike Du, Morten Brørup
  Cc: matan, viacheslavo, orika, dev, rasland, stephen, ferruh.yigit,
	david.marchand

On 5/24/22 22:22, Thomas Monjalon wrote:
> 24/05/2022 21:00, Morten Brørup:
>> From: Thomas Monjalon [mailto:thomas@monjalon.net]
>>> 24/05/2022 17:20, Spike Du:
>>>> LWM(limit watermark) is per RX queue attribute, when RX queue
>>> fullness reach the LWM limit, HW sends an event to dpdk application.
>>
>> Please ignore this comment, it is not important, but I had to get it out of my system: I assume that the "LWM" name is from the NIC datasheet; otherwise I would probably prefer something with "threshold"... LWM is easily confused with "low water mark", which is the opposite of what the LWM does. Names are always open for discussion, so I won't object to it.
> 
> Yes it is a threshold, and yes it is often called a watermark.
> I think we can get more ideas and votes about the naming.
> Please let's conclude on a short name which can be inserted
> easily in function names.

As I understand it is an Rx queue fill (level) threshold.
"fill_thresh" or "flt" if the first one is too long.


^ permalink raw reply	[flat|nested] 131+ messages in thread

* RE: [PATCH v3 0/7] introduce per-queue limit watermark and host shaper
  2022-05-25 13:59                 ` Spike Du
@ 2022-05-25 14:16                   ` Morten Brørup
  2022-05-25 14:30                     ` Andrew Rybchenko
  0 siblings, 1 reply; 131+ messages in thread
From: Morten Brørup @ 2022-05-25 14:16 UTC (permalink / raw)
  To: Spike Du, NBU-Contact-Thomas Monjalon (EXTERNAL)
  Cc: Matan Azrad, Slava Ovsiienko, Ori Kam, dev, Raslan Darawsheh,
	stephen, andrew.rybchenko, ferruh.yigit, david.marchand

> From: Spike Du [mailto:spiked@nvidia.com]
> Sent: Wednesday, 25 May 2022 15.59
> 
> > From: Morten Brørup <mb@smartsharesystems.com>
> > Sent: Wednesday, May 25, 2022 9:40 PM
> >
> > > From: Spike Du [mailto:spiked@nvidia.com]
> > > Sent: Wednesday, 25 May 2022 15.15
> > >
> > > > From: Morten Brørup <mb@smartsharesystems.com>
> > > > Sent: Wednesday, May 25, 2022 3:00 AM
> > > >
> > > > > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > > > > Sent: Tuesday, 24 May 2022 17.59
> > > > >
> > > > > +Cc people involved in previous versions
> > > > >
> > > > > 24/05/2022 17:20, Spike Du:
> > > > > > LWM(limit watermark) is per RX queue attribute, when RX queue
> > > > > fullness reach the LWM limit, HW sends an event to dpdk
> > > application.
> > > > > > Host shaper can configure shaper rate and lwm-triggered for a
> > > host
> > > > > port.
> > > >
> > > >
> > > > > > The shaper limits the rate of traffic from host port to wire
> > > port.
> > > >
> > > > From host to wire? It is RX, so you must mean from wire to host.
> > >
> > > The host shaper is quite private to Nvidia's BlueField 2 NIC. The
> NIC
> > > is inserted In a server which we call it host-system, and the NIC
> has
> > > an embedded Arm-system Which does the forwarding.
> > > The traffic flows from host-system to wire like this:
> > > Host-system generates traffic, send it to Arm-system, Arm sends it
> to
> > > physical/wire port.
> > > So the RX happens between host-system and Arm-system, and the
> traffic
> > > is host to wire.
> > > The shaper also works in a special way: you configure it on
> > > Arm-system, but it takes effect On host-sysmem's TX side.
> > >
> > > >
> > > > > > If lwm-triggered is enabled, a 100Mbps shaper is enabled
> > > > > automatically when one of the host port's Rx queues receives
> LWM
> > > event.
> > > > > >
> > > > > > These two features can combine to control traffic from host
> port
> > > to
> > > > > wire port.
> > > >
> > > > Again, you mean from wire to host?
> > >
> > > Pls see above.
> > >
> > > >
> > > > > > The work flow is configure LWM to RX queue and enable lwm-
> > > triggered
> > > > > flag in host shaper, after receiving LWM event, delay a while
> > > > > until
> > > RX
> > > > > queue is empty , then disable the shaper. We recycle this work
> > > > > flow
> > > to
> > > > > reduce RX queue drops.
> > > >
> > > > You delay while RX queue gets drained by some other threads, I
> > > assume.
> > >
> > > The PMD thread drains the Rx queue, the PMD receiving  as normal,
> as
> > > the PMD Implementation uses rte interrupt thread to handle LWM
> event.
> > >
> >
> > Thank you for the explanation, Spike. It really clarifies a lot!
> >
> > If this patch is intended for DPDK running on the host-system, then
> the LWM
> > attribute is associated with a TX queue, not an RX queue. The packets
> are
> > egressing from the host-system, so TX from the host-system's
> perspective.
> >
> > Otherwise, if this patch is for DPDK running on the embedded ARM-
> system,
> > it should be highlighted somewhere.
> 
> The host-shaper patch is running on ARM-system, I think in that patch I
> have some explanation in mlx5.rst.
> The LWM patch is common and should work on any Rx queue(right now mlx5
> doesn't support Hairpin Rx queue and shared Rx queue).
> On ARM-system, we can use it to monitor traffic from host(representor
> port) or from wire(physical port).
> LWM can also work on host-system if there is DPDK running, for example
> it can monitor traffic from Arm-system to host-system.

OK. Then I get it! I was reading the patch description wearing my host-system glasses, and thus got very confused. :-)

> 
> >
> > > >
> > > > Surely, the excess packets must be dropped somewhere, e.g. by the
> > > shaper?
> >
> > I guess the shaper doesn't have to drop any packets, but the host-
> system will
> > simply be unable to put more packets into the queue if it runs full.
> >
> 
> When LWM event happens, the host-shaper throttles traffic from host-
> system to Arm-system. Yes, the shaper doesn't drop pkts.
> Normally the shaper is small and if PMD thread on Arm keeps working, Rx
> queue is dropless.
> But if PMD thread doesn't receive fast enough, or even with a small
> shaper but host-system is sending some burst,  Rx queue may still drop
> on Arm.
> Anyway even sometimes drop still happens, the cooperation of host-
> shaper and LWM greatly reduce the Rx drop on Arm.

Thanks for elaborating. And yes, shapers are excellent for many scenarios.


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [RFC v2 3/7] ethdev: introduce Rx queue based limit watermark
  2022-05-25 13:58                     ` Thomas Monjalon
@ 2022-05-25 14:23                       ` Andrew Rybchenko
  0 siblings, 0 replies; 131+ messages in thread
From: Andrew Rybchenko @ 2022-05-25 14:23 UTC (permalink / raw)
  To: Thomas Monjalon, Stephen Hemminger, Spike Du
  Cc: dev, Matan Azrad, Slava Ovsiienko, Ori Kam, Raslan Darawsheh,
	ferruh.yigit

On 5/25/22 16:58, Thomas Monjalon wrote:
> 25/05/2022 14:59, Andrew Rybchenko:
>> On 5/24/22 11:18, Thomas Monjalon wrote:
>>> 24/05/2022 04:50, Spike Du:
>>>> From: Thomas Monjalon <thomas@monjalon.net>
>>>>> 23/05/2022 05:01, Spike Du:
>>>>>> From: Stephen Hemminger <stephen@networkplumber.org>
>>>>>>> Spike Du <spiked@nvidia.com> wrote:
>>>>>>>> --- a/lib/ethdev/rte_ethdev.h
>>>>>>>> +++ b/lib/ethdev/rte_ethdev.h
>>>>>>>> @@ -1249,7 +1249,16 @@ struct rte_eth_rxconf {
>>>>>>>>          */
>>>>>>>>         union rte_eth_rxseg *rx_seg;
>>>>>>>>
>>>>>>>> -     uint64_t reserved_64s[2]; /**< Reserved for future fields */
>>>>>>>> +     /**
>>>>>>>> +      * Per-queue Rx limit watermark defined as percentage of Rx queue
>>>>>>>> +      * size. If Rx queue receives traffic higher than this percentage,
>>>>>>>> +      * the event RTE_ETH_EVENT_RX_LWM is triggered.
>>>>>>>> +      */
>>>>>>>> +     uint8_t lwm;
>>>>>>>> +
>>>>>>>> +     uint8_t reserved_bits[3];
>>>>>>>> +     uint32_t reserved_32s;
>>>>>>>> +     uint64_t reserved_64s;
>>>>>>>
>>>>>>> Ok but, this is an ABI risk about this because reserved stuff was
>>>>>>> never required before.
>>>>>
>>>>> An ABI compatibility issue would be for an application compiled with an old
>>>>> DPDK, and loading a new DPDK at runtime.
>>>>> Let's think what would happen in such a case.
>>>>>
>>>>>>> Whenever is a reserved field is introduced the code (in this case
>>>>>>> rte_ethdev_configure).
>>>>>
>>>>> rte_eth_rx_queue_setup() is called with rx_conf->lwm not initialized.
>>>>> Then the library and drivers may interpret a wrong value.
>>>>>
>>>>>>> Best practice would have been to have the code require all reserved
>>>>>>> fields be
>>>>>>> 0 in earlier releases. In this case an application is like to define
>>>>>>> a watermark of zero; how will your code handle it.
>>>>>>
>>>>>> Having watermark of 0 is desired, which is the default. LWM of 0 means
>>>>>> the Rx Queue's watermark is not monitored, hence no LWM event is
>>>>> generated.
>>>>>
>>>>> The problem is to have a value not initialized.
>>>>> I think the best approach is to not expose the LWM value through this
>>>>> configuration structure.
>>>>> If the need is to get the current value, we should better add a field in the
>>>>> struct rte_eth_rxq_info.
>>>>
>>>> At least from all the dpdk app/example code, rxconf is initialized to 0 then setup
>>>> The Rx queue, if user follows these examples we should not have ABI issue.
>>>> Since many people are concerned about rxconf change, it's ok to remove the LWM
>>>> Field there.
>>>> Yes, I think we can add lwm into rte_eth_rxq_info. If we can set Rx queue's attribute,
>>>> We should have a way to get it.
>>>
>>> Unfortunately we cannot rely on examples for ABI compatibility.
>>> My suggestion of moving the field in rte_eth_rxq_info
>>> is not obvious because it could change the size of the struct.
>>> But thanks to __rte_cache_min_aligned, it is OK.
>>> Running pahole on this struct shows we have 50 bytes free:
>>>           /* size: 128, cachelines: 2, members: 6 */
>>>           /* padding: 50 */
>>>
>>> The other option would be to get the LWM value with a "get" function.
>>>
>>> What others prefer?
>>
>> If I'm not mistaken the changeset breaks ABI in any case since
>> it adds a new event and changes MAX.
> 
> I think we can consider it as not a breakage (a rule should be added).
> Last time we had to update this enum, this was the conclusion:
> from https://git.dpdk.org/dpdk/commit/?id=44bf3c796be3f
> "
> The new event type addition in the enum is flagged as an ABI breakage,
> so an ignore rule is added for these reasons:
> - It is not changing value of existing types (except MAX)
> - The new value is not used by existing API if the event is not
>    registered
> In general, it is safe adding new ethdev event types at the end of the
> enum, because of event callback registration mechanism.
> "

I see. Makes sense. Thanks for the information.

>> If so, I'd wait for the
>> next ABI breaking release and do not touch reserved fields.
> 
> In any case, rte_eth_rxconf is not a good fit
> because we have a separate function for configuration.

Yes, it is better to avoid two ways to configure the same
thing.

> It should be either in rte_eth_rxq_info or a specific "get" function.

I see no point to introduce specific get function for a single
value. I think that rte_eth_rxq_info is the right way to get
current value.

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v3 0/7] introduce per-queue limit watermark and host shaper
  2022-05-25 14:16                   ` Morten Brørup
@ 2022-05-25 14:30                     ` Andrew Rybchenko
  0 siblings, 0 replies; 131+ messages in thread
From: Andrew Rybchenko @ 2022-05-25 14:30 UTC (permalink / raw)
  To: Morten Brørup, Spike Du, NBU-Contact-Thomas Monjalon (EXTERNAL)
  Cc: Matan Azrad, Slava Ovsiienko, Ori Kam, dev, Raslan Darawsheh,
	stephen, ferruh.yigit, david.marchand

On 5/25/22 17:16, Morten Brørup wrote:
>> From: Spike Du [mailto:spiked@nvidia.com]
>> Sent: Wednesday, 25 May 2022 15.59
>>
>>> From: Morten Brørup <mb@smartsharesystems.com>
>>> Sent: Wednesday, May 25, 2022 9:40 PM
>>>
>>>> From: Spike Du [mailto:spiked@nvidia.com]
>>>> Sent: Wednesday, 25 May 2022 15.15
>>>>
>>>>> From: Morten Brørup <mb@smartsharesystems.com>
>>>>> Sent: Wednesday, May 25, 2022 3:00 AM
>>>>>
>>>>>> From: Thomas Monjalon [mailto:thomas@monjalon.net]
>>>>>> Sent: Tuesday, 24 May 2022 17.59
>>>>>>
>>>>>> +Cc people involved in previous versions
>>>>>>
>>>>>> 24/05/2022 17:20, Spike Du:
>>>>>>> LWM(limit watermark) is per RX queue attribute, when RX queue
>>>>>> fullness reach the LWM limit, HW sends an event to dpdk
>>>> application.
>>>>>>> Host shaper can configure shaper rate and lwm-triggered for a
>>>> host
>>>>>> port.
>>>>>
>>>>>
>>>>>>> The shaper limits the rate of traffic from host port to wire
>>>> port.
>>>>>
>>>>>  From host to wire? It is RX, so you must mean from wire to host.
>>>>
>>>> The host shaper is quite private to Nvidia's BlueField 2 NIC. The
>> NIC
>>>> is inserted In a server which we call it host-system, and the NIC
>> has
>>>> an embedded Arm-system Which does the forwarding.
>>>> The traffic flows from host-system to wire like this:
>>>> Host-system generates traffic, send it to Arm-system, Arm sends it
>> to
>>>> physical/wire port.
>>>> So the RX happens between host-system and Arm-system, and the
>> traffic
>>>> is host to wire.
>>>> The shaper also works in a special way: you configure it on
>>>> Arm-system, but it takes effect On host-sysmem's TX side.
>>>>
>>>>>
>>>>>>> If lwm-triggered is enabled, a 100Mbps shaper is enabled
>>>>>> automatically when one of the host port's Rx queues receives
>> LWM
>>>> event.
>>>>>>>
>>>>>>> These two features can combine to control traffic from host
>> port
>>>> to
>>>>>> wire port.
>>>>>
>>>>> Again, you mean from wire to host?
>>>>
>>>> Pls see above.
>>>>
>>>>>
>>>>>>> The work flow is configure LWM to RX queue and enable lwm-
>>>> triggered
>>>>>> flag in host shaper, after receiving LWM event, delay a while
>>>>>> until
>>>> RX
>>>>>> queue is empty , then disable the shaper. We recycle this work
>>>>>> flow
>>>> to
>>>>>> reduce RX queue drops.
>>>>>
>>>>> You delay while RX queue gets drained by some other threads, I
>>>> assume.
>>>>
>>>> The PMD thread drains the Rx queue, the PMD receiving  as normal,
>> as
>>>> the PMD Implementation uses rte interrupt thread to handle LWM
>> event.
>>>>
>>>
>>> Thank you for the explanation, Spike. It really clarifies a lot!
>>>
>>> If this patch is intended for DPDK running on the host-system, then
>> the LWM
>>> attribute is associated with a TX queue, not an RX queue. The packets
>> are
>>> egressing from the host-system, so TX from the host-system's
>> perspective.
>>>
>>> Otherwise, if this patch is for DPDK running on the embedded ARM-
>> system,
>>> it should be highlighted somewhere.
>>
>> The host-shaper patch is running on ARM-system, I think in that patch I
>> have some explanation in mlx5.rst.
>> The LWM patch is common and should work on any Rx queue(right now mlx5
>> doesn't support Hairpin Rx queue and shared Rx queue).
>> On ARM-system, we can use it to monitor traffic from host(representor
>> port) or from wire(physical port).
>> LWM can also work on host-system if there is DPDK running, for example
>> it can monitor traffic from Arm-system to host-system.
> 
> OK. Then I get it! I was reading the patch description wearing my host-system glasses, and thus got very confused. :-)

The description in cover letter is very misleading for me as
well. It is not a problem right now after long detailed
explanations. Hopefully there is no such problem in suggested
ethdev documentation. I'll reread it carefully before applying
when time comes.

> 
>>
>>>
>>>>>
>>>>> Surely, the excess packets must be dropped somewhere, e.g. by the
>>>> shaper?
>>>
>>> I guess the shaper doesn't have to drop any packets, but the host-
>> system will
>>> simply be unable to put more packets into the queue if it runs full.
>>>
>>
>> When LWM event happens, the host-shaper throttles traffic from host-
>> system to Arm-system. Yes, the shaper doesn't drop pkts.
>> Normally the shaper is small and if PMD thread on Arm keeps working, Rx
>> queue is dropless.
>> But if PMD thread doesn't receive fast enough, or even with a small
>> shaper but host-system is sending some burst,  Rx queue may still drop
>> on Arm.
>> Anyway even sometimes drop still happens, the cooperation of host-
>> shaper and LWM greatly reduce the Rx drop on Arm.
> 
> Thanks for elaborating. And yes, shapers are excellent for many scenarios.
> 


^ permalink raw reply	[flat|nested] 131+ messages in thread

* [PATCH v4 0/7] introduce per-queue fill threshold and host shaper
  2022-05-24 15:20       ` [PATCH v3 0/7] introduce per-queue limit watermark and host shaper Spike Du
                           ` (7 preceding siblings ...)
  2022-05-24 15:59         ` [PATCH v3 0/7] introduce per-queue limit watermark and host shaper Thomas Monjalon
@ 2022-06-03 12:48         ` Spike Du
  2022-06-03 12:48           ` [PATCH v4 1/7] net/mlx5: add LWM support for Rxq Spike Du
                             ` (7 more replies)
  8 siblings, 8 replies; 131+ messages in thread
From: Spike Du @ 2022-06-03 12:48 UTC (permalink / raw)
  To: matan, viacheslavo, orika, thomas
  Cc: andrew.rybchenko, stephen, mb, dev, rasland

Fill threshold is per RX queue attribute, when RX queue fullness reach the fill threshold limit, HW sends an event to application.
Host shaper can configure shaper rate and fill_thresh-triggered for a host port.
The shaper limits the rate of traffic from host port to embedded ARM rx port on Nvidia BlueField 2 NIC.
If fill_thresh-triggered is enabled, a 100Mbps shaper is enabled automatically when one of the host port's Rx queues receives fill threshold event.

These two features can combine to control traffic from host port to wire port for BlueField 2 NIC.
The traffic flows from host to embedded ARM, then to the physical port.
The work flow is on the ARM system, configure fill threshold to RX queue and enable fill_thresh-triggered flag in host shaper, after receiving fill threshold event, delay a while until RX queue is empty , then disable the shaper. We recycle this work flow to reduce RX queue drops on ARM system.

Add new libethdev API to set fill threshold, add rte event RTE_ETH_EVENT_RX_FILL_THRESH to handle fill threshold event. For host shaper, because it doesn't align to existing DPDK framework and is specific to Nvidia NIC, use PMD private API.

For integration with testpmd, put the private cmdline function and fill threshold event handler in mlx5 PMD directory by adding a new file mlx5_testpmd.c. Follow David Marchand's driver specific commands framework to add mlx5 specific commands.

Spike Du (7):
  net/mlx5: add LWM support for Rxq
  common/mlx5: share interrupt management
  ethdev: introduce Rx queue based fill threshold
  net/mlx5: add LWM event handling support
  net/mlx5: support Rx queue based fill threshold
  net/mlx5: add private API to config host port shaper
  app/testpmd: add Host Shaper command

 app/test-pmd/cmdline.c                       |  68 +++++++
 app/test-pmd/config.c                        |  21 ++
 app/test-pmd/testpmd.c                       |  24 +++
 app/test-pmd/testpmd.h                       |   2 +
 doc/guides/nics/mlx5.rst                     |  93 +++++++++
 doc/guides/rel_notes/release_22_07.rst       |   2 +
 drivers/common/mlx5/linux/meson.build        |  13 ++
 drivers/common/mlx5/linux/mlx5_common_os.c   | 131 ++++++++++++
 drivers/common/mlx5/linux/mlx5_common_os.h   |  11 +
 drivers/common/mlx5/mlx5_prm.h               |  26 +++
 drivers/common/mlx5/version.map              |   2 +
 drivers/common/mlx5/windows/mlx5_common_os.h |  24 +++
 drivers/net/mlx5/linux/mlx5_ethdev_os.c      |  71 -------
 drivers/net/mlx5/linux/mlx5_os.c             | 132 +++---------
 drivers/net/mlx5/linux/mlx5_socket.c         |  53 +----
 drivers/net/mlx5/meson.build                 |   4 +
 drivers/net/mlx5/mlx5.c                      |  68 +++++++
 drivers/net/mlx5/mlx5.h                      |  12 +-
 drivers/net/mlx5/mlx5_devx.c                 |  60 +++++-
 drivers/net/mlx5/mlx5_devx.h                 |   1 +
 drivers/net/mlx5/mlx5_rx.c                   | 292 +++++++++++++++++++++++++++
 drivers/net/mlx5/mlx5_rx.h                   |  13 ++
 drivers/net/mlx5/mlx5_testpmd.c              | 201 ++++++++++++++++++
 drivers/net/mlx5/mlx5_testpmd.h              |  26 +++
 drivers/net/mlx5/mlx5_txpp.c                 |  28 +--
 drivers/net/mlx5/rte_pmd_mlx5.h              |  30 +++
 drivers/net/mlx5/version.map                 |   2 +
 drivers/net/mlx5/windows/mlx5_ethdev_os.c    |  22 --
 drivers/vdpa/mlx5/mlx5_vdpa_virtq.c          |  48 +----
 lib/ethdev/ethdev_driver.h                   |  22 ++
 lib/ethdev/rte_ethdev.c                      |  52 +++++
 lib/ethdev/rte_ethdev.h                      |  72 +++++++
 lib/ethdev/version.map                       |   2 +
 33 files changed, 1320 insertions(+), 308 deletions(-)
 create mode 100644 drivers/net/mlx5/mlx5_testpmd.c
 create mode 100644 drivers/net/mlx5/mlx5_testpmd.h

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 131+ messages in thread

* [PATCH v4 1/7] net/mlx5: add LWM support for Rxq
  2022-06-03 12:48         ` [PATCH v4 0/7] introduce per-queue fill threshold " Spike Du
@ 2022-06-03 12:48           ` Spike Du
  2022-06-03 12:48           ` [PATCH v4 2/7] common/mlx5: share interrupt management Spike Du
                             ` (6 subsequent siblings)
  7 siblings, 0 replies; 131+ messages in thread
From: Spike Du @ 2022-06-03 12:48 UTC (permalink / raw)
  To: matan, viacheslavo, orika, thomas, Shahaf Shuler
  Cc: andrew.rybchenko, stephen, mb, dev, rasland

Add lwm(Limit WaterMark) field to Rxq object which indicates the percentage
of RX queue size used by HW to raise LWM event to the user.
Allow LWM setting in modify_rq command.
Allow the LWM configuration dynamically by adding RDY2RDY state change.

Signed-off-by: Spike Du <spiked@nvidia.com>
---
 drivers/net/mlx5/mlx5.h      |  1 +
 drivers/net/mlx5/mlx5_devx.c | 13 ++++++++++++-
 drivers/net/mlx5/mlx5_devx.h |  1 +
 drivers/net/mlx5/mlx5_rx.h   |  1 +
 4 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index ef755ee..305edff 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -1395,6 +1395,7 @@ enum mlx5_rxq_modify_type {
 	MLX5_RXQ_MOD_RST2RDY, /* modify state from reset to ready. */
 	MLX5_RXQ_MOD_RDY2ERR, /* modify state from ready to error. */
 	MLX5_RXQ_MOD_RDY2RST, /* modify state from ready to reset. */
+	MLX5_RXQ_MOD_RDY2RDY, /* modify state from ready to ready. */
 };
 
 enum mlx5_txq_modify_type {
diff --git a/drivers/net/mlx5/mlx5_devx.c b/drivers/net/mlx5/mlx5_devx.c
index 4b48f94..c918a50 100644
--- a/drivers/net/mlx5/mlx5_devx.c
+++ b/drivers/net/mlx5/mlx5_devx.c
@@ -62,7 +62,7 @@
  * @return
  *   0 on success, a negative errno value otherwise and rte_errno is set.
  */
-static int
+int
 mlx5_devx_modify_rq(struct mlx5_rxq_priv *rxq, uint8_t type)
 {
 	struct mlx5_devx_modify_rq_attr rq_attr;
@@ -76,6 +76,11 @@
 	case MLX5_RXQ_MOD_RST2RDY:
 		rq_attr.rq_state = MLX5_RQC_STATE_RST;
 		rq_attr.state = MLX5_RQC_STATE_RDY;
+		if (rxq->lwm) {
+			rq_attr.modify_bitmask |=
+				MLX5_MODIFY_RQ_IN_MODIFY_BITMASK_WQ_LWM;
+			rq_attr.lwm = rxq->lwm;
+		}
 		break;
 	case MLX5_RXQ_MOD_RDY2ERR:
 		rq_attr.rq_state = MLX5_RQC_STATE_RDY;
@@ -85,6 +90,12 @@
 		rq_attr.rq_state = MLX5_RQC_STATE_RDY;
 		rq_attr.state = MLX5_RQC_STATE_RST;
 		break;
+	case MLX5_RXQ_MOD_RDY2RDY:
+		rq_attr.rq_state = MLX5_RQC_STATE_RDY;
+		rq_attr.state = MLX5_RQC_STATE_RDY;
+		rq_attr.modify_bitmask |= MLX5_MODIFY_RQ_IN_MODIFY_BITMASK_WQ_LWM;
+		rq_attr.lwm = rxq->lwm;
+		break;
 	default:
 		break;
 	}
diff --git a/drivers/net/mlx5/mlx5_devx.h b/drivers/net/mlx5/mlx5_devx.h
index a95207a..ebd1da4 100644
--- a/drivers/net/mlx5/mlx5_devx.h
+++ b/drivers/net/mlx5/mlx5_devx.h
@@ -11,6 +11,7 @@
 int mlx5_txq_devx_modify(struct mlx5_txq_obj *obj,
 			 enum mlx5_txq_modify_type type, uint8_t dev_port);
 void mlx5_txq_devx_obj_release(struct mlx5_txq_obj *txq_obj);
+int mlx5_devx_modify_rq(struct mlx5_rxq_priv *rxq, uint8_t type);
 
 extern struct mlx5_obj_ops devx_obj_ops;
 
diff --git a/drivers/net/mlx5/mlx5_rx.h b/drivers/net/mlx5/mlx5_rx.h
index e715ed6..25a5f2c 100644
--- a/drivers/net/mlx5/mlx5_rx.h
+++ b/drivers/net/mlx5/mlx5_rx.h
@@ -175,6 +175,7 @@ struct mlx5_rxq_priv {
 	struct mlx5_devx_rq devx_rq;
 	struct rte_eth_hairpin_conf hairpin_conf; /* Hairpin configuration. */
 	uint32_t hairpin_status; /* Hairpin binding status. */
+	uint32_t lwm:16;
 };
 
 /* External RX queue descriptor. */
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 131+ messages in thread

* [PATCH v4 2/7] common/mlx5: share interrupt management
  2022-06-03 12:48         ` [PATCH v4 0/7] introduce per-queue fill threshold " Spike Du
  2022-06-03 12:48           ` [PATCH v4 1/7] net/mlx5: add LWM support for Rxq Spike Du
@ 2022-06-03 12:48           ` Spike Du
  2022-06-03 14:30             ` Ray Kinsella
  2022-06-03 12:48           ` [PATCH v4 3/7] ethdev: introduce Rx queue based fill threshold Spike Du
                             ` (5 subsequent siblings)
  7 siblings, 1 reply; 131+ messages in thread
From: Spike Du @ 2022-06-03 12:48 UTC (permalink / raw)
  To: matan, viacheslavo, orika, thomas, Shahaf Shuler, Ray Kinsella,
	Neil Horman
  Cc: andrew.rybchenko, stephen, mb, dev, rasland

There are many duplicate code of creating and initializing rte_intr_handle.
Add a new mlx5_os API to do this, replace all PMD related code with this
API.

Signed-off-by: Spike Du <spiked@nvidia.com>
---
 drivers/common/mlx5/linux/mlx5_common_os.c   | 131 ++++++++++++++++++++++++++
 drivers/common/mlx5/linux/mlx5_common_os.h   |  11 +++
 drivers/common/mlx5/version.map              |   2 +
 drivers/common/mlx5/windows/mlx5_common_os.h |  24 +++++
 drivers/net/mlx5/linux/mlx5_ethdev_os.c      |  71 --------------
 drivers/net/mlx5/linux/mlx5_os.c             | 132 ++++++---------------------
 drivers/net/mlx5/linux/mlx5_socket.c         |  53 ++---------
 drivers/net/mlx5/mlx5.h                      |   2 -
 drivers/net/mlx5/mlx5_txpp.c                 |  28 ++----
 drivers/net/mlx5/windows/mlx5_ethdev_os.c    |  22 -----
 drivers/vdpa/mlx5/mlx5_vdpa_virtq.c          |  48 ++--------
 11 files changed, 217 insertions(+), 307 deletions(-)

diff --git a/drivers/common/mlx5/linux/mlx5_common_os.c b/drivers/common/mlx5/linux/mlx5_common_os.c
index d40cfd5..f10a981 100644
--- a/drivers/common/mlx5/linux/mlx5_common_os.c
+++ b/drivers/common/mlx5/linux/mlx5_common_os.c
@@ -11,6 +11,7 @@
 #endif
 #include <dirent.h>
 #include <net/if.h>
+#include <fcntl.h>
 
 #include <rte_errno.h>
 #include <rte_string_fns.h>
@@ -964,3 +965,133 @@
 		claim_zero(mlx5_glue->dereg_mr(pmd_mr->obj));
 	memset(pmd_mr, 0, sizeof(*pmd_mr));
 }
+
+/**
+ * Rte_intr_handle create and init helper.
+ *
+ * @param[in] mode
+ *   interrupt instance can be shared between primary and secondary
+ *   processes or not.
+ * @param[in] set_fd_nonblock
+ *   Whether to set fd to O_NONBLOCK.
+ * @param[in] fd
+ *   Fd to set in created intr_handle.
+ * @param[in] cb
+ *   Callback to register for intr_handle.
+ * @param[in] cb_arg
+ *   Callback argument for cb.
+ *
+ * @return
+ *  - Interrupt handle on success.
+ *  - NULL on failure, with rte_errno set.
+ */
+struct rte_intr_handle *
+mlx5_os_interrupt_handler_create(int mode, bool set_fd_nonblock, int fd,
+				 rte_intr_callback_fn cb, void *cb_arg)
+{
+	struct rte_intr_handle *tmp_intr_handle;
+	int ret, flags;
+
+	tmp_intr_handle = rte_intr_instance_alloc(mode);
+	if (!tmp_intr_handle) {
+		rte_errno = ENOMEM;
+		goto err;
+	}
+	if (set_fd_nonblock) {
+		flags = fcntl(fd, F_GETFL);
+		ret = fcntl(fd, F_SETFL, flags | O_NONBLOCK);
+		if (ret) {
+			rte_errno = errno;
+			goto err;
+		}
+	}
+	ret = rte_intr_fd_set(tmp_intr_handle, fd);
+	if (ret)
+		goto err;
+	ret = rte_intr_type_set(tmp_intr_handle, RTE_INTR_HANDLE_EXT);
+	if (ret)
+		goto err;
+	ret = rte_intr_callback_register(tmp_intr_handle, cb, cb_arg);
+	if (ret) {
+		rte_errno = -ret;
+		goto err;
+	}
+	return tmp_intr_handle;
+err:
+	if (tmp_intr_handle)
+		rte_intr_instance_free(tmp_intr_handle);
+	return NULL;
+}
+
+/* Safe unregistration for interrupt callback. */
+static void
+mlx5_intr_callback_unregister(const struct rte_intr_handle *handle,
+			      rte_intr_callback_fn cb_fn, void *cb_arg)
+{
+	uint64_t twait = 0;
+	uint64_t start = 0;
+
+	do {
+		int ret;
+
+		ret = rte_intr_callback_unregister(handle, cb_fn, cb_arg);
+		if (ret >= 0)
+			return;
+		if (ret != -EAGAIN) {
+			DRV_LOG(INFO, "failed to unregister interrupt"
+				      " handler (error: %d)", ret);
+			MLX5_ASSERT(false);
+			return;
+		}
+		if (twait) {
+			struct timespec onems;
+
+			/* Wait one millisecond and try again. */
+			onems.tv_sec = 0;
+			onems.tv_nsec = NS_PER_S / MS_PER_S;
+			nanosleep(&onems, 0);
+			/* Check whether one second elapsed. */
+			if ((rte_get_timer_cycles() - start) <= twait)
+				continue;
+		} else {
+			/*
+			 * We get the amount of timer ticks for one second.
+			 * If this amount elapsed it means we spent one
+			 * second in waiting. This branch is executed once
+			 * on first iteration.
+			 */
+			twait = rte_get_timer_hz();
+			MLX5_ASSERT(twait);
+		}
+		/*
+		 * Timeout elapsed, show message (once a second) and retry.
+		 * We have no other acceptable option here, if we ignore
+		 * the unregistering return code the handler will not
+		 * be unregistered, fd will be closed and we may get the
+		 * crush. Hanging and messaging in the loop seems not to be
+		 * the worst choice.
+		 */
+		DRV_LOG(INFO, "Retrying to unregister interrupt handler");
+		start = rte_get_timer_cycles();
+	} while (true);
+}
+
+/**
+ * Rte_intr_handle destroy helper.
+ *
+ * @param[in] intr_handle
+ *   Rte_intr_handle to destroy.
+ * @param[in] cb
+ *   Callback which is registered to intr_handle.
+ * @param[in] cb_arg
+ *   Callback argument for cb.
+ *
+ */
+void
+mlx5_os_interrupt_handler_destroy(struct rte_intr_handle *intr_handle,
+				  rte_intr_callback_fn cb, void *cb_arg)
+{
+	if (rte_intr_fd_get(intr_handle) >= 0)
+		mlx5_intr_callback_unregister(intr_handle, cb, cb_arg);
+	rte_intr_instance_free(intr_handle);
+}
diff --git a/drivers/common/mlx5/linux/mlx5_common_os.h b/drivers/common/mlx5/linux/mlx5_common_os.h
index 27f1192..479bb3c 100644
--- a/drivers/common/mlx5/linux/mlx5_common_os.h
+++ b/drivers/common/mlx5/linux/mlx5_common_os.h
@@ -15,6 +15,7 @@
 #include <rte_log.h>
 #include <rte_kvargs.h>
 #include <rte_devargs.h>
+#include <rte_interrupts.h>
 
 #include "mlx5_autoconf.h"
 #include "mlx5_glue.h"
@@ -299,4 +300,14 @@
 int
 mlx5_get_device_guid(const struct rte_pci_addr *dev, uint8_t *guid, size_t len);
 
+__rte_internal
+struct rte_intr_handle *
+mlx5_os_interrupt_handler_create(int mode, bool set_fd_nonblock, int fd,
+				 rte_intr_callback_fn cb, void *cb_arg);
+
+__rte_internal
+void
+mlx5_os_interrupt_handler_destroy(struct rte_intr_handle *intr_handle,
+				  rte_intr_callback_fn cb, void *cb_arg);
+
 #endif /* RTE_PMD_MLX5_COMMON_OS_H_ */
diff --git a/drivers/common/mlx5/version.map b/drivers/common/mlx5/version.map
index a23a30a..413dec1 100644
--- a/drivers/common/mlx5/version.map
+++ b/drivers/common/mlx5/version.map
@@ -153,5 +153,7 @@ INTERNAL {
 	mlx5_mr_mempool2mr_bh;
 	mlx5_mr_mempool_populate_cache;
 
+	mlx5_os_interrupt_handler_create; # WINDOWS_NO_EXPORT
+	mlx5_os_interrupt_handler_destroy; # WINDOWS_NO_EXPORT
 	local: *;
 };
diff --git a/drivers/common/mlx5/windows/mlx5_common_os.h b/drivers/common/mlx5/windows/mlx5_common_os.h
index ee7973f..e9e9108 100644
--- a/drivers/common/mlx5/windows/mlx5_common_os.h
+++ b/drivers/common/mlx5/windows/mlx5_common_os.h
@@ -9,6 +9,7 @@
 #include <sys/types.h>
 
 #include <rte_errno.h>
+#include <rte_interrupts.h>
 
 #include "mlx5_autoconf.h"
 #include "mlx5_glue.h"
@@ -253,4 +254,27 @@
 __rte_internal
 int mlx5_os_umem_dereg(void *pumem);
 
+static inline struct rte_intr_handle *
+mlx5_os_interrupt_handler_create(int mode, bool set_fd_nonblock, int fd,
+				 rte_intr_callback_fn cb, void *cb_arg)
+{
+	(void)mode;
+	(void)set_fd_nonblock;
+	(void)fd;
+	(void)cb;
+	(void)cb_arg;
+	rte_errno = ENOTSUP;
+	return NULL;
+}
+
+static inline void
+mlx5_os_interrupt_handler_destroy(struct rte_intr_handle *intr_handle,
+				  rte_intr_callback_fn cb, void *cb_arg)
+{
+	(void)intr_handle;
+	(void)cb;
+	(void)cb_arg;
+}
+
+
 #endif /* RTE_PMD_MLX5_COMMON_OS_H_ */
diff --git a/drivers/net/mlx5/linux/mlx5_ethdev_os.c b/drivers/net/mlx5/linux/mlx5_ethdev_os.c
index 8fe73f1..a276b2b 100644
--- a/drivers/net/mlx5/linux/mlx5_ethdev_os.c
+++ b/drivers/net/mlx5/linux/mlx5_ethdev_os.c
@@ -881,77 +881,6 @@ struct ethtool_link_settings {
 	}
 }
 
-/*
- * Unregister callback handler safely. The handler may be active
- * while we are trying to unregister it, in this case code -EAGAIN
- * is returned by rte_intr_callback_unregister(). This routine checks
- * the return code and tries to unregister handler again.
- *
- * @param handle
- *   interrupt handle
- * @param cb_fn
- *   pointer to callback routine
- * @cb_arg
- *   opaque callback parameter
- */
-void
-mlx5_intr_callback_unregister(const struct rte_intr_handle *handle,
-			      rte_intr_callback_fn cb_fn, void *cb_arg)
-{
-	/*
-	 * Try to reduce timeout management overhead by not calling
-	 * the timer related routines on the first iteration. If the
-	 * unregistering succeeds on first call there will be no
-	 * timer calls at all.
-	 */
-	uint64_t twait = 0;
-	uint64_t start = 0;
-
-	do {
-		int ret;
-
-		ret = rte_intr_callback_unregister(handle, cb_fn, cb_arg);
-		if (ret >= 0)
-			return;
-		if (ret != -EAGAIN) {
-			DRV_LOG(INFO, "failed to unregister interrupt"
-				      " handler (error: %d)", ret);
-			MLX5_ASSERT(false);
-			return;
-		}
-		if (twait) {
-			struct timespec onems;
-
-			/* Wait one millisecond and try again. */
-			onems.tv_sec = 0;
-			onems.tv_nsec = NS_PER_S / MS_PER_S;
-			nanosleep(&onems, 0);
-			/* Check whether one second elapsed. */
-			if ((rte_get_timer_cycles() - start) <= twait)
-				continue;
-		} else {
-			/*
-			 * We get the amount of timer ticks for one second.
-			 * If this amount elapsed it means we spent one
-			 * second in waiting. This branch is executed once
-			 * on first iteration.
-			 */
-			twait = rte_get_timer_hz();
-			MLX5_ASSERT(twait);
-		}
-		/*
-		 * Timeout elapsed, show message (once a second) and retry.
-		 * We have no other acceptable option here, if we ignore
-		 * the unregistering return code the handler will not
-		 * be unregistered, fd will be closed and we may get the
-		 * crush. Hanging and messaging in the loop seems not to be
-		 * the worst choice.
-		 */
-		DRV_LOG(INFO, "Retrying to unregister interrupt handler");
-		start = rte_get_timer_cycles();
-	} while (true);
-}
-
 /**
  * Handle DEVX interrupts from the NIC.
  * This function is probably called from the DPDK host thread.
diff --git a/drivers/net/mlx5/linux/mlx5_os.c b/drivers/net/mlx5/linux/mlx5_os.c
index a821153..0741028 100644
--- a/drivers/net/mlx5/linux/mlx5_os.c
+++ b/drivers/net/mlx5/linux/mlx5_os.c
@@ -2494,40 +2494,6 @@
 	mlx5_pmd_socket_uninit();
 }
 
-static int
-mlx5_os_dev_shared_handler_install_lsc(struct mlx5_dev_ctx_shared *sh)
-{
-	int nlsk_fd, flags, ret;
-
-	nlsk_fd = mlx5_nl_init(NETLINK_ROUTE, RTMGRP_LINK);
-	if (nlsk_fd < 0) {
-		DRV_LOG(ERR, "Failed to create a socket for Netlink events: %s",
-			rte_strerror(rte_errno));
-		return -1;
-	}
-	flags = fcntl(nlsk_fd, F_GETFL);
-	ret = fcntl(nlsk_fd, F_SETFL, flags | O_NONBLOCK);
-	if (ret != 0) {
-		DRV_LOG(ERR, "Failed to make Netlink event socket non-blocking: %s",
-			strerror(errno));
-		rte_errno = errno;
-		goto error;
-	}
-	rte_intr_type_set(sh->intr_handle_nl, RTE_INTR_HANDLE_EXT);
-	rte_intr_fd_set(sh->intr_handle_nl, nlsk_fd);
-	if (rte_intr_callback_register(sh->intr_handle_nl,
-				       mlx5_dev_interrupt_handler_nl,
-				       sh) != 0) {
-		DRV_LOG(ERR, "Failed to register Netlink events interrupt");
-		rte_intr_fd_set(sh->intr_handle_nl, -1);
-		goto error;
-	}
-	return 0;
-error:
-	close(nlsk_fd);
-	return -1;
-}
-
 /**
  * Install shared asynchronous device events handler.
  * This function is implemented to support event sharing
@@ -2539,76 +2505,47 @@
 void
 mlx5_os_dev_shared_handler_install(struct mlx5_dev_ctx_shared *sh)
 {
-	int ret;
-	int flags;
 	struct ibv_context *ctx = sh->cdev->ctx;
+	int nlsk_fd;
 
-	sh->intr_handle = rte_intr_instance_alloc(RTE_INTR_INSTANCE_F_SHARED);
-	if (sh->intr_handle == NULL) {
-		DRV_LOG(ERR, "Fail to allocate intr_handle");
-		rte_errno = ENOMEM;
+	sh->intr_handle = mlx5_os_interrupt_handler_create
+		(RTE_INTR_INSTANCE_F_SHARED, true,
+		 ctx->async_fd, mlx5_dev_interrupt_handler, sh);
+	if (!sh->intr_handle) {
+		DRV_LOG(ERR, "Failed to allocate intr_handle.");
 		return;
 	}
-	rte_intr_fd_set(sh->intr_handle, -1);
-
-	flags = fcntl(ctx->async_fd, F_GETFL);
-	ret = fcntl(ctx->async_fd, F_SETFL, flags | O_NONBLOCK);
-	if (ret) {
-		DRV_LOG(INFO, "failed to change file descriptor async event"
-			" queue");
-	} else {
-		rte_intr_fd_set(sh->intr_handle, ctx->async_fd);
-		rte_intr_type_set(sh->intr_handle, RTE_INTR_HANDLE_EXT);
-		if (rte_intr_callback_register(sh->intr_handle,
-					mlx5_dev_interrupt_handler, sh)) {
-			DRV_LOG(INFO, "Fail to install the shared interrupt.");
-			rte_intr_fd_set(sh->intr_handle, -1);
-		}
+	nlsk_fd = mlx5_nl_init(NETLINK_ROUTE, RTMGRP_LINK);
+	if (nlsk_fd < 0) {
+		DRV_LOG(ERR, "Failed to create a socket for Netlink events: %s",
+			rte_strerror(rte_errno));
+		return;
 	}
-	sh->intr_handle_nl = rte_intr_instance_alloc
-						(RTE_INTR_INSTANCE_F_SHARED);
+	sh->intr_handle_nl = mlx5_os_interrupt_handler_create
+		(RTE_INTR_INSTANCE_F_SHARED, true,
+		 nlsk_fd, mlx5_dev_interrupt_handler_nl, sh);
 	if (sh->intr_handle_nl == NULL) {
 		DRV_LOG(ERR, "Fail to allocate intr_handle");
-		rte_errno = ENOMEM;
 		return;
 	}
-	rte_intr_fd_set(sh->intr_handle_nl, -1);
-	if (mlx5_os_dev_shared_handler_install_lsc(sh) < 0) {
-		DRV_LOG(INFO, "Fail to install the shared Netlink event handler.");
-		rte_intr_fd_set(sh->intr_handle_nl, -1);
-	}
 	if (sh->cdev->config.devx) {
 #ifdef HAVE_IBV_DEVX_ASYNC
-		sh->intr_handle_devx =
-			rte_intr_instance_alloc(RTE_INTR_INSTANCE_F_SHARED);
-		if (!sh->intr_handle_devx) {
-			DRV_LOG(ERR, "Fail to allocate intr_handle");
-			rte_errno = ENOMEM;
-			return;
-		}
-		rte_intr_fd_set(sh->intr_handle_devx, -1);
+		struct mlx5dv_devx_cmd_comp *devx_comp;
+
 		sh->devx_comp = (void *)mlx5_glue->devx_create_cmd_comp(ctx);
-		struct mlx5dv_devx_cmd_comp *devx_comp = sh->devx_comp;
+		devx_comp = sh->devx_comp;
 		if (!devx_comp) {
 			DRV_LOG(INFO, "failed to allocate devx_comp.");
 			return;
 		}
-		flags = fcntl(devx_comp->fd, F_GETFL);
-		ret = fcntl(devx_comp->fd, F_SETFL, flags | O_NONBLOCK);
-		if (ret) {
-			DRV_LOG(INFO, "failed to change file descriptor"
-				" devx comp");
+		sh->intr_handle_devx = mlx5_os_interrupt_handler_create
+			(RTE_INTR_INSTANCE_F_SHARED, true,
+			 devx_comp->fd,
+			 mlx5_dev_interrupt_handler_devx, sh);
+		if (!sh->intr_handle_devx) {
+			DRV_LOG(ERR, "Failed to allocate intr_handle.");
 			return;
 		}
-		rte_intr_fd_set(sh->intr_handle_devx, devx_comp->fd);
-		rte_intr_type_set(sh->intr_handle_devx,
-					 RTE_INTR_HANDLE_EXT);
-		if (rte_intr_callback_register(sh->intr_handle_devx,
-					mlx5_dev_interrupt_handler_devx, sh)) {
-			DRV_LOG(INFO, "Fail to install the devx shared"
-				" interrupt.");
-			rte_intr_fd_set(sh->intr_handle_devx, -1);
-		}
 #endif /* HAVE_IBV_DEVX_ASYNC */
 	}
 }
@@ -2624,24 +2561,13 @@
 void
 mlx5_os_dev_shared_handler_uninstall(struct mlx5_dev_ctx_shared *sh)
 {
-	int nlsk_fd;
-
-	if (rte_intr_fd_get(sh->intr_handle) >= 0)
-		mlx5_intr_callback_unregister(sh->intr_handle,
-					      mlx5_dev_interrupt_handler, sh);
-	rte_intr_instance_free(sh->intr_handle);
-	nlsk_fd = rte_intr_fd_get(sh->intr_handle_nl);
-	if (nlsk_fd >= 0) {
-		mlx5_intr_callback_unregister
-			(sh->intr_handle_nl, mlx5_dev_interrupt_handler_nl, sh);
-		close(nlsk_fd);
-	}
-	rte_intr_instance_free(sh->intr_handle_nl);
+	mlx5_os_interrupt_handler_destroy(sh->intr_handle,
+					  mlx5_dev_interrupt_handler, sh);
+	mlx5_os_interrupt_handler_destroy(sh->intr_handle_nl,
+					  mlx5_dev_interrupt_handler_nl, sh);
 #ifdef HAVE_IBV_DEVX_ASYNC
-	if (rte_intr_fd_get(sh->intr_handle_devx) >= 0)
-		rte_intr_callback_unregister(sh->intr_handle_devx,
-				  mlx5_dev_interrupt_handler_devx, sh);
-	rte_intr_instance_free(sh->intr_handle_devx);
+	mlx5_os_interrupt_handler_destroy(sh->intr_handle_devx,
+					  mlx5_dev_interrupt_handler_devx, sh);
 	if (sh->devx_comp)
 		mlx5_glue->devx_destroy_cmd_comp(sh->devx_comp);
 #endif
diff --git a/drivers/net/mlx5/linux/mlx5_socket.c b/drivers/net/mlx5/linux/mlx5_socket.c
index 4882e5f..0e01aff 100644
--- a/drivers/net/mlx5/linux/mlx5_socket.c
+++ b/drivers/net/mlx5/linux/mlx5_socket.c
@@ -134,51 +134,6 @@
 }
 
 /**
- * Install interrupt handler.
- *
- * @param dev
- *   Pointer to Ethernet device.
- * @return
- *   0 on success, a negative errno value otherwise.
- */
-static int
-mlx5_pmd_interrupt_handler_install(void)
-{
-	MLX5_ASSERT(server_socket != -1);
-
-	server_intr_handle =
-		rte_intr_instance_alloc(RTE_INTR_INSTANCE_F_PRIVATE);
-	if (server_intr_handle == NULL) {
-		DRV_LOG(ERR, "Fail to allocate intr_handle");
-		return -ENOMEM;
-	}
-	if (rte_intr_fd_set(server_intr_handle, server_socket))
-		return -rte_errno;
-
-	if (rte_intr_type_set(server_intr_handle, RTE_INTR_HANDLE_EXT))
-		return -rte_errno;
-
-	return rte_intr_callback_register(server_intr_handle,
-					  mlx5_pmd_socket_handle, NULL);
-}
-
-/**
- * Uninstall interrupt handler.
- */
-static void
-mlx5_pmd_interrupt_handler_uninstall(void)
-{
-	if (server_socket != -1) {
-		mlx5_intr_callback_unregister(server_intr_handle,
-					      mlx5_pmd_socket_handle,
-					      NULL);
-	}
-	rte_intr_fd_set(server_intr_handle, 0);
-	rte_intr_type_set(server_intr_handle, RTE_INTR_HANDLE_UNKNOWN);
-	rte_intr_instance_free(server_intr_handle);
-}
-
-/**
  * Initialise the socket to communicate with external tools.
  *
  * @return
@@ -224,7 +179,10 @@
 			strerror(errno));
 		goto remove;
 	}
-	if (mlx5_pmd_interrupt_handler_install()) {
+	server_intr_handle = mlx5_os_interrupt_handler_create
+		(RTE_INTR_INSTANCE_F_PRIVATE, false,
+		 server_socket, mlx5_pmd_socket_handle, NULL);
+	if (server_intr_handle == NULL) {
 		DRV_LOG(WARNING, "cannot register interrupt handler for mlx5 socket: %s",
 			strerror(errno));
 		goto remove;
@@ -248,7 +206,8 @@
 {
 	if (server_socket == -1)
 		return;
-	mlx5_pmd_interrupt_handler_uninstall();
+	mlx5_os_interrupt_handler_destroy(server_intr_handle,
+					  mlx5_pmd_socket_handle, NULL);
 	claim_zero(close(server_socket));
 	server_socket = -1;
 	MKSTR(path, MLX5_SOCKET_PATH, getpid());
diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index 305edff..7ebb2cc 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -1682,8 +1682,6 @@ int mlx5_sysfs_switch_info(unsigned int ifindex,
 			   struct mlx5_switch_info *info);
 void mlx5_translate_port_name(const char *port_name_in,
 			      struct mlx5_switch_info *port_info_out);
-void mlx5_intr_callback_unregister(const struct rte_intr_handle *handle,
-				   rte_intr_callback_fn cb_fn, void *cb_arg);
 int mlx5_sysfs_bond_info(unsigned int pf_ifindex, unsigned int *ifindex,
 			 char *ifname);
 int mlx5_get_module_info(struct rte_eth_dev *dev,
diff --git a/drivers/net/mlx5/mlx5_txpp.c b/drivers/net/mlx5/mlx5_txpp.c
index fe74317..f853a67 100644
--- a/drivers/net/mlx5/mlx5_txpp.c
+++ b/drivers/net/mlx5/mlx5_txpp.c
@@ -741,11 +741,8 @@
 static void
 mlx5_txpp_stop_service(struct mlx5_dev_ctx_shared *sh)
 {
-	if (!rte_intr_fd_get(sh->txpp.intr_handle))
-		return;
-	mlx5_intr_callback_unregister(sh->txpp.intr_handle,
-				      mlx5_txpp_interrupt_handler, sh);
-	rte_intr_instance_free(sh->txpp.intr_handle);
+	mlx5_os_interrupt_handler_destroy(sh->txpp.intr_handle,
+					  mlx5_txpp_interrupt_handler, sh);
 }
 
 /* Attach interrupt handler and fires first request to Rearm Queue. */
@@ -769,23 +766,12 @@
 		rte_errno = errno;
 		return -rte_errno;
 	}
-	sh->txpp.intr_handle =
-			rte_intr_instance_alloc(RTE_INTR_INSTANCE_F_SHARED);
-	if (sh->txpp.intr_handle == NULL) {
-		DRV_LOG(ERR, "Fail to allocate intr_handle");
-		return -ENOMEM;
-	}
 	fd = mlx5_os_get_devx_channel_fd(sh->txpp.echan);
-	if (rte_intr_fd_set(sh->txpp.intr_handle, fd))
-		return -rte_errno;
-
-	if (rte_intr_type_set(sh->txpp.intr_handle, RTE_INTR_HANDLE_EXT))
-		return -rte_errno;
-
-	if (rte_intr_callback_register(sh->txpp.intr_handle,
-				       mlx5_txpp_interrupt_handler, sh)) {
-		rte_intr_fd_set(sh->txpp.intr_handle, 0);
-		DRV_LOG(ERR, "Failed to register CQE interrupt %d.", rte_errno);
+	sh->txpp.intr_handle = mlx5_os_interrupt_handler_create
+		(RTE_INTR_INSTANCE_F_SHARED, false,
+		 fd, mlx5_txpp_interrupt_handler, sh);
+	if (!sh->txpp.intr_handle) {
+		DRV_LOG(ERR, "Fail to allocate intr_handle");
 		return -rte_errno;
 	}
 	/* Subscribe CQ event to the event channel controlled by the driver. */
diff --git a/drivers/net/mlx5/windows/mlx5_ethdev_os.c b/drivers/net/mlx5/windows/mlx5_ethdev_os.c
index f975265..88d8213 100644
--- a/drivers/net/mlx5/windows/mlx5_ethdev_os.c
+++ b/drivers/net/mlx5/windows/mlx5_ethdev_os.c
@@ -140,28 +140,6 @@
 	return 0;
 }
 
-/*
- * Unregister callback handler safely. The handler may be active
- * while we are trying to unregister it, in this case code -EAGAIN
- * is returned by rte_intr_callback_unregister(). This routine checks
- * the return code and tries to unregister handler again.
- *
- * @param handle
- *   interrupt handle
- * @param cb_fn
- *   pointer to callback routine
- * @cb_arg
- *   opaque callback parameter
- */
-void
-mlx5_intr_callback_unregister(const struct rte_intr_handle *handle,
-			      rte_intr_callback_fn cb_fn, void *cb_arg)
-{
-	RTE_SET_USED(handle);
-	RTE_SET_USED(cb_fn);
-	RTE_SET_USED(cb_arg);
-}
-
 /**
  * DPDK callback to get flow control status.
  *
diff --git a/drivers/vdpa/mlx5/mlx5_vdpa_virtq.c b/drivers/vdpa/mlx5/mlx5_vdpa_virtq.c
index e025be4..fd447cc 100644
--- a/drivers/vdpa/mlx5/mlx5_vdpa_virtq.c
+++ b/drivers/vdpa/mlx5/mlx5_vdpa_virtq.c
@@ -93,22 +93,10 @@
 static int
 mlx5_vdpa_virtq_unset(struct mlx5_vdpa_virtq *virtq)
 {
-	int ret = -EAGAIN;
-
-	if (rte_intr_fd_get(virtq->intr_handle) >= 0) {
-		while (ret == -EAGAIN) {
-			ret = rte_intr_callback_unregister(virtq->intr_handle,
-					mlx5_vdpa_virtq_kick_handler, virtq);
-			if (ret == -EAGAIN) {
-				DRV_LOG(DEBUG, "Try again to unregister fd %d of virtq %hu interrupt",
-					rte_intr_fd_get(virtq->intr_handle),
-					virtq->index);
-				usleep(MLX5_VDPA_INTR_RETRIES_USEC);
-			}
-		}
-		rte_intr_fd_set(virtq->intr_handle, -1);
-	}
-	rte_intr_instance_free(virtq->intr_handle);
+	int ret;
+
+	mlx5_os_interrupt_handler_destroy(virtq->intr_handle,
+					  mlx5_vdpa_virtq_kick_handler, virtq);
 	if (virtq->virtq) {
 		ret = mlx5_vdpa_virtq_stop(virtq->priv, virtq->index);
 		if (ret)
@@ -365,35 +353,13 @@
 	virtq->priv = priv;
 	rte_write32(virtq->index, priv->virtq_db_addr);
 	/* Setup doorbell mapping. */
-	virtq->intr_handle =
-		rte_intr_instance_alloc(RTE_INTR_INSTANCE_F_SHARED);
+	virtq->intr_handle = mlx5_os_interrupt_handler_create(
+				  RTE_INTR_INSTANCE_F_SHARED, false,
+				  vq.kickfd, mlx5_vdpa_virtq_kick_handler, virtq);
 	if (virtq->intr_handle == NULL) {
 		DRV_LOG(ERR, "Fail to allocate intr_handle");
 		goto error;
 	}
-
-	if (rte_intr_fd_set(virtq->intr_handle, vq.kickfd))
-		goto error;
-
-	if (rte_intr_fd_get(virtq->intr_handle) == -1) {
-		DRV_LOG(WARNING, "Virtq %d kickfd is invalid.", index);
-	} else {
-		if (rte_intr_type_set(virtq->intr_handle, RTE_INTR_HANDLE_EXT))
-			goto error;
-
-		if (rte_intr_callback_register(virtq->intr_handle,
-					       mlx5_vdpa_virtq_kick_handler,
-					       virtq)) {
-			rte_intr_fd_set(virtq->intr_handle, -1);
-			DRV_LOG(ERR, "Failed to register virtq %d interrupt.",
-				index);
-			goto error;
-		} else {
-			DRV_LOG(DEBUG, "Register fd %d interrupt for virtq %d.",
-				rte_intr_fd_get(virtq->intr_handle),
-				index);
-		}
-	}
 	/* Subscribe virtq error event. */
 	virtq->version++;
 	cookie = ((uint64_t)virtq->version << 32) + index;
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 131+ messages in thread

* [PATCH v4 3/7] ethdev: introduce Rx queue based fill threshold
  2022-06-03 12:48         ` [PATCH v4 0/7] introduce per-queue fill threshold " Spike Du
  2022-06-03 12:48           ` [PATCH v4 1/7] net/mlx5: add LWM support for Rxq Spike Du
  2022-06-03 12:48           ` [PATCH v4 2/7] common/mlx5: share interrupt management Spike Du
@ 2022-06-03 12:48           ` Spike Du
  2022-06-03 14:30             ` Ray Kinsella
                               ` (2 more replies)
  2022-06-03 12:48           ` [PATCH v4 4/7] net/mlx5: add LWM event handling support Spike Du
                             ` (4 subsequent siblings)
  7 siblings, 3 replies; 131+ messages in thread
From: Spike Du @ 2022-06-03 12:48 UTC (permalink / raw)
  To: matan, viacheslavo, orika, thomas, Wenzhuo Lu, Beilei Xing,
	Bernard Iremonger, Ray Kinsella, Neil Horman
  Cc: andrew.rybchenko, stephen, mb, dev, rasland

Fill threshold describes the fullness of a Rx queue. If the Rx
queue fullness is above the threshold, the device will trigger the event
RTE_ETH_EVENT_RX_FILL_THRESH.
Fill threshold is defined as a percentage of Rx queue size with valid
value of [0,99].
Setting fill threshold to 0 means disable it, which is the default.
Add fill threshold configuration and query driver callbacks in eth_dev_ops.
Add command line options to support fill_thresh per-rxq configure.
- Command syntax:
  set port <port_id> rxq <rxq_id> fill_thresh <fill_thresh_num>

- Example commands:
To configure fill_thresh as 30% of rxq size on port 1 rxq 0:
testpmd> set port 1 rxq 0 fill_thresh 30

To disable fill_thresh on port 1 rxq 0:
testpmd> set port 1 rxq 0 fill_thresh 0

Signed-off-by: Spike Du <spiked@nvidia.com>
---
 app/test-pmd/cmdline.c     | 68 +++++++++++++++++++++++++++++++++++++++++++
 app/test-pmd/config.c      | 21 ++++++++++++++
 app/test-pmd/testpmd.c     | 18 ++++++++++++
 app/test-pmd/testpmd.h     |  2 ++
 lib/ethdev/ethdev_driver.h | 22 ++++++++++++++
 lib/ethdev/rte_ethdev.c    | 52 +++++++++++++++++++++++++++++++++
 lib/ethdev/rte_ethdev.h    | 72 ++++++++++++++++++++++++++++++++++++++++++++++
 lib/ethdev/version.map     |  2 ++
 8 files changed, 257 insertions(+)

diff --git a/app/test-pmd/cmdline.c b/app/test-pmd/cmdline.c
index 0410bad..918581e 100644
--- a/app/test-pmd/cmdline.c
+++ b/app/test-pmd/cmdline.c
@@ -17823,6 +17823,73 @@ struct cmd_show_port_flow_transfer_proxy_result {
 	}
 };
 
+/* *** SET FILL THRESHOLD FOR A RXQ OF A PORT *** */
+struct cmd_rxq_fill_thresh_result {
+	cmdline_fixed_string_t set;
+	cmdline_fixed_string_t port;
+	uint16_t port_num;
+	cmdline_fixed_string_t rxq;
+	uint16_t rxq_num;
+	cmdline_fixed_string_t fill_thresh;
+	uint16_t fill_thresh_num;
+};
+
+static void cmd_rxq_fill_thresh_parsed(void *parsed_result,
+		__rte_unused struct cmdline *cl,
+		__rte_unused void *data)
+{
+	struct cmd_rxq_fill_thresh_result *res = parsed_result;
+	int ret = 0;
+
+	if ((strcmp(res->set, "set") == 0) && (strcmp(res->port, "port") == 0)
+	    && (strcmp(res->rxq, "rxq") == 0)
+	    && (strcmp(res->fill_thresh, "fill_thresh") == 0))
+		ret = set_rxq_fill_thresh(res->port_num, res->rxq_num,
+				  res->fill_thresh_num);
+	if (ret < 0)
+		printf("rxq_fill_thresh_cmd error: (%s)\n", strerror(-ret));
+
+}
+
+cmdline_parse_token_string_t cmd_rxq_fill_thresh_set =
+	TOKEN_STRING_INITIALIZER(struct cmd_rxq_fill_thresh_result,
+				set, "set");
+cmdline_parse_token_string_t cmd_rxq_fill_thresh_port =
+	TOKEN_STRING_INITIALIZER(struct cmd_rxq_fill_thresh_result,
+				port, "port");
+cmdline_parse_token_num_t cmd_rxq_fill_thresh_portnum =
+	TOKEN_NUM_INITIALIZER(struct cmd_rxq_fill_thresh_result,
+				port_num, RTE_UINT16);
+cmdline_parse_token_string_t cmd_rxq_fill_thresh_rxq =
+	TOKEN_STRING_INITIALIZER(struct cmd_rxq_fill_thresh_result,
+				rxq, "rxq");
+cmdline_parse_token_num_t cmd_rxq_fill_thresh_rxqnum =
+	TOKEN_NUM_INITIALIZER(struct cmd_rxq_fill_thresh_result,
+				rxq_num, RTE_UINT8);
+cmdline_parse_token_string_t cmd_rxq_fill_thresh_fill_thresh =
+	TOKEN_STRING_INITIALIZER(struct cmd_rxq_fill_thresh_result,
+				fill_thresh, "fill_thresh");
+cmdline_parse_token_num_t cmd_rxq_fill_thresh_fill_threshnum =
+	TOKEN_NUM_INITIALIZER(struct cmd_rxq_fill_thresh_result,
+				fill_thresh_num, RTE_UINT16);
+
+cmdline_parse_inst_t cmd_rxq_fill_thresh = {
+	.f = cmd_rxq_fill_thresh_parsed,
+	.data = (void *)0,
+	.help_str = "set port <port_id> rxq <rxq_id> fill_thresh <fill_thresh_num>"
+		"Set fill_thresh for rxq on port_id",
+	.tokens = {
+		(void *)&cmd_rxq_fill_thresh_set,
+		(void *)&cmd_rxq_fill_thresh_port,
+		(void *)&cmd_rxq_fill_thresh_portnum,
+		(void *)&cmd_rxq_fill_thresh_rxq,
+		(void *)&cmd_rxq_fill_thresh_rxqnum,
+		(void *)&cmd_rxq_fill_thresh_fill_thresh,
+		(void *)&cmd_rxq_fill_thresh_fill_threshnum,
+		NULL,
+	},
+};
+
 /* ******************************************************************************** */
 
 /* list of instructions */
@@ -18110,6 +18177,7 @@ struct cmd_show_port_flow_transfer_proxy_result {
 	(cmdline_parse_inst_t *)&cmd_show_capability,
 	(cmdline_parse_inst_t *)&cmd_set_flex_is_pattern,
 	(cmdline_parse_inst_t *)&cmd_set_flex_spec_pattern,
+	(cmdline_parse_inst_t *)&cmd_rxq_fill_thresh,
 	NULL,
 };
 
diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
index 1b1e738..d0c519b 100644
--- a/app/test-pmd/config.c
+++ b/app/test-pmd/config.c
@@ -6342,3 +6342,24 @@ struct igb_ring_desc_16_bytes {
 		printf("  %s\n", buf);
 	}
 }
+
+int
+set_rxq_fill_thresh(portid_t port_id, uint16_t queue_idx, uint16_t fill_thresh)
+{
+	struct rte_eth_link link;
+	int ret;
+
+	if (port_id_is_invalid(port_id, ENABLED_WARN))
+		return -EINVAL;
+	ret = eth_link_get_nowait_print_err(port_id, &link);
+	if (ret < 0)
+		return -EINVAL;
+	if (fill_thresh > 99)
+		return -EINVAL;
+	ret = rte_eth_rx_fill_thresh_set(port_id, queue_idx, fill_thresh);
+
+	if (ret)
+		return ret;
+	return 0;
+}
+
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index 767765d..1209230 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -420,6 +420,7 @@ struct fwd_engine * fwd_engines[] = {
 	[RTE_ETH_EVENT_NEW] = "device probed",
 	[RTE_ETH_EVENT_DESTROY] = "device released",
 	[RTE_ETH_EVENT_FLOW_AGED] = "flow aged",
+	[RTE_ETH_EVENT_RX_FILL_THRESH] = "rxq fill threshold reached",
 	[RTE_ETH_EVENT_MAX] = NULL,
 };
 
@@ -3616,6 +3617,10 @@ struct pmd_test_command {
 eth_event_callback(portid_t port_id, enum rte_eth_event_type type, void *param,
 		  void *ret_param)
 {
+	struct rte_eth_dev_info dev_info;
+	uint16_t rxq_id;
+	uint8_t fill_thresh;
+	int ret;
 	RTE_SET_USED(param);
 	RTE_SET_USED(ret_param);
 
@@ -3647,6 +3652,19 @@ struct pmd_test_command {
 		ports[port_id].port_status = RTE_PORT_CLOSED;
 		printf("Port %u is closed\n", port_id);
 		break;
+	case RTE_ETH_EVENT_RX_FILL_THRESH:
+		ret = rte_eth_dev_info_get(port_id, &dev_info);
+		if (ret != 0)
+			break;
+		/* fill_thresh query API rewinds rxq_id, no need to check max rxq num. */
+		for (rxq_id = 0; ; rxq_id++) {
+			ret = rte_eth_rx_fill_thresh_query(port_id, &rxq_id, &fill_thresh);
+			if (ret <= 0)
+				break;
+			printf("Received fill_thresh event, port:%d rxq_id:%d\n",
+			       port_id, rxq_id);
+		}
+		break;
 	default:
 		break;
 	}
diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index 78a5f4e..c7a144e 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -1173,6 +1173,8 @@ uint16_t tx_pkt_set_dynf(uint16_t port_id, __rte_unused uint16_t queue,
 void flex_item_create(portid_t port_id, uint16_t flex_id, const char *filename);
 void flex_item_destroy(portid_t port_id, uint16_t flex_id);
 void port_flex_item_flush(portid_t port_id);
+int set_rxq_fill_thresh(portid_t port_id, uint16_t queue_idx,
+			uint16_t fill_thresh);
 
 extern int flow_parse(const char *src, void *result, unsigned int size,
 		      struct rte_flow_attr **attr,
diff --git a/lib/ethdev/ethdev_driver.h b/lib/ethdev/ethdev_driver.h
index 69d9dc2..7ef7dba 100644
--- a/lib/ethdev/ethdev_driver.h
+++ b/lib/ethdev/ethdev_driver.h
@@ -470,6 +470,23 @@ typedef int (*eth_rx_queue_setup_t)(struct rte_eth_dev *dev,
 				    const struct rte_eth_rxconf *rx_conf,
 				    struct rte_mempool *mb_pool);
 
+/**
+ * @internal Set Rx queue fill threshold.
+ * @see rte_eth_rx_fill_thresh_set()
+ */
+typedef int (*eth_rx_queue_fill_thresh_set_t)(struct rte_eth_dev *dev,
+				      uint16_t rx_queue_id,
+				      uint8_t fill_thresh);
+
+/**
+ * @internal Query queue fill threshold event.
+ * @see rte_eth_rx_fill_thresh_query()
+ */
+
+typedef int (*eth_rx_queue_fill_thresh_query_t)(struct rte_eth_dev *dev,
+					uint16_t *rx_queue_id,
+					uint8_t *fill_thresh);
+
 /** @internal Setup a transmit queue of an Ethernet device. */
 typedef int (*eth_tx_queue_setup_t)(struct rte_eth_dev *dev,
 				    uint16_t tx_queue_id,
@@ -1168,6 +1185,11 @@ struct eth_dev_ops {
 	/** Priority flow control queue configure */
 	priority_flow_ctrl_queue_config_t priority_flow_ctrl_queue_config;
 
+	/** Set Rx queue fill threshold. */
+	eth_rx_queue_fill_thresh_set_t rx_queue_fill_thresh_set;
+	/** Query Rx queue fill threshold event. */
+	eth_rx_queue_fill_thresh_query_t rx_queue_fill_thresh_query;
+
 	/** Set Unicast Table Array */
 	eth_uc_hash_table_set_t    uc_hash_table_set;
 	/** Set Unicast hash bitmap */
diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
index a175867..69a1f75 100644
--- a/lib/ethdev/rte_ethdev.c
+++ b/lib/ethdev/rte_ethdev.c
@@ -4424,6 +4424,58 @@ int rte_eth_set_queue_rate_limit(uint16_t port_id, uint16_t queue_idx,
 							queue_idx, tx_rate));
 }
 
+int rte_eth_rx_fill_thresh_set(uint16_t port_id, uint16_t queue_id,
+			       uint8_t fill_thresh)
+{
+	struct rte_eth_dev *dev;
+	struct rte_eth_dev_info dev_info;
+	int ret;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
+	dev = &rte_eth_devices[port_id];
+
+	ret = rte_eth_dev_info_get(port_id, &dev_info);
+	if (ret != 0)
+		return ret;
+
+	if (queue_id > dev_info.max_rx_queues) {
+		RTE_ETHDEV_LOG(ERR,
+			"Set queue fill thresh: port %u: invalid queue ID=%u.\n",
+			port_id, queue_id);
+		return -EINVAL;
+	}
+
+	if (fill_thresh > 99)
+		return -EINVAL;
+	RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->rx_queue_fill_thresh_set, -ENOTSUP);
+	return eth_err(port_id, (*dev->dev_ops->rx_queue_fill_thresh_set)(dev,
+							     queue_id, fill_thresh));
+}
+
+int rte_eth_rx_fill_thresh_query(uint16_t port_id, uint16_t *queue_id,
+				 uint8_t *fill_thresh)
+{
+	struct rte_eth_dev_info dev_info;
+	struct rte_eth_dev *dev;
+	int ret;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
+	dev = &rte_eth_devices[port_id];
+
+	ret = rte_eth_dev_info_get(port_id, &dev_info);
+	if (ret != 0)
+		return ret;
+
+	if (queue_id == NULL)
+		return -EINVAL;
+	if (*queue_id >= dev_info.max_rx_queues)
+		*queue_id = 0;
+
+	RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->rx_queue_fill_thresh_query, -ENOTSUP);
+	return eth_err(port_id, (*dev->dev_ops->rx_queue_fill_thresh_query)(dev,
+							     queue_id, fill_thresh));
+}
+
 RTE_INIT(eth_dev_init_fp_ops)
 {
 	uint32_t i;
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index 04225bb..d44e5da 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -1931,6 +1931,14 @@ struct rte_eth_rxq_info {
 	uint8_t queue_state;        /**< one of RTE_ETH_QUEUE_STATE_*. */
 	uint16_t nb_desc;           /**< configured number of RXDs. */
 	uint16_t rx_buf_size;       /**< hardware receive buffer size. */
+	/**
+	 * Per-queue Rx fill threshold defined as percentage of Rx queue
+	 * size. If Rx queue receives traffic higher than this percentage,
+	 * the event RTE_ETH_EVENT_RX_FILL_THESH is triggered.
+	 * Value 0 means threshold monitoring is disabled, no event is
+	 * triggered.
+	 */
+	uint8_t fill_thresh;
 } __rte_cache_min_aligned;
 
 /**
@@ -3672,6 +3680,65 @@ int rte_eth_dev_set_vlan_ether_type(uint16_t port_id,
  */
 int rte_eth_dev_set_vlan_pvid(uint16_t port_id, uint16_t pvid, int on);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Set Rx queue based fill threshold.
+ *
+ * @param port_id
+ *  The port identifier of the Ethernet device.
+ * @param queue_id
+ *  The index of the receive queue.
+ * @param fill_thresh
+ *  The fill threshold percentage of Rx queue size which describes
+ *  the fullness of Rx queue. If the Rx queue fullness is above it,
+ *  the device will trigger the event RTE_ETH_EVENT_RX_FILL_THRESH.
+ *  [1-99] to set a new fill thresold.
+ *  0 to disable thresold monitoring.
+ *
+ * @return
+ *   - 0 if successful.
+ *   - negative if failed.
+ */
+__rte_experimental
+int rte_eth_rx_fill_thresh_set(uint16_t port_id, uint16_t queue_id,
+			       uint8_t fill_thresh);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Query Rx queue based fill threshold event.
+ * The function queries all queues in the port circularly until one
+ * pending fill_thresh event is found or no pending fill_thresh event is found.
+ *
+ * @param port_id
+ *  The port identifier of the Ethernet device.
+ * @param queue_id
+ *  The API caller sets the starting Rx queue id in the pointer.
+ *  If the queue_id is bigger than maximum queue id of the port,
+ *  it's rewinded to 0 so that application can keep calling
+ *  this function to handle all pending fill_thresh events in the queues
+ *  with a simple increment between calls.
+ *  If a Rx queue has pending fill_thresh event, the pointer is updated
+ *  with this Rx queue id; otherwise this pointer's content is
+ *  unchanged.
+ * @param fill_thresh
+ *  The pointer to the fill threshold percentage of Rx queue.
+ *  If Rx queue with pending fill_thresh event is found, the queue's fill_thresh
+ *  percentage is stored in this pointer, otherwise the pointer's
+ *  content is unchanged.
+ *
+ * @return
+ *   - 1 if a Rx queue with pending fill_thresh event is found.
+ *   - 0 if no Rx queue with pending fill_thresh event is found.
+ *   - -EINVAL if queue_id is NULL.
+ */
+__rte_experimental
+int rte_eth_rx_fill_thresh_query(uint16_t port_id, uint16_t *queue_id,
+				 uint8_t *fill_thresh);
+
 typedef void (*buffer_tx_error_fn)(struct rte_mbuf **unsent, uint16_t count,
 		void *userdata);
 
@@ -3877,6 +3944,11 @@ enum rte_eth_event_type {
 	RTE_ETH_EVENT_DESTROY,  /**< port is released */
 	RTE_ETH_EVENT_IPSEC,    /**< IPsec offload related event */
 	RTE_ETH_EVENT_FLOW_AGED,/**< New aged-out flows is detected */
+	/**
+	 *  Fill threshold value is exceeded in a queue.
+	 *  @see rte_eth_rx_fill_thresh_set()
+	 */
+	RTE_ETH_EVENT_RX_FILL_THRESH,
 	RTE_ETH_EVENT_MAX       /**< max value of this enum */
 };
 
diff --git a/lib/ethdev/version.map b/lib/ethdev/version.map
index daca785..29b1fe8 100644
--- a/lib/ethdev/version.map
+++ b/lib/ethdev/version.map
@@ -285,6 +285,8 @@ EXPERIMENTAL {
 	rte_mtr_color_in_protocol_priority_get;
 	rte_mtr_color_in_protocol_set;
 	rte_mtr_meter_vlan_table_update;
+	rte_eth_rx_fill_thresh_set;
+	rte_eth_rx_fill_thresh_query;
 };
 
 INTERNAL {
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 131+ messages in thread

* [PATCH v4 4/7] net/mlx5: add LWM event handling support
  2022-06-03 12:48         ` [PATCH v4 0/7] introduce per-queue fill threshold " Spike Du
                             ` (2 preceding siblings ...)
  2022-06-03 12:48           ` [PATCH v4 3/7] ethdev: introduce Rx queue based fill threshold Spike Du
@ 2022-06-03 12:48           ` Spike Du
  2022-06-03 12:48           ` [PATCH v4 5/7] net/mlx5: support Rx queue based fill threshold Spike Du
                             ` (3 subsequent siblings)
  7 siblings, 0 replies; 131+ messages in thread
From: Spike Du @ 2022-06-03 12:48 UTC (permalink / raw)
  To: matan, viacheslavo, orika, thomas, Shahaf Shuler
  Cc: andrew.rybchenko, stephen, mb, dev, rasland

When LWM meets RQ WQE, the kernel driver raises an event to SW.
Use devx event_channel to catch this and to notify the user.
Allocate this channel per shared device.
The channel has a cookie that informs the specific event port and queue.

Signed-off-by: Spike Du <spiked@nvidia.com>
---
 drivers/net/mlx5/mlx5.c      | 66 ++++++++++++++++++++++++++++++++++++++++++++
 drivers/net/mlx5/mlx5.h      |  7 +++++
 drivers/net/mlx5/mlx5_devx.c | 47 +++++++++++++++++++++++++++++++
 drivers/net/mlx5/mlx5_rx.c   | 33 ++++++++++++++++++++++
 drivers/net/mlx5/mlx5_rx.h   |  7 +++++
 5 files changed, 160 insertions(+)

diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
index f098871..e04a666 100644
--- a/drivers/net/mlx5/mlx5.c
+++ b/drivers/net/mlx5/mlx5.c
@@ -9,6 +9,7 @@
 #include <stdint.h>
 #include <stdlib.h>
 #include <errno.h>
+#include <fcntl.h>
 
 #include <rte_malloc.h>
 #include <ethdev_driver.h>
@@ -22,6 +23,7 @@
 #include <rte_eal_paging.h>
 #include <rte_alarm.h>
 #include <rte_cycles.h>
+#include <rte_interrupts.h>
 
 #include <mlx5_glue.h>
 #include <mlx5_devx_cmds.h>
@@ -1525,6 +1527,69 @@ struct mlx5_dev_ctx_shared *
 }
 
 /**
+ * Create LWM event_channel and interrupt handle for shared device
+ * context. All rxqs sharing the device context share the event_channel.
+ * A callback is registered in interrupt thread to receive the LWM event.
+ *
+ * @param[in] priv
+ *   Pointer to mlx5_priv instance.
+ *
+ * @return
+ *   0 on success, negative with rte_errno set.
+ */
+int
+mlx5_lwm_setup(struct mlx5_priv *priv)
+{
+	int fd_lwm;
+
+	pthread_mutex_init(&priv->sh->lwm_config_lock, NULL);
+	priv->sh->devx_channel_lwm = mlx5_os_devx_create_event_channel
+			(priv->sh->cdev->ctx,
+			 MLX5DV_DEVX_CREATE_EVENT_CHANNEL_FLAGS_OMIT_EV_DATA);
+	if (!priv->sh->devx_channel_lwm)
+		goto err;
+	fd_lwm = mlx5_os_get_devx_channel_fd(priv->sh->devx_channel_lwm);
+	priv->sh->intr_handle_lwm = mlx5_os_interrupt_handler_create
+		(RTE_INTR_INSTANCE_F_SHARED, true,
+		 fd_lwm, mlx5_dev_interrupt_handler_lwm, priv);
+	if (!priv->sh->intr_handle_lwm)
+		goto err;
+	return 0;
+err:
+	if (priv->sh->devx_channel_lwm) {
+		mlx5_os_devx_destroy_event_channel
+			(priv->sh->devx_channel_lwm);
+		priv->sh->devx_channel_lwm = NULL;
+	}
+	pthread_mutex_destroy(&priv->sh->lwm_config_lock);
+	return -rte_errno;
+}
+
+/**
+ * Destroy LWM event_channel and interrupt handle for shared device
+ * context before free this context. The interrupt handler is also
+ * unregistered.
+ *
+ * @param[in] sh
+ *   Pointer to shared device context.
+ */
+void
+mlx5_lwm_unset(struct mlx5_dev_ctx_shared *sh)
+{
+	if (sh->intr_handle_lwm) {
+		mlx5_os_interrupt_handler_destroy(sh->intr_handle_lwm,
+			mlx5_dev_interrupt_handler_lwm, (void *)-1);
+		sh->intr_handle_lwm = NULL;
+	}
+	if (sh->devx_channel_lwm) {
+		mlx5_os_devx_destroy_event_channel
+			(sh->devx_channel_lwm);
+		sh->devx_channel_lwm = NULL;
+	}
+	pthread_mutex_destroy(&sh->lwm_config_lock);
+}
+
+/**
  * Free shared IB device context. Decrement counter and if zero free
  * all allocated resources and close handles.
  *
@@ -1601,6 +1666,7 @@ struct mlx5_dev_ctx_shared *
 		claim_zero(mlx5_devx_cmd_destroy(sh->td));
 	MLX5_ASSERT(sh->geneve_tlv_option_resource == NULL);
 	pthread_mutex_destroy(&sh->txpp.mutex);
+	mlx5_lwm_unset(sh);
 	mlx5_free(sh);
 	return;
 exit:
diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index 7ebb2cc..a76f2fe 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -1268,6 +1268,9 @@ struct mlx5_dev_ctx_shared {
 	struct mlx5_lb_ctx self_lb; /* QP to enable self loopback for Devx. */
 	unsigned int flow_max_priority;
 	enum modify_reg flow_mreg_c[MLX5_MREG_C_NUM];
+	void *devx_channel_lwm;
+	struct rte_intr_handle *intr_handle_lwm;
+	pthread_mutex_t lwm_config_lock;
 	/* Availability of mreg_c's. */
 	struct mlx5_dev_shared_port port[]; /* per device port data array. */
 };
@@ -1405,6 +1408,7 @@ enum mlx5_txq_modify_type {
 };
 
 struct mlx5_rxq_priv;
+struct mlx5_priv;
 
 /* HW objects operations structure. */
 struct mlx5_obj_ops {
@@ -1413,6 +1417,7 @@ struct mlx5_obj_ops {
 	int (*rxq_event_get)(struct mlx5_rxq_obj *rxq_obj);
 	int (*rxq_obj_modify)(struct mlx5_rxq_priv *rxq, uint8_t type);
 	void (*rxq_obj_release)(struct mlx5_rxq_priv *rxq);
+	int (*rxq_event_get_lwm)(struct mlx5_priv *priv, int *rxq_idx, int *port_id);
 	int (*ind_table_new)(struct rte_eth_dev *dev, const unsigned int log_n,
 			     struct mlx5_ind_table_obj *ind_tbl);
 	int (*ind_table_modify)(struct rte_eth_dev *dev,
@@ -1603,6 +1608,8 @@ int mlx5_udp_tunnel_port_add(struct rte_eth_dev *dev,
 bool mlx5_is_hpf(struct rte_eth_dev *dev);
 bool mlx5_is_sf_repr(struct rte_eth_dev *dev);
 void mlx5_age_event_prepare(struct mlx5_dev_ctx_shared *sh);
+int mlx5_lwm_setup(struct mlx5_priv *priv);
+void mlx5_lwm_unset(struct mlx5_dev_ctx_shared *sh);
 
 /* Macro to iterate over all valid ports for mlx5 driver. */
 #define MLX5_ETH_FOREACH_DEV(port_id, dev) \
diff --git a/drivers/net/mlx5/mlx5_devx.c b/drivers/net/mlx5/mlx5_devx.c
index c918a50..6886ae1 100644
--- a/drivers/net/mlx5/mlx5_devx.c
+++ b/drivers/net/mlx5/mlx5_devx.c
@@ -233,6 +233,52 @@
 }
 
 /**
+ * Get LWM event for shared context, return the correct port/rxq for this event.
+ *
+ * @param priv
+ *   Mlx5_priv object.
+ * @param rxq_idx [out]
+ *   Which rxq gets this event.
+ * @param port_id [out]
+ *   Which port gets this event.
+ *
+ * @return
+ *   0 on success, a negative errno value otherwise and rte_errno is set.
+ */
+static int
+mlx5_rx_devx_get_event_lwm(struct mlx5_priv *priv, int *rxq_idx, int *port_id)
+{
+#ifdef HAVE_IBV_DEVX_EVENT
+	union {
+		struct mlx5dv_devx_async_event_hdr event_resp;
+		uint8_t buf[sizeof(struct mlx5dv_devx_async_event_hdr) + 128];
+	} out;
+	int ret;
+
+	memset(&out, 0, sizeof(out));
+	ret = mlx5_glue->devx_get_event(priv->sh->devx_channel_lwm,
+					&out.event_resp,
+					sizeof(out.buf));
+	if (ret < 0) {
+		rte_errno = errno;
+		DRV_LOG(WARNING, "%s err\n", __func__);
+		return -rte_errno;
+	}
+	*port_id = (((uint32_t)out.event_resp.cookie) >>
+		    LWM_COOKIE_PORTID_OFFSET) & LWM_COOKIE_PORTID_MASK;
+	*rxq_idx = (((uint32_t)out.event_resp.cookie) >>
+		    LWM_COOKIE_RXQID_OFFSET) & LWM_COOKIE_RXQID_MASK;
+	return 0;
+#else
+	(void)priv;
+	(void)rxq_idx;
+	(void)port_id;
+	rte_errno = ENOTSUP;
+	return -rte_errno;
+#endif /* HAVE_IBV_DEVX_EVENT */
+}
+
+/**
  * Create a RQ object using DevX.
  *
  * @param rxq
@@ -1421,6 +1467,7 @@ struct mlx5_obj_ops devx_obj_ops = {
 	.rxq_event_get = mlx5_rx_devx_get_event,
 	.rxq_obj_modify = mlx5_devx_modify_rq,
 	.rxq_obj_release = mlx5_rxq_devx_obj_release,
+	.rxq_event_get_lwm = mlx5_rx_devx_get_event_lwm,
 	.ind_table_new = mlx5_devx_ind_table_new,
 	.ind_table_modify = mlx5_devx_ind_table_modify,
 	.ind_table_destroy = mlx5_devx_ind_table_destroy,
diff --git a/drivers/net/mlx5/mlx5_rx.c b/drivers/net/mlx5/mlx5_rx.c
index e5eea0a..aacb43e 100644
--- a/drivers/net/mlx5/mlx5_rx.c
+++ b/drivers/net/mlx5/mlx5_rx.c
@@ -1187,3 +1187,36 @@ int mlx5_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
 {
 	return -ENOTSUP;
 }
+
+/**
+ * Rte interrupt handler for LWM event.
+ * It first checks if the event arrives, if so process the callback for
+ * RTE_ETH_EVENT_RX_LWM.
+ *
+ * @param args
+ *   Generic pointer to mlx5_priv.
+ */
+void
+mlx5_dev_interrupt_handler_lwm(void *args)
+{
+	struct mlx5_priv *priv = args;
+	struct mlx5_rxq_priv *rxq;
+	struct rte_eth_dev *dev;
+	int ret, rxq_idx = 0, port_id = 0;
+
+	ret = priv->obj_ops.rxq_event_get_lwm(priv, &rxq_idx, &port_id);
+	if (unlikely(ret < 0)) {
+		DRV_LOG(WARNING, "Cannot get LWM event context.");
+		return;
+	}
+	DRV_LOG(INFO, "%s get LWM event, port_id:%d rxq_id:%d.", __func__,
+		port_id, rxq_idx);
+	dev = &rte_eth_devices[port_id];
+	rxq = mlx5_rxq_get(dev, rxq_idx);
+	if (rxq) {
+		pthread_mutex_lock(&priv->sh->lwm_config_lock);
+		rxq->lwm_event_pending = 1;
+		pthread_mutex_unlock(&priv->sh->lwm_config_lock);
+	}
+	rte_eth_dev_callback_process(dev, RTE_ETH_EVENT_RX_FILL_THRESH, NULL);
+}
diff --git a/drivers/net/mlx5/mlx5_rx.h b/drivers/net/mlx5/mlx5_rx.h
index 25a5f2c..068dff5 100644
--- a/drivers/net/mlx5/mlx5_rx.h
+++ b/drivers/net/mlx5/mlx5_rx.h
@@ -176,6 +176,7 @@ struct mlx5_rxq_priv {
 	struct rte_eth_hairpin_conf hairpin_conf; /* Hairpin configuration. */
 	uint32_t hairpin_status; /* Hairpin binding status. */
 	uint32_t lwm:16;
+	uint32_t lwm_event_pending:1;
 };
 
 /* External RX queue descriptor. */
@@ -295,6 +296,7 @@ void mlx5_rxq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
 int mlx5_rx_burst_mode_get(struct rte_eth_dev *dev, uint16_t rx_queue_id,
 			   struct rte_eth_burst_mode *mode);
 int mlx5_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc);
+void mlx5_dev_interrupt_handler_lwm(void *args);
 
 /* Vectorized version of mlx5_rx.c */
 int mlx5_rxq_check_vec_support(struct mlx5_rxq_data *rxq_data);
@@ -675,4 +677,9 @@ uint16_t mlx5_rx_burst_mprq_vec(void *dpdk_rxq, struct rte_mbuf **pkts,
 	return !!__atomic_load_n(&rxq->refcnt, __ATOMIC_RELAXED);
 }
 
+#define LWM_COOKIE_RXQID_OFFSET 0
+#define LWM_COOKIE_RXQID_MASK 0xffff
+#define LWM_COOKIE_PORTID_OFFSET 16
+#define LWM_COOKIE_PORTID_MASK 0xffff
+
 #endif /* RTE_PMD_MLX5_RX_H_ */
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 131+ messages in thread

* [PATCH v4 5/7] net/mlx5: support Rx queue based fill threshold
  2022-06-03 12:48         ` [PATCH v4 0/7] introduce per-queue fill threshold " Spike Du
                             ` (3 preceding siblings ...)
  2022-06-03 12:48           ` [PATCH v4 4/7] net/mlx5: add LWM event handling support Spike Du
@ 2022-06-03 12:48           ` Spike Du
  2022-06-03 12:48           ` [PATCH v4 6/7] net/mlx5: add private API to config host port shaper Spike Du
                             ` (2 subsequent siblings)
  7 siblings, 0 replies; 131+ messages in thread
From: Spike Du @ 2022-06-03 12:48 UTC (permalink / raw)
  To: matan, viacheslavo, orika, thomas, Shahaf Shuler
  Cc: andrew.rybchenko, stephen, mb, dev, rasland

Add mlx5 specific fill threshold configuration and query handler.
In mlx5 PMD, fill threshold is also called LWM(limit watermark).
While the Rx queue fullness reaches the LWM limit, the driver catches
an HW event and invokes the user callback.
The query handler finds the next RX queue with pending LWM event
if any, starting from the given RX queue index.

Signed-off-by: Spike Du <spiked@nvidia.com>
---
 doc/guides/nics/mlx5.rst               |  12 +++
 doc/guides/rel_notes/release_22_07.rst |   1 +
 drivers/common/mlx5/mlx5_prm.h         |   1 +
 drivers/net/mlx5/mlx5.c                |   2 +
 drivers/net/mlx5/mlx5_rx.c             | 156 +++++++++++++++++++++++++++++++++
 drivers/net/mlx5/mlx5_rx.h             |   5 ++
 6 files changed, 177 insertions(+)

diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index d83c56d..ea393fb 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -93,6 +93,7 @@ Features
 - Connection tracking.
 - Sub-Function representors.
 - Sub-Function.
+- Rx queue fill threshold configuration.
 
 
 Limitations
@@ -520,6 +521,9 @@ Limitations
 
 - The NIC egress flow rules on representor port are not supported.
 
+- Fill threshold:
+
+  - Doesn't support shared Rx queue and Hairpin Rx queue.
 
 Statistics
 ----------
@@ -1680,3 +1684,11 @@ The procedure below is an example of using a ConnectX-5 adapter card (pf0) with
 #. For each VF PCIe, using the following command to bind the driver::
 
    $ echo "0000:82:00.2" >> /sys/bus/pci/drivers/mlx5_core/bind
+
+Fill threshold introduction
+----------------
+
+Fill threshold is a per Rx queue attribute, it should be configured as
+a percentage of the Rx queue size.
+When Rx queue fullness is above the threshold, an event is sent to PMD.
+
diff --git a/doc/guides/rel_notes/release_22_07.rst b/doc/guides/rel_notes/release_22_07.rst
index 0ed4f92..62a8874 100644
--- a/doc/guides/rel_notes/release_22_07.rst
+++ b/doc/guides/rel_notes/release_22_07.rst
@@ -89,6 +89,7 @@ New Features
   * Added support for promiscuous mode on Windows.
   * Added support for MTU on Windows.
   * Added matching and RSS on IPsec ESP.
+  * Added Rx queue fill threshold support.
 
 * **Updated Marvell cnxk crypto driver.**
 
diff --git a/drivers/common/mlx5/mlx5_prm.h b/drivers/common/mlx5/mlx5_prm.h
index 630b2c5..3b5e605 100644
--- a/drivers/common/mlx5/mlx5_prm.h
+++ b/drivers/common/mlx5/mlx5_prm.h
@@ -3293,6 +3293,7 @@ struct mlx5_aso_wqe {
 
 enum {
 	MLX5_EVENT_TYPE_OBJECT_CHANGE = 0x27,
+	MLX5_EVENT_TYPE_SRQ_LIMIT_REACHED = 0x14,
 };
 
 enum {
diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
index e04a666..a4a39ab 100644
--- a/drivers/net/mlx5/mlx5.c
+++ b/drivers/net/mlx5/mlx5.c
@@ -2071,6 +2071,8 @@ struct mlx5_dev_ctx_shared *
 	.dev_supported_ptypes_get = mlx5_dev_supported_ptypes_get,
 	.vlan_filter_set = mlx5_vlan_filter_set,
 	.rx_queue_setup = mlx5_rx_queue_setup,
+	.rx_queue_fill_thresh_set = mlx5_rx_queue_lwm_set,
+	.rx_queue_fill_thresh_query = mlx5_rx_queue_lwm_query,
 	.rx_hairpin_queue_setup = mlx5_rx_hairpin_queue_setup,
 	.tx_queue_setup = mlx5_tx_queue_setup,
 	.tx_hairpin_queue_setup = mlx5_tx_hairpin_queue_setup,
diff --git a/drivers/net/mlx5/mlx5_rx.c b/drivers/net/mlx5/mlx5_rx.c
index aacb43e..4099496 100644
--- a/drivers/net/mlx5/mlx5_rx.c
+++ b/drivers/net/mlx5/mlx5_rx.c
@@ -19,12 +19,14 @@
 #include <mlx5_prm.h>
 #include <mlx5_common.h>
 #include <mlx5_common_mr.h>
+#include <rte_pmd_mlx5.h>
 
 #include "mlx5_autoconf.h"
 #include "mlx5_defs.h"
 #include "mlx5.h"
 #include "mlx5_utils.h"
 #include "mlx5_rxtx.h"
+#include "mlx5_devx.h"
 #include "mlx5_rx.h"
 
 
@@ -128,6 +130,17 @@
 	return RTE_ETH_RX_DESC_AVAIL;
 }
 
+/* Get rxq lwm percentage according to lwm number. */
+static uint8_t
+mlx5_rxq_lwm_to_percentage(struct mlx5_rxq_priv *rxq)
+{
+	struct mlx5_rxq_data *rxq_data = &rxq->ctrl->rxq;
+	uint32_t wqe_cnt = 1 << (rxq_data->elts_n - rxq_data->sges_n);
+
+	/* ethdev LWM describes fullness, mlx5 LWM describes emptiness. */
+	return rxq->lwm ? (100 - rxq->lwm * 100 / wqe_cnt) : 0;
+}
+
 /**
  * DPDK callback to get the RX queue information.
  *
@@ -150,6 +163,7 @@
 {
 	struct mlx5_rxq_ctrl *rxq_ctrl = mlx5_rxq_ctrl_get(dev, rx_queue_id);
 	struct mlx5_rxq_data *rxq = mlx5_rxq_data_get(dev, rx_queue_id);
+	struct mlx5_rxq_priv *rxq_priv = mlx5_rxq_get(dev, rx_queue_id);
 
 	if (!rxq)
 		return;
@@ -169,6 +183,8 @@
 	qinfo->nb_desc = mlx5_rxq_mprq_enabled(rxq) ?
 		RTE_BIT32(rxq->elts_n) * RTE_BIT32(rxq->log_strd_num) :
 		RTE_BIT32(rxq->elts_n);
+	qinfo->fill_thresh = rxq_priv ?
+		mlx5_rxq_lwm_to_percentage(rxq_priv) : 0;
 }
 
 /**
@@ -1188,6 +1204,34 @@ int mlx5_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
 	return -ENOTSUP;
 }
 
+int
+mlx5_rx_queue_lwm_query(struct rte_eth_dev *dev,
+			uint16_t *queue_id, uint8_t *lwm)
+{
+	struct mlx5_priv *priv = dev->data->dev_private;
+	unsigned int rxq_id, found = 0, n;
+	struct mlx5_rxq_priv *rxq;
+
+	if (!queue_id)
+		return -EINVAL;
+	/* Query all the Rx queues of the port in a circular way. */
+	for (rxq_id = *queue_id, n = 0; n < priv->rxqs_n; n++) {
+		rxq = mlx5_rxq_get(dev, rxq_id);
+		if (rxq && rxq->lwm_event_pending) {
+			pthread_mutex_lock(&priv->sh->lwm_config_lock);
+			rxq->lwm_event_pending = 0;
+			pthread_mutex_unlock(&priv->sh->lwm_config_lock);
+			*queue_id = rxq_id;
+			found = 1;
+			if (lwm)
+				*lwm =  mlx5_rxq_lwm_to_percentage(rxq);
+			break;
+		}
+		rxq_id = (rxq_id + 1) % priv->rxqs_n;
+	}
+	return found;
+}
+
 /**
  * Rte interrupt handler for LWM event.
  * It first checks if the event arrives, if so process the callback for
@@ -1220,3 +1264,115 @@ int mlx5_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
 	}
 	rte_eth_dev_callback_process(dev, RTE_ETH_EVENT_RX_FILL_THRESH, NULL);
 }
+
+/**
+ * DPDK callback to arm an Rx queue LWM(limit watermark) event.
+ * While the Rx queue fullness reaches the LWM limit, the driver catches
+ * an HW event and invokes the user event callback.
+ * After the last event handling, the user needs to call this API again
+ * to arm an additional event.
+ *
+ * @param dev
+ *   Pointer to the device structure.
+ * @param[in] rx_queue_id
+ *   Rx queue identificator.
+ * @param[in] lwm
+ *   The LWM value, is defined by a percentage of the Rx queue size.
+ *   [1-99] to set a new LWM (update the old value).
+ *   0 to unarm the event.
+ *
+ * @return
+ *   0 : operation success.
+ *   Otherwise:
+ *   - ENOMEM - not enough memory to create LWM event channel.
+ *   - EINVAL - the input Rxq is not created by devx.
+ *   - E2BIG  - lwm is bigger than 99.
+ */
+int
+mlx5_rx_queue_lwm_set(struct rte_eth_dev *dev, uint16_t rx_queue_id,
+		      uint8_t lwm)
+{
+	struct mlx5_priv *priv = dev->data->dev_private;
+	uint16_t port_id = PORT_ID(priv);
+	struct mlx5_rxq_priv *rxq = mlx5_rxq_get(dev, rx_queue_id);
+	uint16_t event_nums[1] = {MLX5_EVENT_TYPE_SRQ_LIMIT_REACHED};
+	struct mlx5_rxq_data *rxq_data;
+	uint32_t wqe_cnt;
+	uint64_t cookie;
+	int ret = 0;
+
+	if (!rxq) {
+		rte_errno = EINVAL;
+		return -rte_errno;
+	}
+	rxq_data = &rxq->ctrl->rxq;
+	/* Ensure the Rq is created by devx. */
+	if (priv->obj_ops.rxq_obj_new != devx_obj_ops.rxq_obj_new) {
+		rte_errno = EINVAL;
+		return -rte_errno;
+	}
+	if (lwm > 99) {
+		DRV_LOG(WARNING, "Too big LWM configuration.");
+		rte_errno = E2BIG;
+		return -rte_errno;
+	}
+	/* Start config LWM. */
+	pthread_mutex_lock(&priv->sh->lwm_config_lock);
+	if (rxq->lwm == 0 && lwm == 0) {
+		/* Both old/new values are 0, do nothing. */
+		ret = 0;
+		goto end;
+	}
+	wqe_cnt = 1 << (rxq_data->elts_n - rxq_data->sges_n);
+	if (lwm) {
+		if (!priv->sh->devx_channel_lwm) {
+			ret = mlx5_lwm_setup(priv);
+			if (ret) {
+				DRV_LOG(WARNING,
+					"Failed to create shared_lwm.");
+				rte_errno = ENOMEM;
+				ret = -rte_errno;
+				goto end;
+			}
+		}
+		if (!rxq->lwm_devx_subscribed) {
+			cookie = ((uint32_t)
+				  (port_id << LWM_COOKIE_PORTID_OFFSET)) |
+				(rx_queue_id << LWM_COOKIE_RXQID_OFFSET);
+			ret = mlx5_os_devx_subscribe_devx_event
+				(priv->sh->devx_channel_lwm,
+				 rxq->devx_rq.rq->obj,
+				 sizeof(event_nums),
+				 event_nums,
+				 cookie);
+			if (ret) {
+				rte_errno = rte_errno ? rte_errno : EINVAL;
+				ret = -rte_errno;
+				goto end;
+			}
+			rxq->lwm_devx_subscribed = 1;
+		}
+	}
+	/* The ethdev LWM describes fullness, mlx5 lwm describes emptiness. */
+	if (lwm)
+		lwm = 100 - lwm;
+	/* Save LWM to rxq and send modify_rq devx command. */
+	rxq->lwm = lwm * wqe_cnt / 100;
+	/* Prevent integer division loss when switch lwm number to percentage. */
+	if (lwm && (lwm * wqe_cnt % 100)) {
+		rxq->lwm = ((uint32_t)(rxq->lwm + 1) >= wqe_cnt) ?
+			rxq->lwm : (rxq->lwm + 1);
+	}
+	if (lwm && !rxq->lwm) {
+		/* With mprq, wqe_cnt may be < 100. */
+		DRV_LOG(WARNING, "Too small LWM configuration.");
+		rte_errno = EINVAL;
+		ret = -rte_errno;
+		goto end;
+	}
+	ret = mlx5_devx_modify_rq(rxq, MLX5_RXQ_MOD_RDY2RDY);
+end:
+	pthread_mutex_unlock(&priv->sh->lwm_config_lock);
+	return ret;
+}
+
diff --git a/drivers/net/mlx5/mlx5_rx.h b/drivers/net/mlx5/mlx5_rx.h
index 068dff5..e078aaf 100644
--- a/drivers/net/mlx5/mlx5_rx.h
+++ b/drivers/net/mlx5/mlx5_rx.h
@@ -177,6 +177,7 @@ struct mlx5_rxq_priv {
 	uint32_t hairpin_status; /* Hairpin binding status. */
 	uint32_t lwm:16;
 	uint32_t lwm_event_pending:1;
+	uint32_t lwm_devx_subscribed:1;
 };
 
 /* External RX queue descriptor. */
@@ -297,6 +298,10 @@ int mlx5_rx_burst_mode_get(struct rte_eth_dev *dev, uint16_t rx_queue_id,
 			   struct rte_eth_burst_mode *mode);
 int mlx5_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc);
 void mlx5_dev_interrupt_handler_lwm(void *args);
+int mlx5_rx_queue_lwm_set(struct rte_eth_dev *dev, uint16_t rx_queue_id,
+			  uint8_t lwm);
+int mlx5_rx_queue_lwm_query(struct rte_eth_dev *dev, uint16_t *rx_queue_id,
+			    uint8_t *lwm);
 
 /* Vectorized version of mlx5_rx.c */
 int mlx5_rxq_check_vec_support(struct mlx5_rxq_data *rxq_data);
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 131+ messages in thread

* [PATCH v4 6/7] net/mlx5: add private API to config host port shaper
  2022-06-03 12:48         ` [PATCH v4 0/7] introduce per-queue fill threshold " Spike Du
                             ` (4 preceding siblings ...)
  2022-06-03 12:48           ` [PATCH v4 5/7] net/mlx5: support Rx queue based fill threshold Spike Du
@ 2022-06-03 12:48           ` Spike Du
  2022-06-03 14:55             ` Ray Kinsella
  2022-06-03 12:48           ` [PATCH v4 7/7] app/testpmd: add Host Shaper command Spike Du
  2022-06-07 12:59           ` [PATCH v5 0/7] introduce per-queue available descriptor threshold and host shaper Spike Du
  7 siblings, 1 reply; 131+ messages in thread
From: Spike Du @ 2022-06-03 12:48 UTC (permalink / raw)
  To: matan, viacheslavo, orika, thomas, Shahaf Shuler, Ray Kinsella,
	Neil Horman
  Cc: andrew.rybchenko, stephen, mb, dev, rasland

Host port shaper can be configured with QSHR(QoS Shaper Host Register).
Add check in build files to enable this function or not.

The host shaper configuration affects all the ethdev ports belonging to the
same host port.

Host shaper can configure shaper rate and lwm-triggered for a host port.
The shaper limits the rate of traffic from host port to wire port.
If lwm-triggered is enabled, a 100Mbps shaper is enabled automatically
when one of the host port's Rx queues receives fill threshold event.

Signed-off-by: Spike Du <spiked@nvidia.com>
---
 doc/guides/nics/mlx5.rst               |  35 +++++++++++
 doc/guides/rel_notes/release_22_07.rst |   1 +
 drivers/common/mlx5/linux/meson.build  |  13 +++++
 drivers/common/mlx5/mlx5_prm.h         |  25 ++++++++
 drivers/net/mlx5/mlx5.h                |   2 +
 drivers/net/mlx5/mlx5_rx.c             | 103 +++++++++++++++++++++++++++++++++
 drivers/net/mlx5/rte_pmd_mlx5.h        |  30 ++++++++++
 drivers/net/mlx5/version.map           |   2 +
 8 files changed, 211 insertions(+)

diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index ea393fb..39bfebb 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -94,6 +94,7 @@ Features
 - Sub-Function representors.
 - Sub-Function.
 - Rx queue fill threshold configuration.
+- Host shaper support.
 
 
 Limitations
@@ -525,6 +526,12 @@ Limitations
 
   - Doesn't support shared Rx queue and Hairpin Rx queue.
 
+- Host shaper:
+
+  - Support BlueField series NIC from BlueField 2.
+  - When configure host shaper with MLX5_HOST_SHAPER_FLAG_FILL_THRESH_TRIGGERED flag set,
+    only rate 0 and 100Mbps are supported.
+
 Statistics
 ----------
 
@@ -1692,3 +1699,31 @@ Fill threshold is a per Rx queue attribute, it should be configured as
 a percentage of the Rx queue size.
 When Rx queue fullness is above the threshold, an event is sent to PMD.
 
+Host shaper introduction
+------------------------
+
+Host shaper register is per host port register which sets a shaper
+on the host port.
+All VF/hostPF representors belonging to one host port share one host shaper.
+For example, if representor 0 and representor 1 belong to same host port,
+and a host shaper rate of 1Gbps is configured, the shaper throttles both
+representors' traffic from host.
+Host shaper has two modes for setting the shaper, immediate and deferred to
+fill threshold event trigger. In immediate mode, the rate limit is configured
+immediately to host shaper. When deferring to fill threshold trigger, the shaper
+is not set until an fill threshold event is received by any Rx queue in a VF
+representor belonging to the host port. The only rate supported for deferred
+mode is 100Mbps (there is no limit on the supported rates for immediate mode).
+In deferred mode, the shaper is set on the host port by the firmware upon
+receiving the fill threshold event, which allows throttling host traffic on
+fill threshold events at minimum latency, preventing excess drops in the
+Rx queue.
+
+Host shaper dependency for mstflint package
+-------------------------------------------
+
+In order to configure host shaper register, ``librte_net_mlx5`` depends on ``libmtcr_ul``
+which can be installed from OFED mstflint package.
+Meson detects ``libmtcr_ul`` existence at configure stage.
+If the library is detected, the application must link with ``-lmtcr_ul``,
+as done by the pkg-config file libdpdk.pc.
diff --git a/doc/guides/rel_notes/release_22_07.rst b/doc/guides/rel_notes/release_22_07.rst
index 62a8874..eaf074c 100644
--- a/doc/guides/rel_notes/release_22_07.rst
+++ b/doc/guides/rel_notes/release_22_07.rst
@@ -90,6 +90,7 @@ New Features
   * Added support for MTU on Windows.
   * Added matching and RSS on IPsec ESP.
   * Added Rx queue fill threshold support.
+  * Added host shaper support.
 
 * **Updated Marvell cnxk crypto driver.**
 
diff --git a/drivers/common/mlx5/linux/meson.build b/drivers/common/mlx5/linux/meson.build
index 5335f5b..51c6e5d 100644
--- a/drivers/common/mlx5/linux/meson.build
+++ b/drivers/common/mlx5/linux/meson.build
@@ -45,6 +45,13 @@ if static_ibverbs
     ext_deps += declare_dependency(link_args:ibv_ldflags.split())
 endif
 
+libmtcr_ul_found = false
+lib = cc.find_library('mtcr_ul', required:false)
+if lib.found() and run_command('meson', '--version').stdout().version_compare('>= 0.49.2')
+    libmtcr_ul_found = true
+    ext_deps += lib
+endif
+
 sources += files('mlx5_nl.c')
 sources += files('mlx5_common_auxiliary.c')
 sources += files('mlx5_common_os.c')
@@ -207,6 +214,12 @@ has_sym_args = [
         [ 'HAVE_MLX5_IBV_IMPORT_CTX_PD_AND_MR', 'infiniband/verbs.h',
             'ibv_import_device' ],
 ]
+if  libmtcr_ul_found
+    has_sym_args += [
+        [  'HAVE_MLX5_MSTFLINT', 'mstflint/mtcr.h',
+            'mopen'],
+    ]
+endif
 config = configuration_data()
 foreach arg:has_sym_args
     config.set(arg[0], cc.has_header_symbol(arg[1], arg[2], dependencies: libs))
diff --git a/drivers/common/mlx5/mlx5_prm.h b/drivers/common/mlx5/mlx5_prm.h
index 3b5e605..92d05a7 100644
--- a/drivers/common/mlx5/mlx5_prm.h
+++ b/drivers/common/mlx5/mlx5_prm.h
@@ -3771,6 +3771,7 @@ enum {
 	MLX5_CRYPTO_COMMISSIONING_REGISTER_ID = 0xC003,
 	MLX5_IMPORT_KEK_HANDLE_REGISTER_ID = 0xC004,
 	MLX5_CREDENTIAL_HANDLE_REGISTER_ID = 0xC005,
+	MLX5_QSHR_REGISTER_ID = 0x4030,
 };
 
 struct mlx5_ifc_register_mtutc_bits {
@@ -3785,6 +3786,30 @@ struct mlx5_ifc_register_mtutc_bits {
 	u8 time_adjustment[0x20];
 };
 
+struct mlx5_ifc_ets_global_config_register_bits {
+	u8 reserved_at_0[0x2];
+	u8 rate_limit_update[0x1];
+	u8 reserved_at_3[0x29];
+	u8 max_bw_units[0x4];
+	u8 reserved_at_48[0x8];
+	u8 max_bw_value[0x8];
+};
+
+#define ETS_GLOBAL_CONFIG_BW_UNIT_DISABLED      0x0
+#define ETS_GLOBAL_CONFIG_BW_UNIT_HUNDREDS_MBPS 0x3
+#define ETS_GLOBAL_CONFIG_BW_UNIT_GBPS          0x4
+
+struct mlx5_ifc_register_qshr_bits {
+	u8 reserved_at_0[0x4];
+	u8 connected_host[0x1];
+	u8 vqos[0x1];
+	u8 fast_response[0x1];
+	u8 reserved_at_7[0x1];
+	u8 local_port[0x8];
+	u8 reserved_at_16[0x230];
+	struct mlx5_ifc_ets_global_config_register_bits global_config;
+};
+
 #define MLX5_MTUTC_TIMESTAMP_MODE_INTERNAL_TIMER 0
 #define MLX5_MTUTC_TIMESTAMP_MODE_REAL_TIME 1
 
diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index a76f2fe..8af84ae 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -1271,6 +1271,8 @@ struct mlx5_dev_ctx_shared {
 	void *devx_channel_lwm;
 	struct rte_intr_handle *intr_handle_lwm;
 	pthread_mutex_t lwm_config_lock;
+	uint32_t host_shaper_rate:8;
+	uint32_t lwm_triggered:1;
 	/* Availability of mreg_c's. */
 	struct mlx5_dev_shared_port port[]; /* per device port data array. */
 };
diff --git a/drivers/net/mlx5/mlx5_rx.c b/drivers/net/mlx5/mlx5_rx.c
index 4099496..b908bf7 100644
--- a/drivers/net/mlx5/mlx5_rx.c
+++ b/drivers/net/mlx5/mlx5_rx.c
@@ -28,6 +28,9 @@
 #include "mlx5_rxtx.h"
 #include "mlx5_devx.h"
 #include "mlx5_rx.h"
+#ifdef HAVE_MLX5_MSTFLINT
+#include <mstflint/mtcr.h>
+#endif
 
 
 static __rte_always_inline uint32_t
@@ -1376,3 +1379,103 @@ int mlx5_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
 	return ret;
 }
 
+/**
+ * Mlx5 access register function to configure host shaper.
+ * It calls API in libmtcr_ul to access QSHR(Qos Shaper Host Register)
+ * in firmware.
+ *
+ * @param dev
+ *   Pointer to rte_eth_dev.
+ * @param lwm_triggered
+ *   Flag to enable/disable lwm_triggered bit in QSHR.
+ * @param rate
+ *   Host shaper rate, unit is 100Mbps, set to 0 means disable the shaper.
+ * @return
+ *   0 : operation success.
+ *   Otherwise:
+ *   - ENOENT - no ibdev interface.
+ *   - EBUSY  - the register access unit is busy.
+ *   - EIO    - the register access command meets IO error.
+ */
+static int
+mlxreg_host_shaper_config(struct rte_eth_dev *dev,
+			  bool lwm_triggered, uint8_t rate)
+{
+#ifdef HAVE_MLX5_MSTFLINT
+	struct mlx5_priv *priv = dev->data->dev_private;
+	uint32_t data[MLX5_ST_SZ_DW(register_qshr)] = {0};
+	int rc, retry_count = 3;
+	mfile *mf = NULL;
+	int status;
+	void *ptr;
+
+	mf = mopen(priv->sh->ibdev_name);
+	if (!mf) {
+		DRV_LOG(WARNING, "mopen failed\n");
+		rte_errno = ENOENT;
+		return -rte_errno;
+	}
+	MLX5_SET(register_qshr, data, connected_host, 1);
+	MLX5_SET(register_qshr, data, fast_response, lwm_triggered ? 1 : 0);
+	MLX5_SET(register_qshr, data, local_port, 1);
+	ptr = MLX5_ADDR_OF(register_qshr, data, global_config);
+	MLX5_SET(ets_global_config_register, ptr, rate_limit_update, 1);
+	MLX5_SET(ets_global_config_register, ptr, max_bw_units,
+		 rate ? ETS_GLOBAL_CONFIG_BW_UNIT_HUNDREDS_MBPS :
+		 ETS_GLOBAL_CONFIG_BW_UNIT_DISABLED);
+	MLX5_SET(ets_global_config_register, ptr, max_bw_value, rate);
+	do {
+		rc = maccess_reg(mf,
+				 MLX5_QSHR_REGISTER_ID,
+				 MACCESS_REG_METHOD_SET,
+				 (u_int32_t *)&data[0],
+				 sizeof(data),
+				 sizeof(data),
+				 sizeof(data),
+				 &status);
+		if ((rc != ME_ICMD_STATUS_IFC_BUSY &&
+		     status != ME_REG_ACCESS_BAD_PARAM) ||
+		    !(mf->flags & MDEVS_REM)) {
+			break;
+		}
+		DRV_LOG(WARNING, "%s retry.", __func__);
+		usleep(10000);
+	} while (retry_count-- > 0);
+	mclose(mf);
+	rte_errno = (rc == ME_REG_ACCESS_DEV_BUSY) ? EBUSY : EIO;
+	return rc ? -rte_errno : 0;
+#else
+	(void)dev;
+	(void)lwm_triggered;
+	(void)rate;
+	return -1;
+#endif
+}
+
+int rte_pmd_mlx5_host_shaper_config(int port_id, uint8_t rate,
+				    uint32_t flags)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+	struct mlx5_priv *priv = dev->data->dev_private;
+	bool lwm_triggered =
+	     !!(flags & RTE_BIT32(MLX5_HOST_SHAPER_FLAG_FILL_THRESH_TRIGGERED));
+
+	if (!lwm_triggered) {
+		priv->sh->host_shaper_rate = rate;
+	} else {
+		switch (rate) {
+		case 0:
+		/* Rate 0 means disable lwm_triggered. */
+			priv->sh->lwm_triggered = 0;
+			break;
+		case 1:
+		/* Rate 1 means enable lwm_triggered. */
+			priv->sh->lwm_triggered = 1;
+			break;
+		default:
+			return -ENOTSUP;
+		}
+	}
+	return mlxreg_host_shaper_config(dev, priv->sh->lwm_triggered,
+					 priv->sh->host_shaper_rate);
+}
diff --git a/drivers/net/mlx5/rte_pmd_mlx5.h b/drivers/net/mlx5/rte_pmd_mlx5.h
index 6e7907e..d7cd149 100644
--- a/drivers/net/mlx5/rte_pmd_mlx5.h
+++ b/drivers/net/mlx5/rte_pmd_mlx5.h
@@ -109,6 +109,36 @@ int rte_pmd_mlx5_external_rx_queue_id_map(uint16_t port_id, uint16_t dpdk_idx,
 int rte_pmd_mlx5_external_rx_queue_id_unmap(uint16_t port_id,
 					    uint16_t dpdk_idx);
 
+/**
+ * The rate of the host port shaper will be updated directly at the next
+ * fill threshold event to the rate that comes with this flag set; set rate 0
+ * to disable this rate update.
+ * Unset this flag to update the rate of the host port shaper directly in
+ * the API call; use rate 0 to disable the current shaper.
+ */
+#define MLX5_HOST_SHAPER_FLAG_FILL_THRESH_TRIGGERED 0
+
+/**
+ * Configure a HW shaper to limit Tx rate for a host port.
+ * The configuration will affect all the ethdev ports belonging to
+ * the same rte_device.
+ *
+ * @param[in] port_id
+ *   The port identifier of the Ethernet device.
+ * @param[in] rate
+ *   Unit is 100Mbps, setting the rate to 0 disables the shaper.
+ * @param[in] flags
+ *   Host shaper flags.
+ * @return
+ *   0 : operation success.
+ *   Otherwise:
+ *   - ENOENT - no ibdev interface.
+ *   - EBUSY  - the register access unit is busy.
+ *   - EIO    - the register access command meets IO error.
+ */
+__rte_experimental
+int rte_pmd_mlx5_host_shaper_config(int port_id, uint8_t rate, uint32_t flags);
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/drivers/net/mlx5/version.map b/drivers/net/mlx5/version.map
index 79cb79a..c97dfe4 100644
--- a/drivers/net/mlx5/version.map
+++ b/drivers/net/mlx5/version.map
@@ -12,4 +12,6 @@ EXPERIMENTAL {
 	# added in 22.03
 	rte_pmd_mlx5_external_rx_queue_id_map;
 	rte_pmd_mlx5_external_rx_queue_id_unmap;
+	# added in 22.07
+	rte_pmd_mlx5_host_shaper_config;
 };
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 131+ messages in thread

* [PATCH v4 7/7] app/testpmd: add Host Shaper command
  2022-06-03 12:48         ` [PATCH v4 0/7] introduce per-queue fill threshold " Spike Du
                             ` (5 preceding siblings ...)
  2022-06-03 12:48           ` [PATCH v4 6/7] net/mlx5: add private API to config host port shaper Spike Du
@ 2022-06-03 12:48           ` Spike Du
  2022-06-07 12:59           ` [PATCH v5 0/7] introduce per-queue available descriptor threshold and host shaper Spike Du
  7 siblings, 0 replies; 131+ messages in thread
From: Spike Du @ 2022-06-03 12:48 UTC (permalink / raw)
  To: matan, viacheslavo, orika, thomas, Wenzhuo Lu, Beilei Xing,
	Bernard Iremonger, Shahaf Shuler
  Cc: andrew.rybchenko, stephen, mb, dev, rasland

Add command line options to support host shaper configure.
- Command syntax:
  mlx5 set port <port_id> host_shaper fill_thresh_triggered <0|1> rate
<rate_num>

- Example commands:
To enable fill_thresh_triggered on port 1 and disable current host shaper:
testpmd> mlx5 set port 1 host_shaper fill_thresh_triggered 1 rate 0

To disable fill_thresh_triggered and current host shaper on port 1:
testpmd> mlx5 set port 1 host_shaper fill_thresh_triggered 0 rate 0

The rate unit is 100Mbps.
To disable fill_thresh_triggered and configure a shaper of 5Gbps on port 1:
testpmd> mlx5 set port 1 host_shaper fill_thresh_triggered 0 rate 50

Add sample code to handle rxq fill_thresh event, it delays a while so
that rxq empties, then disables host shaper and rearms fill_thresh event.

Signed-off-by: Spike Du <spiked@nvidia.com>
---
 app/test-pmd/testpmd.c          |   6 ++
 doc/guides/nics/mlx5.rst        |  46 +++++++++
 drivers/net/mlx5/meson.build    |   4 +
 drivers/net/mlx5/mlx5_testpmd.c | 201 ++++++++++++++++++++++++++++++++++++++++
 drivers/net/mlx5/mlx5_testpmd.h |  26 ++++++
 5 files changed, 283 insertions(+)
 create mode 100644 drivers/net/mlx5/mlx5_testpmd.c
 create mode 100644 drivers/net/mlx5/mlx5_testpmd.h

diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index 1209230..babbc94 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -69,6 +69,9 @@
 #ifdef RTE_NET_BOND
 #include <rte_eth_bond.h>
 #endif
+#ifdef RTE_NET_MLX5
+#include "mlx5_testpmd.h"
+#endif
 
 #include "testpmd.h"
 
@@ -3663,6 +3666,9 @@ struct pmd_test_command {
 				break;
 			printf("Received fill_thresh event, port:%d rxq_id:%d\n",
 			       port_id, rxq_id);
+#ifdef RTE_NET_MLX5
+			mlx5_test_fill_thresh_event_handler(port_id, rxq_id);
+#endif
 		}
 		break;
 	default:
diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index 39bfebb..cdeeef3 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -1727,3 +1727,49 @@ which can be installed from OFED mstflint package.
 Meson detects ``libmtcr_ul`` existence at configure stage.
 If the library is detected, the application must link with ``-lmtcr_ul``,
 as done by the pkg-config file libdpdk.pc.
+
+How to use fill threshold and Host Shaper
+------------------------------
+
+There are sample command lines to configure fill threshold in testpmd.
+Testpmd also contains sample logic to handle fill threshold event.
+The typical workflow is: testpmd configure fill threshold for Rx queues, enable
+fill_thresh_triggered in host shaper and register a callback, when traffic from host is
+too high and Rx queue fullness is above fill threshold, PMD receives an event and
+firmware configures a 100Mbps shaper on host port automatically, then PMD call
+the callback registered previously, which will delay a while to let Rx queue
+empty, then disable host shaper.
+
+Let's assume we have a simple Blue Field 2 setup: port 0 is uplink, port 1
+is VF representor. Each port has 2 Rx queues.
+In order to control traffic from host to ARM, we can enable fill threshold in testpmd by:
+
+.. code-block:: console
+
+   testpmd> mlx5 set port 1 host_shaper fill_thresh_triggered 1 rate 0
+   testpmd> set port 1 rxq 0 fill_thresh 70
+   testpmd> set port 1 rxq 1 fill_thresh 70
+
+The first command disables current host shaper, and enables fill threshold triggered mode.
+The left commands configure fill threshold to 70% of Rx queue size for both Rx queues,
+When traffic from host is too high, you can see testpmd console prints log
+about fill threshold event receiving, then host shaper is disabled.
+The traffic rate from host is controlled and less drop happens in Rx queues.
+
+When disable fill threshold and fill_thresh_triggered, we can invoke below commands in testpmd:
+
+.. code-block:: console
+
+   testpmd> mlx5 set port 1 host_shaper fill_thresh_triggered 0 rate 0
+   testpmd> set port 1 rxq 0 fill_thresh 0
+   testpmd> set port 1 rxq 1 fill_thresh 0
+
+It's recommended an application disables fill threshold and fill_thresh_triggered before exit,
+if it enables them before.
+
+We can also configure the shaper with a value, the rate unit is 100Mbps, below
+command sets current shaper to 5Gbps and disables fill_thresh_triggered.
+
+.. code-block:: console
+
+   testpmd> mlx5 set port 1 host_shaper fill_thresh_triggered 0 rate 50
diff --git a/drivers/net/mlx5/meson.build b/drivers/net/mlx5/meson.build
index 99210fd..941642b 100644
--- a/drivers/net/mlx5/meson.build
+++ b/drivers/net/mlx5/meson.build
@@ -68,4 +68,8 @@ if get_option('buildtype').contains('debug')
 else
     cflags += [ '-UPEDANTIC' ]
 endif
+
+testpmd_sources += files('mlx5_testpmd.c')
+testpmd_drivers_deps += 'net_mlx5'
+
 subdir(exec_env)
diff --git a/drivers/net/mlx5/mlx5_testpmd.c b/drivers/net/mlx5/mlx5_testpmd.c
new file mode 100644
index 0000000..7f7f4d8
--- /dev/null
+++ b/drivers/net/mlx5/mlx5_testpmd.c
@@ -0,0 +1,201 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright 2021 6WIND S.A.
+ * Copyright 2021 Mellanox Technologies, Ltd
+ */
+
+#include <stdint.h>
+#include <string.h>
+#include <stdlib.h>
+
+#include <rte_prefetch.h>
+#include <rte_common.h>
+#include <rte_branch_prediction.h>
+#include <rte_ether.h>
+#include <rte_alarm.h>
+#include <rte_pmd_mlx5.h>
+#include <rte_ethdev.h>
+#include "mlx5_testpmd.h"
+#include "testpmd.h"
+
+static uint8_t host_shaper_fill_thresh_triggered[RTE_MAX_ETHPORTS];
+#define SHAPER_DISABLE_DELAY_US 100000 /* 100ms */
+
+/**
+ * Disable the host shaper and re-arm fill threshold event.
+ *
+ * @param[in] args
+ *   uint32_t integer combining port_id and rxq_id.
+ */
+static void
+mlx5_test_host_shaper_disable(void *args)
+{
+	uint32_t port_rxq_id = (uint32_t)(uintptr_t)args;
+	uint16_t port_id = port_rxq_id & 0xffff;
+	uint16_t qid = (port_rxq_id >> 16) & 0xffff;
+	struct rte_eth_rxq_info qinfo;
+
+	printf("%s disable shaper\n", __func__);
+	if (rte_eth_rx_queue_info_get(port_id, qid, &qinfo)) {
+		printf("rx_queue_info_get returns error\n");
+		return;
+	}
+	/* Rearm the fill threshold event. */
+	if (rte_eth_rx_fill_thresh_set(port_id, qid, qinfo.fill_thresh)) {
+		printf("config fill_thresh returns error\n");
+		return;
+	}
+	/* Only disable the shaper when fill_thresh_triggered is set. */
+	if (host_shaper_fill_thresh_triggered[port_id] &&
+	    rte_pmd_mlx5_host_shaper_config(port_id, 0, 0))
+		printf("%s disable shaper returns error\n", __func__);
+}
+
+void
+mlx5_test_fill_thresh_event_handler(uint16_t port_id, uint16_t rxq_id)
+{
+	uint32_t port_rxq_id = port_id | (rxq_id << 16);
+
+	rte_eal_alarm_set(SHAPER_DISABLE_DELAY_US,
+			  mlx5_test_host_shaper_disable,
+			  (void *)(uintptr_t)port_rxq_id);
+	printf("%s port_id:%u rxq_id:%u\n", __func__, port_id, rxq_id);
+}
+
+/**
+ * Configure host shaper's fill_thresh_triggered and current rate.
+ *
+ * @param[in] fill_thresh_triggered
+ *   Disable/enable fill_thresh_triggered.
+ * @param[in] rate
+ *   Configure current host shaper rate.
+ * @return
+ *   On success, returns 0.
+ *   On failure, returns < 0.
+ */
+static int
+mlx5_test_set_port_host_shaper(uint16_t port_id, uint16_t fill_thresh_triggered, uint8_t rate)
+{
+	struct rte_eth_link link;
+	bool port_id_valid = false;
+	uint16_t pid;
+	int ret;
+
+	RTE_ETH_FOREACH_DEV(pid)
+		if (port_id == pid) {
+			port_id_valid = true;
+			break;
+		}
+	if (!port_id_valid)
+		return -EINVAL;
+	ret = rte_eth_link_get_nowait(port_id, &link);
+	if (ret < 0)
+		return ret;
+	host_shaper_fill_thresh_triggered[port_id] = fill_thresh_triggered ? 1 : 0;
+	if (!fill_thresh_triggered) {
+		ret = rte_pmd_mlx5_host_shaper_config(port_id, 0,
+		RTE_BIT32(MLX5_HOST_SHAPER_FLAG_FILL_THRESH_TRIGGERED));
+	} else {
+		ret = rte_pmd_mlx5_host_shaper_config(port_id, 1,
+		RTE_BIT32(MLX5_HOST_SHAPER_FLAG_FILL_THRESH_TRIGGERED));
+	}
+	if (ret)
+		return ret;
+	ret = rte_pmd_mlx5_host_shaper_config(port_id, rate, 0);
+	if (ret)
+		return ret;
+	return 0;
+}
+
+/* *** SET HOST_SHAPER FOR A PORT *** */
+struct cmd_port_host_shaper_result {
+	cmdline_fixed_string_t mlx5;
+	cmdline_fixed_string_t set;
+	cmdline_fixed_string_t port;
+	uint16_t port_num;
+	cmdline_fixed_string_t host_shaper;
+	cmdline_fixed_string_t fill_thresh_triggered;
+	uint16_t fr;
+	cmdline_fixed_string_t rate;
+	uint8_t rate_num;
+};
+
+static void cmd_port_host_shaper_parsed(void *parsed_result,
+		__rte_unused struct cmdline *cl,
+		__rte_unused void *data)
+{
+	struct cmd_port_host_shaper_result *res = parsed_result;
+	int ret = 0;
+
+	if ((strcmp(res->mlx5, "mlx5") == 0) &&
+	    (strcmp(res->set, "set") == 0) &&
+	    (strcmp(res->port, "port") == 0) &&
+	    (strcmp(res->host_shaper, "host_shaper") == 0) &&
+	    (strcmp(res->fill_thresh_triggered, "fill_thresh_triggered") == 0) &&
+	    (strcmp(res->rate, "rate") == 0))
+		ret = mlx5_test_set_port_host_shaper(res->port_num, res->fr,
+					   res->rate_num);
+	if (ret < 0)
+		printf("cmd_port_host_shaper error: (%s)\n", strerror(-ret));
+}
+
+cmdline_parse_token_string_t cmd_port_host_shaper_mlx5 =
+	TOKEN_STRING_INITIALIZER(struct cmd_port_host_shaper_result,
+				mlx5, "mlx5");
+cmdline_parse_token_string_t cmd_port_host_shaper_set =
+	TOKEN_STRING_INITIALIZER(struct cmd_port_host_shaper_result,
+				set, "set");
+cmdline_parse_token_string_t cmd_port_host_shaper_port =
+	TOKEN_STRING_INITIALIZER(struct cmd_port_host_shaper_result,
+				port, "port");
+cmdline_parse_token_num_t cmd_port_host_shaper_portnum =
+	TOKEN_NUM_INITIALIZER(struct cmd_port_host_shaper_result,
+				port_num, RTE_UINT16);
+cmdline_parse_token_string_t cmd_port_host_shaper_host_shaper =
+	TOKEN_STRING_INITIALIZER(struct cmd_port_host_shaper_result,
+				 host_shaper, "host_shaper");
+cmdline_parse_token_string_t cmd_port_host_shaper_fill_thresh_triggered =
+	TOKEN_STRING_INITIALIZER(struct cmd_port_host_shaper_result,
+				 fill_thresh_triggered, "fill_thresh_triggered");
+cmdline_parse_token_num_t cmd_port_host_shaper_fr =
+	TOKEN_NUM_INITIALIZER(struct cmd_port_host_shaper_result,
+			      fr, RTE_UINT16);
+cmdline_parse_token_string_t cmd_port_host_shaper_rate =
+	TOKEN_STRING_INITIALIZER(struct cmd_port_host_shaper_result,
+				 rate, "rate");
+cmdline_parse_token_num_t cmd_port_host_shaper_rate_num =
+	TOKEN_NUM_INITIALIZER(struct cmd_port_host_shaper_result,
+			      rate_num, RTE_UINT8);
+cmdline_parse_inst_t mlx5_test_cmd_port_host_shaper = {
+	.f = cmd_port_host_shaper_parsed,
+	.data = (void *)0,
+	.help_str = "mlx5 set port <port_id> host_shaper fill_thresh_triggered <0|1> "
+	"rate <rate_num>: Set HOST_SHAPER fill_thresh_triggered and rate with port_id",
+	.tokens = {
+		(void *)&cmd_port_host_shaper_mlx5,
+		(void *)&cmd_port_host_shaper_set,
+		(void *)&cmd_port_host_shaper_port,
+		(void *)&cmd_port_host_shaper_portnum,
+		(void *)&cmd_port_host_shaper_host_shaper,
+		(void *)&cmd_port_host_shaper_fill_thresh_triggered,
+		(void *)&cmd_port_host_shaper_fr,
+		(void *)&cmd_port_host_shaper_rate,
+		(void *)&cmd_port_host_shaper_rate_num,
+		NULL,
+	}
+};
+
+struct testpmd_driver_commands mlx5_driver_cmds = {
+	.commands = {
+		{
+			.ctx = &mlx5_test_cmd_port_host_shaper,
+			.help = "mlx5 set port (port_id) host_shaper fill_thresh_triggered (on|off)"
+			"rate (rate_num):\n"
+			"    Set HOST_SHAPER fill_thresh_triggered and rate with port_id\n\n",
+		},
+		{
+			.ctx = NULL,
+		},
+	}
+};
+TESTPMD_ADD_DRIVER_COMMANDS(mlx5_driver_cmds);
+
diff --git a/drivers/net/mlx5/mlx5_testpmd.h b/drivers/net/mlx5/mlx5_testpmd.h
new file mode 100644
index 0000000..e89b4c8
--- /dev/null
+++ b/drivers/net/mlx5/mlx5_testpmd.h
@@ -0,0 +1,26 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright 2021 6WIND S.A.
+ * Copyright 2021 Mellanox Technologies, Ltd
+ */
+
+#ifndef RTE_PMD_MLX5_TEST_H_
+#define RTE_PMD_MLX5_TEST_H_
+
+#include <cmdline_parse.h>
+#include <cmdline_parse_num.h>
+#include <cmdline_parse_string.h>
+
+/**
+ * RTE_ETH_EVENT_RX_FILL_THRESH handler sample code.
+ * It's called in testpmd, the work flow here is delay a while until
+ * RX queueu is empty, then disable host shaper.
+ *
+ * @param[in] port_id
+ *   Port identifier.
+ * @param[in] rxq_id
+ *   Rx queue identifier.
+ */
+void
+mlx5_test_fill_thresh_event_handler(uint16_t port_id, uint16_t rxq_id);
+
+#endif
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v4 2/7] common/mlx5: share interrupt management
  2022-06-03 12:48           ` [PATCH v4 2/7] common/mlx5: share interrupt management Spike Du
@ 2022-06-03 14:30             ` Ray Kinsella
  0 siblings, 0 replies; 131+ messages in thread
From: Ray Kinsella @ 2022-06-03 14:30 UTC (permalink / raw)
  To: Spike Du
  Cc: matan, viacheslavo, orika, thomas, Shahaf Shuler, Neil Horman,
	andrew.rybchenko, stephen, mb, dev, rasland


Spike Du <spiked@nvidia.com> writes:

> There are many duplicate code of creating and initializing rte_intr_handle.
> Add a new mlx5_os API to do this, replace all PMD related code with this
> API.
>
> Signed-off-by: Spike Du <spiked@nvidia.com>
> ---
>  drivers/common/mlx5/linux/mlx5_common_os.c   | 131 ++++++++++++++++++++++++++
>  drivers/common/mlx5/linux/mlx5_common_os.h   |  11 +++
>  drivers/common/mlx5/version.map              |   2 +
>  drivers/common/mlx5/windows/mlx5_common_os.h |  24 +++++
>  drivers/net/mlx5/linux/mlx5_ethdev_os.c      |  71 --------------
>  drivers/net/mlx5/linux/mlx5_os.c             | 132 ++++++---------------------
>  drivers/net/mlx5/linux/mlx5_socket.c         |  53 ++---------
>  drivers/net/mlx5/mlx5.h                      |   2 -
>  drivers/net/mlx5/mlx5_txpp.c                 |  28 ++----
>  drivers/net/mlx5/windows/mlx5_ethdev_os.c    |  22 -----
>  drivers/vdpa/mlx5/mlx5_vdpa_virtq.c          |  48 ++--------
>  11 files changed, 217 insertions(+), 307 deletions(-)
>

Acked-by: Ray Kinsella <mdr@ashroe.eu>

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v4 3/7] ethdev: introduce Rx queue based fill threshold
  2022-06-03 12:48           ` [PATCH v4 3/7] ethdev: introduce Rx queue based fill threshold Spike Du
@ 2022-06-03 14:30             ` Ray Kinsella
  2022-06-04 12:46             ` Andrew Rybchenko
  2022-06-06 15:49             ` Stephen Hemminger
  2 siblings, 0 replies; 131+ messages in thread
From: Ray Kinsella @ 2022-06-03 14:30 UTC (permalink / raw)
  To: Spike Du
  Cc: matan, viacheslavo, orika, thomas, Wenzhuo Lu, Beilei Xing,
	Bernard Iremonger, Neil Horman, andrew.rybchenko, stephen, mb,
	dev, rasland


Spike Du <spiked@nvidia.com> writes:

> Fill threshold describes the fullness of a Rx queue. If the Rx
> queue fullness is above the threshold, the device will trigger the event
> RTE_ETH_EVENT_RX_FILL_THRESH.
> Fill threshold is defined as a percentage of Rx queue size with valid
> value of [0,99].
> Setting fill threshold to 0 means disable it, which is the default.
> Add fill threshold configuration and query driver callbacks in eth_dev_ops.
> Add command line options to support fill_thresh per-rxq configure.
> - Command syntax:
>   set port <port_id> rxq <rxq_id> fill_thresh <fill_thresh_num>
>
> - Example commands:
> To configure fill_thresh as 30% of rxq size on port 1 rxq 0:
> testpmd> set port 1 rxq 0 fill_thresh 30
>
> To disable fill_thresh on port 1 rxq 0:
> testpmd> set port 1 rxq 0 fill_thresh 0
>
> Signed-off-by: Spike Du <spiked@nvidia.com>
> ---
>  app/test-pmd/cmdline.c     | 68 +++++++++++++++++++++++++++++++++++++++++++
>  app/test-pmd/config.c      | 21 ++++++++++++++
>  app/test-pmd/testpmd.c     | 18 ++++++++++++
>  app/test-pmd/testpmd.h     |  2 ++
>  lib/ethdev/ethdev_driver.h | 22 ++++++++++++++
>  lib/ethdev/rte_ethdev.c    | 52 +++++++++++++++++++++++++++++++++
>  lib/ethdev/rte_ethdev.h    | 72 ++++++++++++++++++++++++++++++++++++++++++++++
>  lib/ethdev/version.map     |  2 ++
>  8 files changed, 257 insertions(+)
>
> diff --git a/app/test-pmd/cmdline.c b/app/test-pmd/cmdline.c
> index 0410bad..918581e 100644
> --- a/app/test-pmd/cmdline.c
> +++ b/app/test-pmd/cmdline.c
> @@ -17823,6 +17823,73 @@ struct cmd_show_port_flow_transfer_proxy_result {
>  	}
>  };
>  
> +/* *** SET FILL THRESHOLD FOR A RXQ OF A PORT *** */
> +struct cmd_rxq_fill_thresh_result {
> +	cmdline_fixed_string_t set;
> +	cmdline_fixed_string_t port;
> +	uint16_t port_num;
> +	cmdline_fixed_string_t rxq;
> +	uint16_t rxq_num;
> +	cmdline_fixed_string_t fill_thresh;
> +	uint16_t fill_thresh_num;
> +};
> +
> +static void cmd_rxq_fill_thresh_parsed(void *parsed_result,
> +		__rte_unused struct cmdline *cl,
> +		__rte_unused void *data)
> +{
> +	struct cmd_rxq_fill_thresh_result *res = parsed_result;
> +	int ret = 0;
> +
> +	if ((strcmp(res->set, "set") == 0) && (strcmp(res->port, "port") == 0)
> +	    && (strcmp(res->rxq, "rxq") == 0)
> +	    && (strcmp(res->fill_thresh, "fill_thresh") == 0))
> +		ret = set_rxq_fill_thresh(res->port_num, res->rxq_num,
> +				  res->fill_thresh_num);
> +	if (ret < 0)
> +		printf("rxq_fill_thresh_cmd error: (%s)\n", strerror(-ret));
> +
> +}
> +
> +cmdline_parse_token_string_t cmd_rxq_fill_thresh_set =
> +	TOKEN_STRING_INITIALIZER(struct cmd_rxq_fill_thresh_result,
> +				set, "set");
> +cmdline_parse_token_string_t cmd_rxq_fill_thresh_port =
> +	TOKEN_STRING_INITIALIZER(struct cmd_rxq_fill_thresh_result,
> +				port, "port");
> +cmdline_parse_token_num_t cmd_rxq_fill_thresh_portnum =
> +	TOKEN_NUM_INITIALIZER(struct cmd_rxq_fill_thresh_result,
> +				port_num, RTE_UINT16);
> +cmdline_parse_token_string_t cmd_rxq_fill_thresh_rxq =
> +	TOKEN_STRING_INITIALIZER(struct cmd_rxq_fill_thresh_result,
> +				rxq, "rxq");
> +cmdline_parse_token_num_t cmd_rxq_fill_thresh_rxqnum =
> +	TOKEN_NUM_INITIALIZER(struct cmd_rxq_fill_thresh_result,
> +				rxq_num, RTE_UINT8);
> +cmdline_parse_token_string_t cmd_rxq_fill_thresh_fill_thresh =
> +	TOKEN_STRING_INITIALIZER(struct cmd_rxq_fill_thresh_result,
> +				fill_thresh, "fill_thresh");
> +cmdline_parse_token_num_t cmd_rxq_fill_thresh_fill_threshnum =
> +	TOKEN_NUM_INITIALIZER(struct cmd_rxq_fill_thresh_result,
> +				fill_thresh_num, RTE_UINT16);
> +
> +cmdline_parse_inst_t cmd_rxq_fill_thresh = {
> +	.f = cmd_rxq_fill_thresh_parsed,
> +	.data = (void *)0,
> +	.help_str = "set port <port_id> rxq <rxq_id> fill_thresh <fill_thresh_num>"
> +		"Set fill_thresh for rxq on port_id",
> +	.tokens = {
> +		(void *)&cmd_rxq_fill_thresh_set,
> +		(void *)&cmd_rxq_fill_thresh_port,
> +		(void *)&cmd_rxq_fill_thresh_portnum,
> +		(void *)&cmd_rxq_fill_thresh_rxq,
> +		(void *)&cmd_rxq_fill_thresh_rxqnum,
> +		(void *)&cmd_rxq_fill_thresh_fill_thresh,
> +		(void *)&cmd_rxq_fill_thresh_fill_threshnum,
> +		NULL,
> +	},
> +};
> +
>  /* ******************************************************************************** */
>  
>  /* list of instructions */
> @@ -18110,6 +18177,7 @@ struct cmd_show_port_flow_transfer_proxy_result {
>  	(cmdline_parse_inst_t *)&cmd_show_capability,
>  	(cmdline_parse_inst_t *)&cmd_set_flex_is_pattern,
>  	(cmdline_parse_inst_t *)&cmd_set_flex_spec_pattern,
> +	(cmdline_parse_inst_t *)&cmd_rxq_fill_thresh,
>  	NULL,
>  };
>  
> diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
> index 1b1e738..d0c519b 100644
> --- a/app/test-pmd/config.c
> +++ b/app/test-pmd/config.c
> @@ -6342,3 +6342,24 @@ struct igb_ring_desc_16_bytes {
>  		printf("  %s\n", buf);
>  	}
>  }
> +
> +int
> +set_rxq_fill_thresh(portid_t port_id, uint16_t queue_idx, uint16_t fill_thresh)
> +{
> +	struct rte_eth_link link;
> +	int ret;
> +
> +	if (port_id_is_invalid(port_id, ENABLED_WARN))
> +		return -EINVAL;
> +	ret = eth_link_get_nowait_print_err(port_id, &link);
> +	if (ret < 0)
> +		return -EINVAL;
> +	if (fill_thresh > 99)
> +		return -EINVAL;
> +	ret = rte_eth_rx_fill_thresh_set(port_id, queue_idx, fill_thresh);
> +
> +	if (ret)
> +		return ret;
> +	return 0;
> +}
> +
> diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
> index 767765d..1209230 100644
> --- a/app/test-pmd/testpmd.c
> +++ b/app/test-pmd/testpmd.c
> @@ -420,6 +420,7 @@ struct fwd_engine * fwd_engines[] = {
>  	[RTE_ETH_EVENT_NEW] = "device probed",
>  	[RTE_ETH_EVENT_DESTROY] = "device released",
>  	[RTE_ETH_EVENT_FLOW_AGED] = "flow aged",
> +	[RTE_ETH_EVENT_RX_FILL_THRESH] = "rxq fill threshold reached",
>  	[RTE_ETH_EVENT_MAX] = NULL,
>  };
>  
> @@ -3616,6 +3617,10 @@ struct pmd_test_command {
>  eth_event_callback(portid_t port_id, enum rte_eth_event_type type, void *param,
>  		  void *ret_param)
>  {
> +	struct rte_eth_dev_info dev_info;
> +	uint16_t rxq_id;
> +	uint8_t fill_thresh;
> +	int ret;
>  	RTE_SET_USED(param);
>  	RTE_SET_USED(ret_param);
>  
> @@ -3647,6 +3652,19 @@ struct pmd_test_command {
>  		ports[port_id].port_status = RTE_PORT_CLOSED;
>  		printf("Port %u is closed\n", port_id);
>  		break;
> +	case RTE_ETH_EVENT_RX_FILL_THRESH:
> +		ret = rte_eth_dev_info_get(port_id, &dev_info);
> +		if (ret != 0)
> +			break;
> +		/* fill_thresh query API rewinds rxq_id, no need to check max rxq num. */
> +		for (rxq_id = 0; ; rxq_id++) {
> +			ret = rte_eth_rx_fill_thresh_query(port_id, &rxq_id, &fill_thresh);
> +			if (ret <= 0)
> +				break;
> +			printf("Received fill_thresh event, port:%d rxq_id:%d\n",
> +			       port_id, rxq_id);
> +		}
> +		break;
>  	default:
>  		break;
>  	}
> diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
> index 78a5f4e..c7a144e 100644
> --- a/app/test-pmd/testpmd.h
> +++ b/app/test-pmd/testpmd.h
> @@ -1173,6 +1173,8 @@ uint16_t tx_pkt_set_dynf(uint16_t port_id, __rte_unused uint16_t queue,
>  void flex_item_create(portid_t port_id, uint16_t flex_id, const char *filename);
>  void flex_item_destroy(portid_t port_id, uint16_t flex_id);
>  void port_flex_item_flush(portid_t port_id);
> +int set_rxq_fill_thresh(portid_t port_id, uint16_t queue_idx,
> +			uint16_t fill_thresh);
>  
>  extern int flow_parse(const char *src, void *result, unsigned int size,
>  		      struct rte_flow_attr **attr,
> diff --git a/lib/ethdev/ethdev_driver.h b/lib/ethdev/ethdev_driver.h
> index 69d9dc2..7ef7dba 100644
> --- a/lib/ethdev/ethdev_driver.h
> +++ b/lib/ethdev/ethdev_driver.h
> @@ -470,6 +470,23 @@ typedef int (*eth_rx_queue_setup_t)(struct rte_eth_dev *dev,
>  				    const struct rte_eth_rxconf *rx_conf,
>  				    struct rte_mempool *mb_pool);
>  
> +/**
> + * @internal Set Rx queue fill threshold.
> + * @see rte_eth_rx_fill_thresh_set()
> + */
> +typedef int (*eth_rx_queue_fill_thresh_set_t)(struct rte_eth_dev *dev,
> +				      uint16_t rx_queue_id,
> +				      uint8_t fill_thresh);
> +
> +/**
> + * @internal Query queue fill threshold event.
> + * @see rte_eth_rx_fill_thresh_query()
> + */
> +
> +typedef int (*eth_rx_queue_fill_thresh_query_t)(struct rte_eth_dev *dev,
> +					uint16_t *rx_queue_id,
> +					uint8_t *fill_thresh);
> +
>  /** @internal Setup a transmit queue of an Ethernet device. */
>  typedef int (*eth_tx_queue_setup_t)(struct rte_eth_dev *dev,
>  				    uint16_t tx_queue_id,
> @@ -1168,6 +1185,11 @@ struct eth_dev_ops {
>  	/** Priority flow control queue configure */
>  	priority_flow_ctrl_queue_config_t priority_flow_ctrl_queue_config;
>  
> +	/** Set Rx queue fill threshold. */
> +	eth_rx_queue_fill_thresh_set_t rx_queue_fill_thresh_set;
> +	/** Query Rx queue fill threshold event. */
> +	eth_rx_queue_fill_thresh_query_t rx_queue_fill_thresh_query;
> +
>  	/** Set Unicast Table Array */
>  	eth_uc_hash_table_set_t    uc_hash_table_set;
>  	/** Set Unicast hash bitmap */
> diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> index a175867..69a1f75 100644
> --- a/lib/ethdev/rte_ethdev.c
> +++ b/lib/ethdev/rte_ethdev.c
> @@ -4424,6 +4424,58 @@ int rte_eth_set_queue_rate_limit(uint16_t port_id, uint16_t queue_idx,
>  							queue_idx, tx_rate));
>  }
>  
> +int rte_eth_rx_fill_thresh_set(uint16_t port_id, uint16_t queue_id,
> +			       uint8_t fill_thresh)
> +{
> +	struct rte_eth_dev *dev;
> +	struct rte_eth_dev_info dev_info;
> +	int ret;
> +
> +	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
> +	dev = &rte_eth_devices[port_id];
> +
> +	ret = rte_eth_dev_info_get(port_id, &dev_info);
> +	if (ret != 0)
> +		return ret;
> +
> +	if (queue_id > dev_info.max_rx_queues) {
> +		RTE_ETHDEV_LOG(ERR,
> +			"Set queue fill thresh: port %u: invalid queue ID=%u.\n",
> +			port_id, queue_id);
> +		return -EINVAL;
> +	}
> +
> +	if (fill_thresh > 99)
> +		return -EINVAL;
> +	RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->rx_queue_fill_thresh_set, -ENOTSUP);
> +	return eth_err(port_id, (*dev->dev_ops->rx_queue_fill_thresh_set)(dev,
> +							     queue_id, fill_thresh));
> +}
> +
> +int rte_eth_rx_fill_thresh_query(uint16_t port_id, uint16_t *queue_id,
> +				 uint8_t *fill_thresh)
> +{
> +	struct rte_eth_dev_info dev_info;
> +	struct rte_eth_dev *dev;
> +	int ret;
> +
> +	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
> +	dev = &rte_eth_devices[port_id];
> +
> +	ret = rte_eth_dev_info_get(port_id, &dev_info);
> +	if (ret != 0)
> +		return ret;
> +
> +	if (queue_id == NULL)
> +		return -EINVAL;
> +	if (*queue_id >= dev_info.max_rx_queues)
> +		*queue_id = 0;
> +
> +	RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->rx_queue_fill_thresh_query, -ENOTSUP);
> +	return eth_err(port_id, (*dev->dev_ops->rx_queue_fill_thresh_query)(dev,
> +							     queue_id, fill_thresh));
> +}
> +
>  RTE_INIT(eth_dev_init_fp_ops)
>  {
>  	uint32_t i;
> diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> index 04225bb..d44e5da 100644
> --- a/lib/ethdev/rte_ethdev.h
> +++ b/lib/ethdev/rte_ethdev.h
> @@ -1931,6 +1931,14 @@ struct rte_eth_rxq_info {
>  	uint8_t queue_state;        /**< one of RTE_ETH_QUEUE_STATE_*. */
>  	uint16_t nb_desc;           /**< configured number of RXDs. */
>  	uint16_t rx_buf_size;       /**< hardware receive buffer size. */
> +	/**
> +	 * Per-queue Rx fill threshold defined as percentage of Rx queue
> +	 * size. If Rx queue receives traffic higher than this percentage,
> +	 * the event RTE_ETH_EVENT_RX_FILL_THESH is triggered.
> +	 * Value 0 means threshold monitoring is disabled, no event is
> +	 * triggered.
> +	 */
> +	uint8_t fill_thresh;
>  } __rte_cache_min_aligned;
>  
>  /**
> @@ -3672,6 +3680,65 @@ int rte_eth_dev_set_vlan_ether_type(uint16_t port_id,
>   */
>  int rte_eth_dev_set_vlan_pvid(uint16_t port_id, uint16_t pvid, int on);
>  
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Set Rx queue based fill threshold.
> + *
> + * @param port_id
> + *  The port identifier of the Ethernet device.
> + * @param queue_id
> + *  The index of the receive queue.
> + * @param fill_thresh
> + *  The fill threshold percentage of Rx queue size which describes
> + *  the fullness of Rx queue. If the Rx queue fullness is above it,
> + *  the device will trigger the event RTE_ETH_EVENT_RX_FILL_THRESH.
> + *  [1-99] to set a new fill thresold.
> + *  0 to disable thresold monitoring.
> + *
> + * @return
> + *   - 0 if successful.
> + *   - negative if failed.
> + */
> +__rte_experimental
> +int rte_eth_rx_fill_thresh_set(uint16_t port_id, uint16_t queue_id,
> +			       uint8_t fill_thresh);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Query Rx queue based fill threshold event.
> + * The function queries all queues in the port circularly until one
> + * pending fill_thresh event is found or no pending fill_thresh event is found.
> + *
> + * @param port_id
> + *  The port identifier of the Ethernet device.
> + * @param queue_id
> + *  The API caller sets the starting Rx queue id in the pointer.
> + *  If the queue_id is bigger than maximum queue id of the port,
> + *  it's rewinded to 0 so that application can keep calling
> + *  this function to handle all pending fill_thresh events in the queues
> + *  with a simple increment between calls.
> + *  If a Rx queue has pending fill_thresh event, the pointer is updated
> + *  with this Rx queue id; otherwise this pointer's content is
> + *  unchanged.
> + * @param fill_thresh
> + *  The pointer to the fill threshold percentage of Rx queue.
> + *  If Rx queue with pending fill_thresh event is found, the queue's fill_thresh
> + *  percentage is stored in this pointer, otherwise the pointer's
> + *  content is unchanged.
> + *
> + * @return
> + *   - 1 if a Rx queue with pending fill_thresh event is found.
> + *   - 0 if no Rx queue with pending fill_thresh event is found.
> + *   - -EINVAL if queue_id is NULL.
> + */
> +__rte_experimental
> +int rte_eth_rx_fill_thresh_query(uint16_t port_id, uint16_t *queue_id,
> +				 uint8_t *fill_thresh);
> +
>  typedef void (*buffer_tx_error_fn)(struct rte_mbuf **unsent, uint16_t count,
>  		void *userdata);
>  
> @@ -3877,6 +3944,11 @@ enum rte_eth_event_type {
>  	RTE_ETH_EVENT_DESTROY,  /**< port is released */
>  	RTE_ETH_EVENT_IPSEC,    /**< IPsec offload related event */
>  	RTE_ETH_EVENT_FLOW_AGED,/**< New aged-out flows is detected */
> +	/**
> +	 *  Fill threshold value is exceeded in a queue.
> +	 *  @see rte_eth_rx_fill_thresh_set()
> +	 */
> +	RTE_ETH_EVENT_RX_FILL_THRESH,
>  	RTE_ETH_EVENT_MAX       /**< max value of this enum */
>  };
>  
> diff --git a/lib/ethdev/version.map b/lib/ethdev/version.map
> index daca785..29b1fe8 100644
> --- a/lib/ethdev/version.map
> +++ b/lib/ethdev/version.map
> @@ -285,6 +285,8 @@ EXPERIMENTAL {
>  	rte_mtr_color_in_protocol_priority_get;
>  	rte_mtr_color_in_protocol_set;
>  	rte_mtr_meter_vlan_table_update;

Comment with # added in 22.XX

> +	rte_eth_rx_fill_thresh_set;
> +	rte_eth_rx_fill_thresh_query;
>  };
>  
>  INTERNAL {


-- 
Regards, Ray K

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v4 6/7] net/mlx5: add private API to config host port shaper
  2022-06-03 12:48           ` [PATCH v4 6/7] net/mlx5: add private API to config host port shaper Spike Du
@ 2022-06-03 14:55             ` Ray Kinsella
  0 siblings, 0 replies; 131+ messages in thread
From: Ray Kinsella @ 2022-06-03 14:55 UTC (permalink / raw)
  To: Spike Du
  Cc: matan, viacheslavo, orika, thomas, Shahaf Shuler, Neil Horman,
	andrew.rybchenko, stephen, mb, dev, rasland


Spike Du <spiked@nvidia.com> writes:

> Host port shaper can be configured with QSHR(QoS Shaper Host Register).
> Add check in build files to enable this function or not.
>
> The host shaper configuration affects all the ethdev ports belonging to the
> same host port.
>
> Host shaper can configure shaper rate and lwm-triggered for a host port.
> The shaper limits the rate of traffic from host port to wire port.
> If lwm-triggered is enabled, a 100Mbps shaper is enabled automatically
> when one of the host port's Rx queues receives fill threshold event.
>
> Signed-off-by: Spike Du <spiked@nvidia.com>
> ---
>  doc/guides/nics/mlx5.rst               |  35 +++++++++++
>  doc/guides/rel_notes/release_22_07.rst |   1 +
>  drivers/common/mlx5/linux/meson.build  |  13 +++++
>  drivers/common/mlx5/mlx5_prm.h         |  25 ++++++++
>  drivers/net/mlx5/mlx5.h                |   2 +
>  drivers/net/mlx5/mlx5_rx.c             | 103 +++++++++++++++++++++++++++++++++
>  drivers/net/mlx5/rte_pmd_mlx5.h        |  30 ++++++++++
>  drivers/net/mlx5/version.map           |   2 +
>  8 files changed, 211 insertions(+)
>

Acked-by: Ray Kinsella <mdr@ashroe.eu>

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v4 3/7] ethdev: introduce Rx queue based fill threshold
  2022-06-03 12:48           ` [PATCH v4 3/7] ethdev: introduce Rx queue based fill threshold Spike Du
  2022-06-03 14:30             ` Ray Kinsella
@ 2022-06-04 12:46             ` Andrew Rybchenko
  2022-06-06 13:16               ` Spike Du
  2022-06-06 15:49             ` Stephen Hemminger
  2 siblings, 1 reply; 131+ messages in thread
From: Andrew Rybchenko @ 2022-06-04 12:46 UTC (permalink / raw)
  To: Spike Du, matan, viacheslavo, orika, thomas, Wenzhuo Lu,
	Beilei Xing, Bernard Iremonger, Ray Kinsella, Neil Horman
  Cc: stephen, mb, dev, rasland

On 6/3/22 15:48, Spike Du wrote:
> Fill threshold describes the fullness of a Rx queue. If the Rx
> queue fullness is above the threshold, the device will trigger the event
> RTE_ETH_EVENT_RX_FILL_THRESH.

Sorry, I'm not sure that I understand. As far as I know the process to
add more Rx buffers to Rx queue is called 'refill' in many drivers. So
fill level is a number (or percentage) of free buffers in an Rx queue.
If so, fill threashold should be a minimum fill level and below the
level we should generate an event.

However reading the first paragraph of the descrition it looks like you
mean oposite thing - a number (or percentage) of ready Rx buffers with
received packets.

I think that the term "fill threshold" is suggested by me, but I did it
with mine understanding of the added feature. Now I'm confused.

Moreover, I don't understand how "fill threshold" could be in terms of
ready Rx buffers. HW simply don't really know when ready Rx buffers are
processed by SW. So, HW can't say for sure how many ready Rx buffers are
pending. It could be calculated as Rx queue size minus number of free Rx
buffers, but it is imprecise. First of all not all Rx descriptors could
be used. Second, HW ring size could differ queue size specified in SW.
Queue size specified in SW could just limit maximum nubmer of free Rx
buffers provided by the driver.

> Fill threshold is defined as a percentage of Rx queue size with valid
> value of [0,99].
> Setting fill threshold to 0 means disable it, which is the default.
> Add fill threshold configuration and query driver callbacks in eth_dev_ops.
> Add command line options to support fill_thresh per-rxq configure.
> - Command syntax:
>    set port <port_id> rxq <rxq_id> fill_thresh <fill_thresh_num>
> 
> - Example commands:
> To configure fill_thresh as 30% of rxq size on port 1 rxq 0:
> testpmd> set port 1 rxq 0 fill_thresh 30
> 
> To disable fill_thresh on port 1 rxq 0:
> testpmd> set port 1 rxq 0 fill_thresh 0
> 
> Signed-off-by: Spike Du <spiked@nvidia.com>
> ---
>   app/test-pmd/cmdline.c     | 68 +++++++++++++++++++++++++++++++++++++++++++
>   app/test-pmd/config.c      | 21 ++++++++++++++
>   app/test-pmd/testpmd.c     | 18 ++++++++++++
>   app/test-pmd/testpmd.h     |  2 ++
>   lib/ethdev/ethdev_driver.h | 22 ++++++++++++++
>   lib/ethdev/rte_ethdev.c    | 52 +++++++++++++++++++++++++++++++++
>   lib/ethdev/rte_ethdev.h    | 72 ++++++++++++++++++++++++++++++++++++++++++++++
>   lib/ethdev/version.map     |  2 ++
>   8 files changed, 257 insertions(+)
> 
> diff --git a/app/test-pmd/cmdline.c b/app/test-pmd/cmdline.c
> index 0410bad..918581e 100644
> --- a/app/test-pmd/cmdline.c
> +++ b/app/test-pmd/cmdline.c
> @@ -17823,6 +17823,73 @@ struct cmd_show_port_flow_transfer_proxy_result {
>   	}
>   };
>   
> +/* *** SET FILL THRESHOLD FOR A RXQ OF A PORT *** */
> +struct cmd_rxq_fill_thresh_result {
> +	cmdline_fixed_string_t set;
> +	cmdline_fixed_string_t port;
> +	uint16_t port_num;
> +	cmdline_fixed_string_t rxq;
> +	uint16_t rxq_num;
> +	cmdline_fixed_string_t fill_thresh;
> +	uint16_t fill_thresh_num;

uint8_t to be consistent with ethdev

> +};
> +
> +static void cmd_rxq_fill_thresh_parsed(void *parsed_result,
> +		__rte_unused struct cmdline *cl,
> +		__rte_unused void *data)
> +{
> +	struct cmd_rxq_fill_thresh_result *res = parsed_result;
> +	int ret = 0;
> +
> +	if ((strcmp(res->set, "set") == 0) && (strcmp(res->port, "port") == 0)
> +	    && (strcmp(res->rxq, "rxq") == 0)
> +	    && (strcmp(res->fill_thresh, "fill_thresh") == 0))
> +		ret = set_rxq_fill_thresh(res->port_num, res->rxq_num,
> +				  res->fill_thresh_num);
> +	if (ret < 0)
> +		printf("rxq_fill_thresh_cmd error: (%s)\n", strerror(-ret));
> +
> +}
> +
> +cmdline_parse_token_string_t cmd_rxq_fill_thresh_set =
> +	TOKEN_STRING_INITIALIZER(struct cmd_rxq_fill_thresh_result,
> +				set, "set");
> +cmdline_parse_token_string_t cmd_rxq_fill_thresh_port =
> +	TOKEN_STRING_INITIALIZER(struct cmd_rxq_fill_thresh_result,
> +				port, "port");
> +cmdline_parse_token_num_t cmd_rxq_fill_thresh_portnum =
> +	TOKEN_NUM_INITIALIZER(struct cmd_rxq_fill_thresh_result,
> +				port_num, RTE_UINT16);
> +cmdline_parse_token_string_t cmd_rxq_fill_thresh_rxq =
> +	TOKEN_STRING_INITIALIZER(struct cmd_rxq_fill_thresh_result,
> +				rxq, "rxq");
> +cmdline_parse_token_num_t cmd_rxq_fill_thresh_rxqnum =
> +	TOKEN_NUM_INITIALIZER(struct cmd_rxq_fill_thresh_result,
> +				rxq_num, RTE_UINT8);

RTE_UINT16 since it is an Rx queue ID

> +cmdline_parse_token_string_t cmd_rxq_fill_thresh_fill_thresh =
> +	TOKEN_STRING_INITIALIZER(struct cmd_rxq_fill_thresh_result,
> +				fill_thresh, "fill_thresh");
> +cmdline_parse_token_num_t cmd_rxq_fill_thresh_fill_threshnum =
> +	TOKEN_NUM_INITIALIZER(struct cmd_rxq_fill_thresh_result,
> +				fill_thresh_num, RTE_UINT16);

RTE_UINT16 -> RTE_UINT8

> +
> +cmdline_parse_inst_t cmd_rxq_fill_thresh = {
> +	.f = cmd_rxq_fill_thresh_parsed,
> +	.data = (void *)0,
> +	.help_str = "set port <port_id> rxq <rxq_id> fill_thresh <fill_thresh_num>"
> +		"Set fill_thresh for rxq on port_id",
> +	.tokens = {
> +		(void *)&cmd_rxq_fill_thresh_set,
> +		(void *)&cmd_rxq_fill_thresh_port,
> +		(void *)&cmd_rxq_fill_thresh_portnum,
> +		(void *)&cmd_rxq_fill_thresh_rxq,
> +		(void *)&cmd_rxq_fill_thresh_rxqnum,
> +		(void *)&cmd_rxq_fill_thresh_fill_thresh,
> +		(void *)&cmd_rxq_fill_thresh_fill_threshnum,
> +		NULL,
> +	},
> +};

Please, add 'static' keyword to all above cmdline_parse_* variable.

> +
>   /* ******************************************************************************** */
>   
>   /* list of instructions */
> @@ -18110,6 +18177,7 @@ struct cmd_show_port_flow_transfer_proxy_result {
>   	(cmdline_parse_inst_t *)&cmd_show_capability,
>   	(cmdline_parse_inst_t *)&cmd_set_flex_is_pattern,
>   	(cmdline_parse_inst_t *)&cmd_set_flex_spec_pattern,
> +	(cmdline_parse_inst_t *)&cmd_rxq_fill_thresh,
>   	NULL,
>   };
>   
> diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
> index 1b1e738..d0c519b 100644
> --- a/app/test-pmd/config.c
> +++ b/app/test-pmd/config.c
> @@ -6342,3 +6342,24 @@ struct igb_ring_desc_16_bytes {
>   		printf("  %s\n", buf);
>   	}
>   }
> +
> +int
> +set_rxq_fill_thresh(portid_t port_id, uint16_t queue_idx, uint16_t fill_thresh)

uint8_t for fill_threash since the type is used in ethdev API

> +{
> +	struct rte_eth_link link;
> +	int ret;
> +
> +	if (port_id_is_invalid(port_id, ENABLED_WARN))
> +		return -EINVAL;
> +	ret = eth_link_get_nowait_print_err(port_id, &link);
> +	if (ret < 0)
> +		return -EINVAL;
> +	if (fill_thresh > 99)
> +		return -EINVAL;
> +	ret = rte_eth_rx_fill_thresh_set(port_id, queue_idx, fill_thresh);
> +
> +	if (ret)

Please remove extra empty line above and compare with 0 explicitly.

> +		return ret;
> +	return 0;
> +}
> +
> diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
> index 767765d..1209230 100644
> --- a/app/test-pmd/testpmd.c
> +++ b/app/test-pmd/testpmd.c
> @@ -420,6 +420,7 @@ struct fwd_engine * fwd_engines[] = {
>   	[RTE_ETH_EVENT_NEW] = "device probed",
>   	[RTE_ETH_EVENT_DESTROY] = "device released",
>   	[RTE_ETH_EVENT_FLOW_AGED] = "flow aged",
> +	[RTE_ETH_EVENT_RX_FILL_THRESH] = "rxq fill threshold reached",
>   	[RTE_ETH_EVENT_MAX] = NULL,
>   };
>   
> @@ -3616,6 +3617,10 @@ struct pmd_test_command {
>   eth_event_callback(portid_t port_id, enum rte_eth_event_type type, void *param,
>   		  void *ret_param)
>   {
> +	struct rte_eth_dev_info dev_info;
> +	uint16_t rxq_id;
> +	uint8_t fill_thresh;
> +	int ret;
>   	RTE_SET_USED(param);
>   	RTE_SET_USED(ret_param);
>   
> @@ -3647,6 +3652,19 @@ struct pmd_test_command {
>   		ports[port_id].port_status = RTE_PORT_CLOSED;
>   		printf("Port %u is closed\n", port_id);
>   		break;
> +	case RTE_ETH_EVENT_RX_FILL_THRESH:
> +		ret = rte_eth_dev_info_get(port_id, &dev_info);
> +		if (ret != 0)
> +			break;

What's the point to get device info above? It is unused as far as I can
see.

> +		/* fill_thresh query API rewinds rxq_id, no need to check max rxq num. */
> +		for (rxq_id = 0; ; rxq_id++) {
> +			ret = rte_eth_rx_fill_thresh_query(port_id, &rxq_id, &fill_thresh);
> +			if (ret <= 0)
> +				break; > +			printf("Received fill_thresh event, port:%d rxq_id:%d\n",
> +			       port_id, rxq_id);
> +		}
> +		break;
>   	default:
>   		break;
>   	}
> diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
> index 78a5f4e..c7a144e 100644
> --- a/app/test-pmd/testpmd.h
> +++ b/app/test-pmd/testpmd.h
> @@ -1173,6 +1173,8 @@ uint16_t tx_pkt_set_dynf(uint16_t port_id, __rte_unused uint16_t queue,
>   void flex_item_create(portid_t port_id, uint16_t flex_id, const char *filename);
>   void flex_item_destroy(portid_t port_id, uint16_t flex_id);
>   void port_flex_item_flush(portid_t port_id);
> +int set_rxq_fill_thresh(portid_t port_id, uint16_t queue_idx,

it is better to be consistent with variable nameing
queue_idx -> queue_id

> +			uint16_t fill_thresh);

uint16_t -> uint8_t since ethdev API uses uint8_t. It must be
consistent with ethdev.

>   
>   extern int flow_parse(const char *src, void *result, unsigned int size,
>   		      struct rte_flow_attr **attr,
> diff --git a/lib/ethdev/ethdev_driver.h b/lib/ethdev/ethdev_driver.h
> index 69d9dc2..7ef7dba 100644
> --- a/lib/ethdev/ethdev_driver.h
> +++ b/lib/ethdev/ethdev_driver.h
> @@ -470,6 +470,23 @@ typedef int (*eth_rx_queue_setup_t)(struct rte_eth_dev *dev,
>   				    const struct rte_eth_rxconf *rx_conf,
>   				    struct rte_mempool *mb_pool);
>   
> +/**
> + * @internal Set Rx queue fill threshold.
> + * @see rte_eth_rx_fill_thresh_set()
> + */
> +typedef int (*eth_rx_queue_fill_thresh_set_t)(struct rte_eth_dev *dev,
> +				      uint16_t rx_queue_id,
> +				      uint8_t fill_thresh);
> +
> +/**
> + * @internal Query queue fill threshold event.
> + * @see rte_eth_rx_fill_thresh_query()
> + */
> +
> +typedef int (*eth_rx_queue_fill_thresh_query_t)(struct rte_eth_dev *dev,
> +					uint16_t *rx_queue_id,
> +					uint8_t *fill_thresh);

Should fill_thresh be round UP or DOWN by the driver?
Order of prototypes definition must follow order of ops in the
structure.

> +
>   /** @internal Setup a transmit queue of an Ethernet device. */
>   typedef int (*eth_tx_queue_setup_t)(struct rte_eth_dev *dev,
>   				    uint16_t tx_queue_id,
> @@ -1168,6 +1185,11 @@ struct eth_dev_ops {
>   	/** Priority flow control queue configure */
>   	priority_flow_ctrl_queue_config_t priority_flow_ctrl_queue_config;
>   
> +	/** Set Rx queue fill threshold. */
> +	eth_rx_queue_fill_thresh_set_t rx_queue_fill_thresh_set;
> +	/** Query Rx queue fill threshold event. */
> +	eth_rx_queue_fill_thresh_query_t rx_queue_fill_thresh_query;
> +

Why is it added in the middle fo the structure?
Isn't it better to add at the end?

>   	/** Set Unicast Table Array */
>   	eth_uc_hash_table_set_t    uc_hash_table_set;
>   	/** Set Unicast hash bitmap */
> diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
> index a175867..69a1f75 100644
> --- a/lib/ethdev/rte_ethdev.c
> +++ b/lib/ethdev/rte_ethdev.c
> @@ -4424,6 +4424,58 @@ int rte_eth_set_queue_rate_limit(uint16_t port_id, uint16_t queue_idx,
>   							queue_idx, tx_rate));
>   }
>   
> +int rte_eth_rx_fill_thresh_set(uint16_t port_id, uint16_t queue_id,
> +			       uint8_t fill_thresh)
> +{
> +	struct rte_eth_dev *dev;
> +	struct rte_eth_dev_info dev_info;
> +	int ret;
> +
> +	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
> +	dev = &rte_eth_devices[port_id];
> +
> +	ret = rte_eth_dev_info_get(port_id, &dev_info);
> +	if (ret != 0)
> +		return ret;
> +
> +	if (queue_id > dev_info.max_rx_queues) {

We don't need dev_info to get number of Rx queues.
dev->data->nb_rx_queues does the job.

> +		RTE_ETHDEV_LOG(ERR,
> +			"Set queue fill thresh: port %u: invalid queue ID=%u.\n",
> +			port_id, queue_id);
> +		return -EINVAL;
> +	}
> +
> +	if (fill_thresh > 99)
> +		return -EINVAL;

Why do you have error log above, but don't have it here?

> +	RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->rx_queue_fill_thresh_set, -ENOTSUP);
> +	return eth_err(port_id, (*dev->dev_ops->rx_queue_fill_thresh_set)(dev,
> +							     queue_id, fill_thresh));
> +}
> +
> +int rte_eth_rx_fill_thresh_query(uint16_t port_id, uint16_t *queue_id,
> +				 uint8_t *fill_thresh)
> +{
> +	struct rte_eth_dev_info dev_info;
> +	struct rte_eth_dev *dev;
> +	int ret;
> +
> +	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
> +	dev = &rte_eth_devices[port_id];
> +
> +	ret = rte_eth_dev_info_get(port_id, &dev_info);
> +	if (ret != 0)
> +		return ret;
> +
> +	if (queue_id == NULL)
> +		return -EINVAL;
> +	if (*queue_id >= dev_info.max_rx_queues)

Same here. you don't need dev_info

> +		*queue_id = 0;
> +
> +	RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->rx_queue_fill_thresh_query, -ENOTSUP);
> +	return eth_err(port_id, (*dev->dev_ops->rx_queue_fill_thresh_query)(dev,
> +							     queue_id, fill_thresh));
> +}
> +
>   RTE_INIT(eth_dev_init_fp_ops)
>   {
>   	uint32_t i;
> diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> index 04225bb..d44e5da 100644
> --- a/lib/ethdev/rte_ethdev.h
> +++ b/lib/ethdev/rte_ethdev.h
> @@ -1931,6 +1931,14 @@ struct rte_eth_rxq_info {
>   	uint8_t queue_state;        /**< one of RTE_ETH_QUEUE_STATE_*. */
>   	uint16_t nb_desc;           /**< configured number of RXDs. */
>   	uint16_t rx_buf_size;       /**< hardware receive buffer size. */
> +	/**
> +	 * Per-queue Rx fill threshold defined as percentage of Rx queue
> +	 * size. If Rx queue receives traffic higher than this percentage,
> +	 * the event RTE_ETH_EVENT_RX_FILL_THESH is triggered.
> +	 * Value 0 means threshold monitoring is disabled, no event is
> +	 * triggered.
> +	 */
> +	uint8_t fill_thresh;
>   } __rte_cache_min_aligned;
>   
>   /**
> @@ -3672,6 +3680,65 @@ int rte_eth_dev_set_vlan_ether_type(uint16_t port_id,
>    */
>   int rte_eth_dev_set_vlan_pvid(uint16_t port_id, uint16_t pvid, int on);
>   
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Set Rx queue based fill threshold.
> + *
> + * @param port_id
> + *  The port identifier of the Ethernet device.
> + * @param queue_id
> + *  The index of the receive queue.
> + * @param fill_thresh
> + *  The fill threshold percentage of Rx queue size which describes
> + *  the fullness of Rx queue. If the Rx queue fullness is above it,
> + *  the device will trigger the event RTE_ETH_EVENT_RX_FILL_THRESH.
> + *  [1-99] to set a new fill thresold.
> + *  0 to disable thresold monitoring.
> + *
> + * @return
> + *   - 0 if successful.
> + *   - negative if failed.
> + */
> +__rte_experimental
> +int rte_eth_rx_fill_thresh_set(uint16_t port_id, uint16_t queue_id,
> +			       uint8_t fill_thresh);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Query Rx queue based fill threshold event.
> + * The function queries all queues in the port circularly until one
> + * pending fill_thresh event is found or no pending fill_thresh event is found.
> + *
> + * @param port_id
> + *  The port identifier of the Ethernet device.
> + * @param queue_id
> + *  The API caller sets the starting Rx queue id in the pointer.
> + *  If the queue_id is bigger than maximum queue id of the port,
> + *  it's rewinded to 0 so that application can keep calling
> + *  this function to handle all pending fill_thresh events in the queues
> + *  with a simple increment between calls.
> + *  If a Rx queue has pending fill_thresh event, the pointer is updated
> + *  with this Rx queue id; otherwise this pointer's content is
> + *  unchanged.
> + * @param fill_thresh
> + *  The pointer to the fill threshold percentage of Rx queue.
> + *  If Rx queue with pending fill_thresh event is found, the queue's fill_thresh
> + *  percentage is stored in this pointer, otherwise the pointer's
> + *  content is unchanged.
> + *
> + * @return
> + *   - 1 if a Rx queue with pending fill_thresh event is found.
> + *   - 0 if no Rx queue with pending fill_thresh event is found.
> + *   - -EINVAL if queue_id is NULL.
> + */
> +__rte_experimental
> +int rte_eth_rx_fill_thresh_query(uint16_t port_id, uint16_t *queue_id,
> +				 uint8_t *fill_thresh);
> +
>   typedef void (*buffer_tx_error_fn)(struct rte_mbuf **unsent, uint16_t count,
>   		void *userdata);
>   
> @@ -3877,6 +3944,11 @@ enum rte_eth_event_type {
>   	RTE_ETH_EVENT_DESTROY,  /**< port is released */
>   	RTE_ETH_EVENT_IPSEC,    /**< IPsec offload related event */
>   	RTE_ETH_EVENT_FLOW_AGED,/**< New aged-out flows is detected */
> +	/**
> +	 *  Fill threshold value is exceeded in a queue.
> +	 *  @see rte_eth_rx_fill_thresh_set()
> +	 */
> +	RTE_ETH_EVENT_RX_FILL_THRESH,
>   	RTE_ETH_EVENT_MAX       /**< max value of this enum */
>   };
>   
> diff --git a/lib/ethdev/version.map b/lib/ethdev/version.map
> index daca785..29b1fe8 100644
> --- a/lib/ethdev/version.map
> +++ b/lib/ethdev/version.map
> @@ -285,6 +285,8 @@ EXPERIMENTAL {
>   	rte_mtr_color_in_protocol_priority_get;
>   	rte_mtr_color_in_protocol_set;
>   	rte_mtr_meter_vlan_table_update;
> +	rte_eth_rx_fill_thresh_set;
> +	rte_eth_rx_fill_thresh_query;
>   };
>   
>   INTERNAL {


^ permalink raw reply	[flat|nested] 131+ messages in thread

* RE: [PATCH v4 3/7] ethdev: introduce Rx queue based fill threshold
  2022-06-04 12:46             ` Andrew Rybchenko
@ 2022-06-06 13:16               ` Spike Du
  2022-06-06 17:15                 ` Andrew Rybchenko
  0 siblings, 1 reply; 131+ messages in thread
From: Spike Du @ 2022-06-06 13:16 UTC (permalink / raw)
  To: Andrew Rybchenko, Matan Azrad, Slava Ovsiienko, Ori Kam,
	NBU-Contact-Thomas Monjalon (EXTERNAL),
	Wenzhuo Lu, Beilei Xing, Bernard Iremonger, Ray Kinsella,
	Neil Horman
  Cc: stephen, mb, dev, Raslan Darawsheh

Hi Andrew,
	Please see below for "fill threshold" concept, I'm ok with other comments about code.

Regards,
Spike.


> -----Original Message-----
> From: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> Sent: Saturday, June 4, 2022 8:46 PM
> To: Spike Du <spiked@nvidia.com>; Matan Azrad <matan@nvidia.com>;
> Slava Ovsiienko <viacheslavo@nvidia.com>; Ori Kam <orika@nvidia.com>;
> NBU-Contact-Thomas Monjalon (EXTERNAL) <thomas@monjalon.net>;
> Wenzhuo Lu <wenzhuo.lu@intel.com>; Beilei Xing <beilei.xing@intel.com>;
> Bernard Iremonger <bernard.iremonger@intel.com>; Ray Kinsella
> <mdr@ashroe.eu>; Neil Horman <nhorman@tuxdriver.com>
> Cc: stephen@networkplumber.org; mb@smartsharesystems.com;
> dev@dpdk.org; Raslan Darawsheh <rasland@nvidia.com>
> Subject: Re: [PATCH v4 3/7] ethdev: introduce Rx queue based fill threshold
> 
> External email: Use caution opening links or attachments
> 
> 
> On 6/3/22 15:48, Spike Du wrote:
> > Fill threshold describes the fullness of a Rx queue. If the Rx queue
> > fullness is above the threshold, the device will trigger the event
> > RTE_ETH_EVENT_RX_FILL_THRESH.
> 
> Sorry, I'm not sure that I understand. As far as I know the process to add
> more Rx buffers to Rx queue is called 'refill' in many drivers. So fill level is a
> number (or percentage) of free buffers in an Rx queue.
> If so, fill threashold should be a minimum fill level and below the level we
> should generate an event.
> 
> However reading the first paragraph of the descrition it looks like you mean
> oposite thing - a number (or percentage) of ready Rx buffers with received
> packets.
> 
> I think that the term "fill threshold" is suggested by me, but I did it with mine
> understanding of the added feature. Now I'm confused.
> 
> Moreover, I don't understand how "fill threshold" could be in terms of ready
> Rx buffers. HW simply don't really know when ready Rx buffers are
> processed by SW. So, HW can't say for sure how many ready Rx buffers are
> pending. It could be calculated as Rx queue size minus number of free Rx
> buffers, but it is imprecise. First of all not all Rx descriptors could be used.
> Second, HW ring size could differ queue size specified in SW.
> Queue size specified in SW could just limit maximum nubmer of free Rx
> buffers provided by the driver.
> 

Let me use other terms because "fill"/"refill" is also ambiguous to me.
In a RX ring, there are Rx buffers with received packets, you call it "ready Rx buffers", there is a RTE api rte_eth_rx_queue_count() to get the number,
It's also called "used descriptors" in the code.
Also there are Rx buffers provided by SW to allow HW "fill in" received packets, we can call it "usable Rx buffers" (here "usable" means usable for HW).
Let's define Rx queue "fullness":
	Fullness = ready-Rx-buffers/Rxq-size
On the opposite, we have "emptiness"
	Emptiness = usable-Rx-buffers/Rxq-size
Here "fill threshold" describes "fullness", it's not "refill" described in you above words. Because in your words, "refill" is the opposite, it's filling "usable/free Rx buffers", or "emptiness".

I can only briefly explain how mlx5 works to get LWM, because I'm not a Firmware guy.
Mlx5 Rx queue is basically RDMA queue. It has two indexes: producer index which increases when HW fills in packet, consumer index which increases when SW consumes the packet.
The queue size is known when it's created. The fullness is something like (producer_index - consumer_index) (I don't consider in wrap-around here).
So mlx5 has the way to get the fullness or emptiness in HW or FW. 
Another detail is mlx5 uses the term "LWM"(limit watermark), which describes "emptiness". When usable-Rx-buffers is below LWM, we trigger an event.
But Thomas think "fullness" is easier to understand, so we use "fullness" in rte APIs and we'll translate it to LWM in mlx5 PMD.


> > Fill threshold is defined as a percentage of Rx queue size with valid
> > value of [0,99].
> > Setting fill threshold to 0 means disable it, which is the default.
> > Add fill threshold configuration and query driver callbacks in eth_dev_ops.
> > Add command line options to support fill_thresh per-rxq configure.
> > - Command syntax:
> >    set port <port_id> rxq <rxq_id> fill_thresh <fill_thresh_num>
> >
> > - Example commands:
> > To configure fill_thresh as 30% of rxq size on port 1 rxq 0:
> > testpmd> set port 1 rxq 0 fill_thresh 30
> >
> > To disable fill_thresh on port 1 rxq 0:
> > testpmd> set port 1 rxq 0 fill_thresh 0
> >
> > Signed-off-by: Spike Du <spiked@nvidia.com>
> > ---
> >   app/test-pmd/cmdline.c     | 68
> +++++++++++++++++++++++++++++++++++++++++++
> >   app/test-pmd/config.c      | 21 ++++++++++++++
> >   app/test-pmd/testpmd.c     | 18 ++++++++++++
> >   app/test-pmd/testpmd.h     |  2 ++
> >   lib/ethdev/ethdev_driver.h | 22 ++++++++++++++
> >   lib/ethdev/rte_ethdev.c    | 52 +++++++++++++++++++++++++++++++++
> >   lib/ethdev/rte_ethdev.h    | 72
> ++++++++++++++++++++++++++++++++++++++++++++++
> >   lib/ethdev/version.map     |  2 ++
> >   8 files changed, 257 insertions(+)
> >
> > diff --git a/app/test-pmd/cmdline.c b/app/test-pmd/cmdline.c index
> > 0410bad..918581e 100644
> > --- a/app/test-pmd/cmdline.c
> > +++ b/app/test-pmd/cmdline.c
> > @@ -17823,6 +17823,73 @@ struct
> cmd_show_port_flow_transfer_proxy_result {
> >       }
> >   };
> >
> > +/* *** SET FILL THRESHOLD FOR A RXQ OF A PORT *** */ struct
> > +cmd_rxq_fill_thresh_result {
> > +     cmdline_fixed_string_t set;
> > +     cmdline_fixed_string_t port;
> > +     uint16_t port_num;
> > +     cmdline_fixed_string_t rxq;
> > +     uint16_t rxq_num;
> > +     cmdline_fixed_string_t fill_thresh;
> > +     uint16_t fill_thresh_num;
> 
> uint8_t to be consistent with ethdev
> 
> > +};
> > +
> > +static void cmd_rxq_fill_thresh_parsed(void *parsed_result,
> > +             __rte_unused struct cmdline *cl,
> > +             __rte_unused void *data) {
> > +     struct cmd_rxq_fill_thresh_result *res = parsed_result;
> > +     int ret = 0;
> > +
> > +     if ((strcmp(res->set, "set") == 0) && (strcmp(res->port, "port") == 0)
> > +         && (strcmp(res->rxq, "rxq") == 0)
> > +         && (strcmp(res->fill_thresh, "fill_thresh") == 0))
> > +             ret = set_rxq_fill_thresh(res->port_num, res->rxq_num,
> > +                               res->fill_thresh_num);
> > +     if (ret < 0)
> > +             printf("rxq_fill_thresh_cmd error: (%s)\n",
> > + strerror(-ret));
> > +
> > +}
> > +
> > +cmdline_parse_token_string_t cmd_rxq_fill_thresh_set =
> > +     TOKEN_STRING_INITIALIZER(struct cmd_rxq_fill_thresh_result,
> > +                             set, "set");
> > +cmdline_parse_token_string_t cmd_rxq_fill_thresh_port =
> > +     TOKEN_STRING_INITIALIZER(struct cmd_rxq_fill_thresh_result,
> > +                             port, "port"); cmdline_parse_token_num_t
> > +cmd_rxq_fill_thresh_portnum =
> > +     TOKEN_NUM_INITIALIZER(struct cmd_rxq_fill_thresh_result,
> > +                             port_num, RTE_UINT16);
> > +cmdline_parse_token_string_t cmd_rxq_fill_thresh_rxq =
> > +     TOKEN_STRING_INITIALIZER(struct cmd_rxq_fill_thresh_result,
> > +                             rxq, "rxq"); cmdline_parse_token_num_t
> > +cmd_rxq_fill_thresh_rxqnum =
> > +     TOKEN_NUM_INITIALIZER(struct cmd_rxq_fill_thresh_result,
> > +                             rxq_num, RTE_UINT8);
> 
> RTE_UINT16 since it is an Rx queue ID
> 
> > +cmdline_parse_token_string_t cmd_rxq_fill_thresh_fill_thresh =
> > +     TOKEN_STRING_INITIALIZER(struct cmd_rxq_fill_thresh_result,
> > +                             fill_thresh, "fill_thresh");
> > +cmdline_parse_token_num_t cmd_rxq_fill_thresh_fill_threshnum =
> > +     TOKEN_NUM_INITIALIZER(struct cmd_rxq_fill_thresh_result,
> > +                             fill_thresh_num, RTE_UINT16);
> 
> RTE_UINT16 -> RTE_UINT8
> 
> > +
> > +cmdline_parse_inst_t cmd_rxq_fill_thresh = {
> > +     .f = cmd_rxq_fill_thresh_parsed,
> > +     .data = (void *)0,
> > +     .help_str = "set port <port_id> rxq <rxq_id> fill_thresh
> <fill_thresh_num>"
> > +             "Set fill_thresh for rxq on port_id",
> > +     .tokens = {
> > +             (void *)&cmd_rxq_fill_thresh_set,
> > +             (void *)&cmd_rxq_fill_thresh_port,
> > +             (void *)&cmd_rxq_fill_thresh_portnum,
> > +             (void *)&cmd_rxq_fill_thresh_rxq,
> > +             (void *)&cmd_rxq_fill_thresh_rxqnum,
> > +             (void *)&cmd_rxq_fill_thresh_fill_thresh,
> > +             (void *)&cmd_rxq_fill_thresh_fill_threshnum,
> > +             NULL,
> > +     },
> > +};
> 
> Please, add 'static' keyword to all above cmdline_parse_* variable.
> 
> > +
> >   /*
> >
> **********************************************************
> ************
> > ********** */
> >
> >   /* list of instructions */
> > @@ -18110,6 +18177,7 @@ struct
> cmd_show_port_flow_transfer_proxy_result {
> >       (cmdline_parse_inst_t *)&cmd_show_capability,
> >       (cmdline_parse_inst_t *)&cmd_set_flex_is_pattern,
> >       (cmdline_parse_inst_t *)&cmd_set_flex_spec_pattern,
> > +     (cmdline_parse_inst_t *)&cmd_rxq_fill_thresh,
> >       NULL,
> >   };
> >
> > diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c index
> > 1b1e738..d0c519b 100644
> > --- a/app/test-pmd/config.c
> > +++ b/app/test-pmd/config.c
> > @@ -6342,3 +6342,24 @@ struct igb_ring_desc_16_bytes {
> >               printf("  %s\n", buf);
> >       }
> >   }
> > +
> > +int
> > +set_rxq_fill_thresh(portid_t port_id, uint16_t queue_idx, uint16_t
> > +fill_thresh)
> 
> uint8_t for fill_threash since the type is used in ethdev API
> 
> > +{
> > +     struct rte_eth_link link;
> > +     int ret;
> > +
> > +     if (port_id_is_invalid(port_id, ENABLED_WARN))
> > +             return -EINVAL;
> > +     ret = eth_link_get_nowait_print_err(port_id, &link);
> > +     if (ret < 0)
> > +             return -EINVAL;
> > +     if (fill_thresh > 99)
> > +             return -EINVAL;
> > +     ret = rte_eth_rx_fill_thresh_set(port_id, queue_idx,
> > + fill_thresh);
> > +
> > +     if (ret)
> 
> Please remove extra empty line above and compare with 0 explicitly.
> 
> > +             return ret;
> > +     return 0;
> > +}
> > +
> > diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c index
> > 767765d..1209230 100644
> > --- a/app/test-pmd/testpmd.c
> > +++ b/app/test-pmd/testpmd.c
> > @@ -420,6 +420,7 @@ struct fwd_engine * fwd_engines[] = {
> >       [RTE_ETH_EVENT_NEW] = "device probed",
> >       [RTE_ETH_EVENT_DESTROY] = "device released",
> >       [RTE_ETH_EVENT_FLOW_AGED] = "flow aged",
> > +     [RTE_ETH_EVENT_RX_FILL_THRESH] = "rxq fill threshold reached",
> >       [RTE_ETH_EVENT_MAX] = NULL,
> >   };
> >
> > @@ -3616,6 +3617,10 @@ struct pmd_test_command {
> >   eth_event_callback(portid_t port_id, enum rte_eth_event_type type,
> void *param,
> >                 void *ret_param)
> >   {
> > +     struct rte_eth_dev_info dev_info;
> > +     uint16_t rxq_id;
> > +     uint8_t fill_thresh;
> > +     int ret;
> >       RTE_SET_USED(param);
> >       RTE_SET_USED(ret_param);
> >
> > @@ -3647,6 +3652,19 @@ struct pmd_test_command {
> >               ports[port_id].port_status = RTE_PORT_CLOSED;
> >               printf("Port %u is closed\n", port_id);
> >               break;
> > +     case RTE_ETH_EVENT_RX_FILL_THRESH:
> > +             ret = rte_eth_dev_info_get(port_id, &dev_info);
> > +             if (ret != 0)
> > +                     break;
> 
> What's the point to get device info above? It is unused as far as I can see.
> 
> > +             /* fill_thresh query API rewinds rxq_id, no need to check max rxq
> num. */
> > +             for (rxq_id = 0; ; rxq_id++) {
> > +                     ret = rte_eth_rx_fill_thresh_query(port_id, &rxq_id,
> &fill_thresh);
> > +                     if (ret <= 0)
> > +                             break; > +                      printf("Received fill_thresh event,
> port:%d rxq_id:%d\n",
> > +                            port_id, rxq_id);
> > +             }
> > +             break;
> >       default:
> >               break;
> >       }
> > diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h index
> > 78a5f4e..c7a144e 100644
> > --- a/app/test-pmd/testpmd.h
> > +++ b/app/test-pmd/testpmd.h
> > @@ -1173,6 +1173,8 @@ uint16_t tx_pkt_set_dynf(uint16_t port_id,
> __rte_unused uint16_t queue,
> >   void flex_item_create(portid_t port_id, uint16_t flex_id, const char
> *filename);
> >   void flex_item_destroy(portid_t port_id, uint16_t flex_id);
> >   void port_flex_item_flush(portid_t port_id);
> > +int set_rxq_fill_thresh(portid_t port_id, uint16_t queue_idx,
> 
> it is better to be consistent with variable nameing queue_idx -> queue_id
> 
> > +                     uint16_t fill_thresh);
> 
> uint16_t -> uint8_t since ethdev API uses uint8_t. It must be consistent with
> ethdev.
> 
> >
> >   extern int flow_parse(const char *src, void *result, unsigned int size,
> >                     struct rte_flow_attr **attr, diff --git
> > a/lib/ethdev/ethdev_driver.h b/lib/ethdev/ethdev_driver.h index
> > 69d9dc2..7ef7dba 100644
> > --- a/lib/ethdev/ethdev_driver.h
> > +++ b/lib/ethdev/ethdev_driver.h
> > @@ -470,6 +470,23 @@ typedef int (*eth_rx_queue_setup_t)(struct
> rte_eth_dev *dev,
> >                                   const struct rte_eth_rxconf *rx_conf,
> >                                   struct rte_mempool *mb_pool);
> >
> > +/**
> > + * @internal Set Rx queue fill threshold.
> > + * @see rte_eth_rx_fill_thresh_set()
> > + */
> > +typedef int (*eth_rx_queue_fill_thresh_set_t)(struct rte_eth_dev *dev,
> > +                                   uint16_t rx_queue_id,
> > +                                   uint8_t fill_thresh);
> > +
> > +/**
> > + * @internal Query queue fill threshold event.
> > + * @see rte_eth_rx_fill_thresh_query()  */
> > +
> > +typedef int (*eth_rx_queue_fill_thresh_query_t)(struct rte_eth_dev
> *dev,
> > +                                     uint16_t *rx_queue_id,
> > +                                     uint8_t *fill_thresh);
> 
> Should fill_thresh be round UP or DOWN by the driver?
> Order of prototypes definition must follow order of ops in the structure.
> 
> > +
> >   /** @internal Setup a transmit queue of an Ethernet device. */
> >   typedef int (*eth_tx_queue_setup_t)(struct rte_eth_dev *dev,
> >                                   uint16_t tx_queue_id, @@ -1168,6
> > +1185,11 @@ struct eth_dev_ops {
> >       /** Priority flow control queue configure */
> >       priority_flow_ctrl_queue_config_t
> > priority_flow_ctrl_queue_config;
> >
> > +     /** Set Rx queue fill threshold. */
> > +     eth_rx_queue_fill_thresh_set_t rx_queue_fill_thresh_set;
> > +     /** Query Rx queue fill threshold event. */
> > +     eth_rx_queue_fill_thresh_query_t rx_queue_fill_thresh_query;
> > +
> 
> Why is it added in the middle fo the structure?
> Isn't it better to add at the end?
> 
> >       /** Set Unicast Table Array */
> >       eth_uc_hash_table_set_t    uc_hash_table_set;
> >       /** Set Unicast hash bitmap */
> > diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c index
> > a175867..69a1f75 100644
> > --- a/lib/ethdev/rte_ethdev.c
> > +++ b/lib/ethdev/rte_ethdev.c
> > @@ -4424,6 +4424,58 @@ int rte_eth_set_queue_rate_limit(uint16_t
> port_id, uint16_t queue_idx,
> >                                                       queue_idx, tx_rate));
> >   }
> >
> > +int rte_eth_rx_fill_thresh_set(uint16_t port_id, uint16_t queue_id,
> > +                            uint8_t fill_thresh) {
> > +     struct rte_eth_dev *dev;
> > +     struct rte_eth_dev_info dev_info;
> > +     int ret;
> > +
> > +     RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
> > +     dev = &rte_eth_devices[port_id];
> > +
> > +     ret = rte_eth_dev_info_get(port_id, &dev_info);
> > +     if (ret != 0)
> > +             return ret;
> > +
> > +     if (queue_id > dev_info.max_rx_queues) {
> 
> We don't need dev_info to get number of Rx queues.
> dev->data->nb_rx_queues does the job.
> 
> > +             RTE_ETHDEV_LOG(ERR,
> > +                     "Set queue fill thresh: port %u: invalid queue ID=%u.\n",
> > +                     port_id, queue_id);
> > +             return -EINVAL;
> > +     }
> > +
> > +     if (fill_thresh > 99)
> > +             return -EINVAL;
> 
> Why do you have error log above, but don't have it here?
> 
> > +     RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops-
> >rx_queue_fill_thresh_set, -ENOTSUP);
> > +     return eth_err(port_id, (*dev->dev_ops-
> >rx_queue_fill_thresh_set)(dev,
> > +                                                          queue_id,
> > +fill_thresh)); }
> > +
> > +int rte_eth_rx_fill_thresh_query(uint16_t port_id, uint16_t *queue_id,
> > +                              uint8_t *fill_thresh) {
> > +     struct rte_eth_dev_info dev_info;
> > +     struct rte_eth_dev *dev;
> > +     int ret;
> > +
> > +     RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
> > +     dev = &rte_eth_devices[port_id];
> > +
> > +     ret = rte_eth_dev_info_get(port_id, &dev_info);
> > +     if (ret != 0)
> > +             return ret;
> > +
> > +     if (queue_id == NULL)
> > +             return -EINVAL;
> > +     if (*queue_id >= dev_info.max_rx_queues)
> 
> Same here. you don't need dev_info
> 
> > +             *queue_id = 0;
> > +
> > +     RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops-
> >rx_queue_fill_thresh_query, -ENOTSUP);
> > +     return eth_err(port_id, (*dev->dev_ops-
> >rx_queue_fill_thresh_query)(dev,
> > +                                                          queue_id,
> > +fill_thresh)); }
> > +
> >   RTE_INIT(eth_dev_init_fp_ops)
> >   {
> >       uint32_t i;
> > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h index
> > 04225bb..d44e5da 100644
> > --- a/lib/ethdev/rte_ethdev.h
> > +++ b/lib/ethdev/rte_ethdev.h
> > @@ -1931,6 +1931,14 @@ struct rte_eth_rxq_info {
> >       uint8_t queue_state;        /**< one of RTE_ETH_QUEUE_STATE_*. */
> >       uint16_t nb_desc;           /**< configured number of RXDs. */
> >       uint16_t rx_buf_size;       /**< hardware receive buffer size. */
> > +     /**
> > +      * Per-queue Rx fill threshold defined as percentage of Rx queue
> > +      * size. If Rx queue receives traffic higher than this percentage,
> > +      * the event RTE_ETH_EVENT_RX_FILL_THESH is triggered.
> > +      * Value 0 means threshold monitoring is disabled, no event is
> > +      * triggered.
> > +      */
> > +     uint8_t fill_thresh;
> >   } __rte_cache_min_aligned;
> >
> >   /**
> > @@ -3672,6 +3680,65 @@ int rte_eth_dev_set_vlan_ether_type(uint16_t
> port_id,
> >    */
> >   int rte_eth_dev_set_vlan_pvid(uint16_t port_id, uint16_t pvid, int
> > on);
> >
> > +/**
> > + * @warning
> > + * @b EXPERIMENTAL: this API may change without prior notice.
> > + *
> > + * Set Rx queue based fill threshold.
> > + *
> > + * @param port_id
> > + *  The port identifier of the Ethernet device.
> > + * @param queue_id
> > + *  The index of the receive queue.
> > + * @param fill_thresh
> > + *  The fill threshold percentage of Rx queue size which describes
> > + *  the fullness of Rx queue. If the Rx queue fullness is above it,
> > + *  the device will trigger the event RTE_ETH_EVENT_RX_FILL_THRESH.
> > + *  [1-99] to set a new fill thresold.
> > + *  0 to disable thresold monitoring.
> > + *
> > + * @return
> > + *   - 0 if successful.
> > + *   - negative if failed.
> > + */
> > +__rte_experimental
> > +int rte_eth_rx_fill_thresh_set(uint16_t port_id, uint16_t queue_id,
> > +                            uint8_t fill_thresh);
> > +
> > +/**
> > + * @warning
> > + * @b EXPERIMENTAL: this API may change without prior notice.
> > + *
> > + * Query Rx queue based fill threshold event.
> > + * The function queries all queues in the port circularly until one
> > + * pending fill_thresh event is found or no pending fill_thresh event is
> found.
> > + *
> > + * @param port_id
> > + *  The port identifier of the Ethernet device.
> > + * @param queue_id
> > + *  The API caller sets the starting Rx queue id in the pointer.
> > + *  If the queue_id is bigger than maximum queue id of the port,
> > + *  it's rewinded to 0 so that application can keep calling
> > + *  this function to handle all pending fill_thresh events in the
> > +queues
> > + *  with a simple increment between calls.
> > + *  If a Rx queue has pending fill_thresh event, the pointer is
> > +updated
> > + *  with this Rx queue id; otherwise this pointer's content is
> > + *  unchanged.
> > + * @param fill_thresh
> > + *  The pointer to the fill threshold percentage of Rx queue.
> > + *  If Rx queue with pending fill_thresh event is found, the queue's
> > +fill_thresh
> > + *  percentage is stored in this pointer, otherwise the pointer's
> > + *  content is unchanged.
> > + *
> > + * @return
> > + *   - 1 if a Rx queue with pending fill_thresh event is found.
> > + *   - 0 if no Rx queue with pending fill_thresh event is found.
> > + *   - -EINVAL if queue_id is NULL.
> > + */
> > +__rte_experimental
> > +int rte_eth_rx_fill_thresh_query(uint16_t port_id, uint16_t *queue_id,
> > +                              uint8_t *fill_thresh);
> > +
> >   typedef void (*buffer_tx_error_fn)(struct rte_mbuf **unsent, uint16_t
> count,
> >               void *userdata);
> >
> > @@ -3877,6 +3944,11 @@ enum rte_eth_event_type {
> >       RTE_ETH_EVENT_DESTROY,  /**< port is released */
> >       RTE_ETH_EVENT_IPSEC,    /**< IPsec offload related event */
> >       RTE_ETH_EVENT_FLOW_AGED,/**< New aged-out flows is detected */
> > +     /**
> > +      *  Fill threshold value is exceeded in a queue.
> > +      *  @see rte_eth_rx_fill_thresh_set()
> > +      */
> > +     RTE_ETH_EVENT_RX_FILL_THRESH,
> >       RTE_ETH_EVENT_MAX       /**< max value of this enum */
> >   };
> >
> > diff --git a/lib/ethdev/version.map b/lib/ethdev/version.map index
> > daca785..29b1fe8 100644
> > --- a/lib/ethdev/version.map
> > +++ b/lib/ethdev/version.map
> > @@ -285,6 +285,8 @@ EXPERIMENTAL {
> >       rte_mtr_color_in_protocol_priority_get;
> >       rte_mtr_color_in_protocol_set;
> >       rte_mtr_meter_vlan_table_update;
> > +     rte_eth_rx_fill_thresh_set;
> > +     rte_eth_rx_fill_thresh_query;
> >   };
> >
> >   INTERNAL {


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v4 3/7] ethdev: introduce Rx queue based fill threshold
  2022-06-03 12:48           ` [PATCH v4 3/7] ethdev: introduce Rx queue based fill threshold Spike Du
  2022-06-03 14:30             ` Ray Kinsella
  2022-06-04 12:46             ` Andrew Rybchenko
@ 2022-06-06 15:49             ` Stephen Hemminger
  2 siblings, 0 replies; 131+ messages in thread
From: Stephen Hemminger @ 2022-06-06 15:49 UTC (permalink / raw)
  To: Spike Du
  Cc: matan, viacheslavo, orika, thomas, Wenzhuo Lu, Beilei Xing,
	Bernard Iremonger, Ray Kinsella, Neil Horman, andrew.rybchenko,
	mb, dev, rasland

On Fri, 3 Jun 2022 15:48:17 +0300
Spike Du <spiked@nvidia.com> wrote:

> Fill threshold describes the fullness of a Rx queue. If the Rx
> queue fullness is above the threshold, the device will trigger the event
> RTE_ETH_EVENT_RX_FILL_THRESH.
> Fill threshold is defined as a percentage of Rx queue size with valid
> value of [0,99].
> Setting fill threshold to 0 means disable it, which is the default.
> Add fill threshold configuration and query driver callbacks in eth_dev_ops.
> Add command line options to support fill_thresh per-rxq configure.
> - Command syntax:
>   set port <port_id> rxq <rxq_id> fill_thresh <fill_thresh_num>
> 
> - Example commands:
> To configure fill_thresh as 30% of rxq size on port 1 rxq 0:
> testpmd> set port 1 rxq 0 fill_thresh 30  
> 
> To disable fill_thresh on port 1 rxq 0:
> testpmd> set port 1 rxq 0 fill_thresh 0  
> 
> Signed-off-by: Spike Du <spiked@nvidia.com>

Could the name be shortened to just rte_event_rx_thresh?

The eth_ and _fill_ part are redundant.

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v4 3/7] ethdev: introduce Rx queue based fill threshold
  2022-06-06 13:16               ` Spike Du
@ 2022-06-06 17:15                 ` Andrew Rybchenko
  2022-06-06 21:30                   ` Thomas Monjalon
  2022-06-07  6:00                   ` Spike Du
  0 siblings, 2 replies; 131+ messages in thread
From: Andrew Rybchenko @ 2022-06-06 17:15 UTC (permalink / raw)
  To: Spike Du, Matan Azrad, Slava Ovsiienko, Ori Kam,
	NBU-Contact-Thomas Monjalon (EXTERNAL),
	Wenzhuo Lu, Beilei Xing, Bernard Iremonger, Ray Kinsella,
	Neil Horman
  Cc: stephen, mb, dev, Raslan Darawsheh

On 6/6/22 16:16, Spike Du wrote:
> Hi Andrew,
> 	Please see below for "fill threshold" concept, I'm ok with other comments about code.
> 
> Regards,
> Spike.
> 
> 
>> -----Original Message-----
>> From: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
>> Sent: Saturday, June 4, 2022 8:46 PM
>> To: Spike Du <spiked@nvidia.com>; Matan Azrad <matan@nvidia.com>;
>> Slava Ovsiienko <viacheslavo@nvidia.com>; Ori Kam <orika@nvidia.com>;
>> NBU-Contact-Thomas Monjalon (EXTERNAL) <thomas@monjalon.net>;
>> Wenzhuo Lu <wenzhuo.lu@intel.com>; Beilei Xing <beilei.xing@intel.com>;
>> Bernard Iremonger <bernard.iremonger@intel.com>; Ray Kinsella
>> <mdr@ashroe.eu>; Neil Horman <nhorman@tuxdriver.com>
>> Cc: stephen@networkplumber.org; mb@smartsharesystems.com;
>> dev@dpdk.org; Raslan Darawsheh <rasland@nvidia.com>
>> Subject: Re: [PATCH v4 3/7] ethdev: introduce Rx queue based fill threshold
>>
>> External email: Use caution opening links or attachments
>>
>>
>> On 6/3/22 15:48, Spike Du wrote:
>>> Fill threshold describes the fullness of a Rx queue. If the Rx queue
>>> fullness is above the threshold, the device will trigger the event
>>> RTE_ETH_EVENT_RX_FILL_THRESH.
>>
>> Sorry, I'm not sure that I understand. As far as I know the process to add
>> more Rx buffers to Rx queue is called 'refill' in many drivers. So fill level is a
>> number (or percentage) of free buffers in an Rx queue.
>> If so, fill threashold should be a minimum fill level and below the level we
>> should generate an event.
>>
>> However reading the first paragraph of the descrition it looks like you mean
>> oposite thing - a number (or percentage) of ready Rx buffers with received
>> packets.
>>
>> I think that the term "fill threshold" is suggested by me, but I did it with mine
>> understanding of the added feature. Now I'm confused.
>>
>> Moreover, I don't understand how "fill threshold" could be in terms of ready
>> Rx buffers. HW simply don't really know when ready Rx buffers are
>> processed by SW. So, HW can't say for sure how many ready Rx buffers are
>> pending. It could be calculated as Rx queue size minus number of free Rx
>> buffers, but it is imprecise. First of all not all Rx descriptors could be used.
>> Second, HW ring size could differ queue size specified in SW.
>> Queue size specified in SW could just limit maximum nubmer of free Rx
>> buffers provided by the driver.
>>
> 
> Let me use other terms because "fill"/"refill" is also ambiguous to me.
> In a RX ring, there are Rx buffers with received packets, you call it "ready Rx buffers", there is a RTE api rte_eth_rx_queue_count() to get the number,
> It's also called "used descriptors" in the code.
> Also there are Rx buffers provided by SW to allow HW "fill in" received packets, we can call it "usable Rx buffers" (here "usable" means usable for HW).

May be it is better to stick to Rx descriptor status terminology?
Available - Rx descriptor available to HW to put received packet to
Done - Rx descriptor with received packet reported to Sw
Unavailable - other (e.g. gap which cannot be used or just processed
Done, but not refilled (made available to HW).

> Let's define Rx queue "fullness":
> 	Fullness = ready-Rx-buffers/Rxq-size

i.e. number of DONE descriptors divided by RxQ size

> On the opposite, we have "emptiness"
> 	Emptiness = usable-Rx-buffers/Rxq-size

i.e. number of AVAIL descriptors divided by RxQ size
Note, that AVAIL != RxQ-size - DONE

HW really knows number of available descriptors by its nature.
It is a space between latest done and latest received on refill.

HW does not know which descriptors are DONE, since some which
are DONE before could be already processed by SW, but not yet
made available again.


> Here "fill threshold" describes "fullness", it's not "refill" described in you above words. Because in your words, "refill" is the opposite, it's filling "usable/free Rx buffers", or "emptiness".
> 
> I can only briefly explain how mlx5 works to get LWM, because I'm not a Firmware guy.
> Mlx5 Rx queue is basically RDMA queue. It has two indexes: producer index which increases when HW fills in packet, consumer index which increases when SW consumes the packet.
> The queue size is known when it's created. The fullness is something like (producer_index - consumer_index) (I don't consider in wrap-around here).
> So mlx5 has the way to get the fullness or emptiness in HW or FW.
> Another detail is mlx5 uses the term "LWM"(limit watermark), which describes "emptiness". When usable-Rx-buffers is below LWM, we trigger an event.
> But Thomas think "fullness" is easier to understand, so we use "fullness" in rte APIs and we'll translate it to LWM in mlx5 PMD.

HW simply does now know fullness and there can't generate any events
based on it. It is a problem on Rx when there is now available
descriptors.  I.e. emptiness.

>>> Fill threshold is defined as a percentage of Rx queue size with valid
>>> value of [0,99].
>>> Setting fill threshold to 0 means disable it, which is the default.
>>> Add fill threshold configuration and query driver callbacks in eth_dev_ops.
>>> Add command line options to support fill_thresh per-rxq configure.
>>> - Command syntax:
>>>     set port <port_id> rxq <rxq_id> fill_thresh <fill_thresh_num>
>>>
>>> - Example commands:
>>> To configure fill_thresh as 30% of rxq size on port 1 rxq 0:
>>> testpmd> set port 1 rxq 0 fill_thresh 30
>>>
>>> To disable fill_thresh on port 1 rxq 0:
>>> testpmd> set port 1 rxq 0 fill_thresh 0
>>>
>>> Signed-off-by: Spike Du <spiked@nvidia.com>

[snip]

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v4 3/7] ethdev: introduce Rx queue based fill threshold
  2022-06-06 17:15                 ` Andrew Rybchenko
@ 2022-06-06 21:30                   ` Thomas Monjalon
  2022-06-07  8:02                     ` Andrew Rybchenko
  2022-06-07  6:00                   ` Spike Du
  1 sibling, 1 reply; 131+ messages in thread
From: Thomas Monjalon @ 2022-06-06 21:30 UTC (permalink / raw)
  To: Spike Du, Andrew Rybchenko
  Cc: Matan Azrad, Slava Ovsiienko, Ori Kam, Wenzhuo Lu, Beilei Xing,
	Bernard Iremonger, Ray Kinsella, dev, stephen, mb,
	Raslan Darawsheh, ferruh.yigit

It seems we share a common understanding
and we need to agree on a good wording
for the most meaningful API.
Questions inline below:

06/06/2022 19:15, Andrew Rybchenko:
> On 6/6/22 16:16, Spike Du wrote:
> > Hi Andrew,
> > 	Please see below for "fill threshold" concept, I'm ok with other comments about code.
> > 
> > Regards,
> > Spike.
> > 
> > 
> > From: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> >> On 6/3/22 15:48, Spike Du wrote:
> >>> Fill threshold describes the fullness of a Rx queue. If the Rx queue
> >>> fullness is above the threshold, the device will trigger the event
> >>> RTE_ETH_EVENT_RX_FILL_THRESH.
> >>
> >> Sorry, I'm not sure that I understand. As far as I know the process to add
> >> more Rx buffers to Rx queue is called 'refill' in many drivers. So fill level is a
> >> number (or percentage) of free buffers in an Rx queue.
> >> If so, fill threashold should be a minimum fill level and below the level we
> >> should generate an event.
> >>
> >> However reading the first paragraph of the descrition it looks like you mean
> >> oposite thing - a number (or percentage) of ready Rx buffers with received
> >> packets.
> >>
> >> I think that the term "fill threshold" is suggested by me, but I did it with mine
> >> understanding of the added feature. Now I'm confused.
> >>
> >> Moreover, I don't understand how "fill threshold" could be in terms of ready
> >> Rx buffers. HW simply don't really know when ready Rx buffers are
> >> processed by SW. So, HW can't say for sure how many ready Rx buffers are
> >> pending. It could be calculated as Rx queue size minus number of free Rx
> >> buffers, but it is imprecise. First of all not all Rx descriptors could be used.
> >> Second, HW ring size could differ queue size specified in SW.
> >> Queue size specified in SW could just limit maximum nubmer of free Rx
> >> buffers provided by the driver.
> >>
> > 
> > Let me use other terms because "fill"/"refill" is also ambiguous to me.
> > In a RX ring, there are Rx buffers with received packets, you call it "ready Rx buffers", there is a RTE api rte_eth_rx_queue_count() to get the number,
> > It's also called "used descriptors" in the code.
> > Also there are Rx buffers provided by SW to allow HW "fill in" received packets, we can call it "usable Rx buffers" (here "usable" means usable for HW).
> 
> May be it is better to stick to Rx descriptor status terminology?
> Available - Rx descriptor available to HW to put received packet to
> Done - Rx descriptor with received packet reported to Sw
> Unavailable - other (e.g. gap which cannot be used or just processed
> Done, but not refilled (made available to HW).
> 
> > Let's define Rx queue "fullness":
> > 	Fullness = ready-Rx-buffers/Rxq-size
> 
> i.e. number of DONE descriptors divided by RxQ size
> 
> > On the opposite, we have "emptiness"
> > 	Emptiness = usable-Rx-buffers/Rxq-size
> 
> i.e. number of AVAIL descriptors divided by RxQ size
> Note, that AVAIL != RxQ-size - DONE
> 
> HW really knows number of available descriptors by its nature.
> It is a space between latest done and latest received on refill.
> 
> HW does not know which descriptors are DONE, since some which
> are DONE before could be already processed by SW, but not yet
> made available again.
> 
> 
> > Here "fill threshold" describes "fullness", it's not "refill" described in you above words. Because in your words, "refill" is the opposite, it's filling "usable/free Rx buffers", or "emptiness".
> > 
> > I can only briefly explain how mlx5 works to get LWM, because I'm not a Firmware guy.
> > Mlx5 Rx queue is basically RDMA queue. It has two indexes: producer index which increases when HW fills in packet, consumer index which increases when SW consumes the packet.
> > The queue size is known when it's created. The fullness is something like (producer_index - consumer_index) (I don't consider in wrap-around here).
> > So mlx5 has the way to get the fullness or emptiness in HW or FW.
> > Another detail is mlx5 uses the term "LWM"(limit watermark), which describes "emptiness". When usable-Rx-buffers is below LWM, we trigger an event.
> > But Thomas think "fullness" is easier to understand, so we use "fullness" in rte APIs and we'll translate it to LWM in mlx5 PMD.

I may be wrong :)

> HW simply does now know fullness and there can't generate any events
> based on it. It is a problem on Rx when there is now available
> descriptors.  I.e. emptiness.

So you think "empty_thresh" would be better?
Or "avail_thresh"?

> >>> Fill threshold is defined as a percentage of Rx queue size with valid
> >>> value of [0,99].
> >>> Setting fill threshold to 0 means disable it, which is the default.
> >>> Add fill threshold configuration and query driver callbacks in eth_dev_ops.
> >>> Add command line options to support fill_thresh per-rxq configure.
> >>> - Command syntax:
> >>>     set port <port_id> rxq <rxq_id> fill_thresh <fill_thresh_num>
> >>>
> >>> - Example commands:
> >>> To configure fill_thresh as 30% of rxq size on port 1 rxq 0:
> >>> testpmd> set port 1 rxq 0 fill_thresh 30
> >>>
> >>> To disable fill_thresh on port 1 rxq 0:
> >>> testpmd> set port 1 rxq 0 fill_thresh 0
> >>>
> >>> Signed-off-by: Spike Du <spiked@nvidia.com>




^ permalink raw reply	[flat|nested] 131+ messages in thread

* RE: [PATCH v4 3/7] ethdev: introduce Rx queue based fill threshold
  2022-06-06 17:15                 ` Andrew Rybchenko
  2022-06-06 21:30                   ` Thomas Monjalon
@ 2022-06-07  6:00                   ` Spike Du
  1 sibling, 0 replies; 131+ messages in thread
From: Spike Du @ 2022-06-07  6:00 UTC (permalink / raw)
  To: Andrew Rybchenko, Matan Azrad, Slava Ovsiienko, Ori Kam,
	NBU-Contact-Thomas Monjalon (EXTERNAL),
	Wenzhuo Lu, Beilei Xing, Bernard Iremonger, Ray Kinsella,
	Neil Horman
  Cc: stephen, mb, dev, Raslan Darawsheh

If you think HW knows "emptiness", and we want to stick to DPDK descriptor terms. 
Can we use the name "avail_thresh?"
When HW sees available descriptors is under this threshold, an event is triggered.

> -----Original Message-----
> From: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> Sent: Tuesday, June 7, 2022 1:16 AM
> To: Spike Du <spiked@nvidia.com>; Matan Azrad <matan@nvidia.com>;
> Slava Ovsiienko <viacheslavo@nvidia.com>; Ori Kam <orika@nvidia.com>;
> NBU-Contact-Thomas Monjalon (EXTERNAL) <thomas@monjalon.net>;
> Wenzhuo Lu <wenzhuo.lu@intel.com>; Beilei Xing <beilei.xing@intel.com>;
> Bernard Iremonger <bernard.iremonger@intel.com>; Ray Kinsella
> <mdr@ashroe.eu>; Neil Horman <nhorman@tuxdriver.com>
> Cc: stephen@networkplumber.org; mb@smartsharesystems.com;
> dev@dpdk.org; Raslan Darawsheh <rasland@nvidia.com>
> Subject: Re: [PATCH v4 3/7] ethdev: introduce Rx queue based fill threshold
> 
> External email: Use caution opening links or attachments
> 
> 
> On 6/6/22 16:16, Spike Du wrote:
> > Hi Andrew,
> >       Please see below for "fill threshold" concept, I'm ok with other
> comments about code.
> >
> > Regards,
> > Spike.
> >
> >
> >> -----Original Message-----
> >> From: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
> >> Sent: Saturday, June 4, 2022 8:46 PM
> >> To: Spike Du <spiked@nvidia.com>; Matan Azrad <matan@nvidia.com>;
> >> Slava Ovsiienko <viacheslavo@nvidia.com>; Ori Kam <orika@nvidia.com>;
> >> NBU-Contact-Thomas Monjalon (EXTERNAL) <thomas@monjalon.net>;
> Wenzhuo
> >> Lu <wenzhuo.lu@intel.com>; Beilei Xing <beilei.xing@intel.com>;
> >> Bernard Iremonger <bernard.iremonger@intel.com>; Ray Kinsella
> >> <mdr@ashroe.eu>; Neil Horman <nhorman@tuxdriver.com>
> >> Cc: stephen@networkplumber.org; mb@smartsharesystems.com;
> >> dev@dpdk.org; Raslan Darawsheh <rasland@nvidia.com>
> >> Subject: Re: [PATCH v4 3/7] ethdev: introduce Rx queue based fill
> >> threshold
> >>
> >> External email: Use caution opening links or attachments
> >>
> >>
> >> On 6/3/22 15:48, Spike Du wrote:
> >>> Fill threshold describes the fullness of a Rx queue. If the Rx queue
> >>> fullness is above the threshold, the device will trigger the event
> >>> RTE_ETH_EVENT_RX_FILL_THRESH.
> >>
> >> Sorry, I'm not sure that I understand. As far as I know the process
> >> to add more Rx buffers to Rx queue is called 'refill' in many
> >> drivers. So fill level is a number (or percentage) of free buffers in an Rx
> queue.
> >> If so, fill threashold should be a minimum fill level and below the
> >> level we should generate an event.
> >>
> >> However reading the first paragraph of the descrition it looks like
> >> you mean oposite thing - a number (or percentage) of ready Rx buffers
> >> with received packets.
> >>
> >> I think that the term "fill threshold" is suggested by me, but I did
> >> it with mine understanding of the added feature. Now I'm confused.
> >>
> >> Moreover, I don't understand how "fill threshold" could be in terms
> >> of ready Rx buffers. HW simply don't really know when ready Rx
> >> buffers are processed by SW. So, HW can't say for sure how many ready
> >> Rx buffers are pending. It could be calculated as Rx queue size minus
> >> number of free Rx buffers, but it is imprecise. First of all not all Rx
> descriptors could be used.
> >> Second, HW ring size could differ queue size specified in SW.
> >> Queue size specified in SW could just limit maximum nubmer of free Rx
> >> buffers provided by the driver.
> >>
> >
> > Let me use other terms because "fill"/"refill" is also ambiguous to me.
> > In a RX ring, there are Rx buffers with received packets, you call it
> > "ready Rx buffers", there is a RTE api rte_eth_rx_queue_count() to get the
> number, It's also called "used descriptors" in the code.
> > Also there are Rx buffers provided by SW to allow HW "fill in" received
> packets, we can call it "usable Rx buffers" (here "usable" means usable for
> HW).
> 
> May be it is better to stick to Rx descriptor status terminology?
> Available - Rx descriptor available to HW to put received packet to Done - Rx
> descriptor with received packet reported to Sw Unavailable - other (e.g. gap
> which cannot be used or just processed Done, but not refilled (made
> available to HW).
> 
> > Let's define Rx queue "fullness":
> >       Fullness = ready-Rx-buffers/Rxq-size
> 
> i.e. number of DONE descriptors divided by RxQ size
> 
> > On the opposite, we have "emptiness"
> >       Emptiness = usable-Rx-buffers/Rxq-size
> 
> i.e. number of AVAIL descriptors divided by RxQ size Note, that AVAIL !=
> RxQ-size - DONE
> 
> HW really knows number of available descriptors by its nature.
> It is a space between latest done and latest received on refill.
> 
> HW does not know which descriptors are DONE, since some which are DONE
> before could be already processed by SW, but not yet made available again.
> 
> 
> > Here "fill threshold" describes "fullness", it's not "refill" described in you
> above words. Because in your words, "refill" is the opposite, it's filling
> "usable/free Rx buffers", or "emptiness".
> >
> > I can only briefly explain how mlx5 works to get LWM, because I'm not a
> Firmware guy.
> > Mlx5 Rx queue is basically RDMA queue. It has two indexes: producer index
> which increases when HW fills in packet, consumer index which increases
> when SW consumes the packet.
> > The queue size is known when it's created. The fullness is something like
> (producer_index - consumer_index) (I don't consider in wrap-around here).
> > So mlx5 has the way to get the fullness or emptiness in HW or FW.
> > Another detail is mlx5 uses the term "LWM"(limit watermark), which
> describes "emptiness". When usable-Rx-buffers is below LWM, we trigger an
> event.
> > But Thomas think "fullness" is easier to understand, so we use "fullness" in
> rte APIs and we'll translate it to LWM in mlx5 PMD.
> 
> HW simply does now know fullness and there can't generate any events
> based on it. It is a problem on Rx when there is now available descriptors.  I.e.
> emptiness.
> 
> >>> Fill threshold is defined as a percentage of Rx queue size with
> >>> valid value of [0,99].
> >>> Setting fill threshold to 0 means disable it, which is the default.
> >>> Add fill threshold configuration and query driver callbacks in
> eth_dev_ops.
> >>> Add command line options to support fill_thresh per-rxq configure.
> >>> - Command syntax:
> >>>     set port <port_id> rxq <rxq_id> fill_thresh <fill_thresh_num>
> >>>
> >>> - Example commands:
> >>> To configure fill_thresh as 30% of rxq size on port 1 rxq 0:
> >>> testpmd> set port 1 rxq 0 fill_thresh 30
> >>>
> >>> To disable fill_thresh on port 1 rxq 0:
> >>> testpmd> set port 1 rxq 0 fill_thresh 0
> >>>
> >>> Signed-off-by: Spike Du <spiked@nvidia.com>
> 
> [snip]

^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v4 3/7] ethdev: introduce Rx queue based fill threshold
  2022-06-06 21:30                   ` Thomas Monjalon
@ 2022-06-07  8:02                     ` Andrew Rybchenko
  0 siblings, 0 replies; 131+ messages in thread
From: Andrew Rybchenko @ 2022-06-07  8:02 UTC (permalink / raw)
  To: Thomas Monjalon, Spike Du
  Cc: Matan Azrad, Slava Ovsiienko, Ori Kam, Wenzhuo Lu, Beilei Xing,
	Bernard Iremonger, Ray Kinsella, dev, stephen, mb,
	Raslan Darawsheh, ferruh.yigit

On 6/7/22 00:30, Thomas Monjalon wrote:
> It seems we share a common understanding
> and we need to agree on a good wording
> for the most meaningful API.
> Questions inline below:
> 
> 06/06/2022 19:15, Andrew Rybchenko:
>> On 6/6/22 16:16, Spike Du wrote:
>>> Hi Andrew,
>>> 	Please see below for "fill threshold" concept, I'm ok with other comments about code.
>>>
>>> Regards,
>>> Spike.
>>>
>>>
>>> From: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
>>>> On 6/3/22 15:48, Spike Du wrote:
>>>>> Fill threshold describes the fullness of a Rx queue. If the Rx queue
>>>>> fullness is above the threshold, the device will trigger the event
>>>>> RTE_ETH_EVENT_RX_FILL_THRESH.
>>>>
>>>> Sorry, I'm not sure that I understand. As far as I know the process to add
>>>> more Rx buffers to Rx queue is called 'refill' in many drivers. So fill level is a
>>>> number (or percentage) of free buffers in an Rx queue.
>>>> If so, fill threashold should be a minimum fill level and below the level we
>>>> should generate an event.
>>>>
>>>> However reading the first paragraph of the descrition it looks like you mean
>>>> oposite thing - a number (or percentage) of ready Rx buffers with received
>>>> packets.
>>>>
>>>> I think that the term "fill threshold" is suggested by me, but I did it with mine
>>>> understanding of the added feature. Now I'm confused.
>>>>
>>>> Moreover, I don't understand how "fill threshold" could be in terms of ready
>>>> Rx buffers. HW simply don't really know when ready Rx buffers are
>>>> processed by SW. So, HW can't say for sure how many ready Rx buffers are
>>>> pending. It could be calculated as Rx queue size minus number of free Rx
>>>> buffers, but it is imprecise. First of all not all Rx descriptors could be used.
>>>> Second, HW ring size could differ queue size specified in SW.
>>>> Queue size specified in SW could just limit maximum nubmer of free Rx
>>>> buffers provided by the driver.
>>>>
>>>
>>> Let me use other terms because "fill"/"refill" is also ambiguous to me.
>>> In a RX ring, there are Rx buffers with received packets, you call it "ready Rx buffers", there is a RTE api rte_eth_rx_queue_count() to get the number,
>>> It's also called "used descriptors" in the code.
>>> Also there are Rx buffers provided by SW to allow HW "fill in" received packets, we can call it "usable Rx buffers" (here "usable" means usable for HW).
>>
>> May be it is better to stick to Rx descriptor status terminology?
>> Available - Rx descriptor available to HW to put received packet to
>> Done - Rx descriptor with received packet reported to Sw
>> Unavailable - other (e.g. gap which cannot be used or just processed
>> Done, but not refilled (made available to HW).
>>
>>> Let's define Rx queue "fullness":
>>> 	Fullness = ready-Rx-buffers/Rxq-size
>>
>> i.e. number of DONE descriptors divided by RxQ size
>>
>>> On the opposite, we have "emptiness"
>>> 	Emptiness = usable-Rx-buffers/Rxq-size
>>
>> i.e. number of AVAIL descriptors divided by RxQ size
>> Note, that AVAIL != RxQ-size - DONE
>>
>> HW really knows number of available descriptors by its nature.
>> It is a space between latest done and latest received on refill.
>>
>> HW does not know which descriptors are DONE, since some which
>> are DONE before could be already processed by SW, but not yet
>> made available again.
>>
>>
>>> Here "fill threshold" describes "fullness", it's not "refill" described in you above words. Because in your words, "refill" is the opposite, it's filling "usable/free Rx buffers", or "emptiness".
>>>
>>> I can only briefly explain how mlx5 works to get LWM, because I'm not a Firmware guy.
>>> Mlx5 Rx queue is basically RDMA queue. It has two indexes: producer index which increases when HW fills in packet, consumer index which increases when SW consumes the packet.
>>> The queue size is known when it's created. The fullness is something like (producer_index - consumer_index) (I don't consider in wrap-around here).
>>> So mlx5 has the way to get the fullness or emptiness in HW or FW.
>>> Another detail is mlx5 uses the term "LWM"(limit watermark), which describes "emptiness". When usable-Rx-buffers is below LWM, we trigger an event.
>>> But Thomas think "fullness" is easier to understand, so we use "fullness" in rte APIs and we'll translate it to LWM in mlx5 PMD.
> 
> I may be wrong :)
> 
>> HW simply does now know fullness and there can't generate any events
>> based on it. It is a problem on Rx when there is now available
>> descriptors.  I.e. emptiness.
> 
> So you think "empty_thresh" would be better?
> Or "avail_thresh"?

I'd go for "avail_thresh" since it is consistent with descriptor status
API terminology.

One moment I thought that there is a problem with rx_free_threshold, but
finally realize that there is no problem, since it is in terms of free
descriptors which could be made available (refilled).

>>>>> Fill threshold is defined as a percentage of Rx queue size with valid
>>>>> value of [0,99].
>>>>> Setting fill threshold to 0 means disable it, which is the default.
>>>>> Add fill threshold configuration and query driver callbacks in eth_dev_ops.
>>>>> Add command line options to support fill_thresh per-rxq configure.
>>>>> - Command syntax:
>>>>>      set port <port_id> rxq <rxq_id> fill_thresh <fill_thresh_num>
>>>>>
>>>>> - Example commands:
>>>>> To configure fill_thresh as 30% of rxq size on port 1 rxq 0:
>>>>> testpmd> set port 1 rxq 0 fill_thresh 30
>>>>>
>>>>> To disable fill_thresh on port 1 rxq 0:
>>>>> testpmd> set port 1 rxq 0 fill_thresh 0
>>>>>
>>>>> Signed-off-by: Spike Du <spiked@nvidia.com>
> 
> 
> 


^ permalink raw reply	[flat|nested] 131+ messages in thread

* [PATCH v5 0/7] introduce per-queue available descriptor threshold and host shaper
  2022-06-03 12:48         ` [PATCH v4 0/7] introduce per-queue fill threshold " Spike Du
                             ` (6 preceding siblings ...)
  2022-06-03 12:48           ` [PATCH v4 7/7] app/testpmd: add Host Shaper command Spike Du
@ 2022-06-07 12:59           ` Spike Du
  2022-06-07 12:59             ` [PATCH v5 1/7] net/mlx5: add LWM support for Rxq Spike Du
                               ` (8 more replies)
  7 siblings, 9 replies; 131+ messages in thread
From: Spike Du @ 2022-06-07 12:59 UTC (permalink / raw)
  To: matan, viacheslavo, orika, thomas
  Cc: andrew.rybchenko, stephen, mb, dev, rasland

available descriptor threshold(ADT for short) is per RX queue attribute, when RX queue available descriptors for HW is below ADT, HW sends an event to application.
Host shaper can configure shaper rate and avail_thresh-triggered for a host port.
The shaper limits the rate of traffic from host port to embedded ARM rx port on Nvidia BlueField 2 NIC.
If avail_thresh-triggered is enabled, a 100Mbps shaper is enabled automatically when one of the host port's Rx queues receives available descriptor threshold event.

These two features can combine to control traffic from host port to wire port for BlueField 2 NIC.
The traffic flows from host to embedded ARM, then to the physical port.
The work flow is on the ARM system, configure available descriptor threshold to RX queue and enable avail_thresh-triggered flag in host shaper, after receiving available descriptor threshold event, delay a while until RX queue is empty , then disable the shaper. We recycle this work flow to reduce RX queue drops on ARM system.

Add new libethdev API to set available descriptor threshold, add rte event RTE_ETH_EVENT_RX_AVAIL_THRESH to handle available descriptor threshold event. For host shaper, because it doesn't align to existing DPDK framework and is specific to Nvidia NIC, use PMD private API.

For integration with testpmd, put the private cmdline function and available descriptor threshold event handler in mlx5 PMD directory by adding a new file mlx5_testpmd.c. Follow David Marchand's driver specific commands framework to add mlx5 specific commands.


Spike Du (7):
  net/mlx5: add LWM support for Rxq
  common/mlx5: share interrupt management
  ethdev: introduce Rx queue based available descriptor threshold
  net/mlx5: add LWM event handling support
  net/mlx5: support Rx queue based available descriptor threshold
  net/mlx5: add private API to config host port shaper
  app/testpmd: add Host Shaper command

 app/test-pmd/cmdline.c                       |  68 +++++++
 app/test-pmd/config.c                        |  21 ++
 app/test-pmd/testpmd.c                       |  24 +++
 app/test-pmd/testpmd.h                       |   2 +
 doc/guides/nics/mlx5.rst                     |  93 +++++++++
 doc/guides/rel_notes/release_22_07.rst       |   2 +
 drivers/common/mlx5/linux/meson.build        |  13 ++
 drivers/common/mlx5/linux/mlx5_common_os.c   | 131 ++++++++++++
 drivers/common/mlx5/linux/mlx5_common_os.h   |  11 +
 drivers/common/mlx5/mlx5_prm.h               |  26 +++
 drivers/common/mlx5/version.map              |   2 +
 drivers/common/mlx5/windows/mlx5_common_os.h |  24 +++
 drivers/net/mlx5/linux/mlx5_ethdev_os.c      |  71 -------
 drivers/net/mlx5/linux/mlx5_os.c             | 132 +++---------
 drivers/net/mlx5/linux/mlx5_socket.c         |  53 +----
 drivers/net/mlx5/meson.build                 |   4 +
 drivers/net/mlx5/mlx5.c                      |  68 +++++++
 drivers/net/mlx5/mlx5.h                      |  12 +-
 drivers/net/mlx5/mlx5_devx.c                 |  60 +++++-
 drivers/net/mlx5/mlx5_devx.h                 |   1 +
 drivers/net/mlx5/mlx5_rx.c                   | 289 +++++++++++++++++++++++++++
 drivers/net/mlx5/mlx5_rx.h                   |  13 ++
 drivers/net/mlx5/mlx5_testpmd.c              | 201 +++++++++++++++++++
 drivers/net/mlx5/mlx5_testpmd.h              |  26 +++
 drivers/net/mlx5/mlx5_txpp.c                 |  28 +--
 drivers/net/mlx5/rte_pmd_mlx5.h              |  30 +++
 drivers/net/mlx5/version.map                 |   2 +
 drivers/net/mlx5/windows/mlx5_ethdev_os.c    |  22 --
 drivers/vdpa/mlx5/mlx5_vdpa_virtq.c          |  48 +----
 lib/ethdev/ethdev_driver.h                   |  22 ++
 lib/ethdev/rte_ethdev.c                      |  52 +++++
 lib/ethdev/rte_ethdev.h                      |  73 +++++++
 lib/ethdev/version.map                       |   2 +
 33 files changed, 1318 insertions(+), 308 deletions(-)
 create mode 100644 drivers/net/mlx5/mlx5_testpmd.c
 create mode 100644 drivers/net/mlx5/mlx5_testpmd.h

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 131+ messages in thread

* [PATCH v5 1/7] net/mlx5: add LWM support for Rxq
  2022-06-07 12:59           ` [PATCH v5 0/7] introduce per-queue available descriptor threshold and host shaper Spike Du
@ 2022-06-07 12:59             ` Spike Du
  2022-06-08 20:10               ` Matan Azrad
  2022-06-07 12:59             ` [PATCH v5 2/7] common/mlx5: share interrupt management Spike Du
                               ` (7 subsequent siblings)
  8 siblings, 1 reply; 131+ messages in thread
From: Spike Du @ 2022-06-07 12:59 UTC (permalink / raw)
  To: matan, viacheslavo, orika, thomas
  Cc: andrew.rybchenko, stephen, mb, dev, rasland

Add lwm(Limit WaterMark) field to Rxq object which indicates the percentage
of RX queue size used by HW to raise LWM event to the user.
Allow LWM setting in modify_rq command.
Allow the LWM configuration dynamically by adding RDY2RDY state change.

Signed-off-by: Spike Du <spiked@nvidia.com>
---
 drivers/net/mlx5/mlx5.h      |  1 +
 drivers/net/mlx5/mlx5_devx.c | 13 ++++++++++++-
 drivers/net/mlx5/mlx5_devx.h |  1 +
 drivers/net/mlx5/mlx5_rx.h   |  1 +
 4 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index ef755ee..305edff 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -1395,6 +1395,7 @@ enum mlx5_rxq_modify_type {
 	MLX5_RXQ_MOD_RST2RDY, /* modify state from reset to ready. */
 	MLX5_RXQ_MOD_RDY2ERR, /* modify state from ready to error. */
 	MLX5_RXQ_MOD_RDY2RST, /* modify state from ready to reset. */
+	MLX5_RXQ_MOD_RDY2RDY, /* modify state from ready to ready. */
 };
 
 enum mlx5_txq_modify_type {
diff --git a/drivers/net/mlx5/mlx5_devx.c b/drivers/net/mlx5/mlx5_devx.c
index 4b48f94..c918a50 100644
--- a/drivers/net/mlx5/mlx5_devx.c
+++ b/drivers/net/mlx5/mlx5_devx.c
@@ -62,7 +62,7 @@
  * @return
  *   0 on success, a negative errno value otherwise and rte_errno is set.
  */
-static int
+int
 mlx5_devx_modify_rq(struct mlx5_rxq_priv *rxq, uint8_t type)
 {
 	struct mlx5_devx_modify_rq_attr rq_attr;
@@ -76,6 +76,11 @@
 	case MLX5_RXQ_MOD_RST2RDY:
 		rq_attr.rq_state = MLX5_RQC_STATE_RST;
 		rq_attr.state = MLX5_RQC_STATE_RDY;
+		if (rxq->lwm) {
+			rq_attr.modify_bitmask |=
+				MLX5_MODIFY_RQ_IN_MODIFY_BITMASK_WQ_LWM;
+			rq_attr.lwm = rxq->lwm;
+		}
 		break;
 	case MLX5_RXQ_MOD_RDY2ERR:
 		rq_attr.rq_state = MLX5_RQC_STATE_RDY;
@@ -85,6 +90,12 @@
 		rq_attr.rq_state = MLX5_RQC_STATE_RDY;
 		rq_attr.state = MLX5_RQC_STATE_RST;
 		break;
+	case MLX5_RXQ_MOD_RDY2RDY:
+		rq_attr.rq_state = MLX5_RQC_STATE_RDY;
+		rq_attr.state = MLX5_RQC_STATE_RDY;
+		rq_attr.modify_bitmask |= MLX5_MODIFY_RQ_IN_MODIFY_BITMASK_WQ_LWM;
+		rq_attr.lwm = rxq->lwm;
+		break;
 	default:
 		break;
 	}
diff --git a/drivers/net/mlx5/mlx5_devx.h b/drivers/net/mlx5/mlx5_devx.h
index a95207a..ebd1da4 100644
--- a/drivers/net/mlx5/mlx5_devx.h
+++ b/drivers/net/mlx5/mlx5_devx.h
@@ -11,6 +11,7 @@
 int mlx5_txq_devx_modify(struct mlx5_txq_obj *obj,
 			 enum mlx5_txq_modify_type type, uint8_t dev_port);
 void mlx5_txq_devx_obj_release(struct mlx5_txq_obj *txq_obj);
+int mlx5_devx_modify_rq(struct mlx5_rxq_priv *rxq, uint8_t type);
 
 extern struct mlx5_obj_ops devx_obj_ops;
 
diff --git a/drivers/net/mlx5/mlx5_rx.h b/drivers/net/mlx5/mlx5_rx.h
index e715ed6..25a5f2c 100644
--- a/drivers/net/mlx5/mlx5_rx.h
+++ b/drivers/net/mlx5/mlx5_rx.h
@@ -175,6 +175,7 @@ struct mlx5_rxq_priv {
 	struct mlx5_devx_rq devx_rq;
 	struct rte_eth_hairpin_conf hairpin_conf; /* Hairpin configuration. */
 	uint32_t hairpin_status; /* Hairpin binding status. */
+	uint32_t lwm:16;
 };
 
 /* External RX queue descriptor. */
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 131+ messages in thread

* [PATCH v5 2/7] common/mlx5: share interrupt management
  2022-06-07 12:59           ` [PATCH v5 0/7] introduce per-queue available descriptor threshold and host shaper Spike Du
  2022-06-07 12:59             ` [PATCH v5 1/7] net/mlx5: add LWM support for Rxq Spike Du
@ 2022-06-07 12:59             ` Spike Du
  2022-06-07 12:59             ` [PATCH v5 3/7] ethdev: introduce Rx queue based available descriptor threshold Spike Du
                               ` (6 subsequent siblings)
  8 siblings, 0 replies; 131+ messages in thread
From: Spike Du @ 2022-06-07 12:59 UTC (permalink / raw)
  To: matan, viacheslavo, orika, thomas, Ray Kinsella
  Cc: andrew.rybchenko, stephen, mb, dev, rasland

There are many duplicate code of creating and initializing rte_intr_handle.
Add a new mlx5_os API to do this, replace all PMD related code with this
API.

Signed-off-by: Spike Du <spiked@nvidia.com>
---
 drivers/common/mlx5/linux/mlx5_common_os.c   | 131 ++++++++++++++++++++++++++
 drivers/common/mlx5/linux/mlx5_common_os.h   |  11 +++
 drivers/common/mlx5/version.map              |   2 +
 drivers/common/mlx5/windows/mlx5_common_os.h |  24 +++++
 drivers/net/mlx5/linux/mlx5_ethdev_os.c      |  71 --------------
 drivers/net/mlx5/linux/mlx5_os.c             | 132 ++++++---------------------
 drivers/net/mlx5/linux/mlx5_socket.c         |  53 ++---------
 drivers/net/mlx5/mlx5.h                      |   2 -
 drivers/net/mlx5/mlx5_txpp.c                 |  28 ++----
 drivers/net/mlx5/windows/mlx5_ethdev_os.c    |  22 -----
 drivers/vdpa/mlx5/mlx5_vdpa_virtq.c          |  48 ++--------
 11 files changed, 217 insertions(+), 307 deletions(-)

diff --git a/drivers/common/mlx5/linux/mlx5_common_os.c b/drivers/common/mlx5/linux/mlx5_common_os.c
index d40cfd5..f10a981 100644
--- a/drivers/common/mlx5/linux/mlx5_common_os.c
+++ b/drivers/common/mlx5/linux/mlx5_common_os.c
@@ -11,6 +11,7 @@
 #endif
 #include <dirent.h>
 #include <net/if.h>
+#include <fcntl.h>
 
 #include <rte_errno.h>
 #include <rte_string_fns.h>
@@ -964,3 +965,133 @@
 		claim_zero(mlx5_glue->dereg_mr(pmd_mr->obj));
 	memset(pmd_mr, 0, sizeof(*pmd_mr));
 }
+
+/**
+ * Rte_intr_handle create and init helper.
+ *
+ * @param[in] mode
+ *   interrupt instance can be shared between primary and secondary
+ *   processes or not.
+ * @param[in] set_fd_nonblock
+ *   Whether to set fd to O_NONBLOCK.
+ * @param[in] fd
+ *   Fd to set in created intr_handle.
+ * @param[in] cb
+ *   Callback to register for intr_handle.
+ * @param[in] cb_arg
+ *   Callback argument for cb.
+ *
+ * @return
+ *  - Interrupt handle on success.
+ *  - NULL on failure, with rte_errno set.
+ */
+struct rte_intr_handle *
+mlx5_os_interrupt_handler_create(int mode, bool set_fd_nonblock, int fd,
+				 rte_intr_callback_fn cb, void *cb_arg)
+{
+	struct rte_intr_handle *tmp_intr_handle;
+	int ret, flags;
+
+	tmp_intr_handle = rte_intr_instance_alloc(mode);
+	if (!tmp_intr_handle) {
+		rte_errno = ENOMEM;
+		goto err;
+	}
+	if (set_fd_nonblock) {
+		flags = fcntl(fd, F_GETFL);
+		ret = fcntl(fd, F_SETFL, flags | O_NONBLOCK);
+		if (ret) {
+			rte_errno = errno;
+			goto err;
+		}
+	}
+	ret = rte_intr_fd_set(tmp_intr_handle, fd);
+	if (ret)
+		goto err;
+	ret = rte_intr_type_set(tmp_intr_handle, RTE_INTR_HANDLE_EXT);
+	if (ret)
+		goto err;
+	ret = rte_intr_callback_register(tmp_intr_handle, cb, cb_arg);
+	if (ret) {
+		rte_errno = -ret;
+		goto err;
+	}
+	return tmp_intr_handle;
+err:
+	if (tmp_intr_handle)
+		rte_intr_instance_free(tmp_intr_handle);
+	return NULL;
+}
+
+/* Safe unregistration for interrupt callback. */
+static void
+mlx5_intr_callback_unregister(const struct rte_intr_handle *handle,
+			      rte_intr_callback_fn cb_fn, void *cb_arg)
+{
+	uint64_t twait = 0;
+	uint64_t start = 0;
+
+	do {
+		int ret;
+
+		ret = rte_intr_callback_unregister(handle, cb_fn, cb_arg);
+		if (ret >= 0)
+			return;
+		if (ret != -EAGAIN) {
+			DRV_LOG(INFO, "failed to unregister interrupt"
+				      " handler (error: %d)", ret);
+			MLX5_ASSERT(false);
+			return;
+		}
+		if (twait) {
+			struct timespec onems;
+
+			/* Wait one millisecond and try again. */
+			onems.tv_sec = 0;
+			onems.tv_nsec = NS_PER_S / MS_PER_S;
+			nanosleep(&onems, 0);
+			/* Check whether one second elapsed. */
+			if ((rte_get_timer_cycles() - start) <= twait)
+				continue;
+		} else {
+			/*
+			 * We get the amount of timer ticks for one second.
+			 * If this amount elapsed it means we spent one
+			 * second in waiting. This branch is executed once
+			 * on first iteration.
+			 */
+			twait = rte_get_timer_hz();
+			MLX5_ASSERT(twait);
+		}
+		/*
+		 * Timeout elapsed, show message (once a second) and retry.
+		 * We have no other acceptable option here, if we ignore
+		 * the unregistering return code the handler will not
+		 * be unregistered, fd will be closed and we may get the
+		 * crush. Hanging and messaging in the loop seems not to be
+		 * the worst choice.
+		 */
+		DRV_LOG(INFO, "Retrying to unregister interrupt handler");
+		start = rte_get_timer_cycles();
+	} while (true);
+}
+
+/**
+ * Rte_intr_handle destroy helper.
+ *
+ * @param[in] intr_handle
+ *   Rte_intr_handle to destroy.
+ * @param[in] cb
+ *   Callback which is registered to intr_handle.
+ * @param[in] cb_arg
+ *   Callback argument for cb.
+ *
+ */
+void
+mlx5_os_interrupt_handler_destroy(struct rte_intr_handle *intr_handle,
+				  rte_intr_callback_fn cb, void *cb_arg)
+{
+	if (rte_intr_fd_get(intr_handle) >= 0)
+		mlx5_intr_callback_unregister(intr_handle, cb, cb_arg);
+	rte_intr_instance_free(intr_handle);
+}
diff --git a/drivers/common/mlx5/linux/mlx5_common_os.h b/drivers/common/mlx5/linux/mlx5_common_os.h
index 27f1192..479bb3c 100644
--- a/drivers/common/mlx5/linux/mlx5_common_os.h
+++ b/drivers/common/mlx5/linux/mlx5_common_os.h
@@ -15,6 +15,7 @@
 #include <rte_log.h>
 #include <rte_kvargs.h>
 #include <rte_devargs.h>
+#include <rte_interrupts.h>
 
 #include "mlx5_autoconf.h"
 #include "mlx5_glue.h"
@@ -299,4 +300,14 @@
 int
 mlx5_get_device_guid(const struct rte_pci_addr *dev, uint8_t *guid, size_t len);
 
+__rte_internal
+struct rte_intr_handle *
+mlx5_os_interrupt_handler_create(int mode, bool set_fd_nonblock, int fd,
+				 rte_intr_callback_fn cb, void *cb_arg);
+
+__rte_internal
+void
+mlx5_os_interrupt_handler_destroy(struct rte_intr_handle *intr_handle,
+				  rte_intr_callback_fn cb, void *cb_arg);
+
 #endif /* RTE_PMD_MLX5_COMMON_OS_H_ */
diff --git a/drivers/common/mlx5/version.map b/drivers/common/mlx5/version.map
index a23a30a..413dec1 100644
--- a/drivers/common/mlx5/version.map
+++ b/drivers/common/mlx5/version.map
@@ -153,5 +153,7 @@ INTERNAL {
 	mlx5_mr_mempool2mr_bh;
 	mlx5_mr_mempool_populate_cache;
 
+	mlx5_os_interrupt_handler_create; # WINDOWS_NO_EXPORT
+	mlx5_os_interrupt_handler_destroy; # WINDOWS_NO_EXPORT
 	local: *;
 };
diff --git a/drivers/common/mlx5/windows/mlx5_common_os.h b/drivers/common/mlx5/windows/mlx5_common_os.h
index ee7973f..e9e9108 100644
--- a/drivers/common/mlx5/windows/mlx5_common_os.h
+++ b/drivers/common/mlx5/windows/mlx5_common_os.h
@@ -9,6 +9,7 @@
 #include <sys/types.h>
 
 #include <rte_errno.h>
+#include <rte_interrupts.h>
 
 #include "mlx5_autoconf.h"
 #include "mlx5_glue.h"
@@ -253,4 +254,27 @@
 __rte_internal
 int mlx5_os_umem_dereg(void *pumem);
 
+static inline struct rte_intr_handle *
+mlx5_os_interrupt_handler_create(int mode, bool set_fd_nonblock, int fd,
+				 rte_intr_callback_fn cb, void *cb_arg)
+{
+	(void)mode;
+	(void)set_fd_nonblock;
+	(void)fd;
+	(void)cb;
+	(void)cb_arg;
+	rte_errno = ENOTSUP;
+	return NULL;
+}
+
+static inline void
+mlx5_os_interrupt_handler_destroy(struct rte_intr_handle *intr_handle,
+				  rte_intr_callback_fn cb, void *cb_arg)
+{
+	(void)intr_handle;
+	(void)cb;
+	(void)cb_arg;
+}
+
+
 #endif /* RTE_PMD_MLX5_COMMON_OS_H_ */
diff --git a/drivers/net/mlx5/linux/mlx5_ethdev_os.c b/drivers/net/mlx5/linux/mlx5_ethdev_os.c
index 8fe73f1..a276b2b 100644
--- a/drivers/net/mlx5/linux/mlx5_ethdev_os.c
+++ b/drivers/net/mlx5/linux/mlx5_ethdev_os.c
@@ -881,77 +881,6 @@ struct ethtool_link_settings {
 	}
 }
 
-/*
- * Unregister callback handler safely. The handler may be active
- * while we are trying to unregister it, in this case code -EAGAIN
- * is returned by rte_intr_callback_unregister(). This routine checks
- * the return code and tries to unregister handler again.
- *
- * @param handle
- *   interrupt handle
- * @param cb_fn
- *   pointer to callback routine
- * @cb_arg
- *   opaque callback parameter
- */
-void
-mlx5_intr_callback_unregister(const struct rte_intr_handle *handle,
-			      rte_intr_callback_fn cb_fn, void *cb_arg)
-{
-	/*
-	 * Try to reduce timeout management overhead by not calling
-	 * the timer related routines on the first iteration. If the
-	 * unregistering succeeds on first call there will be no
-	 * timer calls at all.
-	 */
-	uint64_t twait = 0;
-	uint64_t start = 0;
-
-	do {
-		int ret;
-
-		ret = rte_intr_callback_unregister(handle, cb_fn, cb_arg);
-		if (ret >= 0)
-			return;
-		if (ret != -EAGAIN) {
-			DRV_LOG(INFO, "failed to unregister interrupt"
-				      " handler (error: %d)", ret);
-			MLX5_ASSERT(false);
-			return;
-		}
-		if (twait) {
-			struct timespec onems;
-
-			/* Wait one millisecond and try again. */
-			onems.tv_sec = 0;
-			onems.tv_nsec = NS_PER_S / MS_PER_S;
-			nanosleep(&onems, 0);
-			/* Check whether one second elapsed. */
-			if ((rte_get_timer_cycles() - start) <= twait)
-				continue;
-		} else {
-			/*
-			 * We get the amount of timer ticks for one second.
-			 * If this amount elapsed it means we spent one
-			 * second in waiting. This branch is executed once
-			 * on first iteration.
-			 */
-			twait = rte_get_timer_hz();
-			MLX5_ASSERT(twait);
-		}
-		/*
-		 * Timeout elapsed, show message (once a second) and retry.
-		 * We have no other acceptable option here, if we ignore
-		 * the unregistering return code the handler will not
-		 * be unregistered, fd will be closed and we may get the
-		 * crush. Hanging and messaging in the loop seems not to be
-		 * the worst choice.
-		 */
-		DRV_LOG(INFO, "Retrying to unregister interrupt handler");
-		start = rte_get_timer_cycles();
-	} while (true);
-}
-
 /**
  * Handle DEVX interrupts from the NIC.
  * This function is probably called from the DPDK host thread.
diff --git a/drivers/net/mlx5/linux/mlx5_os.c b/drivers/net/mlx5/linux/mlx5_os.c
index a821153..0741028 100644
--- a/drivers/net/mlx5/linux/mlx5_os.c
+++ b/drivers/net/mlx5/linux/mlx5_os.c
@@ -2494,40 +2494,6 @@
 	mlx5_pmd_socket_uninit();
 }
 
-static int
-mlx5_os_dev_shared_handler_install_lsc(struct mlx5_dev_ctx_shared *sh)
-{
-	int nlsk_fd, flags, ret;
-
-	nlsk_fd = mlx5_nl_init(NETLINK_ROUTE, RTMGRP_LINK);
-	if (nlsk_fd < 0) {
-		DRV_LOG(ERR, "Failed to create a socket for Netlink events: %s",
-			rte_strerror(rte_errno));
-		return -1;
-	}
-	flags = fcntl(nlsk_fd, F_GETFL);
-	ret = fcntl(nlsk_fd, F_SETFL, flags | O_NONBLOCK);
-	if (ret != 0) {
-		DRV_LOG(ERR, "Failed to make Netlink event socket non-blocking: %s",
-			strerror(errno));
-		rte_errno = errno;
-		goto error;
-	}
-	rte_intr_type_set(sh->intr_handle_nl, RTE_INTR_HANDLE_EXT);
-	rte_intr_fd_set(sh->intr_handle_nl, nlsk_fd);
-	if (rte_intr_callback_register(sh->intr_handle_nl,
-				       mlx5_dev_interrupt_handler_nl,
-				       sh) != 0) {
-		DRV_LOG(ERR, "Failed to register Netlink events interrupt");
-		rte_intr_fd_set(sh->intr_handle_nl, -1);
-		goto error;
-	}
-	return 0;
-error:
-	close(nlsk_fd);
-	return -1;
-}
-
 /**
  * Install shared asynchronous device events handler.
  * This function is implemented to support event sharing
@@ -2539,76 +2505,47 @@
 void
 mlx5_os_dev_shared_handler_install(struct mlx5_dev_ctx_shared *sh)
 {
-	int ret;
-	int flags;
 	struct ibv_context *ctx = sh->cdev->ctx;
+	int nlsk_fd;
 
-	sh->intr_handle = rte_intr_instance_alloc(RTE_INTR_INSTANCE_F_SHARED);
-	if (sh->intr_handle == NULL) {
-		DRV_LOG(ERR, "Fail to allocate intr_handle");
-		rte_errno = ENOMEM;
+	sh->intr_handle = mlx5_os_interrupt_handler_create
+		(RTE_INTR_INSTANCE_F_SHARED, true,
+		 ctx->async_fd, mlx5_dev_interrupt_handler, sh);
+	if (!sh->intr_handle) {
+		DRV_LOG(ERR, "Failed to allocate intr_handle.");
 		return;
 	}
-	rte_intr_fd_set(sh->intr_handle, -1);
-
-	flags = fcntl(ctx->async_fd, F_GETFL);
-	ret = fcntl(ctx->async_fd, F_SETFL, flags | O_NONBLOCK);
-	if (ret) {
-		DRV_LOG(INFO, "failed to change file descriptor async event"
-			" queue");
-	} else {
-		rte_intr_fd_set(sh->intr_handle, ctx->async_fd);
-		rte_intr_type_set(sh->intr_handle, RTE_INTR_HANDLE_EXT);
-		if (rte_intr_callback_register(sh->intr_handle,
-					mlx5_dev_interrupt_handler, sh)) {
-			DRV_LOG(INFO, "Fail to install the shared interrupt.");
-			rte_intr_fd_set(sh->intr_handle, -1);
-		}
+	nlsk_fd = mlx5_nl_init(NETLINK_ROUTE, RTMGRP_LINK);
+	if (nlsk_fd < 0) {
+		DRV_LOG(ERR, "Failed to create a socket for Netlink events: %s",
+			rte_strerror(rte_errno));
+		return;
 	}
-	sh->intr_handle_nl = rte_intr_instance_alloc
-						(RTE_INTR_INSTANCE_F_SHARED);
+	sh->intr_handle_nl = mlx5_os_interrupt_handler_create
+		(RTE_INTR_INSTANCE_F_SHARED, true,
+		 nlsk_fd, mlx5_dev_interrupt_handler_nl, sh);
 	if (sh->intr_handle_nl == NULL) {
 		DRV_LOG(ERR, "Fail to allocate intr_handle");
-		rte_errno = ENOMEM;
 		return;
 	}
-	rte_intr_fd_set(sh->intr_handle_nl, -1);
-	if (mlx5_os_dev_shared_handler_install_lsc(sh) < 0) {
-		DRV_LOG(INFO, "Fail to install the shared Netlink event handler.");
-		rte_intr_fd_set(sh->intr_handle_nl, -1);
-	}
 	if (sh->cdev->config.devx) {
 #ifdef HAVE_IBV_DEVX_ASYNC
-		sh->intr_handle_devx =
-			rte_intr_instance_alloc(RTE_INTR_INSTANCE_F_SHARED);
-		if (!sh->intr_handle_devx) {
-			DRV_LOG(ERR, "Fail to allocate intr_handle");
-			rte_errno = ENOMEM;
-			return;
-		}
-		rte_intr_fd_set(sh->intr_handle_devx, -1);
+		struct mlx5dv_devx_cmd_comp *devx_comp;
+
 		sh->devx_comp = (void *)mlx5_glue->devx_create_cmd_comp(ctx);
-		struct mlx5dv_devx_cmd_comp *devx_comp = sh->devx_comp;
+		devx_comp = sh->devx_comp;
 		if (!devx_comp) {
 			DRV_LOG(INFO, "failed to allocate devx_comp.");
 			return;
 		}
-		flags = fcntl(devx_comp->fd, F_GETFL);
-		ret = fcntl(devx_comp->fd, F_SETFL, flags | O_NONBLOCK);
-		if (ret) {
-			DRV_LOG(INFO, "failed to change file descriptor"
-				" devx comp");
+		sh->intr_handle_devx = mlx5_os_interrupt_handler_create
+			(RTE_INTR_INSTANCE_F_SHARED, true,
+			 devx_comp->fd,
+			 mlx5_dev_interrupt_handler_devx, sh);
+		if (!sh->intr_handle_devx) {
+			DRV_LOG(ERR, "Failed to allocate intr_handle.");
 			return;
 		}
-		rte_intr_fd_set(sh->intr_handle_devx, devx_comp->fd);
-		rte_intr_type_set(sh->intr_handle_devx,
-					 RTE_INTR_HANDLE_EXT);
-		if (rte_intr_callback_register(sh->intr_handle_devx,
-					mlx5_dev_interrupt_handler_devx, sh)) {
-			DRV_LOG(INFO, "Fail to install the devx shared"
-				" interrupt.");
-			rte_intr_fd_set(sh->intr_handle_devx, -1);
-		}
 #endif /* HAVE_IBV_DEVX_ASYNC */
 	}
 }
@@ -2624,24 +2561,13 @@
 void
 mlx5_os_dev_shared_handler_uninstall(struct mlx5_dev_ctx_shared *sh)
 {
-	int nlsk_fd;
-
-	if (rte_intr_fd_get(sh->intr_handle) >= 0)
-		mlx5_intr_callback_unregister(sh->intr_handle,
-					      mlx5_dev_interrupt_handler, sh);
-	rte_intr_instance_free(sh->intr_handle);
-	nlsk_fd = rte_intr_fd_get(sh->intr_handle_nl);
-	if (nlsk_fd >= 0) {
-		mlx5_intr_callback_unregister
-			(sh->intr_handle_nl, mlx5_dev_interrupt_handler_nl, sh);
-		close(nlsk_fd);
-	}
-	rte_intr_instance_free(sh->intr_handle_nl);
+	mlx5_os_interrupt_handler_destroy(sh->intr_handle,
+					  mlx5_dev_interrupt_handler, sh);
+	mlx5_os_interrupt_handler_destroy(sh->intr_handle_nl,
+					  mlx5_dev_interrupt_handler_nl, sh);
 #ifdef HAVE_IBV_DEVX_ASYNC
-	if (rte_intr_fd_get(sh->intr_handle_devx) >= 0)
-		rte_intr_callback_unregister(sh->intr_handle_devx,
-				  mlx5_dev_interrupt_handler_devx, sh);
-	rte_intr_instance_free(sh->intr_handle_devx);
+	mlx5_os_interrupt_handler_destroy(sh->intr_handle_devx,
+					  mlx5_dev_interrupt_handler_devx, sh);
 	if (sh->devx_comp)
 		mlx5_glue->devx_destroy_cmd_comp(sh->devx_comp);
 #endif
diff --git a/drivers/net/mlx5/linux/mlx5_socket.c b/drivers/net/mlx5/linux/mlx5_socket.c
index 4882e5f..0e01aff 100644
--- a/drivers/net/mlx5/linux/mlx5_socket.c
+++ b/drivers/net/mlx5/linux/mlx5_socket.c
@@ -134,51 +134,6 @@
 }
 
 /**
- * Install interrupt handler.
- *
- * @param dev
- *   Pointer to Ethernet device.
- * @return
- *   0 on success, a negative errno value otherwise.
- */
-static int
-mlx5_pmd_interrupt_handler_install(void)
-{
-	MLX5_ASSERT(server_socket != -1);
-
-	server_intr_handle =
-		rte_intr_instance_alloc(RTE_INTR_INSTANCE_F_PRIVATE);
-	if (server_intr_handle == NULL) {
-		DRV_LOG(ERR, "Fail to allocate intr_handle");
-		return -ENOMEM;
-	}
-	if (rte_intr_fd_set(server_intr_handle, server_socket))
-		return -rte_errno;
-
-	if (rte_intr_type_set(server_intr_handle, RTE_INTR_HANDLE_EXT))
-		return -rte_errno;
-
-	return rte_intr_callback_register(server_intr_handle,
-					  mlx5_pmd_socket_handle, NULL);
-}
-
-/**
- * Uninstall interrupt handler.
- */
-static void
-mlx5_pmd_interrupt_handler_uninstall(void)
-{
-	if (server_socket != -1) {
-		mlx5_intr_callback_unregister(server_intr_handle,
-					      mlx5_pmd_socket_handle,
-					      NULL);
-	}
-	rte_intr_fd_set(server_intr_handle, 0);
-	rte_intr_type_set(server_intr_handle, RTE_INTR_HANDLE_UNKNOWN);
-	rte_intr_instance_free(server_intr_handle);
-}
-
-/**
  * Initialise the socket to communicate with external tools.
  *
  * @return
@@ -224,7 +179,10 @@
 			strerror(errno));
 		goto remove;
 	}
-	if (mlx5_pmd_interrupt_handler_install()) {
+	server_intr_handle = mlx5_os_interrupt_handler_create
+		(RTE_INTR_INSTANCE_F_PRIVATE, false,
+		 server_socket, mlx5_pmd_socket_handle, NULL);
+	if (server_intr_handle == NULL) {
 		DRV_LOG(WARNING, "cannot register interrupt handler for mlx5 socket: %s",
 			strerror(errno));
 		goto remove;
@@ -248,7 +206,8 @@
 {
 	if (server_socket == -1)
 		return;
-	mlx5_pmd_interrupt_handler_uninstall();
+	mlx5_os_interrupt_handler_destroy(server_intr_handle,
+					  mlx5_pmd_socket_handle, NULL);
 	claim_zero(close(server_socket));
 	server_socket = -1;
 	MKSTR(path, MLX5_SOCKET_PATH, getpid());
diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index 305edff..7ebb2cc 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -1682,8 +1682,6 @@ int mlx5_sysfs_switch_info(unsigned int ifindex,
 			   struct mlx5_switch_info *info);
 void mlx5_translate_port_name(const char *port_name_in,
 			      struct mlx5_switch_info *port_info_out);
-void mlx5_intr_callback_unregister(const struct rte_intr_handle *handle,
-				   rte_intr_callback_fn cb_fn, void *cb_arg);
 int mlx5_sysfs_bond_info(unsigned int pf_ifindex, unsigned int *ifindex,
 			 char *ifname);
 int mlx5_get_module_info(struct rte_eth_dev *dev,
diff --git a/drivers/net/mlx5/mlx5_txpp.c b/drivers/net/mlx5/mlx5_txpp.c
index fe74317..f853a67 100644
--- a/drivers/net/mlx5/mlx5_txpp.c
+++ b/drivers/net/mlx5/mlx5_txpp.c
@@ -741,11 +741,8 @@
 static void
 mlx5_txpp_stop_service(struct mlx5_dev_ctx_shared *sh)
 {
-	if (!rte_intr_fd_get(sh->txpp.intr_handle))
-		return;
-	mlx5_intr_callback_unregister(sh->txpp.intr_handle,
-				      mlx5_txpp_interrupt_handler, sh);
-	rte_intr_instance_free(sh->txpp.intr_handle);
+	mlx5_os_interrupt_handler_destroy(sh->txpp.intr_handle,
+					  mlx5_txpp_interrupt_handler, sh);
 }
 
 /* Attach interrupt handler and fires first request to Rearm Queue. */
@@ -769,23 +766,12 @@
 		rte_errno = errno;
 		return -rte_errno;
 	}
-	sh->txpp.intr_handle =
-			rte_intr_instance_alloc(RTE_INTR_INSTANCE_F_SHARED);
-	if (sh->txpp.intr_handle == NULL) {
-		DRV_LOG(ERR, "Fail to allocate intr_handle");
-		return -ENOMEM;
-	}
 	fd = mlx5_os_get_devx_channel_fd(sh->txpp.echan);
-	if (rte_intr_fd_set(sh->txpp.intr_handle, fd))
-		return -rte_errno;
-
-	if (rte_intr_type_set(sh->txpp.intr_handle, RTE_INTR_HANDLE_EXT))
-		return -rte_errno;
-
-	if (rte_intr_callback_register(sh->txpp.intr_handle,
-				       mlx5_txpp_interrupt_handler, sh)) {
-		rte_intr_fd_set(sh->txpp.intr_handle, 0);
-		DRV_LOG(ERR, "Failed to register CQE interrupt %d.", rte_errno);
+	sh->txpp.intr_handle = mlx5_os_interrupt_handler_create
+		(RTE_INTR_INSTANCE_F_SHARED, false,
+		 fd, mlx5_txpp_interrupt_handler, sh);
+	if (!sh->txpp.intr_handle) {
+		DRV_LOG(ERR, "Fail to allocate intr_handle");
 		return -rte_errno;
 	}
 	/* Subscribe CQ event to the event channel controlled by the driver. */
diff --git a/drivers/net/mlx5/windows/mlx5_ethdev_os.c b/drivers/net/mlx5/windows/mlx5_ethdev_os.c
index f975265..88d8213 100644
--- a/drivers/net/mlx5/windows/mlx5_ethdev_os.c
+++ b/drivers/net/mlx5/windows/mlx5_ethdev_os.c
@@ -140,28 +140,6 @@
 	return 0;
 }
 
-/*
- * Unregister callback handler safely. The handler may be active
- * while we are trying to unregister it, in this case code -EAGAIN
- * is returned by rte_intr_callback_unregister(). This routine checks
- * the return code and tries to unregister handler again.
- *
- * @param handle
- *   interrupt handle
- * @param cb_fn
- *   pointer to callback routine
- * @cb_arg
- *   opaque callback parameter
- */
-void
-mlx5_intr_callback_unregister(const struct rte_intr_handle *handle,
-			      rte_intr_callback_fn cb_fn, void *cb_arg)
-{
-	RTE_SET_USED(handle);
-	RTE_SET_USED(cb_fn);
-	RTE_SET_USED(cb_arg);
-}
-
 /**
  * DPDK callback to get flow control status.
  *
diff --git a/drivers/vdpa/mlx5/mlx5_vdpa_virtq.c b/drivers/vdpa/mlx5/mlx5_vdpa_virtq.c
index e025be4..fd447cc 100644
--- a/drivers/vdpa/mlx5/mlx5_vdpa_virtq.c
+++ b/drivers/vdpa/mlx5/mlx5_vdpa_virtq.c
@@ -93,22 +93,10 @@
 static int
 mlx5_vdpa_virtq_unset(struct mlx5_vdpa_virtq *virtq)
 {
-	int ret = -EAGAIN;
-
-	if (rte_intr_fd_get(virtq->intr_handle) >= 0) {
-		while (ret == -EAGAIN) {
-			ret = rte_intr_callback_unregister(virtq->intr_handle,
-					mlx5_vdpa_virtq_kick_handler, virtq);
-			if (ret == -EAGAIN) {
-				DRV_LOG(DEBUG, "Try again to unregister fd %d of virtq %hu interrupt",
-					rte_intr_fd_get(virtq->intr_handle),
-					virtq->index);
-				usleep(MLX5_VDPA_INTR_RETRIES_USEC);
-			}
-		}
-		rte_intr_fd_set(virtq->intr_handle, -1);
-	}
-	rte_intr_instance_free(virtq->intr_handle);
+	int ret;
+
+	mlx5_os_interrupt_handler_destroy(virtq->intr_handle,
+					  mlx5_vdpa_virtq_kick_handler, virtq);
 	if (virtq->virtq) {
 		ret = mlx5_vdpa_virtq_stop(virtq->priv, virtq->index);
 		if (ret)
@@ -365,35 +353,13 @@
 	virtq->priv = priv;
 	rte_write32(virtq->index, priv->virtq_db_addr);
 	/* Setup doorbell mapping. */
-	virtq->intr_handle =
-		rte_intr_instance_alloc(RTE_INTR_INSTANCE_F_SHARED);
+	virtq->intr_handle = mlx5_os_interrupt_handler_create(
+				  RTE_INTR_INSTANCE_F_SHARED, false,
+				  vq.kickfd, mlx5_vdpa_virtq_kick_handler, virtq);
 	if (virtq->intr_handle == NULL) {
 		DRV_LOG(ERR, "Fail to allocate intr_handle");
 		goto error;
 	}
-
-	if (rte_intr_fd_set(virtq->intr_handle, vq.kickfd))
-		goto error;
-
-	if (rte_intr_fd_get(virtq->intr_handle) == -1) {
-		DRV_LOG(WARNING, "Virtq %d kickfd is invalid.", index);
-	} else {
-		if (rte_intr_type_set(virtq->intr_handle, RTE_INTR_HANDLE_EXT))
-			goto error;
-
-		if (rte_intr_callback_register(virtq->intr_handle,
-					       mlx5_vdpa_virtq_kick_handler,
-					       virtq)) {
-			rte_intr_fd_set(virtq->intr_handle, -1);
-			DRV_LOG(ERR, "Failed to register virtq %d interrupt.",
-				index);
-			goto error;
-		} else {
-			DRV_LOG(DEBUG, "Register fd %d interrupt for virtq %d.",
-				rte_intr_fd_get(virtq->intr_handle),
-				index);
-		}
-	}
 	/* Subscribe virtq error event. */
 	virtq->version++;
 	cookie = ((uint64_t)virtq->version << 32) + index;
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 131+ messages in thread

* [PATCH v5 3/7] ethdev: introduce Rx queue based available descriptor threshold
  2022-06-07 12:59           ` [PATCH v5 0/7] introduce per-queue available descriptor threshold and host shaper Spike Du
  2022-06-07 12:59             ` [PATCH v5 1/7] net/mlx5: add LWM support for Rxq Spike Du
  2022-06-07 12:59             ` [PATCH v5 2/7] common/mlx5: share interrupt management Spike Du
@ 2022-06-07 12:59             ` Spike Du
  2022-06-07 12:59             ` [PATCH v5 4/7] net/mlx5: add LWM event handling support Spike Du
                               ` (5 subsequent siblings)
  8 siblings, 0 replies; 131+ messages in thread
From: Spike Du @ 2022-06-07 12:59 UTC (permalink / raw)
  To: matan, viacheslavo, orika, thomas, Xiaoyun Li, Aman Singh,
	Yuying Zhang, Ferruh Yigit, Andrew Rybchenko, Ray Kinsella
  Cc: stephen, mb, dev, rasland

available descriptor threshold describes the availability of a Rx queue
for hardware.
If the availability is below the threshold, the device will trigger the
event RTE_ETH_EVENT_RX_AVAIL_THRESH.
available descriptor threshold is defined as a percentage of Rx queue
size with valid value of [0,99].
Setting available descriptor threshold to 0 means disable it, which is
the default.
Add available descriptor threshold configuration and query driver
callbacks in eth_dev_ops.
Add command line options to support avail_thresh per-rxq configure.
- Command syntax:
  set port <port_id> rxq <rxq_id> avail_thresh <avail_thresh_num>

- Example commands:
To configure avail_thresh as 30% of rxq size on port 1 rxq 0:
testpmd> set port 1 rxq 0 avail_thresh 30

To disable avail_thresh on port 1 rxq 0:
testpmd> set port 1 rxq 0 avail_thresh 0

Signed-off-by: Spike Du <spiked@nvidia.com>
---
 app/test-pmd/cmdline.c     | 68 ++++++++++++++++++++++++++++++++++++++++++
 app/test-pmd/config.c      | 20 +++++++++++++
 app/test-pmd/testpmd.c     | 14 +++++++++
 app/test-pmd/testpmd.h     |  2 ++
 lib/ethdev/ethdev_driver.h | 22 ++++++++++++++
 lib/ethdev/rte_ethdev.c    | 44 ++++++++++++++++++++++++++++
 lib/ethdev/rte_ethdev.h    | 73 ++++++++++++++++++++++++++++++++++++++++++++++
 lib/ethdev/version.map     |  2 ++
 8 files changed, 245 insertions(+)

diff --git a/app/test-pmd/cmdline.c b/app/test-pmd/cmdline.c
index 0410bad..bbf5835 100644
--- a/app/test-pmd/cmdline.c
+++ b/app/test-pmd/cmdline.c
@@ -17823,6 +17823,73 @@ struct cmd_show_port_flow_transfer_proxy_result {
 	}
 };
 
+/* *** SET AVAIL THRESHOLD FOR A RXQ OF A PORT *** */
+struct cmd_rxq_avail_thresh_result {
+	cmdline_fixed_string_t set;
+	cmdline_fixed_string_t port;
+	uint16_t port_num;
+	cmdline_fixed_string_t rxq;
+	uint16_t rxq_num;
+	cmdline_fixed_string_t avail_thresh;
+	uint8_t avail_thresh_num;
+};
+
+static void cmd_rxq_avail_thresh_parsed(void *parsed_result,
+		__rte_unused struct cmdline *cl,
+		__rte_unused void *data)
+{
+	struct cmd_rxq_avail_thresh_result *res = parsed_result;
+	int ret = 0;
+
+	if ((strcmp(res->set, "set") == 0) && (strcmp(res->port, "port") == 0)
+	    && (strcmp(res->rxq, "rxq") == 0)
+	    && (strcmp(res->avail_thresh, "avail_thresh") == 0))
+		ret = set_rxq_avail_thresh(res->port_num, res->rxq_num,
+				  res->avail_thresh_num);
+	if (ret < 0)
+		printf("rxq_avail_thresh_cmd error: (%s)\n", strerror(-ret));
+
+}
+
+static cmdline_parse_token_string_t cmd_rxq_avail_thresh_set =
+	TOKEN_STRING_INITIALIZER(struct cmd_rxq_avail_thresh_result,
+				set, "set");
+static cmdline_parse_token_string_t cmd_rxq_avail_thresh_port =
+	TOKEN_STRING_INITIALIZER(struct cmd_rxq_avail_thresh_result,
+				port, "port");
+static cmdline_parse_token_num_t cmd_rxq_avail_thresh_portnum =
+	TOKEN_NUM_INITIALIZER(struct cmd_rxq_avail_thresh_result,
+				port_num, RTE_UINT16);
+static cmdline_parse_token_string_t cmd_rxq_avail_thresh_rxq =
+	TOKEN_STRING_INITIALIZER(struct cmd_rxq_avail_thresh_result,
+				rxq, "rxq");
+static cmdline_parse_token_num_t cmd_rxq_avail_thresh_rxqnum =
+	TOKEN_NUM_INITIALIZER(struct cmd_rxq_avail_thresh_result,
+				rxq_num, RTE_UINT16);
+static cmdline_parse_token_string_t cmd_rxq_avail_thresh_avail_thresh =
+	TOKEN_STRING_INITIALIZER(struct cmd_rxq_avail_thresh_result,
+				avail_thresh, "avail_thresh");
+static cmdline_parse_token_num_t cmd_rxq_avail_thresh_avail_threshnum =
+	TOKEN_NUM_INITIALIZER(struct cmd_rxq_avail_thresh_result,
+				avail_thresh_num, RTE_UINT8);
+
+static cmdline_parse_inst_t cmd_rxq_avail_thresh = {
+	.f = cmd_rxq_avail_thresh_parsed,
+	.data = (void *)0,
+	.help_str = "set port <port_id> rxq <rxq_id> avail_thresh <avail_thresh_num>"
+		"Set avail_thresh for rxq on port_id",
+	.tokens = {
+		(void *)&cmd_rxq_avail_thresh_set,
+		(void *)&cmd_rxq_avail_thresh_port,
+		(void *)&cmd_rxq_avail_thresh_portnum,
+		(void *)&cmd_rxq_avail_thresh_rxq,
+		(void *)&cmd_rxq_avail_thresh_rxqnum,
+		(void *)&cmd_rxq_avail_thresh_avail_thresh,
+		(void *)&cmd_rxq_avail_thresh_avail_threshnum,
+		NULL,
+	},
+};
+
 /* ******************************************************************************** */
 
 /* list of instructions */
@@ -18110,6 +18177,7 @@ struct cmd_show_port_flow_transfer_proxy_result {
 	(cmdline_parse_inst_t *)&cmd_show_capability,
 	(cmdline_parse_inst_t *)&cmd_set_flex_is_pattern,
 	(cmdline_parse_inst_t *)&cmd_set_flex_spec_pattern,
+	(cmdline_parse_inst_t *)&cmd_rxq_avail_thresh,
 	NULL,
 };
 
diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
index 1b1e738..b754091 100644
--- a/app/test-pmd/config.c
+++ b/app/test-pmd/config.c
@@ -6342,3 +6342,23 @@ struct igb_ring_desc_16_bytes {
 		printf("  %s\n", buf);
 	}
 }
+
+int
+set_rxq_avail_thresh(portid_t port_id, uint16_t queue_id, uint8_t avail_thresh)
+{
+	struct rte_eth_link link;
+	int ret;
+
+	if (port_id_is_invalid(port_id, ENABLED_WARN))
+		return -EINVAL;
+	ret = eth_link_get_nowait_print_err(port_id, &link);
+	if (ret < 0)
+		return -EINVAL;
+	if (avail_thresh > 99)
+		return -EINVAL;
+	ret = rte_eth_rx_avail_thresh_set(port_id, queue_id, avail_thresh);
+	if (ret != 0)
+		return ret;
+	return 0;
+}
+
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index 767765d..33d9b85 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -420,6 +420,7 @@ struct fwd_engine * fwd_engines[] = {
 	[RTE_ETH_EVENT_NEW] = "device probed",
 	[RTE_ETH_EVENT_DESTROY] = "device released",
 	[RTE_ETH_EVENT_FLOW_AGED] = "flow aged",
+	[RTE_ETH_EVENT_RX_AVAIL_THRESH] = "rxq available threshold reached",
 	[RTE_ETH_EVENT_MAX] = NULL,
 };
 
@@ -3616,6 +3617,9 @@ struct pmd_test_command {
 eth_event_callback(portid_t port_id, enum rte_eth_event_type type, void *param,
 		  void *ret_param)
 {
+	uint16_t rxq_id;
+	uint8_t avail_thresh;
+	int ret;
 	RTE_SET_USED(param);
 	RTE_SET_USED(ret_param);
 
@@ -3647,6 +3651,16 @@ struct pmd_test_command {
 		ports[port_id].port_status = RTE_PORT_CLOSED;
 		printf("Port %u is closed\n", port_id);
 		break;
+	case RTE_ETH_EVENT_RX_AVAIL_THRESH:
+		/* avail_thresh query API rewinds rxq_id, no need to check max rxq num. */
+		for (rxq_id = 0; ; rxq_id++) {
+			ret = rte_eth_rx_avail_thresh_query(port_id, &rxq_id, &avail_thresh);
+			if (ret <= 0)
+				break;
+			printf("Received avail_thresh event, port:%d rxq_id:%d\n",
+			       port_id, rxq_id);
+		}
+		break;
 	default:
 		break;
 	}
diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index 78a5f4e..5b268f4 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -1173,6 +1173,8 @@ uint16_t tx_pkt_set_dynf(uint16_t port_id, __rte_unused uint16_t queue,
 void flex_item_create(portid_t port_id, uint16_t flex_id, const char *filename);
 void flex_item_destroy(portid_t port_id, uint16_t flex_id);
 void port_flex_item_flush(portid_t port_id);
+int set_rxq_avail_thresh(portid_t port_id, uint16_t queue_id,
+			 uint8_t avail_thresh);
 
 extern int flow_parse(const char *src, void *result, unsigned int size,
 		      struct rte_flow_attr **attr,
diff --git a/lib/ethdev/ethdev_driver.h b/lib/ethdev/ethdev_driver.h
index 69d9dc2..847f86a 100644
--- a/lib/ethdev/ethdev_driver.h
+++ b/lib/ethdev/ethdev_driver.h
@@ -1074,6 +1074,23 @@ typedef int (*eth_ip_reassembly_conf_set_t)(struct rte_eth_dev *dev,
 typedef int (*eth_dev_priv_dump_t)(struct rte_eth_dev *dev, FILE *file);
 
 /**
+ * @internal Set Rx queue available descriptor threshold.
+ * @see rte_eth_rx_avail_thresh_set()
+ */
+typedef int (*eth_rx_queue_avail_thresh_set_t)(struct rte_eth_dev *dev,
+				      uint16_t rx_queue_id,
+				      uint8_t avail_thresh);
+
+/**
+ * @internal Query queue available descriptor threshold event.
+ * @see rte_eth_rx_avail_thresh_query()
+ */
+
+typedef int (*eth_rx_queue_avail_thresh_query_t)(struct rte_eth_dev *dev,
+					uint16_t *rx_queue_id,
+					uint8_t *avail_thresh);
+
+/**
  * @internal A structure containing the functions exported by an Ethernet driver.
  */
 struct eth_dev_ops {
@@ -1283,6 +1300,11 @@ struct eth_dev_ops {
 
 	/** Dump private info from device */
 	eth_dev_priv_dump_t eth_dev_priv_dump;
+
+	/** Set Rx queue available descriptor threshold. */
+	eth_rx_queue_avail_thresh_set_t rx_queue_avail_thresh_set;
+	/** Query Rx queue available descriptor threshold event. */
+	eth_rx_queue_avail_thresh_query_t rx_queue_avail_thresh_query;
 };
 
 /**
diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
index a175867..92fb282 100644
--- a/lib/ethdev/rte_ethdev.c
+++ b/lib/ethdev/rte_ethdev.c
@@ -4424,6 +4424,50 @@ int rte_eth_set_queue_rate_limit(uint16_t port_id, uint16_t queue_idx,
 							queue_idx, tx_rate));
 }
 
+int rte_eth_rx_avail_thresh_set(uint16_t port_id, uint16_t queue_id,
+			       uint8_t avail_thresh)
+{
+	struct rte_eth_dev *dev;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
+	dev = &rte_eth_devices[port_id];
+
+	if (queue_id > dev->data->nb_rx_queues) {
+		RTE_ETHDEV_LOG(ERR,
+			"Set queue avail thresh: port %u: invalid queue ID=%u.\n",
+			port_id, queue_id);
+		return -EINVAL;
+	}
+
+	if (avail_thresh > 99) {
+		RTE_ETHDEV_LOG(ERR,
+			"Set queue avail thresh: port %u: threshold should be <= 99.\n",
+			port_id);
+		return -EINVAL;
+	}
+	RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->rx_queue_avail_thresh_set, -ENOTSUP);
+	return eth_err(port_id, (*dev->dev_ops->rx_queue_avail_thresh_set)(dev,
+							     queue_id, avail_thresh));
+}
+
+int rte_eth_rx_avail_thresh_query(uint16_t port_id, uint16_t *queue_id,
+				 uint8_t *avail_thresh)
+{
+	struct rte_eth_dev *dev;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
+	dev = &rte_eth_devices[port_id];
+
+	if (queue_id == NULL)
+		return -EINVAL;
+	if (*queue_id >= dev->data->nb_rx_queues)
+		*queue_id = 0;
+
+	RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->rx_queue_avail_thresh_query, -ENOTSUP);
+	return eth_err(port_id, (*dev->dev_ops->rx_queue_avail_thresh_query)(dev,
+							     queue_id, avail_thresh));
+}
+
 RTE_INIT(eth_dev_init_fp_ops)
 {
 	uint32_t i;
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index 04225bb..d01bfe4 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -1931,6 +1931,14 @@ struct rte_eth_rxq_info {
 	uint8_t queue_state;        /**< one of RTE_ETH_QUEUE_STATE_*. */
 	uint16_t nb_desc;           /**< configured number of RXDs. */
 	uint16_t rx_buf_size;       /**< hardware receive buffer size. */
+	/**
+	 * Per-queue Rx available descriptor threshold defined as percentage
+	 * of Rx queue size. If Rx queue available descriptor is lower than
+	 * this percentage, the event RTE_ETH_EVENT_RX_AVAIL_THESH is triggered.
+	 * Value 0 means threshold monitoring is disabled, no event is
+	 * triggered.
+	 */
+	uint8_t avail_thresh;
 } __rte_cache_min_aligned;
 
 /**
@@ -3672,6 +3680,66 @@ int rte_eth_dev_set_vlan_ether_type(uint16_t port_id,
  */
 int rte_eth_dev_set_vlan_pvid(uint16_t port_id, uint16_t pvid, int on);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Set Rx queue based available descriptor threshold.
+ *
+ * @param port_id
+ *  The port identifier of the Ethernet device.
+ * @param queue_id
+ *  The index of the receive queue.
+ * @param avail_thresh
+ *  The available descriptor threshold is percentage of Rx queue size which
+ *  describes the availability of Rx queue for hardware. If the Rx queue
+ *  availability is below it, the device will trigger the event
+ *  RTE_ETH_EVENT_RX_AVAIL_THRESH.
+ *  [1-99] to set a new available descriptor threshold.
+ *  0 to disable thresold monitoring.
+ *
+ * @return
+ *   - 0 if successful.
+ *   - negative if failed.
+ */
+__rte_experimental
+int rte_eth_rx_avail_thresh_set(uint16_t port_id, uint16_t queue_id,
+			       uint8_t avail_thresh);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Query Rx queue based available descriptor threshold event.
+ * The function queries all queues in the port circularly until one
+ * pending avail_thresh event is found or no pending avail_thresh event is found.
+ *
+ * @param port_id
+ *  The port identifier of the Ethernet device.
+ * @param queue_id
+ *  The API caller sets the starting Rx queue id in the pointer.
+ *  If the queue_id is bigger than maximum queue id of the port,
+ *  it's rewinded to 0 so that application can keep calling
+ *  this function to handle all pending avail_thresh events in the queues
+ *  with a simple increment between calls.
+ *  If a Rx queue has pending avail_thresh event, the pointer is updated
+ *  with this Rx queue id; otherwise this pointer's content is
+ *  unchanged.
+ * @param avail_thresh
+ *  The pointer to the available descriptor threshold percentage of Rx queue.
+ *  If Rx queue with pending avail_thresh event is found, the queue's avail_thresh
+ *  percentage is stored in this pointer, otherwise the pointer's
+ *  content is unchanged.
+ *
+ * @return
+ *   - 1 if a Rx queue with pending avail_thresh event is found.
+ *   - 0 if no Rx queue with pending avail_thresh event is found.
+ *   - -EINVAL if queue_id is NULL.
+ */
+__rte_experimental
+int rte_eth_rx_avail_thresh_query(uint16_t port_id, uint16_t *queue_id,
+				 uint8_t *avail_thresh);
+
 typedef void (*buffer_tx_error_fn)(struct rte_mbuf **unsent, uint16_t count,
 		void *userdata);
 
@@ -3877,6 +3945,11 @@ enum rte_eth_event_type {
 	RTE_ETH_EVENT_DESTROY,  /**< port is released */
 	RTE_ETH_EVENT_IPSEC,    /**< IPsec offload related event */
 	RTE_ETH_EVENT_FLOW_AGED,/**< New aged-out flows is detected */
+	/**
+	 *  Available threshold value is exceeded in a queue.
+	 *  @see rte_eth_rx_avail_thresh_set()
+	 */
+	RTE_ETH_EVENT_RX_AVAIL_THRESH,
 	RTE_ETH_EVENT_MAX       /**< max value of this enum */
 };
 
diff --git a/lib/ethdev/version.map b/lib/ethdev/version.map
index daca785..2fd928f 100644
--- a/lib/ethdev/version.map
+++ b/lib/ethdev/version.map
@@ -285,6 +285,8 @@ EXPERIMENTAL {
 	rte_mtr_color_in_protocol_priority_get;
 	rte_mtr_color_in_protocol_set;
 	rte_mtr_meter_vlan_table_update;
+	rte_eth_rx_avail_thresh_set;
+	rte_eth_rx_avail_thresh_query;
 };
 
 INTERNAL {
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 131+ messages in thread

* [PATCH v5 4/7] net/mlx5: add LWM event handling support
  2022-06-07 12:59           ` [PATCH v5 0/7] introduce per-queue available descriptor threshold and host shaper Spike Du
                               ` (2 preceding siblings ...)
  2022-06-07 12:59             ` [PATCH v5 3/7] ethdev: introduce Rx queue based available descriptor threshold Spike Du
@ 2022-06-07 12:59             ` Spike Du
  2022-06-07 12:59             ` [PATCH v5 5/7] net/mlx5: support Rx queue based available descriptor threshold Spike Du
                               ` (4 subsequent siblings)
  8 siblings, 0 replies; 131+ messages in thread
From: Spike Du @ 2022-06-07 12:59 UTC (permalink / raw)
  To: matan, viacheslavo, orika, thomas
  Cc: andrew.rybchenko, stephen, mb, dev, rasland

When LWM meets RQ WQE, the kernel driver raises an event to SW.
Use devx event_channel to catch this and to notify the user.
Allocate this channel per shared device.
The channel has a cookie that informs the specific event port and queue.

Signed-off-by: Spike Du <spiked@nvidia.com>
---
 drivers/net/mlx5/mlx5.c      | 66 ++++++++++++++++++++++++++++++++++++++++++++
 drivers/net/mlx5/mlx5.h      |  7 +++++
 drivers/net/mlx5/mlx5_devx.c | 47 +++++++++++++++++++++++++++++++
 drivers/net/mlx5/mlx5_rx.c   | 33 ++++++++++++++++++++++
 drivers/net/mlx5/mlx5_rx.h   |  7 +++++
 5 files changed, 160 insertions(+)

diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
index f098871..e04a666 100644
--- a/drivers/net/mlx5/mlx5.c
+++ b/drivers/net/mlx5/mlx5.c
@@ -9,6 +9,7 @@
 #include <stdint.h>
 #include <stdlib.h>
 #include <errno.h>
+#include <fcntl.h>
 
 #include <rte_malloc.h>
 #include <ethdev_driver.h>
@@ -22,6 +23,7 @@
 #include <rte_eal_paging.h>
 #include <rte_alarm.h>
 #include <rte_cycles.h>
+#include <rte_interrupts.h>
 
 #include <mlx5_glue.h>
 #include <mlx5_devx_cmds.h>
@@ -1525,6 +1527,69 @@ struct mlx5_dev_ctx_shared *
 }
 
 /**
+ * Create LWM event_channel and interrupt handle for shared device
+ * context. All rxqs sharing the device context share the event_channel.
+ * A callback is registered in interrupt thread to receive the LWM event.
+ *
+ * @param[in] priv
+ *   Pointer to mlx5_priv instance.
+ *
+ * @return
+ *   0 on success, negative with rte_errno set.
+ */
+int
+mlx5_lwm_setup(struct mlx5_priv *priv)
+{
+	int fd_lwm;
+
+	pthread_mutex_init(&priv->sh->lwm_config_lock, NULL);
+	priv->sh->devx_channel_lwm = mlx5_os_devx_create_event_channel
+			(priv->sh->cdev->ctx,
+			 MLX5DV_DEVX_CREATE_EVENT_CHANNEL_FLAGS_OMIT_EV_DATA);
+	if (!priv->sh->devx_channel_lwm)
+		goto err;
+	fd_lwm = mlx5_os_get_devx_channel_fd(priv->sh->devx_channel_lwm);
+	priv->sh->intr_handle_lwm = mlx5_os_interrupt_handler_create
+		(RTE_INTR_INSTANCE_F_SHARED, true,
+		 fd_lwm, mlx5_dev_interrupt_handler_lwm, priv);
+	if (!priv->sh->intr_handle_lwm)
+		goto err;
+	return 0;
+err:
+	if (priv->sh->devx_channel_lwm) {
+		mlx5_os_devx_destroy_event_channel
+			(priv->sh->devx_channel_lwm);
+		priv->sh->devx_channel_lwm = NULL;
+	}
+	pthread_mutex_destroy(&priv->sh->lwm_config_lock);
+	return -rte_errno;
+}
+
+/**
+ * Destroy LWM event_channel and interrupt handle for shared device
+ * context before free this context. The interrupt handler is also
+ * unregistered.
+ *
+ * @param[in] sh
+ *   Pointer to shared device context.
+ */
+void
+mlx5_lwm_unset(struct mlx5_dev_ctx_shared *sh)
+{
+	if (sh->intr_handle_lwm) {
+		mlx5_os_interrupt_handler_destroy(sh->intr_handle_lwm,
+			mlx5_dev_interrupt_handler_lwm, (void *)-1);
+		sh->intr_handle_lwm = NULL;
+	}
+	if (sh->devx_channel_lwm) {
+		mlx5_os_devx_destroy_event_channel
+			(sh->devx_channel_lwm);
+		sh->devx_channel_lwm = NULL;
+	}
+	pthread_mutex_destroy(&sh->lwm_config_lock);
+}
+
+/**
  * Free shared IB device context. Decrement counter and if zero free
  * all allocated resources and close handles.
  *
@@ -1601,6 +1666,7 @@ struct mlx5_dev_ctx_shared *
 		claim_zero(mlx5_devx_cmd_destroy(sh->td));
 	MLX5_ASSERT(sh->geneve_tlv_option_resource == NULL);
 	pthread_mutex_destroy(&sh->txpp.mutex);
+	mlx5_lwm_unset(sh);
 	mlx5_free(sh);
 	return;
 exit:
diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index 7ebb2cc..a76f2fe 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -1268,6 +1268,9 @@ struct mlx5_dev_ctx_shared {
 	struct mlx5_lb_ctx self_lb; /* QP to enable self loopback for Devx. */
 	unsigned int flow_max_priority;
 	enum modify_reg flow_mreg_c[MLX5_MREG_C_NUM];
+	void *devx_channel_lwm;
+	struct rte_intr_handle *intr_handle_lwm;
+	pthread_mutex_t lwm_config_lock;
 	/* Availability of mreg_c's. */
 	struct mlx5_dev_shared_port port[]; /* per device port data array. */
 };
@@ -1405,6 +1408,7 @@ enum mlx5_txq_modify_type {
 };
 
 struct mlx5_rxq_priv;
+struct mlx5_priv;
 
 /* HW objects operations structure. */
 struct mlx5_obj_ops {
@@ -1413,6 +1417,7 @@ struct mlx5_obj_ops {
 	int (*rxq_event_get)(struct mlx5_rxq_obj *rxq_obj);
 	int (*rxq_obj_modify)(struct mlx5_rxq_priv *rxq, uint8_t type);
 	void (*rxq_obj_release)(struct mlx5_rxq_priv *rxq);
+	int (*rxq_event_get_lwm)(struct mlx5_priv *priv, int *rxq_idx, int *port_id);
 	int (*ind_table_new)(struct rte_eth_dev *dev, const unsigned int log_n,
 			     struct mlx5_ind_table_obj *ind_tbl);
 	int (*ind_table_modify)(struct rte_eth_dev *dev,
@@ -1603,6 +1608,8 @@ int mlx5_udp_tunnel_port_add(struct rte_eth_dev *dev,
 bool mlx5_is_hpf(struct rte_eth_dev *dev);
 bool mlx5_is_sf_repr(struct rte_eth_dev *dev);
 void mlx5_age_event_prepare(struct mlx5_dev_ctx_shared *sh);
+int mlx5_lwm_setup(struct mlx5_priv *priv);
+void mlx5_lwm_unset(struct mlx5_dev_ctx_shared *sh);
 
 /* Macro to iterate over all valid ports for mlx5 driver. */
 #define MLX5_ETH_FOREACH_DEV(port_id, dev) \
diff --git a/drivers/net/mlx5/mlx5_devx.c b/drivers/net/mlx5/mlx5_devx.c
index c918a50..6886ae1 100644
--- a/drivers/net/mlx5/mlx5_devx.c
+++ b/drivers/net/mlx5/mlx5_devx.c
@@ -233,6 +233,52 @@
 }
 
 /**
+ * Get LWM event for shared context, return the correct port/rxq for this event.
+ *
+ * @param priv
+ *   Mlx5_priv object.
+ * @param rxq_idx [out]
+ *   Which rxq gets this event.
+ * @param port_id [out]
+ *   Which port gets this event.
+ *
+ * @return
+ *   0 on success, a negative errno value otherwise and rte_errno is set.
+ */
+static int
+mlx5_rx_devx_get_event_lwm(struct mlx5_priv *priv, int *rxq_idx, int *port_id)
+{
+#ifdef HAVE_IBV_DEVX_EVENT
+	union {
+		struct mlx5dv_devx_async_event_hdr event_resp;
+		uint8_t buf[sizeof(struct mlx5dv_devx_async_event_hdr) + 128];
+	} out;
+	int ret;
+
+	memset(&out, 0, sizeof(out));
+	ret = mlx5_glue->devx_get_event(priv->sh->devx_channel_lwm,
+					&out.event_resp,
+					sizeof(out.buf));
+	if (ret < 0) {
+		rte_errno = errno;
+		DRV_LOG(WARNING, "%s err\n", __func__);
+		return -rte_errno;
+	}
+	*port_id = (((uint32_t)out.event_resp.cookie) >>
+		    LWM_COOKIE_PORTID_OFFSET) & LWM_COOKIE_PORTID_MASK;
+	*rxq_idx = (((uint32_t)out.event_resp.cookie) >>
+		    LWM_COOKIE_RXQID_OFFSET) & LWM_COOKIE_RXQID_MASK;
+	return 0;
+#else
+	(void)priv;
+	(void)rxq_idx;
+	(void)port_id;
+	rte_errno = ENOTSUP;
+	return -rte_errno;
+#endif /* HAVE_IBV_DEVX_EVENT */
+}
+
+/**
  * Create a RQ object using DevX.
  *
  * @param rxq
@@ -1421,6 +1467,7 @@ struct mlx5_obj_ops devx_obj_ops = {
 	.rxq_event_get = mlx5_rx_devx_get_event,
 	.rxq_obj_modify = mlx5_devx_modify_rq,
 	.rxq_obj_release = mlx5_rxq_devx_obj_release,
+	.rxq_event_get_lwm = mlx5_rx_devx_get_event_lwm,
 	.ind_table_new = mlx5_devx_ind_table_new,
 	.ind_table_modify = mlx5_devx_ind_table_modify,
 	.ind_table_destroy = mlx5_devx_ind_table_destroy,
diff --git a/drivers/net/mlx5/mlx5_rx.c b/drivers/net/mlx5/mlx5_rx.c
index e5eea0a..197d708 100644
--- a/drivers/net/mlx5/mlx5_rx.c
+++ b/drivers/net/mlx5/mlx5_rx.c
@@ -1187,3 +1187,36 @@ int mlx5_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
 {
 	return -ENOTSUP;
 }
+
+/**
+ * Rte interrupt handler for LWM event.
+ * It first checks if the event arrives, if so process the callback for
+ * RTE_ETH_EVENT_RX_LWM.
+ *
+ * @param args
+ *   Generic pointer to mlx5_priv.
+ */
+void
+mlx5_dev_interrupt_handler_lwm(void *args)
+{
+	struct mlx5_priv *priv = args;
+	struct mlx5_rxq_priv *rxq;
+	struct rte_eth_dev *dev;
+	int ret, rxq_idx = 0, port_id = 0;
+
+	ret = priv->obj_ops.rxq_event_get_lwm(priv, &rxq_idx, &port_id);
+	if (unlikely(ret < 0)) {
+		DRV_LOG(WARNING, "Cannot get LWM event context.");
+		return;
+	}
+	DRV_LOG(INFO, "%s get LWM event, port_id:%d rxq_id:%d.", __func__,
+		port_id, rxq_idx);
+	dev = &rte_eth_devices[port_id];
+	rxq = mlx5_rxq_get(dev, rxq_idx);
+	if (rxq) {
+		pthread_mutex_lock(&priv->sh->lwm_config_lock);
+		rxq->lwm_event_pending = 1;
+		pthread_mutex_unlock(&priv->sh->lwm_config_lock);
+	}
+	rte_eth_dev_callback_process(dev, RTE_ETH_EVENT_RX_AVAIL_THRESH, NULL);
+}
diff --git a/drivers/net/mlx5/mlx5_rx.h b/drivers/net/mlx5/mlx5_rx.h
index 25a5f2c..068dff5 100644
--- a/drivers/net/mlx5/mlx5_rx.h
+++ b/drivers/net/mlx5/mlx5_rx.h
@@ -176,6 +176,7 @@ struct mlx5_rxq_priv {
 	struct rte_eth_hairpin_conf hairpin_conf; /* Hairpin configuration. */
 	uint32_t hairpin_status; /* Hairpin binding status. */
 	uint32_t lwm:16;
+	uint32_t lwm_event_pending:1;
 };
 
 /* External RX queue descriptor. */
@@ -295,6 +296,7 @@ void mlx5_rxq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
 int mlx5_rx_burst_mode_get(struct rte_eth_dev *dev, uint16_t rx_queue_id,
 			   struct rte_eth_burst_mode *mode);
 int mlx5_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc);
+void mlx5_dev_interrupt_handler_lwm(void *args);
 
 /* Vectorized version of mlx5_rx.c */
 int mlx5_rxq_check_vec_support(struct mlx5_rxq_data *rxq_data);
@@ -675,4 +677,9 @@ uint16_t mlx5_rx_burst_mprq_vec(void *dpdk_rxq, struct rte_mbuf **pkts,
 	return !!__atomic_load_n(&rxq->refcnt, __ATOMIC_RELAXED);
 }
 
+#define LWM_COOKIE_RXQID_OFFSET 0
+#define LWM_COOKIE_RXQID_MASK 0xffff
+#define LWM_COOKIE_PORTID_OFFSET 16
+#define LWM_COOKIE_PORTID_MASK 0xffff
+
 #endif /* RTE_PMD_MLX5_RX_H_ */
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 131+ messages in thread

* [PATCH v5 5/7] net/mlx5: support Rx queue based available descriptor threshold
  2022-06-07 12:59           ` [PATCH v5 0/7] introduce per-queue available descriptor threshold and host shaper Spike Du
                               ` (3 preceding siblings ...)
  2022-06-07 12:59             ` [PATCH v5 4/7] net/mlx5: add LWM event handling support Spike Du
@ 2022-06-07 12:59             ` Spike Du
  2022-06-07 12:59             ` [PATCH v5 6/7] net/mlx5: add private API to config host port shaper Spike Du
                               ` (3 subsequent siblings)
  8 siblings, 0 replies; 131+ messages in thread
From: Spike Du @ 2022-06-07 12:59 UTC (permalink / raw)
  To: matan, viacheslavo, orika, thomas
  Cc: andrew.rybchenko, stephen, mb, dev, rasland

Add mlx5 specific available descriptor threshold configuration
and query handler.
In mlx5 PMD, available descriptor threshold is also called
LWM(limit watermark).
While the Rx queue fullness reaches the LWM limit, the driver catches
an HW event and invokes the user callback.
The query handler finds the next RX queue with pending LWM event
if any, starting from the given RX queue index.

Signed-off-by: Spike Du <spiked@nvidia.com>
---
 doc/guides/nics/mlx5.rst               |  12 +++
 doc/guides/rel_notes/release_22_07.rst |   1 +
 drivers/common/mlx5/mlx5_prm.h         |   1 +
 drivers/net/mlx5/mlx5.c                |   2 +
 drivers/net/mlx5/mlx5_rx.c             | 151 +++++++++++++++++++++++++++++++++
 drivers/net/mlx5/mlx5_rx.h             |   5 ++
 6 files changed, 172 insertions(+)

diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index d83c56d..9163b78 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -93,6 +93,7 @@ Features
 - Connection tracking.
 - Sub-Function representors.
 - Sub-Function.
+- Rx queue available descriptor threshold configuration.
 
 
 Limitations
@@ -520,6 +521,9 @@ Limitations
 
 - The NIC egress flow rules on representor port are not supported.
 
+- Available descriptor threshold:
+
+  - Doesn't support shared Rx queue and Hairpin Rx queue.
 
 Statistics
 ----------
@@ -1680,3 +1684,11 @@ The procedure below is an example of using a ConnectX-5 adapter card (pf0) with
 #. For each VF PCIe, using the following command to bind the driver::
 
    $ echo "0000:82:00.2" >> /sys/bus/pci/drivers/mlx5_core/bind
+
+Available descriptor threshold introduction
+----------------
+
+Available descriptor threshold is a per Rx queue attribute, it should be configured as
+a percentage of the Rx queue size.
+When Rx queue available descriptors for hardware are below the threshold, an event is sent to PMD.
+
diff --git a/doc/guides/rel_notes/release_22_07.rst b/doc/guides/rel_notes/release_22_07.rst
index 0ed4f92..46fd73a 100644
--- a/doc/guides/rel_notes/release_22_07.rst
+++ b/doc/guides/rel_notes/release_22_07.rst
@@ -89,6 +89,7 @@ New Features
   * Added support for promiscuous mode on Windows.
   * Added support for MTU on Windows.
   * Added matching and RSS on IPsec ESP.
+  * Added Rx queue available descriptor threshold support.
 
 * **Updated Marvell cnxk crypto driver.**
 
diff --git a/drivers/common/mlx5/mlx5_prm.h b/drivers/common/mlx5/mlx5_prm.h
index 630b2c5..3b5e605 100644
--- a/drivers/common/mlx5/mlx5_prm.h
+++ b/drivers/common/mlx5/mlx5_prm.h
@@ -3293,6 +3293,7 @@ struct mlx5_aso_wqe {
 
 enum {
 	MLX5_EVENT_TYPE_OBJECT_CHANGE = 0x27,
+	MLX5_EVENT_TYPE_SRQ_LIMIT_REACHED = 0x14,
 };
 
 enum {
diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
index e04a666..998846a 100644
--- a/drivers/net/mlx5/mlx5.c
+++ b/drivers/net/mlx5/mlx5.c
@@ -2071,6 +2071,8 @@ struct mlx5_dev_ctx_shared *
 	.dev_supported_ptypes_get = mlx5_dev_supported_ptypes_get,
 	.vlan_filter_set = mlx5_vlan_filter_set,
 	.rx_queue_setup = mlx5_rx_queue_setup,
+	.rx_queue_avail_thresh_set = mlx5_rx_queue_lwm_set,
+	.rx_queue_avail_thresh_query = mlx5_rx_queue_lwm_query,
 	.rx_hairpin_queue_setup = mlx5_rx_hairpin_queue_setup,
 	.tx_queue_setup = mlx5_tx_queue_setup,
 	.tx_hairpin_queue_setup = mlx5_tx_hairpin_queue_setup,
diff --git a/drivers/net/mlx5/mlx5_rx.c b/drivers/net/mlx5/mlx5_rx.c
index 197d708..2cb7006 100644
--- a/drivers/net/mlx5/mlx5_rx.c
+++ b/drivers/net/mlx5/mlx5_rx.c
@@ -25,6 +25,7 @@
 #include "mlx5.h"
 #include "mlx5_utils.h"
 #include "mlx5_rxtx.h"
+#include "mlx5_devx.h"
 #include "mlx5_rx.h"
 
 
@@ -128,6 +129,16 @@
 	return RTE_ETH_RX_DESC_AVAIL;
 }
 
+/* Get rxq lwm percentage according to lwm number. */
+static uint8_t
+mlx5_rxq_lwm_to_percentage(struct mlx5_rxq_priv *rxq)
+{
+	struct mlx5_rxq_data *rxq_data = &rxq->ctrl->rxq;
+	uint32_t wqe_cnt = 1 << (rxq_data->elts_n - rxq_data->sges_n);
+
+	return rxq->lwm * 100 / wqe_cnt;
+}
+
 /**
  * DPDK callback to get the RX queue information.
  *
@@ -150,6 +161,7 @@
 {
 	struct mlx5_rxq_ctrl *rxq_ctrl = mlx5_rxq_ctrl_get(dev, rx_queue_id);
 	struct mlx5_rxq_data *rxq = mlx5_rxq_data_get(dev, rx_queue_id);
+	struct mlx5_rxq_priv *rxq_priv = mlx5_rxq_get(dev, rx_queue_id);
 
 	if (!rxq)
 		return;
@@ -169,6 +181,8 @@
 	qinfo->nb_desc = mlx5_rxq_mprq_enabled(rxq) ?
 		RTE_BIT32(rxq->elts_n) * RTE_BIT32(rxq->log_strd_num) :
 		RTE_BIT32(rxq->elts_n);
+	qinfo->avail_thresh = rxq_priv ?
+		mlx5_rxq_lwm_to_percentage(rxq_priv) : 0;
 }
 
 /**
@@ -1188,6 +1202,34 @@ int mlx5_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
 	return -ENOTSUP;
 }
 
+int
+mlx5_rx_queue_lwm_query(struct rte_eth_dev *dev,
+			uint16_t *queue_id, uint8_t *lwm)
+{
+	struct mlx5_priv *priv = dev->data->dev_private;
+	unsigned int rxq_id, found = 0, n;
+	struct mlx5_rxq_priv *rxq;
+
+	if (!queue_id)
+		return -EINVAL;
+	/* Query all the Rx queues of the port in a circular way. */
+	for (rxq_id = *queue_id, n = 0; n < priv->rxqs_n; n++) {
+		rxq = mlx5_rxq_get(dev, rxq_id);
+		if (rxq && rxq->lwm_event_pending) {
+			pthread_mutex_lock(&priv->sh->lwm_config_lock);
+			rxq->lwm_event_pending = 0;
+			pthread_mutex_unlock(&priv->sh->lwm_config_lock);
+			*queue_id = rxq_id;
+			found = 1;
+			if (lwm)
+				*lwm =  mlx5_rxq_lwm_to_percentage(rxq);
+			break;
+		}
+		rxq_id = (rxq_id + 1) % priv->rxqs_n;
+	}
+	return found;
+}
+
 /**
  * Rte interrupt handler for LWM event.
  * It first checks if the event arrives, if so process the callback for
@@ -1220,3 +1262,112 @@ int mlx5_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
 	}
 	rte_eth_dev_callback_process(dev, RTE_ETH_EVENT_RX_AVAIL_THRESH, NULL);
 }
+
+/**
+ * DPDK callback to arm an Rx queue LWM(limit watermark) event.
+ * While the Rx queue fullness reaches the LWM limit, the driver catches
+ * an HW event and invokes the user event callback.
+ * After the last event handling, the user needs to call this API again
+ * to arm an additional event.
+ *
+ * @param dev
+ *   Pointer to the device structure.
+ * @param[in] rx_queue_id
+ *   Rx queue identificator.
+ * @param[in] lwm
+ *   The LWM value, is defined by a percentage of the Rx queue size.
+ *   [1-99] to set a new LWM (update the old value).
+ *   0 to unarm the event.
+ *
+ * @return
+ *   0 : operation success.
+ *   Otherwise:
+ *   - ENOMEM - not enough memory to create LWM event channel.
+ *   - EINVAL - the input Rxq is not created by devx.
+ *   - E2BIG  - lwm is bigger than 99.
+ */
+int
+mlx5_rx_queue_lwm_set(struct rte_eth_dev *dev, uint16_t rx_queue_id,
+		      uint8_t lwm)
+{
+	struct mlx5_priv *priv = dev->data->dev_private;
+	uint16_t port_id = PORT_ID(priv);
+	struct mlx5_rxq_priv *rxq = mlx5_rxq_get(dev, rx_queue_id);
+	uint16_t event_nums[1] = {MLX5_EVENT_TYPE_SRQ_LIMIT_REACHED};
+	struct mlx5_rxq_data *rxq_data;
+	uint32_t wqe_cnt;
+	uint64_t cookie;
+	int ret = 0;
+
+	if (!rxq) {
+		rte_errno = EINVAL;
+		return -rte_errno;
+	}
+	rxq_data = &rxq->ctrl->rxq;
+	/* Ensure the Rq is created by devx. */
+	if (priv->obj_ops.rxq_obj_new != devx_obj_ops.rxq_obj_new) {
+		rte_errno = EINVAL;
+		return -rte_errno;
+	}
+	if (lwm > 99) {
+		DRV_LOG(WARNING, "Too big LWM configuration.");
+		rte_errno = E2BIG;
+		return -rte_errno;
+	}
+	/* Start config LWM. */
+	pthread_mutex_lock(&priv->sh->lwm_config_lock);
+	if (rxq->lwm == 0 && lwm == 0) {
+		/* Both old/new values are 0, do nothing. */
+		ret = 0;
+		goto end;
+	}
+	wqe_cnt = 1 << (rxq_data->elts_n - rxq_data->sges_n);
+	if (lwm) {
+		if (!priv->sh->devx_channel_lwm) {
+			ret = mlx5_lwm_setup(priv);
+			if (ret) {
+				DRV_LOG(WARNING,
+					"Failed to create shared_lwm.");
+				rte_errno = ENOMEM;
+				ret = -rte_errno;
+				goto end;
+			}
+		}
+		if (!rxq->lwm_devx_subscribed) {
+			cookie = ((uint32_t)
+				  (port_id << LWM_COOKIE_PORTID_OFFSET)) |
+				(rx_queue_id << LWM_COOKIE_RXQID_OFFSET);
+			ret = mlx5_os_devx_subscribe_devx_event
+				(priv->sh->devx_channel_lwm,
+				 rxq->devx_rq.rq->obj,
+				 sizeof(event_nums),
+				 event_nums,
+				 cookie);
+			if (ret) {
+				rte_errno = rte_errno ? rte_errno : EINVAL;
+				ret = -rte_errno;
+				goto end;
+			}
+			rxq->lwm_devx_subscribed = 1;
+		}
+	}
+	/* Save LWM to rxq and send modify_rq devx command. */
+	rxq->lwm = lwm * wqe_cnt / 100;
+	/* Prevent integer division loss when switch lwm number to percentage. */
+	if (lwm && (lwm * wqe_cnt % 100)) {
+		rxq->lwm = ((uint32_t)(rxq->lwm + 1) >= wqe_cnt) ?
+			rxq->lwm : (rxq->lwm + 1);
+	}
+	if (lwm && !rxq->lwm) {
+		/* With mprq, wqe_cnt may be < 100. */
+		DRV_LOG(WARNING, "Too small LWM configuration.");
+		rte_errno = EINVAL;
+		ret = -rte_errno;
+		goto end;
+	}
+	ret = mlx5_devx_modify_rq(rxq, MLX5_RXQ_MOD_RDY2RDY);
+end:
+	pthread_mutex_unlock(&priv->sh->lwm_config_lock);
+	return ret;
+}
+
diff --git a/drivers/net/mlx5/mlx5_rx.h b/drivers/net/mlx5/mlx5_rx.h
index 068dff5..e078aaf 100644
--- a/drivers/net/mlx5/mlx5_rx.h
+++ b/drivers/net/mlx5/mlx5_rx.h
@@ -177,6 +177,7 @@ struct mlx5_rxq_priv {
 	uint32_t hairpin_status; /* Hairpin binding status. */
 	uint32_t lwm:16;
 	uint32_t lwm_event_pending:1;
+	uint32_t lwm_devx_subscribed:1;
 };
 
 /* External RX queue descriptor. */
@@ -297,6 +298,10 @@ int mlx5_rx_burst_mode_get(struct rte_eth_dev *dev, uint16_t rx_queue_id,
 			   struct rte_eth_burst_mode *mode);
 int mlx5_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc);
 void mlx5_dev_interrupt_handler_lwm(void *args);
+int mlx5_rx_queue_lwm_set(struct rte_eth_dev *dev, uint16_t rx_queue_id,
+			  uint8_t lwm);
+int mlx5_rx_queue_lwm_query(struct rte_eth_dev *dev, uint16_t *rx_queue_id,
+			    uint8_t *lwm);
 
 /* Vectorized version of mlx5_rx.c */
 int mlx5_rxq_check_vec_support(struct mlx5_rxq_data *rxq_data);
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 131+ messages in thread

* [PATCH v5 6/7] net/mlx5: add private API to config host port shaper
  2022-06-07 12:59           ` [PATCH v5 0/7] introduce per-queue available descriptor threshold and host shaper Spike Du
                               ` (4 preceding siblings ...)
  2022-06-07 12:59             ` [PATCH v5 5/7] net/mlx5: support Rx queue based available descriptor threshold Spike Du
@ 2022-06-07 12:59             ` Spike Du
  2022-06-07 12:59             ` [PATCH v5 7/7] app/testpmd: add Host Shaper command Spike Du
                               ` (2 subsequent siblings)
  8 siblings, 0 replies; 131+ messages in thread
From: Spike Du @ 2022-06-07 12:59 UTC (permalink / raw)
  To: matan, viacheslavo, orika, thomas, Ray Kinsella
  Cc: andrew.rybchenko, stephen, mb, dev, rasland

Host port shaper can be configured with QSHR(QoS Shaper Host Register).
Add check in build files to enable this function or not.

The host shaper configuration affects all the ethdev ports belonging to the
same host port.

Host shaper can configure shaper rate and lwm-triggered for a host port.
The shaper limits the rate of traffic from host port to wire port.
If lwm-triggered is enabled, a 100Mbps shaper is enabled automatically
when one of the host port's Rx queues receives available descriptor
threshold event.

Signed-off-by: Spike Du <spiked@nvidia.com>
---
 doc/guides/nics/mlx5.rst               |  35 +++++++++++
 doc/guides/rel_notes/release_22_07.rst |   1 +
 drivers/common/mlx5/linux/meson.build  |  13 +++++
 drivers/common/mlx5/mlx5_prm.h         |  25 ++++++++
 drivers/net/mlx5/mlx5.h                |   2 +
 drivers/net/mlx5/mlx5_rx.c             | 104 +++++++++++++++++++++++++++++++++
 drivers/net/mlx5/rte_pmd_mlx5.h        |  30 ++++++++++
 drivers/net/mlx5/version.map           |   2 +
 8 files changed, 212 insertions(+)

diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index 9163b78..a1e13e7 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -94,6 +94,7 @@ Features
 - Sub-Function representors.
 - Sub-Function.
 - Rx queue available descriptor threshold configuration.
+- Host shaper support.
 
 
 Limitations
@@ -525,6 +526,12 @@ Limitations
 
   - Doesn't support shared Rx queue and Hairpin Rx queue.
 
+- Host shaper:
+
+  - Support BlueField series NIC from BlueField 2.
+  - When configure host shaper with MLX5_HOST_SHAPER_FLAG_AVAIL_THRESH_TRIGGERED flag set,
+    only rate 0 and 100Mbps are supported.
+
 Statistics
 ----------
 
@@ -1692,3 +1699,31 @@ Available descriptor threshold is a per Rx queue attribute, it should be configu
 a percentage of the Rx queue size.
 When Rx queue available descriptors for hardware are below the threshold, an event is sent to PMD.
 
+Host shaper introduction
+------------------------
+
+Host shaper register is per host port register which sets a shaper
+on the host port.
+All VF/hostPF representors belonging to one host port share one host shaper.
+For example, if representor 0 and representor 1 belong to same host port,
+and a host shaper rate of 1Gbps is configured, the shaper throttles both
+representors' traffic from host.
+Host shaper has two modes for setting the shaper, immediate and deferred to
+available descriptor threshold event trigger. In immediate mode, the rate limit is configured
+immediately to host shaper. When deferring to available descriptor threshold trigger, the shaper
+is not set until an available descriptor threshold event is received by any Rx queue in a VF
+representor belonging to the host port. The only rate supported for deferred
+mode is 100Mbps (there is no limit on the supported rates for immediate mode).
+In deferred mode, the shaper is set on the host port by the firmware upon
+receiving the available descriptor threshold event, which allows throttling host traffic on
+available descriptor threshold events at minimum latency, preventing excess drops in the
+Rx queue.
+
+Host shaper dependency for mstflint package
+-------------------------------------------
+
+In order to configure host shaper register, ``librte_net_mlx5`` depends on ``libmtcr_ul``
+which can be installed from OFED mstflint package.
+Meson detects ``libmtcr_ul`` existence at configure stage.
+If the library is detected, the application must link with ``-lmtcr_ul``,
+as done by the pkg-config file libdpdk.pc.
diff --git a/doc/guides/rel_notes/release_22_07.rst b/doc/guides/rel_notes/release_22_07.rst
index 46fd73a..3349cda 100644
--- a/doc/guides/rel_notes/release_22_07.rst
+++ b/doc/guides/rel_notes/release_22_07.rst
@@ -90,6 +90,7 @@ New Features
   * Added support for MTU on Windows.
   * Added matching and RSS on IPsec ESP.
   * Added Rx queue available descriptor threshold support.
+  * Added host shaper support.
 
 * **Updated Marvell cnxk crypto driver.**
 
diff --git a/drivers/common/mlx5/linux/meson.build b/drivers/common/mlx5/linux/meson.build
index 5335f5b..51c6e5d 100644
--- a/drivers/common/mlx5/linux/meson.build
+++ b/drivers/common/mlx5/linux/meson.build
@@ -45,6 +45,13 @@ if static_ibverbs
     ext_deps += declare_dependency(link_args:ibv_ldflags.split())
 endif
 
+libmtcr_ul_found = false
+lib = cc.find_library('mtcr_ul', required:false)
+if lib.found() and run_command('meson', '--version').stdout().version_compare('>= 0.49.2')
+    libmtcr_ul_found = true
+    ext_deps += lib
+endif
+
 sources += files('mlx5_nl.c')
 sources += files('mlx5_common_auxiliary.c')
 sources += files('mlx5_common_os.c')
@@ -207,6 +214,12 @@ has_sym_args = [
         [ 'HAVE_MLX5_IBV_IMPORT_CTX_PD_AND_MR', 'infiniband/verbs.h',
             'ibv_import_device' ],
 ]
+if  libmtcr_ul_found
+    has_sym_args += [
+        [  'HAVE_MLX5_MSTFLINT', 'mstflint/mtcr.h',
+            'mopen'],
+    ]
+endif
 config = configuration_data()
 foreach arg:has_sym_args
     config.set(arg[0], cc.has_header_symbol(arg[1], arg[2], dependencies: libs))
diff --git a/drivers/common/mlx5/mlx5_prm.h b/drivers/common/mlx5/mlx5_prm.h
index 3b5e605..92d05a7 100644
--- a/drivers/common/mlx5/mlx5_prm.h
+++ b/drivers/common/mlx5/mlx5_prm.h
@@ -3771,6 +3771,7 @@ enum {
 	MLX5_CRYPTO_COMMISSIONING_REGISTER_ID = 0xC003,
 	MLX5_IMPORT_KEK_HANDLE_REGISTER_ID = 0xC004,
 	MLX5_CREDENTIAL_HANDLE_REGISTER_ID = 0xC005,
+	MLX5_QSHR_REGISTER_ID = 0x4030,
 };
 
 struct mlx5_ifc_register_mtutc_bits {
@@ -3785,6 +3786,30 @@ struct mlx5_ifc_register_mtutc_bits {
 	u8 time_adjustment[0x20];
 };
 
+struct mlx5_ifc_ets_global_config_register_bits {
+	u8 reserved_at_0[0x2];
+	u8 rate_limit_update[0x1];
+	u8 reserved_at_3[0x29];
+	u8 max_bw_units[0x4];
+	u8 reserved_at_48[0x8];
+	u8 max_bw_value[0x8];
+};
+
+#define ETS_GLOBAL_CONFIG_BW_UNIT_DISABLED      0x0
+#define ETS_GLOBAL_CONFIG_BW_UNIT_HUNDREDS_MBPS 0x3
+#define ETS_GLOBAL_CONFIG_BW_UNIT_GBPS          0x4
+
+struct mlx5_ifc_register_qshr_bits {
+	u8 reserved_at_0[0x4];
+	u8 connected_host[0x1];
+	u8 vqos[0x1];
+	u8 fast_response[0x1];
+	u8 reserved_at_7[0x1];
+	u8 local_port[0x8];
+	u8 reserved_at_16[0x230];
+	struct mlx5_ifc_ets_global_config_register_bits global_config;
+};
+
 #define MLX5_MTUTC_TIMESTAMP_MODE_INTERNAL_TIMER 0
 #define MLX5_MTUTC_TIMESTAMP_MODE_REAL_TIME 1
 
diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index a76f2fe..8af84ae 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -1271,6 +1271,8 @@ struct mlx5_dev_ctx_shared {
 	void *devx_channel_lwm;
 	struct rte_intr_handle *intr_handle_lwm;
 	pthread_mutex_t lwm_config_lock;
+	uint32_t host_shaper_rate:8;
+	uint32_t lwm_triggered:1;
 	/* Availability of mreg_c's. */
 	struct mlx5_dev_shared_port port[]; /* per device port data array. */
 };
diff --git a/drivers/net/mlx5/mlx5_rx.c b/drivers/net/mlx5/mlx5_rx.c
index 2cb7006..bb3ccc3 100644
--- a/drivers/net/mlx5/mlx5_rx.c
+++ b/drivers/net/mlx5/mlx5_rx.c
@@ -19,6 +19,7 @@
 #include <mlx5_prm.h>
 #include <mlx5_common.h>
 #include <mlx5_common_mr.h>
+#include <rte_pmd_mlx5.h>
 
 #include "mlx5_autoconf.h"
 #include "mlx5_defs.h"
@@ -27,6 +28,9 @@
 #include "mlx5_rxtx.h"
 #include "mlx5_devx.h"
 #include "mlx5_rx.h"
+#ifdef HAVE_MLX5_MSTFLINT
+#include <mstflint/mtcr.h>
+#endif
 
 
 static __rte_always_inline uint32_t
@@ -1371,3 +1375,103 @@ int mlx5_get_monitor_addr(void *rx_queue, struct rte_power_monitor_cond *pmc)
 	return ret;
 }
 
+/**
+ * Mlx5 access register function to configure host shaper.
+ * It calls API in libmtcr_ul to access QSHR(Qos Shaper Host Register)
+ * in firmware.
+ *
+ * @param dev
+ *   Pointer to rte_eth_dev.
+ * @param lwm_triggered
+ *   Flag to enable/disable lwm_triggered bit in QSHR.
+ * @param rate
+ *   Host shaper rate, unit is 100Mbps, set to 0 means disable the shaper.
+ * @return
+ *   0 : operation success.
+ *   Otherwise:
+ *   - ENOENT - no ibdev interface.
+ *   - EBUSY  - the register access unit is busy.
+ *   - EIO    - the register access command meets IO error.
+ */
+static int
+mlxreg_host_shaper_config(struct rte_eth_dev *dev,
+			  bool lwm_triggered, uint8_t rate)
+{
+#ifdef HAVE_MLX5_MSTFLINT
+	struct mlx5_priv *priv = dev->data->dev_private;
+	uint32_t data[MLX5_ST_SZ_DW(register_qshr)] = {0};
+	int rc, retry_count = 3;
+	mfile *mf = NULL;
+	int status;
+	void *ptr;
+
+	mf = mopen(priv->sh->ibdev_name);
+	if (!mf) {
+		DRV_LOG(WARNING, "mopen failed\n");
+		rte_errno = ENOENT;
+		return -rte_errno;
+	}
+	MLX5_SET(register_qshr, data, connected_host, 1);
+	MLX5_SET(register_qshr, data, fast_response, lwm_triggered ? 1 : 0);
+	MLX5_SET(register_qshr, data, local_port, 1);
+	ptr = MLX5_ADDR_OF(register_qshr, data, global_config);
+	MLX5_SET(ets_global_config_register, ptr, rate_limit_update, 1);
+	MLX5_SET(ets_global_config_register, ptr, max_bw_units,
+		 rate ? ETS_GLOBAL_CONFIG_BW_UNIT_HUNDREDS_MBPS :
+		 ETS_GLOBAL_CONFIG_BW_UNIT_DISABLED);
+	MLX5_SET(ets_global_config_register, ptr, max_bw_value, rate);
+	do {
+		rc = maccess_reg(mf,
+				 MLX5_QSHR_REGISTER_ID,
+				 MACCESS_REG_METHOD_SET,
+				 (u_int32_t *)&data[0],
+				 sizeof(data),
+				 sizeof(data),
+				 sizeof(data),
+				 &status);
+		if ((rc != ME_ICMD_STATUS_IFC_BUSY &&
+		     status != ME_REG_ACCESS_BAD_PARAM) ||
+		    !(mf->flags & MDEVS_REM)) {
+			break;
+		}
+		DRV_LOG(WARNING, "%s retry.", __func__);
+		usleep(10000);
+	} while (retry_count-- > 0);
+	mclose(mf);
+	rte_errno = (rc == ME_REG_ACCESS_DEV_BUSY) ? EBUSY : EIO;
+	return rc ? -rte_errno : 0;
+#else
+	(void)dev;
+	(void)lwm_triggered;
+	(void)rate;
+	return -1;
+#endif
+}
+
+int rte_pmd_mlx5_host_shaper_config(int port_id, uint8_t rate,
+				    uint32_t flags)
+{
+	struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+	struct mlx5_priv *priv = dev->data->dev_private;
+	bool lwm_triggered =
+	     !!(flags & RTE_BIT32(MLX5_HOST_SHAPER_FLAG_AVAIL_THRESH_TRIGGERED));
+
+	if (!lwm_triggered) {
+		priv->sh->host_shaper_rate = rate;
+	} else {
+		switch (rate) {
+		case 0:
+		/* Rate 0 means disable lwm_triggered. */
+			priv->sh->lwm_triggered = 0;
+			break;
+		case 1:
+		/* Rate 1 means enable lwm_triggered. */
+			priv->sh->lwm_triggered = 1;
+			break;
+		default:
+			return -ENOTSUP;
+		}
+	}
+	return mlxreg_host_shaper_config(dev, priv->sh->lwm_triggered,
+					 priv->sh->host_shaper_rate);
+}
diff --git a/drivers/net/mlx5/rte_pmd_mlx5.h b/drivers/net/mlx5/rte_pmd_mlx5.h
index 6e7907e..fbfdd97 100644
--- a/drivers/net/mlx5/rte_pmd_mlx5.h
+++ b/drivers/net/mlx5/rte_pmd_mlx5.h
@@ -109,6 +109,36 @@ int rte_pmd_mlx5_external_rx_queue_id_map(uint16_t port_id, uint16_t dpdk_idx,
 int rte_pmd_mlx5_external_rx_queue_id_unmap(uint16_t port_id,
 					    uint16_t dpdk_idx);
 
+/**
+ * The rate of the host port shaper will be updated directly at the next
+ * available descriptor threshold event to the rate that comes with this flag set;
+ * set rate 0 to disable this rate update.
+ * Unset this flag to update the rate of the host port shaper directly in
+ * the API call; use rate 0 to disable the current shaper.
+ */
+#define MLX5_HOST_SHAPER_FLAG_AVAIL_THRESH_TRIGGERED 0
+
+/**
+ * Configure a HW shaper to limit Tx rate for a host port.
+ * The configuration will affect all the ethdev ports belonging to
+ * the same rte_device.
+ *
+ * @param[in] port_id
+ *   The port identifier of the Ethernet device.
+ * @param[in] rate
+ *   Unit is 100Mbps, setting the rate to 0 disables the shaper.
+ * @param[in] flags
+ *   Host shaper flags.
+ * @return
+ *   0 : operation success.
+ *   Otherwise:
+ *   - ENOENT - no ibdev interface.
+ *   - EBUSY  - the register access unit is busy.
+ *   - EIO    - the register access command meets IO error.
+ */
+__rte_experimental
+int rte_pmd_mlx5_host_shaper_config(int port_id, uint8_t rate, uint32_t flags);
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/drivers/net/mlx5/version.map b/drivers/net/mlx5/version.map
index 79cb79a..c97dfe4 100644
--- a/drivers/net/mlx5/version.map
+++ b/drivers/net/mlx5/version.map
@@ -12,4 +12,6 @@ EXPERIMENTAL {
 	# added in 22.03
 	rte_pmd_mlx5_external_rx_queue_id_map;
 	rte_pmd_mlx5_external_rx_queue_id_unmap;
+	# added in 22.07
+	rte_pmd_mlx5_host_shaper_config;
 };
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 131+ messages in thread

* [PATCH v5 7/7] app/testpmd: add Host Shaper command
  2022-06-07 12:59           ` [PATCH v5 0/7] introduce per-queue available descriptor threshold and host shaper Spike Du
                               ` (5 preceding siblings ...)
  2022-06-07 12:59             ` [PATCH v5 6/7] net/mlx5: add private API to config host port shaper Spike Du
@ 2022-06-07 12:59             ` Spike Du
  2022-06-09  7:55               ` Andrew Rybchenko
  2022-06-13  2:50               ` [PATCH v6] " Spike Du
  2022-06-08  9:43             ` [PATCH v5 0/7] introduce per-queue available descriptor threshold and host shaper Andrew Rybchenko
  2022-06-08 16:35             ` [PATCH v6] ethdev: introduce available Rx descriptors threshold Andrew Rybchenko
  8 siblings, 2 replies; 131+ messages in thread
From: Spike Du @ 2022-06-07 12:59 UTC (permalink / raw)
  To: matan, viacheslavo, orika, thomas, Xiaoyun Li, Aman Singh, Yuying Zhang
  Cc: andrew.rybchenko, stephen, mb, dev, rasland

Add command line options to support host shaper configure.
- Command syntax:
  mlx5 set port <port_id> host_shaper avail_thresh_triggered <0|1> rate
<rate_num>

- Example commands:
To enable avail_thresh_triggered on port 1 and disable current host
shaper:
testpmd> mlx5 set port 1 host_shaper avail_thresh_triggered 1 rate 0

To disable avail_thresh_triggered and current host shaper on port 1:
testpmd> mlx5 set port 1 host_shaper avail_thresh_triggered 0 rate 0

The rate unit is 100Mbps.
To disable avail_thresh_triggered and configure a shaper of 5Gbps on
port 1:
testpmd> mlx5 set port 1 host_shaper avail_thresh_triggered 0 rate 50

Add sample code to handle rxq available descriptor threshold event, it
delays a while so that rxq empties, then disables host shaper and
rearms available descriptor threshold event.

Signed-off-by: Spike Du <spiked@nvidia.com>
---
 app/test-pmd/testpmd.c          |   6 ++
 doc/guides/nics/mlx5.rst        |  46 +++++++++
 drivers/net/mlx5/meson.build    |   4 +
 drivers/net/mlx5/mlx5_testpmd.c | 201 ++++++++++++++++++++++++++++++++++++++++
 drivers/net/mlx5/mlx5_testpmd.h |  26 ++++++
 5 files changed, 283 insertions(+)
 create mode 100644 drivers/net/mlx5/mlx5_testpmd.c
 create mode 100644 drivers/net/mlx5/mlx5_testpmd.h

diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index 33d9b85..e15882d 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -69,6 +69,9 @@
 #ifdef RTE_NET_BOND
 #include <rte_eth_bond.h>
 #endif
+#ifdef RTE_NET_MLX5
+#include "mlx5_testpmd.h"
+#endif
 
 #include "testpmd.h"
 
@@ -3659,6 +3662,9 @@ struct pmd_test_command {
 				break;
 			printf("Received avail_thresh event, port:%d rxq_id:%d\n",
 			       port_id, rxq_id);
+#ifdef RTE_NET_MLX5
+			mlx5_test_avail_thresh_event_handler(port_id, rxq_id);
+#endif
 		}
 		break;
 	default:
diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index a1e13e7..b5a3ee3 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -1727,3 +1727,49 @@ which can be installed from OFED mstflint package.
 Meson detects ``libmtcr_ul`` existence at configure stage.
 If the library is detected, the application must link with ``-lmtcr_ul``,
 as done by the pkg-config file libdpdk.pc.
+
+How to use available descriptor threshold and Host Shaper
+------------------------------
+
+There are sample command lines to configure available descriptor threshold in testpmd.
+Testpmd also contains sample logic to handle available descriptor threshold event.
+The typical workflow is: testpmd configure available descriptor threshold for Rx queues, enable
+avail_thresh_triggered in host shaper and register a callback, when traffic from host is
+too high and Rx queue emptiness is below available descriptor threshold, PMD receives an event and
+firmware configures a 100Mbps shaper on host port automatically, then PMD call
+the callback registered previously, which will delay a while to let Rx queue
+empty, then disable host shaper.
+
+Let's assume we have a simple Blue Field 2 setup: port 0 is uplink, port 1
+is VF representor. Each port has 2 Rx queues.
+In order to control traffic from host to ARM, we can enable available descriptor threshold in testpmd by:
+
+.. code-block:: console
+
+   testpmd> mlx5 set port 1 host_shaper avail_thresh_triggered 1 rate 0
+   testpmd> set port 1 rxq 0 avail_thresh 70
+   testpmd> set port 1 rxq 1 avail_thresh 70
+
+The first command disables current host shaper, and enables available descriptor threshold triggered mode.
+The left commands configure available descriptor threshold to 70% of Rx queue size for both Rx queues,
+When traffic from host is too high, you can see testpmd console prints log
+about available descriptor threshold event receiving, then host shaper is disabled.
+The traffic rate from host is controlled and less drop happens in Rx queues.
+
+When disable available descriptor threshold and avail_thresh_triggered, we can invoke below commands in testpmd:
+
+.. code-block:: console
+
+   testpmd> mlx5 set port 1 host_shaper avail_thresh_triggered 0 rate 0
+   testpmd> set port 1 rxq 0 avail_thresh 0
+   testpmd> set port 1 rxq 1 avail_thresh 0
+
+It's recommended an application disables available descriptor threshold and avail_thresh_triggered before exit,
+if it enables them before.
+
+We can also configure the shaper with a value, the rate unit is 100Mbps, below
+command sets current shaper to 5Gbps and disables avail_thresh_triggered.
+
+.. code-block:: console
+
+   testpmd> mlx5 set port 1 host_shaper avail_thresh_triggered 0 rate 50
diff --git a/drivers/net/mlx5/meson.build b/drivers/net/mlx5/meson.build
index 99210fd..941642b 100644
--- a/drivers/net/mlx5/meson.build
+++ b/drivers/net/mlx5/meson.build
@@ -68,4 +68,8 @@ if get_option('buildtype').contains('debug')
 else
     cflags += [ '-UPEDANTIC' ]
 endif
+
+testpmd_sources += files('mlx5_testpmd.c')
+testpmd_drivers_deps += 'net_mlx5'
+
 subdir(exec_env)
diff --git a/drivers/net/mlx5/mlx5_testpmd.c b/drivers/net/mlx5/mlx5_testpmd.c
new file mode 100644
index 0000000..071254a
--- /dev/null
+++ b/drivers/net/mlx5/mlx5_testpmd.c
@@ -0,0 +1,201 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright 2021 6WIND S.A.
+ * Copyright 2021 Mellanox Technologies, Ltd
+ */
+
+#include <stdint.h>
+#include <string.h>
+#include <stdlib.h>
+
+#include <rte_prefetch.h>
+#include <rte_common.h>
+#include <rte_branch_prediction.h>
+#include <rte_ether.h>
+#include <rte_alarm.h>
+#include <rte_pmd_mlx5.h>
+#include <rte_ethdev.h>
+#include "mlx5_testpmd.h"
+#include "testpmd.h"
+
+static uint8_t host_shaper_avail_thresh_triggered[RTE_MAX_ETHPORTS];
+#define SHAPER_DISABLE_DELAY_US 100000 /* 100ms */
+
+/**
+ * Disable the host shaper and re-arm available descriptor threshold event.
+ *
+ * @param[in] args
+ *   uint32_t integer combining port_id and rxq_id.
+ */
+static void
+mlx5_test_host_shaper_disable(void *args)
+{
+	uint32_t port_rxq_id = (uint32_t)(uintptr_t)args;
+	uint16_t port_id = port_rxq_id & 0xffff;
+	uint16_t qid = (port_rxq_id >> 16) & 0xffff;
+	struct rte_eth_rxq_info qinfo;
+
+	printf("%s disable shaper\n", __func__);
+	if (rte_eth_rx_queue_info_get(port_id, qid, &qinfo)) {
+		printf("rx_queue_info_get returns error\n");
+		return;
+	}
+	/* Rearm the available descriptor threshold event. */
+	if (rte_eth_rx_avail_thresh_set(port_id, qid, qinfo.avail_thresh)) {
+		printf("config avail_thresh returns error\n");
+		return;
+	}
+	/* Only disable the shaper when avail_thresh_triggered is set. */
+	if (host_shaper_avail_thresh_triggered[port_id] &&
+	    rte_pmd_mlx5_host_shaper_config(port_id, 0, 0))
+		printf("%s disable shaper returns error\n", __func__);
+}
+
+void
+mlx5_test_avail_thresh_event_handler(uint16_t port_id, uint16_t rxq_id)
+{
+	uint32_t port_rxq_id = port_id | (rxq_id << 16);
+
+	rte_eal_alarm_set(SHAPER_DISABLE_DELAY_US,
+			  mlx5_test_host_shaper_disable,
+			  (void *)(uintptr_t)port_rxq_id);
+	printf("%s port_id:%u rxq_id:%u\n", __func__, port_id, rxq_id);
+}
+
+/**
+ * Configure host shaper's avail_thresh_triggered and current rate.
+ *
+ * @param[in] avail_thresh_triggered
+ *   Disable/enable avail_thresh_triggered.
+ * @param[in] rate
+ *   Configure current host shaper rate.
+ * @return
+ *   On success, returns 0.
+ *   On failure, returns < 0.
+ */
+static int
+mlx5_test_set_port_host_shaper(uint16_t port_id, uint16_t avail_thresh_triggered, uint8_t rate)
+{
+	struct rte_eth_link link;
+	bool port_id_valid = false;
+	uint16_t pid;
+	int ret;
+
+	RTE_ETH_FOREACH_DEV(pid)
+		if (port_id == pid) {
+			port_id_valid = true;
+			break;
+		}
+	if (!port_id_valid)
+		return -EINVAL;
+	ret = rte_eth_link_get_nowait(port_id, &link);
+	if (ret < 0)
+		return ret;
+	host_shaper_avail_thresh_triggered[port_id] = avail_thresh_triggered ? 1 : 0;
+	if (!avail_thresh_triggered) {
+		ret = rte_pmd_mlx5_host_shaper_config(port_id, 0,
+		RTE_BIT32(MLX5_HOST_SHAPER_FLAG_AVAIL_THRESH_TRIGGERED));
+	} else {
+		ret = rte_pmd_mlx5_host_shaper_config(port_id, 1,
+		RTE_BIT32(MLX5_HOST_SHAPER_FLAG_AVAIL_THRESH_TRIGGERED));
+	}
+	if (ret)
+		return ret;
+	ret = rte_pmd_mlx5_host_shaper_config(port_id, rate, 0);
+	if (ret)
+		return ret;
+	return 0;
+}
+
+/* *** SET HOST_SHAPER FOR A PORT *** */
+struct cmd_port_host_shaper_result {
+	cmdline_fixed_string_t mlx5;
+	cmdline_fixed_string_t set;
+	cmdline_fixed_string_t port;
+	uint16_t port_num;
+	cmdline_fixed_string_t host_shaper;
+	cmdline_fixed_string_t avail_thresh_triggered;
+	uint16_t fr;
+	cmdline_fixed_string_t rate;
+	uint8_t rate_num;
+};
+
+static void cmd_port_host_shaper_parsed(void *parsed_result,
+		__rte_unused struct cmdline *cl,
+		__rte_unused void *data)
+{
+	struct cmd_port_host_shaper_result *res = parsed_result;
+	int ret = 0;
+
+	if ((strcmp(res->mlx5, "mlx5") == 0) &&
+	    (strcmp(res->set, "set") == 0) &&
+	    (strcmp(res->port, "port") == 0) &&
+	    (strcmp(res->host_shaper, "host_shaper") == 0) &&
+	    (strcmp(res->avail_thresh_triggered, "avail_thresh_triggered") == 0) &&
+	    (strcmp(res->rate, "rate") == 0))
+		ret = mlx5_test_set_port_host_shaper(res->port_num, res->fr,
+					   res->rate_num);
+	if (ret < 0)
+		printf("cmd_port_host_shaper error: (%s)\n", strerror(-ret));
+}
+
+cmdline_parse_token_string_t cmd_port_host_shaper_mlx5 =
+	TOKEN_STRING_INITIALIZER(struct cmd_port_host_shaper_result,
+				mlx5, "mlx5");
+cmdline_parse_token_string_t cmd_port_host_shaper_set =
+	TOKEN_STRING_INITIALIZER(struct cmd_port_host_shaper_result,
+				set, "set");
+cmdline_parse_token_string_t cmd_port_host_shaper_port =
+	TOKEN_STRING_INITIALIZER(struct cmd_port_host_shaper_result,
+				port, "port");
+cmdline_parse_token_num_t cmd_port_host_shaper_portnum =
+	TOKEN_NUM_INITIALIZER(struct cmd_port_host_shaper_result,
+				port_num, RTE_UINT16);
+cmdline_parse_token_string_t cmd_port_host_shaper_host_shaper =
+	TOKEN_STRING_INITIALIZER(struct cmd_port_host_shaper_result,
+				 host_shaper, "host_shaper");
+cmdline_parse_token_string_t cmd_port_host_shaper_avail_thresh_triggered =
+	TOKEN_STRING_INITIALIZER(struct cmd_port_host_shaper_result,
+				 avail_thresh_triggered, "avail_thresh_triggered");
+cmdline_parse_token_num_t cmd_port_host_shaper_fr =
+	TOKEN_NUM_INITIALIZER(struct cmd_port_host_shaper_result,
+			      fr, RTE_UINT16);
+cmdline_parse_token_string_t cmd_port_host_shaper_rate =
+	TOKEN_STRING_INITIALIZER(struct cmd_port_host_shaper_result,
+				 rate, "rate");
+cmdline_parse_token_num_t cmd_port_host_shaper_rate_num =
+	TOKEN_NUM_INITIALIZER(struct cmd_port_host_shaper_result,
+			      rate_num, RTE_UINT8);
+cmdline_parse_inst_t mlx5_test_cmd_port_host_shaper = {
+	.f = cmd_port_host_shaper_parsed,
+	.data = (void *)0,
+	.help_str = "mlx5 set port <port_id> host_shaper avail_thresh_triggered <0|1> "
+	"rate <rate_num>: Set HOST_SHAPER avail_thresh_triggered and rate with port_id",
+	.tokens = {
+		(void *)&cmd_port_host_shaper_mlx5,
+		(void *)&cmd_port_host_shaper_set,
+		(void *)&cmd_port_host_shaper_port,
+		(void *)&cmd_port_host_shaper_portnum,
+		(void *)&cmd_port_host_shaper_host_shaper,
+		(void *)&cmd_port_host_shaper_avail_thresh_triggered,
+		(void *)&cmd_port_host_shaper_fr,
+		(void *)&cmd_port_host_shaper_rate,
+		(void *)&cmd_port_host_shaper_rate_num,
+		NULL,
+	}
+};
+
+struct testpmd_driver_commands mlx5_driver_cmds = {
+	.commands = {
+		{
+			.ctx = &mlx5_test_cmd_port_host_shaper,
+			.help = "mlx5 set port (port_id) host_shaper avail_thresh_triggered (on|off)"
+			"rate (rate_num):\n"
+			"    Set HOST_SHAPER avail_thresh_triggered and rate with port_id\n\n",
+		},
+		{
+			.ctx = NULL,
+		},
+	}
+};
+TESTPMD_ADD_DRIVER_COMMANDS(mlx5_driver_cmds);
+
diff --git a/drivers/net/mlx5/mlx5_testpmd.h b/drivers/net/mlx5/mlx5_testpmd.h
new file mode 100644
index 0000000..7a54658
--- /dev/null
+++ b/drivers/net/mlx5/mlx5_testpmd.h
@@ -0,0 +1,26 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright 2021 6WIND S.A.
+ * Copyright 2021 Mellanox Technologies, Ltd
+ */
+
+#ifndef RTE_PMD_MLX5_TEST_H_
+#define RTE_PMD_MLX5_TEST_H_
+
+#include <cmdline_parse.h>
+#include <cmdline_parse_num.h>
+#include <cmdline_parse_string.h>
+
+/**
+ * RTE_ETH_EVENT_RX_AVAIL_THRESH handler sample code.
+ * It's called in testpmd, the work flow here is delay a while until
+ * RX queueu is empty, then disable host shaper.
+ *
+ * @param[in] port_id
+ *   Port identifier.
+ * @param[in] rxq_id
+ *   Rx queue identifier.
+ */
+void
+mlx5_test_avail_thresh_event_handler(uint16_t port_id, uint16_t rxq_id);
+
+#endif
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 131+ messages in thread

* Re: [PATCH v5 0/7] introduce per-queue available descriptor threshold and host shaper
  2022-06-07 12:59           ` [PATCH v5 0/7] introduce per-queue available descriptor threshold and host shaper Spike Du
                               ` (6 preceding siblings ...)
  2022-06-07 12:59             ` [PATCH v5 7/7] app/testpmd: add Host Shaper command Spike Du
@ 2022-06-08  9:43             ` Andrew Rybchenko
  2022-06-08 16:35             ` [PATCH v6] ethdev: introduce available Rx descriptors threshold Andrew Rybchenko
  8 siblings, 0 replies; 131+ messages in thread
From: Andrew Rybchenko @ 2022-06-08  9:43 UTC (permalink / raw)
  To: Spike Du, matan, viacheslavo, orika, thomas; +Cc: stephen, mb, dev, rasland

@Matan, @Viacheslav, could you review mlx5 patches of the series,
please.

On 6/7/22 15:59, Spike Du wrote:
> available descriptor threshold(ADT for short) is per RX queue attribute, when RX queue available descriptors for HW is below ADT, HW sends an event to application.
> Host shaper can configure shaper rate and avail_thresh-triggered for a host port.
> The shaper limits the rate of traffic from host port to embedded ARM rx port on Nvidia BlueField 2 NIC.
> If avail_thresh-triggered is enabled, a 100Mbps shaper is enabled automatically when one of the host port's Rx queues receives available descriptor threshold event.
> 
> These two features can combine to control traffic from host port to wire port for BlueField 2 NIC.
> The traffic flows from host to embedded ARM, then to the physical port.
> The work flow is on the ARM system, configure available descriptor threshold to RX queue and enable avail_thresh-triggered flag in host shaper, after receiving available descriptor threshold event, delay a while until RX queue is empty , then disable the shaper. We recycle this work flow to reduce RX queue drops on ARM system.
> 
> Add new libethdev API to set available descriptor threshold, add rte event RTE_ETH_EVENT_RX_AVAIL_THRESH to handle available descriptor threshold event. For host shaper, because it doesn't align to existing DPDK framework and is specific to Nvidia NIC, use PMD private API.
> 
> For integration with testpmd, put the private cmdline function and available descriptor threshold event handler in mlx5 PMD directory by adding a new file mlx5_testpmd.c. Follow David Marchand's driver specific commands framework to add mlx5 specific commands.
> 
> 
> Spike Du (7):
>    net/mlx5: add LWM support for Rxq
>    common/mlx5: share interrupt management
>    ethdev: introduce Rx queue based available descriptor threshold
>    net/mlx5: add LWM event handling support
>    net/mlx5: support Rx queue based available descriptor threshold
>    net/mlx5: add private API to config host port shaper
>    app/testpmd: add Host Shaper command



^ permalink raw reply	[flat|nested] 131+ messages in thread

* [PATCH v6] ethdev: introduce available Rx descriptors threshold
  2022-06-07 12:59           ` [PATCH v5 0/7] introduce per-queue available descriptor threshold and host shaper Spike Du
                               ` (7 preceding siblings ...)
  2022-06-08  9:43             ` [PATCH v5 0/7] introduce per-queue available descriptor threshold and host shaper Andrew Rybchenko
@ 2022-06-08 16:35             ` Andrew Rybchenko
  2022-06-08 17:22               ` Thomas Monjalon
  8 siblings, 1 reply; 131+ messages in thread
From: Andrew Rybchenko @ 2022-06-08 16:35 UTC (permalink / raw)
  To: Xiaoyun Li, Aman Singh, Yuying Zhang, Thomas Monjalon,
	Ferruh Yigit, Ray Kinsella
  Cc: dev, Spike Du

From: Spike Du <spiked@nvidia.com>

A new event RTE_ETH_EVENT_RX_AVAIL_THRESH should be generated by HW
when number of available descriptors in Rx queue goes below the
threshold.

The threshold is defined as a percentage of an Rx queue size with valid
values from 0 to 99 (inclusive). Zero (default) value disables it.

There is no capability reporting for the feature. Application should
simply try to set required threshold value and handle result.

Add testpmd commands to control the threshold:
  set port <port_id> rxq <rxq_id> avail_thresh <avail_thresh_num>

Signed-off-by: Spike Du <spiked@nvidia.com>
Signed-off-by: Andrew Rybchenko <andrew.rybchenko@oktetlabs.ru>
---
v6:
    - try to make descriptor shorter and more useful
    - refine terminology to use "available descriptors threshold"
      everywhere (plural "descriptors")
    - fix ethdev API documenation
    - define negative return values
    - define rules to convert percentage to descriptors number
      in drivers
    - avoid extra checks in testpmd helper to allow ethdev API
      to do its job
    - minor fixes in testpmd variables naming
    - fix testpmd help to be human oriented
    - update testpmd users guide
    - add release notes

 app/test-pmd/cmdline.c                      | 72 +++++++++++++++++++++
 app/test-pmd/config.c                       |  9 +++
 app/test-pmd/testpmd.c                      | 16 +++++
 app/test-pmd/testpmd.h                      |  2 +
 doc/guides/rel_notes/release_22_07.rst      |  6 ++
 doc/guides/testpmd_app_ug/testpmd_funcs.rst |  9 +++
 lib/ethdev/ethdev_driver.h                  | 25 +++++++
 lib/ethdev/rte_ethdev.c                     | 44 +++++++++++++
 lib/ethdev/rte_ethdev.h                     | 71 ++++++++++++++++++++
 lib/ethdev/version.map                      |  2 +
 10 files changed, 256 insertions(+)

diff --git a/app/test-pmd/cmdline.c b/app/test-pmd/cmdline.c
index fdd0cada3b..3acdd33cd9 100644
--- a/app/test-pmd/cmdline.c
+++ b/app/test-pmd/cmdline.c
@@ -750,6 +750,9 @@ static void cmd_help_long_parsed(void *parsed_result,
 			"set port (port_id) fec_mode auto|off|rs|baser\n"
 			"    set fec mode for a specific port\n\n"
 
+			"set port (port_id) rxq (queue_id) avail_thresh (0..99)>\n "
+			"    set available descriptors threshold for Rx queue\n\n"
+
 			, list_pkt_forwarding_modes()
 		);
 	}
@@ -17331,6 +17334,74 @@ static cmdline_parse_inst_t cmd_set_fec_mode = {
 	},
 };
 
+/* *** set available descriptors threshold for an RxQ of a port *** */
+struct cmd_set_rxq_avail_thresh_result {
+	cmdline_fixed_string_t set;
+	cmdline_fixed_string_t port;
+	uint16_t port_num;
+	cmdline_fixed_string_t rxq;
+	uint16_t rxq_num;
+	cmdline_fixed_string_t avail_thresh;
+	uint8_t avail_thresh_num;
+};
+
+static void cmd_set_rxq_avail_thresh_parsed(void *parsed_result,
+		__rte_unused struct cmdline *cl,
+		__rte_unused void *data)
+{
+	struct cmd_set_rxq_avail_thresh_result *res = parsed_result;
+	int ret = 0;
+
+	if ((strcmp(res->set, "set") == 0) && (strcmp(res->port, "port") == 0)
+	    && (strcmp(res->rxq, "rxq") == 0)
+	    && (strcmp(res->avail_thresh, "avail_thresh") == 0))
+		ret = set_rxq_avail_thresh(res->port_num, res->rxq_num,
+				  res->avail_thresh_num);
+	if (ret < 0)
+		printf("rxq_avail_thresh_cmd error: (%s)\n", strerror(-ret));
+
+}
+
+static cmdline_parse_token_string_t cmd_set_rxq_avail_thresh_set =
+	TOKEN_STRING_INITIALIZER(struct cmd_set_rxq_avail_thresh_result,
+				set, "set");
+static cmdline_parse_token_string_t cmd_set_rxq_avail_thresh_port =
+	TOKEN_STRING_INITIALIZER(struct cmd_set_rxq_avail_thresh_result,
+				port, "port");
+static cmdline_parse_token_num_t cmd_set_rxq_avail_thresh_portnum =
+	TOKEN_NUM_INITIALIZER(struct cmd_set_rxq_avail_thresh_result,
+				port_num, RTE_UINT16);
+static cmdline_parse_token_string_t cmd_set_rxq_avail_thresh_rxq =
+	TOKEN_STRING_INITIALIZER(struct cmd_set_rxq_avail_thresh_result,
+				rxq, "rxq");
+static cmdline_parse_token_num_t cmd_set_rxq_avail_thresh_rxqnum =
+	TOKEN_NUM_INITIALIZER(struct cmd_set_rxq_avail_thresh_result,
+				rxq_num, RTE_UINT16);
+static cmdline_parse_token_string_t cmd_set_rxq_avail_thresh_avail_thresh =
+	TOKEN_STRING_INITIALIZER(struct cmd_set_rxq_avail_thresh_result,
+				avail_thresh, "avail_thresh");
+static cmdline_parse_token_num_t cmd_set_rxq_avail_thresh_avail_threshnum =
+	TOKEN_NUM_INITIALIZER(struct cmd_set_rxq_avail_thresh_result,
+				avail_thresh_num, RTE_UINT8);
+
+static cmdline_parse_inst_t cmd_set_rxq_avail_thresh = {
+	.f = cmd_set_rxq_avail_thresh_parsed,
+	.data = (void *)0,
+	.help_str =
+		"set port <port_id> rxq <queue_id> avail_thresh <0..99>: "
+		"Set available descriptors threshold for Rx queue",
+	.tokens = {
+		(void *)&cmd_set_rxq_avail_thresh_set,
+		(void *)&cmd_set_rxq_avail_thresh_port,
+		(void *)&cmd_set_rxq_avail_thresh_portnum,
+		(void *)&cmd_set_rxq_avail_thresh_rxq,
+		(void *)&cmd_set_rxq_avail_thresh_rxqnum,
+		(void *)&cmd_set_rxq_avail_thresh_avail_thresh,
+		(void *)&cmd_set_rxq_avail_thresh_avail_threshnum,
+		NULL,
+	},
+};
+
 /* show port supported ptypes */
 
 /* Common result structure for show port ptypes */
@@ -18110,6 +18181,7 @@ static cmdline_parse_ctx_t builtin_ctx[] = {
 	(cmdline_parse_inst_t *)&cmd_config_tx_dynf_specific,
 	(cmdline_parse_inst_t *)&cmd_show_fec_mode,
 	(cmdline_parse_inst_t *)&cmd_set_fec_mode,
+	(cmdline_parse_inst_t *)&cmd_set_rxq_avail_thresh,
 	(cmdline_parse_inst_t *)&cmd_show_capability,
 	(cmdline_parse_inst_t *)&cmd_set_flex_is_pattern,
 	(cmdline_parse_inst_t *)&cmd_set_flex_spec_pattern,
diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
index d6caa1f0b2..e490d7f921 100644
--- a/app/test-pmd/config.c
+++ b/app/test-pmd/config.c
@@ -6392,3 +6392,12 @@ show_mcast_macs(portid_t port_id)
 		printf("  %s\n", buf);
 	}
 }
+
+int
+set_rxq_avail_thresh(portid_t port_id, uint16_t queue_id, uint8_t avail_thresh)
+{
+	if (port_id_is_invalid(port_id, ENABLED_WARN))
+		return -EINVAL;
+
+	return rte_eth_rx_avail_thresh_set(port_id, queue_id, avail_thresh);
+}
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index 9d6175e9a7..3e39584549 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -420,6 +420,7 @@ static const char * const eth_event_desc[] = {
 	[RTE_ETH_EVENT_NEW] = "device probed",
 	[RTE_ETH_EVENT_DESTROY] = "device released",
 	[RTE_ETH_EVENT_FLOW_AGED] = "flow aged",
+	[RTE_ETH_EVENT_RX_AVAIL_THRESH] = "RxQ available descriptors threshold reached",
 	[RTE_ETH_EVENT_MAX] = NULL,
 };
 
@@ -3672,6 +3673,21 @@ eth_event_callback(portid_t port_id, enum rte_eth_event_type type, void *param,
 		ports[port_id].port_status = RTE_PORT_CLOSED;
 		printf("Port %u is closed\n", port_id);
 		break;
+	case RTE_ETH_EVENT_RX_AVAIL_THRESH: {
+		uint16_t rxq_id;
+		int ret;
+
+		/* avail_thresh query API rewinds rxq_id, no need to check max RxQ num */
+		for (rxq_id = 0; ; rxq_id++) {
+			ret = rte_eth_rx_avail_thresh_query(port_id, &rxq_id,
+							    NULL);
+			if (ret <= 0)
+				break;
+			printf("Received avail_thresh event, port:%u rxq_id:%u\n",
+			       port_id, rxq_id);
+		}
+		break;
+	}
 	default:
 		break;
 	}
diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index dd34b025e6..212d836a19 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -1190,6 +1190,8 @@ int update_jumbo_frame_offload(portid_t portid);
 void flex_item_create(portid_t port_id, uint16_t flex_id, const char *filename);
 void flex_item_destroy(portid_t port_id, uint16_t flex_id);
 void port_flex_item_flush(portid_t port_id);
+int set_rxq_avail_thresh(portid_t port_id, uint16_t queue_id,
+			 uint8_t avail_thresh);
 
 extern int flow_parse(const char *src, void *result, unsigned int size,
 		      struct rte_flow_attr **attr,
diff --git a/doc/guides/rel_notes/release_22_07.rst b/doc/guides/rel_notes/release_22_07.rst
index 5551332bcb..913add127c 100644
--- a/doc/guides/rel_notes/release_22_07.rst
+++ b/doc/guides/rel_notes/release_22_07.rst
@@ -73,6 +73,12 @@ New Features
     * SFF-8472 revision 12.0
     * SFF-8636 revision 2.7
 
+* **Added Rx queue available descriptors threshold and event.**
+
+  Added ethdev API and corresponding driver operations to set Rx queue
+  available descriptors threshold and query for queues with reached
+  threshold when a new event RTE_ETH_EVENT_RX_AVAIL_THRESH is received.
+
 * **Added vhost API to get the number of in-flight packets.**
 
   Added an API which can get the number of in-flight packets in
diff --git a/doc/guides/testpmd_app_ug/testpmd_funcs.rst b/doc/guides/testpmd_app_ug/testpmd_funcs.rst
index bbeba554eb..fb7d2839fe 100644
--- a/doc/guides/testpmd_app_ug/testpmd_funcs.rst
+++ b/doc/guides/testpmd_app_ug/testpmd_funcs.rst
@@ -2021,6 +2021,15 @@ Set fec mode for a specific port::
 
   testpmd> set port (port_id) fec_mode auto|off|rs|baser
 
+Set Rx queue available descriptors threshold
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Set available descriptors threshold for a specific Rx queue of port::
+
+  testpmd> set port (port_id) rxq (queue_id) avail_thresh (0..99)
+
+Use 0 value to disable the threshold and corresponding event.
+
 Config Sample actions list
 ~~~~~~~~~~~~~~~~~~~~~~~~~~
 
diff --git a/lib/ethdev/ethdev_driver.h b/lib/ethdev/ethdev_driver.h
index 69d9dc21d8..5101868ea7 100644
--- a/lib/ethdev/ethdev_driver.h
+++ b/lib/ethdev/ethdev_driver.h
@@ -1073,6 +1073,26 @@ typedef int (*eth_ip_reassembly_conf_set_t)(struct rte_eth_dev *dev,
  */
 typedef int (*eth_dev_priv_dump_t)(struct rte_eth_dev *dev, FILE *file);
 
+/**
+ * @internal Set Rx queue available descriptors threshold.
+ * @see rte_eth_rx_avail_thresh_set()
+ *
+ * Driver should round down number of descriptors on conversion from
+ * percentage.
+ */
+typedef int (*eth_rx_queue_avail_thresh_set_t)(struct rte_eth_dev *dev,
+				      uint16_t rx_queue_id,
+				      uint8_t avail_thresh);
+
+/**
+ * @internal Query Rx queue available descriptors threshold event.
+ * @see rte_eth_rx_avail_thresh_query()
+ */
+
+typedef int (*eth_rx_queue_avail_thresh_query_t)(struct rte_eth_dev *dev,
+					uint16_t *rx_queue_id,
+					uint8_t *avail_thresh);
+
 /**
  * @internal A structure containing the functions exported by an Ethernet driver.
  */
@@ -1283,6 +1303,11 @@ struct eth_dev_ops {
 
 	/** Dump private info from device */
 	eth_dev_priv_dump_t eth_dev_priv_dump;
+
+	/** Set Rx queue available descriptors threshold */
+	eth_rx_queue_avail_thresh_set_t rx_queue_avail_thresh_set;
+	/** Query Rx queue available descriptors threshold event */
+	eth_rx_queue_avail_thresh_query_t rx_queue_avail_thresh_query;
 };
 
 /**
diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
index b9c5dea09f..90e50eb02b 100644
--- a/lib/ethdev/rte_ethdev.c
+++ b/lib/ethdev/rte_ethdev.c
@@ -4432,6 +4432,50 @@ int rte_eth_set_queue_rate_limit(uint16_t port_id, uint16_t queue_idx,
 							queue_idx, tx_rate));
 }
 
+int rte_eth_rx_avail_thresh_set(uint16_t port_id, uint16_t queue_id,
+			       uint8_t avail_thresh)
+{
+	struct rte_eth_dev *dev;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
+	dev = &rte_eth_devices[port_id];
+
+	if (queue_id > dev->data->nb_rx_queues) {
+		RTE_ETHDEV_LOG(ERR,
+			"Set queue avail thresh: port %u: invalid queue ID=%u.\n",
+			port_id, queue_id);
+		return -EINVAL;
+	}
+
+	if (avail_thresh > 99) {
+		RTE_ETHDEV_LOG(ERR,
+			"Set queue avail thresh: port %u: threshold should be <= 99.\n",
+			port_id);
+		return -EINVAL;
+	}
+	RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->rx_queue_avail_thresh_set, -ENOTSUP);
+	return eth_err(port_id, (*dev->dev_ops->rx_queue_avail_thresh_set)(dev,
+							     queue_id, avail_thresh));
+}
+
+int rte_eth_rx_avail_thresh_query(uint16_t port_id, uint16_t *queue_id,
+				 uint8_t *avail_thresh)
+{
+	struct rte_eth_dev *dev;
+
+	RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
+	dev = &rte_eth_devices[port_id];
+
+	if (queue_id == NULL)
+		return -EINVAL;
+	if (*queue_id >= dev->data->nb_rx_queues)
+		*queue_id = 0;
+
+	RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->rx_queue_avail_thresh_query, -ENOTSUP);
+	return eth_err(port_id, (*dev->dev_ops->rx_queue_avail_thresh_query)(dev,
+							     queue_id, avail_thresh));
+}
+
 RTE_INIT(eth_dev_init_fp_ops)
 {
 	uint32_t i;
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index 7baea176e8..1975224796 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -1931,6 +1931,13 @@ struct rte_eth_rxq_info {
 	uint8_t queue_state;        /**< one of RTE_ETH_QUEUE_STATE_*. */
 	uint16_t nb_desc;           /**< configured number of RXDs. */
 	uint16_t rx_buf_size;       /**< hardware receive buffer size. */
+	/**
+	 * Available Rx descriptors threshold defined as percentage
+	 * of Rx queue size. If number of available descriptors is lower,
+	 * the event RTE_ETH_EVENT_RX_AVAIL_THESH is generated.
+	 * Value 0 means that the threshold monitoring is disabled.
+	 */
+	uint8_t avail_thresh;
 } __rte_cache_min_aligned;
 
 /**
@@ -3652,6 +3659,65 @@ int rte_eth_dev_get_vlan_offload(uint16_t port_id);
  */
 int rte_eth_dev_set_vlan_pvid(uint16_t port_id, uint16_t pvid, int on);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Set Rx queue available descriptors threshold.
+ *
+ * @param port_id
+ *  The port identifier of the Ethernet device.
+ * @param queue_id
+ *  The index of the receive queue.
+ * @param avail_thresh
+ *  The available descriptors threshold is percentage of Rx queue size which
+ *  describes the availability of Rx queue for hardware. If the Rx queue
+ *  availability is below it, the device will generate the event
+ *  RTE_ETH_EVENT_RX_AVAIL_THRESH.
+ *  [1-99] to set a new available descriptors threshold.
+ *  0 to disable threshold monitoring.
+ *
+ * @return
+ *   - 0 if successful.
+ *   - (-ENODEV) if @p port_id is invalid.
+ *   - (-EINVAL) if bad parameter.
+ *   - (-ENOTSUP) if available Rx descriptors threshold is not supported.
+ *   - (-EIO) if device is removed.
+ */
+__rte_experimental
+int rte_eth_rx_avail_thresh_set(uint16_t port_id, uint16_t queue_id,
+			       uint8_t avail_thresh);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Find Rx queue with RTE_ETH_EVENT_RX_AVAIL_THRESH event pending.
+ *
+ * @param port_id
+ *  The port iden