DPDK patches and discussions
 help / color / mirror / Atom feed
* [PATCH V1 0/7] port probe time optimization
@ 2024-10-16  8:38 Minggang Li(Gavin)
  2024-10-16  8:38 ` [PATCH V1 1/7] mailmap: update user name Minggang Li(Gavin)
                   ` (6 more replies)
  0 siblings, 7 replies; 42+ messages in thread
From: Minggang Li(Gavin) @ 2024-10-16  8:38 UTC (permalink / raw)
  To: viacheslavo, matan, orika, thomas; +Cc: dev, rasland

This patch series introduced a feature that the time to probe a VF/SF will
be reduced greatly in large scale, eg hundreds of VF/SFs. This feature is
controlled through the "probe_opt_en" device argument. Setting it to a
non-zero value indicates the application will enable this functionality
when probing a device. This feature relies on a feature of RDMA driver to
be release in incoming upstream kernel 6.13 or the equivalent in
OFED 24.10, ie. RDMA monitor. For further information on the devargs
limitation, see "doc/guides/nics/mlx5.rst".

Minggang Li(Gavin) (5):
  mailmap: update user name
  common/mlx5: fix Netlink socket leak
  common/mlx5: add RDMA monitor event awareness
  mlx5: use RDMA Netlink to update port information
  mlx5: add backward compatibility for RDMA monitor

Rongwei Liu (2):
  net/mlx5: optimize device probing
  net/mlx5: add new devargs to control probe optimization

 .mailmap                                     |   2 +-
 doc/guides/nics/mlx5.rst                     |  13 +
 drivers/common/mlx5/linux/meson.build        |  10 +
 drivers/common/mlx5/linux/mlx5_common_os.h   |   6 +
 drivers/common/mlx5/linux/mlx5_nl.c          | 262 ++++++++++++++++---
 drivers/common/mlx5/linux/mlx5_nl.h          |  36 ++-
 drivers/common/mlx5/mlx5_common.c            |  20 ++
 drivers/common/mlx5/mlx5_common.h            |  15 ++
 drivers/common/mlx5/version.map              |   3 +
 drivers/common/mlx5/windows/mlx5_common_os.h |   5 +
 drivers/net/mlx5/linux/mlx5_ethdev_os.c      | 136 ++++++++++
 drivers/net/mlx5/linux/mlx5_os.c             | 144 ++++++++--
 drivers/net/mlx5/linux/mlx5_os.h             |   6 -
 drivers/net/mlx5/mlx5.h                      |   3 +
 drivers/net/mlx5/windows/mlx5_os.h           |   5 -
 15 files changed, 591 insertions(+), 75 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH V1 1/7] mailmap: update user name
  2024-10-16  8:38 [PATCH V1 0/7] port probe time optimization Minggang Li(Gavin)
@ 2024-10-16  8:38 ` Minggang Li(Gavin)
  2024-10-16  8:38 ` [PATCH V1 2/7] net/mlx5: optimize device probing Minggang Li(Gavin)
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 42+ messages in thread
From: Minggang Li(Gavin) @ 2024-10-16  8:38 UTC (permalink / raw)
  To: viacheslavo, matan, orika, thomas; +Cc: dev, rasland

Signed-off-by: Minggang Li(Gavin) <gavinl@nvidia.com>
---
 .mailmap | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/.mailmap b/.mailmap
index ec12c42dd4..cb08c2c5d7 100644
--- a/.mailmap
+++ b/.mailmap
@@ -458,7 +458,6 @@ Gary Mussar <gmussar@ciena.com>
 Gaurav Singh <gaurav1086@gmail.com>
 Gautam Dawar <gdawar@solarflare.com>
 Gavin Hu <gavin.hu@arm.com> <gavin.hu@linaro.org>
-Gavin Li <gavinl@nvidia.com>
 Geoffrey Le Gourriérec <geoffrey.le_gourrierec@6wind.com>
 Geoffrey Lv <geoffrey.lv@gmail.com>
 Geoff Thorpe <geoff.thorpe@nxp.com>
@@ -1012,6 +1011,7 @@ Mike Ximing Chen <mike.ximing.chen@intel.com>
 Milena Olech <milena.olech@intel.com>
 Min Cao <min.cao@intel.com>
 Minghuan Lian <minghuan.lian@nxp.com>
+Minggang Li(Gavin) <gavinl@nvidia.com>
 Mingjin Ye <mingjinx.ye@intel.com>
 Mingshan Zhang <mingshan.zhang@intel.com>
 Mingxia Liu <mingxia.liu@intel.com>
-- 
2.34.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH V1 2/7] net/mlx5: optimize device probing
  2024-10-16  8:38 [PATCH V1 0/7] port probe time optimization Minggang Li(Gavin)
  2024-10-16  8:38 ` [PATCH V1 1/7] mailmap: update user name Minggang Li(Gavin)
@ 2024-10-16  8:38 ` Minggang Li(Gavin)
  2024-10-16  8:38 ` [PATCH V1 3/7] net/mlx5: add new devargs to control probe optimization Minggang Li(Gavin)
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 42+ messages in thread
From: Minggang Li(Gavin) @ 2024-10-16  8:38 UTC (permalink / raw)
  To: viacheslavo, matan, orika, thomas, Dariusz Sosnowski, Bing Zhao,
	Suanming Mou
  Cc: dev, rasland, Rongwei Liu

From: Rongwei Liu <rongweil@nvidia.com>

Current DPDK probing logic is:
1. Query IB device index and total port number.
2. Query each port information by traversing the port index and
   get the port's ifindex, name and state information etc.
3. Compare the information with devargs until getting matched.
4. For each probing device, repeat steps 2 and 3.

Step 2 will communicate with kernel via netlink and it's time-consuming.
There is no need to repeat netlink communication for each probing device,
PMD can traverse all ports once and save the information into a caching
structure.

Introduce the device information caching in the mlx5 common device
handle and cache the port number, ibindex, port ifindex.

For dynamic interface changing:
1. New VF by toggling switchdev mode should restart dpdk as sriov
   configuration changed.
2. Changing VF number w/o toggling switchdev mode will trigger
   RTM_DELLINK and RTM_NEWLINK events. All the caching information is
   cleared.
3. New SF triggers RTM_NEWLINK event and no port index information in the
   message. All free entries (ifindex = 0) in the cache are invalidated.
4. Delete SF triggers RTM_DELLINK event. Traverse the cache entries and
   invalidate the one with the same ifindex.

Didn't consider race-condition between probing thread and interrupt
thread.

Signed-off-by: Rongwei Liu <rongweil@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
---
 drivers/common/mlx5/linux/mlx5_common_os.h   |   6 ++
 drivers/common/mlx5/linux/mlx5_nl.c          |  94 +++++++++++++----
 drivers/common/mlx5/linux/mlx5_nl.h          |   8 +-
 drivers/common/mlx5/mlx5_common.c            |   5 +
 drivers/common/mlx5/mlx5_common.h            |  13 +++
 drivers/common/mlx5/windows/mlx5_common_os.h |   5 +
 drivers/net/mlx5/linux/mlx5_ethdev_os.c      |  54 ++++++++++
 drivers/net/mlx5/linux/mlx5_os.c             | 104 ++++++++++++++-----
 drivers/net/mlx5/linux/mlx5_os.h             |   6 --
 drivers/net/mlx5/windows/mlx5_os.h           |   5 -
 10 files changed, 242 insertions(+), 58 deletions(-)

diff --git a/drivers/common/mlx5/linux/mlx5_common_os.h b/drivers/common/mlx5/linux/mlx5_common_os.h
index e8aa1d46ec..2e2c54f1fa 100644
--- a/drivers/common/mlx5/linux/mlx5_common_os.h
+++ b/drivers/common/mlx5/linux/mlx5_common_os.h
@@ -22,6 +22,12 @@
 #include "mlx5_glue.h"
 #include "mlx5_malloc.h"
 
+/* verb enumerations translations to local enums. */
+enum {
+	MLX5_FS_NAME_MAX = IBV_SYSFS_NAME_MAX + 1,
+	MLX5_FS_PATH_MAX = IBV_SYSFS_PATH_MAX + 1
+};
+
 /**
  * Get device name. Given an ibv_device pointer - return a
  * pointer to the corresponding device name.
diff --git a/drivers/common/mlx5/linux/mlx5_nl.c b/drivers/common/mlx5/linux/mlx5_nl.c
index a5ac4dc543..e98073aafe 100644
--- a/drivers/common/mlx5/linux/mlx5_nl.c
+++ b/drivers/common/mlx5/linux/mlx5_nl.c
@@ -1073,16 +1073,18 @@ mlx5_nl_port_info(int nl, uint32_t pindex, struct mlx5_nl_port_info *data)
 	uint32_t sn = MLX5_NL_SN_GENERATE;
 	int ret;
 
-	ret = mlx5_nl_send(nl, &req.nh, sn);
-	if (ret < 0)
-		return ret;
-	ret = mlx5_nl_recv(nl, sn, mlx5_nl_cmdget_cb, data);
-	if (ret < 0)
-		return ret;
-	if (!(data->flags & MLX5_NL_CMD_GET_IB_NAME) ||
-	    !(data->flags & MLX5_NL_CMD_GET_IB_INDEX))
-		goto error;
-	data->flags = 0;
+	if (data->ibindex == UINT32_MAX) {
+		ret = mlx5_nl_send(nl, &req.nh, sn);
+		if (ret < 0)
+			return ret;
+		ret = mlx5_nl_recv(nl, sn, mlx5_nl_cmdget_cb, data);
+		if (ret < 0)
+			return ret;
+		if (!(data->flags & MLX5_NL_CMD_GET_IB_NAME) ||
+		    !(data->flags & MLX5_NL_CMD_GET_IB_INDEX))
+			goto error;
+		data->flags = 0;
+	}
 	sn = MLX5_NL_SN_GENERATE;
 	req.nh.nlmsg_type = RDMA_NL_GET_TYPE(RDMA_NL_NLDEV,
 					     RDMA_NLDEV_CMD_PORT_GET);
@@ -1109,7 +1111,7 @@ mlx5_nl_port_info(int nl, uint32_t pindex, struct mlx5_nl_port_info *data)
 	    !(data->flags & MLX5_NL_CMD_GET_NET_INDEX) ||
 	    !data->ifindex)
 		goto error;
-	return 1;
+	return 0;
 error:
 	rte_errno = ENODEV;
 	return -rte_errno;
@@ -1128,21 +1130,48 @@ mlx5_nl_port_info(int nl, uint32_t pindex, struct mlx5_nl_port_info *data)
  *   IB device name.
  * @param[in] pindex
  *   IB device port index, starting from 1
+ * @param[in] dev_info
+ *   Cached mlx5 device information.
  * @return
  *   A valid (nonzero) interface index on success, 0 otherwise and rte_errno
  *   is set.
  */
 unsigned int
-mlx5_nl_ifindex(int nl, const char *name, uint32_t pindex)
+mlx5_nl_ifindex(int nl, const char *name, uint32_t pindex, struct mlx5_dev_info *dev_info)
 {
+	int ret;
+
 	struct mlx5_nl_port_info data = {
 			.ifindex = 0,
 			.name = name,
+			.ibindex = UINT32_MAX,
+			.flags = 0,
 	};
 
-	if (mlx5_nl_port_info(nl, pindex, &data) < 0)
-		return 0;
-	return data.ifindex;
+	if (!strcmp(name, dev_info->ibname)) {
+		if (dev_info->port_info && pindex <= dev_info->port_num &&
+		    dev_info->port_info[pindex].valid) {
+			if (!dev_info->port_info[pindex].ifindex)
+				rte_errno = ENODEV;
+			return dev_info->port_info[pindex].ifindex;
+		}
+		if (dev_info->port_num)
+			data.ibindex = dev_info->ibindex;
+	}
+
+	ret = mlx5_nl_port_info(nl, pindex, &data);
+
+	if (!strcmp(dev_info->ibname, name)) {
+		if ((!ret || ret == -ENODEV) && dev_info->port_info &&
+		    pindex <= dev_info->port_num) {
+			if (!ret)
+				dev_info->port_info[pindex].ifindex = data.ifindex;
+			/* -ENODEV means the pindex is unused but still valid case */
+			dev_info->port_info[pindex].valid = 1;
+		}
+	}
+
+	return ret ? 0 : data.ifindex;
 }
 
 /**
@@ -1157,18 +1186,23 @@ mlx5_nl_ifindex(int nl, const char *name, uint32_t pindex)
  *   IB device name.
  * @param[in] pindex
  *   IB device port index, starting from 1
+ * @param[in] dev_info
+ *   Cached mlx5 device information.
  * @return
  *   Port state (ibv_port_state) on success, negative on error
  *   and rte_errno is set.
  */
 int
-mlx5_nl_port_state(int nl, const char *name, uint32_t pindex)
+mlx5_nl_port_state(int nl, const char *name, uint32_t pindex, struct mlx5_dev_info *dev_info)
 {
 	struct mlx5_nl_port_info data = {
 			.state = 0,
 			.name = name,
+			.ibindex = UINT32_MAX,
 	};
 
+	if (dev_info && !strcmp(name, dev_info->ibname) && dev_info->port_num)
+		data.ibindex = dev_info->ibindex;
 	if (mlx5_nl_port_info(nl, pindex, &data) < 0)
 		return -rte_errno;
 	if ((data.flags & MLX5_NL_CMD_GET_PORT_STATE) == 0) {
@@ -1185,13 +1219,15 @@ mlx5_nl_port_state(int nl, const char *name, uint32_t pindex)
  *   Netlink socket of the RDMA kind (NETLINK_RDMA).
  * @param[in] name
  *   IB device name.
+ * @param[in] dev_info
+ *   Cached mlx5 device info.
  *
  * @return
  *   A valid (nonzero) number of ports on success, 0 otherwise
  *   and rte_errno is set.
  */
 unsigned int
-mlx5_nl_portnum(int nl, const char *name)
+mlx5_nl_portnum(int nl, const char *name, struct mlx5_dev_info *dev_info)
 {
 	struct mlx5_nl_port_info data = {
 		.flags = 0,
@@ -1206,7 +1242,10 @@ mlx5_nl_portnum(int nl, const char *name)
 		.nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK | NLM_F_DUMP,
 	};
 	uint32_t sn = MLX5_NL_SN_GENERATE;
-	int ret;
+	int ret, size;
+
+	if (dev_info->port_num && !strcmp(name, dev_info->ibname))
+		return dev_info->port_num;
 
 	ret = mlx5_nl_send(nl, &req, sn);
 	if (ret < 0)
@@ -1220,8 +1259,25 @@ mlx5_nl_portnum(int nl, const char *name)
 		rte_errno = ENODEV;
 		return 0;
 	}
-	if (!data.portnum)
+	if (!data.portnum) {
 		rte_errno = EINVAL;
+		return 0;
+	}
+	MLX5_ASSERT(!strlen(dev_info->ibname));
+	dev_info->port_num = data.portnum;
+	dev_info->ibindex = data.ibindex;
+	snprintf(dev_info->ibname, MLX5_FS_NAME_MAX, "%s", name);
+	if (data.portnum > 1) {
+		size = (data.portnum + 1) * sizeof(struct mlx5_port_nl_info);
+		dev_info->port_info = mlx5_malloc(MLX5_MEM_ZERO | MLX5_MEM_RTE, size,
+						  RTE_CACHE_LINE_SIZE,
+						  SOCKET_ID_ANY);
+		if (dev_info->port_info == NULL) {
+			memset(dev_info, 0, sizeof(*dev_info));
+			rte_errno = ENOMEM;
+			return 0;
+		}
+	}
 	return data.portnum;
 }
 
diff --git a/drivers/common/mlx5/linux/mlx5_nl.h b/drivers/common/mlx5/linux/mlx5_nl.h
index 580de3b769..396ffc98ce 100644
--- a/drivers/common/mlx5/linux/mlx5_nl.h
+++ b/drivers/common/mlx5/linux/mlx5_nl.h
@@ -11,6 +11,7 @@
 #include <rte_ether.h>
 
 #include "mlx5_common.h"
+#include "mlx5_common_utils.h"
 
 typedef void (mlx5_nl_event_cb)(struct nlmsghdr *hdr, void *user_data);
 
@@ -52,11 +53,12 @@ int mlx5_nl_promisc(int nlsk_fd, unsigned int iface_idx, int enable);
 __rte_internal
 int mlx5_nl_allmulti(int nlsk_fd, unsigned int iface_idx, int enable);
 __rte_internal
-unsigned int mlx5_nl_portnum(int nl, const char *name);
+unsigned int mlx5_nl_portnum(int nl, const char *name, struct mlx5_dev_info *dev_info);
 __rte_internal
-unsigned int mlx5_nl_ifindex(int nl, const char *name, uint32_t pindex);
+unsigned int mlx5_nl_ifindex(int nl, const char *name, uint32_t pindex,
+			     struct mlx5_dev_info *info);
 __rte_internal
-int mlx5_nl_port_state(int nl, const char *name, uint32_t pindex);
+int mlx5_nl_port_state(int nl, const char *name, uint32_t pindex, struct mlx5_dev_info *dev_info);
 __rte_internal
 int mlx5_nl_vf_mac_addr_modify(int nlsk_fd, unsigned int iface_idx,
 			       struct rte_ether_addr *mac, int vf_index);
diff --git a/drivers/common/mlx5/mlx5_common.c b/drivers/common/mlx5/mlx5_common.c
index ca8543e36e..0aaae91c31 100644
--- a/drivers/common/mlx5/mlx5_common.c
+++ b/drivers/common/mlx5/mlx5_common.c
@@ -735,6 +735,11 @@ mlx5_common_dev_release(struct mlx5_common_device *cdev)
 		if (TAILQ_EMPTY(&devices_list))
 			rte_mem_event_callback_unregister("MLX5_MEM_EVENT_CB",
 							  NULL);
+		if (cdev->dev_info.port_info != NULL) {
+			mlx5_free(cdev->dev_info.port_info);
+			cdev->dev_info.port_info = NULL;
+		}
+		cdev->dev_info.port_num = 0;
 		mlx5_dev_mempool_unsubscribe(cdev);
 		mlx5_mr_release_cache(&cdev->mr_scache);
 		mlx5_dev_hw_global_release(cdev);
diff --git a/drivers/common/mlx5/mlx5_common.h b/drivers/common/mlx5/mlx5_common.h
index 1abd1e8239..6cb40f54dd 100644
--- a/drivers/common/mlx5/mlx5_common.h
+++ b/drivers/common/mlx5/mlx5_common.h
@@ -174,6 +174,18 @@ enum mlx5_nl_phys_port_name_type {
 	MLX5_PHYS_PORT_NAME_TYPE_UNKNOWN, /* Unrecognized. */
 };
 
+struct mlx5_port_nl_info {
+	uint32_t ifindex;
+	uint8_t valid;
+};
+
+struct mlx5_dev_info {
+	uint32_t port_num;
+	uint32_t ibindex;
+	char ibname[MLX5_FS_NAME_MAX];
+	struct mlx5_port_nl_info *port_info;
+};
+
 /** Switch information returned by mlx5_nl_switch_info(). */
 struct mlx5_switch_info {
 	uint32_t master:1; /**< Master device. */
@@ -525,6 +537,7 @@ struct mlx5_common_device {
 	uint32_t classes_loaded;
 	void *ctx; /* Verbs/DV/DevX context. */
 	void *pd; /* Protection Domain. */
+	struct mlx5_dev_info dev_info; /* Device port info queried via netlink. */
 	uint32_t pdn; /* Protection Domain Number. */
 	struct mlx5_mr_share_cache mr_scache; /* Global shared MR cache. */
 	struct mlx5_common_dev_config config; /* Device configuration. */
diff --git a/drivers/common/mlx5/windows/mlx5_common_os.h b/drivers/common/mlx5/windows/mlx5_common_os.h
index acee0c987f..65394035de 100644
--- a/drivers/common/mlx5/windows/mlx5_common_os.h
+++ b/drivers/common/mlx5/windows/mlx5_common_os.h
@@ -20,6 +20,11 @@
 
 #define MLX5_BF_OFFSET 0x800
 
+enum {
+	MLX5_FS_NAME_MAX = MLX5_DEVX_DEVICE_NAME_SIZE + 1,
+	MLX5_FS_PATH_MAX = MLX5_DEVX_DEVICE_PNP_SIZE + 1
+};
+
 /**
  * This API allocates aligned or non-aligned memory.  The free can be on either
  * aligned or nonaligned memory.  To be protected - even though there may be no
diff --git a/drivers/net/mlx5/linux/mlx5_ethdev_os.c b/drivers/net/mlx5/linux/mlx5_ethdev_os.c
index 5d64984022..08ac6dd939 100644
--- a/drivers/net/mlx5/linux/mlx5_ethdev_os.c
+++ b/drivers/net/mlx5/linux/mlx5_ethdev_os.c
@@ -23,6 +23,7 @@
 #include <stdalign.h>
 #include <sys/un.h>
 #include <time.h>
+#include <linux/rtnetlink.h>
 
 #include <ethdev_linux_ethtool.h>
 #include <ethdev_driver.h>
@@ -673,6 +674,57 @@ mlx5_link_update_bond(struct rte_eth_dev *dev)
 		((ifr.ifr_flags & IFF_UP) && (ifr.ifr_flags & IFF_RUNNING));
 }
 
+static void
+mlx5_handle_port_info_update(struct mlx5_dev_info *dev_info, uint32_t if_index,
+			     uint16_t msg_type)
+{
+	struct mlx5_switch_info info = {
+		.master = 0,
+		.representor = 0,
+		.name_type = MLX5_PHYS_PORT_NAME_TYPE_NOTSET,
+		.port_name = 0,
+		.switch_id = 0,
+	};
+	uint32_t i;
+	int nl_route;
+
+	if (dev_info->port_num <= 1 || dev_info->port_info == NULL)
+		return;
+
+	for (i = 1; i <= dev_info->port_num; i++) {
+		if (!dev_info->port_info[i].valid)
+			continue;
+		if (dev_info->port_info[i].ifindex == if_index)
+			break;
+	}
+	if (msg_type == RTM_NEWLINK && i > dev_info->port_num) {
+		nl_route = mlx5_nl_init(NETLINK_ROUTE, 0);
+		if  (nl_route < 0)
+			goto flush_all;
+
+		if (mlx5_nl_switch_info(nl_route, if_index, &info)) {
+			if (mlx5_sysfs_switch_info(if_index, &info))
+				goto flush_all;
+		}
+
+		if (info.name_type == MLX5_PHYS_PORT_NAME_TYPE_PFSF ||
+		    info.name_type == MLX5_PHYS_PORT_NAME_TYPE_PFVF)
+			goto flush_all;
+		close(nl_route);
+	} else if (msg_type == RTM_DELLINK && i <= dev_info->port_num) {
+		memset(dev_info->port_info + i, 0, sizeof(struct mlx5_port_nl_info));
+	}
+
+	return;
+flush_all:
+	if (nl_route >= 0)
+		close(nl_route);
+	for (i = 1; i <= dev_info->port_num; i++) {
+		if (!dev_info->port_info[i].ifindex)
+			dev_info->port_info[i].valid = 0;
+	}
+}
+
 static void
 mlx5_dev_interrupt_nl_cb(struct nlmsghdr *hdr, void *cb_arg)
 {
@@ -682,6 +734,8 @@ mlx5_dev_interrupt_nl_cb(struct nlmsghdr *hdr, void *cb_arg)
 
 	if (mlx5_nl_parse_link_status_update(hdr, &if_index) < 0)
 		return;
+	mlx5_handle_port_info_update(&sh->cdev->dev_info, if_index, hdr->nlmsg_type);
+
 	for (i = 0; i < sh->max_port; i++) {
 		struct mlx5_dev_shared_port *port = &sh->port[i];
 		struct rte_eth_dev *dev;
diff --git a/drivers/net/mlx5/linux/mlx5_os.c b/drivers/net/mlx5/linux/mlx5_os.c
index 0a8de88759..dcf1ff917b 100644
--- a/drivers/net/mlx5/linux/mlx5_os.c
+++ b/drivers/net/mlx5/linux/mlx5_os.c
@@ -1268,7 +1268,7 @@ mlx5_dev_spawn(struct rte_device *dpdk_dev,
 		/* IB doesn't allow more than 255 ports, must be Ethernet. */
 		err = mlx5_nl_port_state(nl_rdma,
 			spawn->phys_dev_name,
-			spawn->phys_port);
+			spawn->phys_port, &spawn->cdev->dev_info);
 		if (err < 0) {
 			DRV_LOG(INFO, "Failed to get netlink port state: %s",
 				strerror(rte_errno));
@@ -1892,6 +1892,8 @@ mlx5_dev_spawn_data_cmp(const void *a, const void *b)
  *   Netlink RDMA group socket handle.
  * @param[in] owner
  *   Representor owner PF index.
+ * @param[in] dev_info
+ *   Cached mlx5 device information.
  * @param[out] bond_info
  *   Pointer to bonding information.
  *
@@ -1903,6 +1905,7 @@ static int
 mlx5_device_bond_pci_match(const char *ibdev_name,
 			   const struct rte_pci_addr *pci_dev,
 			   int nl_rdma, uint16_t owner,
+			   struct mlx5_dev_info *dev_info,
 			   struct mlx5_bond_info *bond_info)
 {
 	char ifname[IF_NAMESIZE + 1];
@@ -1923,7 +1926,7 @@ mlx5_device_bond_pci_match(const char *ibdev_name,
 		return -1;
 	if (!strstr(ibdev_name, "bond"))
 		return -1;
-	np = mlx5_nl_portnum(nl_rdma, ibdev_name);
+	np = mlx5_nl_portnum(nl_rdma, ibdev_name, dev_info);
 	if (!np)
 		return -1;
 	if (mlx5_get_device_guid(pci_dev, cur_guid, sizeof(cur_guid)) < 0)
@@ -1935,7 +1938,7 @@ mlx5_device_bond_pci_match(const char *ibdev_name,
 	 */
 	for (i = 1; i <= np; ++i) {
 		/* Check whether Infiniband port is populated. */
-		ifindex = mlx5_nl_ifindex(nl_rdma, ibdev_name, i);
+		ifindex = mlx5_nl_ifindex(nl_rdma, ibdev_name, i, dev_info);
 		if (!ifindex)
 			continue;
 		if (!if_indextoname(ifindex, ifname))
@@ -1973,9 +1976,13 @@ mlx5_device_bond_pci_match(const char *ibdev_name,
 		if (!file)
 			break;
 		info.name_type = MLX5_PHYS_PORT_NAME_TYPE_NOTSET;
-		if (fscanf(file, "%32s", tmp_str) == 1)
+		if (fscanf(file, "%32s", tmp_str) == 1) {
 			mlx5_translate_port_name(tmp_str, &info);
-		fclose(file);
+			fclose(file);
+		} else {
+			fclose(file);
+			break;
+		}
 		/* Only process PF ports. */
 		if (info.name_type != MLX5_PHYS_PORT_NAME_TYPE_LEGACY &&
 		    info.name_type != MLX5_PHYS_PORT_NAME_TYPE_UPLINK)
@@ -1998,8 +2005,8 @@ mlx5_device_bond_pci_match(const char *ibdev_name,
 		if (ret != 1)
 			break;
 		/* Save bonding info. */
-		strncpy(bond_info->ports[info.port_name].ifname, ifname,
-			sizeof(bond_info->ports[0].ifname));
+		snprintf(bond_info->ports[info.port_name].ifname,
+			 sizeof(bond_info->ports[0].ifname), "%s", ifname);
 		bond_info->ports[info.port_name].pci_addr = pci_addr;
 		bond_info->ports[info.port_name].ifindex = ifindex;
 		bond_info->n_port++;
@@ -2028,6 +2035,7 @@ mlx5_device_bond_pci_match(const char *ibdev_name,
 		      pci_addr.function == owner)))
 			pf = info.port_name;
 	}
+	fclose(bond_file);
 	if (pf >= 0) {
 		/* Get bond interface info */
 		ret = mlx5_sysfs_bond_info(ifindex, &bond_info->ifindex,
@@ -2079,7 +2087,8 @@ mlx5_nl_esw_multiport_get(struct rte_pci_addr *pci_addr, int *enabled)
 #define SYSFS_MPESW_PARAM_MAX_LEN 16
 
 static int
-mlx5_sysfs_esw_multiport_get(struct ibv_device *ibv, struct rte_pci_addr *pci_addr, int *enabled)
+mlx5_sysfs_esw_multiport_get(struct ibv_device *ibv, struct rte_pci_addr *pci_addr, int *enabled,
+			     struct mlx5_dev_info *dev_info)
 {
 	int nl_rdma;
 	unsigned int n_ports;
@@ -2091,7 +2100,7 @@ mlx5_sysfs_esw_multiport_get(struct ibv_device *ibv, struct rte_pci_addr *pci_ad
 	nl_rdma = mlx5_nl_init(NETLINK_RDMA, 0);
 	if (nl_rdma < 0)
 		return nl_rdma;
-	n_ports = mlx5_nl_portnum(nl_rdma, ibv->name);
+	n_ports = mlx5_nl_portnum(nl_rdma, ibv->name, dev_info);
 	if (!n_ports) {
 		ret = -rte_errno;
 		goto close_nl_rdma;
@@ -2099,12 +2108,12 @@ mlx5_sysfs_esw_multiport_get(struct ibv_device *ibv, struct rte_pci_addr *pci_ad
 	for (i = 1; i <= n_ports; ++i) {
 		unsigned int ifindex;
 		char ifname[IF_NAMESIZE + 1];
-		struct rte_pci_addr if_pci_addr;
+		struct rte_pci_addr if_pci_addr = { 0 };
 		char mpesw[SYSFS_MPESW_PARAM_MAX_LEN + 1];
 		FILE *sysfs;
 		int n;
 
-		ifindex = mlx5_nl_ifindex(nl_rdma, ibv->name, i);
+		ifindex = mlx5_nl_ifindex(nl_rdma, ibv->name, i, dev_info);
 		if (!ifindex)
 			continue;
 		if (!if_indextoname(ifindex, ifname))
@@ -2146,7 +2155,8 @@ mlx5_sysfs_esw_multiport_get(struct ibv_device *ibv, struct rte_pci_addr *pci_ad
 }
 
 static int
-mlx5_is_mpesw_enabled(struct ibv_device *ibv, struct rte_pci_addr *ibv_pci_addr, int *enabled)
+mlx5_is_mpesw_enabled(struct ibv_device *ibv, struct rte_pci_addr *ibv_pci_addr, int *enabled,
+		      struct mlx5_dev_info *dev_info)
 {
 	/*
 	 * Try getting Multiport E-Switch state through netlink interface
@@ -2154,7 +2164,7 @@ mlx5_is_mpesw_enabled(struct ibv_device *ibv, struct rte_pci_addr *ibv_pci_addr,
 	 * assume that Multiport E-Switch is disabled and return an error.
 	 */
 	if (mlx5_nl_esw_multiport_get(ibv_pci_addr, enabled) >= 0 ||
-	    mlx5_sysfs_esw_multiport_get(ibv, ibv_pci_addr, enabled) >= 0)
+	    mlx5_sysfs_esw_multiport_get(ibv, ibv_pci_addr, enabled, dev_info) >= 0)
 		return 0;
 	DRV_LOG(DEBUG, "Unable to check MPESW state for IB device %s "
 		       "(PCI: " PCI_PRI_FMT ")",
@@ -2168,7 +2178,7 @@ mlx5_is_mpesw_enabled(struct ibv_device *ibv, struct rte_pci_addr *ibv_pci_addr,
 static int
 mlx5_device_mpesw_pci_match(struct ibv_device *ibv,
 			    const struct rte_pci_addr *owner_pci,
-			    int nl_rdma)
+			    int nl_rdma, struct mlx5_dev_info *dev_info)
 {
 	struct rte_pci_addr ibdev_pci_addr = { 0 };
 	char ifname[IF_NAMESIZE + 1] = { 0 };
@@ -2192,24 +2202,24 @@ mlx5_device_mpesw_pci_match(struct ibv_device *ibv,
 		return -1;
 	}
 	/* Check if IB device has MPESW enabled. */
-	if (mlx5_is_mpesw_enabled(ibv, &ibdev_pci_addr, &enabled))
+	if (mlx5_is_mpesw_enabled(ibv, &ibdev_pci_addr, &enabled, dev_info))
 		return -1;
 	if (!enabled)
 		return -1;
 	/* Iterate through IB ports to find MPESW master uplink port. */
 	if (nl_rdma < 0)
 		return -1;
-	np = mlx5_nl_portnum(nl_rdma, ibv->name);
+	np = mlx5_nl_portnum(nl_rdma, ibv->name, dev_info);
 	if (!np)
 		return -1;
 	for (i = 1; i <= np; ++i) {
-		struct rte_pci_addr pci_addr;
+		struct rte_pci_addr pci_addr = { 0 };
 		FILE *file;
 		char port_name[IF_NAMESIZE + 1];
 		struct mlx5_switch_info	info;
 
 		/* Check whether IB port has a corresponding netdev. */
-		ifindex = mlx5_nl_ifindex(nl_rdma, ibv->name, i);
+		ifindex = mlx5_nl_ifindex(nl_rdma, ibv->name, i, dev_info);
 		if (!ifindex)
 			continue;
 		if (!if_indextoname(ifindex, ifname))
@@ -2316,16 +2326,30 @@ mlx5_os_pci_probe_pf(struct mlx5_common_device *cdev,
 	 * matching ones, gathering into the list.
 	 */
 	struct ibv_device *ibv_match[ret + 1];
+	struct mlx5_dev_info *info, tmp_info[ret];
 	int nl_route = mlx5_nl_init(NETLINK_ROUTE, 0);
 	int nl_rdma = mlx5_nl_init(NETLINK_RDMA, 0);
 	unsigned int i;
 
+	memset(tmp_info, 0, sizeof(tmp_info));
 	while (ret-- > 0) {
 		struct rte_pci_addr pci_addr;
 
+		if (cdev->dev_info.port_num) {
+			if (strcmp(ibv_list[ret]->name, cdev->dev_info.ibname)) {
+				DRV_LOG(INFO, "Unmatched caching device \"%s\" \"%s\"",
+					cdev->dev_info.ibname, ibv_list[ret]->name);
+				continue;
+			}
+			info = &cdev->dev_info;
+		} else {
+			info = &tmp_info[ret];
+		}
 		DRV_LOG(DEBUG, "Checking device \"%s\"", ibv_list[ret]->name);
 		bd = mlx5_device_bond_pci_match(ibv_list[ret]->name, &owner_pci,
-						nl_rdma, owner_id, &bond_info);
+						nl_rdma, owner_id,
+						info,
+						&bond_info);
 		if (bd >= 0) {
 			/*
 			 * Bonding device detected. Only one match is allowed,
@@ -2351,7 +2375,8 @@ mlx5_os_pci_probe_pf(struct mlx5_common_device *cdev,
 			ibv_match[nd++] = ibv_list[ret];
 			break;
 		}
-		mpesw = mlx5_device_mpesw_pci_match(ibv_list[ret], &owner_pci, nl_rdma);
+		mpesw = mlx5_device_mpesw_pci_match(ibv_list[ret], &owner_pci, nl_rdma,
+						    info);
 		if (mpesw >= 0) {
 			/*
 			 * MPESW device detected. Only one matching IB device is allowed,
@@ -2375,10 +2400,18 @@ mlx5_os_pci_probe_pf(struct mlx5_common_device *cdev,
 		}
 		/* Bonding or MPESW device was not found. */
 		if (mlx5_get_pci_addr(ibv_list[ret]->ibdev_path,
-					&pci_addr))
+					&pci_addr)) {
+			if (tmp_info[ret].port_info != NULL)
+				mlx5_free(tmp_info[ret].port_info);
+			memset(&tmp_info[ret], 0, sizeof(tmp_info[0]));
 			continue;
-		if (rte_pci_addr_cmp(&owner_pci, &pci_addr) != 0)
+		}
+		if (rte_pci_addr_cmp(&owner_pci, &pci_addr) != 0) {
+			if (tmp_info[ret].port_info != NULL)
+				mlx5_free(tmp_info[ret].port_info);
+			memset(&tmp_info[ret], 0, sizeof(tmp_info[0]));
 			continue;
+		}
 		DRV_LOG(INFO, "PCI information matches for device \"%s\"",
 			ibv_list[ret]->name);
 		ibv_match[nd++] = ibv_list[ret];
@@ -2396,13 +2429,21 @@ mlx5_os_pci_probe_pf(struct mlx5_common_device *cdev,
 		goto exit;
 	}
 	if (nd == 1) {
+		if (!cdev->dev_info.port_num) {
+			for (i = 0; i < RTE_DIM(tmp_info); i++) {
+				if (tmp_info[i].port_num) {
+					cdev->dev_info = tmp_info[i];
+					break;
+				}
+			}
+		}
 		/*
 		 * Found single matching device may have multiple ports.
 		 * Each port may be representor, we have to check the port
 		 * number and check the representors existence.
 		 */
 		if (nl_rdma >= 0)
-			np = mlx5_nl_portnum(nl_rdma, ibv_match[0]->name);
+			np = mlx5_nl_portnum(nl_rdma, ibv_match[0]->name, &cdev->dev_info);
 		if (!np)
 			DRV_LOG(WARNING,
 				"Cannot get IB device \"%s\" ports number.",
@@ -2419,6 +2460,14 @@ mlx5_os_pci_probe_pf(struct mlx5_common_device *cdev,
 			ret = -rte_errno;
 			goto exit;
 		}
+	} else {
+		/* Can't handle one common device with multiple IB devices caching */
+		for (i = 0; i < RTE_DIM(tmp_info); i++) {
+			if (tmp_info[i].port_info != NULL)
+				mlx5_free(tmp_info[i].port_info);
+			memset(&tmp_info[i], 0, sizeof(tmp_info[0]));
+		}
+		DRV_LOG(INFO, "Cannot handle multiple IB devices info caching in single common device.");
 	}
 	/* Now we can determine the maximal amount of devices to be spawned. */
 	list = mlx5_malloc(MLX5_MEM_ZERO,
@@ -2452,7 +2501,7 @@ mlx5_os_pci_probe_pf(struct mlx5_common_device *cdev,
 			list[ns].mpesw_port = MLX5_MPESW_PORT_INVALID;
 			list[ns].ifindex = mlx5_nl_ifindex(nl_rdma,
 							   ibv_match[0]->name,
-							   i);
+							   i, &cdev->dev_info);
 			if (!list[ns].ifindex) {
 				/*
 				 * No network interface index found for the
@@ -2590,7 +2639,7 @@ mlx5_os_pci_probe_pf(struct mlx5_common_device *cdev,
 				list[ns].ifindex = mlx5_nl_ifindex
 							    (nl_rdma,
 							     ibv_match[i]->name,
-							     1);
+							     1, &cdev->dev_info);
 			if (!list[ns].ifindex) {
 				char ifname[IF_NAMESIZE];
 
@@ -2779,6 +2828,11 @@ mlx5_os_pci_probe_pf(struct mlx5_common_device *cdev,
 		mlx5_free(list);
 	MLX5_ASSERT(ibv_list);
 	mlx5_glue->free_device_list(ibv_list);
+	if (ret) {
+		if (cdev->dev_info.port_info != NULL)
+			mlx5_free(cdev->dev_info.port_info);
+		memset(&cdev->dev_info, 0, sizeof(cdev->dev_info));
+	}
 	return ret;
 }
 
diff --git a/drivers/net/mlx5/linux/mlx5_os.h b/drivers/net/mlx5/linux/mlx5_os.h
index 80c70d713a..4ef0916173 100644
--- a/drivers/net/mlx5/linux/mlx5_os.h
+++ b/drivers/net/mlx5/linux/mlx5_os.h
@@ -8,12 +8,6 @@
 
 #include <net/if.h>
 
-/* verb enumerations translations to local enums. */
-enum {
-	MLX5_FS_NAME_MAX = IBV_SYSFS_NAME_MAX + 1,
-	MLX5_FS_PATH_MAX = IBV_SYSFS_PATH_MAX + 1
-};
-
 /* Maximal data of sendmsg message(in bytes). */
 #define MLX5_SENDMSG_MAX 64
 
diff --git a/drivers/net/mlx5/windows/mlx5_os.h b/drivers/net/mlx5/windows/mlx5_os.h
index 8b58265687..fb7198c244 100644
--- a/drivers/net/mlx5/windows/mlx5_os.h
+++ b/drivers/net/mlx5/windows/mlx5_os.h
@@ -7,11 +7,6 @@
 
 #include "mlx5_win_ext.h"
 
-enum {
-	MLX5_FS_NAME_MAX = MLX5_DEVX_DEVICE_NAME_SIZE + 1,
-	MLX5_FS_PATH_MAX = MLX5_DEVX_DEVICE_PNP_SIZE + 1
-};
-
 #define PCI_DRV_FLAGS 0
 
 #define MLX5_NAMESIZE MLX5_FS_NAME_MAX
-- 
2.34.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH V1 3/7] net/mlx5: add new devargs to control probe optimization
  2024-10-16  8:38 [PATCH V1 0/7] port probe time optimization Minggang Li(Gavin)
  2024-10-16  8:38 ` [PATCH V1 1/7] mailmap: update user name Minggang Li(Gavin)
  2024-10-16  8:38 ` [PATCH V1 2/7] net/mlx5: optimize device probing Minggang Li(Gavin)
@ 2024-10-16  8:38 ` Minggang Li(Gavin)
  2024-10-16  8:38 ` [PATCH V1 4/7] common/mlx5: fix Netlink socket leak Minggang Li(Gavin)
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 42+ messages in thread
From: Minggang Li(Gavin) @ 2024-10-16  8:38 UTC (permalink / raw)
  To: viacheslavo, matan, orika, thomas, Dariusz Sosnowski, Bing Zhao,
	Suanming Mou
  Cc: dev, rasland, Rongwei Liu

From: Rongwei Liu <rongweil@nvidia.com>

Add a new devarg probe_opt_en to control probe optimization
in PMD.

By default, the value is 0 and no behavior changed.

Signed-off-by: Rongwei Liu <rongweil@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
---
 doc/guides/nics/mlx5.rst                |  7 +++++++
 drivers/common/mlx5/linux/mlx5_nl.c     | 12 ++++++++----
 drivers/common/mlx5/mlx5_common.c       | 15 +++++++++++++++
 drivers/common/mlx5/mlx5_common.h       |  2 ++
 drivers/net/mlx5/linux/mlx5_ethdev_os.c |  5 ++++-
 drivers/net/mlx5/linux/mlx5_os.c        |  2 +-
 6 files changed, 37 insertions(+), 6 deletions(-)

diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index 1dccdaad50..b4a4e57cde 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -1436,6 +1436,13 @@ for an additional list of options shared with other mlx5 drivers.
 
   By default, the PMD will set this value to 1.
 
+- ``probe_opt_en`` parameter [int]
+
+  A non-zero value optimizes the probe process, especially for large scale.
+  PMD will hold the IB device information internally and reuse it.
+
+  By default, the PMD will set this value to 0.
+
 - ``lacp_by_user`` parameter [int]
 
   A nonzero value enables the control of LACP traffic by the user application.
diff --git a/drivers/common/mlx5/linux/mlx5_nl.c b/drivers/common/mlx5/linux/mlx5_nl.c
index e98073aafe..745e443f8f 100644
--- a/drivers/common/mlx5/linux/mlx5_nl.c
+++ b/drivers/common/mlx5/linux/mlx5_nl.c
@@ -1148,7 +1148,7 @@ mlx5_nl_ifindex(int nl, const char *name, uint32_t pindex, struct mlx5_dev_info
 			.flags = 0,
 	};
 
-	if (!strcmp(name, dev_info->ibname)) {
+	if (dev_info->probe_opt && !strcmp(name, dev_info->ibname)) {
 		if (dev_info->port_info && pindex <= dev_info->port_num &&
 		    dev_info->port_info[pindex].valid) {
 			if (!dev_info->port_info[pindex].ifindex)
@@ -1161,7 +1161,7 @@ mlx5_nl_ifindex(int nl, const char *name, uint32_t pindex, struct mlx5_dev_info
 
 	ret = mlx5_nl_port_info(nl, pindex, &data);
 
-	if (!strcmp(dev_info->ibname, name)) {
+	if (dev_info->probe_opt && !strcmp(dev_info->ibname, name)) {
 		if ((!ret || ret == -ENODEV) && dev_info->port_info &&
 		    pindex <= dev_info->port_num) {
 			if (!ret)
@@ -1201,7 +1201,8 @@ mlx5_nl_port_state(int nl, const char *name, uint32_t pindex, struct mlx5_dev_in
 			.ibindex = UINT32_MAX,
 	};
 
-	if (dev_info && !strcmp(name, dev_info->ibname) && dev_info->port_num)
+	if (dev_info && dev_info->probe_opt &&
+	    !strcmp(name, dev_info->ibname) && dev_info->port_num)
 		data.ibindex = dev_info->ibindex;
 	if (mlx5_nl_port_info(nl, pindex, &data) < 0)
 		return -rte_errno;
@@ -1244,7 +1245,8 @@ mlx5_nl_portnum(int nl, const char *name, struct mlx5_dev_info *dev_info)
 	uint32_t sn = MLX5_NL_SN_GENERATE;
 	int ret, size;
 
-	if (dev_info->port_num && !strcmp(name, dev_info->ibname))
+	if (dev_info->probe_opt && dev_info->port_num &&
+	    !strcmp(name, dev_info->ibname))
 		return dev_info->port_num;
 
 	ret = mlx5_nl_send(nl, &req, sn);
@@ -1263,6 +1265,8 @@ mlx5_nl_portnum(int nl, const char *name, struct mlx5_dev_info *dev_info)
 		rte_errno = EINVAL;
 		return 0;
 	}
+	if (!dev_info->probe_opt)
+		return data.portnum;
 	MLX5_ASSERT(!strlen(dev_info->ibname));
 	dev_info->port_num = data.portnum;
 	dev_info->ibindex = data.ibindex;
diff --git a/drivers/common/mlx5/mlx5_common.c b/drivers/common/mlx5/mlx5_common.c
index 0aaae91c31..9abae4a374 100644
--- a/drivers/common/mlx5/mlx5_common.c
+++ b/drivers/common/mlx5/mlx5_common.c
@@ -40,6 +40,9 @@ uint8_t haswell_broadwell_cpu;
 /* The default memory allocator used in PMD. */
 #define MLX5_SYS_MEM_EN "sys_mem_en"
 
+/* Probe optimization in PMD. */
+#define MLX5_PROBE_OPT "probe_opt_en"
+
 /*
  * Device parameter to force doorbell register mapping
  * to non-cached region eliminating the extra write memory barrier.
@@ -295,6 +298,8 @@ mlx5_common_args_check_handler(const char *key, const char *val, void *opaque)
 		config->device_fd = tmp;
 	} else if (strcmp(key, MLX5_PD_HANDLE) == 0) {
 		config->pd_handle = tmp;
+	} else if (strcmp(key, MLX5_PROBE_OPT) == 0) {
+		config->probe_opt = !!tmp;
 	}
 	return 0;
 }
@@ -324,6 +329,7 @@ mlx5_common_config_get(struct mlx5_kvargs_ctrl *mkvlist,
 		MLX5_MR_MEMPOOL_REG_EN,
 		MLX5_DEVICE_FD,
 		MLX5_PD_HANDLE,
+		MLX5_PROBE_OPT,
 		NULL,
 	};
 	int ret = 0;
@@ -332,6 +338,7 @@ mlx5_common_config_get(struct mlx5_kvargs_ctrl *mkvlist,
 	config->mr_ext_memseg_en = 1;
 	config->mr_mempool_reg_en = 1;
 	config->sys_mem_en = 0;
+	config->probe_opt = 0;
 	config->dbnc = MLX5_ARG_UNSET;
 	config->device_fd = MLX5_ARG_UNSET;
 	config->pd_handle = MLX5_ARG_UNSET;
@@ -351,6 +358,7 @@ mlx5_common_config_get(struct mlx5_kvargs_ctrl *mkvlist,
 	DRV_LOG(DEBUG, "mr_ext_memseg_en is %u.", config->mr_ext_memseg_en);
 	DRV_LOG(DEBUG, "mr_mempool_reg_en is %u.", config->mr_mempool_reg_en);
 	DRV_LOG(DEBUG, "sys_mem_en is %u.", config->sys_mem_en);
+	DRV_LOG(DEBUG, "probe_opt_en is %u.", config->probe_opt);
 	DRV_LOG(DEBUG, "Send Queue doorbell mapping parameter is %d.",
 		config->dbnc);
 	return ret;
@@ -791,6 +799,7 @@ mlx5_common_dev_create(struct rte_device *eal_dev, uint32_t classes,
 	if (TAILQ_EMPTY(&devices_list))
 		rte_mem_event_callback_register("MLX5_MEM_EVENT_CB",
 						mlx5_mr_mem_event_cb, NULL);
+	cdev->dev_info.probe_opt = cdev->config.probe_opt;
 exit:
 	pthread_mutex_lock(&devices_list_lock);
 	TAILQ_INSERT_HEAD(&devices_list, cdev, next);
@@ -880,6 +889,12 @@ mlx5_common_probe_again_args_validate(struct mlx5_common_device *cdev,
 			cdev->dev->name);
 		goto error;
 	}
+	if (cdev->config.probe_opt != config->probe_opt) {
+		DRV_LOG(ERR, "\"" MLX5_PROBE_OPT"\" "
+			"configuration mismatch for device %s.",
+			cdev->dev->name);
+		goto error;
+	}
 	if (cdev->config.dbnc != config->dbnc) {
 		DRV_LOG(ERR, "\"" MLX5_SQ_DB_NC "\" "
 			"configuration mismatch for device %s.",
diff --git a/drivers/common/mlx5/mlx5_common.h b/drivers/common/mlx5/mlx5_common.h
index 6cb40f54dd..f1b59d6f07 100644
--- a/drivers/common/mlx5/mlx5_common.h
+++ b/drivers/common/mlx5/mlx5_common.h
@@ -183,6 +183,7 @@ struct mlx5_dev_info {
 	uint32_t port_num;
 	uint32_t ibindex;
 	char ibname[MLX5_FS_NAME_MAX];
+	uint8_t probe_opt;
 	struct mlx5_port_nl_info *port_info;
 };
 
@@ -525,6 +526,7 @@ struct mlx5_common_dev_config {
 	int pd_handle; /* Protection Domain handle for importation.  */
 	unsigned int devx:1; /* Whether devx interface is available or not. */
 	unsigned int sys_mem_en:1; /* The default memory allocator. */
+	unsigned int probe_opt:1; /* Optimize probing . */
 	unsigned int mr_mempool_reg_en:1;
 	/* Allow/prevent implicit mempool memory registration. */
 	unsigned int mr_ext_memseg_en:1;
diff --git a/drivers/net/mlx5/linux/mlx5_ethdev_os.c b/drivers/net/mlx5/linux/mlx5_ethdev_os.c
index 08ac6dd939..88d3c57c6e 100644
--- a/drivers/net/mlx5/linux/mlx5_ethdev_os.c
+++ b/drivers/net/mlx5/linux/mlx5_ethdev_os.c
@@ -691,6 +691,8 @@ mlx5_handle_port_info_update(struct mlx5_dev_info *dev_info, uint32_t if_index,
 	if (dev_info->port_num <= 1 || dev_info->port_info == NULL)
 		return;
 
+	DRV_LOG(DEBUG, "IB device %s ifindex %u received netlink event %u",
+			dev_info->ibname, if_index, msg_type);
 	for (i = 1; i <= dev_info->port_num; i++) {
 		if (!dev_info->port_info[i].valid)
 			continue;
@@ -734,7 +736,8 @@ mlx5_dev_interrupt_nl_cb(struct nlmsghdr *hdr, void *cb_arg)
 
 	if (mlx5_nl_parse_link_status_update(hdr, &if_index) < 0)
 		return;
-	mlx5_handle_port_info_update(&sh->cdev->dev_info, if_index, hdr->nlmsg_type);
+	if (sh->cdev->config.probe_opt && sh->cdev->dev_info.port_num > 1)
+		mlx5_handle_port_info_update(&sh->cdev->dev_info, if_index, hdr->nlmsg_type);
 
 	for (i = 0; i < sh->max_port; i++) {
 		struct mlx5_dev_shared_port *port = &sh->port[i];
diff --git a/drivers/net/mlx5/linux/mlx5_os.c b/drivers/net/mlx5/linux/mlx5_os.c
index dcf1ff917b..a408790d1e 100644
--- a/drivers/net/mlx5/linux/mlx5_os.c
+++ b/drivers/net/mlx5/linux/mlx5_os.c
@@ -2335,7 +2335,7 @@ mlx5_os_pci_probe_pf(struct mlx5_common_device *cdev,
 	while (ret-- > 0) {
 		struct rte_pci_addr pci_addr;
 
-		if (cdev->dev_info.port_num) {
+		if (cdev->config.probe_opt && cdev->dev_info.port_num) {
 			if (strcmp(ibv_list[ret]->name, cdev->dev_info.ibname)) {
 				DRV_LOG(INFO, "Unmatched caching device \"%s\" \"%s\"",
 					cdev->dev_info.ibname, ibv_list[ret]->name);
-- 
2.34.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH V1 4/7] common/mlx5: fix Netlink socket leak
  2024-10-16  8:38 [PATCH V1 0/7] port probe time optimization Minggang Li(Gavin)
                   ` (2 preceding siblings ...)
  2024-10-16  8:38 ` [PATCH V1 3/7] net/mlx5: add new devargs to control probe optimization Minggang Li(Gavin)
@ 2024-10-16  8:38 ` Minggang Li(Gavin)
  2024-10-16  8:38 ` [PATCH V1 5/7] common/mlx5: add RDMA monitor event awareness Minggang Li(Gavin)
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 42+ messages in thread
From: Minggang Li(Gavin) @ 2024-10-16  8:38 UTC (permalink / raw)
  To: viacheslavo, matan, orika, thomas, Dariusz Sosnowski, Bing Zhao,
	Suanming Mou, Spike Du
  Cc: dev, rasland, stable

Fixes: 72d7efe464b1 ("common/mlx5: share interrupt management")
Cc: stable@dpdk.org

Signed-off-by: Minggang Li(Gavin) <gavinl@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
---
 drivers/net/mlx5/linux/mlx5_os.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/net/mlx5/linux/mlx5_os.c b/drivers/net/mlx5/linux/mlx5_os.c
index a408790d1e..6609527934 100644
--- a/drivers/net/mlx5/linux/mlx5_os.c
+++ b/drivers/net/mlx5/linux/mlx5_os.c
@@ -3073,10 +3073,15 @@ mlx5_os_dev_shared_handler_install(struct mlx5_dev_ctx_shared *sh)
 void
 mlx5_os_dev_shared_handler_uninstall(struct mlx5_dev_ctx_shared *sh)
 {
+	int fd;
+
 	mlx5_os_interrupt_handler_destroy(sh->intr_handle,
 					  mlx5_dev_interrupt_handler, sh);
+	fd = rte_intr_fd_get(sh->intr_handle_nl);
 	mlx5_os_interrupt_handler_destroy(sh->intr_handle_nl,
 					  mlx5_dev_interrupt_handler_nl, sh);
+	if (fd >= 0)
+		close(fd);
 #ifdef HAVE_IBV_DEVX_ASYNC
 	mlx5_os_interrupt_handler_destroy(sh->intr_handle_devx,
 					  mlx5_dev_interrupt_handler_devx, sh);
-- 
2.34.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH V1 5/7] common/mlx5: add RDMA monitor event awareness
  2024-10-16  8:38 [PATCH V1 0/7] port probe time optimization Minggang Li(Gavin)
                   ` (3 preceding siblings ...)
  2024-10-16  8:38 ` [PATCH V1 4/7] common/mlx5: fix Netlink socket leak Minggang Li(Gavin)
@ 2024-10-16  8:38 ` Minggang Li(Gavin)
  2024-10-16  8:38 ` [PATCH V1 6/7] mlx5: use RDMA Netlink to update port information Minggang Li(Gavin)
  2024-10-16  8:38 ` [PATCH V1 7/7] mlx5: add backward compatibility for RDMA monitor Minggang Li(Gavin)
  6 siblings, 0 replies; 42+ messages in thread
From: Minggang Li(Gavin) @ 2024-10-16  8:38 UTC (permalink / raw)
  To: viacheslavo, matan, orika, thomas, Dariusz Sosnowski, Bing Zhao,
	Suanming Mou
  Cc: dev, rasland, Minggang(Gavin) Li

From: "Minggang(Gavin) Li" <gavinl@nvidia.com>

RDMA monitor is a new feature introduced by kernel driver. This commit
adds backward compatibility for the kernels do not support it.

Signed-off-by: Minggang Li(Gavin) <gavinl@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
---
 drivers/common/mlx5/linux/meson.build | 10 ++++++++++
 drivers/common/mlx5/linux/mlx5_nl.c   | 17 +++++++++++++++++
 2 files changed, 27 insertions(+)

diff --git a/drivers/common/mlx5/linux/meson.build b/drivers/common/mlx5/linux/meson.build
index 82e8046e0c..58d0328c6d 100644
--- a/drivers/common/mlx5/linux/meson.build
+++ b/drivers/common/mlx5/linux/meson.build
@@ -170,6 +170,16 @@ has_sym_args = [
             'RDMA_NLDEV_ATTR_PORT_STATE' ],
         [ 'HAVE_RDMA_NLDEV_ATTR_NDEV_INDEX', 'rdma/rdma_netlink.h',
             'RDMA_NLDEV_ATTR_NDEV_INDEX' ],
+        [ 'HAVE_RDMA_NL_GROUP_NOTIFY', 'rdma/rdma_netlink.h',
+            'RDMA_NL_GROUP_NOTIFY' ],
+        [ 'HAVE_RDMA_NLDEV_CMD_SYS_GET', 'rdma/rdma_netlink.h',
+            'RDMA_NLDEV_CMD_SYS_GET' ],
+        [ 'HAVE_RDMA_NLDEV_SYS_ATTR_MONITOR_MODE', 'rdma/rdma_netlink.h',
+            'RDMA_NLDEV_SYS_ATTR_MONITOR_MODE' ],
+        [ 'HAVE_RDMA_NLDEV_ATTR_EVENT_TYPE', 'rdma/rdma_netlink.h',
+            'RDMA_NLDEV_ATTR_EVENT_TYPE' ],
+        [ 'HAVE_RDMA_NLDEV_CMD_MONITOR', 'rdma/rdma_netlink.h',
+            'RDMA_NLDEV_CMD_MONITOR' ],
         [ 'HAVE_MLX5_DR_FLOW_DUMP', 'infiniband/mlx5dv.h',
             'mlx5dv_dump_dr_domain'],
         [ 'HAVE_MLX5_DR_CREATE_ACTION_FLOW_SAMPLE', 'infiniband/mlx5dv.h',
diff --git a/drivers/common/mlx5/linux/mlx5_nl.c b/drivers/common/mlx5/linux/mlx5_nl.c
index 745e443f8f..e03db4f918 100644
--- a/drivers/common/mlx5/linux/mlx5_nl.c
+++ b/drivers/common/mlx5/linux/mlx5_nl.c
@@ -84,6 +84,23 @@
 #ifndef HAVE_RDMA_NLDEV_ATTR_NDEV_INDEX
 #define RDMA_NLDEV_ATTR_NDEV_INDEX 50
 #endif
+#ifndef HAVE_RDMA_NLDEV_ATTR_EVENT_TYPE
+#define RDMA_NLDEV_ATTR_EVENT_TYPE 102
+#define RDMA_NETDEV_ATTACH_EVENT 2
+#define RDMA_NETDEV_DETACH_EVENT 3
+#endif
+#ifndef HAVE_RDMA_NLDEV_SYS_ATTR_MONITOR_MODE
+#define RDMA_NLDEV_SYS_ATTR_MONITOR_MODE 103
+#endif
+#ifndef HAVE_RDMA_NLDEV_CMD_MONITOR
+#define RDMA_NLDEV_CMD_MONITOR 28
+#endif
+#ifndef HAVE_RDMA_NLDEV_CMD_SYS_GET
+#define RDMA_NLDEV_CMD_SYS_GET 6
+#endif
+#ifndef HAVE_RDMA_NL_GROUP_NOTIFY
+#define RDMA_NL_GROUP_NOTIFY 4
+#endif
 
 /* These are normally found in linux/if_link.h. */
 #ifndef HAVE_IFLA_NUM_VF
-- 
2.34.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH V1 6/7] mlx5: use RDMA Netlink to update port information
  2024-10-16  8:38 [PATCH V1 0/7] port probe time optimization Minggang Li(Gavin)
                   ` (4 preceding siblings ...)
  2024-10-16  8:38 ` [PATCH V1 5/7] common/mlx5: add RDMA monitor event awareness Minggang Li(Gavin)
@ 2024-10-16  8:38 ` Minggang Li(Gavin)
  2024-10-16  8:38 ` [PATCH V1 7/7] mlx5: add backward compatibility for RDMA monitor Minggang Li(Gavin)
  6 siblings, 0 replies; 42+ messages in thread
From: Minggang Li(Gavin) @ 2024-10-16  8:38 UTC (permalink / raw)
  To: viacheslavo, matan, orika, thomas, Dariusz Sosnowski, Bing Zhao,
	Suanming Mou
  Cc: dev, rasland, Minggang(Gavin) Li

From: "Minggang(Gavin) Li" <gavinl@nvidia.com>

Previously, port information, such as adding and deleting, is updated via
route netlink. And the events used are link up/down, not the exact event
for port adding or deleting, which does not performance well.

To improve the performance, use RDMA monitor events to track port adding
and deleting events and update corresponding port information.

Signed-off-by: Minggang Li(Gavin) <gavinl@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
---
 doc/guides/nics/mlx5.rst                |  6 ++
 drivers/common/mlx5/linux/mlx5_nl.c     | 74 ++++++++++++++++++-----
 drivers/common/mlx5/linux/mlx5_nl.h     | 28 +++++++++
 drivers/common/mlx5/version.map         |  2 +
 drivers/net/mlx5/linux/mlx5_ethdev_os.c | 79 +++++++++++++++++++++++++
 drivers/net/mlx5/linux/mlx5_os.c        | 20 +++++++
 drivers/net/mlx5/mlx5.h                 |  2 +
 7 files changed, 195 insertions(+), 16 deletions(-)

diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index b4a4e57cde..d40363fc05 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -1443,6 +1443,12 @@ for an additional list of options shared with other mlx5 drivers.
 
   By default, the PMD will set this value to 0.
 
+  .. note::
+
+    There is a race condition in probing port if probe_opt_en is set to 1.
+    Port probe may fail with wrong ifindex in cache while the interrupt
+    thread is updating the cache. Please try again if port probe failed.
+
 - ``lacp_by_user`` parameter [int]
 
   A nonzero value enables the control of LACP traffic by the user application.
diff --git a/drivers/common/mlx5/linux/mlx5_nl.c b/drivers/common/mlx5/linux/mlx5_nl.c
index e03db4f918..ce1c2a8e75 100644
--- a/drivers/common/mlx5/linux/mlx5_nl.c
+++ b/drivers/common/mlx5/linux/mlx5_nl.c
@@ -101,6 +101,7 @@
 #ifndef HAVE_RDMA_NL_GROUP_NOTIFY
 #define RDMA_NL_GROUP_NOTIFY 4
 #endif
+#define RDMA_NL_GROUP_NOTIFICATION (1 << (RDMA_NL_GROUP_NOTIFY - 1))
 
 /* These are normally found in linux/if_link.h. */
 #ifndef HAVE_IFLA_NUM_VF
@@ -176,22 +177,6 @@ struct mlx5_nl_mac_addr {
 	int mac_n; /**< Number of addresses in the array. */
 };
 
-#define MLX5_NL_CMD_GET_IB_NAME (1 << 0)
-#define MLX5_NL_CMD_GET_IB_INDEX (1 << 1)
-#define MLX5_NL_CMD_GET_NET_INDEX (1 << 2)
-#define MLX5_NL_CMD_GET_PORT_INDEX (1 << 3)
-#define MLX5_NL_CMD_GET_PORT_STATE (1 << 4)
-
-/** Data structure used by mlx5_nl_cmdget_cb(). */
-struct mlx5_nl_port_info {
-	const char *name; /**< IB device name (in). */
-	uint32_t flags; /**< found attribute flags (out). */
-	uint32_t ibindex; /**< IB device index (out). */
-	uint32_t ifindex; /**< Network interface index (out). */
-	uint32_t portnum; /**< IB device max port number (out). */
-	uint16_t state; /**< IB device port state (out). */
-};
-
 RTE_ATOMIC(uint32_t) atomic_sn;
 
 /* Generate Netlink sequence number. */
@@ -2110,3 +2095,60 @@ mlx5_nl_devlink_esw_multiport_get(int nlsk_fd, int family_id, const char *pci_ad
 		*enable ? "en" : "dis", pci_addr);
 	return ret;
 }
+
+int
+mlx5_nl_rdma_monitor_init(void)
+{
+	return mlx5_nl_init(NETLINK_RDMA, RDMA_NL_GROUP_NOTIFICATION);
+}
+
+void
+mlx5_nl_rdma_monitor_info_get(struct nlmsghdr *hdr, struct mlx5_nl_port_info *data)
+{
+	size_t off = NLMSG_HDRLEN;
+	uint8_t event_type = 0;
+
+	if (hdr->nlmsg_type != RDMA_NL_GET_TYPE(RDMA_NL_NLDEV, RDMA_NLDEV_CMD_MONITOR))
+		goto error;
+
+	while (off < hdr->nlmsg_len) {
+		struct nlattr *na = (void *)((uintptr_t)hdr + off);
+		void *payload = (void *)((uintptr_t)na + NLA_HDRLEN);
+
+		if (na->nla_len > hdr->nlmsg_len - off)
+			goto error;
+		switch (na->nla_type) {
+		case RDMA_NLDEV_ATTR_EVENT_TYPE:
+			event_type = *(uint8_t *)payload;
+			if (event_type == RDMA_NETDEV_ATTACH_EVENT) {
+				data->flags |= MLX5_NL_CMD_GET_EVENT_TYPE;
+				data->event_type = MLX5_NL_RDMA_NETDEV_ATTACH_EVENT;
+			} else if (event_type == RDMA_NETDEV_DETACH_EVENT) {
+				data->flags |= MLX5_NL_CMD_GET_EVENT_TYPE;
+				data->event_type = MLX5_NL_RDMA_NETDEV_DETACH_EVENT;
+			}
+			break;
+		case RDMA_NLDEV_ATTR_DEV_INDEX:
+			data->ibindex = *(uint32_t *)payload;
+			data->flags |= MLX5_NL_CMD_GET_IB_INDEX;
+			break;
+		case RDMA_NLDEV_ATTR_PORT_INDEX:
+			data->portnum = *(uint32_t *)payload;
+			data->flags |= MLX5_NL_CMD_GET_PORT_INDEX;
+			break;
+		case RDMA_NLDEV_ATTR_NDEV_INDEX:
+			data->ifindex = *(uint32_t *)payload;
+			data->flags |= MLX5_NL_CMD_GET_NET_INDEX;
+			break;
+		default:
+			DRV_LOG(DEBUG, "Unknown attribute[%d] found", na->nla_type);
+			break;
+		}
+		off += NLA_ALIGN(na->nla_len);
+	}
+
+	return;
+
+error:
+	rte_errno = EINVAL;
+}
diff --git a/drivers/common/mlx5/linux/mlx5_nl.h b/drivers/common/mlx5/linux/mlx5_nl.h
index 396ffc98ce..e32080fa63 100644
--- a/drivers/common/mlx5/linux/mlx5_nl.h
+++ b/drivers/common/mlx5/linux/mlx5_nl.h
@@ -32,6 +32,27 @@ struct mlx5_nl_vlan_vmwa_context {
 	struct mlx5_nl_vlan_dev vlan_dev[4096];
 };
 
+#define MLX5_NL_CMD_GET_IB_NAME (1 << 0)
+#define MLX5_NL_CMD_GET_IB_INDEX (1 << 1)
+#define MLX5_NL_CMD_GET_NET_INDEX (1 << 2)
+#define MLX5_NL_CMD_GET_PORT_INDEX (1 << 3)
+#define MLX5_NL_CMD_GET_PORT_STATE (1 << 4)
+#define MLX5_NL_CMD_GET_EVENT_TYPE (1 << 5)
+
+/** Data structure used by mlx5_nl_cmdget_cb(). */
+struct mlx5_nl_port_info {
+	const char *name; /**< IB device name (in). */
+	uint32_t flags; /**< found attribute flags (out). */
+	uint32_t ibindex; /**< IB device index (out). */
+	uint32_t ifindex; /**< Network interface index (out). */
+	uint32_t portnum; /**< IB device max port number (out). */
+	uint16_t state; /**< IB device port state (out). */
+	uint8_t event_type; /**< IB RDMA event type (out). */
+};
+
+#define MLX5_NL_RDMA_NETDEV_ATTACH_EVENT (1)
+#define MLX5_NL_RDMA_NETDEV_DETACH_EVENT (2)
+
 __rte_internal
 int mlx5_nl_init(int protocol, int groups);
 __rte_internal
@@ -89,4 +110,11 @@ __rte_internal
 int mlx5_nl_devlink_esw_multiport_get(int nlsk_fd, int family_id,
 				      const char *pci_addr, int *enable);
 
+__rte_internal
+int mlx5_nl_rdma_monitor_init(void);
+__rte_internal
+void mlx5_nl_rdma_monitor_info_get(struct nlmsghdr *hdr, struct mlx5_nl_port_info *data);
+__rte_internal
+int mlx5_nl_rdma_monitor_cap_get(int nl, uint8_t *cap);
+
 #endif /* RTE_PMD_MLX5_NL_H_ */
diff --git a/drivers/common/mlx5/version.map b/drivers/common/mlx5/version.map
index a2f72ef46a..5230576006 100644
--- a/drivers/common/mlx5/version.map
+++ b/drivers/common/mlx5/version.map
@@ -146,6 +146,8 @@ INTERNAL {
 	mlx5_nl_vf_mac_addr_modify; # WINDOWS_NO_EXPORT
 	mlx5_nl_vlan_vmwa_create; # WINDOWS_NO_EXPORT
 	mlx5_nl_vlan_vmwa_delete; # WINDOWS_NO_EXPORT
+	mlx5_nl_rdma_monitor_init; # WINDOWS_NO_EXPORT
+	mlx5_nl_rdma_monitor_info_get; # WINDOWS_NO_EXPORT
 
 	mlx5_os_umem_dereg;
 	mlx5_os_umem_reg;
diff --git a/drivers/net/mlx5/linux/mlx5_ethdev_os.c b/drivers/net/mlx5/linux/mlx5_ethdev_os.c
index 88d3c57c6e..5156d96b3a 100644
--- a/drivers/net/mlx5/linux/mlx5_ethdev_os.c
+++ b/drivers/net/mlx5/linux/mlx5_ethdev_os.c
@@ -894,6 +894,85 @@ mlx5_dev_interrupt_handler_devx(void *cb_arg)
 #endif /* HAVE_IBV_DEVX_ASYNC */
 }
 
+static void
+mlx5_dev_interrupt_ib_cb(struct nlmsghdr *hdr, void *cb_arg)
+{
+	mlx5_nl_rdma_monitor_info_get(hdr, (struct mlx5_nl_port_info *)cb_arg);
+}
+
+void
+mlx5_dev_interrupt_handler_ib(void *arg)
+{
+	struct mlx5_dev_ctx_shared *sh = arg;
+	struct mlx5_nl_port_info data = {
+		.flags = 0,
+		.name = "",
+		.ifindex = 0,
+		.ibindex = 0,
+		.portnum = 0,
+	};
+	int nlsk_fd = rte_intr_fd_get(sh->intr_handle_ib);
+	struct mlx5_dev_info *dev_info;
+	uint32_t i;
+
+	dev_info = &sh->cdev->dev_info;
+	DRV_LOG(DEBUG, "IB device %s received RDMA monitor netlink event", dev_info->ibname);
+	if (dev_info->port_num <= 1 || dev_info->port_info == NULL)
+		return;
+
+	if (nlsk_fd < 0)
+		return;
+
+	if (mlx5_nl_read_events(nlsk_fd, mlx5_dev_interrupt_ib_cb, &data) < 0)
+		DRV_LOG(ERR, "Failed to process Netlink events: %s",
+			rte_strerror(rte_errno));
+
+	if (!(data.flags & MLX5_NL_CMD_GET_EVENT_TYPE) ||
+		!(data.flags & MLX5_NL_CMD_GET_PORT_INDEX) ||
+		!(data.flags & MLX5_NL_CMD_GET_IB_INDEX))
+		return;
+
+	if (data.ibindex != dev_info->ibindex)
+		return;
+
+	if (data.event_type != MLX5_NL_RDMA_NETDEV_ATTACH_EVENT &&
+		data.event_type != MLX5_NL_RDMA_NETDEV_DETACH_EVENT)
+		return;
+
+	if (data.event_type == MLX5_NL_RDMA_NETDEV_ATTACH_EVENT &&
+	    !(data.flags & MLX5_NL_CMD_GET_NET_INDEX))
+		return;
+
+	DRV_LOG(DEBUG, "Event info: type %d, ibindex %d, ifindex %d, portnum %d,",
+		data.event_type, data.ibindex, data.ifindex, data.portnum);
+
+	/* Changes found in number of SF/VF ports. All information is likely unreliable. */
+	if (data.portnum > dev_info->port_num) {
+		DRV_LOG(ERR, "Port[%d] exceeds maximum[%d]", data.portnum, dev_info->port_num);
+		goto flush_all;
+	}
+	if (data.event_type == MLX5_NL_RDMA_NETDEV_ATTACH_EVENT) {
+		if (!dev_info->port_info[data.portnum].ifindex) {
+			dev_info->port_info[data.portnum].ifindex = data.ifindex;
+			dev_info->port_info[data.portnum].valid = 1;
+		} else {
+			DRV_LOG(WARNING, "Duplicate RDMA event for port[%d] ifindex[%d]",
+				data.portnum, data.ifindex);
+			if (data.ifindex != dev_info->port_info[data.portnum].ifindex)
+				goto flush_all;
+		}
+	} else if (data.event_type == MLX5_NL_RDMA_NETDEV_DETACH_EVENT) {
+		memset(dev_info->port_info + data.portnum, 0, sizeof(struct mlx5_port_nl_info));
+	}
+	return;
+
+flush_all:
+	for (i = 1; i <= dev_info->port_num; i++) {
+		dev_info->port_info[i].ifindex = 0;
+		dev_info->port_info[i].valid = 0;
+	}
+}
+
 /**
  * DPDK callback to bring the link DOWN.
  *
diff --git a/drivers/net/mlx5/linux/mlx5_os.c b/drivers/net/mlx5/linux/mlx5_os.c
index 6609527934..2a93e994ff 100644
--- a/drivers/net/mlx5/linux/mlx5_os.c
+++ b/drivers/net/mlx5/linux/mlx5_os.c
@@ -3027,6 +3027,21 @@ mlx5_os_dev_shared_handler_install(struct mlx5_dev_ctx_shared *sh)
 		DRV_LOG(ERR, "Failed to allocate intr_handle.");
 		return;
 	}
+	if (sh->cdev->config.probe_opt && sh->cdev->dev_info.port_num > 1) {
+		nlsk_fd = mlx5_nl_rdma_monitor_init();
+		if (nlsk_fd < 0) {
+			DRV_LOG(ERR, "Failed to create a socket for RDMA Netlink events: %s",
+				rte_strerror(rte_errno));
+			return;
+		}
+		sh->intr_handle_ib = mlx5_os_interrupt_handler_create
+			(RTE_INTR_INSTANCE_F_SHARED, true,
+			 nlsk_fd, mlx5_dev_interrupt_handler_ib, sh);
+		if (sh->intr_handle_ib == NULL) {
+			DRV_LOG(ERR, "Fail to allocate intr_handle");
+			return;
+		}
+	}
 	nlsk_fd = mlx5_nl_init(NETLINK_ROUTE, RTMGRP_LINK);
 	if (nlsk_fd < 0) {
 		DRV_LOG(ERR, "Failed to create a socket for Netlink events: %s",
@@ -3088,6 +3103,11 @@ mlx5_os_dev_shared_handler_uninstall(struct mlx5_dev_ctx_shared *sh)
 	if (sh->devx_comp)
 		mlx5_glue->devx_destroy_cmd_comp(sh->devx_comp);
 #endif
+	fd = rte_intr_fd_get(sh->intr_handle_ib);
+	mlx5_os_interrupt_handler_destroy(sh->intr_handle_ib,
+				  mlx5_dev_interrupt_handler_ib, sh);
+	if (fd >= 0)
+		close(fd);
 }
 
 /**
diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index 18b4c15a26..748d92c9cf 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -1574,6 +1574,7 @@ struct mlx5_dev_ctx_shared {
 	struct rte_intr_handle *intr_handle; /* Interrupt handler for device. */
 	struct rte_intr_handle *intr_handle_devx; /* DEVX interrupt handler. */
 	struct rte_intr_handle *intr_handle_nl; /* Netlink interrupt handler. */
+	struct rte_intr_handle *intr_handle_ib; /* Interrupt handler for IB device. */
 	void *devx_comp; /* DEVX async comp obj. */
 	struct mlx5_devx_obj *tis[16]; /* TIS object. */
 	struct mlx5_devx_obj *td; /* Transport domain. */
@@ -2248,6 +2249,7 @@ int mlx5_dev_set_flow_ctrl(struct rte_eth_dev *dev,
 void mlx5_dev_interrupt_handler(void *arg);
 void mlx5_dev_interrupt_handler_devx(void *arg);
 void mlx5_dev_interrupt_handler_nl(void *arg);
+void mlx5_dev_interrupt_handler_ib(void *arg);
 int mlx5_set_link_down(struct rte_eth_dev *dev);
 int mlx5_set_link_up(struct rte_eth_dev *dev);
 int mlx5_is_removed(struct rte_eth_dev *dev);
-- 
2.34.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH V1 7/7] mlx5: add backward compatibility for RDMA monitor
  2024-10-16  8:38 [PATCH V1 0/7] port probe time optimization Minggang Li(Gavin)
                   ` (5 preceding siblings ...)
  2024-10-16  8:38 ` [PATCH V1 6/7] mlx5: use RDMA Netlink to update port information Minggang Li(Gavin)
@ 2024-10-16  8:38 ` Minggang Li(Gavin)
  2024-10-28  9:18   ` [PATCH V2 0/7] port probe time optimization Minggang Li(Gavin)
  6 siblings, 1 reply; 42+ messages in thread
From: Minggang Li(Gavin) @ 2024-10-16  8:38 UTC (permalink / raw)
  To: viacheslavo, matan, orika, thomas, Dariusz Sosnowski, Bing Zhao,
	Suanming Mou
  Cc: dev, rasland, Minggang(Gavin) Li

From: "Minggang(Gavin) Li" <gavinl@nvidia.com>

Fallback to the old way to update port information if the kernel driver
does not support RDMA monitor.

Signed-off-by: Minggang Li(Gavin) <gavinl@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
---
 drivers/common/mlx5/linux/mlx5_nl.c     | 73 +++++++++++++++++++++++++
 drivers/common/mlx5/version.map         |  1 +
 drivers/net/mlx5/linux/mlx5_ethdev_os.c |  2 +-
 drivers/net/mlx5/linux/mlx5_os.c        | 27 +++++++--
 drivers/net/mlx5/mlx5.h                 |  1 +
 5 files changed, 97 insertions(+), 7 deletions(-)

diff --git a/drivers/common/mlx5/linux/mlx5_nl.c b/drivers/common/mlx5/linux/mlx5_nl.c
index ce1c2a8e75..12f1a620f3 100644
--- a/drivers/common/mlx5/linux/mlx5_nl.c
+++ b/drivers/common/mlx5/linux/mlx5_nl.c
@@ -2152,3 +2152,76 @@ mlx5_nl_rdma_monitor_info_get(struct nlmsghdr *hdr, struct mlx5_nl_port_info *da
 error:
 	rte_errno = EINVAL;
 }
+
+static int
+mlx5_nl_rdma_monitor_cap_get_cb(struct nlmsghdr *hdr, void *arg)
+{
+	size_t off = NLMSG_HDRLEN;
+	uint8_t *cap = arg;
+
+	if (hdr->nlmsg_type != RDMA_NL_GET_TYPE(RDMA_NL_NLDEV, RDMA_NLDEV_CMD_SYS_GET))
+		goto error;
+
+	*cap = 0;
+	while (off < hdr->nlmsg_len) {
+		struct nlattr *na = (void *)((uintptr_t)hdr + off);
+		void *payload = (void *)((uintptr_t)na + NLA_HDRLEN);
+
+		if (na->nla_len > hdr->nlmsg_len - off)
+			goto error;
+		switch (na->nla_type) {
+		case RDMA_NLDEV_SYS_ATTR_MONITOR_MODE:
+			*cap = *(uint8_t *)payload;
+			return 0;
+		default:
+			break;
+		}
+		off += NLA_ALIGN(na->nla_len);
+	}
+
+	return 0;
+
+error:
+	return -EINVAL;
+}
+
+/**
+ * Get RDMA monitor support in driver.
+ *
+ *
+ * @param nl
+ *   Netlink socket of the RDMA kind (NETLINK_RDMA).
+ * @param[out] cap
+ *   Pointer to port info.
+ * @return
+ *   0 on success, negative on error and rte_errno is set.
+ */
+int
+mlx5_nl_rdma_monitor_cap_get(int nl, uint8_t *cap)
+{
+	union {
+		struct nlmsghdr nh;
+		uint8_t buf[NLMSG_HDRLEN];
+	} req = {
+		.nh = {
+			.nlmsg_len = NLMSG_LENGTH(0),
+			.nlmsg_type = RDMA_NL_GET_TYPE(RDMA_NL_NLDEV,
+						       RDMA_NLDEV_CMD_SYS_GET),
+			.nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK,
+		},
+	};
+	uint32_t sn = MLX5_NL_SN_GENERATE;
+	int ret;
+
+	ret = mlx5_nl_send(nl, &req.nh, sn);
+	if (ret < 0) {
+		rte_errno = -ret;
+		return ret;
+	}
+	ret = mlx5_nl_recv(nl, sn, mlx5_nl_rdma_monitor_cap_get_cb, cap);
+	if (ret < 0) {
+		rte_errno = -ret;
+		return ret;
+	}
+	return 0;
+}
diff --git a/drivers/common/mlx5/version.map b/drivers/common/mlx5/version.map
index 5230576006..8301485839 100644
--- a/drivers/common/mlx5/version.map
+++ b/drivers/common/mlx5/version.map
@@ -148,6 +148,7 @@ INTERNAL {
 	mlx5_nl_vlan_vmwa_delete; # WINDOWS_NO_EXPORT
 	mlx5_nl_rdma_monitor_init; # WINDOWS_NO_EXPORT
 	mlx5_nl_rdma_monitor_info_get; # WINDOWS_NO_EXPORT
+	mlx5_nl_rdma_monitor_cap_get; # WINDOWS_NO_EXPORT
 
 	mlx5_os_umem_dereg;
 	mlx5_os_umem_reg;
diff --git a/drivers/net/mlx5/linux/mlx5_ethdev_os.c b/drivers/net/mlx5/linux/mlx5_ethdev_os.c
index 5156d96b3a..6b2c25a7c2 100644
--- a/drivers/net/mlx5/linux/mlx5_ethdev_os.c
+++ b/drivers/net/mlx5/linux/mlx5_ethdev_os.c
@@ -736,7 +736,7 @@ mlx5_dev_interrupt_nl_cb(struct nlmsghdr *hdr, void *cb_arg)
 
 	if (mlx5_nl_parse_link_status_update(hdr, &if_index) < 0)
 		return;
-	if (sh->cdev->config.probe_opt && sh->cdev->dev_info.port_num > 1)
+	if (sh->cdev->config.probe_opt && sh->cdev->dev_info.port_num > 1 && !sh->rdma_monitor_supp)
 		mlx5_handle_port_info_update(&sh->cdev->dev_info, if_index, hdr->nlmsg_type);
 
 	for (i = 0; i < sh->max_port; i++) {
diff --git a/drivers/net/mlx5/linux/mlx5_os.c b/drivers/net/mlx5/linux/mlx5_os.c
index 2a93e994ff..fbe265ab70 100644
--- a/drivers/net/mlx5/linux/mlx5_os.c
+++ b/drivers/net/mlx5/linux/mlx5_os.c
@@ -3019,6 +3019,7 @@ mlx5_os_dev_shared_handler_install(struct mlx5_dev_ctx_shared *sh)
 {
 	struct ibv_context *ctx = sh->cdev->ctx;
 	int nlsk_fd;
+	uint8_t rdma_monitor_supp = 0;
 
 	sh->intr_handle = mlx5_os_interrupt_handler_create
 		(RTE_INTR_INSTANCE_F_SHARED, true,
@@ -3027,20 +3028,34 @@ mlx5_os_dev_shared_handler_install(struct mlx5_dev_ctx_shared *sh)
 		DRV_LOG(ERR, "Failed to allocate intr_handle.");
 		return;
 	}
-	if (sh->cdev->config.probe_opt && sh->cdev->dev_info.port_num > 1) {
+	if (sh->cdev->config.probe_opt &&
+	    sh->cdev->dev_info.port_num > 1 &&
+	    !sh->rdma_monitor_supp) {
 		nlsk_fd = mlx5_nl_rdma_monitor_init();
 		if (nlsk_fd < 0) {
 			DRV_LOG(ERR, "Failed to create a socket for RDMA Netlink events: %s",
 				rte_strerror(rte_errno));
 			return;
 		}
-		sh->intr_handle_ib = mlx5_os_interrupt_handler_create
-			(RTE_INTR_INSTANCE_F_SHARED, true,
-			 nlsk_fd, mlx5_dev_interrupt_handler_ib, sh);
-		if (sh->intr_handle_ib == NULL) {
-			DRV_LOG(ERR, "Fail to allocate intr_handle");
+		if (mlx5_nl_rdma_monitor_cap_get(nlsk_fd, &rdma_monitor_supp)) {
+			DRV_LOG(ERR, "Failed to query RDMA monitor support: %s",
+				rte_strerror(rte_errno));
+			close(nlsk_fd);
 			return;
 		}
+		sh->rdma_monitor_supp = rdma_monitor_supp;
+		if (sh->rdma_monitor_supp) {
+			sh->intr_handle_ib = mlx5_os_interrupt_handler_create
+				(RTE_INTR_INSTANCE_F_SHARED, true,
+				 nlsk_fd, mlx5_dev_interrupt_handler_ib, sh);
+			if (sh->intr_handle_ib == NULL) {
+				DRV_LOG(ERR, "Fail to allocate intr_handle");
+				close(nlsk_fd);
+				return;
+			}
+		} else {
+			close(nlsk_fd);
+		}
 	}
 	nlsk_fd = mlx5_nl_init(NETLINK_ROUTE, RTMGRP_LINK);
 	if (nlsk_fd < 0) {
diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index 748d92c9cf..ceb8e36ba5 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -1517,6 +1517,7 @@ struct mlx5_dev_ctx_shared {
 	uint32_t lag_rx_port_affinity_en:1;
 	/* lag_rx_port_affinity is supported. */
 	uint32_t hws_max_log_bulk_sz:5;
+	uint32_t rdma_monitor_supp:1;
 	/* Log of minimal HWS counters created hard coded. */
 	uint32_t hws_max_nb_counters; /* Maximal number for HWS counters. */
 	uint32_t max_port; /* Maximal IB device port index. */
-- 
2.34.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH V2 0/7] port probe time optimization
  2024-10-16  8:38 ` [PATCH V1 7/7] mlx5: add backward compatibility for RDMA monitor Minggang Li(Gavin)
@ 2024-10-28  9:18   ` Minggang Li(Gavin)
  2024-10-28  9:18     ` [PATCH V2 1/7] mailmap: update user name Minggang Li(Gavin)
                       ` (6 more replies)
  0 siblings, 7 replies; 42+ messages in thread
From: Minggang Li(Gavin) @ 2024-10-28  9:18 UTC (permalink / raw)
  To: viacheslavo, matan, orika, thomas; +Cc: dev, rasland

This patch series introduced a feature that the time to probe a VF/SF will
be reduced greatly in large scale, eg hundreds of VF/SFs. This feature is
controlled through the "probe_opt_en" device argument. Setting it to a
non-zero value indicates the application will enable this functionality
when probing a device. This feature relies on a feature of RDMA driver to
be release in incoming upstream kernel 6.13 or the equivalent in
OFED 24.10, ie. RDMA monitor. For further information on the devargs
limitation, see "doc/guides/nics/mlx5.rst".

Minggang Li(Gavin) (5):
  mailmap: update user name
  common/mlx5: fix Netlink socket leak
  common/mlx5: add RDMA monitor event awareness
  mlx5: use RDMA Netlink to update port information
  mlx5: add backward compatibility for RDMA monitor
---
changelog:
v1->v2
        - add feature doc and upstream kernel dependency in release notes
---

Rongwei Liu (2):
  net/mlx5: optimize device probing
  net/mlx5: add new devargs to control probe optimization

 .mailmap                                     |   2 +-
 doc/guides/nics/mlx5.rst                     |  13 +
 doc/guides/rel_notes/release_24_11.rst       |  14 +
 drivers/common/mlx5/linux/meson.build        |  10 +
 drivers/common/mlx5/linux/mlx5_common_os.h   |   6 +
 drivers/common/mlx5/linux/mlx5_nl.c          | 262 ++++++++++++++++---
 drivers/common/mlx5/linux/mlx5_nl.h          |  36 ++-
 drivers/common/mlx5/mlx5_common.c            |  20 ++
 drivers/common/mlx5/mlx5_common.h            |  15 ++
 drivers/common/mlx5/version.map              |   3 +
 drivers/common/mlx5/windows/mlx5_common_os.h |   5 +
 drivers/net/mlx5/linux/mlx5_ethdev_os.c      | 136 ++++++++++
 drivers/net/mlx5/linux/mlx5_os.c             | 144 ++++++++--
 drivers/net/mlx5/linux/mlx5_os.h             |   6 -
 drivers/net/mlx5/mlx5.h                      |   3 +
 drivers/net/mlx5/windows/mlx5_os.h           |   5 -
 16 files changed, 605 insertions(+), 75 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH V2 1/7] mailmap: update user name
  2024-10-28  9:18   ` [PATCH V2 0/7] port probe time optimization Minggang Li(Gavin)
@ 2024-10-28  9:18     ` Minggang Li(Gavin)
  2024-10-28  9:18     ` [PATCH V2 2/7] net/mlx5: optimize device probing Minggang Li(Gavin)
                       ` (5 subsequent siblings)
  6 siblings, 0 replies; 42+ messages in thread
From: Minggang Li(Gavin) @ 2024-10-28  9:18 UTC (permalink / raw)
  To: viacheslavo, matan, orika, thomas; +Cc: dev, rasland

Signed-off-by: Minggang Li(Gavin) <gavinl@nvidia.com>
---
 .mailmap | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/.mailmap b/.mailmap
index 504c390f0f..0ce965084c 100644
--- a/.mailmap
+++ b/.mailmap
@@ -462,7 +462,6 @@ Gary Mussar <gmussar@ciena.com>
 Gaurav Singh <gaurav1086@gmail.com>
 Gautam Dawar <gdawar@solarflare.com>
 Gavin Hu <gavin.hu@arm.com> <gavin.hu@linaro.org>
-Gavin Li <gavinl@nvidia.com>
 Geoffrey Le Gourriérec <geoffrey.le_gourrierec@6wind.com>
 Geoffrey Lv <geoffrey.lv@gmail.com>
 Geoff Thorpe <geoff.thorpe@nxp.com>
@@ -1024,6 +1023,7 @@ Mike Ximing Chen <mike.ximing.chen@intel.com>
 Milena Olech <milena.olech@intel.com>
 Min Cao <min.cao@intel.com>
 Minghuan Lian <minghuan.lian@nxp.com>
+Minggang Li(Gavin) <gavinl@nvidia.com>
 Mingjin Ye <mingjinx.ye@intel.com>
 Mingshan Zhang <mingshan.zhang@intel.com>
 Mingxia Liu <mingxia.liu@intel.com>
-- 
2.34.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH V2 2/7] net/mlx5: optimize device probing
  2024-10-28  9:18   ` [PATCH V2 0/7] port probe time optimization Minggang Li(Gavin)
  2024-10-28  9:18     ` [PATCH V2 1/7] mailmap: update user name Minggang Li(Gavin)
@ 2024-10-28  9:18     ` Minggang Li(Gavin)
  2024-10-28  9:18     ` [PATCH V2 3/7] net/mlx5: add new devargs to control probe optimization Minggang Li(Gavin)
                       ` (4 subsequent siblings)
  6 siblings, 0 replies; 42+ messages in thread
From: Minggang Li(Gavin) @ 2024-10-28  9:18 UTC (permalink / raw)
  To: viacheslavo, matan, orika, thomas, Dariusz Sosnowski, Bing Zhao,
	Suanming Mou
  Cc: dev, rasland, Rongwei Liu

From: Rongwei Liu <rongweil@nvidia.com>

Current DPDK probing logic is:
1. Query IB device index and total port number.
2. Query each port information by traversing the port index and
   get the port's ifindex, name and state information etc.
3. Compare the information with devargs until getting matched.
4. For each probing device, repeat steps 2 and 3.

Step 2 will communicate with kernel via netlink and it's time-consuming.
There is no need to repeat netlink communication for each probing device,
PMD can traverse all ports once and save the information into a caching
structure.

Introduce the device information caching in the mlx5 common device
handle and cache the port number, ibindex, port ifindex.

For dynamic interface changing:
1. New VF by toggling switchdev mode should restart dpdk as sriov
   configuration changed.
2. Changing VF number w/o toggling switchdev mode will trigger
   RTM_DELLINK and RTM_NEWLINK events. All the caching information is
   cleared.
3. New SF triggers RTM_NEWLINK event and no port index information in the
   message. All free entries (ifindex = 0) in the cache are invalidated.
4. Delete SF triggers RTM_DELLINK event. Traverse the cache entries and
   invalidate the one with the same ifindex.

Didn't consider race-condition between probing thread and interrupt
thread.

Signed-off-by: Rongwei Liu <rongweil@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
---
 drivers/common/mlx5/linux/mlx5_common_os.h   |   6 ++
 drivers/common/mlx5/linux/mlx5_nl.c          |  94 +++++++++++++----
 drivers/common/mlx5/linux/mlx5_nl.h          |   8 +-
 drivers/common/mlx5/mlx5_common.c            |   5 +
 drivers/common/mlx5/mlx5_common.h            |  13 +++
 drivers/common/mlx5/windows/mlx5_common_os.h |   5 +
 drivers/net/mlx5/linux/mlx5_ethdev_os.c      |  54 ++++++++++
 drivers/net/mlx5/linux/mlx5_os.c             | 104 ++++++++++++++-----
 drivers/net/mlx5/linux/mlx5_os.h             |   6 --
 drivers/net/mlx5/windows/mlx5_os.h           |   5 -
 10 files changed, 242 insertions(+), 58 deletions(-)

diff --git a/drivers/common/mlx5/linux/mlx5_common_os.h b/drivers/common/mlx5/linux/mlx5_common_os.h
index e8aa1d46ec..2e2c54f1fa 100644
--- a/drivers/common/mlx5/linux/mlx5_common_os.h
+++ b/drivers/common/mlx5/linux/mlx5_common_os.h
@@ -22,6 +22,12 @@
 #include "mlx5_glue.h"
 #include "mlx5_malloc.h"
 
+/* verb enumerations translations to local enums. */
+enum {
+	MLX5_FS_NAME_MAX = IBV_SYSFS_NAME_MAX + 1,
+	MLX5_FS_PATH_MAX = IBV_SYSFS_PATH_MAX + 1
+};
+
 /**
  * Get device name. Given an ibv_device pointer - return a
  * pointer to the corresponding device name.
diff --git a/drivers/common/mlx5/linux/mlx5_nl.c b/drivers/common/mlx5/linux/mlx5_nl.c
index a5ac4dc543..e98073aafe 100644
--- a/drivers/common/mlx5/linux/mlx5_nl.c
+++ b/drivers/common/mlx5/linux/mlx5_nl.c
@@ -1073,16 +1073,18 @@ mlx5_nl_port_info(int nl, uint32_t pindex, struct mlx5_nl_port_info *data)
 	uint32_t sn = MLX5_NL_SN_GENERATE;
 	int ret;
 
-	ret = mlx5_nl_send(nl, &req.nh, sn);
-	if (ret < 0)
-		return ret;
-	ret = mlx5_nl_recv(nl, sn, mlx5_nl_cmdget_cb, data);
-	if (ret < 0)
-		return ret;
-	if (!(data->flags & MLX5_NL_CMD_GET_IB_NAME) ||
-	    !(data->flags & MLX5_NL_CMD_GET_IB_INDEX))
-		goto error;
-	data->flags = 0;
+	if (data->ibindex == UINT32_MAX) {
+		ret = mlx5_nl_send(nl, &req.nh, sn);
+		if (ret < 0)
+			return ret;
+		ret = mlx5_nl_recv(nl, sn, mlx5_nl_cmdget_cb, data);
+		if (ret < 0)
+			return ret;
+		if (!(data->flags & MLX5_NL_CMD_GET_IB_NAME) ||
+		    !(data->flags & MLX5_NL_CMD_GET_IB_INDEX))
+			goto error;
+		data->flags = 0;
+	}
 	sn = MLX5_NL_SN_GENERATE;
 	req.nh.nlmsg_type = RDMA_NL_GET_TYPE(RDMA_NL_NLDEV,
 					     RDMA_NLDEV_CMD_PORT_GET);
@@ -1109,7 +1111,7 @@ mlx5_nl_port_info(int nl, uint32_t pindex, struct mlx5_nl_port_info *data)
 	    !(data->flags & MLX5_NL_CMD_GET_NET_INDEX) ||
 	    !data->ifindex)
 		goto error;
-	return 1;
+	return 0;
 error:
 	rte_errno = ENODEV;
 	return -rte_errno;
@@ -1128,21 +1130,48 @@ mlx5_nl_port_info(int nl, uint32_t pindex, struct mlx5_nl_port_info *data)
  *   IB device name.
  * @param[in] pindex
  *   IB device port index, starting from 1
+ * @param[in] dev_info
+ *   Cached mlx5 device information.
  * @return
  *   A valid (nonzero) interface index on success, 0 otherwise and rte_errno
  *   is set.
  */
 unsigned int
-mlx5_nl_ifindex(int nl, const char *name, uint32_t pindex)
+mlx5_nl_ifindex(int nl, const char *name, uint32_t pindex, struct mlx5_dev_info *dev_info)
 {
+	int ret;
+
 	struct mlx5_nl_port_info data = {
 			.ifindex = 0,
 			.name = name,
+			.ibindex = UINT32_MAX,
+			.flags = 0,
 	};
 
-	if (mlx5_nl_port_info(nl, pindex, &data) < 0)
-		return 0;
-	return data.ifindex;
+	if (!strcmp(name, dev_info->ibname)) {
+		if (dev_info->port_info && pindex <= dev_info->port_num &&
+		    dev_info->port_info[pindex].valid) {
+			if (!dev_info->port_info[pindex].ifindex)
+				rte_errno = ENODEV;
+			return dev_info->port_info[pindex].ifindex;
+		}
+		if (dev_info->port_num)
+			data.ibindex = dev_info->ibindex;
+	}
+
+	ret = mlx5_nl_port_info(nl, pindex, &data);
+
+	if (!strcmp(dev_info->ibname, name)) {
+		if ((!ret || ret == -ENODEV) && dev_info->port_info &&
+		    pindex <= dev_info->port_num) {
+			if (!ret)
+				dev_info->port_info[pindex].ifindex = data.ifindex;
+			/* -ENODEV means the pindex is unused but still valid case */
+			dev_info->port_info[pindex].valid = 1;
+		}
+	}
+
+	return ret ? 0 : data.ifindex;
 }
 
 /**
@@ -1157,18 +1186,23 @@ mlx5_nl_ifindex(int nl, const char *name, uint32_t pindex)
  *   IB device name.
  * @param[in] pindex
  *   IB device port index, starting from 1
+ * @param[in] dev_info
+ *   Cached mlx5 device information.
  * @return
  *   Port state (ibv_port_state) on success, negative on error
  *   and rte_errno is set.
  */
 int
-mlx5_nl_port_state(int nl, const char *name, uint32_t pindex)
+mlx5_nl_port_state(int nl, const char *name, uint32_t pindex, struct mlx5_dev_info *dev_info)
 {
 	struct mlx5_nl_port_info data = {
 			.state = 0,
 			.name = name,
+			.ibindex = UINT32_MAX,
 	};
 
+	if (dev_info && !strcmp(name, dev_info->ibname) && dev_info->port_num)
+		data.ibindex = dev_info->ibindex;
 	if (mlx5_nl_port_info(nl, pindex, &data) < 0)
 		return -rte_errno;
 	if ((data.flags & MLX5_NL_CMD_GET_PORT_STATE) == 0) {
@@ -1185,13 +1219,15 @@ mlx5_nl_port_state(int nl, const char *name, uint32_t pindex)
  *   Netlink socket of the RDMA kind (NETLINK_RDMA).
  * @param[in] name
  *   IB device name.
+ * @param[in] dev_info
+ *   Cached mlx5 device info.
  *
  * @return
  *   A valid (nonzero) number of ports on success, 0 otherwise
  *   and rte_errno is set.
  */
 unsigned int
-mlx5_nl_portnum(int nl, const char *name)
+mlx5_nl_portnum(int nl, const char *name, struct mlx5_dev_info *dev_info)
 {
 	struct mlx5_nl_port_info data = {
 		.flags = 0,
@@ -1206,7 +1242,10 @@ mlx5_nl_portnum(int nl, const char *name)
 		.nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK | NLM_F_DUMP,
 	};
 	uint32_t sn = MLX5_NL_SN_GENERATE;
-	int ret;
+	int ret, size;
+
+	if (dev_info->port_num && !strcmp(name, dev_info->ibname))
+		return dev_info->port_num;
 
 	ret = mlx5_nl_send(nl, &req, sn);
 	if (ret < 0)
@@ -1220,8 +1259,25 @@ mlx5_nl_portnum(int nl, const char *name)
 		rte_errno = ENODEV;
 		return 0;
 	}
-	if (!data.portnum)
+	if (!data.portnum) {
 		rte_errno = EINVAL;
+		return 0;
+	}
+	MLX5_ASSERT(!strlen(dev_info->ibname));
+	dev_info->port_num = data.portnum;
+	dev_info->ibindex = data.ibindex;
+	snprintf(dev_info->ibname, MLX5_FS_NAME_MAX, "%s", name);
+	if (data.portnum > 1) {
+		size = (data.portnum + 1) * sizeof(struct mlx5_port_nl_info);
+		dev_info->port_info = mlx5_malloc(MLX5_MEM_ZERO | MLX5_MEM_RTE, size,
+						  RTE_CACHE_LINE_SIZE,
+						  SOCKET_ID_ANY);
+		if (dev_info->port_info == NULL) {
+			memset(dev_info, 0, sizeof(*dev_info));
+			rte_errno = ENOMEM;
+			return 0;
+		}
+	}
 	return data.portnum;
 }
 
diff --git a/drivers/common/mlx5/linux/mlx5_nl.h b/drivers/common/mlx5/linux/mlx5_nl.h
index 580de3b769..396ffc98ce 100644
--- a/drivers/common/mlx5/linux/mlx5_nl.h
+++ b/drivers/common/mlx5/linux/mlx5_nl.h
@@ -11,6 +11,7 @@
 #include <rte_ether.h>
 
 #include "mlx5_common.h"
+#include "mlx5_common_utils.h"
 
 typedef void (mlx5_nl_event_cb)(struct nlmsghdr *hdr, void *user_data);
 
@@ -52,11 +53,12 @@ int mlx5_nl_promisc(int nlsk_fd, unsigned int iface_idx, int enable);
 __rte_internal
 int mlx5_nl_allmulti(int nlsk_fd, unsigned int iface_idx, int enable);
 __rte_internal
-unsigned int mlx5_nl_portnum(int nl, const char *name);
+unsigned int mlx5_nl_portnum(int nl, const char *name, struct mlx5_dev_info *dev_info);
 __rte_internal
-unsigned int mlx5_nl_ifindex(int nl, const char *name, uint32_t pindex);
+unsigned int mlx5_nl_ifindex(int nl, const char *name, uint32_t pindex,
+			     struct mlx5_dev_info *info);
 __rte_internal
-int mlx5_nl_port_state(int nl, const char *name, uint32_t pindex);
+int mlx5_nl_port_state(int nl, const char *name, uint32_t pindex, struct mlx5_dev_info *dev_info);
 __rte_internal
 int mlx5_nl_vf_mac_addr_modify(int nlsk_fd, unsigned int iface_idx,
 			       struct rte_ether_addr *mac, int vf_index);
diff --git a/drivers/common/mlx5/mlx5_common.c b/drivers/common/mlx5/mlx5_common.c
index ca8543e36e..0aaae91c31 100644
--- a/drivers/common/mlx5/mlx5_common.c
+++ b/drivers/common/mlx5/mlx5_common.c
@@ -735,6 +735,11 @@ mlx5_common_dev_release(struct mlx5_common_device *cdev)
 		if (TAILQ_EMPTY(&devices_list))
 			rte_mem_event_callback_unregister("MLX5_MEM_EVENT_CB",
 							  NULL);
+		if (cdev->dev_info.port_info != NULL) {
+			mlx5_free(cdev->dev_info.port_info);
+			cdev->dev_info.port_info = NULL;
+		}
+		cdev->dev_info.port_num = 0;
 		mlx5_dev_mempool_unsubscribe(cdev);
 		mlx5_mr_release_cache(&cdev->mr_scache);
 		mlx5_dev_hw_global_release(cdev);
diff --git a/drivers/common/mlx5/mlx5_common.h b/drivers/common/mlx5/mlx5_common.h
index 1abd1e8239..6cb40f54dd 100644
--- a/drivers/common/mlx5/mlx5_common.h
+++ b/drivers/common/mlx5/mlx5_common.h
@@ -174,6 +174,18 @@ enum mlx5_nl_phys_port_name_type {
 	MLX5_PHYS_PORT_NAME_TYPE_UNKNOWN, /* Unrecognized. */
 };
 
+struct mlx5_port_nl_info {
+	uint32_t ifindex;
+	uint8_t valid;
+};
+
+struct mlx5_dev_info {
+	uint32_t port_num;
+	uint32_t ibindex;
+	char ibname[MLX5_FS_NAME_MAX];
+	struct mlx5_port_nl_info *port_info;
+};
+
 /** Switch information returned by mlx5_nl_switch_info(). */
 struct mlx5_switch_info {
 	uint32_t master:1; /**< Master device. */
@@ -525,6 +537,7 @@ struct mlx5_common_device {
 	uint32_t classes_loaded;
 	void *ctx; /* Verbs/DV/DevX context. */
 	void *pd; /* Protection Domain. */
+	struct mlx5_dev_info dev_info; /* Device port info queried via netlink. */
 	uint32_t pdn; /* Protection Domain Number. */
 	struct mlx5_mr_share_cache mr_scache; /* Global shared MR cache. */
 	struct mlx5_common_dev_config config; /* Device configuration. */
diff --git a/drivers/common/mlx5/windows/mlx5_common_os.h b/drivers/common/mlx5/windows/mlx5_common_os.h
index acee0c987f..65394035de 100644
--- a/drivers/common/mlx5/windows/mlx5_common_os.h
+++ b/drivers/common/mlx5/windows/mlx5_common_os.h
@@ -20,6 +20,11 @@
 
 #define MLX5_BF_OFFSET 0x800
 
+enum {
+	MLX5_FS_NAME_MAX = MLX5_DEVX_DEVICE_NAME_SIZE + 1,
+	MLX5_FS_PATH_MAX = MLX5_DEVX_DEVICE_PNP_SIZE + 1
+};
+
 /**
  * This API allocates aligned or non-aligned memory.  The free can be on either
  * aligned or nonaligned memory.  To be protected - even though there may be no
diff --git a/drivers/net/mlx5/linux/mlx5_ethdev_os.c b/drivers/net/mlx5/linux/mlx5_ethdev_os.c
index 5d64984022..08ac6dd939 100644
--- a/drivers/net/mlx5/linux/mlx5_ethdev_os.c
+++ b/drivers/net/mlx5/linux/mlx5_ethdev_os.c
@@ -23,6 +23,7 @@
 #include <stdalign.h>
 #include <sys/un.h>
 #include <time.h>
+#include <linux/rtnetlink.h>
 
 #include <ethdev_linux_ethtool.h>
 #include <ethdev_driver.h>
@@ -673,6 +674,57 @@ mlx5_link_update_bond(struct rte_eth_dev *dev)
 		((ifr.ifr_flags & IFF_UP) && (ifr.ifr_flags & IFF_RUNNING));
 }
 
+static void
+mlx5_handle_port_info_update(struct mlx5_dev_info *dev_info, uint32_t if_index,
+			     uint16_t msg_type)
+{
+	struct mlx5_switch_info info = {
+		.master = 0,
+		.representor = 0,
+		.name_type = MLX5_PHYS_PORT_NAME_TYPE_NOTSET,
+		.port_name = 0,
+		.switch_id = 0,
+	};
+	uint32_t i;
+	int nl_route;
+
+	if (dev_info->port_num <= 1 || dev_info->port_info == NULL)
+		return;
+
+	for (i = 1; i <= dev_info->port_num; i++) {
+		if (!dev_info->port_info[i].valid)
+			continue;
+		if (dev_info->port_info[i].ifindex == if_index)
+			break;
+	}
+	if (msg_type == RTM_NEWLINK && i > dev_info->port_num) {
+		nl_route = mlx5_nl_init(NETLINK_ROUTE, 0);
+		if  (nl_route < 0)
+			goto flush_all;
+
+		if (mlx5_nl_switch_info(nl_route, if_index, &info)) {
+			if (mlx5_sysfs_switch_info(if_index, &info))
+				goto flush_all;
+		}
+
+		if (info.name_type == MLX5_PHYS_PORT_NAME_TYPE_PFSF ||
+		    info.name_type == MLX5_PHYS_PORT_NAME_TYPE_PFVF)
+			goto flush_all;
+		close(nl_route);
+	} else if (msg_type == RTM_DELLINK && i <= dev_info->port_num) {
+		memset(dev_info->port_info + i, 0, sizeof(struct mlx5_port_nl_info));
+	}
+
+	return;
+flush_all:
+	if (nl_route >= 0)
+		close(nl_route);
+	for (i = 1; i <= dev_info->port_num; i++) {
+		if (!dev_info->port_info[i].ifindex)
+			dev_info->port_info[i].valid = 0;
+	}
+}
+
 static void
 mlx5_dev_interrupt_nl_cb(struct nlmsghdr *hdr, void *cb_arg)
 {
@@ -682,6 +734,8 @@ mlx5_dev_interrupt_nl_cb(struct nlmsghdr *hdr, void *cb_arg)
 
 	if (mlx5_nl_parse_link_status_update(hdr, &if_index) < 0)
 		return;
+	mlx5_handle_port_info_update(&sh->cdev->dev_info, if_index, hdr->nlmsg_type);
+
 	for (i = 0; i < sh->max_port; i++) {
 		struct mlx5_dev_shared_port *port = &sh->port[i];
 		struct rte_eth_dev *dev;
diff --git a/drivers/net/mlx5/linux/mlx5_os.c b/drivers/net/mlx5/linux/mlx5_os.c
index c8d7fdb8dd..e7277d1e43 100644
--- a/drivers/net/mlx5/linux/mlx5_os.c
+++ b/drivers/net/mlx5/linux/mlx5_os.c
@@ -1268,7 +1268,7 @@ mlx5_dev_spawn(struct rte_device *dpdk_dev,
 		/* IB doesn't allow more than 255 ports, must be Ethernet. */
 		err = mlx5_nl_port_state(nl_rdma,
 			spawn->phys_dev_name,
-			spawn->phys_port);
+			spawn->phys_port, &spawn->cdev->dev_info);
 		if (err < 0) {
 			DRV_LOG(INFO, "Failed to get netlink port state: %s",
 				strerror(rte_errno));
@@ -1895,6 +1895,8 @@ mlx5_dev_spawn_data_cmp(const void *a, const void *b)
  *   Netlink RDMA group socket handle.
  * @param[in] owner
  *   Representor owner PF index.
+ * @param[in] dev_info
+ *   Cached mlx5 device information.
  * @param[out] bond_info
  *   Pointer to bonding information.
  *
@@ -1906,6 +1908,7 @@ static int
 mlx5_device_bond_pci_match(const char *ibdev_name,
 			   const struct rte_pci_addr *pci_dev,
 			   int nl_rdma, uint16_t owner,
+			   struct mlx5_dev_info *dev_info,
 			   struct mlx5_bond_info *bond_info)
 {
 	char ifname[IF_NAMESIZE + 1];
@@ -1926,7 +1929,7 @@ mlx5_device_bond_pci_match(const char *ibdev_name,
 		return -1;
 	if (!strstr(ibdev_name, "bond"))
 		return -1;
-	np = mlx5_nl_portnum(nl_rdma, ibdev_name);
+	np = mlx5_nl_portnum(nl_rdma, ibdev_name, dev_info);
 	if (!np)
 		return -1;
 	if (mlx5_get_device_guid(pci_dev, cur_guid, sizeof(cur_guid)) < 0)
@@ -1938,7 +1941,7 @@ mlx5_device_bond_pci_match(const char *ibdev_name,
 	 */
 	for (i = 1; i <= np; ++i) {
 		/* Check whether Infiniband port is populated. */
-		ifindex = mlx5_nl_ifindex(nl_rdma, ibdev_name, i);
+		ifindex = mlx5_nl_ifindex(nl_rdma, ibdev_name, i, dev_info);
 		if (!ifindex)
 			continue;
 		if (!if_indextoname(ifindex, ifname))
@@ -1976,9 +1979,13 @@ mlx5_device_bond_pci_match(const char *ibdev_name,
 		if (!file)
 			break;
 		info.name_type = MLX5_PHYS_PORT_NAME_TYPE_NOTSET;
-		if (fscanf(file, "%32s", tmp_str) == 1)
+		if (fscanf(file, "%32s", tmp_str) == 1) {
 			mlx5_translate_port_name(tmp_str, &info);
-		fclose(file);
+			fclose(file);
+		} else {
+			fclose(file);
+			break;
+		}
 		/* Only process PF ports. */
 		if (info.name_type != MLX5_PHYS_PORT_NAME_TYPE_LEGACY &&
 		    info.name_type != MLX5_PHYS_PORT_NAME_TYPE_UPLINK)
@@ -2001,8 +2008,8 @@ mlx5_device_bond_pci_match(const char *ibdev_name,
 		if (ret != 1)
 			break;
 		/* Save bonding info. */
-		strncpy(bond_info->ports[info.port_name].ifname, ifname,
-			sizeof(bond_info->ports[0].ifname));
+		snprintf(bond_info->ports[info.port_name].ifname,
+			 sizeof(bond_info->ports[0].ifname), "%s", ifname);
 		bond_info->ports[info.port_name].pci_addr = pci_addr;
 		bond_info->ports[info.port_name].ifindex = ifindex;
 		bond_info->n_port++;
@@ -2031,6 +2038,7 @@ mlx5_device_bond_pci_match(const char *ibdev_name,
 		      pci_addr.function == owner)))
 			pf = info.port_name;
 	}
+	fclose(bond_file);
 	if (pf >= 0) {
 		/* Get bond interface info */
 		ret = mlx5_sysfs_bond_info(ifindex, &bond_info->ifindex,
@@ -2082,7 +2090,8 @@ mlx5_nl_esw_multiport_get(struct rte_pci_addr *pci_addr, int *enabled)
 #define SYSFS_MPESW_PARAM_MAX_LEN 16
 
 static int
-mlx5_sysfs_esw_multiport_get(struct ibv_device *ibv, struct rte_pci_addr *pci_addr, int *enabled)
+mlx5_sysfs_esw_multiport_get(struct ibv_device *ibv, struct rte_pci_addr *pci_addr, int *enabled,
+			     struct mlx5_dev_info *dev_info)
 {
 	int nl_rdma;
 	unsigned int n_ports;
@@ -2094,7 +2103,7 @@ mlx5_sysfs_esw_multiport_get(struct ibv_device *ibv, struct rte_pci_addr *pci_ad
 	nl_rdma = mlx5_nl_init(NETLINK_RDMA, 0);
 	if (nl_rdma < 0)
 		return nl_rdma;
-	n_ports = mlx5_nl_portnum(nl_rdma, ibv->name);
+	n_ports = mlx5_nl_portnum(nl_rdma, ibv->name, dev_info);
 	if (!n_ports) {
 		ret = -rte_errno;
 		goto close_nl_rdma;
@@ -2102,12 +2111,12 @@ mlx5_sysfs_esw_multiport_get(struct ibv_device *ibv, struct rte_pci_addr *pci_ad
 	for (i = 1; i <= n_ports; ++i) {
 		unsigned int ifindex;
 		char ifname[IF_NAMESIZE + 1];
-		struct rte_pci_addr if_pci_addr;
+		struct rte_pci_addr if_pci_addr = { 0 };
 		char mpesw[SYSFS_MPESW_PARAM_MAX_LEN + 1];
 		FILE *sysfs;
 		int n;
 
-		ifindex = mlx5_nl_ifindex(nl_rdma, ibv->name, i);
+		ifindex = mlx5_nl_ifindex(nl_rdma, ibv->name, i, dev_info);
 		if (!ifindex)
 			continue;
 		if (!if_indextoname(ifindex, ifname))
@@ -2149,7 +2158,8 @@ mlx5_sysfs_esw_multiport_get(struct ibv_device *ibv, struct rte_pci_addr *pci_ad
 }
 
 static int
-mlx5_is_mpesw_enabled(struct ibv_device *ibv, struct rte_pci_addr *ibv_pci_addr, int *enabled)
+mlx5_is_mpesw_enabled(struct ibv_device *ibv, struct rte_pci_addr *ibv_pci_addr, int *enabled,
+		      struct mlx5_dev_info *dev_info)
 {
 	/*
 	 * Try getting Multiport E-Switch state through netlink interface
@@ -2157,7 +2167,7 @@ mlx5_is_mpesw_enabled(struct ibv_device *ibv, struct rte_pci_addr *ibv_pci_addr,
 	 * assume that Multiport E-Switch is disabled and return an error.
 	 */
 	if (mlx5_nl_esw_multiport_get(ibv_pci_addr, enabled) >= 0 ||
-	    mlx5_sysfs_esw_multiport_get(ibv, ibv_pci_addr, enabled) >= 0)
+	    mlx5_sysfs_esw_multiport_get(ibv, ibv_pci_addr, enabled, dev_info) >= 0)
 		return 0;
 	DRV_LOG(DEBUG, "Unable to check MPESW state for IB device %s "
 		       "(PCI: " PCI_PRI_FMT ")",
@@ -2171,7 +2181,7 @@ mlx5_is_mpesw_enabled(struct ibv_device *ibv, struct rte_pci_addr *ibv_pci_addr,
 static int
 mlx5_device_mpesw_pci_match(struct ibv_device *ibv,
 			    const struct rte_pci_addr *owner_pci,
-			    int nl_rdma)
+			    int nl_rdma, struct mlx5_dev_info *dev_info)
 {
 	struct rte_pci_addr ibdev_pci_addr = { 0 };
 	char ifname[IF_NAMESIZE + 1] = { 0 };
@@ -2195,24 +2205,24 @@ mlx5_device_mpesw_pci_match(struct ibv_device *ibv,
 		return -1;
 	}
 	/* Check if IB device has MPESW enabled. */
-	if (mlx5_is_mpesw_enabled(ibv, &ibdev_pci_addr, &enabled))
+	if (mlx5_is_mpesw_enabled(ibv, &ibdev_pci_addr, &enabled, dev_info))
 		return -1;
 	if (!enabled)
 		return -1;
 	/* Iterate through IB ports to find MPESW master uplink port. */
 	if (nl_rdma < 0)
 		return -1;
-	np = mlx5_nl_portnum(nl_rdma, ibv->name);
+	np = mlx5_nl_portnum(nl_rdma, ibv->name, dev_info);
 	if (!np)
 		return -1;
 	for (i = 1; i <= np; ++i) {
-		struct rte_pci_addr pci_addr;
+		struct rte_pci_addr pci_addr = { 0 };
 		FILE *file;
 		char port_name[IF_NAMESIZE + 1];
 		struct mlx5_switch_info	info;
 
 		/* Check whether IB port has a corresponding netdev. */
-		ifindex = mlx5_nl_ifindex(nl_rdma, ibv->name, i);
+		ifindex = mlx5_nl_ifindex(nl_rdma, ibv->name, i, dev_info);
 		if (!ifindex)
 			continue;
 		if (!if_indextoname(ifindex, ifname))
@@ -2319,16 +2329,30 @@ mlx5_os_pci_probe_pf(struct mlx5_common_device *cdev,
 	 * matching ones, gathering into the list.
 	 */
 	struct ibv_device *ibv_match[ret + 1];
+	struct mlx5_dev_info *info, tmp_info[ret];
 	int nl_route = mlx5_nl_init(NETLINK_ROUTE, 0);
 	int nl_rdma = mlx5_nl_init(NETLINK_RDMA, 0);
 	unsigned int i;
 
+	memset(tmp_info, 0, sizeof(tmp_info));
 	while (ret-- > 0) {
 		struct rte_pci_addr pci_addr;
 
+		if (cdev->dev_info.port_num) {
+			if (strcmp(ibv_list[ret]->name, cdev->dev_info.ibname)) {
+				DRV_LOG(INFO, "Unmatched caching device \"%s\" \"%s\"",
+					cdev->dev_info.ibname, ibv_list[ret]->name);
+				continue;
+			}
+			info = &cdev->dev_info;
+		} else {
+			info = &tmp_info[ret];
+		}
 		DRV_LOG(DEBUG, "Checking device \"%s\"", ibv_list[ret]->name);
 		bd = mlx5_device_bond_pci_match(ibv_list[ret]->name, &owner_pci,
-						nl_rdma, owner_id, &bond_info);
+						nl_rdma, owner_id,
+						info,
+						&bond_info);
 		if (bd >= 0) {
 			/*
 			 * Bonding device detected. Only one match is allowed,
@@ -2354,7 +2378,8 @@ mlx5_os_pci_probe_pf(struct mlx5_common_device *cdev,
 			ibv_match[nd++] = ibv_list[ret];
 			break;
 		}
-		mpesw = mlx5_device_mpesw_pci_match(ibv_list[ret], &owner_pci, nl_rdma);
+		mpesw = mlx5_device_mpesw_pci_match(ibv_list[ret], &owner_pci, nl_rdma,
+						    info);
 		if (mpesw >= 0) {
 			/*
 			 * MPESW device detected. Only one matching IB device is allowed,
@@ -2378,10 +2403,18 @@ mlx5_os_pci_probe_pf(struct mlx5_common_device *cdev,
 		}
 		/* Bonding or MPESW device was not found. */
 		if (mlx5_get_pci_addr(ibv_list[ret]->ibdev_path,
-					&pci_addr))
+					&pci_addr)) {
+			if (tmp_info[ret].port_info != NULL)
+				mlx5_free(tmp_info[ret].port_info);
+			memset(&tmp_info[ret], 0, sizeof(tmp_info[0]));
 			continue;
-		if (rte_pci_addr_cmp(&owner_pci, &pci_addr) != 0)
+		}
+		if (rte_pci_addr_cmp(&owner_pci, &pci_addr) != 0) {
+			if (tmp_info[ret].port_info != NULL)
+				mlx5_free(tmp_info[ret].port_info);
+			memset(&tmp_info[ret], 0, sizeof(tmp_info[0]));
 			continue;
+		}
 		DRV_LOG(INFO, "PCI information matches for device \"%s\"",
 			ibv_list[ret]->name);
 		ibv_match[nd++] = ibv_list[ret];
@@ -2399,13 +2432,21 @@ mlx5_os_pci_probe_pf(struct mlx5_common_device *cdev,
 		goto exit;
 	}
 	if (nd == 1) {
+		if (!cdev->dev_info.port_num) {
+			for (i = 0; i < RTE_DIM(tmp_info); i++) {
+				if (tmp_info[i].port_num) {
+					cdev->dev_info = tmp_info[i];
+					break;
+				}
+			}
+		}
 		/*
 		 * Found single matching device may have multiple ports.
 		 * Each port may be representor, we have to check the port
 		 * number and check the representors existence.
 		 */
 		if (nl_rdma >= 0)
-			np = mlx5_nl_portnum(nl_rdma, ibv_match[0]->name);
+			np = mlx5_nl_portnum(nl_rdma, ibv_match[0]->name, &cdev->dev_info);
 		if (!np)
 			DRV_LOG(WARNING,
 				"Cannot get IB device \"%s\" ports number.",
@@ -2422,6 +2463,14 @@ mlx5_os_pci_probe_pf(struct mlx5_common_device *cdev,
 			ret = -rte_errno;
 			goto exit;
 		}
+	} else {
+		/* Can't handle one common device with multiple IB devices caching */
+		for (i = 0; i < RTE_DIM(tmp_info); i++) {
+			if (tmp_info[i].port_info != NULL)
+				mlx5_free(tmp_info[i].port_info);
+			memset(&tmp_info[i], 0, sizeof(tmp_info[0]));
+		}
+		DRV_LOG(INFO, "Cannot handle multiple IB devices info caching in single common device.");
 	}
 	/* Now we can determine the maximal amount of devices to be spawned. */
 	list = mlx5_malloc(MLX5_MEM_ZERO,
@@ -2455,7 +2504,7 @@ mlx5_os_pci_probe_pf(struct mlx5_common_device *cdev,
 			list[ns].mpesw_port = MLX5_MPESW_PORT_INVALID;
 			list[ns].ifindex = mlx5_nl_ifindex(nl_rdma,
 							   ibv_match[0]->name,
-							   i);
+							   i, &cdev->dev_info);
 			if (!list[ns].ifindex) {
 				/*
 				 * No network interface index found for the
@@ -2593,7 +2642,7 @@ mlx5_os_pci_probe_pf(struct mlx5_common_device *cdev,
 				list[ns].ifindex = mlx5_nl_ifindex
 							    (nl_rdma,
 							     ibv_match[i]->name,
-							     1);
+							     1, &cdev->dev_info);
 			if (!list[ns].ifindex) {
 				char ifname[IF_NAMESIZE];
 
@@ -2782,6 +2831,11 @@ mlx5_os_pci_probe_pf(struct mlx5_common_device *cdev,
 		mlx5_free(list);
 	MLX5_ASSERT(ibv_list);
 	mlx5_glue->free_device_list(ibv_list);
+	if (ret) {
+		if (cdev->dev_info.port_info != NULL)
+			mlx5_free(cdev->dev_info.port_info);
+		memset(&cdev->dev_info, 0, sizeof(cdev->dev_info));
+	}
 	return ret;
 }
 
diff --git a/drivers/net/mlx5/linux/mlx5_os.h b/drivers/net/mlx5/linux/mlx5_os.h
index 80c70d713a..4ef0916173 100644
--- a/drivers/net/mlx5/linux/mlx5_os.h
+++ b/drivers/net/mlx5/linux/mlx5_os.h
@@ -8,12 +8,6 @@
 
 #include <net/if.h>
 
-/* verb enumerations translations to local enums. */
-enum {
-	MLX5_FS_NAME_MAX = IBV_SYSFS_NAME_MAX + 1,
-	MLX5_FS_PATH_MAX = IBV_SYSFS_PATH_MAX + 1
-};
-
 /* Maximal data of sendmsg message(in bytes). */
 #define MLX5_SENDMSG_MAX 64
 
diff --git a/drivers/net/mlx5/windows/mlx5_os.h b/drivers/net/mlx5/windows/mlx5_os.h
index 8b58265687..fb7198c244 100644
--- a/drivers/net/mlx5/windows/mlx5_os.h
+++ b/drivers/net/mlx5/windows/mlx5_os.h
@@ -7,11 +7,6 @@
 
 #include "mlx5_win_ext.h"
 
-enum {
-	MLX5_FS_NAME_MAX = MLX5_DEVX_DEVICE_NAME_SIZE + 1,
-	MLX5_FS_PATH_MAX = MLX5_DEVX_DEVICE_PNP_SIZE + 1
-};
-
 #define PCI_DRV_FLAGS 0
 
 #define MLX5_NAMESIZE MLX5_FS_NAME_MAX
-- 
2.34.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH V2 3/7] net/mlx5: add new devargs to control probe optimization
  2024-10-28  9:18   ` [PATCH V2 0/7] port probe time optimization Minggang Li(Gavin)
  2024-10-28  9:18     ` [PATCH V2 1/7] mailmap: update user name Minggang Li(Gavin)
  2024-10-28  9:18     ` [PATCH V2 2/7] net/mlx5: optimize device probing Minggang Li(Gavin)
@ 2024-10-28  9:18     ` Minggang Li(Gavin)
  2024-10-28 15:47       ` Stephen Hemminger
  2024-10-28  9:18     ` [PATCH V2 4/7] common/mlx5: fix Netlink socket leak Minggang Li(Gavin)
                       ` (3 subsequent siblings)
  6 siblings, 1 reply; 42+ messages in thread
From: Minggang Li(Gavin) @ 2024-10-28  9:18 UTC (permalink / raw)
  To: viacheslavo, matan, orika, thomas, Dariusz Sosnowski, Bing Zhao,
	Suanming Mou
  Cc: dev, rasland, Rongwei Liu

From: Rongwei Liu <rongweil@nvidia.com>

Add a new devarg probe_opt_en to control probe optimization
in PMD.

By default, the value is 0 and no behavior changed.

Signed-off-by: Rongwei Liu <rongweil@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
---
 doc/guides/nics/mlx5.rst                |  7 +++++++
 drivers/common/mlx5/linux/mlx5_nl.c     | 12 ++++++++----
 drivers/common/mlx5/mlx5_common.c       | 15 +++++++++++++++
 drivers/common/mlx5/mlx5_common.h       |  2 ++
 drivers/net/mlx5/linux/mlx5_ethdev_os.c |  5 ++++-
 drivers/net/mlx5/linux/mlx5_os.c        |  2 +-
 6 files changed, 37 insertions(+), 6 deletions(-)

diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index f82e2d75de..981401a9f2 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -1436,6 +1436,13 @@ for an additional list of options shared with other mlx5 drivers.
 
   By default, the PMD will set this value to 1.
 
+- ``probe_opt_en`` parameter [int]
+
+  A non-zero value optimizes the probe process, especially for large scale.
+  PMD will hold the IB device information internally and reuse it.
+
+  By default, the PMD will set this value to 0.
+
 - ``lacp_by_user`` parameter [int]
 
   A nonzero value enables the control of LACP traffic by the user application.
diff --git a/drivers/common/mlx5/linux/mlx5_nl.c b/drivers/common/mlx5/linux/mlx5_nl.c
index e98073aafe..745e443f8f 100644
--- a/drivers/common/mlx5/linux/mlx5_nl.c
+++ b/drivers/common/mlx5/linux/mlx5_nl.c
@@ -1148,7 +1148,7 @@ mlx5_nl_ifindex(int nl, const char *name, uint32_t pindex, struct mlx5_dev_info
 			.flags = 0,
 	};
 
-	if (!strcmp(name, dev_info->ibname)) {
+	if (dev_info->probe_opt && !strcmp(name, dev_info->ibname)) {
 		if (dev_info->port_info && pindex <= dev_info->port_num &&
 		    dev_info->port_info[pindex].valid) {
 			if (!dev_info->port_info[pindex].ifindex)
@@ -1161,7 +1161,7 @@ mlx5_nl_ifindex(int nl, const char *name, uint32_t pindex, struct mlx5_dev_info
 
 	ret = mlx5_nl_port_info(nl, pindex, &data);
 
-	if (!strcmp(dev_info->ibname, name)) {
+	if (dev_info->probe_opt && !strcmp(dev_info->ibname, name)) {
 		if ((!ret || ret == -ENODEV) && dev_info->port_info &&
 		    pindex <= dev_info->port_num) {
 			if (!ret)
@@ -1201,7 +1201,8 @@ mlx5_nl_port_state(int nl, const char *name, uint32_t pindex, struct mlx5_dev_in
 			.ibindex = UINT32_MAX,
 	};
 
-	if (dev_info && !strcmp(name, dev_info->ibname) && dev_info->port_num)
+	if (dev_info && dev_info->probe_opt &&
+	    !strcmp(name, dev_info->ibname) && dev_info->port_num)
 		data.ibindex = dev_info->ibindex;
 	if (mlx5_nl_port_info(nl, pindex, &data) < 0)
 		return -rte_errno;
@@ -1244,7 +1245,8 @@ mlx5_nl_portnum(int nl, const char *name, struct mlx5_dev_info *dev_info)
 	uint32_t sn = MLX5_NL_SN_GENERATE;
 	int ret, size;
 
-	if (dev_info->port_num && !strcmp(name, dev_info->ibname))
+	if (dev_info->probe_opt && dev_info->port_num &&
+	    !strcmp(name, dev_info->ibname))
 		return dev_info->port_num;
 
 	ret = mlx5_nl_send(nl, &req, sn);
@@ -1263,6 +1265,8 @@ mlx5_nl_portnum(int nl, const char *name, struct mlx5_dev_info *dev_info)
 		rte_errno = EINVAL;
 		return 0;
 	}
+	if (!dev_info->probe_opt)
+		return data.portnum;
 	MLX5_ASSERT(!strlen(dev_info->ibname));
 	dev_info->port_num = data.portnum;
 	dev_info->ibindex = data.ibindex;
diff --git a/drivers/common/mlx5/mlx5_common.c b/drivers/common/mlx5/mlx5_common.c
index 0aaae91c31..9abae4a374 100644
--- a/drivers/common/mlx5/mlx5_common.c
+++ b/drivers/common/mlx5/mlx5_common.c
@@ -40,6 +40,9 @@ uint8_t haswell_broadwell_cpu;
 /* The default memory allocator used in PMD. */
 #define MLX5_SYS_MEM_EN "sys_mem_en"
 
+/* Probe optimization in PMD. */
+#define MLX5_PROBE_OPT "probe_opt_en"
+
 /*
  * Device parameter to force doorbell register mapping
  * to non-cached region eliminating the extra write memory barrier.
@@ -295,6 +298,8 @@ mlx5_common_args_check_handler(const char *key, const char *val, void *opaque)
 		config->device_fd = tmp;
 	} else if (strcmp(key, MLX5_PD_HANDLE) == 0) {
 		config->pd_handle = tmp;
+	} else if (strcmp(key, MLX5_PROBE_OPT) == 0) {
+		config->probe_opt = !!tmp;
 	}
 	return 0;
 }
@@ -324,6 +329,7 @@ mlx5_common_config_get(struct mlx5_kvargs_ctrl *mkvlist,
 		MLX5_MR_MEMPOOL_REG_EN,
 		MLX5_DEVICE_FD,
 		MLX5_PD_HANDLE,
+		MLX5_PROBE_OPT,
 		NULL,
 	};
 	int ret = 0;
@@ -332,6 +338,7 @@ mlx5_common_config_get(struct mlx5_kvargs_ctrl *mkvlist,
 	config->mr_ext_memseg_en = 1;
 	config->mr_mempool_reg_en = 1;
 	config->sys_mem_en = 0;
+	config->probe_opt = 0;
 	config->dbnc = MLX5_ARG_UNSET;
 	config->device_fd = MLX5_ARG_UNSET;
 	config->pd_handle = MLX5_ARG_UNSET;
@@ -351,6 +358,7 @@ mlx5_common_config_get(struct mlx5_kvargs_ctrl *mkvlist,
 	DRV_LOG(DEBUG, "mr_ext_memseg_en is %u.", config->mr_ext_memseg_en);
 	DRV_LOG(DEBUG, "mr_mempool_reg_en is %u.", config->mr_mempool_reg_en);
 	DRV_LOG(DEBUG, "sys_mem_en is %u.", config->sys_mem_en);
+	DRV_LOG(DEBUG, "probe_opt_en is %u.", config->probe_opt);
 	DRV_LOG(DEBUG, "Send Queue doorbell mapping parameter is %d.",
 		config->dbnc);
 	return ret;
@@ -791,6 +799,7 @@ mlx5_common_dev_create(struct rte_device *eal_dev, uint32_t classes,
 	if (TAILQ_EMPTY(&devices_list))
 		rte_mem_event_callback_register("MLX5_MEM_EVENT_CB",
 						mlx5_mr_mem_event_cb, NULL);
+	cdev->dev_info.probe_opt = cdev->config.probe_opt;
 exit:
 	pthread_mutex_lock(&devices_list_lock);
 	TAILQ_INSERT_HEAD(&devices_list, cdev, next);
@@ -880,6 +889,12 @@ mlx5_common_probe_again_args_validate(struct mlx5_common_device *cdev,
 			cdev->dev->name);
 		goto error;
 	}
+	if (cdev->config.probe_opt != config->probe_opt) {
+		DRV_LOG(ERR, "\"" MLX5_PROBE_OPT"\" "
+			"configuration mismatch for device %s.",
+			cdev->dev->name);
+		goto error;
+	}
 	if (cdev->config.dbnc != config->dbnc) {
 		DRV_LOG(ERR, "\"" MLX5_SQ_DB_NC "\" "
 			"configuration mismatch for device %s.",
diff --git a/drivers/common/mlx5/mlx5_common.h b/drivers/common/mlx5/mlx5_common.h
index 6cb40f54dd..f1b59d6f07 100644
--- a/drivers/common/mlx5/mlx5_common.h
+++ b/drivers/common/mlx5/mlx5_common.h
@@ -183,6 +183,7 @@ struct mlx5_dev_info {
 	uint32_t port_num;
 	uint32_t ibindex;
 	char ibname[MLX5_FS_NAME_MAX];
+	uint8_t probe_opt;
 	struct mlx5_port_nl_info *port_info;
 };
 
@@ -525,6 +526,7 @@ struct mlx5_common_dev_config {
 	int pd_handle; /* Protection Domain handle for importation.  */
 	unsigned int devx:1; /* Whether devx interface is available or not. */
 	unsigned int sys_mem_en:1; /* The default memory allocator. */
+	unsigned int probe_opt:1; /* Optimize probing . */
 	unsigned int mr_mempool_reg_en:1;
 	/* Allow/prevent implicit mempool memory registration. */
 	unsigned int mr_ext_memseg_en:1;
diff --git a/drivers/net/mlx5/linux/mlx5_ethdev_os.c b/drivers/net/mlx5/linux/mlx5_ethdev_os.c
index 08ac6dd939..88d3c57c6e 100644
--- a/drivers/net/mlx5/linux/mlx5_ethdev_os.c
+++ b/drivers/net/mlx5/linux/mlx5_ethdev_os.c
@@ -691,6 +691,8 @@ mlx5_handle_port_info_update(struct mlx5_dev_info *dev_info, uint32_t if_index,
 	if (dev_info->port_num <= 1 || dev_info->port_info == NULL)
 		return;
 
+	DRV_LOG(DEBUG, "IB device %s ifindex %u received netlink event %u",
+			dev_info->ibname, if_index, msg_type);
 	for (i = 1; i <= dev_info->port_num; i++) {
 		if (!dev_info->port_info[i].valid)
 			continue;
@@ -734,7 +736,8 @@ mlx5_dev_interrupt_nl_cb(struct nlmsghdr *hdr, void *cb_arg)
 
 	if (mlx5_nl_parse_link_status_update(hdr, &if_index) < 0)
 		return;
-	mlx5_handle_port_info_update(&sh->cdev->dev_info, if_index, hdr->nlmsg_type);
+	if (sh->cdev->config.probe_opt && sh->cdev->dev_info.port_num > 1)
+		mlx5_handle_port_info_update(&sh->cdev->dev_info, if_index, hdr->nlmsg_type);
 
 	for (i = 0; i < sh->max_port; i++) {
 		struct mlx5_dev_shared_port *port = &sh->port[i];
diff --git a/drivers/net/mlx5/linux/mlx5_os.c b/drivers/net/mlx5/linux/mlx5_os.c
index e7277d1e43..4c2caa21e9 100644
--- a/drivers/net/mlx5/linux/mlx5_os.c
+++ b/drivers/net/mlx5/linux/mlx5_os.c
@@ -2338,7 +2338,7 @@ mlx5_os_pci_probe_pf(struct mlx5_common_device *cdev,
 	while (ret-- > 0) {
 		struct rte_pci_addr pci_addr;
 
-		if (cdev->dev_info.port_num) {
+		if (cdev->config.probe_opt && cdev->dev_info.port_num) {
 			if (strcmp(ibv_list[ret]->name, cdev->dev_info.ibname)) {
 				DRV_LOG(INFO, "Unmatched caching device \"%s\" \"%s\"",
 					cdev->dev_info.ibname, ibv_list[ret]->name);
-- 
2.34.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH V2 4/7] common/mlx5: fix Netlink socket leak
  2024-10-28  9:18   ` [PATCH V2 0/7] port probe time optimization Minggang Li(Gavin)
                       ` (2 preceding siblings ...)
  2024-10-28  9:18     ` [PATCH V2 3/7] net/mlx5: add new devargs to control probe optimization Minggang Li(Gavin)
@ 2024-10-28  9:18     ` Minggang Li(Gavin)
  2024-10-28  9:18     ` [PATCH V2 5/7] common/mlx5: add RDMA monitor event awareness Minggang Li(Gavin)
                       ` (2 subsequent siblings)
  6 siblings, 0 replies; 42+ messages in thread
From: Minggang Li(Gavin) @ 2024-10-28  9:18 UTC (permalink / raw)
  To: viacheslavo, matan, orika, thomas, Dariusz Sosnowski, Bing Zhao,
	Suanming Mou, Spike Du
  Cc: dev, rasland, stable

Fixes: 72d7efe464b1 ("common/mlx5: share interrupt management")
Cc: stable@dpdk.org

Signed-off-by: Minggang Li(Gavin) <gavinl@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
---
 drivers/net/mlx5/linux/mlx5_os.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/net/mlx5/linux/mlx5_os.c b/drivers/net/mlx5/linux/mlx5_os.c
index 4c2caa21e9..8df45ef010 100644
--- a/drivers/net/mlx5/linux/mlx5_os.c
+++ b/drivers/net/mlx5/linux/mlx5_os.c
@@ -3076,10 +3076,15 @@ mlx5_os_dev_shared_handler_install(struct mlx5_dev_ctx_shared *sh)
 void
 mlx5_os_dev_shared_handler_uninstall(struct mlx5_dev_ctx_shared *sh)
 {
+	int fd;
+
 	mlx5_os_interrupt_handler_destroy(sh->intr_handle,
 					  mlx5_dev_interrupt_handler, sh);
+	fd = rte_intr_fd_get(sh->intr_handle_nl);
 	mlx5_os_interrupt_handler_destroy(sh->intr_handle_nl,
 					  mlx5_dev_interrupt_handler_nl, sh);
+	if (fd >= 0)
+		close(fd);
 #ifdef HAVE_IBV_DEVX_ASYNC
 	mlx5_os_interrupt_handler_destroy(sh->intr_handle_devx,
 					  mlx5_dev_interrupt_handler_devx, sh);
-- 
2.34.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH V2 5/7] common/mlx5: add RDMA monitor event awareness
  2024-10-28  9:18   ` [PATCH V2 0/7] port probe time optimization Minggang Li(Gavin)
                       ` (3 preceding siblings ...)
  2024-10-28  9:18     ` [PATCH V2 4/7] common/mlx5: fix Netlink socket leak Minggang Li(Gavin)
@ 2024-10-28  9:18     ` Minggang Li(Gavin)
  2024-10-28  9:18     ` [PATCH V2 6/7] mlx5: use RDMA Netlink to update port information Minggang Li(Gavin)
  2024-10-28  9:18     ` [PATCH V2 7/7] mlx5: add backward compatibility for RDMA monitor Minggang Li(Gavin)
  6 siblings, 0 replies; 42+ messages in thread
From: Minggang Li(Gavin) @ 2024-10-28  9:18 UTC (permalink / raw)
  To: viacheslavo, matan, orika, thomas, Dariusz Sosnowski, Bing Zhao,
	Suanming Mou
  Cc: dev, rasland

RDMA monitor is a new feature introduced by kernel driver. This commit
adds backward compatibility for the kernels do not support it.

Signed-off-by: Minggang Li(Gavin) <gavinl@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
---
 drivers/common/mlx5/linux/meson.build | 10 ++++++++++
 drivers/common/mlx5/linux/mlx5_nl.c   | 17 +++++++++++++++++
 2 files changed, 27 insertions(+)

diff --git a/drivers/common/mlx5/linux/meson.build b/drivers/common/mlx5/linux/meson.build
index 82e8046e0c..58d0328c6d 100644
--- a/drivers/common/mlx5/linux/meson.build
+++ b/drivers/common/mlx5/linux/meson.build
@@ -170,6 +170,16 @@ has_sym_args = [
             'RDMA_NLDEV_ATTR_PORT_STATE' ],
         [ 'HAVE_RDMA_NLDEV_ATTR_NDEV_INDEX', 'rdma/rdma_netlink.h',
             'RDMA_NLDEV_ATTR_NDEV_INDEX' ],
+        [ 'HAVE_RDMA_NL_GROUP_NOTIFY', 'rdma/rdma_netlink.h',
+            'RDMA_NL_GROUP_NOTIFY' ],
+        [ 'HAVE_RDMA_NLDEV_CMD_SYS_GET', 'rdma/rdma_netlink.h',
+            'RDMA_NLDEV_CMD_SYS_GET' ],
+        [ 'HAVE_RDMA_NLDEV_SYS_ATTR_MONITOR_MODE', 'rdma/rdma_netlink.h',
+            'RDMA_NLDEV_SYS_ATTR_MONITOR_MODE' ],
+        [ 'HAVE_RDMA_NLDEV_ATTR_EVENT_TYPE', 'rdma/rdma_netlink.h',
+            'RDMA_NLDEV_ATTR_EVENT_TYPE' ],
+        [ 'HAVE_RDMA_NLDEV_CMD_MONITOR', 'rdma/rdma_netlink.h',
+            'RDMA_NLDEV_CMD_MONITOR' ],
         [ 'HAVE_MLX5_DR_FLOW_DUMP', 'infiniband/mlx5dv.h',
             'mlx5dv_dump_dr_domain'],
         [ 'HAVE_MLX5_DR_CREATE_ACTION_FLOW_SAMPLE', 'infiniband/mlx5dv.h',
diff --git a/drivers/common/mlx5/linux/mlx5_nl.c b/drivers/common/mlx5/linux/mlx5_nl.c
index 745e443f8f..e03db4f918 100644
--- a/drivers/common/mlx5/linux/mlx5_nl.c
+++ b/drivers/common/mlx5/linux/mlx5_nl.c
@@ -84,6 +84,23 @@
 #ifndef HAVE_RDMA_NLDEV_ATTR_NDEV_INDEX
 #define RDMA_NLDEV_ATTR_NDEV_INDEX 50
 #endif
+#ifndef HAVE_RDMA_NLDEV_ATTR_EVENT_TYPE
+#define RDMA_NLDEV_ATTR_EVENT_TYPE 102
+#define RDMA_NETDEV_ATTACH_EVENT 2
+#define RDMA_NETDEV_DETACH_EVENT 3
+#endif
+#ifndef HAVE_RDMA_NLDEV_SYS_ATTR_MONITOR_MODE
+#define RDMA_NLDEV_SYS_ATTR_MONITOR_MODE 103
+#endif
+#ifndef HAVE_RDMA_NLDEV_CMD_MONITOR
+#define RDMA_NLDEV_CMD_MONITOR 28
+#endif
+#ifndef HAVE_RDMA_NLDEV_CMD_SYS_GET
+#define RDMA_NLDEV_CMD_SYS_GET 6
+#endif
+#ifndef HAVE_RDMA_NL_GROUP_NOTIFY
+#define RDMA_NL_GROUP_NOTIFY 4
+#endif
 
 /* These are normally found in linux/if_link.h. */
 #ifndef HAVE_IFLA_NUM_VF
-- 
2.34.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH V2 6/7] mlx5: use RDMA Netlink to update port information
  2024-10-28  9:18   ` [PATCH V2 0/7] port probe time optimization Minggang Li(Gavin)
                       ` (4 preceding siblings ...)
  2024-10-28  9:18     ` [PATCH V2 5/7] common/mlx5: add RDMA monitor event awareness Minggang Li(Gavin)
@ 2024-10-28  9:18     ` Minggang Li(Gavin)
  2024-10-28  9:18     ` [PATCH V2 7/7] mlx5: add backward compatibility for RDMA monitor Minggang Li(Gavin)
  6 siblings, 0 replies; 42+ messages in thread
From: Minggang Li(Gavin) @ 2024-10-28  9:18 UTC (permalink / raw)
  To: viacheslavo, matan, orika, thomas, Dariusz Sosnowski, Bing Zhao,
	Suanming Mou
  Cc: dev, rasland

Previously, port information, such as adding and deleting, is updated via
route netlink. And the events used are link up/down, not the exact event
for port adding or deleting, which does not performance well.

To improve the performance, use RDMA monitor events to track port adding
and deleting events and update corresponding port information.

Signed-off-by: Minggang Li(Gavin) <gavinl@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
---
 doc/guides/nics/mlx5.rst                |  6 ++
 drivers/common/mlx5/linux/mlx5_nl.c     | 74 ++++++++++++++++++-----
 drivers/common/mlx5/linux/mlx5_nl.h     | 28 +++++++++
 drivers/common/mlx5/version.map         |  2 +
 drivers/net/mlx5/linux/mlx5_ethdev_os.c | 79 +++++++++++++++++++++++++
 drivers/net/mlx5/linux/mlx5_os.c        | 20 +++++++
 drivers/net/mlx5/mlx5.h                 |  2 +
 7 files changed, 195 insertions(+), 16 deletions(-)

diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index 981401a9f2..1a9ec1bd62 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -1443,6 +1443,12 @@ for an additional list of options shared with other mlx5 drivers.
 
   By default, the PMD will set this value to 0.
 
+  .. note::
+
+    There is a race condition in probing port if probe_opt_en is set to 1.
+    Port probe may fail with wrong ifindex in cache while the interrupt
+    thread is updating the cache. Please try again if port probe failed.
+
 - ``lacp_by_user`` parameter [int]
 
   A nonzero value enables the control of LACP traffic by the user application.
diff --git a/drivers/common/mlx5/linux/mlx5_nl.c b/drivers/common/mlx5/linux/mlx5_nl.c
index e03db4f918..ce1c2a8e75 100644
--- a/drivers/common/mlx5/linux/mlx5_nl.c
+++ b/drivers/common/mlx5/linux/mlx5_nl.c
@@ -101,6 +101,7 @@
 #ifndef HAVE_RDMA_NL_GROUP_NOTIFY
 #define RDMA_NL_GROUP_NOTIFY 4
 #endif
+#define RDMA_NL_GROUP_NOTIFICATION (1 << (RDMA_NL_GROUP_NOTIFY - 1))
 
 /* These are normally found in linux/if_link.h. */
 #ifndef HAVE_IFLA_NUM_VF
@@ -176,22 +177,6 @@ struct mlx5_nl_mac_addr {
 	int mac_n; /**< Number of addresses in the array. */
 };
 
-#define MLX5_NL_CMD_GET_IB_NAME (1 << 0)
-#define MLX5_NL_CMD_GET_IB_INDEX (1 << 1)
-#define MLX5_NL_CMD_GET_NET_INDEX (1 << 2)
-#define MLX5_NL_CMD_GET_PORT_INDEX (1 << 3)
-#define MLX5_NL_CMD_GET_PORT_STATE (1 << 4)
-
-/** Data structure used by mlx5_nl_cmdget_cb(). */
-struct mlx5_nl_port_info {
-	const char *name; /**< IB device name (in). */
-	uint32_t flags; /**< found attribute flags (out). */
-	uint32_t ibindex; /**< IB device index (out). */
-	uint32_t ifindex; /**< Network interface index (out). */
-	uint32_t portnum; /**< IB device max port number (out). */
-	uint16_t state; /**< IB device port state (out). */
-};
-
 RTE_ATOMIC(uint32_t) atomic_sn;
 
 /* Generate Netlink sequence number. */
@@ -2110,3 +2095,60 @@ mlx5_nl_devlink_esw_multiport_get(int nlsk_fd, int family_id, const char *pci_ad
 		*enable ? "en" : "dis", pci_addr);
 	return ret;
 }
+
+int
+mlx5_nl_rdma_monitor_init(void)
+{
+	return mlx5_nl_init(NETLINK_RDMA, RDMA_NL_GROUP_NOTIFICATION);
+}
+
+void
+mlx5_nl_rdma_monitor_info_get(struct nlmsghdr *hdr, struct mlx5_nl_port_info *data)
+{
+	size_t off = NLMSG_HDRLEN;
+	uint8_t event_type = 0;
+
+	if (hdr->nlmsg_type != RDMA_NL_GET_TYPE(RDMA_NL_NLDEV, RDMA_NLDEV_CMD_MONITOR))
+		goto error;
+
+	while (off < hdr->nlmsg_len) {
+		struct nlattr *na = (void *)((uintptr_t)hdr + off);
+		void *payload = (void *)((uintptr_t)na + NLA_HDRLEN);
+
+		if (na->nla_len > hdr->nlmsg_len - off)
+			goto error;
+		switch (na->nla_type) {
+		case RDMA_NLDEV_ATTR_EVENT_TYPE:
+			event_type = *(uint8_t *)payload;
+			if (event_type == RDMA_NETDEV_ATTACH_EVENT) {
+				data->flags |= MLX5_NL_CMD_GET_EVENT_TYPE;
+				data->event_type = MLX5_NL_RDMA_NETDEV_ATTACH_EVENT;
+			} else if (event_type == RDMA_NETDEV_DETACH_EVENT) {
+				data->flags |= MLX5_NL_CMD_GET_EVENT_TYPE;
+				data->event_type = MLX5_NL_RDMA_NETDEV_DETACH_EVENT;
+			}
+			break;
+		case RDMA_NLDEV_ATTR_DEV_INDEX:
+			data->ibindex = *(uint32_t *)payload;
+			data->flags |= MLX5_NL_CMD_GET_IB_INDEX;
+			break;
+		case RDMA_NLDEV_ATTR_PORT_INDEX:
+			data->portnum = *(uint32_t *)payload;
+			data->flags |= MLX5_NL_CMD_GET_PORT_INDEX;
+			break;
+		case RDMA_NLDEV_ATTR_NDEV_INDEX:
+			data->ifindex = *(uint32_t *)payload;
+			data->flags |= MLX5_NL_CMD_GET_NET_INDEX;
+			break;
+		default:
+			DRV_LOG(DEBUG, "Unknown attribute[%d] found", na->nla_type);
+			break;
+		}
+		off += NLA_ALIGN(na->nla_len);
+	}
+
+	return;
+
+error:
+	rte_errno = EINVAL;
+}
diff --git a/drivers/common/mlx5/linux/mlx5_nl.h b/drivers/common/mlx5/linux/mlx5_nl.h
index 396ffc98ce..e32080fa63 100644
--- a/drivers/common/mlx5/linux/mlx5_nl.h
+++ b/drivers/common/mlx5/linux/mlx5_nl.h
@@ -32,6 +32,27 @@ struct mlx5_nl_vlan_vmwa_context {
 	struct mlx5_nl_vlan_dev vlan_dev[4096];
 };
 
+#define MLX5_NL_CMD_GET_IB_NAME (1 << 0)
+#define MLX5_NL_CMD_GET_IB_INDEX (1 << 1)
+#define MLX5_NL_CMD_GET_NET_INDEX (1 << 2)
+#define MLX5_NL_CMD_GET_PORT_INDEX (1 << 3)
+#define MLX5_NL_CMD_GET_PORT_STATE (1 << 4)
+#define MLX5_NL_CMD_GET_EVENT_TYPE (1 << 5)
+
+/** Data structure used by mlx5_nl_cmdget_cb(). */
+struct mlx5_nl_port_info {
+	const char *name; /**< IB device name (in). */
+	uint32_t flags; /**< found attribute flags (out). */
+	uint32_t ibindex; /**< IB device index (out). */
+	uint32_t ifindex; /**< Network interface index (out). */
+	uint32_t portnum; /**< IB device max port number (out). */
+	uint16_t state; /**< IB device port state (out). */
+	uint8_t event_type; /**< IB RDMA event type (out). */
+};
+
+#define MLX5_NL_RDMA_NETDEV_ATTACH_EVENT (1)
+#define MLX5_NL_RDMA_NETDEV_DETACH_EVENT (2)
+
 __rte_internal
 int mlx5_nl_init(int protocol, int groups);
 __rte_internal
@@ -89,4 +110,11 @@ __rte_internal
 int mlx5_nl_devlink_esw_multiport_get(int nlsk_fd, int family_id,
 				      const char *pci_addr, int *enable);
 
+__rte_internal
+int mlx5_nl_rdma_monitor_init(void);
+__rte_internal
+void mlx5_nl_rdma_monitor_info_get(struct nlmsghdr *hdr, struct mlx5_nl_port_info *data);
+__rte_internal
+int mlx5_nl_rdma_monitor_cap_get(int nl, uint8_t *cap);
+
 #endif /* RTE_PMD_MLX5_NL_H_ */
diff --git a/drivers/common/mlx5/version.map b/drivers/common/mlx5/version.map
index a2f72ef46a..5230576006 100644
--- a/drivers/common/mlx5/version.map
+++ b/drivers/common/mlx5/version.map
@@ -146,6 +146,8 @@ INTERNAL {
 	mlx5_nl_vf_mac_addr_modify; # WINDOWS_NO_EXPORT
 	mlx5_nl_vlan_vmwa_create; # WINDOWS_NO_EXPORT
 	mlx5_nl_vlan_vmwa_delete; # WINDOWS_NO_EXPORT
+	mlx5_nl_rdma_monitor_init; # WINDOWS_NO_EXPORT
+	mlx5_nl_rdma_monitor_info_get; # WINDOWS_NO_EXPORT
 
 	mlx5_os_umem_dereg;
 	mlx5_os_umem_reg;
diff --git a/drivers/net/mlx5/linux/mlx5_ethdev_os.c b/drivers/net/mlx5/linux/mlx5_ethdev_os.c
index 88d3c57c6e..5156d96b3a 100644
--- a/drivers/net/mlx5/linux/mlx5_ethdev_os.c
+++ b/drivers/net/mlx5/linux/mlx5_ethdev_os.c
@@ -894,6 +894,85 @@ mlx5_dev_interrupt_handler_devx(void *cb_arg)
 #endif /* HAVE_IBV_DEVX_ASYNC */
 }
 
+static void
+mlx5_dev_interrupt_ib_cb(struct nlmsghdr *hdr, void *cb_arg)
+{
+	mlx5_nl_rdma_monitor_info_get(hdr, (struct mlx5_nl_port_info *)cb_arg);
+}
+
+void
+mlx5_dev_interrupt_handler_ib(void *arg)
+{
+	struct mlx5_dev_ctx_shared *sh = arg;
+	struct mlx5_nl_port_info data = {
+		.flags = 0,
+		.name = "",
+		.ifindex = 0,
+		.ibindex = 0,
+		.portnum = 0,
+	};
+	int nlsk_fd = rte_intr_fd_get(sh->intr_handle_ib);
+	struct mlx5_dev_info *dev_info;
+	uint32_t i;
+
+	dev_info = &sh->cdev->dev_info;
+	DRV_LOG(DEBUG, "IB device %s received RDMA monitor netlink event", dev_info->ibname);
+	if (dev_info->port_num <= 1 || dev_info->port_info == NULL)
+		return;
+
+	if (nlsk_fd < 0)
+		return;
+
+	if (mlx5_nl_read_events(nlsk_fd, mlx5_dev_interrupt_ib_cb, &data) < 0)
+		DRV_LOG(ERR, "Failed to process Netlink events: %s",
+			rte_strerror(rte_errno));
+
+	if (!(data.flags & MLX5_NL_CMD_GET_EVENT_TYPE) ||
+		!(data.flags & MLX5_NL_CMD_GET_PORT_INDEX) ||
+		!(data.flags & MLX5_NL_CMD_GET_IB_INDEX))
+		return;
+
+	if (data.ibindex != dev_info->ibindex)
+		return;
+
+	if (data.event_type != MLX5_NL_RDMA_NETDEV_ATTACH_EVENT &&
+		data.event_type != MLX5_NL_RDMA_NETDEV_DETACH_EVENT)
+		return;
+
+	if (data.event_type == MLX5_NL_RDMA_NETDEV_ATTACH_EVENT &&
+	    !(data.flags & MLX5_NL_CMD_GET_NET_INDEX))
+		return;
+
+	DRV_LOG(DEBUG, "Event info: type %d, ibindex %d, ifindex %d, portnum %d,",
+		data.event_type, data.ibindex, data.ifindex, data.portnum);
+
+	/* Changes found in number of SF/VF ports. All information is likely unreliable. */
+	if (data.portnum > dev_info->port_num) {
+		DRV_LOG(ERR, "Port[%d] exceeds maximum[%d]", data.portnum, dev_info->port_num);
+		goto flush_all;
+	}
+	if (data.event_type == MLX5_NL_RDMA_NETDEV_ATTACH_EVENT) {
+		if (!dev_info->port_info[data.portnum].ifindex) {
+			dev_info->port_info[data.portnum].ifindex = data.ifindex;
+			dev_info->port_info[data.portnum].valid = 1;
+		} else {
+			DRV_LOG(WARNING, "Duplicate RDMA event for port[%d] ifindex[%d]",
+				data.portnum, data.ifindex);
+			if (data.ifindex != dev_info->port_info[data.portnum].ifindex)
+				goto flush_all;
+		}
+	} else if (data.event_type == MLX5_NL_RDMA_NETDEV_DETACH_EVENT) {
+		memset(dev_info->port_info + data.portnum, 0, sizeof(struct mlx5_port_nl_info));
+	}
+	return;
+
+flush_all:
+	for (i = 1; i <= dev_info->port_num; i++) {
+		dev_info->port_info[i].ifindex = 0;
+		dev_info->port_info[i].valid = 0;
+	}
+}
+
 /**
  * DPDK callback to bring the link DOWN.
  *
diff --git a/drivers/net/mlx5/linux/mlx5_os.c b/drivers/net/mlx5/linux/mlx5_os.c
index 8df45ef010..47da00937b 100644
--- a/drivers/net/mlx5/linux/mlx5_os.c
+++ b/drivers/net/mlx5/linux/mlx5_os.c
@@ -3030,6 +3030,21 @@ mlx5_os_dev_shared_handler_install(struct mlx5_dev_ctx_shared *sh)
 		DRV_LOG(ERR, "Failed to allocate intr_handle.");
 		return;
 	}
+	if (sh->cdev->config.probe_opt && sh->cdev->dev_info.port_num > 1) {
+		nlsk_fd = mlx5_nl_rdma_monitor_init();
+		if (nlsk_fd < 0) {
+			DRV_LOG(ERR, "Failed to create a socket for RDMA Netlink events: %s",
+				rte_strerror(rte_errno));
+			return;
+		}
+		sh->intr_handle_ib = mlx5_os_interrupt_handler_create
+			(RTE_INTR_INSTANCE_F_SHARED, true,
+			 nlsk_fd, mlx5_dev_interrupt_handler_ib, sh);
+		if (sh->intr_handle_ib == NULL) {
+			DRV_LOG(ERR, "Fail to allocate intr_handle");
+			return;
+		}
+	}
 	nlsk_fd = mlx5_nl_init(NETLINK_ROUTE, RTMGRP_LINK);
 	if (nlsk_fd < 0) {
 		DRV_LOG(ERR, "Failed to create a socket for Netlink events: %s",
@@ -3091,6 +3106,11 @@ mlx5_os_dev_shared_handler_uninstall(struct mlx5_dev_ctx_shared *sh)
 	if (sh->devx_comp)
 		mlx5_glue->devx_destroy_cmd_comp(sh->devx_comp);
 #endif
+	fd = rte_intr_fd_get(sh->intr_handle_ib);
+	mlx5_os_interrupt_handler_destroy(sh->intr_handle_ib,
+				  mlx5_dev_interrupt_handler_ib, sh);
+	if (fd >= 0)
+		close(fd);
 }
 
 /**
diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index 0e026f7bbb..fe56bc897a 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -1574,6 +1574,7 @@ struct mlx5_dev_ctx_shared {
 	struct rte_intr_handle *intr_handle; /* Interrupt handler for device. */
 	struct rte_intr_handle *intr_handle_devx; /* DEVX interrupt handler. */
 	struct rte_intr_handle *intr_handle_nl; /* Netlink interrupt handler. */
+	struct rte_intr_handle *intr_handle_ib; /* Interrupt handler for IB device. */
 	void *devx_comp; /* DEVX async comp obj. */
 	struct mlx5_devx_obj *tis[16]; /* TIS object. */
 	struct mlx5_devx_obj *td; /* Transport domain. */
@@ -2274,6 +2275,7 @@ int mlx5_dev_set_flow_ctrl(struct rte_eth_dev *dev,
 void mlx5_dev_interrupt_handler(void *arg);
 void mlx5_dev_interrupt_handler_devx(void *arg);
 void mlx5_dev_interrupt_handler_nl(void *arg);
+void mlx5_dev_interrupt_handler_ib(void *arg);
 int mlx5_set_link_down(struct rte_eth_dev *dev);
 int mlx5_set_link_up(struct rte_eth_dev *dev);
 int mlx5_is_removed(struct rte_eth_dev *dev);
-- 
2.34.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH V2 7/7] mlx5: add backward compatibility for RDMA monitor
  2024-10-28  9:18   ` [PATCH V2 0/7] port probe time optimization Minggang Li(Gavin)
                       ` (5 preceding siblings ...)
  2024-10-28  9:18     ` [PATCH V2 6/7] mlx5: use RDMA Netlink to update port information Minggang Li(Gavin)
@ 2024-10-28  9:18     ` Minggang Li(Gavin)
  2024-10-28 15:49       ` Stephen Hemminger
  2024-10-29 13:42       ` [PATCH V3 0/7] port probe time optimization Minggang Li(Gavin)
  6 siblings, 2 replies; 42+ messages in thread
From: Minggang Li(Gavin) @ 2024-10-28  9:18 UTC (permalink / raw)
  To: viacheslavo, matan, orika, thomas, Dariusz Sosnowski, Bing Zhao,
	Suanming Mou
  Cc: dev, rasland

Fallback to the old way to update port information if the kernel driver
does not support RDMA monitor.

Signed-off-by: Minggang Li(Gavin) <gavinl@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
---
 doc/guides/rel_notes/release_24_11.rst  | 14 +++++
 drivers/common/mlx5/linux/mlx5_nl.c     | 73 +++++++++++++++++++++++++
 drivers/common/mlx5/version.map         |  1 +
 drivers/net/mlx5/linux/mlx5_ethdev_os.c |  2 +-
 drivers/net/mlx5/linux/mlx5_os.c        | 27 +++++++--
 drivers/net/mlx5/mlx5.h                 |  1 +
 6 files changed, 111 insertions(+), 7 deletions(-)

diff --git a/doc/guides/rel_notes/release_24_11.rst b/doc/guides/rel_notes/release_24_11.rst
index fa4822d928..02827ff392 100644
--- a/doc/guides/rel_notes/release_24_11.rst
+++ b/doc/guides/rel_notes/release_24_11.rst
@@ -247,6 +247,20 @@ New Features
   Added ability for node to advertise and update multiple xstat counters,
   that can be retrieved using ``rte_graph_cluster_stats_get``.
 
+* **Updated NVIDIA mlx5 driver.**
+
+  Optimized port probe in large scale.
+  In previous release, it would take long time to probe one VF/SF if
+  hundreds of VF/SF were created in the system. With this newly introduced
+  feature optimization, the time to probe a VF/SF will be reduced greatly in
+  large scale, eg hundreds of VF/SFs. This feature is controlled through the
+  ``probe_opt_en`` device argument. Setting it to a non-zero value indicates
+  the application will enable this functionality when probing a device. This
+  feature relies on a feature of RDMA driver to be release in incoming
+  upstream kernel 6.13 or the equivalent in OFED 24.10, ie. RDMA monitor.
+  For further information on the devargs limitation, see
+  "doc/guides/nics/mlx5.rst".
+
 
 Removed Items
 -------------
diff --git a/drivers/common/mlx5/linux/mlx5_nl.c b/drivers/common/mlx5/linux/mlx5_nl.c
index ce1c2a8e75..12f1a620f3 100644
--- a/drivers/common/mlx5/linux/mlx5_nl.c
+++ b/drivers/common/mlx5/linux/mlx5_nl.c
@@ -2152,3 +2152,76 @@ mlx5_nl_rdma_monitor_info_get(struct nlmsghdr *hdr, struct mlx5_nl_port_info *da
 error:
 	rte_errno = EINVAL;
 }
+
+static int
+mlx5_nl_rdma_monitor_cap_get_cb(struct nlmsghdr *hdr, void *arg)
+{
+	size_t off = NLMSG_HDRLEN;
+	uint8_t *cap = arg;
+
+	if (hdr->nlmsg_type != RDMA_NL_GET_TYPE(RDMA_NL_NLDEV, RDMA_NLDEV_CMD_SYS_GET))
+		goto error;
+
+	*cap = 0;
+	while (off < hdr->nlmsg_len) {
+		struct nlattr *na = (void *)((uintptr_t)hdr + off);
+		void *payload = (void *)((uintptr_t)na + NLA_HDRLEN);
+
+		if (na->nla_len > hdr->nlmsg_len - off)
+			goto error;
+		switch (na->nla_type) {
+		case RDMA_NLDEV_SYS_ATTR_MONITOR_MODE:
+			*cap = *(uint8_t *)payload;
+			return 0;
+		default:
+			break;
+		}
+		off += NLA_ALIGN(na->nla_len);
+	}
+
+	return 0;
+
+error:
+	return -EINVAL;
+}
+
+/**
+ * Get RDMA monitor support in driver.
+ *
+ *
+ * @param nl
+ *   Netlink socket of the RDMA kind (NETLINK_RDMA).
+ * @param[out] cap
+ *   Pointer to port info.
+ * @return
+ *   0 on success, negative on error and rte_errno is set.
+ */
+int
+mlx5_nl_rdma_monitor_cap_get(int nl, uint8_t *cap)
+{
+	union {
+		struct nlmsghdr nh;
+		uint8_t buf[NLMSG_HDRLEN];
+	} req = {
+		.nh = {
+			.nlmsg_len = NLMSG_LENGTH(0),
+			.nlmsg_type = RDMA_NL_GET_TYPE(RDMA_NL_NLDEV,
+						       RDMA_NLDEV_CMD_SYS_GET),
+			.nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK,
+		},
+	};
+	uint32_t sn = MLX5_NL_SN_GENERATE;
+	int ret;
+
+	ret = mlx5_nl_send(nl, &req.nh, sn);
+	if (ret < 0) {
+		rte_errno = -ret;
+		return ret;
+	}
+	ret = mlx5_nl_recv(nl, sn, mlx5_nl_rdma_monitor_cap_get_cb, cap);
+	if (ret < 0) {
+		rte_errno = -ret;
+		return ret;
+	}
+	return 0;
+}
diff --git a/drivers/common/mlx5/version.map b/drivers/common/mlx5/version.map
index 5230576006..8301485839 100644
--- a/drivers/common/mlx5/version.map
+++ b/drivers/common/mlx5/version.map
@@ -148,6 +148,7 @@ INTERNAL {
 	mlx5_nl_vlan_vmwa_delete; # WINDOWS_NO_EXPORT
 	mlx5_nl_rdma_monitor_init; # WINDOWS_NO_EXPORT
 	mlx5_nl_rdma_monitor_info_get; # WINDOWS_NO_EXPORT
+	mlx5_nl_rdma_monitor_cap_get; # WINDOWS_NO_EXPORT
 
 	mlx5_os_umem_dereg;
 	mlx5_os_umem_reg;
diff --git a/drivers/net/mlx5/linux/mlx5_ethdev_os.c b/drivers/net/mlx5/linux/mlx5_ethdev_os.c
index 5156d96b3a..6b2c25a7c2 100644
--- a/drivers/net/mlx5/linux/mlx5_ethdev_os.c
+++ b/drivers/net/mlx5/linux/mlx5_ethdev_os.c
@@ -736,7 +736,7 @@ mlx5_dev_interrupt_nl_cb(struct nlmsghdr *hdr, void *cb_arg)
 
 	if (mlx5_nl_parse_link_status_update(hdr, &if_index) < 0)
 		return;
-	if (sh->cdev->config.probe_opt && sh->cdev->dev_info.port_num > 1)
+	if (sh->cdev->config.probe_opt && sh->cdev->dev_info.port_num > 1 && !sh->rdma_monitor_supp)
 		mlx5_handle_port_info_update(&sh->cdev->dev_info, if_index, hdr->nlmsg_type);
 
 	for (i = 0; i < sh->max_port; i++) {
diff --git a/drivers/net/mlx5/linux/mlx5_os.c b/drivers/net/mlx5/linux/mlx5_os.c
index 47da00937b..93556dc580 100644
--- a/drivers/net/mlx5/linux/mlx5_os.c
+++ b/drivers/net/mlx5/linux/mlx5_os.c
@@ -3022,6 +3022,7 @@ mlx5_os_dev_shared_handler_install(struct mlx5_dev_ctx_shared *sh)
 {
 	struct ibv_context *ctx = sh->cdev->ctx;
 	int nlsk_fd;
+	uint8_t rdma_monitor_supp = 0;
 
 	sh->intr_handle = mlx5_os_interrupt_handler_create
 		(RTE_INTR_INSTANCE_F_SHARED, true,
@@ -3030,20 +3031,34 @@ mlx5_os_dev_shared_handler_install(struct mlx5_dev_ctx_shared *sh)
 		DRV_LOG(ERR, "Failed to allocate intr_handle.");
 		return;
 	}
-	if (sh->cdev->config.probe_opt && sh->cdev->dev_info.port_num > 1) {
+	if (sh->cdev->config.probe_opt &&
+	    sh->cdev->dev_info.port_num > 1 &&
+	    !sh->rdma_monitor_supp) {
 		nlsk_fd = mlx5_nl_rdma_monitor_init();
 		if (nlsk_fd < 0) {
 			DRV_LOG(ERR, "Failed to create a socket for RDMA Netlink events: %s",
 				rte_strerror(rte_errno));
 			return;
 		}
-		sh->intr_handle_ib = mlx5_os_interrupt_handler_create
-			(RTE_INTR_INSTANCE_F_SHARED, true,
-			 nlsk_fd, mlx5_dev_interrupt_handler_ib, sh);
-		if (sh->intr_handle_ib == NULL) {
-			DRV_LOG(ERR, "Fail to allocate intr_handle");
+		if (mlx5_nl_rdma_monitor_cap_get(nlsk_fd, &rdma_monitor_supp)) {
+			DRV_LOG(ERR, "Failed to query RDMA monitor support: %s",
+				rte_strerror(rte_errno));
+			close(nlsk_fd);
 			return;
 		}
+		sh->rdma_monitor_supp = rdma_monitor_supp;
+		if (sh->rdma_monitor_supp) {
+			sh->intr_handle_ib = mlx5_os_interrupt_handler_create
+				(RTE_INTR_INSTANCE_F_SHARED, true,
+				 nlsk_fd, mlx5_dev_interrupt_handler_ib, sh);
+			if (sh->intr_handle_ib == NULL) {
+				DRV_LOG(ERR, "Fail to allocate intr_handle");
+				close(nlsk_fd);
+				return;
+			}
+		} else {
+			close(nlsk_fd);
+		}
 	}
 	nlsk_fd = mlx5_nl_init(NETLINK_ROUTE, RTMGRP_LINK);
 	if (nlsk_fd < 0) {
diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index fe56bc897a..126b48ac61 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -1517,6 +1517,7 @@ struct mlx5_dev_ctx_shared {
 	uint32_t lag_rx_port_affinity_en:1;
 	/* lag_rx_port_affinity is supported. */
 	uint32_t hws_max_log_bulk_sz:5;
+	uint32_t rdma_monitor_supp:1;
 	/* Log of minimal HWS counters created hard coded. */
 	uint32_t hws_max_nb_counters; /* Maximal number for HWS counters. */
 	uint32_t max_port; /* Maximal IB device port index. */
-- 
2.34.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V2 3/7] net/mlx5: add new devargs to control probe optimization
  2024-10-28  9:18     ` [PATCH V2 3/7] net/mlx5: add new devargs to control probe optimization Minggang Li(Gavin)
@ 2024-10-28 15:47       ` Stephen Hemminger
  2024-10-29  8:27         ` Minggang(Gavin) Li
  0 siblings, 1 reply; 42+ messages in thread
From: Stephen Hemminger @ 2024-10-28 15:47 UTC (permalink / raw)
  To: Minggang Li(Gavin)
  Cc: viacheslavo, matan, orika, thomas, Dariusz Sosnowski, Bing Zhao,
	Suanming Mou, dev, rasland, Rongwei Liu

On Mon, 28 Oct 2024 11:18:18 +0200
"Minggang Li(Gavin)" <gavinl@nvidia.com> wrote:

> +- ``probe_opt_en`` parameter [int]
> +
> +  A non-zero value optimizes the probe process, especially for large scale.
> +  PMD will hold the IB device information internally and reuse it.
> +
> +  By default, the PMD will set this value to 0.
> +

Is there ever a case where this should not be used?

It would be better to just detect and use it if available.
This driver does not need more options...

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V2 7/7] mlx5: add backward compatibility for RDMA monitor
  2024-10-28  9:18     ` [PATCH V2 7/7] mlx5: add backward compatibility for RDMA monitor Minggang Li(Gavin)
@ 2024-10-28 15:49       ` Stephen Hemminger
  2024-10-29  8:31         ` Minggang(Gavin) Li
  2024-10-29 13:42       ` [PATCH V3 0/7] port probe time optimization Minggang Li(Gavin)
  1 sibling, 1 reply; 42+ messages in thread
From: Stephen Hemminger @ 2024-10-28 15:49 UTC (permalink / raw)
  To: Minggang Li(Gavin)
  Cc: viacheslavo, matan, orika, thomas, Dariusz Sosnowski, Bing Zhao,
	Suanming Mou, dev, rasland

On Mon, 28 Oct 2024 11:18:22 +0200
"Minggang Li(Gavin)" <gavinl@nvidia.com> wrote:

> +* **Updated NVIDIA mlx5 driver.**
> +
> +  Optimized port probe in large scale.
> +  In previous release, it would take long time to probe one VF/SF if
> +  hundreds of VF/SF were created in the system. With this newly introduced
> +  feature optimization, the time to probe a VF/SF will be reduced greatly in
> +  large scale, eg hundreds of VF/SFs. This feature is controlled through the
> +  ``probe_opt_en`` device argument. Setting it to a non-zero value indicates
> +  the application will enable this functionality when probing a device. This
> +  feature relies on a feature of RDMA driver to be release in incoming
> +  upstream kernel 6.13 or the equivalent in OFED 24.10, ie. RDMA monitor.
> +  For further information on the devargs limitation, see
> +  "doc/guides/nics/mlx5.rst".

Too wordy. Many filler words and phrases.

And no clear description of when to use and when not to use.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V2 3/7] net/mlx5: add new devargs to control probe optimization
  2024-10-28 15:47       ` Stephen Hemminger
@ 2024-10-29  8:27         ` Minggang(Gavin) Li
  2024-10-29 16:07           ` Stephen Hemminger
  0 siblings, 1 reply; 42+ messages in thread
From: Minggang(Gavin) Li @ 2024-10-29  8:27 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: viacheslavo, matan, orika, thomas, Dariusz Sosnowski, Bing Zhao,
	Suanming Mou, dev, rasland, Rongwei Liu


On 10/28/2024 11:47 PM, Stephen Hemminger wrote:
> On Mon, 28 Oct 2024 11:18:18 +0200
> "Minggang Li(Gavin)" <gavinl@nvidia.com> wrote:
>
>> +- ``probe_opt_en`` parameter [int]
>> +
>> +  A non-zero value optimizes the probe process, especially for large scale.
>> +  PMD will hold the IB device information internally and reuse it.
>> +
>> +  By default, the PMD will set this value to 0.
>> +
> Is there ever a case where this should not be used?
>
> It would be better to just detect and use it if available.
> This driver does not need more options...
The new mechanism, which is required by few users, so we would not break 
production and with the option we encourage to use new way only those 
who actually needs. Once we see the new way is reliable - we will change 
the default value.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V2 7/7] mlx5: add backward compatibility for RDMA monitor
  2024-10-28 15:49       ` Stephen Hemminger
@ 2024-10-29  8:31         ` Minggang(Gavin) Li
  0 siblings, 0 replies; 42+ messages in thread
From: Minggang(Gavin) Li @ 2024-10-29  8:31 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: viacheslavo, matan, orika, thomas, Dariusz Sosnowski, Bing Zhao,
	Suanming Mou, dev, rasland


On 10/28/2024 11:49 PM, Stephen Hemminger wrote:
> On Mon, 28 Oct 2024 11:18:22 +0200
> "Minggang Li(Gavin)" <gavinl@nvidia.com> wrote:
>
>> +* **Updated NVIDIA mlx5 driver.**
>> +
>> +  Optimized port probe in large scale.
>> +  In previous release, it would take long time to probe one VF/SF if
>> +  hundreds of VF/SF were created in the system. With this newly introduced
>> +  feature optimization, the time to probe a VF/SF will be reduced greatly in
>> +  large scale, eg hundreds of VF/SFs. This feature is controlled through the
>> +  ``probe_opt_en`` device argument. Setting it to a non-zero value indicates
>> +  the application will enable this functionality when probing a device. This
>> +  feature relies on a feature of RDMA driver to be release in incoming
>> +  upstream kernel 6.13 or the equivalent in OFED 24.10, ie. RDMA monitor.
>> +  For further information on the devargs limitation, see
>> +  "doc/guides/nics/mlx5.rst".
> Too wordy. Many filler words and phrases.
>
> And no clear description of when to use and when not to use.
ACK

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH V3 0/7] port probe time optimization
  2024-10-28  9:18     ` [PATCH V2 7/7] mlx5: add backward compatibility for RDMA monitor Minggang Li(Gavin)
  2024-10-28 15:49       ` Stephen Hemminger
@ 2024-10-29 13:42       ` Minggang Li(Gavin)
  2024-10-29 13:42         ` [PATCH V3 1/7] mailmap: update user name Minggang Li(Gavin)
                           ` (6 more replies)
  1 sibling, 7 replies; 42+ messages in thread
From: Minggang Li(Gavin) @ 2024-10-29 13:42 UTC (permalink / raw)
  To: viacheslavo, matan, orika, thomas; +Cc: dev, rasland

This patch series introduced a feature that the time to probe a VF/SF will
be reduced greatly in large scale, eg hundreds of VF/SFs. This feature is
controlled through the "probe_opt_en" device argument. Setting it to a
non-zero value indicates the application will enable this functionality
when probing a device. This feature relies on a feature of RDMA driver to
be release in incoming upstream kernel 6.13 or the equivalent in
OFED 24.10, ie. RDMA monitor. For further information on the devargs
limitation, see "doc/guides/nics/mlx5.rst".

Minggang Li(Gavin) (5):
  mailmap: update user name
  common/mlx5: fix Netlink socket leak
  common/mlx5: add RDMA monitor event awareness
  mlx5: use RDMA Netlink to update port information
  mlx5: add backward compatibility for RDMA monitor
---
changelog:
v1->v2
        - add feature doc and upstream kernel dependency in release notes
v2->v3
        - revise release notes
---


Rongwei Liu (2):
  net/mlx5: optimize device probing
  net/mlx5: add new devargs to control probe optimization

 .mailmap                                     |   2 +-
 doc/guides/nics/mlx5.rst                     |  13 +
 doc/guides/rel_notes/release_24_11.rst       |  14 +
 drivers/common/mlx5/linux/meson.build        |  10 +
 drivers/common/mlx5/linux/mlx5_common_os.h   |   6 +
 drivers/common/mlx5/linux/mlx5_nl.c          | 262 ++++++++++++++++---
 drivers/common/mlx5/linux/mlx5_nl.h          |  36 ++-
 drivers/common/mlx5/mlx5_common.c            |  20 ++
 drivers/common/mlx5/mlx5_common.h            |  15 ++
 drivers/common/mlx5/version.map              |   3 +
 drivers/common/mlx5/windows/mlx5_common_os.h |   5 +
 drivers/net/mlx5/linux/mlx5_ethdev_os.c      | 136 ++++++++++
 drivers/net/mlx5/linux/mlx5_os.c             | 144 ++++++++--
 drivers/net/mlx5/linux/mlx5_os.h             |   6 -
 drivers/net/mlx5/mlx5.h                      |   3 +
 drivers/net/mlx5/windows/mlx5_os.h           |   5 -
 16 files changed, 605 insertions(+), 75 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH V3 1/7] mailmap: update user name
  2024-10-29 13:42       ` [PATCH V3 0/7] port probe time optimization Minggang Li(Gavin)
@ 2024-10-29 13:42         ` Minggang Li(Gavin)
  2024-10-29 13:42         ` [PATCH V3 2/7] net/mlx5: optimize device probing Minggang Li(Gavin)
                           ` (5 subsequent siblings)
  6 siblings, 0 replies; 42+ messages in thread
From: Minggang Li(Gavin) @ 2024-10-29 13:42 UTC (permalink / raw)
  To: viacheslavo, matan, orika, thomas; +Cc: dev, rasland

Signed-off-by: Minggang Li(Gavin) <gavinl@nvidia.com>
---
 .mailmap | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/.mailmap b/.mailmap
index 504c390f0f..0ce965084c 100644
--- a/.mailmap
+++ b/.mailmap
@@ -462,7 +462,6 @@ Gary Mussar <gmussar@ciena.com>
 Gaurav Singh <gaurav1086@gmail.com>
 Gautam Dawar <gdawar@solarflare.com>
 Gavin Hu <gavin.hu@arm.com> <gavin.hu@linaro.org>
-Gavin Li <gavinl@nvidia.com>
 Geoffrey Le Gourriérec <geoffrey.le_gourrierec@6wind.com>
 Geoffrey Lv <geoffrey.lv@gmail.com>
 Geoff Thorpe <geoff.thorpe@nxp.com>
@@ -1024,6 +1023,7 @@ Mike Ximing Chen <mike.ximing.chen@intel.com>
 Milena Olech <milena.olech@intel.com>
 Min Cao <min.cao@intel.com>
 Minghuan Lian <minghuan.lian@nxp.com>
+Minggang Li(Gavin) <gavinl@nvidia.com>
 Mingjin Ye <mingjinx.ye@intel.com>
 Mingshan Zhang <mingshan.zhang@intel.com>
 Mingxia Liu <mingxia.liu@intel.com>
-- 
2.34.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH V3 2/7] net/mlx5: optimize device probing
  2024-10-29 13:42       ` [PATCH V3 0/7] port probe time optimization Minggang Li(Gavin)
  2024-10-29 13:42         ` [PATCH V3 1/7] mailmap: update user name Minggang Li(Gavin)
@ 2024-10-29 13:42         ` Minggang Li(Gavin)
  2024-10-29 13:42         ` [PATCH V3 3/7] net/mlx5: add new devargs to control probe optimization Minggang Li(Gavin)
                           ` (4 subsequent siblings)
  6 siblings, 0 replies; 42+ messages in thread
From: Minggang Li(Gavin) @ 2024-10-29 13:42 UTC (permalink / raw)
  To: viacheslavo, matan, orika, thomas, Dariusz Sosnowski, Bing Zhao,
	Suanming Mou
  Cc: dev, rasland, Rongwei Liu

From: Rongwei Liu <rongweil@nvidia.com>

Current DPDK probing logic is:
1. Query IB device index and total port number.
2. Query each port information by traversing the port index and
   get the port's ifindex, name and state information etc.
3. Compare the information with devargs until getting matched.
4. For each probing device, repeat steps 2 and 3.

Step 2 will communicate with kernel via netlink and it's time-consuming.
There is no need to repeat netlink communication for each probing device,
PMD can traverse all ports once and save the information into a caching
structure.

Introduce the device information caching in the mlx5 common device
handle and cache the port number, ibindex, port ifindex.

For dynamic interface changing:
1. New VF by toggling switchdev mode should restart dpdk as sriov
   configuration changed.
2. Changing VF number w/o toggling switchdev mode will trigger
   RTM_DELLINK and RTM_NEWLINK events. All the caching information is
   cleared.
3. New SF triggers RTM_NEWLINK event and no port index information in the
   message. All free entries (ifindex = 0) in the cache are invalidated.
4. Delete SF triggers RTM_DELLINK event. Traverse the cache entries and
   invalidate the one with the same ifindex.

Didn't consider race-condition between probing thread and interrupt
thread.

Signed-off-by: Rongwei Liu <rongweil@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
---
 drivers/common/mlx5/linux/mlx5_common_os.h   |   6 ++
 drivers/common/mlx5/linux/mlx5_nl.c          |  94 +++++++++++++----
 drivers/common/mlx5/linux/mlx5_nl.h          |   8 +-
 drivers/common/mlx5/mlx5_common.c            |   5 +
 drivers/common/mlx5/mlx5_common.h            |  13 +++
 drivers/common/mlx5/windows/mlx5_common_os.h |   5 +
 drivers/net/mlx5/linux/mlx5_ethdev_os.c      |  54 ++++++++++
 drivers/net/mlx5/linux/mlx5_os.c             | 104 ++++++++++++++-----
 drivers/net/mlx5/linux/mlx5_os.h             |   6 --
 drivers/net/mlx5/windows/mlx5_os.h           |   5 -
 10 files changed, 242 insertions(+), 58 deletions(-)

diff --git a/drivers/common/mlx5/linux/mlx5_common_os.h b/drivers/common/mlx5/linux/mlx5_common_os.h
index e8aa1d46ec..2e2c54f1fa 100644
--- a/drivers/common/mlx5/linux/mlx5_common_os.h
+++ b/drivers/common/mlx5/linux/mlx5_common_os.h
@@ -22,6 +22,12 @@
 #include "mlx5_glue.h"
 #include "mlx5_malloc.h"
 
+/* verb enumerations translations to local enums. */
+enum {
+	MLX5_FS_NAME_MAX = IBV_SYSFS_NAME_MAX + 1,
+	MLX5_FS_PATH_MAX = IBV_SYSFS_PATH_MAX + 1
+};
+
 /**
  * Get device name. Given an ibv_device pointer - return a
  * pointer to the corresponding device name.
diff --git a/drivers/common/mlx5/linux/mlx5_nl.c b/drivers/common/mlx5/linux/mlx5_nl.c
index a5ac4dc543..e98073aafe 100644
--- a/drivers/common/mlx5/linux/mlx5_nl.c
+++ b/drivers/common/mlx5/linux/mlx5_nl.c
@@ -1073,16 +1073,18 @@ mlx5_nl_port_info(int nl, uint32_t pindex, struct mlx5_nl_port_info *data)
 	uint32_t sn = MLX5_NL_SN_GENERATE;
 	int ret;
 
-	ret = mlx5_nl_send(nl, &req.nh, sn);
-	if (ret < 0)
-		return ret;
-	ret = mlx5_nl_recv(nl, sn, mlx5_nl_cmdget_cb, data);
-	if (ret < 0)
-		return ret;
-	if (!(data->flags & MLX5_NL_CMD_GET_IB_NAME) ||
-	    !(data->flags & MLX5_NL_CMD_GET_IB_INDEX))
-		goto error;
-	data->flags = 0;
+	if (data->ibindex == UINT32_MAX) {
+		ret = mlx5_nl_send(nl, &req.nh, sn);
+		if (ret < 0)
+			return ret;
+		ret = mlx5_nl_recv(nl, sn, mlx5_nl_cmdget_cb, data);
+		if (ret < 0)
+			return ret;
+		if (!(data->flags & MLX5_NL_CMD_GET_IB_NAME) ||
+		    !(data->flags & MLX5_NL_CMD_GET_IB_INDEX))
+			goto error;
+		data->flags = 0;
+	}
 	sn = MLX5_NL_SN_GENERATE;
 	req.nh.nlmsg_type = RDMA_NL_GET_TYPE(RDMA_NL_NLDEV,
 					     RDMA_NLDEV_CMD_PORT_GET);
@@ -1109,7 +1111,7 @@ mlx5_nl_port_info(int nl, uint32_t pindex, struct mlx5_nl_port_info *data)
 	    !(data->flags & MLX5_NL_CMD_GET_NET_INDEX) ||
 	    !data->ifindex)
 		goto error;
-	return 1;
+	return 0;
 error:
 	rte_errno = ENODEV;
 	return -rte_errno;
@@ -1128,21 +1130,48 @@ mlx5_nl_port_info(int nl, uint32_t pindex, struct mlx5_nl_port_info *data)
  *   IB device name.
  * @param[in] pindex
  *   IB device port index, starting from 1
+ * @param[in] dev_info
+ *   Cached mlx5 device information.
  * @return
  *   A valid (nonzero) interface index on success, 0 otherwise and rte_errno
  *   is set.
  */
 unsigned int
-mlx5_nl_ifindex(int nl, const char *name, uint32_t pindex)
+mlx5_nl_ifindex(int nl, const char *name, uint32_t pindex, struct mlx5_dev_info *dev_info)
 {
+	int ret;
+
 	struct mlx5_nl_port_info data = {
 			.ifindex = 0,
 			.name = name,
+			.ibindex = UINT32_MAX,
+			.flags = 0,
 	};
 
-	if (mlx5_nl_port_info(nl, pindex, &data) < 0)
-		return 0;
-	return data.ifindex;
+	if (!strcmp(name, dev_info->ibname)) {
+		if (dev_info->port_info && pindex <= dev_info->port_num &&
+		    dev_info->port_info[pindex].valid) {
+			if (!dev_info->port_info[pindex].ifindex)
+				rte_errno = ENODEV;
+			return dev_info->port_info[pindex].ifindex;
+		}
+		if (dev_info->port_num)
+			data.ibindex = dev_info->ibindex;
+	}
+
+	ret = mlx5_nl_port_info(nl, pindex, &data);
+
+	if (!strcmp(dev_info->ibname, name)) {
+		if ((!ret || ret == -ENODEV) && dev_info->port_info &&
+		    pindex <= dev_info->port_num) {
+			if (!ret)
+				dev_info->port_info[pindex].ifindex = data.ifindex;
+			/* -ENODEV means the pindex is unused but still valid case */
+			dev_info->port_info[pindex].valid = 1;
+		}
+	}
+
+	return ret ? 0 : data.ifindex;
 }
 
 /**
@@ -1157,18 +1186,23 @@ mlx5_nl_ifindex(int nl, const char *name, uint32_t pindex)
  *   IB device name.
  * @param[in] pindex
  *   IB device port index, starting from 1
+ * @param[in] dev_info
+ *   Cached mlx5 device information.
  * @return
  *   Port state (ibv_port_state) on success, negative on error
  *   and rte_errno is set.
  */
 int
-mlx5_nl_port_state(int nl, const char *name, uint32_t pindex)
+mlx5_nl_port_state(int nl, const char *name, uint32_t pindex, struct mlx5_dev_info *dev_info)
 {
 	struct mlx5_nl_port_info data = {
 			.state = 0,
 			.name = name,
+			.ibindex = UINT32_MAX,
 	};
 
+	if (dev_info && !strcmp(name, dev_info->ibname) && dev_info->port_num)
+		data.ibindex = dev_info->ibindex;
 	if (mlx5_nl_port_info(nl, pindex, &data) < 0)
 		return -rte_errno;
 	if ((data.flags & MLX5_NL_CMD_GET_PORT_STATE) == 0) {
@@ -1185,13 +1219,15 @@ mlx5_nl_port_state(int nl, const char *name, uint32_t pindex)
  *   Netlink socket of the RDMA kind (NETLINK_RDMA).
  * @param[in] name
  *   IB device name.
+ * @param[in] dev_info
+ *   Cached mlx5 device info.
  *
  * @return
  *   A valid (nonzero) number of ports on success, 0 otherwise
  *   and rte_errno is set.
  */
 unsigned int
-mlx5_nl_portnum(int nl, const char *name)
+mlx5_nl_portnum(int nl, const char *name, struct mlx5_dev_info *dev_info)
 {
 	struct mlx5_nl_port_info data = {
 		.flags = 0,
@@ -1206,7 +1242,10 @@ mlx5_nl_portnum(int nl, const char *name)
 		.nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK | NLM_F_DUMP,
 	};
 	uint32_t sn = MLX5_NL_SN_GENERATE;
-	int ret;
+	int ret, size;
+
+	if (dev_info->port_num && !strcmp(name, dev_info->ibname))
+		return dev_info->port_num;
 
 	ret = mlx5_nl_send(nl, &req, sn);
 	if (ret < 0)
@@ -1220,8 +1259,25 @@ mlx5_nl_portnum(int nl, const char *name)
 		rte_errno = ENODEV;
 		return 0;
 	}
-	if (!data.portnum)
+	if (!data.portnum) {
 		rte_errno = EINVAL;
+		return 0;
+	}
+	MLX5_ASSERT(!strlen(dev_info->ibname));
+	dev_info->port_num = data.portnum;
+	dev_info->ibindex = data.ibindex;
+	snprintf(dev_info->ibname, MLX5_FS_NAME_MAX, "%s", name);
+	if (data.portnum > 1) {
+		size = (data.portnum + 1) * sizeof(struct mlx5_port_nl_info);
+		dev_info->port_info = mlx5_malloc(MLX5_MEM_ZERO | MLX5_MEM_RTE, size,
+						  RTE_CACHE_LINE_SIZE,
+						  SOCKET_ID_ANY);
+		if (dev_info->port_info == NULL) {
+			memset(dev_info, 0, sizeof(*dev_info));
+			rte_errno = ENOMEM;
+			return 0;
+		}
+	}
 	return data.portnum;
 }
 
diff --git a/drivers/common/mlx5/linux/mlx5_nl.h b/drivers/common/mlx5/linux/mlx5_nl.h
index 580de3b769..396ffc98ce 100644
--- a/drivers/common/mlx5/linux/mlx5_nl.h
+++ b/drivers/common/mlx5/linux/mlx5_nl.h
@@ -11,6 +11,7 @@
 #include <rte_ether.h>
 
 #include "mlx5_common.h"
+#include "mlx5_common_utils.h"
 
 typedef void (mlx5_nl_event_cb)(struct nlmsghdr *hdr, void *user_data);
 
@@ -52,11 +53,12 @@ int mlx5_nl_promisc(int nlsk_fd, unsigned int iface_idx, int enable);
 __rte_internal
 int mlx5_nl_allmulti(int nlsk_fd, unsigned int iface_idx, int enable);
 __rte_internal
-unsigned int mlx5_nl_portnum(int nl, const char *name);
+unsigned int mlx5_nl_portnum(int nl, const char *name, struct mlx5_dev_info *dev_info);
 __rte_internal
-unsigned int mlx5_nl_ifindex(int nl, const char *name, uint32_t pindex);
+unsigned int mlx5_nl_ifindex(int nl, const char *name, uint32_t pindex,
+			     struct mlx5_dev_info *info);
 __rte_internal
-int mlx5_nl_port_state(int nl, const char *name, uint32_t pindex);
+int mlx5_nl_port_state(int nl, const char *name, uint32_t pindex, struct mlx5_dev_info *dev_info);
 __rte_internal
 int mlx5_nl_vf_mac_addr_modify(int nlsk_fd, unsigned int iface_idx,
 			       struct rte_ether_addr *mac, int vf_index);
diff --git a/drivers/common/mlx5/mlx5_common.c b/drivers/common/mlx5/mlx5_common.c
index ca8543e36e..0aaae91c31 100644
--- a/drivers/common/mlx5/mlx5_common.c
+++ b/drivers/common/mlx5/mlx5_common.c
@@ -735,6 +735,11 @@ mlx5_common_dev_release(struct mlx5_common_device *cdev)
 		if (TAILQ_EMPTY(&devices_list))
 			rte_mem_event_callback_unregister("MLX5_MEM_EVENT_CB",
 							  NULL);
+		if (cdev->dev_info.port_info != NULL) {
+			mlx5_free(cdev->dev_info.port_info);
+			cdev->dev_info.port_info = NULL;
+		}
+		cdev->dev_info.port_num = 0;
 		mlx5_dev_mempool_unsubscribe(cdev);
 		mlx5_mr_release_cache(&cdev->mr_scache);
 		mlx5_dev_hw_global_release(cdev);
diff --git a/drivers/common/mlx5/mlx5_common.h b/drivers/common/mlx5/mlx5_common.h
index 1abd1e8239..6cb40f54dd 100644
--- a/drivers/common/mlx5/mlx5_common.h
+++ b/drivers/common/mlx5/mlx5_common.h
@@ -174,6 +174,18 @@ enum mlx5_nl_phys_port_name_type {
 	MLX5_PHYS_PORT_NAME_TYPE_UNKNOWN, /* Unrecognized. */
 };
 
+struct mlx5_port_nl_info {
+	uint32_t ifindex;
+	uint8_t valid;
+};
+
+struct mlx5_dev_info {
+	uint32_t port_num;
+	uint32_t ibindex;
+	char ibname[MLX5_FS_NAME_MAX];
+	struct mlx5_port_nl_info *port_info;
+};
+
 /** Switch information returned by mlx5_nl_switch_info(). */
 struct mlx5_switch_info {
 	uint32_t master:1; /**< Master device. */
@@ -525,6 +537,7 @@ struct mlx5_common_device {
 	uint32_t classes_loaded;
 	void *ctx; /* Verbs/DV/DevX context. */
 	void *pd; /* Protection Domain. */
+	struct mlx5_dev_info dev_info; /* Device port info queried via netlink. */
 	uint32_t pdn; /* Protection Domain Number. */
 	struct mlx5_mr_share_cache mr_scache; /* Global shared MR cache. */
 	struct mlx5_common_dev_config config; /* Device configuration. */
diff --git a/drivers/common/mlx5/windows/mlx5_common_os.h b/drivers/common/mlx5/windows/mlx5_common_os.h
index acee0c987f..65394035de 100644
--- a/drivers/common/mlx5/windows/mlx5_common_os.h
+++ b/drivers/common/mlx5/windows/mlx5_common_os.h
@@ -20,6 +20,11 @@
 
 #define MLX5_BF_OFFSET 0x800
 
+enum {
+	MLX5_FS_NAME_MAX = MLX5_DEVX_DEVICE_NAME_SIZE + 1,
+	MLX5_FS_PATH_MAX = MLX5_DEVX_DEVICE_PNP_SIZE + 1
+};
+
 /**
  * This API allocates aligned or non-aligned memory.  The free can be on either
  * aligned or nonaligned memory.  To be protected - even though there may be no
diff --git a/drivers/net/mlx5/linux/mlx5_ethdev_os.c b/drivers/net/mlx5/linux/mlx5_ethdev_os.c
index 5d64984022..08ac6dd939 100644
--- a/drivers/net/mlx5/linux/mlx5_ethdev_os.c
+++ b/drivers/net/mlx5/linux/mlx5_ethdev_os.c
@@ -23,6 +23,7 @@
 #include <stdalign.h>
 #include <sys/un.h>
 #include <time.h>
+#include <linux/rtnetlink.h>
 
 #include <ethdev_linux_ethtool.h>
 #include <ethdev_driver.h>
@@ -673,6 +674,57 @@ mlx5_link_update_bond(struct rte_eth_dev *dev)
 		((ifr.ifr_flags & IFF_UP) && (ifr.ifr_flags & IFF_RUNNING));
 }
 
+static void
+mlx5_handle_port_info_update(struct mlx5_dev_info *dev_info, uint32_t if_index,
+			     uint16_t msg_type)
+{
+	struct mlx5_switch_info info = {
+		.master = 0,
+		.representor = 0,
+		.name_type = MLX5_PHYS_PORT_NAME_TYPE_NOTSET,
+		.port_name = 0,
+		.switch_id = 0,
+	};
+	uint32_t i;
+	int nl_route;
+
+	if (dev_info->port_num <= 1 || dev_info->port_info == NULL)
+		return;
+
+	for (i = 1; i <= dev_info->port_num; i++) {
+		if (!dev_info->port_info[i].valid)
+			continue;
+		if (dev_info->port_info[i].ifindex == if_index)
+			break;
+	}
+	if (msg_type == RTM_NEWLINK && i > dev_info->port_num) {
+		nl_route = mlx5_nl_init(NETLINK_ROUTE, 0);
+		if  (nl_route < 0)
+			goto flush_all;
+
+		if (mlx5_nl_switch_info(nl_route, if_index, &info)) {
+			if (mlx5_sysfs_switch_info(if_index, &info))
+				goto flush_all;
+		}
+
+		if (info.name_type == MLX5_PHYS_PORT_NAME_TYPE_PFSF ||
+		    info.name_type == MLX5_PHYS_PORT_NAME_TYPE_PFVF)
+			goto flush_all;
+		close(nl_route);
+	} else if (msg_type == RTM_DELLINK && i <= dev_info->port_num) {
+		memset(dev_info->port_info + i, 0, sizeof(struct mlx5_port_nl_info));
+	}
+
+	return;
+flush_all:
+	if (nl_route >= 0)
+		close(nl_route);
+	for (i = 1; i <= dev_info->port_num; i++) {
+		if (!dev_info->port_info[i].ifindex)
+			dev_info->port_info[i].valid = 0;
+	}
+}
+
 static void
 mlx5_dev_interrupt_nl_cb(struct nlmsghdr *hdr, void *cb_arg)
 {
@@ -682,6 +734,8 @@ mlx5_dev_interrupt_nl_cb(struct nlmsghdr *hdr, void *cb_arg)
 
 	if (mlx5_nl_parse_link_status_update(hdr, &if_index) < 0)
 		return;
+	mlx5_handle_port_info_update(&sh->cdev->dev_info, if_index, hdr->nlmsg_type);
+
 	for (i = 0; i < sh->max_port; i++) {
 		struct mlx5_dev_shared_port *port = &sh->port[i];
 		struct rte_eth_dev *dev;
diff --git a/drivers/net/mlx5/linux/mlx5_os.c b/drivers/net/mlx5/linux/mlx5_os.c
index 69a80b9ddc..8f6e584154 100644
--- a/drivers/net/mlx5/linux/mlx5_os.c
+++ b/drivers/net/mlx5/linux/mlx5_os.c
@@ -1268,7 +1268,7 @@ mlx5_dev_spawn(struct rte_device *dpdk_dev,
 		/* IB doesn't allow more than 255 ports, must be Ethernet. */
 		err = mlx5_nl_port_state(nl_rdma,
 			spawn->phys_dev_name,
-			spawn->phys_port);
+			spawn->phys_port, &spawn->cdev->dev_info);
 		if (err < 0) {
 			DRV_LOG(INFO, "Failed to get netlink port state: %s",
 				strerror(rte_errno));
@@ -1897,6 +1897,8 @@ mlx5_dev_spawn_data_cmp(const void *a, const void *b)
  *   Netlink RDMA group socket handle.
  * @param[in] owner
  *   Representor owner PF index.
+ * @param[in] dev_info
+ *   Cached mlx5 device information.
  * @param[out] bond_info
  *   Pointer to bonding information.
  *
@@ -1908,6 +1910,7 @@ static int
 mlx5_device_bond_pci_match(const char *ibdev_name,
 			   const struct rte_pci_addr *pci_dev,
 			   int nl_rdma, uint16_t owner,
+			   struct mlx5_dev_info *dev_info,
 			   struct mlx5_bond_info *bond_info)
 {
 	char ifname[IF_NAMESIZE + 1];
@@ -1928,7 +1931,7 @@ mlx5_device_bond_pci_match(const char *ibdev_name,
 		return -1;
 	if (!strstr(ibdev_name, "bond"))
 		return -1;
-	np = mlx5_nl_portnum(nl_rdma, ibdev_name);
+	np = mlx5_nl_portnum(nl_rdma, ibdev_name, dev_info);
 	if (!np)
 		return -1;
 	if (mlx5_get_device_guid(pci_dev, cur_guid, sizeof(cur_guid)) < 0)
@@ -1940,7 +1943,7 @@ mlx5_device_bond_pci_match(const char *ibdev_name,
 	 */
 	for (i = 1; i <= np; ++i) {
 		/* Check whether Infiniband port is populated. */
-		ifindex = mlx5_nl_ifindex(nl_rdma, ibdev_name, i);
+		ifindex = mlx5_nl_ifindex(nl_rdma, ibdev_name, i, dev_info);
 		if (!ifindex)
 			continue;
 		if (!if_indextoname(ifindex, ifname))
@@ -1978,9 +1981,13 @@ mlx5_device_bond_pci_match(const char *ibdev_name,
 		if (!file)
 			break;
 		info.name_type = MLX5_PHYS_PORT_NAME_TYPE_NOTSET;
-		if (fscanf(file, "%32s", tmp_str) == 1)
+		if (fscanf(file, "%32s", tmp_str) == 1) {
 			mlx5_translate_port_name(tmp_str, &info);
-		fclose(file);
+			fclose(file);
+		} else {
+			fclose(file);
+			break;
+		}
 		/* Only process PF ports. */
 		if (info.name_type != MLX5_PHYS_PORT_NAME_TYPE_LEGACY &&
 		    info.name_type != MLX5_PHYS_PORT_NAME_TYPE_UPLINK)
@@ -2003,8 +2010,8 @@ mlx5_device_bond_pci_match(const char *ibdev_name,
 		if (ret != 1)
 			break;
 		/* Save bonding info. */
-		strncpy(bond_info->ports[info.port_name].ifname, ifname,
-			sizeof(bond_info->ports[0].ifname));
+		snprintf(bond_info->ports[info.port_name].ifname,
+			 sizeof(bond_info->ports[0].ifname), "%s", ifname);
 		bond_info->ports[info.port_name].pci_addr = pci_addr;
 		bond_info->ports[info.port_name].ifindex = ifindex;
 		bond_info->n_port++;
@@ -2033,6 +2040,7 @@ mlx5_device_bond_pci_match(const char *ibdev_name,
 		      pci_addr.function == owner)))
 			pf = info.port_name;
 	}
+	fclose(bond_file);
 	if (pf >= 0) {
 		/* Get bond interface info */
 		ret = mlx5_sysfs_bond_info(ifindex, &bond_info->ifindex,
@@ -2084,7 +2092,8 @@ mlx5_nl_esw_multiport_get(struct rte_pci_addr *pci_addr, int *enabled)
 #define SYSFS_MPESW_PARAM_MAX_LEN 16
 
 static int
-mlx5_sysfs_esw_multiport_get(struct ibv_device *ibv, struct rte_pci_addr *pci_addr, int *enabled)
+mlx5_sysfs_esw_multiport_get(struct ibv_device *ibv, struct rte_pci_addr *pci_addr, int *enabled,
+			     struct mlx5_dev_info *dev_info)
 {
 	int nl_rdma;
 	unsigned int n_ports;
@@ -2096,7 +2105,7 @@ mlx5_sysfs_esw_multiport_get(struct ibv_device *ibv, struct rte_pci_addr *pci_ad
 	nl_rdma = mlx5_nl_init(NETLINK_RDMA, 0);
 	if (nl_rdma < 0)
 		return nl_rdma;
-	n_ports = mlx5_nl_portnum(nl_rdma, ibv->name);
+	n_ports = mlx5_nl_portnum(nl_rdma, ibv->name, dev_info);
 	if (!n_ports) {
 		ret = -rte_errno;
 		goto close_nl_rdma;
@@ -2104,12 +2113,12 @@ mlx5_sysfs_esw_multiport_get(struct ibv_device *ibv, struct rte_pci_addr *pci_ad
 	for (i = 1; i <= n_ports; ++i) {
 		unsigned int ifindex;
 		char ifname[IF_NAMESIZE + 1];
-		struct rte_pci_addr if_pci_addr;
+		struct rte_pci_addr if_pci_addr = { 0 };
 		char mpesw[SYSFS_MPESW_PARAM_MAX_LEN + 1];
 		FILE *sysfs;
 		int n;
 
-		ifindex = mlx5_nl_ifindex(nl_rdma, ibv->name, i);
+		ifindex = mlx5_nl_ifindex(nl_rdma, ibv->name, i, dev_info);
 		if (!ifindex)
 			continue;
 		if (!if_indextoname(ifindex, ifname))
@@ -2151,7 +2160,8 @@ mlx5_sysfs_esw_multiport_get(struct ibv_device *ibv, struct rte_pci_addr *pci_ad
 }
 
 static int
-mlx5_is_mpesw_enabled(struct ibv_device *ibv, struct rte_pci_addr *ibv_pci_addr, int *enabled)
+mlx5_is_mpesw_enabled(struct ibv_device *ibv, struct rte_pci_addr *ibv_pci_addr, int *enabled,
+		      struct mlx5_dev_info *dev_info)
 {
 	/*
 	 * Try getting Multiport E-Switch state through netlink interface
@@ -2159,7 +2169,7 @@ mlx5_is_mpesw_enabled(struct ibv_device *ibv, struct rte_pci_addr *ibv_pci_addr,
 	 * assume that Multiport E-Switch is disabled and return an error.
 	 */
 	if (mlx5_nl_esw_multiport_get(ibv_pci_addr, enabled) >= 0 ||
-	    mlx5_sysfs_esw_multiport_get(ibv, ibv_pci_addr, enabled) >= 0)
+	    mlx5_sysfs_esw_multiport_get(ibv, ibv_pci_addr, enabled, dev_info) >= 0)
 		return 0;
 	DRV_LOG(DEBUG, "Unable to check MPESW state for IB device %s "
 		       "(PCI: " PCI_PRI_FMT ")",
@@ -2173,7 +2183,7 @@ mlx5_is_mpesw_enabled(struct ibv_device *ibv, struct rte_pci_addr *ibv_pci_addr,
 static int
 mlx5_device_mpesw_pci_match(struct ibv_device *ibv,
 			    const struct rte_pci_addr *owner_pci,
-			    int nl_rdma)
+			    int nl_rdma, struct mlx5_dev_info *dev_info)
 {
 	struct rte_pci_addr ibdev_pci_addr = { 0 };
 	char ifname[IF_NAMESIZE + 1] = { 0 };
@@ -2197,24 +2207,24 @@ mlx5_device_mpesw_pci_match(struct ibv_device *ibv,
 		return -1;
 	}
 	/* Check if IB device has MPESW enabled. */
-	if (mlx5_is_mpesw_enabled(ibv, &ibdev_pci_addr, &enabled))
+	if (mlx5_is_mpesw_enabled(ibv, &ibdev_pci_addr, &enabled, dev_info))
 		return -1;
 	if (!enabled)
 		return -1;
 	/* Iterate through IB ports to find MPESW master uplink port. */
 	if (nl_rdma < 0)
 		return -1;
-	np = mlx5_nl_portnum(nl_rdma, ibv->name);
+	np = mlx5_nl_portnum(nl_rdma, ibv->name, dev_info);
 	if (!np)
 		return -1;
 	for (i = 1; i <= np; ++i) {
-		struct rte_pci_addr pci_addr;
+		struct rte_pci_addr pci_addr = { 0 };
 		FILE *file;
 		char port_name[IF_NAMESIZE + 1];
 		struct mlx5_switch_info	info;
 
 		/* Check whether IB port has a corresponding netdev. */
-		ifindex = mlx5_nl_ifindex(nl_rdma, ibv->name, i);
+		ifindex = mlx5_nl_ifindex(nl_rdma, ibv->name, i, dev_info);
 		if (!ifindex)
 			continue;
 		if (!if_indextoname(ifindex, ifname))
@@ -2321,16 +2331,30 @@ mlx5_os_pci_probe_pf(struct mlx5_common_device *cdev,
 	 * matching ones, gathering into the list.
 	 */
 	struct ibv_device *ibv_match[ret + 1];
+	struct mlx5_dev_info *info, tmp_info[ret];
 	int nl_route = mlx5_nl_init(NETLINK_ROUTE, 0);
 	int nl_rdma = mlx5_nl_init(NETLINK_RDMA, 0);
 	unsigned int i;
 
+	memset(tmp_info, 0, sizeof(tmp_info));
 	while (ret-- > 0) {
 		struct rte_pci_addr pci_addr;
 
+		if (cdev->dev_info.port_num) {
+			if (strcmp(ibv_list[ret]->name, cdev->dev_info.ibname)) {
+				DRV_LOG(INFO, "Unmatched caching device \"%s\" \"%s\"",
+					cdev->dev_info.ibname, ibv_list[ret]->name);
+				continue;
+			}
+			info = &cdev->dev_info;
+		} else {
+			info = &tmp_info[ret];
+		}
 		DRV_LOG(DEBUG, "Checking device \"%s\"", ibv_list[ret]->name);
 		bd = mlx5_device_bond_pci_match(ibv_list[ret]->name, &owner_pci,
-						nl_rdma, owner_id, &bond_info);
+						nl_rdma, owner_id,
+						info,
+						&bond_info);
 		if (bd >= 0) {
 			/*
 			 * Bonding device detected. Only one match is allowed,
@@ -2356,7 +2380,8 @@ mlx5_os_pci_probe_pf(struct mlx5_common_device *cdev,
 			ibv_match[nd++] = ibv_list[ret];
 			break;
 		}
-		mpesw = mlx5_device_mpesw_pci_match(ibv_list[ret], &owner_pci, nl_rdma);
+		mpesw = mlx5_device_mpesw_pci_match(ibv_list[ret], &owner_pci, nl_rdma,
+						    info);
 		if (mpesw >= 0) {
 			/*
 			 * MPESW device detected. Only one matching IB device is allowed,
@@ -2380,10 +2405,18 @@ mlx5_os_pci_probe_pf(struct mlx5_common_device *cdev,
 		}
 		/* Bonding or MPESW device was not found. */
 		if (mlx5_get_pci_addr(ibv_list[ret]->ibdev_path,
-					&pci_addr))
+					&pci_addr)) {
+			if (tmp_info[ret].port_info != NULL)
+				mlx5_free(tmp_info[ret].port_info);
+			memset(&tmp_info[ret], 0, sizeof(tmp_info[0]));
 			continue;
-		if (rte_pci_addr_cmp(&owner_pci, &pci_addr) != 0)
+		}
+		if (rte_pci_addr_cmp(&owner_pci, &pci_addr) != 0) {
+			if (tmp_info[ret].port_info != NULL)
+				mlx5_free(tmp_info[ret].port_info);
+			memset(&tmp_info[ret], 0, sizeof(tmp_info[0]));
 			continue;
+		}
 		DRV_LOG(INFO, "PCI information matches for device \"%s\"",
 			ibv_list[ret]->name);
 		ibv_match[nd++] = ibv_list[ret];
@@ -2401,13 +2434,21 @@ mlx5_os_pci_probe_pf(struct mlx5_common_device *cdev,
 		goto exit;
 	}
 	if (nd == 1) {
+		if (!cdev->dev_info.port_num) {
+			for (i = 0; i < RTE_DIM(tmp_info); i++) {
+				if (tmp_info[i].port_num) {
+					cdev->dev_info = tmp_info[i];
+					break;
+				}
+			}
+		}
 		/*
 		 * Found single matching device may have multiple ports.
 		 * Each port may be representor, we have to check the port
 		 * number and check the representors existence.
 		 */
 		if (nl_rdma >= 0)
-			np = mlx5_nl_portnum(nl_rdma, ibv_match[0]->name);
+			np = mlx5_nl_portnum(nl_rdma, ibv_match[0]->name, &cdev->dev_info);
 		if (!np)
 			DRV_LOG(WARNING,
 				"Cannot get IB device \"%s\" ports number.",
@@ -2424,6 +2465,14 @@ mlx5_os_pci_probe_pf(struct mlx5_common_device *cdev,
 			ret = -rte_errno;
 			goto exit;
 		}
+	} else {
+		/* Can't handle one common device with multiple IB devices caching */
+		for (i = 0; i < RTE_DIM(tmp_info); i++) {
+			if (tmp_info[i].port_info != NULL)
+				mlx5_free(tmp_info[i].port_info);
+			memset(&tmp_info[i], 0, sizeof(tmp_info[0]));
+		}
+		DRV_LOG(INFO, "Cannot handle multiple IB devices info caching in single common device.");
 	}
 	/* Now we can determine the maximal amount of devices to be spawned. */
 	list = mlx5_malloc(MLX5_MEM_ZERO,
@@ -2457,7 +2506,7 @@ mlx5_os_pci_probe_pf(struct mlx5_common_device *cdev,
 			list[ns].mpesw_port = MLX5_MPESW_PORT_INVALID;
 			list[ns].ifindex = mlx5_nl_ifindex(nl_rdma,
 							   ibv_match[0]->name,
-							   i);
+							   i, &cdev->dev_info);
 			if (!list[ns].ifindex) {
 				/*
 				 * No network interface index found for the
@@ -2588,7 +2637,7 @@ mlx5_os_pci_probe_pf(struct mlx5_common_device *cdev,
 				list[ns].ifindex = mlx5_nl_ifindex
 							    (nl_rdma,
 							     ibv_match[i]->name,
-							     1);
+							     1, &cdev->dev_info);
 			if (!list[ns].ifindex) {
 				char ifname[IF_NAMESIZE];
 
@@ -2777,6 +2826,11 @@ mlx5_os_pci_probe_pf(struct mlx5_common_device *cdev,
 		mlx5_free(list);
 	MLX5_ASSERT(ibv_list);
 	mlx5_glue->free_device_list(ibv_list);
+	if (ret) {
+		if (cdev->dev_info.port_info != NULL)
+			mlx5_free(cdev->dev_info.port_info);
+		memset(&cdev->dev_info, 0, sizeof(cdev->dev_info));
+	}
 	return ret;
 }
 
diff --git a/drivers/net/mlx5/linux/mlx5_os.h b/drivers/net/mlx5/linux/mlx5_os.h
index 80c70d713a..4ef0916173 100644
--- a/drivers/net/mlx5/linux/mlx5_os.h
+++ b/drivers/net/mlx5/linux/mlx5_os.h
@@ -8,12 +8,6 @@
 
 #include <net/if.h>
 
-/* verb enumerations translations to local enums. */
-enum {
-	MLX5_FS_NAME_MAX = IBV_SYSFS_NAME_MAX + 1,
-	MLX5_FS_PATH_MAX = IBV_SYSFS_PATH_MAX + 1
-};
-
 /* Maximal data of sendmsg message(in bytes). */
 #define MLX5_SENDMSG_MAX 64
 
diff --git a/drivers/net/mlx5/windows/mlx5_os.h b/drivers/net/mlx5/windows/mlx5_os.h
index 8b58265687..fb7198c244 100644
--- a/drivers/net/mlx5/windows/mlx5_os.h
+++ b/drivers/net/mlx5/windows/mlx5_os.h
@@ -7,11 +7,6 @@
 
 #include "mlx5_win_ext.h"
 
-enum {
-	MLX5_FS_NAME_MAX = MLX5_DEVX_DEVICE_NAME_SIZE + 1,
-	MLX5_FS_PATH_MAX = MLX5_DEVX_DEVICE_PNP_SIZE + 1
-};
-
 #define PCI_DRV_FLAGS 0
 
 #define MLX5_NAMESIZE MLX5_FS_NAME_MAX
-- 
2.34.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH V3 3/7] net/mlx5: add new devargs to control probe optimization
  2024-10-29 13:42       ` [PATCH V3 0/7] port probe time optimization Minggang Li(Gavin)
  2024-10-29 13:42         ` [PATCH V3 1/7] mailmap: update user name Minggang Li(Gavin)
  2024-10-29 13:42         ` [PATCH V3 2/7] net/mlx5: optimize device probing Minggang Li(Gavin)
@ 2024-10-29 13:42         ` Minggang Li(Gavin)
  2024-10-29 16:20           ` Stephen Hemminger
  2024-10-29 13:42         ` [PATCH V3 4/7] common/mlx5: fix Netlink socket leak Minggang Li(Gavin)
                           ` (3 subsequent siblings)
  6 siblings, 1 reply; 42+ messages in thread
From: Minggang Li(Gavin) @ 2024-10-29 13:42 UTC (permalink / raw)
  To: viacheslavo, matan, orika, thomas, Dariusz Sosnowski, Bing Zhao,
	Suanming Mou
  Cc: dev, rasland, Rongwei Liu

From: Rongwei Liu <rongweil@nvidia.com>

Add a new devarg probe_opt_en to control probe optimization
in PMD.

By default, the value is 0 and no behavior changed.

Signed-off-by: Rongwei Liu <rongweil@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
---
 doc/guides/nics/mlx5.rst                |  7 +++++++
 drivers/common/mlx5/linux/mlx5_nl.c     | 12 ++++++++----
 drivers/common/mlx5/mlx5_common.c       | 15 +++++++++++++++
 drivers/common/mlx5/mlx5_common.h       |  2 ++
 drivers/net/mlx5/linux/mlx5_ethdev_os.c |  5 ++++-
 drivers/net/mlx5/linux/mlx5_os.c        |  2 +-
 6 files changed, 37 insertions(+), 6 deletions(-)

diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index f82e2d75de..981401a9f2 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -1436,6 +1436,13 @@ for an additional list of options shared with other mlx5 drivers.
 
   By default, the PMD will set this value to 1.
 
+- ``probe_opt_en`` parameter [int]
+
+  A non-zero value optimizes the probe process, especially for large scale.
+  PMD will hold the IB device information internally and reuse it.
+
+  By default, the PMD will set this value to 0.
+
 - ``lacp_by_user`` parameter [int]
 
   A nonzero value enables the control of LACP traffic by the user application.
diff --git a/drivers/common/mlx5/linux/mlx5_nl.c b/drivers/common/mlx5/linux/mlx5_nl.c
index e98073aafe..745e443f8f 100644
--- a/drivers/common/mlx5/linux/mlx5_nl.c
+++ b/drivers/common/mlx5/linux/mlx5_nl.c
@@ -1148,7 +1148,7 @@ mlx5_nl_ifindex(int nl, const char *name, uint32_t pindex, struct mlx5_dev_info
 			.flags = 0,
 	};
 
-	if (!strcmp(name, dev_info->ibname)) {
+	if (dev_info->probe_opt && !strcmp(name, dev_info->ibname)) {
 		if (dev_info->port_info && pindex <= dev_info->port_num &&
 		    dev_info->port_info[pindex].valid) {
 			if (!dev_info->port_info[pindex].ifindex)
@@ -1161,7 +1161,7 @@ mlx5_nl_ifindex(int nl, const char *name, uint32_t pindex, struct mlx5_dev_info
 
 	ret = mlx5_nl_port_info(nl, pindex, &data);
 
-	if (!strcmp(dev_info->ibname, name)) {
+	if (dev_info->probe_opt && !strcmp(dev_info->ibname, name)) {
 		if ((!ret || ret == -ENODEV) && dev_info->port_info &&
 		    pindex <= dev_info->port_num) {
 			if (!ret)
@@ -1201,7 +1201,8 @@ mlx5_nl_port_state(int nl, const char *name, uint32_t pindex, struct mlx5_dev_in
 			.ibindex = UINT32_MAX,
 	};
 
-	if (dev_info && !strcmp(name, dev_info->ibname) && dev_info->port_num)
+	if (dev_info && dev_info->probe_opt &&
+	    !strcmp(name, dev_info->ibname) && dev_info->port_num)
 		data.ibindex = dev_info->ibindex;
 	if (mlx5_nl_port_info(nl, pindex, &data) < 0)
 		return -rte_errno;
@@ -1244,7 +1245,8 @@ mlx5_nl_portnum(int nl, const char *name, struct mlx5_dev_info *dev_info)
 	uint32_t sn = MLX5_NL_SN_GENERATE;
 	int ret, size;
 
-	if (dev_info->port_num && !strcmp(name, dev_info->ibname))
+	if (dev_info->probe_opt && dev_info->port_num &&
+	    !strcmp(name, dev_info->ibname))
 		return dev_info->port_num;
 
 	ret = mlx5_nl_send(nl, &req, sn);
@@ -1263,6 +1265,8 @@ mlx5_nl_portnum(int nl, const char *name, struct mlx5_dev_info *dev_info)
 		rte_errno = EINVAL;
 		return 0;
 	}
+	if (!dev_info->probe_opt)
+		return data.portnum;
 	MLX5_ASSERT(!strlen(dev_info->ibname));
 	dev_info->port_num = data.portnum;
 	dev_info->ibindex = data.ibindex;
diff --git a/drivers/common/mlx5/mlx5_common.c b/drivers/common/mlx5/mlx5_common.c
index 0aaae91c31..9abae4a374 100644
--- a/drivers/common/mlx5/mlx5_common.c
+++ b/drivers/common/mlx5/mlx5_common.c
@@ -40,6 +40,9 @@ uint8_t haswell_broadwell_cpu;
 /* The default memory allocator used in PMD. */
 #define MLX5_SYS_MEM_EN "sys_mem_en"
 
+/* Probe optimization in PMD. */
+#define MLX5_PROBE_OPT "probe_opt_en"
+
 /*
  * Device parameter to force doorbell register mapping
  * to non-cached region eliminating the extra write memory barrier.
@@ -295,6 +298,8 @@ mlx5_common_args_check_handler(const char *key, const char *val, void *opaque)
 		config->device_fd = tmp;
 	} else if (strcmp(key, MLX5_PD_HANDLE) == 0) {
 		config->pd_handle = tmp;
+	} else if (strcmp(key, MLX5_PROBE_OPT) == 0) {
+		config->probe_opt = !!tmp;
 	}
 	return 0;
 }
@@ -324,6 +329,7 @@ mlx5_common_config_get(struct mlx5_kvargs_ctrl *mkvlist,
 		MLX5_MR_MEMPOOL_REG_EN,
 		MLX5_DEVICE_FD,
 		MLX5_PD_HANDLE,
+		MLX5_PROBE_OPT,
 		NULL,
 	};
 	int ret = 0;
@@ -332,6 +338,7 @@ mlx5_common_config_get(struct mlx5_kvargs_ctrl *mkvlist,
 	config->mr_ext_memseg_en = 1;
 	config->mr_mempool_reg_en = 1;
 	config->sys_mem_en = 0;
+	config->probe_opt = 0;
 	config->dbnc = MLX5_ARG_UNSET;
 	config->device_fd = MLX5_ARG_UNSET;
 	config->pd_handle = MLX5_ARG_UNSET;
@@ -351,6 +358,7 @@ mlx5_common_config_get(struct mlx5_kvargs_ctrl *mkvlist,
 	DRV_LOG(DEBUG, "mr_ext_memseg_en is %u.", config->mr_ext_memseg_en);
 	DRV_LOG(DEBUG, "mr_mempool_reg_en is %u.", config->mr_mempool_reg_en);
 	DRV_LOG(DEBUG, "sys_mem_en is %u.", config->sys_mem_en);
+	DRV_LOG(DEBUG, "probe_opt_en is %u.", config->probe_opt);
 	DRV_LOG(DEBUG, "Send Queue doorbell mapping parameter is %d.",
 		config->dbnc);
 	return ret;
@@ -791,6 +799,7 @@ mlx5_common_dev_create(struct rte_device *eal_dev, uint32_t classes,
 	if (TAILQ_EMPTY(&devices_list))
 		rte_mem_event_callback_register("MLX5_MEM_EVENT_CB",
 						mlx5_mr_mem_event_cb, NULL);
+	cdev->dev_info.probe_opt = cdev->config.probe_opt;
 exit:
 	pthread_mutex_lock(&devices_list_lock);
 	TAILQ_INSERT_HEAD(&devices_list, cdev, next);
@@ -880,6 +889,12 @@ mlx5_common_probe_again_args_validate(struct mlx5_common_device *cdev,
 			cdev->dev->name);
 		goto error;
 	}
+	if (cdev->config.probe_opt != config->probe_opt) {
+		DRV_LOG(ERR, "\"" MLX5_PROBE_OPT"\" "
+			"configuration mismatch for device %s.",
+			cdev->dev->name);
+		goto error;
+	}
 	if (cdev->config.dbnc != config->dbnc) {
 		DRV_LOG(ERR, "\"" MLX5_SQ_DB_NC "\" "
 			"configuration mismatch for device %s.",
diff --git a/drivers/common/mlx5/mlx5_common.h b/drivers/common/mlx5/mlx5_common.h
index 6cb40f54dd..f1b59d6f07 100644
--- a/drivers/common/mlx5/mlx5_common.h
+++ b/drivers/common/mlx5/mlx5_common.h
@@ -183,6 +183,7 @@ struct mlx5_dev_info {
 	uint32_t port_num;
 	uint32_t ibindex;
 	char ibname[MLX5_FS_NAME_MAX];
+	uint8_t probe_opt;
 	struct mlx5_port_nl_info *port_info;
 };
 
@@ -525,6 +526,7 @@ struct mlx5_common_dev_config {
 	int pd_handle; /* Protection Domain handle for importation.  */
 	unsigned int devx:1; /* Whether devx interface is available or not. */
 	unsigned int sys_mem_en:1; /* The default memory allocator. */
+	unsigned int probe_opt:1; /* Optimize probing . */
 	unsigned int mr_mempool_reg_en:1;
 	/* Allow/prevent implicit mempool memory registration. */
 	unsigned int mr_ext_memseg_en:1;
diff --git a/drivers/net/mlx5/linux/mlx5_ethdev_os.c b/drivers/net/mlx5/linux/mlx5_ethdev_os.c
index 08ac6dd939..88d3c57c6e 100644
--- a/drivers/net/mlx5/linux/mlx5_ethdev_os.c
+++ b/drivers/net/mlx5/linux/mlx5_ethdev_os.c
@@ -691,6 +691,8 @@ mlx5_handle_port_info_update(struct mlx5_dev_info *dev_info, uint32_t if_index,
 	if (dev_info->port_num <= 1 || dev_info->port_info == NULL)
 		return;
 
+	DRV_LOG(DEBUG, "IB device %s ifindex %u received netlink event %u",
+			dev_info->ibname, if_index, msg_type);
 	for (i = 1; i <= dev_info->port_num; i++) {
 		if (!dev_info->port_info[i].valid)
 			continue;
@@ -734,7 +736,8 @@ mlx5_dev_interrupt_nl_cb(struct nlmsghdr *hdr, void *cb_arg)
 
 	if (mlx5_nl_parse_link_status_update(hdr, &if_index) < 0)
 		return;
-	mlx5_handle_port_info_update(&sh->cdev->dev_info, if_index, hdr->nlmsg_type);
+	if (sh->cdev->config.probe_opt && sh->cdev->dev_info.port_num > 1)
+		mlx5_handle_port_info_update(&sh->cdev->dev_info, if_index, hdr->nlmsg_type);
 
 	for (i = 0; i < sh->max_port; i++) {
 		struct mlx5_dev_shared_port *port = &sh->port[i];
diff --git a/drivers/net/mlx5/linux/mlx5_os.c b/drivers/net/mlx5/linux/mlx5_os.c
index 8f6e584154..695936f634 100644
--- a/drivers/net/mlx5/linux/mlx5_os.c
+++ b/drivers/net/mlx5/linux/mlx5_os.c
@@ -2340,7 +2340,7 @@ mlx5_os_pci_probe_pf(struct mlx5_common_device *cdev,
 	while (ret-- > 0) {
 		struct rte_pci_addr pci_addr;
 
-		if (cdev->dev_info.port_num) {
+		if (cdev->config.probe_opt && cdev->dev_info.port_num) {
 			if (strcmp(ibv_list[ret]->name, cdev->dev_info.ibname)) {
 				DRV_LOG(INFO, "Unmatched caching device \"%s\" \"%s\"",
 					cdev->dev_info.ibname, ibv_list[ret]->name);
-- 
2.34.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH V3 4/7] common/mlx5: fix Netlink socket leak
  2024-10-29 13:42       ` [PATCH V3 0/7] port probe time optimization Minggang Li(Gavin)
                           ` (2 preceding siblings ...)
  2024-10-29 13:42         ` [PATCH V3 3/7] net/mlx5: add new devargs to control probe optimization Minggang Li(Gavin)
@ 2024-10-29 13:42         ` Minggang Li(Gavin)
  2024-10-29 13:42         ` [PATCH V3 5/7] common/mlx5: add RDMA monitor event awareness Minggang Li(Gavin)
                           ` (2 subsequent siblings)
  6 siblings, 0 replies; 42+ messages in thread
From: Minggang Li(Gavin) @ 2024-10-29 13:42 UTC (permalink / raw)
  To: viacheslavo, matan, orika, thomas, Dariusz Sosnowski, Bing Zhao,
	Suanming Mou, Spike Du
  Cc: dev, rasland, stable

Fixes: 72d7efe464b1 ("common/mlx5: share interrupt management")
Cc: stable@dpdk.org

Signed-off-by: Minggang Li(Gavin) <gavinl@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
---
 drivers/net/mlx5/linux/mlx5_os.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/net/mlx5/linux/mlx5_os.c b/drivers/net/mlx5/linux/mlx5_os.c
index 695936f634..4537ca0466 100644
--- a/drivers/net/mlx5/linux/mlx5_os.c
+++ b/drivers/net/mlx5/linux/mlx5_os.c
@@ -3071,10 +3071,15 @@ mlx5_os_dev_shared_handler_install(struct mlx5_dev_ctx_shared *sh)
 void
 mlx5_os_dev_shared_handler_uninstall(struct mlx5_dev_ctx_shared *sh)
 {
+	int fd;
+
 	mlx5_os_interrupt_handler_destroy(sh->intr_handle,
 					  mlx5_dev_interrupt_handler, sh);
+	fd = rte_intr_fd_get(sh->intr_handle_nl);
 	mlx5_os_interrupt_handler_destroy(sh->intr_handle_nl,
 					  mlx5_dev_interrupt_handler_nl, sh);
+	if (fd >= 0)
+		close(fd);
 #ifdef HAVE_IBV_DEVX_ASYNC
 	mlx5_os_interrupt_handler_destroy(sh->intr_handle_devx,
 					  mlx5_dev_interrupt_handler_devx, sh);
-- 
2.34.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH V3 5/7] common/mlx5: add RDMA monitor event awareness
  2024-10-29 13:42       ` [PATCH V3 0/7] port probe time optimization Minggang Li(Gavin)
                           ` (3 preceding siblings ...)
  2024-10-29 13:42         ` [PATCH V3 4/7] common/mlx5: fix Netlink socket leak Minggang Li(Gavin)
@ 2024-10-29 13:42         ` Minggang Li(Gavin)
  2024-10-29 13:42         ` [PATCH V3 6/7] mlx5: use RDMA Netlink to update port information Minggang Li(Gavin)
  2024-10-29 13:42         ` [PATCH V3 7/7] mlx5: add backward compatibility for RDMA monitor Minggang Li(Gavin)
  6 siblings, 0 replies; 42+ messages in thread
From: Minggang Li(Gavin) @ 2024-10-29 13:42 UTC (permalink / raw)
  To: viacheslavo, matan, orika, thomas, Dariusz Sosnowski, Bing Zhao,
	Suanming Mou
  Cc: dev, rasland

RDMA monitor is a new feature introduced by kernel driver. This commit
adds backward compatibility for the kernels do not support it.

Signed-off-by: Minggang Li(Gavin) <gavinl@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
---
 drivers/common/mlx5/linux/meson.build | 10 ++++++++++
 drivers/common/mlx5/linux/mlx5_nl.c   | 17 +++++++++++++++++
 2 files changed, 27 insertions(+)

diff --git a/drivers/common/mlx5/linux/meson.build b/drivers/common/mlx5/linux/meson.build
index 82e8046e0c..58d0328c6d 100644
--- a/drivers/common/mlx5/linux/meson.build
+++ b/drivers/common/mlx5/linux/meson.build
@@ -170,6 +170,16 @@ has_sym_args = [
             'RDMA_NLDEV_ATTR_PORT_STATE' ],
         [ 'HAVE_RDMA_NLDEV_ATTR_NDEV_INDEX', 'rdma/rdma_netlink.h',
             'RDMA_NLDEV_ATTR_NDEV_INDEX' ],
+        [ 'HAVE_RDMA_NL_GROUP_NOTIFY', 'rdma/rdma_netlink.h',
+            'RDMA_NL_GROUP_NOTIFY' ],
+        [ 'HAVE_RDMA_NLDEV_CMD_SYS_GET', 'rdma/rdma_netlink.h',
+            'RDMA_NLDEV_CMD_SYS_GET' ],
+        [ 'HAVE_RDMA_NLDEV_SYS_ATTR_MONITOR_MODE', 'rdma/rdma_netlink.h',
+            'RDMA_NLDEV_SYS_ATTR_MONITOR_MODE' ],
+        [ 'HAVE_RDMA_NLDEV_ATTR_EVENT_TYPE', 'rdma/rdma_netlink.h',
+            'RDMA_NLDEV_ATTR_EVENT_TYPE' ],
+        [ 'HAVE_RDMA_NLDEV_CMD_MONITOR', 'rdma/rdma_netlink.h',
+            'RDMA_NLDEV_CMD_MONITOR' ],
         [ 'HAVE_MLX5_DR_FLOW_DUMP', 'infiniband/mlx5dv.h',
             'mlx5dv_dump_dr_domain'],
         [ 'HAVE_MLX5_DR_CREATE_ACTION_FLOW_SAMPLE', 'infiniband/mlx5dv.h',
diff --git a/drivers/common/mlx5/linux/mlx5_nl.c b/drivers/common/mlx5/linux/mlx5_nl.c
index 745e443f8f..e03db4f918 100644
--- a/drivers/common/mlx5/linux/mlx5_nl.c
+++ b/drivers/common/mlx5/linux/mlx5_nl.c
@@ -84,6 +84,23 @@
 #ifndef HAVE_RDMA_NLDEV_ATTR_NDEV_INDEX
 #define RDMA_NLDEV_ATTR_NDEV_INDEX 50
 #endif
+#ifndef HAVE_RDMA_NLDEV_ATTR_EVENT_TYPE
+#define RDMA_NLDEV_ATTR_EVENT_TYPE 102
+#define RDMA_NETDEV_ATTACH_EVENT 2
+#define RDMA_NETDEV_DETACH_EVENT 3
+#endif
+#ifndef HAVE_RDMA_NLDEV_SYS_ATTR_MONITOR_MODE
+#define RDMA_NLDEV_SYS_ATTR_MONITOR_MODE 103
+#endif
+#ifndef HAVE_RDMA_NLDEV_CMD_MONITOR
+#define RDMA_NLDEV_CMD_MONITOR 28
+#endif
+#ifndef HAVE_RDMA_NLDEV_CMD_SYS_GET
+#define RDMA_NLDEV_CMD_SYS_GET 6
+#endif
+#ifndef HAVE_RDMA_NL_GROUP_NOTIFY
+#define RDMA_NL_GROUP_NOTIFY 4
+#endif
 
 /* These are normally found in linux/if_link.h. */
 #ifndef HAVE_IFLA_NUM_VF
-- 
2.34.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH V3 6/7] mlx5: use RDMA Netlink to update port information
  2024-10-29 13:42       ` [PATCH V3 0/7] port probe time optimization Minggang Li(Gavin)
                           ` (4 preceding siblings ...)
  2024-10-29 13:42         ` [PATCH V3 5/7] common/mlx5: add RDMA monitor event awareness Minggang Li(Gavin)
@ 2024-10-29 13:42         ` Minggang Li(Gavin)
  2024-10-29 13:42         ` [PATCH V3 7/7] mlx5: add backward compatibility for RDMA monitor Minggang Li(Gavin)
  6 siblings, 0 replies; 42+ messages in thread
From: Minggang Li(Gavin) @ 2024-10-29 13:42 UTC (permalink / raw)
  To: viacheslavo, matan, orika, thomas, Dariusz Sosnowski, Bing Zhao,
	Suanming Mou
  Cc: dev, rasland

Previously, port information, such as adding and deleting, is updated via
route netlink. And the events used are link up/down, not the exact event
for port adding or deleting, which does not performance well.

To improve the performance, use RDMA monitor events to track port adding
and deleting events and update corresponding port information.

Signed-off-by: Minggang Li(Gavin) <gavinl@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
---
 doc/guides/nics/mlx5.rst                |  6 ++
 drivers/common/mlx5/linux/mlx5_nl.c     | 74 ++++++++++++++++++-----
 drivers/common/mlx5/linux/mlx5_nl.h     | 28 +++++++++
 drivers/common/mlx5/version.map         |  2 +
 drivers/net/mlx5/linux/mlx5_ethdev_os.c | 79 +++++++++++++++++++++++++
 drivers/net/mlx5/linux/mlx5_os.c        | 20 +++++++
 drivers/net/mlx5/mlx5.h                 |  2 +
 7 files changed, 195 insertions(+), 16 deletions(-)

diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index 981401a9f2..1a9ec1bd62 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -1443,6 +1443,12 @@ for an additional list of options shared with other mlx5 drivers.
 
   By default, the PMD will set this value to 0.
 
+  .. note::
+
+    There is a race condition in probing port if probe_opt_en is set to 1.
+    Port probe may fail with wrong ifindex in cache while the interrupt
+    thread is updating the cache. Please try again if port probe failed.
+
 - ``lacp_by_user`` parameter [int]
 
   A nonzero value enables the control of LACP traffic by the user application.
diff --git a/drivers/common/mlx5/linux/mlx5_nl.c b/drivers/common/mlx5/linux/mlx5_nl.c
index e03db4f918..ce1c2a8e75 100644
--- a/drivers/common/mlx5/linux/mlx5_nl.c
+++ b/drivers/common/mlx5/linux/mlx5_nl.c
@@ -101,6 +101,7 @@
 #ifndef HAVE_RDMA_NL_GROUP_NOTIFY
 #define RDMA_NL_GROUP_NOTIFY 4
 #endif
+#define RDMA_NL_GROUP_NOTIFICATION (1 << (RDMA_NL_GROUP_NOTIFY - 1))
 
 /* These are normally found in linux/if_link.h. */
 #ifndef HAVE_IFLA_NUM_VF
@@ -176,22 +177,6 @@ struct mlx5_nl_mac_addr {
 	int mac_n; /**< Number of addresses in the array. */
 };
 
-#define MLX5_NL_CMD_GET_IB_NAME (1 << 0)
-#define MLX5_NL_CMD_GET_IB_INDEX (1 << 1)
-#define MLX5_NL_CMD_GET_NET_INDEX (1 << 2)
-#define MLX5_NL_CMD_GET_PORT_INDEX (1 << 3)
-#define MLX5_NL_CMD_GET_PORT_STATE (1 << 4)
-
-/** Data structure used by mlx5_nl_cmdget_cb(). */
-struct mlx5_nl_port_info {
-	const char *name; /**< IB device name (in). */
-	uint32_t flags; /**< found attribute flags (out). */
-	uint32_t ibindex; /**< IB device index (out). */
-	uint32_t ifindex; /**< Network interface index (out). */
-	uint32_t portnum; /**< IB device max port number (out). */
-	uint16_t state; /**< IB device port state (out). */
-};
-
 RTE_ATOMIC(uint32_t) atomic_sn;
 
 /* Generate Netlink sequence number. */
@@ -2110,3 +2095,60 @@ mlx5_nl_devlink_esw_multiport_get(int nlsk_fd, int family_id, const char *pci_ad
 		*enable ? "en" : "dis", pci_addr);
 	return ret;
 }
+
+int
+mlx5_nl_rdma_monitor_init(void)
+{
+	return mlx5_nl_init(NETLINK_RDMA, RDMA_NL_GROUP_NOTIFICATION);
+}
+
+void
+mlx5_nl_rdma_monitor_info_get(struct nlmsghdr *hdr, struct mlx5_nl_port_info *data)
+{
+	size_t off = NLMSG_HDRLEN;
+	uint8_t event_type = 0;
+
+	if (hdr->nlmsg_type != RDMA_NL_GET_TYPE(RDMA_NL_NLDEV, RDMA_NLDEV_CMD_MONITOR))
+		goto error;
+
+	while (off < hdr->nlmsg_len) {
+		struct nlattr *na = (void *)((uintptr_t)hdr + off);
+		void *payload = (void *)((uintptr_t)na + NLA_HDRLEN);
+
+		if (na->nla_len > hdr->nlmsg_len - off)
+			goto error;
+		switch (na->nla_type) {
+		case RDMA_NLDEV_ATTR_EVENT_TYPE:
+			event_type = *(uint8_t *)payload;
+			if (event_type == RDMA_NETDEV_ATTACH_EVENT) {
+				data->flags |= MLX5_NL_CMD_GET_EVENT_TYPE;
+				data->event_type = MLX5_NL_RDMA_NETDEV_ATTACH_EVENT;
+			} else if (event_type == RDMA_NETDEV_DETACH_EVENT) {
+				data->flags |= MLX5_NL_CMD_GET_EVENT_TYPE;
+				data->event_type = MLX5_NL_RDMA_NETDEV_DETACH_EVENT;
+			}
+			break;
+		case RDMA_NLDEV_ATTR_DEV_INDEX:
+			data->ibindex = *(uint32_t *)payload;
+			data->flags |= MLX5_NL_CMD_GET_IB_INDEX;
+			break;
+		case RDMA_NLDEV_ATTR_PORT_INDEX:
+			data->portnum = *(uint32_t *)payload;
+			data->flags |= MLX5_NL_CMD_GET_PORT_INDEX;
+			break;
+		case RDMA_NLDEV_ATTR_NDEV_INDEX:
+			data->ifindex = *(uint32_t *)payload;
+			data->flags |= MLX5_NL_CMD_GET_NET_INDEX;
+			break;
+		default:
+			DRV_LOG(DEBUG, "Unknown attribute[%d] found", na->nla_type);
+			break;
+		}
+		off += NLA_ALIGN(na->nla_len);
+	}
+
+	return;
+
+error:
+	rte_errno = EINVAL;
+}
diff --git a/drivers/common/mlx5/linux/mlx5_nl.h b/drivers/common/mlx5/linux/mlx5_nl.h
index 396ffc98ce..e32080fa63 100644
--- a/drivers/common/mlx5/linux/mlx5_nl.h
+++ b/drivers/common/mlx5/linux/mlx5_nl.h
@@ -32,6 +32,27 @@ struct mlx5_nl_vlan_vmwa_context {
 	struct mlx5_nl_vlan_dev vlan_dev[4096];
 };
 
+#define MLX5_NL_CMD_GET_IB_NAME (1 << 0)
+#define MLX5_NL_CMD_GET_IB_INDEX (1 << 1)
+#define MLX5_NL_CMD_GET_NET_INDEX (1 << 2)
+#define MLX5_NL_CMD_GET_PORT_INDEX (1 << 3)
+#define MLX5_NL_CMD_GET_PORT_STATE (1 << 4)
+#define MLX5_NL_CMD_GET_EVENT_TYPE (1 << 5)
+
+/** Data structure used by mlx5_nl_cmdget_cb(). */
+struct mlx5_nl_port_info {
+	const char *name; /**< IB device name (in). */
+	uint32_t flags; /**< found attribute flags (out). */
+	uint32_t ibindex; /**< IB device index (out). */
+	uint32_t ifindex; /**< Network interface index (out). */
+	uint32_t portnum; /**< IB device max port number (out). */
+	uint16_t state; /**< IB device port state (out). */
+	uint8_t event_type; /**< IB RDMA event type (out). */
+};
+
+#define MLX5_NL_RDMA_NETDEV_ATTACH_EVENT (1)
+#define MLX5_NL_RDMA_NETDEV_DETACH_EVENT (2)
+
 __rte_internal
 int mlx5_nl_init(int protocol, int groups);
 __rte_internal
@@ -89,4 +110,11 @@ __rte_internal
 int mlx5_nl_devlink_esw_multiport_get(int nlsk_fd, int family_id,
 				      const char *pci_addr, int *enable);
 
+__rte_internal
+int mlx5_nl_rdma_monitor_init(void);
+__rte_internal
+void mlx5_nl_rdma_monitor_info_get(struct nlmsghdr *hdr, struct mlx5_nl_port_info *data);
+__rte_internal
+int mlx5_nl_rdma_monitor_cap_get(int nl, uint8_t *cap);
+
 #endif /* RTE_PMD_MLX5_NL_H_ */
diff --git a/drivers/common/mlx5/version.map b/drivers/common/mlx5/version.map
index a2f72ef46a..5230576006 100644
--- a/drivers/common/mlx5/version.map
+++ b/drivers/common/mlx5/version.map
@@ -146,6 +146,8 @@ INTERNAL {
 	mlx5_nl_vf_mac_addr_modify; # WINDOWS_NO_EXPORT
 	mlx5_nl_vlan_vmwa_create; # WINDOWS_NO_EXPORT
 	mlx5_nl_vlan_vmwa_delete; # WINDOWS_NO_EXPORT
+	mlx5_nl_rdma_monitor_init; # WINDOWS_NO_EXPORT
+	mlx5_nl_rdma_monitor_info_get; # WINDOWS_NO_EXPORT
 
 	mlx5_os_umem_dereg;
 	mlx5_os_umem_reg;
diff --git a/drivers/net/mlx5/linux/mlx5_ethdev_os.c b/drivers/net/mlx5/linux/mlx5_ethdev_os.c
index 88d3c57c6e..5156d96b3a 100644
--- a/drivers/net/mlx5/linux/mlx5_ethdev_os.c
+++ b/drivers/net/mlx5/linux/mlx5_ethdev_os.c
@@ -894,6 +894,85 @@ mlx5_dev_interrupt_handler_devx(void *cb_arg)
 #endif /* HAVE_IBV_DEVX_ASYNC */
 }
 
+static void
+mlx5_dev_interrupt_ib_cb(struct nlmsghdr *hdr, void *cb_arg)
+{
+	mlx5_nl_rdma_monitor_info_get(hdr, (struct mlx5_nl_port_info *)cb_arg);
+}
+
+void
+mlx5_dev_interrupt_handler_ib(void *arg)
+{
+	struct mlx5_dev_ctx_shared *sh = arg;
+	struct mlx5_nl_port_info data = {
+		.flags = 0,
+		.name = "",
+		.ifindex = 0,
+		.ibindex = 0,
+		.portnum = 0,
+	};
+	int nlsk_fd = rte_intr_fd_get(sh->intr_handle_ib);
+	struct mlx5_dev_info *dev_info;
+	uint32_t i;
+
+	dev_info = &sh->cdev->dev_info;
+	DRV_LOG(DEBUG, "IB device %s received RDMA monitor netlink event", dev_info->ibname);
+	if (dev_info->port_num <= 1 || dev_info->port_info == NULL)
+		return;
+
+	if (nlsk_fd < 0)
+		return;
+
+	if (mlx5_nl_read_events(nlsk_fd, mlx5_dev_interrupt_ib_cb, &data) < 0)
+		DRV_LOG(ERR, "Failed to process Netlink events: %s",
+			rte_strerror(rte_errno));
+
+	if (!(data.flags & MLX5_NL_CMD_GET_EVENT_TYPE) ||
+		!(data.flags & MLX5_NL_CMD_GET_PORT_INDEX) ||
+		!(data.flags & MLX5_NL_CMD_GET_IB_INDEX))
+		return;
+
+	if (data.ibindex != dev_info->ibindex)
+		return;
+
+	if (data.event_type != MLX5_NL_RDMA_NETDEV_ATTACH_EVENT &&
+		data.event_type != MLX5_NL_RDMA_NETDEV_DETACH_EVENT)
+		return;
+
+	if (data.event_type == MLX5_NL_RDMA_NETDEV_ATTACH_EVENT &&
+	    !(data.flags & MLX5_NL_CMD_GET_NET_INDEX))
+		return;
+
+	DRV_LOG(DEBUG, "Event info: type %d, ibindex %d, ifindex %d, portnum %d,",
+		data.event_type, data.ibindex, data.ifindex, data.portnum);
+
+	/* Changes found in number of SF/VF ports. All information is likely unreliable. */
+	if (data.portnum > dev_info->port_num) {
+		DRV_LOG(ERR, "Port[%d] exceeds maximum[%d]", data.portnum, dev_info->port_num);
+		goto flush_all;
+	}
+	if (data.event_type == MLX5_NL_RDMA_NETDEV_ATTACH_EVENT) {
+		if (!dev_info->port_info[data.portnum].ifindex) {
+			dev_info->port_info[data.portnum].ifindex = data.ifindex;
+			dev_info->port_info[data.portnum].valid = 1;
+		} else {
+			DRV_LOG(WARNING, "Duplicate RDMA event for port[%d] ifindex[%d]",
+				data.portnum, data.ifindex);
+			if (data.ifindex != dev_info->port_info[data.portnum].ifindex)
+				goto flush_all;
+		}
+	} else if (data.event_type == MLX5_NL_RDMA_NETDEV_DETACH_EVENT) {
+		memset(dev_info->port_info + data.portnum, 0, sizeof(struct mlx5_port_nl_info));
+	}
+	return;
+
+flush_all:
+	for (i = 1; i <= dev_info->port_num; i++) {
+		dev_info->port_info[i].ifindex = 0;
+		dev_info->port_info[i].valid = 0;
+	}
+}
+
 /**
  * DPDK callback to bring the link DOWN.
  *
diff --git a/drivers/net/mlx5/linux/mlx5_os.c b/drivers/net/mlx5/linux/mlx5_os.c
index 4537ca0466..16b275c71e 100644
--- a/drivers/net/mlx5/linux/mlx5_os.c
+++ b/drivers/net/mlx5/linux/mlx5_os.c
@@ -3025,6 +3025,21 @@ mlx5_os_dev_shared_handler_install(struct mlx5_dev_ctx_shared *sh)
 		DRV_LOG(ERR, "Failed to allocate intr_handle.");
 		return;
 	}
+	if (sh->cdev->config.probe_opt && sh->cdev->dev_info.port_num > 1) {
+		nlsk_fd = mlx5_nl_rdma_monitor_init();
+		if (nlsk_fd < 0) {
+			DRV_LOG(ERR, "Failed to create a socket for RDMA Netlink events: %s",
+				rte_strerror(rte_errno));
+			return;
+		}
+		sh->intr_handle_ib = mlx5_os_interrupt_handler_create
+			(RTE_INTR_INSTANCE_F_SHARED, true,
+			 nlsk_fd, mlx5_dev_interrupt_handler_ib, sh);
+		if (sh->intr_handle_ib == NULL) {
+			DRV_LOG(ERR, "Fail to allocate intr_handle");
+			return;
+		}
+	}
 	nlsk_fd = mlx5_nl_init(NETLINK_ROUTE, RTMGRP_LINK);
 	if (nlsk_fd < 0) {
 		DRV_LOG(ERR, "Failed to create a socket for Netlink events: %s",
@@ -3086,6 +3101,11 @@ mlx5_os_dev_shared_handler_uninstall(struct mlx5_dev_ctx_shared *sh)
 	if (sh->devx_comp)
 		mlx5_glue->devx_destroy_cmd_comp(sh->devx_comp);
 #endif
+	fd = rte_intr_fd_get(sh->intr_handle_ib);
+	mlx5_os_interrupt_handler_destroy(sh->intr_handle_ib,
+				  mlx5_dev_interrupt_handler_ib, sh);
+	if (fd >= 0)
+		close(fd);
 }
 
 /**
diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index 503366580b..adc21c272b 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -1574,6 +1574,7 @@ struct mlx5_dev_ctx_shared {
 	struct rte_intr_handle *intr_handle; /* Interrupt handler for device. */
 	struct rte_intr_handle *intr_handle_devx; /* DEVX interrupt handler. */
 	struct rte_intr_handle *intr_handle_nl; /* Netlink interrupt handler. */
+	struct rte_intr_handle *intr_handle_ib; /* Interrupt handler for IB device. */
 	void *devx_comp; /* DEVX async comp obj. */
 	struct mlx5_devx_obj *tis[16]; /* TIS object. */
 	struct mlx5_devx_obj *td; /* Transport domain. */
@@ -2274,6 +2275,7 @@ int mlx5_dev_set_flow_ctrl(struct rte_eth_dev *dev,
 void mlx5_dev_interrupt_handler(void *arg);
 void mlx5_dev_interrupt_handler_devx(void *arg);
 void mlx5_dev_interrupt_handler_nl(void *arg);
+void mlx5_dev_interrupt_handler_ib(void *arg);
 int mlx5_set_link_down(struct rte_eth_dev *dev);
 int mlx5_set_link_up(struct rte_eth_dev *dev);
 int mlx5_is_removed(struct rte_eth_dev *dev);
-- 
2.34.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH V3 7/7] mlx5: add backward compatibility for RDMA monitor
  2024-10-29 13:42       ` [PATCH V3 0/7] port probe time optimization Minggang Li(Gavin)
                           ` (5 preceding siblings ...)
  2024-10-29 13:42         ` [PATCH V3 6/7] mlx5: use RDMA Netlink to update port information Minggang Li(Gavin)
@ 2024-10-29 13:42         ` Minggang Li(Gavin)
  2024-10-29 14:31           ` [PATCH V3 0/7] port probe time optimization Minggang Li(Gavin)
  2024-10-29 16:26           ` Stephen Hemminger
  6 siblings, 2 replies; 42+ messages in thread
From: Minggang Li(Gavin) @ 2024-10-29 13:42 UTC (permalink / raw)
  To: viacheslavo, matan, orika, thomas, Dariusz Sosnowski, Bing Zhao,
	Suanming Mou
  Cc: dev, rasland

Fallback to the old way to update port information if the kernel driver
does not support RDMA monitor.

Signed-off-by: Minggang Li(Gavin) <gavinl@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
---
 doc/guides/rel_notes/release_24_11.rst  | 14 +++++
 drivers/common/mlx5/linux/mlx5_nl.c     | 73 +++++++++++++++++++++++++
 drivers/common/mlx5/version.map         |  1 +
 drivers/net/mlx5/linux/mlx5_ethdev_os.c |  2 +-
 drivers/net/mlx5/linux/mlx5_os.c        | 27 +++++++--
 drivers/net/mlx5/mlx5.h                 |  1 +
 6 files changed, 111 insertions(+), 7 deletions(-)

diff --git a/doc/guides/rel_notes/release_24_11.rst b/doc/guides/rel_notes/release_24_11.rst
index fa4822d928..6fc32ff8a4 100644
--- a/doc/guides/rel_notes/release_24_11.rst
+++ b/doc/guides/rel_notes/release_24_11.rst
@@ -247,6 +247,20 @@ New Features
   Added ability for node to advertise and update multiple xstat counters,
   that can be retrieved using ``rte_graph_cluster_stats_get``.
 
+* **Updated NVIDIA mlx5 driver.**
+
+  Optimized port probe in large scale.
+  This feature enhances the efficiency of probing VF/SFs on a large scale
+  by significantly reducing the probing time. To activate this feature,
+  set ``probe_opt_en`` to a non-zero value during device probing. It
+  leverages a capability from the RDMA driver, expected to be released in
+  the upcoming kernel version 6.13 or its equivalent in OFED 24.10,
+  specifically the RDMA monitor. For additional details on the limitations
+  of devargs, refer to "doc/guides/nics/mlx5.rst".
+
+  If there are lots of VFs/SFs to be probed by the application, eg, 300
+  VFs/SFs, the option should be enabled to save probing time.
+
 
 Removed Items
 -------------
diff --git a/drivers/common/mlx5/linux/mlx5_nl.c b/drivers/common/mlx5/linux/mlx5_nl.c
index ce1c2a8e75..12f1a620f3 100644
--- a/drivers/common/mlx5/linux/mlx5_nl.c
+++ b/drivers/common/mlx5/linux/mlx5_nl.c
@@ -2152,3 +2152,76 @@ mlx5_nl_rdma_monitor_info_get(struct nlmsghdr *hdr, struct mlx5_nl_port_info *da
 error:
 	rte_errno = EINVAL;
 }
+
+static int
+mlx5_nl_rdma_monitor_cap_get_cb(struct nlmsghdr *hdr, void *arg)
+{
+	size_t off = NLMSG_HDRLEN;
+	uint8_t *cap = arg;
+
+	if (hdr->nlmsg_type != RDMA_NL_GET_TYPE(RDMA_NL_NLDEV, RDMA_NLDEV_CMD_SYS_GET))
+		goto error;
+
+	*cap = 0;
+	while (off < hdr->nlmsg_len) {
+		struct nlattr *na = (void *)((uintptr_t)hdr + off);
+		void *payload = (void *)((uintptr_t)na + NLA_HDRLEN);
+
+		if (na->nla_len > hdr->nlmsg_len - off)
+			goto error;
+		switch (na->nla_type) {
+		case RDMA_NLDEV_SYS_ATTR_MONITOR_MODE:
+			*cap = *(uint8_t *)payload;
+			return 0;
+		default:
+			break;
+		}
+		off += NLA_ALIGN(na->nla_len);
+	}
+
+	return 0;
+
+error:
+	return -EINVAL;
+}
+
+/**
+ * Get RDMA monitor support in driver.
+ *
+ *
+ * @param nl
+ *   Netlink socket of the RDMA kind (NETLINK_RDMA).
+ * @param[out] cap
+ *   Pointer to port info.
+ * @return
+ *   0 on success, negative on error and rte_errno is set.
+ */
+int
+mlx5_nl_rdma_monitor_cap_get(int nl, uint8_t *cap)
+{
+	union {
+		struct nlmsghdr nh;
+		uint8_t buf[NLMSG_HDRLEN];
+	} req = {
+		.nh = {
+			.nlmsg_len = NLMSG_LENGTH(0),
+			.nlmsg_type = RDMA_NL_GET_TYPE(RDMA_NL_NLDEV,
+						       RDMA_NLDEV_CMD_SYS_GET),
+			.nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK,
+		},
+	};
+	uint32_t sn = MLX5_NL_SN_GENERATE;
+	int ret;
+
+	ret = mlx5_nl_send(nl, &req.nh, sn);
+	if (ret < 0) {
+		rte_errno = -ret;
+		return ret;
+	}
+	ret = mlx5_nl_recv(nl, sn, mlx5_nl_rdma_monitor_cap_get_cb, cap);
+	if (ret < 0) {
+		rte_errno = -ret;
+		return ret;
+	}
+	return 0;
+}
diff --git a/drivers/common/mlx5/version.map b/drivers/common/mlx5/version.map
index 5230576006..8301485839 100644
--- a/drivers/common/mlx5/version.map
+++ b/drivers/common/mlx5/version.map
@@ -148,6 +148,7 @@ INTERNAL {
 	mlx5_nl_vlan_vmwa_delete; # WINDOWS_NO_EXPORT
 	mlx5_nl_rdma_monitor_init; # WINDOWS_NO_EXPORT
 	mlx5_nl_rdma_monitor_info_get; # WINDOWS_NO_EXPORT
+	mlx5_nl_rdma_monitor_cap_get; # WINDOWS_NO_EXPORT
 
 	mlx5_os_umem_dereg;
 	mlx5_os_umem_reg;
diff --git a/drivers/net/mlx5/linux/mlx5_ethdev_os.c b/drivers/net/mlx5/linux/mlx5_ethdev_os.c
index 5156d96b3a..6b2c25a7c2 100644
--- a/drivers/net/mlx5/linux/mlx5_ethdev_os.c
+++ b/drivers/net/mlx5/linux/mlx5_ethdev_os.c
@@ -736,7 +736,7 @@ mlx5_dev_interrupt_nl_cb(struct nlmsghdr *hdr, void *cb_arg)
 
 	if (mlx5_nl_parse_link_status_update(hdr, &if_index) < 0)
 		return;
-	if (sh->cdev->config.probe_opt && sh->cdev->dev_info.port_num > 1)
+	if (sh->cdev->config.probe_opt && sh->cdev->dev_info.port_num > 1 && !sh->rdma_monitor_supp)
 		mlx5_handle_port_info_update(&sh->cdev->dev_info, if_index, hdr->nlmsg_type);
 
 	for (i = 0; i < sh->max_port; i++) {
diff --git a/drivers/net/mlx5/linux/mlx5_os.c b/drivers/net/mlx5/linux/mlx5_os.c
index 16b275c71e..d3fd77af58 100644
--- a/drivers/net/mlx5/linux/mlx5_os.c
+++ b/drivers/net/mlx5/linux/mlx5_os.c
@@ -3017,6 +3017,7 @@ mlx5_os_dev_shared_handler_install(struct mlx5_dev_ctx_shared *sh)
 {
 	struct ibv_context *ctx = sh->cdev->ctx;
 	int nlsk_fd;
+	uint8_t rdma_monitor_supp = 0;
 
 	sh->intr_handle = mlx5_os_interrupt_handler_create
 		(RTE_INTR_INSTANCE_F_SHARED, true,
@@ -3025,20 +3026,34 @@ mlx5_os_dev_shared_handler_install(struct mlx5_dev_ctx_shared *sh)
 		DRV_LOG(ERR, "Failed to allocate intr_handle.");
 		return;
 	}
-	if (sh->cdev->config.probe_opt && sh->cdev->dev_info.port_num > 1) {
+	if (sh->cdev->config.probe_opt &&
+	    sh->cdev->dev_info.port_num > 1 &&
+	    !sh->rdma_monitor_supp) {
 		nlsk_fd = mlx5_nl_rdma_monitor_init();
 		if (nlsk_fd < 0) {
 			DRV_LOG(ERR, "Failed to create a socket for RDMA Netlink events: %s",
 				rte_strerror(rte_errno));
 			return;
 		}
-		sh->intr_handle_ib = mlx5_os_interrupt_handler_create
-			(RTE_INTR_INSTANCE_F_SHARED, true,
-			 nlsk_fd, mlx5_dev_interrupt_handler_ib, sh);
-		if (sh->intr_handle_ib == NULL) {
-			DRV_LOG(ERR, "Fail to allocate intr_handle");
+		if (mlx5_nl_rdma_monitor_cap_get(nlsk_fd, &rdma_monitor_supp)) {
+			DRV_LOG(ERR, "Failed to query RDMA monitor support: %s",
+				rte_strerror(rte_errno));
+			close(nlsk_fd);
 			return;
 		}
+		sh->rdma_monitor_supp = rdma_monitor_supp;
+		if (sh->rdma_monitor_supp) {
+			sh->intr_handle_ib = mlx5_os_interrupt_handler_create
+				(RTE_INTR_INSTANCE_F_SHARED, true,
+				 nlsk_fd, mlx5_dev_interrupt_handler_ib, sh);
+			if (sh->intr_handle_ib == NULL) {
+				DRV_LOG(ERR, "Fail to allocate intr_handle");
+				close(nlsk_fd);
+				return;
+			}
+		} else {
+			close(nlsk_fd);
+		}
 	}
 	nlsk_fd = mlx5_nl_init(NETLINK_ROUTE, RTMGRP_LINK);
 	if (nlsk_fd < 0) {
diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index adc21c272b..b6be4646ef 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -1517,6 +1517,7 @@ struct mlx5_dev_ctx_shared {
 	uint32_t lag_rx_port_affinity_en:1;
 	/* lag_rx_port_affinity is supported. */
 	uint32_t hws_max_log_bulk_sz:5;
+	uint32_t rdma_monitor_supp:1;
 	/* Log of minimal HWS counters created hard coded. */
 	uint32_t hws_max_nb_counters; /* Maximal number for HWS counters. */
 	uint32_t max_port; /* Maximal IB device port index. */
-- 
2.34.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH V3 0/7] port probe time optimization
  2024-10-29 13:42         ` [PATCH V3 7/7] mlx5: add backward compatibility for RDMA monitor Minggang Li(Gavin)
@ 2024-10-29 14:31           ` Minggang Li(Gavin)
  2024-10-29 14:31             ` [PATCH V3 1/7] mailmap: update user name Minggang Li(Gavin)
                               ` (6 more replies)
  2024-10-29 16:26           ` Stephen Hemminger
  1 sibling, 7 replies; 42+ messages in thread
From: Minggang Li(Gavin) @ 2024-10-29 14:31 UTC (permalink / raw)
  To: viacheslavo, matan, orika, thomas; +Cc: dev, rasland

This patch series introduced a feature that the time to probe a VF/SF will
be reduced greatly in large scale, eg hundreds of VF/SFs. This feature is
controlled through the "probe_opt_en" device argument. Setting it to a
non-zero value indicates the application will enable this functionality
when probing a device. This feature relies on a feature of RDMA driver to
be release in incoming upstream kernel 6.12 or the equivalent in
OFED 24.10, ie. RDMA monitor. For further information on the devargs
limitation, see "doc/guides/nics/mlx5.rst".

Minggang Li(Gavin) (5):
  mailmap: update user name
  common/mlx5: fix Netlink socket leak
  common/mlx5: add RDMA monitor event awareness
  mlx5: use RDMA Netlink to update port information
  mlx5: add backward compatibility for RDMA monitor
---
changelog:
v1->v2
        - add feature doc and upstream kernel dependency in release notes
v2->v3
        - revise release notes
v3->v4
        - change kernel dependency to 6.12
---


Rongwei Liu (2):
  net/mlx5: optimize device probing
  net/mlx5: add new devargs to control probe optimization

 .mailmap                                     |   2 +-
 doc/guides/nics/mlx5.rst                     |  13 +
 doc/guides/rel_notes/release_24_11.rst       |  14 +
 drivers/common/mlx5/linux/meson.build        |  10 +
 drivers/common/mlx5/linux/mlx5_common_os.h   |   6 +
 drivers/common/mlx5/linux/mlx5_nl.c          | 262 ++++++++++++++++---
 drivers/common/mlx5/linux/mlx5_nl.h          |  36 ++-
 drivers/common/mlx5/mlx5_common.c            |  20 ++
 drivers/common/mlx5/mlx5_common.h            |  15 ++
 drivers/common/mlx5/version.map              |   3 +
 drivers/common/mlx5/windows/mlx5_common_os.h |   5 +
 drivers/net/mlx5/linux/mlx5_ethdev_os.c      | 136 ++++++++++
 drivers/net/mlx5/linux/mlx5_os.c             | 144 ++++++++--
 drivers/net/mlx5/linux/mlx5_os.h             |   6 -
 drivers/net/mlx5/mlx5.h                      |   3 +
 drivers/net/mlx5/windows/mlx5_os.h           |   5 -
 16 files changed, 605 insertions(+), 75 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH V3 1/7] mailmap: update user name
  2024-10-29 14:31           ` [PATCH V3 0/7] port probe time optimization Minggang Li(Gavin)
@ 2024-10-29 14:31             ` Minggang Li(Gavin)
  2024-10-29 14:31             ` [PATCH V3 2/7] net/mlx5: optimize device probing Minggang Li(Gavin)
                               ` (5 subsequent siblings)
  6 siblings, 0 replies; 42+ messages in thread
From: Minggang Li(Gavin) @ 2024-10-29 14:31 UTC (permalink / raw)
  To: viacheslavo, matan, orika, thomas; +Cc: dev, rasland

Signed-off-by: Minggang Li(Gavin) <gavinl@nvidia.com>
---
 .mailmap | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/.mailmap b/.mailmap
index 504c390f0f..0ce965084c 100644
--- a/.mailmap
+++ b/.mailmap
@@ -462,7 +462,6 @@ Gary Mussar <gmussar@ciena.com>
 Gaurav Singh <gaurav1086@gmail.com>
 Gautam Dawar <gdawar@solarflare.com>
 Gavin Hu <gavin.hu@arm.com> <gavin.hu@linaro.org>
-Gavin Li <gavinl@nvidia.com>
 Geoffrey Le Gourriérec <geoffrey.le_gourrierec@6wind.com>
 Geoffrey Lv <geoffrey.lv@gmail.com>
 Geoff Thorpe <geoff.thorpe@nxp.com>
@@ -1024,6 +1023,7 @@ Mike Ximing Chen <mike.ximing.chen@intel.com>
 Milena Olech <milena.olech@intel.com>
 Min Cao <min.cao@intel.com>
 Minghuan Lian <minghuan.lian@nxp.com>
+Minggang Li(Gavin) <gavinl@nvidia.com>
 Mingjin Ye <mingjinx.ye@intel.com>
 Mingshan Zhang <mingshan.zhang@intel.com>
 Mingxia Liu <mingxia.liu@intel.com>
-- 
2.34.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH V3 2/7] net/mlx5: optimize device probing
  2024-10-29 14:31           ` [PATCH V3 0/7] port probe time optimization Minggang Li(Gavin)
  2024-10-29 14:31             ` [PATCH V3 1/7] mailmap: update user name Minggang Li(Gavin)
@ 2024-10-29 14:31             ` Minggang Li(Gavin)
  2024-10-29 14:31             ` [PATCH V3 3/7] net/mlx5: add new devargs to control probe optimization Minggang Li(Gavin)
                               ` (4 subsequent siblings)
  6 siblings, 0 replies; 42+ messages in thread
From: Minggang Li(Gavin) @ 2024-10-29 14:31 UTC (permalink / raw)
  To: viacheslavo, matan, orika, thomas, Dariusz Sosnowski, Bing Zhao,
	Suanming Mou
  Cc: dev, rasland, Rongwei Liu

From: Rongwei Liu <rongweil@nvidia.com>

Current DPDK probing logic is:
1. Query IB device index and total port number.
2. Query each port information by traversing the port index and
   get the port's ifindex, name and state information etc.
3. Compare the information with devargs until getting matched.
4. For each probing device, repeat steps 2 and 3.

Step 2 will communicate with kernel via netlink and it's time-consuming.
There is no need to repeat netlink communication for each probing device,
PMD can traverse all ports once and save the information into a caching
structure.

Introduce the device information caching in the mlx5 common device
handle and cache the port number, ibindex, port ifindex.

For dynamic interface changing:
1. New VF by toggling switchdev mode should restart dpdk as sriov
   configuration changed.
2. Changing VF number w/o toggling switchdev mode will trigger
   RTM_DELLINK and RTM_NEWLINK events. All the caching information is
   cleared.
3. New SF triggers RTM_NEWLINK event and no port index information in the
   message. All free entries (ifindex = 0) in the cache are invalidated.
4. Delete SF triggers RTM_DELLINK event. Traverse the cache entries and
   invalidate the one with the same ifindex.

Didn't consider race-condition between probing thread and interrupt
thread.

Signed-off-by: Rongwei Liu <rongweil@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
---
 drivers/common/mlx5/linux/mlx5_common_os.h   |   6 ++
 drivers/common/mlx5/linux/mlx5_nl.c          |  94 +++++++++++++----
 drivers/common/mlx5/linux/mlx5_nl.h          |   8 +-
 drivers/common/mlx5/mlx5_common.c            |   5 +
 drivers/common/mlx5/mlx5_common.h            |  13 +++
 drivers/common/mlx5/windows/mlx5_common_os.h |   5 +
 drivers/net/mlx5/linux/mlx5_ethdev_os.c      |  54 ++++++++++
 drivers/net/mlx5/linux/mlx5_os.c             | 104 ++++++++++++++-----
 drivers/net/mlx5/linux/mlx5_os.h             |   6 --
 drivers/net/mlx5/windows/mlx5_os.h           |   5 -
 10 files changed, 242 insertions(+), 58 deletions(-)

diff --git a/drivers/common/mlx5/linux/mlx5_common_os.h b/drivers/common/mlx5/linux/mlx5_common_os.h
index e8aa1d46ec..2e2c54f1fa 100644
--- a/drivers/common/mlx5/linux/mlx5_common_os.h
+++ b/drivers/common/mlx5/linux/mlx5_common_os.h
@@ -22,6 +22,12 @@
 #include "mlx5_glue.h"
 #include "mlx5_malloc.h"
 
+/* verb enumerations translations to local enums. */
+enum {
+	MLX5_FS_NAME_MAX = IBV_SYSFS_NAME_MAX + 1,
+	MLX5_FS_PATH_MAX = IBV_SYSFS_PATH_MAX + 1
+};
+
 /**
  * Get device name. Given an ibv_device pointer - return a
  * pointer to the corresponding device name.
diff --git a/drivers/common/mlx5/linux/mlx5_nl.c b/drivers/common/mlx5/linux/mlx5_nl.c
index a5ac4dc543..e98073aafe 100644
--- a/drivers/common/mlx5/linux/mlx5_nl.c
+++ b/drivers/common/mlx5/linux/mlx5_nl.c
@@ -1073,16 +1073,18 @@ mlx5_nl_port_info(int nl, uint32_t pindex, struct mlx5_nl_port_info *data)
 	uint32_t sn = MLX5_NL_SN_GENERATE;
 	int ret;
 
-	ret = mlx5_nl_send(nl, &req.nh, sn);
-	if (ret < 0)
-		return ret;
-	ret = mlx5_nl_recv(nl, sn, mlx5_nl_cmdget_cb, data);
-	if (ret < 0)
-		return ret;
-	if (!(data->flags & MLX5_NL_CMD_GET_IB_NAME) ||
-	    !(data->flags & MLX5_NL_CMD_GET_IB_INDEX))
-		goto error;
-	data->flags = 0;
+	if (data->ibindex == UINT32_MAX) {
+		ret = mlx5_nl_send(nl, &req.nh, sn);
+		if (ret < 0)
+			return ret;
+		ret = mlx5_nl_recv(nl, sn, mlx5_nl_cmdget_cb, data);
+		if (ret < 0)
+			return ret;
+		if (!(data->flags & MLX5_NL_CMD_GET_IB_NAME) ||
+		    !(data->flags & MLX5_NL_CMD_GET_IB_INDEX))
+			goto error;
+		data->flags = 0;
+	}
 	sn = MLX5_NL_SN_GENERATE;
 	req.nh.nlmsg_type = RDMA_NL_GET_TYPE(RDMA_NL_NLDEV,
 					     RDMA_NLDEV_CMD_PORT_GET);
@@ -1109,7 +1111,7 @@ mlx5_nl_port_info(int nl, uint32_t pindex, struct mlx5_nl_port_info *data)
 	    !(data->flags & MLX5_NL_CMD_GET_NET_INDEX) ||
 	    !data->ifindex)
 		goto error;
-	return 1;
+	return 0;
 error:
 	rte_errno = ENODEV;
 	return -rte_errno;
@@ -1128,21 +1130,48 @@ mlx5_nl_port_info(int nl, uint32_t pindex, struct mlx5_nl_port_info *data)
  *   IB device name.
  * @param[in] pindex
  *   IB device port index, starting from 1
+ * @param[in] dev_info
+ *   Cached mlx5 device information.
  * @return
  *   A valid (nonzero) interface index on success, 0 otherwise and rte_errno
  *   is set.
  */
 unsigned int
-mlx5_nl_ifindex(int nl, const char *name, uint32_t pindex)
+mlx5_nl_ifindex(int nl, const char *name, uint32_t pindex, struct mlx5_dev_info *dev_info)
 {
+	int ret;
+
 	struct mlx5_nl_port_info data = {
 			.ifindex = 0,
 			.name = name,
+			.ibindex = UINT32_MAX,
+			.flags = 0,
 	};
 
-	if (mlx5_nl_port_info(nl, pindex, &data) < 0)
-		return 0;
-	return data.ifindex;
+	if (!strcmp(name, dev_info->ibname)) {
+		if (dev_info->port_info && pindex <= dev_info->port_num &&
+		    dev_info->port_info[pindex].valid) {
+			if (!dev_info->port_info[pindex].ifindex)
+				rte_errno = ENODEV;
+			return dev_info->port_info[pindex].ifindex;
+		}
+		if (dev_info->port_num)
+			data.ibindex = dev_info->ibindex;
+	}
+
+	ret = mlx5_nl_port_info(nl, pindex, &data);
+
+	if (!strcmp(dev_info->ibname, name)) {
+		if ((!ret || ret == -ENODEV) && dev_info->port_info &&
+		    pindex <= dev_info->port_num) {
+			if (!ret)
+				dev_info->port_info[pindex].ifindex = data.ifindex;
+			/* -ENODEV means the pindex is unused but still valid case */
+			dev_info->port_info[pindex].valid = 1;
+		}
+	}
+
+	return ret ? 0 : data.ifindex;
 }
 
 /**
@@ -1157,18 +1186,23 @@ mlx5_nl_ifindex(int nl, const char *name, uint32_t pindex)
  *   IB device name.
  * @param[in] pindex
  *   IB device port index, starting from 1
+ * @param[in] dev_info
+ *   Cached mlx5 device information.
  * @return
  *   Port state (ibv_port_state) on success, negative on error
  *   and rte_errno is set.
  */
 int
-mlx5_nl_port_state(int nl, const char *name, uint32_t pindex)
+mlx5_nl_port_state(int nl, const char *name, uint32_t pindex, struct mlx5_dev_info *dev_info)
 {
 	struct mlx5_nl_port_info data = {
 			.state = 0,
 			.name = name,
+			.ibindex = UINT32_MAX,
 	};
 
+	if (dev_info && !strcmp(name, dev_info->ibname) && dev_info->port_num)
+		data.ibindex = dev_info->ibindex;
 	if (mlx5_nl_port_info(nl, pindex, &data) < 0)
 		return -rte_errno;
 	if ((data.flags & MLX5_NL_CMD_GET_PORT_STATE) == 0) {
@@ -1185,13 +1219,15 @@ mlx5_nl_port_state(int nl, const char *name, uint32_t pindex)
  *   Netlink socket of the RDMA kind (NETLINK_RDMA).
  * @param[in] name
  *   IB device name.
+ * @param[in] dev_info
+ *   Cached mlx5 device info.
  *
  * @return
  *   A valid (nonzero) number of ports on success, 0 otherwise
  *   and rte_errno is set.
  */
 unsigned int
-mlx5_nl_portnum(int nl, const char *name)
+mlx5_nl_portnum(int nl, const char *name, struct mlx5_dev_info *dev_info)
 {
 	struct mlx5_nl_port_info data = {
 		.flags = 0,
@@ -1206,7 +1242,10 @@ mlx5_nl_portnum(int nl, const char *name)
 		.nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK | NLM_F_DUMP,
 	};
 	uint32_t sn = MLX5_NL_SN_GENERATE;
-	int ret;
+	int ret, size;
+
+	if (dev_info->port_num && !strcmp(name, dev_info->ibname))
+		return dev_info->port_num;
 
 	ret = mlx5_nl_send(nl, &req, sn);
 	if (ret < 0)
@@ -1220,8 +1259,25 @@ mlx5_nl_portnum(int nl, const char *name)
 		rte_errno = ENODEV;
 		return 0;
 	}
-	if (!data.portnum)
+	if (!data.portnum) {
 		rte_errno = EINVAL;
+		return 0;
+	}
+	MLX5_ASSERT(!strlen(dev_info->ibname));
+	dev_info->port_num = data.portnum;
+	dev_info->ibindex = data.ibindex;
+	snprintf(dev_info->ibname, MLX5_FS_NAME_MAX, "%s", name);
+	if (data.portnum > 1) {
+		size = (data.portnum + 1) * sizeof(struct mlx5_port_nl_info);
+		dev_info->port_info = mlx5_malloc(MLX5_MEM_ZERO | MLX5_MEM_RTE, size,
+						  RTE_CACHE_LINE_SIZE,
+						  SOCKET_ID_ANY);
+		if (dev_info->port_info == NULL) {
+			memset(dev_info, 0, sizeof(*dev_info));
+			rte_errno = ENOMEM;
+			return 0;
+		}
+	}
 	return data.portnum;
 }
 
diff --git a/drivers/common/mlx5/linux/mlx5_nl.h b/drivers/common/mlx5/linux/mlx5_nl.h
index 580de3b769..396ffc98ce 100644
--- a/drivers/common/mlx5/linux/mlx5_nl.h
+++ b/drivers/common/mlx5/linux/mlx5_nl.h
@@ -11,6 +11,7 @@
 #include <rte_ether.h>
 
 #include "mlx5_common.h"
+#include "mlx5_common_utils.h"
 
 typedef void (mlx5_nl_event_cb)(struct nlmsghdr *hdr, void *user_data);
 
@@ -52,11 +53,12 @@ int mlx5_nl_promisc(int nlsk_fd, unsigned int iface_idx, int enable);
 __rte_internal
 int mlx5_nl_allmulti(int nlsk_fd, unsigned int iface_idx, int enable);
 __rte_internal
-unsigned int mlx5_nl_portnum(int nl, const char *name);
+unsigned int mlx5_nl_portnum(int nl, const char *name, struct mlx5_dev_info *dev_info);
 __rte_internal
-unsigned int mlx5_nl_ifindex(int nl, const char *name, uint32_t pindex);
+unsigned int mlx5_nl_ifindex(int nl, const char *name, uint32_t pindex,
+			     struct mlx5_dev_info *info);
 __rte_internal
-int mlx5_nl_port_state(int nl, const char *name, uint32_t pindex);
+int mlx5_nl_port_state(int nl, const char *name, uint32_t pindex, struct mlx5_dev_info *dev_info);
 __rte_internal
 int mlx5_nl_vf_mac_addr_modify(int nlsk_fd, unsigned int iface_idx,
 			       struct rte_ether_addr *mac, int vf_index);
diff --git a/drivers/common/mlx5/mlx5_common.c b/drivers/common/mlx5/mlx5_common.c
index ca8543e36e..0aaae91c31 100644
--- a/drivers/common/mlx5/mlx5_common.c
+++ b/drivers/common/mlx5/mlx5_common.c
@@ -735,6 +735,11 @@ mlx5_common_dev_release(struct mlx5_common_device *cdev)
 		if (TAILQ_EMPTY(&devices_list))
 			rte_mem_event_callback_unregister("MLX5_MEM_EVENT_CB",
 							  NULL);
+		if (cdev->dev_info.port_info != NULL) {
+			mlx5_free(cdev->dev_info.port_info);
+			cdev->dev_info.port_info = NULL;
+		}
+		cdev->dev_info.port_num = 0;
 		mlx5_dev_mempool_unsubscribe(cdev);
 		mlx5_mr_release_cache(&cdev->mr_scache);
 		mlx5_dev_hw_global_release(cdev);
diff --git a/drivers/common/mlx5/mlx5_common.h b/drivers/common/mlx5/mlx5_common.h
index 1abd1e8239..6cb40f54dd 100644
--- a/drivers/common/mlx5/mlx5_common.h
+++ b/drivers/common/mlx5/mlx5_common.h
@@ -174,6 +174,18 @@ enum mlx5_nl_phys_port_name_type {
 	MLX5_PHYS_PORT_NAME_TYPE_UNKNOWN, /* Unrecognized. */
 };
 
+struct mlx5_port_nl_info {
+	uint32_t ifindex;
+	uint8_t valid;
+};
+
+struct mlx5_dev_info {
+	uint32_t port_num;
+	uint32_t ibindex;
+	char ibname[MLX5_FS_NAME_MAX];
+	struct mlx5_port_nl_info *port_info;
+};
+
 /** Switch information returned by mlx5_nl_switch_info(). */
 struct mlx5_switch_info {
 	uint32_t master:1; /**< Master device. */
@@ -525,6 +537,7 @@ struct mlx5_common_device {
 	uint32_t classes_loaded;
 	void *ctx; /* Verbs/DV/DevX context. */
 	void *pd; /* Protection Domain. */
+	struct mlx5_dev_info dev_info; /* Device port info queried via netlink. */
 	uint32_t pdn; /* Protection Domain Number. */
 	struct mlx5_mr_share_cache mr_scache; /* Global shared MR cache. */
 	struct mlx5_common_dev_config config; /* Device configuration. */
diff --git a/drivers/common/mlx5/windows/mlx5_common_os.h b/drivers/common/mlx5/windows/mlx5_common_os.h
index acee0c987f..65394035de 100644
--- a/drivers/common/mlx5/windows/mlx5_common_os.h
+++ b/drivers/common/mlx5/windows/mlx5_common_os.h
@@ -20,6 +20,11 @@
 
 #define MLX5_BF_OFFSET 0x800
 
+enum {
+	MLX5_FS_NAME_MAX = MLX5_DEVX_DEVICE_NAME_SIZE + 1,
+	MLX5_FS_PATH_MAX = MLX5_DEVX_DEVICE_PNP_SIZE + 1
+};
+
 /**
  * This API allocates aligned or non-aligned memory.  The free can be on either
  * aligned or nonaligned memory.  To be protected - even though there may be no
diff --git a/drivers/net/mlx5/linux/mlx5_ethdev_os.c b/drivers/net/mlx5/linux/mlx5_ethdev_os.c
index 5d64984022..08ac6dd939 100644
--- a/drivers/net/mlx5/linux/mlx5_ethdev_os.c
+++ b/drivers/net/mlx5/linux/mlx5_ethdev_os.c
@@ -23,6 +23,7 @@
 #include <stdalign.h>
 #include <sys/un.h>
 #include <time.h>
+#include <linux/rtnetlink.h>
 
 #include <ethdev_linux_ethtool.h>
 #include <ethdev_driver.h>
@@ -673,6 +674,57 @@ mlx5_link_update_bond(struct rte_eth_dev *dev)
 		((ifr.ifr_flags & IFF_UP) && (ifr.ifr_flags & IFF_RUNNING));
 }
 
+static void
+mlx5_handle_port_info_update(struct mlx5_dev_info *dev_info, uint32_t if_index,
+			     uint16_t msg_type)
+{
+	struct mlx5_switch_info info = {
+		.master = 0,
+		.representor = 0,
+		.name_type = MLX5_PHYS_PORT_NAME_TYPE_NOTSET,
+		.port_name = 0,
+		.switch_id = 0,
+	};
+	uint32_t i;
+	int nl_route;
+
+	if (dev_info->port_num <= 1 || dev_info->port_info == NULL)
+		return;
+
+	for (i = 1; i <= dev_info->port_num; i++) {
+		if (!dev_info->port_info[i].valid)
+			continue;
+		if (dev_info->port_info[i].ifindex == if_index)
+			break;
+	}
+	if (msg_type == RTM_NEWLINK && i > dev_info->port_num) {
+		nl_route = mlx5_nl_init(NETLINK_ROUTE, 0);
+		if  (nl_route < 0)
+			goto flush_all;
+
+		if (mlx5_nl_switch_info(nl_route, if_index, &info)) {
+			if (mlx5_sysfs_switch_info(if_index, &info))
+				goto flush_all;
+		}
+
+		if (info.name_type == MLX5_PHYS_PORT_NAME_TYPE_PFSF ||
+		    info.name_type == MLX5_PHYS_PORT_NAME_TYPE_PFVF)
+			goto flush_all;
+		close(nl_route);
+	} else if (msg_type == RTM_DELLINK && i <= dev_info->port_num) {
+		memset(dev_info->port_info + i, 0, sizeof(struct mlx5_port_nl_info));
+	}
+
+	return;
+flush_all:
+	if (nl_route >= 0)
+		close(nl_route);
+	for (i = 1; i <= dev_info->port_num; i++) {
+		if (!dev_info->port_info[i].ifindex)
+			dev_info->port_info[i].valid = 0;
+	}
+}
+
 static void
 mlx5_dev_interrupt_nl_cb(struct nlmsghdr *hdr, void *cb_arg)
 {
@@ -682,6 +734,8 @@ mlx5_dev_interrupt_nl_cb(struct nlmsghdr *hdr, void *cb_arg)
 
 	if (mlx5_nl_parse_link_status_update(hdr, &if_index) < 0)
 		return;
+	mlx5_handle_port_info_update(&sh->cdev->dev_info, if_index, hdr->nlmsg_type);
+
 	for (i = 0; i < sh->max_port; i++) {
 		struct mlx5_dev_shared_port *port = &sh->port[i];
 		struct rte_eth_dev *dev;
diff --git a/drivers/net/mlx5/linux/mlx5_os.c b/drivers/net/mlx5/linux/mlx5_os.c
index 69a80b9ddc..8f6e584154 100644
--- a/drivers/net/mlx5/linux/mlx5_os.c
+++ b/drivers/net/mlx5/linux/mlx5_os.c
@@ -1268,7 +1268,7 @@ mlx5_dev_spawn(struct rte_device *dpdk_dev,
 		/* IB doesn't allow more than 255 ports, must be Ethernet. */
 		err = mlx5_nl_port_state(nl_rdma,
 			spawn->phys_dev_name,
-			spawn->phys_port);
+			spawn->phys_port, &spawn->cdev->dev_info);
 		if (err < 0) {
 			DRV_LOG(INFO, "Failed to get netlink port state: %s",
 				strerror(rte_errno));
@@ -1897,6 +1897,8 @@ mlx5_dev_spawn_data_cmp(const void *a, const void *b)
  *   Netlink RDMA group socket handle.
  * @param[in] owner
  *   Representor owner PF index.
+ * @param[in] dev_info
+ *   Cached mlx5 device information.
  * @param[out] bond_info
  *   Pointer to bonding information.
  *
@@ -1908,6 +1910,7 @@ static int
 mlx5_device_bond_pci_match(const char *ibdev_name,
 			   const struct rte_pci_addr *pci_dev,
 			   int nl_rdma, uint16_t owner,
+			   struct mlx5_dev_info *dev_info,
 			   struct mlx5_bond_info *bond_info)
 {
 	char ifname[IF_NAMESIZE + 1];
@@ -1928,7 +1931,7 @@ mlx5_device_bond_pci_match(const char *ibdev_name,
 		return -1;
 	if (!strstr(ibdev_name, "bond"))
 		return -1;
-	np = mlx5_nl_portnum(nl_rdma, ibdev_name);
+	np = mlx5_nl_portnum(nl_rdma, ibdev_name, dev_info);
 	if (!np)
 		return -1;
 	if (mlx5_get_device_guid(pci_dev, cur_guid, sizeof(cur_guid)) < 0)
@@ -1940,7 +1943,7 @@ mlx5_device_bond_pci_match(const char *ibdev_name,
 	 */
 	for (i = 1; i <= np; ++i) {
 		/* Check whether Infiniband port is populated. */
-		ifindex = mlx5_nl_ifindex(nl_rdma, ibdev_name, i);
+		ifindex = mlx5_nl_ifindex(nl_rdma, ibdev_name, i, dev_info);
 		if (!ifindex)
 			continue;
 		if (!if_indextoname(ifindex, ifname))
@@ -1978,9 +1981,13 @@ mlx5_device_bond_pci_match(const char *ibdev_name,
 		if (!file)
 			break;
 		info.name_type = MLX5_PHYS_PORT_NAME_TYPE_NOTSET;
-		if (fscanf(file, "%32s", tmp_str) == 1)
+		if (fscanf(file, "%32s", tmp_str) == 1) {
 			mlx5_translate_port_name(tmp_str, &info);
-		fclose(file);
+			fclose(file);
+		} else {
+			fclose(file);
+			break;
+		}
 		/* Only process PF ports. */
 		if (info.name_type != MLX5_PHYS_PORT_NAME_TYPE_LEGACY &&
 		    info.name_type != MLX5_PHYS_PORT_NAME_TYPE_UPLINK)
@@ -2003,8 +2010,8 @@ mlx5_device_bond_pci_match(const char *ibdev_name,
 		if (ret != 1)
 			break;
 		/* Save bonding info. */
-		strncpy(bond_info->ports[info.port_name].ifname, ifname,
-			sizeof(bond_info->ports[0].ifname));
+		snprintf(bond_info->ports[info.port_name].ifname,
+			 sizeof(bond_info->ports[0].ifname), "%s", ifname);
 		bond_info->ports[info.port_name].pci_addr = pci_addr;
 		bond_info->ports[info.port_name].ifindex = ifindex;
 		bond_info->n_port++;
@@ -2033,6 +2040,7 @@ mlx5_device_bond_pci_match(const char *ibdev_name,
 		      pci_addr.function == owner)))
 			pf = info.port_name;
 	}
+	fclose(bond_file);
 	if (pf >= 0) {
 		/* Get bond interface info */
 		ret = mlx5_sysfs_bond_info(ifindex, &bond_info->ifindex,
@@ -2084,7 +2092,8 @@ mlx5_nl_esw_multiport_get(struct rte_pci_addr *pci_addr, int *enabled)
 #define SYSFS_MPESW_PARAM_MAX_LEN 16
 
 static int
-mlx5_sysfs_esw_multiport_get(struct ibv_device *ibv, struct rte_pci_addr *pci_addr, int *enabled)
+mlx5_sysfs_esw_multiport_get(struct ibv_device *ibv, struct rte_pci_addr *pci_addr, int *enabled,
+			     struct mlx5_dev_info *dev_info)
 {
 	int nl_rdma;
 	unsigned int n_ports;
@@ -2096,7 +2105,7 @@ mlx5_sysfs_esw_multiport_get(struct ibv_device *ibv, struct rte_pci_addr *pci_ad
 	nl_rdma = mlx5_nl_init(NETLINK_RDMA, 0);
 	if (nl_rdma < 0)
 		return nl_rdma;
-	n_ports = mlx5_nl_portnum(nl_rdma, ibv->name);
+	n_ports = mlx5_nl_portnum(nl_rdma, ibv->name, dev_info);
 	if (!n_ports) {
 		ret = -rte_errno;
 		goto close_nl_rdma;
@@ -2104,12 +2113,12 @@ mlx5_sysfs_esw_multiport_get(struct ibv_device *ibv, struct rte_pci_addr *pci_ad
 	for (i = 1; i <= n_ports; ++i) {
 		unsigned int ifindex;
 		char ifname[IF_NAMESIZE + 1];
-		struct rte_pci_addr if_pci_addr;
+		struct rte_pci_addr if_pci_addr = { 0 };
 		char mpesw[SYSFS_MPESW_PARAM_MAX_LEN + 1];
 		FILE *sysfs;
 		int n;
 
-		ifindex = mlx5_nl_ifindex(nl_rdma, ibv->name, i);
+		ifindex = mlx5_nl_ifindex(nl_rdma, ibv->name, i, dev_info);
 		if (!ifindex)
 			continue;
 		if (!if_indextoname(ifindex, ifname))
@@ -2151,7 +2160,8 @@ mlx5_sysfs_esw_multiport_get(struct ibv_device *ibv, struct rte_pci_addr *pci_ad
 }
 
 static int
-mlx5_is_mpesw_enabled(struct ibv_device *ibv, struct rte_pci_addr *ibv_pci_addr, int *enabled)
+mlx5_is_mpesw_enabled(struct ibv_device *ibv, struct rte_pci_addr *ibv_pci_addr, int *enabled,
+		      struct mlx5_dev_info *dev_info)
 {
 	/*
 	 * Try getting Multiport E-Switch state through netlink interface
@@ -2159,7 +2169,7 @@ mlx5_is_mpesw_enabled(struct ibv_device *ibv, struct rte_pci_addr *ibv_pci_addr,
 	 * assume that Multiport E-Switch is disabled and return an error.
 	 */
 	if (mlx5_nl_esw_multiport_get(ibv_pci_addr, enabled) >= 0 ||
-	    mlx5_sysfs_esw_multiport_get(ibv, ibv_pci_addr, enabled) >= 0)
+	    mlx5_sysfs_esw_multiport_get(ibv, ibv_pci_addr, enabled, dev_info) >= 0)
 		return 0;
 	DRV_LOG(DEBUG, "Unable to check MPESW state for IB device %s "
 		       "(PCI: " PCI_PRI_FMT ")",
@@ -2173,7 +2183,7 @@ mlx5_is_mpesw_enabled(struct ibv_device *ibv, struct rte_pci_addr *ibv_pci_addr,
 static int
 mlx5_device_mpesw_pci_match(struct ibv_device *ibv,
 			    const struct rte_pci_addr *owner_pci,
-			    int nl_rdma)
+			    int nl_rdma, struct mlx5_dev_info *dev_info)
 {
 	struct rte_pci_addr ibdev_pci_addr = { 0 };
 	char ifname[IF_NAMESIZE + 1] = { 0 };
@@ -2197,24 +2207,24 @@ mlx5_device_mpesw_pci_match(struct ibv_device *ibv,
 		return -1;
 	}
 	/* Check if IB device has MPESW enabled. */
-	if (mlx5_is_mpesw_enabled(ibv, &ibdev_pci_addr, &enabled))
+	if (mlx5_is_mpesw_enabled(ibv, &ibdev_pci_addr, &enabled, dev_info))
 		return -1;
 	if (!enabled)
 		return -1;
 	/* Iterate through IB ports to find MPESW master uplink port. */
 	if (nl_rdma < 0)
 		return -1;
-	np = mlx5_nl_portnum(nl_rdma, ibv->name);
+	np = mlx5_nl_portnum(nl_rdma, ibv->name, dev_info);
 	if (!np)
 		return -1;
 	for (i = 1; i <= np; ++i) {
-		struct rte_pci_addr pci_addr;
+		struct rte_pci_addr pci_addr = { 0 };
 		FILE *file;
 		char port_name[IF_NAMESIZE + 1];
 		struct mlx5_switch_info	info;
 
 		/* Check whether IB port has a corresponding netdev. */
-		ifindex = mlx5_nl_ifindex(nl_rdma, ibv->name, i);
+		ifindex = mlx5_nl_ifindex(nl_rdma, ibv->name, i, dev_info);
 		if (!ifindex)
 			continue;
 		if (!if_indextoname(ifindex, ifname))
@@ -2321,16 +2331,30 @@ mlx5_os_pci_probe_pf(struct mlx5_common_device *cdev,
 	 * matching ones, gathering into the list.
 	 */
 	struct ibv_device *ibv_match[ret + 1];
+	struct mlx5_dev_info *info, tmp_info[ret];
 	int nl_route = mlx5_nl_init(NETLINK_ROUTE, 0);
 	int nl_rdma = mlx5_nl_init(NETLINK_RDMA, 0);
 	unsigned int i;
 
+	memset(tmp_info, 0, sizeof(tmp_info));
 	while (ret-- > 0) {
 		struct rte_pci_addr pci_addr;
 
+		if (cdev->dev_info.port_num) {
+			if (strcmp(ibv_list[ret]->name, cdev->dev_info.ibname)) {
+				DRV_LOG(INFO, "Unmatched caching device \"%s\" \"%s\"",
+					cdev->dev_info.ibname, ibv_list[ret]->name);
+				continue;
+			}
+			info = &cdev->dev_info;
+		} else {
+			info = &tmp_info[ret];
+		}
 		DRV_LOG(DEBUG, "Checking device \"%s\"", ibv_list[ret]->name);
 		bd = mlx5_device_bond_pci_match(ibv_list[ret]->name, &owner_pci,
-						nl_rdma, owner_id, &bond_info);
+						nl_rdma, owner_id,
+						info,
+						&bond_info);
 		if (bd >= 0) {
 			/*
 			 * Bonding device detected. Only one match is allowed,
@@ -2356,7 +2380,8 @@ mlx5_os_pci_probe_pf(struct mlx5_common_device *cdev,
 			ibv_match[nd++] = ibv_list[ret];
 			break;
 		}
-		mpesw = mlx5_device_mpesw_pci_match(ibv_list[ret], &owner_pci, nl_rdma);
+		mpesw = mlx5_device_mpesw_pci_match(ibv_list[ret], &owner_pci, nl_rdma,
+						    info);
 		if (mpesw >= 0) {
 			/*
 			 * MPESW device detected. Only one matching IB device is allowed,
@@ -2380,10 +2405,18 @@ mlx5_os_pci_probe_pf(struct mlx5_common_device *cdev,
 		}
 		/* Bonding or MPESW device was not found. */
 		if (mlx5_get_pci_addr(ibv_list[ret]->ibdev_path,
-					&pci_addr))
+					&pci_addr)) {
+			if (tmp_info[ret].port_info != NULL)
+				mlx5_free(tmp_info[ret].port_info);
+			memset(&tmp_info[ret], 0, sizeof(tmp_info[0]));
 			continue;
-		if (rte_pci_addr_cmp(&owner_pci, &pci_addr) != 0)
+		}
+		if (rte_pci_addr_cmp(&owner_pci, &pci_addr) != 0) {
+			if (tmp_info[ret].port_info != NULL)
+				mlx5_free(tmp_info[ret].port_info);
+			memset(&tmp_info[ret], 0, sizeof(tmp_info[0]));
 			continue;
+		}
 		DRV_LOG(INFO, "PCI information matches for device \"%s\"",
 			ibv_list[ret]->name);
 		ibv_match[nd++] = ibv_list[ret];
@@ -2401,13 +2434,21 @@ mlx5_os_pci_probe_pf(struct mlx5_common_device *cdev,
 		goto exit;
 	}
 	if (nd == 1) {
+		if (!cdev->dev_info.port_num) {
+			for (i = 0; i < RTE_DIM(tmp_info); i++) {
+				if (tmp_info[i].port_num) {
+					cdev->dev_info = tmp_info[i];
+					break;
+				}
+			}
+		}
 		/*
 		 * Found single matching device may have multiple ports.
 		 * Each port may be representor, we have to check the port
 		 * number and check the representors existence.
 		 */
 		if (nl_rdma >= 0)
-			np = mlx5_nl_portnum(nl_rdma, ibv_match[0]->name);
+			np = mlx5_nl_portnum(nl_rdma, ibv_match[0]->name, &cdev->dev_info);
 		if (!np)
 			DRV_LOG(WARNING,
 				"Cannot get IB device \"%s\" ports number.",
@@ -2424,6 +2465,14 @@ mlx5_os_pci_probe_pf(struct mlx5_common_device *cdev,
 			ret = -rte_errno;
 			goto exit;
 		}
+	} else {
+		/* Can't handle one common device with multiple IB devices caching */
+		for (i = 0; i < RTE_DIM(tmp_info); i++) {
+			if (tmp_info[i].port_info != NULL)
+				mlx5_free(tmp_info[i].port_info);
+			memset(&tmp_info[i], 0, sizeof(tmp_info[0]));
+		}
+		DRV_LOG(INFO, "Cannot handle multiple IB devices info caching in single common device.");
 	}
 	/* Now we can determine the maximal amount of devices to be spawned. */
 	list = mlx5_malloc(MLX5_MEM_ZERO,
@@ -2457,7 +2506,7 @@ mlx5_os_pci_probe_pf(struct mlx5_common_device *cdev,
 			list[ns].mpesw_port = MLX5_MPESW_PORT_INVALID;
 			list[ns].ifindex = mlx5_nl_ifindex(nl_rdma,
 							   ibv_match[0]->name,
-							   i);
+							   i, &cdev->dev_info);
 			if (!list[ns].ifindex) {
 				/*
 				 * No network interface index found for the
@@ -2588,7 +2637,7 @@ mlx5_os_pci_probe_pf(struct mlx5_common_device *cdev,
 				list[ns].ifindex = mlx5_nl_ifindex
 							    (nl_rdma,
 							     ibv_match[i]->name,
-							     1);
+							     1, &cdev->dev_info);
 			if (!list[ns].ifindex) {
 				char ifname[IF_NAMESIZE];
 
@@ -2777,6 +2826,11 @@ mlx5_os_pci_probe_pf(struct mlx5_common_device *cdev,
 		mlx5_free(list);
 	MLX5_ASSERT(ibv_list);
 	mlx5_glue->free_device_list(ibv_list);
+	if (ret) {
+		if (cdev->dev_info.port_info != NULL)
+			mlx5_free(cdev->dev_info.port_info);
+		memset(&cdev->dev_info, 0, sizeof(cdev->dev_info));
+	}
 	return ret;
 }
 
diff --git a/drivers/net/mlx5/linux/mlx5_os.h b/drivers/net/mlx5/linux/mlx5_os.h
index 80c70d713a..4ef0916173 100644
--- a/drivers/net/mlx5/linux/mlx5_os.h
+++ b/drivers/net/mlx5/linux/mlx5_os.h
@@ -8,12 +8,6 @@
 
 #include <net/if.h>
 
-/* verb enumerations translations to local enums. */
-enum {
-	MLX5_FS_NAME_MAX = IBV_SYSFS_NAME_MAX + 1,
-	MLX5_FS_PATH_MAX = IBV_SYSFS_PATH_MAX + 1
-};
-
 /* Maximal data of sendmsg message(in bytes). */
 #define MLX5_SENDMSG_MAX 64
 
diff --git a/drivers/net/mlx5/windows/mlx5_os.h b/drivers/net/mlx5/windows/mlx5_os.h
index 8b58265687..fb7198c244 100644
--- a/drivers/net/mlx5/windows/mlx5_os.h
+++ b/drivers/net/mlx5/windows/mlx5_os.h
@@ -7,11 +7,6 @@
 
 #include "mlx5_win_ext.h"
 
-enum {
-	MLX5_FS_NAME_MAX = MLX5_DEVX_DEVICE_NAME_SIZE + 1,
-	MLX5_FS_PATH_MAX = MLX5_DEVX_DEVICE_PNP_SIZE + 1
-};
-
 #define PCI_DRV_FLAGS 0
 
 #define MLX5_NAMESIZE MLX5_FS_NAME_MAX
-- 
2.34.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH V3 3/7] net/mlx5: add new devargs to control probe optimization
  2024-10-29 14:31           ` [PATCH V3 0/7] port probe time optimization Minggang Li(Gavin)
  2024-10-29 14:31             ` [PATCH V3 1/7] mailmap: update user name Minggang Li(Gavin)
  2024-10-29 14:31             ` [PATCH V3 2/7] net/mlx5: optimize device probing Minggang Li(Gavin)
@ 2024-10-29 14:31             ` Minggang Li(Gavin)
  2024-10-29 14:31             ` [PATCH V3 4/7] common/mlx5: fix Netlink socket leak Minggang Li(Gavin)
                               ` (3 subsequent siblings)
  6 siblings, 0 replies; 42+ messages in thread
From: Minggang Li(Gavin) @ 2024-10-29 14:31 UTC (permalink / raw)
  To: viacheslavo, matan, orika, thomas, Dariusz Sosnowski, Bing Zhao,
	Suanming Mou
  Cc: dev, rasland, Rongwei Liu

From: Rongwei Liu <rongweil@nvidia.com>

Add a new devarg probe_opt_en to control probe optimization
in PMD.

By default, the value is 0 and no behavior changed.

Signed-off-by: Rongwei Liu <rongweil@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
---
 doc/guides/nics/mlx5.rst                |  7 +++++++
 drivers/common/mlx5/linux/mlx5_nl.c     | 12 ++++++++----
 drivers/common/mlx5/mlx5_common.c       | 15 +++++++++++++++
 drivers/common/mlx5/mlx5_common.h       |  2 ++
 drivers/net/mlx5/linux/mlx5_ethdev_os.c |  5 ++++-
 drivers/net/mlx5/linux/mlx5_os.c        |  2 +-
 6 files changed, 37 insertions(+), 6 deletions(-)

diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index f82e2d75de..981401a9f2 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -1436,6 +1436,13 @@ for an additional list of options shared with other mlx5 drivers.
 
   By default, the PMD will set this value to 1.
 
+- ``probe_opt_en`` parameter [int]
+
+  A non-zero value optimizes the probe process, especially for large scale.
+  PMD will hold the IB device information internally and reuse it.
+
+  By default, the PMD will set this value to 0.
+
 - ``lacp_by_user`` parameter [int]
 
   A nonzero value enables the control of LACP traffic by the user application.
diff --git a/drivers/common/mlx5/linux/mlx5_nl.c b/drivers/common/mlx5/linux/mlx5_nl.c
index e98073aafe..745e443f8f 100644
--- a/drivers/common/mlx5/linux/mlx5_nl.c
+++ b/drivers/common/mlx5/linux/mlx5_nl.c
@@ -1148,7 +1148,7 @@ mlx5_nl_ifindex(int nl, const char *name, uint32_t pindex, struct mlx5_dev_info
 			.flags = 0,
 	};
 
-	if (!strcmp(name, dev_info->ibname)) {
+	if (dev_info->probe_opt && !strcmp(name, dev_info->ibname)) {
 		if (dev_info->port_info && pindex <= dev_info->port_num &&
 		    dev_info->port_info[pindex].valid) {
 			if (!dev_info->port_info[pindex].ifindex)
@@ -1161,7 +1161,7 @@ mlx5_nl_ifindex(int nl, const char *name, uint32_t pindex, struct mlx5_dev_info
 
 	ret = mlx5_nl_port_info(nl, pindex, &data);
 
-	if (!strcmp(dev_info->ibname, name)) {
+	if (dev_info->probe_opt && !strcmp(dev_info->ibname, name)) {
 		if ((!ret || ret == -ENODEV) && dev_info->port_info &&
 		    pindex <= dev_info->port_num) {
 			if (!ret)
@@ -1201,7 +1201,8 @@ mlx5_nl_port_state(int nl, const char *name, uint32_t pindex, struct mlx5_dev_in
 			.ibindex = UINT32_MAX,
 	};
 
-	if (dev_info && !strcmp(name, dev_info->ibname) && dev_info->port_num)
+	if (dev_info && dev_info->probe_opt &&
+	    !strcmp(name, dev_info->ibname) && dev_info->port_num)
 		data.ibindex = dev_info->ibindex;
 	if (mlx5_nl_port_info(nl, pindex, &data) < 0)
 		return -rte_errno;
@@ -1244,7 +1245,8 @@ mlx5_nl_portnum(int nl, const char *name, struct mlx5_dev_info *dev_info)
 	uint32_t sn = MLX5_NL_SN_GENERATE;
 	int ret, size;
 
-	if (dev_info->port_num && !strcmp(name, dev_info->ibname))
+	if (dev_info->probe_opt && dev_info->port_num &&
+	    !strcmp(name, dev_info->ibname))
 		return dev_info->port_num;
 
 	ret = mlx5_nl_send(nl, &req, sn);
@@ -1263,6 +1265,8 @@ mlx5_nl_portnum(int nl, const char *name, struct mlx5_dev_info *dev_info)
 		rte_errno = EINVAL;
 		return 0;
 	}
+	if (!dev_info->probe_opt)
+		return data.portnum;
 	MLX5_ASSERT(!strlen(dev_info->ibname));
 	dev_info->port_num = data.portnum;
 	dev_info->ibindex = data.ibindex;
diff --git a/drivers/common/mlx5/mlx5_common.c b/drivers/common/mlx5/mlx5_common.c
index 0aaae91c31..9abae4a374 100644
--- a/drivers/common/mlx5/mlx5_common.c
+++ b/drivers/common/mlx5/mlx5_common.c
@@ -40,6 +40,9 @@ uint8_t haswell_broadwell_cpu;
 /* The default memory allocator used in PMD. */
 #define MLX5_SYS_MEM_EN "sys_mem_en"
 
+/* Probe optimization in PMD. */
+#define MLX5_PROBE_OPT "probe_opt_en"
+
 /*
  * Device parameter to force doorbell register mapping
  * to non-cached region eliminating the extra write memory barrier.
@@ -295,6 +298,8 @@ mlx5_common_args_check_handler(const char *key, const char *val, void *opaque)
 		config->device_fd = tmp;
 	} else if (strcmp(key, MLX5_PD_HANDLE) == 0) {
 		config->pd_handle = tmp;
+	} else if (strcmp(key, MLX5_PROBE_OPT) == 0) {
+		config->probe_opt = !!tmp;
 	}
 	return 0;
 }
@@ -324,6 +329,7 @@ mlx5_common_config_get(struct mlx5_kvargs_ctrl *mkvlist,
 		MLX5_MR_MEMPOOL_REG_EN,
 		MLX5_DEVICE_FD,
 		MLX5_PD_HANDLE,
+		MLX5_PROBE_OPT,
 		NULL,
 	};
 	int ret = 0;
@@ -332,6 +338,7 @@ mlx5_common_config_get(struct mlx5_kvargs_ctrl *mkvlist,
 	config->mr_ext_memseg_en = 1;
 	config->mr_mempool_reg_en = 1;
 	config->sys_mem_en = 0;
+	config->probe_opt = 0;
 	config->dbnc = MLX5_ARG_UNSET;
 	config->device_fd = MLX5_ARG_UNSET;
 	config->pd_handle = MLX5_ARG_UNSET;
@@ -351,6 +358,7 @@ mlx5_common_config_get(struct mlx5_kvargs_ctrl *mkvlist,
 	DRV_LOG(DEBUG, "mr_ext_memseg_en is %u.", config->mr_ext_memseg_en);
 	DRV_LOG(DEBUG, "mr_mempool_reg_en is %u.", config->mr_mempool_reg_en);
 	DRV_LOG(DEBUG, "sys_mem_en is %u.", config->sys_mem_en);
+	DRV_LOG(DEBUG, "probe_opt_en is %u.", config->probe_opt);
 	DRV_LOG(DEBUG, "Send Queue doorbell mapping parameter is %d.",
 		config->dbnc);
 	return ret;
@@ -791,6 +799,7 @@ mlx5_common_dev_create(struct rte_device *eal_dev, uint32_t classes,
 	if (TAILQ_EMPTY(&devices_list))
 		rte_mem_event_callback_register("MLX5_MEM_EVENT_CB",
 						mlx5_mr_mem_event_cb, NULL);
+	cdev->dev_info.probe_opt = cdev->config.probe_opt;
 exit:
 	pthread_mutex_lock(&devices_list_lock);
 	TAILQ_INSERT_HEAD(&devices_list, cdev, next);
@@ -880,6 +889,12 @@ mlx5_common_probe_again_args_validate(struct mlx5_common_device *cdev,
 			cdev->dev->name);
 		goto error;
 	}
+	if (cdev->config.probe_opt != config->probe_opt) {
+		DRV_LOG(ERR, "\"" MLX5_PROBE_OPT"\" "
+			"configuration mismatch for device %s.",
+			cdev->dev->name);
+		goto error;
+	}
 	if (cdev->config.dbnc != config->dbnc) {
 		DRV_LOG(ERR, "\"" MLX5_SQ_DB_NC "\" "
 			"configuration mismatch for device %s.",
diff --git a/drivers/common/mlx5/mlx5_common.h b/drivers/common/mlx5/mlx5_common.h
index 6cb40f54dd..f1b59d6f07 100644
--- a/drivers/common/mlx5/mlx5_common.h
+++ b/drivers/common/mlx5/mlx5_common.h
@@ -183,6 +183,7 @@ struct mlx5_dev_info {
 	uint32_t port_num;
 	uint32_t ibindex;
 	char ibname[MLX5_FS_NAME_MAX];
+	uint8_t probe_opt;
 	struct mlx5_port_nl_info *port_info;
 };
 
@@ -525,6 +526,7 @@ struct mlx5_common_dev_config {
 	int pd_handle; /* Protection Domain handle for importation.  */
 	unsigned int devx:1; /* Whether devx interface is available or not. */
 	unsigned int sys_mem_en:1; /* The default memory allocator. */
+	unsigned int probe_opt:1; /* Optimize probing . */
 	unsigned int mr_mempool_reg_en:1;
 	/* Allow/prevent implicit mempool memory registration. */
 	unsigned int mr_ext_memseg_en:1;
diff --git a/drivers/net/mlx5/linux/mlx5_ethdev_os.c b/drivers/net/mlx5/linux/mlx5_ethdev_os.c
index 08ac6dd939..88d3c57c6e 100644
--- a/drivers/net/mlx5/linux/mlx5_ethdev_os.c
+++ b/drivers/net/mlx5/linux/mlx5_ethdev_os.c
@@ -691,6 +691,8 @@ mlx5_handle_port_info_update(struct mlx5_dev_info *dev_info, uint32_t if_index,
 	if (dev_info->port_num <= 1 || dev_info->port_info == NULL)
 		return;
 
+	DRV_LOG(DEBUG, "IB device %s ifindex %u received netlink event %u",
+			dev_info->ibname, if_index, msg_type);
 	for (i = 1; i <= dev_info->port_num; i++) {
 		if (!dev_info->port_info[i].valid)
 			continue;
@@ -734,7 +736,8 @@ mlx5_dev_interrupt_nl_cb(struct nlmsghdr *hdr, void *cb_arg)
 
 	if (mlx5_nl_parse_link_status_update(hdr, &if_index) < 0)
 		return;
-	mlx5_handle_port_info_update(&sh->cdev->dev_info, if_index, hdr->nlmsg_type);
+	if (sh->cdev->config.probe_opt && sh->cdev->dev_info.port_num > 1)
+		mlx5_handle_port_info_update(&sh->cdev->dev_info, if_index, hdr->nlmsg_type);
 
 	for (i = 0; i < sh->max_port; i++) {
 		struct mlx5_dev_shared_port *port = &sh->port[i];
diff --git a/drivers/net/mlx5/linux/mlx5_os.c b/drivers/net/mlx5/linux/mlx5_os.c
index 8f6e584154..695936f634 100644
--- a/drivers/net/mlx5/linux/mlx5_os.c
+++ b/drivers/net/mlx5/linux/mlx5_os.c
@@ -2340,7 +2340,7 @@ mlx5_os_pci_probe_pf(struct mlx5_common_device *cdev,
 	while (ret-- > 0) {
 		struct rte_pci_addr pci_addr;
 
-		if (cdev->dev_info.port_num) {
+		if (cdev->config.probe_opt && cdev->dev_info.port_num) {
 			if (strcmp(ibv_list[ret]->name, cdev->dev_info.ibname)) {
 				DRV_LOG(INFO, "Unmatched caching device \"%s\" \"%s\"",
 					cdev->dev_info.ibname, ibv_list[ret]->name);
-- 
2.34.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH V3 4/7] common/mlx5: fix Netlink socket leak
  2024-10-29 14:31           ` [PATCH V3 0/7] port probe time optimization Minggang Li(Gavin)
                               ` (2 preceding siblings ...)
  2024-10-29 14:31             ` [PATCH V3 3/7] net/mlx5: add new devargs to control probe optimization Minggang Li(Gavin)
@ 2024-10-29 14:31             ` Minggang Li(Gavin)
  2024-10-29 14:31             ` [PATCH V3 5/7] common/mlx5: add RDMA monitor event awareness Minggang Li(Gavin)
                               ` (2 subsequent siblings)
  6 siblings, 0 replies; 42+ messages in thread
From: Minggang Li(Gavin) @ 2024-10-29 14:31 UTC (permalink / raw)
  To: viacheslavo, matan, orika, thomas, Dariusz Sosnowski, Bing Zhao,
	Suanming Mou, Spike Du
  Cc: dev, rasland, stable

Fixes: 72d7efe464b1 ("common/mlx5: share interrupt management")
Cc: stable@dpdk.org

Signed-off-by: Minggang Li(Gavin) <gavinl@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
---
 drivers/net/mlx5/linux/mlx5_os.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/net/mlx5/linux/mlx5_os.c b/drivers/net/mlx5/linux/mlx5_os.c
index 695936f634..4537ca0466 100644
--- a/drivers/net/mlx5/linux/mlx5_os.c
+++ b/drivers/net/mlx5/linux/mlx5_os.c
@@ -3071,10 +3071,15 @@ mlx5_os_dev_shared_handler_install(struct mlx5_dev_ctx_shared *sh)
 void
 mlx5_os_dev_shared_handler_uninstall(struct mlx5_dev_ctx_shared *sh)
 {
+	int fd;
+
 	mlx5_os_interrupt_handler_destroy(sh->intr_handle,
 					  mlx5_dev_interrupt_handler, sh);
+	fd = rte_intr_fd_get(sh->intr_handle_nl);
 	mlx5_os_interrupt_handler_destroy(sh->intr_handle_nl,
 					  mlx5_dev_interrupt_handler_nl, sh);
+	if (fd >= 0)
+		close(fd);
 #ifdef HAVE_IBV_DEVX_ASYNC
 	mlx5_os_interrupt_handler_destroy(sh->intr_handle_devx,
 					  mlx5_dev_interrupt_handler_devx, sh);
-- 
2.34.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH V3 5/7] common/mlx5: add RDMA monitor event awareness
  2024-10-29 14:31           ` [PATCH V3 0/7] port probe time optimization Minggang Li(Gavin)
                               ` (3 preceding siblings ...)
  2024-10-29 14:31             ` [PATCH V3 4/7] common/mlx5: fix Netlink socket leak Minggang Li(Gavin)
@ 2024-10-29 14:31             ` Minggang Li(Gavin)
  2024-10-29 14:31             ` [PATCH V3 6/7] mlx5: use RDMA Netlink to update port information Minggang Li(Gavin)
  2024-10-29 14:31             ` [PATCH V3 7/7] mlx5: add backward compatibility for RDMA monitor Minggang Li(Gavin)
  6 siblings, 0 replies; 42+ messages in thread
From: Minggang Li(Gavin) @ 2024-10-29 14:31 UTC (permalink / raw)
  To: viacheslavo, matan, orika, thomas, Dariusz Sosnowski, Bing Zhao,
	Suanming Mou
  Cc: dev, rasland

RDMA monitor is a new feature introduced by kernel driver. This commit
adds backward compatibility for the kernels do not support it.

Signed-off-by: Minggang Li(Gavin) <gavinl@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
---
 drivers/common/mlx5/linux/meson.build | 10 ++++++++++
 drivers/common/mlx5/linux/mlx5_nl.c   | 17 +++++++++++++++++
 2 files changed, 27 insertions(+)

diff --git a/drivers/common/mlx5/linux/meson.build b/drivers/common/mlx5/linux/meson.build
index 82e8046e0c..58d0328c6d 100644
--- a/drivers/common/mlx5/linux/meson.build
+++ b/drivers/common/mlx5/linux/meson.build
@@ -170,6 +170,16 @@ has_sym_args = [
             'RDMA_NLDEV_ATTR_PORT_STATE' ],
         [ 'HAVE_RDMA_NLDEV_ATTR_NDEV_INDEX', 'rdma/rdma_netlink.h',
             'RDMA_NLDEV_ATTR_NDEV_INDEX' ],
+        [ 'HAVE_RDMA_NL_GROUP_NOTIFY', 'rdma/rdma_netlink.h',
+            'RDMA_NL_GROUP_NOTIFY' ],
+        [ 'HAVE_RDMA_NLDEV_CMD_SYS_GET', 'rdma/rdma_netlink.h',
+            'RDMA_NLDEV_CMD_SYS_GET' ],
+        [ 'HAVE_RDMA_NLDEV_SYS_ATTR_MONITOR_MODE', 'rdma/rdma_netlink.h',
+            'RDMA_NLDEV_SYS_ATTR_MONITOR_MODE' ],
+        [ 'HAVE_RDMA_NLDEV_ATTR_EVENT_TYPE', 'rdma/rdma_netlink.h',
+            'RDMA_NLDEV_ATTR_EVENT_TYPE' ],
+        [ 'HAVE_RDMA_NLDEV_CMD_MONITOR', 'rdma/rdma_netlink.h',
+            'RDMA_NLDEV_CMD_MONITOR' ],
         [ 'HAVE_MLX5_DR_FLOW_DUMP', 'infiniband/mlx5dv.h',
             'mlx5dv_dump_dr_domain'],
         [ 'HAVE_MLX5_DR_CREATE_ACTION_FLOW_SAMPLE', 'infiniband/mlx5dv.h',
diff --git a/drivers/common/mlx5/linux/mlx5_nl.c b/drivers/common/mlx5/linux/mlx5_nl.c
index 745e443f8f..e03db4f918 100644
--- a/drivers/common/mlx5/linux/mlx5_nl.c
+++ b/drivers/common/mlx5/linux/mlx5_nl.c
@@ -84,6 +84,23 @@
 #ifndef HAVE_RDMA_NLDEV_ATTR_NDEV_INDEX
 #define RDMA_NLDEV_ATTR_NDEV_INDEX 50
 #endif
+#ifndef HAVE_RDMA_NLDEV_ATTR_EVENT_TYPE
+#define RDMA_NLDEV_ATTR_EVENT_TYPE 102
+#define RDMA_NETDEV_ATTACH_EVENT 2
+#define RDMA_NETDEV_DETACH_EVENT 3
+#endif
+#ifndef HAVE_RDMA_NLDEV_SYS_ATTR_MONITOR_MODE
+#define RDMA_NLDEV_SYS_ATTR_MONITOR_MODE 103
+#endif
+#ifndef HAVE_RDMA_NLDEV_CMD_MONITOR
+#define RDMA_NLDEV_CMD_MONITOR 28
+#endif
+#ifndef HAVE_RDMA_NLDEV_CMD_SYS_GET
+#define RDMA_NLDEV_CMD_SYS_GET 6
+#endif
+#ifndef HAVE_RDMA_NL_GROUP_NOTIFY
+#define RDMA_NL_GROUP_NOTIFY 4
+#endif
 
 /* These are normally found in linux/if_link.h. */
 #ifndef HAVE_IFLA_NUM_VF
-- 
2.34.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH V3 6/7] mlx5: use RDMA Netlink to update port information
  2024-10-29 14:31           ` [PATCH V3 0/7] port probe time optimization Minggang Li(Gavin)
                               ` (4 preceding siblings ...)
  2024-10-29 14:31             ` [PATCH V3 5/7] common/mlx5: add RDMA monitor event awareness Minggang Li(Gavin)
@ 2024-10-29 14:31             ` Minggang Li(Gavin)
  2024-10-29 14:31             ` [PATCH V3 7/7] mlx5: add backward compatibility for RDMA monitor Minggang Li(Gavin)
  6 siblings, 0 replies; 42+ messages in thread
From: Minggang Li(Gavin) @ 2024-10-29 14:31 UTC (permalink / raw)
  To: viacheslavo, matan, orika, thomas, Dariusz Sosnowski, Bing Zhao,
	Suanming Mou
  Cc: dev, rasland

Previously, port information, such as adding and deleting, is updated via
route netlink. And the events used are link up/down, not the exact event
for port adding or deleting, which does not performance well.

To improve the performance, use RDMA monitor events to track port adding
and deleting events and update corresponding port information.

Signed-off-by: Minggang Li(Gavin) <gavinl@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
---
 doc/guides/nics/mlx5.rst                |  6 ++
 drivers/common/mlx5/linux/mlx5_nl.c     | 74 ++++++++++++++++++-----
 drivers/common/mlx5/linux/mlx5_nl.h     | 28 +++++++++
 drivers/common/mlx5/version.map         |  2 +
 drivers/net/mlx5/linux/mlx5_ethdev_os.c | 79 +++++++++++++++++++++++++
 drivers/net/mlx5/linux/mlx5_os.c        | 20 +++++++
 drivers/net/mlx5/mlx5.h                 |  2 +
 7 files changed, 195 insertions(+), 16 deletions(-)

diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index 981401a9f2..1a9ec1bd62 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -1443,6 +1443,12 @@ for an additional list of options shared with other mlx5 drivers.
 
   By default, the PMD will set this value to 0.
 
+  .. note::
+
+    There is a race condition in probing port if probe_opt_en is set to 1.
+    Port probe may fail with wrong ifindex in cache while the interrupt
+    thread is updating the cache. Please try again if port probe failed.
+
 - ``lacp_by_user`` parameter [int]
 
   A nonzero value enables the control of LACP traffic by the user application.
diff --git a/drivers/common/mlx5/linux/mlx5_nl.c b/drivers/common/mlx5/linux/mlx5_nl.c
index e03db4f918..ce1c2a8e75 100644
--- a/drivers/common/mlx5/linux/mlx5_nl.c
+++ b/drivers/common/mlx5/linux/mlx5_nl.c
@@ -101,6 +101,7 @@
 #ifndef HAVE_RDMA_NL_GROUP_NOTIFY
 #define RDMA_NL_GROUP_NOTIFY 4
 #endif
+#define RDMA_NL_GROUP_NOTIFICATION (1 << (RDMA_NL_GROUP_NOTIFY - 1))
 
 /* These are normally found in linux/if_link.h. */
 #ifndef HAVE_IFLA_NUM_VF
@@ -176,22 +177,6 @@ struct mlx5_nl_mac_addr {
 	int mac_n; /**< Number of addresses in the array. */
 };
 
-#define MLX5_NL_CMD_GET_IB_NAME (1 << 0)
-#define MLX5_NL_CMD_GET_IB_INDEX (1 << 1)
-#define MLX5_NL_CMD_GET_NET_INDEX (1 << 2)
-#define MLX5_NL_CMD_GET_PORT_INDEX (1 << 3)
-#define MLX5_NL_CMD_GET_PORT_STATE (1 << 4)
-
-/** Data structure used by mlx5_nl_cmdget_cb(). */
-struct mlx5_nl_port_info {
-	const char *name; /**< IB device name (in). */
-	uint32_t flags; /**< found attribute flags (out). */
-	uint32_t ibindex; /**< IB device index (out). */
-	uint32_t ifindex; /**< Network interface index (out). */
-	uint32_t portnum; /**< IB device max port number (out). */
-	uint16_t state; /**< IB device port state (out). */
-};
-
 RTE_ATOMIC(uint32_t) atomic_sn;
 
 /* Generate Netlink sequence number. */
@@ -2110,3 +2095,60 @@ mlx5_nl_devlink_esw_multiport_get(int nlsk_fd, int family_id, const char *pci_ad
 		*enable ? "en" : "dis", pci_addr);
 	return ret;
 }
+
+int
+mlx5_nl_rdma_monitor_init(void)
+{
+	return mlx5_nl_init(NETLINK_RDMA, RDMA_NL_GROUP_NOTIFICATION);
+}
+
+void
+mlx5_nl_rdma_monitor_info_get(struct nlmsghdr *hdr, struct mlx5_nl_port_info *data)
+{
+	size_t off = NLMSG_HDRLEN;
+	uint8_t event_type = 0;
+
+	if (hdr->nlmsg_type != RDMA_NL_GET_TYPE(RDMA_NL_NLDEV, RDMA_NLDEV_CMD_MONITOR))
+		goto error;
+
+	while (off < hdr->nlmsg_len) {
+		struct nlattr *na = (void *)((uintptr_t)hdr + off);
+		void *payload = (void *)((uintptr_t)na + NLA_HDRLEN);
+
+		if (na->nla_len > hdr->nlmsg_len - off)
+			goto error;
+		switch (na->nla_type) {
+		case RDMA_NLDEV_ATTR_EVENT_TYPE:
+			event_type = *(uint8_t *)payload;
+			if (event_type == RDMA_NETDEV_ATTACH_EVENT) {
+				data->flags |= MLX5_NL_CMD_GET_EVENT_TYPE;
+				data->event_type = MLX5_NL_RDMA_NETDEV_ATTACH_EVENT;
+			} else if (event_type == RDMA_NETDEV_DETACH_EVENT) {
+				data->flags |= MLX5_NL_CMD_GET_EVENT_TYPE;
+				data->event_type = MLX5_NL_RDMA_NETDEV_DETACH_EVENT;
+			}
+			break;
+		case RDMA_NLDEV_ATTR_DEV_INDEX:
+			data->ibindex = *(uint32_t *)payload;
+			data->flags |= MLX5_NL_CMD_GET_IB_INDEX;
+			break;
+		case RDMA_NLDEV_ATTR_PORT_INDEX:
+			data->portnum = *(uint32_t *)payload;
+			data->flags |= MLX5_NL_CMD_GET_PORT_INDEX;
+			break;
+		case RDMA_NLDEV_ATTR_NDEV_INDEX:
+			data->ifindex = *(uint32_t *)payload;
+			data->flags |= MLX5_NL_CMD_GET_NET_INDEX;
+			break;
+		default:
+			DRV_LOG(DEBUG, "Unknown attribute[%d] found", na->nla_type);
+			break;
+		}
+		off += NLA_ALIGN(na->nla_len);
+	}
+
+	return;
+
+error:
+	rte_errno = EINVAL;
+}
diff --git a/drivers/common/mlx5/linux/mlx5_nl.h b/drivers/common/mlx5/linux/mlx5_nl.h
index 396ffc98ce..e32080fa63 100644
--- a/drivers/common/mlx5/linux/mlx5_nl.h
+++ b/drivers/common/mlx5/linux/mlx5_nl.h
@@ -32,6 +32,27 @@ struct mlx5_nl_vlan_vmwa_context {
 	struct mlx5_nl_vlan_dev vlan_dev[4096];
 };
 
+#define MLX5_NL_CMD_GET_IB_NAME (1 << 0)
+#define MLX5_NL_CMD_GET_IB_INDEX (1 << 1)
+#define MLX5_NL_CMD_GET_NET_INDEX (1 << 2)
+#define MLX5_NL_CMD_GET_PORT_INDEX (1 << 3)
+#define MLX5_NL_CMD_GET_PORT_STATE (1 << 4)
+#define MLX5_NL_CMD_GET_EVENT_TYPE (1 << 5)
+
+/** Data structure used by mlx5_nl_cmdget_cb(). */
+struct mlx5_nl_port_info {
+	const char *name; /**< IB device name (in). */
+	uint32_t flags; /**< found attribute flags (out). */
+	uint32_t ibindex; /**< IB device index (out). */
+	uint32_t ifindex; /**< Network interface index (out). */
+	uint32_t portnum; /**< IB device max port number (out). */
+	uint16_t state; /**< IB device port state (out). */
+	uint8_t event_type; /**< IB RDMA event type (out). */
+};
+
+#define MLX5_NL_RDMA_NETDEV_ATTACH_EVENT (1)
+#define MLX5_NL_RDMA_NETDEV_DETACH_EVENT (2)
+
 __rte_internal
 int mlx5_nl_init(int protocol, int groups);
 __rte_internal
@@ -89,4 +110,11 @@ __rte_internal
 int mlx5_nl_devlink_esw_multiport_get(int nlsk_fd, int family_id,
 				      const char *pci_addr, int *enable);
 
+__rte_internal
+int mlx5_nl_rdma_monitor_init(void);
+__rte_internal
+void mlx5_nl_rdma_monitor_info_get(struct nlmsghdr *hdr, struct mlx5_nl_port_info *data);
+__rte_internal
+int mlx5_nl_rdma_monitor_cap_get(int nl, uint8_t *cap);
+
 #endif /* RTE_PMD_MLX5_NL_H_ */
diff --git a/drivers/common/mlx5/version.map b/drivers/common/mlx5/version.map
index a2f72ef46a..5230576006 100644
--- a/drivers/common/mlx5/version.map
+++ b/drivers/common/mlx5/version.map
@@ -146,6 +146,8 @@ INTERNAL {
 	mlx5_nl_vf_mac_addr_modify; # WINDOWS_NO_EXPORT
 	mlx5_nl_vlan_vmwa_create; # WINDOWS_NO_EXPORT
 	mlx5_nl_vlan_vmwa_delete; # WINDOWS_NO_EXPORT
+	mlx5_nl_rdma_monitor_init; # WINDOWS_NO_EXPORT
+	mlx5_nl_rdma_monitor_info_get; # WINDOWS_NO_EXPORT
 
 	mlx5_os_umem_dereg;
 	mlx5_os_umem_reg;
diff --git a/drivers/net/mlx5/linux/mlx5_ethdev_os.c b/drivers/net/mlx5/linux/mlx5_ethdev_os.c
index 88d3c57c6e..5156d96b3a 100644
--- a/drivers/net/mlx5/linux/mlx5_ethdev_os.c
+++ b/drivers/net/mlx5/linux/mlx5_ethdev_os.c
@@ -894,6 +894,85 @@ mlx5_dev_interrupt_handler_devx(void *cb_arg)
 #endif /* HAVE_IBV_DEVX_ASYNC */
 }
 
+static void
+mlx5_dev_interrupt_ib_cb(struct nlmsghdr *hdr, void *cb_arg)
+{
+	mlx5_nl_rdma_monitor_info_get(hdr, (struct mlx5_nl_port_info *)cb_arg);
+}
+
+void
+mlx5_dev_interrupt_handler_ib(void *arg)
+{
+	struct mlx5_dev_ctx_shared *sh = arg;
+	struct mlx5_nl_port_info data = {
+		.flags = 0,
+		.name = "",
+		.ifindex = 0,
+		.ibindex = 0,
+		.portnum = 0,
+	};
+	int nlsk_fd = rte_intr_fd_get(sh->intr_handle_ib);
+	struct mlx5_dev_info *dev_info;
+	uint32_t i;
+
+	dev_info = &sh->cdev->dev_info;
+	DRV_LOG(DEBUG, "IB device %s received RDMA monitor netlink event", dev_info->ibname);
+	if (dev_info->port_num <= 1 || dev_info->port_info == NULL)
+		return;
+
+	if (nlsk_fd < 0)
+		return;
+
+	if (mlx5_nl_read_events(nlsk_fd, mlx5_dev_interrupt_ib_cb, &data) < 0)
+		DRV_LOG(ERR, "Failed to process Netlink events: %s",
+			rte_strerror(rte_errno));
+
+	if (!(data.flags & MLX5_NL_CMD_GET_EVENT_TYPE) ||
+		!(data.flags & MLX5_NL_CMD_GET_PORT_INDEX) ||
+		!(data.flags & MLX5_NL_CMD_GET_IB_INDEX))
+		return;
+
+	if (data.ibindex != dev_info->ibindex)
+		return;
+
+	if (data.event_type != MLX5_NL_RDMA_NETDEV_ATTACH_EVENT &&
+		data.event_type != MLX5_NL_RDMA_NETDEV_DETACH_EVENT)
+		return;
+
+	if (data.event_type == MLX5_NL_RDMA_NETDEV_ATTACH_EVENT &&
+	    !(data.flags & MLX5_NL_CMD_GET_NET_INDEX))
+		return;
+
+	DRV_LOG(DEBUG, "Event info: type %d, ibindex %d, ifindex %d, portnum %d,",
+		data.event_type, data.ibindex, data.ifindex, data.portnum);
+
+	/* Changes found in number of SF/VF ports. All information is likely unreliable. */
+	if (data.portnum > dev_info->port_num) {
+		DRV_LOG(ERR, "Port[%d] exceeds maximum[%d]", data.portnum, dev_info->port_num);
+		goto flush_all;
+	}
+	if (data.event_type == MLX5_NL_RDMA_NETDEV_ATTACH_EVENT) {
+		if (!dev_info->port_info[data.portnum].ifindex) {
+			dev_info->port_info[data.portnum].ifindex = data.ifindex;
+			dev_info->port_info[data.portnum].valid = 1;
+		} else {
+			DRV_LOG(WARNING, "Duplicate RDMA event for port[%d] ifindex[%d]",
+				data.portnum, data.ifindex);
+			if (data.ifindex != dev_info->port_info[data.portnum].ifindex)
+				goto flush_all;
+		}
+	} else if (data.event_type == MLX5_NL_RDMA_NETDEV_DETACH_EVENT) {
+		memset(dev_info->port_info + data.portnum, 0, sizeof(struct mlx5_port_nl_info));
+	}
+	return;
+
+flush_all:
+	for (i = 1; i <= dev_info->port_num; i++) {
+		dev_info->port_info[i].ifindex = 0;
+		dev_info->port_info[i].valid = 0;
+	}
+}
+
 /**
  * DPDK callback to bring the link DOWN.
  *
diff --git a/drivers/net/mlx5/linux/mlx5_os.c b/drivers/net/mlx5/linux/mlx5_os.c
index 4537ca0466..16b275c71e 100644
--- a/drivers/net/mlx5/linux/mlx5_os.c
+++ b/drivers/net/mlx5/linux/mlx5_os.c
@@ -3025,6 +3025,21 @@ mlx5_os_dev_shared_handler_install(struct mlx5_dev_ctx_shared *sh)
 		DRV_LOG(ERR, "Failed to allocate intr_handle.");
 		return;
 	}
+	if (sh->cdev->config.probe_opt && sh->cdev->dev_info.port_num > 1) {
+		nlsk_fd = mlx5_nl_rdma_monitor_init();
+		if (nlsk_fd < 0) {
+			DRV_LOG(ERR, "Failed to create a socket for RDMA Netlink events: %s",
+				rte_strerror(rte_errno));
+			return;
+		}
+		sh->intr_handle_ib = mlx5_os_interrupt_handler_create
+			(RTE_INTR_INSTANCE_F_SHARED, true,
+			 nlsk_fd, mlx5_dev_interrupt_handler_ib, sh);
+		if (sh->intr_handle_ib == NULL) {
+			DRV_LOG(ERR, "Fail to allocate intr_handle");
+			return;
+		}
+	}
 	nlsk_fd = mlx5_nl_init(NETLINK_ROUTE, RTMGRP_LINK);
 	if (nlsk_fd < 0) {
 		DRV_LOG(ERR, "Failed to create a socket for Netlink events: %s",
@@ -3086,6 +3101,11 @@ mlx5_os_dev_shared_handler_uninstall(struct mlx5_dev_ctx_shared *sh)
 	if (sh->devx_comp)
 		mlx5_glue->devx_destroy_cmd_comp(sh->devx_comp);
 #endif
+	fd = rte_intr_fd_get(sh->intr_handle_ib);
+	mlx5_os_interrupt_handler_destroy(sh->intr_handle_ib,
+				  mlx5_dev_interrupt_handler_ib, sh);
+	if (fd >= 0)
+		close(fd);
 }
 
 /**
diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index 503366580b..adc21c272b 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -1574,6 +1574,7 @@ struct mlx5_dev_ctx_shared {
 	struct rte_intr_handle *intr_handle; /* Interrupt handler for device. */
 	struct rte_intr_handle *intr_handle_devx; /* DEVX interrupt handler. */
 	struct rte_intr_handle *intr_handle_nl; /* Netlink interrupt handler. */
+	struct rte_intr_handle *intr_handle_ib; /* Interrupt handler for IB device. */
 	void *devx_comp; /* DEVX async comp obj. */
 	struct mlx5_devx_obj *tis[16]; /* TIS object. */
 	struct mlx5_devx_obj *td; /* Transport domain. */
@@ -2274,6 +2275,7 @@ int mlx5_dev_set_flow_ctrl(struct rte_eth_dev *dev,
 void mlx5_dev_interrupt_handler(void *arg);
 void mlx5_dev_interrupt_handler_devx(void *arg);
 void mlx5_dev_interrupt_handler_nl(void *arg);
+void mlx5_dev_interrupt_handler_ib(void *arg);
 int mlx5_set_link_down(struct rte_eth_dev *dev);
 int mlx5_set_link_up(struct rte_eth_dev *dev);
 int mlx5_is_removed(struct rte_eth_dev *dev);
-- 
2.34.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH V3 7/7] mlx5: add backward compatibility for RDMA monitor
  2024-10-29 14:31           ` [PATCH V3 0/7] port probe time optimization Minggang Li(Gavin)
                               ` (5 preceding siblings ...)
  2024-10-29 14:31             ` [PATCH V3 6/7] mlx5: use RDMA Netlink to update port information Minggang Li(Gavin)
@ 2024-10-29 14:31             ` Minggang Li(Gavin)
  6 siblings, 0 replies; 42+ messages in thread
From: Minggang Li(Gavin) @ 2024-10-29 14:31 UTC (permalink / raw)
  To: viacheslavo, matan, orika, thomas, Dariusz Sosnowski, Bing Zhao,
	Suanming Mou
  Cc: dev, rasland

Fallback to the old way to update port information if the kernel driver
does not support RDMA monitor.

Signed-off-by: Minggang Li(Gavin) <gavinl@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
---
 doc/guides/rel_notes/release_24_11.rst  | 14 +++++
 drivers/common/mlx5/linux/mlx5_nl.c     | 73 +++++++++++++++++++++++++
 drivers/common/mlx5/version.map         |  1 +
 drivers/net/mlx5/linux/mlx5_ethdev_os.c |  2 +-
 drivers/net/mlx5/linux/mlx5_os.c        | 27 +++++++--
 drivers/net/mlx5/mlx5.h                 |  1 +
 6 files changed, 111 insertions(+), 7 deletions(-)

diff --git a/doc/guides/rel_notes/release_24_11.rst b/doc/guides/rel_notes/release_24_11.rst
index fa4822d928..bc868bb74a 100644
--- a/doc/guides/rel_notes/release_24_11.rst
+++ b/doc/guides/rel_notes/release_24_11.rst
@@ -247,6 +247,20 @@ New Features
   Added ability for node to advertise and update multiple xstat counters,
   that can be retrieved using ``rte_graph_cluster_stats_get``.
 
+* **Updated NVIDIA mlx5 driver.**
+
+  Optimized port probe in large scale.
+  This feature enhances the efficiency of probing VF/SFs on a large scale
+  by significantly reducing the probing time. To activate this feature,
+  set ``probe_opt_en`` to a non-zero value during device probing. It
+  leverages a capability from the RDMA driver, expected to be released in
+  the upcoming kernel version 6.12 or its equivalent in OFED 24.10,
+  specifically the RDMA monitor. For additional details on the limitations
+  of devargs, refer to "doc/guides/nics/mlx5.rst".
+
+  If there are lots of VFs/SFs to be probed by the application, eg, 300
+  VFs/SFs, the option should be enabled to save probing time.
+
 
 Removed Items
 -------------
diff --git a/drivers/common/mlx5/linux/mlx5_nl.c b/drivers/common/mlx5/linux/mlx5_nl.c
index ce1c2a8e75..12f1a620f3 100644
--- a/drivers/common/mlx5/linux/mlx5_nl.c
+++ b/drivers/common/mlx5/linux/mlx5_nl.c
@@ -2152,3 +2152,76 @@ mlx5_nl_rdma_monitor_info_get(struct nlmsghdr *hdr, struct mlx5_nl_port_info *da
 error:
 	rte_errno = EINVAL;
 }
+
+static int
+mlx5_nl_rdma_monitor_cap_get_cb(struct nlmsghdr *hdr, void *arg)
+{
+	size_t off = NLMSG_HDRLEN;
+	uint8_t *cap = arg;
+
+	if (hdr->nlmsg_type != RDMA_NL_GET_TYPE(RDMA_NL_NLDEV, RDMA_NLDEV_CMD_SYS_GET))
+		goto error;
+
+	*cap = 0;
+	while (off < hdr->nlmsg_len) {
+		struct nlattr *na = (void *)((uintptr_t)hdr + off);
+		void *payload = (void *)((uintptr_t)na + NLA_HDRLEN);
+
+		if (na->nla_len > hdr->nlmsg_len - off)
+			goto error;
+		switch (na->nla_type) {
+		case RDMA_NLDEV_SYS_ATTR_MONITOR_MODE:
+			*cap = *(uint8_t *)payload;
+			return 0;
+		default:
+			break;
+		}
+		off += NLA_ALIGN(na->nla_len);
+	}
+
+	return 0;
+
+error:
+	return -EINVAL;
+}
+
+/**
+ * Get RDMA monitor support in driver.
+ *
+ *
+ * @param nl
+ *   Netlink socket of the RDMA kind (NETLINK_RDMA).
+ * @param[out] cap
+ *   Pointer to port info.
+ * @return
+ *   0 on success, negative on error and rte_errno is set.
+ */
+int
+mlx5_nl_rdma_monitor_cap_get(int nl, uint8_t *cap)
+{
+	union {
+		struct nlmsghdr nh;
+		uint8_t buf[NLMSG_HDRLEN];
+	} req = {
+		.nh = {
+			.nlmsg_len = NLMSG_LENGTH(0),
+			.nlmsg_type = RDMA_NL_GET_TYPE(RDMA_NL_NLDEV,
+						       RDMA_NLDEV_CMD_SYS_GET),
+			.nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK,
+		},
+	};
+	uint32_t sn = MLX5_NL_SN_GENERATE;
+	int ret;
+
+	ret = mlx5_nl_send(nl, &req.nh, sn);
+	if (ret < 0) {
+		rte_errno = -ret;
+		return ret;
+	}
+	ret = mlx5_nl_recv(nl, sn, mlx5_nl_rdma_monitor_cap_get_cb, cap);
+	if (ret < 0) {
+		rte_errno = -ret;
+		return ret;
+	}
+	return 0;
+}
diff --git a/drivers/common/mlx5/version.map b/drivers/common/mlx5/version.map
index 5230576006..8301485839 100644
--- a/drivers/common/mlx5/version.map
+++ b/drivers/common/mlx5/version.map
@@ -148,6 +148,7 @@ INTERNAL {
 	mlx5_nl_vlan_vmwa_delete; # WINDOWS_NO_EXPORT
 	mlx5_nl_rdma_monitor_init; # WINDOWS_NO_EXPORT
 	mlx5_nl_rdma_monitor_info_get; # WINDOWS_NO_EXPORT
+	mlx5_nl_rdma_monitor_cap_get; # WINDOWS_NO_EXPORT
 
 	mlx5_os_umem_dereg;
 	mlx5_os_umem_reg;
diff --git a/drivers/net/mlx5/linux/mlx5_ethdev_os.c b/drivers/net/mlx5/linux/mlx5_ethdev_os.c
index 5156d96b3a..6b2c25a7c2 100644
--- a/drivers/net/mlx5/linux/mlx5_ethdev_os.c
+++ b/drivers/net/mlx5/linux/mlx5_ethdev_os.c
@@ -736,7 +736,7 @@ mlx5_dev_interrupt_nl_cb(struct nlmsghdr *hdr, void *cb_arg)
 
 	if (mlx5_nl_parse_link_status_update(hdr, &if_index) < 0)
 		return;
-	if (sh->cdev->config.probe_opt && sh->cdev->dev_info.port_num > 1)
+	if (sh->cdev->config.probe_opt && sh->cdev->dev_info.port_num > 1 && !sh->rdma_monitor_supp)
 		mlx5_handle_port_info_update(&sh->cdev->dev_info, if_index, hdr->nlmsg_type);
 
 	for (i = 0; i < sh->max_port; i++) {
diff --git a/drivers/net/mlx5/linux/mlx5_os.c b/drivers/net/mlx5/linux/mlx5_os.c
index 16b275c71e..d3fd77af58 100644
--- a/drivers/net/mlx5/linux/mlx5_os.c
+++ b/drivers/net/mlx5/linux/mlx5_os.c
@@ -3017,6 +3017,7 @@ mlx5_os_dev_shared_handler_install(struct mlx5_dev_ctx_shared *sh)
 {
 	struct ibv_context *ctx = sh->cdev->ctx;
 	int nlsk_fd;
+	uint8_t rdma_monitor_supp = 0;
 
 	sh->intr_handle = mlx5_os_interrupt_handler_create
 		(RTE_INTR_INSTANCE_F_SHARED, true,
@@ -3025,20 +3026,34 @@ mlx5_os_dev_shared_handler_install(struct mlx5_dev_ctx_shared *sh)
 		DRV_LOG(ERR, "Failed to allocate intr_handle.");
 		return;
 	}
-	if (sh->cdev->config.probe_opt && sh->cdev->dev_info.port_num > 1) {
+	if (sh->cdev->config.probe_opt &&
+	    sh->cdev->dev_info.port_num > 1 &&
+	    !sh->rdma_monitor_supp) {
 		nlsk_fd = mlx5_nl_rdma_monitor_init();
 		if (nlsk_fd < 0) {
 			DRV_LOG(ERR, "Failed to create a socket for RDMA Netlink events: %s",
 				rte_strerror(rte_errno));
 			return;
 		}
-		sh->intr_handle_ib = mlx5_os_interrupt_handler_create
-			(RTE_INTR_INSTANCE_F_SHARED, true,
-			 nlsk_fd, mlx5_dev_interrupt_handler_ib, sh);
-		if (sh->intr_handle_ib == NULL) {
-			DRV_LOG(ERR, "Fail to allocate intr_handle");
+		if (mlx5_nl_rdma_monitor_cap_get(nlsk_fd, &rdma_monitor_supp)) {
+			DRV_LOG(ERR, "Failed to query RDMA monitor support: %s",
+				rte_strerror(rte_errno));
+			close(nlsk_fd);
 			return;
 		}
+		sh->rdma_monitor_supp = rdma_monitor_supp;
+		if (sh->rdma_monitor_supp) {
+			sh->intr_handle_ib = mlx5_os_interrupt_handler_create
+				(RTE_INTR_INSTANCE_F_SHARED, true,
+				 nlsk_fd, mlx5_dev_interrupt_handler_ib, sh);
+			if (sh->intr_handle_ib == NULL) {
+				DRV_LOG(ERR, "Fail to allocate intr_handle");
+				close(nlsk_fd);
+				return;
+			}
+		} else {
+			close(nlsk_fd);
+		}
 	}
 	nlsk_fd = mlx5_nl_init(NETLINK_ROUTE, RTMGRP_LINK);
 	if (nlsk_fd < 0) {
diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index adc21c272b..b6be4646ef 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -1517,6 +1517,7 @@ struct mlx5_dev_ctx_shared {
 	uint32_t lag_rx_port_affinity_en:1;
 	/* lag_rx_port_affinity is supported. */
 	uint32_t hws_max_log_bulk_sz:5;
+	uint32_t rdma_monitor_supp:1;
 	/* Log of minimal HWS counters created hard coded. */
 	uint32_t hws_max_nb_counters; /* Maximal number for HWS counters. */
 	uint32_t max_port; /* Maximal IB device port index. */
-- 
2.34.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V2 3/7] net/mlx5: add new devargs to control probe optimization
  2024-10-29  8:27         ` Minggang(Gavin) Li
@ 2024-10-29 16:07           ` Stephen Hemminger
  2024-10-30  8:16             ` Slava Ovsiienko
  0 siblings, 1 reply; 42+ messages in thread
From: Stephen Hemminger @ 2024-10-29 16:07 UTC (permalink / raw)
  To: Minggang(Gavin) Li
  Cc: viacheslavo, matan, orika, thomas, Dariusz Sosnowski, Bing Zhao,
	Suanming Mou, dev, rasland, Rongwei Liu

On Tue, 29 Oct 2024 16:27:25 +0800
"Minggang(Gavin) Li" <gavinl@nvidia.com> wrote:

> On 10/28/2024 11:47 PM, Stephen Hemminger wrote:
> > On Mon, 28 Oct 2024 11:18:18 +0200
> > "Minggang Li(Gavin)" <gavinl@nvidia.com> wrote:
> >  
> >> +- ``probe_opt_en`` parameter [int]
> >> +
> >> +  A non-zero value optimizes the probe process, especially for large scale.
> >> +  PMD will hold the IB device information internally and reuse it.
> >> +
> >> +  By default, the PMD will set this value to 0.
> >> +  
> > Is there ever a case where this should not be used?
> >
> > It would be better to just detect and use it if available.
> > This driver does not need more options...  
> The new mechanism, which is required by few users, so we would not break 
> production and with the option we encourage to use new way only those 
> who actually needs. Once we see the new way is reliable - we will change 
> the default value.

I understand that philosophy but it leads to a maze of technical debt.
Has a full suite of tests been done with both settings of the option?
Has both values been tested on all combinations of platforms and OS releases?

My point is every option adds to the necessary test matrix geometrically!

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V3 3/7] net/mlx5: add new devargs to control probe optimization
  2024-10-29 13:42         ` [PATCH V3 3/7] net/mlx5: add new devargs to control probe optimization Minggang Li(Gavin)
@ 2024-10-29 16:20           ` Stephen Hemminger
  0 siblings, 0 replies; 42+ messages in thread
From: Stephen Hemminger @ 2024-10-29 16:20 UTC (permalink / raw)
  To: Minggang Li(Gavin)
  Cc: viacheslavo, matan, orika, thomas, Dariusz Sosnowski, Bing Zhao,
	Suanming Mou, dev, rasland, Rongwei Liu

On Tue, 29 Oct 2024 15:42:52 +0200
"Minggang Li(Gavin)" <gavinl@nvidia.com> wrote:

> From: Rongwei Liu <rongweil@nvidia.com>
> 
> Add a new devarg probe_opt_en to control probe optimization
> in PMD.
> 
> By default, the value is 0 and no behavior changed.
> 
> Signed-off-by: Rongwei Liu <rongweil@nvidia.com>
> Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>

Once again, every option you introduce expands the test space by 2X.
"Do or Do not. There is no try"
Either it works all the time or it is a bad idea.

Sorry if I sound like a broken record, the project I used to work on
had the same kind of "always add an option" policy. But every time
an option was changed, there was a 50/50 chance that it was broken because
that combination of options had not been tested since originally added
and was non functional due to bit rot.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V3 7/7] mlx5: add backward compatibility for RDMA monitor
  2024-10-29 13:42         ` [PATCH V3 7/7] mlx5: add backward compatibility for RDMA monitor Minggang Li(Gavin)
  2024-10-29 14:31           ` [PATCH V3 0/7] port probe time optimization Minggang Li(Gavin)
@ 2024-10-29 16:26           ` Stephen Hemminger
  2024-10-30  8:25             ` Minggang(Gavin) Li
  1 sibling, 1 reply; 42+ messages in thread
From: Stephen Hemminger @ 2024-10-29 16:26 UTC (permalink / raw)
  To: Minggang Li(Gavin)
  Cc: viacheslavo, matan, orika, thomas, Dariusz Sosnowski, Bing Zhao,
	Suanming Mou, dev, rasland

On Tue, 29 Oct 2024 15:42:56 +0200
"Minggang Li(Gavin)" <gavinl@nvidia.com> wrote:

>  
> +* **Updated NVIDIA mlx5 driver.**
> +
> +  Optimized port probe in large scale.
> +  This feature enhances the efficiency of probing VF/SFs on a large scale
> +  by significantly reducing the probing time. To activate this feature,
> +  set ``probe_opt_en`` to a non-zero value during device probing. It
> +  leverages a capability from the RDMA driver, expected to be released in
> +  the upcoming kernel version 6.13 or its equivalent in OFED 24.10,
> +  specifically the RDMA monitor. For additional details on the limitations
> +  of devargs, refer to "doc/guides/nics/mlx5.rst".
> +
> +  If there are lots of VFs/SFs to be probed by the application, eg, 300
> +  VFs/SFs, the option should be enabled to save probing time.

IMHO the kernel parts have to be available in a released kernel version.
Otherwise the kernel API/ABI is not stable and there is a possibility of user confusion.

This needs to stay in "awaiting upstream" state until kernel is released

^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: [PATCH V2 3/7] net/mlx5: add new devargs to control probe optimization
  2024-10-29 16:07           ` Stephen Hemminger
@ 2024-10-30  8:16             ` Slava Ovsiienko
  2024-10-30 19:05               ` Stephen Hemminger
  0 siblings, 1 reply; 42+ messages in thread
From: Slava Ovsiienko @ 2024-10-30  8:16 UTC (permalink / raw)
  To: Stephen Hemminger, Minggang(Gavin) Li
  Cc: Matan Azrad, Ori Kam, NBU-Contact-Thomas Monjalon (EXTERNAL),
	Dariusz Sosnowski, Bing Zhao, Suanming Mou, dev,
	Raslan Darawsheh, rongwei liu

Hi,


> -----Original Message-----
> From: Stephen Hemminger <stephen@networkplumber.org>
> Sent: Tuesday, October 29, 2024 6:07 PM
> To: Minggang(Gavin) Li <gavinl@nvidia.com>
> Cc: Slava Ovsiienko <viacheslavo@nvidia.com>; Matan Azrad
> <matan@nvidia.com>; Ori Kam <orika@nvidia.com>; NBU-Contact-Thomas
> Monjalon (EXTERNAL) <thomas@monjalon.net>; Dariusz Sosnowski
> <dsosnowski@nvidia.com>; Bing Zhao <bingz@nvidia.com>; Suanming Mou
> <suanmingm@nvidia.com>; dev@dpdk.org; Raslan Darawsheh
> <rasland@nvidia.com>; rongwei liu <rongweil@nvidia.com>
> Subject: Re: [PATCH V2 3/7] net/mlx5: add new devargs to control probe
> optimization
> 
> On Tue, 29 Oct 2024 16:27:25 +0800
> "Minggang(Gavin) Li" <gavinl@nvidia.com> wrote:
> 
> > On 10/28/2024 11:47 PM, Stephen Hemminger wrote:
> > > On Mon, 28 Oct 2024 11:18:18 +0200
> > > "Minggang Li(Gavin)" <gavinl@nvidia.com> wrote:
> > >
> > >> +- ``probe_opt_en`` parameter [int]
> > >> +
> > >> +  A non-zero value optimizes the probe process, especially for large
> scale.
> > >> +  PMD will hold the IB device information internally and reuse it.
> > >> +
> > >> +  By default, the PMD will set this value to 0.
> > >> +
> > > Is there ever a case where this should not be used?
> > >
> > > It would be better to just detect and use it if available.
> > > This driver does not need more options...
> > The new mechanism, which is required by few users, so we would not
> > break production and with the option we encourage to use new way only
> > those who actually needs. Once we see the new way is reliable - we
> > will change the default value.
> 
> I understand that philosophy but it leads to a maze of technical debt.

This specific case is not about philosophy in general.

We have users with huge number of SFs/VFs configured and experiencing the issues
with gigantic probing timings (literally - tens of minutes). This story was lasting
long time, we were trying different approaches, then admitted we had to update kernel,
etc., and eventually we had things done and it resulted in this series.

The new approach is event driven and based on the handling the new kernel-generated events.
So, it relies on system-wide environment and might be problematic on some hosts (we do not
expect too much though).

At the same time, the existing probe approach provides acceptable performance and
satisfies the vast majority of the users.  So, our main objective is not to break anything
in production (most users), the second objective - to resolve issues of some users with
configuration specifics (few users). That's why we would prefer to have the devarg
(with all its cons and pros) and set the devarg default value to false. Later, once the new kernel
API spreads and we have good production statistics we can consider altering the default
value to true or obsolete the devarg at all. Does this approach look reasonable?

> Has a full suite of tests been done with both settings of the option?
> Has both values been tested on all combinations of platforms and OS
> releases?

We cannot keep the new approach only - we have to maintain legacy kernel compatibility.
So - there always will be 2 branches of tests, till legacy kernels retirement.  And having the devarg
might even simplify the testing - the single host can be used for both runs, with different devargs values.

> My point is every option adds to the necessary test matrix geometrically!

Once we added the new probing mechanics - the test matrix is ALREADY extended , regardless of devargs
implementation. The devarg just makes our users livings in fields easier.

With best regards,
Slava


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V3 7/7] mlx5: add backward compatibility for RDMA monitor
  2024-10-29 16:26           ` Stephen Hemminger
@ 2024-10-30  8:25             ` Minggang(Gavin) Li
  0 siblings, 0 replies; 42+ messages in thread
From: Minggang(Gavin) Li @ 2024-10-30  8:25 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: viacheslavo, matan, orika, thomas, Dariusz Sosnowski, Bing Zhao,
	Suanming Mou, dev, rasland


On 10/30/2024 12:26 AM, Stephen Hemminger wrote:
> On Tue, 29 Oct 2024 15:42:56 +0200
> "Minggang Li(Gavin)" <gavinl@nvidia.com> wrote:
>
>>   
>> +* **Updated NVIDIA mlx5 driver.**
>> +
>> +  Optimized port probe in large scale.
>> +  This feature enhances the efficiency of probing VF/SFs on a large scale
>> +  by significantly reducing the probing time. To activate this feature,
>> +  set ``probe_opt_en`` to a non-zero value during device probing. It
>> +  leverages a capability from the RDMA driver, expected to be released in
>> +  the upcoming kernel version 6.13 or its equivalent in OFED 24.10,
>> +  specifically the RDMA monitor. For additional details on the limitations
>> +  of devargs, refer to "doc/guides/nics/mlx5.rst".
>> +
>> +  If there are lots of VFs/SFs to be probed by the application, eg, 300
>> +  VFs/SFs, the option should be enabled to save probing time.
> IMHO the kernel parts have to be available in a released kernel version.
> Otherwise the kernel API/ABI is not stable and there is a possibility of user confusion.
>
> This needs to stay in "awaiting upstream" state until kernel is released
Sorry, it's a typo. The dependent kernel is 6.12 which is in RC. Do you 
think we should wait for it to be released to push the patch to DPDK 
upstream?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V2 3/7] net/mlx5: add new devargs to control probe optimization
  2024-10-30  8:16             ` Slava Ovsiienko
@ 2024-10-30 19:05               ` Stephen Hemminger
  0 siblings, 0 replies; 42+ messages in thread
From: Stephen Hemminger @ 2024-10-30 19:05 UTC (permalink / raw)
  To: Slava Ovsiienko
  Cc: Minggang(Gavin) Li, Matan Azrad, Ori Kam,
	NBU-Contact-Thomas Monjalon (EXTERNAL),
	Dariusz Sosnowski, Bing Zhao, Suanming Mou, dev,
	Raslan Darawsheh, rongwei liu

On Wed, 30 Oct 2024 08:16:58 +0000
Slava Ovsiienko <viacheslavo@nvidia.com> wrote:

> Hi,
> 
> 
> > -----Original Message-----
> > From: Stephen Hemminger <stephen@networkplumber.org>
> > Sent: Tuesday, October 29, 2024 6:07 PM
> > To: Minggang(Gavin) Li <gavinl@nvidia.com>
> > Cc: Slava Ovsiienko <viacheslavo@nvidia.com>; Matan Azrad
> > <matan@nvidia.com>; Ori Kam <orika@nvidia.com>; NBU-Contact-Thomas
> > Monjalon (EXTERNAL) <thomas@monjalon.net>; Dariusz Sosnowski
> > <dsosnowski@nvidia.com>; Bing Zhao <bingz@nvidia.com>; Suanming Mou
> > <suanmingm@nvidia.com>; dev@dpdk.org; Raslan Darawsheh
> > <rasland@nvidia.com>; rongwei liu <rongweil@nvidia.com>
> > Subject: Re: [PATCH V2 3/7] net/mlx5: add new devargs to control probe
> > optimization
> > 
> > On Tue, 29 Oct 2024 16:27:25 +0800
> > "Minggang(Gavin) Li" <gavinl@nvidia.com> wrote:
> >   
> > > On 10/28/2024 11:47 PM, Stephen Hemminger wrote:  
> > > > On Mon, 28 Oct 2024 11:18:18 +0200
> > > > "Minggang Li(Gavin)" <gavinl@nvidia.com> wrote:
> > > >  
> > > >> +- ``probe_opt_en`` parameter [int]
> > > >> +
> > > >> +  A non-zero value optimizes the probe process, especially for large  
> > scale.  
> > > >> +  PMD will hold the IB device information internally and reuse it.
> > > >> +
> > > >> +  By default, the PMD will set this value to 0.
> > > >> +  
> > > > Is there ever a case where this should not be used?
> > > >
> > > > It would be better to just detect and use it if available.
> > > > This driver does not need more options...  
> > > The new mechanism, which is required by few users, so we would not
> > > break production and with the option we encourage to use new way only
> > > those who actually needs. Once we see the new way is reliable - we
> > > will change the default value.  
> > 
> > I understand that philosophy but it leads to a maze of technical debt.  
> 
> This specific case is not about philosophy in general.
> 
> We have users with huge number of SFs/VFs configured and experiencing the issues
> with gigantic probing timings (literally - tens of minutes). This story was lasting
> long time, we were trying different approaches, then admitted we had to update kernel,
> etc., and eventually we had things done and it resulted in this series.
> 
> The new approach is event driven and based on the handling the new kernel-generated events.
> So, it relies on system-wide environment and might be problematic on some hosts (we do not
> expect too much though).
> 
> At the same time, the existing probe approach provides acceptable performance and
> satisfies the vast majority of the users.  So, our main objective is not to break anything
> in production (most users), the second objective - to resolve issues of some users with
> configuration specifics (few users). That's why we would prefer to have the devarg
> (with all its cons and pros) and set the devarg default value to false. Later, once the new kernel
> API spreads and we have good production statistics we can consider altering the default
> value to true or obsolete the devarg at all. Does this approach look reasonable?


Ok, was just hoping that all this could be transparent to the users.
Ideally, the driver could detect if the right version of components (rdma, kernel) were available
at run time and just do the fast thing.

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2024-10-30 19:05 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-10-16  8:38 [PATCH V1 0/7] port probe time optimization Minggang Li(Gavin)
2024-10-16  8:38 ` [PATCH V1 1/7] mailmap: update user name Minggang Li(Gavin)
2024-10-16  8:38 ` [PATCH V1 2/7] net/mlx5: optimize device probing Minggang Li(Gavin)
2024-10-16  8:38 ` [PATCH V1 3/7] net/mlx5: add new devargs to control probe optimization Minggang Li(Gavin)
2024-10-16  8:38 ` [PATCH V1 4/7] common/mlx5: fix Netlink socket leak Minggang Li(Gavin)
2024-10-16  8:38 ` [PATCH V1 5/7] common/mlx5: add RDMA monitor event awareness Minggang Li(Gavin)
2024-10-16  8:38 ` [PATCH V1 6/7] mlx5: use RDMA Netlink to update port information Minggang Li(Gavin)
2024-10-16  8:38 ` [PATCH V1 7/7] mlx5: add backward compatibility for RDMA monitor Minggang Li(Gavin)
2024-10-28  9:18   ` [PATCH V2 0/7] port probe time optimization Minggang Li(Gavin)
2024-10-28  9:18     ` [PATCH V2 1/7] mailmap: update user name Minggang Li(Gavin)
2024-10-28  9:18     ` [PATCH V2 2/7] net/mlx5: optimize device probing Minggang Li(Gavin)
2024-10-28  9:18     ` [PATCH V2 3/7] net/mlx5: add new devargs to control probe optimization Minggang Li(Gavin)
2024-10-28 15:47       ` Stephen Hemminger
2024-10-29  8:27         ` Minggang(Gavin) Li
2024-10-29 16:07           ` Stephen Hemminger
2024-10-30  8:16             ` Slava Ovsiienko
2024-10-30 19:05               ` Stephen Hemminger
2024-10-28  9:18     ` [PATCH V2 4/7] common/mlx5: fix Netlink socket leak Minggang Li(Gavin)
2024-10-28  9:18     ` [PATCH V2 5/7] common/mlx5: add RDMA monitor event awareness Minggang Li(Gavin)
2024-10-28  9:18     ` [PATCH V2 6/7] mlx5: use RDMA Netlink to update port information Minggang Li(Gavin)
2024-10-28  9:18     ` [PATCH V2 7/7] mlx5: add backward compatibility for RDMA monitor Minggang Li(Gavin)
2024-10-28 15:49       ` Stephen Hemminger
2024-10-29  8:31         ` Minggang(Gavin) Li
2024-10-29 13:42       ` [PATCH V3 0/7] port probe time optimization Minggang Li(Gavin)
2024-10-29 13:42         ` [PATCH V3 1/7] mailmap: update user name Minggang Li(Gavin)
2024-10-29 13:42         ` [PATCH V3 2/7] net/mlx5: optimize device probing Minggang Li(Gavin)
2024-10-29 13:42         ` [PATCH V3 3/7] net/mlx5: add new devargs to control probe optimization Minggang Li(Gavin)
2024-10-29 16:20           ` Stephen Hemminger
2024-10-29 13:42         ` [PATCH V3 4/7] common/mlx5: fix Netlink socket leak Minggang Li(Gavin)
2024-10-29 13:42         ` [PATCH V3 5/7] common/mlx5: add RDMA monitor event awareness Minggang Li(Gavin)
2024-10-29 13:42         ` [PATCH V3 6/7] mlx5: use RDMA Netlink to update port information Minggang Li(Gavin)
2024-10-29 13:42         ` [PATCH V3 7/7] mlx5: add backward compatibility for RDMA monitor Minggang Li(Gavin)
2024-10-29 14:31           ` [PATCH V3 0/7] port probe time optimization Minggang Li(Gavin)
2024-10-29 14:31             ` [PATCH V3 1/7] mailmap: update user name Minggang Li(Gavin)
2024-10-29 14:31             ` [PATCH V3 2/7] net/mlx5: optimize device probing Minggang Li(Gavin)
2024-10-29 14:31             ` [PATCH V3 3/7] net/mlx5: add new devargs to control probe optimization Minggang Li(Gavin)
2024-10-29 14:31             ` [PATCH V3 4/7] common/mlx5: fix Netlink socket leak Minggang Li(Gavin)
2024-10-29 14:31             ` [PATCH V3 5/7] common/mlx5: add RDMA monitor event awareness Minggang Li(Gavin)
2024-10-29 14:31             ` [PATCH V3 6/7] mlx5: use RDMA Netlink to update port information Minggang Li(Gavin)
2024-10-29 14:31             ` [PATCH V3 7/7] mlx5: add backward compatibility for RDMA monitor Minggang Li(Gavin)
2024-10-29 16:26           ` Stephen Hemminger
2024-10-30  8:25             ` Minggang(Gavin) Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).