DPDK patches and discussions
 help / color / mirror / Atom feed
* [PATCH 0/8] net/mlx5: add Multiport E-Switch support
@ 2023-10-31 14:27 Dariusz Sosnowski
  2023-10-31 14:27 ` [PATCH 1/8] net/mlx5/hws: fix leak in FT management Dariusz Sosnowski
                   ` (8 more replies)
  0 siblings, 9 replies; 12+ messages in thread
From: Dariusz Sosnowski @ 2023-10-31 14:27 UTC (permalink / raw)
  To: Matan Azrad, Viacheslav Ovsiienko, Ori Kam, Suanming Mou
  Cc: dev, Raslan Darawsheh

This patchset adds support for probing ports of a Multiport
E-Switch device to mlx5 PMD.

Multiport E-Switch is a configuration of NVIDIA ConnectX/BlueField HCAs
where all connected entities (i.e. physical ports, VFs and SFs)
share the same switch domain.
In this mode, applications are allowed to create transfer flow rules
which explicitly match on the physical port on which traffic
arrives and/or on VFs and SFs, regardless of the root PF.
On top of that, forwarding to any of these entities is allowed.
Notably, applications are allowed to explicitly forward traffic
to any of the physical ports of the HCA.

Bing Zhao (1):
  net/mlx5: add support for vport match selection

Dariusz Sosnowski (6):
  common/mlx5: fix controller index parsing
  common/mlx5: add Netlink check for Multiport E-Switch
  net/mlx5: add sysfs check for Multiport E-Switch
  net/mlx5: add checking Multiport E-Switch state
  net/mlx5: support port probing of Multiport E-Switch device
  net/mlx5: sort port spawn data with uplink ports first

Itamar Gozlan (1):
  net/mlx5/hws: fix leak in FT management

 doc/guides/nics/mlx5.rst                   | 157 +++++++++
 doc/guides/rel_notes/release_23_11.rst     |   1 +
 drivers/common/mlx5/linux/mlx5_common_os.c |   5 +-
 drivers/common/mlx5/linux/mlx5_nl.c        |  70 ++++
 drivers/common/mlx5/linux/mlx5_nl.h        |   5 +
 drivers/common/mlx5/mlx5_common.h          |   1 +
 drivers/common/mlx5/version.map            |   2 +
 drivers/net/mlx5/hws/mlx5dr_matcher.c      |  41 +--
 drivers/net/mlx5/linux/mlx5_os.c           | 379 +++++++++++++++++++--
 drivers/net/mlx5/mlx5.c                    |  17 +
 drivers/net/mlx5/mlx5.h                    |  41 +++
 drivers/net/mlx5/mlx5_ethdev.c             |  53 ++-
 drivers/net/mlx5/mlx5_flow_dv.c            |   2 +-
 drivers/net/mlx5/mlx5_flow_hw.c            |   4 +-
 drivers/net/mlx5/mlx5_mac.c                |   8 +-
 drivers/net/mlx5/mlx5_trigger.c            |   5 +-
 16 files changed, 718 insertions(+), 73 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/8] net/mlx5/hws: fix leak in FT management
  2023-10-31 14:27 [PATCH 0/8] net/mlx5: add Multiport E-Switch support Dariusz Sosnowski
@ 2023-10-31 14:27 ` Dariusz Sosnowski
  2023-10-31 14:27 ` [PATCH 2/8] common/mlx5: fix controller index parsing Dariusz Sosnowski
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 12+ messages in thread
From: Dariusz Sosnowski @ 2023-10-31 14:27 UTC (permalink / raw)
  To: Matan Azrad, Viacheslav Ovsiienko, Ori Kam, Suanming Mou, Itamar Gozlan
  Cc: dev, Raslan Darawsheh

From: Itamar Gozlan <igozlan@nvidia.com>

This commit fixes two leaks in flow table management.
The first leak was when the default miss table of a flow table was not
reset to the default action when setting a new first matcher.
The second leak was caused by a missing free for an RTC in the case of
disconnecting the last matcher in a table's matcher list.

Fixes: b81f95ca770d ("net/mlx5/hws: support default miss table")

Signed-off-by: Itamar Gozlan <igozlan@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
---
 drivers/net/mlx5/hws/mlx5dr_matcher.c | 41 +++++++--------------------
 1 file changed, 10 insertions(+), 31 deletions(-)

diff --git a/drivers/net/mlx5/hws/mlx5dr_matcher.c b/drivers/net/mlx5/hws/mlx5dr_matcher.c
index a82c182460..ebe42c44c6 100644
--- a/drivers/net/mlx5/hws/mlx5dr_matcher.c
+++ b/drivers/net/mlx5/hws/mlx5dr_matcher.c
@@ -253,15 +253,15 @@ static int mlx5dr_matcher_connect(struct mlx5dr_matcher *matcher)
 		goto remove_from_list;
 	}
 
-	if (prev) {
-		/* Reset next miss FT to default (drop refcount) */
-		ret = mlx5dr_table_ft_set_default_next_ft(tbl, prev->end_ft);
-		if (ret) {
-			DR_LOG(ERR, "Failed to reset matcher ft default miss");
-			goto remove_from_list;
-		}
-	} else {
-		/* Update tables missing to current table */
+	/* Reset next miss FT to default (drop refcount) */
+	ret = mlx5dr_table_ft_set_default_next_ft(tbl, prev ? prev->end_ft : tbl->ft);
+	if (ret) {
+		DR_LOG(ERR, "Failed to reset matcher ft default miss");
+		goto remove_from_list;
+	}
+
+	if (!prev) {
+		/* Update tables missing to current matcher in the table */
 		ret = mlx5dr_table_update_connected_miss_tables(tbl);
 		if (ret) {
 			DR_LOG(ERR, "Fatal error, failed to update connected miss table");
@@ -276,27 +276,6 @@ static int mlx5dr_matcher_connect(struct mlx5dr_matcher *matcher)
 	return ret;
 }
 
-static int mlx5dr_last_matcher_disconnect(struct mlx5dr_table *tbl,
-					  struct mlx5dr_devx_obj *prev_ft)
-{
-	struct mlx5dr_cmd_ft_modify_attr ft_attr = {0};
-
-	if (tbl->default_miss.miss_tbl) {
-		/* Connect new last matcher to next miss_tbl if exists */
-		return mlx5dr_table_connect_to_miss_table(tbl,
-							  tbl->default_miss.miss_tbl);
-	} else {
-		ft_attr.modify_fs = MLX5_IFC_MODIFY_FLOW_TABLE_RTC_ID;
-		ft_attr.type = tbl->fw_ft_type;
-		/* Matcher is last, point prev end FT to default miss */
-		mlx5dr_cmd_set_attr_connect_miss_tbl(tbl->ctx,
-						     tbl->fw_ft_type,
-						     tbl->type,
-						     &ft_attr);
-		return mlx5dr_cmd_flow_table_modify(prev_ft, &ft_attr);
-	}
-}
-
 static int mlx5dr_matcher_disconnect(struct mlx5dr_matcher *matcher)
 {
 	struct mlx5dr_matcher *tmp_matcher, *prev_matcher;
@@ -330,7 +309,7 @@ static int mlx5dr_matcher_disconnect(struct mlx5dr_matcher *matcher)
 			goto matcher_reconnect;
 		}
 	} else {
-		ret = mlx5dr_last_matcher_disconnect(tbl, prev_ft);
+		ret = mlx5dr_table_connect_to_miss_table(tbl, tbl->default_miss.miss_tbl);
 		if (ret) {
 			DR_LOG(ERR, "Failed to disconnect last matcher");
 			goto matcher_reconnect;
-- 
2.25.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 2/8] common/mlx5: fix controller index parsing
  2023-10-31 14:27 [PATCH 0/8] net/mlx5: add Multiport E-Switch support Dariusz Sosnowski
  2023-10-31 14:27 ` [PATCH 1/8] net/mlx5/hws: fix leak in FT management Dariusz Sosnowski
@ 2023-10-31 14:27 ` Dariusz Sosnowski
  2023-10-31 14:27 ` [PATCH 3/8] common/mlx5: add Netlink check for Multiport E-Switch Dariusz Sosnowski
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 12+ messages in thread
From: Dariusz Sosnowski @ 2023-10-31 14:27 UTC (permalink / raw)
  To: Matan Azrad, Viacheslav Ovsiienko, Ori Kam, Suanming Mou, Xueming Li
  Cc: dev, Raslan Darawsheh, stable

When probing the Linux kernel network interfaces attached to E-Switch,
mlx5 PMD decides the representor type and represented entity
using phys_port_name exposed by the mlx5 kernel driver in sysfs.
mlx5 PMD first checks this name for multihost controller index.
In multihost scenarios, phys_port_name is prefixed with "c[0-9]+" string.
Included integer is the controller index.

Assuming that phys_port_name contains a string representing a physical
port, i.e. "p[0-9]+" string, the parsing logic is incorrect.
Both "p[0-9]+" and "c[0-9]+" match the formatting string used to parse
phys_port_name, but controller index is still filled out.

This patch fixes this behavior by storing the parsed index
in a temporary variable and setting controller index
if and only if phys_port_name matches multihost controller syntax.

Fixes: 59df97f1a832 ("common/mlx5: support sub-function representor parsing")
Cc: xuemingl@nvidia.com
Cc: stable@dpdk.org

Signed-off-by: Dariusz Sosnowski <dsosnowski@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
---
 drivers/common/mlx5/linux/mlx5_common_os.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/common/mlx5/linux/mlx5_common_os.c b/drivers/common/mlx5/linux/mlx5_common_os.c
index 7260c1a19f..41345e1597 100644
--- a/drivers/common/mlx5/linux/mlx5_common_os.c
+++ b/drivers/common/mlx5/linux/mlx5_common_os.c
@@ -96,10 +96,11 @@ mlx5_translate_port_name(const char *port_name_in,
 	char ctrl = 0, pf_c1, pf_c2, vf_c1, vf_c2, eol;
 	char *end;
 	int sc_items;
+	int32_t ctrl_num = -1;
 
-	sc_items = sscanf(port_name_in, "%c%d",
-			  &ctrl, &port_info_out->ctrl_num);
+	sc_items = sscanf(port_name_in, "%c%d", &ctrl, &ctrl_num);
 	if (sc_items == 2 && ctrl == 'c') {
+		port_info_out->ctrl_num = ctrl_num;
 		port_name_in++; /* 'c' */
 		port_name_in += snprintf(NULL, 0, "%d",
 					  port_info_out->ctrl_num);
-- 
2.25.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 3/8] common/mlx5: add Netlink check for Multiport E-Switch
  2023-10-31 14:27 [PATCH 0/8] net/mlx5: add Multiport E-Switch support Dariusz Sosnowski
  2023-10-31 14:27 ` [PATCH 1/8] net/mlx5/hws: fix leak in FT management Dariusz Sosnowski
  2023-10-31 14:27 ` [PATCH 2/8] common/mlx5: fix controller index parsing Dariusz Sosnowski
@ 2023-10-31 14:27 ` Dariusz Sosnowski
  2023-10-31 14:27 ` [PATCH 4/8] net/mlx5: add sysfs " Dariusz Sosnowski
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 12+ messages in thread
From: Dariusz Sosnowski @ 2023-10-31 14:27 UTC (permalink / raw)
  To: Matan Azrad, Viacheslav Ovsiienko, Ori Kam, Suanming Mou
  Cc: dev, Raslan Darawsheh

This patch implements checking if Multiport E-Switch is enabled
on a given PCI device using Devlink Linux kernel interface.
This facility will be used in follow up commits, which add support
for such configuration to mlx5 PMD.

If mlx5_core Linux kernel module supports Multiport E-Switch,
then it can be configured through a Devlink boolean parameter
"esw_multiport". Checking the value of this parameter
is implemented in mlx5_nl_devlink_esw_multiport_get() function.
If such parameter does not exist, this function returns -EINVAL.

To manually check if mlx5_core kernel module supports "esw_multiport"
parameter, and check if Multiport E-Switch is enabled,
one can use the following command:

  # <pci-dbdf> should be substituted with PCI device address
  # in format <domain>:<bus>:<device>:<function>.
  $ devlink dev param show pci/0000:08:00.0 name esw_multiport
  pci/0000:08:00.0:
    name esw_multiport type driver-specific
      values:
        cmode runtime value true

If parameter is not supported, Devlink command fails with
"Invalid argument" error.

Signed-off-by: Dariusz Sosnowski <dsosnowski@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
---
 drivers/common/mlx5/linux/mlx5_nl.c | 70 +++++++++++++++++++++++++++++
 drivers/common/mlx5/linux/mlx5_nl.h |  5 +++
 drivers/common/mlx5/version.map     |  2 +
 3 files changed, 77 insertions(+)

diff --git a/drivers/common/mlx5/linux/mlx5_nl.c b/drivers/common/mlx5/linux/mlx5_nl.c
index 33670bb5c8..28a1f56dba 100644
--- a/drivers/common/mlx5/linux/mlx5_nl.c
+++ b/drivers/common/mlx5/linux/mlx5_nl.c
@@ -1962,3 +1962,73 @@ mlx5_nl_read_events(int nlsk_fd, mlx5_nl_event_cb *cb, void *cb_arg)
 	}
 	return 0;
 }
+
+static int
+mlx5_nl_esw_multiport_cb(struct nlmsghdr *nh, void *arg)
+{
+
+	int ret = -EINVAL;
+	int *enable = arg;
+	struct nlattr *tail = RTE_PTR_ADD(nh, nh->nlmsg_len);
+	struct nlattr *nla = RTE_PTR_ADD(nh, NLMSG_ALIGN(sizeof(*nh)) +
+					NLMSG_ALIGN(sizeof(struct genlmsghdr)));
+
+	while (nla->nla_len && nla < tail) {
+		switch (nla->nla_type) {
+		/* Expected nested attributes case. */
+		case DEVLINK_ATTR_PARAM:
+		case DEVLINK_ATTR_PARAM_VALUES_LIST:
+		case DEVLINK_ATTR_PARAM_VALUE:
+			ret = 0;
+			nla += 1;
+			break;
+		case DEVLINK_ATTR_PARAM_VALUE_DATA:
+			*enable = 1;
+			return 0;
+		default:
+			nla = RTE_PTR_ADD(nla, NLMSG_ALIGN(nla->nla_len));
+		}
+	}
+	*enable = 0;
+	return ret;
+}
+
+#define NL_ESW_MULTIPORT_PARAM "esw_multiport"
+
+int
+mlx5_nl_devlink_esw_multiport_get(int nlsk_fd, int family_id, const char *pci_addr, int *enable)
+{
+	struct nlmsghdr *nlh;
+	struct genlmsghdr *genl;
+	uint32_t sn = MLX5_NL_SN_GENERATE;
+	int ret;
+	uint8_t buf[NLMSG_ALIGN(sizeof(struct nlmsghdr)) +
+		    NLMSG_ALIGN(sizeof(struct genlmsghdr)) +
+		    NLMSG_ALIGN(sizeof(struct nlattr)) * 4 +
+		    NLMSG_ALIGN(MLX5_NL_MAX_ATTR_SIZE) * 4];
+
+	memset(buf, 0, sizeof(buf));
+	nlh = (struct nlmsghdr *)buf;
+	nlh->nlmsg_len = sizeof(struct nlmsghdr);
+	nlh->nlmsg_type = family_id;
+	nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK;
+	genl = (struct genlmsghdr *)nl_msg_tail(nlh);
+	nlh->nlmsg_len += sizeof(struct genlmsghdr);
+	genl->cmd = DEVLINK_CMD_PARAM_GET;
+	genl->version = DEVLINK_GENL_VERSION;
+	nl_attr_put(nlh, DEVLINK_ATTR_BUS_NAME, "pci", 4);
+	nl_attr_put(nlh, DEVLINK_ATTR_DEV_NAME, pci_addr, strlen(pci_addr) + 1);
+	nl_attr_put(nlh, DEVLINK_ATTR_PARAM_NAME,
+		    NL_ESW_MULTIPORT_PARAM, sizeof(NL_ESW_MULTIPORT_PARAM));
+	ret = mlx5_nl_send(nlsk_fd, nlh, sn);
+	if (ret >= 0)
+		ret = mlx5_nl_recv(nlsk_fd, sn, mlx5_nl_esw_multiport_cb, enable);
+	if (ret < 0) {
+		DRV_LOG(DEBUG, "Failed to get Multiport E-Switch enable on device %s: %d.",
+			pci_addr, ret);
+		return ret;
+	}
+	DRV_LOG(DEBUG, "Multiport E-Switch is %sabled for device \"%s\".",
+		*enable ? "en" : "dis", pci_addr);
+	return ret;
+}
diff --git a/drivers/common/mlx5/linux/mlx5_nl.h b/drivers/common/mlx5/linux/mlx5_nl.h
index db01d7323e..580de3b769 100644
--- a/drivers/common/mlx5/linux/mlx5_nl.h
+++ b/drivers/common/mlx5/linux/mlx5_nl.h
@@ -71,6 +71,7 @@ __rte_internal
 uint32_t mlx5_nl_vlan_vmwa_create(struct mlx5_nl_vlan_vmwa_context *vmwa,
 				  uint32_t ifindex, uint16_t tag);
 
+__rte_internal
 int mlx5_nl_devlink_family_id_get(int nlsk_fd);
 int mlx5_nl_enable_roce_get(int nlsk_fd, int family_id, const char *pci_addr,
 			    int *enable);
@@ -82,4 +83,8 @@ int mlx5_nl_read_events(int nlsk_fd, mlx5_nl_event_cb *cb, void *cb_arg);
 __rte_internal
 int mlx5_nl_parse_link_status_update(struct nlmsghdr *hdr, uint32_t *ifindex);
 
+__rte_internal
+int mlx5_nl_devlink_esw_multiport_get(int nlsk_fd, int family_id,
+				      const char *pci_addr, int *enable);
+
 #endif /* RTE_PMD_MLX5_NL_H_ */
diff --git a/drivers/common/mlx5/version.map b/drivers/common/mlx5/version.map
index 0758ba76de..074eed46fd 100644
--- a/drivers/common/mlx5/version.map
+++ b/drivers/common/mlx5/version.map
@@ -125,6 +125,8 @@ INTERNAL {
 	mlx5_mr_addr2mr_bh;
 
 	mlx5_nl_allmulti; # WINDOWS_NO_EXPORT
+	mlx5_nl_devlink_esw_multiport_get; # WINDOWS_NO_EXPORT
+	mlx5_nl_devlink_family_id_get; # WINDOWS_NO_EXPORT
 	mlx5_nl_ifindex; # WINDOWS_NO_EXPORT
 	mlx5_nl_init; # WINDOWS_NO_EXPORT
 	mlx5_nl_mac_addr_add; # WINDOWS_NO_EXPORT
-- 
2.25.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 4/8] net/mlx5: add sysfs check for Multiport E-Switch
  2023-10-31 14:27 [PATCH 0/8] net/mlx5: add Multiport E-Switch support Dariusz Sosnowski
                   ` (2 preceding siblings ...)
  2023-10-31 14:27 ` [PATCH 3/8] common/mlx5: add Netlink check for Multiport E-Switch Dariusz Sosnowski
@ 2023-10-31 14:27 ` Dariusz Sosnowski
  2023-10-31 16:09   ` Stephen Hemminger
  2023-10-31 14:27 ` [PATCH 5/8] net/mlx5: add checking Multiport E-Switch state Dariusz Sosnowski
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 12+ messages in thread
From: Dariusz Sosnowski @ 2023-10-31 14:27 UTC (permalink / raw)
  To: Matan Azrad, Viacheslav Ovsiienko, Ori Kam, Suanming Mou
  Cc: dev, Raslan Darawsheh

This patch implements checking if Multiport E-Switch is enabled
on a given PCI device, using sysfs Linux kernel interface.
This facility will be used in follow up commits,
which add support for such configuration to mlx5 PMD.

MLNX_OFED mlx5_core kernel module versions which support
Multiport E-Switch do not expose this configuration through Devlink,
but through sysfs interface.
If such a version is used, then Multiport E-Switch can be enabled
(or its state can be probed) through a sysfs file under path:

  # <ifname> should be substituted with Linux interface name.
  /sys/class/net/<ifname>/compat/devlink/lag_port_select_mode

Writing "multiport_esw" to this file enables Multiport E-Switch.
If "multiport_esw" is read from this file, then
Multiport E-Switch is enabled.

If this file does not exist or writing "multiport_esw" to this file,
raises an error, then Multiport E-Switch is not supported.

Signed-off-by: Dariusz Sosnowski <dsosnowski@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
---
 drivers/net/mlx5/linux/mlx5_os.c | 69 ++++++++++++++++++++++++++++++++
 1 file changed, 69 insertions(+)

diff --git a/drivers/net/mlx5/linux/mlx5_os.c b/drivers/net/mlx5/linux/mlx5_os.c
index 2f08f2354e..7a656a7237 100644
--- a/drivers/net/mlx5/linux/mlx5_os.c
+++ b/drivers/net/mlx5/linux/mlx5_os.c
@@ -1931,6 +1931,75 @@ mlx5_device_bond_pci_match(const char *ibdev_name,
 	return pf;
 }
 
+#define SYSFS_MPESW_PARAM_MAX_LEN 16
+
+static __rte_unused int
+mlx5_sysfs_esw_multiport_get(struct ibv_device *ibv, struct rte_pci_addr *pci_addr, int *enabled)
+{
+	int nl_rdma;
+	unsigned int n_ports;
+	unsigned int i;
+	int ret;
+
+	/* Provide correct value to have defined enabled state in case of an error. */
+	*enabled = 0;
+	nl_rdma = mlx5_nl_init(NETLINK_RDMA, 0);
+	if (nl_rdma < 0)
+		return nl_rdma;
+	n_ports = mlx5_nl_portnum(nl_rdma, ibv->name);
+	if (!n_ports) {
+		ret = -rte_errno;
+		goto close_nl_rdma;
+	}
+	for (i = 1; i <= n_ports; ++i) {
+		unsigned int ifindex;
+		char ifname[IF_NAMESIZE + 1];
+		struct rte_pci_addr if_pci_addr;
+		char mpesw[SYSFS_MPESW_PARAM_MAX_LEN + 1];
+		FILE *sysfs;
+		int n;
+
+		ifindex = mlx5_nl_ifindex(nl_rdma, ibv->name, i);
+		if (!ifindex)
+			continue;
+		if (!if_indextoname(ifindex, ifname))
+			continue;
+		MKSTR(sysfs_if_path, "/sys/class/net/%s", ifname);
+		if (mlx5_get_pci_addr(sysfs_if_path, &if_pci_addr))
+			continue;
+		if (pci_addr->domain != if_pci_addr.domain ||
+		    pci_addr->bus != if_pci_addr.bus ||
+		    pci_addr->devid != if_pci_addr.devid ||
+		    pci_addr->function != if_pci_addr.function)
+			continue;
+		MKSTR(sysfs_mpesw_path,
+		      "/sys/class/net/%s/compat/devlink/lag_port_select_mode", ifname);
+		sysfs = fopen(sysfs_mpesw_path, "r");
+		if (!sysfs)
+			continue;
+		n = fscanf(sysfs, "%" RTE_STR(SYSFS_MPESW_PARAM_MAX_LEN) "s", mpesw);
+		fclose(sysfs);
+		if (n != 1)
+			continue;
+		ret = 0;
+		if (strcmp(mpesw, "multiport_esw") == 0) {
+			*enabled = 1;
+			break;
+		}
+		*enabled = 0;
+		break;
+	}
+	if (i > n_ports) {
+		DRV_LOG(DEBUG, "Unable to get Multiport E-Switch state by sysfs.");
+		rte_errno = ENOENT;
+		ret = -rte_errno;
+	}
+
+close_nl_rdma:
+	close(nl_rdma);
+	return ret;
+}
+
 /**
  * Register a PCI device within bonding.
  *
-- 
2.25.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 5/8] net/mlx5: add checking Multiport E-Switch state
  2023-10-31 14:27 [PATCH 0/8] net/mlx5: add Multiport E-Switch support Dariusz Sosnowski
                   ` (3 preceding siblings ...)
  2023-10-31 14:27 ` [PATCH 4/8] net/mlx5: add sysfs " Dariusz Sosnowski
@ 2023-10-31 14:27 ` Dariusz Sosnowski
  2023-10-31 14:27 ` [PATCH 6/8] net/mlx5: support port probing of Multiport E-Switch device Dariusz Sosnowski
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 12+ messages in thread
From: Dariusz Sosnowski @ 2023-10-31 14:27 UTC (permalink / raw)
  To: Matan Azrad, Viacheslav Ovsiienko, Ori Kam, Suanming Mou
  Cc: dev, Raslan Darawsheh

This patch implements checking if Multiport E-Switch is enabled
on a given PCI device. mlx5_is_mpesw_enabled() implements this
functionality and it will be used in a follow up commit.

mlx5_is_mpesw_enabled() first checks if E-Switch state can be probed
using Devlink device parameter. If it cannot be checked or
there was an error, then sysfs interface will be probed.
If that fails, mlx5_is_mpesw_enabled() assumes that Multiport E-Switch
is disabled and returns an error.

Signed-off-by: Dariusz Sosnowski <dsosnowski@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
---
 drivers/net/mlx5/linux/mlx5_os.c | 51 +++++++++++++++++++++++++++++++-
 1 file changed, 50 insertions(+), 1 deletion(-)

diff --git a/drivers/net/mlx5/linux/mlx5_os.c b/drivers/net/mlx5/linux/mlx5_os.c
index 7a656a7237..8a57edc470 100644
--- a/drivers/net/mlx5/linux/mlx5_os.c
+++ b/drivers/net/mlx5/linux/mlx5_os.c
@@ -1931,9 +1931,38 @@ mlx5_device_bond_pci_match(const char *ibdev_name,
 	return pf;
 }
 
+static int
+mlx5_nl_esw_multiport_get(struct rte_pci_addr *pci_addr, int *enabled)
+{
+	char pci_addr_str[PCI_PRI_STR_SIZE] = { 0 };
+	int nlsk_fd;
+	int devlink_id;
+	int ret;
+
+	/* Provide correct value to have defined enabled state in case of an error. */
+	*enabled = 0;
+	rte_pci_device_name(pci_addr, pci_addr_str, sizeof(pci_addr_str));
+	nlsk_fd = mlx5_nl_init(NETLINK_GENERIC, 0);
+	if (nlsk_fd < 0)
+		return nlsk_fd;
+	devlink_id = mlx5_nl_devlink_family_id_get(nlsk_fd);
+	if (devlink_id < 0) {
+		ret = devlink_id;
+		DRV_LOG(DEBUG, "Unable to get devlink family id for Multiport E-Switch checks "
+			       "by netlink, for PCI device %s", pci_addr_str);
+		goto close_nlsk_fd;
+	}
+	ret = mlx5_nl_devlink_esw_multiport_get(nlsk_fd, devlink_id, pci_addr_str, enabled);
+	if (ret < 0)
+		DRV_LOG(DEBUG, "Unable to get Multiport E-Switch state by Netlink.");
+close_nlsk_fd:
+	close(nlsk_fd);
+	return ret;
+}
+
 #define SYSFS_MPESW_PARAM_MAX_LEN 16
 
-static __rte_unused int
+static int
 mlx5_sysfs_esw_multiport_get(struct ibv_device *ibv, struct rte_pci_addr *pci_addr, int *enabled)
 {
 	int nl_rdma;
@@ -2000,6 +2029,26 @@ mlx5_sysfs_esw_multiport_get(struct ibv_device *ibv, struct rte_pci_addr *pci_ad
 	return ret;
 }
 
+static __rte_unused int
+mlx5_is_mpesw_enabled(struct ibv_device *ibv, struct rte_pci_addr *ibv_pci_addr, int *enabled)
+{
+	/*
+	 * Try getting Multiport E-Switch state through netlink interface
+	 * If unable, try sysfs interface. If that is unable as well,
+	 * assume that Multiport E-Switch is disabled and return an error.
+	 */
+	if (mlx5_nl_esw_multiport_get(ibv_pci_addr, enabled) >= 0 ||
+	    mlx5_sysfs_esw_multiport_get(ibv, ibv_pci_addr, enabled) >= 0)
+		return 0;
+	DRV_LOG(DEBUG, "Unable to check MPESW state for IB device %s "
+		       "(PCI: " PCI_PRI_FMT ")",
+		       ibv->name,
+		       ibv_pci_addr->domain, ibv_pci_addr->bus,
+		       ibv_pci_addr->devid, ibv_pci_addr->function);
+	*enabled = 0;
+	return -rte_errno;
+}
+
 /**
  * Register a PCI device within bonding.
  *
-- 
2.25.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 6/8] net/mlx5: support port probing of Multiport E-Switch device
  2023-10-31 14:27 [PATCH 0/8] net/mlx5: add Multiport E-Switch support Dariusz Sosnowski
                   ` (4 preceding siblings ...)
  2023-10-31 14:27 ` [PATCH 5/8] net/mlx5: add checking Multiport E-Switch state Dariusz Sosnowski
@ 2023-10-31 14:27 ` Dariusz Sosnowski
  2023-10-31 14:27 ` [PATCH 7/8] net/mlx5: sort port spawn data with uplink ports first Dariusz Sosnowski
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 12+ messages in thread
From: Dariusz Sosnowski @ 2023-10-31 14:27 UTC (permalink / raw)
  To: Matan Azrad, Viacheslav Ovsiienko, Ori Kam, Suanming Mou
  Cc: dev, Raslan Darawsheh, Bing Zhao

This patch adds support for probing ports of a Multiport
E-Switch device to mlx5 PMD.

Multiport E-Switch is a configuration of NVIDIA ConnectX/BlueField HCAs
where all connected entities (i.e. physical ports, VFs and SFs)
share the same switch domain.
In this mode, applications are allowed to create transfer flow rules
which explicitly match on the physical port on which traffic
arrives and/or on VFs and SFs, regardless of the root PF.
On top of that, forwarding to any of these entities is allowed.
Notably, applications are allowed to explicitly forward traffic
to any of the physical ports of the HCA.

This patch implements the following procedure for probing ports
of the device configured as Multiport E-Switch:

1. EAL calls mlx5 PMD to probe certain PCI device (with address BDF).
2. mlx5 PMD iterates over all existing IB devices:
   2.1. Check if IB device has a PCI address which matches BDF.
   2.2. Check if IB device is configured as Multiport E-Switch device,
        using devlink interface.
   2.3. Iterate over all IB ports of this device to find a netdev with
        matching PCI address.
        If any is found, IB device is chosen to instantiate DPDK ports
        from it.
3. Iterate over all IB ports of the selected IB device,
   to choose which ports to instantiate:
   3.1. Choose IB ports which match the selected representor ports
        (selected through representor devarg).
        Instantiate DPDK ports based on those.
   3.2. If IB port represented an uplink port and this port corresponds
        to the probed PCI device, instantiated DPDK port is selected
        as a switch master port.

Bulk of this work was done in mlx5_os_pci_probe_pf().

To properly enable support for Multiport E-Switch, this patch also
changes the following:

- Probing of representors of type RTE_ETH_REPRESENTOR_PF is allowed,
  but if and only if Multiport E-Switch is enabled.
- Uplink ports have a representor type NONE and have
  representor ID equal to UINT16_MAX.
  rte_eth_dev_representor_info struct returned for uplink ports
  have their index stored in `pf` field.
- flow_hw_set_port_info() used by HWS steering layer sets `is_wire`
  field to true if a port is an uplink port,
  if Multiport E-Switch is enabled.
- Changing MAC address of a port marked as representor is done directly
  through its corresponding netdev if it is a Multiport E-Switch uplink.

Signed-off-by: Dariusz Sosnowski <dsosnowski@nvidia.com>
Signed-off-by: Bing Zhao <bingz@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
---
 doc/guides/nics/mlx5.rst               | 141 ++++++++++++++
 doc/guides/rel_notes/release_23_11.rst |   1 +
 drivers/common/mlx5/mlx5_common.h      |   1 +
 drivers/net/mlx5/linux/mlx5_os.c       | 255 ++++++++++++++++++++++---
 drivers/net/mlx5/mlx5.h                |  39 ++++
 drivers/net/mlx5/mlx5_ethdev.c         |  53 ++++-
 drivers/net/mlx5/mlx5_flow_hw.c        |   4 +-
 drivers/net/mlx5/mlx5_mac.c            |   8 +-
 8 files changed, 464 insertions(+), 38 deletions(-)

diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index be5054e68a..584f592433 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -1759,6 +1759,147 @@ behavior as librte_net_mlx4::
    > port config all rss all
    > port start all
 
+
+Multiport E-Switch
+------------------
+
+In standard deployments of NVIDIA ConnectX and BlueField HCAs, where embedded switch is enabled,
+each physical port is associated with a single switching domain.
+Only PFs, VFs and SFs related to that physical port are connected to this domain
+and offloaded flow rules are allowed to steer traffic only between the entities in the given domain.
+
+The following diagram pictures the high level overview of this architecture:
+
+::
+
+       .---. .------. .------. .---. .------. .------.
+       |PF0| |PF0VFi| |PF0SFi| |PF1| |PF1VFi| |PF1SFi|
+       .-+-. .--+---. .--+---. .-+-. .--+---. .--+---.
+         |      |        |       |      |        |
+     .---|------|--------|-------|------|--------|---------.
+     |   |      |        |       |      |        |      HCA|
+     | .-+------+--------+---. .-+------+--------+---.     |
+     | |                     | |                     |     |
+     | |      E-Switch       | |     E-Switch        |     |
+     | |         PF0         | |        PF1          |     |
+     | |                     | |                     |     |
+     | .---------+-----------. .--------+------------.     |
+     |           |                      |                  |
+     .--------+--+---+---------------+--+---+--------------.
+              |      |               |      |
+              | PHY0 |               | PHY1 |
+              |      |               |      |
+              .------.               .------.
+
+Multiport E-Switch is a deployment scenario where:
+
+- All physical ports, PFs, VFs and SFs share the same switching domain.
+- Each physical port gets a separate representor port.
+- Traffic can be matched or forwarded explicitly between any of the entities
+  connected to the domain.
+
+The following diagram pictures the high level overview of this architecture:
+
+::
+
+
+       .---. .------. .------. .---. .------. .------.
+       |PF0| |PF0VFi| |PF0SFi| |PF1| |PF1VFi| |PF1SFi|
+       .-+-. .--+---. .--+---. .-+-. .--+---. .--+---.
+         |      |        |       |      |        |
+     .---|------|--------|-------|------|--------|---------.
+     |   |      |        |       |      |        |      HCA|
+     | .-+------+--------+-------+------+--------+---.     |
+     | |                                             |     |
+     | |                   Shared                    |     |
+     | |                  E-Switch                   |     |
+     | |                                             |     |
+     | .---------+----------------------+------------.     |
+     |           |                      |                  |
+     .--------+--+---+---------------+--+---+--------------.
+              |      |               |      |
+              | PHY0 |               | PHY1 |
+              |      |               |      |
+              .------.               .------.
+
+
+In this deployment a single application can control the switching and forwarding behavior for all
+entities on the HCA.
+
+With this configuration, mlx5 PMD supports:
+
+- matching traffic coming from physical port, PF, VF or SF using REPRESENTED_PORT items;
+- forwarding traffic to physical port, PF, VF or SF using REPRESENTED_PORT actions;
+
+
+Requirements
+~~~~~~~~~~~~
+
+Supported HCAs:
+
+- ConnectX family: ConnectX-6 Dx and above.
+- BlueField family: BlueField-2 and above.
+- FW version: at least ``XX.37.1014``.
+
+Supported mlx5 kernel modules versions:
+
+- Upstream Linux - from version 6.3.
+- Modules packaged in MLNX_OFED - from version v23.04-0.5.3.3.
+
+
+Configuration
+~~~~~~~~~~~~~
+
+#. Apply required FW configuration::
+
+      sudo mlxconfig -d /dev/mst/mt4125_pciconf0 set LAG_RESOURCE_ALLOCATION=1
+
+#. Reset FW or cold reboot the host.
+#. Switch E-Switch mode on all of the PFs to ``switchdev`` mode::
+
+      sudo devlink dev eswitch set pci/0000:08:00.0 mode switchdev
+      sudo devlink dev eswitch set pci/0000:08:00.1 mode switchdev
+
+#. Enable Multiport E-Switch on all of the PFs::
+
+      sudo devlink dev param set pci/0000:08:00.0 name esw_multiport value true cmode runtime
+      sudo devlink dev param set pci/0000:08:00.1 name esw_multiport value true cmode runtime
+
+#. Configure required number of VFs/SFs::
+
+      echo 4 | sudo tee /sys/class/net/eth2/device/sriov_numvfs
+      echo 4 | sudo tee /sys/class/net/eth3/device/sriov_numvfs
+
+#. Start testpmd and verify that all ports are visible::
+
+      $ sudo dpdk-testpmd -a 08:00.0,dv_flow_en=2,representor=pf0-1vf0-3 -- -i
+      testpmd> show port summary all
+      Number of available ports: 10
+      Port MAC Address       Name         Driver         Status   Link
+      0    E8:EB:D5:18:22:BC 08:00.0_p0   mlx5_pci       up       200 Gbps
+      1    E8:EB:D5:18:22:BD 08:00.0_p1   mlx5_pci       up       200 Gbps
+      2    D2:F6:43:0B:9E:19 08:00.0_representor_c0pf0vf0 mlx5_pci       up       200 Gbps
+      3    E6:42:27:B7:68:BD 08:00.0_representor_c0pf0vf1 mlx5_pci       up       200 Gbps
+      4    A6:5B:7F:8B:B8:47 08:00.0_representor_c0pf0vf2 mlx5_pci       up       200 Gbps
+      5    12:93:50:45:89:02 08:00.0_representor_c0pf0vf3 mlx5_pci       up       200 Gbps
+      6    06:D3:B2:79:FE:AC 08:00.0_representor_c0pf1vf0 mlx5_pci       up       200 Gbps
+      7    12:FC:08:E4:C2:CA 08:00.0_representor_c0pf1vf1 mlx5_pci       up       200 Gbps
+      8    8E:A9:9A:D0:35:4C 08:00.0_representor_c0pf1vf2 mlx5_pci       up       200 Gbps
+      9    E6:35:83:1F:B0:A9 08:00.0_representor_c0pf1vf3 mlx5_pci       up       200 Gbps
+
+
+Limitations
+~~~~~~~~~~~
+
+- Multiport E-Switch is not supported on Windows.
+- Multiport E-Switch is supported only with HW Steering flow engine (``dv_flow_en=2``).
+- Matching traffic coming from a physical port and forwarding it to a physical port
+  (either the same or other one) is not supported.
+
+  In order to achieve such a functionality, an application has to setup hairpin queues between
+  physical port representors and forward the traffic using hairpin queues.
+
+
 Usage example
 -------------
 
diff --git a/doc/guides/rel_notes/release_23_11.rst b/doc/guides/rel_notes/release_23_11.rst
index 93999893bd..f337db19f0 100644
--- a/doc/guides/rel_notes/release_23_11.rst
+++ b/doc/guides/rel_notes/release_23_11.rst
@@ -157,6 +157,7 @@ New Features
   * Added support for ``RTE_FLOW_ACTION_TYPE_INDIRECT_LIST`` flow action.
   * Added support for ``RTE_FLOW_ITEM_TYPE_PTYPE`` flow item.
   * Added support for ``RTE_FLOW_ACTION_TYPE_PORT_REPRESENTOR`` flow action and mirror.
+  * Added support for Multiport E-Switch.
 
 * **Updated Solarflare net driver.**
 
diff --git a/drivers/common/mlx5/mlx5_common.h b/drivers/common/mlx5/mlx5_common.h
index 28f9f41528..9c80277d74 100644
--- a/drivers/common/mlx5/mlx5_common.h
+++ b/drivers/common/mlx5/mlx5_common.h
@@ -169,6 +169,7 @@ struct mlx5_switch_info {
 	int32_t ctrl_num; /**< Controller number (valid for c#pf#vf# format). */
 	int32_t pf_num; /**< PF number (valid for pfxvfx format only). */
 	int32_t port_name; /**< Representor port name. */
+	int32_t mpesw_owner; /**< MPESW owner port number. */
 	uint64_t switch_id; /**< Switch identifier. */
 };
 
diff --git a/drivers/net/mlx5/linux/mlx5_os.c b/drivers/net/mlx5/linux/mlx5_os.c
index 8a57edc470..8ddf38288e 100644
--- a/drivers/net/mlx5/linux/mlx5_os.c
+++ b/drivers/net/mlx5/linux/mlx5_os.c
@@ -959,7 +959,30 @@ mlx5_representor_match(struct mlx5_dev_spawn_data *spawn,
 	uint16_t repr_id = mlx5_representor_id_encode(switch_info,
 						      eth_da->type);
 
+	/*
+	 * Assuming Multiport E-Switch device was detected,
+	 * if spawned port is an uplink, check if the port
+	 * was requested through representor devarg.
+	 */
+	if (mlx5_is_probed_port_on_mpesw_device(spawn) &&
+	    switch_info->name_type == MLX5_PHYS_PORT_NAME_TYPE_UPLINK) {
+		for (p = 0; p < eth_da->nb_ports; ++p)
+			if (switch_info->port_name == eth_da->ports[p])
+				return true;
+		rte_errno = EBUSY;
+		return false;
+	}
 	switch (eth_da->type) {
+	case RTE_ETH_REPRESENTOR_PF:
+		/*
+		 * PF representors provided in devargs translate to uplink ports, but
+		 * if and only if the device is a part of MPESW device.
+		 */
+		if (!mlx5_is_probed_port_on_mpesw_device(spawn)) {
+			rte_errno = EBUSY;
+			return false;
+		}
+		break;
 	case RTE_ETH_REPRESENTOR_SF:
 		if (!(spawn->info.port_name == -1 &&
 		      switch_info->name_type ==
@@ -989,7 +1012,7 @@ mlx5_representor_match(struct mlx5_dev_spawn_data *spawn,
 	}
 	/* Check representor ID: */
 	for (p = 0; p < eth_da->nb_ports; ++p) {
-		if (spawn->pf_bond < 0) {
+		if (!mlx5_is_probed_port_on_mpesw_device(spawn) && spawn->pf_bond < 0) {
 			/* For non-LAG mode, allow and ignore pf. */
 			switch_info->pf_num = eth_da->ports[p];
 			repr_id = mlx5_representor_id_encode(switch_info,
@@ -1051,17 +1074,7 @@ mlx5_dev_spawn(struct rte_device *dpdk_dev,
 	    !mlx5_representor_match(spawn, eth_da))
 		return NULL;
 	/* Build device name. */
-	if (spawn->pf_bond < 0) {
-		/* Single device. */
-		if (!switch_info->representor)
-			strlcpy(name, dpdk_dev->name, sizeof(name));
-		else
-			err = snprintf(name, sizeof(name), "%s_representor_%s%u",
-				 dpdk_dev->name,
-				 switch_info->name_type ==
-				 MLX5_PHYS_PORT_NAME_TYPE_PFSF ? "sf" : "vf",
-				 switch_info->port_name);
-	} else {
+	if (spawn->pf_bond >= 0) {
 		/* Bonding device. */
 		if (!switch_info->representor) {
 			err = snprintf(name, sizeof(name), "%s_%s",
@@ -1075,6 +1088,30 @@ mlx5_dev_spawn(struct rte_device *dpdk_dev,
 				MLX5_PHYS_PORT_NAME_TYPE_PFSF ? "sf" : "vf",
 				switch_info->port_name);
 		}
+	} else if (mlx5_is_probed_port_on_mpesw_device(spawn)) {
+		/* MPESW device. */
+		if (switch_info->name_type == MLX5_PHYS_PORT_NAME_TYPE_UPLINK) {
+			err = snprintf(name, sizeof(name), "%s_p%d",
+				       dpdk_dev->name, spawn->mpesw_port);
+		} else {
+			err = snprintf(name, sizeof(name), "%s_representor_c%dpf%d%s%u",
+				dpdk_dev->name,
+				switch_info->ctrl_num,
+				switch_info->pf_num,
+				switch_info->name_type ==
+				MLX5_PHYS_PORT_NAME_TYPE_PFSF ? "sf" : "vf",
+				switch_info->port_name);
+		}
+	} else {
+		/* Single device. */
+		if (!switch_info->representor)
+			strlcpy(name, dpdk_dev->name, sizeof(name));
+		else
+			err = snprintf(name, sizeof(name), "%s_representor_%s%u",
+				 dpdk_dev->name,
+				 switch_info->name_type ==
+				 MLX5_PHYS_PORT_NAME_TYPE_PFSF ? "sf" : "vf",
+				 switch_info->port_name);
 	}
 	if (err >= (int)sizeof(name))
 		DRV_LOG(WARNING, "device name overflow %s", name);
@@ -1202,13 +1239,25 @@ mlx5_dev_spawn(struct rte_device *dpdk_dev,
 	priv->vport_meta_tag = 0;
 	priv->vport_meta_mask = 0;
 	priv->pf_bond = spawn->pf_bond;
+	priv->mpesw_port = spawn->mpesw_port;
+	priv->mpesw_uplink = false;
+	priv->mpesw_owner = spawn->info.mpesw_owner;
+	if (mlx5_is_port_on_mpesw_device(priv))
+		priv->mpesw_uplink = (spawn->info.name_type == MLX5_PHYS_PORT_NAME_TYPE_UPLINK);
 
 	DRV_LOG(DEBUG,
-		"dev_port=%u bus=%s pci=%s master=%d representor=%d pf_bond=%d\n",
+		"dev_port=%u bus=%s pci=%s master=%d representor=%d pf_bond=%d "
+		"mpesw_port=%d mpesw_uplink=%d",
 		priv->dev_port, dpdk_dev->bus->name,
 		priv->pci_dev ? priv->pci_dev->name : "NONE",
-		priv->master, priv->representor, priv->pf_bond);
+		priv->master, priv->representor, priv->pf_bond,
+		priv->mpesw_port, priv->mpesw_uplink);
 
+	if (mlx5_is_port_on_mpesw_device(priv) && priv->sh->config.dv_flow_en != 2) {
+		DRV_LOG(ERR, "MPESW device is supported only with HWS");
+		err = ENOTSUP;
+		goto error;
+	}
 	/*
 	 * If we have E-Switch we should determine the vport attributes.
 	 * E-Switch may use either source vport field or reg_c[0] metadata
@@ -2029,7 +2078,7 @@ mlx5_sysfs_esw_multiport_get(struct ibv_device *ibv, struct rte_pci_addr *pci_ad
 	return ret;
 }
 
-static __rte_unused int
+static int
 mlx5_is_mpesw_enabled(struct ibv_device *ibv, struct rte_pci_addr *ibv_pci_addr, int *enabled)
 {
 	/*
@@ -2049,6 +2098,84 @@ mlx5_is_mpesw_enabled(struct ibv_device *ibv, struct rte_pci_addr *ibv_pci_addr,
 	return -rte_errno;
 }
 
+static int
+mlx5_device_mpesw_pci_match(struct ibv_device *ibv,
+			    const struct rte_pci_addr *owner_pci,
+			    int nl_rdma)
+{
+	struct rte_pci_addr ibdev_pci_addr = { 0 };
+	char ifname[IF_NAMESIZE + 1] = { 0 };
+	unsigned int ifindex;
+	unsigned int np;
+	unsigned int i;
+	int enabled = 0;
+	int ret;
+
+	/* Check if IB device's PCI address matches the probed PCI address. */
+	if (mlx5_get_pci_addr(ibv->ibdev_path, &ibdev_pci_addr)) {
+		DRV_LOG(DEBUG, "Skipping MPESW check for IB device %s since "
+			       "there is no underlying PCI device", ibv->name);
+		rte_errno = ENOENT;
+		return -rte_errno;
+	}
+	if (ibdev_pci_addr.domain != owner_pci->domain ||
+	    ibdev_pci_addr.bus != owner_pci->bus ||
+	    ibdev_pci_addr.devid != owner_pci->devid ||
+	    ibdev_pci_addr.function != owner_pci->function) {
+		return -1;
+	}
+	/* Check if IB device has MPESW enabled. */
+	if (mlx5_is_mpesw_enabled(ibv, &ibdev_pci_addr, &enabled))
+		return -1;
+	if (!enabled)
+		return -1;
+	/* Iterate through IB ports to find MPESW master uplink port. */
+	if (nl_rdma < 0)
+		return -1;
+	np = mlx5_nl_portnum(nl_rdma, ibv->name);
+	if (!np)
+		return -1;
+	for (i = 1; i <= np; ++i) {
+		struct rte_pci_addr pci_addr;
+		FILE *file;
+		char port_name[IF_NAMESIZE + 1];
+		struct mlx5_switch_info	info;
+
+		/* Check whether IB port has a corresponding netdev. */
+		ifindex = mlx5_nl_ifindex(nl_rdma, ibv->name, i);
+		if (!ifindex)
+			continue;
+		if (!if_indextoname(ifindex, ifname))
+			continue;
+		/* Read port name and determine its type. */
+		MKSTR(ifphysportname, "/sys/class/net/%s/phys_port_name", ifname);
+		file = fopen(ifphysportname, "rb");
+		if (!file)
+			continue;
+		ret = fscanf(file, "%16s", port_name);
+		fclose(file);
+		if (ret != 1)
+			continue;
+		memset(&info, 0, sizeof(info));
+		mlx5_translate_port_name(port_name, &info);
+		if (info.name_type != MLX5_PHYS_PORT_NAME_TYPE_UPLINK)
+			continue;
+		/* Fetch PCI address of the device to which the netdev is bound. */
+		MKSTR(ifpath, "/sys/class/net/%s", ifname);
+		if (mlx5_get_pci_addr(ifpath, &pci_addr))
+			continue;
+		if (pci_addr.domain == ibdev_pci_addr.domain &&
+		    pci_addr.bus == ibdev_pci_addr.bus &&
+		    pci_addr.devid == ibdev_pci_addr.devid &&
+		    pci_addr.function == ibdev_pci_addr.function) {
+			MLX5_ASSERT(info.port_name >= 0);
+			return info.port_name;
+		}
+	}
+	/* No matching MPESW uplink port was found. */
+	return -1;
+}
+
 /**
  * Register a PCI device within bonding.
  *
@@ -2097,6 +2224,12 @@ mlx5_os_pci_probe_pf(struct mlx5_common_device *cdev,
 	 *  >= 0 - bonding device (value is slave PF index)
 	 */
 	int bd = -1;
+	/*
+	 * Multiport E-Switch (MPESW) device:
+	 *   < 0 - no MPESW device or could not determine if it is MPESW device,
+	 *  >= 0 - MPESW device. Value is the port index of the MPESW owner.
+	 */
+	int mpesw = MLX5_MPESW_PORT_INVALID;
 	struct rte_pci_device *pci_dev = RTE_DEV_TO_PCI(cdev->dev);
 	struct mlx5_dev_spawn_data *list = NULL;
 	struct rte_eth_devargs eth_da = *req_eth_da;
@@ -2150,17 +2283,38 @@ mlx5_os_pci_probe_pf(struct mlx5_common_device *cdev,
 				bd, ibv_list[ret]->name);
 			ibv_match[nd++] = ibv_list[ret];
 			break;
-		} else {
-			/* Bonding device not found. */
-			if (mlx5_get_pci_addr(ibv_list[ret]->ibdev_path,
-					      &pci_addr))
-				continue;
-			if (rte_pci_addr_cmp(&owner_pci, &pci_addr) != 0)
-				continue;
-			DRV_LOG(INFO, "PCI information matches for device \"%s\"",
+		}
+		mpesw = mlx5_device_mpesw_pci_match(ibv_list[ret], &owner_pci, nl_rdma);
+		if (mpesw >= 0) {
+			/*
+			 * MPESW device detected. Only one matching IB device is allowed,
+			 * so if any matches were found previously, fail gracefully.
+			 */
+			if (nd) {
+				DRV_LOG(ERR,
+					"PCI information matches MPESW device \"%s\", "
+					"but multiple matching PCI devices were found. "
+					"Probing failed.",
+					ibv_list[ret]->name);
+				rte_errno = ENOENT;
+				ret = -rte_errno;
+				goto exit;
+			}
+			DRV_LOG(INFO,
+				"PCI information matches MPESW device \"%s\"",
 				ibv_list[ret]->name);
 			ibv_match[nd++] = ibv_list[ret];
+			break;
 		}
+		/* Bonding or MPESW device was not found. */
+		if (mlx5_get_pci_addr(ibv_list[ret]->ibdev_path,
+					&pci_addr))
+			continue;
+		if (rte_pci_addr_cmp(&owner_pci, &pci_addr) != 0)
+			continue;
+		DRV_LOG(INFO, "PCI information matches for device \"%s\"",
+			ibv_list[ret]->name);
+		ibv_match[nd++] = ibv_list[ret];
 	}
 	ibv_match[nd] = NULL;
 	if (!nd) {
@@ -2192,6 +2346,12 @@ mlx5_os_pci_probe_pf(struct mlx5_common_device *cdev,
 			ret = -rte_errno;
 			goto exit;
 		}
+		if (mpesw >= 0 && !np) {
+			DRV_LOG(ERR, "Cannot get ports for MPESW device.");
+			rte_errno = ENOENT;
+			ret = -rte_errno;
+			goto exit;
+		}
 	}
 	/* Now we can determine the maximal amount of devices to be spawned. */
 	list = mlx5_malloc(MLX5_MEM_ZERO,
@@ -2203,7 +2363,7 @@ mlx5_os_pci_probe_pf(struct mlx5_common_device *cdev,
 		ret = -rte_errno;
 		goto exit;
 	}
-	if (bd >= 0 || np > 1) {
+	if (bd >= 0 || mpesw >= 0 || np > 1) {
 		/*
 		 * Single IB device with multiple ports found,
 		 * it may be E-Switch master device and representors.
@@ -2222,6 +2382,7 @@ mlx5_os_pci_probe_pf(struct mlx5_common_device *cdev,
 			list[ns].pci_dev = pci_dev;
 			list[ns].cdev = cdev;
 			list[ns].pf_bond = bd;
+			list[ns].mpesw_port = MLX5_MPESW_PORT_INVALID;
 			list[ns].ifindex = mlx5_nl_ifindex(nl_rdma,
 							   ibv_match[0]->name,
 							   i);
@@ -2278,6 +2439,46 @@ mlx5_os_pci_probe_pf(struct mlx5_common_device *cdev,
 				}
 				continue;
 			}
+			if (!ret && mpesw >= 0) {
+				switch (list[ns].info.name_type) {
+				case MLX5_PHYS_PORT_NAME_TYPE_UPLINK:
+					/* Owner port is treated as master port. */
+					if (list[ns].info.port_name == mpesw) {
+						list[ns].info.master = 1;
+						list[ns].info.representor = 0;
+					} else {
+						list[ns].info.master = 0;
+						list[ns].info.representor = 1;
+					}
+					/*
+					 * Ports of this type have uplink port index
+					 * encoded in the name. This index is also a PF index.
+					 */
+					list[ns].info.pf_num = list[ns].info.port_name;
+					list[ns].mpesw_port = list[ns].info.port_name;
+					list[ns].info.mpesw_owner = mpesw;
+					ns++;
+					break;
+				case MLX5_PHYS_PORT_NAME_TYPE_PFHPF:
+				case MLX5_PHYS_PORT_NAME_TYPE_PFVF:
+				case MLX5_PHYS_PORT_NAME_TYPE_PFSF:
+					/* Only spawn representors related to the probed PF. */
+					if (list[ns].info.pf_num == owner_id) {
+						/*
+						 * Ports of this type have PF index encoded in name,
+						 * which translate to the related uplink port index.
+						 */
+						list[ns].mpesw_port = list[ns].info.pf_num;
+						/* MPESW owner is also saved but not used now. */
+						list[ns].info.mpesw_owner = mpesw;
+						ns++;
+					}
+					break;
+				default:
+					break;
+				}
+				continue;
+			}
 			if (!ret && (list[ns].info.representor ^
 				     list[ns].info.master))
 				ns++;
@@ -2317,6 +2518,7 @@ mlx5_os_pci_probe_pf(struct mlx5_common_device *cdev,
 			list[ns].pci_dev = pci_dev;
 			list[ns].cdev = cdev;
 			list[ns].pf_bond = -1;
+			list[ns].mpesw_port = MLX5_MPESW_PORT_INVALID;
 			list[ns].ifindex = 0;
 			if (nl_rdma >= 0)
 				list[ns].ifindex = mlx5_nl_ifindex
@@ -2597,7 +2799,10 @@ mlx5_os_auxiliary_probe(struct mlx5_common_device *cdev,
 			struct mlx5_kvargs_ctrl *mkvlist)
 {
 	struct rte_eth_devargs eth_da = { .nb_ports = 0 };
-	struct mlx5_dev_spawn_data spawn = { .pf_bond = -1 };
+	struct mlx5_dev_spawn_data spawn = {
+		.pf_bond = -1,
+		.mpesw_port = MLX5_MPESW_PORT_INVALID,
+	};
 	struct rte_device *dev = cdev->dev;
 	struct rte_auxiliary_device *adev = RTE_DEV_TO_AUXILIARY(dev);
 	struct rte_eth_dev *eth_dev;
diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index a20acb6ca8..484c5eb3df 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -186,12 +186,15 @@ struct mlx5_dev_cap {
 	char fw_ver[64]; /* Firmware version of this device. */
 };
 
+#define MLX5_MPESW_PORT_INVALID (-1)
+
 /** Data associated with devices to spawn. */
 struct mlx5_dev_spawn_data {
 	uint32_t ifindex; /**< Network interface index. */
 	uint32_t max_port; /**< Device maximal port index. */
 	uint32_t phys_port; /**< Device physical port index. */
 	int pf_bond; /**< bonding device PF index. < 0 - no bonding */
+	int mpesw_port; /**< MPESW uplink index. Valid if mpesw_owner_port >= 0. */
 	struct mlx5_switch_info info; /**< Switch information. */
 	const char *phys_dev_name; /**< Name of physical device. */
 	struct rte_eth_dev *eth_dev; /**< Associated Ethernet device. */
@@ -200,6 +203,23 @@ struct mlx5_dev_spawn_data {
 	struct mlx5_bond_info *bond_info;
 };
 
+/**
+ * Check if the port requested to be probed is MPESW physical device
+ * or a representor port.
+ *
+ * @param spawn
+ *   Parameters of the probed port.
+ *
+ * @return
+ *   True if the probed port is a physical device or representor in MPESW setup.
+ *   False otherwise or MPESW was not configured.
+ */
+static inline bool
+mlx5_is_probed_port_on_mpesw_device(struct mlx5_dev_spawn_data *spawn)
+{
+	return spawn->mpesw_port >= 0;
+}
+
 /** Data associated with socket messages. */
 struct mlx5_flow_dump_req  {
 	uint32_t port_id; /**< There are plans in DPDK to extend port_id. */
@@ -1768,6 +1788,9 @@ struct mlx5_priv {
 	uint32_t vport_meta_mask; /* Used for vport index field match mask. */
 	uint16_t representor_id; /* UINT16_MAX if not a representor. */
 	int32_t pf_bond; /* >=0, representor owner PF index in bonding. */
+	int32_t mpesw_owner; /* >=0, representor owner PF index in MPESW. */
+	int32_t mpesw_port; /* Related port index of MPESW device. < 0 - no MPESW. */
+	bool mpesw_uplink; /* If true, port is an uplink port. */
 	unsigned int if_index; /* Associated kernel network device index. */
 	/* RX/TX queues. */
 	unsigned int rxqs_n; /* RX queues array size. */
@@ -1933,6 +1956,22 @@ mlx5_devx_obj_ops_en(struct mlx5_dev_ctx_shared *sh)
 		sh->dev_cap.dest_tir);
 }
 
+/**
+ * Check if the port is either MPESW physical device or a representor port.
+ *
+ * @param priv
+ *   Pointer to port's private data.
+ *
+ * @return
+ *   True if the port is a physical device or representor in MPESW setup.
+ *   False otherwise or MPESW was not configured.
+ */
+static inline bool
+mlx5_is_port_on_mpesw_device(struct mlx5_priv *priv)
+{
+	return priv->mpesw_port >= 0;
+}
+
 /* mlx5.c */
 
 int mlx5_getenv_int(const char *);
diff --git a/drivers/net/mlx5/mlx5_ethdev.c b/drivers/net/mlx5/mlx5_ethdev.c
index 4a85415ff3..cd84960b7e 100644
--- a/drivers/net/mlx5/mlx5_ethdev.c
+++ b/drivers/net/mlx5/mlx5_ethdev.c
@@ -395,18 +395,30 @@ uint16_t
 mlx5_representor_id_encode(const struct mlx5_switch_info *info,
 			   enum rte_eth_representor_type hpf_type)
 {
-	enum rte_eth_representor_type type = RTE_ETH_REPRESENTOR_VF;
+	enum rte_eth_representor_type type;
 	uint16_t repr = info->port_name;
-
-	if (info->representor == 0)
-		return UINT16_MAX;
-	if (info->name_type == MLX5_PHYS_PORT_NAME_TYPE_PFSF)
+	int32_t pf = info->pf_num;
+
+	switch (info->name_type) {
+	case MLX5_PHYS_PORT_NAME_TYPE_UPLINK:
+		if (!info->representor)
+			return UINT16_MAX;
+		type = RTE_ETH_REPRESENTOR_PF;
+		pf = info->mpesw_owner;
+		break;
+	case MLX5_PHYS_PORT_NAME_TYPE_PFSF:
 		type = RTE_ETH_REPRESENTOR_SF;
-	if (info->name_type == MLX5_PHYS_PORT_NAME_TYPE_PFHPF) {
+		break;
+	case MLX5_PHYS_PORT_NAME_TYPE_PFHPF:
 		type = hpf_type;
 		repr = UINT16_MAX;
+		break;
+	case MLX5_PHYS_PORT_NAME_TYPE_PFVF:
+	default:
+		type = RTE_ETH_REPRESENTOR_VF;
+		break;
 	}
-	return MLX5_REPRESENTOR_ID(info->pf_num, type, repr);
+	return MLX5_REPRESENTOR_ID(pf, type, repr);
 }
 
 /**
@@ -430,7 +442,7 @@ mlx5_representor_info_get(struct rte_eth_dev *dev,
 			  struct rte_eth_representor_info *info)
 {
 	struct mlx5_priv *priv = dev->data->dev_private;
-	int n_type = 4; /* Representor types, VF, HPF@VF, SF and HPF@SF. */
+	int n_type = 5; /* Representor types: PF, VF, HPF@VF, SF and HPF@SF. */
 	int n_pf = 2; /* Number of PFs. */
 	int i = 0, pf;
 	int n_entries;
@@ -443,7 +455,30 @@ mlx5_representor_info_get(struct rte_eth_dev *dev,
 		n_entries = info->nb_ranges_alloc;
 
 	info->controller = 0;
-	info->pf = priv->pf_bond >= 0 ? priv->pf_bond : 0;
+	info->pf = 0;
+	if (mlx5_is_port_on_mpesw_device(priv)) {
+		info->pf = priv->mpesw_port;
+		/* PF range, both ports will show the same information. */
+		info->ranges[i].type = RTE_ETH_REPRESENTOR_PF;
+		info->ranges[i].controller = 0;
+		info->ranges[i].pf = priv->mpesw_owner + 1;
+		info->ranges[i].vf = 0;
+		/*
+		 * The representor indexes should be the values set of "priv->mpesw_port".
+		 * In the real case now, only 1 PF/UPLINK representor is supported.
+		 * The port index will always be the value of "owner + 1".
+		 */
+		info->ranges[i].id_base =
+			MLX5_REPRESENTOR_ID(priv->mpesw_owner, info->ranges[i].type,
+					    info->ranges[i].pf);
+		info->ranges[i].id_end =
+			MLX5_REPRESENTOR_ID(priv->mpesw_owner, info->ranges[i].type,
+					    info->ranges[i].pf);
+		snprintf(info->ranges[i].name, sizeof(info->ranges[i].name),
+			 "pf%d", info->ranges[i].pf);
+		i++;
+	} else if (priv->pf_bond >= 0)
+		info->pf = priv->pf_bond;
 	for (pf = 0; pf < n_pf; ++pf) {
 		/* VF range. */
 		info->ranges[i].type = RTE_ETH_REPRESENTOR_VF;
diff --git a/drivers/net/mlx5/mlx5_flow_hw.c b/drivers/net/mlx5/mlx5_flow_hw.c
index 977751394e..9700f0342a 100644
--- a/drivers/net/mlx5/mlx5_flow_hw.c
+++ b/drivers/net/mlx5/mlx5_flow_hw.c
@@ -1331,7 +1331,7 @@ flow_hw_represented_port_compile(struct rte_eth_dev *dev,
 	if (!priv->master)
 		return rte_flow_error_set(error, EINVAL,
 					  RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
-					  "represented_port acton must"
+					  "represented_port action must"
 					  " be used on proxy port");
 	if (m && !!m->port_id) {
 		struct mlx5_priv *port_priv;
@@ -9188,7 +9188,7 @@ flow_hw_set_port_info(struct rte_eth_dev *dev)
 	info = &mlx5_flow_hw_port_infos[port_id];
 	info->regc_mask = priv->vport_meta_mask;
 	info->regc_value = priv->vport_meta_tag;
-	info->is_wire = priv->master;
+	info->is_wire = mlx5_is_port_on_mpesw_device(priv) ? priv->mpesw_uplink : priv->master;
 }
 
 /* Clears vport tag and mask used for HWS rules. */
diff --git a/drivers/net/mlx5/mlx5_mac.c b/drivers/net/mlx5/mlx5_mac.c
index b9d1e33ac3..22a756a52b 100644
--- a/drivers/net/mlx5/mlx5_mac.c
+++ b/drivers/net/mlx5/mlx5_mac.c
@@ -157,9 +157,13 @@ mlx5_mac_addr_set(struct rte_eth_dev *dev, struct rte_ether_addr *mac_addr)
 
 	/*
 	 * Configuring the VF instead of its representor,
-	 * need to skip the special case of HPF on BlueField.
+	 * need to skip the special cases:
+	 * - HPF on BlueField,
+	 * - SF representors,
+	 * - uplink ports when running in MPESW mode.
 	 */
-	if (priv->representor && !mlx5_is_hpf(dev) && !mlx5_is_sf_repr(dev)) {
+	if (priv->representor && !mlx5_is_hpf(dev) && !mlx5_is_sf_repr(dev) &&
+	    !priv->mpesw_uplink) {
 		DRV_LOG(DEBUG, "VF represented by port %u setting primary MAC address",
 			dev->data->port_id);
 		if (priv->pf_bond >= 0) {
-- 
2.25.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 7/8] net/mlx5: sort port spawn data with uplink ports first
  2023-10-31 14:27 [PATCH 0/8] net/mlx5: add Multiport E-Switch support Dariusz Sosnowski
                   ` (5 preceding siblings ...)
  2023-10-31 14:27 ` [PATCH 6/8] net/mlx5: support port probing of Multiport E-Switch device Dariusz Sosnowski
@ 2023-10-31 14:27 ` Dariusz Sosnowski
  2023-10-31 14:27 ` [PATCH 8/8] net/mlx5: add support for vport match selection Dariusz Sosnowski
  2023-10-31 21:49 ` [PATCH 0/8] net/mlx5: add Multiport E-Switch support Raslan Darawsheh
  8 siblings, 0 replies; 12+ messages in thread
From: Dariusz Sosnowski @ 2023-10-31 14:27 UTC (permalink / raw)
  To: Matan Azrad, Viacheslav Ovsiienko, Ori Kam, Suanming Mou
  Cc: dev, Raslan Darawsheh

This patch changes the behavior of the comparator used to
sort mlx5_dev_spawn_data structures, to put them in a more
user friendly order

Before this patch, ports were sorted assuming there is
only a single master port. It resulted in an order where
master port first comes second, then representors in ascending
order of IDs.

This approach however is not desirable with devices configured
for Multiport E-Switch, since uplink ports which do not correspond
to the owning PCI device are representors as well and they will be
mixed with VF/SF representors.

To change that, this patch amends the comparator to force uplink ports
to be first. If there are many uplink ports, the master port will
come first and the rest will be sorted by port index.

Signed-off-by: Dariusz Sosnowski <dsosnowski@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
---
 drivers/net/mlx5/linux/mlx5_os.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/net/mlx5/linux/mlx5_os.c b/drivers/net/mlx5/linux/mlx5_os.c
index 8ddf38288e..07f31de5ae 100644
--- a/drivers/net/mlx5/linux/mlx5_os.c
+++ b/drivers/net/mlx5/linux/mlx5_os.c
@@ -1797,9 +1797,15 @@ mlx5_dev_spawn_data_cmp(const void *a, const void *b)
 		&((const struct mlx5_dev_spawn_data *)a)->info;
 	const struct mlx5_switch_info *si_b =
 		&((const struct mlx5_dev_spawn_data *)b)->info;
+	int uplink_a = si_a->name_type == MLX5_PHYS_PORT_NAME_TYPE_UPLINK;
+	int uplink_b = si_b->name_type == MLX5_PHYS_PORT_NAME_TYPE_UPLINK;
 	int ret;
 
-	/* Master device first. */
+	/* Uplink ports first. */
+	ret = uplink_b - uplink_a;
+	if (ret)
+		return ret;
+	/* Then master devices. */
 	ret = si_b->master - si_a->master;
 	if (ret)
 		return ret;
-- 
2.25.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 8/8] net/mlx5: add support for vport match selection
  2023-10-31 14:27 [PATCH 0/8] net/mlx5: add Multiport E-Switch support Dariusz Sosnowski
                   ` (6 preceding siblings ...)
  2023-10-31 14:27 ` [PATCH 7/8] net/mlx5: sort port spawn data with uplink ports first Dariusz Sosnowski
@ 2023-10-31 14:27 ` Dariusz Sosnowski
  2023-10-31 21:49 ` [PATCH 0/8] net/mlx5: add Multiport E-Switch support Raslan Darawsheh
  8 siblings, 0 replies; 12+ messages in thread
From: Dariusz Sosnowski @ 2023-10-31 14:27 UTC (permalink / raw)
  To: Matan Azrad, Viacheslav Ovsiienko, Ori Kam, Suanming Mou
  Cc: dev, Raslan Darawsheh, Bing Zhao

From: Bing Zhao <bingz@nvidia.com>

A new devarg "vport_match" is introduced for the application to use.
If set to 1, then matching using REPRESENTED_PORT items on group 0
will be forced to use "misc.source_port", instead of matching on
the vport metadata in HWS mode. It allows the user to match on the
traffic from E-Switch manager.

A new devarg "vport_match" is introduced for the application to use.
This enables the force matching on "misc.source_port" for item
REPRESENTED_PORT on group 0, instead of matching on the metadata
REG_C_0 bits in HWS mode. It will allow the user to match on the
traffic from E-Switch manager.

By default, this is set to 0. When enable it with 1, the default
FDB jump rule should be disabled by set "fdb_def_rule_en=0".

Signed-off-by: Bing Zhao <bingz@nvidia.com>
Acked-by: Viacheslav Ovsiienko <viacheslavo@nvidia.com>
---
 doc/guides/nics/mlx5.rst        | 16 ++++++++++++++++
 drivers/net/mlx5/mlx5.c         | 17 +++++++++++++++++
 drivers/net/mlx5/mlx5.h         |  2 ++
 drivers/net/mlx5/mlx5_flow_dv.c |  2 +-
 drivers/net/mlx5/mlx5_trigger.c |  5 ++++-
 5 files changed, 40 insertions(+), 2 deletions(-)

diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index 584f592433..8c65f16db8 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -1369,6 +1369,22 @@ for an additional list of options shared with other mlx5 drivers.
 
   By default, the PMD will set this value to 1.
 
+- ``vport_match`` parameter [int]
+
+  Controls the underlying matching mechanism for REPRESENTED_PORT items when they are used for
+  flow rules in E-Switch root flow table.
+
+  If set to 1, then ``source_vport`` matching is used. This allows applications to match whole
+  traffic coming from the application by using REPRESENTED_PORT item with ``port_id == UINT16_MAX``.
+  As a side effect, flow rules in root flow table will not be able match physical ports explicitly,
+  when running on Multiport E-Switch.
+  Matching in non-root flow tables (group bigger than 1) is not affected.
+
+  If set to 0, then ``vport_metadata`` matching is used. This is the default mechanism.
+
+  By default, the PMD will set this value to 0. Setting ``vport_match`` to 1 requires that
+  ``fdb_def_rule_en`` is set to 0, so that E-Switch root flow table is exposed to the application.
+
 
 Sub-Function
 ------------
diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
index f929d6547c..c275cdfee8 100644
--- a/drivers/net/mlx5/mlx5.c
+++ b/drivers/net/mlx5/mlx5.c
@@ -184,6 +184,9 @@
 /* Device parameter to control representor matching in ingress/egress flows with HWS. */
 #define MLX5_REPR_MATCHING_EN "repr_matching_en"
 
+/* Representor matching field selection: 0 - meta_vport, 1 - misc.vport */
+#define MLX5_HWS_ROOT_VPORT_MATCH "vport_match"
+
 /* Shared memory between primary and secondary processes. */
 struct mlx5_shared_data *mlx5_shared_data;
 
@@ -1425,6 +1428,8 @@ mlx5_dev_args_check_handler(const char *key, const char *val, void *opaque)
 		config->cnt_svc.cycle_time = tmp;
 	} else if (strcmp(MLX5_REPR_MATCHING_EN, key) == 0) {
 		config->repr_matching = !!tmp;
+	} else if (strcmp(MLX5_HWS_ROOT_VPORT_MATCH, key) == 0) {
+		config->vport_match = !!tmp;
 	}
 	return 0;
 }
@@ -1464,6 +1469,7 @@ mlx5_shared_dev_ctx_args_config(struct mlx5_dev_ctx_shared *sh,
 		MLX5_HWS_CNT_SERVICE_CORE,
 		MLX5_HWS_CNT_CYCLE_TIME,
 		MLX5_REPR_MATCHING_EN,
+		MLX5_HWS_ROOT_VPORT_MATCH,
 		NULL,
 	};
 	int ret = 0;
@@ -1522,6 +1528,11 @@ mlx5_shared_dev_ctx_args_config(struct mlx5_dev_ctx_shared *sh,
 		rte_errno = ENODEV;
 		return -rte_errno;
 	}
+	if (config->dv_flow_en == 2 && config->fdb_def_rule && config->vport_match) {
+		DRV_LOG(DEBUG, "vport_match=1 is incompatible with FDB default rule "
+			       "(fdb_def_rule-en=1). Setting vport_match=0.");
+		config->vport_match = 0;
+	}
 	if (!config->tx_pp && config->tx_skew &&
 	    !sh->cdev->config.hca_attr.wait_on_time) {
 		DRV_LOG(WARNING,
@@ -1562,6 +1573,7 @@ mlx5_shared_dev_ctx_args_config(struct mlx5_dev_ctx_shared *sh,
 		config->allow_duplicate_pattern);
 	DRV_LOG(DEBUG, "\"fdb_def_rule_en\" is %u.", config->fdb_def_rule);
 	DRV_LOG(DEBUG, "\"repr_matching_en\" is %u.", config->repr_matching);
+	DRV_LOG(DEBUG, "\"vport_match\" is %u.", config->vport_match);
 	return 0;
 }
 
@@ -3003,6 +3015,11 @@ mlx5_probe_again_args_validate(struct mlx5_common_device *cdev,
 			sh->ibdev_name);
 		goto error;
 	}
+	if (sh->config.vport_match ^ config->vport_match) {
+		DRV_LOG(ERR, "\"vport_match\" configuration mismatch for shared %s context.",
+			sh->ibdev_name);
+		goto error;
+	}
 	mlx5_free(config);
 	return 0;
 error:
diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index 484c5eb3df..5299b1321a 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -352,6 +352,7 @@ struct mlx5_sh_config {
 	/* Allow/Prevent the duplicate rules pattern. */
 	uint32_t fdb_def_rule:1; /* Create FDB default jump rule */
 	uint32_t repr_matching:1; /* Enable implicit vport matching in HWS FDB. */
+	uint32_t vport_match:1; /* Root table representor matching field selection. */
 };
 
 /* Structure for VF VLAN workaround. */
@@ -1782,6 +1783,7 @@ struct mlx5_priv {
 	uint32_t mark_enabled:1; /* If mark action is enabled on rxqs. */
 	uint32_t num_lag_ports:4; /* Number of ports can be bonded. */
 	uint32_t tunnel_enabled:1; /* If tunnel offloading is enabled on rxqs. */
+	uint32_t vport_match:1; /* vport match field. */
 	uint16_t domain_id; /* Switch domain identifier. */
 	uint16_t vport_id; /* Associated VF vport index (if any). */
 	uint32_t vport_meta_tag; /* Used for vport index match ove VF LAG. */
diff --git a/drivers/net/mlx5/mlx5_flow_dv.c b/drivers/net/mlx5/mlx5_flow_dv.c
index a39b4600e6..5b5716692c 100644
--- a/drivers/net/mlx5/mlx5_flow_dv.c
+++ b/drivers/net/mlx5/mlx5_flow_dv.c
@@ -10594,7 +10594,7 @@ flow_dv_translate_item_represented_port(struct rte_eth_dev *dev, void *key,
 	 * Kernel can use either misc.source_port or half of C0 metadata
 	 * register.
 	 */
-	if (priv->vport_meta_mask) {
+	if (priv->vport_meta_mask && !priv->vport_match) {
 		/*
 		 * Provide the hint for SW steering library
 		 * to insert the flow into ingress domain and
diff --git a/drivers/net/mlx5/mlx5_trigger.c b/drivers/net/mlx5/mlx5_trigger.c
index 7bdb897612..d28cbe1dfd 100644
--- a/drivers/net/mlx5/mlx5_trigger.c
+++ b/drivers/net/mlx5/mlx5_trigger.c
@@ -1515,7 +1515,10 @@ mlx5_traffic_enable_hws(struct rte_eth_dev *dev)
 				goto error;
 		}
 	} else {
-		DRV_LOG(INFO, "port %u FDB default rule is disabled", dev->data->port_id);
+		DRV_LOG(INFO, "port %u FDB default rule is disabled with vport_match %u",
+			dev->data->port_id, config->vport_match);
+		/* vport_match is only interesting in no default FDB rule mode. */
+		priv->vport_match = config->vport_match;
 	}
 	if (priv->isolated)
 		return 0;
-- 
2.25.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 4/8] net/mlx5: add sysfs check for Multiport E-Switch
  2023-10-31 14:27 ` [PATCH 4/8] net/mlx5: add sysfs " Dariusz Sosnowski
@ 2023-10-31 16:09   ` Stephen Hemminger
  2023-10-31 17:37     ` Dariusz Sosnowski
  0 siblings, 1 reply; 12+ messages in thread
From: Stephen Hemminger @ 2023-10-31 16:09 UTC (permalink / raw)
  To: Dariusz Sosnowski
  Cc: Matan Azrad, Viacheslav Ovsiienko, Ori Kam, Suanming Mou, dev,
	Raslan Darawsheh

On Tue, 31 Oct 2023 16:27:29 +0200
Dariusz Sosnowski <dsosnowski@nvidia.com> wrote:

> +		MKSTR(sysfs_if_path, "/sys/class/net/%s", ifname);
> +		if (mlx5_get_pci_addr(sysfs_if_path, &if_pci_addr))
> +			continue;
> +		if (pci_addr->domain != if_pci_addr.domain ||
> +		    pci_addr->bus != if_pci_addr.bus ||
> +		    pci_addr->devid != if_pci_addr.devid ||
> +		    pci_addr->function != if_pci_addr.function)
> +			continue;
> +		MKSTR(sysfs_mpesw_path,
> +		      "/sys/class/net/%s/compat/devlink/lag_port_select_mode", ifname);

There are lots of DPDK code that reads sysfs, but eal and each driver ends up
coding there own way of handling this. Would be good to have common helpers in EAL.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [PATCH 4/8] net/mlx5: add sysfs check for Multiport E-Switch
  2023-10-31 16:09   ` Stephen Hemminger
@ 2023-10-31 17:37     ` Dariusz Sosnowski
  0 siblings, 0 replies; 12+ messages in thread
From: Dariusz Sosnowski @ 2023-10-31 17:37 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Matan Azrad, Slava Ovsiienko, Ori Kam, Suanming Mou, dev,
	Raslan Darawsheh

Hi Stephen,

Thank you for your comment.

> -----Original Message-----
> From: Stephen Hemminger <stephen@networkplumber.org>
> Sent: Tuesday, October 31, 2023 17:09
> To: Dariusz Sosnowski <dsosnowski@nvidia.com>
> Cc: Matan Azrad <matan@nvidia.com>; Slava Ovsiienko
> <viacheslavo@nvidia.com>; Ori Kam <orika@nvidia.com>; Suanming Mou
> <suanmingm@nvidia.com>; dev@dpdk.org; Raslan Darawsheh
> <rasland@nvidia.com>
> Subject: Re: [PATCH 4/8] net/mlx5: add sysfs check for Multiport E-Switch
> 
> External email: Use caution opening links or attachments
> 
> 
> On Tue, 31 Oct 2023 16:27:29 +0200
> Dariusz Sosnowski <dsosnowski@nvidia.com> wrote:
> 
> > +             MKSTR(sysfs_if_path, "/sys/class/net/%s", ifname);
> > +             if (mlx5_get_pci_addr(sysfs_if_path, &if_pci_addr))
> > +                     continue;
> > +             if (pci_addr->domain != if_pci_addr.domain ||
> > +                 pci_addr->bus != if_pci_addr.bus ||
> > +                 pci_addr->devid != if_pci_addr.devid ||
> > +                 pci_addr->function != if_pci_addr.function)
> > +                     continue;
> > +             MKSTR(sysfs_mpesw_path,
> > +
> > + "/sys/class/net/%s/compat/devlink/lag_port_select_mode", ifname);
> 
> There are lots of DPDK code that reads sysfs, but eal and each driver ends up
> coding there own way of handling this. Would be good to have common
> helpers in EAL.
Agreed.

From a quick glance, I see that there are a few sysfs paths with which several drivers interact with e.g.:
- /sys/class/net
- /sys/bus/pci/devices
- /sys/devices
I think that, introducing common sysfs utilities (for example, some way of interacting with such common paths or just constructing sysfs paths) in DPDK could be beneficial.
We definitely can look into it, to see if it is viable.

Best regards,
Dariusz Sosnowski

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [PATCH 0/8] net/mlx5: add Multiport E-Switch support
  2023-10-31 14:27 [PATCH 0/8] net/mlx5: add Multiport E-Switch support Dariusz Sosnowski
                   ` (7 preceding siblings ...)
  2023-10-31 14:27 ` [PATCH 8/8] net/mlx5: add support for vport match selection Dariusz Sosnowski
@ 2023-10-31 21:49 ` Raslan Darawsheh
  8 siblings, 0 replies; 12+ messages in thread
From: Raslan Darawsheh @ 2023-10-31 21:49 UTC (permalink / raw)
  To: Dariusz Sosnowski, Matan Azrad, Slava Ovsiienko, Ori Kam, Suanming Mou
  Cc: dev

Hi,

> -----Original Message-----
> From: Dariusz Sosnowski <dsosnowski@nvidia.com>
> Sent: Tuesday, October 31, 2023 4:27 PM
> To: Matan Azrad <matan@nvidia.com>; Slava Ovsiienko
> <viacheslavo@nvidia.com>; Ori Kam <orika@nvidia.com>; Suanming Mou
> <suanmingm@nvidia.com>
> Cc: dev@dpdk.org; Raslan Darawsheh <rasland@nvidia.com>
> Subject: [PATCH 0/8] net/mlx5: add Multiport E-Switch support
> 
> This patchset adds support for probing ports of a Multiport
> E-Switch device to mlx5 PMD.
> 
> Multiport E-Switch is a configuration of NVIDIA ConnectX/BlueField HCAs
> where all connected entities (i.e. physical ports, VFs and SFs)
> share the same switch domain.
> In this mode, applications are allowed to create transfer flow rules
> which explicitly match on the physical port on which traffic
> arrives and/or on VFs and SFs, regardless of the root PF.
> On top of that, forwarding to any of these entities is allowed.
> Notably, applications are allowed to explicitly forward traffic
> to any of the physical ports of the HCA.
> 
> Bing Zhao (1):
>   net/mlx5: add support for vport match selection
> 
> Dariusz Sosnowski (6):
>   common/mlx5: fix controller index parsing
>   common/mlx5: add Netlink check for Multiport E-Switch
>   net/mlx5: add sysfs check for Multiport E-Switch
>   net/mlx5: add checking Multiport E-Switch state
>   net/mlx5: support port probing of Multiport E-Switch device
>   net/mlx5: sort port spawn data with uplink ports first
> 
> Itamar Gozlan (1):
>   net/mlx5/hws: fix leak in FT management
> 
>  doc/guides/nics/mlx5.rst                   | 157 +++++++++
>  doc/guides/rel_notes/release_23_11.rst     |   1 +
>  drivers/common/mlx5/linux/mlx5_common_os.c |   5 +-
>  drivers/common/mlx5/linux/mlx5_nl.c        |  70 ++++
>  drivers/common/mlx5/linux/mlx5_nl.h        |   5 +
>  drivers/common/mlx5/mlx5_common.h          |   1 +
>  drivers/common/mlx5/version.map            |   2 +
>  drivers/net/mlx5/hws/mlx5dr_matcher.c      |  41 +--
>  drivers/net/mlx5/linux/mlx5_os.c           | 379 +++++++++++++++++++--
>  drivers/net/mlx5/mlx5.c                    |  17 +
>  drivers/net/mlx5/mlx5.h                    |  41 +++
>  drivers/net/mlx5/mlx5_ethdev.c             |  53 ++-
>  drivers/net/mlx5/mlx5_flow_dv.c            |   2 +-
>  drivers/net/mlx5/mlx5_flow_hw.c            |   4 +-
>  drivers/net/mlx5/mlx5_mac.c                |   8 +-
>  drivers/net/mlx5/mlx5_trigger.c            |   5 +-
>  16 files changed, 718 insertions(+), 73 deletions(-)
> 
> --
> 2.25.1

Series applied to next-net-mlx,

Kindest regards,
Raslan Darawsheh

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2023-10-31 21:49 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-31 14:27 [PATCH 0/8] net/mlx5: add Multiport E-Switch support Dariusz Sosnowski
2023-10-31 14:27 ` [PATCH 1/8] net/mlx5/hws: fix leak in FT management Dariusz Sosnowski
2023-10-31 14:27 ` [PATCH 2/8] common/mlx5: fix controller index parsing Dariusz Sosnowski
2023-10-31 14:27 ` [PATCH 3/8] common/mlx5: add Netlink check for Multiport E-Switch Dariusz Sosnowski
2023-10-31 14:27 ` [PATCH 4/8] net/mlx5: add sysfs " Dariusz Sosnowski
2023-10-31 16:09   ` Stephen Hemminger
2023-10-31 17:37     ` Dariusz Sosnowski
2023-10-31 14:27 ` [PATCH 5/8] net/mlx5: add checking Multiport E-Switch state Dariusz Sosnowski
2023-10-31 14:27 ` [PATCH 6/8] net/mlx5: support port probing of Multiport E-Switch device Dariusz Sosnowski
2023-10-31 14:27 ` [PATCH 7/8] net/mlx5: sort port spawn data with uplink ports first Dariusz Sosnowski
2023-10-31 14:27 ` [PATCH 8/8] net/mlx5: add support for vport match selection Dariusz Sosnowski
2023-10-31 21:49 ` [PATCH 0/8] net/mlx5: add Multiport E-Switch support Raslan Darawsheh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).