DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] [PATCH 1/5] net/mlx5: add VXLAN encap/decap support for e-switch
@ 2018-10-02  6:30 Slava Ovsiienko
  2018-10-02  6:30 ` [dpdk-dev] [PATCH 2/5] net/mlx5: e-switch VXLAN netlink routines update Slava Ovsiienko
                   ` (4 more replies)
  0 siblings, 5 replies; 110+ messages in thread
From: Slava Ovsiienko @ 2018-10-02  6:30 UTC (permalink / raw)
  To: dev; +Cc: Shahaf Shuler, Slava Ovsiienko

This patchset adds support for RTE_FLOW_ACTION_TYPE_VXLAN_ENCAP and
RTE_FLOW_ACTION_TYPE_VXLAN_DECAP to mlx5 PMD. This patch is refactored
version of proposal 20180831092038.23051-2-adrien.mazarguil@6wind.com.

A typical use case is port representors in switchdev mode, with VXLAN
traffic encapsulation performed on traffic coming *from* a representor
and decapsulation on traffic going *to* that representor, in order
to transparently assign a given VXLAN to VF traffic.

Since these actions are supported at the switch level, the "transfer"
attribute must be set on such flow rules. They must also be combined
with a port redirection action to make sense.

Since only ingress is supported, encapsulation flow rules are normally
applied on a physical port and emit traffic to a port representor.
The opposite order is used for decapsulation.

Like other mlx5 switch flow rule actions, these are implemented through
Linux's TC flower API. Since the Linux interface for VXLAN encap/decap
involves virtual network devices (i.e. ip link add type vxlan [...]),
 the PMD automatically spawns them on a needed basis through Netlink
 calls.

VXLAN interfaces are dynamically created for each local UDP port of
outer networks and then used as targets for TC "flower" filters
in order to perform encapsulation. These VXLAN interfaces are
system-wide, the only one device with given UDP port can exist
in the system (the attempt of creating another device with the
same UDP local port returns EEXIST), so PMD should support the shared
device instances database for PMD instances. These VXLAN implicitly
created devices are called VTEPs (Virtual Tunnel End Points).

The first part of patchset introduces the new datastructures and
definitions needed to implement VXLAN support in mlx5 PMD.

The history of the patch:

v1
Refactored code of initial experimental proposal
20180831092038.23051-2-adrien.mazarguil@6wind.com, the unattached
VTEP used in order to resolve the problem of VTEP UDP port sharing
between several PMD ports.

Suggested-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
---
 app/test-pmd/config.c            |   3 +
 drivers/net/mlx5/Makefile        |  75 ++++++++++++++++++
 drivers/net/mlx5/mlx5_flow.h     |  11 +++
 drivers/net/mlx5/mlx5_flow_tcf.c | 167 +++++++++++++++++++++++++++++++++++++++
 drivers/net/mlx5/mlx5_nl.c       |  12 ++-
 5 files changed, 264 insertions(+), 4 deletions(-)

diff --git a/app/test-pmd/config.c b/app/test-pmd/config.c
index 794aa52..b088c9f 100644
--- a/app/test-pmd/config.c
+++ b/app/test-pmd/config.c
@@ -1172,6 +1172,9 @@ enum item_spec_type {
 		       sizeof(struct rte_flow_action_of_pop_mpls)),
 	MK_FLOW_ACTION(OF_PUSH_MPLS,
 		       sizeof(struct rte_flow_action_of_push_mpls)),
+	MK_FLOW_ACTION(VXLAN_ENCAP,
+		       sizeof(struct rte_flow_action_vxlan_encap)),
+	MK_FLOW_ACTION(VXLAN_DECAP, 0),
 };
 
 /** Compute storage space needed by action configuration and copy it. */
diff --git a/drivers/net/mlx5/Makefile b/drivers/net/mlx5/Makefile
index ca1de9f..63c7191 100644
--- a/drivers/net/mlx5/Makefile
+++ b/drivers/net/mlx5/Makefile
@@ -347,6 +347,81 @@ mlx5_autoconf.h.new: $(RTE_SDK)/buildtools/auto-config-h.sh
 		enum TCA_VLAN_PUSH_VLAN_PRIORITY \
 		$(AUTOCONF_OUTPUT)
 	$Q sh -- '$<' '$@' \
+		HAVE_TCA_FLOWER_KEY_ENC_KEY_ID \
+		linux/pkt_cls.h \
+		enum TCA_FLOWER_KEY_ENC_KEY_ID \
+		$(AUTOCONF_OUTPUT)
+	$Q sh -- '$<' '$@' \
+		HAVE_TCA_FLOWER_KEY_ENC_IPV4_SRC \
+		linux/pkt_cls.h \
+		enum TCA_FLOWER_KEY_ENC_IPV4_SRC \
+		$(AUTOCONF_OUTPUT)
+	$Q sh -- '$<' '$@' \
+		HAVE_TCA_FLOWER_KEY_ENC_IPV4_SRC_MASK \
+		linux/pkt_cls.h \
+		enum TCA_FLOWER_KEY_ENC_IPV4_SRC_MASK \
+		$(AUTOCONF_OUTPUT)
+	$Q sh -- '$<' '$@' \
+		HAVE_TCA_FLOWER_KEY_ENC_IPV4_DST \
+		linux/pkt_cls.h \
+		enum TCA_FLOWER_KEY_ENC_IPV4_DST \
+		$(AUTOCONF_OUTPUT)
+	$Q sh -- '$<' '$@' \
+		HAVE_TCA_FLOWER_KEY_ENC_IPV4_DST_MASK \
+		linux/pkt_cls.h \
+		enum TCA_FLOWER_KEY_ENC_IPV4_DST_MASK \
+		$(AUTOCONF_OUTPUT)
+	$Q sh -- '$<' '$@' \
+		HAVE_TCA_FLOWER_KEY_ENC_IPV6_SRC \
+		linux/pkt_cls.h \
+		enum TCA_FLOWER_KEY_ENC_IPV6_SRC \
+		$(AUTOCONF_OUTPUT)
+	$Q sh -- '$<' '$@' \
+		HAVE_TCA_FLOWER_KEY_ENC_IPV6_SRC_MASK \
+		linux/pkt_cls.h \
+		enum TCA_FLOWER_KEY_ENC_IPV6_SRC_MASK \
+		$(AUTOCONF_OUTPUT)
+	$Q sh -- '$<' '$@' \
+		HAVE_TCA_FLOWER_KEY_ENC_IPV6_DST \
+		linux/pkt_cls.h \
+		enum TCA_FLOWER_KEY_ENC_IPV6_DST \
+		$(AUTOCONF_OUTPUT)
+	$Q sh -- '$<' '$@' \
+		HAVE_TCA_FLOWER_KEY_ENC_IPV6_DST_MASK \
+		linux/pkt_cls.h \
+		enum TCA_FLOWER_KEY_ENC_IPV6_DST_MASK \
+		$(AUTOCONF_OUTPUT)
+	$Q sh -- '$<' '$@' \
+		HAVE_TCA_FLOWER_KEY_ENC_UDP_SRC_PORT \
+		linux/pkt_cls.h \
+		enum TCA_FLOWER_KEY_ENC_UDP_SRC_PORT \
+		$(AUTOCONF_OUTPUT)
+	$Q sh -- '$<' '$@' \
+		HAVE_TCA_FLOWER_KEY_ENC_UDP_SRC_PORT_MASK \
+		linux/pkt_cls.h \
+		enum TCA_FLOWER_KEY_ENC_UDP_SRC_PORT_MASK \
+		$(AUTOCONF_OUTPUT)
+	$Q sh -- '$<' '$@' \
+		HAVE_TCA_FLOWER_KEY_ENC_UDP_DST_PORT \
+		linux/pkt_cls.h \
+		enum TCA_FLOWER_KEY_ENC_UDP_DST_PORT \
+		$(AUTOCONF_OUTPUT)
+	$Q sh -- '$<' '$@' \
+		HAVE_TCA_FLOWER_KEY_ENC_UDP_DST_PORT_MASK \
+		linux/pkt_cls.h \
+		enum TCA_FLOWER_KEY_ENC_UDP_DST_PORT_MASK \
+		$(AUTOCONF_OUTPUT)
+	$Q sh -- '$<' '$@' \
+		HAVE_TC_ACT_TUNNEL_KEY \
+		linux/tc_act/tc_tunnel_key.h \
+		define TCA_ACT_TUNNEL_KEY \
+		$(AUTOCONF_OUTPUT)
+	$Q sh -- '$<' '$@' \
+		HAVE_TCA_TUNNEL_KEY_ENC_DST_PORT \
+		linux/tc_act/tc_tunnel_key.h \
+		enum TCA_TUNNEL_KEY_ENC_DST_PORT \
+		$(AUTOCONF_OUTPUT)
+	$Q sh -- '$<' '$@' \
 		HAVE_SUPPORTED_40000baseKR4_Full \
 		/usr/include/linux/ethtool.h \
 		define SUPPORTED_40000baseKR4_Full \
diff --git a/drivers/net/mlx5/mlx5_flow.h b/drivers/net/mlx5/mlx5_flow.h
index 10d700a..2d56ced 100644
--- a/drivers/net/mlx5/mlx5_flow.h
+++ b/drivers/net/mlx5/mlx5_flow.h
@@ -87,6 +87,8 @@
 #define MLX5_ACTION_OF_PUSH_VLAN (1u << 8)
 #define MLX5_ACTION_OF_SET_VLAN_VID (1u << 9)
 #define MLX5_ACTION_OF_SET_VLAN_PCP (1u << 10)
+#define MLX5_ACTION_VXLAN_ENCAP (1u << 11)
+#define MLX5_ACTION_VXLAN_DECAP (1u << 12)
 
 /* possible L3 layers protocols filtering. */
 #define MLX5_IP_PROTOCOL_TCP 6
@@ -178,8 +180,17 @@ struct mlx5_flow_dv {
 
 /** Linux TC flower driver for E-Switch flow. */
 struct mlx5_flow_tcf {
+	uint32_t nlsize; /**< Size of NL message buffer. */
+	uint32_t applied:1; /**< Whether rule is currently applied. */
+	uint64_t item_flags; /**< Item flags. */
+	uint64_t action_flags; /**< Action flags. */
 	struct nlmsghdr *nlh;
 	struct tcmsg *tcm;
+	union { /**< Tunnel encap/decap descriptor. */
+		struct mlx5_flow_tcf_tunnel_hdr *tunnel;
+		struct mlx5_flow_tcf_vxlan_decap *vxlan_decap;
+		struct mlx5_flow_tcf_vxlan_encap *vxlan_encap;
+	};
 };
 
 /* Verbs specification header. */
diff --git a/drivers/net/mlx5/mlx5_flow_tcf.c b/drivers/net/mlx5/mlx5_flow_tcf.c
index 1437618..5c93412 100644
--- a/drivers/net/mlx5/mlx5_flow_tcf.c
+++ b/drivers/net/mlx5/mlx5_flow_tcf.c
@@ -6,6 +6,29 @@
 #include <assert.h>
 #include <errno.h>
 #include <libmnl/libmnl.h>
+/*
+ * Older versions of linux/if.h do not have the required safeties to coexist
+ * with net/if.h. This causes a compilation failure due to symbol
+ * redefinitions even when including the latter first.
+ *
+ * One workaround is to prevent net/if.h from defining conflicting symbols
+ * by removing __USE_MISC, and maintaining it undefined while including
+ * linux/if.h.
+ *
+ * Alphabetical order cannot be preserved since net/if.h must always be
+ * included before linux/if.h regardless.
+ */
+#ifdef __USE_MISC
+#undef __USE_MISC
+#define RESTORE_USE_MISC
+#endif
+#include <net/if.h>
+#include <linux/if.h>
+#ifdef RESTORE_USE_MISC
+#undef RESTORE_USE_MISC
+#define __USE_MISC 1
+#endif
+#include <linux/if_arp.h>
 #include <linux/if_ether.h>
 #include <linux/netlink.h>
 #include <linux/pkt_cls.h>
@@ -53,6 +76,34 @@ struct tc_vlan {
 
 #endif /* HAVE_TC_ACT_VLAN */
 
+#ifdef HAVE_TC_ACT_TUNNEL_KEY
+
+#include <linux/tc_act/tc_tunnel_key.h>
+
+#ifndef HAVE_TCA_TUNNEL_KEY_ENC_DST_PORT
+#define TCA_TUNNEL_KEY_ENC_DST_PORT 9
+#endif
+
+#else /* HAVE_TC_ACT_TUNNEL_KEY */
+
+#define TCA_ACT_TUNNEL_KEY 17
+#define TCA_TUNNEL_KEY_ACT_SET 1
+#define TCA_TUNNEL_KEY_ACT_RELEASE 2
+#define TCA_TUNNEL_KEY_PARMS 2
+#define TCA_TUNNEL_KEY_ENC_IPV4_SRC 3
+#define TCA_TUNNEL_KEY_ENC_IPV4_DST 4
+#define TCA_TUNNEL_KEY_ENC_IPV6_SRC 5
+#define TCA_TUNNEL_KEY_ENC_IPV6_DST 6
+#define TCA_TUNNEL_KEY_ENC_KEY_ID 7
+#define TCA_TUNNEL_KEY_ENC_DST_PORT 9
+
+struct tc_tunnel_key {
+	tc_gen;
+	int t_action;
+};
+
+#endif /* HAVE_TC_ACT_TUNNEL_KEY */
+
 /* Normally found in linux/netlink.h. */
 #ifndef NETLINK_CAP_ACK
 #define NETLINK_CAP_ACK 10
@@ -148,11 +199,118 @@ struct tc_vlan {
 #ifndef HAVE_TCA_FLOWER_KEY_VLAN_ETH_TYPE
 #define TCA_FLOWER_KEY_VLAN_ETH_TYPE 25
 #endif
+#ifndef HAVE_TCA_FLOWER_KEY_ENC_KEY_ID
+#define TCA_FLOWER_KEY_ENC_KEY_ID 26
+#endif
+#ifndef HAVE_TCA_FLOWER_KEY_ENC_IPV4_SRC
+#define TCA_FLOWER_KEY_ENC_IPV4_SRC 27
+#endif
+#ifndef HAVE_TCA_FLOWER_KEY_ENC_IPV4_SRC_MASK
+#define TCA_FLOWER_KEY_ENC_IPV4_SRC_MASK 28
+#endif
+#ifndef HAVE_TCA_FLOWER_KEY_ENC_IPV4_DST
+#define TCA_FLOWER_KEY_ENC_IPV4_DST 29
+#endif
+#ifndef HAVE_TCA_FLOWER_KEY_ENC_IPV4_DST_MASK
+#define TCA_FLOWER_KEY_ENC_IPV4_DST_MASK 30
+#endif
+#ifndef HAVE_TCA_FLOWER_KEY_ENC_IPV6_SRC
+#define TCA_FLOWER_KEY_ENC_IPV6_SRC 31
+#endif
+#ifndef HAVE_TCA_FLOWER_KEY_ENC_IPV6_SRC_MASK
+#define TCA_FLOWER_KEY_ENC_IPV6_SRC_MASK 32
+#endif
+#ifndef HAVE_TCA_FLOWER_KEY_ENC_IPV6_DST
+#define TCA_FLOWER_KEY_ENC_IPV6_DST 33
+#endif
+#ifndef HAVE_TCA_FLOWER_KEY_ENC_IPV6_DST_MASK
+#define TCA_FLOWER_KEY_ENC_IPV6_DST_MASK 34
+#endif
+#ifndef HAVE_TCA_FLOWER_KEY_ENC_UDP_SRC_PORT
+#define TCA_FLOWER_KEY_ENC_UDP_SRC_PORT 43
+#endif
+#ifndef HAVE_TCA_FLOWER_KEY_ENC_UDP_SRC_PORT_MASK
+#define TCA_FLOWER_KEY_ENC_UDP_SRC_PORT_MASK 44
+#endif
+#ifndef HAVE_TCA_FLOWER_KEY_ENC_UDP_DST_PORT
+#define TCA_FLOWER_KEY_ENC_UDP_DST_PORT 45
+#endif
+#ifndef HAVE_TCA_FLOWER_KEY_ENC_UDP_DST_PORT_MASK
+#define TCA_FLOWER_KEY_ENC_UDP_DST_PORT_MASK 46
+#endif
 
 #ifndef IPV6_ADDR_LEN
 #define IPV6_ADDR_LEN 16
 #endif
 
+#define MLX5_VXLAN_DEFAULT_PORT	4789
+#define MLX5_VXLAN_DEVICE_PFX "vmlx_"
+
+/** Tunnel action type, used for @p type in header structure. */
+enum mlx5_flow_tcf_tunact_type {
+	MLX5_FLOW_TCF_TUNACT_VXLAN_ENCAP,
+	MLX5_FLOW_TCF_TUNACT_VXLAN_DECAP,
+};
+
+/** Flags used for @p mask in tunnel action encap descriptors. */
+#define	MLX5_FLOW_TCF_ENCAP_ETH_SRC	(1u << 0)
+#define	MLX5_FLOW_TCF_ENCAP_ETH_DST	(1u << 1)
+#define	MLX5_FLOW_TCF_ENCAP_IPV4_SRC	(1u << 2)
+#define	MLX5_FLOW_TCF_ENCAP_IPV4_DST	(1u << 3)
+#define	MLX5_FLOW_TCF_ENCAP_IPV6_SRC	(1u << 4)
+#define	MLX5_FLOW_TCF_ENCAP_IPV6_DST	(1u << 5)
+#define	MLX5_FLOW_TCF_ENCAP_UDP_SRC	(1u << 6)
+#define	MLX5_FLOW_TCF_ENCAP_UDP_DST	(1u << 7)
+#define	MLX5_FLOW_TCF_ENCAP_VXLAN_VNI	(1u << 8)
+
+/** VXLAN virtual netdev. */
+struct mlx5_flow_tcf_vtep {
+	LIST_ENTRY(mlx5_flow_tcf_vtep) next;
+	uint32_t refcnt;
+	unsigned int ifindex;
+	uint16_t port;
+	uint8_t notcreated;
+};
+
+/** Tunnel descriptor header, common for all tunnel types. */
+struct mlx5_flow_tcf_tunnel_hdr {
+	uint32_t type; /**< Tunnel action type. */
+	unsigned int ifindex_tun; /**< Tunnel endpoint interface. */
+	unsigned int ifindex_org; /**< Original dst/src interface */
+	unsigned int *ifindex_ptr; /**< Interface ptr in message. */
+};
+
+struct mlx5_flow_tcf_vxlan_decap {
+	struct mlx5_flow_tcf_tunnel_hdr hdr;
+	uint16_t udp_port;
+};
+
+struct mlx5_flow_tcf_vxlan_encap {
+	struct mlx5_flow_tcf_tunnel_hdr hdr;
+	uint32_t mask;
+	struct {
+		struct ether_addr dst;
+		struct ether_addr src;
+	} eth;
+	union {
+		struct {
+			rte_be32_t dst;
+			rte_be32_t src;
+		} ipv4;
+		struct {
+			uint8_t dst[16];
+			uint8_t src[16];
+		} ipv6;
+	};
+	struct {
+		rte_be16_t src;
+		rte_be16_t dst;
+	} udp;
+	struct {
+		uint8_t vni[3];
+	} vxlan;
+};
+
 /** Empty masks for known item types. */
 static const union {
 	struct rte_flow_item_port_id port_id;
@@ -162,6 +320,7 @@ struct tc_vlan {
 	struct rte_flow_item_ipv6 ipv6;
 	struct rte_flow_item_tcp tcp;
 	struct rte_flow_item_udp udp;
+	struct rte_flow_item_vxlan vxlan;
 } flow_tcf_mask_empty;
 
 /** Supported masks for known item types. */
@@ -173,6 +332,7 @@ struct tc_vlan {
 	struct rte_flow_item_ipv6 ipv6;
 	struct rte_flow_item_tcp tcp;
 	struct rte_flow_item_udp udp;
+	struct rte_flow_item_vxlan vxlan;
 } flow_tcf_mask_supported = {
 	.port_id = {
 		.id = 0xffffffff,
@@ -209,6 +369,9 @@ struct tc_vlan {
 		.src_port = RTE_BE16(0xffff),
 		.dst_port = RTE_BE16(0xffff),
 	},
+	.vxlan = {
+	       .vni = "\xff\xff\xff",
+	},
 };
 
 #define SZ_NLATTR_HDR MNL_ALIGN(sizeof(struct nlattr))
@@ -216,6 +379,10 @@ struct tc_vlan {
 #define SZ_NLATTR_DATA_OF(len) MNL_ALIGN(SZ_NLATTR_HDR + (len))
 #define SZ_NLATTR_TYPE_OF(typ) SZ_NLATTR_DATA_OF(sizeof(typ))
 #define SZ_NLATTR_STRZ_OF(str) SZ_NLATTR_DATA_OF(strlen(str) + 1)
+#define SZ_NLATTR_TYPE_OF_UINT8 SZ_NLATTR_TYPE_OF(uint8_t)
+#define SZ_NLATTR_TYPE_OF_UINT16 SZ_NLATTR_TYPE_OF(uint16_t)
+#define SZ_NLATTR_TYPE_OF_UINT32 SZ_NLATTR_TYPE_OF(uint32_t)
+#define SZ_NLATTR_TYPE_OF_STRUCT(typ) SZ_NLATTR_TYPE_OF(struct typ)
 
 #define PTOI_TABLE_SZ_MAX(dev) (mlx5_dev_to_port_id((dev)->device, NULL, 0) + 2)
 
diff --git a/drivers/net/mlx5/mlx5_nl.c b/drivers/net/mlx5/mlx5_nl.c
index d61826a..88e8e15 100644
--- a/drivers/net/mlx5/mlx5_nl.c
+++ b/drivers/net/mlx5/mlx5_nl.c
@@ -385,8 +385,10 @@ struct mlx5_nl_ifindex_data {
 	int ret;
 	uint32_t sn = priv->nl_sn++;
 
-	if (priv->nl_socket_route == -1)
-		return 0;
+	if (priv->nl_socket_route < 0) {
+		rte_errno = ENOENT;
+		goto error;
+	}
 	fd = priv->nl_socket_route;
 	ret = mlx5_nl_request(fd, &req.hdr, sn, &req.ifm,
 			      sizeof(struct ifinfomsg));
@@ -449,8 +451,10 @@ struct mlx5_nl_ifindex_data {
 	int ret;
 	uint32_t sn = priv->nl_sn++;
 
-	if (priv->nl_socket_route == -1)
-		return 0;
+	if (priv->nl_socket_route < 0) {
+		rte_errno = ENOENT;
+		goto error;
+	}
 	fd = priv->nl_socket_route;
 	memcpy(RTA_DATA(&req.rta), mac, ETHER_ADDR_LEN);
 	req.hdr.nlmsg_len = NLMSG_ALIGN(req.hdr.nlmsg_len) +
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [dpdk-dev] [PATCH 2/5] net/mlx5: e-switch VXLAN netlink routines update
  2018-10-02  6:30 [dpdk-dev] [PATCH 1/5] net/mlx5: add VXLAN encap/decap support for e-switch Slava Ovsiienko
@ 2018-10-02  6:30 ` Slava Ovsiienko
  2018-10-02  6:30 ` [dpdk-dev] [PATCH 3/5] net/mlx5: e-switch VXLAN flow validation routine Slava Ovsiienko
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 110+ messages in thread
From: Slava Ovsiienko @ 2018-10-02  6:30 UTC (permalink / raw)
  To: dev; +Cc: Shahaf Shuler, Slava Ovsiienko

This part of patchset updates Netlink exchange routines. Message sequence
numbers became not random ones, the multipart reply messages are supported,
not propagating errors to the following socket calls, Netlink replies
buffer size is increased to MNL_SOCKET_BUFFER_SIZE.

Suggested-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
---
 drivers/net/mlx5/mlx5.c          |  18 ++--
 drivers/net/mlx5/mlx5.h          |   7 +-
 drivers/net/mlx5/mlx5_flow.h     |   9 +-
 drivers/net/mlx5/mlx5_flow_tcf.c | 214 +++++++++++++++++++++++----------------
 4 files changed, 147 insertions(+), 101 deletions(-)

diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
index 4be6a1c..201a26e 100644
--- a/drivers/net/mlx5/mlx5.c
+++ b/drivers/net/mlx5/mlx5.c
@@ -287,8 +287,7 @@
 		close(priv->nl_socket_route);
 	if (priv->nl_socket_rdma >= 0)
 		close(priv->nl_socket_rdma);
-	if (priv->mnl_socket)
-		mlx5_flow_tcf_socket_destroy(priv->mnl_socket);
+	mlx5_flow_tcf_socket_close(&priv->tcf_socket);
 	ret = mlx5_hrxq_ibv_verify(dev);
 	if (ret)
 		DRV_LOG(WARNING, "port %u some hash Rx queue still remain",
@@ -1138,8 +1137,9 @@
 	claim_zero(mlx5_mac_addr_add(eth_dev, &mac, 0, 0));
 	if (vf && config.vf_nl_en)
 		mlx5_nl_mac_addr_sync(eth_dev);
-	priv->mnl_socket = mlx5_flow_tcf_socket_create();
-	if (!priv->mnl_socket) {
+	/* Initialize Netlink socket for e-switch control */
+	err = mlx5_flow_tcf_socket_open(&priv->tcf_socket);
+	if (err) {
 		err = -rte_errno;
 		DRV_LOG(WARNING,
 			"flow rules relying on switch offloads will not be"
@@ -1154,16 +1154,15 @@
 			error.message =
 				"cannot retrieve network interface index";
 		} else {
-			err = mlx5_flow_tcf_init(priv->mnl_socket, ifindex,
-						&error);
+			err = mlx5_flow_tcf_ifindex_init(&priv->tcf_socket,
+							 ifindex, &error);
 		}
 		if (err) {
 			DRV_LOG(WARNING,
 				"flow rules relying on switch offloads will"
 				" not be supported: %s: %s",
 				error.message, strerror(rte_errno));
-			mlx5_flow_tcf_socket_destroy(priv->mnl_socket);
-			priv->mnl_socket = NULL;
+			mlx5_flow_tcf_socket_close(&priv->tcf_socket);
 		}
 	}
 	TAILQ_INIT(&priv->flows);
@@ -1218,8 +1217,7 @@
 			close(priv->nl_socket_route);
 		if (priv->nl_socket_rdma >= 0)
 			close(priv->nl_socket_rdma);
-		if (priv->mnl_socket)
-			mlx5_flow_tcf_socket_destroy(priv->mnl_socket);
+		mlx5_flow_tcf_socket_close(&priv->tcf_socket);
 		if (own_domain_id)
 			claim_zero(rte_eth_switch_domain_free(priv->domain_id));
 		rte_free(priv);
diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index 8de0d74..b327a39 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -160,6 +160,11 @@ struct mlx5_drop {
 
 struct mnl_socket;
 
+struct mlx5_tcf_socket {
+	uint32_t seq; /* Message sequence number. */
+	struct mnl_socket *nl; /* NETLINK_ROUTE libmnl socket. */
+};
+
 struct priv {
 	LIST_ENTRY(priv) mem_event_cb; /* Called by memory event callback. */
 	struct rte_eth_dev_data *dev_data;  /* Pointer to device data. */
@@ -220,12 +225,12 @@ struct priv {
 	int nl_socket_rdma; /* Netlink socket (NETLINK_RDMA). */
 	int nl_socket_route; /* Netlink socket (NETLINK_ROUTE). */
 	uint32_t nl_sn; /* Netlink message sequence number. */
+	struct mlx5_tcf_socket tcf_socket; /* Libmnl socket for tcf. */
 #ifndef RTE_ARCH_64
 	rte_spinlock_t uar_lock_cq; /* CQs share a common distinct UAR */
 	rte_spinlock_t uar_lock[MLX5_UAR_PAGE_NUM_MAX];
 	/* UAR same-page access control required in 32bit implementations. */
 #endif
-	struct mnl_socket *mnl_socket; /* Libmnl socket. */
 };
 
 #define PORT_ID(priv) ((priv)->dev_data->port_id)
diff --git a/drivers/net/mlx5/mlx5_flow.h b/drivers/net/mlx5/mlx5_flow.h
index 2d56ced..fff905a 100644
--- a/drivers/net/mlx5/mlx5_flow.h
+++ b/drivers/net/mlx5/mlx5_flow.h
@@ -348,9 +348,10 @@ int mlx5_flow_validate_item_vxlan_gpe(const struct rte_flow_item *item,
 
 /* mlx5_flow_tcf.c */
 
-int mlx5_flow_tcf_init(struct mnl_socket *nl, unsigned int ifindex,
-		       struct rte_flow_error *error);
-struct mnl_socket *mlx5_flow_tcf_socket_create(void);
-void mlx5_flow_tcf_socket_destroy(struct mnl_socket *nl);
+int mlx5_flow_tcf_ifindex_init(struct mlx5_tcf_socket *tcf,
+			       unsigned int ifindex,
+			       struct rte_flow_error *error);
+int mlx5_flow_tcf_socket_open(struct mlx5_tcf_socket *tcf);
+void mlx5_flow_tcf_socket_close(struct mlx5_tcf_socket *tcf);
 
 #endif /* RTE_PMD_MLX5_FLOW_H_ */
diff --git a/drivers/net/mlx5/mlx5_flow_tcf.c b/drivers/net/mlx5/mlx5_flow_tcf.c
index 5c93412..15e250c 100644
--- a/drivers/net/mlx5/mlx5_flow_tcf.c
+++ b/drivers/net/mlx5/mlx5_flow_tcf.c
@@ -1552,8 +1552,8 @@ struct flow_tcf_ptoi {
 /**
  * Send Netlink message with acknowledgment.
  *
- * @param nl
- *   Libmnl socket to use.
+ * @param tcf
+ *   Libmnl socket context to use.
  * @param nlh
  *   Message to send. This function always raises the NLM_F_ACK flag before
  *   sending.
@@ -1562,26 +1562,108 @@ struct flow_tcf_ptoi {
  *   0 on success, a negative errno value otherwise and rte_errno is set.
  */
 static int
-flow_tcf_nl_ack(struct mnl_socket *nl, struct nlmsghdr *nlh)
+flow_tcf_nl_ack(struct mlx5_tcf_socket *tcf, struct nlmsghdr *nlh)
 {
 	alignas(struct nlmsghdr)
-	uint8_t ans[mnl_nlmsg_size(sizeof(struct nlmsgerr)) +
-		    nlh->nlmsg_len - sizeof(*nlh)];
-	uint32_t seq = random();
-	int ret;
-
+	uint8_t ans[MNL_SOCKET_BUFFER_SIZE];
+	unsigned int portid = mnl_socket_get_portid(tcf->nl);
+	uint32_t seq = tcf->seq++;
+	struct mnl_socket *nl = tcf->nl;
+	int err, ret;
+
+	assert(nl);
+	if (!seq)
+		seq = tcf->seq++;
 	nlh->nlmsg_flags |= NLM_F_ACK;
 	nlh->nlmsg_seq = seq;
 	ret = mnl_socket_sendto(nl, nlh, nlh->nlmsg_len);
-	if (ret != -1)
-		ret = mnl_socket_recvfrom(nl, ans, sizeof(ans));
-	if (ret != -1)
-		ret = mnl_cb_run
-			(ans, ret, seq, mnl_socket_get_portid(nl), NULL, NULL);
+	err = (ret <= 0) ? -errno : 0;
+	nlh = (struct nlmsghdr *)ans;
+	/*
+	 * The following loop postpones non-fatal errors until multipart
+	 * messages are complete.
+	 */
 	if (ret > 0)
+		while (true) {
+			ret = mnl_socket_recvfrom(nl, ans, sizeof(ans));
+			if (ret < 0) {
+				err = errno;
+				if (err != ENOSPC)
+					break;
+			}
+			if (!err) {
+				ret = mnl_cb_run(nlh, ret, seq, portid,
+						 NULL, NULL);
+				if (ret < 0) {
+					err = errno;
+					break;
+				}
+			}
+			/* Will receive till end of multipart message */
+			if (!(nlh->nlmsg_flags & NLM_F_MULTI) ||
+			      nlh->nlmsg_type == NLMSG_DONE)
+				break;
+		}
+	if (!err)
 		return 0;
-	rte_errno = errno;
-	return -rte_errno;
+	rte_errno = err;
+	return -err;
+}
+
+/**
+ * Initialize ingress qdisc of a given network interface.
+ *
+ * @param tcf
+ *   Libmnl socket context object.
+ * @param ifindex
+ *   Index of network interface to initialize.
+ * @param[out] error
+ *   Perform verbose error reporting if not NULL.
+ *
+ * @return
+ *   0 on success, a negative errno value otherwise and rte_errno is set.
+ */
+int
+mlx5_flow_tcf_ifindex_init(struct mlx5_tcf_socket *tcf, unsigned int ifindex,
+		   struct rte_flow_error *error)
+{
+	struct nlmsghdr *nlh;
+	struct tcmsg *tcm;
+	alignas(struct nlmsghdr)
+	uint8_t buf[mnl_nlmsg_size(sizeof(*tcm) + 128)];
+
+	/* Destroy existing ingress qdisc and everything attached to it. */
+	nlh = mnl_nlmsg_put_header(buf);
+	nlh->nlmsg_type = RTM_DELQDISC;
+	nlh->nlmsg_flags = NLM_F_REQUEST;
+	tcm = mnl_nlmsg_put_extra_header(nlh, sizeof(*tcm));
+	tcm->tcm_family = AF_UNSPEC;
+	tcm->tcm_ifindex = ifindex;
+	tcm->tcm_handle = TC_H_MAKE(TC_H_INGRESS, 0);
+	tcm->tcm_parent = TC_H_INGRESS;
+	/* Ignore errors when qdisc is already absent. */
+	if (flow_tcf_nl_ack(tcf, nlh) &&
+	    rte_errno != EINVAL && rte_errno != ENOENT)
+		return rte_flow_error_set(error, rte_errno,
+					  RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
+					  "netlink: failed to remove ingress"
+					  " qdisc");
+	/* Create fresh ingress qdisc. */
+	nlh = mnl_nlmsg_put_header(buf);
+	nlh->nlmsg_type = RTM_NEWQDISC;
+	nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_CREATE | NLM_F_EXCL;
+	tcm = mnl_nlmsg_put_extra_header(nlh, sizeof(*tcm));
+	tcm->tcm_family = AF_UNSPEC;
+	tcm->tcm_ifindex = ifindex;
+	tcm->tcm_handle = TC_H_MAKE(TC_H_INGRESS, 0);
+	tcm->tcm_parent = TC_H_INGRESS;
+	mnl_attr_put_strz_check(nlh, sizeof(buf), TCA_KIND, "ingress");
+	if (flow_tcf_nl_ack(tcf, nlh))
+		return rte_flow_error_set(error, rte_errno,
+					  RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
+					  "netlink: failed to create ingress"
+					  " qdisc");
+	return 0;
 }
 
 /**
@@ -1602,18 +1684,25 @@ struct flow_tcf_ptoi {
 	       struct rte_flow_error *error)
 {
 	struct priv *priv = dev->data->dev_private;
-	struct mnl_socket *nl = priv->mnl_socket;
+	struct mlx5_tcf_socket *tcf = &priv->tcf_socket;
 	struct mlx5_flow *dev_flow;
 	struct nlmsghdr *nlh;
+	int ret;
 
 	dev_flow = LIST_FIRST(&flow->dev_flows);
 	/* E-Switch flow can't be expanded. */
 	assert(!LIST_NEXT(dev_flow, next));
+	if (dev_flow->tcf.applied)
+		return 0;
 	nlh = dev_flow->tcf.nlh;
 	nlh->nlmsg_type = RTM_NEWTFILTER;
 	nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_CREATE | NLM_F_EXCL;
-	if (!flow_tcf_nl_ack(nl, nlh))
+	ret = flow_tcf_nl_ack(tcf, nlh);
+	if (!ret) {
+		dev_flow->tcf.applied = 1;
 		return 0;
+	}
+	DRV_LOG(WARNING, "Failed to create TC rule (%d)", rte_errno);
 	return rte_flow_error_set(error, rte_errno,
 				  RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
 				  "netlink: failed to create TC flow rule");
@@ -1631,7 +1720,7 @@ struct flow_tcf_ptoi {
 flow_tcf_remove(struct rte_eth_dev *dev, struct rte_flow *flow)
 {
 	struct priv *priv = dev->data->dev_private;
-	struct mnl_socket *nl = priv->mnl_socket;
+	struct mlx5_tcf_socket *tcf = &priv->tcf_socket;
 	struct mlx5_flow *dev_flow;
 	struct nlmsghdr *nlh;
 
@@ -1645,7 +1734,8 @@ struct flow_tcf_ptoi {
 	nlh = dev_flow->tcf.nlh;
 	nlh->nlmsg_type = RTM_DELTFILTER;
 	nlh->nlmsg_flags = NLM_F_REQUEST;
-	flow_tcf_nl_ack(nl, nlh);
+	flow_tcf_nl_ack(tcf, nlh);
+	dev_flow->tcf.applied = 0;
 }
 
 /**
@@ -1683,93 +1773,45 @@ struct flow_tcf_ptoi {
 };
 
 /**
- * Initialize ingress qdisc of a given network interface.
- *
- * @param nl
- *   Libmnl socket of the @p NETLINK_ROUTE kind.
- * @param ifindex
- *   Index of network interface to initialize.
- * @param[out] error
- *   Perform verbose error reporting if not NULL.
+ * Creates and configures a libmnl socket for Netlink flow rules.
  *
+ * @param tcf
+ *   tcf socket object to be initialized by function.
  * @return
  *   0 on success, a negative errno value otherwise and rte_errno is set.
  */
 int
-mlx5_flow_tcf_init(struct mnl_socket *nl, unsigned int ifindex,
-		   struct rte_flow_error *error)
-{
-	struct nlmsghdr *nlh;
-	struct tcmsg *tcm;
-	alignas(struct nlmsghdr)
-	uint8_t buf[mnl_nlmsg_size(sizeof(*tcm) + 128)];
-
-	/* Destroy existing ingress qdisc and everything attached to it. */
-	nlh = mnl_nlmsg_put_header(buf);
-	nlh->nlmsg_type = RTM_DELQDISC;
-	nlh->nlmsg_flags = NLM_F_REQUEST;
-	tcm = mnl_nlmsg_put_extra_header(nlh, sizeof(*tcm));
-	tcm->tcm_family = AF_UNSPEC;
-	tcm->tcm_ifindex = ifindex;
-	tcm->tcm_handle = TC_H_MAKE(TC_H_INGRESS, 0);
-	tcm->tcm_parent = TC_H_INGRESS;
-	/* Ignore errors when qdisc is already absent. */
-	if (flow_tcf_nl_ack(nl, nlh) &&
-	    rte_errno != EINVAL && rte_errno != ENOENT)
-		return rte_flow_error_set(error, rte_errno,
-					  RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
-					  "netlink: failed to remove ingress"
-					  " qdisc");
-	/* Create fresh ingress qdisc. */
-	nlh = mnl_nlmsg_put_header(buf);
-	nlh->nlmsg_type = RTM_NEWQDISC;
-	nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_CREATE | NLM_F_EXCL;
-	tcm = mnl_nlmsg_put_extra_header(nlh, sizeof(*tcm));
-	tcm->tcm_family = AF_UNSPEC;
-	tcm->tcm_ifindex = ifindex;
-	tcm->tcm_handle = TC_H_MAKE(TC_H_INGRESS, 0);
-	tcm->tcm_parent = TC_H_INGRESS;
-	mnl_attr_put_strz_check(nlh, sizeof(buf), TCA_KIND, "ingress");
-	if (flow_tcf_nl_ack(nl, nlh))
-		return rte_flow_error_set(error, rte_errno,
-					  RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
-					  "netlink: failed to create ingress"
-					  " qdisc");
-	return 0;
-}
-
-/**
- * Create and configure a libmnl socket for Netlink flow rules.
- *
- * @return
- *   A valid libmnl socket object pointer on success, NULL otherwise and
- *   rte_errno is set.
- */
-struct mnl_socket *
-mlx5_flow_tcf_socket_create(void)
+mlx5_flow_tcf_socket_open(struct mlx5_tcf_socket *tcf)
 {
 	struct mnl_socket *nl = mnl_socket_open(NETLINK_ROUTE);
 
+	tcf->nl = NULL;
 	if (nl) {
 		mnl_socket_setsockopt(nl, NETLINK_CAP_ACK, &(int){ 1 },
 				      sizeof(int));
-		if (!mnl_socket_bind(nl, 0, MNL_SOCKET_AUTOPID))
-			return nl;
+		if (!mnl_socket_bind(nl, 0, MNL_SOCKET_AUTOPID)) {
+			tcf->nl = nl;
+			tcf->seq = random();
+			return 0;
+		}
 	}
 	rte_errno = errno;
 	if (nl)
 		mnl_socket_close(nl);
-	return NULL;
+	return -rte_errno;
 }
 
 /**
- * Destroy a libmnl socket.
+ * Destroys tcf object (closes MNL socket).
  *
- * @param nl
- *   Libmnl socket of the @p NETLINK_ROUTE kind.
+ * @param tcf
+ *   tcf socket object to be destroyed by function.
  */
 void
-mlx5_flow_tcf_socket_destroy(struct mnl_socket *nl)
+mlx5_flow_tcf_socket_close(struct mlx5_tcf_socket *tcf)
 {
-	mnl_socket_close(nl);
+	if (tcf->nl) {
+		mnl_socket_close(tcf->nl);
+		tcf->nl = NULL;
+	}
 }
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [dpdk-dev] [PATCH 3/5] net/mlx5: e-switch VXLAN flow validation routine
  2018-10-02  6:30 [dpdk-dev] [PATCH 1/5] net/mlx5: add VXLAN encap/decap support for e-switch Slava Ovsiienko
  2018-10-02  6:30 ` [dpdk-dev] [PATCH 2/5] net/mlx5: e-switch VXLAN netlink routines update Slava Ovsiienko
@ 2018-10-02  6:30 ` Slava Ovsiienko
  2018-10-02  6:30 ` [dpdk-dev] [PATCH 4/5] net/mlx5: e-switch VXLAN flow translation routine Slava Ovsiienko
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 110+ messages in thread
From: Slava Ovsiienko @ 2018-10-02  6:30 UTC (permalink / raw)
  To: dev; +Cc: Shahaf Shuler, Slava Ovsiienko

This part of patchset adds support for flow item/actions lists
validation. The following entities are now supported:

- RTE_FLOW_ITEM_TYPE_VXLAN, contains the tunnel VNI

- RTE_FLOW_ACTION_TYPE_VXLAN_DECAP, if this action is specified
  the items in the flow items list treated as outer network
  parameters for tunnel outer header match. The ethernet layer
  addresses always are treated as inner ones.

- RTE_FLOW_ACTION_TYPE_VXLAN_ENCAP, contains the item list to
  build the encapsulation header. In current implementation the
  values is the subject for some constraints:
	- outer source IP should coincide with outer egress
          interface assigned address
	- outer source MAC address will be always unconditionally
	  set to the one of MAC addresses of outer egress interface
	- no way to specify source UDP port
	- all abovementioned parameters are ignored if specified
	  in the rule, warning messages are sent to the log

Suggested-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
---
 drivers/net/mlx5/mlx5_flow_tcf.c | 717 ++++++++++++++++++++++++++++++++++++++-
 1 file changed, 713 insertions(+), 4 deletions(-)

diff --git a/drivers/net/mlx5/mlx5_flow_tcf.c b/drivers/net/mlx5/mlx5_flow_tcf.c
index 15e250c..97451bd 100644
--- a/drivers/net/mlx5/mlx5_flow_tcf.c
+++ b/drivers/net/mlx5/mlx5_flow_tcf.c
@@ -558,6 +558,630 @@ struct flow_tcf_ptoi {
 }
 
 /**
+ * Validate VXLAN_ENCAP action RTE_FLOW_ITEM_TYPE_ETH item for E-Switch.
+ *
+ * @param[in] item
+ *   Pointer to the itemn structure.
+ * @param[out] error
+ *   Pointer to the error structure.
+ *
+ * @return
+ *   0 on success, a negative errno value otherwise and rte_errno is set.
+ **/
+static int
+flow_tcf_validate_vxlan_encap_eth(const struct rte_flow_item *item,
+				  struct rte_flow_error *error)
+{
+	const struct rte_flow_item_eth *spec = item->spec;
+	const struct rte_flow_item_eth *mask = item->mask;
+
+	if (!spec)
+		/*
+		 * Specification for L2 addresses can be empty
+		 * because these ones are optional and not
+		 * required directly by tc rule.
+		 */
+		return 0;
+	if (!mask)
+		/* If mask is not specified use the default one. */
+		mask = &rte_flow_item_eth_mask;
+	if (memcmp(&mask->dst,
+		   &flow_tcf_mask_empty.eth.dst,
+		   sizeof(flow_tcf_mask_empty.eth.dst))) {
+		if (memcmp(&mask->dst,
+			   &rte_flow_item_eth_mask.dst,
+			   sizeof(rte_flow_item_eth_mask.dst)))
+			return rte_flow_error_set(error, ENOTSUP,
+				 RTE_FLOW_ERROR_TYPE_ITEM_MASK, mask,
+				 "no support for partial mask on"
+				 " \"eth.dst\" field");
+		/*
+		 * Ethernet addresses are not supported by
+		 * tc as tunnel_key parameters. Destination
+		 * L2 address is needed to form encap packet
+		 * header and retrieved by kernel from implicit
+		 * sources (ARP table, etc), address masks are
+		 * not supported at all.
+		 */
+		DRV_LOG(WARNING,
+			"outer ethernet destination address "
+			"cannot be forced for VXLAN "
+			"encapsulation, parameter ignored");
+	}
+	if (memcmp(&mask->src,
+		   &flow_tcf_mask_empty.eth.src,
+		   sizeof(flow_tcf_mask_empty.eth.src))) {
+		if (memcmp(&mask->src,
+			   &rte_flow_item_eth_mask.src,
+			   sizeof(rte_flow_item_eth_mask.src)))
+			return rte_flow_error_set(error, ENOTSUP,
+				 RTE_FLOW_ERROR_TYPE_ITEM_MASK, mask,
+				 "no support for partial mask on"
+				 " \"eth.src\" field");
+		DRV_LOG(WARNING,
+			"outer ethernet source address "
+			"cannot be forced for VXLAN "
+			"encapsulation, parameter ignored");
+	}
+	if (mask->type != RTE_BE16(0x0000)) {
+		if (mask->type != RTE_BE16(0xffff))
+			return rte_flow_error_set(error, ENOTSUP,
+				 RTE_FLOW_ERROR_TYPE_ITEM_MASK, mask,
+				 "no support for partial mask on"
+				 " \"eth.type\" field");
+		DRV_LOG(WARNING,
+			"outer ethernet type field "
+			"cannot be forced for VXLAN "
+			"encapsulation, parameter ignored");
+	}
+	return 0;
+}
+
+/**
+ * Validate VXLAN_ENCAP action RTE_FLOW_ITEM_TYPE_IPV4 item for E-Switch.
+ *
+ * @param[in] item
+ *   Pointer to the itemn structure.
+ * @param[out] error
+ *   Pointer to the error structure.
+ *
+ * @return
+ *   0 on success, a negative errno value otherwise and rte_errno is set.
+ **/
+static int
+flow_tcf_validate_vxlan_encap_ipv4(const struct rte_flow_item *item,
+				   struct rte_flow_error *error)
+{
+	const struct rte_flow_item_ipv4 *spec = item->spec;
+	const struct rte_flow_item_ipv4 *mask = item->mask;
+
+	if (!spec)
+		/*
+		 * Specification for L3 addresses cannot be empty
+		 * because it is required by tunnel_key parameter.
+		 */
+		return rte_flow_error_set(error, EINVAL,
+				 RTE_FLOW_ERROR_TYPE_ITEM, item,
+				 "NULL outer L3 address specification "
+				 " for VXLAN encapsulation");
+	if (!mask)
+		mask = &rte_flow_item_ipv4_mask;
+	if (mask->hdr.dst_addr != RTE_BE32(0x00000000)) {
+		if (mask->hdr.dst_addr != RTE_BE32(0xffffffff))
+			return rte_flow_error_set(error, ENOTSUP,
+				 RTE_FLOW_ERROR_TYPE_ITEM_MASK, mask,
+				 "no support for partial mask on"
+				 " \"ipv4.hdr.dst_addr\" field");
+		/* More L3 address validations can be put here. */
+	} else {
+		/*
+		 * Kernel uses the destination L3 address to determine
+		 * the routing path and obtain the L2 destination
+		 * address, so L3 destination address must be
+		 * specified in the tc rule.
+		 */
+		return rte_flow_error_set(error, EINVAL,
+				 RTE_FLOW_ERROR_TYPE_ITEM, item,
+				 "outer L3 destination address must be "
+				 "specified for VXLAN encapsulation");
+	}
+	if (mask->hdr.src_addr != RTE_BE32(0x00000000)) {
+		if (mask->hdr.src_addr != RTE_BE32(0xffffffff))
+			return rte_flow_error_set(error, ENOTSUP,
+				 RTE_FLOW_ERROR_TYPE_ITEM_MASK, mask,
+				 "no support for partial mask on"
+				 " \"ipv4.hdr.src_addr\" field");
+		/* More L3 address validations can be put here. */
+	} else {
+		/*
+		 * Kernel uses the source L3 address to select the
+		 * interface for egress encapsulated traffic, so
+		 * it must be specified in the tc rule.
+		 */
+		return rte_flow_error_set(error, EINVAL,
+				 RTE_FLOW_ERROR_TYPE_ITEM, item,
+				 "outer L3 source address must be "
+				 "specified for VXLAN encapsulation");
+	}
+	return 0;
+}
+
+/**
+ * Validate VXLAN_ENCAP action RTE_FLOW_ITEM_TYPE_IPV6 item for E-Switch.
+ *
+ * @param[in] item
+ *   Pointer to the itemn structure.
+ * @param[out] error
+ *   Pointer to the error structure.
+ *
+ * @return
+ *   0 on success, a negative errno value otherwise and rte_ernno is set.
+ **/
+static int
+flow_tcf_validate_vxlan_encap_ipv6(const struct rte_flow_item *item,
+				   struct rte_flow_error *error)
+{
+	const struct rte_flow_item_ipv6 *spec = item->spec;
+	const struct rte_flow_item_ipv6 *mask = item->mask;
+
+	if (!spec)
+		/*
+		 * Specification for L3 addresses cannot be empty
+		 * because it is required by tunnel_key parameter.
+		 */
+		return rte_flow_error_set(error, EINVAL,
+				 RTE_FLOW_ERROR_TYPE_ITEM, item,
+				 "NULL outer L3 address specification "
+				 " for VXLAN encapsulation");
+	if (!mask)
+		mask = &rte_flow_item_ipv6_mask;
+	if (memcmp(&mask->hdr.dst_addr,
+		   &flow_tcf_mask_empty.ipv6.hdr.dst_addr,
+		   sizeof(flow_tcf_mask_empty.ipv6.hdr.dst_addr))) {
+		if (memcmp(&mask->hdr.dst_addr,
+		   &rte_flow_item_ipv6_mask.hdr.dst_addr,
+		   sizeof(rte_flow_item_ipv6_mask.hdr.dst_addr)))
+			return rte_flow_error_set(error, ENOTSUP,
+				 RTE_FLOW_ERROR_TYPE_ITEM_MASK, mask,
+				 "no support for partial mask on"
+				 " \"ipv6.hdr.dst_addr\" field");
+		/* More L3 address validations can be put here. */
+	} else {
+		/*
+		 * Kernel uses the destination L3 address to determine
+		 * the routing path and obtain the L2 destination
+		 * address (heigh or gate), so L3 destination address
+		 * must be specified within the tc rule.
+		 */
+		return rte_flow_error_set(error, EINVAL,
+				 RTE_FLOW_ERROR_TYPE_ITEM, item,
+				 "outer L3 destination address must be "
+				 "specified for VXLAN encapsulation");
+	}
+	if (memcmp(&mask->hdr.src_addr,
+		   &flow_tcf_mask_empty.ipv6.hdr.src_addr,
+		   sizeof(flow_tcf_mask_empty.ipv6.hdr.src_addr))) {
+		if (memcmp(&mask->hdr.src_addr,
+		   &rte_flow_item_ipv6_mask.hdr.src_addr,
+		   sizeof(rte_flow_item_ipv6_mask.hdr.src_addr)))
+			return rte_flow_error_set(error, ENOTSUP,
+				 RTE_FLOW_ERROR_TYPE_ITEM_MASK, mask,
+				 "no support for partial mask on"
+				 " \"ipv6.hdr.src_addr\" field");
+		/* More L3 address validation can be put here. */
+	} else {
+		/*
+		 * Kernel uses the source L3 address to select the
+		 * interface for egress encapsulated traffic, so
+		 * it must be specified in the tc rule.
+		 */
+		return rte_flow_error_set(error, EINVAL,
+				 RTE_FLOW_ERROR_TYPE_ITEM, item,
+				 "outer L3 source address must be "
+				 "specified for VXLAN encapsulation");
+	}
+	return 0;
+}
+
+/**
+ * Validate VXLAN_ENCAP action RTE_FLOW_ITEM_TYPE_UDP item for E-Switch.
+ *
+ * @param[in] item
+ *   Pointer to the itemn structure.
+ * @param[out] error
+ *   Pointer to the error structure.
+ *
+ * @return
+ *   0 on success, a negative errno value otherwise and rte_ernno is set.
+ **/
+static int
+flow_tcf_validate_vxlan_encap_udp(const struct rte_flow_item *item,
+				  struct rte_flow_error *error)
+{
+	const struct rte_flow_item_udp *spec = item->spec;
+	const struct rte_flow_item_udp *mask = item->mask;
+
+	if (!spec)
+		/*
+		 * Specification for UDP ports cannot be empty
+		 * because it is required by tunnel_key parameter.
+		 */
+		return rte_flow_error_set(error, EINVAL,
+				 RTE_FLOW_ERROR_TYPE_ITEM, item,
+				 "NULL UDP port specification "
+				 " for VXLAN encapsulation");
+	if (!mask)
+		mask = &rte_flow_item_udp_mask;
+	if (mask->hdr.dst_port != RTE_BE16(0x0000)) {
+		if (mask->hdr.dst_port != RTE_BE16(0xffff))
+			return rte_flow_error_set(error, ENOTSUP,
+				 RTE_FLOW_ERROR_TYPE_ITEM_MASK, mask,
+				 "no support for partial mask on"
+				 " \"udp.hdr.dst_port\" field");
+		if (!spec->hdr.dst_port)
+			return rte_flow_error_set(error, EINVAL,
+				 RTE_FLOW_ERROR_TYPE_ITEM, item,
+				 "zero encap remote UDP port");
+	} else {
+		return rte_flow_error_set(error, EINVAL,
+				 RTE_FLOW_ERROR_TYPE_ITEM, item,
+				 "outer UDP remote port must be "
+				 "specified for VXLAN encapsulation");
+	}
+	if (mask->hdr.src_port != RTE_BE16(0x0000)) {
+		if (mask->hdr.src_port != RTE_BE16(0xffff))
+			return rte_flow_error_set(error, ENOTSUP,
+				 RTE_FLOW_ERROR_TYPE_ITEM_MASK, mask,
+				 "no support for partial mask on"
+				 " \"udp.hdr.src_port\" field");
+		DRV_LOG(WARNING,
+			"outer UDP source port cannot be "
+			"forced for VXLAN encapsulation, "
+			"parameter ignored");
+	}
+	return 0;
+}
+
+/**
+ * Validate VXLAN_ENCAP action RTE_FLOW_ITEM_TYPE_VXLAN item for E-Switch.
+ *
+ * @param[in] item
+ *   Pointer to the itemn structure.
+ * @param[out] error
+ *   Pointer to the error structure.
+ *
+ * @return
+ *   0 on success, a negative errno value otherwise and rte_ernno is set.
+ **/
+static int
+flow_tcf_validate_vxlan_encap_vni(const struct rte_flow_item *item,
+				  struct rte_flow_error *error)
+{
+	const struct rte_flow_item_vxlan *spec = item->spec;
+	const struct rte_flow_item_vxlan *mask = item->mask;
+
+	if (!spec)
+		/* Outer VNI is required by tunnel_key parameter. */
+		return rte_flow_error_set(error, EINVAL,
+				 RTE_FLOW_ERROR_TYPE_ITEM, item,
+				 "NULL VNI specification "
+				 " for VXLAN encapsulation");
+	if (!mask)
+		mask = &rte_flow_item_vxlan_mask;
+	if (mask->vni[0] != 0 ||
+	    mask->vni[1] != 0 ||
+	    mask->vni[2] != 0) {
+		if (mask->vni[0] != 0xff ||
+		    mask->vni[1] != 0xff ||
+		    mask->vni[2] != 0xff)
+			return rte_flow_error_set(error, ENOTSUP,
+				 RTE_FLOW_ERROR_TYPE_ITEM_MASK, mask,
+				 "no support for partial mask on"
+				 " \"vxlan.vni\" field");
+		if (spec->vni[0] == 0 &&
+		    spec->vni[1] == 0 &&
+		    spec->vni[2] == 0)
+			return rte_flow_error_set(error, EINVAL,
+					  RTE_FLOW_ERROR_TYPE_ITEM, item,
+					  "VXLAN vni cannot be 0");
+	} else {
+		return rte_flow_error_set(error, EINVAL,
+				 RTE_FLOW_ERROR_TYPE_ITEM,
+				 item,
+				 "outer VNI must be specified "
+				 "for VXLAN encapsulation");
+	}
+	return 0;
+}
+
+/**
+ * Validate VXLAN_ENCAP action item list for E-Switch.
+ *
+ * @param[in] action
+ *   Pointer to the VXLAN_ENCAP action structure.
+ * @param[out] error
+ *   Pointer to the error structure.
+ *
+ * @return
+ *   0 on success, a negative errno value otherwise and rte_ernno is set.
+ **/
+static int
+flow_tcf_validate_vxlan_encap(const struct rte_flow_action *action,
+			      struct rte_flow_error *error)
+{
+	const struct rte_flow_item *items;
+	int ret;
+	uint32_t item_flags = 0;
+
+	assert(action->type == RTE_FLOW_ACTION_TYPE_VXLAN_ENCAP);
+	if (!action->conf)
+		return rte_flow_error_set
+			(error, EINVAL, RTE_FLOW_ERROR_TYPE_ACTION,
+			 action, "Missing VXLAN tunnel "
+				 "action configuration");
+	items = ((const struct rte_flow_action_vxlan_encap *)
+					action->conf)->definition;
+	if (!items)
+		return rte_flow_error_set
+			(error, EINVAL, RTE_FLOW_ERROR_TYPE_ACTION,
+			 action, "Missing VXLAN tunnel "
+				 "encapsulation parameters");
+	for (; items->type != RTE_FLOW_ITEM_TYPE_END; items++) {
+		switch (items->type) {
+		case RTE_FLOW_ITEM_TYPE_VOID:
+			break;
+		case RTE_FLOW_ITEM_TYPE_ETH:
+			ret = mlx5_flow_validate_item_eth(items, item_flags,
+							  error);
+			if (ret < 0)
+				return ret;
+			ret = flow_tcf_validate_vxlan_encap_eth(items, error);
+			if (ret < 0)
+				return ret;
+			item_flags |= MLX5_FLOW_LAYER_OUTER_L2;
+			break;
+		break;
+		case RTE_FLOW_ITEM_TYPE_IPV4:
+			ret = mlx5_flow_validate_item_ipv4(items, item_flags,
+							   error);
+			if (ret < 0)
+				return ret;
+			ret = flow_tcf_validate_vxlan_encap_ipv4(items, error);
+			if (ret < 0)
+				return ret;
+			item_flags |= MLX5_FLOW_LAYER_OUTER_L3_IPV4;
+			break;
+		case RTE_FLOW_ITEM_TYPE_IPV6:
+			ret = mlx5_flow_validate_item_ipv6(items, item_flags,
+							   error);
+			if (ret < 0)
+				return ret;
+			ret = flow_tcf_validate_vxlan_encap_ipv6(items, error);
+			if (ret < 0)
+				return ret;
+			item_flags |= MLX5_FLOW_LAYER_OUTER_L3_IPV6;
+			break;
+		case RTE_FLOW_ITEM_TYPE_UDP:
+			ret = mlx5_flow_validate_item_udp(items, item_flags,
+							   0xFF, error);
+			if (ret < 0)
+				return ret;
+			ret = flow_tcf_validate_vxlan_encap_udp(items, error);
+			if (ret < 0)
+				return ret;
+			item_flags |= MLX5_FLOW_LAYER_OUTER_L4_UDP;
+			break;
+		case RTE_FLOW_ITEM_TYPE_VXLAN:
+			ret = mlx5_flow_validate_item_vxlan(items,
+							    item_flags, error);
+			if (ret < 0)
+				return ret;
+			ret = flow_tcf_validate_vxlan_encap_vni(items, error);
+			if (ret < 0)
+				return ret;
+			item_flags |= MLX5_FLOW_LAYER_VXLAN;
+			break;
+		default:
+			return rte_flow_error_set(error, ENOTSUP,
+					  RTE_FLOW_ERROR_TYPE_ITEM, items,
+					  "VXLAN encap item not supported");
+		}
+	}
+	if (!(item_flags & MLX5_FLOW_LAYER_OUTER_L3))
+		return rte_flow_error_set(error, EINVAL,
+					  RTE_FLOW_ERROR_TYPE_ACTION, action,
+					  "no outer L3 layer found"
+					  " for VXLAN encapsulation");
+	if (!(item_flags & MLX5_FLOW_LAYER_OUTER_L4_UDP))
+		return rte_flow_error_set(error, EINVAL,
+					  RTE_FLOW_ERROR_TYPE_ACTION, action,
+					  "no outer L4 layer found"
+					  " for VXLAN encapsulation");
+	if (!(item_flags & MLX5_FLOW_LAYER_VXLAN))
+		return rte_flow_error_set(error, EINVAL,
+					  RTE_FLOW_ERROR_TYPE_ACTION, action,
+					  "no VXLAN VNI found"
+					  " for VXLAN encapsulation");
+	return 0;
+}
+
+/**
+ * Validate VXLAN_DECAP action outer tunnel items for E-Switch.
+ *
+ * @param[in] item_flags
+ *   Mask of provided outer tunnel parameters
+ * @param[in] ipv4
+ *   Outer IPv4 address item (if any, NULL otherwise).
+ * @param[in] ipv6
+ *   Outer IPv6 address item (if any, NULL otherwise).
+ * @param[in] udp
+ *   Outer UDP layer item (if any, NULL otherwise).
+ * @param[out] error
+ *   Pointer to the error structure.
+ *
+ * @return
+ *   0 on success, a negative errno value otherwise and rte_ernno is set.
+ **/
+static int
+flow_tcf_validate_vxlan_decap(uint32_t item_flags,
+			      const struct rte_flow_action *action,
+			      const struct rte_flow_item *ipv4,
+			      const struct rte_flow_item *ipv6,
+			      const struct rte_flow_item *udp,
+			      struct rte_flow_error *error)
+{
+	if (!ipv4 && !ipv6)
+		return rte_flow_error_set(error, EINVAL,
+					  RTE_FLOW_ERROR_TYPE_ACTION, action,
+					  "no outer L3 layer found"
+					  " for VXLAN decapsulation");
+	if (ipv4) {
+		const struct rte_flow_item_ipv4 *spec = ipv4->spec;
+		const struct rte_flow_item_ipv4 *mask = ipv4->mask;
+
+		if (!spec)
+			/*
+			 * Specification for L3 addresses cannot be empty
+			 * because it is required as decap parameter.
+			 */
+			return rte_flow_error_set(error, EINVAL,
+				 RTE_FLOW_ERROR_TYPE_ITEM, ipv4,
+				 "NULL outer L3 address specification "
+				 " for VXLAN decapsulation");
+		if (!mask)
+			mask = &rte_flow_item_ipv4_mask;
+		if (mask->hdr.dst_addr != RTE_BE32(0x00000000)) {
+			if (mask->hdr.dst_addr != RTE_BE32(0xffffffff))
+				return rte_flow_error_set(error, ENOTSUP,
+					 RTE_FLOW_ERROR_TYPE_ITEM_MASK, mask,
+					 "no support for partial mask on"
+					 " \"ipv4.hdr.dst_addr\" field");
+			/* More L3 address validations can be put here. */
+		} else {
+			/*
+			 * Kernel uses the destination L3 address
+			 * to determine the ingress network interface
+			 * for traffic being decapculated.
+			 */
+			return rte_flow_error_set(error, EINVAL,
+				 RTE_FLOW_ERROR_TYPE_ITEM, ipv4,
+				 "outer L3 destination address must be "
+				 "specified for VXLAN decapsulation");
+		}
+		/* Source L3 address is optional for decap. */
+		if (mask->hdr.src_addr != RTE_BE32(0x00000000))
+			if (mask->hdr.src_addr != RTE_BE32(0xffffffff))
+				return rte_flow_error_set(error, ENOTSUP,
+					 RTE_FLOW_ERROR_TYPE_ITEM_MASK, mask,
+					 "no support for partial mask on"
+					 " \"ipv4.hdr.src_addr\" field");
+	} else {
+		const struct rte_flow_item_ipv6 *spec = ipv6->spec;
+		const struct rte_flow_item_ipv6 *mask = ipv6->mask;
+
+		if (!spec)
+			/*
+			 * Specification for L3 addresses cannot be empty
+			 * because it is required as decap parameter.
+			 */
+			return rte_flow_error_set(error, EINVAL,
+				 RTE_FLOW_ERROR_TYPE_ITEM, ipv6,
+				 "NULL outer L3 address specification "
+				 " for VXLAN decapsulation");
+		if (!mask)
+			mask = &rte_flow_item_ipv6_mask;
+		if (memcmp(&mask->hdr.dst_addr,
+			   &flow_tcf_mask_empty.ipv6.hdr.dst_addr,
+			   sizeof(flow_tcf_mask_empty.ipv6.hdr.dst_addr))) {
+			if (memcmp(&mask->hdr.dst_addr,
+				&rte_flow_item_ipv6_mask.hdr.dst_addr,
+				sizeof(rte_flow_item_ipv6_mask.hdr.dst_addr)))
+				return rte_flow_error_set(error, ENOTSUP,
+				       RTE_FLOW_ERROR_TYPE_ITEM_MASK, mask,
+				       "no support for partial mask on"
+				       " \"ipv6.hdr.dst_addr\" field");
+		/* More L3 address validations can be put here. */
+		} else {
+			/*
+			 * Kernel uses the destination L3 address
+			 * to determine the ingress network interface
+			 * for traffic being decapculated.
+			 */
+			return rte_flow_error_set(error, EINVAL,
+				 RTE_FLOW_ERROR_TYPE_ITEM, ipv6,
+				 "outer L3 destination address must be "
+				 "specified for VXLAN decapsulation");
+		}
+		/* Source L3 address is optional for decap. */
+		if (memcmp(&mask->hdr.src_addr,
+			   &flow_tcf_mask_empty.ipv6.hdr.src_addr,
+			   sizeof(flow_tcf_mask_empty.ipv6.hdr.src_addr))) {
+			if (memcmp(&mask->hdr.src_addr,
+				   &rte_flow_item_ipv6_mask.hdr.src_addr,
+				   sizeof(mask->hdr.src_addr)))
+				return rte_flow_error_set(error, ENOTSUP,
+					RTE_FLOW_ERROR_TYPE_ITEM_MASK, mask,
+					"no support for partial mask on"
+					" \"ipv6.hdr.src_addr\" field");
+		}
+	}
+	if (!udp) {
+		return rte_flow_error_set(error, EINVAL,
+					  RTE_FLOW_ERROR_TYPE_ACTION, action,
+					  "no outer L4 layer found"
+					  " for VXLAN decapsulation");
+	} else {
+		const struct rte_flow_item_udp *spec = udp->spec;
+		const struct rte_flow_item_udp *mask = udp->mask;
+
+		if (!spec)
+			/*
+			 * Specification for UDP ports cannot be empty
+			 * because it is required as decap parameter.
+			 */
+			return rte_flow_error_set(error, EINVAL,
+					 RTE_FLOW_ERROR_TYPE_ITEM, udp,
+					 "NULL UDP port specification "
+					 " for VXLAN decapsulation");
+		if (!mask)
+			mask = &rte_flow_item_udp_mask;
+		if (mask->hdr.dst_port != RTE_BE16(0x0000)) {
+			if (mask->hdr.dst_port != RTE_BE16(0xffff))
+				return rte_flow_error_set(error, ENOTSUP,
+					 RTE_FLOW_ERROR_TYPE_ITEM_MASK, mask,
+					 "no support for partial mask on"
+					 " \"udp.hdr.dst_port\" field");
+			if (!spec->hdr.dst_port)
+				return rte_flow_error_set(error, EINVAL,
+					 RTE_FLOW_ERROR_TYPE_ITEM, udp,
+					 "zero decap local UDP port");
+		} else {
+			return rte_flow_error_set(error, EINVAL,
+					 RTE_FLOW_ERROR_TYPE_ITEM, udp,
+					 "outer UDP destination port must be "
+					 "specified for VXLAN decapsulation");
+		}
+		if (mask->hdr.src_port != RTE_BE16(0x0000)) {
+			if (mask->hdr.src_port != RTE_BE16(0xffff))
+				return rte_flow_error_set(error, ENOTSUP,
+					 RTE_FLOW_ERROR_TYPE_ITEM_MASK, mask,
+					 "no support for partial mask on"
+					 " \"udp.hdr.src_port\" field");
+			DRV_LOG(WARNING,
+			"outer UDP local port cannot be "
+			"forced for VXLAN encapsulation, "
+			"parameter ignored");
+		}
+	}
+	if (!(item_flags & MLX5_FLOW_LAYER_VXLAN))
+		return rte_flow_error_set(error, EINVAL,
+					  RTE_FLOW_ERROR_TYPE_ACTION, action,
+					  "no VXLAN VNI found"
+					  " for VXLAN decapsulation");
+	/* VNI is already validated, extra check can be put here. */
+	return 0;
+}
+/**
  * Validate flow for E-Switch.
  *
  * @param[in] priv
@@ -589,6 +1213,7 @@ struct flow_tcf_ptoi {
 		const struct rte_flow_item_ipv6 *ipv6;
 		const struct rte_flow_item_tcp *tcp;
 		const struct rte_flow_item_udp *udp;
+		const struct rte_flow_item_vxlan *vxlan;
 	} spec, mask;
 	union {
 		const struct rte_flow_action_port_id *port_id;
@@ -597,7 +1222,11 @@ struct flow_tcf_ptoi {
 			of_set_vlan_vid;
 		const struct rte_flow_action_of_set_vlan_pcp *
 			of_set_vlan_pcp;
+		const struct rte_flow_action_vxlan_encap *vxlan_encap;
 	} conf;
+	const struct rte_flow_item *ipv4 = NULL; /* storage to check */
+	const struct rte_flow_item *ipv6 = NULL; /* outer tunnel. */
+	const struct rte_flow_item *udp = NULL;  /* parameters. */
 	uint32_t item_flags = 0;
 	uint32_t action_flags = 0;
 	uint8_t next_protocol = -1;
@@ -724,7 +1353,6 @@ struct flow_tcf_ptoi {
 							   error);
 			if (ret < 0)
 				return ret;
-			item_flags |= MLX5_FLOW_LAYER_OUTER_L3_IPV4;
 			mask.ipv4 = flow_tcf_item_mask
 				(items, &rte_flow_item_ipv4_mask,
 				 &flow_tcf_mask_supported.ipv4,
@@ -745,13 +1373,22 @@ struct flow_tcf_ptoi {
 				next_protocol =
 					((const struct rte_flow_item_ipv4 *)
 					 (items->spec))->hdr.next_proto_id;
+			if (item_flags & MLX5_FLOW_LAYER_OUTER_L3_IPV4) {
+				/*
+				 * Multiple outer items are not allowed as
+				 * tunnel parameters
+				 */
+				ipv4 = NULL;
+			} else {
+				ipv4 = items;
+				item_flags |= MLX5_FLOW_LAYER_OUTER_L3_IPV4;
+			}
 			break;
 		case RTE_FLOW_ITEM_TYPE_IPV6:
 			ret = mlx5_flow_validate_item_ipv6(items, item_flags,
 							   error);
 			if (ret < 0)
 				return ret;
-			item_flags |= MLX5_FLOW_LAYER_OUTER_L3_IPV6;
 			mask.ipv6 = flow_tcf_item_mask
 				(items, &rte_flow_item_ipv6_mask,
 				 &flow_tcf_mask_supported.ipv6,
@@ -772,13 +1409,22 @@ struct flow_tcf_ptoi {
 				next_protocol =
 					((const struct rte_flow_item_ipv6 *)
 					 (items->spec))->hdr.proto;
+			if (item_flags & MLX5_FLOW_LAYER_OUTER_L3_IPV6) {
+				/*
+				 *Multiple outer items are not allowed as
+				 * tunnel parameters
+				 */
+				ipv6 = NULL;
+			} else {
+				ipv6 = items;
+				item_flags |= MLX5_FLOW_LAYER_OUTER_L3_IPV6;
+			}
 			break;
 		case RTE_FLOW_ITEM_TYPE_UDP:
 			ret = mlx5_flow_validate_item_udp(items, item_flags,
 							  next_protocol, error);
 			if (ret < 0)
 				return ret;
-			item_flags |= MLX5_FLOW_LAYER_OUTER_L4_UDP;
 			mask.udp = flow_tcf_item_mask
 				(items, &rte_flow_item_udp_mask,
 				 &flow_tcf_mask_supported.udp,
@@ -787,13 +1433,18 @@ struct flow_tcf_ptoi {
 				 error);
 			if (!mask.udp)
 				return -rte_errno;
+			if (item_flags & MLX5_FLOW_LAYER_OUTER_L4_UDP) {
+				udp = NULL;
+			} else {
+				udp = items;
+				item_flags |= MLX5_FLOW_LAYER_OUTER_L4_UDP;
+			}
 			break;
 		case RTE_FLOW_ITEM_TYPE_TCP:
 			ret = mlx5_flow_validate_item_tcp(items, item_flags,
 							  next_protocol, error);
 			if (ret < 0)
 				return ret;
-			item_flags |= MLX5_FLOW_LAYER_OUTER_L4_TCP;
 			mask.tcp = flow_tcf_item_mask
 				(items, &rte_flow_item_tcp_mask,
 				 &flow_tcf_mask_supported.tcp,
@@ -802,6 +1453,31 @@ struct flow_tcf_ptoi {
 				 error);
 			if (!mask.tcp)
 				return -rte_errno;
+			item_flags |= MLX5_FLOW_LAYER_OUTER_L4_TCP;
+			break;
+		case RTE_FLOW_ITEM_TYPE_VXLAN:
+			ret = mlx5_flow_validate_item_vxlan(items,
+							    item_flags, error);
+			if (ret < 0)
+				return ret;
+			mask.vxlan = flow_tcf_item_mask
+				(items, &rte_flow_item_vxlan_mask,
+				 &flow_tcf_mask_supported.vxlan,
+				 &flow_tcf_mask_empty.vxlan,
+				 sizeof(flow_tcf_mask_supported.vxlan),
+				 error);
+			if (!mask.vxlan)
+				return -rte_errno;
+			if (mask.vxlan->vni[0] != 0xff ||
+			    mask.vxlan->vni[1] != 0xff ||
+			    mask.vxlan->vni[2] != 0xff)
+				return rte_flow_error_set
+					(error, ENOTSUP,
+					 RTE_FLOW_ERROR_TYPE_ITEM_MASK,
+					 mask.vxlan,
+					 "no support for partial or "
+					 "empty mask on \"vxlan.vni\" field");
+			item_flags |= MLX5_FLOW_LAYER_VXLAN;
 			break;
 		default:
 			return rte_flow_error_set(error, ENOTSUP,
@@ -857,6 +1533,33 @@ struct flow_tcf_ptoi {
 		case RTE_FLOW_ACTION_TYPE_OF_SET_VLAN_PCP:
 			action_flags |= MLX5_ACTION_OF_SET_VLAN_PCP;
 			break;
+		case RTE_FLOW_ACTION_TYPE_VXLAN_ENCAP:
+			if (action_flags & (MLX5_ACTION_VXLAN_ENCAP
+					   | MLX5_ACTION_VXLAN_DECAP))
+				return rte_flow_error_set
+					(error, ENOTSUP,
+					 RTE_FLOW_ERROR_TYPE_ACTION, actions,
+					 "can't have multiple vxlan actions");
+			ret = flow_tcf_validate_vxlan_encap(actions, error);
+			if (ret < 0)
+				return ret;
+			action_flags |= MLX5_ACTION_VXLAN_ENCAP;
+			break;
+		case RTE_FLOW_ACTION_TYPE_VXLAN_DECAP:
+			if (action_flags & (MLX5_ACTION_VXLAN_ENCAP
+					   | MLX5_ACTION_VXLAN_DECAP))
+				return rte_flow_error_set
+					(error, ENOTSUP,
+					 RTE_FLOW_ERROR_TYPE_ACTION, actions,
+					 "can't have multiple vxlan actions");
+			ret = flow_tcf_validate_vxlan_decap(item_flags,
+							    actions,
+							    ipv4, ipv6, udp,
+							    error);
+			if (ret < 0)
+				return ret;
+			action_flags |= MLX5_ACTION_VXLAN_DECAP;
+			break;
 		default:
 			return rte_flow_error_set(error, ENOTSUP,
 						  RTE_FLOW_ERROR_TYPE_ACTION,
@@ -864,6 +1567,12 @@ struct flow_tcf_ptoi {
 						  "action not supported");
 		}
 	}
+	if ((item_flags & MLX5_FLOW_LAYER_VXLAN) &&
+	    !(action_flags & MLX5_ACTION_VXLAN_DECAP))
+		return rte_flow_error_set(error, ENOTSUP,
+					 RTE_FLOW_ERROR_TYPE_ACTION, NULL,
+					 "VNI pattern should be followed "
+					 " by VXLAN_DECAP action");
 	return 0;
 }
 
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [dpdk-dev] [PATCH 4/5] net/mlx5: e-switch VXLAN flow translation routine
  2018-10-02  6:30 [dpdk-dev] [PATCH 1/5] net/mlx5: add VXLAN encap/decap support for e-switch Slava Ovsiienko
  2018-10-02  6:30 ` [dpdk-dev] [PATCH 2/5] net/mlx5: e-switch VXLAN netlink routines update Slava Ovsiienko
  2018-10-02  6:30 ` [dpdk-dev] [PATCH 3/5] net/mlx5: e-switch VXLAN flow validation routine Slava Ovsiienko
@ 2018-10-02  6:30 ` Slava Ovsiienko
  2018-10-02  6:30 ` [dpdk-dev] [PATCH 5/5] net/mlx5: e-switch VXLAN tunnel devices management Slava Ovsiienko
  2018-10-15 14:13 ` [dpdk-dev] [PATCH v2 0/7] net/mlx5: e-switch VXLAN encap/decap hardware offload Viacheslav Ovsiienko
  4 siblings, 0 replies; 110+ messages in thread
From: Slava Ovsiienko @ 2018-10-02  6:30 UTC (permalink / raw)
  To: dev; +Cc: Shahaf Shuler, Slava Ovsiienko

This part of patchset adds support of VXLAN-related items and
actions to the flow translation routine. If some of them are
specified in the rule, the extra space for tunnel description
structure is allocated. Later some tunnel types, other than VXLAN
can be addedd (GRE). No VTEP devices are created at this point,
the flow rule is just translated, not applied yet.

Suggested-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
---
 drivers/net/mlx5/mlx5_flow_tcf.c | 671 ++++++++++++++++++++++++++++++++++-----
 1 file changed, 591 insertions(+), 80 deletions(-)

diff --git a/drivers/net/mlx5/mlx5_flow_tcf.c b/drivers/net/mlx5/mlx5_flow_tcf.c
index 97451bd..dfffc50 100644
--- a/drivers/net/mlx5/mlx5_flow_tcf.c
+++ b/drivers/net/mlx5/mlx5_flow_tcf.c
@@ -1597,7 +1597,7 @@ struct flow_tcf_ptoi {
 
 	size += SZ_NLATTR_STRZ_OF("flower") +
 		SZ_NLATTR_NEST + /* TCA_OPTIONS. */
-		SZ_NLATTR_TYPE_OF(uint32_t); /* TCA_CLS_FLAGS_SKIP_SW. */
+		SZ_NLATTR_TYPE_OF_UINT32; /* TCA_CLS_FLAGS_SKIP_SW. */
 	for (; items->type != RTE_FLOW_ITEM_TYPE_END; items++) {
 		switch (items->type) {
 		case RTE_FLOW_ITEM_TYPE_VOID:
@@ -1605,45 +1605,49 @@ struct flow_tcf_ptoi {
 		case RTE_FLOW_ITEM_TYPE_PORT_ID:
 			break;
 		case RTE_FLOW_ITEM_TYPE_ETH:
-			size += SZ_NLATTR_TYPE_OF(uint16_t) + /* Ether type. */
+			size += SZ_NLATTR_TYPE_OF_UINT16 + /* Ether type. */
 				SZ_NLATTR_DATA_OF(ETHER_ADDR_LEN) * 4;
 				/* dst/src MAC addr and mask. */
 			flags |= MLX5_FLOW_LAYER_OUTER_L2;
 			break;
 		case RTE_FLOW_ITEM_TYPE_VLAN:
-			size += SZ_NLATTR_TYPE_OF(uint16_t) + /* Ether type. */
-				SZ_NLATTR_TYPE_OF(uint16_t) +
+			size += SZ_NLATTR_TYPE_OF_UINT16 + /* Ether type. */
+				SZ_NLATTR_TYPE_OF_UINT16 +
 				/* VLAN Ether type. */
-				SZ_NLATTR_TYPE_OF(uint8_t) + /* VLAN prio. */
-				SZ_NLATTR_TYPE_OF(uint16_t); /* VLAN ID. */
+				SZ_NLATTR_TYPE_OF_UINT8 + /* VLAN prio. */
+				SZ_NLATTR_TYPE_OF_UINT16; /* VLAN ID. */
 			flags |= MLX5_FLOW_LAYER_OUTER_VLAN;
 			break;
 		case RTE_FLOW_ITEM_TYPE_IPV4:
-			size += SZ_NLATTR_TYPE_OF(uint16_t) + /* Ether type. */
-				SZ_NLATTR_TYPE_OF(uint8_t) + /* IP proto. */
-				SZ_NLATTR_TYPE_OF(uint32_t) * 4;
+			size += SZ_NLATTR_TYPE_OF_UINT16 + /* Ether type. */
+				SZ_NLATTR_TYPE_OF_UINT8 + /* IP proto. */
+				SZ_NLATTR_TYPE_OF_UINT32 * 4;
 				/* dst/src IP addr and mask. */
 			flags |= MLX5_FLOW_LAYER_OUTER_L3_IPV4;
 			break;
 		case RTE_FLOW_ITEM_TYPE_IPV6:
-			size += SZ_NLATTR_TYPE_OF(uint16_t) + /* Ether type. */
-				SZ_NLATTR_TYPE_OF(uint8_t) + /* IP proto. */
+			size += SZ_NLATTR_TYPE_OF_UINT16 + /* Ether type. */
+				SZ_NLATTR_TYPE_OF_UINT8 + /* IP proto. */
 				SZ_NLATTR_TYPE_OF(IPV6_ADDR_LEN) * 4;
 				/* dst/src IP addr and mask. */
 			flags |= MLX5_FLOW_LAYER_OUTER_L3_IPV6;
 			break;
 		case RTE_FLOW_ITEM_TYPE_UDP:
-			size += SZ_NLATTR_TYPE_OF(uint8_t) + /* IP proto. */
-				SZ_NLATTR_TYPE_OF(uint16_t) * 4;
+			size += SZ_NLATTR_TYPE_OF_UINT8 + /* IP proto. */
+				SZ_NLATTR_TYPE_OF_UINT16 * 4;
 				/* dst/src port and mask. */
 			flags |= MLX5_FLOW_LAYER_OUTER_L4_UDP;
 			break;
 		case RTE_FLOW_ITEM_TYPE_TCP:
-			size += SZ_NLATTR_TYPE_OF(uint8_t) + /* IP proto. */
-				SZ_NLATTR_TYPE_OF(uint16_t) * 4;
+			size += SZ_NLATTR_TYPE_OF_UINT8 + /* IP proto. */
+				SZ_NLATTR_TYPE_OF_UINT16 * 4;
 				/* dst/src port and mask. */
 			flags |= MLX5_FLOW_LAYER_OUTER_L4_TCP;
 			break;
+		case RTE_FLOW_ITEM_TYPE_VXLAN:
+			size += SZ_NLATTR_TYPE_OF_UINT32;
+			flags |= MLX5_FLOW_LAYER_VXLAN;
+			break;
 		default:
 			DRV_LOG(WARNING,
 				"unsupported item %p type %d,"
@@ -1657,6 +1661,265 @@ struct flow_tcf_ptoi {
 }
 
 /**
+ * Helper function to process RTE_FLOW_ITEM_TYPE_ETH entry in configuration
+ * of action RTE_FLOW_ACTION_TYPE_VXLAN_ENCAP. Fills the MAC address fields
+ * in the encapsulation parameters structure. The item must be prevalidated,
+ * no any validation checks performed by function.
+ *
+ * @param[in] spec
+ *   RTE_FLOW_ITEM_TYPE_ETH entry specification.
+ * @param[in] mask
+ *   RTE_FLOW_ITEM_TYPE_ETH entry mask.
+ * @param[out] encap
+ *   Structure to fill the gathered MAC address data.
+ *
+ * @return
+ *   The size needed the Netlink message tunnel_key
+ *   parameter buffer to store the item attributes.
+ */
+static int
+flow_tcf_parse_vxlan_encap_eth(const struct rte_flow_item_eth *spec,
+			       const struct rte_flow_item_eth *mask,
+			       struct mlx5_flow_tcf_vxlan_encap *encap)
+{
+	/* Item must be validated before. No redundant checks. */
+	assert(spec);
+	if (!mask || !memcmp(&mask->dst,
+			     &rte_flow_item_eth_mask.dst,
+			     sizeof(rte_flow_item_eth_mask.dst))) {
+		/*
+		 * Ethernet addresses are not supported by
+		 * tc as tunnel_key parameters. Destination
+		 * address is needed to form encap packet
+		 * header and retrieved by kernel from
+		 * implicit sources (ARP table, etc),
+		 * address masks are not supported at all.
+		 */
+		encap->eth.dst = spec->dst;
+		encap->mask |= MLX5_FLOW_TCF_ENCAP_ETH_DST;
+	}
+	if (!mask || !memcmp(&mask->src,
+			     &rte_flow_item_eth_mask.src,
+			     sizeof(rte_flow_item_eth_mask.src))) {
+		/*
+		 * Ethernet addresses are not supported by
+		 * tc as tunnel_key parameters. Source ethernet
+		 * address is ignored anyway.
+		 */
+		encap->eth.src = spec->src;
+		encap->mask |= MLX5_FLOW_TCF_ENCAP_ETH_SRC;
+	}
+	/*
+	 * No space allocated for ethernet addresses within Netlink
+	 * message tunnel_key record - these ones are not
+	 * supported by tc.
+	 */
+	return 0;
+}
+
+/**
+ * Helper function to process RTE_FLOW_ITEM_TYPE_IPV4 entry in configuration
+ * of action RTE_FLOW_ACTION_TYPE_VXLAN_ENCAP. Fills the IPV4 address fields
+ * in the encapsulation parameters structure. The item must be prevalidated,
+ * no any validation checks performed by function.
+ *
+ * @param[in] spec
+ *   RTE_FLOW_ITEM_TYPE_IPV4 entry specification.
+ * @param[out] encap
+ *   Structure to fill the gathered IPV4 address data.
+ *
+ * @return
+ *   The size needed the Netlink message tunnel_key
+ *   parameter buffer to store the item attributes.
+ */
+static int
+flow_tcf_parse_vxlan_encap_ipv4(const struct rte_flow_item_ipv4 *spec,
+				struct mlx5_flow_tcf_vxlan_encap *encap)
+{
+	/* Item must be validated before. No redundant checks. */
+	assert(spec);
+	encap->ipv4.dst = spec->hdr.dst_addr;
+	encap->ipv4.src = spec->hdr.src_addr;
+	encap->mask |= MLX5_FLOW_TCF_ENCAP_IPV4_SRC |
+		       MLX5_FLOW_TCF_ENCAP_IPV4_DST;
+	return SZ_NLATTR_TYPE_OF_UINT32 * 2;
+}
+
+/**
+ * Helper function to process RTE_FLOW_ITEM_TYPE_IPV6 entry in configuration
+ * of action RTE_FLOW_ACTION_TYPE_VXLAN_ENCAP. Fills the IPV6 address fields
+ * in the encapsulation parameters structure. The item must be prevalidated,
+ * no any validation checks performed by function.
+ *
+ * @param[in] spec
+ *   RTE_FLOW_ITEM_TYPE_IPV6 entry specification.
+ * @param[out] encap
+ *   Structure to fill the gathered IPV6 address data.
+ *
+ * @return
+ *   The size needed the Netlink message tunnel_key
+ *   parameter buffer to store the item attributes.
+ */
+static int
+flow_tcf_parse_vxlan_encap_ipv6(const struct rte_flow_item_ipv6 *spec,
+				struct mlx5_flow_tcf_vxlan_encap *encap)
+{
+	/* Item must be validated before. No redundant checks. */
+	assert(spec);
+	memcpy(encap->ipv6.dst, spec->hdr.dst_addr, sizeof(encap->ipv6.dst));
+	memcpy(encap->ipv6.src, spec->hdr.src_addr, sizeof(encap->ipv6.src));
+	encap->mask |= MLX5_FLOW_TCF_ENCAP_IPV6_SRC |
+		       MLX5_FLOW_TCF_ENCAP_IPV6_DST;
+	return SZ_NLATTR_TYPE_OF(IPV6_ADDR_LEN) * 2;
+}
+
+/**
+ * Helper function to process RTE_FLOW_ITEM_TYPE_UDP entry in configuration
+ * of action RTE_FLOW_ACTION_TYPE_VXLAN_ENCAP. Fills the UDP port fields
+ * in the encapsulation parameters structure. The item must be prevalidated,
+ * no any validation checks performed by function.
+ *
+ * @param[in] spec
+ *   RTE_FLOW_ITEM_TYPE_UDP entry specification.
+ * @param[in] mask
+ *   RTE_FLOW_ITEM_TYPE_UDP entry mask.
+ * @param[out] encap
+ *   Structure to fill the gathered UDP port data.
+ *
+ * @return
+ *   The size needed the Netlink message tunnel_key
+ *   parameter buffer to store the item attributes.
+ */
+static int
+flow_tcf_parse_vxlan_encap_udp(const struct rte_flow_item_udp *spec,
+			       const struct rte_flow_item_udp *mask,
+			       struct mlx5_flow_tcf_vxlan_encap *encap)
+{
+	int size = SZ_NLATTR_TYPE_OF_UINT16;
+
+	assert(spec);
+	encap->udp.dst = spec->hdr.dst_port;
+	encap->mask |= MLX5_FLOW_TCF_ENCAP_UDP_DST;
+	if (!mask || mask->hdr.src_port != RTE_BE16(0x0000)) {
+		encap->udp.src = spec->hdr.src_port;
+		size += SZ_NLATTR_TYPE_OF_UINT16;
+		encap->mask |= MLX5_FLOW_TCF_ENCAP_IPV4_SRC;
+	}
+	return size;
+}
+
+/**
+ * Helper function to process RTE_FLOW_ITEM_TYPE_VXLAN entry in configuration
+ * of action RTE_FLOW_ACTION_TYPE_VXLAN_ENCAP. Fills the VNI fields
+ * in the encapsulation parameters structure. The item must be prevalidated,
+ * no any validation checks performed by function.
+ *
+ * @param[in] spec
+ *   RTE_FLOW_ITEM_TYPE_VXLAN entry specification.
+ * @param[out] encap
+ *   Structure to fill the gathered VNI address data.
+ *
+ * @return
+ *   The size needed the Netlink message tunnel_key
+ *   parameter buffer to store the item attributes.
+ */
+static int
+flow_tcf_parse_vxlan_encap_vni(const struct rte_flow_item_vxlan *spec,
+			       struct mlx5_flow_tcf_vxlan_encap *encap)
+{
+	/* Item must be validated before. Do not redundant checks. */
+	assert(spec);
+	memcpy(encap->vxlan.vni, spec->vni, sizeof(encap->vxlan.vni));
+	encap->mask |= MLX5_FLOW_TCF_ENCAP_VXLAN_VNI;
+	return SZ_NLATTR_TYPE_OF_UINT32;
+}
+
+/**
+ * Populate consolidated encapsulation object from list of pattern items.
+ *
+ * Helper function to process configuration of action such as
+ * RTE_FLOW_ACTION_TYPE_VXLAN_ENCAP. The item list should be
+ * validated, there is no way to return an meaningful error.
+ *
+ * @param[in] action
+ *   RTE_FLOW_ACTION_TYPE_VXLAN_ENCAP action object.
+ *   List of pattern items to gather data from.
+ * @param[out] src
+ *   Structure to fill gathered data.
+ *
+ * @return
+ *   The size the part of Netlink message buffer to store the item
+ *   attributes on success, zero otherwise. The mask field in
+ *   result structure reflects correctly parsed items.
+ */
+static int
+flow_tcf_vxlan_encap_parse(const struct rte_flow_action *action,
+			   struct mlx5_flow_tcf_vxlan_encap *encap)
+{
+	union {
+		const struct rte_flow_item_eth *eth;
+		const struct rte_flow_item_ipv4 *ipv4;
+		const struct rte_flow_item_ipv6 *ipv6;
+		const struct rte_flow_item_udp *udp;
+		const struct rte_flow_item_vxlan *vxlan;
+	} spec, mask;
+	const struct rte_flow_item *items;
+	int size = 0;
+
+	assert(action->type == RTE_FLOW_ACTION_TYPE_VXLAN_ENCAP);
+	assert(action->conf);
+
+	items = ((const struct rte_flow_action_vxlan_encap *)
+					action->conf)->definition;
+	assert(items);
+	for (; items->type != RTE_FLOW_ITEM_TYPE_END; items++) {
+		switch (items->type) {
+		case RTE_FLOW_ITEM_TYPE_VOID:
+			break;
+		case RTE_FLOW_ITEM_TYPE_ETH:
+			mask.eth = items->mask;
+			spec.eth = items->spec;
+			size += flow_tcf_parse_vxlan_encap_eth(spec.eth,
+							       mask.eth,
+							       encap);
+			break;
+		case RTE_FLOW_ITEM_TYPE_IPV4:
+			spec.ipv4 = items->spec;
+			size += flow_tcf_parse_vxlan_encap_ipv4(spec.ipv4,
+								encap);
+			break;
+		case RTE_FLOW_ITEM_TYPE_IPV6:
+			spec.ipv6 = items->spec;
+			size += flow_tcf_parse_vxlan_encap_ipv6(spec.ipv6,
+								encap);
+			break;
+		case RTE_FLOW_ITEM_TYPE_UDP:
+			mask.udp = items->mask;
+			spec.udp = items->spec;
+			size += flow_tcf_parse_vxlan_encap_udp(spec.udp,
+							       mask.udp,
+							       encap);
+			break;
+		case RTE_FLOW_ITEM_TYPE_VXLAN:
+			spec.vxlan = items->spec;
+			size += flow_tcf_parse_vxlan_encap_vni(spec.vxlan,
+							       encap);
+			break;
+		default:
+			assert(false);
+			DRV_LOG(WARNING,
+				"unsupported item %p type %d,"
+				" items must be validated"
+				" before flow creation",
+				(const void *)items, items->type);
+			encap->mask = 0;
+			return 0;
+		}
+	}
+	return size;
+}
+
+/**
  * Calculate maximum size of memory for flow actions of Linux TC flower and
  * extract specified actions.
  *
@@ -1664,13 +1927,16 @@ struct flow_tcf_ptoi {
  *   Pointer to the list of actions.
  * @param[out] action_flags
  *   Pointer to the detected actions.
+ * @param[out] tunnel
+ *   Pointer to tunnel encapsulation parameters structure to fill.
  *
  * @return
  *   Maximum size of memory for actions.
  */
 static int
 flow_tcf_get_actions_and_size(const struct rte_flow_action actions[],
-			      uint64_t *action_flags)
+			      uint64_t *action_flags,
+			      void *tunnel)
 {
 	int size = 0;
 	uint64_t flags = 0;
@@ -1684,14 +1950,14 @@ struct flow_tcf_ptoi {
 			size += SZ_NLATTR_NEST + /* na_act_index. */
 				SZ_NLATTR_STRZ_OF("mirred") +
 				SZ_NLATTR_NEST + /* TCA_ACT_OPTIONS. */
-				SZ_NLATTR_TYPE_OF(struct tc_mirred);
+				SZ_NLATTR_TYPE_OF_STRUCT(tc_mirred);
 			flags |= MLX5_ACTION_PORT_ID;
 			break;
 		case RTE_FLOW_ACTION_TYPE_DROP:
 			size += SZ_NLATTR_NEST + /* na_act_index. */
 				SZ_NLATTR_STRZ_OF("gact") +
 				SZ_NLATTR_NEST + /* TCA_ACT_OPTIONS. */
-				SZ_NLATTR_TYPE_OF(struct tc_gact);
+				SZ_NLATTR_TYPE_OF_STRUCT(tc_gact);
 			flags |= MLX5_ACTION_DROP;
 			break;
 		case RTE_FLOW_ACTION_TYPE_OF_POP_VLAN:
@@ -1710,11 +1976,34 @@ struct flow_tcf_ptoi {
 			size += SZ_NLATTR_NEST + /* na_act_index. */
 				SZ_NLATTR_STRZ_OF("vlan") +
 				SZ_NLATTR_NEST + /* TCA_ACT_OPTIONS. */
-				SZ_NLATTR_TYPE_OF(struct tc_vlan) +
-				SZ_NLATTR_TYPE_OF(uint16_t) +
+				SZ_NLATTR_TYPE_OF_STRUCT(tc_vlan) +
+				SZ_NLATTR_TYPE_OF_UINT16 +
 				/* VLAN protocol. */
-				SZ_NLATTR_TYPE_OF(uint16_t) + /* VLAN ID. */
-				SZ_NLATTR_TYPE_OF(uint8_t); /* VLAN prio. */
+				SZ_NLATTR_TYPE_OF_UINT16 + /* VLAN ID. */
+				SZ_NLATTR_TYPE_OF_UINT8; /* VLAN prio. */
+			break;
+		case RTE_FLOW_ACTION_TYPE_VXLAN_ENCAP:
+			size += SZ_NLATTR_NEST + /* na_act_index. */
+				SZ_NLATTR_STRZ_OF("tunnel_key") +
+				SZ_NLATTR_NEST + /* TCA_ACT_OPTIONS. */
+				SZ_NLATTR_TYPE_OF_UINT8 + /* no UDP sum */
+				SZ_NLATTR_TYPE_OF_STRUCT(tc_tunnel_key) +
+				flow_tcf_vxlan_encap_parse(actions, tunnel) +
+				RTE_ALIGN_CEIL /* preceding encap params. */
+				(sizeof(struct mlx5_flow_tcf_vxlan_encap),
+				MNL_ALIGNTO);
+			flags |= MLX5_ACTION_VXLAN_ENCAP;
+			break;
+		case RTE_FLOW_ACTION_TYPE_VXLAN_DECAP:
+			size += SZ_NLATTR_NEST + /* na_act_index. */
+				SZ_NLATTR_STRZ_OF("tunnel_key") +
+				SZ_NLATTR_NEST + /* TCA_ACT_OPTIONS. */
+				SZ_NLATTR_TYPE_OF_UINT8 + /* no UDP sum */
+				SZ_NLATTR_TYPE_OF_STRUCT(tc_tunnel_key) +
+				RTE_ALIGN_CEIL /* preceding decap params. */
+				(sizeof(struct mlx5_flow_tcf_vxlan_decap),
+				MNL_ALIGNTO);
+			flags |= MLX5_ACTION_VXLAN_DECAP;
 			break;
 		default:
 			DRV_LOG(WARNING,
@@ -1750,6 +2039,26 @@ struct flow_tcf_ptoi {
 }
 
 /**
+ * Convert VXLAN VNI to 32-bit integer.
+ *
+ * @param[in] vni
+ *   VXLAN VNI in 24-bit wire format.
+ *
+ * @return
+ *   VXLAN VNI as a 32-bit integer value in network endian.
+ */
+static rte_be32_t
+vxlan_vni_as_be32(const uint8_t vni[3])
+{
+	rte_be32_t ret;
+
+	ret = vni[0];
+	ret = (ret << 8) | vni[1];
+	ret = (ret << 8) | vni[2];
+	return RTE_BE32(ret);
+}
+
+/**
  * Prepare a flow object for Linux TC flower. It calculates the maximum size of
  * memory required, allocates the memory, initializes Netlink message headers
  * and set unique TC message handle.
@@ -1784,22 +2093,54 @@ struct flow_tcf_ptoi {
 	struct mlx5_flow *dev_flow;
 	struct nlmsghdr *nlh;
 	struct tcmsg *tcm;
+	struct mlx5_flow_tcf_vxlan_encap encap = {.mask = 0};
+	uint8_t *sp, *tun = NULL;
 
 	size += flow_tcf_get_items_and_size(items, item_flags);
-	size += flow_tcf_get_actions_and_size(actions, action_flags);
-	dev_flow = rte_zmalloc(__func__, size, MNL_ALIGNTO);
+	size += flow_tcf_get_actions_and_size(actions, action_flags, &encap);
+	dev_flow = rte_zmalloc(__func__, size,
+			RTE_MAX(alignof(struct mlx5_flow_tcf_tunnel_hdr),
+				(size_t)MNL_ALIGNTO));
 	if (!dev_flow) {
 		rte_flow_error_set(error, ENOMEM,
 				   RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
 				   "not enough memory to create E-Switch flow");
 		return NULL;
 	}
-	nlh = mnl_nlmsg_put_header((void *)(dev_flow + 1));
+	sp = (uint8_t *)(dev_flow + 1);
+	if (*action_flags & MLX5_ACTION_VXLAN_ENCAP) {
+		tun = sp;
+		sp += RTE_ALIGN_CEIL
+			(sizeof(struct mlx5_flow_tcf_vxlan_encap),
+			MNL_ALIGNTO);
+		size -= RTE_ALIGN_CEIL
+			(sizeof(struct mlx5_flow_tcf_vxlan_encap),
+			MNL_ALIGNTO);
+		encap.hdr.type = MLX5_FLOW_TCF_TUNACT_VXLAN_ENCAP;
+		memcpy(tun, &encap,
+		       sizeof(struct mlx5_flow_tcf_vxlan_encap));
+	} else if (*action_flags & MLX5_ACTION_VXLAN_DECAP) {
+		tun = sp;
+		sp += RTE_ALIGN_CEIL
+			(sizeof(struct mlx5_flow_tcf_vxlan_decap),
+			MNL_ALIGNTO);
+		size -= RTE_ALIGN_CEIL
+			(sizeof(struct mlx5_flow_tcf_vxlan_decap),
+			MNL_ALIGNTO);
+		encap.hdr.type = MLX5_FLOW_TCF_TUNACT_VXLAN_DECAP;
+		memcpy(tun, &encap,
+		       sizeof(struct mlx5_flow_tcf_vxlan_decap));
+	}
+	nlh = mnl_nlmsg_put_header(sp);
 	tcm = mnl_nlmsg_put_extra_header(nlh, sizeof(*tcm));
 	*dev_flow = (struct mlx5_flow){
 		.tcf = (struct mlx5_flow_tcf){
+			.nlsize = size,
 			.nlh = nlh,
 			.tcm = tcm,
+			.tunnel = (struct mlx5_flow_tcf_tunnel_hdr *)tun,
+			.item_flags = *item_flags,
+			.action_flags = *action_flags,
 		},
 	};
 	/*
@@ -1853,6 +2194,7 @@ struct flow_tcf_ptoi {
 		const struct rte_flow_item_ipv6 *ipv6;
 		const struct rte_flow_item_tcp *tcp;
 		const struct rte_flow_item_udp *udp;
+		const struct rte_flow_item_vxlan *vxlan;
 	} spec, mask;
 	union {
 		const struct rte_flow_action_port_id *port_id;
@@ -1862,6 +2204,14 @@ struct flow_tcf_ptoi {
 		const struct rte_flow_action_of_set_vlan_pcp *
 			of_set_vlan_pcp;
 	} conf;
+	union {
+		struct mlx5_flow_tcf_tunnel_hdr *hdr;
+		struct mlx5_flow_tcf_vxlan_decap *vxlan;
+	} decap;
+	union {
+		struct mlx5_flow_tcf_tunnel_hdr *hdr;
+		struct mlx5_flow_tcf_vxlan_encap *vxlan;
+	} encap;
 	struct flow_tcf_ptoi ptoi[PTOI_TABLE_SZ_MAX(dev)];
 	struct nlmsghdr *nlh = dev_flow->tcf.nlh;
 	struct tcmsg *tcm = dev_flow->tcf.tcm;
@@ -1877,6 +2227,12 @@ struct flow_tcf_ptoi {
 
 	claim_nonzero(flow_tcf_build_ptoi_table(dev, ptoi,
 						PTOI_TABLE_SZ_MAX(dev)));
+	encap.hdr = NULL;
+	decap.hdr = NULL;
+	if (dev_flow->tcf.action_flags & MLX5_ACTION_VXLAN_ENCAP)
+		encap.vxlan = dev_flow->tcf.vxlan_encap;
+	if (dev_flow->tcf.action_flags & MLX5_ACTION_VXLAN_DECAP)
+		decap.vxlan = dev_flow->tcf.vxlan_decap;
 	nlh = dev_flow->tcf.nlh;
 	tcm = dev_flow->tcf.tcm;
 	/* Prepare API must have been called beforehand. */
@@ -1892,7 +2248,6 @@ struct flow_tcf_ptoi {
 				  RTE_BE16(ETH_P_ALL));
 	mnl_attr_put_strz(nlh, TCA_KIND, "flower");
 	na_flower = mnl_attr_nest_start(nlh, TCA_OPTIONS);
-	mnl_attr_put_u32(nlh, TCA_FLOWER_FLAGS, TCA_CLS_FLAGS_SKIP_SW);
 	for (; items->type != RTE_FLOW_ITEM_TYPE_END; items++) {
 		unsigned int i;
 
@@ -1935,6 +2290,12 @@ struct flow_tcf_ptoi {
 						 spec.eth->type);
 				eth_type_set = 1;
 			}
+			/*
+			 * Send L2 addresses/masks anyway, including
+			 * VXLAN encap/decap cases, sometimes kernel
+			 * returns an error if no L2 address provided
+			 * and skip_sw flag is set
+			 */
 			if (!is_zero_ether_addr(&mask.eth->dst)) {
 				mnl_attr_put(nlh, TCA_FLOWER_KEY_ETH_DST,
 					     ETHER_ADDR_LEN,
@@ -1951,8 +2312,19 @@ struct flow_tcf_ptoi {
 					     ETHER_ADDR_LEN,
 					     mask.eth->src.addr_bytes);
 			}
+			if (decap.hdr) {
+				DRV_LOG(WARNING,
+				"ethernet addresses are treated "
+				"as inner ones for tunnel decapsulation");
+			}
+			assert(dev_flow->tcf.nlsize >= nlh->nlmsg_len);
 			break;
 		case RTE_FLOW_ITEM_TYPE_VLAN:
+			if (encap.hdr || decap.hdr)
+				return rte_flow_error_set(error, ENOTSUP,
+					  RTE_FLOW_ERROR_TYPE_ITEM, NULL,
+					  "outer VLAN is not "
+					  "supported for tunnels");
 			mask.vlan = flow_tcf_item_mask
 				(items, &rte_flow_item_vlan_mask,
 				 &flow_tcf_mask_supported.vlan,
@@ -1983,6 +2355,7 @@ struct flow_tcf_ptoi {
 						 rte_be_to_cpu_16
 						 (spec.vlan->tci &
 						  RTE_BE16(0x0fff)));
+			assert(dev_flow->tcf.nlsize >= nlh->nlmsg_len);
 			break;
 		case RTE_FLOW_ITEM_TYPE_IPV4:
 			mask.ipv4 = flow_tcf_item_mask
@@ -1992,36 +2365,53 @@ struct flow_tcf_ptoi {
 				 sizeof(flow_tcf_mask_supported.ipv4),
 				 error);
 			assert(mask.ipv4);
-			if (!eth_type_set || !vlan_eth_type_set)
-				mnl_attr_put_u16(nlh,
-						 vlan_present ?
-						 TCA_FLOWER_KEY_VLAN_ETH_TYPE :
-						 TCA_FLOWER_KEY_ETH_TYPE,
-						 RTE_BE16(ETH_P_IP));
-			eth_type_set = 1;
-			vlan_eth_type_set = 1;
-			if (mask.ipv4 == &flow_tcf_mask_empty.ipv4)
-				break;
 			spec.ipv4 = items->spec;
-			if (mask.ipv4->hdr.next_proto_id) {
-				mnl_attr_put_u8(nlh, TCA_FLOWER_KEY_IP_PROTO,
+			if (!decap.vxlan) {
+				if (!eth_type_set || !vlan_eth_type_set) {
+					mnl_attr_put_u16(nlh,
+						vlan_present ?
+						TCA_FLOWER_KEY_VLAN_ETH_TYPE :
+						TCA_FLOWER_KEY_ETH_TYPE,
+						RTE_BE16(ETH_P_IP));
+				}
+				eth_type_set = 1;
+				vlan_eth_type_set = 1;
+				if (mask.ipv4 == &flow_tcf_mask_empty.ipv4)
+					break;
+				if (mask.ipv4->hdr.next_proto_id) {
+					mnl_attr_put_u8
+						(nlh, TCA_FLOWER_KEY_IP_PROTO,
 						spec.ipv4->hdr.next_proto_id);
-				ip_proto_set = 1;
+					ip_proto_set = 1;
+				}
+			} else {
+				assert(mask.ipv4 != &flow_tcf_mask_empty.ipv4);
 			}
 			if (mask.ipv4->hdr.src_addr) {
-				mnl_attr_put_u32(nlh, TCA_FLOWER_KEY_IPV4_SRC,
-						 spec.ipv4->hdr.src_addr);
-				mnl_attr_put_u32(nlh,
-						 TCA_FLOWER_KEY_IPV4_SRC_MASK,
-						 mask.ipv4->hdr.src_addr);
+				mnl_attr_put_u32
+					(nlh, decap.vxlan ?
+					 TCA_FLOWER_KEY_ENC_IPV4_SRC :
+					 TCA_FLOWER_KEY_IPV4_SRC,
+					 spec.ipv4->hdr.src_addr);
+				mnl_attr_put_u32
+					(nlh, decap.vxlan ?
+					 TCA_FLOWER_KEY_ENC_IPV4_SRC_MASK :
+					 TCA_FLOWER_KEY_IPV4_SRC_MASK,
+					 mask.ipv4->hdr.src_addr);
 			}
 			if (mask.ipv4->hdr.dst_addr) {
-				mnl_attr_put_u32(nlh, TCA_FLOWER_KEY_IPV4_DST,
-						 spec.ipv4->hdr.dst_addr);
-				mnl_attr_put_u32(nlh,
-						 TCA_FLOWER_KEY_IPV4_DST_MASK,
-						 mask.ipv4->hdr.dst_addr);
+				mnl_attr_put_u32
+					(nlh, decap.vxlan ?
+					 TCA_FLOWER_KEY_ENC_IPV4_DST :
+					 TCA_FLOWER_KEY_IPV4_DST,
+					 spec.ipv4->hdr.dst_addr);
+				mnl_attr_put_u32
+					(nlh, decap.vxlan ?
+					 TCA_FLOWER_KEY_ENC_IPV4_DST_MASK :
+					 TCA_FLOWER_KEY_IPV4_DST_MASK,
+					 mask.ipv4->hdr.dst_addr);
 			}
+			assert(dev_flow->tcf.nlsize >= nlh->nlmsg_len);
 			break;
 		case RTE_FLOW_ITEM_TYPE_IPV6:
 			mask.ipv6 = flow_tcf_item_mask
@@ -2031,38 +2421,53 @@ struct flow_tcf_ptoi {
 				 sizeof(flow_tcf_mask_supported.ipv6),
 				 error);
 			assert(mask.ipv6);
-			if (!eth_type_set || !vlan_eth_type_set)
-				mnl_attr_put_u16(nlh,
+			spec.ipv6 = items->spec;
+			if (!decap.vxlan) {
+				if (!eth_type_set || !vlan_eth_type_set) {
+					mnl_attr_put_u16(nlh,
 						 vlan_present ?
 						 TCA_FLOWER_KEY_VLAN_ETH_TYPE :
 						 TCA_FLOWER_KEY_ETH_TYPE,
 						 RTE_BE16(ETH_P_IPV6));
-			eth_type_set = 1;
-			vlan_eth_type_set = 1;
-			if (mask.ipv6 == &flow_tcf_mask_empty.ipv6)
-				break;
-			spec.ipv6 = items->spec;
-			if (mask.ipv6->hdr.proto) {
-				mnl_attr_put_u8(nlh, TCA_FLOWER_KEY_IP_PROTO,
-						spec.ipv6->hdr.proto);
-				ip_proto_set = 1;
+				}
+				eth_type_set = 1;
+				vlan_eth_type_set = 1;
+				if (mask.ipv6 == &flow_tcf_mask_empty.ipv6)
+					break;
+				if (mask.ipv6->hdr.proto) {
+					mnl_attr_put_u8
+						(nlh, TCA_FLOWER_KEY_IP_PROTO,
+						 spec.ipv6->hdr.proto);
+					ip_proto_set = 1;
+				}
+			} else {
+				assert(mask.ipv6 != &flow_tcf_mask_empty.ipv6);
 			}
 			if (!IN6_IS_ADDR_UNSPECIFIED(mask.ipv6->hdr.src_addr)) {
-				mnl_attr_put(nlh, TCA_FLOWER_KEY_IPV6_SRC,
+				mnl_attr_put(nlh, decap.vxlan ?
+					     TCA_FLOWER_KEY_ENC_IPV6_SRC :
+					     TCA_FLOWER_KEY_IPV6_SRC,
 					     sizeof(spec.ipv6->hdr.src_addr),
 					     spec.ipv6->hdr.src_addr);
-				mnl_attr_put(nlh, TCA_FLOWER_KEY_IPV6_SRC_MASK,
+				mnl_attr_put(nlh, decap.vxlan ?
+					     TCA_FLOWER_KEY_ENC_IPV6_SRC_MASK :
+					     TCA_FLOWER_KEY_IPV6_SRC_MASK,
 					     sizeof(mask.ipv6->hdr.src_addr),
 					     mask.ipv6->hdr.src_addr);
 			}
 			if (!IN6_IS_ADDR_UNSPECIFIED(mask.ipv6->hdr.dst_addr)) {
-				mnl_attr_put(nlh, TCA_FLOWER_KEY_IPV6_DST,
+				mnl_attr_put(nlh, decap.vxlan ?
+					     TCA_FLOWER_KEY_ENC_IPV6_DST :
+					     TCA_FLOWER_KEY_IPV6_DST,
 					     sizeof(spec.ipv6->hdr.dst_addr),
 					     spec.ipv6->hdr.dst_addr);
-				mnl_attr_put(nlh, TCA_FLOWER_KEY_IPV6_DST_MASK,
+				mnl_attr_put(nlh, decap.vxlan ?
+					     TCA_FLOWER_KEY_ENC_IPV6_DST_MASK :
+					     TCA_FLOWER_KEY_IPV6_DST_MASK,
 					     sizeof(mask.ipv6->hdr.dst_addr),
 					     mask.ipv6->hdr.dst_addr);
 			}
+			assert(dev_flow->tcf.nlsize >= nlh->nlmsg_len);
 			break;
 		case RTE_FLOW_ITEM_TYPE_UDP:
 			mask.udp = flow_tcf_item_mask
@@ -2072,27 +2477,45 @@ struct flow_tcf_ptoi {
 				 sizeof(flow_tcf_mask_supported.udp),
 				 error);
 			assert(mask.udp);
-			if (!ip_proto_set)
-				mnl_attr_put_u8(nlh, TCA_FLOWER_KEY_IP_PROTO,
-						IPPROTO_UDP);
-			if (mask.udp == &flow_tcf_mask_empty.udp)
-				break;
 			spec.udp = items->spec;
+			if (!decap.vxlan) {
+				if (!ip_proto_set)
+					mnl_attr_put_u8
+						(nlh, TCA_FLOWER_KEY_IP_PROTO,
+						IPPROTO_UDP);
+				if (mask.udp == &flow_tcf_mask_empty.udp)
+					break;
+			} else {
+				assert(mask.udp != &flow_tcf_mask_empty.udp);
+				decap.vxlan->udp_port
+					= RTE_BE16(spec.udp->hdr.dst_port);
+			}
 			if (mask.udp->hdr.src_port) {
-				mnl_attr_put_u16(nlh, TCA_FLOWER_KEY_UDP_SRC,
-						 spec.udp->hdr.src_port);
-				mnl_attr_put_u16(nlh,
-						 TCA_FLOWER_KEY_UDP_SRC_MASK,
-						 mask.udp->hdr.src_port);
+				mnl_attr_put_u16
+					(nlh, decap.vxlan ?
+					 TCA_FLOWER_KEY_ENC_UDP_SRC_PORT :
+					 TCA_FLOWER_KEY_UDP_SRC,
+					 spec.udp->hdr.src_port);
+				mnl_attr_put_u16
+					(nlh, decap.vxlan ?
+					 TCA_FLOWER_KEY_ENC_UDP_SRC_PORT_MASK :
+					 TCA_FLOWER_KEY_UDP_SRC_MASK,
+					 mask.udp->hdr.src_port);
 			}
 			if (mask.udp->hdr.dst_port) {
-				mnl_attr_put_u16(nlh, TCA_FLOWER_KEY_UDP_DST,
-						 spec.udp->hdr.dst_port);
-				mnl_attr_put_u16(nlh,
-						 TCA_FLOWER_KEY_UDP_DST_MASK,
-						 mask.udp->hdr.dst_port);
+				mnl_attr_put_u16
+					(nlh, decap.vxlan ?
+					 TCA_FLOWER_KEY_ENC_UDP_DST_PORT :
+					 TCA_FLOWER_KEY_UDP_DST,
+					 spec.udp->hdr.dst_port);
+				mnl_attr_put_u16
+					(nlh, decap.vxlan ?
+					 TCA_FLOWER_KEY_ENC_UDP_DST_PORT_MASK :
+					 TCA_FLOWER_KEY_UDP_DST_MASK,
+					 mask.udp->hdr.dst_port);
 			}
-			break;
+			assert(dev_flow->tcf.nlsize >= nlh->nlmsg_len);
+		break;
 		case RTE_FLOW_ITEM_TYPE_TCP:
 			mask.tcp = flow_tcf_item_mask
 				(items, &rte_flow_item_tcp_mask,
@@ -2121,6 +2544,15 @@ struct flow_tcf_ptoi {
 						 TCA_FLOWER_KEY_TCP_DST_MASK,
 						 mask.tcp->hdr.dst_port);
 			}
+			assert(dev_flow->tcf.nlsize >= nlh->nlmsg_len);
+			break;
+		case RTE_FLOW_ITEM_TYPE_VXLAN:
+			assert(decap.vxlan);
+			spec.vxlan = items->spec;
+			mnl_attr_put_u32(nlh,
+					 TCA_FLOWER_KEY_ENC_KEY_ID,
+					 vxlan_vni_as_be32(spec.vxlan->vni));
+			assert(dev_flow->tcf.nlsize >= nlh->nlmsg_len);
 			break;
 		default:
 			return rte_flow_error_set(error, ENOTSUP,
@@ -2154,6 +2586,14 @@ struct flow_tcf_ptoi {
 			mnl_attr_put_strz(nlh, TCA_ACT_KIND, "mirred");
 			na_act = mnl_attr_nest_start(nlh, TCA_ACT_OPTIONS);
 			assert(na_act);
+			if (encap.hdr) {
+				assert(dev_flow->tcf.tunnel);
+				dev_flow->tcf.tunnel->ifindex_ptr =
+					&((struct tc_mirred *)
+					mnl_attr_get_payload
+					(mnl_nlmsg_get_payload_tail
+						(nlh)))->ifindex;
+			}
 			mnl_attr_put(nlh, TCA_MIRRED_PARMS,
 				     sizeof(struct tc_mirred),
 				     &(struct tc_mirred){
@@ -2163,6 +2603,7 @@ struct flow_tcf_ptoi {
 				     });
 			mnl_attr_nest_end(nlh, na_act);
 			mnl_attr_nest_end(nlh, na_act_index);
+			assert(dev_flow->tcf.nlsize >= nlh->nlmsg_len);
 			break;
 		case RTE_FLOW_ACTION_TYPE_DROP:
 			na_act_index =
@@ -2243,6 +2684,74 @@ struct flow_tcf_ptoi {
 					(na_vlan_priority) =
 					conf.of_set_vlan_pcp->vlan_pcp;
 			}
+			assert(dev_flow->tcf.nlsize >= nlh->nlmsg_len);
+			break;
+		case RTE_FLOW_ACTION_TYPE_VXLAN_DECAP:
+			assert(decap.vxlan);
+			assert(dev_flow->tcf.tunnel);
+			dev_flow->tcf.tunnel->ifindex_ptr
+				= (unsigned int *)&tcm->tcm_ifindex;
+			na_act_index =
+				mnl_attr_nest_start(nlh, na_act_index_cur++);
+			assert(na_act_index);
+			mnl_attr_put_strz(nlh, TCA_ACT_KIND, "tunnel_key");
+			na_act = mnl_attr_nest_start(nlh, TCA_ACT_OPTIONS);
+			assert(na_act);
+			mnl_attr_put(nlh, TCA_TUNNEL_KEY_PARMS,
+				sizeof(struct tc_tunnel_key),
+				&(struct tc_tunnel_key){
+					.action = TC_ACT_PIPE,
+					.t_action = TCA_TUNNEL_KEY_ACT_RELEASE,
+					});
+			mnl_attr_nest_end(nlh, na_act);
+			mnl_attr_nest_end(nlh, na_act_index);
+			assert(dev_flow->tcf.nlsize >= nlh->nlmsg_len);
+			break;
+		case RTE_FLOW_ACTION_TYPE_VXLAN_ENCAP:
+			assert(encap.vxlan);
+			na_act_index =
+				mnl_attr_nest_start(nlh, na_act_index_cur++);
+			assert(na_act_index);
+			mnl_attr_put_strz(nlh, TCA_ACT_KIND, "tunnel_key");
+			na_act = mnl_attr_nest_start(nlh, TCA_ACT_OPTIONS);
+			assert(na_act);
+			mnl_attr_put(nlh, TCA_TUNNEL_KEY_PARMS,
+				sizeof(struct tc_tunnel_key),
+				&(struct tc_tunnel_key){
+					.action = TC_ACT_PIPE,
+					.t_action = TCA_TUNNEL_KEY_ACT_SET,
+					});
+			if (encap.vxlan->mask & MLX5_FLOW_TCF_ENCAP_UDP_DST)
+				mnl_attr_put_u16(nlh,
+					 TCA_TUNNEL_KEY_ENC_DST_PORT,
+					 encap.vxlan->udp.dst);
+			if (encap.vxlan->mask & MLX5_FLOW_TCF_ENCAP_IPV4_SRC)
+				mnl_attr_put_u32(nlh,
+					 TCA_TUNNEL_KEY_ENC_IPV4_SRC,
+					 encap.vxlan->ipv4.src);
+			if (encap.vxlan->mask & MLX5_FLOW_TCF_ENCAP_IPV4_DST)
+				mnl_attr_put_u32(nlh,
+					 TCA_TUNNEL_KEY_ENC_IPV4_DST,
+					 encap.vxlan->ipv4.dst);
+			if (encap.vxlan->mask & MLX5_FLOW_TCF_ENCAP_IPV6_SRC)
+				mnl_attr_put(nlh,
+					 TCA_TUNNEL_KEY_ENC_IPV6_SRC,
+					 sizeof(encap.vxlan->ipv6.src),
+					 &encap.vxlan->ipv6.src);
+			if (encap.vxlan->mask & MLX5_FLOW_TCF_ENCAP_IPV6_DST)
+				mnl_attr_put(nlh,
+					 TCA_TUNNEL_KEY_ENC_IPV6_DST,
+					 sizeof(encap.vxlan->ipv6.dst),
+					 &encap.vxlan->ipv6.dst);
+			if (encap.vxlan->mask & MLX5_FLOW_TCF_ENCAP_VXLAN_VNI)
+				mnl_attr_put_u32(nlh,
+					 TCA_TUNNEL_KEY_ENC_KEY_ID,
+					 vxlan_vni_as_be32
+						(encap.vxlan->vxlan.vni));
+			mnl_attr_put_u8(nlh, TCA_TUNNEL_KEY_NO_CSUM, 0);
+			mnl_attr_nest_end(nlh, na_act);
+			mnl_attr_nest_end(nlh, na_act_index);
+			assert(dev_flow->tcf.nlsize >= nlh->nlmsg_len);
 			break;
 		default:
 			return rte_flow_error_set(error, ENOTSUP,
@@ -2254,7 +2763,9 @@ struct flow_tcf_ptoi {
 	assert(na_flower);
 	assert(na_flower_act);
 	mnl_attr_nest_end(nlh, na_flower_act);
+	mnl_attr_put_u32(nlh, TCA_FLOWER_FLAGS, TCA_CLS_FLAGS_SKIP_SW);
 	mnl_attr_nest_end(nlh, na_flower);
+	assert(dev_flow->tcf.nlsize >= nlh->nlmsg_len);
 	return 0;
 }
 
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [dpdk-dev] [PATCH 5/5] net/mlx5: e-switch VXLAN tunnel devices management
  2018-10-02  6:30 [dpdk-dev] [PATCH 1/5] net/mlx5: add VXLAN encap/decap support for e-switch Slava Ovsiienko
                   ` (2 preceding siblings ...)
  2018-10-02  6:30 ` [dpdk-dev] [PATCH 4/5] net/mlx5: e-switch VXLAN flow translation routine Slava Ovsiienko
@ 2018-10-02  6:30 ` Slava Ovsiienko
  2018-10-15 14:13 ` [dpdk-dev] [PATCH v2 0/7] net/mlx5: e-switch VXLAN encap/decap hardware offload Viacheslav Ovsiienko
  4 siblings, 0 replies; 110+ messages in thread
From: Slava Ovsiienko @ 2018-10-02  6:30 UTC (permalink / raw)
  To: dev; +Cc: Shahaf Shuler, Slava Ovsiienko

VXLAN interfaces are dynamically created for each local UDP port
of outer networks and then used as targets for TC "flower" filters
in order to perform encapsulation. These VXLAN interfaces are
system-wide, the only one device with given UDP port can exist
in the system (the attempt of creating another device with the
same UDP local port returns EEXIST), so PMD should support the
shared device instances database for PMD instances. These VXLAN
implicitly created devices are called VTEPs (Virtual Tunnel
End Points).

Creation of the VTEP occurs at the moment of rule applying. The
link is set up, root ingress qdisc is also initialized. One VTEP
is shared for all encapsulation rules in the DPDK application
instance. For decapsulaton one VTEP is created per every unique
UDP local port to accept tunnel traffic. The name of created
VTEP consists of prefix "vmlx_" and the number of UDP port in
decimal digits without leading zeros (vmlx_4789). The VTEP
can be preliminary created in the system before the launching
application, it allows to share UDP ports between primary and
secondary processes.

Suggested-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
---
 drivers/net/mlx5/mlx5_flow_tcf.c | 344 ++++++++++++++++++++++++++++++++++++++-
 1 file changed, 343 insertions(+), 1 deletion(-)

diff --git a/drivers/net/mlx5/mlx5_flow_tcf.c b/drivers/net/mlx5/mlx5_flow_tcf.c
index dfffc50..0e62fe9 100644
--- a/drivers/net/mlx5/mlx5_flow_tcf.c
+++ b/drivers/net/mlx5/mlx5_flow_tcf.c
@@ -1482,7 +1482,7 @@ struct flow_tcf_ptoi {
 		default:
 			return rte_flow_error_set(error, ENOTSUP,
 						  RTE_FLOW_ERROR_TYPE_ITEM,
-						  NULL, "item not supported");
+						  items, "item not supported");
 		}
 	}
 	for (; actions->type != RTE_FLOW_ACTION_TYPE_END; actions++) {
@@ -2886,6 +2886,291 @@ struct flow_tcf_ptoi {
 	return 0;
 }
 
+/* VTEP device list is shared between PMD port instances. */
+static LIST_HEAD(, mlx5_flow_tcf_vtep)
+			vtep_list_vxlan = LIST_HEAD_INITIALIZER();
+static pthread_mutex_t vtep_list_mutex = PTHREAD_MUTEX_INITIALIZER;
+static struct mlx5_flow_tcf_vtep *vtep_encap;
+
+/**
+ * Deletes VTEP network device.
+ *
+ * @param[in] tcf
+ *   Context object initialized by mlx5_flow_tcf_socket_open().
+ * @param[in] vtep
+ *   Flow tcf object with tunnel device structure to delete.
+ */
+static void
+flow_tcf_delete_iface(struct mlx5_tcf_socket *tcf,
+		      struct mlx5_flow_tcf_vtep *vtep)
+{
+	struct nlmsghdr *nlh;
+	struct ifinfomsg *ifm;
+	alignas(struct nlmsghdr)
+	uint8_t buf[mnl_nlmsg_size(MNL_ALIGN(sizeof(*ifm))) + 8];
+	int ret;
+
+	DRV_LOG(NOTICE, "VTEP delete (%d)", vtep->port);
+	nlh = mnl_nlmsg_put_header(buf);
+	nlh->nlmsg_type = RTM_DELLINK;
+	nlh->nlmsg_flags = NLM_F_REQUEST;
+	ifm = mnl_nlmsg_put_extra_header(nlh, sizeof(*ifm));
+	ifm->ifi_family = AF_UNSPEC;
+	ifm->ifi_index = vtep->ifindex;
+	ret = flow_tcf_nl_ack(tcf, nlh);
+	if (ret)
+		DRV_LOG(DEBUG, "error deleting VXLAN encap/decap ifindex %u",
+			ifm->ifi_index);
+}
+
+/**
+ * Creates VTEP network device.
+ *
+ * @param[in] tcf
+ *   Context object initialized by mlx5_flow_tcf_socket_open().
+ * @param[in] port
+ *   UDP port of created VTEP device.
+ * @param[out] error
+ *   Perform verbose error reporting if not NULL.
+ *
+ * @return
+ * Pointer to created device structure on success, NULL otherwise
+ * and rte_errno is set.
+ */
+static struct mlx5_flow_tcf_vtep*
+flow_tcf_create_iface(struct mlx5_tcf_socket *tcf, uint16_t port,
+		      struct rte_flow_error *error)
+{
+	struct mlx5_flow_tcf_vtep *vtep;
+	struct nlmsghdr *nlh;
+	struct ifinfomsg *ifm;
+	alignas(struct nlmsghdr)
+	char name[sizeof(MLX5_VXLAN_DEVICE_PFX) + 24];
+	uint8_t buf[mnl_nlmsg_size(sizeof(*ifm)) +
+		       SZ_NLATTR_DATA_OF(sizeof(name)) +
+		       SZ_NLATTR_NEST * 2 +
+		       SZ_NLATTR_STRZ_OF("vxlan") +
+		       SZ_NLATTR_TYPE_OF_UINT32 +
+		       SZ_NLATTR_TYPE_OF_UINT16 +
+		       SZ_NLATTR_TYPE_OF_UINT8 + 128];
+	struct nlattr *na_info;
+	struct nlattr *na_vxlan;
+	rte_be16_t vxlan_port = RTE_BE16(port);
+	int ret;
+
+	vtep = rte_zmalloc(__func__, sizeof(*vtep),
+			alignof(struct mlx5_flow_tcf_vtep));
+	if (!vtep) {
+		rte_flow_error_set
+			(error, ENOMEM, RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
+			 NULL, "unadble to allocate memory for VTEP desc");
+		return NULL;
+	}
+	*vtep = (struct mlx5_flow_tcf_vtep){
+			.refcnt = 0,
+			.port = port,
+			.notcreated = 0,
+	};
+	memset(buf, 0, sizeof(buf));
+	nlh = mnl_nlmsg_put_header(buf);
+	nlh->nlmsg_type = RTM_NEWLINK;
+	nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_CREATE  | NLM_F_EXCL;
+	ifm = mnl_nlmsg_put_extra_header(nlh, sizeof(*ifm));
+	ifm->ifi_family = AF_UNSPEC;
+	ifm->ifi_type = 0;
+	ifm->ifi_index = 0;
+	ifm->ifi_flags = IFF_UP;
+	ifm->ifi_change = 0xffffffff;
+	snprintf(name, sizeof(name), "%s%u", MLX5_VXLAN_DEVICE_PFX, port);
+	mnl_attr_put_strz(nlh, IFLA_IFNAME, name);
+	na_info = mnl_attr_nest_start(nlh, IFLA_LINKINFO);
+	assert(na_info);
+	mnl_attr_put_strz(nlh, IFLA_INFO_KIND, "vxlan");
+	na_vxlan = mnl_attr_nest_start(nlh, IFLA_INFO_DATA);
+	assert(na_vxlan);
+	mnl_attr_put_u8(nlh, IFLA_VXLAN_COLLECT_METADATA, 1);
+	mnl_attr_put_u8(nlh, IFLA_VXLAN_UDP_ZERO_CSUM6_RX, 1);
+	mnl_attr_put_u8(nlh, IFLA_VXLAN_LEARNING, 0);
+	mnl_attr_put_u16(nlh, IFLA_VXLAN_PORT, vxlan_port);
+	mnl_attr_nest_end(nlh, na_vxlan);
+	mnl_attr_nest_end(nlh, na_info);
+	assert(sizeof(buf) >= nlh->nlmsg_len);
+	ret = flow_tcf_nl_ack(tcf, nlh);
+	if (ret) {
+		DRV_LOG(WARNING,
+			"VTEP %s create failure (%d)",
+			name, rte_errno);
+		vtep->notcreated = 1; /* Assume the device exists. */
+	}
+	ret = if_nametoindex(name);
+	if (ret) {
+		vtep->ifindex = ret;
+		memset(buf, 0, sizeof(buf));
+		nlh = mnl_nlmsg_put_header(buf);
+		nlh->nlmsg_type = RTM_NEWLINK;
+		nlh->nlmsg_flags = NLM_F_REQUEST;
+		ifm = mnl_nlmsg_put_extra_header(nlh, sizeof(*ifm));
+		ifm->ifi_family = AF_UNSPEC;
+		ifm->ifi_type = 0;
+		ifm->ifi_index = vtep->ifindex;
+		ifm->ifi_flags = IFF_UP;
+		ifm->ifi_change = IFF_UP;
+		ret = flow_tcf_nl_ack(tcf, nlh);
+		if (ret) {
+			DRV_LOG(WARNING,
+			"VTEP %s set link up failure (%d)", name, rte_errno);
+			rte_free(vtep);
+			rte_flow_error_set
+				(error, -errno,
+				 RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
+				 "netlink: failed to set VTEP link up");
+			vtep = NULL;
+		} else {
+			ret = mlx5_flow_tcf_ifindex_init(tcf,
+							 vtep->ifindex, error);
+			if (ret)
+				DRV_LOG(WARNING,
+				"VTEP %s init failure (%d)", name, rte_errno);
+		}
+	} else {
+		DRV_LOG(WARNING,
+			"VTEP %s failed to get index (%d)", name, errno);
+		rte_flow_error_set
+			(error, -errno,
+			 RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
+			 vtep->notcreated ? "netlink: failed to create VTEP" :
+			 "netlink: failed to retrieve VTEP ifindex");
+			 ret = 1;
+	}
+	if (ret) {
+		if (!vtep->notcreated && vtep->ifindex)
+			flow_tcf_delete_iface(tcf, vtep);
+		rte_free(vtep);
+		vtep = NULL;
+	}
+	DRV_LOG(NOTICE, "VTEP create (%d, %s)", vtep->port, vtep ? "OK" : "error");
+	return vtep;
+}
+
+/**
+ * Creates target interface index for tunneling.
+ *
+ * @param tcf
+ *   Context object initialized by mlx5_flow_tcf_socket_open().
+ * @param[in] dev_flow
+ *   Flow tcf object with tunnel structure pointer set.
+ * @param[out] error
+ *   Perform verbose error reporting if not NULL.
+ *
+ * @return
+ *   Interface index on success, zero otherwise and rte_errno is set.
+ */
+static unsigned int
+flow_tcf_tunnel_vtep_create(struct mlx5_tcf_socket *tcf,
+			    struct mlx5_flow *dev_flow,
+			    struct rte_flow_error *error)
+{
+	unsigned int ret;
+
+	assert(dev_flow->tcf.tunnel);
+	pthread_mutex_lock(&vtep_list_mutex);
+	switch (dev_flow->tcf.tunnel->type) {
+	case MLX5_FLOW_TCF_TUNACT_VXLAN_ENCAP:
+		if (!vtep_encap) {
+			vtep_encap = flow_tcf_create_iface(tcf,
+				MLX5_VXLAN_DEFAULT_PORT, error);
+			if (!vtep_encap) {
+				ret = 0;
+				break;
+			}
+			LIST_INSERT_HEAD(&vtep_list_vxlan, vtep_encap, next);
+		}
+		vtep_encap->refcnt++;
+		ret = vtep_encap->ifindex;
+		assert(ret);
+		break;
+	case MLX5_FLOW_TCF_TUNACT_VXLAN_DECAP: {
+		struct mlx5_flow_tcf_vtep *vtep;
+		uint16_t port = dev_flow->tcf.vxlan_decap->udp_port;
+
+		LIST_FOREACH(vtep, &vtep_list_vxlan, next) {
+			if (vtep->port == port)
+				break;
+		}
+		if (!vtep) {
+			vtep = flow_tcf_create_iface(tcf, port, error);
+			if (!vtep) {
+				ret = 0;
+				break;
+			}
+			LIST_INSERT_HEAD(&vtep_list_vxlan, vtep, next);
+		}
+		vtep->refcnt++;
+		ret = vtep->ifindex;
+		assert(ret);
+		break;
+	}
+	default:
+		rte_flow_error_set(error, ENOTSUP,
+				RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
+				"unsupported tunnel type");
+		ret = 0;
+		break;
+	}
+	pthread_mutex_unlock(&vtep_list_mutex);
+	return ret;
+}
+
+/**
+ * Deletes tunneling interface by UDP port.
+ *
+ * @param tx
+ *   Context object initialized by mlx5_flow_tcf_socket_open().
+ * @param[in] dev_flow
+ *   Flow tcf object with tunnel structure pointer set.
+ */
+static void
+flow_tcf_tunnel_vtep_delete(struct mlx5_tcf_socket *tcf,
+			    struct mlx5_flow *dev_flow)
+{
+	struct mlx5_flow_tcf_vtep *vtep;
+	uint16_t port = MLX5_VXLAN_DEFAULT_PORT;
+
+	assert(dev_flow->tcf.tunnel);
+	pthread_mutex_lock(&vtep_list_mutex);
+	switch (dev_flow->tcf.tunnel->type) {
+	case MLX5_FLOW_TCF_TUNACT_VXLAN_DECAP:
+		port = dev_flow->tcf.vxlan_decap->udp_port;
+		/* There is no break operator intentionally. */
+	case MLX5_FLOW_TCF_TUNACT_VXLAN_ENCAP:
+		LIST_FOREACH(vtep, &vtep_list_vxlan, next) {
+			if (vtep->port == port)
+				break;
+		}
+		if (!vtep) {
+			DRV_LOG(WARNING,
+				"No VTEP device found in the list");
+			break;
+		}
+		assert(dev_flow->tcf.tunnel->ifindex_tun == vtep->ifindex);
+		assert(vtep->refcnt);
+		if (vtep->refcnt && --vtep->refcnt)
+			break;
+		if (!vtep->notcreated)
+			flow_tcf_delete_iface(tcf, vtep);
+		LIST_REMOVE(vtep, next);
+		if (vtep_encap == vtep)
+			vtep_encap = NULL;
+		rte_free(vtep);
+		break;
+	default:
+		assert(false);
+		DRV_LOG(WARNING, "Unsupported tunnel type");
+		break;
+	}
+	pthread_mutex_unlock(&vtep_list_mutex);
+}
+
 /**
  * Apply flow to E-Switch by sending Netlink message.
  *
@@ -2917,12 +3202,45 @@ struct flow_tcf_ptoi {
 	nlh = dev_flow->tcf.nlh;
 	nlh->nlmsg_type = RTM_NEWTFILTER;
 	nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_CREATE | NLM_F_EXCL;
+	if (dev_flow->tcf.tunnel) {
+		/*
+		 * Replace the interface index, target for
+		 * encapsulation, source for decapsulation
+		 */
+		assert(!dev_flow->tcf.tunnel->ifindex_tun);
+		assert(dev_flow->tcf.tunnel->ifindex_ptr);
+		/* Create actual VTEP device when rule is being applied. */
+		dev_flow->tcf.tunnel->ifindex_tun
+			= flow_tcf_tunnel_vtep_create(&priv->tcf_socket,
+						      dev_flow, error);
+			DRV_LOG(INFO, "Replace ifindex: %d->%d",
+				dev_flow->tcf.tunnel->ifindex_tun,
+				*dev_flow->tcf.tunnel->ifindex_ptr);
+		if (!dev_flow->tcf.tunnel->ifindex_tun)
+			return -rte_errno;
+		dev_flow->tcf.tunnel->ifindex_org
+			= *dev_flow->tcf.tunnel->ifindex_ptr;
+		*dev_flow->tcf.tunnel->ifindex_ptr
+			= dev_flow->tcf.tunnel->ifindex_tun;
+	}
 	ret = flow_tcf_nl_ack(tcf, nlh);
+	if (dev_flow->tcf.tunnel) {
+		DRV_LOG(INFO, "Restore ifindex: %d->%d",
+				dev_flow->tcf.tunnel->ifindex_org,
+				*dev_flow->tcf.tunnel->ifindex_ptr);
+		*dev_flow->tcf.tunnel->ifindex_ptr
+			= dev_flow->tcf.tunnel->ifindex_org;
+		dev_flow->tcf.tunnel->ifindex_org = 0;
+	}
 	if (!ret) {
 		dev_flow->tcf.applied = 1;
 		return 0;
 	}
 	DRV_LOG(WARNING, "Failed to create TC rule (%d)", rte_errno);
+	if (dev_flow->tcf.tunnel->ifindex_tun) {
+		flow_tcf_tunnel_vtep_delete(&priv->tcf_socket, dev_flow);
+		dev_flow->tcf.tunnel->ifindex_tun = 0;
+	}
 	return rte_flow_error_set(error, rte_errno,
 				  RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
 				  "netlink: failed to create TC flow rule");
@@ -2951,10 +3269,34 @@ struct flow_tcf_ptoi {
 		return;
 	/* E-Switch flow can't be expanded. */
 	assert(!LIST_NEXT(dev_flow, next));
+	if (!dev_flow->tcf.applied)
+		return;
+	if (dev_flow->tcf.tunnel) {
+		/*
+		 * Replace the interface index, target for
+		 * encapsulation, source for decapsulation
+		 */
+		assert(dev_flow->tcf.tunnel->ifindex_tun);
+		assert(dev_flow->tcf.tunnel->ifindex_ptr);
+		dev_flow->tcf.tunnel->ifindex_org
+			= *dev_flow->tcf.tunnel->ifindex_ptr;
+		*dev_flow->tcf.tunnel->ifindex_ptr
+			= dev_flow->tcf.tunnel->ifindex_tun;
+	}
 	nlh = dev_flow->tcf.nlh;
 	nlh->nlmsg_type = RTM_DELTFILTER;
 	nlh->nlmsg_flags = NLM_F_REQUEST;
 	flow_tcf_nl_ack(tcf, nlh);
+	if (dev_flow->tcf.tunnel) {
+		*dev_flow->tcf.tunnel->ifindex_ptr
+			= dev_flow->tcf.tunnel->ifindex_org;
+		dev_flow->tcf.tunnel->ifindex_org = 0;
+		if (dev_flow->tcf.tunnel->ifindex_tun) {
+			flow_tcf_tunnel_vtep_delete(&priv->tcf_socket,
+						    dev_flow);
+			dev_flow->tcf.tunnel->ifindex_tun = 0;
+		}
+	}
 	dev_flow->tcf.applied = 0;
 }
 
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [dpdk-dev] [PATCH v2 0/7] net/mlx5: e-switch VXLAN encap/decap hardware offload
  2018-10-02  6:30 [dpdk-dev] [PATCH 1/5] net/mlx5: add VXLAN encap/decap support for e-switch Slava Ovsiienko
                   ` (3 preceding siblings ...)
  2018-10-02  6:30 ` [dpdk-dev] [PATCH 5/5] net/mlx5: e-switch VXLAN tunnel devices management Slava Ovsiienko
@ 2018-10-15 14:13 ` Viacheslav Ovsiienko
  2018-10-15 14:13   ` [dpdk-dev] [PATCH v2 1/7] net/mlx5: e-switch VXLAN configuration and definitions Viacheslav Ovsiienko
                     ` (7 more replies)
  4 siblings, 8 replies; 110+ messages in thread
From: Viacheslav Ovsiienko @ 2018-10-15 14:13 UTC (permalink / raw)
  To: shahafs, yskoh; +Cc: dev, Viacheslav Ovsiienko

This patchset adds the VXLAN encapsulation/decapsulation hardware
offload feature for E-Switch.
 
A typical use case of tunneling infrastructure is port representors 
in switchdev mode, with VXLAN traffic encapsulation performed on
traffic coming *from* a representor and decapsulation on traffic
going *to* that representor, in order to transparently assign
a given VXLAN to VF traffic.

Since these actions are supported at the E-Switch level, the "transfer" 
attribute must be set on such flow rules. They must also be combined
with a port redirection action to make sense.

Since only ingress is supported, encapsulation flow rules are normally
applied on a physical port and emit traffic to a port representor. 
The opposite order is used for decapsulation.

Like other mlx5 E-Switch flow rule actions, these ones are implemented
through Linux's TC flower API. Since the Linux interface for VXLAN
encap/decap involves virtual network devices (i.e. ip link add type
vxlan [...]), the PMD dynamically spawns them on a needed basis
through Netlink calls. These VXLAN implicitly created devices are
called VTEPs (Virtual Tunnel End Points).

VXLAN interfaces are dynamically created for each local port of
outer networks and then used as targets for TC "flower" filters
in order to perform encapsulation. For decapsulation the VXLAN
devices are created for each unique UDP-port. These VXLAN interfaces
are system-wide, the only one device with given UDP port can exist 
in the system (the attempt of creating another device with the 
same UDP local port returns EEXIST), so PMD should support the shared
(between PMD instances) device database. 

Rules samples consideraions:

$PF 		- physical device, outer network
$VF 		- representor for VF, outer/inner network
$VXLAN		- VTEP netdev name
$PF_OUTER_IP 	- $PF IP (v4 or v6) within outer network
$REMOTE_IP 	- remote peer IP (v4 or v6) within outer network
$LOCAL_PORT	- local UDP port
$REMOTE_PORT	- remote UDP port

VXLAN VTEP creation with iproute2 (PMD does the same via Netlink):

- for encapsulation:

ip link add $VXLAN type vxlan dstport $LOCAL_PORT external dev $PF
ip link set dev $VXLAN up
tc qdisc del dev $VXLAN ingress
tc qdisc add dev $VXLAN ingress

$LOCAL_PORT for egress encapsulated traffic (note, this is not
source UDP port in the VXLAN header, it is just UDP port assigned
to VTEP, no practical usage) is selected from available UDP ports
automatically in range 30000-60000.

- for decapsulation:

ip link add $VXLAN type vxlan dstport $LOCAL_PORT external
ip link set dev $VXLAN up
tc qdisc del dev $VXLAN ingress
tc qdisc add dev $VXLAN ingress

$LOCAL_PORT is UDP port receiving the VXLAN traffic from outer networks.

All ingress UDP traffic with given UDP destination port from ALL existing
netdevs is routed by kernel to the $VXLAN net device. While applying the
rule the kernel checks the IP parameter withing rule, determines the
appropriate underlaying PF and tryes to setup the rule hardware offload.

VXLAN encapsulation 

VXLAN encap rules are applied to the VF ingress traffic and have the 
VTEP as actual redirection destinations instead of outer PF.
The encapsulation rule should provide:
- redirection action VF->PF
- VF port ID
- some inner network parameters (MACs) 
- the tunnel outer source IP (v4/v6), (IS A MUST)
- the tunnel outer destination IP (v4/v6), (IS A MUST).
- VNI - Virtual Network Identifier (IS A MUST)

VXLAN encapsulation rule sample for tc utility:

tc filter add dev $VF protocol all parent ffff: flower skip_sw \
	action tunnel_key set dst_port $REMOTE_PORT \
	src_ip $PF_OUTER_IP dst_ip $REMOTE_IP id $VNI
        action mirred egress redirect dev $VXLAN

VXLAN encapsulation rule sample for testpmd:

- Setting up outer properties of VXLAN tunnel:

  set vxlan ip-version ipv4 vni $VNI udp-src $IGNORED udp-dst $REMOTE_PORT
    ip-src $PF_OUTER_IP ip-dst $REMOTE_IP
    eth-src $IGNORED eth-dst $REMOTE_MAC

- Creating a flow rule on port ID 4 performing VXLAN encapsulation with the
  above properties and directing the resulting traffic to port ID 0:

  flow create 4 ingress transfer pattern eth src is $INNER_MAC /
     ipv4 / udp dst is 4789 / end actions vxlan_encap / port_id id 0 / end

There is no direct way found to provide kernel with all required
encapsulatioh header parameters. The encapsulation VTEP is created
attached to the outer interface and assumed as default path for
egress encapsulated traffic. The outer tunnel IP address are
assigned to interface using Netlink, the implicit route is
created like this:

    ip addr add <src_ip> peer <dst_ip> dev <outer> scope link

The peer address option provides implicit route, and scope link
attribute reduces the risk of conflicts. At initialization time all
local scope link addresses are flushed from the outer network device.

The destination MAC address is provided via permenent neigh rule:

   ip neigh add dev <outer> lladdr <dst_mac> to <dst_ip> nud permanent

At initialization time all neigh rules of permanent type are flushed
from the outer network device.

VXLAN decapsulation

VXLAN decap rules are applied to the ingress traffic of VTEP ($VXLAN)
device instead of PF.
The decapsulation rule should provide:
- redirection action PF->VF
- VF port ID as redirection destination
- $VXLAN device as ingress traffic source
- the tunnel outer source IP (v4/v6), (optional)
- the tunnel outer destination IP (v4/v6), (IS A MUST)
- the tunnel local UDP port (IS A MUST, PMD looks for appropriate VTEP
  with given local UDP port)
- VNI - Virtual Network Identifier (IS A MUST)

VXLAN decap rule sample for tc utility: 

tc filter add dev $VXLAN protocol all parent ffff: flower skip_sw \
	enc_src_ip $REMOTE_IP enc_dst_ip $PF_OUTER_IP enc_key_id $VNI \
	enc_dst_port $LOCAL_PORT \
	action tunnel_key unset action mirred egress redirect dev $VF

VXLAN decap rule sample for testpmd: 

- Creating a flow on port ID 0 performing VXLAN decapsulation and directing
  the result to port ID 4 with checking inner properties:

  flow create 1 ingress transfer pattern eth src is 66:77:88:99:aa:bb
     dst is 00:11:22:33:44:55 / ipv4 src is $REMOTE_IP dst $PF_LOCAL_IP /
     udp src is 9999 dst is $LOCAL_PORT / vxlan vni is $VNI / end
     actions vxlan_decap / port_id id 2 / end

The VXLAN encap/decap rules constrains (implied by current kernel support)

- VXLAN decapsulation provided for PF->VF direction only
- VXLAN encapsulation provided for VF->PF direction only
- current implementation will support non-shared database of VTEPs
  (impossible simultaneous usage of the same UDP port by several instances
  of DPDK apps)

Suggested-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>

---
v2:
* removed non-VXLAN related parts
* multipart Netlink messages support
* local IP and peer IP rules management
* neigh IP address to MAC address rules management
* rules cleanup at outer device initialization
* attached devices cleanup at outer device initialization

v1:
* http://patches.dpdk.org/patch/45800/
* Refactored code of initial experimental proposal

v0:
* http://patches.dpdk.org/cover/44080/
* Initial proposal by Adrien Mazarguil <adrien.mazarguil@6wind.com>
** BLURB HERE ***

Viacheslav Ovsiienko (7):
  net/mlx5: e-switch VXLAN configuration and definitions
  net/mlx5: e-switch VXLAN flow validation routine
  net/mlx5: e-switch VXLAN flow translation routine
  net/mlx5: e-switch VXLAN netlink routines update
  net/mlx5: e-switch VXLAN tunnel devices management
  net/mlx5: e-switch VXLAN encapsulation rules management
  net/mlx5: e-switch VXLAN rule cleanup routines

 drivers/net/mlx5/Makefile        |   80 +
 drivers/net/mlx5/meson.build     |   32 +
 drivers/net/mlx5/mlx5_flow.h     |   11 +
 drivers/net/mlx5/mlx5_flow_tcf.c | 3285 +++++++++++++++++++++++++++++++++++---
 4 files changed, 3165 insertions(+), 243 deletions(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [dpdk-dev] [PATCH v2 1/7] net/mlx5: e-switch VXLAN configuration and definitions
  2018-10-15 14:13 ` [dpdk-dev] [PATCH v2 0/7] net/mlx5: e-switch VXLAN encap/decap hardware offload Viacheslav Ovsiienko
@ 2018-10-15 14:13   ` Viacheslav Ovsiienko
  2018-10-23 10:01     ` Yongseok Koh
  2018-10-15 14:13   ` [dpdk-dev] [PATCH v2 2/7] net/mlx5: e-switch VXLAN flow validation routine Viacheslav Ovsiienko
                     ` (6 subsequent siblings)
  7 siblings, 1 reply; 110+ messages in thread
From: Viacheslav Ovsiienko @ 2018-10-15 14:13 UTC (permalink / raw)
  To: shahafs, yskoh; +Cc: dev, Viacheslav Ovsiienko

This part of patchset adds configuration changes in makefile and
meson.build for Mellanox MLX5 PMD. Also necessary defenitions
for VXLAN support are made and appropriate data structures
are presented.

Suggested-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
---
 drivers/net/mlx5/Makefile        |  80 ++++++++++++++++++
 drivers/net/mlx5/meson.build     |  32 +++++++
 drivers/net/mlx5/mlx5_flow.h     |  11 +++
 drivers/net/mlx5/mlx5_flow_tcf.c | 175 +++++++++++++++++++++++++++++++++++++++
 4 files changed, 298 insertions(+)

diff --git a/drivers/net/mlx5/Makefile b/drivers/net/mlx5/Makefile
index 1e9c0b4..fec7779 100644
--- a/drivers/net/mlx5/Makefile
+++ b/drivers/net/mlx5/Makefile
@@ -207,6 +207,11 @@ mlx5_autoconf.h.new: $(RTE_SDK)/buildtools/auto-config-h.sh
 		enum IFLA_PHYS_PORT_NAME \
 		$(AUTOCONF_OUTPUT)
 	$Q sh -- '$<' '$@' \
+		HAVE_IFLA_VXLAN_COLLECT_METADATA \
+		linux/if_link.h \
+		enum IFLA_VXLAN_COLLECT_METADATA \
+		$(AUTOCONF_OUTPUT)
+	$Q sh -- '$<' '$@' \
 		HAVE_TCA_CHAIN \
 		linux/rtnetlink.h \
 		enum TCA_CHAIN \
@@ -367,6 +372,81 @@ mlx5_autoconf.h.new: $(RTE_SDK)/buildtools/auto-config-h.sh
 		enum TCA_VLAN_PUSH_VLAN_PRIORITY \
 		$(AUTOCONF_OUTPUT)
 	$Q sh -- '$<' '$@' \
+		HAVE_TCA_FLOWER_KEY_ENC_KEY_ID \
+		linux/pkt_cls.h \
+		enum TCA_FLOWER_KEY_ENC_KEY_ID \
+		$(AUTOCONF_OUTPUT)
+	$Q sh -- '$<' '$@' \
+		HAVE_TCA_FLOWER_KEY_ENC_IPV4_SRC \
+		linux/pkt_cls.h \
+		enum TCA_FLOWER_KEY_ENC_IPV4_SRC \
+		$(AUTOCONF_OUTPUT)
+	$Q sh -- '$<' '$@' \
+		HAVE_TCA_FLOWER_KEY_ENC_IPV4_SRC_MASK \
+		linux/pkt_cls.h \
+		enum TCA_FLOWER_KEY_ENC_IPV4_SRC_MASK \
+		$(AUTOCONF_OUTPUT)
+	$Q sh -- '$<' '$@' \
+		HAVE_TCA_FLOWER_KEY_ENC_IPV4_DST \
+		linux/pkt_cls.h \
+		enum TCA_FLOWER_KEY_ENC_IPV4_DST \
+		$(AUTOCONF_OUTPUT)
+	$Q sh -- '$<' '$@' \
+		HAVE_TCA_FLOWER_KEY_ENC_IPV4_DST_MASK \
+		linux/pkt_cls.h \
+		enum TCA_FLOWER_KEY_ENC_IPV4_DST_MASK \
+		$(AUTOCONF_OUTPUT)
+	$Q sh -- '$<' '$@' \
+		HAVE_TCA_FLOWER_KEY_ENC_IPV6_SRC \
+		linux/pkt_cls.h \
+		enum TCA_FLOWER_KEY_ENC_IPV6_SRC \
+		$(AUTOCONF_OUTPUT)
+	$Q sh -- '$<' '$@' \
+		HAVE_TCA_FLOWER_KEY_ENC_IPV6_SRC_MASK \
+		linux/pkt_cls.h \
+		enum TCA_FLOWER_KEY_ENC_IPV6_SRC_MASK \
+		$(AUTOCONF_OUTPUT)
+	$Q sh -- '$<' '$@' \
+		HAVE_TCA_FLOWER_KEY_ENC_IPV6_DST \
+		linux/pkt_cls.h \
+		enum TCA_FLOWER_KEY_ENC_IPV6_DST \
+		$(AUTOCONF_OUTPUT)
+	$Q sh -- '$<' '$@' \
+		HAVE_TCA_FLOWER_KEY_ENC_IPV6_DST_MASK \
+		linux/pkt_cls.h \
+		enum TCA_FLOWER_KEY_ENC_IPV6_DST_MASK \
+		$(AUTOCONF_OUTPUT)
+	$Q sh -- '$<' '$@' \
+		HAVE_TCA_FLOWER_KEY_ENC_UDP_SRC_PORT \
+		linux/pkt_cls.h \
+		enum TCA_FLOWER_KEY_ENC_UDP_SRC_PORT \
+		$(AUTOCONF_OUTPUT)
+	$Q sh -- '$<' '$@' \
+		HAVE_TCA_FLOWER_KEY_ENC_UDP_SRC_PORT_MASK \
+		linux/pkt_cls.h \
+		enum TCA_FLOWER_KEY_ENC_UDP_SRC_PORT_MASK \
+		$(AUTOCONF_OUTPUT)
+	$Q sh -- '$<' '$@' \
+		HAVE_TCA_FLOWER_KEY_ENC_UDP_DST_PORT \
+		linux/pkt_cls.h \
+		enum TCA_FLOWER_KEY_ENC_UDP_DST_PORT \
+		$(AUTOCONF_OUTPUT)
+	$Q sh -- '$<' '$@' \
+		HAVE_TCA_FLOWER_KEY_ENC_UDP_DST_PORT_MASK \
+		linux/pkt_cls.h \
+		enum TCA_FLOWER_KEY_ENC_UDP_DST_PORT_MASK \
+		$(AUTOCONF_OUTPUT)
+	$Q sh -- '$<' '$@' \
+		HAVE_TC_ACT_TUNNEL_KEY \
+		linux/tc_act/tc_tunnel_key.h \
+		define TCA_ACT_TUNNEL_KEY \
+		$(AUTOCONF_OUTPUT)
+	$Q sh -- '$<' '$@' \
+		HAVE_TCA_TUNNEL_KEY_ENC_DST_PORT \
+		linux/tc_act/tc_tunnel_key.h \
+		enum TCA_TUNNEL_KEY_ENC_DST_PORT \
+		$(AUTOCONF_OUTPUT)
+	$Q sh -- '$<' '$@' \
 		HAVE_TC_ACT_PEDIT \
 		linux/tc_act/tc_pedit.h \
 		enum TCA_PEDIT_KEY_EX_HDR_TYPE_UDP \
diff --git a/drivers/net/mlx5/meson.build b/drivers/net/mlx5/meson.build
index c192d44..43aabf2 100644
--- a/drivers/net/mlx5/meson.build
+++ b/drivers/net/mlx5/meson.build
@@ -126,6 +126,8 @@ if build
 		'IFLA_PHYS_SWITCH_ID' ],
 		[ 'HAVE_IFLA_PHYS_PORT_NAME', 'linux/if_link.h',
 		'IFLA_PHYS_PORT_NAME' ],
+		[ 'HAVE_IFLA_VXLAN_COLLECT_METADATA', 'linux/if_link.h',
+		'IFLA_VXLAN_COLLECT_METADATA' ],
 		[ 'HAVE_TCA_CHAIN', 'linux/rtnetlink.h',
 		'TCA_CHAIN' ],
 		[ 'HAVE_TCA_FLOWER_ACT', 'linux/pkt_cls.h',
@@ -190,6 +192,36 @@ if build
 		'TC_ACT_GOTO_CHAIN' ],
 		[ 'HAVE_TC_ACT_VLAN', 'linux/tc_act/tc_vlan.h',
 		'TCA_VLAN_PUSH_VLAN_PRIORITY' ],
+		[ 'HAVE_TCA_FLOWER_KEY_ENC_KEY_ID', 'linux/pkt_cls.h',
+		'TCA_FLOWER_KEY_ENC_KEY_ID' ],
+		[ 'HAVE_TCA_FLOWER_KEY_ENC_IPV4_SRC', 'linux/pkt_cls.h',
+		'TCA_FLOWER_KEY_ENC_IPV4_SRC' ],
+		[ 'HAVE_TCA_FLOWER_KEY_ENC_IPV4_SRC_MASK', 'linux/pkt_cls.h',
+		'TCA_FLOWER_KEY_ENC_IPV4_SRC_MASK' ],
+		[ 'HAVE_TCA_FLOWER_KEY_ENC_IPV4_DST', 'linux/pkt_cls.h',
+		'TCA_FLOWER_KEY_ENC_IPV4_DST' ],
+		[ 'HAVE_TCA_FLOWER_KEY_ENC_IPV4_DST_MASK', 'linux/pkt_cls.h',
+		'TCA_FLOWER_KEY_ENC_IPV4_DST_MASK' ],
+		[ 'HAVE_TCA_FLOWER_KEY_ENC_IPV6_SRC', 'linux/pkt_cls.h',
+		'TCA_FLOWER_KEY_ENC_IPV6_SRC' ],
+		[ 'HAVE_TCA_FLOWER_KEY_ENC_IPV6_SRC_MASK', 'linux/pkt_cls.h',
+		'TCA_FLOWER_KEY_ENC_IPV6_SRC_MASK' ],
+		[ 'HAVE_TCA_FLOWER_KEY_ENC_IPV6_DST', 'linux/pkt_cls.h',
+		'TCA_FLOWER_KEY_ENC_IPV6_DST' ],
+		[ 'HAVE_TCA_FLOWER_KEY_ENC_IPV6_DST_MASK', 'linux/pkt_cls.h',
+		'TCA_FLOWER_KEY_ENC_IPV6_DST_MASK' ],
+		[ 'HAVE_TCA_FLOWER_KEY_ENC_UDP_SRC_PORT', 'linux/pkt_cls.h',
+		'TCA_FLOWER_KEY_ENC_UDP_SRC_PORT' ],
+		[ 'HAVE_TCA_FLOWER_KEY_ENC_UDP_SRC_PORT_MASK', 'linux/pkt_cls.h',
+		'TCA_FLOWER_KEY_ENC_UDP_SRC_PORT_MASK' ],
+		[ 'HAVE_TCA_FLOWER_KEY_ENC_UDP_DST_PORT', 'linux/pkt_cls.h',
+		'TCA_FLOWER_KEY_ENC_UDP_DST_PORT' ],
+		[ 'HAVE_TCA_FLOWER_KEY_ENC_UDP_DST_PORT_MASK', 'linux/pkt_cls.h',
+		'TCA_FLOWER_KEY_ENC_UDP_DST_PORT_MASK' ],
+		[ 'HAVE_TC_ACT_TUNNEL_KEY', 'linux/tc_act/tc_tunnel_key.h',
+		'TCA_ACT_TUNNEL_KEY' ],
+		[ 'HAVE_TCA_TUNNEL_KEY_ENC_DST_PORT', 'linux/tc_act/tc_tunnel_key.h',
+		'TCA_TUNNEL_KEY_ENC_DST_PORT' ],
 		[ 'HAVE_TC_ACT_PEDIT', 'linux/tc_act/tc_pedit.h',
 		'TCA_PEDIT_KEY_EX_HDR_TYPE_UDP' ],
 		[ 'HAVE_RDMA_NL_NLDEV', 'rdma/rdma_netlink.h',
diff --git a/drivers/net/mlx5/mlx5_flow.h b/drivers/net/mlx5/mlx5_flow.h
index 840d645..b838ab0 100644
--- a/drivers/net/mlx5/mlx5_flow.h
+++ b/drivers/net/mlx5/mlx5_flow.h
@@ -85,6 +85,8 @@
 #define MLX5_FLOW_ACTION_SET_TP_SRC (1u << 15)
 #define MLX5_FLOW_ACTION_SET_TP_DST (1u << 16)
 #define MLX5_FLOW_ACTION_JUMP (1u << 17)
+#define MLX5_ACTION_VXLAN_ENCAP (1u << 11)
+#define MLX5_ACTION_VXLAN_DECAP (1u << 12)
 
 #define MLX5_FLOW_FATE_ACTIONS \
 	(MLX5_FLOW_ACTION_DROP | MLX5_FLOW_ACTION_QUEUE | MLX5_FLOW_ACTION_RSS)
@@ -182,8 +184,17 @@ struct mlx5_flow_dv {
 struct mlx5_flow_tcf {
 	struct nlmsghdr *nlh;
 	struct tcmsg *tcm;
+	uint32_t nlsize; /**< Size of NL message buffer. */
+	uint32_t applied:1; /**< Whether rule is currently applied. */
+	uint64_t item_flags; /**< Item flags. */
+	uint64_t action_flags; /**< Action flags. */
 	uint64_t hits;
 	uint64_t bytes;
+	union { /**< Tunnel encap/decap descriptor. */
+		struct mlx5_flow_tcf_tunnel_hdr *tunnel;
+		struct mlx5_flow_tcf_vxlan_decap *vxlan_decap;
+		struct mlx5_flow_tcf_vxlan_encap *vxlan_encap;
+	};
 };
 
 /* Verbs specification header. */
diff --git a/drivers/net/mlx5/mlx5_flow_tcf.c b/drivers/net/mlx5/mlx5_flow_tcf.c
index 5c46f35..8f9c78a 100644
--- a/drivers/net/mlx5/mlx5_flow_tcf.c
+++ b/drivers/net/mlx5/mlx5_flow_tcf.c
@@ -54,6 +54,37 @@ struct tc_vlan {
 
 #endif /* HAVE_TC_ACT_VLAN */
 
+#ifdef HAVE_TC_ACT_TUNNEL_KEY
+
+#include <linux/tc_act/tc_tunnel_key.h>
+
+#ifndef HAVE_TCA_TUNNEL_KEY_ENC_DST_PORT
+#define TCA_TUNNEL_KEY_ENC_DST_PORT 9
+#endif
+
+#else /* HAVE_TC_ACT_TUNNEL_KEY */
+
+#define TCA_ACT_TUNNEL_KEY 17
+#define TCA_TUNNEL_KEY_ACT_SET 1
+#define TCA_TUNNEL_KEY_ACT_RELEASE 2
+#define TCA_TUNNEL_KEY_PARMS 2
+#define TCA_TUNNEL_KEY_ENC_IPV4_SRC 3
+#define TCA_TUNNEL_KEY_ENC_IPV4_DST 4
+#define TCA_TUNNEL_KEY_ENC_IPV6_SRC 5
+#define TCA_TUNNEL_KEY_ENC_IPV6_DST 6
+#define TCA_TUNNEL_KEY_ENC_KEY_ID 7
+#define TCA_TUNNEL_KEY_ENC_DST_PORT 9
+#define TCA_TUNNEL_KEY_NO_CSUM 10
+
+struct tc_tunnel_key {
+	tc_gen;
+	int t_action;
+};
+
+#endif /* HAVE_TC_ACT_TUNNEL_KEY */
+
+
+
 #ifdef HAVE_TC_ACT_PEDIT
 
 #include <linux/tc_act/tc_pedit.h>
@@ -210,6 +241,45 @@ struct tc_pedit_sel {
 #ifndef HAVE_TCA_FLOWER_KEY_VLAN_ETH_TYPE
 #define TCA_FLOWER_KEY_VLAN_ETH_TYPE 25
 #endif
+#ifndef HAVE_TCA_FLOWER_KEY_ENC_KEY_ID
+#define TCA_FLOWER_KEY_ENC_KEY_ID 26
+#endif
+#ifndef HAVE_TCA_FLOWER_KEY_ENC_IPV4_SRC
+#define TCA_FLOWER_KEY_ENC_IPV4_SRC 27
+#endif
+#ifndef HAVE_TCA_FLOWER_KEY_ENC_IPV4_SRC_MASK
+#define TCA_FLOWER_KEY_ENC_IPV4_SRC_MASK 28
+#endif
+#ifndef HAVE_TCA_FLOWER_KEY_ENC_IPV4_DST
+#define TCA_FLOWER_KEY_ENC_IPV4_DST 29
+#endif
+#ifndef HAVE_TCA_FLOWER_KEY_ENC_IPV4_DST_MASK
+#define TCA_FLOWER_KEY_ENC_IPV4_DST_MASK 30
+#endif
+#ifndef HAVE_TCA_FLOWER_KEY_ENC_IPV6_SRC
+#define TCA_FLOWER_KEY_ENC_IPV6_SRC 31
+#endif
+#ifndef HAVE_TCA_FLOWER_KEY_ENC_IPV6_SRC_MASK
+#define TCA_FLOWER_KEY_ENC_IPV6_SRC_MASK 32
+#endif
+#ifndef HAVE_TCA_FLOWER_KEY_ENC_IPV6_DST
+#define TCA_FLOWER_KEY_ENC_IPV6_DST 33
+#endif
+#ifndef HAVE_TCA_FLOWER_KEY_ENC_IPV6_DST_MASK
+#define TCA_FLOWER_KEY_ENC_IPV6_DST_MASK 34
+#endif
+#ifndef HAVE_TCA_FLOWER_KEY_ENC_UDP_SRC_PORT
+#define TCA_FLOWER_KEY_ENC_UDP_SRC_PORT 43
+#endif
+#ifndef HAVE_TCA_FLOWER_KEY_ENC_UDP_SRC_PORT_MASK
+#define TCA_FLOWER_KEY_ENC_UDP_SRC_PORT_MASK 44
+#endif
+#ifndef HAVE_TCA_FLOWER_KEY_ENC_UDP_DST_PORT
+#define TCA_FLOWER_KEY_ENC_UDP_DST_PORT 45
+#endif
+#ifndef HAVE_TCA_FLOWER_KEY_ENC_UDP_DST_PORT_MASK
+#define TCA_FLOWER_KEY_ENC_UDP_DST_PORT_MASK 46
+#endif
 #ifndef HAVE_TCA_FLOWER_KEY_TCP_FLAGS
 #define TCA_FLOWER_KEY_TCP_FLAGS 71
 #endif
@@ -232,6 +302,111 @@ struct tc_pedit_sel {
 #define TP_PORT_LEN 2 /* Transport Port (UDP/TCP) Length */
 #endif
 
+#define MLX5_VXLAN_PORT_RANGE_MIN 30000
+#define MLX5_VXLAN_PORT_RANGE_MAX 60000
+#define MLX5_VXLAN_DEVICE_PFX "vmlx_"
+
+/** Tunnel action type, used for @p type in header structure. */
+enum mlx5_flow_tcf_tunact_type {
+	MLX5_FLOW_TCF_TUNACT_VXLAN_ENCAP,
+	MLX5_FLOW_TCF_TUNACT_VXLAN_DECAP,
+};
+
+/** Flags used for @p mask in tunnel action encap descriptors. */
+#define	MLX5_FLOW_TCF_ENCAP_ETH_SRC (1u << 0)
+#define	MLX5_FLOW_TCF_ENCAP_ETH_DST (1u << 1)
+#define	MLX5_FLOW_TCF_ENCAP_IPV4_SRC (1u << 2)
+#define	MLX5_FLOW_TCF_ENCAP_IPV4_DST (1u << 3)
+#define	MLX5_FLOW_TCF_ENCAP_IPV6_SRC (1u << 4)
+#define	MLX5_FLOW_TCF_ENCAP_IPV6_DST (1u << 5)
+#define	MLX5_FLOW_TCF_ENCAP_UDP_SRC (1u << 6)
+#define	MLX5_FLOW_TCF_ENCAP_UDP_DST (1u << 7)
+#define	MLX5_FLOW_TCF_ENCAP_VXLAN_VNI (1u << 8)
+
+/** Neigh rule structure */
+struct tcf_neigh_rule {
+	LIST_ENTRY(tcf_neigh_rule) next;
+	uint32_t refcnt;
+	struct ether_addr eth;
+	uint16_t mask;
+	union {
+		struct {
+			rte_be32_t dst;
+		} ipv4;
+		struct {
+			uint8_t dst[16];
+		} ipv6;
+	};
+};
+
+/** Local rule structure */
+struct tcf_local_rule {
+	LIST_ENTRY(tcf_neigh_rule) next;
+	uint32_t refcnt;
+	uint16_t mask;
+	union {
+		struct {
+			rte_be32_t dst;
+			rte_be32_t src;
+		} ipv4;
+		struct {
+			uint8_t dst[16];
+			uint8_t src[16];
+		} ipv6;
+	};
+};
+
+/** VXLAN virtual netdev. */
+struct mlx5_flow_tcf_vtep {
+	LIST_ENTRY(mlx5_flow_tcf_vtep) next;
+	LIST_HEAD(, tcf_neigh_rule) neigh;
+	LIST_HEAD(, tcf_local_rule) local;
+	uint32_t refcnt;
+	unsigned int ifindex; /**< Own interface index. */
+	unsigned int ifouter; /**< Index of device attached to. */
+	uint16_t port;
+	uint8_t created;
+};
+
+/** Tunnel descriptor header, common for all tunnel types. */
+struct mlx5_flow_tcf_tunnel_hdr {
+	uint32_t type; /**< Tunnel action type. */
+	unsigned int ifindex_tun; /**< Tunnel endpoint interface. */
+	unsigned int ifindex_org; /**< Original dst/src interface */
+	unsigned int *ifindex_ptr; /**< Interface ptr in message. */
+};
+
+struct mlx5_flow_tcf_vxlan_decap {
+	struct mlx5_flow_tcf_tunnel_hdr hdr;
+	uint16_t udp_port;
+};
+
+struct mlx5_flow_tcf_vxlan_encap {
+	struct mlx5_flow_tcf_tunnel_hdr hdr;
+	uint32_t mask;
+	struct {
+		struct ether_addr dst;
+		struct ether_addr src;
+	} eth;
+	union {
+		struct {
+			rte_be32_t dst;
+			rte_be32_t src;
+		} ipv4;
+		struct {
+			uint8_t dst[16];
+			uint8_t src[16];
+		} ipv6;
+	};
+struct {
+		rte_be16_t src;
+		rte_be16_t dst;
+	} udp;
+	struct {
+		uint8_t vni[3];
+	} vxlan;
+};
+
 /**
  * Structure for holding netlink context.
  * Note the size of the message buffer which is MNL_SOCKET_BUFFER_SIZE.
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [dpdk-dev] [PATCH v2 2/7] net/mlx5: e-switch VXLAN flow validation routine
  2018-10-15 14:13 ` [dpdk-dev] [PATCH v2 0/7] net/mlx5: e-switch VXLAN encap/decap hardware offload Viacheslav Ovsiienko
  2018-10-15 14:13   ` [dpdk-dev] [PATCH v2 1/7] net/mlx5: e-switch VXLAN configuration and definitions Viacheslav Ovsiienko
@ 2018-10-15 14:13   ` Viacheslav Ovsiienko
  2018-10-23 10:04     ` Yongseok Koh
  2018-10-15 14:13   ` [dpdk-dev] [PATCH v2 3/7] net/mlx5: e-switch VXLAN flow translation routine Viacheslav Ovsiienko
                     ` (5 subsequent siblings)
  7 siblings, 1 reply; 110+ messages in thread
From: Viacheslav Ovsiienko @ 2018-10-15 14:13 UTC (permalink / raw)
  To: shahafs, yskoh; +Cc: dev, Viacheslav Ovsiienko

This part of patchset adds support for flow item/action lists
validation. The following entities are now supported:

- RTE_FLOW_ITEM_TYPE_VXLAN, contains the tunnel VNI

- RTE_FLOW_ACTION_TYPE_VXLAN_DECAP, if this action is specified
  the items in the flow items list treated as outer network
  parameters for tunnel outer header match. The ethernet layer
  addresses always are treated as inner ones.

- RTE_FLOW_ACTION_TYPE_VXLAN_ENCAP, contains the item list to
  build the encapsulation header. In current implementation the
  values is the subject for some constraints:
    - outer source MAC address will be always unconditionally
      set to the one of MAC addresses of outer egress interface
    - no way to specify source UDP port
    - all abovementioned parameters are ignored if specified
      in the rule, warning messages are sent to the log

Suggested-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
---
 drivers/net/mlx5/mlx5_flow_tcf.c | 711 ++++++++++++++++++++++++++++++++++++++-
 1 file changed, 705 insertions(+), 6 deletions(-)

diff --git a/drivers/net/mlx5/mlx5_flow_tcf.c b/drivers/net/mlx5/mlx5_flow_tcf.c
index 8f9c78a..0055417 100644
--- a/drivers/net/mlx5/mlx5_flow_tcf.c
+++ b/drivers/net/mlx5/mlx5_flow_tcf.c
@@ -430,6 +430,7 @@ struct mlx5_flow_tcf_context {
 	struct rte_flow_item_ipv6 ipv6;
 	struct rte_flow_item_tcp tcp;
 	struct rte_flow_item_udp udp;
+	struct rte_flow_item_vxlan vxlan;
 } flow_tcf_mask_empty;
 
 /** Supported masks for known item types. */
@@ -441,6 +442,7 @@ struct mlx5_flow_tcf_context {
 	struct rte_flow_item_ipv6 ipv6;
 	struct rte_flow_item_tcp tcp;
 	struct rte_flow_item_udp udp;
+	struct rte_flow_item_vxlan vxlan;
 } flow_tcf_mask_supported = {
 	.port_id = {
 		.id = 0xffffffff,
@@ -478,6 +480,9 @@ struct mlx5_flow_tcf_context {
 		.src_port = RTE_BE16(0xffff),
 		.dst_port = RTE_BE16(0xffff),
 	},
+	.vxlan = {
+	       .vni = "\xff\xff\xff",
+	},
 };
 
 #define SZ_NLATTR_HDR MNL_ALIGN(sizeof(struct nlattr))
@@ -943,6 +948,615 @@ struct pedit_parser {
 }
 
 /**
+ * Validate VXLAN_ENCAP action RTE_FLOW_ITEM_TYPE_ETH item for E-Switch.
+ *
+ * @param[in] item
+ *   Pointer to the itemn structure.
+ * @param[out] error
+ *   Pointer to the error structure.
+ *
+ * @return
+ *   0 on success, a negative errno value otherwise and rte_errno is set.
+ **/
+static int
+flow_tcf_validate_vxlan_encap_eth(const struct rte_flow_item *item,
+				  struct rte_flow_error *error)
+{
+	const struct rte_flow_item_eth *spec = item->spec;
+	const struct rte_flow_item_eth *mask = item->mask;
+
+	if (!spec)
+		/*
+		 * Specification for L2 addresses can be empty
+		 * because these ones are optional and not
+		 * required directly by tc rule.
+		 */
+		return 0;
+	if (!mask)
+		/* If mask is not specified use the default one. */
+		mask = &rte_flow_item_eth_mask;
+	if (memcmp(&mask->dst,
+		   &flow_tcf_mask_empty.eth.dst,
+		   sizeof(flow_tcf_mask_empty.eth.dst))) {
+		if (memcmp(&mask->dst,
+			   &rte_flow_item_eth_mask.dst,
+			   sizeof(rte_flow_item_eth_mask.dst)))
+			return rte_flow_error_set(error, ENOTSUP,
+				 RTE_FLOW_ERROR_TYPE_ITEM_MASK, mask,
+				 "no support for partial mask on"
+				 " \"eth.dst\" field");
+	}
+	if (memcmp(&mask->src,
+		   &flow_tcf_mask_empty.eth.src,
+		   sizeof(flow_tcf_mask_empty.eth.src))) {
+		if (memcmp(&mask->src,
+			   &rte_flow_item_eth_mask.src,
+			   sizeof(rte_flow_item_eth_mask.src)))
+			return rte_flow_error_set(error, ENOTSUP,
+				 RTE_FLOW_ERROR_TYPE_ITEM_MASK, mask,
+				 "no support for partial mask on"
+				 " \"eth.src\" field");
+	}
+	if (mask->type != RTE_BE16(0x0000)) {
+		if (mask->type != RTE_BE16(0xffff))
+			return rte_flow_error_set(error, ENOTSUP,
+				 RTE_FLOW_ERROR_TYPE_ITEM_MASK, mask,
+				 "no support for partial mask on"
+				 " \"eth.type\" field");
+		DRV_LOG(WARNING,
+			"outer ethernet type field "
+			"cannot be forced for VXLAN "
+			"encapsulation, parameter ignored");
+	}
+	return 0;
+}
+
+/**
+ * Validate VXLAN_ENCAP action RTE_FLOW_ITEM_TYPE_IPV4 item for E-Switch.
+ *
+ * @param[in] item
+ *   Pointer to the itemn structure.
+ * @param[out] error
+ *   Pointer to the error structure.
+ *
+ * @return
+ *   0 on success, a negative errno value otherwise and rte_errno is set.
+ **/
+static int
+flow_tcf_validate_vxlan_encap_ipv4(const struct rte_flow_item *item,
+				   struct rte_flow_error *error)
+{
+	const struct rte_flow_item_ipv4 *spec = item->spec;
+	const struct rte_flow_item_ipv4 *mask = item->mask;
+
+	if (!spec)
+		/*
+		 * Specification for L3 addresses cannot be empty
+		 * because it is required by tunnel_key parameter.
+		 */
+		return rte_flow_error_set(error, EINVAL,
+				 RTE_FLOW_ERROR_TYPE_ITEM, item,
+				 "NULL outer L3 address specification "
+				 " for VXLAN encapsulation");
+	if (!mask)
+		mask = &rte_flow_item_ipv4_mask;
+	if (mask->hdr.dst_addr != RTE_BE32(0x00000000)) {
+		if (mask->hdr.dst_addr != RTE_BE32(0xffffffff))
+			return rte_flow_error_set(error, ENOTSUP,
+				 RTE_FLOW_ERROR_TYPE_ITEM_MASK, mask,
+				 "no support for partial mask on"
+				 " \"ipv4.hdr.dst_addr\" field");
+		/* More L3 address validations can be put here. */
+	} else {
+		/*
+		 * Kernel uses the destination L3 address to determine
+		 * the routing path and obtain the L2 destination
+		 * address, so L3 destination address must be
+		 * specified in the tc rule.
+		 */
+		return rte_flow_error_set(error, EINVAL,
+				 RTE_FLOW_ERROR_TYPE_ITEM, item,
+				 "outer L3 destination address must be "
+				 "specified for VXLAN encapsulation");
+	}
+	if (mask->hdr.src_addr != RTE_BE32(0x00000000)) {
+		if (mask->hdr.src_addr != RTE_BE32(0xffffffff))
+			return rte_flow_error_set(error, ENOTSUP,
+				 RTE_FLOW_ERROR_TYPE_ITEM_MASK, mask,
+				 "no support for partial mask on"
+				 " \"ipv4.hdr.src_addr\" field");
+		/* More L3 address validations can be put here. */
+	} else {
+		/*
+		 * Kernel uses the source L3 address to select the
+		 * interface for egress encapsulated traffic, so
+		 * it must be specified in the tc rule.
+		 */
+		return rte_flow_error_set(error, EINVAL,
+				 RTE_FLOW_ERROR_TYPE_ITEM, item,
+				 "outer L3 source address must be "
+				 "specified for VXLAN encapsulation");
+	}
+	return 0;
+}
+
+/**
+ * Validate VXLAN_ENCAP action RTE_FLOW_ITEM_TYPE_IPV6 item for E-Switch.
+ *
+ * @param[in] item
+ *   Pointer to the itemn structure.
+ * @param[out] error
+ *   Pointer to the error structure.
+ *
+ * @return
+ *   0 on success, a negative errno value otherwise and rte_ernno is set.
+ **/
+static int
+flow_tcf_validate_vxlan_encap_ipv6(const struct rte_flow_item *item,
+				   struct rte_flow_error *error)
+{
+	const struct rte_flow_item_ipv6 *spec = item->spec;
+	const struct rte_flow_item_ipv6 *mask = item->mask;
+
+	if (!spec)
+		/*
+		 * Specification for L3 addresses cannot be empty
+		 * because it is required by tunnel_key parameter.
+		 */
+		return rte_flow_error_set(error, EINVAL,
+				 RTE_FLOW_ERROR_TYPE_ITEM, item,
+				 "NULL outer L3 address specification "
+				 " for VXLAN encapsulation");
+	if (!mask)
+		mask = &rte_flow_item_ipv6_mask;
+	if (memcmp(&mask->hdr.dst_addr,
+		   &flow_tcf_mask_empty.ipv6.hdr.dst_addr,
+		   sizeof(flow_tcf_mask_empty.ipv6.hdr.dst_addr))) {
+		if (memcmp(&mask->hdr.dst_addr,
+		   &rte_flow_item_ipv6_mask.hdr.dst_addr,
+		   sizeof(rte_flow_item_ipv6_mask.hdr.dst_addr)))
+			return rte_flow_error_set(error, ENOTSUP,
+				 RTE_FLOW_ERROR_TYPE_ITEM_MASK, mask,
+				 "no support for partial mask on"
+				 " \"ipv6.hdr.dst_addr\" field");
+		/* More L3 address validations can be put here. */
+	} else {
+		/*
+		 * Kernel uses the destination L3 address to determine
+		 * the routing path and obtain the L2 destination
+		 * address (heigh or gate), so L3 destination address
+		 * must be specified within the tc rule.
+		 */
+		return rte_flow_error_set(error, EINVAL,
+				 RTE_FLOW_ERROR_TYPE_ITEM, item,
+				 "outer L3 destination address must be "
+				 "specified for VXLAN encapsulation");
+	}
+	if (memcmp(&mask->hdr.src_addr,
+		   &flow_tcf_mask_empty.ipv6.hdr.src_addr,
+		   sizeof(flow_tcf_mask_empty.ipv6.hdr.src_addr))) {
+		if (memcmp(&mask->hdr.src_addr,
+		   &rte_flow_item_ipv6_mask.hdr.src_addr,
+		   sizeof(rte_flow_item_ipv6_mask.hdr.src_addr)))
+			return rte_flow_error_set(error, ENOTSUP,
+				 RTE_FLOW_ERROR_TYPE_ITEM_MASK, mask,
+				 "no support for partial mask on"
+				 " \"ipv6.hdr.src_addr\" field");
+		/* More L3 address validation can be put here. */
+	} else {
+		/*
+		 * Kernel uses the source L3 address to select the
+		 * interface for egress encapsulated traffic, so
+		 * it must be specified in the tc rule.
+		 */
+		return rte_flow_error_set(error, EINVAL,
+				 RTE_FLOW_ERROR_TYPE_ITEM, item,
+				 "outer L3 source address must be "
+				 "specified for VXLAN encapsulation");
+	}
+	return 0;
+}
+
+/**
+ * Validate VXLAN_ENCAP action RTE_FLOW_ITEM_TYPE_UDP item for E-Switch.
+ *
+ * @param[in] item
+ *   Pointer to the itemn structure.
+ * @param[out] error
+ *   Pointer to the error structure.
+ *
+ * @return
+ *   0 on success, a negative errno value otherwise and rte_ernno is set.
+ **/
+static int
+flow_tcf_validate_vxlan_encap_udp(const struct rte_flow_item *item,
+				  struct rte_flow_error *error)
+{
+	const struct rte_flow_item_udp *spec = item->spec;
+	const struct rte_flow_item_udp *mask = item->mask;
+
+	if (!spec)
+		/*
+		 * Specification for UDP ports cannot be empty
+		 * because it is required by tunnel_key parameter.
+		 */
+		return rte_flow_error_set(error, EINVAL,
+				 RTE_FLOW_ERROR_TYPE_ITEM, item,
+				 "NULL UDP port specification "
+				 " for VXLAN encapsulation");
+	if (!mask)
+		mask = &rte_flow_item_udp_mask;
+	if (mask->hdr.dst_port != RTE_BE16(0x0000)) {
+		if (mask->hdr.dst_port != RTE_BE16(0xffff))
+			return rte_flow_error_set(error, ENOTSUP,
+				 RTE_FLOW_ERROR_TYPE_ITEM_MASK, mask,
+				 "no support for partial mask on"
+				 " \"udp.hdr.dst_port\" field");
+		if (!spec->hdr.dst_port)
+			return rte_flow_error_set(error, EINVAL,
+				 RTE_FLOW_ERROR_TYPE_ITEM, item,
+				 "zero encap remote UDP port");
+	} else {
+		return rte_flow_error_set(error, EINVAL,
+				 RTE_FLOW_ERROR_TYPE_ITEM, item,
+				 "outer UDP remote port must be "
+				 "specified for VXLAN encapsulation");
+	}
+	if (mask->hdr.src_port != RTE_BE16(0x0000)) {
+		if (mask->hdr.src_port != RTE_BE16(0xffff))
+			return rte_flow_error_set(error, ENOTSUP,
+				 RTE_FLOW_ERROR_TYPE_ITEM_MASK, mask,
+				 "no support for partial mask on"
+				 " \"udp.hdr.src_port\" field");
+		DRV_LOG(WARNING,
+			"outer UDP source port cannot be "
+			"forced for VXLAN encapsulation, "
+			"parameter ignored");
+	}
+	return 0;
+}
+
+/**
+ * Validate VXLAN_ENCAP action RTE_FLOW_ITEM_TYPE_VXLAN item for E-Switch.
+ *
+ * @param[in] item
+ *   Pointer to the itemn structure.
+ * @param[out] error
+ *   Pointer to the error structure.
+ *
+ * @return
+ *   0 on success, a negative errno value otherwise and rte_ernno is set.
+ **/
+static int
+flow_tcf_validate_vxlan_encap_vni(const struct rte_flow_item *item,
+				  struct rte_flow_error *error)
+{
+	const struct rte_flow_item_vxlan *spec = item->spec;
+	const struct rte_flow_item_vxlan *mask = item->mask;
+
+	if (!spec)
+		/* Outer VNI is required by tunnel_key parameter. */
+		return rte_flow_error_set(error, EINVAL,
+				 RTE_FLOW_ERROR_TYPE_ITEM, item,
+				 "NULL VNI specification "
+				 " for VXLAN encapsulation");
+	if (!mask)
+		mask = &rte_flow_item_vxlan_mask;
+	if (mask->vni[0] != 0 ||
+	    mask->vni[1] != 0 ||
+	    mask->vni[2] != 0) {
+		if (mask->vni[0] != 0xff ||
+		    mask->vni[1] != 0xff ||
+		    mask->vni[2] != 0xff)
+			return rte_flow_error_set(error, ENOTSUP,
+				 RTE_FLOW_ERROR_TYPE_ITEM_MASK, mask,
+				 "no support for partial mask on"
+				 " \"vxlan.vni\" field");
+		if (spec->vni[0] == 0 &&
+		    spec->vni[1] == 0 &&
+		    spec->vni[2] == 0)
+			return rte_flow_error_set(error, EINVAL,
+					  RTE_FLOW_ERROR_TYPE_ITEM, item,
+					  "VXLAN vni cannot be 0");
+	} else {
+		return rte_flow_error_set(error, EINVAL,
+				 RTE_FLOW_ERROR_TYPE_ITEM,
+				 item,
+				 "outer VNI must be specified "
+				 "for VXLAN encapsulation");
+	}
+	return 0;
+}
+
+/**
+ * Validate VXLAN_ENCAP action item list for E-Switch.
+ *
+ * @param[in] action
+ *   Pointer to the VXLAN_ENCAP action structure.
+ * @param[out] error
+ *   Pointer to the error structure.
+ *
+ * @return
+ *   0 on success, a negative errno value otherwise and rte_ernno is set.
+ **/
+static int
+flow_tcf_validate_vxlan_encap(const struct rte_flow_action *action,
+			      struct rte_flow_error *error)
+{
+	const struct rte_flow_item *items;
+	int ret;
+	uint32_t item_flags = 0;
+
+	assert(action->type == RTE_FLOW_ACTION_TYPE_VXLAN_ENCAP);
+	if (!action->conf)
+		return rte_flow_error_set
+			(error, EINVAL, RTE_FLOW_ERROR_TYPE_ACTION,
+			 action, "Missing VXLAN tunnel "
+				 "action configuration");
+	items = ((const struct rte_flow_action_vxlan_encap *)
+					action->conf)->definition;
+	if (!items)
+		return rte_flow_error_set
+			(error, EINVAL, RTE_FLOW_ERROR_TYPE_ACTION,
+			 action, "Missing VXLAN tunnel "
+				 "encapsulation parameters");
+	for (; items->type != RTE_FLOW_ITEM_TYPE_END; items++) {
+		switch (items->type) {
+		case RTE_FLOW_ITEM_TYPE_VOID:
+			break;
+		case RTE_FLOW_ITEM_TYPE_ETH:
+			ret = mlx5_flow_validate_item_eth(items, item_flags,
+							  error);
+			if (ret < 0)
+				return ret;
+			ret = flow_tcf_validate_vxlan_encap_eth(items, error);
+			if (ret < 0)
+				return ret;
+			item_flags |= MLX5_FLOW_LAYER_OUTER_L2;
+			break;
+		break;
+		case RTE_FLOW_ITEM_TYPE_IPV4:
+			ret = mlx5_flow_validate_item_ipv4(items, item_flags,
+							   error);
+			if (ret < 0)
+				return ret;
+			ret = flow_tcf_validate_vxlan_encap_ipv4(items, error);
+			if (ret < 0)
+				return ret;
+			item_flags |= MLX5_FLOW_LAYER_OUTER_L3_IPV4;
+			break;
+		case RTE_FLOW_ITEM_TYPE_IPV6:
+			ret = mlx5_flow_validate_item_ipv6(items, item_flags,
+							   error);
+			if (ret < 0)
+				return ret;
+			ret = flow_tcf_validate_vxlan_encap_ipv6(items, error);
+			if (ret < 0)
+				return ret;
+			item_flags |= MLX5_FLOW_LAYER_OUTER_L3_IPV6;
+			break;
+		case RTE_FLOW_ITEM_TYPE_UDP:
+			ret = mlx5_flow_validate_item_udp(items, item_flags,
+							   0xFF, error);
+			if (ret < 0)
+				return ret;
+			ret = flow_tcf_validate_vxlan_encap_udp(items, error);
+			if (ret < 0)
+				return ret;
+			item_flags |= MLX5_FLOW_LAYER_OUTER_L4_UDP;
+			break;
+		case RTE_FLOW_ITEM_TYPE_VXLAN:
+			ret = mlx5_flow_validate_item_vxlan(items,
+							    item_flags, error);
+			if (ret < 0)
+				return ret;
+			ret = flow_tcf_validate_vxlan_encap_vni(items, error);
+			if (ret < 0)
+				return ret;
+			item_flags |= MLX5_FLOW_LAYER_VXLAN;
+			break;
+		default:
+			return rte_flow_error_set(error, ENOTSUP,
+					  RTE_FLOW_ERROR_TYPE_ITEM, items,
+					  "VXLAN encap item not supported");
+		}
+	}
+	if (!(item_flags & MLX5_FLOW_LAYER_OUTER_L3))
+		return rte_flow_error_set(error, EINVAL,
+					  RTE_FLOW_ERROR_TYPE_ACTION, action,
+					  "no outer L3 layer found"
+					  " for VXLAN encapsulation");
+	if (!(item_flags & MLX5_FLOW_LAYER_OUTER_L4_UDP))
+		return rte_flow_error_set(error, EINVAL,
+					  RTE_FLOW_ERROR_TYPE_ACTION, action,
+					  "no outer L4 layer found"
+					  " for VXLAN encapsulation");
+	if (!(item_flags & MLX5_FLOW_LAYER_VXLAN))
+		return rte_flow_error_set(error, EINVAL,
+					  RTE_FLOW_ERROR_TYPE_ACTION, action,
+					  "no VXLAN VNI found"
+					  " for VXLAN encapsulation");
+	return 0;
+}
+
+/**
+ * Validate VXLAN_DECAP action outer tunnel items for E-Switch.
+ *
+ * @param[in] item_flags
+ *   Mask of provided outer tunnel parameters
+ * @param[in] ipv4
+ *   Outer IPv4 address item (if any, NULL otherwise).
+ * @param[in] ipv6
+ *   Outer IPv6 address item (if any, NULL otherwise).
+ * @param[in] udp
+ *   Outer UDP layer item (if any, NULL otherwise).
+ * @param[out] error
+ *   Pointer to the error structure.
+ *
+ * @return
+ *   0 on success, a negative errno value otherwise and rte_ernno is set.
+ **/
+static int
+flow_tcf_validate_vxlan_decap(uint32_t item_flags,
+			      const struct rte_flow_action *action,
+			      const struct rte_flow_item *ipv4,
+			      const struct rte_flow_item *ipv6,
+			      const struct rte_flow_item *udp,
+			      struct rte_flow_error *error)
+{
+	if (!ipv4 && !ipv6)
+		return rte_flow_error_set(error, EINVAL,
+					  RTE_FLOW_ERROR_TYPE_ACTION, action,
+					  "no outer L3 layer found"
+					  " for VXLAN decapsulation");
+	if (ipv4) {
+		const struct rte_flow_item_ipv4 *spec = ipv4->spec;
+		const struct rte_flow_item_ipv4 *mask = ipv4->mask;
+
+		if (!spec)
+			/*
+			 * Specification for L3 addresses cannot be empty
+			 * because it is required as decap parameter.
+			 */
+			return rte_flow_error_set(error, EINVAL,
+				 RTE_FLOW_ERROR_TYPE_ITEM, ipv4,
+				 "NULL outer L3 address specification "
+				 " for VXLAN decapsulation");
+		if (!mask)
+			mask = &rte_flow_item_ipv4_mask;
+		if (mask->hdr.dst_addr != RTE_BE32(0x00000000)) {
+			if (mask->hdr.dst_addr != RTE_BE32(0xffffffff))
+				return rte_flow_error_set(error, ENOTSUP,
+					 RTE_FLOW_ERROR_TYPE_ITEM_MASK, mask,
+					 "no support for partial mask on"
+					 " \"ipv4.hdr.dst_addr\" field");
+			/* More L3 address validations can be put here. */
+		} else {
+			/*
+			 * Kernel uses the destination L3 address
+			 * to determine the ingress network interface
+			 * for traffic being decapculated.
+			 */
+			return rte_flow_error_set(error, EINVAL,
+				 RTE_FLOW_ERROR_TYPE_ITEM, ipv4,
+				 "outer L3 destination address must be "
+				 "specified for VXLAN decapsulation");
+		}
+		/* Source L3 address is optional for decap. */
+		if (mask->hdr.src_addr != RTE_BE32(0x00000000))
+			if (mask->hdr.src_addr != RTE_BE32(0xffffffff))
+				return rte_flow_error_set(error, ENOTSUP,
+					 RTE_FLOW_ERROR_TYPE_ITEM_MASK, mask,
+					 "no support for partial mask on"
+					 " \"ipv4.hdr.src_addr\" field");
+	} else {
+		const struct rte_flow_item_ipv6 *spec = ipv6->spec;
+		const struct rte_flow_item_ipv6 *mask = ipv6->mask;
+
+		if (!spec)
+			/*
+			 * Specification for L3 addresses cannot be empty
+			 * because it is required as decap parameter.
+			 */
+			return rte_flow_error_set(error, EINVAL,
+				 RTE_FLOW_ERROR_TYPE_ITEM, ipv6,
+				 "NULL outer L3 address specification "
+				 " for VXLAN decapsulation");
+		if (!mask)
+			mask = &rte_flow_item_ipv6_mask;
+		if (memcmp(&mask->hdr.dst_addr,
+			   &flow_tcf_mask_empty.ipv6.hdr.dst_addr,
+			   sizeof(flow_tcf_mask_empty.ipv6.hdr.dst_addr))) {
+			if (memcmp(&mask->hdr.dst_addr,
+				&rte_flow_item_ipv6_mask.hdr.dst_addr,
+				sizeof(rte_flow_item_ipv6_mask.hdr.dst_addr)))
+				return rte_flow_error_set(error, ENOTSUP,
+				       RTE_FLOW_ERROR_TYPE_ITEM_MASK, mask,
+				       "no support for partial mask on"
+				       " \"ipv6.hdr.dst_addr\" field");
+		/* More L3 address validations can be put here. */
+		} else {
+			/*
+			 * Kernel uses the destination L3 address
+			 * to determine the ingress network interface
+			 * for traffic being decapculated.
+			 */
+			return rte_flow_error_set(error, EINVAL,
+				 RTE_FLOW_ERROR_TYPE_ITEM, ipv6,
+				 "outer L3 destination address must be "
+				 "specified for VXLAN decapsulation");
+		}
+		/* Source L3 address is optional for decap. */
+		if (memcmp(&mask->hdr.src_addr,
+			   &flow_tcf_mask_empty.ipv6.hdr.src_addr,
+			   sizeof(flow_tcf_mask_empty.ipv6.hdr.src_addr))) {
+			if (memcmp(&mask->hdr.src_addr,
+				   &rte_flow_item_ipv6_mask.hdr.src_addr,
+				   sizeof(mask->hdr.src_addr)))
+				return rte_flow_error_set(error, ENOTSUP,
+					RTE_FLOW_ERROR_TYPE_ITEM_MASK, mask,
+					"no support for partial mask on"
+					" \"ipv6.hdr.src_addr\" field");
+		}
+	}
+	if (!udp) {
+		return rte_flow_error_set(error, EINVAL,
+					  RTE_FLOW_ERROR_TYPE_ACTION, action,
+					  "no outer L4 layer found"
+					  " for VXLAN decapsulation");
+	} else {
+		const struct rte_flow_item_udp *spec = udp->spec;
+		const struct rte_flow_item_udp *mask = udp->mask;
+
+		if (!spec)
+			/*
+			 * Specification for UDP ports cannot be empty
+			 * because it is required as decap parameter.
+			 */
+			return rte_flow_error_set(error, EINVAL,
+					 RTE_FLOW_ERROR_TYPE_ITEM, udp,
+					 "NULL UDP port specification "
+					 " for VXLAN decapsulation");
+		if (!mask)
+			mask = &rte_flow_item_udp_mask;
+		if (mask->hdr.dst_port != RTE_BE16(0x0000)) {
+			if (mask->hdr.dst_port != RTE_BE16(0xffff))
+				return rte_flow_error_set(error, ENOTSUP,
+					 RTE_FLOW_ERROR_TYPE_ITEM_MASK, mask,
+					 "no support for partial mask on"
+					 " \"udp.hdr.dst_port\" field");
+			if (!spec->hdr.dst_port)
+				return rte_flow_error_set(error, EINVAL,
+					 RTE_FLOW_ERROR_TYPE_ITEM, udp,
+					 "zero decap local UDP port");
+		} else {
+			return rte_flow_error_set(error, EINVAL,
+					 RTE_FLOW_ERROR_TYPE_ITEM, udp,
+					 "outer UDP destination port must be "
+					 "specified for VXLAN decapsulation");
+		}
+		if (mask->hdr.src_port != RTE_BE16(0x0000)) {
+			if (mask->hdr.src_port != RTE_BE16(0xffff))
+				return rte_flow_error_set(error, ENOTSUP,
+					 RTE_FLOW_ERROR_TYPE_ITEM_MASK, mask,
+					 "no support for partial mask on"
+					 " \"udp.hdr.src_port\" field");
+			DRV_LOG(WARNING,
+			"outer UDP local port cannot be "
+			"forced for VXLAN encapsulation, "
+			"parameter ignored");
+		}
+	}
+	if (!(item_flags & MLX5_FLOW_LAYER_VXLAN))
+		return rte_flow_error_set(error, EINVAL,
+					  RTE_FLOW_ERROR_TYPE_ACTION, action,
+					  "no VXLAN VNI found"
+					  " for VXLAN decapsulation");
+	/* VNI is already validated, extra check can be put here. */
+	return 0;
+}
+
+/**
  * Validate flow for E-Switch.
  *
  * @param[in] priv
@@ -974,7 +1588,8 @@ struct pedit_parser {
 		const struct rte_flow_item_ipv6 *ipv6;
 		const struct rte_flow_item_tcp *tcp;
 		const struct rte_flow_item_udp *udp;
-	} spec, mask;
+		const struct rte_flow_item_vxlan *vxlan;
+	 } spec, mask;
 	union {
 		const struct rte_flow_action_port_id *port_id;
 		const struct rte_flow_action_jump *jump;
@@ -983,9 +1598,13 @@ struct pedit_parser {
 			of_set_vlan_vid;
 		const struct rte_flow_action_of_set_vlan_pcp *
 			of_set_vlan_pcp;
+		const struct rte_flow_action_vxlan_encap *vxlan_encap;
 		const struct rte_flow_action_set_ipv4 *set_ipv4;
 		const struct rte_flow_action_set_ipv6 *set_ipv6;
 	} conf;
+	const struct rte_flow_item *ipv4 = NULL; /* storage to check */
+	const struct rte_flow_item *ipv6 = NULL; /* outer tunnel. */
+	const struct rte_flow_item *udp = NULL;  /* parameters. */
 	uint32_t item_flags = 0;
 	uint32_t action_flags = 0;
 	uint8_t next_protocol = -1;
@@ -1114,7 +1733,6 @@ struct pedit_parser {
 							   error);
 			if (ret < 0)
 				return ret;
-			item_flags |= MLX5_FLOW_LAYER_OUTER_L3_IPV4;
 			mask.ipv4 = flow_tcf_item_mask
 				(items, &rte_flow_item_ipv4_mask,
 				 &flow_tcf_mask_supported.ipv4,
@@ -1135,13 +1753,22 @@ struct pedit_parser {
 				next_protocol =
 					((const struct rte_flow_item_ipv4 *)
 					 (items->spec))->hdr.next_proto_id;
+			if (item_flags & MLX5_FLOW_LAYER_OUTER_L3_IPV4) {
+				/*
+				 * Multiple outer items are not allowed as
+				 * tunnel parameters, will raise an error later.
+				 */
+				ipv4 = NULL;
+			} else {
+				ipv4 = items;
+				item_flags |= MLX5_FLOW_LAYER_OUTER_L3_IPV4;
+			}
 			break;
 		case RTE_FLOW_ITEM_TYPE_IPV6:
 			ret = mlx5_flow_validate_item_ipv6(items, item_flags,
 							   error);
 			if (ret < 0)
 				return ret;
-			item_flags |= MLX5_FLOW_LAYER_OUTER_L3_IPV6;
 			mask.ipv6 = flow_tcf_item_mask
 				(items, &rte_flow_item_ipv6_mask,
 				 &flow_tcf_mask_supported.ipv6,
@@ -1162,13 +1789,22 @@ struct pedit_parser {
 				next_protocol =
 					((const struct rte_flow_item_ipv6 *)
 					 (items->spec))->hdr.proto;
+			if (item_flags & MLX5_FLOW_LAYER_OUTER_L3_IPV6) {
+				/*
+				 *Multiple outer items are not allowed as
+				 * tunnel parameters
+				 */
+				ipv6 = NULL;
+			} else {
+				ipv6 = items;
+				item_flags |= MLX5_FLOW_LAYER_OUTER_L3_IPV6;
+			}
 			break;
 		case RTE_FLOW_ITEM_TYPE_UDP:
 			ret = mlx5_flow_validate_item_udp(items, item_flags,
 							  next_protocol, error);
 			if (ret < 0)
 				return ret;
-			item_flags |= MLX5_FLOW_LAYER_OUTER_L4_UDP;
 			mask.udp = flow_tcf_item_mask
 				(items, &rte_flow_item_udp_mask,
 				 &flow_tcf_mask_supported.udp,
@@ -1177,6 +1813,12 @@ struct pedit_parser {
 				 error);
 			if (!mask.udp)
 				return -rte_errno;
+			if (item_flags & MLX5_FLOW_LAYER_OUTER_L4_UDP) {
+				udp = NULL;
+			} else {
+				udp = items;
+				item_flags |= MLX5_FLOW_LAYER_OUTER_L4_UDP;
+			}
 			break;
 		case RTE_FLOW_ITEM_TYPE_TCP:
 			ret = mlx5_flow_validate_item_tcp
@@ -1186,7 +1828,6 @@ struct pedit_parser {
 					      error);
 			if (ret < 0)
 				return ret;
-			item_flags |= MLX5_FLOW_LAYER_OUTER_L4_TCP;
 			mask.tcp = flow_tcf_item_mask
 				(items, &rte_flow_item_tcp_mask,
 				 &flow_tcf_mask_supported.tcp,
@@ -1195,11 +1836,36 @@ struct pedit_parser {
 				 error);
 			if (!mask.tcp)
 				return -rte_errno;
+			item_flags |= MLX5_FLOW_LAYER_OUTER_L4_TCP;
+			break;
+		case RTE_FLOW_ITEM_TYPE_VXLAN:
+			ret = mlx5_flow_validate_item_vxlan(items,
+							    item_flags, error);
+			if (ret < 0)
+				return ret;
+			mask.vxlan = flow_tcf_item_mask
+				(items, &rte_flow_item_vxlan_mask,
+				 &flow_tcf_mask_supported.vxlan,
+				 &flow_tcf_mask_empty.vxlan,
+				 sizeof(flow_tcf_mask_supported.vxlan),
+				 error);
+			if (!mask.vxlan)
+				return -rte_errno;
+			if (mask.vxlan->vni[0] != 0xff ||
+			    mask.vxlan->vni[1] != 0xff ||
+			    mask.vxlan->vni[2] != 0xff)
+				return rte_flow_error_set
+					(error, ENOTSUP,
+					 RTE_FLOW_ERROR_TYPE_ITEM_MASK,
+					 mask.vxlan,
+					 "no support for partial or "
+					 "empty mask on \"vxlan.vni\" field");
+			item_flags |= MLX5_FLOW_LAYER_VXLAN;
 			break;
 		default:
 			return rte_flow_error_set(error, ENOTSUP,
 						  RTE_FLOW_ERROR_TYPE_ITEM,
-						  NULL, "item not supported");
+						  items, "item not supported");
 		}
 	}
 	for (; actions->type != RTE_FLOW_ACTION_TYPE_END; actions++) {
@@ -1271,6 +1937,33 @@ struct pedit_parser {
 					 " set action must follow push action");
 			current_action_flag = MLX5_FLOW_ACTION_OF_SET_VLAN_PCP;
 			break;
+		case RTE_FLOW_ACTION_TYPE_VXLAN_ENCAP:
+			if (action_flags & (MLX5_ACTION_VXLAN_ENCAP
+					   | MLX5_ACTION_VXLAN_DECAP))
+				return rte_flow_error_set
+					(error, ENOTSUP,
+					 RTE_FLOW_ERROR_TYPE_ACTION, actions,
+					 "can't have multiple vxlan actions");
+			ret = flow_tcf_validate_vxlan_encap(actions, error);
+			if (ret < 0)
+				return ret;
+			action_flags |= MLX5_ACTION_VXLAN_ENCAP;
+			break;
+		case RTE_FLOW_ACTION_TYPE_VXLAN_DECAP:
+			if (action_flags & (MLX5_ACTION_VXLAN_ENCAP
+					   | MLX5_ACTION_VXLAN_DECAP))
+				return rte_flow_error_set
+					(error, ENOTSUP,
+					 RTE_FLOW_ERROR_TYPE_ACTION, actions,
+					 "can't have multiple vxlan actions");
+			ret = flow_tcf_validate_vxlan_decap(item_flags,
+							    actions,
+							    ipv4, ipv6, udp,
+							    error);
+			if (ret < 0)
+				return ret;
+			action_flags |= MLX5_ACTION_VXLAN_DECAP;
+			break;
 		case RTE_FLOW_ACTION_TYPE_SET_IPV4_SRC:
 			current_action_flag = MLX5_FLOW_ACTION_SET_IPV4_SRC;
 			break;
@@ -1391,6 +2084,12 @@ struct pedit_parser {
 		return rte_flow_error_set(error, EINVAL,
 					  RTE_FLOW_ERROR_TYPE_ACTION, actions,
 					  "no fate action is found");
+	if ((item_flags & MLX5_FLOW_LAYER_VXLAN) &&
+	    !(action_flags & MLX5_ACTION_VXLAN_DECAP))
+		return rte_flow_error_set(error, ENOTSUP,
+					 RTE_FLOW_ERROR_TYPE_ACTION, NULL,
+					 "VNI pattern should be followed "
+					 " by VXLAN_DECAP action");
 	return 0;
 }
 
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [dpdk-dev] [PATCH v2 3/7] net/mlx5: e-switch VXLAN flow translation routine
  2018-10-15 14:13 ` [dpdk-dev] [PATCH v2 0/7] net/mlx5: e-switch VXLAN encap/decap hardware offload Viacheslav Ovsiienko
  2018-10-15 14:13   ` [dpdk-dev] [PATCH v2 1/7] net/mlx5: e-switch VXLAN configuration and definitions Viacheslav Ovsiienko
  2018-10-15 14:13   ` [dpdk-dev] [PATCH v2 2/7] net/mlx5: e-switch VXLAN flow validation routine Viacheslav Ovsiienko
@ 2018-10-15 14:13   ` Viacheslav Ovsiienko
  2018-10-23 10:06     ` Yongseok Koh
  2018-10-15 14:13   ` [dpdk-dev] [PATCH v2 4/7] net/mlx5: e-switch VXLAN netlink routines update Viacheslav Ovsiienko
                     ` (4 subsequent siblings)
  7 siblings, 1 reply; 110+ messages in thread
From: Viacheslav Ovsiienko @ 2018-10-15 14:13 UTC (permalink / raw)
  To: shahafs, yskoh; +Cc: dev, Viacheslav Ovsiienko

This part of patchset adds support of VXLAN-related items and
actions to the flow translation routine. If some of them are
specified in the rule, the extra space for tunnel description
structure is allocated. Later some tunnel types, other than
VXLAN can be addedd (GRE). No VTEP devices are created at this
point, the flow rule is just translated, not applied yet.

Suggested-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
---
 drivers/net/mlx5/mlx5_flow_tcf.c | 641 +++++++++++++++++++++++++++++++++++----
 1 file changed, 578 insertions(+), 63 deletions(-)

diff --git a/drivers/net/mlx5/mlx5_flow_tcf.c b/drivers/net/mlx5/mlx5_flow_tcf.c
index 0055417..660d45e 100644
--- a/drivers/net/mlx5/mlx5_flow_tcf.c
+++ b/drivers/net/mlx5/mlx5_flow_tcf.c
@@ -2094,6 +2094,265 @@ struct pedit_parser {
 }
 
 /**
+ * Helper function to process RTE_FLOW_ITEM_TYPE_ETH entry in configuration
+ * of action RTE_FLOW_ACTION_TYPE_VXLAN_ENCAP. Fills the MAC address fields
+ * in the encapsulation parameters structure. The item must be prevalidated,
+ * no any validation checks performed by function.
+ *
+ * @param[in] spec
+ *   RTE_FLOW_ITEM_TYPE_ETH entry specification.
+ * @param[in] mask
+ *   RTE_FLOW_ITEM_TYPE_ETH entry mask.
+ * @param[out] encap
+ *   Structure to fill the gathered MAC address data.
+ *
+ * @return
+ *   The size needed the Netlink message tunnel_key
+ *   parameter buffer to store the item attributes.
+ */
+static int
+flow_tcf_parse_vxlan_encap_eth(const struct rte_flow_item_eth *spec,
+			       const struct rte_flow_item_eth *mask,
+			       struct mlx5_flow_tcf_vxlan_encap *encap)
+{
+	/* Item must be validated before. No redundant checks. */
+	assert(spec);
+	if (!mask || !memcmp(&mask->dst,
+			     &rte_flow_item_eth_mask.dst,
+			     sizeof(rte_flow_item_eth_mask.dst))) {
+		/*
+		 * Ethernet addresses are not supported by
+		 * tc as tunnel_key parameters. Destination
+		 * address is needed to form encap packet
+		 * header and retrieved by kernel from
+		 * implicit sources (ARP table, etc),
+		 * address masks are not supported at all.
+		 */
+		encap->eth.dst = spec->dst;
+		encap->mask |= MLX5_FLOW_TCF_ENCAP_ETH_DST;
+	}
+	if (!mask || !memcmp(&mask->src,
+			     &rte_flow_item_eth_mask.src,
+			     sizeof(rte_flow_item_eth_mask.src))) {
+		/*
+		 * Ethernet addresses are not supported by
+		 * tc as tunnel_key parameters. Source ethernet
+		 * address is ignored anyway.
+		 */
+		encap->eth.src = spec->src;
+		encap->mask |= MLX5_FLOW_TCF_ENCAP_ETH_SRC;
+	}
+	/*
+	 * No space allocated for ethernet addresses within Netlink
+	 * message tunnel_key record - these ones are not
+	 * supported by tc.
+	 */
+	return 0;
+}
+
+/**
+ * Helper function to process RTE_FLOW_ITEM_TYPE_IPV4 entry in configuration
+ * of action RTE_FLOW_ACTION_TYPE_VXLAN_ENCAP. Fills the IPV4 address fields
+ * in the encapsulation parameters structure. The item must be prevalidated,
+ * no any validation checks performed by function.
+ *
+ * @param[in] spec
+ *   RTE_FLOW_ITEM_TYPE_IPV4 entry specification.
+ * @param[out] encap
+ *   Structure to fill the gathered IPV4 address data.
+ *
+ * @return
+ *   The size needed the Netlink message tunnel_key
+ *   parameter buffer to store the item attributes.
+ */
+static int
+flow_tcf_parse_vxlan_encap_ipv4(const struct rte_flow_item_ipv4 *spec,
+				struct mlx5_flow_tcf_vxlan_encap *encap)
+{
+	/* Item must be validated before. No redundant checks. */
+	assert(spec);
+	encap->ipv4.dst = spec->hdr.dst_addr;
+	encap->ipv4.src = spec->hdr.src_addr;
+	encap->mask |= MLX5_FLOW_TCF_ENCAP_IPV4_SRC |
+		       MLX5_FLOW_TCF_ENCAP_IPV4_DST;
+	return 2 * SZ_NLATTR_TYPE_OF(uint32_t);
+}
+
+/**
+ * Helper function to process RTE_FLOW_ITEM_TYPE_IPV6 entry in configuration
+ * of action RTE_FLOW_ACTION_TYPE_VXLAN_ENCAP. Fills the IPV6 address fields
+ * in the encapsulation parameters structure. The item must be prevalidated,
+ * no any validation checks performed by function.
+ *
+ * @param[in] spec
+ *   RTE_FLOW_ITEM_TYPE_IPV6 entry specification.
+ * @param[out] encap
+ *   Structure to fill the gathered IPV6 address data.
+ *
+ * @return
+ *   The size needed the Netlink message tunnel_key
+ *   parameter buffer to store the item attributes.
+ */
+static int
+flow_tcf_parse_vxlan_encap_ipv6(const struct rte_flow_item_ipv6 *spec,
+				struct mlx5_flow_tcf_vxlan_encap *encap)
+{
+	/* Item must be validated before. No redundant checks. */
+	assert(spec);
+	memcpy(encap->ipv6.dst, spec->hdr.dst_addr, sizeof(encap->ipv6.dst));
+	memcpy(encap->ipv6.src, spec->hdr.src_addr, sizeof(encap->ipv6.src));
+	encap->mask |= MLX5_FLOW_TCF_ENCAP_IPV6_SRC |
+		       MLX5_FLOW_TCF_ENCAP_IPV6_DST;
+	return SZ_NLATTR_DATA_OF(IPV6_ADDR_LEN) * 2;
+}
+
+/**
+ * Helper function to process RTE_FLOW_ITEM_TYPE_UDP entry in configuration
+ * of action RTE_FLOW_ACTION_TYPE_VXLAN_ENCAP. Fills the UDP port fields
+ * in the encapsulation parameters structure. The item must be prevalidated,
+ * no any validation checks performed by function.
+ *
+ * @param[in] spec
+ *   RTE_FLOW_ITEM_TYPE_UDP entry specification.
+ * @param[in] mask
+ *   RTE_FLOW_ITEM_TYPE_UDP entry mask.
+ * @param[out] encap
+ *   Structure to fill the gathered UDP port data.
+ *
+ * @return
+ *   The size needed the Netlink message tunnel_key
+ *   parameter buffer to store the item attributes.
+ */
+static int
+flow_tcf_parse_vxlan_encap_udp(const struct rte_flow_item_udp *spec,
+			       const struct rte_flow_item_udp *mask,
+			       struct mlx5_flow_tcf_vxlan_encap *encap)
+{
+	int size = SZ_NLATTR_TYPE_OF(uint16_t);
+
+	assert(spec);
+	encap->udp.dst = spec->hdr.dst_port;
+	encap->mask |= MLX5_FLOW_TCF_ENCAP_UDP_DST;
+	if (!mask || mask->hdr.src_port != RTE_BE16(0x0000)) {
+		encap->udp.src = spec->hdr.src_port;
+		size += SZ_NLATTR_TYPE_OF(uint16_t);
+		encap->mask |= MLX5_FLOW_TCF_ENCAP_IPV4_SRC;
+	}
+	return size;
+}
+
+/**
+ * Helper function to process RTE_FLOW_ITEM_TYPE_VXLAN entry in configuration
+ * of action RTE_FLOW_ACTION_TYPE_VXLAN_ENCAP. Fills the VNI fields
+ * in the encapsulation parameters structure. The item must be prevalidated,
+ * no any validation checks performed by function.
+ *
+ * @param[in] spec
+ *   RTE_FLOW_ITEM_TYPE_VXLAN entry specification.
+ * @param[out] encap
+ *   Structure to fill the gathered VNI address data.
+ *
+ * @return
+ *   The size needed the Netlink message tunnel_key
+ *   parameter buffer to store the item attributes.
+ */
+static int
+flow_tcf_parse_vxlan_encap_vni(const struct rte_flow_item_vxlan *spec,
+			       struct mlx5_flow_tcf_vxlan_encap *encap)
+{
+	/* Item must be validated before. Do not redundant checks. */
+	assert(spec);
+	memcpy(encap->vxlan.vni, spec->vni, sizeof(encap->vxlan.vni));
+	encap->mask |= MLX5_FLOW_TCF_ENCAP_VXLAN_VNI;
+	return SZ_NLATTR_TYPE_OF(uint32_t);
+}
+
+/**
+ * Populate consolidated encapsulation object from list of pattern items.
+ *
+ * Helper function to process configuration of action such as
+ * RTE_FLOW_ACTION_TYPE_VXLAN_ENCAP. The item list should be
+ * validated, there is no way to return an meaningful error.
+ *
+ * @param[in] action
+ *   RTE_FLOW_ACTION_TYPE_VXLAN_ENCAP action object.
+ *   List of pattern items to gather data from.
+ * @param[out] src
+ *   Structure to fill gathered data.
+ *
+ * @return
+ *   The size the part of Netlink message buffer to store the item
+ *   attributes on success, zero otherwise. The mask field in
+ *   result structure reflects correctly parsed items.
+ */
+static int
+flow_tcf_vxlan_encap_parse(const struct rte_flow_action *action,
+			   struct mlx5_flow_tcf_vxlan_encap *encap)
+{
+	union {
+		const struct rte_flow_item_eth *eth;
+		const struct rte_flow_item_ipv4 *ipv4;
+		const struct rte_flow_item_ipv6 *ipv6;
+		const struct rte_flow_item_udp *udp;
+		const struct rte_flow_item_vxlan *vxlan;
+	} spec, mask;
+	const struct rte_flow_item *items;
+	int size = 0;
+
+	assert(action->type == RTE_FLOW_ACTION_TYPE_VXLAN_ENCAP);
+	assert(action->conf);
+
+	items = ((const struct rte_flow_action_vxlan_encap *)
+					action->conf)->definition;
+	assert(items);
+	for (; items->type != RTE_FLOW_ITEM_TYPE_END; items++) {
+		switch (items->type) {
+		case RTE_FLOW_ITEM_TYPE_VOID:
+			break;
+		case RTE_FLOW_ITEM_TYPE_ETH:
+			mask.eth = items->mask;
+			spec.eth = items->spec;
+			size += flow_tcf_parse_vxlan_encap_eth(spec.eth,
+							       mask.eth,
+							       encap);
+			break;
+		case RTE_FLOW_ITEM_TYPE_IPV4:
+			spec.ipv4 = items->spec;
+			size += flow_tcf_parse_vxlan_encap_ipv4(spec.ipv4,
+								encap);
+			break;
+		case RTE_FLOW_ITEM_TYPE_IPV6:
+			spec.ipv6 = items->spec;
+			size += flow_tcf_parse_vxlan_encap_ipv6(spec.ipv6,
+								encap);
+			break;
+		case RTE_FLOW_ITEM_TYPE_UDP:
+			mask.udp = items->mask;
+			spec.udp = items->spec;
+			size += flow_tcf_parse_vxlan_encap_udp(spec.udp,
+							       mask.udp,
+							       encap);
+			break;
+		case RTE_FLOW_ITEM_TYPE_VXLAN:
+			spec.vxlan = items->spec;
+			size += flow_tcf_parse_vxlan_encap_vni(spec.vxlan,
+							       encap);
+			break;
+		default:
+			assert(false);
+			DRV_LOG(WARNING,
+				"unsupported item %p type %d,"
+				" items must be validated"
+				" before flow creation",
+				(const void *)items, items->type);
+			encap->mask = 0;
+			return 0;
+		}
+	}
+	return size;
+}
+
+/**
  * Calculate maximum size of memory for flow items of Linux TC flower and
  * extract specified items.
  *
@@ -2148,7 +2407,7 @@ struct pedit_parser {
 		case RTE_FLOW_ITEM_TYPE_IPV6:
 			size += SZ_NLATTR_TYPE_OF(uint16_t) + /* Ether type. */
 				SZ_NLATTR_TYPE_OF(uint8_t) + /* IP proto. */
-				SZ_NLATTR_TYPE_OF(IPV6_ADDR_LEN) * 4;
+				SZ_NLATTR_DATA_OF(IPV6_ADDR_LEN) * 4;
 				/* dst/src IP addr and mask. */
 			flags |= MLX5_FLOW_LAYER_OUTER_L3_IPV6;
 			break;
@@ -2164,6 +2423,10 @@ struct pedit_parser {
 				/* dst/src port and mask. */
 			flags |= MLX5_FLOW_LAYER_OUTER_L4_TCP;
 			break;
+		case RTE_FLOW_ITEM_TYPE_VXLAN:
+			size += SZ_NLATTR_TYPE_OF(uint32_t);
+			flags |= MLX5_FLOW_LAYER_VXLAN;
+			break;
 		default:
 			DRV_LOG(WARNING,
 				"unsupported item %p type %d,"
@@ -2184,13 +2447,16 @@ struct pedit_parser {
  *   Pointer to the list of actions.
  * @param[out] action_flags
  *   Pointer to the detected actions.
+ * @param[out] tunnel
+ *   Pointer to tunnel encapsulation parameters structure to fill.
  *
  * @return
  *   Maximum size of memory for actions.
  */
 static int
 flow_tcf_get_actions_and_size(const struct rte_flow_action actions[],
-			      uint64_t *action_flags)
+			      uint64_t *action_flags,
+			      void *tunnel)
 {
 	int size = 0;
 	uint64_t flags = 0;
@@ -2246,6 +2512,29 @@ struct pedit_parser {
 				SZ_NLATTR_TYPE_OF(uint16_t) + /* VLAN ID. */
 				SZ_NLATTR_TYPE_OF(uint8_t); /* VLAN prio. */
 			break;
+		case RTE_FLOW_ACTION_TYPE_VXLAN_ENCAP:
+			size += SZ_NLATTR_NEST + /* na_act_index. */
+				SZ_NLATTR_STRZ_OF("tunnel_key") +
+				SZ_NLATTR_NEST + /* TCA_ACT_OPTIONS. */
+				SZ_NLATTR_TYPE_OF(uint8_t);
+			size += SZ_NLATTR_TYPE_OF(struct tc_tunnel_key);
+			size +=	flow_tcf_vxlan_encap_parse(actions, tunnel) +
+				RTE_ALIGN_CEIL /* preceding encap params. */
+				(sizeof(struct mlx5_flow_tcf_vxlan_encap),
+				MNL_ALIGNTO);
+			flags |= MLX5_ACTION_VXLAN_ENCAP;
+			break;
+		case RTE_FLOW_ACTION_TYPE_VXLAN_DECAP:
+			size += SZ_NLATTR_NEST + /* na_act_index. */
+				SZ_NLATTR_STRZ_OF("tunnel_key") +
+				SZ_NLATTR_NEST + /* TCA_ACT_OPTIONS. */
+				SZ_NLATTR_TYPE_OF(uint8_t);
+			size +=	SZ_NLATTR_TYPE_OF(struct tc_tunnel_key);
+			size +=	RTE_ALIGN_CEIL /* preceding decap params. */
+				(sizeof(struct mlx5_flow_tcf_vxlan_decap),
+				MNL_ALIGNTO);
+			flags |= MLX5_ACTION_VXLAN_DECAP;
+			break;
 		case RTE_FLOW_ACTION_TYPE_SET_IPV4_SRC:
 		case RTE_FLOW_ACTION_TYPE_SET_IPV4_DST:
 		case RTE_FLOW_ACTION_TYPE_SET_IPV6_SRC:
@@ -2289,6 +2578,26 @@ struct pedit_parser {
 }
 
 /**
+ * Convert VXLAN VNI to 32-bit integer.
+ *
+ * @param[in] vni
+ *   VXLAN VNI in 24-bit wire format.
+ *
+ * @return
+ *   VXLAN VNI as a 32-bit integer value in network endian.
+ */
+static rte_be32_t
+vxlan_vni_as_be32(const uint8_t vni[3])
+{
+	rte_be32_t ret;
+
+	ret = vni[0];
+	ret = (ret << 8) | vni[1];
+	ret = (ret << 8) | vni[2];
+	return RTE_BE32(ret);
+}
+
+/**
  * Prepare a flow object for Linux TC flower. It calculates the maximum size of
  * memory required, allocates the memory, initializes Netlink message headers
  * and set unique TC message handle.
@@ -2323,22 +2632,54 @@ struct pedit_parser {
 	struct mlx5_flow *dev_flow;
 	struct nlmsghdr *nlh;
 	struct tcmsg *tcm;
+	struct mlx5_flow_tcf_vxlan_encap encap = {.mask = 0};
+	uint8_t *sp, *tun = NULL;
 
 	size += flow_tcf_get_items_and_size(attr, items, item_flags);
-	size += flow_tcf_get_actions_and_size(actions, action_flags);
-	dev_flow = rte_zmalloc(__func__, size, MNL_ALIGNTO);
+	size += flow_tcf_get_actions_and_size(actions, action_flags, &encap);
+	dev_flow = rte_zmalloc(__func__, size,
+			RTE_MAX(alignof(struct mlx5_flow_tcf_tunnel_hdr),
+				(size_t)MNL_ALIGNTO));
 	if (!dev_flow) {
 		rte_flow_error_set(error, ENOMEM,
 				   RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
 				   "not enough memory to create E-Switch flow");
 		return NULL;
 	}
-	nlh = mnl_nlmsg_put_header((void *)(dev_flow + 1));
+	sp = (uint8_t *)(dev_flow + 1);
+	if (*action_flags & MLX5_ACTION_VXLAN_ENCAP) {
+		tun = sp;
+		sp += RTE_ALIGN_CEIL
+			(sizeof(struct mlx5_flow_tcf_vxlan_encap),
+			MNL_ALIGNTO);
+		size -= RTE_ALIGN_CEIL
+			(sizeof(struct mlx5_flow_tcf_vxlan_encap),
+			MNL_ALIGNTO);
+		encap.hdr.type = MLX5_FLOW_TCF_TUNACT_VXLAN_ENCAP;
+		memcpy(tun, &encap,
+		       sizeof(struct mlx5_flow_tcf_vxlan_encap));
+	} else if (*action_flags & MLX5_ACTION_VXLAN_DECAP) {
+		tun = sp;
+		sp += RTE_ALIGN_CEIL
+			(sizeof(struct mlx5_flow_tcf_vxlan_decap),
+			MNL_ALIGNTO);
+		size -= RTE_ALIGN_CEIL
+			(sizeof(struct mlx5_flow_tcf_vxlan_decap),
+			MNL_ALIGNTO);
+		encap.hdr.type = MLX5_FLOW_TCF_TUNACT_VXLAN_DECAP;
+		memcpy(tun, &encap,
+		       sizeof(struct mlx5_flow_tcf_vxlan_decap));
+	}
+	nlh = mnl_nlmsg_put_header(sp);
 	tcm = mnl_nlmsg_put_extra_header(nlh, sizeof(*tcm));
 	*dev_flow = (struct mlx5_flow){
 		.tcf = (struct mlx5_flow_tcf){
+			.nlsize = size,
 			.nlh = nlh,
 			.tcm = tcm,
+			.tunnel = (struct mlx5_flow_tcf_tunnel_hdr *)tun,
+			.item_flags = *item_flags,
+			.action_flags = *action_flags,
 		},
 	};
 	/*
@@ -2392,6 +2733,7 @@ struct pedit_parser {
 		const struct rte_flow_item_ipv6 *ipv6;
 		const struct rte_flow_item_tcp *tcp;
 		const struct rte_flow_item_udp *udp;
+		const struct rte_flow_item_vxlan *vxlan;
 	} spec, mask;
 	union {
 		const struct rte_flow_action_port_id *port_id;
@@ -2402,6 +2744,14 @@ struct pedit_parser {
 		const struct rte_flow_action_of_set_vlan_pcp *
 			of_set_vlan_pcp;
 	} conf;
+	union {
+		struct mlx5_flow_tcf_tunnel_hdr *hdr;
+		struct mlx5_flow_tcf_vxlan_decap *vxlan;
+	} decap;
+	union {
+		struct mlx5_flow_tcf_tunnel_hdr *hdr;
+		struct mlx5_flow_tcf_vxlan_encap *vxlan;
+	} encap;
 	struct flow_tcf_ptoi ptoi[PTOI_TABLE_SZ_MAX(dev)];
 	struct nlmsghdr *nlh = dev_flow->tcf.nlh;
 	struct tcmsg *tcm = dev_flow->tcf.tcm;
@@ -2418,6 +2768,12 @@ struct pedit_parser {
 
 	claim_nonzero(flow_tcf_build_ptoi_table(dev, ptoi,
 						PTOI_TABLE_SZ_MAX(dev)));
+	encap.hdr = NULL;
+	decap.hdr = NULL;
+	if (dev_flow->tcf.action_flags & MLX5_ACTION_VXLAN_ENCAP)
+		encap.vxlan = dev_flow->tcf.vxlan_encap;
+	if (dev_flow->tcf.action_flags & MLX5_ACTION_VXLAN_DECAP)
+		decap.vxlan = dev_flow->tcf.vxlan_decap;
 	nlh = dev_flow->tcf.nlh;
 	tcm = dev_flow->tcf.tcm;
 	/* Prepare API must have been called beforehand. */
@@ -2435,7 +2791,6 @@ struct pedit_parser {
 		mnl_attr_put_u32(nlh, TCA_CHAIN, attr->group);
 	mnl_attr_put_strz(nlh, TCA_KIND, "flower");
 	na_flower = mnl_attr_nest_start(nlh, TCA_OPTIONS);
-	mnl_attr_put_u32(nlh, TCA_FLOWER_FLAGS, TCA_CLS_FLAGS_SKIP_SW);
 	for (; items->type != RTE_FLOW_ITEM_TYPE_END; items++) {
 		unsigned int i;
 
@@ -2479,6 +2834,12 @@ struct pedit_parser {
 						 spec.eth->type);
 				eth_type_set = 1;
 			}
+			/*
+			 * L2 addresses/masks should  be sent anyway,
+			 * including VXLAN encap/decap cases, sometimes
+			 * kernel returns an error if no L2 address
+			 * provided and skip_sw flag is set
+			 */
 			if (!is_zero_ether_addr(&mask.eth->dst)) {
 				mnl_attr_put(nlh, TCA_FLOWER_KEY_ETH_DST,
 					     ETHER_ADDR_LEN,
@@ -2495,8 +2856,19 @@ struct pedit_parser {
 					     ETHER_ADDR_LEN,
 					     mask.eth->src.addr_bytes);
 			}
-			break;
+			if (decap.hdr) {
+				DRV_LOG(INFO,
+				"ethernet addresses are treated "
+				"as inner ones for tunnel decapsulation");
+			}
+			assert(dev_flow->tcf.nlsize >= nlh->nlmsg_len);
+		break;
 		case RTE_FLOW_ITEM_TYPE_VLAN:
+			if (encap.hdr || decap.hdr)
+				return rte_flow_error_set(error, ENOTSUP,
+					  RTE_FLOW_ERROR_TYPE_ITEM, NULL,
+					  "outer VLAN is not "
+					  "supported for tunnels");
 			item_flags |= MLX5_FLOW_LAYER_OUTER_VLAN;
 			mask.vlan = flow_tcf_item_mask
 				(items, &rte_flow_item_vlan_mask,
@@ -2528,6 +2900,7 @@ struct pedit_parser {
 						 rte_be_to_cpu_16
 						 (spec.vlan->tci &
 						  RTE_BE16(0x0fff)));
+			assert(dev_flow->tcf.nlsize >= nlh->nlmsg_len);
 			break;
 		case RTE_FLOW_ITEM_TYPE_IPV4:
 			item_flags |= MLX5_FLOW_LAYER_OUTER_L3_IPV4;
@@ -2538,36 +2911,53 @@ struct pedit_parser {
 				 sizeof(flow_tcf_mask_supported.ipv4),
 				 error);
 			assert(mask.ipv4);
-			if (!eth_type_set || !vlan_eth_type_set)
-				mnl_attr_put_u16(nlh,
-						 vlan_present ?
-						 TCA_FLOWER_KEY_VLAN_ETH_TYPE :
-						 TCA_FLOWER_KEY_ETH_TYPE,
-						 RTE_BE16(ETH_P_IP));
-			eth_type_set = 1;
-			vlan_eth_type_set = 1;
-			if (mask.ipv4 == &flow_tcf_mask_empty.ipv4)
-				break;
 			spec.ipv4 = items->spec;
-			if (mask.ipv4->hdr.next_proto_id) {
-				mnl_attr_put_u8(nlh, TCA_FLOWER_KEY_IP_PROTO,
+			if (!decap.vxlan) {
+				if (!eth_type_set || !vlan_eth_type_set) {
+					mnl_attr_put_u16(nlh,
+						vlan_present ?
+						TCA_FLOWER_KEY_VLAN_ETH_TYPE :
+						TCA_FLOWER_KEY_ETH_TYPE,
+						RTE_BE16(ETH_P_IP));
+				}
+				eth_type_set = 1;
+				vlan_eth_type_set = 1;
+				if (mask.ipv4 == &flow_tcf_mask_empty.ipv4)
+					break;
+				if (mask.ipv4->hdr.next_proto_id) {
+					mnl_attr_put_u8
+						(nlh, TCA_FLOWER_KEY_IP_PROTO,
 						spec.ipv4->hdr.next_proto_id);
-				ip_proto_set = 1;
+					ip_proto_set = 1;
+				}
+			} else {
+				assert(mask.ipv4 != &flow_tcf_mask_empty.ipv4);
 			}
 			if (mask.ipv4->hdr.src_addr) {
-				mnl_attr_put_u32(nlh, TCA_FLOWER_KEY_IPV4_SRC,
-						 spec.ipv4->hdr.src_addr);
-				mnl_attr_put_u32(nlh,
-						 TCA_FLOWER_KEY_IPV4_SRC_MASK,
-						 mask.ipv4->hdr.src_addr);
+				mnl_attr_put_u32
+					(nlh, decap.vxlan ?
+					 TCA_FLOWER_KEY_ENC_IPV4_SRC :
+					 TCA_FLOWER_KEY_IPV4_SRC,
+					 spec.ipv4->hdr.src_addr);
+				mnl_attr_put_u32
+					(nlh, decap.vxlan ?
+					 TCA_FLOWER_KEY_ENC_IPV4_SRC_MASK :
+					 TCA_FLOWER_KEY_IPV4_SRC_MASK,
+					 mask.ipv4->hdr.src_addr);
 			}
 			if (mask.ipv4->hdr.dst_addr) {
-				mnl_attr_put_u32(nlh, TCA_FLOWER_KEY_IPV4_DST,
-						 spec.ipv4->hdr.dst_addr);
-				mnl_attr_put_u32(nlh,
-						 TCA_FLOWER_KEY_IPV4_DST_MASK,
-						 mask.ipv4->hdr.dst_addr);
+				mnl_attr_put_u32
+					(nlh, decap.vxlan ?
+					 TCA_FLOWER_KEY_ENC_IPV4_DST :
+					 TCA_FLOWER_KEY_IPV4_DST,
+					 spec.ipv4->hdr.dst_addr);
+				mnl_attr_put_u32
+					(nlh, decap.vxlan ?
+					 TCA_FLOWER_KEY_ENC_IPV4_DST_MASK :
+					 TCA_FLOWER_KEY_IPV4_DST_MASK,
+					 mask.ipv4->hdr.dst_addr);
 			}
+			assert(dev_flow->tcf.nlsize >= nlh->nlmsg_len);
 			break;
 		case RTE_FLOW_ITEM_TYPE_IPV6:
 			item_flags |= MLX5_FLOW_LAYER_OUTER_L3_IPV6;
@@ -2578,38 +2968,53 @@ struct pedit_parser {
 				 sizeof(flow_tcf_mask_supported.ipv6),
 				 error);
 			assert(mask.ipv6);
-			if (!eth_type_set || !vlan_eth_type_set)
-				mnl_attr_put_u16(nlh,
-						 vlan_present ?
-						 TCA_FLOWER_KEY_VLAN_ETH_TYPE :
-						 TCA_FLOWER_KEY_ETH_TYPE,
-						 RTE_BE16(ETH_P_IPV6));
-			eth_type_set = 1;
-			vlan_eth_type_set = 1;
-			if (mask.ipv6 == &flow_tcf_mask_empty.ipv6)
-				break;
 			spec.ipv6 = items->spec;
-			if (mask.ipv6->hdr.proto) {
-				mnl_attr_put_u8(nlh, TCA_FLOWER_KEY_IP_PROTO,
-						spec.ipv6->hdr.proto);
-				ip_proto_set = 1;
+			if (!decap.vxlan) {
+				if (!eth_type_set || !vlan_eth_type_set) {
+					mnl_attr_put_u16(nlh,
+						vlan_present ?
+						TCA_FLOWER_KEY_VLAN_ETH_TYPE :
+						TCA_FLOWER_KEY_ETH_TYPE,
+						RTE_BE16(ETH_P_IPV6));
+				}
+				eth_type_set = 1;
+				vlan_eth_type_set = 1;
+				if (mask.ipv6 == &flow_tcf_mask_empty.ipv6)
+					break;
+				if (mask.ipv6->hdr.proto) {
+					mnl_attr_put_u8
+						(nlh, TCA_FLOWER_KEY_IP_PROTO,
+						 spec.ipv6->hdr.proto);
+					ip_proto_set = 1;
+				}
+			} else {
+				assert(mask.ipv6 != &flow_tcf_mask_empty.ipv6);
 			}
 			if (!IN6_IS_ADDR_UNSPECIFIED(mask.ipv6->hdr.src_addr)) {
-				mnl_attr_put(nlh, TCA_FLOWER_KEY_IPV6_SRC,
+				mnl_attr_put(nlh, decap.vxlan ?
+					     TCA_FLOWER_KEY_ENC_IPV6_SRC :
+					     TCA_FLOWER_KEY_IPV6_SRC,
 					     sizeof(spec.ipv6->hdr.src_addr),
 					     spec.ipv6->hdr.src_addr);
-				mnl_attr_put(nlh, TCA_FLOWER_KEY_IPV6_SRC_MASK,
+				mnl_attr_put(nlh, decap.vxlan ?
+					     TCA_FLOWER_KEY_ENC_IPV6_SRC_MASK :
+					     TCA_FLOWER_KEY_IPV6_SRC_MASK,
 					     sizeof(mask.ipv6->hdr.src_addr),
 					     mask.ipv6->hdr.src_addr);
 			}
 			if (!IN6_IS_ADDR_UNSPECIFIED(mask.ipv6->hdr.dst_addr)) {
-				mnl_attr_put(nlh, TCA_FLOWER_KEY_IPV6_DST,
+				mnl_attr_put(nlh, decap.vxlan ?
+					     TCA_FLOWER_KEY_ENC_IPV6_DST :
+					     TCA_FLOWER_KEY_IPV6_DST,
 					     sizeof(spec.ipv6->hdr.dst_addr),
 					     spec.ipv6->hdr.dst_addr);
-				mnl_attr_put(nlh, TCA_FLOWER_KEY_IPV6_DST_MASK,
+				mnl_attr_put(nlh, decap.vxlan ?
+					     TCA_FLOWER_KEY_ENC_IPV6_DST_MASK :
+					     TCA_FLOWER_KEY_IPV6_DST_MASK,
 					     sizeof(mask.ipv6->hdr.dst_addr),
 					     mask.ipv6->hdr.dst_addr);
 			}
+			assert(dev_flow->tcf.nlsize >= nlh->nlmsg_len);
 			break;
 		case RTE_FLOW_ITEM_TYPE_UDP:
 			item_flags |= MLX5_FLOW_LAYER_OUTER_L4_UDP;
@@ -2620,26 +3025,44 @@ struct pedit_parser {
 				 sizeof(flow_tcf_mask_supported.udp),
 				 error);
 			assert(mask.udp);
-			if (!ip_proto_set)
-				mnl_attr_put_u8(nlh, TCA_FLOWER_KEY_IP_PROTO,
-						IPPROTO_UDP);
-			if (mask.udp == &flow_tcf_mask_empty.udp)
-				break;
 			spec.udp = items->spec;
+			if (!decap.vxlan) {
+				if (!ip_proto_set)
+					mnl_attr_put_u8
+						(nlh, TCA_FLOWER_KEY_IP_PROTO,
+						IPPROTO_UDP);
+				if (mask.udp == &flow_tcf_mask_empty.udp)
+					break;
+			} else {
+				assert(mask.udp != &flow_tcf_mask_empty.udp);
+				decap.vxlan->udp_port
+					= RTE_BE16(spec.udp->hdr.dst_port);
+			}
 			if (mask.udp->hdr.src_port) {
-				mnl_attr_put_u16(nlh, TCA_FLOWER_KEY_UDP_SRC,
-						 spec.udp->hdr.src_port);
-				mnl_attr_put_u16(nlh,
-						 TCA_FLOWER_KEY_UDP_SRC_MASK,
-						 mask.udp->hdr.src_port);
+				mnl_attr_put_u16
+					(nlh, decap.vxlan ?
+					 TCA_FLOWER_KEY_ENC_UDP_SRC_PORT :
+					 TCA_FLOWER_KEY_UDP_SRC,
+					 spec.udp->hdr.src_port);
+				mnl_attr_put_u16
+					(nlh, decap.vxlan ?
+					 TCA_FLOWER_KEY_ENC_UDP_SRC_PORT_MASK :
+					 TCA_FLOWER_KEY_UDP_SRC_MASK,
+					 mask.udp->hdr.src_port);
 			}
 			if (mask.udp->hdr.dst_port) {
-				mnl_attr_put_u16(nlh, TCA_FLOWER_KEY_UDP_DST,
-						 spec.udp->hdr.dst_port);
-				mnl_attr_put_u16(nlh,
-						 TCA_FLOWER_KEY_UDP_DST_MASK,
-						 mask.udp->hdr.dst_port);
+				mnl_attr_put_u16
+					(nlh, decap.vxlan ?
+					 TCA_FLOWER_KEY_ENC_UDP_DST_PORT :
+					 TCA_FLOWER_KEY_UDP_DST,
+					 spec.udp->hdr.dst_port);
+				mnl_attr_put_u16
+					(nlh, decap.vxlan ?
+					 TCA_FLOWER_KEY_ENC_UDP_DST_PORT_MASK :
+					 TCA_FLOWER_KEY_UDP_DST_MASK,
+					 mask.udp->hdr.dst_port);
 			}
+			assert(dev_flow->tcf.nlsize >= nlh->nlmsg_len);
 			break;
 		case RTE_FLOW_ITEM_TYPE_TCP:
 			item_flags |= MLX5_FLOW_LAYER_OUTER_L4_TCP;
@@ -2682,7 +3105,15 @@ struct pedit_parser {
 					 rte_cpu_to_be_16
 						(mask.tcp->hdr.tcp_flags));
 			}
+			assert(dev_flow->tcf.nlsize >= nlh->nlmsg_len);
 			break;
+		case RTE_FLOW_ITEM_TYPE_VXLAN:
+			assert(decap.vxlan);
+			spec.vxlan = items->spec;
+			mnl_attr_put_u32(nlh,
+					 TCA_FLOWER_KEY_ENC_KEY_ID,
+					 vxlan_vni_as_be32(spec.vxlan->vni));
+			assert(dev_flow->tcf.nlsize >= nlh->nlmsg_len);
 		default:
 			return rte_flow_error_set(error, ENOTSUP,
 						  RTE_FLOW_ERROR_TYPE_ITEM,
@@ -2715,6 +3146,14 @@ struct pedit_parser {
 			mnl_attr_put_strz(nlh, TCA_ACT_KIND, "mirred");
 			na_act = mnl_attr_nest_start(nlh, TCA_ACT_OPTIONS);
 			assert(na_act);
+			if (encap.hdr) {
+				assert(dev_flow->tcf.tunnel);
+				dev_flow->tcf.tunnel->ifindex_ptr =
+					&((struct tc_mirred *)
+					mnl_attr_get_payload
+					(mnl_nlmsg_get_payload_tail
+						(nlh)))->ifindex;
+			}
 			mnl_attr_put(nlh, TCA_MIRRED_PARMS,
 				     sizeof(struct tc_mirred),
 				     &(struct tc_mirred){
@@ -2724,6 +3163,7 @@ struct pedit_parser {
 				     });
 			mnl_attr_nest_end(nlh, na_act);
 			mnl_attr_nest_end(nlh, na_act_index);
+			assert(dev_flow->tcf.nlsize >= nlh->nlmsg_len);
 			break;
 		case RTE_FLOW_ACTION_TYPE_JUMP:
 			conf.jump = actions->conf;
@@ -2741,6 +3181,7 @@ struct pedit_parser {
 				     });
 			mnl_attr_nest_end(nlh, na_act);
 			mnl_attr_nest_end(nlh, na_act_index);
+			assert(dev_flow->tcf.nlsize >= nlh->nlmsg_len);
 			break;
 		case RTE_FLOW_ACTION_TYPE_DROP:
 			na_act_index =
@@ -2827,6 +3268,76 @@ struct pedit_parser {
 					(na_vlan_priority) =
 					conf.of_set_vlan_pcp->vlan_pcp;
 			}
+			assert(dev_flow->tcf.nlsize >= nlh->nlmsg_len);
+			break;
+		case RTE_FLOW_ACTION_TYPE_VXLAN_DECAP:
+			assert(decap.vxlan);
+			assert(dev_flow->tcf.tunnel);
+			dev_flow->tcf.tunnel->ifindex_ptr
+				= (unsigned int *)&tcm->tcm_ifindex;
+			na_act_index =
+				mnl_attr_nest_start(nlh, na_act_index_cur++);
+			assert(na_act_index);
+			mnl_attr_put_strz(nlh, TCA_ACT_KIND, "tunnel_key");
+			na_act = mnl_attr_nest_start(nlh, TCA_ACT_OPTIONS);
+			assert(na_act);
+			mnl_attr_put(nlh, TCA_TUNNEL_KEY_PARMS,
+				sizeof(struct tc_tunnel_key),
+				&(struct tc_tunnel_key){
+					.action = TC_ACT_PIPE,
+					.t_action = TCA_TUNNEL_KEY_ACT_RELEASE,
+					});
+			mnl_attr_nest_end(nlh, na_act);
+			mnl_attr_nest_end(nlh, na_act_index);
+			assert(dev_flow->tcf.nlsize >= nlh->nlmsg_len);
+			break;
+		case RTE_FLOW_ACTION_TYPE_VXLAN_ENCAP:
+			assert(encap.vxlan);
+			na_act_index =
+				mnl_attr_nest_start(nlh, na_act_index_cur++);
+			assert(na_act_index);
+			mnl_attr_put_strz(nlh, TCA_ACT_KIND, "tunnel_key");
+			na_act = mnl_attr_nest_start(nlh, TCA_ACT_OPTIONS);
+			assert(na_act);
+			mnl_attr_put(nlh, TCA_TUNNEL_KEY_PARMS,
+				sizeof(struct tc_tunnel_key),
+				&(struct tc_tunnel_key){
+					.action = TC_ACT_PIPE,
+					.t_action = TCA_TUNNEL_KEY_ACT_SET,
+					});
+			if (encap.vxlan->mask & MLX5_FLOW_TCF_ENCAP_UDP_DST)
+				mnl_attr_put_u16(nlh,
+					 TCA_TUNNEL_KEY_ENC_DST_PORT,
+					 encap.vxlan->udp.dst);
+			if (encap.vxlan->mask & MLX5_FLOW_TCF_ENCAP_IPV4_SRC)
+				mnl_attr_put_u32(nlh,
+					 TCA_TUNNEL_KEY_ENC_IPV4_SRC,
+					 encap.vxlan->ipv4.src);
+			if (encap.vxlan->mask & MLX5_FLOW_TCF_ENCAP_IPV4_DST)
+				mnl_attr_put_u32(nlh,
+					 TCA_TUNNEL_KEY_ENC_IPV4_DST,
+					 encap.vxlan->ipv4.dst);
+			if (encap.vxlan->mask & MLX5_FLOW_TCF_ENCAP_IPV6_SRC)
+				mnl_attr_put(nlh,
+					 TCA_TUNNEL_KEY_ENC_IPV6_SRC,
+					 sizeof(encap.vxlan->ipv6.src),
+					 &encap.vxlan->ipv6.src);
+			if (encap.vxlan->mask & MLX5_FLOW_TCF_ENCAP_IPV6_DST)
+				mnl_attr_put(nlh,
+					 TCA_TUNNEL_KEY_ENC_IPV6_DST,
+					 sizeof(encap.vxlan->ipv6.dst),
+					 &encap.vxlan->ipv6.dst);
+			if (encap.vxlan->mask & MLX5_FLOW_TCF_ENCAP_VXLAN_VNI)
+				mnl_attr_put_u32(nlh,
+					 TCA_TUNNEL_KEY_ENC_KEY_ID,
+					 vxlan_vni_as_be32
+						(encap.vxlan->vxlan.vni));
+#ifdef TCA_TUNNEL_KEY_NO_CSUM
+			mnl_attr_put_u8(nlh, TCA_TUNNEL_KEY_NO_CSUM, 0);
+#endif
+			mnl_attr_nest_end(nlh, na_act);
+			mnl_attr_nest_end(nlh, na_act_index);
+			assert(dev_flow->tcf.nlsize >= nlh->nlmsg_len);
 			break;
 		case RTE_FLOW_ACTION_TYPE_SET_IPV4_SRC:
 		case RTE_FLOW_ACTION_TYPE_SET_IPV4_DST:
@@ -2850,7 +3361,11 @@ struct pedit_parser {
 	assert(na_flower);
 	assert(na_flower_act);
 	mnl_attr_nest_end(nlh, na_flower_act);
+	mnl_attr_put_u32(nlh, TCA_FLOWER_FLAGS,
+			 dev_flow->tcf.action_flags & MLX5_ACTION_VXLAN_DECAP
+			 ? 0 : TCA_CLS_FLAGS_SKIP_SW);
 	mnl_attr_nest_end(nlh, na_flower);
+	assert(dev_flow->tcf.nlsize >= nlh->nlmsg_len);
 	return 0;
 }
 
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [dpdk-dev] [PATCH v2 4/7] net/mlx5: e-switch VXLAN netlink routines update
  2018-10-15 14:13 ` [dpdk-dev] [PATCH v2 0/7] net/mlx5: e-switch VXLAN encap/decap hardware offload Viacheslav Ovsiienko
                     ` (2 preceding siblings ...)
  2018-10-15 14:13   ` [dpdk-dev] [PATCH v2 3/7] net/mlx5: e-switch VXLAN flow translation routine Viacheslav Ovsiienko
@ 2018-10-15 14:13   ` Viacheslav Ovsiienko
  2018-10-23 10:07     ` Yongseok Koh
  2018-10-15 14:13   ` [dpdk-dev] [PATCH v2 5/7] net/mlx5: e-switch VXLAN tunnel devices management Viacheslav Ovsiienko
                     ` (3 subsequent siblings)
  7 siblings, 1 reply; 110+ messages in thread
From: Viacheslav Ovsiienko @ 2018-10-15 14:13 UTC (permalink / raw)
  To: shahafs, yskoh; +Cc: dev, Viacheslav Ovsiienko

This part of patchset updates Netlink exchange routine. Message
sequence numbers became not random ones, the multipart reply messages
are supported, not propagating errors to the following socket calls,
Netlink replies buffer size is increased to MNL_SOCKET_BUFFER_SIZE
and now is preallocated at context creation time instead of stack
usage. This update is needed to support Netlink query operations.

Suggested-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
---
 drivers/net/mlx5/mlx5_flow_tcf.c | 82 +++++++++++++++++++++++++++++-----------
 1 file changed, 60 insertions(+), 22 deletions(-)

diff --git a/drivers/net/mlx5/mlx5_flow_tcf.c b/drivers/net/mlx5/mlx5_flow_tcf.c
index 660d45e..d6840d5 100644
--- a/drivers/net/mlx5/mlx5_flow_tcf.c
+++ b/drivers/net/mlx5/mlx5_flow_tcf.c
@@ -3372,37 +3372,75 @@ struct pedit_parser {
 /**
  * Send Netlink message with acknowledgment.
  *
- * @param ctx
+ * @param tcf
  *   Flow context to use.
  * @param nlh
  *   Message to send. This function always raises the NLM_F_ACK flag before
  *   sending.
+ * @param[in] msglen
+ *   Message length. Message buffer may contain multiple commands and
+ *   nlmsg_len field not always corresponds to actual message length.
+ *   If 0 specified the nlmsg_len field in header is used as message length.
+ * @param[in] cb
+ *   Callback handler for received message.
+ * @param[in] arg
+ *   Context pointer for callback handler.
  *
  * @return
  *   0 on success, a negative errno value otherwise and rte_errno is set.
  */
 static int
-flow_tcf_nl_ack(struct mlx5_flow_tcf_context *ctx, struct nlmsghdr *nlh)
+flow_tcf_nl_ack(struct mlx5_flow_tcf_context *tcf,
+		struct nlmsghdr *nlh,
+		uint32_t msglen,
+		mnl_cb_t cb, void *arg)
 {
-	alignas(struct nlmsghdr)
-	uint8_t ans[mnl_nlmsg_size(sizeof(struct nlmsgerr)) +
-		    nlh->nlmsg_len - sizeof(*nlh)];
-	uint32_t seq = ctx->seq++;
-	struct mnl_socket *nl = ctx->nl;
-	int ret;
-
-	nlh->nlmsg_flags |= NLM_F_ACK;
+	unsigned int portid = mnl_socket_get_portid(tcf->nl);
+	uint32_t seq = tcf->seq++;
+	int err, ret;
+
+	assert(tcf->nl);
+	assert(tcf->buf);
+	if (!seq)
+		seq = tcf->seq++;
 	nlh->nlmsg_seq = seq;
-	ret = mnl_socket_sendto(nl, nlh, nlh->nlmsg_len);
-	if (ret != -1)
-		ret = mnl_socket_recvfrom(nl, ans, sizeof(ans));
-	if (ret != -1)
-		ret = mnl_cb_run
-			(ans, ret, seq, mnl_socket_get_portid(nl), NULL, NULL);
+	if (!msglen) {
+		msglen = nlh->nlmsg_len;
+		nlh->nlmsg_flags |= NLM_F_ACK;
+	}
+	ret = mnl_socket_sendto(tcf->nl, nlh, msglen);
+	err = (ret <= 0) ? errno : 0;
+	nlh = (struct nlmsghdr *)(tcf->buf);
+	/*
+	 * The following loop postpones non-fatal errors until multipart
+	 * messages are complete.
+	 */
 	if (ret > 0)
+		while (true) {
+			ret = mnl_socket_recvfrom(tcf->nl, tcf->buf,
+						  tcf->buf_size);
+			if (ret < 0) {
+				err = errno;
+				if (err != ENOSPC)
+					break;
+			}
+			if (!err) {
+				ret = mnl_cb_run(nlh, ret, seq, portid,
+						 cb, arg);
+				if (ret < 0) {
+					err = errno;
+					break;
+				}
+			}
+			/* Will receive till end of multipart message */
+			if (!(nlh->nlmsg_flags & NLM_F_MULTI) ||
+			      nlh->nlmsg_type == NLMSG_DONE)
+				break;
+		}
+	if (!err)
 		return 0;
-	rte_errno = errno;
-	return -rte_errno;
+	rte_errno = err;
+	return -err;
 }
 
 /**
@@ -3433,7 +3471,7 @@ struct pedit_parser {
 	nlh = dev_flow->tcf.nlh;
 	nlh->nlmsg_type = RTM_NEWTFILTER;
 	nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_CREATE | NLM_F_EXCL;
-	if (!flow_tcf_nl_ack(nl, nlh))
+	if (!flow_tcf_nl_ack(nl, nlh, 0, NULL, NULL))
 		return 0;
 	return rte_flow_error_set(error, rte_errno,
 				  RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
@@ -3466,7 +3504,7 @@ struct pedit_parser {
 	nlh = dev_flow->tcf.nlh;
 	nlh->nlmsg_type = RTM_DELTFILTER;
 	nlh->nlmsg_flags = NLM_F_REQUEST;
-	flow_tcf_nl_ack(nl, nlh);
+	flow_tcf_nl_ack(nl, nlh, 0, NULL, NULL);
 }
 
 /**
@@ -3842,7 +3880,7 @@ struct pedit_parser {
 	tcm->tcm_handle = TC_H_MAKE(TC_H_INGRESS, 0);
 	tcm->tcm_parent = TC_H_INGRESS;
 	/* Ignore errors when qdisc is already absent. */
-	if (flow_tcf_nl_ack(nl, nlh) &&
+	if (flow_tcf_nl_ack(nl, nlh, 0, NULL, NULL) &&
 	    rte_errno != EINVAL && rte_errno != ENOENT)
 		return rte_flow_error_set(error, rte_errno,
 					  RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
@@ -3858,7 +3896,7 @@ struct pedit_parser {
 	tcm->tcm_handle = TC_H_MAKE(TC_H_INGRESS, 0);
 	tcm->tcm_parent = TC_H_INGRESS;
 	mnl_attr_put_strz_check(nlh, sizeof(buf), TCA_KIND, "ingress");
-	if (flow_tcf_nl_ack(nl, nlh))
+	if (flow_tcf_nl_ack(nl, nlh, 0, NULL, NULL))
 		return rte_flow_error_set(error, rte_errno,
 					  RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
 					  "netlink: failed to create ingress"
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [dpdk-dev] [PATCH v2 5/7] net/mlx5: e-switch VXLAN tunnel devices management
  2018-10-15 14:13 ` [dpdk-dev] [PATCH v2 0/7] net/mlx5: e-switch VXLAN encap/decap hardware offload Viacheslav Ovsiienko
                     ` (3 preceding siblings ...)
  2018-10-15 14:13   ` [dpdk-dev] [PATCH v2 4/7] net/mlx5: e-switch VXLAN netlink routines update Viacheslav Ovsiienko
@ 2018-10-15 14:13   ` Viacheslav Ovsiienko
  2018-10-25  0:28     ` Yongseok Koh
  2018-10-15 14:13   ` [dpdk-dev] [PATCH v2 6/7] net/mlx5: e-switch VXLAN encapsulation rules management Viacheslav Ovsiienko
                     ` (2 subsequent siblings)
  7 siblings, 1 reply; 110+ messages in thread
From: Viacheslav Ovsiienko @ 2018-10-15 14:13 UTC (permalink / raw)
  To: shahafs, yskoh; +Cc: dev, Viacheslav Ovsiienko

VXLAN interfaces are dynamically created for each local UDP port
of outer networks and then used as targets for TC "flower" filters
in order to perform encapsulation. These VXLAN interfaces are
system-wide, the only one device with given UDP port can exist
in the system (the attempt of creating another device with the
same UDP local port returns EEXIST), so PMD should support the
shared device instances database for PMD instances. These VXLAN
implicitly created devices are called VTEPs (Virtual Tunnel
End Points).

Creation of the VTEP occurs at the moment of rule applying. The
link is set up, root ingress qdisc is also initialized.

Encapsulation VTEPs are created on per port basis, the single
VTEP is attached to the outer interface and is shared for all
encapsulation rules on this interface. The source UDP port is
automatically selected in range 30000-60000.

For decapsulaton one VTEP is created per every unique UDP
local port to accept tunnel traffic. The name of created
VTEP consists of prefix "vmlx_" and the number of UDP port in
decimal digits without leading zeros (vmlx_4789). The VTEP
can be preliminary created in the system before the launching
application, it allows to share	UDP ports between primary
and secondary processes.

Suggested-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
---
 drivers/net/mlx5/mlx5_flow_tcf.c | 503 ++++++++++++++++++++++++++++++++++++++-
 1 file changed, 499 insertions(+), 4 deletions(-)

diff --git a/drivers/net/mlx5/mlx5_flow_tcf.c b/drivers/net/mlx5/mlx5_flow_tcf.c
index d6840d5..efa9c3b 100644
--- a/drivers/net/mlx5/mlx5_flow_tcf.c
+++ b/drivers/net/mlx5/mlx5_flow_tcf.c
@@ -3443,6 +3443,432 @@ struct pedit_parser {
 	return -err;
 }
 
+/* VTEP device list is shared between PMD port instances. */
+static LIST_HEAD(, mlx5_flow_tcf_vtep)
+			vtep_list_vxlan = LIST_HEAD_INITIALIZER();
+static pthread_mutex_t vtep_list_mutex = PTHREAD_MUTEX_INITIALIZER;
+
+/**
+ * Deletes VTEP network device.
+ *
+ * @param[in] tcf
+ *   Context object initialized by mlx5_flow_tcf_context_create().
+ * @param[in] vtep
+ *   Object represinting the network device to delete. Memory
+ *   allocated for this object is freed by routine.
+ */
+static void
+flow_tcf_delete_iface(struct mlx5_flow_tcf_context *tcf,
+		      struct mlx5_flow_tcf_vtep *vtep)
+{
+	struct nlmsghdr *nlh;
+	struct ifinfomsg *ifm;
+	alignas(struct nlmsghdr)
+	uint8_t buf[mnl_nlmsg_size(MNL_ALIGN(sizeof(*ifm))) + 8];
+	int ret;
+
+	assert(!vtep->refcnt);
+	if (vtep->created && vtep->ifindex) {
+		DRV_LOG(INFO, "VTEP delete (%d)", vtep->ifindex);
+		nlh = mnl_nlmsg_put_header(buf);
+		nlh->nlmsg_type = RTM_DELLINK;
+		nlh->nlmsg_flags = NLM_F_REQUEST;
+		ifm = mnl_nlmsg_put_extra_header(nlh, sizeof(*ifm));
+		ifm->ifi_family = AF_UNSPEC;
+		ifm->ifi_index = vtep->ifindex;
+		ret = flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
+		if (ret)
+			DRV_LOG(WARNING, "netlink: error deleting VXLAN "
+					 "encap/decap ifindex %u",
+					 ifm->ifi_index);
+	}
+	rte_free(vtep);
+}
+
+/**
+ * Creates VTEP network device.
+ *
+ * @param[in] tcf
+ *   Context object initialized by mlx5_flow_tcf_context_create().
+ * @param[in] ifouter
+ *   Outer interface to attach new-created VXLAN device
+ *   If zero the VXLAN device will not be attached to any device.
+ * @param[in] port
+ *   UDP port of created VTEP device.
+ * @param[out] error
+ *   Perform verbose error reporting if not NULL.
+ *
+ * @return
+ * Pointer to created device structure on success, NULL otherwise
+ * and rte_errno is set.
+ */
+#ifndef HAVE_IFLA_VXLAN_COLLECT_METADATA
+static struct mlx5_flow_tcf_vtep*
+flow_tcf_create_iface(struct mlx5_flow_tcf_context *tcf __rte_unused,
+		      unsigned int ifouter __rte_unused,
+		      uint16_t port __rte_unused,
+		      struct rte_flow_error *error)
+{
+	rte_flow_error_set(error, ENOTSUP,
+			 RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
+			 "netlink: failed to create VTEP, "
+			 "VXLAN metadat is not supported by kernel");
+	return NULL;
+}
+#else
+static struct mlx5_flow_tcf_vtep*
+flow_tcf_create_iface(struct mlx5_flow_tcf_context *tcf,
+		      unsigned int ifouter,
+		      uint16_t port, struct rte_flow_error *error)
+{
+	struct mlx5_flow_tcf_vtep *vtep;
+	struct nlmsghdr *nlh;
+	struct ifinfomsg *ifm;
+	char name[sizeof(MLX5_VXLAN_DEVICE_PFX) + 24];
+	alignas(struct nlmsghdr)
+	uint8_t buf[mnl_nlmsg_size(sizeof(*ifm)) + 128 +
+		       SZ_NLATTR_DATA_OF(sizeof(name)) +
+		       SZ_NLATTR_NEST * 2 +
+		       SZ_NLATTR_STRZ_OF("vxlan") +
+		       SZ_NLATTR_DATA_OF(sizeof(uint32_t)) +
+		       SZ_NLATTR_DATA_OF(sizeof(uint32_t)) +
+		       SZ_NLATTR_DATA_OF(sizeof(uint16_t)) +
+		       SZ_NLATTR_DATA_OF(sizeof(uint8_t))];
+	struct nlattr *na_info;
+	struct nlattr *na_vxlan;
+	rte_be16_t vxlan_port = RTE_BE16(port);
+	int ret;
+
+	vtep = rte_zmalloc(__func__, sizeof(*vtep),
+			alignof(struct mlx5_flow_tcf_vtep));
+	if (!vtep) {
+		rte_flow_error_set
+			(error, ENOMEM, RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
+			 NULL, "unadble to allocate memory for VTEP desc");
+		return NULL;
+	}
+	*vtep = (struct mlx5_flow_tcf_vtep){
+			.refcnt = 0,
+			.port = port,
+			.created = 0,
+			.ifouter = 0,
+			.ifindex = 0,
+			.local = LIST_HEAD_INITIALIZER(),
+			.neigh = LIST_HEAD_INITIALIZER(),
+	};
+	memset(buf, 0, sizeof(buf));
+	nlh = mnl_nlmsg_put_header(buf);
+	nlh->nlmsg_type = RTM_NEWLINK;
+	nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_CREATE  | NLM_F_EXCL;
+	ifm = mnl_nlmsg_put_extra_header(nlh, sizeof(*ifm));
+	ifm->ifi_family = AF_UNSPEC;
+	ifm->ifi_type = 0;
+	ifm->ifi_index = 0;
+	ifm->ifi_flags = IFF_UP;
+	ifm->ifi_change = 0xffffffff;
+	snprintf(name, sizeof(name), "%s%u", MLX5_VXLAN_DEVICE_PFX, port);
+	mnl_attr_put_strz(nlh, IFLA_IFNAME, name);
+	na_info = mnl_attr_nest_start(nlh, IFLA_LINKINFO);
+	assert(na_info);
+	mnl_attr_put_strz(nlh, IFLA_INFO_KIND, "vxlan");
+	na_vxlan = mnl_attr_nest_start(nlh, IFLA_INFO_DATA);
+	if (ifouter)
+		mnl_attr_put_u32(nlh, IFLA_VXLAN_LINK, ifouter);
+	assert(na_vxlan);
+	mnl_attr_put_u8(nlh, IFLA_VXLAN_COLLECT_METADATA, 1);
+	mnl_attr_put_u8(nlh, IFLA_VXLAN_UDP_ZERO_CSUM6_RX, 1);
+	mnl_attr_put_u8(nlh, IFLA_VXLAN_LEARNING, 0);
+	mnl_attr_put_u16(nlh, IFLA_VXLAN_PORT, vxlan_port);
+	mnl_attr_nest_end(nlh, na_vxlan);
+	mnl_attr_nest_end(nlh, na_info);
+	assert(sizeof(buf) >= nlh->nlmsg_len);
+	ret = flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
+	if (ret)
+		DRV_LOG(WARNING,
+			"netlink: VTEP %s create failure (%d)",
+			name, rte_errno);
+	else
+		vtep->created = 1;
+	if (ret && ifouter)
+		ret = 0;
+	else
+		ret = if_nametoindex(name);
+	if (ret) {
+		vtep->ifindex = ret;
+		vtep->ifouter = ifouter;
+		memset(buf, 0, sizeof(buf));
+		nlh = mnl_nlmsg_put_header(buf);
+		nlh->nlmsg_type = RTM_NEWLINK;
+		nlh->nlmsg_flags = NLM_F_REQUEST;
+		ifm = mnl_nlmsg_put_extra_header(nlh, sizeof(*ifm));
+		ifm->ifi_family = AF_UNSPEC;
+		ifm->ifi_type = 0;
+		ifm->ifi_index = vtep->ifindex;
+		ifm->ifi_flags = IFF_UP;
+		ifm->ifi_change = IFF_UP;
+		ret = flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
+		if (ret) {
+			DRV_LOG(WARNING,
+				"netlink: VTEP %s set link up failure (%d)",
+				name, rte_errno);
+			rte_free(vtep);
+			rte_flow_error_set
+				(error, -errno,
+				 RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
+				 "netlink: failed to set VTEP link up");
+			vtep = NULL;
+		} else {
+			ret = mlx5_flow_tcf_init(tcf, vtep->ifindex, error);
+			if (ret)
+				DRV_LOG(WARNING,
+				"VTEP %s init failure (%d)", name, rte_errno);
+		}
+	} else {
+		DRV_LOG(WARNING,
+			"VTEP %s failed to get index (%d)", name, errno);
+		rte_flow_error_set
+			(error, -errno,
+			 RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
+			 !vtep->created ? "netlink: failed to create VTEP" :
+			 "netlink: failed to retrieve VTEP ifindex");
+			 ret = 1;
+	}
+	if (ret) {
+		flow_tcf_delete_iface(tcf, vtep);
+		vtep = NULL;
+	}
+	DRV_LOG(INFO, "VTEP create (%d, %s)", vtep->port, vtep ? "OK" : "error");
+	return vtep;
+}
+#endif /* HAVE_IFLA_VXLAN_COLLECT_METADATA */
+
+/**
+ * Create target interface index for VXLAN tunneling decapsulation.
+ * In order to share the UDP port within the other interfaces the
+ * VXLAN device created as not attached to any interface (if created).
+ *
+ * @param[in] tcf
+ *   Context object initialized by mlx5_flow_tcf_context_create().
+ * @param[in] dev_flow
+ *   Flow tcf object with tunnel structure pointer set.
+ * @param[out] error
+ *   Perform verbose error reporting if not NULL.
+ * @return
+ *   Interface index on success, zero otherwise and rte_errno is set.
+ */
+static unsigned int
+flow_tcf_decap_vtep_create(struct mlx5_flow_tcf_context *tcf,
+			   struct mlx5_flow *dev_flow,
+			   struct rte_flow_error *error)
+{
+	struct mlx5_flow_tcf_vtep *vtep, *vlst;
+	uint16_t port = dev_flow->tcf.vxlan_decap->udp_port;
+
+	vtep = NULL;
+	LIST_FOREACH(vlst, &vtep_list_vxlan, next) {
+		if (vlst->port == port) {
+			vtep = vlst;
+			break;
+		}
+	}
+	if (!vtep) {
+		vtep = flow_tcf_create_iface(tcf, 0, port, error);
+		if (vtep)
+			LIST_INSERT_HEAD(&vtep_list_vxlan, vtep, next);
+	} else {
+		if (vtep->ifouter) {
+			rte_flow_error_set(error, -errno,
+				RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
+				"Failed to create decap VTEP, attached "
+				"device with the same UDP port exists");
+				vtep = NULL;
+		}
+	}
+	if (vtep) {
+		vtep->refcnt++;
+		assert(vtep->ifindex);
+		return vtep->ifindex;
+	} else {
+		return 0;
+	}
+}
+
+/**
+ * Creates target interface index for VXLAN tunneling encapsulation.
+ *
+ * @param[in] tcf
+ *   Context object initialized by mlx5_flow_tcf_context_create().
+ * @param[in] ifouter
+ *   Network interface index to attach VXLAN encap device to.
+ * @param[in] dev_flow
+ *   Flow tcf object with tunnel structure pointer set.
+ * @param[out] error
+ *   Perform verbose error reporting if not NULL.
+ * @return
+ *   Interface index on success, zero otherwise and rte_errno is set.
+ */
+static unsigned int
+flow_tcf_encap_vtep_create(struct mlx5_flow_tcf_context *tcf,
+			    unsigned int ifouter,
+			    struct mlx5_flow *dev_flow __rte_unused,
+			    struct rte_flow_error *error)
+{
+	static uint16_t encap_port = MLX5_VXLAN_PORT_RANGE_MIN - 1;
+	struct mlx5_flow_tcf_vtep *vtep, *vlst;
+
+	assert(ifouter);
+	/* Look whether the attached VTEP for encap is created. */
+	vtep = NULL;
+	LIST_FOREACH(vlst, &vtep_list_vxlan, next) {
+		if (vlst->ifouter == ifouter) {
+			vtep = vlst;
+			break;
+		}
+	}
+	if (!vtep) {
+		uint16_t pcnt;
+
+		/* Not found, we should create the new attached VTEP. */
+/*
+ * TODO: not implemented yet
+ * flow_tcf_encap_iface_cleanup(tcf, ifouter);
+ * flow_tcf_encap_local_cleanup(tcf, ifouter);
+ * flow_tcf_encap_neigh_cleanup(tcf, ifouter);
+ */
+		for (pcnt = 0; pcnt <= (MLX5_VXLAN_PORT_RANGE_MAX
+				     - MLX5_VXLAN_PORT_RANGE_MIN); pcnt++) {
+			encap_port++;
+			/* Wraparound the UDP port index. */
+			if (encap_port < MLX5_VXLAN_PORT_RANGE_MIN ||
+			    encap_port > MLX5_VXLAN_PORT_RANGE_MAX)
+				encap_port = MLX5_VXLAN_PORT_RANGE_MIN;
+			/* Check whether UDP port is in already in use. */
+			vtep = NULL;
+			LIST_FOREACH(vlst, &vtep_list_vxlan, next) {
+				if (vlst->port == encap_port) {
+					vtep = vlst;
+					break;
+				}
+			}
+			if (vtep) {
+				vtep = NULL;
+				continue;
+			}
+			vtep = flow_tcf_create_iface(tcf, ifouter,
+						     encap_port, error);
+			if (vtep) {
+				LIST_INSERT_HEAD(&vtep_list_vxlan, vtep, next);
+				break;
+			}
+			if (rte_errno != EEXIST)
+				break;
+		}
+	}
+	if (!vtep)
+		return 0;
+	vtep->refcnt++;
+	assert(vtep->ifindex);
+	return vtep->ifindex;
+}
+
+/**
+ * Creates target interface index for tunneling of any type.
+ *
+ * @param[in] tcf
+ *   Context object initialized by mlx5_flow_tcf_context_create().
+ * @param[in] ifouter
+ *   Network interface index to attach VXLAN encap device to.
+ * @param[in] dev_flow
+ *   Flow tcf object with tunnel structure pointer set.
+ * @param[out] error
+ *   Perform verbose error reporting if not NULL.
+ * @return
+ *   Interface index on success, zero otherwise and rte_errno is set.
+ */
+static unsigned int
+flow_tcf_tunnel_vtep_create(struct mlx5_flow_tcf_context *tcf,
+			    unsigned int ifouter,
+			    struct mlx5_flow *dev_flow,
+			    struct rte_flow_error *error)
+{
+	unsigned int ret;
+
+	assert(dev_flow->tcf.tunnel);
+	pthread_mutex_lock(&vtep_list_mutex);
+	switch (dev_flow->tcf.tunnel->type) {
+	case MLX5_FLOW_TCF_TUNACT_VXLAN_ENCAP:
+		ret = flow_tcf_encap_vtep_create(tcf, ifouter,
+						 dev_flow, error);
+		break;
+	case MLX5_FLOW_TCF_TUNACT_VXLAN_DECAP:
+		ret = flow_tcf_decap_vtep_create(tcf, dev_flow, error);
+		break;
+	default:
+		rte_flow_error_set(error, ENOTSUP,
+				RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
+				"unsupported tunnel type");
+		ret = 0;
+		break;
+	}
+	pthread_mutex_unlock(&vtep_list_mutex);
+	return ret;
+}
+
+/**
+ * Deletes tunneling interface by UDP port.
+ *
+ * @param[in] tcf
+ *   Context object initialized by mlx5_flow_tcf_context_create().
+ * @param[in] ifindex
+ *   Network interface index of VXLAN device.
+ * @param[in] dev_flow
+ *   Flow tcf object with tunnel structure pointer set.
+ */
+static void
+flow_tcf_tunnel_vtep_delete(struct mlx5_flow_tcf_context *tcf,
+			    unsigned int ifindex,
+			    struct mlx5_flow *dev_flow)
+{
+	struct mlx5_flow_tcf_vtep *vtep, *vlst;
+
+	assert(dev_flow->tcf.tunnel);
+	pthread_mutex_lock(&vtep_list_mutex);
+	vtep = NULL;
+	LIST_FOREACH(vlst, &vtep_list_vxlan, next) {
+		if (vlst->ifindex == ifindex) {
+			vtep = vlst;
+			break;
+		}
+	}
+	if (!vtep) {
+		DRV_LOG(WARNING, "No VTEP device found in the list");
+		goto exit;
+	}
+	switch (dev_flow->tcf.tunnel->type) {
+	case MLX5_FLOW_TCF_TUNACT_VXLAN_DECAP:
+		break;
+	case MLX5_FLOW_TCF_TUNACT_VXLAN_ENCAP:
+/*
+ * TODO: Remove the encap ancillary rules first.
+ * flow_tcf_encap_neigh(tcf, vtep, dev_flow, false, NULL);
+ * flow_tcf_encap_local(tcf, vtep, dev_flow, false, NULL);
+ */
+		break;
+	default:
+		assert(false);
+		DRV_LOG(WARNING, "Unsupported tunnel type");
+		break;
+	}
+	assert(dev_flow->tcf.tunnel->ifindex_tun == vtep->ifindex);
+	assert(vtep->refcnt);
+	if (!vtep->refcnt || !--vtep->refcnt) {
+		LIST_REMOVE(vtep, next);
+		flow_tcf_delete_iface(tcf, vtep);
+	}
+exit:
+	pthread_mutex_unlock(&vtep_list_mutex);
+}
+
 /**
  * Apply flow to E-Switch by sending Netlink message.
  *
@@ -3461,18 +3887,61 @@ struct pedit_parser {
 	       struct rte_flow_error *error)
 {
 	struct priv *priv = dev->data->dev_private;
-	struct mlx5_flow_tcf_context *nl = priv->tcf_context;
+	struct mlx5_flow_tcf_context *tcf = priv->tcf_context;
 	struct mlx5_flow *dev_flow;
 	struct nlmsghdr *nlh;
+	int ret;
 
 	dev_flow = LIST_FIRST(&flow->dev_flows);
 	/* E-Switch flow can't be expanded. */
 	assert(!LIST_NEXT(dev_flow, next));
+	if (dev_flow->tcf.applied)
+		return 0;
 	nlh = dev_flow->tcf.nlh;
 	nlh->nlmsg_type = RTM_NEWTFILTER;
 	nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_CREATE | NLM_F_EXCL;
-	if (!flow_tcf_nl_ack(nl, nlh, 0, NULL, NULL))
+	if (dev_flow->tcf.tunnel) {
+		/*
+		 * Replace the interface index, target for
+		 * encapsulation, source for decapsulation.
+		 */
+		assert(!dev_flow->tcf.tunnel->ifindex_tun);
+		assert(dev_flow->tcf.tunnel->ifindex_ptr);
+		/* Create actual VTEP device when rule is being applied. */
+		dev_flow->tcf.tunnel->ifindex_tun
+			= flow_tcf_tunnel_vtep_create(tcf,
+					*dev_flow->tcf.tunnel->ifindex_ptr,
+					dev_flow, error);
+			DRV_LOG(INFO, "Replace ifindex: %d->%d",
+				dev_flow->tcf.tunnel->ifindex_tun,
+				*dev_flow->tcf.tunnel->ifindex_ptr);
+		if (!dev_flow->tcf.tunnel->ifindex_tun)
+			return -rte_errno;
+		dev_flow->tcf.tunnel->ifindex_org
+			= *dev_flow->tcf.tunnel->ifindex_ptr;
+		*dev_flow->tcf.tunnel->ifindex_ptr
+			= dev_flow->tcf.tunnel->ifindex_tun;
+	}
+	ret = flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
+	if (dev_flow->tcf.tunnel) {
+		DRV_LOG(INFO, "Restore ifindex: %d->%d",
+				dev_flow->tcf.tunnel->ifindex_org,
+				*dev_flow->tcf.tunnel->ifindex_ptr);
+		*dev_flow->tcf.tunnel->ifindex_ptr
+			= dev_flow->tcf.tunnel->ifindex_org;
+		dev_flow->tcf.tunnel->ifindex_org = 0;
+	}
+	if (!ret) {
+		dev_flow->tcf.applied = 1;
 		return 0;
+	}
+	DRV_LOG(WARNING, "netlink: failed to create TC rule (%d)", rte_errno);
+	if (dev_flow->tcf.tunnel->ifindex_tun) {
+		flow_tcf_tunnel_vtep_delete(tcf,
+					    dev_flow->tcf.tunnel->ifindex_tun,
+					    dev_flow);
+		dev_flow->tcf.tunnel->ifindex_tun = 0;
+	}
 	return rte_flow_error_set(error, rte_errno,
 				  RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
 				  "netlink: failed to create TC flow rule");
@@ -3490,7 +3959,7 @@ struct pedit_parser {
 flow_tcf_remove(struct rte_eth_dev *dev, struct rte_flow *flow)
 {
 	struct priv *priv = dev->data->dev_private;
-	struct mlx5_flow_tcf_context *nl = priv->tcf_context;
+	struct mlx5_flow_tcf_context *tcf = priv->tcf_context;
 	struct mlx5_flow *dev_flow;
 	struct nlmsghdr *nlh;
 
@@ -3501,10 +3970,36 @@ struct pedit_parser {
 		return;
 	/* E-Switch flow can't be expanded. */
 	assert(!LIST_NEXT(dev_flow, next));
+	if (!dev_flow->tcf.applied)
+		return;
+	if (dev_flow->tcf.tunnel) {
+		/*
+		 * Replace the interface index, target for
+		 * encapsulation, source for decapsulation.
+		 */
+		assert(dev_flow->tcf.tunnel->ifindex_tun);
+		assert(dev_flow->tcf.tunnel->ifindex_ptr);
+		dev_flow->tcf.tunnel->ifindex_org
+			= *dev_flow->tcf.tunnel->ifindex_ptr;
+		*dev_flow->tcf.tunnel->ifindex_ptr
+			= dev_flow->tcf.tunnel->ifindex_tun;
+	}
 	nlh = dev_flow->tcf.nlh;
 	nlh->nlmsg_type = RTM_DELTFILTER;
 	nlh->nlmsg_flags = NLM_F_REQUEST;
-	flow_tcf_nl_ack(nl, nlh, 0, NULL, NULL);
+	flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
+	if (dev_flow->tcf.tunnel) {
+		*dev_flow->tcf.tunnel->ifindex_ptr
+			= dev_flow->tcf.tunnel->ifindex_org;
+		dev_flow->tcf.tunnel->ifindex_org = 0;
+		if (dev_flow->tcf.tunnel->ifindex_tun) {
+			flow_tcf_tunnel_vtep_delete(tcf,
+					dev_flow->tcf.tunnel->ifindex_tun,
+					dev_flow);
+			dev_flow->tcf.tunnel->ifindex_tun = 0;
+		}
+	}
+	dev_flow->tcf.applied = 0;
 }
 
 /**
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [dpdk-dev] [PATCH v2 6/7] net/mlx5: e-switch VXLAN encapsulation rules management
  2018-10-15 14:13 ` [dpdk-dev] [PATCH v2 0/7] net/mlx5: e-switch VXLAN encap/decap hardware offload Viacheslav Ovsiienko
                     ` (4 preceding siblings ...)
  2018-10-15 14:13   ` [dpdk-dev] [PATCH v2 5/7] net/mlx5: e-switch VXLAN tunnel devices management Viacheslav Ovsiienko
@ 2018-10-15 14:13   ` Viacheslav Ovsiienko
  2018-10-25  0:33     ` Yongseok Koh
  2018-10-15 14:13   ` [dpdk-dev] [PATCH v2 7/7] net/mlx5: e-switch VXLAN rule cleanup routines Viacheslav Ovsiienko
  2018-11-01 12:19   ` [dpdk-dev] [PATCH v3 00/13] net/mlx5: e-switch VXLAN encap/decap hardware offload Slava Ovsiienko
  7 siblings, 1 reply; 110+ messages in thread
From: Viacheslav Ovsiienko @ 2018-10-15 14:13 UTC (permalink / raw)
  To: shahafs, yskoh; +Cc: dev, Viacheslav Ovsiienko

VXLAN encap rules are applied to the VF ingress traffic and have the
VTEP as actual redirection destinations instead of outer PF.
The encapsulation rule should provide:
- redirection action VF->PF
- VF port ID
- some inner network parameters (MACs/IP)
- the tunnel outer source IP (v4/v6)
- the tunnel outer destination IP (v4/v6). Current
- VNI - Virtual Network Identifier

There is no direct way found to provide kernel with all required
encapsulatioh header parameters. The encapsulation VTEP is created
attached to the outer interface and assumed as default path for
egress encapsulated traffic. The outer tunnel IP address are
assigned to interface using Netlink, the implicit route is
created like this:

  ip addr add <src_ip> peer <dst_ip> dev <outer> scope link

Peer address provides implicit route, and scode link reduces
the risk of conflicts. At initialization time all local scope
link addresses are flushed from device (see next part of patchset).

The destination MAC address is provided via permenent neigh rule:

  ip neigh add dev <outer> lladdr <dst_mac> to <dst_ip> nud permanent

At initialization time all neigh rules of this type are flushed
from device (see the next part of patchset).

Suggested-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
---
 drivers/net/mlx5/mlx5_flow_tcf.c | 394 ++++++++++++++++++++++++++++++++++++++-
 1 file changed, 389 insertions(+), 5 deletions(-)

diff --git a/drivers/net/mlx5/mlx5_flow_tcf.c b/drivers/net/mlx5/mlx5_flow_tcf.c
index efa9c3b..a1d7733 100644
--- a/drivers/net/mlx5/mlx5_flow_tcf.c
+++ b/drivers/net/mlx5/mlx5_flow_tcf.c
@@ -3443,6 +3443,376 @@ struct pedit_parser {
 	return -err;
 }
 
+/**
+ * Emit Netlink message to add/remove local address to the outer device.
+ * The address being added is visible within the link only (scope link).
+ *
+ * Note that an implicit route is maintained by the kernel due to the
+ * presence of a peer address (IFA_ADDRESS).
+ *
+ * These rules are used for encapsultion only and allow to assign
+ * the outer tunnel source IP address.
+ *
+ * @param[in] tcf
+ *   Libmnl socket context object.
+ * @param[in] encap
+ *   Encapsulation properties (source address and its peer).
+ * @param[in] ifindex
+ *   Network interface to apply rule.
+ * @param[in] enable
+ *   Toggle between add and remove.
+ * @param[out] error
+ *   Perform verbose error reporting if not NULL.
+ *
+ * @return
+ *   0 on success, a negative errno value otherwise and rte_errno is set.
+ */
+static int
+flow_tcf_rule_local(struct mlx5_flow_tcf_context *tcf,
+		    const struct mlx5_flow_tcf_vxlan_encap *encap,
+		    unsigned int ifindex,
+		    bool enable,
+		    struct rte_flow_error *error)
+{
+	struct nlmsghdr *nlh;
+	struct ifaddrmsg *ifa;
+	alignas(struct nlmsghdr)
+	uint8_t buf[mnl_nlmsg_size(sizeof(*ifa) + 128)];
+
+	nlh = mnl_nlmsg_put_header(buf);
+	nlh->nlmsg_type = enable ? RTM_NEWADDR : RTM_DELADDR;
+	nlh->nlmsg_flags =
+		NLM_F_REQUEST | (enable ? NLM_F_CREATE | NLM_F_REPLACE : 0);
+	nlh->nlmsg_seq = 0;
+	ifa = mnl_nlmsg_put_extra_header(nlh, sizeof(*ifa));
+	ifa->ifa_flags = IFA_F_PERMANENT;
+	ifa->ifa_scope = RT_SCOPE_LINK;
+	ifa->ifa_index = ifindex;
+	if (encap->mask & MLX5_FLOW_TCF_ENCAP_IPV4_SRC) {
+		ifa->ifa_family = AF_INET;
+		ifa->ifa_prefixlen = 32;
+		mnl_attr_put_u32(nlh, IFA_LOCAL, encap->ipv4.src);
+		if (encap->mask & MLX5_FLOW_TCF_ENCAP_IPV4_DST)
+			mnl_attr_put_u32(nlh, IFA_ADDRESS,
+					      encap->ipv4.dst);
+	} else {
+		assert(encap->mask & MLX5_FLOW_TCF_ENCAP_IPV6_SRC);
+		ifa->ifa_family = AF_INET6;
+		ifa->ifa_prefixlen = 128;
+		mnl_attr_put(nlh, IFA_LOCAL,
+				  sizeof(encap->ipv6.src),
+				  &encap->ipv6.src);
+		if (encap->mask & MLX5_FLOW_TCF_ENCAP_IPV6_DST)
+			mnl_attr_put(nlh, IFA_ADDRESS,
+					  sizeof(encap->ipv6.dst),
+					  &encap->ipv6.dst);
+	}
+	if (!flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL))
+		return 0;
+	return rte_flow_error_set
+		(error, rte_errno, RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
+		 "netlink: cannot complete IFA request (ip addr add)");
+}
+
+/**
+ * Emit Netlink message to add/remove neighbor.
+ *
+ * @param[in] tcf
+ *   Libmnl socket context object.
+ * @param[in] encap
+ *   Encapsulation properties (destination address).
+ * @param[in] ifindex
+ *   Network interface.
+ * @param[in] enable
+ *   Toggle between add and remove.
+ * @param[out] error
+ *   Perform verbose error reporting if not NULL.
+ *
+ * @return
+ *   0 on success, a negative errno value otherwise and rte_errno is set.
+ */
+static int
+flow_tcf_rule_neigh(struct mlx5_flow_tcf_context *tcf,
+		     const struct mlx5_flow_tcf_vxlan_encap *encap,
+		     unsigned int ifindex,
+		     bool enable,
+		     struct rte_flow_error *error)
+{
+	struct nlmsghdr *nlh;
+	struct ndmsg *ndm;
+	alignas(struct nlmsghdr)
+	uint8_t buf[mnl_nlmsg_size(sizeof(*ndm) + 128)];
+
+	nlh = mnl_nlmsg_put_header(buf);
+	nlh->nlmsg_type = enable ? RTM_NEWNEIGH : RTM_DELNEIGH;
+	nlh->nlmsg_flags =
+		NLM_F_REQUEST | (enable ? NLM_F_CREATE | NLM_F_REPLACE : 0);
+	nlh->nlmsg_seq = 0;
+	ndm = mnl_nlmsg_put_extra_header(nlh, sizeof(*ndm));
+	ndm->ndm_ifindex = ifindex;
+	ndm->ndm_state = NUD_PERMANENT;
+	ndm->ndm_flags = 0;
+	ndm->ndm_type = 0;
+	if (encap->mask & MLX5_FLOW_TCF_ENCAP_IPV4_DST) {
+		ndm->ndm_family = AF_INET;
+		mnl_attr_put_u32(nlh, NDA_DST, encap->ipv4.dst);
+	} else {
+		assert(encap->mask & MLX5_FLOW_TCF_ENCAP_IPV6_DST);
+		ndm->ndm_family = AF_INET6;
+		mnl_attr_put(nlh, NDA_DST, sizeof(encap->ipv6.dst),
+						 &encap->ipv6.dst);
+	}
+	if (encap->mask & MLX5_FLOW_TCF_ENCAP_ETH_SRC && enable)
+		DRV_LOG(WARNING,
+			"Outer ethernet source address cannot be "
+			"forced for VXLAN encapsulation");
+	if (encap->mask & MLX5_FLOW_TCF_ENCAP_ETH_DST)
+		mnl_attr_put(nlh, NDA_LLADDR, sizeof(encap->eth.dst),
+						    &encap->eth.dst);
+	if (!flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL))
+		return 0;
+	return rte_flow_error_set
+		(error, rte_errno, RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
+		 "netlink: cannot complete ND request (ip neigh)");
+}
+
+/**
+ * Manage the local IP addresses and their peers IP addresses on the
+ * outer interface for encapsulation purposes. The kernel searches the
+ * appropriate device for tunnel egress traffic using the outer source
+ * IP, this IP should be assigned to the outer network device, otherwise
+ * kernel rejects the rule.
+ *
+ * Adds or removes the addresses using the Netlink command like this:
+ *   ip addr add <src_ip> peer <dst_ip> scope link dev <ifouter>
+ *
+ * The addresses are local to the netdev ("scope link"), this reduces
+ * the risk of conflicts. Note that an implicit route is maintained by
+ * the kernel due to the presence of a peer address (IFA_ADDRESS).
+ *
+ * @param[in] tcf
+ *   Libmnl socket context object.
+ * @param[in] vtep
+ *   VTEP object, contains rule database and ifouter index.
+ * @param[in] dev_flow
+ *   Flow object, contains the tunnel parameters (for encap only).
+ * @param[in] enable
+ *   Toggle between add and remove.
+ * @param[out] error
+ *   Perform verbose error reporting if not NULL.
+ *
+ * @return
+ *   0 on success, a negative errno value otherwise and rte_errno is set.
+ */
+static int
+flow_tcf_encap_local(struct mlx5_flow_tcf_context *tcf,
+		     struct mlx5_flow_tcf_vtep *vtep,
+		     struct mlx5_flow *dev_flow,
+		     bool enable,
+		     struct rte_flow_error *error)
+{
+	const struct mlx5_flow_tcf_vxlan_encap *encap =
+						dev_flow->tcf.vxlan_encap;
+	struct tcf_local_rule *rule;
+	bool found = false;
+	int ret;
+
+	assert(encap);
+	assert(encap->hdr.type == MLX5_FLOW_TCF_TUNACT_VXLAN_ENCAP);
+	if (encap->mask & MLX5_FLOW_TCF_ENCAP_IPV4_SRC) {
+		assert(encap->mask & MLX5_FLOW_TCF_ENCAP_IPV4_DST);
+		LIST_FOREACH(rule, &vtep->local, next) {
+			if (rule->mask & MLX5_FLOW_TCF_ENCAP_IPV4_SRC &&
+			    encap->ipv4.src == rule->ipv4.src &&
+			    encap->ipv4.dst == rule->ipv4.dst) {
+				found = true;
+				break;
+			}
+		}
+	} else {
+		assert(encap->mask & MLX5_FLOW_TCF_ENCAP_IPV6_SRC);
+		assert(encap->mask & MLX5_FLOW_TCF_ENCAP_IPV6_DST);
+		LIST_FOREACH(rule, &vtep->local, next) {
+			if (rule->mask & MLX5_FLOW_TCF_ENCAP_IPV6_SRC &&
+			    !memcmp(&encap->ipv6.src, &rule->ipv6.src,
+					    sizeof(encap->ipv6.src)) &&
+			    !memcmp(&encap->ipv6.dst, &rule->ipv6.dst,
+					    sizeof(encap->ipv6.dst))) {
+				found = true;
+				break;
+			}
+		}
+	}
+	if (found) {
+		if (enable) {
+			rule->refcnt++;
+			return 0;
+		}
+		if (!rule->refcnt || !--rule->refcnt) {
+			LIST_REMOVE(rule, next);
+			return flow_tcf_rule_local(tcf, encap,
+					vtep->ifouter, false, error);
+		}
+		return 0;
+	}
+	if (!enable) {
+		DRV_LOG(WARNING, "Disabling not existing local rule");
+		rte_flow_error_set
+			(error, ENOENT, RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
+			 NULL, "Disabling not existing local rule");
+		return -ENOENT;
+	}
+	rule = rte_zmalloc(__func__, sizeof(struct tcf_local_rule),
+				alignof(struct tcf_local_rule));
+	if (!rule) {
+		rte_flow_error_set
+			(error, ENOMEM, RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
+			 NULL, "unable to allocate memory for local rule");
+		return -rte_errno;
+	}
+	*rule = (struct tcf_local_rule){.refcnt = 0,
+					.mask = 0,
+					};
+	if (encap->mask & MLX5_FLOW_TCF_ENCAP_IPV4_SRC) {
+		rule->mask = MLX5_FLOW_TCF_ENCAP_IPV4_SRC
+			   | MLX5_FLOW_TCF_ENCAP_IPV4_DST;
+		rule->ipv4.src = encap->ipv4.src;
+		rule->ipv4.dst = encap->ipv4.dst;
+	} else {
+		rule->mask = MLX5_FLOW_TCF_ENCAP_IPV6_SRC
+			   | MLX5_FLOW_TCF_ENCAP_IPV6_DST;
+		memcpy(&rule->ipv6.src, &encap->ipv6.src,
+				sizeof(rule->ipv6.src));
+		memcpy(&rule->ipv6.dst, &encap->ipv6.dst,
+				sizeof(rule->ipv6.dst));
+	}
+	ret = flow_tcf_rule_local(tcf, encap, vtep->ifouter, true, error);
+	if (ret) {
+		rte_free(rule);
+		return ret;
+	}
+	rule->refcnt++;
+	LIST_INSERT_HEAD(&vtep->local, rule, next);
+	return 0;
+}
+
+/**
+ * Manage the destination MAC/IP addresses neigh database, kernel uses
+ * this one to determine the destination MAC address within encapsulation
+ * header. Adds or removes the entries using the Netlink command like this:
+ *   ip neigh add dev <ifouter> lladdr <dst_mac> to <dst_ip> nud permanent
+ *
+ * @param[in] tcf
+ *   Libmnl socket context object.
+ * @param[in] vtep
+ *   VTEP object, contains rule database and ifouter index.
+ * @param[in] dev_flow
+ *   Flow object, contains the tunnel parameters (for encap only).
+ * @param[in] enable
+ *   Toggle between add and remove.
+ * @param[out] error
+ *   Perform verbose error reporting if not NULL.
+ *
+ * @return
+ *   0 on success, a negative errno value otherwise and rte_errno is set.
+ */
+static int
+flow_tcf_encap_neigh(struct mlx5_flow_tcf_context *tcf,
+		     struct mlx5_flow_tcf_vtep *vtep,
+		     struct mlx5_flow *dev_flow,
+		     bool enable,
+		     struct rte_flow_error *error)
+{
+	const struct mlx5_flow_tcf_vxlan_encap *encap =
+						dev_flow->tcf.vxlan_encap;
+	struct tcf_neigh_rule *rule;
+	bool found = false;
+	int ret;
+
+	assert(encap);
+	assert(encap->hdr.type == MLX5_FLOW_TCF_TUNACT_VXLAN_ENCAP);
+	if (encap->mask & MLX5_FLOW_TCF_ENCAP_IPV4_DST) {
+		assert(encap->mask & MLX5_FLOW_TCF_ENCAP_IPV4_SRC);
+		LIST_FOREACH(rule, &vtep->neigh, next) {
+			if (rule->mask & MLX5_FLOW_TCF_ENCAP_IPV4_DST &&
+			    encap->ipv4.dst == rule->ipv4.dst) {
+				found = true;
+				break;
+			}
+		}
+	} else {
+		assert(encap->mask & MLX5_FLOW_TCF_ENCAP_IPV6_SRC);
+		assert(encap->mask & MLX5_FLOW_TCF_ENCAP_IPV6_DST);
+		LIST_FOREACH(rule, &vtep->neigh, next) {
+			if (rule->mask & MLX5_FLOW_TCF_ENCAP_IPV6_DST &&
+			    !memcmp(&encap->ipv6.dst, &rule->ipv6.dst,
+						sizeof(encap->ipv6.dst))) {
+				found = true;
+				break;
+			}
+		}
+	}
+	if (found) {
+		if (memcmp(&encap->eth.dst, &rule->eth,
+			   sizeof(encap->eth.dst))) {
+			DRV_LOG(WARNING, "Destination MAC differs"
+					 " in neigh rule");
+			rte_flow_error_set(error, EEXIST,
+					   RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
+					   NULL, "Different MAC address"
+					   " neigh rule for the same"
+					   " destination IP");
+					return -EEXIST;
+		}
+		if (enable) {
+			rule->refcnt++;
+			return 0;
+		}
+		if (!rule->refcnt || !--rule->refcnt) {
+			LIST_REMOVE(rule, next);
+			return flow_tcf_rule_neigh(tcf, encap,
+						   vtep->ifouter,
+						   false, error);
+		}
+		return 0;
+	}
+	if (!enable) {
+		DRV_LOG(WARNING, "Disabling not existing neigh rule");
+		rte_flow_error_set
+			(error, ENOENT, RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
+			 NULL, "unable to allocate memory for neigh rule");
+		return -ENOENT;
+	}
+	rule = rte_zmalloc(__func__, sizeof(struct tcf_neigh_rule),
+				alignof(struct tcf_neigh_rule));
+	if (!rule) {
+		rte_flow_error_set
+			(error, ENOMEM, RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
+			 NULL, "unadble to allocate memory for neigh rule");
+		return -rte_errno;
+	}
+	*rule = (struct tcf_neigh_rule){.refcnt = 0,
+					.mask = 0,
+					};
+	if (encap->mask & MLX5_FLOW_TCF_ENCAP_IPV4_DST) {
+		rule->mask = MLX5_FLOW_TCF_ENCAP_IPV4_DST;
+		rule->ipv4.dst = encap->ipv4.dst;
+	} else {
+		rule->mask = MLX5_FLOW_TCF_ENCAP_IPV6_DST;
+		memcpy(&rule->ipv6.dst, &encap->ipv6.dst,
+					sizeof(rule->ipv6.dst));
+	}
+	memcpy(&rule->eth, &encap->eth.dst, sizeof(rule->eth));
+	ret = flow_tcf_rule_neigh(tcf, encap, vtep->ifouter, true, error);
+	if (ret) {
+		rte_free(rule);
+		return ret;
+	}
+	rule->refcnt++;
+	LIST_INSERT_HEAD(&vtep->neigh, rule, next);
+	return 0;
+}
+
 /* VTEP device list is shared between PMD port instances. */
 static LIST_HEAD(, mlx5_flow_tcf_vtep)
 			vtep_list_vxlan = LIST_HEAD_INITIALIZER();
@@ -3715,6 +4085,7 @@ static LIST_HEAD(, mlx5_flow_tcf_vtep)
 {
 	static uint16_t encap_port = MLX5_VXLAN_PORT_RANGE_MIN - 1;
 	struct mlx5_flow_tcf_vtep *vtep, *vlst;
+	int ret;
 
 	assert(ifouter);
 	/* Look whether the attached VTEP for encap is created. */
@@ -3766,6 +4137,21 @@ static LIST_HEAD(, mlx5_flow_tcf_vtep)
 	}
 	if (!vtep)
 		return 0;
+	/* Create local ipaddr with peer to specify the outer IPs. */
+	ret = flow_tcf_encap_local(tcf, vtep, dev_flow, true, error);
+	if (ret) {
+		if (!vtep->refcnt)
+			flow_tcf_delete_iface(tcf, vtep);
+		return 0;
+	}
+	/* Create neigh rule to specify outer destination MAC. */
+	ret = flow_tcf_encap_neigh(tcf, vtep, dev_flow, true, error);
+	if (ret) {
+		flow_tcf_encap_local(tcf, vtep, dev_flow, false, error);
+		if (!vtep->refcnt)
+			flow_tcf_delete_iface(tcf, vtep);
+		return 0;
+	}
 	vtep->refcnt++;
 	assert(vtep->ifindex);
 	return vtep->ifindex;
@@ -3848,11 +4234,9 @@ static LIST_HEAD(, mlx5_flow_tcf_vtep)
 	case MLX5_FLOW_TCF_TUNACT_VXLAN_DECAP:
 		break;
 	case MLX5_FLOW_TCF_TUNACT_VXLAN_ENCAP:
-/*
- * TODO: Remove the encap ancillary rules first.
- * flow_tcf_encap_neigh(tcf, vtep, dev_flow, false, NULL);
- * flow_tcf_encap_local(tcf, vtep, dev_flow, false, NULL);
- */
+		/* Remove the encap ancillary rules first. */
+		flow_tcf_encap_neigh(tcf, vtep, dev_flow, false, NULL);
+		flow_tcf_encap_local(tcf, vtep, dev_flow, false, NULL);
 		break;
 	default:
 		assert(false);
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [dpdk-dev] [PATCH v2 7/7] net/mlx5: e-switch VXLAN rule cleanup routines
  2018-10-15 14:13 ` [dpdk-dev] [PATCH v2 0/7] net/mlx5: e-switch VXLAN encap/decap hardware offload Viacheslav Ovsiienko
                     ` (5 preceding siblings ...)
  2018-10-15 14:13   ` [dpdk-dev] [PATCH v2 6/7] net/mlx5: e-switch VXLAN encapsulation rules management Viacheslav Ovsiienko
@ 2018-10-15 14:13   ` Viacheslav Ovsiienko
  2018-10-25  0:36     ` Yongseok Koh
  2018-11-01 12:19   ` [dpdk-dev] [PATCH v3 00/13] net/mlx5: e-switch VXLAN encap/decap hardware offload Slava Ovsiienko
  7 siblings, 1 reply; 110+ messages in thread
From: Viacheslav Ovsiienko @ 2018-10-15 14:13 UTC (permalink / raw)
  To: shahafs, yskoh; +Cc: dev, Viacheslav Ovsiienko

The last part of patchset contains the rule cleanup routines.
These ones is the part of outer interface initialization at
the moment of VXLAN VTEP attaching. These routines query
the list of attached VXLAN devices, the list of local IP
addresses with peer and link scope attribute and the list
of permanent neigh rules, then all found abovementioned
items on the specified outer device are flushed.

Suggested-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
---
 drivers/net/mlx5/mlx5_flow_tcf.c | 505 ++++++++++++++++++++++++++++++++++++++-
 1 file changed, 499 insertions(+), 6 deletions(-)

diff --git a/drivers/net/mlx5/mlx5_flow_tcf.c b/drivers/net/mlx5/mlx5_flow_tcf.c
index a1d7733..a3348ea 100644
--- a/drivers/net/mlx5/mlx5_flow_tcf.c
+++ b/drivers/net/mlx5/mlx5_flow_tcf.c
@@ -4012,6 +4012,502 @@ static LIST_HEAD(, mlx5_flow_tcf_vtep)
 }
 #endif /* HAVE_IFLA_VXLAN_COLLECT_METADATA */
 
+#define MNL_REQUEST_SIZE_MIN 256
+#define MNL_REQUEST_SIZE_MAX 2048
+#define MNL_REQUEST_SIZE RTE_MIN(RTE_MAX(sysconf(_SC_PAGESIZE), \
+				 MNL_REQUEST_SIZE_MIN), MNL_REQUEST_SIZE_MAX)
+
+/* Data structures used by flow_tcf_xxx_cb() routines. */
+struct tcf_nlcb_buf {
+	LIST_ENTRY(tcf_nlcb_buf) next;
+	uint32_t size;
+	alignas(struct nlmsghdr)
+	uint8_t msg[]; /**< Netlink message data. */
+};
+
+struct tcf_nlcb_context {
+	unsigned int ifindex; /**< Base interface index. */
+	uint32_t bufsize;
+	LIST_HEAD(, tcf_nlcb_buf) nlbuf;
+};
+
+/**
+ * Allocate space for netlink command in buffer list
+ *
+ * @param[in, out] ctx
+ *   Pointer to callback context with command buffers list.
+ * @param[in] size
+ *   Required size of data buffer to be allocated.
+ *
+ * @return
+ *   Pointer to allocated memory, aligned as message header.
+ *   NULL if some error occurred.
+ */
+static struct nlmsghdr *
+flow_tcf_alloc_nlcmd(struct tcf_nlcb_context *ctx, uint32_t size)
+{
+	struct tcf_nlcb_buf *buf;
+	struct nlmsghdr *nlh;
+
+	size = NLMSG_ALIGN(size);
+	buf = LIST_FIRST(&ctx->nlbuf);
+	if (buf && (buf->size + size) <= ctx->bufsize) {
+		nlh = (struct nlmsghdr *)&buf->msg[buf->size];
+		buf->size += size;
+		return nlh;
+	}
+	if (size > ctx->bufsize) {
+		DRV_LOG(WARNING, "netlink: too long command buffer requested");
+		return NULL;
+	}
+	buf = rte_malloc(__func__,
+			ctx->bufsize + sizeof(struct tcf_nlcb_buf),
+			alignof(struct tcf_nlcb_buf));
+	if (!buf) {
+		DRV_LOG(WARNING, "netlink: no memory for command buffer");
+		return NULL;
+	}
+	LIST_INSERT_HEAD(&ctx->nlbuf, buf, next);
+	buf->size = size;
+	nlh = (struct nlmsghdr *)&buf->msg[0];
+	return nlh;
+}
+
+/**
+ * Set NLM_F_ACK flags in the last netlink command in buffer.
+ * Only last command in the buffer will be acked by system.
+ *
+ * @param[in, out] buf
+ *   Pointer to buffer with netlink commands.
+ */
+static void
+flow_tcf_setack_nlcmd(struct tcf_nlcb_buf *buf)
+{
+	struct nlmsghdr *nlh;
+	uint32_t size = 0;
+
+	assert(buf->size);
+	do {
+		nlh = (struct nlmsghdr *)&buf->msg[size];
+		size += NLMSG_ALIGN(nlh->nlmsg_len);
+		if (size >= buf->size) {
+			nlh->nlmsg_flags |= NLM_F_ACK;
+			break;
+		}
+	} while (true);
+}
+
+/**
+ * Send the buffers with prepared netlink commands. Scans the list and
+ * sends all found buffers. Buffers are sent and freed anyway in order
+ * to prevent memory leakage if some every message in received packet.
+ *
+ * @param[in] tcf
+ *   Context object initialized by mlx5_flow_tcf_context_create().
+ * @param[in, out] ctx
+ *   Pointer to callback context with command buffers list.
+ *
+ * @return
+ *   Zero value on success, negative errno value otherwise
+ *   and rte_errno is set.
+ */
+static int
+flow_tcf_send_nlcmd(struct mlx5_flow_tcf_context *tcf,
+		    struct tcf_nlcb_context *ctx)
+{
+	struct tcf_nlcb_buf *bc, *bn;
+	struct nlmsghdr *nlh;
+	int ret = 0;
+
+	bc = LIST_FIRST(&ctx->nlbuf);
+	while (bc) {
+		int rc;
+
+		bn = LIST_NEXT(bc, next);
+		if (bc->size) {
+			flow_tcf_setack_nlcmd(bc);
+			nlh = (struct nlmsghdr *)&bc->msg;
+			rc = flow_tcf_nl_ack(tcf, nlh, bc->size, NULL, NULL);
+			if (rc && !ret)
+				ret = rc;
+		}
+		rte_free(bc);
+		bc = bn;
+	}
+	LIST_INIT(&ctx->nlbuf);
+	return ret;
+}
+
+/**
+ * Collect local IP address rules with scope link attribute  on specified
+ * network device. This is callback routine called by libmnl mnl_cb_run()
+ * in loop for every message in received packet.
+ *
+ * @param[in] nlh
+ *   Pointer to reply header.
+ * @param[in, out] arg
+ *   Opaque data pointer for this callback.
+ *
+ * @return
+ *   A positive, nonzero value on success, negative errno value otherwise
+ *   and rte_errno is set.
+ */
+static int
+flow_tcf_collect_local_cb(const struct nlmsghdr *nlh, void *arg)
+{
+	struct tcf_nlcb_context *ctx = arg;
+	struct nlmsghdr *cmd;
+	struct ifaddrmsg *ifa;
+	struct nlattr *na;
+	struct nlattr *na_local = NULL;
+	struct nlattr *na_peer = NULL;
+	unsigned char family;
+
+	if (nlh->nlmsg_type != RTM_NEWADDR) {
+		rte_errno = EINVAL;
+		return -rte_errno;
+	}
+	ifa = mnl_nlmsg_get_payload(nlh);
+	family = ifa->ifa_family;
+	if (ifa->ifa_index != ctx->ifindex ||
+	    ifa->ifa_scope != RT_SCOPE_LINK ||
+	    !(ifa->ifa_flags & IFA_F_PERMANENT) ||
+	    (family != AF_INET && family != AF_INET6))
+		return 1;
+	mnl_attr_for_each(na, nlh, sizeof(*ifa)) {
+		switch (mnl_attr_get_type(na)) {
+		case IFA_LOCAL:
+			na_local = na;
+			break;
+		case IFA_ADDRESS:
+			na_peer = na;
+			break;
+		}
+		if (na_local && na_peer)
+			break;
+	}
+	if (!na_local || !na_peer)
+		return 1;
+	/* Local rule found with scope link, permanent and assigned peer. */
+	cmd = flow_tcf_alloc_nlcmd(ctx, MNL_ALIGN(sizeof(struct nlmsghdr)) +
+					MNL_ALIGN(sizeof(struct ifaddrmsg)) +
+					(family == AF_INET6
+					? 2 * SZ_NLATTR_DATA_OF(IPV6_ADDR_LEN)
+					: 2 * SZ_NLATTR_TYPE_OF(uint32_t)));
+	if (!cmd) {
+		rte_errno = ENOMEM;
+		return -rte_errno;
+	}
+	cmd = mnl_nlmsg_put_header(cmd);
+	cmd->nlmsg_type = RTM_DELADDR;
+	cmd->nlmsg_flags = NLM_F_REQUEST;
+	ifa = mnl_nlmsg_put_extra_header(cmd, sizeof(*ifa));
+	ifa->ifa_flags = IFA_F_PERMANENT;
+	ifa->ifa_scope = RT_SCOPE_LINK;
+	ifa->ifa_index = ctx->ifindex;
+	if (family == AF_INET) {
+		ifa->ifa_family = AF_INET;
+		ifa->ifa_prefixlen = 32;
+		mnl_attr_put_u32(cmd, IFA_LOCAL, mnl_attr_get_u32(na_local));
+		mnl_attr_put_u32(cmd, IFA_ADDRESS, mnl_attr_get_u32(na_peer));
+	} else {
+		ifa->ifa_family = AF_INET6;
+		ifa->ifa_prefixlen = 128;
+		mnl_attr_put(cmd, IFA_LOCAL, IPV6_ADDR_LEN,
+			mnl_attr_get_payload(na_local));
+		mnl_attr_put(cmd, IFA_ADDRESS, IPV6_ADDR_LEN,
+			mnl_attr_get_payload(na_peer));
+	}
+	return 1;
+}
+
+/**
+ * Cleanup the local IP addresses on outer interface.
+ *
+ * @param[in] tcf
+ *   Context object initialized by mlx5_flow_tcf_context_create().
+ * @param[in] ifindex
+ *   Network inferface index to perform cleanup.
+ */
+static void
+flow_tcf_encap_local_cleanup(struct mlx5_flow_tcf_context *tcf,
+			    unsigned int ifindex)
+{
+	struct nlmsghdr *nlh;
+	struct ifaddrmsg *ifa;
+	struct tcf_nlcb_context ctx = {
+		.ifindex = ifindex,
+		.bufsize = MNL_REQUEST_SIZE,
+		.nlbuf = LIST_HEAD_INITIALIZER(),
+	};
+	int ret;
+
+	assert(ifindex);
+	/*
+	 * Seek and destroy leftovers of local IP addresses with
+	 * matching properties "scope link".
+	 */
+	nlh = mnl_nlmsg_put_header(tcf->buf);
+	nlh->nlmsg_type = RTM_GETADDR;
+	nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_DUMP;
+	ifa = mnl_nlmsg_put_extra_header(nlh, sizeof(*ifa));
+	ifa->ifa_family = AF_UNSPEC;
+	ifa->ifa_index = ifindex;
+	ifa->ifa_scope = RT_SCOPE_LINK;
+	ret = flow_tcf_nl_ack(tcf, nlh, 0, flow_tcf_collect_local_cb, &ctx);
+	if (ret)
+		DRV_LOG(WARNING, "netlink: query device list error %d", ret);
+	ret = flow_tcf_send_nlcmd(tcf, &ctx);
+	if (ret)
+		DRV_LOG(WARNING, "netlink: device delete error %d", ret);
+}
+
+/**
+ * Collect neigh permament rules on specified network device.
+ * This is callback routine called by libmnl mnl_cb_run() in loop for
+ * every message in received packet.
+ *
+ * @param[in] nlh
+ *   Pointer to reply header.
+ * @param[in, out] arg
+ *   Opaque data pointer for this callback.
+ *
+ * @return
+ *   A positive, nonzero value on success, negative errno value otherwise
+ *   and rte_errno is set.
+ */
+static int
+flow_tcf_collect_neigh_cb(const struct nlmsghdr *nlh, void *arg)
+{
+	struct tcf_nlcb_context *ctx = arg;
+	struct nlmsghdr *cmd;
+	struct ndmsg *ndm;
+	struct nlattr *na;
+	struct nlattr *na_ip = NULL;
+	struct nlattr *na_mac = NULL;
+	unsigned char family;
+
+	if (nlh->nlmsg_type != RTM_NEWNEIGH) {
+		rte_errno = EINVAL;
+		return -rte_errno;
+	}
+	ndm = mnl_nlmsg_get_payload(nlh);
+	family = ndm->ndm_family;
+	if (ndm->ndm_ifindex != (int)ctx->ifindex ||
+	   !(ndm->ndm_state & NUD_PERMANENT) ||
+	   (family != AF_INET && family != AF_INET6))
+		return 1;
+	mnl_attr_for_each(na, nlh, sizeof(*ndm)) {
+		switch (mnl_attr_get_type(na)) {
+		case NDA_DST:
+			na_ip = na;
+			break;
+		case NDA_LLADDR:
+			na_mac = na;
+			break;
+		}
+		if (na_mac && na_ip)
+			break;
+	}
+	if (!na_mac || !na_ip)
+		return 1;
+	/* Neigh rule with permenent attribute found. */
+	cmd = flow_tcf_alloc_nlcmd(ctx, MNL_ALIGN(sizeof(struct nlmsghdr)) +
+					MNL_ALIGN(sizeof(struct ndmsg)) +
+					SZ_NLATTR_DATA_OF(ETHER_ADDR_LEN) +
+					(family == AF_INET6
+					? SZ_NLATTR_DATA_OF(IPV6_ADDR_LEN)
+					: SZ_NLATTR_TYPE_OF(uint32_t)));
+	if (!cmd) {
+		rte_errno = ENOMEM;
+		return -rte_errno;
+	}
+	cmd = mnl_nlmsg_put_header(cmd);
+	cmd->nlmsg_type = RTM_DELNEIGH;
+	cmd->nlmsg_flags = NLM_F_REQUEST;
+	ndm = mnl_nlmsg_put_extra_header(cmd, sizeof(*ndm));
+	ndm->ndm_ifindex = ctx->ifindex;
+	ndm->ndm_state = NUD_PERMANENT;
+	ndm->ndm_flags = 0;
+	ndm->ndm_type = 0;
+	if (family == AF_INET) {
+		ndm->ndm_family = AF_INET;
+		mnl_attr_put_u32(cmd, NDA_DST, mnl_attr_get_u32(na_ip));
+	} else {
+		ndm->ndm_family = AF_INET6;
+		mnl_attr_put(cmd, NDA_DST, IPV6_ADDR_LEN,
+			     mnl_attr_get_payload(na_ip));
+	}
+	mnl_attr_put(cmd, NDA_LLADDR, ETHER_ADDR_LEN,
+		     mnl_attr_get_payload(na_mac));
+	return 1;
+}
+
+/**
+ * Cleanup the neigh rules on outer interface.
+ *
+ * @param[in] tcf
+ *   Context object initialized by mlx5_flow_tcf_context_create().
+ * @param[in] ifindex
+ *   Network inferface index to perform cleanup.
+ */
+static void
+flow_tcf_encap_neigh_cleanup(struct mlx5_flow_tcf_context *tcf,
+			    unsigned int ifindex)
+{
+	struct nlmsghdr *nlh;
+	struct ndmsg *ndm;
+	struct tcf_nlcb_context ctx = {
+		.ifindex = ifindex,
+		.bufsize = MNL_REQUEST_SIZE,
+		.nlbuf = LIST_HEAD_INITIALIZER(),
+	};
+	int ret;
+
+	assert(ifindex);
+	/* Seek and destroy leftovers of neigh rules. */
+	nlh = mnl_nlmsg_put_header(tcf->buf);
+	nlh->nlmsg_type = RTM_GETNEIGH;
+	nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_DUMP;
+	ndm = mnl_nlmsg_put_extra_header(nlh, sizeof(*ndm));
+	ndm->ndm_family = AF_UNSPEC;
+	ndm->ndm_ifindex = ifindex;
+	ndm->ndm_state = NUD_PERMANENT;
+	ret = flow_tcf_nl_ack(tcf, nlh, 0, flow_tcf_collect_neigh_cb, &ctx);
+	if (ret)
+		DRV_LOG(WARNING, "netlink: query device list error %d", ret);
+	ret = flow_tcf_send_nlcmd(tcf, &ctx);
+	if (ret)
+		DRV_LOG(WARNING, "netlink: device delete error %d", ret);
+}
+
+/**
+ * Collect indices of VXLAN encap/decap interfaces associated with device.
+ * This is callback routine called by libmnl mnl_cb_run() in loop for
+ * every message in received packet.
+ *
+ * @param[in] nlh
+ *   Pointer to reply header.
+ * @param[in, out] arg
+ *   Opaque data pointer for this callback.
+ *
+ * @return
+ *   A positive, nonzero value on success, negative errno value otherwise
+ *   and rte_errno is set.
+ */
+static int
+flow_tcf_collect_vxlan_cb(const struct nlmsghdr *nlh, void *arg)
+{
+	struct tcf_nlcb_context *ctx = arg;
+	struct nlmsghdr *cmd;
+	struct ifinfomsg *ifm;
+	struct nlattr *na;
+	struct nlattr *na_info = NULL;
+	struct nlattr *na_vxlan = NULL;
+	bool found = false;
+	unsigned int vxindex;
+
+	if (nlh->nlmsg_type != RTM_NEWLINK) {
+		rte_errno = EINVAL;
+		return -rte_errno;
+	}
+	ifm = mnl_nlmsg_get_payload(nlh);
+	if (!ifm->ifi_index) {
+		rte_errno = EINVAL;
+		return -rte_errno;
+	}
+	mnl_attr_for_each(na, nlh, sizeof(*ifm))
+		if (mnl_attr_get_type(na) == IFLA_LINKINFO) {
+			na_info = na;
+			break;
+		}
+	if (!na_info)
+		return 1;
+	mnl_attr_for_each_nested(na, na_info) {
+		switch (mnl_attr_get_type(na)) {
+		case IFLA_INFO_KIND:
+			if (!strncmp("vxlan", mnl_attr_get_str(na),
+				     mnl_attr_get_len(na)))
+				found = true;
+			break;
+		case IFLA_INFO_DATA:
+			na_vxlan = na;
+			break;
+		}
+		if (found && na_vxlan)
+			break;
+	}
+	if (!found || !na_vxlan)
+		return 1;
+	found = false;
+	mnl_attr_for_each_nested(na, na_vxlan) {
+		if (mnl_attr_get_type(na) == IFLA_VXLAN_LINK &&
+		    mnl_attr_get_u32(na) == ctx->ifindex) {
+			found = true;
+			break;
+		}
+	}
+	if (!found)
+		return 1;
+	/* Attached VXLAN device found, store the command to delete. */
+	vxindex = ifm->ifi_index;
+	cmd = flow_tcf_alloc_nlcmd(ctx, MNL_ALIGN(sizeof(struct nlmsghdr)) +
+					MNL_ALIGN(sizeof(struct ifinfomsg)));
+	if (!nlh) {
+		rte_errno = ENOMEM;
+		return -rte_errno;
+	}
+	cmd = mnl_nlmsg_put_header(cmd);
+	cmd->nlmsg_type = RTM_DELLINK;
+	cmd->nlmsg_flags = NLM_F_REQUEST;
+	ifm = mnl_nlmsg_put_extra_header(cmd, sizeof(*ifm));
+	ifm->ifi_family = AF_UNSPEC;
+	ifm->ifi_index = vxindex;
+	return 1;
+}
+
+/**
+ * Cleanup the outer interface. Removes all found vxlan devices
+ * attached to specified index, flushes the meigh and local IP
+ * datavase.
+ *
+ * @param[in] tcf
+ *   Context object initialized by mlx5_flow_tcf_context_create().
+ * @param[in] ifindex
+ *   Network inferface index to perform cleanup.
+ */
+static void
+flow_tcf_encap_iface_cleanup(struct mlx5_flow_tcf_context *tcf,
+			    unsigned int ifindex)
+{
+	struct nlmsghdr *nlh;
+	struct ifinfomsg *ifm;
+	struct tcf_nlcb_context ctx = {
+		.ifindex = ifindex,
+		.bufsize = MNL_REQUEST_SIZE,
+		.nlbuf = LIST_HEAD_INITIALIZER(),
+	};
+	int ret;
+
+	assert(ifindex);
+	/*
+	 * Seek and destroy leftover VXLAN encap/decap interfaces with
+	 * matching properties.
+	 */
+	nlh = mnl_nlmsg_put_header(tcf->buf);
+	nlh->nlmsg_type = RTM_GETLINK;
+	nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_DUMP;
+	ifm = mnl_nlmsg_put_extra_header(nlh, sizeof(*ifm));
+	ifm->ifi_family = AF_UNSPEC;
+	ret = flow_tcf_nl_ack(tcf, nlh, 0, flow_tcf_collect_vxlan_cb, &ctx);
+	if (ret)
+		DRV_LOG(WARNING, "netlink: query device list error %d", ret);
+	ret = flow_tcf_send_nlcmd(tcf, &ctx);
+	if (ret)
+		DRV_LOG(WARNING, "netlink: device delete error %d", ret);
+}
+
+
 /**
  * Create target interface index for VXLAN tunneling decapsulation.
  * In order to share the UDP port within the other interfaces the
@@ -4100,12 +4596,9 @@ static LIST_HEAD(, mlx5_flow_tcf_vtep)
 		uint16_t pcnt;
 
 		/* Not found, we should create the new attached VTEP. */
-/*
- * TODO: not implemented yet
- * flow_tcf_encap_iface_cleanup(tcf, ifouter);
- * flow_tcf_encap_local_cleanup(tcf, ifouter);
- * flow_tcf_encap_neigh_cleanup(tcf, ifouter);
- */
+		flow_tcf_encap_iface_cleanup(tcf, ifouter);
+		flow_tcf_encap_local_cleanup(tcf, ifouter);
+		flow_tcf_encap_neigh_cleanup(tcf, ifouter);
 		for (pcnt = 0; pcnt <= (MLX5_VXLAN_PORT_RANGE_MAX
 				     - MLX5_VXLAN_PORT_RANGE_MIN); pcnt++) {
 			encap_port++;
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [dpdk-dev] [PATCH v2 1/7] net/mlx5: e-switch VXLAN configuration and definitions
  2018-10-15 14:13   ` [dpdk-dev] [PATCH v2 1/7] net/mlx5: e-switch VXLAN configuration and definitions Viacheslav Ovsiienko
@ 2018-10-23 10:01     ` Yongseok Koh
  2018-10-25 12:50       ` Slava Ovsiienko
  0 siblings, 1 reply; 110+ messages in thread
From: Yongseok Koh @ 2018-10-23 10:01 UTC (permalink / raw)
  To: Slava Ovsiienko; +Cc: Shahaf Shuler, dev

On Mon, Oct 15, 2018 at 02:13:29PM +0000, Viacheslav Ovsiienko wrote:
> This part of patchset adds configuration changes in makefile and
> meson.build for Mellanox MLX5 PMD. Also necessary defenitions
> for VXLAN support are made and appropriate data structures
> are presented.
> 
> Suggested-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
> Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
> ---
>  drivers/net/mlx5/Makefile        |  80 ++++++++++++++++++
>  drivers/net/mlx5/meson.build     |  32 +++++++
>  drivers/net/mlx5/mlx5_flow.h     |  11 +++
>  drivers/net/mlx5/mlx5_flow_tcf.c | 175 +++++++++++++++++++++++++++++++++++++++
>  4 files changed, 298 insertions(+)
> 
> diff --git a/drivers/net/mlx5/Makefile b/drivers/net/mlx5/Makefile
> index 1e9c0b4..fec7779 100644
> --- a/drivers/net/mlx5/Makefile
> +++ b/drivers/net/mlx5/Makefile
> @@ -207,6 +207,11 @@ mlx5_autoconf.h.new: $(RTE_SDK)/buildtools/auto-config-h.sh
>  		enum IFLA_PHYS_PORT_NAME \
>  		$(AUTOCONF_OUTPUT)
>  	$Q sh -- '$<' '$@' \
> +		HAVE_IFLA_VXLAN_COLLECT_METADATA \
> +		linux/if_link.h \
> +		enum IFLA_VXLAN_COLLECT_METADATA \
> +		$(AUTOCONF_OUTPUT)
> +	$Q sh -- '$<' '$@' \
>  		HAVE_TCA_CHAIN \
>  		linux/rtnetlink.h \
>  		enum TCA_CHAIN \
> @@ -367,6 +372,81 @@ mlx5_autoconf.h.new: $(RTE_SDK)/buildtools/auto-config-h.sh
>  		enum TCA_VLAN_PUSH_VLAN_PRIORITY \
>  		$(AUTOCONF_OUTPUT)
>  	$Q sh -- '$<' '$@' \
> +		HAVE_TCA_FLOWER_KEY_ENC_KEY_ID \
> +		linux/pkt_cls.h \
> +		enum TCA_FLOWER_KEY_ENC_KEY_ID \
> +		$(AUTOCONF_OUTPUT)
> +	$Q sh -- '$<' '$@' \
> +		HAVE_TCA_FLOWER_KEY_ENC_IPV4_SRC \
> +		linux/pkt_cls.h \
> +		enum TCA_FLOWER_KEY_ENC_IPV4_SRC \
> +		$(AUTOCONF_OUTPUT)
> +	$Q sh -- '$<' '$@' \
> +		HAVE_TCA_FLOWER_KEY_ENC_IPV4_SRC_MASK \
> +		linux/pkt_cls.h \
> +		enum TCA_FLOWER_KEY_ENC_IPV4_SRC_MASK \
> +		$(AUTOCONF_OUTPUT)
> +	$Q sh -- '$<' '$@' \
> +		HAVE_TCA_FLOWER_KEY_ENC_IPV4_DST \
> +		linux/pkt_cls.h \
> +		enum TCA_FLOWER_KEY_ENC_IPV4_DST \
> +		$(AUTOCONF_OUTPUT)
> +	$Q sh -- '$<' '$@' \
> +		HAVE_TCA_FLOWER_KEY_ENC_IPV4_DST_MASK \
> +		linux/pkt_cls.h \
> +		enum TCA_FLOWER_KEY_ENC_IPV4_DST_MASK \
> +		$(AUTOCONF_OUTPUT)
> +	$Q sh -- '$<' '$@' \
> +		HAVE_TCA_FLOWER_KEY_ENC_IPV6_SRC \
> +		linux/pkt_cls.h \
> +		enum TCA_FLOWER_KEY_ENC_IPV6_SRC \
> +		$(AUTOCONF_OUTPUT)
> +	$Q sh -- '$<' '$@' \
> +		HAVE_TCA_FLOWER_KEY_ENC_IPV6_SRC_MASK \
> +		linux/pkt_cls.h \
> +		enum TCA_FLOWER_KEY_ENC_IPV6_SRC_MASK \
> +		$(AUTOCONF_OUTPUT)
> +	$Q sh -- '$<' '$@' \
> +		HAVE_TCA_FLOWER_KEY_ENC_IPV6_DST \
> +		linux/pkt_cls.h \
> +		enum TCA_FLOWER_KEY_ENC_IPV6_DST \
> +		$(AUTOCONF_OUTPUT)
> +	$Q sh -- '$<' '$@' \
> +		HAVE_TCA_FLOWER_KEY_ENC_IPV6_DST_MASK \
> +		linux/pkt_cls.h \
> +		enum TCA_FLOWER_KEY_ENC_IPV6_DST_MASK \
> +		$(AUTOCONF_OUTPUT)
> +	$Q sh -- '$<' '$@' \
> +		HAVE_TCA_FLOWER_KEY_ENC_UDP_SRC_PORT \
> +		linux/pkt_cls.h \
> +		enum TCA_FLOWER_KEY_ENC_UDP_SRC_PORT \
> +		$(AUTOCONF_OUTPUT)
> +	$Q sh -- '$<' '$@' \
> +		HAVE_TCA_FLOWER_KEY_ENC_UDP_SRC_PORT_MASK \
> +		linux/pkt_cls.h \
> +		enum TCA_FLOWER_KEY_ENC_UDP_SRC_PORT_MASK \
> +		$(AUTOCONF_OUTPUT)
> +	$Q sh -- '$<' '$@' \
> +		HAVE_TCA_FLOWER_KEY_ENC_UDP_DST_PORT \
> +		linux/pkt_cls.h \
> +		enum TCA_FLOWER_KEY_ENC_UDP_DST_PORT \
> +		$(AUTOCONF_OUTPUT)
> +	$Q sh -- '$<' '$@' \
> +		HAVE_TCA_FLOWER_KEY_ENC_UDP_DST_PORT_MASK \
> +		linux/pkt_cls.h \
> +		enum TCA_FLOWER_KEY_ENC_UDP_DST_PORT_MASK \
> +		$(AUTOCONF_OUTPUT)
> +	$Q sh -- '$<' '$@' \
> +		HAVE_TC_ACT_TUNNEL_KEY \
> +		linux/tc_act/tc_tunnel_key.h \
> +		define TCA_ACT_TUNNEL_KEY \
> +		$(AUTOCONF_OUTPUT)
> +	$Q sh -- '$<' '$@' \
> +		HAVE_TCA_TUNNEL_KEY_ENC_DST_PORT \
> +		linux/tc_act/tc_tunnel_key.h \
> +		enum TCA_TUNNEL_KEY_ENC_DST_PORT \
> +		$(AUTOCONF_OUTPUT)
> +	$Q sh -- '$<' '$@' \
>  		HAVE_TC_ACT_PEDIT \
>  		linux/tc_act/tc_pedit.h \
>  		enum TCA_PEDIT_KEY_EX_HDR_TYPE_UDP \
> diff --git a/drivers/net/mlx5/meson.build b/drivers/net/mlx5/meson.build
> index c192d44..43aabf2 100644
> --- a/drivers/net/mlx5/meson.build
> +++ b/drivers/net/mlx5/meson.build
> @@ -126,6 +126,8 @@ if build
>  		'IFLA_PHYS_SWITCH_ID' ],
>  		[ 'HAVE_IFLA_PHYS_PORT_NAME', 'linux/if_link.h',
>  		'IFLA_PHYS_PORT_NAME' ],
> +		[ 'HAVE_IFLA_VXLAN_COLLECT_METADATA', 'linux/if_link.h',
> +		'IFLA_VXLAN_COLLECT_METADATA' ],
>  		[ 'HAVE_TCA_CHAIN', 'linux/rtnetlink.h',
>  		'TCA_CHAIN' ],
>  		[ 'HAVE_TCA_FLOWER_ACT', 'linux/pkt_cls.h',
> @@ -190,6 +192,36 @@ if build
>  		'TC_ACT_GOTO_CHAIN' ],
>  		[ 'HAVE_TC_ACT_VLAN', 'linux/tc_act/tc_vlan.h',
>  		'TCA_VLAN_PUSH_VLAN_PRIORITY' ],
> +		[ 'HAVE_TCA_FLOWER_KEY_ENC_KEY_ID', 'linux/pkt_cls.h',
> +		'TCA_FLOWER_KEY_ENC_KEY_ID' ],
> +		[ 'HAVE_TCA_FLOWER_KEY_ENC_IPV4_SRC', 'linux/pkt_cls.h',
> +		'TCA_FLOWER_KEY_ENC_IPV4_SRC' ],
> +		[ 'HAVE_TCA_FLOWER_KEY_ENC_IPV4_SRC_MASK', 'linux/pkt_cls.h',
> +		'TCA_FLOWER_KEY_ENC_IPV4_SRC_MASK' ],
> +		[ 'HAVE_TCA_FLOWER_KEY_ENC_IPV4_DST', 'linux/pkt_cls.h',
> +		'TCA_FLOWER_KEY_ENC_IPV4_DST' ],
> +		[ 'HAVE_TCA_FLOWER_KEY_ENC_IPV4_DST_MASK', 'linux/pkt_cls.h',
> +		'TCA_FLOWER_KEY_ENC_IPV4_DST_MASK' ],
> +		[ 'HAVE_TCA_FLOWER_KEY_ENC_IPV6_SRC', 'linux/pkt_cls.h',
> +		'TCA_FLOWER_KEY_ENC_IPV6_SRC' ],
> +		[ 'HAVE_TCA_FLOWER_KEY_ENC_IPV6_SRC_MASK', 'linux/pkt_cls.h',
> +		'TCA_FLOWER_KEY_ENC_IPV6_SRC_MASK' ],
> +		[ 'HAVE_TCA_FLOWER_KEY_ENC_IPV6_DST', 'linux/pkt_cls.h',
> +		'TCA_FLOWER_KEY_ENC_IPV6_DST' ],
> +		[ 'HAVE_TCA_FLOWER_KEY_ENC_IPV6_DST_MASK', 'linux/pkt_cls.h',
> +		'TCA_FLOWER_KEY_ENC_IPV6_DST_MASK' ],
> +		[ 'HAVE_TCA_FLOWER_KEY_ENC_UDP_SRC_PORT', 'linux/pkt_cls.h',
> +		'TCA_FLOWER_KEY_ENC_UDP_SRC_PORT' ],
> +		[ 'HAVE_TCA_FLOWER_KEY_ENC_UDP_SRC_PORT_MASK', 'linux/pkt_cls.h',
> +		'TCA_FLOWER_KEY_ENC_UDP_SRC_PORT_MASK' ],
> +		[ 'HAVE_TCA_FLOWER_KEY_ENC_UDP_DST_PORT', 'linux/pkt_cls.h',
> +		'TCA_FLOWER_KEY_ENC_UDP_DST_PORT' ],
> +		[ 'HAVE_TCA_FLOWER_KEY_ENC_UDP_DST_PORT_MASK', 'linux/pkt_cls.h',
> +		'TCA_FLOWER_KEY_ENC_UDP_DST_PORT_MASK' ],
> +		[ 'HAVE_TC_ACT_TUNNEL_KEY', 'linux/tc_act/tc_tunnel_key.h',
> +		'TCA_ACT_TUNNEL_KEY' ],
> +		[ 'HAVE_TCA_TUNNEL_KEY_ENC_DST_PORT', 'linux/tc_act/tc_tunnel_key.h',
> +		'TCA_TUNNEL_KEY_ENC_DST_PORT' ],
>  		[ 'HAVE_TC_ACT_PEDIT', 'linux/tc_act/tc_pedit.h',
>  		'TCA_PEDIT_KEY_EX_HDR_TYPE_UDP' ],
>  		[ 'HAVE_RDMA_NL_NLDEV', 'rdma/rdma_netlink.h',
> diff --git a/drivers/net/mlx5/mlx5_flow.h b/drivers/net/mlx5/mlx5_flow.h
> index 840d645..b838ab0 100644
> --- a/drivers/net/mlx5/mlx5_flow.h
> +++ b/drivers/net/mlx5/mlx5_flow.h
> @@ -85,6 +85,8 @@
>  #define MLX5_FLOW_ACTION_SET_TP_SRC (1u << 15)
>  #define MLX5_FLOW_ACTION_SET_TP_DST (1u << 16)
>  #define MLX5_FLOW_ACTION_JUMP (1u << 17)
> +#define MLX5_ACTION_VXLAN_ENCAP (1u << 11)
> +#define MLX5_ACTION_VXLAN_DECAP (1u << 12)

MLX5_ACTION_* has been changed to MLX5_FLOW_ACTION_* as you can see above. 
And make it alphabetical order; decap first and encap later? Or, at least make
it consistent. The order (case clause) is different among validate, prepare and
translate.

>  #define MLX5_FLOW_FATE_ACTIONS \
>  	(MLX5_FLOW_ACTION_DROP | MLX5_FLOW_ACTION_QUEUE | MLX5_FLOW_ACTION_RSS)
> @@ -182,8 +184,17 @@ struct mlx5_flow_dv {
>  struct mlx5_flow_tcf {
>  	struct nlmsghdr *nlh;
>  	struct tcmsg *tcm;
> +	uint32_t nlsize; /**< Size of NL message buffer. */

It is used only for assert(), but if prepare() is trusted, why do we need to
keep it? I don't it is needed.

> +	uint32_t applied:1; /**< Whether rule is currently applied. */
> +	uint64_t item_flags; /**< Item flags. */

This isn't used at all.

> +	uint64_t action_flags; /**< Action flags. */

I checked following patches and it doesn't seem necessary. Please refer to the
comment on the translation func. But if you think it is really needed, you
could've used actions field of struct rte_flow and layers field of struct
mlx5_flow in mlx5_flow.h

>  	uint64_t hits;
>  	uint64_t bytes;
> +	union { /**< Tunnel encap/decap descriptor. */
> +		struct mlx5_flow_tcf_tunnel_hdr *tunnel;
> +		struct mlx5_flow_tcf_vxlan_decap *vxlan_decap;
> +		struct mlx5_flow_tcf_vxlan_encap *vxlan_encap;
> +	};

What is the reason for keeping pointer even though the actual structure follows
after mlx5_flow_tcf? Maybe you don't want to waste memory, as the size of
encap/decap struct differs a lot?

>  };
>  
>  /* Verbs specification header. */
> diff --git a/drivers/net/mlx5/mlx5_flow_tcf.c b/drivers/net/mlx5/mlx5_flow_tcf.c
> index 5c46f35..8f9c78a 100644
> --- a/drivers/net/mlx5/mlx5_flow_tcf.c
> +++ b/drivers/net/mlx5/mlx5_flow_tcf.c
> @@ -54,6 +54,37 @@ struct tc_vlan {
>  
>  #endif /* HAVE_TC_ACT_VLAN */
>  
> +#ifdef HAVE_TC_ACT_TUNNEL_KEY
> +
> +#include <linux/tc_act/tc_tunnel_key.h>
> +
> +#ifndef HAVE_TCA_TUNNEL_KEY_ENC_DST_PORT
> +#define TCA_TUNNEL_KEY_ENC_DST_PORT 9
> +#endif
> +
> +#else /* HAVE_TC_ACT_TUNNEL_KEY */
> +
> +#define TCA_ACT_TUNNEL_KEY 17
> +#define TCA_TUNNEL_KEY_ACT_SET 1
> +#define TCA_TUNNEL_KEY_ACT_RELEASE 2
> +#define TCA_TUNNEL_KEY_PARMS 2
> +#define TCA_TUNNEL_KEY_ENC_IPV4_SRC 3
> +#define TCA_TUNNEL_KEY_ENC_IPV4_DST 4
> +#define TCA_TUNNEL_KEY_ENC_IPV6_SRC 5
> +#define TCA_TUNNEL_KEY_ENC_IPV6_DST 6
> +#define TCA_TUNNEL_KEY_ENC_KEY_ID 7
> +#define TCA_TUNNEL_KEY_ENC_DST_PORT 9
> +#define TCA_TUNNEL_KEY_NO_CSUM 10
> +
> +struct tc_tunnel_key {
> +	tc_gen;
> +	int t_action;
> +};
> +
> +#endif /* HAVE_TC_ACT_TUNNEL_KEY */
> +
> +
> +
>  #ifdef HAVE_TC_ACT_PEDIT
>  
>  #include <linux/tc_act/tc_pedit.h>
> @@ -210,6 +241,45 @@ struct tc_pedit_sel {
>  #ifndef HAVE_TCA_FLOWER_KEY_VLAN_ETH_TYPE
>  #define TCA_FLOWER_KEY_VLAN_ETH_TYPE 25
>  #endif
> +#ifndef HAVE_TCA_FLOWER_KEY_ENC_KEY_ID
> +#define TCA_FLOWER_KEY_ENC_KEY_ID 26
> +#endif
> +#ifndef HAVE_TCA_FLOWER_KEY_ENC_IPV4_SRC
> +#define TCA_FLOWER_KEY_ENC_IPV4_SRC 27
> +#endif
> +#ifndef HAVE_TCA_FLOWER_KEY_ENC_IPV4_SRC_MASK
> +#define TCA_FLOWER_KEY_ENC_IPV4_SRC_MASK 28
> +#endif
> +#ifndef HAVE_TCA_FLOWER_KEY_ENC_IPV4_DST
> +#define TCA_FLOWER_KEY_ENC_IPV4_DST 29
> +#endif
> +#ifndef HAVE_TCA_FLOWER_KEY_ENC_IPV4_DST_MASK
> +#define TCA_FLOWER_KEY_ENC_IPV4_DST_MASK 30
> +#endif
> +#ifndef HAVE_TCA_FLOWER_KEY_ENC_IPV6_SRC
> +#define TCA_FLOWER_KEY_ENC_IPV6_SRC 31
> +#endif
> +#ifndef HAVE_TCA_FLOWER_KEY_ENC_IPV6_SRC_MASK
> +#define TCA_FLOWER_KEY_ENC_IPV6_SRC_MASK 32
> +#endif
> +#ifndef HAVE_TCA_FLOWER_KEY_ENC_IPV6_DST
> +#define TCA_FLOWER_KEY_ENC_IPV6_DST 33
> +#endif
> +#ifndef HAVE_TCA_FLOWER_KEY_ENC_IPV6_DST_MASK
> +#define TCA_FLOWER_KEY_ENC_IPV6_DST_MASK 34
> +#endif
> +#ifndef HAVE_TCA_FLOWER_KEY_ENC_UDP_SRC_PORT
> +#define TCA_FLOWER_KEY_ENC_UDP_SRC_PORT 43
> +#endif
> +#ifndef HAVE_TCA_FLOWER_KEY_ENC_UDP_SRC_PORT_MASK
> +#define TCA_FLOWER_KEY_ENC_UDP_SRC_PORT_MASK 44
> +#endif
> +#ifndef HAVE_TCA_FLOWER_KEY_ENC_UDP_DST_PORT
> +#define TCA_FLOWER_KEY_ENC_UDP_DST_PORT 45
> +#endif
> +#ifndef HAVE_TCA_FLOWER_KEY_ENC_UDP_DST_PORT_MASK
> +#define TCA_FLOWER_KEY_ENC_UDP_DST_PORT_MASK 46
> +#endif
>  #ifndef HAVE_TCA_FLOWER_KEY_TCP_FLAGS
>  #define TCA_FLOWER_KEY_TCP_FLAGS 71
>  #endif
> @@ -232,6 +302,111 @@ struct tc_pedit_sel {
>  #define TP_PORT_LEN 2 /* Transport Port (UDP/TCP) Length */
>  #endif
>  
> +#define MLX5_VXLAN_PORT_RANGE_MIN 30000
> +#define MLX5_VXLAN_PORT_RANGE_MAX 60000
> +#define MLX5_VXLAN_DEVICE_PFX "vmlx_"
> +
> +/** Tunnel action type, used for @p type in header structure. */
> +enum mlx5_flow_tcf_tunact_type {
> +	MLX5_FLOW_TCF_TUNACT_VXLAN_ENCAP,
> +	MLX5_FLOW_TCF_TUNACT_VXLAN_DECAP,
> +};
> +
> +/** Flags used for @p mask in tunnel action encap descriptors. */
> +#define	MLX5_FLOW_TCF_ENCAP_ETH_SRC (1u << 0)
> +#define	MLX5_FLOW_TCF_ENCAP_ETH_DST (1u << 1)
> +#define	MLX5_FLOW_TCF_ENCAP_IPV4_SRC (1u << 2)
> +#define	MLX5_FLOW_TCF_ENCAP_IPV4_DST (1u << 3)
> +#define	MLX5_FLOW_TCF_ENCAP_IPV6_SRC (1u << 4)
> +#define	MLX5_FLOW_TCF_ENCAP_IPV6_DST (1u << 5)
> +#define	MLX5_FLOW_TCF_ENCAP_UDP_SRC (1u << 6)
> +#define	MLX5_FLOW_TCF_ENCAP_UDP_DST (1u << 7)
> +#define	MLX5_FLOW_TCF_ENCAP_VXLAN_VNI (1u << 8)
> +
> +/** Neigh rule structure */
> +struct tcf_neigh_rule {
> +	LIST_ENTRY(tcf_neigh_rule) next;
> +	uint32_t refcnt;
> +	struct ether_addr eth;
> +	uint16_t mask;
> +	union {
> +		struct {
> +			rte_be32_t dst;
> +		} ipv4;
> +		struct {
> +			uint8_t dst[16];
> +		} ipv6;
> +	};
> +};
> +
> +/** Local rule structure */
> +struct tcf_local_rule {
> +	LIST_ENTRY(tcf_neigh_rule) next;
> +	uint32_t refcnt;
> +	uint16_t mask;
> +	union {
> +		struct {
> +			rte_be32_t dst;
> +			rte_be32_t src;
> +		} ipv4;
> +		struct {
> +			uint8_t dst[16];
> +			uint8_t src[16];
> +		} ipv6;
> +	};
> +};
> +
> +/** VXLAN virtual netdev. */
> +struct mlx5_flow_tcf_vtep {
> +	LIST_ENTRY(mlx5_flow_tcf_vtep) next;
> +	LIST_HEAD(, tcf_neigh_rule) neigh;
> +	LIST_HEAD(, tcf_local_rule) local;
> +	uint32_t refcnt;
> +	unsigned int ifindex; /**< Own interface index. */
> +	unsigned int ifouter; /**< Index of device attached to. */
> +	uint16_t port;
> +	uint8_t created;
> +};
> +
> +/** Tunnel descriptor header, common for all tunnel types. */
> +struct mlx5_flow_tcf_tunnel_hdr {
> +	uint32_t type; /**< Tunnel action type. */
> +	unsigned int ifindex_tun; /**< Tunnel endpoint interface. */
> +	unsigned int ifindex_org; /**< Original dst/src interface */
> +	unsigned int *ifindex_ptr; /**< Interface ptr in message. */
> +};
> +
> +struct mlx5_flow_tcf_vxlan_decap {
> +	struct mlx5_flow_tcf_tunnel_hdr hdr;
> +	uint16_t udp_port;
> +};
> +
> +struct mlx5_flow_tcf_vxlan_encap {
> +	struct mlx5_flow_tcf_tunnel_hdr hdr;
> +	uint32_t mask;
> +	struct {
> +		struct ether_addr dst;
> +		struct ether_addr src;
> +	} eth;
> +	union {
> +		struct {
> +			rte_be32_t dst;
> +			rte_be32_t src;
> +		} ipv4;
> +		struct {
> +			uint8_t dst[16];
> +			uint8_t src[16];
> +		} ipv6;
> +	};
> +struct {
> +		rte_be16_t src;
> +		rte_be16_t dst;
> +	} udp;
> +	struct {
> +		uint8_t vni[3];
> +	} vxlan;
> +};
> +
>  /**
>   * Structure for holding netlink context.
>   * Note the size of the message buffer which is MNL_SOCKET_BUFFER_SIZE.
> 

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [dpdk-dev] [PATCH v2 2/7] net/mlx5: e-switch VXLAN flow validation routine
  2018-10-15 14:13   ` [dpdk-dev] [PATCH v2 2/7] net/mlx5: e-switch VXLAN flow validation routine Viacheslav Ovsiienko
@ 2018-10-23 10:04     ` Yongseok Koh
  2018-10-25 13:53       ` Slava Ovsiienko
  0 siblings, 1 reply; 110+ messages in thread
From: Yongseok Koh @ 2018-10-23 10:04 UTC (permalink / raw)
  To: Slava Ovsiienko; +Cc: Shahaf Shuler, dev

On Mon, Oct 15, 2018 at 02:13:30PM +0000, Viacheslav Ovsiienko wrote:
> This part of patchset adds support for flow item/action lists
> validation. The following entities are now supported:
> 
> - RTE_FLOW_ITEM_TYPE_VXLAN, contains the tunnel VNI
> 
> - RTE_FLOW_ACTION_TYPE_VXLAN_DECAP, if this action is specified
>   the items in the flow items list treated as outer network
>   parameters for tunnel outer header match. The ethernet layer
>   addresses always are treated as inner ones.
> 
> - RTE_FLOW_ACTION_TYPE_VXLAN_ENCAP, contains the item list to
>   build the encapsulation header. In current implementation the
>   values is the subject for some constraints:
>     - outer source MAC address will be always unconditionally
>       set to the one of MAC addresses of outer egress interface
>     - no way to specify source UDP port
>     - all abovementioned parameters are ignored if specified
>       in the rule, warning messages are sent to the log
> 
> Suggested-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
> Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
> ---
>  drivers/net/mlx5/mlx5_flow_tcf.c | 711 ++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 705 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/net/mlx5/mlx5_flow_tcf.c b/drivers/net/mlx5/mlx5_flow_tcf.c
> index 8f9c78a..0055417 100644
> --- a/drivers/net/mlx5/mlx5_flow_tcf.c
> +++ b/drivers/net/mlx5/mlx5_flow_tcf.c
> @@ -430,6 +430,7 @@ struct mlx5_flow_tcf_context {
>  	struct rte_flow_item_ipv6 ipv6;
>  	struct rte_flow_item_tcp tcp;
>  	struct rte_flow_item_udp udp;
> +	struct rte_flow_item_vxlan vxlan;
>  } flow_tcf_mask_empty;
>  
>  /** Supported masks for known item types. */
> @@ -441,6 +442,7 @@ struct mlx5_flow_tcf_context {
>  	struct rte_flow_item_ipv6 ipv6;
>  	struct rte_flow_item_tcp tcp;
>  	struct rte_flow_item_udp udp;
> +	struct rte_flow_item_vxlan vxlan;
>  } flow_tcf_mask_supported = {
>  	.port_id = {
>  		.id = 0xffffffff,
> @@ -478,6 +480,9 @@ struct mlx5_flow_tcf_context {
>  		.src_port = RTE_BE16(0xffff),
>  		.dst_port = RTE_BE16(0xffff),
>  	},
> +	.vxlan = {
> +	       .vni = "\xff\xff\xff",
> +	},
>  };
>  
>  #define SZ_NLATTR_HDR MNL_ALIGN(sizeof(struct nlattr))
> @@ -943,6 +948,615 @@ struct pedit_parser {
>  }
>  
>  /**
> + * Validate VXLAN_ENCAP action RTE_FLOW_ITEM_TYPE_ETH item for E-Switch.

How about mentioning it is to validate items of the "encap header"? Same for the
rest.

> + *
> + * @param[in] item
> + *   Pointer to the itemn structure.

Typo. Same for the rest.

> + * @param[out] error
> + *   Pointer to the error structure.
> + *
> + * @return
> + *   0 on success, a negative errno value otherwise and rte_errno is set.
> + **/
> +static int
> +flow_tcf_validate_vxlan_encap_eth(const struct rte_flow_item *item,
> +				  struct rte_flow_error *error)
> +{
> +	const struct rte_flow_item_eth *spec = item->spec;
> +	const struct rte_flow_item_eth *mask = item->mask;
> +
> +	if (!spec)
> +		/*
> +		 * Specification for L2 addresses can be empty
> +		 * because these ones are optional and not
> +		 * required directly by tc rule.
> +		 */
> +		return 0;
> +	if (!mask)
> +		/* If mask is not specified use the default one. */
> +		mask = &rte_flow_item_eth_mask;
> +	if (memcmp(&mask->dst,
> +		   &flow_tcf_mask_empty.eth.dst,
> +		   sizeof(flow_tcf_mask_empty.eth.dst))) {
> +		if (memcmp(&mask->dst,
> +			   &rte_flow_item_eth_mask.dst,
> +			   sizeof(rte_flow_item_eth_mask.dst)))
> +			return rte_flow_error_set(error, ENOTSUP,
> +				 RTE_FLOW_ERROR_TYPE_ITEM_MASK, mask,
> +				 "no support for partial mask on"
> +				 " \"eth.dst\" field");
> +	}
> +	if (memcmp(&mask->src,
> +		   &flow_tcf_mask_empty.eth.src,
> +		   sizeof(flow_tcf_mask_empty.eth.src))) {
> +		if (memcmp(&mask->src,
> +			   &rte_flow_item_eth_mask.src,
> +			   sizeof(rte_flow_item_eth_mask.src)))
> +			return rte_flow_error_set(error, ENOTSUP,
> +				 RTE_FLOW_ERROR_TYPE_ITEM_MASK, mask,
> +				 "no support for partial mask on"
> +				 " \"eth.src\" field");
> +	}
> +	if (mask->type != RTE_BE16(0x0000)) {
> +		if (mask->type != RTE_BE16(0xffff))
> +			return rte_flow_error_set(error, ENOTSUP,
> +				 RTE_FLOW_ERROR_TYPE_ITEM_MASK, mask,
> +				 "no support for partial mask on"
> +				 " \"eth.type\" field");
> +		DRV_LOG(WARNING,
> +			"outer ethernet type field "
> +			"cannot be forced for VXLAN "
> +			"encapsulation, parameter ignored");
> +	}
> +	return 0;
> +}
> +
> +/**
> + * Validate VXLAN_ENCAP action RTE_FLOW_ITEM_TYPE_IPV4 item for E-Switch.
> + *
> + * @param[in] item
> + *   Pointer to the itemn structure.
> + * @param[out] error
> + *   Pointer to the error structure.
> + *
> + * @return
> + *   0 on success, a negative errno value otherwise and rte_errno is set.
> + **/
> +static int
> +flow_tcf_validate_vxlan_encap_ipv4(const struct rte_flow_item *item,
> +				   struct rte_flow_error *error)
> +{
> +	const struct rte_flow_item_ipv4 *spec = item->spec;
> +	const struct rte_flow_item_ipv4 *mask = item->mask;
> +
> +	if (!spec)
> +		/*
> +		 * Specification for L3 addresses cannot be empty
> +		 * because it is required by tunnel_key parameter.
> +		 */
> +		return rte_flow_error_set(error, EINVAL,
> +				 RTE_FLOW_ERROR_TYPE_ITEM, item,
> +				 "NULL outer L3 address specification "
> +				 " for VXLAN encapsulation");
> +	if (!mask)
> +		mask = &rte_flow_item_ipv4_mask;
> +	if (mask->hdr.dst_addr != RTE_BE32(0x00000000)) {
> +		if (mask->hdr.dst_addr != RTE_BE32(0xffffffff))
> +			return rte_flow_error_set(error, ENOTSUP,
> +				 RTE_FLOW_ERROR_TYPE_ITEM_MASK, mask,
> +				 "no support for partial mask on"
> +				 " \"ipv4.hdr.dst_addr\" field");
> +		/* More L3 address validations can be put here. */
> +	} else {
> +		/*
> +		 * Kernel uses the destination L3 address to determine
> +		 * the routing path and obtain the L2 destination
> +		 * address, so L3 destination address must be
> +		 * specified in the tc rule.
> +		 */
> +		return rte_flow_error_set(error, EINVAL,
> +				 RTE_FLOW_ERROR_TYPE_ITEM, item,
> +				 "outer L3 destination address must be "
> +				 "specified for VXLAN encapsulation");
> +	}
> +	if (mask->hdr.src_addr != RTE_BE32(0x00000000)) {
> +		if (mask->hdr.src_addr != RTE_BE32(0xffffffff))
> +			return rte_flow_error_set(error, ENOTSUP,
> +				 RTE_FLOW_ERROR_TYPE_ITEM_MASK, mask,
> +				 "no support for partial mask on"
> +				 " \"ipv4.hdr.src_addr\" field");
> +		/* More L3 address validations can be put here. */
> +	} else {
> +		/*
> +		 * Kernel uses the source L3 address to select the
> +		 * interface for egress encapsulated traffic, so
> +		 * it must be specified in the tc rule.
> +		 */
> +		return rte_flow_error_set(error, EINVAL,
> +				 RTE_FLOW_ERROR_TYPE_ITEM, item,
> +				 "outer L3 source address must be "
> +				 "specified for VXLAN encapsulation");
> +	}
> +	return 0;
> +}
> +
> +/**
> + * Validate VXLAN_ENCAP action RTE_FLOW_ITEM_TYPE_IPV6 item for E-Switch.
> + *
> + * @param[in] item
> + *   Pointer to the itemn structure.
> + * @param[out] error
> + *   Pointer to the error structure.
> + *
> + * @return
> + *   0 on success, a negative errno value otherwise and rte_ernno is set.
> + **/
> +static int
> +flow_tcf_validate_vxlan_encap_ipv6(const struct rte_flow_item *item,
> +				   struct rte_flow_error *error)
> +{
> +	const struct rte_flow_item_ipv6 *spec = item->spec;
> +	const struct rte_flow_item_ipv6 *mask = item->mask;
> +
> +	if (!spec)
> +		/*
> +		 * Specification for L3 addresses cannot be empty
> +		 * because it is required by tunnel_key parameter.
> +		 */
> +		return rte_flow_error_set(error, EINVAL,
> +				 RTE_FLOW_ERROR_TYPE_ITEM, item,
> +				 "NULL outer L3 address specification "
> +				 " for VXLAN encapsulation");
> +	if (!mask)
> +		mask = &rte_flow_item_ipv6_mask;
> +	if (memcmp(&mask->hdr.dst_addr,
> +		   &flow_tcf_mask_empty.ipv6.hdr.dst_addr,
> +		   sizeof(flow_tcf_mask_empty.ipv6.hdr.dst_addr))) {
> +		if (memcmp(&mask->hdr.dst_addr,
> +		   &rte_flow_item_ipv6_mask.hdr.dst_addr,
> +		   sizeof(rte_flow_item_ipv6_mask.hdr.dst_addr)))
> +			return rte_flow_error_set(error, ENOTSUP,
> +				 RTE_FLOW_ERROR_TYPE_ITEM_MASK, mask,
> +				 "no support for partial mask on"
> +				 " \"ipv6.hdr.dst_addr\" field");
> +		/* More L3 address validations can be put here. */
> +	} else {
> +		/*
> +		 * Kernel uses the destination L3 address to determine
> +		 * the routing path and obtain the L2 destination
> +		 * address (heigh or gate), so L3 destination address
> +		 * must be specified within the tc rule.
> +		 */
> +		return rte_flow_error_set(error, EINVAL,
> +				 RTE_FLOW_ERROR_TYPE_ITEM, item,
> +				 "outer L3 destination address must be "
> +				 "specified for VXLAN encapsulation");
> +	}
> +	if (memcmp(&mask->hdr.src_addr,
> +		   &flow_tcf_mask_empty.ipv6.hdr.src_addr,
> +		   sizeof(flow_tcf_mask_empty.ipv6.hdr.src_addr))) {
> +		if (memcmp(&mask->hdr.src_addr,
> +		   &rte_flow_item_ipv6_mask.hdr.src_addr,
> +		   sizeof(rte_flow_item_ipv6_mask.hdr.src_addr)))
> +			return rte_flow_error_set(error, ENOTSUP,
> +				 RTE_FLOW_ERROR_TYPE_ITEM_MASK, mask,
> +				 "no support for partial mask on"
> +				 " \"ipv6.hdr.src_addr\" field");
> +		/* More L3 address validation can be put here. */
> +	} else {
> +		/*
> +		 * Kernel uses the source L3 address to select the
> +		 * interface for egress encapsulated traffic, so
> +		 * it must be specified in the tc rule.
> +		 */
> +		return rte_flow_error_set(error, EINVAL,
> +				 RTE_FLOW_ERROR_TYPE_ITEM, item,
> +				 "outer L3 source address must be "
> +				 "specified for VXLAN encapsulation");
> +	}
> +	return 0;
> +}
> +
> +/**
> + * Validate VXLAN_ENCAP action RTE_FLOW_ITEM_TYPE_UDP item for E-Switch.
> + *
> + * @param[in] item
> + *   Pointer to the itemn structure.
> + * @param[out] error
> + *   Pointer to the error structure.
> + *
> + * @return
> + *   0 on success, a negative errno value otherwise and rte_ernno is set.
> + **/
> +static int
> +flow_tcf_validate_vxlan_encap_udp(const struct rte_flow_item *item,
> +				  struct rte_flow_error *error)
> +{
> +	const struct rte_flow_item_udp *spec = item->spec;
> +	const struct rte_flow_item_udp *mask = item->mask;
> +
> +	if (!spec)
> +		/*
> +		 * Specification for UDP ports cannot be empty
> +		 * because it is required by tunnel_key parameter.
> +		 */
> +		return rte_flow_error_set(error, EINVAL,
> +				 RTE_FLOW_ERROR_TYPE_ITEM, item,
> +				 "NULL UDP port specification "
> +				 " for VXLAN encapsulation");
> +	if (!mask)
> +		mask = &rte_flow_item_udp_mask;
> +	if (mask->hdr.dst_port != RTE_BE16(0x0000)) {
> +		if (mask->hdr.dst_port != RTE_BE16(0xffff))
> +			return rte_flow_error_set(error, ENOTSUP,
> +				 RTE_FLOW_ERROR_TYPE_ITEM_MASK, mask,
> +				 "no support for partial mask on"
> +				 " \"udp.hdr.dst_port\" field");
> +		if (!spec->hdr.dst_port)
> +			return rte_flow_error_set(error, EINVAL,
> +				 RTE_FLOW_ERROR_TYPE_ITEM, item,
> +				 "zero encap remote UDP port");
> +	} else {
> +		return rte_flow_error_set(error, EINVAL,
> +				 RTE_FLOW_ERROR_TYPE_ITEM, item,
> +				 "outer UDP remote port must be "
> +				 "specified for VXLAN encapsulation");
> +	}
> +	if (mask->hdr.src_port != RTE_BE16(0x0000)) {
> +		if (mask->hdr.src_port != RTE_BE16(0xffff))
> +			return rte_flow_error_set(error, ENOTSUP,
> +				 RTE_FLOW_ERROR_TYPE_ITEM_MASK, mask,
> +				 "no support for partial mask on"
> +				 " \"udp.hdr.src_port\" field");
> +		DRV_LOG(WARNING,
> +			"outer UDP source port cannot be "
> +			"forced for VXLAN encapsulation, "
> +			"parameter ignored");
> +	}
> +	return 0;
> +}
> +
> +/**
> + * Validate VXLAN_ENCAP action RTE_FLOW_ITEM_TYPE_VXLAN item for E-Switch.
> + *
> + * @param[in] item
> + *   Pointer to the itemn structure.
> + * @param[out] error
> + *   Pointer to the error structure.
> + *
> + * @return
> + *   0 on success, a negative errno value otherwise and rte_ernno is set.
> + **/
> +static int
> +flow_tcf_validate_vxlan_encap_vni(const struct rte_flow_item *item,
> +				  struct rte_flow_error *error)
> +{
> +	const struct rte_flow_item_vxlan *spec = item->spec;
> +	const struct rte_flow_item_vxlan *mask = item->mask;
> +
> +	if (!spec)
> +		/* Outer VNI is required by tunnel_key parameter. */
> +		return rte_flow_error_set(error, EINVAL,
> +				 RTE_FLOW_ERROR_TYPE_ITEM, item,
> +				 "NULL VNI specification "
> +				 " for VXLAN encapsulation");
> +	if (!mask)
> +		mask = &rte_flow_item_vxlan_mask;
> +	if (mask->vni[0] != 0 ||
> +	    mask->vni[1] != 0 ||
> +	    mask->vni[2] != 0) {

can be one line.

> +		if (mask->vni[0] != 0xff ||
> +		    mask->vni[1] != 0xff ||
> +		    mask->vni[2] != 0xff)

same here.

> +			return rte_flow_error_set(error, ENOTSUP,
> +				 RTE_FLOW_ERROR_TYPE_ITEM_MASK, mask,
> +				 "no support for partial mask on"
> +				 " \"vxlan.vni\" field");
> +		if (spec->vni[0] == 0 &&
> +		    spec->vni[1] == 0 &&
> +		    spec->vni[2] == 0)
> +			return rte_flow_error_set(error, EINVAL,
> +					  RTE_FLOW_ERROR_TYPE_ITEM, item,
> +					  "VXLAN vni cannot be 0");

It is already checked by mlx5_flow_validate_item_vxlan().

> +	} else {
> +		return rte_flow_error_set(error, EINVAL,
> +				 RTE_FLOW_ERROR_TYPE_ITEM,
> +				 item,
> +				 "outer VNI must be specified "
> +				 "for VXLAN encapsulation");
> +	}

Already checked in mlx5_flow_validate_item_vxlan().

> +	return 0;
> +}
> +
> +/**
> + * Validate VXLAN_ENCAP action item list for E-Switch.
> + *
> + * @param[in] action
> + *   Pointer to the VXLAN_ENCAP action structure.
> + * @param[out] error
> + *   Pointer to the error structure.
> + *
> + * @return
> + *   0 on success, a negative errno value otherwise and rte_ernno is set.
> + **/
> +static int
> +flow_tcf_validate_vxlan_encap(const struct rte_flow_action *action,
> +			      struct rte_flow_error *error)
> +{
> +	const struct rte_flow_item *items;
> +	int ret;
> +	uint32_t item_flags = 0;
> +
> +	assert(action->type == RTE_FLOW_ACTION_TYPE_VXLAN_ENCAP);
> +	if (!action->conf)
> +		return rte_flow_error_set
> +			(error, EINVAL, RTE_FLOW_ERROR_TYPE_ACTION,
> +			 action, "Missing VXLAN tunnel "
> +				 "action configuration");
> +	items = ((const struct rte_flow_action_vxlan_encap *)
> +					action->conf)->definition;
> +	if (!items)
> +		return rte_flow_error_set
> +			(error, EINVAL, RTE_FLOW_ERROR_TYPE_ACTION,
> +			 action, "Missing VXLAN tunnel "
> +				 "encapsulation parameters");
> +	for (; items->type != RTE_FLOW_ITEM_TYPE_END; items++) {
> +		switch (items->type) {
> +		case RTE_FLOW_ITEM_TYPE_VOID:
> +			break;
> +		case RTE_FLOW_ITEM_TYPE_ETH:
> +			ret = mlx5_flow_validate_item_eth(items, item_flags,
> +							  error);
> +			if (ret < 0)
> +				return ret;
> +			ret = flow_tcf_validate_vxlan_encap_eth(items, error);
> +			if (ret < 0)
> +				return ret;
> +			item_flags |= MLX5_FLOW_LAYER_OUTER_L2;
> +			break;
> +		break;
> +		case RTE_FLOW_ITEM_TYPE_IPV4:
> +			ret = mlx5_flow_validate_item_ipv4(items, item_flags,
> +							   error);
> +			if (ret < 0)
> +				return ret;
> +			ret = flow_tcf_validate_vxlan_encap_ipv4(items, error);
> +			if (ret < 0)
> +				return ret;
> +			item_flags |= MLX5_FLOW_LAYER_OUTER_L3_IPV4;
> +			break;
> +		case RTE_FLOW_ITEM_TYPE_IPV6:
> +			ret = mlx5_flow_validate_item_ipv6(items, item_flags,
> +							   error);
> +			if (ret < 0)
> +				return ret;
> +			ret = flow_tcf_validate_vxlan_encap_ipv6(items, error);
> +			if (ret < 0)
> +				return ret;
> +			item_flags |= MLX5_FLOW_LAYER_OUTER_L3_IPV6;
> +			break;
> +		case RTE_FLOW_ITEM_TYPE_UDP:
> +			ret = mlx5_flow_validate_item_udp(items, item_flags,
> +							   0xFF, error);
> +			if (ret < 0)
> +				return ret;
> +			ret = flow_tcf_validate_vxlan_encap_udp(items, error);
> +			if (ret < 0)
> +				return ret;
> +			item_flags |= MLX5_FLOW_LAYER_OUTER_L4_UDP;
> +			break;
> +		case RTE_FLOW_ITEM_TYPE_VXLAN:
> +			ret = mlx5_flow_validate_item_vxlan(items,
> +							    item_flags, error);
> +			if (ret < 0)
> +				return ret;
> +			ret = flow_tcf_validate_vxlan_encap_vni(items, error);
> +			if (ret < 0)
> +				return ret;
> +			item_flags |= MLX5_FLOW_LAYER_VXLAN;
> +			break;
> +		default:
> +			return rte_flow_error_set(error, ENOTSUP,
> +					  RTE_FLOW_ERROR_TYPE_ITEM, items,
> +					  "VXLAN encap item not supported");
> +		}
> +	}
> +	if (!(item_flags & MLX5_FLOW_LAYER_OUTER_L3))
> +		return rte_flow_error_set(error, EINVAL,
> +					  RTE_FLOW_ERROR_TYPE_ACTION, action,
> +					  "no outer L3 layer found"
> +					  " for VXLAN encapsulation");
> +	if (!(item_flags & MLX5_FLOW_LAYER_OUTER_L4_UDP))
> +		return rte_flow_error_set(error, EINVAL,
> +					  RTE_FLOW_ERROR_TYPE_ACTION, action,
> +					  "no outer L4 layer found"

L4 -> UDP?

> +					  " for VXLAN encapsulation");
> +	if (!(item_flags & MLX5_FLOW_LAYER_VXLAN))
> +		return rte_flow_error_set(error, EINVAL,
> +					  RTE_FLOW_ERROR_TYPE_ACTION, action,
> +					  "no VXLAN VNI found"
> +					  " for VXLAN encapsulation");
> +	return 0;
> +}
> +
> +/**
> + * Validate VXLAN_DECAP action outer tunnel items for E-Switch.
> + *
> + * @param[in] item_flags
> + *   Mask of provided outer tunnel parameters
> + * @param[in] ipv4
> + *   Outer IPv4 address item (if any, NULL otherwise).
> + * @param[in] ipv6
> + *   Outer IPv6 address item (if any, NULL otherwise).
> + * @param[in] udp
> + *   Outer UDP layer item (if any, NULL otherwise).
> + * @param[out] error
> + *   Pointer to the error structure.
> + *
> + * @return
> + *   0 on success, a negative errno value otherwise and rte_ernno is set.
> + **/
> +static int
> +flow_tcf_validate_vxlan_decap(uint32_t item_flags,
> +			      const struct rte_flow_action *action,
> +			      const struct rte_flow_item *ipv4,
> +			      const struct rte_flow_item *ipv6,
> +			      const struct rte_flow_item *udp,
> +			      struct rte_flow_error *error)
> +{
> +	if (!ipv4 && !ipv6)
> +		return rte_flow_error_set(error, EINVAL,
> +					  RTE_FLOW_ERROR_TYPE_ACTION, action,
> +					  "no outer L3 layer found"
> +					  " for VXLAN decapsulation");
> +	if (ipv4) {
> +		const struct rte_flow_item_ipv4 *spec = ipv4->spec;
> +		const struct rte_flow_item_ipv4 *mask = ipv4->mask;
> +
> +		if (!spec)
> +			/*
> +			 * Specification for L3 addresses cannot be empty
> +			 * because it is required as decap parameter.
> +			 */
> +			return rte_flow_error_set(error, EINVAL,
> +				 RTE_FLOW_ERROR_TYPE_ITEM, ipv4,
> +				 "NULL outer L3 address specification "
> +				 " for VXLAN decapsulation");
> +		if (!mask)
> +			mask = &rte_flow_item_ipv4_mask;
> +		if (mask->hdr.dst_addr != RTE_BE32(0x00000000)) {
> +			if (mask->hdr.dst_addr != RTE_BE32(0xffffffff))
> +				return rte_flow_error_set(error, ENOTSUP,
> +					 RTE_FLOW_ERROR_TYPE_ITEM_MASK, mask,
> +					 "no support for partial mask on"
> +					 " \"ipv4.hdr.dst_addr\" field");
> +			/* More L3 address validations can be put here. */
> +		} else {
> +			/*
> +			 * Kernel uses the destination L3 address
> +			 * to determine the ingress network interface
> +			 * for traffic being decapculated.
> +			 */
> +			return rte_flow_error_set(error, EINVAL,
> +				 RTE_FLOW_ERROR_TYPE_ITEM, ipv4,
> +				 "outer L3 destination address must be "
> +				 "specified for VXLAN decapsulation");
> +		}
> +		/* Source L3 address is optional for decap. */
> +		if (mask->hdr.src_addr != RTE_BE32(0x00000000))
> +			if (mask->hdr.src_addr != RTE_BE32(0xffffffff))
> +				return rte_flow_error_set(error, ENOTSUP,
> +					 RTE_FLOW_ERROR_TYPE_ITEM_MASK, mask,
> +					 "no support for partial mask on"
> +					 " \"ipv4.hdr.src_addr\" field");
> +	} else {
> +		const struct rte_flow_item_ipv6 *spec = ipv6->spec;
> +		const struct rte_flow_item_ipv6 *mask = ipv6->mask;
> +
> +		if (!spec)
> +			/*
> +			 * Specification for L3 addresses cannot be empty
> +			 * because it is required as decap parameter.
> +			 */
> +			return rte_flow_error_set(error, EINVAL,
> +				 RTE_FLOW_ERROR_TYPE_ITEM, ipv6,
> +				 "NULL outer L3 address specification "
> +				 " for VXLAN decapsulation");
> +		if (!mask)
> +			mask = &rte_flow_item_ipv6_mask;
> +		if (memcmp(&mask->hdr.dst_addr,
> +			   &flow_tcf_mask_empty.ipv6.hdr.dst_addr,
> +			   sizeof(flow_tcf_mask_empty.ipv6.hdr.dst_addr))) {
> +			if (memcmp(&mask->hdr.dst_addr,
> +				&rte_flow_item_ipv6_mask.hdr.dst_addr,
> +				sizeof(rte_flow_item_ipv6_mask.hdr.dst_addr)))
> +				return rte_flow_error_set(error, ENOTSUP,
> +				       RTE_FLOW_ERROR_TYPE_ITEM_MASK, mask,
> +				       "no support for partial mask on"
> +				       " \"ipv6.hdr.dst_addr\" field");
> +		/* More L3 address validations can be put here. */
> +		} else {
> +			/*
> +			 * Kernel uses the destination L3 address
> +			 * to determine the ingress network interface
> +			 * for traffic being decapculated.
> +			 */
> +			return rte_flow_error_set(error, EINVAL,
> +				 RTE_FLOW_ERROR_TYPE_ITEM, ipv6,
> +				 "outer L3 destination address must be "
> +				 "specified for VXLAN decapsulation");
> +		}
> +		/* Source L3 address is optional for decap. */
> +		if (memcmp(&mask->hdr.src_addr,
> +			   &flow_tcf_mask_empty.ipv6.hdr.src_addr,
> +			   sizeof(flow_tcf_mask_empty.ipv6.hdr.src_addr))) {
> +			if (memcmp(&mask->hdr.src_addr,
> +				   &rte_flow_item_ipv6_mask.hdr.src_addr,
> +				   sizeof(mask->hdr.src_addr)))
> +				return rte_flow_error_set(error, ENOTSUP,
> +					RTE_FLOW_ERROR_TYPE_ITEM_MASK, mask,
> +					"no support for partial mask on"
> +					" \"ipv6.hdr.src_addr\" field");
> +		}
> +	}
> +	if (!udp) {
> +		return rte_flow_error_set(error, EINVAL,
> +					  RTE_FLOW_ERROR_TYPE_ACTION, action,
> +					  "no outer L4 layer found"
> +					  " for VXLAN decapsulation");
> +	} else {
> +		const struct rte_flow_item_udp *spec = udp->spec;
> +		const struct rte_flow_item_udp *mask = udp->mask;
> +
> +		if (!spec)
> +			/*
> +			 * Specification for UDP ports cannot be empty
> +			 * because it is required as decap parameter.
> +			 */
> +			return rte_flow_error_set(error, EINVAL,
> +					 RTE_FLOW_ERROR_TYPE_ITEM, udp,
> +					 "NULL UDP port specification "
> +					 " for VXLAN decapsulation");
> +		if (!mask)
> +			mask = &rte_flow_item_udp_mask;
> +		if (mask->hdr.dst_port != RTE_BE16(0x0000)) {
> +			if (mask->hdr.dst_port != RTE_BE16(0xffff))
> +				return rte_flow_error_set(error, ENOTSUP,
> +					 RTE_FLOW_ERROR_TYPE_ITEM_MASK, mask,
> +					 "no support for partial mask on"
> +					 " \"udp.hdr.dst_port\" field");
> +			if (!spec->hdr.dst_port)
> +				return rte_flow_error_set(error, EINVAL,
> +					 RTE_FLOW_ERROR_TYPE_ITEM, udp,
> +					 "zero decap local UDP port");
> +		} else {
> +			return rte_flow_error_set(error, EINVAL,
> +					 RTE_FLOW_ERROR_TYPE_ITEM, udp,
> +					 "outer UDP destination port must be "
> +					 "specified for VXLAN decapsulation");
> +		}
> +		if (mask->hdr.src_port != RTE_BE16(0x0000)) {
> +			if (mask->hdr.src_port != RTE_BE16(0xffff))
> +				return rte_flow_error_set(error, ENOTSUP,
> +					 RTE_FLOW_ERROR_TYPE_ITEM_MASK, mask,
> +					 "no support for partial mask on"
> +					 " \"udp.hdr.src_port\" field");
> +			DRV_LOG(WARNING,
> +			"outer UDP local port cannot be "
> +			"forced for VXLAN encapsulation, "
> +			"parameter ignored");
> +		}
> +	}
> +	if (!(item_flags & MLX5_FLOW_LAYER_VXLAN))
> +		return rte_flow_error_set(error, EINVAL,
> +					  RTE_FLOW_ERROR_TYPE_ACTION, action,
> +					  "no VXLAN VNI found"
> +					  " for VXLAN decapsulation");
> +	/* VNI is already validated, extra check can be put here. */
> +	return 0;
> +}
> +
> +/**
>   * Validate flow for E-Switch.
>   *
>   * @param[in] priv
> @@ -974,7 +1588,8 @@ struct pedit_parser {
>  		const struct rte_flow_item_ipv6 *ipv6;
>  		const struct rte_flow_item_tcp *tcp;
>  		const struct rte_flow_item_udp *udp;
> -	} spec, mask;
> +		const struct rte_flow_item_vxlan *vxlan;
> +	 } spec, mask;
>  	union {
>  		const struct rte_flow_action_port_id *port_id;
>  		const struct rte_flow_action_jump *jump;
> @@ -983,9 +1598,13 @@ struct pedit_parser {
>  			of_set_vlan_vid;
>  		const struct rte_flow_action_of_set_vlan_pcp *
>  			of_set_vlan_pcp;
> +		const struct rte_flow_action_vxlan_encap *vxlan_encap;
>  		const struct rte_flow_action_set_ipv4 *set_ipv4;
>  		const struct rte_flow_action_set_ipv6 *set_ipv6;
>  	} conf;
> +	const struct rte_flow_item *ipv4 = NULL; /* storage to check */
> +	const struct rte_flow_item *ipv6 = NULL; /* outer tunnel. */
> +	const struct rte_flow_item *udp = NULL;  /* parameters. */
>  	uint32_t item_flags = 0;
>  	uint32_t action_flags = 0;
>  	uint8_t next_protocol = -1;
> @@ -1114,7 +1733,6 @@ struct pedit_parser {
>  							   error);
>  			if (ret < 0)
>  				return ret;
> -			item_flags |= MLX5_FLOW_LAYER_OUTER_L3_IPV4;
>  			mask.ipv4 = flow_tcf_item_mask
>  				(items, &rte_flow_item_ipv4_mask,
>  				 &flow_tcf_mask_supported.ipv4,
> @@ -1135,13 +1753,22 @@ struct pedit_parser {
>  				next_protocol =
>  					((const struct rte_flow_item_ipv4 *)
>  					 (items->spec))->hdr.next_proto_id;
> +			if (item_flags & MLX5_FLOW_LAYER_OUTER_L3_IPV4) {
> +				/*
> +				 * Multiple outer items are not allowed as
> +				 * tunnel parameters, will raise an error later.
> +				 */
> +				ipv4 = NULL;

Can't it be inner then?

  flow create 1 ingress transfer
    pattern eth src is 66:77:88:99:aa:bb
      dst is 00:11:22:33:44:55 / ipv4 src is 2.2.2.2 dst is 1.1.1.1 /
      udp src is 4789 dst is 4242 / vxlan vni is 0x112233 /
      eth / ipv6 / tcp dst is 42 / end
    actions vxlan_decap / port_id id 2 / end

Is this flow supported by linux tcf? I took this example from Adrien's patch -
"[8/8] net/mlx5: add VXLAN decap support to switch flow rules". If so, isn't it
possible to have inner L3 layer (MLX5_FLOW_LAYER_INNER_*)? If not, you should
return error in this case. I don't see any code to check redundant outer items.
Did I miss something?

BTW, for the tunneled items, why don't you follow the code of
Verbs(mlx5_flow_verbs.c) and DV(mlx5_flow_dv.c)? For tcf, it is the first time
to add tunneled item, but Verbs/DV already have validation code for tunnel, so
you can reuse the existing code. In flow_tcf_validate_vxlan_decap(), not every
validation is VXLAN-specific but some of them can be common code.

And if you need to know whether there's the VXLAN decap action prior to outer
header item validation, you can relocate the code - action validation first and
item validation next, as there's no dependency yet in the current code. Defining
ipv4, ipv6, udp seems to make the code path more complex.

For example, you just can call vxlan decap item validation (by splitting
flow_tcf_validate_vxlan_decap()) at this point like:

			if (action_flags & MLX5_FLOW_ACTION_VXLAN_DECAP)
				ret = flow_tcf_validate_vxlan_decap_ipv4(...);
			...

Same for other items.

> +			} else {
> +				ipv4 = items;
> +				item_flags |= MLX5_FLOW_LAYER_OUTER_L3_IPV4;
> +			}
>  			break;
>  		case RTE_FLOW_ITEM_TYPE_IPV6:
>  			ret = mlx5_flow_validate_item_ipv6(items, item_flags,
>  							   error);
>  			if (ret < 0)
>  				return ret;
> -			item_flags |= MLX5_FLOW_LAYER_OUTER_L3_IPV6;
>  			mask.ipv6 = flow_tcf_item_mask
>  				(items, &rte_flow_item_ipv6_mask,
>  				 &flow_tcf_mask_supported.ipv6,
> @@ -1162,13 +1789,22 @@ struct pedit_parser {
>  				next_protocol =
>  					((const struct rte_flow_item_ipv6 *)
>  					 (items->spec))->hdr.proto;
> +			if (item_flags & MLX5_FLOW_LAYER_OUTER_L3_IPV6) {
> +				/*
> +				 *Multiple outer items are not allowed as
> +				 * tunnel parameters
> +				 */
> +				ipv6 = NULL;
> +			} else {
> +				ipv6 = items;
> +				item_flags |= MLX5_FLOW_LAYER_OUTER_L3_IPV6;
> +			}
>  			break;
>  		case RTE_FLOW_ITEM_TYPE_UDP:
>  			ret = mlx5_flow_validate_item_udp(items, item_flags,
>  							  next_protocol, error);
>  			if (ret < 0)
>  				return ret;
> -			item_flags |= MLX5_FLOW_LAYER_OUTER_L4_UDP;
>  			mask.udp = flow_tcf_item_mask
>  				(items, &rte_flow_item_udp_mask,
>  				 &flow_tcf_mask_supported.udp,
> @@ -1177,6 +1813,12 @@ struct pedit_parser {
>  				 error);
>  			if (!mask.udp)
>  				return -rte_errno;
> +			if (item_flags & MLX5_FLOW_LAYER_OUTER_L4_UDP) {
> +				udp = NULL;
> +			} else {
> +				udp = items;
> +				item_flags |= MLX5_FLOW_LAYER_OUTER_L4_UDP;
> +			}
>  			break;
>  		case RTE_FLOW_ITEM_TYPE_TCP:
>  			ret = mlx5_flow_validate_item_tcp
> @@ -1186,7 +1828,6 @@ struct pedit_parser {
>  					      error);
>  			if (ret < 0)
>  				return ret;
> -			item_flags |= MLX5_FLOW_LAYER_OUTER_L4_TCP;
>  			mask.tcp = flow_tcf_item_mask
>  				(items, &rte_flow_item_tcp_mask,
>  				 &flow_tcf_mask_supported.tcp,
> @@ -1195,11 +1836,36 @@ struct pedit_parser {
>  				 error);
>  			if (!mask.tcp)
>  				return -rte_errno;
> +			item_flags |= MLX5_FLOW_LAYER_OUTER_L4_TCP;
> +			break;
> +		case RTE_FLOW_ITEM_TYPE_VXLAN:
> +			ret = mlx5_flow_validate_item_vxlan(items,
> +							    item_flags, error);
> +			if (ret < 0)
> +				return ret;
> +			mask.vxlan = flow_tcf_item_mask
> +				(items, &rte_flow_item_vxlan_mask,
> +				 &flow_tcf_mask_supported.vxlan,
> +				 &flow_tcf_mask_empty.vxlan,
> +				 sizeof(flow_tcf_mask_supported.vxlan),
> +				 error);
> +			if (!mask.vxlan)
> +				return -rte_errno;
> +			if (mask.vxlan->vni[0] != 0xff ||
> +			    mask.vxlan->vni[1] != 0xff ||
> +			    mask.vxlan->vni[2] != 0xff)
> +				return rte_flow_error_set
> +					(error, ENOTSUP,
> +					 RTE_FLOW_ERROR_TYPE_ITEM_MASK,
> +					 mask.vxlan,
> +					 "no support for partial or "
> +					 "empty mask on \"vxlan.vni\" field");
> +			item_flags |= MLX5_FLOW_LAYER_VXLAN;
>  			break;
>  		default:
>  			return rte_flow_error_set(error, ENOTSUP,
>  						  RTE_FLOW_ERROR_TYPE_ITEM,
> -						  NULL, "item not supported");
> +						  items, "item not supported");
>  		}
>  	}
>  	for (; actions->type != RTE_FLOW_ACTION_TYPE_END; actions++) {
> @@ -1271,6 +1937,33 @@ struct pedit_parser {
>  					 " set action must follow push action");
>  			current_action_flag = MLX5_FLOW_ACTION_OF_SET_VLAN_PCP;
>  			break;
> +		case RTE_FLOW_ACTION_TYPE_VXLAN_ENCAP:
> +			if (action_flags & (MLX5_ACTION_VXLAN_ENCAP
> +					   | MLX5_ACTION_VXLAN_DECAP))
> +				return rte_flow_error_set
> +					(error, ENOTSUP,
> +					 RTE_FLOW_ERROR_TYPE_ACTION, actions,
> +					 "can't have multiple vxlan actions");
> +			ret = flow_tcf_validate_vxlan_encap(actions, error);
> +			if (ret < 0)
> +				return ret;
> +			action_flags |= MLX5_ACTION_VXLAN_ENCAP;

Recently, current_action_flag has been added for PEDIT actions. Please refer to
the code above and make it compliant.

> +			break;
> +		case RTE_FLOW_ACTION_TYPE_VXLAN_DECAP:
> +			if (action_flags & (MLX5_ACTION_VXLAN_ENCAP
> +					   | MLX5_ACTION_VXLAN_DECAP))
> +				return rte_flow_error_set
> +					(error, ENOTSUP,
> +					 RTE_FLOW_ERROR_TYPE_ACTION, actions,
> +					 "can't have multiple vxlan actions");
> +			ret = flow_tcf_validate_vxlan_decap(item_flags,
> +							    actions,
> +							    ipv4, ipv6, udp,
> +							    error);
> +			if (ret < 0)
> +				return ret;
> +			action_flags |= MLX5_ACTION_VXLAN_DECAP;
> +			break;
>  		case RTE_FLOW_ACTION_TYPE_SET_IPV4_SRC:
>  			current_action_flag = MLX5_FLOW_ACTION_SET_IPV4_SRC;
>  			break;
> @@ -1391,6 +2084,12 @@ struct pedit_parser {
>  		return rte_flow_error_set(error, EINVAL,
>  					  RTE_FLOW_ERROR_TYPE_ACTION, actions,
>  					  "no fate action is found");
> +	if ((item_flags & MLX5_FLOW_LAYER_VXLAN) &&
> +	    !(action_flags & MLX5_ACTION_VXLAN_DECAP))
> +		return rte_flow_error_set(error, ENOTSUP,
> +					 RTE_FLOW_ERROR_TYPE_ACTION, NULL,
> +					 "VNI pattern should be followed "
> +					 " by VXLAN_DECAP action");
>  	return 0;
>  }
>  

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [dpdk-dev] [PATCH v2 3/7] net/mlx5: e-switch VXLAN flow translation routine
  2018-10-15 14:13   ` [dpdk-dev] [PATCH v2 3/7] net/mlx5: e-switch VXLAN flow translation routine Viacheslav Ovsiienko
@ 2018-10-23 10:06     ` Yongseok Koh
  2018-10-25 14:37       ` Slava Ovsiienko
  0 siblings, 1 reply; 110+ messages in thread
From: Yongseok Koh @ 2018-10-23 10:06 UTC (permalink / raw)
  To: Slava Ovsiienko; +Cc: Shahaf Shuler, dev

On Mon, Oct 15, 2018 at 02:13:31PM +0000, Viacheslav Ovsiienko wrote:
> This part of patchset adds support of VXLAN-related items and
> actions to the flow translation routine. If some of them are
> specified in the rule, the extra space for tunnel description
> structure is allocated. Later some tunnel types, other than
> VXLAN can be addedd (GRE). No VTEP devices are created at this
> point, the flow rule is just translated, not applied yet.
> 
> Suggested-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
> Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
> ---
>  drivers/net/mlx5/mlx5_flow_tcf.c | 641 +++++++++++++++++++++++++++++++++++----
>  1 file changed, 578 insertions(+), 63 deletions(-)
> 
> diff --git a/drivers/net/mlx5/mlx5_flow_tcf.c b/drivers/net/mlx5/mlx5_flow_tcf.c
> index 0055417..660d45e 100644
> --- a/drivers/net/mlx5/mlx5_flow_tcf.c
> +++ b/drivers/net/mlx5/mlx5_flow_tcf.c
> @@ -2094,6 +2094,265 @@ struct pedit_parser {
>  }
>  
>  /**
> + * Helper function to process RTE_FLOW_ITEM_TYPE_ETH entry in configuration
> + * of action RTE_FLOW_ACTION_TYPE_VXLAN_ENCAP. Fills the MAC address fields
> + * in the encapsulation parameters structure. The item must be prevalidated,
> + * no any validation checks performed by function.
> + *
> + * @param[in] spec
> + *   RTE_FLOW_ITEM_TYPE_ETH entry specification.
> + * @param[in] mask
> + *   RTE_FLOW_ITEM_TYPE_ETH entry mask.
> + * @param[out] encap
> + *   Structure to fill the gathered MAC address data.
> + *
> + * @return
> + *   The size needed the Netlink message tunnel_key
> + *   parameter buffer to store the item attributes.
> + */
> +static int
> +flow_tcf_parse_vxlan_encap_eth(const struct rte_flow_item_eth *spec,
> +			       const struct rte_flow_item_eth *mask,
> +			       struct mlx5_flow_tcf_vxlan_encap *encap)
> +{
> +	/* Item must be validated before. No redundant checks. */
> +	assert(spec);
> +	if (!mask || !memcmp(&mask->dst,
> +			     &rte_flow_item_eth_mask.dst,
> +			     sizeof(rte_flow_item_eth_mask.dst))) {
> +		/*
> +		 * Ethernet addresses are not supported by
> +		 * tc as tunnel_key parameters. Destination
> +		 * address is needed to form encap packet
> +		 * header and retrieved by kernel from
> +		 * implicit sources (ARP table, etc),
> +		 * address masks are not supported at all.
> +		 */
> +		encap->eth.dst = spec->dst;
> +		encap->mask |= MLX5_FLOW_TCF_ENCAP_ETH_DST;
> +	}
> +	if (!mask || !memcmp(&mask->src,
> +			     &rte_flow_item_eth_mask.src,
> +			     sizeof(rte_flow_item_eth_mask.src))) {
> +		/*
> +		 * Ethernet addresses are not supported by
> +		 * tc as tunnel_key parameters. Source ethernet
> +		 * address is ignored anyway.
> +		 */
> +		encap->eth.src = spec->src;
> +		encap->mask |= MLX5_FLOW_TCF_ENCAP_ETH_SRC;
> +	}
> +	/*
> +	 * No space allocated for ethernet addresses within Netlink
> +	 * message tunnel_key record - these ones are not
> +	 * supported by tc.
> +	 */
> +	return 0;
> +}
> +
> +/**
> + * Helper function to process RTE_FLOW_ITEM_TYPE_IPV4 entry in configuration
> + * of action RTE_FLOW_ACTION_TYPE_VXLAN_ENCAP. Fills the IPV4 address fields
> + * in the encapsulation parameters structure. The item must be prevalidated,
> + * no any validation checks performed by function.
> + *
> + * @param[in] spec
> + *   RTE_FLOW_ITEM_TYPE_IPV4 entry specification.
> + * @param[out] encap
> + *   Structure to fill the gathered IPV4 address data.
> + *
> + * @return
> + *   The size needed the Netlink message tunnel_key
> + *   parameter buffer to store the item attributes.
> + */
> +static int
> +flow_tcf_parse_vxlan_encap_ipv4(const struct rte_flow_item_ipv4 *spec,
> +				struct mlx5_flow_tcf_vxlan_encap *encap)
> +{
> +	/* Item must be validated before. No redundant checks. */
> +	assert(spec);
> +	encap->ipv4.dst = spec->hdr.dst_addr;
> +	encap->ipv4.src = spec->hdr.src_addr;
> +	encap->mask |= MLX5_FLOW_TCF_ENCAP_IPV4_SRC |
> +		       MLX5_FLOW_TCF_ENCAP_IPV4_DST;
> +	return 2 * SZ_NLATTR_TYPE_OF(uint32_t);
> +}
> +
> +/**
> + * Helper function to process RTE_FLOW_ITEM_TYPE_IPV6 entry in configuration
> + * of action RTE_FLOW_ACTION_TYPE_VXLAN_ENCAP. Fills the IPV6 address fields
> + * in the encapsulation parameters structure. The item must be prevalidated,
> + * no any validation checks performed by function.
> + *
> + * @param[in] spec
> + *   RTE_FLOW_ITEM_TYPE_IPV6 entry specification.
> + * @param[out] encap
> + *   Structure to fill the gathered IPV6 address data.
> + *
> + * @return
> + *   The size needed the Netlink message tunnel_key
> + *   parameter buffer to store the item attributes.
> + */
> +static int
> +flow_tcf_parse_vxlan_encap_ipv6(const struct rte_flow_item_ipv6 *spec,
> +				struct mlx5_flow_tcf_vxlan_encap *encap)
> +{
> +	/* Item must be validated before. No redundant checks. */
> +	assert(spec);
> +	memcpy(encap->ipv6.dst, spec->hdr.dst_addr, sizeof(encap->ipv6.dst));
> +	memcpy(encap->ipv6.src, spec->hdr.src_addr, sizeof(encap->ipv6.src));
> +	encap->mask |= MLX5_FLOW_TCF_ENCAP_IPV6_SRC |
> +		       MLX5_FLOW_TCF_ENCAP_IPV6_DST;
> +	return SZ_NLATTR_DATA_OF(IPV6_ADDR_LEN) * 2;
> +}
> +
> +/**
> + * Helper function to process RTE_FLOW_ITEM_TYPE_UDP entry in configuration
> + * of action RTE_FLOW_ACTION_TYPE_VXLAN_ENCAP. Fills the UDP port fields
> + * in the encapsulation parameters structure. The item must be prevalidated,
> + * no any validation checks performed by function.
> + *
> + * @param[in] spec
> + *   RTE_FLOW_ITEM_TYPE_UDP entry specification.
> + * @param[in] mask
> + *   RTE_FLOW_ITEM_TYPE_UDP entry mask.
> + * @param[out] encap
> + *   Structure to fill the gathered UDP port data.
> + *
> + * @return
> + *   The size needed the Netlink message tunnel_key
> + *   parameter buffer to store the item attributes.
> + */
> +static int
> +flow_tcf_parse_vxlan_encap_udp(const struct rte_flow_item_udp *spec,
> +			       const struct rte_flow_item_udp *mask,
> +			       struct mlx5_flow_tcf_vxlan_encap *encap)
> +{
> +	int size = SZ_NLATTR_TYPE_OF(uint16_t);
> +
> +	assert(spec);
> +	encap->udp.dst = spec->hdr.dst_port;
> +	encap->mask |= MLX5_FLOW_TCF_ENCAP_UDP_DST;
> +	if (!mask || mask->hdr.src_port != RTE_BE16(0x0000)) {
> +		encap->udp.src = spec->hdr.src_port;
> +		size += SZ_NLATTR_TYPE_OF(uint16_t);
> +		encap->mask |= MLX5_FLOW_TCF_ENCAP_IPV4_SRC;
> +	}
> +	return size;
> +}
> +
> +/**
> + * Helper function to process RTE_FLOW_ITEM_TYPE_VXLAN entry in configuration
> + * of action RTE_FLOW_ACTION_TYPE_VXLAN_ENCAP. Fills the VNI fields
> + * in the encapsulation parameters structure. The item must be prevalidated,
> + * no any validation checks performed by function.
> + *
> + * @param[in] spec
> + *   RTE_FLOW_ITEM_TYPE_VXLAN entry specification.
> + * @param[out] encap
> + *   Structure to fill the gathered VNI address data.
> + *
> + * @return
> + *   The size needed the Netlink message tunnel_key
> + *   parameter buffer to store the item attributes.
> + */
> +static int
> +flow_tcf_parse_vxlan_encap_vni(const struct rte_flow_item_vxlan *spec,
> +			       struct mlx5_flow_tcf_vxlan_encap *encap)
> +{
> +	/* Item must be validated before. Do not redundant checks. */
> +	assert(spec);
> +	memcpy(encap->vxlan.vni, spec->vni, sizeof(encap->vxlan.vni));
> +	encap->mask |= MLX5_FLOW_TCF_ENCAP_VXLAN_VNI;
> +	return SZ_NLATTR_TYPE_OF(uint32_t);
> +}
> +
> +/**
> + * Populate consolidated encapsulation object from list of pattern items.
> + *
> + * Helper function to process configuration of action such as
> + * RTE_FLOW_ACTION_TYPE_VXLAN_ENCAP. The item list should be
> + * validated, there is no way to return an meaningful error.
> + *
> + * @param[in] action
> + *   RTE_FLOW_ACTION_TYPE_VXLAN_ENCAP action object.
> + *   List of pattern items to gather data from.
> + * @param[out] src
> + *   Structure to fill gathered data.
> + *
> + * @return
> + *   The size the part of Netlink message buffer to store the item
> + *   attributes on success, zero otherwise. The mask field in
> + *   result structure reflects correctly parsed items.
> + */
> +static int
> +flow_tcf_vxlan_encap_parse(const struct rte_flow_action *action,
> +			   struct mlx5_flow_tcf_vxlan_encap *encap)
> +{
> +	union {
> +		const struct rte_flow_item_eth *eth;
> +		const struct rte_flow_item_ipv4 *ipv4;
> +		const struct rte_flow_item_ipv6 *ipv6;
> +		const struct rte_flow_item_udp *udp;
> +		const struct rte_flow_item_vxlan *vxlan;
> +	} spec, mask;
> +	const struct rte_flow_item *items;
> +	int size = 0;
> +
> +	assert(action->type == RTE_FLOW_ACTION_TYPE_VXLAN_ENCAP);
> +	assert(action->conf);
> +
> +	items = ((const struct rte_flow_action_vxlan_encap *)
> +					action->conf)->definition;
> +	assert(items);
> +	for (; items->type != RTE_FLOW_ITEM_TYPE_END; items++) {
> +		switch (items->type) {
> +		case RTE_FLOW_ITEM_TYPE_VOID:
> +			break;
> +		case RTE_FLOW_ITEM_TYPE_ETH:
> +			mask.eth = items->mask;
> +			spec.eth = items->spec;
> +			size += flow_tcf_parse_vxlan_encap_eth(spec.eth,
> +							       mask.eth,
> +							       encap);
> +			break;
> +		case RTE_FLOW_ITEM_TYPE_IPV4:
> +			spec.ipv4 = items->spec;
> +			size += flow_tcf_parse_vxlan_encap_ipv4(spec.ipv4,
> +								encap);
> +			break;
> +		case RTE_FLOW_ITEM_TYPE_IPV6:
> +			spec.ipv6 = items->spec;
> +			size += flow_tcf_parse_vxlan_encap_ipv6(spec.ipv6,
> +								encap);
> +			break;
> +		case RTE_FLOW_ITEM_TYPE_UDP:
> +			mask.udp = items->mask;
> +			spec.udp = items->spec;
> +			size += flow_tcf_parse_vxlan_encap_udp(spec.udp,
> +							       mask.udp,
> +							       encap);
> +			break;
> +		case RTE_FLOW_ITEM_TYPE_VXLAN:
> +			spec.vxlan = items->spec;
> +			size += flow_tcf_parse_vxlan_encap_vni(spec.vxlan,
> +							       encap);
> +			break;
> +		default:
> +			assert(false);
> +			DRV_LOG(WARNING,
> +				"unsupported item %p type %d,"
> +				" items must be validated"
> +				" before flow creation",
> +				(const void *)items, items->type);
> +			encap->mask = 0;
> +			return 0;
> +		}
> +	}
> +	return size;
> +}
> +
> +/**
>   * Calculate maximum size of memory for flow items of Linux TC flower and
>   * extract specified items.
>   *
> @@ -2148,7 +2407,7 @@ struct pedit_parser {
>  		case RTE_FLOW_ITEM_TYPE_IPV6:
>  			size += SZ_NLATTR_TYPE_OF(uint16_t) + /* Ether type. */
>  				SZ_NLATTR_TYPE_OF(uint8_t) + /* IP proto. */
> -				SZ_NLATTR_TYPE_OF(IPV6_ADDR_LEN) * 4;
> +				SZ_NLATTR_DATA_OF(IPV6_ADDR_LEN) * 4;
>  				/* dst/src IP addr and mask. */
>  			flags |= MLX5_FLOW_LAYER_OUTER_L3_IPV6;
>  			break;
> @@ -2164,6 +2423,10 @@ struct pedit_parser {
>  				/* dst/src port and mask. */
>  			flags |= MLX5_FLOW_LAYER_OUTER_L4_TCP;
>  			break;
> +		case RTE_FLOW_ITEM_TYPE_VXLAN:
> +			size += SZ_NLATTR_TYPE_OF(uint32_t);
> +			flags |= MLX5_FLOW_LAYER_VXLAN;
> +			break;
>  		default:
>  			DRV_LOG(WARNING,
>  				"unsupported item %p type %d,"
> @@ -2184,13 +2447,16 @@ struct pedit_parser {
>   *   Pointer to the list of actions.
>   * @param[out] action_flags
>   *   Pointer to the detected actions.
> + * @param[out] tunnel
> + *   Pointer to tunnel encapsulation parameters structure to fill.
>   *
>   * @return
>   *   Maximum size of memory for actions.
>   */
>  static int
>  flow_tcf_get_actions_and_size(const struct rte_flow_action actions[],
> -			      uint64_t *action_flags)
> +			      uint64_t *action_flags,
> +			      void *tunnel)

This func is to get actions and size but you are parsing and filling tunnel info
here. It would be better to move parsing to translate() because it anyway has
multiple if conditions (same as switch/case) to set TCA_TUNNEL_KEY_ENC_* there.

>  {
>  	int size = 0;
>  	uint64_t flags = 0;
> @@ -2246,6 +2512,29 @@ struct pedit_parser {
>  				SZ_NLATTR_TYPE_OF(uint16_t) + /* VLAN ID. */
>  				SZ_NLATTR_TYPE_OF(uint8_t); /* VLAN prio. */
>  			break;
> +		case RTE_FLOW_ACTION_TYPE_VXLAN_ENCAP:
> +			size += SZ_NLATTR_NEST + /* na_act_index. */
> +				SZ_NLATTR_STRZ_OF("tunnel_key") +
> +				SZ_NLATTR_NEST + /* TCA_ACT_OPTIONS. */
> +				SZ_NLATTR_TYPE_OF(uint8_t);
> +			size += SZ_NLATTR_TYPE_OF(struct tc_tunnel_key);
> +			size +=	flow_tcf_vxlan_encap_parse(actions, tunnel) +
> +				RTE_ALIGN_CEIL /* preceding encap params. */
> +				(sizeof(struct mlx5_flow_tcf_vxlan_encap),
> +				MNL_ALIGNTO);

Is it different from SZ_NLATTR_TYPE_OF(struct mlx5_flow_tcf_vxlan_encap)? Or,
use __rte_aligned(MNL_ALIGNTO) instead.

> +			flags |= MLX5_ACTION_VXLAN_ENCAP;
> +			break;
> +		case RTE_FLOW_ACTION_TYPE_VXLAN_DECAP:
> +			size += SZ_NLATTR_NEST + /* na_act_index. */
> +				SZ_NLATTR_STRZ_OF("tunnel_key") +
> +				SZ_NLATTR_NEST + /* TCA_ACT_OPTIONS. */
> +				SZ_NLATTR_TYPE_OF(uint8_t);
> +			size +=	SZ_NLATTR_TYPE_OF(struct tc_tunnel_key);
> +			size +=	RTE_ALIGN_CEIL /* preceding decap params. */
> +				(sizeof(struct mlx5_flow_tcf_vxlan_decap),
> +				MNL_ALIGNTO);

Same here.

> +			flags |= MLX5_ACTION_VXLAN_DECAP;
> +			break;
>  		case RTE_FLOW_ACTION_TYPE_SET_IPV4_SRC:
>  		case RTE_FLOW_ACTION_TYPE_SET_IPV4_DST:
>  		case RTE_FLOW_ACTION_TYPE_SET_IPV6_SRC:
> @@ -2289,6 +2578,26 @@ struct pedit_parser {
>  }
>  
>  /**
> + * Convert VXLAN VNI to 32-bit integer.
> + *
> + * @param[in] vni
> + *   VXLAN VNI in 24-bit wire format.
> + *
> + * @return
> + *   VXLAN VNI as a 32-bit integer value in network endian.
> + */
> +static rte_be32_t

make it inline.

> +vxlan_vni_as_be32(const uint8_t vni[3])
> +{
> +	rte_be32_t ret;

Defining ret as rte_be32_t? The return value of this func which is bswap(ret) is
also rte_be32_t??

> +
> +	ret = vni[0];
> +	ret = (ret << 8) | vni[1];
> +	ret = (ret << 8) | vni[2];
> +	return RTE_BE32(ret);

Use rte_cpu_to_be_*() instead. But I still don't understand why you shuffle
bytes twice. One with shift and or and other by bswap().

{
	union {
		uint8_t vni[4];
		rte_be32_t dword;
	} ret = {
		.vni = { 0, vni[0], vni[1], vni[2] },
	};
	return ret.dword;
}

This will have the same result without extra cost.

> +}
> +
> +/**
>   * Prepare a flow object for Linux TC flower. It calculates the maximum size of
>   * memory required, allocates the memory, initializes Netlink message headers
>   * and set unique TC message handle.
> @@ -2323,22 +2632,54 @@ struct pedit_parser {
>  	struct mlx5_flow *dev_flow;
>  	struct nlmsghdr *nlh;
>  	struct tcmsg *tcm;
> +	struct mlx5_flow_tcf_vxlan_encap encap = {.mask = 0};
> +	uint8_t *sp, *tun = NULL;
>  
>  	size += flow_tcf_get_items_and_size(attr, items, item_flags);
> -	size += flow_tcf_get_actions_and_size(actions, action_flags);
> -	dev_flow = rte_zmalloc(__func__, size, MNL_ALIGNTO);
> +	size += flow_tcf_get_actions_and_size(actions, action_flags, &encap);
> +	dev_flow = rte_zmalloc(__func__, size,
> +			RTE_MAX(alignof(struct mlx5_flow_tcf_tunnel_hdr),
> +				(size_t)MNL_ALIGNTO));

Why RTE_MAX between the two? Note that it is alignment for start address of the
memory and the minimum alignment is cacheline size. On x86, non-zero value less
than 64 will have same result as 64.

>  	if (!dev_flow) {
>  		rte_flow_error_set(error, ENOMEM,
>  				   RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
>  				   "not enough memory to create E-Switch flow");
>  		return NULL;
>  	}
> -	nlh = mnl_nlmsg_put_header((void *)(dev_flow + 1));
> +	sp = (uint8_t *)(dev_flow + 1);
> +	if (*action_flags & MLX5_ACTION_VXLAN_ENCAP) {
> +		tun = sp;
> +		sp += RTE_ALIGN_CEIL
> +			(sizeof(struct mlx5_flow_tcf_vxlan_encap),
> +			MNL_ALIGNTO);

And why should it be aligned? As the size of dev_flow might not be aligned, it
is meaningless, isn't it? If you think it must be aligned for better
performance (not much anyway), you can use __rte_aligned(MNL_ALIGNTO) on the
struct definition but not for mlx5_flow (it's not only for tcf, have to do it
manually).

> +		size -= RTE_ALIGN_CEIL
> +			(sizeof(struct mlx5_flow_tcf_vxlan_encap),
> +			MNL_ALIGNTO);

Don't you have to subtract sizeof(struct mlx5_flow) as well? But like I
mentioned, if '.nlsize' below isn't needed, you don't need to have this
calculation either.

> +		encap.hdr.type = MLX5_FLOW_TCF_TUNACT_VXLAN_ENCAP;
> +		memcpy(tun, &encap,
> +		       sizeof(struct mlx5_flow_tcf_vxlan_encap));
> +	} else if (*action_flags & MLX5_ACTION_VXLAN_DECAP) {
> +		tun = sp;
> +		sp += RTE_ALIGN_CEIL
> +			(sizeof(struct mlx5_flow_tcf_vxlan_decap),
> +			MNL_ALIGNTO);
> +		size -= RTE_ALIGN_CEIL
> +			(sizeof(struct mlx5_flow_tcf_vxlan_decap),
> +			MNL_ALIGNTO);
> +		encap.hdr.type = MLX5_FLOW_TCF_TUNACT_VXLAN_DECAP;
> +		memcpy(tun, &encap,
> +		       sizeof(struct mlx5_flow_tcf_vxlan_decap));
> +	}
> +	nlh = mnl_nlmsg_put_header(sp);
>  	tcm = mnl_nlmsg_put_extra_header(nlh, sizeof(*tcm));
>  	*dev_flow = (struct mlx5_flow){
>  		.tcf = (struct mlx5_flow_tcf){
> +			.nlsize = size,
>  			.nlh = nlh,
>  			.tcm = tcm,
> +			.tunnel = (struct mlx5_flow_tcf_tunnel_hdr *)tun,
> +			.item_flags = *item_flags,
> +			.action_flags = *action_flags,
>  		},
>  	};
>  	/*
> @@ -2392,6 +2733,7 @@ struct pedit_parser {
>  		const struct rte_flow_item_ipv6 *ipv6;
>  		const struct rte_flow_item_tcp *tcp;
>  		const struct rte_flow_item_udp *udp;
> +		const struct rte_flow_item_vxlan *vxlan;
>  	} spec, mask;
>  	union {
>  		const struct rte_flow_action_port_id *port_id;
> @@ -2402,6 +2744,14 @@ struct pedit_parser {
>  		const struct rte_flow_action_of_set_vlan_pcp *
>  			of_set_vlan_pcp;
>  	} conf;
> +	union {
> +		struct mlx5_flow_tcf_tunnel_hdr *hdr;
> +		struct mlx5_flow_tcf_vxlan_decap *vxlan;
> +	} decap;
> +	union {
> +		struct mlx5_flow_tcf_tunnel_hdr *hdr;
> +		struct mlx5_flow_tcf_vxlan_encap *vxlan;
> +	} encap;
>  	struct flow_tcf_ptoi ptoi[PTOI_TABLE_SZ_MAX(dev)];
>  	struct nlmsghdr *nlh = dev_flow->tcf.nlh;
>  	struct tcmsg *tcm = dev_flow->tcf.tcm;
> @@ -2418,6 +2768,12 @@ struct pedit_parser {
>  
>  	claim_nonzero(flow_tcf_build_ptoi_table(dev, ptoi,
>  						PTOI_TABLE_SZ_MAX(dev)));
> +	encap.hdr = NULL;
> +	decap.hdr = NULL;
> +	if (dev_flow->tcf.action_flags & MLX5_ACTION_VXLAN_ENCAP)

dev_flow->flow->actions already has it.

> +		encap.vxlan = dev_flow->tcf.vxlan_encap;
> +	if (dev_flow->tcf.action_flags & MLX5_ACTION_VXLAN_DECAP)
> +		decap.vxlan = dev_flow->tcf.vxlan_decap;
>  	nlh = dev_flow->tcf.nlh;
>  	tcm = dev_flow->tcf.tcm;
>  	/* Prepare API must have been called beforehand. */
> @@ -2435,7 +2791,6 @@ struct pedit_parser {
>  		mnl_attr_put_u32(nlh, TCA_CHAIN, attr->group);
>  	mnl_attr_put_strz(nlh, TCA_KIND, "flower");
>  	na_flower = mnl_attr_nest_start(nlh, TCA_OPTIONS);
> -	mnl_attr_put_u32(nlh, TCA_FLOWER_FLAGS, TCA_CLS_FLAGS_SKIP_SW);

Why do you move it? You anyway can know if it is vxlan decap at this point.

>  	for (; items->type != RTE_FLOW_ITEM_TYPE_END; items++) {
>  		unsigned int i;
>  
> @@ -2479,6 +2834,12 @@ struct pedit_parser {
>  						 spec.eth->type);
>  				eth_type_set = 1;
>  			}
> +			/*
> +			 * L2 addresses/masks should  be sent anyway,
> +			 * including VXLAN encap/decap cases, sometimes

"sometimes" sounds like a bug. Did you figure out why it is inconsistent?

> +			 * kernel returns an error if no L2 address
> +			 * provided and skip_sw flag is set
> +			 */
>  			if (!is_zero_ether_addr(&mask.eth->dst)) {
>  				mnl_attr_put(nlh, TCA_FLOWER_KEY_ETH_DST,
>  					     ETHER_ADDR_LEN,
> @@ -2495,8 +2856,19 @@ struct pedit_parser {
>  					     ETHER_ADDR_LEN,
>  					     mask.eth->src.addr_bytes);
>  			}
> -			break;
> +			if (decap.hdr) {
> +				DRV_LOG(INFO,
> +				"ethernet addresses are treated "
> +				"as inner ones for tunnel decapsulation");
> +			}

I know there's no enc_[src|dst]_mac in tc flower command but can you clarify
more about this? Let me take an example.

  flow create 1 ingress transfer
    pattern eth src is 66:77:88:99:aa:bb
      dst is 00:11:22:33:44:55 / ipv4 src is 2.2.2.2 dst is 1.1.1.1 /
      udp src is 4789 dst is 4242 / vxlan vni is 0x112233 / end
    actions vxlan_decap / port_id id 2 / end

In this case, will the mac addrs specified above be regarded as inner mac addrs?
That sounds very weird. If inner mac addrs have to be specified it should be:

  flow create 1 ingress transfer
    pattern eth / ipv4 src is 2.2.2.2 dst is 1.1.1.1 /
      udp src is 4789 dst is 4242 / vxlan vni is 0x112233 /
      eth src is 66:77:88:99:aa:bb dst is 00:11:22:33:44:55 / end
    actions vxlan_decap / port_id id 2 / end

Isn't it?

Also, I hope it to be in validate() as well.

> +			assert(dev_flow->tcf.nlsize >= nlh->nlmsg_len);
> +		break;
>  		case RTE_FLOW_ITEM_TYPE_VLAN:
> +			if (encap.hdr || decap.hdr)
> +				return rte_flow_error_set(error, ENOTSUP,
> +					  RTE_FLOW_ERROR_TYPE_ITEM, NULL,
> +					  "outer VLAN is not "
> +					  "supported for tunnels");

This should be moved to validate(). And the error message sounds like inner vlan
is allowed, doesn't it? Even if it is moved to validate(), there's no way to
distinguish between outer vlan and inner vlan in your code. A bit confusing.
Please clarify.

>  			item_flags |= MLX5_FLOW_LAYER_OUTER_VLAN;
>  			mask.vlan = flow_tcf_item_mask
>  				(items, &rte_flow_item_vlan_mask,
> @@ -2528,6 +2900,7 @@ struct pedit_parser {
>  						 rte_be_to_cpu_16
>  						 (spec.vlan->tci &
>  						  RTE_BE16(0x0fff)));
> +			assert(dev_flow->tcf.nlsize >= nlh->nlmsg_len);
>  			break;
>  		case RTE_FLOW_ITEM_TYPE_IPV4:
>  			item_flags |= MLX5_FLOW_LAYER_OUTER_L3_IPV4;
> @@ -2538,36 +2911,53 @@ struct pedit_parser {
>  				 sizeof(flow_tcf_mask_supported.ipv4),
>  				 error);
>  			assert(mask.ipv4);
> -						 vlan_present ?
> -						 TCA_FLOWER_KEY_VLAN_ETH_TYPE :
> -						 TCA_FLOWER_KEY_ETH_TYPE,
> -						 RTE_BE16(ETH_P_IP));
> -			eth_type_set = 1;
> -			vlan_eth_type_set = 1;
> -			if (mask.ipv4 == &flow_tcf_mask_empty.ipv4)
> -				break;
>  			spec.ipv4 = items->spec;
> -			if (mask.ipv4->hdr.next_proto_id) {
> -				mnl_attr_put_u8(nlh, TCA_FLOWER_KEY_IP_PROTO,
> +			if (!decap.vxlan) {
> +				if (!eth_type_set || !vlan_eth_type_set) {
> +					mnl_attr_put_u16(nlh,
> +						vlan_present ?
> +						TCA_FLOWER_KEY_VLAN_ETH_TYPE :
> +						TCA_FLOWER_KEY_ETH_TYPE,
> +						RTE_BE16(ETH_P_IP));
> +				}
> +				eth_type_set = 1;
> +				vlan_eth_type_set = 1;
> +				if (mask.ipv4 == &flow_tcf_mask_empty.ipv4)
> +					break;
> +				if (mask.ipv4->hdr.next_proto_id) {
> +					mnl_attr_put_u8
> +						(nlh, TCA_FLOWER_KEY_IP_PROTO,
>  						spec.ipv4->hdr.next_proto_id);
> -				ip_proto_set = 1;
> +					ip_proto_set = 1;
> +				}
> +			} else {
> +				assert(mask.ipv4 != &flow_tcf_mask_empty.ipv4);
>  			}
>  			if (mask.ipv4->hdr.src_addr) {
> -				mnl_attr_put_u32(nlh, TCA_FLOWER_KEY_IPV4_SRC,
> -						 spec.ipv4->hdr.src_addr);
> -				mnl_attr_put_u32(nlh,
> -						 TCA_FLOWER_KEY_IPV4_SRC_MASK,
> -						 mask.ipv4->hdr.src_addr);
> +				mnl_attr_put_u32
> +					(nlh, decap.vxlan ?
> +					 TCA_FLOWER_KEY_ENC_IPV4_SRC :
> +					 TCA_FLOWER_KEY_IPV4_SRC,
> +					 spec.ipv4->hdr.src_addr);
> +				mnl_attr_put_u32
> +					(nlh, decap.vxlan ?
> +					 TCA_FLOWER_KEY_ENC_IPV4_SRC_MASK :
> +					 TCA_FLOWER_KEY_IPV4_SRC_MASK,
> +					 mask.ipv4->hdr.src_addr);
>  			}
>  			if (mask.ipv4->hdr.dst_addr) {
> -				mnl_attr_put_u32(nlh, TCA_FLOWER_KEY_IPV4_DST,
> -						 spec.ipv4->hdr.dst_addr);
> -				mnl_attr_put_u32(nlh,
> -						 TCA_FLOWER_KEY_IPV4_DST_MASK,
> -						 mask.ipv4->hdr.dst_addr);
> +				mnl_attr_put_u32
> +					(nlh, decap.vxlan ?
> +					 TCA_FLOWER_KEY_ENC_IPV4_DST :
> +					 TCA_FLOWER_KEY_IPV4_DST,
> +					 spec.ipv4->hdr.dst_addr);
> +				mnl_attr_put_u32
> +					(nlh, decap.vxlan ?
> +					 TCA_FLOWER_KEY_ENC_IPV4_DST_MASK :
> +					 TCA_FLOWER_KEY_IPV4_DST_MASK,
> +					 mask.ipv4->hdr.dst_addr);
>  			}
> +			assert(dev_flow->tcf.nlsize >= nlh->nlmsg_len);
>  			break;
>  		case RTE_FLOW_ITEM_TYPE_IPV6:
>  			item_flags |= MLX5_FLOW_LAYER_OUTER_L3_IPV6;
> @@ -2578,38 +2968,53 @@ struct pedit_parser {
>  				 sizeof(flow_tcf_mask_supported.ipv6),
>  				 error);
>  			assert(mask.ipv6);
> -			if (!eth_type_set || !vlan_eth_type_set)
> -				mnl_attr_put_u16(nlh,
> -						 vlan_present ?
> -						 TCA_FLOWER_KEY_VLAN_ETH_TYPE :
> -						 TCA_FLOWER_KEY_ETH_TYPE,
> -						 RTE_BE16(ETH_P_IPV6));
> -			eth_type_set = 1;
> -			vlan_eth_type_set = 1;
> -			if (mask.ipv6 == &flow_tcf_mask_empty.ipv6)
> -				break;
>  			spec.ipv6 = items->spec;
> -			if (mask.ipv6->hdr.proto) {
> -				mnl_attr_put_u8(nlh, TCA_FLOWER_KEY_IP_PROTO,
> -						spec.ipv6->hdr.proto);
> -				ip_proto_set = 1;
> +			if (!decap.vxlan) {
> +				if (!eth_type_set || !vlan_eth_type_set) {
> +					mnl_attr_put_u16(nlh,
> +						vlan_present ?
> +						TCA_FLOWER_KEY_VLAN_ETH_TYPE :
> +						TCA_FLOWER_KEY_ETH_TYPE,
> +						RTE_BE16(ETH_P_IPV6));
> +				}
> +				eth_type_set = 1;
> +				vlan_eth_type_set = 1;
> +				if (mask.ipv6 == &flow_tcf_mask_empty.ipv6)
> +					break;
> +				if (mask.ipv6->hdr.proto) {
> +					mnl_attr_put_u8
> +						(nlh, TCA_FLOWER_KEY_IP_PROTO,
> +						 spec.ipv6->hdr.proto);
> +					ip_proto_set = 1;
> +				}
> +			} else {
> +				assert(mask.ipv6 != &flow_tcf_mask_empty.ipv6);
>  			}
>  			if (!IN6_IS_ADDR_UNSPECIFIED(mask.ipv6->hdr.src_addr)) {
> -				mnl_attr_put(nlh, TCA_FLOWER_KEY_IPV6_SRC,
> +				mnl_attr_put(nlh, decap.vxlan ?
> +					     TCA_FLOWER_KEY_ENC_IPV6_SRC :
> +					     TCA_FLOWER_KEY_IPV6_SRC,
>  					     sizeof(spec.ipv6->hdr.src_addr),
>  					     spec.ipv6->hdr.src_addr);
> -				mnl_attr_put(nlh, TCA_FLOWER_KEY_IPV6_SRC_MASK,
> +				mnl_attr_put(nlh, decap.vxlan ?
> +					     TCA_FLOWER_KEY_ENC_IPV6_SRC_MASK :
> +					     TCA_FLOWER_KEY_IPV6_SRC_MASK,
>  					     sizeof(mask.ipv6->hdr.src_addr),
>  					     mask.ipv6->hdr.src_addr);
>  			}
>  			if (!IN6_IS_ADDR_UNSPECIFIED(mask.ipv6->hdr.dst_addr)) {
> -				mnl_attr_put(nlh, TCA_FLOWER_KEY_IPV6_DST,
> +				mnl_attr_put(nlh, decap.vxlan ?
> +					     TCA_FLOWER_KEY_ENC_IPV6_DST :
> +					     TCA_FLOWER_KEY_IPV6_DST,
>  					     sizeof(spec.ipv6->hdr.dst_addr),
>  					     spec.ipv6->hdr.dst_addr);
> -				mnl_attr_put(nlh, TCA_FLOWER_KEY_IPV6_DST_MASK,
> +				mnl_attr_put(nlh, decap.vxlan ?
> +					     TCA_FLOWER_KEY_ENC_IPV6_DST_MASK :
> +					     TCA_FLOWER_KEY_IPV6_DST_MASK,
>  					     sizeof(mask.ipv6->hdr.dst_addr),
>  					     mask.ipv6->hdr.dst_addr);
>  			}
> +			assert(dev_flow->tcf.nlsize >= nlh->nlmsg_len);
>  			break;
>  		case RTE_FLOW_ITEM_TYPE_UDP:
>  			item_flags |= MLX5_FLOW_LAYER_OUTER_L4_UDP;
> @@ -2620,26 +3025,44 @@ struct pedit_parser {
>  				 sizeof(flow_tcf_mask_supported.udp),
>  				 error);
>  			assert(mask.udp);
> -			if (!ip_proto_set)
> -				mnl_attr_put_u8(nlh, TCA_FLOWER_KEY_IP_PROTO,
> -						IPPROTO_UDP);
> -			if (mask.udp == &flow_tcf_mask_empty.udp)
> -				break;
>  			spec.udp = items->spec;
> +			if (!decap.vxlan) {
> +				if (!ip_proto_set)
> +					mnl_attr_put_u8
> +						(nlh, TCA_FLOWER_KEY_IP_PROTO,
> +						IPPROTO_UDP);
> +				if (mask.udp == &flow_tcf_mask_empty.udp)
> +					break;
> +			} else {
> +				assert(mask.udp != &flow_tcf_mask_empty.udp);
> +				decap.vxlan->udp_port
> +					= RTE_BE16(spec.udp->hdr.dst_port);

Use rte_cpu_to_be_*() instead. And (=) sign has to be moved up.

> +			}
>  			if (mask.udp->hdr.src_port) {
> -				mnl_attr_put_u16(nlh, TCA_FLOWER_KEY_UDP_SRC,
> -						 spec.udp->hdr.src_port);
> -				mnl_attr_put_u16(nlh,
> -						 TCA_FLOWER_KEY_UDP_SRC_MASK,
> -						 mask.udp->hdr.src_port);
> +				mnl_attr_put_u16
> +					(nlh, decap.vxlan ?
> +					 TCA_FLOWER_KEY_ENC_UDP_SRC_PORT :
> +					 TCA_FLOWER_KEY_UDP_SRC,
> +					 spec.udp->hdr.src_port);
> +				mnl_attr_put_u16
> +					(nlh, decap.vxlan ?
> +					 TCA_FLOWER_KEY_ENC_UDP_SRC_PORT_MASK :
> +					 TCA_FLOWER_KEY_UDP_SRC_MASK,
> +					 mask.udp->hdr.src_port);
>  			}
>  			if (mask.udp->hdr.dst_port) {
> -				mnl_attr_put_u16(nlh, TCA_FLOWER_KEY_UDP_DST,
> -						 spec.udp->hdr.dst_port);
> -				mnl_attr_put_u16(nlh,
> -						 TCA_FLOWER_KEY_UDP_DST_MASK,
> -						 mask.udp->hdr.dst_port);
> +				mnl_attr_put_u16
> +					(nlh, decap.vxlan ?
> +					 TCA_FLOWER_KEY_ENC_UDP_DST_PORT :
> +					 TCA_FLOWER_KEY_UDP_DST,
> +					 spec.udp->hdr.dst_port);
> +				mnl_attr_put_u16
> +					(nlh, decap.vxlan ?
> +					 TCA_FLOWER_KEY_ENC_UDP_DST_PORT_MASK :
> +					 TCA_FLOWER_KEY_UDP_DST_MASK,
> +					 mask.udp->hdr.dst_port);
>  			}
> +			assert(dev_flow->tcf.nlsize >= nlh->nlmsg_len);
>  			break;
>  		case RTE_FLOW_ITEM_TYPE_TCP:
>  			item_flags |= MLX5_FLOW_LAYER_OUTER_L4_TCP;
> @@ -2682,7 +3105,15 @@ struct pedit_parser {
>  					 rte_cpu_to_be_16
>  						(mask.tcp->hdr.tcp_flags));
>  			}
> +			assert(dev_flow->tcf.nlsize >= nlh->nlmsg_len);
>  			break;
> +		case RTE_FLOW_ITEM_TYPE_VXLAN:
> +			assert(decap.vxlan);
> +			spec.vxlan = items->spec;
> +			mnl_attr_put_u32(nlh,
> +					 TCA_FLOWER_KEY_ENC_KEY_ID,
> +					 vxlan_vni_as_be32(spec.vxlan->vni));
> +			assert(dev_flow->tcf.nlsize >= nlh->nlmsg_len);
>  		default:
>  			return rte_flow_error_set(error, ENOTSUP,
>  						  RTE_FLOW_ERROR_TYPE_ITEM,
> @@ -2715,6 +3146,14 @@ struct pedit_parser {
>  			mnl_attr_put_strz(nlh, TCA_ACT_KIND, "mirred");
>  			na_act = mnl_attr_nest_start(nlh, TCA_ACT_OPTIONS);
>  			assert(na_act);
> +			if (encap.hdr) {
> +				assert(dev_flow->tcf.tunnel);
> +				dev_flow->tcf.tunnel->ifindex_ptr =
> +					&((struct tc_mirred *)
> +					mnl_attr_get_payload
> +					(mnl_nlmsg_get_payload_tail
> +						(nlh)))->ifindex;
> +			}
>  			mnl_attr_put(nlh, TCA_MIRRED_PARMS,
>  				     sizeof(struct tc_mirred),
>  				     &(struct tc_mirred){
> @@ -2724,6 +3163,7 @@ struct pedit_parser {
>  				     });
>  			mnl_attr_nest_end(nlh, na_act);
>  			mnl_attr_nest_end(nlh, na_act_index);
> +			assert(dev_flow->tcf.nlsize >= nlh->nlmsg_len);
>  			break;
>  		case RTE_FLOW_ACTION_TYPE_JUMP:
>  			conf.jump = actions->conf;
> @@ -2741,6 +3181,7 @@ struct pedit_parser {
>  				     });
>  			mnl_attr_nest_end(nlh, na_act);
>  			mnl_attr_nest_end(nlh, na_act_index);
> +			assert(dev_flow->tcf.nlsize >= nlh->nlmsg_len);
>  			break;
>  		case RTE_FLOW_ACTION_TYPE_DROP:
>  			na_act_index =
> @@ -2827,6 +3268,76 @@ struct pedit_parser {
>  					(na_vlan_priority) =
>  					conf.of_set_vlan_pcp->vlan_pcp;
>  			}
> +			assert(dev_flow->tcf.nlsize >= nlh->nlmsg_len);
> +			break;
> +		case RTE_FLOW_ACTION_TYPE_VXLAN_DECAP:
> +			assert(decap.vxlan);
> +			assert(dev_flow->tcf.tunnel);
> +			dev_flow->tcf.tunnel->ifindex_ptr
> +				= (unsigned int *)&tcm->tcm_ifindex;
> +			na_act_index =
> +				mnl_attr_nest_start(nlh, na_act_index_cur++);
> +			assert(na_act_index);
> +			mnl_attr_put_strz(nlh, TCA_ACT_KIND, "tunnel_key");
> +			na_act = mnl_attr_nest_start(nlh, TCA_ACT_OPTIONS);
> +			assert(na_act);
> +			mnl_attr_put(nlh, TCA_TUNNEL_KEY_PARMS,
> +				sizeof(struct tc_tunnel_key),
> +				&(struct tc_tunnel_key){
> +					.action = TC_ACT_PIPE,
> +					.t_action = TCA_TUNNEL_KEY_ACT_RELEASE,
> +					});
> +			mnl_attr_nest_end(nlh, na_act);
> +			mnl_attr_nest_end(nlh, na_act_index);
> +			assert(dev_flow->tcf.nlsize >= nlh->nlmsg_len);
> +			break;
> +		case RTE_FLOW_ACTION_TYPE_VXLAN_ENCAP:
> +			assert(encap.vxlan);
> +			na_act_index =
> +				mnl_attr_nest_start(nlh, na_act_index_cur++);
> +			assert(na_act_index);
> +			mnl_attr_put_strz(nlh, TCA_ACT_KIND, "tunnel_key");
> +			na_act = mnl_attr_nest_start(nlh, TCA_ACT_OPTIONS);
> +			assert(na_act);
> +			mnl_attr_put(nlh, TCA_TUNNEL_KEY_PARMS,
> +				sizeof(struct tc_tunnel_key),
> +				&(struct tc_tunnel_key){
> +					.action = TC_ACT_PIPE,
> +					.t_action = TCA_TUNNEL_KEY_ACT_SET,
> +					});
> +			if (encap.vxlan->mask & MLX5_FLOW_TCF_ENCAP_UDP_DST)
> +				mnl_attr_put_u16(nlh,
> +					 TCA_TUNNEL_KEY_ENC_DST_PORT,
> +					 encap.vxlan->udp.dst);
> +			if (encap.vxlan->mask & MLX5_FLOW_TCF_ENCAP_IPV4_SRC)
> +				mnl_attr_put_u32(nlh,
> +					 TCA_TUNNEL_KEY_ENC_IPV4_SRC,
> +					 encap.vxlan->ipv4.src);
> +			if (encap.vxlan->mask & MLX5_FLOW_TCF_ENCAP_IPV4_DST)
> +				mnl_attr_put_u32(nlh,
> +					 TCA_TUNNEL_KEY_ENC_IPV4_DST,
> +					 encap.vxlan->ipv4.dst);
> +			if (encap.vxlan->mask & MLX5_FLOW_TCF_ENCAP_IPV6_SRC)
> +				mnl_attr_put(nlh,
> +					 TCA_TUNNEL_KEY_ENC_IPV6_SRC,
> +					 sizeof(encap.vxlan->ipv6.src),
> +					 &encap.vxlan->ipv6.src);
> +			if (encap.vxlan->mask & MLX5_FLOW_TCF_ENCAP_IPV6_DST)
> +				mnl_attr_put(nlh,
> +					 TCA_TUNNEL_KEY_ENC_IPV6_DST,
> +					 sizeof(encap.vxlan->ipv6.dst),
> +					 &encap.vxlan->ipv6.dst);
> +			if (encap.vxlan->mask & MLX5_FLOW_TCF_ENCAP_VXLAN_VNI)
> +				mnl_attr_put_u32(nlh,
> +					 TCA_TUNNEL_KEY_ENC_KEY_ID,
> +					 vxlan_vni_as_be32
> +						(encap.vxlan->vxlan.vni));
> +#ifdef TCA_TUNNEL_KEY_NO_CSUM
> +			mnl_attr_put_u8(nlh, TCA_TUNNEL_KEY_NO_CSUM, 0);
> +#endif

TCA_TUNNEL_KEY_NO_CSUM is anyway defined like others, then why do you treat it
differently with #ifdef/#endif?

> +			mnl_attr_nest_end(nlh, na_act);
> +			mnl_attr_nest_end(nlh, na_act_index);
> +			assert(dev_flow->tcf.nlsize >= nlh->nlmsg_len);
>  			break;
>  		case RTE_FLOW_ACTION_TYPE_SET_IPV4_SRC:
>  		case RTE_FLOW_ACTION_TYPE_SET_IPV4_DST:
> @@ -2850,7 +3361,11 @@ struct pedit_parser {
>  	assert(na_flower);
>  	assert(na_flower_act);
>  	mnl_attr_nest_end(nlh, na_flower_act);
> +	mnl_attr_put_u32(nlh, TCA_FLOWER_FLAGS,
> +			 dev_flow->tcf.action_flags & MLX5_ACTION_VXLAN_DECAP
> +			 ? 0 : TCA_CLS_FLAGS_SKIP_SW);
>  	mnl_attr_nest_end(nlh, na_flower);
> +	assert(dev_flow->tcf.nlsize >= nlh->nlmsg_len);
>  	return 0;
>  }

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [dpdk-dev] [PATCH v2 4/7] net/mlx5: e-switch VXLAN netlink routines update
  2018-10-15 14:13   ` [dpdk-dev] [PATCH v2 4/7] net/mlx5: e-switch VXLAN netlink routines update Viacheslav Ovsiienko
@ 2018-10-23 10:07     ` Yongseok Koh
  0 siblings, 0 replies; 110+ messages in thread
From: Yongseok Koh @ 2018-10-23 10:07 UTC (permalink / raw)
  To: Slava Ovsiienko; +Cc: Shahaf Shuler, dev

On Mon, Oct 15, 2018 at 02:13:32PM +0000, Viacheslav Ovsiienko wrote:
> This part of patchset updates Netlink exchange routine. Message
> sequence numbers became not random ones, the multipart reply messages
> are supported, not propagating errors to the following socket calls,
> Netlink replies buffer size is increased to MNL_SOCKET_BUFFER_SIZE
> and now is preallocated at context creation time instead of stack
> usage. This update is needed to support Netlink query operations.
> 
> Suggested-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
> Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
> ---
Acked-by: Yongseok Koh <yskoh@mellanox.com>

Thanks

>  drivers/net/mlx5/mlx5_flow_tcf.c | 82 +++++++++++++++++++++++++++++-----------
>  1 file changed, 60 insertions(+), 22 deletions(-)
> 
> diff --git a/drivers/net/mlx5/mlx5_flow_tcf.c b/drivers/net/mlx5/mlx5_flow_tcf.c
> index 660d45e..d6840d5 100644
> --- a/drivers/net/mlx5/mlx5_flow_tcf.c
> +++ b/drivers/net/mlx5/mlx5_flow_tcf.c
> @@ -3372,37 +3372,75 @@ struct pedit_parser {
>  /**
>   * Send Netlink message with acknowledgment.
>   *
> - * @param ctx
> + * @param tcf
>   *   Flow context to use.
>   * @param nlh
>   *   Message to send. This function always raises the NLM_F_ACK flag before
>   *   sending.
> + * @param[in] msglen
> + *   Message length. Message buffer may contain multiple commands and
> + *   nlmsg_len field not always corresponds to actual message length.
> + *   If 0 specified the nlmsg_len field in header is used as message length.
> + * @param[in] cb
> + *   Callback handler for received message.
> + * @param[in] arg
> + *   Context pointer for callback handler.
>   *
>   * @return
>   *   0 on success, a negative errno value otherwise and rte_errno is set.
>   */
>  static int
> -flow_tcf_nl_ack(struct mlx5_flow_tcf_context *ctx, struct nlmsghdr *nlh)
> +flow_tcf_nl_ack(struct mlx5_flow_tcf_context *tcf,
> +		struct nlmsghdr *nlh,
> +		uint32_t msglen,
> +		mnl_cb_t cb, void *arg)
>  {
> -	alignas(struct nlmsghdr)
> -	uint8_t ans[mnl_nlmsg_size(sizeof(struct nlmsgerr)) +
> -		    nlh->nlmsg_len - sizeof(*nlh)];
> -	uint32_t seq = ctx->seq++;
> -	struct mnl_socket *nl = ctx->nl;
> -	int ret;
> -
> -	nlh->nlmsg_flags |= NLM_F_ACK;
> +	unsigned int portid = mnl_socket_get_portid(tcf->nl);
> +	uint32_t seq = tcf->seq++;
> +	int err, ret;
> +
> +	assert(tcf->nl);
> +	assert(tcf->buf);
> +	if (!seq)
> +		seq = tcf->seq++;
>  	nlh->nlmsg_seq = seq;
> -	ret = mnl_socket_sendto(nl, nlh, nlh->nlmsg_len);
> -	if (ret != -1)
> -		ret = mnl_socket_recvfrom(nl, ans, sizeof(ans));
> -	if (ret != -1)
> -		ret = mnl_cb_run
> -			(ans, ret, seq, mnl_socket_get_portid(nl), NULL, NULL);
> +	if (!msglen) {
> +		msglen = nlh->nlmsg_len;
> +		nlh->nlmsg_flags |= NLM_F_ACK;
> +	}
> +	ret = mnl_socket_sendto(tcf->nl, nlh, msglen);
> +	err = (ret <= 0) ? errno : 0;
> +	nlh = (struct nlmsghdr *)(tcf->buf);
> +	/*
> +	 * The following loop postpones non-fatal errors until multipart
> +	 * messages are complete.
> +	 */
>  	if (ret > 0)
> +		while (true) {
> +			ret = mnl_socket_recvfrom(tcf->nl, tcf->buf,
> +						  tcf->buf_size);
> +			if (ret < 0) {
> +				err = errno;
> +				if (err != ENOSPC)
> +					break;
> +			}
> +			if (!err) {
> +				ret = mnl_cb_run(nlh, ret, seq, portid,
> +						 cb, arg);
> +				if (ret < 0) {
> +					err = errno;
> +					break;
> +				}
> +			}
> +			/* Will receive till end of multipart message */
> +			if (!(nlh->nlmsg_flags & NLM_F_MULTI) ||
> +			      nlh->nlmsg_type == NLMSG_DONE)
> +				break;
> +		}
> +	if (!err)
>  		return 0;
> -	rte_errno = errno;
> -	return -rte_errno;
> +	rte_errno = err;
> +	return -err;
>  }
>  
>  /**
> @@ -3433,7 +3471,7 @@ struct pedit_parser {
>  	nlh = dev_flow->tcf.nlh;
>  	nlh->nlmsg_type = RTM_NEWTFILTER;
>  	nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_CREATE | NLM_F_EXCL;
> -	if (!flow_tcf_nl_ack(nl, nlh))
> +	if (!flow_tcf_nl_ack(nl, nlh, 0, NULL, NULL))
>  		return 0;
>  	return rte_flow_error_set(error, rte_errno,
>  				  RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
> @@ -3466,7 +3504,7 @@ struct pedit_parser {
>  	nlh = dev_flow->tcf.nlh;
>  	nlh->nlmsg_type = RTM_DELTFILTER;
>  	nlh->nlmsg_flags = NLM_F_REQUEST;
> -	flow_tcf_nl_ack(nl, nlh);
> +	flow_tcf_nl_ack(nl, nlh, 0, NULL, NULL);
>  }
>  
>  /**
> @@ -3842,7 +3880,7 @@ struct pedit_parser {
>  	tcm->tcm_handle = TC_H_MAKE(TC_H_INGRESS, 0);
>  	tcm->tcm_parent = TC_H_INGRESS;
>  	/* Ignore errors when qdisc is already absent. */
> -	if (flow_tcf_nl_ack(nl, nlh) &&
> +	if (flow_tcf_nl_ack(nl, nlh, 0, NULL, NULL) &&
>  	    rte_errno != EINVAL && rte_errno != ENOENT)
>  		return rte_flow_error_set(error, rte_errno,
>  					  RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
> @@ -3858,7 +3896,7 @@ struct pedit_parser {
>  	tcm->tcm_handle = TC_H_MAKE(TC_H_INGRESS, 0);
>  	tcm->tcm_parent = TC_H_INGRESS;
>  	mnl_attr_put_strz_check(nlh, sizeof(buf), TCA_KIND, "ingress");
> -	if (flow_tcf_nl_ack(nl, nlh))
> +	if (flow_tcf_nl_ack(nl, nlh, 0, NULL, NULL))
>  		return rte_flow_error_set(error, rte_errno,
>  					  RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
>  					  "netlink: failed to create ingress"
> -- 
> 1.8.3.1
> 

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [dpdk-dev] [PATCH v2 5/7] net/mlx5: e-switch VXLAN tunnel devices management
  2018-10-15 14:13   ` [dpdk-dev] [PATCH v2 5/7] net/mlx5: e-switch VXLAN tunnel devices management Viacheslav Ovsiienko
@ 2018-10-25  0:28     ` Yongseok Koh
  2018-10-25 20:21       ` Slava Ovsiienko
  0 siblings, 1 reply; 110+ messages in thread
From: Yongseok Koh @ 2018-10-25  0:28 UTC (permalink / raw)
  To: Slava Ovsiienko; +Cc: Shahaf Shuler, dev

On Mon, Oct 15, 2018 at 02:13:33PM +0000, Viacheslav Ovsiienko wrote:
> VXLAN interfaces are dynamically created for each local UDP port
> of outer networks and then used as targets for TC "flower" filters
> in order to perform encapsulation. These VXLAN interfaces are
> system-wide, the only one device with given UDP port can exist
> in the system (the attempt of creating another device with the
> same UDP local port returns EEXIST), so PMD should support the
> shared device instances database for PMD instances. These VXLAN
> implicitly created devices are called VTEPs (Virtual Tunnel
> End Points).
> 
> Creation of the VTEP occurs at the moment of rule applying. The
> link is set up, root ingress qdisc is also initialized.
> 
> Encapsulation VTEPs are created on per port basis, the single
> VTEP is attached to the outer interface and is shared for all
> encapsulation rules on this interface. The source UDP port is
> automatically selected in range 30000-60000.
> 
> For decapsulaton one VTEP is created per every unique UDP
> local port to accept tunnel traffic. The name of created
> VTEP consists of prefix "vmlx_" and the number of UDP port in
> decimal digits without leading zeros (vmlx_4789). The VTEP
> can be preliminary created in the system before the launching
> application, it allows to share	UDP ports between primary
> and secondary processes.
> 
> Suggested-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
> Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
> ---
>  drivers/net/mlx5/mlx5_flow_tcf.c | 503 ++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 499 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/net/mlx5/mlx5_flow_tcf.c b/drivers/net/mlx5/mlx5_flow_tcf.c
> index d6840d5..efa9c3b 100644
> --- a/drivers/net/mlx5/mlx5_flow_tcf.c
> +++ b/drivers/net/mlx5/mlx5_flow_tcf.c
> @@ -3443,6 +3443,432 @@ struct pedit_parser {
>  	return -err;
>  }
>  
> +/* VTEP device list is shared between PMD port instances. */
> +static LIST_HEAD(, mlx5_flow_tcf_vtep)
> +			vtep_list_vxlan = LIST_HEAD_INITIALIZER();
> +static pthread_mutex_t vtep_list_mutex = PTHREAD_MUTEX_INITIALIZER;

What's the reason for choosing pthread_mutex instead of rte_*_lock?

> +
> +/**
> + * Deletes VTEP network device.
> + *
> + * @param[in] tcf
> + *   Context object initialized by mlx5_flow_tcf_context_create().
> + * @param[in] vtep
> + *   Object represinting the network device to delete. Memory
> + *   allocated for this object is freed by routine.
> + */
> +static void
> +flow_tcf_delete_iface(struct mlx6_flow_tcf_context *tcf,
> +		      struct mlx5_flow_tcf_vtep *vtep)
> +{
> +	struct nlmsghdr *nlh;
> +	struct ifinfomsg *ifm;
> +	alignas(struct nlmsghdr)
> +	uint8_t buf[mnl_nlmsg_size(MNL_ALIGN(sizeof(*ifm))) + 8];
> +	int ret;
> +
> +	assert(!vtep->refcnt);
> +	if (vtep->created && vtep->ifindex) {

First of all vtep->created seems of no use. It is introduced to select the error
message in flow_tcf_create_iface(). I don't see any necessity to distinguish
between 'vtep is allocated by rte_malloc()' and 'vtep is created in kernel'.

And why do you need to check vtep->ifindex as well? If vtep is created in kernel
and its ifindex isn't set, that should be an error which had to be hanled in
flow_tcf_create_iface(). Such a vtep shouldn't exist.

Also, the refcnt management is a bit strange. Please put an abstraction by
adding create_iface(), get_iface() and release_iface(). In the get_ifce(),
vtep->refcnt should be incremented. And in the release_iface(), it decrease the
refcnt and if it reaches to zero, the iface can be removed. create_iface() will
set the refcnt to 1. And if you refer to mlx5_hrxq_get(), it even does searching
the list not by repeating the same lookup code here and there. That will make
your code much simpler.

> +		DRV_LOG(INFO, "VTEP delete (%d)", vtep->ifindex);
> +		nlh = mnl_nlmsg_put_header(buf);
> +		nlh->nlmsg_type = RTM_DELLINK;
> +		nlh->nlmsg_flags = NLM_F_REQUEST;
> +		ifm = mnl_nlmsg_put_extra_header(nlh, sizeof(*ifm));
> +		ifm->ifi_family = AF_UNSPEC;
> +		ifm->ifi_index = vtep->ifindex;
> +		ret = flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
> +		if (ret)
> +			DRV_LOG(WARNING, "netlink: error deleting VXLAN "
> +					 "encap/decap ifindex %u",
> +					 ifm->ifi_index);
> +	}
> +	rte_free(vtep);
> +}
> +
> +/**
> + * Creates VTEP network device.
> + *
> + * @param[in] tcf
> + *   Context object initialized by mlx5_flow_tcf_context_create().
> + * @param[in] ifouter
> + *   Outer interface to attach new-created VXLAN device
> + *   If zero the VXLAN device will not be attached to any device.
> + * @param[in] port
> + *   UDP port of created VTEP device.
> + * @param[out] error
> + *   Perform verbose error reporting if not NULL.
> + *
> + * @return
> + * Pointer to created device structure on success, NULL otherwise
> + * and rte_errno is set.
> + */
> +#ifndef HAVE_IFLA_VXLAN_COLLECT_METADATA

Why negative(ifndef) first intead of positive(ifdef)?

> +static struct mlx5_flow_tcf_vtep*
> +flow_tcf_create_iface(struct mlx5_flow_tcf_context *tcf __rte_unused,
> +		      unsigned int ifouter __rte_unused,
> +		      uint16_t port __rte_unused,
> +		      struct rte_flow_error *error)
> +{
> +	rte_flow_error_set(error, ENOTSUP,
> +			 RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
> +			 "netlink: failed to create VTEP, "
> +			 "VXLAN metadat is not supported by kernel");

Typo.

> +	return NULL;
> +}
> +#else
> +static struct mlx5_flow_tcf_vtep*
> +flow_tcf_create_iface(struct mlx5_flow_tcf_context *tcf,

How about adding 'vtep'? It sounds vague - creating a general interface.
E.g., flow_tcf_create_vtep_iface()?

> +		      unsigned int ifouter,
> +		      uint16_t port, struct rte_flow_error *error)
> +{
> +	struct mlx5_flow_tcf_vtep *vtep;
> +	struct nlmsghdr *nlh;
> +	struct ifinfomsg *ifm;
> +	char name[sizeof(MLX5_VXLAN_DEVICE_PFX) + 24];
> +	alignas(struct nlmsghdr)
> +	uint8_t buf[mnl_nlmsg_size(sizeof(*ifm)) + 128 +

Use a macro for '128'. Can't know the meaning.

> +		       SZ_NLATTR_DATA_OF(sizeof(name)) +
> +		       SZ_NLATTR_NEST * 2 +
> +		       SZ_NLATTR_STRZ_OF("vxlan") +
> +		       SZ_NLATTR_DATA_OF(sizeof(uint32_t)) +
> +		       SZ_NLATTR_DATA_OF(sizeof(uint32_t)) +
> +		       SZ_NLATTR_DATA_OF(sizeof(uint16_t)) +
> +		       SZ_NLATTR_DATA_OF(sizeof(uint8_t))];
> +	struct nlattr *na_info;
> +	struct nlattr *na_vxlan;
> +	rte_be16_t vxlan_port = RTE_BE16(port);

Use rte_cpu_to_be_*() instead.

> +	int ret;
> +
> +	vtep = rte_zmalloc(__func__, sizeof(*vtep),
> +			alignof(struct mlx5_flow_tcf_vtep));
> +	if (!vtep) {
> +		rte_flow_error_set
> +			(error, ENOMEM, RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> +			 NULL, "unadble to allocate memory for VTEP desc");
> +		return NULL;
> +	}
> +	*vtep = (struct mlx5_flow_tcf_vtep){
> +			.refcnt = 0,
> +			.port = port,
> +			.created = 0,
> +			.ifouter = 0,
> +			.ifindex = 0,
> +			.local = LIST_HEAD_INITIALIZER(),
> +			.neigh = LIST_HEAD_INITIALIZER(),
> +	};
> +	memset(buf, 0, sizeof(buf));
> +	nlh = mnl_nlmsg_put_header(buf);
> +	nlh->nlmsg_type = RTM_NEWLINK;
> +	nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_CREATE  | NLM_F_EXCL;
> +	ifm = mnl_nlmsg_put_extra_header(nlh, sizeof(*ifm));
> +	ifm->ifi_family = AF_UNSPEC;
> +	ifm->ifi_type = 0;
> +	ifm->ifi_index = 0;
> +	ifm->ifi_flags = IFF_UP;
> +	ifm->ifi_change = 0xffffffff;
> +	snprintf(name, sizeof(name), "%s%u", MLX5_VXLAN_DEVICE_PFX, port);
> +	mnl_attr_put_strz(nlh, IFLA_IFNAME, name);
> +	na_info = mnl_attr_nest_start(nlh, IFLA_LINKINFO);
> +	assert(na_info);
> +	mnl_attr_put_strz(nlh, IFLA_INFO_KIND, "vxlan");
> +	na_vxlan = mnl_attr_nest_start(nlh, IFLA_INFO_DATA);
> +	if (ifouter)
> +		mnl_attr_put_u32(nlh, IFLA_VXLAN_LINK, ifouter);
> +	assert(na_vxlan);
> +	mnl_attr_put_u8(nlh, IFLA_VXLAN_COLLECT_METADATA, 1);
> +	mnl_attr_put_u8(nlh, IFLA_VXLAN_UDP_ZERO_CSUM6_RX, 1);
> +	mnl_attr_put_u8(nlh, IFLA_VXLAN_LEARNING, 0);
> +	mnl_attr_put_u16(nlh, IFLA_VXLAN_PORT, vxlan_port);
> +	mnl_attr_nest_end(nlh, na_vxlan);
> +	mnl_attr_nest_end(nlh, na_info);
> +	assert(sizeof(buf) >= nlh->nlmsg_len);
> +	ret = flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
> +	if (ret)
> +		DRV_LOG(WARNING,
> +			"netlink: VTEP %s create failure (%d)",
> +			name, rte_errno);
> +	else
> +		vtep->created = 1;

Flow of code here isn't smooth, thus could be error-prone. Most of all, I don't
like ret has multiple meanings. ret should be return value but you are using it
to store ifindex.

> +	if (ret && ifouter)
> +		ret = 0;
> +	else
> +		ret = if_nametoindex(name);

If vtep isn't created and ifouter is set, then skip init below, which
means, if vtep is created or ifouter is set, it tries to get ifindex of vtep.
But why do you want to try to call this API even if it failed to create vtep?
Let's not make code flow convoluted even though it logically works. Let's make
it straightforward.

> +	if (ret) {
> +		vtep->ifindex = ret;
> +		vtep->ifouter = ifouter;
> +		memset(buf, 0, sizeof(buf));
> +		nlh = mnl_nlmsg_put_header(buf);
> +		nlh->nlmsg_type = RTM_NEWLINK;
> +		nlh->nlmsg_flags = NLM_F_REQUEST;
> +		ifm = mnl_nlmsg_put_extra_header(nlh, sizeof(*ifm));
> +		ifm->ifi_family = AF_UNSPEC;
> +		ifm->ifi_type = 0;
> +		ifm->ifi_index = vtep->ifindex;
> +		ifm->ifi_flags = IFF_UP;
> +		ifm->ifi_change = IFF_UP;
> +		ret = flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
> +		if (ret) {
> +			DRV_LOG(WARNING,
> +				"netlink: VTEP %s set link up failure (%d)",
> +				name, rte_errno);
> +			rte_free(vtep);
> +			rte_flow_error_set
> +				(error, -errno,
> +				 RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
> +				 "netlink: failed to set VTEP link up");
> +			vtep = NULL;
> +		} else {
> +			ret = mlx5_flow_tcf_init(tcf, vtep->ifindex, error);
> +			if (ret)
> +				DRV_LOG(WARNING,
> +				"VTEP %s init failure (%d)", name, rte_errno);
> +		}
> +	} else {
> +		DRV_LOG(WARNING,
> +			"VTEP %s failed to get index (%d)", name, errno);
> +		rte_flow_error_set
> +			(error, -errno,
> +			 RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
> +			 !vtep->created ? "netlink: failed to create VTEP" :
> +			 "netlink: failed to retrieve VTEP ifindex");
> +			 ret = 1;

If it fails to create a vtep above, it will print out two warning messages and
one rte_flow_error message. And it even selects message to print between two?
And there's another info msg at the end even in case of failure. Do you really
want to do this even with manipulating ret to change code path?  Not a good
practice.

Usually, code path should be straightforward for sucessful path and for
errors/failures, return immediately or use 'goto' if there's need for cleanup.

Please refactor entire function.

> +	}
> +	if (ret) {
> +		flow_tcf_delete_iface(tcf, vtep);
> +		vtep = NULL;
> +	}
> +	DRV_LOG(INFO, "VTEP create (%d, %s)", vtep->port, vtep ? "OK" : "error");
> +	return vtep;
> +}
> +#endif /* HAVE_IFLA_VXLAN_COLLECT_METADATA */
> +
> +/**
> + * Create target interface index for VXLAN tunneling decapsulation.
> + * In order to share the UDP port within the other interfaces the
> + * VXLAN device created as not attached to any interface (if created).
> + *
> + * @param[in] tcf
> + *   Context object initialized by mlx5_flow_tcf_context_create().
> + * @param[in] dev_flow
> + *   Flow tcf object with tunnel structure pointer set.
> + * @param[out] error
> + *   Perform verbose error reporting if not NULL.
> + * @return
> + *   Interface index on success, zero otherwise and rte_errno is set.

Return negative errno in case of failure like others.

 *   Interface index on success, a negative errno value otherwise and rte_errno is set.

> + */
> +static unsigned int
> +flow_tcf_decap_vtep_create(struct mlx5_flow_tcf_context *tcf,
> +			   struct mlx5_flow *dev_flow,
> +			   struct rte_flow_error *error)
> +{
> +	struct mlx5_flow_tcf_vtep *vtep, *vlst;
> +	uint16_t port = dev_flow->tcf.vxlan_decap->udp_port;
> +
> +	vtep = NULL;
> +	LIST_FOREACH(vlst, &vtep_list_vxlan, next) {
> +		if (vlst->port == port) {
> +			vtep = vlst;
> +			break;
> +		}
> +	}

You just need one variable.

	struct mlx5_flow_tcf_vtep *vtep;

	LIST_FOREACH(vtep, &vtep_list_vxlan, next) {
		if (vtep->port == port)
			break;
	}

> +	if (!vtep) {
> +		vtep = flow_tcf_create_iface(tcf, 0, port, error);
> +		if (vtep)
> +			LIST_INSERT_HEAD(&vtep_list_vxlan, vtep, next);
> +	} else {
> +		if (vtep->ifouter) {
> +			rte_flow_error_set(error, -errno,
> +				RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
> +				"Failed to create decap VTEP, attached "
> +				"device with the same UDP port exists");
> +				vtep = NULL;

Making vtep null to skip the following code? Please merge the two same if/else
and make the code path strightforward. And which errno do you expect here?
Should it be set EEXIST instead?

> +		}
> +	}
> +	if (vtep) {
> +		vtep->refcnt++;
> +		assert(vtep->ifindex);
> +		return vtep->ifindex;
> +	} else {
> +		return 0;
> +	}

Why repeating same if/else?


This is my suggestion but if you take my suggestion to have
flow_tcf_[create|get|release]_iface(), this will get much simpler.

{
	struct mlx5_flow_tcf_vtep *vtep;
	uint16_t port = dev_flow->tcf.vxlan_decap->udp_port;

	LIST_FOREACH(vtep, &vtep_list_vxlan, next) {
		if (vtep->port == port)
			break;
	}
	if (vtep && vtep->ifouter)
		return rte_flow_error_set(... EEXIST ...);
	else if (vtep) {
		++vtep->refcnt;
	} else {
		vtep = flow_tcf_create_iface(tcf, 0, port, error);
		if (!vtep)
			return rte_flow_error_set(...);
		LIST_INSERT_HEAD(&vtep_list_vxlan, vtep, next);
	}
	assert(vtep->ifindex);
	return vtep->ifindex;
}


> +}
> +
> +/**
> + * Creates target interface index for VXLAN tunneling encapsulation.
> + *
> + * @param[in] tcf
> + *   Context object initialized by mlx5_flow_tcf_context_create().
> + * @param[in] ifouter
> + *   Network interface index to attach VXLAN encap device to.
> + * @param[in] dev_flow
> + *   Flow tcf object with tunnel structure pointer set.
> + * @param[out] error
> + *   Perform verbose error reporting if not NULL.
> + * @return
> + *   Interface index on success, zero otherwise and rte_errno is set.
> + */
> +static unsigned int
> +flow_tcf_encap_vtep_create(struct mlx5_flow_tcf_context *tcf,
> +			    unsigned int ifouter,
> +			    struct mlx5_flow *dev_flow __rte_unused,
> +			    struct rte_flow_error *error)
> +{
> +	static uint16_t encap_port = MLX5_VXLAN_PORT_RANGE_MIN - 1;
> +	struct mlx5_flow_tcf_vtep *vtep, *vlst;
> +
> +	assert(ifouter);
> +	/* Look whether the attached VTEP for encap is created. */
> +	vtep = NULL;
> +	LIST_FOREACH(vlst, &vtep_list_vxlan, next) {
> +		if (vlst->ifouter == ifouter) {
> +			vtep = vlst;
> +			break;
> +		}
> +	}

Same here.

> +	if (!vtep) {
> +		uint16_t pcnt;
> +
> +		/* Not found, we should create the new attached VTEP. */
> +/*
> + * TODO: not implemented yet
> + * flow_tcf_encap_iface_cleanup(tcf, ifouter);
> + * flow_tcf_encap_local_cleanup(tcf, ifouter);
> + * flow_tcf_encap_neigh_cleanup(tcf, ifouter);
> + */

Personal note is not appropriate even though it is removed in the following
patch.

> +		for (pcnt = 0; pcnt <= (MLX5_VXLAN_PORT_RANGE_MAX
> +				     - MLX5_VXLAN_PORT_RANGE_MIN); pcnt++) {
> +			encap_port++;
> +			/* Wraparound the UDP port index. */
> +			if (encap_port < MLX5_VXLAN_PORT_RANGE_MIN ||
> +			    encap_port > MLX5_VXLAN_PORT_RANGE_MAX)
> +				encap_port = MLX5_VXLAN_PORT_RANGE_MIN;
> +			/* Check whether UDP port is in already in use. */
> +			vtep = NULL;
> +			LIST_FOREACH(vlst, &vtep_list_vxlan, next) {
> +				if (vlst->port == encap_port) {
> +					vtep = vlst;
> +					break;
> +				}
> +			}

If you want to find out an empty port number, you can use rte_bitmap instead of
repeating searching the entire list for all possible port numbers.

> +			if (vtep) {
> +				vtep = NULL;
> +				continue;
> +			}
> +			vtep = flow_tcf_create_iface(tcf, ifouter,
> +						     encap_port, error);
> +			if (vtep) {
> +				LIST_INSERT_HEAD(&vtep_list_vxlan, vtep, next);
> +				break;
> +			}
> +			if (rte_errno != EEXIST)
> +				break;
> +		}
> +	}
> +	if (!vtep)
> +		return 0;
> +	vtep->refcnt++;
> +	assert(vtep->ifindex);
> +	return vtep->ifindex;

Please refactor this func according to what I suggested for
flow_tcf_decap_vtep_create() and flow_tcf_delete_iface().

> +}
> +
> +/**
> + * Creates target interface index for tunneling of any type.
> + *
> + * @param[in] tcf
> + *   Context object initialized by mlx5_flow_tcf_context_create().
> + * @param[in] ifouter
> + *   Network interface index to attach VXLAN encap device to.
> + * @param[in] dev_flow
> + *   Flow tcf object with tunnel structure pointer set.
> + * @param[out] error
> + *   Perform verbose error reporting if not NULL.
> + * @return
> + *   Interface index on success, zero otherwise and rte_errno is set.

 *   Interface index on success, a negative errno value otherwise and
 *   rte_errno is set.

> + */
> +static unsigned int
> +flow_tcf_tunnel_vtep_create(struct mlx5_flow_tcf_context *tcf,
> +			    unsigned int ifouter,
> +			    struct mlx5_flow *dev_flow,
> +			    struct rte_flow_error *error)
> +{
> +	unsigned int ret;
> +
> +	assert(dev_flow->tcf.tunnel);
> +	pthread_mutex_lock(&vtep_list_mutex);
> +	switch (dev_flow->tcf.tunnel->type) {
> +	case MLX5_FLOW_TCF_TUNACT_VXLAN_ENCAP:
> +		ret = flow_tcf_encap_vtep_create(tcf, ifouter,
> +						 dev_flow, error);
> +		break;
> +	case MLX5_FLOW_TCF_TUNACT_VXLAN_DECAP:
> +		ret = flow_tcf_decap_vtep_create(tcf, dev_flow, error);
> +		break;
> +	default:
> +		rte_flow_error_set(error, ENOTSUP,
> +				RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
> +				"unsupported tunnel type");
> +		ret = 0;
> +		break;
> +	}
> +	pthread_mutex_unlock(&vtep_list_mutex);
> +	return ret;
> +}
> +
> +/**
> + * Deletes tunneling interface by UDP port.
> + *
> + * @param[in] tcf
> + *   Context object initialized by mlx5_flow_tcf_context_create().
> + * @param[in] ifindex
> + *   Network interface index of VXLAN device.
> + * @param[in] dev_flow
> + *   Flow tcf object with tunnel structure pointer set.
> + */
> +static void
> +flow_tcf_tunnel_vtep_delete(struct mlx5_flow_tcf_context *tcf,
> +			    unsigned int ifindex,
> +			    struct mlx5_flow *dev_flow)
> +{
> +	struct mlx5_flow_tcf_vtep *vtep, *vlst;
> +
> +	assert(dev_flow->tcf.tunnel);
> +	pthread_mutex_lock(&vtep_list_mutex);
> +	vtep = NULL;
> +	LIST_FOREACH(vlst, &vtep_list_vxlan, next) {
> +		if (vlst->ifindex == ifindex) {
> +			vtep = vlst;
> +			break;
> +		}
> +	}

It is weird. You just can have vtep pointer in the dev_flow->tcf.tunnel instead
of ifindex_tun which is same as vtep->ifindex like the assertion below. Then,
this lookup can be skipped.

> +	if (!vtep) {
> +		DRV_LOG(WARNING, "No VTEP device found in the list");
> +		goto exit;
> +	}
> +	switch (dev_flow->tcf.tunnel->type) {
> +	case MLX5_FLOW_TCF_TUNACT_VXLAN_DECAP:
> +		break;
> +	case MLX5_FLOW_TCF_TUNACT_VXLAN_ENCAP:
> +/*
> + * TODO: Remove the encap ancillary rules first.
> + * flow_tcf_encap_neigh(tcf, vtep, dev_flow, false, NULL);
> + * flow_tcf_encap_local(tcf, vtep, dev_flow, false, NULL);
> + */

Is it a personal note? Please remove.

> +		break;
> +	default:
> +		assert(false);
> +		DRV_LOG(WARNING, "Unsupported tunnel type");
> +		break;
> +	}
> +	assert(dev_flow->tcf.tunnel->ifindex_tun == vtep->ifindex);
> +	assert(vtep->refcnt);
> +	if (!vtep->refcnt || !--vtep->refcnt) {
> +		LIST_REMOVE(vtep, next);
> +		flow_tcf_delete_iface(tcf, vtep);
> +	}
> +exit:
> +	pthread_mutex_unlock(&vtep_list_mutex);
> +}
> +
>  /**
>   * Apply flow to E-Switch by sending Netlink message.
>   *
> @@ -3461,18 +3887,61 @@ struct pedit_parser {
>  	       struct rte_flow_error *error)
>  {
>  	struct priv *priv = dev->data->dev_private;
> -	struct mlx5_flow_tcf_context *nl = priv->tcf_context;
> +	struct mlx5_flow_tcf_context *tcf = priv->tcf_context;
>  	struct mlx5_flow *dev_flow;
>  	struct nlmsghdr *nlh;
> +	int ret;
>  
>  	dev_flow = LIST_FIRST(&flow->dev_flows);
>  	/* E-Switch flow can't be expanded. */
>  	assert(!LIST_NEXT(dev_flow, next));
> +	if (dev_flow->tcf.applied)
> +		return 0;
>  	nlh = dev_flow->tcf.nlh;
>  	nlh->nlmsg_type = RTM_NEWTFILTER;
>  	nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_CREATE | NLM_F_EXCL;
> -	if (!flow_tcf_nl_ack(nl, nlh, 0, NULL, NULL))
> +	if (dev_flow->tcf.tunnel) {
> +		/*
> +		 * Replace the interface index, target for
> +		 * encapsulation, source for decapsulation.
> +		 */
> +		assert(!dev_flow->tcf.tunnel->ifindex_tun);
> +		assert(dev_flow->tcf.tunnel->ifindex_ptr);
> +		/* Create actual VTEP device when rule is being applied. */
> +		dev_flow->tcf.tunnel->ifindex_tun
> +			= flow_tcf_tunnel_vtep_create(tcf,
> +					*dev_flow->tcf.tunnel->ifindex_ptr,
> +					dev_flow, error);
> +			DRV_LOG(INFO, "Replace ifindex: %d->%d",
> +				dev_flow->tcf.tunnel->ifindex_tun,
> +				*dev_flow->tcf.tunnel->ifindex_ptr);
> +		if (!dev_flow->tcf.tunnel->ifindex_tun)
> +			return -rte_errno;
> +		dev_flow->tcf.tunnel->ifindex_org
> +			= *dev_flow->tcf.tunnel->ifindex_ptr;
> +		*dev_flow->tcf.tunnel->ifindex_ptr
> +			= dev_flow->tcf.tunnel->ifindex_tun;
> +	}
> +	ret = flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
> +	if (dev_flow->tcf.tunnel) {
> +		DRV_LOG(INFO, "Restore ifindex: %d->%d",
> +				dev_flow->tcf.tunnel->ifindex_org,
> +				*dev_flow->tcf.tunnel->ifindex_ptr);
> +		*dev_flow->tcf.tunnel->ifindex_ptr
> +			= dev_flow->tcf.tunnel->ifindex_org;
> +		dev_flow->tcf.tunnel->ifindex_org = 0;

ifindex_org looks a temporary storage in this code. And this kind of hassle
(replace/restore) is there because you took the ifindex from the netlink
message. Why don't you have just

struct mlx5_flow_tcf_tunnel_hdr {
	uint32_t type; /**< Tunnel action type. */
	unsigned int ifindex; /**< Original dst/src interface */
	struct mlx5_flow_tcf_vtep *vtep; /**< Tunnel endpoint device. */
	unsigned int *nlmsg_ifindex_ptr; /**< ifindex ptr in Netlink message. */
};

and don't change ifindex?

Thanks,
Yongseok

> +	}
> +	if (!ret) {
> +		dev_flow->tcf.applied = 1;
>  		return 0;
> +	}
> +	DRV_LOG(WARNING, "netlink: failed to create TC rule (%d)", rte_errno);
> +	if (dev_flow->tcf.tunnel->ifindex_tun) {
> +		flow_tcf_tunnel_vtep_delete(tcf,
> +					    dev_flow->tcf.tunnel->ifindex_tun,
> +					    dev_flow);
> +		dev_flow->tcf.tunnel->ifindex_tun = 0;
> +	}
>  	return rte_flow_error_set(error, rte_errno,
>  				  RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
>  				  "netlink: failed to create TC flow rule");
> @@ -3490,7 +3959,7 @@ struct pedit_parser {
>  flow_tcf_remove(struct rte_eth_dev *dev, struct rte_flow *flow)
>  {
>  	struct priv *priv = dev->data->dev_private;
> -	struct mlx5_flow_tcf_context *nl = priv->tcf_context;
> +	struct mlx5_flow_tcf_context *tcf = priv->tcf_context;
>  	struct mlx5_flow *dev_flow;
>  	struct nlmsghdr *nlh;
>  
> @@ -3501,10 +3970,36 @@ struct pedit_parser {
>  		return;
>  	/* E-Switch flow can't be expanded. */
>  	assert(!LIST_NEXT(dev_flow, next));
> +	if (!dev_flow->tcf.applied)
> +		return;
> +	if (dev_flow->tcf.tunnel) {
> +		/*
> +		 * Replace the interface index, target for
> +		 * encapsulation, source for decapsulation.
> +		 */
> +		assert(dev_flow->tcf.tunnel->ifindex_tun);
> +		assert(dev_flow->tcf.tunnel->ifindex_ptr);
> +		dev_flow->tcf.tunnel->ifindex_org
> +			= *dev_flow->tcf.tunnel->ifindex_ptr;
> +		*dev_flow->tcf.tunnel->ifindex_ptr
> +			= dev_flow->tcf.tunnel->ifindex_tun;
> +	}
>  	nlh = dev_flow->tcf.nlh;
>  	nlh->nlmsg_type = RTM_DELTFILTER;
>  	nlh->nlmsg_flags = NLM_F_REQUEST;
> -	flow_tcf_nl_ack(nl, nlh, 0, NULL, NULL);
> +	flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
> +	if (dev_flow->tcf.tunnel) {
> +		*dev_flow->tcf.tunnel->ifindex_ptr
> +			= dev_flow->tcf.tunnel->ifindex_org;
> +		dev_flow->tcf.tunnel->ifindex_org = 0;
> +		if (dev_flow->tcf.tunnel->ifindex_tun) {
> +			flow_tcf_tunnel_vtep_delete(tcf,
> +					dev_flow->tcf.tunnel->ifindex_tun,
> +					dev_flow);
> +			dev_flow->tcf.tunnel->ifindex_tun = 0;
> +		}
> +	}
> +	dev_flow->tcf.applied = 0;
>  }
>  
>  /**
> 

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [dpdk-dev] [PATCH v2 6/7] net/mlx5: e-switch VXLAN encapsulation rules management
  2018-10-15 14:13   ` [dpdk-dev] [PATCH v2 6/7] net/mlx5: e-switch VXLAN encapsulation rules management Viacheslav Ovsiienko
@ 2018-10-25  0:33     ` Yongseok Koh
  0 siblings, 0 replies; 110+ messages in thread
From: Yongseok Koh @ 2018-10-25  0:33 UTC (permalink / raw)
  To: Slava Ovsiienko; +Cc: Shahaf Shuler, dev

On Mon, Oct 15, 2018 at 02:13:34PM +0000, Viacheslav Ovsiienko wrote:
> VXLAN encap rules are applied to the VF ingress traffic and have the
> VTEP as actual redirection destinations instead of outer PF.
> The encapsulation rule should provide:
> - redirection action VF->PF
> - VF port ID
> - some inner network parameters (MACs/IP)
> - the tunnel outer source IP (v4/v6)
> - the tunnel outer destination IP (v4/v6). Current
> - VNI - Virtual Network Identifier
> 
> There is no direct way found to provide kernel with all required
> encapsulatioh header parameters. The encapsulation VTEP is created
> attached to the outer interface and assumed as default path for
> egress encapsulated traffic. The outer tunnel IP address are
> assigned to interface using Netlink, the implicit route is
> created like this:
> 
>   ip addr add <src_ip> peer <dst_ip> dev <outer> scope link
> 
> Peer address provides implicit route, and scode link reduces
> the risk of conflicts. At initialization time all local scope
> link addresses are flushed from device (see next part of patchset).
> 
> The destination MAC address is provided via permenent neigh rule:
> 
>   ip neigh add dev <outer> lladdr <dst_mac> to <dst_ip> nud permanent
> 
> At initialization time all neigh rules of this type are flushed
> from device (see the next part of patchset).
> 
> Suggested-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
> Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
> ---
>  drivers/net/mlx5/mlx5_flow_tcf.c | 394 ++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 389 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/net/mlx5/mlx5_flow_tcf.c b/drivers/net/mlx5/mlx5_flow_tcf.c
> index efa9c3b..a1d7733 100644
> --- a/drivers/net/mlx5/mlx5_flow_tcf.c
> +++ b/drivers/net/mlx5/mlx5_flow_tcf.c
> @@ -3443,6 +3443,376 @@ struct pedit_parser {
>  	return -err;
>  }
>  
> +/**
> + * Emit Netlink message to add/remove local address to the outer device.
> + * The address being added is visible within the link only (scope link).
> + *
> + * Note that an implicit route is maintained by the kernel due to the
> + * presence of a peer address (IFA_ADDRESS).
> + *
> + * These rules are used for encapsultion only and allow to assign
> + * the outer tunnel source IP address.
> + *
> + * @param[in] tcf
> + *   Libmnl socket context object.
> + * @param[in] encap
> + *   Encapsulation properties (source address and its peer).
> + * @param[in] ifindex
> + *   Network interface to apply rule.
> + * @param[in] enable
> + *   Toggle between add and remove.
> + * @param[out] error
> + *   Perform verbose error reporting if not NULL.
> + *
> + * @return
> + *   0 on success, a negative errno value otherwise and rte_errno is set.
> + */
> +static int
> +flow_tcf_rule_local(struct mlx5_flow_tcf_context *tcf,
> +		    const struct mlx5_flow_tcf_vxlan_encap *encap,
> +		    unsigned int ifindex,
> +		    bool enable,
> +		    struct rte_flow_error *error)
> +{
> +	struct nlmsghdr *nlh;
> +	struct ifaddrmsg *ifa;
> +	alignas(struct nlmsghdr)
> +	uint8_t buf[mnl_nlmsg_size(sizeof(*ifa) + 128)];
> +
> +	nlh = mnl_nlmsg_put_header(buf);
> +	nlh->nlmsg_type = enable ? RTM_NEWADDR : RTM_DELADDR;
> +	nlh->nlmsg_flags =
> +		NLM_F_REQUEST | (enable ? NLM_F_CREATE | NLM_F_REPLACE : 0);
> +	nlh->nlmsg_seq = 0;
> +	ifa = mnl_nlmsg_put_extra_header(nlh, sizeof(*ifa));
> +	ifa->ifa_flags = IFA_F_PERMANENT;
> +	ifa->ifa_scope = RT_SCOPE_LINK;
> +	ifa->ifa_index = ifindex;
> +	if (encap->mask & MLX5_FLOW_TCF_ENCAP_IPV4_SRC) {
> +		ifa->ifa_family = AF_INET;
> +		ifa->ifa_prefixlen = 32;
> +		mnl_attr_put_u32(nlh, IFA_LOCAL, encap->ipv4.src);
> +		if (encap->mask & MLX5_FLOW_TCF_ENCAP_IPV4_DST)
> +			mnl_attr_put_u32(nlh, IFA_ADDRESS,
> +					      encap->ipv4.dst);
> +	} else {
> +		assert(encap->mask & MLX5_FLOW_TCF_ENCAP_IPV6_SRC);
> +		ifa->ifa_family = AF_INET6;
> +		ifa->ifa_prefixlen = 128;
> +		mnl_attr_put(nlh, IFA_LOCAL,
> +				  sizeof(encap->ipv6.src),
> +				  &encap->ipv6.src);
> +		if (encap->mask & MLX5_FLOW_TCF_ENCAP_IPV6_DST)
> +			mnl_attr_put(nlh, IFA_ADDRESS,
> +					  sizeof(encap->ipv6.dst),
> +					  &encap->ipv6.dst);
> +	}
> +	if (!flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL))
> +		return 0;
> +	return rte_flow_error_set
> +		(error, rte_errno, RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
> +		 "netlink: cannot complete IFA request (ip addr add)");
> +}
> +
> +/**
> + * Emit Netlink message to add/remove neighbor.
> + *
> + * @param[in] tcf
> + *   Libmnl socket context object.
> + * @param[in] encap
> + *   Encapsulation properties (destination address).
> + * @param[in] ifindex
> + *   Network interface.
> + * @param[in] enable
> + *   Toggle between add and remove.
> + * @param[out] error
> + *   Perform verbose error reporting if not NULL.
> + *
> + * @return
> + *   0 on success, a negative errno value otherwise and rte_errno is set.
> + */
> +static int
> +flow_tcf_rule_neigh(struct mlx5_flow_tcf_context *tcf,
> +		     const struct mlx5_flow_tcf_vxlan_encap *encap,
> +		     unsigned int ifindex,
> +		     bool enable,
> +		     struct rte_flow_error *error)
> +{
> +	struct nlmsghdr *nlh;
> +	struct ndmsg *ndm;
> +	alignas(struct nlmsghdr)
> +	uint8_t buf[mnl_nlmsg_size(sizeof(*ndm) + 128)];
> +
> +	nlh = mnl_nlmsg_put_header(buf);
> +	nlh->nlmsg_type = enable ? RTM_NEWNEIGH : RTM_DELNEIGH;
> +	nlh->nlmsg_flags =
> +		NLM_F_REQUEST | (enable ? NLM_F_CREATE | NLM_F_REPLACE : 0);
> +	nlh->nlmsg_seq = 0;
> +	ndm = mnl_nlmsg_put_extra_header(nlh, sizeof(*ndm));
> +	ndm->ndm_ifindex = ifindex;
> +	ndm->ndm_state = NUD_PERMANENT;
> +	ndm->ndm_flags = 0;
> +	ndm->ndm_type = 0;
> +	if (encap->mask & MLX5_FLOW_TCF_ENCAP_IPV4_DST) {
> +		ndm->ndm_family = AF_INET;
> +		mnl_attr_put_u32(nlh, NDA_DST, encap->ipv4.dst);
> +	} else {
> +		assert(encap->mask & MLX5_FLOW_TCF_ENCAP_IPV6_DST);
> +		ndm->ndm_family = AF_INET6;
> +		mnl_attr_put(nlh, NDA_DST, sizeof(encap->ipv6.dst),
> +						 &encap->ipv6.dst);
> +	}
> +	if (encap->mask & MLX5_FLOW_TCF_ENCAP_ETH_SRC && enable)
> +		DRV_LOG(WARNING,
> +			"Outer ethernet source address cannot be "
> +			"forced for VXLAN encapsulation");
> +	if (encap->mask & MLX5_FLOW_TCF_ENCAP_ETH_DST)
> +		mnl_attr_put(nlh, NDA_LLADDR, sizeof(encap->eth.dst),
> +						    &encap->eth.dst);
> +	if (!flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL))
> +		return 0;
> +	return rte_flow_error_set
> +		(error, rte_errno, RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
> +		 "netlink: cannot complete ND request (ip neigh)");
> +}
> +
> +/**
> + * Manage the local IP addresses and their peers IP addresses on the
> + * outer interface for encapsulation purposes. The kernel searches the
> + * appropriate device for tunnel egress traffic using the outer source
> + * IP, this IP should be assigned to the outer network device, otherwise
> + * kernel rejects the rule.
> + *
> + * Adds or removes the addresses using the Netlink command like this:
> + *   ip addr add <src_ip> peer <dst_ip> scope link dev <ifouter>
> + *
> + * The addresses are local to the netdev ("scope link"), this reduces
> + * the risk of conflicts. Note that an implicit route is maintained by
> + * the kernel due to the presence of a peer address (IFA_ADDRESS).
> + *
> + * @param[in] tcf
> + *   Libmnl socket context object.
> + * @param[in] vtep
> + *   VTEP object, contains rule database and ifouter index.
> + * @param[in] dev_flow
> + *   Flow object, contains the tunnel parameters (for encap only).
> + * @param[in] enable
> + *   Toggle between add and remove.
> + * @param[out] error
> + *   Perform verbose error reporting if not NULL.
> + *
> + * @return
> + *   0 on success, a negative errno value otherwise and rte_errno is set.
> + */
> +static int
> +flow_tcf_encap_local(struct mlx5_flow_tcf_context *tcf,
> +		     struct mlx5_flow_tcf_vtep *vtep,
> +		     struct mlx5_flow *dev_flow,
> +		     bool enable,
> +		     struct rte_flow_error *error)
> +{
> +	const struct mlx5_flow_tcf_vxlan_encap *encap =
> +						dev_flow->tcf.vxlan_encap;
> +	struct tcf_local_rule *rule;
> +	bool found = false;
> +	int ret;
> +
> +	assert(encap);
> +	assert(encap->hdr.type == MLX5_FLOW_TCF_TUNACT_VXLAN_ENCAP);
> +	if (encap->mask & MLX5_FLOW_TCF_ENCAP_IPV4_SRC) {
> +		assert(encap->mask & MLX5_FLOW_TCF_ENCAP_IPV4_DST);
> +		LIST_FOREACH(rule, &vtep->local, next) {
> +			if (rule->mask & MLX5_FLOW_TCF_ENCAP_IPV4_SRC &&
> +			    encap->ipv4.src == rule->ipv4.src &&
> +			    encap->ipv4.dst == rule->ipv4.dst) {
> +				found = true;
> +				break;
> +			}
> +		}
> +	} else {
> +		assert(encap->mask & MLX5_FLOW_TCF_ENCAP_IPV6_SRC);
> +		assert(encap->mask & MLX5_FLOW_TCF_ENCAP_IPV6_DST);
> +		LIST_FOREACH(rule, &vtep->local, next) {
> +			if (rule->mask & MLX5_FLOW_TCF_ENCAP_IPV6_SRC &&
> +			    !memcmp(&encap->ipv6.src, &rule->ipv6.src,
> +					    sizeof(encap->ipv6.src)) &&
> +			    !memcmp(&encap->ipv6.dst, &rule->ipv6.dst,
> +					    sizeof(encap->ipv6.dst))) {
> +				found = true;
> +				break;
> +			}
> +		}
> +	}
> +	if (found) {
> +		if (enable) {
> +			rule->refcnt++;
> +			return 0;
> +		}
> +		if (!rule->refcnt || !--rule->refcnt) {

Same suggestion for this as that of vtep - refcnt handling and adding get()
func.

> +			LIST_REMOVE(rule, next);
> +			return flow_tcf_rule_local(tcf, encap,
> +					vtep->ifouter, false, error);
> +		}
> +		return 0;
> +	}
> +	if (!enable) {
> +		DRV_LOG(WARNING, "Disabling not existing local rule");
> +		rte_flow_error_set
> +			(error, ENOENT, RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> +			 NULL, "Disabling not existing local rule");
> +		return -ENOENT;
> +	}
> +	rule = rte_zmalloc(__func__, sizeof(struct tcf_local_rule),
> +				alignof(struct tcf_local_rule));
> +	if (!rule) {
> +		rte_flow_error_set
> +			(error, ENOMEM, RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> +			 NULL, "unable to allocate memory for local rule");
> +		return -rte_errno;
> +	}
> +	*rule = (struct tcf_local_rule){.refcnt = 0,
> +					.mask = 0,
> +					};

Is it effective? The allocated memory is already zeroed out.

> +	if (encap->mask & MLX5_FLOW_TCF_ENCAP_IPV4_SRC) {
> +		rule->mask = MLX5_FLOW_TCF_ENCAP_IPV4_SRC
> +			   | MLX5_FLOW_TCF_ENCAP_IPV4_DST;
> +		rule->ipv4.src = encap->ipv4.src;
> +		rule->ipv4.dst = encap->ipv4.dst;
> +	} else {
> +		rule->mask = MLX5_FLOW_TCF_ENCAP_IPV6_SRC
> +			   | MLX5_FLOW_TCF_ENCAP_IPV6_DST;
> +		memcpy(&rule->ipv6.src, &encap->ipv6.src,
> +				sizeof(rule->ipv6.src));
> +		memcpy(&rule->ipv6.dst, &encap->ipv6.dst,
> +				sizeof(rule->ipv6.dst));
> +	}
> +	ret = flow_tcf_rule_local(tcf, encap, vtep->ifouter, true, error);
> +	if (ret) {
> +		rte_free(rule);
> +		return ret;
> +	}
> +	rule->refcnt++;
> +	LIST_INSERT_HEAD(&vtep->local, rule, next);
> +	return 0;
> +}
> +
> +/**
> + * Manage the destination MAC/IP addresses neigh database, kernel uses
> + * this one to determine the destination MAC address within encapsulation
> + * header. Adds or removes the entries using the Netlink command like this:
> + *   ip neigh add dev <ifouter> lladdr <dst_mac> to <dst_ip> nud permanent
> + *
> + * @param[in] tcf
> + *   Libmnl socket context object.
> + * @param[in] vtep
> + *   VTEP object, contains rule database and ifouter index.
> + * @param[in] dev_flow
> + *   Flow object, contains the tunnel parameters (for encap only).
> + * @param[in] enable
> + *   Toggle between add and remove.
> + * @param[out] error
> + *   Perform verbose error reporting if not NULL.
> + *
> + * @return
> + *   0 on success, a negative errno value otherwise and rte_errno is set.
> + */
> +static int
> +flow_tcf_encap_neigh(struct mlx5_flow_tcf_context *tcf,
> +		     struct mlx5_flow_tcf_vtep *vtep,
> +		     struct mlx5_flow *dev_flow,
> +		     bool enable,
> +		     struct rte_flow_error *error)
> +{
> +	const struct mlx5_flow_tcf_vxlan_encap *encap =
> +						dev_flow->tcf.vxlan_encap;
> +	struct tcf_neigh_rule *rule;
> +	bool found = false;
> +	int ret;
> +
> +	assert(encap);
> +	assert(encap->hdr.type == MLX5_FLOW_TCF_TUNACT_VXLAN_ENCAP);
> +	if (encap->mask & MLX5_FLOW_TCF_ENCAP_IPV4_DST) {
> +		assert(encap->mask & MLX5_FLOW_TCF_ENCAP_IPV4_SRC);
> +		LIST_FOREACH(rule, &vtep->neigh, next) {
> +			if (rule->mask & MLX5_FLOW_TCF_ENCAP_IPV4_DST &&
> +			    encap->ipv4.dst == rule->ipv4.dst) {
> +				found = true;
> +				break;
> +			}
> +		}
> +	} else {
> +		assert(encap->mask & MLX5_FLOW_TCF_ENCAP_IPV6_SRC);
> +		assert(encap->mask & MLX5_FLOW_TCF_ENCAP_IPV6_DST);
> +		LIST_FOREACH(rule, &vtep->neigh, next) {
> +			if (rule->mask & MLX5_FLOW_TCF_ENCAP_IPV6_DST &&
> +			    !memcmp(&encap->ipv6.dst, &rule->ipv6.dst,
> +						sizeof(encap->ipv6.dst))) {
> +				found = true;
> +				break;
> +			}
> +		}
> +	}
> +	if (found) {
> +		if (memcmp(&encap->eth.dst, &rule->eth,
> +			   sizeof(encap->eth.dst))) {
> +			DRV_LOG(WARNING, "Destination MAC differs"
> +					 " in neigh rule");
> +			rte_flow_error_set(error, EEXIST,
> +					   RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> +					   NULL, "Different MAC address"
> +					   " neigh rule for the same"
> +					   " destination IP");
> +					return -EEXIST;
> +		}
> +		if (enable) {
> +			rule->refcnt++;
> +			return 0;
> +		}
> +		if (!rule->refcnt || !--rule->refcnt) {

Same suggestion for this as that of vtep - refcnt handling by adding
create()/get()/release() func.

> +			LIST_REMOVE(rule, next);
> +			return flow_tcf_rule_neigh(tcf, encap,
> +						   vtep->ifouter,
> +						   false, error);
> +		}
> +		return 0;
> +	}
> +	if (!enable) {
> +		DRV_LOG(WARNING, "Disabling not existing neigh rule");
> +		rte_flow_error_set
> +			(error, ENOENT, RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> +			 NULL, "unable to allocate memory for neigh rule");
> +		return -ENOENT;
> +	}
> +	rule = rte_zmalloc(__func__, sizeof(struct tcf_neigh_rule),
> +				alignof(struct tcf_neigh_rule));
> +	if (!rule) {
> +		rte_flow_error_set
> +			(error, ENOMEM, RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> +			 NULL, "unadble to allocate memory for neigh rule");
> +		return -rte_errno;
> +	}
> +	*rule = (struct tcf_neigh_rule){.refcnt = 0,
> +					.mask = 0,
> +					};

Is it effective? The allocated memory is already zeroed out.

> +	if (encap->mask & MLX5_FLOW_TCF_ENCAP_IPV4_DST) {
> +		rule->mask = MLX5_FLOW_TCF_ENCAP_IPV4_DST;
> +		rule->ipv4.dst = encap->ipv4.dst;
> +	} else {
> +		rule->mask = MLX5_FLOW_TCF_ENCAP_IPV6_DST;
> +		memcpy(&rule->ipv6.dst, &encap->ipv6.dst,
> +					sizeof(rule->ipv6.dst));
> +	}
> +	memcpy(&rule->eth, &encap->eth.dst, sizeof(rule->eth));
> +	ret = flow_tcf_rule_neigh(tcf, encap, vtep->ifouter, true, error);
> +	if (ret) {
> +		rte_free(rule);
> +		return ret;
> +	}
> +	rule->refcnt++;
> +	LIST_INSERT_HEAD(&vtep->neigh, rule, next);
> +	return 0;
> +}
> +
>  /* VTEP device list is shared between PMD port instances. */
>  static LIST_HEAD(, mlx5_flow_tcf_vtep)
>  			vtep_list_vxlan = LIST_HEAD_INITIALIZER();
> @@ -3715,6 +4085,7 @@ static LIST_HEAD(, mlx5_flow_tcf_vtep)
>  {
>  	static uint16_t encap_port = MLX5_VXLAN_PORT_RANGE_MIN - 1;
>  	struct mlx5_flow_tcf_vtep *vtep, *vlst;
> +	int ret;
>  
>  	assert(ifouter);
>  	/* Look whether the attached VTEP for encap is created. */
> @@ -3766,6 +4137,21 @@ static LIST_HEAD(, mlx5_flow_tcf_vtep)
>  	}
>  	if (!vtep)
>  		return 0;
> +	/* Create local ipaddr with peer to specify the outer IPs. */
> +	ret = flow_tcf_encap_local(tcf, vtep, dev_flow, true, error);
> +	if (ret) {
> +		if (!vtep->refcnt)
> +			flow_tcf_delete_iface(tcf, vtep);

There's no possibility of decreasing vtep->refcnt in flow_tcf_encap_local(),
then why do you expect it to be zero here? If it is already zero at this point,
it should've been deleted when it became zero.

> +		return 0;
> +	}
> +	/* Create neigh rule to specify outer destination MAC. */
> +	ret = flow_tcf_encap_neigh(tcf, vtep, dev_flow, true, error);
> +	if (ret) {
> +		flow_tcf_encap_local(tcf, vtep, dev_flow, false, error);
> +		if (!vtep->refcnt)
> +			flow_tcf_delete_iface(tcf, vtep);

Same here.

Thanks,
Yongseok

> +		return 0;
> +	}
>  	vtep->refcnt++;
>  	assert(vtep->ifindex);
>  	return vtep->ifindex;
> @@ -3848,11 +4234,9 @@ static LIST_HEAD(, mlx5_flow_tcf_vtep)
>  	case MLX5_FLOW_TCF_TUNACT_VXLAN_DECAP:
>  		break;
>  	case MLX5_FLOW_TCF_TUNACT_VXLAN_ENCAP:
> -/*
> - * TODO: Remove the encap ancillary rules first.
> - * flow_tcf_encap_neigh(tcf, vtep, dev_flow, false, NULL);
> - * flow_tcf_encap_local(tcf, vtep, dev_flow, false, NULL);
> - */
> +		/* Remove the encap ancillary rules first. */
> +		flow_tcf_encap_neigh(tcf, vtep, dev_flow, false, NULL);
> +		flow_tcf_encap_local(tcf, vtep, dev_flow, false, NULL);
>  		break;
>  	default:
>  		assert(false);
> 

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [dpdk-dev] [PATCH v2 7/7] net/mlx5: e-switch VXLAN rule cleanup routines
  2018-10-15 14:13   ` [dpdk-dev] [PATCH v2 7/7] net/mlx5: e-switch VXLAN rule cleanup routines Viacheslav Ovsiienko
@ 2018-10-25  0:36     ` Yongseok Koh
  2018-10-25 20:32       ` Slava Ovsiienko
  0 siblings, 1 reply; 110+ messages in thread
From: Yongseok Koh @ 2018-10-25  0:36 UTC (permalink / raw)
  To: Slava Ovsiienko; +Cc: Shahaf Shuler, dev

On Mon, Oct 15, 2018 at 02:13:35PM +0000, Viacheslav Ovsiienko wrote:
> The last part of patchset contains the rule cleanup routines.
> These ones is the part of outer interface initialization at
> the moment of VXLAN VTEP attaching. These routines query
> the list of attached VXLAN devices, the list of local IP
> addresses with peer and link scope attribute and the list
> of permanent neigh rules, then all found abovementioned
> items on the specified outer device are flushed.
> 
> Suggested-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
> Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
> ---
>  drivers/net/mlx5/mlx5_flow_tcf.c | 505 ++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 499 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/net/mlx5/mlx5_flow_tcf.c b/drivers/net/mlx5/mlx5_flow_tcf.c
> index a1d7733..a3348ea 100644
> --- a/drivers/net/mlx5/mlx5_flow_tcf.c
> +++ b/drivers/net/mlx5/mlx5_flow_tcf.c
> @@ -4012,6 +4012,502 @@ static LIST_HEAD(, mlx5_flow_tcf_vtep)
>  }
>  #endif /* HAVE_IFLA_VXLAN_COLLECT_METADATA */
>  
> +#define MNL_REQUEST_SIZE_MIN 256
> +#define MNL_REQUEST_SIZE_MAX 2048
> +#define MNL_REQUEST_SIZE RTE_MIN(RTE_MAX(sysconf(_SC_PAGESIZE), \
> +				 MNL_REQUEST_SIZE_MIN), MNL_REQUEST_SIZE_MAX)
> +
> +/* Data structures used by flow_tcf_xxx_cb() routines. */
> +struct tcf_nlcb_buf {
> +	LIST_ENTRY(tcf_nlcb_buf) next;
> +	uint32_t size;
> +	alignas(struct nlmsghdr)
> +	uint8_t msg[]; /**< Netlink message data. */
> +};
> +
> +struct tcf_nlcb_context {
> +	unsigned int ifindex; /**< Base interface index. */
> +	uint32_t bufsize;
> +	LIST_HEAD(, tcf_nlcb_buf) nlbuf;
> +};
> +
> +/**
> + * Allocate space for netlink command in buffer list
> + *
> + * @param[in, out] ctx
> + *   Pointer to callback context with command buffers list.
> + * @param[in] size
> + *   Required size of data buffer to be allocated.
> + *
> + * @return
> + *   Pointer to allocated memory, aligned as message header.
> + *   NULL if some error occurred.
> + */
> +static struct nlmsghdr *
> +flow_tcf_alloc_nlcmd(struct tcf_nlcb_context *ctx, uint32_t size)
> +{
> +	struct tcf_nlcb_buf *buf;
> +	struct nlmsghdr *nlh;
> +
> +	size = NLMSG_ALIGN(size);
> +	buf = LIST_FIRST(&ctx->nlbuf);
> +	if (buf && (buf->size + size) <= ctx->bufsize) {
> +		nlh = (struct nlmsghdr *)&buf->msg[buf->size];
> +		buf->size += size;
> +		return nlh;
> +	}
> +	if (size > ctx->bufsize) {
> +		DRV_LOG(WARNING, "netlink: too long command buffer requested");
> +		return NULL;
> +	}
> +	buf = rte_malloc(__func__,
> +			ctx->bufsize + sizeof(struct tcf_nlcb_buf),
> +			alignof(struct tcf_nlcb_buf));
> +	if (!buf) {
> +		DRV_LOG(WARNING, "netlink: no memory for command buffer");
> +		return NULL;
> +	}
> +	LIST_INSERT_HEAD(&ctx->nlbuf, buf, next);
> +	buf->size = size;
> +	nlh = (struct nlmsghdr *)&buf->msg[0];
> +	return nlh;
> +}
> +
> +/**
> + * Set NLM_F_ACK flags in the last netlink command in buffer.
> + * Only last command in the buffer will be acked by system.
> + *
> + * @param[in, out] buf
> + *   Pointer to buffer with netlink commands.
> + */
> +static void
> +flow_tcf_setack_nlcmd(struct tcf_nlcb_buf *buf)
> +{
> +	struct nlmsghdr *nlh;
> +	uint32_t size = 0;
> +
> +	assert(buf->size);
> +	do {
> +		nlh = (struct nlmsghdr *)&buf->msg[size];
> +		size += NLMSG_ALIGN(nlh->nlmsg_len);
> +		if (size >= buf->size) {
> +			nlh->nlmsg_flags |= NLM_F_ACK;
> +			break;
> +		}
> +	} while (true);
> +}
> +
> +/**
> + * Send the buffers with prepared netlink commands. Scans the list and
> + * sends all found buffers. Buffers are sent and freed anyway in order
> + * to prevent memory leakage if some every message in received packet.
> + *
> + * @param[in] tcf
> + *   Context object initialized by mlx5_flow_tcf_context_create().
> + * @param[in, out] ctx
> + *   Pointer to callback context with command buffers list.
> + *
> + * @return
> + *   Zero value on success, negative errno value otherwise
> + *   and rte_errno is set.
> + */
> +static int
> +flow_tcf_send_nlcmd(struct mlx5_flow_tcf_context *tcf,
> +		    struct tcf_nlcb_context *ctx)
> +{
> +	struct tcf_nlcb_buf *bc, *bn;
> +	struct nlmsghdr *nlh;
> +	int ret = 0;
> +
> +	bc = LIST_FIRST(&ctx->nlbuf);
> +	while (bc) {
> +		int rc;
> +
> +		bn = LIST_NEXT(bc, next);
> +		if (bc->size) {
> +			flow_tcf_setack_nlcmd(bc);
> +			nlh = (struct nlmsghdr *)&bc->msg;
> +			rc = flow_tcf_nl_ack(tcf, nlh, bc->size, NULL, NULL);
> +			if (rc && !ret)
> +				ret = rc;
> +		}
> +		rte_free(bc);
> +		bc = bn;
> +	}
> +	LIST_INIT(&ctx->nlbuf);
> +	return ret;
> +}
> +
> +/**
> + * Collect local IP address rules with scope link attribute  on specified
> + * network device. This is callback routine called by libmnl mnl_cb_run()
> + * in loop for every message in received packet.
> + *
> + * @param[in] nlh
> + *   Pointer to reply header.
> + * @param[in, out] arg
> + *   Opaque data pointer for this callback.
> + *
> + * @return
> + *   A positive, nonzero value on success, negative errno value otherwise
> + *   and rte_errno is set.
> + */
> +static int
> +flow_tcf_collect_local_cb(const struct nlmsghdr *nlh, void *arg)
> +{
> +	struct tcf_nlcb_context *ctx = arg;
> +	struct nlmsghdr *cmd;
> +	struct ifaddrmsg *ifa;
> +	struct nlattr *na;
> +	struct nlattr *na_local = NULL;
> +	struct nlattr *na_peer = NULL;
> +	unsigned char family;
> +
> +	if (nlh->nlmsg_type != RTM_NEWADDR) {
> +		rte_errno = EINVAL;
> +		return -rte_errno;
> +	}
> +	ifa = mnl_nlmsg_get_payload(nlh);
> +	family = ifa->ifa_family;
> +	if (ifa->ifa_index != ctx->ifindex ||
> +	    ifa->ifa_scope != RT_SCOPE_LINK ||
> +	    !(ifa->ifa_flags & IFA_F_PERMANENT) ||
> +	    (family != AF_INET && family != AF_INET6))
> +		return 1;
> +	mnl_attr_for_each(na, nlh, sizeof(*ifa)) {
> +		switch (mnl_attr_get_type(na)) {
> +		case IFA_LOCAL:
> +			na_local = na;
> +			break;
> +		case IFA_ADDRESS:
> +			na_peer = na;
> +			break;
> +		}
> +		if (na_local && na_peer)
> +			break;
> +	}
> +	if (!na_local || !na_peer)
> +		return 1;
> +	/* Local rule found with scope link, permanent and assigned peer. */
> +	cmd = flow_tcf_alloc_nlcmd(ctx, MNL_ALIGN(sizeof(struct nlmsghdr)) +
> +					MNL_ALIGN(sizeof(struct ifaddrmsg)) +
> +					(family == AF_INET6
> +					? 2 * SZ_NLATTR_DATA_OF(IPV6_ADDR_LEN)
> +					: 2 * SZ_NLATTR_TYPE_OF(uint32_t)));

Better to use IPV4_ADDR_LEN instead?

> +	if (!cmd) {
> +		rte_errno = ENOMEM;
> +		return -rte_errno;
> +	}
> +	cmd = mnl_nlmsg_put_header(cmd);
> +	cmd->nlmsg_type = RTM_DELADDR;
> +	cmd->nlmsg_flags = NLM_F_REQUEST;
> +	ifa = mnl_nlmsg_put_extra_header(cmd, sizeof(*ifa));
> +	ifa->ifa_flags = IFA_F_PERMANENT;
> +	ifa->ifa_scope = RT_SCOPE_LINK;
> +	ifa->ifa_index = ctx->ifindex;
> +	if (family == AF_INET) {
> +		ifa->ifa_family = AF_INET;
> +		ifa->ifa_prefixlen = 32;
> +		mnl_attr_put_u32(cmd, IFA_LOCAL, mnl_attr_get_u32(na_local));
> +		mnl_attr_put_u32(cmd, IFA_ADDRESS, mnl_attr_get_u32(na_peer));
> +	} else {
> +		ifa->ifa_family = AF_INET6;
> +		ifa->ifa_prefixlen = 128;
> +		mnl_attr_put(cmd, IFA_LOCAL, IPV6_ADDR_LEN,
> +			mnl_attr_get_payload(na_local));
> +		mnl_attr_put(cmd, IFA_ADDRESS, IPV6_ADDR_LEN,
> +			mnl_attr_get_payload(na_peer));
> +	}
> +	return 1;
> +}
> +
> +/**
> + * Cleanup the local IP addresses on outer interface.
> + *
> + * @param[in] tcf
> + *   Context object initialized by mlx5_flow_tcf_context_create().
> + * @param[in] ifindex
> + *   Network inferface index to perform cleanup.
> + */
> +static void
> +flow_tcf_encap_local_cleanup(struct mlx5_flow_tcf_context *tcf,
> +			    unsigned int ifindex)
> +{
> +	struct nlmsghdr *nlh;
> +	struct ifaddrmsg *ifa;
> +	struct tcf_nlcb_context ctx = {
> +		.ifindex = ifindex,
> +		.bufsize = MNL_REQUEST_SIZE,
> +		.nlbuf = LIST_HEAD_INITIALIZER(),
> +	};
> +	int ret;
> +
> +	assert(ifindex);
> +	/*
> +	 * Seek and destroy leftovers of local IP addresses with
> +	 * matching properties "scope link".
> +	 */
> +	nlh = mnl_nlmsg_put_header(tcf->buf);
> +	nlh->nlmsg_type = RTM_GETADDR;
> +	nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_DUMP;
> +	ifa = mnl_nlmsg_put_extra_header(nlh, sizeof(*ifa));
> +	ifa->ifa_family = AF_UNSPEC;
> +	ifa->ifa_index = ifindex;
> +	ifa->ifa_scope = RT_SCOPE_LINK;
> +	ret = flow_tcf_nl_ack(tcf, nlh, 0, flow_tcf_collect_local_cb, &ctx);
> +	if (ret)
> +		DRV_LOG(WARNING, "netlink: query device list error %d", ret);
> +	ret = flow_tcf_send_nlcmd(tcf, &ctx);
> +	if (ret)
> +		DRV_LOG(WARNING, "netlink: device delete error %d", ret);
> +}
> +
> +/**
> + * Collect neigh permament rules on specified network device.
> + * This is callback routine called by libmnl mnl_cb_run() in loop for
> + * every message in received packet.
> + *
> + * @param[in] nlh
> + *   Pointer to reply header.
> + * @param[in, out] arg
> + *   Opaque data pointer for this callback.
> + *
> + * @return
> + *   A positive, nonzero value on success, negative errno value otherwise
> + *   and rte_errno is set.
> + */
> +static int
> +flow_tcf_collect_neigh_cb(const struct nlmsghdr *nlh, void *arg)
> +{
> +	struct tcf_nlcb_context *ctx = arg;
> +	struct nlmsghdr *cmd;
> +	struct ndmsg *ndm;
> +	struct nlattr *na;
> +	struct nlattr *na_ip = NULL;
> +	struct nlattr *na_mac = NULL;
> +	unsigned char family;
> +
> +	if (nlh->nlmsg_type != RTM_NEWNEIGH) {
> +		rte_errno = EINVAL;
> +		return -rte_errno;
> +	}
> +	ndm = mnl_nlmsg_get_payload(nlh);
> +	family = ndm->ndm_family;
> +	if (ndm->ndm_ifindex != (int)ctx->ifindex ||
> +	   !(ndm->ndm_state & NUD_PERMANENT) ||
> +	   (family != AF_INET && family != AF_INET6))
> +		return 1;
> +	mnl_attr_for_each(na, nlh, sizeof(*ndm)) {
> +		switch (mnl_attr_get_type(na)) {
> +		case NDA_DST:
> +			na_ip = na;
> +			break;
> +		case NDA_LLADDR:
> +			na_mac = na;
> +			break;
> +		}
> +		if (na_mac && na_ip)
> +			break;
> +	}
> +	if (!na_mac || !na_ip)
> +		return 1;
> +	/* Neigh rule with permenent attribute found. */
> +	cmd = flow_tcf_alloc_nlcmd(ctx, MNL_ALIGN(sizeof(struct nlmsghdr)) +
> +					MNL_ALIGN(sizeof(struct ndmsg)) +
> +					SZ_NLATTR_DATA_OF(ETHER_ADDR_LEN) +
> +					(family == AF_INET6
> +					? SZ_NLATTR_DATA_OF(IPV6_ADDR_LEN)
> +					: SZ_NLATTR_TYPE_OF(uint32_t)));

Better to use IPV4_ADDR_LEN instead?

> +	if (!cmd) {
> +		rte_errno = ENOMEM;
> +		return -rte_errno;
> +	}
> +	cmd = mnl_nlmsg_put_header(cmd);
> +	cmd->nlmsg_type = RTM_DELNEIGH;
> +	cmd->nlmsg_flags = NLM_F_REQUEST;
> +	ndm = mnl_nlmsg_put_extra_header(cmd, sizeof(*ndm));
> +	ndm->ndm_ifindex = ctx->ifindex;
> +	ndm->ndm_state = NUD_PERMANENT;
> +	ndm->ndm_flags = 0;
> +	ndm->ndm_type = 0;
> +	if (family == AF_INET) {
> +		ndm->ndm_family = AF_INET;
> +		mnl_attr_put_u32(cmd, NDA_DST, mnl_attr_get_u32(na_ip));
> +	} else {
> +		ndm->ndm_family = AF_INET6;
> +		mnl_attr_put(cmd, NDA_DST, IPV6_ADDR_LEN,
> +			     mnl_attr_get_payload(na_ip));
> +	}
> +	mnl_attr_put(cmd, NDA_LLADDR, ETHER_ADDR_LEN,
> +		     mnl_attr_get_payload(na_mac));
> +	return 1;
> +}
> +
> +/**
> + * Cleanup the neigh rules on outer interface.
> + *
> + * @param[in] tcf
> + *   Context object initialized by mlx5_flow_tcf_context_create().
> + * @param[in] ifindex
> + *   Network inferface index to perform cleanup.
> + */
> +static void
> +flow_tcf_encap_neigh_cleanup(struct mlx5_flow_tcf_context *tcf,
> +			    unsigned int ifindex)
> +{
> +	struct nlmsghdr *nlh;
> +	struct ndmsg *ndm;
> +	struct tcf_nlcb_context ctx = {
> +		.ifindex = ifindex,
> +		.bufsize = MNL_REQUEST_SIZE,
> +		.nlbuf = LIST_HEAD_INITIALIZER(),
> +	};
> +	int ret;
> +
> +	assert(ifindex);
> +	/* Seek and destroy leftovers of neigh rules. */
> +	nlh = mnl_nlmsg_put_header(tcf->buf);
> +	nlh->nlmsg_type = RTM_GETNEIGH;
> +	nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_DUMP;
> +	ndm = mnl_nlmsg_put_extra_header(nlh, sizeof(*ndm));
> +	ndm->ndm_family = AF_UNSPEC;
> +	ndm->ndm_ifindex = ifindex;
> +	ndm->ndm_state = NUD_PERMANENT;
> +	ret = flow_tcf_nl_ack(tcf, nlh, 0, flow_tcf_collect_neigh_cb, &ctx);
> +	if (ret)
> +		DRV_LOG(WARNING, "netlink: query device list error %d", ret);
> +	ret = flow_tcf_send_nlcmd(tcf, &ctx);
> +	if (ret)
> +		DRV_LOG(WARNING, "netlink: device delete error %d", ret);
> +}
> +
> +/**
> + * Collect indices of VXLAN encap/decap interfaces associated with device.
> + * This is callback routine called by libmnl mnl_cb_run() in loop for
> + * every message in received packet.
> + *
> + * @param[in] nlh
> + *   Pointer to reply header.
> + * @param[in, out] arg
> + *   Opaque data pointer for this callback.
> + *
> + * @return
> + *   A positive, nonzero value on success, negative errno value otherwise
> + *   and rte_errno is set.
> + */
> +static int
> +flow_tcf_collect_vxlan_cb(const struct nlmsghdr *nlh, void *arg)
> +{
> +	struct tcf_nlcb_context *ctx = arg;
> +	struct nlmsghdr *cmd;
> +	struct ifinfomsg *ifm;
> +	struct nlattr *na;
> +	struct nlattr *na_info = NULL;
> +	struct nlattr *na_vxlan = NULL;
> +	bool found = false;
> +	unsigned int vxindex;
> +
> +	if (nlh->nlmsg_type != RTM_NEWLINK) {
> +		rte_errno = EINVAL;
> +		return -rte_errno;
> +	}
> +	ifm = mnl_nlmsg_get_payload(nlh);
> +	if (!ifm->ifi_index) {
> +		rte_errno = EINVAL;
> +		return -rte_errno;
> +	}
> +	mnl_attr_for_each(na, nlh, sizeof(*ifm))
> +		if (mnl_attr_get_type(na) == IFLA_LINKINFO) {
> +			na_info = na;
> +			break;
> +		}
> +	if (!na_info)
> +		return 1;
> +	mnl_attr_for_each_nested(na, na_info) {
> +		switch (mnl_attr_get_type(na)) {
> +		case IFLA_INFO_KIND:
> +			if (!strncmp("vxlan", mnl_attr_get_str(na),
> +				     mnl_attr_get_len(na)))
> +				found = true;
> +			break;
> +		case IFLA_INFO_DATA:
> +			na_vxlan = na;
> +			break;
> +		}
> +		if (found && na_vxlan)
> +			break;
> +	}
> +	if (!found || !na_vxlan)
> +		return 1;
> +	found = false;
> +	mnl_attr_for_each_nested(na, na_vxlan) {
> +		if (mnl_attr_get_type(na) == IFLA_VXLAN_LINK &&
> +		    mnl_attr_get_u32(na) == ctx->ifindex) {
> +			found = true;
> +			break;
> +		}
> +	}
> +	if (!found)
> +		return 1;
> +	/* Attached VXLAN device found, store the command to delete. */
> +	vxindex = ifm->ifi_index;
> +	cmd = flow_tcf_alloc_nlcmd(ctx, MNL_ALIGN(sizeof(struct nlmsghdr)) +
> +					MNL_ALIGN(sizeof(struct ifinfomsg)));
> +	if (!nlh) {
> +		rte_errno = ENOMEM;
> +		return -rte_errno;
> +	}
> +	cmd = mnl_nlmsg_put_header(cmd);
> +	cmd->nlmsg_type = RTM_DELLINK;
> +	cmd->nlmsg_flags = NLM_F_REQUEST;
> +	ifm = mnl_nlmsg_put_extra_header(cmd, sizeof(*ifm));
> +	ifm->ifi_family = AF_UNSPEC;
> +	ifm->ifi_index = vxindex;
> +	return 1;
> +}
> +
> +/**
> + * Cleanup the outer interface. Removes all found vxlan devices
> + * attached to specified index, flushes the meigh and local IP
> + * datavase.
> + *
> + * @param[in] tcf
> + *   Context object initialized by mlx5_flow_tcf_context_create().
> + * @param[in] ifindex
> + *   Network inferface index to perform cleanup.
> + */
> +static void
> +flow_tcf_encap_iface_cleanup(struct mlx5_flow_tcf_context *tcf,
> +			    unsigned int ifindex)
> +{
> +	struct nlmsghdr *nlh;
> +	struct ifinfomsg *ifm;
> +	struct tcf_nlcb_context ctx = {
> +		.ifindex = ifindex,
> +		.bufsize = MNL_REQUEST_SIZE,
> +		.nlbuf = LIST_HEAD_INITIALIZER(),
> +	};
> +	int ret;
> +
> +	assert(ifindex);
> +	/*
> +	 * Seek and destroy leftover VXLAN encap/decap interfaces with
> +	 * matching properties.
> +	 */
> +	nlh = mnl_nlmsg_put_header(tcf->buf);
> +	nlh->nlmsg_type = RTM_GETLINK;
> +	nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_DUMP;
> +	ifm = mnl_nlmsg_put_extra_header(nlh, sizeof(*ifm));
> +	ifm->ifi_family = AF_UNSPEC;
> +	ret = flow_tcf_nl_ack(tcf, nlh, 0, flow_tcf_collect_vxlan_cb, &ctx);
> +	if (ret)
> +		DRV_LOG(WARNING, "netlink: query device list error %d", ret);
> +	ret = flow_tcf_send_nlcmd(tcf, &ctx);
> +	if (ret)
> +		DRV_LOG(WARNING, "netlink: device delete error %d", ret);
> +}
> +
> +
>  /**
>   * Create target interface index for VXLAN tunneling decapsulation.
>   * In order to share the UDP port within the other interfaces the
> @@ -4100,12 +4596,9 @@ static LIST_HEAD(, mlx5_flow_tcf_vtep)
>  		uint16_t pcnt;
>  
>  		/* Not found, we should create the new attached VTEP. */
> -/*
> - * TODO: not implemented yet
> - * flow_tcf_encap_iface_cleanup(tcf, ifouter);
> - * flow_tcf_encap_local_cleanup(tcf, ifouter);
> - * flow_tcf_encap_neigh_cleanup(tcf, ifouter);
> - */
> +		flow_tcf_encap_iface_cleanup(tcf, ifouter);
> +		flow_tcf_encap_local_cleanup(tcf, ifouter);
> +		flow_tcf_encap_neigh_cleanup(tcf, ifouter);

I have a fundamental questioin. Why are these cleanups needed? If I read the
code correctly, it looks like cleaning up vtep, ip assginment and neigh entry
which are not created/set by PMD. The reason why we have to clean up things is that
PMD exclusively owns the interface (ifouter). Is my understanding correct?

Thanks,
Yongseok

>  		for (pcnt = 0; pcnt <= (MLX5_VXLAN_PORT_RANGE_MAX
>  				     - MLX5_VXLAN_PORT_RANGE_MIN); pcnt++) {
>  			encap_port++;
>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [dpdk-dev] [PATCH v2 1/7] net/mlx5: e-switch VXLAN configuration and definitions
  2018-10-23 10:01     ` Yongseok Koh
@ 2018-10-25 12:50       ` Slava Ovsiienko
  2018-10-25 23:33         ` Yongseok Koh
  0 siblings, 1 reply; 110+ messages in thread
From: Slava Ovsiienko @ 2018-10-25 12:50 UTC (permalink / raw)
  To: Yongseok Koh; +Cc: Shahaf Shuler, dev

> -----Original Message-----
> From: Yongseok Koh
> Sent: Tuesday, October 23, 2018 13:02
> To: Slava Ovsiienko <viacheslavo@mellanox.com>
> Cc: Shahaf Shuler <shahafs@mellanox.com>; dev@dpdk.org
> Subject: Re: [PATCH v2 1/7] net/mlx5: e-switch VXLAN configuration and
> definitions
> 
> On Mon, Oct 15, 2018 at 02:13:29PM +0000, Viacheslav Ovsiienko wrote:
> > This part of patchset adds configuration changes in makefile and
> > meson.build for Mellanox MLX5 PMD. Also necessary defenitions
> > for VXLAN support are made and appropriate data structures
> > are presented.
> >
> > Suggested-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
> > Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
> > ---
> >  drivers/net/mlx5/Makefile        |  80 ++++++++++++++++++
> >  drivers/net/mlx5/meson.build     |  32 +++++++
> >  drivers/net/mlx5/mlx5_flow.h     |  11 +++
> >  drivers/net/mlx5/mlx5_flow_tcf.c | 175
> +++++++++++++++++++++++++++++++++++++++
> >  4 files changed, 298 insertions(+)
> >
> > diff --git a/drivers/net/mlx5/Makefile b/drivers/net/mlx5/Makefile
> > index 1e9c0b4..fec7779 100644
> > --- a/drivers/net/mlx5/Makefile
> > +++ b/drivers/net/mlx5/Makefile
> > @@ -207,6 +207,11 @@ mlx5_autoconf.h.new:
> $(RTE_SDK)/buildtools/auto-config-h.sh
> >  		enum IFLA_PHYS_PORT_NAME \
> >  		$(AUTOCONF_OUTPUT)
> >  	$Q sh -- '$<' '$@' \
> > +		HAVE_IFLA_VXLAN_COLLECT_METADATA \
> > +		linux/if_link.h \
> > +		enum IFLA_VXLAN_COLLECT_METADATA \
> > +		$(AUTOCONF_OUTPUT)
> > +	$Q sh -- '$<' '$@' \
> >  		HAVE_TCA_CHAIN \
> >  		linux/rtnetlink.h \
> >  		enum TCA_CHAIN \
> > @@ -367,6 +372,81 @@ mlx5_autoconf.h.new:
> $(RTE_SDK)/buildtools/auto-config-h.sh
> >  		enum TCA_VLAN_PUSH_VLAN_PRIORITY \
> >  		$(AUTOCONF_OUTPUT)
> >  	$Q sh -- '$<' '$@' \
> > +		HAVE_TCA_FLOWER_KEY_ENC_KEY_ID \
> > +		linux/pkt_cls.h \
> > +		enum TCA_FLOWER_KEY_ENC_KEY_ID \
> > +		$(AUTOCONF_OUTPUT)
> > +	$Q sh -- '$<' '$@' \
> > +		HAVE_TCA_FLOWER_KEY_ENC_IPV4_SRC \
> > +		linux/pkt_cls.h \
> > +		enum TCA_FLOWER_KEY_ENC_IPV4_SRC \
> > +		$(AUTOCONF_OUTPUT)
> > +	$Q sh -- '$<' '$@' \
> > +		HAVE_TCA_FLOWER_KEY_ENC_IPV4_SRC_MASK \
> > +		linux/pkt_cls.h \
> > +		enum TCA_FLOWER_KEY_ENC_IPV4_SRC_MASK \
> > +		$(AUTOCONF_OUTPUT)
> > +	$Q sh -- '$<' '$@' \
> > +		HAVE_TCA_FLOWER_KEY_ENC_IPV4_DST \
> > +		linux/pkt_cls.h \
> > +		enum TCA_FLOWER_KEY_ENC_IPV4_DST \
> > +		$(AUTOCONF_OUTPUT)
> > +	$Q sh -- '$<' '$@' \
> > +		HAVE_TCA_FLOWER_KEY_ENC_IPV4_DST_MASK \
> > +		linux/pkt_cls.h \
> > +		enum TCA_FLOWER_KEY_ENC_IPV4_DST_MASK \
> > +		$(AUTOCONF_OUTPUT)
> > +	$Q sh -- '$<' '$@' \
> > +		HAVE_TCA_FLOWER_KEY_ENC_IPV6_SRC \
> > +		linux/pkt_cls.h \
> > +		enum TCA_FLOWER_KEY_ENC_IPV6_SRC \
> > +		$(AUTOCONF_OUTPUT)
> > +	$Q sh -- '$<' '$@' \
> > +		HAVE_TCA_FLOWER_KEY_ENC_IPV6_SRC_MASK \
> > +		linux/pkt_cls.h \
> > +		enum TCA_FLOWER_KEY_ENC_IPV6_SRC_MASK \
> > +		$(AUTOCONF_OUTPUT)
> > +	$Q sh -- '$<' '$@' \
> > +		HAVE_TCA_FLOWER_KEY_ENC_IPV6_DST \
> > +		linux/pkt_cls.h \
> > +		enum TCA_FLOWER_KEY_ENC_IPV6_DST \
> > +		$(AUTOCONF_OUTPUT)
> > +	$Q sh -- '$<' '$@' \
> > +		HAVE_TCA_FLOWER_KEY_ENC_IPV6_DST_MASK \
> > +		linux/pkt_cls.h \
> > +		enum TCA_FLOWER_KEY_ENC_IPV6_DST_MASK \
> > +		$(AUTOCONF_OUTPUT)
> > +	$Q sh -- '$<' '$@' \
> > +		HAVE_TCA_FLOWER_KEY_ENC_UDP_SRC_PORT \
> > +		linux/pkt_cls.h \
> > +		enum TCA_FLOWER_KEY_ENC_UDP_SRC_PORT \
> > +		$(AUTOCONF_OUTPUT)
> > +	$Q sh -- '$<' '$@' \
> > +		HAVE_TCA_FLOWER_KEY_ENC_UDP_SRC_PORT_MASK \
> > +		linux/pkt_cls.h \
> > +		enum TCA_FLOWER_KEY_ENC_UDP_SRC_PORT_MASK \
> > +		$(AUTOCONF_OUTPUT)
> > +	$Q sh -- '$<' '$@' \
> > +		HAVE_TCA_FLOWER_KEY_ENC_UDP_DST_PORT \
> > +		linux/pkt_cls.h \
> > +		enum TCA_FLOWER_KEY_ENC_UDP_DST_PORT \
> > +		$(AUTOCONF_OUTPUT)
> > +	$Q sh -- '$<' '$@' \
> > +		HAVE_TCA_FLOWER_KEY_ENC_UDP_DST_PORT_MASK \
> > +		linux/pkt_cls.h \
> > +		enum TCA_FLOWER_KEY_ENC_UDP_DST_PORT_MASK \
> > +		$(AUTOCONF_OUTPUT)
> > +	$Q sh -- '$<' '$@' \
> > +		HAVE_TC_ACT_TUNNEL_KEY \
> > +		linux/tc_act/tc_tunnel_key.h \
> > +		define TCA_ACT_TUNNEL_KEY \
> > +		$(AUTOCONF_OUTPUT)
> > +	$Q sh -- '$<' '$@' \
> > +		HAVE_TCA_TUNNEL_KEY_ENC_DST_PORT \
> > +		linux/tc_act/tc_tunnel_key.h \
> > +		enum TCA_TUNNEL_KEY_ENC_DST_PORT \
> > +		$(AUTOCONF_OUTPUT)
> > +	$Q sh -- '$<' '$@' \
> >  		HAVE_TC_ACT_PEDIT \
> >  		linux/tc_act/tc_pedit.h \
> >  		enum TCA_PEDIT_KEY_EX_HDR_TYPE_UDP \
> > diff --git a/drivers/net/mlx5/meson.build b/drivers/net/mlx5/meson.build
> > index c192d44..43aabf2 100644
> > --- a/drivers/net/mlx5/meson.build
> > +++ b/drivers/net/mlx5/meson.build
> > @@ -126,6 +126,8 @@ if build
> >  		'IFLA_PHYS_SWITCH_ID' ],
> >  		[ 'HAVE_IFLA_PHYS_PORT_NAME', 'linux/if_link.h',
> >  		'IFLA_PHYS_PORT_NAME' ],
> > +		[ 'HAVE_IFLA_VXLAN_COLLECT_METADATA', 'linux/if_link.h',
> > +		'IFLA_VXLAN_COLLECT_METADATA' ],
> >  		[ 'HAVE_TCA_CHAIN', 'linux/rtnetlink.h',
> >  		'TCA_CHAIN' ],
> >  		[ 'HAVE_TCA_FLOWER_ACT', 'linux/pkt_cls.h',
> > @@ -190,6 +192,36 @@ if build
> >  		'TC_ACT_GOTO_CHAIN' ],
> >  		[ 'HAVE_TC_ACT_VLAN', 'linux/tc_act/tc_vlan.h',
> >  		'TCA_VLAN_PUSH_VLAN_PRIORITY' ],
> > +		[ 'HAVE_TCA_FLOWER_KEY_ENC_KEY_ID', 'linux/pkt_cls.h',
> > +		'TCA_FLOWER_KEY_ENC_KEY_ID' ],
> > +		[ 'HAVE_TCA_FLOWER_KEY_ENC_IPV4_SRC',
> 'linux/pkt_cls.h',
> > +		'TCA_FLOWER_KEY_ENC_IPV4_SRC' ],
> > +		[ 'HAVE_TCA_FLOWER_KEY_ENC_IPV4_SRC_MASK',
> 'linux/pkt_cls.h',
> > +		'TCA_FLOWER_KEY_ENC_IPV4_SRC_MASK' ],
> > +		[ 'HAVE_TCA_FLOWER_KEY_ENC_IPV4_DST',
> 'linux/pkt_cls.h',
> > +		'TCA_FLOWER_KEY_ENC_IPV4_DST' ],
> > +		[ 'HAVE_TCA_FLOWER_KEY_ENC_IPV4_DST_MASK',
> 'linux/pkt_cls.h',
> > +		'TCA_FLOWER_KEY_ENC_IPV4_DST_MASK' ],
> > +		[ 'HAVE_TCA_FLOWER_KEY_ENC_IPV6_SRC',
> 'linux/pkt_cls.h',
> > +		'TCA_FLOWER_KEY_ENC_IPV6_SRC' ],
> > +		[ 'HAVE_TCA_FLOWER_KEY_ENC_IPV6_SRC_MASK',
> 'linux/pkt_cls.h',
> > +		'TCA_FLOWER_KEY_ENC_IPV6_SRC_MASK' ],
> > +		[ 'HAVE_TCA_FLOWER_KEY_ENC_IPV6_DST',
> 'linux/pkt_cls.h',
> > +		'TCA_FLOWER_KEY_ENC_IPV6_DST' ],
> > +		[ 'HAVE_TCA_FLOWER_KEY_ENC_IPV6_DST_MASK',
> 'linux/pkt_cls.h',
> > +		'TCA_FLOWER_KEY_ENC_IPV6_DST_MASK' ],
> > +		[ 'HAVE_TCA_FLOWER_KEY_ENC_UDP_SRC_PORT',
> 'linux/pkt_cls.h',
> > +		'TCA_FLOWER_KEY_ENC_UDP_SRC_PORT' ],
> > +		[ 'HAVE_TCA_FLOWER_KEY_ENC_UDP_SRC_PORT_MASK',
> 'linux/pkt_cls.h',
> > +		'TCA_FLOWER_KEY_ENC_UDP_SRC_PORT_MASK' ],
> > +		[ 'HAVE_TCA_FLOWER_KEY_ENC_UDP_DST_PORT',
> 'linux/pkt_cls.h',
> > +		'TCA_FLOWER_KEY_ENC_UDP_DST_PORT' ],
> > +		[ 'HAVE_TCA_FLOWER_KEY_ENC_UDP_DST_PORT_MASK',
> 'linux/pkt_cls.h',
> > +		'TCA_FLOWER_KEY_ENC_UDP_DST_PORT_MASK' ],
> > +		[ 'HAVE_TC_ACT_TUNNEL_KEY',
> 'linux/tc_act/tc_tunnel_key.h',
> > +		'TCA_ACT_TUNNEL_KEY' ],
> > +		[ 'HAVE_TCA_TUNNEL_KEY_ENC_DST_PORT',
> 'linux/tc_act/tc_tunnel_key.h',
> > +		'TCA_TUNNEL_KEY_ENC_DST_PORT' ],
> >  		[ 'HAVE_TC_ACT_PEDIT', 'linux/tc_act/tc_pedit.h',
> >  		'TCA_PEDIT_KEY_EX_HDR_TYPE_UDP' ],
> >  		[ 'HAVE_RDMA_NL_NLDEV', 'rdma/rdma_netlink.h',
> > diff --git a/drivers/net/mlx5/mlx5_flow.h b/drivers/net/mlx5/mlx5_flow.h
> > index 840d645..b838ab0 100644
> > --- a/drivers/net/mlx5/mlx5_flow.h
> > +++ b/drivers/net/mlx5/mlx5_flow.h
> > @@ -85,6 +85,8 @@
> >  #define MLX5_FLOW_ACTION_SET_TP_SRC (1u << 15)
> >  #define MLX5_FLOW_ACTION_SET_TP_DST (1u << 16)
> >  #define MLX5_FLOW_ACTION_JUMP (1u << 17)
> > +#define MLX5_ACTION_VXLAN_ENCAP (1u << 11)
> > +#define MLX5_ACTION_VXLAN_DECAP (1u << 12)
> 
> MLX5_ACTION_* has been changed to MLX5_FLOW_ACTION_* as you can
> see above.
OK. Miscopied from previous version of patch.

> And make it alphabetical order; decap first and encap later? Or, at least make
> it consistent. The order (case clause) is different among validate, prepare and
> translate.
OK. Will reorder.

> 
> >  #define MLX5_FLOW_FATE_ACTIONS \
> >  	(MLX5_FLOW_ACTION_DROP | MLX5_FLOW_ACTION_QUEUE |
> MLX5_FLOW_ACTION_RSS)
> > @@ -182,8 +184,17 @@ struct mlx5_flow_dv {
> >  struct mlx5_flow_tcf {
> >  	struct nlmsghdr *nlh;
> >  	struct tcmsg *tcm;
> > +	uint32_t nlsize; /**< Size of NL message buffer. */
> 
> It is used only for assert(), but if prepare() is trusted, why do we need to
> keep it? I don't it is needed.
> 
Q? Let's keep the nlsize under NDEBUG flag? 
It's extremely useful to have assert()
on allocated size for debugging purposes.

> > +	uint32_t applied:1; /**< Whether rule is currently applied. */
> > +	uint64_t item_flags; /**< Item flags. */
> 
> This isn't used at all.
OK, now no dependencies on it, should be removed, good.

> 
> > +	uint64_t action_flags; /**< Action flags. */
> 
> I checked following patches and it doesn't seem necessary. Please refer to
> the
> comment on the translation func. But if you think it is really needed, you
> could've used actions field of struct rte_flow and layers field of struct
> mlx5_flow in mlx5_flow.h

When translating item list into NL-message we have to know whether there is 
some tunneling action in the actions list. This is due to possible 
changing of the item meanings if tunneling action is present. For example,
usually the ipv4 item provides IPv4 addresses for matching and translated to
TCA_FLOWER_KEY_IPV4_SRC (+ xxx_DST) Netlink attribute(s), but if there is
VXLAN decap action specified, this item becames outer tunnel  source IPs
and should be translated to TCA_FLOWER_KEY_ENC_IPV4_SRC. The action
list is scanned in the preperd list, so we can save action flags  and  use these
gathered results in translation routine. As we can see from mlx5_flow_list_create() source,
it does not save item/actions flags, gathered by flow_drv_prepare(). That's why
there are item_flags/action_flags  in the struct mlx5_flow_tcf. item_flags is not
needed, should be removed. action_flags is in use.

BTW, do we need item_flags, action_flags params in flow_drv_prepare() ?
We would avoid the item_flags field if flags were transferred from
flow_drv_prepare() to flow_drv_translate() (as local variable of
mlx5_flow_list_create().

> >  	uint64_t hits;
> >  	uint64_t bytes;
> > +	union { /**< Tunnel encap/decap descriptor. */
> > +		struct mlx5_flow_tcf_tunnel_hdr *tunnel;
> > +		struct mlx5_flow_tcf_vxlan_decap *vxlan_decap;
> > +		struct mlx5_flow_tcf_vxlan_encap *vxlan_encap;
> > +	};
> 
> What is the reason for keeping pointer even though the actual structure
> follows
> after mlx5_flow_tcf? Maybe you don't want to waste memory, as the size of
> encap/decap struct differs a lot?

Sizes differ, but not a lot (especially comparing with DV rule size). 
Would you prefer to simplify and just include the union?
On other hand we could declare something like that:
	...
 	uint8_t tunnel_type;
	alignas(struct mlx5_flow_tcf_tunnel_hdr)   uint8_t buf[];

and eliminate the pointer at all. The buf beginning contains either tunnel structure
or Netlink message (if no tunnel), depending on the tunnel_type field.

With best regards,
Slava

> 
> >  };
> >
> >  /* Verbs specification header. */
> > diff --git a/drivers/net/mlx5/mlx5_flow_tcf.c
> b/drivers/net/mlx5/mlx5_flow_tcf.c
> > index 5c46f35..8f9c78a 100644
> > --- a/drivers/net/mlx5/mlx5_flow_tcf.c
> > +++ b/drivers/net/mlx5/mlx5_flow_tcf.c
> > @@ -54,6 +54,37 @@ struct tc_vlan {
> >
> >  #endif /* HAVE_TC_ACT_VLAN */
> >
> > +#ifdef HAVE_TC_ACT_TUNNEL_KEY
> > +
> > +#include <linux/tc_act/tc_tunnel_key.h>
> > +
> > +#ifndef HAVE_TCA_TUNNEL_KEY_ENC_DST_PORT
> > +#define TCA_TUNNEL_KEY_ENC_DST_PORT 9
> > +#endif
> > +
> > +#else /* HAVE_TC_ACT_TUNNEL_KEY */
> > +
> > +#define TCA_ACT_TUNNEL_KEY 17
> > +#define TCA_TUNNEL_KEY_ACT_SET 1
> > +#define TCA_TUNNEL_KEY_ACT_RELEASE 2
> > +#define TCA_TUNNEL_KEY_PARMS 2
> > +#define TCA_TUNNEL_KEY_ENC_IPV4_SRC 3
> > +#define TCA_TUNNEL_KEY_ENC_IPV4_DST 4
> > +#define TCA_TUNNEL_KEY_ENC_IPV6_SRC 5
> > +#define TCA_TUNNEL_KEY_ENC_IPV6_DST 6
> > +#define TCA_TUNNEL_KEY_ENC_KEY_ID 7
> > +#define TCA_TUNNEL_KEY_ENC_DST_PORT 9
> > +#define TCA_TUNNEL_KEY_NO_CSUM 10
> > +
> > +struct tc_tunnel_key {
> > +	tc_gen;
> > +	int t_action;
> > +};
> > +
> > +#endif /* HAVE_TC_ACT_TUNNEL_KEY */
> > +
> > +
> > +
> >  #ifdef HAVE_TC_ACT_PEDIT
> >
> >  #include <linux/tc_act/tc_pedit.h>
> > @@ -210,6 +241,45 @@ struct tc_pedit_sel {
> >  #ifndef HAVE_TCA_FLOWER_KEY_VLAN_ETH_TYPE
> >  #define TCA_FLOWER_KEY_VLAN_ETH_TYPE 25
> >  #endif
> > +#ifndef HAVE_TCA_FLOWER_KEY_ENC_KEY_ID
> > +#define TCA_FLOWER_KEY_ENC_KEY_ID 26
> > +#endif
> > +#ifndef HAVE_TCA_FLOWER_KEY_ENC_IPV4_SRC
> > +#define TCA_FLOWER_KEY_ENC_IPV4_SRC 27
> > +#endif
> > +#ifndef HAVE_TCA_FLOWER_KEY_ENC_IPV4_SRC_MASK
> > +#define TCA_FLOWER_KEY_ENC_IPV4_SRC_MASK 28
> > +#endif
> > +#ifndef HAVE_TCA_FLOWER_KEY_ENC_IPV4_DST
> > +#define TCA_FLOWER_KEY_ENC_IPV4_DST 29
> > +#endif
> > +#ifndef HAVE_TCA_FLOWER_KEY_ENC_IPV4_DST_MASK
> > +#define TCA_FLOWER_KEY_ENC_IPV4_DST_MASK 30
> > +#endif
> > +#ifndef HAVE_TCA_FLOWER_KEY_ENC_IPV6_SRC
> > +#define TCA_FLOWER_KEY_ENC_IPV6_SRC 31
> > +#endif
> > +#ifndef HAVE_TCA_FLOWER_KEY_ENC_IPV6_SRC_MASK
> > +#define TCA_FLOWER_KEY_ENC_IPV6_SRC_MASK 32
> > +#endif
> > +#ifndef HAVE_TCA_FLOWER_KEY_ENC_IPV6_DST
> > +#define TCA_FLOWER_KEY_ENC_IPV6_DST 33
> > +#endif
> > +#ifndef HAVE_TCA_FLOWER_KEY_ENC_IPV6_DST_MASK
> > +#define TCA_FLOWER_KEY_ENC_IPV6_DST_MASK 34
> > +#endif
> > +#ifndef HAVE_TCA_FLOWER_KEY_ENC_UDP_SRC_PORT
> > +#define TCA_FLOWER_KEY_ENC_UDP_SRC_PORT 43
> > +#endif
> > +#ifndef HAVE_TCA_FLOWER_KEY_ENC_UDP_SRC_PORT_MASK
> > +#define TCA_FLOWER_KEY_ENC_UDP_SRC_PORT_MASK 44
> > +#endif
> > +#ifndef HAVE_TCA_FLOWER_KEY_ENC_UDP_DST_PORT
> > +#define TCA_FLOWER_KEY_ENC_UDP_DST_PORT 45
> > +#endif
> > +#ifndef HAVE_TCA_FLOWER_KEY_ENC_UDP_DST_PORT_MASK
> > +#define TCA_FLOWER_KEY_ENC_UDP_DST_PORT_MASK 46
> > +#endif
> >  #ifndef HAVE_TCA_FLOWER_KEY_TCP_FLAGS
> >  #define TCA_FLOWER_KEY_TCP_FLAGS 71
> >  #endif
> > @@ -232,6 +302,111 @@ struct tc_pedit_sel {
> >  #define TP_PORT_LEN 2 /* Transport Port (UDP/TCP) Length */
> >  #endif
> >
> > +#define MLX5_VXLAN_PORT_RANGE_MIN 30000
> > +#define MLX5_VXLAN_PORT_RANGE_MAX 60000
> > +#define MLX5_VXLAN_DEVICE_PFX "vmlx_"
> > +
> > +/** Tunnel action type, used for @p type in header structure. */
> > +enum mlx5_flow_tcf_tunact_type {
> > +	MLX5_FLOW_TCF_TUNACT_VXLAN_ENCAP,
> > +	MLX5_FLOW_TCF_TUNACT_VXLAN_DECAP,
> > +};
> > +
> > +/** Flags used for @p mask in tunnel action encap descriptors. */
> > +#define	MLX5_FLOW_TCF_ENCAP_ETH_SRC (1u << 0)
> > +#define	MLX5_FLOW_TCF_ENCAP_ETH_DST (1u << 1)
> > +#define	MLX5_FLOW_TCF_ENCAP_IPV4_SRC (1u << 2)
> > +#define	MLX5_FLOW_TCF_ENCAP_IPV4_DST (1u << 3)
> > +#define	MLX5_FLOW_TCF_ENCAP_IPV6_SRC (1u << 4)
> > +#define	MLX5_FLOW_TCF_ENCAP_IPV6_DST (1u << 5)
> > +#define	MLX5_FLOW_TCF_ENCAP_UDP_SRC (1u << 6)
> > +#define	MLX5_FLOW_TCF_ENCAP_UDP_DST (1u << 7)
> > +#define	MLX5_FLOW_TCF_ENCAP_VXLAN_VNI (1u << 8)
> > +
> > +/** Neigh rule structure */
> > +struct tcf_neigh_rule {
> > +	LIST_ENTRY(tcf_neigh_rule) next;
> > +	uint32_t refcnt;
> > +	struct ether_addr eth;
> > +	uint16_t mask;
> > +	union {
> > +		struct {
> > +			rte_be32_t dst;
> > +		} ipv4;
> > +		struct {
> > +			uint8_t dst[16];
> > +		} ipv6;
> > +	};
> > +};
> > +
> > +/** Local rule structure */
> > +struct tcf_local_rule {
> > +	LIST_ENTRY(tcf_neigh_rule) next;
> > +	uint32_t refcnt;
> > +	uint16_t mask;
> > +	union {
> > +		struct {
> > +			rte_be32_t dst;
> > +			rte_be32_t src;
> > +		} ipv4;
> > +		struct {
> > +			uint8_t dst[16];
> > +			uint8_t src[16];
> > +		} ipv6;
> > +	};
> > +};
> > +
> > +/** VXLAN virtual netdev. */
> > +struct mlx5_flow_tcf_vtep {
> > +	LIST_ENTRY(mlx5_flow_tcf_vtep) next;
> > +	LIST_HEAD(, tcf_neigh_rule) neigh;
> > +	LIST_HEAD(, tcf_local_rule) local;
> > +	uint32_t refcnt;
> > +	unsigned int ifindex; /**< Own interface index. */
> > +	unsigned int ifouter; /**< Index of device attached to. */
> > +	uint16_t port;
> > +	uint8_t created;
> > +};
> > +
> > +/** Tunnel descriptor header, common for all tunnel types. */
> > +struct mlx5_flow_tcf_tunnel_hdr {
> > +	uint32_t type; /**< Tunnel action type. */
> > +	unsigned int ifindex_tun; /**< Tunnel endpoint interface. */
> > +	unsigned int ifindex_org; /**< Original dst/src interface */
> > +	unsigned int *ifindex_ptr; /**< Interface ptr in message. */
> > +};
> > +
> > +struct mlx5_flow_tcf_vxlan_decap {
> > +	struct mlx5_flow_tcf_tunnel_hdr hdr;
> > +	uint16_t udp_port;
> > +};
> > +
> > +struct mlx5_flow_tcf_vxlan_encap {
> > +	struct mlx5_flow_tcf_tunnel_hdr hdr;
> > +	uint32_t mask;
> > +	struct {
> > +		struct ether_addr dst;
> > +		struct ether_addr src;
> > +	} eth;
> > +	union {
> > +		struct {
> > +			rte_be32_t dst;
> > +			rte_be32_t src;
> > +		} ipv4;
> > +		struct {
> > +			uint8_t dst[16];
> > +			uint8_t src[16];
> > +		} ipv6;
> > +	};
> > +struct {
> > +		rte_be16_t src;
> > +		rte_be16_t dst;
> > +	} udp;
> > +	struct {
> > +		uint8_t vni[3];
> > +	} vxlan;
> > +};
> > +
> >  /**
> >   * Structure for holding netlink context.
> >   * Note the size of the message buffer which is
> MNL_SOCKET_BUFFER_SIZE.
> >

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [dpdk-dev] [PATCH v2 2/7] net/mlx5: e-switch VXLAN flow validation routine
  2018-10-23 10:04     ` Yongseok Koh
@ 2018-10-25 13:53       ` Slava Ovsiienko
  2018-10-26  3:07         ` Yongseok Koh
  0 siblings, 1 reply; 110+ messages in thread
From: Slava Ovsiienko @ 2018-10-25 13:53 UTC (permalink / raw)
  To: Yongseok Koh; +Cc: Shahaf Shuler, dev

> -----Original Message-----
> From: Yongseok Koh
> Sent: Tuesday, October 23, 2018 13:05
> To: Slava Ovsiienko <viacheslavo@mellanox.com>
> Cc: Shahaf Shuler <shahafs@mellanox.com>; dev@dpdk.org
> Subject: Re: [PATCH v2 2/7] net/mlx5: e-switch VXLAN flow validation
> routine
> 
> On Mon, Oct 15, 2018 at 02:13:30PM +0000, Viacheslav Ovsiienko wrote:
> > This part of patchset adds support for flow item/action lists
> > validation. The following entities are now supported:
> >
> > - RTE_FLOW_ITEM_TYPE_VXLAN, contains the tunnel VNI
> >
> > - RTE_FLOW_ACTION_TYPE_VXLAN_DECAP, if this action is specified
> >   the items in the flow items list treated as outer network
> >   parameters for tunnel outer header match. The ethernet layer
> >   addresses always are treated as inner ones.
> >
> > - RTE_FLOW_ACTION_TYPE_VXLAN_ENCAP, contains the item list to
> >   build the encapsulation header. In current implementation the
> >   values is the subject for some constraints:
> >     - outer source MAC address will be always unconditionally
> >       set to the one of MAC addresses of outer egress interface
> >     - no way to specify source UDP port
> >     - all abovementioned parameters are ignored if specified
> >       in the rule, warning messages are sent to the log
> >
> > Suggested-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
> > Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
> > ---
> >  drivers/net/mlx5/mlx5_flow_tcf.c | 711
> > ++++++++++++++++++++++++++++++++++++++-
> >  1 file changed, 705 insertions(+), 6 deletions(-)
> >
> > diff --git a/drivers/net/mlx5/mlx5_flow_tcf.c
> > b/drivers/net/mlx5/mlx5_flow_tcf.c
> > index 8f9c78a..0055417 100644
> > --- a/drivers/net/mlx5/mlx5_flow_tcf.c
> > +++ b/drivers/net/mlx5/mlx5_flow_tcf.c
> > @@ -430,6 +430,7 @@ struct mlx5_flow_tcf_context {
> >  	struct rte_flow_item_ipv6 ipv6;
> >  	struct rte_flow_item_tcp tcp;
> >  	struct rte_flow_item_udp udp;
> > +	struct rte_flow_item_vxlan vxlan;
> >  } flow_tcf_mask_empty;
> >
> >  /** Supported masks for known item types. */ @@ -441,6 +442,7 @@
> > struct mlx5_flow_tcf_context {
> >  	struct rte_flow_item_ipv6 ipv6;
> >  	struct rte_flow_item_tcp tcp;
> >  	struct rte_flow_item_udp udp;
> > +	struct rte_flow_item_vxlan vxlan;
> >  } flow_tcf_mask_supported = {
> >  	.port_id = {
> >  		.id = 0xffffffff,
> > @@ -478,6 +480,9 @@ struct mlx5_flow_tcf_context {
> >  		.src_port = RTE_BE16(0xffff),
> >  		.dst_port = RTE_BE16(0xffff),
> >  	},
> > +	.vxlan = {
> > +	       .vni = "\xff\xff\xff",
> > +	},
> >  };
> >
> >  #define SZ_NLATTR_HDR MNL_ALIGN(sizeof(struct nlattr)) @@ -943,6
> > +948,615 @@ struct pedit_parser {  }
> >
> >  /**
> > + * Validate VXLAN_ENCAP action RTE_FLOW_ITEM_TYPE_ETH item for E-
> Switch.
> 
> How about mentioning it is to validate items of the "encap header"? Same
> for the rest.


OK. Will clarify description(s).

> 
> > + *
> > + * @param[in] item
> > + *   Pointer to the itemn structure.
> 
> Typo. Same for the rest.

OK. Thanks, will correct.

> 
> > + * @param[out] error
> > + *   Pointer to the error structure.
> > + *
> > + * @return
> > + *   0 on success, a negative errno value otherwise and rte_errno is set.
> > + **/
> > +static int
> > +flow_tcf_validate_vxlan_encap_eth(const struct rte_flow_item *item,
> > +				  struct rte_flow_error *error)
> > +{
> > +	const struct rte_flow_item_eth *spec = item->spec;
> > +	const struct rte_flow_item_eth *mask = item->mask;
> > +
> > +	if (!spec)
> > +		/*
> > +		 * Specification for L2 addresses can be empty
> > +		 * because these ones are optional and not
> > +		 * required directly by tc rule.
> > +		 */
> > +		return 0;
> > +	if (!mask)
> > +		/* If mask is not specified use the default one. */
> > +		mask = &rte_flow_item_eth_mask;
> > +	if (memcmp(&mask->dst,
> > +		   &flow_tcf_mask_empty.eth.dst,
> > +		   sizeof(flow_tcf_mask_empty.eth.dst))) {
> > +		if (memcmp(&mask->dst,
> > +			   &rte_flow_item_eth_mask.dst,
> > +			   sizeof(rte_flow_item_eth_mask.dst)))
> > +			return rte_flow_error_set(error, ENOTSUP,
> > +				 RTE_FLOW_ERROR_TYPE_ITEM_MASK,
> mask,
> > +				 "no support for partial mask on"
> > +				 " \"eth.dst\" field");
> > +	}
> > +	if (memcmp(&mask->src,
> > +		   &flow_tcf_mask_empty.eth.src,
> > +		   sizeof(flow_tcf_mask_empty.eth.src))) {
> > +		if (memcmp(&mask->src,
> > +			   &rte_flow_item_eth_mask.src,
> > +			   sizeof(rte_flow_item_eth_mask.src)))
> > +			return rte_flow_error_set(error, ENOTSUP,
> > +				 RTE_FLOW_ERROR_TYPE_ITEM_MASK,
> mask,
> > +				 "no support for partial mask on"
> > +				 " \"eth.src\" field");
> > +	}
> > +	if (mask->type != RTE_BE16(0x0000)) {
> > +		if (mask->type != RTE_BE16(0xffff))
> > +			return rte_flow_error_set(error, ENOTSUP,
> > +				 RTE_FLOW_ERROR_TYPE_ITEM_MASK,
> mask,
> > +				 "no support for partial mask on"
> > +				 " \"eth.type\" field");
> > +		DRV_LOG(WARNING,
> > +			"outer ethernet type field "
> > +			"cannot be forced for VXLAN "
> > +			"encapsulation, parameter ignored");
> > +	}
> > +	return 0;
> > +}
> > +
> > +/**
> > + * Validate VXLAN_ENCAP action RTE_FLOW_ITEM_TYPE_IPV4 item for E-
> Switch.
> > + *
> > + * @param[in] item
> > + *   Pointer to the itemn structure.
> > + * @param[out] error
> > + *   Pointer to the error structure.
> > + *
> > + * @return
> > + *   0 on success, a negative errno value otherwise and rte_errno is set.
> > + **/
> > +static int
> > +flow_tcf_validate_vxlan_encap_ipv4(const struct rte_flow_item *item,
> > +				   struct rte_flow_error *error)
> > +{
> > +	const struct rte_flow_item_ipv4 *spec = item->spec;
> > +	const struct rte_flow_item_ipv4 *mask = item->mask;
> > +
> > +	if (!spec)
> > +		/*
> > +		 * Specification for L3 addresses cannot be empty
> > +		 * because it is required by tunnel_key parameter.
> > +		 */
> > +		return rte_flow_error_set(error, EINVAL,
> > +				 RTE_FLOW_ERROR_TYPE_ITEM, item,
> > +				 "NULL outer L3 address specification "
> > +				 " for VXLAN encapsulation");
> > +	if (!mask)
> > +		mask = &rte_flow_item_ipv4_mask;
> > +	if (mask->hdr.dst_addr != RTE_BE32(0x00000000)) {
> > +		if (mask->hdr.dst_addr != RTE_BE32(0xffffffff))
> > +			return rte_flow_error_set(error, ENOTSUP,
> > +				 RTE_FLOW_ERROR_TYPE_ITEM_MASK,
> mask,
> > +				 "no support for partial mask on"
> > +				 " \"ipv4.hdr.dst_addr\" field");
> > +		/* More L3 address validations can be put here. */
> > +	} else {
> > +		/*
> > +		 * Kernel uses the destination L3 address to determine
> > +		 * the routing path and obtain the L2 destination
> > +		 * address, so L3 destination address must be
> > +		 * specified in the tc rule.
> > +		 */
> > +		return rte_flow_error_set(error, EINVAL,
> > +				 RTE_FLOW_ERROR_TYPE_ITEM, item,
> > +				 "outer L3 destination address must be "
> > +				 "specified for VXLAN encapsulation");
> > +	}
> > +	if (mask->hdr.src_addr != RTE_BE32(0x00000000)) {
> > +		if (mask->hdr.src_addr != RTE_BE32(0xffffffff))
> > +			return rte_flow_error_set(error, ENOTSUP,
> > +				 RTE_FLOW_ERROR_TYPE_ITEM_MASK,
> mask,
> > +				 "no support for partial mask on"
> > +				 " \"ipv4.hdr.src_addr\" field");
> > +		/* More L3 address validations can be put here. */
> > +	} else {
> > +		/*
> > +		 * Kernel uses the source L3 address to select the
> > +		 * interface for egress encapsulated traffic, so
> > +		 * it must be specified in the tc rule.
> > +		 */
> > +		return rte_flow_error_set(error, EINVAL,
> > +				 RTE_FLOW_ERROR_TYPE_ITEM, item,
> > +				 "outer L3 source address must be "
> > +				 "specified for VXLAN encapsulation");
> > +	}
> > +	return 0;
> > +}
> > +
> > +/**
> > + * Validate VXLAN_ENCAP action RTE_FLOW_ITEM_TYPE_IPV6 item for E-
> Switch.
> > + *
> > + * @param[in] item
> > + *   Pointer to the itemn structure.
> > + * @param[out] error
> > + *   Pointer to the error structure.
> > + *
> > + * @return
> > + *   0 on success, a negative errno value otherwise and rte_ernno is set.
> > + **/
> > +static int
> > +flow_tcf_validate_vxlan_encap_ipv6(const struct rte_flow_item *item,
> > +				   struct rte_flow_error *error)
> > +{
> > +	const struct rte_flow_item_ipv6 *spec = item->spec;
> > +	const struct rte_flow_item_ipv6 *mask = item->mask;
> > +
> > +	if (!spec)
> > +		/*
> > +		 * Specification for L3 addresses cannot be empty
> > +		 * because it is required by tunnel_key parameter.
> > +		 */
> > +		return rte_flow_error_set(error, EINVAL,
> > +				 RTE_FLOW_ERROR_TYPE_ITEM, item,
> > +				 "NULL outer L3 address specification "
> > +				 " for VXLAN encapsulation");
> > +	if (!mask)
> > +		mask = &rte_flow_item_ipv6_mask;
> > +	if (memcmp(&mask->hdr.dst_addr,
> > +		   &flow_tcf_mask_empty.ipv6.hdr.dst_addr,
> > +		   sizeof(flow_tcf_mask_empty.ipv6.hdr.dst_addr))) {
> > +		if (memcmp(&mask->hdr.dst_addr,
> > +		   &rte_flow_item_ipv6_mask.hdr.dst_addr,
> > +		   sizeof(rte_flow_item_ipv6_mask.hdr.dst_addr)))
> > +			return rte_flow_error_set(error, ENOTSUP,
> > +				 RTE_FLOW_ERROR_TYPE_ITEM_MASK,
> mask,
> > +				 "no support for partial mask on"
> > +				 " \"ipv6.hdr.dst_addr\" field");
> > +		/* More L3 address validations can be put here. */
> > +	} else {
> > +		/*
> > +		 * Kernel uses the destination L3 address to determine
> > +		 * the routing path and obtain the L2 destination
> > +		 * address (heigh or gate), so L3 destination address
> > +		 * must be specified within the tc rule.
> > +		 */
> > +		return rte_flow_error_set(error, EINVAL,
> > +				 RTE_FLOW_ERROR_TYPE_ITEM, item,
> > +				 "outer L3 destination address must be "
> > +				 "specified for VXLAN encapsulation");
> > +	}
> > +	if (memcmp(&mask->hdr.src_addr,
> > +		   &flow_tcf_mask_empty.ipv6.hdr.src_addr,
> > +		   sizeof(flow_tcf_mask_empty.ipv6.hdr.src_addr))) {
> > +		if (memcmp(&mask->hdr.src_addr,
> > +		   &rte_flow_item_ipv6_mask.hdr.src_addr,
> > +		   sizeof(rte_flow_item_ipv6_mask.hdr.src_addr)))
> > +			return rte_flow_error_set(error, ENOTSUP,
> > +				 RTE_FLOW_ERROR_TYPE_ITEM_MASK,
> mask,
> > +				 "no support for partial mask on"
> > +				 " \"ipv6.hdr.src_addr\" field");
> > +		/* More L3 address validation can be put here. */
> > +	} else {
> > +		/*
> > +		 * Kernel uses the source L3 address to select the
> > +		 * interface for egress encapsulated traffic, so
> > +		 * it must be specified in the tc rule.
> > +		 */
> > +		return rte_flow_error_set(error, EINVAL,
> > +				 RTE_FLOW_ERROR_TYPE_ITEM, item,
> > +				 "outer L3 source address must be "
> > +				 "specified for VXLAN encapsulation");
> > +	}
> > +	return 0;
> > +}
> > +
> > +/**
> > + * Validate VXLAN_ENCAP action RTE_FLOW_ITEM_TYPE_UDP item for E-
> Switch.
> > + *
> > + * @param[in] item
> > + *   Pointer to the itemn structure.
> > + * @param[out] error
> > + *   Pointer to the error structure.
> > + *
> > + * @return
> > + *   0 on success, a negative errno value otherwise and rte_ernno is set.
> > + **/
> > +static int
> > +flow_tcf_validate_vxlan_encap_udp(const struct rte_flow_item *item,
> > +				  struct rte_flow_error *error)
> > +{
> > +	const struct rte_flow_item_udp *spec = item->spec;
> > +	const struct rte_flow_item_udp *mask = item->mask;
> > +
> > +	if (!spec)
> > +		/*
> > +		 * Specification for UDP ports cannot be empty
> > +		 * because it is required by tunnel_key parameter.
> > +		 */
> > +		return rte_flow_error_set(error, EINVAL,
> > +				 RTE_FLOW_ERROR_TYPE_ITEM, item,
> > +				 "NULL UDP port specification "
> > +				 " for VXLAN encapsulation");
> > +	if (!mask)
> > +		mask = &rte_flow_item_udp_mask;
> > +	if (mask->hdr.dst_port != RTE_BE16(0x0000)) {
> > +		if (mask->hdr.dst_port != RTE_BE16(0xffff))
> > +			return rte_flow_error_set(error, ENOTSUP,
> > +				 RTE_FLOW_ERROR_TYPE_ITEM_MASK,
> mask,
> > +				 "no support for partial mask on"
> > +				 " \"udp.hdr.dst_port\" field");
> > +		if (!spec->hdr.dst_port)
> > +			return rte_flow_error_set(error, EINVAL,
> > +				 RTE_FLOW_ERROR_TYPE_ITEM, item,
> > +				 "zero encap remote UDP port");
> > +	} else {
> > +		return rte_flow_error_set(error, EINVAL,
> > +				 RTE_FLOW_ERROR_TYPE_ITEM, item,
> > +				 "outer UDP remote port must be "
> > +				 "specified for VXLAN encapsulation");
> > +	}
> > +	if (mask->hdr.src_port != RTE_BE16(0x0000)) {
> > +		if (mask->hdr.src_port != RTE_BE16(0xffff))
> > +			return rte_flow_error_set(error, ENOTSUP,
> > +				 RTE_FLOW_ERROR_TYPE_ITEM_MASK,
> mask,
> > +				 "no support for partial mask on"
> > +				 " \"udp.hdr.src_port\" field");
> > +		DRV_LOG(WARNING,
> > +			"outer UDP source port cannot be "
> > +			"forced for VXLAN encapsulation, "
> > +			"parameter ignored");
> > +	}
> > +	return 0;
> > +}
> > +
> > +/**
> > + * Validate VXLAN_ENCAP action RTE_FLOW_ITEM_TYPE_VXLAN item for
> E-Switch.
> > + *
> > + * @param[in] item
> > + *   Pointer to the itemn structure.
> > + * @param[out] error
> > + *   Pointer to the error structure.
> > + *
> > + * @return
> > + *   0 on success, a negative errno value otherwise and rte_ernno is set.
> > + **/
> > +static int
> > +flow_tcf_validate_vxlan_encap_vni(const struct rte_flow_item *item,
> > +				  struct rte_flow_error *error)
> > +{
> > +	const struct rte_flow_item_vxlan *spec = item->spec;
> > +	const struct rte_flow_item_vxlan *mask = item->mask;
> > +
> > +	if (!spec)
> > +		/* Outer VNI is required by tunnel_key parameter. */
> > +		return rte_flow_error_set(error, EINVAL,
> > +				 RTE_FLOW_ERROR_TYPE_ITEM, item,
> > +				 "NULL VNI specification "
> > +				 " for VXLAN encapsulation");
> > +	if (!mask)
> > +		mask = &rte_flow_item_vxlan_mask;
> > +	if (mask->vni[0] != 0 ||
> > +	    mask->vni[1] != 0 ||
> > +	    mask->vni[2] != 0) {
> 
> can be one line.
> 

OK. Will be aligned.

> > +		if (mask->vni[0] != 0xff ||
> > +		    mask->vni[1] != 0xff ||
> > +		    mask->vni[2] != 0xff)
> 
> same here.
> 
OK. Ditto.

> > +			return rte_flow_error_set(error, ENOTSUP,
> > +				 RTE_FLOW_ERROR_TYPE_ITEM_MASK,
> mask,
> > +				 "no support for partial mask on"
> > +				 " \"vxlan.vni\" field");
> > +		if (spec->vni[0] == 0 &&
> > +		    spec->vni[1] == 0 &&
> > +		    spec->vni[2] == 0)
> > +			return rte_flow_error_set(error, EINVAL,
> > +					  RTE_FLOW_ERROR_TYPE_ITEM,
> item,
> > +					  "VXLAN vni cannot be 0");
> 
> It is already checked by mlx5_flow_validate_item_vxlan().
> 
> > +	} else {
> > +		return rte_flow_error_set(error, EINVAL,
> > +				 RTE_FLOW_ERROR_TYPE_ITEM,
> > +				 item,
> > +				 "outer VNI must be specified "
> > +				 "for VXLAN encapsulation");
> > +	}
> 
> Already checked in mlx5_flow_validate_item_vxlan().

OK. Has to be removed. We trust the validated rules.

> 
> > +	return 0;
> > +}
> > +
> > +/**
> > + * Validate VXLAN_ENCAP action item list for E-Switch.
> > + *
> > + * @param[in] action
> > + *   Pointer to the VXLAN_ENCAP action structure.
> > + * @param[out] error
> > + *   Pointer to the error structure.
> > + *
> > + * @return
> > + *   0 on success, a negative errno value otherwise and rte_ernno is set.
> > + **/
> > +static int
> > +flow_tcf_validate_vxlan_encap(const struct rte_flow_action *action,
> > +			      struct rte_flow_error *error) {
> > +	const struct rte_flow_item *items;
> > +	int ret;
> > +	uint32_t item_flags = 0;
> > +
> > +	assert(action->type == RTE_FLOW_ACTION_TYPE_VXLAN_ENCAP);
> > +	if (!action->conf)
> > +		return rte_flow_error_set
> > +			(error, EINVAL, RTE_FLOW_ERROR_TYPE_ACTION,
> > +			 action, "Missing VXLAN tunnel "
> > +				 "action configuration");
> > +	items = ((const struct rte_flow_action_vxlan_encap *)
> > +					action->conf)->definition;
> > +	if (!items)
> > +		return rte_flow_error_set
> > +			(error, EINVAL, RTE_FLOW_ERROR_TYPE_ACTION,
> > +			 action, "Missing VXLAN tunnel "
> > +				 "encapsulation parameters");
> > +	for (; items->type != RTE_FLOW_ITEM_TYPE_END; items++) {
> > +		switch (items->type) {
> > +		case RTE_FLOW_ITEM_TYPE_VOID:
> > +			break;
> > +		case RTE_FLOW_ITEM_TYPE_ETH:
> > +			ret = mlx5_flow_validate_item_eth(items,
> item_flags,
> > +							  error);
> > +			if (ret < 0)
> > +				return ret;
> > +			ret = flow_tcf_validate_vxlan_encap_eth(items,
> error);
> > +			if (ret < 0)
> > +				return ret;
> > +			item_flags |= MLX5_FLOW_LAYER_OUTER_L2;
> > +			break;
> > +		break;
> > +		case RTE_FLOW_ITEM_TYPE_IPV4:
> > +			ret = mlx5_flow_validate_item_ipv4(items,
> item_flags,
> > +							   error);
> > +			if (ret < 0)
> > +				return ret;
> > +			ret = flow_tcf_validate_vxlan_encap_ipv4(items,
> error);
> > +			if (ret < 0)
> > +				return ret;
> > +			item_flags |= MLX5_FLOW_LAYER_OUTER_L3_IPV4;
> > +			break;
> > +		case RTE_FLOW_ITEM_TYPE_IPV6:
> > +			ret = mlx5_flow_validate_item_ipv6(items,
> item_flags,
> > +							   error);
> > +			if (ret < 0)
> > +				return ret;
> > +			ret = flow_tcf_validate_vxlan_encap_ipv6(items,
> error);
> > +			if (ret < 0)
> > +				return ret;
> > +			item_flags |= MLX5_FLOW_LAYER_OUTER_L3_IPV6;
> > +			break;
> > +		case RTE_FLOW_ITEM_TYPE_UDP:
> > +			ret = mlx5_flow_validate_item_udp(items,
> item_flags,
> > +							   0xFF, error);
> > +			if (ret < 0)
> > +				return ret;
> > +			ret = flow_tcf_validate_vxlan_encap_udp(items,
> error);
> > +			if (ret < 0)
> > +				return ret;
> > +			item_flags |= MLX5_FLOW_LAYER_OUTER_L4_UDP;
> > +			break;
> > +		case RTE_FLOW_ITEM_TYPE_VXLAN:
> > +			ret = mlx5_flow_validate_item_vxlan(items,
> > +							    item_flags, error);
> > +			if (ret < 0)
> > +				return ret;
> > +			ret = flow_tcf_validate_vxlan_encap_vni(items,
> error);
> > +			if (ret < 0)
> > +				return ret;
> > +			item_flags |= MLX5_FLOW_LAYER_VXLAN;
> > +			break;
> > +		default:
> > +			return rte_flow_error_set(error, ENOTSUP,
> > +					  RTE_FLOW_ERROR_TYPE_ITEM,
> items,
> > +					  "VXLAN encap item not supported");
> > +		}
> > +	}
> > +	if (!(item_flags & MLX5_FLOW_LAYER_OUTER_L3))
> > +		return rte_flow_error_set(error, EINVAL,
> > +					  RTE_FLOW_ERROR_TYPE_ACTION,
> action,
> > +					  "no outer L3 layer found"
> > +					  " for VXLAN encapsulation");
> > +	if (!(item_flags & MLX5_FLOW_LAYER_OUTER_L4_UDP))
> > +		return rte_flow_error_set(error, EINVAL,
> > +					  RTE_FLOW_ERROR_TYPE_ACTION,
> action,
> > +					  "no outer L4 layer found"
> 
> L4 -> UDP?

OK.  Good clarification.

> 
> > +					  " for VXLAN encapsulation");
> > +	if (!(item_flags & MLX5_FLOW_LAYER_VXLAN))
> > +		return rte_flow_error_set(error, EINVAL,
> > +					  RTE_FLOW_ERROR_TYPE_ACTION,
> action,
> > +					  "no VXLAN VNI found"
> > +					  " for VXLAN encapsulation");
> > +	return 0;
> > +}
> > +
> > +/**
> > + * Validate VXLAN_DECAP action outer tunnel items for E-Switch.
> > + *
> > + * @param[in] item_flags
> > + *   Mask of provided outer tunnel parameters
> > + * @param[in] ipv4
> > + *   Outer IPv4 address item (if any, NULL otherwise).
> > + * @param[in] ipv6
> > + *   Outer IPv6 address item (if any, NULL otherwise).
> > + * @param[in] udp
> > + *   Outer UDP layer item (if any, NULL otherwise).
> > + * @param[out] error
> > + *   Pointer to the error structure.
> > + *
> > + * @return
> > + *   0 on success, a negative errno value otherwise and rte_ernno is set.
> > + **/
> > +static int
> > +flow_tcf_validate_vxlan_decap(uint32_t item_flags,
> > +			      const struct rte_flow_action *action,
> > +			      const struct rte_flow_item *ipv4,
> > +			      const struct rte_flow_item *ipv6,
> > +			      const struct rte_flow_item *udp,
> > +			      struct rte_flow_error *error) {
> > +	if (!ipv4 && !ipv6)
> > +		return rte_flow_error_set(error, EINVAL,
> > +					  RTE_FLOW_ERROR_TYPE_ACTION,
> action,
> > +					  "no outer L3 layer found"
> > +					  " for VXLAN decapsulation");
> > +	if (ipv4) {
> > +		const struct rte_flow_item_ipv4 *spec = ipv4->spec;
> > +		const struct rte_flow_item_ipv4 *mask = ipv4->mask;
> > +
> > +		if (!spec)
> > +			/*
> > +			 * Specification for L3 addresses cannot be empty
> > +			 * because it is required as decap parameter.
> > +			 */
> > +			return rte_flow_error_set(error, EINVAL,
> > +				 RTE_FLOW_ERROR_TYPE_ITEM, ipv4,
> > +				 "NULL outer L3 address specification "
> > +				 " for VXLAN decapsulation");
> > +		if (!mask)
> > +			mask = &rte_flow_item_ipv4_mask;
> > +		if (mask->hdr.dst_addr != RTE_BE32(0x00000000)) {
> > +			if (mask->hdr.dst_addr != RTE_BE32(0xffffffff))
> > +				return rte_flow_error_set(error, ENOTSUP,
> > +
> RTE_FLOW_ERROR_TYPE_ITEM_MASK, mask,
> > +					 "no support for partial mask on"
> > +					 " \"ipv4.hdr.dst_addr\" field");
> > +			/* More L3 address validations can be put here. */
> > +		} else {
> > +			/*
> > +			 * Kernel uses the destination L3 address
> > +			 * to determine the ingress network interface
> > +			 * for traffic being decapculated.
> > +			 */
> > +			return rte_flow_error_set(error, EINVAL,
> > +				 RTE_FLOW_ERROR_TYPE_ITEM, ipv4,
> > +				 "outer L3 destination address must be "
> > +				 "specified for VXLAN decapsulation");
> > +		}
> > +		/* Source L3 address is optional for decap. */
> > +		if (mask->hdr.src_addr != RTE_BE32(0x00000000))
> > +			if (mask->hdr.src_addr != RTE_BE32(0xffffffff))
> > +				return rte_flow_error_set(error, ENOTSUP,
> > +
> RTE_FLOW_ERROR_TYPE_ITEM_MASK, mask,
> > +					 "no support for partial mask on"
> > +					 " \"ipv4.hdr.src_addr\" field");
> > +	} else {
> > +		const struct rte_flow_item_ipv6 *spec = ipv6->spec;
> > +		const struct rte_flow_item_ipv6 *mask = ipv6->mask;
> > +
> > +		if (!spec)
> > +			/*
> > +			 * Specification for L3 addresses cannot be empty
> > +			 * because it is required as decap parameter.
> > +			 */
> > +			return rte_flow_error_set(error, EINVAL,
> > +				 RTE_FLOW_ERROR_TYPE_ITEM, ipv6,
> > +				 "NULL outer L3 address specification "
> > +				 " for VXLAN decapsulation");
> > +		if (!mask)
> > +			mask = &rte_flow_item_ipv6_mask;
> > +		if (memcmp(&mask->hdr.dst_addr,
> > +			   &flow_tcf_mask_empty.ipv6.hdr.dst_addr,
> > +			   sizeof(flow_tcf_mask_empty.ipv6.hdr.dst_addr))) {
> > +			if (memcmp(&mask->hdr.dst_addr,
> > +				&rte_flow_item_ipv6_mask.hdr.dst_addr,
> > +
> 	sizeof(rte_flow_item_ipv6_mask.hdr.dst_addr)))
> > +				return rte_flow_error_set(error, ENOTSUP,
> > +				       RTE_FLOW_ERROR_TYPE_ITEM_MASK,
> mask,
> > +				       "no support for partial mask on"
> > +				       " \"ipv6.hdr.dst_addr\" field");
> > +		/* More L3 address validations can be put here. */
> > +		} else {
> > +			/*
> > +			 * Kernel uses the destination L3 address
> > +			 * to determine the ingress network interface
> > +			 * for traffic being decapculated.
> > +			 */
> > +			return rte_flow_error_set(error, EINVAL,
> > +				 RTE_FLOW_ERROR_TYPE_ITEM, ipv6,
> > +				 "outer L3 destination address must be "
> > +				 "specified for VXLAN decapsulation");
> > +		}
> > +		/* Source L3 address is optional for decap. */
> > +		if (memcmp(&mask->hdr.src_addr,
> > +			   &flow_tcf_mask_empty.ipv6.hdr.src_addr,
> > +			   sizeof(flow_tcf_mask_empty.ipv6.hdr.src_addr))) {
> > +			if (memcmp(&mask->hdr.src_addr,
> > +				   &rte_flow_item_ipv6_mask.hdr.src_addr,
> > +				   sizeof(mask->hdr.src_addr)))
> > +				return rte_flow_error_set(error, ENOTSUP,
> > +
> 	RTE_FLOW_ERROR_TYPE_ITEM_MASK, mask,
> > +					"no support for partial mask on"
> > +					" \"ipv6.hdr.src_addr\" field");
> > +		}
> > +	}
> > +	if (!udp) {
> > +		return rte_flow_error_set(error, EINVAL,
> > +					  RTE_FLOW_ERROR_TYPE_ACTION,
> action,
> > +					  "no outer L4 layer found"
> > +					  " for VXLAN decapsulation");
> > +	} else {
> > +		const struct rte_flow_item_udp *spec = udp->spec;
> > +		const struct rte_flow_item_udp *mask = udp->mask;
> > +
> > +		if (!spec)
> > +			/*
> > +			 * Specification for UDP ports cannot be empty
> > +			 * because it is required as decap parameter.
> > +			 */
> > +			return rte_flow_error_set(error, EINVAL,
> > +					 RTE_FLOW_ERROR_TYPE_ITEM,
> udp,
> > +					 "NULL UDP port specification "
> > +					 " for VXLAN decapsulation");
> > +		if (!mask)
> > +			mask = &rte_flow_item_udp_mask;
> > +		if (mask->hdr.dst_port != RTE_BE16(0x0000)) {
> > +			if (mask->hdr.dst_port != RTE_BE16(0xffff))
> > +				return rte_flow_error_set(error, ENOTSUP,
> > +
> RTE_FLOW_ERROR_TYPE_ITEM_MASK, mask,
> > +					 "no support for partial mask on"
> > +					 " \"udp.hdr.dst_port\" field");
> > +			if (!spec->hdr.dst_port)
> > +				return rte_flow_error_set(error, EINVAL,
> > +					 RTE_FLOW_ERROR_TYPE_ITEM,
> udp,
> > +					 "zero decap local UDP port");
> > +		} else {
> > +			return rte_flow_error_set(error, EINVAL,
> > +					 RTE_FLOW_ERROR_TYPE_ITEM,
> udp,
> > +					 "outer UDP destination port must be
> "
> > +					 "specified for VXLAN
> decapsulation");
> > +		}
> > +		if (mask->hdr.src_port != RTE_BE16(0x0000)) {
> > +			if (mask->hdr.src_port != RTE_BE16(0xffff))
> > +				return rte_flow_error_set(error, ENOTSUP,
> > +
> RTE_FLOW_ERROR_TYPE_ITEM_MASK, mask,
> > +					 "no support for partial mask on"
> > +					 " \"udp.hdr.src_port\" field");
> > +			DRV_LOG(WARNING,
> > +			"outer UDP local port cannot be "
> > +			"forced for VXLAN encapsulation, "
> > +			"parameter ignored");
> > +		}
> > +	}
> > +	if (!(item_flags & MLX5_FLOW_LAYER_VXLAN))
> > +		return rte_flow_error_set(error, EINVAL,
> > +					  RTE_FLOW_ERROR_TYPE_ACTION,
> action,
> > +					  "no VXLAN VNI found"
> > +					  " for VXLAN decapsulation");
> > +	/* VNI is already validated, extra check can be put here. */
> > +	return 0;
> > +}
> > +
> > +/**
> >   * Validate flow for E-Switch.
> >   *
> >   * @param[in] priv
> > @@ -974,7 +1588,8 @@ struct pedit_parser {
> >  		const struct rte_flow_item_ipv6 *ipv6;
> >  		const struct rte_flow_item_tcp *tcp;
> >  		const struct rte_flow_item_udp *udp;
> > -	} spec, mask;
> > +		const struct rte_flow_item_vxlan *vxlan;
> > +	 } spec, mask;
> >  	union {
> >  		const struct rte_flow_action_port_id *port_id;
> >  		const struct rte_flow_action_jump *jump; @@ -983,9
> +1598,13 @@
> > struct pedit_parser {
> >  			of_set_vlan_vid;
> >  		const struct rte_flow_action_of_set_vlan_pcp *
> >  			of_set_vlan_pcp;
> > +		const struct rte_flow_action_vxlan_encap *vxlan_encap;
> >  		const struct rte_flow_action_set_ipv4 *set_ipv4;
> >  		const struct rte_flow_action_set_ipv6 *set_ipv6;
> >  	} conf;
> > +	const struct rte_flow_item *ipv4 = NULL; /* storage to check */
> > +	const struct rte_flow_item *ipv6 = NULL; /* outer tunnel. */
> > +	const struct rte_flow_item *udp = NULL;  /* parameters. */
> >  	uint32_t item_flags = 0;
> >  	uint32_t action_flags = 0;
> >  	uint8_t next_protocol = -1;
> > @@ -1114,7 +1733,6 @@ struct pedit_parser {
> >  							   error);
> >  			if (ret < 0)
> >  				return ret;
> > -			item_flags |= MLX5_FLOW_LAYER_OUTER_L3_IPV4;
> >  			mask.ipv4 = flow_tcf_item_mask
> >  				(items, &rte_flow_item_ipv4_mask,
> >  				 &flow_tcf_mask_supported.ipv4,
> > @@ -1135,13 +1753,22 @@ struct pedit_parser {
> >  				next_protocol =
> >  					((const struct rte_flow_item_ipv4 *)
> >  					 (items->spec))->hdr.next_proto_id;
> > +			if (item_flags &
> MLX5_FLOW_LAYER_OUTER_L3_IPV4) {
> > +				/*
> > +				 * Multiple outer items are not allowed as
> > +				 * tunnel parameters, will raise an error later.
> > +				 */
> > +				ipv4 = NULL;
> 
> Can't it be inner then?
AFAIK,  no for tc rules, we can not specify multiple levels (inner + outer) for them.
There is just no TCA_FLOWER_KEY_xxx attributes  for specifying inner items 
to match by flower.

It is quite unclear comment, not the best one, sorry. I did not like it too, 
just forgot to rewrite.

ipv4, ipv6 , udp variables gather the matching items during the item list scanning,
later variables are used for VXLAN decap action validation only. So, the "outer"
means that ipv4 variable contains the VXLAN decap outer addresses, and
should be NULL-ed if multiple items are found in the items list. 

But we can generate an error here if we have valid action_flags
(gathered by prepare function) and VXLAN decap is set. Raising
an error looks more relevant and clear.

 
>   flow create 1 ingress transfer
>     pattern eth src is 66:77:88:99:aa:bb
>       dst is 00:11:22:33:44:55 / ipv4 src is 2.2.2.2 dst is 1.1.1.1 /
>       udp src is 4789 dst is 4242 / vxlan vni is 0x112233 /
>       eth / ipv6 / tcp dst is 42 / end
>     actions vxlan_decap / port_id id 2 / end
> 
> Is this flow supported by linux tcf? I took this example from Adrien's patch -
> "[8/8] net/mlx5: add VXLAN decap support to switch flow rules". If so, isn't it
> possible to have inner L3 layer (MLX5_FLOW_LAYER_INNER_*)? If not, you
> should return error in this case. I don't see any code to check redundant
> outer items.
> Did I miss something?

Interesting, besides rule has correct syntax, I'm not sure whether it can be applied w/o errors.
At least our current flow_tcf_translate() implementation does not support any INNERs.
But it seems the flow_tcf_validate() does, it's subject to recheck - we should not allow
unsupported items to pass the validation. I'll check and provide the separate bugfix patch
(if any).

> 
> BTW, for the tunneled items, why don't you follow the code of
> Verbs(mlx5_flow_verbs.c) and DV(mlx5_flow_dv.c)? For tcf, it is the first time
For VXLAN it has some specifics (warning about ignored params, etc.)
I've checked which of verbs/dv code could be reused and did not discovered
a lot. I'll recheck the latest code commits, possible it became more appropriate
for VXLAN. 

> to add tunneled item, but Verbs/DV already have validation code for tunnel,
> so you can reuse the existing code. In flow_tcf_validate_vxlan_decap(), not
> every validation is VXLAN-specific but some of them can be common code.
> 
> And if you need to know whether there's the VXLAN decap action prior to
> outer header item validation, you can relocate the code - action validation
> first and item validation next, as there's no dependency yet in the current

We can not validate action first - we need items to be preliminary gathered,
to check them in action's specific fashion and to check action itself. 
I mean, if we see VXLAN decap action, we should check the presence of
L2, L3, L4 and VNI items. I minimized the number of passes along the item
and action lists. BTW, Adrien's approach performed two passes, mine does only.

> code. Defining ipv4, ipv6, udp seems to make the code path more complex.
Yes, but it allows us to avoid the extra item list scanning and minimizes the changes
of existing code.
In your approach we should:
- scan actions, w/o full checking, just action_flags gathering and checking
- scan items, performing variating check (depending on gathered action flags)
- scan actions again, performing full check with params (at least for now 
check whether all params gathered)
> 
> For example, you just can call vxlan decap item validation (by splitting
> flow_tcf_validate_vxlan_decap()) at this point like:
> 
> 			if (action_flags &
> MLX5_FLOW_ACTION_VXLAN_DECAP)
> 				ret =
> flow_tcf_validate_vxlan_decap_ipv4(...);
> 			...
> 
> Same for other items.
> 
> > +			} else {
> > +				ipv4 = items;
> > +				item_flags |=
> MLX5_FLOW_LAYER_OUTER_L3_IPV4;
> > +			}
> >  			break;
> >  		case RTE_FLOW_ITEM_TYPE_IPV6:
> >  			ret = mlx5_flow_validate_item_ipv6(items,
> item_flags,
> >  							   error);
> >  			if (ret < 0)
> >  				return ret;
> > -			item_flags |= MLX5_FLOW_LAYER_OUTER_L3_IPV6;
> >  			mask.ipv6 = flow_tcf_item_mask
> >  				(items, &rte_flow_item_ipv6_mask,
> >  				 &flow_tcf_mask_supported.ipv6,
> > @@ -1162,13 +1789,22 @@ struct pedit_parser {
> >  				next_protocol =
> >  					((const struct rte_flow_item_ipv6 *)
> >  					 (items->spec))->hdr.proto;
> > +			if (item_flags &
> MLX5_FLOW_LAYER_OUTER_L3_IPV6) {
> > +				/*
> > +				 *Multiple outer items are not allowed as
> > +				 * tunnel parameters
> > +				 */
> > +				ipv6 = NULL;
> > +			} else {
> > +				ipv6 = items;
> > +				item_flags |=
> MLX5_FLOW_LAYER_OUTER_L3_IPV6;
> > +			}
> >  			break;
> >  		case RTE_FLOW_ITEM_TYPE_UDP:
> >  			ret = mlx5_flow_validate_item_udp(items,
> item_flags,
> >  							  next_protocol,
> error);
> >  			if (ret < 0)
> >  				return ret;
> > -			item_flags |= MLX5_FLOW_LAYER_OUTER_L4_UDP;
> >  			mask.udp = flow_tcf_item_mask
> >  				(items, &rte_flow_item_udp_mask,
> >  				 &flow_tcf_mask_supported.udp,
> > @@ -1177,6 +1813,12 @@ struct pedit_parser {
> >  				 error);
> >  			if (!mask.udp)
> >  				return -rte_errno;
> > +			if (item_flags &
> MLX5_FLOW_LAYER_OUTER_L4_UDP) {
> > +				udp = NULL;
> > +			} else {
> > +				udp = items;
> > +				item_flags |=
> MLX5_FLOW_LAYER_OUTER_L4_UDP;
> > +			}
> >  			break;
> >  		case RTE_FLOW_ITEM_TYPE_TCP:
> >  			ret = mlx5_flow_validate_item_tcp
> > @@ -1186,7 +1828,6 @@ struct pedit_parser {
> >  					      error);
> >  			if (ret < 0)
> >  				return ret;
> > -			item_flags |= MLX5_FLOW_LAYER_OUTER_L4_TCP;
> >  			mask.tcp = flow_tcf_item_mask
> >  				(items, &rte_flow_item_tcp_mask,
> >  				 &flow_tcf_mask_supported.tcp,
> > @@ -1195,11 +1836,36 @@ struct pedit_parser {
> >  				 error);
> >  			if (!mask.tcp)
> >  				return -rte_errno;
> > +			item_flags |= MLX5_FLOW_LAYER_OUTER_L4_TCP;
> > +			break;
> > +		case RTE_FLOW_ITEM_TYPE_VXLAN:
> > +			ret = mlx5_flow_validate_item_vxlan(items,
> > +							    item_flags, error);
> > +			if (ret < 0)
> > +				return ret;
> > +			mask.vxlan = flow_tcf_item_mask
> > +				(items, &rte_flow_item_vxlan_mask,
> > +				 &flow_tcf_mask_supported.vxlan,
> > +				 &flow_tcf_mask_empty.vxlan,
> > +				 sizeof(flow_tcf_mask_supported.vxlan),
> > +				 error);
> > +			if (!mask.vxlan)
> > +				return -rte_errno;
> > +			if (mask.vxlan->vni[0] != 0xff ||
> > +			    mask.vxlan->vni[1] != 0xff ||
> > +			    mask.vxlan->vni[2] != 0xff)
> > +				return rte_flow_error_set
> > +					(error, ENOTSUP,
> > +
> RTE_FLOW_ERROR_TYPE_ITEM_MASK,
> > +					 mask.vxlan,
> > +					 "no support for partial or "
> > +					 "empty mask on \"vxlan.vni\" field");
> > +			item_flags |= MLX5_FLOW_LAYER_VXLAN;
> >  			break;
> >  		default:
> >  			return rte_flow_error_set(error, ENOTSUP,
> >
> RTE_FLOW_ERROR_TYPE_ITEM,
> > -						  NULL, "item not
> supported");
> > +						  items, "item not
> supported");
> >  		}
> >  	}
> >  	for (; actions->type != RTE_FLOW_ACTION_TYPE_END; actions++) {
> @@
> > -1271,6 +1937,33 @@ struct pedit_parser {
> >  					 " set action must follow push
> action");
> >  			current_action_flag =
> MLX5_FLOW_ACTION_OF_SET_VLAN_PCP;
> >  			break;
> > +		case RTE_FLOW_ACTION_TYPE_VXLAN_ENCAP:
> > +			if (action_flags & (MLX5_ACTION_VXLAN_ENCAP
> > +					   | MLX5_ACTION_VXLAN_DECAP))
> > +				return rte_flow_error_set
> > +					(error, ENOTSUP,
> > +					 RTE_FLOW_ERROR_TYPE_ACTION,
> actions,
> > +					 "can't have multiple vxlan actions");
> > +			ret = flow_tcf_validate_vxlan_encap(actions, error);
> > +			if (ret < 0)
> > +				return ret;
> > +			action_flags |= MLX5_ACTION_VXLAN_ENCAP;
> 
> Recently, current_action_flag has been added for PEDIT actions. Please refer
> to the code above and make it compliant.

OK. I missed the point while rebasing, will make code compliant.

With best regards,
Slava
> 
> > +			break;
> > +		case RTE_FLOW_ACTION_TYPE_VXLAN_DECAP:
> > +			if (action_flags & (MLX5_ACTION_VXLAN_ENCAP
> > +					   | MLX5_ACTION_VXLAN_DECAP))
> > +				return rte_flow_error_set
> > +					(error, ENOTSUP,
> > +					 RTE_FLOW_ERROR_TYPE_ACTION,
> actions,
> > +					 "can't have multiple vxlan actions");
> > +			ret = flow_tcf_validate_vxlan_decap(item_flags,
> > +							    actions,
> > +							    ipv4, ipv6, udp,
> > +							    error);
> > +			if (ret < 0)
> > +				return ret;
> > +			action_flags |= MLX5_ACTION_VXLAN_DECAP;
> > +			break;
> >  		case RTE_FLOW_ACTION_TYPE_SET_IPV4_SRC:
> >  			current_action_flag =
> MLX5_FLOW_ACTION_SET_IPV4_SRC;
> >  			break;
> > @@ -1391,6 +2084,12 @@ struct pedit_parser {
> >  		return rte_flow_error_set(error, EINVAL,
> >  					  RTE_FLOW_ERROR_TYPE_ACTION,
> actions,
> >  					  "no fate action is found");
> > +	if ((item_flags & MLX5_FLOW_LAYER_VXLAN) &&
> > +	    !(action_flags & MLX5_ACTION_VXLAN_DECAP))
> > +		return rte_flow_error_set(error, ENOTSUP,
> > +					 RTE_FLOW_ERROR_TYPE_ACTION,
> NULL,
> > +					 "VNI pattern should be followed "
> > +					 " by VXLAN_DECAP action");
> >  	return 0;
> >  }
> >

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [dpdk-dev] [PATCH v2 3/7] net/mlx5: e-switch VXLAN flow translation routine
  2018-10-23 10:06     ` Yongseok Koh
@ 2018-10-25 14:37       ` Slava Ovsiienko
  2018-10-26  4:22         ` Yongseok Koh
  0 siblings, 1 reply; 110+ messages in thread
From: Slava Ovsiienko @ 2018-10-25 14:37 UTC (permalink / raw)
  To: Yongseok Koh; +Cc: Shahaf Shuler, dev

> -----Original Message-----
> From: Yongseok Koh
> Sent: Tuesday, October 23, 2018 13:06
> To: Slava Ovsiienko <viacheslavo@mellanox.com>
> Cc: Shahaf Shuler <shahafs@mellanox.com>; dev@dpdk.org
> Subject: Re: [PATCH v2 3/7] net/mlx5: e-switch VXLAN flow translation
> routine
> 
> On Mon, Oct 15, 2018 at 02:13:31PM +0000, Viacheslav Ovsiienko wrote:
> > This part of patchset adds support of VXLAN-related items and actions
> > to the flow translation routine. If some of them are specified in the
> > rule, the extra space for tunnel description structure is allocated.
> > Later some tunnel types, other than VXLAN can be addedd (GRE). No VTEP
> > devices are created at this point, the flow rule is just translated,
> > not applied yet.
> >
> > Suggested-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
> > Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
> > ---
> >  drivers/net/mlx5/mlx5_flow_tcf.c | 641
> > +++++++++++++++++++++++++++++++++++----
> >  1 file changed, 578 insertions(+), 63 deletions(-)
> >
> > diff --git a/drivers/net/mlx5/mlx5_flow_tcf.c
> > b/drivers/net/mlx5/mlx5_flow_tcf.c
> > index 0055417..660d45e 100644
> > --- a/drivers/net/mlx5/mlx5_flow_tcf.c
> > +++ b/drivers/net/mlx5/mlx5_flow_tcf.c
> > @@ -2094,6 +2094,265 @@ struct pedit_parser {  }
> >
> >  /**
> > + * Helper function to process RTE_FLOW_ITEM_TYPE_ETH entry in
> > +configuration
> > + * of action RTE_FLOW_ACTION_TYPE_VXLAN_ENCAP. Fills the MAC
> address
> > +fields
> > + * in the encapsulation parameters structure. The item must be
> > +prevalidated,
> > + * no any validation checks performed by function.
> > + *
> > + * @param[in] spec
> > + *   RTE_FLOW_ITEM_TYPE_ETH entry specification.
> > + * @param[in] mask
> > + *   RTE_FLOW_ITEM_TYPE_ETH entry mask.
> > + * @param[out] encap
> > + *   Structure to fill the gathered MAC address data.
> > + *
> > + * @return
> > + *   The size needed the Netlink message tunnel_key
> > + *   parameter buffer to store the item attributes.
> > + */
> > +static int
> > +flow_tcf_parse_vxlan_encap_eth(const struct rte_flow_item_eth *spec,
> > +			       const struct rte_flow_item_eth *mask,
> > +			       struct mlx5_flow_tcf_vxlan_encap *encap) {
> > +	/* Item must be validated before. No redundant checks. */
> > +	assert(spec);
> > +	if (!mask || !memcmp(&mask->dst,
> > +			     &rte_flow_item_eth_mask.dst,
> > +			     sizeof(rte_flow_item_eth_mask.dst))) {
> > +		/*
> > +		 * Ethernet addresses are not supported by
> > +		 * tc as tunnel_key parameters. Destination
> > +		 * address is needed to form encap packet
> > +		 * header and retrieved by kernel from
> > +		 * implicit sources (ARP table, etc),
> > +		 * address masks are not supported at all.
> > +		 */
> > +		encap->eth.dst = spec->dst;
> > +		encap->mask |= MLX5_FLOW_TCF_ENCAP_ETH_DST;
> > +	}
> > +	if (!mask || !memcmp(&mask->src,
> > +			     &rte_flow_item_eth_mask.src,
> > +			     sizeof(rte_flow_item_eth_mask.src))) {
> > +		/*
> > +		 * Ethernet addresses are not supported by
> > +		 * tc as tunnel_key parameters. Source ethernet
> > +		 * address is ignored anyway.
> > +		 */
> > +		encap->eth.src = spec->src;
> > +		encap->mask |= MLX5_FLOW_TCF_ENCAP_ETH_SRC;
> > +	}
> > +	/*
> > +	 * No space allocated for ethernet addresses within Netlink
> > +	 * message tunnel_key record - these ones are not
> > +	 * supported by tc.
> > +	 */
> > +	return 0;
> > +}
> > +
> > +/**
> > + * Helper function to process RTE_FLOW_ITEM_TYPE_IPV4 entry in
> > +configuration
> > + * of action RTE_FLOW_ACTION_TYPE_VXLAN_ENCAP. Fills the IPV4
> address
> > +fields
> > + * in the encapsulation parameters structure. The item must be
> > +prevalidated,
> > + * no any validation checks performed by function.
> > + *
> > + * @param[in] spec
> > + *   RTE_FLOW_ITEM_TYPE_IPV4 entry specification.
> > + * @param[out] encap
> > + *   Structure to fill the gathered IPV4 address data.
> > + *
> > + * @return
> > + *   The size needed the Netlink message tunnel_key
> > + *   parameter buffer to store the item attributes.
> > + */
> > +static int
> > +flow_tcf_parse_vxlan_encap_ipv4(const struct rte_flow_item_ipv4 *spec,
> > +				struct mlx5_flow_tcf_vxlan_encap *encap) {
> > +	/* Item must be validated before. No redundant checks. */
> > +	assert(spec);
> > +	encap->ipv4.dst = spec->hdr.dst_addr;
> > +	encap->ipv4.src = spec->hdr.src_addr;
> > +	encap->mask |= MLX5_FLOW_TCF_ENCAP_IPV4_SRC |
> > +		       MLX5_FLOW_TCF_ENCAP_IPV4_DST;
> > +	return 2 * SZ_NLATTR_TYPE_OF(uint32_t); }
> > +
> > +/**
> > + * Helper function to process RTE_FLOW_ITEM_TYPE_IPV6 entry in
> > +configuration
> > + * of action RTE_FLOW_ACTION_TYPE_VXLAN_ENCAP. Fills the IPV6
> address
> > +fields
> > + * in the encapsulation parameters structure. The item must be
> > +prevalidated,
> > + * no any validation checks performed by function.
> > + *
> > + * @param[in] spec
> > + *   RTE_FLOW_ITEM_TYPE_IPV6 entry specification.
> > + * @param[out] encap
> > + *   Structure to fill the gathered IPV6 address data.
> > + *
> > + * @return
> > + *   The size needed the Netlink message tunnel_key
> > + *   parameter buffer to store the item attributes.
> > + */
> > +static int
> > +flow_tcf_parse_vxlan_encap_ipv6(const struct rte_flow_item_ipv6 *spec,
> > +				struct mlx5_flow_tcf_vxlan_encap *encap) {
> > +	/* Item must be validated before. No redundant checks. */
> > +	assert(spec);
> > +	memcpy(encap->ipv6.dst, spec->hdr.dst_addr, sizeof(encap-
> >ipv6.dst));
> > +	memcpy(encap->ipv6.src, spec->hdr.src_addr, sizeof(encap-
> >ipv6.src));
> > +	encap->mask |= MLX5_FLOW_TCF_ENCAP_IPV6_SRC |
> > +		       MLX5_FLOW_TCF_ENCAP_IPV6_DST;
> > +	return SZ_NLATTR_DATA_OF(IPV6_ADDR_LEN) * 2; }
> > +
> > +/**
> > + * Helper function to process RTE_FLOW_ITEM_TYPE_UDP entry in
> > +configuration
> > + * of action RTE_FLOW_ACTION_TYPE_VXLAN_ENCAP. Fills the UDP port
> > +fields
> > + * in the encapsulation parameters structure. The item must be
> > +prevalidated,
> > + * no any validation checks performed by function.
> > + *
> > + * @param[in] spec
> > + *   RTE_FLOW_ITEM_TYPE_UDP entry specification.
> > + * @param[in] mask
> > + *   RTE_FLOW_ITEM_TYPE_UDP entry mask.
> > + * @param[out] encap
> > + *   Structure to fill the gathered UDP port data.
> > + *
> > + * @return
> > + *   The size needed the Netlink message tunnel_key
> > + *   parameter buffer to store the item attributes.
> > + */
> > +static int
> > +flow_tcf_parse_vxlan_encap_udp(const struct rte_flow_item_udp *spec,
> > +			       const struct rte_flow_item_udp *mask,
> > +			       struct mlx5_flow_tcf_vxlan_encap *encap) {
> > +	int size = SZ_NLATTR_TYPE_OF(uint16_t);
> > +
> > +	assert(spec);
> > +	encap->udp.dst = spec->hdr.dst_port;
> > +	encap->mask |= MLX5_FLOW_TCF_ENCAP_UDP_DST;
> > +	if (!mask || mask->hdr.src_port != RTE_BE16(0x0000)) {
> > +		encap->udp.src = spec->hdr.src_port;
> > +		size += SZ_NLATTR_TYPE_OF(uint16_t);
> > +		encap->mask |= MLX5_FLOW_TCF_ENCAP_IPV4_SRC;
> > +	}
> > +	return size;
> > +}
> > +
> > +/**
> > + * Helper function to process RTE_FLOW_ITEM_TYPE_VXLAN entry in
> > +configuration
> > + * of action RTE_FLOW_ACTION_TYPE_VXLAN_ENCAP. Fills the VNI fields
> > + * in the encapsulation parameters structure. The item must be
> > +prevalidated,
> > + * no any validation checks performed by function.
> > + *
> > + * @param[in] spec
> > + *   RTE_FLOW_ITEM_TYPE_VXLAN entry specification.
> > + * @param[out] encap
> > + *   Structure to fill the gathered VNI address data.
> > + *
> > + * @return
> > + *   The size needed the Netlink message tunnel_key
> > + *   parameter buffer to store the item attributes.
> > + */
> > +static int
> > +flow_tcf_parse_vxlan_encap_vni(const struct rte_flow_item_vxlan *spec,
> > +			       struct mlx5_flow_tcf_vxlan_encap *encap) {
> > +	/* Item must be validated before. Do not redundant checks. */
> > +	assert(spec);
> > +	memcpy(encap->vxlan.vni, spec->vni, sizeof(encap->vxlan.vni));
> > +	encap->mask |= MLX5_FLOW_TCF_ENCAP_VXLAN_VNI;
> > +	return SZ_NLATTR_TYPE_OF(uint32_t);
> > +}
> > +
> > +/**
> > + * Populate consolidated encapsulation object from list of pattern items.
> > + *
> > + * Helper function to process configuration of action such as
> > + * RTE_FLOW_ACTION_TYPE_VXLAN_ENCAP. The item list should be
> > + * validated, there is no way to return an meaningful error.
> > + *
> > + * @param[in] action
> > + *   RTE_FLOW_ACTION_TYPE_VXLAN_ENCAP action object.
> > + *   List of pattern items to gather data from.
> > + * @param[out] src
> > + *   Structure to fill gathered data.
> > + *
> > + * @return
> > + *   The size the part of Netlink message buffer to store the item
> > + *   attributes on success, zero otherwise. The mask field in
> > + *   result structure reflects correctly parsed items.
> > + */
> > +static int
> > +flow_tcf_vxlan_encap_parse(const struct rte_flow_action *action,
> > +			   struct mlx5_flow_tcf_vxlan_encap *encap) {
> > +	union {
> > +		const struct rte_flow_item_eth *eth;
> > +		const struct rte_flow_item_ipv4 *ipv4;
> > +		const struct rte_flow_item_ipv6 *ipv6;
> > +		const struct rte_flow_item_udp *udp;
> > +		const struct rte_flow_item_vxlan *vxlan;
> > +	} spec, mask;
> > +	const struct rte_flow_item *items;
> > +	int size = 0;
> > +
> > +	assert(action->type == RTE_FLOW_ACTION_TYPE_VXLAN_ENCAP);
> > +	assert(action->conf);
> > +
> > +	items = ((const struct rte_flow_action_vxlan_encap *)
> > +					action->conf)->definition;
> > +	assert(items);
> > +	for (; items->type != RTE_FLOW_ITEM_TYPE_END; items++) {
> > +		switch (items->type) {
> > +		case RTE_FLOW_ITEM_TYPE_VOID:
> > +			break;
> > +		case RTE_FLOW_ITEM_TYPE_ETH:
> > +			mask.eth = items->mask;
> > +			spec.eth = items->spec;
> > +			size += flow_tcf_parse_vxlan_encap_eth(spec.eth,
> > +							       mask.eth,
> > +							       encap);
> > +			break;
> > +		case RTE_FLOW_ITEM_TYPE_IPV4:
> > +			spec.ipv4 = items->spec;
> > +			size += flow_tcf_parse_vxlan_encap_ipv4(spec.ipv4,
> > +								encap);
> > +			break;
> > +		case RTE_FLOW_ITEM_TYPE_IPV6:
> > +			spec.ipv6 = items->spec;
> > +			size += flow_tcf_parse_vxlan_encap_ipv6(spec.ipv6,
> > +								encap);
> > +			break;
> > +		case RTE_FLOW_ITEM_TYPE_UDP:
> > +			mask.udp = items->mask;
> > +			spec.udp = items->spec;
> > +			size += flow_tcf_parse_vxlan_encap_udp(spec.udp,
> > +							       mask.udp,
> > +							       encap);
> > +			break;
> > +		case RTE_FLOW_ITEM_TYPE_VXLAN:
> > +			spec.vxlan = items->spec;
> > +			size += flow_tcf_parse_vxlan_encap_vni(spec.vxlan,
> > +							       encap);
> > +			break;
> > +		default:
> > +			assert(false);
> > +			DRV_LOG(WARNING,
> > +				"unsupported item %p type %d,"
> > +				" items must be validated"
> > +				" before flow creation",
> > +				(const void *)items, items->type);
> > +			encap->mask = 0;
> > +			return 0;
> > +		}
> > +	}
> > +	return size;
> > +}
> > +
> > +/**
> >   * Calculate maximum size of memory for flow items of Linux TC flower
> and
> >   * extract specified items.
> >   *
> > @@ -2148,7 +2407,7 @@ struct pedit_parser {
> >  		case RTE_FLOW_ITEM_TYPE_IPV6:
> >  			size += SZ_NLATTR_TYPE_OF(uint16_t) + /* Ether
> type. */
> >  				SZ_NLATTR_TYPE_OF(uint8_t) + /* IP proto.
> */
> > -				SZ_NLATTR_TYPE_OF(IPV6_ADDR_LEN) * 4;
> > +				SZ_NLATTR_DATA_OF(IPV6_ADDR_LEN) * 4;
> >  				/* dst/src IP addr and mask. */
> >  			flags |= MLX5_FLOW_LAYER_OUTER_L3_IPV6;
> >  			break;
> > @@ -2164,6 +2423,10 @@ struct pedit_parser {
> >  				/* dst/src port and mask. */
> >  			flags |= MLX5_FLOW_LAYER_OUTER_L4_TCP;
> >  			break;
> > +		case RTE_FLOW_ITEM_TYPE_VXLAN:
> > +			size += SZ_NLATTR_TYPE_OF(uint32_t);
> > +			flags |= MLX5_FLOW_LAYER_VXLAN;
> > +			break;
> >  		default:
> >  			DRV_LOG(WARNING,
> >  				"unsupported item %p type %d,"
> > @@ -2184,13 +2447,16 @@ struct pedit_parser {
> >   *   Pointer to the list of actions.
> >   * @param[out] action_flags
> >   *   Pointer to the detected actions.
> > + * @param[out] tunnel
> > + *   Pointer to tunnel encapsulation parameters structure to fill.
> >   *
> >   * @return
> >   *   Maximum size of memory for actions.
> >   */
> >  static int
> >  flow_tcf_get_actions_and_size(const struct rte_flow_action actions[],
> > -			      uint64_t *action_flags)
> > +			      uint64_t *action_flags,
> > +			      void *tunnel)
> 
> This func is to get actions and size but you are parsing and filling tunnel info
> here. It would be better to move parsing to translate() because it anyway has
> multiple if conditions (same as switch/case) to set TCA_TUNNEL_KEY_ENC_*
> there.
Do you mean call of flow_tcf_vxlan_encap_parse(actions, tunnel)?
OK, let's move it to translate stage. Anyway, we need to keep encap structure
for local/neigh rules.

> 
> >  {
> >  	int size = 0;
> >  	uint64_t flags = 0;
> > @@ -2246,6 +2512,29 @@ struct pedit_parser {
> >  				SZ_NLATTR_TYPE_OF(uint16_t) + /* VLAN ID.
> */
> >  				SZ_NLATTR_TYPE_OF(uint8_t); /* VLAN prio.
> */
> >  			break;
> > +		case RTE_FLOW_ACTION_TYPE_VXLAN_ENCAP:
> > +			size += SZ_NLATTR_NEST + /* na_act_index. */
> > +				SZ_NLATTR_STRZ_OF("tunnel_key") +
> > +				SZ_NLATTR_NEST + /* TCA_ACT_OPTIONS.
> */
> > +				SZ_NLATTR_TYPE_OF(uint8_t);
> > +			size += SZ_NLATTR_TYPE_OF(struct tc_tunnel_key);
> > +			size +=	flow_tcf_vxlan_encap_parse(actions, tunnel)
> +
> > +				RTE_ALIGN_CEIL /* preceding encap params.
> */
> > +				(sizeof(struct mlx5_flow_tcf_vxlan_encap),
> > +				MNL_ALIGNTO);
> 
> Is it different from SZ_NLATTR_TYPE_OF(struct
> mlx5_flow_tcf_vxlan_encap)? Or, use __rte_aligned(MNL_ALIGNTO) instead.

It is written intentionally in this form. It means that there is struct mlx5_flow_tcf_vxlan_encap 
at the beginning of buffer. This is not the NL attribute, usage of SZ_NLATTR_TYPE_OF is
not relevant here. Alignment is needed for the following Netlink message.
> 
> > +			flags |= MLX5_ACTION_VXLAN_ENCAP;
> > +			break;
> > +		case RTE_FLOW_ACTION_TYPE_VXLAN_DECAP:
> > +			size += SZ_NLATTR_NEST + /* na_act_index. */
> > +				SZ_NLATTR_STRZ_OF("tunnel_key") +
> > +				SZ_NLATTR_NEST + /* TCA_ACT_OPTIONS.
> */
> > +				SZ_NLATTR_TYPE_OF(uint8_t);
> > +			size +=	SZ_NLATTR_TYPE_OF(struct tc_tunnel_key);
> > +			size +=	RTE_ALIGN_CEIL /* preceding decap params.
> */
> > +				(sizeof(struct mlx5_flow_tcf_vxlan_decap),
> > +				MNL_ALIGNTO);
> 
> Same here.
> 
> > +			flags |= MLX5_ACTION_VXLAN_DECAP;
> > +			break;
> >  		case RTE_FLOW_ACTION_TYPE_SET_IPV4_SRC:
> >  		case RTE_FLOW_ACTION_TYPE_SET_IPV4_DST:
> >  		case RTE_FLOW_ACTION_TYPE_SET_IPV6_SRC:
> > @@ -2289,6 +2578,26 @@ struct pedit_parser {  }
> >
> >  /**
> > + * Convert VXLAN VNI to 32-bit integer.
> > + *
> > + * @param[in] vni
> > + *   VXLAN VNI in 24-bit wire format.
> > + *
> > + * @return
> > + *   VXLAN VNI as a 32-bit integer value in network endian.
> > + */
> > +static rte_be32_t
> 
> make it inline.
OK. Missed point.

> 
> > +vxlan_vni_as_be32(const uint8_t vni[3]) {
> > +	rte_be32_t ret;
> 
> Defining ret as rte_be32_t? The return value of this func which is bswap(ret)
> is also rte_be32_t??
Yes. And it is directly stored in the net-endian NL attribute. 
I've compiled and checked the listing of the function you proposed. It seems to be best, I'll take it.

> 
> > +
> > +	ret = vni[0];
> > +	ret = (ret << 8) | vni[1];
> > +	ret = (ret << 8) | vni[2];
> > +	return RTE_BE32(ret);
> 
> Use rte_cpu_to_be_*() instead. But I still don't understand why you shuffle
> bytes twice. One with shift and or and other by bswap().
And it works. There are three bytes in very bizarre order (in NL attribute) - 0, vni[0], vni[1], vni[2].

> 
> {
> 	union {
> 		uint8_t vni[4];
> 		rte_be32_t dword;
> 	} ret = {
> 		.vni = { 0, vni[0], vni[1], vni[2] },
> 	};
> 	return ret.dword;
> }
> 
> This will have the same result without extra cost.

OK. Investigated, it is the best for x86-64. Also I'm going to test it on the ARM 32,
with various compilers, just curious.

> 
> > +}
> > +
> > +/**
> >   * Prepare a flow object for Linux TC flower. It calculates the maximum
> size of
> >   * memory required, allocates the memory, initializes Netlink message
> headers
> >   * and set unique TC message handle.
> > @@ -2323,22 +2632,54 @@ struct pedit_parser {
> >  	struct mlx5_flow *dev_flow;
> >  	struct nlmsghdr *nlh;
> >  	struct tcmsg *tcm;
> > +	struct mlx5_flow_tcf_vxlan_encap encap = {.mask = 0};
> > +	uint8_t *sp, *tun = NULL;
> >
> >  	size += flow_tcf_get_items_and_size(attr, items, item_flags);
> > -	size += flow_tcf_get_actions_and_size(actions, action_flags);
> > -	dev_flow = rte_zmalloc(__func__, size, MNL_ALIGNTO);
> > +	size += flow_tcf_get_actions_and_size(actions, action_flags,
> &encap);
> > +	dev_flow = rte_zmalloc(__func__, size,
> > +			RTE_MAX(alignof(struct mlx5_flow_tcf_tunnel_hdr),
> > +				(size_t)MNL_ALIGNTO));
> 
> Why RTE_MAX between the two? Note that it is alignment for start address
> of the memory and the minimum alignment is cacheline size. On x86, non-
> zero value less than 64 will have same result as 64.

OK. Thanks for note.
It is not expected the structure alignments exceed the cache line size.
So? Just specify zero?
> 
> >  	if (!dev_flow) {
> >  		rte_flow_error_set(error, ENOMEM,
> >  				   RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> NULL,
> >  				   "not enough memory to create E-Switch
> flow");
> >  		return NULL;
> >  	}
> > -	nlh = mnl_nlmsg_put_header((void *)(dev_flow + 1));
> > +	sp = (uint8_t *)(dev_flow + 1);
> > +	if (*action_flags & MLX5_ACTION_VXLAN_ENCAP) {
> > +		tun = sp;
> > +		sp += RTE_ALIGN_CEIL
> > +			(sizeof(struct mlx5_flow_tcf_vxlan_encap),
> > +			MNL_ALIGNTO);
> 
> And why should it be aligned? 

Netlink message should be aligned. It follows the mlx5_flow_tcf_vxlan_encap,
that's why the pointer is aligned.

As the size of dev_flow might not be aligned, it
> is meaningless, isn't it? If you think it must be aligned for better performance
> (not much anyway), you can use __rte_aligned(MNL_ALIGNTO) on the struct
Hm. Where we can use __rte_aligned? Could you clarify, please.

> definition but not for mlx5_flow (it's not only for tcf, have to do it manually).
> 
> > +		size -= RTE_ALIGN_CEIL
> > +			(sizeof(struct mlx5_flow_tcf_vxlan_encap),
> > +			MNL_ALIGNTO);
> 
> Don't you have to subtract sizeof(struct mlx5_flow) as well? But like I
> mentioned, if '.nlsize' below isn't needed, you don't need to have this
> calculation either.
Yes, it is a bug. Should be fixed. Thank you.
Let's discuss whether we can keep the nlsize under NDEBUG switch.

> 
> > +		encap.hdr.type =
> MLX5_FLOW_TCF_TUNACT_VXLAN_ENCAP;
> > +		memcpy(tun, &encap,
> > +		       sizeof(struct mlx5_flow_tcf_vxlan_encap));
> > +	} else if (*action_flags & MLX5_ACTION_VXLAN_DECAP) {
> > +		tun = sp;
> > +		sp += RTE_ALIGN_CEIL
> > +			(sizeof(struct mlx5_flow_tcf_vxlan_decap),
> > +			MNL_ALIGNTO);
> > +		size -= RTE_ALIGN_CEIL
> > +			(sizeof(struct mlx5_flow_tcf_vxlan_decap),
> > +			MNL_ALIGNTO);
> > +		encap.hdr.type =
> MLX5_FLOW_TCF_TUNACT_VXLAN_DECAP;
> > +		memcpy(tun, &encap,
> > +		       sizeof(struct mlx5_flow_tcf_vxlan_decap));
> > +	}
> > +	nlh = mnl_nlmsg_put_header(sp);
> >  	tcm = mnl_nlmsg_put_extra_header(nlh, sizeof(*tcm));
> >  	*dev_flow = (struct mlx5_flow){
> >  		.tcf = (struct mlx5_flow_tcf){
> > +			.nlsize = size,
> >  			.nlh = nlh,
> >  			.tcm = tcm,
> > +			.tunnel = (struct mlx5_flow_tcf_tunnel_hdr *)tun,
> > +			.item_flags = *item_flags,
> > +			.action_flags = *action_flags,
> >  		},
> >  	};
> >  	/*
> > @@ -2392,6 +2733,7 @@ struct pedit_parser {
> >  		const struct rte_flow_item_ipv6 *ipv6;
> >  		const struct rte_flow_item_tcp *tcp;
> >  		const struct rte_flow_item_udp *udp;
> > +		const struct rte_flow_item_vxlan *vxlan;
> >  	} spec, mask;
> >  	union {
> >  		const struct rte_flow_action_port_id *port_id; @@ -2402,6
> +2744,14
> > @@ struct pedit_parser {
> >  		const struct rte_flow_action_of_set_vlan_pcp *
> >  			of_set_vlan_pcp;
> >  	} conf;
> > +	union {
> > +		struct mlx5_flow_tcf_tunnel_hdr *hdr;
> > +		struct mlx5_flow_tcf_vxlan_decap *vxlan;
> > +	} decap;
> > +	union {
> > +		struct mlx5_flow_tcf_tunnel_hdr *hdr;
> > +		struct mlx5_flow_tcf_vxlan_encap *vxlan;
> > +	} encap;
> >  	struct flow_tcf_ptoi ptoi[PTOI_TABLE_SZ_MAX(dev)];
> >  	struct nlmsghdr *nlh = dev_flow->tcf.nlh;
> >  	struct tcmsg *tcm = dev_flow->tcf.tcm; @@ -2418,6 +2768,12 @@
> struct
> > pedit_parser {
> >
> >  	claim_nonzero(flow_tcf_build_ptoi_table(dev, ptoi,
> >  						PTOI_TABLE_SZ_MAX(dev)));
> > +	encap.hdr = NULL;
> > +	decap.hdr = NULL;
> > +	if (dev_flow->tcf.action_flags & MLX5_ACTION_VXLAN_ENCAP)
> 
> dev_flow->flow->actions already has it.
> 
> > +		encap.vxlan = dev_flow->tcf.vxlan_encap;
> > +	if (dev_flow->tcf.action_flags & MLX5_ACTION_VXLAN_DECAP)
> > +		decap.vxlan = dev_flow->tcf.vxlan_decap;
> >  	nlh = dev_flow->tcf.nlh;
> >  	tcm = dev_flow->tcf.tcm;
> >  	/* Prepare API must have been called beforehand. */ @@ -2435,7
> > +2791,6 @@ struct pedit_parser {
> >  		mnl_attr_put_u32(nlh, TCA_CHAIN, attr->group);
> >  	mnl_attr_put_strz(nlh, TCA_KIND, "flower");
> >  	na_flower = mnl_attr_nest_start(nlh, TCA_OPTIONS);
> > -	mnl_attr_put_u32(nlh, TCA_FLOWER_FLAGS,
> TCA_CLS_FLAGS_SKIP_SW);
> 
> Why do you move it? You anyway can know if it is vxlan decap at this point.
> 
> >  	for (; items->type != RTE_FLOW_ITEM_TYPE_END; items++) {
> >  		unsigned int i;
> >
> > @@ -2479,6 +2834,12 @@ struct pedit_parser {
> >  						 spec.eth->type);
> >  				eth_type_set = 1;
> >  			}
> > +			/*
> > +			 * L2 addresses/masks should  be sent anyway,
> > +			 * including VXLAN encap/decap cases, sometimes
> 
> "sometimes" sounds like a bug. Did you figure out why it is inconsistent?
> 
> > +			 * kernel returns an error if no L2 address
> > +			 * provided and skip_sw flag is set
> > +			 */
> >  			if (!is_zero_ether_addr(&mask.eth->dst)) {
> >  				mnl_attr_put(nlh,
> TCA_FLOWER_KEY_ETH_DST,
> >  					     ETHER_ADDR_LEN,
> > @@ -2495,8 +2856,19 @@ struct pedit_parser {
> >  					     ETHER_ADDR_LEN,
> >  					     mask.eth->src.addr_bytes);
> >  			}
> > -			break;
> > +			if (decap.hdr) {
> > +				DRV_LOG(INFO,
> > +				"ethernet addresses are treated "
> > +				"as inner ones for tunnel decapsulation");
> > +			}
> 
> I know there's no enc_[src|dst]_mac in tc flower command but can you
> clarify more about this? Let me take an example.
> 
>   flow create 1 ingress transfer
>     pattern eth src is 66:77:88:99:aa:bb
>       dst is 00:11:22:33:44:55 / ipv4 src is 2.2.2.2 dst is 1.1.1.1 /
>       udp src is 4789 dst is 4242 / vxlan vni is 0x112233 / end
>     actions vxlan_decap / port_id id 2 / end
> 
> In this case, will the mac addrs specified above be regarded as inner mac
> addrs?
> That sounds very weird. If inner mac addrs have to be specified it should be:

> 
>   flow create 1 ingress transfer
>     pattern eth / ipv4 src is 2.2.2.2 dst is 1.1.1.1 /
>       udp src is 4789 dst is 4242 / vxlan vni is 0x112233 /
>       eth src is 66:77:88:99:aa:bb dst is 00:11:22:33:44:55 / end
>     actions vxlan_decap / port_id id 2 / end
> 
> Isn't it?

Hm. I inherited Adrien's approach, but now It seems it is not correct.
I think we should change vxlan_decap to inner MACs.
 
> Also, I hope it to be in validate() as well.
> 
> > +			assert(dev_flow->tcf.nlsize >= nlh->nlmsg_len);
> > +		break;
> >  		case RTE_FLOW_ITEM_TYPE_VLAN:
> > +			if (encap.hdr || decap.hdr)
> > +				return rte_flow_error_set(error, ENOTSUP,
> > +					  RTE_FLOW_ERROR_TYPE_ITEM,
> NULL,
> > +					  "outer VLAN is not "
> > +					  "supported for tunnels");
> 
> This should be moved to validate(). And the error message sounds like inner
> vlan is allowed, doesn't it? Even if it is moved to validate(), there's no way to
> distinguish between outer vlan and inner vlan in your code. A bit confusing.
> Please clarify.
Check for tunnel flags? OK.
> 
> >  			item_flags |= MLX5_FLOW_LAYER_OUTER_VLAN;
> >  			mask.vlan = flow_tcf_item_mask
> >  				(items, &rte_flow_item_vlan_mask, @@ -
> 2528,6 +2900,7 @@ struct
> > pedit_parser {
> >  						 rte_be_to_cpu_16
> >  						 (spec.vlan->tci &
> >  						  RTE_BE16(0x0fff)));
> > +			assert(dev_flow->tcf.nlsize >= nlh->nlmsg_len);
> >  			break;
> >  		case RTE_FLOW_ITEM_TYPE_IPV4:
> >  			item_flags |= MLX5_FLOW_LAYER_OUTER_L3_IPV4;
> @@ -2538,36 +2911,53
> > @@ struct pedit_parser {
> >  				 sizeof(flow_tcf_mask_supported.ipv4),
> >  				 error);
> >  			assert(mask.ipv4);
> > -						 vlan_present ?
> > -
> TCA_FLOWER_KEY_VLAN_ETH_TYPE :
> > -
> TCA_FLOWER_KEY_ETH_TYPE,
> > -						 RTE_BE16(ETH_P_IP));
> > -			eth_type_set = 1;
> > -			vlan_eth_type_set = 1;
> > -			if (mask.ipv4 == &flow_tcf_mask_empty.ipv4)
> > -				break;
> >  			spec.ipv4 = items->spec;
> > -			if (mask.ipv4->hdr.next_proto_id) {
> > -				mnl_attr_put_u8(nlh,
> TCA_FLOWER_KEY_IP_PROTO,
> > +			if (!decap.vxlan) {
> > +				if (!eth_type_set || !vlan_eth_type_set) {
> > +					mnl_attr_put_u16(nlh,
> > +						vlan_present ?
> > +
> 	TCA_FLOWER_KEY_VLAN_ETH_TYPE :
> > +
> 	TCA_FLOWER_KEY_ETH_TYPE,
> > +						RTE_BE16(ETH_P_IP));
> > +				}
> > +				eth_type_set = 1;
> > +				vlan_eth_type_set = 1;
> > +				if (mask.ipv4 ==
> &flow_tcf_mask_empty.ipv4)
> > +					break;
> > +				if (mask.ipv4->hdr.next_proto_id) {
> > +					mnl_attr_put_u8
> > +						(nlh,
> TCA_FLOWER_KEY_IP_PROTO,
> >  						spec.ipv4-
> >hdr.next_proto_id);
> > -				ip_proto_set = 1;
> > +					ip_proto_set = 1;
> > +				}
> > +			} else {
> > +				assert(mask.ipv4 !=
> &flow_tcf_mask_empty.ipv4);
> >  			}
> >  			if (mask.ipv4->hdr.src_addr) {
> > -				mnl_attr_put_u32(nlh,
> TCA_FLOWER_KEY_IPV4_SRC,
> > -						 spec.ipv4->hdr.src_addr);
> > -				mnl_attr_put_u32(nlh,
> > -
> TCA_FLOWER_KEY_IPV4_SRC_MASK,
> > -						 mask.ipv4->hdr.src_addr);
> > +				mnl_attr_put_u32
> > +					(nlh, decap.vxlan ?
> > +					 TCA_FLOWER_KEY_ENC_IPV4_SRC :
> > +					 TCA_FLOWER_KEY_IPV4_SRC,
> > +					 spec.ipv4->hdr.src_addr);
> > +				mnl_attr_put_u32
> > +					(nlh, decap.vxlan ?
> > +
> TCA_FLOWER_KEY_ENC_IPV4_SRC_MASK :
> > +
> TCA_FLOWER_KEY_IPV4_SRC_MASK,
> > +					 mask.ipv4->hdr.src_addr);
> >  			}
> >  			if (mask.ipv4->hdr.dst_addr) {
> > -				mnl_attr_put_u32(nlh,
> TCA_FLOWER_KEY_IPV4_DST,
> > -						 spec.ipv4->hdr.dst_addr);
> > -				mnl_attr_put_u32(nlh,
> > -
> TCA_FLOWER_KEY_IPV4_DST_MASK,
> > -						 mask.ipv4->hdr.dst_addr);
> > +				mnl_attr_put_u32
> > +					(nlh, decap.vxlan ?
> > +					 TCA_FLOWER_KEY_ENC_IPV4_DST :
> > +					 TCA_FLOWER_KEY_IPV4_DST,
> > +					 spec.ipv4->hdr.dst_addr);
> > +				mnl_attr_put_u32
> > +					(nlh, decap.vxlan ?
> > +
> TCA_FLOWER_KEY_ENC_IPV4_DST_MASK :
> > +
> TCA_FLOWER_KEY_IPV4_DST_MASK,
> > +					 mask.ipv4->hdr.dst_addr);
> >  			}
> > +			assert(dev_flow->tcf.nlsize >= nlh->nlmsg_len);
> >  			break;
> >  		case RTE_FLOW_ITEM_TYPE_IPV6:
> >  			item_flags |= MLX5_FLOW_LAYER_OUTER_L3_IPV6;
> @@ -2578,38 +2968,53
> > @@ struct pedit_parser {
> >  				 sizeof(flow_tcf_mask_supported.ipv6),
> >  				 error);
> >  			assert(mask.ipv6);
> > -			if (!eth_type_set || !vlan_eth_type_set)
> > -				mnl_attr_put_u16(nlh,
> > -						 vlan_present ?
> > -
> TCA_FLOWER_KEY_VLAN_ETH_TYPE :
> > -
> TCA_FLOWER_KEY_ETH_TYPE,
> > -						 RTE_BE16(ETH_P_IPV6));
> > -			eth_type_set = 1;
> > -			vlan_eth_type_set = 1;
> > -			if (mask.ipv6 == &flow_tcf_mask_empty.ipv6)
> > -				break;
> >  			spec.ipv6 = items->spec;
> > -			if (mask.ipv6->hdr.proto) {
> > -				mnl_attr_put_u8(nlh,
> TCA_FLOWER_KEY_IP_PROTO,
> > -						spec.ipv6->hdr.proto);
> > -				ip_proto_set = 1;
> > +			if (!decap.vxlan) {
> > +				if (!eth_type_set || !vlan_eth_type_set) {
> > +					mnl_attr_put_u16(nlh,
> > +						vlan_present ?
> > +
> 	TCA_FLOWER_KEY_VLAN_ETH_TYPE :
> > +
> 	TCA_FLOWER_KEY_ETH_TYPE,
> > +						RTE_BE16(ETH_P_IPV6));
> > +				}
> > +				eth_type_set = 1;
> > +				vlan_eth_type_set = 1;
> > +				if (mask.ipv6 ==
> &flow_tcf_mask_empty.ipv6)
> > +					break;
> > +				if (mask.ipv6->hdr.proto) {
> > +					mnl_attr_put_u8
> > +						(nlh,
> TCA_FLOWER_KEY_IP_PROTO,
> > +						 spec.ipv6->hdr.proto);
> > +					ip_proto_set = 1;
> > +				}
> > +			} else {
> > +				assert(mask.ipv6 !=
> &flow_tcf_mask_empty.ipv6);
> >  			}
> >  			if (!IN6_IS_ADDR_UNSPECIFIED(mask.ipv6-
> >hdr.src_addr)) {
> > -				mnl_attr_put(nlh,
> TCA_FLOWER_KEY_IPV6_SRC,
> > +				mnl_attr_put(nlh, decap.vxlan ?
> > +
> TCA_FLOWER_KEY_ENC_IPV6_SRC :
> > +					     TCA_FLOWER_KEY_IPV6_SRC,
> >  					     sizeof(spec.ipv6->hdr.src_addr),
> >  					     spec.ipv6->hdr.src_addr);
> > -				mnl_attr_put(nlh,
> TCA_FLOWER_KEY_IPV6_SRC_MASK,
> > +				mnl_attr_put(nlh, decap.vxlan ?
> > +
> TCA_FLOWER_KEY_ENC_IPV6_SRC_MASK :
> > +
> TCA_FLOWER_KEY_IPV6_SRC_MASK,
> >  					     sizeof(mask.ipv6->hdr.src_addr),
> >  					     mask.ipv6->hdr.src_addr);
> >  			}
> >  			if (!IN6_IS_ADDR_UNSPECIFIED(mask.ipv6-
> >hdr.dst_addr)) {
> > -				mnl_attr_put(nlh,
> TCA_FLOWER_KEY_IPV6_DST,
> > +				mnl_attr_put(nlh, decap.vxlan ?
> > +
> TCA_FLOWER_KEY_ENC_IPV6_DST :
> > +					     TCA_FLOWER_KEY_IPV6_DST,
> >  					     sizeof(spec.ipv6->hdr.dst_addr),
> >  					     spec.ipv6->hdr.dst_addr);
> > -				mnl_attr_put(nlh,
> TCA_FLOWER_KEY_IPV6_DST_MASK,
> > +				mnl_attr_put(nlh, decap.vxlan ?
> > +
> TCA_FLOWER_KEY_ENC_IPV6_DST_MASK :
> > +
> TCA_FLOWER_KEY_IPV6_DST_MASK,
> >  					     sizeof(mask.ipv6->hdr.dst_addr),
> >  					     mask.ipv6->hdr.dst_addr);
> >  			}
> > +			assert(dev_flow->tcf.nlsize >= nlh->nlmsg_len);
> >  			break;
> >  		case RTE_FLOW_ITEM_TYPE_UDP:
> >  			item_flags |= MLX5_FLOW_LAYER_OUTER_L4_UDP;
> @@ -2620,26 +3025,44
> > @@ struct pedit_parser {
> >  				 sizeof(flow_tcf_mask_supported.udp),
> >  				 error);
> >  			assert(mask.udp);
> > -			if (!ip_proto_set)
> > -				mnl_attr_put_u8(nlh,
> TCA_FLOWER_KEY_IP_PROTO,
> > -						IPPROTO_UDP);
> > -			if (mask.udp == &flow_tcf_mask_empty.udp)
> > -				break;
> >  			spec.udp = items->spec;
> > +			if (!decap.vxlan) {
> > +				if (!ip_proto_set)
> > +					mnl_attr_put_u8
> > +						(nlh,
> TCA_FLOWER_KEY_IP_PROTO,
> > +						IPPROTO_UDP);
> > +				if (mask.udp == &flow_tcf_mask_empty.udp)
> > +					break;
> > +			} else {
> > +				assert(mask.udp !=
> &flow_tcf_mask_empty.udp);
> > +				decap.vxlan->udp_port
> > +					= RTE_BE16(spec.udp->hdr.dst_port);
> 
> Use rte_cpu_to_be_*() instead. And (=) sign has to be moved up.
> 
> > +			}
> >  			if (mask.udp->hdr.src_port) {
> > -				mnl_attr_put_u16(nlh,
> TCA_FLOWER_KEY_UDP_SRC,
> > -						 spec.udp->hdr.src_port);
> > -				mnl_attr_put_u16(nlh,
> > -
> TCA_FLOWER_KEY_UDP_SRC_MASK,
> > -						 mask.udp->hdr.src_port);
> > +				mnl_attr_put_u16
> > +					(nlh, decap.vxlan ?
> > +
> TCA_FLOWER_KEY_ENC_UDP_SRC_PORT :
> > +					 TCA_FLOWER_KEY_UDP_SRC,
> > +					 spec.udp->hdr.src_port);
> > +				mnl_attr_put_u16
> > +					(nlh, decap.vxlan ?
> > +
> TCA_FLOWER_KEY_ENC_UDP_SRC_PORT_MASK :
> > +
> TCA_FLOWER_KEY_UDP_SRC_MASK,
> > +					 mask.udp->hdr.src_port);
> >  			}
> >  			if (mask.udp->hdr.dst_port) {
> > -				mnl_attr_put_u16(nlh,
> TCA_FLOWER_KEY_UDP_DST,
> > -						 spec.udp->hdr.dst_port);
> > -				mnl_attr_put_u16(nlh,
> > -
> TCA_FLOWER_KEY_UDP_DST_MASK,
> > -						 mask.udp->hdr.dst_port);
> > +				mnl_attr_put_u16
> > +					(nlh, decap.vxlan ?
> > +
> TCA_FLOWER_KEY_ENC_UDP_DST_PORT :
> > +					 TCA_FLOWER_KEY_UDP_DST,
> > +					 spec.udp->hdr.dst_port);
> > +				mnl_attr_put_u16
> > +					(nlh, decap.vxlan ?
> > +
> TCA_FLOWER_KEY_ENC_UDP_DST_PORT_MASK :
> > +
> TCA_FLOWER_KEY_UDP_DST_MASK,
> > +					 mask.udp->hdr.dst_port);
> >  			}
> > +			assert(dev_flow->tcf.nlsize >= nlh->nlmsg_len);
> >  			break;
> >  		case RTE_FLOW_ITEM_TYPE_TCP:
> >  			item_flags |= MLX5_FLOW_LAYER_OUTER_L4_TCP;
> @@ -2682,7 +3105,15 @@
> > struct pedit_parser {
> >  					 rte_cpu_to_be_16
> >  						(mask.tcp->hdr.tcp_flags));
> >  			}
> > +			assert(dev_flow->tcf.nlsize >= nlh->nlmsg_len);
> >  			break;
> > +		case RTE_FLOW_ITEM_TYPE_VXLAN:
> > +			assert(decap.vxlan);
> > +			spec.vxlan = items->spec;
> > +			mnl_attr_put_u32(nlh,
> > +					 TCA_FLOWER_KEY_ENC_KEY_ID,
> > +					 vxlan_vni_as_be32(spec.vxlan-
> >vni));
> > +			assert(dev_flow->tcf.nlsize >= nlh->nlmsg_len);
> >  		default:
> >  			return rte_flow_error_set(error, ENOTSUP,
> >
> RTE_FLOW_ERROR_TYPE_ITEM,
> > @@ -2715,6 +3146,14 @@ struct pedit_parser {
> >  			mnl_attr_put_strz(nlh, TCA_ACT_KIND, "mirred");
> >  			na_act = mnl_attr_nest_start(nlh,
> TCA_ACT_OPTIONS);
> >  			assert(na_act);
> > +			if (encap.hdr) {
> > +				assert(dev_flow->tcf.tunnel);
> > +				dev_flow->tcf.tunnel->ifindex_ptr =
> > +					&((struct tc_mirred *)
> > +					mnl_attr_get_payload
> > +					(mnl_nlmsg_get_payload_tail
> > +						(nlh)))->ifindex;
> > +			}
> >  			mnl_attr_put(nlh, TCA_MIRRED_PARMS,
> >  				     sizeof(struct tc_mirred),
> >  				     &(struct tc_mirred){
> > @@ -2724,6 +3163,7 @@ struct pedit_parser {
> >  				     });
> >  			mnl_attr_nest_end(nlh, na_act);
> >  			mnl_attr_nest_end(nlh, na_act_index);
> > +			assert(dev_flow->tcf.nlsize >= nlh->nlmsg_len);
> >  			break;
> >  		case RTE_FLOW_ACTION_TYPE_JUMP:
> >  			conf.jump = actions->conf;
> > @@ -2741,6 +3181,7 @@ struct pedit_parser {
> >  				     });
> >  			mnl_attr_nest_end(nlh, na_act);
> >  			mnl_attr_nest_end(nlh, na_act_index);
> > +			assert(dev_flow->tcf.nlsize >= nlh->nlmsg_len);
> >  			break;
> >  		case RTE_FLOW_ACTION_TYPE_DROP:
> >  			na_act_index =
> > @@ -2827,6 +3268,76 @@ struct pedit_parser {
> >  					(na_vlan_priority) =
> >  					conf.of_set_vlan_pcp->vlan_pcp;
> >  			}
> > +			assert(dev_flow->tcf.nlsize >= nlh->nlmsg_len);
> > +			break;
> > +		case RTE_FLOW_ACTION_TYPE_VXLAN_DECAP:
> > +			assert(decap.vxlan);
> > +			assert(dev_flow->tcf.tunnel);
> > +			dev_flow->tcf.tunnel->ifindex_ptr
> > +				= (unsigned int *)&tcm->tcm_ifindex;
> > +			na_act_index =
> > +				mnl_attr_nest_start(nlh,
> na_act_index_cur++);
> > +			assert(na_act_index);
> > +			mnl_attr_put_strz(nlh, TCA_ACT_KIND,
> "tunnel_key");
> > +			na_act = mnl_attr_nest_start(nlh,
> TCA_ACT_OPTIONS);
> > +			assert(na_act);
> > +			mnl_attr_put(nlh, TCA_TUNNEL_KEY_PARMS,
> > +				sizeof(struct tc_tunnel_key),
> > +				&(struct tc_tunnel_key){
> > +					.action = TC_ACT_PIPE,
> > +					.t_action =
> TCA_TUNNEL_KEY_ACT_RELEASE,
> > +					});
> > +			mnl_attr_nest_end(nlh, na_act);
> > +			mnl_attr_nest_end(nlh, na_act_index);
> > +			assert(dev_flow->tcf.nlsize >= nlh->nlmsg_len);
> > +			break;
> > +		case RTE_FLOW_ACTION_TYPE_VXLAN_ENCAP:
> > +			assert(encap.vxlan);
> > +			na_act_index =
> > +				mnl_attr_nest_start(nlh,
> na_act_index_cur++);
> > +			assert(na_act_index);
> > +			mnl_attr_put_strz(nlh, TCA_ACT_KIND,
> "tunnel_key");
> > +			na_act = mnl_attr_nest_start(nlh,
> TCA_ACT_OPTIONS);
> > +			assert(na_act);
> > +			mnl_attr_put(nlh, TCA_TUNNEL_KEY_PARMS,
> > +				sizeof(struct tc_tunnel_key),
> > +				&(struct tc_tunnel_key){
> > +					.action = TC_ACT_PIPE,
> > +					.t_action =
> TCA_TUNNEL_KEY_ACT_SET,
> > +					});
> > +			if (encap.vxlan->mask &
> MLX5_FLOW_TCF_ENCAP_UDP_DST)
> > +				mnl_attr_put_u16(nlh,
> > +					 TCA_TUNNEL_KEY_ENC_DST_PORT,
> > +					 encap.vxlan->udp.dst);
> > +			if (encap.vxlan->mask &
> MLX5_FLOW_TCF_ENCAP_IPV4_SRC)
> > +				mnl_attr_put_u32(nlh,
> > +					 TCA_TUNNEL_KEY_ENC_IPV4_SRC,
> > +					 encap.vxlan->ipv4.src);
> > +			if (encap.vxlan->mask &
> MLX5_FLOW_TCF_ENCAP_IPV4_DST)
> > +				mnl_attr_put_u32(nlh,
> > +					 TCA_TUNNEL_KEY_ENC_IPV4_DST,
> > +					 encap.vxlan->ipv4.dst);
> > +			if (encap.vxlan->mask &
> MLX5_FLOW_TCF_ENCAP_IPV6_SRC)
> > +				mnl_attr_put(nlh,
> > +					 TCA_TUNNEL_KEY_ENC_IPV6_SRC,
> > +					 sizeof(encap.vxlan->ipv6.src),
> > +					 &encap.vxlan->ipv6.src);
> > +			if (encap.vxlan->mask &
> MLX5_FLOW_TCF_ENCAP_IPV6_DST)
> > +				mnl_attr_put(nlh,
> > +					 TCA_TUNNEL_KEY_ENC_IPV6_DST,
> > +					 sizeof(encap.vxlan->ipv6.dst),
> > +					 &encap.vxlan->ipv6.dst);
> > +			if (encap.vxlan->mask &
> MLX5_FLOW_TCF_ENCAP_VXLAN_VNI)
> > +				mnl_attr_put_u32(nlh,
> > +					 TCA_TUNNEL_KEY_ENC_KEY_ID,
> > +					 vxlan_vni_as_be32
> > +						(encap.vxlan->vxlan.vni));
> > +#ifdef TCA_TUNNEL_KEY_NO_CSUM
> > +			mnl_attr_put_u8(nlh, TCA_TUNNEL_KEY_NO_CSUM,
> 0); #endif
> 
> TCA_TUNNEL_KEY_NO_CSUM is anyway defined like others, then why do
> you treat it differently with #ifdef/#endif?

As it was found it is not defined on old kernels, on some our CI machines
compilation errors occurred.
 
WBR, Slava

> > +			mnl_attr_nest_end(nlh, na_act);
> > +			mnl_attr_nest_end(nlh, na_act_index);
> > +			assert(dev_flow->tcf.nlsize >= nlh->nlmsg_len);
> >  			break;
> >  		case RTE_FLOW_ACTION_TYPE_SET_IPV4_SRC:
> >  		case RTE_FLOW_ACTION_TYPE_SET_IPV4_DST:
> > @@ -2850,7 +3361,11 @@ struct pedit_parser {
> >  	assert(na_flower);
> >  	assert(na_flower_act);
> >  	mnl_attr_nest_end(nlh, na_flower_act);
> > +	mnl_attr_put_u32(nlh, TCA_FLOWER_FLAGS,
> > +			 dev_flow->tcf.action_flags &
> MLX5_ACTION_VXLAN_DECAP
> > +			 ? 0 : TCA_CLS_FLAGS_SKIP_SW);
> >  	mnl_attr_nest_end(nlh, na_flower);
> > +	assert(dev_flow->tcf.nlsize >= nlh->nlmsg_len);
> >  	return 0;
> >  }

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [dpdk-dev] [PATCH v2 5/7] net/mlx5: e-switch VXLAN tunnel devices management
  2018-10-25  0:28     ` Yongseok Koh
@ 2018-10-25 20:21       ` Slava Ovsiienko
  2018-10-26  6:25         ` Yongseok Koh
  0 siblings, 1 reply; 110+ messages in thread
From: Slava Ovsiienko @ 2018-10-25 20:21 UTC (permalink / raw)
  To: Yongseok Koh; +Cc: Shahaf Shuler, dev

> -----Original Message-----
> From: Yongseok Koh
> Sent: Thursday, October 25, 2018 3:28
> To: Slava Ovsiienko <viacheslavo@mellanox.com>
> Cc: Shahaf Shuler <shahafs@mellanox.com>; dev@dpdk.org
> Subject: Re: [PATCH v2 5/7] net/mlx5: e-switch VXLAN tunnel devices
> management
> 
> On Mon, Oct 15, 2018 at 02:13:33PM +0000, Viacheslav Ovsiienko wrote:
> > VXLAN interfaces are dynamically created for each local UDP port of
> > outer networks and then used as targets for TC "flower" filters in
> > order to perform encapsulation. These VXLAN interfaces are
> > system-wide, the only one device with given UDP port can exist in the
> > system (the attempt of creating another device with the same UDP local
> > port returns EEXIST), so PMD should support the shared device
> > instances database for PMD instances. These VXLAN implicitly created
> > devices are called VTEPs (Virtual Tunnel End Points).
> >
> > Creation of the VTEP occurs at the moment of rule applying. The link
> > is set up, root ingress qdisc is also initialized.
> >
> > Encapsulation VTEPs are created on per port basis, the single VTEP is
> > attached to the outer interface and is shared for all encapsulation
> > rules on this interface. The source UDP port is automatically selected
> > in range 30000-60000.
> >
> > For decapsulaton one VTEP is created per every unique UDP local port
> > to accept tunnel traffic. The name of created VTEP consists of prefix
> > "vmlx_" and the number of UDP port in decimal digits without leading
> > zeros (vmlx_4789). The VTEP can be preliminary created in the system
> > before the launching
> > application, it allows to share	UDP ports between primary
> > and secondary processes.
> >
> > Suggested-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
> > Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
> > ---
> >  drivers/net/mlx5/mlx5_flow_tcf.c | 503
> > ++++++++++++++++++++++++++++++++++++++-
> >  1 file changed, 499 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/net/mlx5/mlx5_flow_tcf.c
> > b/drivers/net/mlx5/mlx5_flow_tcf.c
> > index d6840d5..efa9c3b 100644
> > --- a/drivers/net/mlx5/mlx5_flow_tcf.c
> > +++ b/drivers/net/mlx5/mlx5_flow_tcf.c
> > @@ -3443,6 +3443,432 @@ struct pedit_parser {
> >  	return -err;
> >  }
> >
> > +/* VTEP device list is shared between PMD port instances. */ static
> > +LIST_HEAD(, mlx5_flow_tcf_vtep)
> > +			vtep_list_vxlan = LIST_HEAD_INITIALIZER(); static
> pthread_mutex_t
> > +vtep_list_mutex = PTHREAD_MUTEX_INITIALIZER;
> 
> What's the reason for choosing pthread_mutex instead of rte_*_lock?

The sharing this database for secondary processes?

> 
> > +
> > +/**
> > + * Deletes VTEP network device.
> > + *
> > + * @param[in] tcf
> > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > + * @param[in] vtep
> > + *   Object represinting the network device to delete. Memory
> > + *   allocated for this object is freed by routine.
> > + */
> > +static void
> > +flow_tcf_delete_iface(struct mlx6_flow_tcf_context *tcf,
> > +		      struct mlx5_flow_tcf_vtep *vtep) {
> > +	struct nlmsghdr *nlh;
> > +	struct ifinfomsg *ifm;
> > +	alignas(struct nlmsghdr)
> > +	uint8_t buf[mnl_nlmsg_size(MNL_ALIGN(sizeof(*ifm))) + 8];
> > +	int ret;
> > +
> > +	assert(!vtep->refcnt);
> > +	if (vtep->created && vtep->ifindex) {
> 
> First of all vtep->created seems of no use. It is introduced to select the error
> message in flow_tcf_create_iface(). I don't see any necessity to distinguish
> between 'vtep is allocated by rte_malloc()' and 'vtep is created in kernel'.

created flag indicates the iface is created by our code.
The VXLAN decap devices must have the specified UDP port, we can not create
multiple VXLAN devices with the same UDP port - EEXIST is returned. So, we have
to share device. One option is create device before DPDK application launch and use
these pre-created devices. Inthis case created flag is not set and VXLAN device
is not reinitialized, and not deleted.

> And why do you need to check vtep->ifindex as well? If vtep is created in
> kernel and its ifindex isn't set, that should be an error which had to be hanled
> in flow_tcf_create_iface(). Such a vtep shouldn't exist.
Yes, if we did not get ifindex of device - vtep is not created, error returned.
We just can not operate w/o ifindex.

> 
> Also, the refcnt management is a bit strange. Please put an abstraction by
> adding create_iface(), get_iface() and release_iface(). In the get_ifce(),
> vtep->refcnt should be incremented. And in the release_iface(), it
> vtep->decrease the
OK. Good proposal. I'll refactor the code.

> refcnt and if it reaches to zero, the iface can be removed. create_iface() will
> set the refcnt to 1. And if you refer to mlx5_hrxq_get(), it even does
> searching the list not by repeating the same lookup code here and there.
> That will make your code much simpler.
> 
> > +		DRV_LOG(INFO, "VTEP delete (%d)", vtep->ifindex);
> > +		nlh = mnl_nlmsg_put_header(buf);
> > +		nlh->nlmsg_type = RTM_DELLINK;
> > +		nlh->nlmsg_flags = NLM_F_REQUEST;
> > +		ifm = mnl_nlmsg_put_extra_header(nlh, sizeof(*ifm));
> > +		ifm->ifi_family = AF_UNSPEC;
> > +		ifm->ifi_index = vtep->ifindex;
> > +		ret = flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
> > +		if (ret)
> > +			DRV_LOG(WARNING, "netlink: error deleting VXLAN
> "
> > +					 "encap/decap ifindex %u",
> > +					 ifm->ifi_index);
> > +	}
> > +	rte_free(vtep);
> > +}
> > +
> > +/**
> > + * Creates VTEP network device.
> > + *
> > + * @param[in] tcf
> > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > + * @param[in] ifouter
> > + *   Outer interface to attach new-created VXLAN device
> > + *   If zero the VXLAN device will not be attached to any device.
> > + * @param[in] port
> > + *   UDP port of created VTEP device.
> > + * @param[out] error
> > + *   Perform verbose error reporting if not NULL.
> > + *
> > + * @return
> > + * Pointer to created device structure on success, NULL otherwise
> > + * and rte_errno is set.
> > + */
> > +#ifndef HAVE_IFLA_VXLAN_COLLECT_METADATA
> 
> Why negative(ifndef) first intead of positive(ifdef)?
Hm. Did I miss the rule. Positive #ifdef first? OK.

> 
> > +static struct mlx5_flow_tcf_vtep*
> > +flow_tcf_create_iface(struct mlx5_flow_tcf_context *tcf __rte_unused,
> > +		      unsigned int ifouter __rte_unused,
> > +		      uint16_t port __rte_unused,
> > +		      struct rte_flow_error *error) {
> > +	rte_flow_error_set(error, ENOTSUP,
> > +			 RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
> > +			 "netlink: failed to create VTEP, "
> > +			 "VXLAN metadat is not supported by kernel");
> 
> Typo.

OK.  "metadata are not supported".
> 
> > +	return NULL;
> > +}
> > +#else
> > +static struct mlx5_flow_tcf_vtep*
> > +flow_tcf_create_iface(struct mlx5_flow_tcf_context *tcf,
> 
> How about adding 'vtep'? It sounds vague - creating a general interface.
> E.g., flow_tcf_create_vtep_iface()?

OK.

> 
> > +		      unsigned int ifouter,
> > +		      uint16_t port, struct rte_flow_error *error) {
> > +	struct mlx5_flow_tcf_vtep *vtep;
> > +	struct nlmsghdr *nlh;
> > +	struct ifinfomsg *ifm;
> > +	char name[sizeof(MLX5_VXLAN_DEVICE_PFX) + 24];
> > +	alignas(struct nlmsghdr)
> > +	uint8_t buf[mnl_nlmsg_size(sizeof(*ifm)) + 128 +
> 
> Use a macro for '128'. Can't know the meaning.
OK. I think we should calculate the buffer size explicitly.

> 
> > +		       SZ_NLATTR_DATA_OF(sizeof(name)) +
> > +		       SZ_NLATTR_NEST * 2 +
> > +		       SZ_NLATTR_STRZ_OF("vxlan") +
> > +		       SZ_NLATTR_DATA_OF(sizeof(uint32_t)) +
> > +		       SZ_NLATTR_DATA_OF(sizeof(uint32_t)) +
> > +		       SZ_NLATTR_DATA_OF(sizeof(uint16_t)) +
> > +		       SZ_NLATTR_DATA_OF(sizeof(uint8_t))];
> > +	struct nlattr *na_info;
> > +	struct nlattr *na_vxlan;
> > +	rte_be16_t vxlan_port = RTE_BE16(port);
> 
> Use rte_cpu_to_be_*() instead.

Yes, I'll recheck the whole code for this issue.

> 
> > +	int ret;
> > +
> > +	vtep = rte_zmalloc(__func__, sizeof(*vtep),
> > +			alignof(struct mlx5_flow_tcf_vtep));
> > +	if (!vtep) {
> > +		rte_flow_error_set
> > +			(error, ENOMEM,
> RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> > +			 NULL, "unadble to allocate memory for VTEP desc");
> > +		return NULL;
> > +	}
> > +	*vtep = (struct mlx5_flow_tcf_vtep){
> > +			.refcnt = 0,
> > +			.port = port,
> > +			.created = 0,
> > +			.ifouter = 0,
> > +			.ifindex = 0,
> > +			.local = LIST_HEAD_INITIALIZER(),
> > +			.neigh = LIST_HEAD_INITIALIZER(),
> > +	};
> > +	memset(buf, 0, sizeof(buf));
> > +	nlh = mnl_nlmsg_put_header(buf);
> > +	nlh->nlmsg_type = RTM_NEWLINK;
> > +	nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_CREATE  |
> NLM_F_EXCL;
> > +	ifm = mnl_nlmsg_put_extra_header(nlh, sizeof(*ifm));
> > +	ifm->ifi_family = AF_UNSPEC;
> > +	ifm->ifi_type = 0;
> > +	ifm->ifi_index = 0;
> > +	ifm->ifi_flags = IFF_UP;
> > +	ifm->ifi_change = 0xffffffff;
> > +	snprintf(name, sizeof(name), "%s%u", MLX5_VXLAN_DEVICE_PFX,
> port);
> > +	mnl_attr_put_strz(nlh, IFLA_IFNAME, name);
> > +	na_info = mnl_attr_nest_start(nlh, IFLA_LINKINFO);
> > +	assert(na_info);
> > +	mnl_attr_put_strz(nlh, IFLA_INFO_KIND, "vxlan");
> > +	na_vxlan = mnl_attr_nest_start(nlh, IFLA_INFO_DATA);
> > +	if (ifouter)
> > +		mnl_attr_put_u32(nlh, IFLA_VXLAN_LINK, ifouter);
> > +	assert(na_vxlan);
> > +	mnl_attr_put_u8(nlh, IFLA_VXLAN_COLLECT_METADATA, 1);
> > +	mnl_attr_put_u8(nlh, IFLA_VXLAN_UDP_ZERO_CSUM6_RX, 1);
> > +	mnl_attr_put_u8(nlh, IFLA_VXLAN_LEARNING, 0);
> > +	mnl_attr_put_u16(nlh, IFLA_VXLAN_PORT, vxlan_port);
> > +	mnl_attr_nest_end(nlh, na_vxlan);
> > +	mnl_attr_nest_end(nlh, na_info);
> > +	assert(sizeof(buf) >= nlh->nlmsg_len);
> > +	ret = flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
> > +	if (ret)
> > +		DRV_LOG(WARNING,
> > +			"netlink: VTEP %s create failure (%d)",
> > +			name, rte_errno);
> > +	else
> > +		vtep->created = 1;
> 
> Flow of code here isn't smooth, thus could be error-prone. Most of all, I
> don't like ret has multiple meanings. ret should be return value but you are
> using it to store ifindex.
> 
> > +	if (ret && ifouter)
> > +		ret = 0;
> > +	else
> > +		ret = if_nametoindex(name);
> 
> If vtep isn't created and ifouter is set, then skip init below, which means, if

ifouter is set for VXLAN encap devices. They should be attached to ifouter
and can not be shared. So, if ifouter I set - we do not use the precreated/existing
VXLAN devices. We have to create our own not shared device.

> vtep is created or ifouter is set, it tries to get ifindex of vtep.
> But why do you want to try to call this API even if it failed to create vtep?
> Let's not make code flow convoluted even though it logically works. Let's
> make it straightforward.
> 
> > +	if (ret) {
> > +		vtep->ifindex = ret;
> > +		vtep->ifouter = ifouter;
> > +		memset(buf, 0, sizeof(buf));
> > +		nlh = mnl_nlmsg_put_header(buf);
> > +		nlh->nlmsg_type = RTM_NEWLINK;
> > +		nlh->nlmsg_flags = NLM_F_REQUEST;
> > +		ifm = mnl_nlmsg_put_extra_header(nlh, sizeof(*ifm));
> > +		ifm->ifi_family = AF_UNSPEC;
> > +		ifm->ifi_type = 0;
> > +		ifm->ifi_index = vtep->ifindex;
> > +		ifm->ifi_flags = IFF_UP;
> > +		ifm->ifi_change = IFF_UP;
> > +		ret = flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
> > +		if (ret) {
> > +			DRV_LOG(WARNING,
> > +				"netlink: VTEP %s set link up failure (%d)",
> > +				name, rte_errno);
> > +			rte_free(vtep);
> > +			rte_flow_error_set
> > +				(error, -errno,
> > +				 RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> NULL,
> > +				 "netlink: failed to set VTEP link up");
> > +			vtep = NULL;
> > +		} else {
> > +			ret = mlx5_flow_tcf_init(tcf, vtep->ifindex, error);
> > +			if (ret)
> > +				DRV_LOG(WARNING,
> > +				"VTEP %s init failure (%d)", name, rte_errno);
> > +		}
> > +	} else {
> > +		DRV_LOG(WARNING,
> > +			"VTEP %s failed to get index (%d)", name, errno);
> > +		rte_flow_error_set
> > +			(error, -errno,
> > +			 RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
> > +			 !vtep->created ? "netlink: failed to create VTEP" :
> > +			 "netlink: failed to retrieve VTEP ifindex");
> > +			 ret = 1;
> 
> If it fails to create a vtep above, it will print out two warning messages and
> one rte_flow_error message. And it even selects message to print between
> two?
> And there's another info msg at the end even in case of failure. Do you really
> want to do this even with manipulating ret to change code path?  Not a good
> practice.
> 
> Usually, code path should be straightforward for sucessful path and for
> errors/failures, return immediately or use 'goto' if there's need for cleanup.
> 
> Please refactor entire function.

I think I'll split it in two ones - for attached and potentially shared ifaces.
> 
> > +	}
> > +	if (ret) {
> > +		flow_tcf_delete_iface(tcf, vtep);
> > +		vtep = NULL;
> > +	}
> > +	DRV_LOG(INFO, "VTEP create (%d, %s)", vtep->port, vtep ? "OK" :
> "error");
> > +	return vtep;
> > +}
> > +#endif /* HAVE_IFLA_VXLAN_COLLECT_METADATA */
> > +
> > +/**
> > + * Create target interface index for VXLAN tunneling decapsulation.
> > + * In order to share the UDP port within the other interfaces the
> > + * VXLAN device created as not attached to any interface (if created).
> > + *
> > + * @param[in] tcf
> > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > + * @param[in] dev_flow
> > + *   Flow tcf object with tunnel structure pointer set.
> > + * @param[out] error
> > + *   Perform verbose error reporting if not NULL.
> > + * @return
> > + *   Interface index on success, zero otherwise and rte_errno is set.
> 
> Return negative errno in case of failure like others.

Anyway, we have to return an index. If we do not return it as function result
we will need to provide some extra pointing parameter, it complicates the code.

> 
>  *   Interface index on success, a negative errno value otherwise and
> rte_errno is set.
> 
> > + */
> > +static unsigned int
> > +flow_tcf_decap_vtep_create(struct mlx5_flow_tcf_context *tcf,
> > +			   struct mlx5_flow *dev_flow,
> > +			   struct rte_flow_error *error)
> > +{
> > +	struct mlx5_flow_tcf_vtep *vtep, *vlst;
> > +	uint16_t port = dev_flow->tcf.vxlan_decap->udp_port;
> > +
> > +	vtep = NULL;
> > +	LIST_FOREACH(vlst, &vtep_list_vxlan, next) {
> > +		if (vlst->port == port) {
> > +			vtep = vlst;
> > +			break;
> > +		}
> > +	}
> 
> You just need one variable.

Yes. There is a long story, I forgot to revert code to one variable after debugging.
> 
> 	struct mlx5_flow_tcf_vtep *vtep;
> 
> 	LIST_FOREACH(vtep, &vtep_list_vxlan, next) {
> 		if (vtep->port == port)
> 			break;
> 	}
> 
> > +	if (!vtep) {
> > +		vtep = flow_tcf_create_iface(tcf, 0, port, error);
> > +		if (vtep)
> > +			LIST_INSERT_HEAD(&vtep_list_vxlan, vtep, next);
> > +	} else {
> > +		if (vtep->ifouter) {
> > +			rte_flow_error_set(error, -errno,
> > +				RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> NULL,
> > +				"Failed to create decap VTEP, attached "
> > +				"device with the same UDP port exists");
> > +				vtep = NULL;
> 
> Making vtep null to skip the following code?

Yes. To avoid multiple return operators in code.
> Please merge the two same
> if/else and make the code path strightforward. And which errno do you
> expect here?
> Should it be set EEXIST instead?
Not always. Netlink returns the code. 

> 
> > +		}
> > +	}
> > +	if (vtep) {
> > +		vtep->refcnt++;
> > +		assert(vtep->ifindex);
> > +		return vtep->ifindex;
> > +	} else {
> > +		return 0;
> > +	}
> 
> Why repeating same if/else?
> 
> 
> This is my suggestion but if you take my suggestion to have
> flow_tcf_[create|get|release]_iface(), this will get much simpler.
Agree.

> 
> {
> 	struct mlx5_flow_tcf_vtep *vtep;
> 	uint16_t port = dev_flow->tcf.vxlan_decap->udp_port;
> 
> 	LIST_FOREACH(vtep, &vtep_list_vxlan, next) {
> 		if (vtep->port == port)
> 			break;
> 	}
> 	if (vtep && vtep->ifouter)
> 		return rte_flow_error_set(... EEXIST ...);
> 	else if (vtep) {
> 		++vtep->refcnt;
> 	} else {
> 		vtep = flow_tcf_create_iface(tcf, 0, port, error);
> 		if (!vtep)
> 			return rte_flow_error_set(...);
> 		LIST_INSERT_HEAD(&vtep_list_vxlan, vtep, next);
> 	}
> 	assert(vtep->ifindex);
> 	return vtep->ifindex;
> }
> 
> 
> > +}
> > +
> > +/**
> > + * Creates target interface index for VXLAN tunneling encapsulation.
> > + *
> > + * @param[in] tcf
> > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > + * @param[in] ifouter
> > + *   Network interface index to attach VXLAN encap device to.
> > + * @param[in] dev_flow
> > + *   Flow tcf object with tunnel structure pointer set.
> > + * @param[out] error
> > + *   Perform verbose error reporting if not NULL.
> > + * @return
> > + *   Interface index on success, zero otherwise and rte_errno is set.
> > + */
> > +static unsigned int
> > +flow_tcf_encap_vtep_create(struct mlx5_flow_tcf_context *tcf,
> > +			    unsigned int ifouter,
> > +			    struct mlx5_flow *dev_flow __rte_unused,
> > +			    struct rte_flow_error *error)
> > +{
> > +	static uint16_t encap_port = MLX5_VXLAN_PORT_RANGE_MIN - 1;
> > +	struct mlx5_flow_tcf_vtep *vtep, *vlst;
> > +
> > +	assert(ifouter);
> > +	/* Look whether the attached VTEP for encap is created. */
> > +	vtep = NULL;
> > +	LIST_FOREACH(vlst, &vtep_list_vxlan, next) {
> > +		if (vlst->ifouter == ifouter) {
> > +			vtep = vlst;
> > +			break;
> > +		}
> > +	}
> 
> Same here.
> 
> > +	if (!vtep) {
> > +		uint16_t pcnt;
> > +
> > +		/* Not found, we should create the new attached VTEP. */
> > +/*
> > + * TODO: not implemented yet
> > + * flow_tcf_encap_iface_cleanup(tcf, ifouter);
> > + * flow_tcf_encap_local_cleanup(tcf, ifouter);
> > + * flow_tcf_encap_neigh_cleanup(tcf, ifouter);  */
> 
> Personal note is not appropriate even though it is removed in the following
> patch.
> 
> > +		for (pcnt = 0; pcnt <= (MLX5_VXLAN_PORT_RANGE_MAX
> > +				     - MLX5_VXLAN_PORT_RANGE_MIN);
> pcnt++) {
> > +			encap_port++;
> > +			/* Wraparound the UDP port index. */
> > +			if (encap_port < MLX5_VXLAN_PORT_RANGE_MIN
> ||
> > +			    encap_port > MLX5_VXLAN_PORT_RANGE_MAX)
> > +				encap_port =
> MLX5_VXLAN_PORT_RANGE_MIN;
> > +			/* Check whether UDP port is in already in use. */
> > +			vtep = NULL;
> > +			LIST_FOREACH(vlst, &vtep_list_vxlan, next) {
> > +				if (vlst->port == encap_port) {
> > +					vtep = vlst;
> > +					break;
> > +				}
> > +			}
> 
> If you want to find out an empty port number, you can use rte_bitmap
> instead of repeating searching the entire list for all possible port numbers.

We do not expect too many VXLAN devices have been created. bitmap.

> 
> > +			if (vtep) {
> > +				vtep = NULL;
> > +				continue;
> > +			}
> > +			vtep = flow_tcf_create_iface(tcf, ifouter,
> > +						     encap_port, error);
> > +			if (vtep) {
> > +				LIST_INSERT_HEAD(&vtep_list_vxlan, vtep,
> next);
> > +				break;
> > +			}
> > +			if (rte_errno != EEXIST)
> > +				break;
> > +		}
> > +	}
> > +	if (!vtep)
> > +		return 0;
> > +	vtep->refcnt++;
> > +	assert(vtep->ifindex);
> > +	return vtep->ifindex;
> 
> Please refactor this func according to what I suggested for
> flow_tcf_decap_vtep_create() and flow_tcf_delete_iface().
> 
> > +}
> > +
> > +/**
> > + * Creates target interface index for tunneling of any type.
> > + *
> > + * @param[in] tcf
> > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > + * @param[in] ifouter
> > + *   Network interface index to attach VXLAN encap device to.
> > + * @param[in] dev_flow
> > + *   Flow tcf object with tunnel structure pointer set.
> > + * @param[out] error
> > + *   Perform verbose error reporting if not NULL.
> > + * @return
> > + *   Interface index on success, zero otherwise and rte_errno is set.
> 
>  *   Interface index on success, a negative errno value otherwise and
>  *   rte_errno is set.
> 
> > + */
> > +static unsigned int
> > +flow_tcf_tunnel_vtep_create(struct mlx5_flow_tcf_context *tcf,
> > +			    unsigned int ifouter,
> > +			    struct mlx5_flow *dev_flow,
> > +			    struct rte_flow_error *error)
> > +{
> > +	unsigned int ret;
> > +
> > +	assert(dev_flow->tcf.tunnel);
> > +	pthread_mutex_lock(&vtep_list_mutex);
> > +	switch (dev_flow->tcf.tunnel->type) {
> > +	case MLX5_FLOW_TCF_TUNACT_VXLAN_ENCAP:
> > +		ret = flow_tcf_encap_vtep_create(tcf, ifouter,
> > +						 dev_flow, error);
> > +		break;
> > +	case MLX5_FLOW_TCF_TUNACT_VXLAN_DECAP:
> > +		ret = flow_tcf_decap_vtep_create(tcf, dev_flow, error);
> > +		break;
> > +	default:
> > +		rte_flow_error_set(error, ENOTSUP,
> > +				RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> NULL,
> > +				"unsupported tunnel type");
> > +		ret = 0;
> > +		break;
> > +	}
> > +	pthread_mutex_unlock(&vtep_list_mutex);
> > +	return ret;
> > +}
> > +
> > +/**
> > + * Deletes tunneling interface by UDP port.
> > + *
> > + * @param[in] tcf
> > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > + * @param[in] ifindex
> > + *   Network interface index of VXLAN device.
> > + * @param[in] dev_flow
> > + *   Flow tcf object with tunnel structure pointer set.
> > + */
> > +static void
> > +flow_tcf_tunnel_vtep_delete(struct mlx5_flow_tcf_context *tcf,
> > +			    unsigned int ifindex,
> > +			    struct mlx5_flow *dev_flow)
> > +{
> > +	struct mlx5_flow_tcf_vtep *vtep, *vlst;
> > +
> > +	assert(dev_flow->tcf.tunnel);
> > +	pthread_mutex_lock(&vtep_list_mutex);
> > +	vtep = NULL;
> > +	LIST_FOREACH(vlst, &vtep_list_vxlan, next) {
> > +		if (vlst->ifindex == ifindex) {
> > +			vtep = vlst;
> > +			break;
> > +		}
> > +	}
> 
> It is weird. You just can have vtep pointer in the dev_flow->tcf.tunnel instead
> of ifindex_tun which is same as vtep->ifindex like the assertion below. Then,
> this lookup can be skipped.

OK. Good optimization.

> 
> > +	if (!vtep) {
> > +		DRV_LOG(WARNING, "No VTEP device found in the list");
> > +		goto exit;
> > +	}
> > +	switch (dev_flow->tcf.tunnel->type) {
> > +	case MLX5_FLOW_TCF_TUNACT_VXLAN_DECAP:
> > +		break;
> > +	case MLX5_FLOW_TCF_TUNACT_VXLAN_ENCAP:
> > +/*
> > + * TODO: Remove the encap ancillary rules first.
> > + * flow_tcf_encap_neigh(tcf, vtep, dev_flow, false, NULL);
> > + * flow_tcf_encap_local(tcf, vtep, dev_flow, false, NULL);  */
> 
> Is it a personal note? Please remove.
OK.

> 
> > +		break;
> > +	default:
> > +		assert(false);
> > +		DRV_LOG(WARNING, "Unsupported tunnel type");
> > +		break;
> > +	}
> > +	assert(dev_flow->tcf.tunnel->ifindex_tun == vtep->ifindex);
> > +	assert(vtep->refcnt);
> > +	if (!vtep->refcnt || !--vtep->refcnt) {
> > +		LIST_REMOVE(vtep, next);
> > +		flow_tcf_delete_iface(tcf, vtep);
> > +	}
> > +exit:
> > +	pthread_mutex_unlock(&vtep_list_mutex);
> > +}
> > +
> >  /**
> >   * Apply flow to E-Switch by sending Netlink message.
> >   *
> > @@ -3461,18 +3887,61 @@ struct pedit_parser {
> >  	       struct rte_flow_error *error)  {
> >  	struct priv *priv = dev->data->dev_private;
> > -	struct mlx5_flow_tcf_context *nl = priv->tcf_context;
> > +	struct mlx5_flow_tcf_context *tcf = priv->tcf_context;
> >  	struct mlx5_flow *dev_flow;
> >  	struct nlmsghdr *nlh;
> > +	int ret;
> >
> >  	dev_flow = LIST_FIRST(&flow->dev_flows);
> >  	/* E-Switch flow can't be expanded. */
> >  	assert(!LIST_NEXT(dev_flow, next));
> > +	if (dev_flow->tcf.applied)
> > +		return 0;
> >  	nlh = dev_flow->tcf.nlh;
> >  	nlh->nlmsg_type = RTM_NEWTFILTER;
> >  	nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_CREATE |
> NLM_F_EXCL;
> > -	if (!flow_tcf_nl_ack(nl, nlh, 0, NULL, NULL))
> > +	if (dev_flow->tcf.tunnel) {
> > +		/*
> > +		 * Replace the interface index, target for
> > +		 * encapsulation, source for decapsulation.
> > +		 */
> > +		assert(!dev_flow->tcf.tunnel->ifindex_tun);
> > +		assert(dev_flow->tcf.tunnel->ifindex_ptr);
> > +		/* Create actual VTEP device when rule is being applied. */
> > +		dev_flow->tcf.tunnel->ifindex_tun
> > +			= flow_tcf_tunnel_vtep_create(tcf,
> > +					*dev_flow->tcf.tunnel->ifindex_ptr,
> > +					dev_flow, error);
> > +			DRV_LOG(INFO, "Replace ifindex: %d->%d",
> > +				dev_flow->tcf.tunnel->ifindex_tun,
> > +				*dev_flow->tcf.tunnel->ifindex_ptr);
> > +		if (!dev_flow->tcf.tunnel->ifindex_tun)
> > +			return -rte_errno;
> > +		dev_flow->tcf.tunnel->ifindex_org
> > +			= *dev_flow->tcf.tunnel->ifindex_ptr;
> > +		*dev_flow->tcf.tunnel->ifindex_ptr
> > +			= dev_flow->tcf.tunnel->ifindex_tun;
> > +	}
> > +	ret = flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
> > +	if (dev_flow->tcf.tunnel) {
> > +		DRV_LOG(INFO, "Restore ifindex: %d->%d",
> > +				dev_flow->tcf.tunnel->ifindex_org,
> > +				*dev_flow->tcf.tunnel->ifindex_ptr);
> > +		*dev_flow->tcf.tunnel->ifindex_ptr
> > +			= dev_flow->tcf.tunnel->ifindex_org;
> > +		dev_flow->tcf.tunnel->ifindex_org = 0;
> 
> ifindex_org looks a temporary storage in this code. And this kind of hassle
> (replace/restore) is there because you took the ifindex from the netlink
> message. Why don't you have just
> 
> struct mlx5_flow_tcf_tunnel_hdr {
> 	uint32_t type; /**< Tunnel action type. */
> 	unsigned int ifindex; /**< Original dst/src interface */
> 	struct mlx5_flow_tcf_vtep *vtep; /**< Tunnel endpoint device. */
> 	unsigned int *nlmsg_ifindex_ptr; /**< ifindex ptr in Netlink message.
> */ };
> 
> and don't change ifindex?

I propose to use the local variable for ifindex_org and do not keep it
in structure. *ifindex_ptr will keep.

With best regards,
Slava

> 
> Thanks,
> Yongseok
> 
> > +	}
> > +	if (!ret) {
> > +		dev_flow->tcf.applied = 1;
> >  		return 0;
> > +	}
> > +	DRV_LOG(WARNING, "netlink: failed to create TC rule (%d)",
> rte_errno);
> > +	if (dev_flow->tcf.tunnel->ifindex_tun) {
> > +		flow_tcf_tunnel_vtep_delete(tcf,
> > +					    dev_flow->tcf.tunnel->ifindex_tun,
> > +					    dev_flow);
> > +		dev_flow->tcf.tunnel->ifindex_tun = 0;
> > +	}
> >  	return rte_flow_error_set(error, rte_errno,
> >  				  RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> NULL,
> >  				  "netlink: failed to create TC flow rule"); @@
> -3490,7 +3959,7
> > @@ struct pedit_parser {  flow_tcf_remove(struct rte_eth_dev *dev,
> > struct rte_flow *flow)  {
> >  	struct priv *priv = dev->data->dev_private;
> > -	struct mlx5_flow_tcf_context *nl = priv->tcf_context;
> > +	struct mlx5_flow_tcf_context *tcf = priv->tcf_context;
> >  	struct mlx5_flow *dev_flow;
> >  	struct nlmsghdr *nlh;
> >
> > @@ -3501,10 +3970,36 @@ struct pedit_parser {
> >  		return;
> >  	/* E-Switch flow can't be expanded. */
> >  	assert(!LIST_NEXT(dev_flow, next));
> > +	if (!dev_flow->tcf.applied)
> > +		return;
> > +	if (dev_flow->tcf.tunnel) {
> > +		/*
> > +		 * Replace the interface index, target for
> > +		 * encapsulation, source for decapsulation.
> > +		 */
> > +		assert(dev_flow->tcf.tunnel->ifindex_tun);
> > +		assert(dev_flow->tcf.tunnel->ifindex_ptr);
> > +		dev_flow->tcf.tunnel->ifindex_org
> > +			= *dev_flow->tcf.tunnel->ifindex_ptr;
> > +		*dev_flow->tcf.tunnel->ifindex_ptr
> > +			= dev_flow->tcf.tunnel->ifindex_tun;
> > +	}
> >  	nlh = dev_flow->tcf.nlh;
> >  	nlh->nlmsg_type = RTM_DELTFILTER;
> >  	nlh->nlmsg_flags = NLM_F_REQUEST;
> > -	flow_tcf_nl_ack(nl, nlh, 0, NULL, NULL);
> > +	flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
> > +	if (dev_flow->tcf.tunnel) {
> > +		*dev_flow->tcf.tunnel->ifindex_ptr
> > +			= dev_flow->tcf.tunnel->ifindex_org;
> > +		dev_flow->tcf.tunnel->ifindex_org = 0;
> > +		if (dev_flow->tcf.tunnel->ifindex_tun) {
> > +			flow_tcf_tunnel_vtep_delete(tcf,
> > +					dev_flow->tcf.tunnel->ifindex_tun,
> > +					dev_flow);
> > +			dev_flow->tcf.tunnel->ifindex_tun = 0;
> > +		}
> > +	}
> > +	dev_flow->tcf.applied = 0;
> >  }
> >
> >  /**
> >

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [dpdk-dev] [PATCH v2 7/7] net/mlx5: e-switch VXLAN rule cleanup routines
  2018-10-25  0:36     ` Yongseok Koh
@ 2018-10-25 20:32       ` Slava Ovsiienko
  2018-10-26  6:30         ` Yongseok Koh
  0 siblings, 1 reply; 110+ messages in thread
From: Slava Ovsiienko @ 2018-10-25 20:32 UTC (permalink / raw)
  To: Yongseok Koh; +Cc: Shahaf Shuler, dev

> -----Original Message-----
> From: Yongseok Koh
> Sent: Thursday, October 25, 2018 3:37
> To: Slava Ovsiienko <viacheslavo@mellanox.com>
> Cc: Shahaf Shuler <shahafs@mellanox.com>; dev@dpdk.org
> Subject: Re: [PATCH v2 7/7] net/mlx5: e-switch VXLAN rule cleanup routines
> 
> On Mon, Oct 15, 2018 at 02:13:35PM +0000, Viacheslav Ovsiienko wrote:
> > The last part of patchset contains the rule cleanup routines.
> > These ones is the part of outer interface initialization at the moment
> > of VXLAN VTEP attaching. These routines query the list of attached
> > VXLAN devices, the list of local IP addresses with peer and link scope
> > attribute and the list of permanent neigh rules, then all found
> > abovementioned items on the specified outer device are flushed.
> >
> > Suggested-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
> > Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
> > ---
> >  drivers/net/mlx5/mlx5_flow_tcf.c | 505
> > ++++++++++++++++++++++++++++++++++++++-
> >  1 file changed, 499 insertions(+), 6 deletions(-)
> >
> > diff --git a/drivers/net/mlx5/mlx5_flow_tcf.c
> > b/drivers/net/mlx5/mlx5_flow_tcf.c
> > index a1d7733..a3348ea 100644
> > --- a/drivers/net/mlx5/mlx5_flow_tcf.c
> > +++ b/drivers/net/mlx5/mlx5_flow_tcf.c
> > @@ -4012,6 +4012,502 @@ static LIST_HEAD(, mlx5_flow_tcf_vtep)  }
> > #endif /* HAVE_IFLA_VXLAN_COLLECT_METADATA */
> >
> > +#define MNL_REQUEST_SIZE_MIN 256
> > +#define MNL_REQUEST_SIZE_MAX 2048
> > +#define MNL_REQUEST_SIZE RTE_MIN(RTE_MAX(sysconf(_SC_PAGESIZE),
> \
> > +				 MNL_REQUEST_SIZE_MIN),
> MNL_REQUEST_SIZE_MAX)
> > +
> > +/* Data structures used by flow_tcf_xxx_cb() routines. */ struct
> > +tcf_nlcb_buf {
> > +	LIST_ENTRY(tcf_nlcb_buf) next;
> > +	uint32_t size;
> > +	alignas(struct nlmsghdr)
> > +	uint8_t msg[]; /**< Netlink message data. */ };
> > +
> > +struct tcf_nlcb_context {
> > +	unsigned int ifindex; /**< Base interface index. */
> > +	uint32_t bufsize;
> > +	LIST_HEAD(, tcf_nlcb_buf) nlbuf;
> > +};
> > +
> > +/**
> > + * Allocate space for netlink command in buffer list
> > + *
> > + * @param[in, out] ctx
> > + *   Pointer to callback context with command buffers list.
> > + * @param[in] size
> > + *   Required size of data buffer to be allocated.
> > + *
> > + * @return
> > + *   Pointer to allocated memory, aligned as message header.
> > + *   NULL if some error occurred.
> > + */
> > +static struct nlmsghdr *
> > +flow_tcf_alloc_nlcmd(struct tcf_nlcb_context *ctx, uint32_t size) {
> > +	struct tcf_nlcb_buf *buf;
> > +	struct nlmsghdr *nlh;
> > +
> > +	size = NLMSG_ALIGN(size);
> > +	buf = LIST_FIRST(&ctx->nlbuf);
> > +	if (buf && (buf->size + size) <= ctx->bufsize) {
> > +		nlh = (struct nlmsghdr *)&buf->msg[buf->size];
> > +		buf->size += size;
> > +		return nlh;
> > +	}
> > +	if (size > ctx->bufsize) {
> > +		DRV_LOG(WARNING, "netlink: too long command buffer
> requested");
> > +		return NULL;
> > +	}
> > +	buf = rte_malloc(__func__,
> > +			ctx->bufsize + sizeof(struct tcf_nlcb_buf),
> > +			alignof(struct tcf_nlcb_buf));
> > +	if (!buf) {
> > +		DRV_LOG(WARNING, "netlink: no memory for command
> buffer");
> > +		return NULL;
> > +	}
> > +	LIST_INSERT_HEAD(&ctx->nlbuf, buf, next);
> > +	buf->size = size;
> > +	nlh = (struct nlmsghdr *)&buf->msg[0];
> > +	return nlh;
> > +}
> > +
> > +/**
> > + * Set NLM_F_ACK flags in the last netlink command in buffer.
> > + * Only last command in the buffer will be acked by system.
> > + *
> > + * @param[in, out] buf
> > + *   Pointer to buffer with netlink commands.
> > + */
> > +static void
> > +flow_tcf_setack_nlcmd(struct tcf_nlcb_buf *buf) {
> > +	struct nlmsghdr *nlh;
> > +	uint32_t size = 0;
> > +
> > +	assert(buf->size);
> > +	do {
> > +		nlh = (struct nlmsghdr *)&buf->msg[size];
> > +		size += NLMSG_ALIGN(nlh->nlmsg_len);
> > +		if (size >= buf->size) {
> > +			nlh->nlmsg_flags |= NLM_F_ACK;
> > +			break;
> > +		}
> > +	} while (true);
> > +}
> > +
> > +/**
> > + * Send the buffers with prepared netlink commands. Scans the list
> > +and
> > + * sends all found buffers. Buffers are sent and freed anyway in
> > +order
> > + * to prevent memory leakage if some every message in received packet.
> > + *
> > + * @param[in] tcf
> > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > + * @param[in, out] ctx
> > + *   Pointer to callback context with command buffers list.
> > + *
> > + * @return
> > + *   Zero value on success, negative errno value otherwise
> > + *   and rte_errno is set.
> > + */
> > +static int
> > +flow_tcf_send_nlcmd(struct mlx5_flow_tcf_context *tcf,
> > +		    struct tcf_nlcb_context *ctx)
> > +{
> > +	struct tcf_nlcb_buf *bc, *bn;
> > +	struct nlmsghdr *nlh;
> > +	int ret = 0;
> > +
> > +	bc = LIST_FIRST(&ctx->nlbuf);
> > +	while (bc) {
> > +		int rc;
> > +
> > +		bn = LIST_NEXT(bc, next);
> > +		if (bc->size) {
> > +			flow_tcf_setack_nlcmd(bc);
> > +			nlh = (struct nlmsghdr *)&bc->msg;
> > +			rc = flow_tcf_nl_ack(tcf, nlh, bc->size, NULL, NULL);
> > +			if (rc && !ret)
> > +				ret = rc;
> > +		}
> > +		rte_free(bc);
> > +		bc = bn;
> > +	}
> > +	LIST_INIT(&ctx->nlbuf);
> > +	return ret;
> > +}
> > +
> > +/**
> > + * Collect local IP address rules with scope link attribute  on
> > +specified
> > + * network device. This is callback routine called by libmnl
> > +mnl_cb_run()
> > + * in loop for every message in received packet.
> > + *
> > + * @param[in] nlh
> > + *   Pointer to reply header.
> > + * @param[in, out] arg
> > + *   Opaque data pointer for this callback.
> > + *
> > + * @return
> > + *   A positive, nonzero value on success, negative errno value otherwise
> > + *   and rte_errno is set.
> > + */
> > +static int
> > +flow_tcf_collect_local_cb(const struct nlmsghdr *nlh, void *arg) {
> > +	struct tcf_nlcb_context *ctx = arg;
> > +	struct nlmsghdr *cmd;
> > +	struct ifaddrmsg *ifa;
> > +	struct nlattr *na;
> > +	struct nlattr *na_local = NULL;
> > +	struct nlattr *na_peer = NULL;
> > +	unsigned char family;
> > +
> > +	if (nlh->nlmsg_type != RTM_NEWADDR) {
> > +		rte_errno = EINVAL;
> > +		return -rte_errno;
> > +	}
> > +	ifa = mnl_nlmsg_get_payload(nlh);
> > +	family = ifa->ifa_family;
> > +	if (ifa->ifa_index != ctx->ifindex ||
> > +	    ifa->ifa_scope != RT_SCOPE_LINK ||
> > +	    !(ifa->ifa_flags & IFA_F_PERMANENT) ||
> > +	    (family != AF_INET && family != AF_INET6))
> > +		return 1;
> > +	mnl_attr_for_each(na, nlh, sizeof(*ifa)) {
> > +		switch (mnl_attr_get_type(na)) {
> > +		case IFA_LOCAL:
> > +			na_local = na;
> > +			break;
> > +		case IFA_ADDRESS:
> > +			na_peer = na;
> > +			break;
> > +		}
> > +		if (na_local && na_peer)
> > +			break;
> > +	}
> > +	if (!na_local || !na_peer)
> > +		return 1;
> > +	/* Local rule found with scope link, permanent and assigned peer. */
> > +	cmd = flow_tcf_alloc_nlcmd(ctx, MNL_ALIGN(sizeof(struct
> nlmsghdr)) +
> > +					MNL_ALIGN(sizeof(struct ifaddrmsg))
> +
> > +					(family == AF_INET6
> > +					? 2 *
> SZ_NLATTR_DATA_OF(IPV6_ADDR_LEN)
> > +					: 2 *
> SZ_NLATTR_TYPE_OF(uint32_t)));
> 
> Better to use IPV4_ADDR_LEN instead?
> 
OK.

> > +	if (!cmd) {
> > +		rte_errno = ENOMEM;
> > +		return -rte_errno;
> > +	}
> > +	cmd = mnl_nlmsg_put_header(cmd);
> > +	cmd->nlmsg_type = RTM_DELADDR;
> > +	cmd->nlmsg_flags = NLM_F_REQUEST;
> > +	ifa = mnl_nlmsg_put_extra_header(cmd, sizeof(*ifa));
> > +	ifa->ifa_flags = IFA_F_PERMANENT;
> > +	ifa->ifa_scope = RT_SCOPE_LINK;
> > +	ifa->ifa_index = ctx->ifindex;
> > +	if (family == AF_INET) {
> > +		ifa->ifa_family = AF_INET;
> > +		ifa->ifa_prefixlen = 32;
> > +		mnl_attr_put_u32(cmd, IFA_LOCAL,
> mnl_attr_get_u32(na_local));
> > +		mnl_attr_put_u32(cmd, IFA_ADDRESS,
> mnl_attr_get_u32(na_peer));
> > +	} else {
> > +		ifa->ifa_family = AF_INET6;
> > +		ifa->ifa_prefixlen = 128;
> > +		mnl_attr_put(cmd, IFA_LOCAL, IPV6_ADDR_LEN,
> > +			mnl_attr_get_payload(na_local));
> > +		mnl_attr_put(cmd, IFA_ADDRESS, IPV6_ADDR_LEN,
> > +			mnl_attr_get_payload(na_peer));
> > +	}
> > +	return 1;
> > +}
> > +
> > +/**
> > + * Cleanup the local IP addresses on outer interface.
> > + *
> > + * @param[in] tcf
> > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > + * @param[in] ifindex
> > + *   Network inferface index to perform cleanup.
> > + */
> > +static void
> > +flow_tcf_encap_local_cleanup(struct mlx5_flow_tcf_context *tcf,
> > +			    unsigned int ifindex)
> > +{
> > +	struct nlmsghdr *nlh;
> > +	struct ifaddrmsg *ifa;
> > +	struct tcf_nlcb_context ctx = {
> > +		.ifindex = ifindex,
> > +		.bufsize = MNL_REQUEST_SIZE,
> > +		.nlbuf = LIST_HEAD_INITIALIZER(),
> > +	};
> > +	int ret;
> > +
> > +	assert(ifindex);
> > +	/*
> > +	 * Seek and destroy leftovers of local IP addresses with
> > +	 * matching properties "scope link".
> > +	 */
> > +	nlh = mnl_nlmsg_put_header(tcf->buf);
> > +	nlh->nlmsg_type = RTM_GETADDR;
> > +	nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_DUMP;
> > +	ifa = mnl_nlmsg_put_extra_header(nlh, sizeof(*ifa));
> > +	ifa->ifa_family = AF_UNSPEC;
> > +	ifa->ifa_index = ifindex;
> > +	ifa->ifa_scope = RT_SCOPE_LINK;
> > +	ret = flow_tcf_nl_ack(tcf, nlh, 0, flow_tcf_collect_local_cb, &ctx);
> > +	if (ret)
> > +		DRV_LOG(WARNING, "netlink: query device list error %d",
> ret);
> > +	ret = flow_tcf_send_nlcmd(tcf, &ctx);
> > +	if (ret)
> > +		DRV_LOG(WARNING, "netlink: device delete error %d", ret); }
> > +
> > +/**
> > + * Collect neigh permament rules on specified network device.
> > + * This is callback routine called by libmnl mnl_cb_run() in loop for
> > + * every message in received packet.
> > + *
> > + * @param[in] nlh
> > + *   Pointer to reply header.
> > + * @param[in, out] arg
> > + *   Opaque data pointer for this callback.
> > + *
> > + * @return
> > + *   A positive, nonzero value on success, negative errno value otherwise
> > + *   and rte_errno is set.
> > + */
> > +static int
> > +flow_tcf_collect_neigh_cb(const struct nlmsghdr *nlh, void *arg) {
> > +	struct tcf_nlcb_context *ctx = arg;
> > +	struct nlmsghdr *cmd;
> > +	struct ndmsg *ndm;
> > +	struct nlattr *na;
> > +	struct nlattr *na_ip = NULL;
> > +	struct nlattr *na_mac = NULL;
> > +	unsigned char family;
> > +
> > +	if (nlh->nlmsg_type != RTM_NEWNEIGH) {
> > +		rte_errno = EINVAL;
> > +		return -rte_errno;
> > +	}
> > +	ndm = mnl_nlmsg_get_payload(nlh);
> > +	family = ndm->ndm_family;
> > +	if (ndm->ndm_ifindex != (int)ctx->ifindex ||
> > +	   !(ndm->ndm_state & NUD_PERMANENT) ||
> > +	   (family != AF_INET && family != AF_INET6))
> > +		return 1;
> > +	mnl_attr_for_each(na, nlh, sizeof(*ndm)) {
> > +		switch (mnl_attr_get_type(na)) {
> > +		case NDA_DST:
> > +			na_ip = na;
> > +			break;
> > +		case NDA_LLADDR:
> > +			na_mac = na;
> > +			break;
> > +		}
> > +		if (na_mac && na_ip)
> > +			break;
> > +	}
> > +	if (!na_mac || !na_ip)
> > +		return 1;
> > +	/* Neigh rule with permenent attribute found. */
> > +	cmd = flow_tcf_alloc_nlcmd(ctx, MNL_ALIGN(sizeof(struct
> nlmsghdr)) +
> > +					MNL_ALIGN(sizeof(struct ndmsg)) +
> > +
> 	SZ_NLATTR_DATA_OF(ETHER_ADDR_LEN) +
> > +					(family == AF_INET6
> > +					?
> SZ_NLATTR_DATA_OF(IPV6_ADDR_LEN)
> > +					: SZ_NLATTR_TYPE_OF(uint32_t)));
> 
> Better to use IPV4_ADDR_LEN instead?
> 
> > +	if (!cmd) {
> > +		rte_errno = ENOMEM;
> > +		return -rte_errno;
> > +	}
> > +	cmd = mnl_nlmsg_put_header(cmd);
> > +	cmd->nlmsg_type = RTM_DELNEIGH;
> > +	cmd->nlmsg_flags = NLM_F_REQUEST;
> > +	ndm = mnl_nlmsg_put_extra_header(cmd, sizeof(*ndm));
> > +	ndm->ndm_ifindex = ctx->ifindex;
> > +	ndm->ndm_state = NUD_PERMANENT;
> > +	ndm->ndm_flags = 0;
> > +	ndm->ndm_type = 0;
> > +	if (family == AF_INET) {
> > +		ndm->ndm_family = AF_INET;
> > +		mnl_attr_put_u32(cmd, NDA_DST,
> mnl_attr_get_u32(na_ip));
> > +	} else {
> > +		ndm->ndm_family = AF_INET6;
> > +		mnl_attr_put(cmd, NDA_DST, IPV6_ADDR_LEN,
> > +			     mnl_attr_get_payload(na_ip));
> > +	}
> > +	mnl_attr_put(cmd, NDA_LLADDR, ETHER_ADDR_LEN,
> > +		     mnl_attr_get_payload(na_mac));
> > +	return 1;
> > +}
> > +
> > +/**
> > + * Cleanup the neigh rules on outer interface.
> > + *
> > + * @param[in] tcf
> > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > + * @param[in] ifindex
> > + *   Network inferface index to perform cleanup.
> > + */
> > +static void
> > +flow_tcf_encap_neigh_cleanup(struct mlx5_flow_tcf_context *tcf,
> > +			    unsigned int ifindex)
> > +{
> > +	struct nlmsghdr *nlh;
> > +	struct ndmsg *ndm;
> > +	struct tcf_nlcb_context ctx = {
> > +		.ifindex = ifindex,
> > +		.bufsize = MNL_REQUEST_SIZE,
> > +		.nlbuf = LIST_HEAD_INITIALIZER(),
> > +	};
> > +	int ret;
> > +
> > +	assert(ifindex);
> > +	/* Seek and destroy leftovers of neigh rules. */
> > +	nlh = mnl_nlmsg_put_header(tcf->buf);
> > +	nlh->nlmsg_type = RTM_GETNEIGH;
> > +	nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_DUMP;
> > +	ndm = mnl_nlmsg_put_extra_header(nlh, sizeof(*ndm));
> > +	ndm->ndm_family = AF_UNSPEC;
> > +	ndm->ndm_ifindex = ifindex;
> > +	ndm->ndm_state = NUD_PERMANENT;
> > +	ret = flow_tcf_nl_ack(tcf, nlh, 0, flow_tcf_collect_neigh_cb, &ctx);
> > +	if (ret)
> > +		DRV_LOG(WARNING, "netlink: query device list error %d",
> ret);
> > +	ret = flow_tcf_send_nlcmd(tcf, &ctx);
> > +	if (ret)
> > +		DRV_LOG(WARNING, "netlink: device delete error %d", ret); }
> > +
> > +/**
> > + * Collect indices of VXLAN encap/decap interfaces associated with
> device.
> > + * This is callback routine called by libmnl mnl_cb_run() in loop for
> > + * every message in received packet.
> > + *
> > + * @param[in] nlh
> > + *   Pointer to reply header.
> > + * @param[in, out] arg
> > + *   Opaque data pointer for this callback.
> > + *
> > + * @return
> > + *   A positive, nonzero value on success, negative errno value otherwise
> > + *   and rte_errno is set.
> > + */
> > +static int
> > +flow_tcf_collect_vxlan_cb(const struct nlmsghdr *nlh, void *arg) {
> > +	struct tcf_nlcb_context *ctx = arg;
> > +	struct nlmsghdr *cmd;
> > +	struct ifinfomsg *ifm;
> > +	struct nlattr *na;
> > +	struct nlattr *na_info = NULL;
> > +	struct nlattr *na_vxlan = NULL;
> > +	bool found = false;
> > +	unsigned int vxindex;
> > +
> > +	if (nlh->nlmsg_type != RTM_NEWLINK) {
> > +		rte_errno = EINVAL;
> > +		return -rte_errno;
> > +	}
> > +	ifm = mnl_nlmsg_get_payload(nlh);
> > +	if (!ifm->ifi_index) {
> > +		rte_errno = EINVAL;
> > +		return -rte_errno;
> > +	}
> > +	mnl_attr_for_each(na, nlh, sizeof(*ifm))
> > +		if (mnl_attr_get_type(na) == IFLA_LINKINFO) {
> > +			na_info = na;
> > +			break;
> > +		}
> > +	if (!na_info)
> > +		return 1;
> > +	mnl_attr_for_each_nested(na, na_info) {
> > +		switch (mnl_attr_get_type(na)) {
> > +		case IFLA_INFO_KIND:
> > +			if (!strncmp("vxlan", mnl_attr_get_str(na),
> > +				     mnl_attr_get_len(na)))
> > +				found = true;
> > +			break;
> > +		case IFLA_INFO_DATA:
> > +			na_vxlan = na;
> > +			break;
> > +		}
> > +		if (found && na_vxlan)
> > +			break;
> > +	}
> > +	if (!found || !na_vxlan)
> > +		return 1;
> > +	found = false;
> > +	mnl_attr_for_each_nested(na, na_vxlan) {
> > +		if (mnl_attr_get_type(na) == IFLA_VXLAN_LINK &&
> > +		    mnl_attr_get_u32(na) == ctx->ifindex) {
> > +			found = true;
> > +			break;
> > +		}
> > +	}
> > +	if (!found)
> > +		return 1;
> > +	/* Attached VXLAN device found, store the command to delete. */
> > +	vxindex = ifm->ifi_index;
> > +	cmd = flow_tcf_alloc_nlcmd(ctx, MNL_ALIGN(sizeof(struct
> nlmsghdr)) +
> > +					MNL_ALIGN(sizeof(struct
> ifinfomsg)));
> > +	if (!nlh) {
> > +		rte_errno = ENOMEM;
> > +		return -rte_errno;
> > +	}
> > +	cmd = mnl_nlmsg_put_header(cmd);
> > +	cmd->nlmsg_type = RTM_DELLINK;
> > +	cmd->nlmsg_flags = NLM_F_REQUEST;
> > +	ifm = mnl_nlmsg_put_extra_header(cmd, sizeof(*ifm));
> > +	ifm->ifi_family = AF_UNSPEC;
> > +	ifm->ifi_index = vxindex;
> > +	return 1;
> > +}
> > +
> > +/**
> > + * Cleanup the outer interface. Removes all found vxlan devices
> > + * attached to specified index, flushes the meigh and local IP
> > + * datavase.
> > + *
> > + * @param[in] tcf
> > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > + * @param[in] ifindex
> > + *   Network inferface index to perform cleanup.
> > + */
> > +static void
> > +flow_tcf_encap_iface_cleanup(struct mlx5_flow_tcf_context *tcf,
> > +			    unsigned int ifindex)
> > +{
> > +	struct nlmsghdr *nlh;
> > +	struct ifinfomsg *ifm;
> > +	struct tcf_nlcb_context ctx = {
> > +		.ifindex = ifindex,
> > +		.bufsize = MNL_REQUEST_SIZE,
> > +		.nlbuf = LIST_HEAD_INITIALIZER(),
> > +	};
> > +	int ret;
> > +
> > +	assert(ifindex);
> > +	/*
> > +	 * Seek and destroy leftover VXLAN encap/decap interfaces with
> > +	 * matching properties.
> > +	 */
> > +	nlh = mnl_nlmsg_put_header(tcf->buf);
> > +	nlh->nlmsg_type = RTM_GETLINK;
> > +	nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_DUMP;
> > +	ifm = mnl_nlmsg_put_extra_header(nlh, sizeof(*ifm));
> > +	ifm->ifi_family = AF_UNSPEC;
> > +	ret = flow_tcf_nl_ack(tcf, nlh, 0, flow_tcf_collect_vxlan_cb, &ctx);
> > +	if (ret)
> > +		DRV_LOG(WARNING, "netlink: query device list error %d",
> ret);
> > +	ret = flow_tcf_send_nlcmd(tcf, &ctx);
> > +	if (ret)
> > +		DRV_LOG(WARNING, "netlink: device delete error %d", ret); }
> > +
> > +
> >  /**
> >   * Create target interface index for VXLAN tunneling decapsulation.
> >   * In order to share the UDP port within the other interfaces the @@
> > -4100,12 +4596,9 @@ static LIST_HEAD(, mlx5_flow_tcf_vtep)
> >  		uint16_t pcnt;
> >
> >  		/* Not found, we should create the new attached VTEP. */
> > -/*
> > - * TODO: not implemented yet
> > - * flow_tcf_encap_iface_cleanup(tcf, ifouter);
> > - * flow_tcf_encap_local_cleanup(tcf, ifouter);
> > - * flow_tcf_encap_neigh_cleanup(tcf, ifouter);
> > - */
> > +		flow_tcf_encap_iface_cleanup(tcf, ifouter);
> > +		flow_tcf_encap_local_cleanup(tcf, ifouter);
> > +		flow_tcf_encap_neigh_cleanup(tcf, ifouter);
> 
> I have a fundamental questioin. Why are these cleanups needed? If I read the
> code correctly, it looks like cleaning up vtep, ip assginment and neigh entry
> which are not created/set by PMD. The reason why we have to clean up
> things is that PMD exclusively owns the interface (ifouter). Is my
> understanding correct?

Because this is the most simple approach. I have no guess how
to co-exist with unknown pre-created rules and how to get into account
all their properties and side effects.

While debugging I see the situations when application crashes and
leaves a "leftovers" as VXLAN devices, neigh and local rules. If we run application again -
these leftovers were the sources of errors (EEXIST on rule creations and so on).

With best regards,
Slava
> 
> Thanks,
> Yongseok
> 
> >  		for (pcnt = 0; pcnt <= (MLX5_VXLAN_PORT_RANGE_MAX
> >  				     - MLX5_VXLAN_PORT_RANGE_MIN);
> pcnt++) {
> >  			encap_port++;
> >

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [dpdk-dev] [PATCH v2 1/7] net/mlx5: e-switch VXLAN configuration and definitions
  2018-10-25 12:50       ` Slava Ovsiienko
@ 2018-10-25 23:33         ` Yongseok Koh
  0 siblings, 0 replies; 110+ messages in thread
From: Yongseok Koh @ 2018-10-25 23:33 UTC (permalink / raw)
  To: Slava Ovsiienko; +Cc: Shahaf Shuler, dev

On Thu, Oct 25, 2018 at 05:50:26AM -0700, Slava Ovsiienko wrote:
> > -----Original Message-----
> > From: Yongseok Koh
> > Sent: Tuesday, October 23, 2018 13:02
> > To: Slava Ovsiienko <viacheslavo@mellanox.com>
> > Cc: Shahaf Shuler <shahafs@mellanox.com>; dev@dpdk.org
> > Subject: Re: [PATCH v2 1/7] net/mlx5: e-switch VXLAN configuration and
> > definitions
> > 
> > On Mon, Oct 15, 2018 at 02:13:29PM +0000, Viacheslav Ovsiienko wrote:
[...]
> > > diff --git a/drivers/net/mlx5/mlx5_flow.h b/drivers/net/mlx5/mlx5_flow.h
> > > index 840d645..b838ab0 100644
> > > --- a/drivers/net/mlx5/mlx5_flow.h
> > > +++ b/drivers/net/mlx5/mlx5_flow.h
> > > @@ -85,6 +85,8 @@
> > >  #define MLX5_FLOW_ACTION_SET_TP_SRC (1u << 15)
> > >  #define MLX5_FLOW_ACTION_SET_TP_DST (1u << 16)
> > >  #define MLX5_FLOW_ACTION_JUMP (1u << 17)
> > > +#define MLX5_ACTION_VXLAN_ENCAP (1u << 11)
> > > +#define MLX5_ACTION_VXLAN_DECAP (1u << 12)
> > 
> > MLX5_ACTION_* has been changed to MLX5_FLOW_ACTION_* as you can
> > see above.
> OK. Miscopied from previous version of patch.
> 
> > And make it alphabetical order; decap first and encap later? Or, at least make
> > it consistent. The order (case clause) is different among validate, prepare and
> > translate.
> OK. Will reorder.
> 
> > 
> > >  #define MLX5_FLOW_FATE_ACTIONS \
> > >  	(MLX5_FLOW_ACTION_DROP | MLX5_FLOW_ACTION_QUEUE |
> > MLX5_FLOW_ACTION_RSS)
> > > @@ -182,8 +184,17 @@ struct mlx5_flow_dv {
> > >  struct mlx5_flow_tcf {
> > >  	struct nlmsghdr *nlh;
> > >  	struct tcmsg *tcm;
> > > +	uint32_t nlsize; /**< Size of NL message buffer. */
> > 
> > It is used only for assert(), but if prepare() is trusted, why do we need to
> > keep it? I don't it is needed.
> > 
> Q? Let's keep the nlsize under NDEBUG flag? 
> It's extremely useful to have assert()
> on allocated size for debugging purposes.

Totally agree. Please do so.

> > > +	uint32_t applied:1; /**< Whether rule is currently applied. */
> > > +	uint64_t item_flags; /**< Item flags. */
> > 
> > This isn't used at all.
> OK, now no dependencies on it, should be removed, good.
> 
> > 
> > > +	uint64_t action_flags; /**< Action flags. */
> > 
> > I checked following patches and it doesn't seem necessary. Please refer to
> > the
> > comment on the translation func. But if you think it is really needed, you
> > could've used actions field of struct rte_flow and layers field of struct
> > mlx5_flow in mlx5_flow.h
> 
> When translating item list into NL-message we have to know whether there is 
> some tunneling action in the actions list. This is due to possible 
> changing of the item meanings if tunneling action is present. For example,
> usually the ipv4 item provides IPv4 addresses for matching and translated to
> TCA_FLOWER_KEY_IPV4_SRC (+ xxx_DST) Netlink attribute(s), but if there is
> VXLAN decap action specified, this item becames outer tunnel  source IPs
> and should be translated to TCA_FLOWER_KEY_ENC_IPV4_SRC. The action
> list is scanned in the preperd list, so we can save action flags  and  use these
> gathered results in translation routine. As we can see from mlx5_flow_list_create() source,
> it does not save item/actions flags, gathered by flow_drv_prepare(). That's why
> there are item_flags/action_flags  in the struct mlx5_flow_tcf. item_flags is not
> needed, should be removed. action_flags is in use.
> 
> BTW, do we need item_flags, action_flags params in flow_drv_prepare() ?
> We would avoid the item_flags field if flags were transferred from
> flow_drv_prepare() to flow_drv_translate() (as local variable of
> mlx5_flow_list_create().

That was a bug. I found it while I was reviewing your patches. Thanks :)
Refer to my patch which is already merged. Prepare should return flags so that
it can be used by translate and others.

http://git.dpdk.org/next/dpdk-next-net-mlx/commit/?id=4fa7d5e88165745523b9b6682c4092fb943a7d49

So, you don't need to keep this field here.

> 
> > >  	uint64_t hits;
> > >  	uint64_t bytes;
> > > +	union { /**< Tunnel encap/decap descriptor. */
> > > +		struct mlx5_flow_tcf_tunnel_hdr *tunnel;
> > > +		struct mlx5_flow_tcf_vxlan_decap *vxlan_decap;
> > > +		struct mlx5_flow_tcf_vxlan_encap *vxlan_encap;
> > > +	};
> > 
> > What is the reason for keeping pointer even though the actual structure
> > follows
> > after mlx5_flow_tcf? Maybe you don't want to waste memory, as the size of
> > encap/decap struct differs a lot?
> 
> Sizes differ, but not a lot (especially comparing with DV rule size). 
> Would you prefer to simplify and just include the union?
> On other hand we could declare something like that:
> 	...
>  	uint8_t tunnel_type;
> 	alignas(struct mlx5_flow_tcf_tunnel_hdr)   uint8_t buf[];
> 
> and eliminate the pointer at all. The buf beginning contains either tunnel structure
> or Netlink message (if no tunnel), depending on the tunnel_type field.

I was just curious. Either way looks good to me. Will defer it to you.

Thanks,
Yongseok

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [dpdk-dev] [PATCH v2 2/7] net/mlx5: e-switch VXLAN flow validation routine
  2018-10-25 13:53       ` Slava Ovsiienko
@ 2018-10-26  3:07         ` Yongseok Koh
  2018-10-26  8:39           ` Slava Ovsiienko
  0 siblings, 1 reply; 110+ messages in thread
From: Yongseok Koh @ 2018-10-26  3:07 UTC (permalink / raw)
  To: Slava Ovsiienko; +Cc: Shahaf Shuler, dev

On Thu, Oct 25, 2018 at 06:53:11AM -0700, Slava Ovsiienko wrote:
> > -----Original Message-----
> > From: Yongseok Koh
> > Sent: Tuesday, October 23, 2018 13:05
> > To: Slava Ovsiienko <viacheslavo@mellanox.com>
> > Cc: Shahaf Shuler <shahafs@mellanox.com>; dev@dpdk.org
> > Subject: Re: [PATCH v2 2/7] net/mlx5: e-switch VXLAN flow validation
> > routine
> > 
> > On Mon, Oct 15, 2018 at 02:13:30PM +0000, Viacheslav Ovsiienko wrote:
[...]
> > > @@ -1114,7 +1733,6 @@ struct pedit_parser {
> > >  							   error);
> > >  			if (ret < 0)
> > >  				return ret;
> > > -			item_flags |= MLX5_FLOW_LAYER_OUTER_L3_IPV4;
> > >  			mask.ipv4 = flow_tcf_item_mask
> > >  				(items, &rte_flow_item_ipv4_mask,
> > >  				 &flow_tcf_mask_supported.ipv4,
> > > @@ -1135,13 +1753,22 @@ struct pedit_parser {
> > >  				next_protocol =
> > >  					((const struct rte_flow_item_ipv4 *)
> > >  					 (items->spec))->hdr.next_proto_id;
> > > +			if (item_flags &
> > MLX5_FLOW_LAYER_OUTER_L3_IPV4) {
> > > +				/*
> > > +				 * Multiple outer items are not allowed as
> > > +				 * tunnel parameters, will raise an error later.
> > > +				 */
> > > +				ipv4 = NULL;
> > 
> > Can't it be inner then?
> AFAIK,  no for tc rules, we can not specify multiple levels (inner + outer) for them.
> There is just no TCA_FLOWER_KEY_xxx attributes  for specifying inner items 
> to match by flower.

When I briefly read the kernel code, I thought TCA_FLOWER_KEY_* are for inner
header before decap. I mean TCA_FLOWER_KEY_IPV4_SRC is for inner L3 and
TCA_FLOWER_KEY_ENC_IPV4_SRC is for outer tunnel header. Please do some
experiments with tc-flower command.

> It is quite unclear comment, not the best one, sorry. I did not like it too, 
> just forgot to rewrite.
> 
> ipv4, ipv6 , udp variables gather the matching items during the item list scanning,
> later variables are used for VXLAN decap action validation only. So, the "outer"
> means that ipv4 variable contains the VXLAN decap outer addresses, and
> should be NULL-ed if multiple items are found in the items list. 
> 
> But we can generate an error here if we have valid action_flags
> (gathered by prepare function) and VXLAN decap is set. Raising
> an error looks more relevant and clear.

You can't use flags at this point. It is validate() so prepare() might not be
preceded.

> >   flow create 1 ingress transfer
> >     pattern eth src is 66:77:88:99:aa:bb
> >       dst is 00:11:22:33:44:55 / ipv4 src is 2.2.2.2 dst is 1.1.1.1 /
> >       udp src is 4789 dst is 4242 / vxlan vni is 0x112233 /
> >       eth / ipv6 / tcp dst is 42 / end
> >     actions vxlan_decap / port_id id 2 / end
> > 
> > Is this flow supported by linux tcf? I took this example from Adrien's patch -
> > "[8/8] net/mlx5: add VXLAN decap support to switch flow rules". If so, isn't it
> > possible to have inner L3 layer (MLX5_FLOW_LAYER_INNER_*)? If not, you
> > should return error in this case. I don't see any code to check redundant
> > outer items.
> > Did I miss something?
> 
> Interesting, besides rule has correct syntax, I'm not sure whether it can be applied w/o errors.

Please try. You owns this patchset. However, you just can prohibit such flows
(tunneled item) and come up with follow-up patches to enable it later if it is
support by tcf as this whole patchset itself is pretty huge enough and we don't
have much time.

> At least our current flow_tcf_translate() implementation does not support any INNERs.
> But it seems the flow_tcf_validate() does, it's subject to recheck - we should not allow
> unsupported items to pass the validation. I'll check and provide the separate bugfix patch
> (if any).

Neither has tunnel support. It is the first time to add tunnel support to TCF.
If it was needed, you should've added it, not skipping it.

You can check how MLX5_FLOW_LAYER_TUNNEL is used in Verbs/DV as a reference.

> > BTW, for the tunneled items, why don't you follow the code of
> > Verbs(mlx5_flow_verbs.c) and DV(mlx5_flow_dv.c)? For tcf, it is the first time
> For VXLAN it has some specifics (warning about ignored params, etc.)
> I've checked which of verbs/dv code could be reused and did not discovered
> a lot. I'll recheck the latest code commits, possible it became more appropriate
> for VXLAN. 

Agreed. I'm not forcing you to do it because we run out of time but mentioned it
because if there's any redundancy in our code, that usually causes bug later.
Let's not waste too much time for that. Just grab low hanging fruits if any.

> > to add tunneled item, but Verbs/DV already have validation code for tunnel,
> > so you can reuse the existing code. In flow_tcf_validate_vxlan_decap(), not
> > every validation is VXLAN-specific but some of them can be common code.
> > 
> > And if you need to know whether there's the VXLAN decap action prior to
> > outer header item validation, you can relocate the code - action validation
> > first and item validation next, as there's no dependency yet in the current
> 
> We can not validate action first - we need items to be preliminary gathered,
> to check them in action's specific fashion and to check action itself. 
> I mean, if we see VXLAN decap action, we should check the presence of
> L2, L3, L4 and VNI items. I minimized the number of passes along the item
> and action lists. BTW, Adrien's approach performed two passes, mine does only.
> 
> > code. Defining ipv4, ipv6, udp seems to make the code path more complex.
> Yes, but it allows us to avoid the extra item list scanning and minimizes the changes
> of existing code.
> In your approach we should:
> - scan actions, w/o full checking, just action_flags gathering and checking
> - scan items, performing variating check (depending on gathered action flags)
> - scan actions again, performing full check with params (at least for now 
> check whether all params gathered)

Disagree. flow_tcf_validate_vxlan_encap() doesn't even need any info of items
and flow_tcf_validate_vxlan_decap() needs item_flags to check whether VXLAN
item is there or not and ipv4/ipv6/udp are all for item checks. Let me give you
very detailed exmaple:

{
	for (actions[]...) {
		...
		case RTE_FLOW_ACTION_TYPE_VXLAN_ENCAP:
			...
			flow_tcf_validate_vxlan_encap();
			...
			break;
		case RTE_FLOW_ACTION_TYPE_VXLAN_DECAP:
			if (action_flags & (MLX5_ACTION_VXLAN_ENCAP
					   | MLX5_ACTION_VXLAN_DECAP))
				return rte_flow_error_set
					(error, ENOTSUP,
					 RTE_FLOW_ERROR_TYPE_ACTION,
					 actions,
					 "can't have multiple vxlan actions");
			/* Don't call flow_tcf_validate_vxlan_decap(). */
			action_flags |= MLX5_ACTION_VXLAN_DECAP;
			break;
	}
	for (items[]...) {
		...
		case RTE_FLOW_ITEM_TYPE_IPV4:
			/* Existing common validation. */
			...
			if (action_flags & MLX5_ACTION_VXLAN_DECAP) {
				/* Do ipv4 validation in
				 * flow_tcf_validate_vxlan_decap()/
			}
			break;
	}
}

Curretly you are doing,

	- validate items
	- validate actions
	- validate items again if decap.

But this can simply be

	- validate actions
	- validate items

Thanks,
Yongseok

> > 
> > For example, you just can call vxlan decap item validation (by splitting
> > flow_tcf_validate_vxlan_decap()) at this point like:
> > 
> > 			if (action_flags &
> > MLX5_FLOW_ACTION_VXLAN_DECAP)
> > 				ret =
> > flow_tcf_validate_vxlan_decap_ipv4(...);
> > 			...
> > 
> > Same for other items.
> > 

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [dpdk-dev] [PATCH v2 3/7] net/mlx5: e-switch VXLAN flow translation routine
  2018-10-25 14:37       ` Slava Ovsiienko
@ 2018-10-26  4:22         ` Yongseok Koh
  2018-10-26  9:06           ` Slava Ovsiienko
  0 siblings, 1 reply; 110+ messages in thread
From: Yongseok Koh @ 2018-10-26  4:22 UTC (permalink / raw)
  To: Slava Ovsiienko; +Cc: Shahaf Shuler, dev

On Thu, Oct 25, 2018 at 07:37:56AM -0700, Slava Ovsiienko wrote:
> > -----Original Message-----
> > From: Yongseok Koh
> > Sent: Tuesday, October 23, 2018 13:06
> > To: Slava Ovsiienko <viacheslavo@mellanox.com>
> > Cc: Shahaf Shuler <shahafs@mellanox.com>; dev@dpdk.org
> > Subject: Re: [PATCH v2 3/7] net/mlx5: e-switch VXLAN flow translation
> > routine
> > 
> > On Mon, Oct 15, 2018 at 02:13:31PM +0000, Viacheslav Ovsiienko wrote:
[...]
> > > @@ -2184,13 +2447,16 @@ struct pedit_parser {
> > >   *   Pointer to the list of actions.
> > >   * @param[out] action_flags
> > >   *   Pointer to the detected actions.
> > > + * @param[out] tunnel
> > > + *   Pointer to tunnel encapsulation parameters structure to fill.
> > >   *
> > >   * @return
> > >   *   Maximum size of memory for actions.
> > >   */
> > >  static int
> > >  flow_tcf_get_actions_and_size(const struct rte_flow_action actions[],
> > > -			      uint64_t *action_flags)
> > > +			      uint64_t *action_flags,
> > > +			      void *tunnel)
> > 
> > This func is to get actions and size but you are parsing and filling tunnel info
> > here. It would be better to move parsing to translate() because it anyway has
> > multiple if conditions (same as switch/case) to set TCA_TUNNEL_KEY_ENC_*
> > there.
> Do you mean call of flow_tcf_vxlan_encap_parse(actions, tunnel)?

Yes.

> OK, let's move it to translate stage. Anyway, we need to keep encap structure
> for local/neigh rules.
> 
> > 
> > >  {
> > >  	int size = 0;
> > >  	uint64_t flags = 0;
> > > @@ -2246,6 +2512,29 @@ struct pedit_parser {
> > >  				SZ_NLATTR_TYPE_OF(uint16_t) + /* VLAN ID.
> > */
> > >  				SZ_NLATTR_TYPE_OF(uint8_t); /* VLAN prio.
> > */
> > >  			break;
> > > +		case RTE_FLOW_ACTION_TYPE_VXLAN_ENCAP:
> > > +			size += SZ_NLATTR_NEST + /* na_act_index. */
> > > +				SZ_NLATTR_STRZ_OF("tunnel_key") +
> > > +				SZ_NLATTR_NEST + /* TCA_ACT_OPTIONS.
> > */
> > > +				SZ_NLATTR_TYPE_OF(uint8_t);
> > > +			size += SZ_NLATTR_TYPE_OF(struct tc_tunnel_key);
> > > +			size +=	flow_tcf_vxlan_encap_parse(actions, tunnel)
> > +
> > > +				RTE_ALIGN_CEIL /* preceding encap params.
> > */
> > > +				(sizeof(struct mlx5_flow_tcf_vxlan_encap),
> > > +				MNL_ALIGNTO);
> > 
> > Is it different from SZ_NLATTR_TYPE_OF(struct
> > mlx5_flow_tcf_vxlan_encap)? Or, use __rte_aligned(MNL_ALIGNTO) instead.
> 
> It is written intentionally in this form. It means that there is struct mlx5_flow_tcf_vxlan_encap 
> at the beginning of buffer. This is not the NL attribute, usage of SZ_NLATTR_TYPE_OF is
> not relevant here. Alignment is needed for the following Netlink message.

Good point. Understood.

> > 
> > > +			flags |= MLX5_ACTION_VXLAN_ENCAP;
> > > +			break;
> > > +		case RTE_FLOW_ACTION_TYPE_VXLAN_DECAP:
> > > +			size += SZ_NLATTR_NEST + /* na_act_index. */
> > > +				SZ_NLATTR_STRZ_OF("tunnel_key") +
> > > +				SZ_NLATTR_NEST + /* TCA_ACT_OPTIONS.
> > */
> > > +				SZ_NLATTR_TYPE_OF(uint8_t);
> > > +			size +=	SZ_NLATTR_TYPE_OF(struct tc_tunnel_key);
> > > +			size +=	RTE_ALIGN_CEIL /* preceding decap params.
> > */
> > > +				(sizeof(struct mlx5_flow_tcf_vxlan_decap),
> > > +				MNL_ALIGNTO);
> > 
> > Same here.
> > 
> > > +			flags |= MLX5_ACTION_VXLAN_DECAP;
> > > +			break;
> > >  		case RTE_FLOW_ACTION_TYPE_SET_IPV4_SRC:
> > >  		case RTE_FLOW_ACTION_TYPE_SET_IPV4_DST:
> > >  		case RTE_FLOW_ACTION_TYPE_SET_IPV6_SRC:
> > > @@ -2289,6 +2578,26 @@ struct pedit_parser {  }
> > >
> > >  /**
> > > + * Convert VXLAN VNI to 32-bit integer.
> > > + *
> > > + * @param[in] vni
> > > + *   VXLAN VNI in 24-bit wire format.
> > > + *
> > > + * @return
> > > + *   VXLAN VNI as a 32-bit integer value in network endian.
> > > + */
> > > +static rte_be32_t
> > 
> > make it inline.
> OK. Missed point.
> 
> > 
> > > +vxlan_vni_as_be32(const uint8_t vni[3]) {
> > > +	rte_be32_t ret;
> > 
> > Defining ret as rte_be32_t? The return value of this func which is bswap(ret)
> > is also rte_be32_t??
> Yes. And it is directly stored in the net-endian NL attribute. 
> I've compiled and checked the listing of the function you proposed. It seems to be best, I'll take it.
> 
> > 
> > > +
> > > +	ret = vni[0];
> > > +	ret = (ret << 8) | vni[1];
> > > +	ret = (ret << 8) | vni[2];
> > > +	return RTE_BE32(ret);
> > 
> > Use rte_cpu_to_be_*() instead. But I still don't understand why you shuffle
> > bytes twice. One with shift and or and other by bswap().
> And it works. There are three bytes in very bizarre order (in NL attribute) - 0, vni[0], vni[1], vni[2].
> 
> > 
> > {
> > 	union {
> > 		uint8_t vni[4];
> > 		rte_be32_t dword;
> > 	} ret = {
> > 		.vni = { 0, vni[0], vni[1], vni[2] },
> > 	};
> > 	return ret.dword;
> > }
> > 
> > This will have the same result without extra cost.
> 
> OK. Investigated, it is the best for x86-64. Also I'm going to test it on the ARM 32,
> with various compilers, just curious.
> 
> > 
> > > +}
> > > +
> > > +/**
> > >   * Prepare a flow object for Linux TC flower. It calculates the maximum
> > size of
> > >   * memory required, allocates the memory, initializes Netlink message
> > headers
> > >   * and set unique TC message handle.
> > > @@ -2323,22 +2632,54 @@ struct pedit_parser {
> > >  	struct mlx5_flow *dev_flow;
> > >  	struct nlmsghdr *nlh;
> > >  	struct tcmsg *tcm;
> > > +	struct mlx5_flow_tcf_vxlan_encap encap = {.mask = 0};
> > > +	uint8_t *sp, *tun = NULL;
> > >
> > >  	size += flow_tcf_get_items_and_size(attr, items, item_flags);
> > > -	size += flow_tcf_get_actions_and_size(actions, action_flags);
> > > -	dev_flow = rte_zmalloc(__func__, size, MNL_ALIGNTO);
> > > +	size += flow_tcf_get_actions_and_size(actions, action_flags,
> > &encap);
> > > +	dev_flow = rte_zmalloc(__func__, size,
> > > +			RTE_MAX(alignof(struct mlx5_flow_tcf_tunnel_hdr),
> > > +				(size_t)MNL_ALIGNTO));
> > 
> > Why RTE_MAX between the two? Note that it is alignment for start address
> > of the memory and the minimum alignment is cacheline size. On x86, non-
> > zero value less than 64 will have same result as 64.
> 
> OK. Thanks for note.
> It is not expected the structure alignments exceed the cache line size.
> So? Just specify zero?
> > 
> > >  	if (!dev_flow) {
> > >  		rte_flow_error_set(error, ENOMEM,
> > >  				   RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> > NULL,
> > >  				   "not enough memory to create E-Switch
> > flow");
> > >  		return NULL;
> > >  	}
> > > -	nlh = mnl_nlmsg_put_header((void *)(dev_flow + 1));
> > > +	sp = (uint8_t *)(dev_flow + 1);
> > > +	if (*action_flags & MLX5_ACTION_VXLAN_ENCAP) {
> > > +		tun = sp;
> > > +		sp += RTE_ALIGN_CEIL
> > > +			(sizeof(struct mlx5_flow_tcf_vxlan_encap),
> > > +			MNL_ALIGNTO);
> > 
> > And why should it be aligned? 
> 
> Netlink message should be aligned. It follows the mlx5_flow_tcf_vxlan_encap,
> that's why the pointer is aligned.

Not true. There's no requirement for nl msg buffer alignment. MNL_ALIGNTO is for
mainly size alignment. For example, checkout the source code of
mnl_nlmsg_put_header(void *buf). There's no requirement of aligning the start
address of buf. But, size of any entries (hdr, attr ...) should be aligned to
MNL_ALIGNTO(4).

> 
> As the size of dev_flow might not be aligned, it
> > is meaningless, isn't it? If you think it must be aligned for better performance
> > (not much anyway), you can use __rte_aligned(MNL_ALIGNTO) on the struct
> Hm. Where we can use __rte_aligned? Could you clarify, please.

For example,

struct mlx5_flow_tcf_tunnel_hdr {
	uint32_t type; /**< Tunnel action type. */
	unsigned int ifindex_tun; /**< Tunnel endpoint interface. */
	unsigned int ifindex_org; /**< Original dst/src interface */
	unsigned int *ifindex_ptr; /**< Interface ptr in message. */
} __rte_aligned(MNL_ALIGNTO);

A good example is the struct rte_mbuf. If this attribute is used, the size of
the struct will be aligned to the value.

If you still want to make the nl msg aligned,

	dev_flow = rte_zmalloc(..., MNL_ALIGNTO); /* anyway cacheline aligned. */
	tun = RTE_PTR_ALIGN(dev_flow + 1, MNL_ALIGNTO);
	nlh = mnl_nlmsg_put_header(tun);

with adding '__rte_aligned(MNL_ALIGNTO)' to struct mlx5_flow_tcf_vxlan_encap/decap.

Then, nlh will be aligned. You should make sure size is correctly calculated.

> 
> > definition but not for mlx5_flow (it's not only for tcf, have to do it manually).
> > 
> > > +		size -= RTE_ALIGN_CEIL
> > > +			(sizeof(struct mlx5_flow_tcf_vxlan_encap),
> > > +			MNL_ALIGNTO);
> > 
> > Don't you have to subtract sizeof(struct mlx5_flow) as well? But like I
> > mentioned, if '.nlsize' below isn't needed, you don't need to have this
> > calculation either.
> Yes, it is a bug. Should be fixed. Thank you.
> Let's discuss whether we can keep the nlsize under NDEBUG switch.

I agreed on using NDEBUG for it.

> 
> > 
> > > +		encap.hdr.type =
> > MLX5_FLOW_TCF_TUNACT_VXLAN_ENCAP;
> > > +		memcpy(tun, &encap,
> > > +		       sizeof(struct mlx5_flow_tcf_vxlan_encap));
> > > +	} else if (*action_flags & MLX5_ACTION_VXLAN_DECAP) {
> > > +		tun = sp;
> > > +		sp += RTE_ALIGN_CEIL
> > > +			(sizeof(struct mlx5_flow_tcf_vxlan_decap),
> > > +			MNL_ALIGNTO);
> > > +		size -= RTE_ALIGN_CEIL
> > > +			(sizeof(struct mlx5_flow_tcf_vxlan_decap),
> > > +			MNL_ALIGNTO);
> > > +		encap.hdr.type =
> > MLX5_FLOW_TCF_TUNACT_VXLAN_DECAP;
> > > +		memcpy(tun, &encap,
> > > +		       sizeof(struct mlx5_flow_tcf_vxlan_decap));
> > > +	}
> > > +	nlh = mnl_nlmsg_put_header(sp);
> > >  	tcm = mnl_nlmsg_put_extra_header(nlh, sizeof(*tcm));
> > >  	*dev_flow = (struct mlx5_flow){
> > >  		.tcf = (struct mlx5_flow_tcf){
> > > +			.nlsize = size,
> > >  			.nlh = nlh,
> > >  			.tcm = tcm,
> > > +			.tunnel = (struct mlx5_flow_tcf_tunnel_hdr *)tun,
> > > +			.item_flags = *item_flags,
> > > +			.action_flags = *action_flags,
> > >  		},
> > >  	};
> > >  	/*
[...]
> > > @@ -2827,6 +3268,76 @@ struct pedit_parser {
> > >  					(na_vlan_priority) =
> > >  					conf.of_set_vlan_pcp->vlan_pcp;
> > >  			}
> > > +			assert(dev_flow->tcf.nlsize >= nlh->nlmsg_len);
> > > +			break;
> > > +		case RTE_FLOW_ACTION_TYPE_VXLAN_DECAP:
> > > +			assert(decap.vxlan);
> > > +			assert(dev_flow->tcf.tunnel);
> > > +			dev_flow->tcf.tunnel->ifindex_ptr
> > > +				= (unsigned int *)&tcm->tcm_ifindex;
> > > +			na_act_index =
> > > +				mnl_attr_nest_start(nlh,
> > na_act_index_cur++);
> > > +			assert(na_act_index);
> > > +			mnl_attr_put_strz(nlh, TCA_ACT_KIND,
> > "tunnel_key");
> > > +			na_act = mnl_attr_nest_start(nlh,
> > TCA_ACT_OPTIONS);
> > > +			assert(na_act);
> > > +			mnl_attr_put(nlh, TCA_TUNNEL_KEY_PARMS,
> > > +				sizeof(struct tc_tunnel_key),
> > > +				&(struct tc_tunnel_key){
> > > +					.action = TC_ACT_PIPE,
> > > +					.t_action =
> > TCA_TUNNEL_KEY_ACT_RELEASE,
> > > +					});
> > > +			mnl_attr_nest_end(nlh, na_act);
> > > +			mnl_attr_nest_end(nlh, na_act_index);
> > > +			assert(dev_flow->tcf.nlsize >= nlh->nlmsg_len);
> > > +			break;
> > > +		case RTE_FLOW_ACTION_TYPE_VXLAN_ENCAP:
> > > +			assert(encap.vxlan);
> > > +			na_act_index =
> > > +				mnl_attr_nest_start(nlh,
> > na_act_index_cur++);
> > > +			assert(na_act_index);
> > > +			mnl_attr_put_strz(nlh, TCA_ACT_KIND,
> > "tunnel_key");
> > > +			na_act = mnl_attr_nest_start(nlh,
> > TCA_ACT_OPTIONS);
> > > +			assert(na_act);
> > > +			mnl_attr_put(nlh, TCA_TUNNEL_KEY_PARMS,
> > > +				sizeof(struct tc_tunnel_key),
> > > +				&(struct tc_tunnel_key){
> > > +					.action = TC_ACT_PIPE,
> > > +					.t_action =
> > TCA_TUNNEL_KEY_ACT_SET,
> > > +					});
> > > +			if (encap.vxlan->mask &
> > MLX5_FLOW_TCF_ENCAP_UDP_DST)
> > > +				mnl_attr_put_u16(nlh,
> > > +					 TCA_TUNNEL_KEY_ENC_DST_PORT,
> > > +					 encap.vxlan->udp.dst);
> > > +			if (encap.vxlan->mask &
> > MLX5_FLOW_TCF_ENCAP_IPV4_SRC)
> > > +				mnl_attr_put_u32(nlh,
> > > +					 TCA_TUNNEL_KEY_ENC_IPV4_SRC,
> > > +					 encap.vxlan->ipv4.src);
> > > +			if (encap.vxlan->mask &
> > MLX5_FLOW_TCF_ENCAP_IPV4_DST)
> > > +				mnl_attr_put_u32(nlh,
> > > +					 TCA_TUNNEL_KEY_ENC_IPV4_DST,
> > > +					 encap.vxlan->ipv4.dst);
> > > +			if (encap.vxlan->mask &
> > MLX5_FLOW_TCF_ENCAP_IPV6_SRC)
> > > +				mnl_attr_put(nlh,
> > > +					 TCA_TUNNEL_KEY_ENC_IPV6_SRC,
> > > +					 sizeof(encap.vxlan->ipv6.src),
> > > +					 &encap.vxlan->ipv6.src);
> > > +			if (encap.vxlan->mask &
> > MLX5_FLOW_TCF_ENCAP_IPV6_DST)
> > > +				mnl_attr_put(nlh,
> > > +					 TCA_TUNNEL_KEY_ENC_IPV6_DST,
> > > +					 sizeof(encap.vxlan->ipv6.dst),
> > > +					 &encap.vxlan->ipv6.dst);
> > > +			if (encap.vxlan->mask &
> > MLX5_FLOW_TCF_ENCAP_VXLAN_VNI)
> > > +				mnl_attr_put_u32(nlh,
> > > +					 TCA_TUNNEL_KEY_ENC_KEY_ID,
> > > +					 vxlan_vni_as_be32
> > > +						(encap.vxlan->vxlan.vni));
> > > +#ifdef TCA_TUNNEL_KEY_NO_CSUM
> > > +			mnl_attr_put_u8(nlh, TCA_TUNNEL_KEY_NO_CSUM,
> > 0); #endif
> > 
> > TCA_TUNNEL_KEY_NO_CSUM is anyway defined like others, then why do
> > you treat it differently with #ifdef/#endif?
> 
> As it was found it is not defined on old kernels, on some our CI machines
> compilation errors occurred.

In your first patch, TCA_TUNNEL_KEY_NO_CSUM is defined if there isn't
HAVE_TC_ACT_TUNNEL_KEY. Actually I'm wondering why it is different from
HAVE_TCA_TUNNEL_KEY_ENC_DST_PORT. It looks like the following is needed -
HAVE_TCA_TUNNEL_KEY_NO_CSUM ??


	#ifdef HAVE_TC_ACT_TUNNEL_KEY

	#include <linux/tc_act/tc_tunnel_key.h>

	#ifndef HAVE_TCA_TUNNEL_KEY_ENC_DST_PORT
	#define TCA_TUNNEL_KEY_ENC_DST_PORT 9
	#endif

	#ifndef HAVE_TCA_TUNNEL_KEY_NO_CSUM
	#define TCA_TUNNEL_KEY_NO_CSUM 10
	#endif

	#else /* HAVE_TC_ACT_TUNNEL_KEY */


Thanks,
Yongseok

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [dpdk-dev] [PATCH v2 5/7] net/mlx5: e-switch VXLAN tunnel devices management
  2018-10-25 20:21       ` Slava Ovsiienko
@ 2018-10-26  6:25         ` Yongseok Koh
  2018-10-26  9:35           ` Slava Ovsiienko
  0 siblings, 1 reply; 110+ messages in thread
From: Yongseok Koh @ 2018-10-26  6:25 UTC (permalink / raw)
  To: Slava Ovsiienko; +Cc: Shahaf Shuler, dev

On Thu, Oct 25, 2018 at 01:21:12PM -0700, Slava Ovsiienko wrote:
> > -----Original Message-----
> > From: Yongseok Koh
> > Sent: Thursday, October 25, 2018 3:28
> > To: Slava Ovsiienko <viacheslavo@mellanox.com>
> > Cc: Shahaf Shuler <shahafs@mellanox.com>; dev@dpdk.org
> > Subject: Re: [PATCH v2 5/7] net/mlx5: e-switch VXLAN tunnel devices
> > management
> > 
> > On Mon, Oct 15, 2018 at 02:13:33PM +0000, Viacheslav Ovsiienko wrote:
> > > VXLAN interfaces are dynamically created for each local UDP port of
> > > outer networks and then used as targets for TC "flower" filters in
> > > order to perform encapsulation. These VXLAN interfaces are
> > > system-wide, the only one device with given UDP port can exist in the
> > > system (the attempt of creating another device with the same UDP local
> > > port returns EEXIST), so PMD should support the shared device
> > > instances database for PMD instances. These VXLAN implicitly created
> > > devices are called VTEPs (Virtual Tunnel End Points).
> > >
> > > Creation of the VTEP occurs at the moment of rule applying. The link
> > > is set up, root ingress qdisc is also initialized.
> > >
> > > Encapsulation VTEPs are created on per port basis, the single VTEP is
> > > attached to the outer interface and is shared for all encapsulation
> > > rules on this interface. The source UDP port is automatically selected
> > > in range 30000-60000.
> > >
> > > For decapsulaton one VTEP is created per every unique UDP local port
> > > to accept tunnel traffic. The name of created VTEP consists of prefix
> > > "vmlx_" and the number of UDP port in decimal digits without leading
> > > zeros (vmlx_4789). The VTEP can be preliminary created in the system
> > > before the launching
> > > application, it allows to share	UDP ports between primary
> > > and secondary processes.
> > >
> > > Suggested-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
> > > Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
> > > ---
> > >  drivers/net/mlx5/mlx5_flow_tcf.c | 503
> > > ++++++++++++++++++++++++++++++++++++++-
> > >  1 file changed, 499 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/drivers/net/mlx5/mlx5_flow_tcf.c
> > > b/drivers/net/mlx5/mlx5_flow_tcf.c
> > > index d6840d5..efa9c3b 100644
> > > --- a/drivers/net/mlx5/mlx5_flow_tcf.c
> > > +++ b/drivers/net/mlx5/mlx5_flow_tcf.c
> > > @@ -3443,6 +3443,432 @@ struct pedit_parser {
> > >  	return -err;
> > >  }
> > >
> > > +/* VTEP device list is shared between PMD port instances. */ static
> > > +LIST_HEAD(, mlx5_flow_tcf_vtep)
> > > +			vtep_list_vxlan = LIST_HEAD_INITIALIZER(); static
> > pthread_mutex_t
> > > +vtep_list_mutex = PTHREAD_MUTEX_INITIALIZER;
> > 
> > What's the reason for choosing pthread_mutex instead of rte_*_lock?
> 
> The sharing this database for secondary processes?

The static variable isn't shared with sec proc. But you can leave it as is.

> > > +
> > > +/**
> > > + * Deletes VTEP network device.
> > > + *
> > > + * @param[in] tcf
> > > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > > + * @param[in] vtep
> > > + *   Object represinting the network device to delete. Memory
> > > + *   allocated for this object is freed by routine.
> > > + */
> > > +static void
> > > +flow_tcf_delete_iface(struct mlx6_flow_tcf_context *tcf,
> > > +		      struct mlx5_flow_tcf_vtep *vtep) {
> > > +	struct nlmsghdr *nlh;
> > > +	struct ifinfomsg *ifm;
> > > +	alignas(struct nlmsghdr)
> > > +	uint8_t buf[mnl_nlmsg_size(MNL_ALIGN(sizeof(*ifm))) + 8];
> > > +	int ret;
> > > +
> > > +	assert(!vtep->refcnt);
> > > +	if (vtep->created && vtep->ifindex) {
> > 
> > First of all vtep->created seems of no use. It is introduced to select the error
> > message in flow_tcf_create_iface(). I don't see any necessity to distinguish
> > between 'vtep is allocated by rte_malloc()' and 'vtep is created in kernel'.
> 
> created flag indicates the iface is created by our code.
> The VXLAN decap devices must have the specified UDP port, we can not create
> multiple VXLAN devices with the same UDP port - EEXIST is returned. So, we have
> to share device. One option is create device before DPDK application launch and use
> these pre-created devices. Inthis case created flag is not set and VXLAN device
> is not reinitialized, and not deleted.

I can't see any code to use pre-created device (created even before dpdk app
launch). Your code just tries to create 'vmlx_xxxx'. Even from your comment in
[7/7] patch, PMD will cleanup any leftovers (existing vtep devices) on
initialization. Your comment sounds conflicting and confusing.

> > And why do you need to check vtep->ifindex as well? If vtep is created in
> > kernel and its ifindex isn't set, that should be an error which had to be hanled
> > in flow_tcf_create_iface(). Such a vtep shouldn't exist.
> Yes, if we did not get ifindex of device - vtep is not created, error returned.
> We just can not operate w/o ifindex.

I know ifindex is needed but my question was checking vtep->ifindex here looked
redundant/unnecessary. But as you agreed on having create/get/release_iface(),
it doesn't matter much.

> > Also, the refcnt management is a bit strange. Please put an abstraction by
> > adding create_iface(), get_iface() and release_iface(). In the get_ifce(),
> > vtep->refcnt should be incremented. And in the release_iface(), it
> > vtep->decrease the
> OK. Good proposal. I'll refactor the code.
> 
> > refcnt and if it reaches to zero, the iface can be removed. create_iface() will
> > set the refcnt to 1. And if you refer to mlx5_hrxq_get(), it even does
> > searching the list not by repeating the same lookup code here and there.
> > That will make your code much simpler.
> > 
> > > +		DRV_LOG(INFO, "VTEP delete (%d)", vtep->ifindex);
> > > +		nlh = mnl_nlmsg_put_header(buf);
> > > +		nlh->nlmsg_type = RTM_DELLINK;
> > > +		nlh->nlmsg_flags = NLM_F_REQUEST;
> > > +		ifm = mnl_nlmsg_put_extra_header(nlh, sizeof(*ifm));
> > > +		ifm->ifi_family = AF_UNSPEC;
> > > +		ifm->ifi_index = vtep->ifindex;
> > > +		ret = flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
> > > +		if (ret)
> > > +			DRV_LOG(WARNING, "netlink: error deleting VXLAN
> > "
> > > +					 "encap/decap ifindex %u",
> > > +					 ifm->ifi_index);
> > > +	}
> > > +	rte_free(vtep);
> > > +}
> > > +
> > > +/**
> > > + * Creates VTEP network device.
> > > + *
> > > + * @param[in] tcf
> > > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > > + * @param[in] ifouter
> > > + *   Outer interface to attach new-created VXLAN device
> > > + *   If zero the VXLAN device will not be attached to any device.
> > > + * @param[in] port
> > > + *   UDP port of created VTEP device.
> > > + * @param[out] error
> > > + *   Perform verbose error reporting if not NULL.
> > > + *
> > > + * @return
> > > + * Pointer to created device structure on success, NULL otherwise
> > > + * and rte_errno is set.
> > > + */
> > > +#ifndef HAVE_IFLA_VXLAN_COLLECT_METADATA
> > 
> > Why negative(ifndef) first intead of positive(ifdef)?
> Hm. Did I miss the rule. Positive #ifdef first? OK.

No concrete rule but if there's no specific reason, it would be better to start
from ifdef.

> > > +static struct mlx5_flow_tcf_vtep*
> > > +flow_tcf_create_iface(struct mlx5_flow_tcf_context *tcf __rte_unused,
> > > +		      unsigned int ifouter __rte_unused,
> > > +		      uint16_t port __rte_unused,
> > > +		      struct rte_flow_error *error) {
> > > +	rte_flow_error_set(error, ENOTSUP,
> > > +			 RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
> > > +			 "netlink: failed to create VTEP, "
> > > +			 "VXLAN metadat is not supported by kernel");
> > 
> > Typo.
> 
> OK.  "metadata are not supported".
> > 
> > > +	return NULL;
> > > +}
> > > +#else
> > > +static struct mlx5_flow_tcf_vtep*
> > > +flow_tcf_create_iface(struct mlx5_flow_tcf_context *tcf,
> > 
> > How about adding 'vtep'? It sounds vague - creating a general interface.
> > E.g., flow_tcf_create_vtep_iface()?
> 
> OK.
> 
> > 
> > > +		      unsigned int ifouter,
> > > +		      uint16_t port, struct rte_flow_error *error) {
> > > +	struct mlx5_flow_tcf_vtep *vtep;
> > > +	struct nlmsghdr *nlh;
> > > +	struct ifinfomsg *ifm;
> > > +	char name[sizeof(MLX5_VXLAN_DEVICE_PFX) + 24];
> > > +	alignas(struct nlmsghdr)
> > > +	uint8_t buf[mnl_nlmsg_size(sizeof(*ifm)) + 128 +
> > 
> > Use a macro for '128'. Can't know the meaning.
> OK. I think we should calculate the buffer size explicitly.
> 
> > 
> > > +		       SZ_NLATTR_DATA_OF(sizeof(name)) +
> > > +		       SZ_NLATTR_NEST * 2 +
> > > +		       SZ_NLATTR_STRZ_OF("vxlan") +
> > > +		       SZ_NLATTR_DATA_OF(sizeof(uint32_t)) +
> > > +		       SZ_NLATTR_DATA_OF(sizeof(uint32_t)) +
> > > +		       SZ_NLATTR_DATA_OF(sizeof(uint16_t)) +
> > > +		       SZ_NLATTR_DATA_OF(sizeof(uint8_t))];
> > > +	struct nlattr *na_info;
> > > +	struct nlattr *na_vxlan;
> > > +	rte_be16_t vxlan_port = RTE_BE16(port);
> > 
> > Use rte_cpu_to_be_*() instead.
> 
> Yes, I'll recheck the whole code for this issue.
> 
> > 
> > > +	int ret;
> > > +
> > > +	vtep = rte_zmalloc(__func__, sizeof(*vtep),
> > > +			alignof(struct mlx5_flow_tcf_vtep));
> > > +	if (!vtep) {
> > > +		rte_flow_error_set
> > > +			(error, ENOMEM,
> > RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> > > +			 NULL, "unadble to allocate memory for VTEP desc");
> > > +		return NULL;
> > > +	}
> > > +	*vtep = (struct mlx5_flow_tcf_vtep){
> > > +			.refcnt = 0,
> > > +			.port = port,
> > > +			.created = 0,
> > > +			.ifouter = 0,
> > > +			.ifindex = 0,
> > > +			.local = LIST_HEAD_INITIALIZER(),
> > > +			.neigh = LIST_HEAD_INITIALIZER(),
> > > +	};
> > > +	memset(buf, 0, sizeof(buf));
> > > +	nlh = mnl_nlmsg_put_header(buf);
> > > +	nlh->nlmsg_type = RTM_NEWLINK;
> > > +	nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_CREATE  |
> > NLM_F_EXCL;
> > > +	ifm = mnl_nlmsg_put_extra_header(nlh, sizeof(*ifm));
> > > +	ifm->ifi_family = AF_UNSPEC;
> > > +	ifm->ifi_type = 0;
> > > +	ifm->ifi_index = 0;
> > > +	ifm->ifi_flags = IFF_UP;
> > > +	ifm->ifi_change = 0xffffffff;
> > > +	snprintf(name, sizeof(name), "%s%u", MLX5_VXLAN_DEVICE_PFX,
> > port);
> > > +	mnl_attr_put_strz(nlh, IFLA_IFNAME, name);
> > > +	na_info = mnl_attr_nest_start(nlh, IFLA_LINKINFO);
> > > +	assert(na_info);
> > > +	mnl_attr_put_strz(nlh, IFLA_INFO_KIND, "vxlan");
> > > +	na_vxlan = mnl_attr_nest_start(nlh, IFLA_INFO_DATA);
> > > +	if (ifouter)
> > > +		mnl_attr_put_u32(nlh, IFLA_VXLAN_LINK, ifouter);
> > > +	assert(na_vxlan);
> > > +	mnl_attr_put_u8(nlh, IFLA_VXLAN_COLLECT_METADATA, 1);
> > > +	mnl_attr_put_u8(nlh, IFLA_VXLAN_UDP_ZERO_CSUM6_RX, 1);
> > > +	mnl_attr_put_u8(nlh, IFLA_VXLAN_LEARNING, 0);
> > > +	mnl_attr_put_u16(nlh, IFLA_VXLAN_PORT, vxlan_port);
> > > +	mnl_attr_nest_end(nlh, na_vxlan);
> > > +	mnl_attr_nest_end(nlh, na_info);
> > > +	assert(sizeof(buf) >= nlh->nlmsg_len);
> > > +	ret = flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
> > > +	if (ret)
> > > +		DRV_LOG(WARNING,
> > > +			"netlink: VTEP %s create failure (%d)",
> > > +			name, rte_errno);
> > > +	else
> > > +		vtep->created = 1;
> > 
> > Flow of code here isn't smooth, thus could be error-prone. Most of all, I
> > don't like ret has multiple meanings. ret should be return value but you are
> > using it to store ifindex.
> > 
> > > +	if (ret && ifouter)
> > > +		ret = 0;
> > > +	else
> > > +		ret = if_nametoindex(name);
> > 
> > If vtep isn't created and ifouter is set, then skip init below, which means, if
> 
> ifouter is set for VXLAN encap devices. They should be attached to ifouter
> and can not be shared. So, if ifouter I set - we do not use the precreated/existing
> VXLAN devices. We have to create our own not shared device.

In your code (flow_tcf_encap_vtep_create()), it is shared by multiple flows. Do
you mean it isn't shared between different outer ifaces? If so, that's for sure.

> > vtep is created or ifouter is set, it tries to get ifindex of vtep.
> > But why do you want to try to call this API even if it failed to create vtep?
> > Let's not make code flow convoluted even though it logically works. Let's
> > make it straightforward.
> > 
> > > +	if (ret) {
> > > +		vtep->ifindex = ret;
> > > +		vtep->ifouter = ifouter;
> > > +		memset(buf, 0, sizeof(buf));
> > > +		nlh = mnl_nlmsg_put_header(buf);
> > > +		nlh->nlmsg_type = RTM_NEWLINK;
> > > +		nlh->nlmsg_flags = NLM_F_REQUEST;
> > > +		ifm = mnl_nlmsg_put_extra_header(nlh, sizeof(*ifm));
> > > +		ifm->ifi_family = AF_UNSPEC;
> > > +		ifm->ifi_type = 0;
> > > +		ifm->ifi_index = vtep->ifindex;
> > > +		ifm->ifi_flags = IFF_UP;
> > > +		ifm->ifi_change = IFF_UP;
> > > +		ret = flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
> > > +		if (ret) {
> > > +			DRV_LOG(WARNING,
> > > +				"netlink: VTEP %s set link up failure (%d)",
> > > +				name, rte_errno);
> > > +			rte_free(vtep);
> > > +			rte_flow_error_set
> > > +				(error, -errno,
> > > +				 RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> > NULL,
> > > +				 "netlink: failed to set VTEP link up");
> > > +			vtep = NULL;
> > > +		} else {
> > > +			ret = mlx5_flow_tcf_init(tcf, vtep->ifindex, error);
> > > +			if (ret)
> > > +				DRV_LOG(WARNING,
> > > +				"VTEP %s init failure (%d)", name, rte_errno);
> > > +		}
> > > +	} else {
> > > +		DRV_LOG(WARNING,
> > > +			"VTEP %s failed to get index (%d)", name, errno);
> > > +		rte_flow_error_set
> > > +			(error, -errno,
> > > +			 RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
> > > +			 !vtep->created ? "netlink: failed to create VTEP" :
> > > +			 "netlink: failed to retrieve VTEP ifindex");
> > > +			 ret = 1;
> > 
> > If it fails to create a vtep above, it will print out two warning messages and
> > one rte_flow_error message. And it even selects message to print between
> > two?
> > And there's another info msg at the end even in case of failure. Do you really
> > want to do this even with manipulating ret to change code path?  Not a good
> > practice.
> > 
> > Usually, code path should be straightforward for sucessful path and for
> > errors/failures, return immediately or use 'goto' if there's need for cleanup.
> > 
> > Please refactor entire function.
> 
> I think I'll split it in two ones - for attached and potentially shared ifaces.
> > 
> > > +	}
> > > +	if (ret) {
> > > +		flow_tcf_delete_iface(tcf, vtep);
> > > +		vtep = NULL;
> > > +	}
> > > +	DRV_LOG(INFO, "VTEP create (%d, %s)", vtep->port, vtep ? "OK" :
> > "error");
> > > +	return vtep;
> > > +}
> > > +#endif /* HAVE_IFLA_VXLAN_COLLECT_METADATA */
> > > +
> > > +/**
> > > + * Create target interface index for VXLAN tunneling decapsulation.
> > > + * In order to share the UDP port within the other interfaces the
> > > + * VXLAN device created as not attached to any interface (if created).
> > > + *
> > > + * @param[in] tcf
> > > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > > + * @param[in] dev_flow
> > > + *   Flow tcf object with tunnel structure pointer set.
> > > + * @param[out] error
> > > + *   Perform verbose error reporting if not NULL.
> > > + * @return
> > > + *   Interface index on success, zero otherwise and rte_errno is set.
> > 
> > Return negative errno in case of failure like others.
> 
> Anyway, we have to return an index. If we do not return it as function result
> we will need to provide some extra pointing parameter, it complicates the code.

You misunderstood it. See what I wrote below. The function still returns the
index but in case of error, make it return negative errno instead of zero.

> > 
> >  *   Interface index on success, a negative errno value otherwise and
> > rte_errno is set.
> > 
> > > + */
> > > +static unsigned int
> > > +flow_tcf_decap_vtep_create(struct mlx5_flow_tcf_context *tcf,
> > > +			   struct mlx5_flow *dev_flow,
> > > +			   struct rte_flow_error *error)
> > > +{
> > > +	struct mlx5_flow_tcf_vtep *vtep, *vlst;
> > > +	uint16_t port = dev_flow->tcf.vxlan_decap->udp_port;
> > > +
> > > +	vtep = NULL;
> > > +	LIST_FOREACH(vlst, &vtep_list_vxlan, next) {
> > > +		if (vlst->port == port) {
> > > +			vtep = vlst;
> > > +			break;
> > > +		}
> > > +	}
> > 
> > You just need one variable.
> 
> Yes. There is a long story, I forgot to revert code to one variable after debugging.
> > 
> > 	struct mlx5_flow_tcf_vtep *vtep;
> > 
> > 	LIST_FOREACH(vtep, &vtep_list_vxlan, next) {
> > 		if (vtep->port == port)
> > 			break;
> > 	}
> > 
> > > +	if (!vtep) {
> > > +		vtep = flow_tcf_create_iface(tcf, 0, port, error);
> > > +		if (vtep)
> > > +			LIST_INSERT_HEAD(&vtep_list_vxlan, vtep, next);
> > > +	} else {
> > > +		if (vtep->ifouter) {
> > > +			rte_flow_error_set(error, -errno,
> > > +				RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> > NULL,
> > > +				"Failed to create decap VTEP, attached "
> > > +				"device with the same UDP port exists");
> > > +				vtep = NULL;
> > 
> > Making vtep null to skip the following code?
> 
> Yes. To avoid multiple return operators in code.

It's okay to have multiple returns. Why not?

> > Please merge the two same
> > if/else and make the code path strightforward. And which errno do you
> > expect here?
> > Should it be set EEXIST instead?
> Not always. Netlink returns the code. 

No, that's not my point. Your code above sets errno instead of rte_errno or
EEXIST.

	} else {
		if (vtep->ifouter) {
			rte_flow_error_set(error, -errno,

Which one sets this errno? Here, it sets rte_errno because matched vtep can't be
used as it already has outer iface attached (error message isn't clear, please
reword it too). I thought this should be EEXIST but you set errno to rte_errno
but errno isn't valid at this point.

> 
> > 
> > > +		}
> > > +	}
> > > +	if (vtep) {
> > > +		vtep->refcnt++;
> > > +		assert(vtep->ifindex);
> > > +		return vtep->ifindex;
> > > +	} else {
> > > +		return 0;
> > > +	}
> > 
> > Why repeating same if/else?
> > 
> > 
> > This is my suggestion but if you take my suggestion to have
> > flow_tcf_[create|get|release]_iface(), this will get much simpler.
> Agree.
> 
> > 
> > {
> > 	struct mlx5_flow_tcf_vtep *vtep;
> > 	uint16_t port = dev_flow->tcf.vxlan_decap->udp_port;
> > 
> > 	LIST_FOREACH(vtep, &vtep_list_vxlan, next) {
> > 		if (vtep->port == port)
> > 			break;
> > 	}
> > 	if (vtep && vtep->ifouter)
> > 		return rte_flow_error_set(... EEXIST ...);
> > 	else if (vtep) {
> > 		++vtep->refcnt;
> > 	} else {
> > 		vtep = flow_tcf_create_iface(tcf, 0, port, error);
> > 		if (!vtep)
> > 			return rte_flow_error_set(...);
> > 		LIST_INSERT_HEAD(&vtep_list_vxlan, vtep, next);
> > 	}
> > 	assert(vtep->ifindex);
> > 	return vtep->ifindex;
> > }
> > 
> > 
> > > +}
> > > +
> > > +/**
> > > + * Creates target interface index for VXLAN tunneling encapsulation.
> > > + *
> > > + * @param[in] tcf
> > > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > > + * @param[in] ifouter
> > > + *   Network interface index to attach VXLAN encap device to.
> > > + * @param[in] dev_flow
> > > + *   Flow tcf object with tunnel structure pointer set.
> > > + * @param[out] error
> > > + *   Perform verbose error reporting if not NULL.
> > > + * @return
> > > + *   Interface index on success, zero otherwise and rte_errno is set.
> > > + */
> > > +static unsigned int
> > > +flow_tcf_encap_vtep_create(struct mlx5_flow_tcf_context *tcf,
> > > +			    unsigned int ifouter,
> > > +			    struct mlx5_flow *dev_flow __rte_unused,
> > > +			    struct rte_flow_error *error)
> > > +{
> > > +	static uint16_t encap_port = MLX5_VXLAN_PORT_RANGE_MIN - 1;
> > > +	struct mlx5_flow_tcf_vtep *vtep, *vlst;
> > > +
> > > +	assert(ifouter);
> > > +	/* Look whether the attached VTEP for encap is created. */
> > > +	vtep = NULL;
> > > +	LIST_FOREACH(vlst, &vtep_list_vxlan, next) {
> > > +		if (vlst->ifouter == ifouter) {
> > > +			vtep = vlst;
> > > +			break;
> > > +		}
> > > +	}
> > 
> > Same here.
> > 
> > > +	if (!vtep) {
> > > +		uint16_t pcnt;
> > > +
> > > +		/* Not found, we should create the new attached VTEP. */
> > > +/*
> > > + * TODO: not implemented yet
> > > + * flow_tcf_encap_iface_cleanup(tcf, ifouter);
> > > + * flow_tcf_encap_local_cleanup(tcf, ifouter);
> > > + * flow_tcf_encap_neigh_cleanup(tcf, ifouter);  */
> > 
> > Personal note is not appropriate even though it is removed in the following
> > patch.
> > 
> > > +		for (pcnt = 0; pcnt <= (MLX5_VXLAN_PORT_RANGE_MAX
> > > +				     - MLX5_VXLAN_PORT_RANGE_MIN);
> > pcnt++) {
> > > +			encap_port++;
> > > +			/* Wraparound the UDP port index. */
> > > +			if (encap_port < MLX5_VXLAN_PORT_RANGE_MIN
> > ||
> > > +			    encap_port > MLX5_VXLAN_PORT_RANGE_MAX)
> > > +				encap_port =
> > MLX5_VXLAN_PORT_RANGE_MIN;
> > > +			/* Check whether UDP port is in already in use. */
> > > +			vtep = NULL;
> > > +			LIST_FOREACH(vlst, &vtep_list_vxlan, next) {
> > > +				if (vlst->port == encap_port) {
> > > +					vtep = vlst;
> > > +					break;
> > > +				}
> > > +			}
> > 
> > If you want to find out an empty port number, you can use rte_bitmap
> > instead of repeating searching the entire list for all possible port numbers.
> 
> We do not expect too many VXLAN devices have been created. bitmap.

+1, valid point.

> > > +			if (vtep) {
> > > +				vtep = NULL;
> > > +				continue;
> > > +			}
> > > +			vtep = flow_tcf_create_iface(tcf, ifouter,
> > > +						     encap_port, error);
> > > +			if (vtep) {
> > > +				LIST_INSERT_HEAD(&vtep_list_vxlan, vtep,
> > next);
> > > +				break;
> > > +			}
> > > +			if (rte_errno != EEXIST)
> > > +				break;
> > > +		}
> > > +	}
> > > +	if (!vtep)
> > > +		return 0;
> > > +	vtep->refcnt++;
> > > +	assert(vtep->ifindex);
> > > +	return vtep->ifindex;
> > 
> > Please refactor this func according to what I suggested for
> > flow_tcf_decap_vtep_create() and flow_tcf_delete_iface().
> > 
> > > +}
> > > +
> > > +/**
> > > + * Creates target interface index for tunneling of any type.
> > > + *
> > > + * @param[in] tcf
> > > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > > + * @param[in] ifouter
> > > + *   Network interface index to attach VXLAN encap device to.
> > > + * @param[in] dev_flow
> > > + *   Flow tcf object with tunnel structure pointer set.
> > > + * @param[out] error
> > > + *   Perform verbose error reporting if not NULL.
> > > + * @return
> > > + *   Interface index on success, zero otherwise and rte_errno is set.
> > 
> >  *   Interface index on success, a negative errno value otherwise and
> >  *   rte_errno is set.
> > 
> > > + */
> > > +static unsigned int
> > > +flow_tcf_tunnel_vtep_create(struct mlx5_flow_tcf_context *tcf,
> > > +			    unsigned int ifouter,
> > > +			    struct mlx5_flow *dev_flow,
> > > +			    struct rte_flow_error *error)
> > > +{
> > > +	unsigned int ret;
> > > +
> > > +	assert(dev_flow->tcf.tunnel);
> > > +	pthread_mutex_lock(&vtep_list_mutex);
> > > +	switch (dev_flow->tcf.tunnel->type) {
> > > +	case MLX5_FLOW_TCF_TUNACT_VXLAN_ENCAP:
> > > +		ret = flow_tcf_encap_vtep_create(tcf, ifouter,
> > > +						 dev_flow, error);
> > > +		break;
> > > +	case MLX5_FLOW_TCF_TUNACT_VXLAN_DECAP:
> > > +		ret = flow_tcf_decap_vtep_create(tcf, dev_flow, error);
> > > +		break;
> > > +	default:
> > > +		rte_flow_error_set(error, ENOTSUP,
> > > +				RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> > NULL,
> > > +				"unsupported tunnel type");
> > > +		ret = 0;
> > > +		break;
> > > +	}
> > > +	pthread_mutex_unlock(&vtep_list_mutex);
> > > +	return ret;
> > > +}
> > > +
> > > +/**
> > > + * Deletes tunneling interface by UDP port.
> > > + *
> > > + * @param[in] tcf
> > > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > > + * @param[in] ifindex
> > > + *   Network interface index of VXLAN device.
> > > + * @param[in] dev_flow
> > > + *   Flow tcf object with tunnel structure pointer set.
> > > + */
> > > +static void
> > > +flow_tcf_tunnel_vtep_delete(struct mlx5_flow_tcf_context *tcf,
> > > +			    unsigned int ifindex,
> > > +			    struct mlx5_flow *dev_flow)
> > > +{
> > > +	struct mlx5_flow_tcf_vtep *vtep, *vlst;
> > > +
> > > +	assert(dev_flow->tcf.tunnel);
> > > +	pthread_mutex_lock(&vtep_list_mutex);
> > > +	vtep = NULL;
> > > +	LIST_FOREACH(vlst, &vtep_list_vxlan, next) {
> > > +		if (vlst->ifindex == ifindex) {
> > > +			vtep = vlst;
> > > +			break;
> > > +		}
> > > +	}
> > 
> > It is weird. You just can have vtep pointer in the dev_flow->tcf.tunnel instead
> > of ifindex_tun which is same as vtep->ifindex like the assertion below. Then,
> > this lookup can be skipped.
> 
> OK. Good optimization.
> 
> > 
> > > +	if (!vtep) {
> > > +		DRV_LOG(WARNING, "No VTEP device found in the list");
> > > +		goto exit;
> > > +	}
> > > +	switch (dev_flow->tcf.tunnel->type) {
> > > +	case MLX5_FLOW_TCF_TUNACT_VXLAN_DECAP:
> > > +		break;
> > > +	case MLX5_FLOW_TCF_TUNACT_VXLAN_ENCAP:
> > > +/*
> > > + * TODO: Remove the encap ancillary rules first.
> > > + * flow_tcf_encap_neigh(tcf, vtep, dev_flow, false, NULL);
> > > + * flow_tcf_encap_local(tcf, vtep, dev_flow, false, NULL);  */
> > 
> > Is it a personal note? Please remove.
> OK.
> 
> > 
> > > +		break;
> > > +	default:
> > > +		assert(false);
> > > +		DRV_LOG(WARNING, "Unsupported tunnel type");
> > > +		break;
> > > +	}
> > > +	assert(dev_flow->tcf.tunnel->ifindex_tun == vtep->ifindex);
> > > +	assert(vtep->refcnt);
> > > +	if (!vtep->refcnt || !--vtep->refcnt) {
> > > +		LIST_REMOVE(vtep, next);
> > > +		flow_tcf_delete_iface(tcf, vtep);
> > > +	}
> > > +exit:
> > > +	pthread_mutex_unlock(&vtep_list_mutex);
> > > +}
> > > +
> > >  /**
> > >   * Apply flow to E-Switch by sending Netlink message.
> > >   *
> > > @@ -3461,18 +3887,61 @@ struct pedit_parser {
> > >  	       struct rte_flow_error *error)  {
> > >  	struct priv *priv = dev->data->dev_private;
> > > -	struct mlx5_flow_tcf_context *nl = priv->tcf_context;
> > > +	struct mlx5_flow_tcf_context *tcf = priv->tcf_context;
> > >  	struct mlx5_flow *dev_flow;
> > >  	struct nlmsghdr *nlh;
> > > +	int ret;
> > >
> > >  	dev_flow = LIST_FIRST(&flow->dev_flows);
> > >  	/* E-Switch flow can't be expanded. */
> > >  	assert(!LIST_NEXT(dev_flow, next));
> > > +	if (dev_flow->tcf.applied)
> > > +		return 0;
> > >  	nlh = dev_flow->tcf.nlh;
> > >  	nlh->nlmsg_type = RTM_NEWTFILTER;
> > >  	nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_CREATE |
> > NLM_F_EXCL;
> > > -	if (!flow_tcf_nl_ack(nl, nlh, 0, NULL, NULL))
> > > +	if (dev_flow->tcf.tunnel) {
> > > +		/*
> > > +		 * Replace the interface index, target for
> > > +		 * encapsulation, source for decapsulation.
> > > +		 */
> > > +		assert(!dev_flow->tcf.tunnel->ifindex_tun);
> > > +		assert(dev_flow->tcf.tunnel->ifindex_ptr);
> > > +		/* Create actual VTEP device when rule is being applied. */
> > > +		dev_flow->tcf.tunnel->ifindex_tun
> > > +			= flow_tcf_tunnel_vtep_create(tcf,
> > > +					*dev_flow->tcf.tunnel->ifindex_ptr,
> > > +					dev_flow, error);
> > > +			DRV_LOG(INFO, "Replace ifindex: %d->%d",
> > > +				dev_flow->tcf.tunnel->ifindex_tun,
> > > +				*dev_flow->tcf.tunnel->ifindex_ptr);
> > > +		if (!dev_flow->tcf.tunnel->ifindex_tun)
> > > +			return -rte_errno;
> > > +		dev_flow->tcf.tunnel->ifindex_org
> > > +			= *dev_flow->tcf.tunnel->ifindex_ptr;
> > > +		*dev_flow->tcf.tunnel->ifindex_ptr
> > > +			= dev_flow->tcf.tunnel->ifindex_tun;
> > > +	}
> > > +	ret = flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
> > > +	if (dev_flow->tcf.tunnel) {
> > > +		DRV_LOG(INFO, "Restore ifindex: %d->%d",
> > > +				dev_flow->tcf.tunnel->ifindex_org,
> > > +				*dev_flow->tcf.tunnel->ifindex_ptr);
> > > +		*dev_flow->tcf.tunnel->ifindex_ptr
> > > +			= dev_flow->tcf.tunnel->ifindex_org;
> > > +		dev_flow->tcf.tunnel->ifindex_org = 0;
> > 
> > ifindex_org looks a temporary storage in this code. And this kind of hassle
> > (replace/restore) is there because you took the ifindex from the netlink
> > message. Why don't you have just
> > 
> > struct mlx5_flow_tcf_tunnel_hdr {
> > 	uint32_t type; /**< Tunnel action type. */
> > 	unsigned int ifindex; /**< Original dst/src interface */
> > 	struct mlx5_flow_tcf_vtep *vtep; /**< Tunnel endpoint device. */
> > 	unsigned int *nlmsg_ifindex_ptr; /**< ifindex ptr in Netlink message.
> > */ };
> > 
> > and don't change ifindex?
> 
> I propose to use the local variable for ifindex_org and do not keep it
> in structure. *ifindex_ptr will keep.

Well, you still have to restore the ifindex whenever sending the nl msg. Most of
all, ifindex_ptr in nl msg isn't a right place to store the ifindex. It should
have vtep ifindex but it just temporarily keeps the device ifindex until vtep is
created/found.

Thanks,
Yongseok

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [dpdk-dev] [PATCH v2 7/7] net/mlx5: e-switch VXLAN rule cleanup routines
  2018-10-25 20:32       ` Slava Ovsiienko
@ 2018-10-26  6:30         ` Yongseok Koh
  0 siblings, 0 replies; 110+ messages in thread
From: Yongseok Koh @ 2018-10-26  6:30 UTC (permalink / raw)
  To: Slava Ovsiienko; +Cc: Shahaf Shuler, dev

On Thu, Oct 25, 2018 at 01:32:23PM -0700, Slava Ovsiienko wrote:
> > -----Original Message-----
> > From: Yongseok Koh
> > Sent: Thursday, October 25, 2018 3:37
> > To: Slava Ovsiienko <viacheslavo@mellanox.com>
> > Cc: Shahaf Shuler <shahafs@mellanox.com>; dev@dpdk.org
> > Subject: Re: [PATCH v2 7/7] net/mlx5: e-switch VXLAN rule cleanup routines
> > 
> > On Mon, Oct 15, 2018 at 02:13:35PM +0000, Viacheslav Ovsiienko wrote:
> > > The last part of patchset contains the rule cleanup routines.
> > > These ones is the part of outer interface initialization at the moment
> > > of VXLAN VTEP attaching. These routines query the list of attached
> > > VXLAN devices, the list of local IP addresses with peer and link scope
> > > attribute and the list of permanent neigh rules, then all found
> > > abovementioned items on the specified outer device are flushed.
> > >
> > > Suggested-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
> > > Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
> > > ---
[...]
> > > -4100,12 +4596,9 @@ static LIST_HEAD(, mlx5_flow_tcf_vtep)
> > >  		uint16_t pcnt;
> > >
> > >  		/* Not found, we should create the new attached VTEP. */
> > > -/*
> > > - * TODO: not implemented yet
> > > - * flow_tcf_encap_iface_cleanup(tcf, ifouter);
> > > - * flow_tcf_encap_local_cleanup(tcf, ifouter);
> > > - * flow_tcf_encap_neigh_cleanup(tcf, ifouter);
> > > - */
> > > +		flow_tcf_encap_iface_cleanup(tcf, ifouter);
> > > +		flow_tcf_encap_local_cleanup(tcf, ifouter);
> > > +		flow_tcf_encap_neigh_cleanup(tcf, ifouter);
> > 
> > I have a fundamental questioin. Why are these cleanups needed? If I read the
> > code correctly, it looks like cleaning up vtep, ip assginment and neigh entry
> > which are not created/set by PMD. The reason why we have to clean up
> > things is that PMD exclusively owns the interface (ifouter). Is my
> > understanding correct?
> 
> Because this is the most simple approach. I have no guess how
> to co-exist with unknown pre-created rules and how to get into account
> all their properties and side effects.
> 
> While debugging I see the situations when application crashes and
> leaves a "leftovers" as VXLAN devices, neigh and local rules. If we run application again -
> these leftovers were the sources of errors (EEXIST on rule creations and so on).

Okay, makes sense.
Thanks for clarification.

Yongseok

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [dpdk-dev] [PATCH v2 2/7] net/mlx5: e-switch VXLAN flow validation routine
  2018-10-26  3:07         ` Yongseok Koh
@ 2018-10-26  8:39           ` Slava Ovsiienko
  2018-10-26 21:56             ` Yongseok Koh
  0 siblings, 1 reply; 110+ messages in thread
From: Slava Ovsiienko @ 2018-10-26  8:39 UTC (permalink / raw)
  To: Yongseok Koh; +Cc: Shahaf Shuler, dev

> -----Original Message-----
> From: Yongseok Koh
> Sent: Friday, October 26, 2018 6:07
> To: Slava Ovsiienko <viacheslavo@mellanox.com>
> Cc: Shahaf Shuler <shahafs@mellanox.com>; dev@dpdk.org
> Subject: Re: [PATCH v2 2/7] net/mlx5: e-switch VXLAN flow validation
> routine
> 
> On Thu, Oct 25, 2018 at 06:53:11AM -0700, Slava Ovsiienko wrote:
> > > -----Original Message-----
> > > From: Yongseok Koh
> > > Sent: Tuesday, October 23, 2018 13:05
> > > To: Slava Ovsiienko <viacheslavo@mellanox.com>
> > > Cc: Shahaf Shuler <shahafs@mellanox.com>; dev@dpdk.org
> > > Subject: Re: [PATCH v2 2/7] net/mlx5: e-switch VXLAN flow validation
> > > routine
> > >
> > > On Mon, Oct 15, 2018 at 02:13:30PM +0000, Viacheslav Ovsiienko wrote:
> [...]
> > > > @@ -1114,7 +1733,6 @@ struct pedit_parser {
> > > >  							   error);
> > > >  			if (ret < 0)
> > > >  				return ret;
> > > > -			item_flags |= MLX5_FLOW_LAYER_OUTER_L3_IPV4;
> > > >  			mask.ipv4 = flow_tcf_item_mask
> > > >  				(items, &rte_flow_item_ipv4_mask,
> > > >  				 &flow_tcf_mask_supported.ipv4,
> > > > @@ -1135,13 +1753,22 @@ struct pedit_parser {
> > > >  				next_protocol =
> > > >  					((const struct rte_flow_item_ipv4 *)
> > > >  					 (items->spec))->hdr.next_proto_id;
> > > > +			if (item_flags &
> > > MLX5_FLOW_LAYER_OUTER_L3_IPV4) {
> > > > +				/*
> > > > +				 * Multiple outer items are not allowed as
> > > > +				 * tunnel parameters, will raise an error later.
> > > > +				 */
> > > > +				ipv4 = NULL;
> > >
> > > Can't it be inner then?
> > AFAIK,  no for tc rules, we can not specify multiple levels (inner + outer) for
> them.
> > There is just no TCA_FLOWER_KEY_xxx attributes  for specifying inner
> items
> > to match by flower.
> 
> When I briefly read the kernel code, I thought TCA_FLOWER_KEY_* are for
> inner
> header before decap. I mean TCA_FLOWER_KEY_IPV4_SRC is for inner L3
> and
> TCA_FLOWER_KEY_ENC_IPV4_SRC is for outer tunnel header. Please do
> some
> experiments with tc-flower command.

Hm. Interesting. I will check.

> > It is quite unclear comment, not the best one, sorry. I did not like it too,
> > just forgot to rewrite.
> >
> > ipv4, ipv6 , udp variables gather the matching items during the item list
> scanning,
> > later variables are used for VXLAN decap action validation only. So, the
> "outer"
> > means that ipv4 variable contains the VXLAN decap outer addresses, and
> > should be NULL-ed if multiple items are found in the items list.
> >
> > But we can generate an error here if we have valid action_flags
> > (gathered by prepare function) and VXLAN decap is set. Raising
> > an error looks more relevant and clear.
> 
> You can't use flags at this point. It is validate() so prepare() might not be
> preceded.
> 
> > >   flow create 1 ingress transfer
> > >     pattern eth src is 66:77:88:99:aa:bb
> > >       dst is 00:11:22:33:44:55 / ipv4 src is 2.2.2.2 dst is 1.1.1.1 /
> > >       udp src is 4789 dst is 4242 / vxlan vni is 0x112233 /
> > >       eth / ipv6 / tcp dst is 42 / end
> > >     actions vxlan_decap / port_id id 2 / end
> > >
> > > Is this flow supported by linux tcf? I took this example from Adrien's
> patch -
> > > "[8/8] net/mlx5: add VXLAN decap support to switch flow rules". If so,
> isn't it
> > > possible to have inner L3 layer (MLX5_FLOW_LAYER_INNER_*)? If not,
> you
> > > should return error in this case. I don't see any code to check redundant
> > > outer items.
> > > Did I miss something?
> >
> > Interesting, besides rule has correct syntax, I'm not sure whether it can be
> applied w/o errors.
> 
> Please try. You owns this patchset. However, you just can prohibit such flows
> (tunneled item) and come up with follow-up patches to enable it later if it is
> support by tcf as this whole patchset itself is pretty huge enough and we
> don't
> have much time.
> 
> > At least our current flow_tcf_translate() implementation does not support
> any INNERs.
> > But it seems the flow_tcf_validate() does, it's subject to recheck - we
> should not allow
> > unsupported items to pass the validation. I'll check and provide the
> separate bugfix patch
> > (if any).
> 
> Neither has tunnel support. It is the first time to add tunnel support to TCF.
> If it was needed, you should've added it, not skipping it.
> 
> You can check how MLX5_FLOW_LAYER_TUNNEL is used in Verbs/DV as a
> reference.

Yes. I understood your point. Will check and add tunnel support for TCF rules.
Anyway, inner MAC addresses are supported for VXLAN decap, I think we should
specify these ones in the rule as inners (after VNI item),  definitely
some tunnel support in validate/parse/translate should be added.

> 
> > > BTW, for the tunneled items, why don't you follow the code of
> > > Verbs(mlx5_flow_verbs.c) and DV(mlx5_flow_dv.c)? For tcf, it is the first
> time
> > For VXLAN it has some specifics (warning about ignored params, etc.)
> > I've checked which of verbs/dv code could be reused and did not
> discovered
> > a lot. I'll recheck the latest code commits, possible it became more
> appropriate
> > for VXLAN.
> 
> Agreed. I'm not forcing you to do it because we run out of time but
> mentioned it
> because if there's any redundancy in our code, that usually causes bug later.
> Let's not waste too much time for that. Just grab low hanging fruits if any.
> 
> > > to add tunneled item, but Verbs/DV already have validation code for
> tunnel,
> > > so you can reuse the existing code. In flow_tcf_validate_vxlan_decap(),
> not
> > > every validation is VXLAN-specific but some of them can be common
> code.
> > >
> > > And if you need to know whether there's the VXLAN decap action prior to
> > > outer header item validation, you can relocate the code - action
> validation
> > > first and item validation next, as there's no dependency yet in the current
> >
> > We can not validate action first - we need items to be preliminary
> gathered,
> > to check them in action's specific fashion and to check action itself.
> > I mean, if we see VXLAN decap action, we should check the presence of
> > L2, L3, L4 and VNI items. I minimized the number of passes along the item
> > and action lists. BTW, Adrien's approach performed two passes, mine does
> only.
> >
> > > code. Defining ipv4, ipv6, udp seems to make the code path more
> complex.
> > Yes, but it allows us to avoid the extra item list scanning and minimizes the
> changes
> > of existing code.
> > In your approach we should:
> > - scan actions, w/o full checking, just action_flags gathering and checking
> > - scan items, performing variating check (depending on gathered action
> flags)
> > - scan actions again, performing full check with params (at least for now
> > check whether all params gathered)
> 
> Disagree. flow_tcf_validate_vxlan_encap() doesn't even need any info of
> items
> and flow_tcf_validate_vxlan_decap() needs item_flags to check whether
> VXLAN
> item is there or not and ipv4/ipv6/udp are all for item checks. Let me give
> you
> very detailed exmaple:
> 
> {
> 	for (actions[]...) {
> 		...
> 		case RTE_FLOW_ACTION_TYPE_VXLAN_ENCAP:
> 			...
> 			flow_tcf_validate_vxlan_encap();
> 			...
> 			break;
> 		case RTE_FLOW_ACTION_TYPE_VXLAN_DECAP:
> 			if (action_flags & (MLX5_ACTION_VXLAN_ENCAP
> 					   | MLX5_ACTION_VXLAN_DECAP))
> 				return rte_flow_error_set
> 					(error, ENOTSUP,
> 					 RTE_FLOW_ERROR_TYPE_ACTION,
> 					 actions,
> 					 "can't have multiple vxlan actions");
> 			/* Don't call flow_tcf_validate_vxlan_decap(). */
> 			action_flags |= MLX5_ACTION_VXLAN_DECAP;
> 			break;
> 	}
> 	for (items[]...) {
> 		...
> 		case RTE_FLOW_ITEM_TYPE_IPV4:
> 			/* Existing common validation. */
> 			...
> 			if (action_flags & MLX5_ACTION_VXLAN_DECAP) {
> 				/* Do ipv4 validation in
> 				 * flow_tcf_validate_vxlan_decap()/
> 			}
> 			break;
> 	}
> }
> 
> Curretly you are doing,
> 
> 	- validate items
> 	- validate actions
> 	- validate items again if decap.
> 
> But this can simply be
> 
> 	- validate actions
How  we could validate VXLAN decap at this stage? 
As we do not have item_flags set yet?
Do I miss something?

> 	- validate items
> 
> Thanks,
> Yongseok
> 

With best regards,
Slava

> > >
> > > For example, you just can call vxlan decap item validation (by splitting
> > > flow_tcf_validate_vxlan_decap()) at this point like:
> > >
> > > 			if (action_flags &
> > > MLX5_FLOW_ACTION_VXLAN_DECAP)
> > > 				ret =
> > > flow_tcf_validate_vxlan_decap_ipv4(...);
> > > 			...
> > >
> > > Same for other items.
> > >

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [dpdk-dev] [PATCH v2 3/7] net/mlx5: e-switch VXLAN flow translation routine
  2018-10-26  4:22         ` Yongseok Koh
@ 2018-10-26  9:06           ` Slava Ovsiienko
  2018-10-26 22:10             ` Yongseok Koh
  0 siblings, 1 reply; 110+ messages in thread
From: Slava Ovsiienko @ 2018-10-26  9:06 UTC (permalink / raw)
  To: Yongseok Koh; +Cc: Shahaf Shuler, dev

> -----Original Message-----
> From: Yongseok Koh
> Sent: Friday, October 26, 2018 7:22
> To: Slava Ovsiienko <viacheslavo@mellanox.com>
> Cc: Shahaf Shuler <shahafs@mellanox.com>; dev@dpdk.org
> Subject: Re: [PATCH v2 3/7] net/mlx5: e-switch VXLAN flow translation
> routine
> 
> On Thu, Oct 25, 2018 at 07:37:56AM -0700, Slava Ovsiienko wrote:
> > > -----Original Message-----
> > > From: Yongseok Koh
> > > Sent: Tuesday, October 23, 2018 13:06
> > > To: Slava Ovsiienko <viacheslavo@mellanox.com>
> > > Cc: Shahaf Shuler <shahafs@mellanox.com>; dev@dpdk.org
> > > Subject: Re: [PATCH v2 3/7] net/mlx5: e-switch VXLAN flow
> > > translation routine
> > >
> > > On Mon, Oct 15, 2018 at 02:13:31PM +0000, Viacheslav Ovsiienko wrote:
> [...]
> > > > @@ -2184,13 +2447,16 @@ struct pedit_parser {
> > > >   *   Pointer to the list of actions.
> > > >   * @param[out] action_flags
> > > >   *   Pointer to the detected actions.
> > > > + * @param[out] tunnel
> > > > + *   Pointer to tunnel encapsulation parameters structure to fill.
> > > >   *
> > > >   * @return
> > > >   *   Maximum size of memory for actions.
> > > >   */
> > > >  static int
> > > >  flow_tcf_get_actions_and_size(const struct rte_flow_action actions[],
> > > > -			      uint64_t *action_flags)
> > > > +			      uint64_t *action_flags,
> > > > +			      void *tunnel)
> > >
> > > This func is to get actions and size but you are parsing and filling
> > > tunnel info here. It would be better to move parsing to translate()
> > > because it anyway has multiple if conditions (same as switch/case)
> > > to set TCA_TUNNEL_KEY_ENC_* there.
> > Do you mean call of flow_tcf_vxlan_encap_parse(actions, tunnel)?
> 
> Yes.
> 
> > OK, let's move it to translate stage. Anyway, we need to keep encap
> > structure for local/neigh rules.
> >
> > >
> > > >  {
> > > >  	int size = 0;
> > > >  	uint64_t flags = 0;
> > > > @@ -2246,6 +2512,29 @@ struct pedit_parser {
> > > >  				SZ_NLATTR_TYPE_OF(uint16_t) + /* VLAN ID.
> > > */
> > > >  				SZ_NLATTR_TYPE_OF(uint8_t); /* VLAN prio.
> > > */
> > > >  			break;
> > > > +		case RTE_FLOW_ACTION_TYPE_VXLAN_ENCAP:
> > > > +			size += SZ_NLATTR_NEST + /* na_act_index. */
> > > > +				SZ_NLATTR_STRZ_OF("tunnel_key") +
> > > > +				SZ_NLATTR_NEST + /* TCA_ACT_OPTIONS.
> > > */
> > > > +				SZ_NLATTR_TYPE_OF(uint8_t);
> > > > +			size += SZ_NLATTR_TYPE_OF(struct tc_tunnel_key);
> > > > +			size +=	flow_tcf_vxlan_encap_parse(actions, tunnel)
> > > +
> > > > +				RTE_ALIGN_CEIL /* preceding encap params.
> > > */
> > > > +				(sizeof(struct mlx5_flow_tcf_vxlan_encap),
> > > > +				MNL_ALIGNTO);
> > >
> > > Is it different from SZ_NLATTR_TYPE_OF(struct
> > > mlx5_flow_tcf_vxlan_encap)? Or, use __rte_aligned(MNL_ALIGNTO)
> instead.
> >
> > It is written intentionally in this form. It means that there is
> > struct mlx5_flow_tcf_vxlan_encap at the beginning of buffer. This is
> > not the NL attribute, usage of SZ_NLATTR_TYPE_OF is not relevant here.
> Alignment is needed for the following Netlink message.
> 
> Good point. Understood.
> 
> > >
> > > > +			flags |= MLX5_ACTION_VXLAN_ENCAP;
> > > > +			break;
> > > > +		case RTE_FLOW_ACTION_TYPE_VXLAN_DECAP:
> > > > +			size += SZ_NLATTR_NEST + /* na_act_index. */
> > > > +				SZ_NLATTR_STRZ_OF("tunnel_key") +
> > > > +				SZ_NLATTR_NEST + /* TCA_ACT_OPTIONS.
> > > */
> > > > +				SZ_NLATTR_TYPE_OF(uint8_t);
> > > > +			size +=	SZ_NLATTR_TYPE_OF(struct tc_tunnel_key);
> > > > +			size +=	RTE_ALIGN_CEIL /* preceding decap params.
> > > */
> > > > +				(sizeof(struct mlx5_flow_tcf_vxlan_decap),
> > > > +				MNL_ALIGNTO);
> > >
> > > Same here.
> > >
> > > > +			flags |= MLX5_ACTION_VXLAN_DECAP;
> > > > +			break;
> > > >  		case RTE_FLOW_ACTION_TYPE_SET_IPV4_SRC:
> > > >  		case RTE_FLOW_ACTION_TYPE_SET_IPV4_DST:
> > > >  		case RTE_FLOW_ACTION_TYPE_SET_IPV6_SRC:
> > > > @@ -2289,6 +2578,26 @@ struct pedit_parser {  }
> > > >
> > > >  /**
> > > > + * Convert VXLAN VNI to 32-bit integer.
> > > > + *
> > > > + * @param[in] vni
> > > > + *   VXLAN VNI in 24-bit wire format.
> > > > + *
> > > > + * @return
> > > > + *   VXLAN VNI as a 32-bit integer value in network endian.
> > > > + */
> > > > +static rte_be32_t
> > >
> > > make it inline.
> > OK. Missed point.
> >
> > >
> > > > +vxlan_vni_as_be32(const uint8_t vni[3]) {
> > > > +	rte_be32_t ret;
> > >
> > > Defining ret as rte_be32_t? The return value of this func which is
> > > bswap(ret) is also rte_be32_t??
> > Yes. And it is directly stored in the net-endian NL attribute.
> > I've compiled and checked the listing of the function you proposed. It
> seems to be best, I'll take it.
> >
> > >
> > > > +
> > > > +	ret = vni[0];
> > > > +	ret = (ret << 8) | vni[1];
> > > > +	ret = (ret << 8) | vni[2];
> > > > +	return RTE_BE32(ret);
> > >
> > > Use rte_cpu_to_be_*() instead. But I still don't understand why you
> > > shuffle bytes twice. One with shift and or and other by bswap().
> > And it works. There are three bytes in very bizarre order (in NL attribute) -
> 0, vni[0], vni[1], vni[2].
> >
> > >
> > > {
> > > 	union {
> > > 		uint8_t vni[4];
> > > 		rte_be32_t dword;
> > > 	} ret = {
> > > 		.vni = { 0, vni[0], vni[1], vni[2] },
> > > 	};
> > > 	return ret.dword;
> > > }
> > >
> > > This will have the same result without extra cost.
> >
> > OK. Investigated, it is the best for x86-64. Also I'm going to test it
> > on the ARM 32, with various compilers, just curious.
> >
> > >
> > > > +}
> > > > +
> > > > +/**
> > > >   * Prepare a flow object for Linux TC flower. It calculates the
> > > > maximum
> > > size of
> > > >   * memory required, allocates the memory, initializes Netlink
> > > > message
> > > headers
> > > >   * and set unique TC message handle.
> > > > @@ -2323,22 +2632,54 @@ struct pedit_parser {
> > > >  	struct mlx5_flow *dev_flow;
> > > >  	struct nlmsghdr *nlh;
> > > >  	struct tcmsg *tcm;
> > > > +	struct mlx5_flow_tcf_vxlan_encap encap = {.mask = 0};
> > > > +	uint8_t *sp, *tun = NULL;
> > > >
> > > >  	size += flow_tcf_get_items_and_size(attr, items, item_flags);
> > > > -	size += flow_tcf_get_actions_and_size(actions, action_flags);
> > > > -	dev_flow = rte_zmalloc(__func__, size, MNL_ALIGNTO);
> > > > +	size += flow_tcf_get_actions_and_size(actions, action_flags,
> > > &encap);
> > > > +	dev_flow = rte_zmalloc(__func__, size,
> > > > +			RTE_MAX(alignof(struct mlx5_flow_tcf_tunnel_hdr),
> > > > +				(size_t)MNL_ALIGNTO));
> > >
> > > Why RTE_MAX between the two? Note that it is alignment for start
> > > address of the memory and the minimum alignment is cacheline size.
> > > On x86, non- zero value less than 64 will have same result as 64.
> >
> > OK. Thanks for note.
> > It is not expected the structure alignments exceed the cache line size.
> > So? Just specify zero?
> > >
> > > >  	if (!dev_flow) {
> > > >  		rte_flow_error_set(error, ENOMEM,
> > > >  				   RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> > > NULL,
> > > >  				   "not enough memory to create E-Switch
> > > flow");
> > > >  		return NULL;
> > > >  	}
> > > > -	nlh = mnl_nlmsg_put_header((void *)(dev_flow + 1));
> > > > +	sp = (uint8_t *)(dev_flow + 1);
> > > > +	if (*action_flags & MLX5_ACTION_VXLAN_ENCAP) {
> > > > +		tun = sp;
> > > > +		sp += RTE_ALIGN_CEIL
> > > > +			(sizeof(struct mlx5_flow_tcf_vxlan_encap),
> > > > +			MNL_ALIGNTO);
> > >
> > > And why should it be aligned?
> >
> > Netlink message should be aligned. It follows the
> > mlx5_flow_tcf_vxlan_encap, that's why the pointer is aligned.
> 
> Not true. There's no requirement for nl msg buffer alignment.
> MNL_ALIGNTO is for mainly size alignment. For example, checkout the
> source code of mnl_nlmsg_put_header(void *buf). There's no requirement of
> aligning the start address of buf. But, size of any entries (hdr, attr ...) should
> be aligned to MNL_ALIGNTO(4).

Formally speaking, yes. There is no explicit requirement for the header
alignment. And the entire message goes to the send(), it does not care about alignment.
But not aligning the entire structure does not look as a good practice.
( I had been living for a long time on embedded systems with activated
Alignment Check feature and off unaligned access compiler flags. 
There was not very long waiting time to get punishing exception. )

> 
> >
> > As the size of dev_flow might not be aligned, it
> > > is meaningless, isn't it? If you think it must be aligned for better
> > > performance (not much anyway), you can use
> > > __rte_aligned(MNL_ALIGNTO) on the struct
> > Hm. Where we can use __rte_aligned? Could you clarify, please.
> 
> For example,
> 
> struct mlx5_flow_tcf_tunnel_hdr {
> 	uint32_t type; /**< Tunnel action type. */
> 	unsigned int ifindex_tun; /**< Tunnel endpoint interface. */
> 	unsigned int ifindex_org; /**< Original dst/src interface */
> 	unsigned int *ifindex_ptr; /**< Interface ptr in message. */ }
> __rte_aligned(MNL_ALIGNTO);

No. tunnel_hdr should not know anything about NL message.
It happens, we have the NL message follows the tunnel_hdr 
in some our memory buf. What if we would like to add some other
object after tunnel_hdr in buffer? Not NL message? 
The aligment of objects is  duty of code which places objects into buffer,
Objects can be very different, with different alignment requirements,
and, generally speaking, placed  in arbitrary order. Why, while
declaring the tunnel_hdr  structure,  we should make an assumption
it is always followed by NL message? _rte_aligned(MNL_ALIGNTO) at the end
of tunnel_hdr - is exactly an example of  that unapropriate assumption.

> 
> A good example is the struct rte_mbuf. If this attribute is used, the size of the
> struct will be aligned to the value.
> 
> If you still want to make the nl msg aligned,
> 
> 	dev_flow = rte_zmalloc(..., MNL_ALIGNTO); /* anyway cacheline
> aligned. */
> 	tun = RTE_PTR_ALIGN(dev_flow + 1, MNL_ALIGNTO);
> 	nlh = mnl_nlmsg_put_header(tun);
> 
> with adding '__rte_aligned(MNL_ALIGNTO)' to struct
> mlx5_flow_tcf_vxlan_encap/decap.
> 
> Then, nlh will be aligned. You should make sure size is correctly calculated.
> 
> >
> > > definition but not for mlx5_flow (it's not only for tcf, have to do it
> manually).
> > >
> > > > +		size -= RTE_ALIGN_CEIL
> > > > +			(sizeof(struct mlx5_flow_tcf_vxlan_encap),
> > > > +			MNL_ALIGNTO);
> > >
> > > Don't you have to subtract sizeof(struct mlx5_flow) as well? But
> > > like I mentioned, if '.nlsize' below isn't needed, you don't need to
> > > have this calculation either.
> > Yes, it is a bug. Should be fixed. Thank you.
> > Let's discuss whether we can keep the nlsize under NDEBUG switch.
> 
> I agreed on using NDEBUG for it.
> 
> >
> > >
> > > > +		encap.hdr.type =
> > > MLX5_FLOW_TCF_TUNACT_VXLAN_ENCAP;
> > > > +		memcpy(tun, &encap,
> > > > +		       sizeof(struct mlx5_flow_tcf_vxlan_encap));
> > > > +	} else if (*action_flags & MLX5_ACTION_VXLAN_DECAP) {
> > > > +		tun = sp;
> > > > +		sp += RTE_ALIGN_CEIL
> > > > +			(sizeof(struct mlx5_flow_tcf_vxlan_decap),
> > > > +			MNL_ALIGNTO);
> > > > +		size -= RTE_ALIGN_CEIL
> > > > +			(sizeof(struct mlx5_flow_tcf_vxlan_decap),
> > > > +			MNL_ALIGNTO);
> > > > +		encap.hdr.type =
> > > MLX5_FLOW_TCF_TUNACT_VXLAN_DECAP;
> > > > +		memcpy(tun, &encap,
> > > > +		       sizeof(struct mlx5_flow_tcf_vxlan_decap));
> > > > +	}
> > > > +	nlh = mnl_nlmsg_put_header(sp);
> > > >  	tcm = mnl_nlmsg_put_extra_header(nlh, sizeof(*tcm));
> > > >  	*dev_flow = (struct mlx5_flow){
> > > >  		.tcf = (struct mlx5_flow_tcf){
> > > > +			.nlsize = size,
> > > >  			.nlh = nlh,
> > > >  			.tcm = tcm,
> > > > +			.tunnel = (struct mlx5_flow_tcf_tunnel_hdr *)tun,
> > > > +			.item_flags = *item_flags,
> > > > +			.action_flags = *action_flags,
> > > >  		},
> > > >  	};
> > > >  	/*
> [...]
> > > > @@ -2827,6 +3268,76 @@ struct pedit_parser {
> > > >  					(na_vlan_priority) =
> > > >  					conf.of_set_vlan_pcp->vlan_pcp;
> > > >  			}
> > > > +			assert(dev_flow->tcf.nlsize >= nlh->nlmsg_len);
> > > > +			break;
> > > > +		case RTE_FLOW_ACTION_TYPE_VXLAN_DECAP:
> > > > +			assert(decap.vxlan);
> > > > +			assert(dev_flow->tcf.tunnel);
> > > > +			dev_flow->tcf.tunnel->ifindex_ptr
> > > > +				= (unsigned int *)&tcm->tcm_ifindex;
> > > > +			na_act_index =
> > > > +				mnl_attr_nest_start(nlh,
> > > na_act_index_cur++);
> > > > +			assert(na_act_index);
> > > > +			mnl_attr_put_strz(nlh, TCA_ACT_KIND,
> > > "tunnel_key");
> > > > +			na_act = mnl_attr_nest_start(nlh,
> > > TCA_ACT_OPTIONS);
> > > > +			assert(na_act);
> > > > +			mnl_attr_put(nlh, TCA_TUNNEL_KEY_PARMS,
> > > > +				sizeof(struct tc_tunnel_key),
> > > > +				&(struct tc_tunnel_key){
> > > > +					.action = TC_ACT_PIPE,
> > > > +					.t_action =
> > > TCA_TUNNEL_KEY_ACT_RELEASE,
> > > > +					});
> > > > +			mnl_attr_nest_end(nlh, na_act);
> > > > +			mnl_attr_nest_end(nlh, na_act_index);
> > > > +			assert(dev_flow->tcf.nlsize >= nlh->nlmsg_len);
> > > > +			break;
> > > > +		case RTE_FLOW_ACTION_TYPE_VXLAN_ENCAP:
> > > > +			assert(encap.vxlan);
> > > > +			na_act_index =
> > > > +				mnl_attr_nest_start(nlh,
> > > na_act_index_cur++);
> > > > +			assert(na_act_index);
> > > > +			mnl_attr_put_strz(nlh, TCA_ACT_KIND,
> > > "tunnel_key");
> > > > +			na_act = mnl_attr_nest_start(nlh,
> > > TCA_ACT_OPTIONS);
> > > > +			assert(na_act);
> > > > +			mnl_attr_put(nlh, TCA_TUNNEL_KEY_PARMS,
> > > > +				sizeof(struct tc_tunnel_key),
> > > > +				&(struct tc_tunnel_key){
> > > > +					.action = TC_ACT_PIPE,
> > > > +					.t_action =
> > > TCA_TUNNEL_KEY_ACT_SET,
> > > > +					});
> > > > +			if (encap.vxlan->mask &
> > > MLX5_FLOW_TCF_ENCAP_UDP_DST)
> > > > +				mnl_attr_put_u16(nlh,
> > > > +					 TCA_TUNNEL_KEY_ENC_DST_PORT,
> > > > +					 encap.vxlan->udp.dst);
> > > > +			if (encap.vxlan->mask &
> > > MLX5_FLOW_TCF_ENCAP_IPV4_SRC)
> > > > +				mnl_attr_put_u32(nlh,
> > > > +					 TCA_TUNNEL_KEY_ENC_IPV4_SRC,
> > > > +					 encap.vxlan->ipv4.src);
> > > > +			if (encap.vxlan->mask &
> > > MLX5_FLOW_TCF_ENCAP_IPV4_DST)
> > > > +				mnl_attr_put_u32(nlh,
> > > > +					 TCA_TUNNEL_KEY_ENC_IPV4_DST,
> > > > +					 encap.vxlan->ipv4.dst);
> > > > +			if (encap.vxlan->mask &
> > > MLX5_FLOW_TCF_ENCAP_IPV6_SRC)
> > > > +				mnl_attr_put(nlh,
> > > > +					 TCA_TUNNEL_KEY_ENC_IPV6_SRC,
> > > > +					 sizeof(encap.vxlan->ipv6.src),
> > > > +					 &encap.vxlan->ipv6.src);
> > > > +			if (encap.vxlan->mask &
> > > MLX5_FLOW_TCF_ENCAP_IPV6_DST)
> > > > +				mnl_attr_put(nlh,
> > > > +					 TCA_TUNNEL_KEY_ENC_IPV6_DST,
> > > > +					 sizeof(encap.vxlan->ipv6.dst),
> > > > +					 &encap.vxlan->ipv6.dst);
> > > > +			if (encap.vxlan->mask &
> > > MLX5_FLOW_TCF_ENCAP_VXLAN_VNI)
> > > > +				mnl_attr_put_u32(nlh,
> > > > +					 TCA_TUNNEL_KEY_ENC_KEY_ID,
> > > > +					 vxlan_vni_as_be32
> > > > +						(encap.vxlan->vxlan.vni));
> > > > +#ifdef TCA_TUNNEL_KEY_NO_CSUM
> > > > +			mnl_attr_put_u8(nlh, TCA_TUNNEL_KEY_NO_CSUM,
> > > 0); #endif
> > >
> > > TCA_TUNNEL_KEY_NO_CSUM is anyway defined like others, then why do
> > > you treat it differently with #ifdef/#endif?
> >
> > As it was found it is not defined on old kernels, on some our CI
> > machines compilation errors occurred.
> 
> In your first patch, TCA_TUNNEL_KEY_NO_CSUM is defined if there isn't
> HAVE_TC_ACT_TUNNEL_KEY. Actually I'm wondering why it is different from
> HAVE_TCA_TUNNEL_KEY_ENC_DST_PORT. It looks like the following is
> needed - HAVE_TCA_TUNNEL_KEY_NO_CSUM ??
> 
> 
> 	#ifdef HAVE_TC_ACT_TUNNEL_KEY
> 
> 	#include <linux/tc_act/tc_tunnel_key.h>
> 
> 	#ifndef HAVE_TCA_TUNNEL_KEY_ENC_DST_PORT
> 	#define TCA_TUNNEL_KEY_ENC_DST_PORT 9
> 	#endif
> 
> 	#ifndef HAVE_TCA_TUNNEL_KEY_NO_CSUM
> 	#define TCA_TUNNEL_KEY_NO_CSUM 10
> 	#endif

I think it is subject to check. Yes, we can define the "missing"
macros, but it seems the old kernel just does not know these
keys. Whether the rule with these keys is accepted by kernel?
I did not check (have no host with old setup to check),
I'd prefer to exclude not very significant key to lower the
rule rejection risk. 

> 
> 	#else /* HAVE_TC_ACT_TUNNEL_KEY */
> 
> 
> Thanks,
> Yongseok

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [dpdk-dev] [PATCH v2 5/7] net/mlx5: e-switch VXLAN tunnel devices management
  2018-10-26  6:25         ` Yongseok Koh
@ 2018-10-26  9:35           ` Slava Ovsiienko
  2018-10-26 22:42             ` Yongseok Koh
  0 siblings, 1 reply; 110+ messages in thread
From: Slava Ovsiienko @ 2018-10-26  9:35 UTC (permalink / raw)
  To: Yongseok Koh; +Cc: Shahaf Shuler, dev

> -----Original Message-----
> From: Yongseok Koh
> Sent: Friday, October 26, 2018 9:26
> To: Slava Ovsiienko <viacheslavo@mellanox.com>
> Cc: Shahaf Shuler <shahafs@mellanox.com>; dev@dpdk.org
> Subject: Re: [PATCH v2 5/7] net/mlx5: e-switch VXLAN tunnel devices
> management
> 
> On Thu, Oct 25, 2018 at 01:21:12PM -0700, Slava Ovsiienko wrote:
> > > -----Original Message-----
> > > From: Yongseok Koh
> > > Sent: Thursday, October 25, 2018 3:28
> > > To: Slava Ovsiienko <viacheslavo@mellanox.com>
> > > Cc: Shahaf Shuler <shahafs@mellanox.com>; dev@dpdk.org
> > > Subject: Re: [PATCH v2 5/7] net/mlx5: e-switch VXLAN tunnel devices
> > > management
> > >
> > > On Mon, Oct 15, 2018 at 02:13:33PM +0000, Viacheslav Ovsiienko wrote:
> > > > VXLAN interfaces are dynamically created for each local UDP port
> > > > of outer networks and then used as targets for TC "flower" filters
> > > > in order to perform encapsulation. These VXLAN interfaces are
> > > > system-wide, the only one device with given UDP port can exist in
> > > > the system (the attempt of creating another device with the same
> > > > UDP local port returns EEXIST), so PMD should support the shared
> > > > device instances database for PMD instances. These VXLAN
> > > > implicitly created devices are called VTEPs (Virtual Tunnel End Points).
> > > >
> > > > Creation of the VTEP occurs at the moment of rule applying. The
> > > > link is set up, root ingress qdisc is also initialized.
> > > >
> > > > Encapsulation VTEPs are created on per port basis, the single VTEP
> > > > is attached to the outer interface and is shared for all
> > > > encapsulation rules on this interface. The source UDP port is
> > > > automatically selected in range 30000-60000.
> > > >
> > > > For decapsulaton one VTEP is created per every unique UDP local
> > > > port to accept tunnel traffic. The name of created VTEP consists
> > > > of prefix "vmlx_" and the number of UDP port in decimal digits
> > > > without leading zeros (vmlx_4789). The VTEP can be preliminary
> > > > created in the system before the launching
> > > > application, it allows to share	UDP ports between primary
> > > > and secondary processes.
> > > >
> > > > Suggested-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
> > > > Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
> > > > ---
> > > >  drivers/net/mlx5/mlx5_flow_tcf.c | 503
> > > > ++++++++++++++++++++++++++++++++++++++-
> > > >  1 file changed, 499 insertions(+), 4 deletions(-)
> > > >
> > > > diff --git a/drivers/net/mlx5/mlx5_flow_tcf.c
> > > > b/drivers/net/mlx5/mlx5_flow_tcf.c
> > > > index d6840d5..efa9c3b 100644
> > > > --- a/drivers/net/mlx5/mlx5_flow_tcf.c
> > > > +++ b/drivers/net/mlx5/mlx5_flow_tcf.c
> > > > @@ -3443,6 +3443,432 @@ struct pedit_parser {
> > > >  	return -err;
> > > >  }
> > > >
> > > > +/* VTEP device list is shared between PMD port instances. */
> > > > +static LIST_HEAD(, mlx5_flow_tcf_vtep)
> > > > +			vtep_list_vxlan = LIST_HEAD_INITIALIZER(); static
> > > pthread_mutex_t
> > > > +vtep_list_mutex = PTHREAD_MUTEX_INITIALIZER;
> > >
> > > What's the reason for choosing pthread_mutex instead of rte_*_lock?
> >
> > The sharing this database for secondary processes?
> 
> The static variable isn't shared with sec proc. But you can leave it as is.

Yes. The sharing just was assumed, not implemented yet.

> 
> > > > +
> > > > +/**
> > > > + * Deletes VTEP network device.
> > > > + *
> > > > + * @param[in] tcf
> > > > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > > > + * @param[in] vtep
> > > > + *   Object represinting the network device to delete. Memory
> > > > + *   allocated for this object is freed by routine.
> > > > + */
> > > > +static void
> > > > +flow_tcf_delete_iface(struct mlx6_flow_tcf_context *tcf,
> > > > +		      struct mlx5_flow_tcf_vtep *vtep) {
> > > > +	struct nlmsghdr *nlh;
> > > > +	struct ifinfomsg *ifm;
> > > > +	alignas(struct nlmsghdr)
> > > > +	uint8_t buf[mnl_nlmsg_size(MNL_ALIGN(sizeof(*ifm))) + 8];
> > > > +	int ret;
> > > > +
> > > > +	assert(!vtep->refcnt);
> > > > +	if (vtep->created && vtep->ifindex) {
> > >
> > > First of all vtep->created seems of no use. It is introduced to
> > > select the error message in flow_tcf_create_iface(). I don't see any
> > > necessity to distinguish between 'vtep is allocated by rte_malloc()' and
> 'vtep is created in kernel'.
> >
> > created flag indicates the iface is created by our code.
> > The VXLAN decap devices must have the specified UDP port, we can not
> > create multiple VXLAN devices with the same UDP port - EEXIST is
> > returned. So, we have to share device. One option is create device
> > before DPDK application launch and use these pre-created devices.
> > Inthis case created flag is not set and VXLAN device is not reinitialized, and
> not deleted.
> 
> I can't see any code to use pre-created device (created even before dpdk app
> launch). Your code just tries to create 'vmlx_xxxx'. Even from your comment
> in [7/7] patch, PMD will cleanup any leftovers (existing vtep devices) on
> initialization. Your comment sounds conflicting and confusing.

There are two types of VXLAN devices:

- VXLAN decap, not attached to any ifouter. Provides the ingress UDP port,
 we try to share the devices of this type, because we may be asked for
 the specified UDP port. No device/rule cleanup and reinit needed.

- VXLAN encap, should be attached to ifouter to provide strict egress path,
no need to share - egress UDP port does not matter. And we need to cleanup ifouter,
remove other attached VXLAN devices and rules, because it is too hard to
co-exist with some pre-created setup.. 

> 
> > > And why do you need to check vtep->ifindex as well? If vtep is
> > > created in kernel and its ifindex isn't set, that should be an error
> > > which had to be hanled in flow_tcf_create_iface(). Such a vtep shouldn't
> exist.
> > Yes, if we did not get ifindex of device - vtep is not created, error returned.
> > We just can not operate w/o ifindex.
> 
> I know ifindex is needed but my question was checking vtep->ifindex here
> looked redundant/unnecessary. But as you agreed on having
> create/get/release_iface(), it doesn't matter much.

Yes. I agree, will refactor the code.

> 
> > > Also, the refcnt management is a bit strange. Please put an
> > > abstraction by adding create_iface(), get_iface() and
> > > release_iface(). In the get_ifce(),
> > > vtep->refcnt should be incremented. And in the release_iface(), it
> > > vtep->decrease the
> > OK. Good proposal. I'll refactor the code.
> >
> > > refcnt and if it reaches to zero, the iface can be removed.
> > > create_iface() will set the refcnt to 1. And if you refer to
> > > mlx5_hrxq_get(), it even does searching the list not by repeating the
> same lookup code here and there.
> > > That will make your code much simpler.
> > >
> > > > +		DRV_LOG(INFO, "VTEP delete (%d)", vtep->ifindex);
> > > > +		nlh = mnl_nlmsg_put_header(buf);
> > > > +		nlh->nlmsg_type = RTM_DELLINK;
> > > > +		nlh->nlmsg_flags = NLM_F_REQUEST;
> > > > +		ifm = mnl_nlmsg_put_extra_header(nlh, sizeof(*ifm));
> > > > +		ifm->ifi_family = AF_UNSPEC;
> > > > +		ifm->ifi_index = vtep->ifindex;
> > > > +		ret = flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
> > > > +		if (ret)
> > > > +			DRV_LOG(WARNING, "netlink: error deleting VXLAN
> > > "
> > > > +					 "encap/decap ifindex %u",
> > > > +					 ifm->ifi_index);
> > > > +	}
> > > > +	rte_free(vtep);
> > > > +}
> > > > +
> > > > +/**
> > > > + * Creates VTEP network device.
> > > > + *
> > > > + * @param[in] tcf
> > > > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > > > + * @param[in] ifouter
> > > > + *   Outer interface to attach new-created VXLAN device
> > > > + *   If zero the VXLAN device will not be attached to any device.
> > > > + * @param[in] port
> > > > + *   UDP port of created VTEP device.
> > > > + * @param[out] error
> > > > + *   Perform verbose error reporting if not NULL.
> > > > + *
> > > > + * @return
> > > > + * Pointer to created device structure on success, NULL otherwise
> > > > + * and rte_errno is set.
> > > > + */
> > > > +#ifndef HAVE_IFLA_VXLAN_COLLECT_METADATA
> > >
> > > Why negative(ifndef) first intead of positive(ifdef)?
> > Hm. Did I miss the rule. Positive #ifdef first? OK.
> 
> No concrete rule but if there's no specific reason, it would be better to start
> from ifdef.
> 
> > > > +static struct mlx5_flow_tcf_vtep* flow_tcf_create_iface(struct
> > > > +mlx5_flow_tcf_context *tcf __rte_unused,
> > > > +		      unsigned int ifouter __rte_unused,
> > > > +		      uint16_t port __rte_unused,
> > > > +		      struct rte_flow_error *error) {
> > > > +	rte_flow_error_set(error, ENOTSUP,
> > > > +			 RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
> > > > +			 "netlink: failed to create VTEP, "
> > > > +			 "VXLAN metadat is not supported by kernel");
> > >
> > > Typo.
> >
> > OK.  "metadata are not supported".
> > >
> > > > +	return NULL;
> > > > +}
> > > > +#else
> > > > +static struct mlx5_flow_tcf_vtep* flow_tcf_create_iface(struct
> > > > +mlx5_flow_tcf_context *tcf,
> > >
> > > How about adding 'vtep'? It sounds vague - creating a general interface.
> > > E.g., flow_tcf_create_vtep_iface()?
> >
> > OK.
> >
> > >
> > > > +		      unsigned int ifouter,
> > > > +		      uint16_t port, struct rte_flow_error *error) {
> > > > +	struct mlx5_flow_tcf_vtep *vtep;
> > > > +	struct nlmsghdr *nlh;
> > > > +	struct ifinfomsg *ifm;
> > > > +	char name[sizeof(MLX5_VXLAN_DEVICE_PFX) + 24];
> > > > +	alignas(struct nlmsghdr)
> > > > +	uint8_t buf[mnl_nlmsg_size(sizeof(*ifm)) + 128 +
> > >
> > > Use a macro for '128'. Can't know the meaning.
> > OK. I think we should calculate the buffer size explicitly.
> >
> > >
> > > > +		       SZ_NLATTR_DATA_OF(sizeof(name)) +
> > > > +		       SZ_NLATTR_NEST * 2 +
> > > > +		       SZ_NLATTR_STRZ_OF("vxlan") +
> > > > +		       SZ_NLATTR_DATA_OF(sizeof(uint32_t)) +
> > > > +		       SZ_NLATTR_DATA_OF(sizeof(uint32_t)) +
> > > > +		       SZ_NLATTR_DATA_OF(sizeof(uint16_t)) +
> > > > +		       SZ_NLATTR_DATA_OF(sizeof(uint8_t))];
> > > > +	struct nlattr *na_info;
> > > > +	struct nlattr *na_vxlan;
> > > > +	rte_be16_t vxlan_port = RTE_BE16(port);
> > >
> > > Use rte_cpu_to_be_*() instead.
> >
> > Yes, I'll recheck the whole code for this issue.
> >
> > >
> > > > +	int ret;
> > > > +
> > > > +	vtep = rte_zmalloc(__func__, sizeof(*vtep),
> > > > +			alignof(struct mlx5_flow_tcf_vtep));
> > > > +	if (!vtep) {
> > > > +		rte_flow_error_set
> > > > +			(error, ENOMEM,
> > > RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> > > > +			 NULL, "unadble to allocate memory for VTEP desc");
> > > > +		return NULL;
> > > > +	}
> > > > +	*vtep = (struct mlx5_flow_tcf_vtep){
> > > > +			.refcnt = 0,
> > > > +			.port = port,
> > > > +			.created = 0,
> > > > +			.ifouter = 0,
> > > > +			.ifindex = 0,
> > > > +			.local = LIST_HEAD_INITIALIZER(),
> > > > +			.neigh = LIST_HEAD_INITIALIZER(),
> > > > +	};
> > > > +	memset(buf, 0, sizeof(buf));
> > > > +	nlh = mnl_nlmsg_put_header(buf);
> > > > +	nlh->nlmsg_type = RTM_NEWLINK;
> > > > +	nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_CREATE  |
> > > NLM_F_EXCL;
> > > > +	ifm = mnl_nlmsg_put_extra_header(nlh, sizeof(*ifm));
> > > > +	ifm->ifi_family = AF_UNSPEC;
> > > > +	ifm->ifi_type = 0;
> > > > +	ifm->ifi_index = 0;
> > > > +	ifm->ifi_flags = IFF_UP;
> > > > +	ifm->ifi_change = 0xffffffff;
> > > > +	snprintf(name, sizeof(name), "%s%u", MLX5_VXLAN_DEVICE_PFX,
> > > port);
> > > > +	mnl_attr_put_strz(nlh, IFLA_IFNAME, name);
> > > > +	na_info = mnl_attr_nest_start(nlh, IFLA_LINKINFO);
> > > > +	assert(na_info);
> > > > +	mnl_attr_put_strz(nlh, IFLA_INFO_KIND, "vxlan");
> > > > +	na_vxlan = mnl_attr_nest_start(nlh, IFLA_INFO_DATA);
> > > > +	if (ifouter)
> > > > +		mnl_attr_put_u32(nlh, IFLA_VXLAN_LINK, ifouter);
> > > > +	assert(na_vxlan);
> > > > +	mnl_attr_put_u8(nlh, IFLA_VXLAN_COLLECT_METADATA, 1);
> > > > +	mnl_attr_put_u8(nlh, IFLA_VXLAN_UDP_ZERO_CSUM6_RX, 1);
> > > > +	mnl_attr_put_u8(nlh, IFLA_VXLAN_LEARNING, 0);
> > > > +	mnl_attr_put_u16(nlh, IFLA_VXLAN_PORT, vxlan_port);
> > > > +	mnl_attr_nest_end(nlh, na_vxlan);
> > > > +	mnl_attr_nest_end(nlh, na_info);
> > > > +	assert(sizeof(buf) >= nlh->nlmsg_len);
> > > > +	ret = flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
> > > > +	if (ret)
> > > > +		DRV_LOG(WARNING,
> > > > +			"netlink: VTEP %s create failure (%d)",
> > > > +			name, rte_errno);
> > > > +	else
> > > > +		vtep->created = 1;
> > >
> > > Flow of code here isn't smooth, thus could be error-prone. Most of
> > > all, I don't like ret has multiple meanings. ret should be return
> > > value but you are using it to store ifindex.
> > >
> > > > +	if (ret && ifouter)
> > > > +		ret = 0;
> > > > +	else
> > > > +		ret = if_nametoindex(name);
> > >
> > > If vtep isn't created and ifouter is set, then skip init below,
> > > which means, if
> >
> > ifouter is set for VXLAN encap devices. They should be attached to
> > ifouter and can not be shared. So, if ifouter I set - we do not use
> > the precreated/existing VXLAN devices. We have to create our own not
> shared device.
> 
> In your code (flow_tcf_encap_vtep_create()), it is shared by multiple flows.
> Do you mean it isn't shared between different outer ifaces? If so, that's for
> sure.
Sorry, I do not understand the question.
VXLAN encap device is attached to ifouter and shared by all flows with this
ifouter. No multiple VXLAN devices are attached to the same ifouter, only one.
VXLAN decap device has no attached ifouter, so it can not share it.

> 
> > > vtep is created or ifouter is set, it tries to get ifindex of vtep.
> > > But why do you want to try to call this API even if it failed to create vtep?
> > > Let's not make code flow convoluted even though it logically works.
> > > Let's make it straightforward.
> > >
> > > > +	if (ret) {
> > > > +		vtep->ifindex = ret;
> > > > +		vtep->ifouter = ifouter;
> > > > +		memset(buf, 0, sizeof(buf));
> > > > +		nlh = mnl_nlmsg_put_header(buf);
> > > > +		nlh->nlmsg_type = RTM_NEWLINK;
> > > > +		nlh->nlmsg_flags = NLM_F_REQUEST;
> > > > +		ifm = mnl_nlmsg_put_extra_header(nlh, sizeof(*ifm));
> > > > +		ifm->ifi_family = AF_UNSPEC;
> > > > +		ifm->ifi_type = 0;
> > > > +		ifm->ifi_index = vtep->ifindex;
> > > > +		ifm->ifi_flags = IFF_UP;
> > > > +		ifm->ifi_change = IFF_UP;
> > > > +		ret = flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
> > > > +		if (ret) {
> > > > +			DRV_LOG(WARNING,
> > > > +				"netlink: VTEP %s set link up failure (%d)",
> > > > +				name, rte_errno);
> > > > +			rte_free(vtep);
> > > > +			rte_flow_error_set
> > > > +				(error, -errno,
> > > > +				 RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> > > NULL,
> > > > +				 "netlink: failed to set VTEP link up");
> > > > +			vtep = NULL;
> > > > +		} else {
> > > > +			ret = mlx5_flow_tcf_init(tcf, vtep->ifindex, error);
> > > > +			if (ret)
> > > > +				DRV_LOG(WARNING,
> > > > +				"VTEP %s init failure (%d)", name, rte_errno);
> > > > +		}
> > > > +	} else {
> > > > +		DRV_LOG(WARNING,
> > > > +			"VTEP %s failed to get index (%d)", name, errno);
> > > > +		rte_flow_error_set
> > > > +			(error, -errno,
> > > > +			 RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
> > > > +			 !vtep->created ? "netlink: failed to create VTEP" :
> > > > +			 "netlink: failed to retrieve VTEP ifindex");
> > > > +			 ret = 1;
> > >
> > > If it fails to create a vtep above, it will print out two warning
> > > messages and one rte_flow_error message. And it even selects message
> > > to print between two?
> > > And there's another info msg at the end even in case of failure. Do
> > > you really want to do this even with manipulating ret to change code
> > > path?  Not a good practice.
> > >
> > > Usually, code path should be straightforward for sucessful path and
> > > for errors/failures, return immediately or use 'goto' if there's need for
> cleanup.
> > >
> > > Please refactor entire function.
> >
> > I think I'll split it in two ones - for attached and potentially shared ifaces.
> > >
> > > > +	}
> > > > +	if (ret) {
> > > > +		flow_tcf_delete_iface(tcf, vtep);
> > > > +		vtep = NULL;
> > > > +	}
> > > > +	DRV_LOG(INFO, "VTEP create (%d, %s)", vtep->port, vtep ? "OK" :
> > > "error");
> > > > +	return vtep;
> > > > +}
> > > > +#endif /* HAVE_IFLA_VXLAN_COLLECT_METADATA */
> > > > +
> > > > +/**
> > > > + * Create target interface index for VXLAN tunneling decapsulation.
> > > > + * In order to share the UDP port within the other interfaces the
> > > > + * VXLAN device created as not attached to any interface (if created).
> > > > + *
> > > > + * @param[in] tcf
> > > > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > > > + * @param[in] dev_flow
> > > > + *   Flow tcf object with tunnel structure pointer set.
> > > > + * @param[out] error
> > > > + *   Perform verbose error reporting if not NULL.
> > > > + * @return
> > > > + *   Interface index on success, zero otherwise and rte_errno is set.
> > >
> > > Return negative errno in case of failure like others.
> >
> > Anyway, we have to return an index. If we do not return it as function
> > result we will need to provide some extra pointing parameter, it
> complicates the code.
> 
> You misunderstood it. See what I wrote below. The function still returns the
> index but in case of error, make it return negative errno instead of zero.
> 
> > >
> > >  *   Interface index on success, a negative errno value otherwise and
> > > rte_errno is set.
> > >
> > > > + */
> > > > +static unsigned int
> > > > +flow_tcf_decap_vtep_create(struct mlx5_flow_tcf_context *tcf,
> > > > +			   struct mlx5_flow *dev_flow,
> > > > +			   struct rte_flow_error *error) {
> > > > +	struct mlx5_flow_tcf_vtep *vtep, *vlst;
> > > > +	uint16_t port = dev_flow->tcf.vxlan_decap->udp_port;
> > > > +
> > > > +	vtep = NULL;
> > > > +	LIST_FOREACH(vlst, &vtep_list_vxlan, next) {
> > > > +		if (vlst->port == port) {
> > > > +			vtep = vlst;
> > > > +			break;
> > > > +		}
> > > > +	}
> > >
> > > You just need one variable.
> >
> > Yes. There is a long story, I forgot to revert code to one variable after
> debugging.
> > >
> > > 	struct mlx5_flow_tcf_vtep *vtep;
> > >
> > > 	LIST_FOREACH(vtep, &vtep_list_vxlan, next) {
> > > 		if (vtep->port == port)
> > > 			break;
> > > 	}
> > >
> > > > +	if (!vtep) {
> > > > +		vtep = flow_tcf_create_iface(tcf, 0, port, error);
> > > > +		if (vtep)
> > > > +			LIST_INSERT_HEAD(&vtep_list_vxlan, vtep, next);
> > > > +	} else {
> > > > +		if (vtep->ifouter) {
> > > > +			rte_flow_error_set(error, -errno,
> > > > +				RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> > > NULL,
> > > > +				"Failed to create decap VTEP, attached "
> > > > +				"device with the same UDP port exists");
> > > > +				vtep = NULL;
> > >
> > > Making vtep null to skip the following code?
> >
> > Yes. To avoid multiple return operators in code.
> 
> It's okay to have multiple returns. Why not?

It is easy to miss the return in the midst of function  while refactoring/modifying the code.

> 
> > > Please merge the two same
> > > if/else and make the code path strightforward. And which errno do
> > > you expect here?
> > > Should it be set EEXIST instead?
> > Not always. Netlink returns the code.
> 
> No, that's not my point. Your code above sets errno instead of rte_errno or
> EEXIST.
> 
> 	} else {
> 		if (vtep->ifouter) {
> 			rte_flow_error_set(error, -errno,
> 
> Which one sets this errno? Here, it sets rte_errno because matched vtep
libmnl sets, while processing the Netlink reply message (callback.c of libmnl sources).

> can't be used as it already has outer iface attached (error message isn't clear,
> please reword it too). I thought this should be EEXIST but you set errno to
> rte_errno but errno isn't valid at this point.
> 
> >
> > >
> > > > +		}
> > > > +	}
> > > > +	if (vtep) {
> > > > +		vtep->refcnt++;
> > > > +		assert(vtep->ifindex);
> > > > +		return vtep->ifindex;
> > > > +	} else {
> > > > +		return 0;
> > > > +	}
> > >
> > > Why repeating same if/else?
> > >
> > >
> > > This is my suggestion but if you take my suggestion to have
> > > flow_tcf_[create|get|release]_iface(), this will get much simpler.
> > Agree.
> >
> > >
> > > {
> > > 	struct mlx5_flow_tcf_vtep *vtep;
> > > 	uint16_t port = dev_flow->tcf.vxlan_decap->udp_port;
> > >
> > > 	LIST_FOREACH(vtep, &vtep_list_vxlan, next) {
> > > 		if (vtep->port == port)
> > > 			break;
> > > 	}
> > > 	if (vtep && vtep->ifouter)
> > > 		return rte_flow_error_set(... EEXIST ...);
> > > 	else if (vtep) {
> > > 		++vtep->refcnt;
> > > 	} else {
> > > 		vtep = flow_tcf_create_iface(tcf, 0, port, error);
> > > 		if (!vtep)
> > > 			return rte_flow_error_set(...);
> > > 		LIST_INSERT_HEAD(&vtep_list_vxlan, vtep, next);
> > > 	}
> > > 	assert(vtep->ifindex);
> > > 	return vtep->ifindex;
> > > }
> > >
> > >
> > > > +}
> > > > +
> > > > +/**
> > > > + * Creates target interface index for VXLAN tunneling encapsulation.
> > > > + *
> > > > + * @param[in] tcf
> > > > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > > > + * @param[in] ifouter
> > > > + *   Network interface index to attach VXLAN encap device to.
> > > > + * @param[in] dev_flow
> > > > + *   Flow tcf object with tunnel structure pointer set.
> > > > + * @param[out] error
> > > > + *   Perform verbose error reporting if not NULL.
> > > > + * @return
> > > > + *   Interface index on success, zero otherwise and rte_errno is set.
> > > > + */
> > > > +static unsigned int
> > > > +flow_tcf_encap_vtep_create(struct mlx5_flow_tcf_context *tcf,
> > > > +			    unsigned int ifouter,
> > > > +			    struct mlx5_flow *dev_flow __rte_unused,
> > > > +			    struct rte_flow_error *error) {
> > > > +	static uint16_t encap_port = MLX5_VXLAN_PORT_RANGE_MIN - 1;
> > > > +	struct mlx5_flow_tcf_vtep *vtep, *vlst;
> > > > +
> > > > +	assert(ifouter);
> > > > +	/* Look whether the attached VTEP for encap is created. */
> > > > +	vtep = NULL;
> > > > +	LIST_FOREACH(vlst, &vtep_list_vxlan, next) {
> > > > +		if (vlst->ifouter == ifouter) {
> > > > +			vtep = vlst;
> > > > +			break;
> > > > +		}
> > > > +	}
> > >
> > > Same here.
> > >
> > > > +	if (!vtep) {
> > > > +		uint16_t pcnt;
> > > > +
> > > > +		/* Not found, we should create the new attached VTEP. */
> > > > +/*
> > > > + * TODO: not implemented yet
> > > > + * flow_tcf_encap_iface_cleanup(tcf, ifouter);
> > > > + * flow_tcf_encap_local_cleanup(tcf, ifouter);
> > > > + * flow_tcf_encap_neigh_cleanup(tcf, ifouter);  */
> > >
> > > Personal note is not appropriate even though it is removed in the
> > > following patch.
> > >
> > > > +		for (pcnt = 0; pcnt <= (MLX5_VXLAN_PORT_RANGE_MAX
> > > > +				     - MLX5_VXLAN_PORT_RANGE_MIN);
> > > pcnt++) {
> > > > +			encap_port++;
> > > > +			/* Wraparound the UDP port index. */
> > > > +			if (encap_port < MLX5_VXLAN_PORT_RANGE_MIN
> > > ||
> > > > +			    encap_port > MLX5_VXLAN_PORT_RANGE_MAX)
> > > > +				encap_port =
> > > MLX5_VXLAN_PORT_RANGE_MIN;
> > > > +			/* Check whether UDP port is in already in use. */
> > > > +			vtep = NULL;
> > > > +			LIST_FOREACH(vlst, &vtep_list_vxlan, next) {
> > > > +				if (vlst->port == encap_port) {
> > > > +					vtep = vlst;
> > > > +					break;
> > > > +				}
> > > > +			}
> > >
> > > If you want to find out an empty port number, you can use rte_bitmap
> > > instead of repeating searching the entire list for all possible port
> numbers.
> >
> > We do not expect too many VXLAN devices have been created. bitmap.
> 
> +1, valid point.
> 
> > > > +			if (vtep) {
> > > > +				vtep = NULL;
> > > > +				continue;
> > > > +			}
> > > > +			vtep = flow_tcf_create_iface(tcf, ifouter,
> > > > +						     encap_port, error);
> > > > +			if (vtep) {
> > > > +				LIST_INSERT_HEAD(&vtep_list_vxlan, vtep,
> > > next);
> > > > +				break;
> > > > +			}
> > > > +			if (rte_errno != EEXIST)
> > > > +				break;
> > > > +		}
> > > > +	}
> > > > +	if (!vtep)
> > > > +		return 0;
> > > > +	vtep->refcnt++;
> > > > +	assert(vtep->ifindex);
> > > > +	return vtep->ifindex;
> > >
> > > Please refactor this func according to what I suggested for
> > > flow_tcf_decap_vtep_create() and flow_tcf_delete_iface().
> > >
> > > > +}
> > > > +
> > > > +/**
> > > > + * Creates target interface index for tunneling of any type.
> > > > + *
> > > > + * @param[in] tcf
> > > > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > > > + * @param[in] ifouter
> > > > + *   Network interface index to attach VXLAN encap device to.
> > > > + * @param[in] dev_flow
> > > > + *   Flow tcf object with tunnel structure pointer set.
> > > > + * @param[out] error
> > > > + *   Perform verbose error reporting if not NULL.
> > > > + * @return
> > > > + *   Interface index on success, zero otherwise and rte_errno is set.
> > >
> > >  *   Interface index on success, a negative errno value otherwise and
> > >  *   rte_errno is set.
> > >
> > > > + */
> > > > +static unsigned int
> > > > +flow_tcf_tunnel_vtep_create(struct mlx5_flow_tcf_context *tcf,
> > > > +			    unsigned int ifouter,
> > > > +			    struct mlx5_flow *dev_flow,
> > > > +			    struct rte_flow_error *error) {
> > > > +	unsigned int ret;
> > > > +
> > > > +	assert(dev_flow->tcf.tunnel);
> > > > +	pthread_mutex_lock(&vtep_list_mutex);
> > > > +	switch (dev_flow->tcf.tunnel->type) {
> > > > +	case MLX5_FLOW_TCF_TUNACT_VXLAN_ENCAP:
> > > > +		ret = flow_tcf_encap_vtep_create(tcf, ifouter,
> > > > +						 dev_flow, error);
> > > > +		break;
> > > > +	case MLX5_FLOW_TCF_TUNACT_VXLAN_DECAP:
> > > > +		ret = flow_tcf_decap_vtep_create(tcf, dev_flow, error);
> > > > +		break;
> > > > +	default:
> > > > +		rte_flow_error_set(error, ENOTSUP,
> > > > +				RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> > > NULL,
> > > > +				"unsupported tunnel type");
> > > > +		ret = 0;
> > > > +		break;
> > > > +	}
> > > > +	pthread_mutex_unlock(&vtep_list_mutex);
> > > > +	return ret;
> > > > +}
> > > > +
> > > > +/**
> > > > + * Deletes tunneling interface by UDP port.
> > > > + *
> > > > + * @param[in] tcf
> > > > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > > > + * @param[in] ifindex
> > > > + *   Network interface index of VXLAN device.
> > > > + * @param[in] dev_flow
> > > > + *   Flow tcf object with tunnel structure pointer set.
> > > > + */
> > > > +static void
> > > > +flow_tcf_tunnel_vtep_delete(struct mlx5_flow_tcf_context *tcf,
> > > > +			    unsigned int ifindex,
> > > > +			    struct mlx5_flow *dev_flow) {
> > > > +	struct mlx5_flow_tcf_vtep *vtep, *vlst;
> > > > +
> > > > +	assert(dev_flow->tcf.tunnel);
> > > > +	pthread_mutex_lock(&vtep_list_mutex);
> > > > +	vtep = NULL;
> > > > +	LIST_FOREACH(vlst, &vtep_list_vxlan, next) {
> > > > +		if (vlst->ifindex == ifindex) {
> > > > +			vtep = vlst;
> > > > +			break;
> > > > +		}
> > > > +	}
> > >
> > > It is weird. You just can have vtep pointer in the
> > > dev_flow->tcf.tunnel instead of ifindex_tun which is same as
> > > vtep->ifindex like the assertion below. Then, this lookup can be skipped.
> >
> > OK. Good optimization.
> >
> > >
> > > > +	if (!vtep) {
> > > > +		DRV_LOG(WARNING, "No VTEP device found in the list");
> > > > +		goto exit;
> > > > +	}
> > > > +	switch (dev_flow->tcf.tunnel->type) {
> > > > +	case MLX5_FLOW_TCF_TUNACT_VXLAN_DECAP:
> > > > +		break;
> > > > +	case MLX5_FLOW_TCF_TUNACT_VXLAN_ENCAP:
> > > > +/*
> > > > + * TODO: Remove the encap ancillary rules first.
> > > > + * flow_tcf_encap_neigh(tcf, vtep, dev_flow, false, NULL);
> > > > + * flow_tcf_encap_local(tcf, vtep, dev_flow, false, NULL);  */
> > >
> > > Is it a personal note? Please remove.
> > OK.
> >
> > >
> > > > +		break;
> > > > +	default:
> > > > +		assert(false);
> > > > +		DRV_LOG(WARNING, "Unsupported tunnel type");
> > > > +		break;
> > > > +	}
> > > > +	assert(dev_flow->tcf.tunnel->ifindex_tun == vtep->ifindex);
> > > > +	assert(vtep->refcnt);
> > > > +	if (!vtep->refcnt || !--vtep->refcnt) {
> > > > +		LIST_REMOVE(vtep, next);
> > > > +		flow_tcf_delete_iface(tcf, vtep);
> > > > +	}
> > > > +exit:
> > > > +	pthread_mutex_unlock(&vtep_list_mutex);
> > > > +}
> > > > +
> > > >  /**
> > > >   * Apply flow to E-Switch by sending Netlink message.
> > > >   *
> > > > @@ -3461,18 +3887,61 @@ struct pedit_parser {
> > > >  	       struct rte_flow_error *error)  {
> > > >  	struct priv *priv = dev->data->dev_private;
> > > > -	struct mlx5_flow_tcf_context *nl = priv->tcf_context;
> > > > +	struct mlx5_flow_tcf_context *tcf = priv->tcf_context;
> > > >  	struct mlx5_flow *dev_flow;
> > > >  	struct nlmsghdr *nlh;
> > > > +	int ret;
> > > >
> > > >  	dev_flow = LIST_FIRST(&flow->dev_flows);
> > > >  	/* E-Switch flow can't be expanded. */
> > > >  	assert(!LIST_NEXT(dev_flow, next));
> > > > +	if (dev_flow->tcf.applied)
> > > > +		return 0;
> > > >  	nlh = dev_flow->tcf.nlh;
> > > >  	nlh->nlmsg_type = RTM_NEWTFILTER;
> > > >  	nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_CREATE |
> > > NLM_F_EXCL;
> > > > -	if (!flow_tcf_nl_ack(nl, nlh, 0, NULL, NULL))
> > > > +	if (dev_flow->tcf.tunnel) {
> > > > +		/*
> > > > +		 * Replace the interface index, target for
> > > > +		 * encapsulation, source for decapsulation.
> > > > +		 */
> > > > +		assert(!dev_flow->tcf.tunnel->ifindex_tun);
> > > > +		assert(dev_flow->tcf.tunnel->ifindex_ptr);
> > > > +		/* Create actual VTEP device when rule is being applied. */
> > > > +		dev_flow->tcf.tunnel->ifindex_tun
> > > > +			= flow_tcf_tunnel_vtep_create(tcf,
> > > > +					*dev_flow->tcf.tunnel->ifindex_ptr,
> > > > +					dev_flow, error);
> > > > +			DRV_LOG(INFO, "Replace ifindex: %d->%d",
> > > > +				dev_flow->tcf.tunnel->ifindex_tun,
> > > > +				*dev_flow->tcf.tunnel->ifindex_ptr);
> > > > +		if (!dev_flow->tcf.tunnel->ifindex_tun)
> > > > +			return -rte_errno;
> > > > +		dev_flow->tcf.tunnel->ifindex_org
> > > > +			= *dev_flow->tcf.tunnel->ifindex_ptr;
> > > > +		*dev_flow->tcf.tunnel->ifindex_ptr
> > > > +			= dev_flow->tcf.tunnel->ifindex_tun;
> > > > +	}
> > > > +	ret = flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
> > > > +	if (dev_flow->tcf.tunnel) {
> > > > +		DRV_LOG(INFO, "Restore ifindex: %d->%d",
> > > > +				dev_flow->tcf.tunnel->ifindex_org,
> > > > +				*dev_flow->tcf.tunnel->ifindex_ptr);
> > > > +		*dev_flow->tcf.tunnel->ifindex_ptr
> > > > +			= dev_flow->tcf.tunnel->ifindex_org;
> > > > +		dev_flow->tcf.tunnel->ifindex_org = 0;
> > >
> > > ifindex_org looks a temporary storage in this code. And this kind of
> > > hassle
> > > (replace/restore) is there because you took the ifindex from the
> > > netlink message. Why don't you have just
> > >
> > > struct mlx5_flow_tcf_tunnel_hdr {
> > > 	uint32_t type; /**< Tunnel action type. */
> > > 	unsigned int ifindex; /**< Original dst/src interface */
> > > 	struct mlx5_flow_tcf_vtep *vtep; /**< Tunnel endpoint device. */
> > > 	unsigned int *nlmsg_ifindex_ptr; /**< ifindex ptr in Netlink message.
> > > */ };
> > >
> > > and don't change ifindex?
> >
> > I propose to use the local variable for ifindex_org and do not keep it
> > in structure. *ifindex_ptr will keep.
> 
> Well, you still have to restore the ifindex whenever sending the nl msg. Most
> of all, ifindex_ptr in nl msg isn't a right place to store the ifindex. 
It is stored there for rules w/o tunnels. It is its "native" place, Id prefer
not to create some new location to store the original index and save some space.
We have to swap indices only if rule has requested the tunneling.  We can not
set tunnel index permanently, because rule can be applied/removed/reapplied
and other new VXLAN device with new index can be recreated.

> have vtep ifindex but it just temporarily keeps the device ifindex until vtep is
> created/found.
> 
> Thanks,
> Yongseok

With best regards,
Slava

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [dpdk-dev] [PATCH v2 2/7] net/mlx5: e-switch VXLAN flow validation routine
  2018-10-26  8:39           ` Slava Ovsiienko
@ 2018-10-26 21:56             ` Yongseok Koh
  2018-10-29  9:33               ` Slava Ovsiienko
  0 siblings, 1 reply; 110+ messages in thread
From: Yongseok Koh @ 2018-10-26 21:56 UTC (permalink / raw)
  To: Slava Ovsiienko; +Cc: Shahaf Shuler, dev

On Fri, Oct 26, 2018 at 01:39:38AM -0700, Slava Ovsiienko wrote:
> > -----Original Message-----
> > From: Yongseok Koh
> > Sent: Friday, October 26, 2018 6:07
> > To: Slava Ovsiienko <viacheslavo@mellanox.com>
> > Cc: Shahaf Shuler <shahafs@mellanox.com>; dev@dpdk.org
> > Subject: Re: [PATCH v2 2/7] net/mlx5: e-switch VXLAN flow validation
> > routine
> > 
> > On Thu, Oct 25, 2018 at 06:53:11AM -0700, Slava Ovsiienko wrote:
> > > > -----Original Message-----
> > > > From: Yongseok Koh
> > > > Sent: Tuesday, October 23, 2018 13:05
> > > > To: Slava Ovsiienko <viacheslavo@mellanox.com>
> > > > Cc: Shahaf Shuler <shahafs@mellanox.com>; dev@dpdk.org
> > > > Subject: Re: [PATCH v2 2/7] net/mlx5: e-switch VXLAN flow validation
> > > > routine
> > > >
> > > > On Mon, Oct 15, 2018 at 02:13:30PM +0000, Viacheslav Ovsiienko wrote:
> > [...]
> > > > > @@ -1114,7 +1733,6 @@ struct pedit_parser {
> > > > >  							   error);
> > > > >  			if (ret < 0)
> > > > >  				return ret;
> > > > > -			item_flags |= MLX5_FLOW_LAYER_OUTER_L3_IPV4;
> > > > >  			mask.ipv4 = flow_tcf_item_mask
> > > > >  				(items, &rte_flow_item_ipv4_mask,
> > > > >  				 &flow_tcf_mask_supported.ipv4,
> > > > > @@ -1135,13 +1753,22 @@ struct pedit_parser {
> > > > >  				next_protocol =
> > > > >  					((const struct rte_flow_item_ipv4 *)
> > > > >  					 (items->spec))->hdr.next_proto_id;
> > > > > +			if (item_flags &
> > > > MLX5_FLOW_LAYER_OUTER_L3_IPV4) {
> > > > > +				/*
> > > > > +				 * Multiple outer items are not allowed as
> > > > > +				 * tunnel parameters, will raise an error later.
> > > > > +				 */
> > > > > +				ipv4 = NULL;
> > > >
> > > > Can't it be inner then?
> > > AFAIK,  no for tc rules, we can not specify multiple levels (inner + outer) for
> > them.
> > > There is just no TCA_FLOWER_KEY_xxx attributes  for specifying inner
> > items
> > > to match by flower.
> > 
> > When I briefly read the kernel code, I thought TCA_FLOWER_KEY_* are for
> > inner
> > header before decap. I mean TCA_FLOWER_KEY_IPV4_SRC is for inner L3
> > and
> > TCA_FLOWER_KEY_ENC_IPV4_SRC is for outer tunnel header. Please do
> > some
> > experiments with tc-flower command.
> 
> Hm. Interesting. I will check.
> 
> > > It is quite unclear comment, not the best one, sorry. I did not like it too,
> > > just forgot to rewrite.
> > >
> > > ipv4, ipv6 , udp variables gather the matching items during the item list
> > scanning,
> > > later variables are used for VXLAN decap action validation only. So, the
> > "outer"
> > > means that ipv4 variable contains the VXLAN decap outer addresses, and
> > > should be NULL-ed if multiple items are found in the items list.
> > >
> > > But we can generate an error here if we have valid action_flags
> > > (gathered by prepare function) and VXLAN decap is set. Raising
> > > an error looks more relevant and clear.
> > 
> > You can't use flags at this point. It is validate() so prepare() might not be
> > preceded.
> > 
> > > >   flow create 1 ingress transfer
> > > >     pattern eth src is 66:77:88:99:aa:bb
> > > >       dst is 00:11:22:33:44:55 / ipv4 src is 2.2.2.2 dst is 1.1.1.1 /
> > > >       udp src is 4789 dst is 4242 / vxlan vni is 0x112233 /
> > > >       eth / ipv6 / tcp dst is 42 / end
> > > >     actions vxlan_decap / port_id id 2 / end
> > > >
> > > > Is this flow supported by linux tcf? I took this example from Adrien's
> > patch -
> > > > "[8/8] net/mlx5: add VXLAN decap support to switch flow rules". If so,
> > isn't it
> > > > possible to have inner L3 layer (MLX5_FLOW_LAYER_INNER_*)? If not,
> > you
> > > > should return error in this case. I don't see any code to check redundant
> > > > outer items.
> > > > Did I miss something?
> > >
> > > Interesting, besides rule has correct syntax, I'm not sure whether it can be
> > applied w/o errors.
> > 
> > Please try. You owns this patchset. However, you just can prohibit such flows
> > (tunneled item) and come up with follow-up patches to enable it later if it is
> > support by tcf as this whole patchset itself is pretty huge enough and we
> > don't
> > have much time.
> > 
> > > At least our current flow_tcf_translate() implementation does not support
> > any INNERs.
> > > But it seems the flow_tcf_validate() does, it's subject to recheck - we
> > should not allow
> > > unsupported items to pass the validation. I'll check and provide the
> > separate bugfix patch
> > > (if any).
> > 
> > Neither has tunnel support. It is the first time to add tunnel support to TCF.
> > If it was needed, you should've added it, not skipping it.
> > 
> > You can check how MLX5_FLOW_LAYER_TUNNEL is used in Verbs/DV as a
> > reference.
> 
> Yes. I understood your point. Will check and add tunnel support for TCF rules.
> Anyway, inner MAC addresses are supported for VXLAN decap, I think we should
> specify these ones in the rule as inners (after VNI item),  definitely
> some tunnel support in validate/parse/translate should be added.
> 
> > 
> > > > BTW, for the tunneled items, why don't you follow the code of
> > > > Verbs(mlx5_flow_verbs.c) and DV(mlx5_flow_dv.c)? For tcf, it is the first
> > time
> > > For VXLAN it has some specifics (warning about ignored params, etc.)
> > > I've checked which of verbs/dv code could be reused and did not
> > discovered
> > > a lot. I'll recheck the latest code commits, possible it became more
> > appropriate
> > > for VXLAN.
> > 
> > Agreed. I'm not forcing you to do it because we run out of time but
> > mentioned it
> > because if there's any redundancy in our code, that usually causes bug later.
> > Let's not waste too much time for that. Just grab low hanging fruits if any.
> > 
> > > > to add tunneled item, but Verbs/DV already have validation code for
> > tunnel,
> > > > so you can reuse the existing code. In flow_tcf_validate_vxlan_decap(),
> > not
> > > > every validation is VXLAN-specific but some of them can be common
> > code.
> > > >
> > > > And if you need to know whether there's the VXLAN decap action prior to
> > > > outer header item validation, you can relocate the code - action
> > validation
> > > > first and item validation next, as there's no dependency yet in the current
> > >
> > > We can not validate action first - we need items to be preliminary
> > gathered,
> > > to check them in action's specific fashion and to check action itself.
> > > I mean, if we see VXLAN decap action, we should check the presence of
> > > L2, L3, L4 and VNI items. I minimized the number of passes along the item
> > > and action lists. BTW, Adrien's approach performed two passes, mine does
> > only.
> > >
> > > > code. Defining ipv4, ipv6, udp seems to make the code path more
> > complex.
> > > Yes, but it allows us to avoid the extra item list scanning and minimizes the
> > changes
> > > of existing code.
> > > In your approach we should:
> > > - scan actions, w/o full checking, just action_flags gathering and checking
> > > - scan items, performing variating check (depending on gathered action
> > flags)
> > > - scan actions again, performing full check with params (at least for now
> > > check whether all params gathered)
> > 
> > Disagree. flow_tcf_validate_vxlan_encap() doesn't even need any info of
> > items
> > and flow_tcf_validate_vxlan_decap() needs item_flags to check whether
> > VXLAN
> > item is there or not and ipv4/ipv6/udp are all for item checks. Let me give
> > you
> > very detailed exmaple:
> > 
> > {
> > 	for (actions[]...) {
> > 		...
> > 		case RTE_FLOW_ACTION_TYPE_VXLAN_ENCAP:
> > 			...
> > 			flow_tcf_validate_vxlan_encap();
> > 			...
> > 			break;
> > 		case RTE_FLOW_ACTION_TYPE_VXLAN_DECAP:
> > 			if (action_flags & (MLX5_ACTION_VXLAN_ENCAP
> > 					   | MLX5_ACTION_VXLAN_DECAP))
> > 				return rte_flow_error_set
> > 					(error, ENOTSUP,
> > 					 RTE_FLOW_ERROR_TYPE_ACTION,
> > 					 actions,
> > 					 "can't have multiple vxlan actions");
> > 			/* Don't call flow_tcf_validate_vxlan_decap(). */
> > 			action_flags |= MLX5_ACTION_VXLAN_DECAP;
> > 			break;
> > 	}
> > 	for (items[]...) {
> > 		...
> > 		case RTE_FLOW_ITEM_TYPE_IPV4:
> > 			/* Existing common validation. */
> > 			...
> > 			if (action_flags & MLX5_ACTION_VXLAN_DECAP) {
> > 				/* Do ipv4 validation in
> > 				 * flow_tcf_validate_vxlan_decap()/
> > 			}
> > 			break;
> > 	}
> > }
> > 
> > Curretly you are doing,
> > 
> > 	- validate items
> > 	- validate actions
> > 	- validate items again if decap.
> > 
> > But this can simply be
> > 
> > 	- validate actions
> How  we could validate VXLAN decap at this stage? 
> As we do not have item_flags set yet?
> Do I miss something?

Look at my pseudo code above.
Nothing much to be done in validating decap action. And item validation for
decap can be done together in item validation code.

Thanks,
Yongseok

> 
> > 	- validate items
> > 

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [dpdk-dev] [PATCH v2 3/7] net/mlx5: e-switch VXLAN flow translation routine
  2018-10-26  9:06           ` Slava Ovsiienko
@ 2018-10-26 22:10             ` Yongseok Koh
  0 siblings, 0 replies; 110+ messages in thread
From: Yongseok Koh @ 2018-10-26 22:10 UTC (permalink / raw)
  To: Slava Ovsiienko; +Cc: Shahaf Shuler, dev

On Fri, Oct 26, 2018 at 02:06:53AM -0700, Slava Ovsiienko wrote:
> > -----Original Message-----
> > From: Yongseok Koh
> > Sent: Friday, October 26, 2018 7:22
> > To: Slava Ovsiienko <viacheslavo@mellanox.com>
> > Cc: Shahaf Shuler <shahafs@mellanox.com>; dev@dpdk.org
> > Subject: Re: [PATCH v2 3/7] net/mlx5: e-switch VXLAN flow translation
> > routine
> > 
> > On Thu, Oct 25, 2018 at 07:37:56AM -0700, Slava Ovsiienko wrote:
> > > > -----Original Message-----
> > > > From: Yongseok Koh
> > > > Sent: Tuesday, October 23, 2018 13:06
> > > > To: Slava Ovsiienko <viacheslavo@mellanox.com>
> > > > Cc: Shahaf Shuler <shahafs@mellanox.com>; dev@dpdk.org
> > > > Subject: Re: [PATCH v2 3/7] net/mlx5: e-switch VXLAN flow
> > > > translation routine
> > > >
> > > > On Mon, Oct 15, 2018 at 02:13:31PM +0000, Viacheslav Ovsiienko wrote:
> > [...]
> > > > > @@ -2184,13 +2447,16 @@ struct pedit_parser {
> > > > >   *   Pointer to the list of actions.
> > > > >   * @param[out] action_flags
> > > > >   *   Pointer to the detected actions.
> > > > > + * @param[out] tunnel
> > > > > + *   Pointer to tunnel encapsulation parameters structure to fill.
> > > > >   *
> > > > >   * @return
> > > > >   *   Maximum size of memory for actions.
> > > > >   */
> > > > >  static int
> > > > >  flow_tcf_get_actions_and_size(const struct rte_flow_action actions[],
> > > > > -			      uint64_t *action_flags)
> > > > > +			      uint64_t *action_flags,
> > > > > +			      void *tunnel)
> > > >
> > > > This func is to get actions and size but you are parsing and filling
> > > > tunnel info here. It would be better to move parsing to translate()
> > > > because it anyway has multiple if conditions (same as switch/case)
> > > > to set TCA_TUNNEL_KEY_ENC_* there.
> > > Do you mean call of flow_tcf_vxlan_encap_parse(actions, tunnel)?
> > 
> > Yes.
> > 
> > > OK, let's move it to translate stage. Anyway, we need to keep encap
> > > structure for local/neigh rules.
> > >
> > > >
> > > > >  {
> > > > >  	int size = 0;
> > > > >  	uint64_t flags = 0;
> > > > > @@ -2246,6 +2512,29 @@ struct pedit_parser {
> > > > >  				SZ_NLATTR_TYPE_OF(uint16_t) + /* VLAN ID.
> > > > */
> > > > >  				SZ_NLATTR_TYPE_OF(uint8_t); /* VLAN prio.
> > > > */
> > > > >  			break;
> > > > > +		case RTE_FLOW_ACTION_TYPE_VXLAN_ENCAP:
> > > > > +			size += SZ_NLATTR_NEST + /* na_act_index. */
> > > > > +				SZ_NLATTR_STRZ_OF("tunnel_key") +
> > > > > +				SZ_NLATTR_NEST + /* TCA_ACT_OPTIONS.
> > > > */
> > > > > +				SZ_NLATTR_TYPE_OF(uint8_t);
> > > > > +			size += SZ_NLATTR_TYPE_OF(struct tc_tunnel_key);
> > > > > +			size +=	flow_tcf_vxlan_encap_parse(actions, tunnel)
> > > > +
> > > > > +				RTE_ALIGN_CEIL /* preceding encap params.
> > > > */
> > > > > +				(sizeof(struct mlx5_flow_tcf_vxlan_encap),
> > > > > +				MNL_ALIGNTO);
> > > >
> > > > Is it different from SZ_NLATTR_TYPE_OF(struct
> > > > mlx5_flow_tcf_vxlan_encap)? Or, use __rte_aligned(MNL_ALIGNTO)
> > instead.
> > >
> > > It is written intentionally in this form. It means that there is
> > > struct mlx5_flow_tcf_vxlan_encap at the beginning of buffer. This is
> > > not the NL attribute, usage of SZ_NLATTR_TYPE_OF is not relevant here.
> > Alignment is needed for the following Netlink message.
> > 
> > Good point. Understood.
> > 
> > > >
> > > > > +			flags |= MLX5_ACTION_VXLAN_ENCAP;
> > > > > +			break;
> > > > > +		case RTE_FLOW_ACTION_TYPE_VXLAN_DECAP:
> > > > > +			size += SZ_NLATTR_NEST + /* na_act_index. */
> > > > > +				SZ_NLATTR_STRZ_OF("tunnel_key") +
> > > > > +				SZ_NLATTR_NEST + /* TCA_ACT_OPTIONS.
> > > > */
> > > > > +				SZ_NLATTR_TYPE_OF(uint8_t);
> > > > > +			size +=	SZ_NLATTR_TYPE_OF(struct tc_tunnel_key);
> > > > > +			size +=	RTE_ALIGN_CEIL /* preceding decap params.
> > > > */
> > > > > +				(sizeof(struct mlx5_flow_tcf_vxlan_decap),
> > > > > +				MNL_ALIGNTO);
> > > >
> > > > Same here.
> > > >
> > > > > +			flags |= MLX5_ACTION_VXLAN_DECAP;
> > > > > +			break;
> > > > >  		case RTE_FLOW_ACTION_TYPE_SET_IPV4_SRC:
> > > > >  		case RTE_FLOW_ACTION_TYPE_SET_IPV4_DST:
> > > > >  		case RTE_FLOW_ACTION_TYPE_SET_IPV6_SRC:
> > > > > @@ -2289,6 +2578,26 @@ struct pedit_parser {  }
> > > > >
> > > > >  /**
> > > > > + * Convert VXLAN VNI to 32-bit integer.
> > > > > + *
> > > > > + * @param[in] vni
> > > > > + *   VXLAN VNI in 24-bit wire format.
> > > > > + *
> > > > > + * @return
> > > > > + *   VXLAN VNI as a 32-bit integer value in network endian.
> > > > > + */
> > > > > +static rte_be32_t
> > > >
> > > > make it inline.
> > > OK. Missed point.
> > >
> > > >
> > > > > +vxlan_vni_as_be32(const uint8_t vni[3]) {
> > > > > +	rte_be32_t ret;
> > > >
> > > > Defining ret as rte_be32_t? The return value of this func which is
> > > > bswap(ret) is also rte_be32_t??
> > > Yes. And it is directly stored in the net-endian NL attribute.
> > > I've compiled and checked the listing of the function you proposed. It
> > seems to be best, I'll take it.
> > >
> > > >
> > > > > +
> > > > > +	ret = vni[0];
> > > > > +	ret = (ret << 8) | vni[1];
> > > > > +	ret = (ret << 8) | vni[2];
> > > > > +	return RTE_BE32(ret);
> > > >
> > > > Use rte_cpu_to_be_*() instead. But I still don't understand why you
> > > > shuffle bytes twice. One with shift and or and other by bswap().
> > > And it works. There are three bytes in very bizarre order (in NL attribute) -
> > 0, vni[0], vni[1], vni[2].
> > >
> > > >
> > > > {
> > > > 	union {
> > > > 		uint8_t vni[4];
> > > > 		rte_be32_t dword;
> > > > 	} ret = {
> > > > 		.vni = { 0, vni[0], vni[1], vni[2] },
> > > > 	};
> > > > 	return ret.dword;
> > > > }
> > > >
> > > > This will have the same result without extra cost.
> > >
> > > OK. Investigated, it is the best for x86-64. Also I'm going to test it
> > > on the ARM 32, with various compilers, just curious.
> > >
> > > >
> > > > > +}
> > > > > +
> > > > > +/**
> > > > >   * Prepare a flow object for Linux TC flower. It calculates the
> > > > > maximum
> > > > size of
> > > > >   * memory required, allocates the memory, initializes Netlink
> > > > > message
> > > > headers
> > > > >   * and set unique TC message handle.
> > > > > @@ -2323,22 +2632,54 @@ struct pedit_parser {
> > > > >  	struct mlx5_flow *dev_flow;
> > > > >  	struct nlmsghdr *nlh;
> > > > >  	struct tcmsg *tcm;
> > > > > +	struct mlx5_flow_tcf_vxlan_encap encap = {.mask = 0};
> > > > > +	uint8_t *sp, *tun = NULL;
> > > > >
> > > > >  	size += flow_tcf_get_items_and_size(attr, items, item_flags);
> > > > > -	size += flow_tcf_get_actions_and_size(actions, action_flags);
> > > > > -	dev_flow = rte_zmalloc(__func__, size, MNL_ALIGNTO);
> > > > > +	size += flow_tcf_get_actions_and_size(actions, action_flags,
> > > > &encap);
> > > > > +	dev_flow = rte_zmalloc(__func__, size,
> > > > > +			RTE_MAX(alignof(struct mlx5_flow_tcf_tunnel_hdr),
> > > > > +				(size_t)MNL_ALIGNTO));
> > > >
> > > > Why RTE_MAX between the two? Note that it is alignment for start
> > > > address of the memory and the minimum alignment is cacheline size.
> > > > On x86, non- zero value less than 64 will have same result as 64.
> > >
> > > OK. Thanks for note.
> > > It is not expected the structure alignments exceed the cache line size.
> > > So? Just specify zero?
> > > >
> > > > >  	if (!dev_flow) {
> > > > >  		rte_flow_error_set(error, ENOMEM,
> > > > >  				   RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> > > > NULL,
> > > > >  				   "not enough memory to create E-Switch
> > > > flow");
> > > > >  		return NULL;
> > > > >  	}
> > > > > -	nlh = mnl_nlmsg_put_header((void *)(dev_flow + 1));
> > > > > +	sp = (uint8_t *)(dev_flow + 1);
> > > > > +	if (*action_flags & MLX5_ACTION_VXLAN_ENCAP) {
> > > > > +		tun = sp;
> > > > > +		sp += RTE_ALIGN_CEIL
> > > > > +			(sizeof(struct mlx5_flow_tcf_vxlan_encap),
> > > > > +			MNL_ALIGNTO);
> > > >
> > > > And why should it be aligned?
> > >
> > > Netlink message should be aligned. It follows the
> > > mlx5_flow_tcf_vxlan_encap, that's why the pointer is aligned.
> > 
> > Not true. There's no requirement for nl msg buffer alignment.
> > MNL_ALIGNTO is for mainly size alignment. For example, checkout the
> > source code of mnl_nlmsg_put_header(void *buf). There's no requirement of
> > aligning the start address of buf. But, size of any entries (hdr, attr ...) should
> > be aligned to MNL_ALIGNTO(4).
> 
> Formally speaking, yes. There is no explicit requirement for the header
> alignment. And the entire message goes to the send(), it does not care about alignment.
> But not aligning the entire structure does not look as a good practice.
> ( I had been living for a long time on embedded systems with activated
> Alignment Check feature and off unaligned access compiler flags. 
> There was not very long waiting time to get punishing exception. )

Like mentioned before, I don't have any objection for the alignment.

> > > As the size of dev_flow might not be aligned, it
> > > > is meaningless, isn't it? If you think it must be aligned for better
> > > > performance (not much anyway), you can use
> > > > __rte_aligned(MNL_ALIGNTO) on the struct
> > > Hm. Where we can use __rte_aligned? Could you clarify, please.
> > 
> > For example,
> > 
> > struct mlx5_flow_tcf_tunnel_hdr {
> > 	uint32_t type; /**< Tunnel action type. */
> > 	unsigned int ifindex_tun; /**< Tunnel endpoint interface. */
> > 	unsigned int ifindex_org; /**< Original dst/src interface */
> > 	unsigned int *ifindex_ptr; /**< Interface ptr in message. */ }
> > __rte_aligned(MNL_ALIGNTO);
> 
> No. tunnel_hdr should not know anything about NL message.
> It happens, we have the NL message follows the tunnel_hdr 
> in some our memory buf. What if we would like to add some other
> object after tunnel_hdr in buffer? Not NL message? 
> The aligment of objects is  duty of code which places objects into buffer,
> Objects can be very different, with different alignment requirements,
> and, generally speaking, placed  in arbitrary order. Why, while
> declaring the tunnel_hdr  structure,  we should make an assumption
> it is always followed by NL message? _rte_aligned(MNL_ALIGNTO) at the end
> of tunnel_hdr - is exactly an example of  that unapropriate assumption.

Yeah, I agree. That was just an example.

And my point was the original code isn't enough to achieve the alignment as the
size of dev_flow isn't aligned. You should've done something like:

	tun = RTE_PTR_ALIGN(dev_flow + 1, MNL_ALIGNTO);

In summary, if you want to make it aligned, please fix the issue I raised and
improve readability of the code.

> > A good example is the struct rte_mbuf. If this attribute is used, the size of the
> > struct will be aligned to the value.
> > 
> > If you still want to make the nl msg aligned,
> > 
> > 	dev_flow = rte_zmalloc(..., MNL_ALIGNTO); /* anyway cacheline
> > aligned. */
> > 	tun = RTE_PTR_ALIGN(dev_flow + 1, MNL_ALIGNTO);
> > 	nlh = mnl_nlmsg_put_header(tun);
> > 
> > with adding '__rte_aligned(MNL_ALIGNTO)' to struct
> > mlx5_flow_tcf_vxlan_encap/decap.
> > 
> > Then, nlh will be aligned. You should make sure size is correctly calculated.
> > 
> > >
> > > > definition but not for mlx5_flow (it's not only for tcf, have to do it
> > manually).
> > > >
> > > > > +		size -= RTE_ALIGN_CEIL
> > > > > +			(sizeof(struct mlx5_flow_tcf_vxlan_encap),
> > > > > +			MNL_ALIGNTO);
> > > >
> > > > Don't you have to subtract sizeof(struct mlx5_flow) as well? But
> > > > like I mentioned, if '.nlsize' below isn't needed, you don't need to
> > > > have this calculation either.
> > > Yes, it is a bug. Should be fixed. Thank you.
> > > Let's discuss whether we can keep the nlsize under NDEBUG switch.
> > 
> > I agreed on using NDEBUG for it.
> > 
> > >
> > > >
> > > > > +		encap.hdr.type =
> > > > MLX5_FLOW_TCF_TUNACT_VXLAN_ENCAP;
> > > > > +		memcpy(tun, &encap,
> > > > > +		       sizeof(struct mlx5_flow_tcf_vxlan_encap));
> > > > > +	} else if (*action_flags & MLX5_ACTION_VXLAN_DECAP) {
> > > > > +		tun = sp;
> > > > > +		sp += RTE_ALIGN_CEIL
> > > > > +			(sizeof(struct mlx5_flow_tcf_vxlan_decap),
> > > > > +			MNL_ALIGNTO);
> > > > > +		size -= RTE_ALIGN_CEIL
> > > > > +			(sizeof(struct mlx5_flow_tcf_vxlan_decap),
> > > > > +			MNL_ALIGNTO);
> > > > > +		encap.hdr.type =
> > > > MLX5_FLOW_TCF_TUNACT_VXLAN_DECAP;
> > > > > +		memcpy(tun, &encap,
> > > > > +		       sizeof(struct mlx5_flow_tcf_vxlan_decap));
> > > > > +	}
> > > > > +	nlh = mnl_nlmsg_put_header(sp);
> > > > >  	tcm = mnl_nlmsg_put_extra_header(nlh, sizeof(*tcm));
> > > > >  	*dev_flow = (struct mlx5_flow){
> > > > >  		.tcf = (struct mlx5_flow_tcf){
> > > > > +			.nlsize = size,
> > > > >  			.nlh = nlh,
> > > > >  			.tcm = tcm,
> > > > > +			.tunnel = (struct mlx5_flow_tcf_tunnel_hdr *)tun,
> > > > > +			.item_flags = *item_flags,
> > > > > +			.action_flags = *action_flags,
> > > > >  		},
> > > > >  	};
> > > > >  	/*
> > [...]
> > > > > @@ -2827,6 +3268,76 @@ struct pedit_parser {
> > > > >  					(na_vlan_priority) =
> > > > >  					conf.of_set_vlan_pcp->vlan_pcp;
> > > > >  			}
> > > > > +			assert(dev_flow->tcf.nlsize >= nlh->nlmsg_len);
> > > > > +			break;
> > > > > +		case RTE_FLOW_ACTION_TYPE_VXLAN_DECAP:
> > > > > +			assert(decap.vxlan);
> > > > > +			assert(dev_flow->tcf.tunnel);
> > > > > +			dev_flow->tcf.tunnel->ifindex_ptr
> > > > > +				= (unsigned int *)&tcm->tcm_ifindex;
> > > > > +			na_act_index =
> > > > > +				mnl_attr_nest_start(nlh,
> > > > na_act_index_cur++);
> > > > > +			assert(na_act_index);
> > > > > +			mnl_attr_put_strz(nlh, TCA_ACT_KIND,
> > > > "tunnel_key");
> > > > > +			na_act = mnl_attr_nest_start(nlh,
> > > > TCA_ACT_OPTIONS);
> > > > > +			assert(na_act);
> > > > > +			mnl_attr_put(nlh, TCA_TUNNEL_KEY_PARMS,
> > > > > +				sizeof(struct tc_tunnel_key),
> > > > > +				&(struct tc_tunnel_key){
> > > > > +					.action = TC_ACT_PIPE,
> > > > > +					.t_action =
> > > > TCA_TUNNEL_KEY_ACT_RELEASE,
> > > > > +					});
> > > > > +			mnl_attr_nest_end(nlh, na_act);
> > > > > +			mnl_attr_nest_end(nlh, na_act_index);
> > > > > +			assert(dev_flow->tcf.nlsize >= nlh->nlmsg_len);
> > > > > +			break;
> > > > > +		case RTE_FLOW_ACTION_TYPE_VXLAN_ENCAP:
> > > > > +			assert(encap.vxlan);
> > > > > +			na_act_index =
> > > > > +				mnl_attr_nest_start(nlh,
> > > > na_act_index_cur++);
> > > > > +			assert(na_act_index);
> > > > > +			mnl_attr_put_strz(nlh, TCA_ACT_KIND,
> > > > "tunnel_key");
> > > > > +			na_act = mnl_attr_nest_start(nlh,
> > > > TCA_ACT_OPTIONS);
> > > > > +			assert(na_act);
> > > > > +			mnl_attr_put(nlh, TCA_TUNNEL_KEY_PARMS,
> > > > > +				sizeof(struct tc_tunnel_key),
> > > > > +				&(struct tc_tunnel_key){
> > > > > +					.action = TC_ACT_PIPE,
> > > > > +					.t_action =
> > > > TCA_TUNNEL_KEY_ACT_SET,
> > > > > +					});
> > > > > +			if (encap.vxlan->mask &
> > > > MLX5_FLOW_TCF_ENCAP_UDP_DST)
> > > > > +				mnl_attr_put_u16(nlh,
> > > > > +					 TCA_TUNNEL_KEY_ENC_DST_PORT,
> > > > > +					 encap.vxlan->udp.dst);
> > > > > +			if (encap.vxlan->mask &
> > > > MLX5_FLOW_TCF_ENCAP_IPV4_SRC)
> > > > > +				mnl_attr_put_u32(nlh,
> > > > > +					 TCA_TUNNEL_KEY_ENC_IPV4_SRC,
> > > > > +					 encap.vxlan->ipv4.src);
> > > > > +			if (encap.vxlan->mask &
> > > > MLX5_FLOW_TCF_ENCAP_IPV4_DST)
> > > > > +				mnl_attr_put_u32(nlh,
> > > > > +					 TCA_TUNNEL_KEY_ENC_IPV4_DST,
> > > > > +					 encap.vxlan->ipv4.dst);
> > > > > +			if (encap.vxlan->mask &
> > > > MLX5_FLOW_TCF_ENCAP_IPV6_SRC)
> > > > > +				mnl_attr_put(nlh,
> > > > > +					 TCA_TUNNEL_KEY_ENC_IPV6_SRC,
> > > > > +					 sizeof(encap.vxlan->ipv6.src),
> > > > > +					 &encap.vxlan->ipv6.src);
> > > > > +			if (encap.vxlan->mask &
> > > > MLX5_FLOW_TCF_ENCAP_IPV6_DST)
> > > > > +				mnl_attr_put(nlh,
> > > > > +					 TCA_TUNNEL_KEY_ENC_IPV6_DST,
> > > > > +					 sizeof(encap.vxlan->ipv6.dst),
> > > > > +					 &encap.vxlan->ipv6.dst);
> > > > > +			if (encap.vxlan->mask &
> > > > MLX5_FLOW_TCF_ENCAP_VXLAN_VNI)
> > > > > +				mnl_attr_put_u32(nlh,
> > > > > +					 TCA_TUNNEL_KEY_ENC_KEY_ID,
> > > > > +					 vxlan_vni_as_be32
> > > > > +						(encap.vxlan->vxlan.vni));
> > > > > +#ifdef TCA_TUNNEL_KEY_NO_CSUM
> > > > > +			mnl_attr_put_u8(nlh, TCA_TUNNEL_KEY_NO_CSUM,
> > > > 0); #endif
> > > >
> > > > TCA_TUNNEL_KEY_NO_CSUM is anyway defined like others, then why do
> > > > you treat it differently with #ifdef/#endif?
> > >
> > > As it was found it is not defined on old kernels, on some our CI
> > > machines compilation errors occurred.
> > 
> > In your first patch, TCA_TUNNEL_KEY_NO_CSUM is defined if there isn't
> > HAVE_TC_ACT_TUNNEL_KEY. Actually I'm wondering why it is different from
> > HAVE_TCA_TUNNEL_KEY_ENC_DST_PORT. It looks like the following is
> > needed - HAVE_TCA_TUNNEL_KEY_NO_CSUM ??
> > 
> > 
> > 	#ifdef HAVE_TC_ACT_TUNNEL_KEY
> > 
> > 	#include <linux/tc_act/tc_tunnel_key.h>
> > 
> > 	#ifndef HAVE_TCA_TUNNEL_KEY_ENC_DST_PORT
> > 	#define TCA_TUNNEL_KEY_ENC_DST_PORT 9
> > 	#endif
> > 
> > 	#ifndef HAVE_TCA_TUNNEL_KEY_NO_CSUM
> > 	#define TCA_TUNNEL_KEY_NO_CSUM 10
> > 	#endif
> 
> I think it is subject to check. Yes, we can define the "missing"
> macros, but it seems the old kernel just does not know these
> keys. Whether the rule with these keys is accepted by kernel?
> I did not check (have no host with old setup to check),
> I'd prefer to exclude not very significant key to lower the
> rule rejection risk. 

My question is that why the two missing macros (TCA_TUNNEL_KEY_ENC_DST_PORT and
TCA_TUNNEL_KEY_NO_CSUM) are treated differently? AFAIK, the reason for defining
it manually for old kernel is that it can be run on a different host from the
compile host. Even though the compile machine doesn't have the feature, but it
can be supported on the machine it runs on. If kernel doesn't understand it on
an old machine, an error will be returned, which is fine.

Thanks,
Yongseok

> > 
> > 	#else /* HAVE_TC_ACT_TUNNEL_KEY */
> > 
> > 

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [dpdk-dev] [PATCH v2 5/7] net/mlx5: e-switch VXLAN tunnel devices management
  2018-10-26  9:35           ` Slava Ovsiienko
@ 2018-10-26 22:42             ` Yongseok Koh
  2018-10-29 11:53               ` Slava Ovsiienko
  0 siblings, 1 reply; 110+ messages in thread
From: Yongseok Koh @ 2018-10-26 22:42 UTC (permalink / raw)
  To: Slava Ovsiienko; +Cc: Shahaf Shuler, dev

On Fri, Oct 26, 2018 at 02:35:24AM -0700, Slava Ovsiienko wrote:
> > -----Original Message-----
> > From: Yongseok Koh
> > Sent: Friday, October 26, 2018 9:26
> > To: Slava Ovsiienko <viacheslavo@mellanox.com>
> > Cc: Shahaf Shuler <shahafs@mellanox.com>; dev@dpdk.org
> > Subject: Re: [PATCH v2 5/7] net/mlx5: e-switch VXLAN tunnel devices
> > management
> > 
> > On Thu, Oct 25, 2018 at 01:21:12PM -0700, Slava Ovsiienko wrote:
> > > > -----Original Message-----
> > > > From: Yongseok Koh
> > > > Sent: Thursday, October 25, 2018 3:28
> > > > To: Slava Ovsiienko <viacheslavo@mellanox.com>
> > > > Cc: Shahaf Shuler <shahafs@mellanox.com>; dev@dpdk.org
> > > > Subject: Re: [PATCH v2 5/7] net/mlx5: e-switch VXLAN tunnel devices
> > > > management
> > > >
> > > > On Mon, Oct 15, 2018 at 02:13:33PM +0000, Viacheslav Ovsiienko wrote:
> > > > > VXLAN interfaces are dynamically created for each local UDP port
> > > > > of outer networks and then used as targets for TC "flower" filters
> > > > > in order to perform encapsulation. These VXLAN interfaces are
> > > > > system-wide, the only one device with given UDP port can exist in
> > > > > the system (the attempt of creating another device with the same
> > > > > UDP local port returns EEXIST), so PMD should support the shared
> > > > > device instances database for PMD instances. These VXLAN
> > > > > implicitly created devices are called VTEPs (Virtual Tunnel End Points).
> > > > >
> > > > > Creation of the VTEP occurs at the moment of rule applying. The
> > > > > link is set up, root ingress qdisc is also initialized.
> > > > >
> > > > > Encapsulation VTEPs are created on per port basis, the single VTEP
> > > > > is attached to the outer interface and is shared for all
> > > > > encapsulation rules on this interface. The source UDP port is
> > > > > automatically selected in range 30000-60000.
> > > > >
> > > > > For decapsulaton one VTEP is created per every unique UDP local
> > > > > port to accept tunnel traffic. The name of created VTEP consists
> > > > > of prefix "vmlx_" and the number of UDP port in decimal digits
> > > > > without leading zeros (vmlx_4789). The VTEP can be preliminary
> > > > > created in the system before the launching
> > > > > application, it allows to share	UDP ports between primary
> > > > > and secondary processes.
> > > > >
> > > > > Suggested-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
> > > > > Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
> > > > > ---
> > > > >  drivers/net/mlx5/mlx5_flow_tcf.c | 503
> > > > > ++++++++++++++++++++++++++++++++++++++-
> > > > >  1 file changed, 499 insertions(+), 4 deletions(-)
> > > > >
> > > > > diff --git a/drivers/net/mlx5/mlx5_flow_tcf.c
> > > > > b/drivers/net/mlx5/mlx5_flow_tcf.c
> > > > > index d6840d5..efa9c3b 100644
> > > > > --- a/drivers/net/mlx5/mlx5_flow_tcf.c
> > > > > +++ b/drivers/net/mlx5/mlx5_flow_tcf.c
> > > > > @@ -3443,6 +3443,432 @@ struct pedit_parser {
> > > > >  	return -err;
> > > > >  }
> > > > >
> > > > > +/* VTEP device list is shared between PMD port instances. */
> > > > > +static LIST_HEAD(, mlx5_flow_tcf_vtep)
> > > > > +			vtep_list_vxlan = LIST_HEAD_INITIALIZER(); static
> > > > pthread_mutex_t
> > > > > +vtep_list_mutex = PTHREAD_MUTEX_INITIALIZER;
> > > >
> > > > What's the reason for choosing pthread_mutex instead of rte_*_lock?
> > >
> > > The sharing this database for secondary processes?
> > 
> > The static variable isn't shared with sec proc. But you can leave it as is.
> 
> Yes. The sharing just was assumed, not implemented yet.
> 
> > 
> > > > > +
> > > > > +/**
> > > > > + * Deletes VTEP network device.
> > > > > + *
> > > > > + * @param[in] tcf
> > > > > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > > > > + * @param[in] vtep
> > > > > + *   Object represinting the network device to delete. Memory
> > > > > + *   allocated for this object is freed by routine.
> > > > > + */
> > > > > +static void
> > > > > +flow_tcf_delete_iface(struct mlx6_flow_tcf_context *tcf,
> > > > > +		      struct mlx5_flow_tcf_vtep *vtep) {
> > > > > +	struct nlmsghdr *nlh;
> > > > > +	struct ifinfomsg *ifm;
> > > > > +	alignas(struct nlmsghdr)
> > > > > +	uint8_t buf[mnl_nlmsg_size(MNL_ALIGN(sizeof(*ifm))) + 8];
> > > > > +	int ret;
> > > > > +
> > > > > +	assert(!vtep->refcnt);
> > > > > +	if (vtep->created && vtep->ifindex) {
> > > >
> > > > First of all vtep->created seems of no use. It is introduced to
> > > > select the error message in flow_tcf_create_iface(). I don't see any
> > > > necessity to distinguish between 'vtep is allocated by rte_malloc()' and
> > 'vtep is created in kernel'.
> > >
> > > created flag indicates the iface is created by our code.
> > > The VXLAN decap devices must have the specified UDP port, we can not
> > > create multiple VXLAN devices with the same UDP port - EEXIST is
> > > returned. So, we have to share device. One option is create device
> > > before DPDK application launch and use these pre-created devices.
> > > Inthis case created flag is not set and VXLAN device is not reinitialized, and
> > not deleted.
> > 
> > I can't see any code to use pre-created device (created even before dpdk app
> > launch). Your code just tries to create 'vmlx_xxxx'. Even from your comment
> > in [7/7] patch, PMD will cleanup any leftovers (existing vtep devices) on
> > initialization. Your comment sounds conflicting and confusing.
> 
> There are two types of VXLAN devices:
> 
> - VXLAN decap, not attached to any ifouter. Provides the ingress UDP port,
>  we try to share the devices of this type, because we may be asked for
>  the specified UDP port. No device/rule cleanup and reinit needed.
> 
> - VXLAN encap, should be attached to ifouter to provide strict egress path,
> no need to share - egress UDP port does not matter. And we need to cleanup ifouter,
> remove other attached VXLAN devices and rules, because it is too hard to
> co-exist with some pre-created setup.. 

I knew that. But how can it justify the need of 'created' field in vtep struct?
In this code, it is of no use. But will see how it is used in your v3.

> > > > And why do you need to check vtep->ifindex as well? If vtep is
> > > > created in kernel and its ifindex isn't set, that should be an error
> > > > which had to be hanled in flow_tcf_create_iface(). Such a vtep shouldn't
> > exist.
> > > Yes, if we did not get ifindex of device - vtep is not created, error returned.
> > > We just can not operate w/o ifindex.
> > 
> > I know ifindex is needed but my question was checking vtep->ifindex here
> > looked redundant/unnecessary. But as you agreed on having
> > create/get/release_iface(), it doesn't matter much.
> 
> Yes. I agree, will refactor the code.
> 
> > 
> > > > Also, the refcnt management is a bit strange. Please put an
> > > > abstraction by adding create_iface(), get_iface() and
> > > > release_iface(). In the get_ifce(),
> > > > vtep->refcnt should be incremented. And in the release_iface(), it
> > > > vtep->decrease the
> > > OK. Good proposal. I'll refactor the code.
> > >
> > > > refcnt and if it reaches to zero, the iface can be removed.
> > > > create_iface() will set the refcnt to 1. And if you refer to
> > > > mlx5_hrxq_get(), it even does searching the list not by repeating the
> > same lookup code here and there.
> > > > That will make your code much simpler.
> > > >
> > > > > +		DRV_LOG(INFO, "VTEP delete (%d)", vtep->ifindex);
> > > > > +		nlh = mnl_nlmsg_put_header(buf);
> > > > > +		nlh->nlmsg_type = RTM_DELLINK;
> > > > > +		nlh->nlmsg_flags = NLM_F_REQUEST;
> > > > > +		ifm = mnl_nlmsg_put_extra_header(nlh, sizeof(*ifm));
> > > > > +		ifm->ifi_family = AF_UNSPEC;
> > > > > +		ifm->ifi_index = vtep->ifindex;
> > > > > +		ret = flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
> > > > > +		if (ret)
> > > > > +			DRV_LOG(WARNING, "netlink: error deleting VXLAN
> > > > "
> > > > > +					 "encap/decap ifindex %u",
> > > > > +					 ifm->ifi_index);
> > > > > +	}
> > > > > +	rte_free(vtep);
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * Creates VTEP network device.
> > > > > + *
> > > > > + * @param[in] tcf
> > > > > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > > > > + * @param[in] ifouter
> > > > > + *   Outer interface to attach new-created VXLAN device
> > > > > + *   If zero the VXLAN device will not be attached to any device.
> > > > > + * @param[in] port
> > > > > + *   UDP port of created VTEP device.
> > > > > + * @param[out] error
> > > > > + *   Perform verbose error reporting if not NULL.
> > > > > + *
> > > > > + * @return
> > > > > + * Pointer to created device structure on success, NULL otherwise
> > > > > + * and rte_errno is set.
> > > > > + */
> > > > > +#ifndef HAVE_IFLA_VXLAN_COLLECT_METADATA
> > > >
> > > > Why negative(ifndef) first intead of positive(ifdef)?
> > > Hm. Did I miss the rule. Positive #ifdef first? OK.
> > 
> > No concrete rule but if there's no specific reason, it would be better to start
> > from ifdef.
> > 
> > > > > +static struct mlx5_flow_tcf_vtep* flow_tcf_create_iface(struct
> > > > > +mlx5_flow_tcf_context *tcf __rte_unused,
> > > > > +		      unsigned int ifouter __rte_unused,
> > > > > +		      uint16_t port __rte_unused,
> > > > > +		      struct rte_flow_error *error) {
> > > > > +	rte_flow_error_set(error, ENOTSUP,
> > > > > +			 RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
> > > > > +			 "netlink: failed to create VTEP, "
> > > > > +			 "VXLAN metadat is not supported by kernel");
> > > >
> > > > Typo.
> > >
> > > OK.  "metadata are not supported".
> > > >
> > > > > +	return NULL;
> > > > > +}
> > > > > +#else
> > > > > +static struct mlx5_flow_tcf_vtep* flow_tcf_create_iface(struct
> > > > > +mlx5_flow_tcf_context *tcf,
> > > >
> > > > How about adding 'vtep'? It sounds vague - creating a general interface.
> > > > E.g., flow_tcf_create_vtep_iface()?
> > >
> > > OK.
> > >
> > > >
> > > > > +		      unsigned int ifouter,
> > > > > +		      uint16_t port, struct rte_flow_error *error) {
> > > > > +	struct mlx5_flow_tcf_vtep *vtep;
> > > > > +	struct nlmsghdr *nlh;
> > > > > +	struct ifinfomsg *ifm;
> > > > > +	char name[sizeof(MLX5_VXLAN_DEVICE_PFX) + 24];
> > > > > +	alignas(struct nlmsghdr)
> > > > > +	uint8_t buf[mnl_nlmsg_size(sizeof(*ifm)) + 128 +
> > > >
> > > > Use a macro for '128'. Can't know the meaning.
> > > OK. I think we should calculate the buffer size explicitly.
> > >
> > > >
> > > > > +		       SZ_NLATTR_DATA_OF(sizeof(name)) +
> > > > > +		       SZ_NLATTR_NEST * 2 +
> > > > > +		       SZ_NLATTR_STRZ_OF("vxlan") +
> > > > > +		       SZ_NLATTR_DATA_OF(sizeof(uint32_t)) +
> > > > > +		       SZ_NLATTR_DATA_OF(sizeof(uint32_t)) +
> > > > > +		       SZ_NLATTR_DATA_OF(sizeof(uint16_t)) +
> > > > > +		       SZ_NLATTR_DATA_OF(sizeof(uint8_t))];
> > > > > +	struct nlattr *na_info;
> > > > > +	struct nlattr *na_vxlan;
> > > > > +	rte_be16_t vxlan_port = RTE_BE16(port);
> > > >
> > > > Use rte_cpu_to_be_*() instead.
> > >
> > > Yes, I'll recheck the whole code for this issue.
> > >
> > > >
> > > > > +	int ret;
> > > > > +
> > > > > +	vtep = rte_zmalloc(__func__, sizeof(*vtep),
> > > > > +			alignof(struct mlx5_flow_tcf_vtep));
> > > > > +	if (!vtep) {
> > > > > +		rte_flow_error_set
> > > > > +			(error, ENOMEM,
> > > > RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> > > > > +			 NULL, "unadble to allocate memory for VTEP desc");
> > > > > +		return NULL;
> > > > > +	}
> > > > > +	*vtep = (struct mlx5_flow_tcf_vtep){
> > > > > +			.refcnt = 0,
> > > > > +			.port = port,
> > > > > +			.created = 0,
> > > > > +			.ifouter = 0,
> > > > > +			.ifindex = 0,
> > > > > +			.local = LIST_HEAD_INITIALIZER(),
> > > > > +			.neigh = LIST_HEAD_INITIALIZER(),
> > > > > +	};
> > > > > +	memset(buf, 0, sizeof(buf));
> > > > > +	nlh = mnl_nlmsg_put_header(buf);
> > > > > +	nlh->nlmsg_type = RTM_NEWLINK;
> > > > > +	nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_CREATE  |
> > > > NLM_F_EXCL;
> > > > > +	ifm = mnl_nlmsg_put_extra_header(nlh, sizeof(*ifm));
> > > > > +	ifm->ifi_family = AF_UNSPEC;
> > > > > +	ifm->ifi_type = 0;
> > > > > +	ifm->ifi_index = 0;
> > > > > +	ifm->ifi_flags = IFF_UP;
> > > > > +	ifm->ifi_change = 0xffffffff;
> > > > > +	snprintf(name, sizeof(name), "%s%u", MLX5_VXLAN_DEVICE_PFX,
> > > > port);
> > > > > +	mnl_attr_put_strz(nlh, IFLA_IFNAME, name);
> > > > > +	na_info = mnl_attr_nest_start(nlh, IFLA_LINKINFO);
> > > > > +	assert(na_info);
> > > > > +	mnl_attr_put_strz(nlh, IFLA_INFO_KIND, "vxlan");
> > > > > +	na_vxlan = mnl_attr_nest_start(nlh, IFLA_INFO_DATA);
> > > > > +	if (ifouter)
> > > > > +		mnl_attr_put_u32(nlh, IFLA_VXLAN_LINK, ifouter);
> > > > > +	assert(na_vxlan);
> > > > > +	mnl_attr_put_u8(nlh, IFLA_VXLAN_COLLECT_METADATA, 1);
> > > > > +	mnl_attr_put_u8(nlh, IFLA_VXLAN_UDP_ZERO_CSUM6_RX, 1);
> > > > > +	mnl_attr_put_u8(nlh, IFLA_VXLAN_LEARNING, 0);
> > > > > +	mnl_attr_put_u16(nlh, IFLA_VXLAN_PORT, vxlan_port);
> > > > > +	mnl_attr_nest_end(nlh, na_vxlan);
> > > > > +	mnl_attr_nest_end(nlh, na_info);
> > > > > +	assert(sizeof(buf) >= nlh->nlmsg_len);
> > > > > +	ret = flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
> > > > > +	if (ret)
> > > > > +		DRV_LOG(WARNING,
> > > > > +			"netlink: VTEP %s create failure (%d)",
> > > > > +			name, rte_errno);
> > > > > +	else
> > > > > +		vtep->created = 1;
> > > >
> > > > Flow of code here isn't smooth, thus could be error-prone. Most of
> > > > all, I don't like ret has multiple meanings. ret should be return
> > > > value but you are using it to store ifindex.
> > > >
> > > > > +	if (ret && ifouter)
> > > > > +		ret = 0;
> > > > > +	else
> > > > > +		ret = if_nametoindex(name);
> > > >
> > > > If vtep isn't created and ifouter is set, then skip init below,
> > > > which means, if
> > >
> > > ifouter is set for VXLAN encap devices. They should be attached to
> > > ifouter and can not be shared. So, if ifouter I set - we do not use
> > > the precreated/existing VXLAN devices. We have to create our own not
> > shared device.
> > 
> > In your code (flow_tcf_encap_vtep_create()), it is shared by multiple flows.
> > Do you mean it isn't shared between different outer ifaces? If so, that's for
> > sure.
> Sorry, I do not understand the question.
> VXLAN encap device is attached to ifouter and shared by all flows with this
> ifouter. No multiple VXLAN devices are attached to the same ifouter, only one.
> VXLAN decap device has no attached ifouter, so it can not share it.

Yep, that's what I meant.

> > > > vtep is created or ifouter is set, it tries to get ifindex of vtep.
> > > > But why do you want to try to call this API even if it failed to create vtep?
> > > > Let's not make code flow convoluted even though it logically works.
> > > > Let's make it straightforward.
> > > >
> > > > > +	if (ret) {
> > > > > +		vtep->ifindex = ret;
> > > > > +		vtep->ifouter = ifouter;
> > > > > +		memset(buf, 0, sizeof(buf));
> > > > > +		nlh = mnl_nlmsg_put_header(buf);
> > > > > +		nlh->nlmsg_type = RTM_NEWLINK;
> > > > > +		nlh->nlmsg_flags = NLM_F_REQUEST;
> > > > > +		ifm = mnl_nlmsg_put_extra_header(nlh, sizeof(*ifm));
> > > > > +		ifm->ifi_family = AF_UNSPEC;
> > > > > +		ifm->ifi_type = 0;
> > > > > +		ifm->ifi_index = vtep->ifindex;
> > > > > +		ifm->ifi_flags = IFF_UP;
> > > > > +		ifm->ifi_change = IFF_UP;
> > > > > +		ret = flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
> > > > > +		if (ret) {
> > > > > +			DRV_LOG(WARNING,
> > > > > +				"netlink: VTEP %s set link up failure (%d)",
> > > > > +				name, rte_errno);
> > > > > +			rte_free(vtep);
> > > > > +			rte_flow_error_set
> > > > > +				(error, -errno,
> > > > > +				 RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> > > > NULL,
> > > > > +				 "netlink: failed to set VTEP link up");
> > > > > +			vtep = NULL;
> > > > > +		} else {
> > > > > +			ret = mlx5_flow_tcf_init(tcf, vtep->ifindex, error);
> > > > > +			if (ret)
> > > > > +				DRV_LOG(WARNING,
> > > > > +				"VTEP %s init failure (%d)", name, rte_errno);
> > > > > +		}
> > > > > +	} else {
> > > > > +		DRV_LOG(WARNING,
> > > > > +			"VTEP %s failed to get index (%d)", name, errno);
> > > > > +		rte_flow_error_set
> > > > > +			(error, -errno,
> > > > > +			 RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
> > > > > +			 !vtep->created ? "netlink: failed to create VTEP" :
> > > > > +			 "netlink: failed to retrieve VTEP ifindex");
> > > > > +			 ret = 1;
> > > >
> > > > If it fails to create a vtep above, it will print out two warning
> > > > messages and one rte_flow_error message. And it even selects message
> > > > to print between two?
> > > > And there's another info msg at the end even in case of failure. Do
> > > > you really want to do this even with manipulating ret to change code
> > > > path?  Not a good practice.
> > > >
> > > > Usually, code path should be straightforward for sucessful path and
> > > > for errors/failures, return immediately or use 'goto' if there's need for
> > cleanup.
> > > >
> > > > Please refactor entire function.
> > >
> > > I think I'll split it in two ones - for attached and potentially shared ifaces.
> > > >
> > > > > +	}
> > > > > +	if (ret) {
> > > > > +		flow_tcf_delete_iface(tcf, vtep);
> > > > > +		vtep = NULL;
> > > > > +	}
> > > > > +	DRV_LOG(INFO, "VTEP create (%d, %s)", vtep->port, vtep ? "OK" :
> > > > "error");
> > > > > +	return vtep;
> > > > > +}
> > > > > +#endif /* HAVE_IFLA_VXLAN_COLLECT_METADATA */
> > > > > +
> > > > > +/**
> > > > > + * Create target interface index for VXLAN tunneling decapsulation.
> > > > > + * In order to share the UDP port within the other interfaces the
> > > > > + * VXLAN device created as not attached to any interface (if created).
> > > > > + *
> > > > > + * @param[in] tcf
> > > > > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > > > > + * @param[in] dev_flow
> > > > > + *   Flow tcf object with tunnel structure pointer set.
> > > > > + * @param[out] error
> > > > > + *   Perform verbose error reporting if not NULL.
> > > > > + * @return
> > > > > + *   Interface index on success, zero otherwise and rte_errno is set.
> > > >
> > > > Return negative errno in case of failure like others.
> > >
> > > Anyway, we have to return an index. If we do not return it as function
> > > result we will need to provide some extra pointing parameter, it
> > complicates the code.
> > 
> > You misunderstood it. See what I wrote below. The function still returns the
> > index but in case of error, make it return negative errno instead of zero.
> > 
> > > >
> > > >  *   Interface index on success, a negative errno value otherwise and
> > > > rte_errno is set.
> > > >
> > > > > + */
> > > > > +static unsigned int
> > > > > +flow_tcf_decap_vtep_create(struct mlx5_flow_tcf_context *tcf,
> > > > > +			   struct mlx5_flow *dev_flow,
> > > > > +			   struct rte_flow_error *error) {
> > > > > +	struct mlx5_flow_tcf_vtep *vtep, *vlst;
> > > > > +	uint16_t port = dev_flow->tcf.vxlan_decap->udp_port;
> > > > > +
> > > > > +	vtep = NULL;
> > > > > +	LIST_FOREACH(vlst, &vtep_list_vxlan, next) {
> > > > > +		if (vlst->port == port) {
> > > > > +			vtep = vlst;
> > > > > +			break;
> > > > > +		}
> > > > > +	}
> > > >
> > > > You just need one variable.
> > >
> > > Yes. There is a long story, I forgot to revert code to one variable after
> > debugging.
> > > >
> > > > 	struct mlx5_flow_tcf_vtep *vtep;
> > > >
> > > > 	LIST_FOREACH(vtep, &vtep_list_vxlan, next) {
> > > > 		if (vtep->port == port)
> > > > 			break;
> > > > 	}
> > > >
> > > > > +	if (!vtep) {
> > > > > +		vtep = flow_tcf_create_iface(tcf, 0, port, error);
> > > > > +		if (vtep)
> > > > > +			LIST_INSERT_HEAD(&vtep_list_vxlan, vtep, next);
> > > > > +	} else {
> > > > > +		if (vtep->ifouter) {
> > > > > +			rte_flow_error_set(error, -errno,
> > > > > +				RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> > > > NULL,
> > > > > +				"Failed to create decap VTEP, attached "
> > > > > +				"device with the same UDP port exists");
> > > > > +				vtep = NULL;
> > > >
> > > > Making vtep null to skip the following code?
> > >
> > > Yes. To avoid multiple return operators in code.
> > 
> > It's okay to have multiple returns. Why not?
> 
> It is easy to miss the return in the midst of function  while refactoring/modifying the code.

Your code path doesn't look easy and free from error. Please refer to other
control path functions in this PMD.

> > > > Please merge the two same
> > > > if/else and make the code path strightforward. And which errno do
> > > > you expect here?
> > > > Should it be set EEXIST instead?
> > > Not always. Netlink returns the code.
> > 
> > No, that's not my point. Your code above sets errno instead of rte_errno or
> > EEXIST.
> > 
> > 	} else {
> > 		if (vtep->ifouter) {
> > 			rte_flow_error_set(error, -errno,
> > 
> > Which one sets this errno? Here, it sets rte_errno because matched vtep
> libmnl sets, while processing the Netlink reply message (callback.c of libmnl sources).

You still don't understand my point.

In this flow_tcf_decap_vtep_create(), if vtep is found (vtep != NULL), how can
errno be set? Before the if/else, there's no libmnl call.

> > can't be used as it already has outer iface attached (error message isn't clear,
> > please reword it too). I thought this should be EEXIST but you set errno to
> > rte_errno but errno isn't valid at this point.
> > 
> > >
> > > >
> > > > > +		}
> > > > > +	}
> > > > > +	if (vtep) {
> > > > > +		vtep->refcnt++;
> > > > > +		assert(vtep->ifindex);
> > > > > +		return vtep->ifindex;
> > > > > +	} else {
> > > > > +		return 0;
> > > > > +	}
> > > >
> > > > Why repeating same if/else?
> > > >
> > > >
> > > > This is my suggestion but if you take my suggestion to have
> > > > flow_tcf_[create|get|release]_iface(), this will get much simpler.
> > > Agree.
> > >
> > > >
> > > > {
> > > > 	struct mlx5_flow_tcf_vtep *vtep;
> > > > 	uint16_t port = dev_flow->tcf.vxlan_decap->udp_port;
> > > >
> > > > 	LIST_FOREACH(vtep, &vtep_list_vxlan, next) {
> > > > 		if (vtep->port == port)
> > > > 			break;
> > > > 	}
> > > > 	if (vtep && vtep->ifouter)
> > > > 		return rte_flow_error_set(... EEXIST ...);
> > > > 	else if (vtep) {
> > > > 		++vtep->refcnt;
> > > > 	} else {
> > > > 		vtep = flow_tcf_create_iface(tcf, 0, port, error);
> > > > 		if (!vtep)
> > > > 			return rte_flow_error_set(...);
> > > > 		LIST_INSERT_HEAD(&vtep_list_vxlan, vtep, next);
> > > > 	}
> > > > 	assert(vtep->ifindex);
> > > > 	return vtep->ifindex;
> > > > }
> > > >
> > > >
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * Creates target interface index for VXLAN tunneling encapsulation.
> > > > > + *
> > > > > + * @param[in] tcf
> > > > > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > > > > + * @param[in] ifouter
> > > > > + *   Network interface index to attach VXLAN encap device to.
> > > > > + * @param[in] dev_flow
> > > > > + *   Flow tcf object with tunnel structure pointer set.
> > > > > + * @param[out] error
> > > > > + *   Perform verbose error reporting if not NULL.
> > > > > + * @return
> > > > > + *   Interface index on success, zero otherwise and rte_errno is set.
> > > > > + */
> > > > > +static unsigned int
> > > > > +flow_tcf_encap_vtep_create(struct mlx5_flow_tcf_context *tcf,
> > > > > +			    unsigned int ifouter,
> > > > > +			    struct mlx5_flow *dev_flow __rte_unused,
> > > > > +			    struct rte_flow_error *error) {
> > > > > +	static uint16_t encap_port = MLX5_VXLAN_PORT_RANGE_MIN - 1;
> > > > > +	struct mlx5_flow_tcf_vtep *vtep, *vlst;
> > > > > +
> > > > > +	assert(ifouter);
> > > > > +	/* Look whether the attached VTEP for encap is created. */
> > > > > +	vtep = NULL;
> > > > > +	LIST_FOREACH(vlst, &vtep_list_vxlan, next) {
> > > > > +		if (vlst->ifouter == ifouter) {
> > > > > +			vtep = vlst;
> > > > > +			break;
> > > > > +		}
> > > > > +	}
> > > >
> > > > Same here.
> > > >
> > > > > +	if (!vtep) {
> > > > > +		uint16_t pcnt;
> > > > > +
> > > > > +		/* Not found, we should create the new attached VTEP. */
> > > > > +/*
> > > > > + * TODO: not implemented yet
> > > > > + * flow_tcf_encap_iface_cleanup(tcf, ifouter);
> > > > > + * flow_tcf_encap_local_cleanup(tcf, ifouter);
> > > > > + * flow_tcf_encap_neigh_cleanup(tcf, ifouter);  */
> > > >
> > > > Personal note is not appropriate even though it is removed in the
> > > > following patch.
> > > >
> > > > > +		for (pcnt = 0; pcnt <= (MLX5_VXLAN_PORT_RANGE_MAX
> > > > > +				     - MLX5_VXLAN_PORT_RANGE_MIN);
> > > > pcnt++) {
> > > > > +			encap_port++;
> > > > > +			/* Wraparound the UDP port index. */
> > > > > +			if (encap_port < MLX5_VXLAN_PORT_RANGE_MIN
> > > > ||
> > > > > +			    encap_port > MLX5_VXLAN_PORT_RANGE_MAX)
> > > > > +				encap_port =
> > > > MLX5_VXLAN_PORT_RANGE_MIN;
> > > > > +			/* Check whether UDP port is in already in use. */
> > > > > +			vtep = NULL;
> > > > > +			LIST_FOREACH(vlst, &vtep_list_vxlan, next) {
> > > > > +				if (vlst->port == encap_port) {
> > > > > +					vtep = vlst;
> > > > > +					break;
> > > > > +				}
> > > > > +			}
> > > >
> > > > If you want to find out an empty port number, you can use rte_bitmap
> > > > instead of repeating searching the entire list for all possible port
> > numbers.
> > >
> > > We do not expect too many VXLAN devices have been created. bitmap.
> > 
> > +1, valid point.
> > 
> > > > > +			if (vtep) {
> > > > > +				vtep = NULL;
> > > > > +				continue;
> > > > > +			}
> > > > > +			vtep = flow_tcf_create_iface(tcf, ifouter,
> > > > > +						     encap_port, error);
> > > > > +			if (vtep) {
> > > > > +				LIST_INSERT_HEAD(&vtep_list_vxlan, vtep,
> > > > next);
> > > > > +				break;
> > > > > +			}
> > > > > +			if (rte_errno != EEXIST)
> > > > > +				break;
> > > > > +		}
> > > > > +	}
> > > > > +	if (!vtep)
> > > > > +		return 0;
> > > > > +	vtep->refcnt++;
> > > > > +	assert(vtep->ifindex);
> > > > > +	return vtep->ifindex;
> > > >
> > > > Please refactor this func according to what I suggested for
> > > > flow_tcf_decap_vtep_create() and flow_tcf_delete_iface().
> > > >
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * Creates target interface index for tunneling of any type.
> > > > > + *
> > > > > + * @param[in] tcf
> > > > > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > > > > + * @param[in] ifouter
> > > > > + *   Network interface index to attach VXLAN encap device to.
> > > > > + * @param[in] dev_flow
> > > > > + *   Flow tcf object with tunnel structure pointer set.
> > > > > + * @param[out] error
> > > > > + *   Perform verbose error reporting if not NULL.
> > > > > + * @return
> > > > > + *   Interface index on success, zero otherwise and rte_errno is set.
> > > >
> > > >  *   Interface index on success, a negative errno value otherwise and
> > > >  *   rte_errno is set.
> > > >
> > > > > + */
> > > > > +static unsigned int
> > > > > +flow_tcf_tunnel_vtep_create(struct mlx5_flow_tcf_context *tcf,
> > > > > +			    unsigned int ifouter,
> > > > > +			    struct mlx5_flow *dev_flow,
> > > > > +			    struct rte_flow_error *error) {
> > > > > +	unsigned int ret;
> > > > > +
> > > > > +	assert(dev_flow->tcf.tunnel);
> > > > > +	pthread_mutex_lock(&vtep_list_mutex);
> > > > > +	switch (dev_flow->tcf.tunnel->type) {
> > > > > +	case MLX5_FLOW_TCF_TUNACT_VXLAN_ENCAP:
> > > > > +		ret = flow_tcf_encap_vtep_create(tcf, ifouter,
> > > > > +						 dev_flow, error);
> > > > > +		break;
> > > > > +	case MLX5_FLOW_TCF_TUNACT_VXLAN_DECAP:
> > > > > +		ret = flow_tcf_decap_vtep_create(tcf, dev_flow, error);
> > > > > +		break;
> > > > > +	default:
> > > > > +		rte_flow_error_set(error, ENOTSUP,
> > > > > +				RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> > > > NULL,
> > > > > +				"unsupported tunnel type");
> > > > > +		ret = 0;
> > > > > +		break;
> > > > > +	}
> > > > > +	pthread_mutex_unlock(&vtep_list_mutex);
> > > > > +	return ret;
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * Deletes tunneling interface by UDP port.
> > > > > + *
> > > > > + * @param[in] tcf
> > > > > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > > > > + * @param[in] ifindex
> > > > > + *   Network interface index of VXLAN device.
> > > > > + * @param[in] dev_flow
> > > > > + *   Flow tcf object with tunnel structure pointer set.
> > > > > + */
> > > > > +static void
> > > > > +flow_tcf_tunnel_vtep_delete(struct mlx5_flow_tcf_context *tcf,
> > > > > +			    unsigned int ifindex,
> > > > > +			    struct mlx5_flow *dev_flow) {
> > > > > +	struct mlx5_flow_tcf_vtep *vtep, *vlst;
> > > > > +
> > > > > +	assert(dev_flow->tcf.tunnel);
> > > > > +	pthread_mutex_lock(&vtep_list_mutex);
> > > > > +	vtep = NULL;
> > > > > +	LIST_FOREACH(vlst, &vtep_list_vxlan, next) {
> > > > > +		if (vlst->ifindex == ifindex) {
> > > > > +			vtep = vlst;
> > > > > +			break;
> > > > > +		}
> > > > > +	}
> > > >
> > > > It is weird. You just can have vtep pointer in the
> > > > dev_flow->tcf.tunnel instead of ifindex_tun which is same as
> > > > vtep->ifindex like the assertion below. Then, this lookup can be skipped.
> > >
> > > OK. Good optimization.
> > >
> > > >
> > > > > +	if (!vtep) {
> > > > > +		DRV_LOG(WARNING, "No VTEP device found in the list");
> > > > > +		goto exit;
> > > > > +	}
> > > > > +	switch (dev_flow->tcf.tunnel->type) {
> > > > > +	case MLX5_FLOW_TCF_TUNACT_VXLAN_DECAP:
> > > > > +		break;
> > > > > +	case MLX5_FLOW_TCF_TUNACT_VXLAN_ENCAP:
> > > > > +/*
> > > > > + * TODO: Remove the encap ancillary rules first.
> > > > > + * flow_tcf_encap_neigh(tcf, vtep, dev_flow, false, NULL);
> > > > > + * flow_tcf_encap_local(tcf, vtep, dev_flow, false, NULL);  */
> > > >
> > > > Is it a personal note? Please remove.
> > > OK.
> > >
> > > >
> > > > > +		break;
> > > > > +	default:
> > > > > +		assert(false);
> > > > > +		DRV_LOG(WARNING, "Unsupported tunnel type");
> > > > > +		break;
> > > > > +	}
> > > > > +	assert(dev_flow->tcf.tunnel->ifindex_tun == vtep->ifindex);
> > > > > +	assert(vtep->refcnt);
> > > > > +	if (!vtep->refcnt || !--vtep->refcnt) {
> > > > > +		LIST_REMOVE(vtep, next);
> > > > > +		flow_tcf_delete_iface(tcf, vtep);
> > > > > +	}
> > > > > +exit:
> > > > > +	pthread_mutex_unlock(&vtep_list_mutex);
> > > > > +}
> > > > > +
> > > > >  /**
> > > > >   * Apply flow to E-Switch by sending Netlink message.
> > > > >   *
> > > > > @@ -3461,18 +3887,61 @@ struct pedit_parser {
> > > > >  	       struct rte_flow_error *error)  {
> > > > >  	struct priv *priv = dev->data->dev_private;
> > > > > -	struct mlx5_flow_tcf_context *nl = priv->tcf_context;
> > > > > +	struct mlx5_flow_tcf_context *tcf = priv->tcf_context;
> > > > >  	struct mlx5_flow *dev_flow;
> > > > >  	struct nlmsghdr *nlh;
> > > > > +	int ret;
> > > > >
> > > > >  	dev_flow = LIST_FIRST(&flow->dev_flows);
> > > > >  	/* E-Switch flow can't be expanded. */
> > > > >  	assert(!LIST_NEXT(dev_flow, next));
> > > > > +	if (dev_flow->tcf.applied)
> > > > > +		return 0;
> > > > >  	nlh = dev_flow->tcf.nlh;
> > > > >  	nlh->nlmsg_type = RTM_NEWTFILTER;
> > > > >  	nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_CREATE |
> > > > NLM_F_EXCL;
> > > > > -	if (!flow_tcf_nl_ack(nl, nlh, 0, NULL, NULL))
> > > > > +	if (dev_flow->tcf.tunnel) {
> > > > > +		/*
> > > > > +		 * Replace the interface index, target for
> > > > > +		 * encapsulation, source for decapsulation.
> > > > > +		 */
> > > > > +		assert(!dev_flow->tcf.tunnel->ifindex_tun);
> > > > > +		assert(dev_flow->tcf.tunnel->ifindex_ptr);
> > > > > +		/* Create actual VTEP device when rule is being applied. */
> > > > > +		dev_flow->tcf.tunnel->ifindex_tun
> > > > > +			= flow_tcf_tunnel_vtep_create(tcf,
> > > > > +					*dev_flow->tcf.tunnel->ifindex_ptr,
> > > > > +					dev_flow, error);
> > > > > +			DRV_LOG(INFO, "Replace ifindex: %d->%d",
> > > > > +				dev_flow->tcf.tunnel->ifindex_tun,
> > > > > +				*dev_flow->tcf.tunnel->ifindex_ptr);
> > > > > +		if (!dev_flow->tcf.tunnel->ifindex_tun)
> > > > > +			return -rte_errno;
> > > > > +		dev_flow->tcf.tunnel->ifindex_org
> > > > > +			= *dev_flow->tcf.tunnel->ifindex_ptr;
> > > > > +		*dev_flow->tcf.tunnel->ifindex_ptr
> > > > > +			= dev_flow->tcf.tunnel->ifindex_tun;
> > > > > +	}
> > > > > +	ret = flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
> > > > > +	if (dev_flow->tcf.tunnel) {
> > > > > +		DRV_LOG(INFO, "Restore ifindex: %d->%d",
> > > > > +				dev_flow->tcf.tunnel->ifindex_org,
> > > > > +				*dev_flow->tcf.tunnel->ifindex_ptr);
> > > > > +		*dev_flow->tcf.tunnel->ifindex_ptr
> > > > > +			= dev_flow->tcf.tunnel->ifindex_org;
> > > > > +		dev_flow->tcf.tunnel->ifindex_org = 0;
> > > >
> > > > ifindex_org looks a temporary storage in this code. And this kind of
> > > > hassle
> > > > (replace/restore) is there because you took the ifindex from the
> > > > netlink message. Why don't you have just
> > > >
> > > > struct mlx5_flow_tcf_tunnel_hdr {
> > > > 	uint32_t type; /**< Tunnel action type. */
> > > > 	unsigned int ifindex; /**< Original dst/src interface */
> > > > 	struct mlx5_flow_tcf_vtep *vtep; /**< Tunnel endpoint device. */
> > > > 	unsigned int *nlmsg_ifindex_ptr; /**< ifindex ptr in Netlink message.
> > > > */ };
> > > >
> > > > and don't change ifindex?
> > >
> > > I propose to use the local variable for ifindex_org and do not keep it
> > > in structure. *ifindex_ptr will keep.
> > 
> > Well, you still have to restore the ifindex whenever sending the nl msg. Most
> > of all, ifindex_ptr in nl msg isn't a right place to store the ifindex. 
> It is stored there for rules w/o tunnels. It is its "native" place, Id prefer
> not to create some new location to store the original index and save some space.
> We have to swap indices only if rule has requested the tunneling.  We can not

No no. At this point, flow is already created to be tunneled one. What do you
mean by 'rules w/o tunnels' or 'only if rule has requested the tunneling'?? It
has already been created as a vxlan tunnel rule. It won't be changed. The nlmsg
is supposed to have vtep ifindex but translation didn't know it and stored the
outer iface temporarily to get it replaced by vtep ifindex. It never be a
'native'/'original' place to store it. In which case the nl msg can be sent with
the 'original' ifindex? Any specific example? No.

> set tunnel index permanently, because rule can be applied/removed/reapplied
> and other new VXLAN device with new index can be recreated.

Every time it is applied, it will get the vtep and overwrite vtep ifindex in the
nl msg.

> > have vtep ifindex but it just temporarily keeps the device ifindex until vtep is
> > created/found.

Thanks,
Yongseok

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [dpdk-dev] [PATCH v2 2/7] net/mlx5: e-switch VXLAN flow validation routine
  2018-10-26 21:56             ` Yongseok Koh
@ 2018-10-29  9:33               ` Slava Ovsiienko
  2018-10-29 18:26                 ` Yongseok Koh
  0 siblings, 1 reply; 110+ messages in thread
From: Slava Ovsiienko @ 2018-10-29  9:33 UTC (permalink / raw)
  To: Yongseok Koh; +Cc: Shahaf Shuler, dev

> -----Original Message-----
> From: Yongseok Koh
> Sent: Saturday, October 27, 2018 0:57
> To: Slava Ovsiienko <viacheslavo@mellanox.com>
> Cc: Shahaf Shuler <shahafs@mellanox.com>; dev@dpdk.org
> Subject: Re: [PATCH v2 2/7] net/mlx5: e-switch VXLAN flow validation
> routine
> 
> On Fri, Oct 26, 2018 at 01:39:38AM -0700, Slava Ovsiienko wrote:
> > > -----Original Message-----
> > > From: Yongseok Koh
> > > Sent: Friday, October 26, 2018 6:07
> > > To: Slava Ovsiienko <viacheslavo@mellanox.com>
> > > Cc: Shahaf Shuler <shahafs@mellanox.com>; dev@dpdk.org
> > > Subject: Re: [PATCH v2 2/7] net/mlx5: e-switch VXLAN flow validation
> > > routine
> > >
> > > On Thu, Oct 25, 2018 at 06:53:11AM -0700, Slava Ovsiienko wrote:
> > > > > -----Original Message-----
> > > > > From: Yongseok Koh
> > > > > Sent: Tuesday, October 23, 2018 13:05
> > > > > To: Slava Ovsiienko <viacheslavo@mellanox.com>
> > > > > Cc: Shahaf Shuler <shahafs@mellanox.com>; dev@dpdk.org
> > > > > Subject: Re: [PATCH v2 2/7] net/mlx5: e-switch VXLAN flow
> > > > > validation routine
> > > > >
> > > > > On Mon, Oct 15, 2018 at 02:13:30PM +0000, Viacheslav Ovsiienko
> wrote:
> > > [...]
> > > > > > @@ -1114,7 +1733,6 @@ struct pedit_parser {
> > > > > >  							   error);
> > > > > >  			if (ret < 0)
> > > > > >  				return ret;
> > > > > > -			item_flags |=
> MLX5_FLOW_LAYER_OUTER_L3_IPV4;
> > > > > >  			mask.ipv4 = flow_tcf_item_mask
> > > > > >  				(items, &rte_flow_item_ipv4_mask,
> > > > > >  				 &flow_tcf_mask_supported.ipv4,
> @@ -1135,13 +1753,22 @@
> > > > > > struct pedit_parser {
> > > > > >  				next_protocol =
> > > > > >  					((const struct
> rte_flow_item_ipv4 *)
> > > > > >  					 (items->spec))-
> >hdr.next_proto_id;
> > > > > > +			if (item_flags &
> > > > > MLX5_FLOW_LAYER_OUTER_L3_IPV4) {
> > > > > > +				/*
> > > > > > +				 * Multiple outer items are not
> allowed as
> > > > > > +				 * tunnel parameters, will raise an
> error later.
> > > > > > +				 */
> > > > > > +				ipv4 = NULL;
> > > > >
> > > > > Can't it be inner then?
> > > > AFAIK,  no for tc rules, we can not specify multiple levels (inner
> > > > + outer) for
> > > them.
> > > > There is just no TCA_FLOWER_KEY_xxx attributes  for specifying
> > > > inner
> > > items
> > > > to match by flower.
> > >
> > > When I briefly read the kernel code, I thought TCA_FLOWER_KEY_* are
> > > for inner header before decap. I mean TCA_FLOWER_KEY_IPV4_SRC is
> for
> > > inner L3 and TCA_FLOWER_KEY_ENC_IPV4_SRC is for outer tunnel
> header.
> > > Please do some experiments with tc-flower command.
> >
> > Hm. Interesting. I will check.
> >
> > > > It is quite unclear comment, not the best one, sorry. I did not
> > > > like it too, just forgot to rewrite.
> > > >
> > > > ipv4, ipv6 , udp variables gather the matching items during the
> > > > item list
> > > scanning,
> > > > later variables are used for VXLAN decap action validation only.
> > > > So, the
> > > "outer"
> > > > means that ipv4 variable contains the VXLAN decap outer addresses,
> > > > and should be NULL-ed if multiple items are found in the items list.
> > > >
> > > > But we can generate an error here if we have valid action_flags
> > > > (gathered by prepare function) and VXLAN decap is set. Raising an
> > > > error looks more relevant and clear.
> > >
> > > You can't use flags at this point. It is validate() so prepare()
> > > might not be preceded.
> > >
> > > > >   flow create 1 ingress transfer
> > > > >     pattern eth src is 66:77:88:99:aa:bb
> > > > >       dst is 00:11:22:33:44:55 / ipv4 src is 2.2.2.2 dst is 1.1.1.1 /
> > > > >       udp src is 4789 dst is 4242 / vxlan vni is 0x112233 /
> > > > >       eth / ipv6 / tcp dst is 42 / end
> > > > >     actions vxlan_decap / port_id id 2 / end
> > > > >
> > > > > Is this flow supported by linux tcf? I took this example from
> > > > > Adrien's
> > > patch -
> > > > > "[8/8] net/mlx5: add VXLAN decap support to switch flow rules".
> > > > > If so,
> > > isn't it
> > > > > possible to have inner L3 layer (MLX5_FLOW_LAYER_INNER_*)? If
> > > > > not,
> > > you
> > > > > should return error in this case. I don't see any code to check
> > > > > redundant outer items.
> > > > > Did I miss something?
> > > >
> > > > Interesting, besides rule has correct syntax, I'm not sure whether
> > > > it can be
> > > applied w/o errors.
> > >
> > > Please try. You owns this patchset. However, you just can prohibit
> > > such flows (tunneled item) and come up with follow-up patches to
> > > enable it later if it is support by tcf as this whole patchset
> > > itself is pretty huge enough and we don't have much time.
> > >
> > > > At least our current flow_tcf_translate() implementation does not
> > > > support
> > > any INNERs.
> > > > But it seems the flow_tcf_validate() does, it's subject to recheck
> > > > - we
> > > should not allow
> > > > unsupported items to pass the validation. I'll check and provide
> > > > the
> > > separate bugfix patch
> > > > (if any).
> > >
> > > Neither has tunnel support. It is the first time to add tunnel support to
> TCF.
> > > If it was needed, you should've added it, not skipping it.
> > >
> > > You can check how MLX5_FLOW_LAYER_TUNNEL is used in Verbs/DV as
> a
> > > reference.
> >
> > Yes. I understood your point. Will check and add tunnel support for TCF
> rules.
> > Anyway, inner MAC addresses are supported for VXLAN decap, I think we
> > should specify these ones in the rule as inners (after VNI item),
> > definitely some tunnel support in validate/parse/translate should be added.
> >
> > >
> > > > > BTW, for the tunneled items, why don't you follow the code of
> > > > > Verbs(mlx5_flow_verbs.c) and DV(mlx5_flow_dv.c)? For tcf, it is
> > > > > the first
> > > time
> > > > For VXLAN it has some specifics (warning about ignored params,
> > > > etc.) I've checked which of verbs/dv code could be reused and did
> > > > not
> > > discovered
> > > > a lot. I'll recheck the latest code commits, possible it became
> > > > more
> > > appropriate
> > > > for VXLAN.
> > >
> > > Agreed. I'm not forcing you to do it because we run out of time but
> > > mentioned it because if there's any redundancy in our code, that
> > > usually causes bug later.
> > > Let's not waste too much time for that. Just grab low hanging fruits if
> any.
> > >
> > > > > to add tunneled item, but Verbs/DV already have validation code
> > > > > for
> > > tunnel,
> > > > > so you can reuse the existing code. In
> > > > > flow_tcf_validate_vxlan_decap(),
> > > not
> > > > > every validation is VXLAN-specific but some of them can be
> > > > > common
> > > code.
> > > > >
> > > > > And if you need to know whether there's the VXLAN decap action
> > > > > prior to outer header item validation, you can relocate the code
> > > > > - action
> > > validation
> > > > > first and item validation next, as there's no dependency yet in
> > > > > the current
> > > >
> > > > We can not validate action first - we need items to be preliminary
> > > gathered,
> > > > to check them in action's specific fashion and to check action itself.
> > > > I mean, if we see VXLAN decap action, we should check the presence
> > > > of L2, L3, L4 and VNI items. I minimized the number of passes
> > > > along the item and action lists. BTW, Adrien's approach performed
> > > > two passes, mine does
> > > only.
> > > >
> > > > > code. Defining ipv4, ipv6, udp seems to make the code path more
> > > complex.
> > > > Yes, but it allows us to avoid the extra item list scanning and
> > > > minimizes the
> > > changes
> > > > of existing code.
> > > > In your approach we should:
> > > > - scan actions, w/o full checking, just action_flags gathering and
> > > > checking
> > > > - scan items, performing variating check (depending on gathered
> > > > action
> > > flags)
> > > > - scan actions again, performing full check with params (at least
> > > > for now check whether all params gathered)
> > >
> > > Disagree. flow_tcf_validate_vxlan_encap() doesn't even need any info
> > > of items and flow_tcf_validate_vxlan_decap() needs item_flags to
> > > check whether VXLAN item is there or not and ipv4/ipv6/udp are all
> > > for item checks. Let me give you very detailed exmaple:
> > >
> > > {
> > > 	for (actions[]...) {
> > > 		...
> > > 		case RTE_FLOW_ACTION_TYPE_VXLAN_ENCAP:
> > > 			...
> > > 			flow_tcf_validate_vxlan_encap();
> > > 			...
> > > 			break;
> > > 		case RTE_FLOW_ACTION_TYPE_VXLAN_DECAP:
> > > 			if (action_flags & (MLX5_ACTION_VXLAN_ENCAP
> > > 					   | MLX5_ACTION_VXLAN_DECAP))
> > > 				return rte_flow_error_set
> > > 					(error, ENOTSUP,
> > > 					 RTE_FLOW_ERROR_TYPE_ACTION,
> > > 					 actions,
> > > 					 "can't have multiple vxlan actions");
> > > 			/* Don't call flow_tcf_validate_vxlan_decap(). */
> > > 			action_flags |= MLX5_ACTION_VXLAN_DECAP;
> > > 			break;
> > > 	}
> > > 	for (items[]...) {
> > > 		...
> > > 		case RTE_FLOW_ITEM_TYPE_IPV4:
> > > 			/* Existing common validation. */
> > > 			...
> > > 			if (action_flags & MLX5_ACTION_VXLAN_DECAP) {
> > > 				/* Do ipv4 validation in
> > > 				 * flow_tcf_validate_vxlan_decap()/
> > > 			}
> > > 			break;
> > > 	}
> > > }
> > >
> > > Curretly you are doing,
> > >
> > > 	- validate items
> > > 	- validate actions
> > > 	- validate items again if decap.
> > >
> > > But this can simply be
> > >
> > > 	- validate actions
> > How  we could validate VXLAN decap at this stage?
> > As we do not have item_flags set yet?
> > Do I miss something?
> 
> Look at my pseudo code above.
> Nothing much to be done in validating decap action. And item validation for
> decap can be done together in item validation code.
> 
VXLAB decap action should check:
- whether outer destination UDP port is present (otherwise we cannot assign VTEP VXLAN)
- whether outer destination IP is present (otherwise we cannot assign IP to ifouter/build route)
- whether VNI is present (to identify VXLAN traffic)

How do you  propose check these issues in your approach?

With best regards,
Slava


> Thanks,
> Yongseok
> 
> >
> > > 	- validate items
> > >

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [dpdk-dev] [PATCH v2 5/7] net/mlx5: e-switch VXLAN tunnel devices management
  2018-10-26 22:42             ` Yongseok Koh
@ 2018-10-29 11:53               ` Slava Ovsiienko
  2018-10-29 18:42                 ` Yongseok Koh
  0 siblings, 1 reply; 110+ messages in thread
From: Slava Ovsiienko @ 2018-10-29 11:53 UTC (permalink / raw)
  To: Yongseok Koh; +Cc: Shahaf Shuler, dev

> -----Original Message-----
> From: Yongseok Koh
> Sent: Saturday, October 27, 2018 1:43
> To: Slava Ovsiienko <viacheslavo@mellanox.com>
> Cc: Shahaf Shuler <shahafs@mellanox.com>; dev@dpdk.org
> Subject: Re: [PATCH v2 5/7] net/mlx5: e-switch VXLAN tunnel devices
> management
> 
> On Fri, Oct 26, 2018 at 02:35:24AM -0700, Slava Ovsiienko wrote:
> > > -----Original Message-----
> > > From: Yongseok Koh
> > > Sent: Friday, October 26, 2018 9:26
> > > To: Slava Ovsiienko <viacheslavo@mellanox.com>
> > > Cc: Shahaf Shuler <shahafs@mellanox.com>; dev@dpdk.org
> > > Subject: Re: [PATCH v2 5/7] net/mlx5: e-switch VXLAN tunnel devices
> > > management
> > >
> > > On Thu, Oct 25, 2018 at 01:21:12PM -0700, Slava Ovsiienko wrote:
> > > > > -----Original Message-----
> > > > > From: Yongseok Koh
> > > > > Sent: Thursday, October 25, 2018 3:28
> > > > > To: Slava Ovsiienko <viacheslavo@mellanox.com>
> > > > > Cc: Shahaf Shuler <shahafs@mellanox.com>; dev@dpdk.org
> > > > > Subject: Re: [PATCH v2 5/7] net/mlx5: e-switch VXLAN tunnel
> > > > > devices management
> > > > >
> > > > > On Mon, Oct 15, 2018 at 02:13:33PM +0000, Viacheslav Ovsiienko
> wrote:
> > > > > > VXLAN interfaces are dynamically created for each local UDP
> > > > > > port of outer networks and then used as targets for TC
> > > > > > "flower" filters in order to perform encapsulation. These
> > > > > > VXLAN interfaces are system-wide, the only one device with
> > > > > > given UDP port can exist in the system (the attempt of
> > > > > > creating another device with the same UDP local port returns
> > > > > > EEXIST), so PMD should support the shared device instances
> > > > > > database for PMD instances. These VXLAN implicitly created devices
> are called VTEPs (Virtual Tunnel End Points).
> > > > > >
> > > > > > Creation of the VTEP occurs at the moment of rule applying.
> > > > > > The link is set up, root ingress qdisc is also initialized.
> > > > > >
> > > > > > Encapsulation VTEPs are created on per port basis, the single
> > > > > > VTEP is attached to the outer interface and is shared for all
> > > > > > encapsulation rules on this interface. The source UDP port is
> > > > > > automatically selected in range 30000-60000.
> > > > > >
> > > > > > For decapsulaton one VTEP is created per every unique UDP
> > > > > > local port to accept tunnel traffic. The name of created VTEP
> > > > > > consists of prefix "vmlx_" and the number of UDP port in
> > > > > > decimal digits without leading zeros (vmlx_4789). The VTEP can
> > > > > > be preliminary created in the system before the launching
> > > > > > application, it allows to share	UDP ports between primary
> > > > > > and secondary processes.
> > > > > >
> > > > > > Suggested-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
> > > > > > Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
> > > > > > ---
> > > > > >  drivers/net/mlx5/mlx5_flow_tcf.c | 503
> > > > > > ++++++++++++++++++++++++++++++++++++++-
> > > > > >  1 file changed, 499 insertions(+), 4 deletions(-)
> > > > > >
> > > > > > diff --git a/drivers/net/mlx5/mlx5_flow_tcf.c
> > > > > > b/drivers/net/mlx5/mlx5_flow_tcf.c
> > > > > > index d6840d5..efa9c3b 100644
> > > > > > --- a/drivers/net/mlx5/mlx5_flow_tcf.c
> > > > > > +++ b/drivers/net/mlx5/mlx5_flow_tcf.c
> > > > > > @@ -3443,6 +3443,432 @@ struct pedit_parser {
> > > > > >  	return -err;
> > > > > >  }
> > > > > >
> > > > > > +/* VTEP device list is shared between PMD port instances. */
> > > > > > +static LIST_HEAD(, mlx5_flow_tcf_vtep)
> > > > > > +			vtep_list_vxlan = LIST_HEAD_INITIALIZER();
> static
> > > > > pthread_mutex_t
> > > > > > +vtep_list_mutex = PTHREAD_MUTEX_INITIALIZER;
> > > > >
> > > > > What's the reason for choosing pthread_mutex instead of
> rte_*_lock?
> > > >
> > > > The sharing this database for secondary processes?
> > >
> > > The static variable isn't shared with sec proc. But you can leave it as is.
> >
> > Yes. The sharing just was assumed, not implemented yet.
> >
> > >
> > > > > > +
> > > > > > +/**
> > > > > > + * Deletes VTEP network device.
> > > > > > + *
> > > > > > + * @param[in] tcf
> > > > > > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > > > > > + * @param[in] vtep
> > > > > > + *   Object represinting the network device to delete. Memory
> > > > > > + *   allocated for this object is freed by routine.
> > > > > > + */
> > > > > > +static void
> > > > > > +flow_tcf_delete_iface(struct mlx6_flow_tcf_context *tcf,
> > > > > > +		      struct mlx5_flow_tcf_vtep *vtep) {
> > > > > > +	struct nlmsghdr *nlh;
> > > > > > +	struct ifinfomsg *ifm;
> > > > > > +	alignas(struct nlmsghdr)
> > > > > > +	uint8_t buf[mnl_nlmsg_size(MNL_ALIGN(sizeof(*ifm))) + 8];
> > > > > > +	int ret;
> > > > > > +
> > > > > > +	assert(!vtep->refcnt);
> > > > > > +	if (vtep->created && vtep->ifindex) {
> > > > >
> > > > > First of all vtep->created seems of no use. It is introduced to
> > > > > select the error message in flow_tcf_create_iface(). I don't see
> > > > > any necessity to distinguish between 'vtep is allocated by
> > > > > rte_malloc()' and
> > > 'vtep is created in kernel'.
> > > >
> > > > created flag indicates the iface is created by our code.
> > > > The VXLAN decap devices must have the specified UDP port, we can
> > > > not create multiple VXLAN devices with the same UDP port - EEXIST
> > > > is returned. So, we have to share device. One option is create
> > > > device before DPDK application launch and use these pre-created
> devices.
> > > > Inthis case created flag is not set and VXLAN device is not
> > > > reinitialized, and
> > > not deleted.
> > >
> > > I can't see any code to use pre-created device (created even before
> > > dpdk app launch). Your code just tries to create 'vmlx_xxxx'. Even
> > > from your comment in [7/7] patch, PMD will cleanup any leftovers
> > > (existing vtep devices) on initialization. Your comment sounds conflicting
> and confusing.
> >
> > There are two types of VXLAN devices:
> >
> > - VXLAN decap, not attached to any ifouter. Provides the ingress UDP
> > port,  we try to share the devices of this type, because we may be
> > asked for  the specified UDP port. No device/rule cleanup and reinit
> needed.
> >
> > - VXLAN encap, should be attached to ifouter to provide strict egress
> > path, no need to share - egress UDP port does not matter. And we need
> > to cleanup ifouter, remove other attached VXLAN devices and rules,
> > because it is too hard to co-exist with some pre-created setup..
> 
> I knew that. But how can it justify the need of 'created' field in vtep struct?
> In this code, it is of no use. But will see how it is used in your v3.
> 
> > > > > And why do you need to check vtep->ifindex as well? If vtep is
> > > > > created in kernel and its ifindex isn't set, that should be an
> > > > > error which had to be hanled in flow_tcf_create_iface(). Such a
> > > > > vtep shouldn't
> > > exist.
> > > > Yes, if we did not get ifindex of device - vtep is not created, error
> returned.
> > > > We just can not operate w/o ifindex.
> > >
> > > I know ifindex is needed but my question was checking vtep->ifindex
> > > here looked redundant/unnecessary. But as you agreed on having
> > > create/get/release_iface(), it doesn't matter much.
> >
> > Yes. I agree, will refactor the code.
> >
> > >
> > > > > Also, the refcnt management is a bit strange. Please put an
> > > > > abstraction by adding create_iface(), get_iface() and
> > > > > release_iface(). In the get_ifce(),
> > > > > vtep->refcnt should be incremented. And in the release_iface(),
> > > > > vtep->it decrease the
> > > > OK. Good proposal. I'll refactor the code.
> > > >
> > > > > refcnt and if it reaches to zero, the iface can be removed.
> > > > > create_iface() will set the refcnt to 1. And if you refer to
> > > > > mlx5_hrxq_get(), it even does searching the list not by
> > > > > repeating the
> > > same lookup code here and there.
> > > > > That will make your code much simpler.
> > > > >
> > > > > > +		DRV_LOG(INFO, "VTEP delete (%d)", vtep->ifindex);
> > > > > > +		nlh = mnl_nlmsg_put_header(buf);
> > > > > > +		nlh->nlmsg_type = RTM_DELLINK;
> > > > > > +		nlh->nlmsg_flags = NLM_F_REQUEST;
> > > > > > +		ifm = mnl_nlmsg_put_extra_header(nlh,
> sizeof(*ifm));
> > > > > > +		ifm->ifi_family = AF_UNSPEC;
> > > > > > +		ifm->ifi_index = vtep->ifindex;
> > > > > > +		ret = flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
> > > > > > +		if (ret)
> > > > > > +			DRV_LOG(WARNING, "netlink: error deleting
> VXLAN
> > > > > "
> > > > > > +					 "encap/decap ifindex %u",
> > > > > > +					 ifm->ifi_index);
> > > > > > +	}
> > > > > > +	rte_free(vtep);
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * Creates VTEP network device.
> > > > > > + *
> > > > > > + * @param[in] tcf
> > > > > > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > > > > > + * @param[in] ifouter
> > > > > > + *   Outer interface to attach new-created VXLAN device
> > > > > > + *   If zero the VXLAN device will not be attached to any device.
> > > > > > + * @param[in] port
> > > > > > + *   UDP port of created VTEP device.
> > > > > > + * @param[out] error
> > > > > > + *   Perform verbose error reporting if not NULL.
> > > > > > + *
> > > > > > + * @return
> > > > > > + * Pointer to created device structure on success, NULL
> > > > > > +otherwise
> > > > > > + * and rte_errno is set.
> > > > > > + */
> > > > > > +#ifndef HAVE_IFLA_VXLAN_COLLECT_METADATA
> > > > >
> > > > > Why negative(ifndef) first intead of positive(ifdef)?
> > > > Hm. Did I miss the rule. Positive #ifdef first? OK.
> > >
> > > No concrete rule but if there's no specific reason, it would be
> > > better to start from ifdef.
> > >
> > > > > > +static struct mlx5_flow_tcf_vtep*
> > > > > > +flow_tcf_create_iface(struct mlx5_flow_tcf_context *tcf
> __rte_unused,
> > > > > > +		      unsigned int ifouter __rte_unused,
> > > > > > +		      uint16_t port __rte_unused,
> > > > > > +		      struct rte_flow_error *error) {
> > > > > > +	rte_flow_error_set(error, ENOTSUP,
> > > > > > +			 RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> NULL,
> > > > > > +			 "netlink: failed to create VTEP, "
> > > > > > +			 "VXLAN metadat is not supported by
> kernel");
> > > > >
> > > > > Typo.
> > > >
> > > > OK.  "metadata are not supported".
> > > > >
> > > > > > +	return NULL;
> > > > > > +}
> > > > > > +#else
> > > > > > +static struct mlx5_flow_tcf_vtep*
> > > > > > +flow_tcf_create_iface(struct mlx5_flow_tcf_context *tcf,
> > > > >
> > > > > How about adding 'vtep'? It sounds vague - creating a general
> interface.
> > > > > E.g., flow_tcf_create_vtep_iface()?
> > > >
> > > > OK.
> > > >
> > > > >
> > > > > > +		      unsigned int ifouter,
> > > > > > +		      uint16_t port, struct rte_flow_error *error) {
> > > > > > +	struct mlx5_flow_tcf_vtep *vtep;
> > > > > > +	struct nlmsghdr *nlh;
> > > > > > +	struct ifinfomsg *ifm;
> > > > > > +	char name[sizeof(MLX5_VXLAN_DEVICE_PFX) + 24];
> > > > > > +	alignas(struct nlmsghdr)
> > > > > > +	uint8_t buf[mnl_nlmsg_size(sizeof(*ifm)) + 128 +
> > > > >
> > > > > Use a macro for '128'. Can't know the meaning.
> > > > OK. I think we should calculate the buffer size explicitly.
> > > >
> > > > >
> > > > > > +		       SZ_NLATTR_DATA_OF(sizeof(name)) +
> > > > > > +		       SZ_NLATTR_NEST * 2 +
> > > > > > +		       SZ_NLATTR_STRZ_OF("vxlan") +
> > > > > > +		       SZ_NLATTR_DATA_OF(sizeof(uint32_t)) +
> > > > > > +		       SZ_NLATTR_DATA_OF(sizeof(uint32_t)) +
> > > > > > +		       SZ_NLATTR_DATA_OF(sizeof(uint16_t)) +
> > > > > > +		       SZ_NLATTR_DATA_OF(sizeof(uint8_t))];
> > > > > > +	struct nlattr *na_info;
> > > > > > +	struct nlattr *na_vxlan;
> > > > > > +	rte_be16_t vxlan_port = RTE_BE16(port);
> > > > >
> > > > > Use rte_cpu_to_be_*() instead.
> > > >
> > > > Yes, I'll recheck the whole code for this issue.
> > > >
> > > > >
> > > > > > +	int ret;
> > > > > > +
> > > > > > +	vtep = rte_zmalloc(__func__, sizeof(*vtep),
> > > > > > +			alignof(struct mlx5_flow_tcf_vtep));
> > > > > > +	if (!vtep) {
> > > > > > +		rte_flow_error_set
> > > > > > +			(error, ENOMEM,
> > > > > RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> > > > > > +			 NULL, "unadble to allocate memory for
> VTEP desc");
> > > > > > +		return NULL;
> > > > > > +	}
> > > > > > +	*vtep = (struct mlx5_flow_tcf_vtep){
> > > > > > +			.refcnt = 0,
> > > > > > +			.port = port,
> > > > > > +			.created = 0,
> > > > > > +			.ifouter = 0,
> > > > > > +			.ifindex = 0,
> > > > > > +			.local = LIST_HEAD_INITIALIZER(),
> > > > > > +			.neigh = LIST_HEAD_INITIALIZER(),
> > > > > > +	};
> > > > > > +	memset(buf, 0, sizeof(buf));
> > > > > > +	nlh = mnl_nlmsg_put_header(buf);
> > > > > > +	nlh->nlmsg_type = RTM_NEWLINK;
> > > > > > +	nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_CREATE  |
> > > > > NLM_F_EXCL;
> > > > > > +	ifm = mnl_nlmsg_put_extra_header(nlh, sizeof(*ifm));
> > > > > > +	ifm->ifi_family = AF_UNSPEC;
> > > > > > +	ifm->ifi_type = 0;
> > > > > > +	ifm->ifi_index = 0;
> > > > > > +	ifm->ifi_flags = IFF_UP;
> > > > > > +	ifm->ifi_change = 0xffffffff;
> > > > > > +	snprintf(name, sizeof(name), "%s%u",
> MLX5_VXLAN_DEVICE_PFX,
> > > > > port);
> > > > > > +	mnl_attr_put_strz(nlh, IFLA_IFNAME, name);
> > > > > > +	na_info = mnl_attr_nest_start(nlh, IFLA_LINKINFO);
> > > > > > +	assert(na_info);
> > > > > > +	mnl_attr_put_strz(nlh, IFLA_INFO_KIND, "vxlan");
> > > > > > +	na_vxlan = mnl_attr_nest_start(nlh, IFLA_INFO_DATA);
> > > > > > +	if (ifouter)
> > > > > > +		mnl_attr_put_u32(nlh, IFLA_VXLAN_LINK, ifouter);
> > > > > > +	assert(na_vxlan);
> > > > > > +	mnl_attr_put_u8(nlh, IFLA_VXLAN_COLLECT_METADATA, 1);
> > > > > > +	mnl_attr_put_u8(nlh, IFLA_VXLAN_UDP_ZERO_CSUM6_RX,
> 1);
> > > > > > +	mnl_attr_put_u8(nlh, IFLA_VXLAN_LEARNING, 0);
> > > > > > +	mnl_attr_put_u16(nlh, IFLA_VXLAN_PORT, vxlan_port);
> > > > > > +	mnl_attr_nest_end(nlh, na_vxlan);
> > > > > > +	mnl_attr_nest_end(nlh, na_info);
> > > > > > +	assert(sizeof(buf) >= nlh->nlmsg_len);
> > > > > > +	ret = flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
> > > > > > +	if (ret)
> > > > > > +		DRV_LOG(WARNING,
> > > > > > +			"netlink: VTEP %s create failure (%d)",
> > > > > > +			name, rte_errno);
> > > > > > +	else
> > > > > > +		vtep->created = 1;
> > > > >
> > > > > Flow of code here isn't smooth, thus could be error-prone. Most
> > > > > of all, I don't like ret has multiple meanings. ret should be
> > > > > return value but you are using it to store ifindex.
> > > > >
> > > > > > +	if (ret && ifouter)
> > > > > > +		ret = 0;
> > > > > > +	else
> > > > > > +		ret = if_nametoindex(name);
> > > > >
> > > > > If vtep isn't created and ifouter is set, then skip init below,
> > > > > which means, if
> > > >
> > > > ifouter is set for VXLAN encap devices. They should be attached to
> > > > ifouter and can not be shared. So, if ifouter I set - we do not
> > > > use the precreated/existing VXLAN devices. We have to create our
> > > > own not
> > > shared device.
> > >
> > > In your code (flow_tcf_encap_vtep_create()), it is shared by multiple
> flows.
> > > Do you mean it isn't shared between different outer ifaces? If so,
> > > that's for sure.
> > Sorry, I do not understand the question.
> > VXLAN encap device is attached to ifouter and shared by all flows with
> > this ifouter. No multiple VXLAN devices are attached to the same ifouter,
> only one.
> > VXLAN decap device has no attached ifouter, so it can not share it.
> 
> Yep, that's what I meant.
> 
> > > > > vtep is created or ifouter is set, it tries to get ifindex of vtep.
> > > > > But why do you want to try to call this API even if it failed to create
> vtep?
> > > > > Let's not make code flow convoluted even though it logically works.
> > > > > Let's make it straightforward.
> > > > >
> > > > > > +	if (ret) {
> > > > > > +		vtep->ifindex = ret;
> > > > > > +		vtep->ifouter = ifouter;
> > > > > > +		memset(buf, 0, sizeof(buf));
> > > > > > +		nlh = mnl_nlmsg_put_header(buf);
> > > > > > +		nlh->nlmsg_type = RTM_NEWLINK;
> > > > > > +		nlh->nlmsg_flags = NLM_F_REQUEST;
> > > > > > +		ifm = mnl_nlmsg_put_extra_header(nlh,
> sizeof(*ifm));
> > > > > > +		ifm->ifi_family = AF_UNSPEC;
> > > > > > +		ifm->ifi_type = 0;
> > > > > > +		ifm->ifi_index = vtep->ifindex;
> > > > > > +		ifm->ifi_flags = IFF_UP;
> > > > > > +		ifm->ifi_change = IFF_UP;
> > > > > > +		ret = flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
> > > > > > +		if (ret) {
> > > > > > +			DRV_LOG(WARNING,
> > > > > > +				"netlink: VTEP %s set link up failure
> (%d)",
> > > > > > +				name, rte_errno);
> > > > > > +			rte_free(vtep);
> > > > > > +			rte_flow_error_set
> > > > > > +				(error, -errno,
> > > > > > +
> RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> > > > > NULL,
> > > > > > +				 "netlink: failed to set VTEP link up");
> > > > > > +			vtep = NULL;
> > > > > > +		} else {
> > > > > > +			ret = mlx5_flow_tcf_init(tcf, vtep->ifindex,
> error);
> > > > > > +			if (ret)
> > > > > > +				DRV_LOG(WARNING,
> > > > > > +				"VTEP %s init failure (%d)", name,
> rte_errno);
> > > > > > +		}
> > > > > > +	} else {
> > > > > > +		DRV_LOG(WARNING,
> > > > > > +			"VTEP %s failed to get index (%d)", name,
> errno);
> > > > > > +		rte_flow_error_set
> > > > > > +			(error, -errno,
> > > > > > +			 RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> NULL,
> > > > > > +			 !vtep->created ? "netlink: failed to create
> VTEP" :
> > > > > > +			 "netlink: failed to retrieve VTEP ifindex");
> > > > > > +			 ret = 1;
> > > > >
> > > > > If it fails to create a vtep above, it will print out two
> > > > > warning messages and one rte_flow_error message. And it even
> > > > > selects message to print between two?
> > > > > And there's another info msg at the end even in case of failure.
> > > > > Do you really want to do this even with manipulating ret to
> > > > > change code path?  Not a good practice.
> > > > >
> > > > > Usually, code path should be straightforward for sucessful path
> > > > > and for errors/failures, return immediately or use 'goto' if
> > > > > there's need for
> > > cleanup.
> > > > >
> > > > > Please refactor entire function.
> > > >
> > > > I think I'll split it in two ones - for attached and potentially shared
> ifaces.
> > > > >
> > > > > > +	}
> > > > > > +	if (ret) {
> > > > > > +		flow_tcf_delete_iface(tcf, vtep);
> > > > > > +		vtep = NULL;
> > > > > > +	}
> > > > > > +	DRV_LOG(INFO, "VTEP create (%d, %s)", vtep->port, vtep ?
> "OK" :
> > > > > "error");
> > > > > > +	return vtep;
> > > > > > +}
> > > > > > +#endif /* HAVE_IFLA_VXLAN_COLLECT_METADATA */
> > > > > > +
> > > > > > +/**
> > > > > > + * Create target interface index for VXLAN tunneling
> decapsulation.
> > > > > > + * In order to share the UDP port within the other interfaces
> > > > > > +the
> > > > > > + * VXLAN device created as not attached to any interface (if
> created).
> > > > > > + *
> > > > > > + * @param[in] tcf
> > > > > > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > > > > > + * @param[in] dev_flow
> > > > > > + *   Flow tcf object with tunnel structure pointer set.
> > > > > > + * @param[out] error
> > > > > > + *   Perform verbose error reporting if not NULL.
> > > > > > + * @return
> > > > > > + *   Interface index on success, zero otherwise and rte_errno is set.
> > > > >
> > > > > Return negative errno in case of failure like others.
> > > >
> > > > Anyway, we have to return an index. If we do not return it as
> > > > function result we will need to provide some extra pointing
> > > > parameter, it
> > > complicates the code.
> > >
> > > You misunderstood it. See what I wrote below. The function still
> > > returns the index but in case of error, make it return negative errno
> instead of zero.
> > >
> > > > >
> > > > >  *   Interface index on success, a negative errno value otherwise and
> > > > > rte_errno is set.
> > > > >
> > > > > > + */
> > > > > > +static unsigned int
> > > > > > +flow_tcf_decap_vtep_create(struct mlx5_flow_tcf_context *tcf,
> > > > > > +			   struct mlx5_flow *dev_flow,
> > > > > > +			   struct rte_flow_error *error) {
> > > > > > +	struct mlx5_flow_tcf_vtep *vtep, *vlst;
> > > > > > +	uint16_t port = dev_flow->tcf.vxlan_decap->udp_port;
> > > > > > +
> > > > > > +	vtep = NULL;
> > > > > > +	LIST_FOREACH(vlst, &vtep_list_vxlan, next) {
> > > > > > +		if (vlst->port == port) {
> > > > > > +			vtep = vlst;
> > > > > > +			break;
> > > > > > +		}
> > > > > > +	}
> > > > >
> > > > > You just need one variable.
> > > >
> > > > Yes. There is a long story, I forgot to revert code to one
> > > > variable after
> > > debugging.
> > > > >
> > > > > 	struct mlx5_flow_tcf_vtep *vtep;
> > > > >
> > > > > 	LIST_FOREACH(vtep, &vtep_list_vxlan, next) {
> > > > > 		if (vtep->port == port)
> > > > > 			break;
> > > > > 	}
> > > > >
> > > > > > +	if (!vtep) {
> > > > > > +		vtep = flow_tcf_create_iface(tcf, 0, port, error);
> > > > > > +		if (vtep)
> > > > > > +			LIST_INSERT_HEAD(&vtep_list_vxlan, vtep,
> next);
> > > > > > +	} else {
> > > > > > +		if (vtep->ifouter) {
> > > > > > +			rte_flow_error_set(error, -errno,
> > > > > > +
> 	RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> > > > > NULL,
> > > > > > +				"Failed to create decap VTEP,
> attached "
> > > > > > +				"device with the same UDP port
> exists");
> > > > > > +				vtep = NULL;
> > > > >
> > > > > Making vtep null to skip the following code?
> > > >
> > > > Yes. To avoid multiple return operators in code.
> > >
> > > It's okay to have multiple returns. Why not?
> >
> > It is easy to miss the return in the midst of function  while
> refactoring/modifying the code.
> 
> Your code path doesn't look easy and free from error. Please refer to other
> control path functions in this PMD.
> 
> > > > > Please merge the two same
> > > > > if/else and make the code path strightforward. And which errno
> > > > > do you expect here?
> > > > > Should it be set EEXIST instead?
> > > > Not always. Netlink returns the code.
> > >
> > > No, that's not my point. Your code above sets errno instead of
> > > rte_errno or EEXIST.
> > >
> > > 	} else {
> > > 		if (vtep->ifouter) {
> > > 			rte_flow_error_set(error, -errno,
> > >
> > > Which one sets this errno? Here, it sets rte_errno because matched
> > > vtep
> > libmnl sets, while processing the Netlink reply message (callback.c of libmnl
> sources).
> 
> You still don't understand my point.
> 
> In this flow_tcf_decap_vtep_create(), if vtep is found (vtep != NULL), how
> can errno be set? Before the if/else, there's no libmnl call.
> 
> > > can't be used as it already has outer iface attached (error message
> > > isn't clear, please reword it too). I thought this should be EEXIST
> > > but you set errno to rte_errno but errno isn't valid at this point.
> > >
> > > >
> > > > >
> > > > > > +		}
> > > > > > +	}
> > > > > > +	if (vtep) {
> > > > > > +		vtep->refcnt++;
> > > > > > +		assert(vtep->ifindex);
> > > > > > +		return vtep->ifindex;
> > > > > > +	} else {
> > > > > > +		return 0;
> > > > > > +	}
> > > > >
> > > > > Why repeating same if/else?
> > > > >
> > > > >
> > > > > This is my suggestion but if you take my suggestion to have
> > > > > flow_tcf_[create|get|release]_iface(), this will get much simpler.
> > > > Agree.
> > > >
> > > > >
> > > > > {
> > > > > 	struct mlx5_flow_tcf_vtep *vtep;
> > > > > 	uint16_t port = dev_flow->tcf.vxlan_decap->udp_port;
> > > > >
> > > > > 	LIST_FOREACH(vtep, &vtep_list_vxlan, next) {
> > > > > 		if (vtep->port == port)
> > > > > 			break;
> > > > > 	}
> > > > > 	if (vtep && vtep->ifouter)
> > > > > 		return rte_flow_error_set(... EEXIST ...);
> > > > > 	else if (vtep) {
> > > > > 		++vtep->refcnt;
> > > > > 	} else {
> > > > > 		vtep = flow_tcf_create_iface(tcf, 0, port, error);
> > > > > 		if (!vtep)
> > > > > 			return rte_flow_error_set(...);
> > > > > 		LIST_INSERT_HEAD(&vtep_list_vxlan, vtep, next);
> > > > > 	}
> > > > > 	assert(vtep->ifindex);
> > > > > 	return vtep->ifindex;
> > > > > }
> > > > >
> > > > >
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * Creates target interface index for VXLAN tunneling
> encapsulation.
> > > > > > + *
> > > > > > + * @param[in] tcf
> > > > > > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > > > > > + * @param[in] ifouter
> > > > > > + *   Network interface index to attach VXLAN encap device to.
> > > > > > + * @param[in] dev_flow
> > > > > > + *   Flow tcf object with tunnel structure pointer set.
> > > > > > + * @param[out] error
> > > > > > + *   Perform verbose error reporting if not NULL.
> > > > > > + * @return
> > > > > > + *   Interface index on success, zero otherwise and rte_errno is set.
> > > > > > + */
> > > > > > +static unsigned int
> > > > > > +flow_tcf_encap_vtep_create(struct mlx5_flow_tcf_context *tcf,
> > > > > > +			    unsigned int ifouter,
> > > > > > +			    struct mlx5_flow *dev_flow __rte_unused,
> > > > > > +			    struct rte_flow_error *error) {
> > > > > > +	static uint16_t encap_port =
> MLX5_VXLAN_PORT_RANGE_MIN - 1;
> > > > > > +	struct mlx5_flow_tcf_vtep *vtep, *vlst;
> > > > > > +
> > > > > > +	assert(ifouter);
> > > > > > +	/* Look whether the attached VTEP for encap is created. */
> > > > > > +	vtep = NULL;
> > > > > > +	LIST_FOREACH(vlst, &vtep_list_vxlan, next) {
> > > > > > +		if (vlst->ifouter == ifouter) {
> > > > > > +			vtep = vlst;
> > > > > > +			break;
> > > > > > +		}
> > > > > > +	}
> > > > >
> > > > > Same here.
> > > > >
> > > > > > +	if (!vtep) {
> > > > > > +		uint16_t pcnt;
> > > > > > +
> > > > > > +		/* Not found, we should create the new attached
> VTEP. */
> > > > > > +/*
> > > > > > + * TODO: not implemented yet
> > > > > > + * flow_tcf_encap_iface_cleanup(tcf, ifouter);
> > > > > > + * flow_tcf_encap_local_cleanup(tcf, ifouter);
> > > > > > + * flow_tcf_encap_neigh_cleanup(tcf, ifouter);  */
> > > > >
> > > > > Personal note is not appropriate even though it is removed in
> > > > > the following patch.
> > > > >
> > > > > > +		for (pcnt = 0; pcnt <=
> (MLX5_VXLAN_PORT_RANGE_MAX
> > > > > > +				     -
> MLX5_VXLAN_PORT_RANGE_MIN);
> > > > > pcnt++) {
> > > > > > +			encap_port++;
> > > > > > +			/* Wraparound the UDP port index. */
> > > > > > +			if (encap_port <
> MLX5_VXLAN_PORT_RANGE_MIN
> > > > > ||
> > > > > > +			    encap_port >
> MLX5_VXLAN_PORT_RANGE_MAX)
> > > > > > +				encap_port =
> > > > > MLX5_VXLAN_PORT_RANGE_MIN;
> > > > > > +			/* Check whether UDP port is in already in
> use. */
> > > > > > +			vtep = NULL;
> > > > > > +			LIST_FOREACH(vlst, &vtep_list_vxlan, next) {
> > > > > > +				if (vlst->port == encap_port) {
> > > > > > +					vtep = vlst;
> > > > > > +					break;
> > > > > > +				}
> > > > > > +			}
> > > > >
> > > > > If you want to find out an empty port number, you can use
> > > > > rte_bitmap instead of repeating searching the entire list for
> > > > > all possible port
> > > numbers.
> > > >
> > > > We do not expect too many VXLAN devices have been created. bitmap.
> > >
> > > +1, valid point.
> > >
> > > > > > +			if (vtep) {
> > > > > > +				vtep = NULL;
> > > > > > +				continue;
> > > > > > +			}
> > > > > > +			vtep = flow_tcf_create_iface(tcf, ifouter,
> > > > > > +						     encap_port,
> error);
> > > > > > +			if (vtep) {
> > > > > > +				LIST_INSERT_HEAD(&vtep_list_vxlan,
> vtep,
> > > > > next);
> > > > > > +				break;
> > > > > > +			}
> > > > > > +			if (rte_errno != EEXIST)
> > > > > > +				break;
> > > > > > +		}
> > > > > > +	}
> > > > > > +	if (!vtep)
> > > > > > +		return 0;
> > > > > > +	vtep->refcnt++;
> > > > > > +	assert(vtep->ifindex);
> > > > > > +	return vtep->ifindex;
> > > > >
> > > > > Please refactor this func according to what I suggested for
> > > > > flow_tcf_decap_vtep_create() and flow_tcf_delete_iface().
> > > > >
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * Creates target interface index for tunneling of any type.
> > > > > > + *
> > > > > > + * @param[in] tcf
> > > > > > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > > > > > + * @param[in] ifouter
> > > > > > + *   Network interface index to attach VXLAN encap device to.
> > > > > > + * @param[in] dev_flow
> > > > > > + *   Flow tcf object with tunnel structure pointer set.
> > > > > > + * @param[out] error
> > > > > > + *   Perform verbose error reporting if not NULL.
> > > > > > + * @return
> > > > > > + *   Interface index on success, zero otherwise and rte_errno is set.
> > > > >
> > > > >  *   Interface index on success, a negative errno value otherwise and
> > > > >  *   rte_errno is set.
> > > > >
> > > > > > + */
> > > > > > +static unsigned int
> > > > > > +flow_tcf_tunnel_vtep_create(struct mlx5_flow_tcf_context *tcf,
> > > > > > +			    unsigned int ifouter,
> > > > > > +			    struct mlx5_flow *dev_flow,
> > > > > > +			    struct rte_flow_error *error) {
> > > > > > +	unsigned int ret;
> > > > > > +
> > > > > > +	assert(dev_flow->tcf.tunnel);
> > > > > > +	pthread_mutex_lock(&vtep_list_mutex);
> > > > > > +	switch (dev_flow->tcf.tunnel->type) {
> > > > > > +	case MLX5_FLOW_TCF_TUNACT_VXLAN_ENCAP:
> > > > > > +		ret = flow_tcf_encap_vtep_create(tcf, ifouter,
> > > > > > +						 dev_flow, error);
> > > > > > +		break;
> > > > > > +	case MLX5_FLOW_TCF_TUNACT_VXLAN_DECAP:
> > > > > > +		ret = flow_tcf_decap_vtep_create(tcf, dev_flow,
> error);
> > > > > > +		break;
> > > > > > +	default:
> > > > > > +		rte_flow_error_set(error, ENOTSUP,
> > > > > > +
> 	RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> > > > > NULL,
> > > > > > +				"unsupported tunnel type");
> > > > > > +		ret = 0;
> > > > > > +		break;
> > > > > > +	}
> > > > > > +	pthread_mutex_unlock(&vtep_list_mutex);
> > > > > > +	return ret;
> > > > > > +}
> > > > > > +
> > > > > > +/**
> > > > > > + * Deletes tunneling interface by UDP port.
> > > > > > + *
> > > > > > + * @param[in] tcf
> > > > > > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > > > > > + * @param[in] ifindex
> > > > > > + *   Network interface index of VXLAN device.
> > > > > > + * @param[in] dev_flow
> > > > > > + *   Flow tcf object with tunnel structure pointer set.
> > > > > > + */
> > > > > > +static void
> > > > > > +flow_tcf_tunnel_vtep_delete(struct mlx5_flow_tcf_context *tcf,
> > > > > > +			    unsigned int ifindex,
> > > > > > +			    struct mlx5_flow *dev_flow) {
> > > > > > +	struct mlx5_flow_tcf_vtep *vtep, *vlst;
> > > > > > +
> > > > > > +	assert(dev_flow->tcf.tunnel);
> > > > > > +	pthread_mutex_lock(&vtep_list_mutex);
> > > > > > +	vtep = NULL;
> > > > > > +	LIST_FOREACH(vlst, &vtep_list_vxlan, next) {
> > > > > > +		if (vlst->ifindex == ifindex) {
> > > > > > +			vtep = vlst;
> > > > > > +			break;
> > > > > > +		}
> > > > > > +	}
> > > > >
> > > > > It is weird. You just can have vtep pointer in the
> > > > > dev_flow->tcf.tunnel instead of ifindex_tun which is same as
> > > > > vtep->ifindex like the assertion below. Then, this lookup can be
> skipped.
> > > >
> > > > OK. Good optimization.
> > > >
> > > > >
> > > > > > +	if (!vtep) {
> > > > > > +		DRV_LOG(WARNING, "No VTEP device found in the
> list");
> > > > > > +		goto exit;
> > > > > > +	}
> > > > > > +	switch (dev_flow->tcf.tunnel->type) {
> > > > > > +	case MLX5_FLOW_TCF_TUNACT_VXLAN_DECAP:
> > > > > > +		break;
> > > > > > +	case MLX5_FLOW_TCF_TUNACT_VXLAN_ENCAP:
> > > > > > +/*
> > > > > > + * TODO: Remove the encap ancillary rules first.
> > > > > > + * flow_tcf_encap_neigh(tcf, vtep, dev_flow, false, NULL);
> > > > > > + * flow_tcf_encap_local(tcf, vtep, dev_flow, false, NULL);
> > > > > > +*/
> > > > >
> > > > > Is it a personal note? Please remove.
> > > > OK.
> > > >
> > > > >
> > > > > > +		break;
> > > > > > +	default:
> > > > > > +		assert(false);
> > > > > > +		DRV_LOG(WARNING, "Unsupported tunnel type");
> > > > > > +		break;
> > > > > > +	}
> > > > > > +	assert(dev_flow->tcf.tunnel->ifindex_tun == vtep->ifindex);
> > > > > > +	assert(vtep->refcnt);
> > > > > > +	if (!vtep->refcnt || !--vtep->refcnt) {
> > > > > > +		LIST_REMOVE(vtep, next);
> > > > > > +		flow_tcf_delete_iface(tcf, vtep);
> > > > > > +	}
> > > > > > +exit:
> > > > > > +	pthread_mutex_unlock(&vtep_list_mutex);
> > > > > > +}
> > > > > > +
> > > > > >  /**
> > > > > >   * Apply flow to E-Switch by sending Netlink message.
> > > > > >   *
> > > > > > @@ -3461,18 +3887,61 @@ struct pedit_parser {
> > > > > >  	       struct rte_flow_error *error)  {
> > > > > >  	struct priv *priv = dev->data->dev_private;
> > > > > > -	struct mlx5_flow_tcf_context *nl = priv->tcf_context;
> > > > > > +	struct mlx5_flow_tcf_context *tcf = priv->tcf_context;
> > > > > >  	struct mlx5_flow *dev_flow;
> > > > > >  	struct nlmsghdr *nlh;
> > > > > > +	int ret;
> > > > > >
> > > > > >  	dev_flow = LIST_FIRST(&flow->dev_flows);
> > > > > >  	/* E-Switch flow can't be expanded. */
> > > > > >  	assert(!LIST_NEXT(dev_flow, next));
> > > > > > +	if (dev_flow->tcf.applied)
> > > > > > +		return 0;
> > > > > >  	nlh = dev_flow->tcf.nlh;
> > > > > >  	nlh->nlmsg_type = RTM_NEWTFILTER;
> > > > > >  	nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_CREATE |
> > > > > NLM_F_EXCL;
> > > > > > -	if (!flow_tcf_nl_ack(nl, nlh, 0, NULL, NULL))
> > > > > > +	if (dev_flow->tcf.tunnel) {
> > > > > > +		/*
> > > > > > +		 * Replace the interface index, target for
> > > > > > +		 * encapsulation, source for decapsulation.
> > > > > > +		 */
> > > > > > +		assert(!dev_flow->tcf.tunnel->ifindex_tun);
> > > > > > +		assert(dev_flow->tcf.tunnel->ifindex_ptr);
> > > > > > +		/* Create actual VTEP device when rule is being
> applied. */
> > > > > > +		dev_flow->tcf.tunnel->ifindex_tun
> > > > > > +			= flow_tcf_tunnel_vtep_create(tcf,
> > > > > > +					*dev_flow->tcf.tunnel-
> >ifindex_ptr,
> > > > > > +					dev_flow, error);
> > > > > > +			DRV_LOG(INFO, "Replace ifindex: %d->%d",
> > > > > > +				dev_flow->tcf.tunnel->ifindex_tun,
> > > > > > +				*dev_flow->tcf.tunnel->ifindex_ptr);
> > > > > > +		if (!dev_flow->tcf.tunnel->ifindex_tun)
> > > > > > +			return -rte_errno;
> > > > > > +		dev_flow->tcf.tunnel->ifindex_org
> > > > > > +			= *dev_flow->tcf.tunnel->ifindex_ptr;
> > > > > > +		*dev_flow->tcf.tunnel->ifindex_ptr
> > > > > > +			= dev_flow->tcf.tunnel->ifindex_tun;
> > > > > > +	}
> > > > > > +	ret = flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
> > > > > > +	if (dev_flow->tcf.tunnel) {
> > > > > > +		DRV_LOG(INFO, "Restore ifindex: %d->%d",
> > > > > > +				dev_flow->tcf.tunnel->ifindex_org,
> > > > > > +				*dev_flow->tcf.tunnel->ifindex_ptr);
> > > > > > +		*dev_flow->tcf.tunnel->ifindex_ptr
> > > > > > +			= dev_flow->tcf.tunnel->ifindex_org;
> > > > > > +		dev_flow->tcf.tunnel->ifindex_org = 0;
> > > > >
> > > > > ifindex_org looks a temporary storage in this code. And this
> > > > > kind of hassle
> > > > > (replace/restore) is there because you took the ifindex from the
> > > > > netlink message. Why don't you have just
> > > > >
> > > > > struct mlx5_flow_tcf_tunnel_hdr {
> > > > > 	uint32_t type; /**< Tunnel action type. */
> > > > > 	unsigned int ifindex; /**< Original dst/src interface */
> > > > > 	struct mlx5_flow_tcf_vtep *vtep; /**< Tunnel endpoint device. */
> > > > > 	unsigned int *nlmsg_ifindex_ptr; /**< ifindex ptr in Netlink message.
> > > > > */ };
> > > > >
> > > > > and don't change ifindex?
> > > >
> > > > I propose to use the local variable for ifindex_org and do not
> > > > keep it in structure. *ifindex_ptr will keep.
> > >
> > > Well, you still have to restore the ifindex whenever sending the nl
> > > msg. Most of all, ifindex_ptr in nl msg isn't a right place to store the
> ifindex.
> > It is stored there for rules w/o tunnels. It is its "native" place, Id
> > prefer not to create some new location to store the original index and save
> some space.
> > We have to swap indices only if rule has requested the tunneling.  We
> > can not
> 
> No no. At this point, flow is already created to be tunneled one. What do you
> mean by 'rules w/o tunnels' or 'only if rule has requested the tunneling'??

I mean the code handles all kind of rules - with tunnel and w/o tunnels.
The same code prepares the NL message for both rule types.

> It has already been created as a vxlan tunnel rule. It won't be changed. The
> nlmsg is supposed to have vtep ifindex but translation didn't know it and
> stored the outer iface temporarily to get it replaced by vtep ifindex. It never
> be a 'native'/'original' place to store it.

I mean, if rule does not request the tunneling action - it just keeps the
unchanged ifindex within Netlink message. If there is the tunneling - we replace
this index with some value depending on this ifindex.  We cannot replace
ifindex permanently at rule translation once, because VTEPs are created 
dynamically and VTEP ifindex can be different at the rule applying time.
So, we need to keep the original ifindex and create VTEP depending on it every
time rule is being applied.

> In which case the nl msg can be sent
> with the 'original' ifindex? Any specific example? No.
> 
> > set tunnel index permanently, because rule can be
> > applied/removed/reapplied and other new VXLAN device with new index
> >can be recreated.
> 
> Every time it is applied, it will get the vtep and overwrite vtep ifindex in the nl
> msg.

Yes. We should overwrite this field anyway, and every time at rule applying.
Because vtep ifindex can be different. And we need to keep the original ifindex
(for example to dynamically  create VTEP attached to it). Do you propose to keep
ifindex_org field?  Now, as I can see we have to keep ifindex_ptr field only.

> 
> > > have vtep ifindex but it just temporarily keeps the device ifindex
> > > until vtep is created/found.
> 
> Thanks,
> Yongseok

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [dpdk-dev] [PATCH v2 2/7] net/mlx5: e-switch VXLAN flow validation routine
  2018-10-29  9:33               ` Slava Ovsiienko
@ 2018-10-29 18:26                 ` Yongseok Koh
  0 siblings, 0 replies; 110+ messages in thread
From: Yongseok Koh @ 2018-10-29 18:26 UTC (permalink / raw)
  To: Slava Ovsiienko; +Cc: Shahaf Shuler, dev

On Mon, Oct 29, 2018 at 02:33:03AM -0700, Slava Ovsiienko wrote:
> > -----Original Message-----
> > From: Yongseok Koh
> > Sent: Saturday, October 27, 2018 0:57
> > To: Slava Ovsiienko <viacheslavo@mellanox.com>
> > Cc: Shahaf Shuler <shahafs@mellanox.com>; dev@dpdk.org
> > Subject: Re: [PATCH v2 2/7] net/mlx5: e-switch VXLAN flow validation
> > routine
> > 
> > On Fri, Oct 26, 2018 at 01:39:38AM -0700, Slava Ovsiienko wrote:
> > > > -----Original Message-----
> > > > From: Yongseok Koh
> > > > Sent: Friday, October 26, 2018 6:07
> > > > To: Slava Ovsiienko <viacheslavo@mellanox.com>
> > > > Cc: Shahaf Shuler <shahafs@mellanox.com>; dev@dpdk.org
> > > > Subject: Re: [PATCH v2 2/7] net/mlx5: e-switch VXLAN flow validation
> > > > routine
> > > >
> > > > On Thu, Oct 25, 2018 at 06:53:11AM -0700, Slava Ovsiienko wrote:
> > > > > > -----Original Message-----
> > > > > > From: Yongseok Koh
> > > > > > Sent: Tuesday, October 23, 2018 13:05
> > > > > > To: Slava Ovsiienko <viacheslavo@mellanox.com>
> > > > > > Cc: Shahaf Shuler <shahafs@mellanox.com>; dev@dpdk.org
> > > > > > Subject: Re: [PATCH v2 2/7] net/mlx5: e-switch VXLAN flow
> > > > > > validation routine
> > > > > >
> > > > > > On Mon, Oct 15, 2018 at 02:13:30PM +0000, Viacheslav Ovsiienko
> > wrote:
> > > > [...]
> > > > > > > @@ -1114,7 +1733,6 @@ struct pedit_parser {
> > > > > > >  							   error);
> > > > > > >  			if (ret < 0)
> > > > > > >  				return ret;
> > > > > > > -			item_flags |=
> > MLX5_FLOW_LAYER_OUTER_L3_IPV4;
> > > > > > >  			mask.ipv4 = flow_tcf_item_mask
> > > > > > >  				(items, &rte_flow_item_ipv4_mask,
> > > > > > >  				 &flow_tcf_mask_supported.ipv4,
> > @@ -1135,13 +1753,22 @@
> > > > > > > struct pedit_parser {
> > > > > > >  				next_protocol =
> > > > > > >  					((const struct
> > rte_flow_item_ipv4 *)
> > > > > > >  					 (items->spec))-
> > >hdr.next_proto_id;
> > > > > > > +			if (item_flags &
> > > > > > MLX5_FLOW_LAYER_OUTER_L3_IPV4) {
> > > > > > > +				/*
> > > > > > > +				 * Multiple outer items are not
> > allowed as
> > > > > > > +				 * tunnel parameters, will raise an
> > error later.
> > > > > > > +				 */
> > > > > > > +				ipv4 = NULL;
> > > > > >
> > > > > > Can't it be inner then?
> > > > > AFAIK,  no for tc rules, we can not specify multiple levels (inner
> > > > > + outer) for
> > > > them.
> > > > > There is just no TCA_FLOWER_KEY_xxx attributes  for specifying
> > > > > inner
> > > > items
> > > > > to match by flower.
> > > >
> > > > When I briefly read the kernel code, I thought TCA_FLOWER_KEY_* are
> > > > for inner header before decap. I mean TCA_FLOWER_KEY_IPV4_SRC is
> > for
> > > > inner L3 and TCA_FLOWER_KEY_ENC_IPV4_SRC is for outer tunnel
> > header.
> > > > Please do some experiments with tc-flower command.
> > >
> > > Hm. Interesting. I will check.
> > >
> > > > > It is quite unclear comment, not the best one, sorry. I did not
> > > > > like it too, just forgot to rewrite.
> > > > >
> > > > > ipv4, ipv6 , udp variables gather the matching items during the
> > > > > item list
> > > > scanning,
> > > > > later variables are used for VXLAN decap action validation only.
> > > > > So, the
> > > > "outer"
> > > > > means that ipv4 variable contains the VXLAN decap outer addresses,
> > > > > and should be NULL-ed if multiple items are found in the items list.
> > > > >
> > > > > But we can generate an error here if we have valid action_flags
> > > > > (gathered by prepare function) and VXLAN decap is set. Raising an
> > > > > error looks more relevant and clear.
> > > >
> > > > You can't use flags at this point. It is validate() so prepare()
> > > > might not be preceded.
> > > >
> > > > > >   flow create 1 ingress transfer
> > > > > >     pattern eth src is 66:77:88:99:aa:bb
> > > > > >       dst is 00:11:22:33:44:55 / ipv4 src is 2.2.2.2 dst is 1.1.1.1 /
> > > > > >       udp src is 4789 dst is 4242 / vxlan vni is 0x112233 /
> > > > > >       eth / ipv6 / tcp dst is 42 / end
> > > > > >     actions vxlan_decap / port_id id 2 / end
> > > > > >
> > > > > > Is this flow supported by linux tcf? I took this example from
> > > > > > Adrien's
> > > > patch -
> > > > > > "[8/8] net/mlx5: add VXLAN decap support to switch flow rules".
> > > > > > If so,
> > > > isn't it
> > > > > > possible to have inner L3 layer (MLX5_FLOW_LAYER_INNER_*)? If
> > > > > > not,
> > > > you
> > > > > > should return error in this case. I don't see any code to check
> > > > > > redundant outer items.
> > > > > > Did I miss something?
> > > > >
> > > > > Interesting, besides rule has correct syntax, I'm not sure whether
> > > > > it can be
> > > > applied w/o errors.
> > > >
> > > > Please try. You owns this patchset. However, you just can prohibit
> > > > such flows (tunneled item) and come up with follow-up patches to
> > > > enable it later if it is support by tcf as this whole patchset
> > > > itself is pretty huge enough and we don't have much time.
> > > >
> > > > > At least our current flow_tcf_translate() implementation does not
> > > > > support
> > > > any INNERs.
> > > > > But it seems the flow_tcf_validate() does, it's subject to recheck
> > > > > - we
> > > > should not allow
> > > > > unsupported items to pass the validation. I'll check and provide
> > > > > the
> > > > separate bugfix patch
> > > > > (if any).
> > > >
> > > > Neither has tunnel support. It is the first time to add tunnel support to
> > TCF.
> > > > If it was needed, you should've added it, not skipping it.
> > > >
> > > > You can check how MLX5_FLOW_LAYER_TUNNEL is used in Verbs/DV as
> > a
> > > > reference.
> > >
> > > Yes. I understood your point. Will check and add tunnel support for TCF
> > rules.
> > > Anyway, inner MAC addresses are supported for VXLAN decap, I think we
> > > should specify these ones in the rule as inners (after VNI item),
> > > definitely some tunnel support in validate/parse/translate should be added.
> > >
> > > >
> > > > > > BTW, for the tunneled items, why don't you follow the code of
> > > > > > Verbs(mlx5_flow_verbs.c) and DV(mlx5_flow_dv.c)? For tcf, it is
> > > > > > the first
> > > > time
> > > > > For VXLAN it has some specifics (warning about ignored params,
> > > > > etc.) I've checked which of verbs/dv code could be reused and did
> > > > > not
> > > > discovered
> > > > > a lot. I'll recheck the latest code commits, possible it became
> > > > > more
> > > > appropriate
> > > > > for VXLAN.
> > > >
> > > > Agreed. I'm not forcing you to do it because we run out of time but
> > > > mentioned it because if there's any redundancy in our code, that
> > > > usually causes bug later.
> > > > Let's not waste too much time for that. Just grab low hanging fruits if
> > any.
> > > >
> > > > > > to add tunneled item, but Verbs/DV already have validation code
> > > > > > for
> > > > tunnel,
> > > > > > so you can reuse the existing code. In
> > > > > > flow_tcf_validate_vxlan_decap(),
> > > > not
> > > > > > every validation is VXLAN-specific but some of them can be
> > > > > > common
> > > > code.
> > > > > >
> > > > > > And if you need to know whether there's the VXLAN decap action
> > > > > > prior to outer header item validation, you can relocate the code
> > > > > > - action
> > > > validation
> > > > > > first and item validation next, as there's no dependency yet in
> > > > > > the current
> > > > >
> > > > > We can not validate action first - we need items to be preliminary
> > > > gathered,
> > > > > to check them in action's specific fashion and to check action itself.
> > > > > I mean, if we see VXLAN decap action, we should check the presence
> > > > > of L2, L3, L4 and VNI items. I minimized the number of passes
> > > > > along the item and action lists. BTW, Adrien's approach performed
> > > > > two passes, mine does
> > > > only.
> > > > >
> > > > > > code. Defining ipv4, ipv6, udp seems to make the code path more
> > > > complex.
> > > > > Yes, but it allows us to avoid the extra item list scanning and
> > > > > minimizes the
> > > > changes
> > > > > of existing code.
> > > > > In your approach we should:
> > > > > - scan actions, w/o full checking, just action_flags gathering and
> > > > > checking
> > > > > - scan items, performing variating check (depending on gathered
> > > > > action
> > > > flags)
> > > > > - scan actions again, performing full check with params (at least
> > > > > for now check whether all params gathered)
> > > >
> > > > Disagree. flow_tcf_validate_vxlan_encap() doesn't even need any info
> > > > of items and flow_tcf_validate_vxlan_decap() needs item_flags to
> > > > check whether VXLAN item is there or not and ipv4/ipv6/udp are all
> > > > for item checks. Let me give you very detailed exmaple:
> > > >
> > > > {
> > > > 	for (actions[]...) {
> > > > 		...
> > > > 		case RTE_FLOW_ACTION_TYPE_VXLAN_ENCAP:
> > > > 			...
> > > > 			flow_tcf_validate_vxlan_encap();
> > > > 			...
> > > > 			break;
> > > > 		case RTE_FLOW_ACTION_TYPE_VXLAN_DECAP:
> > > > 			if (action_flags & (MLX5_ACTION_VXLAN_ENCAP
> > > > 					   | MLX5_ACTION_VXLAN_DECAP))
> > > > 				return rte_flow_error_set
> > > > 					(error, ENOTSUP,
> > > > 					 RTE_FLOW_ERROR_TYPE_ACTION,
> > > > 					 actions,
> > > > 					 "can't have multiple vxlan actions");
> > > > 			/* Don't call flow_tcf_validate_vxlan_decap(). */
> > > > 			action_flags |= MLX5_ACTION_VXLAN_DECAP;
> > > > 			break;
> > > > 	}
> > > > 	for (items[]...) {
> > > > 		...
> > > > 		case RTE_FLOW_ITEM_TYPE_IPV4:
> > > > 			/* Existing common validation. */
> > > > 			...
> > > > 			if (action_flags & MLX5_ACTION_VXLAN_DECAP) {
> > > > 				/* Do ipv4 validation in
> > > > 				 * flow_tcf_validate_vxlan_decap()/
> > > > 			}
> > > > 			break;
> > > > 	}
> > > > }
> > > >
> > > > Curretly you are doing,
> > > >
> > > > 	- validate items
> > > > 	- validate actions
> > > > 	- validate items again if decap.
> > > >
> > > > But this can simply be
> > > >
> > > > 	- validate actions
> > > How  we could validate VXLAN decap at this stage?
> > > As we do not have item_flags set yet?
> > > Do I miss something?
> > 
> > Look at my pseudo code above.
> > Nothing much to be done in validating decap action. And item validation for
> > decap can be done together in item validation code.
> > 
> VXLAB decap action should check:
> - whether outer destination UDP port is present (otherwise we cannot assign VTEP VXLAN)
> - whether outer destination IP is present (otherwise we cannot assign IP to ifouter/build route)
> - whether VNI is present (to identify VXLAN traffic)
> 
> How do you  propose check these issues in your approach?

Did you look at my pseudo code? We are not validating vxlan decap action itself
but items when vxlan decap is present.

{
	for (actions[]...) {
		...
		case RTE_FLOW_ACTION_TYPE_VXLAN_ENCAP:
			...
			flow_tcf_validate_vxlan_encap();
			...
			break;
		case RTE_FLOW_ACTION_TYPE_VXLAN_DECAP:
			if (action_flags & (MLX5_ACTION_VXLAN_ENCAP
					   | MLX5_ACTION_VXLAN_DECAP))
				return rte_flow_error_set
					(error, ENOTSUP,
					 RTE_FLOW_ERROR_TYPE_ACTION,
					 actions,
					 "can't have multiple vxlan actions");
			/* Don't call flow_tcf_validate_vxlan_decap(). */
			action_flags |= MLX5_ACTION_VXLAN_DECAP;
			break;
	}
	for (items[]...) {
		...
		case RTE_FLOW_ITEM_TYPE_IPV4:
			/* Existing common validation. */
			...
			if (action_flags & MLX5_ACTION_VXLAN_DECAP) {
				/*
				 * check whether outer destination IP is present
				 */
			}
			break;
		...
		case RTE_FLOW_ITEM_TYPE_UDP:
			/* Existing common validation. */
			...
			if (action_flags & MLX5_ACTION_VXLAN_DECAP) {
				/*
				 * check whether outer destination UDP port is
				 * present
				 */
			}
			break;
		...
		case RTE_FLOW_ITEM_TYPE_VXLAN:
			/* Do the same for vni. */
	}
	...
	if (action_flags & MLX5_ACTION_VXLAN_DECAP) {
		if (!(items_flags & MLX5_FLOW_LAYER_OUTER_L3_IPV4 || ... IPV6))
			return rte_flow_error_set
				(error, EINVAL,
				 RTE_FLOW_ERROR_TYPE_ITEM, items,
				 "vxlan decap requires item IP");
		if (!(items_flags & MLX5_FLOW_LAYER_OUTER_L4_UDP))
			return rte_flow_error_set
				(error, EINVAL,
				 RTE_FLOW_ERROR_TYPE_ITEM, items,
				 "vxlan decap requires item UDP");
		if (!(items_flags & MLX5_FLOW_LAYER_VXLAN))
			/* Do the same . */
	}
}

Still problem?

Thanks,
Yongseok

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [dpdk-dev] [PATCH v2 5/7] net/mlx5: e-switch VXLAN tunnel devices management
  2018-10-29 11:53               ` Slava Ovsiienko
@ 2018-10-29 18:42                 ` Yongseok Koh
  0 siblings, 0 replies; 110+ messages in thread
From: Yongseok Koh @ 2018-10-29 18:42 UTC (permalink / raw)
  To: Slava Ovsiienko; +Cc: Shahaf Shuler, dev

On Mon, Oct 29, 2018 at 04:53:34AM -0700, Slava Ovsiienko wrote:
> > -----Original Message-----
> > From: Yongseok Koh
> > Sent: Saturday, October 27, 2018 1:43
> > To: Slava Ovsiienko <viacheslavo@mellanox.com>
> > Cc: Shahaf Shuler <shahafs@mellanox.com>; dev@dpdk.org
> > Subject: Re: [PATCH v2 5/7] net/mlx5: e-switch VXLAN tunnel devices
> > management
> > 
> > On Fri, Oct 26, 2018 at 02:35:24AM -0700, Slava Ovsiienko wrote:
> > > > -----Original Message-----
> > > > From: Yongseok Koh
> > > > Sent: Friday, October 26, 2018 9:26
> > > > To: Slava Ovsiienko <viacheslavo@mellanox.com>
> > > > Cc: Shahaf Shuler <shahafs@mellanox.com>; dev@dpdk.org
> > > > Subject: Re: [PATCH v2 5/7] net/mlx5: e-switch VXLAN tunnel devices
> > > > management
> > > >
> > > > On Thu, Oct 25, 2018 at 01:21:12PM -0700, Slava Ovsiienko wrote:
> > > > > > -----Original Message-----
> > > > > > From: Yongseok Koh
> > > > > > Sent: Thursday, October 25, 2018 3:28
> > > > > > To: Slava Ovsiienko <viacheslavo@mellanox.com>
> > > > > > Cc: Shahaf Shuler <shahafs@mellanox.com>; dev@dpdk.org
> > > > > > Subject: Re: [PATCH v2 5/7] net/mlx5: e-switch VXLAN tunnel
> > > > > > devices management
> > > > > >
> > > > > > On Mon, Oct 15, 2018 at 02:13:33PM +0000, Viacheslav Ovsiienko
> > wrote:
> > > > > > > VXLAN interfaces are dynamically created for each local UDP
> > > > > > > port of outer networks and then used as targets for TC
> > > > > > > "flower" filters in order to perform encapsulation. These
> > > > > > > VXLAN interfaces are system-wide, the only one device with
> > > > > > > given UDP port can exist in the system (the attempt of
> > > > > > > creating another device with the same UDP local port returns
> > > > > > > EEXIST), so PMD should support the shared device instances
> > > > > > > database for PMD instances. These VXLAN implicitly created devices
> > are called VTEPs (Virtual Tunnel End Points).
> > > > > > >
> > > > > > > Creation of the VTEP occurs at the moment of rule applying.
> > > > > > > The link is set up, root ingress qdisc is also initialized.
> > > > > > >
> > > > > > > Encapsulation VTEPs are created on per port basis, the single
> > > > > > > VTEP is attached to the outer interface and is shared for all
> > > > > > > encapsulation rules on this interface. The source UDP port is
> > > > > > > automatically selected in range 30000-60000.
> > > > > > >
> > > > > > > For decapsulaton one VTEP is created per every unique UDP
> > > > > > > local port to accept tunnel traffic. The name of created VTEP
> > > > > > > consists of prefix "vmlx_" and the number of UDP port in
> > > > > > > decimal digits without leading zeros (vmlx_4789). The VTEP can
> > > > > > > be preliminary created in the system before the launching
> > > > > > > application, it allows to share	UDP ports between primary
> > > > > > > and secondary processes.
> > > > > > >
> > > > > > > Suggested-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
> > > > > > > Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
> > > > > > > ---
> > > > > > >  drivers/net/mlx5/mlx5_flow_tcf.c | 503
> > > > > > > ++++++++++++++++++++++++++++++++++++++-
> > > > > > >  1 file changed, 499 insertions(+), 4 deletions(-)
> > > > > > >
> > > > > > > diff --git a/drivers/net/mlx5/mlx5_flow_tcf.c
> > > > > > > b/drivers/net/mlx5/mlx5_flow_tcf.c
> > > > > > > index d6840d5..efa9c3b 100644
> > > > > > > --- a/drivers/net/mlx5/mlx5_flow_tcf.c
> > > > > > > +++ b/drivers/net/mlx5/mlx5_flow_tcf.c
> > > > > > > @@ -3443,6 +3443,432 @@ struct pedit_parser {
> > > > > > >  	return -err;
> > > > > > >  }
> > > > > > >
> > > > > > > +/* VTEP device list is shared between PMD port instances. */
> > > > > > > +static LIST_HEAD(, mlx5_flow_tcf_vtep)
> > > > > > > +			vtep_list_vxlan = LIST_HEAD_INITIALIZER();
> > static
> > > > > > pthread_mutex_t
> > > > > > > +vtep_list_mutex = PTHREAD_MUTEX_INITIALIZER;
> > > > > >
> > > > > > What's the reason for choosing pthread_mutex instead of
> > rte_*_lock?
> > > > >
> > > > > The sharing this database for secondary processes?
> > > >
> > > > The static variable isn't shared with sec proc. But you can leave it as is.
> > >
> > > Yes. The sharing just was assumed, not implemented yet.
> > >
> > > >
> > > > > > > +
> > > > > > > +/**
> > > > > > > + * Deletes VTEP network device.
> > > > > > > + *
> > > > > > > + * @param[in] tcf
> > > > > > > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > > > > > > + * @param[in] vtep
> > > > > > > + *   Object represinting the network device to delete. Memory
> > > > > > > + *   allocated for this object is freed by routine.
> > > > > > > + */
> > > > > > > +static void
> > > > > > > +flow_tcf_delete_iface(struct mlx6_flow_tcf_context *tcf,
> > > > > > > +		      struct mlx5_flow_tcf_vtep *vtep) {
> > > > > > > +	struct nlmsghdr *nlh;
> > > > > > > +	struct ifinfomsg *ifm;
> > > > > > > +	alignas(struct nlmsghdr)
> > > > > > > +	uint8_t buf[mnl_nlmsg_size(MNL_ALIGN(sizeof(*ifm))) + 8];
> > > > > > > +	int ret;
> > > > > > > +
> > > > > > > +	assert(!vtep->refcnt);
> > > > > > > +	if (vtep->created && vtep->ifindex) {
> > > > > >
> > > > > > First of all vtep->created seems of no use. It is introduced to
> > > > > > select the error message in flow_tcf_create_iface(). I don't see
> > > > > > any necessity to distinguish between 'vtep is allocated by
> > > > > > rte_malloc()' and
> > > > 'vtep is created in kernel'.
> > > > >
> > > > > created flag indicates the iface is created by our code.
> > > > > The VXLAN decap devices must have the specified UDP port, we can
> > > > > not create multiple VXLAN devices with the same UDP port - EEXIST
> > > > > is returned. So, we have to share device. One option is create
> > > > > device before DPDK application launch and use these pre-created
> > devices.
> > > > > Inthis case created flag is not set and VXLAN device is not
> > > > > reinitialized, and
> > > > not deleted.
> > > >
> > > > I can't see any code to use pre-created device (created even before
> > > > dpdk app launch). Your code just tries to create 'vmlx_xxxx'. Even
> > > > from your comment in [7/7] patch, PMD will cleanup any leftovers
> > > > (existing vtep devices) on initialization. Your comment sounds conflicting
> > and confusing.
> > >
> > > There are two types of VXLAN devices:
> > >
> > > - VXLAN decap, not attached to any ifouter. Provides the ingress UDP
> > > port,  we try to share the devices of this type, because we may be
> > > asked for  the specified UDP port. No device/rule cleanup and reinit
> > needed.
> > >
> > > - VXLAN encap, should be attached to ifouter to provide strict egress
> > > path, no need to share - egress UDP port does not matter. And we need
> > > to cleanup ifouter, remove other attached VXLAN devices and rules,
> > > because it is too hard to co-exist with some pre-created setup..
> > 
> > I knew that. But how can it justify the need of 'created' field in vtep struct?
> > In this code, it is of no use. But will see how it is used in your v3.
> > 
> > > > > > And why do you need to check vtep->ifindex as well? If vtep is
> > > > > > created in kernel and its ifindex isn't set, that should be an
> > > > > > error which had to be hanled in flow_tcf_create_iface(). Such a
> > > > > > vtep shouldn't
> > > > exist.
> > > > > Yes, if we did not get ifindex of device - vtep is not created, error
> > returned.
> > > > > We just can not operate w/o ifindex.
> > > >
> > > > I know ifindex is needed but my question was checking vtep->ifindex
> > > > here looked redundant/unnecessary. But as you agreed on having
> > > > create/get/release_iface(), it doesn't matter much.
> > >
> > > Yes. I agree, will refactor the code.
> > >
> > > >
> > > > > > Also, the refcnt management is a bit strange. Please put an
> > > > > > abstraction by adding create_iface(), get_iface() and
> > > > > > release_iface(). In the get_ifce(),
> > > > > > vtep->refcnt should be incremented. And in the release_iface(),
> > > > > > vtep->it decrease the
> > > > > OK. Good proposal. I'll refactor the code.
> > > > >
> > > > > > refcnt and if it reaches to zero, the iface can be removed.
> > > > > > create_iface() will set the refcnt to 1. And if you refer to
> > > > > > mlx5_hrxq_get(), it even does searching the list not by
> > > > > > repeating the
> > > > same lookup code here and there.
> > > > > > That will make your code much simpler.
> > > > > >
> > > > > > > +		DRV_LOG(INFO, "VTEP delete (%d)", vtep->ifindex);
> > > > > > > +		nlh = mnl_nlmsg_put_header(buf);
> > > > > > > +		nlh->nlmsg_type = RTM_DELLINK;
> > > > > > > +		nlh->nlmsg_flags = NLM_F_REQUEST;
> > > > > > > +		ifm = mnl_nlmsg_put_extra_header(nlh,
> > sizeof(*ifm));
> > > > > > > +		ifm->ifi_family = AF_UNSPEC;
> > > > > > > +		ifm->ifi_index = vtep->ifindex;
> > > > > > > +		ret = flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
> > > > > > > +		if (ret)
> > > > > > > +			DRV_LOG(WARNING, "netlink: error deleting
> > VXLAN
> > > > > > "
> > > > > > > +					 "encap/decap ifindex %u",
> > > > > > > +					 ifm->ifi_index);
> > > > > > > +	}
> > > > > > > +	rte_free(vtep);
> > > > > > > +}
> > > > > > > +
> > > > > > > +/**
> > > > > > > + * Creates VTEP network device.
> > > > > > > + *
> > > > > > > + * @param[in] tcf
> > > > > > > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > > > > > > + * @param[in] ifouter
> > > > > > > + *   Outer interface to attach new-created VXLAN device
> > > > > > > + *   If zero the VXLAN device will not be attached to any device.
> > > > > > > + * @param[in] port
> > > > > > > + *   UDP port of created VTEP device.
> > > > > > > + * @param[out] error
> > > > > > > + *   Perform verbose error reporting if not NULL.
> > > > > > > + *
> > > > > > > + * @return
> > > > > > > + * Pointer to created device structure on success, NULL
> > > > > > > +otherwise
> > > > > > > + * and rte_errno is set.
> > > > > > > + */
> > > > > > > +#ifndef HAVE_IFLA_VXLAN_COLLECT_METADATA
> > > > > >
> > > > > > Why negative(ifndef) first intead of positive(ifdef)?
> > > > > Hm. Did I miss the rule. Positive #ifdef first? OK.
> > > >
> > > > No concrete rule but if there's no specific reason, it would be
> > > > better to start from ifdef.
> > > >
> > > > > > > +static struct mlx5_flow_tcf_vtep*
> > > > > > > +flow_tcf_create_iface(struct mlx5_flow_tcf_context *tcf
> > __rte_unused,
> > > > > > > +		      unsigned int ifouter __rte_unused,
> > > > > > > +		      uint16_t port __rte_unused,
> > > > > > > +		      struct rte_flow_error *error) {
> > > > > > > +	rte_flow_error_set(error, ENOTSUP,
> > > > > > > +			 RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> > NULL,
> > > > > > > +			 "netlink: failed to create VTEP, "
> > > > > > > +			 "VXLAN metadat is not supported by
> > kernel");
> > > > > >
> > > > > > Typo.
> > > > >
> > > > > OK.  "metadata are not supported".
> > > > > >
> > > > > > > +	return NULL;
> > > > > > > +}
> > > > > > > +#else
> > > > > > > +static struct mlx5_flow_tcf_vtep*
> > > > > > > +flow_tcf_create_iface(struct mlx5_flow_tcf_context *tcf,
> > > > > >
> > > > > > How about adding 'vtep'? It sounds vague - creating a general
> > interface.
> > > > > > E.g., flow_tcf_create_vtep_iface()?
> > > > >
> > > > > OK.
> > > > >
> > > > > >
> > > > > > > +		      unsigned int ifouter,
> > > > > > > +		      uint16_t port, struct rte_flow_error *error) {
> > > > > > > +	struct mlx5_flow_tcf_vtep *vtep;
> > > > > > > +	struct nlmsghdr *nlh;
> > > > > > > +	struct ifinfomsg *ifm;
> > > > > > > +	char name[sizeof(MLX5_VXLAN_DEVICE_PFX) + 24];
> > > > > > > +	alignas(struct nlmsghdr)
> > > > > > > +	uint8_t buf[mnl_nlmsg_size(sizeof(*ifm)) + 128 +
> > > > > >
> > > > > > Use a macro for '128'. Can't know the meaning.
> > > > > OK. I think we should calculate the buffer size explicitly.
> > > > >
> > > > > >
> > > > > > > +		       SZ_NLATTR_DATA_OF(sizeof(name)) +
> > > > > > > +		       SZ_NLATTR_NEST * 2 +
> > > > > > > +		       SZ_NLATTR_STRZ_OF("vxlan") +
> > > > > > > +		       SZ_NLATTR_DATA_OF(sizeof(uint32_t)) +
> > > > > > > +		       SZ_NLATTR_DATA_OF(sizeof(uint32_t)) +
> > > > > > > +		       SZ_NLATTR_DATA_OF(sizeof(uint16_t)) +
> > > > > > > +		       SZ_NLATTR_DATA_OF(sizeof(uint8_t))];
> > > > > > > +	struct nlattr *na_info;
> > > > > > > +	struct nlattr *na_vxlan;
> > > > > > > +	rte_be16_t vxlan_port = RTE_BE16(port);
> > > > > >
> > > > > > Use rte_cpu_to_be_*() instead.
> > > > >
> > > > > Yes, I'll recheck the whole code for this issue.
> > > > >
> > > > > >
> > > > > > > +	int ret;
> > > > > > > +
> > > > > > > +	vtep = rte_zmalloc(__func__, sizeof(*vtep),
> > > > > > > +			alignof(struct mlx5_flow_tcf_vtep));
> > > > > > > +	if (!vtep) {
> > > > > > > +		rte_flow_error_set
> > > > > > > +			(error, ENOMEM,
> > > > > > RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> > > > > > > +			 NULL, "unadble to allocate memory for
> > VTEP desc");
> > > > > > > +		return NULL;
> > > > > > > +	}
> > > > > > > +	*vtep = (struct mlx5_flow_tcf_vtep){
> > > > > > > +			.refcnt = 0,
> > > > > > > +			.port = port,
> > > > > > > +			.created = 0,
> > > > > > > +			.ifouter = 0,
> > > > > > > +			.ifindex = 0,
> > > > > > > +			.local = LIST_HEAD_INITIALIZER(),
> > > > > > > +			.neigh = LIST_HEAD_INITIALIZER(),
> > > > > > > +	};
> > > > > > > +	memset(buf, 0, sizeof(buf));
> > > > > > > +	nlh = mnl_nlmsg_put_header(buf);
> > > > > > > +	nlh->nlmsg_type = RTM_NEWLINK;
> > > > > > > +	nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_CREATE  |
> > > > > > NLM_F_EXCL;
> > > > > > > +	ifm = mnl_nlmsg_put_extra_header(nlh, sizeof(*ifm));
> > > > > > > +	ifm->ifi_family = AF_UNSPEC;
> > > > > > > +	ifm->ifi_type = 0;
> > > > > > > +	ifm->ifi_index = 0;
> > > > > > > +	ifm->ifi_flags = IFF_UP;
> > > > > > > +	ifm->ifi_change = 0xffffffff;
> > > > > > > +	snprintf(name, sizeof(name), "%s%u",
> > MLX5_VXLAN_DEVICE_PFX,
> > > > > > port);
> > > > > > > +	mnl_attr_put_strz(nlh, IFLA_IFNAME, name);
> > > > > > > +	na_info = mnl_attr_nest_start(nlh, IFLA_LINKINFO);
> > > > > > > +	assert(na_info);
> > > > > > > +	mnl_attr_put_strz(nlh, IFLA_INFO_KIND, "vxlan");
> > > > > > > +	na_vxlan = mnl_attr_nest_start(nlh, IFLA_INFO_DATA);
> > > > > > > +	if (ifouter)
> > > > > > > +		mnl_attr_put_u32(nlh, IFLA_VXLAN_LINK, ifouter);
> > > > > > > +	assert(na_vxlan);
> > > > > > > +	mnl_attr_put_u8(nlh, IFLA_VXLAN_COLLECT_METADATA, 1);
> > > > > > > +	mnl_attr_put_u8(nlh, IFLA_VXLAN_UDP_ZERO_CSUM6_RX,
> > 1);
> > > > > > > +	mnl_attr_put_u8(nlh, IFLA_VXLAN_LEARNING, 0);
> > > > > > > +	mnl_attr_put_u16(nlh, IFLA_VXLAN_PORT, vxlan_port);
> > > > > > > +	mnl_attr_nest_end(nlh, na_vxlan);
> > > > > > > +	mnl_attr_nest_end(nlh, na_info);
> > > > > > > +	assert(sizeof(buf) >= nlh->nlmsg_len);
> > > > > > > +	ret = flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
> > > > > > > +	if (ret)
> > > > > > > +		DRV_LOG(WARNING,
> > > > > > > +			"netlink: VTEP %s create failure (%d)",
> > > > > > > +			name, rte_errno);
> > > > > > > +	else
> > > > > > > +		vtep->created = 1;
> > > > > >
> > > > > > Flow of code here isn't smooth, thus could be error-prone. Most
> > > > > > of all, I don't like ret has multiple meanings. ret should be
> > > > > > return value but you are using it to store ifindex.
> > > > > >
> > > > > > > +	if (ret && ifouter)
> > > > > > > +		ret = 0;
> > > > > > > +	else
> > > > > > > +		ret = if_nametoindex(name);
> > > > > >
> > > > > > If vtep isn't created and ifouter is set, then skip init below,
> > > > > > which means, if
> > > > >
> > > > > ifouter is set for VXLAN encap devices. They should be attached to
> > > > > ifouter and can not be shared. So, if ifouter I set - we do not
> > > > > use the precreated/existing VXLAN devices. We have to create our
> > > > > own not
> > > > shared device.
> > > >
> > > > In your code (flow_tcf_encap_vtep_create()), it is shared by multiple
> > flows.
> > > > Do you mean it isn't shared between different outer ifaces? If so,
> > > > that's for sure.
> > > Sorry, I do not understand the question.
> > > VXLAN encap device is attached to ifouter and shared by all flows with
> > > this ifouter. No multiple VXLAN devices are attached to the same ifouter,
> > only one.
> > > VXLAN decap device has no attached ifouter, so it can not share it.
> > 
> > Yep, that's what I meant.
> > 
> > > > > > vtep is created or ifouter is set, it tries to get ifindex of vtep.
> > > > > > But why do you want to try to call this API even if it failed to create
> > vtep?
> > > > > > Let's not make code flow convoluted even though it logically works.
> > > > > > Let's make it straightforward.
> > > > > >
> > > > > > > +	if (ret) {
> > > > > > > +		vtep->ifindex = ret;
> > > > > > > +		vtep->ifouter = ifouter;
> > > > > > > +		memset(buf, 0, sizeof(buf));
> > > > > > > +		nlh = mnl_nlmsg_put_header(buf);
> > > > > > > +		nlh->nlmsg_type = RTM_NEWLINK;
> > > > > > > +		nlh->nlmsg_flags = NLM_F_REQUEST;
> > > > > > > +		ifm = mnl_nlmsg_put_extra_header(nlh,
> > sizeof(*ifm));
> > > > > > > +		ifm->ifi_family = AF_UNSPEC;
> > > > > > > +		ifm->ifi_type = 0;
> > > > > > > +		ifm->ifi_index = vtep->ifindex;
> > > > > > > +		ifm->ifi_flags = IFF_UP;
> > > > > > > +		ifm->ifi_change = IFF_UP;
> > > > > > > +		ret = flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
> > > > > > > +		if (ret) {
> > > > > > > +			DRV_LOG(WARNING,
> > > > > > > +				"netlink: VTEP %s set link up failure
> > (%d)",
> > > > > > > +				name, rte_errno);
> > > > > > > +			rte_free(vtep);
> > > > > > > +			rte_flow_error_set
> > > > > > > +				(error, -errno,
> > > > > > > +
> > RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> > > > > > NULL,
> > > > > > > +				 "netlink: failed to set VTEP link up");
> > > > > > > +			vtep = NULL;
> > > > > > > +		} else {
> > > > > > > +			ret = mlx5_flow_tcf_init(tcf, vtep->ifindex,
> > error);
> > > > > > > +			if (ret)
> > > > > > > +				DRV_LOG(WARNING,
> > > > > > > +				"VTEP %s init failure (%d)", name,
> > rte_errno);
> > > > > > > +		}
> > > > > > > +	} else {
> > > > > > > +		DRV_LOG(WARNING,
> > > > > > > +			"VTEP %s failed to get index (%d)", name,
> > errno);
> > > > > > > +		rte_flow_error_set
> > > > > > > +			(error, -errno,
> > > > > > > +			 RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> > NULL,
> > > > > > > +			 !vtep->created ? "netlink: failed to create
> > VTEP" :
> > > > > > > +			 "netlink: failed to retrieve VTEP ifindex");
> > > > > > > +			 ret = 1;
> > > > > >
> > > > > > If it fails to create a vtep above, it will print out two
> > > > > > warning messages and one rte_flow_error message. And it even
> > > > > > selects message to print between two?
> > > > > > And there's another info msg at the end even in case of failure.
> > > > > > Do you really want to do this even with manipulating ret to
> > > > > > change code path?  Not a good practice.
> > > > > >
> > > > > > Usually, code path should be straightforward for sucessful path
> > > > > > and for errors/failures, return immediately or use 'goto' if
> > > > > > there's need for
> > > > cleanup.
> > > > > >
> > > > > > Please refactor entire function.
> > > > >
> > > > > I think I'll split it in two ones - for attached and potentially shared
> > ifaces.
> > > > > >
> > > > > > > +	}
> > > > > > > +	if (ret) {
> > > > > > > +		flow_tcf_delete_iface(tcf, vtep);
> > > > > > > +		vtep = NULL;
> > > > > > > +	}
> > > > > > > +	DRV_LOG(INFO, "VTEP create (%d, %s)", vtep->port, vtep ?
> > "OK" :
> > > > > > "error");
> > > > > > > +	return vtep;
> > > > > > > +}
> > > > > > > +#endif /* HAVE_IFLA_VXLAN_COLLECT_METADATA */
> > > > > > > +
> > > > > > > +/**
> > > > > > > + * Create target interface index for VXLAN tunneling
> > decapsulation.
> > > > > > > + * In order to share the UDP port within the other interfaces
> > > > > > > +the
> > > > > > > + * VXLAN device created as not attached to any interface (if
> > created).
> > > > > > > + *
> > > > > > > + * @param[in] tcf
> > > > > > > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > > > > > > + * @param[in] dev_flow
> > > > > > > + *   Flow tcf object with tunnel structure pointer set.
> > > > > > > + * @param[out] error
> > > > > > > + *   Perform verbose error reporting if not NULL.
> > > > > > > + * @return
> > > > > > > + *   Interface index on success, zero otherwise and rte_errno is set.
> > > > > >
> > > > > > Return negative errno in case of failure like others.
> > > > >
> > > > > Anyway, we have to return an index. If we do not return it as
> > > > > function result we will need to provide some extra pointing
> > > > > parameter, it
> > > > complicates the code.
> > > >
> > > > You misunderstood it. See what I wrote below. The function still
> > > > returns the index but in case of error, make it return negative errno
> > instead of zero.
> > > >
> > > > > >
> > > > > >  *   Interface index on success, a negative errno value otherwise and
> > > > > > rte_errno is set.
> > > > > >
> > > > > > > + */
> > > > > > > +static unsigned int
> > > > > > > +flow_tcf_decap_vtep_create(struct mlx5_flow_tcf_context *tcf,
> > > > > > > +			   struct mlx5_flow *dev_flow,
> > > > > > > +			   struct rte_flow_error *error) {
> > > > > > > +	struct mlx5_flow_tcf_vtep *vtep, *vlst;
> > > > > > > +	uint16_t port = dev_flow->tcf.vxlan_decap->udp_port;
> > > > > > > +
> > > > > > > +	vtep = NULL;
> > > > > > > +	LIST_FOREACH(vlst, &vtep_list_vxlan, next) {
> > > > > > > +		if (vlst->port == port) {
> > > > > > > +			vtep = vlst;
> > > > > > > +			break;
> > > > > > > +		}
> > > > > > > +	}
> > > > > >
> > > > > > You just need one variable.
> > > > >
> > > > > Yes. There is a long story, I forgot to revert code to one
> > > > > variable after
> > > > debugging.
> > > > > >
> > > > > > 	struct mlx5_flow_tcf_vtep *vtep;
> > > > > >
> > > > > > 	LIST_FOREACH(vtep, &vtep_list_vxlan, next) {
> > > > > > 		if (vtep->port == port)
> > > > > > 			break;
> > > > > > 	}
> > > > > >
> > > > > > > +	if (!vtep) {
> > > > > > > +		vtep = flow_tcf_create_iface(tcf, 0, port, error);
> > > > > > > +		if (vtep)
> > > > > > > +			LIST_INSERT_HEAD(&vtep_list_vxlan, vtep,
> > next);
> > > > > > > +	} else {
> > > > > > > +		if (vtep->ifouter) {
> > > > > > > +			rte_flow_error_set(error, -errno,
> > > > > > > +
> > 	RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> > > > > > NULL,
> > > > > > > +				"Failed to create decap VTEP,
> > attached "
> > > > > > > +				"device with the same UDP port
> > exists");
> > > > > > > +				vtep = NULL;
> > > > > >
> > > > > > Making vtep null to skip the following code?
> > > > >
> > > > > Yes. To avoid multiple return operators in code.
> > > >
> > > > It's okay to have multiple returns. Why not?
> > >
> > > It is easy to miss the return in the midst of function  while
> > refactoring/modifying the code.
> > 
> > Your code path doesn't look easy and free from error. Please refer to other
> > control path functions in this PMD.
> > 
> > > > > > Please merge the two same
> > > > > > if/else and make the code path strightforward. And which errno
> > > > > > do you expect here?
> > > > > > Should it be set EEXIST instead?
> > > > > Not always. Netlink returns the code.
> > > >
> > > > No, that's not my point. Your code above sets errno instead of
> > > > rte_errno or EEXIST.
> > > >
> > > > 	} else {
> > > > 		if (vtep->ifouter) {
> > > > 			rte_flow_error_set(error, -errno,
> > > >
> > > > Which one sets this errno? Here, it sets rte_errno because matched
> > > > vtep
> > > libmnl sets, while processing the Netlink reply message (callback.c of libmnl
> > sources).
> > 
> > You still don't understand my point.
> > 
> > In this flow_tcf_decap_vtep_create(), if vtep is found (vtep != NULL), how
> > can errno be set? Before the if/else, there's no libmnl call.
> > 
> > > > can't be used as it already has outer iface attached (error message
> > > > isn't clear, please reword it too). I thought this should be EEXIST
> > > > but you set errno to rte_errno but errno isn't valid at this point.
> > > >
> > > > >
> > > > > >
> > > > > > > +		}
> > > > > > > +	}
> > > > > > > +	if (vtep) {
> > > > > > > +		vtep->refcnt++;
> > > > > > > +		assert(vtep->ifindex);
> > > > > > > +		return vtep->ifindex;
> > > > > > > +	} else {
> > > > > > > +		return 0;
> > > > > > > +	}
> > > > > >
> > > > > > Why repeating same if/else?
> > > > > >
> > > > > >
> > > > > > This is my suggestion but if you take my suggestion to have
> > > > > > flow_tcf_[create|get|release]_iface(), this will get much simpler.
> > > > > Agree.
> > > > >
> > > > > >
> > > > > > {
> > > > > > 	struct mlx5_flow_tcf_vtep *vtep;
> > > > > > 	uint16_t port = dev_flow->tcf.vxlan_decap->udp_port;
> > > > > >
> > > > > > 	LIST_FOREACH(vtep, &vtep_list_vxlan, next) {
> > > > > > 		if (vtep->port == port)
> > > > > > 			break;
> > > > > > 	}
> > > > > > 	if (vtep && vtep->ifouter)
> > > > > > 		return rte_flow_error_set(... EEXIST ...);
> > > > > > 	else if (vtep) {
> > > > > > 		++vtep->refcnt;
> > > > > > 	} else {
> > > > > > 		vtep = flow_tcf_create_iface(tcf, 0, port, error);
> > > > > > 		if (!vtep)
> > > > > > 			return rte_flow_error_set(...);
> > > > > > 		LIST_INSERT_HEAD(&vtep_list_vxlan, vtep, next);
> > > > > > 	}
> > > > > > 	assert(vtep->ifindex);
> > > > > > 	return vtep->ifindex;
> > > > > > }
> > > > > >
> > > > > >
> > > > > > > +}
> > > > > > > +
> > > > > > > +/**
> > > > > > > + * Creates target interface index for VXLAN tunneling
> > encapsulation.
> > > > > > > + *
> > > > > > > + * @param[in] tcf
> > > > > > > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > > > > > > + * @param[in] ifouter
> > > > > > > + *   Network interface index to attach VXLAN encap device to.
> > > > > > > + * @param[in] dev_flow
> > > > > > > + *   Flow tcf object with tunnel structure pointer set.
> > > > > > > + * @param[out] error
> > > > > > > + *   Perform verbose error reporting if not NULL.
> > > > > > > + * @return
> > > > > > > + *   Interface index on success, zero otherwise and rte_errno is set.
> > > > > > > + */
> > > > > > > +static unsigned int
> > > > > > > +flow_tcf_encap_vtep_create(struct mlx5_flow_tcf_context *tcf,
> > > > > > > +			    unsigned int ifouter,
> > > > > > > +			    struct mlx5_flow *dev_flow __rte_unused,
> > > > > > > +			    struct rte_flow_error *error) {
> > > > > > > +	static uint16_t encap_port =
> > MLX5_VXLAN_PORT_RANGE_MIN - 1;
> > > > > > > +	struct mlx5_flow_tcf_vtep *vtep, *vlst;
> > > > > > > +
> > > > > > > +	assert(ifouter);
> > > > > > > +	/* Look whether the attached VTEP for encap is created. */
> > > > > > > +	vtep = NULL;
> > > > > > > +	LIST_FOREACH(vlst, &vtep_list_vxlan, next) {
> > > > > > > +		if (vlst->ifouter == ifouter) {
> > > > > > > +			vtep = vlst;
> > > > > > > +			break;
> > > > > > > +		}
> > > > > > > +	}
> > > > > >
> > > > > > Same here.
> > > > > >
> > > > > > > +	if (!vtep) {
> > > > > > > +		uint16_t pcnt;
> > > > > > > +
> > > > > > > +		/* Not found, we should create the new attached
> > VTEP. */
> > > > > > > +/*
> > > > > > > + * TODO: not implemented yet
> > > > > > > + * flow_tcf_encap_iface_cleanup(tcf, ifouter);
> > > > > > > + * flow_tcf_encap_local_cleanup(tcf, ifouter);
> > > > > > > + * flow_tcf_encap_neigh_cleanup(tcf, ifouter);  */
> > > > > >
> > > > > > Personal note is not appropriate even though it is removed in
> > > > > > the following patch.
> > > > > >
> > > > > > > +		for (pcnt = 0; pcnt <=
> > (MLX5_VXLAN_PORT_RANGE_MAX
> > > > > > > +				     -
> > MLX5_VXLAN_PORT_RANGE_MIN);
> > > > > > pcnt++) {
> > > > > > > +			encap_port++;
> > > > > > > +			/* Wraparound the UDP port index. */
> > > > > > > +			if (encap_port <
> > MLX5_VXLAN_PORT_RANGE_MIN
> > > > > > ||
> > > > > > > +			    encap_port >
> > MLX5_VXLAN_PORT_RANGE_MAX)
> > > > > > > +				encap_port =
> > > > > > MLX5_VXLAN_PORT_RANGE_MIN;
> > > > > > > +			/* Check whether UDP port is in already in
> > use. */
> > > > > > > +			vtep = NULL;
> > > > > > > +			LIST_FOREACH(vlst, &vtep_list_vxlan, next) {
> > > > > > > +				if (vlst->port == encap_port) {
> > > > > > > +					vtep = vlst;
> > > > > > > +					break;
> > > > > > > +				}
> > > > > > > +			}
> > > > > >
> > > > > > If you want to find out an empty port number, you can use
> > > > > > rte_bitmap instead of repeating searching the entire list for
> > > > > > all possible port
> > > > numbers.
> > > > >
> > > > > We do not expect too many VXLAN devices have been created. bitmap.
> > > >
> > > > +1, valid point.
> > > >
> > > > > > > +			if (vtep) {
> > > > > > > +				vtep = NULL;
> > > > > > > +				continue;
> > > > > > > +			}
> > > > > > > +			vtep = flow_tcf_create_iface(tcf, ifouter,
> > > > > > > +						     encap_port,
> > error);
> > > > > > > +			if (vtep) {
> > > > > > > +				LIST_INSERT_HEAD(&vtep_list_vxlan,
> > vtep,
> > > > > > next);
> > > > > > > +				break;
> > > > > > > +			}
> > > > > > > +			if (rte_errno != EEXIST)
> > > > > > > +				break;
> > > > > > > +		}
> > > > > > > +	}
> > > > > > > +	if (!vtep)
> > > > > > > +		return 0;
> > > > > > > +	vtep->refcnt++;
> > > > > > > +	assert(vtep->ifindex);
> > > > > > > +	return vtep->ifindex;
> > > > > >
> > > > > > Please refactor this func according to what I suggested for
> > > > > > flow_tcf_decap_vtep_create() and flow_tcf_delete_iface().
> > > > > >
> > > > > > > +}
> > > > > > > +
> > > > > > > +/**
> > > > > > > + * Creates target interface index for tunneling of any type.
> > > > > > > + *
> > > > > > > + * @param[in] tcf
> > > > > > > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > > > > > > + * @param[in] ifouter
> > > > > > > + *   Network interface index to attach VXLAN encap device to.
> > > > > > > + * @param[in] dev_flow
> > > > > > > + *   Flow tcf object with tunnel structure pointer set.
> > > > > > > + * @param[out] error
> > > > > > > + *   Perform verbose error reporting if not NULL.
> > > > > > > + * @return
> > > > > > > + *   Interface index on success, zero otherwise and rte_errno is set.
> > > > > >
> > > > > >  *   Interface index on success, a negative errno value otherwise and
> > > > > >  *   rte_errno is set.
> > > > > >
> > > > > > > + */
> > > > > > > +static unsigned int
> > > > > > > +flow_tcf_tunnel_vtep_create(struct mlx5_flow_tcf_context *tcf,
> > > > > > > +			    unsigned int ifouter,
> > > > > > > +			    struct mlx5_flow *dev_flow,
> > > > > > > +			    struct rte_flow_error *error) {
> > > > > > > +	unsigned int ret;
> > > > > > > +
> > > > > > > +	assert(dev_flow->tcf.tunnel);
> > > > > > > +	pthread_mutex_lock(&vtep_list_mutex);
> > > > > > > +	switch (dev_flow->tcf.tunnel->type) {
> > > > > > > +	case MLX5_FLOW_TCF_TUNACT_VXLAN_ENCAP:
> > > > > > > +		ret = flow_tcf_encap_vtep_create(tcf, ifouter,
> > > > > > > +						 dev_flow, error);
> > > > > > > +		break;
> > > > > > > +	case MLX5_FLOW_TCF_TUNACT_VXLAN_DECAP:
> > > > > > > +		ret = flow_tcf_decap_vtep_create(tcf, dev_flow,
> > error);
> > > > > > > +		break;
> > > > > > > +	default:
> > > > > > > +		rte_flow_error_set(error, ENOTSUP,
> > > > > > > +
> > 	RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> > > > > > NULL,
> > > > > > > +				"unsupported tunnel type");
> > > > > > > +		ret = 0;
> > > > > > > +		break;
> > > > > > > +	}
> > > > > > > +	pthread_mutex_unlock(&vtep_list_mutex);
> > > > > > > +	return ret;
> > > > > > > +}
> > > > > > > +
> > > > > > > +/**
> > > > > > > + * Deletes tunneling interface by UDP port.
> > > > > > > + *
> > > > > > > + * @param[in] tcf
> > > > > > > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > > > > > > + * @param[in] ifindex
> > > > > > > + *   Network interface index of VXLAN device.
> > > > > > > + * @param[in] dev_flow
> > > > > > > + *   Flow tcf object with tunnel structure pointer set.
> > > > > > > + */
> > > > > > > +static void
> > > > > > > +flow_tcf_tunnel_vtep_delete(struct mlx5_flow_tcf_context *tcf,
> > > > > > > +			    unsigned int ifindex,
> > > > > > > +			    struct mlx5_flow *dev_flow) {
> > > > > > > +	struct mlx5_flow_tcf_vtep *vtep, *vlst;
> > > > > > > +
> > > > > > > +	assert(dev_flow->tcf.tunnel);
> > > > > > > +	pthread_mutex_lock(&vtep_list_mutex);
> > > > > > > +	vtep = NULL;
> > > > > > > +	LIST_FOREACH(vlst, &vtep_list_vxlan, next) {
> > > > > > > +		if (vlst->ifindex == ifindex) {
> > > > > > > +			vtep = vlst;
> > > > > > > +			break;
> > > > > > > +		}
> > > > > > > +	}
> > > > > >
> > > > > > It is weird. You just can have vtep pointer in the
> > > > > > dev_flow->tcf.tunnel instead of ifindex_tun which is same as
> > > > > > vtep->ifindex like the assertion below. Then, this lookup can be
> > skipped.
> > > > >
> > > > > OK. Good optimization.
> > > > >
> > > > > >
> > > > > > > +	if (!vtep) {
> > > > > > > +		DRV_LOG(WARNING, "No VTEP device found in the
> > list");
> > > > > > > +		goto exit;
> > > > > > > +	}
> > > > > > > +	switch (dev_flow->tcf.tunnel->type) {
> > > > > > > +	case MLX5_FLOW_TCF_TUNACT_VXLAN_DECAP:
> > > > > > > +		break;
> > > > > > > +	case MLX5_FLOW_TCF_TUNACT_VXLAN_ENCAP:
> > > > > > > +/*
> > > > > > > + * TODO: Remove the encap ancillary rules first.
> > > > > > > + * flow_tcf_encap_neigh(tcf, vtep, dev_flow, false, NULL);
> > > > > > > + * flow_tcf_encap_local(tcf, vtep, dev_flow, false, NULL);
> > > > > > > +*/
> > > > > >
> > > > > > Is it a personal note? Please remove.
> > > > > OK.
> > > > >
> > > > > >
> > > > > > > +		break;
> > > > > > > +	default:
> > > > > > > +		assert(false);
> > > > > > > +		DRV_LOG(WARNING, "Unsupported tunnel type");
> > > > > > > +		break;
> > > > > > > +	}
> > > > > > > +	assert(dev_flow->tcf.tunnel->ifindex_tun == vtep->ifindex);
> > > > > > > +	assert(vtep->refcnt);
> > > > > > > +	if (!vtep->refcnt || !--vtep->refcnt) {
> > > > > > > +		LIST_REMOVE(vtep, next);
> > > > > > > +		flow_tcf_delete_iface(tcf, vtep);
> > > > > > > +	}
> > > > > > > +exit:
> > > > > > > +	pthread_mutex_unlock(&vtep_list_mutex);
> > > > > > > +}
> > > > > > > +
> > > > > > >  /**
> > > > > > >   * Apply flow to E-Switch by sending Netlink message.
> > > > > > >   *
> > > > > > > @@ -3461,18 +3887,61 @@ struct pedit_parser {
> > > > > > >  	       struct rte_flow_error *error)  {
> > > > > > >  	struct priv *priv = dev->data->dev_private;
> > > > > > > -	struct mlx5_flow_tcf_context *nl = priv->tcf_context;
> > > > > > > +	struct mlx5_flow_tcf_context *tcf = priv->tcf_context;
> > > > > > >  	struct mlx5_flow *dev_flow;
> > > > > > >  	struct nlmsghdr *nlh;
> > > > > > > +	int ret;
> > > > > > >
> > > > > > >  	dev_flow = LIST_FIRST(&flow->dev_flows);
> > > > > > >  	/* E-Switch flow can't be expanded. */
> > > > > > >  	assert(!LIST_NEXT(dev_flow, next));
> > > > > > > +	if (dev_flow->tcf.applied)
> > > > > > > +		return 0;
> > > > > > >  	nlh = dev_flow->tcf.nlh;
> > > > > > >  	nlh->nlmsg_type = RTM_NEWTFILTER;
> > > > > > >  	nlh->nlmsg_flags = NLM_F_REQUEST | NLM_F_CREATE |
> > > > > > NLM_F_EXCL;
> > > > > > > -	if (!flow_tcf_nl_ack(nl, nlh, 0, NULL, NULL))
> > > > > > > +	if (dev_flow->tcf.tunnel) {
> > > > > > > +		/*
> > > > > > > +		 * Replace the interface index, target for
> > > > > > > +		 * encapsulation, source for decapsulation.
> > > > > > > +		 */
> > > > > > > +		assert(!dev_flow->tcf.tunnel->ifindex_tun);
> > > > > > > +		assert(dev_flow->tcf.tunnel->ifindex_ptr);
> > > > > > > +		/* Create actual VTEP device when rule is being
> > applied. */
> > > > > > > +		dev_flow->tcf.tunnel->ifindex_tun
> > > > > > > +			= flow_tcf_tunnel_vtep_create(tcf,
> > > > > > > +					*dev_flow->tcf.tunnel-
> > >ifindex_ptr,
> > > > > > > +					dev_flow, error);
> > > > > > > +			DRV_LOG(INFO, "Replace ifindex: %d->%d",
> > > > > > > +				dev_flow->tcf.tunnel->ifindex_tun,
> > > > > > > +				*dev_flow->tcf.tunnel->ifindex_ptr);
> > > > > > > +		if (!dev_flow->tcf.tunnel->ifindex_tun)
> > > > > > > +			return -rte_errno;
> > > > > > > +		dev_flow->tcf.tunnel->ifindex_org
> > > > > > > +			= *dev_flow->tcf.tunnel->ifindex_ptr;
> > > > > > > +		*dev_flow->tcf.tunnel->ifindex_ptr
> > > > > > > +			= dev_flow->tcf.tunnel->ifindex_tun;
> > > > > > > +	}
> > > > > > > +	ret = flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
> > > > > > > +	if (dev_flow->tcf.tunnel) {
> > > > > > > +		DRV_LOG(INFO, "Restore ifindex: %d->%d",
> > > > > > > +				dev_flow->tcf.tunnel->ifindex_org,
> > > > > > > +				*dev_flow->tcf.tunnel->ifindex_ptr);
> > > > > > > +		*dev_flow->tcf.tunnel->ifindex_ptr
> > > > > > > +			= dev_flow->tcf.tunnel->ifindex_org;
> > > > > > > +		dev_flow->tcf.tunnel->ifindex_org = 0;
> > > > > >
> > > > > > ifindex_org looks a temporary storage in this code. And this
> > > > > > kind of hassle
> > > > > > (replace/restore) is there because you took the ifindex from the
> > > > > > netlink message. Why don't you have just
> > > > > >
> > > > > > struct mlx5_flow_tcf_tunnel_hdr {
> > > > > > 	uint32_t type; /**< Tunnel action type. */
> > > > > > 	unsigned int ifindex; /**< Original dst/src interface */
> > > > > > 	struct mlx5_flow_tcf_vtep *vtep; /**< Tunnel endpoint device. */
> > > > > > 	unsigned int *nlmsg_ifindex_ptr; /**< ifindex ptr in Netlink message.
> > > > > > */ };
> > > > > >
> > > > > > and don't change ifindex?
> > > > >
> > > > > I propose to use the local variable for ifindex_org and do not
> > > > > keep it in structure. *ifindex_ptr will keep.
> > > >
> > > > Well, you still have to restore the ifindex whenever sending the nl
> > > > msg. Most of all, ifindex_ptr in nl msg isn't a right place to store the
> > ifindex.
> > > It is stored there for rules w/o tunnels. It is its "native" place, Id
> > > prefer not to create some new location to store the original index and save
> > some space.
> > > We have to swap indices only if rule has requested the tunneling.  We
> > > can not
> > 
> > No no. At this point, flow is already created to be tunneled one. What do you
> > mean by 'rules w/o tunnels' or 'only if rule has requested the tunneling'??
> 
> I mean the code handles all kind of rules - with tunnel and w/o tunnels.
> The same code prepares the NL message for both rule types.
> 
> > It has already been created as a vxlan tunnel rule. It won't be changed. The
> > nlmsg is supposed to have vtep ifindex but translation didn't know it and
> > stored the outer iface temporarily to get it replaced by vtep ifindex. It never
> > be a 'native'/'original' place to store it.
> 
> I mean, if rule does not request the tunneling action - it just keeps the
> unchanged ifindex within Netlink message. If there is the tunneling - we replace
> this index with some value depending on this ifindex.  We cannot replace
> ifindex permanently at rule translation once, because VTEPs are created 
> dynamically and VTEP ifindex can be different at the rule applying time.
> So, we need to keep the original ifindex and create VTEP depending on it every
> time rule is being applied.

Rules w/o tunnels doesn't use the struct (mlx5_flow_tcf_tunnel_hdr) anyway. I
don't understand why you care.

> > In which case the nl msg can be sent
> > with the 'original' ifindex? Any specific example? No.
> > 
> > > set tunnel index permanently, because rule can be
> > > applied/removed/reapplied and other new VXLAN device with new index
> > >can be recreated.
> > 
> > Every time it is applied, it will get the vtep and overwrite vtep ifindex in the nl
> > msg.
> 
> Yes. We should overwrite this field anyway, and every time at rule applying.
> Because vtep ifindex can be different. And we need to keep the original ifindex
> (for example to dynamically  create VTEP attached to it). Do you propose to keep
> ifindex_org field?  Now, as I can see we have to keep ifindex_ptr field only.

Like I suggested above, yes. What I don't like is replace/restore the indexes
every time. I don't understand why you want to shuffling around a variable each
time. nlmsg is temporary anyway, there's no 'native' place. But I'll leave it up
to you. This isn't a super critical issue.


Thanks,
Yongseok

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [dpdk-dev] [PATCH v3 00/13] net/mlx5: e-switch VXLAN encap/decap hardware offload
  2018-10-15 14:13 ` [dpdk-dev] [PATCH v2 0/7] net/mlx5: e-switch VXLAN encap/decap hardware offload Viacheslav Ovsiienko
                     ` (6 preceding siblings ...)
  2018-10-15 14:13   ` [dpdk-dev] [PATCH v2 7/7] net/mlx5: e-switch VXLAN rule cleanup routines Viacheslav Ovsiienko
@ 2018-11-01 12:19   ` Slava Ovsiienko
  2018-11-01 12:19     ` [dpdk-dev] [PATCH v3 01/13] net/mlx5: prepare makefile for adding e-switch VXLAN Slava Ovsiienko
                       ` (14 more replies)
  7 siblings, 15 replies; 110+ messages in thread
From: Slava Ovsiienko @ 2018-11-01 12:19 UTC (permalink / raw)
  To: Shahaf Shuler; +Cc: dev, Yongseok Koh, Slava Ovsiienko

This patchset adds the VXLAN encapsulation/decapsulation hardware
offload feature for E-Switch.
 
A typical use case of tunneling infrastructure is port representors 
in switchdev mode, with VXLAN traffic encapsulation performed on
traffic coming *from* a representor and decapsulation on traffic
going *to* that representor, in order to transparently assign
a given VXLAN to VF traffic.

Since these actions are supported at the E-Switch level, the "transfer" 
attribute must be set on such flow rules. They must also be combined
with a port redirection action to make sense.

Since only ingress is supported, encapsulation flow rules are normally
applied on a physical port and emit traffic to a port representor. 
The opposite order is used for decapsulation.

Like other mlx5 E-Switch flow rule actions, these ones are implemented
through Linux's TC flower API. Since the Linux interface for VXLAN
encap/decap involves virtual network devices (i.e. ip link add type
		vxlan [...]), the PMD dynamically spawns them on a needed basis
through Netlink calls. These VXLAN implicitly created devices are
called VTEPs (Virtual Tunnel End Points).

VXLAN interfaces are dynamically created for each local port of
outer networks and then used as targets for TC "flower" filters
in order to perform encapsulation. For decapsulation the VXLAN
devices are created for each unique UDP-port. These VXLAN interfaces
are system-wide, the only one device with given UDP port can exist 
in the system (the attempt of creating another device with the 
same UDP local port returns EEXIST), so PMD should support the
shared (between PMD instances) device database. 

Rules samples consideraions:

$PF 		- physical device, outer network
$VF 		- representor for VF, outer/inner network
$VXLAN		- VTEP netdev name
$PF_OUTER_IP 	- $PF IP (v4 or v6) within outer network
$REMOTE_IP 	- remote peer IP (v4 or v6) within outer network
$LOCAL_PORT	- local UDP port
$REMOTE_PORT	- remote UDP port

VXLAN VTEP creation with iproute2 (PMD does the same via Netlink):

- for encapsulation:

  ip link add $VXLAN type vxlan dstport $LOCAL_PORT external dev $PF
  ip link set dev $VXLAN up
  tc qdisc del dev $VXLAN ingress
  tc qdisc add dev $VXLAN ingress

$LOCAL_PORT for egress encapsulated traffic (note, this is not
source UDP port in the VXLAN header, it is just UDP port assigned
to VTEP, no practical usage) is selected from available	UDP ports
automatically in range 30000-60000.

- for decapsulation:

  ip link add $VXLAN type vxlan dstport $LOCAL_PORT external
  ip link set dev $VXLAN up
  tc qdisc del dev $VXLAN ingress
  tc qdisc add dev $VXLAN ingress

$LOCAL_PORT is UDP port receiving the VXLAN traffic from outer networks.

All ingress UDP traffic with given UDP destination port from ALL existing
netdevs is routed by kernel to the $VXLAN net device. While applying the
rule the kernel checks the IP parameter withing rule, determines the
appropriate underlaying PF and tryes to setup the rule hardware offload.

VXLAN encapsulation 

VXLAN encap rules are applied to the VF ingress traffic and have the 
VTEP as actual redirection destinations instead of outer PF.
The encapsulation rule should provide:
- redirection action VF->PF
- VF port ID
- some inner network parameters (MACs) 
- the tunnel outer source IP (v4/v6), (IS A MUST)
- the tunnel outer destination IP (v4/v6), (IS A MUST).
- VNI - Virtual Network Identifier (IS A MUST)

VXLAN encapsulation rule sample for tc utility:

  tc filter add dev $VF protocol all parent ffff: flower skip_sw \
	action tunnel_key set dst_port $REMOTE_PORT \
	src_ip $PF_OUTER_IP dst_ip $REMOTE_IP id $VNI \
	action mirred egress redirect dev $VXLAN

VXLAN encapsulation rule sample for testpmd:

- Setting up outer properties of VXLAN tunnel:

  set vxlan ip-version ipv4 vni $VNI \
	udp-src $IGNORED udp-dst $REMOTE_PORT \
	ip-src $PF_OUTER_IP ip-dst $REMOTE_IP \
 	eth-src $IGNORED eth-dst $REMOTE_MAC

- Creating a flow rule on port ID 4 performing VXLAN encapsulation
  with the abovementioned properties and directing the resulting
  traffic to port ID 0:

  flow create 4 ingress transfer pattern eth src is $INNER_MAC / end
	actions vxlan_encap / port_id id 0 / end

There is no direct way found to provide kernel with all required
encapsulatioh header parameters. The encapsulation VTEP is created
attached to the outer interface and assumed as default path for
egress encapsulated traffic. The outer tunnel IP address are
assigned to interface using Netlink, the implicit route is
created like this:

  ip addr add <src_ip> peer <dst_ip> dev <outer> scope link

The peer address option provides implicit route, and scope link
attribute reduces the risk of conflicts. At initialization time all
local scope link addresses are flushed from the outer network device.

The destination MAC address is provided via permenent neigh rule:

 ip neigh add dev <outer> lladdr <dst_mac> to <dst_ip> nud permanent

At initialization time all neigh rules of permanent type are flushed
from the outer network device. 

VXLAN decapsulation 

VXLAN decap rules are applied to the ingress traffic of VTEP ($VXLAN)
device instead of PF. The decapsulation rule should provide:
- redirection action PF->VF
- VF port ID as redirection destination
- $VXLAN device as ingress traffic source
- the tunnel outer source IP (v4/v6), (optional)
- the tunnel outer destination IP (v4/v6), (IS A MUST)
- the tunnel local UDP port (IS A MUST, PMD looks for appropriate VTEP
  with given local UDP port)
- VNI - Virtual Network Identifier (IS A MUST)

VXLAN decap rule sample for tc utility: 

  tc filter add dev $VXLAN protocol all parent ffff: flower skip_sw \
	enc_src_ip $REMOTE_IP enc_dst_ip $PF_OUTER_IP enc_key_id $VNI \
	nc_dst_port $LOCAL_PORT \
	action tunnel_key unset action mirred egress redirect dev $VF
						
VXLAN decap rule sample for testpmd: 

- Creating a flow on port ID 0 performing VXLAN decapsulation and directing
  the result to port ID 4 with checking inner properties:

  flow create 0 ingress transfer pattern / 
  	ipv4 src is $REMOTE_IP dst $PF_LOCAL_IP /
	udp src is 9999 dst is $LOCAL_PORT / vxlan vni is $VNI / 
	eth src is 00:11:22:33:44:55 dst is $INNER_MAC / end
        actions vxlan_decap / port_id id 4 / end

The VXLAN encap/decap rules constrains (implied by current kernel support)

- VXLAN decapsulation provided for PF->VF direction only
- VXLAN encapsulation provided for VF->PF direction only
- current implementation will support non-shared database of VTEPs
  (impossible simultaneous usage of the same UDP port by several
   instances of DPDK apps)

Suggested-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>

---
v3:
  * patchset is resplitted into more dedicated parts
  * decapsulation rule takes MAC from inner eth item
  * appropriate RTE_BEx are replaced with runtime rte_cpu_xxx
  * E-Switch Flow counter deletion is fixed
  * VTEP management routines are refactored
  * found typos are corrected

v2:
  * removed non-VXLAN related parts
  * multipart Netlink messages support
  * local IP and peer IP rules management
  * neigh IP address to MAC address rules
  * management rules cleanup at outer device initialization
  * attached devices cleanup at outer device initialization

v1:
 * http://patches.dpdk.org/patch/45800/
 * Refactored code of initial experimental proposal

v0:
 * http://patches.dpdk.org/cover/44080/
 * Initial proposal by Adrien Mazarguil <adrien.mazarguil@6wind.com>

Viacheslav Ovsiienko (13):
  net/mlx5: prepare makefile for adding e-switch VXLAN
  net/mlx5: prepare meson.build for adding e-switch VXLAN
  net/mlx5: add necessary definitions for e-switch VXLAN
  net/mlx5: add necessary structures for e-switch VXLAN
  net/mlx5: swap items/actions validations for e-switch rules
  net/mlx5: add e-switch VXLAN support to validation routine
  net/mlx5: add VXLAN support to flow prepare routine
  net/mlx5: add VXLAN support to flow translate routine
  net/mlx5: e-switch VXLAN netlink routines update
  net/mlx5: fix e-switch Flow counter deletion
  net/mlx5: add e-switch VXLAN tunnel devices management
  net/mlx5: add e-switch VXLAN encapsulation rules
  net/mlx5: add e-switch VXLAN rule cleanup routines

 drivers/net/mlx5/Makefile        |   85 +
 drivers/net/mlx5/meson.build     |   34 +
 drivers/net/mlx5/mlx5_flow.h     |   11 +
 drivers/net/mlx5/mlx5_flow_tcf.c | 5118 +++++++++++++++++++++++++++++---------
 4 files changed, 4107 insertions(+), 1141 deletions(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [dpdk-dev] [PATCH v3 01/13] net/mlx5: prepare makefile for adding e-switch VXLAN
  2018-11-01 12:19   ` [dpdk-dev] [PATCH v3 00/13] net/mlx5: e-switch VXLAN encap/decap hardware offload Slava Ovsiienko
@ 2018-11-01 12:19     ` Slava Ovsiienko
  2018-11-01 20:33       ` Yongseok Koh
  2018-11-01 12:19     ` [dpdk-dev] [PATCH v3 02/13] net/mlx5: prepare meson.build " Slava Ovsiienko
                       ` (13 subsequent siblings)
  14 siblings, 1 reply; 110+ messages in thread
From: Slava Ovsiienko @ 2018-11-01 12:19 UTC (permalink / raw)
  To: Shahaf Shuler; +Cc: dev, Yongseok Koh, Slava Ovsiienko

This patch updates makefile before adding E-Switch VXLAN
encapsulation/decapsulation hardware offload support.
E-Switch rules are controlled via tc Netilnk commands,
so we need to include tc related headers, and check for
some tunnel specific key definitions.

Suggested-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
---
 drivers/net/mlx5/Makefile | 85 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 85 insertions(+)

diff --git a/drivers/net/mlx5/Makefile b/drivers/net/mlx5/Makefile
index ba433d3..f4fa075 100644
--- a/drivers/net/mlx5/Makefile
+++ b/drivers/net/mlx5/Makefile
@@ -218,6 +218,11 @@ mlx5_autoconf.h.new: $(RTE_SDK)/buildtools/auto-config-h.sh
 		enum IFLA_PHYS_PORT_NAME \
 		$(AUTOCONF_OUTPUT)
 	$Q sh -- '$<' '$@' \
+		HAVE_IFLA_VXLAN_COLLECT_METADATA \
+		linux/if_link.h \
+		enum IFLA_VXLAN_COLLECT_METADATA \
+		$(AUTOCONF_OUTPUT)
+	$Q sh -- '$<' '$@' \
 		HAVE_TCA_CHAIN \
 		linux/rtnetlink.h \
 		enum TCA_CHAIN \
@@ -378,6 +383,86 @@ mlx5_autoconf.h.new: $(RTE_SDK)/buildtools/auto-config-h.sh
 		enum TCA_VLAN_PUSH_VLAN_PRIORITY \
 		$(AUTOCONF_OUTPUT)
 	$Q sh -- '$<' '$@' \
+		HAVE_TCA_FLOWER_KEY_ENC_KEY_ID \
+		linux/pkt_cls.h \
+		enum TCA_FLOWER_KEY_ENC_KEY_ID \
+		$(AUTOCONF_OUTPUT)
+	$Q sh -- '$<' '$@' \
+		HAVE_TCA_FLOWER_KEY_ENC_IPV4_SRC \
+		linux/pkt_cls.h \
+		enum TCA_FLOWER_KEY_ENC_IPV4_SRC \
+		$(AUTOCONF_OUTPUT)
+	$Q sh -- '$<' '$@' \
+		HAVE_TCA_FLOWER_KEY_ENC_IPV4_SRC_MASK \
+		linux/pkt_cls.h \
+		enum TCA_FLOWER_KEY_ENC_IPV4_SRC_MASK \
+		$(AUTOCONF_OUTPUT)
+	$Q sh -- '$<' '$@' \
+		HAVE_TCA_FLOWER_KEY_ENC_IPV4_DST \
+		linux/pkt_cls.h \
+		enum TCA_FLOWER_KEY_ENC_IPV4_DST \
+		$(AUTOCONF_OUTPUT)
+	$Q sh -- '$<' '$@' \
+		HAVE_TCA_FLOWER_KEY_ENC_IPV4_DST_MASK \
+		linux/pkt_cls.h \
+		enum TCA_FLOWER_KEY_ENC_IPV4_DST_MASK \
+		$(AUTOCONF_OUTPUT)
+	$Q sh -- '$<' '$@' \
+		HAVE_TCA_FLOWER_KEY_ENC_IPV6_SRC \
+		linux/pkt_cls.h \
+		enum TCA_FLOWER_KEY_ENC_IPV6_SRC \
+		$(AUTOCONF_OUTPUT)
+	$Q sh -- '$<' '$@' \
+		HAVE_TCA_FLOWER_KEY_ENC_IPV6_SRC_MASK \
+		linux/pkt_cls.h \
+		enum TCA_FLOWER_KEY_ENC_IPV6_SRC_MASK \
+		$(AUTOCONF_OUTPUT)
+	$Q sh -- '$<' '$@' \
+		HAVE_TCA_FLOWER_KEY_ENC_IPV6_DST \
+		linux/pkt_cls.h \
+		enum TCA_FLOWER_KEY_ENC_IPV6_DST \
+		$(AUTOCONF_OUTPUT)
+	$Q sh -- '$<' '$@' \
+		HAVE_TCA_FLOWER_KEY_ENC_IPV6_DST_MASK \
+		linux/pkt_cls.h \
+		enum TCA_FLOWER_KEY_ENC_IPV6_DST_MASK \
+		$(AUTOCONF_OUTPUT)
+	$Q sh -- '$<' '$@' \
+		HAVE_TCA_FLOWER_KEY_ENC_UDP_SRC_PORT \
+		linux/pkt_cls.h \
+		enum TCA_FLOWER_KEY_ENC_UDP_SRC_PORT \
+		$(AUTOCONF_OUTPUT)
+	$Q sh -- '$<' '$@' \
+		HAVE_TCA_FLOWER_KEY_ENC_UDP_SRC_PORT_MASK \
+		linux/pkt_cls.h \
+		enum TCA_FLOWER_KEY_ENC_UDP_SRC_PORT_MASK \
+		$(AUTOCONF_OUTPUT)
+	$Q sh -- '$<' '$@' \
+		HAVE_TCA_FLOWER_KEY_ENC_UDP_DST_PORT \
+		linux/pkt_cls.h \
+		enum TCA_FLOWER_KEY_ENC_UDP_DST_PORT \
+		$(AUTOCONF_OUTPUT)
+	$Q sh -- '$<' '$@' \
+		HAVE_TCA_FLOWER_KEY_ENC_UDP_DST_PORT_MASK \
+		linux/pkt_cls.h \
+		enum TCA_FLOWER_KEY_ENC_UDP_DST_PORT_MASK \
+		$(AUTOCONF_OUTPUT)
+	$Q sh -- '$<' '$@' \
+		HAVE_TC_ACT_TUNNEL_KEY \
+		linux/tc_act/tc_tunnel_key.h \
+		define TCA_ACT_TUNNEL_KEY \
+		$(AUTOCONF_OUTPUT)
+	$Q sh -- '$<' '$@' \
+		HAVE_TCA_TUNNEL_KEY_ENC_DST_PORT \
+		linux/tc_act/tc_tunnel_key.h \
+		enum TCA_TUNNEL_KEY_ENC_DST_PORT \
+		$(AUTOCONF_OUTPUT)
+	$Q sh -- '$<' '$@' \
+		HAVE_TCA_TUNNEL_KEY_NO_CSUM \
+		linux/tc_act/tc_tunnel_key.h \
+		enum TCA_TUNNEL_KEY_NO_CSUM \
+		$(AUTOCONF_OUTPUT)
+	$Q sh -- '$<' '$@' \
 		HAVE_TC_ACT_PEDIT \
 		linux/tc_act/tc_pedit.h \
 		enum TCA_PEDIT_KEY_EX_HDR_TYPE_UDP \
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [dpdk-dev] [PATCH v3 02/13] net/mlx5: prepare meson.build for adding e-switch VXLAN
  2018-11-01 12:19   ` [dpdk-dev] [PATCH v3 00/13] net/mlx5: e-switch VXLAN encap/decap hardware offload Slava Ovsiienko
  2018-11-01 12:19     ` [dpdk-dev] [PATCH v3 01/13] net/mlx5: prepare makefile for adding e-switch VXLAN Slava Ovsiienko
@ 2018-11-01 12:19     ` Slava Ovsiienko
  2018-11-01 20:33       ` Yongseok Koh
  2018-11-01 12:19     ` [dpdk-dev] [PATCH v3 03/13] net/mlx5: add necessary definitions for " Slava Ovsiienko
                       ` (12 subsequent siblings)
  14 siblings, 1 reply; 110+ messages in thread
From: Slava Ovsiienko @ 2018-11-01 12:19 UTC (permalink / raw)
  To: Shahaf Shuler; +Cc: dev, Yongseok Koh, Slava Ovsiienko

This patch updates meson.build before adding E-Switch VXLAN
encapsulation/decapsulation hardware offload support.
E-Switch rules are controlled via tc Netilnk commands,
so we need to include tc related headers, and check for
some tunnel specific key definitions.

Suggested-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
---
---
 drivers/net/mlx5/meson.build | 34 ++++++++++++++++++++++++++++++++++
 1 file changed, 34 insertions(+)

diff --git a/drivers/net/mlx5/meson.build b/drivers/net/mlx5/meson.build
index f8e0c1b..ed54dc2 100644
--- a/drivers/net/mlx5/meson.build
+++ b/drivers/net/mlx5/meson.build
@@ -130,6 +130,8 @@ if build
 		'IFLA_PHYS_SWITCH_ID' ],
 		[ 'HAVE_IFLA_PHYS_PORT_NAME', 'linux/if_link.h',
 		'IFLA_PHYS_PORT_NAME' ],
+		[ 'HAVE_IFLA_VXLAN_COLLECT_METADATA', 'linux/if_link.h',
+		'IFLA_VXLAN_COLLECT_METADATA' ],
 		[ 'HAVE_TCA_CHAIN', 'linux/rtnetlink.h',
 		'TCA_CHAIN' ],
 		[ 'HAVE_TCA_FLOWER_ACT', 'linux/pkt_cls.h',
@@ -194,6 +196,38 @@ if build
 		'TC_ACT_GOTO_CHAIN' ],
 		[ 'HAVE_TC_ACT_VLAN', 'linux/tc_act/tc_vlan.h',
 		'TCA_VLAN_PUSH_VLAN_PRIORITY' ],
+		[ 'HAVE_TCA_FLOWER_KEY_ENC_KEY_ID', 'linux/pkt_cls.h',
+		'TCA_FLOWER_KEY_ENC_KEY_ID' ],
+		[ 'HAVE_TCA_FLOWER_KEY_ENC_IPV4_SRC', 'linux/pkt_cls.h',
+		'TCA_FLOWER_KEY_ENC_IPV4_SRC' ],
+		[ 'HAVE_TCA_FLOWER_KEY_ENC_IPV4_SRC_MASK', 'linux/pkt_cls.h',
+		'TCA_FLOWER_KEY_ENC_IPV4_SRC_MASK' ],
+		[ 'HAVE_TCA_FLOWER_KEY_ENC_IPV4_DST', 'linux/pkt_cls.h',
+		'TCA_FLOWER_KEY_ENC_IPV4_DST' ],
+		[ 'HAVE_TCA_FLOWER_KEY_ENC_IPV4_DST_MASK', 'linux/pkt_cls.h',
+		'TCA_FLOWER_KEY_ENC_IPV4_DST_MASK' ],
+		[ 'HAVE_TCA_FLOWER_KEY_ENC_IPV6_SRC', 'linux/pkt_cls.h',
+		'TCA_FLOWER_KEY_ENC_IPV6_SRC' ],
+		[ 'HAVE_TCA_FLOWER_KEY_ENC_IPV6_SRC_MASK', 'linux/pkt_cls.h',
+		'TCA_FLOWER_KEY_ENC_IPV6_SRC_MASK' ],
+		[ 'HAVE_TCA_FLOWER_KEY_ENC_IPV6_DST', 'linux/pkt_cls.h',
+		'TCA_FLOWER_KEY_ENC_IPV6_DST' ],
+		[ 'HAVE_TCA_FLOWER_KEY_ENC_IPV6_DST_MASK', 'linux/pkt_cls.h',
+		'TCA_FLOWER_KEY_ENC_IPV6_DST_MASK' ],
+		[ 'HAVE_TCA_FLOWER_KEY_ENC_UDP_SRC_PORT', 'linux/pkt_cls.h',
+		'TCA_FLOWER_KEY_ENC_UDP_SRC_PORT' ],
+		[ 'HAVE_TCA_FLOWER_KEY_ENC_UDP_SRC_PORT_MASK', 'linux/pkt_cls.h',
+		'TCA_FLOWER_KEY_ENC_UDP_SRC_PORT_MASK' ],
+		[ 'HAVE_TCA_FLOWER_KEY_ENC_UDP_DST_PORT', 'linux/pkt_cls.h',
+		'TCA_FLOWER_KEY_ENC_UDP_DST_PORT' ],
+		[ 'HAVE_TCA_FLOWER_KEY_ENC_UDP_DST_PORT_MASK', 'linux/pkt_cls.h',
+		'TCA_FLOWER_KEY_ENC_UDP_DST_PORT_MASK' ],
+		[ 'HAVE_TC_ACT_TUNNEL_KEY', 'linux/tc_act/tc_tunnel_key.h',
+		'TCA_ACT_TUNNEL_KEY' ],
+		[ 'HAVE_TCA_TUNNEL_KEY_ENC_DST_PORT', 'linux/tc_act/tc_tunnel_key.h',
+		'TCA_TUNNEL_KEY_ENC_DST_PORT' ],
+		[ 'HAVE_TCA_TUNNEL_KEY_NO_CSUM', 'linux/tc_act/tc_tunnel_key.h',
+		'TCA_TUNNEL_KEY_NO_CSUM' ],
 		[ 'HAVE_TC_ACT_PEDIT', 'linux/tc_act/tc_pedit.h',
 		'TCA_PEDIT_KEY_EX_HDR_TYPE_UDP' ],
 		[ 'HAVE_RDMA_NL_NLDEV', 'rdma/rdma_netlink.h',
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [dpdk-dev] [PATCH v3 03/13] net/mlx5: add necessary definitions for e-switch VXLAN
  2018-11-01 12:19   ` [dpdk-dev] [PATCH v3 00/13] net/mlx5: e-switch VXLAN encap/decap hardware offload Slava Ovsiienko
  2018-11-01 12:19     ` [dpdk-dev] [PATCH v3 01/13] net/mlx5: prepare makefile for adding e-switch VXLAN Slava Ovsiienko
  2018-11-01 12:19     ` [dpdk-dev] [PATCH v3 02/13] net/mlx5: prepare meson.build " Slava Ovsiienko
@ 2018-11-01 12:19     ` Slava Ovsiienko
  2018-11-01 20:35       ` Yongseok Koh
  2018-11-01 12:19     ` [dpdk-dev] [PATCH v3 04/13] net/mlx5: add necessary structures " Slava Ovsiienko
                       ` (11 subsequent siblings)
  14 siblings, 1 reply; 110+ messages in thread
From: Slava Ovsiienko @ 2018-11-01 12:19 UTC (permalink / raw)
  To: Shahaf Shuler; +Cc: dev, Yongseok Koh, Slava Ovsiienko

This patch contains tc flower related and some other definitions
needed to implement VXLAN encapsulation/decapsulation hardware
offload support for E-Switch.

mlx5 driver dynamically creates and manages the VXLAN virtual
tunnel endpoint devices, the following definitions control
the parameters of these network devices:

- MLX5_VXLAN_PORT_MIN - minimal allowed UDP port for VXLAN device
- MLX5_VXLAN_PORT_MAX - maximal allowed UDP port for VXLAN device
- MLX5_VXLAN_DEVICE_PFX - name prefix of driver created VXLAN device

The mlx5 drivers creates the VXLAN devices with UDP port within
specified range, devices have the names with specified prefix,
followed by decimal digits of UDP port.

Suggested-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
---
 drivers/net/mlx5/mlx5_flow.h     |  2 +
 drivers/net/mlx5/mlx5_flow_tcf.c | 97 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 99 insertions(+)

diff --git a/drivers/net/mlx5/mlx5_flow.h b/drivers/net/mlx5/mlx5_flow.h
index c24d26e..392c525 100644
--- a/drivers/net/mlx5/mlx5_flow.h
+++ b/drivers/net/mlx5/mlx5_flow.h
@@ -92,6 +92,8 @@
 #define MLX5_FLOW_ACTION_DEC_TTL (1u << 19)
 #define MLX5_FLOW_ACTION_SET_MAC_SRC (1u << 20)
 #define MLX5_FLOW_ACTION_SET_MAC_DST (1u << 21)
+#define MLX5_FLOW_ACTION_VXLAN_ENCAP (1u << 22)
+#define MLX5_FLOW_ACTION_VXLAN_DECAP (1u << 23)
 
 #define MLX5_FLOW_FATE_ACTIONS \
 	(MLX5_FLOW_ACTION_DROP | MLX5_FLOW_ACTION_QUEUE | MLX5_FLOW_ACTION_RSS)
diff --git a/drivers/net/mlx5/mlx5_flow_tcf.c b/drivers/net/mlx5/mlx5_flow_tcf.c
index 719fb10..4d54112 100644
--- a/drivers/net/mlx5/mlx5_flow_tcf.c
+++ b/drivers/net/mlx5/mlx5_flow_tcf.c
@@ -113,6 +113,39 @@ struct tc_pedit_sel {
 
 #endif /* HAVE_TC_ACT_VLAN */
 
+#ifdef HAVE_TC_ACT_TUNNEL_KEY
+
+#include <linux/tc_act/tc_tunnel_key.h>
+
+#ifndef HAVE_TCA_TUNNEL_KEY_ENC_DST_PORT
+#define TCA_TUNNEL_KEY_ENC_DST_PORT 9
+#endif
+
+#ifndef HAVE_TCA_TUNNEL_KEY_NO_CSUM
+#define TCA_TUNNEL_KEY_NO_CSUM 10
+#endif
+
+#else /* HAVE_TC_ACT_TUNNEL_KEY */
+
+#define TCA_ACT_TUNNEL_KEY 17
+#define TCA_TUNNEL_KEY_ACT_SET 1
+#define TCA_TUNNEL_KEY_ACT_RELEASE 2
+#define TCA_TUNNEL_KEY_PARMS 2
+#define TCA_TUNNEL_KEY_ENC_IPV4_SRC 3
+#define TCA_TUNNEL_KEY_ENC_IPV4_DST 4
+#define TCA_TUNNEL_KEY_ENC_IPV6_SRC 5
+#define TCA_TUNNEL_KEY_ENC_IPV6_DST 6
+#define TCA_TUNNEL_KEY_ENC_KEY_ID 7
+#define TCA_TUNNEL_KEY_ENC_DST_PORT 9
+#define TCA_TUNNEL_KEY_NO_CSUM 10
+
+struct tc_tunnel_key {
+	tc_gen;
+	int t_action;
+};
+
+#endif /* HAVE_TC_ACT_TUNNEL_KEY */
+
 /* Normally found in linux/netlink.h. */
 #ifndef NETLINK_CAP_ACK
 #define NETLINK_CAP_ACK 10
@@ -211,6 +244,45 @@ struct tc_pedit_sel {
 #ifndef HAVE_TCA_FLOWER_KEY_VLAN_ETH_TYPE
 #define TCA_FLOWER_KEY_VLAN_ETH_TYPE 25
 #endif
+#ifndef HAVE_TCA_FLOWER_KEY_ENC_KEY_ID
+#define TCA_FLOWER_KEY_ENC_KEY_ID 26
+#endif
+#ifndef HAVE_TCA_FLOWER_KEY_ENC_IPV4_SRC
+#define TCA_FLOWER_KEY_ENC_IPV4_SRC 27
+#endif
+#ifndef HAVE_TCA_FLOWER_KEY_ENC_IPV4_SRC_MASK
+#define TCA_FLOWER_KEY_ENC_IPV4_SRC_MASK 28
+#endif
+#ifndef HAVE_TCA_FLOWER_KEY_ENC_IPV4_DST
+#define TCA_FLOWER_KEY_ENC_IPV4_DST 29
+#endif
+#ifndef HAVE_TCA_FLOWER_KEY_ENC_IPV4_DST_MASK
+#define TCA_FLOWER_KEY_ENC_IPV4_DST_MASK 30
+#endif
+#ifndef HAVE_TCA_FLOWER_KEY_ENC_IPV6_SRC
+#define TCA_FLOWER_KEY_ENC_IPV6_SRC 31
+#endif
+#ifndef HAVE_TCA_FLOWER_KEY_ENC_IPV6_SRC_MASK
+#define TCA_FLOWER_KEY_ENC_IPV6_SRC_MASK 32
+#endif
+#ifndef HAVE_TCA_FLOWER_KEY_ENC_IPV6_DST
+#define TCA_FLOWER_KEY_ENC_IPV6_DST 33
+#endif
+#ifndef HAVE_TCA_FLOWER_KEY_ENC_IPV6_DST_MASK
+#define TCA_FLOWER_KEY_ENC_IPV6_DST_MASK 34
+#endif
+#ifndef HAVE_TCA_FLOWER_KEY_ENC_UDP_SRC_PORT
+#define TCA_FLOWER_KEY_ENC_UDP_SRC_PORT 43
+#endif
+#ifndef HAVE_TCA_FLOWER_KEY_ENC_UDP_SRC_PORT_MASK
+#define TCA_FLOWER_KEY_ENC_UDP_SRC_PORT_MASK 44
+#endif
+#ifndef HAVE_TCA_FLOWER_KEY_ENC_UDP_DST_PORT
+#define TCA_FLOWER_KEY_ENC_UDP_DST_PORT 45
+#endif
+#ifndef HAVE_TCA_FLOWER_KEY_ENC_UDP_DST_PORT_MASK
+#define TCA_FLOWER_KEY_ENC_UDP_DST_PORT_MASK 46
+#endif
 #ifndef HAVE_TCA_FLOWER_KEY_TCP_FLAGS
 #define TCA_FLOWER_KEY_TCP_FLAGS 71
 #endif
@@ -241,6 +313,28 @@ struct tc_pedit_sel {
 #define TCA_ACT_MAX_PRIO 32
 #endif
 
+/** UDP port range of VXLAN devices created by driver. */
+#define MLX5_VXLAN_PORT_MIN 30000
+#define MLX5_VXLAN_PORT_MAX 60000
+#define MLX5_VXLAN_DEVICE_PFX "vmlx_"
+
+/** Tunnel action type, used for @p type in header structure. */
+enum flow_tcf_tunact_type {
+	FLOW_TCF_TUNACT_VXLAN_DECAP,
+	FLOW_TCF_TUNACT_VXLAN_ENCAP,
+};
+
+/** Flags used for @p mask in tunnel action encap descriptors. */
+#define FLOW_TCF_ENCAP_ETH_SRC (1u << 0)
+#define FLOW_TCF_ENCAP_ETH_DST (1u << 1)
+#define FLOW_TCF_ENCAP_IPV4_SRC (1u << 2)
+#define FLOW_TCF_ENCAP_IPV4_DST (1u << 3)
+#define FLOW_TCF_ENCAP_IPV6_SRC (1u << 4)
+#define FLOW_TCF_ENCAP_IPV6_DST (1u << 5)
+#define FLOW_TCF_ENCAP_UDP_SRC (1u << 6)
+#define FLOW_TCF_ENCAP_UDP_DST (1u << 7)
+#define FLOW_TCF_ENCAP_VXLAN_VNI (1u << 8)
+
 /**
  * Structure for holding netlink context.
  * Note the size of the message buffer which is MNL_SOCKET_BUFFER_SIZE.
@@ -347,6 +441,9 @@ struct flow_tcf_ptoi {
 	(MLX5_FLOW_ACTION_OF_POP_VLAN | MLX5_FLOW_ACTION_OF_PUSH_VLAN | \
 	 MLX5_FLOW_ACTION_OF_SET_VLAN_VID | MLX5_FLOW_ACTION_OF_SET_VLAN_PCP)
 
+#define MLX5_TCF_VXLAN_ACTIONS \
+	(MLX5_FLOW_ACTION_VXLAN_ENCAP | MLX5_FLOW_ACTION_VXLAN_DECAP)
+
 #define MLX5_TCF_PEDIT_ACTIONS \
 	(MLX5_FLOW_ACTION_SET_IPV4_SRC | MLX5_FLOW_ACTION_SET_IPV4_DST | \
 	 MLX5_FLOW_ACTION_SET_IPV6_SRC | MLX5_FLOW_ACTION_SET_IPV6_DST | \
-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [dpdk-dev] [PATCH v3 04/13] net/mlx5: add necessary structures for e-switch VXLAN
  2018-11-01 12:19   ` [dpdk-dev] [PATCH v3 00/13] net/mlx5: e-switch VXLAN encap/decap hardware offload Slava Ovsiienko
                       ` (2 preceding siblings ...)
  2018-11-01 12:19     ` [dpdk-dev] [PATCH v3 03/13] net/mlx5: add necessary definitions for " Slava Ovsiienko
@ 2018-11-01 12:19     ` Slava Ovsiienko
  2018-11-01 20:36       ` Yongseok Koh
  2018-11-01 12:19     ` [dpdk-dev] [PATCH v3 05/13] net/mlx5: swap items/actions validations for e-switch rules Slava Ovsiienko
                       ` (10 subsequent siblings)
  14 siblings, 1 reply; 110+ messages in thread
From: Slava Ovsiienko @ 2018-11-01 12:19 UTC (permalink / raw)
  To: Shahaf Shuler; +Cc: dev, Yongseok Koh, Slava Ovsiienko

This patch introduces the data structures needed to implement VXLAN
encapsulation/decapsulation hardware offload support for E-Switch.

Suggested-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
---
 drivers/net/mlx5/mlx5_flow.h     |  9 ++++
 drivers/net/mlx5/mlx5_flow_tcf.c | 99 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 108 insertions(+)

diff --git a/drivers/net/mlx5/mlx5_flow.h b/drivers/net/mlx5/mlx5_flow.h
index 392c525..3887ee9 100644
--- a/drivers/net/mlx5/mlx5_flow.h
+++ b/drivers/net/mlx5/mlx5_flow.h
@@ -191,6 +191,15 @@ struct mlx5_flow_dv {
 struct mlx5_flow_tcf {
 	struct nlmsghdr *nlh;
 	struct tcmsg *tcm;
+	union { /**< Tunnel encap/decap descriptor. */
+		struct flow_tcf_tunnel_hdr *tunnel;
+		struct flow_tcf_vxlan_decap *vxlan_decap;
+		struct flow_tcf_vxlan_encap *vxlan_encap;
+	};
+	uint32_t applied:1; /**< Whether rule is currently applied. */
+#ifndef NDEBUG
+	uint32_t nlsize; /**< Size of NL message buffer for debug check. */
+#endif
 };
 
 /* Verbs specification header. */
diff --git a/drivers/net/mlx5/mlx5_flow_tcf.c b/drivers/net/mlx5/mlx5_flow_tcf.c
index 4d54112..55c77e3 100644
--- a/drivers/net/mlx5/mlx5_flow_tcf.c
+++ b/drivers/net/mlx5/mlx5_flow_tcf.c
@@ -348,6 +348,100 @@ struct mlx5_flow_tcf_context {
 	uint8_t *buf; /* Message buffer. */
 };
 
+/**
+ * Neigh rule structure. The neigh rule is applied via Netlink to
+ * outer tunnel iface in order to provide destination MAC address
+ * for the VXLAN encapsultion. The neigh rule is implicitly related
+ * to the Flow itself and can be shared by multiple Flows.
+ */
+struct tcf_neigh_rule {
+	LIST_ENTRY(tcf_neigh_rule) next;
+	uint32_t refcnt;
+	struct ether_addr eth;
+	uint16_t mask;
+	union {
+		struct {
+			rte_be32_t dst;
+		} ipv4;
+		struct {
+			uint8_t dst[IPV6_ADDR_LEN];
+		} ipv6;
+	};
+};
+
+/**
+ * Local rule structure. The local rule is applied via Netlink to
+ * outer tunnel iface in order to provide local and peer IP addresses
+ * of the VXLAN tunnel for encapsulation. The local rule is implicitly
+ * related to the Flow itself and can be shared by multiple Flows.
+ */
+struct tcf_local_rule {
+	LIST_ENTRY(tcf_local_rule) next;
+	uint32_t refcnt;
+	uint16_t mask;
+	union {
+		struct {
+			rte_be32_t dst;
+			rte_be32_t src;
+		} ipv4;
+		struct {
+			uint8_t dst[IPV6_ADDR_LEN];
+			uint8_t src[IPV6_ADDR_LEN];
+		} ipv6;
+	};
+};
+
+/** VXLAN virtual netdev. */
+struct tcf_vtep {
+	LIST_ENTRY(tcf_vtep) next;
+	LIST_HEAD(, tcf_neigh_rule) neigh;
+	LIST_HEAD(, tcf_local_rule) local;
+	uint32_t refcnt;
+	unsigned int ifindex; /**< Own interface index. */
+	unsigned int ifouter; /**< Index of device attached to. */
+	uint16_t port;
+	uint8_t created;
+};
+
+/** Tunnel descriptor header, common for all tunnel types. */
+struct flow_tcf_tunnel_hdr {
+	uint32_t type; /**< Tunnel action type. */
+	struct tcf_vtep *vtep; /**< Virtual tunnel endpoint device. */
+	unsigned int ifindex_org; /**< Original dst/src interface */
+	unsigned int *ifin