From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <yskoh@mellanox.com>
Received: from EUR02-VE1-obe.outbound.protection.outlook.com
 (mail-eopbgr20051.outbound.protection.outlook.com [40.107.2.51])
 by dpdk.org (Postfix) with ESMTP id 2C89958F6
 for <dev@dpdk.org>; Fri, 26 Oct 2018 08:25:53 +0200 (CEST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=Mellanox.com;
 s=selector1;
 h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck;
 bh=H/MJF0s+EkoZviYUAW69kTOwtu76HTBRRcyrjgMy7Pk=;
 b=QIO29E63yJN5JkdY/2y7WALQYxV+aJQVZBvSr1TqEn5DOeEzIn09G5aPxgleIqUlNnthM5RTseXWfuxBUhOWGVP3uUHgU+xqaJIteWTuaiIM5VMchjkGk/dwKztgjIMv0jbT7GvMYkWIlfmV5JbnsFOfHp30izj1V1UqE5QlMn4=
Received: from DB3PR0502MB3980.eurprd05.prod.outlook.com (52.134.72.27) by
 DB3PR0502MB4073.eurprd05.prod.outlook.com (52.134.68.156) with Microsoft SMTP
 Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 15.20.1250.30; Fri, 26 Oct 2018 06:25:49 +0000
Received: from DB3PR0502MB3980.eurprd05.prod.outlook.com
 ([fe80::f8a1:fcab:94f0:97cc]) by DB3PR0502MB3980.eurprd05.prod.outlook.com
 ([fe80::f8a1:fcab:94f0:97cc%4]) with mapi id 15.20.1273.025; Fri, 26 Oct 2018
 06:25:49 +0000
From: Yongseok Koh <yskoh@mellanox.com>
To: Slava Ovsiienko <viacheslavo@mellanox.com>
CC: Shahaf Shuler <shahafs@mellanox.com>, "dev@dpdk.org" <dev@dpdk.org>
Thread-Topic: [PATCH v2 5/7] net/mlx5: e-switch VXLAN tunnel devices management
Thread-Index: AQHUZJFa0Xzb/HBSQ0ye9aegR3x2F6UutK6AgAHCugCAAKjkgA==
Date: Fri, 26 Oct 2018 06:25:48 +0000
Message-ID: <20181026062307.GD6434@mtidpdk.mti.labs.mlnx>
References: <1538461807-37507-1-git-send-email-viacheslavo@mellanox.com>
 <1539612815-47199-1-git-send-email-viacheslavo@mellanox.com>
 <1539612815-47199-6-git-send-email-viacheslavo@mellanox.com>
 <20181025002759.GA26874@mtidpdk.mti.labs.mlnx>
 <AM4PR05MB326594413216A252B5194833D2F70@AM4PR05MB3265.eurprd05.prod.outlook.com>
In-Reply-To: <AM4PR05MB326594413216A252B5194833D2F70@AM4PR05MB3265.eurprd05.prod.outlook.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
x-clientproxiedby: BYAPR08CA0003.namprd08.prod.outlook.com
 (2603:10b6:a03:100::16) To DB3PR0502MB3980.eurprd05.prod.outlook.com
 (2603:10a6:8:10::27)
authentication-results: spf=none (sender IP is )
 smtp.mailfrom=yskoh@mellanox.com; 
x-ms-exchange-messagesentrepresentingtype: 1
x-originating-ip: [209.116.155.178]
x-ms-publictraffictype: Email
x-microsoft-exchange-diagnostics: 1; DB3PR0502MB4073;
 6:e+bHzU+pao/9wp1moliwse5Ofk3MqoAweXxGF+1j6Pbybu88kDJfr3VMKj+7THmM2E/yQT+e/xGYiN1VsoN2O4kfRftdgkFplcfK9DTuLvNE7ZpT3igra2wXCi3VkxlYurD4L8VTYc1dexkvfemZYcYn58infwCb+jizluYD5Ki8TsoGNdsapRyw9AvPkbCUPLosOIu/o04X7fBGxBwe8IhbAvNoKK9Uu9nrBaziQPO6445BSSOJykdEQc568uoQtOC7NgFrcE7CCNQqpd6tVocAZQaU9XJG1ybLPz5CZ3PxgS9eEX1UNmy6menaKoimBhqEA6AS7MfXZVo9O12K/n0QTs9BDBBM+HLOqZYQdMJ73031d53VsdoSTRU5S1FFYDXuFkWaFdfGxZ2Jdl8hJikRy/Pf9JxiWEc4TIWS/U8L9WcGS4m0Ll1oBXJkfqQA7Gy4ITA7hhUJY7nTTjLnqQ==;
 5:p/9u+gCLTTUhEgZKLMMvWTCYtYLR+OVUGT5kRscVdAAvWlcTkZONaSaAgjq2AOxghNmzzWtoS9ZdOqZhK/W7lyyC7O+6WdbcxBcAUTgAcs8HWYgYu71gqOVYUJAKl7+oqrTIXJwJCnODt3HlEciufs1kAXW62844YBbbXLe2Tgk=;
 7:4kISTxB+oen1l0+Hp6p+XtQp9CjXgytapKTZkB1La/eE7QAdn2ecZ34VLIwq2PnkgCIhyknMvw47z9ybDCEmQtUabfjW4ogbrHlfqN/0iepHwdA10DrDaQURr7/rxePMsnrYrv2haFZ1TgyH3l+v11hN21RIYqIVJfZ6Rgb8jhgkTMlR6CFhs0Ave4eLeKo3m1CGdV5hvbjM9bTum33GCaf9CFVP8R5yjFiAJX2cHdMpbDxC3/HlQf+GJQB4k02R
x-ms-office365-filtering-correlation-id: e819dd0b-e664-43de-06d3-08d63b0bde78
x-ms-office365-filtering-ht: Tenant
x-microsoft-antispam: BCL:0; PCL:0;
 RULEID:(7020095)(4652040)(8989299)(4534185)(4627221)(201703031133081)(201702281549075)(8990200)(5600074)(711020)(4618075)(2017052603328)(7153060)(7193020);
 SRVR:DB3PR0502MB4073; 
x-ms-traffictypediagnostic: DB3PR0502MB4073:
x-microsoft-antispam-prvs: <DB3PR0502MB407332278FDC6B1E6F638EC6C3F00@DB3PR0502MB4073.eurprd05.prod.outlook.com>
x-exchange-antispam-report-test: UriScan:(211171220733660)(788757137089);
x-ms-exchange-senderadcheck: 1
x-exchange-antispam-report-cfa-test: BCL:0; PCL:0;
 RULEID:(8211001083)(6040522)(2401047)(5005006)(8121501046)(10201501046)(93006095)(93001095)(3002001)(3231355)(944501410)(52105095)(6055026)(148016)(149066)(150057)(6041310)(201703131423095)(201702281528075)(20161123555045)(201703061421075)(201703061406153)(20161123558120)(20161123564045)(20161123560045)(20161123562045)(201708071742011)(7699051)(76991095);
 SRVR:DB3PR0502MB4073; BCL:0; PCL:0; RULEID:; SRVR:DB3PR0502MB4073; 
x-forefront-prvs: 083751FCA6
x-forefront-antispam-report: SFV:NSPM;
 SFS:(10009020)(346002)(366004)(396003)(39860400002)(376002)(136003)(13464003)(52314003)(199004)(189003)(6246003)(81166006)(6436002)(53946003)(99286004)(2906002)(9686003)(71200400001)(86362001)(4744004)(71190400001)(81156014)(6512007)(6636002)(14454004)(5024004)(8936002)(14444005)(8676002)(106356001)(6486002)(316002)(186003)(3846002)(105586002)(54906003)(93886005)(478600001)(5660300001)(53936002)(25786009)(229853002)(26005)(1076002)(6116002)(4326008)(66066001)(561944003)(33656002)(53546011)(6506007)(476003)(386003)(102836004)(2900100001)(97736004)(68736007)(6862004)(486006)(256004)(5250100002)(33896004)(305945005)(7736002)(446003)(11346002)(52116002)(76176011)(559001)(579004);
 DIR:OUT; SFP:1101; SCL:1; SRVR:DB3PR0502MB4073;
 H:DB3PR0502MB3980.eurprd05.prod.outlook.com; FPR:; SPF:None; LANG:en;
 PTR:InfoNoRecords; A:1; MX:1; 
received-spf: None (protection.outlook.com: mellanox.com does not designate
 permitted sender hosts)
x-microsoft-antispam-message-info: hgWqdkK0G012jmp3TQDXuN+AWePQjcr7As0Jsn9rqRr56yDVNpMpCb7LKjrOOgPAycyAc00BRUEKqRJvE+j8tCM/d1cmx6ivp3o8eQETQTz0QavxqvM9jPjtAFrky0PUEDZOtgqiQO4zL9q/xL5uT5shmvAy+gYhrDyXUA1mIcrYmofzeEjTGazCGdFRQngcSNMeRvUyHQ092xTX50UP5meZAPJp3KpawHIcnBkEmHbWLo2AlZpk3vHlldkjEm4qfMXVMsg/mQAXsAxT1HnmObDV/17T6hz2k1IIhzh9LZZjP6AfTNHVgcHH3kiaUi8o9dqCgNA4iGvX7qF56COq+hGkdspktqck0OdFksKdtKg=
spamdiagnosticoutput: 1:99
spamdiagnosticmetadata: NSPM
Content-Type: text/plain; charset="us-ascii"
Content-ID: <00F1ADDB4A7D9442BC01C0841A92CF55@eurprd05.prod.outlook.com>
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
X-OriginatorOrg: Mellanox.com
X-MS-Exchange-CrossTenant-Network-Message-Id: e819dd0b-e664-43de-06d3-08d63b0bde78
X-MS-Exchange-CrossTenant-originalarrivaltime: 26 Oct 2018 06:25:48.9375 (UTC)
X-MS-Exchange-CrossTenant-fromentityheader: Hosted
X-MS-Exchange-CrossTenant-id: a652971c-7d2e-4d9b-a6a4-d149256f461b
X-MS-Exchange-Transport-CrossTenantHeadersStamped: DB3PR0502MB4073
Subject: Re: [dpdk-dev] [PATCH v2 5/7] net/mlx5: e-switch VXLAN tunnel
	devices management
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Fri, 26 Oct 2018 06:25:53 -0000

On Thu, Oct 25, 2018 at 01:21:12PM -0700, Slava Ovsiienko wrote:
> > -----Original Message-----
> > From: Yongseok Koh
> > Sent: Thursday, October 25, 2018 3:28
> > To: Slava Ovsiienko <viacheslavo@mellanox.com>
> > Cc: Shahaf Shuler <shahafs@mellanox.com>; dev@dpdk.org
> > Subject: Re: [PATCH v2 5/7] net/mlx5: e-switch VXLAN tunnel devices
> > management
> >=20
> > On Mon, Oct 15, 2018 at 02:13:33PM +0000, Viacheslav Ovsiienko wrote:
> > > VXLAN interfaces are dynamically created for each local UDP port of
> > > outer networks and then used as targets for TC "flower" filters in
> > > order to perform encapsulation. These VXLAN interfaces are
> > > system-wide, the only one device with given UDP port can exist in the
> > > system (the attempt of creating another device with the same UDP loca=
l
> > > port returns EEXIST), so PMD should support the shared device
> > > instances database for PMD instances. These VXLAN implicitly created
> > > devices are called VTEPs (Virtual Tunnel End Points).
> > >
> > > Creation of the VTEP occurs at the moment of rule applying. The link
> > > is set up, root ingress qdisc is also initialized.
> > >
> > > Encapsulation VTEPs are created on per port basis, the single VTEP is
> > > attached to the outer interface and is shared for all encapsulation
> > > rules on this interface. The source UDP port is automatically selecte=
d
> > > in range 30000-60000.
> > >
> > > For decapsulaton one VTEP is created per every unique UDP local port
> > > to accept tunnel traffic. The name of created VTEP consists of prefix
> > > "vmlx_" and the number of UDP port in decimal digits without leading
> > > zeros (vmlx_4789). The VTEP can be preliminary created in the system
> > > before the launching
> > > application, it allows to share	UDP ports between primary
> > > and secondary processes.
> > >
> > > Suggested-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
> > > Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com>
> > > ---
> > >  drivers/net/mlx5/mlx5_flow_tcf.c | 503
> > > ++++++++++++++++++++++++++++++++++++++-
> > >  1 file changed, 499 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/drivers/net/mlx5/mlx5_flow_tcf.c
> > > b/drivers/net/mlx5/mlx5_flow_tcf.c
> > > index d6840d5..efa9c3b 100644
> > > --- a/drivers/net/mlx5/mlx5_flow_tcf.c
> > > +++ b/drivers/net/mlx5/mlx5_flow_tcf.c
> > > @@ -3443,6 +3443,432 @@ struct pedit_parser {
> > >  	return -err;
> > >  }
> > >
> > > +/* VTEP device list is shared between PMD port instances. */ static
> > > +LIST_HEAD(, mlx5_flow_tcf_vtep)
> > > +			vtep_list_vxlan =3D LIST_HEAD_INITIALIZER(); static
> > pthread_mutex_t
> > > +vtep_list_mutex =3D PTHREAD_MUTEX_INITIALIZER;
> >=20
> > What's the reason for choosing pthread_mutex instead of rte_*_lock?
>=20
> The sharing this database for secondary processes?

The static variable isn't shared with sec proc. But you can leave it as is.

> > > +
> > > +/**
> > > + * Deletes VTEP network device.
> > > + *
> > > + * @param[in] tcf
> > > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > > + * @param[in] vtep
> > > + *   Object represinting the network device to delete. Memory
> > > + *   allocated for this object is freed by routine.
> > > + */
> > > +static void
> > > +flow_tcf_delete_iface(struct mlx6_flow_tcf_context *tcf,
> > > +		      struct mlx5_flow_tcf_vtep *vtep) {
> > > +	struct nlmsghdr *nlh;
> > > +	struct ifinfomsg *ifm;
> > > +	alignas(struct nlmsghdr)
> > > +	uint8_t buf[mnl_nlmsg_size(MNL_ALIGN(sizeof(*ifm))) + 8];
> > > +	int ret;
> > > +
> > > +	assert(!vtep->refcnt);
> > > +	if (vtep->created && vtep->ifindex) {
> >=20
> > First of all vtep->created seems of no use. It is introduced to select =
the error
> > message in flow_tcf_create_iface(). I don't see any necessity to distin=
guish
> > between 'vtep is allocated by rte_malloc()' and 'vtep is created in ker=
nel'.
>=20
> created flag indicates the iface is created by our code.
> The VXLAN decap devices must have the specified UDP port, we can not crea=
te
> multiple VXLAN devices with the same UDP port - EEXIST is returned. So, w=
e have
> to share device. One option is create device before DPDK application laun=
ch and use
> these pre-created devices. Inthis case created flag is not set and VXLAN =
device
> is not reinitialized, and not deleted.

I can't see any code to use pre-created device (created even before dpdk ap=
p
launch). Your code just tries to create 'vmlx_xxxx'. Even from your comment=
 in
[7/7] patch, PMD will cleanup any leftovers (existing vtep devices) on
initialization. Your comment sounds conflicting and confusing.

> > And why do you need to check vtep->ifindex as well? If vtep is created =
in
> > kernel and its ifindex isn't set, that should be an error which had to =
be hanled
> > in flow_tcf_create_iface(). Such a vtep shouldn't exist.
> Yes, if we did not get ifindex of device - vtep is not created, error ret=
urned.
> We just can not operate w/o ifindex.

I know ifindex is needed but my question was checking vtep->ifindex here lo=
oked
redundant/unnecessary. But as you agreed on having create/get/release_iface=
(),
it doesn't matter much.

> > Also, the refcnt management is a bit strange. Please put an abstraction=
 by
> > adding create_iface(), get_iface() and release_iface(). In the get_ifce=
(),
> > vtep->refcnt should be incremented. And in the release_iface(), it
> > vtep->decrease the
> OK. Good proposal. I'll refactor the code.
>=20
> > refcnt and if it reaches to zero, the iface can be removed. create_ifac=
e() will
> > set the refcnt to 1. And if you refer to mlx5_hrxq_get(), it even does
> > searching the list not by repeating the same lookup code here and there=
.
> > That will make your code much simpler.
> >=20
> > > +		DRV_LOG(INFO, "VTEP delete (%d)", vtep->ifindex);
> > > +		nlh =3D mnl_nlmsg_put_header(buf);
> > > +		nlh->nlmsg_type =3D RTM_DELLINK;
> > > +		nlh->nlmsg_flags =3D NLM_F_REQUEST;
> > > +		ifm =3D mnl_nlmsg_put_extra_header(nlh, sizeof(*ifm));
> > > +		ifm->ifi_family =3D AF_UNSPEC;
> > > +		ifm->ifi_index =3D vtep->ifindex;
> > > +		ret =3D flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
> > > +		if (ret)
> > > +			DRV_LOG(WARNING, "netlink: error deleting VXLAN
> > "
> > > +					 "encap/decap ifindex %u",
> > > +					 ifm->ifi_index);
> > > +	}
> > > +	rte_free(vtep);
> > > +}
> > > +
> > > +/**
> > > + * Creates VTEP network device.
> > > + *
> > > + * @param[in] tcf
> > > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > > + * @param[in] ifouter
> > > + *   Outer interface to attach new-created VXLAN device
> > > + *   If zero the VXLAN device will not be attached to any device.
> > > + * @param[in] port
> > > + *   UDP port of created VTEP device.
> > > + * @param[out] error
> > > + *   Perform verbose error reporting if not NULL.
> > > + *
> > > + * @return
> > > + * Pointer to created device structure on success, NULL otherwise
> > > + * and rte_errno is set.
> > > + */
> > > +#ifndef HAVE_IFLA_VXLAN_COLLECT_METADATA
> >=20
> > Why negative(ifndef) first intead of positive(ifdef)?
> Hm. Did I miss the rule. Positive #ifdef first? OK.

No concrete rule but if there's no specific reason, it would be better to s=
tart
from ifdef.

> > > +static struct mlx5_flow_tcf_vtep*
> > > +flow_tcf_create_iface(struct mlx5_flow_tcf_context *tcf __rte_unused=
,
> > > +		      unsigned int ifouter __rte_unused,
> > > +		      uint16_t port __rte_unused,
> > > +		      struct rte_flow_error *error) {
> > > +	rte_flow_error_set(error, ENOTSUP,
> > > +			 RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
> > > +			 "netlink: failed to create VTEP, "
> > > +			 "VXLAN metadat is not supported by kernel");
> >=20
> > Typo.
>=20
> OK.  "metadata are not supported".
> >=20
> > > +	return NULL;
> > > +}
> > > +#else
> > > +static struct mlx5_flow_tcf_vtep*
> > > +flow_tcf_create_iface(struct mlx5_flow_tcf_context *tcf,
> >=20
> > How about adding 'vtep'? It sounds vague - creating a general interface=
.
> > E.g., flow_tcf_create_vtep_iface()?
>=20
> OK.
>=20
> >=20
> > > +		      unsigned int ifouter,
> > > +		      uint16_t port, struct rte_flow_error *error) {
> > > +	struct mlx5_flow_tcf_vtep *vtep;
> > > +	struct nlmsghdr *nlh;
> > > +	struct ifinfomsg *ifm;
> > > +	char name[sizeof(MLX5_VXLAN_DEVICE_PFX) + 24];
> > > +	alignas(struct nlmsghdr)
> > > +	uint8_t buf[mnl_nlmsg_size(sizeof(*ifm)) + 128 +
> >=20
> > Use a macro for '128'. Can't know the meaning.
> OK. I think we should calculate the buffer size explicitly.
>=20
> >=20
> > > +		       SZ_NLATTR_DATA_OF(sizeof(name)) +
> > > +		       SZ_NLATTR_NEST * 2 +
> > > +		       SZ_NLATTR_STRZ_OF("vxlan") +
> > > +		       SZ_NLATTR_DATA_OF(sizeof(uint32_t)) +
> > > +		       SZ_NLATTR_DATA_OF(sizeof(uint32_t)) +
> > > +		       SZ_NLATTR_DATA_OF(sizeof(uint16_t)) +
> > > +		       SZ_NLATTR_DATA_OF(sizeof(uint8_t))];
> > > +	struct nlattr *na_info;
> > > +	struct nlattr *na_vxlan;
> > > +	rte_be16_t vxlan_port =3D RTE_BE16(port);
> >=20
> > Use rte_cpu_to_be_*() instead.
>=20
> Yes, I'll recheck the whole code for this issue.
>=20
> >=20
> > > +	int ret;
> > > +
> > > +	vtep =3D rte_zmalloc(__func__, sizeof(*vtep),
> > > +			alignof(struct mlx5_flow_tcf_vtep));
> > > +	if (!vtep) {
> > > +		rte_flow_error_set
> > > +			(error, ENOMEM,
> > RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> > > +			 NULL, "unadble to allocate memory for VTEP desc");
> > > +		return NULL;
> > > +	}
> > > +	*vtep =3D (struct mlx5_flow_tcf_vtep){
> > > +			.refcnt =3D 0,
> > > +			.port =3D port,
> > > +			.created =3D 0,
> > > +			.ifouter =3D 0,
> > > +			.ifindex =3D 0,
> > > +			.local =3D LIST_HEAD_INITIALIZER(),
> > > +			.neigh =3D LIST_HEAD_INITIALIZER(),
> > > +	};
> > > +	memset(buf, 0, sizeof(buf));
> > > +	nlh =3D mnl_nlmsg_put_header(buf);
> > > +	nlh->nlmsg_type =3D RTM_NEWLINK;
> > > +	nlh->nlmsg_flags =3D NLM_F_REQUEST | NLM_F_CREATE  |
> > NLM_F_EXCL;
> > > +	ifm =3D mnl_nlmsg_put_extra_header(nlh, sizeof(*ifm));
> > > +	ifm->ifi_family =3D AF_UNSPEC;
> > > +	ifm->ifi_type =3D 0;
> > > +	ifm->ifi_index =3D 0;
> > > +	ifm->ifi_flags =3D IFF_UP;
> > > +	ifm->ifi_change =3D 0xffffffff;
> > > +	snprintf(name, sizeof(name), "%s%u", MLX5_VXLAN_DEVICE_PFX,
> > port);
> > > +	mnl_attr_put_strz(nlh, IFLA_IFNAME, name);
> > > +	na_info =3D mnl_attr_nest_start(nlh, IFLA_LINKINFO);
> > > +	assert(na_info);
> > > +	mnl_attr_put_strz(nlh, IFLA_INFO_KIND, "vxlan");
> > > +	na_vxlan =3D mnl_attr_nest_start(nlh, IFLA_INFO_DATA);
> > > +	if (ifouter)
> > > +		mnl_attr_put_u32(nlh, IFLA_VXLAN_LINK, ifouter);
> > > +	assert(na_vxlan);
> > > +	mnl_attr_put_u8(nlh, IFLA_VXLAN_COLLECT_METADATA, 1);
> > > +	mnl_attr_put_u8(nlh, IFLA_VXLAN_UDP_ZERO_CSUM6_RX, 1);
> > > +	mnl_attr_put_u8(nlh, IFLA_VXLAN_LEARNING, 0);
> > > +	mnl_attr_put_u16(nlh, IFLA_VXLAN_PORT, vxlan_port);
> > > +	mnl_attr_nest_end(nlh, na_vxlan);
> > > +	mnl_attr_nest_end(nlh, na_info);
> > > +	assert(sizeof(buf) >=3D nlh->nlmsg_len);
> > > +	ret =3D flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
> > > +	if (ret)
> > > +		DRV_LOG(WARNING,
> > > +			"netlink: VTEP %s create failure (%d)",
> > > +			name, rte_errno);
> > > +	else
> > > +		vtep->created =3D 1;
> >=20
> > Flow of code here isn't smooth, thus could be error-prone. Most of all,=
 I
> > don't like ret has multiple meanings. ret should be return value but yo=
u are
> > using it to store ifindex.
> >=20
> > > +	if (ret && ifouter)
> > > +		ret =3D 0;
> > > +	else
> > > +		ret =3D if_nametoindex(name);
> >=20
> > If vtep isn't created and ifouter is set, then skip init below, which m=
eans, if
>=20
> ifouter is set for VXLAN encap devices. They should be attached to ifoute=
r
> and can not be shared. So, if ifouter I set - we do not use the precreate=
d/existing
> VXLAN devices. We have to create our own not shared device.

In your code (flow_tcf_encap_vtep_create()), it is shared by multiple flows=
. Do
you mean it isn't shared between different outer ifaces? If so, that's for =
sure.

> > vtep is created or ifouter is set, it tries to get ifindex of vtep.
> > But why do you want to try to call this API even if it failed to create=
 vtep?
> > Let's not make code flow convoluted even though it logically works. Let=
's
> > make it straightforward.
> >=20
> > > +	if (ret) {
> > > +		vtep->ifindex =3D ret;
> > > +		vtep->ifouter =3D ifouter;
> > > +		memset(buf, 0, sizeof(buf));
> > > +		nlh =3D mnl_nlmsg_put_header(buf);
> > > +		nlh->nlmsg_type =3D RTM_NEWLINK;
> > > +		nlh->nlmsg_flags =3D NLM_F_REQUEST;
> > > +		ifm =3D mnl_nlmsg_put_extra_header(nlh, sizeof(*ifm));
> > > +		ifm->ifi_family =3D AF_UNSPEC;
> > > +		ifm->ifi_type =3D 0;
> > > +		ifm->ifi_index =3D vtep->ifindex;
> > > +		ifm->ifi_flags =3D IFF_UP;
> > > +		ifm->ifi_change =3D IFF_UP;
> > > +		ret =3D flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
> > > +		if (ret) {
> > > +			DRV_LOG(WARNING,
> > > +				"netlink: VTEP %s set link up failure (%d)",
> > > +				name, rte_errno);
> > > +			rte_free(vtep);
> > > +			rte_flow_error_set
> > > +				(error, -errno,
> > > +				 RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> > NULL,
> > > +				 "netlink: failed to set VTEP link up");
> > > +			vtep =3D NULL;
> > > +		} else {
> > > +			ret =3D mlx5_flow_tcf_init(tcf, vtep->ifindex, error);
> > > +			if (ret)
> > > +				DRV_LOG(WARNING,
> > > +				"VTEP %s init failure (%d)", name, rte_errno);
> > > +		}
> > > +	} else {
> > > +		DRV_LOG(WARNING,
> > > +			"VTEP %s failed to get index (%d)", name, errno);
> > > +		rte_flow_error_set
> > > +			(error, -errno,
> > > +			 RTE_FLOW_ERROR_TYPE_UNSPECIFIED, NULL,
> > > +			 !vtep->created ? "netlink: failed to create VTEP" :
> > > +			 "netlink: failed to retrieve VTEP ifindex");
> > > +			 ret =3D 1;
> >=20
> > If it fails to create a vtep above, it will print out two warning messa=
ges and
> > one rte_flow_error message. And it even selects message to print betwee=
n
> > two?
> > And there's another info msg at the end even in case of failure. Do you=
 really
> > want to do this even with manipulating ret to change code path?  Not a =
good
> > practice.
> >=20
> > Usually, code path should be straightforward for sucessful path and for
> > errors/failures, return immediately or use 'goto' if there's need for c=
leanup.
> >=20
> > Please refactor entire function.
>=20
> I think I'll split it in two ones - for attached and potentially shared i=
faces.
> >=20
> > > +	}
> > > +	if (ret) {
> > > +		flow_tcf_delete_iface(tcf, vtep);
> > > +		vtep =3D NULL;
> > > +	}
> > > +	DRV_LOG(INFO, "VTEP create (%d, %s)", vtep->port, vtep ? "OK" :
> > "error");
> > > +	return vtep;
> > > +}
> > > +#endif /* HAVE_IFLA_VXLAN_COLLECT_METADATA */
> > > +
> > > +/**
> > > + * Create target interface index for VXLAN tunneling decapsulation.
> > > + * In order to share the UDP port within the other interfaces the
> > > + * VXLAN device created as not attached to any interface (if created=
).
> > > + *
> > > + * @param[in] tcf
> > > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > > + * @param[in] dev_flow
> > > + *   Flow tcf object with tunnel structure pointer set.
> > > + * @param[out] error
> > > + *   Perform verbose error reporting if not NULL.
> > > + * @return
> > > + *   Interface index on success, zero otherwise and rte_errno is set=
.
> >=20
> > Return negative errno in case of failure like others.
>=20
> Anyway, we have to return an index. If we do not return it as function re=
sult
> we will need to provide some extra pointing parameter, it complicates the=
 code.

You misunderstood it. See what I wrote below. The function still returns th=
e
index but in case of error, make it return negative errno instead of zero.

> >=20
> >  *   Interface index on success, a negative errno value otherwise and
> > rte_errno is set.
> >=20
> > > + */
> > > +static unsigned int
> > > +flow_tcf_decap_vtep_create(struct mlx5_flow_tcf_context *tcf,
> > > +			   struct mlx5_flow *dev_flow,
> > > +			   struct rte_flow_error *error)
> > > +{
> > > +	struct mlx5_flow_tcf_vtep *vtep, *vlst;
> > > +	uint16_t port =3D dev_flow->tcf.vxlan_decap->udp_port;
> > > +
> > > +	vtep =3D NULL;
> > > +	LIST_FOREACH(vlst, &vtep_list_vxlan, next) {
> > > +		if (vlst->port =3D=3D port) {
> > > +			vtep =3D vlst;
> > > +			break;
> > > +		}
> > > +	}
> >=20
> > You just need one variable.
>=20
> Yes. There is a long story, I forgot to revert code to one variable after=
 debugging.
> >=20
> > 	struct mlx5_flow_tcf_vtep *vtep;
> >=20
> > 	LIST_FOREACH(vtep, &vtep_list_vxlan, next) {
> > 		if (vtep->port =3D=3D port)
> > 			break;
> > 	}
> >=20
> > > +	if (!vtep) {
> > > +		vtep =3D flow_tcf_create_iface(tcf, 0, port, error);
> > > +		if (vtep)
> > > +			LIST_INSERT_HEAD(&vtep_list_vxlan, vtep, next);
> > > +	} else {
> > > +		if (vtep->ifouter) {
> > > +			rte_flow_error_set(error, -errno,
> > > +				RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> > NULL,
> > > +				"Failed to create decap VTEP, attached "
> > > +				"device with the same UDP port exists");
> > > +				vtep =3D NULL;
> >=20
> > Making vtep null to skip the following code?
>=20
> Yes. To avoid multiple return operators in code.

It's okay to have multiple returns. Why not?

> > Please merge the two same
> > if/else and make the code path strightforward. And which errno do you
> > expect here?
> > Should it be set EEXIST instead?
> Not always. Netlink returns the code.=20

No, that's not my point. Your code above sets errno instead of rte_errno or
EEXIST.

	} else {
		if (vtep->ifouter) {
			rte_flow_error_set(error, -errno,

Which one sets this errno? Here, it sets rte_errno because matched vtep can=
't be
used as it already has outer iface attached (error message isn't clear, ple=
ase
reword it too). I thought this should be EEXIST but you set errno to rte_er=
rno
but errno isn't valid at this point.

>=20
> >=20
> > > +		}
> > > +	}
> > > +	if (vtep) {
> > > +		vtep->refcnt++;
> > > +		assert(vtep->ifindex);
> > > +		return vtep->ifindex;
> > > +	} else {
> > > +		return 0;
> > > +	}
> >=20
> > Why repeating same if/else?
> >=20
> >=20
> > This is my suggestion but if you take my suggestion to have
> > flow_tcf_[create|get|release]_iface(), this will get much simpler.
> Agree.
>=20
> >=20
> > {
> > 	struct mlx5_flow_tcf_vtep *vtep;
> > 	uint16_t port =3D dev_flow->tcf.vxlan_decap->udp_port;
> >=20
> > 	LIST_FOREACH(vtep, &vtep_list_vxlan, next) {
> > 		if (vtep->port =3D=3D port)
> > 			break;
> > 	}
> > 	if (vtep && vtep->ifouter)
> > 		return rte_flow_error_set(... EEXIST ...);
> > 	else if (vtep) {
> > 		++vtep->refcnt;
> > 	} else {
> > 		vtep =3D flow_tcf_create_iface(tcf, 0, port, error);
> > 		if (!vtep)
> > 			return rte_flow_error_set(...);
> > 		LIST_INSERT_HEAD(&vtep_list_vxlan, vtep, next);
> > 	}
> > 	assert(vtep->ifindex);
> > 	return vtep->ifindex;
> > }
> >=20
> >=20
> > > +}
> > > +
> > > +/**
> > > + * Creates target interface index for VXLAN tunneling encapsulation.
> > > + *
> > > + * @param[in] tcf
> > > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > > + * @param[in] ifouter
> > > + *   Network interface index to attach VXLAN encap device to.
> > > + * @param[in] dev_flow
> > > + *   Flow tcf object with tunnel structure pointer set.
> > > + * @param[out] error
> > > + *   Perform verbose error reporting if not NULL.
> > > + * @return
> > > + *   Interface index on success, zero otherwise and rte_errno is set=
.
> > > + */
> > > +static unsigned int
> > > +flow_tcf_encap_vtep_create(struct mlx5_flow_tcf_context *tcf,
> > > +			    unsigned int ifouter,
> > > +			    struct mlx5_flow *dev_flow __rte_unused,
> > > +			    struct rte_flow_error *error)
> > > +{
> > > +	static uint16_t encap_port =3D MLX5_VXLAN_PORT_RANGE_MIN - 1;
> > > +	struct mlx5_flow_tcf_vtep *vtep, *vlst;
> > > +
> > > +	assert(ifouter);
> > > +	/* Look whether the attached VTEP for encap is created. */
> > > +	vtep =3D NULL;
> > > +	LIST_FOREACH(vlst, &vtep_list_vxlan, next) {
> > > +		if (vlst->ifouter =3D=3D ifouter) {
> > > +			vtep =3D vlst;
> > > +			break;
> > > +		}
> > > +	}
> >=20
> > Same here.
> >=20
> > > +	if (!vtep) {
> > > +		uint16_t pcnt;
> > > +
> > > +		/* Not found, we should create the new attached VTEP. */
> > > +/*
> > > + * TODO: not implemented yet
> > > + * flow_tcf_encap_iface_cleanup(tcf, ifouter);
> > > + * flow_tcf_encap_local_cleanup(tcf, ifouter);
> > > + * flow_tcf_encap_neigh_cleanup(tcf, ifouter);  */
> >=20
> > Personal note is not appropriate even though it is removed in the follo=
wing
> > patch.
> >=20
> > > +		for (pcnt =3D 0; pcnt <=3D (MLX5_VXLAN_PORT_RANGE_MAX
> > > +				     - MLX5_VXLAN_PORT_RANGE_MIN);
> > pcnt++) {
> > > +			encap_port++;
> > > +			/* Wraparound the UDP port index. */
> > > +			if (encap_port < MLX5_VXLAN_PORT_RANGE_MIN
> > ||
> > > +			    encap_port > MLX5_VXLAN_PORT_RANGE_MAX)
> > > +				encap_port =3D
> > MLX5_VXLAN_PORT_RANGE_MIN;
> > > +			/* Check whether UDP port is in already in use. */
> > > +			vtep =3D NULL;
> > > +			LIST_FOREACH(vlst, &vtep_list_vxlan, next) {
> > > +				if (vlst->port =3D=3D encap_port) {
> > > +					vtep =3D vlst;
> > > +					break;
> > > +				}
> > > +			}
> >=20
> > If you want to find out an empty port number, you can use rte_bitmap
> > instead of repeating searching the entire list for all possible port nu=
mbers.
>=20
> We do not expect too many VXLAN devices have been created. bitmap.

+1, valid point.

> > > +			if (vtep) {
> > > +				vtep =3D NULL;
> > > +				continue;
> > > +			}
> > > +			vtep =3D flow_tcf_create_iface(tcf, ifouter,
> > > +						     encap_port, error);
> > > +			if (vtep) {
> > > +				LIST_INSERT_HEAD(&vtep_list_vxlan, vtep,
> > next);
> > > +				break;
> > > +			}
> > > +			if (rte_errno !=3D EEXIST)
> > > +				break;
> > > +		}
> > > +	}
> > > +	if (!vtep)
> > > +		return 0;
> > > +	vtep->refcnt++;
> > > +	assert(vtep->ifindex);
> > > +	return vtep->ifindex;
> >=20
> > Please refactor this func according to what I suggested for
> > flow_tcf_decap_vtep_create() and flow_tcf_delete_iface().
> >=20
> > > +}
> > > +
> > > +/**
> > > + * Creates target interface index for tunneling of any type.
> > > + *
> > > + * @param[in] tcf
> > > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > > + * @param[in] ifouter
> > > + *   Network interface index to attach VXLAN encap device to.
> > > + * @param[in] dev_flow
> > > + *   Flow tcf object with tunnel structure pointer set.
> > > + * @param[out] error
> > > + *   Perform verbose error reporting if not NULL.
> > > + * @return
> > > + *   Interface index on success, zero otherwise and rte_errno is set=
.
> >=20
> >  *   Interface index on success, a negative errno value otherwise and
> >  *   rte_errno is set.
> >=20
> > > + */
> > > +static unsigned int
> > > +flow_tcf_tunnel_vtep_create(struct mlx5_flow_tcf_context *tcf,
> > > +			    unsigned int ifouter,
> > > +			    struct mlx5_flow *dev_flow,
> > > +			    struct rte_flow_error *error)
> > > +{
> > > +	unsigned int ret;
> > > +
> > > +	assert(dev_flow->tcf.tunnel);
> > > +	pthread_mutex_lock(&vtep_list_mutex);
> > > +	switch (dev_flow->tcf.tunnel->type) {
> > > +	case MLX5_FLOW_TCF_TUNACT_VXLAN_ENCAP:
> > > +		ret =3D flow_tcf_encap_vtep_create(tcf, ifouter,
> > > +						 dev_flow, error);
> > > +		break;
> > > +	case MLX5_FLOW_TCF_TUNACT_VXLAN_DECAP:
> > > +		ret =3D flow_tcf_decap_vtep_create(tcf, dev_flow, error);
> > > +		break;
> > > +	default:
> > > +		rte_flow_error_set(error, ENOTSUP,
> > > +				RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> > NULL,
> > > +				"unsupported tunnel type");
> > > +		ret =3D 0;
> > > +		break;
> > > +	}
> > > +	pthread_mutex_unlock(&vtep_list_mutex);
> > > +	return ret;
> > > +}
> > > +
> > > +/**
> > > + * Deletes tunneling interface by UDP port.
> > > + *
> > > + * @param[in] tcf
> > > + *   Context object initialized by mlx5_flow_tcf_context_create().
> > > + * @param[in] ifindex
> > > + *   Network interface index of VXLAN device.
> > > + * @param[in] dev_flow
> > > + *   Flow tcf object with tunnel structure pointer set.
> > > + */
> > > +static void
> > > +flow_tcf_tunnel_vtep_delete(struct mlx5_flow_tcf_context *tcf,
> > > +			    unsigned int ifindex,
> > > +			    struct mlx5_flow *dev_flow)
> > > +{
> > > +	struct mlx5_flow_tcf_vtep *vtep, *vlst;
> > > +
> > > +	assert(dev_flow->tcf.tunnel);
> > > +	pthread_mutex_lock(&vtep_list_mutex);
> > > +	vtep =3D NULL;
> > > +	LIST_FOREACH(vlst, &vtep_list_vxlan, next) {
> > > +		if (vlst->ifindex =3D=3D ifindex) {
> > > +			vtep =3D vlst;
> > > +			break;
> > > +		}
> > > +	}
> >=20
> > It is weird. You just can have vtep pointer in the dev_flow->tcf.tunnel=
 instead
> > of ifindex_tun which is same as vtep->ifindex like the assertion below.=
 Then,
> > this lookup can be skipped.
>=20
> OK. Good optimization.
>=20
> >=20
> > > +	if (!vtep) {
> > > +		DRV_LOG(WARNING, "No VTEP device found in the list");
> > > +		goto exit;
> > > +	}
> > > +	switch (dev_flow->tcf.tunnel->type) {
> > > +	case MLX5_FLOW_TCF_TUNACT_VXLAN_DECAP:
> > > +		break;
> > > +	case MLX5_FLOW_TCF_TUNACT_VXLAN_ENCAP:
> > > +/*
> > > + * TODO: Remove the encap ancillary rules first.
> > > + * flow_tcf_encap_neigh(tcf, vtep, dev_flow, false, NULL);
> > > + * flow_tcf_encap_local(tcf, vtep, dev_flow, false, NULL);  */
> >=20
> > Is it a personal note? Please remove.
> OK.
>=20
> >=20
> > > +		break;
> > > +	default:
> > > +		assert(false);
> > > +		DRV_LOG(WARNING, "Unsupported tunnel type");
> > > +		break;
> > > +	}
> > > +	assert(dev_flow->tcf.tunnel->ifindex_tun =3D=3D vtep->ifindex);
> > > +	assert(vtep->refcnt);
> > > +	if (!vtep->refcnt || !--vtep->refcnt) {
> > > +		LIST_REMOVE(vtep, next);
> > > +		flow_tcf_delete_iface(tcf, vtep);
> > > +	}
> > > +exit:
> > > +	pthread_mutex_unlock(&vtep_list_mutex);
> > > +}
> > > +
> > >  /**
> > >   * Apply flow to E-Switch by sending Netlink message.
> > >   *
> > > @@ -3461,18 +3887,61 @@ struct pedit_parser {
> > >  	       struct rte_flow_error *error)  {
> > >  	struct priv *priv =3D dev->data->dev_private;
> > > -	struct mlx5_flow_tcf_context *nl =3D priv->tcf_context;
> > > +	struct mlx5_flow_tcf_context *tcf =3D priv->tcf_context;
> > >  	struct mlx5_flow *dev_flow;
> > >  	struct nlmsghdr *nlh;
> > > +	int ret;
> > >
> > >  	dev_flow =3D LIST_FIRST(&flow->dev_flows);
> > >  	/* E-Switch flow can't be expanded. */
> > >  	assert(!LIST_NEXT(dev_flow, next));
> > > +	if (dev_flow->tcf.applied)
> > > +		return 0;
> > >  	nlh =3D dev_flow->tcf.nlh;
> > >  	nlh->nlmsg_type =3D RTM_NEWTFILTER;
> > >  	nlh->nlmsg_flags =3D NLM_F_REQUEST | NLM_F_CREATE |
> > NLM_F_EXCL;
> > > -	if (!flow_tcf_nl_ack(nl, nlh, 0, NULL, NULL))
> > > +	if (dev_flow->tcf.tunnel) {
> > > +		/*
> > > +		 * Replace the interface index, target for
> > > +		 * encapsulation, source for decapsulation.
> > > +		 */
> > > +		assert(!dev_flow->tcf.tunnel->ifindex_tun);
> > > +		assert(dev_flow->tcf.tunnel->ifindex_ptr);
> > > +		/* Create actual VTEP device when rule is being applied. */
> > > +		dev_flow->tcf.tunnel->ifindex_tun
> > > +			=3D flow_tcf_tunnel_vtep_create(tcf,
> > > +					*dev_flow->tcf.tunnel->ifindex_ptr,
> > > +					dev_flow, error);
> > > +			DRV_LOG(INFO, "Replace ifindex: %d->%d",
> > > +				dev_flow->tcf.tunnel->ifindex_tun,
> > > +				*dev_flow->tcf.tunnel->ifindex_ptr);
> > > +		if (!dev_flow->tcf.tunnel->ifindex_tun)
> > > +			return -rte_errno;
> > > +		dev_flow->tcf.tunnel->ifindex_org
> > > +			=3D *dev_flow->tcf.tunnel->ifindex_ptr;
> > > +		*dev_flow->tcf.tunnel->ifindex_ptr
> > > +			=3D dev_flow->tcf.tunnel->ifindex_tun;
> > > +	}
> > > +	ret =3D flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
> > > +	if (dev_flow->tcf.tunnel) {
> > > +		DRV_LOG(INFO, "Restore ifindex: %d->%d",
> > > +				dev_flow->tcf.tunnel->ifindex_org,
> > > +				*dev_flow->tcf.tunnel->ifindex_ptr);
> > > +		*dev_flow->tcf.tunnel->ifindex_ptr
> > > +			=3D dev_flow->tcf.tunnel->ifindex_org;
> > > +		dev_flow->tcf.tunnel->ifindex_org =3D 0;
> >=20
> > ifindex_org looks a temporary storage in this code. And this kind of ha=
ssle
> > (replace/restore) is there because you took the ifindex from the netlin=
k
> > message. Why don't you have just
> >=20
> > struct mlx5_flow_tcf_tunnel_hdr {
> > 	uint32_t type; /**< Tunnel action type. */
> > 	unsigned int ifindex; /**< Original dst/src interface */
> > 	struct mlx5_flow_tcf_vtep *vtep; /**< Tunnel endpoint device. */
> > 	unsigned int *nlmsg_ifindex_ptr; /**< ifindex ptr in Netlink message.
> > */ };
> >=20
> > and don't change ifindex?
>=20
> I propose to use the local variable for ifindex_org and do not keep it
> in structure. *ifindex_ptr will keep.

Well, you still have to restore the ifindex whenever sending the nl msg. Mo=
st of
all, ifindex_ptr in nl msg isn't a right place to store the ifindex. It sho=
uld
have vtep ifindex but it just temporarily keeps the device ifindex until vt=
ep is
created/found.

Thanks,
Yongseok