From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <yskoh@mellanox.com>
Received: from EUR04-HE1-obe.outbound.protection.outlook.com
 (mail-eopbgr70053.outbound.protection.outlook.com [40.107.7.53])
 by dpdk.org (Postfix) with ESMTP id 7213510A3
 for <dev@dpdk.org>; Mon, 29 Oct 2018 19:42:25 +0100 (CET)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=Mellanox.com;
 s=selector1;
 h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck;
 bh=JyAN9FwqY76DjLIQ6nrw7QhGMfa4C7yOm57YukI4djk=;
 b=ZYtI8MCJb6SeVZ5EivOwsS0CHW8hf5Wdj5uR4DpI32KXMA9LEYdPuKZdcU0xfVDxG9H2xwwCVSpzor9n3S315zagVFcMm011xYga5xhLp8acGVZy83WxMk3PDhsDvTO5nePRZ2BbdvAsTqyX9qUNrK6qFS8+fXgAdy/63np5zig=
Received: from DB3PR0502MB3980.eurprd05.prod.outlook.com (52.134.72.27) by
 DB3PR0502MB4044.eurprd05.prod.outlook.com (52.134.72.149) with Microsoft SMTP
 Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 15.20.1273.21; Mon, 29 Oct 2018 18:42:23 +0000
Received: from DB3PR0502MB3980.eurprd05.prod.outlook.com
 ([fe80::f8a1:fcab:94f0:97cc]) by DB3PR0502MB3980.eurprd05.prod.outlook.com
 ([fe80::f8a1:fcab:94f0:97cc%4]) with mapi id 15.20.1273.027; Mon, 29 Oct 2018
 18:42:23 +0000
From: Yongseok Koh <yskoh@mellanox.com>
To: Slava Ovsiienko <viacheslavo@mellanox.com>
CC: Shahaf Shuler <shahafs@mellanox.com>, "dev@dpdk.org" <dev@dpdk.org>
Thread-Topic: [PATCH v2 5/7] net/mlx5: e-switch VXLAN tunnel devices management
Thread-Index: AQHUZJFa0Xzb/HBSQ0ye9aegR3x2F6UutK6AgAHCugCAADMzgIAAqrMAgABmpgCABHbzAIAAci+A
Date: Mon, 29 Oct 2018 18:42:22 +0000
Message-ID: <20181029184214.GB9272@mtidpdk.mti.labs.mlnx>
References: <1538461807-37507-1-git-send-email-viacheslavo@mellanox.com>
 <1539612815-47199-1-git-send-email-viacheslavo@mellanox.com>
 <1539612815-47199-6-git-send-email-viacheslavo@mellanox.com>
 <20181025002759.GA26874@mtidpdk.mti.labs.mlnx>
 <AM4PR05MB326594413216A252B5194833D2F70@AM4PR05MB3265.eurprd05.prod.outlook.com>
 <20181026062307.GD6434@mtidpdk.mti.labs.mlnx>
 <AM4PR05MB3265A36AD6DA68FAABE37064D2F00@AM4PR05MB3265.eurprd05.prod.outlook.com>
 <20181026224248.GE13615@mtidpdk.mti.labs.mlnx>
 <AM4PR05MB326538EB415446562A9F9677D2F30@AM4PR05MB3265.eurprd05.prod.outlook.com>
In-Reply-To: <AM4PR05MB326538EB415446562A9F9677D2F30@AM4PR05MB3265.eurprd05.prod.outlook.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
x-clientproxiedby: BYAPR11CA0051.namprd11.prod.outlook.com
 (2603:10b6:a03:80::28) To DB3PR0502MB3980.eurprd05.prod.outlook.com
 (2603:10a6:8:10::27)
authentication-results: spf=none (sender IP is )
 smtp.mailfrom=yskoh@mellanox.com; 
x-ms-exchange-messagesentrepresentingtype: 1
x-originating-ip: [209.116.155.178]
x-ms-publictraffictype: Email
x-microsoft-exchange-diagnostics: 1; DB3PR0502MB4044;
 6:H2MiYH20L+znctn2jAnBMBi7TkMSV3mcaQ9YP0oE+05zshkE/l5zLM8tm7GaNjrPdSkxj+2HfkykXpwQQLC5I6LNg2cUGQQ/h/jVGfoSh+cwVafzERUcJvAS96Vy1JBGEyG5zOHgviAmYUEOsAqMpgqomSSTtvIbnGvbp7DR/n9A54RkbKLemhvWWOlY32hSjVX1y6WRtQBdTfVsS5jZKt/NJsTgKslI6B2+i40nE9Rbn4LvlEKU+DIN1IgLO5PBpjzPrDV+Qy+wOIxOjDyKtjhmtIS/da2XNNxeq3CNg7xRDg77T1AHmzUmmp8MjP1EFdBWdoX3m8ydv6lUdHm0OdQEMobnB9sJhtQ2Z35j6dYAKZc7nbyaaGwt9+hd1ZMKHAxpLClr6JmvnFi8L7Omi6qnofKFr1zC83VvZwjSZTEEhcClUlmv6wGnSJwaxc+Up1ITk2dERE1AhjaJN7isOg==;
 5:LquETD5dVGW8h66r9lQbGRqQOBfqMoZPvuwxjjOaUbNTqHbER2ha4Vd7RTXpGJlLlNUX3McgakmGyO9/C4DX6zugHFd8RfqFT1FUKge4uyog6n0t2ktEHdvVrbmB1FoV1kIPrtUbqdGjwCe9hmI2B+cNX1yQAV6vuUOgVciX090=;
 7:XPJ+jC45fTzFldNMYarRUycfma7oQXFrykEs5JqZxQ4s9eM0iXzzDE0ncaqSNPY3zNx+5rzuyl8QCKzi3GNKaQNPathq/DinkG9DV9RqPzS5P2q/nwHK1gxcl+9rJMur31v+9bU16Px4+VbwtXazmQ==
x-ms-office365-filtering-correlation-id: 9199eb4c-6a6e-41ea-cb1b-08d63dce4377
x-ms-office365-filtering-ht: Tenant
x-microsoft-antispam: BCL:0; PCL:0;
 RULEID:(7020095)(4652040)(8989299)(4534185)(4627221)(201703031133081)(201702281549075)(8990200)(5600074)(711020)(4618075)(2017052603328)(7153060)(7193020);
 SRVR:DB3PR0502MB4044; 
x-ms-traffictypediagnostic: DB3PR0502MB4044:
x-microsoft-antispam-prvs: <DB3PR0502MB40449D0D53208DF083C6EA45C3F30@DB3PR0502MB4044.eurprd05.prod.outlook.com>
x-exchange-antispam-report-test: UriScan:(211171220733660)(788757137089);
x-ms-exchange-senderadcheck: 1
x-exchange-antispam-report-cfa-test: BCL:0; PCL:0;
 RULEID:(8211001083)(6040522)(2401047)(5005006)(8121501046)(3231382)(944501410)(52105095)(93006095)(93001095)(3002001)(10201501046)(6055026)(148016)(149066)(150057)(6041310)(20161123560045)(20161123564045)(20161123562045)(20161123558120)(201703131423095)(201702281528075)(20161123555045)(201703061421075)(201703061406153)(201708071742011)(7699051)(76991095);
 SRVR:DB3PR0502MB4044; BCL:0; PCL:0; RULEID:; SRVR:DB3PR0502MB4044; 
x-forefront-prvs: 084080FC15
x-forefront-antispam-report: SFV:NSPM;
 SFS:(10009020)(366004)(396003)(136003)(39860400002)(346002)(376002)(52314003)(13464003)(199004)(189003)(76176011)(53946003)(6862004)(6486002)(5250100002)(6436002)(14454004)(33656002)(66066001)(53936002)(33896004)(561944003)(52116002)(2906002)(6512007)(9686003)(6246003)(3846002)(1076002)(6116002)(99286004)(5660300001)(97736004)(4744004)(71190400001)(71200400001)(86362001)(8676002)(25786009)(26005)(81156014)(81166006)(186003)(229853002)(93886005)(7736002)(305945005)(14444005)(476003)(53546011)(446003)(11346002)(486006)(102836004)(54906003)(478600001)(6636002)(256004)(4326008)(6506007)(386003)(2900100001)(5024004)(68736007)(106356001)(316002)(105586002)(8936002)(569006);
 DIR:OUT; SFP:1101; SCL:1; SRVR:DB3PR0502MB4044;
 H:DB3PR0502MB3980.eurprd05.prod.outlook.com; FPR:; SPF:None; LANG:en;
 PTR:InfoNoRecords; A:1; MX:1; 
received-spf: None (protection.outlook.com: mellanox.com does not designate
 permitted sender hosts)
x-microsoft-antispam-message-info: o3k5E7I7kc6NG+n65XYpR4vvjXyvDjK8Xx8L8Ceu5jrp1sE9GpmaOhZh1Ot9x+9Zo1LOw/Mfqcpc9MOmqnSXYCG/U9BMX+BvNWxqeWIyZZlm7I+vvPhdnEiPtuvINe869eAL/Nyz9uggVz6v5QjDYHC7pTFFbHujJt1eQEQtT6yhXnJo4VvgYoHNZiwe6AClGdNPpK8liUSRzwTGSYfBlktLeT43G6Cslj42G02zQjUMMlxn7Qs/6g6cuBbIXq0/To3VJCO14MpUJewVUuRn2aptBtYoGoxh6SpKQdwQmWfkwpftTa7ChTO9u1hB7y4K2PNS6XRxqNyCB2zd5zSck2LO+08cNk40dKfJbLeanrI=
spamdiagnosticoutput: 1:99
spamdiagnosticmetadata: NSPM
Content-Type: text/plain; charset="us-ascii"
Content-ID: <F7BDEE91CE159C48B24B3D8A8C754C69@eurprd05.prod.outlook.com>
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
X-OriginatorOrg: Mellanox.com
X-MS-Exchange-CrossTenant-Network-Message-Id: 9199eb4c-6a6e-41ea-cb1b-08d63dce4377
X-MS-Exchange-CrossTenant-originalarrivaltime: 29 Oct 2018 18:42:23.0019 (UTC)
X-MS-Exchange-CrossTenant-fromentityheader: Hosted
X-MS-Exchange-CrossTenant-id: a652971c-7d2e-4d9b-a6a4-d149256f461b
X-MS-Exchange-Transport-CrossTenantHeadersStamped: DB3PR0502MB4044
Subject: Re: [dpdk-dev] [PATCH v2 5/7] net/mlx5: e-switch VXLAN tunnel
	devices management
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Mon, 29 Oct 2018 18:42:25 -0000

On Mon, Oct 29, 2018 at 04:53:34AM -0700, Slava Ovsiienko wrote:
> > -----Original Message-----
> > From: Yongseok Koh
> > Sent: Saturday, October 27, 2018 1:43
> > To: Slava Ovsiienko <viacheslavo@mellanox.com>
> > Cc: Shahaf Shuler <shahafs@mellanox.com>; dev@dpdk.org
> > Subject: Re: [PATCH v2 5/7] net/mlx5: e-switch VXLAN tunnel devices
> > management
> >=20
> > On Fri, Oct 26, 2018 at 02:35:24AM -0700, Slava Ovsiienko wrote:
> > > > -----Original Message-----
> > > > From: Yongseok Koh
> > > > Sent: Friday, October 26, 2018 9:26
> > > > To: Slava Ovsiienko <viacheslavo@mellanox.com>
> > > > Cc: Shahaf Shuler <shahafs@mellanox.com>; dev@dpdk.org
> > > > Subject: Re: [PATCH v2 5/7] net/mlx5: e-switch VXLAN tunnel devices
> > > > management
> > > >
> > > > On Thu, Oct 25, 2018 at 01:21:12PM -0700, Slava Ovsiienko wrote:
> > > > > > -----Original Message-----
> > > > > > From: Yongseok Koh
> > > > > > Sent: Thursday, October 25, 2018 3:28
> > > > > > To: Slava Ovsiienko <viacheslavo@mellanox.com>
> > > > > > Cc: Shahaf Shuler <shahafs@mellanox.com>; dev@dpdk.org
> > > > > > Subject: Re: [PATCH v2 5/7] net/mlx5: e-switch VXLAN tunnel
> > > > > > devices management
> > > > > >
> > > > > > On Mon, Oct 15, 2018 at 02:13:33PM +0000, Viacheslav Ovsiienko
> > wrote:
> > > > > > > VXLAN interfaces are dynamically created for each local UDP
> > > > > > > port of outer networks and then used as targets for TC
> > > > > > > "flower" filters in order to perform encapsulation. These
> > > > > > > VXLAN interfaces are system-wide, the only one device with
> > > > > > > given UDP port can exist in the system (the attempt of
> > > > > > > creating another device with the same UDP local port returns
> > > > > > > EEXIST), so PMD should support the shared device instances
> > > > > > > database for PMD instances. These VXLAN implicitly created de=
vices
> > are called VTEPs (Virtual Tunnel End Points).
> > > > > > >
> > > > > > > Creation of the VTEP occurs at the moment of rule applying.
> > > > > > > The link is set up, root ingress qdisc is also initialized.
> > > > > > >
> > > > > > > Encapsulation VTEPs are created on per port basis, the single
> > > > > > > VTEP is attached to the outer interface and is shared for all
> > > > > > > encapsulation rules on this interface. The source UDP port is
> > > > > > > automatically selected in range 30000-60000.
> > > > > > >
> > > > > > > For decapsulaton one VTEP is created per every unique UDP
> > > > > > > local port to accept tunnel traffic. The name of created VTEP
> > > > > > > consists of prefix "vmlx_" and the number of UDP port in
> > > > > > > decimal digits without leading zeros (vmlx_4789). The VTEP ca=
n
> > > > > > > be preliminary created in the system before the launching
> > > > > > > application, it allows to share	UDP ports between primary
> > > > > > > and secondary processes.
> > > > > > >
> > > > > > > Suggested-by: Adrien Mazarguil <adrien.mazarguil@6wind.com>
> > > > > > > Signed-off-by: Viacheslav Ovsiienko <viacheslavo@mellanox.com=
>
> > > > > > > ---
> > > > > > >  drivers/net/mlx5/mlx5_flow_tcf.c | 503
> > > > > > > ++++++++++++++++++++++++++++++++++++++-
> > > > > > >  1 file changed, 499 insertions(+), 4 deletions(-)
> > > > > > >
> > > > > > > diff --git a/drivers/net/mlx5/mlx5_flow_tcf.c
> > > > > > > b/drivers/net/mlx5/mlx5_flow_tcf.c
> > > > > > > index d6840d5..efa9c3b 100644
> > > > > > > --- a/drivers/net/mlx5/mlx5_flow_tcf.c
> > > > > > > +++ b/drivers/net/mlx5/mlx5_flow_tcf.c
> > > > > > > @@ -3443,6 +3443,432 @@ struct pedit_parser {
> > > > > > >  	return -err;
> > > > > > >  }
> > > > > > >
> > > > > > > +/* VTEP device list is shared between PMD port instances. */
> > > > > > > +static LIST_HEAD(, mlx5_flow_tcf_vtep)
> > > > > > > +			vtep_list_vxlan =3D LIST_HEAD_INITIALIZER();
> > static
> > > > > > pthread_mutex_t
> > > > > > > +vtep_list_mutex =3D PTHREAD_MUTEX_INITIALIZER;
> > > > > >
> > > > > > What's the reason for choosing pthread_mutex instead of
> > rte_*_lock?
> > > > >
> > > > > The sharing this database for secondary processes?
> > > >
> > > > The static variable isn't shared with sec proc. But you can leave i=
t as is.
> > >
> > > Yes. The sharing just was assumed, not implemented yet.
> > >
> > > >
> > > > > > > +
> > > > > > > +/**
> > > > > > > + * Deletes VTEP network device.
> > > > > > > + *
> > > > > > > + * @param[in] tcf
> > > > > > > + *   Context object initialized by mlx5_flow_tcf_context_cre=
ate().
> > > > > > > + * @param[in] vtep
> > > > > > > + *   Object represinting the network device to delete. Memor=
y
> > > > > > > + *   allocated for this object is freed by routine.
> > > > > > > + */
> > > > > > > +static void
> > > > > > > +flow_tcf_delete_iface(struct mlx6_flow_tcf_context *tcf,
> > > > > > > +		      struct mlx5_flow_tcf_vtep *vtep) {
> > > > > > > +	struct nlmsghdr *nlh;
> > > > > > > +	struct ifinfomsg *ifm;
> > > > > > > +	alignas(struct nlmsghdr)
> > > > > > > +	uint8_t buf[mnl_nlmsg_size(MNL_ALIGN(sizeof(*ifm))) + 8];
> > > > > > > +	int ret;
> > > > > > > +
> > > > > > > +	assert(!vtep->refcnt);
> > > > > > > +	if (vtep->created && vtep->ifindex) {
> > > > > >
> > > > > > First of all vtep->created seems of no use. It is introduced to
> > > > > > select the error message in flow_tcf_create_iface(). I don't se=
e
> > > > > > any necessity to distinguish between 'vtep is allocated by
> > > > > > rte_malloc()' and
> > > > 'vtep is created in kernel'.
> > > > >
> > > > > created flag indicates the iface is created by our code.
> > > > > The VXLAN decap devices must have the specified UDP port, we can
> > > > > not create multiple VXLAN devices with the same UDP port - EEXIST
> > > > > is returned. So, we have to share device. One option is create
> > > > > device before DPDK application launch and use these pre-created
> > devices.
> > > > > Inthis case created flag is not set and VXLAN device is not
> > > > > reinitialized, and
> > > > not deleted.
> > > >
> > > > I can't see any code to use pre-created device (created even before
> > > > dpdk app launch). Your code just tries to create 'vmlx_xxxx'. Even
> > > > from your comment in [7/7] patch, PMD will cleanup any leftovers
> > > > (existing vtep devices) on initialization. Your comment sounds conf=
licting
> > and confusing.
> > >
> > > There are two types of VXLAN devices:
> > >
> > > - VXLAN decap, not attached to any ifouter. Provides the ingress UDP
> > > port,  we try to share the devices of this type, because we may be
> > > asked for  the specified UDP port. No device/rule cleanup and reinit
> > needed.
> > >
> > > - VXLAN encap, should be attached to ifouter to provide strict egress
> > > path, no need to share - egress UDP port does not matter. And we need
> > > to cleanup ifouter, remove other attached VXLAN devices and rules,
> > > because it is too hard to co-exist with some pre-created setup..
> >=20
> > I knew that. But how can it justify the need of 'created' field in vtep=
 struct?
> > In this code, it is of no use. But will see how it is used in your v3.
> >=20
> > > > > > And why do you need to check vtep->ifindex as well? If vtep is
> > > > > > created in kernel and its ifindex isn't set, that should be an
> > > > > > error which had to be hanled in flow_tcf_create_iface(). Such a
> > > > > > vtep shouldn't
> > > > exist.
> > > > > Yes, if we did not get ifindex of device - vtep is not created, e=
rror
> > returned.
> > > > > We just can not operate w/o ifindex.
> > > >
> > > > I know ifindex is needed but my question was checking vtep->ifindex
> > > > here looked redundant/unnecessary. But as you agreed on having
> > > > create/get/release_iface(), it doesn't matter much.
> > >
> > > Yes. I agree, will refactor the code.
> > >
> > > >
> > > > > > Also, the refcnt management is a bit strange. Please put an
> > > > > > abstraction by adding create_iface(), get_iface() and
> > > > > > release_iface(). In the get_ifce(),
> > > > > > vtep->refcnt should be incremented. And in the release_iface(),
> > > > > > vtep->it decrease the
> > > > > OK. Good proposal. I'll refactor the code.
> > > > >
> > > > > > refcnt and if it reaches to zero, the iface can be removed.
> > > > > > create_iface() will set the refcnt to 1. And if you refer to
> > > > > > mlx5_hrxq_get(), it even does searching the list not by
> > > > > > repeating the
> > > > same lookup code here and there.
> > > > > > That will make your code much simpler.
> > > > > >
> > > > > > > +		DRV_LOG(INFO, "VTEP delete (%d)", vtep->ifindex);
> > > > > > > +		nlh =3D mnl_nlmsg_put_header(buf);
> > > > > > > +		nlh->nlmsg_type =3D RTM_DELLINK;
> > > > > > > +		nlh->nlmsg_flags =3D NLM_F_REQUEST;
> > > > > > > +		ifm =3D mnl_nlmsg_put_extra_header(nlh,
> > sizeof(*ifm));
> > > > > > > +		ifm->ifi_family =3D AF_UNSPEC;
> > > > > > > +		ifm->ifi_index =3D vtep->ifindex;
> > > > > > > +		ret =3D flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
> > > > > > > +		if (ret)
> > > > > > > +			DRV_LOG(WARNING, "netlink: error deleting
> > VXLAN
> > > > > > "
> > > > > > > +					 "encap/decap ifindex %u",
> > > > > > > +					 ifm->ifi_index);
> > > > > > > +	}
> > > > > > > +	rte_free(vtep);
> > > > > > > +}
> > > > > > > +
> > > > > > > +/**
> > > > > > > + * Creates VTEP network device.
> > > > > > > + *
> > > > > > > + * @param[in] tcf
> > > > > > > + *   Context object initialized by mlx5_flow_tcf_context_cre=
ate().
> > > > > > > + * @param[in] ifouter
> > > > > > > + *   Outer interface to attach new-created VXLAN device
> > > > > > > + *   If zero the VXLAN device will not be attached to any de=
vice.
> > > > > > > + * @param[in] port
> > > > > > > + *   UDP port of created VTEP device.
> > > > > > > + * @param[out] error
> > > > > > > + *   Perform verbose error reporting if not NULL.
> > > > > > > + *
> > > > > > > + * @return
> > > > > > > + * Pointer to created device structure on success, NULL
> > > > > > > +otherwise
> > > > > > > + * and rte_errno is set.
> > > > > > > + */
> > > > > > > +#ifndef HAVE_IFLA_VXLAN_COLLECT_METADATA
> > > > > >
> > > > > > Why negative(ifndef) first intead of positive(ifdef)?
> > > > > Hm. Did I miss the rule. Positive #ifdef first? OK.
> > > >
> > > > No concrete rule but if there's no specific reason, it would be
> > > > better to start from ifdef.
> > > >
> > > > > > > +static struct mlx5_flow_tcf_vtep*
> > > > > > > +flow_tcf_create_iface(struct mlx5_flow_tcf_context *tcf
> > __rte_unused,
> > > > > > > +		      unsigned int ifouter __rte_unused,
> > > > > > > +		      uint16_t port __rte_unused,
> > > > > > > +		      struct rte_flow_error *error) {
> > > > > > > +	rte_flow_error_set(error, ENOTSUP,
> > > > > > > +			 RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> > NULL,
> > > > > > > +			 "netlink: failed to create VTEP, "
> > > > > > > +			 "VXLAN metadat is not supported by
> > kernel");
> > > > > >
> > > > > > Typo.
> > > > >
> > > > > OK.  "metadata are not supported".
> > > > > >
> > > > > > > +	return NULL;
> > > > > > > +}
> > > > > > > +#else
> > > > > > > +static struct mlx5_flow_tcf_vtep*
> > > > > > > +flow_tcf_create_iface(struct mlx5_flow_tcf_context *tcf,
> > > > > >
> > > > > > How about adding 'vtep'? It sounds vague - creating a general
> > interface.
> > > > > > E.g., flow_tcf_create_vtep_iface()?
> > > > >
> > > > > OK.
> > > > >
> > > > > >
> > > > > > > +		      unsigned int ifouter,
> > > > > > > +		      uint16_t port, struct rte_flow_error *error) {
> > > > > > > +	struct mlx5_flow_tcf_vtep *vtep;
> > > > > > > +	struct nlmsghdr *nlh;
> > > > > > > +	struct ifinfomsg *ifm;
> > > > > > > +	char name[sizeof(MLX5_VXLAN_DEVICE_PFX) + 24];
> > > > > > > +	alignas(struct nlmsghdr)
> > > > > > > +	uint8_t buf[mnl_nlmsg_size(sizeof(*ifm)) + 128 +
> > > > > >
> > > > > > Use a macro for '128'. Can't know the meaning.
> > > > > OK. I think we should calculate the buffer size explicitly.
> > > > >
> > > > > >
> > > > > > > +		       SZ_NLATTR_DATA_OF(sizeof(name)) +
> > > > > > > +		       SZ_NLATTR_NEST * 2 +
> > > > > > > +		       SZ_NLATTR_STRZ_OF("vxlan") +
> > > > > > > +		       SZ_NLATTR_DATA_OF(sizeof(uint32_t)) +
> > > > > > > +		       SZ_NLATTR_DATA_OF(sizeof(uint32_t)) +
> > > > > > > +		       SZ_NLATTR_DATA_OF(sizeof(uint16_t)) +
> > > > > > > +		       SZ_NLATTR_DATA_OF(sizeof(uint8_t))];
> > > > > > > +	struct nlattr *na_info;
> > > > > > > +	struct nlattr *na_vxlan;
> > > > > > > +	rte_be16_t vxlan_port =3D RTE_BE16(port);
> > > > > >
> > > > > > Use rte_cpu_to_be_*() instead.
> > > > >
> > > > > Yes, I'll recheck the whole code for this issue.
> > > > >
> > > > > >
> > > > > > > +	int ret;
> > > > > > > +
> > > > > > > +	vtep =3D rte_zmalloc(__func__, sizeof(*vtep),
> > > > > > > +			alignof(struct mlx5_flow_tcf_vtep));
> > > > > > > +	if (!vtep) {
> > > > > > > +		rte_flow_error_set
> > > > > > > +			(error, ENOMEM,
> > > > > > RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> > > > > > > +			 NULL, "unadble to allocate memory for
> > VTEP desc");
> > > > > > > +		return NULL;
> > > > > > > +	}
> > > > > > > +	*vtep =3D (struct mlx5_flow_tcf_vtep){
> > > > > > > +			.refcnt =3D 0,
> > > > > > > +			.port =3D port,
> > > > > > > +			.created =3D 0,
> > > > > > > +			.ifouter =3D 0,
> > > > > > > +			.ifindex =3D 0,
> > > > > > > +			.local =3D LIST_HEAD_INITIALIZER(),
> > > > > > > +			.neigh =3D LIST_HEAD_INITIALIZER(),
> > > > > > > +	};
> > > > > > > +	memset(buf, 0, sizeof(buf));
> > > > > > > +	nlh =3D mnl_nlmsg_put_header(buf);
> > > > > > > +	nlh->nlmsg_type =3D RTM_NEWLINK;
> > > > > > > +	nlh->nlmsg_flags =3D NLM_F_REQUEST | NLM_F_CREATE  |
> > > > > > NLM_F_EXCL;
> > > > > > > +	ifm =3D mnl_nlmsg_put_extra_header(nlh, sizeof(*ifm));
> > > > > > > +	ifm->ifi_family =3D AF_UNSPEC;
> > > > > > > +	ifm->ifi_type =3D 0;
> > > > > > > +	ifm->ifi_index =3D 0;
> > > > > > > +	ifm->ifi_flags =3D IFF_UP;
> > > > > > > +	ifm->ifi_change =3D 0xffffffff;
> > > > > > > +	snprintf(name, sizeof(name), "%s%u",
> > MLX5_VXLAN_DEVICE_PFX,
> > > > > > port);
> > > > > > > +	mnl_attr_put_strz(nlh, IFLA_IFNAME, name);
> > > > > > > +	na_info =3D mnl_attr_nest_start(nlh, IFLA_LINKINFO);
> > > > > > > +	assert(na_info);
> > > > > > > +	mnl_attr_put_strz(nlh, IFLA_INFO_KIND, "vxlan");
> > > > > > > +	na_vxlan =3D mnl_attr_nest_start(nlh, IFLA_INFO_DATA);
> > > > > > > +	if (ifouter)
> > > > > > > +		mnl_attr_put_u32(nlh, IFLA_VXLAN_LINK, ifouter);
> > > > > > > +	assert(na_vxlan);
> > > > > > > +	mnl_attr_put_u8(nlh, IFLA_VXLAN_COLLECT_METADATA, 1);
> > > > > > > +	mnl_attr_put_u8(nlh, IFLA_VXLAN_UDP_ZERO_CSUM6_RX,
> > 1);
> > > > > > > +	mnl_attr_put_u8(nlh, IFLA_VXLAN_LEARNING, 0);
> > > > > > > +	mnl_attr_put_u16(nlh, IFLA_VXLAN_PORT, vxlan_port);
> > > > > > > +	mnl_attr_nest_end(nlh, na_vxlan);
> > > > > > > +	mnl_attr_nest_end(nlh, na_info);
> > > > > > > +	assert(sizeof(buf) >=3D nlh->nlmsg_len);
> > > > > > > +	ret =3D flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
> > > > > > > +	if (ret)
> > > > > > > +		DRV_LOG(WARNING,
> > > > > > > +			"netlink: VTEP %s create failure (%d)",
> > > > > > > +			name, rte_errno);
> > > > > > > +	else
> > > > > > > +		vtep->created =3D 1;
> > > > > >
> > > > > > Flow of code here isn't smooth, thus could be error-prone. Most
> > > > > > of all, I don't like ret has multiple meanings. ret should be
> > > > > > return value but you are using it to store ifindex.
> > > > > >
> > > > > > > +	if (ret && ifouter)
> > > > > > > +		ret =3D 0;
> > > > > > > +	else
> > > > > > > +		ret =3D if_nametoindex(name);
> > > > > >
> > > > > > If vtep isn't created and ifouter is set, then skip init below,
> > > > > > which means, if
> > > > >
> > > > > ifouter is set for VXLAN encap devices. They should be attached t=
o
> > > > > ifouter and can not be shared. So, if ifouter I set - we do not
> > > > > use the precreated/existing VXLAN devices. We have to create our
> > > > > own not
> > > > shared device.
> > > >
> > > > In your code (flow_tcf_encap_vtep_create()), it is shared by multip=
le
> > flows.
> > > > Do you mean it isn't shared between different outer ifaces? If so,
> > > > that's for sure.
> > > Sorry, I do not understand the question.
> > > VXLAN encap device is attached to ifouter and shared by all flows wit=
h
> > > this ifouter. No multiple VXLAN devices are attached to the same ifou=
ter,
> > only one.
> > > VXLAN decap device has no attached ifouter, so it can not share it.
> >=20
> > Yep, that's what I meant.
> >=20
> > > > > > vtep is created or ifouter is set, it tries to get ifindex of v=
tep.
> > > > > > But why do you want to try to call this API even if it failed t=
o create
> > vtep?
> > > > > > Let's not make code flow convoluted even though it logically wo=
rks.
> > > > > > Let's make it straightforward.
> > > > > >
> > > > > > > +	if (ret) {
> > > > > > > +		vtep->ifindex =3D ret;
> > > > > > > +		vtep->ifouter =3D ifouter;
> > > > > > > +		memset(buf, 0, sizeof(buf));
> > > > > > > +		nlh =3D mnl_nlmsg_put_header(buf);
> > > > > > > +		nlh->nlmsg_type =3D RTM_NEWLINK;
> > > > > > > +		nlh->nlmsg_flags =3D NLM_F_REQUEST;
> > > > > > > +		ifm =3D mnl_nlmsg_put_extra_header(nlh,
> > sizeof(*ifm));
> > > > > > > +		ifm->ifi_family =3D AF_UNSPEC;
> > > > > > > +		ifm->ifi_type =3D 0;
> > > > > > > +		ifm->ifi_index =3D vtep->ifindex;
> > > > > > > +		ifm->ifi_flags =3D IFF_UP;
> > > > > > > +		ifm->ifi_change =3D IFF_UP;
> > > > > > > +		ret =3D flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
> > > > > > > +		if (ret) {
> > > > > > > +			DRV_LOG(WARNING,
> > > > > > > +				"netlink: VTEP %s set link up failure
> > (%d)",
> > > > > > > +				name, rte_errno);
> > > > > > > +			rte_free(vtep);
> > > > > > > +			rte_flow_error_set
> > > > > > > +				(error, -errno,
> > > > > > > +
> > RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> > > > > > NULL,
> > > > > > > +				 "netlink: failed to set VTEP link up");
> > > > > > > +			vtep =3D NULL;
> > > > > > > +		} else {
> > > > > > > +			ret =3D mlx5_flow_tcf_init(tcf, vtep->ifindex,
> > error);
> > > > > > > +			if (ret)
> > > > > > > +				DRV_LOG(WARNING,
> > > > > > > +				"VTEP %s init failure (%d)", name,
> > rte_errno);
> > > > > > > +		}
> > > > > > > +	} else {
> > > > > > > +		DRV_LOG(WARNING,
> > > > > > > +			"VTEP %s failed to get index (%d)", name,
> > errno);
> > > > > > > +		rte_flow_error_set
> > > > > > > +			(error, -errno,
> > > > > > > +			 RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> > NULL,
> > > > > > > +			 !vtep->created ? "netlink: failed to create
> > VTEP" :
> > > > > > > +			 "netlink: failed to retrieve VTEP ifindex");
> > > > > > > +			 ret =3D 1;
> > > > > >
> > > > > > If it fails to create a vtep above, it will print out two
> > > > > > warning messages and one rte_flow_error message. And it even
> > > > > > selects message to print between two?
> > > > > > And there's another info msg at the end even in case of failure=
.
> > > > > > Do you really want to do this even with manipulating ret to
> > > > > > change code path?  Not a good practice.
> > > > > >
> > > > > > Usually, code path should be straightforward for sucessful path
> > > > > > and for errors/failures, return immediately or use 'goto' if
> > > > > > there's need for
> > > > cleanup.
> > > > > >
> > > > > > Please refactor entire function.
> > > > >
> > > > > I think I'll split it in two ones - for attached and potentially =
shared
> > ifaces.
> > > > > >
> > > > > > > +	}
> > > > > > > +	if (ret) {
> > > > > > > +		flow_tcf_delete_iface(tcf, vtep);
> > > > > > > +		vtep =3D NULL;
> > > > > > > +	}
> > > > > > > +	DRV_LOG(INFO, "VTEP create (%d, %s)", vtep->port, vtep ?
> > "OK" :
> > > > > > "error");
> > > > > > > +	return vtep;
> > > > > > > +}
> > > > > > > +#endif /* HAVE_IFLA_VXLAN_COLLECT_METADATA */
> > > > > > > +
> > > > > > > +/**
> > > > > > > + * Create target interface index for VXLAN tunneling
> > decapsulation.
> > > > > > > + * In order to share the UDP port within the other interface=
s
> > > > > > > +the
> > > > > > > + * VXLAN device created as not attached to any interface (if
> > created).
> > > > > > > + *
> > > > > > > + * @param[in] tcf
> > > > > > > + *   Context object initialized by mlx5_flow_tcf_context_cre=
ate().
> > > > > > > + * @param[in] dev_flow
> > > > > > > + *   Flow tcf object with tunnel structure pointer set.
> > > > > > > + * @param[out] error
> > > > > > > + *   Perform verbose error reporting if not NULL.
> > > > > > > + * @return
> > > > > > > + *   Interface index on success, zero otherwise and rte_errn=
o is set.
> > > > > >
> > > > > > Return negative errno in case of failure like others.
> > > > >
> > > > > Anyway, we have to return an index. If we do not return it as
> > > > > function result we will need to provide some extra pointing
> > > > > parameter, it
> > > > complicates the code.
> > > >
> > > > You misunderstood it. See what I wrote below. The function still
> > > > returns the index but in case of error, make it return negative err=
no
> > instead of zero.
> > > >
> > > > > >
> > > > > >  *   Interface index on success, a negative errno value otherwi=
se and
> > > > > > rte_errno is set.
> > > > > >
> > > > > > > + */
> > > > > > > +static unsigned int
> > > > > > > +flow_tcf_decap_vtep_create(struct mlx5_flow_tcf_context *tcf=
,
> > > > > > > +			   struct mlx5_flow *dev_flow,
> > > > > > > +			   struct rte_flow_error *error) {
> > > > > > > +	struct mlx5_flow_tcf_vtep *vtep, *vlst;
> > > > > > > +	uint16_t port =3D dev_flow->tcf.vxlan_decap->udp_port;
> > > > > > > +
> > > > > > > +	vtep =3D NULL;
> > > > > > > +	LIST_FOREACH(vlst, &vtep_list_vxlan, next) {
> > > > > > > +		if (vlst->port =3D=3D port) {
> > > > > > > +			vtep =3D vlst;
> > > > > > > +			break;
> > > > > > > +		}
> > > > > > > +	}
> > > > > >
> > > > > > You just need one variable.
> > > > >
> > > > > Yes. There is a long story, I forgot to revert code to one
> > > > > variable after
> > > > debugging.
> > > > > >
> > > > > > 	struct mlx5_flow_tcf_vtep *vtep;
> > > > > >
> > > > > > 	LIST_FOREACH(vtep, &vtep_list_vxlan, next) {
> > > > > > 		if (vtep->port =3D=3D port)
> > > > > > 			break;
> > > > > > 	}
> > > > > >
> > > > > > > +	if (!vtep) {
> > > > > > > +		vtep =3D flow_tcf_create_iface(tcf, 0, port, error);
> > > > > > > +		if (vtep)
> > > > > > > +			LIST_INSERT_HEAD(&vtep_list_vxlan, vtep,
> > next);
> > > > > > > +	} else {
> > > > > > > +		if (vtep->ifouter) {
> > > > > > > +			rte_flow_error_set(error, -errno,
> > > > > > > +
> > 	RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> > > > > > NULL,
> > > > > > > +				"Failed to create decap VTEP,
> > attached "
> > > > > > > +				"device with the same UDP port
> > exists");
> > > > > > > +				vtep =3D NULL;
> > > > > >
> > > > > > Making vtep null to skip the following code?
> > > > >
> > > > > Yes. To avoid multiple return operators in code.
> > > >
> > > > It's okay to have multiple returns. Why not?
> > >
> > > It is easy to miss the return in the midst of function  while
> > refactoring/modifying the code.
> >=20
> > Your code path doesn't look easy and free from error. Please refer to o=
ther
> > control path functions in this PMD.
> >=20
> > > > > > Please merge the two same
> > > > > > if/else and make the code path strightforward. And which errno
> > > > > > do you expect here?
> > > > > > Should it be set EEXIST instead?
> > > > > Not always. Netlink returns the code.
> > > >
> > > > No, that's not my point. Your code above sets errno instead of
> > > > rte_errno or EEXIST.
> > > >
> > > > 	} else {
> > > > 		if (vtep->ifouter) {
> > > > 			rte_flow_error_set(error, -errno,
> > > >
> > > > Which one sets this errno? Here, it sets rte_errno because matched
> > > > vtep
> > > libmnl sets, while processing the Netlink reply message (callback.c o=
f libmnl
> > sources).
> >=20
> > You still don't understand my point.
> >=20
> > In this flow_tcf_decap_vtep_create(), if vtep is found (vtep !=3D NULL)=
, how
> > can errno be set? Before the if/else, there's no libmnl call.
> >=20
> > > > can't be used as it already has outer iface attached (error message
> > > > isn't clear, please reword it too). I thought this should be EEXIST
> > > > but you set errno to rte_errno but errno isn't valid at this point.
> > > >
> > > > >
> > > > > >
> > > > > > > +		}
> > > > > > > +	}
> > > > > > > +	if (vtep) {
> > > > > > > +		vtep->refcnt++;
> > > > > > > +		assert(vtep->ifindex);
> > > > > > > +		return vtep->ifindex;
> > > > > > > +	} else {
> > > > > > > +		return 0;
> > > > > > > +	}
> > > > > >
> > > > > > Why repeating same if/else?
> > > > > >
> > > > > >
> > > > > > This is my suggestion but if you take my suggestion to have
> > > > > > flow_tcf_[create|get|release]_iface(), this will get much simpl=
er.
> > > > > Agree.
> > > > >
> > > > > >
> > > > > > {
> > > > > > 	struct mlx5_flow_tcf_vtep *vtep;
> > > > > > 	uint16_t port =3D dev_flow->tcf.vxlan_decap->udp_port;
> > > > > >
> > > > > > 	LIST_FOREACH(vtep, &vtep_list_vxlan, next) {
> > > > > > 		if (vtep->port =3D=3D port)
> > > > > > 			break;
> > > > > > 	}
> > > > > > 	if (vtep && vtep->ifouter)
> > > > > > 		return rte_flow_error_set(... EEXIST ...);
> > > > > > 	else if (vtep) {
> > > > > > 		++vtep->refcnt;
> > > > > > 	} else {
> > > > > > 		vtep =3D flow_tcf_create_iface(tcf, 0, port, error);
> > > > > > 		if (!vtep)
> > > > > > 			return rte_flow_error_set(...);
> > > > > > 		LIST_INSERT_HEAD(&vtep_list_vxlan, vtep, next);
> > > > > > 	}
> > > > > > 	assert(vtep->ifindex);
> > > > > > 	return vtep->ifindex;
> > > > > > }
> > > > > >
> > > > > >
> > > > > > > +}
> > > > > > > +
> > > > > > > +/**
> > > > > > > + * Creates target interface index for VXLAN tunneling
> > encapsulation.
> > > > > > > + *
> > > > > > > + * @param[in] tcf
> > > > > > > + *   Context object initialized by mlx5_flow_tcf_context_cre=
ate().
> > > > > > > + * @param[in] ifouter
> > > > > > > + *   Network interface index to attach VXLAN encap device to=
.
> > > > > > > + * @param[in] dev_flow
> > > > > > > + *   Flow tcf object with tunnel structure pointer set.
> > > > > > > + * @param[out] error
> > > > > > > + *   Perform verbose error reporting if not NULL.
> > > > > > > + * @return
> > > > > > > + *   Interface index on success, zero otherwise and rte_errn=
o is set.
> > > > > > > + */
> > > > > > > +static unsigned int
> > > > > > > +flow_tcf_encap_vtep_create(struct mlx5_flow_tcf_context *tcf=
,
> > > > > > > +			    unsigned int ifouter,
> > > > > > > +			    struct mlx5_flow *dev_flow __rte_unused,
> > > > > > > +			    struct rte_flow_error *error) {
> > > > > > > +	static uint16_t encap_port =3D
> > MLX5_VXLAN_PORT_RANGE_MIN - 1;
> > > > > > > +	struct mlx5_flow_tcf_vtep *vtep, *vlst;
> > > > > > > +
> > > > > > > +	assert(ifouter);
> > > > > > > +	/* Look whether the attached VTEP for encap is created. */
> > > > > > > +	vtep =3D NULL;
> > > > > > > +	LIST_FOREACH(vlst, &vtep_list_vxlan, next) {
> > > > > > > +		if (vlst->ifouter =3D=3D ifouter) {
> > > > > > > +			vtep =3D vlst;
> > > > > > > +			break;
> > > > > > > +		}
> > > > > > > +	}
> > > > > >
> > > > > > Same here.
> > > > > >
> > > > > > > +	if (!vtep) {
> > > > > > > +		uint16_t pcnt;
> > > > > > > +
> > > > > > > +		/* Not found, we should create the new attached
> > VTEP. */
> > > > > > > +/*
> > > > > > > + * TODO: not implemented yet
> > > > > > > + * flow_tcf_encap_iface_cleanup(tcf, ifouter);
> > > > > > > + * flow_tcf_encap_local_cleanup(tcf, ifouter);
> > > > > > > + * flow_tcf_encap_neigh_cleanup(tcf, ifouter);  */
> > > > > >
> > > > > > Personal note is not appropriate even though it is removed in
> > > > > > the following patch.
> > > > > >
> > > > > > > +		for (pcnt =3D 0; pcnt <=3D
> > (MLX5_VXLAN_PORT_RANGE_MAX
> > > > > > > +				     -
> > MLX5_VXLAN_PORT_RANGE_MIN);
> > > > > > pcnt++) {
> > > > > > > +			encap_port++;
> > > > > > > +			/* Wraparound the UDP port index. */
> > > > > > > +			if (encap_port <
> > MLX5_VXLAN_PORT_RANGE_MIN
> > > > > > ||
> > > > > > > +			    encap_port >
> > MLX5_VXLAN_PORT_RANGE_MAX)
> > > > > > > +				encap_port =3D
> > > > > > MLX5_VXLAN_PORT_RANGE_MIN;
> > > > > > > +			/* Check whether UDP port is in already in
> > use. */
> > > > > > > +			vtep =3D NULL;
> > > > > > > +			LIST_FOREACH(vlst, &vtep_list_vxlan, next) {
> > > > > > > +				if (vlst->port =3D=3D encap_port) {
> > > > > > > +					vtep =3D vlst;
> > > > > > > +					break;
> > > > > > > +				}
> > > > > > > +			}
> > > > > >
> > > > > > If you want to find out an empty port number, you can use
> > > > > > rte_bitmap instead of repeating searching the entire list for
> > > > > > all possible port
> > > > numbers.
> > > > >
> > > > > We do not expect too many VXLAN devices have been created. bitmap=
.
> > > >
> > > > +1, valid point.
> > > >
> > > > > > > +			if (vtep) {
> > > > > > > +				vtep =3D NULL;
> > > > > > > +				continue;
> > > > > > > +			}
> > > > > > > +			vtep =3D flow_tcf_create_iface(tcf, ifouter,
> > > > > > > +						     encap_port,
> > error);
> > > > > > > +			if (vtep) {
> > > > > > > +				LIST_INSERT_HEAD(&vtep_list_vxlan,
> > vtep,
> > > > > > next);
> > > > > > > +				break;
> > > > > > > +			}
> > > > > > > +			if (rte_errno !=3D EEXIST)
> > > > > > > +				break;
> > > > > > > +		}
> > > > > > > +	}
> > > > > > > +	if (!vtep)
> > > > > > > +		return 0;
> > > > > > > +	vtep->refcnt++;
> > > > > > > +	assert(vtep->ifindex);
> > > > > > > +	return vtep->ifindex;
> > > > > >
> > > > > > Please refactor this func according to what I suggested for
> > > > > > flow_tcf_decap_vtep_create() and flow_tcf_delete_iface().
> > > > > >
> > > > > > > +}
> > > > > > > +
> > > > > > > +/**
> > > > > > > + * Creates target interface index for tunneling of any type.
> > > > > > > + *
> > > > > > > + * @param[in] tcf
> > > > > > > + *   Context object initialized by mlx5_flow_tcf_context_cre=
ate().
> > > > > > > + * @param[in] ifouter
> > > > > > > + *   Network interface index to attach VXLAN encap device to=
.
> > > > > > > + * @param[in] dev_flow
> > > > > > > + *   Flow tcf object with tunnel structure pointer set.
> > > > > > > + * @param[out] error
> > > > > > > + *   Perform verbose error reporting if not NULL.
> > > > > > > + * @return
> > > > > > > + *   Interface index on success, zero otherwise and rte_errn=
o is set.
> > > > > >
> > > > > >  *   Interface index on success, a negative errno value otherwi=
se and
> > > > > >  *   rte_errno is set.
> > > > > >
> > > > > > > + */
> > > > > > > +static unsigned int
> > > > > > > +flow_tcf_tunnel_vtep_create(struct mlx5_flow_tcf_context *tc=
f,
> > > > > > > +			    unsigned int ifouter,
> > > > > > > +			    struct mlx5_flow *dev_flow,
> > > > > > > +			    struct rte_flow_error *error) {
> > > > > > > +	unsigned int ret;
> > > > > > > +
> > > > > > > +	assert(dev_flow->tcf.tunnel);
> > > > > > > +	pthread_mutex_lock(&vtep_list_mutex);
> > > > > > > +	switch (dev_flow->tcf.tunnel->type) {
> > > > > > > +	case MLX5_FLOW_TCF_TUNACT_VXLAN_ENCAP:
> > > > > > > +		ret =3D flow_tcf_encap_vtep_create(tcf, ifouter,
> > > > > > > +						 dev_flow, error);
> > > > > > > +		break;
> > > > > > > +	case MLX5_FLOW_TCF_TUNACT_VXLAN_DECAP:
> > > > > > > +		ret =3D flow_tcf_decap_vtep_create(tcf, dev_flow,
> > error);
> > > > > > > +		break;
> > > > > > > +	default:
> > > > > > > +		rte_flow_error_set(error, ENOTSUP,
> > > > > > > +
> > 	RTE_FLOW_ERROR_TYPE_UNSPECIFIED,
> > > > > > NULL,
> > > > > > > +				"unsupported tunnel type");
> > > > > > > +		ret =3D 0;
> > > > > > > +		break;
> > > > > > > +	}
> > > > > > > +	pthread_mutex_unlock(&vtep_list_mutex);
> > > > > > > +	return ret;
> > > > > > > +}
> > > > > > > +
> > > > > > > +/**
> > > > > > > + * Deletes tunneling interface by UDP port.
> > > > > > > + *
> > > > > > > + * @param[in] tcf
> > > > > > > + *   Context object initialized by mlx5_flow_tcf_context_cre=
ate().
> > > > > > > + * @param[in] ifindex
> > > > > > > + *   Network interface index of VXLAN device.
> > > > > > > + * @param[in] dev_flow
> > > > > > > + *   Flow tcf object with tunnel structure pointer set.
> > > > > > > + */
> > > > > > > +static void
> > > > > > > +flow_tcf_tunnel_vtep_delete(struct mlx5_flow_tcf_context *tc=
f,
> > > > > > > +			    unsigned int ifindex,
> > > > > > > +			    struct mlx5_flow *dev_flow) {
> > > > > > > +	struct mlx5_flow_tcf_vtep *vtep, *vlst;
> > > > > > > +
> > > > > > > +	assert(dev_flow->tcf.tunnel);
> > > > > > > +	pthread_mutex_lock(&vtep_list_mutex);
> > > > > > > +	vtep =3D NULL;
> > > > > > > +	LIST_FOREACH(vlst, &vtep_list_vxlan, next) {
> > > > > > > +		if (vlst->ifindex =3D=3D ifindex) {
> > > > > > > +			vtep =3D vlst;
> > > > > > > +			break;
> > > > > > > +		}
> > > > > > > +	}
> > > > > >
> > > > > > It is weird. You just can have vtep pointer in the
> > > > > > dev_flow->tcf.tunnel instead of ifindex_tun which is same as
> > > > > > vtep->ifindex like the assertion below. Then, this lookup can b=
e
> > skipped.
> > > > >
> > > > > OK. Good optimization.
> > > > >
> > > > > >
> > > > > > > +	if (!vtep) {
> > > > > > > +		DRV_LOG(WARNING, "No VTEP device found in the
> > list");
> > > > > > > +		goto exit;
> > > > > > > +	}
> > > > > > > +	switch (dev_flow->tcf.tunnel->type) {
> > > > > > > +	case MLX5_FLOW_TCF_TUNACT_VXLAN_DECAP:
> > > > > > > +		break;
> > > > > > > +	case MLX5_FLOW_TCF_TUNACT_VXLAN_ENCAP:
> > > > > > > +/*
> > > > > > > + * TODO: Remove the encap ancillary rules first.
> > > > > > > + * flow_tcf_encap_neigh(tcf, vtep, dev_flow, false, NULL);
> > > > > > > + * flow_tcf_encap_local(tcf, vtep, dev_flow, false, NULL);
> > > > > > > +*/
> > > > > >
> > > > > > Is it a personal note? Please remove.
> > > > > OK.
> > > > >
> > > > > >
> > > > > > > +		break;
> > > > > > > +	default:
> > > > > > > +		assert(false);
> > > > > > > +		DRV_LOG(WARNING, "Unsupported tunnel type");
> > > > > > > +		break;
> > > > > > > +	}
> > > > > > > +	assert(dev_flow->tcf.tunnel->ifindex_tun =3D=3D vtep->ifind=
ex);
> > > > > > > +	assert(vtep->refcnt);
> > > > > > > +	if (!vtep->refcnt || !--vtep->refcnt) {
> > > > > > > +		LIST_REMOVE(vtep, next);
> > > > > > > +		flow_tcf_delete_iface(tcf, vtep);
> > > > > > > +	}
> > > > > > > +exit:
> > > > > > > +	pthread_mutex_unlock(&vtep_list_mutex);
> > > > > > > +}
> > > > > > > +
> > > > > > >  /**
> > > > > > >   * Apply flow to E-Switch by sending Netlink message.
> > > > > > >   *
> > > > > > > @@ -3461,18 +3887,61 @@ struct pedit_parser {
> > > > > > >  	       struct rte_flow_error *error)  {
> > > > > > >  	struct priv *priv =3D dev->data->dev_private;
> > > > > > > -	struct mlx5_flow_tcf_context *nl =3D priv->tcf_context;
> > > > > > > +	struct mlx5_flow_tcf_context *tcf =3D priv->tcf_context;
> > > > > > >  	struct mlx5_flow *dev_flow;
> > > > > > >  	struct nlmsghdr *nlh;
> > > > > > > +	int ret;
> > > > > > >
> > > > > > >  	dev_flow =3D LIST_FIRST(&flow->dev_flows);
> > > > > > >  	/* E-Switch flow can't be expanded. */
> > > > > > >  	assert(!LIST_NEXT(dev_flow, next));
> > > > > > > +	if (dev_flow->tcf.applied)
> > > > > > > +		return 0;
> > > > > > >  	nlh =3D dev_flow->tcf.nlh;
> > > > > > >  	nlh->nlmsg_type =3D RTM_NEWTFILTER;
> > > > > > >  	nlh->nlmsg_flags =3D NLM_F_REQUEST | NLM_F_CREATE |
> > > > > > NLM_F_EXCL;
> > > > > > > -	if (!flow_tcf_nl_ack(nl, nlh, 0, NULL, NULL))
> > > > > > > +	if (dev_flow->tcf.tunnel) {
> > > > > > > +		/*
> > > > > > > +		 * Replace the interface index, target for
> > > > > > > +		 * encapsulation, source for decapsulation.
> > > > > > > +		 */
> > > > > > > +		assert(!dev_flow->tcf.tunnel->ifindex_tun);
> > > > > > > +		assert(dev_flow->tcf.tunnel->ifindex_ptr);
> > > > > > > +		/* Create actual VTEP device when rule is being
> > applied. */
> > > > > > > +		dev_flow->tcf.tunnel->ifindex_tun
> > > > > > > +			=3D flow_tcf_tunnel_vtep_create(tcf,
> > > > > > > +					*dev_flow->tcf.tunnel-
> > >ifindex_ptr,
> > > > > > > +					dev_flow, error);
> > > > > > > +			DRV_LOG(INFO, "Replace ifindex: %d->%d",
> > > > > > > +				dev_flow->tcf.tunnel->ifindex_tun,
> > > > > > > +				*dev_flow->tcf.tunnel->ifindex_ptr);
> > > > > > > +		if (!dev_flow->tcf.tunnel->ifindex_tun)
> > > > > > > +			return -rte_errno;
> > > > > > > +		dev_flow->tcf.tunnel->ifindex_org
> > > > > > > +			=3D *dev_flow->tcf.tunnel->ifindex_ptr;
> > > > > > > +		*dev_flow->tcf.tunnel->ifindex_ptr
> > > > > > > +			=3D dev_flow->tcf.tunnel->ifindex_tun;
> > > > > > > +	}
> > > > > > > +	ret =3D flow_tcf_nl_ack(tcf, nlh, 0, NULL, NULL);
> > > > > > > +	if (dev_flow->tcf.tunnel) {
> > > > > > > +		DRV_LOG(INFO, "Restore ifindex: %d->%d",
> > > > > > > +				dev_flow->tcf.tunnel->ifindex_org,
> > > > > > > +				*dev_flow->tcf.tunnel->ifindex_ptr);
> > > > > > > +		*dev_flow->tcf.tunnel->ifindex_ptr
> > > > > > > +			=3D dev_flow->tcf.tunnel->ifindex_org;
> > > > > > > +		dev_flow->tcf.tunnel->ifindex_org =3D 0;
> > > > > >
> > > > > > ifindex_org looks a temporary storage in this code. And this
> > > > > > kind of hassle
> > > > > > (replace/restore) is there because you took the ifindex from th=
e
> > > > > > netlink message. Why don't you have just
> > > > > >
> > > > > > struct mlx5_flow_tcf_tunnel_hdr {
> > > > > > 	uint32_t type; /**< Tunnel action type. */
> > > > > > 	unsigned int ifindex; /**< Original dst/src interface */
> > > > > > 	struct mlx5_flow_tcf_vtep *vtep; /**< Tunnel endpoint device. =
*/
> > > > > > 	unsigned int *nlmsg_ifindex_ptr; /**< ifindex ptr in Netlink m=
essage.
> > > > > > */ };
> > > > > >
> > > > > > and don't change ifindex?
> > > > >
> > > > > I propose to use the local variable for ifindex_org and do not
> > > > > keep it in structure. *ifindex_ptr will keep.
> > > >
> > > > Well, you still have to restore the ifindex whenever sending the nl
> > > > msg. Most of all, ifindex_ptr in nl msg isn't a right place to stor=
e the
> > ifindex.
> > > It is stored there for rules w/o tunnels. It is its "native" place, I=
d
> > > prefer not to create some new location to store the original index an=
d save
> > some space.
> > > We have to swap indices only if rule has requested the tunneling.  We
> > > can not
> >=20
> > No no. At this point, flow is already created to be tunneled one. What =
do you
> > mean by 'rules w/o tunnels' or 'only if rule has requested the tunnelin=
g'??
>=20
> I mean the code handles all kind of rules - with tunnel and w/o tunnels.
> The same code prepares the NL message for both rule types.
>=20
> > It has already been created as a vxlan tunnel rule. It won't be changed=
. The
> > nlmsg is supposed to have vtep ifindex but translation didn't know it a=
nd
> > stored the outer iface temporarily to get it replaced by vtep ifindex. =
It never
> > be a 'native'/'original' place to store it.
>=20
> I mean, if rule does not request the tunneling action - it just keeps the
> unchanged ifindex within Netlink message. If there is the tunneling - we =
replace
> this index with some value depending on this ifindex.  We cannot replace
> ifindex permanently at rule translation once, because VTEPs are created=20
> dynamically and VTEP ifindex can be different at the rule applying time.
> So, we need to keep the original ifindex and create VTEP depending on it =
every
> time rule is being applied.

Rules w/o tunnels doesn't use the struct (mlx5_flow_tcf_tunnel_hdr) anyway.=
 I
don't understand why you care.

> > In which case the nl msg can be sent
> > with the 'original' ifindex? Any specific example? No.
> >=20
> > > set tunnel index permanently, because rule can be
> > > applied/removed/reapplied and other new VXLAN device with new index
> > >can be recreated.
> >=20
> > Every time it is applied, it will get the vtep and overwrite vtep ifind=
ex in the nl
> > msg.
>=20
> Yes. We should overwrite this field anyway, and every time at rule applyi=
ng.
> Because vtep ifindex can be different. And we need to keep the original i=
findex
> (for example to dynamically  create VTEP attached to it). Do you propose =
to keep
> ifindex_org field?  Now, as I can see we have to keep ifindex_ptr field o=
nly.

Like I suggested above, yes. What I don't like is replace/restore the index=
es
every time. I don't understand why you want to shuffling around a variable =
each
time. nlmsg is temporary anyway, there's no 'native' place. But I'll leave =
it up
to you. This isn't a super critical issue.


Thanks,
Yongseok