From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <anatoly.burakov@intel.com>
Received: from mga09.intel.com (mga09.intel.com [134.134.136.24])
 by dpdk.org (Postfix) with ESMTP id D52231396
 for <dev@dpdk.org>; Wed, 22 Mar 2017 14:49:21 +0100 (CET)
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple;
 d=intel.com; i=@intel.com; q=dns/txt; s=intel;
 t=1490190563; x=1521726563;
 h=from:to:cc:subject:date:message-id:references:
 in-reply-to:content-transfer-encoding:mime-version;
 bh=lwitZbfV875lSAe7o6XejhxELmiJcBc3lSRWVwfAR5c=;
 b=W/w02iQSA5wLwFrHXAiQCRv13fUtr02hN6d13BSHNrA8yNQoawU8btxW
 H4i8nZrGQHvBlited9Bef0PPejr3bQ==;
Received: from fmsmga005.fm.intel.com ([10.253.24.32])
 by orsmga102.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384;
 22 Mar 2017 06:49:20 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.36,205,1486454400"; d="scan'208";a="79404923"
Received: from irsmsx107.ger.corp.intel.com ([163.33.3.99])
 by fmsmga005.fm.intel.com with ESMTP; 22 Mar 2017 06:49:19 -0700
Received: from irsmsx156.ger.corp.intel.com (10.108.20.68) by
 IRSMSX107.ger.corp.intel.com (163.33.3.99) with Microsoft SMTP Server (TLS)
 id 14.3.248.2; Wed, 22 Mar 2017 13:49:16 +0000
Received: from irsmsx109.ger.corp.intel.com ([169.254.13.44]) by
 IRSMSX156.ger.corp.intel.com ([169.254.3.107]) with mapi id 14.03.0248.002;
 Wed, 22 Mar 2017 13:49:16 +0000
From: "Burakov, Anatoly" <anatoly.burakov@intel.com>
To: Alejandro Lucero <alejandro.lucero@netronome.com>
CC: "dev@dpdk.org" <dev@dpdk.org>
Thread-Topic: [PATCH] vfio: add hotplug support
Thread-Index: AQHSnkNgzZU4cdK2B0+xUeem3f363aGgrY3g
Date: Wed, 22 Mar 2017 13:49:15 +0000
Message-ID: <C6ECDF3AB251BE4894318F4E45123697821EF8A5@IRSMSX109.ger.corp.intel.com>
References: <1489661496-12818-1-git-send-email-alejandro.lucero@netronome.com>
In-Reply-To: <1489661496-12818-1-git-send-email-alejandro.lucero@netronome.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
x-ctpclassification: CTP_IC
x-titus-metadata-40: eyJDYXRlZ29yeUxhYmVscyI6IiIsIk1ldGFkYXRhIjp7Im5zIjoiaHR0cDpcL1wvd3d3LnRpdHVzLmNvbVwvbnNcL0ludGVsMyIsImlkIjoiYzZhMWIwMzUtNDFjMS00NjU0LThiMDUtZWM4ZDA2ZmViMjVhIiwicHJvcHMiOlt7Im4iOiJDVFBDbGFzc2lmaWNhdGlvbiIsInZhbHMiOlt7InZhbHVlIjoiQ1RQX0lDIn1dfV19LCJTdWJqZWN0TGFiZWxzIjpbXSwiVE1DVmVyc2lvbiI6IjE2LjUuOS4zIiwiVHJ1c3RlZExhYmVsSGFzaCI6Im51ZWkxSml4bjNHcHl3K0VcL0VPZEtEbEFaRHlmb0NzN05RY1puWDZidGdnPSJ9
x-originating-ip: [163.33.239.182]
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Subject: Re: [dpdk-dev] [PATCH] vfio: add hotplug support
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <http://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <http://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Wed, 22 Mar 2017 13:49:23 -0000

Hi Alejandro,

Just a general comment, I think there's now a rule to not split error messa=
ges onto multiple lines
even if going above the 80 char limit, to make grepping easier.

> Current device hotplug is just when using UIO. This patch adds
> same functionality with VFIO.
>=20
> It has been validated through tests using IOMMU and also with VFIO
> and no-iommu mode.
>=20

<...>

>  /* map VFIO resource prototype */
>  int pci_vfio_map_resource(struct rte_pci_device *dev);
> +int pci_vfio_unmap_resource(struct rte_pci_device *dev);

Probably worth it to fix the comment here?

> @@ -517,7 +521,7 @@
>  		}
>=20
>  		/* set bus mastering for the device */
> -		if (pci_vfio_set_bus_master(vfio_dev_fd)) {
> +		if (pci_vfio_set_bus_master(vfio_dev_fd, 1)) {

"true"? Doesn't make a difference, but if you're accepting bool, you might =
as well
use bool values when calling :)

>  			RTE_LOG(ERR, EAL, "  %s cannot set up bus
> mastering!\n", pci_addr);
>  			close(vfio_dev_fd);
>  			rte_free(vfio_res);
> @@ -535,6 +539,76 @@
>  }
>=20
>  int
> +pci_vfio_unmap_resource(struct rte_pci_device *dev)
> +{
> +	char pci_addr[PATH_MAX] =3D {0};
> +	struct rte_pci_addr *loc =3D &dev->addr;
> +	int i, ret;
> +	struct mapped_pci_resource *vfio_res =3D NULL;
> +	struct mapped_pci_res_list *vfio_res_list;
> +
> +	struct pci_map *maps;
> +
> +	/* store PCI address string */
> +	snprintf(pci_addr, sizeof(pci_addr), PCI_PRI_FMT,
> +			loc->domain, loc->bus, loc->devid, loc->function);
> +
> +
> +	if (close(dev->intr_handle.fd) < 0) {
> +		RTE_LOG(INFO, EAL, "Error when Closing eventfd file descriptor for %s\=
n",

"closing"

> +			pci_addr);
> +		return -1;
> +	}
> +
> +	if (pci_vfio_set_bus_master(dev->intr_handle.vfio_dev_fd, 0)) {

"false"?

> +		RTE_LOG(ERR, EAL, "  %s cannot unset bus mastering for PCI
> device!\n",
> +				pci_addr);
> +		return -1;
> +	}
> +
> +	ret =3D vfio_release_device(pci_get_sysfs_path(), pci_addr,
> +				  dev->intr_handle.vfio_dev_fd);
> +	if (ret < 0) {
> +		RTE_LOG(ERR, EAL,
> +			"%s(): cannot release device\n", __func__);
> +		return ret;
> +	}
> +
> +	vfio_res_list =3D RTE_TAILQ_CAST(rte_vfio_tailq.head,
> mapped_pci_res_list);
> +	/* Get vfio_res */
> +	TAILQ_FOREACH(vfio_res, vfio_res_list, next) {
> +		if (memcmp(&vfio_res->pci_addr, &dev->addr, sizeof(dev-
> >addr)))
> +			continue;
> +		break;
> +	}
> +	/* if we haven't found our tailq entry, something's wrong */
> +	if (vfio_res =3D=3D NULL) {
> +		RTE_LOG(ERR, EAL, "  %s cannot find TAILQ entry for PCI
> device!\n",
> +				pci_addr);
> +		return -1;
> +	}
> +
> +	/* map BARs */
> +	maps =3D vfio_res->maps;

Fix comment

> +
> +	RTE_LOG(INFO, EAL, "Releasing pci mapped resource for %s\n",
> +		pci_addr);
> +	for (i =3D 0; i < (int) vfio_res->nb_maps; i++) {
> +
> +		/* TODO: what about the maps offset field? */

You probably need to perform the same calculations mapping function does if=
 you
are unmapping an MSI-X BAR. As far as I understand, the address space is ma=
pped
in whole (i.e. the maps[i].addr covers the whole BAR), but the things that =
map into
that address space are mapped using two separate mmap calls. So it's probab=
ly a
matter of 1) figuring out if maps[i] points to an MSI-X BAR, 2) figuring ou=
t if there
is other stuff in the register that can be mapped, and 3) unmapping that. W=
e pretty
much map those unconditionally, so if the device is up, they're guaranteed =
to be mapped.

> +		if (maps[i].addr) {
> +			RTE_LOG(INFO, EAL, "Calling pci_unmap_resource
> for %s at %p\n",
> +				pci_addr, maps[i].addr);
> +			pci_unmap_resource(maps[i].addr, maps[i].size);
> +		}
> +	}
> +
> +	TAILQ_REMOVE(vfio_res_list, vfio_res, next);
> +
> +	return 0;
> +}
> +
> +int
>  pci_vfio_ioport_map(struct rte_pci_device *dev, int bar,
>  		    struct rte_pci_ioport *p)
>  {
> diff --git a/lib/librte_eal/linuxapp/eal/eal_vfio.c
> b/lib/librte_eal/linuxapp/eal/eal_vfio.c
> index 9377a66..42e0f62 100644
> --- a/lib/librte_eal/linuxapp/eal/eal_vfio.c
> +++ b/lib/librte_eal/linuxapp/eal/eal_vfio.c
> @@ -68,13 +68,32 @@
>  {
>  	int i;
>  	int vfio_group_fd;
> +	int group_idx =3D -1;
>  	char filename[PATH_MAX];
>=20
>  	/* check if we already have the group descriptor open */
> -	for (i =3D 0; i < vfio_cfg.vfio_group_idx; i++)
> +	for (i =3D 0; i < VFIO_MAX_GROUPS; i++)
>  		if (vfio_cfg.vfio_groups[i].group_no =3D=3D iommu_group_no)
>  			return vfio_cfg.vfio_groups[i].fd;
>=20
> +	/* Lets see first if there is room for a new group */
> +	if (vfio_cfg.vfio_active_groups =3D=3D VFIO_MAX_GROUPS) {
> +		RTE_LOG(ERR, EAL, "Maximum number of VFIO groups
> reached!\n");
> +		return -1;
> +	}
> +
> +	/* Now lets get an index for the new group */
> +	for (i =3D 0; i < VFIO_MAX_GROUPS; i++)
> +		if (vfio_cfg.vfio_groups[i].group_no =3D=3D -1) {
> +			group_idx =3D i;
> +			break;
> +		}
> +
> +	/* This should not happen */
> +	if (group_idx =3D=3D -1) {
> +		RTE_LOG(ERR, EAL, "No VFIO group free slot found\n");
> +		return -1;
> +	}
>  	/* if primary, try to open the group */
>  	if (internal_config.process_type =3D=3D RTE_PROC_PRIMARY) {
>  		/* try regular group format */
> @@ -104,14 +123,9 @@
>  			/* noiommu group found */
>  		}
>=20
> -		/* if the fd is valid, create a new group for it */
> -		if (vfio_cfg.vfio_group_idx =3D=3D VFIO_MAX_GROUPS) {
> -			RTE_LOG(ERR, EAL, "Maximum number of VFIO
> groups reached!\n");
> -			close(vfio_group_fd);
> -			return -1;
> -		}
> -		vfio_cfg.vfio_groups[vfio_cfg.vfio_group_idx].group_no =3D
> iommu_group_no;
> -		vfio_cfg.vfio_groups[vfio_cfg.vfio_group_idx].fd =3D
> vfio_group_fd;
> +		vfio_cfg.vfio_groups[group_idx].group_no =3D
> iommu_group_no;
> +		vfio_cfg.vfio_groups[group_idx].fd =3D vfio_group_fd;
> +		vfio_cfg.vfio_active_groups++;
>  		return vfio_group_fd;
>  	}
>  	/* if we're in a secondary process, request group fd from the primary
> @@ -158,14 +172,62 @@
>  	return -1;
>  }
>=20
> -static void
> -clear_current_group(void)
> +int
> +clear_group(int vfio_group_fd)
>  {
> -	vfio_cfg.vfio_groups[vfio_cfg.vfio_group_idx].group_no =3D 0;
> -	vfio_cfg.vfio_groups[vfio_cfg.vfio_group_idx].fd =3D -1;
> +	int i;
> +	int socket_fd, ret;
> +
> +	if (internal_config.process_type =3D=3D RTE_PROC_PRIMARY) {
> +
> +		for (i =3D 0; i < VFIO_MAX_GROUPS; i++)
> +			if (vfio_cfg.vfio_groups[i].fd =3D=3D vfio_group_fd) {
> +				vfio_cfg.vfio_groups[i].group_no =3D -1;
> +				vfio_cfg.vfio_groups[i].fd =3D -1;
> +				vfio_cfg.vfio_active_groups--;
> +				return 0;
> +			}
> +		return -1;
> +	}
> +
> +	/* This is just for SECONDARY processes */
> +	socket_fd =3D vfio_mp_sync_connect_to_primary();
> +
> +	if (socket_fd < 0) {
> +		RTE_LOG(ERR, EAL, "  cannot connect to primary
> process!\n");
> +		return -1;
> +	}
> +
> +	if (vfio_mp_sync_send_request(socket_fd, SOCKET_CLR_GROUP) <
> 0) {
> +		RTE_LOG(ERR, EAL, "  cannot request container fd!\n");
> +		close(socket_fd);
> +		return -1;
> +	}
> +
> +	if (vfio_mp_sync_send_request(socket_fd, vfio_group_fd) < 0) {
> +		RTE_LOG(ERR, EAL, "  cannot send group fd!\n");
> +		close(socket_fd);
> +		return -1;
> +	}
> +
> +	ret =3D vfio_mp_sync_receive_request(socket_fd);
> +	switch (ret) {
> +	case SOCKET_NO_FD:
> +		RTE_LOG(ERR, EAL, "  BAD VFIO group fd!\n");
> +		close(socket_fd);
> +		break;
> +	case SOCKET_OK:
> +		close(socket_fd);
> +		return 0;
> +	default:
> +		RTE_LOG(ERR, EAL, "  UNKNOWN reply, %d\n", ret);
> +		close(socket_fd);

Technically, SOCKET_ERR isn't an "UNKNOWN" reply, so maybe handle it explic=
itly?

That said, please forgive me my ignorance on hotplug, but does secondary pr=
ocess
need to notify primary about group clearing at all? I mean, doesn't primary=
 process
also get the hotplug event? I would guess secondary process just needs to u=
nmap
the device and that's it, no?

> +	}
> +	return -1;
>  }
>=20
> -int vfio_setup_device(const char *sysfs_base, const char *dev_addr,
> +int
> +vfio_setup_device(const char *sysfs_base, const char *dev_addr,
>  		int *vfio_dev_fd, struct vfio_device_info *device_info)
>  {
>  	struct vfio_group_status group_status =3D {

<...>

> +		 */
> +
> +		RTE_LOG(WARNING, EAL, "Getting a vfio_dev_fd for %s
> failed\n",
>  				dev_addr);
> -		return 1;
> +		close(vfio_group_fd);
> +		clear_group(vfio_group_fd);
> +		return -1;

I've had suspicions that it would be a bug, but I tested it on both modern =
and old kernels (where
ports on the same NIC are assigned same IOMMU group), and it works as it sh=
ould. I think there wasn't
a "which kernel driver" EAL PCI code back when this was written, so returni=
ng 1 was necessary to not fail
when multiple ports shared the same IOMMU group, but only one of them was b=
ound to VFIO.

>  	}
>=20
>  	/* test and setup the device */
>  	ret =3D ioctl(*vfio_dev_fd, VFIO_DEVICE_GET_INFO, device_info);
>  	if (ret) {
>  		RTE_LOG(ERR, EAL, "  %s cannot get device info, "
> -				"error %i (%s)\n", dev_addr, errno,
> strerror(errno));

<...>

> @@ -155,6 +154,10 @@ struct vfio_iommu_type {
>  int
>  vfio_get_group_fd(int iommu_group_no);
>=20
> +/* remove group fd from internal VFIO group fd array */
> +int
> +clear_group(int vfio_group_fd);
> +
>  /**
>   * Setup vfio_cfg for the device identified by its address. It discovers
>   * the configured I/O MMU groups or sets a new one for the device. If a =
new
> @@ -165,6 +168,9 @@ struct vfio_iommu_type {
>  int vfio_setup_device(const char *sysfs_base, const char *dev_addr,
>  		int *vfio_dev_fd, struct vfio_device_info *device_info);
>=20
> +int vfio_release_device(const char *sysfs_base, const char *dev_addr, in=
t
> fd);
> +
> +int vfio_enable(const char *modname);
>  int vfio_enable(const char *modname);

Duplicate vfio_enable declaration?

>  int vfio_is_enabled(const char *modname);
>=20
> @@ -175,6 +181,7 @@ int vfio_setup_device(const char *sysfs_base, const
> char *dev_addr,
>=20
>  #define SOCKET_REQ_CONTAINER 0x100
>  #define SOCKET_REQ_GROUP 0x200
> +#define SOCKET_CLR_GROUP 0x300
>  #define SOCKET_OK 0x0
>  #define SOCKET_NO_FD 0x1
>  #define SOCKET_ERR 0xFF

Thanks,
Anatoly