From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <anatoly.burakov@intel.com>
Received: from mga03.intel.com (mga03.intel.com [134.134.136.65])
 by dpdk.org (Postfix) with ESMTP id 06B5B1C073
 for <dev@dpdk.org>; Thu, 12 Apr 2018 16:03:42 +0200 (CEST)
X-Amp-Result: SKIPPED(no attachment in message)
X-Amp-File-Uploaded: False
Received: from orsmga005.jf.intel.com ([10.7.209.41])
 by orsmga103.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384;
 12 Apr 2018 07:03:41 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.48,441,1517904000"; d="scan'208";a="216075349"
Received: from aburakov-mobl.ger.corp.intel.com (HELO [10.237.220.128])
 ([10.237.220.128])
 by orsmga005.jf.intel.com with ESMTP; 12 Apr 2018 07:03:39 -0700
To: Xiao Wang <xiao.w.wang@intel.com>, ferruh.yigit@intel.com
Cc: dev@dpdk.org, maxime.coquelin@redhat.com, zhihong.wang@intel.com,
 tiwei.bie@intel.com, jianfeng.tan@intel.com, cunming.liang@intel.com,
 dan.daly@intel.com, thomas@monjalon.net, gaetan.rivet@6wind.com,
 hemant.agrawal@nxp.com, Junjie Chen <junjie.j.chen@intel.com>
References: <20180405180701.16853-4-xiao.w.wang@intel.com>
 <20180412071956.66178-1-xiao.w.wang@intel.com>
 <20180412071956.66178-2-xiao.w.wang@intel.com>
From: "Burakov, Anatoly" <anatoly.burakov@intel.com>
Message-ID: <974c9cd0-87c4-6ab1-0787-9278a7379fda@intel.com>
Date: Thu, 12 Apr 2018 15:03:38 +0100
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101
 Thunderbird/52.7.0
MIME-Version: 1.0
In-Reply-To: <20180412071956.66178-2-xiao.w.wang@intel.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Content-Transfer-Encoding: 7bit
Subject: Re: [dpdk-dev] [PATCH v6 1/4] eal/vfio: add multiple container
	support
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Thu, 12 Apr 2018 14:03:43 -0000

On 12-Apr-18 8:19 AM, Xiao Wang wrote:
> Currently eal vfio framework binds vfio group fd to the default
> container fd during rte_vfio_setup_device, while in some cases,
> e.g. vDPA (vhost data path acceleration), we want to put vfio group
> to a separate container and program IOMMU via this container.
> 
> This patch adds some APIs to support container creating and device
> binding with a container.
> 
> A driver could use "rte_vfio_create_container" helper to create a
> new container from eal, use "rte_vfio_bind_group" to bind a device
> to the newly created container.
> 
> During rte_vfio_setup_device, the container bound with the device
> will be used for IOMMU setup.
> 
> Signed-off-by: Junjie Chen <junjie.j.chen@intel.com>
> Signed-off-by: Xiao Wang <xiao.w.wang@intel.com>
> Reviewed-by: Maxime Coquelin <maxime.coquelin@redhat.com>
> Reviewed-by: Ferruh Yigit <ferruh.yigit@intel.com>
> ---

Apologies for late review. Some comments below.

<...>

>   
> +struct rte_memseg;
> +
>   /**
>    * Setup vfio_cfg for the device identified by its address.
>    * It discovers the configured I/O MMU groups or sets a new one for the device.
> @@ -131,6 +133,117 @@ rte_vfio_clear_group(int vfio_group_fd);
>   }
>   #endif
>   

<...>

> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> + *
> + * Perform dma mapping for devices in a conainer.
> + *
> + * @param container_fd
> + *   the specified container fd
> + *
> + * @param dma_type
> + *   the dma map type
> + *
> + * @param ms
> + *   the dma address region to map
> + *
> + * @return
> + *    0 if successful
> + *   <0 if failed
> + */
> +int __rte_experimental
> +rte_vfio_dma_map(int container_fd, int dma_type, const struct rte_memseg *ms);
> +

First of all, why memseg, instead of va/iova/len? This seems like 
unnecessary attachment to internals of DPDK memory representation. Not 
all memory comes in memsegs, this makes the API unnecessarily specific 
to DPDK memory.

Also, why providing DMA type? There's already a VFIO type pointer in 
vfio_config - you can set this pointer for every new created container, 
so the user wouldn't have to care about IOMMU type. Is it not possible 
to figure out DMA type from within EAL VFIO? If not, maybe provide an 
API to do so, e.g. rte_vfio_container_set_dma_type()?

This will also need to be rebased on top of latest HEAD because there 
already is a similar DMA map/unmap API added, only without the container 
parameter. Perhaps rename these new functions to 
rte_vfio_container_(create|destroy|dma_map|dma_unmap)?

> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
> + *
> + * Perform dma unmapping for devices in a conainer.
> + *
> + * @param container_fd
> + *   the specified container fd
> + *
> + * @param dma_type
> + *    the dma map type
> + *
> + * @param ms
> + *   the dma address region to unmap
> + *
> + * @return
> + *    0 if successful
> + *   <0 if failed
> + */
> +int __rte_experimental
> +rte_vfio_dma_unmap(int container_fd, int dma_type, const struct rte_memseg *ms);
> +
>   #endif /* VFIO_PRESENT */
>   

<...>

> @@ -75,8 +53,8 @@ vfio_get_group_fd(int iommu_group_no)
>   		if (vfio_group_fd < 0) {
>   			/* if file not found, it's not an error */
>   			if (errno != ENOENT) {
> -				RTE_LOG(ERR, EAL, "Cannot open %s: %s\n", filename,
> -						strerror(errno));
> +				RTE_LOG(ERR, EAL, "Cannot open %s: %s\n",
> +					filename, strerror(errno));

This looks like unintended change.

>   				return -1;
>   			}
>   
> @@ -86,8 +64,10 @@ vfio_get_group_fd(int iommu_group_no)
>   			vfio_group_fd = open(filename, O_RDWR);
>   			if (vfio_group_fd < 0) {
>   				if (errno != ENOENT) {
> -					RTE_LOG(ERR, EAL, "Cannot open %s: %s\n", filename,
> -							strerror(errno));
> +					RTE_LOG(ERR, EAL,
> +						"Cannot open %s: %s\n",
> +						filename,
> +						strerror(errno));

This looks like unintended change.

>   					return -1;
>   				}
>   				return 0;
> @@ -95,21 +75,19 @@ vfio_get_group_fd(int iommu_group_no)
>   			/* noiommu group found */
>   		}
>   
> -		cur_grp->group_no = iommu_group_no;
> -		cur_grp->fd = vfio_group_fd;
> -		vfio_cfg.vfio_active_groups++;
>   		return vfio_group_fd;
>   	}
> -	/* if we're in a secondary process, request group fd from the primary
> +	/*
> +	 * if we're in a secondary process, request group fd from the primary
>   	 * process via our socket
>   	 */

This looks like unintended change.

>   	else {
> -		int socket_fd, ret;
> -
> -		socket_fd = vfio_mp_sync_connect_to_primary();
> +		int ret;
> +		int socket_fd = vfio_mp_sync_connect_to_primary();
>   
>   		if (socket_fd < 0) {
> -			RTE_LOG(ERR, EAL, "  cannot connect to primary process!\n");
> +			RTE_LOG(ERR, EAL,
> +				"  cannot connect to primary process!\n");

This looks like unintended change.

>   			return -1;
>   		}
>   		if (vfio_mp_sync_send_request(socket_fd, SOCKET_REQ_GROUP) < 0) {
> @@ -122,6 +100,7 @@ vfio_get_group_fd(int iommu_group_no)
>   			close(socket_fd);
>   			return -1;
>   		}
> +
>   		ret = vfio_mp_sync_receive_request(socket_fd);

This looks like unintended change.

(hint: "git revert -n HEAD && git add -p" is your friend :) )

>   		switch (ret) {
>   		case SOCKET_NO_FD:
> @@ -132,9 +111,6 @@ vfio_get_group_fd(int iommu_group_no)
>   			/* if we got the fd, store it and return it */
>   			if (vfio_group_fd > 0) {
>   				close(socket_fd);
> -				cur_grp->group_no = iommu_group_no;
> -				cur_grp->fd = vfio_group_fd;
> -				vfio_cfg.vfio_active_groups++;
>   				return vfio_group_fd;
>   			}
>   			/* fall-through on error */
> @@ -147,70 +123,349 @@ vfio_get_group_fd(int iommu_group_no)
>   	return -1;

<...>

> +int __rte_experimental
> +rte_vfio_create_container(void)
> +{
> +	struct vfio_config *vfio_cfg;
> +	int i;
> +
> +	/* Find an empty slot to store new vfio config */
> +	for (i = 1; i < VFIO_MAX_CONTAINERS; i++) {
> +		if (vfio_cfgs[i] == NULL)
> +			break;
> +	}
> +
> +	if (i == VFIO_MAX_CONTAINERS) {
> +		RTE_LOG(ERR, EAL, "exceed max vfio container limit\n");
> +		return -1;
> +	}
> +
> +	vfio_cfgs[i] = rte_zmalloc("vfio_container", sizeof(struct vfio_config),
> +		RTE_CACHE_LINE_SIZE);
> +	if (vfio_cfgs[i] == NULL)
> +		return -ENOMEM;

Is there a specific reason why 1) dynamic allocation is used (as opposed 
to just storing a static array), and 2) DPDK memory allocation is used? 
This seems like unnecessary complication.

Even if you were to decide to allocate memory instead of having a static 
array, you'll have to register for rte_eal_cleanup() to delete any 
allocated containers on DPDK exit. But, as i said, i think it would be 
better to keep it as static array.

> +
> +	RTE_LOG(INFO, EAL, "alloc container at slot %d\n", i);
> +	vfio_cfg = vfio_cfgs[i];
> +	vfio_cfg->vfio_active_groups = 0;
> +	vfio_cfg->vfio_container_fd = vfio_get_container_fd();
> +
> +	if (vfio_cfg->vfio_container_fd < 0) {
> +		rte_free(vfio_cfgs[i]);
> +		vfio_cfgs[i] = NULL;
> +		return -1;
> +	}
> +
> +	for (i = 0; i < VFIO_MAX_GROUPS; i++) {
> +		vfio_cfg->vfio_groups[i].group_no = -1;
> +		vfio_cfg->vfio_groups[i].fd = -1;
> +		vfio_cfg->vfio_groups[i].devices = 0;
> +	}

<...>

> @@ -665,41 +931,80 @@ vfio_get_group_no(const char *sysfs_base,
>   }
>   
>   static int
> -vfio_type1_dma_map(int vfio_container_fd)
> +do_vfio_type1_dma_map(int vfio_container_fd, const struct rte_memseg *ms)

<...>


> +static int
> +do_vfio_type1_dma_unmap(int vfio_container_fd, const struct rte_memseg *ms)

API's such as these two were recently added to DPDK.

-- 
Thanks,
Anatoly