From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <ferruh.yigit@intel.com>
Received: from mga17.intel.com (mga17.intel.com [192.55.52.151])
 by dpdk.org (Postfix) with ESMTP id 530671B4C8
 for <dev@dpdk.org>; Wed,  3 Apr 2019 18:29:28 +0200 (CEST)
X-Amp-Result: SKIPPED(no attachment in message)
X-Amp-File-Uploaded: False
Received: from fmsmga007.fm.intel.com ([10.253.24.52])
 by fmsmga107.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384;
 03 Apr 2019 09:29:27 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.60,305,1549958400"; d="scan'208";a="139840306"
Received: from fyigit-mobl.ger.corp.intel.com (HELO [10.237.221.46])
 ([10.237.221.46])
 by fmsmga007.fm.intel.com with ESMTP; 03 Apr 2019 09:29:26 -0700
To: Kiran Kumar Kokkilagadda <kirankumark@marvell.com>
Cc: "dev@dpdk.org" <dev@dpdk.org>, Jerin Jacob <jerin.jacob@caviumnetworks.com>
References: <20180927104846.16356-1-kkokkilagadda@caviumnetworks.com>
 <20190401095118.4176-1-kirankumark@marvell.com>
From: Ferruh Yigit <ferruh.yigit@intel.com>
Openpgp: preference=signencrypt
Autocrypt: addr=ferruh.yigit@intel.com; prefer-encrypt=mutual; keydata=
 mQINBFXZCFABEADCujshBOAaqPZpwShdkzkyGpJ15lmxiSr3jVMqOtQS/sB3FYLT0/d3+bvy
 qbL9YnlbPyRvZfnP3pXiKwkRoR1RJwEo2BOf6hxdzTmLRtGtwWzI9MwrUPj6n/ldiD58VAGQ
 +iR1I/z9UBUN/ZMksElA2D7Jgg7vZ78iKwNnd+vLBD6I61kVrZ45Vjo3r+pPOByUBXOUlxp9
 GWEKKIrJ4eogqkVNSixN16VYK7xR+5OUkBYUO+sE6etSxCr7BahMPKxH+XPlZZjKrxciaWQb
 +dElz3Ab4Opl+ZT/bK2huX+W+NJBEBVzjTkhjSTjcyRdxvS1gwWRuXqAml/sh+KQjPV1PPHF
 YK5LcqLkle+OKTCa82OvUb7cr+ALxATIZXQkgmn+zFT8UzSS3aiBBohg3BtbTIWy51jNlYdy
 ezUZ4UxKSsFuUTPt+JjHQBvF7WKbmNGS3fCid5Iag4tWOfZoqiCNzxApkVugltxoc6rG2TyX
 CmI2rP0mQ0GOsGXA3+3c1MCdQFzdIn/5tLBZyKy4F54UFo35eOX8/g7OaE+xrgY/4bZjpxC1
 1pd66AAtKb3aNXpHvIfkVV6NYloo52H+FUE5ZDPNCGD0/btFGPWmWRmkPybzColTy7fmPaGz
 cBcEEqHK4T0aY4UJmE7Ylvg255Kz7s6wGZe6IR3N0cKNv++O7QARAQABtCVGZXJydWggWWln
 aXQgPGZlcnJ1aC55aWdpdEBpbnRlbC5jb20+iQJVBBMBAgA/AhsDBgsJCAcDAgYVCAIJCgsE
 FgIDAQIeAQIXgBYhBNI2U4dCLsKE45mBx/kz60PfE2EfBQJbughWBQkHwjOGAAoJEPkz60Pf
 E2Eft84QAIbKWqhgqRfoiw/BbXbA1+qm2o4UgkCRQ0yJgt9QsnbpOmPKydHH0ixCliNz1J8e
 mRXCkMini1bTpnzp7spOjQGLeAFkNFz6BMq8YF2mVWbGEDE9WgnAxZdi0eLY7ZQnHbE6AxKL
 SXmpe9INb6z3ztseFt7mqje/W/6DWYIMnH3Yz9KzxujFWDcq8UCAvPkxVQXLTMpauhFgYeEx
 Nub5HbvhxTfUkapLwRQsSd/HbywzqZ3s/bbYMjj5JO3tgMiM9g9HOjv1G2f1dQjHi5YQiTZl
 1eIIqQ3pTic6ROaiZqNmQFXPsoOOFfXF8nN2zg8kl/sSdoXWHhama5hbwwtl1vdaygQYlmdK
 H2ueiFh/UvT3WG3waNv2eZiEbHV8Rk52Xyn2w1G90lV0fYC6Ket1Xjoch7kjwbx793Kz/RfQ
 rmBY8/S4DTGn3oq3dMdQY+b6+7VMUeLMMh2CXYO9ErkOq+qNTD1IY+cBAkXnaDbQfz0zbste
 ZGWH74FAZ9nCpDOqbRTrBL42aMGhfOWEyeA1x7+hl6JZfabBWAuf4nnCXuorKHzBXTrf7u7p
 fXsKQClWRW77PF1VmzrtKNVSytQAmlCWApQIw20AarFipXmVdIjHmJPU611WoyxZPb4JTOxx
 5cv9B+nr/RIB+v5dcStyHCCwO1be7nBDdCgd4F6kTQPLuQINBFfWTL4BEACnNA29e8TarUsB
 L5n6eLZHXcFvVwNLVlirWOClHXf44o2KnN3ww+eBEmKVfEFo9MSuGDNHS8Zw1NiGMYxLIUgd
 U6gGrVVs/VrQWL82pbMk6jCj98N+BXIri+6K1z+AImz7ax7iF1kDgRAnFWU0znWWBgM2mM8Y
 gDjcxfXk4sCKnvf6Gjo08Ey5zmqx7dekAKU2EEp8Q1EJY3jbymLdZWRP4AFFMTS1rGMk0/tt
 v71NBg1GobCcbNfn9chK/jhqxYhAJqq86RdJQkt3/9x1U1Oq0vXCt4JVVHmkxePtUiuWTTt+
 aYlUAsKYZsWvncExvw77x2ArYDmaK0yfjh37wp0lY7DOJHFxoyT8tyWZlLci/VMRG2Ja33xj
 0CN4C1yBg+QDeV3QFxQo42iA/ykdXPUR3ezmsND3XKvVLTC4DNb3V/EZQ7jBj64+bEK0VW4G
 B31VP00ApNQvSoczsIOAKdk97RNbpmPw6q10ILIB+9T1xbnFYzshzGF17oC0/GENIHATx8vZ
 masOZoDiOZQpeneLgnFE9JfzhLTxv6wNZcc/HLXRQVTkDsQr8ERtkAoHCf1E5+b5Yr7pfnE4
 YuhET746o25S53ELUYPIs49qoJsEJL34/oexMfPGyPIlrbufiNyty5jc/1MRwUlhJlJ5IOHy
 ZUa+6CLR7GdImusFkPJUJwARAQABiQI8BBgBAgAmAhsMFiEE0jZTh0IuwoTjmYHH+TPrQ98T
 YR8FAlu6CHAFCQXE7zIACgkQ+TPrQ98TYR9nXxAAqNBgkYNyGuWUuy0GwDQCbu3iiMyH1+D7
 llafPcK4NYy1Z4AYuVwC9nmLaoj+ozdqS3ncRo57ncRsKEJC46nDJJZYZ5LSJVn63Y3NBF86
 lxQAgjj2oyZEwaLKtKbAFsXL43jv1pUGgSvWwYtDwHITXXFQto9rZEuUDRFSx4sg9OR+Q6/6
 LY+nQQ3OdHlBkflzYMPcWgDcvcTAO6yasLEUf7UcYoSWTyMYjLB4QuNlXzTswzGVMssJF/vo
 V8lD1eqqaSUWG3STF6GVLQOr1NLvN5+kUBiEStHFxBpgSCvYY9sNV8FS6N24CAWMBl+10W+D
 2h1yiiP5dOdPcBDYKsgqDD91/sP0WdyMJkwdQJtD49f9f+lYloxHnSAxMleOpyscg1pldw+i
 mPaUY1bmIknLhhkqfMmjywQOXpac5LRMibAAYkcB8v7y3kwELnt8mhqqZy6LUsqcWygNbH/W
 K3GGt5tRpeIXeJ25x8gg5EBQ0Jnvp/IbBYQfPLtXH0Myq2QuAhk/1q2yEIbVjS+7iowEZNyE
 56K63WBJxsJPB2mvmLgn98GqB4G6GufP1ndS0XDti/2K0o8rep9xoY/JDGi0n0L0tk9BHyoP
 Y7kaEpu7UyY3nVdRLe5H1/MnFG8hdJ97WqnPS0buYZlrbTV0nRFL/NI2VABl18vEEXvNQiO+ vM8=
Message-ID: <ff476309-7384-314b-ccf5-92d52a209eeb@intel.com>
Date: Wed, 3 Apr 2019 17:29:25 +0100
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:60.0) Gecko/20100101
 Thunderbird/60.6.1
MIME-Version: 1.0
In-Reply-To: <20190401095118.4176-1-kirankumark@marvell.com>
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: 8bit
Subject: Re: [dpdk-dev] [PATCH v2] kni: add IOVA va support for kni
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Wed, 03 Apr 2019 16:29:29 -0000

On 4/1/2019 10:51 AM, Kiran Kumar Kokkilagadda wrote:
> From: Kiran Kumar K <kirankumark@marvell.com>
> 
> With current KNI implementation kernel module will work only in
> IOVA=PA mode. This patch will add support for kernel module to work
> with IOVA=VA mode.

Thanks Kiran for removing the limitation, I have a few questions, can you please
help me understand.

And when this patch is ready, the restriction in 'linux/eal/eal.c', in
'rte_eal_init' should be removed, perhaps with this patch. I assume you already
doing it to be able to test this patch.

> 
> The idea is to maintain a mapping in KNI module between user pages and
> kernel pages and in fast path perform a lookup in this table and get
> the kernel virtual address for corresponding user virtual address.
> 
> In IOVA=VA mode, the memory allocated to the pool is physically
> and virtually contiguous. We will take advantage of this and create a
> mapping in the kernel.In kernel we need mapping for queues
> (tx_q, rx_q,... slow path) and mbuf memory (fast path).

Is it?
As far as I know mempool can have multiple chunks and they can be both virtually
and physically separated.

And even for a single chunk, that will be virtually continuous, but will it be
physically continuous?

> 
> At the KNI init time, in slow path we will create a mapping for the
> queues and mbuf using get_user_pages similar to af_xdp. Using pool
> memory base address, we will create a page map table for the mbuf,
> which we will use in the fast path for kernel page translation.
> 
> At KNI init time, we will pass the base address of the pool and size of
> the pool to kernel. In kernel, using get_user_pages API, we will get
> the pages with size PAGE_SIZE and store the mapping and start address
> of user space in a table.
> 
> In fast path for any user address perform PAGE_SHIFT
> (user_addr >> PAGE_SHIFT) and subtract the start address from this value,
> we will get the index of the kernel page with in the page map table.
> Adding offset to this kernel page address, we will get the kernel address
> for this user virtual address.
> 
> For example user pool base address is X, and size is S that we passed to
> kernel. In kernel we will create a mapping for this using get_user_pages.
> Our page map table will look like [Y, Y+PAGE_SIZE, Y+(PAGE_SIZE*2) ....]
> and user start page will be U (we will get it from X >> PAGE_SHIFT).
> 
> For any user address Z we will get the index of the page map table using
> ((Z >> PAGE_SHIFT) - U). Adding offset (Z & (PAGE_SIZE - 1)) to this
> address will give kernel virtual address.
> 
> Signed-off-by: Kiran Kumar K <kirankumark@marvell.com>

<...>

> +int
> +kni_pin_pages(void *address, size_t size, struct page_info *mem)
> +{
> +	unsigned int gup_flags = FOLL_WRITE;
> +	long npgs;
> +	int err;
> +
> +	/* Get at least one page */
> +	if (size < PAGE_SIZE)
> +		size = PAGE_SIZE;
> +
> +	/* Compute number of user pages based on page size */
> +	mem->npgs = (size + PAGE_SIZE - 1) / PAGE_SIZE;
> +
> +	/* Allocate memory for the pages */
> +	mem->pgs = kcalloc(mem->npgs, sizeof(*mem->pgs),
> +		      GFP_KERNEL | __GFP_NOWARN);
> +	if (!mem->pgs) {
> +		pr_err("%s: -ENOMEM\n", __func__);
> +		return -ENOMEM;
> +	}
> +
> +	down_write(&current->mm->mmap_sem);
> +
> +	/* Get the user pages from the user address*/
> +#if LINUX_VERSION_CODE >= KERNEL_VERSION(4,9,0)
> +	npgs = get_user_pages((u64)address, mem->npgs,
> +				gup_flags, &mem->pgs[0], NULL);
> +#else
> +	npgs = get_user_pages(current, current->mm, (u64)address, mem->npgs,
> +				gup_flags, 0, &mem->pgs[0], NULL);
> +#endif
> +	up_write(&current->mm->mmap_sem);

This should work even memory is physically not continuous, right? Where exactly
physically continuous requirement is coming from?

<...>

> +
> +/* Get the kernel address from the user address using
> + * page map table. Will be used only in IOVA=VA mode
> + */
> +static inline void*
> +get_kva(uint64_t usr_addr, struct kni_dev *kni)
> +{
> +	uint32_t index;
> +	/* User page - start user page will give the index
> +	 * with in the page map table
> +	 */
> +	index = (usr_addr >> PAGE_SHIFT) - kni->va_info.start_page;
> +
> +	/* Add the offset to the page address */
> +	return (kni->va_info.page_map[index].addr +
> +		(usr_addr & kni->va_info.page_mask));
> +
> +}
> +
>  /* physical address to kernel virtual address */
>  static void *
>  pa2kva(void *pa)
> @@ -186,7 +205,10 @@ kni_fifo_trans_pa2va(struct kni_dev *kni,
>  			return;
> 
>  		for (i = 0; i < num_rx; i++) {
> -			kva = pa2kva(kni->pa[i]);
> +			if (likely(kni->iova_mode == 1))
> +				kva = get_kva((u64)(kni->pa[i]), kni);

kni->pa[] now has iova addresses, for 'get_kva()' to work shouldn't
'va_info.start_page' calculated from 'mempool_memhdr->iova' instead of
'mempool_memhdr->addr'

If this is working I must be missing something but not able to find what it is.

<...>

> @@ -304,6 +304,27 @@ rte_kni_alloc(struct rte_mempool *pktmbuf_pool,
>  	kni->group_id = conf->group_id;
>  	kni->mbuf_size = conf->mbuf_size;
> 
> +	dev_info.iova_mode = (rte_eal_iova_mode() == RTE_IOVA_VA) ? 1 : 0;
> +	if (dev_info.iova_mode) {
> +		struct rte_mempool_memhdr *hdr;
> +		uint64_t pool_size = 0;
> +
> +		/* In each pool header chunk, we will maintain the
> +		 * base address of the pool. This chunk is physically and
> +		 * virtually contiguous.
> +		 * This approach will work, only if the allocated pool
> +		 * memory is contiguous, else it won't work
> +		 */
> +		hdr = STAILQ_FIRST(&pktmbuf_pool->mem_list);
> +		dev_info.mbuf_va = (void *)(hdr->addr);
> +
> +		/* Traverse the list and get the total size of the pool */
> +		STAILQ_FOREACH(hdr, &pktmbuf_pool->mem_list, next) {
> +			pool_size += hdr->len;
> +		}

This code is aware that there may be multiple chunks, but assumes they are all
continuous, I don't know if this assumption is correct.

Also I guess there is another assumption that there will be single pktmbuf_pool
in the application which passed into kni?
What if there are multiple pktmbuf_pool, like one for each PMD, will this work?
Now some mbufs in kni Rx fifo will come from different pktmbuf_pool which we
don't know their pages, so won't able to get their kernel virtual address.

From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from dpdk.org (dpdk.org [92.243.14.124])
	by dpdk.space (Postfix) with ESMTP id E724CA0679
	for <public@inbox.dpdk.org>; Wed,  3 Apr 2019 18:29:30 +0200 (CEST)
Received: from [92.243.14.124] (localhost [127.0.0.1])
	by dpdk.org (Postfix) with ESMTP id AA7B41B4D3;
	Wed,  3 Apr 2019 18:29:30 +0200 (CEST)
Received: from mga17.intel.com (mga17.intel.com [192.55.52.151])
 by dpdk.org (Postfix) with ESMTP id 530671B4C8
 for <dev@dpdk.org>; Wed,  3 Apr 2019 18:29:28 +0200 (CEST)
X-Amp-Result: SKIPPED(no attachment in message)
X-Amp-File-Uploaded: False
Received: from fmsmga007.fm.intel.com ([10.253.24.52])
 by fmsmga107.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384;
 03 Apr 2019 09:29:27 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.60,305,1549958400"; d="scan'208";a="139840306"
Received: from fyigit-mobl.ger.corp.intel.com (HELO [10.237.221.46])
 ([10.237.221.46])
 by fmsmga007.fm.intel.com with ESMTP; 03 Apr 2019 09:29:26 -0700
To: Kiran Kumar Kokkilagadda <kirankumark@marvell.com>
Cc: "dev@dpdk.org" <dev@dpdk.org>, Jerin Jacob <jerin.jacob@caviumnetworks.com>
References: <20180927104846.16356-1-kkokkilagadda@caviumnetworks.com>
 <20190401095118.4176-1-kirankumark@marvell.com>
From: Ferruh Yigit <ferruh.yigit@intel.com>
Openpgp: preference=signencrypt
Autocrypt: addr=ferruh.yigit@intel.com; prefer-encrypt=mutual; keydata=
 mQINBFXZCFABEADCujshBOAaqPZpwShdkzkyGpJ15lmxiSr3jVMqOtQS/sB3FYLT0/d3+bvy
 qbL9YnlbPyRvZfnP3pXiKwkRoR1RJwEo2BOf6hxdzTmLRtGtwWzI9MwrUPj6n/ldiD58VAGQ
 +iR1I/z9UBUN/ZMksElA2D7Jgg7vZ78iKwNnd+vLBD6I61kVrZ45Vjo3r+pPOByUBXOUlxp9
 GWEKKIrJ4eogqkVNSixN16VYK7xR+5OUkBYUO+sE6etSxCr7BahMPKxH+XPlZZjKrxciaWQb
 +dElz3Ab4Opl+ZT/bK2huX+W+NJBEBVzjTkhjSTjcyRdxvS1gwWRuXqAml/sh+KQjPV1PPHF
 YK5LcqLkle+OKTCa82OvUb7cr+ALxATIZXQkgmn+zFT8UzSS3aiBBohg3BtbTIWy51jNlYdy
 ezUZ4UxKSsFuUTPt+JjHQBvF7WKbmNGS3fCid5Iag4tWOfZoqiCNzxApkVugltxoc6rG2TyX
 CmI2rP0mQ0GOsGXA3+3c1MCdQFzdIn/5tLBZyKy4F54UFo35eOX8/g7OaE+xrgY/4bZjpxC1
 1pd66AAtKb3aNXpHvIfkVV6NYloo52H+FUE5ZDPNCGD0/btFGPWmWRmkPybzColTy7fmPaGz
 cBcEEqHK4T0aY4UJmE7Ylvg255Kz7s6wGZe6IR3N0cKNv++O7QARAQABtCVGZXJydWggWWln
 aXQgPGZlcnJ1aC55aWdpdEBpbnRlbC5jb20+iQJVBBMBAgA/AhsDBgsJCAcDAgYVCAIJCgsE
 FgIDAQIeAQIXgBYhBNI2U4dCLsKE45mBx/kz60PfE2EfBQJbughWBQkHwjOGAAoJEPkz60Pf
 E2Eft84QAIbKWqhgqRfoiw/BbXbA1+qm2o4UgkCRQ0yJgt9QsnbpOmPKydHH0ixCliNz1J8e
 mRXCkMini1bTpnzp7spOjQGLeAFkNFz6BMq8YF2mVWbGEDE9WgnAxZdi0eLY7ZQnHbE6AxKL
 SXmpe9INb6z3ztseFt7mqje/W/6DWYIMnH3Yz9KzxujFWDcq8UCAvPkxVQXLTMpauhFgYeEx
 Nub5HbvhxTfUkapLwRQsSd/HbywzqZ3s/bbYMjj5JO3tgMiM9g9HOjv1G2f1dQjHi5YQiTZl
 1eIIqQ3pTic6ROaiZqNmQFXPsoOOFfXF8nN2zg8kl/sSdoXWHhama5hbwwtl1vdaygQYlmdK
 H2ueiFh/UvT3WG3waNv2eZiEbHV8Rk52Xyn2w1G90lV0fYC6Ket1Xjoch7kjwbx793Kz/RfQ
 rmBY8/S4DTGn3oq3dMdQY+b6+7VMUeLMMh2CXYO9ErkOq+qNTD1IY+cBAkXnaDbQfz0zbste
 ZGWH74FAZ9nCpDOqbRTrBL42aMGhfOWEyeA1x7+hl6JZfabBWAuf4nnCXuorKHzBXTrf7u7p
 fXsKQClWRW77PF1VmzrtKNVSytQAmlCWApQIw20AarFipXmVdIjHmJPU611WoyxZPb4JTOxx
 5cv9B+nr/RIB+v5dcStyHCCwO1be7nBDdCgd4F6kTQPLuQINBFfWTL4BEACnNA29e8TarUsB
 L5n6eLZHXcFvVwNLVlirWOClHXf44o2KnN3ww+eBEmKVfEFo9MSuGDNHS8Zw1NiGMYxLIUgd
 U6gGrVVs/VrQWL82pbMk6jCj98N+BXIri+6K1z+AImz7ax7iF1kDgRAnFWU0znWWBgM2mM8Y
 gDjcxfXk4sCKnvf6Gjo08Ey5zmqx7dekAKU2EEp8Q1EJY3jbymLdZWRP4AFFMTS1rGMk0/tt
 v71NBg1GobCcbNfn9chK/jhqxYhAJqq86RdJQkt3/9x1U1Oq0vXCt4JVVHmkxePtUiuWTTt+
 aYlUAsKYZsWvncExvw77x2ArYDmaK0yfjh37wp0lY7DOJHFxoyT8tyWZlLci/VMRG2Ja33xj
 0CN4C1yBg+QDeV3QFxQo42iA/ykdXPUR3ezmsND3XKvVLTC4DNb3V/EZQ7jBj64+bEK0VW4G
 B31VP00ApNQvSoczsIOAKdk97RNbpmPw6q10ILIB+9T1xbnFYzshzGF17oC0/GENIHATx8vZ
 masOZoDiOZQpeneLgnFE9JfzhLTxv6wNZcc/HLXRQVTkDsQr8ERtkAoHCf1E5+b5Yr7pfnE4
 YuhET746o25S53ELUYPIs49qoJsEJL34/oexMfPGyPIlrbufiNyty5jc/1MRwUlhJlJ5IOHy
 ZUa+6CLR7GdImusFkPJUJwARAQABiQI8BBgBAgAmAhsMFiEE0jZTh0IuwoTjmYHH+TPrQ98T
 YR8FAlu6CHAFCQXE7zIACgkQ+TPrQ98TYR9nXxAAqNBgkYNyGuWUuy0GwDQCbu3iiMyH1+D7
 llafPcK4NYy1Z4AYuVwC9nmLaoj+ozdqS3ncRo57ncRsKEJC46nDJJZYZ5LSJVn63Y3NBF86
 lxQAgjj2oyZEwaLKtKbAFsXL43jv1pUGgSvWwYtDwHITXXFQto9rZEuUDRFSx4sg9OR+Q6/6
 LY+nQQ3OdHlBkflzYMPcWgDcvcTAO6yasLEUf7UcYoSWTyMYjLB4QuNlXzTswzGVMssJF/vo
 V8lD1eqqaSUWG3STF6GVLQOr1NLvN5+kUBiEStHFxBpgSCvYY9sNV8FS6N24CAWMBl+10W+D
 2h1yiiP5dOdPcBDYKsgqDD91/sP0WdyMJkwdQJtD49f9f+lYloxHnSAxMleOpyscg1pldw+i
 mPaUY1bmIknLhhkqfMmjywQOXpac5LRMibAAYkcB8v7y3kwELnt8mhqqZy6LUsqcWygNbH/W
 K3GGt5tRpeIXeJ25x8gg5EBQ0Jnvp/IbBYQfPLtXH0Myq2QuAhk/1q2yEIbVjS+7iowEZNyE
 56K63WBJxsJPB2mvmLgn98GqB4G6GufP1ndS0XDti/2K0o8rep9xoY/JDGi0n0L0tk9BHyoP
 Y7kaEpu7UyY3nVdRLe5H1/MnFG8hdJ97WqnPS0buYZlrbTV0nRFL/NI2VABl18vEEXvNQiO+ vM8=
Message-ID: <ff476309-7384-314b-ccf5-92d52a209eeb@intel.com>
Date: Wed, 3 Apr 2019 17:29:25 +0100
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:60.0) Gecko/20100101
 Thunderbird/60.6.1
MIME-Version: 1.0
In-Reply-To: <20190401095118.4176-1-kirankumark@marvell.com>
Content-Type: text/plain; charset="UTF-8"
Content-Language: en-US
Content-Transfer-Encoding: 8bit
Subject: Re: [dpdk-dev] [PATCH v2] kni: add IOVA va support for kni
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>
Message-ID: <20190403162925.fOQk5_q1mzOSVZNGN3p7tGtIdin6yKk1-OroP6J0Fkw@z>

On 4/1/2019 10:51 AM, Kiran Kumar Kokkilagadda wrote:
> From: Kiran Kumar K <kirankumark@marvell.com>
> 
> With current KNI implementation kernel module will work only in
> IOVA=PA mode. This patch will add support for kernel module to work
> with IOVA=VA mode.

Thanks Kiran for removing the limitation, I have a few questions, can you please
help me understand.

And when this patch is ready, the restriction in 'linux/eal/eal.c', in
'rte_eal_init' should be removed, perhaps with this patch. I assume you already
doing it to be able to test this patch.

> 
> The idea is to maintain a mapping in KNI module between user pages and
> kernel pages and in fast path perform a lookup in this table and get
> the kernel virtual address for corresponding user virtual address.
> 
> In IOVA=VA mode, the memory allocated to the pool is physically
> and virtually contiguous. We will take advantage of this and create a
> mapping in the kernel.In kernel we need mapping for queues
> (tx_q, rx_q,... slow path) and mbuf memory (fast path).

Is it?
As far as I know mempool can have multiple chunks and they can be both virtually
and physically separated.

And even for a single chunk, that will be virtually continuous, but will it be
physically continuous?

> 
> At the KNI init time, in slow path we will create a mapping for the
> queues and mbuf using get_user_pages similar to af_xdp. Using pool
> memory base address, we will create a page map table for the mbuf,
> which we will use in the fast path for kernel page translation.
> 
> At KNI init time, we will pass the base address of the pool and size of
> the pool to kernel. In kernel, using get_user_pages API, we will get
> the pages with size PAGE_SIZE and store the mapping and start address
> of user space in a table.
> 
> In fast path for any user address perform PAGE_SHIFT
> (user_addr >> PAGE_SHIFT) and subtract the start address from this value,
> we will get the index of the kernel page with in the page map table.
> Adding offset to this kernel page address, we will get the kernel address
> for this user virtual address.
> 
> For example user pool base address is X, and size is S that we passed to
> kernel. In kernel we will create a mapping for this using get_user_pages.
> Our page map table will look like [Y, Y+PAGE_SIZE, Y+(PAGE_SIZE*2) ....]
> and user start page will be U (we will get it from X >> PAGE_SHIFT).
> 
> For any user address Z we will get the index of the page map table using
> ((Z >> PAGE_SHIFT) - U). Adding offset (Z & (PAGE_SIZE - 1)) to this
> address will give kernel virtual address.
> 
> Signed-off-by: Kiran Kumar K <kirankumark@marvell.com>

<...>

> +int
> +kni_pin_pages(void *address, size_t size, struct page_info *mem)
> +{
> +	unsigned int gup_flags = FOLL_WRITE;
> +	long npgs;
> +	int err;
> +
> +	/* Get at least one page */
> +	if (size < PAGE_SIZE)
> +		size = PAGE_SIZE;
> +
> +	/* Compute number of user pages based on page size */
> +	mem->npgs = (size + PAGE_SIZE - 1) / PAGE_SIZE;
> +
> +	/* Allocate memory for the pages */
> +	mem->pgs = kcalloc(mem->npgs, sizeof(*mem->pgs),
> +		      GFP_KERNEL | __GFP_NOWARN);
> +	if (!mem->pgs) {
> +		pr_err("%s: -ENOMEM\n", __func__);
> +		return -ENOMEM;
> +	}
> +
> +	down_write(&current->mm->mmap_sem);
> +
> +	/* Get the user pages from the user address*/
> +#if LINUX_VERSION_CODE >= KERNEL_VERSION(4,9,0)
> +	npgs = get_user_pages((u64)address, mem->npgs,
> +				gup_flags, &mem->pgs[0], NULL);
> +#else
> +	npgs = get_user_pages(current, current->mm, (u64)address, mem->npgs,
> +				gup_flags, 0, &mem->pgs[0], NULL);
> +#endif
> +	up_write(&current->mm->mmap_sem);

This should work even memory is physically not continuous, right? Where exactly
physically continuous requirement is coming from?

<...>

> +
> +/* Get the kernel address from the user address using
> + * page map table. Will be used only in IOVA=VA mode
> + */
> +static inline void*
> +get_kva(uint64_t usr_addr, struct kni_dev *kni)
> +{
> +	uint32_t index;
> +	/* User page - start user page will give the index
> +	 * with in the page map table
> +	 */
> +	index = (usr_addr >> PAGE_SHIFT) - kni->va_info.start_page;
> +
> +	/* Add the offset to the page address */
> +	return (kni->va_info.page_map[index].addr +
> +		(usr_addr & kni->va_info.page_mask));
> +
> +}
> +
>  /* physical address to kernel virtual address */
>  static void *
>  pa2kva(void *pa)
> @@ -186,7 +205,10 @@ kni_fifo_trans_pa2va(struct kni_dev *kni,
>  			return;
> 
>  		for (i = 0; i < num_rx; i++) {
> -			kva = pa2kva(kni->pa[i]);
> +			if (likely(kni->iova_mode == 1))
> +				kva = get_kva((u64)(kni->pa[i]), kni);

kni->pa[] now has iova addresses, for 'get_kva()' to work shouldn't
'va_info.start_page' calculated from 'mempool_memhdr->iova' instead of
'mempool_memhdr->addr'

If this is working I must be missing something but not able to find what it is.

<...>

> @@ -304,6 +304,27 @@ rte_kni_alloc(struct rte_mempool *pktmbuf_pool,
>  	kni->group_id = conf->group_id;
>  	kni->mbuf_size = conf->mbuf_size;
> 
> +	dev_info.iova_mode = (rte_eal_iova_mode() == RTE_IOVA_VA) ? 1 : 0;
> +	if (dev_info.iova_mode) {
> +		struct rte_mempool_memhdr *hdr;
> +		uint64_t pool_size = 0;
> +
> +		/* In each pool header chunk, we will maintain the
> +		 * base address of the pool. This chunk is physically and
> +		 * virtually contiguous.
> +		 * This approach will work, only if the allocated pool
> +		 * memory is contiguous, else it won't work
> +		 */
> +		hdr = STAILQ_FIRST(&pktmbuf_pool->mem_list);
> +		dev_info.mbuf_va = (void *)(hdr->addr);
> +
> +		/* Traverse the list and get the total size of the pool */
> +		STAILQ_FOREACH(hdr, &pktmbuf_pool->mem_list, next) {
> +			pool_size += hdr->len;
> +		}

This code is aware that there may be multiple chunks, but assumes they are all
continuous, I don't know if this assumption is correct.

Also I guess there is another assumption that there will be single pktmbuf_pool
in the application which passed into kni?
What if there are multiple pktmbuf_pool, like one for each PMD, will this work?
Now some mbufs in kni Rx fifo will come from different pktmbuf_pool which we
don't know their pages, so won't able to get their kernel virtual address.