From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <ferruh.yigit@intel.com>
Received: from mga12.intel.com (mga12.intel.com [192.55.52.136])
 by dpdk.org (Postfix) with ESMTP id 77BCB1B204
 for <dev@dpdk.org>; Thu,  4 Apr 2019 13:20:41 +0200 (CEST)
X-Amp-Result: SKIPPED(no attachment in message)
X-Amp-File-Uploaded: False
Received: from orsmga005.jf.intel.com ([10.7.209.41])
 by fmsmga106.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384;
 04 Apr 2019 04:20:40 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.60,308,1549958400"; d="scan'208";a="313088137"
Received: from fyigit-mobl.ger.corp.intel.com (HELO [10.237.221.46])
 ([10.237.221.46])
 by orsmga005.jf.intel.com with ESMTP; 04 Apr 2019 04:20:39 -0700
To: Kiran Kumar Kokkilagadda <kirankumark@marvell.com>
Cc: "dev@dpdk.org" <dev@dpdk.org>, Jerin Jacob <jerin.jacob@caviumnetworks.com>
References: <20180927104846.16356-1-kkokkilagadda@caviumnetworks.com>
 <20190401095118.4176-1-kirankumark@marvell.com>
 <ff476309-7384-314b-ccf5-92d52a209eeb@intel.com>
 <BYAPR18MB2392BCD280C8D49809F839BCAC500@BYAPR18MB2392.namprd18.prod.outlook.com>
From: Ferruh Yigit <ferruh.yigit@intel.com>
Openpgp: preference=signencrypt
Autocrypt: addr=ferruh.yigit@intel.com; prefer-encrypt=mutual; keydata=
 mQINBFXZCFABEADCujshBOAaqPZpwShdkzkyGpJ15lmxiSr3jVMqOtQS/sB3FYLT0/d3+bvy
 qbL9YnlbPyRvZfnP3pXiKwkRoR1RJwEo2BOf6hxdzTmLRtGtwWzI9MwrUPj6n/ldiD58VAGQ
 +iR1I/z9UBUN/ZMksElA2D7Jgg7vZ78iKwNnd+vLBD6I61kVrZ45Vjo3r+pPOByUBXOUlxp9
 GWEKKIrJ4eogqkVNSixN16VYK7xR+5OUkBYUO+sE6etSxCr7BahMPKxH+XPlZZjKrxciaWQb
 +dElz3Ab4Opl+ZT/bK2huX+W+NJBEBVzjTkhjSTjcyRdxvS1gwWRuXqAml/sh+KQjPV1PPHF
 YK5LcqLkle+OKTCa82OvUb7cr+ALxATIZXQkgmn+zFT8UzSS3aiBBohg3BtbTIWy51jNlYdy
 ezUZ4UxKSsFuUTPt+JjHQBvF7WKbmNGS3fCid5Iag4tWOfZoqiCNzxApkVugltxoc6rG2TyX
 CmI2rP0mQ0GOsGXA3+3c1MCdQFzdIn/5tLBZyKy4F54UFo35eOX8/g7OaE+xrgY/4bZjpxC1
 1pd66AAtKb3aNXpHvIfkVV6NYloo52H+FUE5ZDPNCGD0/btFGPWmWRmkPybzColTy7fmPaGz
 cBcEEqHK4T0aY4UJmE7Ylvg255Kz7s6wGZe6IR3N0cKNv++O7QARAQABtCVGZXJydWggWWln
 aXQgPGZlcnJ1aC55aWdpdEBpbnRlbC5jb20+iQJVBBMBAgA/AhsDBgsJCAcDAgYVCAIJCgsE
 FgIDAQIeAQIXgBYhBNI2U4dCLsKE45mBx/kz60PfE2EfBQJbughWBQkHwjOGAAoJEPkz60Pf
 E2Eft84QAIbKWqhgqRfoiw/BbXbA1+qm2o4UgkCRQ0yJgt9QsnbpOmPKydHH0ixCliNz1J8e
 mRXCkMini1bTpnzp7spOjQGLeAFkNFz6BMq8YF2mVWbGEDE9WgnAxZdi0eLY7ZQnHbE6AxKL
 SXmpe9INb6z3ztseFt7mqje/W/6DWYIMnH3Yz9KzxujFWDcq8UCAvPkxVQXLTMpauhFgYeEx
 Nub5HbvhxTfUkapLwRQsSd/HbywzqZ3s/bbYMjj5JO3tgMiM9g9HOjv1G2f1dQjHi5YQiTZl
 1eIIqQ3pTic6ROaiZqNmQFXPsoOOFfXF8nN2zg8kl/sSdoXWHhama5hbwwtl1vdaygQYlmdK
 H2ueiFh/UvT3WG3waNv2eZiEbHV8Rk52Xyn2w1G90lV0fYC6Ket1Xjoch7kjwbx793Kz/RfQ
 rmBY8/S4DTGn3oq3dMdQY+b6+7VMUeLMMh2CXYO9ErkOq+qNTD1IY+cBAkXnaDbQfz0zbste
 ZGWH74FAZ9nCpDOqbRTrBL42aMGhfOWEyeA1x7+hl6JZfabBWAuf4nnCXuorKHzBXTrf7u7p
 fXsKQClWRW77PF1VmzrtKNVSytQAmlCWApQIw20AarFipXmVdIjHmJPU611WoyxZPb4JTOxx
 5cv9B+nr/RIB+v5dcStyHCCwO1be7nBDdCgd4F6kTQPLuQINBFfWTL4BEACnNA29e8TarUsB
 L5n6eLZHXcFvVwNLVlirWOClHXf44o2KnN3ww+eBEmKVfEFo9MSuGDNHS8Zw1NiGMYxLIUgd
 U6gGrVVs/VrQWL82pbMk6jCj98N+BXIri+6K1z+AImz7ax7iF1kDgRAnFWU0znWWBgM2mM8Y
 gDjcxfXk4sCKnvf6Gjo08Ey5zmqx7dekAKU2EEp8Q1EJY3jbymLdZWRP4AFFMTS1rGMk0/tt
 v71NBg1GobCcbNfn9chK/jhqxYhAJqq86RdJQkt3/9x1U1Oq0vXCt4JVVHmkxePtUiuWTTt+
 aYlUAsKYZsWvncExvw77x2ArYDmaK0yfjh37wp0lY7DOJHFxoyT8tyWZlLci/VMRG2Ja33xj
 0CN4C1yBg+QDeV3QFxQo42iA/ykdXPUR3ezmsND3XKvVLTC4DNb3V/EZQ7jBj64+bEK0VW4G
 B31VP00ApNQvSoczsIOAKdk97RNbpmPw6q10ILIB+9T1xbnFYzshzGF17oC0/GENIHATx8vZ
 masOZoDiOZQpeneLgnFE9JfzhLTxv6wNZcc/HLXRQVTkDsQr8ERtkAoHCf1E5+b5Yr7pfnE4
 YuhET746o25S53ELUYPIs49qoJsEJL34/oexMfPGyPIlrbufiNyty5jc/1MRwUlhJlJ5IOHy
 ZUa+6CLR7GdImusFkPJUJwARAQABiQI8BBgBAgAmAhsMFiEE0jZTh0IuwoTjmYHH+TPrQ98T
 YR8FAlu6CHAFCQXE7zIACgkQ+TPrQ98TYR9nXxAAqNBgkYNyGuWUuy0GwDQCbu3iiMyH1+D7
 llafPcK4NYy1Z4AYuVwC9nmLaoj+ozdqS3ncRo57ncRsKEJC46nDJJZYZ5LSJVn63Y3NBF86
 lxQAgjj2oyZEwaLKtKbAFsXL43jv1pUGgSvWwYtDwHITXXFQto9rZEuUDRFSx4sg9OR+Q6/6
 LY+nQQ3OdHlBkflzYMPcWgDcvcTAO6yasLEUf7UcYoSWTyMYjLB4QuNlXzTswzGVMssJF/vo
 V8lD1eqqaSUWG3STF6GVLQOr1NLvN5+kUBiEStHFxBpgSCvYY9sNV8FS6N24CAWMBl+10W+D
 2h1yiiP5dOdPcBDYKsgqDD91/sP0WdyMJkwdQJtD49f9f+lYloxHnSAxMleOpyscg1pldw+i
 mPaUY1bmIknLhhkqfMmjywQOXpac5LRMibAAYkcB8v7y3kwELnt8mhqqZy6LUsqcWygNbH/W
 K3GGt5tRpeIXeJ25x8gg5EBQ0Jnvp/IbBYQfPLtXH0Myq2QuAhk/1q2yEIbVjS+7iowEZNyE
 56K63WBJxsJPB2mvmLgn98GqB4G6GufP1ndS0XDti/2K0o8rep9xoY/JDGi0n0L0tk9BHyoP
 Y7kaEpu7UyY3nVdRLe5H1/MnFG8hdJ97WqnPS0buYZlrbTV0nRFL/NI2VABl18vEEXvNQiO+ vM8=
Message-ID: <0a9cc712-b72c-2c73-9483-d157d476796e@intel.com>
Date: Thu, 4 Apr 2019 12:20:38 +0100
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:60.0) Gecko/20100101
 Thunderbird/60.6.1
MIME-Version: 1.0
In-Reply-To: <BYAPR18MB2392BCD280C8D49809F839BCAC500@BYAPR18MB2392.namprd18.prod.outlook.com>
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: 8bit
Subject: Re: [dpdk-dev] [EXT] Re: [PATCH v2] kni: add IOVA va support for kni
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Thu, 04 Apr 2019 11:20:42 -0000

On 4/4/2019 6:03 AM, Kiran Kumar Kokkilagadda wrote:
> 
> 
>> -----Original Message-----
>> From: Ferruh Yigit <ferruh.yigit@intel.com>
>> Sent: Wednesday, April 3, 2019 9:59 PM
>> To: Kiran Kumar Kokkilagadda <kirankumark@marvell.com>
>> Cc: dev@dpdk.org; Jerin Jacob <jerin.jacob@caviumnetworks.com>
>> Subject: [EXT] Re: [dpdk-dev] [PATCH v2] kni: add IOVA va support for kni
>>
>> External Email
>>
>> ----------------------------------------------------------------------
>> On 4/1/2019 10:51 AM, Kiran Kumar Kokkilagadda wrote:
>>> From: Kiran Kumar K <kirankumark@marvell.com>
>>>
>>> With current KNI implementation kernel module will work only in
>>> IOVA=PA mode. This patch will add support for kernel module to work
>>> with IOVA=VA mode.
>>
>> Thanks Kiran for removing the limitation, I have a few questions, can you please
>> help me understand.
>>
>> And when this patch is ready, the restriction in 'linux/eal/eal.c', in 'rte_eal_init'
>> should be removed, perhaps with this patch. I assume you already doing it to be
>> able to test this patch.
>>
> 
> User can choose the mode by passing --iova-mode=<va/pa>. I will remove the rte_kni module restriction.
>  
> 
>>>
>>> The idea is to maintain a mapping in KNI module between user pages and
>>> kernel pages and in fast path perform a lookup in this table and get
>>> the kernel virtual address for corresponding user virtual address.
>>>
>>> In IOVA=VA mode, the memory allocated to the pool is physically and
>>> virtually contiguous. We will take advantage of this and create a
>>> mapping in the kernel.In kernel we need mapping for queues (tx_q,
>>> rx_q,... slow path) and mbuf memory (fast path).
>>
>> Is it?
>> As far as I know mempool can have multiple chunks and they can be both
>> virtually and physically separated.
>>
>> And even for a single chunk, that will be virtually continuous, but will it be
>> physically continuous?
>>
> 
> You are right, it need not have to be physically contiguous. Will change the description.
>  
> 
>>>
>>> At the KNI init time, in slow path we will create a mapping for the
>>> queues and mbuf using get_user_pages similar to af_xdp. Using pool
>>> memory base address, we will create a page map table for the mbuf,
>>> which we will use in the fast path for kernel page translation.
>>>
>>> At KNI init time, we will pass the base address of the pool and size
>>> of the pool to kernel. In kernel, using get_user_pages API, we will
>>> get the pages with size PAGE_SIZE and store the mapping and start
>>> address of user space in a table.
>>>
>>> In fast path for any user address perform PAGE_SHIFT (user_addr >>
>>> PAGE_SHIFT) and subtract the start address from this value, we will
>>> get the index of the kernel page with in the page map table.
>>> Adding offset to this kernel page address, we will get the kernel
>>> address for this user virtual address.
>>>
>>> For example user pool base address is X, and size is S that we passed
>>> to kernel. In kernel we will create a mapping for this using get_user_pages.
>>> Our page map table will look like [Y, Y+PAGE_SIZE, Y+(PAGE_SIZE*2)
>>> ....] and user start page will be U (we will get it from X >> PAGE_SHIFT).
>>>
>>> For any user address Z we will get the index of the page map table
>>> using ((Z >> PAGE_SHIFT) - U). Adding offset (Z & (PAGE_SIZE - 1)) to
>>> this address will give kernel virtual address.
>>>
>>> Signed-off-by: Kiran Kumar K <kirankumark@marvell.com>
>>
>> <...>
>>
>>> +int
>>> +kni_pin_pages(void *address, size_t size, struct page_info *mem) {
>>> +	unsigned int gup_flags = FOLL_WRITE;
>>> +	long npgs;
>>> +	int err;
>>> +
>>> +	/* Get at least one page */
>>> +	if (size < PAGE_SIZE)
>>> +		size = PAGE_SIZE;
>>> +
>>> +	/* Compute number of user pages based on page size */
>>> +	mem->npgs = (size + PAGE_SIZE - 1) / PAGE_SIZE;
>>> +
>>> +	/* Allocate memory for the pages */
>>> +	mem->pgs = kcalloc(mem->npgs, sizeof(*mem->pgs),
>>> +		      GFP_KERNEL | __GFP_NOWARN);
>>> +	if (!mem->pgs) {
>>> +		pr_err("%s: -ENOMEM\n", __func__);
>>> +		return -ENOMEM;
>>> +	}
>>> +
>>> +	down_write(&current->mm->mmap_sem);
>>> +
>>> +	/* Get the user pages from the user address*/ #if
>> LINUX_VERSION_CODE
>>> +>= KERNEL_VERSION(4,9,0)
>>> +	npgs = get_user_pages((u64)address, mem->npgs,
>>> +				gup_flags, &mem->pgs[0], NULL);
>>> +#else
>>> +	npgs = get_user_pages(current, current->mm, (u64)address, mem-
>>> npgs,
>>> +				gup_flags, 0, &mem->pgs[0], NULL); #endif
>>> +	up_write(&current->mm->mmap_sem);
>>
>> This should work even memory is physically not continuous, right? Where
>> exactly physically continuous requirement is coming from?
>>
> 
> Yes, it is not necessary to be physically contiguous.
> 
>  
> 
> 
>> <...>
>>
>>> +
>>> +/* Get the kernel address from the user address using
>>> + * page map table. Will be used only in IOVA=VA mode  */ static
>>> +inline void* get_kva(uint64_t usr_addr, struct kni_dev *kni) {
>>> +	uint32_t index;
>>> +	/* User page - start user page will give the index
>>> +	 * with in the page map table
>>> +	 */
>>> +	index = (usr_addr >> PAGE_SHIFT) - kni->va_info.start_page;
>>> +
>>> +	/* Add the offset to the page address */
>>> +	return (kni->va_info.page_map[index].addr +
>>> +		(usr_addr & kni->va_info.page_mask));
>>> +
>>> +}
>>> +
>>>  /* physical address to kernel virtual address */  static void *
>>> pa2kva(void *pa) @@ -186,7 +205,10 @@ kni_fifo_trans_pa2va(struct
>>> kni_dev *kni,
>>>  			return;
>>>
>>>  		for (i = 0; i < num_rx; i++) {
>>> -			kva = pa2kva(kni->pa[i]);
>>> +			if (likely(kni->iova_mode == 1))
>>> +				kva = get_kva((u64)(kni->pa[i]), kni);
>>
>> kni->pa[] now has iova addresses, for 'get_kva()' to work shouldn't
>> 'va_info.start_page' calculated from 'mempool_memhdr->iova' instead of
>> 'mempool_memhdr->addr'
>>
>> If this is working I must be missing something but not able to find what it is.
>>
>> <...>
>>
>  
> In IOVA=VA mode, both the values will be same right?

I don't know for sure, but according my understanding 'mempool_memhdr->addr' is
application virtual address, 'mempool_memhdr->iova' id device virtual address
and can be anything as long as iommu configured correctly.
So technically ->addr and ->iova can be same, but I don't know it is always the
case in DPDK.

> 
> 
>>> @@ -304,6 +304,27 @@ rte_kni_alloc(struct rte_mempool *pktmbuf_pool,
>>>  	kni->group_id = conf->group_id;
>>>  	kni->mbuf_size = conf->mbuf_size;
>>>
>>> +	dev_info.iova_mode = (rte_eal_iova_mode() == RTE_IOVA_VA) ? 1 : 0;
>>> +	if (dev_info.iova_mode) {
>>> +		struct rte_mempool_memhdr *hdr;
>>> +		uint64_t pool_size = 0;
>>> +
>>> +		/* In each pool header chunk, we will maintain the
>>> +		 * base address of the pool. This chunk is physically and
>>> +		 * virtually contiguous.
>>> +		 * This approach will work, only if the allocated pool
>>> +		 * memory is contiguous, else it won't work
>>> +		 */
>>> +		hdr = STAILQ_FIRST(&pktmbuf_pool->mem_list);
>>> +		dev_info.mbuf_va = (void *)(hdr->addr);
>>> +
>>> +		/* Traverse the list and get the total size of the pool */
>>> +		STAILQ_FOREACH(hdr, &pktmbuf_pool->mem_list, next) {
>>> +			pool_size += hdr->len;
>>> +		}
>>
>> This code is aware that there may be multiple chunks, but assumes they are all
>> continuous, I don't know if this assumption is correct.
>>
>> Also I guess there is another assumption that there will be single pktmbuf_pool
>> in the application which passed into kni?
>> What if there are multiple pktmbuf_pool, like one for each PMD, will this work?
>> Now some mbufs in kni Rx fifo will come from different pktmbuf_pool which we
>> don't know their pages, so won't able to get their kernel virtual address.
> 
> All these chunks have to be virtually contiguous, otherwise this approach will not work. Here one thing we can do is create the mapping for complete huge page, and make sure all the mbuff_mempools will be with in this offset. As huge page size will be big (x64 it is 1G, ARM 512MB), we may not exceed the size, even with multiple pktmbuf_pools. If mbuf_pools were created outside this huge_page, this approach will not work and we will fall back to PA mode.
> 

Those were also my two points,
- This requires all chunks in to be virtually continuous
- There should be only one pktmbuf_pools in system for this to work

But both above assumptions are not correct.
And as far as I can see there is not way to detect them and act accordingly.

From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from dpdk.org (dpdk.org [92.243.14.124])
	by dpdk.space (Postfix) with ESMTP id 237CBA0679
	for <public@inbox.dpdk.org>; Thu,  4 Apr 2019 13:20:44 +0200 (CEST)
Received: from [92.243.14.124] (localhost [127.0.0.1])
	by dpdk.org (Postfix) with ESMTP id D8DF01B207;
	Thu,  4 Apr 2019 13:20:42 +0200 (CEST)
Received: from mga12.intel.com (mga12.intel.com [192.55.52.136])
 by dpdk.org (Postfix) with ESMTP id 77BCB1B204
 for <dev@dpdk.org>; Thu,  4 Apr 2019 13:20:41 +0200 (CEST)
X-Amp-Result: SKIPPED(no attachment in message)
X-Amp-File-Uploaded: False
Received: from orsmga005.jf.intel.com ([10.7.209.41])
 by fmsmga106.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384;
 04 Apr 2019 04:20:40 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.60,308,1549958400"; d="scan'208";a="313088137"
Received: from fyigit-mobl.ger.corp.intel.com (HELO [10.237.221.46])
 ([10.237.221.46])
 by orsmga005.jf.intel.com with ESMTP; 04 Apr 2019 04:20:39 -0700
To: Kiran Kumar Kokkilagadda <kirankumark@marvell.com>
Cc: "dev@dpdk.org" <dev@dpdk.org>, Jerin Jacob <jerin.jacob@caviumnetworks.com>
References: <20180927104846.16356-1-kkokkilagadda@caviumnetworks.com>
 <20190401095118.4176-1-kirankumark@marvell.com>
 <ff476309-7384-314b-ccf5-92d52a209eeb@intel.com>
 <BYAPR18MB2392BCD280C8D49809F839BCAC500@BYAPR18MB2392.namprd18.prod.outlook.com>
From: Ferruh Yigit <ferruh.yigit@intel.com>
Openpgp: preference=signencrypt
Autocrypt: addr=ferruh.yigit@intel.com; prefer-encrypt=mutual; keydata=
 mQINBFXZCFABEADCujshBOAaqPZpwShdkzkyGpJ15lmxiSr3jVMqOtQS/sB3FYLT0/d3+bvy
 qbL9YnlbPyRvZfnP3pXiKwkRoR1RJwEo2BOf6hxdzTmLRtGtwWzI9MwrUPj6n/ldiD58VAGQ
 +iR1I/z9UBUN/ZMksElA2D7Jgg7vZ78iKwNnd+vLBD6I61kVrZ45Vjo3r+pPOByUBXOUlxp9
 GWEKKIrJ4eogqkVNSixN16VYK7xR+5OUkBYUO+sE6etSxCr7BahMPKxH+XPlZZjKrxciaWQb
 +dElz3Ab4Opl+ZT/bK2huX+W+NJBEBVzjTkhjSTjcyRdxvS1gwWRuXqAml/sh+KQjPV1PPHF
 YK5LcqLkle+OKTCa82OvUb7cr+ALxATIZXQkgmn+zFT8UzSS3aiBBohg3BtbTIWy51jNlYdy
 ezUZ4UxKSsFuUTPt+JjHQBvF7WKbmNGS3fCid5Iag4tWOfZoqiCNzxApkVugltxoc6rG2TyX
 CmI2rP0mQ0GOsGXA3+3c1MCdQFzdIn/5tLBZyKy4F54UFo35eOX8/g7OaE+xrgY/4bZjpxC1
 1pd66AAtKb3aNXpHvIfkVV6NYloo52H+FUE5ZDPNCGD0/btFGPWmWRmkPybzColTy7fmPaGz
 cBcEEqHK4T0aY4UJmE7Ylvg255Kz7s6wGZe6IR3N0cKNv++O7QARAQABtCVGZXJydWggWWln
 aXQgPGZlcnJ1aC55aWdpdEBpbnRlbC5jb20+iQJVBBMBAgA/AhsDBgsJCAcDAgYVCAIJCgsE
 FgIDAQIeAQIXgBYhBNI2U4dCLsKE45mBx/kz60PfE2EfBQJbughWBQkHwjOGAAoJEPkz60Pf
 E2Eft84QAIbKWqhgqRfoiw/BbXbA1+qm2o4UgkCRQ0yJgt9QsnbpOmPKydHH0ixCliNz1J8e
 mRXCkMini1bTpnzp7spOjQGLeAFkNFz6BMq8YF2mVWbGEDE9WgnAxZdi0eLY7ZQnHbE6AxKL
 SXmpe9INb6z3ztseFt7mqje/W/6DWYIMnH3Yz9KzxujFWDcq8UCAvPkxVQXLTMpauhFgYeEx
 Nub5HbvhxTfUkapLwRQsSd/HbywzqZ3s/bbYMjj5JO3tgMiM9g9HOjv1G2f1dQjHi5YQiTZl
 1eIIqQ3pTic6ROaiZqNmQFXPsoOOFfXF8nN2zg8kl/sSdoXWHhama5hbwwtl1vdaygQYlmdK
 H2ueiFh/UvT3WG3waNv2eZiEbHV8Rk52Xyn2w1G90lV0fYC6Ket1Xjoch7kjwbx793Kz/RfQ
 rmBY8/S4DTGn3oq3dMdQY+b6+7VMUeLMMh2CXYO9ErkOq+qNTD1IY+cBAkXnaDbQfz0zbste
 ZGWH74FAZ9nCpDOqbRTrBL42aMGhfOWEyeA1x7+hl6JZfabBWAuf4nnCXuorKHzBXTrf7u7p
 fXsKQClWRW77PF1VmzrtKNVSytQAmlCWApQIw20AarFipXmVdIjHmJPU611WoyxZPb4JTOxx
 5cv9B+nr/RIB+v5dcStyHCCwO1be7nBDdCgd4F6kTQPLuQINBFfWTL4BEACnNA29e8TarUsB
 L5n6eLZHXcFvVwNLVlirWOClHXf44o2KnN3ww+eBEmKVfEFo9MSuGDNHS8Zw1NiGMYxLIUgd
 U6gGrVVs/VrQWL82pbMk6jCj98N+BXIri+6K1z+AImz7ax7iF1kDgRAnFWU0znWWBgM2mM8Y
 gDjcxfXk4sCKnvf6Gjo08Ey5zmqx7dekAKU2EEp8Q1EJY3jbymLdZWRP4AFFMTS1rGMk0/tt
 v71NBg1GobCcbNfn9chK/jhqxYhAJqq86RdJQkt3/9x1U1Oq0vXCt4JVVHmkxePtUiuWTTt+
 aYlUAsKYZsWvncExvw77x2ArYDmaK0yfjh37wp0lY7DOJHFxoyT8tyWZlLci/VMRG2Ja33xj
 0CN4C1yBg+QDeV3QFxQo42iA/ykdXPUR3ezmsND3XKvVLTC4DNb3V/EZQ7jBj64+bEK0VW4G
 B31VP00ApNQvSoczsIOAKdk97RNbpmPw6q10ILIB+9T1xbnFYzshzGF17oC0/GENIHATx8vZ
 masOZoDiOZQpeneLgnFE9JfzhLTxv6wNZcc/HLXRQVTkDsQr8ERtkAoHCf1E5+b5Yr7pfnE4
 YuhET746o25S53ELUYPIs49qoJsEJL34/oexMfPGyPIlrbufiNyty5jc/1MRwUlhJlJ5IOHy
 ZUa+6CLR7GdImusFkPJUJwARAQABiQI8BBgBAgAmAhsMFiEE0jZTh0IuwoTjmYHH+TPrQ98T
 YR8FAlu6CHAFCQXE7zIACgkQ+TPrQ98TYR9nXxAAqNBgkYNyGuWUuy0GwDQCbu3iiMyH1+D7
 llafPcK4NYy1Z4AYuVwC9nmLaoj+ozdqS3ncRo57ncRsKEJC46nDJJZYZ5LSJVn63Y3NBF86
 lxQAgjj2oyZEwaLKtKbAFsXL43jv1pUGgSvWwYtDwHITXXFQto9rZEuUDRFSx4sg9OR+Q6/6
 LY+nQQ3OdHlBkflzYMPcWgDcvcTAO6yasLEUf7UcYoSWTyMYjLB4QuNlXzTswzGVMssJF/vo
 V8lD1eqqaSUWG3STF6GVLQOr1NLvN5+kUBiEStHFxBpgSCvYY9sNV8FS6N24CAWMBl+10W+D
 2h1yiiP5dOdPcBDYKsgqDD91/sP0WdyMJkwdQJtD49f9f+lYloxHnSAxMleOpyscg1pldw+i
 mPaUY1bmIknLhhkqfMmjywQOXpac5LRMibAAYkcB8v7y3kwELnt8mhqqZy6LUsqcWygNbH/W
 K3GGt5tRpeIXeJ25x8gg5EBQ0Jnvp/IbBYQfPLtXH0Myq2QuAhk/1q2yEIbVjS+7iowEZNyE
 56K63WBJxsJPB2mvmLgn98GqB4G6GufP1ndS0XDti/2K0o8rep9xoY/JDGi0n0L0tk9BHyoP
 Y7kaEpu7UyY3nVdRLe5H1/MnFG8hdJ97WqnPS0buYZlrbTV0nRFL/NI2VABl18vEEXvNQiO+ vM8=
Message-ID: <0a9cc712-b72c-2c73-9483-d157d476796e@intel.com>
Date: Thu, 4 Apr 2019 12:20:38 +0100
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:60.0) Gecko/20100101
 Thunderbird/60.6.1
MIME-Version: 1.0
In-Reply-To: <BYAPR18MB2392BCD280C8D49809F839BCAC500@BYAPR18MB2392.namprd18.prod.outlook.com>
Content-Type: text/plain; charset="UTF-8"
Content-Language: en-US
Content-Transfer-Encoding: 8bit
Subject: Re: [dpdk-dev] [EXT] Re: [PATCH v2] kni: add IOVA va support for kni
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>
Message-ID: <20190404112038.u1QYuAfWfg839_BFgdN-D56tQSiIdZWDEptBpyuXvoM@z>

On 4/4/2019 6:03 AM, Kiran Kumar Kokkilagadda wrote:
> 
> 
>> -----Original Message-----
>> From: Ferruh Yigit <ferruh.yigit@intel.com>
>> Sent: Wednesday, April 3, 2019 9:59 PM
>> To: Kiran Kumar Kokkilagadda <kirankumark@marvell.com>
>> Cc: dev@dpdk.org; Jerin Jacob <jerin.jacob@caviumnetworks.com>
>> Subject: [EXT] Re: [dpdk-dev] [PATCH v2] kni: add IOVA va support for kni
>>
>> External Email
>>
>> ----------------------------------------------------------------------
>> On 4/1/2019 10:51 AM, Kiran Kumar Kokkilagadda wrote:
>>> From: Kiran Kumar K <kirankumark@marvell.com>
>>>
>>> With current KNI implementation kernel module will work only in
>>> IOVA=PA mode. This patch will add support for kernel module to work
>>> with IOVA=VA mode.
>>
>> Thanks Kiran for removing the limitation, I have a few questions, can you please
>> help me understand.
>>
>> And when this patch is ready, the restriction in 'linux/eal/eal.c', in 'rte_eal_init'
>> should be removed, perhaps with this patch. I assume you already doing it to be
>> able to test this patch.
>>
> 
> User can choose the mode by passing --iova-mode=<va/pa>. I will remove the rte_kni module restriction.
>  
> 
>>>
>>> The idea is to maintain a mapping in KNI module between user pages and
>>> kernel pages and in fast path perform a lookup in this table and get
>>> the kernel virtual address for corresponding user virtual address.
>>>
>>> In IOVA=VA mode, the memory allocated to the pool is physically and
>>> virtually contiguous. We will take advantage of this and create a
>>> mapping in the kernel.In kernel we need mapping for queues (tx_q,
>>> rx_q,... slow path) and mbuf memory (fast path).
>>
>> Is it?
>> As far as I know mempool can have multiple chunks and they can be both
>> virtually and physically separated.
>>
>> And even for a single chunk, that will be virtually continuous, but will it be
>> physically continuous?
>>
> 
> You are right, it need not have to be physically contiguous. Will change the description.
>  
> 
>>>
>>> At the KNI init time, in slow path we will create a mapping for the
>>> queues and mbuf using get_user_pages similar to af_xdp. Using pool
>>> memory base address, we will create a page map table for the mbuf,
>>> which we will use in the fast path for kernel page translation.
>>>
>>> At KNI init time, we will pass the base address of the pool and size
>>> of the pool to kernel. In kernel, using get_user_pages API, we will
>>> get the pages with size PAGE_SIZE and store the mapping and start
>>> address of user space in a table.
>>>
>>> In fast path for any user address perform PAGE_SHIFT (user_addr >>
>>> PAGE_SHIFT) and subtract the start address from this value, we will
>>> get the index of the kernel page with in the page map table.
>>> Adding offset to this kernel page address, we will get the kernel
>>> address for this user virtual address.
>>>
>>> For example user pool base address is X, and size is S that we passed
>>> to kernel. In kernel we will create a mapping for this using get_user_pages.
>>> Our page map table will look like [Y, Y+PAGE_SIZE, Y+(PAGE_SIZE*2)
>>> ....] and user start page will be U (we will get it from X >> PAGE_SHIFT).
>>>
>>> For any user address Z we will get the index of the page map table
>>> using ((Z >> PAGE_SHIFT) - U). Adding offset (Z & (PAGE_SIZE - 1)) to
>>> this address will give kernel virtual address.
>>>
>>> Signed-off-by: Kiran Kumar K <kirankumark@marvell.com>
>>
>> <...>
>>
>>> +int
>>> +kni_pin_pages(void *address, size_t size, struct page_info *mem) {
>>> +	unsigned int gup_flags = FOLL_WRITE;
>>> +	long npgs;
>>> +	int err;
>>> +
>>> +	/* Get at least one page */
>>> +	if (size < PAGE_SIZE)
>>> +		size = PAGE_SIZE;
>>> +
>>> +	/* Compute number of user pages based on page size */
>>> +	mem->npgs = (size + PAGE_SIZE - 1) / PAGE_SIZE;
>>> +
>>> +	/* Allocate memory for the pages */
>>> +	mem->pgs = kcalloc(mem->npgs, sizeof(*mem->pgs),
>>> +		      GFP_KERNEL | __GFP_NOWARN);
>>> +	if (!mem->pgs) {
>>> +		pr_err("%s: -ENOMEM\n", __func__);
>>> +		return -ENOMEM;
>>> +	}
>>> +
>>> +	down_write(&current->mm->mmap_sem);
>>> +
>>> +	/* Get the user pages from the user address*/ #if
>> LINUX_VERSION_CODE
>>> +>= KERNEL_VERSION(4,9,0)
>>> +	npgs = get_user_pages((u64)address, mem->npgs,
>>> +				gup_flags, &mem->pgs[0], NULL);
>>> +#else
>>> +	npgs = get_user_pages(current, current->mm, (u64)address, mem-
>>> npgs,
>>> +				gup_flags, 0, &mem->pgs[0], NULL); #endif
>>> +	up_write(&current->mm->mmap_sem);
>>
>> This should work even memory is physically not continuous, right? Where
>> exactly physically continuous requirement is coming from?
>>
> 
> Yes, it is not necessary to be physically contiguous.
> 
>  
> 
> 
>> <...>
>>
>>> +
>>> +/* Get the kernel address from the user address using
>>> + * page map table. Will be used only in IOVA=VA mode  */ static
>>> +inline void* get_kva(uint64_t usr_addr, struct kni_dev *kni) {
>>> +	uint32_t index;
>>> +	/* User page - start user page will give the index
>>> +	 * with in the page map table
>>> +	 */
>>> +	index = (usr_addr >> PAGE_SHIFT) - kni->va_info.start_page;
>>> +
>>> +	/* Add the offset to the page address */
>>> +	return (kni->va_info.page_map[index].addr +
>>> +		(usr_addr & kni->va_info.page_mask));
>>> +
>>> +}
>>> +
>>>  /* physical address to kernel virtual address */  static void *
>>> pa2kva(void *pa) @@ -186,7 +205,10 @@ kni_fifo_trans_pa2va(struct
>>> kni_dev *kni,
>>>  			return;
>>>
>>>  		for (i = 0; i < num_rx; i++) {
>>> -			kva = pa2kva(kni->pa[i]);
>>> +			if (likely(kni->iova_mode == 1))
>>> +				kva = get_kva((u64)(kni->pa[i]), kni);
>>
>> kni->pa[] now has iova addresses, for 'get_kva()' to work shouldn't
>> 'va_info.start_page' calculated from 'mempool_memhdr->iova' instead of
>> 'mempool_memhdr->addr'
>>
>> If this is working I must be missing something but not able to find what it is.
>>
>> <...>
>>
>  
> In IOVA=VA mode, both the values will be same right?

I don't know for sure, but according my understanding 'mempool_memhdr->addr' is
application virtual address, 'mempool_memhdr->iova' id device virtual address
and can be anything as long as iommu configured correctly.
So technically ->addr and ->iova can be same, but I don't know it is always the
case in DPDK.

> 
> 
>>> @@ -304,6 +304,27 @@ rte_kni_alloc(struct rte_mempool *pktmbuf_pool,
>>>  	kni->group_id = conf->group_id;
>>>  	kni->mbuf_size = conf->mbuf_size;
>>>
>>> +	dev_info.iova_mode = (rte_eal_iova_mode() == RTE_IOVA_VA) ? 1 : 0;
>>> +	if (dev_info.iova_mode) {
>>> +		struct rte_mempool_memhdr *hdr;
>>> +		uint64_t pool_size = 0;
>>> +
>>> +		/* In each pool header chunk, we will maintain the
>>> +		 * base address of the pool. This chunk is physically and
>>> +		 * virtually contiguous.
>>> +		 * This approach will work, only if the allocated pool
>>> +		 * memory is contiguous, else it won't work
>>> +		 */
>>> +		hdr = STAILQ_FIRST(&pktmbuf_pool->mem_list);
>>> +		dev_info.mbuf_va = (void *)(hdr->addr);
>>> +
>>> +		/* Traverse the list and get the total size of the pool */
>>> +		STAILQ_FOREACH(hdr, &pktmbuf_pool->mem_list, next) {
>>> +			pool_size += hdr->len;
>>> +		}
>>
>> This code is aware that there may be multiple chunks, but assumes they are all
>> continuous, I don't know if this assumption is correct.
>>
>> Also I guess there is another assumption that there will be single pktmbuf_pool
>> in the application which passed into kni?
>> What if there are multiple pktmbuf_pool, like one for each PMD, will this work?
>> Now some mbufs in kni Rx fifo will come from different pktmbuf_pool which we
>> don't know their pages, so won't able to get their kernel virtual address.
> 
> All these chunks have to be virtually contiguous, otherwise this approach will not work. Here one thing we can do is create the mapping for complete huge page, and make sure all the mbuff_mempools will be with in this offset. As huge page size will be big (x64 it is 1G, ARM 512MB), we may not exceed the size, even with multiple pktmbuf_pools. If mbuf_pools were created outside this huge_page, this approach will not work and we will fall back to PA mode.
> 

Those were also my two points,
- This requires all chunks in to be virtually continuous
- There should be only one pktmbuf_pools in system for this to work

But both above assumptions are not correct.
And as far as I can see there is not way to detect them and act accordingly.