From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from dpdk.org (dpdk.org [92.243.14.124])
	by dpdk.space (Postfix) with ESMTP id E5DD2A0679
	for <public@inbox.dpdk.org>; Thu,  4 Apr 2019 11:57:59 +0200 (CEST)
Received: from [92.243.14.124] (localhost [127.0.0.1])
	by dpdk.org (Postfix) with ESMTP id C4B6BD0B2;
	Thu,  4 Apr 2019 11:57:58 +0200 (CEST)
Received: from mga05.intel.com (mga05.intel.com [192.55.52.43])
 by dpdk.org (Postfix) with ESMTP id BF45DD0B2
 for <dev@dpdk.org>; Thu,  4 Apr 2019 11:57:56 +0200 (CEST)
X-Amp-Result: SKIPPED(no attachment in message)
X-Amp-File-Uploaded: False
Received: from orsmga008.jf.intel.com ([10.7.209.65])
 by fmsmga105.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384;
 04 Apr 2019 02:57:55 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.60,308,1549958400"; d="scan'208";a="131380944"
Received: from aburakov-mobl1.ger.corp.intel.com (HELO [10.237.220.103])
 ([10.237.220.103])
 by orsmga008.jf.intel.com with ESMTP; 04 Apr 2019 02:57:54 -0700
To: Ferruh Yigit <ferruh.yigit@intel.com>,
 Kiran Kumar Kokkilagadda <kirankumark@marvell.com>
Cc: "dev@dpdk.org" <dev@dpdk.org>, Jerin Jacob <jerin.jacob@caviumnetworks.com>
References: <20180927104846.16356-1-kkokkilagadda@caviumnetworks.com>
 <20190401095118.4176-1-kirankumark@marvell.com>
 <ff476309-7384-314b-ccf5-92d52a209eeb@intel.com>
From: "Burakov, Anatoly" <anatoly.burakov@intel.com>
Message-ID: <4c3ef828-652b-1498-b89b-7b7b58208f46@intel.com>
Date: Thu, 4 Apr 2019 10:57:53 +0100
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:60.0) Gecko/20100101
 Thunderbird/60.6.1
MIME-Version: 1.0
In-Reply-To: <ff476309-7384-314b-ccf5-92d52a209eeb@intel.com>
Content-Type: text/plain; charset="UTF-8"; format="flowed"
Content-Language: en-US
Content-Transfer-Encoding: 7bit
Subject: Re: [dpdk-dev] [PATCH v2] kni: add IOVA va support for kni
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>
Message-ID: <20190404095753.WHrKLpr5iHQTKbo5N9IeLi2g6Kj9rjXGVsrOWs2tTh0@z>

On 03-Apr-19 5:29 PM, Ferruh Yigit wrote:
> On 4/1/2019 10:51 AM, Kiran Kumar Kokkilagadda wrote:
>> From: Kiran Kumar K <kirankumark@marvell.com>
>>
>> With current KNI implementation kernel module will work only in
>> IOVA=PA mode. This patch will add support for kernel module to work
>> with IOVA=VA mode.
> 
> Thanks Kiran for removing the limitation, I have a few questions, can you please
> help me understand.
> 
> And when this patch is ready, the restriction in 'linux/eal/eal.c', in
> 'rte_eal_init' should be removed, perhaps with this patch. I assume you already
> doing it to be able to test this patch.
> 
>>
>> The idea is to maintain a mapping in KNI module between user pages and
>> kernel pages and in fast path perform a lookup in this table and get
>> the kernel virtual address for corresponding user virtual address.
>>
>> In IOVA=VA mode, the memory allocated to the pool is physically
>> and virtually contiguous. We will take advantage of this and create a
>> mapping in the kernel.In kernel we need mapping for queues
>> (tx_q, rx_q,... slow path) and mbuf memory (fast path).
> 
> Is it?
> As far as I know mempool can have multiple chunks and they can be both virtually
> and physically separated.
> 
> And even for a single chunk, that will be virtually continuous, but will it be
> physically continuous?

Just to clarify.

Within DPDK, we do not make a distinction between physical address and 
IOVA address - we never need the actual physical address, we just need 
the DMA addresses, which can either match the physical address, or be 
completely arbitrary (in our case, they will match VA addresses, but it 
doesn't have to be the case - in fact, 17.11 will, under some 
circumstances, populate IOVA addresses simply starting from address 0).

However, one has to remember that IOVA address *is not a physical 
address*. The pages backing a VA chunk may be *IOVA*-contiguous, but may 
not necessarily be *physically* contiguous. Under normal circumstances 
we really don't care, because the VFIO/IOMMU takes care of the mapping 
between IOVA and PA transparently for the hardware.

So, in IOVA as VA mode, the memory allocated to the mempool will be 
(within a given chunk) both VA and IOVA contiguous - but not necessarily 
*physically* contiguous! In fact, if you try calling rte_mem_virt2phy() 
on the mempool pages, you'll likely find that they aren't (i've seen 
cases where pages were mapped /backwards/!).

Therefore, unless by "physically contiguous" you mean "IOVA-contiguous", 
you *cannot* rely on memory in mempool being *physically* contiguous 
merely based on the fact that it's IOVA-contiguous.

> 
>>
>> At the KNI init time, in slow path we will create a mapping for the
>> queues and mbuf using get_user_pages similar to af_xdp. Using pool
>> memory base address, we will create a page map table for the mbuf,
>> which we will use in the fast path for kernel page translation.
>>
>> At KNI init time, we will pass the base address of the pool and size of
>> the pool to kernel. In kernel, using get_user_pages API, we will get
>> the pages with size PAGE_SIZE and store the mapping and start address
>> of user space in a table.
>>
>> In fast path for any user address perform PAGE_SHIFT
>> (user_addr >> PAGE_SHIFT) and subtract the start address from this value,
>> we will get the index of the kernel page with in the page map table.
>> Adding offset to this kernel page address, we will get the kernel address
>> for this user virtual address.
>>
>> For example user pool base address is X, and size is S that we passed to
>> kernel. In kernel we will create a mapping for this using get_user_pages.
>> Our page map table will look like [Y, Y+PAGE_SIZE, Y+(PAGE_SIZE*2) ....]
>> and user start page will be U (we will get it from X >> PAGE_SHIFT).
>>
>> For any user address Z we will get the index of the page map table using
>> ((Z >> PAGE_SHIFT) - U). Adding offset (Z & (PAGE_SIZE - 1)) to this
>> address will give kernel virtual address.
>>
>> Signed-off-by: Kiran Kumar K <kirankumark@marvell.com>
> 
> <...>
> 
>> +int
>> +kni_pin_pages(void *address, size_t size, struct page_info *mem)
>> +{
>> +	unsigned int gup_flags = FOLL_WRITE;
>> +	long npgs;
>> +	int err;
>> +
>> +	/* Get at least one page */
>> +	if (size < PAGE_SIZE)
>> +		size = PAGE_SIZE;
>> +
>> +	/* Compute number of user pages based on page size */
>> +	mem->npgs = (size + PAGE_SIZE - 1) / PAGE_SIZE;
>> +
>> +	/* Allocate memory for the pages */
>> +	mem->pgs = kcalloc(mem->npgs, sizeof(*mem->pgs),
>> +		      GFP_KERNEL | __GFP_NOWARN);
>> +	if (!mem->pgs) {
>> +		pr_err("%s: -ENOMEM\n", __func__);
>> +		return -ENOMEM;
>> +	}
>> +
>> +	down_write(&current->mm->mmap_sem);
>> +
>> +	/* Get the user pages from the user address*/
>> +#if LINUX_VERSION_CODE >= KERNEL_VERSION(4,9,0)
>> +	npgs = get_user_pages((u64)address, mem->npgs,
>> +				gup_flags, &mem->pgs[0], NULL);
>> +#else
>> +	npgs = get_user_pages(current, current->mm, (u64)address, mem->npgs,
>> +				gup_flags, 0, &mem->pgs[0], NULL);
>> +#endif
>> +	up_write(&current->mm->mmap_sem);
> 
> This should work even memory is physically not continuous, right? Where exactly
> physically continuous requirement is coming from?
> 
> <...>
> 
>> +
>> +/* Get the kernel address from the user address using
>> + * page map table. Will be used only in IOVA=VA mode
>> + */
>> +static inline void*
>> +get_kva(uint64_t usr_addr, struct kni_dev *kni)
>> +{
>> +	uint32_t index;
>> +	/* User page - start user page will give the index
>> +	 * with in the page map table
>> +	 */
>> +	index = (usr_addr >> PAGE_SHIFT) - kni->va_info.start_page;
>> +
>> +	/* Add the offset to the page address */
>> +	return (kni->va_info.page_map[index].addr +
>> +		(usr_addr & kni->va_info.page_mask));
>> +
>> +}
>> +
>>   /* physical address to kernel virtual address */
>>   static void *
>>   pa2kva(void *pa)
>> @@ -186,7 +205,10 @@ kni_fifo_trans_pa2va(struct kni_dev *kni,
>>   			return;
>>
>>   		for (i = 0; i < num_rx; i++) {
>> -			kva = pa2kva(kni->pa[i]);
>> +			if (likely(kni->iova_mode == 1))
>> +				kva = get_kva((u64)(kni->pa[i]), kni);
> 
> kni->pa[] now has iova addresses, for 'get_kva()' to work shouldn't
> 'va_info.start_page' calculated from 'mempool_memhdr->iova' instead of
> 'mempool_memhdr->addr'
> 
> If this is working I must be missing something but not able to find what it is.
> 
> <...>
> 
>> @@ -304,6 +304,27 @@ rte_kni_alloc(struct rte_mempool *pktmbuf_pool,
>>   	kni->group_id = conf->group_id;
>>   	kni->mbuf_size = conf->mbuf_size;
>>
>> +	dev_info.iova_mode = (rte_eal_iova_mode() == RTE_IOVA_VA) ? 1 : 0;
>> +	if (dev_info.iova_mode) {
>> +		struct rte_mempool_memhdr *hdr;
>> +		uint64_t pool_size = 0;
>> +
>> +		/* In each pool header chunk, we will maintain the
>> +		 * base address of the pool. This chunk is physically and
>> +		 * virtually contiguous.
>> +		 * This approach will work, only if the allocated pool
>> +		 * memory is contiguous, else it won't work
>> +		 */
>> +		hdr = STAILQ_FIRST(&pktmbuf_pool->mem_list);
>> +		dev_info.mbuf_va = (void *)(hdr->addr);
>> +
>> +		/* Traverse the list and get the total size of the pool */
>> +		STAILQ_FOREACH(hdr, &pktmbuf_pool->mem_list, next) {
>> +			pool_size += hdr->len;
>> +		}
> 
> This code is aware that there may be multiple chunks, but assumes they are all
> continuous, I don't know if this assumption is correct.

It is in fact incorrect - mempool chunks can be located anywhere in memory.

> 
> Also I guess there is another assumption that there will be single pktmbuf_pool
> in the application which passed into kni?
> What if there are multiple pktmbuf_pool, like one for each PMD, will this work?
> Now some mbufs in kni Rx fifo will come from different pktmbuf_pool which we
> don't know their pages, so won't able to get their kernel virtual address.
> 

-- 
Thanks,
Anatoly