From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx1.redhat.com (mx1.redhat.com [209.132.183.28]) by dpdk.org (Postfix) with ESMTP id 89F6B1094 for ; Thu, 6 Jul 2017 12:59:09 +0200 (CEST) Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.phx2.redhat.com [10.5.11.15]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 9294E4E024; Thu, 6 Jul 2017 10:59:08 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com 9294E4E024 Authentication-Results: ext-mx09.extmail.prod.ext.phx2.redhat.com; dmarc=none (p=none dis=none) header.from=redhat.com Authentication-Results: ext-mx09.extmail.prod.ext.phx2.redhat.com; spf=pass smtp.mailfrom=maxime.coquelin@redhat.com DKIM-Filter: OpenDKIM Filter v2.11.0 mx1.redhat.com 9294E4E024 Received: from [10.36.112.14] (unknown [10.36.112.14]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 375D97767D; Thu, 6 Jul 2017 10:59:05 +0000 (UTC) To: Jerin Jacob Cc: Santosh Shukla , thomas@monjalon.net, bruce.richardson@intel.com, dev@dpdk.org, hemant.agrawal@nxp.com, shreyansh.jain@nxp.com, gaetan.rivet@6wind.com References: <20170608110513.22548-1-santosh.shukla@caviumnetworks.com> <20170608110513.22548-8-santosh.shukla@caviumnetworks.com> <730e333b-a9ab-df8b-cf7a-1e0186c6152d@redhat.com> <20170705154314.GA4635@jerin> <2fe366fb-15fa-f754-458e-3f4e8be18699@redhat.com> <20170706094939.GA1709@jerin> From: Maxime Coquelin Message-ID: <89425d75-3f79-d3e8-f0b1-330292866bbb@redhat.com> Date: Thu, 6 Jul 2017 12:59:04 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.1.0 MIME-Version: 1.0 In-Reply-To: <20170706094939.GA1709@jerin> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit X-Scanned-By: MIMEDefang 2.79 on 10.5.11.15 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.38]); Thu, 06 Jul 2017 10:59:08 +0000 (UTC) Subject: Re: [dpdk-dev] [PATCH 07/10] linuxapp/eal_vfio: honor iova mode before mapping X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 06 Jul 2017 10:59:10 -0000 On 07/06/2017 11:49 AM, Jerin Jacob wrote: > -----Original Message----- >> Date: Thu, 6 Jul 2017 09:58:41 +0200 >> From: Maxime Coquelin >> To: Jerin Jacob >> CC: Santosh Shukla , >> thomas@monjalon.net, bruce.richardson@intel.com, dev@dpdk.org, >> hemant.agrawal@nxp.com, shreyansh.jain@nxp.com, gaetan.rivet@6wind.com >> Subject: Re: [dpdk-dev] [PATCH 07/10] linuxapp/eal_vfio: honor iova mode >> before mapping >> User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 >> Thunderbird/52.1.0 >> >> >> >> On 07/05/2017 05:43 PM, Jerin Jacob wrote: >>> -----Original Message----- >>>> Date: Wed, 5 Jul 2017 11:14:01 +0200 >>>> From: Maxime Coquelin >>>> To: Santosh Shukla , >>>> thomas@monjalon.net, bruce.richardson@intel.com, dev@dpdk.org >>>> CC: jerin.jacob@caviumnetworks.com, hemant.agrawal@nxp.com, >>>> shreyansh.jain@nxp.com, gaetan.rivet@6wind.com >>>> Subject: Re: [dpdk-dev] [PATCH 07/10] linuxapp/eal_vfio: honor iova mode >>>> before mapping >>>> User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 >>>> Thunderbird/52.1.0 >>>> >>>> >>>> >>>> On 06/08/2017 01:05 PM, Santosh Shukla wrote: >>>>> Check iova mode and accordingly map iova to pa or va. >>>>> >>>>> Signed-off-by: Santosh Shukla >>>>> Signed-off-by: Jerin Jacob >>>>> --- >>>>> lib/librte_eal/linuxapp/eal/eal_vfio.c | 10 ++++++++-- >>>>> 1 file changed, 8 insertions(+), 2 deletions(-) >>>>> >>>>> diff --git a/lib/librte_eal/linuxapp/eal/eal_vfio.c b/lib/librte_eal/linuxapp/eal/eal_vfio.c >>>>> index 04914406f..348b7a7f4 100644 >>>>> --- a/lib/librte_eal/linuxapp/eal/eal_vfio.c >>>>> +++ b/lib/librte_eal/linuxapp/eal/eal_vfio.c >>>>> @@ -706,7 +706,10 @@ vfio_type1_dma_map(int vfio_container_fd) >>>>> dma_map.argsz = sizeof(struct vfio_iommu_type1_dma_map); >>>>> dma_map.vaddr = ms[i].addr_64; >>>>> dma_map.size = ms[i].len; >>>>> - dma_map.iova = ms[i].phys_addr; >>>>> + if (rte_eal_iova_mode() == RTE_IOVA_VA) >>>>> + dma_map.iova = dma_map.vaddr; >>>>> + else >>>>> + dma_map.iova = ms[i].phys_addr; >>>>> dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE; >>>> >>>> IIUC, it is changing default behavior for VFIO devices. >>>> >>>> I see a possible problem, but I'm not sure the case is valid. >>>> >>>> Imagine you have two devices in the iommu group, and the two devices are >>>> used in separate processes. Each process could try two different >>>> physical addresses at the same virtual address, and so the second map >>>> would fail. >>> >>> IMO, Doesn't look like a problem. Here is the data flow >>> >>> 1) The vfio DMA map function(vfio_type1_dma_map()) will be called only >>> on primary process >>> http://dpdk.org/browse/dpdk/tree/lib/librte_eal/linuxapp/eal/eal_vfio.c#n359 >>> >>> 2) On secondary process, DPDK rte_eal_huge_page_attach() will make sure >>> that, the Secondary process has the _same_ virtual address as primary or >>> exit from on attach. >>> http://dpdk.org/browse/dpdk/tree/lib/librte_eal/linuxapp/eal/eal_memory.c#n1452 >>> >>> 3) Since secondary process adds the mapped the virtual address in step (2). >>> in the page table in OS. On SMMU entry miss(When device >>> request from I/O transaction), OS will load the mapping and update the SMMU >>> "context" with page tables from MMU. >> >> Ok thanks for the detailed info, but what about the case where the same >> iommu group is used by two primary processes? > > Does that case exist with DPDK? We always need to blacklist same BDF in > the secondary process to make things work with existing DPDK setup. Which > make sense as well. Only primary process configures the HW blocks. I meant the case when two BDF are in the same IOMMU group (if ACS is not supported at some point in the hierarchy). And I meant two primary processes running, like for example two containers running each a DPDK application. Maybe this is not a valid use-case (it is not secure, as it would break isolation between the two containers), but it seems that it is something DPDK allows today, if I'm not mistaken. >> >> I don't know how frequent it is, but if ACS is not supported by either the >> endpoint or the the root port, then you would have to share the same IOMMU >> group for all the ports of your card. Right? > > ACS is supported in our card(it not in bypass mode) and one mempool PCI BDF > comes as a IOMMU group. > > If it in bypass mode anyway you use in vfio-noiommu mode as > there is no protection anyway. Yes. >> >>> Let me add the background for why this feature is required in DPDK to >>> enable NPU style co-processors. >>> >>> The traditional NICs the Rx path code look like this: >>> 1) On control path, Fill the mempool with buffers >>> 2) on rx_burst(), alloc the mbuf from mempool >>> 3) SW has the mbuf in hand(which is a virtual address) and program the >>> HW with mbuf->buf_physaddr) >>> 4) Return the last pushed mbuf(will be updated by HW by now) >>> >>> >>> On NPU style co-processors, situation is different as the buffer recycling >>> has been done in HW unlike SW model. Here is the data flow: >>> 1) On control path, Fill the HW mempool with buffers(Obviously the IOVA >>> address, which is PA in existing model) >>> 2) on rx_burst, HW gives you IOVA address(as address as step 1) >>> 3) As application expects VA to operate on it, rx_burst() needs to >>> convert to VA from PAA. Which is very costly. >>> Instead with this IOVA as VA scheme, We can avoid the cost of converting >>> with help of IOMMU/SMMU. >>> >>> This patch set auto detects the mode based available of type devices in >>> bus and provides an option to override mode based on eal argument, so we >>> don't foresee any issue with this approach and welcome any alternative >>> approaches. >> >> I don't question the need of the feature for these kind of >> co-processors, using VA as IOVA in your case seems very valid. >> >> What concerns me is that we change the default behavior for all other >> devices. Having an option to override is fine to me, but the default >> mode should remain the same IMHO. > > Doesn't seems to be a technical point. But I agree with your concern. > we will address it. > I think, we have two ways to address it. > > option 1: > - In existing patch, > a) we are currently setting(internal_cfg->iova_mode = RTE_IOVA_PA) > http://dpdk.org/dev/patchwork/patch/25192 > b) only when with eal argument sets to RTE_IOVA_VA and then bus probed > value == RTE_IOVA_VA the final mode will be RTE_IOVA_VA > http://dpdk.org/dev/patchwork/patch/25193/ > check the code after rte_bus_scan() > > option 2: > On rte_pci_get_iommu_class() in http://dpdk.org/dev/patchwork/patch/25190/ > we can check the rte_pci_device.id.vendor_id == CAVIUM to select the > mode so other type of devices safe. > > I think, option 2 makes sense, as it gives foolproof auto detection scheme and > without effecting any other devices that not interested in this scheme > > Does that address your concern about the patchset? Yes it does, or maybe create a new flag in struct rte_pci_driver's drv_flags to provide a hint it prefers to use VA as IOVA? It, of course, would just be a hint, and should be set only when other conditions are met. >> Wouldn't it be possible to default to VA as IOVA only when an HW mempool >> is in use? > > It will be too late as in the normal scheme of things, application > creates the pool. OK, makes sense. Thanks, Maxime >> >>> Similar problem exists in another part of the code in DPDK, >>> http://dpdk.org/browse/dpdk/tree/drivers/bus/fslmc/fslmc_vfio.c#n231 >>> Its a conditional compilation based approach with duplicating the vfio >>> code and we are trying to fix the problem in a generic way so that >>> everyone can get benefited out of it. >>> >>> Comments are welcome. >> >> Thanks, >> Maxime >> >>> /Jerin >>> >>>> >>>> By using physical addresses, you are safe against this problem. >>>> >>>> Any thoughts? >>>> >>>> Cheers, >>>> Maxime