From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from dpdk.org (dpdk.org [92.243.14.124]) by inbox.dpdk.org (Postfix) with ESMTP id 9BCC8A04DE for ; Wed, 28 Oct 2020 17:07:28 +0100 (CET) Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 55EA1BE49; Wed, 28 Oct 2020 17:07:27 +0100 (CET) Received: from mga18.intel.com (mga18.intel.com [134.134.136.126]) by dpdk.org (Postfix) with ESMTP id 0D3973976; Wed, 28 Oct 2020 17:07:22 +0100 (CET) IronPort-SDR: zU7hg4Qsata8Rhx5UrkR4Lw9XDHEDvpGI39T4pft791ssK+2Zt9Ad4asSfLMgyIyCEzMdpWSMJ CuEjifFOzHjw== X-IronPort-AV: E=McAfee;i="6000,8403,9788"; a="156060342" X-IronPort-AV: E=Sophos;i="5.77,427,1596524400"; d="scan'208";a="156060342" X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga005.jf.intel.com ([10.7.209.41]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Oct 2020 09:07:20 -0700 IronPort-SDR: rWOPbhsgZ899hu7Yo8T/3uhYmHn0fbNkpVKv13ppmIcdRI2wUJPsWEJS+WNDHei43MpTKSKCM4 7jXTgucO73GA== X-IronPort-AV: E=Sophos;i="5.77,427,1596524400"; d="scan'208";a="536288565" Received: from aburakov-mobl.ger.corp.intel.com (HELO [10.213.193.36]) ([10.213.193.36]) by orsmga005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 28 Oct 2020 09:07:19 -0700 To: Nithin Dabilpuram Cc: Jerin Jacob , dev@dpdk.org, stable@dpdk.org References: <05afb7f5-96bf-dffd-15dd-2024586f7290@intel.com> <20201015060914.GA32207@outlook.office365.com> <66b61bda-03a8-d4c4-af9f-0f90a6ef956d@intel.com> <20201016071015.GA22749@gmail.com> <4deaf00f-02d3-15b3-2ebe-4a2becc89251@intel.com> <8cd0fd55-1af6-543d-50c1-80c5f4883550@intel.com> From: "Burakov, Anatoly" Message-ID: <284e8b96-c61d-ce5b-8aab-ecafbb0faefa@intel.com> Date: Wed, 28 Oct 2020 16:07:17 +0000 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101 Thunderbird/68.12.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit Subject: Re: [dpdk-stable] [dpdk-dev] [EXT] Re: [PATCH 2/2] vfio: fix partial DMA unmapping for VFIO type1 X-BeenThere: stable@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches for DPDK stable branches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: stable-bounces@dpdk.org Sender: "stable" On 28-Oct-20 2:17 PM, Nithin Dabilpuram wrote: > On Wed, Oct 28, 2020 at 01:04:26PM +0000, Burakov, Anatoly wrote: >> On 22-Oct-20 1:13 PM, Nithin Dabilpuram wrote: >>> Ping. >>> >>> On Mon, Oct 19, 2020 at 03:13:15PM +0530, Nithin Dabilpuram wrote: >>>> On Sat, Oct 17, 2020 at 05:14:55PM +0100, Burakov, Anatoly wrote: >>>>> On 16-Oct-20 8:10 AM, Nithin Dabilpuram wrote: >>>>>> On Thu, Oct 15, 2020 at 04:10:31PM +0100, Burakov, Anatoly wrote: >>>>>>> On 15-Oct-20 12:57 PM, Nithin Dabilpuram wrote: >>>>>>>> On Thu, Oct 15, 2020 at 3:31 PM Burakov, Anatoly >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> On 15-Oct-20 7:09 AM, Nithin Dabilpuram wrote: >>>>>>>>>> On Wed, Oct 14, 2020 at 04:07:10PM +0100, Burakov, Anatoly wrote: >>>>>>>>>>> External Email >>>>>>>>>>> >>>>>>>>>>> ---------------------------------------------------------------------- >>>>>>>>>>> On 12-Oct-20 9:11 AM, Nithin Dabilpuram wrote: >>>>>>>>>>>> Partial unmapping is not supported for VFIO IOMMU type1 >>>>>>>>>>>> by kernel. Though kernel gives return as zero, the unmapped size >>>>>>>>>>>> returned will not be same as expected. So check for >>>>>>>>>>>> returned unmap size and return error. >>>>>>>>>>>> >>>>>>>>>>>> For case of DMA map/unmap triggered by heap allocations, >>>>>>>>>>>> maintain granularity of memseg page size so that heap >>>>>>>>>>>> expansion and contraction does not have this issue. >>>>>>>>>>> >>>>>>>>>>> This is quite unfortunate, because there was a different bug that had to do >>>>>>>>>>> with kernel having a very limited number of mappings available [1], as a >>>>>>>>>>> result of which the page concatenation code was added. >>>>>>>>>>> >>>>>>>>>>> It should therefore be documented that the dma_entry_limit parameter should >>>>>>>>>>> be adjusted should the user run out of the DMA entries. >>>>>>>>>>> >>>>>>>>>>> [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_lkml_155414977872.12780.13728555131525362206.stgit-40gimli.home_T_&d=DwICaQ&c=nKjWec2b6R0mOyPaz7xtfQ&r=FZ_tPCbgFOh18zwRPO9H0yDx8VW38vuapifdDfc8SFQ&m=3GMg-634_cdUCY4WpQPwjzZ_S4ckuMHOnt2FxyyjXMk&s=TJLzppkaDS95VGyRHX2hzflQfb9XLK0OiOszSXoeXKk&e= >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>>>> RTE_LOG(ERR, EAL, " cannot clear DMA remapping, error %i (%s)\n", >>>>>>>>>>>> errno, strerror(errno)); >>>>>>>>>>>> return -1; >>>>>>>>>>>> + } else if (dma_unmap.size != len) { >>>>>>>>>>>> + RTE_LOG(ERR, EAL, " unexpected size %"PRIu64" of DMA " >>>>>>>>>>>> + "remapping cleared instead of %"PRIu64"\n", >>>>>>>>>>>> + (uint64_t)dma_unmap.size, len); >>>>>>>>>>>> + rte_errno = EIO; >>>>>>>>>>>> + return -1; >>>>>>>>>>>> } >>>>>>>>>>>> } >>>>>>>>>>>> @@ -1853,6 +1869,12 @@ container_dma_unmap(struct vfio_config *vfio_cfg, uint64_t vaddr, uint64_t iova, >>>>>>>>>>>> /* we're partially unmapping a previously mapped region, so we >>>>>>>>>>>> * need to split entry into two. >>>>>>>>>>>> */ >>>>>>>>>>>> + if (!vfio_cfg->vfio_iommu_type->partial_unmap) { >>>>>>>>>>>> + RTE_LOG(DEBUG, EAL, "DMA partial unmap unsupported\n"); >>>>>>>>>>>> + rte_errno = ENOTSUP; >>>>>>>>>>>> + ret = -1; >>>>>>>>>>>> + goto out; >>>>>>>>>>>> + } >>>>>>>>>>> >>>>>>>>>>> How would we ever arrive here if we never do more than 1 page worth of >>>>>>>>>>> memory anyway? I don't think this is needed. >>>>>>>>>> >>>>>>>>>> container_dma_unmap() is called by user via rte_vfio_container_dma_unmap() >>>>>>>>>> and when he maps we don't split it as we don't about his memory. >>>>>>>>>> So if he maps multiple pages and tries to unmap partially, then we should fail. >>>>>>>>> >>>>>>>>> Should we map it in page granularity then, instead of adding this >>>>>>>>> discrepancy between EAL and user mapping? I.e. instead of adding a >>>>>>>>> workaround, how about we just do the same thing for user mem mappings? >>>>>>>>> >>>>>>>> In heap mapping's we map and unmap it at huge page granularity as we will always >>>>>>>> maintain that. >>>>>>>> >>>>>>>> But here I think we don't know if user's allocation is huge page or >>>>>>>> collection of system >>>>>>>> pages. Only thing we can do here is map it at system page granularity which >>>>>>>> could waste entries if he say really is working with hugepages. Isn't ? >>>>>>>> >>>>>>> >>>>>>> Yeah we do. The API mandates the pages granularity, and it will check >>>>>>> against page size and number of IOVA entries, so yes, we do enforce the fact >>>>>>> that the IOVA addresses supplied by the user have to be page addresses. >>>>>> >>>>>> If I see rte_vfio_container_dma_map(), there is no mention of Huge page size >>>>>> user is providing or we computing. He can call rte_vfio_container_dma_map() >>>>>> with 1GB huge page or 4K system page. >>>>>> >>>>>> Am I missing something ? >>>>> >>>>> Are you suggesting that a DMA mapping for hugepage-backed memory will be >>>>> made at system page size granularity? E.g. will a 1GB page-backed segment be >>>>> mapped for DMA as a contiguous 4K-based block? >>>> >>>> I'm not suggesting anything. My only thought is how to solve below problem. >>>> Say application does the following. >>>> >>>> #1 Allocate 1GB memory from huge page or some external mem. >>>> #2 Do rte_vfio_container_dma_map(RTE_VFIO_DEFAULT_CONTAINER_FD, mem, mem, 1GB) >>>> In linux/eal_vfio.c, we map it is as single VFIO DMA entry of 1 GB as we >>>> don't know where this memory is coming from or backed by what. >>>> #3 After a while call rte_vfio_container_dma_unmap(RTE_VFIO_DEFAULT_CONTAINER_FD, mem+4KB, mem+4KB, 4KB) >>>> Though rte_vfio_container_dma_unmap() supports #3 by splitting entry as shown below, >>>> In VFIO type1 iommu, #3 cannot be supported by current kernel interface. So how >>>> can we allow #3 ? >>>> >>>> >>>> static int >>>> container_dma_unmap(struct vfio_config *vfio_cfg, uint64_t vaddr, uint64_t iova, >>>> uint64_t len) >>>> { >>>> struct user_mem_map *map, *new_map = NULL; >>>> struct user_mem_maps *user_mem_maps; >>>> int ret = 0; >>>> >>>> user_mem_maps = &vfio_cfg->mem_maps; >>>> rte_spinlock_recursive_lock(&user_mem_maps->lock); >>>> >>>> /* find our mapping */ >>>> map = find_user_mem_map(user_mem_maps, vaddr, iova, len); >>>> if (!map) { >>>> RTE_LOG(ERR, EAL, "Couldn't find previously mapped region\n"); >>>> rte_errno = EINVAL; >>>> ret = -1; >>>> goto out; >>>> } >>>> if (map->addr != vaddr || map->iova != iova || map->len != len) { >>>> /* we're partially unmapping a previously mapped region, so we >>>> * need to split entry into two. >>>> */ >> >> Hi, >> >> Apologies, i was on vacation. >> >> Yes, I can see the problem now. Does VFIO even support non-system page >> sizes? Like, if i allocated a 1GB page, would i be able to map *this page* >> for DMA, as opposed to first 4K of this page? I suspect that the mapping >> doesn't support page sizes other than the system page size. > > It does support mapping any multiple of system page size. > See vfio/vfio_iommu_type1.c vfio_pin_map_dma(). Also > ./driver-api/vfio.rst doesn't mention any such restrictions even in its > example. > > Also my test case is passing so that confirms the behavior. Can we perhaps make it so that the API mandates mapping/unmapping the same chunks? That would be the easiest solution here. > > >> >> -- >> Thanks, >> Anatoly -- Thanks, Anatoly