From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from dpdk.org (dpdk.org [92.243.14.124]) by inbox.dpdk.org (Postfix) with ESMTP id 031F0A04B8; Tue, 5 May 2020 16:57:08 +0200 (CEST) Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id CE16D1D617; Tue, 5 May 2020 16:57:07 +0200 (CEST) Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by dpdk.org (Postfix) with ESMTP id 506C11D15E for ; Tue, 5 May 2020 16:57:06 +0200 (CEST) IronPort-SDR: apWh1gK0FzWZd37RKGe0OezYojN4Bpss1roTTAKWg6Bp0/bP89W3zgClBd4jSzxIQg43UOXxXL LfyAJ3+qMcrQ== X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 05 May 2020 07:57:05 -0700 IronPort-SDR: LGnWx8BsWvTEhvtGGuguElFbQHMzjgZp01zXNybS4LRyk5LWmlT909aKIy13Hk8bWWQ9t6HeGh M7B/Jce6wl2A== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.73,355,1583222400"; d="scan'208";a="304497966" Received: from aburakov-mobl.ger.corp.intel.com (HELO [10.213.197.31]) ([10.213.197.31]) by FMSMGA003.fm.intel.com with ESMTP; 05 May 2020 07:57:02 -0700 To: David Christensen , dev@dpdk.org References: <20200429232931.87233-1-drc@linux.vnet.ibm.com> <20200429232931.87233-3-drc@linux.vnet.ibm.com> <6cbb170a-3f13-47ba-e0ad-4a86cd6cb352@intel.com> <6763793c-265b-c5cf-228a-b2c177574c84@linux.vnet.ibm.com> <58df8aa5-e9b5-7f9a-2aee-fcb19b6dea04@intel.com> <782d6f04-f476-93d6-1a8f-2ed0b39dde10@linux.vnet.ibm.com> From: "Burakov, Anatoly" Message-ID: <8f83778f-fe74-6c69-e2d0-73b9a6e7ca79@intel.com> Date: Tue, 5 May 2020 15:57:01 +0100 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:68.0) Gecko/20100101 Thunderbird/68.7.0 MIME-Version: 1.0 In-Reply-To: <782d6f04-f476-93d6-1a8f-2ed0b39dde10@linux.vnet.ibm.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 8bit Subject: Re: [dpdk-dev] [PATCH 2/2] vfio: modify spapr iommu support to use static window sizing X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" On 01-May-20 5:48 PM, David Christensen wrote: >>>> I'm not sure of the algorithm for "memory size" here. >>>> >>>> Technically, DPDK can reserve memory segments anywhere in the VA >>>> space allocated by memseg lists. That space may be far bigger than >>>> system memory (on a typical Intel server board you'd see 128GB of VA >>>> space preallocated even though the machine itself might only have, >>>> say, 16GB of RAM installed). The same applies to any other arch >>>> running on Linux, so the window needs to cover at least >>>> RTE_MIN(base_virtaddr, lowest memseglist VA address) and up to >>>> highest memseglist VA address. That's not even mentioning the fact >>>> that the user may register external memory for DMA which may cause >>>> the window to be of insufficient size to cover said external memory. >>>> >>>> I also think that in general, "system memory" metric is ill suited >>>> for measuring VA space, because unlike system memory, the VA space >>>> is sparse and can therefore span *a lot* of address space even >>>> though in reality it may actually use very little physical memory. >>> >>> I'm open to suggestions here.  Perhaps an alternative in /proc/meminfo: >>> >>> VmallocTotal:   549755813888 kB >>> >>> I tested it with 1GB hugepages and it works, need to check with 2M as >>> well.  If there's no alternative for sizing the window based on >>> available system parameters then I have another option which creates >>> a new RTE_IOVA_TA mode that forces IOVA addresses into the range 0 to >>> X where X is configured on the EAL command-line (--iova-base, >>> --iova-len).   I use these command-line values to create a static >>> window. >>> >> >> A whole new IOVA mode, while being a cleaner solution, would require a >> lot of testing, and it doesn't really solve the external memory >> problem, because we're still reliant on the user to provide IOVA >> addresses. Perhaps something akin to VA/IOVA address reservation would >> solve the problem, but again, lots of changes and testing, all for a >> comparatively narrow use case. >> >> The vmalloc area seems big enough (512 terabytes on your machine, 32 >> terabytes on mine), so it'll probably be OK. I'd settle for: >> >> 1) start at base_virtaddr OR lowest memseg list address, whichever is >> lowest > > The IOMMU only supports two starting addresses, 0 or 1<<59, so > implementation will need to start at 0.  (I've been bit by this before, > my understanding is that the processor only supports 54 bits of the > address and that the PCI host bridge uses bit 59 of the IOVA as a signal > to do the address translation for the second DMA window.) Fair enough, 0 it is then. > >> 2) end at lowest addr + VmallocTotal OR highest memseglist addr, >> whichever is higher > > So, instead of rte_memseg_walk() execute rte_memseg_list_walk() to find > the lowest/highest msl addresses? Yep. rte_memseg_walk() will only cover allocated pages, while rte_memseg_list_walk() will cover even empty page tables. > >> 3) a check in user DMA map function that would warn/throw an error >> whenever there is an attempt to map an address for DMA that doesn't >> fit into the DMA window > > Isn't this mostly prevented by the use of  rte_mem_set_dma_mask() and > rte_mem_check_dma_mask()?  I'd expect an error would be thrown by the > kernel IOMMU API for an out-of-range mapping that I would simply return > to the caller (drivers/vfio/vfio_iommu_spapr_tce.c includes the comment > /* iova is checked by the IOMMU API */).  Why do you think double > checking this would help? I don't think we check rte_mem_check_dma_mask() anywhere in the call path of external memory code. Also, i just checked, and you're right, rte_vfio_container_dma_map() will fail if the kernel fails to map the memory, however nothing will fail in external memory because the IOVA addresses aren't checked for being within DMA mask. See malloc_heap.c:1097 onwards, we simply add user-specified IOVA addresses into the page table without checking if the fit into the DMA mask. The DMA mapping will then happen through a mem event callback, but we don't check return value of that callback either, so even if DMA mapping fails, we'll only get a log message. So, perhaps the real solution here is to add a DMA mask check into rte_malloc_heap_memory_add(), so that we check the IOVA addresses before we ever try to do anything with them. I'll submit a patch for this. > >> >> I think that would be best approach. Thoughts? > > Dave -- Thanks, Anatoly