From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from dpdk.org (dpdk.org [92.243.14.124]) by inbox.dpdk.org (Postfix) with ESMTP id 061A3A00C5; Thu, 30 Apr 2020 13:34:40 +0200 (CEST) Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 87B401DB7F; Thu, 30 Apr 2020 13:34:40 +0200 (CEST) Received: from mga05.intel.com (mga05.intel.com [192.55.52.43]) by dpdk.org (Postfix) with ESMTP id 38F6D1DB7E for ; Thu, 30 Apr 2020 13:34:39 +0200 (CEST) IronPort-SDR: KEx3N2RXos1dlz3gdOz/4sIK8A4BBbddmBIGlxs7C0EBLoNsBBCpCaagCB6xC0bKUs4QY9uSfV 5qHZ/5dnGYdA== X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga001.jf.intel.com ([10.7.209.18]) by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 30 Apr 2020 04:34:38 -0700 IronPort-SDR: oKVToX+Zus2LXOz0ANwlfxH0vZPxwBuQ/Fq4/hUyiNav7saWjvNhkDfkVIs8glxWXR54WiQLf9 ogmxTfwnLPVA== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.73,334,1583222400"; d="scan'208";a="337272600" Received: from aburakov-mobl.ger.corp.intel.com (HELO [10.249.32.101]) ([10.249.32.101]) by orsmga001.jf.intel.com with ESMTP; 30 Apr 2020 04:34:36 -0700 To: David Christensen , dev@dpdk.org References: <20200429232931.87233-1-drc@linux.vnet.ibm.com> <20200429232931.87233-3-drc@linux.vnet.ibm.com> From: "Burakov, Anatoly" Message-ID: <6cbb170a-3f13-47ba-e0ad-4a86cd6cb352@intel.com> Date: Thu, 30 Apr 2020 12:34:36 +0100 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:68.0) Gecko/20100101 Thunderbird/68.7.0 MIME-Version: 1.0 In-Reply-To: <20200429232931.87233-3-drc@linux.vnet.ibm.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit Subject: Re: [dpdk-dev] [PATCH 2/2] vfio: modify spapr iommu support to use static window sizing X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" On 30-Apr-20 12:29 AM, David Christensen wrote: > Current SPAPR IOMMU support code dynamically modifies the DMA window > size in response to every new memory allocation. This is potentially > dangerous because all existing mappings need to be unmapped/remapped in > order to resize the DMA window, leaving hardware holding IOVA addresses > that are not properly prepared for DMA. The new SPAPR code statically > assigns the DMA window size on first use, using the largest physical > memory address when IOVA=PA and the base_virtaddr + physical memory size > when IOVA=VA. As a result, memory will only be unmapped when > specifically requested. > > Signed-off-by: David Christensen > --- Hi David, I haven't yet looked at the code in detail (will do so later), but some general comments and questions below. > + /* only create DMA window once */ > + if (spapr_dma_win_len > 0) > + return 0; > + > + if (rte_eal_iova_mode() == RTE_IOVA_PA) { > + /* Set the DMA window to cover the max physical address */ > + const char proc_iomem[] = "/proc/iomem"; > + const char str_sysram[] = "System RAM"; > + uint64_t start, end, max = 0; > + char *line = NULL; > + char *dash, *space; > + size_t line_len; > + > + /* > + * Read "System RAM" in /proc/iomem: > + * 00000000-1fffffffff : System RAM > + * 200000000000-201fffffffff : System RAM > + */ > + FILE *fd = fopen(proc_iomem, "r"); > + if (fd == NULL) { > + RTE_LOG(ERR, EAL, "Cannot open %s\n", proc_iomem); > + return -1; > + } > + /* Scan /proc/iomem for the highest PA in the system */ > + while (getline(&line, &line_len, fd) != -1) { > + if (strstr(line, str_sysram) == NULL) > + continue; > + > + space = strstr(line, " "); > + dash = strstr(line, "-"); > + > + /* Validate the format of the memory string */ > + if (space == NULL || dash == NULL || space < dash) { > + RTE_LOG(ERR, EAL, "Can't parse line \"%s\" in file %s\n", > + line, proc_iomem); > + continue; > + } > + > + start = strtoull(line, NULL, 16); > + end = strtoull(dash + 1, NULL, 16); > + RTE_LOG(DEBUG, EAL, "Found system RAM from 0x%" > + PRIx64 " to 0x%" PRIx64 "\n", start, end); > + if (end > max) > + max = end; > + } > + free(line); > + fclose(fd); > + > + if (max == 0) { > + RTE_LOG(ERR, EAL, "Failed to find valid \"System RAM\" entry " > + "in file %s\n", proc_iomem); > + return -1; > + } > + > + spapr_dma_win_len = rte_align64pow2(max + 1); > + rte_mem_set_dma_mask(__builtin_ctzll(spapr_dma_win_len)); A quick check on my machines shows that when cat'ing /proc/iomem as non-root, you get zeroes everywhere, which leads me to believe that you have to be root to get anything useful out of /proc/iomem. Since one of the major selling points of VFIO is the ability to run as non-root, depending on iomem kind of defeats the purpose a bit. > + return 0; > + > + } else if (rte_eal_iova_mode() == RTE_IOVA_VA) { > + /* Set the DMA window to base_virtaddr + system memory size */ > + const char proc_meminfo[] = "/proc/meminfo"; > + const char str_memtotal[] = "MemTotal:"; > + int memtotal_len = sizeof(str_memtotal) - 1; > + char buffer[256]; > + uint64_t size = 0; > + > + FILE *fd = fopen(proc_meminfo, "r"); > + if (fd == NULL) { > + RTE_LOG(ERR, EAL, "Cannot open %s\n", proc_meminfo); > + return -1; > + } > + while (fgets(buffer, sizeof(buffer), fd)) { > + if (strncmp(buffer, str_memtotal, memtotal_len) == 0) { > + size = rte_str_to_size(&buffer[memtotal_len]); > + break; > + } > + } > + fclose(fd); > + > + if (size == 0) { > + RTE_LOG(ERR, EAL, "Failed to find valid \"MemTotal\" entry " > + "in file %s\n", proc_meminfo); > + return -1; > + } > + > + RTE_LOG(DEBUG, EAL, "MemTotal is 0x%" PRIx64 "\n", size); > + /* if no base virtual address is configured use 4GB */ > + spapr_dma_win_len = rte_align64pow2(size + > + (internal_config.base_virtaddr > 0 ? > + (uint64_t)internal_config.base_virtaddr : 1ULL << 32)); > + rte_mem_set_dma_mask(__builtin_ctzll(spapr_dma_win_len)); I'm not sure of the algorithm for "memory size" here. Technically, DPDK can reserve memory segments anywhere in the VA space allocated by memseg lists. That space may be far bigger than system memory (on a typical Intel server board you'd see 128GB of VA space preallocated even though the machine itself might only have, say, 16GB of RAM installed). The same applies to any other arch running on Linux, so the window needs to cover at least RTE_MIN(base_virtaddr, lowest memseglist VA address) and up to highest memseglist VA address. That's not even mentioning the fact that the user may register external memory for DMA which may cause the window to be of insufficient size to cover said external memory. I also think that in general, "system memory" metric is ill suited for measuring VA space, because unlike system memory, the VA space is sparse and can therefore span *a lot* of address space even though in reality it may actually use very little physical memory. -- Thanks, Anatoly