From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from dpdk.org (dpdk.org [92.243.14.124]) by inbox.dpdk.org (Postfix) with ESMTP id 542DAA04AA; Fri, 1 May 2020 11:06:34 +0200 (CEST) Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 9F2ED1D92C; Fri, 1 May 2020 11:06:33 +0200 (CEST) Received: from mga11.intel.com (mga11.intel.com [192.55.52.93]) by dpdk.org (Postfix) with ESMTP id 24E421D927 for ; Fri, 1 May 2020 11:06:30 +0200 (CEST) IronPort-SDR: EVZRSvLCzqlfgosz/SlaGaMjV/aTRfGtuGbeLFO8+vgrY4+16C+iwlncMDi2O0cyQBgrocsqTI LpJHA60HrJIA== X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga005.fm.intel.com ([10.253.24.32]) by fmsmga102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 May 2020 02:06:30 -0700 IronPort-SDR: 3hreiAIKDPyw/yxtsCDomKqmGKwp6o7RHIj4e9S18oHYGlwAse2cz8EMWK4eADlAKWYnZHIlJe ihB9M1xIulIg== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.73,339,1583222400"; d="scan'208";a="459865134" Received: from unknown (HELO [10.213.228.47]) ([10.213.228.47]) by fmsmga005.fm.intel.com with ESMTP; 01 May 2020 02:06:29 -0700 To: David Christensen , dev@dpdk.org References: <20200429232931.87233-1-drc@linux.vnet.ibm.com> <20200429232931.87233-3-drc@linux.vnet.ibm.com> <6cbb170a-3f13-47ba-e0ad-4a86cd6cb352@intel.com> <6763793c-265b-c5cf-228a-b2c177574c84@linux.vnet.ibm.com> From: "Burakov, Anatoly" Message-ID: <58df8aa5-e9b5-7f9a-2aee-fcb19b6dea04@intel.com> Date: Fri, 1 May 2020 10:06:28 +0100 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:68.0) Gecko/20100101 Thunderbird/68.7.0 MIME-Version: 1.0 In-Reply-To: <6763793c-265b-c5cf-228a-b2c177574c84@linux.vnet.ibm.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 8bit Subject: Re: [dpdk-dev] [PATCH 2/2] vfio: modify spapr iommu support to use static window sizing X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" On 30-Apr-20 6:36 PM, David Christensen wrote: > > > On 4/30/20 4:34 AM, Burakov, Anatoly wrote: >> On 30-Apr-20 12:29 AM, David Christensen wrote: >>> Current SPAPR IOMMU support code dynamically modifies the DMA window >>> size in response to every new memory allocation. This is potentially >>> dangerous because all existing mappings need to be unmapped/remapped in >>> order to resize the DMA window, leaving hardware holding IOVA addresses >>> that are not properly prepared for DMA.  The new SPAPR code statically >>> assigns the DMA window size on first use, using the largest physical >>> memory address when IOVA=PA and the base_virtaddr + physical memory size >>> when IOVA=VA.  As a result, memory will only be unmapped when >>> specifically requested. >>> >>> Signed-off-by: David Christensen >>> --- >> >> Hi David, >> >> I haven't yet looked at the code in detail (will do so later), but >> some general comments and questions below. >> >>> +        /* >>> +         * Read "System RAM" in /proc/iomem: >>> +         * 00000000-1fffffffff : System RAM >>> +         * 200000000000-201fffffffff : System RAM >>> +         */ >>> +        FILE *fd = fopen(proc_iomem, "r"); >>> +        if (fd == NULL) { >>> +            RTE_LOG(ERR, EAL, "Cannot open %s\n", proc_iomem); >>> +            return -1; >>> +        } >> >> A quick check on my machines shows that when cat'ing /proc/iomem as >> non-root, you get zeroes everywhere, which leads me to believe that >> you have to be root to get anything useful out of /proc/iomem. Since >> one of the major selling points of VFIO is the ability to run as >> non-root, depending on iomem kind of defeats the purpose a bit. > > I observed the same thing on my system during development.  I didn't see > anything that precluded support for RTE_IOVA_PA in the VFIO code.  Are > you suggesting that I should explicitly not support that configuration? > If you're attempting to use RTE_IOVA_PA then you're already required to > run as root, so there shouldn't be an issue accessing this Oh, right, forgot about that. That's OK then. > >>> +        return 0; >>> + >>> +    } else if (rte_eal_iova_mode() == RTE_IOVA_VA) { >>> +        /* Set the DMA window to base_virtaddr + system memory size */ >>> +        const char proc_meminfo[] = "/proc/meminfo"; >>> +        const char str_memtotal[] = "MemTotal:"; >>> +        int memtotal_len = sizeof(str_memtotal) - 1; >>> +        char buffer[256]; >>> +        uint64_t size = 0; >>> + >>> +        FILE *fd = fopen(proc_meminfo, "r"); >>> +        if (fd == NULL) { >>> +            RTE_LOG(ERR, EAL, "Cannot open %s\n", proc_meminfo); >>> +            return -1; >>> +        } >>> +        while (fgets(buffer, sizeof(buffer), fd)) { >>> +            if (strncmp(buffer, str_memtotal, memtotal_len) == 0) { >>> +                size = rte_str_to_size(&buffer[memtotal_len]); >>> +                break; >>> +            } >>> +        } >>> +        fclose(fd); >>> + >>> +        if (size == 0) { >>> +            RTE_LOG(ERR, EAL, "Failed to find valid \"MemTotal\" >>> entry " >>> +                "in file %s\n", proc_meminfo); >>> +            return -1; >>> +        } >>> + >>> +        RTE_LOG(DEBUG, EAL, "MemTotal is 0x%" PRIx64 "\n", size); >>> +        /* if no base virtual address is configured use 4GB */ >>> +        spapr_dma_win_len = rte_align64pow2(size + >>> +            (internal_config.base_virtaddr > 0 ? >>> +            (uint64_t)internal_config.base_virtaddr : 1ULL << 32)); >>> +        rte_mem_set_dma_mask(__builtin_ctzll(spapr_dma_win_len)); >> >> I'm not sure of the algorithm for "memory size" here. >> >> Technically, DPDK can reserve memory segments anywhere in the VA space >> allocated by memseg lists. That space may be far bigger than system >> memory (on a typical Intel server board you'd see 128GB of VA space >> preallocated even though the machine itself might only have, say, 16GB >> of RAM installed). The same applies to any other arch running on >> Linux, so the window needs to cover at least RTE_MIN(base_virtaddr, >> lowest memseglist VA address) and up to highest memseglist VA address. >> That's not even mentioning the fact that the user may register >> external memory for DMA which may cause the window to be of >> insufficient size to cover said external memory. >> >> I also think that in general, "system memory" metric is ill suited for >> measuring VA space, because unlike system memory, the VA space is >> sparse and can therefore span *a lot* of address space even though in >> reality it may actually use very little physical memory. > > I'm open to suggestions here.  Perhaps an alternative in /proc/meminfo: > > VmallocTotal:   549755813888 kB > > I tested it with 1GB hugepages and it works, need to check with 2M as > well.  If there's no alternative for sizing the window based on > available system parameters then I have another option which creates a > new RTE_IOVA_TA mode that forces IOVA addresses into the range 0 to X > where X is configured on the EAL command-line (--iova-base, --iova-len). >  I use these command-line values to create a static window. > A whole new IOVA mode, while being a cleaner solution, would require a lot of testing, and it doesn't really solve the external memory problem, because we're still reliant on the user to provide IOVA addresses. Perhaps something akin to VA/IOVA address reservation would solve the problem, but again, lots of changes and testing, all for a comparatively narrow use case. The vmalloc area seems big enough (512 terabytes on your machine, 32 terabytes on mine), so it'll probably be OK. I'd settle for: 1) start at base_virtaddr OR lowest memseg list address, whichever is lowest 2) end at lowest addr + VmallocTotal OR highest memseglist addr, whichever is higher 3) a check in user DMA map function that would warn/throw an error whenever there is an attempt to map an address for DMA that doesn't fit into the DMA window I think that would be best approach. Thoughts? > Dave > > Dave > > -- Thanks, Anatoly