From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from dpdk.org (dpdk.org [92.243.14.124]) by inbox.dpdk.org (Postfix) with ESMTP id 8943DA0350; Thu, 30 Apr 2020 19:36:28 +0200 (CEST) Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 1E5B21DBD2; Thu, 30 Apr 2020 19:36:27 +0200 (CEST) Received: from mx0a-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by dpdk.org (Postfix) with ESMTP id 4E0CB1DBCF for ; Thu, 30 Apr 2020 19:36:25 +0200 (CEST) Received: from pps.filterd (m0098413.ppops.net [127.0.0.1]) by mx0b-001b2d01.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id 03UHXHRu045594; Thu, 30 Apr 2020 13:36:24 -0400 Received: from ppma04dal.us.ibm.com (7a.29.35a9.ip4.static.sl-reverse.com [169.53.41.122]) by mx0b-001b2d01.pphosted.com with ESMTP id 30pjmamg97-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 30 Apr 2020 13:36:24 -0400 Received: from pps.filterd (ppma04dal.us.ibm.com [127.0.0.1]) by ppma04dal.us.ibm.com (8.16.0.27/8.16.0.27) with SMTP id 03UHUQhe010893; Thu, 30 Apr 2020 17:36:20 GMT Received: from b03cxnp08027.gho.boulder.ibm.com (b03cxnp08027.gho.boulder.ibm.com [9.17.130.19]) by ppma04dal.us.ibm.com with ESMTP id 30mcu7b0t7-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 30 Apr 2020 17:36:20 +0000 Received: from b03ledav003.gho.boulder.ibm.com (b03ledav003.gho.boulder.ibm.com [9.17.130.234]) by b03cxnp08027.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 03UHaHZP12845362 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 30 Apr 2020 17:36:17 GMT Received: from b03ledav003.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 7B1AC6A04D; Thu, 30 Apr 2020 17:36:18 +0000 (GMT) Received: from b03ledav003.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 44BC96A047; Thu, 30 Apr 2020 17:36:18 +0000 (GMT) Received: from Davids-MBP.randomparity.org (unknown [9.163.83.155]) by b03ledav003.gho.boulder.ibm.com (Postfix) with ESMTP; Thu, 30 Apr 2020 17:36:18 +0000 (GMT) To: "Burakov, Anatoly" , dev@dpdk.org References: <20200429232931.87233-1-drc@linux.vnet.ibm.com> <20200429232931.87233-3-drc@linux.vnet.ibm.com> <6cbb170a-3f13-47ba-e0ad-4a86cd6cb352@intel.com> From: David Christensen Message-ID: <6763793c-265b-c5cf-228a-b2c177574c84@linux.vnet.ibm.com> Date: Thu, 30 Apr 2020 10:36:17 -0700 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:68.0) Gecko/20100101 Thunderbird/68.7.0 MIME-Version: 1.0 In-Reply-To: <6cbb170a-3f13-47ba-e0ad-4a86cd6cb352@intel.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 8bit X-TM-AS-GCONF: 00 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.138, 18.0.676 definitions=2020-04-30_11:2020-04-30, 2020-04-30 signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 clxscore=1015 mlxscore=0 malwarescore=0 mlxlogscore=999 adultscore=0 priorityscore=1501 phishscore=0 lowpriorityscore=0 suspectscore=0 spamscore=0 bulkscore=0 impostorscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2003020000 definitions=main-2004300135 Subject: Re: [dpdk-dev] [PATCH 2/2] vfio: modify spapr iommu support to use static window sizing X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" On 4/30/20 4:34 AM, Burakov, Anatoly wrote: > On 30-Apr-20 12:29 AM, David Christensen wrote: >> Current SPAPR IOMMU support code dynamically modifies the DMA window >> size in response to every new memory allocation. This is potentially >> dangerous because all existing mappings need to be unmapped/remapped in >> order to resize the DMA window, leaving hardware holding IOVA addresses >> that are not properly prepared for DMA.  The new SPAPR code statically >> assigns the DMA window size on first use, using the largest physical >> memory address when IOVA=PA and the base_virtaddr + physical memory size >> when IOVA=VA.  As a result, memory will only be unmapped when >> specifically requested. >> >> Signed-off-by: David Christensen >> --- > > Hi David, > > I haven't yet looked at the code in detail (will do so later), but some > general comments and questions below. > >> +        /* >> +         * Read "System RAM" in /proc/iomem: >> +         * 00000000-1fffffffff : System RAM >> +         * 200000000000-201fffffffff : System RAM >> +         */ >> +        FILE *fd = fopen(proc_iomem, "r"); >> +        if (fd == NULL) { >> +            RTE_LOG(ERR, EAL, "Cannot open %s\n", proc_iomem); >> +            return -1; >> +        } > > A quick check on my machines shows that when cat'ing /proc/iomem as > non-root, you get zeroes everywhere, which leads me to believe that you > have to be root to get anything useful out of /proc/iomem. Since one of > the major selling points of VFIO is the ability to run as non-root, > depending on iomem kind of defeats the purpose a bit. I observed the same thing on my system during development. I didn't see anything that precluded support for RTE_IOVA_PA in the VFIO code. Are you suggesting that I should explicitly not support that configuration? If you're attempting to use RTE_IOVA_PA then you're already required to run as root, so there shouldn't be an issue accessing this >> +        return 0; >> + >> +    } else if (rte_eal_iova_mode() == RTE_IOVA_VA) { >> +        /* Set the DMA window to base_virtaddr + system memory size */ >> +        const char proc_meminfo[] = "/proc/meminfo"; >> +        const char str_memtotal[] = "MemTotal:"; >> +        int memtotal_len = sizeof(str_memtotal) - 1; >> +        char buffer[256]; >> +        uint64_t size = 0; >> + >> +        FILE *fd = fopen(proc_meminfo, "r"); >> +        if (fd == NULL) { >> +            RTE_LOG(ERR, EAL, "Cannot open %s\n", proc_meminfo); >> +            return -1; >> +        } >> +        while (fgets(buffer, sizeof(buffer), fd)) { >> +            if (strncmp(buffer, str_memtotal, memtotal_len) == 0) { >> +                size = rte_str_to_size(&buffer[memtotal_len]); >> +                break; >> +            } >> +        } >> +        fclose(fd); >> + >> +        if (size == 0) { >> +            RTE_LOG(ERR, EAL, "Failed to find valid \"MemTotal\" entry " >> +                "in file %s\n", proc_meminfo); >> +            return -1; >> +        } >> + >> +        RTE_LOG(DEBUG, EAL, "MemTotal is 0x%" PRIx64 "\n", size); >> +        /* if no base virtual address is configured use 4GB */ >> +        spapr_dma_win_len = rte_align64pow2(size + >> +            (internal_config.base_virtaddr > 0 ? >> +            (uint64_t)internal_config.base_virtaddr : 1ULL << 32)); >> +        rte_mem_set_dma_mask(__builtin_ctzll(spapr_dma_win_len)); > > I'm not sure of the algorithm for "memory size" here. > > Technically, DPDK can reserve memory segments anywhere in the VA space > allocated by memseg lists. That space may be far bigger than system > memory (on a typical Intel server board you'd see 128GB of VA space > preallocated even though the machine itself might only have, say, 16GB > of RAM installed). The same applies to any other arch running on Linux, > so the window needs to cover at least RTE_MIN(base_virtaddr, lowest > memseglist VA address) and up to highest memseglist VA address. That's > not even mentioning the fact that the user may register external memory > for DMA which may cause the window to be of insufficient size to cover > said external memory. > > I also think that in general, "system memory" metric is ill suited for > measuring VA space, because unlike system memory, the VA space is sparse > and can therefore span *a lot* of address space even though in reality > it may actually use very little physical memory. I'm open to suggestions here. Perhaps an alternative in /proc/meminfo: VmallocTotal: 549755813888 kB I tested it with 1GB hugepages and it works, need to check with 2M as well. If there's no alternative for sizing the window based on available system parameters then I have another option which creates a new RTE_IOVA_TA mode that forces IOVA addresses into the range 0 to X where X is configured on the EAL command-line (--iova-base, --iova-len). I use these command-line values to create a static window. Dave Dave