From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from dpdk.org (dpdk.org [92.243.14.124]) by inbox.dpdk.org (Postfix) with ESMTP id 56215A04AF; Fri, 1 May 2020 18:49:02 +0200 (CEST) Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id DDCD51DA3F; Fri, 1 May 2020 18:49:01 +0200 (CEST) Received: from mx0a-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by dpdk.org (Postfix) with ESMTP id 797B91BFAD for ; Fri, 1 May 2020 18:48:59 +0200 (CEST) Received: from pps.filterd (m0098420.ppops.net [127.0.0.1]) by mx0b-001b2d01.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id 041GXMDv005273; Fri, 1 May 2020 12:48:58 -0400 Received: from ppma03dal.us.ibm.com (b.bd.3ea9.ip4.static.sl-reverse.com [169.62.189.11]) by mx0b-001b2d01.pphosted.com with ESMTP id 30rnk838g3-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 01 May 2020 12:48:58 -0400 Received: from pps.filterd (ppma03dal.us.ibm.com [127.0.0.1]) by ppma03dal.us.ibm.com (8.16.0.27/8.16.0.27) with SMTP id 041GmGcc022840; Fri, 1 May 2020 16:48:57 GMT Received: from b01cxnp22034.gho.pok.ibm.com (b01cxnp22034.gho.pok.ibm.com [9.57.198.24]) by ppma03dal.us.ibm.com with ESMTP id 30mcu8c9d0-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 01 May 2020 16:48:57 +0000 Received: from b01ledav005.gho.pok.ibm.com (b01ledav005.gho.pok.ibm.com [9.57.199.110]) by b01cxnp22034.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 041GmvuE39518682 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 1 May 2020 16:48:57 GMT Received: from b01ledav005.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 1A7CAAE05C; Fri, 1 May 2020 16:48:57 +0000 (GMT) Received: from b01ledav005.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id B76F8AE062; Fri, 1 May 2020 16:48:56 +0000 (GMT) Received: from Davids-MBP.randomparity.org (unknown [9.163.83.155]) by b01ledav005.gho.pok.ibm.com (Postfix) with ESMTP; Fri, 1 May 2020 16:48:56 +0000 (GMT) To: "Burakov, Anatoly" , dev@dpdk.org References: <20200429232931.87233-1-drc@linux.vnet.ibm.com> <20200429232931.87233-3-drc@linux.vnet.ibm.com> <6cbb170a-3f13-47ba-e0ad-4a86cd6cb352@intel.com> <6763793c-265b-c5cf-228a-b2c177574c84@linux.vnet.ibm.com> <58df8aa5-e9b5-7f9a-2aee-fcb19b6dea04@intel.com> From: David Christensen Message-ID: <782d6f04-f476-93d6-1a8f-2ed0b39dde10@linux.vnet.ibm.com> Date: Fri, 1 May 2020 09:48:56 -0700 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:68.0) Gecko/20100101 Thunderbird/68.7.0 MIME-Version: 1.0 In-Reply-To: <58df8aa5-e9b5-7f9a-2aee-fcb19b6dea04@intel.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 8bit X-TM-AS-GCONF: 00 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.138, 18.0.676 definitions=2020-05-01_10:2020-05-01, 2020-05-01 signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 suspectscore=0 mlxscore=0 mlxlogscore=999 lowpriorityscore=0 spamscore=0 bulkscore=0 clxscore=1015 malwarescore=0 adultscore=0 priorityscore=1501 impostorscore=0 phishscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2003020000 definitions=main-2005010126 Subject: Re: [dpdk-dev] [PATCH 2/2] vfio: modify spapr iommu support to use static window sizing X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" >>> I'm not sure of the algorithm for "memory size" here. >>> >>> Technically, DPDK can reserve memory segments anywhere in the VA >>> space allocated by memseg lists. That space may be far bigger than >>> system memory (on a typical Intel server board you'd see 128GB of VA >>> space preallocated even though the machine itself might only have, >>> say, 16GB of RAM installed). The same applies to any other arch >>> running on Linux, so the window needs to cover at least >>> RTE_MIN(base_virtaddr, lowest memseglist VA address) and up to >>> highest memseglist VA address. That's not even mentioning the fact >>> that the user may register external memory for DMA which may cause >>> the window to be of insufficient size to cover said external memory. >>> >>> I also think that in general, "system memory" metric is ill suited >>> for measuring VA space, because unlike system memory, the VA space is >>> sparse and can therefore span *a lot* of address space even though in >>> reality it may actually use very little physical memory. >> >> I'm open to suggestions here.  Perhaps an alternative in /proc/meminfo: >> >> VmallocTotal:   549755813888 kB >> >> I tested it with 1GB hugepages and it works, need to check with 2M as >> well.  If there's no alternative for sizing the window based on >> available system parameters then I have another option which creates a >> new RTE_IOVA_TA mode that forces IOVA addresses into the range 0 to X >> where X is configured on the EAL command-line (--iova-base, >> --iova-len).   I use these command-line values to create a static window. >> > > A whole new IOVA mode, while being a cleaner solution, would require a > lot of testing, and it doesn't really solve the external memory problem, > because we're still reliant on the user to provide IOVA addresses. > Perhaps something akin to VA/IOVA address reservation would solve the > problem, but again, lots of changes and testing, all for a comparatively > narrow use case. > > The vmalloc area seems big enough (512 terabytes on your machine, 32 > terabytes on mine), so it'll probably be OK. I'd settle for: > > 1) start at base_virtaddr OR lowest memseg list address, whichever is > lowest The IOMMU only supports two starting addresses, 0 or 1<<59, so implementation will need to start at 0. (I've been bit by this before, my understanding is that the processor only supports 54 bits of the address and that the PCI host bridge uses bit 59 of the IOVA as a signal to do the address translation for the second DMA window.) > 2) end at lowest addr + VmallocTotal OR highest memseglist addr, > whichever is higher So, instead of rte_memseg_walk() execute rte_memseg_list_walk() to find the lowest/highest msl addresses? > 3) a check in user DMA map function that would warn/throw an error > whenever there is an attempt to map an address for DMA that doesn't fit > into the DMA window Isn't this mostly prevented by the use of rte_mem_set_dma_mask() and rte_mem_check_dma_mask()? I'd expect an error would be thrown by the kernel IOMMU API for an out-of-range mapping that I would simply return to the caller (drivers/vfio/vfio_iommu_spapr_tce.c includes the comment /* iova is checked by the IOMMU API */). Why do you think double checking this would help? > > I think that would be best approach. Thoughts? Dave