From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from dpdk.org (dpdk.org [92.243.14.124]) by inbox.dpdk.org (Postfix) with ESMTP id D4512A0588; Fri, 17 Apr 2020 01:46:39 +0200 (CEST) Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 937061DE64; Fri, 17 Apr 2020 01:46:38 +0200 (CEST) Received: from mail-lj1-f194.google.com (mail-lj1-f194.google.com [209.85.208.194]) by dpdk.org (Postfix) with ESMTP id 749BB1DE63 for ; Fri, 17 Apr 2020 01:46:37 +0200 (CEST) Received: by mail-lj1-f194.google.com with SMTP id k21so284468ljh.2 for ; Thu, 16 Apr 2020 16:46:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=date:from:to:cc:subject:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=gGQPc0gd8nG18xGjaZhRh7XOUdJBSIuESu/VVnZvDtk=; b=FPcGbkfGS/dp/uWFNsUyyeB+GBGyx4myj/tOsYkJzigGpT7IHZYOhuVCVWmBvPv5dK grmgIJe2BrgWWROE661LJxZqzKVoz2OfVGlzAhhqIyEYiIulfJ4rX3hL0CZ8h/QrJMwP NAUk2huTW58GcSll1qzn0WLC7KZxIJM0nCqpzWK9eGaa+h5KSObldy8eIiCpbssPbrxr YQJuMYJVXTTu+n+br2BC+gOR0GhsRF6vd1AfQ/Fwb7BRYRBCmuWuLZC8peoNLKCjkgHi 5diKYrBlyD+LvDcjAMJA49F4r7o0vWnZXp1O081gcREHP7cUG0l5teVKyXlFd4Zbq3Gs vyPg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=gGQPc0gd8nG18xGjaZhRh7XOUdJBSIuESu/VVnZvDtk=; b=twv8XxESZgDd6OsQGuGDozgMkff7zq2e4YcWriLLN3ZCIlY1l8z1FNMjbLVIHqm+S+ bYLetX0wCcgOj0d6+Kqa6gVxZznskAGt8zZnwiLR7igJqMJBNGrI4FOTF7YwxANMyPgM yDqqq6q8jk4L9KBk4FBBPD77lc9dc/PYgFWsXD6XOaIiyHc/ltoDkQAQZBX9lFzff2V3 KTXuEKi8oSDGZroIuaIntALe6U0QL9GCI4WIRZN7cn59Yw27XXw48KLE8TtWqIfNvVzB 3WCTPOJHnTpF3AynEi76vB3dRcBPUiv+dLxOJvdqePXRWgQAv2i+SRG5+I7s58MTWC/H 9Ouw== X-Gm-Message-State: AGi0PuZvjestSJyyvSaMQaq5hg7eRoQvlRECZQArncRr4WzUqOCFgP3t xqIswMndBPo2YbIKLRvY1aQ= X-Google-Smtp-Source: APiQypIcQk9EMgEd8/PlzPtyMDsHGEUmR3GLYu55ODuD3pMHp6KRmU49dWv7QzLozaAkU14eat+ucA== X-Received: by 2002:a2e:8954:: with SMTP id b20mr324096ljk.176.1587080795998; Thu, 16 Apr 2020 16:46:35 -0700 (PDT) Received: from Sovereign (broadband-37-110-65-23.ip.moscow.rt.ru. [37.110.65.23]) by smtp.gmail.com with ESMTPSA id j15sm5390321lja.71.2020.04.16.16.46.34 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 16 Apr 2020 16:46:35 -0700 (PDT) Date: Fri, 17 Apr 2020 02:46:33 +0300 From: Dmitry Kozlyuk To: Thomas Monjalon Cc: dev@dpdk.org, Harini Ramakrishnan , Omar Cardona , Dmitry Malloy , Narcisa Ana Maria Vasile , Pallavi Kadam , Ranjit Menon , Tal Shnaiderman , Fady Bader , Ophir Munk , Anatoly Burakov Message-ID: <20200417024633.21d77a3b@Sovereign> In-Reply-To: <4064679.AiC22s8V5E@thomas> References: <4064679.AiC22s8V5E@thomas> X-Mailer: Claws Mail 3.17.5 (GTK+ 2.24.32; x86_64-pc-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Subject: Re: [dpdk-dev] [Minutes 04/15/2020] Bi-Weekly DPDK Windows Community Call X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" > * [AI Dmitry K, Harini] Dmitry K to send summary of conversation for = feedback, Harini to follow-up for resolution. On Windows community calls we've been discussing memory management implementation approaches and plans. This summary aims to bring everyone interested to the same page and to record information in one public place. [Dmitry M] is Dmitry Malloy from Microsoft, [Dmitry K] is me. Cc'ing Anatoly Burakov as DPDK memory subsystem maintainer. Current State ------------- Patches are sent for basic memory management that should be suitable for mo= st simple cases. Relevant implementation traits are as follows: * IOVA as PA only, PA is obtained via a kernel-mode driver. * Hugepages are allocated dynamically in user-mode (2MB only), IOVA-contiguity is provided by allocator to the extent possible. * No multi-process support. Background and Findings ----------------------- Physical addresses are fundamentally limited and insecure because of the following (this list is not specific to Windows, but provides context): 1. A user-mode application with access to DMA and PA can convince the device to overwrite arbitrary RAM content, bypassing OS security. 2. IOMMU might be engaged rendering PA invalid for a particular device. This mode is mandatory for PCI passthrough into VM. 3. IOMMU may be used even on a bare-metal system to protect against #1 by limiting DMA for a device to IOMMU mappings. Zero-copy forwarding using DMA from different RX and TX devices must take care of this. On Windows, such mechanism is called Kernel DMA Protection [1]. 4. Device can be VA-only with an onboard IOMMU (e.g. Mellanox NICs). 5. In complex PCI topologies logical bus addresses may differ from PA, although a concrete example is missing for modern systems (IoT SoC?). Within Windows kernel there are two facilities to deal with the above: 1. DMA_ADAPTER interface and its AllocateDomainCommonBuffer() method [2]. "DMA adapter" is an abstraction of bus-master mode or an allocated chann= el of a DMA controller. Also, each device belongs to a DMA domain, initially its so-called default domain. Only devices of the same domain can have a buffer suitable for DMA by all devices. In that, DMA domains are similar to IOMMU groups in Linux. Besides domain management, this interface allows allocation of such a common buffer, that is, a contiguous range of IOVA (logical addresses) a= nd kernel VA (which can be mapped to user-space). Advantages of this interface: 1) it is universal w.r.t. PCI topology, IOMMU, etc; 2) it supports hugepages. One disadvantage is that kernel controls IOVA and VA. 2. DMA_IOMMU interface which is functionally similar to Linux VFIO driver, that is, it allows management of IOMMU mappings within a domain [3]. [Dmitry M] Microsoft considers creating a generic memory-management driver exposing (some of) these interfaces which will be shipped with Windows. This is an idea on its early stage, not a commitment. Notable DPDK memory management traits: 1. When memory is requested from EAL, it is unknown whether it will be used for DMA and with which device. The hint is when rte_virt2iova() is called, but this is not the case for VA-only devices. 2. Memory is reserved and then committed in segments (basically, hugepages). 3. There is a callback for segment list allocation and deallocation. For example, Linux EAL uses it to create IOMMU mappings when VFIO is engaged. 4. There are drivers that explicitly request PA via rte_virt2phys(). Last but not the least, user-mode memory management notes: 1. Windows doesn't report limits on the number of hugepages. 2. By official documentation, only 2MB hugepages are supported. [Dmitry M] There are new, still undocumented Win32 API flags for 1GB [5]. [Dmitry K] Found a novel allocator library using these new features [6]. Failed to make use of [5] with AWE, unclear how to integrate into MM. 3. Address Windowing Extensions [4] allow allocating physical page frames (PFN) and then mapping them to VA, all in user-mode. [Dmitry K] Experiments show AWE cannot allocate hugepages (in a document= ed way at least) and cannot reliably provide contiguous ranges (and does not guarantee it). IMO, this interface is useless for common MM. Some drivers that do not need hugepages but require PA may benefit from it. Opens ----- IMO, "Advanced memory management" milestone from roadmap should be split. There are three major points of MM improvement, each requiring research and= a complex patch: 1. Proper DMA buffers via AllocateDomainCommonBuffer (DPDK part is unclear). 2. VFIO-like code in Windows EAL using DMA_IOMMU. 3. Support for 1GB hugepages and related changes. Windows kernel interfaces described above have poor documentation. On Windo= ws community call 2020-04-01 Dmitry Malloy agreed to help with this (concrete questions were raised and noted). Hugepages of 1GB are desirable, but allocating them relies on undocumented features. Also, because Windows does not provide hugepage limits, it may require more work to manage multiple sizes in DPDK. References ---------- [1]: Kernel DMA Protection for Thunderbolt=E2=84=A2 3 [2]: DMA_IOMMU interface - [3]: DMA_ADAPTER.AllocateDomainCommonBuffer - [4]: Address Windowing Extensions (AWE) [5]: GitHub issue [6]: mimalloc --=20 Dmitry Kozlyuk