From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id A4E69A00C4; Wed, 27 Jul 2022 22:36:48 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 4CB514021F; Wed, 27 Jul 2022 22:36:48 +0200 (CEST) Received: from mail-lj1-f176.google.com (mail-lj1-f176.google.com [209.85.208.176]) by mails.dpdk.org (Postfix) with ESMTP id 23E8F40141 for ; Wed, 27 Jul 2022 22:36:47 +0200 (CEST) Received: by mail-lj1-f176.google.com with SMTP id z13so20647913ljj.6 for ; Wed, 27 Jul 2022 13:36:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=date:from:to:cc:subject:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=JQHA6eiJhIGCc/SMqB1VLj7rXztlTV0qh+Bp82PIf4Q=; b=cZn3H8JCelfKljuyyhNgXh5kMA4I12fJq6WzBawB0cjQogHtq8wyWAdQQJyhQMQYk9 APOqSQXhMAB8hXs+NO8HpcfB76imLdE5ytUwb/SdLjaPTMv5IaDwi+xhph8lua+nWYLu 8GHItN2LfigeKBzXOZx1CJFArp2M15XkalYY+CcinIgV/bJu/XsXeLqvNcONSqMAAGC7 sHWUyrUlj2rvSFPpRXpJcKuGczupxZ/V5b4BezQSdYXswtBQysQCSadxBVQ5qBYUYw86 I5LTb66o2WViunA69jiDAnIj4Iqeqp2hlqp8sgX8pOMj0KJ7nhhoEdjl0vYKDW2chJUI ArCA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:subject:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=JQHA6eiJhIGCc/SMqB1VLj7rXztlTV0qh+Bp82PIf4Q=; b=VG6nqGVeXSR3teTYgJJYGgDmkE6HrAwu81Wj/WvdeXrtC2MYho3nCBJrgefdZyEDqt 48ZXDWZSTQh20zwy0J/6UMyLb6Nkdqb6URqPxy78kPHd7fOLbeKRvJfQpKtl6Fpe7PpQ Zo2qsmHp6KkTnX12NhkLXHUDjoodEr+eDEv3pP7RXqW2pUl6PpeM1g++XIYQQFfM8gzX sA6ChzXk9s4ox7hlJLUN1BopdFN6IvnSuYwycOtIFCaP2EnyAJzKEWY1MJ4y+z06dkSD NPWxFj4WP4RvIz4nx6xbN9Le1kW3dxFeIDXGmtwJ3OuhiqSu1873NsKHNlkiVXzGk+kA avYw== X-Gm-Message-State: AJIora8+OhmDqScEfwlHmI307i8y/G4BRAXr4TdkJENNKX3SIx+t2RBc f5pWbWgVs3Azk6ciFoNZ+ees6YnFXew= X-Google-Smtp-Source: AGRyM1sgDyZtbErNXahUxL7MpL/58o1JOcCEWWF0UIeW1vQkPL854SDszHZ3pX/ppEmSC840Tx8Jpw== X-Received: by 2002:a2e:8503:0:b0:25d:f33d:c218 with SMTP id j3-20020a2e8503000000b0025df33dc218mr8276848lji.133.1658954206351; Wed, 27 Jul 2022 13:36:46 -0700 (PDT) Received: from sovereign (broadband-37-110-65-23.ip.moscow.rt.ru. [37.110.65.23]) by smtp.gmail.com with ESMTPSA id j7-20020a2e6e07000000b0025d53e34fe7sm4186958ljc.56.2022.07.27.13.36.45 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 27 Jul 2022 13:36:45 -0700 (PDT) Date: Wed, 27 Jul 2022 23:36:44 +0300 From: Dmitry Kozlyuk To: Don Wallwork Cc: "dev@dpdk.org" Subject: Re: [RFC] EAL: legacy memory fixed address translations Message-ID: <20220727233644.21f0b2a3@sovereign> In-Reply-To: References: <256b5409-ddaf-d7cc-00c1-273ca76dbf71@xsightlabs.com> <6aaa04d8-2ac5-ced6-ec25-d42bc52a3e2f@xsightlabs.com> <20220726225910.26159820@sovereign> X-Mailer: Claws Mail 3.18.0 (GTK+ 2.24.33; x86_64-pc-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org I now understand more about _why_ you want this feature but became less confident _what_ do you're proposing specifically. >> 2022-07-26 14:33 (UTC-0400), Don Wallwork: =20 >>> This proposal describes a method for translating any huge page >>> address from virtual to physical or vice versa using simple >>> addition or subtraction of a single fixed value. This works for one region with continuous PA and VA. At first I assumed you want all DPDK memory to be such a region and legacy mode was needed for the mapping to be static. However, you say that all the memory does not need to be PA-continuous. Hence, address translation requires a more complex data structure, something like a tree to lookup the region, the do the translation. This means the lookup is still not a trivial operation. Effectively you want fast rte_mem_virt2iova() and rte_mem_iova2virt() and point to an optimization opportunity. Do I miss anything? Static mapping would also allow to cache offsets for regions, e.g. device knows it works with some area and looks up the offset once. There is no API to detect that the memory layout is static. Is this the missing piece? Because the proposed data structure can be build with existing API: EAL memory callbacks to track all DPDK memory. =20 [...] > Several examples where this could help include: >=20 > 1. A device could return flow lookup results containing the physical > address of a matching entry that needs to be translated to a virtual > address. See above, rte_mem_iova2virt(). > 2. Hardware can perform offloads on dynamically allocted heap > memory objects and would need PA to avoid requiring IOMMU. I wonder where these objects come from and how they are given to the HW. If there are many similar objects, why not use mempools? If there are few objects, why not use memzones? > 3. It may be useful to prepare data such as descriptors in stack > variables, then pass the PA to hardware which can DMA directly > from stack memory. pthread_attr_getstack() can give the base address which can be translated once and then used for VA-to-PA conversions. The missing bit is PA-continuity of the stack memory. That can be done transparently when allocating worker stacks (probably try to allocate a continuous chunk, fallback to non-contiguous). > 4. The CPU instruction set provides memory operations such as > prefetch, atomics, ALU and so on which operate on virtual > addresses with no software requirement to provide physical > addresses. A device may be able to provide a more optimized > implementation of such features that could avoid performance > degradation associated with using a hardware IOMMU if provided > virtual addresses. Having the ability to offload such operations > without requiring data structure modifications to store an IOVA for > every virtual address is desirable. Either it's me lacking experience with such accelerators or this item needs clarification. > All of these cases can run at packet rate and are not operating on > mbuf data. These would all benefit from efficient address translation > in the same way that mbufs already do. Unlike mbuf translation > that only covers VA to PA, this translation can perform both VA to PA > and PA to VA with equal efficiency. >=20 > > > > When drivers need to process a large number of memory blocks, > > these are typically packets in the form of mbufs, > > which already have IOVA attached, so there is no translation. > > Does translation of mbuf VA to PA with the proposed method > > show significant improvement over reading mbuf->iova? =20 >=20 > This proposal does not relate to mbufs.=C2=A0 As you say, there is > already an efficient VA to PA mechanism in place for those. Are you aware of externally-attached mbufs? Those carry a pointer to arbitrary IOVA-continuous memory and its IOVA. They can be used to convey any object in memory to the API consuming mbufs. > > When drivers need to process a few IOVA-contiguous memory blocks, > > they can calculate VA-to-PA offsets in advance, > > amortizing translation cost. > > Hugepage stack falls within this category. =20 >=20 > As the cases listed above hopefully show, there are cases where > it is not practical or desirable to precalculate the offsets. Arguably only PA-to-VA tasks. > >> When legacy memory mode is used, it is possible to map a single > >> virtual memory region large enough to cover all huge pages. During > >> legacy hugepage init, each hugepage is mapped into that region. =20 > > Legacy mode is called "legacy" with an intent to be deprecated :) =20 >=20 > Understood.=C2=A0 For our initial implementation, we were okay with > that limitation given that supporting in legacy mode was simpler. >=20 > > There is initial allocation (-m) and --socket-limit in dynamic mode. > > When initial allocation is equal to the socket limit, > > it should be the same behavior as in legacy mode: > > the number of hugepages mapped is constant and cannot grow, > > so the feature seems applicable as well. =20 >=20 > It seems feasible to implement this feature in non-legacy mode as > well. The approach would be similar; reserve a region of virtual > address space large enough to cover all huge pages before they are > allocated.=C2=A0 As huge pages are allocated, they are mapped into the > appropriate location within that virtual address space. This is what EAL is trying to do. [...] > >> This feature is applicable when rte_eal_iova_mode() =3D=3D RTE_IOVA_PA= =20 > > One can say it always works for RTE_IOVA_VA with VA-to-PA offset of 0. = =20 >=20 > This is true, but requires the use of a hardware IOMMU which > degrades performance. What I meant is this: if there was an API to ask EAL whether the fast translation is available, in RTE_IOVA_VA mode it would always return true; and if asked for an offset, it would always return 0. Bottom line: optimization is not limited to RTE_IOVA_PA, it's just trivial in that mode. > >> and could be enabled either by default when the legacy memory EAL > >> option is given, or a new EAL option could be added to specifically > >> enable this feature. > >> > >> It may be desirable to set a capability bit when this feature is > >> enabled to allow drivers to behave differently depending on the > >> state of that flag. =20 > > The feature requires, in IOVA-as-PA mode: > > 1) that hugepage mapping is static (legacy mode or "-m" =3D=3D "--socke= t-limit"); > > 2) that EAL has succeeded to map all hugepages in one PA-continuous blo= ck. =20 >=20 > It does not require huge pages to be physically contiguous. > Theoretically the mapping a giant VA region could fail, but > we have not seen this in practice even when running on x86_64 > servers with multiple NUMA nodes, many cores and huge pages > that span TBs of physical address space. Size does not matter as long as there are free hugepages. Physical continuity does, it's unpredictable. But it's best-effort for DPDK anyway. My point: no need for a command-line argument to request this optimized mod= e, DPDK can always try and report via API whether it has succeeded.