From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga18.intel.com (mga18.intel.com [134.134.136.126]) by dpdk.org (Postfix) with ESMTP id 715DA1B3D2 for ; Thu, 22 Nov 2018 12:34:17 +0100 (CET) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga006.fm.intel.com ([10.253.24.20]) by orsmga106.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 22 Nov 2018 03:34:16 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.56,265,1539673200"; d="scan'208";a="283272769" Received: from aburakov-mobl1.ger.corp.intel.com (HELO [10.237.220.124]) ([10.237.220.124]) by fmsmga006.fm.intel.com with ESMTP; 22 Nov 2018 03:34:13 -0800 To: Shahaf Shuler , "dev@dpdk.org" Cc: Olga Shern , Yongseok Koh , "pawelx.wodkowski@intel.com" , "gowrishankar.m@linux.vnet.ibm.com" , "ferruh.yigit@intel.com" , Thomas Monjalon , "arybchenko@solarflare.com" , "shreyansh.jain@nxp.com" References: <6c7243cd-5370-846b-2999-5ae34722f640@intel.com> <1c00477d-266a-400e-44b4-5bcb1f41e4b7@intel.com> <7c26e2d2-0c87-d416-87a2-737caa457ed6@intel.com> From: "Burakov, Anatoly" Message-ID: Date: Thu, 22 Nov 2018 11:34:11 +0000 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.9.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 8bit Subject: Re: [dpdk-dev] [RFC] ethdev: introduce DMA memory mapping for external memory X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 22 Nov 2018 11:34:18 -0000 On 22-Nov-18 11:31 AM, Shahaf Shuler wrote: > Thursday, November 22, 2018 12:42 PM, Burakov, Anatoly: >> Subject: Re: [dpdk-dev] [RFC] ethdev: introduce DMA memory mapping for >> external memory >> >> On 22-Nov-18 10:06 AM, Shahaf Shuler wrote: >>> Tuesday, November 20, 2018 12:56 PM, Burakov, Anatoly: >>>> Subject: Re: [dpdk-dev] [RFC] ethdev: introduce DMA memory mapping >>>> for external memory >>>> >>>> On 20-Nov-18 8:08 AM, Shahaf Shuler wrote: >>>>> Monday, November 19, 2018 7:18 PM, Burakov, Anatoly: >>>>>> Subject: Re: [RFC] ethdev: introduce DMA memory mapping for >>>>>> external memory >>>>>> >>>>>> On 19-Nov-18 11:20 AM, Shahaf Shuler wrote: >>>>>>> Thursday, November 15, 2018 1:00 PM, Burakov, Anatoly: >>>>>>>> Subject: Re: [RFC] ethdev: introduce DMA memory mapping for >>>>>>>> external memory >>>>>>>> >>>>>>>> On 15-Nov-18 9:46 AM, Shahaf Shuler wrote: >>>>>>>>> Wednesday, November 14, 2018 7:06 PM, Burakov, Anatoly: >>>>>>>>>> Subject: Re: [RFC] ethdev: introduce DMA memory mapping for >>>>>>>>>> external memory >>>>>>>>>> >>>>>>>>>> On 14-Nov-18 2:53 PM, Shahaf Shuler wrote: >>>>>>>>>>> Hi Anatoly, >>>>>>>>>>> >>>>>>>>>>> Wednesday, November 14, 2018 1:19 PM, Burakov, Anatoly: >>>>>>>>>>>> Subject: Re: [RFC] ethdev: introduce DMA memory mapping >> for >>>>>>>>>>>> external memory >>>>>>>>>>>> >>>>> >>>>> [...] >>> >>> [...] >>> >>> I thought the 18.11 feature should serve both cases. >>> To use memory not managed by DPDK user should create heap and >> populate it w/ the memory, so that it can use it later for mempool/rings. >> >> The feature introduced in 18.11 expects you to use DPDK allocation methods >> with added memory. It is not a supported use case to create an external >> memory area, register it with heap *and* use manual memory management >> on that area (e.g. populate mempool anonymously). >> >> Perhaps there should be a new API, something like "register_memory", >> where we only register memory, but do not expect the user to use DPDK >> allocator with said memory. > > Isn't it the existing rte_vfio_dma_map? It does provide DMA mapping, but it doesn't register memory in DPDK's fbarrays, so any fbarray-related calls (i.e. rte_virt2iova etc.) will fail. > > I wish i would've thought of that before, where >> have you been all this time! :D I think this should feature definitely go into >> 19.02. Are we past the v1 deadline? >> >>> >>>> >>>> The above scenario (working with anonymous memory in mempools) is >>>> still "broken" in this sense - and i see no general way to fix it. It >>>> could work with mempools in that we could automatically create new >>>> heaps behind the scenes and populate them with new memory, but for >>>> other data structures (e.g. rte_ring_init() over anonymous memory), >>>> it's just not possible to do that. >>>> >>>> I personally would prefer these API's to just go away and people >>>> started using the "official" way to manage external memory. Removing >>>> the API's may even be feasible for everything bar mempool, and >>>> mempool support we can manage more-or-less automatically in the >> above >>>> way (by automatically creating an external heap behind the scenes). >>>> But then by the time we're creating/adding external memory heaps >>>> inside mempool API's, you don't really need the external memory DMA >>>> mapping API calls because the mempool would do that automatically by >>>> way of using the external heaps API. >>> >>> Yes, one way to go is to map the memory internally inside the mempool >> creation. But it will not work for all cases. >>> For example, if application wants to work w/ mbuf with external buffer >> (EXT_ATTACHED_MBUF), the buffer will be attached on a different call >> (rte_pktmbuf_attach_extbuf), so we will need to handle that as well. >>> Maybe we will find more cases in the future. i would prefer to solve it for all >> cases. See more below. >>> >>>> >>>> In other words, the problem here isn't so much usability as it is the >>>> fact that, for some stuff, a lot of internal code now relies on >>>> DPDK's internal page tables, and we have to either make allowances >>>> for that on a case-by-case basis (by the way, no one came to me with >>>> these issues!), >>> >>> One option is case by case, other can be to allow the application to explicit >> map this memory to the different devices. >>> >>> or admit that using >>>> unregistered memory is a bad idea and let's not do that in the future. >>> >>> I think it is un-avoidable (to use unregistered memory). The use case exists, >> there are applications which prefer to have their own memory management. >>> For those apps, we need to have a solution. >>> >>>> >>>>> >>>>>> >>>>>> On the other hand, if you were to use it in a different way - for >>>>>> example, allocating hash tables or other DPDK data structures - >>>>>> then such a feature is essential. >>>>> >>>>> Hash tables that the NIC needs to read/write to/from? >>>>> I didn't get the point here. >>>> >>>> The point wasn't hash tables, but support for use cases we might not >>>> have thought of. That includes doing strange things like allocating >>>> DPDK data structures in non-standard memory locations. >>>> >>>>> >>>>> The entire point was to allow using external memory with >>>>>> semantics identical to how you use the rest of DPDK. >>>>> >>>>> I understand this design choice, and it has the benefit of consistency v.s. >>>> more API call on the application side. >>>>> >>>>>> >>>>>> Also, whether it's "intuitive" depends on perspective - you say "i >>>>>> expect to allocate memory and map it for DMA myself", i say "why do >>>>>> i have to care about DMA mapping, DPDK should do this for me" :) >>>>> >>>>> Right 😊. We should be driven by customers and not vendors, too bad >>>>> there >>>> is no much feedback from them on the community. >>>> >>>> That implies current design *wasn't* driven by customers, which is not >> true. >>>> They may not be the same as ones you are aware of - but then this >>>> isn't really my fault :) >>>> >>>>> I know from VPP side they prefer not to use the rte_socket rather to >>>> manage the memory by them self, and they are perfectly fine with >>>> mapping it explicitly. This is true for other company that has >>>> vswitch solution (which I cannot mention by name). >>>> >>>> Well, it's not like i can do anything if no one from either >>>> communities comes here and engages in the discussion. I wish i could >>>> read people's minds, but alas :) >>>> >>>>> >>>>> >>>>> If you are using your >>>>>> own memory management and doing everything manually - you get to >>>> map >>>>>> everything for DMA manually. If you are using DPDK facilities >>>>>> - it is intuitive that for any DPDK-managed memory, internal or >>>>>> external, same rules apply across the board - DMA mapping included. >>>>> >>>>> So if understand you point here, your claim "we need both". One for >>>>> users >>>> which prefer the DPDK to do that for them and the other for the users >>>> who wants more control. >>>>> However, when keeping the API of explicit mapping, we can no longer >>>>> say >>>> that external memory is behaves exactly the same as internal. >>>> >>>> Well, more or less that, yes. It would be good to have full support >>>> for "external" memory (in the "unregistered with DPDK" sense, as >>>> outlined above). >>>> >>>>> >>>>> So going back to my RFC 😊, if we decide to keep both approach, what >>>>> I >>>> suggest is to change the VFIO mapping one to be more generic and per >>>> device (can be rte_device). >>>>> I think the biggest questions we need to answer are 1. are we OK >>>>> with generic, vendor agnostic API? >>>> >>>> Yes, generic and vendor-agnostic FTW! >>>> >>>>> 2. are we OK that the explicit mapping is done per device? >>>> >>>> This question is slightly misleading, because that's not *all* there >>>> is to it. In general, yes, i'm OK with the idea. In practice, VFIO may get in >> the way. >>>> >>>> VFIO works by mapping memory for a VFIO container. The container can >>>> have one or more devices in it, and how many devices it can have >>>> depends on the IOMMU setup (i.e. it may not be possible to have one >>>> device per container on some setups, which is why currently >>>> everything is done with one container). >>> >>> >>> So, "per-device" means we either keep each device in >>>> separate container (which will not work in some cases), >>> or we somehow >>>> work around the fact that mappings can be performed only once (i'm >>>> not sure if VFIO will return appropriate error codes that would allow >>>> us to distinguish failed mappings due to duplicate, or failed >>>> mappings due to other reasons - i'll have to research this), and keep >>>> all devices in the same container. >>> >>> Yes I see here 2 options: >>> 1. the VFIO has good error reports that enables the user to understand >> mapping failed because it is already exists. >>> 2. to hold database which will say which IOVA were already mapped so the >> DPDK vfio mapping will know to avoid it. >> >> Option 2 is difficult to support with secondary processes. Secondary >> processes also use the same container and thus should have the same view >> of what is mapped and what isn't. Given that VFIO mapping may happen >> during memory allocation, we won't be able to allocate more memory to hold >> extra mappings, so shared memory won't work here. We could use fbarray >> for this, but that would make the amount of mappings fixed. >> >>> >>>> >>>> In other words, it may be possible to solve this issue for VFIO, but >>>> i would not be comfortable to proceed without knowing that this can >>>> be done in a sane way. >>> >>> +1. >>> >>> I think the simplest is #1 above. How can we get an answer on it? >> >> A few experiments should do it :) I'll get back to you on that. >> >>> >>>> >>>> There's also a question of automatic vs. manual mapping. The way i >>>> see it, the choices are: >>>> >>>> 1) drop automatic registration for DMA altogether, for both internal >>>> and external memory >>>> 2) drop automatic registration for DMA for external memory only, but >>>> keep automatic DMA registration for internal memory >>>> 3) do automatic DMA registration for both >>>> >>>> I don't like the second choice for API consistency reasons, so i >>>> would rather go either for 1) (and cause massive breakage of all >>>> existing applications along the way...), or 3) (the way it's working right >> now). >>>> The pragmatic choice *may* be 2), which is what i feel you are >>>> advocating for, but it has its own problems, mainly the mechanics of it. >>>> If we can get them worked out, i may just bite the bullet and ack the >>>> patches >>>> 😃 >>> >>> I don't think we need to do such extensive changes. >>> >>> I think it is ok to have 2 ways for application to work (the heaps and the >> explicit DMA call). After all, DPDK is a development kit and we should make >> application life easier for different use cases. >>> If the application wants the DPDK to help, it can create the heap and >> everything will just happen. Otherwise application will do the memory >> management by itself, and use the DMA call explicitly. >> >> In other words, what you are proposing is option 3), with an addition of >> support for manual memory/DMA mapping management by way of the new >> API proposed above (register_memory() call). Am i understanding you >> correctly? > > Exact. > >> >>> >>>> >>>> Specifically, let's suppose we want automatic memory registration for >>>> DMA on internal but not external memory. How does it happen? >>>> Currently, it happens via mem event callbacks. >>>> >>>> If we were to drop the callback approach for external memory, we >>>> would lose out on additional functionality provided by said >>>> callbacks, and i would really like to avoid that. >>>> >>>> If we were to switch to manual mapping, that means EAL will depend on >>>> ethdev (you have suggested rte_device as an alternative - would that >>>> fix the >>>> issue?) >>> >>> No it won't. >>> It will help only for the cases were we have multiple devices on the same >> PCI address (like representors). >> >> So, should it be an rte_device call then? > > I think it should be rte_device. So that we can do the same for crypto devices (and others). > For the representor case, PMDs can have iterator to register the memory to all of the related eth devices. > >> >>> >>> , and we will have to rewrite current DMA mapping code to do >>>> mappings explicitly, rather than relying on callbacks mechanism. So, >>>> instead of e.g. VFIO mapping being performed via callback, we would >>>> instead iterate over each rte_device and call its respective DMA >>>> mapping function explicitly - for internal memory, but not for >>>> external. This >>>> *may* work, but i need to think it through a bit more :) >>>> >>>> This also still leaves a question of external unregistered memory - >>>> this is a wholly separate issue. >>>> >>>> TL;DR i'm tentatively OK with the proposal, but we need to work >>>> certain things out before we go too far along the path and find out >>>> that we've reached a dead end. >>> >>> Fully agree, this is why the RFC is for. >>> I think our biggest question is whether or not VFIO will behave OK with >> double mapping or we need some tracking mechanism in DPDK. >> >> Yes, i will check this for VFIO and get back to you. > > Great thanks. > >> >>> >>>> >>>> -- >>>> Thanks, >>>> Anatoly >> >> >> -- >> Thanks, >> Anatoly -- Thanks, Anatoly