From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <anatoly.burakov@intel.com>
Received: from mga02.intel.com (mga02.intel.com [134.134.136.20])
 by dpdk.org (Postfix) with ESMTP id 33B653237
 for <dev@dpdk.org>; Mon, 19 Nov 2018 18:18:22 +0100 (CET)
X-Amp-Result: SKIPPED(no attachment in message)
X-Amp-File-Uploaded: False
Received: from fmsmga007.fm.intel.com ([10.253.24.52])
 by orsmga101.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384;
 19 Nov 2018 09:18:22 -0800
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.56,253,1539673200"; d="scan'208";a="87122779"
Received: from aburakov-mobl1.ger.corp.intel.com (HELO [10.237.220.124])
 ([10.237.220.124])
 by fmsmga007.fm.intel.com with ESMTP; 19 Nov 2018 09:18:19 -0800
To: Shahaf Shuler <shahafs@mellanox.com>, "dev@dpdk.org" <dev@dpdk.org>
Cc: Olga Shern <olgas@mellanox.com>, Yongseok Koh <yskoh@mellanox.com>,
 "pawelx.wodkowski@intel.com" <pawelx.wodkowski@intel.com>,
 "gowrishankar.m@linux.vnet.ibm.com" <gowrishankar.m@linux.vnet.ibm.com>,
 "ferruh.yigit@intel.com" <ferruh.yigit@intel.com>,
 Thomas Monjalon <thomas@monjalon.net>,
 "arybchenko@solarflare.com" <arybchenko@solarflare.com>,
 "shreyansh.jain@nxp.com" <shreyansh.jain@nxp.com>
References: <aae4d2d1d6ceabe661e22ae8a7591193cea62104.1541335203.git.shahafs@mellanox.com>
 <ba00bccc-f20d-47a9-052f-3e5b6bc1c2c7@intel.com>
 <DB7PR05MB442617F5BEE7251AAD8836A6C3C30@DB7PR05MB4426.eurprd05.prod.outlook.com>
 <6c7243cd-5370-846b-2999-5ae34722f640@intel.com>
 <DB7PR05MB4426A8121816433014E00831C3DC0@DB7PR05MB4426.eurprd05.prod.outlook.com>
 <ec227962-d6c2-134f-fc33-567a228f99d1@intel.com>
 <DB7PR05MB44262AF13377ECBC51D2B295C3D80@DB7PR05MB4426.eurprd05.prod.outlook.com>
From: "Burakov, Anatoly" <anatoly.burakov@intel.com>
Message-ID: <b23195a8-b430-cf06-c391-ca51ebcf5b45@intel.com>
Date: Mon, 19 Nov 2018 17:18:18 +0000
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101
 Thunderbird/52.9.1
MIME-Version: 1.0
In-Reply-To: <DB7PR05MB44262AF13377ECBC51D2B295C3D80@DB7PR05MB4426.eurprd05.prod.outlook.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Content-Transfer-Encoding: 8bit
Subject: Re: [dpdk-dev] [RFC] ethdev: introduce DMA memory mapping for
	external memory
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Mon, 19 Nov 2018 17:18:24 -0000

On 19-Nov-18 11:20 AM, Shahaf Shuler wrote:
> Thursday, November 15, 2018 1:00 PM, Burakov, Anatoly:
>> Subject: Re: [RFC] ethdev: introduce DMA memory mapping for external
>> memory
>>
>> On 15-Nov-18 9:46 AM, Shahaf Shuler wrote:
>>> Wednesday, November 14, 2018 7:06 PM, Burakov, Anatoly:
>>>> Subject: Re: [RFC] ethdev: introduce DMA memory mapping for external
>>>> memory
>>>>
>>>> On 14-Nov-18 2:53 PM, Shahaf Shuler wrote:
>>>>> Hi Anatoly,
>>>>>
>>>>> Wednesday, November 14, 2018 1:19 PM, Burakov, Anatoly:
>>>>>> Subject: Re: [RFC] ethdev: introduce DMA memory mapping for
>>>>>> external memory
>>>>>>
>>>>>> Hi Shahaf,
>>>>>>
>>>>>> Great to see such effort! Few comments below.
>>>>>>
>>>>>> Note: halfway through writing my comments i realized that i am
>>>>>> starting with an assumption that this API is a replacement for
>>>>>> current VFIO DMA mapping API's. So, if my comments seem out of left
>>>>>> field, this is probably why :)
>>>>>>
>>>>>> On 04-Nov-18 12:41 PM, Shahaf Shuler wrote:
>>>>>>> Request for comment on the high level changes present on this patch.
>>>>>>>
>>>>>>> The need to use external memory (memory belong to application and
>>>>>>> not part of the DPDK hugepages) is allready present.
>>>>>>> Starting from storage apps which prefer to manage their own memory
>>>>>>> blocks for efficient use of the storage device. Continue with GPU
>>>>>>> based application which strives to achieve zero copy while
>>>>>>> processing the packet payload on the GPU core. And finally by
>>>>>>> vSwitch/vRouter application who just prefer to have a full control
>>>>>>> over the memory in use
>>>>>> (e.g. VPP).
>>>>>>>
>>>>>>> Recent work[1] in the DPDK enabled the use of external memory,
>>>>>>> however it mostly focus on VFIO as the only way to map memory.
>>>>>>> While VFIO is common, there are other vendors which use different
>>>>>>> ways to map memory (e.g. Mellanox and NXP[2]).
>>>>>>>
>>>>>>> The work in this patch moves the DMA mapping to vendor agnostic
>>>>>>> APIs located under ethdev. The choice in ethdev was because
>> memory
>>>>>>> map should be associated with a specific port(s). Otherwise the
>>>>>>> memory is being mapped multiple times to different frameworks and
>>>>>>> ends up with memory being wasted on redundant translation table in
>>>>>>> the host or in the
>>>>>> device.
>>>>>>
>>>>>> So, anything other than ethdev (e.g. cryptodev) will not be able to
>>>>>> map memory for DMA?
>>>>>
>>>>> That's is a fair point.
>>>>>
>>>>>>
>>>>>> I have thought about this for some length of time, and i think DMA
>>>>>> mapping belongs in EAL (more specifically, somewhere at the bus
>>>>>> layer), rather than at device level.
>>>>>
>>>>> I am not sure I agree here. For example take Intel and Mellanox devices.
>>>> Both are PCI devices, so how will you distinguish which mapping API to
>> use?
>>>>> Also I still think the mapping should be in device granularity and
>>>>> not
>>>> bus/system granularity, since it is very typical for a memory to be
>>>> used for DMA be a specific device.
>>>>>
>>>>> Maybe we can say the DMA mapping is a rte_device attribute. It is
>>>>> the
>>>> parent class for all the DPDK devices.
>>>>> We need to see w/ vport representors (which all has the same
>> rte_device).
>>>> On that case I believe the rte_device.map call can register the
>>>> memory to all of the representors as well (if needed).
>>>>>
>>>>> Placing this functionality at device level comes with more work
>>>>>> to support different device types and puts a burden on device
>>>>>> driver developers to implement their own mapping functions.
>>>>>
>>>>> The mapping function can be shared. For example we can still
>>>>> maintain the
>>>> vfio mapping scheme as part of eal and have all the related driver to
>>>> call this function.
>>>>> The only overhead will be to maintain the function pointer for the dma
>> call.
>>>>> With this work, instead of the eal layer to guess which type of DMA
>>>> mapping the devices in the  system needs or alternatively force them
>>>> all to work w/ VFIO, each driver will select its own function.
>>>>> The driver is the only one which knows what type of DMA mapping its
>>>> device needs.
>>>>>
>>>>>>
>>>>>> However, i have no familiarity with how MLX/NXP devices do their
>>>>>> DMA mapping, so maybe the device-centric approach would be better.
>>>>>> We could provide "standard" mapping functions at the bus level
>>>>>> (such as VFIO mapping functions for PCI bus), so that this could
>>>>>> would not have to be reimplemented in the devices.
>>>>>
>>>>> Yes, like I said above, I wasn't intending to re-implement all the
>>>>> mapping
>>>> function again on each driver. Yet, I believe it should be per device.
>>>>>
>>>>>>
>>>>>> Moreover, i'm not sure how this is going to work for VFIO. If this
>>>>>> is to be called for each NIC that needs access to the memory, then
>>>>>> we'll end up with double mappings for any NIC that uses VFIO,
>>>>>> unless you want each NIC to be in a separate container.
>>>>>
>>>>> I am not much familiar w/ VFIO (you are the expert😊).
>>>>> What will happen if we map the same memory twice (under same
>>>> container)? The translation on the IOMMU will be doubled? The map
>>>> will return with error that this memory mapping already exists?
>>>>
>>>> The latter. You can't map the same memory twice in the same container.
>>>> You can't even keep NICs in separate containers because then
>>>> secondary processes won't work without serious rework.
>>>>
>>>> So, all VFIO-mapped things will need to share the mappings.
>>>>
>>>> It's not an insurmountable problem, but if we're going to share
>>>> mapping status for VFIO (i.e. track which area is already mapped),
>>>> then what's stopping us from doing the same for other DMA mapping
>> mechanisms? I.e.
>>>> instead of duplicating the mappings in each driver, provide some kind
>>>> of mechanism for devices to share the DMA mapping caches. Apologies
>>>> if i'm talking nonsense - i'm completely unfamiliar with how DMA
>>>> mapping works for MLX/NXP devices :)
>>>
>>> Unfortunately it cannot be done at least w/ Mellanox.
>>> In Mellanox the kernel driver is the one which maps the memory. The
>> mapping returns a key which identify a memory region which was just
>> registered to the device.
>>> There is a complete separation between the ports, meaning one port
>> mapping cannot be used by in the other port, even if the key is known.
>>>
>>> The separation is not only in ports, but also in processes (two primary ones,
>> for secondary we have a way to share). If two process work on the same
>> device, the must register the memory independently.
>>
>> Ah, OK.
>>
>> So, we're right back to where we started. Right now, external memory
>> expects to behave the same way as all other memory - you don't need to
>> perform DMA mapping for it.
>>
>> That said, part of the reason *why* it was done that way was because there
>> is no way to trigger VFIO DMA mapping for NXP (or was it MLX?) devices. If
>> you look at initial versions of the patchset, the DMA mapping was actually
>> done manually. Then, i became convinced that doing this automatically is the
>> way to go, both because it erases the usability differences as far as memory
>> types are concerned, and because it enables whatever services that are
>> subscribing to memory events to receive notifications about external
>> memory as well (i.e. consistency).
>>
>> Given that it's still an experimental API, we can tinker with it all we like, so it's
>> not set in stone. However, i would really like to keep the current automagic
>> thing, because DMA mapping may not be the only user of memory callbacks -
>> they can be used for debug purposes, or for any other things.
> 
> Memory callbacks are good to have regardless.
> The question we need to answer is whether or not we are going to provide the DMA map abstraction for external memory. See more below.
> 
> 
>>
>>>>>>> For example, consider a host with Mellanox and Intel devices.
>>>>>>> Mapping a memory without specifying to which port will end up with
>>>>>>> IOMMU registration and Verbs (Mellanox DMA map) registration.
>>>>>>> Another example can be two Mellanox devices on the same host. The
>>>>>> memory
>>>>>>> will be mapped for both, even though application will use mempool
>>>>>>> per device.
>>>>>>>
>>>>>>> To use the suggested APIs the application will allocate a memory
>>>>>>> block and will call rte_eth_dma_map. It will map it to every port
>>>>>>> that needs DMA access to this memory.
>>>>>>
>>>>>> This bit is unclear to me. What do you mean "map it to every port
>>>>>> that needs DMA access to this memory"? I don't see how this API
>>>>>> solves the above problem of mapping the same memory to all devices.
>>>>>> How does a device know which memory it will need? Does the user
>>>>>> specifically have to call this API for each and every NIC they're using?
>>>>>
>>>>> Yes, the user will call this API for every port which needs to have
>>>>> DMA
>>>> access to this memory.
>>>>> Remember we are speaking here on external memory the application
>>>> allocated and wants to use for send/receive.  The device doesn't
>>>> guess which memory he will need, the user is telling it to him explicitly.
>>>>>
>>>>>>
>>>>>> For DPDK-managed memory, everything will still get mapped to every
>>>>>> device automatically, correct?
>>>>>
>>>>> Yes, even though it is not the case today.
>>>>
>>>> What do you mean it is not the case? It is the case today. When
>>>> external memory chunk is registered at the heap, a mem event callback
>>>> is triggered just like for regular memory, and this chunk does get
>>>> mapped to VFIO as well as any other subscribed entity. As i recall,
>>>> NXP NICs currently are set up to ignore externally allocated memory,
>>>> but for the general VFIO case, everything is mapped automatically.
>>>>
>>>>>
>>>>> If so, then such a manual approach for
>>>>>> external memory will be bad for both usability and drop-in
>>>>>> replacement of internal-to-external memory, because it introduces
>>>>>> inconsistency between using internal and external memory. From my
>>>>>> point of view, either we do *everything* manually (i.e. register
>>>>>> all memory for DMA
>>>>>> explicitly) and thereby avoid this problem but keep the
>>>>>> consistency, or we do *everything* automatically and deal with
>>>>>> duplication of mappings somehow (say, by MLX/NXP drivers sharing
>>>>>> their mappings through bus interface).
>>>>>
>>>>> I understand your point, however I am not sure external and internal
>>>> memory *must* be consist.
>>>>> The DPDK-managed memory is part of the DPDK subsystems and the
>> DPDK
>>>> libs are preparing it for the optimal use of the underlying devices.
>>>> The external memory is different, it is a proprietary memory the
>>>> application allocated and the DPDK cannot do anything in advance on it.
>>>>
>>>> My view for designing external memory support was that it should
>>>> behave like regular DPDK memory for all intents and purposes, and be
>>>> a drop-in replacement, should you choose to use it. I.e. the
>>>> application should not care whether it uses internal or external
>>>> memory - it all sits in the same malloc heaps, it all uses the same
>>>> socket ID mechanisms, etc. - for all intents and purposes, they're one and
>> the same.
>>>>
>>>>> Even today there is inconsistency, because if user wants to use
>>>>> external
>>>> memory it must map it (rte_vfio_dma_map) while he doesn't need to do
>>>> that for the DPDK-managed memory.
>>>>
>>>> Well, now i see where your confusion stems from :) You didn't know
>>>> about
>>>> this:
>>>>
>>>>
>> https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgit
>>>> .d
>> pdk.org%2Fdpdk%2Ftree%2Flib%2Flibrte_eal%2Fcommon%2Fmalloc_heap.c
>>>>
>> %23n1169&amp;data=02%7C01%7Cshahafs%40mellanox.com%7Ca1eb06c5a
>>>>
>> d794e4e725d08d64a537f23%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7
>>>>
>> C0%7C636778119822376211&amp;sdata=duMis6fW2CTmyRqdOAmlZl5cezCfJ
>>>> aeuoo61QGIBIGk%3D&amp;reserved=0
>>>>
>>>> This will trigger all mem event callbacks, including VFIO DMA mapping
>>>> callback.
>>>
>>> I see, so I am indeed confused 😊.
>>> On which cases the application should call the existing rte_vfio_dma_map?
>> If the memory is already mapped and the only way to work with it is through
>> the rte_malloc mechanism.
>>
>> I have to be honest, I didn't consider this question before :D I guess there
>> could be cases where using rte_malloc might not be suitable because it
>> wastes some memory on malloc elements, i.e. if you want to use N pages as
>> memory, you'd have to allocate N+1 pages. If memory is at a premium,
>> maybe manual management of it would be better in some cases.
> 
> I had similar thoughts, more related to the usability from the user side.
> When application allocated allocates external memory it just wants to use it for DMA, i.e. put it as the mbuf buf_addr or to populate it w/ a mempool.
> It is an "overhead" to create a socket for this external memory, to populate it w/ the memory, and later on to malloc from this socket (or use the socket id for the mempool creation).
> Not to mention the fact that maybe the application wants to manage this memory differently than how rte_malloc does.
> 
> On the other hand, mapping memory to device before using it for dma is far more intuitive.

It is far more intuitive *if* you're doing all of the memory management 
yourself or "just" using this memory for a mempool. This was already 
working before, and if you had that as your use case, there is no need 
for the external memory feature.

On the other hand, if you were to use it in a different way - for 
example, allocating hash tables or other DPDK data structures - then 
such a feature is essential. The entire point was to allow using 
external memory with semantics identical to how you use the rest of DPDK.

Also, whether it's "intuitive" depends on perspective - you say "i 
expect to allocate memory and map it for DMA myself", i say "why do i 
have to care about DMA mapping, DPDK should do this for me" :) If you 
are using your own memory management and doing everything manually - you 
get to map everything for DMA manually. If you are using DPDK facilities 
- it is intuitive that for any DPDK-managed memory, internal or 
external, same rules apply across the board - DMA mapping included.

> 
>>
>>>
>>>>
>>>>>
>>>>> I guess we can we can add a flag on the device mapping which will
>>>>> say
>>>> MAP_TO_ALL_DEVICES, to ease the application life in the presence of
>>>> multiple device in the host.
>>>>>
>>>>
>>>>
>>>> --
>>>> Thanks,
>>>> Anatoly
>>
>>
>> --
>> Thanks,
>> Anatoly


-- 
Thanks,
Anatoly