DPDK patches and discussions
 help / color / mirror / Atom feed
From: "Burakov, Anatoly" <anatoly.burakov@intel.com>
To: Shahaf Shuler <shahafs@mellanox.com>, "dev@dpdk.org" <dev@dpdk.org>
Cc: Olga Shern <olgas@mellanox.com>,
	Yongseok Koh <yskoh@mellanox.com>,
	"pawelx.wodkowski@intel.com" <pawelx.wodkowski@intel.com>,
	"ferruh.yigit@intel.com" <ferruh.yigit@intel.com>,
	Thomas Monjalon <thomas@monjalon.net>,
	"arybchenko@solarflare.com" <arybchenko@solarflare.com>,
	"shreyansh.jain@nxp.com" <shreyansh.jain@nxp.com>
Subject: Re: [dpdk-dev] [RFC] ethdev: introduce DMA memory mapping for external memory
Date: Thu, 22 Nov 2018 11:34:11 +0000
Message-ID: <ce73612b-d77b-d433-4159-2a24d9704a09@intel.com> (raw)
In-Reply-To: <DB7PR05MB442687896A9E13DE16E486C1C3DB0@DB7PR05MB4426.eurprd05.prod.outlook.com>

On 22-Nov-18 11:31 AM, Shahaf Shuler wrote:
> Thursday, November 22, 2018 12:42 PM, Burakov, Anatoly:
>> Subject: Re: [dpdk-dev] [RFC] ethdev: introduce DMA memory mapping for
>> external memory
>> On 22-Nov-18 10:06 AM, Shahaf Shuler wrote:
>>> Tuesday, November 20, 2018 12:56 PM, Burakov, Anatoly:
>>>> Subject: Re: [dpdk-dev] [RFC] ethdev: introduce DMA memory mapping
>>>> for external memory
>>>> On 20-Nov-18 8:08 AM, Shahaf Shuler wrote:
>>>>> Monday, November 19, 2018 7:18 PM, Burakov, Anatoly:
>>>>>> Subject: Re: [RFC] ethdev: introduce DMA memory mapping for
>>>>>> external memory
>>>>>> On 19-Nov-18 11:20 AM, Shahaf Shuler wrote:
>>>>>>> Thursday, November 15, 2018 1:00 PM, Burakov, Anatoly:
>>>>>>>> Subject: Re: [RFC] ethdev: introduce DMA memory mapping for
>>>>>>>> external memory
>>>>>>>> On 15-Nov-18 9:46 AM, Shahaf Shuler wrote:
>>>>>>>>> Wednesday, November 14, 2018 7:06 PM, Burakov, Anatoly:
>>>>>>>>>> Subject: Re: [RFC] ethdev: introduce DMA memory mapping for
>>>>>>>>>> external memory
>>>>>>>>>> On 14-Nov-18 2:53 PM, Shahaf Shuler wrote:
>>>>>>>>>>> Hi Anatoly,
>>>>>>>>>>> Wednesday, November 14, 2018 1:19 PM, Burakov, Anatoly:
>>>>>>>>>>>> Subject: Re: [RFC] ethdev: introduce DMA memory mapping
>> for
>>>>>>>>>>>> external memory
>>>>> [...]
>>> [...]
>>> I thought the 18.11 feature should serve both cases.
>>> To use memory not managed by DPDK user should create heap and
>> populate it w/ the memory, so that it can use it later for mempool/rings.
>> The feature introduced in 18.11 expects you to use DPDK allocation methods
>> with added memory. It is not a supported use case to create an external
>> memory area, register it with heap *and* use manual memory management
>> on that area (e.g. populate mempool anonymously).
>> Perhaps there should be a new API, something like "register_memory",
>> where we only register memory, but do not expect the user to use DPDK
>> allocator with said memory.
> Isn't it the existing rte_vfio_dma_map?

It does provide DMA mapping, but it doesn't register memory in DPDK's 
fbarrays, so any fbarray-related calls (i.e. rte_virt2iova etc.) will fail.

> I wish i would've thought of that before, where
>> have you been all this time! :D I think this should feature definitely go into
>> 19.02. Are we past the v1 deadline?
>>>> The above scenario (working with anonymous memory in mempools) is
>>>> still "broken" in this sense - and i see no general way to fix it. It
>>>> could work with mempools in that we could automatically create new
>>>> heaps behind the scenes and populate them with new memory, but for
>>>> other data structures (e.g. rte_ring_init() over anonymous memory),
>>>> it's just not possible to do that.
>>>> I personally would prefer these API's to just go away and people
>>>> started using the "official" way to manage external memory. Removing
>>>> the API's may even be feasible for everything bar mempool, and
>>>> mempool support we can manage more-or-less automatically in the
>> above
>>>> way (by automatically creating an external heap behind the scenes).
>>>> But then by the time we're creating/adding external memory heaps
>>>> inside mempool API's, you don't really need the external memory DMA
>>>> mapping API calls because the mempool would do that automatically by
>>>> way of using the external heaps API.
>>> Yes, one way to go is to map the memory internally inside the mempool
>> creation. But it will not work for all cases.
>>> For example, if application wants to work w/ mbuf with external buffer
>> (EXT_ATTACHED_MBUF), the buffer will be attached on a different call
>> (rte_pktmbuf_attach_extbuf), so we will need to handle that as well.
>>> Maybe we will find more cases in the future. i would prefer to solve it for all
>> cases.  See more below.
>>>> In other words, the problem here isn't so much usability as it is the
>>>> fact that, for some stuff, a lot of internal code now relies on
>>>> DPDK's internal page tables, and we have to either make allowances
>>>> for that on a case-by-case basis (by the way, no one came to me with
>>>> these issues!),
>>> One option is case by case, other can be to allow the application to explicit
>> map this memory to the different devices.
>>> or admit that using
>>>> unregistered memory is a bad idea and let's not do that in the future.
>>> I think it is un-avoidable (to use unregistered memory). The use case exists,
>> there are applications which prefer to have their own memory management.
>>> For those apps, we need to have a solution.
>>>>>> On the other hand, if you were to use it in a different way - for
>>>>>> example, allocating hash tables or other DPDK data structures -
>>>>>> then such a feature is essential.
>>>>> Hash tables that the NIC needs to read/write to/from?
>>>>> I didn't get the point here.
>>>> The point wasn't hash tables, but support for use cases we might not
>>>> have thought of. That includes doing strange things like allocating
>>>> DPDK data structures in non-standard memory locations.
>>>>> The entire point was to allow using external memory with
>>>>>> semantics identical to how you use the rest of DPDK.
>>>>> I understand this design choice, and it has the benefit of consistency v.s.
>>>> more API call on the application side.
>>>>>> Also, whether it's "intuitive" depends on perspective - you say "i
>>>>>> expect to allocate memory and map it for DMA myself", i say "why do
>>>>>> i have to care about DMA mapping, DPDK should do this for me" :)
>>>>> Right 😊. We should be driven by customers and not vendors, too bad
>>>>> there
>>>> is no much feedback from them on the community.
>>>> That implies current design *wasn't* driven by customers, which is not
>> true.
>>>> They may not be the same as ones you are aware of - but then this
>>>> isn't really my fault :)
>>>>> I know from VPP side they prefer not to use the rte_socket rather to
>>>> manage the memory by them self, and they are perfectly fine with
>>>> mapping it explicitly. This is true for other company that has
>>>> vswitch solution (which I cannot mention by name).
>>>> Well, it's not like i can do anything if no one from either
>>>> communities comes here and engages in the discussion. I wish i could
>>>> read people's minds, but alas :)
>>>>> If you are using your
>>>>>> own memory management and doing everything manually - you get to
>>>> map
>>>>>> everything for DMA manually. If you are using DPDK facilities
>>>>>> - it is intuitive that for any DPDK-managed memory, internal or
>>>>>> external, same rules apply across the board - DMA mapping included.
>>>>> So if understand you point here, your claim "we need both". One for
>>>>> users
>>>> which prefer the DPDK to do that for them and the other for the users
>>>> who wants more control.
>>>>> However, when keeping the API of explicit mapping, we can no longer
>>>>> say
>>>> that external memory is behaves exactly the same as internal.
>>>> Well, more or less that, yes. It would be good to have full support
>>>> for "external" memory (in the "unregistered with DPDK" sense, as
>>>> outlined above).
>>>>> So going back to my RFC 😊, if we decide to keep both approach, what
>>>>> I
>>>> suggest is to change the VFIO mapping one to be more generic and per
>>>> device (can be rte_device).
>>>>> I think the biggest questions we need to answer are 1. are we OK
>>>>> with generic, vendor agnostic API?
>>>> Yes, generic and vendor-agnostic FTW!
>>>>> 2.  are we OK that the explicit mapping is done per device?
>>>> This question is slightly misleading, because that's not *all* there
>>>> is to it. In general, yes, i'm OK with the idea. In practice, VFIO may get in
>> the way.
>>>> VFIO works by mapping memory for a VFIO container. The container can
>>>> have one or more devices in it, and how many devices it can have
>>>> depends on the IOMMU setup (i.e. it may not be possible to have one
>>>> device per container on some setups, which is why currently
>>>> everything is done with one container).
>>>    So, "per-device" means we either keep each device in
>>>> separate container (which will not work in some cases),
>>> or we somehow
>>>> work around the fact that mappings can be performed only once (i'm
>>>> not sure if VFIO will return appropriate error codes that would allow
>>>> us to distinguish failed mappings due to duplicate, or failed
>>>> mappings due to other reasons - i'll have to research this), and keep
>>>> all devices in the same container.
>>> Yes I see here 2 options:
>>> 1. the VFIO has good error reports that enables the user to understand
>> mapping failed because it is already exists.
>>> 2. to hold database which will say which IOVA were already mapped so the
>> DPDK vfio mapping will know to avoid it.
>> Option 2 is difficult to support with secondary processes. Secondary
>> processes also use the same container and thus should have the same view
>> of what is mapped and what isn't. Given that VFIO mapping may happen
>> during memory allocation, we won't be able to allocate more memory to hold
>> extra mappings, so shared memory won't work here. We could use fbarray
>> for this, but that would make the amount of mappings fixed.
>>>> In other words, it may be possible to solve this issue for VFIO, but
>>>> i would not be comfortable to proceed without knowing that this can
>>>> be done in a sane way.
>>> +1.
>>> I think the simplest is #1 above. How can we get an answer on it?
>> A few experiments should do it :) I'll get back to you on that.
>>>> There's also a question of automatic vs. manual mapping. The way i
>>>> see it, the choices are:
>>>> 1) drop automatic registration for DMA altogether, for both internal
>>>> and external memory
>>>> 2) drop automatic registration for DMA for external memory only, but
>>>> keep automatic DMA registration for internal memory
>>>> 3) do automatic DMA registration for both
>>>> I don't like the second choice for API consistency reasons, so i
>>>> would rather go either for 1) (and cause massive breakage of all
>>>> existing applications along the way...), or 3) (the way it's working right
>> now).
>>>> The pragmatic choice *may* be 2), which is what i feel you are
>>>> advocating for, but it has its own problems, mainly the mechanics of it.
>>>> If we can get them worked out, i may just bite the bullet and ack the
>>>> patches
>>>> 😃
>>> I don't think we need to do such extensive changes.
>>> I think it is ok to have 2 ways for application to work (the heaps and the
>> explicit DMA call). After all, DPDK is a development kit and we should make
>> application life easier for different use cases.
>>> If the application wants the DPDK to help, it can create the heap and
>> everything will just happen. Otherwise application will do the memory
>> management by itself, and use the DMA call explicitly.
>> In other words, what you are proposing is option 3), with an addition of
>> support for manual memory/DMA mapping management by way of the new
>> API proposed above (register_memory() call). Am i understanding you
>> correctly?
> Exact.
>>>> Specifically, let's suppose we want automatic memory registration for
>>>> DMA on internal but not external memory. How does it happen?
>>>> Currently, it happens via mem event callbacks.
>>>> If we were to drop the callback approach for external memory, we
>>>> would lose out on additional functionality provided by said
>>>> callbacks, and i would really like to avoid that.
>>>> If we were to switch to manual mapping, that means EAL will depend on
>>>> ethdev (you have suggested rte_device as an alternative - would that
>>>> fix the
>>>> issue?)
>>> No it won't.
>>> It will help only for the cases were we have multiple devices on the same
>> PCI address (like representors).
>> So, should it be an rte_device call then?
> I think it should be rte_device.  So that we can do the same for crypto devices (and others).
> For the representor case,  PMDs can have iterator to register the memory to all of the related eth devices.
>>> , and we will have to rewrite current DMA mapping code to do
>>>> mappings explicitly, rather than relying on callbacks mechanism. So,
>>>> instead of e.g. VFIO mapping being performed via callback, we would
>>>> instead iterate over each rte_device and call its respective DMA
>>>> mapping function explicitly - for internal memory, but not for
>>>> external. This
>>>> *may* work, but i need to think it through a bit more :)
>>>> This also still leaves a question of external unregistered memory -
>>>> this is a wholly separate issue.
>>>> TL;DR i'm tentatively OK with the proposal, but we need to work
>>>> certain things out before we go too far along the path and find out
>>>> that we've reached a dead end.
>>> Fully agree, this is why the RFC is for.
>>> I think our biggest question is whether or not VFIO will behave OK with
>> double mapping or we need some tracking mechanism in DPDK.
>> Yes, i will check this for VFIO and get back to you.
> Great thanks.
>>>> --
>>>> Thanks,
>>>> Anatoly
>> --
>> Thanks,
>> Anatoly


  reply	other threads:[~2018-11-22 11:34 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-11-04 12:41 Shahaf Shuler
2018-11-14 11:19 ` Burakov, Anatoly
2018-11-14 14:53   ` Shahaf Shuler
2018-11-14 17:06     ` Burakov, Anatoly
2018-11-15  9:46       ` Shahaf Shuler
2018-11-15 10:59         ` Burakov, Anatoly
2018-11-19 11:20           ` Shahaf Shuler
2018-11-19 17:18             ` Burakov, Anatoly
     [not found]               ` <DB7PR05MB442643DFD33B71797CD34B5EC3D90@DB7PR05MB4426.eurprd05.prod.outlook.com>
2018-11-20 10:55                 ` Burakov, Anatoly
2018-11-22 10:06                   ` Shahaf Shuler
2018-11-22 10:41                     ` Burakov, Anatoly
2018-11-22 11:31                       ` Shahaf Shuler
2018-11-22 11:34                         ` Burakov, Anatoly [this message]
2019-01-14  6:12                         ` Shahaf Shuler
2019-01-15 12:07                           ` Burakov, Anatoly
2019-01-16 11:04                             ` Shahaf Shuler
2018-11-19 17:04           ` Stephen Hemminger

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ce73612b-d77b-d433-4159-2a24d9704a09@intel.com \
    --to=anatoly.burakov@intel.com \
    --cc=arybchenko@solarflare.com \
    --cc=dev@dpdk.org \
    --cc=ferruh.yigit@intel.com \
    --cc=gowrishankar.m@linux.vnet.ibm.com \
    --cc=olgas@mellanox.com \
    --cc=pawelx.wodkowski@intel.com \
    --cc=shahafs@mellanox.com \
    --cc=shreyansh.jain@nxp.com \
    --cc=thomas@monjalon.net \
    --cc=yskoh@mellanox.com \


* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

DPDK patches and discussions

This inbox may be cloned and mirrored by anyone:

	git clone --mirror https://inbox.dpdk.org/dev/0 dev/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 dev dev/ https://inbox.dpdk.org/dev \
	public-inbox-index dev

Example config snippet for mirrors.
Newsgroup available over NNTP:

AGPL code for this site: git clone https://public-inbox.org/public-inbox.git