DPDK patches and discussions
 help / color / mirror / Atom feed
From: "Burakov, Anatoly" <anatoly.burakov@intel.com>
To: Shahaf Shuler <shahafs@mellanox.com>, "dev@dpdk.org" <dev@dpdk.org>
Cc: Olga Shern <olgas@mellanox.com>,
	Yongseok Koh <yskoh@mellanox.com>,
	"pawelx.wodkowski@intel.com" <pawelx.wodkowski@intel.com>,
	"ferruh.yigit@intel.com" <ferruh.yigit@intel.com>,
	Thomas Monjalon <thomas@monjalon.net>,
	"arybchenko@solarflare.com" <arybchenko@solarflare.com>,
	"shreyansh.jain@nxp.com" <shreyansh.jain@nxp.com>
Subject: Re: [dpdk-dev] [RFC] ethdev: introduce DMA memory mapping for external memory
Date: Tue, 20 Nov 2018 10:55:47 +0000
Message-ID: <1c00477d-266a-400e-44b4-5bcb1f41e4b7@intel.com> (raw)
In-Reply-To: <DB7PR05MB442643DFD33B71797CD34B5EC3D90@DB7PR05MB4426.eurprd05.prod.outlook.com>

On 20-Nov-18 8:08 AM, Shahaf Shuler wrote:
> Monday, November 19, 2018 7:18 PM, Burakov, Anatoly:
>> Subject: Re: [RFC] ethdev: introduce DMA memory mapping for external
>> memory
>> On 19-Nov-18 11:20 AM, Shahaf Shuler wrote:
>>> Thursday, November 15, 2018 1:00 PM, Burakov, Anatoly:
>>>> Subject: Re: [RFC] ethdev: introduce DMA memory mapping for external
>>>> memory
>>>> On 15-Nov-18 9:46 AM, Shahaf Shuler wrote:
>>>>> Wednesday, November 14, 2018 7:06 PM, Burakov, Anatoly:
>>>>>> Subject: Re: [RFC] ethdev: introduce DMA memory mapping for
>>>>>> external memory
>>>>>> On 14-Nov-18 2:53 PM, Shahaf Shuler wrote:
>>>>>>> Hi Anatoly,
>>>>>>> Wednesday, November 14, 2018 1:19 PM, Burakov, Anatoly:
>>>>>>>> Subject: Re: [RFC] ethdev: introduce DMA memory mapping for
>>>>>>>> external memory
> [...]
>>>> I have to be honest, I didn't consider this question before :D I
>>>> guess there could be cases where using rte_malloc might not be
>>>> suitable because it wastes some memory on malloc elements, i.e. if
>>>> you want to use N pages as memory, you'd have to allocate N+1 pages.
>>>> If memory is at a premium, maybe manual management of it would be
>> better in some cases.
>>> I had similar thoughts, more related to the usability from the user side.
>>> When application allocated allocates external memory it just wants to use it
>> for DMA, i.e. put it as the mbuf buf_addr or to populate it w/ a mempool.
>>> It is an "overhead" to create a socket for this external memory, to populate
>> it w/ the memory, and later on to malloc from this socket (or use the socket
>> id for the mempool creation).
>>> Not to mention the fact that maybe the application wants to manage this
>> memory differently than how rte_malloc does.
>>> On the other hand, mapping memory to device before using it for dma is
>> far more intuitive.
>> It is far more intuitive *if* you're doing all of the memory management
>> yourself or "just" using this memory for a mempool. This was already working
>> before, and if you had that as your use case, there is no need for the
>> external memory feature.
> I think it was broken in some way when you did the first work on the new memory management.
> Because after this work, PMDs were adjusted to look for the memory only in the fbarrays, and such external memory doesn't exists there.
> otherwise, if working, why you added the support for the external memory? Which other use case you tried to cover?
> It is fixed in some way with the new work to support external memory, but requires changes in the application (the "overhead" I referred before).

I feel we have a difference of terminology here.

What i refer to as "external memory support" is the feature that was 
added in 18.11, that has to do with creating heaps etc. - this memory is 
now (hopefully) properly supported across entire DPDK.

What you're probably referring to as "external memory support" is not 
that - it's the feature of mempools, rings etc. that allow you to use 
memory that's not registered with and not managed by DPDK. This is the 
way "external memory" was done prior to 18.11.

The above scenario (working with anonymous memory in mempools) is still 
"broken" in this sense - and i see no general way to fix it. It could 
work with mempools in that we could automatically create new heaps 
behind the scenes and populate them with new memory, but for other data 
structures (e.g. rte_ring_init() over anonymous memory), it's just not 
possible to do that.

I personally would prefer these API's to just go away and people started 
using the "official" way to manage external memory. Removing the API's 
may even be feasible for everything bar mempool, and mempool support we 
can manage more-or-less automatically in the above way (by automatically 
creating an external heap behind the scenes). But then by the time we're 
creating/adding external memory heaps inside mempool API's, you don't 
really need the external memory DMA mapping API calls because the 
mempool would do that automatically by way of using the external heaps API.

In other words, the problem here isn't so much usability as it is the 
fact that, for some stuff, a lot of internal code now relies on DPDK's 
internal page tables, and we have to either make allowances for that on 
a case-by-case basis (by the way, no one came to me with these issues!), 
or admit that using unregistered memory is a bad idea and let's not do 
that in the future.

>> On the other hand, if you were to use it in a different way - for example,
>> allocating hash tables or other DPDK data structures - then such a feature is
>> essential.
> Hash tables that the NIC needs to read/write to/from?
> I didn't get the point here.

The point wasn't hash tables, but support for use cases we might not 
have thought of. That includes doing strange things like allocating DPDK 
data structures in non-standard memory locations.

> The entire point was to allow using external memory with
>> semantics identical to how you use the rest of DPDK.
> I understand this design choice, and it has the benefit of consistency v.s. more API call on the application side.
>> Also, whether it's "intuitive" depends on perspective - you say "i expect to
>> allocate memory and map it for DMA myself", i say "why do i have to care
>> about DMA mapping, DPDK should do this for me" :)
> Right 😊. We should be driven by customers and not vendors, too bad there is no much feedback from them on the community.

That implies current design *wasn't* driven by customers, which is not 
true. They may not be the same as ones you are aware of - but then this 
isn't really my fault :)

> I know from VPP side they prefer not to use the rte_socket rather to manage the memory by them self, and they are perfectly fine with mapping it explicitly. This is true for other company that has vswitch solution (which I cannot mention by name).

Well, it's not like i can do anything if no one from either communities 
comes here and engages in the discussion. I wish i could read people's 
minds, but alas :)

> If you are using your
>> own memory management and doing everything manually - you get to map
>> everything for DMA manually. If you are using DPDK facilities
>> - it is intuitive that for any DPDK-managed memory, internal or external,
>> same rules apply across the board - DMA mapping included.
> So if understand you point here, your claim "we need both". One for users which prefer the DPDK to do that for them and the other for the users who wants more control.
> However, when keeping the API of explicit mapping, we can no longer say that external memory is behaves exactly the same as internal.

Well, more or less that, yes. It would be good to have full support for 
"external" memory (in the "unregistered with DPDK" sense, as outlined 

> So going back to my RFC 😊, if we decide to keep both approach, what I suggest is to change the VFIO mapping one to be more generic and per device (can be rte_device).
> I think the biggest questions we need to answer are
> 1. are we OK with generic, vendor agnostic API?

Yes, generic and vendor-agnostic FTW!

> 2.  are we OK that the explicit mapping is done per device?

This question is slightly misleading, because that's not *all* there is 
to it. In general, yes, i'm OK with the idea. In practice, VFIO may get 
in the way.

VFIO works by mapping memory for a VFIO container. The container can 
have one or more devices in it, and how many devices it can have depends 
on the IOMMU setup (i.e. it may not be possible to have one device per 
container on some setups, which is why currently everything is done with 
one container). So, "per-device" means we either keep each device in 
separate container (which will not work in some cases), or we somehow 
work around the fact that mappings can be performed only once (i'm not 
sure if VFIO will return appropriate error codes that would allow us to 
distinguish failed mappings due to duplicate, or failed mappings due to 
other reasons - i'll have to research this), and keep all devices in the 
same container.

In other words, it may be possible to solve this issue for VFIO, but i 
would not be comfortable to proceed without knowing that this can be 
done in a sane way.

There's also a question of automatic vs. manual mapping. The way i see 
it, the choices are:

1) drop automatic registration for DMA altogether, for both internal and 
external memory
2) drop automatic registration for DMA for external memory only, but 
keep automatic DMA registration for internal memory
3) do automatic DMA registration for both

I don't like the second choice for API consistency reasons, so i would 
rather go either for 1) (and cause massive breakage of all existing 
applications along the way...), or 3) (the way it's working right now). 
The pragmatic choice *may* be 2), which is what i feel you are 
advocating for, but it has its own problems, mainly the mechanics of it. 
If we can get them worked out, i may just bite the bullet and ack the 
patches :D

Specifically, let's suppose we want automatic memory registration for 
DMA on internal but not external memory. How does it happen? Currently, 
it happens via mem event callbacks.

If we were to drop the callback approach for external memory, we would 
lose out on additional functionality provided by said callbacks, and i 
would really like to avoid that.

If we were to switch to manual mapping, that means EAL will depend on 
ethdev (you have suggested rte_device as an alternative - would that fix 
the issue?), and we will have to rewrite current DMA mapping code to do 
mappings explicitly, rather than relying on callbacks mechanism. So, 
instead of e.g. VFIO mapping being performed via callback, we would 
instead iterate over each rte_device and call its respective DMA mapping 
function explicitly - for internal memory, but not for external. This 
*may* work, but i need to think it through a bit more :)

This also still leaves a question of external unregistered memory - this 
is a wholly separate issue.

TL;DR i'm tentatively OK with the proposal, but we need to work certain 
things out before we go too far along the path and find out that we've 
reached a dead end.


  parent reply	other threads:[~2018-11-20 10:55 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-11-04 12:41 Shahaf Shuler
2018-11-14 11:19 ` Burakov, Anatoly
2018-11-14 14:53   ` Shahaf Shuler
2018-11-14 17:06     ` Burakov, Anatoly
2018-11-15  9:46       ` Shahaf Shuler
2018-11-15 10:59         ` Burakov, Anatoly
2018-11-19 11:20           ` Shahaf Shuler
2018-11-19 17:18             ` Burakov, Anatoly
     [not found]               ` <DB7PR05MB442643DFD33B71797CD34B5EC3D90@DB7PR05MB4426.eurprd05.prod.outlook.com>
2018-11-20 10:55                 ` Burakov, Anatoly [this message]
2018-11-22 10:06                   ` Shahaf Shuler
2018-11-22 10:41                     ` Burakov, Anatoly
2018-11-22 11:31                       ` Shahaf Shuler
2018-11-22 11:34                         ` Burakov, Anatoly
2019-01-14  6:12                         ` Shahaf Shuler
2019-01-15 12:07                           ` Burakov, Anatoly
2019-01-16 11:04                             ` Shahaf Shuler
2018-11-19 17:04           ` Stephen Hemminger

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1c00477d-266a-400e-44b4-5bcb1f41e4b7@intel.com \
    --to=anatoly.burakov@intel.com \
    --cc=arybchenko@solarflare.com \
    --cc=dev@dpdk.org \
    --cc=ferruh.yigit@intel.com \
    --cc=gowrishankar.m@linux.vnet.ibm.com \
    --cc=olgas@mellanox.com \
    --cc=pawelx.wodkowski@intel.com \
    --cc=shahafs@mellanox.com \
    --cc=shreyansh.jain@nxp.com \
    --cc=thomas@monjalon.net \
    --cc=yskoh@mellanox.com \


* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

DPDK patches and discussions

This inbox may be cloned and mirrored by anyone:

	git clone --mirror https://inbox.dpdk.org/dev/0 dev/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 dev dev/ https://inbox.dpdk.org/dev \
	public-inbox-index dev

Example config snippet for mirrors.
Newsgroup available over NNTP:

AGPL code for this site: git clone https://public-inbox.org/public-inbox.git