From: Shahaf Shuler <shahafs@mellanox.com>
To: "Burakov, Anatoly" <anatoly.burakov@intel.com>,
"dev@dpdk.org" <dev@dpdk.org>
Cc: Olga Shern <olgas@mellanox.com>,
Yongseok Koh <yskoh@mellanox.com>,
"pawelx.wodkowski@intel.com" <pawelx.wodkowski@intel.com>,
"gowrishankar.m@linux.vnet.ibm.com"
<gowrishankar.m@linux.vnet.ibm.com>,
"ferruh.yigit@intel.com" <ferruh.yigit@intel.com>,
Thomas Monjalon <thomas@monjalon.net>,
"arybchenko@solarflare.com" <arybchenko@solarflare.com>,
"shreyansh.jain@nxp.com" <shreyansh.jain@nxp.com>
Subject: Re: [dpdk-dev] [RFC] ethdev: introduce DMA memory mapping for external memory
Date: Thu, 22 Nov 2018 11:31:42 +0000 [thread overview]
Message-ID: <DB7PR05MB442687896A9E13DE16E486C1C3DB0@DB7PR05MB4426.eurprd05.prod.outlook.com> (raw)
In-Reply-To: <7c26e2d2-0c87-d416-87a2-737caa457ed6@intel.com>
Thursday, November 22, 2018 12:42 PM, Burakov, Anatoly:
> Subject: Re: [dpdk-dev] [RFC] ethdev: introduce DMA memory mapping for
> external memory
>
> On 22-Nov-18 10:06 AM, Shahaf Shuler wrote:
> > Tuesday, November 20, 2018 12:56 PM, Burakov, Anatoly:
> >> Subject: Re: [dpdk-dev] [RFC] ethdev: introduce DMA memory mapping
> >> for external memory
> >>
> >> On 20-Nov-18 8:08 AM, Shahaf Shuler wrote:
> >>> Monday, November 19, 2018 7:18 PM, Burakov, Anatoly:
> >>>> Subject: Re: [RFC] ethdev: introduce DMA memory mapping for
> >>>> external memory
> >>>>
> >>>> On 19-Nov-18 11:20 AM, Shahaf Shuler wrote:
> >>>>> Thursday, November 15, 2018 1:00 PM, Burakov, Anatoly:
> >>>>>> Subject: Re: [RFC] ethdev: introduce DMA memory mapping for
> >>>>>> external memory
> >>>>>>
> >>>>>> On 15-Nov-18 9:46 AM, Shahaf Shuler wrote:
> >>>>>>> Wednesday, November 14, 2018 7:06 PM, Burakov, Anatoly:
> >>>>>>>> Subject: Re: [RFC] ethdev: introduce DMA memory mapping for
> >>>>>>>> external memory
> >>>>>>>>
> >>>>>>>> On 14-Nov-18 2:53 PM, Shahaf Shuler wrote:
> >>>>>>>>> Hi Anatoly,
> >>>>>>>>>
> >>>>>>>>> Wednesday, November 14, 2018 1:19 PM, Burakov, Anatoly:
> >>>>>>>>>> Subject: Re: [RFC] ethdev: introduce DMA memory mapping
> for
> >>>>>>>>>> external memory
> >>>>>>>>>>
> >>>
> >>> [...]
> >
> > [...]
> >
> > I thought the 18.11 feature should serve both cases.
> > To use memory not managed by DPDK user should create heap and
> populate it w/ the memory, so that it can use it later for mempool/rings.
>
> The feature introduced in 18.11 expects you to use DPDK allocation methods
> with added memory. It is not a supported use case to create an external
> memory area, register it with heap *and* use manual memory management
> on that area (e.g. populate mempool anonymously).
>
> Perhaps there should be a new API, something like "register_memory",
> where we only register memory, but do not expect the user to use DPDK
> allocator with said memory.
Isn't it the existing rte_vfio_dma_map?
I wish i would've thought of that before, where
> have you been all this time! :D I think this should feature definitely go into
> 19.02. Are we past the v1 deadline?
>
> >
> >>
> >> The above scenario (working with anonymous memory in mempools) is
> >> still "broken" in this sense - and i see no general way to fix it. It
> >> could work with mempools in that we could automatically create new
> >> heaps behind the scenes and populate them with new memory, but for
> >> other data structures (e.g. rte_ring_init() over anonymous memory),
> >> it's just not possible to do that.
> >>
> >> I personally would prefer these API's to just go away and people
> >> started using the "official" way to manage external memory. Removing
> >> the API's may even be feasible for everything bar mempool, and
> >> mempool support we can manage more-or-less automatically in the
> above
> >> way (by automatically creating an external heap behind the scenes).
> >> But then by the time we're creating/adding external memory heaps
> >> inside mempool API's, you don't really need the external memory DMA
> >> mapping API calls because the mempool would do that automatically by
> >> way of using the external heaps API.
> >
> > Yes, one way to go is to map the memory internally inside the mempool
> creation. But it will not work for all cases.
> > For example, if application wants to work w/ mbuf with external buffer
> (EXT_ATTACHED_MBUF), the buffer will be attached on a different call
> (rte_pktmbuf_attach_extbuf), so we will need to handle that as well.
> > Maybe we will find more cases in the future. i would prefer to solve it for all
> cases. See more below.
> >
> >>
> >> In other words, the problem here isn't so much usability as it is the
> >> fact that, for some stuff, a lot of internal code now relies on
> >> DPDK's internal page tables, and we have to either make allowances
> >> for that on a case-by-case basis (by the way, no one came to me with
> >> these issues!),
> >
> > One option is case by case, other can be to allow the application to explicit
> map this memory to the different devices.
> >
> > or admit that using
> >> unregistered memory is a bad idea and let's not do that in the future.
> >
> > I think it is un-avoidable (to use unregistered memory). The use case exists,
> there are applications which prefer to have their own memory management.
> > For those apps, we need to have a solution.
> >
> >>
> >>>
> >>>>
> >>>> On the other hand, if you were to use it in a different way - for
> >>>> example, allocating hash tables or other DPDK data structures -
> >>>> then such a feature is essential.
> >>>
> >>> Hash tables that the NIC needs to read/write to/from?
> >>> I didn't get the point here.
> >>
> >> The point wasn't hash tables, but support for use cases we might not
> >> have thought of. That includes doing strange things like allocating
> >> DPDK data structures in non-standard memory locations.
> >>
> >>>
> >>> The entire point was to allow using external memory with
> >>>> semantics identical to how you use the rest of DPDK.
> >>>
> >>> I understand this design choice, and it has the benefit of consistency v.s.
> >> more API call on the application side.
> >>>
> >>>>
> >>>> Also, whether it's "intuitive" depends on perspective - you say "i
> >>>> expect to allocate memory and map it for DMA myself", i say "why do
> >>>> i have to care about DMA mapping, DPDK should do this for me" :)
> >>>
> >>> Right 😊. We should be driven by customers and not vendors, too bad
> >>> there
> >> is no much feedback from them on the community.
> >>
> >> That implies current design *wasn't* driven by customers, which is not
> true.
> >> They may not be the same as ones you are aware of - but then this
> >> isn't really my fault :)
> >>
> >>> I know from VPP side they prefer not to use the rte_socket rather to
> >> manage the memory by them self, and they are perfectly fine with
> >> mapping it explicitly. This is true for other company that has
> >> vswitch solution (which I cannot mention by name).
> >>
> >> Well, it's not like i can do anything if no one from either
> >> communities comes here and engages in the discussion. I wish i could
> >> read people's minds, but alas :)
> >>
> >>>
> >>>
> >>> If you are using your
> >>>> own memory management and doing everything manually - you get to
> >> map
> >>>> everything for DMA manually. If you are using DPDK facilities
> >>>> - it is intuitive that for any DPDK-managed memory, internal or
> >>>> external, same rules apply across the board - DMA mapping included.
> >>>
> >>> So if understand you point here, your claim "we need both". One for
> >>> users
> >> which prefer the DPDK to do that for them and the other for the users
> >> who wants more control.
> >>> However, when keeping the API of explicit mapping, we can no longer
> >>> say
> >> that external memory is behaves exactly the same as internal.
> >>
> >> Well, more or less that, yes. It would be good to have full support
> >> for "external" memory (in the "unregistered with DPDK" sense, as
> >> outlined above).
> >>
> >>>
> >>> So going back to my RFC 😊, if we decide to keep both approach, what
> >>> I
> >> suggest is to change the VFIO mapping one to be more generic and per
> >> device (can be rte_device).
> >>> I think the biggest questions we need to answer are 1. are we OK
> >>> with generic, vendor agnostic API?
> >>
> >> Yes, generic and vendor-agnostic FTW!
> >>
> >>> 2. are we OK that the explicit mapping is done per device?
> >>
> >> This question is slightly misleading, because that's not *all* there
> >> is to it. In general, yes, i'm OK with the idea. In practice, VFIO may get in
> the way.
> >>
> >> VFIO works by mapping memory for a VFIO container. The container can
> >> have one or more devices in it, and how many devices it can have
> >> depends on the IOMMU setup (i.e. it may not be possible to have one
> >> device per container on some setups, which is why currently
> >> everything is done with one container).
> >
> >
> > So, "per-device" means we either keep each device in
> >> separate container (which will not work in some cases),
> > or we somehow
> >> work around the fact that mappings can be performed only once (i'm
> >> not sure if VFIO will return appropriate error codes that would allow
> >> us to distinguish failed mappings due to duplicate, or failed
> >> mappings due to other reasons - i'll have to research this), and keep
> >> all devices in the same container.
> >
> > Yes I see here 2 options:
> > 1. the VFIO has good error reports that enables the user to understand
> mapping failed because it is already exists.
> > 2. to hold database which will say which IOVA were already mapped so the
> DPDK vfio mapping will know to avoid it.
>
> Option 2 is difficult to support with secondary processes. Secondary
> processes also use the same container and thus should have the same view
> of what is mapped and what isn't. Given that VFIO mapping may happen
> during memory allocation, we won't be able to allocate more memory to hold
> extra mappings, so shared memory won't work here. We could use fbarray
> for this, but that would make the amount of mappings fixed.
>
> >
> >>
> >> In other words, it may be possible to solve this issue for VFIO, but
> >> i would not be comfortable to proceed without knowing that this can
> >> be done in a sane way.
> >
> > +1.
> >
> > I think the simplest is #1 above. How can we get an answer on it?
>
> A few experiments should do it :) I'll get back to you on that.
>
> >
> >>
> >> There's also a question of automatic vs. manual mapping. The way i
> >> see it, the choices are:
> >>
> >> 1) drop automatic registration for DMA altogether, for both internal
> >> and external memory
> >> 2) drop automatic registration for DMA for external memory only, but
> >> keep automatic DMA registration for internal memory
> >> 3) do automatic DMA registration for both
> >>
> >> I don't like the second choice for API consistency reasons, so i
> >> would rather go either for 1) (and cause massive breakage of all
> >> existing applications along the way...), or 3) (the way it's working right
> now).
> >> The pragmatic choice *may* be 2), which is what i feel you are
> >> advocating for, but it has its own problems, mainly the mechanics of it.
> >> If we can get them worked out, i may just bite the bullet and ack the
> >> patches
> >> 😃
> >
> > I don't think we need to do such extensive changes.
> >
> > I think it is ok to have 2 ways for application to work (the heaps and the
> explicit DMA call). After all, DPDK is a development kit and we should make
> application life easier for different use cases.
> > If the application wants the DPDK to help, it can create the heap and
> everything will just happen. Otherwise application will do the memory
> management by itself, and use the DMA call explicitly.
>
> In other words, what you are proposing is option 3), with an addition of
> support for manual memory/DMA mapping management by way of the new
> API proposed above (register_memory() call). Am i understanding you
> correctly?
Exact.
>
> >
> >>
> >> Specifically, let's suppose we want automatic memory registration for
> >> DMA on internal but not external memory. How does it happen?
> >> Currently, it happens via mem event callbacks.
> >>
> >> If we were to drop the callback approach for external memory, we
> >> would lose out on additional functionality provided by said
> >> callbacks, and i would really like to avoid that.
> >>
> >> If we were to switch to manual mapping, that means EAL will depend on
> >> ethdev (you have suggested rte_device as an alternative - would that
> >> fix the
> >> issue?)
> >
> > No it won't.
> > It will help only for the cases were we have multiple devices on the same
> PCI address (like representors).
>
> So, should it be an rte_device call then?
I think it should be rte_device. So that we can do the same for crypto devices (and others).
For the representor case, PMDs can have iterator to register the memory to all of the related eth devices.
>
> >
> > , and we will have to rewrite current DMA mapping code to do
> >> mappings explicitly, rather than relying on callbacks mechanism. So,
> >> instead of e.g. VFIO mapping being performed via callback, we would
> >> instead iterate over each rte_device and call its respective DMA
> >> mapping function explicitly - for internal memory, but not for
> >> external. This
> >> *may* work, but i need to think it through a bit more :)
> >>
> >> This also still leaves a question of external unregistered memory -
> >> this is a wholly separate issue.
> >>
> >> TL;DR i'm tentatively OK with the proposal, but we need to work
> >> certain things out before we go too far along the path and find out
> >> that we've reached a dead end.
> >
> > Fully agree, this is why the RFC is for.
> > I think our biggest question is whether or not VFIO will behave OK with
> double mapping or we need some tracking mechanism in DPDK.
>
> Yes, i will check this for VFIO and get back to you.
Great thanks.
>
> >
> >>
> >> --
> >> Thanks,
> >> Anatoly
>
>
> --
> Thanks,
> Anatoly
next prev parent reply other threads:[~2018-11-22 11:31 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-11-04 12:41 Shahaf Shuler
2018-11-14 11:19 ` Burakov, Anatoly
2018-11-14 14:53 ` Shahaf Shuler
2018-11-14 17:06 ` Burakov, Anatoly
2018-11-15 9:46 ` Shahaf Shuler
2018-11-15 10:59 ` Burakov, Anatoly
2018-11-19 11:20 ` Shahaf Shuler
2018-11-19 17:18 ` Burakov, Anatoly
[not found] ` <DB7PR05MB442643DFD33B71797CD34B5EC3D90@DB7PR05MB4426.eurprd05.prod.outlook.com>
2018-11-20 10:55 ` Burakov, Anatoly
2018-11-22 10:06 ` Shahaf Shuler
2018-11-22 10:41 ` Burakov, Anatoly
2018-11-22 11:31 ` Shahaf Shuler [this message]
2018-11-22 11:34 ` Burakov, Anatoly
2019-01-14 6:12 ` Shahaf Shuler
2019-01-15 12:07 ` Burakov, Anatoly
2019-01-16 11:04 ` Shahaf Shuler
2018-11-19 17:04 ` Stephen Hemminger
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=DB7PR05MB442687896A9E13DE16E486C1C3DB0@DB7PR05MB4426.eurprd05.prod.outlook.com \
--to=shahafs@mellanox.com \
--cc=anatoly.burakov@intel.com \
--cc=arybchenko@solarflare.com \
--cc=dev@dpdk.org \
--cc=ferruh.yigit@intel.com \
--cc=gowrishankar.m@linux.vnet.ibm.com \
--cc=olgas@mellanox.com \
--cc=pawelx.wodkowski@intel.com \
--cc=shreyansh.jain@nxp.com \
--cc=thomas@monjalon.net \
--cc=yskoh@mellanox.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).