From: "Hu, Jiayu" <jiayu.hu@intel.com>
To: "Richardson, Bruce" <bruce.richardson@intel.com>,
fengchengwen <fengchengwen@huawei.com>
Cc: "thomas@monjalon.net" <thomas@monjalon.net>,
"Yigit, Ferruh" <ferruh.yigit@intel.com>,
"dev@dpdk.org" <dev@dpdk.org>,
"nipun.gupta@nxp.com" <nipun.gupta@nxp.com>,
"hemant.agrawal@nxp.com" <hemant.agrawal@nxp.com>,
"maxime.coquelin@redhat.com" <maxime.coquelin@redhat.com>,
"honnappa.nagarahalli@arm.com" <honnappa.nagarahalli@arm.com>,
"jerinj@marvell.com" <jerinj@marvell.com>,
"david.marchand@redhat.com" <david.marchand@redhat.com>,
"jerinjacobk@gmail.com" <jerinjacobk@gmail.com>
Subject: Re: [dpdk-dev] [RFC PATCH] dmadev: introduce DMA device library
Date: Wed, 23 Jun 2021 05:34:06 +0000 [thread overview]
Message-ID: <18dfcd9c6942429dbd1f674de8e6e337@intel.com> (raw)
In-Reply-To: <YMo1V/Trf7WH8dgN@bricha3-MOBL.ger.corp.intel.com>
> -----Original Message-----
> From: dev <dev-bounces@dpdk.org> On Behalf Of Bruce Richardson
> Sent: Thursday, June 17, 2021 1:31 AM
> To: fengchengwen <fengchengwen@huawei.com>
> Cc: thomas@monjalon.net; Yigit, Ferruh <ferruh.yigit@intel.com>;
> dev@dpdk.org; nipun.gupta@nxp.com; hemant.agrawal@nxp.com;
> maxime.coquelin@redhat.com; honnappa.nagarahalli@arm.com;
> jerinj@marvell.com; david.marchand@redhat.com; jerinjacobk@gmail.com
> Subject: Re: [dpdk-dev] [RFC PATCH] dmadev: introduce DMA device library
>
> On Wed, Jun 16, 2021 at 05:41:45PM +0800, fengchengwen wrote:
> > On 2021/6/16 0:38, Bruce Richardson wrote:
> > > On Tue, Jun 15, 2021 at 09:22:07PM +0800, Chengwen Feng wrote:
> > >> This patch introduces 'dmadevice' which is a generic type of DMA
> > >> device.
> > >>
> > >> The APIs of dmadev library exposes some generic operations which
> > >> can enable configuration and I/O with the DMA devices.
> > >>
> > >> Signed-off-by: Chengwen Feng <fengchengwen@huawei.com>
> > >> ---
> > > Thanks for sending this.
> > >
> > > Of most interest to me right now are the key data-plane APIs. While
> > > we are still in the prototyping phase, below is a draft of what we
> > > are thinking for the key enqueue/perform_ops/completed_ops APIs.
> > >
> > > Some key differences I note in below vs your original RFC:
> > > * Use of void pointers rather than iova addresses. While using iova's
> makes
> > > sense in the general case when using hardware, in that it can work with
> > > both physical addresses and virtual addresses, if we change the APIs to
> use
> > > void pointers instead it will still work for DPDK in VA mode, while at the
> > > same time allow use of software fallbacks in error cases, and also a stub
> > > driver than uses memcpy in the background. Finally, using iova's makes
> the
> > > APIs a lot more awkward to use with anything but mbufs or similar
> buffers
> > > where we already have a pre-computed physical address.
> >
> > The iova is an hint to application, and widely used in DPDK.
> > If switch to void, how to pass the address (iova or just va ?) this
> > may introduce implementation dependencies here.
> >
> > Or always pass the va, and the driver performs address translation,
> > and this translation may cost too much cpu I think.
> >
>
> On the latter point, about driver doing address translation I would agree.
> However, we probably need more discussion about the use of iova vs just
> virtual addresses. My thinking on this is that if we specify the API using iovas
> it will severely hurt usability of the API, since it forces the user to take more
> inefficient codepaths in a large number of cases. Given a pointer to the
> middle of an mbuf, one cannot just pass that straight as an iova but must
> instead do a translation into offset from mbuf pointer and then readd the
> offset to the mbuf base address.
Agree. Vhost is one consumer of DMA devices. To support SW fallback
in case of DMA copy errors, vhost needs to pass VA for both DPDK mbuf
and guest buffer to the callback layer (a middle layer between vhost and
dma device). If DMA devices use iova, it will require the callback layer to
call rte_mem_virt2iova() to translate va to iova in data path, even if iova
is va in some cases. But if DMA devices claim to use va, device differences
can be hided inside driver, which makes the DMA callback layer simpler
and more efficient.
Thanks,
Jiayu
>
> My preference therefore is to require the use of an IOMMU when using a
> dmadev, so that it can be a much closer analog of memcpy. Once an iommu
> is present, DPDK will run in VA mode, allowing virtual addresses to our
> hugepage memory to be sent directly to hardware. Also, when using
> dmadevs on top of an in-kernel driver, that kernel driver may do all iommu
> management for the app, removing further the restrictions on what memory
> can be addressed by hardware.
>
> > > * Use of id values rather than user-provided handles. Allowing the
> user/app
> > > to manage the amount of data stored per operation is a better solution,
> I
> > > feel than proscribing a certain about of in-driver tracking. Some apps
> may
> > > not care about anything other than a job being completed, while other
> apps
> > > may have significant metadata to be tracked. Taking the user-context
> > > handles out of the API also makes the driver code simpler.
> >
> > The user-provided handle was mainly used to simply application
> > implementation, It provides the ability to quickly locate contexts.
> >
> > The "use of id values" seem like the dma_cookie of Linux DMA engine
> > framework, user will get a unique dma_cookie after calling
> > dmaengine_submit(), and then could use it to call
> dma_async_is_tx_complete() to get completion status.
> >
>
> Yes, the idea of the id is the same - to locate contexts. The main difference is
> that if we have the driver manage contexts or pointer to contexts, as well as
> giving more work to the driver, it complicates the APIs for measuring
> completions. If we use an ID-based approach, where the app maintains its
> own ring of contexts (if any), it avoids the need to have an "out" parameter
> array for returning those contexts, which needs to be appropriately sized.
> Instead we can just report that all ids up to N are completed. [This would be
> similar to your suggestion that N jobs be reported as done, in that no
> contexts are provided, it's just that knowing the ID of what is completed is
> generally more useful than the number (which can be obviously got by
> subtracting the old value)]
>
> We are still working on prototyping all this, but would hope to have a
> functional example of all this soon.
>
> > How about define the copy prototype as following:
> > dma_cookie_t rte_dmadev_copy(uint16_t dev_id, xxx) while the
> > dma_cookie_t is int32 and is monotonically increasing, when >=0 mean
> > enqueue successful else fail.
> > when complete the dmadev will return latest completed dma_cookie, and
> > the application could use the dma_cookie to quick locate contexts.
> >
>
> If I understand this correctly, I believe this is largely what I was suggesting -
> just with the typedef for the type? In which case it obviously looks good to
> me.
>
> > > * I've kept a single combined API for completions, which differs from the
> > > separate error handling completion API you propose. I need to give the
> > > two function approach a bit of thought, but likely both could work. If we
> > > (likely) never expect failed ops, then the specifics of error handling
> > > should not matter that much.
> >
> > The rte_ioat_completed_ops API is too complex, and consider some
> > applications may never copy fail, so split them as two API.
> > It's indeed not friendly to other scenarios that always require error
> handling.
> >
> > I prefer use completed operations number as return value other than
> > the ID so that application could simple judge whether have new
> > completed operations, and the new prototype:
> > uint16_t rte_dmadev_completed(uint16_t dev_id, dma_cookie_t *cookie,
> > uint32_t *status, uint16_t max_status, uint16_t *num_fails);
> >
> > 1) for normal case which never expect failed ops:
> > just call: ret = rte_dmadev_completed(dev_id, &cookie, NULL, 0,
> > NULL);
> > 2) for other case:
> > ret = rte_dmadev_completed(dev_id, &cookie, &status, max_status,
> &fails);
> > at this point the fails <= ret <= max_status
> >
> Completely agree that we need to plan for the happy-day case where all is
> passing. Looking at the prototypes you have above, I am ok with returning
> number of completed ops as the return value with the final completed cookie
> as an "out" parameter.
> For handling errors, I'm ok with what you propose above, just with one small
> adjustment - I would remove the restriction that ret <= max_status.
>
> In case of zero-failures, we can report as many ops succeeding as we like,
> and even in case of failure, we can still report as many successful ops as we
> like before we start filling in the status field. For example, if 32 ops are
> completed, and the last one fails, we can just fill in one entry into status, and
> return 32. Alternatively if the 4th last one fails we fill in 4 entries and return
> 32. The only requirements would be:
> * fails <= max_status
> * fails <= ret
> * cookie holds the id of the last entry in status.
>
> A further possible variation is to have separate "completed" and
> "completed_status" APIs, where "completed_status" is as above, but
> "completed" skips the final 3 parameters and returns -1 on error. In that case
> the user can fall back to the completed_status call.
>
> > >
> > > For the rest, the control / setup APIs are likely to be rather
> > > uncontroversial, I suspect. However, I think that rather than xstats
> > > APIs, the library should first provide a set of standardized stats
> > > like ethdev does. If driver-specific stats are needed, we can add
> > > xstats later to the API.
> >
> > Agree, will fix in v2
> >
> Thanks. In parallel, we will be working on our prototype implementation too,
> taking in the feedback here, and hopefully send it as an RFC soon.
> Then we can look to compare and contrast and arrive at an agreed API. It
> might also be worthwhile to set up a community call for all interested parties
> in this API to discuss things with a more rapid turnaround. That was done in
> the past for other new device class APIs that were developed, e.g. eventdev.
>
> Regards,
> /Bruce
next prev parent reply other threads:[~2021-06-23 5:34 UTC|newest]
Thread overview: 79+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-06-15 13:22 Chengwen Feng
2021-06-15 16:38 ` Bruce Richardson
2021-06-16 7:09 ` Morten Brørup
2021-06-16 10:17 ` fengchengwen
2021-06-16 12:09 ` Morten Brørup
2021-06-16 13:06 ` Bruce Richardson
2021-06-16 14:37 ` Jerin Jacob
2021-06-17 9:15 ` Bruce Richardson
2021-06-18 5:52 ` Jerin Jacob
2021-06-18 9:41 ` fengchengwen
2021-06-22 17:25 ` Jerin Jacob
2021-06-23 3:30 ` fengchengwen
2021-06-23 7:21 ` Jerin Jacob
2021-06-23 9:37 ` Bruce Richardson
2021-06-23 11:40 ` Jerin Jacob
2021-06-23 14:19 ` Bruce Richardson
2021-06-24 6:49 ` Jerin Jacob
2021-06-23 9:41 ` Bruce Richardson
2021-06-23 10:10 ` Morten Brørup
2021-06-23 11:46 ` Jerin Jacob
2021-06-23 14:22 ` Bruce Richardson
2021-06-18 9:55 ` Bruce Richardson
2021-06-22 17:31 ` Jerin Jacob
2021-06-22 19:17 ` Bruce Richardson
2021-06-23 7:00 ` Jerin Jacob
2021-06-16 9:41 ` fengchengwen
2021-06-16 17:31 ` Bruce Richardson
2021-06-16 18:08 ` Jerin Jacob
2021-06-16 19:13 ` Bruce Richardson
2021-06-17 7:42 ` Jerin Jacob
2021-06-17 8:00 ` Bruce Richardson
2021-06-18 5:16 ` Jerin Jacob
2021-06-18 10:03 ` Bruce Richardson
2021-06-22 17:36 ` Jerin Jacob
2021-06-17 9:48 ` fengchengwen
2021-06-17 11:02 ` Bruce Richardson
2021-06-17 14:18 ` Bruce Richardson
2021-06-18 8:52 ` fengchengwen
2021-06-18 9:30 ` Bruce Richardson
2021-06-22 17:51 ` Jerin Jacob
2021-06-23 3:50 ` fengchengwen
2021-06-23 11:00 ` Jerin Jacob
2021-06-23 14:56 ` Bruce Richardson
2021-06-24 12:19 ` fengchengwen
2021-06-26 3:59 ` [dpdk-dev] dmadev discussion summary fengchengwen
2021-06-28 10:00 ` Bruce Richardson
2021-06-28 11:14 ` Ananyev, Konstantin
2021-06-28 12:53 ` Bruce Richardson
2021-07-02 13:31 ` fengchengwen
2021-07-01 15:01 ` Jerin Jacob
2021-07-01 16:33 ` Bruce Richardson
2021-07-02 7:39 ` Morten Brørup
2021-07-02 10:05 ` Bruce Richardson
2021-07-02 13:45 ` fengchengwen
2021-07-02 14:57 ` Morten Brørup
2021-07-03 0:32 ` fengchengwen
2021-07-03 8:53 ` Morten Brørup
2021-07-03 9:08 ` Jerin Jacob
2021-07-03 12:24 ` Morten Brørup
2021-07-04 7:43 ` Jerin Jacob
2021-07-05 10:28 ` Morten Brørup
2021-07-06 7:11 ` fengchengwen
2021-07-03 9:45 ` fengchengwen
2021-07-03 12:00 ` Morten Brørup
2021-07-04 7:34 ` Jerin Jacob
2021-07-02 7:07 ` Liang Ma
2021-07-02 13:59 ` fengchengwen
2021-06-24 7:03 ` [dpdk-dev] [RFC PATCH] dmadev: introduce DMA device library Jerin Jacob
2021-06-24 7:59 ` Morten Brørup
2021-06-24 8:05 ` Jerin Jacob
2021-06-23 5:34 ` Hu, Jiayu [this message]
2021-06-23 11:07 ` Jerin Jacob
2021-06-16 2:17 ` Wang, Haiyue
2021-06-16 8:04 ` Bruce Richardson
2021-06-16 8:16 ` Wang, Haiyue
2021-06-16 12:14 ` David Marchand
2021-06-16 13:11 ` Bruce Richardson
2021-06-16 16:48 ` Honnappa Nagarahalli
2021-06-16 19:10 ` Bruce Richardson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=18dfcd9c6942429dbd1f674de8e6e337@intel.com \
--to=jiayu.hu@intel.com \
--cc=bruce.richardson@intel.com \
--cc=david.marchand@redhat.com \
--cc=dev@dpdk.org \
--cc=fengchengwen@huawei.com \
--cc=ferruh.yigit@intel.com \
--cc=hemant.agrawal@nxp.com \
--cc=honnappa.nagarahalli@arm.com \
--cc=jerinj@marvell.com \
--cc=jerinjacobk@gmail.com \
--cc=maxime.coquelin@redhat.com \
--cc=nipun.gupta@nxp.com \
--cc=thomas@monjalon.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).