From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id 89387A0C4A;
	Wed,  7 Jul 2021 13:02:43 +0200 (CEST)
Received: from [217.70.189.124] (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id 09D8C406FF;
	Wed,  7 Jul 2021 13:02:43 +0200 (CEST)
Received: from mga14.intel.com (mga14.intel.com [192.55.52.115])
 by mails.dpdk.org (Postfix) with ESMTP id A23DF406B4
 for <dev@dpdk.org>; Wed,  7 Jul 2021 13:02:40 +0200 (CEST)
X-IronPort-AV: E=McAfee;i="6200,9189,10037"; a="209101686"
X-IronPort-AV: E=Sophos;i="5.83,331,1616482800"; d="scan'208";a="209101686"
Received: from orsmga005.jf.intel.com ([10.7.209.41])
 by fmsmga103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 07 Jul 2021 04:02:27 -0700
X-IronPort-AV: E=Sophos;i="5.83,331,1616482800"; d="scan'208";a="627994554"
Received: from bricha3-mobl.ger.corp.intel.com ([10.252.4.148])
 by orsmga005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-SHA;
 07 Jul 2021 04:01:44 -0700
Date: Wed, 7 Jul 2021 12:01:41 +0100
From: Bruce Richardson <bruce.richardson@intel.com>
To: Jerin Jacob <jerinjacobk@gmail.com>
Cc: Chengwen Feng <fengchengwen@huawei.com>,
 Thomas Monjalon <thomas@monjalon.net>,
 Ferruh Yigit <ferruh.yigit@intel.com>,
 Jerin Jacob <jerinj@marvell.com>, dpdk-dev <dev@dpdk.org>,
 Morten =?iso-8859-1?Q?Br=F8rup?= <mb@smartsharesystems.com>,
 Nipun Gupta <nipun.gupta@nxp.com>, Hemant Agrawal <hemant.agrawal@nxp.com>,
 Maxime Coquelin <maxime.coquelin@redhat.com>,
 Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>,
 David Marchand <david.marchand@redhat.com>,
 Satananda Burla <sburla@marvell.com>, Prasun Kapoor <pkapoor@marvell.com>,
 "Ananyev, Konstantin" <konstantin.ananyev@intel.com>,
 liangma@liangbit.com, Radha Mohan Chintakuntla <radhac@marvell.com>
Message-ID: <YOWJlXvHqQOsMECo@bricha3-MOBL.ger.corp.intel.com>
References: <1625231891-2963-1-git-send-email-fengchengwen@huawei.com>
 <CALBAE1MOaCv_RPr-JxHBwQibbecYeOyg2_RKPpMySN=UJx=XpA@mail.gmail.com>
 <YOLkcUJDsRPC+Aza@bricha3-MOBL.ger.corp.intel.com>
 <CALBAE1NO8BuRh_P4FrE70_faxkpTRxfk-xX2B-+UJyEdHMtx=A@mail.gmail.com>
 <YOM+V8uhPCEFia9Q@bricha3-MOBL.ger.corp.intel.com>
 <CALBAE1NtnEzeA8b=pH_6NUVg0qzANq5kD1Ox66HWYrhCUX1BiA@mail.gmail.com>
 <YOVnQ0QTTcJFIfVR@bricha3-MOBL.ger.corp.intel.com>
 <CALBAE1NZsSf-VEUn+BR6CWcVe82tKVE7eZEsf6SHN-Bo2bUV3Q@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CALBAE1NZsSf-VEUn+BR6CWcVe82tKVE7eZEsf6SHN-Bo2bUV3Q@mail.gmail.com>
Subject: Re: [dpdk-dev] [PATCH] dmadev: introduce DMA device library
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>

On Wed, Jul 07, 2021 at 04:04:16PM +0530, Jerin Jacob wrote:
> On Wed, Jul 7, 2021 at 2:05 PM Bruce Richardson
> <bruce.richardson@intel.com> wrote:
> >
> > On Wed, Jul 07, 2021 at 01:38:58PM +0530, Jerin Jacob wrote:
> > > On Mon, Jul 5, 2021 at 10:46 PM Bruce Richardson
> > > <bruce.richardson@intel.com> wrote:
> > > >
> > > > On Mon, Jul 05, 2021 at 09:25:34PM +0530, Jerin Jacob wrote:
> > > > >
> > > > > On Mon, Jul 5, 2021 at 4:22 PM Bruce Richardson
> > > > > <bruce.richardson@intel.com> wrote:
> > > > > >
> > > > > > On Sun, Jul 04, 2021 at 03:00:30PM +0530, Jerin Jacob wrote:
> > > > > > > On Fri, Jul 2, 2021 at 6:51 PM Chengwen Feng <fengchengwen@huawei.com> wrote:
> > > > > > > >
> > > > > > > > This patch introduces 'dmadevice' which is a generic type of DMA
> > > > > > > > device.
> > > > <snip>
> > > > > >
> > > > > > +1 and the terminology with regards to queues and channels. With our ioat
> > > > > > hardware, each HW queue was called a channel for instance.
> > > > >
> > > > > Looks like <dmadev> <> <channel> can cover all the use cases, if the
> > > > > HW has more than
> > > > > 1 queues it can be exposed as separate dmadev dev.
> > > > >
> > > >
> > > > Fine for me.
> > > >
> > > > However, just to confirm that Morten's suggestion of using a
> > > > (device-specific void *) channel pointer rather than dev_id + channel_id
> > > > pair of parameters won't work for you? You can't store a pointer or dev
> > > > index in the channel struct in the driver?
> > >
> > > Yes. That will work. To confirm, the suggestion is to use, void *
> > > object instead of channel_id,
> > > That will avoid one more indirection.(index -> pointer)
> > >
> >
> > The proposal was to use it in place of "dev_id + channel_id", i.e.
> >
> > copy(dev_id, ch_id, src, dst, len, flags) --> copy(ch_ptr, src, dst, len, flags)
> >
> > Where the channel pointer implicitly indicates the device too. However, I
> > realise now that this would be something completely transparent to the
> > driver as it would all have to be implemented in the dmadev level, and
> > lead to lots of duplication of function pointers, etc. Therefore, let's
> > just go with original scheme. :-(
> 
> Yes. Just go with the original scheme.
>
+1

> >
> > >
> > > >
> > > > >
> >
> > <snip>
> >
> > > > > Got it. In order to save space if first CL size for fastpath(Saving 8B
> > > > > for the pointer) and to avoid
> > > > > function overhead, Can we use one bit of flags of op function to
> > > > > enable the fence?
> > > > >
> > > >
> > > > The original ioat implementation did exactly that. However, I then
> > > > discovered that because a fence logically belongs between two operations,
> > > > does the fence flag on an operation mean "don't do any jobs after this
> > > > until this job has completed" or does it mean "don't start this job until
> > > > all previous jobs have completed". [Or theoretically does it mean both :-)]
> > > > Naturally, some hardware does it the former way (i.e. fence flag goes on
> > > > last op before fence), while other hardware the latter way (i.e. fence flag
> > > > goes on first op after the fence). Therefore, since fencing is about
> > > > ordering *between* two (sets of) jobs, I decided that it should do exactly
> > > > that and go between two jobs, so there is no ambiguity!
> > > >
> > > > However, I'm happy enough to switch to having a fence flag, but I think if
> > > > we do that, it should be put in the "first job after fence" case, because
> > > > it is always easier to modify a previously written job if we need to, than
> > > > to save the flag for a future one.
> > > >
> > > > Alternatively, if we keep the fence as a separate function, I'm happy
> > > > enough for it not to be on the same cacheline as the "hot" operations,
> > > > since fencing will always introduce a small penalty anyway.
> > >
> > > Ack.
> > > You may consider two flags, FENCE_THEN_JOB and JOB_THEN_FENCE( If
> > > there any use case for this or it makes sense for your HW)
> > >
> > >
> > > For us, Fence is NOP for us as we have an implicit fence between each
> > > HW job descriptor.
> > >
> >
> > I actually still think that having a separate fence function in the "ops"
> > section is the best approach here. It's unabiguous as to whether it's
> > fence-before or fence-after, and if we have it in the ops, it doesn't use a
> > "fast-path" slot.
> >
> > However, if we *really* want to use a flag instead, I don't see the value
> > in having two flags, it will be really confusing.  Instead, if we do go
> > with a flag, I think "RTE_DMA_PRE_FENCE" should be the name, indicating
> > that the fence occurs before the job in question.
> 
> IMO, We need to use flags and the name can be RTE_DMA_PRE_FENCE
> due to overhead of the driver implementation where the fence request
> can be NOP and
> to save the first cache line occupancy.
> 
> >
> > >
> > > >
> > > > > >
> > > > > > >
> > > > <snip>
> > > > > > > Since we have additional function call overhead in all the
> > > > > > > applications for this scheme, I would like to understand
> > > > > > > the use of doing this way vs enq does the doorbell implicitly from
> > > > > > > driver/application PoV?
> > > > > > >
> > > > > >
> > > > > > In our benchmarks it's just faster. When we tested it, the overhead of the
> > > > > > function calls was noticably less than the cost of building up the
> > > > > > parameter array(s) for passing the jobs in as a burst. [We don't see this
> > > > > > cost with things like NIC I/O since DPDK tends to already have the mbuf
> > > > > > fully populated before the TX call anyway.]
> > > > >
> > > > > OK. I agree with stack population.
> > > > >
> > > > > My question was more on doing implicit doorbell update enq. Is doorbell write
> > > > > costly in other HW compare to a function call? In our HW, it is just write of
> > > > > the number of instructions written in a register.
> > > > >
> > > > > Also, we need to again access the internal PMD memory structure to find
> > > > > where to write etc if it is a separate function.
> > > > >
> > > >
> > > > The cost varies depending on a number of factors - even writing to a single
> > > > HW register can be very slow if that register is mapped as device
> > > > (uncacheable) memory, since (AFAIK) it will act as a full fence and wait
> > >
> > > I don't know, At least in our case, writes are write-back. so core does not need
> > > to wait.(If there is no read operation).
> > >
> > > > for the write to go all the way to hardware. For more modern HW, the cost
> > > > can be lighter. However, any cost of HW writes is going to be the same
> > > > whether its a separate function call or not.
> > > >
> > > > However, the main thing about the doorbell update is that it's a
> > > > once-per-burst thing, rather than a once-per-job. Therefore, even if you
> > > > have to re-read the struct memory (which is likely still somewhere in your
> > > > cores' cache), any extra small cost of doing so is to be amortized over the
> > > > cost of a whole burst of copies.
> > >
> > > Linux kernel has xmit_more flag in skb to address similar thing.
> > > i.e enq job flag can have one more bit field to say update ring bell or not?
> > > Rather having yet another function overhead.IMO, it is the best of both worlds.
> > >
> >
> > It's just more conditionals and branches all through the code. Inside the
> > user application, the user has to check whether to set the flag or not (or
> > special-case the last transaction outside the loop), and within the driver,
> > there has to be a branch whether or not to call the doorbell function. The
> > code on both sides is far simpler and more readable if the doorbell
> > function is exactly that - a separate function.
> 
> I disagree. The reason is:
> 
> We will have two classes of applications
> 
> a) do dma copy request as and when it has data(I think, this is the
> prime use case), for those,
> I think, it is considerable overhead to have two function invocation
> per transfer i.e
> rte_dma_copy() and rte_dma_perform()
> 
> b) do dma copy when the data is reached to a logical state,  like copy
> IP frame from Ethernet packets or so,
> In that case, the application will have  a LOGIC to detect when to
> perform it so on the end of
> that rte_dma_copy() flag can be updated to fire the doorbell.
> 
> IMO, We are comparing against a branch(flag is already in register) vs
> a set of instructions for
> 1) function pointer overhead
> 2) Need to use the channel context again back in another function.
> 
> IMO, a single branch is most optimal from performance PoV.
> 
Ok, let's try it and see how it goes.

> 
> >
> > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > <snip>
> > > > > > > > + +/** + * @warning + * @b EXPERIMENTAL: this API may change
> > > > > > > > without prior notice.  + * + * Returns the number of operations
> > > > > > > > that failed to complete.  + * NOTE: This API was used when
> > > > > > > > rte_dmadev_completed has_error was set.  + * + * @param dev_id
> > > > > > > > + *   The identifier of the device.  + * @param vq_id + *   The
> > > > > > > > identifier of virt queue.
> > > > > > > (> + * @param nb_status
> > > > > > > > + *   Indicates the size  of status array.  + * @param[out]
> > > > > > > > status + *   The error code of operations that failed to
> > > > > > > > complete.  + * @param[out] cookie + *   The last failed
> > > > > > > > completed operation's cookie.  + * + * @return + *   The number
> > > > > > > > of operations that failed to complete.  + * + * NOTE: The
> > > > > > > > caller must ensure that the input parameter is valid and the +
> > > > > > > > *       corresponding device supports the operation.  + */
> > > > > > > > +__rte_experimental +static inline uint16_t
> > > > > > > > +rte_dmadev_completed_fails(uint16_t dev_id, uint16_t vq_id, +
> > > > > > > > const uint16_t nb_status, uint32_t *status, +
> > > > > > > > dma_cookie_t *cookie)
> > > > > > >
> > > > > > > IMO, it is better to move cookie/rind_idx at 3.  Why it would
> > > > > > > return any array of errors? since it called after
> > > > > > > rte_dmadev_completed() has has_error. Is it better to change
> > > > > > >
> > > > > > > rte_dmadev_error_status((uint16_t dev_id, uint16_t vq_id,
> > > > > > > dma_cookie_t *cookie,  uint32_t *status)
> > > > > > >
> > > > > > > I also think, we may need to set status as bitmask and enumerate
> > > > > > > all the combination of error codes of all the driver and return
> > > > > > > string from driver existing rte_flow_error
> > > > > > >
> > > > > > > See struct rte_flow_error { enum rte_flow_error_type type; /**<
> > > > > > > Cause field and error types. */ const void *cause; /**< Object
> > > > > > > responsible for the error. */ const char *message; /**<
> > > > > > > Human-readable error message. */ };
> > > > > > >
> > > > > >
> > > > > > I think we need a multi-return value API here, as we may add
> > > > > > operations in future which have non-error status values to return.
> > > > > > The obvious case is DMA engines which support "compare" operations.
> > > > > > In that case a successful compare (as in there were no DMA or HW
> > > > > > errors) can return "equal" or "not-equal" as statuses. For general
> > > > > > "copy" operations, the faster completion op can be used to just
> > > > > > return successful values (and only call this status version on
> > > > > > error), while apps using those compare ops or a mixture of copy and
> > > > > > compare ops, would always use the slower one that returns status
> > > > > > values for each and every op..
> > > > > >
> > > > > > The ioat APIs used 32-bit integer values for this status array so
> > > > > > as to allow e.g. 16-bits for error code and 16-bits for future
> > > > > > status values. For most operations there should be a fairly small
> > > > > > set of things that can go wrong, i.e. bad source address, bad
> > > > > > destination address or invalid length.  Within that we may have a
> > > > > > couple of specifics for why an address is bad, but even so I don't
> > > > > > think we need to start having multiple bit combinations.
> > > > >
> > > > > OK. What is the purpose of errors status? Is it for application
> > > > > printing it or Does the application need to take any action based on
> > > > > specific error requests?
> > > >
> > > > It's largely for information purposes, but in the case of SVA/SVM
> > > > errors could occur due to the memory not being pinned, i.e. a page
> > > > fault, in some cases. If that happens, then it's up the app to either
> > > > touch the memory and retry the copy, or to do a SW memcpy as a
> > > > fallback.
> > > >
> > > > In other error cases, I think it's good to tell the application if it's
> > > > passing around bad data, or data that is beyond the scope of hardware,
> > > > e.g.  a copy that is beyond what can be done in a single transaction
> > > > for a HW instance. Given that there are always things that can go
> > > > wrong, I think we need some error reporting mechanism.
> > > >
> > > > > If the former is scope, then we need to define the standard enum
> > > > > value for the error right?  ie. uint32_t *status needs to change to
> > > > > enum rte_dma_error or so.
> > > > >
> > > > Sure. Perhaps an error/status structure either is an option, where we
> > > > explicitly call out error info from status info.
> > >
> > > Agree. Better to have a structure with filed like,
> > >
> > > 1)  enum rte_dma_error_type 2)  memory to store, informative message on
> > > fine aspects of error.  LIke address caused issue etc.(Which will be
> > > driver-specific information).
> > >
> > The only issue I have with that is that once we have driver specific
> > information it is of little use to the application, since it can't know
> > anything about it excepy maybe log it.  I'd much rather have a set of error
> > codes telling user that "source address is wrong", "dest address is wrong",
> > and a generic "an address is wrong" in case driver/HW cannot distinguish
> > source of error. Can we see how far we get with just error codes before we
> > start into passing string messages around and all the memory management
> > headaches that implies.
> 
> Works for me. It should be "enum rte_dma_error_type" then, which has a standard
> error type. Which is missing in the spec now.
> 
+1