From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 89387A0C4A; Wed, 7 Jul 2021 13:02:43 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 09D8C406FF; Wed, 7 Jul 2021 13:02:43 +0200 (CEST) Received: from mga14.intel.com (mga14.intel.com [192.55.52.115]) by mails.dpdk.org (Postfix) with ESMTP id A23DF406B4 for ; Wed, 7 Jul 2021 13:02:40 +0200 (CEST) X-IronPort-AV: E=McAfee;i="6200,9189,10037"; a="209101686" X-IronPort-AV: E=Sophos;i="5.83,331,1616482800"; d="scan'208";a="209101686" Received: from orsmga005.jf.intel.com ([10.7.209.41]) by fmsmga103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 07 Jul 2021 04:02:27 -0700 X-IronPort-AV: E=Sophos;i="5.83,331,1616482800"; d="scan'208";a="627994554" Received: from bricha3-mobl.ger.corp.intel.com ([10.252.4.148]) by orsmga005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-SHA; 07 Jul 2021 04:01:44 -0700 Date: Wed, 7 Jul 2021 12:01:41 +0100 From: Bruce Richardson To: Jerin Jacob Cc: Chengwen Feng , Thomas Monjalon , Ferruh Yigit , Jerin Jacob , dpdk-dev , Morten =?iso-8859-1?Q?Br=F8rup?= , Nipun Gupta , Hemant Agrawal , Maxime Coquelin , Honnappa Nagarahalli , David Marchand , Satananda Burla , Prasun Kapoor , "Ananyev, Konstantin" , liangma@liangbit.com, Radha Mohan Chintakuntla Message-ID: References: <1625231891-2963-1-git-send-email-fengchengwen@huawei.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Subject: Re: [dpdk-dev] [PATCH] dmadev: introduce DMA device library X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" On Wed, Jul 07, 2021 at 04:04:16PM +0530, Jerin Jacob wrote: > On Wed, Jul 7, 2021 at 2:05 PM Bruce Richardson > wrote: > > > > On Wed, Jul 07, 2021 at 01:38:58PM +0530, Jerin Jacob wrote: > > > On Mon, Jul 5, 2021 at 10:46 PM Bruce Richardson > > > wrote: > > > > > > > > On Mon, Jul 05, 2021 at 09:25:34PM +0530, Jerin Jacob wrote: > > > > > > > > > > On Mon, Jul 5, 2021 at 4:22 PM Bruce Richardson > > > > > wrote: > > > > > > > > > > > > On Sun, Jul 04, 2021 at 03:00:30PM +0530, Jerin Jacob wrote: > > > > > > > On Fri, Jul 2, 2021 at 6:51 PM Chengwen Feng wrote: > > > > > > > > > > > > > > > > This patch introduces 'dmadevice' which is a generic type of DMA > > > > > > > > device. > > > > > > > > > > > > > > > > +1 and the terminology with regards to queues and channels. With our ioat > > > > > > hardware, each HW queue was called a channel for instance. > > > > > > > > > > Looks like <> can cover all the use cases, if the > > > > > HW has more than > > > > > 1 queues it can be exposed as separate dmadev dev. > > > > > > > > > > > > > Fine for me. > > > > > > > > However, just to confirm that Morten's suggestion of using a > > > > (device-specific void *) channel pointer rather than dev_id + channel_id > > > > pair of parameters won't work for you? You can't store a pointer or dev > > > > index in the channel struct in the driver? > > > > > > Yes. That will work. To confirm, the suggestion is to use, void * > > > object instead of channel_id, > > > That will avoid one more indirection.(index -> pointer) > > > > > > > The proposal was to use it in place of "dev_id + channel_id", i.e. > > > > copy(dev_id, ch_id, src, dst, len, flags) --> copy(ch_ptr, src, dst, len, flags) > > > > Where the channel pointer implicitly indicates the device too. However, I > > realise now that this would be something completely transparent to the > > driver as it would all have to be implemented in the dmadev level, and > > lead to lots of duplication of function pointers, etc. Therefore, let's > > just go with original scheme. :-( > > Yes. Just go with the original scheme. > +1 > > > > > > > > > > > > > > > > > > > > > > > > > Got it. In order to save space if first CL size for fastpath(Saving 8B > > > > > for the pointer) and to avoid > > > > > function overhead, Can we use one bit of flags of op function to > > > > > enable the fence? > > > > > > > > > > > > > The original ioat implementation did exactly that. However, I then > > > > discovered that because a fence logically belongs between two operations, > > > > does the fence flag on an operation mean "don't do any jobs after this > > > > until this job has completed" or does it mean "don't start this job until > > > > all previous jobs have completed". [Or theoretically does it mean both :-)] > > > > Naturally, some hardware does it the former way (i.e. fence flag goes on > > > > last op before fence), while other hardware the latter way (i.e. fence flag > > > > goes on first op after the fence). Therefore, since fencing is about > > > > ordering *between* two (sets of) jobs, I decided that it should do exactly > > > > that and go between two jobs, so there is no ambiguity! > > > > > > > > However, I'm happy enough to switch to having a fence flag, but I think if > > > > we do that, it should be put in the "first job after fence" case, because > > > > it is always easier to modify a previously written job if we need to, than > > > > to save the flag for a future one. > > > > > > > > Alternatively, if we keep the fence as a separate function, I'm happy > > > > enough for it not to be on the same cacheline as the "hot" operations, > > > > since fencing will always introduce a small penalty anyway. > > > > > > Ack. > > > You may consider two flags, FENCE_THEN_JOB and JOB_THEN_FENCE( If > > > there any use case for this or it makes sense for your HW) > > > > > > > > > For us, Fence is NOP for us as we have an implicit fence between each > > > HW job descriptor. > > > > > > > I actually still think that having a separate fence function in the "ops" > > section is the best approach here. It's unabiguous as to whether it's > > fence-before or fence-after, and if we have it in the ops, it doesn't use a > > "fast-path" slot. > > > > However, if we *really* want to use a flag instead, I don't see the value > > in having two flags, it will be really confusing. Instead, if we do go > > with a flag, I think "RTE_DMA_PRE_FENCE" should be the name, indicating > > that the fence occurs before the job in question. > > IMO, We need to use flags and the name can be RTE_DMA_PRE_FENCE > due to overhead of the driver implementation where the fence request > can be NOP and > to save the first cache line occupancy. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Since we have additional function call overhead in all the > > > > > > > applications for this scheme, I would like to understand > > > > > > > the use of doing this way vs enq does the doorbell implicitly from > > > > > > > driver/application PoV? > > > > > > > > > > > > > > > > > > > In our benchmarks it's just faster. When we tested it, the overhead of the > > > > > > function calls was noticably less than the cost of building up the > > > > > > parameter array(s) for passing the jobs in as a burst. [We don't see this > > > > > > cost with things like NIC I/O since DPDK tends to already have the mbuf > > > > > > fully populated before the TX call anyway.] > > > > > > > > > > OK. I agree with stack population. > > > > > > > > > > My question was more on doing implicit doorbell update enq. Is doorbell write > > > > > costly in other HW compare to a function call? In our HW, it is just write of > > > > > the number of instructions written in a register. > > > > > > > > > > Also, we need to again access the internal PMD memory structure to find > > > > > where to write etc if it is a separate function. > > > > > > > > > > > > > The cost varies depending on a number of factors - even writing to a single > > > > HW register can be very slow if that register is mapped as device > > > > (uncacheable) memory, since (AFAIK) it will act as a full fence and wait > > > > > > I don't know, At least in our case, writes are write-back. so core does not need > > > to wait.(If there is no read operation). > > > > > > > for the write to go all the way to hardware. For more modern HW, the cost > > > > can be lighter. However, any cost of HW writes is going to be the same > > > > whether its a separate function call or not. > > > > > > > > However, the main thing about the doorbell update is that it's a > > > > once-per-burst thing, rather than a once-per-job. Therefore, even if you > > > > have to re-read the struct memory (which is likely still somewhere in your > > > > cores' cache), any extra small cost of doing so is to be amortized over the > > > > cost of a whole burst of copies. > > > > > > Linux kernel has xmit_more flag in skb to address similar thing. > > > i.e enq job flag can have one more bit field to say update ring bell or not? > > > Rather having yet another function overhead.IMO, it is the best of both worlds. > > > > > > > It's just more conditionals and branches all through the code. Inside the > > user application, the user has to check whether to set the flag or not (or > > special-case the last transaction outside the loop), and within the driver, > > there has to be a branch whether or not to call the doorbell function. The > > code on both sides is far simpler and more readable if the doorbell > > function is exactly that - a separate function. > > I disagree. The reason is: > > We will have two classes of applications > > a) do dma copy request as and when it has data(I think, this is the > prime use case), for those, > I think, it is considerable overhead to have two function invocation > per transfer i.e > rte_dma_copy() and rte_dma_perform() > > b) do dma copy when the data is reached to a logical state, like copy > IP frame from Ethernet packets or so, > In that case, the application will have a LOGIC to detect when to > perform it so on the end of > that rte_dma_copy() flag can be updated to fire the doorbell. > > IMO, We are comparing against a branch(flag is already in register) vs > a set of instructions for > 1) function pointer overhead > 2) Need to use the channel context again back in another function. > > IMO, a single branch is most optimal from performance PoV. > Ok, let's try it and see how it goes. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > + +/** + * @warning + * @b EXPERIMENTAL: this API may change > > > > > > > > without prior notice. + * + * Returns the number of operations > > > > > > > > that failed to complete. + * NOTE: This API was used when > > > > > > > > rte_dmadev_completed has_error was set. + * + * @param dev_id > > > > > > > > + * The identifier of the device. + * @param vq_id + * The > > > > > > > > identifier of virt queue. > > > > > > > (> + * @param nb_status > > > > > > > > + * Indicates the size of status array. + * @param[out] > > > > > > > > status + * The error code of operations that failed to > > > > > > > > complete. + * @param[out] cookie + * The last failed > > > > > > > > completed operation's cookie. + * + * @return + * The number > > > > > > > > of operations that failed to complete. + * + * NOTE: The > > > > > > > > caller must ensure that the input parameter is valid and the + > > > > > > > > * corresponding device supports the operation. + */ > > > > > > > > +__rte_experimental +static inline uint16_t > > > > > > > > +rte_dmadev_completed_fails(uint16_t dev_id, uint16_t vq_id, + > > > > > > > > const uint16_t nb_status, uint32_t *status, + > > > > > > > > dma_cookie_t *cookie) > > > > > > > > > > > > > > IMO, it is better to move cookie/rind_idx at 3. Why it would > > > > > > > return any array of errors? since it called after > > > > > > > rte_dmadev_completed() has has_error. Is it better to change > > > > > > > > > > > > > > rte_dmadev_error_status((uint16_t dev_id, uint16_t vq_id, > > > > > > > dma_cookie_t *cookie, uint32_t *status) > > > > > > > > > > > > > > I also think, we may need to set status as bitmask and enumerate > > > > > > > all the combination of error codes of all the driver and return > > > > > > > string from driver existing rte_flow_error > > > > > > > > > > > > > > See struct rte_flow_error { enum rte_flow_error_type type; /**< > > > > > > > Cause field and error types. */ const void *cause; /**< Object > > > > > > > responsible for the error. */ const char *message; /**< > > > > > > > Human-readable error message. */ }; > > > > > > > > > > > > > > > > > > > I think we need a multi-return value API here, as we may add > > > > > > operations in future which have non-error status values to return. > > > > > > The obvious case is DMA engines which support "compare" operations. > > > > > > In that case a successful compare (as in there were no DMA or HW > > > > > > errors) can return "equal" or "not-equal" as statuses. For general > > > > > > "copy" operations, the faster completion op can be used to just > > > > > > return successful values (and only call this status version on > > > > > > error), while apps using those compare ops or a mixture of copy and > > > > > > compare ops, would always use the slower one that returns status > > > > > > values for each and every op.. > > > > > > > > > > > > The ioat APIs used 32-bit integer values for this status array so > > > > > > as to allow e.g. 16-bits for error code and 16-bits for future > > > > > > status values. For most operations there should be a fairly small > > > > > > set of things that can go wrong, i.e. bad source address, bad > > > > > > destination address or invalid length. Within that we may have a > > > > > > couple of specifics for why an address is bad, but even so I don't > > > > > > think we need to start having multiple bit combinations. > > > > > > > > > > OK. What is the purpose of errors status? Is it for application > > > > > printing it or Does the application need to take any action based on > > > > > specific error requests? > > > > > > > > It's largely for information purposes, but in the case of SVA/SVM > > > > errors could occur due to the memory not being pinned, i.e. a page > > > > fault, in some cases. If that happens, then it's up the app to either > > > > touch the memory and retry the copy, or to do a SW memcpy as a > > > > fallback. > > > > > > > > In other error cases, I think it's good to tell the application if it's > > > > passing around bad data, or data that is beyond the scope of hardware, > > > > e.g. a copy that is beyond what can be done in a single transaction > > > > for a HW instance. Given that there are always things that can go > > > > wrong, I think we need some error reporting mechanism. > > > > > > > > > If the former is scope, then we need to define the standard enum > > > > > value for the error right? ie. uint32_t *status needs to change to > > > > > enum rte_dma_error or so. > > > > > > > > > Sure. Perhaps an error/status structure either is an option, where we > > > > explicitly call out error info from status info. > > > > > > Agree. Better to have a structure with filed like, > > > > > > 1) enum rte_dma_error_type 2) memory to store, informative message on > > > fine aspects of error. LIke address caused issue etc.(Which will be > > > driver-specific information). > > > > > The only issue I have with that is that once we have driver specific > > information it is of little use to the application, since it can't know > > anything about it excepy maybe log it. I'd much rather have a set of error > > codes telling user that "source address is wrong", "dest address is wrong", > > and a generic "an address is wrong" in case driver/HW cannot distinguish > > source of error. Can we see how far we get with just error codes before we > > start into passing string messages around and all the memory management > > headaches that implies. > > Works for me. It should be "enum rte_dma_error_type" then, which has a standard > error type. Which is missing in the spec now. > +1