From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id E32D9A0C4A; Wed, 7 Jul 2021 10:35:29 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 5B75F406FF; Wed, 7 Jul 2021 10:35:29 +0200 (CEST) Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by mails.dpdk.org (Postfix) with ESMTP id E349C406B4 for ; Wed, 7 Jul 2021 10:35:26 +0200 (CEST) X-IronPort-AV: E=McAfee;i="6200,9189,10037"; a="209305780" X-IronPort-AV: E=Sophos;i="5.83,331,1616482800"; d="scan'208";a="209305780" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 07 Jul 2021 01:35:23 -0700 X-IronPort-AV: E=Sophos;i="5.83,331,1616482800"; d="scan'208";a="457398508" Received: from bricha3-mobl.ger.corp.intel.com ([10.252.4.148]) by orsmga008-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-SHA; 07 Jul 2021 01:35:19 -0700 Date: Wed, 7 Jul 2021 09:35:15 +0100 From: Bruce Richardson To: Jerin Jacob Cc: Chengwen Feng , Thomas Monjalon , Ferruh Yigit , Jerin Jacob , dpdk-dev , Morten =?iso-8859-1?Q?Br=F8rup?= , Nipun Gupta , Hemant Agrawal , Maxime Coquelin , Honnappa Nagarahalli , David Marchand , Satananda Burla , Prasun Kapoor , "Ananyev, Konstantin" , liangma@liangbit.com, Radha Mohan Chintakuntla Message-ID: References: <1625231891-2963-1-git-send-email-fengchengwen@huawei.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Subject: Re: [dpdk-dev] [PATCH] dmadev: introduce DMA device library X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" On Wed, Jul 07, 2021 at 01:38:58PM +0530, Jerin Jacob wrote: > On Mon, Jul 5, 2021 at 10:46 PM Bruce Richardson > wrote: > > > > On Mon, Jul 05, 2021 at 09:25:34PM +0530, Jerin Jacob wrote: > > > > > > On Mon, Jul 5, 2021 at 4:22 PM Bruce Richardson > > > wrote: > > > > > > > > On Sun, Jul 04, 2021 at 03:00:30PM +0530, Jerin Jacob wrote: > > > > > On Fri, Jul 2, 2021 at 6:51 PM Chengwen Feng wrote: > > > > > > > > > > > > This patch introduces 'dmadevice' which is a generic type of DMA > > > > > > device. > > > > > > > > > > +1 and the terminology with regards to queues and channels. With our ioat > > > > hardware, each HW queue was called a channel for instance. > > > > > > Looks like <> can cover all the use cases, if the > > > HW has more than > > > 1 queues it can be exposed as separate dmadev dev. > > > > > > > Fine for me. > > > > However, just to confirm that Morten's suggestion of using a > > (device-specific void *) channel pointer rather than dev_id + channel_id > > pair of parameters won't work for you? You can't store a pointer or dev > > index in the channel struct in the driver? > > Yes. That will work. To confirm, the suggestion is to use, void * > object instead of channel_id, > That will avoid one more indirection.(index -> pointer) > The proposal was to use it in place of "dev_id + channel_id", i.e. copy(dev_id, ch_id, src, dst, len, flags) --> copy(ch_ptr, src, dst, len, flags) Where the channel pointer implicitly indicates the device too. However, I realise now that this would be something completely transparent to the driver as it would all have to be implemented in the dmadev level, and lead to lots of duplication of function pointers, etc. Therefore, let's just go with original scheme. :-( > > > > > > > > > Got it. In order to save space if first CL size for fastpath(Saving 8B > > > for the pointer) and to avoid > > > function overhead, Can we use one bit of flags of op function to > > > enable the fence? > > > > > > > The original ioat implementation did exactly that. However, I then > > discovered that because a fence logically belongs between two operations, > > does the fence flag on an operation mean "don't do any jobs after this > > until this job has completed" or does it mean "don't start this job until > > all previous jobs have completed". [Or theoretically does it mean both :-)] > > Naturally, some hardware does it the former way (i.e. fence flag goes on > > last op before fence), while other hardware the latter way (i.e. fence flag > > goes on first op after the fence). Therefore, since fencing is about > > ordering *between* two (sets of) jobs, I decided that it should do exactly > > that and go between two jobs, so there is no ambiguity! > > > > However, I'm happy enough to switch to having a fence flag, but I think if > > we do that, it should be put in the "first job after fence" case, because > > it is always easier to modify a previously written job if we need to, than > > to save the flag for a future one. > > > > Alternatively, if we keep the fence as a separate function, I'm happy > > enough for it not to be on the same cacheline as the "hot" operations, > > since fencing will always introduce a small penalty anyway. > > Ack. > You may consider two flags, FENCE_THEN_JOB and JOB_THEN_FENCE( If > there any use case for this or it makes sense for your HW) > > > For us, Fence is NOP for us as we have an implicit fence between each > HW job descriptor. > I actually still think that having a separate fence function in the "ops" section is the best approach here. It's unabiguous as to whether it's fence-before or fence-after, and if we have it in the ops, it doesn't use a "fast-path" slot. However, if we *really* want to use a flag instead, I don't see the value in having two flags, it will be really confusing. Instead, if we do go with a flag, I think "RTE_DMA_PRE_FENCE" should be the name, indicating that the fence occurs before the job in question. > > > > > > > > > > > > > > > > > > > Since we have additional function call overhead in all the > > > > > applications for this scheme, I would like to understand > > > > > the use of doing this way vs enq does the doorbell implicitly from > > > > > driver/application PoV? > > > > > > > > > > > > > In our benchmarks it's just faster. When we tested it, the overhead of the > > > > function calls was noticably less than the cost of building up the > > > > parameter array(s) for passing the jobs in as a burst. [We don't see this > > > > cost with things like NIC I/O since DPDK tends to already have the mbuf > > > > fully populated before the TX call anyway.] > > > > > > OK. I agree with stack population. > > > > > > My question was more on doing implicit doorbell update enq. Is doorbell write > > > costly in other HW compare to a function call? In our HW, it is just write of > > > the number of instructions written in a register. > > > > > > Also, we need to again access the internal PMD memory structure to find > > > where to write etc if it is a separate function. > > > > > > > The cost varies depending on a number of factors - even writing to a single > > HW register can be very slow if that register is mapped as device > > (uncacheable) memory, since (AFAIK) it will act as a full fence and wait > > I don't know, At least in our case, writes are write-back. so core does not need > to wait.(If there is no read operation). > > > for the write to go all the way to hardware. For more modern HW, the cost > > can be lighter. However, any cost of HW writes is going to be the same > > whether its a separate function call or not. > > > > However, the main thing about the doorbell update is that it's a > > once-per-burst thing, rather than a once-per-job. Therefore, even if you > > have to re-read the struct memory (which is likely still somewhere in your > > cores' cache), any extra small cost of doing so is to be amortized over the > > cost of a whole burst of copies. > > Linux kernel has xmit_more flag in skb to address similar thing. > i.e enq job flag can have one more bit field to say update ring bell or not? > Rather having yet another function overhead.IMO, it is the best of both worlds. > It's just more conditionals and branches all through the code. Inside the user application, the user has to check whether to set the flag or not (or special-case the last transaction outside the loop), and within the driver, there has to be a branch whether or not to call the doorbell function. The code on both sides is far simpler and more readable if the doorbell function is exactly that - a separate function. > > > > > > > > > > > > > > > > > > > > > > > + +/** + * @warning + * @b EXPERIMENTAL: this API may change > > > > > > without prior notice. + * + * Returns the number of operations > > > > > > that failed to complete. + * NOTE: This API was used when > > > > > > rte_dmadev_completed has_error was set. + * + * @param dev_id > > > > > > + * The identifier of the device. + * @param vq_id + * The > > > > > > identifier of virt queue. > > > > > (> + * @param nb_status > > > > > > + * Indicates the size of status array. + * @param[out] > > > > > > status + * The error code of operations that failed to > > > > > > complete. + * @param[out] cookie + * The last failed > > > > > > completed operation's cookie. + * + * @return + * The number > > > > > > of operations that failed to complete. + * + * NOTE: The > > > > > > caller must ensure that the input parameter is valid and the + > > > > > > * corresponding device supports the operation. + */ > > > > > > +__rte_experimental +static inline uint16_t > > > > > > +rte_dmadev_completed_fails(uint16_t dev_id, uint16_t vq_id, + > > > > > > const uint16_t nb_status, uint32_t *status, + > > > > > > dma_cookie_t *cookie) > > > > > > > > > > IMO, it is better to move cookie/rind_idx at 3. Why it would > > > > > return any array of errors? since it called after > > > > > rte_dmadev_completed() has has_error. Is it better to change > > > > > > > > > > rte_dmadev_error_status((uint16_t dev_id, uint16_t vq_id, > > > > > dma_cookie_t *cookie, uint32_t *status) > > > > > > > > > > I also think, we may need to set status as bitmask and enumerate > > > > > all the combination of error codes of all the driver and return > > > > > string from driver existing rte_flow_error > > > > > > > > > > See struct rte_flow_error { enum rte_flow_error_type type; /**< > > > > > Cause field and error types. */ const void *cause; /**< Object > > > > > responsible for the error. */ const char *message; /**< > > > > > Human-readable error message. */ }; > > > > > > > > > > > > > I think we need a multi-return value API here, as we may add > > > > operations in future which have non-error status values to return. > > > > The obvious case is DMA engines which support "compare" operations. > > > > In that case a successful compare (as in there were no DMA or HW > > > > errors) can return "equal" or "not-equal" as statuses. For general > > > > "copy" operations, the faster completion op can be used to just > > > > return successful values (and only call this status version on > > > > error), while apps using those compare ops or a mixture of copy and > > > > compare ops, would always use the slower one that returns status > > > > values for each and every op.. > > > > > > > > The ioat APIs used 32-bit integer values for this status array so > > > > as to allow e.g. 16-bits for error code and 16-bits for future > > > > status values. For most operations there should be a fairly small > > > > set of things that can go wrong, i.e. bad source address, bad > > > > destination address or invalid length. Within that we may have a > > > > couple of specifics for why an address is bad, but even so I don't > > > > think we need to start having multiple bit combinations. > > > > > > OK. What is the purpose of errors status? Is it for application > > > printing it or Does the application need to take any action based on > > > specific error requests? > > > > It's largely for information purposes, but in the case of SVA/SVM > > errors could occur due to the memory not being pinned, i.e. a page > > fault, in some cases. If that happens, then it's up the app to either > > touch the memory and retry the copy, or to do a SW memcpy as a > > fallback. > > > > In other error cases, I think it's good to tell the application if it's > > passing around bad data, or data that is beyond the scope of hardware, > > e.g. a copy that is beyond what can be done in a single transaction > > for a HW instance. Given that there are always things that can go > > wrong, I think we need some error reporting mechanism. > > > > > If the former is scope, then we need to define the standard enum > > > value for the error right? ie. uint32_t *status needs to change to > > > enum rte_dma_error or so. > > > > > Sure. Perhaps an error/status structure either is an option, where we > > explicitly call out error info from status info. > > Agree. Better to have a structure with filed like, > > 1) enum rte_dma_error_type 2) memory to store, informative message on > fine aspects of error. LIke address caused issue etc.(Which will be > driver-specific information). > The only issue I have with that is that once we have driver specific information it is of little use to the application, since it can't know anything about it excepy maybe log it. I'd much rather have a set of error codes telling user that "source address is wrong", "dest address is wrong", and a generic "an address is wrong" in case driver/HW cannot distinguish source of error. Can we see how far we get with just error codes before we start into passing string messages around and all the memory management headaches that implies. /Bruce