From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 4C7FBA0C45; Mon, 5 Jul 2021 19:16:18 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id C59AB4068C; Mon, 5 Jul 2021 19:16:17 +0200 (CEST) Received: from mga12.intel.com (mga12.intel.com [192.55.52.136]) by mails.dpdk.org (Postfix) with ESMTP id A411840141 for ; Mon, 5 Jul 2021 19:16:15 +0200 (CEST) X-IronPort-AV: E=McAfee;i="6200,9189,10036"; a="188689825" X-IronPort-AV: E=Sophos;i="5.83,325,1616482800"; d="scan'208";a="188689825" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 05 Jul 2021 10:16:14 -0700 X-IronPort-AV: E=Sophos;i="5.83,325,1616482800"; d="scan'208";a="456784064" Received: from bricha3-mobl.ger.corp.intel.com ([10.252.23.142]) by orsmga008-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-SHA; 05 Jul 2021 10:16:10 -0700 Date: Mon, 5 Jul 2021 18:16:07 +0100 From: Bruce Richardson To: Jerin Jacob Cc: Chengwen Feng , Thomas Monjalon , Ferruh Yigit , Jerin Jacob , dpdk-dev , Morten =?iso-8859-1?Q?Br=F8rup?= , Nipun Gupta , Hemant Agrawal , Maxime Coquelin , Honnappa Nagarahalli , David Marchand , Satananda Burla , Prasun Kapoor , "Ananyev, Konstantin" , liangma@liangbit.com, Radha Mohan Chintakuntla Message-ID: References: <1625231891-2963-1-git-send-email-fengchengwen@huawei.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Subject: Re: [dpdk-dev] [PATCH] dmadev: introduce DMA device library X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" On Mon, Jul 05, 2021 at 09:25:34PM +0530, Jerin Jacob wrote: > > On Mon, Jul 5, 2021 at 4:22 PM Bruce Richardson > wrote: > > > > On Sun, Jul 04, 2021 at 03:00:30PM +0530, Jerin Jacob wrote: > > > On Fri, Jul 2, 2021 at 6:51 PM Chengwen Feng wrote: > > > > > > > > This patch introduces 'dmadevice' which is a generic type of DMA > > > > device. > > > > +1 and the terminology with regards to queues and channels. With our ioat > > hardware, each HW queue was called a channel for instance. > > Looks like <> can cover all the use cases, if the > HW has more than > 1 queues it can be exposed as separate dmadev dev. > Fine for me. However, just to confirm that Morten's suggestion of using a (device-specific void *) channel pointer rather than dev_id + channel_id pair of parameters won't work for you? You can't store a pointer or dev index in the channel struct in the driver? > > > > > + * > > > > + * If dma_cookie_t is >=0 it's a DMA operation request cookie, <0 it's a error > > > > + * code. > > > > + * When using cookies, comply with the following rules: > > > > + * a) Cookies for each virtual queue are independent. > > > > + * b) For a virt queue, the cookie are monotonically incremented, when it reach > > > > + * the INT_MAX, it wraps back to zero. > > > > I disagree with the INT_MAX (or INT32_MAX) value here. If we use that > > value, it means that we cannot use implicit wrap-around inside the CPU and > > have to check for the INT_MAX value. Better to: > > 1. Specify that it wraps at UINT16_MAX which allows us to just use a > > uint16_t internally and wrap-around automatically, or: > > 2. Specify that it wraps at a power-of-2 value >= UINT16_MAX, giving > > drivers the flexibility at what value to wrap around. > > I think, (2) better than 1. I think, even better to wrap around the number of > descriptors configured in dev_configure()(We cake make this as the power of 2), > Interesting, I hadn't really considered that before. My only concern would be if an app wants to keep values in the app ring for a while after they have been returned from dmadev. I thought it easier to have the full 16-bit counter value returned to the user to give the most flexibility, given that going from that to any power-of-2 ring size smaller is a trivial operation. Overall, while my ideal situation is to always have a 0..UINT16_MAX return value from the function, I can live with your suggestion of wrapping at ring_size, since drivers will likely do that internally anyway. I think wrapping at INT32_MAX is too awkward and will be error prone since we can't rely on hardware automatically wrapping to zero, nor on the driver having pre-masked the value. > > > > > > + * c) The initial cookie of a virt queue is zero, after the device is stopped or > > > > + * reset, the virt queue's cookie needs to be reset to zero. > > > > > > Please add some good amount of reserved bits and have API to init this > > > structure for future ABI stability, say rte_dmadev_queue_config_init() > > > or so. > > > > > > > I don't think that is necessary. Since the config struct is used only as > > parameter to the config function, any changes to it can be managed by > > versioning that single function. Padding would only be necessary if we had > > an array of these config structs somewhere. > > OK. > > For some reason, the versioning API looks ugly to me in code instead of keeping > some rsvd fields look cool to me with init function. > > But I agree. function versioning works in this case. No need to find other API > if tt is not general DPDK API practice. > The one thing I would suggest instead of the padding is for the internal APIS, to pass the struct size through, since we can't version those - and for padding we can't know whether any replaced padding should be used or not. Specifically: typedef int (*rte_dmadev_configure_t)(struct rte_dmadev *dev, struct rte_dmadev_conf *cfg, size_t cfg_size); but for the public function: int rte_dmadev_configure(struct rte_dmadev *dev, struct rte_dmadev_conf *cfg) { ... ret = dev->ops.configure(dev, cfg, sizeof(*cfg)); ... } Then if we change the structure and version the config API, the driver can tell from the size what struct version it is and act accordingly. Without that, each time the struct changed, we'd have to add a new function pointer to the device ops. > In other libraries, I have seen such _init or function that can use > for this as well as filling default value > in some cases implementation values is not zero). > So that application can avoid memset for param structure. > Added rte_event_queue_default_conf_get() in eventdev spec for this. > I think that would largely have the same issues, unless it returned a pointer to data inside the driver - and which therefore could not be modified. Alternatively it would mean that the memory would have been allocated in the driver and we would need to ensure proper cleanup functions were called to free memory afterwards. Supporting having the config parameter as a local variable I think makes things a lot easier. > No strong opinion on this. > > > > > > > > > > > > + > > > > +/** > > > > + * A structure used to retrieve information of a DMA virt queue. > > > > + */ > > > > +struct rte_dmadev_queue_info { > > > > + enum dma_transfer_direction direction; > > > > > > A queue may support all directions so I think it should be a bitfield. > > > > > > > + /**< Associated transfer direction */ > > > > + uint16_t hw_queue_id; /**< The HW queue on which to create virt queue */ > > > > + uint16_t nb_desc; /**< Number of descriptor for this virt queue */ > > > > + uint64_t dev_flags; /**< Device specific flags */ > > > > +}; > > > > + > > > > > > > +__rte_experimental > > > > +static inline dma_cookie_t > > > > +rte_dmadev_copy_sg(uint16_t dev_id, uint16_t vq_id, > > > > + const struct dma_scatterlist *sg, > > > > + uint32_t sg_len, uint64_t flags) > > > > > > I would like to change this as: > > > rte_dmadev_copy_sg(uint16_t dev_id, uint16_t vq_id, const struct > > > rte_dma_sg *src, uint32_t nb_src, > > > const struct rte_dma_sg *dst, uint32_t nb_dst) or so allow the use case like > > > src 30 MB copy can be splitted as written as 1 MB x 30 dst. > > > Out of interest, do you see much benefit (and in what way) from having the scatter-gather support? Unlike sending 5 buffers in one packet rather than 5 buffers in 5 packets to a NIC, copying an array of memory in one op vs multiple is functionally identical. > > > > > > > Got it. In order to save space if first CL size for fastpath(Saving 8B > for the pointer) and to avoid > function overhead, Can we use one bit of flags of op function to > enable the fence? > The original ioat implementation did exactly that. However, I then discovered that because a fence logically belongs between two operations, does the fence flag on an operation mean "don't do any jobs after this until this job has completed" or does it mean "don't start this job until all previous jobs have completed". [Or theoretically does it mean both :-)] Naturally, some hardware does it the former way (i.e. fence flag goes on last op before fence), while other hardware the latter way (i.e. fence flag goes on first op after the fence). Therefore, since fencing is about ordering *between* two (sets of) jobs, I decided that it should do exactly that and go between two jobs, so there is no ambiguity! However, I'm happy enough to switch to having a fence flag, but I think if we do that, it should be put in the "first job after fence" case, because it is always easier to modify a previously written job if we need to, than to save the flag for a future one. Alternatively, if we keep the fence as a separate function, I'm happy enough for it not to be on the same cacheline as the "hot" operations, since fencing will always introduce a small penalty anyway. > > > > > > > > Since we have additional function call overhead in all the > > > applications for this scheme, I would like to understand > > > the use of doing this way vs enq does the doorbell implicitly from > > > driver/application PoV? > > > > > > > In our benchmarks it's just faster. When we tested it, the overhead of the > > function calls was noticably less than the cost of building up the > > parameter array(s) for passing the jobs in as a burst. [We don't see this > > cost with things like NIC I/O since DPDK tends to already have the mbuf > > fully populated before the TX call anyway.] > > OK. I agree with stack population. > > My question was more on doing implicit doorbell update enq. Is doorbell write > costly in other HW compare to a function call? In our HW, it is just write of > the number of instructions written in a register. > > Also, we need to again access the internal PMD memory structure to find > where to write etc if it is a separate function. > The cost varies depending on a number of factors - even writing to a single HW register can be very slow if that register is mapped as device (uncacheable) memory, since (AFAIK) it will act as a full fence and wait for the write to go all the way to hardware. For more modern HW, the cost can be lighter. However, any cost of HW writes is going to be the same whether its a separate function call or not. However, the main thing about the doorbell update is that it's a once-per-burst thing, rather than a once-per-job. Therefore, even if you have to re-read the struct memory (which is likely still somewhere in your cores' cache), any extra small cost of doing so is to be amortized over the cost of a whole burst of copies. > > > > > > > > > > + > > > > +/** > > > > + * @warning > > > > + * @b EXPERIMENTAL: this API may change without prior notice. > > > > + * > > > > + * Returns the number of operations that failed to complete. > > > > + * NOTE: This API was used when rte_dmadev_completed has_error was set. > > > > + * > > > > + * @param dev_id > > > > + * The identifier of the device. > > > > + * @param vq_id > > > > + * The identifier of virt queue. > > > (> + * @param nb_status > > > > + * Indicates the size of status array. > > > > + * @param[out] status > > > > + * The error code of operations that failed to complete. > > > > + * @param[out] cookie > > > > + * The last failed completed operation's cookie. > > > > + * > > > > + * @return > > > > + * The number of operations that failed to complete. > > > > + * > > > > + * NOTE: The caller must ensure that the input parameter is valid and the > > > > + * corresponding device supports the operation. > > > > + */ > > > > +__rte_experimental > > > > +static inline uint16_t > > > > +rte_dmadev_completed_fails(uint16_t dev_id, uint16_t vq_id, > > > > + const uint16_t nb_status, uint32_t *status, > > > > + dma_cookie_t *cookie) > > > > > > IMO, it is better to move cookie/rind_idx at 3. > > > Why it would return any array of errors? since it called after > > > rte_dmadev_completed() has > > > has_error. Is it better to change > > > > > > rte_dmadev_error_status((uint16_t dev_id, uint16_t vq_id, dma_cookie_t > > > *cookie, uint32_t *status) > > > > > > I also think, we may need to set status as bitmask and enumerate all > > > the combination of error codes > > > of all the driver and return string from driver existing rte_flow_error > > > > > > See > > > struct rte_flow_error { > > > enum rte_flow_error_type type; /**< Cause field and error types. */ > > > const void *cause; /**< Object responsible for the error. */ > > > const char *message; /**< Human-readable error message. */ > > > }; > > > > > > > I think we need a multi-return value API here, as we may add operations in > > future which have non-error status values to return. The obvious case is > > DMA engines which support "compare" operations. In that case a successful > > compare (as in there were no DMA or HW errors) can return "equal" or > > "not-equal" as statuses. For general "copy" operations, the faster > > completion op can be used to just return successful values (and only call > > this status version on error), while apps using those compare ops or a > > mixture of copy and compare ops, would always use the slower one that > > returns status values for each and every op.. > > > > The ioat APIs used 32-bit integer values for this status array so as to > > allow e.g. 16-bits for error code and 16-bits for future status values. For > > most operations there should be a fairly small set of things that can go > > wrong, i.e. bad source address, bad destination address or invalid length. > > Within that we may have a couple of specifics for why an address is bad, > > but even so I don't think we need to start having multiple bit > > combinations. > > OK. What is the purpose of errors status? Is it for application printing it or > Does the application need to take any action based on specific error requests? It's largely for information purposes, but in the case of SVA/SVM errors could occur due to the memory not being pinned, i.e. a page fault, in some cases. If that happens, then it's up the app to either touch the memory and retry the copy, or to do a SW memcpy as a fallback. In other error cases, I think it's good to tell the application if it's passing around bad data, or data that is beyond the scope of hardware, e.g. a copy that is beyond what can be done in a single transaction for a HW instance. Given that there are always things that can go wrong, I think we need some error reporting mechanism. > If the former is scope, then we need to define the standard enum value > for the error right? > ie. uint32_t *status needs to change to enum rte_dma_error or so. > Sure. Perhaps an error/status structure either is an option, where we explicitly call out error info from status info. > > /Bruce