From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 49C22A0C48; Wed, 7 Jul 2021 10:09:27 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id D9F07406B4; Wed, 7 Jul 2021 10:09:26 +0200 (CEST) Received: from mail-io1-f48.google.com (mail-io1-f48.google.com [209.85.166.48]) by mails.dpdk.org (Postfix) with ESMTP id 7C2C94069D for ; Wed, 7 Jul 2021 10:09:25 +0200 (CEST) Received: by mail-io1-f48.google.com with SMTP id h6so1926415iok.6 for ; Wed, 07 Jul 2021 01:09:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=KhLf4oiD0CmFw2tDiM3JL9Q67UsOhMw7+2xBpdSJo1M=; b=XFa0qbpWyy+wQ4zo6W2zuESA62mOBBUfz0oXW4BpGR+G62cbDqpjQ9rknPiOV77Fvl AhUyo5J9SBOOA8DLWPtY4oZBwEXNzLSheZHSFIrUcjUjScHRMfDqOHuF5shvSlubk+Wh Xhc34FENgfG5vP4nKHPqCxh8P0VwTBMGgkEKNgGzko0aaLcDDhGmxlWqNgZZX8pZXS9m Ncg+jTX18dMg0JPyAbgPsS5sdrlkqzKuMaUsSV0hLucYG4c2zqpcALI9K0HThmY4Gkup VSNmtdy5apto5YBRLzfuSi+aa0Rq/NSR1BnMvmD9PHNXY2qP2v3lY6sBbKKmJXCTTPUw PVbw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=KhLf4oiD0CmFw2tDiM3JL9Q67UsOhMw7+2xBpdSJo1M=; b=DEaFPPnrRRVRP8tBx1mRPzahicWFyDAy8XLV585/M7UrLK+FpBUwIV6O6pd+pJLZ4T ZjysCkdt+iolHz7pKmCqMM02I3BUo2rEK6Unx3+e+lPWMi7C8WkZm+Ul2eIoMqUwu3XM vnGDRNh4gyHq8dy9SYlXg4CZoDu6yBCnBHG8rBmvDoO2yhc7220u8B4jMbNKqonv0o1+ K9kWzv4BLBJymAAPkgAx/fI8iV1Hvvh/1Y+HTKDw8jCqKnP6ED5s5tnw9tXzmm7wgpZX rv5FvHfCeivSfE8bUo7CKz9d+FNGJjKu000w+PWZOAxp+QKaI/3RbsdJZva4HYmYnaBV GNhw== X-Gm-Message-State: AOAM5300oYXQS9S77ojCggQkyrzvmArAkOdrEJ/KXRt2mgskAL4dyj55 RW8Fqe+qPJmxPOEeDIR/M2GUEZJzz2wik/Rg9kc= X-Google-Smtp-Source: ABdhPJxe8yeSNaTZ95Tm7nteszthKyhhWzUxcO1CbXoPsVGVUyB58LCt7ye8fyIrnLtRctTEbESpXgtE+lsqVrpdYi4= X-Received: by 2002:a5e:a514:: with SMTP id 20mr8336254iog.163.1625645364748; Wed, 07 Jul 2021 01:09:24 -0700 (PDT) MIME-Version: 1.0 References: <1625231891-2963-1-git-send-email-fengchengwen@huawei.com> In-Reply-To: From: Jerin Jacob Date: Wed, 7 Jul 2021 13:38:58 +0530 Message-ID: To: Bruce Richardson Cc: Chengwen Feng , Thomas Monjalon , Ferruh Yigit , Jerin Jacob , dpdk-dev , =?UTF-8?Q?Morten_Br=C3=B8rup?= , Nipun Gupta , Hemant Agrawal , Maxime Coquelin , Honnappa Nagarahalli , David Marchand , Satananda Burla , Prasun Kapoor , "Ananyev, Konstantin" , liangma@liangbit.com, Radha Mohan Chintakuntla Content-Type: text/plain; charset="UTF-8" Subject: Re: [dpdk-dev] [PATCH] dmadev: introduce DMA device library X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" On Mon, Jul 5, 2021 at 10:46 PM Bruce Richardson wrote: > > On Mon, Jul 05, 2021 at 09:25:34PM +0530, Jerin Jacob wrote: > > > > On Mon, Jul 5, 2021 at 4:22 PM Bruce Richardson > > wrote: > > > > > > On Sun, Jul 04, 2021 at 03:00:30PM +0530, Jerin Jacob wrote: > > > > On Fri, Jul 2, 2021 at 6:51 PM Chengwen Feng wrote: > > > > > > > > > > This patch introduces 'dmadevice' which is a generic type of DMA > > > > > device. > > > > > > > +1 and the terminology with regards to queues and channels. With our ioat > > > hardware, each HW queue was called a channel for instance. > > > > Looks like <> can cover all the use cases, if the > > HW has more than > > 1 queues it can be exposed as separate dmadev dev. > > > > Fine for me. > > However, just to confirm that Morten's suggestion of using a > (device-specific void *) channel pointer rather than dev_id + channel_id > pair of parameters won't work for you? You can't store a pointer or dev > index in the channel struct in the driver? Yes. That will work. To confirm, the suggestion is to use, void * object instead of channel_id, That will avoid one more indirection.(index -> pointer) > > > > > > > > > + * > > > > > + * If dma_cookie_t is >=0 it's a DMA operation request cookie, <0 it's a error > > > > > + * code. > > > > > + * When using cookies, comply with the following rules: > > > > > + * a) Cookies for each virtual queue are independent. > > > > > + * b) For a virt queue, the cookie are monotonically incremented, when it reach > > > > > + * the INT_MAX, it wraps back to zero. > > > > > > I disagree with the INT_MAX (or INT32_MAX) value here. If we use that > > > value, it means that we cannot use implicit wrap-around inside the CPU and > > > have to check for the INT_MAX value. Better to: > > > 1. Specify that it wraps at UINT16_MAX which allows us to just use a > > > uint16_t internally and wrap-around automatically, or: > > > 2. Specify that it wraps at a power-of-2 value >= UINT16_MAX, giving > > > drivers the flexibility at what value to wrap around. > > > > I think, (2) better than 1. I think, even better to wrap around the number of > > descriptors configured in dev_configure()(We cake make this as the power of 2), > > > > Interesting, I hadn't really considered that before. My only concern > would be if an app wants to keep values in the app ring for a while after > they have been returned from dmadev. I thought it easier to have the full > 16-bit counter value returned to the user to give the most flexibility, > given that going from that to any power-of-2 ring size smaller is a trivial > operation. > > Overall, while my ideal situation is to always have a 0..UINT16_MAX return > value from the function, I can live with your suggestion of wrapping at > ring_size, since drivers will likely do that internally anyway. > I think wrapping at INT32_MAX is too awkward and will be error prone since > we can't rely on hardware automatically wrapping to zero, nor on the driver > having pre-masked the value. OK. +1 for UINT16_MAX > > > > > > > > > + * c) The initial cookie of a virt queue is zero, after the device is stopped or > > > > > + * reset, the virt queue's cookie needs to be reset to zero. > > > > > > > > > Please add some good amount of reserved bits and have API to init this > > > > structure for future ABI stability, say rte_dmadev_queue_config_init() > > > > or so. > > > > > > > > > > I don't think that is necessary. Since the config struct is used only as > > > parameter to the config function, any changes to it can be managed by > > > versioning that single function. Padding would only be necessary if we had > > > an array of these config structs somewhere. > > > > OK. > > > > For some reason, the versioning API looks ugly to me in code instead of keeping > > some rsvd fields look cool to me with init function. > > > > But I agree. function versioning works in this case. No need to find other API > > if tt is not general DPDK API practice. > > > > The one thing I would suggest instead of the padding is for the internal > APIS, to pass the struct size through, since we can't version those - and > for padding we can't know whether any replaced padding should be used or > not. Specifically: > > typedef int (*rte_dmadev_configure_t)(struct rte_dmadev *dev, struct > rte_dmadev_conf *cfg, size_t cfg_size); > > but for the public function: > > int > rte_dmadev_configure(struct rte_dmadev *dev, struct > rte_dmadev_conf *cfg) > { > ... > ret = dev->ops.configure(dev, cfg, sizeof(*cfg)); > ... > } Makes sense. > > Then if we change the structure and version the config API, the driver can > tell from the size what struct version it is and act accordingly. Without > that, each time the struct changed, we'd have to add a new function pointer > to the device ops. > > > In other libraries, I have seen such _init or function that can use > > for this as well as filling default value > > in some cases implementation values is not zero). > > So that application can avoid memset for param structure. > > Added rte_event_queue_default_conf_get() in eventdev spec for this. > > > > I think that would largely have the same issues, unless it returned a > pointer to data inside the driver - and which therefore could not be > modified. Alternatively it would mean that the memory would have been > allocated in the driver and we would need to ensure proper cleanup > functions were called to free memory afterwards. Supporting having the > config parameter as a local variable I think makes things a lot easier. > > > No strong opinion on this. > > > > > > > > > > > > > > > > > > + > > > > > +/** > > > > > + * A structure used to retrieve information of a DMA virt queue. > > > > > + */ > > > > > +struct rte_dmadev_queue_info { > > > > > + enum dma_transfer_direction direction; > > > > > > > > A queue may support all directions so I think it should be a bitfield. > > > > > > > > > + /**< Associated transfer direction */ > > > > > + uint16_t hw_queue_id; /**< The HW queue on which to create virt queue */ > > > > > + uint16_t nb_desc; /**< Number of descriptor for this virt queue */ > > > > > + uint64_t dev_flags; /**< Device specific flags */ > > > > > +}; > > > > > + > > > > > > > > > +__rte_experimental > > > > > +static inline dma_cookie_t > > > > > +rte_dmadev_copy_sg(uint16_t dev_id, uint16_t vq_id, > > > > > + const struct dma_scatterlist *sg, > > > > > + uint32_t sg_len, uint64_t flags) > > > > > > > > I would like to change this as: > > > > rte_dmadev_copy_sg(uint16_t dev_id, uint16_t vq_id, const struct > > > > rte_dma_sg *src, uint32_t nb_src, > > > > const struct rte_dma_sg *dst, uint32_t nb_dst) or so allow the use case like In the above syntax, @Chengchang Tang rte_dma_sg needs to contains only ptr and size. > > > > src 30 MB copy can be splitted as written as 1 MB x 30 dst. > > > > > > Out of interest, do you see much benefit (and in what way) from having the > scatter-gather support? Unlike sending 5 buffers in one packet rather than > 5 buffers in 5 packets to a NIC, copying an array of memory in one op vs > multiple is functionally identical. Knowing upfront or in shot if such segments expressed can have better optimization in drivers like 1) In one DMA job request HW can fill multiple segments vs multiple DMA job requests with each segment. 2) Single completion i.e less overhead system. 3) Less latency for the job requests. > > > > > > > > > > > > Got it. In order to save space if first CL size for fastpath(Saving 8B > > for the pointer) and to avoid > > function overhead, Can we use one bit of flags of op function to > > enable the fence? > > > > The original ioat implementation did exactly that. However, I then > discovered that because a fence logically belongs between two operations, > does the fence flag on an operation mean "don't do any jobs after this > until this job has completed" or does it mean "don't start this job until > all previous jobs have completed". [Or theoretically does it mean both :-)] > Naturally, some hardware does it the former way (i.e. fence flag goes on > last op before fence), while other hardware the latter way (i.e. fence flag > goes on first op after the fence). Therefore, since fencing is about > ordering *between* two (sets of) jobs, I decided that it should do exactly > that and go between two jobs, so there is no ambiguity! > > However, I'm happy enough to switch to having a fence flag, but I think if > we do that, it should be put in the "first job after fence" case, because > it is always easier to modify a previously written job if we need to, than > to save the flag for a future one. > > Alternatively, if we keep the fence as a separate function, I'm happy > enough for it not to be on the same cacheline as the "hot" operations, > since fencing will always introduce a small penalty anyway. Ack. You may consider two flags, FENCE_THEN_JOB and JOB_THEN_FENCE( If there any use case for this or it makes sense for your HW) For us, Fence is NOP for us as we have an implicit fence between each HW job descriptor. > > > > > > > > > > > > > Since we have additional function call overhead in all the > > > > applications for this scheme, I would like to understand > > > > the use of doing this way vs enq does the doorbell implicitly from > > > > driver/application PoV? > > > > > > > > > > In our benchmarks it's just faster. When we tested it, the overhead of the > > > function calls was noticably less than the cost of building up the > > > parameter array(s) for passing the jobs in as a burst. [We don't see this > > > cost with things like NIC I/O since DPDK tends to already have the mbuf > > > fully populated before the TX call anyway.] > > > > OK. I agree with stack population. > > > > My question was more on doing implicit doorbell update enq. Is doorbell write > > costly in other HW compare to a function call? In our HW, it is just write of > > the number of instructions written in a register. > > > > Also, we need to again access the internal PMD memory structure to find > > where to write etc if it is a separate function. > > > > The cost varies depending on a number of factors - even writing to a single > HW register can be very slow if that register is mapped as device > (uncacheable) memory, since (AFAIK) it will act as a full fence and wait I don't know, At least in our case, writes are write-back. so core does not need to wait.(If there is no read operation). > for the write to go all the way to hardware. For more modern HW, the cost > can be lighter. However, any cost of HW writes is going to be the same > whether its a separate function call or not. > > However, the main thing about the doorbell update is that it's a > once-per-burst thing, rather than a once-per-job. Therefore, even if you > have to re-read the struct memory (which is likely still somewhere in your > cores' cache), any extra small cost of doing so is to be amortized over the > cost of a whole burst of copies. Linux kernel has xmit_more flag in skb to address similar thing. i.e enq job flag can have one more bit field to say update ring bell or not? Rather having yet another function overhead.IMO, it is the best of both worlds. > > > > > > > > > > > > > > > > + > > > > > +/** > > > > > + * @warning > > > > > + * @b EXPERIMENTAL: this API may change without prior notice. > > > > > + * > > > > > + * Returns the number of operations that failed to complete. > > > > > + * NOTE: This API was used when rte_dmadev_completed has_error was set. > > > > > + * > > > > > + * @param dev_id > > > > > + * The identifier of the device. > > > > > + * @param vq_id > > > > > + * The identifier of virt queue. > > > > (> + * @param nb_status > > > > > + * Indicates the size of status array. > > > > > + * @param[out] status > > > > > + * The error code of operations that failed to complete. > > > > > + * @param[out] cookie > > > > > + * The last failed completed operation's cookie. > > > > > + * > > > > > + * @return > > > > > + * The number of operations that failed to complete. > > > > > + * > > > > > + * NOTE: The caller must ensure that the input parameter is valid and the > > > > > + * corresponding device supports the operation. > > > > > + */ > > > > > +__rte_experimental > > > > > +static inline uint16_t > > > > > +rte_dmadev_completed_fails(uint16_t dev_id, uint16_t vq_id, > > > > > + const uint16_t nb_status, uint32_t *status, > > > > > + dma_cookie_t *cookie) > > > > > > > > IMO, it is better to move cookie/rind_idx at 3. > > > > Why it would return any array of errors? since it called after > > > > rte_dmadev_completed() has > > > > has_error. Is it better to change > > > > > > > > rte_dmadev_error_status((uint16_t dev_id, uint16_t vq_id, dma_cookie_t > > > > *cookie, uint32_t *status) > > > > > > > > I also think, we may need to set status as bitmask and enumerate all > > > > the combination of error codes > > > > of all the driver and return string from driver existing rte_flow_error > > > > > > > > See > > > > struct rte_flow_error { > > > > enum rte_flow_error_type type; /**< Cause field and error types. */ > > > > const void *cause; /**< Object responsible for the error. */ > > > > const char *message; /**< Human-readable error message. */ > > > > }; > > > > > > > > > > I think we need a multi-return value API here, as we may add operations in > > > future which have non-error status values to return. The obvious case is > > > DMA engines which support "compare" operations. In that case a successful > > > compare (as in there were no DMA or HW errors) can return "equal" or > > > "not-equal" as statuses. For general "copy" operations, the faster > > > completion op can be used to just return successful values (and only call > > > this status version on error), while apps using those compare ops or a > > > mixture of copy and compare ops, would always use the slower one that > > > returns status values for each and every op.. > > > > > > The ioat APIs used 32-bit integer values for this status array so as to > > > allow e.g. 16-bits for error code and 16-bits for future status values. For > > > most operations there should be a fairly small set of things that can go > > > wrong, i.e. bad source address, bad destination address or invalid length. > > > Within that we may have a couple of specifics for why an address is bad, > > > but even so I don't think we need to start having multiple bit > > > combinations. > > > > OK. What is the purpose of errors status? Is it for application printing it or > > Does the application need to take any action based on specific error requests? > > It's largely for information purposes, but in the case of SVA/SVM errors > could occur due to the memory not being pinned, i.e. a page fault, in some > cases. If that happens, then it's up the app to either touch the memory and > retry the copy, or to do a SW memcpy as a fallback. > > In other error cases, I think it's good to tell the application if it's > passing around bad data, or data that is beyond the scope of hardware, e.g. > a copy that is beyond what can be done in a single transaction for a HW > instance. Given that there are always things that can go wrong, I think we > need some error reporting mechanism. > > > If the former is scope, then we need to define the standard enum value > > for the error right? > > ie. uint32_t *status needs to change to enum rte_dma_error or so. > > > Sure. Perhaps an error/status structure either is an option, where we > explicitly call out error info from status info. Agree. Better to have a structure with filed like, 1) enum rte_dma_error_type 2) memory to store, informative message on fine aspects of error. LIke address caused issue etc.(Which will be driver-specific information). > > > > > > > > /Bruce