From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 20C49A0C4A; Wed, 7 Jul 2021 12:34:46 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 9765B406FF; Wed, 7 Jul 2021 12:34:45 +0200 (CEST) Received: from mail-il1-f178.google.com (mail-il1-f178.google.com [209.85.166.178]) by mails.dpdk.org (Postfix) with ESMTP id 091A6406B4 for ; Wed, 7 Jul 2021 12:34:44 +0200 (CEST) Received: by mail-il1-f178.google.com with SMTP id e13so2012607ilc.1 for ; Wed, 07 Jul 2021 03:34:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=AEy0Lvcyqkulpk4+NT/AJFkPEZdzQeqOtgdZtwQRs1Q=; b=XJqtA/6Q9F+EknwZLl8Mtiaw03dnjxUl8shzveGZvOG9aoOyfYrBFUbsm9JNDzs3oS RmZ71eVxFrRAUyfvyApIZAXTCysvFeaQ7U2mWqG/Lv2fvTEP+97anvolIUGJqHP/nQB4 ydjwgK7kd1Hug2g6lCkUtEtT5Ef8e/m3jjSifRIa9HK6hL2vkX75oQoyBsXqAWgeKr4n m/shutm+bOvBgoZHAJYH7CYd2rw0pgGQrEKIzahm4pbgxqixfXa+S9lckk9iYWUjthoq QKBmKB1xIhb3ug5hkDQN3+Ph78tla1DDpOdl3HshfaiPD5XTlVlhdUO6oD21/6VRqNP9 b3Lg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=AEy0Lvcyqkulpk4+NT/AJFkPEZdzQeqOtgdZtwQRs1Q=; b=Utr9YAZv0fE9iZDPTaNG4j1C65rmYrpPmD55nMaUNQfP7dsNITGiGKGFXj99SOIzj1 Z2dUpginaCPcciT/pDWP3etkOnZVsZ+pK7RF0nwt1i5dNSlFouLcsYp0jNz/ImxCOizg UT7uO4RqtKt9ze2IB9HE5IUgYnT3jZMMCiEhq4UCMHr+hmZ3AtpWYIRCvMFMLDPiX0Fw s3tJDH5Xmp5ZJoQw+oBHt2iqCBFh55kKtIwtnLd8PpUmPhgLflMNdP7PB6c9CtoO46sf H7XzVLoJSgoCJhIhyLBhIo60AjPqFTigJusl6JcdubUUV21kf9RwucjwgclmH41Vbm+N zpSw== X-Gm-Message-State: AOAM533g9GWraiZodlVYtIHH+lRwd5vncHJZhJMJp/t1FIy+20wb5ddz IHGKiqhuPUz/MUYxMqi6c+exVvoHe1cTaq91Kbc= X-Google-Smtp-Source: ABdhPJwImUBMB1Vek75dRGZSTPySCbI56LY+kHF1fWhh1DIjs373NMSPUvWJNQ1KknWiW0Bdgu3eBFVNW7z1R8emJIA= X-Received: by 2002:a05:6e02:13b0:: with SMTP id h16mr17048238ilo.271.1625654083285; Wed, 07 Jul 2021 03:34:43 -0700 (PDT) MIME-Version: 1.0 References: <1625231891-2963-1-git-send-email-fengchengwen@huawei.com> In-Reply-To: From: Jerin Jacob Date: Wed, 7 Jul 2021 16:04:16 +0530 Message-ID: To: Bruce Richardson Cc: Chengwen Feng , Thomas Monjalon , Ferruh Yigit , Jerin Jacob , dpdk-dev , =?UTF-8?Q?Morten_Br=C3=B8rup?= , Nipun Gupta , Hemant Agrawal , Maxime Coquelin , Honnappa Nagarahalli , David Marchand , Satananda Burla , Prasun Kapoor , "Ananyev, Konstantin" , liangma@liangbit.com, Radha Mohan Chintakuntla Content-Type: text/plain; charset="UTF-8" Subject: Re: [dpdk-dev] [PATCH] dmadev: introduce DMA device library X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" On Wed, Jul 7, 2021 at 2:05 PM Bruce Richardson wrote: > > On Wed, Jul 07, 2021 at 01:38:58PM +0530, Jerin Jacob wrote: > > On Mon, Jul 5, 2021 at 10:46 PM Bruce Richardson > > wrote: > > > > > > On Mon, Jul 05, 2021 at 09:25:34PM +0530, Jerin Jacob wrote: > > > > > > > > On Mon, Jul 5, 2021 at 4:22 PM Bruce Richardson > > > > wrote: > > > > > > > > > > On Sun, Jul 04, 2021 at 03:00:30PM +0530, Jerin Jacob wrote: > > > > > > On Fri, Jul 2, 2021 at 6:51 PM Chengwen Feng wrote: > > > > > > > > > > > > > > This patch introduces 'dmadevice' which is a generic type of DMA > > > > > > > device. > > > > > > > > > > > > > +1 and the terminology with regards to queues and channels. With our ioat > > > > > hardware, each HW queue was called a channel for instance. > > > > > > > > Looks like <> can cover all the use cases, if the > > > > HW has more than > > > > 1 queues it can be exposed as separate dmadev dev. > > > > > > > > > > Fine for me. > > > > > > However, just to confirm that Morten's suggestion of using a > > > (device-specific void *) channel pointer rather than dev_id + channel_id > > > pair of parameters won't work for you? You can't store a pointer or dev > > > index in the channel struct in the driver? > > > > Yes. That will work. To confirm, the suggestion is to use, void * > > object instead of channel_id, > > That will avoid one more indirection.(index -> pointer) > > > > The proposal was to use it in place of "dev_id + channel_id", i.e. > > copy(dev_id, ch_id, src, dst, len, flags) --> copy(ch_ptr, src, dst, len, flags) > > Where the channel pointer implicitly indicates the device too. However, I > realise now that this would be something completely transparent to the > driver as it would all have to be implemented in the dmadev level, and > lead to lots of duplication of function pointers, etc. Therefore, let's > just go with original scheme. :-( Yes. Just go with the original scheme. > > > > > > > > > > > > > > > > > Got it. In order to save space if first CL size for fastpath(Saving 8B > > > > for the pointer) and to avoid > > > > function overhead, Can we use one bit of flags of op function to > > > > enable the fence? > > > > > > > > > > The original ioat implementation did exactly that. However, I then > > > discovered that because a fence logically belongs between two operations, > > > does the fence flag on an operation mean "don't do any jobs after this > > > until this job has completed" or does it mean "don't start this job until > > > all previous jobs have completed". [Or theoretically does it mean both :-)] > > > Naturally, some hardware does it the former way (i.e. fence flag goes on > > > last op before fence), while other hardware the latter way (i.e. fence flag > > > goes on first op after the fence). Therefore, since fencing is about > > > ordering *between* two (sets of) jobs, I decided that it should do exactly > > > that and go between two jobs, so there is no ambiguity! > > > > > > However, I'm happy enough to switch to having a fence flag, but I think if > > > we do that, it should be put in the "first job after fence" case, because > > > it is always easier to modify a previously written job if we need to, than > > > to save the flag for a future one. > > > > > > Alternatively, if we keep the fence as a separate function, I'm happy > > > enough for it not to be on the same cacheline as the "hot" operations, > > > since fencing will always introduce a small penalty anyway. > > > > Ack. > > You may consider two flags, FENCE_THEN_JOB and JOB_THEN_FENCE( If > > there any use case for this or it makes sense for your HW) > > > > > > For us, Fence is NOP for us as we have an implicit fence between each > > HW job descriptor. > > > > I actually still think that having a separate fence function in the "ops" > section is the best approach here. It's unabiguous as to whether it's > fence-before or fence-after, and if we have it in the ops, it doesn't use a > "fast-path" slot. > > However, if we *really* want to use a flag instead, I don't see the value > in having two flags, it will be really confusing. Instead, if we do go > with a flag, I think "RTE_DMA_PRE_FENCE" should be the name, indicating > that the fence occurs before the job in question. IMO, We need to use flags and the name can be RTE_DMA_PRE_FENCE due to overhead of the driver implementation where the fence request can be NOP and to save the first cache line occupancy. > > > > > > > > > > > > > > > > > > > > > > > > > > Since we have additional function call overhead in all the > > > > > > applications for this scheme, I would like to understand > > > > > > the use of doing this way vs enq does the doorbell implicitly from > > > > > > driver/application PoV? > > > > > > > > > > > > > > > > In our benchmarks it's just faster. When we tested it, the overhead of the > > > > > function calls was noticably less than the cost of building up the > > > > > parameter array(s) for passing the jobs in as a burst. [We don't see this > > > > > cost with things like NIC I/O since DPDK tends to already have the mbuf > > > > > fully populated before the TX call anyway.] > > > > > > > > OK. I agree with stack population. > > > > > > > > My question was more on doing implicit doorbell update enq. Is doorbell write > > > > costly in other HW compare to a function call? In our HW, it is just write of > > > > the number of instructions written in a register. > > > > > > > > Also, we need to again access the internal PMD memory structure to find > > > > where to write etc if it is a separate function. > > > > > > > > > > The cost varies depending on a number of factors - even writing to a single > > > HW register can be very slow if that register is mapped as device > > > (uncacheable) memory, since (AFAIK) it will act as a full fence and wait > > > > I don't know, At least in our case, writes are write-back. so core does not need > > to wait.(If there is no read operation). > > > > > for the write to go all the way to hardware. For more modern HW, the cost > > > can be lighter. However, any cost of HW writes is going to be the same > > > whether its a separate function call or not. > > > > > > However, the main thing about the doorbell update is that it's a > > > once-per-burst thing, rather than a once-per-job. Therefore, even if you > > > have to re-read the struct memory (which is likely still somewhere in your > > > cores' cache), any extra small cost of doing so is to be amortized over the > > > cost of a whole burst of copies. > > > > Linux kernel has xmit_more flag in skb to address similar thing. > > i.e enq job flag can have one more bit field to say update ring bell or not? > > Rather having yet another function overhead.IMO, it is the best of both worlds. > > > > It's just more conditionals and branches all through the code. Inside the > user application, the user has to check whether to set the flag or not (or > special-case the last transaction outside the loop), and within the driver, > there has to be a branch whether or not to call the doorbell function. The > code on both sides is far simpler and more readable if the doorbell > function is exactly that - a separate function. I disagree. The reason is: We will have two classes of applications a) do dma copy request as and when it has data(I think, this is the prime use case), for those, I think, it is considerable overhead to have two function invocation per transfer i.e rte_dma_copy() and rte_dma_perform() b) do dma copy when the data is reached to a logical state, like copy IP frame from Ethernet packets or so, In that case, the application will have a LOGIC to detect when to perform it so on the end of that rte_dma_copy() flag can be updated to fire the doorbell. IMO, We are comparing against a branch(flag is already in register) vs a set of instructions for 1) function pointer overhead 2) Need to use the channel context again back in another function. IMO, a single branch is most optimal from performance PoV. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > + +/** + * @warning + * @b EXPERIMENTAL: this API may change > > > > > > > without prior notice. + * + * Returns the number of operations > > > > > > > that failed to complete. + * NOTE: This API was used when > > > > > > > rte_dmadev_completed has_error was set. + * + * @param dev_id > > > > > > > + * The identifier of the device. + * @param vq_id + * The > > > > > > > identifier of virt queue. > > > > > > (> + * @param nb_status > > > > > > > + * Indicates the size of status array. + * @param[out] > > > > > > > status + * The error code of operations that failed to > > > > > > > complete. + * @param[out] cookie + * The last failed > > > > > > > completed operation's cookie. + * + * @return + * The number > > > > > > > of operations that failed to complete. + * + * NOTE: The > > > > > > > caller must ensure that the input parameter is valid and the + > > > > > > > * corresponding device supports the operation. + */ > > > > > > > +__rte_experimental +static inline uint16_t > > > > > > > +rte_dmadev_completed_fails(uint16_t dev_id, uint16_t vq_id, + > > > > > > > const uint16_t nb_status, uint32_t *status, + > > > > > > > dma_cookie_t *cookie) > > > > > > > > > > > > IMO, it is better to move cookie/rind_idx at 3. Why it would > > > > > > return any array of errors? since it called after > > > > > > rte_dmadev_completed() has has_error. Is it better to change > > > > > > > > > > > > rte_dmadev_error_status((uint16_t dev_id, uint16_t vq_id, > > > > > > dma_cookie_t *cookie, uint32_t *status) > > > > > > > > > > > > I also think, we may need to set status as bitmask and enumerate > > > > > > all the combination of error codes of all the driver and return > > > > > > string from driver existing rte_flow_error > > > > > > > > > > > > See struct rte_flow_error { enum rte_flow_error_type type; /**< > > > > > > Cause field and error types. */ const void *cause; /**< Object > > > > > > responsible for the error. */ const char *message; /**< > > > > > > Human-readable error message. */ }; > > > > > > > > > > > > > > > > I think we need a multi-return value API here, as we may add > > > > > operations in future which have non-error status values to return. > > > > > The obvious case is DMA engines which support "compare" operations. > > > > > In that case a successful compare (as in there were no DMA or HW > > > > > errors) can return "equal" or "not-equal" as statuses. For general > > > > > "copy" operations, the faster completion op can be used to just > > > > > return successful values (and only call this status version on > > > > > error), while apps using those compare ops or a mixture of copy and > > > > > compare ops, would always use the slower one that returns status > > > > > values for each and every op.. > > > > > > > > > > The ioat APIs used 32-bit integer values for this status array so > > > > > as to allow e.g. 16-bits for error code and 16-bits for future > > > > > status values. For most operations there should be a fairly small > > > > > set of things that can go wrong, i.e. bad source address, bad > > > > > destination address or invalid length. Within that we may have a > > > > > couple of specifics for why an address is bad, but even so I don't > > > > > think we need to start having multiple bit combinations. > > > > > > > > OK. What is the purpose of errors status? Is it for application > > > > printing it or Does the application need to take any action based on > > > > specific error requests? > > > > > > It's largely for information purposes, but in the case of SVA/SVM > > > errors could occur due to the memory not being pinned, i.e. a page > > > fault, in some cases. If that happens, then it's up the app to either > > > touch the memory and retry the copy, or to do a SW memcpy as a > > > fallback. > > > > > > In other error cases, I think it's good to tell the application if it's > > > passing around bad data, or data that is beyond the scope of hardware, > > > e.g. a copy that is beyond what can be done in a single transaction > > > for a HW instance. Given that there are always things that can go > > > wrong, I think we need some error reporting mechanism. > > > > > > > If the former is scope, then we need to define the standard enum > > > > value for the error right? ie. uint32_t *status needs to change to > > > > enum rte_dma_error or so. > > > > > > > Sure. Perhaps an error/status structure either is an option, where we > > > explicitly call out error info from status info. > > > > Agree. Better to have a structure with filed like, > > > > 1) enum rte_dma_error_type 2) memory to store, informative message on > > fine aspects of error. LIke address caused issue etc.(Which will be > > driver-specific information). > > > The only issue I have with that is that once we have driver specific > information it is of little use to the application, since it can't know > anything about it excepy maybe log it. I'd much rather have a set of error > codes telling user that "source address is wrong", "dest address is wrong", > and a generic "an address is wrong" in case driver/HW cannot distinguish > source of error. Can we see how far we get with just error codes before we > start into passing string messages around and all the memory management > headaches that implies. Works for me. It should be "enum rte_dma_error_type" then, which has a standard error type. Which is missing in the spec now. > > /Bruce