From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 079EAA0C49; Thu, 8 Jul 2021 05:11:58 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 79D554069C; Thu, 8 Jul 2021 05:11:58 +0200 (CEST) Received: from szxga02-in.huawei.com (szxga02-in.huawei.com [45.249.212.188]) by mails.dpdk.org (Postfix) with ESMTP id 0779940687 for ; Thu, 8 Jul 2021 05:11:55 +0200 (CEST) Received: from dggemv711-chm.china.huawei.com (unknown [172.30.72.55]) by szxga02-in.huawei.com (SkyGuard) with ESMTP id 4GL1TN07k7zZnZc; Thu, 8 Jul 2021 11:08:40 +0800 (CST) Received: from dggpeml500024.china.huawei.com (7.185.36.10) by dggemv711-chm.china.huawei.com (10.1.198.66) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2176.2; Thu, 8 Jul 2021 11:11:52 +0800 Received: from [10.40.190.165] (10.40.190.165) by dggpeml500024.china.huawei.com (7.185.36.10) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2176.2; Thu, 8 Jul 2021 11:11:52 +0800 To: Bruce Richardson , Jerin Jacob CC: Thomas Monjalon , Ferruh Yigit , Jerin Jacob , dpdk-dev , =?UTF-8?Q?Morten_Br=c3=b8rup?= , Nipun Gupta , Hemant Agrawal , Maxime Coquelin , Honnappa Nagarahalli , David Marchand , Satananda Burla , Prasun Kapoor , "Ananyev, Konstantin" , , Radha Mohan Chintakuntla References: <1625231891-2963-1-git-send-email-fengchengwen@huawei.com> From: fengchengwen Message-ID: <9b063d9f-5b52-8e1b-e12a-f24f4ea3b122@huawei.com> Date: Thu, 8 Jul 2021 11:11:51 +0800 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:68.0) Gecko/20100101 Thunderbird/68.11.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset="utf-8" Content-Language: en-US Content-Transfer-Encoding: 7bit X-Originating-IP: [10.40.190.165] X-ClientProxiedBy: dggems702-chm.china.huawei.com (10.3.19.179) To dggpeml500024.china.huawei.com (7.185.36.10) X-CFilter-Loop: Reflected Subject: Re: [dpdk-dev] [PATCH] dmadev: introduce DMA device library X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" On 2021/7/7 19:01, Bruce Richardson wrote: > On Wed, Jul 07, 2021 at 04:04:16PM +0530, Jerin Jacob wrote: >> On Wed, Jul 7, 2021 at 2:05 PM Bruce Richardson >> wrote: >>> >>> On Wed, Jul 07, 2021 at 01:38:58PM +0530, Jerin Jacob wrote: >>>> On Mon, Jul 5, 2021 at 10:46 PM Bruce Richardson >>>> wrote: >>>>> >>>>> On Mon, Jul 05, 2021 at 09:25:34PM +0530, Jerin Jacob wrote: >>>>>> >>>>>> On Mon, Jul 5, 2021 at 4:22 PM Bruce Richardson >>>>>> wrote: >>>>>>> >>>>>>> On Sun, Jul 04, 2021 at 03:00:30PM +0530, Jerin Jacob wrote: >>>>>>>> On Fri, Jul 2, 2021 at 6:51 PM Chengwen Feng wrote: >>>>>>>>> >>>>>>>>> This patch introduces 'dmadevice' which is a generic type of DMA >>>>>>>>> device. >>>>> >>>>>>> >>>>>>> +1 and the terminology with regards to queues and channels. With our ioat >>>>>>> hardware, each HW queue was called a channel for instance. >>>>>> >>>>>> Looks like <> can cover all the use cases, if the >>>>>> HW has more than >>>>>> 1 queues it can be exposed as separate dmadev dev. >>>>>> >>>>> >>>>> Fine for me. >>>>> >>>>> However, just to confirm that Morten's suggestion of using a >>>>> (device-specific void *) channel pointer rather than dev_id + channel_id >>>>> pair of parameters won't work for you? You can't store a pointer or dev >>>>> index in the channel struct in the driver? >>>> >>>> Yes. That will work. To confirm, the suggestion is to use, void * >>>> object instead of channel_id, >>>> That will avoid one more indirection.(index -> pointer) >>>> >>> >>> The proposal was to use it in place of "dev_id + channel_id", i.e. >>> >>> copy(dev_id, ch_id, src, dst, len, flags) --> copy(ch_ptr, src, dst, len, flags) >>> >>> Where the channel pointer implicitly indicates the device too. However, I >>> realise now that this would be something completely transparent to the >>> driver as it would all have to be implemented in the dmadev level, and >>> lead to lots of duplication of function pointers, etc. Therefore, let's >>> just go with original scheme. :-( >> >> Yes. Just go with the original scheme. >> > +1 > >>> >>>> >>>>> >>>>>> >>> >>> >>> >>>>>> Got it. In order to save space if first CL size for fastpath(Saving 8B >>>>>> for the pointer) and to avoid >>>>>> function overhead, Can we use one bit of flags of op function to >>>>>> enable the fence? >>>>>> >>>>> >>>>> The original ioat implementation did exactly that. However, I then >>>>> discovered that because a fence logically belongs between two operations, >>>>> does the fence flag on an operation mean "don't do any jobs after this >>>>> until this job has completed" or does it mean "don't start this job until >>>>> all previous jobs have completed". [Or theoretically does it mean both :-)] >>>>> Naturally, some hardware does it the former way (i.e. fence flag goes on >>>>> last op before fence), while other hardware the latter way (i.e. fence flag >>>>> goes on first op after the fence). Therefore, since fencing is about >>>>> ordering *between* two (sets of) jobs, I decided that it should do exactly >>>>> that and go between two jobs, so there is no ambiguity! >>>>> >>>>> However, I'm happy enough to switch to having a fence flag, but I think if >>>>> we do that, it should be put in the "first job after fence" case, because >>>>> it is always easier to modify a previously written job if we need to, than >>>>> to save the flag for a future one. >>>>> >>>>> Alternatively, if we keep the fence as a separate function, I'm happy >>>>> enough for it not to be on the same cacheline as the "hot" operations, >>>>> since fencing will always introduce a small penalty anyway. >>>> >>>> Ack. >>>> You may consider two flags, FENCE_THEN_JOB and JOB_THEN_FENCE( If >>>> there any use case for this or it makes sense for your HW) >>>> >>>> >>>> For us, Fence is NOP for us as we have an implicit fence between each >>>> HW job descriptor. >>>> >>> >>> I actually still think that having a separate fence function in the "ops" >>> section is the best approach here. It's unabiguous as to whether it's >>> fence-before or fence-after, and if we have it in the ops, it doesn't use a >>> "fast-path" slot. >>> >>> However, if we *really* want to use a flag instead, I don't see the value >>> in having two flags, it will be really confusing. Instead, if we do go >>> with a flag, I think "RTE_DMA_PRE_FENCE" should be the name, indicating >>> that the fence occurs before the job in question. >> >> IMO, We need to use flags and the name can be RTE_DMA_PRE_FENCE >> due to overhead of the driver implementation where the fence request >> can be NOP and >> to save the first cache line occupancy. >> >>> >>>> >>>>> >>>>>>> >>>>>>>> >>>>> >>>>>>>> Since we have additional function call overhead in all the >>>>>>>> applications for this scheme, I would like to understand >>>>>>>> the use of doing this way vs enq does the doorbell implicitly from >>>>>>>> driver/application PoV? >>>>>>>> >>>>>>> >>>>>>> In our benchmarks it's just faster. When we tested it, the overhead of the >>>>>>> function calls was noticably less than the cost of building up the >>>>>>> parameter array(s) for passing the jobs in as a burst. [We don't see this >>>>>>> cost with things like NIC I/O since DPDK tends to already have the mbuf >>>>>>> fully populated before the TX call anyway.] >>>>>> >>>>>> OK. I agree with stack population. >>>>>> >>>>>> My question was more on doing implicit doorbell update enq. Is doorbell write >>>>>> costly in other HW compare to a function call? In our HW, it is just write of >>>>>> the number of instructions written in a register. >>>>>> >>>>>> Also, we need to again access the internal PMD memory structure to find >>>>>> where to write etc if it is a separate function. >>>>>> >>>>> >>>>> The cost varies depending on a number of factors - even writing to a single >>>>> HW register can be very slow if that register is mapped as device >>>>> (uncacheable) memory, since (AFAIK) it will act as a full fence and wait >>>> >>>> I don't know, At least in our case, writes are write-back. so core does not need >>>> to wait.(If there is no read operation). >>>> >>>>> for the write to go all the way to hardware. For more modern HW, the cost >>>>> can be lighter. However, any cost of HW writes is going to be the same >>>>> whether its a separate function call or not. >>>>> >>>>> However, the main thing about the doorbell update is that it's a >>>>> once-per-burst thing, rather than a once-per-job. Therefore, even if you >>>>> have to re-read the struct memory (which is likely still somewhere in your >>>>> cores' cache), any extra small cost of doing so is to be amortized over the >>>>> cost of a whole burst of copies. >>>> >>>> Linux kernel has xmit_more flag in skb to address similar thing. >>>> i.e enq job flag can have one more bit field to say update ring bell or not? >>>> Rather having yet another function overhead.IMO, it is the best of both worlds. >>>> >>> >>> It's just more conditionals and branches all through the code. Inside the >>> user application, the user has to check whether to set the flag or not (or >>> special-case the last transaction outside the loop), and within the driver, >>> there has to be a branch whether or not to call the doorbell function. The >>> code on both sides is far simpler and more readable if the doorbell >>> function is exactly that - a separate function. >> >> I disagree. The reason is: >> >> We will have two classes of applications >> >> a) do dma copy request as and when it has data(I think, this is the >> prime use case), for those, >> I think, it is considerable overhead to have two function invocation >> per transfer i.e >> rte_dma_copy() and rte_dma_perform() >> >> b) do dma copy when the data is reached to a logical state, like copy >> IP frame from Ethernet packets or so, >> In that case, the application will have a LOGIC to detect when to >> perform it so on the end of >> that rte_dma_copy() flag can be updated to fire the doorbell. >> >> IMO, We are comparing against a branch(flag is already in register) vs >> a set of instructions for >> 1) function pointer overhead >> 2) Need to use the channel context again back in another function. >> >> IMO, a single branch is most optimal from performance PoV. >> > Ok, let's try it and see how it goes. Test result show: 1) For Kunpeng platform (ARMv8) could benefit very little with doorbell in flags 2) For Xeon E5-2690 v2 (X86) could benefit with separate function 3) Both platform could benefit with doorbell in flags if burst < 5 There is a performance gain in small bursts (<5). Given the extensive use of bursts in DPDK applications and users are accustomed to the concept, I do not recommend using the 'doorbell' in flags. And also user may confuse about the doorbell operations. Kunpeng platform test result: [root@SZ tmp]# ./a1 1 burst = 1 perform_after_multiple_enqueue: burst:1 cost:0s.554422us doorbell_for_every_enqueue: burst:1 cost:0s.450927us last_enqueue_issue_doorbell: burst:1 cost:0s.450479us [root@SZ tmp]# [root@SZ tmp]# ./a1 2 burst = 2 perform_after_multiple_enqueue: burst:2 cost:0s.900884us doorbell_for_every_enqueue: burst:2 cost:0s.866732us last_enqueue_issue_doorbell: burst:2 cost:0s.732469us [root@SZ tmp]# ./a1 5 burst = 5 perform_after_multiple_enqueue: burst:5 cost:1s.732410us doorbell_for_every_enqueue: burst:5 cost:2s.115479us last_enqueue_issue_doorbell: burst:5 cost:1s.759349us [root@SZ tmp]# ./a1 10 burst = 10 perform_after_multiple_enqueue: burst:10 cost:3s.490716us doorbell_for_every_enqueue: burst:10 cost:4s.194691us last_enqueue_issue_doorbell: burst:10 cost:3s.331825us [root@SZ tmp]# ./a1 30 burst = 30 perform_after_multiple_enqueue: burst:30 cost:9s.61761us doorbell_for_every_enqueue: burst:30 cost:12s.517082us last_enqueue_issue_doorbell: burst:30 cost:9s.614802us X86 platform test result: fengchengwen@SZ:~/tmp$ ./a1 1 burst = 1 perform_after_multiple_enqueue: burst:1 cost:0s.406331us doorbell_for_every_enqueue: burst:1 cost:0s.331109us last_enqueue_issue_doorbell: burst:1 cost:0s.381782us fengchengwen@SZ:~/tmp$ ./a1 2 burst = 2 perform_after_multiple_enqueue: burst:2 cost:0s.569024us doorbell_for_every_enqueue: burst:2 cost:0s.643449us last_enqueue_issue_doorbell: burst:2 cost:0s.486639us fengchengwen@SZ:~/tmp$ ./a1 5 burst = 5 perform_after_multiple_enqueue: burst:5 cost:1s.166384us doorbell_for_every_enqueue: burst:5 cost:1s.602369us last_enqueue_issue_doorbell: burst:5 cost:1s.209392us fengchengwen@SZ:~/tmp$ ./a1 10 burst = 10 perform_after_multiple_enqueue: burst:10 cost:2s.229901us doorbell_for_every_enqueue: burst:10 cost:3s.754802us last_enqueue_issue_doorbell: burst:10 cost:2s.328705us fengchengwen@SZ:~/tmp$ fengchengwen@SZ:~/tmp$ ./a1 30 burst = 30 perform_after_multiple_enqueue: burst:30 cost:6s.132817us doorbell_for_every_enqueue: burst:30 cost:9s.944619us last_enqueue_issue_doorbell: burst:30 cost:7s.73551us test-code: #include #include #include #include #include struct dmadev; unsigned int dev_reg[10240]; volatile unsigned int *ring; volatile unsigned int *doorbell; void init_global(void) { ring = &dev_reg[100]; doorbell = &dev_reg[10000]; } #define rte_wmb() asm volatile("dmb oshst" : : : "memory") //#define rte_wmb() asm volatile ("" : : : "memory") typedef int (*enqueue_t)(struct dmadev *dev, int vchan, void *src, void *dst, int len, int flags); typedef void (*perform_t)(struct dmadev *dev, int vchan); struct dmadev { enqueue_t enqueue; perform_t perform; char rsv[512]; }; int hisi_dma_enqueue(struct dmadev *dev, int vchan, void *src, void *dst, int len, int flags) { *ring = 1; } int hisi_dma_enqueue_doorbell(struct dmadev *dev, int vchan, void *src, void *dst, int len, int flags) { *ring = 1; if (flags == 1) { rte_wmb(); *doorbell = 1; } } void hisi_dma_perform(struct dmadev *dev, int vchan) { rte_wmb(); *doorbell = 1; } struct dmadev devlist[64]; void init_devlist(bool enq_doorbell) { int i; for (i = 0; i < 64; i++) { devlist[i].enqueue = enq_doorbell ? hisi_dma_enqueue_doorbell : hisi_dma_enqueue; devlist[i].perform = hisi_dma_perform; } } static inline int dma_enqueue(int dev_id, int vchan, void *src, void *dst, int len, int flags) { struct dmadev *dev = &devlist[dev_id]; return dev->enqueue(dev, vchan, src, dst, len, flags); } static inline void dma_perform(int dev_id, int vchan) { struct dmadev *dev = &devlist[dev_id]; return dev->perform(dev, vchan); } #define MAX_LOOPS 90000000 void test_for_perform_after_multiple_enqueue(int burst) { struct timeval start, end, delta; unsigned int i, j; init_devlist(false); gettimeofday(&start, NULL); for (i = 0; i < MAX_LOOPS; i++) { for (j = 0; j < burst; j++) (void)dma_enqueue(10, 0, NULL, NULL, 0, 0); dma_perform(10, 0); } gettimeofday(&end, NULL); timersub(&end, &start, &delta); printf("perform_after_multiple_enqueue: burst:%d cost:%us.%uus \n", burst, delta.tv_sec, delta.tv_usec); } void test_for_doorbell_for_every_enqueue(int burst) { struct timeval start, end, delta; unsigned int i, j; init_devlist(true); gettimeofday(&start, NULL); for (i = 0; i < MAX_LOOPS; i++) { for (j = 0; j < burst; j++) (void)dma_enqueue(10, 0, NULL, NULL, 0, 1); } gettimeofday(&end, NULL); timersub(&end, &start, &delta); printf("doorbell_for_every_enqueue: burst:%d cost:%us.%uus \n", burst, delta.tv_sec, delta.tv_usec); } void test_for_last_enqueue_issue_doorbell(int burst) { struct timeval start, end, delta; unsigned int i, j; init_devlist(true); gettimeofday(&start, NULL); for (i = 0; i < MAX_LOOPS; i++) { for (j = 0; j < burst - 1; j++) (void)dma_enqueue(10, 0, NULL, NULL, 0, 0); dma_enqueue(10, 0, NULL, NULL, 0, 1); } gettimeofday(&end, NULL); timersub(&end, &start, &delta); printf("last_enqueue_issue_doorbell: burst:%d cost:%us.%uus \n", burst, delta.tv_sec, delta.tv_usec); } void main(int argc, char *argv[]) { if (argc < 2) { printf("please input burst parameter!\n"); return; } init_global(); int burst = atol(argv[1]); printf("burst = %d \n", burst); test_for_perform_after_multiple_enqueue(burst); test_for_doorbell_for_every_enqueue(burst); test_for_last_enqueue_issue_doorbell(burst); } > >> >>> >>>> >>>>> >>>>>> >>>>>>> >>>>>>>> >>>>> >>>>>>>>> + +/** + * @warning + * @b EXPERIMENTAL: this API may change >>>>>>>>> without prior notice. + * + * Returns the number of operations >>>>>>>>> that failed to complete. + * NOTE: This API was used when >>>>>>>>> rte_dmadev_completed has_error was set. + * + * @param dev_id >>>>>>>>> + * The identifier of the device. + * @param vq_id + * The >>>>>>>>> identifier of virt queue. >>>>>>>> (> + * @param nb_status >>>>>>>>> + * Indicates the size of status array. + * @param[out] >>>>>>>>> status + * The error code of operations that failed to >>>>>>>>> complete. + * @param[out] cookie + * The last failed >>>>>>>>> completed operation's cookie. + * + * @return + * The number >>>>>>>>> of operations that failed to complete. + * + * NOTE: The >>>>>>>>> caller must ensure that the input parameter is valid and the + >>>>>>>>> * corresponding device supports the operation. + */ >>>>>>>>> +__rte_experimental +static inline uint16_t >>>>>>>>> +rte_dmadev_completed_fails(uint16_t dev_id, uint16_t vq_id, + >>>>>>>>> const uint16_t nb_status, uint32_t *status, + >>>>>>>>> dma_cookie_t *cookie) >>>>>>>> >>>>>>>> IMO, it is better to move cookie/rind_idx at 3. Why it would >>>>>>>> return any array of errors? since it called after >>>>>>>> rte_dmadev_completed() has has_error. Is it better to change >>>>>>>> >>>>>>>> rte_dmadev_error_status((uint16_t dev_id, uint16_t vq_id, >>>>>>>> dma_cookie_t *cookie, uint32_t *status) >>>>>>>> >>>>>>>> I also think, we may need to set status as bitmask and enumerate >>>>>>>> all the combination of error codes of all the driver and return >>>>>>>> string from driver existing rte_flow_error >>>>>>>> >>>>>>>> See struct rte_flow_error { enum rte_flow_error_type type; /**< >>>>>>>> Cause field and error types. */ const void *cause; /**< Object >>>>>>>> responsible for the error. */ const char *message; /**< >>>>>>>> Human-readable error message. */ }; >>>>>>>> >>>>>>> >>>>>>> I think we need a multi-return value API here, as we may add >>>>>>> operations in future which have non-error status values to return. >>>>>>> The obvious case is DMA engines which support "compare" operations. >>>>>>> In that case a successful compare (as in there were no DMA or HW >>>>>>> errors) can return "equal" or "not-equal" as statuses. For general >>>>>>> "copy" operations, the faster completion op can be used to just >>>>>>> return successful values (and only call this status version on >>>>>>> error), while apps using those compare ops or a mixture of copy and >>>>>>> compare ops, would always use the slower one that returns status >>>>>>> values for each and every op.. >>>>>>> >>>>>>> The ioat APIs used 32-bit integer values for this status array so >>>>>>> as to allow e.g. 16-bits for error code and 16-bits for future >>>>>>> status values. For most operations there should be a fairly small >>>>>>> set of things that can go wrong, i.e. bad source address, bad >>>>>>> destination address or invalid length. Within that we may have a >>>>>>> couple of specifics for why an address is bad, but even so I don't >>>>>>> think we need to start having multiple bit combinations. >>>>>> >>>>>> OK. What is the purpose of errors status? Is it for application >>>>>> printing it or Does the application need to take any action based on >>>>>> specific error requests? >>>>> >>>>> It's largely for information purposes, but in the case of SVA/SVM >>>>> errors could occur due to the memory not being pinned, i.e. a page >>>>> fault, in some cases. If that happens, then it's up the app to either >>>>> touch the memory and retry the copy, or to do a SW memcpy as a >>>>> fallback. >>>>> >>>>> In other error cases, I think it's good to tell the application if it's >>>>> passing around bad data, or data that is beyond the scope of hardware, >>>>> e.g. a copy that is beyond what can be done in a single transaction >>>>> for a HW instance. Given that there are always things that can go >>>>> wrong, I think we need some error reporting mechanism. >>>>> >>>>>> If the former is scope, then we need to define the standard enum >>>>>> value for the error right? ie. uint32_t *status needs to change to >>>>>> enum rte_dma_error or so. >>>>>> >>>>> Sure. Perhaps an error/status structure either is an option, where we >>>>> explicitly call out error info from status info. >>>> >>>> Agree. Better to have a structure with filed like, >>>> >>>> 1) enum rte_dma_error_type 2) memory to store, informative message on >>>> fine aspects of error. LIke address caused issue etc.(Which will be >>>> driver-specific information). >>>> >>> The only issue I have with that is that once we have driver specific >>> information it is of little use to the application, since it can't know >>> anything about it excepy maybe log it. I'd much rather have a set of error >>> codes telling user that "source address is wrong", "dest address is wrong", >>> and a generic "an address is wrong" in case driver/HW cannot distinguish >>> source of error. Can we see how far we get with just error codes before we >>> start into passing string messages around and all the memory management >>> headaches that implies. >> >> Works for me. It should be "enum rte_dma_error_type" then, which has a standard >> error type. Which is missing in the spec now. >> > +1 > . >