From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 09545A0C46; Fri, 18 Jun 2021 11:31:05 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 79E9240150; Fri, 18 Jun 2021 11:31:05 +0200 (CEST) Received: from mga18.intel.com (mga18.intel.com [134.134.136.126]) by mails.dpdk.org (Postfix) with ESMTP id 5BE8040142 for ; Fri, 18 Jun 2021 11:31:03 +0200 (CEST) IronPort-SDR: 63B1a8EnI9bzcZNOhlcnZ4nH3aXHQqpcIb49QI4xiCYiuMiDI9ZQMuH/M92Mm/UKGfOzLVARNg PgAsyk0dlozQ== X-IronPort-AV: E=McAfee;i="6200,9189,10018"; a="193837961" X-IronPort-AV: E=Sophos;i="5.83,283,1616482800"; d="scan'208";a="193837961" Received: from orsmga005.jf.intel.com ([10.7.209.41]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Jun 2021 02:31:02 -0700 IronPort-SDR: P9Ab6NOSP98LCir/EJURNjc/erl2pttGdP+tywdhEe3V2iobNlxf361+AtLR4bxggJw96OFGsg VCWEPeLQTA2g== X-IronPort-AV: E=Sophos;i="5.83,283,1616482800"; d="scan'208";a="622335132" Received: from bricha3-mobl.ger.corp.intel.com ([10.252.30.209]) by orsmga005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-SHA; 18 Jun 2021 02:30:59 -0700 Date: Fri, 18 Jun 2021 10:30:55 +0100 From: Bruce Richardson To: fengchengwen Cc: thomas@monjalon.net, ferruh.yigit@intel.com, dev@dpdk.org, nipun.gupta@nxp.com, hemant.agrawal@nxp.com, maxime.coquelin@redhat.com, honnappa.nagarahalli@arm.com, jerinj@marvell.com, david.marchand@redhat.com, jerinjacobk@gmail.com Message-ID: References: <1623763327-30987-1-git-send-email-fengchengwen@huawei.com> <25d29598-c26d-8497-2867-9b650c79df49@huawei.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <25d29598-c26d-8497-2867-9b650c79df49@huawei.com> Subject: Re: [dpdk-dev] [RFC PATCH] dmadev: introduce DMA device library X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" On Fri, Jun 18, 2021 at 04:52:00PM +0800, fengchengwen wrote: > On 2021/6/17 22:18, Bruce Richardson wrote: > > On Thu, Jun 17, 2021 at 12:02:00PM +0100, Bruce Richardson wrote: > >> On Thu, Jun 17, 2021 at 05:48:05PM +0800, fengchengwen wrote: > >>> On 2021/6/17 1:31, Bruce Richardson wrote: > >>>> On Wed, Jun 16, 2021 at 05:41:45PM +0800, fengchengwen wrote: > >>>>> On 2021/6/16 0:38, Bruce Richardson wrote: > >>>>>> On Tue, Jun 15, 2021 at 09:22:07PM +0800, Chengwen Feng wrote: > >>>>>>> This patch introduces 'dmadevice' which is a generic type of DMA > >>>>>>> device. > >>>>>>> > >>>>>>> The APIs of dmadev library exposes some generic operations which > >>>>>>> can enable configuration and I/O with the DMA devices. > >>>>>>> > >>>>>>> Signed-off-by: Chengwen Feng --- > >>>>>> Thanks for sending this. > >>>>>> > >>>>>> Of most interest to me right now are the key data-plane APIs. > >>>>>> While we are still in the prototyping phase, below is a draft of > >>>>>> what we are thinking for the key enqueue/perform_ops/completed_ops > >>>>>> APIs. > >>>>>> > >>>>>> Some key differences I note in below vs your original RFC: * Use > >>>>>> of void pointers rather than iova addresses. While using iova's > >>>>>> makes sense in the general case when using hardware, in that it > >>>>>> can work with both physical addresses and virtual addresses, if we > >>>>>> change the APIs to use void pointers instead it will still work > >>>>>> for DPDK in VA mode, while at the same time allow use of software > >>>>>> fallbacks in error cases, and also a stub driver than uses memcpy > >>>>>> in the background. Finally, using iova's makes the APIs a lot more > >>>>>> awkward to use with anything but mbufs or similar buffers where we > >>>>>> already have a pre-computed physical address. > >>>>> > >>>>> The iova is an hint to application, and widely used in DPDK. If > >>>>> switch to void, how to pass the address (iova or just va ?) this > >>>>> may introduce implementation dependencies here. > >>>>> > >>>>> Or always pass the va, and the driver performs address translation, > >>>>> and this translation may cost too much cpu I think. > >>>>> > >>>> > >>>> On the latter point, about driver doing address translation I would > >>>> agree. However, we probably need more discussion about the use of > >>>> iova vs just virtual addresses. My thinking on this is that if we > >>>> specify the API using iovas it will severely hurt usability of the > >>>> API, since it forces the user to take more inefficient codepaths in > >>>> a large number of cases. Given a pointer to the middle of an mbuf, > >>>> one cannot just pass that straight as an iova but must instead do a > >>>> translation into offset from mbuf pointer and then readd the offset > >>>> to the mbuf base address. > >>>> > >>>> My preference therefore is to require the use of an IOMMU when using > >>>> a dmadev, so that it can be a much closer analog of memcpy. Once an > >>>> iommu is present, DPDK will run in VA mode, allowing virtual > >>>> addresses to our hugepage memory to be sent directly to hardware. > >>>> Also, when using dmadevs on top of an in-kernel driver, that kernel > >>>> driver may do all iommu management for the app, removing further the > >>>> restrictions on what memory can be addressed by hardware. > >>> > >>> Some DMA devices many don't support IOMMU or IOMMU bypass default, so > >>> driver may should call rte_mem_virt2phy() do the address translate, > >>> but the rte_mem_virt2phy() cost too many CPU cycles. > >>> > >>> If the API defined as iova, it will work fine in: 1) If DMA don't > >>> support IOMMU or IOMMU bypass, then start application with > >>> --iova-mode=pa 2) If DMA support IOMMU, --iova-mode=pa/va work both > >>> fine > >>> > >> > >> I suppose if we keep the iova as the datatype, we can just cast "void > >> *" pointers to that in the case that virtual addresses can be used > >> directly. I believe your RFC included a capability query API - "uses > >> void * as iova" should probably be one of those capabilities, and that > >> would resolve this. If DPDK is in iova=va mode because of the > >> presence of an iommu, all drivers could report this capability too. > >> > >>>> > >>>>>> * Use of id values rather than user-provided handles. Allowing the > >>>>>> user/app to manage the amount of data stored per operation is a > >>>>>> better solution, I feel than proscribing a certain about of > >>>>>> in-driver tracking. Some apps may not care about anything other > >>>>>> than a job being completed, while other apps may have significant > >>>>>> metadata to be tracked. Taking the user-context handles out of the > >>>>>> API also makes the driver code simpler. > >>>>> > >>>>> The user-provided handle was mainly used to simply application > >>>>> implementation, It provides the ability to quickly locate contexts. > >>>>> > >>>>> The "use of id values" seem like the dma_cookie of Linux DMA engine > >>>>> framework, user will get a unique dma_cookie after calling > >>>>> dmaengine_submit(), and then could use it to call > >>>>> dma_async_is_tx_complete() to get completion status. > >>>>> > >>>> > >>>> Yes, the idea of the id is the same - to locate contexts. The main > >>>> difference is that if we have the driver manage contexts or pointer > >>>> to contexts, as well as giving more work to the driver, it > >>>> complicates the APIs for measuring completions. If we use an > >>>> ID-based approach, where the app maintains its own ring of contexts > >>>> (if any), it avoids the need to have an "out" parameter array for > >>>> returning those contexts, which needs to be appropriately sized. > >>>> Instead we can just report that all ids up to N are completed. [This > >>>> would be similar to your suggestion that N jobs be reported as done, > >>>> in that no contexts are provided, it's just that knowing the ID of > >>>> what is completed is generally more useful than the number (which > >>>> can be obviously got by subtracting the old value)] > >>>> > >>>> We are still working on prototyping all this, but would hope to have > >>>> a functional example of all this soon. > >>>> > >>>>> How about define the copy prototype as following: dma_cookie_t > >>>>> rte_dmadev_copy(uint16_t dev_id, xxx) while the dma_cookie_t is > >>>>> int32 and is monotonically increasing, when >=0 mean enqueue > >>>>> successful else fail. when complete the dmadev will return latest > >>>>> completed dma_cookie, and the application could use the dma_cookie > >>>>> to quick locate contexts. > >>>>> > >>>> > >>>> If I understand this correctly, I believe this is largely what I was > >>>> suggesting - just with the typedef for the type? In which case it > >>>> obviously looks good to me. > >>>> > >>>>>> * I've kept a single combined API for completions, which differs > >>>>>> from the separate error handling completion API you propose. I > >>>>>> need to give the two function approach a bit of thought, but > >>>>>> likely both could work. If we (likely) never expect failed ops, > >>>>>> then the specifics of error handling should not matter that much. > >>>>> > >>>>> The rte_ioat_completed_ops API is too complex, and consider some > >>>>> applications may never copy fail, so split them as two API. It's > >>>>> indeed not friendly to other scenarios that always require error > >>>>> handling. > >>>>> > >>>>> I prefer use completed operations number as return value other than > >>>>> the ID so that application could simple judge whether have new > >>>>> completed operations, and the new prototype: uint16_t > >>>>> rte_dmadev_completed(uint16_t dev_id, dma_cookie_t *cookie, > >>>>> uint32_t *status, uint16_t max_status, uint16_t *num_fails); > >>>>> > >>>>> 1) for normal case which never expect failed ops: just call: ret = > >>>>> rte_dmadev_completed(dev_id, &cookie, NULL, 0, NULL); 2) for other > >>>>> case: ret = rte_dmadev_completed(dev_id, &cookie, &status, > >>>>> max_status, &fails); at this point the fails <= ret <= max_status > >>>>> > >>>> Completely agree that we need to plan for the happy-day case where > >>>> all is passing. Looking at the prototypes you have above, I am ok > >>>> with returning number of completed ops as the return value with the > >>>> final completed cookie as an "out" parameter. For handling errors, > >>>> I'm ok with what you propose above, just with one small adjustment - > >>>> I would remove the restriction that ret <= max_status. > >>>> > >>>> In case of zero-failures, we can report as many ops succeeding as we > >>>> like, and even in case of failure, we can still report as many > >>>> successful ops as we like before we start filling in the status > >>>> field. For example, if 32 ops are completed, and the last one fails, > >>>> we can just fill in one entry into status, and return 32. > >>>> Alternatively if the 4th last one fails we fill in 4 entries and > >>>> return 32. The only requirements would be: * fails <= max_status * > >>>> fails <= ret * cookie holds the id of the last entry in status. > >>> > >>> I think we understand the same: > >>> > >>> The fails <= ret <= max_status include following situation: 1) If > >>> max_status is 32, and there are 32 completed ops, then the ret will > >>> be 32 no matter which ops is failed 2) If max_status is 33, and there > >>> are 32 completed ops, then the ret will be 32 3) If max_status is 16, > >>> and there are 32 completed ops, then the ret will be 16 > >>> > >>> and the cookie always hold the id of the last returned completed ops, > >>> no matter it's completed successful or failed > >>> > >> > >> I actually disagree on the #3. If max_status is 16, there are 32 > >> completed ops, and *no failures* the ret will be 32, not 16, because > >> we are not returning any status entries so max_status need not apply. > >> Keeping that same scenario #3, depending on the number of failures and > >> the point of them, the return value may similarly vary, for example: * > >> if job #28 fails, then ret could still be 32, cookie would be the > >> cookie for that job, "fails" parameter would return as 4, with status > >> holding the failure of 28 plus the succeeded status of jobs 29-31, > >> i.e. 4 elements. * if job #5 fails, then we can't fit the status list > >> from 5 though 31 in an array of 16, so "fails" == 16(max_status) and > >> status contains the 16 statuses starting from #5, which means that > >> cookie contains the value for job #20 and ret is 21. > >> > >> In other words, ignore max_status and status parameters *unless we > >> have an error to return*, meaning the fast-path/happy-day case works > >> as fast as possible. You don't need to worry about sizing your status > >> array to be big, and you always get back a large number of completions > >> when available. Your fastpath code only need check the "fails" > >> parameter to see if status needs to ever be consulted, and in normal > >> case it doesn't. > >> > >> If this is too complicated, maybe we can simplify a little by > >> returning just one failure at a time, though at the cost of making > >> error handling slower? > >> > >> rte_dmadev_completed(dev_id, &cookie, &failure_status) > >> > >> In this case, we always return the number of completed ops on success, > >> while on failure, we return the first error code. For a single error, > >> this works fine, but if we get a burst of errors together, things will > >> work slower - which may be acceptable if errors are very rare. > >> However, for idxd at least if a fence occurs after a failure all jobs > >> in the batch after the fence would be skipped, which would lead to the > >> "burst of errors" case. Therefore, I'd prefer to have the original > >> suggestion allowing multiple errors to be reported at a time. > >> > >> /Bruce > > > > Apologies for self-reply, but thinking about it more, a combination of > > normal-case and error-case APIs may be just simpler: > > > > int rte_dmadev_completed(dev_id, &cookie) > > > > returns number of items completed and cookie of last item. If there is > > an error, returns all successfull values up to the error entry and > > returns -1 on subsequent call. > > > > int rte_dmadev_completed_status(dev_id, &cookie, max_status, > > status_array, &error_count) > > > > this is a slower completion API which behaves like you originally said > > above, returning number of completions x, 0 <= x <= max_status, with x > > status values filled into array, and the number of unsuccessful values > > in the error_count value. > > > > This would allow code to be written in the application to use > > rte_dmadev_completed() in the normal case, and on getting a "-1" value, > > use rte_dmadev_completed_status() to get the error details. If strings > > of errors might be expected, the app can continually use the > > completed_status() function until error_count returns 0, and then > > switch back to the faster/simpler version. > > This two-function simplify the status_array's maintenance because we > don't need init it to zero. I think it's a good trade-off between > performance and rich error info (status code). > > Here I'd like to discuss the 'burst size', which is widely used in DPDK > application (e.g. nic polling or ring en/dequeue). Currently we don't > define a max completed ops in rte_dmadev_completed() API, the return > value may greater than 'burst size' of application, this may result in > the application need to maintain (or remember) the return value of the > function and special handling at the next poll. > > Also consider there may multiple calls rte_dmadev_completed to check > fail, it may make it difficult for the application to use. > > So I prefer following prototype: uint16_t rte_dmadev_completed(uint16_t > dev_id, dma_cookie_t *cookie, uint16_t nb_cpls, bool *has_error) -- > nb_cpls: indicate max process operations number -- has_error: indicate if > there is an error -- return value: the number of successful completed > operations. -- example: 1) If there are already 32 completed ops, and > 4th is error, and nb_cpls is 32, then the ret will be 3(because 1/2/3th > is OK), and has_error will be true. 2) If there are already 32 completed > ops, and all successful completed, then the ret will be min(32, nb_cpls), > and has_error will be false. 3) If there are already 32 completed ops, > and all failed completed, then the ret will be 0, and has_error will be > true. uint16_t rte_dmadev_completed_status(uint16_t dev_id, dma_cookie_t > *cookie, uint16_t nb_status, uint32_t *status) -- return value: the > number of failed completed operations. > > The application use the following invocation order when polling: > has_error = false; // could be init to false by dmadev API, we need > discuss ret = rte_dmadev_completed(dev_id, &cookie, bust_size, > &has_error); // process successful completed case: for (int i = 0; i < > ret; i++) { } if (unlikely(has_error)) { // process failed completed case > ret = rte_dmadev_completed_status(dev_id, &cookie, burst_size - ret, > status_array); for (int i = 0; i < ret; i++) { // ... } } > Seems reasonable. Let's go with this as an option for now - I just want to check for all these the perf impacts to the offload cost. Once I get our prototype working with some hardware (hopefully very soon) I can check this out directly.