From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 9E065A0548; Sat, 26 Jun 2021 05:59:55 +0200 (CEST) Received: from [217.70.189.124] (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 228784068A; Sat, 26 Jun 2021 05:59:55 +0200 (CEST) Received: from szxga01-in.huawei.com (szxga01-in.huawei.com [45.249.212.187]) by mails.dpdk.org (Postfix) with ESMTP id BE36A4014E for ; Sat, 26 Jun 2021 05:59:53 +0200 (CEST) Received: from dggemv711-chm.china.huawei.com (unknown [172.30.72.56]) by szxga01-in.huawei.com (SkyGuard) with ESMTP id 4GBg3x1Z9MzXlC5; Sat, 26 Jun 2021 11:54:37 +0800 (CST) Received: from dggpeml500024.china.huawei.com (7.185.36.10) by dggemv711-chm.china.huawei.com (10.1.198.66) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2176.2; Sat, 26 Jun 2021 11:59:51 +0800 Received: from [127.0.0.1] (10.40.190.165) by dggpeml500024.china.huawei.com (7.185.36.10) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2176.2; Sat, 26 Jun 2021 11:59:50 +0800 From: fengchengwen To: Bruce Richardson , Jerin Jacob , Jerin Jacob , =?UTF-8?Q?Morten_Br=c3=b8rup?= , Nipun Gupta CC: Thomas Monjalon , Ferruh Yigit , dpdk-dev , Nipun Gupta , Hemant Agrawal , "Maxime Coquelin" , Honnappa Nagarahalli , David Marchand , Satananda Burla , Prasun Kapoor References: <1623763327-30987-1-git-send-email-fengchengwen@huawei.com> <25d29598-c26d-8497-2867-9b650c79df49@huawei.com> <3db2eda0-4490-2b8f-c65d-636bcf794494@huawei.com> Message-ID: Date: Sat, 26 Jun 2021 11:59:49 +0800 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:68.0) Gecko/20100101 Thunderbird/68.11.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset="utf-8" Content-Language: en-US Content-Transfer-Encoding: 7bit X-Originating-IP: [10.40.190.165] X-ClientProxiedBy: dggems701-chm.china.huawei.com (10.3.19.178) To dggpeml500024.china.huawei.com (7.185.36.10) X-CFilter-Loop: Reflected Subject: [dpdk-dev] dmadev discussion summary X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" Hi, all I analyzed the current DPAM DMA driver and drew this summary in conjunction with the previous discussion, and this will as a basis for the V2 implementation. Feedback is welcome, thanks dpaa2_qdma: [probe]: mainly obtains the number of hardware queues. [dev_configure]: has following parameters: max_hw_queues_per_core: max_vqs: max number of virt-queue fle_queue_pool_cnt: the size of FLE pool [queue_setup]: setup up one virt-queue, has following parameters: lcore_id: flags: some control params, e.g. sg-list, longformat desc, exclusive HW queue... rbp: some misc field which impact the descriptor Note: this API return the index of virt-queue which was successful setuped. [enqueue_bufs]: data-plane API, the key fields: vq_id: the index of virt-queue job: the pointer of job array nb_jobs: Note: one job has src/dest/len/flag/cnxt/status/vq_id/use_elem fields, the flag field indicate whether src/dst is PHY addr. [dequeue_bufs]: get the completed jobs's pointer [key point]: ------------ ------------ |virt-queue| |virt-queue| ------------ ------------ \ / \ / \ / ------------ ------------ | HW-queue | | HW-queue | ------------ ------------ \ / \ / \ / core/rawdev 1) In the probe stage, driver tell how many HW-queues could use. 2) User could specify the maximum number of HW-queues managed by a single core in the dev_configure stage. 3) User could create one virt-queue by queue_setup API, the virt-queue has two types: a) exclusive HW-queue, b) shared HW-queue(as described above), this is achieved by the corresponding bit of flags field. 4) In this mode, queue management is simplified. User do not need to specify the HW-queue to be applied for and create a virt-queue on the HW-queue. All you need to do is say on which core I want to create a virt-queue. 5) The virt-queue could have different capability, e.g. virt-queue-0 support scatter-gather format, and virt-queue-1 don't support sg, this was control by flags and rbp fields in queue_setup stage. 6) The data-plane API use the definition similar to rte_mbuf and rte_eth_rx/tx_burst(). PS: I still don't understand how sg-list enqueue/dequeue, and user how to use RTE_QDMA_VQ_NO_RESPONSE. Overall, I think it's a flexible design with many scalability. Especially the queue resource pool architecture, simplifies user invocations, although the 'core' introduces a bit abruptly. octeontx2_dma: [dev_configure]: has one parameters: chunk_pool: it's strange why it's not managed internally by the driver, but passed in through the API. [enqueue_bufs]: has three important parameters: context: this is what Jerin referred to 'channel', it could hold the completed ring of the job. buffers: hold the pointer array of dpi_dma_buf_ptr_s count: how many dpi_dma_buf_ptr_s Note: one dpi_dma_buf_ptr_s may has many src and dst pairs (it's scatter- gather list), and has one completed_ptr (when HW complete it will write one value to this ptr), current the completed_ptr pointer struct: struct dpi_dma_req_compl_s { uint64_t cdata; --driver init and HW update result to this. void (*compl_cb)(void *dev, void *arg); void *cb_data; }; [dequeue_bufs]: has two important parameters: context: driver will scan it's completed ring to get complete info. buffers: hold the pointer array of completed_ptr. [key point]: ----------- ----------- | channel | | channel | ----------- ----------- \ / \ / \ / ------------ | HW-queue | ------------ | -------- |rawdev| -------- 1) User could create one channel by init context(dpi_dma_queue_ctx_s), this interface is not standardized and needs to be implemented by users. 2) Different channels can support different transmissions, e.g. one for inner m2m, and other for inbound copy. Overall, I think the 'channel' is similar the 'virt-queue' of dpaa2_qdma. The difference is that dpaa2_qdma supports multiple hardware queues. The 'channel' has following 1) A channel is an operable unit at the user level. User can create a channel for each transfer type, for example, a local-to-local channel, and a local-to-host channel. User could also get the completed status of one channel. 2) Multiple channels can run on the same HW-queue. In terms of API design, this design reduces the number of data-plane API parameters. The channel could has context info which will referred by data-plane APIs execute. ioat: [probe]: create multiple rawdev if it's DSA device and has multiple HW-queues. [dev_configure]: has three parameters: ring_size: the HW descriptor size. hdls_disable: whether ignore user-supplied handle params no_prefetch_completions: [rte_ioat_enqueue_copy]: has dev_id/src/dst/length/src_hdl/dst_hdl parameters. [rte_ioat_completed_ops]: has dev_id/max_copies/status/num_unsuccessful/ src_hdls/dst_hdls parameters. Overall, one HW-queue one rawdev, and don't have many 'channel' which similar to octeontx2_dma. Kunpeng_dma: 1) The hardmware support multiple modes(e.g. local-to-local/local-to-pciehost/ pciehost-to-local/immediated-to-local copy). Note: Currently, we only implement local-to-local copy. 2) The hardmware support multiple HW-queues. Summary: 1) The dpaa2/octeontx2/Kunpeng are all ARM soc, there may acts as endpoint of x86 host (e.g. smart NIC), multiple memory transfer requirements may exist, e.g. local-to-host/local-to-host..., from the point of view of API design, I think we should adopt a similar 'channel' or 'virt-queue' concept. 2) Whether to create a separate dmadev for each HW-queue? We previously discussed this, and due HW-queue could indepent management (like Kunpeng_dma and Intel DSA), we prefer create a separate dmadev for each HW-queue before. But I'm not sure if that's the case with dpaa. I think that can be left to the specific driver, no restriction is imposed on the framework API layer. 3) I think we could setup following abstraction at dmadev device: ------------ ------------ |virt-queue| |virt-queue| ------------ ------------ \ / \ / \ / ------------ ------------ | HW-queue | | HW-queue | ------------ ------------ \ / \ / \ / dmadev 4) The driver's ops design (here we only list key points): [dev_info_get]: mainly return the number of HW-queues [dev_configure]: nothing important [queue_setup]: create one virt-queue, has following main parameters: HW-queue-index: the HW-queue index used nb_desc: the number of HW descriptors opaque: driver's specific info Note1: this API return virt-queue index which will used in later API. If user want create multiple virt-queue one the same HW-queue, they could achieved by call queue_setup with the same HW-queue-index. Note2: I think it's hard to define queue_setup config paramter, and also this is control API, so I think it's OK to use opaque pointer to implement it. [dma_copy/memset/sg]: all has vq_id input parameter. Note: I notice dpaa can't support single and sg in one virt-queue, and I think it's maybe software implement policy other than HW restriction because virt-queue could share the same HW-queue. Here we use vq_id to tackle different scenario, like local-to-local/ local-to-host and etc. 5) And the dmadev public data-plane API (just prototype): dma_cookie_t rte_dmadev_memset(dev, vq_id, pattern, dst, len, flags) -- flags: used as an extended parameter, it could be uint32_t dma_cookie_t rte_dmadev_memcpy(dev, vq_id, src, dst, len, flags) dma_cookie_t rte_dmadev_memcpy_sg(dev, vq_id, sg, sg_len, flags) -- sg: struct dma_scatterlist array uint16_t rte_dmadev_completed(dev, vq_id, dma_cookie_t *cookie, uint16_t nb_cpls, bool *has_error) -- nb_cpls: indicate max process operations number -- has_error: indicate if there is an error -- return value: the number of successful completed operations. -- example: 1) If there are already 32 completed ops, and 4th is error, and nb_cpls is 32, then the ret will be 3(because 1/2/3th is OK), and has_error will be true. 2) If there are already 32 completed ops, and all successful completed, then the ret will be min(32, nb_cpls), and has_error will be false. 3) If there are already 32 completed ops, and all failed completed, then the ret will be 0, and has_error will be true. uint16_t rte_dmadev_completed_status(dev_id, vq_id, dma_cookie_t *cookie, uint16_t nb_status, uint32_t *status) -- return value: the number of failed completed operations. And here I agree with Morten: we should design API which adapts to DPDK service scenarios. So we don't support some sound-cards DMA, and 2D memory copy which mainly used in video scenarios. 6) The dma_cookie_t is signed int type, when <0 it mean error, it's monotonically increasing base on HW-queue (other than virt-queue). The driver needs to make sure this because the damdev framework don't manage the dma_cookie's creation. 7) Because data-plane APIs are not thread-safe, and user could determine virt-queue to HW-queue's map (at the queue-setup stage), so it is user's duty to ensure thread-safe. 8) One example: vq_id = rte_dmadev_queue_setup(dev, config.{HW-queue-index=x, opaque}); if (vq_id < 0) { // create virt-queue failed return; } // submit memcpy task cookit = rte_dmadev_memcpy(dev, vq_id, src, dst, len, flags); if (cookie < 0) { // submit failed return; } // get complete task ret = rte_dmadev_completed(dev, vq_id, &cookie, 1, has_error); if (!has_error && ret == 1) { // the memcpy successful complete } 9) As octeontx2_dma support sg-list which has many valid buffers in dpi_dma_buf_ptr_s, it could call the rte_dmadev_memcpy_sg API. 10) As ioat, it could delcare support one HW-queue at dev_configure stage, and only support create one virt-queue. 11) As dpaa2_qdma, I think it could migrate to new framework, but still wait for dpaa2_qdma guys feedback. 12) About the prototype src/dst parameters of rte_dmadev_memcpy API, we have two candidates which are iova and void *, how about introduce dma_addr_t type which could be va or iova ?