From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id 9E065A0548;
	Sat, 26 Jun 2021 05:59:55 +0200 (CEST)
Received: from [217.70.189.124] (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id 228784068A;
	Sat, 26 Jun 2021 05:59:55 +0200 (CEST)
Received: from szxga01-in.huawei.com (szxga01-in.huawei.com [45.249.212.187])
 by mails.dpdk.org (Postfix) with ESMTP id BE36A4014E
 for <dev@dpdk.org>; Sat, 26 Jun 2021 05:59:53 +0200 (CEST)
Received: from dggemv711-chm.china.huawei.com (unknown [172.30.72.56])
 by szxga01-in.huawei.com (SkyGuard) with ESMTP id 4GBg3x1Z9MzXlC5;
 Sat, 26 Jun 2021 11:54:37 +0800 (CST)
Received: from dggpeml500024.china.huawei.com (7.185.36.10) by
 dggemv711-chm.china.huawei.com (10.1.198.66) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id
 15.1.2176.2; Sat, 26 Jun 2021 11:59:51 +0800
Received: from [127.0.0.1] (10.40.190.165) by dggpeml500024.china.huawei.com
 (7.185.36.10) with Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2176.2; Sat, 26 Jun
 2021 11:59:50 +0800
From: fengchengwen <fengchengwen@huawei.com>
To: Bruce Richardson <bruce.richardson@intel.com>, Jerin Jacob
 <jerinjacobk@gmail.com>, Jerin Jacob <jerinj@marvell.com>,
 =?UTF-8?Q?Morten_Br=c3=b8rup?= <mb@smartsharesystems.com>, Nipun Gupta
 <nipun.gupta@nxp.com>
CC: Thomas Monjalon <thomas@monjalon.net>, Ferruh Yigit
 <ferruh.yigit@intel.com>, dpdk-dev <dev@dpdk.org>, Nipun Gupta
 <nipun.gupta@nxp.com>, Hemant Agrawal <hemant.agrawal@nxp.com>, "Maxime
 Coquelin" <maxime.coquelin@redhat.com>, Honnappa Nagarahalli
 <honnappa.nagarahalli@arm.com>, David Marchand <david.marchand@redhat.com>,
 Satananda Burla <sburla@marvell.com>, Prasun Kapoor <pkapoor@marvell.com>
References: <1623763327-30987-1-git-send-email-fengchengwen@huawei.com>
 <YMjXilFHjCxQ9ViD@bricha3-MOBL.ger.corp.intel.com>
 <a5dc8da0-bfb6-db19-3567-ecb912c4c6ef@huawei.com>
 <YMo1V/Trf7WH8dgN@bricha3-MOBL.ger.corp.intel.com>
 <eb8394ee-45d9-7427-88c2-9c172f424de2@huawei.com>
 <YMsrqB1ixpezlnux@bricha3-MOBL.ger.corp.intel.com>
 <YMtZn36FV2Fe9MEj@bricha3-MOBL.ger.corp.intel.com>
 <25d29598-c26d-8497-2867-9b650c79df49@huawei.com>
 <CALBAE1P37KgBpkiNEMQPCoEATyfAoenYgfoURtqtaoJayPePRg@mail.gmail.com>
 <3db2eda0-4490-2b8f-c65d-636bcf794494@huawei.com>
 <YNNLniBG8IPUhu5h@bricha3-MOBL.ger.corp.intel.com>
 <e3735fa2-0416-1f9f-ff0f-1a259c675248@huawei.com>
Message-ID: <c4a0ee30-f7b8-f8a1-463c-8eedaec82aea@huawei.com>
Date: Sat, 26 Jun 2021 11:59:49 +0800
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:68.0) Gecko/20100101
 Thunderbird/68.11.0
MIME-Version: 1.0
In-Reply-To: <e3735fa2-0416-1f9f-ff0f-1a259c675248@huawei.com>
Content-Type: text/plain; charset="utf-8"
Content-Language: en-US
Content-Transfer-Encoding: 7bit
X-Originating-IP: [10.40.190.165]
X-ClientProxiedBy: dggems701-chm.china.huawei.com (10.3.19.178) To
 dggpeml500024.china.huawei.com (7.185.36.10)
X-CFilter-Loop: Reflected
Subject: [dpdk-dev] dmadev discussion summary
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>

Hi, all
  I analyzed the current DPAM DMA driver and drew this summary in conjunction
with the previous discussion, and this will as a basis for the V2 implementation.
  Feedback is welcome, thanks


dpaa2_qdma:
  [probe]: mainly obtains the number of hardware queues.
  [dev_configure]: has following parameters:
      max_hw_queues_per_core:
      max_vqs: max number of virt-queue
      fle_queue_pool_cnt: the size of FLE pool
  [queue_setup]: setup up one virt-queue, has following parameters:
      lcore_id:
      flags: some control params, e.g. sg-list, longformat desc, exclusive HW
             queue...
      rbp: some misc field which impact the descriptor
      Note: this API return the index of virt-queue which was successful
            setuped.
  [enqueue_bufs]: data-plane API, the key fields:
      vq_id: the index of virt-queue
	  job: the pointer of job array
	  nb_jobs:
	  Note: one job has src/dest/len/flag/cnxt/status/vq_id/use_elem fields,
            the flag field indicate whether src/dst is PHY addr.
  [dequeue_bufs]: get the completed jobs's pointer

  [key point]:
      ------------    ------------
      |virt-queue|    |virt-queue|
      ------------    ------------
             \           /
              \         /
               \       /
             ------------     ------------
             | HW-queue |     | HW-queue |
             ------------     ------------
                    \            /
                     \          /
                      \        /
                      core/rawdev
      1) In the probe stage, driver tell how many HW-queues could use.
      2) User could specify the maximum number of HW-queues managed by a single
         core in the dev_configure stage.
      3) User could create one virt-queue by queue_setup API, the virt-queue has
         two types: a) exclusive HW-queue, b) shared HW-queue(as described
         above), this is achieved by the corresponding bit of flags field.
      4) In this mode, queue management is simplified. User do not need to
         specify the HW-queue to be applied for and create a virt-queue on the
         HW-queue. All you need to do is say on which core I want to create a
         virt-queue.
      5) The virt-queue could have different capability, e.g. virt-queue-0
         support scatter-gather format, and virt-queue-1 don't support sg, this
         was control by flags and rbp fields in queue_setup stage.
      6) The data-plane API use the definition similar to rte_mbuf and
         rte_eth_rx/tx_burst().
      PS: I still don't understand how sg-list enqueue/dequeue, and user how to
          use RTE_QDMA_VQ_NO_RESPONSE.

      Overall, I think it's a flexible design with many scalability. Especially
      the queue resource pool architecture, simplifies user invocations,
      although the 'core' introduces a bit abruptly.


octeontx2_dma:
  [dev_configure]: has one parameters:
      chunk_pool: it's strange why it's not managed internally by the driver,
                  but passed in through the API.
  [enqueue_bufs]: has three important parameters:
      context: this is what Jerin referred to 'channel', it could hold the
               completed ring of the job.
      buffers: hold the pointer array of dpi_dma_buf_ptr_s
      count: how many dpi_dma_buf_ptr_s
	  Note: one dpi_dma_buf_ptr_s may has many src and dst pairs (it's scatter-
            gather list), and has one completed_ptr (when HW complete it will
            write one value to this ptr), current the completed_ptr pointer
            struct:
                struct dpi_dma_req_compl_s {
                    uint64_t cdata;  --driver init and HW update result to this.
                    void (*compl_cb)(void *dev, void *arg);
                    void *cb_data;
                };
  [dequeue_bufs]: has two important parameters:
      context: driver will scan it's completed ring to get complete info.
      buffers: hold the pointer array of completed_ptr.

  [key point]:
      -----------    -----------
      | channel |    | channel |
      -----------    -----------
             \           /
              \         /
               \       /
             ------------
             | HW-queue |
             ------------
                   |
                --------
                |rawdev|
                --------
      1) User could create one channel by init context(dpi_dma_queue_ctx_s),
         this interface is not standardized and needs to be implemented by
         users.
      2) Different channels can support different transmissions, e.g. one for
         inner m2m, and other for inbound copy.

      Overall, I think the 'channel' is similar the 'virt-queue' of dpaa2_qdma.
      The difference is that dpaa2_qdma supports multiple hardware queues. The
      'channel' has following
      1) A channel is an operable unit at the user level. User can create a
         channel for each transfer type, for example, a local-to-local channel,
         and a local-to-host channel. User could also get the completed status
         of one channel.
      2) Multiple channels can run on the same HW-queue. In terms of API design,
         this design reduces the number of data-plane API parameters. The
         channel could has context info which will referred by data-plane APIs
         execute.


ioat:
  [probe]: create multiple rawdev if it's DSA device and has multiple HW-queues.
  [dev_configure]: has three parameters:
      ring_size: the HW descriptor size.
      hdls_disable: whether ignore user-supplied handle params
      no_prefetch_completions:
  [rte_ioat_enqueue_copy]: has dev_id/src/dst/length/src_hdl/dst_hdl parameters.
  [rte_ioat_completed_ops]: has dev_id/max_copies/status/num_unsuccessful/
                            src_hdls/dst_hdls parameters.

  Overall, one HW-queue one rawdev, and don't have many 'channel' which similar
  to octeontx2_dma.


Kunpeng_dma:
  1) The hardmware support multiple modes(e.g. local-to-local/local-to-pciehost/
     pciehost-to-local/immediated-to-local copy).
     Note: Currently, we only implement local-to-local copy.
  2) The hardmware support multiple HW-queues.


Summary:
  1) The dpaa2/octeontx2/Kunpeng are all ARM soc, there may acts as endpoint of
     x86 host (e.g. smart NIC), multiple memory transfer requirements may exist,
     e.g. local-to-host/local-to-host..., from the point of view of API design,
     I think we should adopt a similar 'channel' or 'virt-queue' concept.
  2) Whether to create a separate dmadev for each HW-queue? We previously
     discussed this, and due HW-queue could indepent management (like
     Kunpeng_dma and Intel DSA), we prefer create a separate dmadev for each
     HW-queue before. But I'm not sure if that's the case with dpaa. I think
     that can be left to the specific driver, no restriction is imposed on the
     framework API layer.
  3) I think we could setup following abstraction at dmadev device:
      ------------    ------------
      |virt-queue|    |virt-queue|
      ------------    ------------
             \           /
              \         /
               \       /
             ------------     ------------
             | HW-queue |     | HW-queue |
             ------------     ------------
                    \            /
                     \          /
                      \        /
                        dmadev
  4) The driver's ops design (here we only list key points):
     [dev_info_get]: mainly return the number of HW-queues
     [dev_configure]: nothing important
     [queue_setup]: create one virt-queue, has following main parameters:
         HW-queue-index: the HW-queue index used
         nb_desc: the number of HW descriptors
         opaque: driver's specific info
         Note1: this API return virt-queue index which will used in later API.
                If user want create multiple virt-queue one the same HW-queue,
                they could achieved by call queue_setup with the same
                HW-queue-index.
         Note2: I think it's hard to define queue_setup config paramter, and
                also this is control API, so I think it's OK to use opaque
                pointer to implement it.
      [dma_copy/memset/sg]: all has vq_id input parameter.
         Note: I notice dpaa can't support single and sg in one virt-queue, and
               I think it's maybe software implement policy other than HW
               restriction because virt-queue could share the same HW-queue.
      Here we use vq_id to tackle different scenario, like local-to-local/
      local-to-host and etc.
  5) And the dmadev public data-plane API (just prototype):
     dma_cookie_t rte_dmadev_memset(dev, vq_id, pattern, dst, len, flags)
       -- flags: used as an extended parameter, it could be uint32_t
     dma_cookie_t rte_dmadev_memcpy(dev, vq_id, src, dst, len, flags)
     dma_cookie_t rte_dmadev_memcpy_sg(dev, vq_id, sg, sg_len, flags)
       -- sg: struct dma_scatterlist array
     uint16_t rte_dmadev_completed(dev, vq_id, dma_cookie_t *cookie,
                                   uint16_t nb_cpls, bool *has_error)
       -- nb_cpls: indicate max process operations number
       -- has_error: indicate if there is an error
       -- return value: the number of successful completed operations.
       -- example:
          1) If there are already 32 completed ops, and 4th is error, and
             nb_cpls is 32, then the ret will be 3(because 1/2/3th is OK), and
             has_error will be true.
          2) If there are already 32 completed ops, and all successful
             completed, then the ret will be min(32, nb_cpls), and has_error
             will be false.
          3) If there are already 32 completed ops, and all failed completed,
             then the ret will be 0, and has_error will be true.
     uint16_t rte_dmadev_completed_status(dev_id, vq_id, dma_cookie_t *cookie,
                                          uint16_t nb_status, uint32_t *status)
       -- return value: the number of failed completed operations.
     And here I agree with Morten: we should design API which adapts to DPDK
     service scenarios. So we don't support some sound-cards DMA, and 2D memory
     copy which mainly used in video scenarios.
  6) The dma_cookie_t is signed int type, when <0 it mean error, it's
     monotonically increasing base on HW-queue (other than virt-queue). The
     driver needs to make sure this because the damdev framework don't manage
     the dma_cookie's creation.
  7) Because data-plane APIs are not thread-safe, and user could determine
     virt-queue to HW-queue's map (at the queue-setup stage), so it is user's
     duty to ensure thread-safe.
  8) One example:
     vq_id = rte_dmadev_queue_setup(dev, config.{HW-queue-index=x, opaque});
     if (vq_id < 0) {
        // create virt-queue failed
        return;
     }
     // submit memcpy task
     cookit = rte_dmadev_memcpy(dev, vq_id, src, dst, len, flags);
     if (cookie < 0) {
        // submit failed
        return;
     }
     // get complete task
     ret = rte_dmadev_completed(dev, vq_id, &cookie, 1, has_error);
     if (!has_error && ret == 1) {
        // the memcpy successful complete
     }
  9) As octeontx2_dma support sg-list which has many valid buffers in
     dpi_dma_buf_ptr_s, it could call the rte_dmadev_memcpy_sg API.
  10) As ioat, it could delcare support one HW-queue at dev_configure stage, and
      only support create one virt-queue.
  11) As dpaa2_qdma, I think it could migrate to new framework, but still wait
      for dpaa2_qdma guys feedback.
  12) About the prototype src/dst parameters of rte_dmadev_memcpy API, we have
      two candidates which are iova and void *, how about introduce dma_addr_t
      type which could be va or iova ?