From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from dpdk.org (dpdk.org [92.243.14.124]) by inbox.dpdk.org (Postfix) with ESMTP id 63D38A058A; Fri, 17 Apr 2020 09:26:41 +0200 (CEST) Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 3B7871DCF8; Fri, 17 Apr 2020 09:26:41 +0200 (CEST) Received: from mga14.intel.com (mga14.intel.com [192.55.52.115]) by dpdk.org (Postfix) with ESMTP id 4DE321DCE8 for ; Fri, 17 Apr 2020 09:26:39 +0200 (CEST) IronPort-SDR: XgrrMncIUotJf0gEn39uEKrL1dCMePDNJg0MO8om55mSPX2sOaq4uU6NNfz+a/XVHGl5W22XH6 w57Aabtj17Gg== X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga004.fm.intel.com ([10.253.24.48]) by fmsmga103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 17 Apr 2020 00:26:38 -0700 IronPort-SDR: kh/ou7Mwx0HxA2RY36CcvQdkxHARTj7Pc4ojvveHhuMOrUoHcW8WchQH68gD0yXpJl35MTIZYB m2pRW+6+LJsw== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.72,394,1580803200"; d="scan'208";a="278287923" Received: from fmsmsx106.amr.corp.intel.com ([10.18.124.204]) by fmsmga004.fm.intel.com with ESMTP; 17 Apr 2020 00:26:38 -0700 Received: from fmsmsx113.amr.corp.intel.com (10.18.116.7) by FMSMSX106.amr.corp.intel.com (10.18.124.204) with Microsoft SMTP Server (TLS) id 14.3.439.0; Fri, 17 Apr 2020 00:26:38 -0700 Received: from shsmsx154.ccr.corp.intel.com (10.239.6.54) by FMSMSX113.amr.corp.intel.com (10.18.116.7) with Microsoft SMTP Server (TLS) id 14.3.439.0; Fri, 17 Apr 2020 00:26:38 -0700 Received: from shsmsx107.ccr.corp.intel.com ([169.254.9.191]) by SHSMSX154.ccr.corp.intel.com ([169.254.7.214]) with mapi id 14.03.0439.000; Fri, 17 Apr 2020 15:26:34 +0800 From: "Fu, Patrick" To: "dev@dpdk.org" CC: Maxime Coquelin , "Ye, Xiaolong" , "Hu, Jiayu" , "Wang, Zhihong" , "Liang, Cunming" Thread-Topic: [dpdk-dev] [RFC] Accelerating Data Movement for DPDK vHost with DMA Engines Thread-Index: AdYUh8uPQ3eFooe1TrqrGBybiMJ/hA== Date: Fri, 17 Apr 2020 07:26:34 +0000 Message-ID: <89B17B9B05A1964E8D40D6090018F28151277ADF@SHSMSX107.ccr.corp.intel.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: dlp-product: dlpe-windows dlp-version: 11.2.0.6 dlp-reaction: no-action x-originating-ip: [10.239.127.40] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Subject: [dpdk-dev] [RFC] Accelerating Data Movement for DPDK vHost with DMA Engines X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" Background =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D DPDK vhost library implements a user-space VirtIO net backend allowing host= applications to directly communicate with VirtIO front-end in VMs and cont= ainers. However, every vhost enqueue/dequeue operation requires to copy pac= ket buffers between guest and host memory. The overhead of copying large bu= lk of data makes the vhost backend become the I/O bottleneck. DMA engines, = including un-core DMA accelerator, like Crystal Beach DMA (CBDMA) and Data = Streaming Accelerator (DSA), and discrete card general purpose DMA, are ext= remely efficient in data movement within system memory. Therefore, we propo= se a set of asynchronous DMA data movement API in vhost library for DMA acc= eleration. With offloading packet copies in vhost data-path from the CPU to= the DMA engine, which can not only accelerate data transfers, but also sav= e precious CPU core resources. New API Overview =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D The proposed APIs in the vhost library support various DMA engines to accel= erate data transfers in the data-path. For the higher performance, DMA engi= nes work in an asynchronous manner, where DMA data transfers and CPU comput= ations are executed in parallel. The proposed API consists of control path = API and data path API. The control path API includes Registration API and D= MA operation callback, and the data path API includes asynchronous API. To = remove the dependency of vendor specific DMA engines, the DMA operation cal= lback provides generic DMA data transfer abstractions. To support asynchron= ous DMA data movement, the new async API provides asynchronous ring operati= on semantic in data-path. To enable/disable DMA acceleration for virtqueues= , users need to use registration API is to register/unregister DMA callback= implementations to the vhost library and bind DMA channels to virtqueues. = The DMA channels used by virtqueues are provided by DPDK applications, whic= h is backed by virtual or physical DMA devices. The proposed APIs are consisted of 3 sub-sets: 1. DMA Registration APIs 2. DMA Operation Callbacks 3. Async Data APIs DMA Registration APIs =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=20 DMA acceleration is per queue basis. DPDK applications need to explicitly d= ecide whether a virtqueue needs DMA acceleration and which DMA channel to u= se. In addition, a DMA channel is dedicated to a virtqueue and a DMA channe= l cannot be bound to multiple virtqueues at the same time. To enable DMA ac= celeration for a virtqueue, DPDK applications need to implement DMA operati= on callbacks for a specific DMA type (e.g. CBDMA) first, then register the = callbacks to the vhost library and bind a DMA channel to a virtqueue, and f= inally use the new async API to perform data-path operations on the virtque= ue. The definitions of registration API are shown below: int rte_vhost_async_channel_register(int vid, uint16_t queue_id, struct rte_vdma_device_ops *ops); int rte_vhost_async_channel_unregister(int vid, uint16_t queue_id); The "rte_vhost_async_channel_register" is to register implemented DMA opera= tion callbacks to the vhost library and bind a DMA channel to a virtqueue. = DPDK applications must implement the corresponding DMA operation callbacks = for various DMA engines. To enable DMA acceleration for a virtqueue, DPDK a= pplications need to explicitly call "rte_vhost_async_channel_register" for = the virtqueue. The "ops" points to the implementation of callbacks.=20 The "rte_vhost_async_channel_unregister" unregisters DMA operation callback= s and unbind the DMA channel from the virtqueue. If a virtqueue does not bi= nd to a DMA channel, it will use SW data-path without DMA acceleration. DMA Operation Callbacks =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=20 The definitions of DMA operation callback are shown below: struct iovec { /** this is kernel uapi structure */ void *iov_base; /** buffer address */ size_t iov_len; /** buffer length */ }; struct iov_iter {=09 size_t iov_offset; size_t count; /** total bytes of a packet */ struct iovec *iov; /** array of data buffers */ unsigned long nr_segs; /** number of iovec structures */ uintptr_t usr_data; /** app specific memory handler*/ }; struct dma_trans_desc { struct iov_iter *src; /** source memory iov_iter*/ struct iov_iter *dst; /** destination memory iov_iter*/ }; struct dma_trans_status { uintptr_t src_usr_data; /** trans completed memory handler*/ uintptr_t dst_usr_data; /** trans completed memory handler*/ }; struct rte_vhost_async_channel_ops { /** Instruct a DMA channel to perform copies for a batch of packets */ int (*transfer_data)( struct dma_trans_desc *descs, uint16_t count); /** check copy-completed packets from a DMA channel */ int (*check_completed_copies)( struct dma_trans_status *usr_data, uint16_t max_packets); }; The first callback "transfer_data" is to submit a batch of packet copies to= a DMA channel. As a packet's source or destination buffer can be a vector = of buffers or a single data stream, we use "struct dma_trans_desc" to const= ruct the source and destination buffer of packet. Copying a packet is to m= ove data from source iov_iter structure to destination iov_iter structure. = The "count" is the number of packets to do copy.=20 The second callback "check_completed_copies" queries the completion status = of the DMA. An "usr_data" member variable is embedded in "iov_iter" structu= re, which serves as a unique identifier of the memory region described by "= iov_iter". As the source/destination buffer can be scatter-gather, the DMA = channel may perform its copies out-of-order. When all copies of an iov_iter= are completed by the DMA channel, the "check_completed_copies" should retu= rn the associated "usr_data" by "dma_trans_status" structure.=20 Async Data APIs =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=20 The definitions of new enqueue API are shown below: uint16_t rte_vhost_submit_enqueue_burst(int vid, uint16_t queue_id, struct = rte_mbuf **pkts, uint16_t count); uint16_t rte_vhost_poll_enqueue_completed(int vid, uint16_t queue_id, struc= t rte_mbuf **pkts, uint16_t count); The "rte_vhost_submit_enqueue_burst" is to enqueue a batch of packets to a = virtqueue with giving ownership of enqueue packets to the vhost library. DP= DK applications cannot reuse the enqueued packets until they get back the o= wnership. For a virtqueue enabled DMA acceleration by the "rte_vhost_async_= channel_register", the "rte_vhost_submit_enqueue_burst" will use the bound = DMA channel to perform packet copies; moreover, the function is non-blockin= g, which just submits packet copies to the DMA channel but without waiting = for completion. For a virtqueue without enabling DMA acceleration, the "rte= _vhost_submit_enqueue_burst" will use SW data-path, where the CPU performs = packet copies. It worth noticing that DPDK applications cannot directly reu= se enqueued packet buffers by "rte_vhost_submit_enqueue_burst", even if it = uses SW data-path. The "rte_vhost_poll_enqueue_completed" returns ownership for the packets wh= ose copies are all completed currently, either by the DMA channel or the CP= U. It is a non-blocking function, which will not wait for DMA copies comple= tion. After getting back the ownership of packets enqueued by "rte_vhost_su= bmit_enqueue_burst", DPDK applications can further process the packet buffe= rs, e.g. free pktmbufs. Sample Work Flow =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=20 Some DMA engines, like CBDMA, need to use physical addresses and do not sup= port I/O page fault. In addition, some guests may want to avoid memory swap= ping out. For these cases, we can pin guest memory by setting a new flag "R= TE_VHOST_USER_DMA_COPY" in rte_vhost_driver_register(). Here is an example = of how to use CBDMA to accelerate vhost enqueue operation: Step1: Implement DMA operation callbacks for CBDMA via IOAT PMD Step2: call rte_vhost_driver_register with flag "RTE_VHOST_USER_DMA_COPY" (= pin guest memory) Step3: call rte_vhost_async_channel_register to register DMA channel Step4: call rte_vhost_submit_enqueue_burst to enqueue packets Step5: call rte_vhost_poll_enqueue_completed get back the ownership of the = packets whose copies are completed Step6: call rte_pktmbuf_free to free packet mbuf Signed-off-by: Patrick Fu Signed-off-by: Jiayu Hu =20