From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from dpdk.org (dpdk.org [92.243.14.124])
	by inbox.dpdk.org (Postfix) with ESMTP id 63D38A058A;
	Fri, 17 Apr 2020 09:26:41 +0200 (CEST)
Received: from [92.243.14.124] (localhost [127.0.0.1])
	by dpdk.org (Postfix) with ESMTP id 3B7871DCF8;
	Fri, 17 Apr 2020 09:26:41 +0200 (CEST)
Received: from mga14.intel.com (mga14.intel.com [192.55.52.115])
 by dpdk.org (Postfix) with ESMTP id 4DE321DCE8
 for <dev@dpdk.org>; Fri, 17 Apr 2020 09:26:39 +0200 (CEST)
IronPort-SDR: XgrrMncIUotJf0gEn39uEKrL1dCMePDNJg0MO8om55mSPX2sOaq4uU6NNfz+a/XVHGl5W22XH6
 w57Aabtj17Gg==
X-Amp-Result: SKIPPED(no attachment in message)
X-Amp-File-Uploaded: False
Received: from fmsmga004.fm.intel.com ([10.253.24.48])
 by fmsmga103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 17 Apr 2020 00:26:38 -0700
IronPort-SDR: kh/ou7Mwx0HxA2RY36CcvQdkxHARTj7Pc4ojvveHhuMOrUoHcW8WchQH68gD0yXpJl35MTIZYB
 m2pRW+6+LJsw==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.72,394,1580803200"; d="scan'208";a="278287923"
Received: from fmsmsx106.amr.corp.intel.com ([10.18.124.204])
 by fmsmga004.fm.intel.com with ESMTP; 17 Apr 2020 00:26:38 -0700
Received: from fmsmsx113.amr.corp.intel.com (10.18.116.7) by
 FMSMSX106.amr.corp.intel.com (10.18.124.204) with Microsoft SMTP Server (TLS)
 id 14.3.439.0; Fri, 17 Apr 2020 00:26:38 -0700
Received: from shsmsx154.ccr.corp.intel.com (10.239.6.54) by
 FMSMSX113.amr.corp.intel.com (10.18.116.7) with Microsoft SMTP Server (TLS)
 id 14.3.439.0; Fri, 17 Apr 2020 00:26:38 -0700
Received: from shsmsx107.ccr.corp.intel.com ([169.254.9.191]) by
 SHSMSX154.ccr.corp.intel.com ([169.254.7.214]) with mapi id 14.03.0439.000;
 Fri, 17 Apr 2020 15:26:34 +0800
From: "Fu, Patrick" <patrick.fu@intel.com>
To: "dev@dpdk.org" <dev@dpdk.org>
CC: Maxime Coquelin <maxime.coquelin@redhat.com>, "Ye, Xiaolong"
 <xiaolong.ye@intel.com>, "Hu, Jiayu" <jiayu.hu@intel.com>, "Wang, Zhihong"
 <zhihong.wang@intel.com>, "Liang, Cunming" <cunming.liang@intel.com>
Thread-Topic: [dpdk-dev] [RFC] Accelerating Data Movement for DPDK vHost
 with DMA Engines
Thread-Index: AdYUh8uPQ3eFooe1TrqrGBybiMJ/hA==
Date: Fri, 17 Apr 2020 07:26:34 +0000
Message-ID: <89B17B9B05A1964E8D40D6090018F28151277ADF@SHSMSX107.ccr.corp.intel.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
dlp-product: dlpe-windows
dlp-version: 11.2.0.6
dlp-reaction: no-action
x-originating-ip: [10.239.127.40]
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Subject: [dpdk-dev] [RFC] Accelerating Data Movement for DPDK vHost with DMA
 Engines
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>

Background
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
DPDK vhost library implements a user-space VirtIO net backend allowing host=
 applications to directly communicate with VirtIO front-end in VMs and cont=
ainers. However, every vhost enqueue/dequeue operation requires to copy pac=
ket buffers between guest and host memory. The overhead of copying large bu=
lk of data makes the vhost backend become the I/O bottleneck. DMA engines, =
including un-core DMA accelerator, like Crystal Beach DMA (CBDMA) and Data =
Streaming Accelerator (DSA), and discrete card general purpose DMA, are ext=
remely efficient in data movement within system memory. Therefore, we propo=
se a set of asynchronous DMA data movement API in vhost library for DMA acc=
eleration. With offloading packet copies in vhost data-path from the CPU to=
 the DMA engine, which can not only accelerate data transfers, but also sav=
e precious CPU core resources.

New API Overview
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
The proposed APIs in the vhost library support various DMA engines to accel=
erate data transfers in the data-path. For the higher performance, DMA engi=
nes work in an asynchronous manner, where DMA data transfers and CPU comput=
ations are executed in parallel. The proposed API consists of control path =
API and data path API. The control path API includes Registration API and D=
MA operation callback, and the data path API includes asynchronous API. To =
remove the dependency of vendor specific DMA engines, the DMA operation cal=
lback provides generic DMA data transfer abstractions. To support asynchron=
ous DMA data movement, the new async API provides asynchronous ring operati=
on semantic in data-path. To enable/disable DMA acceleration for virtqueues=
, users need to use registration API is to register/unregister DMA callback=
 implementations to the vhost library and bind DMA channels to virtqueues. =
The DMA channels used by virtqueues are provided by DPDK applications, whic=
h is backed by  virtual or physical DMA devices.
The proposed APIs are consisted of 3 sub-sets:
1. DMA Registration APIs
2. DMA Operation Callbacks
3. Async Data APIs

DMA Registration APIs
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=20
DMA acceleration is per queue basis. DPDK applications need to explicitly d=
ecide whether a virtqueue needs DMA acceleration and which DMA channel to u=
se. In addition, a DMA channel is dedicated to a virtqueue and a DMA channe=
l cannot be bound to multiple virtqueues at the same time. To enable DMA ac=
celeration for a virtqueue, DPDK applications need to implement DMA operati=
on callbacks for a specific DMA type (e.g. CBDMA) first, then register the =
callbacks to the vhost library and bind a DMA channel to a virtqueue, and f=
inally use the new async API to perform data-path operations on the virtque=
ue.
The definitions of registration API are shown below:
int rte_vhost_async_channel_register(int vid, uint16_t queue_id,
					struct rte_vdma_device_ops *ops);

int rte_vhost_async_channel_unregister(int vid, uint16_t queue_id);

The "rte_vhost_async_channel_register" is to register implemented DMA opera=
tion callbacks to the vhost library and bind a DMA channel to a virtqueue. =
DPDK applications must implement the corresponding DMA operation callbacks =
for various DMA engines. To enable DMA acceleration for a virtqueue, DPDK a=
pplications need to explicitly call "rte_vhost_async_channel_register" for =
the virtqueue.  The "ops" points to the implementation of callbacks.=20
The "rte_vhost_async_channel_unregister" unregisters DMA operation callback=
s and unbind the DMA channel from the virtqueue. If a virtqueue does not bi=
nd to a DMA channel, it will use SW data-path without DMA acceleration.

DMA Operation Callbacks
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=20
The definitions of DMA operation callback are shown below:
struct iovec {	/** this is kernel uapi structure */
	void *iov_base;	/** buffer address */
	size_t iov_len;	/** buffer length */
};

struct iov_iter {=09
	size_t iov_offset;
	size_t count;		/** total bytes of a packet */
	struct iovec *iov;	/** array of data buffers */
	unsigned long nr_segs;	/** number of iovec structures */
	uintptr_t usr_data;	/** app specific memory handler*/
};

struct dma_trans_desc {
	struct iov_iter *src; /** source memory iov_iter*/
	struct iov_iter *dst; /** destination memory iov_iter*/
};

struct dma_trans_status {
	uintptr_t src_usr_data; /** trans completed memory handler*/
	uintptr_t dst_usr_data; /** trans completed memory handler*/
};

struct rte_vhost_async_channel_ops {
	/** Instruct a DMA channel to perform copies for a batch of packets */
	int (*transfer_data)( struct dma_trans_desc *descs,
				 uint16_t count);

        	/** check copy-completed packets from a DMA channel */
	int (*check_completed_copies)( struct dma_trans_status *usr_data,
					uint16_t max_packets);
};

The first callback "transfer_data" is to submit a batch of packet copies to=
 a DMA channel. As a packet's source or destination buffer can be a vector =
of buffers or a single data stream, we use "struct dma_trans_desc" to const=
ruct the source and destination buffer of packet.  Copying a packet is to m=
ove data from source iov_iter structure to destination iov_iter structure. =
The "count" is the number of packets to do copy.=20
The second callback "check_completed_copies" queries the completion status =
of the DMA. An "usr_data" member variable is embedded in "iov_iter" structu=
re, which serves as a unique identifier of the memory region described by "=
iov_iter". As the source/destination buffer can be scatter-gather, the DMA =
channel may perform its copies out-of-order. When all copies of an iov_iter=
 are completed by the DMA channel, the "check_completed_copies" should retu=
rn the associated "usr_data" by "dma_trans_status" structure.=20

Async Data APIs
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=20
The definitions of new enqueue API are shown below:
uint16_t rte_vhost_submit_enqueue_burst(int vid, uint16_t queue_id, struct =
rte_mbuf **pkts, uint16_t count);

uint16_t rte_vhost_poll_enqueue_completed(int vid, uint16_t queue_id, struc=
t rte_mbuf **pkts, uint16_t count);

The "rte_vhost_submit_enqueue_burst" is to enqueue a batch of packets to a =
virtqueue with giving ownership of enqueue packets to the vhost library. DP=
DK applications cannot reuse the enqueued packets until they get back the o=
wnership. For a virtqueue enabled DMA acceleration by the "rte_vhost_async_=
channel_register", the "rte_vhost_submit_enqueue_burst" will use the bound =
DMA channel to perform packet copies; moreover, the function is non-blockin=
g, which just submits packet copies to the DMA channel but without waiting =
for completion. For a virtqueue without enabling DMA acceleration, the "rte=
_vhost_submit_enqueue_burst" will use SW data-path, where the CPU performs =
packet copies. It worth noticing that DPDK applications cannot directly reu=
se enqueued packet buffers by "rte_vhost_submit_enqueue_burst", even if it =
uses SW data-path.

The "rte_vhost_poll_enqueue_completed" returns ownership for the packets wh=
ose copies are all completed currently, either by the DMA channel or the CP=
U. It is a non-blocking function, which will not wait for DMA copies comple=
tion. After getting back the ownership of packets enqueued by "rte_vhost_su=
bmit_enqueue_burst", DPDK applications can further process the packet buffe=
rs, e.g. free pktmbufs.

Sample Work Flow
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=20
Some DMA engines, like CBDMA, need to use physical addresses and do not sup=
port I/O page fault. In addition, some guests may want to avoid memory swap=
ping out. For these cases, we can pin guest memory by setting a new flag "R=
TE_VHOST_USER_DMA_COPY" in rte_vhost_driver_register(). Here is an example =
of how to use CBDMA to accelerate vhost enqueue operation:
Step1: Implement DMA operation callbacks for CBDMA via IOAT PMD
Step2: call rte_vhost_driver_register with flag "RTE_VHOST_USER_DMA_COPY" (=
pin guest memory)
Step3: call rte_vhost_async_channel_register to register DMA channel
Step4: call rte_vhost_submit_enqueue_burst to enqueue packets
Step5: call rte_vhost_poll_enqueue_completed get back the ownership of the =
packets whose copies are completed
Step6: call rte_pktmbuf_free to free packet mbuf

Signed-off-by: Patrick Fu <patrick.fu@intel.com>
Signed-off-by: Jiayu Hu <jiayu.hu@intel.com>=20