From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from dpdk.org (dpdk.org [92.243.14.124])
	by inbox.dpdk.org (Postfix) with ESMTP id 79CCBA0583;
	Thu, 19 Mar 2020 12:47:24 +0100 (CET)
Received: from [92.243.14.124] (localhost [127.0.0.1])
	by dpdk.org (Postfix) with ESMTP id 48145F94;
	Thu, 19 Mar 2020 12:47:23 +0100 (CET)
Received: from mga11.intel.com (mga11.intel.com [192.55.52.93])
 by dpdk.org (Postfix) with ESMTP id B35CA3B5
 for <dev@dpdk.org>; Thu, 19 Mar 2020 12:47:20 +0100 (CET)
IronPort-SDR: A2gnxGMzb2G4kwbl4+XQQgV3b3flcvK6SdD89y6pOyd2OomD4NCrt+NJexlWQ/d84s1W8gxeIv
 iTPMXdFyqlBw==
X-Amp-Result: SKIPPED(no attachment in message)
X-Amp-File-Uploaded: False
Received: from orsmga002.jf.intel.com ([10.7.209.21])
 by fmsmga102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 19 Mar 2020 04:47:19 -0700
IronPort-SDR: 8/7au/MZNaLB/CZ0oc1JbLWhzKD/797k1DhpNG0/+fe9G9OB2P6P1hxPsOjsCmQXBaTcMZRXyf
 S9qp27Ogulig==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.70,571,1574150400"; d="scan'208";a="263701686"
Received: from fmsmsx108.amr.corp.intel.com ([10.18.124.206])
 by orsmga002.jf.intel.com with ESMTP; 19 Mar 2020 04:47:18 -0700
Received: from shsmsx601.ccr.corp.intel.com (10.109.6.141) by
 FMSMSX108.amr.corp.intel.com (10.18.124.206) with Microsoft SMTP Server (TLS)
 id 14.3.439.0; Thu, 19 Mar 2020 04:47:17 -0700
Received: from shsmsx602.ccr.corp.intel.com (10.109.6.142) by
 SHSMSX601.ccr.corp.intel.com (10.109.6.141) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id
 15.1.1713.5; Thu, 19 Mar 2020 19:47:15 +0800
Received: from shsmsx602.ccr.corp.intel.com ([10.109.6.142]) by
 SHSMSX602.ccr.corp.intel.com ([10.109.6.142]) with mapi id 15.01.1713.004;
 Thu, 19 Mar 2020 19:47:15 +0800
From: "Hu, Jiayu" <jiayu.hu@intel.com>
To: Maxime Coquelin <maxime.coquelin@redhat.com>, "dev@dpdk.org" <dev@dpdk.org>
CC: "Ye, Xiaolong" <xiaolong.ye@intel.com>, "Wang, Zhihong"
 <zhihong.wang@intel.com>
Thread-Topic: [PATCH 0/4] Support DMA-accelerated Tx operations for vhost-user
 PMD
Thread-Index: AQHV/AXRSj5F1TraekipaLGRDPhWjKhMBbGAgAN5u9D//57jAIAAqshQ
Date: Thu, 19 Mar 2020 11:47:15 +0000
Message-ID: <39c02e0f982a4ce88ae421e6a9c7d26e@intel.com>
References: <1584436885-18651-1-git-send-email-jiayu.hu@intel.com>
 <370798e0-b006-4a33-d8d9-1aea7bf4af49@redhat.com>
 <33221483053a41e8bd8d4bd0cb582634@intel.com>
 <63c79431-96bf-79e9-fb75-1714e194257f@redhat.com>
In-Reply-To: <63c79431-96bf-79e9-fb75-1714e194257f@redhat.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
dlp-reaction: no-action
dlp-version: 11.2.0.6
dlp-product: dlpe-windows
x-originating-ip: [10.239.127.36]
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Subject: Re: [dpdk-dev] [PATCH 0/4] Support DMA-accelerated Tx operations
 for vhost-user PMD
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>

Hi Maxime,

> -----Original Message-----
> From: Maxime Coquelin <maxime.coquelin@redhat.com>
> Sent: Thursday, March 19, 2020 5:10 PM
> To: Hu, Jiayu <jiayu.hu@intel.com>; dev@dpdk.org
> Cc: Ye, Xiaolong <xiaolong.ye@intel.com>; Wang, Zhihong
> <zhihong.wang@intel.com>
> Subject: Re: [PATCH 0/4] Support DMA-accelerated Tx operations for vhost-
> user PMD
>=20
> Hi Jiayu,
>=20
> On 3/19/20 8:33 AM, Hu, Jiayu wrote:
> > Hi Maxime,
> >
> > Thanks for your comments. Replies are inline.
> >
> >> -----Original Message-----
> >> From: Maxime Coquelin <maxime.coquelin@redhat.com>
> >> Sent: Tuesday, March 17, 2020 5:54 PM
> >> To: Hu, Jiayu <jiayu.hu@intel.com>; dev@dpdk.org
> >> Cc: Ye, Xiaolong <xiaolong.ye@intel.com>; Wang, Zhihong
> >> <zhihong.wang@intel.com>
> >> Subject: Re: [PATCH 0/4] Support DMA-accelerated Tx operations for
> vhost-
> >> user PMD
> >>
> >> Hi Jiayu,
> >>
> >> On 3/17/20 10:21 AM, Jiayu Hu wrote:
> >>> In vhost-user PMD's Tx operations, where data movement is heavily
> >> involved,
> >>> performing large memory copies usually takes up a major part of CPU
> >> cycles
> >>> and becomes the hot spot. To offload expensive memory operations
> from
> >> the
> >>> CPU, this patch set proposes to leverage DMA engines, e.g., I/OAT, a
> DMA
> >>> engine in the Intel's processor, to accelerate large copies for vhost=
-user.
> >>>
> >>> Large copies are offloaded from the CPU to the DMA in an asynchronous
> >>> manner. The CPU just submits copy jobs to the DMA but without waiting
> >>> for its copy completion. Thus, there is no CPU intervention during da=
ta
> >>> transfer; we can save precious CPU cycles and improve the overall
> >>> throughput for vhost-user PMD based applications, like OVS. During
> >>> packet transmission, it offloads large copies to the DMA and performs
> >>> small copies by the CPU, due to startup overheads associated with the
> DMA.
> >>>
> >>> vhost-user PMD is able to support various DMA engines, but it just
> >>> supports I/OAT devices currently. In addition, I/OAT acceleration is =
only
> >>> enabled for Tx operations of split rings. Users can explicitly assign=
 a
> >>> I/OAT device to a queue by the parameter 'dmas'. However, one I/OAT
> >> device
> >>> can only be used by one queue, and a queue can use one I/OAT device a=
t
> a
> >>> time.
> >>>
> >>> We measure the performance in testpmd. With 1024 bytes packets,
> >> compared
> >>> with the original SW data path, DMA-enabled vhost-user PMD can
> improve
> >>> the throughput around 20%~30% in the VM2VM and PVP cases.
> >> Furthermore,
> >>> with larger packets, the throughput improvement will be higher.
> >>
> >>
> >> I'm not sure it should be done like that for several reasons.
> >>
> >> First, it seems really complex for the user to get the command line
> >> right. There is no mention in the doc patch on how to bind the DMAs to
> >> the DPDK application. Are all the DMAs on the system capable of doing
> >> it?
> >
> > DMA engines in Intel CPU are able to move data within system memory.
> > Currently, we have I/OAT and we will have DSA in the future.
>=20
> OK, can you give me an example of how many I/OAT instances on a given
> CPU?

One Xeon Platinum 8180 CPU has 8 I/OAT instances.

>=20
> >> I think it should be made transparent to the user, who should not have
> >> to specify the DMA device address in command line. The user should jus=
t
> >> pass a devarg specifying he wants to use DMAs, if available.
> >
> > How do you think of replacing DMA address with specific DMA capabilitie=
s,
> like
> > "dmas=3D[txq0@DMACOPY]". As I/OAT only supports data movement, we
> can
> > just provide basic DMA copy ability now. But when there are more DMA
> devices,
> >  we can add capabilities in devargs later.
> "dmas=3D[txq0@DMACOPY]" is still too complex IMHO. We should just have a
> flag to enable or not DMA (tx_dma=3D1 / tx_dma=3D0), and this would be us=
ed
> for all queues as we do for zero-copy.
>=20
> >>
> >> Second, it looks too much vendor-specific. IMHO, we should have a DMA
> >> framework, so that the driver can request DMA channels based on
> >> capabilities.
> >
> > We only have one DMA engine, I/OAT, in DPDK, and it is implemented as
> > a rawdev. IMO, it will be very hard to provide a generic DMA abstractio=
n
> > currently. In addition, I/OAT specific API is called inside vhost-user =
PMD,
> > we can replace these function calls when we have a DMA framework in
> > the future. Users are unaware of the changes. Does it make sense to you=
?
>=20
> Having an abstraction might be hard, but it does not seem impossible.
> Such DMA abstraction has been done in the Kernel for IOAT. For example:
> https://lore.kernel.org/patchwork/cover/56714/
>=20
> >>
> >> Also, I don't think implementing ring processing in the Vhost PMD is
> >> welcome, Vhost PMD should just be a wrapper for the Vhost library.
> Doing
> >> that in Vhost PMD causes code duplication, and will be a maintenance
> >> burden on the long run.
> >>
> >> As IOAT is a kind of acceleration, why not implement it through the vD=
PA
> >> framework? vDPA framework should be extended to support this kind of
> >> acceleration which requires some CPU processing, as opposed to full
> >> offload of the ring processing as it only supports today.
> >
> > The main reason of implementing data path in vhost PMD is to avoid
> impacting
> > SW data path in vhost library. Even if we implement it as an instance o=
f
> vDPA,
> > we also have to implement data path in a new vdev PMD, as DMA just
> accelerates
> > memory copy and all ring operations have to be done by the CPU. There i=
s
> still the
> > code duplication issue.
>=20
> Ok, so what about:
>=20
> Introducing a pair of callbacks in struct virtio_net for DMA enqueue and
> dequeue.
>=20
> lib/librte_vhost/ioat.c which would implement dma_enqueue and
> dma_dequeue callback for IOAT. As it will live in the vhost lib
> directory, it will be easy to refactor the code to share as much as
> possible and so avoid code duplication.
>=20
> In rte_vhost_enqueue/dequeue_burst, if the dma callback is set, then
> call it instead of the SW datapath. It adds a few cycle, but this is
> much more sane IMHO.

The problem is that current semantics of rte_vhost_enqueue/dequeue API
are conflict with I/OAT accelerated data path. To improve the performance,
the I/OAT works in an asynchronous manner, where the CPU just submits
copy jobs to the I/OAT without waiting for its copy completion. For
rte_vhost_enqueue_burst, users cannot reuse enqueued pktmbufs when it
returns, as the I/OAT may still use them. For rte_vhost_dequeue_burst,
users will not get incoming packets as the I/OAT is still performing packet
copies. As you can see, when enabling I/OAT acceleration, the semantics of
the two API are changed. If we keep the same API name but changing their
semantic, this may confuse users, IMHO.

Thanks,
Jiayu

>=20
> What do you think?
>=20
> Thanks,
> Maxime
> > Thanks,
> > Jiayu
> >
> >>
> >> Kind regards,
> >> Maxime
> >>
> >>> Jiayu Hu (4):
> >>>   vhost: populate guest memory for DMA-accelerated vhost-user
> >>>   net/vhost: setup vrings for DMA-accelerated datapath
> >>>   net/vhost: leverage DMA engines to accelerate Tx operations
> >>>   doc: add I/OAT acceleration support for vhost-user PMD
> >>>
> >>>  doc/guides/nics/vhost.rst         |  14 +
> >>>  drivers/Makefile                  |   2 +-
> >>>  drivers/net/vhost/Makefile        |   6 +-
> >>>  drivers/net/vhost/internal.h      | 160 +++++++
> >>>  drivers/net/vhost/meson.build     |   5 +-
> >>>  drivers/net/vhost/rte_eth_vhost.c | 308 +++++++++++---
> >>>  drivers/net/vhost/virtio_net.c    | 861
> >> ++++++++++++++++++++++++++++++++++++++
> >>>  drivers/net/vhost/virtio_net.h    | 288 +++++++++++++
> >>>  lib/librte_vhost/rte_vhost.h      |   1 +
> >>>  lib/librte_vhost/socket.c         |  20 +
> >>>  lib/librte_vhost/vhost.h          |   2 +
> >>>  lib/librte_vhost/vhost_user.c     |   3 +-
> >>>  12 files changed, 1597 insertions(+), 73 deletions(-)
> >>>  create mode 100644 drivers/net/vhost/internal.h
> >>>  create mode 100644 drivers/net/vhost/virtio_net.c
> >>>  create mode 100644 drivers/net/vhost/virtio_net.h
> >>>
> >