From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from dpdk.org (dpdk.org [92.243.14.124]) by inbox.dpdk.org (Postfix) with ESMTP id 79CCBA0583; Thu, 19 Mar 2020 12:47:24 +0100 (CET) Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 48145F94; Thu, 19 Mar 2020 12:47:23 +0100 (CET) Received: from mga11.intel.com (mga11.intel.com [192.55.52.93]) by dpdk.org (Postfix) with ESMTP id B35CA3B5 for ; Thu, 19 Mar 2020 12:47:20 +0100 (CET) IronPort-SDR: A2gnxGMzb2G4kwbl4+XQQgV3b3flcvK6SdD89y6pOyd2OomD4NCrt+NJexlWQ/d84s1W8gxeIv iTPMXdFyqlBw== X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga002.jf.intel.com ([10.7.209.21]) by fmsmga102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 19 Mar 2020 04:47:19 -0700 IronPort-SDR: 8/7au/MZNaLB/CZ0oc1JbLWhzKD/797k1DhpNG0/+fe9G9OB2P6P1hxPsOjsCmQXBaTcMZRXyf S9qp27Ogulig== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.70,571,1574150400"; d="scan'208";a="263701686" Received: from fmsmsx108.amr.corp.intel.com ([10.18.124.206]) by orsmga002.jf.intel.com with ESMTP; 19 Mar 2020 04:47:18 -0700 Received: from shsmsx601.ccr.corp.intel.com (10.109.6.141) by FMSMSX108.amr.corp.intel.com (10.18.124.206) with Microsoft SMTP Server (TLS) id 14.3.439.0; Thu, 19 Mar 2020 04:47:17 -0700 Received: from shsmsx602.ccr.corp.intel.com (10.109.6.142) by SHSMSX601.ccr.corp.intel.com (10.109.6.141) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.1713.5; Thu, 19 Mar 2020 19:47:15 +0800 Received: from shsmsx602.ccr.corp.intel.com ([10.109.6.142]) by SHSMSX602.ccr.corp.intel.com ([10.109.6.142]) with mapi id 15.01.1713.004; Thu, 19 Mar 2020 19:47:15 +0800 From: "Hu, Jiayu" To: Maxime Coquelin , "dev@dpdk.org" CC: "Ye, Xiaolong" , "Wang, Zhihong" Thread-Topic: [PATCH 0/4] Support DMA-accelerated Tx operations for vhost-user PMD Thread-Index: AQHV/AXRSj5F1TraekipaLGRDPhWjKhMBbGAgAN5u9D//57jAIAAqshQ Date: Thu, 19 Mar 2020 11:47:15 +0000 Message-ID: <39c02e0f982a4ce88ae421e6a9c7d26e@intel.com> References: <1584436885-18651-1-git-send-email-jiayu.hu@intel.com> <370798e0-b006-4a33-d8d9-1aea7bf4af49@redhat.com> <33221483053a41e8bd8d4bd0cb582634@intel.com> <63c79431-96bf-79e9-fb75-1714e194257f@redhat.com> In-Reply-To: <63c79431-96bf-79e9-fb75-1714e194257f@redhat.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: dlp-reaction: no-action dlp-version: 11.2.0.6 dlp-product: dlpe-windows x-originating-ip: [10.239.127.36] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Subject: Re: [dpdk-dev] [PATCH 0/4] Support DMA-accelerated Tx operations for vhost-user PMD X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" Hi Maxime, > -----Original Message----- > From: Maxime Coquelin > Sent: Thursday, March 19, 2020 5:10 PM > To: Hu, Jiayu ; dev@dpdk.org > Cc: Ye, Xiaolong ; Wang, Zhihong > > Subject: Re: [PATCH 0/4] Support DMA-accelerated Tx operations for vhost- > user PMD >=20 > Hi Jiayu, >=20 > On 3/19/20 8:33 AM, Hu, Jiayu wrote: > > Hi Maxime, > > > > Thanks for your comments. Replies are inline. > > > >> -----Original Message----- > >> From: Maxime Coquelin > >> Sent: Tuesday, March 17, 2020 5:54 PM > >> To: Hu, Jiayu ; dev@dpdk.org > >> Cc: Ye, Xiaolong ; Wang, Zhihong > >> > >> Subject: Re: [PATCH 0/4] Support DMA-accelerated Tx operations for > vhost- > >> user PMD > >> > >> Hi Jiayu, > >> > >> On 3/17/20 10:21 AM, Jiayu Hu wrote: > >>> In vhost-user PMD's Tx operations, where data movement is heavily > >> involved, > >>> performing large memory copies usually takes up a major part of CPU > >> cycles > >>> and becomes the hot spot. To offload expensive memory operations > from > >> the > >>> CPU, this patch set proposes to leverage DMA engines, e.g., I/OAT, a > DMA > >>> engine in the Intel's processor, to accelerate large copies for vhost= -user. > >>> > >>> Large copies are offloaded from the CPU to the DMA in an asynchronous > >>> manner. The CPU just submits copy jobs to the DMA but without waiting > >>> for its copy completion. Thus, there is no CPU intervention during da= ta > >>> transfer; we can save precious CPU cycles and improve the overall > >>> throughput for vhost-user PMD based applications, like OVS. During > >>> packet transmission, it offloads large copies to the DMA and performs > >>> small copies by the CPU, due to startup overheads associated with the > DMA. > >>> > >>> vhost-user PMD is able to support various DMA engines, but it just > >>> supports I/OAT devices currently. In addition, I/OAT acceleration is = only > >>> enabled for Tx operations of split rings. Users can explicitly assign= a > >>> I/OAT device to a queue by the parameter 'dmas'. However, one I/OAT > >> device > >>> can only be used by one queue, and a queue can use one I/OAT device a= t > a > >>> time. > >>> > >>> We measure the performance in testpmd. With 1024 bytes packets, > >> compared > >>> with the original SW data path, DMA-enabled vhost-user PMD can > improve > >>> the throughput around 20%~30% in the VM2VM and PVP cases. > >> Furthermore, > >>> with larger packets, the throughput improvement will be higher. > >> > >> > >> I'm not sure it should be done like that for several reasons. > >> > >> First, it seems really complex for the user to get the command line > >> right. There is no mention in the doc patch on how to bind the DMAs to > >> the DPDK application. Are all the DMAs on the system capable of doing > >> it? > > > > DMA engines in Intel CPU are able to move data within system memory. > > Currently, we have I/OAT and we will have DSA in the future. >=20 > OK, can you give me an example of how many I/OAT instances on a given > CPU? One Xeon Platinum 8180 CPU has 8 I/OAT instances. >=20 > >> I think it should be made transparent to the user, who should not have > >> to specify the DMA device address in command line. The user should jus= t > >> pass a devarg specifying he wants to use DMAs, if available. > > > > How do you think of replacing DMA address with specific DMA capabilitie= s, > like > > "dmas=3D[txq0@DMACOPY]". As I/OAT only supports data movement, we > can > > just provide basic DMA copy ability now. But when there are more DMA > devices, > > we can add capabilities in devargs later. > "dmas=3D[txq0@DMACOPY]" is still too complex IMHO. We should just have a > flag to enable or not DMA (tx_dma=3D1 / tx_dma=3D0), and this would be us= ed > for all queues as we do for zero-copy. >=20 > >> > >> Second, it looks too much vendor-specific. IMHO, we should have a DMA > >> framework, so that the driver can request DMA channels based on > >> capabilities. > > > > We only have one DMA engine, I/OAT, in DPDK, and it is implemented as > > a rawdev. IMO, it will be very hard to provide a generic DMA abstractio= n > > currently. In addition, I/OAT specific API is called inside vhost-user = PMD, > > we can replace these function calls when we have a DMA framework in > > the future. Users are unaware of the changes. Does it make sense to you= ? >=20 > Having an abstraction might be hard, but it does not seem impossible. > Such DMA abstraction has been done in the Kernel for IOAT. For example: > https://lore.kernel.org/patchwork/cover/56714/ >=20 > >> > >> Also, I don't think implementing ring processing in the Vhost PMD is > >> welcome, Vhost PMD should just be a wrapper for the Vhost library. > Doing > >> that in Vhost PMD causes code duplication, and will be a maintenance > >> burden on the long run. > >> > >> As IOAT is a kind of acceleration, why not implement it through the vD= PA > >> framework? vDPA framework should be extended to support this kind of > >> acceleration which requires some CPU processing, as opposed to full > >> offload of the ring processing as it only supports today. > > > > The main reason of implementing data path in vhost PMD is to avoid > impacting > > SW data path in vhost library. Even if we implement it as an instance o= f > vDPA, > > we also have to implement data path in a new vdev PMD, as DMA just > accelerates > > memory copy and all ring operations have to be done by the CPU. There i= s > still the > > code duplication issue. >=20 > Ok, so what about: >=20 > Introducing a pair of callbacks in struct virtio_net for DMA enqueue and > dequeue. >=20 > lib/librte_vhost/ioat.c which would implement dma_enqueue and > dma_dequeue callback for IOAT. As it will live in the vhost lib > directory, it will be easy to refactor the code to share as much as > possible and so avoid code duplication. >=20 > In rte_vhost_enqueue/dequeue_burst, if the dma callback is set, then > call it instead of the SW datapath. It adds a few cycle, but this is > much more sane IMHO. The problem is that current semantics of rte_vhost_enqueue/dequeue API are conflict with I/OAT accelerated data path. To improve the performance, the I/OAT works in an asynchronous manner, where the CPU just submits copy jobs to the I/OAT without waiting for its copy completion. For rte_vhost_enqueue_burst, users cannot reuse enqueued pktmbufs when it returns, as the I/OAT may still use them. For rte_vhost_dequeue_burst, users will not get incoming packets as the I/OAT is still performing packet copies. As you can see, when enabling I/OAT acceleration, the semantics of the two API are changed. If we keep the same API name but changing their semantic, this may confuse users, IMHO. Thanks, Jiayu >=20 > What do you think? >=20 > Thanks, > Maxime > > Thanks, > > Jiayu > > > >> > >> Kind regards, > >> Maxime > >> > >>> Jiayu Hu (4): > >>> vhost: populate guest memory for DMA-accelerated vhost-user > >>> net/vhost: setup vrings for DMA-accelerated datapath > >>> net/vhost: leverage DMA engines to accelerate Tx operations > >>> doc: add I/OAT acceleration support for vhost-user PMD > >>> > >>> doc/guides/nics/vhost.rst | 14 + > >>> drivers/Makefile | 2 +- > >>> drivers/net/vhost/Makefile | 6 +- > >>> drivers/net/vhost/internal.h | 160 +++++++ > >>> drivers/net/vhost/meson.build | 5 +- > >>> drivers/net/vhost/rte_eth_vhost.c | 308 +++++++++++--- > >>> drivers/net/vhost/virtio_net.c | 861 > >> ++++++++++++++++++++++++++++++++++++++ > >>> drivers/net/vhost/virtio_net.h | 288 +++++++++++++ > >>> lib/librte_vhost/rte_vhost.h | 1 + > >>> lib/librte_vhost/socket.c | 20 + > >>> lib/librte_vhost/vhost.h | 2 + > >>> lib/librte_vhost/vhost_user.c | 3 +- > >>> 12 files changed, 1597 insertions(+), 73 deletions(-) > >>> create mode 100644 drivers/net/vhost/internal.h > >>> create mode 100644 drivers/net/vhost/virtio_net.c > >>> create mode 100644 drivers/net/vhost/virtio_net.h > >>> > >