From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga01.intel.com (mga01.intel.com [192.55.52.88]) by dpdk.org (Postfix) with ESMTP id 9D1D5234 for ; Tue, 25 Aug 2015 11:57:13 +0200 (CEST) Received: from fmsmga002.fm.intel.com ([10.253.24.26]) by fmsmga101.fm.intel.com with ESMTP; 25 Aug 2015 02:57:13 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.15,745,1432623600"; d="scan'208";a="790418266" Received: from pgsmsx105.gar.corp.intel.com ([10.221.44.96]) by fmsmga002.fm.intel.com with ESMTP; 25 Aug 2015 02:57:10 -0700 Received: from shsmsx103.ccr.corp.intel.com (10.239.4.69) by PGSMSX105.gar.corp.intel.com (10.221.44.96) with Microsoft SMTP Server (TLS) id 14.3.224.2; Tue, 25 Aug 2015 17:57:06 +0800 Received: from shsmsx101.ccr.corp.intel.com ([169.254.1.171]) by SHSMSX103.ccr.corp.intel.com ([169.254.4.248]) with mapi id 14.03.0224.002; Tue, 25 Aug 2015 17:57:00 +0800 From: "Xie, Huawei" To: Tetsuya Mukawa , Zhuangyanying , "dev@dpdk.org" Thread-Topic: [dpdk-dev] vhost compliant virtio based networking interface in container Thread-Index: AdDbL5/n4/QXHaS9RkiZ9XcA9E/NoA== Date: Tue, 25 Aug 2015 09:56:59 +0000 Message-ID: References: <55DBD9E1.3050609@igel.co.jp> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.239.127.40] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Cc: "nakajima.yoshihiro@lab.ntt.co.jp" , "zhbzg@huawei.com" , "gaoxiaoqiu@huawei.com" , "oscar.zhangbo@huawei.com" , "zhoujingbin@huawei.com" , "guohongzhen@huawei.com" Subject: Re: [dpdk-dev] vhost compliant virtio based networking interface in container X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 25 Aug 2015 09:57:14 -0000 On 8/25/2015 10:59 AM, Tetsuya Mukawa wrote:=0A= > Hi Xie and Yanping,=0A= >=0A= >=0A= > May I ask you some questions?=0A= > It seems we are also developing an almost same one.=0A= =0A= Good to know that we are tackling the same problem and have the similar=0A= idea.=0A= What is your status now? We had the POC running, and compliant with=0A= dpdkvhost.=0A= Interrupt like notification isn't supported.=0A= =0A= >=0A= > On 2015/08/20 19:14, Xie, Huawei wrote:=0A= >> Added dev@dpdk.org=0A= >>=0A= >> On 8/20/2015 6:04 PM, Xie, Huawei wrote:=0A= >>> Yanping:=0A= >>> I read your mail, seems what we did are quite similar. Here i wrote a= =0A= >>> quick mail to describe our design. Let me know if it is the same thing.= =0A= >>>=0A= >>> Problem Statement:=0A= >>> We don't have a high performance networking interface in container for= =0A= >>> NFV. Current veth pair based interface couldn't be easily accelerated.= =0A= >>>=0A= >>> The key components involved:=0A= >>> 1. DPDK based virtio PMD driver in container.=0A= >>> 2. device simulation framework in container.=0A= >>> 3. dpdk(or kernel) vhost running in host.=0A= >>>=0A= >>> How virtio is created?=0A= >>> A: There is no "real" virtio-pci device in container environment.=0A= >>> 1). Host maintains pools of memories, and shares memory to container.= =0A= >>> This could be accomplished through host share a huge page file to conta= iner.=0A= >>> 2). Containers creates virtio rings based on the shared memory.=0A= >>> 3). Container creates mbuf memory pools on the shared memory.=0A= >>> 4) Container send the memory and vring information to vhost through=0A= >>> vhost message. This could be done either through ioctl call or vhost=0A= >>> user message.=0A= >>>=0A= >>> How vhost message is sent?=0A= >>> A: There are two alternative ways to do this.=0A= >>> 1) The customized virtio PMD is responsible for all the vring creation,= =0A= >>> and vhost message sending.=0A= > Above is our approach so far.=0A= > It seems Yanping also takes this kind of approach.=0A= > We are using vhost-user functionality instead of using the vhost-net=0A= > kernel module.=0A= > Probably this is the difference between Yanping and us.=0A= =0A= In my current implementation, the device simulation layer talks to "user=0A= space" vhost through cuse interface. It could also be done through vhost=0A= user socket. This isn't the key point.=0A= Here vhost-user is kind of confusing, maybe user space vhost is more=0A= accurate, either cuse or unix domain socket. :).=0A= =0A= As for yanping, they are now connecting to vhost-net kernel module, but=0A= they are also trying to connect to "user space" vhost. Correct me if wrong= .=0A= Yes, there is some difference between these two. Vhost-net kernel module=0A= could directly access other process's memory, while using=0A= vhost-user(cuse/user), we need do the memory mapping.=0A= >=0A= > BTW, we are going to submit a vhost PMD for DPDK-2.2.=0A= > This PMD is implemented on librte_vhost.=0A= > It allows DPDK application to handle a vhost-user(cuse) backend as a=0A= > normal NIC port.=0A= > This PMD should work with both Xie and Yanping approach.=0A= > (In the case of Yanping approach, we may need vhost-cuse)=0A= >=0A= >>> 2) We could do this through a lightweight device simulation framework.= =0A= >>> The device simulation creates simple PCI bus. On the PCI bus,=0A= >>> virtio-net PCI devices are created. The device simulations provides=0A= >>> IOAPI for MMIO/IO access.=0A= > Does it mean you implemented a kernel module?=0A= > If so, do you still need vhost-cuse functionality to handle vhost=0A= > messages n userspace?=0A= =0A= The device simulation is a library running in user space in container. =0A= It is linked with DPDK app. It creates pseudo buses and virtio-net PCI=0A= devices.=0A= The virtio-container-PMD configures the virtio-net pseudo devices=0A= through IOAPI provided by the device simulation rather than IO=0A= instructions as in KVM.=0A= Why we use device simulation?=0A= We could create other virtio devices in container, and provide an common=0A= way to talk to vhost-xx module.=0A= =0A= >>> 2.1 virtio PMD configures the pseudo virtio device as how it does i= n=0A= >>> KVM guest enviroment.=0A= >>> 2.2 Rather than using io instruction, virtio PMD uses IOAPI for IO= =0A= >>> operation on the virtio-net PCI device.=0A= >>> 2.3 The device simulation is responsible for device state machine= =0A= >>> simulation.=0A= >>> 2.4 The device simulation is responsbile for talking to vhost.=0A= >>> With this approach, we could minimize the virtio PMD modifications= .=0A= >>> The virtio PMD is like configuring a real virtio-net PCI device.=0A= >>>=0A= >>> Memory mapping?=0A= >>> A: QEMU could access the whole guest memory in KVM enviroment. We need= =0A= >>> to fill the gap.=0A= >>> container maps the shared memory to container's virtual address space= =0A= >>> and host maps it to host's virtual address space. There is a fixed=0A= >>> offset mapping.=0A= >>> Container creates shared vring based on the memory. Container also=0A= >>> creates mbuf memory pool based on the shared memroy.=0A= >>> In VHOST_SET_MEMORY_TABLE message, we send the memory mapping=0A= >>> information for the shared memory. As we require mbuf pool created on= =0A= >>> the shared memory, and buffers are allcoated from the mbuf pools, dpdk= =0A= >>> vhost could translate the GPA in vring desc to host virtual.=0A= >>>=0A= >>>=0A= >>> GPA or CVA in vring desc?=0A= >>> To ease the memory translation, rather than using GPA, here we use=0A= >>> CVA(container virtual address). This the tricky thing here.=0A= >>> 1) virtio PMD writes vring's VFN rather than PFN to PFN register throug= h=0A= >>> IOAPI.=0A= >>> 2) device simulation framework will use VFN as PFN.=0A= >>> 3) device simulation sends SET_VRING_ADDR with CVA.=0A= >>> 4) virtio PMD fills vring desc with CVA of the mbuf data pointer rather= =0A= >>> than GPA.=0A= >>> So when host sees the CVA, it could translates it to HVA(host virtual= =0A= >>> address).=0A= >>>=0A= >>> Worth to note:=0A= >>> The virtio interface in container follows the vhost message format, and= =0A= >>> is compliant with dpdk vhost implmentation, i.e, no dpdk vhost=0A= >>> modification is needed.=0A= >>> vHost isn't aware whether the incoming virtio comes from KVM guest or= =0A= >>> container.=0A= >>>=0A= >>> The pretty much covers the high level design. There are quite some low= =0A= >>> level issues. For example, 32bit PFN is enough for KVM guest, since we= =0A= >>> use 64bit VFN(virtual page frame number), trick is done here through a= =0A= >>> special IOAPI.=0A= > In addition above, we might consider "namespace" kernel functionality.=0A= > Technically, it would not be a big problem, but related with security.=0A= > So it would be nice to take account.=0A= =0A= There is no namespace concept here because we don't generate kernel=0A= netdev devices. It might be usefull if we could extend our work to=0A= support kernel netdev interface and assign to container's namespace.=0A= =0A= >=0A= > Regards,=0A= > Tetsuya=0A= >=0A= >>> /huawei=0A= >>>=0A= >>> =0A= >>>=0A= >>>=0A= >>>=0A= >>>=0A= >>>=0A= >>>=0A= >=0A= =0A=