From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga14.intel.com (mga14.intel.com [192.55.52.115]) by dpdk.org (Postfix) with ESMTP id E0F7A5A1F for ; Mon, 7 Sep 2015 07:56:15 +0200 (CEST) Received: from orsmga001.jf.intel.com ([10.7.209.18]) by fmsmga103.fm.intel.com with ESMTP; 06 Sep 2015 22:56:14 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.17,483,1437462000"; d="scan'208";a="763866726" Received: from kmsmsx151.gar.corp.intel.com ([172.21.73.86]) by orsmga001.jf.intel.com with ESMTP; 06 Sep 2015 22:56:08 -0700 Received: from shsmsx152.ccr.corp.intel.com (10.239.6.52) by KMSMSX151.gar.corp.intel.com (172.21.73.86) with Microsoft SMTP Server (TLS) id 14.3.224.2; Mon, 7 Sep 2015 13:54:15 +0800 Received: from shsmsx101.ccr.corp.intel.com ([169.254.1.171]) by SHSMSX152.ccr.corp.intel.com ([169.254.6.182]) with mapi id 14.03.0224.002; Mon, 7 Sep 2015 13:54:14 +0800 From: "Xie, Huawei" To: Tetsuya Mukawa , "dev@dpdk.org" Thread-Topic: [dpdk-dev] vhost compliant virtio based networking interface in container Thread-Index: AdDbL5/n4/QXHaS9RkiZ9XcA9E/NoA== Date: Mon, 7 Sep 2015 05:54:13 +0000 Message-ID: References: <55DBD9E1.3050609@igel.co.jp> <55DD8578.70208@igel.co.jp> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.239.127.40] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Cc: "nakajima.yoshihiro@lab.ntt.co.jp" , "zhbzg@huawei.com" , "gaoxiaoqiu@huawei.com" , "oscar.zhangbo@huawei.com" , Zhuangyanying , "zhoujingbin@huawei.com" , "guohongzhen@huawei.com" Subject: Re: [dpdk-dev] vhost compliant virtio based networking interface in container X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 07 Sep 2015 05:56:16 -0000 On 8/26/2015 5:23 PM, Tetsuya Mukawa wrote:=0A= > On 2015/08/25 18:56, Xie, Huawei wrote:=0A= >> On 8/25/2015 10:59 AM, Tetsuya Mukawa wrote:=0A= >>> Hi Xie and Yanping,=0A= >>>=0A= >>>=0A= >>> May I ask you some questions?=0A= >>> It seems we are also developing an almost same one.=0A= >> Good to know that we are tackling the same problem and have the similar= =0A= >> idea.=0A= >> What is your status now? We had the POC running, and compliant with=0A= >> dpdkvhost.=0A= >> Interrupt like notification isn't supported.=0A= > We implemented vhost PMD first, so we just start implementing it.=0A= >=0A= >>> On 2015/08/20 19:14, Xie, Huawei wrote:=0A= >>>> Added dev@dpdk.org=0A= >>>>=0A= >>>> On 8/20/2015 6:04 PM, Xie, Huawei wrote:=0A= >>>>> Yanping:=0A= >>>>> I read your mail, seems what we did are quite similar. Here i wrote a= =0A= >>>>> quick mail to describe our design. Let me know if it is the same thin= g.=0A= >>>>>=0A= >>>>> Problem Statement:=0A= >>>>> We don't have a high performance networking interface in container fo= r=0A= >>>>> NFV. Current veth pair based interface couldn't be easily accelerated= .=0A= >>>>>=0A= >>>>> The key components involved:=0A= >>>>> 1. DPDK based virtio PMD driver in container.=0A= >>>>> 2. device simulation framework in container.=0A= >>>>> 3. dpdk(or kernel) vhost running in host.=0A= >>>>>=0A= >>>>> How virtio is created?=0A= >>>>> A: There is no "real" virtio-pci device in container environment.=0A= >>>>> 1). Host maintains pools of memories, and shares memory to container.= =0A= >>>>> This could be accomplished through host share a huge page file to con= tainer.=0A= >>>>> 2). Containers creates virtio rings based on the shared memory.=0A= >>>>> 3). Container creates mbuf memory pools on the shared memory.=0A= >>>>> 4) Container send the memory and vring information to vhost through= =0A= >>>>> vhost message. This could be done either through ioctl call or vhost= =0A= >>>>> user message.=0A= >>>>>=0A= >>>>> How vhost message is sent?=0A= >>>>> A: There are two alternative ways to do this.=0A= >>>>> 1) The customized virtio PMD is responsible for all the vring creatio= n,=0A= >>>>> and vhost message sending.=0A= >>> Above is our approach so far.=0A= >>> It seems Yanping also takes this kind of approach.=0A= >>> We are using vhost-user functionality instead of using the vhost-net=0A= >>> kernel module.=0A= >>> Probably this is the difference between Yanping and us.=0A= >> In my current implementation, the device simulation layer talks to "user= =0A= >> space" vhost through cuse interface. It could also be done through vhost= =0A= >> user socket. This isn't the key point.=0A= >> Here vhost-user is kind of confusing, maybe user space vhost is more=0A= >> accurate, either cuse or unix domain socket. :).=0A= >>=0A= >> As for yanping, they are now connecting to vhost-net kernel module, but= =0A= >> they are also trying to connect to "user space" vhost. Correct me if wr= ong.=0A= >> Yes, there is some difference between these two. Vhost-net kernel module= =0A= >> could directly access other process's memory, while using=0A= >> vhost-user(cuse/user), we need do the memory mapping.=0A= >>> BTW, we are going to submit a vhost PMD for DPDK-2.2.=0A= >>> This PMD is implemented on librte_vhost.=0A= >>> It allows DPDK application to handle a vhost-user(cuse) backend as a=0A= >>> normal NIC port.=0A= >>> This PMD should work with both Xie and Yanping approach.=0A= >>> (In the case of Yanping approach, we may need vhost-cuse)=0A= >>>=0A= >>>>> 2) We could do this through a lightweight device simulation framework= .=0A= >>>>> The device simulation creates simple PCI bus. On the PCI bus,=0A= >>>>> virtio-net PCI devices are created. The device simulations provides= =0A= >>>>> IOAPI for MMIO/IO access.=0A= >>> Does it mean you implemented a kernel module?=0A= >>> If so, do you still need vhost-cuse functionality to handle vhost=0A= >>> messages n userspace?=0A= >> The device simulation is a library running in user space in container. = =0A= >> It is linked with DPDK app. It creates pseudo buses and virtio-net PCI= =0A= >> devices.=0A= >> The virtio-container-PMD configures the virtio-net pseudo devices=0A= >> through IOAPI provided by the device simulation rather than IO=0A= >> instructions as in KVM.=0A= >> Why we use device simulation?=0A= >> We could create other virtio devices in container, and provide an common= =0A= >> way to talk to vhost-xx module.=0A= > Thanks for explanation.=0A= > At first reading, I thought the difference between approach1 and=0A= > approach2 is whether we need to implement a new kernel module, or not.=0A= > But I understand how you implemented.=0A= >=0A= > Please let me explain our design more.=0A= > We might use a kind of similar approach to handle a pseudo virtio-net=0A= > device in DPDK.=0A= > (Anyway, we haven't finished implementing yet, this overview might have= =0A= > some technical problems)=0A= >=0A= > Step1. Separate virtio-net and vhost-user socket related code from QEMU,= =0A= > then implement it as a separated program.=0A= > The program also has below features.=0A= > - Create a directory that contains almost same files like=0A= > /sys/bus/pci/device//*=0A= > (To scan these file located on outside sysfs, we need to fix EAL)=0A= > - This dummy device is driven by dummy-virtio-net-driver. This name is= =0A= > specified by '/driver' file.=0A= > - Create a shared file that represents pci configuration space, then=0A= > mmap it, also specify the path in '/resource_path'=0A= >=0A= > The program will be GPL, but it will be like a bridge on the shared=0A= > memory between virtio-net PMD and DPDK vhost backend.=0A= > Actually, It will work under virtio-net PMD, but we don't need to link it= .=0A= > So I guess we don't have GPL license issue.=0A= >=0A= > Step2. Fix pci scan code of EAL to scan dummy devices.=0A= > - To scan above files, extend pci_scan() of EAL.=0A= >=0A= > Step3. Add a new kdrv type to EAL.=0A= > - To handle the 'dummy-virtio-net-driver', add a new kdrv type to EAL.= =0A= >=0A= > Step4. Implement pci_dummy_virtio_net_map/unmap().=0A= > - It will have almost same functionality like pci_uio_map(), but for=0A= > dummy virtio-net device.=0A= > - The dummy device will be mmaped using a path specified in ' addr>/resource_path'.=0A= >=0A= > Step5. Add a new compile option for virtio-net device to replace IO=0A= > functions.=0A= > - The IO functions of virtio-net PMD will be replaced by read() and=0A= > write() to access to the shared memory.=0A= > - Add notification mechanism to IO functions. This will be used when=0A= > write() to the shared memory is done.=0A= > (Not sure exactly, but probably we need it)=0A= >=0A= > Does it make sense?=0A= > I guess Step1&2 is different from your approach, but the rest might be=0A= > similar.=0A= >=0A= > Actually, we just need sysfs entries for a virtio-net dummy device, but= =0A= > so far, I don't have a fine way to register them from user space without= =0A= > loading a kernel module.=0A= Tetsuya:=0A= I don't quite get the details. Who will create those sysfs entries? A=0A= kernel module right?=0A= The virtio-net is configured through read/write to sharing=0A= memory(between host and guest), right?=0A= Where is shared vring created and shared memory created, on shared huge=0A= page between host and guest?=0A= Who will talk to dpdkvhost?=0A= > This is because I need to change pci_scan() also.=0A= >=0A= > It seems you have implemented a virtio-net pseudo device as BSD license.= =0A= > If so, this kind of PMD would be nice to use it.=0A= Currently it is based on native linux kvm tool.=0A= > In the case that it takes much time to implement some lost=0A= > functionalities like interrupt mode, using QEMU code might be an one of= =0A= > options.=0A= For interrupt mode, i plan to use eventfd for sleep/wake, have not tried=0A= yet.=0A= > Anyway, we just need a fine virtual NIC between containers and host.=0A= > So we don't hold to our approach and implementation.=0A= =0A= Do you have comments to my implementation?=0A= We could publish the version without the device framework first for=0A= reference.=0A= >=0A= > Thanks,=0A= > Tetsuya=0A= >=0A= >>>>> 2.1 virtio PMD configures the pseudo virtio device as how it does= in=0A= >>>>> KVM guest enviroment.=0A= >>>>> 2.2 Rather than using io instruction, virtio PMD uses IOAPI for I= O=0A= >>>>> operation on the virtio-net PCI device.=0A= >>>>> 2.3 The device simulation is responsible for device state machine= =0A= >>>>> simulation.=0A= >>>>> 2.4 The device simulation is responsbile for talking to vhost.= =0A= >>>>> With this approach, we could minimize the virtio PMD modificatio= ns.=0A= >>>>> The virtio PMD is like configuring a real virtio-net PCI device.=0A= >>>>>=0A= >>>>> Memory mapping?=0A= >>>>> A: QEMU could access the whole guest memory in KVM enviroment. We nee= d=0A= >>>>> to fill the gap.=0A= >>>>> container maps the shared memory to container's virtual address space= =0A= >>>>> and host maps it to host's virtual address space. There is a fixed=0A= >>>>> offset mapping.=0A= >>>>> Container creates shared vring based on the memory. Container also=0A= >>>>> creates mbuf memory pool based on the shared memroy.=0A= >>>>> In VHOST_SET_MEMORY_TABLE message, we send the memory mapping=0A= >>>>> information for the shared memory. As we require mbuf pool created on= =0A= >>>>> the shared memory, and buffers are allcoated from the mbuf pools, dpd= k=0A= >>>>> vhost could translate the GPA in vring desc to host virtual.=0A= >>>>>=0A= >>>>>=0A= >>>>> GPA or CVA in vring desc?=0A= >>>>> To ease the memory translation, rather than using GPA, here we use=0A= >>>>> CVA(container virtual address). This the tricky thing here.=0A= >>>>> 1) virtio PMD writes vring's VFN rather than PFN to PFN register thro= ugh=0A= >>>>> IOAPI.=0A= >>>>> 2) device simulation framework will use VFN as PFN.=0A= >>>>> 3) device simulation sends SET_VRING_ADDR with CVA.=0A= >>>>> 4) virtio PMD fills vring desc with CVA of the mbuf data pointer rath= er=0A= >>>>> than GPA.=0A= >>>>> So when host sees the CVA, it could translates it to HVA(host virtual= =0A= >>>>> address).=0A= >>>>>=0A= >>>>> Worth to note:=0A= >>>>> The virtio interface in container follows the vhost message format, a= nd=0A= >>>>> is compliant with dpdk vhost implmentation, i.e, no dpdk vhost=0A= >>>>> modification is needed.=0A= >>>>> vHost isn't aware whether the incoming virtio comes from KVM guest or= =0A= >>>>> container.=0A= >>>>>=0A= >>>>> The pretty much covers the high level design. There are quite some lo= w=0A= >>>>> level issues. For example, 32bit PFN is enough for KVM guest, since w= e=0A= >>>>> use 64bit VFN(virtual page frame number), trick is done here through= a=0A= >>>>> special IOAPI.=0A= >>> In addition above, we might consider "namespace" kernel functionality.= =0A= >>> Technically, it would not be a big problem, but related with security.= =0A= >>> So it would be nice to take account.=0A= >> There is no namespace concept here because we don't generate kernel=0A= >> netdev devices. It might be usefull if we could extend our work to=0A= >> support kernel netdev interface and assign to container's namespace.=0A= Yes, it would be great if we could extend this to support both kernel=0A= networking and user space networking.=0A= No progress so far.=0A= =0A= >>> Regards,=0A= >>> Tetsuya=0A= >>>=0A= >>>>> /huawei=0A= >>>>>=0A= >>>>> =0A= >>>>>=0A= >>>>>=0A= >>>>>=0A= >>>>>=0A= >>>>>=0A= >>>>>=0A= >=0A= =0A=