From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga14.intel.com (mga14.intel.com [192.55.52.115]) by dpdk.org (Postfix) with ESMTP id E280A5680 for ; Mon, 14 Sep 2015 05:18:18 +0200 (CEST) Received: from fmsmga001.fm.intel.com ([10.253.24.23]) by fmsmga103.fm.intel.com with ESMTP; 13 Sep 2015 20:18:17 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.17,525,1437462000"; d="scan'208";a="788779222" Received: from kmsmsx154.gar.corp.intel.com ([172.21.73.14]) by fmsmga001.fm.intel.com with ESMTP; 13 Sep 2015 20:18:15 -0700 Received: from shsmsx104.ccr.corp.intel.com (10.239.4.70) by KMSMSX154.gar.corp.intel.com (172.21.73.14) with Microsoft SMTP Server (TLS) id 14.3.224.2; Mon, 14 Sep 2015 11:15:54 +0800 Received: from shsmsx101.ccr.corp.intel.com ([169.254.1.75]) by SHSMSX104.ccr.corp.intel.com ([169.254.5.210]) with mapi id 14.03.0248.002; Mon, 14 Sep 2015 11:15:53 +0800 From: "Xie, Huawei" To: Tetsuya Mukawa , "dev@dpdk.org" Thread-Topic: [dpdk-dev] vhost compliant virtio based networking interface in container Thread-Index: AdDbL5/n4/QXHaS9RkiZ9XcA9E/NoA== Date: Mon, 14 Sep 2015 03:15:52 +0000 Message-ID: References: <55DBD9E1.3050609@igel.co.jp> <55DD8578.70208@igel.co.jp> <55EE67C2.5040301@igel.co.jp> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.239.127.40] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Cc: "nakajima.yoshihiro@lab.ntt.co.jp" , "zhbzg@huawei.com" , "gaoxiaoqiu@huawei.com" , "oscar.zhangbo@huawei.com" , Zhuangyanying , "zhoujingbin@huawei.com" , "guohongzhen@huawei.com" Subject: Re: [dpdk-dev] vhost compliant virtio based networking interface in container X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 14 Sep 2015 03:18:19 -0000 On 9/8/2015 12:45 PM, Tetsuya Mukawa wrote:=0A= > On 2015/09/07 14:54, Xie, Huawei wrote:=0A= >> On 8/26/2015 5:23 PM, Tetsuya Mukawa wrote:=0A= >>> On 2015/08/25 18:56, Xie, Huawei wrote:=0A= >>>> On 8/25/2015 10:59 AM, Tetsuya Mukawa wrote:=0A= >>>>> Hi Xie and Yanping,=0A= >>>>>=0A= >>>>>=0A= >>>>> May I ask you some questions?=0A= >>>>> It seems we are also developing an almost same one.=0A= >>>> Good to know that we are tackling the same problem and have the simila= r=0A= >>>> idea.=0A= >>>> What is your status now? We had the POC running, and compliant with=0A= >>>> dpdkvhost.=0A= >>>> Interrupt like notification isn't supported.=0A= >>> We implemented vhost PMD first, so we just start implementing it.=0A= >>>=0A= >>>>> On 2015/08/20 19:14, Xie, Huawei wrote:=0A= >>>>>> Added dev@dpdk.org=0A= >>>>>>=0A= >>>>>> On 8/20/2015 6:04 PM, Xie, Huawei wrote:=0A= >>>>>>> Yanping:=0A= >>>>>>> I read your mail, seems what we did are quite similar. Here i wrote= a=0A= >>>>>>> quick mail to describe our design. Let me know if it is the same th= ing.=0A= >>>>>>>=0A= >>>>>>> Problem Statement:=0A= >>>>>>> We don't have a high performance networking interface in container = for=0A= >>>>>>> NFV. Current veth pair based interface couldn't be easily accelerat= ed.=0A= >>>>>>>=0A= >>>>>>> The key components involved:=0A= >>>>>>> 1. DPDK based virtio PMD driver in container.=0A= >>>>>>> 2. device simulation framework in container.=0A= >>>>>>> 3. dpdk(or kernel) vhost running in host.=0A= >>>>>>>=0A= >>>>>>> How virtio is created?=0A= >>>>>>> A: There is no "real" virtio-pci device in container environment.= =0A= >>>>>>> 1). Host maintains pools of memories, and shares memory to containe= r.=0A= >>>>>>> This could be accomplished through host share a huge page file to c= ontainer.=0A= >>>>>>> 2). Containers creates virtio rings based on the shared memory.=0A= >>>>>>> 3). Container creates mbuf memory pools on the shared memory.=0A= >>>>>>> 4) Container send the memory and vring information to vhost through= =0A= >>>>>>> vhost message. This could be done either through ioctl call or vhos= t=0A= >>>>>>> user message.=0A= >>>>>>>=0A= >>>>>>> How vhost message is sent?=0A= >>>>>>> A: There are two alternative ways to do this.=0A= >>>>>>> 1) The customized virtio PMD is responsible for all the vring creat= ion,=0A= >>>>>>> and vhost message sending.=0A= >>>>> Above is our approach so far.=0A= >>>>> It seems Yanping also takes this kind of approach.=0A= >>>>> We are using vhost-user functionality instead of using the vhost-net= =0A= >>>>> kernel module.=0A= >>>>> Probably this is the difference between Yanping and us.=0A= >>>> In my current implementation, the device simulation layer talks to "us= er=0A= >>>> space" vhost through cuse interface. It could also be done through vho= st=0A= >>>> user socket. This isn't the key point.=0A= >>>> Here vhost-user is kind of confusing, maybe user space vhost is more= =0A= >>>> accurate, either cuse or unix domain socket. :).=0A= >>>>=0A= >>>> As for yanping, they are now connecting to vhost-net kernel module, bu= t=0A= >>>> they are also trying to connect to "user space" vhost. Correct me if = wrong.=0A= >>>> Yes, there is some difference between these two. Vhost-net kernel modu= le=0A= >>>> could directly access other process's memory, while using=0A= >>>> vhost-user(cuse/user), we need do the memory mapping.=0A= >>>>> BTW, we are going to submit a vhost PMD for DPDK-2.2.=0A= >>>>> This PMD is implemented on librte_vhost.=0A= >>>>> It allows DPDK application to handle a vhost-user(cuse) backend as a= =0A= >>>>> normal NIC port.=0A= >>>>> This PMD should work with both Xie and Yanping approach.=0A= >>>>> (In the case of Yanping approach, we may need vhost-cuse)=0A= >>>>>=0A= >>>>>>> 2) We could do this through a lightweight device simulation framewo= rk.=0A= >>>>>>> The device simulation creates simple PCI bus. On the PCI bus,= =0A= >>>>>>> virtio-net PCI devices are created. The device simulations provides= =0A= >>>>>>> IOAPI for MMIO/IO access.=0A= >>>>> Does it mean you implemented a kernel module?=0A= >>>>> If so, do you still need vhost-cuse functionality to handle vhost=0A= >>>>> messages n userspace?=0A= >>>> The device simulation is a library running in user space in container= . =0A= >>>> It is linked with DPDK app. It creates pseudo buses and virtio-net PCI= =0A= >>>> devices.=0A= >>>> The virtio-container-PMD configures the virtio-net pseudo devices=0A= >>>> through IOAPI provided by the device simulation rather than IO=0A= >>>> instructions as in KVM.=0A= >>>> Why we use device simulation?=0A= >>>> We could create other virtio devices in container, and provide an comm= on=0A= >>>> way to talk to vhost-xx module.=0A= >>> Thanks for explanation.=0A= >>> At first reading, I thought the difference between approach1 and=0A= >>> approach2 is whether we need to implement a new kernel module, or not.= =0A= >>> But I understand how you implemented.=0A= >>>=0A= >>> Please let me explain our design more.=0A= >>> We might use a kind of similar approach to handle a pseudo virtio-net= =0A= >>> device in DPDK.=0A= >>> (Anyway, we haven't finished implementing yet, this overview might have= =0A= >>> some technical problems)=0A= >>>=0A= >>> Step1. Separate virtio-net and vhost-user socket related code from QEMU= ,=0A= >>> then implement it as a separated program.=0A= >>> The program also has below features.=0A= >>> - Create a directory that contains almost same files like=0A= >>> /sys/bus/pci/device//*=0A= >>> (To scan these file located on outside sysfs, we need to fix EAL)=0A= >>> - This dummy device is driven by dummy-virtio-net-driver. This name is= =0A= >>> specified by '/driver' file.=0A= >>> - Create a shared file that represents pci configuration space, then= =0A= >>> mmap it, also specify the path in '/resource_path'=0A= >>>=0A= >>> The program will be GPL, but it will be like a bridge on the shared=0A= >>> memory between virtio-net PMD and DPDK vhost backend.=0A= >>> Actually, It will work under virtio-net PMD, but we don't need to link = it.=0A= >>> So I guess we don't have GPL license issue.=0A= >>>=0A= >>> Step2. Fix pci scan code of EAL to scan dummy devices.=0A= >>> - To scan above files, extend pci_scan() of EAL.=0A= >>>=0A= >>> Step3. Add a new kdrv type to EAL.=0A= >>> - To handle the 'dummy-virtio-net-driver', add a new kdrv type to EAL.= =0A= >>>=0A= >>> Step4. Implement pci_dummy_virtio_net_map/unmap().=0A= >>> - It will have almost same functionality like pci_uio_map(), but for= =0A= >>> dummy virtio-net device.=0A= >>> - The dummy device will be mmaped using a path specified in '>> addr>/resource_path'.=0A= >>>=0A= >>> Step5. Add a new compile option for virtio-net device to replace IO=0A= >>> functions.=0A= >>> - The IO functions of virtio-net PMD will be replaced by read() and=0A= >>> write() to access to the shared memory.=0A= >>> - Add notification mechanism to IO functions. This will be used when= =0A= >>> write() to the shared memory is done.=0A= >>> (Not sure exactly, but probably we need it)=0A= >>>=0A= >>> Does it make sense?=0A= >>> I guess Step1&2 is different from your approach, but the rest might be= =0A= >>> similar.=0A= >>>=0A= >>> Actually, we just need sysfs entries for a virtio-net dummy device, but= =0A= >>> so far, I don't have a fine way to register them from user space withou= t=0A= >>> loading a kernel module.=0A= >> Tetsuya:=0A= >> I don't quite get the details. Who will create those sysfs entries? A=0A= >> kernel module right?=0A= > Hi Xie,=0A= >=0A= > I don't create sysfs entries. Just create a directory that contains=0A= > files looks like sysfs entries.=0A= > And initialize EAL with not only sysfs but also the above directory.=0A= >=0A= > In quoted last sentence, I wanted to say we just needed files looks like= =0A= > sysfs entries.=0A= > But I don't know a good way to create files under sysfs without loading= =0A= > kernel module.=0A= > This is because I try to create the additional directory.=0A= >=0A= >> The virtio-net is configured through read/write to sharing=0A= >> memory(between host and guest), right?=0A= > Yes, I agree.=0A= >=0A= >> Where is shared vring created and shared memory created, on shared huge= =0A= >> page between host and guest?=0A= > The vritqueues(vrings) are on guest hugepage.=0A= >=0A= > Let me explain.=0A= > Guest container should have read/write access to a part of hugepage=0A= > directory on host.=0A= > (For example, /mnt/huge/conainer1/ is shared between host and guest.)=0A= > Also host and guest needs to communicate through a unix domain socket.=0A= > (For example, host and guest can communicate with using=0A= > "/tmp/container1/sock")=0A= >=0A= > If we can do like above, a virtio-net PMD on guest can creates=0A= > virtqueues(vrings) on it's hugepage, and writes these information to a=0A= > pseudo virtio-net device that is a process created in guest container.=0A= > Then the pseudo virtio-net device sends it to vhost-user backend(host=0A= > DPDK application) through a unix domain socket.=0A= >=0A= > So with my plan, there are 3 processes.=0A= > DPDK applications on host and guest, also a process that works like=0A= > virtio-net device.=0A= >=0A= >> Who will talk to dpdkvhost?=0A= > If we need to talk to a cuse device or the vhost-net kernel module, an=0A= > above pseudo virtio-net device could talk to.=0A= > (But, so far, my target is only vhost-user.)=0A= >=0A= >>> This is because I need to change pci_scan() also.=0A= >>>=0A= >>> It seems you have implemented a virtio-net pseudo device as BSD license= .=0A= >>> If so, this kind of PMD would be nice to use it.=0A= >> Currently it is based on native linux kvm tool.=0A= > Great, I hadn't noticed this option.=0A= >=0A= >>> In the case that it takes much time to implement some lost=0A= >>> functionalities like interrupt mode, using QEMU code might be an one of= =0A= >>> options.=0A= >> For interrupt mode, i plan to use eventfd for sleep/wake, have not tried= =0A= >> yet.=0A= >>> Anyway, we just need a fine virtual NIC between containers and host.=0A= >>> So we don't hold to our approach and implementation.=0A= >> Do you have comments to my implementation?=0A= >> We could publish the version without the device framework first for=0A= >> reference.=0A= > No I don't have. Could you please share it?=0A= > I am looking forward to seeing it.=0A= OK, we are removing the device framework. Hope to publish it in one=0A= month's time.=0A= =0A= >=0A= > Tetsuya=0A= >=0A= =0A=