From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pa0-f45.google.com (mail-pa0-f45.google.com [209.85.220.45]) by dpdk.org (Postfix) with ESMTP id 4A1C18D96 for ; Tue, 8 Sep 2015 06:44:55 +0200 (CEST) Received: by padhy16 with SMTP id hy16so110748469pad.1 for ; Mon, 07 Sep 2015 21:44:54 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:message-id:date:from:user-agent:mime-version:to :cc:subject:references:in-reply-to:content-type :content-transfer-encoding; bh=67Pfyhb8bab9LoSpcHN9cSsiQzoBiX3Pyg2j6NGzWZw=; b=UkNSPfixP4BOzACtZgipom4bkRJpaj3KoZfWCEsVxONHW7Z7zbsunlSpqRE/f+NqRx Z3MkglyS01Qgd+rjARRABXFd/UHv+sz+gxgObm7J6gY1WyBfQflK2KsWYkw446dP/sTs xpWXZFaxr6yP+XpUICjZRUcHEMbfL30F5fzYaGjqNuVMHR7gcKfpFKDwPS1eaa8+Yirf illvXp29Yx2AFegHU6Ol48PLhdon/x8EY7UQ/sSZEsCIrfUYh0MYJuv/1AC9TSSZ6L4j XAPD0MRwedKJy4vJ9kpNI/cJP/agbzHSGh4PusYUaMaqcj9S1MUkJCTScjMIvfr6uM3X 79Cw== X-Gm-Message-State: ALoCoQmS1iAMhQaJ3oZ9iTErgQnZBGVWdEUaUz5/QsT2nUHHi49I/mexF9gMF37IuADduHh3NcLX X-Received: by 10.68.200.40 with SMTP id jp8mr54858008pbc.16.1441687494379; Mon, 07 Sep 2015 21:44:54 -0700 (PDT) Received: from [10.16.129.101] (napt.igel.co.jp. [219.106.231.132]) by smtp.googlemail.com with ESMTPSA id xf6sm1559989pbc.70.2015.09.07.21.44.51 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 07 Sep 2015 21:44:53 -0700 (PDT) Message-ID: <55EE67C2.5040301@igel.co.jp> Date: Tue, 08 Sep 2015 13:44:50 +0900 From: Tetsuya Mukawa User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Thunderbird/31.7.0 MIME-Version: 1.0 To: "Xie, Huawei" , "dev@dpdk.org" References: <55DBD9E1.3050609@igel.co.jp> <55DD8578.70208@igel.co.jp> In-Reply-To: Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Cc: "nakajima.yoshihiro@lab.ntt.co.jp" , "zhbzg@huawei.com" , "gaoxiaoqiu@huawei.com" , "oscar.zhangbo@huawei.com" , Zhuangyanying , "zhoujingbin@huawei.com" , "guohongzhen@huawei.com" Subject: Re: [dpdk-dev] vhost compliant virtio based networking interface in container X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 08 Sep 2015 04:44:55 -0000 On 2015/09/07 14:54, Xie, Huawei wrote: > On 8/26/2015 5:23 PM, Tetsuya Mukawa wrote: >> On 2015/08/25 18:56, Xie, Huawei wrote: >>> On 8/25/2015 10:59 AM, Tetsuya Mukawa wrote: >>>> Hi Xie and Yanping, >>>> >>>> >>>> May I ask you some questions? >>>> It seems we are also developing an almost same one. >>> Good to know that we are tackling the same problem and have the similar >>> idea. >>> What is your status now? We had the POC running, and compliant with >>> dpdkvhost. >>> Interrupt like notification isn't supported. >> We implemented vhost PMD first, so we just start implementing it. >> >>>> On 2015/08/20 19:14, Xie, Huawei wrote: >>>>> Added dev@dpdk.org >>>>> >>>>> On 8/20/2015 6:04 PM, Xie, Huawei wrote: >>>>>> Yanping: >>>>>> I read your mail, seems what we did are quite similar. Here i wrote a >>>>>> quick mail to describe our design. Let me know if it is the same thing. >>>>>> >>>>>> Problem Statement: >>>>>> We don't have a high performance networking interface in container for >>>>>> NFV. Current veth pair based interface couldn't be easily accelerated. >>>>>> >>>>>> The key components involved: >>>>>> 1. DPDK based virtio PMD driver in container. >>>>>> 2. device simulation framework in container. >>>>>> 3. dpdk(or kernel) vhost running in host. >>>>>> >>>>>> How virtio is created? >>>>>> A: There is no "real" virtio-pci device in container environment. >>>>>> 1). Host maintains pools of memories, and shares memory to container. >>>>>> This could be accomplished through host share a huge page file to container. >>>>>> 2). Containers creates virtio rings based on the shared memory. >>>>>> 3). Container creates mbuf memory pools on the shared memory. >>>>>> 4) Container send the memory and vring information to vhost through >>>>>> vhost message. This could be done either through ioctl call or vhost >>>>>> user message. >>>>>> >>>>>> How vhost message is sent? >>>>>> A: There are two alternative ways to do this. >>>>>> 1) The customized virtio PMD is responsible for all the vring creation, >>>>>> and vhost message sending. >>>> Above is our approach so far. >>>> It seems Yanping also takes this kind of approach. >>>> We are using vhost-user functionality instead of using the vhost-net >>>> kernel module. >>>> Probably this is the difference between Yanping and us. >>> In my current implementation, the device simulation layer talks to "user >>> space" vhost through cuse interface. It could also be done through vhost >>> user socket. This isn't the key point. >>> Here vhost-user is kind of confusing, maybe user space vhost is more >>> accurate, either cuse or unix domain socket. :). >>> >>> As for yanping, they are now connecting to vhost-net kernel module, but >>> they are also trying to connect to "user space" vhost. Correct me if wrong. >>> Yes, there is some difference between these two. Vhost-net kernel module >>> could directly access other process's memory, while using >>> vhost-user(cuse/user), we need do the memory mapping. >>>> BTW, we are going to submit a vhost PMD for DPDK-2.2. >>>> This PMD is implemented on librte_vhost. >>>> It allows DPDK application to handle a vhost-user(cuse) backend as a >>>> normal NIC port. >>>> This PMD should work with both Xie and Yanping approach. >>>> (In the case of Yanping approach, we may need vhost-cuse) >>>> >>>>>> 2) We could do this through a lightweight device simulation framework. >>>>>> The device simulation creates simple PCI bus. On the PCI bus, >>>>>> virtio-net PCI devices are created. The device simulations provides >>>>>> IOAPI for MMIO/IO access. >>>> Does it mean you implemented a kernel module? >>>> If so, do you still need vhost-cuse functionality to handle vhost >>>> messages n userspace? >>> The device simulation is a library running in user space in container. >>> It is linked with DPDK app. It creates pseudo buses and virtio-net PCI >>> devices. >>> The virtio-container-PMD configures the virtio-net pseudo devices >>> through IOAPI provided by the device simulation rather than IO >>> instructions as in KVM. >>> Why we use device simulation? >>> We could create other virtio devices in container, and provide an common >>> way to talk to vhost-xx module. >> Thanks for explanation. >> At first reading, I thought the difference between approach1 and >> approach2 is whether we need to implement a new kernel module, or not. >> But I understand how you implemented. >> >> Please let me explain our design more. >> We might use a kind of similar approach to handle a pseudo virtio-net >> device in DPDK. >> (Anyway, we haven't finished implementing yet, this overview might have >> some technical problems) >> >> Step1. Separate virtio-net and vhost-user socket related code from QEMU, >> then implement it as a separated program. >> The program also has below features. >> - Create a directory that contains almost same files like >> /sys/bus/pci/device//* >> (To scan these file located on outside sysfs, we need to fix EAL) >> - This dummy device is driven by dummy-virtio-net-driver. This name is >> specified by '/driver' file. >> - Create a shared file that represents pci configuration space, then >> mmap it, also specify the path in '/resource_path' >> >> The program will be GPL, but it will be like a bridge on the shared >> memory between virtio-net PMD and DPDK vhost backend. >> Actually, It will work under virtio-net PMD, but we don't need to link it. >> So I guess we don't have GPL license issue. >> >> Step2. Fix pci scan code of EAL to scan dummy devices. >> - To scan above files, extend pci_scan() of EAL. >> >> Step3. Add a new kdrv type to EAL. >> - To handle the 'dummy-virtio-net-driver', add a new kdrv type to EAL. >> >> Step4. Implement pci_dummy_virtio_net_map/unmap(). >> - It will have almost same functionality like pci_uio_map(), but for >> dummy virtio-net device. >> - The dummy device will be mmaped using a path specified in '> addr>/resource_path'. >> >> Step5. Add a new compile option for virtio-net device to replace IO >> functions. >> - The IO functions of virtio-net PMD will be replaced by read() and >> write() to access to the shared memory. >> - Add notification mechanism to IO functions. This will be used when >> write() to the shared memory is done. >> (Not sure exactly, but probably we need it) >> >> Does it make sense? >> I guess Step1&2 is different from your approach, but the rest might be >> similar. >> >> Actually, we just need sysfs entries for a virtio-net dummy device, but >> so far, I don't have a fine way to register them from user space without >> loading a kernel module. > Tetsuya: > I don't quite get the details. Who will create those sysfs entries? A > kernel module right? Hi Xie, I don't create sysfs entries. Just create a directory that contains files looks like sysfs entries. And initialize EAL with not only sysfs but also the above directory. In quoted last sentence, I wanted to say we just needed files looks like sysfs entries. But I don't know a good way to create files under sysfs without loading kernel module. This is because I try to create the additional directory. > The virtio-net is configured through read/write to sharing > memory(between host and guest), right? Yes, I agree. > Where is shared vring created and shared memory created, on shared huge > page between host and guest? The vritqueues(vrings) are on guest hugepage. Let me explain. Guest container should have read/write access to a part of hugepage directory on host. (For example, /mnt/huge/conainer1/ is shared between host and guest.) Also host and guest needs to communicate through a unix domain socket. (For example, host and guest can communicate with using "/tmp/container1/sock") If we can do like above, a virtio-net PMD on guest can creates virtqueues(vrings) on it's hugepage, and writes these information to a pseudo virtio-net device that is a process created in guest container. Then the pseudo virtio-net device sends it to vhost-user backend(host DPDK application) through a unix domain socket. So with my plan, there are 3 processes. DPDK applications on host and guest, also a process that works like virtio-net device. > Who will talk to dpdkvhost? If we need to talk to a cuse device or the vhost-net kernel module, an above pseudo virtio-net device could talk to. (But, so far, my target is only vhost-user.) >> This is because I need to change pci_scan() also. >> >> It seems you have implemented a virtio-net pseudo device as BSD license. >> If so, this kind of PMD would be nice to use it. > Currently it is based on native linux kvm tool. Great, I hadn't noticed this option. >> In the case that it takes much time to implement some lost >> functionalities like interrupt mode, using QEMU code might be an one of >> options. > For interrupt mode, i plan to use eventfd for sleep/wake, have not tried > yet. >> Anyway, we just need a fine virtual NIC between containers and host. >> So we don't hold to our approach and implementation. > Do you have comments to my implementation? > We could publish the version without the device framework first for > reference. No I don't have. Could you please share it? I am looking forward to seeing it. Tetsuya