From: Tetsuya Mukawa <mukawa@igel.co.jp>
To: "Xie, Huawei" <huawei.xie@intel.com>, "dev@dpdk.org" <dev@dpdk.org>
Cc: "nakajima.yoshihiro@lab.ntt.co.jp"
<nakajima.yoshihiro@lab.ntt.co.jp>,
"zhbzg@huawei.com" <zhbzg@huawei.com>,
"gaoxiaoqiu@huawei.com" <gaoxiaoqiu@huawei.com>,
"oscar.zhangbo@huawei.com" <oscar.zhangbo@huawei.com>,
Zhuangyanying <ann.zhuangyanying@huawei.com>,
"zhoujingbin@huawei.com" <zhoujingbin@huawei.com>,
"guohongzhen@huawei.com" <guohongzhen@huawei.com>
Subject: Re: [dpdk-dev] vhost compliant virtio based networking interface in container
Date: Tue, 08 Sep 2015 13:44:50 +0900 [thread overview]
Message-ID: <55EE67C2.5040301@igel.co.jp> (raw)
In-Reply-To: <C37D651A908B024F974696C65296B57B2BDBDDCD@SHSMSX101.ccr.corp.intel.com>
On 2015/09/07 14:54, Xie, Huawei wrote:
> On 8/26/2015 5:23 PM, Tetsuya Mukawa wrote:
>> On 2015/08/25 18:56, Xie, Huawei wrote:
>>> On 8/25/2015 10:59 AM, Tetsuya Mukawa wrote:
>>>> Hi Xie and Yanping,
>>>>
>>>>
>>>> May I ask you some questions?
>>>> It seems we are also developing an almost same one.
>>> Good to know that we are tackling the same problem and have the similar
>>> idea.
>>> What is your status now? We had the POC running, and compliant with
>>> dpdkvhost.
>>> Interrupt like notification isn't supported.
>> We implemented vhost PMD first, so we just start implementing it.
>>
>>>> On 2015/08/20 19:14, Xie, Huawei wrote:
>>>>> Added dev@dpdk.org
>>>>>
>>>>> On 8/20/2015 6:04 PM, Xie, Huawei wrote:
>>>>>> Yanping:
>>>>>> I read your mail, seems what we did are quite similar. Here i wrote a
>>>>>> quick mail to describe our design. Let me know if it is the same thing.
>>>>>>
>>>>>> Problem Statement:
>>>>>> We don't have a high performance networking interface in container for
>>>>>> NFV. Current veth pair based interface couldn't be easily accelerated.
>>>>>>
>>>>>> The key components involved:
>>>>>> 1. DPDK based virtio PMD driver in container.
>>>>>> 2. device simulation framework in container.
>>>>>> 3. dpdk(or kernel) vhost running in host.
>>>>>>
>>>>>> How virtio is created?
>>>>>> A: There is no "real" virtio-pci device in container environment.
>>>>>> 1). Host maintains pools of memories, and shares memory to container.
>>>>>> This could be accomplished through host share a huge page file to container.
>>>>>> 2). Containers creates virtio rings based on the shared memory.
>>>>>> 3). Container creates mbuf memory pools on the shared memory.
>>>>>> 4) Container send the memory and vring information to vhost through
>>>>>> vhost message. This could be done either through ioctl call or vhost
>>>>>> user message.
>>>>>>
>>>>>> How vhost message is sent?
>>>>>> A: There are two alternative ways to do this.
>>>>>> 1) The customized virtio PMD is responsible for all the vring creation,
>>>>>> and vhost message sending.
>>>> Above is our approach so far.
>>>> It seems Yanping also takes this kind of approach.
>>>> We are using vhost-user functionality instead of using the vhost-net
>>>> kernel module.
>>>> Probably this is the difference between Yanping and us.
>>> In my current implementation, the device simulation layer talks to "user
>>> space" vhost through cuse interface. It could also be done through vhost
>>> user socket. This isn't the key point.
>>> Here vhost-user is kind of confusing, maybe user space vhost is more
>>> accurate, either cuse or unix domain socket. :).
>>>
>>> As for yanping, they are now connecting to vhost-net kernel module, but
>>> they are also trying to connect to "user space" vhost. Correct me if wrong.
>>> Yes, there is some difference between these two. Vhost-net kernel module
>>> could directly access other process's memory, while using
>>> vhost-user(cuse/user), we need do the memory mapping.
>>>> BTW, we are going to submit a vhost PMD for DPDK-2.2.
>>>> This PMD is implemented on librte_vhost.
>>>> It allows DPDK application to handle a vhost-user(cuse) backend as a
>>>> normal NIC port.
>>>> This PMD should work with both Xie and Yanping approach.
>>>> (In the case of Yanping approach, we may need vhost-cuse)
>>>>
>>>>>> 2) We could do this through a lightweight device simulation framework.
>>>>>> The device simulation creates simple PCI bus. On the PCI bus,
>>>>>> virtio-net PCI devices are created. The device simulations provides
>>>>>> IOAPI for MMIO/IO access.
>>>> Does it mean you implemented a kernel module?
>>>> If so, do you still need vhost-cuse functionality to handle vhost
>>>> messages n userspace?
>>> The device simulation is a library running in user space in container.
>>> It is linked with DPDK app. It creates pseudo buses and virtio-net PCI
>>> devices.
>>> The virtio-container-PMD configures the virtio-net pseudo devices
>>> through IOAPI provided by the device simulation rather than IO
>>> instructions as in KVM.
>>> Why we use device simulation?
>>> We could create other virtio devices in container, and provide an common
>>> way to talk to vhost-xx module.
>> Thanks for explanation.
>> At first reading, I thought the difference between approach1 and
>> approach2 is whether we need to implement a new kernel module, or not.
>> But I understand how you implemented.
>>
>> Please let me explain our design more.
>> We might use a kind of similar approach to handle a pseudo virtio-net
>> device in DPDK.
>> (Anyway, we haven't finished implementing yet, this overview might have
>> some technical problems)
>>
>> Step1. Separate virtio-net and vhost-user socket related code from QEMU,
>> then implement it as a separated program.
>> The program also has below features.
>> - Create a directory that contains almost same files like
>> /sys/bus/pci/device/<pci address>/*
>> (To scan these file located on outside sysfs, we need to fix EAL)
>> - This dummy device is driven by dummy-virtio-net-driver. This name is
>> specified by '<pci addr>/driver' file.
>> - Create a shared file that represents pci configuration space, then
>> mmap it, also specify the path in '<pci addr>/resource_path'
>>
>> The program will be GPL, but it will be like a bridge on the shared
>> memory between virtio-net PMD and DPDK vhost backend.
>> Actually, It will work under virtio-net PMD, but we don't need to link it.
>> So I guess we don't have GPL license issue.
>>
>> Step2. Fix pci scan code of EAL to scan dummy devices.
>> - To scan above files, extend pci_scan() of EAL.
>>
>> Step3. Add a new kdrv type to EAL.
>> - To handle the 'dummy-virtio-net-driver', add a new kdrv type to EAL.
>>
>> Step4. Implement pci_dummy_virtio_net_map/unmap().
>> - It will have almost same functionality like pci_uio_map(), but for
>> dummy virtio-net device.
>> - The dummy device will be mmaped using a path specified in '<pci
>> addr>/resource_path'.
>>
>> Step5. Add a new compile option for virtio-net device to replace IO
>> functions.
>> - The IO functions of virtio-net PMD will be replaced by read() and
>> write() to access to the shared memory.
>> - Add notification mechanism to IO functions. This will be used when
>> write() to the shared memory is done.
>> (Not sure exactly, but probably we need it)
>>
>> Does it make sense?
>> I guess Step1&2 is different from your approach, but the rest might be
>> similar.
>>
>> Actually, we just need sysfs entries for a virtio-net dummy device, but
>> so far, I don't have a fine way to register them from user space without
>> loading a kernel module.
> Tetsuya:
> I don't quite get the details. Who will create those sysfs entries? A
> kernel module right?
Hi Xie,
I don't create sysfs entries. Just create a directory that contains
files looks like sysfs entries.
And initialize EAL with not only sysfs but also the above directory.
In quoted last sentence, I wanted to say we just needed files looks like
sysfs entries.
But I don't know a good way to create files under sysfs without loading
kernel module.
This is because I try to create the additional directory.
> The virtio-net is configured through read/write to sharing
> memory(between host and guest), right?
Yes, I agree.
> Where is shared vring created and shared memory created, on shared huge
> page between host and guest?
The vritqueues(vrings) are on guest hugepage.
Let me explain.
Guest container should have read/write access to a part of hugepage
directory on host.
(For example, /mnt/huge/conainer1/ is shared between host and guest.)
Also host and guest needs to communicate through a unix domain socket.
(For example, host and guest can communicate with using
"/tmp/container1/sock")
If we can do like above, a virtio-net PMD on guest can creates
virtqueues(vrings) on it's hugepage, and writes these information to a
pseudo virtio-net device that is a process created in guest container.
Then the pseudo virtio-net device sends it to vhost-user backend(host
DPDK application) through a unix domain socket.
So with my plan, there are 3 processes.
DPDK applications on host and guest, also a process that works like
virtio-net device.
> Who will talk to dpdkvhost?
If we need to talk to a cuse device or the vhost-net kernel module, an
above pseudo virtio-net device could talk to.
(But, so far, my target is only vhost-user.)
>> This is because I need to change pci_scan() also.
>>
>> It seems you have implemented a virtio-net pseudo device as BSD license.
>> If so, this kind of PMD would be nice to use it.
> Currently it is based on native linux kvm tool.
Great, I hadn't noticed this option.
>> In the case that it takes much time to implement some lost
>> functionalities like interrupt mode, using QEMU code might be an one of
>> options.
> For interrupt mode, i plan to use eventfd for sleep/wake, have not tried
> yet.
>> Anyway, we just need a fine virtual NIC between containers and host.
>> So we don't hold to our approach and implementation.
> Do you have comments to my implementation?
> We could publish the version without the device framework first for
> reference.
No I don't have. Could you please share it?
I am looking forward to seeing it.
Tetsuya
next prev parent reply other threads:[~2015-09-08 4:44 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <C37D651A908B024F974696C65296B57B2BD9F976@SHSMSX101.ccr.corp.intel.com>
2015-08-20 10:14 ` Xie, Huawei
2015-08-25 2:58 ` Tetsuya Mukawa
2015-08-25 9:56 ` Xie, Huawei
2015-08-26 9:23 ` Tetsuya Mukawa
2015-09-07 5:54 ` Xie, Huawei
2015-09-08 4:44 ` Tetsuya Mukawa [this message]
2015-09-14 3:15 ` Xie, Huawei
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=55EE67C2.5040301@igel.co.jp \
--to=mukawa@igel.co.jp \
--cc=ann.zhuangyanying@huawei.com \
--cc=dev@dpdk.org \
--cc=gaoxiaoqiu@huawei.com \
--cc=guohongzhen@huawei.com \
--cc=huawei.xie@intel.com \
--cc=nakajima.yoshihiro@lab.ntt.co.jp \
--cc=oscar.zhangbo@huawei.com \
--cc=zhbzg@huawei.com \
--cc=zhoujingbin@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).