DPDK patches and discussions
 help / color / mirror / Atom feed
From: Tetsuya Mukawa <mukawa@igel.co.jp>
To: "Xie, Huawei" <huawei.xie@intel.com>, "dev@dpdk.org" <dev@dpdk.org>
Cc: "nakajima.yoshihiro@lab.ntt.co.jp"
	<nakajima.yoshihiro@lab.ntt.co.jp>,
	"zhbzg@huawei.com" <zhbzg@huawei.com>,
	"gaoxiaoqiu@huawei.com" <gaoxiaoqiu@huawei.com>,
	"oscar.zhangbo@huawei.com" <oscar.zhangbo@huawei.com>,
	Zhuangyanying <ann.zhuangyanying@huawei.com>,
	"zhoujingbin@huawei.com" <zhoujingbin@huawei.com>,
	"guohongzhen@huawei.com" <guohongzhen@huawei.com>
Subject: Re: [dpdk-dev] vhost compliant virtio based networking interface in container
Date: Tue, 08 Sep 2015 13:44:50 +0900	[thread overview]
Message-ID: <55EE67C2.5040301@igel.co.jp> (raw)
In-Reply-To: <C37D651A908B024F974696C65296B57B2BDBDDCD@SHSMSX101.ccr.corp.intel.com>

On 2015/09/07 14:54, Xie, Huawei wrote:
> On 8/26/2015 5:23 PM, Tetsuya Mukawa wrote:
>> On 2015/08/25 18:56, Xie, Huawei wrote:
>>> On 8/25/2015 10:59 AM, Tetsuya Mukawa wrote:
>>>> Hi Xie and Yanping,
>>>>
>>>>
>>>> May I ask you some questions?
>>>> It seems we are also developing an almost same one.
>>> Good to know that we are tackling the same problem and have the similar
>>> idea.
>>> What is your status now? We had the POC running, and compliant with
>>> dpdkvhost.
>>> Interrupt like notification isn't supported.
>> We implemented vhost PMD first, so we just start implementing it.
>>
>>>> On 2015/08/20 19:14, Xie, Huawei wrote:
>>>>> Added dev@dpdk.org
>>>>>
>>>>> On 8/20/2015 6:04 PM, Xie, Huawei wrote:
>>>>>> Yanping:
>>>>>> I read your mail, seems what we did are quite similar. Here i wrote a
>>>>>> quick mail to describe our design. Let me know if it is the same thing.
>>>>>>
>>>>>> Problem Statement:
>>>>>> We don't have a high performance networking interface in container for
>>>>>> NFV. Current veth pair based interface couldn't be easily accelerated.
>>>>>>
>>>>>> The key components involved:
>>>>>>     1.    DPDK based virtio PMD driver in container.
>>>>>>     2.    device simulation framework in container.
>>>>>>     3.    dpdk(or kernel) vhost running in host.
>>>>>>
>>>>>> How virtio is created?
>>>>>> A:  There is no "real" virtio-pci device in container environment.
>>>>>> 1). Host maintains pools of memories, and shares memory to container.
>>>>>> This could be accomplished through host share a huge page file to container.
>>>>>> 2). Containers creates virtio rings based on the shared memory.
>>>>>> 3). Container creates mbuf memory pools on the shared memory.
>>>>>> 4) Container send the memory and vring information to vhost through
>>>>>> vhost message. This could be done either through ioctl call or vhost
>>>>>> user message.
>>>>>>
>>>>>> How vhost message is sent?
>>>>>> A: There are two alternative ways to do this.
>>>>>> 1) The customized virtio PMD is responsible for all the vring creation,
>>>>>> and vhost message sending.
>>>> Above is our approach so far.
>>>> It seems Yanping also takes this kind of approach.
>>>> We are using vhost-user functionality instead of using the vhost-net
>>>> kernel module.
>>>> Probably this is the difference between Yanping and us.
>>> In my current implementation, the device simulation layer talks to "user
>>> space" vhost through cuse interface. It could also be done through vhost
>>> user socket. This isn't the key point.
>>> Here vhost-user is kind of confusing, maybe user space vhost is more
>>> accurate, either cuse or unix domain socket. :).
>>>
>>> As for yanping, they are now connecting to vhost-net kernel module, but
>>> they are also trying to connect to "user space" vhost.  Correct me if wrong.
>>> Yes, there is some difference between these two. Vhost-net kernel module
>>> could directly access other process's memory, while using
>>> vhost-user(cuse/user), we need do the memory mapping.
>>>> BTW, we are going to submit a vhost PMD for DPDK-2.2.
>>>> This PMD is implemented on librte_vhost.
>>>> It allows DPDK application to handle a vhost-user(cuse) backend as a
>>>> normal NIC port.
>>>> This PMD should work with both Xie and Yanping approach.
>>>> (In the case of Yanping approach, we may need vhost-cuse)
>>>>
>>>>>> 2) We could do this through a lightweight device simulation framework.
>>>>>>     The device simulation creates simple PCI bus. On the PCI bus,
>>>>>> virtio-net PCI devices are created. The device simulations provides
>>>>>> IOAPI for MMIO/IO access.
>>>> Does it mean you implemented a kernel module?
>>>> If so, do you still need vhost-cuse functionality to handle vhost
>>>> messages n userspace?
>>> The device simulation is  a library running in user space in container. 
>>> It is linked with DPDK app. It creates pseudo buses and virtio-net PCI
>>> devices.
>>> The virtio-container-PMD configures the virtio-net pseudo devices
>>> through IOAPI provided by the device simulation rather than IO
>>> instructions as in KVM.
>>> Why we use device simulation?
>>> We could create other virtio devices in container, and provide an common
>>> way to talk to vhost-xx module.
>> Thanks for explanation.
>> At first reading, I thought the difference between approach1 and
>> approach2 is whether we need to implement a new kernel module, or not.
>> But I understand how you implemented.
>>
>> Please let me explain our design more.
>> We might use a kind of similar approach to handle a pseudo virtio-net
>> device in DPDK.
>> (Anyway, we haven't finished implementing yet, this overview might have
>> some technical problems)
>>
>> Step1. Separate virtio-net and vhost-user socket related code from QEMU,
>> then implement it as a separated program.
>> The program also has below features.
>>  - Create a directory that contains almost same files like
>> /sys/bus/pci/device/<pci address>/*
>>    (To scan these file located on outside sysfs, we need to fix EAL)
>>  - This dummy device is driven by dummy-virtio-net-driver. This name is
>> specified by '<pci addr>/driver' file.
>>  - Create a shared file that represents pci configuration space, then
>> mmap it, also specify the path in '<pci addr>/resource_path'
>>
>> The program will be GPL, but it will be like a bridge on the shared
>> memory between virtio-net PMD and DPDK vhost backend.
>> Actually, It will work under virtio-net PMD, but we don't need to link it.
>> So I guess we don't have GPL license issue.
>>
>> Step2. Fix pci scan code of EAL to scan dummy devices.
>>  - To scan above files, extend pci_scan() of EAL.
>>
>> Step3. Add a new kdrv type to EAL.
>>  - To handle the 'dummy-virtio-net-driver', add a new kdrv type to EAL.
>>
>> Step4. Implement pci_dummy_virtio_net_map/unmap().
>>  - It will have almost same functionality like pci_uio_map(), but for
>> dummy virtio-net device.
>>  - The dummy device will be mmaped using a path specified in '<pci
>> addr>/resource_path'.
>>
>> Step5. Add a new compile option for virtio-net device to replace IO
>> functions.
>>  - The IO functions of virtio-net PMD will be replaced by read() and
>> write() to access to the shared memory.
>>  - Add notification mechanism to IO functions. This will be used when
>> write() to the shared memory is done.
>>  (Not sure exactly, but probably we need it)
>>
>> Does it make sense?
>> I guess Step1&2 is different from your approach, but the rest might be
>> similar.
>>
>> Actually, we just need sysfs entries for a virtio-net dummy device, but
>> so far, I don't have a fine way to register them from user space without
>> loading a kernel module.
> Tetsuya:
> I don't quite get the details. Who will create those sysfs entries? A
> kernel module right?

Hi Xie,

I don't create sysfs entries. Just create a directory that contains
files looks like sysfs entries.
And initialize EAL with not only sysfs but also the above directory.

In quoted last sentence, I wanted to say we just needed files looks like
sysfs entries.
But I don't know a good way to create files under sysfs without loading
kernel module.
This is because I try to create the additional directory.

> The virtio-net is configured through read/write to sharing
> memory(between host and guest), right?

Yes, I agree.

> Where is shared vring created and shared memory created, on shared huge
> page between host and guest?

The vritqueues(vrings) are on guest hugepage.

Let me explain.
Guest container should have read/write access to a part of hugepage
directory on host.
(For example, /mnt/huge/conainer1/ is shared between host and guest.)
Also host and guest needs to communicate through a unix domain socket.
(For example, host and guest can communicate with using
"/tmp/container1/sock")

If we can do like above, a virtio-net PMD on guest can creates
virtqueues(vrings) on it's hugepage, and writes these information to a
pseudo virtio-net device that is a process created in guest container.
Then the pseudo virtio-net device sends it to vhost-user backend(host
DPDK application) through a unix domain socket.

So with my plan, there are 3 processes.
DPDK applications on host and guest, also a process that works like
virtio-net device.

> Who will talk to dpdkvhost?

If we need to talk to a cuse device or the vhost-net kernel module, an
above pseudo virtio-net device could talk to.
(But, so far, my target is only vhost-user.)

>> This is because I need to change pci_scan() also.
>>
>> It seems you have implemented a virtio-net pseudo device as BSD license.
>> If so, this kind of PMD would be nice to use it.
> Currently it is based on native linux kvm tool.

Great, I hadn't noticed this option.

>> In the case that it takes much time to implement some lost
>> functionalities like interrupt mode, using QEMU code might be an one of
>> options.
> For interrupt mode, i plan to use eventfd for sleep/wake, have not tried
> yet.
>> Anyway, we just need a fine virtual NIC between containers and host.
>> So we don't hold to our approach and implementation.
> Do you have comments to my implementation?
> We could publish the version without the device framework first for
> reference.

No I don't have. Could you please share it?
I am looking forward to seeing it.

Tetsuya

  reply	other threads:[~2015-09-08  4:44 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <C37D651A908B024F974696C65296B57B2BD9F976@SHSMSX101.ccr.corp.intel.com>
2015-08-20 10:14 ` Xie, Huawei
2015-08-25  2:58   ` Tetsuya Mukawa
2015-08-25  9:56     ` Xie, Huawei
2015-08-26  9:23       ` Tetsuya Mukawa
2015-09-07  5:54         ` Xie, Huawei
2015-09-08  4:44           ` Tetsuya Mukawa [this message]
2015-09-14  3:15             ` Xie, Huawei

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=55EE67C2.5040301@igel.co.jp \
    --to=mukawa@igel.co.jp \
    --cc=ann.zhuangyanying@huawei.com \
    --cc=dev@dpdk.org \
    --cc=gaoxiaoqiu@huawei.com \
    --cc=guohongzhen@huawei.com \
    --cc=huawei.xie@intel.com \
    --cc=nakajima.yoshihiro@lab.ntt.co.jp \
    --cc=oscar.zhangbo@huawei.com \
    --cc=zhbzg@huawei.com \
    --cc=zhoujingbin@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).