From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <ann.zhuangyanying@huawei.com>
Received: from szxga02-in.huawei.com (szxga02-in.huawei.com [119.145.14.65])
 by dpdk.org (Postfix) with ESMTP id BE126567E
 for <dev@dpdk.org>; Tue, 24 Nov 2015 04:53:09 +0100 (CET)
Received: from 172.24.1.48 (EHLO SZXEMA412-HUB.china.huawei.com)
 ([172.24.1.48])
 by szxrg02-dlp.huawei.com (MOS 4.3.7-GA FastPath queued)
 with ESMTP id CWT92086; Tue, 24 Nov 2015 11:53:06 +0800 (CST)
Received: from SZXEMA502-MBX.china.huawei.com ([169.254.3.230]) by
 SZXEMA412-HUB.china.huawei.com ([10.82.72.71]) with mapi id 14.03.0235.001;
 Tue, 24 Nov 2015 11:53:00 +0800
From: Zhuangyanying <ann.zhuangyanying@huawei.com>
To: Jianfeng Tan <jianfeng.tan@intel.com>, "dev@dpdk.org" <dev@dpdk.org>
Thread-Topic: [RFC 0/5] virtio support for container
Thread-Index: AQHRGDLlgW++w4c2a0WI/Z/rSJv22J6qpLhg
Date: Tue, 24 Nov 2015 03:53:00 +0000
Message-ID: <EC9759BC1E3E98429B5DE9A03DF86D8B592FAF25@SZXEMA502-MBX.china.huawei.com>
References: <1446748276-132087-1-git-send-email-jianfeng.tan@intel.com>
In-Reply-To: <1446748276-132087-1-git-send-email-jianfeng.tan@intel.com>
Accept-Language: zh-CN, en-US
Content-Language: zh-CN
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
x-originating-ip: [10.177.21.2]
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
X-CFilter-Loop: Reflected
X-Mirapoint-Virus-RAPID-Raw: score=unknown(0),
 refid=str=0001.0A020201.5653DF23.00B5, ss=1, re=0.000, recu=0.000, reip=0.000,
 cl=1, cld=1, fgs=0, ip=169.254.3.230,
 so=2013-06-18 04:22:30, dmn=2013-03-21 17:37:32
X-Mirapoint-Loop-Id: 1940705ed36e013c8607676c0e0e3a94
Cc: "nakajima.yoshihiro@lab.ntt.co.jp" <nakajima.yoshihiro@lab.ntt.co.jp>,
 Zhbzg <zhbzg@huawei.com>, "mst@redhat.com" <mst@redhat.com>,
 gaoxiaoqiu <gaoxiaoqiu@huawei.com>,
 "Zhangbo \(Oscar\)" <oscar.zhangbo@huawei.com>,
 Zhoujingbin <zhoujingbin@huawei.com>, Guohongzhen <guohongzhen@huawei.com>
Subject: Re: [dpdk-dev] [RFC 0/5] virtio support for container
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: patches and discussions about DPDK <dev.dpdk.org>
List-Unsubscribe: <http://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <http://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Tue, 24 Nov 2015 03:53:11 -0000



> -----Original Message-----
> From: Jianfeng Tan [mailto:jianfeng.tan@intel.com]
> Sent: Friday, November 06, 2015 2:31 AM
> To: dev@dpdk.org
> Cc: mst@redhat.com; mukawa@igel.co.jp; nakajima.yoshihiro@lab.ntt.co.jp;
> michael.qiu@intel.com; Guohongzhen; Zhoujingbin; Zhuangyanying; Zhangbo
> (Oscar); gaoxiaoqiu; Zhbzg; huawei.xie@intel.com; Jianfeng Tan
> Subject: [RFC 0/5] virtio support for container
>=20
> This patchset only acts as a PoC to request the community for comments.
>=20
> This patchset is to provide high performance networking interface
> (virtio) for container-based DPDK applications. The way of starting DPDK
> applications in containers with ownership of NIC devices exclusively is b=
eyond
> the scope. The basic idea here is to present a new virtual device (named
> eth_cvio), which can be discovered and initialized in container-based DPD=
K
> applications rte_eal_init().
> To minimize the change, we reuse already-existing virtio frontend driver =
code
> (driver/net/virtio/).
>=20
> Compared to QEMU/VM case, virtio device framework (translates I/O port r/=
w
> operations into unix socket/cuse protocol, which is originally provided i=
n QEMU),
> is integrated in virtio frontend driver. Aka, this new converged driver a=
ctually
> plays the role of original frontend driver and the role of QEMU device
> framework.
>=20
> The biggest difference here lies in how to calculate relative address for=
 backend.
> The principle of virtio is that: based on one or multiple shared memory
> segments, vhost maintains a reference system with the base addresses and
> length of these segments so that an address from VM comes (usually GPA,
> Guest Physical Address), vhost can translate it into self-recognizable ad=
dress
> (aka VVA, Vhost Virtual Address). To decrease the overhead of address
> translation, we should maintain as few segments as better. In the context=
 of
> virtual machines, GPA is always locally continuous. So it's a good choice=
. In
> container's case, CVA (Container Virtual Address) can be used. This means
> that:
> a. when set_base_addr, CVA address is used; b. when preparing RX's
> descriptors, CVA address is used; c. when transmitting packets, CVA is fi=
lled in
> TX's descriptors; d. in TX and CQ's header, CVA is used.
>=20
> How to share memory? In VM's case, qemu always shares all physical layout=
 to
> backend. But it's not feasible for a container, as a process, to share al=
l virtual
> memory regions to backend. So only specified virtual memory regions (type=
 is
> shared) are sent to backend. It leads to a limitation that only addresses=
 in
> these areas can be used to transmit or receive packets. For now, the shar=
ed
> memory is created in /dev/shm using shm_open() in the memory initializati=
on
> process.
>=20
> How to use?
>=20
> a. Apply the patch of virtio for container. We need two copies of patched=
 code
> (referred as dpdk-app/ and dpdk-vhost/)
>=20
> b. To compile container apps:
> $: cd dpdk-app
> $: vim config/common_linuxapp (uncomment "CONFIG_RTE_VIRTIO_VDEV=3Dy")
> $: make config RTE_SDK=3D`pwd` T=3Dx86_64-native-linuxapp-gcc
> $: make install RTE_SDK=3D`pwd` T=3Dx86_64-native-linuxapp-gcc
> $: make -C examples/l2fwd RTE_SDK=3D`pwd` T=3Dx86_64-native-linuxapp-gcc
>=20
> c. To build a docker image using Dockerfile below.
> $: cat ./Dockerfile
> FROM ubuntu:latest
> WORKDIR /usr/src/dpdk
> COPY . /usr/src/dpdk
> CMD ["/usr/src/dpdk/examples/l2fwd/build/l2fwd", "-c", "0xc", "-n", "4",
> "--no-huge", "--no-pci",
> "--vdev=3Deth_cvio0,queue_num=3D256,rx=3D1,tx=3D1,cq=3D0,path=3D/var/run/=
usvhost",
> "--", "-p", "0x1"]
> $: docker build -t dpdk-app-l2fwd .
>=20
> d. To compile vhost:
> $: cd dpdk-vhost
> $: make config RTE_SDK=3D`pwd` T=3Dx86_64-native-linuxapp-gcc
> $: make install RTE_SDK=3D`pwd` T=3Dx86_64-native-linuxapp-gcc
> $: make -C examples/vhost RTE_SDK=3D`pwd` T=3Dx86_64-native-linuxapp-gcc
>=20
> e. Start vhost-switch
> $: ./examples/vhost/build/vhost-switch -c 3 -n 4 --socket-mem 1024,1024 -=
- -p
> 0x1 --stats 1
>=20
> f. Start docker
> $: docker run -i -t -v <path to vhost unix socket>:/var/run/usvhost
> dpdk-app-l2fwd
>=20
> Signed-off-by: Huawei Xie <huawei.xie@intel.com>
> Signed-off-by: Jianfeng Tan <jianfeng.tan@intel.com>
>=20
> Jianfeng Tan (5):
>   virtio/container: add handler for ioport rd/wr
>   virtio/container: add a new virtual device named eth_cvio
>   virtio/container: unify desc->addr assignment
>   virtio/container: adjust memory initialization process
>   vhost/container: change mode of vhost listening socket
>=20
>  config/common_linuxapp                       |   5 +
>  drivers/net/virtio/Makefile                  |   4 +
>  drivers/net/virtio/vhost-user.c              | 433
> +++++++++++++++++++++++++++
>  drivers/net/virtio/vhost-user.h              | 137 +++++++++
>  drivers/net/virtio/virtio_ethdev.c           | 319 +++++++++++++++-----
>  drivers/net/virtio/virtio_ethdev.h           |  16 +
>  drivers/net/virtio/virtio_pci.h              |  32 +-
>  drivers/net/virtio/virtio_rxtx.c             |   9 +-
>  drivers/net/virtio/virtio_rxtx_simple.c      |   9 +-
>  drivers/net/virtio/virtqueue.h               |   9 +-
>  lib/librte_eal/common/include/rte_memory.h   |   5 +
>  lib/librte_eal/linuxapp/eal/eal_memory.c     |  58 +++-
>  lib/librte_mempool/rte_mempool.c             |  16 +-
>  lib/librte_vhost/vhost_user/vhost-net-user.c |   5 +
>  14 files changed, 967 insertions(+), 90 deletions(-)  create mode 100644
> drivers/net/virtio/vhost-user.c  create mode 100644
> drivers/net/virtio/vhost-user.h
>=20
> --
> 2.1.4

This patch arose a good idea to add an extra abstracted IO layer,  which wo=
uld make it simple to extend the function to the kernel mode switch(such as=
 OVS). That's great.
But I have one question here:=20
    it's the issue on VHOST_USER_SET_MEM_TABLE. you alloc memory from tmpfs=
 filesyste, just one fd, could used rte_memseg_info_get() to=20
	directly get the memory topology, However, things change in kernel-space, =
because mempool should be created on each container's
	hugetlbfs(rather than tmpfs), which is seperated from each other, at last,=
 considering of the ioctl's parameter.=20
       My solution is as follows for your reference:
/*
	reg =3D mem->regions;
	reg->guest_phys_addr =3D (__u64) ((struct virtqueue *)(dev->data->rx_queue=
s[0]))->mpool->elt_va_start;
	reg->userspace_addr =3D reg->guest_phys_addr;
	reg->memory_size =3D ((struct virtqueue *)(dev->data->rx_queues[0]))->mpoo=
l->elt_va_end - reg->guest_phys_addr;

	reg =3D mem->regions + 1;
	reg->guest_phys_addr =3D (__u64)(((struct virtqueue *)(dev->data->tx_queue=
s[0]))->virtio_net_hdr_mem);
	reg->userspace_addr =3D reg->guest_phys_addr;
	reg->memory_size =3D vq_size * internals->vtnet_hdr_size;
*/	 =20
	   But it's a little ugly, any better idea?