From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by dpdk.org (Postfix) with ESMTP id 3B49A5A92 for ; Tue, 26 Jan 2016 07:02:16 +0100 (CET) Received: from fmsmga001.fm.intel.com ([10.253.24.23]) by orsmga103.jf.intel.com with ESMTP; 25 Jan 2016 22:02:07 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.22,348,1449561600"; d="scan'208";a="889275638" Received: from fmsmsx105.amr.corp.intel.com ([10.18.124.203]) by fmsmga001.fm.intel.com with ESMTP; 25 Jan 2016 22:02:06 -0800 Received: from shsmsx151.ccr.corp.intel.com (10.239.6.50) by FMSMSX105.amr.corp.intel.com (10.18.124.203) with Microsoft SMTP Server (TLS) id 14.3.248.2; Mon, 25 Jan 2016 22:02:05 -0800 Received: from shsmsx101.ccr.corp.intel.com ([169.254.1.215]) by SHSMSX151.ccr.corp.intel.com ([169.254.3.231]) with mapi id 14.03.0248.002; Tue, 26 Jan 2016 14:02:03 +0800 From: "Qiu, Michael" To: "Tan, Jianfeng" , "dev@dpdk.org" Thread-Topic: [PATCH 0/4] virtio support for container Thread-Index: AQHRS9bFQM0Ok1Q4s0iHHt9GOQfz5A== Date: Tue, 26 Jan 2016 06:02:03 +0000 Message-ID: <533710CFB86FA344BFBF2D6802E6028622F226FD@SHSMSX101.ccr.corp.intel.com> References: <1446748276-132087-1-git-send-email-jianfeng.tan@intel.com> <1452426182-86851-1-git-send-email-jianfeng.tan@intel.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.239.127.40] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Cc: "nakajima.yoshihiro@lab.ntt.co.jp" , "mst@redhat.com" , "ann.zhuangyanying@huawei.com" Subject: Re: [dpdk-dev] [PATCH 0/4] virtio support for container X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 26 Jan 2016 06:02:16 -0000 On 1/11/2016 2:43 AM, Tan, Jianfeng wrote:=0A= > This patchset is to provide high performance networking interface (virtio= )=0A= > for container-based DPDK applications. The way of starting DPDK apps in= =0A= > containers with ownership of NIC devices exclusively is beyond the scope.= =0A= > The basic idea here is to present a new virtual device (named eth_cvio),= =0A= > which can be discovered and initialized in container-based DPDK apps usin= g=0A= > rte_eal_init(). To minimize the change, we reuse already-existing virtio= =0A= > frontend driver code (driver/net/virtio/).=0A= > =0A= > Compared to QEMU/VM case, virtio device framework (translates I/O port r/= w=0A= > operations into unix socket/cuse protocol, which is originally provided i= n=0A= > QEMU), is integrated in virtio frontend driver. So this converged driver= =0A= > actually plays the role of original frontend driver and the role of QEMU= =0A= > device framework.=0A= > =0A= > The major difference lies in how to calculate relative address for vhost.= =0A= > The principle of virtio is that: based on one or multiple shared memory= =0A= > segments, vhost maintains a reference system with the base addresses and= =0A= > length for each segment so that an address from VM comes (usually GPA,=0A= > Guest Physical Address) can be translated into vhost-recognizable address= =0A= > (named VVA, Vhost Virtual Address). To decrease the overhead of address= =0A= > translation, we should maintain as few segments as possible. In VM's case= ,=0A= > GPA is always locally continuous. In container's case, CVA (Container=0A= > Virtual Address) can be used. Specifically:=0A= > a. when set_base_addr, CVA address is used;=0A= > b. when preparing RX's descriptors, CVA address is used;=0A= > c. when transmitting packets, CVA is filled in TX's descriptors;=0A= > d. in TX and CQ's header, CVA is used.=0A= > =0A= > How to share memory? In VM's case, qemu always shares all physical layout= =0A= > to backend. But it's not feasible for a container, as a process, to share= =0A= > all virtual memory regions to backend. So only specified virtual memory= =0A= > regions (with type of shared) are sent to backend. It's a limitation that= =0A= > only addresses in these areas can be used to transmit or receive packets.= =0A= >=0A= > Known issues=0A= >=0A= > a. When used with vhost-net, root privilege is required to create tap=0A= > device inside.=0A= > b. Control queue and multi-queue are not supported yet.=0A= > c. When --single-file option is used, socket_id of the memory may be=0A= > wrong. (Use "numactl -N x -m x" to work around this for now)=0A= > =0A= > How to use?=0A= >=0A= > a. Apply this patchset.=0A= >=0A= > b. To compile container apps:=0A= > $: make config RTE_SDK=3D`pwd` T=3Dx86_64-native-linuxapp-gcc=0A= > $: make install RTE_SDK=3D`pwd` T=3Dx86_64-native-linuxapp-gcc=0A= > $: make -C examples/l2fwd RTE_SDK=3D`pwd` T=3Dx86_64-native-linuxapp-gcc= =0A= > $: make -C examples/vhost RTE_SDK=3D`pwd` T=3Dx86_64-native-linuxapp-gcc= =0A= >=0A= > c. To build a docker image using Dockerfile below.=0A= > $: cat ./Dockerfile=0A= > FROM ubuntu:latest=0A= > WORKDIR /usr/src/dpdk=0A= > COPY . /usr/src/dpdk=0A= > ENV PATH "$PATH:/usr/src/dpdk/examples/l2fwd/build/"=0A= > $: docker build -t dpdk-app-l2fwd .=0A= >=0A= > d. Used with vhost-user=0A= > $: ./examples/vhost/build/vhost-switch -c 3 -n 4 \=0A= > --socket-mem 1024,1024 -- -p 0x1 --stats 1=0A= > $: docker run -i -t -v :/var/run/usvhost \=0A= > -v /dev/hugepages:/dev/hugepages \=0A= > dpdk-app-l2fwd l2fwd -c 0x4 -n 4 -m 1024 --no-pci \=0A= > --vdev=3Deth_cvio0,path=3D/var/run/usvhost -- -p 0x1=0A= >=0A= > f. Used with vhost-net=0A= > $: modprobe vhost=0A= > $: modprobe vhost-net=0A= > $: docker run -i -t --privileged \=0A= > -v /dev/vhost-net:/dev/vhost-net \=0A= > -v /dev/net/tun:/dev/net/tun \=0A= > -v /dev/hugepages:/dev/hugepages \=0A= > dpdk-app-l2fwd l2fwd -c 0x4 -n 4 -m 1024 --no-pci \=0A= > --vdev=3Deth_cvio0,path=3D/dev/vhost-net -- -p 0x1=0A= =0A= We'd better add a ifname, like=0A= --vdev=3Deth_cvio0,path=3D/dev/vhost-net,ifname=3Dtap0, so that user could = add=0A= the tap to the bridge first.=0A= =0A= Thanks,=0A= Michael=0A= >=0A= > By the way, it's not necessary to run in a container.=0A= >=0A= > Signed-off-by: Huawei Xie =0A= > Signed-off-by: Jianfeng Tan =0A= >=0A= > Jianfeng Tan (4):=0A= > mem: add --single-file to create single mem-backed file=0A= > mem: add API to obstain memory-backed file info=0A= > virtio/vdev: add ways to interact with vhost=0A= > virtio/vdev: add a new vdev named eth_cvio=0A= >=0A= > config/common_linuxapp | 5 +=0A= > drivers/net/virtio/Makefile | 4 +=0A= > drivers/net/virtio/vhost.c | 734 +++++++++++++++++++++++= ++++++=0A= > drivers/net/virtio/vhost.h | 192 ++++++++=0A= > drivers/net/virtio/virtio_ethdev.c | 338 ++++++++++---=0A= > drivers/net/virtio/virtio_ethdev.h | 4 +=0A= > drivers/net/virtio/virtio_pci.h | 52 +-=0A= > drivers/net/virtio/virtio_rxtx.c | 11 +-=0A= > drivers/net/virtio/virtio_rxtx_simple.c | 14 +-=0A= > drivers/net/virtio/virtqueue.h | 13 +-=0A= > lib/librte_eal/common/eal_common_options.c | 17 +=0A= > lib/librte_eal/common/eal_internal_cfg.h | 1 +=0A= > lib/librte_eal/common/eal_options.h | 2 +=0A= > lib/librte_eal/common/include/rte_memory.h | 16 +=0A= > lib/librte_eal/linuxapp/eal/eal_memory.c | 82 +++-=0A= > 15 files changed, 1392 insertions(+), 93 deletions(-)=0A= > create mode 100644 drivers/net/virtio/vhost.c=0A= > create mode 100644 drivers/net/virtio/vhost.h=0A= >=0A= =0A=