From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by dpdk.org (Postfix) with ESMTP id A5F132BE4 for ; Fri, 25 Mar 2016 02:25:52 +0100 (CET) Received: from orsmga002.jf.intel.com ([10.7.209.21]) by orsmga101.jf.intel.com with ESMTP; 24 Mar 2016 18:25:51 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.24,388,1455004800"; d="scan'208";a="941113795" Received: from shwdeisgchi083.ccr.corp.intel.com (HELO [10.239.67.193]) ([10.239.67.193]) by orsmga002.jf.intel.com with ESMTP; 24 Mar 2016 18:25:50 -0700 To: Neil Horman References: <1446748276-132087-1-git-send-email-jianfeng.tan@intel.com> <1454671228-33284-1-git-send-email-jianfeng.tan@intel.com> <20160323191743.GB13829@hmsreliant.think-freely.org> <56F35ABA.40403@intel.com> <20160324134540.GA19236@hmsreliant.think-freely.org> Cc: Neil Horman , dev@dpdk.org From: "Tan, Jianfeng" Message-ID: <56F4939D.6020906@intel.com> Date: Fri, 25 Mar 2016 09:25:49 +0800 User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:38.0) Gecko/20100101 Thunderbird/38.7.0 MIME-Version: 1.0 In-Reply-To: <20160324134540.GA19236@hmsreliant.think-freely.org> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [dpdk-dev] [PATCH v2 0/5] virtio support for container X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 25 Mar 2016 01:25:53 -0000 On 3/24/2016 9:45 PM, Neil Horman wrote: > On Thu, Mar 24, 2016 at 11:10:50AM +0800, Tan, Jianfeng wrote: >> Hi Neil, >> >> On 3/24/2016 3:17 AM, Neil Horman wrote: >>> On Fri, Feb 05, 2016 at 07:20:23PM +0800, Jianfeng Tan wrote: >>>> v1->v2: >>>> - Rebase on the patchset of virtio 1.0 support. >>>> - Fix cannot create non-hugepage memory. >>>> - Fix wrong size of memory region when "single-file" is used. >>>> - Fix setting of offset in virtqueue to use virtual address. >>>> - Fix setting TUNSETVNETHDRSZ in vhost-user's branch. >>>> - Add mac option to specify the mac address of this virtual device. >>>> - Update doc. >>>> >>>> This patchset is to provide high performance networking interface (virtio) >>>> for container-based DPDK applications. The way of starting DPDK apps in >>>> containers with ownership of NIC devices exclusively is beyond the scope. >>>> The basic idea here is to present a new virtual device (named eth_cvio), >>>> which can be discovered and initialized in container-based DPDK apps using >>>> rte_eal_init(). To minimize the change, we reuse already-existing virtio >>>> frontend driver code (driver/net/virtio/). >>>> Compared to QEMU/VM case, virtio device framework (translates I/O port r/w >>>> operations into unix socket/cuse protocol, which is originally provided in >>>> QEMU), is integrated in virtio frontend driver. So this converged driver >>>> actually plays the role of original frontend driver and the role of QEMU >>>> device framework. >>>> The major difference lies in how to calculate relative address for vhost. >>>> The principle of virtio is that: based on one or multiple shared memory >>>> segments, vhost maintains a reference system with the base addresses and >>>> length for each segment so that an address from VM comes (usually GPA, >>>> Guest Physical Address) can be translated into vhost-recognizable address >>>> (named VVA, Vhost Virtual Address). To decrease the overhead of address >>>> translation, we should maintain as few segments as possible. In VM's case, >>>> GPA is always locally continuous. In container's case, CVA (Container >>>> Virtual Address) can be used. Specifically: >>>> a. when set_base_addr, CVA address is used; >>>> b. when preparing RX's descriptors, CVA address is used; >>>> c. when transmitting packets, CVA is filled in TX's descriptors; >>>> d. in TX and CQ's header, CVA is used. >>>> How to share memory? In VM's case, qemu always shares all physical layout >>>> to backend. But it's not feasible for a container, as a process, to share >>>> all virtual memory regions to backend. So only specified virtual memory >>>> regions (with type of shared) are sent to backend. It's a limitation that >>>> only addresses in these areas can be used to transmit or receive packets. >>>> >>>> Known issues >>>> >>>> a. When used with vhost-net, root privilege is required to create tap >>>> device inside. >>>> b. Control queue and multi-queue are not supported yet. >>>> c. When --single-file option is used, socket_id of the memory may be >>>> wrong. (Use "numactl -N x -m x" to work around this for now) >>>> How to use? >>>> >>>> a. Apply this patchset. >>>> >>>> b. To compile container apps: >>>> $: make config RTE_SDK=`pwd` T=x86_64-native-linuxapp-gcc >>>> $: make install RTE_SDK=`pwd` T=x86_64-native-linuxapp-gcc >>>> $: make -C examples/l2fwd RTE_SDK=`pwd` T=x86_64-native-linuxapp-gcc >>>> $: make -C examples/vhost RTE_SDK=`pwd` T=x86_64-native-linuxapp-gcc >>>> >>>> c. To build a docker image using Dockerfile below. >>>> $: cat ./Dockerfile >>> >FROM ubuntu:latest >>>> WORKDIR /usr/src/dpdk >>>> COPY . /usr/src/dpdk >>>> ENV PATH "$PATH:/usr/src/dpdk/examples/l2fwd/build/" >>>> $: docker build -t dpdk-app-l2fwd . >>>> >>>> d. Used with vhost-user >>>> $: ./examples/vhost/build/vhost-switch -c 3 -n 4 \ >>>> --socket-mem 1024,1024 -- -p 0x1 --stats 1 >>>> $: docker run -i -t -v :/var/run/usvhost \ >>>> -v /dev/hugepages:/dev/hugepages \ >>>> dpdk-app-l2fwd l2fwd -c 0x4 -n 4 -m 1024 --no-pci \ >>>> --vdev=eth_cvio0,path=/var/run/usvhost -- -p 0x1 >>>> >>>> f. Used with vhost-net >>>> $: modprobe vhost >>>> $: modprobe vhost-net >>>> $: docker run -i -t --privileged \ >>>> -v /dev/vhost-net:/dev/vhost-net \ >>>> -v /dev/net/tun:/dev/net/tun \ >>>> -v /dev/hugepages:/dev/hugepages \ >>>> dpdk-app-l2fwd l2fwd -c 0x4 -n 4 -m 1024 --no-pci \ >>>> --vdev=eth_cvio0,path=/dev/vhost-net -- -p 0x1 >>>> >>>> By the way, it's not necessary to run in a container. >>>> >>>> Signed-off-by: Huawei Xie >>>> Signed-off-by: Jianfeng Tan >>>> >>>> Jianfeng Tan (5): >>>> mem: add --single-file to create single mem-backed file >>>> mem: add API to obtain memory-backed file info >>>> virtio/vdev: add embeded device emulation >>>> virtio/vdev: add a new vdev named eth_cvio >>>> docs: add release note for virtio for container >>>> >>>> config/common_linuxapp | 5 + >>>> doc/guides/rel_notes/release_2_3.rst | 4 + >>>> drivers/net/virtio/Makefile | 4 + >>>> drivers/net/virtio/vhost.h | 194 +++++++ >>>> drivers/net/virtio/vhost_embedded.c | 809 +++++++++++++++++++++++++++++ >>>> drivers/net/virtio/virtio_ethdev.c | 329 +++++++++--- >>>> drivers/net/virtio/virtio_ethdev.h | 6 +- >>>> drivers/net/virtio/virtio_pci.h | 15 +- >>>> drivers/net/virtio/virtio_rxtx.c | 6 +- >>>> drivers/net/virtio/virtio_rxtx_simple.c | 13 +- >>>> drivers/net/virtio/virtqueue.h | 15 +- >>>> lib/librte_eal/common/eal_common_options.c | 17 + >>>> lib/librte_eal/common/eal_internal_cfg.h | 1 + >>>> lib/librte_eal/common/eal_options.h | 2 + >>>> lib/librte_eal/common/include/rte_memory.h | 16 + >>>> lib/librte_eal/linuxapp/eal/eal.c | 4 +- >>>> lib/librte_eal/linuxapp/eal/eal_memory.c | 88 +++- >>>> 17 files changed, 1435 insertions(+), 93 deletions(-) >>>> create mode 100644 drivers/net/virtio/vhost.h >>>> create mode 100644 drivers/net/virtio/vhost_embedded.c >>>> >>>> -- >>>> 2.1.4 >>>> >>> So, first off, apologies for being so late to review this patch, its been on my >>> todo list forever, and I've just not gotten to it. >>> >>> I've taken a cursory look at the code, and I can't find anything glaringly wrong >>> with it. >> Thanks very much for reviewing this series. >> >>> That said, I'm a bit confused about the overall purpose of this PMD. I've read >>> the description several times now, and I _think_ I understand the purpose and >>> construction of the PMD. Please correct me if this is not the (admittedly very >>> generalized) overview: >>> >>> 1) You've created a vdev PMD that is generally named eth_cvio%n, which serves as >>> a virtual NIC suitable for use in a containerized space >>> >>> 2) The PMD in (1) establishes a connection to the host via the vhost backend >>> (which is either a socket or a character device), which it uses to forward data >> >from the containerized dpdk application >> >> The socket or the character device is used just for control plane messages >> to setting up the datapath. The data does not go through the socket or the >> character device. >> >>> 3) The system hosting the containerized dpdk application ties the other end of >>> the tun/tap interface established in (2) to some other forwarding mechanism >>> (ostensibly a host based dpdk forwarder) to send the frame out on the physical >>> wire. >> There are two kinds of vhost backend: >> (1) vhost-user, no need to leverage a tun/tap. the cvio PMD connects to the >> backend socket, and communicate memory region information with the >> vhost-user backend (the backend is another DPDK application using vhost PMD >> by Tetsuya, or using vhost library like vhost example). >> (2) vhost-net, here we need a tun/tap. When we open the /dev/vhost-net char >> device, and some ioctl on it, it just starts a kthread (backend). We need an >> interface (tun/tap) as an agent to blend into kernel networking, so that the >> kthread knows where to send those packets (sent by frontend), and where to >> receive packets to send to frontend. >> >> To be honest, vhost-user is the preferred way to achieve high performance. >> As far as vhost-net is concerned, it goes through a kernel network stack, >> which is the performance bottleneck. >> > Sure, that makes sense. So in the vhost-user case, we just read/write to a > shared memory region? I.e. no user/kernel space transition for the nominal data > path? If thats the case, than thats the piece I'm missing > Neil Yes, exactly for now (both sides is in polling mode). Plus, we are trying to add interrupt mode so that large amount of containers can run with this new PMD. At interrupt mode, "user/kernel transition" would be smart because its the other side's responsibility to tell this side if the other side needs to be waken up, so user/kernel space transition happens only wakeup is necessary. Thanks, Jianfeng