From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx1.redhat.com (mx1.redhat.com [209.132.183.28]) by dpdk.org (Postfix) with ESMTP id B4F2B2B99 for ; Wed, 23 Mar 2016 20:17:47 +0100 (CET) Received: from int-mx13.intmail.prod.int.phx2.redhat.com (int-mx13.intmail.prod.int.phx2.redhat.com [10.5.11.26]) by mx1.redhat.com (Postfix) with ESMTPS id 34A5C7AE9B; Wed, 23 Mar 2016 19:17:46 +0000 (UTC) Received: from hmsreliant.think-freely.org (vpn-61-176.rdu2.redhat.com [10.10.61.176]) by int-mx13.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id u2NJHhJq026006 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Wed, 23 Mar 2016 15:17:45 -0400 Date: Wed, 23 Mar 2016 15:17:43 -0400 From: Neil Horman To: jianfeng.tan@intel.com Cc: dev@dpdk.org Message-ID: <20160323191743.GB13829@hmsreliant.think-freely.org> References: <1446748276-132087-1-git-send-email-jianfeng.tan@intel.com> <1454671228-33284-1-git-send-email-jianfeng.tan@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1454671228-33284-1-git-send-email-jianfeng.tan@intel.com> User-Agent: Mutt/1.5.24 (2015-08-30) X-Scanned-By: MIMEDefang 2.68 on 10.5.11.26 Subject: Re: [dpdk-dev] [PATCH v2 0/5] virtio support for container X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 23 Mar 2016 19:17:48 -0000 On Fri, Feb 05, 2016 at 07:20:23PM +0800, Jianfeng Tan wrote: > v1->v2: > - Rebase on the patchset of virtio 1.0 support. > - Fix cannot create non-hugepage memory. > - Fix wrong size of memory region when "single-file" is used. > - Fix setting of offset in virtqueue to use virtual address. > - Fix setting TUNSETVNETHDRSZ in vhost-user's branch. > - Add mac option to specify the mac address of this virtual device. > - Update doc. > > This patchset is to provide high performance networking interface (virtio) > for container-based DPDK applications. The way of starting DPDK apps in > containers with ownership of NIC devices exclusively is beyond the scope. > The basic idea here is to present a new virtual device (named eth_cvio), > which can be discovered and initialized in container-based DPDK apps using > rte_eal_init(). To minimize the change, we reuse already-existing virtio > frontend driver code (driver/net/virtio/). > > Compared to QEMU/VM case, virtio device framework (translates I/O port r/w > operations into unix socket/cuse protocol, which is originally provided in > QEMU), is integrated in virtio frontend driver. So this converged driver > actually plays the role of original frontend driver and the role of QEMU > device framework. > > The major difference lies in how to calculate relative address for vhost. > The principle of virtio is that: based on one or multiple shared memory > segments, vhost maintains a reference system with the base addresses and > length for each segment so that an address from VM comes (usually GPA, > Guest Physical Address) can be translated into vhost-recognizable address > (named VVA, Vhost Virtual Address). To decrease the overhead of address > translation, we should maintain as few segments as possible. In VM's case, > GPA is always locally continuous. In container's case, CVA (Container > Virtual Address) can be used. Specifically: > a. when set_base_addr, CVA address is used; > b. when preparing RX's descriptors, CVA address is used; > c. when transmitting packets, CVA is filled in TX's descriptors; > d. in TX and CQ's header, CVA is used. > > How to share memory? In VM's case, qemu always shares all physical layout > to backend. But it's not feasible for a container, as a process, to share > all virtual memory regions to backend. So only specified virtual memory > regions (with type of shared) are sent to backend. It's a limitation that > only addresses in these areas can be used to transmit or receive packets. > > Known issues > > a. When used with vhost-net, root privilege is required to create tap > device inside. > b. Control queue and multi-queue are not supported yet. > c. When --single-file option is used, socket_id of the memory may be > wrong. (Use "numactl -N x -m x" to work around this for now) > > How to use? > > a. Apply this patchset. > > b. To compile container apps: > $: make config RTE_SDK=`pwd` T=x86_64-native-linuxapp-gcc > $: make install RTE_SDK=`pwd` T=x86_64-native-linuxapp-gcc > $: make -C examples/l2fwd RTE_SDK=`pwd` T=x86_64-native-linuxapp-gcc > $: make -C examples/vhost RTE_SDK=`pwd` T=x86_64-native-linuxapp-gcc > > c. To build a docker image using Dockerfile below. > $: cat ./Dockerfile > FROM ubuntu:latest > WORKDIR /usr/src/dpdk > COPY . /usr/src/dpdk > ENV PATH "$PATH:/usr/src/dpdk/examples/l2fwd/build/" > $: docker build -t dpdk-app-l2fwd . > > d. Used with vhost-user > $: ./examples/vhost/build/vhost-switch -c 3 -n 4 \ > --socket-mem 1024,1024 -- -p 0x1 --stats 1 > $: docker run -i -t -v :/var/run/usvhost \ > -v /dev/hugepages:/dev/hugepages \ > dpdk-app-l2fwd l2fwd -c 0x4 -n 4 -m 1024 --no-pci \ > --vdev=eth_cvio0,path=/var/run/usvhost -- -p 0x1 > > f. Used with vhost-net > $: modprobe vhost > $: modprobe vhost-net > $: docker run -i -t --privileged \ > -v /dev/vhost-net:/dev/vhost-net \ > -v /dev/net/tun:/dev/net/tun \ > -v /dev/hugepages:/dev/hugepages \ > dpdk-app-l2fwd l2fwd -c 0x4 -n 4 -m 1024 --no-pci \ > --vdev=eth_cvio0,path=/dev/vhost-net -- -p 0x1 > > By the way, it's not necessary to run in a container. > > Signed-off-by: Huawei Xie > Signed-off-by: Jianfeng Tan > > Jianfeng Tan (5): > mem: add --single-file to create single mem-backed file > mem: add API to obtain memory-backed file info > virtio/vdev: add embeded device emulation > virtio/vdev: add a new vdev named eth_cvio > docs: add release note for virtio for container > > config/common_linuxapp | 5 + > doc/guides/rel_notes/release_2_3.rst | 4 + > drivers/net/virtio/Makefile | 4 + > drivers/net/virtio/vhost.h | 194 +++++++ > drivers/net/virtio/vhost_embedded.c | 809 +++++++++++++++++++++++++++++ > drivers/net/virtio/virtio_ethdev.c | 329 +++++++++--- > drivers/net/virtio/virtio_ethdev.h | 6 +- > drivers/net/virtio/virtio_pci.h | 15 +- > drivers/net/virtio/virtio_rxtx.c | 6 +- > drivers/net/virtio/virtio_rxtx_simple.c | 13 +- > drivers/net/virtio/virtqueue.h | 15 +- > lib/librte_eal/common/eal_common_options.c | 17 + > lib/librte_eal/common/eal_internal_cfg.h | 1 + > lib/librte_eal/common/eal_options.h | 2 + > lib/librte_eal/common/include/rte_memory.h | 16 + > lib/librte_eal/linuxapp/eal/eal.c | 4 +- > lib/librte_eal/linuxapp/eal/eal_memory.c | 88 +++- > 17 files changed, 1435 insertions(+), 93 deletions(-) > create mode 100644 drivers/net/virtio/vhost.h > create mode 100644 drivers/net/virtio/vhost_embedded.c > > -- > 2.1.4 > So, first off, apologies for being so late to review this patch, its been on my todo list forever, and I've just not gotten to it. I've taken a cursory look at the code, and I can't find anything glaringly wrong with it. That said, I'm a bit confused about the overall purpose of this PMD. I've read the description several times now, and I _think_ I understand the purpose and construction of the PMD. Please correct me if this is not the (admittedly very generalized) overview: 1) You've created a vdev PMD that is generally named eth_cvio%n, which serves as a virtual NIC suitable for use in a containerized space 2) The PMD in (1) establishes a connection to the host via the vhost backend (which is either a socket or a character device), which it uses to forward data from the containerized dpdk application 3) The system hosting the containerized dpdk application ties the other end of the tun/tap interface established in (2) to some other forwarding mechanism (ostensibly a host based dpdk forwarder) to send the frame out on the physical wire. If I understand that, it seems reasonable, but I have to ask why? It feels a bit like a re-invention of the wheel to me. That is to say, for whatever optimization this PMD may have, the by-far larger bottleneck is the tun/tap interface in step (2). If thats the case, then why create a new PMD at all? Why not instead just use a tun/tap interface into the container, along with the af_packet PMD for communication. That has the ability to do memory mapping of an interface for relatively fast packet writes, so I expect it will be just as performant as this solution, and without the need to write and maintain a new PMD's worth of code. I feel like I'm missing something here, so please clarify if I am, but at the moment, I'm having a hard time seeing the advantage to a new PMD here Regards Neil