From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from smtp.tuxdriver.com (charlotte.tuxdriver.com [70.61.120.58]) by dpdk.org (Postfix) with ESMTP id 6CC425586 for ; Fri, 25 Mar 2016 12:06:18 +0100 (CET) Received: from [107.15.76.160] (helo=localhost) by smtp.tuxdriver.com with esmtpsa (TLSv1:AES128-SHA:128) (Exim 4.63) (envelope-from ) id 1ajPZ9-0000YT-90; Fri, 25 Mar 2016 07:06:12 -0400 Date: Fri, 25 Mar 2016 07:06:06 -0400 From: Neil Horman To: "Tan, Jianfeng" Cc: Neil Horman , dev@dpdk.org Message-ID: <20160325110606.GA16676@hmsreliant.think-freely.org> References: <1446748276-132087-1-git-send-email-jianfeng.tan@intel.com> <1454671228-33284-1-git-send-email-jianfeng.tan@intel.com> <20160323191743.GB13829@hmsreliant.think-freely.org> <56F35ABA.40403@intel.com> <20160324134540.GA19236@hmsreliant.think-freely.org> <56F4939D.6020906@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <56F4939D.6020906@intel.com> User-Agent: Mutt/1.5.24 (2015-08-30) X-Spam-Score: -1.0 (-) X-Spam-Status: No Subject: Re: [dpdk-dev] [PATCH v2 0/5] virtio support for container X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 25 Mar 2016 11:06:18 -0000 On Fri, Mar 25, 2016 at 09:25:49AM +0800, Tan, Jianfeng wrote: > > > On 3/24/2016 9:45 PM, Neil Horman wrote: > >On Thu, Mar 24, 2016 at 11:10:50AM +0800, Tan, Jianfeng wrote: > >>Hi Neil, > >> > >>On 3/24/2016 3:17 AM, Neil Horman wrote: > >>>On Fri, Feb 05, 2016 at 07:20:23PM +0800, Jianfeng Tan wrote: > >>>>v1->v2: > >>>> - Rebase on the patchset of virtio 1.0 support. > >>>> - Fix cannot create non-hugepage memory. > >>>> - Fix wrong size of memory region when "single-file" is used. > >>>> - Fix setting of offset in virtqueue to use virtual address. > >>>> - Fix setting TUNSETVNETHDRSZ in vhost-user's branch. > >>>> - Add mac option to specify the mac address of this virtual device. > >>>> - Update doc. > >>>> > >>>>This patchset is to provide high performance networking interface (virtio) > >>>>for container-based DPDK applications. The way of starting DPDK apps in > >>>>containers with ownership of NIC devices exclusively is beyond the scope. > >>>>The basic idea here is to present a new virtual device (named eth_cvio), > >>>>which can be discovered and initialized in container-based DPDK apps using > >>>>rte_eal_init(). To minimize the change, we reuse already-existing virtio > >>>>frontend driver code (driver/net/virtio/). > >>>>Compared to QEMU/VM case, virtio device framework (translates I/O port r/w > >>>>operations into unix socket/cuse protocol, which is originally provided in > >>>>QEMU), is integrated in virtio frontend driver. So this converged driver > >>>>actually plays the role of original frontend driver and the role of QEMU > >>>>device framework. > >>>>The major difference lies in how to calculate relative address for vhost. > >>>>The principle of virtio is that: based on one or multiple shared memory > >>>>segments, vhost maintains a reference system with the base addresses and > >>>>length for each segment so that an address from VM comes (usually GPA, > >>>>Guest Physical Address) can be translated into vhost-recognizable address > >>>>(named VVA, Vhost Virtual Address). To decrease the overhead of address > >>>>translation, we should maintain as few segments as possible. In VM's case, > >>>>GPA is always locally continuous. In container's case, CVA (Container > >>>>Virtual Address) can be used. Specifically: > >>>>a. when set_base_addr, CVA address is used; > >>>>b. when preparing RX's descriptors, CVA address is used; > >>>>c. when transmitting packets, CVA is filled in TX's descriptors; > >>>>d. in TX and CQ's header, CVA is used. > >>>>How to share memory? In VM's case, qemu always shares all physical layout > >>>>to backend. But it's not feasible for a container, as a process, to share > >>>>all virtual memory regions to backend. So only specified virtual memory > >>>>regions (with type of shared) are sent to backend. It's a limitation that > >>>>only addresses in these areas can be used to transmit or receive packets. > >>>> > >>>>Known issues > >>>> > >>>>a. When used with vhost-net, root privilege is required to create tap > >>>>device inside. > >>>>b. Control queue and multi-queue are not supported yet. > >>>>c. When --single-file option is used, socket_id of the memory may be > >>>>wrong. (Use "numactl -N x -m x" to work around this for now) > >>>>How to use? > >>>> > >>>>a. Apply this patchset. > >>>> > >>>>b. To compile container apps: > >>>>$: make config RTE_SDK=`pwd` T=x86_64-native-linuxapp-gcc > >>>>$: make install RTE_SDK=`pwd` T=x86_64-native-linuxapp-gcc > >>>>$: make -C examples/l2fwd RTE_SDK=`pwd` T=x86_64-native-linuxapp-gcc > >>>>$: make -C examples/vhost RTE_SDK=`pwd` T=x86_64-native-linuxapp-gcc > >>>> > >>>>c. To build a docker image using Dockerfile below. > >>>>$: cat ./Dockerfile > >>>>FROM ubuntu:latest > >>>>WORKDIR /usr/src/dpdk > >>>>COPY . /usr/src/dpdk > >>>>ENV PATH "$PATH:/usr/src/dpdk/examples/l2fwd/build/" > >>>>$: docker build -t dpdk-app-l2fwd . > >>>> > >>>>d. Used with vhost-user > >>>>$: ./examples/vhost/build/vhost-switch -c 3 -n 4 \ > >>>> --socket-mem 1024,1024 -- -p 0x1 --stats 1 > >>>>$: docker run -i -t -v :/var/run/usvhost \ > >>>> -v /dev/hugepages:/dev/hugepages \ > >>>> dpdk-app-l2fwd l2fwd -c 0x4 -n 4 -m 1024 --no-pci \ > >>>> --vdev=eth_cvio0,path=/var/run/usvhost -- -p 0x1 > >>>> > >>>>f. Used with vhost-net > >>>>$: modprobe vhost > >>>>$: modprobe vhost-net > >>>>$: docker run -i -t --privileged \ > >>>> -v /dev/vhost-net:/dev/vhost-net \ > >>>> -v /dev/net/tun:/dev/net/tun \ > >>>> -v /dev/hugepages:/dev/hugepages \ > >>>> dpdk-app-l2fwd l2fwd -c 0x4 -n 4 -m 1024 --no-pci \ > >>>> --vdev=eth_cvio0,path=/dev/vhost-net -- -p 0x1 > >>>> > >>>>By the way, it's not necessary to run in a container. > >>>> > >>>>Signed-off-by: Huawei Xie > >>>>Signed-off-by: Jianfeng Tan > >>>> > >>>>Jianfeng Tan (5): > >>>> mem: add --single-file to create single mem-backed file > >>>> mem: add API to obtain memory-backed file info > >>>> virtio/vdev: add embeded device emulation > >>>> virtio/vdev: add a new vdev named eth_cvio > >>>> docs: add release note for virtio for container > >>>> > >>>> config/common_linuxapp | 5 + > >>>> doc/guides/rel_notes/release_2_3.rst | 4 + > >>>> drivers/net/virtio/Makefile | 4 + > >>>> drivers/net/virtio/vhost.h | 194 +++++++ > >>>> drivers/net/virtio/vhost_embedded.c | 809 +++++++++++++++++++++++++++++ > >>>> drivers/net/virtio/virtio_ethdev.c | 329 +++++++++--- > >>>> drivers/net/virtio/virtio_ethdev.h | 6 +- > >>>> drivers/net/virtio/virtio_pci.h | 15 +- > >>>> drivers/net/virtio/virtio_rxtx.c | 6 +- > >>>> drivers/net/virtio/virtio_rxtx_simple.c | 13 +- > >>>> drivers/net/virtio/virtqueue.h | 15 +- > >>>> lib/librte_eal/common/eal_common_options.c | 17 + > >>>> lib/librte_eal/common/eal_internal_cfg.h | 1 + > >>>> lib/librte_eal/common/eal_options.h | 2 + > >>>> lib/librte_eal/common/include/rte_memory.h | 16 + > >>>> lib/librte_eal/linuxapp/eal/eal.c | 4 +- > >>>> lib/librte_eal/linuxapp/eal/eal_memory.c | 88 +++- > >>>> 17 files changed, 1435 insertions(+), 93 deletions(-) > >>>> create mode 100644 drivers/net/virtio/vhost.h > >>>> create mode 100644 drivers/net/virtio/vhost_embedded.c > >>>> > >>>>-- > >>>>2.1.4 > >>>> > >>>So, first off, apologies for being so late to review this patch, its been on my > >>>todo list forever, and I've just not gotten to it. > >>> > >>>I've taken a cursory look at the code, and I can't find anything glaringly wrong > >>>with it. > >>Thanks very much for reviewing this series. > >> > >>>That said, I'm a bit confused about the overall purpose of this PMD. I've read > >>>the description several times now, and I _think_ I understand the purpose and > >>>construction of the PMD. Please correct me if this is not the (admittedly very > >>>generalized) overview: > >>> > >>>1) You've created a vdev PMD that is generally named eth_cvio%n, which serves as > >>>a virtual NIC suitable for use in a containerized space > >>> > >>>2) The PMD in (1) establishes a connection to the host via the vhost backend > >>>(which is either a socket or a character device), which it uses to forward data > >>>from the containerized dpdk application > >> > >>The socket or the character device is used just for control plane messages > >>to setting up the datapath. The data does not go through the socket or the > >>character device. > >> > >>>3) The system hosting the containerized dpdk application ties the other end of > >>>the tun/tap interface established in (2) to some other forwarding mechanism > >>>(ostensibly a host based dpdk forwarder) to send the frame out on the physical > >>>wire. > >>There are two kinds of vhost backend: > >>(1) vhost-user, no need to leverage a tun/tap. the cvio PMD connects to the > >>backend socket, and communicate memory region information with the > >>vhost-user backend (the backend is another DPDK application using vhost PMD > >>by Tetsuya, or using vhost library like vhost example). > >>(2) vhost-net, here we need a tun/tap. When we open the /dev/vhost-net char > >>device, and some ioctl on it, it just starts a kthread (backend). We need an > >>interface (tun/tap) as an agent to blend into kernel networking, so that the > >>kthread knows where to send those packets (sent by frontend), and where to > >>receive packets to send to frontend. > >> > >>To be honest, vhost-user is the preferred way to achieve high performance. > >>As far as vhost-net is concerned, it goes through a kernel network stack, > >>which is the performance bottleneck. > >> > >Sure, that makes sense. So in the vhost-user case, we just read/write to a > >shared memory region? I.e. no user/kernel space transition for the nominal data > >path? If thats the case, than thats the piece I'm missing > >Neil > > Yes, exactly for now (both sides is in polling mode). Plus, we are trying to > add interrupt mode so that large amount of containers can run with this new > PMD. At interrupt mode, "user/kernel transition" would be smart because its > the other side's responsibility to tell this side if the other side needs to > be waken up, so user/kernel space transition happens only wakeup is > necessary. > > Thanks, > Jianfeng > Ok, thank you for the clarification Acked-By: Neil Horman