From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <nhorman@redhat.com>
Received: from mx1.redhat.com (mx1.redhat.com [209.132.183.28])
 by dpdk.org (Postfix) with ESMTP id B4F2B2B99
 for <dev@dpdk.org>; Wed, 23 Mar 2016 20:17:47 +0100 (CET)
Received: from int-mx13.intmail.prod.int.phx2.redhat.com
 (int-mx13.intmail.prod.int.phx2.redhat.com [10.5.11.26])
 by mx1.redhat.com (Postfix) with ESMTPS id 34A5C7AE9B;
 Wed, 23 Mar 2016 19:17:46 +0000 (UTC)
Received: from hmsreliant.think-freely.org (vpn-61-176.rdu2.redhat.com
 [10.10.61.176])
 by int-mx13.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id
 u2NJHhJq026006
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO);
 Wed, 23 Mar 2016 15:17:45 -0400
Date: Wed, 23 Mar 2016 15:17:43 -0400
From: Neil Horman <nhorman@redhat.com>
To: jianfeng.tan@intel.com
Cc: dev@dpdk.org
Message-ID: <20160323191743.GB13829@hmsreliant.think-freely.org>
References: <1446748276-132087-1-git-send-email-jianfeng.tan@intel.com>
 <1454671228-33284-1-git-send-email-jianfeng.tan@intel.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <1454671228-33284-1-git-send-email-jianfeng.tan@intel.com>
User-Agent: Mutt/1.5.24 (2015-08-30)
X-Scanned-By: MIMEDefang 2.68 on 10.5.11.26
Subject: Re: [dpdk-dev] [PATCH v2 0/5] virtio support for container
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: patches and discussions about DPDK <dev.dpdk.org>
List-Unsubscribe: <http://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <http://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Wed, 23 Mar 2016 19:17:48 -0000

On Fri, Feb 05, 2016 at 07:20:23PM +0800, Jianfeng Tan wrote:
> v1->v2:
>  - Rebase on the patchset of virtio 1.0 support.
>  - Fix cannot create non-hugepage memory.
>  - Fix wrong size of memory region when "single-file" is used.
>  - Fix setting of offset in virtqueue to use virtual address.
>  - Fix setting TUNSETVNETHDRSZ in vhost-user's branch.
>  - Add mac option to specify the mac address of this virtual device.
>  - Update doc.
> 
> This patchset is to provide high performance networking interface (virtio)
> for container-based DPDK applications. The way of starting DPDK apps in
> containers with ownership of NIC devices exclusively is beyond the scope.
> The basic idea here is to present a new virtual device (named eth_cvio),
> which can be discovered and initialized in container-based DPDK apps using
> rte_eal_init(). To minimize the change, we reuse already-existing virtio
> frontend driver code (driver/net/virtio/).
>  
> Compared to QEMU/VM case, virtio device framework (translates I/O port r/w
> operations into unix socket/cuse protocol, which is originally provided in
> QEMU), is integrated in virtio frontend driver. So this converged driver
> actually plays the role of original frontend driver and the role of QEMU
> device framework.
>  
> The major difference lies in how to calculate relative address for vhost.
> The principle of virtio is that: based on one or multiple shared memory
> segments, vhost maintains a reference system with the base addresses and
> length for each segment so that an address from VM comes (usually GPA,
> Guest Physical Address) can be translated into vhost-recognizable address
> (named VVA, Vhost Virtual Address). To decrease the overhead of address
> translation, we should maintain as few segments as possible. In VM's case,
> GPA is always locally continuous. In container's case, CVA (Container
> Virtual Address) can be used. Specifically:
> a. when set_base_addr, CVA address is used;
> b. when preparing RX's descriptors, CVA address is used;
> c. when transmitting packets, CVA is filled in TX's descriptors;
> d. in TX and CQ's header, CVA is used.
>  
> How to share memory? In VM's case, qemu always shares all physical layout
> to backend. But it's not feasible for a container, as a process, to share
> all virtual memory regions to backend. So only specified virtual memory
> regions (with type of shared) are sent to backend. It's a limitation that
> only addresses in these areas can be used to transmit or receive packets.
> 
> Known issues
> 
> a. When used with vhost-net, root privilege is required to create tap
> device inside.
> b. Control queue and multi-queue are not supported yet.
> c. When --single-file option is used, socket_id of the memory may be
> wrong. (Use "numactl -N x -m x" to work around this for now)
>  
> How to use?
> 
> a. Apply this patchset.
> 
> b. To compile container apps:
> $: make config RTE_SDK=`pwd` T=x86_64-native-linuxapp-gcc
> $: make install RTE_SDK=`pwd` T=x86_64-native-linuxapp-gcc
> $: make -C examples/l2fwd RTE_SDK=`pwd` T=x86_64-native-linuxapp-gcc
> $: make -C examples/vhost RTE_SDK=`pwd` T=x86_64-native-linuxapp-gcc
> 
> c. To build a docker image using Dockerfile below.
> $: cat ./Dockerfile
> FROM ubuntu:latest
> WORKDIR /usr/src/dpdk
> COPY . /usr/src/dpdk
> ENV PATH "$PATH:/usr/src/dpdk/examples/l2fwd/build/"
> $: docker build -t dpdk-app-l2fwd .
> 
> d. Used with vhost-user
> $: ./examples/vhost/build/vhost-switch -c 3 -n 4 \
> 	--socket-mem 1024,1024 -- -p 0x1 --stats 1
> $: docker run -i -t -v <path_to_vhost_unix_socket>:/var/run/usvhost \
> 	-v /dev/hugepages:/dev/hugepages \
> 	dpdk-app-l2fwd l2fwd -c 0x4 -n 4 -m 1024 --no-pci \
> 	--vdev=eth_cvio0,path=/var/run/usvhost -- -p 0x1
> 
> f. Used with vhost-net
> $: modprobe vhost
> $: modprobe vhost-net
> $: docker run -i -t --privileged \
> 	-v /dev/vhost-net:/dev/vhost-net \
> 	-v /dev/net/tun:/dev/net/tun \
> 	-v /dev/hugepages:/dev/hugepages \
> 	dpdk-app-l2fwd l2fwd -c 0x4 -n 4 -m 1024 --no-pci \
> 	--vdev=eth_cvio0,path=/dev/vhost-net -- -p 0x1
> 
> By the way, it's not necessary to run in a container.
> 
> Signed-off-by: Huawei Xie <huawei.xie at intel.com>
> Signed-off-by: Jianfeng Tan <jianfeng.tan at intel.com>
> 
> Jianfeng Tan (5):
>   mem: add --single-file to create single mem-backed file
>   mem: add API to obtain memory-backed file info
>   virtio/vdev: add embeded device emulation
>   virtio/vdev: add a new vdev named eth_cvio
>   docs: add release note for virtio for container
> 
>  config/common_linuxapp                     |   5 +
>  doc/guides/rel_notes/release_2_3.rst       |   4 +
>  drivers/net/virtio/Makefile                |   4 +
>  drivers/net/virtio/vhost.h                 | 194 +++++++
>  drivers/net/virtio/vhost_embedded.c        | 809 +++++++++++++++++++++++++++++
>  drivers/net/virtio/virtio_ethdev.c         | 329 +++++++++---
>  drivers/net/virtio/virtio_ethdev.h         |   6 +-
>  drivers/net/virtio/virtio_pci.h            |  15 +-
>  drivers/net/virtio/virtio_rxtx.c           |   6 +-
>  drivers/net/virtio/virtio_rxtx_simple.c    |  13 +-
>  drivers/net/virtio/virtqueue.h             |  15 +-
>  lib/librte_eal/common/eal_common_options.c |  17 +
>  lib/librte_eal/common/eal_internal_cfg.h   |   1 +
>  lib/librte_eal/common/eal_options.h        |   2 +
>  lib/librte_eal/common/include/rte_memory.h |  16 +
>  lib/librte_eal/linuxapp/eal/eal.c          |   4 +-
>  lib/librte_eal/linuxapp/eal/eal_memory.c   |  88 +++-
>  17 files changed, 1435 insertions(+), 93 deletions(-)
>  create mode 100644 drivers/net/virtio/vhost.h
>  create mode 100644 drivers/net/virtio/vhost_embedded.c
> 
> -- 
> 2.1.4
> 
So, first off, apologies for being so late to review this patch, its been on my
todo list forever, and I've just not gotten to it.

I've taken a cursory look at the code, and I can't find anything glaringly wrong
with it.

That said, I'm a bit confused about the overall purpose of this PMD.  I've read
the description several times now, and I _think_ I understand the purpose and
construction of the PMD. Please correct me if this is not the (admittedly very
generalized) overview:

1) You've created a vdev PMD that is generally named eth_cvio%n, which serves as
a virtual NIC suitable for use in a containerized space

2) The PMD in (1) establishes a connection to the host via the vhost backend
(which is either a socket or a character device), which it uses to forward data
from the containerized dpdk application

3) The system hosting the containerized dpdk application ties the other end of
the tun/tap interface established in (2) to some other forwarding mechanism
(ostensibly a host based dpdk forwarder) to send the frame out on the physical
wire.

If I understand that, it seems reasonable, but I have to ask why?  It feels a
bit like a re-invention of the wheel to me.  That is to say, for whatever
optimization this PMD may have, the by-far larger bottleneck is the tun/tap
interface in step (2).  If thats the case, then why create a new PMD at all? Why
not instead just use a tun/tap interface into the container, along with the
af_packet PMD for communication.  That has the ability to do memory mapping of
an interface for relatively fast packet writes, so I expect it will be just as
performant as this solution, and without the need to write and maintain a new
PMD's worth of code.

I feel like I'm missing something here, so please clarify if I am, but at the
moment, I'm having a hard time seeing the advantage to a new PMD here

Regards
Neil