From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga11.intel.com (mga11.intel.com [192.55.52.93]) by dpdk.org (Postfix) with ESMTP id 39B912BE2 for ; Thu, 14 Apr 2016 08:08:28 +0200 (CEST) Received: from fmsmga002.fm.intel.com ([10.253.24.26]) by fmsmga102.fm.intel.com with ESMTP; 13 Apr 2016 23:08:23 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.24,483,1455004800"; d="scan'208";a="958347125" Received: from tanjianf-mobl.ccr.corp.intel.com (HELO [10.239.201.86]) ([10.239.201.86]) by fmsmga002.fm.intel.com with ESMTP; 13 Apr 2016 23:08:21 -0700 To: Thomas Monjalon References: <1446748276-132087-1-git-send-email-jianfeng.tan@intel.com> <1454671228-33284-1-git-send-email-jianfeng.tan@intel.com> <1642018.IWC2Tt5SYA@xps13> Cc: dev@dpdk.org, nakajima.yoshihiro@lab.ntt.co.jp, mst@redhat.com, ann.zhuangyanying@huawei.com From: "Tan, Jianfeng" Message-ID: <570F33D3.6030009@intel.com> Date: Thu, 14 Apr 2016 14:08:19 +0800 User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:38.0) Gecko/20100101 Thunderbird/38.7.2 MIME-Version: 1.0 In-Reply-To: <1642018.IWC2Tt5SYA@xps13> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [dpdk-dev] [PATCH v2 0/5] virtio support for container X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 14 Apr 2016 06:08:28 -0000 Hi Thomas, On 4/14/2016 12:14 AM, Thomas Monjalon wrote: > Hi Jianfeng, > > Thanks for raising the container issues and proposing some solutions. > General comments below. > > 2016-02-05 19:20, Jianfeng Tan: >> This patchset is to provide high performance networking interface (virtio) >> for container-based DPDK applications. The way of starting DPDK apps in >> containers with ownership of NIC devices exclusively is beyond the scope. >> The basic idea here is to present a new virtual device (named eth_cvio), >> which can be discovered and initialized in container-based DPDK apps using >> rte_eal_init(). To minimize the change, we reuse already-existing virtio >> frontend driver code (driver/net/virtio/). >> >> Compared to QEMU/VM case, virtio device framework (translates I/O port r/w >> operations into unix socket/cuse protocol, which is originally provided in >> QEMU), is integrated in virtio frontend driver. So this converged driver >> actually plays the role of original frontend driver and the role of QEMU >> device framework. >> >> The major difference lies in how to calculate relative address for vhost. >> The principle of virtio is that: based on one or multiple shared memory >> segments, vhost maintains a reference system with the base addresses and >> length for each segment so that an address from VM comes (usually GPA, >> Guest Physical Address) can be translated into vhost-recognizable address >> (named VVA, Vhost Virtual Address). To decrease the overhead of address >> translation, we should maintain as few segments as possible. In VM's case, >> GPA is always locally continuous. In container's case, CVA (Container >> Virtual Address) can be used. Specifically: >> a. when set_base_addr, CVA address is used; >> b. when preparing RX's descriptors, CVA address is used; >> c. when transmitting packets, CVA is filled in TX's descriptors; >> d. in TX and CQ's header, CVA is used. >> >> How to share memory? In VM's case, qemu always shares all physical layout >> to backend. But it's not feasible for a container, as a process, to share >> all virtual memory regions to backend. So only specified virtual memory >> regions (with type of shared) are sent to backend. It's a limitation that >> only addresses in these areas can be used to transmit or receive packets. >> >> Known issues >> >> a. When used with vhost-net, root privilege is required to create tap >> device inside. >> b. Control queue and multi-queue are not supported yet. >> c. When --single-file option is used, socket_id of the memory may be >> wrong. (Use "numactl -N x -m x" to work around this for now) > There are 2 different topics in this patchset: > 1/ How to provide networking in containers > 2/ How to provide memory in containers > > 1/ You have decided to use the virtio spec to bridge the host > with its containers. But there is no virtio device in a container > and no vhost interface in the host (except the kernel one). > So you are extending virtio to work as a vdev inside the container. > Could you explain what is the datapath between virtio and the host app? The datapath is based on the shared memory, which is determined using vhost-user protocol through a unix socket. So the key condition in this approach is to map the unix socket into container. > Does it need to use a fake device from Qemu as Tetsuya has done? In this implementation, we don't need a fake device from Qemu as Tetsuya is doing. We just maintain a virtual virtio device in DPDK EAL layer, and talk to vhost via unix socket. OK, I think it's necessary to point out the implementation difference between the two implementation: this approach gets involved existing virtio PMD at the layer of struct virtio_pci_ops, but Tetsuya's solution intercepts r/w toioport or pci configuration space. > > Do you think there can be some alternatives to vhost/virtio in containers? Yeah, we were considering another way to create virtual virtio in kernel space, which is driven by a new kernel module (instead of virtio-net) and a new library (maybe in DPDK). Then control path goes from app -> library -> kernel -> vhost user (or vhost-net), and data path is still based on the negotiated shared memory and some vring structures inside the memory. However, this involves another new kernel module, I don't think it's a easy way to go. > > 2/ The memory management is already a mess and it's going worst. > I think we need to think the requirements first and then write a proper > implementation to cover every identified needs. > I have started a new thread to cover this part: > http://thread.gmane.org/gmane.comp.networking.dpdk.devel/37445 I agree we should isolate the memory problem from network interface problem. And the memory problem is not a blocker issue for this patch, we can go without changing the memory part, however, it makes it hard to use. We'll go to the thread to discuss this more. Thanks, Jianfeng