From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by dpdk.org (Postfix) with ESMTP id CBD918D36 for ; Thu, 20 Aug 2015 12:15:32 +0200 (CEST) Received: from fmsmga002.fm.intel.com ([10.253.24.26]) by orsmga103.jf.intel.com with ESMTP; 20 Aug 2015 03:15:31 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.15,714,1432623600"; d="scan'208";a="787427348" Received: from kmsmsx152.gar.corp.intel.com ([172.21.73.87]) by fmsmga002.fm.intel.com with ESMTP; 20 Aug 2015 03:15:15 -0700 Received: from shsmsx103.ccr.corp.intel.com (10.239.110.14) by KMSMSX152.gar.corp.intel.com (172.21.73.87) with Microsoft SMTP Server (TLS) id 14.3.224.2; Thu, 20 Aug 2015 18:14:57 +0800 Received: from shsmsx101.ccr.corp.intel.com ([169.254.1.171]) by SHSMSX103.ccr.corp.intel.com ([169.254.4.248]) with mapi id 14.03.0224.002; Thu, 20 Aug 2015 18:14:56 +0800 From: "Xie, Huawei" To: Zhuangyanying , "dev@dpdk.org" Thread-Topic: vhost compliant virtio based networking interface in container Thread-Index: AdDbL5/n4/QXHaS9RkiZ9XcA9E/NoA== Date: Thu, 20 Aug 2015 10:14:55 +0000 Message-ID: References: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.239.127.40] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Cc: "gaoxiaoqiu@huawei.com" , "oscar.zhangbo@huawei.com" , "zhbzg@huawei.com" , "guohongzhen@huawei.com" , "zhoujingbin@huawei.com" Subject: Re: [dpdk-dev] vhost compliant virtio based networking interface in container X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 20 Aug 2015 10:15:33 -0000 Added dev@dpdk.org=0A= =0A= On 8/20/2015 6:04 PM, Xie, Huawei wrote:=0A= > Yanping:=0A= > I read your mail, seems what we did are quite similar. Here i wrote a=0A= > quick mail to describe our design. Let me know if it is the same thing.= =0A= >=0A= > Problem Statement:=0A= > We don't have a high performance networking interface in container for=0A= > NFV. Current veth pair based interface couldn't be easily accelerated.=0A= >=0A= > The key components involved:=0A= > 1. DPDK based virtio PMD driver in container.=0A= > 2. device simulation framework in container.=0A= > 3. dpdk(or kernel) vhost running in host.=0A= >=0A= > How virtio is created?=0A= > A: There is no "real" virtio-pci device in container environment.=0A= > 1). Host maintains pools of memories, and shares memory to container.=0A= > This could be accomplished through host share a huge page file to contain= er.=0A= > 2). Containers creates virtio rings based on the shared memory.=0A= > 3). Container creates mbuf memory pools on the shared memory.=0A= > 4) Container send the memory and vring information to vhost through=0A= > vhost message. This could be done either through ioctl call or vhost=0A= > user message.=0A= >=0A= > How vhost message is sent?=0A= > A: There are two alternative ways to do this.=0A= > 1) The customized virtio PMD is responsible for all the vring creation,= =0A= > and vhost message sending.=0A= > 2) We could do this through a lightweight device simulation framework.=0A= > The device simulation creates simple PCI bus. On the PCI bus,=0A= > virtio-net PCI devices are created. The device simulations provides=0A= > IOAPI for MMIO/IO access.=0A= > 2.1 virtio PMD configures the pseudo virtio device as how it does in= =0A= > KVM guest enviroment.=0A= > 2.2 Rather than using io instruction, virtio PMD uses IOAPI for IO=0A= > operation on the virtio-net PCI device.=0A= > 2.3 The device simulation is responsible for device state machine=0A= > simulation.=0A= > 2.4 The device simulation is responsbile for talking to vhost.=0A= > With this approach, we could minimize the virtio PMD modifications.= =0A= > The virtio PMD is like configuring a real virtio-net PCI device.=0A= >=0A= > Memory mapping?=0A= > A: QEMU could access the whole guest memory in KVM enviroment. We need=0A= > to fill the gap.=0A= > container maps the shared memory to container's virtual address space=0A= > and host maps it to host's virtual address space. There is a fixed=0A= > offset mapping.=0A= > Container creates shared vring based on the memory. Container also=0A= > creates mbuf memory pool based on the shared memroy.=0A= > In VHOST_SET_MEMORY_TABLE message, we send the memory mapping=0A= > information for the shared memory. As we require mbuf pool created on=0A= > the shared memory, and buffers are allcoated from the mbuf pools, dpdk=0A= > vhost could translate the GPA in vring desc to host virtual.=0A= >=0A= >=0A= > GPA or CVA in vring desc?=0A= > To ease the memory translation, rather than using GPA, here we use=0A= > CVA(container virtual address). This the tricky thing here.=0A= > 1) virtio PMD writes vring's VFN rather than PFN to PFN register through= =0A= > IOAPI.=0A= > 2) device simulation framework will use VFN as PFN.=0A= > 3) device simulation sends SET_VRING_ADDR with CVA.=0A= > 4) virtio PMD fills vring desc with CVA of the mbuf data pointer rather= =0A= > than GPA.=0A= > So when host sees the CVA, it could translates it to HVA(host virtual=0A= > address).=0A= >=0A= > Worth to note:=0A= > The virtio interface in container follows the vhost message format, and= =0A= > is compliant with dpdk vhost implmentation, i.e, no dpdk vhost=0A= > modification is needed.=0A= > vHost isn't aware whether the incoming virtio comes from KVM guest or=0A= > container.=0A= >=0A= > The pretty much covers the high level design. There are quite some low=0A= > level issues. For example, 32bit PFN is enough for KVM guest, since we=0A= > use 64bit VFN(virtual page frame number), trick is done here through a= =0A= > special IOAPI.=0A= >=0A= > /huawei=0A= >=0A= > =0A= >=0A= >=0A= >=0A= >=0A= >=0A= >=0A= =0A=