From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <jianfeng.tan@intel.com>
Received: from mga14.intel.com (mga14.intel.com [192.55.52.115])
 by dpdk.org (Postfix) with ESMTP id DF9848E83
 for <dev@dpdk.org>; Tue, 24 Nov 2015 07:19:11 +0100 (CET)
Received: from fmsmga004.fm.intel.com ([10.253.24.48])
 by fmsmga103.fm.intel.com with ESMTP; 23 Nov 2015 22:19:10 -0800
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.20,338,1444719600"; 
   d="scan'208";a="1385164"
Received: from fmsmsx106.amr.corp.intel.com ([10.18.124.204])
 by fmsmga004.fm.intel.com with ESMTP; 23 Nov 2015 22:19:10 -0800
Received: from fmsmsx117.amr.corp.intel.com (10.18.116.17) by
 FMSMSX106.amr.corp.intel.com (10.18.124.204) with Microsoft SMTP Server (TLS)
 id 14.3.248.2; Mon, 23 Nov 2015 22:19:10 -0800
Received: from shsmsx102.ccr.corp.intel.com (10.239.4.154) by
 fmsmsx117.amr.corp.intel.com (10.18.116.17) with Microsoft SMTP Server (TLS)
 id 14.3.248.2; Mon, 23 Nov 2015 22:19:09 -0800
Received: from shsmsx152.ccr.corp.intel.com ([169.254.6.193]) by
 shsmsx102.ccr.corp.intel.com ([169.254.2.42]) with mapi id 14.03.0248.002;
 Tue, 24 Nov 2015 14:19:07 +0800
From: "Tan, Jianfeng" <jianfeng.tan@intel.com>
To: Zhuangyanying <ann.zhuangyanying@huawei.com>, "dev@dpdk.org" <dev@dpdk.org>
Thread-Topic: [RFC 0/5] virtio support for container
Thread-Index: AQHRGDLgBZCTKplke0WF7Abcv5iuGZ6qILcAgACgdHA=
Date: Tue, 24 Nov 2015 06:19:07 +0000
Message-ID: <ED26CBA2FAD1BF48A8719AEF02201E366485DC@SHSMSX152.ccr.corp.intel.com>
References: <1446748276-132087-1-git-send-email-jianfeng.tan@intel.com>
 <EC9759BC1E3E98429B5DE9A03DF86D8B592FAF25@SZXEMA502-MBX.china.huawei.com>
In-Reply-To: <EC9759BC1E3E98429B5DE9A03DF86D8B592FAF25@SZXEMA502-MBX.china.huawei.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
x-originating-ip: [10.239.127.40]
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Cc: "nakajima.yoshihiro@lab.ntt.co.jp" <nakajima.yoshihiro@lab.ntt.co.jp>,
 Zhbzg <zhbzg@huawei.com>, "mst@redhat.com" <mst@redhat.com>,
 gaoxiaoqiu <gaoxiaoqiu@huawei.com>,
 "Zhangbo \(Oscar\)" <oscar.zhangbo@huawei.com>,
 Zhoujingbin <zhoujingbin@huawei.com>, Guohongzhen <guohongzhen@huawei.com>
Subject: Re: [dpdk-dev] [RFC 0/5] virtio support for container
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: patches and discussions about DPDK <dev.dpdk.org>
List-Unsubscribe: <http://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <http://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Tue, 24 Nov 2015 06:19:12 -0000



> -----Original Message-----
> From: Zhuangyanying [mailto:ann.zhuangyanying@huawei.com]
> Sent: Tuesday, November 24, 2015 11:53 AM
> To: Tan, Jianfeng; dev@dpdk.org
> Cc: mst@redhat.com; mukawa@igel.co.jp; nakajima.yoshihiro@lab.ntt.co.jp;
> Qiu, Michael; Guohongzhen; Zhoujingbin; Zhangbo (Oscar); gaoxiaoqiu;
> Zhbzg; Xie, Huawei
> Subject: RE: [RFC 0/5] virtio support for container
>=20
>=20
>=20
> > -----Original Message-----
> > From: Jianfeng Tan [mailto:jianfeng.tan@intel.com]
> > Sent: Friday, November 06, 2015 2:31 AM
> > To: dev@dpdk.org
> > Cc: mst@redhat.com; mukawa@igel.co.jp;
> nakajima.yoshihiro@lab.ntt.co.jp;
> > michael.qiu@intel.com; Guohongzhen; Zhoujingbin; Zhuangyanying;
> Zhangbo
> > (Oscar); gaoxiaoqiu; Zhbzg; huawei.xie@intel.com; Jianfeng Tan
> > Subject: [RFC 0/5] virtio support for container
> >
...
> > 2.1.4
>=20
> This patch arose a good idea to add an extra abstracted IO layer,  which
> would make it simple to extend the function to the kernel mode switch(suc=
h
> as OVS). That's great.
> But I have one question here:
>     it's the issue on VHOST_USER_SET_MEM_TABLE. you alloc memory from
> tmpfs filesyste, just one fd, could used rte_memseg_info_get() to
> 	directly get the memory topology, However, things change in kernel-
> space, because mempool should be created on each container's
> 	hugetlbfs(rather than tmpfs), which is seperated from each other, at
> last, considering of the ioctl's parameter.
>        My solution is as follows for your reference:
> /*
> 	reg =3D mem->regions;
> 	reg->guest_phys_addr =3D (__u64) ((struct virtqueue *)(dev->data-
> >rx_queues[0]))->mpool->elt_va_start;
> 	reg->userspace_addr =3D reg->guest_phys_addr;
> 	reg->memory_size =3D ((struct virtqueue *)(dev->data-
> >rx_queues[0]))->mpool->elt_va_end - reg->guest_phys_addr;
>=20
> 	reg =3D mem->regions + 1;
> 	reg->guest_phys_addr =3D (__u64)(((struct virtqueue *)(dev->data-
> >tx_queues[0]))->virtio_net_hdr_mem);
> 	reg->userspace_addr =3D reg->guest_phys_addr;
> 	reg->memory_size =3D vq_size * internals->vtnet_hdr_size;
> */
> 	   But it's a little ugly, any better idea?

Hi Yanying,

Your solution seems ok for me when used with kernel vhost-net, because vhos=
t
kthread just shares the same mm_struct with virtio process. But it will not=
 work
with vhost-user, which realize memory sharing through putting fd in sendmsg=
().
Worse, it will not work with userspace vhost_cuse (see
lib/librte_vhost/vhost_cuse/), either, because current implementation suppo=
ses
VM's physical memory is backed by one huge file. Actually, what we need to =
do
Is enhancing userspace vhost_cuse, so that it supports cross-file memory re=
gion.

With below solutions to support hugetlbfs FYI:

To support hugetlbfs, my previous idea is to use -v option of "docker run"
to map hugetlbfs into its /dev/shm, so that we can create a "huge" shm file
on hugetlbfs. But this seems not accepted by others.

You mentioned the situation that DPDK now creates a file for each hugepage.
Maybe we just need to share all these hugepages with vhost. To minimize the
memory translation effort, we need to require that we use as few pages as
possible. Can you accept this solution?

Thanks,
Jianfeng