From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <yuanhan.liu@linux.intel.com>
Received: from mga02.intel.com (mga02.intel.com [134.134.136.20])
 by dpdk.org (Postfix) with ESMTP id 188A52C63
 for <dev@dpdk.org>; Tue,  8 Mar 2016 03:42:20 +0100 (CET)
Received: from orsmga003.jf.intel.com ([10.7.209.27])
 by orsmga101.jf.intel.com with ESMTP; 07 Mar 2016 18:42:20 -0800
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.22,554,1449561600"; d="scan'208";a="759937572"
Received: from yliu-dev.sh.intel.com (HELO yliu-dev) ([10.239.66.49])
 by orsmga003.jf.intel.com with ESMTP; 07 Mar 2016 18:42:17 -0800
Date: Tue, 8 Mar 2016 10:44:37 +0800
From: Yuanhan Liu <yuanhan.liu@linux.intel.com>
To: "Tan, Jianfeng" <jianfeng.tan@intel.com>
Message-ID: <20160308024437.GJ14300@yliu-dev.sh.intel.com>
References: <1446748276-132087-1-git-send-email-jianfeng.tan@intel.com>
 <1454671228-33284-1-git-send-email-jianfeng.tan@intel.com>
 <1454671228-33284-2-git-send-email-jianfeng.tan@intel.com>
 <20160307131322.GH14300@yliu-dev.sh.intel.com>
 <56DE30FE.7020809@intel.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <56DE30FE.7020809@intel.com>
User-Agent: Mutt/1.5.23 (2014-03-12)
Cc: nakajima.yoshihiro@lab.ntt.co.jp, mst@redhat.com, dev@dpdk.org,
 p.fedin@samsung.com, ann.zhuangyanying@huawei.com
Subject: Re: [dpdk-dev] [PATCH v2 1/5] mem: add --single-file to create
 single mem-backed file
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: patches and discussions about DPDK <dev.dpdk.org>
List-Unsubscribe: <http://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <http://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Tue, 08 Mar 2016 02:42:21 -0000

On Tue, Mar 08, 2016 at 09:55:10AM +0800, Tan, Jianfeng wrote:
> Hi Yuanhan,
> 
> On 3/7/2016 9:13 PM, Yuanhan Liu wrote:
> >CC'ed EAL hugepage maintainer, which is something you should do when
> >send a patch.
> 
> Thanks for doing this.
> 
> >
> >On Fri, Feb 05, 2016 at 07:20:24PM +0800, Jianfeng Tan wrote:
> >>Originally, there're two cons in using hugepage: a. needs root
> >>privilege to touch /proc/self/pagemap, which is a premise to
> >>alllocate physically contiguous memseg; b. possibly too many
> >>hugepage file are created, especially used with 2M hugepage.
> >>
> >>For virtual devices, they don't care about physical-contiguity
> >>of allocated hugepages at all. Option --single-file is to
> >>provide a way to allocate all hugepages into single mem-backed
> >>file.
> >>
> >>Known issue:
> >>a. single-file option relys on kernel to allocate numa-affinitive
> >>memory.
> >>b. possible ABI break, originally, --no-huge uses anonymous memory
> >>instead of file-backed way to create memory.
> >>
> >>Signed-off-by: Huawei Xie <huawei.xie@intel.com>
> >>Signed-off-by: Jianfeng Tan <jianfeng.tan@intel.com>
> >...
> >>@@ -956,6 +961,16 @@ eal_check_common_options(struct internal_config *internal_cfg)
> >>  			"be specified together with --"OPT_NO_HUGE"\n");
> >>  		return -1;
> >>  	}
> >>+	if (internal_cfg->single_file && internal_cfg->force_sockets == 1) {
> >>+		RTE_LOG(ERR, EAL, "Option --"OPT_SINGLE_FILE" cannot "
> >>+			"be specified together with --"OPT_SOCKET_MEM"\n");
> >>+		return -1;
> >>+	}
> >>+	if (internal_cfg->single_file && internal_cfg->hugepage_unlink) {
> >>+		RTE_LOG(ERR, EAL, "Option --"OPT_HUGE_UNLINK" cannot "
> >>+			"be specified together with --"OPT_SINGLE_FILE"\n");
> >>+		return -1;
> >>+	}
> >The two limitation doesn't make sense to me.
> 
> For the force_sockets option, my original thought on --single-file option
> is, we don't sort those pages (require root/cap_sys_admin) and even don't
> look up numa information because it may contain both sockets' memory.
> 
> For the hugepage_unlink option, those hugepage files get closed in the end
> of memory initialization, if we even unlink those hugepage files, so we
> cannot share those with other processes (say backend).

Yeah, I know how the two limitations come, from your implementation. I
was just wondering if they both are __truly__ the limitations. I mean,
can we get rid of them somehow?

For --socket-mem option, if we can't handle it well, or if we could
ignore the socket_id for allocated huge page, yes, the limitation is
a true one.

But for the second option, no, we should be able to co-work it with
well. One extra action is you should not invoke "close(fd)" for those
huge page files. And then you can get all the informations as I stated
in a reply to your 2nd patch.

> >
> >>diff --git a/lib/librte_eal/linuxapp/eal/eal_memory.c b/lib/librte_eal/linuxapp/eal/eal_memory.c
> >>index 6008533..68ef49a 100644
> >>--- a/lib/librte_eal/linuxapp/eal/eal_memory.c
> >>+++ b/lib/librte_eal/linuxapp/eal/eal_memory.c
> >>@@ -1102,20 +1102,54 @@ rte_eal_hugepage_init(void)
> >>  	/* get pointer to global configuration */
> >>  	mcfg = rte_eal_get_configuration()->mem_config;
> >>-	/* hugetlbfs can be disabled */
> >>-	if (internal_config.no_hugetlbfs) {
> >>-		addr = mmap(NULL, internal_config.memory, PROT_READ | PROT_WRITE,
> >>-				MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
> >>+	/* when hugetlbfs is disabled or single-file option is specified */
> >>+	if (internal_config.no_hugetlbfs || internal_config.single_file) {
> >>+		int fd;
> >>+		uint64_t pagesize;
> >>+		unsigned socket_id = rte_socket_id();
> >>+		char filepath[MAX_HUGEPAGE_PATH];
> >>+
> >>+		if (internal_config.no_hugetlbfs) {
> >>+			eal_get_hugefile_path(filepath, sizeof(filepath),
> >>+					      "/dev/shm", 0);
> >>+			pagesize = RTE_PGSIZE_4K;
> >>+		} else {
> >>+			struct hugepage_info *hpi;
> >>+
> >>+			hpi = &internal_config.hugepage_info[0];
> >>+			eal_get_hugefile_path(filepath, sizeof(filepath),
> >>+					      hpi->hugedir, 0);
> >>+			pagesize = hpi->hugepage_sz;
> >>+		}
> >>+		fd = open(filepath, O_CREAT | O_RDWR, S_IRUSR | S_IWUSR);
> >>+		if (fd < 0) {
> >>+			RTE_LOG(ERR, EAL, "%s: open %s failed: %s\n",
> >>+				__func__, filepath, strerror(errno));
> >>+			return -1;
> >>+		}
> >>+
> >>+		if (ftruncate(fd, internal_config.memory) < 0) {
> >>+			RTE_LOG(ERR, EAL, "ftuncate %s failed: %s\n",
> >>+				filepath, strerror(errno));
> >>+			return -1;
> >>+		}
> >>+
> >>+		addr = mmap(NULL, internal_config.memory,
> >>+			    PROT_READ | PROT_WRITE,
> >>+			    MAP_SHARED | MAP_POPULATE, fd, 0);
> >>  		if (addr == MAP_FAILED) {
> >>-			RTE_LOG(ERR, EAL, "%s: mmap() failed: %s\n", __func__,
> >>-					strerror(errno));
> >>+			RTE_LOG(ERR, EAL, "%s: mmap() failed: %s\n",
> >>+				__func__, strerror(errno));
> >>  			return -1;
> >>  		}
> >>  		mcfg->memseg[0].phys_addr = (phys_addr_t)(uintptr_t)addr;
> >>  		mcfg->memseg[0].addr = addr;
> >>-		mcfg->memseg[0].hugepage_sz = RTE_PGSIZE_4K;
> >>+		mcfg->memseg[0].hugepage_sz = pagesize;
> >>  		mcfg->memseg[0].len = internal_config.memory;
> >>-		mcfg->memseg[0].socket_id = 0;
> >>+		mcfg->memseg[0].socket_id = socket_id;
> >I saw quite few issues:
> >
> >- Assume I have a system with two hugepage sizes: 1G (x4) and 2M (x512),
> >   mounted at /dev/hugepages and /mnt, respectively.
> >
> >   Here we then got an 5G internal_config.memory, and your code will
> >   try to mmap 5G on the first mount point (/dev/hugepages) due to the
> >   hardcode logic in your code:
> >
> >       hpi = &internal_config.hugepage_info[0];
> >       eal_get_hugefile_path(filepath, sizeof(filepath),
> >       		      hpi->hugedir, 0);
> >
> >   But it has 4G in total, therefore, it will fails.
> 
> As mentioned above, this case is not for original design of --single-file.

But it's a so common case, isn't it?

> >
> >- As you stated, socket_id is hardcoded, which could be wrong.
> 
> We rely on OS to allocate hugepages, and cannot promise physical hugepages
> in the big hugepage file are from the same socket.
> 
> >
> >- As stated in above, the option limitation doesn't seem right to me.
> >
> >   I mean, --single-file should be able to work with --socket-mem option
> >   in semantic.
> 
> If we'd like to work well with --socket-mem option, we need to use syscalls
> like set_mempolicy(), mbind(). So it'll bring bigger change related to
> current one. I don't know if it's acceptable?

Yes, if that's the right way to go. But also as you stated, I doubt we
really need handle the numa affinitive here, due to it's complex.

> >
> >
> >And I have been thinking how to deal with those issues properly, and a
> >__very immature__ solution come to my mind (which could be simply not
> >working), but anyway, here is FYI: we go through the same process to
> >handle normal huge page initilization to --single-file option as well.
> >But we take different actions or no actions at all at some stages when
> >that option is given, which is a bit similiar with the way of handling
> >RTE_EAL_SINGLE_FILE_SEGMENTS.
> >
> >And we create one hugepage file for each node, each page size. For a
> >system like mine above (2 nodes), it may populate files like following:
> >
> >- 1G x 2 on node0
> >- 1G x 2 on node1
> >- 2M x 256 on node0
> >- 2M x 256 on node1
> >
> >That could normally fit your case. Though 4 nodes looks like the maximum
> >node number, --socket-mem option may relieve the limit a bit.
> >
> >And if we "could" not care the socket_id being set correctly, we could
> >just simply allocate one file for each hugepage size. That would work
> >well for your container enabling.
> 
> This way seems a good option at first sight. Let's compare this new way with
> original design.
> 
> The original design just covers the simplest scenario:
> a. just one hugetlbfs (new way can provide support for multiple number of
> hugetlbfs)
> b. does not require a root privilege (new way can achieve this by using
> above-mentioned mind() or set_mempolicy() syscall)
> c. no sorting (both way are OK)
> d. performance, from the perspective of virtio for container, we take more
> consideration about the performance of address translation in the vhost. In
> the vhost, now we adopt a O(n) linear comparison to translate address (this
> can be optimized to O(logn) using segment tree, or even better using a
> cache, sorry, it's just another problem), so we should maintain as few files
> as possible. (new way can achieve this by used with --socket-mem,
> --huge-dir)
> e. numa aware is not required (and it's complex). (new way can solve this
> without promise)
> 
> In all, this new way seems great for me.
> 
> Another thing is if "go through the same process to handle normal huge page
> initilization", my consideration is: RTE_EAL_SINGLE_FILE_SEGMENTS goes such
> way to maximize code reuse. But the new way has few common code with
> original ways. And mixing these options together leads to bad readability.
> How do you think?

Indeed. I've already found that the code is a bit hard to read, due to
many "#ifdef ... #else .. #endif" blocks, for RTE_EAL_SINGLE_FILE_SEGMENTS
as well as some special archs.

Therefore, I would suggest to do it as below: add another option based
on the SINGLE_FILE_SEGMENTS implementation.

I mean SINGLE_FILE_SEGMENTS already tries to generate as few files as
possible. If we add another option, say --no-sort (or --no-phys-continuity),
we could add just few lines of code to let it generate one file for
each huge page size (if we don't consider the numa affinity).

> >
> >BTW, since we already have SINGLE_FILE_SEGMENTS (config) option, adding
> >another option --single-file looks really confusing to me.
> >
> >To me, maybe you could base the SINGLE_FILE_SEGMENTS option, and add
> >another option, say --no-sort (I confess this name sucks, but you get
> >my point). With that, we could make sure to create as least huge page
> >files as possible, to fit your case.
> 
> This is a great advice. So how do you think of --converged, or
> --no-scattered-mem, or any better idea?

TBH, none of them looks great to me, either. But I have no better
options. Well, --no-phys-continuity looks like the best option to
me so far :)

	--yliu

> 
> Thanks for valuable input.
> 
> Jianfeng
> 
> >
> >	--yliu