From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by dpdk.org (Postfix) with ESMTP id 188A52C63 for ; Tue, 8 Mar 2016 03:42:20 +0100 (CET) Received: from orsmga003.jf.intel.com ([10.7.209.27]) by orsmga101.jf.intel.com with ESMTP; 07 Mar 2016 18:42:20 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.22,554,1449561600"; d="scan'208";a="759937572" Received: from yliu-dev.sh.intel.com (HELO yliu-dev) ([10.239.66.49]) by orsmga003.jf.intel.com with ESMTP; 07 Mar 2016 18:42:17 -0800 Date: Tue, 8 Mar 2016 10:44:37 +0800 From: Yuanhan Liu To: "Tan, Jianfeng" Message-ID: <20160308024437.GJ14300@yliu-dev.sh.intel.com> References: <1446748276-132087-1-git-send-email-jianfeng.tan@intel.com> <1454671228-33284-1-git-send-email-jianfeng.tan@intel.com> <1454671228-33284-2-git-send-email-jianfeng.tan@intel.com> <20160307131322.GH14300@yliu-dev.sh.intel.com> <56DE30FE.7020809@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <56DE30FE.7020809@intel.com> User-Agent: Mutt/1.5.23 (2014-03-12) Cc: nakajima.yoshihiro@lab.ntt.co.jp, mst@redhat.com, dev@dpdk.org, p.fedin@samsung.com, ann.zhuangyanying@huawei.com Subject: Re: [dpdk-dev] [PATCH v2 1/5] mem: add --single-file to create single mem-backed file X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 08 Mar 2016 02:42:21 -0000 On Tue, Mar 08, 2016 at 09:55:10AM +0800, Tan, Jianfeng wrote: > Hi Yuanhan, > > On 3/7/2016 9:13 PM, Yuanhan Liu wrote: > >CC'ed EAL hugepage maintainer, which is something you should do when > >send a patch. > > Thanks for doing this. > > > > >On Fri, Feb 05, 2016 at 07:20:24PM +0800, Jianfeng Tan wrote: > >>Originally, there're two cons in using hugepage: a. needs root > >>privilege to touch /proc/self/pagemap, which is a premise to > >>alllocate physically contiguous memseg; b. possibly too many > >>hugepage file are created, especially used with 2M hugepage. > >> > >>For virtual devices, they don't care about physical-contiguity > >>of allocated hugepages at all. Option --single-file is to > >>provide a way to allocate all hugepages into single mem-backed > >>file. > >> > >>Known issue: > >>a. single-file option relys on kernel to allocate numa-affinitive > >>memory. > >>b. possible ABI break, originally, --no-huge uses anonymous memory > >>instead of file-backed way to create memory. > >> > >>Signed-off-by: Huawei Xie > >>Signed-off-by: Jianfeng Tan > >... > >>@@ -956,6 +961,16 @@ eal_check_common_options(struct internal_config *internal_cfg) > >> "be specified together with --"OPT_NO_HUGE"\n"); > >> return -1; > >> } > >>+ if (internal_cfg->single_file && internal_cfg->force_sockets == 1) { > >>+ RTE_LOG(ERR, EAL, "Option --"OPT_SINGLE_FILE" cannot " > >>+ "be specified together with --"OPT_SOCKET_MEM"\n"); > >>+ return -1; > >>+ } > >>+ if (internal_cfg->single_file && internal_cfg->hugepage_unlink) { > >>+ RTE_LOG(ERR, EAL, "Option --"OPT_HUGE_UNLINK" cannot " > >>+ "be specified together with --"OPT_SINGLE_FILE"\n"); > >>+ return -1; > >>+ } > >The two limitation doesn't make sense to me. > > For the force_sockets option, my original thought on --single-file option > is, we don't sort those pages (require root/cap_sys_admin) and even don't > look up numa information because it may contain both sockets' memory. > > For the hugepage_unlink option, those hugepage files get closed in the end > of memory initialization, if we even unlink those hugepage files, so we > cannot share those with other processes (say backend). Yeah, I know how the two limitations come, from your implementation. I was just wondering if they both are __truly__ the limitations. I mean, can we get rid of them somehow? For --socket-mem option, if we can't handle it well, or if we could ignore the socket_id for allocated huge page, yes, the limitation is a true one. But for the second option, no, we should be able to co-work it with well. One extra action is you should not invoke "close(fd)" for those huge page files. And then you can get all the informations as I stated in a reply to your 2nd patch. > > > >>diff --git a/lib/librte_eal/linuxapp/eal/eal_memory.c b/lib/librte_eal/linuxapp/eal/eal_memory.c > >>index 6008533..68ef49a 100644 > >>--- a/lib/librte_eal/linuxapp/eal/eal_memory.c > >>+++ b/lib/librte_eal/linuxapp/eal/eal_memory.c > >>@@ -1102,20 +1102,54 @@ rte_eal_hugepage_init(void) > >> /* get pointer to global configuration */ > >> mcfg = rte_eal_get_configuration()->mem_config; > >>- /* hugetlbfs can be disabled */ > >>- if (internal_config.no_hugetlbfs) { > >>- addr = mmap(NULL, internal_config.memory, PROT_READ | PROT_WRITE, > >>- MAP_PRIVATE | MAP_ANONYMOUS, 0, 0); > >>+ /* when hugetlbfs is disabled or single-file option is specified */ > >>+ if (internal_config.no_hugetlbfs || internal_config.single_file) { > >>+ int fd; > >>+ uint64_t pagesize; > >>+ unsigned socket_id = rte_socket_id(); > >>+ char filepath[MAX_HUGEPAGE_PATH]; > >>+ > >>+ if (internal_config.no_hugetlbfs) { > >>+ eal_get_hugefile_path(filepath, sizeof(filepath), > >>+ "/dev/shm", 0); > >>+ pagesize = RTE_PGSIZE_4K; > >>+ } else { > >>+ struct hugepage_info *hpi; > >>+ > >>+ hpi = &internal_config.hugepage_info[0]; > >>+ eal_get_hugefile_path(filepath, sizeof(filepath), > >>+ hpi->hugedir, 0); > >>+ pagesize = hpi->hugepage_sz; > >>+ } > >>+ fd = open(filepath, O_CREAT | O_RDWR, S_IRUSR | S_IWUSR); > >>+ if (fd < 0) { > >>+ RTE_LOG(ERR, EAL, "%s: open %s failed: %s\n", > >>+ __func__, filepath, strerror(errno)); > >>+ return -1; > >>+ } > >>+ > >>+ if (ftruncate(fd, internal_config.memory) < 0) { > >>+ RTE_LOG(ERR, EAL, "ftuncate %s failed: %s\n", > >>+ filepath, strerror(errno)); > >>+ return -1; > >>+ } > >>+ > >>+ addr = mmap(NULL, internal_config.memory, > >>+ PROT_READ | PROT_WRITE, > >>+ MAP_SHARED | MAP_POPULATE, fd, 0); > >> if (addr == MAP_FAILED) { > >>- RTE_LOG(ERR, EAL, "%s: mmap() failed: %s\n", __func__, > >>- strerror(errno)); > >>+ RTE_LOG(ERR, EAL, "%s: mmap() failed: %s\n", > >>+ __func__, strerror(errno)); > >> return -1; > >> } > >> mcfg->memseg[0].phys_addr = (phys_addr_t)(uintptr_t)addr; > >> mcfg->memseg[0].addr = addr; > >>- mcfg->memseg[0].hugepage_sz = RTE_PGSIZE_4K; > >>+ mcfg->memseg[0].hugepage_sz = pagesize; > >> mcfg->memseg[0].len = internal_config.memory; > >>- mcfg->memseg[0].socket_id = 0; > >>+ mcfg->memseg[0].socket_id = socket_id; > >I saw quite few issues: > > > >- Assume I have a system with two hugepage sizes: 1G (x4) and 2M (x512), > > mounted at /dev/hugepages and /mnt, respectively. > > > > Here we then got an 5G internal_config.memory, and your code will > > try to mmap 5G on the first mount point (/dev/hugepages) due to the > > hardcode logic in your code: > > > > hpi = &internal_config.hugepage_info[0]; > > eal_get_hugefile_path(filepath, sizeof(filepath), > > hpi->hugedir, 0); > > > > But it has 4G in total, therefore, it will fails. > > As mentioned above, this case is not for original design of --single-file. But it's a so common case, isn't it? > > > >- As you stated, socket_id is hardcoded, which could be wrong. > > We rely on OS to allocate hugepages, and cannot promise physical hugepages > in the big hugepage file are from the same socket. > > > > >- As stated in above, the option limitation doesn't seem right to me. > > > > I mean, --single-file should be able to work with --socket-mem option > > in semantic. > > If we'd like to work well with --socket-mem option, we need to use syscalls > like set_mempolicy(), mbind(). So it'll bring bigger change related to > current one. I don't know if it's acceptable? Yes, if that's the right way to go. But also as you stated, I doubt we really need handle the numa affinitive here, due to it's complex. > > > > > >And I have been thinking how to deal with those issues properly, and a > >__very immature__ solution come to my mind (which could be simply not > >working), but anyway, here is FYI: we go through the same process to > >handle normal huge page initilization to --single-file option as well. > >But we take different actions or no actions at all at some stages when > >that option is given, which is a bit similiar with the way of handling > >RTE_EAL_SINGLE_FILE_SEGMENTS. > > > >And we create one hugepage file for each node, each page size. For a > >system like mine above (2 nodes), it may populate files like following: > > > >- 1G x 2 on node0 > >- 1G x 2 on node1 > >- 2M x 256 on node0 > >- 2M x 256 on node1 > > > >That could normally fit your case. Though 4 nodes looks like the maximum > >node number, --socket-mem option may relieve the limit a bit. > > > >And if we "could" not care the socket_id being set correctly, we could > >just simply allocate one file for each hugepage size. That would work > >well for your container enabling. > > This way seems a good option at first sight. Let's compare this new way with > original design. > > The original design just covers the simplest scenario: > a. just one hugetlbfs (new way can provide support for multiple number of > hugetlbfs) > b. does not require a root privilege (new way can achieve this by using > above-mentioned mind() or set_mempolicy() syscall) > c. no sorting (both way are OK) > d. performance, from the perspective of virtio for container, we take more > consideration about the performance of address translation in the vhost. In > the vhost, now we adopt a O(n) linear comparison to translate address (this > can be optimized to O(logn) using segment tree, or even better using a > cache, sorry, it's just another problem), so we should maintain as few files > as possible. (new way can achieve this by used with --socket-mem, > --huge-dir) > e. numa aware is not required (and it's complex). (new way can solve this > without promise) > > In all, this new way seems great for me. > > Another thing is if "go through the same process to handle normal huge page > initilization", my consideration is: RTE_EAL_SINGLE_FILE_SEGMENTS goes such > way to maximize code reuse. But the new way has few common code with > original ways. And mixing these options together leads to bad readability. > How do you think? Indeed. I've already found that the code is a bit hard to read, due to many "#ifdef ... #else .. #endif" blocks, for RTE_EAL_SINGLE_FILE_SEGMENTS as well as some special archs. Therefore, I would suggest to do it as below: add another option based on the SINGLE_FILE_SEGMENTS implementation. I mean SINGLE_FILE_SEGMENTS already tries to generate as few files as possible. If we add another option, say --no-sort (or --no-phys-continuity), we could add just few lines of code to let it generate one file for each huge page size (if we don't consider the numa affinity). > > > >BTW, since we already have SINGLE_FILE_SEGMENTS (config) option, adding > >another option --single-file looks really confusing to me. > > > >To me, maybe you could base the SINGLE_FILE_SEGMENTS option, and add > >another option, say --no-sort (I confess this name sucks, but you get > >my point). With that, we could make sure to create as least huge page > >files as possible, to fit your case. > > This is a great advice. So how do you think of --converged, or > --no-scattered-mem, or any better idea? TBH, none of them looks great to me, either. But I have no better options. Well, --no-phys-continuity looks like the best option to me so far :) --yliu > > Thanks for valuable input. > > Jianfeng > > > > > --yliu