From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga01.intel.com (mga01.intel.com [192.55.52.88]) by dpdk.org (Postfix) with ESMTP id 977A92C4A for ; Tue, 8 Mar 2016 02:55:15 +0100 (CET) Received: from orsmga002.jf.intel.com ([10.7.209.21]) by fmsmga101.fm.intel.com with ESMTP; 07 Mar 2016 17:55:14 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.22,554,1449561600"; d="scan'208";a="928959256" Received: from tanjianf-mobl.ccr.corp.intel.com (HELO [10.239.200.26]) ([10.239.200.26]) by orsmga002.jf.intel.com with ESMTP; 07 Mar 2016 17:55:11 -0800 To: Yuanhan Liu References: <1446748276-132087-1-git-send-email-jianfeng.tan@intel.com> <1454671228-33284-1-git-send-email-jianfeng.tan@intel.com> <1454671228-33284-2-git-send-email-jianfeng.tan@intel.com> <20160307131322.GH14300@yliu-dev.sh.intel.com> From: "Tan, Jianfeng" Message-ID: <56DE30FE.7020809@intel.com> Date: Tue, 8 Mar 2016 09:55:10 +0800 User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:38.0) Gecko/20100101 Thunderbird/38.6.0 MIME-Version: 1.0 In-Reply-To: <20160307131322.GH14300@yliu-dev.sh.intel.com> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Cc: nakajima.yoshihiro@lab.ntt.co.jp, mst@redhat.com, dev@dpdk.org, p.fedin@samsung.com, ann.zhuangyanying@huawei.com Subject: Re: [dpdk-dev] [PATCH v2 1/5] mem: add --single-file to create single mem-backed file X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 08 Mar 2016 01:55:16 -0000 Hi Yuanhan, On 3/7/2016 9:13 PM, Yuanhan Liu wrote: > CC'ed EAL hugepage maintainer, which is something you should do when > send a patch. Thanks for doing this. > > On Fri, Feb 05, 2016 at 07:20:24PM +0800, Jianfeng Tan wrote: >> Originally, there're two cons in using hugepage: a. needs root >> privilege to touch /proc/self/pagemap, which is a premise to >> alllocate physically contiguous memseg; b. possibly too many >> hugepage file are created, especially used with 2M hugepage. >> >> For virtual devices, they don't care about physical-contiguity >> of allocated hugepages at all. Option --single-file is to >> provide a way to allocate all hugepages into single mem-backed >> file. >> >> Known issue: >> a. single-file option relys on kernel to allocate numa-affinitive >> memory. >> b. possible ABI break, originally, --no-huge uses anonymous memory >> instead of file-backed way to create memory. >> >> Signed-off-by: Huawei Xie >> Signed-off-by: Jianfeng Tan > ... >> @@ -956,6 +961,16 @@ eal_check_common_options(struct internal_config *internal_cfg) >> "be specified together with --"OPT_NO_HUGE"\n"); >> return -1; >> } >> + if (internal_cfg->single_file && internal_cfg->force_sockets == 1) { >> + RTE_LOG(ERR, EAL, "Option --"OPT_SINGLE_FILE" cannot " >> + "be specified together with --"OPT_SOCKET_MEM"\n"); >> + return -1; >> + } >> + if (internal_cfg->single_file && internal_cfg->hugepage_unlink) { >> + RTE_LOG(ERR, EAL, "Option --"OPT_HUGE_UNLINK" cannot " >> + "be specified together with --"OPT_SINGLE_FILE"\n"); >> + return -1; >> + } > The two limitation doesn't make sense to me. For the force_sockets option, my original thought on --single-file option is, we don't sort those pages (require root/cap_sys_admin) and even don't look up numa information because it may contain both sockets' memory. For the hugepage_unlink option, those hugepage files get closed in the end of memory initialization, if we even unlink those hugepage files, so we cannot share those with other processes (say backend). > >> diff --git a/lib/librte_eal/linuxapp/eal/eal_memory.c b/lib/librte_eal/linuxapp/eal/eal_memory.c >> index 6008533..68ef49a 100644 >> --- a/lib/librte_eal/linuxapp/eal/eal_memory.c >> +++ b/lib/librte_eal/linuxapp/eal/eal_memory.c >> @@ -1102,20 +1102,54 @@ rte_eal_hugepage_init(void) >> /* get pointer to global configuration */ >> mcfg = rte_eal_get_configuration()->mem_config; >> >> - /* hugetlbfs can be disabled */ >> - if (internal_config.no_hugetlbfs) { >> - addr = mmap(NULL, internal_config.memory, PROT_READ | PROT_WRITE, >> - MAP_PRIVATE | MAP_ANONYMOUS, 0, 0); >> + /* when hugetlbfs is disabled or single-file option is specified */ >> + if (internal_config.no_hugetlbfs || internal_config.single_file) { >> + int fd; >> + uint64_t pagesize; >> + unsigned socket_id = rte_socket_id(); >> + char filepath[MAX_HUGEPAGE_PATH]; >> + >> + if (internal_config.no_hugetlbfs) { >> + eal_get_hugefile_path(filepath, sizeof(filepath), >> + "/dev/shm", 0); >> + pagesize = RTE_PGSIZE_4K; >> + } else { >> + struct hugepage_info *hpi; >> + >> + hpi = &internal_config.hugepage_info[0]; >> + eal_get_hugefile_path(filepath, sizeof(filepath), >> + hpi->hugedir, 0); >> + pagesize = hpi->hugepage_sz; >> + } >> + fd = open(filepath, O_CREAT | O_RDWR, S_IRUSR | S_IWUSR); >> + if (fd < 0) { >> + RTE_LOG(ERR, EAL, "%s: open %s failed: %s\n", >> + __func__, filepath, strerror(errno)); >> + return -1; >> + } >> + >> + if (ftruncate(fd, internal_config.memory) < 0) { >> + RTE_LOG(ERR, EAL, "ftuncate %s failed: %s\n", >> + filepath, strerror(errno)); >> + return -1; >> + } >> + >> + addr = mmap(NULL, internal_config.memory, >> + PROT_READ | PROT_WRITE, >> + MAP_SHARED | MAP_POPULATE, fd, 0); >> if (addr == MAP_FAILED) { >> - RTE_LOG(ERR, EAL, "%s: mmap() failed: %s\n", __func__, >> - strerror(errno)); >> + RTE_LOG(ERR, EAL, "%s: mmap() failed: %s\n", >> + __func__, strerror(errno)); >> return -1; >> } >> mcfg->memseg[0].phys_addr = (phys_addr_t)(uintptr_t)addr; >> mcfg->memseg[0].addr = addr; >> - mcfg->memseg[0].hugepage_sz = RTE_PGSIZE_4K; >> + mcfg->memseg[0].hugepage_sz = pagesize; >> mcfg->memseg[0].len = internal_config.memory; >> - mcfg->memseg[0].socket_id = 0; >> + mcfg->memseg[0].socket_id = socket_id; > I saw quite few issues: > > - Assume I have a system with two hugepage sizes: 1G (x4) and 2M (x512), > mounted at /dev/hugepages and /mnt, respectively. > > Here we then got an 5G internal_config.memory, and your code will > try to mmap 5G on the first mount point (/dev/hugepages) due to the > hardcode logic in your code: > > hpi = &internal_config.hugepage_info[0]; > eal_get_hugefile_path(filepath, sizeof(filepath), > hpi->hugedir, 0); > > But it has 4G in total, therefore, it will fails. As mentioned above, this case is not for original design of --single-file. > > - As you stated, socket_id is hardcoded, which could be wrong. We rely on OS to allocate hugepages, and cannot promise physical hugepages in the big hugepage file are from the same socket. > > - As stated in above, the option limitation doesn't seem right to me. > > I mean, --single-file should be able to work with --socket-mem option > in semantic. If we'd like to work well with --socket-mem option, we need to use syscalls like set_mempolicy(), mbind(). So it'll bring bigger change related to current one. I don't know if it's acceptable? > > > And I have been thinking how to deal with those issues properly, and a > __very immature__ solution come to my mind (which could be simply not > working), but anyway, here is FYI: we go through the same process to > handle normal huge page initilization to --single-file option as well. > But we take different actions or no actions at all at some stages when > that option is given, which is a bit similiar with the way of handling > RTE_EAL_SINGLE_FILE_SEGMENTS. > > And we create one hugepage file for each node, each page size. For a > system like mine above (2 nodes), it may populate files like following: > > - 1G x 2 on node0 > - 1G x 2 on node1 > - 2M x 256 on node0 > - 2M x 256 on node1 > > That could normally fit your case. Though 4 nodes looks like the maximum > node number, --socket-mem option may relieve the limit a bit. > > And if we "could" not care the socket_id being set correctly, we could > just simply allocate one file for each hugepage size. That would work > well for your container enabling. This way seems a good option at first sight. Let's compare this new way with original design. The original design just covers the simplest scenario: a. just one hugetlbfs (new way can provide support for multiple number of hugetlbfs) b. does not require a root privilege (new way can achieve this by using above-mentioned mind() or set_mempolicy() syscall) c. no sorting (both way are OK) d. performance, from the perspective of virtio for container, we take more consideration about the performance of address translation in the vhost. In the vhost, now we adopt a O(n) linear comparison to translate address (this can be optimized to O(logn) using segment tree, or even better using a cache, sorry, it's just another problem), so we should maintain as few files as possible. (new way can achieve this by used with --socket-mem, --huge-dir) e. numa aware is not required (and it's complex). (new way can solve this without promise) In all, this new way seems great for me. Another thing is if "go through the same process to handle normal huge page initilization", my consideration is: RTE_EAL_SINGLE_FILE_SEGMENTS goes such way to maximize code reuse. But the new way has few common code with original ways. And mixing these options together leads to bad readability. How do you think? > > BTW, since we already have SINGLE_FILE_SEGMENTS (config) option, adding > another option --single-file looks really confusing to me. > > To me, maybe you could base the SINGLE_FILE_SEGMENTS option, and add > another option, say --no-sort (I confess this name sucks, but you get > my point). With that, we could make sure to create as least huge page > files as possible, to fit your case. This is a great advice. So how do you think of --converged, or --no-scattered-mem, or any better idea? Thanks for valuable input. Jianfeng > > --yliu