From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from dpdk.org (dpdk.org [92.243.14.124]) by inbox.dpdk.org (Postfix) with ESMTP id 7CA34A046B; Thu, 9 Jan 2020 14:32:43 +0100 (CET) Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 227C81DD7D; Thu, 9 Jan 2020 14:32:39 +0100 (CET) Received: from mail-wr1-f65.google.com (mail-wr1-f65.google.com [209.85.221.65]) by dpdk.org (Postfix) with ESMTP id F22662C6A for ; Thu, 9 Jan 2020 14:32:36 +0100 (CET) Received: by mail-wr1-f65.google.com with SMTP id c14so7398874wrn.7 for ; Thu, 09 Jan 2020 05:32:36 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=6wind.com; s=google; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=H0cQhKHbCmlinM99ANIEAKvPQK8UVJ9RpOevcfQ8r/Q=; b=HbKU1wfVUBZJruUzoONxo+0BQaJ+khuJsZSvVxsHpJUE7OwUzvqZ7JiHugqNe+yOIV jWsyr0drR9ydE0U6Bkcfd+FoQUzi9Q3UuAIMsX87KHrxPBMFqOsxhXmCiHOI+l1tm/VQ f6Y0R9k4iMNkQlmfUCrP7mNmGiT3XUsj1CBpJzpWHnAhbIYjTuHFAD+FyiRpuSaTF5aD nlv+jztnW5VIBYr5zevzrHQJYJoECsHVc3D08PHPilY4sF8gcNN3Km/+C7RIpnNgUH8U 6x9wM5uq5pfsFL+BdjdMPrpIIOW2iO0CkMEaLT3h85D9GEc9jIcDi5oXCygMCr4fF89B lUIw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to:user-agent; bh=H0cQhKHbCmlinM99ANIEAKvPQK8UVJ9RpOevcfQ8r/Q=; b=MzmWQy1uV4J8Bv4AgM4qvXp6ebzS8GivvDRIUKek7meJKhYfQDmdIdlHjyPeeKZ0vT Lt4zfLi3UOaEBG4XEgER50GoNreF19S82WNCmYreqa+DGL9bdLydSRJ6+QNxR2u6cj2h g8VPN5eJ8ZsZ+ZIeENX+RjenLkcnjGHfyn00YsKQJbbrVuPrCfa56Oya+m2t4pJ9RS98 sjiFo5ruLG+HcMxh74jrCO78eJHfNA5ZGvLw++VCbHPZqKr0WuygTdTj2BcMcJg0kOgj r4BZrPwuKMKs1HZHdPry1HgQEhtrHtNnfpFv/mEfEzlz/Tzj4b0NtjqE8pswVIYD1tFq 1YcQ== X-Gm-Message-State: APjAAAW6k8tVlLE8R9t4p+CMb2fRfRlQxzDkGbdLG2cbgG2JgJcfhLl2 ZHO3kQN9QA4tmfiWPwoRrfBPeQ== X-Google-Smtp-Source: APXvYqyGqiuN/nKstDvCLFPjfTvITAW4A0h/FO4d8eQ7MzkCC3bsHmNt6QjhttqyYUDIa2JEU5ah7A== X-Received: by 2002:a5d:6b82:: with SMTP id n2mr11039418wrx.153.1578576756426; Thu, 09 Jan 2020 05:32:36 -0800 (PST) Received: from 6wind.com (2a01cb0c0005a600345636f7e65ed1a0.ipv6.abo.wanadoo.fr. [2a01:cb0c:5:a600:3456:36f7:e65e:d1a0]) by smtp.gmail.com with ESMTPSA id m126sm2907373wmf.7.2020.01.09.05.32.35 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 09 Jan 2020 05:32:35 -0800 (PST) Date: Thu, 9 Jan 2020 14:32:34 +0100 From: Olivier Matz To: "Burakov, Anatoly" Cc: Bao-Long Tran , arybchenko@solarflare.com, dev@dpdk.org, users@dpdk.org, ricudis@niometrics.com Message-ID: <20200109133234.GH22738@platinum> References: <20191226154524.GG22738@platinum> <20191227081122.GL22738@platinum> <6C9B12E3-7C6C-4D0F-981B-12A49F71E467@niometrics.com> <20191227111141.GO22738@platinum> <737005e1-643a-0690-49e4-b050a1cd14bb@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <737005e1-643a-0690-49e4-b050a1cd14bb@intel.com> User-Agent: Mutt/1.10.1 (2018-07-13) Subject: Re: [dpdk-dev] Inconsistent behavior of mempool with regards to hugepage allocation X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" On Tue, Jan 07, 2020 at 01:06:01PM +0000, Burakov, Anatoly wrote: > On 27-Dec-19 11:11 AM, Olivier Matz wrote: > > Hi Bao-Long, > > > > On Fri, Dec 27, 2019 at 06:05:57PM +0800, Bao-Long Tran wrote: > > > Hi Olivier, > > > > > > > On 27 Dec 2019, at 4:11 PM, Olivier Matz wrote: > > > > > > > > On Thu, Dec 26, 2019 at 04:45:24PM +0100, Olivier Matz wrote: > > > > > Hi Bao-Long, > > > > > > > > > > On Mon, Dec 23, 2019 at 07:09:29PM +0800, Bao-Long Tran wrote: > > > > > > Hi, > > > > > > > > > > > > I'm not sure if this is a bug, but I've seen an inconsistency in the behavior > > > > > > of DPDK with regards to hugepage allocation for rte_mempool. Basically, for the > > > > > > same mempool size, the number of hugepages allocated changes from run to run. > > > > > > > > > > > > Here's how I reproduce with DPDK 19.11. IOVA=pa (default) > > > > > > > > > > > > 1. Reserve 16x1G hugepages on socket 0 > > > > > > 2. Replace examples/skeleton/basicfwd.c with the code below, build and run > > > > > > make && ./build/basicfwd > > > > > > 3. At the same time, watch the number of hugepages allocated > > > > > > "watch -n.1 ls /dev/hugepages" > > > > > > 4. Repeat step 2 > > > > > > > > > > > > If you can reproduce, you should see that for some runs, DPDK allocates 5 > > > > > > hugepages, other times it allocates 6. When it allocates 6, if you watch the > > > > > > output from step 3., you will see that DPDK first try to allocate 5 hugepages, > > > > > > then unmap all 5, retry, and got 6. > > > > > > > > > > I cannot reproduce in the same conditions than yours (with 16 hugepages > > > > > on socket 0), but I think I can see a similar issue: > > > > > > > > > > If I reserve at least 6 hugepages, it seems reproducible (6 hugepages > > > > > are used). If I reserve 5 hugepages, it takes more time, > > > > > taking/releasing hugepages several times, and it finally succeeds with 5 > > > > > hugepages. > > > > > > My apology: I just checked again, I was using DPDK 19.05, not 19.11 or master. > > > Let me try to see if I can repro my issue with 19.11. Sorry for the confusion. > > > > > > I also saw your patch to reduce wasted memory (eba11e). Seems like it resolves > > > the problem with the IOVA-contig constraint that I described in my first message. > > > I'll look into it to confirm. > > > > > > If I cannot repro my issue (different number of hugepages) with 19.11, from our > > > side we can upgrade to 19.11 and that's all we need for now. But let me also try > > > to repro the issue you described (multiple attempts to allocate hugepages). > > > > OK, thanks. > > > > Anyway, I think there is an issue on 19.11. And it is is even worse with > > 2M hugepages. Let's say we reserve 500x 2M hugepages, and try to > > allocate a mempool of 5G: > > > > 1/ mempool_populate tries to allocate in one virtually contiguous block, > > which maps all 500 hugepages, then fail, unmapping them > > 2/ it tries to allocate the largest zone, which returns ~2MB. > > 3/ this zone is added to the mempool, and for that, it allocates a > > mem_header struct, which triggers the mapping of a new page. > > 4/ Back to 1... until it fails after 3 mins > > > > The memzone allocation of "largest available area" does not have the > > same semantic depending on the memory model (pre-mapped hugepages or > > not). When using dynamic hugepage mapping, it won't map any additional > > hugepage. > > > > To solve the issue, we could either change it to allocate all available > > hugepages, or change mempool populate, by not using the "largest > > available area" allocation, doing the search by ourself. > > Yep, this is one of the things that is currently an unsolved problem in the > allocator. I am not sure if any one behavior is "more correct" than the > other, so i don't think allocating "all available" hugepages is more correct > than not doing it. > > Besides, there's no reliable way to get "biggest" chunk of memory, because > while you might get *some* memory from 2M pages, there's no guarantee that > the amount you may get from 1G pages isn't bigger. So, we either momentarily > take over the entire users' memory and figure out what we need and what we > don't, or we use the first available page size and hope that that's enough. > > That said, there's an internal API to allocate "up to X" pages, so in > principle, we could build this kind of infrastructure. I tried to solve the issue in mempool, without using the memzone_alloc(size=0) feature. See https://patches.dpdk.org/patch/64370/ > > > > > > > > > > > > > > > > > > > For our use case, it's important that DPDK allocate the same number of > > > > > > hugepages on every run so we can get reproducable results. > > > > > > > > > > One possibility is to use the --legacy-mem EAL option. It will try to > > > > > reserve all hugepages first. > > > > > > > > Passing --socket-mem=5120,0 also does the job. > > > > > > > > > > > > > Studying the code, this seems to be the behavior of > > > > > > rte_mempool_populate_default(). If I understand correctly, if the first try fail > > > > > > to get 5 IOVA-contiguous pages, it retries, relaxing the IOVA-contiguous > > > > > > condition, and eventually wound up with 6 hugepages. > > > > > > > > > > No, I think you don't have the IOVA-contiguous constraint in your > > > > > case. This is what I see: > > > > > > > > > > a- reserve 5 hugepages on socket 0, and start your patched basicfwd > > > > > b- it tries to allocate 2097151 objects of size 2304, pg_size = 1073741824 > > > > > c- the total element size (with header) is 2304 + 64 = 2368 > > > > > d- in rte_mempool_op_calc_mem_size_helper(), it calculates > > > > > obj_per_page = 453438 (453438 * 2368 = 1073741184) > > > > > mem_size = 4966058495 > > > > > e- it tries to allocate 4966058495 bytes, which is less than 5 x 1G, with: > > > > > rte_memzone_reserve_aligned(name, size=4966058495, socket=0, > > > > > mz_flags=RTE_MEMZONE_1GB|RTE_MEMZONE_SIZE_HINT_ONLY, > > > > > align=64) > > > > > For some reason, it fails: we can see that the number of map'd hugepages > > > > > increases in /dev/hugepages, the return to its original value. > > > > > I don't think it should fail here. > > > > > f- then, it will try to allocate the biggest available contiguous zone. In > > > > > my case, it is 1055291776 bytes (almost all the uniq map'd hugepage). > > > > > This is a second problem: if we call it again, it returns NULL, because > > > > > it won't map another hugepage. > > > > > g- by luck, calling rte_mempool_populate_virt() allocates a small aera > > > > > (mempool header), and it triggers the mapping a a new hugepage, that > > > > > will be used in the next loop, back at step d with a smaller mem_size. > > > > > > > > > > > > > > Questions: > > > > > > 1. Why does the API sometimes fail to get IOVA contig mem, when hugepage memory > > > > > > is abundant? > > > > > > > > > > In my case, it looks that we have a bit less than 1G which is free at > > > > > the end of the heap, than we call rte_memzone_reserve_aligned(size=5G). > > > > > The allocator ends up in mapping 5 pages (and fail), while only 4 is > > > > > needed. > > > > > > > > > > Anatoly, do you have any idea? Shouldn't we take in account the amount > > > > > of free space at the end of the heap when expanding? > > > > > > > > > > > 2. Why does the 2nd retry need N+1 hugepages? > > > > > > > > > > When the first alloc fails, the mempool code tries to allocate in > > > > > several chunks which are not virtually contiguous. This is needed in > > > > > case the memory is fragmented. > > > > > > > > > > > Some insights for Q1: From my experiments, seems like the IOVA of the first > > > > > > hugepage is not guaranteed to be at the start of the IOVA space (understandably). > > > > > > It could explain the retry when the IOVA of the first hugepage is near the end of > > > > > > the IOVA space. But I have also seen situation where the 1st hugepage is near > > > > > > the beginning of the IOVA space and it still failed the 1st time. > > > > > > > > > > > > Here's the code: > > > > > > #include > > > > > > #include > > > > > > > > > > > > int > > > > > > main(int argc, char *argv[]) > > > > > > { > > > > > > struct rte_mempool *mbuf_pool; > > > > > > unsigned mbuf_pool_size = 2097151; > > > > > > > > > > > > int ret = rte_eal_init(argc, argv); > > > > > > if (ret < 0) > > > > > > rte_exit(EXIT_FAILURE, "Error with EAL initialization\n"); > > > > > > > > > > > > printf("Creating mbuf pool size=%u\n", mbuf_pool_size); > > > > > > mbuf_pool = rte_pktmbuf_pool_create("MBUF_POOL", mbuf_pool_size, > > > > > > 256, 0, RTE_MBUF_DEFAULT_BUF_SIZE, 0); > > > > > > > > > > > > printf("mbuf_pool %p\n", mbuf_pool); > > > > > > > > > > > > return 0; > > > > > > } > > > > > > > > > > > > Best regards, > > > > > > BL > > > > > > > > > > Regards, > > > > > Olivier > > > > > > Thanks, > > > BL > > > > > > > > -- > Thanks, > Anatoly