From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from dpdk.org (dpdk.org [92.243.14.124]) by inbox.dpdk.org (Postfix) with ESMTP id 79D5CA04FA; Thu, 26 Dec 2019 16:45:33 +0100 (CET) Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id A99F51BFB1; Thu, 26 Dec 2019 16:45:28 +0100 (CET) Received: from mail-wm1-f66.google.com (mail-wm1-f66.google.com [209.85.128.66]) by dpdk.org (Postfix) with ESMTP id 393DD1BFAA for ; Thu, 26 Dec 2019 16:45:26 +0100 (CET) Received: by mail-wm1-f66.google.com with SMTP id m24so6144643wmc.3 for ; Thu, 26 Dec 2019 07:45:26 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=6wind.com; s=google; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=yNL9nZSrEKh7Yo8G0xH7L8dE4xcUpryjvuvf34KcCZ0=; b=eSV7KIkEry6xR87jfV/TXNYaW75mP5OaF3Du7BjSl6lkayzx/pGJY7Ya5Oj0q0Y3DM PCjuQ0zbZ6rtk8KEJhJxiws7VXAx9dq7ifuvI8AYNqF0Yy8ZGtMrl+Iud5c5JIvQIyi8 XRw4YUTE3qgTc9QiFMjgt03AbCullofEmIrbWfxLr2OPvh5vYN8XbEi8fFyTlmZFGZO/ rOtslizLQEHaGT7uP4uPwoAVvP4FqpkNwI9HBKH0l7reiUmtN0nWqIedmWy9F4/XlehI 8qFhEgCldWoaQgGfnaPC62CVQq0V+RjO9rurF0gZB2OmOPxMjuEinjVPkYXroq+nvRzu LGRA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to:user-agent; bh=yNL9nZSrEKh7Yo8G0xH7L8dE4xcUpryjvuvf34KcCZ0=; b=SmZOIyX5im7Z72yE6O35NaDCUw1Ai+eR1rQhtgsu9fq3K1LFHlzKAINPzdpmCG06eQ w38jK86++bbAq3MtcSHVyd1sNqu7DiXQAxzbB0VLwDBE59RcwqgjvBbP7YeLtwMB3lSs ydzKD3bk2QoofxI66GoMK2LBAnVI8injO8tcFuHBejkQWe6xwkLTKn2tTynxNdpTS2Hm Q6VnnbQ984IBk7ztNAu5TXOZzSItq9awI9jSgryjS74/QrnrE1D/eelkloqytnZo+xd5 RFqiMsZWXwkqqBhjqAGUDapc3q/Ns3NfqPCIdXquTKQMPADqiEPb7vZScfda2Q5L7moY ktLQ== X-Gm-Message-State: APjAAAVKPFGxZ+oK/ZON8mVQrSrCh5G244f8Y0pWQ+97nv/zH1Rlj3/I lhyIU5RE+cvW/Qs0bgwkeCHtLQ== X-Google-Smtp-Source: APXvYqz5ZcP3J2ahVMEdaOliHavOkacJJjMVrPqy2PZ8K/X81hQiSiM/yMf3Sb9h5265wqIAEqvqaA== X-Received: by 2002:a7b:c392:: with SMTP id s18mr15258205wmj.169.1577375125736; Thu, 26 Dec 2019 07:45:25 -0800 (PST) Received: from 6wind.com (2a01cb0c0005a600345636f7e65ed1a0.ipv6.abo.wanadoo.fr. [2a01:cb0c:5:a600:3456:36f7:e65e:d1a0]) by smtp.gmail.com with ESMTPSA id h66sm9093357wme.41.2019.12.26.07.45.24 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 26 Dec 2019 07:45:24 -0800 (PST) Date: Thu, 26 Dec 2019 16:45:24 +0100 From: Olivier Matz To: Bao-Long Tran Cc: anatoly.burakov@intel.com, arybchenko@solarflare.com, dev@dpdk.org, users@dpdk.org, ricudis@niometrics.com Message-ID: <20191226154524.GG22738@platinum> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.10.1 (2018-07-13) Subject: Re: [dpdk-dev] Inconsistent behavior of mempool with regards to hugepage allocation X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" Hi Bao-Long, On Mon, Dec 23, 2019 at 07:09:29PM +0800, Bao-Long Tran wrote: > Hi, > > I'm not sure if this is a bug, but I've seen an inconsistency in the behavior > of DPDK with regards to hugepage allocation for rte_mempool. Basically, for the > same mempool size, the number of hugepages allocated changes from run to run. > > Here's how I reproduce with DPDK 19.11. IOVA=pa (default) > > 1. Reserve 16x1G hugepages on socket 0 > 2. Replace examples/skeleton/basicfwd.c with the code below, build and run > make && ./build/basicfwd > 3. At the same time, watch the number of hugepages allocated > "watch -n.1 ls /dev/hugepages" > 4. Repeat step 2 > > If you can reproduce, you should see that for some runs, DPDK allocates 5 > hugepages, other times it allocates 6. When it allocates 6, if you watch the > output from step 3., you will see that DPDK first try to allocate 5 hugepages, > then unmap all 5, retry, and got 6. I cannot reproduce in the same conditions than yours (with 16 hugepages on socket 0), but I think I can see a similar issue: If I reserve at least 6 hugepages, it seems reproducible (6 hugepages are used). If I reserve 5 hugepages, it takes more time, taking/releasing hugepages several times, and it finally succeeds with 5 hugepages. > For our use case, it's important that DPDK allocate the same number of > hugepages on every run so we can get reproducable results. One possibility is to use the --legacy-mem EAL option. It will try to reserve all hugepages first. > Studying the code, this seems to be the behavior of > rte_mempool_populate_default(). If I understand correctly, if the first try fail > to get 5 IOVA-contiguous pages, it retries, relaxing the IOVA-contiguous > condition, and eventually wound up with 6 hugepages. No, I think you don't have the IOVA-contiguous constraint in your case. This is what I see: a- reserve 5 hugepages on socket 0, and start your patched basicfwd b- it tries to allocate 2097151 objects of size 2304, pg_size = 1073741824 c- the total element size (with header) is 2304 + 64 = 2368 d- in rte_mempool_op_calc_mem_size_helper(), it calculates obj_per_page = 453438 (453438 * 2368 = 1073741184) mem_size = 4966058495 e- it tries to allocate 4966058495 bytes, which is less than 5 x 1G, with: rte_memzone_reserve_aligned(name, size=4966058495, socket=0, mz_flags=RTE_MEMZONE_1GB|RTE_MEMZONE_SIZE_HINT_ONLY, align=64) For some reason, it fails: we can see that the number of map'd hugepages increases in /dev/hugepages, the return to its original value. I don't think it should fail here. f- then, it will try to allocate the biggest available contiguous zone. In my case, it is 1055291776 bytes (almost all the uniq map'd hugepage). This is a second problem: if we call it again, it returns NULL, because it won't map another hugepage. g- by luck, calling rte_mempool_populate_virt() allocates a small aera (mempool header), and it triggers the mapping a a new hugepage, that will be used in the next loop, back at step d with a smaller mem_size. > Questions: > 1. Why does the API sometimes fail to get IOVA contig mem, when hugepage memory > is abundant? In my case, it looks that we have a bit less than 1G which is free at the end of the heap, than we call rte_memzone_reserve_aligned(size=5G). The allocator ends up in mapping 5 pages (and fail), while only 4 is needed. Anatoly, do you have any idea? Shouldn't we take in account the amount of free space at the end of the heap when expanding? > 2. Why does the 2nd retry need N+1 hugepages? When the first alloc fails, the mempool code tries to allocate in several chunks which are not virtually contiguous. This is needed in case the memory is fragmented. > Some insights for Q1: From my experiments, seems like the IOVA of the first > hugepage is not guaranteed to be at the start of the IOVA space (understandably). > It could explain the retry when the IOVA of the first hugepage is near the end of > the IOVA space. But I have also seen situation where the 1st hugepage is near > the beginning of the IOVA space and it still failed the 1st time. > > Here's the code: > #include > #include > > int > main(int argc, char *argv[]) > { > struct rte_mempool *mbuf_pool; > unsigned mbuf_pool_size = 2097151; > > int ret = rte_eal_init(argc, argv); > if (ret < 0) > rte_exit(EXIT_FAILURE, "Error with EAL initialization\n"); > > printf("Creating mbuf pool size=%u\n", mbuf_pool_size); > mbuf_pool = rte_pktmbuf_pool_create("MBUF_POOL", mbuf_pool_size, > 256, 0, RTE_MBUF_DEFAULT_BUF_SIZE, 0); > > printf("mbuf_pool %p\n", mbuf_pool); > > return 0; > } > > Best regards, > BL Regards, Olivier