From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <sergio.gonzalez.monroy@intel.com>
Received: from mga11.intel.com (mga11.intel.com [192.55.52.93])
 by dpdk.org (Postfix) with ESMTP id 2908037B1
 for <users@dpdk.org>; Thu,  1 Jun 2017 11:03:00 +0200 (CEST)
Received: from fmsmga006.fm.intel.com ([10.253.24.20])
 by fmsmga102.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384;
 01 Jun 2017 02:02:59 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.39,278,1493708400"; d="scan'208";a="109564821"
Received: from smonroyx-mobl.ger.corp.intel.com (HELO [10.237.221.116])
 ([10.237.221.116])
 by fmsmga006.fm.intel.com with ESMTP; 01 Jun 2017 02:02:58 -0700
To: Imre Pinter <imre.pinter@ericsson.com>, "users@dpdk.org" <users@dpdk.org>
References: <VI1PR07MB1357C989E2F7092D9A31ED9F80F30@VI1PR07MB1357.eurprd07.prod.outlook.com>
 <VI1PR07MB13578627486437F8339E99FE80F60@VI1PR07MB1357.eurprd07.prod.outlook.com>
Cc: =?UTF-8?Q?Gabor_Hal=c3=a1sz?= <gabor.halasz@ericsson.com>,
 =?UTF-8?Q?P=c3=a9ter_Suskovics?= <peter.suskovics@ericsson.com>
From: Sergio Gonzalez Monroy <sergio.gonzalez.monroy@intel.com>
Message-ID: <2addf963-8e23-f7fa-038a-da23a9dbcde2@intel.com>
Date: Thu, 1 Jun 2017 10:02:57 +0100
User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:45.0) Gecko/20100101
 Thunderbird/45.1.1
MIME-Version: 1.0
In-Reply-To: <VI1PR07MB13578627486437F8339E99FE80F60@VI1PR07MB1357.eurprd07.prod.outlook.com>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 8bit
Subject: Re: [dpdk-users] Slow DPDK startup with many 1G hugepages
X-BeenThere: users@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK usage discussions <users.dpdk.org>
List-Unsubscribe: <http://dpdk.org/ml/options/users>,
 <mailto:users-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/users/>
List-Post: <mailto:users@dpdk.org>
List-Help: <mailto:users-request@dpdk.org?subject=help>
List-Subscribe: <http://dpdk.org/ml/listinfo/users>,
 <mailto:users-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Thu, 01 Jun 2017 09:03:00 -0000

On 01/06/2017 08:55, Imre Pinter wrote:
> Hi,
>
> We experience slow startup time in DPDK-OVS, when backing memory with 1G hugepages instead of 2M hugepages.
> Currently we're mapping 2M hugepages as memory backend for DPDK OVS. In the future we would like to allocate this memory from the 1G hugepage pool. Currently in our deployments we have significant amount of 1G hugepages allocated (min. 54G) for VMs and only 2G memory on 2M hugepages.
>
> Typical setup for 2M hugepages:
>                  GRUB:
> hugepagesz=2M hugepages=1024 hugepagesz=1G hugepages=54 default_hugepagesz=1G
>
> $ grep hugetlbfs /proc/mounts
> nodev /mnt/huge_ovs_2M hugetlbfs rw,relatime,pagesize=2M 0 0
> nodev /mnt/huge_qemu_1G hugetlbfs rw,relatime,pagesize=1G 0 0
>
> Typical setup for 1GB hugepages:
> GRUB:
> hugepagesz=1G hugepages=56 default_hugepagesz=1G
>
> $ grep hugetlbfs /proc/mounts
> nodev /mnt/huge_qemu_1G hugetlbfs rw,relatime,pagesize=1G 0 0
>
> DPDK OVS startup times based on the ovs-vswitchd.log logs:
>
>    *   2M (2G memory allocated) - startup time ~3 sec:
>
> 2017-05-03T08:13:50.177Z|00009|dpdk|INFO|EAL ARGS: ovs-vswitchd -c 0x1 --huge-dir /mnt/huge_ovs_2M --socket-mem 1024,1024
>
> 2017-05-03T08:13:50.708Z|00010|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath supports recirculation
>
>    *   1G (56G memory allocated) - startup time ~13 sec:
> 2017-05-03T08:09:22.114Z|00009|dpdk|INFO|EAL ARGS: ovs-vswitchd -c 0x1 --huge-dir /mnt/huge_qemu_1G --socket-mem 1024,1024
> 2017-05-03T08:09:32.706Z|00010|ofproto_dpif|INFO|netdev@ovs-netdev: Datapath supports recirculation
> I used DPDK 16.11 for OVS and testpmd and tested on Ubuntu 14.04 with kernel 3.13.0-117-generic and 4.4.0-78-generic.
>
> We had a discussion with Mark Gray (from Intel), and he come up with the following items:
>
> ·         The ~10 sec time difference is there with testpmd as well
>
> ·         They believe it is a kernel overhead (mmap is slow, perhaps it is zeroing pages). The following code from eal_memory.c does the above mentioned printout in EAL startup:
> 469    /* map the segment, and populate page tables,
> 470     * the kernel fills this segment with zeros */
> 468    uint64_t start = rte_rdtsc();
> 471    virtaddr = mmap(vma_addr, hugepage_sz, PROT_READ | PROT_WRITE,
> 472                    MAP_SHARED | MAP_POPULATE, fd, 0);
> 473    if (virtaddr == MAP_FAILED) {
> 474            RTE_LOG(DEBUG, EAL, "%s(): mmap failed: %s\n", __func__,
> 475                            strerror(errno));
> 476            close(fd);
> 477            return i;
> 478    }
> 479
> 480    if (orig) {
> 481            hugepg_tbl[i].orig_va = virtaddr;
> 482            printf("Original mapping of page %u took: %"PRIu64" ticks, %"PRIu64" ms\n     ",
> 483                    i, rte_rdtsc() - start,
> 484                    (rte_rdtsc() - start) * 1000 /
> 485                    rte_get_timer_hz());
> 486    }
>
>
> A solution could be to mount 1G hugepages to 2 separate directory: 2G for OVS and the remaining for the VMs, but the NUMA location for these hugepages is non-deterministic. Since mount cannot handle NUMA related parameters during mounting hugetlbfs, and fstab forks the mounts during boot.
>
> Do you have a solution on how to use 1G hugepages for VMs and have reasonable DPDK EAL startup time?

In theory, one solution would be to use cgroup , as described here:
http://dpdk.org/ml/archives/dev/2017-February/057742.html
http://dpdk.org/ml/archives/dev/2017-April/063442.html

Then use 'numactl --interleave' policy.

I said in theory because it does not seem to work as one would expect, 
so the proposed patch in above threads would be a solution by forcing 
allocation from specific numa node for each page.

Thanks,
Sergio

> Thanks,
> Imre
>