From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <marco.varlese@suse.com>
Received: from smtp.nue.novell.com (smtp.nue.novell.com [195.135.221.5])
 by dpdk.org (Postfix) with ESMTP id 48C0020F
 for <users@dpdk.org>; Thu,  1 Jun 2017 12:12:43 +0200 (CEST)
Received: from emea4-mta.ukb.novell.com ([10.120.13.87])
 by smtp.nue.novell.com with ESMTP (TLS encrypted);
 Thu, 01 Jun 2017 12:12:42 +0200
Received: from linux-yk3w.homenet.telecomitalia.it
 (nwb-a10-snat.microfocus.com [10.120.13.201])
 by emea4-mta.ukb.novell.com with ESMTP (TLS encrypted);
 Thu, 01 Jun 2017 11:12:11 +0100
Message-ID: <1496311928.3871.7.camel@suse.com>
From: Marco Varlese <marco.varlese@suse.com>
To: "Tan, Jianfeng" <jianfeng.tan@intel.com>, Imre Pinter
 <imre.pinter@ericsson.com>, "users@dpdk.org" <users@dpdk.org>
Cc: Gabor =?ISO-8859-1?Q?Hal=E1sz?= <gabor.halasz@ericsson.com>, 
 =?ISO-8859-1?Q?P=E9ter?= Suskovics <peter.suskovics@ericsson.com>
Date: Thu, 01 Jun 2017 12:12:08 +0200
In-Reply-To: <ED26CBA2FAD1BF48A8719AEF02201E36511F3564@SHSMSX103.ccr.corp.intel.com>
References: <VI1PR07MB1357C989E2F7092D9A31ED9F80F30@VI1PR07MB1357.eurprd07.prod.outlook.com>
 <VI1PR07MB13578627486437F8339E99FE80F60@VI1PR07MB1357.eurprd07.prod.outlook.com>
 <ED26CBA2FAD1BF48A8719AEF02201E36511F3564@SHSMSX103.ccr.corp.intel.com>
Content-Type: text/plain; charset="UTF-8"
X-Mailer: Evolution 3.20.5 
Mime-Version: 1.0
Content-Transfer-Encoding: 8bit
Subject: Re: [dpdk-users] Slow DPDK startup with many 1G hugepages
X-BeenThere: users@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK usage discussions <users.dpdk.org>
List-Unsubscribe: <http://dpdk.org/ml/options/users>,
 <mailto:users-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/users/>
List-Post: <mailto:users@dpdk.org>
List-Help: <mailto:users-request@dpdk.org?subject=help>
List-Subscribe: <http://dpdk.org/ml/listinfo/users>,
 <mailto:users-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Thu, 01 Jun 2017 10:12:43 -0000

On Thu, 2017-06-01 at 08:50 +0000, Tan, Jianfeng wrote:
> 
> > 
> > -----Original Message-----
> > From: users [mailto:users-bounces@dpdk.org] On Behalf Of Imre Pinter
> > Sent: Thursday, June 1, 2017 3:55 PM
> > To: users@dpdk.org
> > Cc: Gabor Halász; Péter Suskovics
> > Subject: [dpdk-users] Slow DPDK startup with many 1G hugepages
> > 
> > Hi,
> > 
> > We experience slow startup time in DPDK-OVS, when backing memory with
> > 1G hugepages instead of 2M hugepages.
> > Currently we're mapping 2M hugepages as memory backend for DPDK OVS.
> > In the future we would like to allocate this memory from the 1G hugepage
> > pool. Currently in our deployments we have significant amount of 1G
> > hugepages allocated (min. 54G) for VMs and only 2G memory on 2M
> > hugepages.
> > 
> > Typical setup for 2M hugepages:
> >                 GRUB:
> > hugepagesz=2M hugepages=1024 hugepagesz=1G hugepages=54
> > default_hugepagesz=1G
> > 
> > $ grep hugetlbfs /proc/mounts
> > nodev /mnt/huge_ovs_2M hugetlbfs rw,relatime,pagesize=2M 0 0
> > nodev /mnt/huge_qemu_1G hugetlbfs rw,relatime,pagesize=1G 0 0
> > 
> > Typical setup for 1GB hugepages:
> > GRUB:
> > hugepagesz=1G hugepages=56 default_hugepagesz=1G
> > 
> > $ grep hugetlbfs /proc/mounts
> > nodev /mnt/huge_qemu_1G hugetlbfs rw,relatime,pagesize=1G 0 0
> > 
> > DPDK OVS startup times based on the ovs-vswitchd.log logs:
> > 
> >   *   2M (2G memory allocated) - startup time ~3 sec:
> > 
> > 2017-05-03T08:13:50.177Z|00009|dpdk|INFO|EAL ARGS: ovs-vswitchd -c 0x1
> > --huge-dir /mnt/huge_ovs_2M --socket-mem 1024,1024
> > 
> > 2017-05-03T08:13:50.708Z|00010|ofproto_dpif|INFO|netdev@ovs-netdev:
> > Datapath supports recirculation
> > 
> >   *   1G (56G memory allocated) - startup time ~13 sec:
> > 2017-05-03T08:09:22.114Z|00009|dpdk|INFO|EAL ARGS: ovs-vswitchd -c 0x1
> > --huge-dir /mnt/huge_qemu_1G --socket-mem 1024,1024
> > 2017-05-03T08:09:32.706Z|00010|ofproto_dpif|INFO|netdev@ovs-netdev:
> > Datapath supports recirculation
> > I used DPDK 16.11 for OVS and testpmd and tested on Ubuntu 14.04 with
> > kernel 3.13.0-117-generic and 4.4.0-78-generic.
> 
> 
> You can shorten the time by this:
> 
> (1) Mount 1 GB hugepages into two directories.
> nodev /mnt/huge_ovs_1G hugetlbfs rw,relatime,pagesize=1G,size=<how much you
> want to use in OVS> 0 0
> nodev /mnt/huge_qemu_1G hugetlbfs rw,relatime,pagesize=1G 0 0
I understood (reading Imre) that this does not really work because of non-
deterministic allocation of hugepages in a NUMA architecture.
e.g. we would end up (potentially) using hugepages allocated on different nodes
even when accessing the OVS directory. 
Did I understand this correctly?

> 
> (2) Force to use memory  interleave policy 
> $ numactl --interleave=all ovs-vswitchd ...
> 
> Note: keep the huge-dir and socket-mem option, "--huge-dir /mnt/huge_ovs_1G --
> socket-mem 1024,1024".
> 
> > 
> > 
> > We had a discussion with Mark Gray (from Intel), and he come up with the
> > following items:
> > 
> > ·         The ~10 sec time difference is there with testpmd as well
> > 
> > ·         They believe it is a kernel overhead (mmap is slow, perhaps it is
> > zeroing
> > pages). The following code from eal_memory.c does the above mentioned
> > printout in EAL startup:
> 
> Yes, correct.
> 
> > 
> > 469    /* map the segment, and populate page tables,
> > 470     * the kernel fills this segment with zeros */
> > 468    uint64_t start = rte_rdtsc();
> > 471    virtaddr = mmap(vma_addr, hugepage_sz, PROT_READ | PROT_WRITE,
> > 472                    MAP_SHARED | MAP_POPULATE, fd, 0);
> > 473    if (virtaddr == MAP_FAILED) {
> > 474            RTE_LOG(DEBUG, EAL, "%s(): mmap failed: %s\n", __func__,
> > 475                            strerror(errno));
> > 476            close(fd);
> > 477            return i;
> > 478    }
> > 479
> > 480    if (orig) {
> > 481            hugepg_tbl[i].orig_va = virtaddr;
> > 482            printf("Original mapping of page %u took: %"PRIu64"
> > ticks, %"PRIu64" ms\n     ",
> > 483                    i, rte_rdtsc() - start,
> > 484                    (rte_rdtsc() - start) * 1000 /
> > 485                    rte_get_timer_hz());
> > 486    }
> > 
> > 
> > A solution could be to mount 1G hugepages to 2 separate directory: 2G for
> > OVS and the remaining for the VMs, but the NUMA location for these
> > hugepages is non-deterministic. Since mount cannot handle NUMA related
> > parameters during mounting hugetlbfs, and fstab forks the mounts during
> > boot.
> 
> Oh, similar idea :-)
> 
> > 
> > 
> > Do you have a solution on how to use 1G hugepages for VMs and have
> > reasonable DPDK EAL startup time?
> 
> No, we still don't have such options.
> 
> Thanks,
> Jianfeng
> 
> > 
> > 
> > Thanks,
> > Imre
> 
>