From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <jianfeng.tan@intel.com>
Received: from mga06.intel.com (mga06.intel.com [134.134.136.31])
 by dpdk.org (Postfix) with ESMTP id B5A3C37B1
 for <users@dpdk.org>; Thu,  1 Jun 2017 10:50:39 +0200 (CEST)
Received: from fmsmga005.fm.intel.com ([10.253.24.32])
 by orsmga104.jf.intel.com with ESMTP; 01 Jun 2017 01:50:38 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.39,278,1493708400"; d="scan'208";a="109085590"
Received: from fmsmsx103.amr.corp.intel.com ([10.18.124.201])
 by fmsmga005.fm.intel.com with ESMTP; 01 Jun 2017 01:50:38 -0700
Received: from fmsmsx154.amr.corp.intel.com (10.18.116.70) by
 FMSMSX103.amr.corp.intel.com (10.18.124.201) with Microsoft SMTP Server (TLS)
 id 14.3.319.2; Thu, 1 Jun 2017 01:50:38 -0700
Received: from shsmsx104.ccr.corp.intel.com (10.239.4.70) by
 FMSMSX154.amr.corp.intel.com (10.18.116.70) with Microsoft SMTP Server (TLS)
 id 14.3.319.2; Thu, 1 Jun 2017 01:50:37 -0700
Received: from shsmsx103.ccr.corp.intel.com ([169.254.4.116]) by
 SHSMSX104.ccr.corp.intel.com ([10.239.4.70]) with mapi id 14.03.0319.002;
 Thu, 1 Jun 2017 16:50:36 +0800
From: "Tan, Jianfeng" <jianfeng.tan@intel.com>
To: Imre Pinter <imre.pinter@ericsson.com>, "users@dpdk.org" <users@dpdk.org>
CC: =?iso-8859-1?Q?Gabor_Hal=E1sz?= <gabor.halasz@ericsson.com>,
 =?iso-8859-1?Q?P=E9ter_Suskovics?= <peter.suskovics@ericsson.com>
Thread-Topic: Slow DPDK startup with many 1G hugepages
Thread-Index: AdLYf50vhjMfOET4SySM24szocOCwABljQHwACaTBjA=
Date: Thu, 1 Jun 2017 08:50:35 +0000
Message-ID: <ED26CBA2FAD1BF48A8719AEF02201E36511F3564@SHSMSX103.ccr.corp.intel.com>
References: <VI1PR07MB1357C989E2F7092D9A31ED9F80F30@VI1PR07MB1357.eurprd07.prod.outlook.com>
 <VI1PR07MB13578627486437F8339E99FE80F60@VI1PR07MB1357.eurprd07.prod.outlook.com>
In-Reply-To: <VI1PR07MB13578627486437F8339E99FE80F60@VI1PR07MB1357.eurprd07.prod.outlook.com>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
x-originating-ip: [10.239.127.40]
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Subject: Re: [dpdk-users] Slow DPDK startup with many 1G hugepages
X-BeenThere: users@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK usage discussions <users.dpdk.org>
List-Unsubscribe: <http://dpdk.org/ml/options/users>,
 <mailto:users-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/users/>
List-Post: <mailto:users@dpdk.org>
List-Help: <mailto:users-request@dpdk.org?subject=help>
List-Subscribe: <http://dpdk.org/ml/listinfo/users>,
 <mailto:users-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Thu, 01 Jun 2017 08:50:40 -0000


> -----Original Message-----
> From: users [mailto:users-bounces@dpdk.org] On Behalf Of Imre Pinter
> Sent: Thursday, June 1, 2017 3:55 PM
> To: users@dpdk.org
> Cc: Gabor Hal=E1sz; P=E9ter Suskovics
> Subject: [dpdk-users] Slow DPDK startup with many 1G hugepages
>=20
> Hi,
>=20
> We experience slow startup time in DPDK-OVS, when backing memory with
> 1G hugepages instead of 2M hugepages.
> Currently we're mapping 2M hugepages as memory backend for DPDK OVS.
> In the future we would like to allocate this memory from the 1G hugepage
> pool. Currently in our deployments we have significant amount of 1G
> hugepages allocated (min. 54G) for VMs and only 2G memory on 2M
> hugepages.
>=20
> Typical setup for 2M hugepages:
>                 GRUB:
> hugepagesz=3D2M hugepages=3D1024 hugepagesz=3D1G hugepages=3D54
> default_hugepagesz=3D1G
>=20
> $ grep hugetlbfs /proc/mounts
> nodev /mnt/huge_ovs_2M hugetlbfs rw,relatime,pagesize=3D2M 0 0
> nodev /mnt/huge_qemu_1G hugetlbfs rw,relatime,pagesize=3D1G 0 0
>=20
> Typical setup for 1GB hugepages:
> GRUB:
> hugepagesz=3D1G hugepages=3D56 default_hugepagesz=3D1G
>=20
> $ grep hugetlbfs /proc/mounts
> nodev /mnt/huge_qemu_1G hugetlbfs rw,relatime,pagesize=3D1G 0 0
>=20
> DPDK OVS startup times based on the ovs-vswitchd.log logs:
>=20
>   *   2M (2G memory allocated) - startup time ~3 sec:
>=20
> 2017-05-03T08:13:50.177Z|00009|dpdk|INFO|EAL ARGS: ovs-vswitchd -c 0x1
> --huge-dir /mnt/huge_ovs_2M --socket-mem 1024,1024
>=20
> 2017-05-03T08:13:50.708Z|00010|ofproto_dpif|INFO|netdev@ovs-netdev:
> Datapath supports recirculation
>=20
>   *   1G (56G memory allocated) - startup time ~13 sec:
> 2017-05-03T08:09:22.114Z|00009|dpdk|INFO|EAL ARGS: ovs-vswitchd -c 0x1
> --huge-dir /mnt/huge_qemu_1G --socket-mem 1024,1024
> 2017-05-03T08:09:32.706Z|00010|ofproto_dpif|INFO|netdev@ovs-netdev:
> Datapath supports recirculation
> I used DPDK 16.11 for OVS and testpmd and tested on Ubuntu 14.04 with
> kernel 3.13.0-117-generic and 4.4.0-78-generic.


You can shorten the time by this:

(1) Mount 1 GB hugepages into two directories.
nodev /mnt/huge_ovs_1G hugetlbfs rw,relatime,pagesize=3D1G,size=3D<how much=
 you want to use in OVS> 0 0
nodev /mnt/huge_qemu_1G hugetlbfs rw,relatime,pagesize=3D1G 0 0

(2) Force to use memory  interleave policy=20
$ numactl --interleave=3Dall ovs-vswitchd ...

Note: keep the huge-dir and socket-mem option, "--huge-dir /mnt/huge_ovs_1G=
 --socket-mem 1024,1024".

>=20
> We had a discussion with Mark Gray (from Intel), and he come up with the
> following items:
>=20
> =B7         The ~10 sec time difference is there with testpmd as well
>=20
> =B7         They believe it is a kernel overhead (mmap is slow, perhaps i=
t is zeroing
> pages). The following code from eal_memory.c does the above mentioned
> printout in EAL startup:

Yes, correct.

> 469    /* map the segment, and populate page tables,
> 470     * the kernel fills this segment with zeros */
> 468    uint64_t start =3D rte_rdtsc();
> 471    virtaddr =3D mmap(vma_addr, hugepage_sz, PROT_READ | PROT_WRITE,
> 472                    MAP_SHARED | MAP_POPULATE, fd, 0);
> 473    if (virtaddr =3D=3D MAP_FAILED) {
> 474            RTE_LOG(DEBUG, EAL, "%s(): mmap failed: %s\n", __func__,
> 475                            strerror(errno));
> 476            close(fd);
> 477            return i;
> 478    }
> 479
> 480    if (orig) {
> 481            hugepg_tbl[i].orig_va =3D virtaddr;
> 482            printf("Original mapping of page %u took: %"PRIu64"
> ticks, %"PRIu64" ms\n     ",
> 483                    i, rte_rdtsc() - start,
> 484                    (rte_rdtsc() - start) * 1000 /
> 485                    rte_get_timer_hz());
> 486    }
>=20
>=20
> A solution could be to mount 1G hugepages to 2 separate directory: 2G for
> OVS and the remaining for the VMs, but the NUMA location for these
> hugepages is non-deterministic. Since mount cannot handle NUMA related
> parameters during mounting hugetlbfs, and fstab forks the mounts during
> boot.

Oh, similar idea :-)

>=20
> Do you have a solution on how to use 1G hugepages for VMs and have
> reasonable DPDK EAL startup time?

No, we still don't have such options.

Thanks,
Jianfeng

>=20
> Thanks,
> Imre