From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <jianfeng.tan@intel.com>
Received: from mga03.intel.com (mga03.intel.com [134.134.136.65])
 by dpdk.org (Postfix) with ESMTP id AC1F920BD
 for <users@dpdk.org>; Tue,  6 Jun 2017 16:31:58 +0200 (CEST)
Received: from orsmga004.jf.intel.com ([10.7.209.38])
 by orsmga103.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384;
 06 Jun 2017 07:31:57 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.39,306,1493708400"; d="scan'208";a="95211359"
Received: from tanjianf-mobl.ccr.corp.intel.com (HELO [10.255.28.221])
 ([10.255.28.221])
 by orsmga004.jf.intel.com with ESMTP; 06 Jun 2017 07:31:56 -0700
To: Imre Pinter <imre.pinter@ericsson.com>,
 Marco Varlese <marco.varlese@suse.com>, "users@dpdk.org" <users@dpdk.org>
References: <VI1PR07MB1357C989E2F7092D9A31ED9F80F30@VI1PR07MB1357.eurprd07.prod.outlook.com>
 <VI1PR07MB13578627486437F8339E99FE80F60@VI1PR07MB1357.eurprd07.prod.outlook.com>
 <ED26CBA2FAD1BF48A8719AEF02201E36511F3564@SHSMSX103.ccr.corp.intel.com>
 <1496311928.3871.7.camel@suse.com>
 <ED26CBA2FAD1BF48A8719AEF02201E36511F6551@SHSMSX103.ccr.corp.intel.com>
 <VI1PR07MB13571DCB1459FD6D2D361C4C80CB0@VI1PR07MB1357.eurprd07.prod.outlook.com>
Cc: =?UTF-8?Q?Gabor_Hal=c3=a1sz?= <gabor.halasz@ericsson.com>,
 =?UTF-8?Q?P=c3=a9ter_Suskovics?= <peter.suskovics@ericsson.com>
From: "Tan, Jianfeng" <jianfeng.tan@intel.com>
Message-ID: <0f24fe8c-9294-9656-7338-1c09e5c83340@intel.com>
Date: Tue, 6 Jun 2017 22:31:55 +0800
User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:45.0) Gecko/20100101
 Thunderbird/45.8.0
MIME-Version: 1.0
In-Reply-To: <VI1PR07MB13571DCB1459FD6D2D361C4C80CB0@VI1PR07MB1357.eurprd07.prod.outlook.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Subject: Re: [dpdk-users] Slow DPDK startup with many 1G hugepages
X-BeenThere: users@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: DPDK usage discussions <users.dpdk.org>
List-Unsubscribe: <http://dpdk.org/ml/options/users>,
 <mailto:users-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/users/>
List-Post: <mailto:users@dpdk.org>
List-Help: <mailto:users-request@dpdk.org?subject=help>
List-Subscribe: <http://dpdk.org/ml/listinfo/users>,
 <mailto:users-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Tue, 06 Jun 2017 14:31:59 -0000


On 6/6/2017 8:39 PM, Imre Pinter wrote:
> Hi guys,
>
> Thanks for the replies. See my comments inline.
>
>
> -----Original Message-----
> From: Tan, Jianfeng [mailto:jianfeng.tan@intel.com]
> Sent: 2017. június 2. 3:40
> To: Marco Varlese <marco.varlese@suse.com>; Imre Pinter <imre.pinter@ericsson.com>; users@dpdk.org
> Cc: Gabor Halász <gabor.halasz@ericsson.com>; Péter Suskovics <peter.suskovics@ericsson.com>
> Subject: RE: [dpdk-users] Slow DPDK startup with many 1G hugepages
>
>
>
>> -----Original Message-----
>> From: Marco Varlese [mailto:marco.varlese@suse.com]
>> Sent: Thursday, June 1, 2017 6:12 PM
>> To: Tan, Jianfeng; Imre Pinter; users@dpdk.org
>> Cc: Gabor Halász; Péter Suskovics
>> Subject: Re: [dpdk-users] Slow DPDK startup with many 1G hugepages
>>
>> On Thu, 2017-06-01 at 08:50 +0000, Tan, Jianfeng wrote:
>>>> -----Original Message-----
>>>> From: users [mailto:users-bounces@dpdk.org] On Behalf Of Imre
>>>> Pinter
>>>> Sent: Thursday, June 1, 2017 3:55 PM
>>>> To: users@dpdk.org
>>>> Cc: Gabor Halász; Péter Suskovics
>>>> Subject: [dpdk-users] Slow DPDK startup with many 1G hugepages
>>>>
>>>> Hi,
>>>>
>>>> We experience slow startup time in DPDK-OVS, when backing memory
>> with
>>>> 1G hugepages instead of 2M hugepages.
>>>> Currently we're mapping 2M hugepages as memory backend for DPDK
>> OVS.
>>>> In the future we would like to allocate this memory from the 1G
>> hugepage
>>>> pool. Currently in our deployments we have significant amount of
>>>> 1G hugepages allocated (min. 54G) for VMs and only 2G memory on 2M
>>>> hugepages.
>>>>
>>>> Typical setup for 2M hugepages:
>>>>                  GRUB:
>>>> hugepagesz=2M hugepages=1024 hugepagesz=1G hugepages=54
>>>> default_hugepagesz=1G
>>>>
>>>> $ grep hugetlbfs /proc/mounts
>>>> nodev /mnt/huge_ovs_2M hugetlbfs rw,relatime,pagesize=2M 0 0 nodev
>>>> /mnt/huge_qemu_1G hugetlbfs rw,relatime,pagesize=1G 0 0
>>>>
>>>> Typical setup for 1GB hugepages:
>>>> GRUB:
>>>> hugepagesz=1G hugepages=56 default_hugepagesz=1G
>>>>
>>>> $ grep hugetlbfs /proc/mounts
>>>> nodev /mnt/huge_qemu_1G hugetlbfs rw,relatime,pagesize=1G 0 0
>>>>
>>>> DPDK OVS startup times based on the ovs-vswitchd.log logs:
>>>>
>>>>    *   2M (2G memory allocated) - startup time ~3 sec:
>>>>
>>>> 2017-05-03T08:13:50.177Z|00009|dpdk|INFO|EAL ARGS: ovs-vswitchd -c
>> 0x1
>>>> --huge-dir /mnt/huge_ovs_2M --socket-mem 1024,1024
>>>>
>>>> 2017-05-03T08:13:50.708Z|00010|ofproto_dpif|INFO|netdev@ovs-
>> netdev:
>>>> Datapath supports recirculation
>>>>
>>>>    *   1G (56G memory allocated) - startup time ~13 sec:
>>>> 2017-05-03T08:09:22.114Z|00009|dpdk|INFO|EAL ARGS: ovs-vswitchd -c
>> 0x1
>>>> --huge-dir /mnt/huge_qemu_1G --socket-mem 1024,1024
>>>> 2017-05-03T08:09:32.706Z|00010|ofproto_dpif|INFO|netdev@ovs-
>> netdev:
>>>> Datapath supports recirculation
>>>> I used DPDK 16.11 for OVS and testpmd and tested on Ubuntu 14.04
>>>> with kernel 3.13.0-117-generic and 4.4.0-78-generic.
>>>
>>> You can shorten the time by this:
>>>
>>> (1) Mount 1 GB hugepages into two directories.
>>> nodev /mnt/huge_ovs_1G hugetlbfs rw,relatime,pagesize=1G,size=<how
>> much you
>>> want to use in OVS> 0 0
>>> nodev /mnt/huge_qemu_1G hugetlbfs rw,relatime,pagesize=1G 0 0
>> I understood (reading Imre) that this does not really work because of
>> non- deterministic allocation of hugepages in a NUMA architecture.
>> e.g. we would end up (potentially) using hugepages allocated on
>> different nodes even when accessing the OVS directory.
>> Did I understand this correctly?
> Did you try step 2? And Sergio also gives more options on another email in this thread for your reference.
>
> Thanks,
> Jianfeng
>
> @Jianfeng: Step (1) will not help in our case. Hence 'mount' will not allocate hugepages from NUMA1 till the system has free hugepages on NUMA0.
> I have 56G hugepages allocated from 1G size. This means 28-28G hugepages available per NUMA node. If mounting action is performed via fstab, then we'll end up in one of the following scenarios randomly.
> First mount for OVS, then for VMs:
> +---------------------------------------+---------------------------------------+
> |                 NUMA0                 |                 NUMA1                 |
> +---------------------------------------+---------------------------------------+
> | OVS(2G) |           VMs(26G)          |               VMs (28G)               |
> +---------------------------------------+---------------------------------------+
>
> First mount for VMs, then OVS:
> +---------------------------------------+---------------------------------------+
> |                 NUMA0                 |                 NUMA1                 |
> +---------------------------------------+---------------------------------------+
> |               VMs (28G)               |           VMs(26G)          | OVS(2G) |
> +---------------------------------------+---------------------------------------+

This is why I suggested step 2 to allocate memory in an interleave way. 
Do you try that?

Thanks,
Jianfeng

> @Marco: After the hugepages were allocated, the ones in OVS directory were either from NUMA0, or NUMA1, but not from both (different setup come after a roboot). This caused error in DPDK startup, hence 1-1 hugepages were requested from both NUMA nodes, and there was no hugepages allocated to the other NUMA node.
>
>>> (2) Force to use memory  interleave policy $ numactl
>>> --interleave=all ovs-vswitchd ...
>>>
>>> Note: keep the huge-dir and socket-mem option, "--huge-dir
>> /mnt/huge_ovs_1G --
>>> socket-mem 1024,1024".
>>>
> @Jianfeng: If I perform Step (1), then Step (2) 'numactl --interleave=all ovs-vswitchd ...' cannot help, because all the hugepages mounted to OVS directory will be from one of the NUMA nodes. The DPDK application requires 1-1G hugepage from both of the NUMA nodes, so DPDK returns with an error.
> I have also tried without Step (1), and we still has the slower startup.
> Currently I'm looking into Sergio's mail.
>
> Br,
> Imre