From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <laszlo.vadkerti@ericsson.com>
Received: from usevmg20.ericsson.net (usevmg20.ericsson.net [198.24.6.45])
 by dpdk.org (Postfix) with ESMTP id E59EC282
 for <dev@dpdk.org>; Wed, 10 Dec 2014 20:17:00 +0100 (CET)
X-AuditID: c618062d-f79376d000000ceb-94-54884bc6d329
Received: from EUSAAHC005.ericsson.se (Unknown_Domain [147.117.188.87])
 by usevmg20.ericsson.net (Symantec Mail Security) with SMTP id
 D5.C6.03307.6CB48845; Wed, 10 Dec 2014 14:33:58 +0100 (CET)
Received: from EUSAAMB109.ericsson.se ([147.117.188.126]) by
 EUSAAHC005.ericsson.se ([147.117.188.87]) with mapi id 14.03.0195.001; Wed,
 10 Dec 2014 14:16:59 -0500
From: =?iso-8859-1?Q?L=E1szl=F3_Vadkerti?= <laszlo.vadkerti@ericsson.com>
To: Bruce Richardson <bruce.richardson@intel.com>, Neil Horman
 <nhorman@tuxdriver.com>, Matt Laswell <laswell@infiniteio.com>
Thread-Topic: [dpdk-dev] A question about hugepage initialization time
Thread-Index: AQHQE833RqT88CNv5U+sKZoZDz/AaJyH/XqAgAAooQCAAM9HgIAAQjkAgAAB1AD//+oCQA==
Date: Wed, 10 Dec 2014 19:16:59 +0000
Message-ID: <C2225743E7290344B4DAA0FA42E605D2AF837C@eusaamb109.ericsson.se>
References: <CA+GnqArTJoVd9Hh2xZ-fFhHRnUdbgvxB5Tp+rvi2crUi0-9g9A@mail.gmail.com>
 <alpine.DEB.2.10.1412091130410.13009@mwlx389>
 <20141209141032.5fa2db0d@urahara> <20141210103225.GA10056@bricha3-MOBL3>
 <20141210142926.GA17040@localhost.localdomain>
 <20141210143558.GB1632@bricha3-MOBL3>
In-Reply-To: <20141210143558.GB1632@bricha3-MOBL3>
Accept-Language: en-US, hu-HU
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
x-originating-ip: [147.117.188.10]
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFvrCLMWRmVeSWpSXmKPExsUyuXRPuO4x744Qgyk3TSxurLK3ePdpO5PF
 wtV3mC1uNZ9kc2Dx+LVgKavHpn+zWT0W73nJ5HHl+2rGAJYoLpuU1JzMstQifbsEroxvU/UL
 /mlV9F2cw9jA2KjcxcjJISFgItH1dC0jhC0mceHeerYuRi4OIYEjjBJtry+xQDjLgZw1v1lB
 qtgEPCX2fe1lArFFBOolvu/bCGYzCyhKfFowiQXEFhZwldg28SA7RI2bxOGmGSwQdpjEzN71
 YNtYBFQlnhy7yAZi8wp4S2zeeIUJYtkSJolPk5+CLeMUMJKYf6oNbAGjgKzEi/YJUMvEJW49
 mc8EcbaAxJI955khbFGJl4//sULYShKTlp5jhajXk7gxdQobhK0tsWzha2aIxYISJ2c+YZnA
 KDYLydhZSFpmIWmZhaRlASPLKkaO0uLUstx0I4NNjMBoOibBpruDcc9Ly0OMAhyMSjy8Btvb
 Q4RYE8uKK3MPMUpzsCiJ886qnRcsJJCeWJKanZpakFoUX1Sak1p8iJGJg1OqgTEtOnIJl8pb
 bgMJKWe3+eHF0jNsws+lrHhe+zo+qk9s1zou9z0+X7MflAqvNvucdapqSnGBQfFv8Rgd12sb
 mM+IXO2c/l89ZHkGi1jEEcP7aQ/mmizMuf6xJsDPaEmRS8HXpHL5tmOx29bMPfXIYUl6Y9nq
 Pem1PDcN11YqdBpNt7p9/4F9nBJLcUaioRZzUXEiAAKI2duHAgAA
Cc: "dev@dpdk.org" <dev@dpdk.org>
Subject: Re: [dpdk-dev] A question about hugepage initialization time
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: patches and discussions about DPDK <dev.dpdk.org>
List-Unsubscribe: <http://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <http://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Wed, 10 Dec 2014 19:17:01 -0000

na ez :)

On Wed, 10 Dec 2014, Bruce Richardson wrote:

> On Wed, Dec 10, 2014 at 09:29:26AM -0500, Neil Horman wrote:
>> On Wed, Dec 10, 2014 at 10:32:25AM +0000, Bruce Richardson wrote:
>>> On Tue, Dec 09, 2014 at 02:10:32PM -0800, Stephen Hemminger wrote:
>>>> On Tue, 9 Dec 2014 11:45:07 -0800
>>>> &rew <andras.kovacs@ericsson.com> wrote:
>>>>
>>>>>> Hey Folks,
>>>>>>
>>>>>> Our DPDK application deals with very large in memory data=20
>>>>>> structures, and can potentially use tens or even hundreds of gigabyt=
es of hugepage memory.
>>>>>> During the course of development, we've noticed that as the=20
>>>>>> number of huge pages increases, the memory initialization time=20
>>>>>> during EAL init gets to be quite long, lasting several minutes at=20
>>>>>> present.  The growth in init time doesn't appear to be linear, which=
 is concerning.
>>>>>>
>>>>>> This is a minor inconvenience for us and our customers, as memory=20
>>>>>> initialization makes our boot times a lot longer than it would=20
>>>>>> otherwise be.  Also, my experience has been that really long=20
>>>>>> operations often are hiding errors - what you think is merely a=20
>>>>>> slow operation is actually a timeout of some sort, often due to=20
>>>>>> misconfiguration. This leads to two
>>>>>> questions:
>>>>>>
>>>>>> 1. Does the long initialization time suggest that there's an=20
>>>>>> error happening under the covers?
>>>>>> 2. If not, is there any simple way that we can shorten memory=20
>>>>>> initialization time?
>>>>>>
>>>>>> Thanks in advance for your insights.
>>>>>>
>>>>>> --
>>>>>> Matt Laswell
>>>>>> laswell@infiniteio.com
>>>>>> infinite io, inc.
>>>>>>
>>>>>
>>>>> Hello,
>>>>>
>>>>> please find some quick comments on the questions:
>>>>> 1.) By our experience long initialization time is normal in case=20
>>>>> of large amount of memory. However this time depends on some things:
>>>>> - number of hugepages (pagefault handled by kernel is pretty=20
>>>>> expensive)
>>>>> - size of hugepages (memset at initialization)
>>>>>
>>>>> 2.) Using 1G pages instead of 2M will reduce the initialization=20
>>>>> time significantly. Using wmemset instead of memset adds an=20
>>>>> additional 20-30% boost by our measurements. Or, just by touching=20
>>>>> the pages but not cleaning them you can have still some more=20
>>>>> speedup. But in this case your layer or the applications above=20
>>>>> need to do the cleanup at allocation time (e.g. by using rte_zmalloc)=
.
>>>>>
>>>>> Cheers,
>>>>> &rew
>>>>
>>>> I wonder if the whole rte_malloc code is even worth it with a=20
>>>> modern kernel with transparent huge pages? rte_malloc adds very=20
>>>> little value and is less safe and slower than glibc or other=20
>>>> allocators. Plus you lose the ablilty to get all the benefit out of va=
lgrind or electric fence.
>>>
>>> While I'd dearly love to not have our own custom malloc lib to=20
>>> maintain, for DPDK multiprocess, rte_malloc will be hard to replace=20
>>> as we would need a replacement solution that similarly guarantees=20
>>> that memory mapped in process A is also available at the same=20
>>> address in process B. :-(
>>>
>> Just out of curiosity, why even bother with multiprocess support? =20
>> What you're talking about above is a multithread model, and your=20
>> shoehorning multiple processes into it.
>> Neil
>>
>
> Yep, that's pretty much what it is alright. However, this multiprocess=20
> support is very widely used by our customers in building their=20
> applications, and has been in place and supported since some of the=20
> earliest DPDK releases. If it is to be removed, it needs to be=20
> replaced by something that provides equivalent capabilities to=20
> application writers (perhaps something with more fine-grained sharing=20
> etc.)
>
> /Bruce
>

It is probably time to start discussing how to pull in our multi process an=
d
memory management improvements we were talking about in our
DPDK Summit presentation:
https://www.youtube.com/watch?v=3D907VShi799k#t=3D647

Multi-process model could have several benefits mostly in the high availabi=
lity
area (telco requirement) due to better separation, controlling permissions
(per process RO or RW page mappings), single process restartability, improv=
ed
startup and core dumping time etc.

As a summary of our memory management additions, it allows an application
to describe their memory model in a configuration (or via an API),
e.g. a simplified config would say that every instance will need 4GB privat=
e
memory and 2GB shared memory. In a multi process model this will result
mapping only 6GB memory in each process instead of the current DPDK model
where the 4GB per process private memory is mapped into all other processes
resulting in unnecessary mappings, e.g. 16x4GB + 2GB in every processes.

What we've chosen is to use DPDK's NUMA aware allocator for this purpose,
e.g. the above example for 16 instances will result allocating
17 DPDK NUMA sockets (1 default shared + 16 private) and we can selectively
map a given "NUMA socket" (set of memsegs) into a process.
This also opens many other possibilities to play with, e.g.
 - clearing of the full private memory if a process dies including memzones=
 on it
 - pop-up memory support
etc. etc.

Other option could be to use page aligned memzones and control the
mapping/permissions on a memzone level.

/Laszlo