From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <bruce.richardson@intel.com>
Received: from mga03.intel.com (mga03.intel.com [134.134.136.65])
 by dpdk.org (Postfix) with ESMTP id 21B5E8059
 for <dev@dpdk.org>; Thu, 11 Dec 2014 11:14:53 +0100 (CET)
Received: from orsmga003.jf.intel.com ([10.7.209.27])
 by orsmga103.jf.intel.com with ESMTP; 11 Dec 2014 02:12:53 -0800
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.04,691,1406617200"; d="scan'208";a="497155717"
Received: from bricha3-mobl3.ger.corp.intel.com ([10.243.20.26])
 by orsmga003.jf.intel.com with SMTP; 11 Dec 2014 02:11:01 -0800
Received: by  (sSMTP sendmail emulation); Thu, 11 Dec 2014 10:14:49 +0025
Date: Thu, 11 Dec 2014 10:14:49 +0000
From: Bruce Richardson <bruce.richardson@intel.com>
To: =?iso-8859-1?B?TOFzemzz?= Vadkerti <laszlo.vadkerti@ericsson.com>
Message-ID: <20141211101449.GB5668@bricha3-MOBL3>
References: <CA+GnqArTJoVd9Hh2xZ-fFhHRnUdbgvxB5Tp+rvi2crUi0-9g9A@mail.gmail.com>
 <alpine.DEB.2.10.1412091130410.13009@mwlx389>
 <20141209141032.5fa2db0d@urahara>
 <20141210103225.GA10056@bricha3-MOBL3>
 <20141210142926.GA17040@localhost.localdomain>
 <20141210143558.GB1632@bricha3-MOBL3>
 <C2225743E7290344B4DAA0FA42E605D2AF837C@eusaamb109.ericsson.se>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <C2225743E7290344B4DAA0FA42E605D2AF837C@eusaamb109.ericsson.se>
Organization: Intel Shannon Ltd.
User-Agent: Mutt/1.5.23 (2014-03-12)
Cc: "dev@dpdk.org" <dev@dpdk.org>
Subject: Re: [dpdk-dev] A question about hugepage initialization time
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: patches and discussions about DPDK <dev.dpdk.org>
List-Unsubscribe: <http://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <http://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Thu, 11 Dec 2014 10:14:55 -0000

On Wed, Dec 10, 2014 at 07:16:59PM +0000, László Vadkerti wrote:
> na ez :)
> 
> On Wed, 10 Dec 2014, Bruce Richardson wrote:
> 
> > On Wed, Dec 10, 2014 at 09:29:26AM -0500, Neil Horman wrote:
> >> On Wed, Dec 10, 2014 at 10:32:25AM +0000, Bruce Richardson wrote:
> >>> On Tue, Dec 09, 2014 at 02:10:32PM -0800, Stephen Hemminger wrote:
> >>>> On Tue, 9 Dec 2014 11:45:07 -0800
> >>>> &rew <andras.kovacs@ericsson.com> wrote:
> >>>>
> >>>>>> Hey Folks,
> >>>>>>
> >>>>>> Our DPDK application deals with very large in memory data 
> >>>>>> structures, and can potentially use tens or even hundreds of gigabytes of hugepage memory.
> >>>>>> During the course of development, we've noticed that as the 
> >>>>>> number of huge pages increases, the memory initialization time 
> >>>>>> during EAL init gets to be quite long, lasting several minutes at 
> >>>>>> present.  The growth in init time doesn't appear to be linear, which is concerning.
> >>>>>>
> >>>>>> This is a minor inconvenience for us and our customers, as memory 
> >>>>>> initialization makes our boot times a lot longer than it would 
> >>>>>> otherwise be.  Also, my experience has been that really long 
> >>>>>> operations often are hiding errors - what you think is merely a 
> >>>>>> slow operation is actually a timeout of some sort, often due to 
> >>>>>> misconfiguration. This leads to two
> >>>>>> questions:
> >>>>>>
> >>>>>> 1. Does the long initialization time suggest that there's an 
> >>>>>> error happening under the covers?
> >>>>>> 2. If not, is there any simple way that we can shorten memory 
> >>>>>> initialization time?
> >>>>>>
> >>>>>> Thanks in advance for your insights.
> >>>>>>
> >>>>>> --
> >>>>>> Matt Laswell
> >>>>>> laswell@infiniteio.com
> >>>>>> infinite io, inc.
> >>>>>>
> >>>>>
> >>>>> Hello,
> >>>>>
> >>>>> please find some quick comments on the questions:
> >>>>> 1.) By our experience long initialization time is normal in case 
> >>>>> of large amount of memory. However this time depends on some things:
> >>>>> - number of hugepages (pagefault handled by kernel is pretty 
> >>>>> expensive)
> >>>>> - size of hugepages (memset at initialization)
> >>>>>
> >>>>> 2.) Using 1G pages instead of 2M will reduce the initialization 
> >>>>> time significantly. Using wmemset instead of memset adds an 
> >>>>> additional 20-30% boost by our measurements. Or, just by touching 
> >>>>> the pages but not cleaning them you can have still some more 
> >>>>> speedup. But in this case your layer or the applications above 
> >>>>> need to do the cleanup at allocation time (e.g. by using rte_zmalloc).
> >>>>>
> >>>>> Cheers,
> >>>>> &rew
> >>>>
> >>>> I wonder if the whole rte_malloc code is even worth it with a 
> >>>> modern kernel with transparent huge pages? rte_malloc adds very 
> >>>> little value and is less safe and slower than glibc or other 
> >>>> allocators. Plus you lose the ablilty to get all the benefit out of valgrind or electric fence.
> >>>
> >>> While I'd dearly love to not have our own custom malloc lib to 
> >>> maintain, for DPDK multiprocess, rte_malloc will be hard to replace 
> >>> as we would need a replacement solution that similarly guarantees 
> >>> that memory mapped in process A is also available at the same 
> >>> address in process B. :-(
> >>>
> >> Just out of curiosity, why even bother with multiprocess support?  
> >> What you're talking about above is a multithread model, and your 
> >> shoehorning multiple processes into it.
> >> Neil
> >>
> >
> > Yep, that's pretty much what it is alright. However, this multiprocess 
> > support is very widely used by our customers in building their 
> > applications, and has been in place and supported since some of the 
> > earliest DPDK releases. If it is to be removed, it needs to be 
> > replaced by something that provides equivalent capabilities to 
> > application writers (perhaps something with more fine-grained sharing 
> > etc.)
> >
> > /Bruce
> >
> 
> It is probably time to start discussing how to pull in our multi process and
> memory management improvements we were talking about in our
> DPDK Summit presentation:
> https://www.youtube.com/watch?v=907VShi799k#t=647
> 
> Multi-process model could have several benefits mostly in the high availability
> area (telco requirement) due to better separation, controlling permissions
> (per process RO or RW page mappings), single process restartability, improved
> startup and core dumping time etc.
> 
> As a summary of our memory management additions, it allows an application
> to describe their memory model in a configuration (or via an API),
> e.g. a simplified config would say that every instance will need 4GB private
> memory and 2GB shared memory. In a multi process model this will result
> mapping only 6GB memory in each process instead of the current DPDK model
> where the 4GB per process private memory is mapped into all other processes
> resulting in unnecessary mappings, e.g. 16x4GB + 2GB in every processes.
> 
> What we've chosen is to use DPDK's NUMA aware allocator for this purpose,
> e.g. the above example for 16 instances will result allocating
> 17 DPDK NUMA sockets (1 default shared + 16 private) and we can selectively
> map a given "NUMA socket" (set of memsegs) into a process.
> This also opens many other possibilities to play with, e.g.
>  - clearing of the full private memory if a process dies including memzones on it
>  - pop-up memory support
> etc. etc.
> 
> Other option could be to use page aligned memzones and control the
> mapping/permissions on a memzone level.
> 
> /Laszlo

Those enhancements sound really, really good. Do you have code for these that
you can share that we can start looking at with a view to pulling it in?

/Bruce