From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by dpdk.org (Postfix) with ESMTP id 21B5E8059 for ; Thu, 11 Dec 2014 11:14:53 +0100 (CET) Received: from orsmga003.jf.intel.com ([10.7.209.27]) by orsmga103.jf.intel.com with ESMTP; 11 Dec 2014 02:12:53 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.04,691,1406617200"; d="scan'208";a="497155717" Received: from bricha3-mobl3.ger.corp.intel.com ([10.243.20.26]) by orsmga003.jf.intel.com with SMTP; 11 Dec 2014 02:11:01 -0800 Received: by (sSMTP sendmail emulation); Thu, 11 Dec 2014 10:14:49 +0025 Date: Thu, 11 Dec 2014 10:14:49 +0000 From: Bruce Richardson To: =?iso-8859-1?B?TOFzemzz?= Vadkerti Message-ID: <20141211101449.GB5668@bricha3-MOBL3> References: <20141209141032.5fa2db0d@urahara> <20141210103225.GA10056@bricha3-MOBL3> <20141210142926.GA17040@localhost.localdomain> <20141210143558.GB1632@bricha3-MOBL3> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: Organization: Intel Shannon Ltd. User-Agent: Mutt/1.5.23 (2014-03-12) Cc: "dev@dpdk.org" Subject: Re: [dpdk-dev] A question about hugepage initialization time X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 11 Dec 2014 10:14:55 -0000 On Wed, Dec 10, 2014 at 07:16:59PM +0000, László Vadkerti wrote: > na ez :) > > On Wed, 10 Dec 2014, Bruce Richardson wrote: > > > On Wed, Dec 10, 2014 at 09:29:26AM -0500, Neil Horman wrote: > >> On Wed, Dec 10, 2014 at 10:32:25AM +0000, Bruce Richardson wrote: > >>> On Tue, Dec 09, 2014 at 02:10:32PM -0800, Stephen Hemminger wrote: > >>>> On Tue, 9 Dec 2014 11:45:07 -0800 > >>>> &rew wrote: > >>>> > >>>>>> Hey Folks, > >>>>>> > >>>>>> Our DPDK application deals with very large in memory data > >>>>>> structures, and can potentially use tens or even hundreds of gigabytes of hugepage memory. > >>>>>> During the course of development, we've noticed that as the > >>>>>> number of huge pages increases, the memory initialization time > >>>>>> during EAL init gets to be quite long, lasting several minutes at > >>>>>> present. The growth in init time doesn't appear to be linear, which is concerning. > >>>>>> > >>>>>> This is a minor inconvenience for us and our customers, as memory > >>>>>> initialization makes our boot times a lot longer than it would > >>>>>> otherwise be. Also, my experience has been that really long > >>>>>> operations often are hiding errors - what you think is merely a > >>>>>> slow operation is actually a timeout of some sort, often due to > >>>>>> misconfiguration. This leads to two > >>>>>> questions: > >>>>>> > >>>>>> 1. Does the long initialization time suggest that there's an > >>>>>> error happening under the covers? > >>>>>> 2. If not, is there any simple way that we can shorten memory > >>>>>> initialization time? > >>>>>> > >>>>>> Thanks in advance for your insights. > >>>>>> > >>>>>> -- > >>>>>> Matt Laswell > >>>>>> laswell@infiniteio.com > >>>>>> infinite io, inc. > >>>>>> > >>>>> > >>>>> Hello, > >>>>> > >>>>> please find some quick comments on the questions: > >>>>> 1.) By our experience long initialization time is normal in case > >>>>> of large amount of memory. However this time depends on some things: > >>>>> - number of hugepages (pagefault handled by kernel is pretty > >>>>> expensive) > >>>>> - size of hugepages (memset at initialization) > >>>>> > >>>>> 2.) Using 1G pages instead of 2M will reduce the initialization > >>>>> time significantly. Using wmemset instead of memset adds an > >>>>> additional 20-30% boost by our measurements. Or, just by touching > >>>>> the pages but not cleaning them you can have still some more > >>>>> speedup. But in this case your layer or the applications above > >>>>> need to do the cleanup at allocation time (e.g. by using rte_zmalloc). > >>>>> > >>>>> Cheers, > >>>>> &rew > >>>> > >>>> I wonder if the whole rte_malloc code is even worth it with a > >>>> modern kernel with transparent huge pages? rte_malloc adds very > >>>> little value and is less safe and slower than glibc or other > >>>> allocators. Plus you lose the ablilty to get all the benefit out of valgrind or electric fence. > >>> > >>> While I'd dearly love to not have our own custom malloc lib to > >>> maintain, for DPDK multiprocess, rte_malloc will be hard to replace > >>> as we would need a replacement solution that similarly guarantees > >>> that memory mapped in process A is also available at the same > >>> address in process B. :-( > >>> > >> Just out of curiosity, why even bother with multiprocess support? > >> What you're talking about above is a multithread model, and your > >> shoehorning multiple processes into it. > >> Neil > >> > > > > Yep, that's pretty much what it is alright. However, this multiprocess > > support is very widely used by our customers in building their > > applications, and has been in place and supported since some of the > > earliest DPDK releases. If it is to be removed, it needs to be > > replaced by something that provides equivalent capabilities to > > application writers (perhaps something with more fine-grained sharing > > etc.) > > > > /Bruce > > > > It is probably time to start discussing how to pull in our multi process and > memory management improvements we were talking about in our > DPDK Summit presentation: > https://www.youtube.com/watch?v=907VShi799k#t=647 > > Multi-process model could have several benefits mostly in the high availability > area (telco requirement) due to better separation, controlling permissions > (per process RO or RW page mappings), single process restartability, improved > startup and core dumping time etc. > > As a summary of our memory management additions, it allows an application > to describe their memory model in a configuration (or via an API), > e.g. a simplified config would say that every instance will need 4GB private > memory and 2GB shared memory. In a multi process model this will result > mapping only 6GB memory in each process instead of the current DPDK model > where the 4GB per process private memory is mapped into all other processes > resulting in unnecessary mappings, e.g. 16x4GB + 2GB in every processes. > > What we've chosen is to use DPDK's NUMA aware allocator for this purpose, > e.g. the above example for 16 instances will result allocating > 17 DPDK NUMA sockets (1 default shared + 16 private) and we can selectively > map a given "NUMA socket" (set of memsegs) into a process. > This also opens many other possibilities to play with, e.g. > - clearing of the full private memory if a process dies including memzones on it > - pop-up memory support > etc. etc. > > Other option could be to use page aligned memzones and control the > mapping/permissions on a memzone level. > > /Laszlo Those enhancements sound really, really good. Do you have code for these that you can share that we can start looking at with a view to pulling it in? /Bruce