From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wi0-f174.google.com (mail-wi0-f174.google.com [209.85.212.174]) by dpdk.org (Postfix) with ESMTP id 1CCA1809A for ; Fri, 12 Dec 2014 16:50:59 +0100 (CET) Received: by mail-wi0-f174.google.com with SMTP id h11so2942665wiw.1 for ; Fri, 12 Dec 2014 07:50:59 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:from:to:cc:subject:date:message-id:organization :user-agent:in-reply-to:references:mime-version :content-transfer-encoding:content-type; bh=CxpoDfKypsEyrE+VE49AWh1tgi1/FreiZEMvpDvrfak=; b=jVD4xyD1DLZdFqIP6wveotsFRcNXVaT2FDZjWv7F6NrBQ0bfxhgGLq0uNbu6O73bxm XkIj/EHM4Igvf3D9an7mh6NjiPJanNcLJZm9sVhyS25cpg6HSc8Y7mod+EkFp/mW/zvw gwxCfVrY5mYhWQFLBQFnWClsRlGB/Ek4quJ2Wv0N36LMccRKVxQH0XaoWjwb238S/XpG 7Q3zCHH0S42U6PuLYkuKgv6hWiBc50rNmCmE67iH0a+wWOGCFPRzDotHHv2ZkRKqaGBG LXc/pr7ztWxly2ftDMRXHRz3/zMCFT1CDlcbt+TfxG4VGNTN+oOF5pFJ6l+dUoLcCGmi 7KBQ== X-Gm-Message-State: ALoCoQnJSg9DXGG7dPn+4tpu9zDbri82kE4vp2df0I6TBl8ve7thvPQE71PIsL0i1xcgptWt0O1E X-Received: by 10.194.8.34 with SMTP id o2mr28759571wja.129.1418399458890; Fri, 12 Dec 2014 07:50:58 -0800 (PST) Received: from xps13.localnet (136-92-190-109.dsl.ovh.fr. [109.190.92.136]) by mx.google.com with ESMTPSA id wx3sm2296074wjc.19.2014.12.12.07.50.57 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 12 Dec 2014 07:50:58 -0800 (PST) From: Thomas Monjalon To: =?ISO-8859-1?Q?L=E1szl=F3?= Vadkerti Date: Fri, 12 Dec 2014 16:50:33 +0100 Message-ID: <2123951.k0dJfZKBPF@xps13> Organization: 6WIND User-Agent: KMail/4.14.3 (Linux/3.17.4-1-ARCH; KDE/4.14.3; x86_64; ; ) In-Reply-To: <20141212095940.GA2100@bricha3-MOBL3> References: <20141212095940.GA2100@bricha3-MOBL3> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="iso-8859-1" Cc: dev@dpdk.org Subject: Re: [dpdk-dev] A question about hugepage initialization time X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 12 Dec 2014 15:50:59 -0000 2014-12-12 09:59, Bruce Richardson: > On Fri, Dec 12, 2014 at 04:07:40AM +0000, L=E1szl=F3 Vadkerti wrote: > > On Thu, 11 Dec, 2014, Bruce Richardson wrote: > > > On Wed, Dec 10, 2014 at 07:16:59PM +0000, L=E1szl=F3 Vadkerti wro= te: > > > > > > > > On Wed, 10 Dec 2014, Bruce Richardson wrote: > > > > > > > > > On Wed, Dec 10, 2014 at 09:29:26AM -0500, Neil Horman wrote: > > > > >> On Wed, Dec 10, 2014 at 10:32:25AM +0000, Bruce Richardson w= rote: > > > > >>> On Tue, Dec 09, 2014 at 02:10:32PM -0800, Stephen Hemminger= wrote: > > > > >>>> On Tue, 9 Dec 2014 11:45:07 -0800 &rew > > > > >>>> wrote: > > > > >>>> > > > > >>>>>> Hey Folks, > > > > >>>>>> > > > > >>>>>> Our DPDK application deals with very large in memory dat= a > > > > >>>>>> structures, and can potentially use tens or even hundred= s of > > > gigabytes of hugepage memory. > > > > >>>>>> During the course of development, we've noticed that as = the > > > > >>>>>> number of huge pages increases, the memory initializatio= n time > > > > >>>>>> during EAL init gets to be quite long, lasting several m= inutes > > > > >>>>>> at present. The growth in init time doesn't appear to b= e linear, > > > which is concerning. > > > > >>>>>> > > > > >>>>>> This is a minor inconvenience for us and our customers, = as > > > > >>>>>> memory initialization makes our boot times a lot longer = than it > > > > >>>>>> would otherwise be. Also, my experience has been that r= eally > > > > >>>>>> long operations often are hiding errors - what you think= is > > > > >>>>>> merely a slow operation is actually a timeout of some so= rt, > > > > >>>>>> often due to misconfiguration. This leads to two > > > > >>>>>> questions: > > > > >>>>>> > > > > >>>>>> 1. Does the long initialization time suggest that there'= s an > > > > >>>>>> error happening under the covers? > > > > >>>>>> 2. If not, is there any simple way that we can shorten m= emory > > > > >>>>>> initialization time? > > > > >>>>>> > > > > >>>>>> Thanks in advance for your insights. > > > > >>>>>> > > > > >>>>>> -- > > > > >>>>>> Matt Laswell > > > > >>>>>> laswell@infiniteio.com > > > > >>>>>> infinite io, inc. > > > > >>>>>> > > > > >>>>> > > > > >>>>> Hello, > > > > >>>>> > > > > >>>>> please find some quick comments on the questions: > > > > >>>>> 1.) By our experience long initialization time is normal = in case > > > > >>>>> of large amount of memory. However this time depends on s= ome > > > things: > > > > >>>>> - number of hugepages (pagefault handled by kernel is pre= tty > > > > >>>>> expensive) > > > > >>>>> - size of hugepages (memset at initialization) > > > > >>>>> > > > > >>>>> 2.) Using 1G pages instead of 2M will reduce the initiali= zation > > > > >>>>> time significantly. Using wmemset instead of memset adds = an > > > > >>>>> additional 20-30% boost by our measurements. Or, just by > > > > >>>>> touching the pages but not cleaning them you can have sti= ll some > > > > >>>>> more speedup. But in this case your layer or the applicat= ions > > > > >>>>> above need to do the cleanup at allocation time (e.g. by = using > > > rte_zmalloc). > > > > >>>>> > > > > >>>>> Cheers, > > > > >>>>> &rew > > > > >>>> > > > > >>>> I wonder if the whole rte_malloc code is even worth it wit= h a > > > > >>>> modern kernel with transparent huge pages? rte_malloc adds= very > > > > >>>> little value and is less safe and slower than glibc or oth= er > > > > >>>> allocators. Plus you lose the ablilty to get all the benef= it out of > > > valgrind or electric fence. > > > > >>> > > > > >>> While I'd dearly love to not have our own custom malloc lib= to > > > > >>> maintain, for DPDK multiprocess, rte_malloc will be hard to= > > > > >>> replace as we would need a replacement solution that simila= rly > > > > >>> guarantees that memory mapped in process A is also availabl= e at > > > > >>> the same address in process B. :-( > > > > >>> > > > > >> Just out of curiosity, why even bother with multiprocess sup= port? > > > > >> What you're talking about above is a multithread model, and = your > > > > >> shoehorning multiple processes into it. > > > > >> Neil > > > > >> > > > > > > > > > > Yep, that's pretty much what it is alright. However, this > > > > > multiprocess support is very widely used by our customers in > > > > > building their applications, and has been in place and suppor= ted > > > > > since some of the earliest DPDK releases. If it is to be remo= ved, it > > > > > needs to be replaced by something that provides equivalent > > > > > capabilities to application writers (perhaps something with m= ore > > > > > fine-grained sharing > > > > > etc.) > > > > > > > > > > /Bruce > > > > > > > > > > > > > It is probably time to start discussing how to pull in our mult= i > > > > process and memory management improvements we were talking abou= t in > > > > our DPDK Summit presentation: > > > > https://www.youtube.com/watch?v=3D907VShi799k#t=3D647 > > > > > > > > Multi-process model could have several benefits mostly in the h= igh > > > > availability area (telco requirement) due to better separation,= > > > > controlling permissions (per process RO or RW page mappings), s= ingle > > > > process restartability, improved startup and core dumping time = etc. > > > > > > > > As a summary of our memory management additions, it allows an > > > > application to describe their memory model in a configuration (= or via > > > > an API), e.g. a simplified config would say that every instance= will > > > > need 4GB private memory and 2GB shared memory. In a multi proce= ss > > > > model this will result mapping only 6GB memory in each process = instead > > > > of the current DPDK model where the 4GB per process private mem= ory is > > > > mapped into all other processes resulting in unnecessary mappin= gs, e.g. > > > 16x4GB + 2GB in every processes. > > > > > > > > What we've chosen is to use DPDK's NUMA aware allocator for thi= s > > > > purpose, e.g. the above example for 16 instances will result > > > > allocating > > > > 17 DPDK NUMA sockets (1 default shared + 16 private) and we can= > > > > selectively map a given "NUMA socket" (set of memsegs) into a p= rocess. > > > > This also opens many other possibilities to play with, e.g. > > > > - clearing of the full private memory if a process dies includ= ing > > > > memzones on it > > > > - pop-up memory support > > > > etc. etc. > > > > > > > > Other option could be to use page aligned memzones and control = the > > > > mapping/permissions on a memzone level. > > > > > > > > /Laszlo > > >=20 > > > Those enhancements sound really, really good. Do you have code fo= r these > > > that you can share that we can start looking at with a view to pu= lling it in? > > >=20 > > > /Bruce > >=20 > > Our approach when started implementing these enhancements was to ha= ve > > an additional layer on top of DPDK, so our changes cannot just be p= ulled in as is > > and unfortunately we do not yet have the permission to share our co= de. > > However we can share ideas and start discussing what would more int= erest the > > community and if there is something which we can easily pull in or = put on the > > DPDK roadmap. > >=20 > > As mentioned in the presentation we implemented a new EAL layer whi= ch we > > also rely on, although this may not be necessary for all our enhanc= ements. > > For example our named memory partition pools ("memdomains") which i= s the > > base of our selective memory mapping and permission control could e= ither be > > implemented above or below the memzones or DPDK could even be just = a user > > of it. Our implementation relies on our new EAL layer, but there ma= y be another > > option to pull this in as a new library which relies on the memzone= allocator. > >=20 > > We have a whole set of features with the main goal of environment i= ndependency > > and of course performance first mainly focusing on NFV deployments.= > > e.g. allowing applications to adopt different environments (without= any code change) > > while still getting the highest possible performance. > > The key for this is our new split EAL layer which I think should be= the first step to > > start with. This can co-exist with the current linuxapp and bsdapp = and would allow > > supporting both Linux and BSD with separate publisher components wh= ich could > > be relying on the existing linuxapp/bsdapp code :) > > This new EAL layer would open up many possibilities to play with, > > e.g. expose NUMA in a non-NUMA aware VM, pretend that every CPU is = in a new > > NUMA domain, emulate a multi CPU multi socket system on a single CP= U etc. etc. > >=20 > > What do you think would be the right way to start these discussions= ? > > We should probably need to open a new thread on this as it is now n= ot fully related > > to the subject or should we have an internal discussion and then pr= esent and discuss > > the ideas in a community call? > > We are working with DPDK since a long time, but new to the communit= y and need to > > understand the ways of working here... >=20 > A new thread describing the details of how you have implemented thing= s would be > great. +1 Please, explain also which problems you try to solve. Maybe that some of your constraints does not apply here, so the impleme= ntation could be different. If your work can be split in different features, it may be easier to di= scuss each feature in a different thread. Thank you --=20 Thomas