From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from usevmg20.ericsson.net (usevmg20.ericsson.net [198.24.6.45]) by dpdk.org (Postfix) with ESMTP id 24E73803E for ; Fri, 12 Dec 2014 05:07:42 +0100 (CET) X-AuditID: c618062d-f79376d000000ceb-82-548a19950725 Received: from EUSAAHC005.ericsson.se (Unknown_Domain [147.117.188.87]) by usevmg20.ericsson.net (Symantec Mail Security) with SMTP id 37.45.03307.5991A845; Thu, 11 Dec 2014 23:24:22 +0100 (CET) Received: from EUSAAMB109.ericsson.se ([147.117.188.126]) by EUSAAHC005.ericsson.se ([147.117.188.87]) with mapi id 14.03.0195.001; Thu, 11 Dec 2014 23:07:40 -0500 From: =?iso-8859-1?Q?L=E1szl=F3_Vadkerti?= To: Bruce Richardson , Thomas Monjalon Thread-Topic: [dpdk-dev] A question about hugepage initialization time Thread-Index: AQHQE833RqT88CNv5U+sKZoZDz/AaJyH/XqAgAAooQCAAM9HgIAAQjkAgAAB1AD//+oCQIABX1uAgACwAnA= Date: Fri, 12 Dec 2014 04:07:40 +0000 Message-ID: References: <20141209141032.5fa2db0d@urahara> <20141210103225.GA10056@bricha3-MOBL3> <20141210142926.GA17040@localhost.localdomain> <20141210143558.GB1632@bricha3-MOBL3> <20141211101449.GB5668@bricha3-MOBL3> In-Reply-To: <20141211101449.GB5668@bricha3-MOBL3> Accept-Language: en-US, hu-HU Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [147.117.188.9] Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFvrDLMWRmVeSWpSXmKPExsUyuXRPuO40ya4QgyPHWSxurLK3ePdpO5PF wtV3mC1uNZ9ks/iyaTqbA6vHxf47jB6/Fixl9dj0bzarx+I9L5k8rnxfzRjAGsVlk5Kak1mW WqRvl8CV0X7rFlPBGseKlp8t7A2M54y7GDk4JARMJGZO0+xi5AQyxSQu3FvP1sXIxSEkcIRR Yln/dhYIZzmjxLvlN5hBqtgEPCX2fe1lAmkWEUiQOP3aHsRkFsiX6G32AKkQFnCV2DbxIDuI LSLgJnG4aQYLhJ0k8e3hObApLAKqElMnTAOr4RXwljjaeZMJYlUfs8T6jSfYQBKcAkYS7+8e ALMZBWQlXrRPYAKxmQXEJW49mc8EcbSAxJI955khbFGJl4//sULYihL7+qezQ9TrSdyYOoUN wtaWWLbwNTPEYkGJkzOfsExgFJuFZOwsJC2zkLTMQtKygJFlFSNHaXFqWW66kcEmRmB0HZNg 093BuOel5SFGAQ5GJR7eDw6dIUKsiWXFlbmHGKU5WJTEeWfVzgsWEkhPLEnNTk0tSC2KLyrN SS0+xMjEwSnVwKjQdHm7R/XMuwLzS95WtxZVp59cZDvzUcyWTbpb1CITbr6Ys3jTu2e9jw8v iY7VejZlanTL6/TJx+s+3Hqd+/fYu/cRIufSTzWLSeS2sV3bonLe28gypnnjjU0/tJ7vNbVP 36r1nDX71trEhxq37grY1LD8dJ/bc7D1zLEu/4Wq51q03iU+f/pFiaU4I9FQi7moOBEATLww dY8CAAA= Cc: "dev@dpdk.org" Subject: Re: [dpdk-dev] A question about hugepage initialization time X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 12 Dec 2014 04:07:42 -0000 On Thu, 11 Dec, 2014, Bruce Richardson wrote: > On Wed, Dec 10, 2014 at 07:16:59PM +0000, L=E1szl=F3 Vadkerti wrote: > > > > On Wed, 10 Dec 2014, Bruce Richardson wrote: > > > > > On Wed, Dec 10, 2014 at 09:29:26AM -0500, Neil Horman wrote: > > >> On Wed, Dec 10, 2014 at 10:32:25AM +0000, Bruce Richardson wrote: > > >>> On Tue, Dec 09, 2014 at 02:10:32PM -0800, Stephen Hemminger wrote: > > >>>> On Tue, 9 Dec 2014 11:45:07 -0800 &rew > > >>>> wrote: > > >>>> > > >>>>>> Hey Folks, > > >>>>>> > > >>>>>> Our DPDK application deals with very large in memory data > > >>>>>> structures, and can potentially use tens or even hundreds of > gigabytes of hugepage memory. > > >>>>>> During the course of development, we've noticed that as the > > >>>>>> number of huge pages increases, the memory initialization time > > >>>>>> during EAL init gets to be quite long, lasting several minutes > > >>>>>> at present. The growth in init time doesn't appear to be linear= , > which is concerning. > > >>>>>> > > >>>>>> This is a minor inconvenience for us and our customers, as > > >>>>>> memory initialization makes our boot times a lot longer than it > > >>>>>> would otherwise be. Also, my experience has been that really > > >>>>>> long operations often are hiding errors - what you think is > > >>>>>> merely a slow operation is actually a timeout of some sort, > > >>>>>> often due to misconfiguration. This leads to two > > >>>>>> questions: > > >>>>>> > > >>>>>> 1. Does the long initialization time suggest that there's an > > >>>>>> error happening under the covers? > > >>>>>> 2. If not, is there any simple way that we can shorten memory > > >>>>>> initialization time? > > >>>>>> > > >>>>>> Thanks in advance for your insights. > > >>>>>> > > >>>>>> -- > > >>>>>> Matt Laswell > > >>>>>> laswell@infiniteio.com > > >>>>>> infinite io, inc. > > >>>>>> > > >>>>> > > >>>>> Hello, > > >>>>> > > >>>>> please find some quick comments on the questions: > > >>>>> 1.) By our experience long initialization time is normal in case > > >>>>> of large amount of memory. However this time depends on some > things: > > >>>>> - number of hugepages (pagefault handled by kernel is pretty > > >>>>> expensive) > > >>>>> - size of hugepages (memset at initialization) > > >>>>> > > >>>>> 2.) Using 1G pages instead of 2M will reduce the initialization > > >>>>> time significantly. Using wmemset instead of memset adds an > > >>>>> additional 20-30% boost by our measurements. Or, just by > > >>>>> touching the pages but not cleaning them you can have still some > > >>>>> more speedup. But in this case your layer or the applications > > >>>>> above need to do the cleanup at allocation time (e.g. by using > rte_zmalloc). > > >>>>> > > >>>>> Cheers, > > >>>>> &rew > > >>>> > > >>>> I wonder if the whole rte_malloc code is even worth it with a > > >>>> modern kernel with transparent huge pages? rte_malloc adds very > > >>>> little value and is less safe and slower than glibc or other > > >>>> allocators. Plus you lose the ablilty to get all the benefit out o= f > valgrind or electric fence. > > >>> > > >>> While I'd dearly love to not have our own custom malloc lib to > > >>> maintain, for DPDK multiprocess, rte_malloc will be hard to > > >>> replace as we would need a replacement solution that similarly > > >>> guarantees that memory mapped in process A is also available at > > >>> the same address in process B. :-( > > >>> > > >> Just out of curiosity, why even bother with multiprocess support? > > >> What you're talking about above is a multithread model, and your > > >> shoehorning multiple processes into it. > > >> Neil > > >> > > > > > > Yep, that's pretty much what it is alright. However, this > > > multiprocess support is very widely used by our customers in > > > building their applications, and has been in place and supported > > > since some of the earliest DPDK releases. If it is to be removed, it > > > needs to be replaced by something that provides equivalent > > > capabilities to application writers (perhaps something with more > > > fine-grained sharing > > > etc.) > > > > > > /Bruce > > > > > > > It is probably time to start discussing how to pull in our multi > > process and memory management improvements we were talking about in > > our DPDK Summit presentation: > > https://www.youtube.com/watch?v=3D907VShi799k#t=3D647 > > > > Multi-process model could have several benefits mostly in the high > > availability area (telco requirement) due to better separation, > > controlling permissions (per process RO or RW page mappings), single > > process restartability, improved startup and core dumping time etc. > > > > As a summary of our memory management additions, it allows an > > application to describe their memory model in a configuration (or via > > an API), e.g. a simplified config would say that every instance will > > need 4GB private memory and 2GB shared memory. In a multi process > > model this will result mapping only 6GB memory in each process instead > > of the current DPDK model where the 4GB per process private memory is > > mapped into all other processes resulting in unnecessary mappings, e.g. > 16x4GB + 2GB in every processes. > > > > What we've chosen is to use DPDK's NUMA aware allocator for this > > purpose, e.g. the above example for 16 instances will result > > allocating > > 17 DPDK NUMA sockets (1 default shared + 16 private) and we can > > selectively map a given "NUMA socket" (set of memsegs) into a process. > > This also opens many other possibilities to play with, e.g. > > - clearing of the full private memory if a process dies including > > memzones on it > > - pop-up memory support > > etc. etc. > > > > Other option could be to use page aligned memzones and control the > > mapping/permissions on a memzone level. > > > > /Laszlo >=20 > Those enhancements sound really, really good. Do you have code for these > that you can share that we can start looking at with a view to pulling it= in? >=20 > /Bruce Our approach when started implementing these enhancements was to have an additional layer on top of DPDK, so our changes cannot just be pulled in= as is and unfortunately we do not yet have the permission to share our code. However we can share ideas and start discussing what would more interest th= e community and if there is something which we can easily pull in or put on t= he DPDK roadmap. As mentioned in the presentation we implemented a new EAL layer which we also rely on, although this may not be necessary for all our enhancements. For example our named memory partition pools ("memdomains") which is the base of our selective memory mapping and permission control could either be implemented above or below the memzones or DPDK could even be just a user of it. Our implementation relies on our new EAL layer, but there may be ano= ther option to pull this in as a new library which relies on the memzone allocat= or. We have a whole set of features with the main goal of environment independe= ncy and of course performance first mainly focusing on NFV deployments. e.g. allowing applications to adopt different environments (without any cod= e change) while still getting the highest possible performance. The key for this is our new split EAL layer which I think should be the fir= st step to start with. This can co-exist with the current linuxapp and bsdapp and wou= ld allow supporting both Linux and BSD with separate publisher components which coul= d be relying on the existing linuxapp/bsdapp code :) This new EAL layer would open up many possibilities to play with, e.g. expose NUMA in a non-NUMA aware VM, pretend that every CPU is in a new NUMA domain, emulate a multi CPU multi socket system on a single CPU etc. e= tc. What do you think would be the right way to start these discussions? We should probably need to open a new thread on this as it is now not fully= related to the subject or should we have an internal discussion and then present an= d discuss the ideas in a community call? We are working with DPDK since a long time, but new to the community and ne= ed to understand the ways of working here...