* [dpdk-dev] A question about hugepage initialization time
@ 2014-12-09 16:33 Matt Laswell
2014-12-09 16:50 ` Burakov, Anatoly
` (2 more replies)
0 siblings, 3 replies; 14+ messages in thread
From: Matt Laswell @ 2014-12-09 16:33 UTC (permalink / raw)
To: dev
Hey Folks,
Our DPDK application deals with very large in memory data structures, and
can potentially use tens or even hundreds of gigabytes of hugepage memory.
During the course of development, we've noticed that as the number of huge
pages increases, the memory initialization time during EAL init gets to be
quite long, lasting several minutes at present. The growth in init time
doesn't appear to be linear, which is concerning.
This is a minor inconvenience for us and our customers, as memory
initialization makes our boot times a lot longer than it would otherwise
be. Also, my experience has been that really long operations often are
hiding errors - what you think is merely a slow operation is actually a
timeout of some sort, often due to misconfiguration. This leads to two
questions:
1. Does the long initialization time suggest that there's an error
happening under the covers?
2. If not, is there any simple way that we can shorten memory
initialization time?
Thanks in advance for your insights.
--
Matt Laswell
laswell@infiniteio.com
infinite io, inc.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [dpdk-dev] A question about hugepage initialization time
2014-12-09 16:33 [dpdk-dev] A question about hugepage initialization time Matt Laswell
@ 2014-12-09 16:50 ` Burakov, Anatoly
2014-12-09 19:06 ` Matthew Hall
2014-12-09 19:45 ` &rew
2 siblings, 0 replies; 14+ messages in thread
From: Burakov, Anatoly @ 2014-12-09 16:50 UTC (permalink / raw)
To: Matt Laswell, dev
Hi
> Hey Folks,
>
> Our DPDK application deals with very large in memory data structures, and
> can potentially use tens or even hundreds of gigabytes of hugepage
> memory.
> During the course of development, we've noticed that as the number of
> huge pages increases, the memory initialization time during EAL init gets to
> be quite long, lasting several minutes at present. The growth in init time
> doesn't appear to be linear, which is concerning.
>
> This is a minor inconvenience for us and our customers, as memory
> initialization makes our boot times a lot longer than it would otherwise be.
> Also, my experience has been that really long operations often are hiding
> errors - what you think is merely a slow operation is actually a timeout of
> some sort, often due to misconfiguration. This leads to two
> questions:
>
> 1. Does the long initialization time suggest that there's an error happening
> under the covers?
> 2. If not, is there any simple way that we can shorten memory initialization
> time?
>
> Thanks in advance for your insights.
I've seen similar behavior on some systems with large amounts of memory. Basically, the more hugepages there are, the longer it takes for EAL to process them - map them, sort them, remap them, find contiguous segments, etc. The slowdown is not linear because on each stage there are multiple loops over hugepage list taking place, with trying to find contiguous memory segments taking the longest. I am not aware of any mechanism to speed up EAL startup in such cases (not without code changes, anyway), the only thing I could suggest is to avoid using 2MB pages in this scenario.
Thanks,
Anatoly
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [dpdk-dev] A question about hugepage initialization time
2014-12-09 16:33 [dpdk-dev] A question about hugepage initialization time Matt Laswell
2014-12-09 16:50 ` Burakov, Anatoly
@ 2014-12-09 19:06 ` Matthew Hall
2014-12-09 22:05 ` Matt Laswell
2014-12-09 19:45 ` &rew
2 siblings, 1 reply; 14+ messages in thread
From: Matthew Hall @ 2014-12-09 19:06 UTC (permalink / raw)
To: Matt Laswell; +Cc: dev
On Tue, Dec 09, 2014 at 10:33:59AM -0600, Matt Laswell wrote:
> Our DPDK application deals with very large in memory data structures, and
> can potentially use tens or even hundreds of gigabytes of hugepage memory.
What you're doing is an unusual use case and this is open source code where
nobody might have tested and QA'ed this yet.
So my recommendation would be adding some rte_log statements to measure the
various steps in the process to see what's going on. Also using the Linux Perf
framework to do low-overhead sampling-based profiling, and making sure you've
got everything compiled with debug symbols so you can see what's consuming the
execution time.
You might find that it makes sense to use some custom allocators like jemalloc
alongside of the DPDK allocators, including perhaps "transparent hugepage
mode" in your process, and some larger page sizes to reduce the number of
pages.
You can also use this handy kernel options, hugepagesz=<size> hugepages=N .
This creates guaranteed-contiguous known-good hugepages during boot which
initialize much more quickly with less trouble and glitches in my experience.
https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt
https://www.kernel.org/doc/Documentation/vm/transhuge.txt
There is no one-size-fits-all solution but these are some possibilities.
Good Luck,
Matthew.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [dpdk-dev] A question about hugepage initialization time
2014-12-09 16:33 [dpdk-dev] A question about hugepage initialization time Matt Laswell
2014-12-09 16:50 ` Burakov, Anatoly
2014-12-09 19:06 ` Matthew Hall
@ 2014-12-09 19:45 ` &rew
2014-12-09 22:10 ` Stephen Hemminger
2 siblings, 1 reply; 14+ messages in thread
From: &rew @ 2014-12-09 19:45 UTC (permalink / raw)
To: Matt Laswell; +Cc: dev
> Hey Folks,
>
> Our DPDK application deals with very large in memory data structures, and
> can potentially use tens or even hundreds of gigabytes of hugepage memory.
> During the course of development, we've noticed that as the number of huge
> pages increases, the memory initialization time during EAL init gets to be
> quite long, lasting several minutes at present. The growth in init time
> doesn't appear to be linear, which is concerning.
>
> This is a minor inconvenience for us and our customers, as memory
> initialization makes our boot times a lot longer than it would otherwise
> be. Also, my experience has been that really long operations often are
> hiding errors - what you think is merely a slow operation is actually a
> timeout of some sort, often due to misconfiguration. This leads to two
> questions:
>
> 1. Does the long initialization time suggest that there's an error
> happening under the covers?
> 2. If not, is there any simple way that we can shorten memory
> initialization time?
>
> Thanks in advance for your insights.
>
> --
> Matt Laswell
> laswell@infiniteio.com
> infinite io, inc.
>
Hello,
please find some quick comments on the questions:
1.) By our experience long initialization time is normal in case of
large amount of memory. However this time depends on some things:
- number of hugepages (pagefault handled by kernel is pretty expensive)
- size of hugepages (memset at initialization)
2.) Using 1G pages instead of 2M will reduce the initialization time
significantly. Using wmemset instead of memset adds an additional 20-30%
boost by our measurements. Or, just by touching the pages but not cleaning
them you can have still some more speedup. But in this case your layer or
the applications above need to do the cleanup at allocation time
(e.g. by using rte_zmalloc).
Cheers,
&rew
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [dpdk-dev] A question about hugepage initialization time
2014-12-09 19:06 ` Matthew Hall
@ 2014-12-09 22:05 ` Matt Laswell
0 siblings, 0 replies; 14+ messages in thread
From: Matt Laswell @ 2014-12-09 22:05 UTC (permalink / raw)
To: Matthew Hall; +Cc: dev
Hey Everybody,
Thanks for the feedback. Yeah, we're pretty sure that the amount of memory
we work with is atypical, and we're hitting something that isn't an issue
for most DPDK users.
To clarify, yes, we're using 1GB hugepages, and we set them up via
hugepagesz and hugepages= in our kernel's grub line. We find that when we
use four 1GB huge pages, eal memory init takes a couple of seconds, which
is no big deal. When we use 128 1GB pages, though, memory init can take
several minutes. The concern is that we will very likely use even more
memory in the future. Our boot time is mostly just a nuisance now;
nonlinear growth in memory init time may transform it into a larger problem.
We've had to disable transparent hugepages due to latency issues with
in-memory databases. I'll have to look at the possibility of alternative
memset implementations. Perhaps some profiler time is in my future.
Again, thanks to everybody for the useful information.
--
Matt Laswell
laswell@infiniteio.com
infinite io, inc.
On Tue, Dec 9, 2014 at 1:06 PM, Matthew Hall <mhall@mhcomputing.net> wrote:
> On Tue, Dec 09, 2014 at 10:33:59AM -0600, Matt Laswell wrote:
> > Our DPDK application deals with very large in memory data structures, and
> > can potentially use tens or even hundreds of gigabytes of hugepage
> memory.
>
> What you're doing is an unusual use case and this is open source code where
> nobody might have tested and QA'ed this yet.
>
> So my recommendation would be adding some rte_log statements to measure the
> various steps in the process to see what's going on. Also using the Linux
> Perf
> framework to do low-overhead sampling-based profiling, and making sure
> you've
> got everything compiled with debug symbols so you can see what's consuming
> the
> execution time.
>
> You might find that it makes sense to use some custom allocators like
> jemalloc
> alongside of the DPDK allocators, including perhaps "transparent hugepage
> mode" in your process, and some larger page sizes to reduce the number of
> pages.
>
> You can also use this handy kernel options, hugepagesz=<size> hugepages=N .
> This creates guaranteed-contiguous known-good hugepages during boot which
> initialize much more quickly with less trouble and glitches in my
> experience.
>
> https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt
> https://www.kernel.org/doc/Documentation/vm/transhuge.txt
>
> There is no one-size-fits-all solution but these are some possibilities.
>
> Good Luck,
> Matthew.
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [dpdk-dev] A question about hugepage initialization time
2014-12-09 19:45 ` &rew
@ 2014-12-09 22:10 ` Stephen Hemminger
2014-12-10 10:32 ` Bruce Richardson
0 siblings, 1 reply; 14+ messages in thread
From: Stephen Hemminger @ 2014-12-09 22:10 UTC (permalink / raw)
To: &rew; +Cc: dev
On Tue, 9 Dec 2014 11:45:07 -0800
&rew <andras.kovacs@ericsson.com> wrote:
> > Hey Folks,
> >
> > Our DPDK application deals with very large in memory data structures, and
> > can potentially use tens or even hundreds of gigabytes of hugepage memory.
> > During the course of development, we've noticed that as the number of huge
> > pages increases, the memory initialization time during EAL init gets to be
> > quite long, lasting several minutes at present. The growth in init time
> > doesn't appear to be linear, which is concerning.
> >
> > This is a minor inconvenience for us and our customers, as memory
> > initialization makes our boot times a lot longer than it would otherwise
> > be. Also, my experience has been that really long operations often are
> > hiding errors - what you think is merely a slow operation is actually a
> > timeout of some sort, often due to misconfiguration. This leads to two
> > questions:
> >
> > 1. Does the long initialization time suggest that there's an error
> > happening under the covers?
> > 2. If not, is there any simple way that we can shorten memory
> > initialization time?
> >
> > Thanks in advance for your insights.
> >
> > --
> > Matt Laswell
> > laswell@infiniteio.com
> > infinite io, inc.
> >
>
> Hello,
>
> please find some quick comments on the questions:
> 1.) By our experience long initialization time is normal in case of
> large amount of memory. However this time depends on some things:
> - number of hugepages (pagefault handled by kernel is pretty expensive)
> - size of hugepages (memset at initialization)
>
> 2.) Using 1G pages instead of 2M will reduce the initialization time
> significantly. Using wmemset instead of memset adds an additional 20-30%
> boost by our measurements. Or, just by touching the pages but not cleaning
> them you can have still some more speedup. But in this case your layer or
> the applications above need to do the cleanup at allocation time
> (e.g. by using rte_zmalloc).
>
> Cheers,
> &rew
I wonder if the whole rte_malloc code is even worth it with a modern kernel
with transparent huge pages? rte_malloc adds very little value and is less safe
and slower than glibc or other allocators. Plus you lose the ablilty to get
all the benefit out of valgrind or electric fence.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [dpdk-dev] A question about hugepage initialization time
2014-12-09 22:10 ` Stephen Hemminger
@ 2014-12-10 10:32 ` Bruce Richardson
2014-12-10 14:29 ` Neil Horman
0 siblings, 1 reply; 14+ messages in thread
From: Bruce Richardson @ 2014-12-10 10:32 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: dev
On Tue, Dec 09, 2014 at 02:10:32PM -0800, Stephen Hemminger wrote:
> On Tue, 9 Dec 2014 11:45:07 -0800
> &rew <andras.kovacs@ericsson.com> wrote:
>
> > > Hey Folks,
> > >
> > > Our DPDK application deals with very large in memory data structures, and
> > > can potentially use tens or even hundreds of gigabytes of hugepage memory.
> > > During the course of development, we've noticed that as the number of huge
> > > pages increases, the memory initialization time during EAL init gets to be
> > > quite long, lasting several minutes at present. The growth in init time
> > > doesn't appear to be linear, which is concerning.
> > >
> > > This is a minor inconvenience for us and our customers, as memory
> > > initialization makes our boot times a lot longer than it would otherwise
> > > be. Also, my experience has been that really long operations often are
> > > hiding errors - what you think is merely a slow operation is actually a
> > > timeout of some sort, often due to misconfiguration. This leads to two
> > > questions:
> > >
> > > 1. Does the long initialization time suggest that there's an error
> > > happening under the covers?
> > > 2. If not, is there any simple way that we can shorten memory
> > > initialization time?
> > >
> > > Thanks in advance for your insights.
> > >
> > > --
> > > Matt Laswell
> > > laswell@infiniteio.com
> > > infinite io, inc.
> > >
> >
> > Hello,
> >
> > please find some quick comments on the questions:
> > 1.) By our experience long initialization time is normal in case of
> > large amount of memory. However this time depends on some things:
> > - number of hugepages (pagefault handled by kernel is pretty expensive)
> > - size of hugepages (memset at initialization)
> >
> > 2.) Using 1G pages instead of 2M will reduce the initialization time
> > significantly. Using wmemset instead of memset adds an additional 20-30%
> > boost by our measurements. Or, just by touching the pages but not cleaning
> > them you can have still some more speedup. But in this case your layer or
> > the applications above need to do the cleanup at allocation time
> > (e.g. by using rte_zmalloc).
> >
> > Cheers,
> > &rew
>
> I wonder if the whole rte_malloc code is even worth it with a modern kernel
> with transparent huge pages? rte_malloc adds very little value and is less safe
> and slower than glibc or other allocators. Plus you lose the ablilty to get
> all the benefit out of valgrind or electric fence.
While I'd dearly love to not have our own custom malloc lib to maintain, for DPDK
multiprocess, rte_malloc will be hard to replace as we would need a replacement
solution that similarly guarantees that memory mapped in process A is also
available at the same address in process B. :-(
/Bruce
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [dpdk-dev] A question about hugepage initialization time
2014-12-10 10:32 ` Bruce Richardson
@ 2014-12-10 14:29 ` Neil Horman
2014-12-10 14:35 ` Bruce Richardson
0 siblings, 1 reply; 14+ messages in thread
From: Neil Horman @ 2014-12-10 14:29 UTC (permalink / raw)
To: Bruce Richardson; +Cc: dev
On Wed, Dec 10, 2014 at 10:32:25AM +0000, Bruce Richardson wrote:
> On Tue, Dec 09, 2014 at 02:10:32PM -0800, Stephen Hemminger wrote:
> > On Tue, 9 Dec 2014 11:45:07 -0800
> > &rew <andras.kovacs@ericsson.com> wrote:
> >
> > > > Hey Folks,
> > > >
> > > > Our DPDK application deals with very large in memory data structures, and
> > > > can potentially use tens or even hundreds of gigabytes of hugepage memory.
> > > > During the course of development, we've noticed that as the number of huge
> > > > pages increases, the memory initialization time during EAL init gets to be
> > > > quite long, lasting several minutes at present. The growth in init time
> > > > doesn't appear to be linear, which is concerning.
> > > >
> > > > This is a minor inconvenience for us and our customers, as memory
> > > > initialization makes our boot times a lot longer than it would otherwise
> > > > be. Also, my experience has been that really long operations often are
> > > > hiding errors - what you think is merely a slow operation is actually a
> > > > timeout of some sort, often due to misconfiguration. This leads to two
> > > > questions:
> > > >
> > > > 1. Does the long initialization time suggest that there's an error
> > > > happening under the covers?
> > > > 2. If not, is there any simple way that we can shorten memory
> > > > initialization time?
> > > >
> > > > Thanks in advance for your insights.
> > > >
> > > > --
> > > > Matt Laswell
> > > > laswell@infiniteio.com
> > > > infinite io, inc.
> > > >
> > >
> > > Hello,
> > >
> > > please find some quick comments on the questions:
> > > 1.) By our experience long initialization time is normal in case of
> > > large amount of memory. However this time depends on some things:
> > > - number of hugepages (pagefault handled by kernel is pretty expensive)
> > > - size of hugepages (memset at initialization)
> > >
> > > 2.) Using 1G pages instead of 2M will reduce the initialization time
> > > significantly. Using wmemset instead of memset adds an additional 20-30%
> > > boost by our measurements. Or, just by touching the pages but not cleaning
> > > them you can have still some more speedup. But in this case your layer or
> > > the applications above need to do the cleanup at allocation time
> > > (e.g. by using rte_zmalloc).
> > >
> > > Cheers,
> > > &rew
> >
> > I wonder if the whole rte_malloc code is even worth it with a modern kernel
> > with transparent huge pages? rte_malloc adds very little value and is less safe
> > and slower than glibc or other allocators. Plus you lose the ablilty to get
> > all the benefit out of valgrind or electric fence.
>
> While I'd dearly love to not have our own custom malloc lib to maintain, for DPDK
> multiprocess, rte_malloc will be hard to replace as we would need a replacement
> solution that similarly guarantees that memory mapped in process A is also
> available at the same address in process B. :-(
>
Just out of curiosity, why even bother with multiprocess support? What you're
talking about above is a multithread model, and your shoehorning multiple
processes into it.
Neil
> /Bruce
>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [dpdk-dev] A question about hugepage initialization time
2014-12-10 14:29 ` Neil Horman
@ 2014-12-10 14:35 ` Bruce Richardson
2014-12-10 19:16 ` László Vadkerti
0 siblings, 1 reply; 14+ messages in thread
From: Bruce Richardson @ 2014-12-10 14:35 UTC (permalink / raw)
To: Neil Horman; +Cc: dev
On Wed, Dec 10, 2014 at 09:29:26AM -0500, Neil Horman wrote:
> On Wed, Dec 10, 2014 at 10:32:25AM +0000, Bruce Richardson wrote:
> > On Tue, Dec 09, 2014 at 02:10:32PM -0800, Stephen Hemminger wrote:
> > > On Tue, 9 Dec 2014 11:45:07 -0800
> > > &rew <andras.kovacs@ericsson.com> wrote:
> > >
> > > > > Hey Folks,
> > > > >
> > > > > Our DPDK application deals with very large in memory data structures, and
> > > > > can potentially use tens or even hundreds of gigabytes of hugepage memory.
> > > > > During the course of development, we've noticed that as the number of huge
> > > > > pages increases, the memory initialization time during EAL init gets to be
> > > > > quite long, lasting several minutes at present. The growth in init time
> > > > > doesn't appear to be linear, which is concerning.
> > > > >
> > > > > This is a minor inconvenience for us and our customers, as memory
> > > > > initialization makes our boot times a lot longer than it would otherwise
> > > > > be. Also, my experience has been that really long operations often are
> > > > > hiding errors - what you think is merely a slow operation is actually a
> > > > > timeout of some sort, often due to misconfiguration. This leads to two
> > > > > questions:
> > > > >
> > > > > 1. Does the long initialization time suggest that there's an error
> > > > > happening under the covers?
> > > > > 2. If not, is there any simple way that we can shorten memory
> > > > > initialization time?
> > > > >
> > > > > Thanks in advance for your insights.
> > > > >
> > > > > --
> > > > > Matt Laswell
> > > > > laswell@infiniteio.com
> > > > > infinite io, inc.
> > > > >
> > > >
> > > > Hello,
> > > >
> > > > please find some quick comments on the questions:
> > > > 1.) By our experience long initialization time is normal in case of
> > > > large amount of memory. However this time depends on some things:
> > > > - number of hugepages (pagefault handled by kernel is pretty expensive)
> > > > - size of hugepages (memset at initialization)
> > > >
> > > > 2.) Using 1G pages instead of 2M will reduce the initialization time
> > > > significantly. Using wmemset instead of memset adds an additional 20-30%
> > > > boost by our measurements. Or, just by touching the pages but not cleaning
> > > > them you can have still some more speedup. But in this case your layer or
> > > > the applications above need to do the cleanup at allocation time
> > > > (e.g. by using rte_zmalloc).
> > > >
> > > > Cheers,
> > > > &rew
> > >
> > > I wonder if the whole rte_malloc code is even worth it with a modern kernel
> > > with transparent huge pages? rte_malloc adds very little value and is less safe
> > > and slower than glibc or other allocators. Plus you lose the ablilty to get
> > > all the benefit out of valgrind or electric fence.
> >
> > While I'd dearly love to not have our own custom malloc lib to maintain, for DPDK
> > multiprocess, rte_malloc will be hard to replace as we would need a replacement
> > solution that similarly guarantees that memory mapped in process A is also
> > available at the same address in process B. :-(
> >
> Just out of curiosity, why even bother with multiprocess support? What you're
> talking about above is a multithread model, and your shoehorning multiple
> processes into it.
> Neil
>
Yep, that's pretty much what it is alright. However, this multiprocess support
is very widely used by our customers in building their applications, and has
been in place and supported since some of the earliest DPDK releases. If it
is to be removed, it needs to be replaced by something that provides equivalent
capabilities to application writers (perhaps something with more fine-grained
sharing etc.)
/Bruce
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [dpdk-dev] A question about hugepage initialization time
2014-12-10 14:35 ` Bruce Richardson
@ 2014-12-10 19:16 ` László Vadkerti
2014-12-11 10:14 ` Bruce Richardson
0 siblings, 1 reply; 14+ messages in thread
From: László Vadkerti @ 2014-12-10 19:16 UTC (permalink / raw)
To: Bruce Richardson, Neil Horman, Matt Laswell; +Cc: dev
na ez :)
On Wed, 10 Dec 2014, Bruce Richardson wrote:
> On Wed, Dec 10, 2014 at 09:29:26AM -0500, Neil Horman wrote:
>> On Wed, Dec 10, 2014 at 10:32:25AM +0000, Bruce Richardson wrote:
>>> On Tue, Dec 09, 2014 at 02:10:32PM -0800, Stephen Hemminger wrote:
>>>> On Tue, 9 Dec 2014 11:45:07 -0800
>>>> &rew <andras.kovacs@ericsson.com> wrote:
>>>>
>>>>>> Hey Folks,
>>>>>>
>>>>>> Our DPDK application deals with very large in memory data
>>>>>> structures, and can potentially use tens or even hundreds of gigabytes of hugepage memory.
>>>>>> During the course of development, we've noticed that as the
>>>>>> number of huge pages increases, the memory initialization time
>>>>>> during EAL init gets to be quite long, lasting several minutes at
>>>>>> present. The growth in init time doesn't appear to be linear, which is concerning.
>>>>>>
>>>>>> This is a minor inconvenience for us and our customers, as memory
>>>>>> initialization makes our boot times a lot longer than it would
>>>>>> otherwise be. Also, my experience has been that really long
>>>>>> operations often are hiding errors - what you think is merely a
>>>>>> slow operation is actually a timeout of some sort, often due to
>>>>>> misconfiguration. This leads to two
>>>>>> questions:
>>>>>>
>>>>>> 1. Does the long initialization time suggest that there's an
>>>>>> error happening under the covers?
>>>>>> 2. If not, is there any simple way that we can shorten memory
>>>>>> initialization time?
>>>>>>
>>>>>> Thanks in advance for your insights.
>>>>>>
>>>>>> --
>>>>>> Matt Laswell
>>>>>> laswell@infiniteio.com
>>>>>> infinite io, inc.
>>>>>>
>>>>>
>>>>> Hello,
>>>>>
>>>>> please find some quick comments on the questions:
>>>>> 1.) By our experience long initialization time is normal in case
>>>>> of large amount of memory. However this time depends on some things:
>>>>> - number of hugepages (pagefault handled by kernel is pretty
>>>>> expensive)
>>>>> - size of hugepages (memset at initialization)
>>>>>
>>>>> 2.) Using 1G pages instead of 2M will reduce the initialization
>>>>> time significantly. Using wmemset instead of memset adds an
>>>>> additional 20-30% boost by our measurements. Or, just by touching
>>>>> the pages but not cleaning them you can have still some more
>>>>> speedup. But in this case your layer or the applications above
>>>>> need to do the cleanup at allocation time (e.g. by using rte_zmalloc).
>>>>>
>>>>> Cheers,
>>>>> &rew
>>>>
>>>> I wonder if the whole rte_malloc code is even worth it with a
>>>> modern kernel with transparent huge pages? rte_malloc adds very
>>>> little value and is less safe and slower than glibc or other
>>>> allocators. Plus you lose the ablilty to get all the benefit out of valgrind or electric fence.
>>>
>>> While I'd dearly love to not have our own custom malloc lib to
>>> maintain, for DPDK multiprocess, rte_malloc will be hard to replace
>>> as we would need a replacement solution that similarly guarantees
>>> that memory mapped in process A is also available at the same
>>> address in process B. :-(
>>>
>> Just out of curiosity, why even bother with multiprocess support?
>> What you're talking about above is a multithread model, and your
>> shoehorning multiple processes into it.
>> Neil
>>
>
> Yep, that's pretty much what it is alright. However, this multiprocess
> support is very widely used by our customers in building their
> applications, and has been in place and supported since some of the
> earliest DPDK releases. If it is to be removed, it needs to be
> replaced by something that provides equivalent capabilities to
> application writers (perhaps something with more fine-grained sharing
> etc.)
>
> /Bruce
>
It is probably time to start discussing how to pull in our multi process and
memory management improvements we were talking about in our
DPDK Summit presentation:
https://www.youtube.com/watch?v=907VShi799k#t=647
Multi-process model could have several benefits mostly in the high availability
area (telco requirement) due to better separation, controlling permissions
(per process RO or RW page mappings), single process restartability, improved
startup and core dumping time etc.
As a summary of our memory management additions, it allows an application
to describe their memory model in a configuration (or via an API),
e.g. a simplified config would say that every instance will need 4GB private
memory and 2GB shared memory. In a multi process model this will result
mapping only 6GB memory in each process instead of the current DPDK model
where the 4GB per process private memory is mapped into all other processes
resulting in unnecessary mappings, e.g. 16x4GB + 2GB in every processes.
What we've chosen is to use DPDK's NUMA aware allocator for this purpose,
e.g. the above example for 16 instances will result allocating
17 DPDK NUMA sockets (1 default shared + 16 private) and we can selectively
map a given "NUMA socket" (set of memsegs) into a process.
This also opens many other possibilities to play with, e.g.
- clearing of the full private memory if a process dies including memzones on it
- pop-up memory support
etc. etc.
Other option could be to use page aligned memzones and control the
mapping/permissions on a memzone level.
/Laszlo
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [dpdk-dev] A question about hugepage initialization time
2014-12-10 19:16 ` László Vadkerti
@ 2014-12-11 10:14 ` Bruce Richardson
2014-12-12 4:07 ` László Vadkerti
0 siblings, 1 reply; 14+ messages in thread
From: Bruce Richardson @ 2014-12-11 10:14 UTC (permalink / raw)
To: László Vadkerti; +Cc: dev
On Wed, Dec 10, 2014 at 07:16:59PM +0000, László Vadkerti wrote:
> na ez :)
>
> On Wed, 10 Dec 2014, Bruce Richardson wrote:
>
> > On Wed, Dec 10, 2014 at 09:29:26AM -0500, Neil Horman wrote:
> >> On Wed, Dec 10, 2014 at 10:32:25AM +0000, Bruce Richardson wrote:
> >>> On Tue, Dec 09, 2014 at 02:10:32PM -0800, Stephen Hemminger wrote:
> >>>> On Tue, 9 Dec 2014 11:45:07 -0800
> >>>> &rew <andras.kovacs@ericsson.com> wrote:
> >>>>
> >>>>>> Hey Folks,
> >>>>>>
> >>>>>> Our DPDK application deals with very large in memory data
> >>>>>> structures, and can potentially use tens or even hundreds of gigabytes of hugepage memory.
> >>>>>> During the course of development, we've noticed that as the
> >>>>>> number of huge pages increases, the memory initialization time
> >>>>>> during EAL init gets to be quite long, lasting several minutes at
> >>>>>> present. The growth in init time doesn't appear to be linear, which is concerning.
> >>>>>>
> >>>>>> This is a minor inconvenience for us and our customers, as memory
> >>>>>> initialization makes our boot times a lot longer than it would
> >>>>>> otherwise be. Also, my experience has been that really long
> >>>>>> operations often are hiding errors - what you think is merely a
> >>>>>> slow operation is actually a timeout of some sort, often due to
> >>>>>> misconfiguration. This leads to two
> >>>>>> questions:
> >>>>>>
> >>>>>> 1. Does the long initialization time suggest that there's an
> >>>>>> error happening under the covers?
> >>>>>> 2. If not, is there any simple way that we can shorten memory
> >>>>>> initialization time?
> >>>>>>
> >>>>>> Thanks in advance for your insights.
> >>>>>>
> >>>>>> --
> >>>>>> Matt Laswell
> >>>>>> laswell@infiniteio.com
> >>>>>> infinite io, inc.
> >>>>>>
> >>>>>
> >>>>> Hello,
> >>>>>
> >>>>> please find some quick comments on the questions:
> >>>>> 1.) By our experience long initialization time is normal in case
> >>>>> of large amount of memory. However this time depends on some things:
> >>>>> - number of hugepages (pagefault handled by kernel is pretty
> >>>>> expensive)
> >>>>> - size of hugepages (memset at initialization)
> >>>>>
> >>>>> 2.) Using 1G pages instead of 2M will reduce the initialization
> >>>>> time significantly. Using wmemset instead of memset adds an
> >>>>> additional 20-30% boost by our measurements. Or, just by touching
> >>>>> the pages but not cleaning them you can have still some more
> >>>>> speedup. But in this case your layer or the applications above
> >>>>> need to do the cleanup at allocation time (e.g. by using rte_zmalloc).
> >>>>>
> >>>>> Cheers,
> >>>>> &rew
> >>>>
> >>>> I wonder if the whole rte_malloc code is even worth it with a
> >>>> modern kernel with transparent huge pages? rte_malloc adds very
> >>>> little value and is less safe and slower than glibc or other
> >>>> allocators. Plus you lose the ablilty to get all the benefit out of valgrind or electric fence.
> >>>
> >>> While I'd dearly love to not have our own custom malloc lib to
> >>> maintain, for DPDK multiprocess, rte_malloc will be hard to replace
> >>> as we would need a replacement solution that similarly guarantees
> >>> that memory mapped in process A is also available at the same
> >>> address in process B. :-(
> >>>
> >> Just out of curiosity, why even bother with multiprocess support?
> >> What you're talking about above is a multithread model, and your
> >> shoehorning multiple processes into it.
> >> Neil
> >>
> >
> > Yep, that's pretty much what it is alright. However, this multiprocess
> > support is very widely used by our customers in building their
> > applications, and has been in place and supported since some of the
> > earliest DPDK releases. If it is to be removed, it needs to be
> > replaced by something that provides equivalent capabilities to
> > application writers (perhaps something with more fine-grained sharing
> > etc.)
> >
> > /Bruce
> >
>
> It is probably time to start discussing how to pull in our multi process and
> memory management improvements we were talking about in our
> DPDK Summit presentation:
> https://www.youtube.com/watch?v=907VShi799k#t=647
>
> Multi-process model could have several benefits mostly in the high availability
> area (telco requirement) due to better separation, controlling permissions
> (per process RO or RW page mappings), single process restartability, improved
> startup and core dumping time etc.
>
> As a summary of our memory management additions, it allows an application
> to describe their memory model in a configuration (or via an API),
> e.g. a simplified config would say that every instance will need 4GB private
> memory and 2GB shared memory. In a multi process model this will result
> mapping only 6GB memory in each process instead of the current DPDK model
> where the 4GB per process private memory is mapped into all other processes
> resulting in unnecessary mappings, e.g. 16x4GB + 2GB in every processes.
>
> What we've chosen is to use DPDK's NUMA aware allocator for this purpose,
> e.g. the above example for 16 instances will result allocating
> 17 DPDK NUMA sockets (1 default shared + 16 private) and we can selectively
> map a given "NUMA socket" (set of memsegs) into a process.
> This also opens many other possibilities to play with, e.g.
> - clearing of the full private memory if a process dies including memzones on it
> - pop-up memory support
> etc. etc.
>
> Other option could be to use page aligned memzones and control the
> mapping/permissions on a memzone level.
>
> /Laszlo
Those enhancements sound really, really good. Do you have code for these that
you can share that we can start looking at with a view to pulling it in?
/Bruce
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [dpdk-dev] A question about hugepage initialization time
2014-12-11 10:14 ` Bruce Richardson
@ 2014-12-12 4:07 ` László Vadkerti
2014-12-12 9:59 ` Bruce Richardson
0 siblings, 1 reply; 14+ messages in thread
From: László Vadkerti @ 2014-12-12 4:07 UTC (permalink / raw)
To: Bruce Richardson, Thomas Monjalon; +Cc: dev
On Thu, 11 Dec, 2014, Bruce Richardson wrote:
> On Wed, Dec 10, 2014 at 07:16:59PM +0000, László Vadkerti wrote:
> >
> > On Wed, 10 Dec 2014, Bruce Richardson wrote:
> >
> > > On Wed, Dec 10, 2014 at 09:29:26AM -0500, Neil Horman wrote:
> > >> On Wed, Dec 10, 2014 at 10:32:25AM +0000, Bruce Richardson wrote:
> > >>> On Tue, Dec 09, 2014 at 02:10:32PM -0800, Stephen Hemminger wrote:
> > >>>> On Tue, 9 Dec 2014 11:45:07 -0800 &rew
> > >>>> <andras.kovacs@ericsson.com> wrote:
> > >>>>
> > >>>>>> Hey Folks,
> > >>>>>>
> > >>>>>> Our DPDK application deals with very large in memory data
> > >>>>>> structures, and can potentially use tens or even hundreds of
> gigabytes of hugepage memory.
> > >>>>>> During the course of development, we've noticed that as the
> > >>>>>> number of huge pages increases, the memory initialization time
> > >>>>>> during EAL init gets to be quite long, lasting several minutes
> > >>>>>> at present. The growth in init time doesn't appear to be linear,
> which is concerning.
> > >>>>>>
> > >>>>>> This is a minor inconvenience for us and our customers, as
> > >>>>>> memory initialization makes our boot times a lot longer than it
> > >>>>>> would otherwise be. Also, my experience has been that really
> > >>>>>> long operations often are hiding errors - what you think is
> > >>>>>> merely a slow operation is actually a timeout of some sort,
> > >>>>>> often due to misconfiguration. This leads to two
> > >>>>>> questions:
> > >>>>>>
> > >>>>>> 1. Does the long initialization time suggest that there's an
> > >>>>>> error happening under the covers?
> > >>>>>> 2. If not, is there any simple way that we can shorten memory
> > >>>>>> initialization time?
> > >>>>>>
> > >>>>>> Thanks in advance for your insights.
> > >>>>>>
> > >>>>>> --
> > >>>>>> Matt Laswell
> > >>>>>> laswell@infiniteio.com
> > >>>>>> infinite io, inc.
> > >>>>>>
> > >>>>>
> > >>>>> Hello,
> > >>>>>
> > >>>>> please find some quick comments on the questions:
> > >>>>> 1.) By our experience long initialization time is normal in case
> > >>>>> of large amount of memory. However this time depends on some
> things:
> > >>>>> - number of hugepages (pagefault handled by kernel is pretty
> > >>>>> expensive)
> > >>>>> - size of hugepages (memset at initialization)
> > >>>>>
> > >>>>> 2.) Using 1G pages instead of 2M will reduce the initialization
> > >>>>> time significantly. Using wmemset instead of memset adds an
> > >>>>> additional 20-30% boost by our measurements. Or, just by
> > >>>>> touching the pages but not cleaning them you can have still some
> > >>>>> more speedup. But in this case your layer or the applications
> > >>>>> above need to do the cleanup at allocation time (e.g. by using
> rte_zmalloc).
> > >>>>>
> > >>>>> Cheers,
> > >>>>> &rew
> > >>>>
> > >>>> I wonder if the whole rte_malloc code is even worth it with a
> > >>>> modern kernel with transparent huge pages? rte_malloc adds very
> > >>>> little value and is less safe and slower than glibc or other
> > >>>> allocators. Plus you lose the ablilty to get all the benefit out of
> valgrind or electric fence.
> > >>>
> > >>> While I'd dearly love to not have our own custom malloc lib to
> > >>> maintain, for DPDK multiprocess, rte_malloc will be hard to
> > >>> replace as we would need a replacement solution that similarly
> > >>> guarantees that memory mapped in process A is also available at
> > >>> the same address in process B. :-(
> > >>>
> > >> Just out of curiosity, why even bother with multiprocess support?
> > >> What you're talking about above is a multithread model, and your
> > >> shoehorning multiple processes into it.
> > >> Neil
> > >>
> > >
> > > Yep, that's pretty much what it is alright. However, this
> > > multiprocess support is very widely used by our customers in
> > > building their applications, and has been in place and supported
> > > since some of the earliest DPDK releases. If it is to be removed, it
> > > needs to be replaced by something that provides equivalent
> > > capabilities to application writers (perhaps something with more
> > > fine-grained sharing
> > > etc.)
> > >
> > > /Bruce
> > >
> >
> > It is probably time to start discussing how to pull in our multi
> > process and memory management improvements we were talking about in
> > our DPDK Summit presentation:
> > https://www.youtube.com/watch?v=907VShi799k#t=647
> >
> > Multi-process model could have several benefits mostly in the high
> > availability area (telco requirement) due to better separation,
> > controlling permissions (per process RO or RW page mappings), single
> > process restartability, improved startup and core dumping time etc.
> >
> > As a summary of our memory management additions, it allows an
> > application to describe their memory model in a configuration (or via
> > an API), e.g. a simplified config would say that every instance will
> > need 4GB private memory and 2GB shared memory. In a multi process
> > model this will result mapping only 6GB memory in each process instead
> > of the current DPDK model where the 4GB per process private memory is
> > mapped into all other processes resulting in unnecessary mappings, e.g.
> 16x4GB + 2GB in every processes.
> >
> > What we've chosen is to use DPDK's NUMA aware allocator for this
> > purpose, e.g. the above example for 16 instances will result
> > allocating
> > 17 DPDK NUMA sockets (1 default shared + 16 private) and we can
> > selectively map a given "NUMA socket" (set of memsegs) into a process.
> > This also opens many other possibilities to play with, e.g.
> > - clearing of the full private memory if a process dies including
> > memzones on it
> > - pop-up memory support
> > etc. etc.
> >
> > Other option could be to use page aligned memzones and control the
> > mapping/permissions on a memzone level.
> >
> > /Laszlo
>
> Those enhancements sound really, really good. Do you have code for these
> that you can share that we can start looking at with a view to pulling it in?
>
> /Bruce
Our approach when started implementing these enhancements was to have
an additional layer on top of DPDK, so our changes cannot just be pulled in as is
and unfortunately we do not yet have the permission to share our code.
However we can share ideas and start discussing what would more interest the
community and if there is something which we can easily pull in or put on the
DPDK roadmap.
As mentioned in the presentation we implemented a new EAL layer which we
also rely on, although this may not be necessary for all our enhancements.
For example our named memory partition pools ("memdomains") which is the
base of our selective memory mapping and permission control could either be
implemented above or below the memzones or DPDK could even be just a user
of it. Our implementation relies on our new EAL layer, but there may be another
option to pull this in as a new library which relies on the memzone allocator.
We have a whole set of features with the main goal of environment independency
and of course performance first mainly focusing on NFV deployments.
e.g. allowing applications to adopt different environments (without any code change)
while still getting the highest possible performance.
The key for this is our new split EAL layer which I think should be the first step to
start with. This can co-exist with the current linuxapp and bsdapp and would allow
supporting both Linux and BSD with separate publisher components which could
be relying on the existing linuxapp/bsdapp code :)
This new EAL layer would open up many possibilities to play with,
e.g. expose NUMA in a non-NUMA aware VM, pretend that every CPU is in a new
NUMA domain, emulate a multi CPU multi socket system on a single CPU etc. etc.
What do you think would be the right way to start these discussions?
We should probably need to open a new thread on this as it is now not fully related
to the subject or should we have an internal discussion and then present and discuss
the ideas in a community call?
We are working with DPDK since a long time, but new to the community and need to
understand the ways of working here...
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [dpdk-dev] A question about hugepage initialization time
2014-12-12 4:07 ` László Vadkerti
@ 2014-12-12 9:59 ` Bruce Richardson
2014-12-12 15:50 ` Thomas Monjalon
0 siblings, 1 reply; 14+ messages in thread
From: Bruce Richardson @ 2014-12-12 9:59 UTC (permalink / raw)
To: László Vadkerti; +Cc: dev
On Fri, Dec 12, 2014 at 04:07:40AM +0000, László Vadkerti wrote:
> On Thu, 11 Dec, 2014, Bruce Richardson wrote:
> > On Wed, Dec 10, 2014 at 07:16:59PM +0000, László Vadkerti wrote:
> > >
> > > On Wed, 10 Dec 2014, Bruce Richardson wrote:
> > >
> > > > On Wed, Dec 10, 2014 at 09:29:26AM -0500, Neil Horman wrote:
> > > >> On Wed, Dec 10, 2014 at 10:32:25AM +0000, Bruce Richardson wrote:
> > > >>> On Tue, Dec 09, 2014 at 02:10:32PM -0800, Stephen Hemminger wrote:
> > > >>>> On Tue, 9 Dec 2014 11:45:07 -0800 &rew
> > > >>>> <andras.kovacs@ericsson.com> wrote:
> > > >>>>
> > > >>>>>> Hey Folks,
> > > >>>>>>
> > > >>>>>> Our DPDK application deals with very large in memory data
> > > >>>>>> structures, and can potentially use tens or even hundreds of
> > gigabytes of hugepage memory.
> > > >>>>>> During the course of development, we've noticed that as the
> > > >>>>>> number of huge pages increases, the memory initialization time
> > > >>>>>> during EAL init gets to be quite long, lasting several minutes
> > > >>>>>> at present. The growth in init time doesn't appear to be linear,
> > which is concerning.
> > > >>>>>>
> > > >>>>>> This is a minor inconvenience for us and our customers, as
> > > >>>>>> memory initialization makes our boot times a lot longer than it
> > > >>>>>> would otherwise be. Also, my experience has been that really
> > > >>>>>> long operations often are hiding errors - what you think is
> > > >>>>>> merely a slow operation is actually a timeout of some sort,
> > > >>>>>> often due to misconfiguration. This leads to two
> > > >>>>>> questions:
> > > >>>>>>
> > > >>>>>> 1. Does the long initialization time suggest that there's an
> > > >>>>>> error happening under the covers?
> > > >>>>>> 2. If not, is there any simple way that we can shorten memory
> > > >>>>>> initialization time?
> > > >>>>>>
> > > >>>>>> Thanks in advance for your insights.
> > > >>>>>>
> > > >>>>>> --
> > > >>>>>> Matt Laswell
> > > >>>>>> laswell@infiniteio.com
> > > >>>>>> infinite io, inc.
> > > >>>>>>
> > > >>>>>
> > > >>>>> Hello,
> > > >>>>>
> > > >>>>> please find some quick comments on the questions:
> > > >>>>> 1.) By our experience long initialization time is normal in case
> > > >>>>> of large amount of memory. However this time depends on some
> > things:
> > > >>>>> - number of hugepages (pagefault handled by kernel is pretty
> > > >>>>> expensive)
> > > >>>>> - size of hugepages (memset at initialization)
> > > >>>>>
> > > >>>>> 2.) Using 1G pages instead of 2M will reduce the initialization
> > > >>>>> time significantly. Using wmemset instead of memset adds an
> > > >>>>> additional 20-30% boost by our measurements. Or, just by
> > > >>>>> touching the pages but not cleaning them you can have still some
> > > >>>>> more speedup. But in this case your layer or the applications
> > > >>>>> above need to do the cleanup at allocation time (e.g. by using
> > rte_zmalloc).
> > > >>>>>
> > > >>>>> Cheers,
> > > >>>>> &rew
> > > >>>>
> > > >>>> I wonder if the whole rte_malloc code is even worth it with a
> > > >>>> modern kernel with transparent huge pages? rte_malloc adds very
> > > >>>> little value and is less safe and slower than glibc or other
> > > >>>> allocators. Plus you lose the ablilty to get all the benefit out of
> > valgrind or electric fence.
> > > >>>
> > > >>> While I'd dearly love to not have our own custom malloc lib to
> > > >>> maintain, for DPDK multiprocess, rte_malloc will be hard to
> > > >>> replace as we would need a replacement solution that similarly
> > > >>> guarantees that memory mapped in process A is also available at
> > > >>> the same address in process B. :-(
> > > >>>
> > > >> Just out of curiosity, why even bother with multiprocess support?
> > > >> What you're talking about above is a multithread model, and your
> > > >> shoehorning multiple processes into it.
> > > >> Neil
> > > >>
> > > >
> > > > Yep, that's pretty much what it is alright. However, this
> > > > multiprocess support is very widely used by our customers in
> > > > building their applications, and has been in place and supported
> > > > since some of the earliest DPDK releases. If it is to be removed, it
> > > > needs to be replaced by something that provides equivalent
> > > > capabilities to application writers (perhaps something with more
> > > > fine-grained sharing
> > > > etc.)
> > > >
> > > > /Bruce
> > > >
> > >
> > > It is probably time to start discussing how to pull in our multi
> > > process and memory management improvements we were talking about in
> > > our DPDK Summit presentation:
> > > https://www.youtube.com/watch?v=907VShi799k#t=647
> > >
> > > Multi-process model could have several benefits mostly in the high
> > > availability area (telco requirement) due to better separation,
> > > controlling permissions (per process RO or RW page mappings), single
> > > process restartability, improved startup and core dumping time etc.
> > >
> > > As a summary of our memory management additions, it allows an
> > > application to describe their memory model in a configuration (or via
> > > an API), e.g. a simplified config would say that every instance will
> > > need 4GB private memory and 2GB shared memory. In a multi process
> > > model this will result mapping only 6GB memory in each process instead
> > > of the current DPDK model where the 4GB per process private memory is
> > > mapped into all other processes resulting in unnecessary mappings, e.g.
> > 16x4GB + 2GB in every processes.
> > >
> > > What we've chosen is to use DPDK's NUMA aware allocator for this
> > > purpose, e.g. the above example for 16 instances will result
> > > allocating
> > > 17 DPDK NUMA sockets (1 default shared + 16 private) and we can
> > > selectively map a given "NUMA socket" (set of memsegs) into a process.
> > > This also opens many other possibilities to play with, e.g.
> > > - clearing of the full private memory if a process dies including
> > > memzones on it
> > > - pop-up memory support
> > > etc. etc.
> > >
> > > Other option could be to use page aligned memzones and control the
> > > mapping/permissions on a memzone level.
> > >
> > > /Laszlo
> >
> > Those enhancements sound really, really good. Do you have code for these
> > that you can share that we can start looking at with a view to pulling it in?
> >
> > /Bruce
>
> Our approach when started implementing these enhancements was to have
> an additional layer on top of DPDK, so our changes cannot just be pulled in as is
> and unfortunately we do not yet have the permission to share our code.
> However we can share ideas and start discussing what would more interest the
> community and if there is something which we can easily pull in or put on the
> DPDK roadmap.
>
> As mentioned in the presentation we implemented a new EAL layer which we
> also rely on, although this may not be necessary for all our enhancements.
> For example our named memory partition pools ("memdomains") which is the
> base of our selective memory mapping and permission control could either be
> implemented above or below the memzones or DPDK could even be just a user
> of it. Our implementation relies on our new EAL layer, but there may be another
> option to pull this in as a new library which relies on the memzone allocator.
>
> We have a whole set of features with the main goal of environment independency
> and of course performance first mainly focusing on NFV deployments.
> e.g. allowing applications to adopt different environments (without any code change)
> while still getting the highest possible performance.
> The key for this is our new split EAL layer which I think should be the first step to
> start with. This can co-exist with the current linuxapp and bsdapp and would allow
> supporting both Linux and BSD with separate publisher components which could
> be relying on the existing linuxapp/bsdapp code :)
> This new EAL layer would open up many possibilities to play with,
> e.g. expose NUMA in a non-NUMA aware VM, pretend that every CPU is in a new
> NUMA domain, emulate a multi CPU multi socket system on a single CPU etc. etc.
>
> What do you think would be the right way to start these discussions?
> We should probably need to open a new thread on this as it is now not fully related
> to the subject or should we have an internal discussion and then present and discuss
> the ideas in a community call?
> We are working with DPDK since a long time, but new to the community and need to
> understand the ways of working here...
A new thread describing the details of how you have implemented things would be
great.
Thanks,
/Bruce
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [dpdk-dev] A question about hugepage initialization time
2014-12-12 9:59 ` Bruce Richardson
@ 2014-12-12 15:50 ` Thomas Monjalon
0 siblings, 0 replies; 14+ messages in thread
From: Thomas Monjalon @ 2014-12-12 15:50 UTC (permalink / raw)
To: László Vadkerti; +Cc: dev
2014-12-12 09:59, Bruce Richardson:
> On Fri, Dec 12, 2014 at 04:07:40AM +0000, László Vadkerti wrote:
> > On Thu, 11 Dec, 2014, Bruce Richardson wrote:
> > > On Wed, Dec 10, 2014 at 07:16:59PM +0000, László Vadkerti wrote:
> > > >
> > > > On Wed, 10 Dec 2014, Bruce Richardson wrote:
> > > >
> > > > > On Wed, Dec 10, 2014 at 09:29:26AM -0500, Neil Horman wrote:
> > > > >> On Wed, Dec 10, 2014 at 10:32:25AM +0000, Bruce Richardson wrote:
> > > > >>> On Tue, Dec 09, 2014 at 02:10:32PM -0800, Stephen Hemminger wrote:
> > > > >>>> On Tue, 9 Dec 2014 11:45:07 -0800 &rew
> > > > >>>> <andras.kovacs@ericsson.com> wrote:
> > > > >>>>
> > > > >>>>>> Hey Folks,
> > > > >>>>>>
> > > > >>>>>> Our DPDK application deals with very large in memory data
> > > > >>>>>> structures, and can potentially use tens or even hundreds of
> > > gigabytes of hugepage memory.
> > > > >>>>>> During the course of development, we've noticed that as the
> > > > >>>>>> number of huge pages increases, the memory initialization time
> > > > >>>>>> during EAL init gets to be quite long, lasting several minutes
> > > > >>>>>> at present. The growth in init time doesn't appear to be linear,
> > > which is concerning.
> > > > >>>>>>
> > > > >>>>>> This is a minor inconvenience for us and our customers, as
> > > > >>>>>> memory initialization makes our boot times a lot longer than it
> > > > >>>>>> would otherwise be. Also, my experience has been that really
> > > > >>>>>> long operations often are hiding errors - what you think is
> > > > >>>>>> merely a slow operation is actually a timeout of some sort,
> > > > >>>>>> often due to misconfiguration. This leads to two
> > > > >>>>>> questions:
> > > > >>>>>>
> > > > >>>>>> 1. Does the long initialization time suggest that there's an
> > > > >>>>>> error happening under the covers?
> > > > >>>>>> 2. If not, is there any simple way that we can shorten memory
> > > > >>>>>> initialization time?
> > > > >>>>>>
> > > > >>>>>> Thanks in advance for your insights.
> > > > >>>>>>
> > > > >>>>>> --
> > > > >>>>>> Matt Laswell
> > > > >>>>>> laswell@infiniteio.com
> > > > >>>>>> infinite io, inc.
> > > > >>>>>>
> > > > >>>>>
> > > > >>>>> Hello,
> > > > >>>>>
> > > > >>>>> please find some quick comments on the questions:
> > > > >>>>> 1.) By our experience long initialization time is normal in case
> > > > >>>>> of large amount of memory. However this time depends on some
> > > things:
> > > > >>>>> - number of hugepages (pagefault handled by kernel is pretty
> > > > >>>>> expensive)
> > > > >>>>> - size of hugepages (memset at initialization)
> > > > >>>>>
> > > > >>>>> 2.) Using 1G pages instead of 2M will reduce the initialization
> > > > >>>>> time significantly. Using wmemset instead of memset adds an
> > > > >>>>> additional 20-30% boost by our measurements. Or, just by
> > > > >>>>> touching the pages but not cleaning them you can have still some
> > > > >>>>> more speedup. But in this case your layer or the applications
> > > > >>>>> above need to do the cleanup at allocation time (e.g. by using
> > > rte_zmalloc).
> > > > >>>>>
> > > > >>>>> Cheers,
> > > > >>>>> &rew
> > > > >>>>
> > > > >>>> I wonder if the whole rte_malloc code is even worth it with a
> > > > >>>> modern kernel with transparent huge pages? rte_malloc adds very
> > > > >>>> little value and is less safe and slower than glibc or other
> > > > >>>> allocators. Plus you lose the ablilty to get all the benefit out of
> > > valgrind or electric fence.
> > > > >>>
> > > > >>> While I'd dearly love to not have our own custom malloc lib to
> > > > >>> maintain, for DPDK multiprocess, rte_malloc will be hard to
> > > > >>> replace as we would need a replacement solution that similarly
> > > > >>> guarantees that memory mapped in process A is also available at
> > > > >>> the same address in process B. :-(
> > > > >>>
> > > > >> Just out of curiosity, why even bother with multiprocess support?
> > > > >> What you're talking about above is a multithread model, and your
> > > > >> shoehorning multiple processes into it.
> > > > >> Neil
> > > > >>
> > > > >
> > > > > Yep, that's pretty much what it is alright. However, this
> > > > > multiprocess support is very widely used by our customers in
> > > > > building their applications, and has been in place and supported
> > > > > since some of the earliest DPDK releases. If it is to be removed, it
> > > > > needs to be replaced by something that provides equivalent
> > > > > capabilities to application writers (perhaps something with more
> > > > > fine-grained sharing
> > > > > etc.)
> > > > >
> > > > > /Bruce
> > > > >
> > > >
> > > > It is probably time to start discussing how to pull in our multi
> > > > process and memory management improvements we were talking about in
> > > > our DPDK Summit presentation:
> > > > https://www.youtube.com/watch?v=907VShi799k#t=647
> > > >
> > > > Multi-process model could have several benefits mostly in the high
> > > > availability area (telco requirement) due to better separation,
> > > > controlling permissions (per process RO or RW page mappings), single
> > > > process restartability, improved startup and core dumping time etc.
> > > >
> > > > As a summary of our memory management additions, it allows an
> > > > application to describe their memory model in a configuration (or via
> > > > an API), e.g. a simplified config would say that every instance will
> > > > need 4GB private memory and 2GB shared memory. In a multi process
> > > > model this will result mapping only 6GB memory in each process instead
> > > > of the current DPDK model where the 4GB per process private memory is
> > > > mapped into all other processes resulting in unnecessary mappings, e.g.
> > > 16x4GB + 2GB in every processes.
> > > >
> > > > What we've chosen is to use DPDK's NUMA aware allocator for this
> > > > purpose, e.g. the above example for 16 instances will result
> > > > allocating
> > > > 17 DPDK NUMA sockets (1 default shared + 16 private) and we can
> > > > selectively map a given "NUMA socket" (set of memsegs) into a process.
> > > > This also opens many other possibilities to play with, e.g.
> > > > - clearing of the full private memory if a process dies including
> > > > memzones on it
> > > > - pop-up memory support
> > > > etc. etc.
> > > >
> > > > Other option could be to use page aligned memzones and control the
> > > > mapping/permissions on a memzone level.
> > > >
> > > > /Laszlo
> > >
> > > Those enhancements sound really, really good. Do you have code for these
> > > that you can share that we can start looking at with a view to pulling it in?
> > >
> > > /Bruce
> >
> > Our approach when started implementing these enhancements was to have
> > an additional layer on top of DPDK, so our changes cannot just be pulled in as is
> > and unfortunately we do not yet have the permission to share our code.
> > However we can share ideas and start discussing what would more interest the
> > community and if there is something which we can easily pull in or put on the
> > DPDK roadmap.
> >
> > As mentioned in the presentation we implemented a new EAL layer which we
> > also rely on, although this may not be necessary for all our enhancements.
> > For example our named memory partition pools ("memdomains") which is the
> > base of our selective memory mapping and permission control could either be
> > implemented above or below the memzones or DPDK could even be just a user
> > of it. Our implementation relies on our new EAL layer, but there may be another
> > option to pull this in as a new library which relies on the memzone allocator.
> >
> > We have a whole set of features with the main goal of environment independency
> > and of course performance first mainly focusing on NFV deployments.
> > e.g. allowing applications to adopt different environments (without any code change)
> > while still getting the highest possible performance.
> > The key for this is our new split EAL layer which I think should be the first step to
> > start with. This can co-exist with the current linuxapp and bsdapp and would allow
> > supporting both Linux and BSD with separate publisher components which could
> > be relying on the existing linuxapp/bsdapp code :)
> > This new EAL layer would open up many possibilities to play with,
> > e.g. expose NUMA in a non-NUMA aware VM, pretend that every CPU is in a new
> > NUMA domain, emulate a multi CPU multi socket system on a single CPU etc. etc.
> >
> > What do you think would be the right way to start these discussions?
> > We should probably need to open a new thread on this as it is now not fully related
> > to the subject or should we have an internal discussion and then present and discuss
> > the ideas in a community call?
> > We are working with DPDK since a long time, but new to the community and need to
> > understand the ways of working here...
>
> A new thread describing the details of how you have implemented things would be
> great.
+1
Please, explain also which problems you try to solve.
Maybe that some of your constraints does not apply here, so the implementation
could be different.
If your work can be split in different features, it may be easier to discuss
each feature in a different thread.
Thank you
--
Thomas
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2014-12-12 15:50 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-12-09 16:33 [dpdk-dev] A question about hugepage initialization time Matt Laswell
2014-12-09 16:50 ` Burakov, Anatoly
2014-12-09 19:06 ` Matthew Hall
2014-12-09 22:05 ` Matt Laswell
2014-12-09 19:45 ` &rew
2014-12-09 22:10 ` Stephen Hemminger
2014-12-10 10:32 ` Bruce Richardson
2014-12-10 14:29 ` Neil Horman
2014-12-10 14:35 ` Bruce Richardson
2014-12-10 19:16 ` László Vadkerti
2014-12-11 10:14 ` Bruce Richardson
2014-12-12 4:07 ` László Vadkerti
2014-12-12 9:59 ` Bruce Richardson
2014-12-12 15:50 ` Thomas Monjalon
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).