DPDK usage discussions
 help / color / mirror / Atom feed
* Increase DPDK's virtual memory to more than 512 GiB
@ 2025-02-19 14:22 Lucas
  2025-02-19 15:03 ` Stephen Hemminger
  2025-02-19 22:28 ` Dmitry Kozlyuk
  0 siblings, 2 replies; 8+ messages in thread
From: Lucas @ 2025-02-19 14:22 UTC (permalink / raw)
  To: users

[-- Attachment #1: Type: text/plain, Size: 1247 bytes --]

Hello,

I am creating an application where I want to cache packets in a ring for a
certain duration. To do so at high speed (100 Gb/s) for tens of seconds, I
would like to create a mempool with a size bigger than 500 GiB.
I know that in rte_config.h, we have this excerpt:
```
#define RTE_MAX_MEMSEG_LISTS 128
...
#define RTE_MAX_MEM_MB_PER_LIST 32768
```
4'194'304 MiB of addressable memory, which is more than enough. However,
there is a virtual memory limit on DPDK processes on x86_64, which appears
to be 512 GiB.
at https://doc.dpdk.org/guides/prog_guide/env_abstraction_layer.html it is
listed that RTE_MAX_MEM_MB controls that.
In the meson.build file in the config directory, we have:
```
if dpdk_conf.get('RTE_ARCH_64')
    dpdk_conf.set('RTE_MAX_MEM_MB', 524288)
else # for 32-bit we need smaller reserved memory areas
    dpdk_conf.set('RTE_MAX_MEM_MB', 2048)
endif
```
Changing this value does not seem to change the amount of virtual memory
that DPDK allocates. It appears that no headers or C-files actually
reference this value.
What would I need to change to allow more virtual memory than 512 GiB and
be able to allocate a mempool with a size bigger than that?

Thank you in advance for any pointers!

Kind regards,
Lucas Crijns

[-- Attachment #2: Type: text/html, Size: 1634 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Increase DPDK's virtual memory to more than 512 GiB
  2025-02-19 14:22 Increase DPDK's virtual memory to more than 512 GiB Lucas
@ 2025-02-19 15:03 ` Stephen Hemminger
  2025-02-19 22:28 ` Dmitry Kozlyuk
  1 sibling, 0 replies; 8+ messages in thread
From: Stephen Hemminger @ 2025-02-19 15:03 UTC (permalink / raw)
  To: Lucas; +Cc: users

On Wed, 19 Feb 2025 15:22:46 +0100
Lucas <lucascrijns@gmail.com> wrote:

> Hello,
> 
> I am creating an application where I want to cache packets in a ring for a
> certain duration. To do so at high speed (100 Gb/s) for tens of seconds, I
> would like to create a mempool with a size bigger than 500 GiB.
> I know that in rte_config.h, we have this excerpt:
> ```
> #define RTE_MAX_MEMSEG_LISTS 128
> ...
> #define RTE_MAX_MEM_MB_PER_LIST 32768
> ```
> 4'194'304 MiB of addressable memory, which is more than enough. However,
> there is a virtual memory limit on DPDK processes on x86_64, which appears
> to be 512 GiB.
> at https://doc.dpdk.org/guides/prog_guide/env_abstraction_layer.html it is
> listed that RTE_MAX_MEM_MB controls that.
> In the meson.build file in the config directory, we have:
> ```
> if dpdk_conf.get('RTE_ARCH_64')
>     dpdk_conf.set('RTE_MAX_MEM_MB', 524288)
> else # for 32-bit we need smaller reserved memory areas
>     dpdk_conf.set('RTE_MAX_MEM_MB', 2048)
> endif
> ```
> Changing this value does not seem to change the amount of virtual memory
> that DPDK allocates. It appears that no headers or C-files actually
> reference this value.
> What would I need to change to allow more virtual memory than 512 GiB and
> be able to allocate a mempool with a size bigger than that?
> 
> Thank you in advance for any pointers!
> 
> Kind regards,
> Lucas Crijns

Mempools come from huge pages. Do you have that many huge pages available?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Increase DPDK's virtual memory to more than 512 GiB
  2025-02-19 14:22 Increase DPDK's virtual memory to more than 512 GiB Lucas
  2025-02-19 15:03 ` Stephen Hemminger
@ 2025-02-19 22:28 ` Dmitry Kozlyuk
       [not found]   ` <CADTTPn-zDTotOsCsnkuEQiRDdEfuLGVDo1hmuXetgGLd+TTP6Q@mail.gmail.com>
  1 sibling, 1 reply; 8+ messages in thread
From: Dmitry Kozlyuk @ 2025-02-19 22:28 UTC (permalink / raw)
  To: Lucas; +Cc: users

2025-02-19 15:22 (UTC+0100), Lucas:
> I am creating an application where I want to cache packets in a ring for a
> certain duration. To do so at high speed (100 Gb/s) for tens of seconds, I
> would like to create a mempool with a size bigger than 500 GiB.
> I know that in rte_config.h, we have this excerpt:
> ```
> #define RTE_MAX_MEMSEG_LISTS 128
> ...
> #define RTE_MAX_MEM_MB_PER_LIST 32768
> ```
> 4'194'304 MiB of addressable memory, which is more than enough. However,
> there is a virtual memory limit on DPDK processes on x86_64, which appears
> to be 512 GiB.
> at https://doc.dpdk.org/guides/prog_guide/env_abstraction_layer.html it is
> listed that RTE_MAX_MEM_MB controls that.
> In the meson.build file in the config directory, we have:
> ```
> if dpdk_conf.get('RTE_ARCH_64')
>     dpdk_conf.set('RTE_MAX_MEM_MB', 524288)
> else # for 32-bit we need smaller reserved memory areas
>     dpdk_conf.set('RTE_MAX_MEM_MB', 2048)
> endif
> ```
> Changing this value does not seem to change the amount of virtual memory
> that DPDK allocates. It appears that no headers or C-files actually
> reference this value.

https://elixir.bootlin.com/dpdk/v24.11.1/source/lib/eal/common/eal_common_dynmem.c#L113

> What would I need to change to allow more virtual memory than 512 GiB and
> be able to allocate a mempool with a size bigger than that?

Increase RTE_MAX_MEM_MB_PER_LIST and RTE_MAX_MEM_MB_PER_TYPE to 512 GB.
See the big comment in the linked function code.
Note that RTE_MAX_MEM_MB must accommodate for all memory types,
e.g. for a 2-processor x86_64 system with 2M and 1G hugepages supported
it must be 2 NUMA nodes x 2 huggepage sizes x 512 GB = 2048 GB.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Increase DPDK's virtual memory to more than 512 GiB
       [not found]   ` <CADTTPn-zDTotOsCsnkuEQiRDdEfuLGVDo1hmuXetgGLd+TTP6Q@mail.gmail.com>
@ 2025-02-20 11:21     ` Dmitry Kozlyuk
  2025-02-20 14:19       ` Lucas
  0 siblings, 1 reply; 8+ messages in thread
From: Dmitry Kozlyuk @ 2025-02-20 11:21 UTC (permalink / raw)
  To: Lucas; +Cc: users

Hi Lucas,

Please don't send private replies to discussions in the mailing list.

2025-02-20 12:00 (UTC+0100), Lucas:
> Hi Dmitry,
> 
> Thank you for your detailed instructions.
> I have followed them in the following way:
> 
>    - In config/rte_config.h set RTE_MAX_MEM_MB_PER_LIST to 524288
>    - In config/rte_config.h set RTE_MAX_MEM_MB_PER_TYPE to 524288
>    - To change RTE_MAX_MEM_MB, I have to change the configure set in the
>    meson build system. To do so, I changed
>    https://elixir.bootlin.com/dpdk/v24.11.1/source/config/meson.build#L362
>    I replaced "dpdk_conf.set('RTE_MAX_MEM_MB', 524288)" with
>    "dpdk_conf.set('RTE_MAX_MEM_MB', 8388608)". As I have 8 NUMA nodes, with 2
>    huge pages sizes: 8 NUMA nodes × 2 hugepage sizes × 512 GiB = 8388 GiB
> 
> With these changes, my program can create a mempool with  275'735'294
> MBUFS, comprising 2'176 (bytes of MBUF size) × 275'735'294 =
> 599'999'999'744 bytes ~ 600 GB, but fails later, as I also need an extra
> rte_ring to hold pointers to the MBUFs. In HTOP, a virtual memory size of
> 4'097G is reported.
> However, with a smaller amount, 229'779'411 MBUFs, it works (i.e. 500 GB).
> And I can also allocate a ring of the same size (I use RING_F_EXACT_SZ, so
> in reality it is more).
> 
> I have tried increasing the limits further to be able to allocate more:
> 
>    - In config/rte_config.h set RTE_MAX_MEM_MB_PER_LIST to 1048576
>    - In config/rte_config.h set RTE_MAX_MEM_MB_PER_TYPE to 1048576
>    - Set RTE_MAX_MEM_MB = 16'777'216
> 
> The virtual memory allocated is now 8192G, 8192G, and I can allocate 600 GB
> for the mempool (275'735'294 MBUFs), but the ring allocation fails (fails
> with 'Cannot allocate memory'). Allocation of the mempool now takes seven
> minutes.

It would be interesting to profile this one your issue is resolved.
I expect that hugepage allocation takes only about 10 seconds of this,
while the rest is mempool initialization.

> How would I be able to also allocate a ring of the same size?
> I verified that the number of MBUFs I need rounded up to the next power of
> 2 is still smaller than the max size of an unsigned int on my platform
> (x86_64, so 32 bit unsigned int).
> I have 670 hugepages of 1 GB available. Is this too little? In principle
> the ring takes 64 bits × # entries = memory. In this case, that would be:
> next power of 2 is 536'870'912 × 8 bytes (64 bites) = 4.3 GB.
> With 670 GB available, and roughly 600 GB for the mempool, this should fit.
> Could it be that supporting structures take the rest of the memory?

Mempool adds headers and padding to objects within,
so it probably takes more memory than calculated.
You can use rte_mempool_mem_iter() and rte_mempool_calc_obj_size() to check.
You can check exact memory usage with rte_malloc_get_socket_stats().

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Increase DPDK's virtual memory to more than 512 GiB
  2025-02-20 11:21     ` Dmitry Kozlyuk
@ 2025-02-20 14:19       ` Lucas
  2025-02-20 14:55         ` Dmitry Kozlyuk
  0 siblings, 1 reply; 8+ messages in thread
From: Lucas @ 2025-02-20 14:19 UTC (permalink / raw)
  To: users

[-- Attachment #1: Type: text/plain, Size: 4228 bytes --]

Hi Dmitry,

My apologies for the private reply. I am quite new to the mailing list.
I will do a profile later of the allocation.

As for the issue, at 550 GB so 252'757'352 MBUFs (using default mbuf buf
size), it works now, with ring allocation. However, the physical memory
usage now goes up a lot and I end up swapping. It appears that not all
memory is in hugepages (not all pages are filled) and that perhaps the
kernel also allocates more memory. I have 755 GiB RAM available, so 600 GB
of mempool is pushing it.
I realise now that I also have some private data in the mempool, so the
figure of 550 GB is plainly wrong. In reality, one object is:

   - rte_mempool_calc_obj_size gives: total_size = 2240 bytes
   - Private data per mbuf (alignment included) is:  48 bytes

So actual memory consumption is: 252'757'352 MBUFs × (48 + 2240 bytes) =
578'308'821'376 bytes ~ 578 GB
That is at least 28 GB more.

I now fixed my program to address this issue and when requesting 500 GB, it
will take the private data and headroom into account.

I will update later with some memory statistics and a profile.


On Thu, Feb 20, 2025 at 12:21 PM Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
wrote:

> Hi Lucas,
>
> Please don't send private replies to discussions in the mailing list.
>
> 2025-02-20 12:00 (UTC+0100), Lucas:
> > Hi Dmitry,
> >
> > Thank you for your detailed instructions.
> > I have followed them in the following way:
> >
> >    - In config/rte_config.h set RTE_MAX_MEM_MB_PER_LIST to 524288
> >    - In config/rte_config.h set RTE_MAX_MEM_MB_PER_TYPE to 524288
> >    - To change RTE_MAX_MEM_MB, I have to change the configure set in the
> >    meson build system. To do so, I changed
> >
> https://elixir.bootlin.com/dpdk/v24.11.1/source/config/meson.build#L362
> >    I replaced "dpdk_conf.set('RTE_MAX_MEM_MB', 524288)" with
> >    "dpdk_conf.set('RTE_MAX_MEM_MB', 8388608)". As I have 8 NUMA nodes,
> with 2
> >    huge pages sizes: 8 NUMA nodes × 2 hugepage sizes × 512 GiB = 8388 GiB
> >
> > With these changes, my program can create a mempool with  275'735'294
> > MBUFS, comprising 2'176 (bytes of MBUF size) × 275'735'294 =
> > 599'999'999'744 bytes ~ 600 GB, but fails later, as I also need an extra
> > rte_ring to hold pointers to the MBUFs. In HTOP, a virtual memory size of
> > 4'097G is reported.
> > However, with a smaller amount, 229'779'411 MBUFs, it works (i.e. 500
> GB).
> > And I can also allocate a ring of the same size (I use RING_F_EXACT_SZ,
> so
> > in reality it is more).
> >
> > I have tried increasing the limits further to be able to allocate more:
> >
> >    - In config/rte_config.h set RTE_MAX_MEM_MB_PER_LIST to 1048576
> >    - In config/rte_config.h set RTE_MAX_MEM_MB_PER_TYPE to 1048576
> >    - Set RTE_MAX_MEM_MB = 16'777'216
> >
> > The virtual memory allocated is now 8192G, 8192G, and I can allocate 600
> GB
> > for the mempool (275'735'294 MBUFs), but the ring allocation fails (fails
> > with 'Cannot allocate memory'). Allocation of the mempool now takes seven
> > minutes.
>
> It would be interesting to profile this one your issue is resolved.
> I expect that hugepage allocation takes only about 10 seconds of this,
> while the rest is mempool initialization.
>
> > How would I be able to also allocate a ring of the same size?
> > I verified that the number of MBUFs I need rounded up to the next power
> of
> > 2 is still smaller than the max size of an unsigned int on my platform
> > (x86_64, so 32 bit unsigned int).
> > I have 670 hugepages of 1 GB available. Is this too little? In principle
> > the ring takes 64 bits × # entries = memory. In this case, that would be:
> > next power of 2 is 536'870'912 × 8 bytes (64 bites) = 4.3 GB.
> > With 670 GB available, and roughly 600 GB for the mempool, this should
> fit.
> > Could it be that supporting structures take the rest of the memory?
>
> Mempool adds headers and padding to objects within,
> so it probably takes more memory than calculated.
> You can use rte_mempool_mem_iter() and rte_mempool_calc_obj_size() to
> check.
> You can check exact memory usage with rte_malloc_get_socket_stats().
>

[-- Attachment #2: Type: text/html, Size: 5219 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Increase DPDK's virtual memory to more than 512 GiB
  2025-02-20 14:19       ` Lucas
@ 2025-02-20 14:55         ` Dmitry Kozlyuk
  2025-03-03 14:52           ` Lucas
  0 siblings, 1 reply; 8+ messages in thread
From: Dmitry Kozlyuk @ 2025-02-20 14:55 UTC (permalink / raw)
  To: Lucas; +Cc: users

2025-02-20 15:19 (UTC+0100), Lucas:
> As for the issue, at 550 GB so 252'757'352 MBUFs (using default mbuf buf
> size), it works now, with ring allocation. However, the physical memory
> usage now goes up a lot and I end up swapping. It appears that not all
> memory is in hugepages (not all pages are filled) and that perhaps the
> kernel also allocates more memory. I have 755 GiB RAM available, so 600 GB
> of mempool is pushing it.

How is your system configured to reserve hugepages?
DPDK always allocates hugepages; if none are available, allocation fails.
Hugepages never swap AFAIK.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Increase DPDK's virtual memory to more than 512 GiB
  2025-02-20 14:55         ` Dmitry Kozlyuk
@ 2025-03-03 14:52           ` Lucas
  2025-03-03 22:05             ` Dmitry Kozlyuk
  0 siblings, 1 reply; 8+ messages in thread
From: Lucas @ 2025-03-03 14:52 UTC (permalink / raw)
  To: users


[-- Attachment #1.1: Type: text/plain, Size: 1327 bytes --]

Hi Dmitry,

Excuse my late reply.
My system is configured with 1G huge page size upon boot and then later on
I issue `sysctl -w vm.nr_hugepages=700`
It appears that the swapping was caused by some daemon that decided to
start up and allocate quite some memory. I no longer observe this effect.
I included a FlameGraph (recorded with perf) from program startup that does
a memory allocation for 500 GB (MBUFS + priv data): mempool allocation
with 218'531'468 MBUFs.
Most time is spent in mmap and in memsets.

Let me know what you think and perhaps if there are ways to improve the
loading time.

On Thu, Feb 20, 2025 at 3:55 PM Dmitry Kozlyuk <dmitry.kozliuk@gmail.com>
wrote:

> 2025-02-20 15:19 (UTC+0100), Lucas:
> > As for the issue, at 550 GB so 252'757'352 MBUFs (using default mbuf buf
> > size), it works now, with ring allocation. However, the physical memory
> > usage now goes up a lot and I end up swapping. It appears that not all
> > memory is in hugepages (not all pages are filled) and that perhaps the
> > kernel also allocates more memory. I have 755 GiB RAM available, so 600
> GB
> > of mempool is pushing it.
>
> How is your system configured to reserve hugepages?
> DPDK always allocates hugepages; if none are available, allocation fails.
> Hugepages never swap AFAIK.
>

[-- Attachment #1.2: Type: text/html, Size: 1775 bytes --]

[-- Attachment #2: startup_flamegraph.svg --]
[-- Type: image/svg+xml, Size: 71899 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Increase DPDK's virtual memory to more than 512 GiB
  2025-03-03 14:52           ` Lucas
@ 2025-03-03 22:05             ` Dmitry Kozlyuk
  0 siblings, 0 replies; 8+ messages in thread
From: Dmitry Kozlyuk @ 2025-03-03 22:05 UTC (permalink / raw)
  To: Lucas; +Cc: users

2025-03-03 15:52 (UTC+0100), Lucas:
> Hi Dmitry,
> 
> Excuse my late reply.
> My system is configured with 1G huge page size upon boot and then later on
> I issue `sysctl -w vm.nr_hugepages=700`
> It appears that the swapping was caused by some daemon that decided to
> start up and allocate quite some memory. I no longer observe this effect.
> I included a FlameGraph (recorded with perf) from program startup that does
> a memory allocation for 500 GB (MBUFS + priv data): mempool allocation
> with 218'531'468 MBUFs.
> Most time is spent in mmap and in memsets.
> 
> Let me know what you think and perhaps if there are ways to improve the
> loading time.

Thanks for the profile, Lucas.

[1] suggests that allocation would take about (500 / 32) * 2.16 = 33.75 sec.
You can run "malloc_perf_autotest" in "dpdk-test" to estimate for your HW.
I don't know any simple and secure way to speed up the kernel part.
Restart can be made faster as [1] suggests, but not the first start.

Try investigating other parts of the profile that take place in user space.

[1]: https://inbox.dpdk.org/dev/20211230143744.3550098-1-dkozlyuk@nvidia.com/

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-03-03 22:05 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-02-19 14:22 Increase DPDK's virtual memory to more than 512 GiB Lucas
2025-02-19 15:03 ` Stephen Hemminger
2025-02-19 22:28 ` Dmitry Kozlyuk
     [not found]   ` <CADTTPn-zDTotOsCsnkuEQiRDdEfuLGVDo1hmuXetgGLd+TTP6Q@mail.gmail.com>
2025-02-20 11:21     ` Dmitry Kozlyuk
2025-02-20 14:19       ` Lucas
2025-02-20 14:55         ` Dmitry Kozlyuk
2025-03-03 14:52           ` Lucas
2025-03-03 22:05             ` Dmitry Kozlyuk

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).