DPDK patches and discussions
 help / color / mirror / Atom feed
* Azure(hyperv) hugepages issues with Mellanox NICs(mlx5)
@ 2024-02-07 12:40 Vladimir Ratnikov
  0 siblings, 0 replies; only message in thread
From: Vladimir Ratnikov @ 2024-02-07 12:40 UTC (permalink / raw)
  To: dev

[-- Attachment #1: Type: text/plain, Size: 4088 bytes --]

Hello!

We observe the problem with hugepages on Azure environment with mellanox
devices there(Currently MT27710 Connect-X 4) which use mlx5 PMD.
We use a DPDK based application, but when using testpmd for debug purposes
we observe exact the same issue. After each restart of the process, 2
hugepages(2 Mellanox NICs) disappear from the pool until they're totally
exhausted.
Checking using /proc/meminfo.

That's very weird behavior. If user space application exits, hugepages will
be freed in any case(except in situations where files are being kept on
hugetlb FS).
So it probably tells that maybe some kernel process holds those pages(tried
to rmmod/modprobe all related modules and it didn't help. On hugetlb FS
there's nothing at this point)

Found that one of the places where the issue appears to happen is
mlx5_malloc function when ~2MB of memory is being allocated while creating
the device.

Stack trace is:

> #0  mlx5_malloc (flags=4, size=2097168, align=64, socket=-1) at
> ../src-dpdk/drivers/common/mlx5/mlx5_malloc.c:174
> #1  0x00007fffae258cad in _mlx5_ipool_malloc_cache (pool=0xac036ac40,
> cidx=0, idx=0x7fffa8759e90) at ../src-dpdk/drivers/net/mlx5/mlx5_utils.c:410
> #2  0x00007fffae258e42 in mlx5_ipool_malloc_cache (pool=0xac036ac40,
> idx=0x7fffa8759e90) at ../src-dpdk/drivers/net/mlx5/mlx5_utils.c:441
> #3  0x00007fffae259208 in mlx5_ipool_malloc (pool=0xac036ac40,
> idx=0x7fffa8759e90) at ../src-dpdk/drivers/net/mlx5/mlx5_utils.c:521
> #4  0x00007fffae2593d0 in mlx5_ipool_zmalloc (pool=0xac036ac40,
> idx=0x7fffa8759e90) at ../src-dpdk/drivers/net/mlx5/mlx5_utils.c:575
> #5  0x00007fffabbb1b73 in flow_dv_discover_priorities (dev=0x7fffaf97ae80
> <rte_eth_devices>, vprio=0x7fffaf1e492e <vprio>, vprio_n=2) at
> ../src-dpdk/drivers/net/mlx5/mlx5_flow_dv.c:19706
> #6  0x00007fffabb4d126 in mlx5_flow_discover_priorities
> (dev=0x7fffaf97ae80 <rte_eth_devices>) at
> ../src-dpdk/drivers/net/mlx5/mlx5_flow.c:11963
> #7  0x00007fffae52cf48 in mlx5_dev_spawn (dpdk_dev=0x55555575bca0,
> spawn=0xac03d7340, eth_da=0x7fffa875a1d0, mkvlist=0x0) at
> ../src-dpdk/drivers/net/mlx5/linux/mlx5_os.c:1649
> #8  0x00007fffae52f1a3 in mlx5_os_pci_probe_pf (cdev=0xac03d9d80,
> req_eth_da=0x7fffa875a300, owner_id=0, mkvlist=0x0) at
> ../src-dpdk/drivers/net/mlx5/linux/mlx5_os.c:2348
> #9  0x00007fffae52f91f in mlx5_os_pci_probe (cdev=0xac03d9d80,
> mkvlist=0x0) at ../src-dpdk/drivers/net/mlx5/linux/mlx5_os.c:2497
> #10 0x00007fffae52fd09 in mlx5_os_net_probe (cdev=0xac03d9d80,
> mkvlist=0x0) at ../src-dpdk/drivers/net/mlx5/linux/mlx5_os.c:2578
> #11 0x00007fffa93f297b in drivers_probe (cdev=0xac03d9d80, user_classes=1,
> mkvlist=0x0) at ../src-dpdk/drivers/common/mlx5/mlx5_common.c:937
> #12 0x00007fffa93f2c95 in mlx5_common_dev_probe (eal_dev=0x55555575bca0)
> at ../src-dpdk/drivers/common/mlx5/mlx5_common.c:1027
> #13 0x00007fffa94105b3 in mlx5_common_pci_probe (pci_drv=0x7fffaf492680
> <mlx5_common_pci_driver>, pci_dev=0x55555575bc90) at
> ../src-dpdk/drivers/common/mlx5/mlx5_common_pci.c:168
> #14 0x00007fffa9297950 in rte_pci_probe_one_driver (dr=0x7fffaf492680
> <mlx5_common_pci_driver>, dev=0x55555575bc90) at
> ../src-dpdk/drivers/bus/pci/pci_common.c:312
> #15 0x00007fffa9297be4 in pci_probe_all_drivers (dev=0x55555575bc90) at
> ../src-dpdk/drivers/bus/pci/pci_common.c:396
> #16 0x00007fffa9297c6d in pci_probe () at
> ../src-dpdk/drivers/bus/pci/pci_common.c:423
> #17 0x00007fffa9c8f551 in rte_bus_probe () at
> ../src-dpdk/lib/eal/common/eal_common_bus.c:78
> #18 0x00007fffa9cd80c2 in rte_eal_init (argc=7, argv=0x7fffbba74ff8) at
> ../src-dpdk/lib/eal/linux/eal.c:1300


We found that there's a devarg for mlx5 PMD `sys_mem_en=1` which allows
using system memory instead of RTE memory. It helped a bit. Now just one
hugepage is missing after each restart.
Also tried to reproduce the same on hardware and it's fine there(a bit
different NICs MT27800, but using the same PMD mlx5).

So our thoughts that something could be related somehow with Azure(hyperv?)
environment. So wondering if someone observed the same issue?

Many thanks!

[-- Attachment #2: Type: text/html, Size: 4329 bytes --]

^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2024-02-07 12:40 UTC | newest]

Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-07 12:40 Azure(hyperv) hugepages issues with Mellanox NICs(mlx5) Vladimir Ratnikov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).