Hi Michal, I'll "top post" on this reply as the content is in HTML format below. In future, please try to send plain-text emails to DPDK mailing lists. Regarding the issue you're having, its interesting that allocating from hugepage backed memory "solves" the problem, even when going back to the lower traffic rate. The main difference for a CPU to access hugepage backed or 4k paged backed memory is the DTLB[1] pressure. In your scenario, both page-sizes work equally well at the start (no drops). This is likely as all buffers are being accessed linearly, and there are no packet drops, resulting in good re-use of buffers. Lets discuss the 4K page scenario: When the rate is turned up, packets are dropped, and the CPU(s) cannot keep up. This results in NIC rx descriptor rings being totally full of used packets, and the mempools that contain the buffers become more "fragmented" in that not every buffer is on the same 4k page anymore. In the worst case, each mbuf could be on a _different_ 4k page! I think that when turning down the rate again, the fragmentation of mbufs in the mempool remains, resulting in continued loss of packets. Estimating and talking is never conclusive – lets measure using Linux "Perf" tool. Run this command 3x, just like you posted the drop stats below. I expect to see lower dTLB-load-misses on the first run (no drops, 10 mpps), and that the dTLB misses are higher for 15 mpps *and* for 10 mpps again afterwards. perf stat -e cycles,dTLB-load-misses -C -- sleep 1 Please try the commands, and report back your findings! Hope that helps, -Harry [1] TLB & DPDK Resources; https://en.wikipedia.org/wiki/Translation_lookaside_buffer (DTLB just means Data-TLB, as opposed to instruction-TLB) https://stackoverflow.com/questions/52077230/huge-number-of-dtlb-load-misses-when-dpdk-forwarding-test https://www.dpdk.org/wp-content/uploads/sites/35/2018/12/LeiJiayu_Revise-4K-Pages-Performance-Impact-For-DPDK-Applications.pdf From: Michał Niciejewski Sent: Wednesday, December 22, 2021 9:57 AM To: users@dpdk.org Subject: Unexpected behavior when using mbuf pool with external buffers Hi, recently I stumbled upon a problem with mbuf pool with external buffers. I allocated some memory with aligned_alloc(), registered it, DMA mapped the memory, and created mbuf pool: size_t mem_size = RTE_ALIGN_CEIL(MBUFS_NUM * QUEUE_NUM * RTE_MBUF_DEFAULT_BUF_SIZE, 4096); auto mem = aligned_alloc(4096, mem_size); mlock(mem, mem_size); rte_pktmbuf_extmem ext_mem = { .buf_ptr = mem, .buf_iova = (uintptr_t)mem, .buf_len = mem_size, .elt_size = RTE_MBUF_DEFAULT_BUF_SIZE, }; if (rte_extmem_register(ext_mem.buf_ptr, ext_mem.buf_len, nullptr, 0, 4096) != 0) throw runtime_error("Failed to register DPDK external memory"); if (rte_dev_dma_map(dev, ext_mem.buf_ptr, ext_mem.buf_iova, ext_mem.buf_len) != 0) throw runtime_error("Failed to DMA map external memory"); mp = rte_pktmbuf_pool_create_extbuf("ext_mbuf_pool", MBUFS_NUM * QUEUE_NUM, 0, 0, RTE_MBUF_DEFAULT_BUF_SIZE, rte_eth_dev_socket_id(0), &ext_mem, 1); if (mp == nullptr) throw runtime_error("Failed to create external mbuf pool"); The main loop of the program works like normal l2fwd: it receives packets and sends them to another port. std::vector mbufs(MAX_PKT_BURST); while (true) { auto rx_num = rte_eth_rx_burst(0, queue, mbufs.data(), MAX_PKT_BURST); if (!rx_num) continue; // ... auto tx_num = rte_eth_tx_burst(1, queue, mbufs.data(), rx_num); rte_pktmbuf_free_bulk(mbufs.data() + tx_num, rx_num - tx_num); } Every second, the program prints some info about the packets received in this second and some stats regarding rte_eth_tx_burst calls. For example, logs printed while receiving and sending 10mpps: Number of all rx burst calls: 12238365 Number of non-zero rx burst calls: 966834 Avg pkt nb received per rx burst: 0.816879 All received pkts: 9997264 All sent pkts: 9997264 All dropped pkts: 0 For lower traffic, everything looks fine. But when I start sending more packets some unexpected behavior occurs. When I increase traffic to 15mpps most of the packets are dropped on TX: Queue: 0 Number of rx burst calls: 4449541 Number of non-zero rx burst calls: 1616833 Avg pkt nb received per rx burst: 3.36962 All received pkts: 14993272 All sent pkts: 5827744 All dropped pkts: 9165528 After that, I checked again the results for 10mpps. Even though previously the application didn't have any troubles in managing 10mpps, now it does: Queue: 0 Number of all rx burst calls: 8722385 Number of non-zero rx burst calls: 1447741 Avg pkt nb received per rx burst: 1.14617 All received pkts: 9997316 All sent pkts: 8194416 All dropped pkts: 1802900 So basically it looks like sending too many packets breaks something and starts causing problems when sending fewer packets. I also tried allocating huge pages for mbuf pool instead of memory returned from aligned_alloc: auto mem = mmap(0, mem_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0); And actually, it solved the problems - too big traffic doesn't affect lower traffic management. But I still want to know why memory allocated using aligned_alloc causes problems because in the place where I want to use mbuf pools with external buffers huge pages cannot be used like that. The full code used for testing: https://gist.github.com/tropuq/22625e0e5ac420a8ff5ae072a16f4c06 NIC used: Supermicro AOC-S25G-I2S-O Std Low Profile 25G Dual Port SFP28, based on Intel XXV710 Did anyone have similar issues or know what could cause such behavior? Is this allocation of the mbuf pool correct or am I missing something? Thanks in advance -- Michał Niciejewski Junior Software Engineer