Dmitry, "On the contrary: rte_pktmbuf_pool_create() takes the amount of usable memory (dataroom) and adds space for rte_mbuf and the headroom. Furthermore, the underlying rte_mempool_create() ensures element (mbuf) alignment, may spread the elements between pages, etc." Thanks. This is a crucial correction to my erroneous statement. I'd like to press-on then with one of my questions that, after some additional thought is answered however implicitly. For the benefit of other programmers who are new to this work. I'll explain. If wrong, please hammer on it. The other crucial insight is: so long as memory is allocated on the same NUMA node as the RXQ/TXQ runs that ultimately uses it, there is only marginal performance advantage to having per-core caching of mbufs in a mempool as provided by the private_data_size formal argument in rte_mempool_create() here: https://doc.dpdk.org/api/rte__mempool_8h.html#a503f2f889043a48ca9995878846db2fd In fact the API doc should really point out the advantage; perhaps it eliminates some cache sloshing to get the last few percent of performance. It probably is not a major factor in latency or bandwidth with or without private_data_size==0. Memory access from an lcore x (aka H/W thread, vCPU) on NUMA N is fairly unchanged to any other distinct lcore y != x provided y also runs on N *and the memory was allocated* *for N*. Therefore, lcore affinity to a mempool is pretty much a red herring. Consider this code which originally I used as indicative of good mempool creation, but upon further thinking got me confused: https://github.com/erpc-io/eRPC/blob/master/src/transport_impl/dpdk/dpdk_init.cc#L76 for (size_t i = 0; i < kMaxQueuesPerPort; i++) { const std::string pname = get_mempool_name(phy_port, i); rte_mempool *mempool = rte_pktmbuf_pool_create(pname.c_str(), kNumMbufs, 0 /* cache */, 0 /* priv size */, kMbufSize, numa_node); This has the appearance of creating one mempool per each RXQ and each TXQ. And in fact this is what it does. The programmer here ensures the numa_node passed in as the last argument is the same numa_node the RXQ/TXQ eventually runs. Since each lcore has its own mempool and because rte_pktmbuf_create never calls into rte_mempool_create() with a non-zero private_data_size, per lcore caching doesn't arise. (I briefly checked mbuf/rte_mbuf.c to confirm). Indeed *lcore v. mempool affinity is irrelevant* provided the RXQ for a given mempool runs on the same numa_node as specified in the last argument to rte_pktmbuf_pool_create. Let's turn then to a larger issue: what happens if different RXQ/TXQs have radically different needs? As the code above illustrates, one merely allocates a size appropriate to an individual RXQ/TXQ by changing the count and size of mbufs ---- which is as simple as it can get. You have 10 queues each with their own memory needs? OK, then allocate one memory pool for each. None of the other 9 queues will have that mempool pointer. Each queue will use the mempool only that was specified for it. To beat a dead horse just make sure the numa_node in the allocation and the numa node which will ultimately run the RXQ/TXQ are the same. On Sat, Jan 29, 2022 at 8:23 PM Dmitry Kozlyuk wrote: > 2022-01-29 18:46 (UTC-0500), fwefew 4t4tg: > [...] > > 1. Does cache_size include or exclude data_room_size? > > 2. Does cache_size include or exclude sizeof(struct rtre_mbuf)? > > 3. Does cache size include or exclude RTE_PKTMBUF_HEADROOM? > > Cache size is measured in the number of elements, irrelevant of their size. > It is not a memory size, so the questions above are not really meaningful. > > > 4. What lcore is the allocated memory pinned to? > > Memory is associated with a NUMA node (DPDK calls it "socket"), not an > lcore. > Each lcore belongs to one NUMA node, see rte_lcore_to_socket_id(). > > > The lcore of the caller > > when this method is run? The answer here is important. If it's the lcore > of > > the caller when called, this routine should be called in the lcore's > entry > > point so it's on the right lcore the memory is intended. Calling it on > the > > lcore that happens to be running main, for example, could have a bad side > > effect if it's different from where the memory will be ultimately used. > > The NUMA node is controlled by "socket_id" parameter. > Your considerations are correct, often you should create separate mempools > for each NUMA node to avoid this performance issue. (You should also > consider which NUMA node each device belongs to.) > > > 5. Which one of the formal arguments represents tail room indicated in > > https://doc.dpdk.org/guides/prog_guide/mbuf_lib.html#figure-mbuf1 > [...] > > 5. Unknown. Perhaps if you want private data which corresponds to tail > room > > in the diagram above one has to call rte_mempool_create() instead and > focus > > on private_data_size. > > Incorrect; tail room is simply an unused part at the end of the data room. > Private data is for the entire mempool, not for individual mbufs. > > > Mempool creation is like malloc: you request the total number of absolute > > bytes required. The API will not add or remove bytes to the number you > > specify. Therefore the number you give must be inclusive of all needs > > including your payload, any DPDK overheader, headroom, tailroom, and so > on. > > DPDK is not adding to the number you give for its own purposes. Clearer? > > Perhaps ... but what needs? Read on ... > > On the contrary: rte_pktmbuf_pool_create() takes the amount > of usable memory (dataroom) and adds space for rte_mbuf and the headroom. > Furthermore, the underlying rte_mempool_create() ensures element (mbuf) > alignment, may spread the elements between pages, etc. > > [...] > > No. I might not. I might have half my TXQ and RXQs dealing with tiny > > mbufs/packets, and the other half dealing with completely different > traffic > > of a completely different size and structure. So I might want memory pool > > allocation to be done on a smaller scale e.g. per RXQ/TXQ/lcore. DPDK > > doesn't seem to permit this. > > You can create different mempools for each purpose > and specify the proper mempool to rte_eth_rx_queue_setup(). > When creating them, you can and should also take NUMA into account. > Take a look at init_mem() function of examples/l3fwd. >